throbber
7/3/12
`
`Analgorithm for matching text (possibly original). - comp.compression | Google Groups
`
`Search Images Maps Play YouTube News Gmail Documents More
`
`Sign in
`
`The old Google Groups will be going away soon. Switch to the new Gooale Groups,
`
`Google groups
`
`« Groups Home
`
`
`comp.compression "construct a program tofind al|Search this grour
`
`
`Discussions
`
`
`
`+ new post
`
`>
`Expand all
`About this
`grou
`
`
`Dear Compressor Heads,
`
`Ross WilliamsViewprofile. More options Jan 27 1992, 3:32 pm Subscribe to this group
`
`This is a Usenet group - eam more
`
`Wantto Patent Your Idea?
`Learn Everything About Patenting.
`Request Free Invention PatentKit.
`www.Inventionldeas.org
`.
`
`Wiore.
`
`w/
`= Seees it
`‘s a R
`lati mL im Mo
`elca Super-nesolution. Learn
`www.leica-
`microsystems.com/superres
`Utest
`The On-DemandSoftware Testing
`Community. Get Real World Results!
`.uTest.
`www.
`rest-com
`See your message here...
`
`Here is a description of an idea | just had for text matching. It may
`have some compression application (see the end of the description of
`the idea). | don't knowif itis original. Whether itis or not,it
`could be of interest to compressor heads andthose involved with large
`text databases and text matching.
`
`Mainly though, I'm just posting this idea here publicly to impede
`
`anyone who might independentlyhave had the idea from patentingit.If
`you like this idea and you wantit to be in the public domain, do
`something to makeit even clearerthat the idea has been exposed
`publicly (and is therefore more clearly priorart).
`
`Ross Williams
`r...@spam.adelaide.edu.au
`27-Jan-1992.
`(Headerfor my Guru text database program).
`--<Guru>--
`D=27-Jan-1992
`F=superdiff.mes
`S=An idea for a super differences algorithm/program.
`K=differences superdifferences text matching algorithm string searching
`--<Guru>--
`IDEA FOR SUPERDIFFERENCES ALGORITHM/PROGRAM
`
`Author: Ross Williams.
`Date : Monday 27-Jan-1992, 10:40pm CSTAustralia (when | had the
`idea).
`
`The Problem
`
`Quite often one finds that because of various backup activities and so
`on, that one has multiple large directory trees full of files, many of
`whichare identical or share large identical chunks. From this grows
`the need fora superdifferences program, that not only compares files,
`but also directories offiles.
`
`To construct a program tofind all identicalfiles in a file system is
`easy --- just compute a table of checksums, one for eachfile, and
`then perform full comparisons on files that share the same checksum.
`(If you wish, for "checksum" read "hash value").
`
`groups.google.com/group/comp.compression/browse_thread/thread/.../646d449de1cddd?q="construc...
`
`EMCVMW 1027
`
`1/4
`
`

`

`7/3/12
`
`An algorithm for matching text (possibly original). - comp.compression | Google Groups
`
`Aharder problem is identifying files that are NEARLYidentical or
`which sharelarge slabs of text. For example, two .C files being
`nearly the same versions of a program will share most oftheir text.
`The problem proposedis to construct an algorithm/utility thatwill
`find such shared slabsoftext in the file system. The result would be
`a very useful utility.
`
`ABrute Force Solution
`
`An extremely thorough brute force approach would be to take every
`subsequenceof N characters (e.g. with N=1000 say) at every alignment
`(i.e. commencing at every byte) in every file and form a huge table of
`checksums in the same manner as the table of checksums atthe file
`level. The table would identify all common strings of 1000 characters
`in different files and these connections could be usedto explore and
`identify longer matches.
`
`The only problem with this idea is that the table would be aboutfour
`times the size of the filesystem (because there would be (say) a
`4-byte checksum for each byte (being the checksum forthe string of N
`characters commencing atthat byte)).
`
`To cut down on the number of checksums required, we mightthink of
`only recording a checksum every M bytes (where M=50 say). If M<<N this
`should pose no problem. Unfortunately, this scheme fails totally
`because of alignment problems. Twofiles sharing a span of text may
`not have those two spans aligned at MOD 50 boundaries.
`
`Example:
`File1:
`'This is some shared text.
`
`File2:
`This is some shared text.
`
`In the above, although the twofiles share a slab of text, the two
`slabs are not aligned (because one has the exclamation mark). If
`checksums are performed every M bytes, the match will not be detected.
`Only checksums on every byte boundary will do the trick.
`
`The Good Part (My Idea)
`
`My idea provides a stochastic solution that does not guaranteeto find
`all matches, but will most likely find most matches. It could
`certainly be usedto construct a practical tool.
`
`The idea is this: Run through eachfile computing a checksum ofC bits
`(e.g. C=32) of all N-byte strings on ALL byte boundaries as with the
`brute force approach. HOWEVER, only store a (checksum ,file,position)
`tuple in the checksum table if the bottomBbits (e.g. with (say)
`B=10) of the checksum are zero (or some other B-bit constant).
`
`B can be chosen to taste, with the size of the table and the
`probability of finding each match being inversely related to B.A
`value of B=0 corresponds to the brute force approach whichwill find
`all matches, but produce a massive table. A value of B=C will produce
`a table of at most one entry and will most likely find no matches. The
`number 1/(24B) is the probability that the checksum commencing at any
`particular byte will be entered into the checksum table.
`
`Asliding window checksum could be usedso that only about two
`arithmetic operations are required per byte.
`
`groups.google.com/group/comp.compression/browse_thread/thread/.../646d449de1cddd?q="construc...
`
`2/4
`
`

`

`7/3/12
`
`Discussion
`
`An algorithm for matching text (possibly original). - comp.compression | Google Groups
`
`The technique that the algorithm usesis to collect a "random" sample
`of checksums of N-byte sequencesfrom all ofthe files, but to define
`"random" deterministically so that the same "random"strings will tend
`to be collected from the differentfiles. Thus, if the algorithm scans
`over a span of identical text in two different files, it may not pick
`up any checksums from the texts (although if B is not set too high
`this should not happen too often), but if it does pick up any
`checksums, it will pick them up at the SAMEpoints in the strings -
`and so a matchwill be detected!
`
`The idea behind this approach was inspired by the way that humans
`might perform the task. Ahuman, asked to roughly compare several
`documents would not compare every substring against every other, but
`would visually scan each documentfor "interesting" features and these
`would be used asreference points to seeif large slabs of the
`document have been seen before. For example, if you put "BISHOP IN SEX
`SCANDAL"in a commentin the middle of one of your Pascal source
`files, a human reader would probably consider that string a relatively
`"interesting" feature ofthatfile, and, if the same feature was seen
`in another file, the reader would instantly recognise it and try to
`comparethatfile against the earlier one seen with the same feature.
`Note that the reader's definition of "interesting" means thatthe
`reader (whom wewill temporarily think of as a semi-automaton) does
`not remember "ISHOP IN SEX SCANDAL ", or" BISHOP IN SEX SCANDA",
`but
`"BISHOP IN SEX SCANDAL". The interestingness or otherwise of the
`information automatically, deterministically, creates a set of
`alignments at which the various scrapsofthe different files can be
`compared.
`
`To get the computer to do the same thing, we merely needto define
`"interesting". Asimple definition, and the one I have chosen to use
`in the above algorithm is "random"- just carve out a subspace of the
`checksum space! The algorithm then stores these points as
`"interesting". So long as the total number ofinteresting things
`recorded is eclipsed by the size of the reduced checksum space, this
`algorithm forms an efficient method for identifying documents sharing
`large slabs of text. B, which determinesthe density of interesting
`things can be setso an interesting feature appears only every couple
`of thousand bytes or so. And rememberthat we can set N (the number of
`bytes used in the checksum) up high too (e.g. 1000).
`
`Once the algorithm hasidentified differentfiles containing text
`slabs having the same checksum, it can explore around the shared parts
`of the files to see just how long the matchis. The result could be a
`comprehensive report outlining (nearly all of) the shared texts in
`one's file system.
`
`It would be easy to calculate the performance(e.g. failure to match,
`size of tables etc) probabilities for this system.
`
`ACompression Application
`
`Afile system that ran this idea continuously could identify large
`tracts of shared text in the file system and replace them by
`referencesto their copies. This would be substring macro compression
`ona grand scale. The files being compared would haveto be in
`uncompressed form though, because ordinary compression scrambles
`everything, making comparisons of substrings difficult.
`
`groups.google.com/group/comp.compression/browse_thread/thread/.../646d449de1cddd?q="construc...
`
`3/4
`
`

`

`7/3/12
`
`Analgorithm for matching text (possibly original). - comp.compression | Google Groups
`
`Conclusion
`
`Itis possible that this is an original idea. It would be fun to write a
`utility that uses this algorithm! The utility would be extremely
`useful for locating old backup copies of programs in one's file system
`or any other number of searching tasks.
`
`--<End of Idea>--
`
`Dwight Wilson In article <1...@spam.ua.ozr...@...
`
`Jan 27 1992, 3:41 pm
`
`Patrick Smith In article <1...@spam.ua.oz r...@... Jan 27 1992, 11:41 pm
`
`« Back to Discussions
`
`« Newer topic Older topic »
`
`Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
`
`©2012 Google
`
`groups.google.com/group/comp.compression/browse_thread/thread/.../646d449de1cddd?q="construc...
`
`4/4
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket