`
`Analgorithm for matching text (possibly original). - comp.compression | Google Groups
`
`Search Images Maps Play YouTube News Gmail Documents More
`
`Sign in
`
`The old Google Groups will be going away soon. Switch to the new Gooale Groups,
`
`Google groups
`
`« Groups Home
`
`
`comp.compression "construct a program tofind al|Search this grour
`
`
`Discussions
`
`
`
`+ new post
`
`>
`Expand all
`About this
`grou
`
`
`Dear Compressor Heads,
`
`Ross WilliamsViewprofile. More options Jan 27 1992, 3:32 pm Subscribe to this group
`
`This is a Usenet group - eam more
`
`Wantto Patent Your Idea?
`Learn Everything About Patenting.
`Request Free Invention PatentKit.
`www.Inventionldeas.org
`.
`
`Wiore.
`
`w/
`= Seees it
`‘s a R
`lati mL im Mo
`elca Super-nesolution. Learn
`www.leica-
`microsystems.com/superres
`Utest
`The On-DemandSoftware Testing
`Community. Get Real World Results!
`.uTest.
`www.
`rest-com
`See your message here...
`
`Here is a description of an idea | just had for text matching. It may
`have some compression application (see the end of the description of
`the idea). | don't knowif itis original. Whether itis or not,it
`could be of interest to compressor heads andthose involved with large
`text databases and text matching.
`
`Mainly though, I'm just posting this idea here publicly to impede
`
`anyone who might independentlyhave had the idea from patentingit.If
`you like this idea and you wantit to be in the public domain, do
`something to makeit even clearerthat the idea has been exposed
`publicly (and is therefore more clearly priorart).
`
`Ross Williams
`r...@spam.adelaide.edu.au
`27-Jan-1992.
`(Headerfor my Guru text database program).
`--<Guru>--
`D=27-Jan-1992
`F=superdiff.mes
`S=An idea for a super differences algorithm/program.
`K=differences superdifferences text matching algorithm string searching
`--<Guru>--
`IDEA FOR SUPERDIFFERENCES ALGORITHM/PROGRAM
`
`Author: Ross Williams.
`Date : Monday 27-Jan-1992, 10:40pm CSTAustralia (when | had the
`idea).
`
`The Problem
`
`Quite often one finds that because of various backup activities and so
`on, that one has multiple large directory trees full of files, many of
`whichare identical or share large identical chunks. From this grows
`the need fora superdifferences program, that not only compares files,
`but also directories offiles.
`
`To construct a program tofind all identicalfiles in a file system is
`easy --- just compute a table of checksums, one for eachfile, and
`then perform full comparisons on files that share the same checksum.
`(If you wish, for "checksum" read "hash value").
`
`groups.google.com/group/comp.compression/browse_thread/thread/.../646d449de1cddd?q="construc...
`
`EMCVMW 1027
`
`1/4
`
`
`
`7/3/12
`
`An algorithm for matching text (possibly original). - comp.compression | Google Groups
`
`Aharder problem is identifying files that are NEARLYidentical or
`which sharelarge slabs of text. For example, two .C files being
`nearly the same versions of a program will share most oftheir text.
`The problem proposedis to construct an algorithm/utility thatwill
`find such shared slabsoftext in the file system. The result would be
`a very useful utility.
`
`ABrute Force Solution
`
`An extremely thorough brute force approach would be to take every
`subsequenceof N characters (e.g. with N=1000 say) at every alignment
`(i.e. commencing at every byte) in every file and form a huge table of
`checksums in the same manner as the table of checksums atthe file
`level. The table would identify all common strings of 1000 characters
`in different files and these connections could be usedto explore and
`identify longer matches.
`
`The only problem with this idea is that the table would be aboutfour
`times the size of the filesystem (because there would be (say) a
`4-byte checksum for each byte (being the checksum forthe string of N
`characters commencing atthat byte)).
`
`To cut down on the number of checksums required, we mightthink of
`only recording a checksum every M bytes (where M=50 say). If M<<N this
`should pose no problem. Unfortunately, this scheme fails totally
`because of alignment problems. Twofiles sharing a span of text may
`not have those two spans aligned at MOD 50 boundaries.
`
`Example:
`File1:
`'This is some shared text.
`
`File2:
`This is some shared text.
`
`In the above, although the twofiles share a slab of text, the two
`slabs are not aligned (because one has the exclamation mark). If
`checksums are performed every M bytes, the match will not be detected.
`Only checksums on every byte boundary will do the trick.
`
`The Good Part (My Idea)
`
`My idea provides a stochastic solution that does not guaranteeto find
`all matches, but will most likely find most matches. It could
`certainly be usedto construct a practical tool.
`
`The idea is this: Run through eachfile computing a checksum ofC bits
`(e.g. C=32) of all N-byte strings on ALL byte boundaries as with the
`brute force approach. HOWEVER, only store a (checksum ,file,position)
`tuple in the checksum table if the bottomBbits (e.g. with (say)
`B=10) of the checksum are zero (or some other B-bit constant).
`
`B can be chosen to taste, with the size of the table and the
`probability of finding each match being inversely related to B.A
`value of B=0 corresponds to the brute force approach whichwill find
`all matches, but produce a massive table. A value of B=C will produce
`a table of at most one entry and will most likely find no matches. The
`number 1/(24B) is the probability that the checksum commencing at any
`particular byte will be entered into the checksum table.
`
`Asliding window checksum could be usedso that only about two
`arithmetic operations are required per byte.
`
`groups.google.com/group/comp.compression/browse_thread/thread/.../646d449de1cddd?q="construc...
`
`2/4
`
`
`
`7/3/12
`
`Discussion
`
`An algorithm for matching text (possibly original). - comp.compression | Google Groups
`
`The technique that the algorithm usesis to collect a "random" sample
`of checksums of N-byte sequencesfrom all ofthe files, but to define
`"random" deterministically so that the same "random"strings will tend
`to be collected from the differentfiles. Thus, if the algorithm scans
`over a span of identical text in two different files, it may not pick
`up any checksums from the texts (although if B is not set too high
`this should not happen too often), but if it does pick up any
`checksums, it will pick them up at the SAMEpoints in the strings -
`and so a matchwill be detected!
`
`The idea behind this approach was inspired by the way that humans
`might perform the task. Ahuman, asked to roughly compare several
`documents would not compare every substring against every other, but
`would visually scan each documentfor "interesting" features and these
`would be used asreference points to seeif large slabs of the
`document have been seen before. For example, if you put "BISHOP IN SEX
`SCANDAL"in a commentin the middle of one of your Pascal source
`files, a human reader would probably consider that string a relatively
`"interesting" feature ofthatfile, and, if the same feature was seen
`in another file, the reader would instantly recognise it and try to
`comparethatfile against the earlier one seen with the same feature.
`Note that the reader's definition of "interesting" means thatthe
`reader (whom wewill temporarily think of as a semi-automaton) does
`not remember "ISHOP IN SEX SCANDAL ", or" BISHOP IN SEX SCANDA",
`but
`"BISHOP IN SEX SCANDAL". The interestingness or otherwise of the
`information automatically, deterministically, creates a set of
`alignments at which the various scrapsofthe different files can be
`compared.
`
`To get the computer to do the same thing, we merely needto define
`"interesting". Asimple definition, and the one I have chosen to use
`in the above algorithm is "random"- just carve out a subspace of the
`checksum space! The algorithm then stores these points as
`"interesting". So long as the total number ofinteresting things
`recorded is eclipsed by the size of the reduced checksum space, this
`algorithm forms an efficient method for identifying documents sharing
`large slabs of text. B, which determinesthe density of interesting
`things can be setso an interesting feature appears only every couple
`of thousand bytes or so. And rememberthat we can set N (the number of
`bytes used in the checksum) up high too (e.g. 1000).
`
`Once the algorithm hasidentified differentfiles containing text
`slabs having the same checksum, it can explore around the shared parts
`of the files to see just how long the matchis. The result could be a
`comprehensive report outlining (nearly all of) the shared texts in
`one's file system.
`
`It would be easy to calculate the performance(e.g. failure to match,
`size of tables etc) probabilities for this system.
`
`ACompression Application
`
`Afile system that ran this idea continuously could identify large
`tracts of shared text in the file system and replace them by
`referencesto their copies. This would be substring macro compression
`ona grand scale. The files being compared would haveto be in
`uncompressed form though, because ordinary compression scrambles
`everything, making comparisons of substrings difficult.
`
`groups.google.com/group/comp.compression/browse_thread/thread/.../646d449de1cddd?q="construc...
`
`3/4
`
`
`
`7/3/12
`
`Analgorithm for matching text (possibly original). - comp.compression | Google Groups
`
`Conclusion
`
`Itis possible that this is an original idea. It would be fun to write a
`utility that uses this algorithm! The utility would be extremely
`useful for locating old backup copies of programs in one's file system
`or any other number of searching tasks.
`
`--<End of Idea>--
`
`Dwight Wilson In article <1...@spam.ua.ozr...@...
`
`Jan 27 1992, 3:41 pm
`
`Patrick Smith In article <1...@spam.ua.oz r...@... Jan 27 1992, 11:41 pm
`
`« Back to Discussions
`
`« Newer topic Older topic »
`
`Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
`
`©2012 Google
`
`groups.google.com/group/comp.compression/browse_thread/thread/.../646d449de1cddd?q="construc...
`
`4/4
`
`