IPR2013-00083, No. 1016 Exhibit - U Manber, “Finding Similar Files in a Large File System”, University of Arizona Technical Report 1994 (P.T.A.B. Dec. 15, 2012)

Finding Similar Files in a Large File System
`
`Udi Manber
`
`TR 93-33
`
`October 1993
`
`DEPARTMENT OF COMPUTER SCIENCE
`
`EMCVMW 1016
`
`

`To appear in the
`1994 Winter USENIX Technical Conference
`
`FINDING SIMILAR FILES IN A LARGE FILE SYSTEM
`
`Udi Manber1
`
`Department of Computer Science
`University of Arizona
`Tucson, AZ 85721
`udi@cs.arizona.edu
`
`ABSTRACT
`
`We present a tool, called sif, for ﬁnding all similar ﬁles in a large ﬁle system. Files are considered similar if they
`have signiﬁcant number of common pieces, even if they are very different otherwise. For example, one ﬁle may be
`contained, possibly with some changes, in another ﬁle, or a ﬁle may be a reorganization of another ﬁle. The run-
`ning time for ﬁnding all groups of similar ﬁles, even for as little as 25% similarity, is on the order of 500MB to
`1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at
`a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar ﬁles to a query
`ﬁle using a preprocessed index. Application of sif can be found in ﬁle management, information collecting (to
`remove duplicates), program reuse, ﬁle synchronization, data compression, and maybe even plagiarism detection.
`
`1. Introduction
`Our goal is to identify ﬁles that came from the same source or contain parts that came from the same source. We
`say that two ﬁles are similar if they contain a signiﬁcant number of common substrings that are not too small. We
`would like to ﬁnd enough common substrings to rule out chance, without requiring too many so that we can detect
`similarity even if signiﬁcant parts of the ﬁles are different. The two ﬁles need not even be similar in size; one ﬁle
`may be contained, possibly with some changes, in the other. The user should be able to indicate the amount of
`similarity that is sought and also the type of similarity (e.g., ﬁles of very different sizes may be ruled out). Similar
`ﬁles may be different versions of the same program, different programs containing a similar procedure, different
`drafts of an article, etc.
`
`1 Supported in part by an NSF Presidential Young Investigator Award (grant DCR-8451397), with matching funds from AT&T, by NSF grants
`CCR-9002351 and CCR-9301129, and by the Advanced Research Projects Agency under contract number DABT63-93-C-0052. Part of this
`work was done while the author was visiting the University of Washington.
`
`The information contained in this paper does not necessarily reﬂect the position or the policy of the U.S. Government or other sponsors of this
`research. No ofﬁcial endorsement should be inferred.
`
`

`2
`
`The problem of computing the similarity between two ﬁles has been studied extensively and many pro-
`grams, such as UNIX diff, have been developed to solve it. But using diff for all pairwise comparisons among,
`say, 5000 ﬁles would require more than 12 million comparisons taking about 5 months of CPU time, assuming 1
`second per comparison. Even an order of magnitude improvement in comparison time will still make this
`approach much too slow. We present a new approach for this problem based on what we call approximate ﬁnger-
`prints. Approximate ﬁngerprints provide a compact representation of a ﬁle such that, with high probability, the
`ﬁngerprints of two similar ﬁles are similar (but not necessarily equal), and the ﬁngerprints of two non-similar ﬁles
`are different. Sif works in two different modes: all-against-all and one-against-all. The ﬁrst mode ﬁnds all groups
`of similar ﬁles in a large ﬁle system and gives a rough indication of the similarity. The running time is essentially
`linear in the total size of all ﬁles2 and thus sif can be used for large ﬁle systems. The second mode compares a
`given ﬁle to a preprocessed approximate index of all other ﬁles, and determines very quickly (e.g., in 3 seconds for
`4000 ﬁles of 60MB) all ﬁles that are similar to the given ﬁle. In both cases, similarity can be detected even if the
`similar portions constitute as little as 25% of the size of the smaller ﬁle.
`We foresee several applications for sif. The most obvious one is to help in ﬁle management. We tested per-
`sonal ﬁle systems and found groups of similar ﬁles of many different kinds. The most common ones were dif-
`ferent versions of articles and programs (including ‘‘temporary’’ ﬁles that became permanent), some of which
`were generated by the owner of the ﬁle system, but some were obtained through the network (in which case, it is
`much harder to discover that they contain similar information). This information gave us a very interesting view
`of the ﬁle system (e.g., similarity between seemingly unrelated directories) that could not have been obtained oth-
`erwise. System administrators can ﬁnd many uses for sif, from saving space to determining whether a version of a
`given program is already stored somewhere to detecting unauthorized copying. We plan to use sif in our work on
`developing general Internet resource discovery tools [BDMS93]. Identifying similar ﬁles (which abound in the
`Internet FTP space) can improve searching facilities by keeping less to search and giving the users less to browse
`through. Sif can be used as part of a global compression scheme to group similar ﬁles together before they are
`compressed. Yet another important application is in ﬁle synchronization for users who keep ﬁles on several
`machines (e.g., work, home, and portable). Sif can also be used by professors to detect cheating in homework
`assignments (although it would be relatively easy to beat it if one wants to put an effort into it), by publishers to
`detect plagiarism, by politicians to detect many copies of essentially the same form letter they receive from consti-
`tuents, and so on.
`Our notion of similarity throughout this paper is completely syntactic. We make no effort to understand the
`contents of the ﬁles. Files containing similar information but using different words will not be considered similar.
`This approach is therefore very different from the approach taken in the information retrieval literature, and can-
`not be applied to discover semantic similarities. In a sense, this paper extends the work on approximate string
`matching (see, for example, our work on agrep [WM92a, WM92b]), except that instead of matching strings to
`large texts, we match parts of large texts to other parts of large texts on a very large scale. Another major differ-
`ence is that we also solve the all-against-all version of the problem.
`A different approach to identifying similarity of source code was taken by Baker [Ba93] who deﬁned two
`source codes to be similar if one can be obtained from the other by changing parameter names. Baker called this
`similarity checking parameterized patern matching and presented several algorithms to identify similar source
`
`2 A sort, which is not a linear-time routine, is required, but we do not expect it to dominate the running time unless we compare more than 1-
`2GB of ﬁles.
`
`

`3
`
`codes. No other differences in the codes were allowed, however. It would be interesting to combine the two
`approaches.
`
`2. Approximate Fingerprints
`The idea of computing checksums to detect equal ﬁles has been used in many contexts. The addition of duplicate
`detection to DIALOG was hailed as a ‘‘a searcher’s dream come true’’ [Mi90]. The UNIX sum program outputs a
`16-bit checksum and the approximate size of a given ﬁle. This information is commonly used to ensure that ﬁles
`are received undamaged and untouched. A similar notion of ‘‘ﬁngerprinting’’ a ﬁle has been suggested by Rabin
`[Ra81] as a way to protect the ﬁle from unauthorized modiﬁcations. The idea is essentially to use a function that
`maps any size string to a number in a reasonably random way (not unlike hashing), with the use of a secret key.
`Any change to the ﬁle will produce a different ﬁngerprint with high probability. Rabin suggested using 63-bit
`numbers which lead to extremely low probabilities of false positives. (He also designed a special function that has
`provable security properties.)
`But ﬁngerprints and checksums are good only for exact equality testing. Our goal is to identify similar ﬁles.
`We want to be able to detect that two ﬁles are similar even if their similarity covers as little as 25% of their con-
`tent. Of course, we would like the user to be able to indicate how much similarity is sought. The basic idea is to
`use ﬁngerprints on several small parts of the ﬁle and have several ﬁngerprints rather than just one. But we cannot
`use ﬁxed parts of a ﬁle (e.g., the middle 10%), because any insertion or deletion from that ﬁle will make those
`parts completely different. We need to be able to ‘‘synchronize’’ the equal parts in two different ﬁles and to do
`that without knowing apriori which ﬁles or which parts are involved. We will present techniques that are very
`effective for natural language texts, source codes, and other types of texts that appear in practice.
`We achieve the kind of synchronization described above with the use of what we call anchors. An anchor is
`simply a string of characters, and we will use a ﬁxed set of anchors. The idea is to achieve synchronization by
`extracting from the text strings that start with anchors. If two ﬁles contain an identical piece, and if the piece con-
`tains an anchor, then the string around the anchor is identical in the two ﬁles. For example, suppose that the string
`acte is an anchor. We search the ﬁle for all occurrences of acte. We may ﬁnd the word character in which acte
`appears. We then scan the text for a ﬁxed number of characters, say 50, starting from acte, and compute a check-
`sum of these 50 characters. We call this checksum a ﬁngerprint. The same ﬁngerprint will be generated from all
`ﬁles that contain the same 50 characters, no matter where they are located. Of course, acte may not appear in the
`ﬁle at all, or it may appear only in places that have been modiﬁed, in which case no common ﬁngerprints will be
`found. The trick is to use several anchors and to choose them such that they span the ﬁles in a reasonably uniform
`fashion. We devised two different ways to use anchors. The ﬁrst is by analyzing text from many different ﬁles
`and selecting a ﬁxed set of representative strings, which are quite common but not too common. The string acte is
`an example. Once we have a set of anchors, we scan the ﬁles we want to compare and search for all occurrences
`of all anchors. Fortunately, we can do it reasonably quickly using a our multiple-pattern matching algorithm
`(which is part of agrep [WM92a]). We will not elaborate too much here on this method of anchor selection,
`because the second method is much simpler.
`The second method computes ﬁngerprints of essentially all possible substrings of a certain length and
`chooses a subset of these ﬁngerprints based on their values. Again, since two equal substrings will generate the
`same ﬁngerprints, no matter where they are in the text, we have the synchronization that we need. Note that we
`cannot simply divide the text into groups of 50 bytes and use their ﬁngerprints, because a single insertion at the
`beginning of the ﬁle will shift everything by 1 and cause all groups, and therefore all ﬁngerprints, to be different.
`
`

`4
`
`We need to consider all 50-byte substrings, including all overlaps. We now present an efﬁcient method to com-
`pute all these ﬁngerprints. Denote the text string by t 1t 2 . . . tn. The ﬁngerprint for the ﬁrst 50-byte substring will
`be
`
`F 1 = (t 1 . p 49 + t 2 . p 48 + . . . + t 50) mod M, where p and M are constants.
`The best way to evaluate a polynomial given its coefﬁcients is by Horner’s rule:
`F 1 = (p . ( ( . . . ( p . ( p . t 1 + t 2) + t 3) . . . ) ) + t 50 ) mod M.
`If we now want to compute F 2, then we need only to add the last coefﬁcient and remove the ﬁrst one:
`F 2 = ( p . F 1 + t 51
`t 1 . p 49) mod M.
`We compute a table of all possible values of ( ti . p 49 ) mod M for all 256 byte values and use it throughout.
`Overall, computing all ﬁngerprints is proportional to the number of characters but not to the size of the ﬁngerprint.
`Deciding which ﬁngerprints to select can be done in many ways, the simplest of them is by taking those with the
`last k bits equal to 0. Approximately one ﬁngerprint out of 2k characters will be selected. We use a prime number
`for p, 230 for M, and k = 8. Since all selected ﬁngerprints have 8 least signiﬁcant bits equal to 0, their values
`should be shifted by 8 before storing them to save space. If the number of ﬁles is very large, we may need to use
`larger ﬁngerprints (i.e., select 231 or 232) to minimize the number of equal ﬁngerprints by chance.
`The second method is easier to use, because the anchors are in a sense universal. They are selected truly at
`random. It relieves the user from the task of adjusting the anchors to the text. With the ﬁrst method, anchors that
`are optimized for Wall Street Journal articles may not be as good for medical articles or computer programs.
`Anchors for one language may not be good for another language. On the other hand, some users may want to
`have the ability to ﬁne tune the anchors. For example, with hashing, there is a 1 in 2k chance (256 in our case) that
`a string of 50 blanks is selected. If it is, the corresponding ﬁngerprint may appear many times in the ﬁle, and it
`will hardly be representative of the ﬁle. The same holds for many other non-representative strings. (We actually
`encountered that problem; a string of 50 underline symbols turned out to be selected.) One can change the hash
`function (e.g., by changing p), but there is very little control over the results. One precaution that we take is for-
`bidding overlaps of ﬁngerprints. In other words, once a ﬁngerprint is identiﬁed, the text is shifted to the end of it.
`This way, if, for example, 50 underline symbols form a ﬁngerprint, and the text contains 70 underline symbols, we
`will not generate 21 duplicate ﬁngerprints.
`Both methods are susceptible to bad ﬁngerprints, even for strings that seem representative. The worst
`example we encountered are ﬁngerprints that are contained in the headers of Postscript ﬁles. These headers are
`large, similar, and ubiquitous; they make many unrelated Postscript ﬁles, especially small ones, look very similar.
`The best solution in this case is to identify the Postscript ﬁle when the ﬁle is opened and disregard the headers.
`We discuss handling special ﬁles in Section 6.
`Although both methods are not perfect, both are good. Having spurious ﬁngerprints is not a major problem
`as long as there are enough representative ﬁngerprints. Typically, the probability of the same string of 50 bytes
`appearing in two unrelated ﬁles is quite low. And since we require several shared ﬁngerprints we are quite
`assured of ﬁltering noise. If sufﬁcient number of ﬁngerprints are common to two ﬁles then this is a good enough
`evidence that the two ﬁles are similar in some way.
`
`

`5
`
`3. Finding Similar Files to a Given File
`In this mode, a given ﬁle, let’s call it the query ﬁle, is compared to a large set of ﬁles that have already been
`‘‘ﬁngerprinted.’’ The collection of all ﬁngerprints, which we will denote by All_Fingers, is maintained in one
`large ﬁle. With each ﬁngerprint we must associate the ﬁle it came from. We do that by maintaining the names of
`all ﬁles that were ﬁngerprinted, and associating with each ﬁngerprint the number of the corresponding ﬁle. The
`ﬁrst thing we do is generate the set Query_Fingers of all ﬁngerprints for the query ﬁle. We now have to look in
`All_Fingers and compare all ﬁngerprints there to those of Query_Fingers. Searching a set is one of the most basic
`data structure problems and there are many ways to handle it;
`the most common techniques use hashing or tree
`structures. In this case, we can also sort both sets and intersect them. But we found that a simple solution using
`multi-pattern matching was just as effective. We store the ﬁngerprints in All_Fingers as we obtained them without
`providing any other structure, putting one ﬁngerprint together with its ﬁle number per line. Then we use agrep to
`search the ﬁle All_Fingers using Query_Fingers as the set of patterns. The output of the search is the list of all
`common ﬁngerprints to Query_Fingers and All_Fingers. As long as the set All_Fingers is no more than a few
`megabytes, this search is very effective. (We plan to provide other options for very large indexes.)
`Once we have the list of all ﬁles containing ﬁngerprints common to the query ﬁle, we output those that have
`more than a given percentage common ﬁngerprints (with default of 50%). This ﬁngerprint percentage number
`gives a rough estimate for the similarity of the two ﬁles. More precisely, it gives an indication of how much of the
`query ﬁle is contained in the ﬁle we found. It is interesting to note that this ratio can be greater than 1. That is, the
`number of common ﬁngerprints to All_Fingers and Query_Fingers can actually be higher than the total number of
`ﬁngerprints in Query_Fingers. The reason for that is that some ﬁngerprints may appear more than once, and will
`thus be counted more than once.3
`We also provide 32-bit checksums for all ﬁles to allow exact comparisons. We compute such checksums
`together with the ﬁngerprints and store them (along with the ﬁle sizes in bytes for extra safety and for more infor-
`mation in the output) with the list of ﬁles. We also compute the checksum for the query ﬁle and determine
`whether some ﬁles are exactly equal to it. This whole process normally takes 2-3 seconds.
`
`4. Comparing All Against All
`Comparing all ﬁles against all other ﬁles is more complicated. It turns out that providing a good way to view the
`output is one of the major difﬁculties. The output is a set of sets, or a hypergraph, with some similarity relation-
`ships among the elements. Hypergraphs are very hard to view (see Harel [Ha88]). We discuss here one approach
`that is an extension of the one-against-all paradigm, and therefore quite intuitive to view. We experimented with
`other more complicated approaches and we mention one of them brieﬂy at the end.
`The input to the problem is now a directory (or several directories), which we will traverse recursively
`checking all the ﬁles below it, and a threshold number T indicating how similar we want the ﬁles to be. The ﬁrst
`stage is identical to the preprocessing described in the previous section. All ﬁles are scanned, all ﬁngerprints are
`recorded in the ﬁngerprint ﬁle All_Fingers, and the names, checksums, and sizes of all ﬁles are recorded in
`another ﬁle. It will be useful in this stage to separate ﬁles according to some types (e.g., text, binary, compressed),
`
`3 In fact, this happened to us the very ﬁrst time we tested sif. The query ﬁle was an edited version of a call for papers and it matched with 160%
`"similarity" another ﬁle that happened to contain (unintentionally) two slightly different versions of the same call for papers (saved at different
`times from two mail messages into the same ﬁle). Of course, this kind of information is also very useful.
`
`

`6
`
`using a program such as Essence [HS93], such that ﬁles of different types will not be compared. In the current
`implementation we use only two types, text ﬁles and non-text ﬁles.
`Given All_Fingers, we now sort all ﬁngerprints and collate, for each ﬁngerprint associated with more than
`one ﬁle, the list of all ﬁle numbers sharing it. The value of the ﬁngerprint itself is not important any more and it is
`discarded. An example is shown in Figure 1. Removing the ﬁngerprints that appear in no more than one ﬁle
`reduces the size of All_Fingers signiﬁcantly (for my ﬁle system, it was by a factor of 30), so it is much easier to
`work with. We now have a list of sets and we sort it again, lexicographically, and for sets that appear more than
`once (meaning the same ﬁles share more than one ﬁngerprint) we replace the many copies by one copy and a
`counter. Figure 2 shows the output of this step, which we denote by Sorted_Fingers. Sets with large counters —
`that is, sets that share signiﬁcant number of ﬁngerprints — should deﬁnitely be part of the output, but how exactly
`
`10 1174 129 196 647
`10 129 196 647
`12 213 30 3023 40 4207 44 649 733 942 962
`12 213 30 3023 40 4207 44 649 733 942 962
`10 129 196 647
`11 212 3021 4203 648 732 76 942 961
`12 213 30 3023 40 4207 44 649 733 942 962
`11 212 3021 4203 648 732 942 961
`10 129 196 647
`11 212 3021 4203 648 732 942 961
`12 213 30 3023 40 4207 44 649 733 77 942 962
`11 212 3021 4203 648 732 76 942 961
`
`Figure 1: Typical (partial) output of sets of ﬁle numbers that share common ﬁngerprints.
`
`1 10 1174 129 196 647
`3 10 129 196 647
`3 12 213 30 3023 40 4207 44 649 733 942 962
`2 11 212 3021 4203 648 732 76 942 961
`2 11 212 3021 4203 648 732 942 961
`1 12 213 30 3023 40 4207 44 649 733 77 942 962
`
`Figure 2: Lines of ﬁle numbers that share common ﬁngerprints after sorting and collating.
`The number in bold is the counter (the number of ﬁngerprints shared by this group of ﬁle numbers).
`
`

`7
`
`to organize the output turns out to be a very difﬁcult problem.
`We are dealing with sets of sets and as a result many complicated scenarios are possible. The similarity
`relation as we deﬁned it is not transitive. It is possible that ﬁle A is similar to ﬁle B, which is similar to ﬁle C, but
`A and C have no similarity. You can also have A similar to B, B similar to C, and A similar to C, but the three of
`them together, A, B, and C, share no common ﬁngerprints (recall that we allow similarity to correspond to as little
`as 25% of the ﬁles). Or, A, B, C, and D can share 20 ﬁngerprints in common (which is signiﬁcant), A and C share
`40 ﬁngerprints, B and D share 65 ﬁngerprints, A, B, and D share 38 ﬁngerprints, and so on. Does the user really
`want to see all combinations?
`In the ﬁrst version of sif each ﬁle ﬁle that appears somewhere in Sorted_Fingers is considered separately.
`All sets (lines in Figure 2) containing ﬁle are collected. (This is done while the sets are constructed by associating
`the corresponding set numbers with each ﬁle.) Denote these sets by S 1, S 2,..., Sk and their counters by
`c 1, c 2,..., ck. For each other ﬁle that appears in any of the Si’s, the sum of the corresponding ci’s is computed. If
`that sum, as a percentage of the total number of ﬁngerprints that were found for ﬁle, is more than the threshold,
`then the ﬁle is considered similar to ﬁle. The output consists of all similar ﬁles to ﬁle, the similarity for each ﬁle,
`and the size of each ﬁle. It is easy at this point to skip ﬁles whose size differ substantially from that of ﬁle (if that
`is what the user wants to see), or ﬁles that fall under some other rules speciﬁed by the user; for example, only ﬁles
`with the same sufﬁx may be considered similar. One way to reduce the size of this output without losing
`signiﬁcant information is to eliminate duplicate sets. If, for example, 7 ﬁles are similar, the list of these ﬁles will
`appear 7 times, one for each of them. The similarity percentages may be different in each instance because they
`are computed as percentages, but the differences are usually minor. So, to reduce the output, any set of ﬁles is
`output no more than once. An example of a partial output is shown in Figure 3.
`
`The following groups of files are similar. Minimum similarity = 25%
`
`R100 /u1/udi/xyz/abc/foo.c 10763
`79 /u1/udi/qwe/ewq/bar.c 10979
`75 /u1/udi/uuu/xxx/foobar.c 9560
`
`R indicates the ﬁle with which all others are compared. The ﬁrst number is the percentage of
`similarity with the R ﬁle. The numbers at the end indicate the ﬁle sizes. (Except for the ﬁle
`names, this is an actual output.)
`
`Figure 3: An example of a group of similar ﬁles as output by sif.
`
`The second version of sif will include as an option a list of interesting similar ﬁles, where a set is considered
`interesting if all members of the set have sufﬁcient number of common ﬁngerprints and the set is not a subset of a
`larger interesting set. The problem of generating interesting sets turns out to be NP-complete, but we devised an
`algorithm that we hope will work reasonably well in practice.
`
`

`8
`
`5. Experience and Performance
`As expected, sif performs very well on random tests. For example, we took a ﬁle containing a C program of size
`30K, and made 300 random substitutions (with repetition), each of size 50, thus changing about 50% of the ﬁle
`(but leaving signiﬁcant chunks unchanged). We then ran sif (in the one-against-all mode) setting the threshold at a
`very low 5%. We ran this experiment 50 times. Each time sif found the right ﬁle (among 4000 other ﬁles of about
`60MB) and only it. The similarity that sif reported ranged from 37% to 62%, averaging 52%. The average run-
`ning time for one test (user + system time) was 3.1 seconds (not counting, of course, the time it took originally to
`build the index). (All experiments were run on a DEC 5000/240 workstation running Ultrix.)
`The real question, however, is the performance on real data. We tried sif on several ﬁle systems. The run-
`ning time for computing all ﬁngerprints was 3-6 seconds per MB which is in the order of 500MB to 1GB per hour.
`It takes longer if there are many ﬁles and directories and they are small. The sorting of the ﬁngerprints takes from
`a third to a half of this time (sorting is not a linear-time algorithm, so it takes longer for large number of ﬁnger-
`prints; we tried up to 200MB which generated 800,000 ﬁngerprints). The rest of the algorithm takes much less
`time.
`
`The ﬁrst run was on a ﬁle system with 2750 (uncompressed) text ﬁles of about 40MB. It took 127.4 seconds
`(user + system time) to generate all ﬁngerprints (and determine that 800 other ﬁles are non-text ﬁles), 74.3 seconds
`to sort the ﬁngerprints, and 14.6 seconds to perform the rest of the computation (only 1 second of which depends
`on the similarity parameter, so changing it will take just one more second literally). Another test was performed
`on a large collection of ‘‘Frequently Asked Questions’’ (FAQ) ﬁles extracted automatically from many news-
`groups.
`For
`example,
`two FAQs
`that were
`archived
`under
`different
`names
`(misc.quotes
`and
`misc.quotations.sources) turned out to be almost the same.
`The next collection of ﬁles presented us with almost the worst case. We obtained, through the Alex system
`[Ca92], a large collection of 21249 README ﬁles taken from thousands of ftp sites across the Internet. Their
`total size was 73MB, which puts the average size at a very low 3K. We found 3620 groups of equal ﬁles and 2810
`other groups of similar ﬁles (the similarity threshold was set at 50%). In many cases it would be impossible to tell
`that the ﬁles are similar by looking only at their names (e.g., draco.ccs.yorku.ca:pub/doc/tmp/read.me and
`ee.utah.edu:rfc/READ.ME are very similar but not the same). The most challenging experiment was the whole
`X11R5 distribution (380MB, ˜200MB of which were 21288 ascii ﬁles). There were 657 groups of equal ﬁles,
`3445 groups of similar ﬁles with 25% similarity threshold, and 2915 groups with 50% threshold. For most of the
`experiments described here (with the exception of the FAQ ﬁles), we used the ﬁrst method with a list of very fre-
`quent anchors to allow high precision. On the average, one ﬁngerprint was generated for about every 200 bytes of
`text.
`
`The space required to hold the ﬁngerprints for the one-against-all tool is currently about 5% of the total
`space. By a better encoding of the ﬁngerprints this ﬁgure will be reduced to about 2%.
`
`6. Future Work
`We are extending sif in four areas. The ﬁrst is adapting the ﬁngerprint generation to ﬁle types. We already men-
`tioned Postscript ﬁles, for which the headers should be excluded from generating ﬁngerprints. This is relatively
`easy to do because Postscript ﬁles and their headers can be easily recognized. Another example is removing for-
`matting statements from ﬁles (e.g., troff or TeX ﬁles). Other types present more difﬁcult problems. The two most
`notable types we would like to handle are executables and compressed ﬁles. Both types are very sensitive to
`
`

`9
`
`change. Adding one statement to a program can change all addresses in the executable. In compressed ﬁles that
`translate strings to dictionary indices (e.g., Lempel-Ziv compression) one change in the dictionary can change all
`indices. The challenge is to ﬁnd invariants and generate ﬁngerprints accordingly (e.g., ignoring addresses alto-
`gether, exploring the relationships between dictionary indices).
`The second area is allowing different treatments of small and large ﬁles. Currently, we treat all ﬁles
`equally. A noble idea, but sometimes not effective. We ﬁgured that at least 5-10 shared ﬁngerprints are needed
`for a strong evidence of similarity (the exact number depends on the ﬁle type). If we seek 50% similarity, then
`each ﬁle needs at least 10-20 ﬁngerprints. In the current setting (which can be easily changed), sif generates 3-4
`ﬁngerprints per 1K, which makes sif only marginally effective for ﬁles of less than 5K. On the other hand, a ﬁle
`of 1MB can generate 4,000 ﬁngerprints. If we adjust the number of ﬁngerprints to the size of the ﬁle, we may lose
`the ability to determine whether a small ﬁle is contained in a large ﬁle, but this is not always needed. Adjusting
`the number of ﬁngerprints is easy with both methods of anchor selection (by decreasing the number of anchors
`with the ﬁrst method or increasing the value of k with the second method).
`The third area is providing convenient facilities for comparing two directories. Current tools for comparing
`directories (such as the system V dircpm program) rely on ﬁle names and checksums. For ﬁles that do not match
`exactly, dircpm will only list them as not being equal. Comparing two directories based on content can be done
`with essentially the same tools we already have, but we need to allow the use of the ﬁlenames in the similarity
`measure.
`The fourth area is customizing the output generation. The most difﬁcult problem is how to provide users
`with ﬂexible means to extract only the similarity they seek. We could attach a large number of options to sif (a
`popular UNIX solution) and provide hooks to external ﬁlter routines (such as ones using Essence [HS93]). We
`would like to have something more general. The problem of ﬁnding interesting similar ﬁles (as deﬁned at the end
`of Section 4) is very intriguing from a practical and also a theoretical point of view.
`
`Acknowledgements
`We thank Vincent Cate for supplying us with the README ﬁles collected by Alex. Richard Schroeppel sug-
`gested the use of hashing to select anchors. Gregg Townsend wrote the ﬁlter that collects the FAQ ﬁles.
`
`References
`[Ba93]
`Baker, B. S., ‘‘A theory of parameterized patern matching: Algorithms and applications,’’ 25th Annual
`ACM Symposium on Theory of Computing, San Diego, CA (May 1993), pp. 71 80.
`[BDMS93]
`Bowman, C. M., P. B. Danzig, U. Manber, and M. F. Schwartz, ‘‘Scalable Internet Resource Discovery:
`Research Problems and Approaches,’’ University of Colorado Technical Report# CU-CS-679-93 (October
`1993), submitted for publication.
`[Ca92]
`Cate, V., ‘‘Alex — a global ﬁlesystem,’’ Proceedings of the Usenix File Systems Workshop, pp. 1 11, May
`1992.
`
`

`10
`
`[Ha88]
`Harel D., ‘‘On Visual Formalisms,’’ Communications of the ACM, 31 (May 1988), pp. 514 530.
`[HS93]
`Hardy D. R., and M. F. Schwartz, ‘‘Essence: A resource discovery system based on semantic ﬁle indexing,’’
`USENIX Winter 1993 Technical Conference, San Diego (January 1993), pp. 361 374.
`[Mi90]
`Miller C., ‘‘Detecting duplicates: a searcher’s dream come true’’ Online, 14, 4 (July 1990), pp. 27 34.
`[Ra81]
`Rabin, M. O., ‘‘Fingerprinting by Random Polynomials,’’ Center for Research in Computing Technology,
`Harvard University, Report TR-15-81, 1981.
`[WM92a]
`Wu S. and U. Manber, ‘‘Agrep — A Fast Approximate Pattern-Matching Tool,’’ Usenix Winter 1992
`Technical Conference, San Francisco (January 1992), pp. 153 162.
`[WM92b]
`Wu S., and U. Manber, ‘‘Fast Text Searching Allowing Errors,’’ Communications of the ACM 35 (October
`1992), pp. 83 91.
`
`

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases