`
`DROPB OX EX. 1020
`
`
`
`
`
`Computing Science Technical Report #41
`
`An Algorithm for Differential File Comparison
`
`J. W. Hunt and M. D.McI/roy
`
`1976
`
`Dropbox Ex. 1020
`
`
`
`An Algorithm for Differential File Comparison
`
`1. W. Hunt
`Department of Electrical Engineering,
`Stanford University, Stanford, California
`
`M. D. Mcilroy
`Bell Laboratories,
`Murray Hill, New Jersey 07974
`
`,II
`
`-,
`
`ABSTRACT
`
`The program diff reports differences
`as a
`two files, expressed
`between
`file into agreement with the other.
`minimal
`list of line changes
`to bring either
`Diff has been engineered
`to make efficient use of time and space on typical
`inputs
`that arise in getting version-to-version
`changes
`in computer-maintained
`or computer-generated
`documents.
`Time and space usage are observed to vary
`about as the sum of the file lengths on real data, although they are known to
`vary as the product of the file lengths in the worst case.
`algorithm of diff solves
`common subsequence
`The central
`the 'longest
`problem'
`to find the lines that do not change between files. Practical efficiency
`is gained by attending only to certain critical
`'candidate' matches between the
`files, the breaking of which would shorten the longest subsequence
`common to
`some pair of initial segments of the two files. Various techniques of hashing,
`presorting
`into equivalence
`classes, merging by binary search,
`and dynamic
`storage allocation are used to obtain good performance.
`
`Dropbox Ex. 1020
`
`
`
`An Algorithm for Differential File Comparison
`
`J. W. Hunt
`of Electrical Engineering,
`Department
`Stanford University, Stanford, California
`
`M. D. McIlroy
`Bell Laboratories,
`Murray Hill, New Jersey 07974
`
`lines of one file have to be changed to bring it into
`a list of what
`The program d(ffcreates
`It is based on ideas from several sourcesll.z.r.Bl.
`agreement with a second file or vice versa.
`As an example of its work, consider
`the two files, listed horizontally for brevity:
`abcdefg
`w a b x
`e
`z
`y
`It is easy to see that
`the first file can be made into the second by the following prescription,
`which an imaginary line 0 is understood at the beginning of each:
`append after line 0: w,
`change lines 3 through 4 from:
`c d
`to: x y z,
`delete lines 6 through 7, which were:
`f g.
`Going the other way, the first file can be made from the second this way:
`delete line 1, which was: w,
`change lines 4 through 6 from: x y z
`to: c d,
`
`in
`
`append after line 7:
`f g.
`them by 1-
`It indicates
`available to diff.
`Delete, change
`and append are the only operations
`of the qed text editorl I] in a form from which both directions
`letter abbreviations
`reminiscent
`of change
`can be read off. By exchanging
`'a'
`for 'd' and line numbers of the first file with
`those of a second, we get a recipe for going the other way.
`In these recipes lines of the origi-
`nal file are flagged with '<', lines of the derived file are flagged with ">':
`o all
`1,1 d 0
`,>w
`<w
`3 4 c 4 6
`4,6 c 3,4
`,
`,
`<c
`<x
`<y<z
`<d
`>x
`>c
`>y
`>z
`>d
`6,7 d 7
`<f
`
`>g
`
`Dropbox Ex. 1020
`
`Page 1
`
`
`
`-2-
`
`the goal of d(ffis
`the minimum number of line changes
`to report
`terms,
`In mathematical
`necessary to convert one file into the other. Equivalently,
`the goal is to maximize the number
`of lines left unchanged, or to find the longest common subsequence of lines that occurs in both
`files.
`
`1. Solving the longest common subsequence problem
`problem is known.
`No uniformly good way of solving the longest common subsequence
`then search forward
`The simplest
`idea-go
`through both files line by line until
`they disagree,
`somehow in both until a matching
`pair of
`lines is encountered,
`and continue
`similarly-
`in
`reduces the problem to implementing
`the 'somehow', which doesn't help much. However,
`practical
`terms,
`the first step of stripping matching lines from the beginning (and end)
`is help-
`ful, for when changes are not
`too pervasive stripping can make inroads
`into the (nonlinear)
`running time of the hard part of the problem.
`An extremely simple heuristic for the 'somehow', which works well when there are rela-
`tively few differences between files and relatively few duplications of lines within one file, has
`been used by Johnson and othersll.l Il: Upon encountering
`a difference, compare the kth line
`ahead in each file with the k lines following the mismatch in the other
`for k =1,2,... until a
`match is found. On more difficult problems,
`the method can missynchronize
`badly. To keep a
`lid on time and space, k is customarily limited, with the result
`that
`longer changed passages .
`defeat .resynchronization,
`for the longest common subsequence
`There is a simple dynamic programming scheme
`file Ai' i = 1,...,m and the
`probteml-l.Sl.
`lines of
`the
`second
`Call
`the
`lines of
`the
`first
`B.i' j = l .....n. Let Pij be the length of the longest. subsequence
`i lines of
`common to the first
`the first file and the' first j lines of the second. Evidently Po satisfies
`Pio = 0
`i =O, m,
`j =O,
`.n,
`POj = 0
`if Ai =Bj
`r.. =
`A --<'B
`u
`I
`max
`ir-
`i.j-I
`i-I.j'
`j
`Pi' array
`Then Pmn is the length of the desired longest common subsequence.
`From thewhole
`that was generated in calculating Pmn,
`it is easy to recover
`the indices of'the
`elements of a
`longest common subsequence.
`in
`in time, and-even
`Unfortunately
`the dynamic program is Otmn)
`worse-O(mn)
`space. Noting that. each row Pi of the difference equation is simply determined
`from Pi -I' D.
`S. Hirschberg invented a clever scheme that first calculates P in Otn)
`space and then recov-
`m
`.
`ers the sequence using no more space and about as much time again as is needed to find Pm [6].
`The diff algorithm improves on the simple dynamic program by attending only to essen-
`'k-
`the breaking of which would change P. The essential matches,
`dubbed
`tial matches,
`candidates' by Hirschberg]"], occur where Ai =B. and p..> max (P'_I
`.•P, '-1)' A k-candidate
`(i,j) such that (I) Ai =Bj, ch a lori~est comm~n '~ub~equence of length k
`is ~ pair of indices
`exists between the first
`of the first file and the first jelements
`of the second
`and
`ielements
`of length k exists when either i or j
`is reduced. A candidat~ is a
`(3) no common subsequence
`is a k-candidate for some k. Evidently a longest common subsequence
`can
`pair of indices that
`be found among a complete list of.candidates,
`then h >h. For if h = h,
`If (il,h ) .and (i2,h!. with il < i2 are b~t? k-candi~ates,
`..
`and If h <h then the common subse-
`(/2,h) would violate ~ondlhon (3) of the definition;
`quence of length k ending with (il,h ) could be extended to a common subsequence
`of
`k+ 1ending with (i2,h).
`
`1~i~m,
`
`1Q~n.
`
`11 + Pi-I.j-I
`
`(P
`
`P
`
`)'f
`
`Dropbox Ex. 1020
`
`Page 2
`
`
`
`-3-
`
`In Figure 1 dots mark grid
`interpretation.
`The candidate methods have a simple graphical
`(i,j) for which Ai = Ri. Because the dots portray an equivalence relation, any two hor-
`points
`izontal
`lines or any two vertical
`lines on the figure have either no dots in common or carry ex-
`actly the same dots. A common subsequence
`is a set of dots that can be threaded by a strictly
`monotone
`increasing curve. Four such curves have been drawn in .the figure. These particular
`curves have been chosen to thread only (and all) dots that are candidates. The values of k for
`are indicated by transecting curves of constant k. These latter curves, shown
`these candidates
`is obviously less than mn,
`dashed, must all decrease monotonically.
`The number of candidates
`except
`in trivial cases, and in practical
`file comparison turns out
`to be very much less, so the
`list of candidates usually can be stored quite comfortably.
`
`c
`
`a"b
`a
`"
`
`b c
`
`b
`
`c
`
`a
`
`b
`
`b
`
`a
`
`a
`
`6
`
`5 .--+--+--+-- .....
`
`4
`
`2
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`in comparing
`
`and candidates
`
`Figure I. Common subsequences
`abcabba
`c b a b a c
`2. The method of diff
`The dots of Figure 1 are stored in linear space as follows:
`These lists occu-
`(1) Construct
`lists of the equivalence classes of elements
`in thesecondfile.
`py Oin)
`space. They can be made by sorting the lines of the. second file.
`(2) Associate the appropriate equivalence
`class with each element of the first file. This asso-
`ciation can be stored in 0 im)
`space.
`In effect now we have a list of the dots for .each
`vertical.
`K be a vector
`Having this setup, we proceed to generate the candidatesleft-to-right.Let
`yet seen for eachk .. To simplify what follows, pad the
`designating the rightmostk-candidate
`that-do notyet
`havea candidate,
`vector outtoinclude
`a dummyO-candidate
`(O,O)and, for allk
`1,n + 1), whose. components will compare. high against
`the
`a dummy'fence'candidate(m+
`cornPonents of any other candidate. KbeginsemptY,exceptforpadding,
`and gets updated as
`we move right.i.Thus
`after processing the 4th "ertical,marked
`'a'
`in Figure 1, the list ofright-
`most candidates
`is
`.:.
`. (8,7)
`(4,3)(4,5)
`(0,0)
`(3,1)
`.•·between the·
`is ••the lowest dot.tha(fal.lsproperly
`the next vertical
`new··k-candidateon
`previous .•.(k '-1 )-and .k-.candidates.Two.
`ordinates
`such dots are on the 5th. vertical
`in
`displace the 2-candidate and 3-candidate entries to give
`new vectorK:
`
`Dropbox Ex. 1020
`
`Page 3
`
`
`
`-4-
`
`...
`(8,7)
`(5,4)
`(O,O),(3,l)~S,2)
`in this list and so are
`than between, ordinates
`The two dots on the 6th vertical fall on, rather
`not candidates. Each new k-candidate is chained to the .previous
`(k -1)-candidate
`to facilitate
`later recovery of the longest common subsequence.
`(For more detail seethe Appendix.)
`The determination
`of candidates on a given vertical
`is thusa
`specialized merge of the list
`of dots on that vertical
`into the current
`list of rightmost candidates. When the number of dots
`is 0(1), binary search in the list of at most min im.n) candidates will do the merge in time
`O(Jogm).
`Since this case of very few dots per vertical
`is typical
`in practice, we are led to
`merge each dot separately by binary search, even though the worst case time to process a verti-
`as against O(m+n)
`for ordinary merging.
`cal becomes O(nlogm),
`
`3. Hashing
`of lines) possible in random
`large files (thousands
`To make comparison of reasonably
`access memory, diff hashes each line into one computer word. This may cause some unequal
`lines to. compare
`equal.
`•Assuming the hash function is truly random,
`the. probability of a
`is 1/M, where
`spurious equality on a given comparison that should have turned out unequal
`the hash values range from 1 to M. A longest common subsequence
`of length k determined
`from hash values can thus be expected to contain about k/ M spurious matches when k« M,
`so a sequence of length k will be a spurious 'jackpot' sequence with probability about k/ M. On
`our I6-bit machine jackpots on SOOO-linefiles should happen less than 10% of the time and on
`SOO-linefiles less than 1% of the time.
`Diff guards against
`in
`jackpots by checking the purported longest common subsequence
`the original
`files. What
`remains
`after spurious
`equalities ..are edited out
`is accepted as an
`answer even though there is a small possibility that
`it is not actually a longest common subse-
`quence. Diff announces
`jackpots,
`so these cases tend to get scrutinized
`fairly hard.
`In two
`years we have had brought
`to our attention only one jackpot where an edited longest subse-
`quence was actually short-in
`that
`instance short by one.
`
`4. Complexity
`the diff algorithm doesn't perform substantially
`In the worst case,
`than the trivial
`better
`dynamic program. From Section 2 it follows that
`the worst case time complexity is dominated
`by the merging and is in fact O(mnlogm)
`(although Oim (m+n»
`could be achieved). Worst
`case space complexity
`is dominated
`by the space required
`for
`the candidate
`list, which is
`o imn) as can be seen by counting the candidates
`that arise in comparing the two files
`abc
`abc
`a b-' c
`..
`a c. b ac
`b a c b ..
`This problem is illustrated in Figure 2. When m =n the kite-shaped area in which the candi-
`dates lie is Il2 the total area of the diagram, and (asymptotically) 1/3 of the grid points in the
`the number of candidates approaches n2J6 asymptotically."
`kite are candidates,so
`In practice,d(tfworksmuch
`worst
`bounds would
`rarely are more than min (m,n)
`In fact
`with
`provided space
`
`Only
`naive storage
`.after two
`Thus
`
`months
`have good evidellce
`space.
`
`Dropbox Ex. 1020
`
`Page 4
`
`
`
`- 5 -
`
`J
`V J
`V J
`I /
`V
`I J
`V /
`V J
`/
`V
`I /
`I /
`J
`J
`V J
`V
`/
`/
`V
`V
`V J
`/
`/
`r
`'&v
`/
`/
`/
`I£,v
`V
`V
`/
`/
`If I V /
`/
`Ib /
`
`/
`
`I
`/
`
`V-
`
`/
`
`~
`
`/"
`
`~
`
`(
`
`and candidates in comparing
`
`Figure 2. Common subsequences
`abcabcabc
`.
`acbacbacb
`.
`the central algorithm of diff
`is so fast that even in the
`As for practical
`time complexity,
`biggest cases our implementation
`can handle {about 3500 lines} almost half the run time is still
`absorbed by simple character handling for hashing,
`jackpot checking, etc., that
`is linear
`in the
`total number of characters
`in the two tiles. Typical
`times for comparing 3500-line tiles range
`from 114 to 3/4 cpu minutes on a PDP11145. By contrast, a speeded-up variant of Hirschberg's
`dynamic programming atgorithmlel
`took about 5 cpu minutes on 3500-line tiles. The heuristic
`algorithm sketched at the beginning of Section 1 typically runs about 2 or 3 times as fast as d(fJ
`on long but
`trivially different
`tiles, but
`loses much of that advantage on more difficult cases
`that arewithin
`the competence of both methods. Since the failure modes of the two programs
`8M qune different,
`it is useful
`to have both on hang.
`
`Dropbox Ex. 1020
`
`Page 5
`
`
`
`-6;..
`
`... References
`
`Program,' Bell Laboratories
`
`internal
`
`[l]
`
`'[3J
`[4J
`
`'ALTER -,A .ComdeckCornparing
`S. C. Johnson,
`memorandum 1971.
`[2J Generalizing from a special case solved by 'L'G'Szyrnanskild], H. S. Stone proposed and 1.
`W. Hunt
`refined and implemented
`the first version of the candidate-listing
`algorithm
`used by diff, and embedded it in an older framework due to M. D. McIlroy. A variant of
`this algorithm was also elaborated by Szymanskill.Ol. We have had many useful discus-
`sions with A. V. Aho and 1. D Ullman. M. E. Lesk moved the program from UNIX to
`05/360.
`to QED Text Editor,' Murray Hill Computing Center MHCC-002.
`Introduction
`'Tutorial
`S. B. Needleman
`and C. D. Wunsch,
`'A General Method Applicable
`to the Search for
`Similarities in the Amino Acid Sequence,'J Mol BioI 48 (1970) 443-53.
`[5J D. Sankoff,
`'Matching Sequences Under Deletion/InsertionConstraints,
`USA 69 (1972) 4-6.
`[6J D. S. Hirschberg,
`'A Linear Space Algorithm for Computing Maximal Common Subse-
`quences,' CACM 18 (1975) 341-3.
`[7J D. S. Hirschberg,
`'The Longest Common Subsequence Problem,' Doctoral Thesis, Prince-
`ton 1975.
`'A Special Case of the Maximal Common Subsequence Problem,' Com-
`[8J T. G Szymanski,
`puter Science Lab TR-170, Princeton University 1975
`[9J Michael L. Fredrnan..
`'On Computing
`the Length of Longest
`Discrete Math 11 (1975) 29-35.
`[tOJ T. G. Szymanski,
`'A Note on the Maximal Common Subsequence Problem,'submitted
`publication.
`[I I] The programs called proof, written byE. N. Pinson andM. E. Lesk for UNIX and eJECOS
`use the heuristic algorithm for differential. file comparison.
`
`Proc Nat Acad Sci
`
`Increasing Subsequences,'
`
`for
`
`Dropbox Ex. 1020
`
`Page 6
`
`
`
`-7-
`
`Appendix
`
`of lines common to file 1, whose length is m lines,
`
`structured
`
`(serial, hash), where serial is a line number and
`
`A.I Summary of the diff algorithm
`Algorithm to find the longest subsequence
`and file 2, n lines.
`equivalence classes in file 2 and associate them with lines in file 1
`Steps 1 through 4 determine
`in preparation
`for the central algorithm.
`(The diff program that
`is in actual use does the work
`of these steps somewhat differently.)
`Let V be a vector of elements
`1.
`hash is an integer. Set
`j = l,....n,
`VLi] -
`(i.H(j»
`where H(j)
`is the hash value of line j in file 2.
`Sort V into ascending order on hash as primary key and serial as secondary key.
`Let E be a vector of elements
`isertal.last), Then set
`structured
`j = l,....n,
`E(j]
`-
`(V(jJ.serial,f(j»,
`E[O] -
`(O.true),
`
`2.
`3.
`
`4.
`
`where
`
`. _I true
`
`or V[j].hash~ V[j+ 1].hash
`ifj=n
`-
`f(J)
`false otherwise.
`classes of lines in file 2, with last = true on the last element of
`E lists all the equivalence
`each class. The elements are ordered by serial within.classes.
`Let P be a vector of integers. For i = l ,....m set
`. _1 js.uch that Eli-.Il.last = true and H(;) = V[j].hash
`10 If no such j exists,
`
`, P[(l
`where H(i)
`is the hash value of line i of file 1. The j values can be found by binary
`search in V.
`P[i], if nonzero, now points in E to the beginning of the class of lines in file 2 equivalent
`to line i in file 1.
`Steps 5 and 6 are the longest common subsequence
`algorithm proper.
`Let candtdatet.ab.previous)
`constructor, where a and b are line
`5.
`be a reference-valued
`in file 1 and file 2 respectively and previous is oil or a reference to a candidate.
`numbers
`Let K [O:min im.n) + 1] be a vector of references
`to candidates. Let k be the index of the
`last usefully filled element of K. Initialize
`K[O] -
`candidate (O.O.oiO.
`K[Il
`-
`candidateim-r
`l,n+ Lnll):
`k-O.
`K [I ]•is a fence beyond'
`,,'
`the last usefully filled element.
`For, i. = 1~....m, ifP[i]~O do merge (K,k,i.E,P[;) }toupdateK
`6.
`Steps 7 and 8 get a more convenient
`representation
`for the longest common subsequence.
`
`Dropbox Ex. 1020
`
`Page 7
`
`
`
`-8-
`
`7.
`
`8.
`
`referred to by K[k] and linked by previous
`
`Let J be a vector of integers.
`Initialize
`I == O..... m.
`J[/]
`-
`0
`c of the chain of candidates
`For each element
`references
`set
`J [c.a]
`c.b.
`-
`The nonzero elements of J now pick out a longest common subsequence,
`possibly includ-
`ing spurious 'jackpot' coincidences. The pairings between the two files are given by
`I J[/]¢O}.
`{(I.J[/])
`The next step weeds out jackpots.
`For I = 1,....In, if J[i]¢O and line I in file 1 is not equal to line J[i]
`9.
`- O.
`J[i]
`This step requires one synchronized
`
`in file 2, set
`
`pass through both files.
`
`A.2 Storage management
`To maximize capacity, storage is managed in diff per the following notes, which are keyed to
`the steps in the preceding summary. After each step appear
`the number of words then in use,
`except for a small additive constant, assuming that an integer or a pointer occupy one word.
`I.
`Storage for V can be grown as file 2 is read and hashed. The. value of'.» need not be
`[211 words]
`known in advance.
`already in V, it is more compact because the last field
`Though E contains
`information
`only takes one bit, and can be packed into the same word with the serial field. E can be
`overlaid on V.serlal.
`[211 words]
`P can be grown as was V in step I. [211 + In words]
`V is dead after this step. Storage can be compacted to contain only the live information,
`[11+ m words]
`E and P.
`.
`Candidates
`can be allocated as needed from the free storage obtained in the previous
`compaction,
`and from space grown beyond that
`if necessary. Because they are chained,
`candidates do not have to be contiguous.
`During the Ith invocation of merge, the first i elements of P are dead, and at most
`the first
`1+ 2 elements of K are in use, so with suitable equivalencing K can be overlaid on P.
`[11+m+ 3 x (number of candidatesl]
`P and K are dead, so J can be overlaid on them. E is dead also.
`didaies)]
`
`3.
`
`4.
`
`5.
`
`6.
`
`7.
`
`[m+ 3)( (number of can-
`
`A.3 Summary of merge step
`procedure merge (K,k,i,E,p )
`K is as defined in step 5 above, by reference
`k is index of last filled element of K; by reference
`I is current
`index in file 1, by value
`E is as defined in Step 3 above, by reference
`pis
`i:~~:einEoffirsreleTent
`of class of lines in file iequivalent
`
`to lineilof~lel,
`
`Dropbox Ex. 1020
`
`Page 8
`
`
`
`-9-
`
`c will always refer to the last can-
`Let, be an integer and c be a reference to a candidate.
`didate found, which will always be an r-candidate, K[r] will be updated with this refer-
`Initialize
`ence once the previous value of K [,] is no longer needed.
`,-0,
`c r- K[O].
`(By handling the equivalence class in reverse order, Szyrnanskilf.Ol circumvents
`that waste space.)
`to delay updating K [r], but generates extra 'candidates'
`Do steps 3 through 6 repeatedly.
`. Let j = E[p ].seria/.
`3.
`such that K[s]-b <I and K[s+ ll-b~j.
`Search K[,:k]
`for an element K[s]
`(Note that K is ordered on K[.]-b,
`so binary search will work.)
`If such an element
`is found do steps 4 and 5.
`If K[s+ 1]-b »t. simultaneously set
`4.
`K[,]
`-
`c,
`r rr s+I,
`candidate U,.i,K [sJ ).
`c -
`
`the need
`
`5.
`
`6.
`
`If s = k do:
`Simultaneously set
`- K[k+ 1]
`K[k+2]
`k- k+1.
`Break out of step 2's loop.
`If E[p ].last = true, break out of step 2's loop.
`p+ 1.
`Otherwise set p -
`c.
`Set K[r]
`-
`
`(move fence),
`
`1.
`
`2.
`
`7.
`
`Dropbox Ex. 1020
`
`Page 9