throbber
DROPBOX EX. 1020
`
`DROPB OX EX. 1020
`
`
`
`

`
`Computing Science Technical Report #41
`
`An Algorithm for Differential File Comparison
`
`J. W. Hunt and M. D.McI/roy
`
`1976
`
`Dropbox Ex. 1020
`
`

`
`An Algorithm for Differential File Comparison
`
`1. W. Hunt
`Department of Electrical Engineering,
`Stanford University, Stanford, California
`
`M. D. Mcilroy
`Bell Laboratories,
`Murray Hill, New Jersey 07974
`
`,II
`
`-,
`
`ABSTRACT
`
`The program diff reports differences
`as a
`two files, expressed
`between
`file into agreement with the other.
`minimal
`list of line changes
`to bring either
`Diff has been engineered
`to make efficient use of time and space on typical
`inputs
`that arise in getting version-to-version
`changes
`in computer-maintained
`or computer-generated
`documents.
`Time and space usage are observed to vary
`about as the sum of the file lengths on real data, although they are known to
`vary as the product of the file lengths in the worst case.
`algorithm of diff solves
`common subsequence
`The central
`the 'longest
`problem'
`to find the lines that do not change between files. Practical efficiency
`is gained by attending only to certain critical
`'candidate' matches between the
`files, the breaking of which would shorten the longest subsequence
`common to
`some pair of initial segments of the two files. Various techniques of hashing,
`presorting
`into equivalence
`classes, merging by binary search,
`and dynamic
`storage allocation are used to obtain good performance.
`
`Dropbox Ex. 1020
`
`

`
`An Algorithm for Differential File Comparison
`
`J. W. Hunt
`of Electrical Engineering,
`Department
`Stanford University, Stanford, California
`
`M. D. McIlroy
`Bell Laboratories,
`Murray Hill, New Jersey 07974
`
`lines of one file have to be changed to bring it into
`a list of what
`The program d(ffcreates
`It is based on ideas from several sourcesll.z.r.Bl.
`agreement with a second file or vice versa.
`As an example of its work, consider
`the two files, listed horizontally for brevity:
`abcdefg
`w a b x
`e
`z
`y
`It is easy to see that
`the first file can be made into the second by the following prescription,
`which an imaginary line 0 is understood at the beginning of each:
`append after line 0: w,
`change lines 3 through 4 from:
`c d
`to: x y z,
`delete lines 6 through 7, which were:
`f g.
`Going the other way, the first file can be made from the second this way:
`delete line 1, which was: w,
`change lines 4 through 6 from: x y z
`to: c d,
`
`in
`
`append after line 7:
`f g.
`them by 1-
`It indicates
`available to diff.
`Delete, change
`and append are the only operations
`of the qed text editorl I] in a form from which both directions
`letter abbreviations
`reminiscent
`of change
`can be read off. By exchanging
`'a'
`for 'd' and line numbers of the first file with
`those of a second, we get a recipe for going the other way.
`In these recipes lines of the origi-
`nal file are flagged with '<', lines of the derived file are flagged with ">':
`o all
`1,1 d 0
`,>w
`<w
`3 4 c 4 6
`4,6 c 3,4
`,
`,
`<c
`<x
`<y<z
`<d
`>x
`>c
`>y
`>z
`>d
`6,7 d 7
`<f
`
`>g
`
`Dropbox Ex. 1020
`
`Page 1
`
`

`
`-2-
`
`the goal of d(ffis
`the minimum number of line changes
`to report
`terms,
`In mathematical
`necessary to convert one file into the other. Equivalently,
`the goal is to maximize the number
`of lines left unchanged, or to find the longest common subsequence of lines that occurs in both
`files.
`
`1. Solving the longest common subsequence problem
`problem is known.
`No uniformly good way of solving the longest common subsequence
`then search forward
`The simplest
`idea-go
`through both files line by line until
`they disagree,
`somehow in both until a matching
`pair of
`lines is encountered,
`and continue
`similarly-
`in
`reduces the problem to implementing
`the 'somehow', which doesn't help much. However,
`practical
`terms,
`the first step of stripping matching lines from the beginning (and end)
`is help-
`ful, for when changes are not
`too pervasive stripping can make inroads
`into the (nonlinear)
`running time of the hard part of the problem.
`An extremely simple heuristic for the 'somehow', which works well when there are rela-
`tively few differences between files and relatively few duplications of lines within one file, has
`been used by Johnson and othersll.l Il: Upon encountering
`a difference, compare the kth line
`ahead in each file with the k lines following the mismatch in the other
`for k =1,2,... until a
`match is found. On more difficult problems,
`the method can missynchronize
`badly. To keep a
`lid on time and space, k is customarily limited, with the result
`that
`longer changed passages .
`defeat .resynchronization,
`for the longest common subsequence
`There is a simple dynamic programming scheme
`file Ai' i = 1,...,m and the
`probteml-l.Sl.
`lines of
`the
`second
`Call
`the
`lines of
`the
`first
`B.i' j = l .....n. Let Pij be the length of the longest. subsequence
`i lines of
`common to the first
`the first file and the' first j lines of the second. Evidently Po satisfies
`Pio = 0
`i =O, m,
`j =O,
`.n,
`POj = 0
`if Ai =Bj
`r.. =
`A --<'B
`u
`I
`max
`ir-
`i.j-I
`i-I.j'
`j
`Pi' array
`Then Pmn is the length of the desired longest common subsequence.
`From thewhole
`that was generated in calculating Pmn,
`it is easy to recover
`the indices of'the
`elements of a
`longest common subsequence.
`in
`in time, and-even
`Unfortunately
`the dynamic program is Otmn)
`worse-O(mn)
`space. Noting that. each row Pi of the difference equation is simply determined
`from Pi -I' D.
`S. Hirschberg invented a clever scheme that first calculates P in Otn)
`space and then recov-
`m
`.
`ers the sequence using no more space and about as much time again as is needed to find Pm [6].
`The diff algorithm improves on the simple dynamic program by attending only to essen-
`'k-
`the breaking of which would change P. The essential matches,
`dubbed
`tial matches,
`candidates' by Hirschberg]"], occur where Ai =B. and p..> max (P'_I
`.•P, '-1)' A k-candidate
`(i,j) such that (I) Ai =Bj, ch a lori~est comm~n '~ub~equence of length k
`is ~ pair of indices
`exists between the first
`of the first file and the first jelements
`of the second
`and
`ielements
`of length k exists when either i or j
`is reduced. A candidat~ is a
`(3) no common subsequence
`is a k-candidate for some k. Evidently a longest common subsequence
`can
`pair of indices that
`be found among a complete list of.candidates,
`then h >h. For if h = h,
`If (il,h ) .and (i2,h!. with il < i2 are b~t? k-candi~ates,
`..
`and If h <h then the common subse-
`(/2,h) would violate ~ondlhon (3) of the definition;
`quence of length k ending with (il,h ) could be extended to a common subsequence
`of
`k+ 1ending with (i2,h).
`
`1~i~m,
`
`1Q~n.
`
`11 + Pi-I.j-I
`
`(P
`
`P
`
`)'f
`
`Dropbox Ex. 1020
`
`Page 2
`
`

`
`-3-
`
`In Figure 1 dots mark grid
`interpretation.
`The candidate methods have a simple graphical
`(i,j) for which Ai = Ri. Because the dots portray an equivalence relation, any two hor-
`points
`izontal
`lines or any two vertical
`lines on the figure have either no dots in common or carry ex-
`actly the same dots. A common subsequence
`is a set of dots that can be threaded by a strictly
`monotone
`increasing curve. Four such curves have been drawn in .the figure. These particular
`curves have been chosen to thread only (and all) dots that are candidates. The values of k for
`are indicated by transecting curves of constant k. These latter curves, shown
`these candidates
`is obviously less than mn,
`dashed, must all decrease monotonically.
`The number of candidates
`except
`in trivial cases, and in practical
`file comparison turns out
`to be very much less, so the
`list of candidates usually can be stored quite comfortably.
`
`c
`
`a"b
`a
`"
`
`b c
`
`b
`
`c
`
`a
`
`b
`
`b
`
`a
`
`a
`
`6
`
`5 .--+--+--+-- .....
`
`4
`
`2
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`in comparing
`
`and candidates
`
`Figure I. Common subsequences
`abcabba
`c b a b a c
`2. The method of diff
`The dots of Figure 1 are stored in linear space as follows:
`These lists occu-
`(1) Construct
`lists of the equivalence classes of elements
`in thesecondfile.
`py Oin)
`space. They can be made by sorting the lines of the. second file.
`(2) Associate the appropriate equivalence
`class with each element of the first file. This asso-
`ciation can be stored in 0 im)
`space.
`In effect now we have a list of the dots for .each
`vertical.
`K be a vector
`Having this setup, we proceed to generate the candidatesleft-to-right.Let
`yet seen for eachk .. To simplify what follows, pad the
`designating the rightmostk-candidate
`that-do notyet
`havea candidate,
`vector outtoinclude
`a dummyO-candidate
`(O,O)and, for allk
`1,n + 1), whose. components will compare. high against
`the
`a dummy'fence'candidate(m+
`cornPonents of any other candidate. KbeginsemptY,exceptforpadding,
`and gets updated as
`we move right.i.Thus
`after processing the 4th "ertical,marked
`'a'
`in Figure 1, the list ofright-
`most candidates
`is
`.:.
`. (8,7)
`(4,3)(4,5)
`(0,0)
`(3,1)
`.•·between the·
`is ••the lowest dot.tha(fal.lsproperly
`the next vertical
`new··k-candidateon
`previous .•.(k '-1 )-and .k-.candidates.Two.
`ordinates
`such dots are on the 5th. vertical
`in
`displace the 2-candidate and 3-candidate entries to give
`new vectorK:
`
`Dropbox Ex. 1020
`
`Page 3
`
`

`
`-4-
`
`...
`(8,7)
`(5,4)
`(O,O),(3,l)~S,2)
`in this list and so are
`than between, ordinates
`The two dots on the 6th vertical fall on, rather
`not candidates. Each new k-candidate is chained to the .previous
`(k -1)-candidate
`to facilitate
`later recovery of the longest common subsequence.
`(For more detail seethe Appendix.)
`The determination
`of candidates on a given vertical
`is thusa
`specialized merge of the list
`of dots on that vertical
`into the current
`list of rightmost candidates. When the number of dots
`is 0(1), binary search in the list of at most min im.n) candidates will do the merge in time
`O(Jogm).
`Since this case of very few dots per vertical
`is typical
`in practice, we are led to
`merge each dot separately by binary search, even though the worst case time to process a verti-
`as against O(m+n)
`for ordinary merging.
`cal becomes O(nlogm),
`
`3. Hashing
`of lines) possible in random
`large files (thousands
`To make comparison of reasonably
`access memory, diff hashes each line into one computer word. This may cause some unequal
`lines to. compare
`equal.
`•Assuming the hash function is truly random,
`the. probability of a
`is 1/M, where
`spurious equality on a given comparison that should have turned out unequal
`the hash values range from 1 to M. A longest common subsequence
`of length k determined
`from hash values can thus be expected to contain about k/ M spurious matches when k« M,
`so a sequence of length k will be a spurious 'jackpot' sequence with probability about k/ M. On
`our I6-bit machine jackpots on SOOO-linefiles should happen less than 10% of the time and on
`SOO-linefiles less than 1% of the time.
`Diff guards against
`in
`jackpots by checking the purported longest common subsequence
`the original
`files. What
`remains
`after spurious
`equalities ..are edited out
`is accepted as an
`answer even though there is a small possibility that
`it is not actually a longest common subse-
`quence. Diff announces
`jackpots,
`so these cases tend to get scrutinized
`fairly hard.
`In two
`years we have had brought
`to our attention only one jackpot where an edited longest subse-
`quence was actually short-in
`that
`instance short by one.
`
`4. Complexity
`the diff algorithm doesn't perform substantially
`In the worst case,
`than the trivial
`better
`dynamic program. From Section 2 it follows that
`the worst case time complexity is dominated
`by the merging and is in fact O(mnlogm)
`(although Oim (m+n»
`could be achieved). Worst
`case space complexity
`is dominated
`by the space required
`for
`the candidate
`list, which is
`o imn) as can be seen by counting the candidates
`that arise in comparing the two files
`abc
`abc
`a b-' c
`..
`a c. b ac
`b a c b ..
`This problem is illustrated in Figure 2. When m =n the kite-shaped area in which the candi-
`dates lie is Il2 the total area of the diagram, and (asymptotically) 1/3 of the grid points in the
`the number of candidates approaches n2J6 asymptotically."
`kite are candidates,so
`In practice,d(tfworksmuch
`worst
`bounds would
`rarely are more than min (m,n)
`In fact
`with
`provided space
`
`Only
`naive storage
`.after two
`Thus
`
`months
`have good evidellce
`space.
`
`Dropbox Ex. 1020
`
`Page 4
`
`

`
`- 5 -
`
`J
`V J
`V J
`I /
`V
`I J
`V /
`V J
`/
`V
`I /
`I /
`J
`J
`V J
`V
`/
`/
`V
`V
`V J
`/
`/
`r
`'&v
`/
`/
`/
`I£,v
`V
`V
`/
`/
`If I V /
`/
`Ib /
`
`/
`
`I
`/
`
`V-
`
`/
`
`~
`
`/"
`
`~
`
`(
`
`and candidates in comparing
`
`Figure 2. Common subsequences
`abcabcabc
`.
`acbacbacb
`.
`the central algorithm of diff
`is so fast that even in the
`As for practical
`time complexity,
`biggest cases our implementation
`can handle {about 3500 lines} almost half the run time is still
`absorbed by simple character handling for hashing,
`jackpot checking, etc., that
`is linear
`in the
`total number of characters
`in the two tiles. Typical
`times for comparing 3500-line tiles range
`from 114 to 3/4 cpu minutes on a PDP11145. By contrast, a speeded-up variant of Hirschberg's
`dynamic programming atgorithmlel
`took about 5 cpu minutes on 3500-line tiles. The heuristic
`algorithm sketched at the beginning of Section 1 typically runs about 2 or 3 times as fast as d(fJ
`on long but
`trivially different
`tiles, but
`loses much of that advantage on more difficult cases
`that arewithin
`the competence of both methods. Since the failure modes of the two programs
`8M qune different,
`it is useful
`to have both on hang.
`
`Dropbox Ex. 1020
`
`Page 5
`
`

`
`-6;..
`
`... References
`
`Program,' Bell Laboratories
`
`internal
`
`[l]
`
`'[3J
`[4J
`
`'ALTER -,A .ComdeckCornparing
`S. C. Johnson,
`memorandum 1971.
`[2J Generalizing from a special case solved by 'L'G'Szyrnanskild], H. S. Stone proposed and 1.
`W. Hunt
`refined and implemented
`the first version of the candidate-listing
`algorithm
`used by diff, and embedded it in an older framework due to M. D. McIlroy. A variant of
`this algorithm was also elaborated by Szymanskill.Ol. We have had many useful discus-
`sions with A. V. Aho and 1. D Ullman. M. E. Lesk moved the program from UNIX to
`05/360.
`to QED Text Editor,' Murray Hill Computing Center MHCC-002.
`Introduction
`'Tutorial
`S. B. Needleman
`and C. D. Wunsch,
`'A General Method Applicable
`to the Search for
`Similarities in the Amino Acid Sequence,'J Mol BioI 48 (1970) 443-53.
`[5J D. Sankoff,
`'Matching Sequences Under Deletion/InsertionConstraints,
`USA 69 (1972) 4-6.
`[6J D. S. Hirschberg,
`'A Linear Space Algorithm for Computing Maximal Common Subse-
`quences,' CACM 18 (1975) 341-3.
`[7J D. S. Hirschberg,
`'The Longest Common Subsequence Problem,' Doctoral Thesis, Prince-
`ton 1975.
`'A Special Case of the Maximal Common Subsequence Problem,' Com-
`[8J T. G Szymanski,
`puter Science Lab TR-170, Princeton University 1975
`[9J Michael L. Fredrnan..
`'On Computing
`the Length of Longest
`Discrete Math 11 (1975) 29-35.
`[tOJ T. G. Szymanski,
`'A Note on the Maximal Common Subsequence Problem,'submitted
`publication.
`[I I] The programs called proof, written byE. N. Pinson andM. E. Lesk for UNIX and eJECOS
`use the heuristic algorithm for differential. file comparison.
`
`Proc Nat Acad Sci
`
`Increasing Subsequences,'
`
`for
`
`Dropbox Ex. 1020
`
`Page 6
`
`

`
`-7-
`
`Appendix
`
`of lines common to file 1, whose length is m lines,
`
`structured
`
`(serial, hash), where serial is a line number and
`
`A.I Summary of the diff algorithm
`Algorithm to find the longest subsequence
`and file 2, n lines.
`equivalence classes in file 2 and associate them with lines in file 1
`Steps 1 through 4 determine
`in preparation
`for the central algorithm.
`(The diff program that
`is in actual use does the work
`of these steps somewhat differently.)
`Let V be a vector of elements
`1.
`hash is an integer. Set
`j = l,....n,
`VLi] -
`(i.H(j»
`where H(j)
`is the hash value of line j in file 2.
`Sort V into ascending order on hash as primary key and serial as secondary key.
`Let E be a vector of elements
`isertal.last), Then set
`structured
`j = l,....n,
`E(j]
`-
`(V(jJ.serial,f(j»,
`E[O] -
`(O.true),
`
`2.
`3.
`
`4.
`
`where
`
`. _I true
`
`or V[j].hash~ V[j+ 1].hash
`ifj=n
`-
`f(J)
`false otherwise.
`classes of lines in file 2, with last = true on the last element of
`E lists all the equivalence
`each class. The elements are ordered by serial within.classes.
`Let P be a vector of integers. For i = l ,....m set
`. _1 js.uch that Eli-.Il.last = true and H(;) = V[j].hash
`10 If no such j exists,
`
`, P[(l
`where H(i)
`is the hash value of line i of file 1. The j values can be found by binary
`search in V.
`P[i], if nonzero, now points in E to the beginning of the class of lines in file 2 equivalent
`to line i in file 1.
`Steps 5 and 6 are the longest common subsequence
`algorithm proper.
`Let candtdatet.ab.previous)
`constructor, where a and b are line
`5.
`be a reference-valued
`in file 1 and file 2 respectively and previous is oil or a reference to a candidate.
`numbers
`Let K [O:min im.n) + 1] be a vector of references
`to candidates. Let k be the index of the
`last usefully filled element of K. Initialize
`K[O] -
`candidate (O.O.oiO.
`K[Il
`-
`candidateim-r
`l,n+ Lnll):
`k-O.
`K [I ]•is a fence beyond'
`,,'
`the last usefully filled element.
`For, i. = 1~....m, ifP[i]~O do merge (K,k,i.E,P[;) }toupdateK
`6.
`Steps 7 and 8 get a more convenient
`representation
`for the longest common subsequence.
`
`Dropbox Ex. 1020
`
`Page 7
`
`

`
`-8-
`
`7.
`
`8.
`
`referred to by K[k] and linked by previous
`
`Let J be a vector of integers.
`Initialize
`I == O..... m.
`J[/]
`-
`0
`c of the chain of candidates
`For each element
`references
`set
`J [c.a]
`c.b.
`-
`The nonzero elements of J now pick out a longest common subsequence,
`possibly includ-
`ing spurious 'jackpot' coincidences. The pairings between the two files are given by
`I J[/]¢O}.
`{(I.J[/])
`The next step weeds out jackpots.
`For I = 1,....In, if J[i]¢O and line I in file 1 is not equal to line J[i]
`9.
`- O.
`J[i]
`This step requires one synchronized
`
`in file 2, set
`
`pass through both files.
`
`A.2 Storage management
`To maximize capacity, storage is managed in diff per the following notes, which are keyed to
`the steps in the preceding summary. After each step appear
`the number of words then in use,
`except for a small additive constant, assuming that an integer or a pointer occupy one word.
`I.
`Storage for V can be grown as file 2 is read and hashed. The. value of'.» need not be
`[211 words]
`known in advance.
`already in V, it is more compact because the last field
`Though E contains
`information
`only takes one bit, and can be packed into the same word with the serial field. E can be
`overlaid on V.serlal.
`[211 words]
`P can be grown as was V in step I. [211 + In words]
`V is dead after this step. Storage can be compacted to contain only the live information,
`[11+ m words]
`E and P.
`.
`Candidates
`can be allocated as needed from the free storage obtained in the previous
`compaction,
`and from space grown beyond that
`if necessary. Because they are chained,
`candidates do not have to be contiguous.
`During the Ith invocation of merge, the first i elements of P are dead, and at most
`the first
`1+ 2 elements of K are in use, so with suitable equivalencing K can be overlaid on P.
`[11+m+ 3 x (number of candidatesl]
`P and K are dead, so J can be overlaid on them. E is dead also.
`didaies)]
`
`3.
`
`4.
`
`5.
`
`6.
`
`7.
`
`[m+ 3)( (number of can-
`
`A.3 Summary of merge step
`procedure merge (K,k,i,E,p )
`K is as defined in step 5 above, by reference
`k is index of last filled element of K; by reference
`I is current
`index in file 1, by value
`E is as defined in Step 3 above, by reference
`pis
`i:~~:einEoffirsreleTent
`of class of lines in file iequivalent
`
`to lineilof~lel,
`
`Dropbox Ex. 1020
`
`Page 8
`
`

`
`-9-
`
`c will always refer to the last can-
`Let, be an integer and c be a reference to a candidate.
`didate found, which will always be an r-candidate, K[r] will be updated with this refer-
`Initialize
`ence once the previous value of K [,] is no longer needed.
`,-0,
`c r- K[O].
`(By handling the equivalence class in reverse order, Szyrnanskilf.Ol circumvents
`that waste space.)
`to delay updating K [r], but generates extra 'candidates'
`Do steps 3 through 6 repeatedly.
`. Let j = E[p ].seria/.
`3.
`such that K[s]-b <I and K[s+ ll-b~j.
`Search K[,:k]
`for an element K[s]
`(Note that K is ordered on K[.]-b,
`so binary search will work.)
`If such an element
`is found do steps 4 and 5.
`If K[s+ 1]-b »t. simultaneously set
`4.
`K[,]
`-
`c,
`r rr s+I,
`candidate U,.i,K [sJ ).
`c -
`
`the need
`
`5.
`
`6.
`
`If s = k do:
`Simultaneously set
`- K[k+ 1]
`K[k+2]
`k- k+1.
`Break out of step 2's loop.
`If E[p ].last = true, break out of step 2's loop.
`p+ 1.
`Otherwise set p -
`c.
`Set K[r]
`-
`
`(move fence),
`
`1.
`
`2.
`
`7.
`
`Dropbox Ex. 1020
`
`Page 9

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket