`
`US00707905l B2
`
`(I2)
`
`United States Patent
`Storer et al.
`
`um Patent No.:
`(45; Date of Patent:
`
`US 7,079,051 B2
`Jul. 18., 2006
`
`(54)
`
`(76)
`
`IN-PLACE D1FFERE.\"l"IAL (TOMPRESSION
`
`ltwcnlnrsr James Andrew Storer. 89 S. Circ:-11
`Rd.. Lincultl. MA (US) 01773: Dana
`Shapira. 80 Pleasant St.. Apt. 53.
`Brooklinc. MA (US) 02-'-146
`
`1*)
`
`Notice:
`
`Subject to any disclaimer. the term ufthis
`patent is extended or adjusted under 35
`U.S.(‘. 154(1)) by 0 days.
`
`Appl. No.: 10/803507
`
`Filed:
`
`M ar. 18. 2004
`
`Prior Publication Data
`
`US 2005/0219075 A1
`
`Oct. 6. 2005
`
`Int. (‘L
`1. 2006.01)
`lI03.M 7B8
`U.S. (‘1.
`......................................... .. 341/51;3411S0
`Field of (flassificatiun Search
`341150,
`341151
`
`See application tile for complete search history.
`
`References Cited
`
`U.S. PA'fEN'l' DOCUMENTS
`
`-1.366.551 A
`4.558.302 A
`4.646.061 A
`4.701.745 A
`4.730.343 A
`4.314.746 A
`4.876.541 A
`4.905.297 A
`4.906.991 A
`5.003.307 A
`5.0l6.0U‘) A
`
`12.1932 H0112
`12.1985 Wclch
`Z 1987 Blodsoe
`1031987 Watcrworth
`3 1988 N/[acCrisken
`31989 Miller at al.
`10i‘1|).\'!) Slater
`Z.‘1‘)90
`I.:1ngdon at al.
`3‘ 1990 Finln ct .11.
`331991 George at Al.
`51991 Whiting el‘ al.
`
`341451
`
`3-4|/51
`........... .. 341151
`
`6 I992 (imrg-: at al.
`5.136.739 A
`1011902 (‘lurk
`5.153.591 A
`12-‘I992 Lam?
`5.175.543 A
`1 I993 Ranganathan C1 31.
`5.179.378 A “
`101993 Clark
`4.153.335 A
`1 I995 Smrcr
`5.379.036 A "‘
`-'1 1995 Anderson ct al.
`3.406.179 A “
`111995 Squibb
`5.479.654 A
`31997 Chang at :11.
`5.608.396 A
`41998 Squibb
`5.145.906 A
`‘)‘|‘)93 Morris
`5.313.017 A
`1. 2000 Burns ct al.
`6.013.747 A
`52001 Murz11id11:u'
`6.233.559 Bl
`-412002 Ajlai et al.
`6.374.250 Bl
`12. 20113 Tltnmpsnn ct :11
`f1.(a71.'7¢,1,‘1 BI
`0THF.R PU B1,.1C'ATIONS
`
`"(.'nmpuct1_\' Encoding
`Ajlai. Burns. 1"agiIL and Long.
`Unstructured Inputs with 1)iITcrL‘ntia1 (‘ontprcssiutfl .1. of
`the ACM 49:3. 318-367. (11.3). vol. 49. N0. 3. May 2002,
`Apnstolicu. Browne. and Guam: [1992]. “Fast L inear-Space
`Co111pLtt:ttiuns 01' Longest Common Subsequenccs“. Thea-
`retical Computer Science. 92:1. 3-17 (U.S.). vol. ‘)2. No. 1.
`Jan. 6. I992.
`
`(Ccintinued)
`
`Prfmu‘r_l' l'§xz1mIner'
`
`Jean Bruner Jczmglatldc
`
`(S71
`
`ABSTR.-\("l"
`
`To etthunce the distribution or backup of data similar tn
`previously distributed or backed up data. to provide simi-
`larity testing. and other benefits. :1 first body ofdata *1‘ 01' size
`11 is cmnpressed with respect to a second body of data S 01'
`size 111
`in—p1z1ce‘. that is. the memory containing S is over-
`written fmm left to right so that at nn time is more than £1
`lulu] ml" MAX{l‘I].l1}+U( 1) l‘l‘1t.‘1‘1l0I‘_\' used.
`
`23 Claims. 3 Drawing Sheets
`
`
`<3 Sis of size m
`T13 Of S]-Z6 H‘
`
` L“ "7 ‘til
`13:31
`
`window of size MAX{m,nI
`
`T
`
`
`dccoderx‘ 4—- partially decoded T—><— remainder of S —>
`.
`m -x
`31*’?!
`‘T
`
`IPSW method
`
`ORACLE 1019
`
`
`
`US 7,079,051 B2
`Page 2
`
`OTHER PUBLICATIONS
`
`Burns and Long [I997]. "A Linear Time. Constant Space
`Dif1"ercucing.1\lgoritluu". Proc. IEEE Int. Conf. on Pcrfonn..
`Comp.. and Communications. (U.S.). Feb. 1997.
`Burns and Long [I997b]. “Elficient Distributed Backup and
`Restore with Delta Comprt.-ssion"_. Workshop on H0 in
`Parallel and Distributed Systems (IOPADS3.
`(U.S.). no
`month.
`Bums. Stockmeycr. and D. D. E. Long. [2002]. “Experimen-
`tally Evaluating In-Place Delta Reconstruction". NASA
`Cool‘. on Mass Storage Sys. and Tech..(U.S.), no month.
`Connode. Paterson. Sahinalp and VlSl1l\’lI‘I [2000]. "Commu-
`nication Complexity of Doctunent Excliange". Pmc. llth
`Symp. on Disenzte Algorithms. I97-206. (U.S.). no month.
`Factor. Sheinwald. and Yassour [200I ]. “Software Compres-
`sion in the Client/Server Environment", Proc. Data Com-
`pression Conf.. IEEE Comp. Society Press. 233-242. (U.S.).
`Mar. 27-29. 2001.
`Fraser and Myers [ 1987]. “An Editor for Revision Control“.
`ACM Transactions on Programming Innguaggcs and Sys-
`tems 9:2. 277-295. (U.S.). vol. 9 No. 2. Apr. 1997.
`Ilurtt, Vo. and Tichy [I998]. “Delta ulgoritlmisz An Empiri-
`cal Analysis". ACM Trans. on Software Engineering and
`Methodology 7. 192-214. (U.S.). vol. 2, No. 2, Apr. 1998.
`Hufliuan [I952]. “A Method for the Construction of Mini-
`mum-Redundancy Codes". Proceedings of the IRE 40.
`I098-I101. (U.S.). Sep. 1952.
`Letnpel and Ziv [1976]. “On the Complexity of Finite
`Sequences".
`IEEE Transactions on Information 'l'heory
`22:], 75-8]. (U.S.). no month.
`Miller and Myers [I985]. “A File Comparison Program".
`Sofiware— PFZICIICE and Experience 15:11.
`1025-1040.
`(U ,S.)_. unknown month.
`Rick [I995]. “A New Flexible Algorithm for the Longest
`Common Sub-sequence Problem”, Proc. 6th Annual Sympo-
`sium on Combinatorial Patten: Matching. Espoo. Finland.
`5-7. (U.S.). unknown month.
`Shapira. and Slorer [2002]. "Edit Distance with Move
`Operations", Proceedings Conference of Combinatorial Pat-
`tern Matching (CPM) Springer. 85-98. (U.S.). unknown
`month.
`
`Reicheuberger [I991]. "Delta Storage for Arbitrary Non-
`Text Files“, Proc. 3rd Int. Workshop on Software Configu-
`ration Management. Trondheim. Norway,
`I2-14 (U.S.).
`unknown month.
`Shapiro. and Storer [Z003]. “In-Place I)itl”erential File Com-
`pression”. Proc. 2003 Data Comp. Conf. (published Mar. 25.
`2003 J, IEIZEZ Comp. Soc. Press. 263-272. (U.S.).
`Storer and T. G. Szymauski [I978]. "The Macro Model for
`Data Compression". Proc. 10th Atu1ualACM Symposium of
`the Theory of Computing. San Diego. CA. 30-39. (U.S.).
`unknown month.
`
`Storer and Szymanski [1982]. “Data Compression Via Tex-
`tual Substitution". Journal of the ACM 29:4. 928-951.
`(U.S.). unknown month.
`Storcr [1988] Data Compression: Methods and Theory.
`Computer Science Press (a subsidiary of W. B. Freeman
`Press). (U 5.]. unknown month.
`Storcr [2002]. “An Introduction to Data Structures and
`AIgoritl1.tus“. Birkltttuser—S;-ringer.
`(U.S.).
`unknown
`month.
`
`Tichy [ I 984]. “The String to String Conection Problem with
`Block Moves“, ACM Transactions on Computer Systems
`2:4. 309-321. (U.S.). unknown month.
`Wagner and Fischer [1973]. “The String-to-String Correc-
`tion Problem”. Journal of the ACM 21:1, I68-I73. (U.S.).
`unknown month.
`
`We-iner[1973]. “Linear Pattern Matching Algorithms". Pro-
`ceedings of the 14th Annual IEEE. Symposium on Switching
`and Automnta Theory 1-] 1. (U.S.). unknown month.
`Ziv and Lempel [I977]. “A Universal Algorithm for Sequen-
`tial Dato Compression". IEEE Transactions on Information
`Theory 23:3. 337-343. (U.S.). unknown month.
`Ziv and I/etnpel
`[I978]. “(boipression of Individual
`Sequences Via Variable-Rate Coding". IEEE Transactions
`on Information Theory 24:5. 530-536. (U.S.). unknown
`month.
`Burns and Long [I998]. "lo-place Reconstruction of Delta
`Compressed Files". Proe. AC’M Cont". Principles of Distrib-
`uted Comp. (PODC) (U.S.). no month.
`
`"' cited by examiner
`
`
`
`U.S. Patent
`
`Jul. 18,2005
`
`Sheet 1 of3
`
`US 7,079,051 B2
`
`-(T Sis of size m Tis Of SiZ6 ll >
`
`
`/ window of size MAX{m,r1}
`¢
`
`encoder
`
`deCoder% <(-- partially decoded T—>~(— remainder of S —)~
`
`I F
`
`IG. 1
`
`[PSW method.
`
`
`
`U.S. Patent
`
`Jul. 18, 2005
`
`Sheet 2 of3
`
`US 7.079.051 B2
`
`etext already seen
`
`- ELEPHANT
` 53°
`\
`
`incoming
`characters
`
`Suppose that "ELEPHANT"
`has already been seen 500
`characters back in the text.
`
`Encode the 8 characters
`"ELEPHANT"
`as the pair of numbers (500.8).
`
`FIG. 2 (prior art)
`
`sliding window compression
`
`
`
`U.S. Patent
`
`Jul. 18, 2005
`
`Sheet 3 of}
`
`Us 7,079,051 B2
`
`--:-'-er:-:.s:>.-—-=a:1.~a:=*.:
`m=3:sz:c:2:B7-Was.-Argcx _ ,
`v-
`_ -
`J-‘H-.
`. 9
`'
`,
`-cg;
`'
`"
`
`- _'.',.
`
`FIG. 3
`
`Iinking ofgaps
`
`
`
`US 7,079,051 B2
`
`1
`IN-PLACE DIFFERENTIAL COMPRESSION
`
`BACKGROUND OF THE INVENTION
`
`The present invention relates to in-place dillcrcntial corn-
`prcssion, where a body of data '1‘ is compressed with respect
`to a body of data S by performing copies from S in such a
`way that no additional memory is used beyond what
`is
`needed to store the longer of S or '1': that is. when decoding.
`S is overwritten from left to right as T is constmcted in its
`place.
`There have been many patents and tecltnical papers that
`pertain to data compression. Many relate to techniques
`diflerent
`than ones that employ string copying such as
`Hutlinan coding (e.g.. US. Pat. No. 4.646.061 ) or arithmetic
`coding (e.g.. U.S. Pat. No. 4.905.297). Many relate to
`techniques that employ string copies but in a traditional data
`compression model where a single body of data is com-
`pressed not in-place dilfcrential compression ofa firsl body
`of data with respect to a second body of data; for example.
`U.S. patents such as I-loltz [U.S. Pat. No. 4.366.551]. Welch
`[U.S. Pat. No. 4.558.302]. Watetworlb [U.S. Pat. No. 4.701.
`7-15]. MacCt-isken [U.S. Pat. No. 4.730.348]. Miller and
`Wegman [U.S. Pat. No. 4.814.746]. Storer [U.S. Pat. Nos.
`4.876.541. 5.379.036]. Fiala and Greene [U.S. Pat. No.
`4.906.991]. George.
`lvey. and Whiting [U.S. Pat. Nos.
`5.003.307. 5.016.009. 5.126.739]. Rubow and Wachel [U.S.
`Pat. No. 5.023.610]. Clark [U.S. Pat. Nos. 5.153.591. 5.253.
`325]. Lantz [U.S. Pat. No. 5.175.543]. Ranganathan and
`llcm-iques [U.S. Pat. No. 5.179.378]. Cheng. Cra ft. Ciaribay.
`and Karnin |U.S. Pat. No. 5.608.396]. and technical articles
`such as Letnpel and Ziv [1977. 1979] and Storer [1978.
`1988. 1982. 2002].
`There have also been a number of patents and technical
`papers relating to differential compression that do not per-
`form decoding in—place: for example: Squibb [U.S. Pat. Nos.
`5.479.654. 5.745.906]. Mon'is [U.S. Pat. No. 5.813.017].
`Muralidhar and (‘handnn [U.S. Pat. No. 6.233.589]. 'l'lton1p-
`son. Peterson. and Mohammadioun [U.S. Pat. No. 6.671.
`703]. and technical articles such as Wciner [1973] (who
`developed a linear time and space greedy copy/insert algo-
`rithm using a suflix tree to search for matching substrings).
`Wagner and Fischer [1973] (who considered the string»to«
`string correction problem). Heckel [1978] (who presented a
`linear time algorithm for detecting block moves using long-
`est comtnon substring techniqucs).Ticl1y [1984] [who used
`edit-distance techniques for differencing and considered the
`string to string correction problem with block moves].
`Miller and Myers |1985]
`(who presented a comparison
`program for producing delta files). Fraser and Myers [1987]
`(who integrated version control into a line editor so that on
`every change a minimal delta is retained"). Reichenberger
`[1991] (who presented a greedy algorithm for ditl’crcncing).
`Apostolico. Browne. and Guena [1992] and Rick [1995]
`(who considered methods for computing longest common
`subsequences). Burns and Long [1997b] (use delta compres-
`sion to modify ADSM. Adstar Distributed Storage Manager
`of IBM. to transmit compact encodings of versioned data.
`where the client maintains a store of reference files). ilunt.
`Tichy and Vi) [1998] (who combine Lempel-Ziv type com-
`pression and diflcrential compression to compute a delta file
`by using a reference file as pan of the dictionary to compress
`a target file). Factor. Sheinwald and Yassour [2001] (who
`present a Lempel Ziv based compression with an extended
`dictionary with shared data). Shapiro and Storer [2002] [who
`give theoretical evidence that determining the optimal set of
`move operations is not computationally tractable. and
`
`ll]
`
`15
`
`H1
`
`35
`
`40
`
`45
`
`'
`
`U1 ‘J:
`
`611
`
`ti.‘-
`
`2
`
`present an approximation algorithm for a block edit-distance
`problem). Agarwal. Amalapurapu. and Jain [2003] (who
`speed up dilferential compression with hashing techniques
`and additional data structures such as suliix arrays).
`There has also been commonly available software avail-
`able for dilfcrcncing that does not employ in-place decoding
`with string copying. such as the UNIX dill. xdelta and zdelta
`utilities.
`Burns and Long [1997]. M. Ajtai. R. Bums. R. Fagin. and
`D. 1). E. Long |2002]. and the 11.8. patent of Ajtai. Bums.
`Fagin. and Stockmeycr [U.S. Pat. No. 6.374.250] use a hash
`table with Karp-Rabin footprints to perform ditfcrential
`compression of one file with respect
`to another. using
`constant space in addition to that used by both files. but do
`not provide for in-place decoding.
`Bums and [hug [1998]. Burns. Stoclcmcycr. and Long
`[2002]. and the U.S. Patent oi'Burns and Long [U.S. Pat. No.
`6.018.747] present an in-place reconstruction of differential
`compressed data. but do not perform the reconstruction with
`copies that overwrite front left to right. They begin with a
`traditional delta file and work to detect and eliminate write-
`before-rend conflicts (increasing the size of the delta cod-
`ing).
`The invention disclosed here is in part motivated by the
`rescarch presented in Shapiro and Storer [2003].
`
`BRIEF SUMMARY OF THE INVENTION
`
`is an object of this invention to perfomt
`it
`Therefore.
`in-place dillerential compression. With dilferential co1nprcs-
`sion. at body of data 1‘ is compressed with respect to a body
`of data S. That is, an encoder and decoder have available
`identical copies of S. and the encoder may copy substrings
`from S to fomt T. A pointer that copies a string w from S
`achieves data compression when the bits representing that
`pointer are fewer than the bits representing w. We use the
`variable :1 to denote the size of '1' and in to denote the sire
`of S.
`l)itTcrcntial compression is in-place if the memory
`containing S is overwritten when decoding T so that at no
`time is more than a total of MAX{m.n] memory used. in
`addition to the constant amount of memory to store the
`program itself along with its local variables.
`It is another object of this invention to rearrange the
`substrings of S to better align S and '1‘ to enhance the
`cflizctiveness of copying substrings from S to Tin-place.
`An in-place dilfcrential compression method according to
`the present
`invention includes the steps of rearranging
`substrings of S to improve the alignment of subslrings of T
`with substrings of S and decoding in-place by copying
`substrings ol'S to fonn portions of T in a way that overwrites
`S.
`
`BRIEF DESCRIPTION OF SEVERAL VIEWS OF
`THE DRAWING
`
`FJG. 1 shows encoding on the top and on the bottom the
`way in which in—place decoding overwrites memory.
`FIG. 2 shows tltc well known method of sliding window
`compression (prior art).
`FIG. 3 shows the of linking gaps when ofl'-the-shelf
`compression follows aligned moves.
`
`DETAILED DESCR]1"TlON OF THE
`INVENTION
`
`With dillbrentiol compression. :1 body of data S of size in
`is compressed with respect to a body of data '1‘ of size it. That
`
`
`
`US 7,0'r’9,05l B2
`
`3
`is. both the encoder and the decoder have a copy of S and
`then it new string 'l' may be encoded and subsequently
`decoded by making use of S (e.g. copying substrings of S].
`There are many practical applications where new infor-
`mation is received or generated that
`is highly similar to
`information already present. When a software revision is
`released to licensed users. the distributor can require that a
`user must perform the upgrade on the licensed copy of the
`existing version. When incremental backups are performed
`for a computing system. dilfcrential compression can be
`used to compress a tile with respect
`to its version in a
`previous backup, with respect to ta similar tile in a previous
`backup. or with respect to a tile already processed in the
`current backup. Diflerential file compression can also be a
`powerful string similarity test
`in browsing and filtering
`applications.
`One of ordinary skill in the art can understand that there
`are many ways that compressed data may be transmitted
`from an encoder to a decoder. including on a communica-
`tions link. over a wireless connection. through a bu ifcr. by
`transfer to and from a storage device. etc.. and that the lonn
`of compressed data transmission from the encoder to
`decoder does not lintit the invention herein disclosed.
`
`ill
`
`15
`
`Ditlerential compression is in-place if the memory cott-
`taining S is overwritten when decoding. T so that at no time
`is more than a total of MAX{m.n} memory used. Of course.
`the decoder has stored somewhere the executable code for
`the decoding program (possibly in read—only memory or
`hardware), which not is part of memory used when we refer
`to the computation being in-place; that is. in-place refers to .
`the amount of memory used above and beyond the program
`code. The program code may also make use of some lixed
`number of local program variables (indiccs of for loops. etc.)
`which are also not part ofthe memory used when referring
`to the computation as being in~placc. We allow the encoder
`to use more memory if needed. The restriction that the
`decoder must operate in-place is desirable because it reliects
`practical applications when: the decoder may have unknown
`or limited resources.
`
`in-place
`is an object of this invention to perfortlt
`It
`diltc-vrential compression with methods that are both power-
`ful (i.c.. typically achieve high compression) and provide for
`fast and space eflicient decoding.
`MAX and MIN Notation
`
`We use the notation MIN{x.y} to denote the smaller of it
`and y and MAX {x.y} to denote the larger of x and y.
`
`40
`
`45
`
`Big 0 Notation
`It is standard in the computer science literature to use Big
`O notation to specify how the amount of time or memory '
`ttscd by a method increases as a ftutction of the input size.
`For two functions f and g. both of which map non-negative
`integers to non-negative integers. fix) is said to be 0(g(x))
`if there exist two constants at and it such that for all integers
`xéa. l(x)§b*g(x). For example. ifa parameter K is chosen
`in such a way that K is O(log3(MlN{m.n})). then it must be
`true that
`there exists two constants a and b which are
`independent ofm and n (a and b remain the same no matter
`what the values of m and n are) such that for all values of
`m and n for which MIN {m.n}§a. l(§b"‘ log2('MIN{m_.n}].
`The notation 0(1) denotes a fixed constant. That is. f(x)
`is O( l ) ifthere exists two constants a and b such that for all
`integers xéa. f(x)§b; if we let c be the constant equal to the
`maximum value off(it) in the range Oéxéb. then f(x_)§c for
`all integers X30. A big 0 constraint can be combined with
`other constraints. For example. for a parameter K. saying
`that “l(<MlN{m.n} and K is O(\.MlN{m.n})” means that
`
`fit]
`
`ti.‘-
`
`4
`
`although K may be chosen to be a function of m and n (i.e._.
`K is larger when MlN{m.n} is larger). for all m and n
`K<MlN {m.n}. and also. there exists two constants a and b
`such that K§b*\,'MlN{tu.n}. for all m and n for which
`MIN {rn.n} Ea.
`
`Memory
`We use the term memory to refer to any type ofcomputcr
`storage. such as the computer's intental RAM memory.
`rt
`hard drive. a read-writable disc, a communications butler,
`etc. We use the term character to refer to the basic unit of
`data to which compression is being applied to a body ofdata
`in some computer memory. A common example of a char-
`acter is a byte (8 bits) or a set of items that are stored one
`per byte (e.g. 7-bit ASCII codes). However. all of what we
`describe applies as well to other types of characters such as
`audio.
`image. and video data (e.g.. for audio data that is
`stored using I2 bits per sample. the set of possible characters
`is the set of 4.096 distinct 12-bit values).
`Although it is common for a body of data to be stored in
`sequential locations of memory. it is not required for this
`invention, it could be that diiferenl portions of the data are
`stored in diiferettt portions of memory. We assume that there
`is a linear ordering to a body of data (at first character. a
`second character. a third character. ctc.). lf a character x
`occurs earlier than a character y in the sequence of charac-
`ters that comprisc a body of data. then we say x is to the left
`of y. and y is to the right of x.
`The size of a body of data is the number of cltamctcrs that
`cornprisc it. and corresponds to the sire o t‘the memory used
`to store it (if the body of data resides in two or more
`disconnected portions of memory, the size of the body oi‘
`data is the sum of the size of each of the portions of memory
`it occupies].
`When we refer to the amount of memory used when
`decoding for in-place dilfcrential compression of a body of
`data '1' with respect
`to a body of data
`it
`is always
`understood that we are referring to the memory used to store
`portions ofS and T. and that there may be an additional fixed
`amount of memory used to contain decoding instnictions
`and local variables used for decoding (indices for loops.
`etc.).
`
`Sliding Window Compression
`Refening to FIG. 2. to help describe our invention. we
`lirst review the well known method in the prior art of sliding
`window compression. The standard UNIX gzip utility is an
`example of a sliding window compressor/decomprcssor.
`Given a string. sliding window compression represents it
`with a smaller string by replacing substrings by a pointer
`consisting of a pair of integers (d.l) where d is a displace-
`ment back in a window of the last N characters and l is the
`length of an identical substring. The reduction in size
`achieved on a string depends on how often suhstrings are
`repeated in the text. how the pairs (d.l) are coded. etc.
`Typically “greedy” matching is used (always take the long-
`est match possible).
`There are many ways that have been proposed to encode
`pointers. A simple method based on fixed length codes is:
`Displacerneitls are represented with |' log1(N)] bits.
`Lengths range from 3 to some upper limit Maxlcngtlt.
`which can be represented by the MaxLcngth-2 integers
`0 through MaxI.ength-3. and use |' log3(Maxl..ength-2)]
`bits.
`
`An initial flag bit distinguish a pointer (to a substring of
`3 or more characters) front a code for a single character.
`wlterc the leading bit is a O ifthe next [ log3(A)] bits
`to follow are a character and a I if the next [ log;,(N)]+[
`
`
`
`5
`
`6
`
`US 7,0?9,05] B2
`
`log3(_MaxLength-2)] bits to follow are a displacement
`and length. where A denotes the size of the input
`alphabet.
`For example, ifN=4.095. MaxLength'—-10, zutd we use the
`I28 character ASCII alphabet (A=l23]. a single char-
`acter is represented with one byte (a [lag bit and 7
`ASCII bits) and a pointer uses two bytes (21 flag bit. 12
`displacement bits. and 3 length bits).
`()ther methods may use variable lettgth codes [some of
`which may be able to represent a string of two bytes with
`less than 16 bits).
`It
`is also possible to employ an ofl'-the-shelf variable
`length coder to encode fields of a pointer. For example. to
`compress data where very large matches may be present. one
`could use the following coding method for pointers that
`employs an arithmetic coder:
`The pointer format begins with a control integer between
`0 and 4 that indicates one of the following cases:
`Case 0: No displacement or length: the hits to ibllow
`indicate a single raw character.
`(‘else 1; Displacctncnts <4,096 that point to a match of
`length <256.
`Case 2: Displacements between 4 KB and 36 KB that
`point to a match of length <256.
`Casc 3: Displacements larger than 36 [(13 that point to
`a match of length <256.
`Casc 4: Displacements larger than 36 KB that point to
`a match of length >256.
`Separate oil‘-the-shelf arithmetic encoders are used to
`encode control integers. the raw character for Case 0,
`and the length fields for Cases I through 3.
`A fixed code of either 12 or 15 bits is used for the
`
`displacements for Cases 1 and 2. A fixed length [
`log2(N)] code is used for the displacement fields of
`Cases 3 and 4 and for the and length field of Case 4.
`The Present Invention
`
`A problem with previous methods for ditferential cont-
`pression is their use of more computational resources {e.g..
`time and memory) than may be necessary and/or the need to
`sacrifice achievable compression in order to achieve accept-
`able resources. Here we propose a high performance fast
`method for perfonning differential compression in place.
`The solution proposed here uses the ln-Place Sliding
`Window (_lPSW) method to compress a body of data 'I' of
`size n with respect to a body of data S of size in:
`IPSW Encoding Algorithm:
`Step 1: Append T to the end of 3.
`Step 2: Compress T with a sliding window of size
`MAX{rn.n}.
`When the encoder slides a window ofsizc MAX{n1.n}. if
`is equal to or larger than n. then only at the start can
`111
`pointers reach all the way back to the start of S. and as we
`move to the right. more of 8 becomes inaccessible. If m is
`less than n. then after producing the lirst n—m characters of
`T. it again becomes the case that pointers cannot reach. all the
`way back to the start of S. In either case. by overwriting. the
`no longer accessible portion of S from lcti to right, decoding
`can be done in-place. using MAX{m.n} memory (encoding
`may use m+n memory). Tltat
`is, each time an additional
`pointer is received by the decoder.
`it
`is decoded and the
`window slides right to add characters to the right‘ of T and
`remove them from the left of S. If m<n. then decoding
`increases the size m ofthc memory to store S to a sin: 11 of
`memory to store '1'. lfm>n, then alter decoding is complete.
`
`T resides i11 what used to be the rightmost 11 characters ofS.
`and the memory for what used to be the first m—n characters
`of S can be reclaimed.
`
`the decoding may use a fixed amount of
`Of cotmse.
`memory for the decoding instructions (which could be
`read-only rncrttory or hardware) and a fixed amount of
`memory to contain a fixed number of local variables used for
`decoding (indices of loops. etc._). This fixed amount of
`memory is independent of m and n. Using big 0 notation.
`the memory used for decoding. instructions is O(l ) and the
`number of local variables is 0(1). When we say that decod-
`ing can be done in-place using MAX{m,n} memory. it is
`always understood that
`the memory used for decoding
`instructions and local variables is
`in addition to this
`
`ll]
`
`15
`
`MAX{tn.n} memory.
`Referring to FIG. 1. depicted is encoding (top) and
`decoding (bottom) for when the size of T is 50 percent larger
`than the size of S. and we are at the point when n—m+x
`characters ofT have been encoded. The hatched region is all
`. J
`1: of S on the top and the remaining ponion ol'S on the bottom.
`The lightly shaded region is the n—m characters of T that
`have already been encoded and decoded by the decoder
`without having to overwrite S. The dark shaded region is the
`portion ofT that has already been encoded by overwriting
`the first x characters of S. The remaining 111-): characters of
`T have yet to be encoded. and hence the rightmost 111-):
`characters of S are still available to the decoder. The
`dccoder‘s window can be viewed as two pieces: the already
`decompressed poniou of T (the lightly shaded and darkly
`shaded regions) that
`is to the lcft of the pointer and the
`remaining portion of S that is to the right of the pointer (but
`was to the Ielt of the lightly shaded and darkly shaded
`regions when encoding). So for each pointer decoded. at
`most two string copies must be pcrformect a single copy
`_ when the match is contained entirely in S (hatched region)
`or entirely in T (lightly shaded and darkly shaded regions)
`or two copies when the match starts in S and ends in T (the
`encoder encodes a match that crosses the boundary between
`the hatched region and the lightly shaded region). Since in
`many practical applications matches that cross this boundary
`are really just “lucl.y" (i .e.. we may be primarily looking for
`large matches to S and if they are not found then a match in
`T that is a shorter distance away). an alternate embodiment
`ofthis invention is to forbid copies that cross this boundary.
`in order to further simplify decoding.
`Any reasonable implementation of sliding window encod-
`ing can be used that allows for large matches. For example.
`the UNIX grip utility that uses a window of size 32K and
`maximum match length 256; it is easy to modify the UNIX
`grip utility to use a simple escape code for long pointers
`(e.g. reserve the longest possible length or displacement for
`the escape code) so that the same algorithm can be used for
`normal matches and the increased pointer length is only
`associated with large matches.
`Another object of this invention is. when in some appli-
`cations it may be that there is some additional amount of
`memory K available. to take advantage of this additional
`memory to improve the amottnt of compression achieved. It
`could be that K is a relatively small amount of memory that
`grows larger as m and it grow larger. such as K being
`O(y'IVllN{[t1.l]}) or even K being O(log2(MlN{rrt.n})). or it
`could be that K is relatively large, but still not large enough
`to retain a separate copy of S while decoding (that
`is.
`K<MlN {m.n} ). such as 50 percent of the additional memory
`needed {that is. K IMIN{m.n}/2) or even 90% of the addi-
`tional memory needed (that
`is. K=(9/l0)MlN{m.n}). An
`additional amount of memory K can be utilized to poten-
`
`30
`
`40
`
`45
`
`so
`
`55
`
`EAIJ
`
`65
`
`
`
`US 7.0?9,05l B2
`
`7
`tially improve compression by lengthening the sliding win-
`dow to size MAX{nt.n}+K. thus delaying by an additional
`K characters the point at which S begins to be overwritten
`when decoding. That
`is. encoding begins with S as the
`rightmost characters of a window ofsize MAX(_n1.n)-t-K.
`In many practical applications. such as distribution of
`successive versions of a software application, S and '1' are
`highly similar and reasonably well aligned: that is. large
`matches between S and '1' occur in approximately the same
`relative order. In this case. IPSW can be a fast method that
`performs in-place as well as methods that are not in-place.
`Another object of this invention is
`to achieve good
`perfomtance even when S and Tare not well aligned. by
`preceding IPSW with an additional step to improve align-
`ment. Poor alignment could happen, for example, with a
`sofiware update where for some reason the compiler moved
`large blocks of code around. An extreme case is when the
`first and second halves of S have been exchanged to form T
`(S—:uv and T:vu): to decompress T the IPSW algorithm
`overwrites u as it copies v. and then the ability to represent
`v with a single copy is lost. Rather than modify IPSW
`(which is already a very fast and practical method that
`suflices for many and perhaps most practical inputs). we
`propose a preprocessing stage for IPSW that moves sub-
`strings within S to better align S with T. The compressed
`format can incorporate a way for the decoder to determine
`whether preprocessing was performed (e.g.. compressed
`data can be begin with an initial bit indicating whether
`preprocessing has occurred). The encoder can compress T
`with IPSW and compare that to compressing T with respect
`to S not in place (so at all times all substrings of S are
`available to be copied). If the difference is significant, this
`initial bit can be set. alignment preprocessing performed.
`and a list of moves prepended the normal IPSW encoding.
`The decoder can perform the moves and then proceed
`in-place with normal IPSW decoding.
`lfthe encoder detemiines that S and ‘litre not well aligned.
`then the goal of prcprocessing for the IPSW algorithm is to
`find a minimal set of substring moves to convert S to a new
`string S‘ that is well aligned with T. We litnit our attention
`to moves that are non-overlapping. where the moves define
`a parsing ofS and T into matched blocks. {b,.}_,.
`1
`g
`A
`g
`,, and
`.l““k b]°°k5- lxvyri/=-1 .
`.
`.
`r-3 11131 i5- S=Xo‘baut'7‘t‘bo(2r'x2 -
`-
`-
`x,_.-bum-x,. and 'f=yu-b,~y,-b3~y._, .
`.
`. y,,_, -hr-y,. When using
`the sliding window method. we would like to copy sub-
`strings ofS only from the pan that was not yet overwritten
`by the reconstmction of T. That is. we would like to perform
`only left copies. ie.. a copy (s,. d,.. I,) that copies a substring
`with 1, characters from position s, to position (1,. that satisfies
`s,§d,..
`We can improve upon the idea of greedy block-move edit
`distance computation proposed in Shapira and Storer [2002]
`by using two different kinds of rearrangements of the blocks.
`Moves rearrange the blocks {b,},._.. _
`_ .,. so that they occur in
`S and T at the same order. Jigglings move the junk blocks of
`S backwards. so that. as a side affect. the matched blocks are
`moved forwards.
`
`Move Preprocessing Algorithm:
`Step l: Find Non Overlapping Common Subsrrings
`(NOCS) ofS and T from longest to shortest down to a
`tninimurn length stoplength. These are the {b,},. .1 ‘
`_
`_
`,.
`blocks.
`
`Step 2: Compute the minimum number ofrearrangetnents
`(moves attdjigglirtgs) in S so that the blocks {bi},-__,
`_
`_ ,
`are left copies within S and T.
`
`S
`
`lit
`
`15
`
`8
`Step 3: Generate 3' according to step 2.
`Step 4: Compute lPSW(S'.T)_
`Step 1 can be pcrfonned by repeatedly applying a longest
`common substring computation (see. for example. the text
`book Arr Immducrion to Data Slmc-lures undA1goriIhm.t'. by
`J. A. Storer. Birkhauserr‘Springer. 2002). or by using the
`linear time method of M. Meycrovich, D. Shapira, and J. A.
`Store: [“l’inding Non-Overlapping Common Substrings in
`Linear Time". Technical Report CS-O3-236, Comp. Science
`Dcpt.. Brandeis University): minlength can be tuned for
`diifcr