throbber
||||||||||||||||||||||||||||||||||l|||||||||l||||||||||||||||||||||||||||||
`
`US00707905l B2
`
`(I2)
`
`United States Patent
`Storer et al.
`
`um Patent No.:
`(45; Date of Patent:
`
`US 7,079,051 B2
`Jul. 18., 2006
`
`(54)
`
`(76)
`
`IN-PLACE D1FFERE.\"l"IAL (TOMPRESSION
`
`ltwcnlnrsr James Andrew Storer. 89 S. Circ:-11
`Rd.. Lincultl. MA (US) 01773: Dana
`Shapira. 80 Pleasant St.. Apt. 53.
`Brooklinc. MA (US) 02-'-146
`
`1*)
`
`Notice:
`
`Subject to any disclaimer. the term ufthis
`patent is extended or adjusted under 35
`U.S.(‘. 154(1)) by 0 days.
`
`Appl. No.: 10/803507
`
`Filed:
`
`M ar. 18. 2004
`
`Prior Publication Data
`
`US 2005/0219075 A1
`
`Oct. 6. 2005
`
`Int. (‘L
`1. 2006.01)
`lI03.M 7/18
`U.S. (‘1.
`......................................... .. 341/51;3411S0
`Field of (flassificatiun Search
`341150,
`341151
`
`See application me for complete search history.
`
`References Cited
`
`U.S. PA'fEN'l' DOCUMENTS
`
`-1.366.551 A
`4.558.302 A
`4.646.061 A
`4.701.745 A
`4.730.343 A
`4.314.746 A
`4.876.541 A
`4.905.297 A
`4.906.991 A
`5.003.307 A
`5.(1l6.0U‘) A
`
`12.1932 H0112
`12.1985 Wclch
`Z 1987 Blodsoe
`1031987 Watcrworth
`3 1988 N/[acCrisken
`31989 Miller at al.
`10i‘1|).\'!) Slater
`Z.‘1‘)90
`I.:1ngdon at al.
`3‘ 1990 Finln ct .11.
`331991 George at Al.
`51991 Whiting el‘ al.
`
`341451
`
`3-4|/51
`........... .. 341151
`
`6 I992 (imrg-: at al.
`5.136.739 A
`1011902 (‘lurk
`5.153.591 A
`12-‘I992 Lam?
`5.175.543 A
`1 I993 Ranganathan C1 31.
`5.179.378 A “
`101993 Clark
`4.153.335 A
`1 I995 Smrcr
`5.379.036 A "‘
`-'1 1995 Anderson ct al.
`1.406.179 A “
`111995 Squibb
`5.479.654 A
`31997 Chang at :11.
`5.608.396 A
`41998 Squibb
`5.145.906 A
`‘)‘|‘)93 Morris
`5.313.017 A
`1. 2000 Burns ct al.
`6.013.747 A
`52001 Murz11id11:u'
`6.233.559 Bl
`-412002 Ajlai et a].
`6.374.251) Bl
`12. 20113 Tltnmpsnn c1 :11
`(1.671 31.1.1 BI
`0THF.R PU B1,.1C'ATIONS
`
`"(.'nmpuct1_\' Encoding
`Ajlai. Burns. 1"agiIL and Long.
`Unstructured Inputs with 1)i1TcrL‘ntia1 (‘ontprcssiutfl .1. of
`the ACM 49:3. 318-367. (11.3). vol. 49. N0. 3. May 2002,
`Apnstolicu. Browne. and Guam: [1992]. “Fast L inear-Space
`Co111pLtt:ttiuns 01' Longest Common Subsequenccs“. Thea-
`retical Computer Science. 92:1. 3-17 (U.S.). vol. ‘)2. No. 1.
`Jan. 6. I992.
`
`(Ccintinued)
`
`Prfmu‘r_l' l'§xz1mIner'
`
`Jean Bruner Jczmglatldc
`
`(S71
`
`ABSTR.-\("l"
`
`To etthunce the distribution or backup of data similar tn
`previously distributed or backed up data. to provide simi-
`larity testing. and other benefits. :1 first body ofdata *1‘ 01' size
`11 is cmnpressed with respect to a second body of data S 01'
`size 111
`in—p1z1ce‘. that is. the memory containing S is over-
`written fmm left to right so that at nn time is more than £1
`lulu] ml" MAX{l‘I].l1}+U( 1) l‘l‘1t.‘1‘110I‘_\' used.
`
`23 Claims. 3 Drawing Sheets
`
`
`<3 Sis of size m
`T13 Of 5126 H‘
`
` L“ "7 ‘til
`13:31
`
`window of size MAX{m,nI
`
`T
`
`
`dccoderx‘ 4—- partially decoded T—><— remainder of S —>
`.
`m -x
`31*’?!
`‘T
`
`IPSW method
`
`ORACLE 1118
`
`

`
`US 7,079,051 B2
`Page 2
`
`OTHER PUBLICATIONS
`
`Burns and Long [I997]. "A Linear Time. Constant Space
`Dif1"ercucing.1\lgoritluu". Proc. IEEE Int. Conf. on Pcrfonn..
`Comp.. and Communications. (U.S.). Feb. 1997.
`Burns and Long [I997b]. “Elficient Distributed Backup and
`Restore with Delta Comprt.-ssion"_. Workshop on H0 in
`Parallel and Distributed Systems (IOPADS3.
`(U.S.). no
`month.
`Bums. Stockmeycr. and D. D. E. Long. [2002]. “Experimen-
`tally Evaluating In-Place Delta Reconstruction". NASA
`Cool‘. on Mass Storage Sys. and Tech..(U.S.), no month.
`Connode. Paterson. Sahinalp and VlSl1l\’lI‘I [2000]. "Commu-
`nication Complexity of Doctunent Excliange". Pmc. llth
`Symp. on Disenzte Algorithms. I97-206. (U.S.). no month.
`Factor. Sheinwald. and Yassour [200I ]. “Software Compres-
`sion in the Client/Server Environment", Proc. Data Com-
`pression Conf.. IEEE Comp. Society Press. 233-242. (U.S.).
`Mar. 27-29. 2001.
`Fraser and Myers [ 1987]. “An Editor for Revision Control“.
`ACM Transactions on Programming Innguaggcs and Sys-
`tems 9:2. 277-295. (U.S.). vol. 9 No. 2. Apr. 1997.
`Ilurtt, Vo. and Tichy [I998]. “Delta ulgoritlmisz An Empiri-
`cal Analysis". ACM Trans. on Software Engineering and
`Methodology 7. 192-214. (U.S.). vol. 2, No. 2, Apr. 1998.
`Hufliuan [I952]. “A Method for the Construction of Mini-
`mum-Redundancy Codes". Proceedings of the IRE 40.
`I098-I101. (U.S.). Sep. 1952.
`Letnpel and Ziv [1976]. “On the Complexity of Finite
`Sequences".
`IEEE Transactions on Information 'l'heory
`22:], 75-8]. (U.S.). no month.
`Miller and Myers [I985]. “A File Comparison Program".
`Sofiware— PFZICIICE and Experience 15:11.
`1025-1040.
`(U ,S.)_. unknown month.
`Rick [I995]. “A New Flexible Algorithm for the Longest
`Common Sub-sequence Problem”, Proc. 6th Annual Sympo-
`sium on Combinatorial Patten: Matching. Espoo. Finland.
`5-7. (U.S.). unknown month.
`Shapira. and Slorer [2002]. "Edit Distance with Move
`Operations", Proceedings Conference of Combinatorial Pat-
`tern Matching (CPM) Springer. 85-98. (U.S.). unknown
`month.
`
`Reicheuberger [I991]. "Delta Storage for Arbitrary Non-
`Text Files“, Proc. 3rd Int. Workshop on Software Configu-
`ration Management. Trondheim. Norway,
`I2-14 (U.S.).
`unknown month.
`Shapiro. and Storer [Z003]. “In-Place I)itl”erential File Com-
`pression”. Proc. 2003 Data Comp. Conf. (published Mar. 25.
`2003 J, IEIZEZ Comp. Soc. Press. 263-272. (U.S.).
`Storer and T. G. Szymauski [I978]. "The Macro Model for
`Data Compression". Proc. 10th Atu1ualACM Symposium of
`the Theory of Computing. San Diego. CA. 30-39. (U.S.).
`unknown month.
`
`Storer and Szymanski [1982]. “Data Compression Via Tex-
`tual Substitution". Journal of the ACM 29:4. 928-951.
`(U.S.). unknown month.
`Storcr [1988] Data Compression: Methods and Theory.
`Computer Science Press (a subsidiary of W. B. Freeman
`Press). (U 5.]. unknown month.
`Storcr [2002]. “An Introduction to Data Structures and
`AIgoritl1.tus“. Birkltttuser—S;-ringer.
`(U.S.).
`unknown
`month.
`
`Tichy [ I 984]. “The String to String Conection Problem with
`Block Moves“, ACM Transactions on Computer Systems
`2:4. 309-321. (U.S.). unknown month.
`Wagner and Fischer [1973]. “The String-to-String Correc-
`tion Problem”. Journal of the ACM 21:1, I68-I73. (U.S.).
`unknown month.
`
`We-iner[1973]. “Linear Pattern Matching Algorithms". Pro-
`ceedings of the 14th Annual IEEE. Symposium on Switching
`and Automnta Theory 1-] 1. (U.S.). unknown month.
`Ziv and Lempel [I977]. “A Universal Algorithm for Sequen-
`tial Dato Compression". IEEE Transactions on Information
`Theory 23:3. 337-343. (U.S.). unknown month.
`Ziv and I/etnpel
`[I978]. “(boipression of Individual
`Sequences Via Variable-Rate Coding". IEEE Transactions
`on Information Theory 24:5. 530-536. (U.S.). unknown
`month.
`Burns and Long [I998]. "lo-place Reconstruction of Delta
`Compressed Files". Proe. AC’M Cont". Principles of Distrib-
`uted Comp. (PODC) (U.S.). no month.
`
`"' cited by examiner
`
`

`
`U.S. Patent
`
`Jul. 18,2005
`
`Sheet 1 of3
`
`US 7,079,051 B2
`
`-(T Sis of size m Tis Of SiZ6 ll >
`
`
`/ window of size MAX{m,r1}

`
`encoder
`
`deCoder% <(-- partially decoded T—>~(— remainder of S —)~
`
`I F
`
`IG. 1
`
`[PSW method.
`
`

`
`U.S. Patent
`
`Jul. 18, 2005
`
`Sheet 2 of3
`
`US 7.079.051 B2
`
`etext already seen
`
`- ELEPHANT
` 53°
`\
`
`incoming
`characters
`
`Suppose that "ELEPHANT"
`has already been seen 500
`characters back in the text.
`
`Encode the 8 characters
`"ELEPHANT"
`as the pair of numbers (500.8).
`
`FIG. 2 (prior art)
`
`sliding window compression
`
`

`
`U.S. Patent
`
`Jul. 18, 2005
`
`Sheet 3 of}
`
`Us 7,079,051 B2
`
`--:-'-er:-:.s:>.-—-=a:1.~a:=*.:
`m=3:sz:c:2:B7-Was.-Argcx _ ,
`v-
`_ -
`J-‘H-.
`. 9
`'
`,
`-cg;
`'
`"
`
`- _'.',.
`
`FIG. 3
`
`Iinking ofgaps
`
`

`
`US 7,079,051 B2
`
`1
`IN-PLACE DIFFERENTIAL COMPRESSION
`
`BACKGROUND OF THE INVENTION
`
`The present invention relates to in-place dillcrcntial corn-
`prcssion, where a body of data '1‘ is compressed with respect
`to a body of data S by performing copies from S in such a
`way that no additional memory is used beyond what
`is
`needed to store the longer of S or '1': that is. when decoding.
`S is overwritten from left to right as T is constmcted in its
`place.
`There have been many patents and tecltnical papers that
`pertain to data compression. Many relate to techniques
`diflerent
`than ones that employ string copying such as
`Hutlinan coding (e.g.. US. Pat. No. 4.646.061 ) or arithmetic
`coding (e.g.. U.S. Pat. No. 4.905.297). Many relate to
`techniques that employ string copies but in a traditional data
`compression model where a single body of data is com-
`pressed not in-place dilfcrential compression ofa firsl body
`of data with respect to a second body of data; for example.
`U.S. patents such as I-loltz [U.S. Pat. No. 4.366.551]. Welch
`[U.S. Pat. No. 4.558.302]. Watetworlb [U.S. Pat. No. 4.701.
`7-15]. MacCt-isken [U.S. Pat. No. 4.730.348]. Miller and
`Wegman [U.S. Pat. No. 4.814.746]. Storer [U.S. Pat. Nos.
`4.876.541. 5.379.036]. Fiala and Greene [U.S. Pat. No.
`4.906.991]. George.
`lvey. and Whiting [U.S. Pat. Nos.
`5.003.307. 5.016.009. 5.126.739]. Rubow and Wachel [U.S.
`Pat. No. 5.023.610]. Clark [U.S. Pat. Nos. 5.153.591. 5.253.
`325]. Lantz [U.S. Pat. No. 5.175.543]. Ranganathan and
`llcm-iques [U.S. Pat. No. 5.179.378]. Cheng. Cra ft. Ciaribay.
`and Karnin |U.S. Pat. No. 5.608.396]. and technical articles
`such as Letnpel and Ziv [1977. 1979] and Storer [1978.
`1988. 1982. 2002].
`There have also been a number of patents and technical
`papers relating to differential compression that do not per-
`form decoding in—place: for example: Squibb [U.S. Pat. Nos.
`5.479.654. 5.745.906]. Mon'is [U.S. Pat. No. 5.813.017].
`Muralidhar and (‘handnn [U.S. Pat. No. 6.233.589]. 'l'lton1p-
`son. Peterson. and Mohammadioun [U.S. Pat. No. 6.671.
`703]. and technical articles such as Wciner [1973] (who
`developed a linear time and space greedy copy/insert algo-
`rithm using a suflix tree to search for matching substrings).
`Wagner and Fischer [1973] (who considered the string»to«
`string correction problem). Heckel [1978] (who presented a
`linear time algorithm for detecting block moves using long-
`est comtnon substring techniqucs).Ticl1y [1984] [who used
`edit-distance techniques for differencing and considered the
`string to string correction problem with block moves].
`Miller and Myers |1985]
`(who presented a comparison
`program for producing delta files). Fraser and Myers [1987]
`(who integrated version control into a line editor so that on
`every change a minimal delta is retained"). Reichenberger
`[1991] (who presented a greedy algorithm for ditl’crcncing).
`Apostolico. Browne. and Guena [1992] and Rick [1995]
`(who considered methods for computing longest common
`subsequences). Burns and Long [1997b] (use delta compres-
`sion to modify ADSM. Adstar Distributed Storage Manager
`of IBM. to transmit compact encodings of versioned data.
`where the client maintains a store of reference files). ilunt.
`Tichy and Vi) [1998] (who combine Lempel-Ziv type com-
`pression and diflcrential compression to compute a delta file
`by using a reference file as pan of the dictionary to compress
`a target file). Factor. Sheinwald and Yassour [2001] (who
`present a Lempel Ziv based compression with an extended
`dictionary with shared data). Shapiro and Storer [2002] [who
`give theoretical evidence that determining the optimal set of
`move operations is not computationally tractable. and
`
`ll]
`
`15
`
`H1
`
`35
`
`40
`
`45
`
`'
`
`U1 ‘J:
`
`611
`
`ti.‘-
`
`2
`
`present an approximation algorithm for a block edit-distance
`problem). Agarwal. Amalapurapu. and Jain [2003] (who
`speed up dilferential compression with hashing techniques
`and additional data structures such as suliix arrays).
`There has also been commonly available software avail-
`able for dilfcrcncing that does not employ in-place decoding
`with string copying. such as the UNIX dill. xdelta and zdelta
`utilities.
`Burns and Long [1997]. M. Ajtai. R. Bums. R. Fagin. and
`D. 1). E. Long |2002]. and the 11.8. patent of Ajtai. Bums.
`Fagin. and Stockmeycr [U.S. Pat. No. 6.374.250] use a hash
`table with Karp-Rabin footprints to perform ditfcrential
`compression of one file with respect
`to another. using
`constant space in addition to that used by both files. but do
`not provide for in-place decoding.
`Bums and [hug [1998]. Burns. Stoclcmcycr. and Long
`[2002]. and the U.S. Patent oi'Burns and Long [U.S. Pat. No.
`6.018.747] present an in-place reconstruction of differential
`compressed data. but do not perform the reconstruction with
`copies that overwrite front left to right. They begin with a
`traditional delta file and work to detect and eliminate write-
`before-rend conflicts (increasing the size of the delta cod-
`ing).
`The invention disclosed here is in part motivated by the
`rescarch presented in Shapiro and Storer [2003].
`
`BRIEF SUMMARY OF THE INVENTION
`
`is an object of this invention to perfomt
`it
`Therefore.
`in-place dillerential compression. With dilferential co1nprcs-
`sion. at body of data 1‘ is compressed with respect to a body
`of data S. That is, an encoder and decoder have available
`identical copies of S. and the encoder may copy substrings
`from S to fomt T. A pointer that copies a string w from S
`achieves data compression when the bits representing that
`pointer are fewer than the bits representing w. We use the
`variable :1 to denote the size of '1' and in to denote the sire
`of S.
`l)itTcrcntial compression is in-place if the memory
`containing S is overwritten when decoding T so that at no
`time is more than a total of MAX{m.n] memory used. in
`addition to the constant amount of memory to store the
`program itself along with its local variables.
`It is another object of this invention to rearrange the
`substrings of S to better align S and '1‘ to enhance the
`cflizctiveness of copying substrings from S to Tin-place.
`An in-place dilfcrential compression method according to
`the present
`invention includes the steps of rearranging
`substrings of S to improve the alignment of subslrings of T
`with substrings of S and decoding in-place by copying
`substrings ol'S to fonn portions of T in a way that overwrites
`S.
`
`BRIEF DESCRIPTION OF SEVERAL VIEWS OF
`THE DRAWING
`
`FJG. 1 shows encoding on the top and on the bottom the
`way in which in—place decoding overwrites memory.
`FIG. 2 shows tltc well known method of sliding window
`compression (prior art).
`FIG. 3 shows the of linking gaps when ofl'-the-shelf
`compression follows aligned moves.
`
`DETAILED DESCR]1"TlON OF THE
`INVENTION
`
`With dillbrentiol compression. :1 body of data S of size in
`is compressed with respect to a body of data '1‘ of size it. That
`
`

`
`US 7,0'r’9,05l B2
`
`3
`is. both the encoder and the decoder have a copy of S and
`then it new string 'l' may be encoded and subsequently
`decoded by making use of S (e.g. copying substrings of S].
`There are many practical applications where new infor-
`mation is received or generated that
`is highly similar to
`information already present. When a software revision is
`released to licensed users. the distributor can require that a
`user must perform the upgrade on the licensed copy of the
`existing version. When incremental backups are performed
`for a computing system. dilfcrential compression can be
`used to compress a tile with respect
`to its version in a
`previous backup, with respect to ta similar tile in a previous
`backup. or with respect to a tile already processed in the
`current backup. Diflerential file compression can also be a
`powerful string similarity test
`in browsing and filtering
`applications.
`One of ordinary skill in the art can understand that there
`are many ways that compressed data may be transmitted
`from an encoder to a decoder. including on a communica-
`tions link. over a wireless connection. through a bu ifcr. by
`transfer to and from a storage device. etc.. and that the lonn
`of compressed data transmission from the encoder to
`decoder does not lintit the invention herein disclosed.
`
`ill
`
`15
`
`Ditlerential compression is in-place if the memory cott-
`taining S is overwritten when decoding. T so that at no time
`is more than a total of MAX{m.n} memory used. Of course.
`the decoder has stored somewhere the executable code for
`the decoding program (possibly in read—only memory or
`hardware), which not is part of memory used when we refer
`to the computation being in-place; that is. in-place refers to .
`the amount of memory used above and beyond the program
`code. The program code may also make use of some lixed
`number of local program variables (indiccs of for loops. etc.)
`which are also not part ofthe memory used when referring
`to the computation as being in~placc. We allow the encoder
`to use more memory if needed. The restriction that the
`decoder must operate in-place is desirable because it reliects
`practical applications when: the decoder may have unknown
`or limited resources.
`
`in-place
`is an object of this invention to perfortlt
`It
`diltc-vrential compression with methods that are both power-
`ful (i.c.. typically achieve high compression) and provide for
`fast and space eflicient decoding.
`MAX and MIN Notation
`
`We use the notation MIN{x.y} to denote the smaller of it
`and y and MAX {x.y} to denote the larger of x and y.
`
`40
`
`45
`
`Big 0 Notation
`It is standard in the computer science literature to use Big
`O notation to specify how the amount of time or memory '
`ttscd by a method increases as a ftutction of the input size.
`For two functions f and g. both of which map non-negative
`integers to non-negative integers. fix) is said to be 0(g(x))
`if there exist two constants at and it such that for all integers
`xéa. l(x)§b*g(x). For example. ifa parameter K is chosen
`in such a way that K is O(log3(MlN{m.n})). then it must be
`true that
`there exists two constants a and b which are
`independent ofm and n (a and b remain the same no matter
`what the values of m and n are) such that for all values of
`m and n for which MIN {m.n}§a. l(§b"‘ log2('MIN{m_.n}].
`The notation 0(1) denotes a fixed constant. That is. f(x)
`is O( l ) ifthere exists two constants a and b such that for all
`integers xéa. f(x)§b; if we let c be the constant equal to the
`maximum value off(it) in the range Oéxéb. then f(x_)§c for
`all integers X30. A big 0 constraint can be combined with
`other constraints. For example. for a parameter K. saying
`that “l(<MlN{m.n} and K is O(\.MlN{m.n})” means that
`
`fit]
`
`ti.‘-
`
`4
`
`although K may be chosen to be a function of m and n (i.e._.
`K is larger when MlN{m.n} is larger). for all m and n
`K<MlN {m.n}. and also. there exists two constants a and b
`such that K§b*\,'MlN{tu.n}. for all m and n for which
`MIN {rn.n} Ea.
`
`Memory
`We use the term memory to refer to any type ofcomputcr
`storage. such as the computer's intental RAM memory.
`rt
`hard drive. a read-writable disc, a communications butler,
`etc. We use the term character to refer to the basic unit of
`data to which compression is being applied to a body ofdata
`in some computer memory. A common example of a char-
`acter is a byte (8 bits) or a set of items that are stored one
`per byte (e.g. 7-bit ASCII codes). However. all of what we
`describe applies as well to other types of characters such as
`audio.
`image. and video data (e.g.. for audio data that is
`stored using I2 bits per sample. the set of possible characters
`is the set of 4.096 distinct 12-bit values).
`Although it is common for a body of data to be stored in
`sequential locations of memory. it is not required for this
`invention, it could be that diiferenl portions of the data are
`stored in diiferettt portions of memory. We assume that there
`is a linear ordering to a body of data (at first character. a
`second character. a third character. ctc.). lf a character x
`occurs earlier than a character y in the sequence of charac-
`ters that comprisc a body of data. then we say x is to the left
`of y. and y is to the right of x.
`The size of a body of data is the number of cltamctcrs that
`cornprisc it. and corresponds to the sire o t‘the memory used
`to store it (if the body of data resides in two or more
`disconnected portions of memory, the size of the body oi‘
`data is the sum of the size of each of the portions of memory
`it occupies].
`When we refer to the amount of memory used when
`decoding for in-place dilfcrential compression of a body of
`data '1' with respect
`to a body of data
`it
`is always
`understood that we are referring to the memory used to store
`portions ofS and T. and that there may be an additional fixed
`amount of memory used to contain decoding instnictions
`and local variables used for decoding (indices for loops.
`etc.).
`
`Sliding Window Compression
`Refening to FIG. 2. to help describe our invention. we
`lirst review the well known method in the prior art of sliding
`window compression. The standard UNIX gzip utility is an
`example of a sliding window compressor/decomprcssor.
`Given a string. sliding window compression represents it
`with a smaller string by replacing substrings by a pointer
`consisting of a pair of integers (d.l) where d is a displace-
`ment back in a window of the last N characters and l is the
`length of an identical substring. The reduction in size
`achieved on a string depends on how often suhstrings are
`repeated in the text. how the pairs (d.l) are coded. etc.
`Typically “greedy” matching is used (always take the long-
`est match possible).
`There are many ways that have been proposed to encode
`pointers. A simple method based on fixed length codes is:
`Displacerneitls are represented with |' log1(N)] bits.
`Lengths range from 3 to some upper limit Maxlcngtlt.
`which can be represented by the MaxLcngth-2 integers
`0 through MaxI.ength-3. and use |' log3(Maxl..ength-2)]
`bits.
`
`An initial flag bit distinguish a pointer (to a substring of
`3 or more characters) front a code for a single character.
`wlterc the leading bit is a O ifthe next [ log3(A)] bits
`to follow are a character and a I if the next [ log;,(N)]+[
`
`

`
`5
`
`6
`
`US 7,0?9,05] B2
`
`log3(_MaxLength-2)] bits to follow are a displacement
`and length. where A denotes the size of the input
`alphabet.
`For example, ifN=4.095. MaxLength'—-10, zutd we use the
`I28 character ASCII alphabet (A=l23]. a single char-
`acter is represented with one byte (a [lag bit and 7
`ASCII bits) and a pointer uses two bytes (21 flag bit. 12
`displacement bits. and 3 length bits).
`()ther methods may use variable lettgth codes [some of
`which may be able to represent a string of two bytes with
`less than 16 bits).
`It
`is also possible to employ an ofl'-the-shelf variable
`length coder to encode fields of a pointer. For example. to
`compress data where very large matches may be present. one
`could use the following coding method for pointers that
`employs an arithmetic coder:
`The pointer format begins with a control integer between
`0 and 4 that indicates one of the following cases:
`Case 0: No displacement or length: the hits to ibllow
`indicate a single raw character.
`(‘else 1; Displacctncnts <4,096 that point to a match of
`length <256.
`Case 2: Displacements between 4 KB and 36 KB that
`point to a match of length <256.
`Casc 3: Displacements larger than 36 [(13 that point to
`a match of length <256.
`Casc 4: Displacements larger than 36 KB that point to
`a match of length >256.
`Separate oil‘-the-shelf arithmetic encoders are used to
`encode control integers. the raw character for Case 0,
`and the length fields for Cases I through 3.
`A fixed code of either 12 or 15 bits is used for the
`
`displacements for Cases 1 and 2. A fixed length [
`log2(N)] code is used for the displacement fields of
`Cases 3 and 4 and for the and length field of Case 4.
`The Present Invention
`
`A problem with previous methods for ditferential cont-
`pression is their use of more computational resources {e.g..
`time and memory) than may be necessary and/or the need to
`sacrifice achievable compression in order to achieve accept-
`able resources. Here we propose a high performance fast
`method for perfonning differential compression in place.
`The solution proposed here uses the ln-Place Sliding
`Window (_lPSW) method to compress a body of data 'I' of
`size n with respect to a body of data S of size in:
`IPSW Encoding Algorithm:
`Step 1: Append T to the end of 3.
`Step 2: Compress T with a sliding window of size
`MAX{rn.n}.
`When the encoder slides a window ofsizc MAX{n1.n}. if
`is equal to or larger than n. then only at the start can
`111
`pointers reach all the way back to the start of S. and as we
`move to the right. more of 8 becomes inaccessible. If m is
`less than n. then after producing the lirst n—m characters of
`T. it again becomes the case that pointers cannot reach. all the
`way back to the start of S. In either case. by overwriting. the
`no longer accessible portion of S from lcti to right, decoding
`can be done in-place. using MAX{m.n} memory (encoding
`may use m+n memory). Tltat
`is, each time an additional
`pointer is received by the decoder.
`it
`is decoded and the
`window slides right to add characters to the right‘ of T and
`remove them from the left of S. If m<n. then decoding
`increases the size m ofthc memory to store S to a sin: 11 of
`memory to store '1'. lfm>n, then alter decoding is complete.
`
`T resides i11 what used to be the rightmost 11 characters ofS.
`and the memory for what used to be the first m—n characters
`of S can be reclaimed.
`
`the decoding may use a fixed amount of
`Of cotmse.
`memory for the decoding instructions (which could be
`read-only rncrttory or hardware) and a fixed amount of
`memory to contain a fixed number of local variables used for
`decoding (indices of loops. etc._). This fixed amount of
`memory is independent of m and n. Using big 0 notation.
`the memory used for decoding. instructions is O(l ) and the
`number of local variables is 0(1). When we say that decod-
`ing can be done in-place using MAX{m,n} memory. it is
`always understood that
`the memory used for decoding
`instructions and local variables is
`in addition to this
`
`ll]
`
`15
`
`MAX{tn.n} memory.
`Referring to FIG. 1. depicted is encoding (top) and
`decoding (bottom) for when the size of T is 50 percent larger
`than the size of S. and we are at the point when n—m+x
`characters ofT have been encoded. The hatched region is all
`. J
`1: of S on the top and the remaining ponion ol'S on the bottom.
`The lightly shaded region is the n—m characters of T that
`have already been encoded and decoded by the decoder
`without having to overwrite S. The dark shaded region is the
`portion ofT that has already been encoded by overwriting
`the first x characters of S. The remaining 111-): characters of
`T have yet to be encoded. and hence the rightmost 111-):
`characters of S are still available to the decoder. The
`dccoder‘s window can be viewed as two pieces: the already
`decompressed poniou of T (the lightly shaded and darkly
`shaded regions) that
`is to the lcft of the pointer and the
`remaining portion of S that is to the right of the pointer (but
`was to the Ielt of the lightly shaded and darkly shaded
`regions when encoding). So for each pointer decoded. at
`most two string copies must be pcrformect a single copy
`_ when the match is contained entirely in S (hatched region)
`or entirely in T (lightly shaded and darkly shaded regions)
`or two copies when the match starts in S and ends in T (the
`encoder encodes a match that crosses the boundary between
`the hatched region and the lightly shaded region). Since in
`many practical applications matches that cross this boundary
`are really just “lucl.y" (i .e.. we may be primarily looking for
`large matches to S and if they are not found then a match in
`T that is a shorter distance away). an alternate embodiment
`ofthis invention is to forbid copies that cross this boundary.
`in order to further simplify decoding.
`Any reasonable implementation of sliding window encod-
`ing can be used that allows for large matches. For example.
`the UNIX grip utility that uses a window of size 32K and
`maximum match length 256; it is easy to modify the UNIX
`grip utility to use a simple escape code for long pointers
`(e.g. reserve the longest possible length or displacement for
`the escape code) so that the same algorithm can be used for
`normal matches and the increased pointer length is only
`associated with large matches.
`Another object of this invention is. when in some appli-
`cations it may be that there is some additional amount of
`memory K available. to take advantage of this additional
`memory to improve the amottnt of compression achieved. It
`could be that K is a relatively small amount of memory that
`grows larger as m and it grow larger. such as K being
`O(y'IVllN{[t1.l]}) or even K being O(log2(MlN{rrt.n})). or it
`could be that K is relatively large, but still not large enough
`to retain a separate copy of S while decoding (that
`is.
`K<MlN {m.n} ). such as 50 percent of the additional memory
`needed {that is. K IMIN{m.n}/2) or even 90% of the addi-
`tional memory needed (that
`is. K=(9/l0)MlN{m.n}). An
`additional amount of memory K can be utilized to poten-
`
`30
`
`40
`
`45
`
`so
`
`55
`
`EAIJ
`
`65
`
`

`
`US 7.0?9,05l B2
`
`7
`tially improve compression by lengthening the sliding win-
`dow to size MAX{nt.n}+K. thus delaying by an additional
`K characters the point at which S begins to be overwritten
`when decoding. That
`is. encoding begins with S as the
`rightmost characters of a window ofsize MAX(_n1.n)-t-K.
`In many practical applications. such as distribution of
`successive versions of a software application, S and '1' are
`highly similar and reasonably well aligned: that is. large
`matches between S and '1' occur in approximately the same
`relative order. In this case. IPSW can be a fast method that
`performs in-place as well as methods that are not in-place.
`Another object of this invention is
`to achieve good
`perfomtance even when S and Tare not well aligned. by
`preceding IPSW with an additional step to improve align-
`ment. Poor alignment could happen, for example, with a
`sofiware update where for some reason the compiler moved
`large blocks of code around. An extreme case is when the
`first and second halves of S have been exchanged to form T
`(S—:uv and T:vu): to decompress T the IPSW algorithm
`overwrites u as it copies v. and then the ability to represent
`v with a single copy is lost. Rather than modify IPSW
`(which is already a very fast and practical method that
`suflices for many and perhaps most practical inputs). we
`propose a preprocessing stage for IPSW that moves sub-
`strings within S to better align S with T. The compressed
`format can incorporate a way for the decoder to determine
`whether preprocessing was performed (e.g.. compressed
`data can be begin with an initial bit indicating whether
`preprocessing has occurred). The encoder can compress T
`with IPSW and compare that to compressing T with respect
`to S not in place (so at all times all substrings of S are
`available to be copied). If the difference is significant, this
`initial bit can be set. alignment preprocessing performed.
`and a list of moves prepended the normal IPSW encoding.
`The decoder can perform the moves and then proceed
`in-place with normal IPSW decoding.
`lfthe encoder detemiines that S and ‘litre not well aligned.
`then the goal of prcprocessing for the IPSW algorithm is to
`find a minimal set of substring moves to convert S to a new
`string S‘ that is well aligned with T. We litnit our attention
`to moves that are non-overlapping. where the moves define
`a parsing ofS and T into matched blocks. {b,.}_,.
`1
`g
`A
`g
`,, and
`.l““k b]°°k5- lxvyri/=-1 .
`.
`.
`r-3 11131 i5- S=Xo‘baut'7‘t‘bo(2r'x2 -
`-
`-
`x,_.-bum-x,. and 'f=yu-b,~y,-b3~y._, .
`.
`. y,,_, -hr-y,. When using
`the sliding window method. we would like to copy sub-
`strings ofS only from the pan that was not yet overwritten
`by the reconstmction of T. That is. we would like to perform
`only left copies. ie.. a copy (s,. d,.. I,) that copies a substring
`with 1, characters from position s, to position (1,. that satisfies
`s,§d,..
`We can improve upon the idea of greedy block-move edit
`distance computation proposed in Shapira and Storer [2002]
`by using two different kinds of rearrangements of the blocks.
`Moves rearrange the blocks {b,},._.. _
`_ .,. so that they occur in
`S and T at the same order. Jigglings move the junk blocks of
`S backwards. so that. as a side affect. the matched blocks are
`moved forwards.
`
`Move Preprocessing Algorithm:
`Step l: Find Non Overlapping Common Subsrrings
`(NOCS) ofS and T from longest to shortest down to a
`tninimurn length stoplength. These are the {b,},. .1 ‘
`_
`_
`,.
`blocks.
`
`Step 2: Compute the minimum number ofrearrangetnents
`(moves attdjigglirtgs) in S so that the blocks {bi},-__,
`_
`_ ,
`are left copies within S and T.
`
`S
`
`lit
`
`15
`
`8
`Step 3: Generate 3' according to step 2.
`Step 4: Compute lPSW(S'.T)_
`Step 1 can be pcrfonned by repeatedly applying a longest
`common substring computation (see. for example. the text
`book Arr Immducrion to Data Slmc-lures undA1goriIhm.t'. by
`J. A. Storer. Birkhauserr‘Springer. 2002). or by using the
`linear time method of M. Meycrovich, D. Shapira, and J. A.
`Store: [“l’inding Non-Overlapping Common Substrings in
`Linear Time". Technical Report CS-O3-236, Comp. Science
`Dcpt.. Brandeis University): minlength can be tuned for
`diifcrcnt

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket