throbber
RESEARCH CONTRJBt/TIONS
`
`Algorithms and
`Data Structures
`
`Da11iel Sltator
`Editor
`
`Data Compression with Finite
`Windows
`
`EDWARD R. FIALA and DANIEL H. GREENE
`
`ABSTRACT: Several methods are presented for adaptive,
`i11vertiblt data comJ•rtssiorr in the style of Lempel's and
`Ziv's first textual substitution proposal. For the first two
`methods, the article describes modifications of McCreight's
`suffix tree data strudure that support cyclic maintena11ce of
`a window on the most recent source characters. A
`percolating update i,; used to keep node positions within the
`window, 11nd the upol11ting process is shown to have constant
`11mortized cost. Otha methods explore the tradeoffs between
`compression time, eJpansion time, data structure size, and
`amount of compress1on achieved. The article includes a
`graph-theoretic anal1sis of the compression penalty incurred
`by our codeword sel• ction policy in comparison with a11
`optimal policy, and 11 iucludes empirical studies of the
`performance of various adaptive compressors from the
`literature.
`
`1. INTRODUCTION
`Compression is the coding of data to minimize its repre(cid:173)
`sentation. In this a1 tide, we are concerned with fast.
`one-pass, adaptive, invertible (or lossless) methods of
`digital compression which have reasonable memory
`requirements. Suet methods can be used, for example,
`to reduce the storaHe requirements for files, to increase
`the communicatior rate over a channel, or to reduce
`redundancy prior t•> encryption for greater security.
`By "adaptive" WE mean that a compression method
`should be widely a:>plicable to different kinds of source
`data. Ideally, it sho J!d adapt rapidly to the source to
`achieve significant compression on small files, and it
`should adapt to an} subsequent internal changes in the
`nature of the sourco:. In addition, it should achieve very
`high compression a;ymptotically on large regions with
`stationary statistics
`All the compression methods developed in this article
`are subslilution11l. T.,rpically. a substitutional compressor
`functions by replac: ng large blocks of text with shorter
`references to earlie: occurrences of identical text (3, 5,
`29, 34. 36. 39. 41-4: ]. (This is often called Ziv-Lempel
`compression, in rec>gnilion of their pioneering ideas.
`Ziv and Lempel. in £act, proposed two methods. The
`unqualified use of tbe phrase "Ziv-Lempel compres(cid:173)
`sion" usually refers to their second proposal (43). In this
`
`C 1989 ACM 0001-0782/ 811 '0300.()490 $1.50
`
`article, we will be primarily concerned with. their first
`proposal [42).) A popular alternative to a substitutional
`compressor is a statistical compressor. A symbol wise
`statistical compressor functions by accurately predict(cid:173)
`ing the probability of individual symbols, and then en(cid:173)
`coding these symbols with space close to -log2 of the
`predicted probabilities. The encoding is accomplished
`with either Huffman compression (17] which has re(cid:173)
`cently been made one-pass and adaptive (11, 22, 37),
`or with arithmetic coding, as described in [1. 14, 20, 25,
`26, 31-33]. The major challenge of a statistical com(cid:173)
`pressor is to predict the symbol prob&bilities. Simple
`strategies, such as keeping zero-order (single symbol) or
`first-order (symbol pair) statistics of the input, do not
`compress English text very well. Several authors have
`had success gathering higher-order statistics. but this
`necessarily involves higher memory costs and addi(cid:173)
`tional mechanisms for de11ling with situatiuns when~
`higher-order statistics are not available (6. 7, 26).
`It is hard to give a rigorous foundation to the substi(cid:173)
`tutional vs. statistical distinction described above. Sev(cid:173)
`eral authors have observed that statistical methods can
`be used lo simulate textual substitution, suggesting that
`the statistical category includes the substitutional cate(cid:173)
`gory [4, 24). However. this takes no account of the sim(cid:173)
`plicity of mechanism; the virtue of textual substitution
`is that it recognizes and removes coherence on a large
`scale, oftentimes ignoring the smaller scale statistics. As
`a result, most textual substitution compressors pro-
`cess their compressed representation in larger blocks
`than their statistical counterparts. thereby gaining a sig(cid:173)
`nificant speed advantage. II was previously believed
`that the speed gained by textual substitution would
`necessarily cost something in compression achieved.
`We were surprised to discover that with careful atten(cid:173)
`tion to coding, textual substitution compressors can
`match the compression performance of the best statisti(cid:173)
`cal methods.
`Consider the following scheme, which we wiU im(cid:173)
`pro\'e later in the article. Compressed files contain two
`types of codewords:
`
`literal x
`
`pass the next ;r characters directly into the
`uncompressed output
`copy x, - y go back y characters in the output and copy
`x characters forward to the current position.
`
`490
`
`Communications of tht ~CM
`
`April 1989 Volumt n Numbtr 4
`
`Page 1 of 16
`
`

`
`Research Contributions
`
`So, for example, the following piece of literature:
`
`IT WAS THE BEST OF TIMES,
`IT WAS THE WORST OF TIMES
`
`would compress to
`
`(literal 26)IT WAS THE BEST OF TIMES,
`(copy 11-26)(literal 3)WOR(copy 11 -27)
`
`The compression achieved depends on the space re(cid:173)
`quired for the copy and literal codewords. Our simplest
`scheme, hereafter denoted At, uses 8 bits for a literal
`codeword and 16 for a copy codeword. If the first 4 bits
`are 0, then the codeword is a literal; the next 4 bits
`encode a length x in the range (1 .. 16] and the follow(cid:173)
`ing x characters are literal (one byte per character).
`Otherwise, the codeword is a copy; the first 4 bits
`encode a length x in the range [2 .. 16] and the next
`12 bits are a displacement y in the range [1 .. 4096]. At
`each step, the policy by which the compressor chooses
`between a literal and a copy is as follows: If the com(cid:173)
`pressor is idle (just finished a copy, or terminated a
`literal because of the 16-character limit), then the long(cid:173)
`est copy of length 2 or more is issued; otherwise, if the
`longest copy is less than 2 long, a literal is started. Once
`started, a literal is extended across subsequent charac(cid:173)
`ters until a copy of length 3 or more can be issued or
`until the length limit is reached.
`At would break the first literal in the above example
`into two literals and compress the source from 51 bytes
`down to 36. At is close to Ziv and Lempel's first textual
`substitution proposal [42]. One difference is that At
`uses a separate literal codeword, while Ziv and Lempel
`combine each copy codeword with a single literal char(cid:173)
`acter. We have found it useful to have longer literals
`during the startup transient; after the startup, it is bet(cid:173)
`ter to have no literals consuming space in the copy
`codewords.
`Our empirical studies showed that, for source code
`and English text, the field size choices for At are good;
`reducing the size of the literal length field by 1 bit
`increases compression slightly but gives up the byte(cid:173)
`alignment property of the At codewords. In short, if
`one desires a simple method based upon the copy and
`literal idea, At is a good choice.
`At was designed for 8-bit per character text or pro(cid:173)
`gram sources, but, as we will see shortly, it achieves
`good compression on other kinds of source data, such as
`compiled code and images, where the word model does
`not match the source data particularly well, or where
`no model of the source is easily perceived. At is, in
`fact, an excellent approach to general purpose data
`compression. In the remainder of this article, we will
`study At and several more powerful variations.
`
`2. OVERVIEW OF THE DATA STRUCTURE
`The fixed window suffix tree of this article is a modifi(cid:173)
`cation of McCreight's suffix tree [28] (see also [21, 34,
`38]), which is itself a modification of Morrison's PATRI(cid:173)
`CIA tree [30], and Morrison's tree is ultimately based on
`a Trie data structure [22, page 481 ). We will review
`each of these data structures briefly.
`
`A Trie is a tree structure where the branching occurs
`according to "digits" of the keys, rather than according
`to comparisons of the keys. In English, for example, the
`most natural "digits" are individual letters, with the lth
`level of the tree branching according to the lth letter of
`the words in the tree.
`In Figure 1, many internal nodes are superfluous,
`having only one descendant. If we are building an
`index for a file, we can save space by eliminating the
`superfluous nodes and putting pointers to the file into
`the nodes rather than including characters in the data
`structure. In Figure 2, the characters in parentheses are
`not actually represented in the data structure, but they
`can be recovered from the (position, level) pairs in the
`nodes. Figure 2 also shows a suffix pointer (as a dark
`right arrow) that will be explained later.
`Figure 2 represents some, but not all, of the innova(cid:173)
`tions in Morrison's PATRICIA trees. He builds the trees
`with binary "digits" rather than full characters, and this
`allows him to save more space by folding the leaves
`into the internal nodes. Our "digits" are bytes, so the
`branching factor can be as large as 256. Since there are
`rarely 256 descendants of a node, we do not reserve
`that much space in each node, but instead hash the
`arcs. There is also a question about when the strings in
`parentheses are checked in the searching process. In
`what follows, we usually check characters immediately
`when we cross an arc. Morrison's scheme can avoid file
`access by skipping the characters on the arcs, and doing
`only one file access and comparison at the end of the
`search. However, our files will be in main memory, so
`this consideration is unimportant. We will use the sim(cid:173)
`plified tree depicted in Figure 2.
`For At, we wish to find the longest (up to 16 charac(cid:173)
`ter) match to the current string beginning anywhere in
`the preceding 4096 positions. If all preceding 4096
`strings were stored in a PATRICIA tree with depth
`d = 16, then finding this match would be straightfor(cid:173)
`ward. Unfortunately, the cost of inserting these strings
`can be prohibitive, for if we have just descended d
`levels in the tree to insert the string starting at position
`i then we will descend at least d - 1 levels inserting the
`string at i + 1. In the worst case this can lead to O(nd)
`
`A
`
`S
`
`S
`
`T
`
`R
`
`A
`
`STRAY
`
`STRIDE
`
`T
`
`R
`
`A
`
`ASTRAY
`
`ASTRIDE
`
`FIGURE 1. A Trie
`
`Apri/1989 Volume 32 Number 4
`
`Communications of the ACM
`
`491
`
`Page 2 of 16
`
`

`
`Research Contributions
`
`insertion time for a file of size n. Since later encodings
`will use much laqer values ford than 16, it is impor(cid:173)
`tant to eliminate tl from the running time.
`To insert the strings in O(n) time, McCreight added
`additional suffix l'ointers to the tree. Each internal
`node, representing the string aX on the path from the
`root to the intern< I node, has a pointer to the node
`representing X, th 3 string obtained by stripping a single
`letter from the be1~nning of aX. If a string starting at i
`has just been inse:·ted at level d we do not need to
`return to the root to insert the string at i + 1; instead,
`a nearby suffix po .nter will lead us to the relevant
`branch of the tree.
`Figure 3 shows how suffix links are created and used.
`On the previous it3ration, we have matched the string
`aXY, where a is a single character, X and Yare strings,
`and b is the first u 1matched character after Y. Figure 3
`shows a complicat 3d case where a new internal node,
`a, has been added to the tree, and the suffix link of a
`must be computed. We insert the next string XYb by
`going up the tree t'J node (3, representing the string aX,
`and crossing its su:'fix link to ')1, representing X. Once
`we have crossed tl.e suffix link, we descend again in
`the tree, first by "nscanning" the string Y, and then by
`"scanning" from o mtil the new string is inserted. The
`first part is called ' rescanning" because it covers a por(cid:173)
`tion of the string that was covered by the previous
`insert, and so it do,~s not require checking the internal
`strings on the arcs. (In fact, avoiding these checks is
`essential to the linoar time functioning of the algo(cid:173)
`rithm.) The rescan either ends at an existing node o, or
`o is created to insert the new string XYb; either way we
`have the destination for the suffix link of a. We have
`restored the invariant that every internal node, except
`possibly the one ju:;t created, has a suffix link.
`For the At compressor, with a 4096-byte fixed win(cid:173)
`dow, we need a way to delete and reclaim the storage
`for portions of the ~ uffix tree representing strings fur(cid:173)
`ther back than 4091i in the file. Several things must be
`added to the suffix tree data structure. The leaves of
`the tree are placed in a circular buffer, so that the
`oldest leaf can be identified and reclaimed, and the
`internal nodes are !:iven "son count" fields. When an
`
`9
`10
`1
`2
`File: ASTRIDE ASTRAY
`
`FIGURE 2. A IJATRICIA Tree with a Suffix Pointer
`
`A
`
`' X
`\
`'
`
`aX
`
`;
`I
`;
`
`y
`
`' \ y
`'
`
`rescan
`
`... " .... , ...... ,." ................. .,., ........... ," .. A'
`' 'o
`
`scan
`
`' \
`
`K
`
`FIGURE 3. Building a Suffix Tree
`
`internal "son count" falls to one, the node is deleted
`and two consecutive arcs are combined. In Section 3, it
`is shown that this approach will never leave a "dan(cid:173)
`gling" suffix link pointing to deleted nodes. Unfortu(cid:173)
`nately, this is not the only problem in maintaining a
`valiid suffix tree. The modifications that avoided a re(cid:173)
`turn to the root for each new insertion create havoc for
`deletions. Since we have not always returned to the
`root, we may have consistently entered a branch of the
`tree sideways. The pointers (to strings in the 4096-byte
`window) in the higher levels of such a branch can be(cid:173)
`come out-of-date. However, traversing the branch and
`updating the pointers would destroy any advantage
`gained by using the suffix links.
`We can keep valid pointers and avoid extensive up(cid:173)
`dating by partially updating according to a percolating
`update. Each internal node has a single "update" bit. If
`the update bit is true when we are updating a node,
`then we set the bit false and propagate the update re(cid:173)
`cursively to the node's parent. Otherwise, we set the bit
`true and stop the propagation. In the worst case, a long
`string of true updates can cause the update to propagate
`to the root. However, when amortized over all new
`leaves, the cost of updating is constant, and the effect of
`updating is to keep all internal pointers on positions
`within the last 4096 positions of the file. These facts
`will be shown in Section 3.
`We can now summarize the operation of the inner
`loop, using Figure 3 again. If we have just created node
`a, then we use a's parent's suffix link to find 'Y· From 'Y
`we move down in the tree, first rescanning, and then
`scanning. At the end of the scan, we percolate an up(cid:173)
`date from the leaf, moving toward the root, setting the
`position fields equal to the current position, and setting
`the update bits false, until we find a node with an
`update bit that is already false, whereupon we set that
`node's update bit true and stop the percolation. Finally,
`we go to the circular buffer of leaves and replace the
`oldest leaf with the new leaf. If the oldest leaf's parent
`has only one remaining son, then it must also be de-
`
`492
`
`Communications of the 1lCM
`
`April 1989 Volume 32 Number 4
`
`Page 3 of 16
`
`

`
`Research Contributions
`
`{3 farthest from the root that does not propagate an
`update to its parent. If {3 has at least two children that
`have remained for the entire period, then {3 must have
`received updates from these nodes: they are farther
`from the root. If {3 has only one remaining child, then
`it must have a new child, and so it will still get
`two updates. (Every newly created arc causes a son
`to update a parent, percolating if necessary.) Similarly,
`two new children also cause two updates. By every
`accounting, {3 will receive two updates during the
`period, and thus propagate an update-contradicting
`our assumption of {J's failure to update its parent.
`
`There is some flexibility on how updating is handled.
`We could propagate the current position upward before
`rescanning, and then write the current position into
`those nodes passed during the rescan and scan; in this
`case, the proof of Theorem 3 is conservative. Alterna(cid:173)
`tively, a similar, symmetric proof can be used to show
`that updating can be omitted when new arcs are added
`so long as we propagate an update after every arc is
`deleted. The choice is primarily a matter of implemen(cid:173)
`tation convenience, although the method used above is
`slightly faster.
`The last major theoretical consideration is the effec(cid:173)
`tiveness of the At policy in choosing between literal
`and copy codewords. We have chosen the following
`one-pass policy for At: When the encoder is idle, issue
`a copy if it is possible to copy two or more characters;
`otherwise, start a literal. If the encoder has previously
`started a literal, then terminate the literal and issue a
`copy only if the copy is of length three or greater.
`Notice that this policy can sometimes go astray. For
`example, suppose that the compressor is idle at position
`i and has the following copy lengths available at subse(cid:173)
`quent positions:
`i + 1
`3
`
`i+2 i+3 i+4 i+5
`16
`15
`14
`13
`
`(1)
`
`1
`
`leted; in this case, the remaining son is attached to its
`grandparent, and the deleted node's position is perco(cid:173)
`lated upward as before, only at each step the position
`being percolated and the position already in the node
`must be compared and the more recent of these sent
`upward in the tree.
`
`3. THEORETICAL CONSIDERATIONS
`The correctness and linearity of suffix tree construction
`follows from McCreight's original paper [28]. Here we
`will concern ourselves with the correctness and the
`linearity of suffix tree destruction-questions raised in
`Section 2.
`
`THEOREM 1. Deleting leaves in FIFO order and deleting
`internal nodes with single sons will never leave dangling
`suffix pointers.
`
`PROOF. Assume the contrary. We have a node a
`with a suffix pointer to a node o that has just been
`deleted. The existence of a means that there are at
`least two strings that agree for l positions and then
`differ at I + 1. Assuming that these two strings start at
`positions i and j, where both i and j are within the
`window of recently scanned strings and are not equal
`to the current position, then there are are two even
`younger strings at i + 1 and j + 1 that differ first at I.
`This contradicts the assumption that o has one son. {If
`either i or j are equal to the current position, then a is
`a new node, and can temporarily be without a suffix
`pointer.)
`
`There are two issues related to the percolating update:
`its cost and its effectiveness.
`
`THEOREM 2. Each percolated update has constant amor(cid:173)
`tized cost.
`
`PROOF. We assume that the data structure contains
`a "credit" on each internal node where the "update"
`flag is true. A new leaf can be added with two "credits."
`One is spent immediately to update the parent, and the
`other is combined with any credits remaining at the
`parent to either: 1) obtain one credit to leave at the
`parent and terminate the algorithm or 2) obtain two
`credits to apply the algorithm recursively at the parent.
`This gives an amortized cost of two updates for each
`new leaf.
`
`For the next theorem, define the "span" of a suffix tree
`to be equal to the size of its fixed window. So far we
`have used examples with "span" equal to 4096, but the
`value is flexible.
`
`THEOREM 3. Using the percolating update, every inter(cid:173)
`nal node will be updated at least once during every period of
`length "span."
`
`It is useful to prove the slightly stronger re(cid:173)
`PROOF.
`sult that every internal node (that remains for an entire
`period) will be updated twice during a period, and thus
`propagate at least one update to its parent. To show a
`contradiction, we find the earliest period and the node
`
`Under the policy, the compressor encodes position i
`with a literal codeword, then takes the copy of length 3,
`and finally takes a copy of length 14 at position i + 4. It
`uses 6 bytes in the encoding:
`
`(literal1)X(copy 3 - y)(copy 14 - y)
`
`If the compressor had foresight it could avoid the
`copy of length 3, compressing the same material into
`5 bytes:
`
`(literal 2)XX(copy 16 - y)
`
`The optimal solution can be computed by dynamic
`programming [36]. One forward pass records the length
`of the longest possible copy at each position (as in equa(cid:173)
`tion 1) and the displacement for the copy (not shown in
`equation 1). A second backward pass computes the
`optimal way to finish compressing the file from each
`position by recording the best codeword to use and the
`length to the end-of-file. Finally, another forward pass
`reads off the solution and outputs the compressed file.
`However, one would probably never want to use dy-
`
`April1989 Volume 32 Number 4
`
`Communications of the ACM
`
`493
`
`Page 4 of 16
`
`

`
`Research Contributions
`
`namic programming since the one-pass heuristic is a lot
`faster, and we estimated for several typical files that
`the heuristically compressed output was only about
`1 percent larger than the optimum. Furthermore, we
`will show in the Iemainder of this section that the size
`of the compressed file is never worse than % the size of
`the optimal soluti:m for the specific At encoding.
`This will require ·ieveloping some analytic tools, so the
`non-mathematica. reader should feel free to skip to
`Section 4.
`
`The following def.nitions are useful:
`Definition. F(i I is the longest feasible copy at position i
`in the file.
`
`Sample F(i)'s were· given above in equation 1. They are
`dependent on the encoding used. For now, we are as(cid:173)
`suming that they are limited in magnitude to 16, and
`must correspond tJ copy sources within the last 4096
`characters.
`
`Definition. B(i 1 is the size of the best way to compress
`the remainder of th1~ file, starting at position i.
`
`B(i)'s would be computed in the reverse pass of the
`optimal algorithm outlined above.
`
`The following T 1eorems are given without proof:
`
`THEOREM. F(i -t 1) ~ F(i) - 1.
`
`Ther.~ exists an optimal solution where copies
`THIWREM.
`are longest possible (i.e., only copies corresponding to F(i)'s
`are used).
`
`THEOREM. B(i) lS monotone decreasing.
`
`THEOREM. Any :wlution can be modified, without affect(cid:173)
`ing length, so that (literal Xt) followed immediately by
`(literal x2 ) implies lhat x1 is maximum (in this case 16).
`We could continue to reason in this vein, but there is
`an abstract way of looking at the problem that is both
`clearer and more general. Suppose we have a non(cid:173)
`deterministic finitE' automaton where each transition is
`given a cost. A simple example is shown in Figure 4.
`The machine accepts (a+ b)*, with costs as shown in
`parentheses.
`The total cost of accepting a string is the sum of the
`
`Start I b (3)
`IX
`
`FIGURE 4. A Nondeterministic Automaton with Transition Costs
`
`transition costs for each character. (While it is not
`important to our problem, the optimal solution can be
`computed by forming a transition matrix for each let(cid:173)
`te:r, using the costs shown in parentheses, and then
`multiplying the matrices for a given string, treating the
`coefficients as elements of the closed semiring with op(cid:173)
`erations of addition and minimization.) We can obtain a
`solution that approximates the minimum by deleting
`transitions in the original machine until it becomes a
`deterministic machine. This corresponds to choosing a
`policy in our original data compression problem. A pol(cid:173)
`icy for the machine in Figure 4 is shown in Figure 5.
`We now wish to compare, in the worst case, the dif(cid:173)
`ference between optimally accepting a string with the
`non-deterministic machine, and deterministically ac(cid:173)
`cepting the same string with the "policy" machine. This
`is done by taking a cross product of the two machines,
`as shown in Figure 6.
`In Figure 6 there are now two weights on each transi(cid:173)
`tion; the first is the cost in the non-deterministic graph,
`and the second is the cost in the policy graph. Asymp(cid:173)
`totically, the relationship of the optimal solution to the
`poiicy solution is dominated by the smallest ratio on a
`cycle in this graph. In the case of Figure 6, there is a
`cycle from 1, 1' to 1, 2' and back that has cost in the
`non-deterministic graph of 2 + 1 = 3, and cost in the
`policy graph of 3 + 3 = 6, giving a ratio of%. That is,
`
`Start
`
`)'''
`
`FIGURE 5. A Deterministic "Policy" Automaton for Figure 4
`
`the policy solution can be twice as bad as the optimum
`on the string ababababab . ...
`In general, we can find the cycle with the smallest
`ratio mechanically, using well known techniques (8,
`27]. The idea is to conjecture a ratio r and then reduce
`the pairs of weights (x, y) on the arcs to single weights
`x - ry. Under this reduction, a cycle with zero weight
`has ratio exactly r. If a cycle has negative weight, then r
`is too large. The ratio on the negative cycle is used as a
`new conjecture, and the process is iterated. !Negative
`cycles are detected by running a shortest path algo(cid:173)
`rithm and checking for convergence.) Once we have
`found the minimum ratio cycle, we can create a worst
`case string in the original automata problem by finding
`a path from the start state to the cycle and then repeat(cid:173)
`ing the cycle indefinitely. The ratio of the costs of ac-
`
`494
`
`Communications of the ACM
`
`Apri/1989 Volume 32 Number 4
`
`Page 5 of 16
`
`

`
`Start
`
`Research Contributions
`
`below, eliminates many more transitions:
`
`lo ~ /15 start a literal if i :s 1
`lx ~ lx-1 continue a literal if x 2: 1 and i :S 2
`'f .
`x ~ C;-1 start a copy I 1 2: 3 or x = 0 and i = 2
`l Pi(2)
`p,(O)
`•
`Ct ~ Cy-1 contmue a copy
`
`Finally, we add one more machine to guarantee that
`the strings of p; are realistic. In this machine, state s;
`means that the previous character wasp;, so the index
`of the next character must be at least p;- 1 :
`S; ..J!4 Si
`
`(j 2: i - 1)
`
`(5)
`
`The cross product of these three machines has ap(cid:173)
`proximately 17K states and was analyzed mechanically
`to prove a minimum ratio cycle of%. Thus the policy
`we have chosen is never off by more than 25 percent,
`and the worst case is realized on a string that repeats a
`p; pattern as follows:
`
`FIGURE 6. The Cross Product
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`cepting the string non-deterministically and determin(cid:173)
`istically will converge to the ratio of the cycle. (The
`path taken in the cross product graph will not necessar(cid:173)
`ily bring us to the same cycle, due to the initial path
`fragment; we will, nevertheless, do at least as well.)
`Conversely, if we have a sufficiently long string with
`non-deterministic to deterministic ratio r, then the
`string will eventually loop in the cross product graph. If
`we remove loops with ratio greater than r we only im(cid:173)
`prove the ratio of the string, so we must eventually find
`a loop with ratio at least as small as r.
`The above discussion gives us an algorithmic way of
`analyzing our original data compression problem. The
`possible values of F(i) are encoded in a 17 character
`alphabet Po ... P16. representing the length of copy
`available at each position. The compression algorithm
`is described by a non-deterministic machine that ac(cid:173)
`cepts strings of Pi; this machine has costs equal to the
`lengths of the codewords used by the algorithm. There
`are two parameterized states in this machine: lx means
`that there is a literal codeword under construction with
`x spaces still available; cy means that a copy is in prog(cid:173)
`ress withy characters remaining to copy. The idle state
`is lo =Co. In the non-deterministic machine, the possi(cid:173)
`ble transitions are:
`
`start a literal
`continue a literal {x 2: 1)
`start a copy
`continue a copy
`
`(2)
`
`(An asterisk is used as a wild card to denote any state.)
`Based on the theorems above we have already elimi(cid:173)
`nated some transitions to simplify what follows. For
`example,
`

`I f
`'d
`l't
`p,(zJ l
`t
`Cy ~ 15 s art a I era
`rom msi e a copy (y 2: 1)
`
`(3)
`
`is unnecessary. The deterministic machine, given
`
`14 15
`10 11 12 13
`9
`8
`pg
`P4 P3 Pz
`P1
`Pz P10 P10
`
`(6)
`
`(There is nothing special about 10; it was chosen to
`illustrate a long copy and to match the example in
`Appendix A.) The deterministic algorithm takes a copy
`of length 10 in the first position, and then switches to a
`literal for positions 11 and 12. Five bytes are used in
`each repetition of the pattern. The optimal solution is
`one position out of phase. It takes a copy of length 10 in
`the second position, and then finds a copy of length 2 at
`position 12, for a total of four bytes on each iteration.
`We have abstracted the problem so that the possible
`copy operations are described by a string of Pi• and we
`have shown a pathological pattern of Pi that results in
`% of the optimal encoding. There might still be some
`doubt that such a string exists, since the condition that
`our third machine (5) guarantees, F(i + 1) 2: F(i) - 1, is
`a necessary but not sufficient condition. Nevertheless,
`the details of an actual pathological string can be found
`in Appendix A.
`
`4. A SIMPLER DATA STRUCTURE
`Although the quantity of code associated with A1 is not
`enormous, it is complicated, and the data structures are
`fairly large. In this section, we present simpler methods
`for finding the suffix and for propagating the window
`position.
`The alternative to a percolating update is to update
`the positions in all nodes back to the root whenever a
`new leaf is inserted. Then no updates are needed when
`nodes are deleted. The update flags can be eliminated.
`The alternative to suffix pointers is more compli(cid:173)
`cated. The cost of movement in a tree is not uniform;
`moving deeper requires a hash table lookup, which is
`more expensive than following a parent pointer. So we
`can determine the suffix by starting at the suffix leaf
`and following parent pointers back toward the root un-
`
`Apri/1989 Volume 32 Number 4
`
`Communications of the ACM
`
`495
`
`Page 6 of 16
`
`

`
`Research Contributions
`
`til the suffix nodo is reached. The suffix leaf is known
`because the strin.~ at i matched the string at some ear(cid:173)
`lier window position j; the suffix leaf j + 1 is the next
`entry in the leaf array. With this change, the suffix
`pointers can be e: imina ted.
`From a theoretical perspective, these modifications,
`which have O(nd 1 worst case performance for a file of
`size nand cut-off depth d, are inferior to the O(n) per(cid:173)
`formance of the Slffix tree. For At, with a cutoff of 16,
`these modifications improve average performance, but
`the A2 method di >cussed in the next section has such a
`deep cut-off that wfllx pointers and percolated updates
`are preferable.
`
`5. A MORE POWERFUL ENCODING
`The 4,096-byte window of At is roughly optimal for
`fixed size copy and literal codewords. Longer copies
`would, on average, be found in a larger window, but a
`larger displacement fleld would be required to encode
`them. To exploit a larger window, we must use a
`variable-width en,;oding statistically sensitive to the
`fact that recent window positions are more likely to be
`used by copy codewords than those positions further
`back. Similarly, it is advantageous to use variable(cid:173)
`width encodings for copy and literal lengths.
`There are several approaches we might use for vari(cid:173)
`able-length encoding. We could use fixed or adaptive
`Huffman coding, arithmetic encoding, a variable-length
`encoding of the in·:egers, or a manageable set of hand(cid:173)
`designed codeworcls. We eliminated from consideration
`adaptive Huffman and arithmetic coding because they
`are slow. Moreove.:, we felt they would provide (at best)
`a secondary adaptive advantage since the "front end"
`textual substitution is itself adapting to the input. We
`experimented witf. a fixed Huffman encoding, a hand(cid:173)
`designed family of codewords, and a variable-length
`encoding of the integers, so we will compare these
`options briefly:
`
`Hand-Designed Co:lewords. This is a direct generaliza(cid:173)
`ti

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket