`
`Algorithms and
`Data Structures
`
`Da11iel Sltator
`Editor
`
`Data Compression with Finite
`Windows
`
`EDWARD R. FIALA and DANIEL H. GREENE
`
`ABSTRACT: Several methods are presented for adaptive,
`i11vertiblt data comJ•rtssiorr in the style of Lempel's and
`Ziv's first textual substitution proposal. For the first two
`methods, the article describes modifications of McCreight's
`suffix tree data strudure that support cyclic maintena11ce of
`a window on the most recent source characters. A
`percolating update i,; used to keep node positions within the
`window, 11nd the upol11ting process is shown to have constant
`11mortized cost. Otha methods explore the tradeoffs between
`compression time, eJpansion time, data structure size, and
`amount of compress1on achieved. The article includes a
`graph-theoretic anal1sis of the compression penalty incurred
`by our codeword sel• ction policy in comparison with a11
`optimal policy, and 11 iucludes empirical studies of the
`performance of various adaptive compressors from the
`literature.
`
`1. INTRODUCTION
`Compression is the coding of data to minimize its repre(cid:173)
`sentation. In this a1 tide, we are concerned with fast.
`one-pass, adaptive, invertible (or lossless) methods of
`digital compression which have reasonable memory
`requirements. Suet methods can be used, for example,
`to reduce the storaHe requirements for files, to increase
`the communicatior rate over a channel, or to reduce
`redundancy prior t•> encryption for greater security.
`By "adaptive" WE mean that a compression method
`should be widely a:>plicable to different kinds of source
`data. Ideally, it sho J!d adapt rapidly to the source to
`achieve significant compression on small files, and it
`should adapt to an} subsequent internal changes in the
`nature of the sourco:. In addition, it should achieve very
`high compression a;ymptotically on large regions with
`stationary statistics
`All the compression methods developed in this article
`are subslilution11l. T.,rpically. a substitutional compressor
`functions by replac: ng large blocks of text with shorter
`references to earlie: occurrences of identical text (3, 5,
`29, 34. 36. 39. 41-4: ]. (This is often called Ziv-Lempel
`compression, in rec>gnilion of their pioneering ideas.
`Ziv and Lempel. in £act, proposed two methods. The
`unqualified use of tbe phrase "Ziv-Lempel compres(cid:173)
`sion" usually refers to their second proposal (43). In this
`
`C 1989 ACM 0001-0782/ 811 '0300.()490 $1.50
`
`article, we will be primarily concerned with. their first
`proposal [42).) A popular alternative to a substitutional
`compressor is a statistical compressor. A symbol wise
`statistical compressor functions by accurately predict(cid:173)
`ing the probability of individual symbols, and then en(cid:173)
`coding these symbols with space close to -log2 of the
`predicted probabilities. The encoding is accomplished
`with either Huffman compression (17] which has re(cid:173)
`cently been made one-pass and adaptive (11, 22, 37),
`or with arithmetic coding, as described in [1. 14, 20, 25,
`26, 31-33]. The major challenge of a statistical com(cid:173)
`pressor is to predict the symbol prob&bilities. Simple
`strategies, such as keeping zero-order (single symbol) or
`first-order (symbol pair) statistics of the input, do not
`compress English text very well. Several authors have
`had success gathering higher-order statistics. but this
`necessarily involves higher memory costs and addi(cid:173)
`tional mechanisms for de11ling with situatiuns when~
`higher-order statistics are not available (6. 7, 26).
`It is hard to give a rigorous foundation to the substi(cid:173)
`tutional vs. statistical distinction described above. Sev(cid:173)
`eral authors have observed that statistical methods can
`be used lo simulate textual substitution, suggesting that
`the statistical category includes the substitutional cate(cid:173)
`gory [4, 24). However. this takes no account of the sim(cid:173)
`plicity of mechanism; the virtue of textual substitution
`is that it recognizes and removes coherence on a large
`scale, oftentimes ignoring the smaller scale statistics. As
`a result, most textual substitution compressors pro-
`cess their compressed representation in larger blocks
`than their statistical counterparts. thereby gaining a sig(cid:173)
`nificant speed advantage. II was previously believed
`that the speed gained by textual substitution would
`necessarily cost something in compression achieved.
`We were surprised to discover that with careful atten(cid:173)
`tion to coding, textual substitution compressors can
`match the compression performance of the best statisti(cid:173)
`cal methods.
`Consider the following scheme, which we wiU im(cid:173)
`pro\'e later in the article. Compressed files contain two
`types of codewords:
`
`literal x
`
`pass the next ;r characters directly into the
`uncompressed output
`copy x, - y go back y characters in the output and copy
`x characters forward to the current position.
`
`490
`
`Communications of tht ~CM
`
`April 1989 Volumt n Numbtr 4
`
`Page 1 of 16
`
`
`
`Research Contributions
`
`So, for example, the following piece of literature:
`
`IT WAS THE BEST OF TIMES,
`IT WAS THE WORST OF TIMES
`
`would compress to
`
`(literal 26)IT WAS THE BEST OF TIMES,
`(copy 11-26)(literal 3)WOR(copy 11 -27)
`
`The compression achieved depends on the space re(cid:173)
`quired for the copy and literal codewords. Our simplest
`scheme, hereafter denoted At, uses 8 bits for a literal
`codeword and 16 for a copy codeword. If the first 4 bits
`are 0, then the codeword is a literal; the next 4 bits
`encode a length x in the range (1 .. 16] and the follow(cid:173)
`ing x characters are literal (one byte per character).
`Otherwise, the codeword is a copy; the first 4 bits
`encode a length x in the range [2 .. 16] and the next
`12 bits are a displacement y in the range [1 .. 4096]. At
`each step, the policy by which the compressor chooses
`between a literal and a copy is as follows: If the com(cid:173)
`pressor is idle (just finished a copy, or terminated a
`literal because of the 16-character limit), then the long(cid:173)
`est copy of length 2 or more is issued; otherwise, if the
`longest copy is less than 2 long, a literal is started. Once
`started, a literal is extended across subsequent charac(cid:173)
`ters until a copy of length 3 or more can be issued or
`until the length limit is reached.
`At would break the first literal in the above example
`into two literals and compress the source from 51 bytes
`down to 36. At is close to Ziv and Lempel's first textual
`substitution proposal [42]. One difference is that At
`uses a separate literal codeword, while Ziv and Lempel
`combine each copy codeword with a single literal char(cid:173)
`acter. We have found it useful to have longer literals
`during the startup transient; after the startup, it is bet(cid:173)
`ter to have no literals consuming space in the copy
`codewords.
`Our empirical studies showed that, for source code
`and English text, the field size choices for At are good;
`reducing the size of the literal length field by 1 bit
`increases compression slightly but gives up the byte(cid:173)
`alignment property of the At codewords. In short, if
`one desires a simple method based upon the copy and
`literal idea, At is a good choice.
`At was designed for 8-bit per character text or pro(cid:173)
`gram sources, but, as we will see shortly, it achieves
`good compression on other kinds of source data, such as
`compiled code and images, where the word model does
`not match the source data particularly well, or where
`no model of the source is easily perceived. At is, in
`fact, an excellent approach to general purpose data
`compression. In the remainder of this article, we will
`study At and several more powerful variations.
`
`2. OVERVIEW OF THE DATA STRUCTURE
`The fixed window suffix tree of this article is a modifi(cid:173)
`cation of McCreight's suffix tree [28] (see also [21, 34,
`38]), which is itself a modification of Morrison's PATRI(cid:173)
`CIA tree [30], and Morrison's tree is ultimately based on
`a Trie data structure [22, page 481 ). We will review
`each of these data structures briefly.
`
`A Trie is a tree structure where the branching occurs
`according to "digits" of the keys, rather than according
`to comparisons of the keys. In English, for example, the
`most natural "digits" are individual letters, with the lth
`level of the tree branching according to the lth letter of
`the words in the tree.
`In Figure 1, many internal nodes are superfluous,
`having only one descendant. If we are building an
`index for a file, we can save space by eliminating the
`superfluous nodes and putting pointers to the file into
`the nodes rather than including characters in the data
`structure. In Figure 2, the characters in parentheses are
`not actually represented in the data structure, but they
`can be recovered from the (position, level) pairs in the
`nodes. Figure 2 also shows a suffix pointer (as a dark
`right arrow) that will be explained later.
`Figure 2 represents some, but not all, of the innova(cid:173)
`tions in Morrison's PATRICIA trees. He builds the trees
`with binary "digits" rather than full characters, and this
`allows him to save more space by folding the leaves
`into the internal nodes. Our "digits" are bytes, so the
`branching factor can be as large as 256. Since there are
`rarely 256 descendants of a node, we do not reserve
`that much space in each node, but instead hash the
`arcs. There is also a question about when the strings in
`parentheses are checked in the searching process. In
`what follows, we usually check characters immediately
`when we cross an arc. Morrison's scheme can avoid file
`access by skipping the characters on the arcs, and doing
`only one file access and comparison at the end of the
`search. However, our files will be in main memory, so
`this consideration is unimportant. We will use the sim(cid:173)
`plified tree depicted in Figure 2.
`For At, we wish to find the longest (up to 16 charac(cid:173)
`ter) match to the current string beginning anywhere in
`the preceding 4096 positions. If all preceding 4096
`strings were stored in a PATRICIA tree with depth
`d = 16, then finding this match would be straightfor(cid:173)
`ward. Unfortunately, the cost of inserting these strings
`can be prohibitive, for if we have just descended d
`levels in the tree to insert the string starting at position
`i then we will descend at least d - 1 levels inserting the
`string at i + 1. In the worst case this can lead to O(nd)
`
`A
`
`S
`
`S
`
`T
`
`R
`
`A
`
`STRAY
`
`STRIDE
`
`T
`
`R
`
`A
`
`ASTRAY
`
`ASTRIDE
`
`FIGURE 1. A Trie
`
`Apri/1989 Volume 32 Number 4
`
`Communications of the ACM
`
`491
`
`Page 2 of 16
`
`
`
`Research Contributions
`
`insertion time for a file of size n. Since later encodings
`will use much laqer values ford than 16, it is impor(cid:173)
`tant to eliminate tl from the running time.
`To insert the strings in O(n) time, McCreight added
`additional suffix l'ointers to the tree. Each internal
`node, representing the string aX on the path from the
`root to the intern< I node, has a pointer to the node
`representing X, th 3 string obtained by stripping a single
`letter from the be1~nning of aX. If a string starting at i
`has just been inse:·ted at level d we do not need to
`return to the root to insert the string at i + 1; instead,
`a nearby suffix po .nter will lead us to the relevant
`branch of the tree.
`Figure 3 shows how suffix links are created and used.
`On the previous it3ration, we have matched the string
`aXY, where a is a single character, X and Yare strings,
`and b is the first u 1matched character after Y. Figure 3
`shows a complicat 3d case where a new internal node,
`a, has been added to the tree, and the suffix link of a
`must be computed. We insert the next string XYb by
`going up the tree t'J node (3, representing the string aX,
`and crossing its su:'fix link to ')1, representing X. Once
`we have crossed tl.e suffix link, we descend again in
`the tree, first by "nscanning" the string Y, and then by
`"scanning" from o mtil the new string is inserted. The
`first part is called ' rescanning" because it covers a por(cid:173)
`tion of the string that was covered by the previous
`insert, and so it do,~s not require checking the internal
`strings on the arcs. (In fact, avoiding these checks is
`essential to the linoar time functioning of the algo(cid:173)
`rithm.) The rescan either ends at an existing node o, or
`o is created to insert the new string XYb; either way we
`have the destination for the suffix link of a. We have
`restored the invariant that every internal node, except
`possibly the one ju:;t created, has a suffix link.
`For the At compressor, with a 4096-byte fixed win(cid:173)
`dow, we need a way to delete and reclaim the storage
`for portions of the ~ uffix tree representing strings fur(cid:173)
`ther back than 4091i in the file. Several things must be
`added to the suffix tree data structure. The leaves of
`the tree are placed in a circular buffer, so that the
`oldest leaf can be identified and reclaimed, and the
`internal nodes are !:iven "son count" fields. When an
`
`9
`10
`1
`2
`File: ASTRIDE ASTRAY
`
`FIGURE 2. A IJATRICIA Tree with a Suffix Pointer
`
`A
`
`' X
`\
`'
`
`aX
`
`;
`I
`;
`
`y
`
`' \ y
`'
`
`rescan
`
`... " .... , ...... ,." ................. .,., ........... ," .. A'
`' 'o
`
`scan
`
`' \
`
`K
`
`FIGURE 3. Building a Suffix Tree
`
`internal "son count" falls to one, the node is deleted
`and two consecutive arcs are combined. In Section 3, it
`is shown that this approach will never leave a "dan(cid:173)
`gling" suffix link pointing to deleted nodes. Unfortu(cid:173)
`nately, this is not the only problem in maintaining a
`valiid suffix tree. The modifications that avoided a re(cid:173)
`turn to the root for each new insertion create havoc for
`deletions. Since we have not always returned to the
`root, we may have consistently entered a branch of the
`tree sideways. The pointers (to strings in the 4096-byte
`window) in the higher levels of such a branch can be(cid:173)
`come out-of-date. However, traversing the branch and
`updating the pointers would destroy any advantage
`gained by using the suffix links.
`We can keep valid pointers and avoid extensive up(cid:173)
`dating by partially updating according to a percolating
`update. Each internal node has a single "update" bit. If
`the update bit is true when we are updating a node,
`then we set the bit false and propagate the update re(cid:173)
`cursively to the node's parent. Otherwise, we set the bit
`true and stop the propagation. In the worst case, a long
`string of true updates can cause the update to propagate
`to the root. However, when amortized over all new
`leaves, the cost of updating is constant, and the effect of
`updating is to keep all internal pointers on positions
`within the last 4096 positions of the file. These facts
`will be shown in Section 3.
`We can now summarize the operation of the inner
`loop, using Figure 3 again. If we have just created node
`a, then we use a's parent's suffix link to find 'Y· From 'Y
`we move down in the tree, first rescanning, and then
`scanning. At the end of the scan, we percolate an up(cid:173)
`date from the leaf, moving toward the root, setting the
`position fields equal to the current position, and setting
`the update bits false, until we find a node with an
`update bit that is already false, whereupon we set that
`node's update bit true and stop the percolation. Finally,
`we go to the circular buffer of leaves and replace the
`oldest leaf with the new leaf. If the oldest leaf's parent
`has only one remaining son, then it must also be de-
`
`492
`
`Communications of the 1lCM
`
`April 1989 Volume 32 Number 4
`
`Page 3 of 16
`
`
`
`Research Contributions
`
`{3 farthest from the root that does not propagate an
`update to its parent. If {3 has at least two children that
`have remained for the entire period, then {3 must have
`received updates from these nodes: they are farther
`from the root. If {3 has only one remaining child, then
`it must have a new child, and so it will still get
`two updates. (Every newly created arc causes a son
`to update a parent, percolating if necessary.) Similarly,
`two new children also cause two updates. By every
`accounting, {3 will receive two updates during the
`period, and thus propagate an update-contradicting
`our assumption of {J's failure to update its parent.
`
`There is some flexibility on how updating is handled.
`We could propagate the current position upward before
`rescanning, and then write the current position into
`those nodes passed during the rescan and scan; in this
`case, the proof of Theorem 3 is conservative. Alterna(cid:173)
`tively, a similar, symmetric proof can be used to show
`that updating can be omitted when new arcs are added
`so long as we propagate an update after every arc is
`deleted. The choice is primarily a matter of implemen(cid:173)
`tation convenience, although the method used above is
`slightly faster.
`The last major theoretical consideration is the effec(cid:173)
`tiveness of the At policy in choosing between literal
`and copy codewords. We have chosen the following
`one-pass policy for At: When the encoder is idle, issue
`a copy if it is possible to copy two or more characters;
`otherwise, start a literal. If the encoder has previously
`started a literal, then terminate the literal and issue a
`copy only if the copy is of length three or greater.
`Notice that this policy can sometimes go astray. For
`example, suppose that the compressor is idle at position
`i and has the following copy lengths available at subse(cid:173)
`quent positions:
`i + 1
`3
`
`i+2 i+3 i+4 i+5
`16
`15
`14
`13
`
`(1)
`
`1
`
`leted; in this case, the remaining son is attached to its
`grandparent, and the deleted node's position is perco(cid:173)
`lated upward as before, only at each step the position
`being percolated and the position already in the node
`must be compared and the more recent of these sent
`upward in the tree.
`
`3. THEORETICAL CONSIDERATIONS
`The correctness and linearity of suffix tree construction
`follows from McCreight's original paper [28]. Here we
`will concern ourselves with the correctness and the
`linearity of suffix tree destruction-questions raised in
`Section 2.
`
`THEOREM 1. Deleting leaves in FIFO order and deleting
`internal nodes with single sons will never leave dangling
`suffix pointers.
`
`PROOF. Assume the contrary. We have a node a
`with a suffix pointer to a node o that has just been
`deleted. The existence of a means that there are at
`least two strings that agree for l positions and then
`differ at I + 1. Assuming that these two strings start at
`positions i and j, where both i and j are within the
`window of recently scanned strings and are not equal
`to the current position, then there are are two even
`younger strings at i + 1 and j + 1 that differ first at I.
`This contradicts the assumption that o has one son. {If
`either i or j are equal to the current position, then a is
`a new node, and can temporarily be without a suffix
`pointer.)
`
`There are two issues related to the percolating update:
`its cost and its effectiveness.
`
`THEOREM 2. Each percolated update has constant amor(cid:173)
`tized cost.
`
`PROOF. We assume that the data structure contains
`a "credit" on each internal node where the "update"
`flag is true. A new leaf can be added with two "credits."
`One is spent immediately to update the parent, and the
`other is combined with any credits remaining at the
`parent to either: 1) obtain one credit to leave at the
`parent and terminate the algorithm or 2) obtain two
`credits to apply the algorithm recursively at the parent.
`This gives an amortized cost of two updates for each
`new leaf.
`
`For the next theorem, define the "span" of a suffix tree
`to be equal to the size of its fixed window. So far we
`have used examples with "span" equal to 4096, but the
`value is flexible.
`
`THEOREM 3. Using the percolating update, every inter(cid:173)
`nal node will be updated at least once during every period of
`length "span."
`
`It is useful to prove the slightly stronger re(cid:173)
`PROOF.
`sult that every internal node (that remains for an entire
`period) will be updated twice during a period, and thus
`propagate at least one update to its parent. To show a
`contradiction, we find the earliest period and the node
`
`Under the policy, the compressor encodes position i
`with a literal codeword, then takes the copy of length 3,
`and finally takes a copy of length 14 at position i + 4. It
`uses 6 bytes in the encoding:
`
`(literal1)X(copy 3 - y)(copy 14 - y)
`
`If the compressor had foresight it could avoid the
`copy of length 3, compressing the same material into
`5 bytes:
`
`(literal 2)XX(copy 16 - y)
`
`The optimal solution can be computed by dynamic
`programming [36]. One forward pass records the length
`of the longest possible copy at each position (as in equa(cid:173)
`tion 1) and the displacement for the copy (not shown in
`equation 1). A second backward pass computes the
`optimal way to finish compressing the file from each
`position by recording the best codeword to use and the
`length to the end-of-file. Finally, another forward pass
`reads off the solution and outputs the compressed file.
`However, one would probably never want to use dy-
`
`April1989 Volume 32 Number 4
`
`Communications of the ACM
`
`493
`
`Page 4 of 16
`
`
`
`Research Contributions
`
`namic programming since the one-pass heuristic is a lot
`faster, and we estimated for several typical files that
`the heuristically compressed output was only about
`1 percent larger than the optimum. Furthermore, we
`will show in the Iemainder of this section that the size
`of the compressed file is never worse than % the size of
`the optimal soluti:m for the specific At encoding.
`This will require ·ieveloping some analytic tools, so the
`non-mathematica. reader should feel free to skip to
`Section 4.
`
`The following def.nitions are useful:
`Definition. F(i I is the longest feasible copy at position i
`in the file.
`
`Sample F(i)'s were· given above in equation 1. They are
`dependent on the encoding used. For now, we are as(cid:173)
`suming that they are limited in magnitude to 16, and
`must correspond tJ copy sources within the last 4096
`characters.
`
`Definition. B(i 1 is the size of the best way to compress
`the remainder of th1~ file, starting at position i.
`
`B(i)'s would be computed in the reverse pass of the
`optimal algorithm outlined above.
`
`The following T 1eorems are given without proof:
`
`THEOREM. F(i -t 1) ~ F(i) - 1.
`
`Ther.~ exists an optimal solution where copies
`THIWREM.
`are longest possible (i.e., only copies corresponding to F(i)'s
`are used).
`
`THEOREM. B(i) lS monotone decreasing.
`
`THEOREM. Any :wlution can be modified, without affect(cid:173)
`ing length, so that (literal Xt) followed immediately by
`(literal x2 ) implies lhat x1 is maximum (in this case 16).
`We could continue to reason in this vein, but there is
`an abstract way of looking at the problem that is both
`clearer and more general. Suppose we have a non(cid:173)
`deterministic finitE' automaton where each transition is
`given a cost. A simple example is shown in Figure 4.
`The machine accepts (a+ b)*, with costs as shown in
`parentheses.
`The total cost of accepting a string is the sum of the
`
`Start I b (3)
`IX
`
`FIGURE 4. A Nondeterministic Automaton with Transition Costs
`
`transition costs for each character. (While it is not
`important to our problem, the optimal solution can be
`computed by forming a transition matrix for each let(cid:173)
`te:r, using the costs shown in parentheses, and then
`multiplying the matrices for a given string, treating the
`coefficients as elements of the closed semiring with op(cid:173)
`erations of addition and minimization.) We can obtain a
`solution that approximates the minimum by deleting
`transitions in the original machine until it becomes a
`deterministic machine. This corresponds to choosing a
`policy in our original data compression problem. A pol(cid:173)
`icy for the machine in Figure 4 is shown in Figure 5.
`We now wish to compare, in the worst case, the dif(cid:173)
`ference between optimally accepting a string with the
`non-deterministic machine, and deterministically ac(cid:173)
`cepting the same string with the "policy" machine. This
`is done by taking a cross product of the two machines,
`as shown in Figure 6.
`In Figure 6 there are now two weights on each transi(cid:173)
`tion; the first is the cost in the non-deterministic graph,
`and the second is the cost in the policy graph. Asymp(cid:173)
`totically, the relationship of the optimal solution to the
`poiicy solution is dominated by the smallest ratio on a
`cycle in this graph. In the case of Figure 6, there is a
`cycle from 1, 1' to 1, 2' and back that has cost in the
`non-deterministic graph of 2 + 1 = 3, and cost in the
`policy graph of 3 + 3 = 6, giving a ratio of%. That is,
`
`Start
`
`)'''
`
`FIGURE 5. A Deterministic "Policy" Automaton for Figure 4
`
`the policy solution can be twice as bad as the optimum
`on the string ababababab . ...
`In general, we can find the cycle with the smallest
`ratio mechanically, using well known techniques (8,
`27]. The idea is to conjecture a ratio r and then reduce
`the pairs of weights (x, y) on the arcs to single weights
`x - ry. Under this reduction, a cycle with zero weight
`has ratio exactly r. If a cycle has negative weight, then r
`is too large. The ratio on the negative cycle is used as a
`new conjecture, and the process is iterated. !Negative
`cycles are detected by running a shortest path algo(cid:173)
`rithm and checking for convergence.) Once we have
`found the minimum ratio cycle, we can create a worst
`case string in the original automata problem by finding
`a path from the start state to the cycle and then repeat(cid:173)
`ing the cycle indefinitely. The ratio of the costs of ac-
`
`494
`
`Communications of the ACM
`
`Apri/1989 Volume 32 Number 4
`
`Page 5 of 16
`
`
`
`Start
`
`Research Contributions
`
`below, eliminates many more transitions:
`
`lo ~ /15 start a literal if i :s 1
`lx ~ lx-1 continue a literal if x 2: 1 and i :S 2
`'f .
`x ~ C;-1 start a copy I 1 2: 3 or x = 0 and i = 2
`l Pi(2)
`p,(O)
`•
`Ct ~ Cy-1 contmue a copy
`
`Finally, we add one more machine to guarantee that
`the strings of p; are realistic. In this machine, state s;
`means that the previous character wasp;, so the index
`of the next character must be at least p;- 1 :
`S; ..J!4 Si
`
`(j 2: i - 1)
`
`(5)
`
`The cross product of these three machines has ap(cid:173)
`proximately 17K states and was analyzed mechanically
`to prove a minimum ratio cycle of%. Thus the policy
`we have chosen is never off by more than 25 percent,
`and the worst case is realized on a string that repeats a
`p; pattern as follows:
`
`FIGURE 6. The Cross Product
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`cepting the string non-deterministically and determin(cid:173)
`istically will converge to the ratio of the cycle. (The
`path taken in the cross product graph will not necessar(cid:173)
`ily bring us to the same cycle, due to the initial path
`fragment; we will, nevertheless, do at least as well.)
`Conversely, if we have a sufficiently long string with
`non-deterministic to deterministic ratio r, then the
`string will eventually loop in the cross product graph. If
`we remove loops with ratio greater than r we only im(cid:173)
`prove the ratio of the string, so we must eventually find
`a loop with ratio at least as small as r.
`The above discussion gives us an algorithmic way of
`analyzing our original data compression problem. The
`possible values of F(i) are encoded in a 17 character
`alphabet Po ... P16. representing the length of copy
`available at each position. The compression algorithm
`is described by a non-deterministic machine that ac(cid:173)
`cepts strings of Pi; this machine has costs equal to the
`lengths of the codewords used by the algorithm. There
`are two parameterized states in this machine: lx means
`that there is a literal codeword under construction with
`x spaces still available; cy means that a copy is in prog(cid:173)
`ress withy characters remaining to copy. The idle state
`is lo =Co. In the non-deterministic machine, the possi(cid:173)
`ble transitions are:
`
`start a literal
`continue a literal {x 2: 1)
`start a copy
`continue a copy
`
`(2)
`
`(An asterisk is used as a wild card to denote any state.)
`Based on the theorems above we have already elimi(cid:173)
`nated some transitions to simplify what follows. For
`example,
`
`·
`I f
`'d
`l't
`p,(zJ l
`t
`Cy ~ 15 s art a I era
`rom msi e a copy (y 2: 1)
`
`(3)
`
`is unnecessary. The deterministic machine, given
`
`14 15
`10 11 12 13
`9
`8
`pg
`P4 P3 Pz
`P1
`Pz P10 P10
`
`(6)
`
`(There is nothing special about 10; it was chosen to
`illustrate a long copy and to match the example in
`Appendix A.) The deterministic algorithm takes a copy
`of length 10 in the first position, and then switches to a
`literal for positions 11 and 12. Five bytes are used in
`each repetition of the pattern. The optimal solution is
`one position out of phase. It takes a copy of length 10 in
`the second position, and then finds a copy of length 2 at
`position 12, for a total of four bytes on each iteration.
`We have abstracted the problem so that the possible
`copy operations are described by a string of Pi• and we
`have shown a pathological pattern of Pi that results in
`% of the optimal encoding. There might still be some
`doubt that such a string exists, since the condition that
`our third machine (5) guarantees, F(i + 1) 2: F(i) - 1, is
`a necessary but not sufficient condition. Nevertheless,
`the details of an actual pathological string can be found
`in Appendix A.
`
`4. A SIMPLER DATA STRUCTURE
`Although the quantity of code associated with A1 is not
`enormous, it is complicated, and the data structures are
`fairly large. In this section, we present simpler methods
`for finding the suffix and for propagating the window
`position.
`The alternative to a percolating update is to update
`the positions in all nodes back to the root whenever a
`new leaf is inserted. Then no updates are needed when
`nodes are deleted. The update flags can be eliminated.
`The alternative to suffix pointers is more compli(cid:173)
`cated. The cost of movement in a tree is not uniform;
`moving deeper requires a hash table lookup, which is
`more expensive than following a parent pointer. So we
`can determine the suffix by starting at the suffix leaf
`and following parent pointers back toward the root un-
`
`Apri/1989 Volume 32 Number 4
`
`Communications of the ACM
`
`495
`
`Page 6 of 16
`
`
`
`Research Contributions
`
`til the suffix nodo is reached. The suffix leaf is known
`because the strin.~ at i matched the string at some ear(cid:173)
`lier window position j; the suffix leaf j + 1 is the next
`entry in the leaf array. With this change, the suffix
`pointers can be e: imina ted.
`From a theoretical perspective, these modifications,
`which have O(nd 1 worst case performance for a file of
`size nand cut-off depth d, are inferior to the O(n) per(cid:173)
`formance of the Slffix tree. For At, with a cutoff of 16,
`these modifications improve average performance, but
`the A2 method di >cussed in the next section has such a
`deep cut-off that wfllx pointers and percolated updates
`are preferable.
`
`5. A MORE POWERFUL ENCODING
`The 4,096-byte window of At is roughly optimal for
`fixed size copy and literal codewords. Longer copies
`would, on average, be found in a larger window, but a
`larger displacement fleld would be required to encode
`them. To exploit a larger window, we must use a
`variable-width en,;oding statistically sensitive to the
`fact that recent window positions are more likely to be
`used by copy codewords than those positions further
`back. Similarly, it is advantageous to use variable(cid:173)
`width encodings for copy and literal lengths.
`There are several approaches we might use for vari(cid:173)
`able-length encoding. We could use fixed or adaptive
`Huffman coding, arithmetic encoding, a variable-length
`encoding of the in·:egers, or a manageable set of hand(cid:173)
`designed codeworcls. We eliminated from consideration
`adaptive Huffman and arithmetic coding because they
`are slow. Moreove.:, we felt they would provide (at best)
`a secondary adaptive advantage since the "front end"
`textual substitution is itself adapting to the input. We
`experimented witf. a fixed Huffman encoding, a hand(cid:173)
`designed family of codewords, and a variable-length
`encoding of the integers, so we will compare these
`options briefly:
`
`Hand-Designed Co:lewords. This is a direct generaliza(cid:173)
`ti