`
`Algorithms and
`
`Data Structures Data Compression with Finite
`Windows
`
`Daniel Sleator
`Editor
`
`EDWARD R. FIALA and DANIEL H. GREENE
`
`Sevt,ral methods are presented for adaptive,
`ABSTRACT:
`invertible data complression in the style of Lempel’s and
`Ziv’s first textual substitution proposal. For the first two
`methods, the article describes modifications of McCreight’s
`suffix tree data strut ture that support cyclic maintenance of
`a window on the most recent source characters. A
`the
`percolating update i:; used to keep node positions within
`window, and the updating process is shown to have constant
`amortized cost. Other methods explore the tradeoffs between
`compression time, expansion time, data structure size, and
`amount of compression achieved. The article includes a
`graph-theoretic analysis of the compression penalty
`incurred
`by our codeword selt’ction policy in comparison with an
`optimal policy, and It includes empirical studies of the
`performance of various adaptive compressors from the
`literature.
`
`1. INTRODUCTION
`its repre-
`Compression
`is the coding of data to minimize
`sentation.
`In this al,ticle, we are concerned with
`fast,
`one-pass, adaptive,
`invertible
`(or lossless) methods of
`digital compression which have reasonable memory
`requirements. SUCK. methods can be used, for example,
`to reduce
`the storal;e requirements
`for files, to increase
`the communication
`rate over a channel, or to reduce
`redundancy
`prior 11) e.ncryption
`for greater security.
`By “adaptive” WE mean that a compression method
`should be widely a:,plicable
`to different kinds of source
`data. Ideally,
`it sho,lld adapt rapidly
`to the source to
`achieve significant
`compression on small
`files, and it
`should adapt to an]’ subsequent
`internal
`changes in the
`nature of the SOUKI?. In addition,
`it should achieve very
`high compression asymptotically
`on large regions with
`stationary
`statistics
`in this article
`All
`the compression methods developed
`compressor
`are substitutional. T:rpically,
`a substitutional
`functions by replac: ng large blocks of text with shorter
`references
`to earlie:, occurrences of identical
`text
`[3, 5,
`29, 34, 36, 39, 41-4:].
`(This is often called Ziv-Lempel
`compression,
`in ret Jgnition of their pioneering
`ideas.
`Ziv and Lempel,
`in fact, proposed
`two methods. The
`unqualified
`use of the phrase “Ziv-Lempel
`compres-
`sion” usually
`refers to their second proposal
`[43]. In this
`
`0,989
`
`ACM OOOl-0782/89,‘03OO-0490
`
`$1.50
`
`first
`concerned with. their
`article, we will be primarily
`proposal
`[42].) A popular alternative
`to a substitutional
`compressor
`is a statistical
`compressor. A symbolwise
`statistical compressor
`functions by accurately predict-
`ing the probability
`of individual
`symbols, an.d then en-
`coding these symbols with space close to -log, of the
`predicted probabilities.
`The encoding
`is accomplished
`with either Huffman compression
`[17] which has re-
`cently been made one-pass and adaptive
`[ll, 22, 371,
`or with arithmetic
`coding, as described
`in [l, 14, 20, 25,
`26, 31-331. The major challenge of a statistical com-
`pressor is to predict
`the symbol probabilities.
`Simple
`strategies, such as keeping zero-order
`(single symbol) or
`first-order
`(symbol pair) statistics of the input, do not
`compress English
`text very well. Several authors have
`had success gathering higher-order
`statistics, but this
`necessarily
`involves higher memory costs an.d addi-
`tional mechanisms
`for dealing with situations where
`higher-order
`statistics are not available
`[6, 7: 261.
`It is hard to give a rigorous
`foundation
`to the substi-
`tutional vs. statistical distinction
`described above. Sev-
`eral authors have observed
`that statistical methods can
`be used to simulate
`textual substitution,
`suggesting
`that
`the statistical
`category
`includes
`the substitutional
`cate-
`gory [4, 241. However,
`this takes no account of the sim-
`plicity of mechanism;
`the virtue of textual substitution
`is that it recognizes and removes coherence on a large
`scale, oftentimes
`ignoring
`the smaller scale statistics. As
`a result, most textual substitution
`compressors pro-
`cess their compressed
`representation
`in larger blocks
`than their statistical counterparts,
`thereby gaining a sig-
`nificant speed advantage.
`It was previously
`believed
`that the speed gained by textual substitution would
`necessarily cost something
`in compression achieved.
`We were surprised
`to discover
`that with careful atten-
`tion to coding,
`textual substitution
`compressors can
`mat’ch the compression performance
`of the best statisti-
`cal methods.
`im-
`scheme, which we will
`Consider
`the following
`prove
`later in the article. Compressed
`files contain
`two
`types of codewords:
`
`literal x
`
`copy x, -y
`
`pass the next x characters direc-tly
`uncompressed output
`go back y characters
`in the output and copy
`x characters
`forward
`to the current position.
`
`into
`
`the
`
`Communications of the .4CM
`
`April 1989 Volume 3;! Number 4
`
`Apple Inc. Exhibit 1004 Page 1
`
`
`
`So, for example, the following piece of literature:
`
`IT WAS THE BEST OF TIMES,
`IT WAS THE WORST OF TIMES
`
`would compress to
`
`WAS THE BEST OF TIMES,
`(literal26)IT
`(copy 11-26)(literal
`3)WOR(copy 11-27)
`
`The compression achieved depends on the space re-
`quired for the copy and literal codewords. Our simplest
`scheme, hereafter denoted Al, uses 8 bits for a literal
`codeword and 16 for a copy codeword. If the first 4 bits
`are 0, then the codeword is a literal; the next 4 bits
`encode a length x in the range [l . . 161 and the follow-
`ing x characters are literal (one byte per character).
`Otherwise, the codeword is a copy; the first 4 bits
`encode a length x in the range [2 . . 161 and the next
`12 bits are a displacement y in the range [l . .4096]. At
`each step, the policy by which the compressor chooses
`between a literal and a copy is as follows: If the com-
`pressor is idle (just finished a copy, or terminated a
`literal because of the 16-character limit), then the long-
`est copy of length 2 or more is issued; otherwise, if the
`longest copy is less than 2 long, a literal is started. Once
`started, a literal is extended across subsequent charac-
`ters until a copy of length 3 or more can be issued or
`until the length limit is reached.
`Al would break the first literal in the above example
`into two literals and compress the source from 51 bytes
`down to 36. Al
`is close to Ziv and Lempel’s first textual
`substitution proposal 1421. One difference is that Al
`uses a separate literal codeword, while Ziv and Lempel
`combine each copy codeword with a single literal char-
`acter. We have found it useful to have longer literals
`during the startup transient; after the startup, it is bet-
`ter to have no literals consuming space in the copy
`codewords.
`Our empirical studies showed that, for source code
`and English text, the field size choices for Al are good;
`reducing the size of the literal length field by 1 bit
`increases compression slightly but gives up the byte-
`alignment property of the Al codewords. In short, if
`one desires a simple method based upon the copy and
`literal idea, Al
`is a good choice.
`Al was designed for &bit per character text or pro-
`gram sources, but, as we will see shortly, it achieves
`good compression on other kinds of source data, such as
`compiled code and images, where the word model does
`not match the source data particularly well, or where
`no model of the source is easily perceived. Al
`is, in
`fact, an excellent approach to general purpose data
`compression. In the remainder of this article, we will
`study Al and several more powerful variations.
`
`2. OVERVIEW OF THE DATA STRUCTURE
`The fixed window suffix tree of this article is a modifi-
`cation of McCreight’s suffix tree [28] (see also [21, 34,
`38]), which is itself a modification of Morrison’s PATRI-
`CIA tree [30], and Morrison’s tree is ultimately based on
`a Trie data structure [22, page 4811. We will review
`each of these data structures briefly.
`
`Research Contributions
`
`A Trie is a tree structure where the branching occurs
`according to “digits” of the keys, rather than according
`to comparisons of the keys. In English, for example, the
`most natural “digits” are individual
`letters, with the Zth
`level of the tree branching according to the Ith letter of
`the words in the tree.
`In Figure 1, many internal nodes are superfluous,
`having only one descendant. If we are building an
`index for a file, we can save space by eliminating
`the
`superfluous nodes and putting pointers to the file into
`the nodes rather than including characters in the data
`structure. In Figure 2, the characters in parentheses are
`not actually represented in the data structure, but they
`can be recovered from the (position, level) pairs in the
`nodes. Figure 2 also shows a suffix pointer (as a dark
`right arrow) that will be explained later.
`Figure 2 represents some, but not all, of the innova-
`tions in Morrison’s PATRICIA trees. He builds the trees
`with binary “digits” rather than full characters, and this
`allows him to save more space by folding the leaves
`into the internal nodes. Our “digits” are bytes, so the
`branching factor can be as large as 256. Since there are
`rarely 256 descendants of a node, we do not reserve
`that much space in each node, but instead hash the
`arcs. There is also a question about when the strings in
`parentheses are checked in the searching process. In
`what follows, we usually check characters immediately
`when we cross an arc. Morrison’s scheme can avoid file
`access by skipping the characters on the arcs, and doing
`only one file access and comparison at the end of the
`search. However, our files will be in main memory, so
`this consideration is unimportant. We will use the sim-
`plified tree depicted in Figure 2.
`For Al, we wish to find the longest (up to 16 charac-
`ter] match to the current string beginning anywhere in
`the preceding 4096 positions. If all preceding 4096
`strings were stored in a PATRICIA tree with depth
`d = 16, then finding this match would be straightfor-
`ward. Unfortunately,
`the cost of inserting these strings
`can be prohibitive,
`for if we have just descended d
`levels in the tree to insert the string starting at position
`i then we will descend at least d - 1 levels inserting the
`string at i + 1. In the worst case this can lead to O(nd)
`
`ASTRAY
`
`ASTRIDE
`
`FIGURE 1. A Trie
`
`April 1989 Volume 32 Number 4
`
`Communications of the ACM
`
`491
`
`Apple Inc. Exhibit 1004 Page 2
`
`
`
`Research Contributions
`
`insertion time for a file of size n. Since later encodings
`will use much larger values for d than 16, it is impor-
`tant to eliminate 11 from the running time.
`To insert the strings in O(n) time, McCreight added
`additional suffix pointers to the tree. Each internal
`node, representing the string aX on the path from the
`root to the internal node, has a pointer to the node
`representing X, th: string obtained by stripping a single
`letter from the beginning of ax. If a string starting at i
`has just been inse:.ted at level d we do not need to
`return to the root to insert the string at i + 1; instead,
`a nearby suffix po .nter will
`lead us to the relevant
`branch of the tree.
`Figure 3 shows how suffix links are created and used.
`On the previous it oration, we have matched the string
`aXY, where a is a single character, X and Y are strings,
`and b is the first u nmatched character after Y. Figure 3
`shows a complicat sd case where a new internal node,
`(Y, has been added to the tree, and the suffix link of (Y
`must be computed. We insert the next string XYb by
`going up the tree tl, node ,f3, representing the string ax,
`and crossing its su:fix link to y, representing X. Once
`we have crossed t1.e suffix link, we descend again in
`the tree, first by “rsscanning”
`the string Y, and then by
`“scanning” from 6 -rntil the new string is inserted. The
`first part is called ‘ rescanning” because it covers a por-
`tion of the string that was covered by the previous
`insert, and so it dol?s not require checking the internal
`strings on the arcs. (In fact, avoiding these checks is
`essential to the linear time functioning of the algo-
`rithm.) The rescan either ends at an existing node 6, or
`6 is created to insermt the new string XYb; either way we
`have the destination for the suffix link of LY. We have
`restored the invariant
`that every internal node, except
`possibly the one ju:;t cfreated, has a suffix link.
`For the Al compressor, with a 4096-byte fixed win-
`dow, we need a way to delete and reclaim the storage
`for portions of the I uffix tree representing strings fur-
`ther back than 4096 in the file. Several things must be
`added to the suffix tree data structure. The leaves of
`the tree are placed in a circular buffer, so that the
`oldest leaf can be identified and reclaimed, and the
`internal nodes are given “son count” fields. When an
`
`File:
`
`ZTRI
`
`DE
`
`9
`10
`ASTRAY
`
`(SW >
`
`FIGURE 2. A DATRICIA Tree with a Suffix Pointer
`
`R
`
`\ x
`\ \
`
`aX 1
`I
`
`FIGURE 3. Building a Suffix Tree
`
`internal “son count” falls to one, the node is deleted
`and two consecutive arcs are combined. In Section 3, it
`is shown that this approach will never leave a “dan-
`gling” suffix link pointing to deleted nodes. .Unfortu-
`nately, this is not the only problem in maintaining a
`valid suffix tree. The modifications
`that avo:ided a re-
`turn to the root for each new insertion create havoc for
`deletions. Since we have not always returned to the
`root, we may have consistently entered a branch of the
`tree sideways. The pointers (to strings in the 4096-byte
`win.dow) in the higher levels of such a branc:h can be-
`come out-of-date. However, traversing the b:ranch and
`updating the pointers would destroy any advantage
`gained by using the suffix links.
`We can keep valid pointers and avoid extensive up-
`dating by partially updating according to a percolating
`update. Each internal node has a single “update” bit. If
`the update bit is true when we are updating a node,
`then we set the bit false and propagate the update re-
`cursively to the node’s parent. Otherwise, we set the bit
`true and stop the propagation. In the worst case, a long
`string of true updates can cause the update to propagate
`to the root. However, when amortized over all new
`leaves, the cost of updating is constant, and the effect of
`upd,ating is to keep all internal pointers on positions
`within
`the last 4096 positions of the file. These facts
`will be shown in Section 3.
`We can now summarize the operation of the inner
`loop, using Figure 3 again. If we have just created node
`LY, then we use (Y’S parent’s suffix link to find y. From y
`we move down in the tree, first rescanning, a.nd then
`scanning. At the end of the scan, we percolate an up-
`date from the leaf, moving toward the root, setting the
`posnion fields equal to the current position, and setting
`the update bits false, until we find a node wit-h an
`update bit that is already false, whereupon we set that
`node’s update bit true and stop the percolation. Finally,
`we go to the circular buffer of leaves and replace the
`oldest leaf with the new leaf. If the oldest leaf’s parent
`has only one remaining son, then it must also be de-
`
`492
`
`Communications of the 1lCA4
`
`April 1989 Volume 32 Number 4
`
`Apple Inc. Exhibit 1004 Page 3
`
`
`
`Research Contributions
`
`leted; in this case, the remaining son is attached to its
`grandparent, and the deleted node‘s position is perco-
`lated upward as before, only at each step the position
`being percolated and the position already in the node
`must be compared and the more recent of these sent
`upward in the tree.
`
`CONSIDERATIONS
`3. THEORETICAL
`The correctness and linearity of suffix tree construction
`follows from McCreight’s original paper [28]. Here we
`will concern ourselves with the correctness and the
`linearity of suffix tree destruction-questions
`raised in
`Section 2.
`
`THEOREM 1. Delefing leaves in FIFO order and delefing
`infernal nodes with single sons will never leave dangling
`suffix pointers.
`
`PROOF. Assume the contrary. We have a node a!
`with a suffix pointer to a node 6 that has just been
`deleted. The existence of (Y means that there are at
`least two strings that agree for 1 positions and then
`differ at 1 + 1. Assuming that these two strings start at
`positions i and j, where both i and j are within
`the
`window of recently scanned strings and are not equal
`to the current position, then there are are two even
`younger strings at i + 1 and j + 1 that differ first at 1.
`This contradicts the assumption that 6 has one son. (If
`either i or j are equal to the current position, then (r is
`a new node, and can temporarily be without a suffix
`pointer.)
`
`There are two issues related to the percolating update:
`its cost and its effectiveness.
`
`THEOREM 2. Each percolated update has constant amor-
`tized cost.
`
`PROOF. We assume that the data structure contains
`a “credit” on each internal node where the “update”
`flag is true. A new leaf can be added with two “credits.”
`One is spent immediately
`to update the parent, and the
`other is combined with any credits remaining at the
`parent to either: 1) obtain one credit to leave at the
`parent and terminate the algorithm or 2) obtain two
`credits to apply the algorithm recursively at the parent.
`This gives an amortized cost of two updates for each
`new leaf.
`
`For the next theorem, define the “span” of a suffix tree
`to be equal to the size of its fixed window. So far we
`have used examples with “span” equal to 4096, but the
`value is flexible.
`
`THEOREM 3. Using the percolating update, every infer-
`nal node will be updated at least once during every period of
`length “span .”
`
`It is useful to prove the slightly stronger re-
`PROOF.
`sult that every internal node (that remains for an entire
`period) will be updated twice during a period, and thus
`propagate at least one update to its parent. To show a
`contradiction, we find the earliest period and the node
`
`p farthest from the root that does not propagate an
`update to its parent. If /3 has at least two children
`that
`have remained for the entire period, then p must have
`received updates from these nodes: they are farther
`from the root. If B has only one remaining child, then
`it must have a new child, and so it will still get
`two updates. (Every newly created arc causes a son
`to update a parent, percolating if necessary.) Similarly,
`two new children also cause two updates. By every
`accounting, p will receive two updates during the
`period, and thus propagate an update-contradicting
`our assumption of /3’s failure to update its parent.
`There is some flexibility on how updating is handled.
`We could propagate the current position upward before
`rescanning, and then write the current position into
`those nodes passed during the rescan and scan; in this
`case, the proof of Theorem 3 is conservative. Alterna-
`tively, a similar, symmetric proof can be used to show
`that updating can be omitted when new arcs are added
`so long as we propagate an update after every arc is
`deleted. The choice is primarily a matter of implemen-
`tation convenience, although the method used above is
`slightly faster.
`The last major theoretical consideration is the effec-
`tiveness of the Al policy in choosing between literal
`and copy codewords. We have chosen the following
`one-pass policy for Al: When the encoder is idle, issue
`a copy if it is possible to copy two or more characters;
`otherwise, start a literal. If the encoder has previously
`started a literal, then terminate the literal and issue a
`copy only if the copy is of length three or greater.
`Notice that this policy can sometimes go astray. For
`example, suppose that the compressor is idle at position
`i and has the following copy lengths available at subse-
`quent positions:
`
`i
`1
`
`i+l
`3
`
`i+2
`16
`
`i+3
`15
`
`i+4
`14
`
`i+5
`13
`
`(1)
`
`Under the policy, the compressor encodes position i
`with a literal codeword, then takes the copy of length 3,
`and finally
`takes a copy of length 14 at position i + 4. It
`uses 6 bytes in the encoding:
`
`(literal
`
`l)X(copy 3 - y)(copy 14 - y)
`
`If the compressor had foresight it could avoid the
`copy of length 3, compressing the same material into
`5 bytes:
`
`[literal 2)XX(copy 16 - y)
`
`The optimal solution can be computed by dynamic
`programming [36]. One forward pass records the length
`of the longest possible copy at each position (as in equa-
`tion 1) and the displacement for the copy (not shown in
`equation 1). A second backward pass computes the
`optimal way to finish compressing the file from each
`position by recording the best codeword to use and the
`length to the end-of-file. Finally, another forward pass
`reads off the solution and outputs the compressed file.
`However, one would probably never want to use dy-
`
`April 1989 Volume 32 Number 4
`
`Communications of the ACM
`
`493
`
`Apple Inc. Exhibit 1004 Page 4
`
`
`
`Research Contributions
`
`namic programming since the one-pass heuristic is a lot
`faster, and we estimated for several typical files that
`the heuristically compressed output was only about
`1 percent larger than the optimum. Furthermore, we
`will show in the Iemiainder of this section that the size
`of the compressed file is never worse than % the size of
`the optimal solution for the specific Al encoding.
`This will require ,leveloping some analytic tools, so the
`non-mathematics.
`re!ader should feel free to skip to
`Section 4.
`
`The following def nitions are useful:
`
`F(i) is the longest feasible copy at position i
`
`Definition.
`in the file.
`Sample F(i)‘s were, given above in equation 1. They are
`dependent on the encoding used. For now, we are as-
`suming that they iire limited
`in magnitude to 16, and
`must correspond t3 copy sources within
`the last 4096
`characters.
`
`B(i 1 is the size of the best way to compress
`Definition.
`the remainder of thl, file, starting at position i.
`B(i)‘s would be computed in the reverse pass of the
`optimal algorithm outlined above.
`
`The following T neorems are given without proof:
`
`THEOREM. F(i t 1) 1 F(i) - 1.
`
`THEOREM. Then? exists an optimal solution where copies
`are longest possible (i.e., only copies corresponding to F(i)‘s
`are used).
`
`THEOREM. B(i) IS monotone decreasing.
`
`THEOREM. Any :;olution can be modified, without affect-
`ing length, so that (literal x1) followed
`immediately by
`(literal x2) implies lhaf x1 is maximum (in this case 26).
`We could continue to reason in this vein, but there is
`an abstract way of looking at the problem that is both
`clearer and more general. Suppose we have a non-
`deterministic
`finite! automaton where each transition
`given a cost. A simple example is shown in Figure 4.
`The machine accepts (a + b)*, with costs as shown in
`parentheses.
`The total cost of accepting a string is the sum of the
`
`is
`
`Start
`
`FIGURE 4. A Nondeierministic Automaton with Transition Costs
`
`transition costs for each character. (While it is not
`important to our problem, the optimal solution can be
`computed by forming a transition matrix for each let-
`te:r, using the costs shown in parentheses, and then
`multiplying
`the matrices for a given string, treating the
`coefficients as elements of the closed semiring with op-
`erations of addition and minimization.) We can obtain a
`solution that approximates the minimum by deleting
`transitions in the original machine until it becomes a
`deterministic machine. This corresponds to choosing a
`policy in our original data compression pro’blem. A pol-
`icy for the machine in Figure 4 is shown in Figure 5.
`We now wish to compare, in the worst case, the dif-
`ference between optimally accepting a string with the
`non-deterministic machine, and deterministically ac-
`cepting the same string with the “policy” machine. This
`is (done by taking a cross product of the two machines,
`as shown in Figure 6.
`In Figure 6 there are now two weights on each transi-
`tion; the first is the cost in the non-deterministic graph,
`and the second is the cost in the policy graph. Asymp-
`totically,
`the relationship of the optimal solution to the
`policy solution is dominated by the smallest ratio on a
`cycle in this graph. In the case of Figure 6, there is a
`cycle from 1, 1’ to 1, 2' and back that has cost in the
`no:n-deterministic graph of 2 + 1 = 3, and cost in the
`policy graph of 3 + 3 = 6, giving a ratio of Vi. That is,
`
`a (1)
`
`>
`
`FIGURE 5. A Deterministic “Policy” Automaton for Figure 4
`
`the policy solution can be twice as bad as the optimum
`on t.he string ababababab. , . .
`In general, we can find the cycle with the smallest
`ratio mechanically, using well known techni.ques [8,
`271. The idea is to conjecture a ratio r and th.en reduce
`the pairs of weights (x, y) on the arcs to single weights
`x - ry. Under this reduction, a cycle with zero weight
`has ratio exactly r. If a cycle has negative weight, then r
`is too large. The ratio on the negative cycle is used as a
`new conjecture, and the process is iterated. (Negative
`cycl.es are detected by running a shortest path algo-
`rithm and checking for convergence.) Once we have
`found the minimum ratio cycle, we can create a worst
`case string in the original automata problem by finding
`a path from the start state to the cycle and then repeat-
`ing .the cycle indefinitely. The ratio of the costs of ac-
`
`494
`
`Communications of the AC.M
`
`April 1989 Volume 32 Number 4
`
`Apple Inc. Exhibit 1004 Page 5
`
`
`
`Start
`
`Research Contributions
`
`below, eliminates many more transitions:
`
`if i 5 1
`115 start a literal
`lo ‘2
`l,-, continue a literal
`if x 2 1 and i I z
`1, “3
`if i 5 3 or x = 0 and i = 2
`I, “3 ci-I start a copy
`Pm
`cf +
`cYml continue a copy
`
`that
`to guarantee
`Finally, we add one more machine
`the strings of pi are realistic.
`In this machine, state si
`means that the previous character was pi, so the index
`of the next character must be at least pi-l:
`
`Si -% S,
`
`(j 2
`
`i - 1)
`
`(51
`
`The cross product of these three machines has ap-
`proximately
`17K states and was analyzed mechanically
`to prove a minimum
`ratio cycle of VS. Thus the policy
`we have chosen is never off by more than 25 percent,
`and the worst case is realized on a string
`that repeats a
`pi pattern as follows:
`
`FIGURE 6. The Cross Product
`
`7
`6
`5
`4
`3
`2
`1
`PI0 PI0 Pg Pa P7 Ps P5
`
`and determin-
`the string non-deterministically
`cepting
`istically will converge
`to the ratio of the cycle. (The
`path taken
`in the cross product graph will not necessar-
`ily bring us to the same cycle, due to the initial path
`fragment; we will, nevertheless,
`do at least as well.)
`Conversely,
`if we have a sufficiently
`long string with
`non-deterministic
`to deterministic
`ratio r, then the
`string will eventually
`loop in the cross product graph. If
`we remove
`loops with
`ratio greater
`than r we only im-
`prove the ratio of the string, so we must eventually
`find
`a loop with
`ratio at least as small as r.
`The above discussion gives us an algorithmic way of
`analyzing our original data compression problem. The
`possible values of F(i) are encoded
`in a 17 character
`alphabet pO . . . p16, representing
`the length of copy
`available at each position. The compression algorithm
`is described by a non-deterministic
`machine
`that ac-
`cepts strings of pi; this machine has costs equal
`to the
`lengths of the codewords used by the algorithm. There
`are two parameterized
`states in this machine: 1, means
`that there is a literal codeword under construction with
`x spaces still available;
`cY means that a copy is in prog-
`ress with y characters
`remaining
`to copy. The idle state
`is IO = co. In the non-deterministic
`machine,
`the possi-
`ble transitions are:
`Pm
`-
`lo
`P.(l)
`+
`1,
`l
`Pi@)
`*
`---*
`Ci-I
`P*W
`c, --* c,-,
`
`115
`I,-,
`
`start a literal
`continue a literal
`start a copy
`continue a copy
`
`(x 2 1)
`
`(2)
`
`is used as a wild card to denote any state.)
`(An asterisk
`Based on the theorems above we have already elimi-
`nated some transitions
`to simplify what
`follows. For
`example,
`
`c, “3 115 start a literal
`
`from
`
`inside a copy (y 2 1)
`
`(3)
`
`is unnecessary. The deterministic machine, given
`
`8
`
`p4
`
`9
`
`p3
`
`10
`
`pz
`
`11
`
`p1
`
`12
`
`p2
`
`13
`
`14
`
`p10
`
`p10
`
`15
`
`p9
`
`*.
`
`.
`
`(6)
`
`is nothing special about 10; it was chosen to
`(There
`illustrate a long copy and to match
`the example
`in
`Appendix A.) The deterministic
`algorithm
`takes a copy
`of length 10 in the first position, and then switches
`to a
`literal
`for positions 11 and 12. Five bytes are used in
`each repetition of the pattern. The optimal solution
`is
`one position out of phase. It takes a copy of length 10 in
`the second position, and then finds a copy of length 2 at
`position 12, for a total of four bytes on each iteration.
`We have abstracted
`the problem so that the possible
`copy operations are described by a string of pi, and we
`have shown a pathological
`pattern of p, that results
`in
`% of the optimal encoding. There might still be some
`doubt that such a string exists, since the condition
`that
`our third machine
`(5) guarantees, F(i + 1) 1 F(i)
`- 1, is
`a necessary but not sufficient condition. Nevertheless,
`the details of an actual pathological
`string can be found
`in Appendix A.
`
`4. A SIMPLER DATA STRUCTURE
`is not
`Although
`the quantity of code associated with Al
`enormous,
`it is complicated,
`and the data structures are
`fairly
`large. In this section, we present simpler methods
`for finding
`the suffix and for propagating
`the window
`position.
`is to update
`update
`to a percolating
`The alternative
`the positions
`in all nodes back to the root whenever a
`new leaf is inserted. Then no updates are needed when
`nodes are deleted. The update
`flags can be eliminated.
`The alternative
`to suffix pointers
`is more compli-
`cated. The cost of movement
`in a tree is not uniform;
`moving deeper requires a hash table lookup, which
`is
`more expensive
`than following a parent pointer. So we
`can determine
`the suffix by starting at the suffix
`leaf
`and following parent pointers back toward
`the root un-
`
`April 1989 Volume 32 Number 4
`
`Communications of the ACM
`
`495
`
`Apple Inc. Exhibit 1004 Page 6
`
`
`
`Research Contributions
`
`til the suffix node is reached. The suffix leaf is known
`because the strin,; at i matched the string at some ear-
`lier window position j; the suffix leaf j + 1 is the next
`entry in the leaf array. With this change, the suffix
`pointers can be e: iminated.
`From a theoretical perspective, these modifications,
`which have O(nd I worst case performance for a file of
`size n and cut-off depth d, are inferior to the O(n) per-
`formance of the suffix tree. For Al, with a cutoff of 16,
`these modifications
`improve average performance, but
`the A2 method discussed in the next section has such a
`deep cut-off that : uffix pointers and percolated updates
`are preferable.
`
`5. A MORE POWERFUL ENCODING
`The 4,096-byte window of Al
`is roughly optimal for
`fixed size copy and literal codewords. Longer copies
`would, on average, be found in a larger window, but a
`larger displacement field would be required to encode
`them. To exploit a larger window, we must use a
`variable-width em:od.ing statistically sensitive to the
`fact that recent window positions are more likely to be
`used by copy codewords than those positions further
`back. Similarly,
`it is advantageous to use variable-
`width encodings for copy and literal lengths.
`There are severid approaches we might use for vari-
`able-length encoding. We could use fixed or adaptive
`Huffman coding, arithmetic encoding, a variable-length
`encoding of the inegers, or a manageable set of hand-
`designed codewords. We eliminated
`from consideration
`adaptive Huffman and arithmetic coding because they
`are slow. Moreove.:, we felt they would provide (at best)
`a secondary adaptive advantage since the “front end”
`textual substitution
`is itself adapting to the input. We
`experimented witf. a fixed Huffman encoding, a hand-
`designed family of codewords, and a variable-length
`encoding of the integers, so we will compare these
`options briefly:
`
`Hand-Designed Coc3ewords. This is a direct generaliza-
`tion of Al, with shllrt copies that use fewer bits but
`cannot address the full window, and longer copies that
`can address larger ‘,locks further back in the window.
`With a few codewords, this is fast and relatively easy to
`implement.