throbber
RESEARCH CONTRlBllTlONS
`
`Algorithms and
`
`Data Structures Data Compression with Finite
`Windows
`
`Daniel Sleator
`Editor
`
`EDWARD R. FIALA and DANIEL H. GREENE
`
`Sevt,ral methods are presented for adaptive,
`ABSTRACT:
`invertible data complression in the style of Lempel’s and
`Ziv’s first textual substitution proposal. For the first two
`methods, the article describes modifications of McCreight’s
`suffix tree data strut ture that support cyclic maintenance of
`a window on the most recent source characters. A
`the
`percolating update i:; used to keep node positions within
`window, and the updating process is shown to have constant
`amortized cost. Other methods explore the tradeoffs between
`compression time, expansion time, data structure size, and
`amount of compression achieved. The article includes a
`graph-theoretic analysis of the compression penalty
`incurred
`by our codeword selt’ction policy in comparison with an
`optimal policy, and It includes empirical studies of the
`performance of various adaptive compressors from the
`literature.
`
`1. INTRODUCTION
`its repre-
`Compression
`is the coding of data to minimize
`sentation.
`In this al,ticle, we are concerned with
`fast,
`one-pass, adaptive,
`invertible
`(or lossless) methods of
`digital compression which have reasonable memory
`requirements. SUCK. methods can be used, for example,
`to reduce
`the storal;e requirements
`for files, to increase
`the communication
`rate over a channel, or to reduce
`redundancy
`prior 11) e.ncryption
`for greater security.
`By “adaptive” WE mean that a compression method
`should be widely a:,plicable
`to different kinds of source
`data. Ideally,
`it sho,lld adapt rapidly
`to the source to
`achieve significant
`compression on small
`files, and it
`should adapt to an]’ subsequent
`internal
`changes in the
`nature of the SOUKI?. In addition,
`it should achieve very
`high compression asymptotically
`on large regions with
`stationary
`statistics
`in this article
`All
`the compression methods developed
`compressor
`are substitutional. T:rpically,
`a substitutional
`functions by replac: ng large blocks of text with shorter
`references
`to earlie:, occurrences of identical
`text
`[3, 5,
`29, 34, 36, 39, 41-4:].
`(This is often called Ziv-Lempel
`compression,
`in ret Jgnition of their pioneering
`ideas.
`Ziv and Lempel,
`in fact, proposed
`two methods. The
`unqualified
`use of the phrase “Ziv-Lempel
`compres-
`sion” usually
`refers to their second proposal
`[43]. In this
`
`0,989
`
`ACM OOOl-0782/89,‘03OO-0490
`
`$1.50
`
`first
`concerned with. their
`article, we will be primarily
`proposal
`[42].) A popular alternative
`to a substitutional
`compressor
`is a statistical
`compressor. A symbolwise
`statistical compressor
`functions by accurately predict-
`ing the probability
`of individual
`symbols, an.d then en-
`coding these symbols with space close to -log, of the
`predicted probabilities.
`The encoding
`is accomplished
`with either Huffman compression
`[17] which has re-
`cently been made one-pass and adaptive
`[ll, 22, 371,
`or with arithmetic
`coding, as described
`in [l, 14, 20, 25,
`26, 31-331. The major challenge of a statistical com-
`pressor is to predict
`the symbol probabilities.
`Simple
`strategies, such as keeping zero-order
`(single symbol) or
`first-order
`(symbol pair) statistics of the input, do not
`compress English
`text very well. Several authors have
`had success gathering higher-order
`statistics, but this
`necessarily
`involves higher memory costs an.d addi-
`tional mechanisms
`for dealing with situations where
`higher-order
`statistics are not available
`[6, 7: 261.
`It is hard to give a rigorous
`foundation
`to the substi-
`tutional vs. statistical distinction
`described above. Sev-
`eral authors have observed
`that statistical methods can
`be used to simulate
`textual substitution,
`suggesting
`that
`the statistical
`category
`includes
`the substitutional
`cate-
`gory [4, 241. However,
`this takes no account of the sim-
`plicity of mechanism;
`the virtue of textual substitution
`is that it recognizes and removes coherence on a large
`scale, oftentimes
`ignoring
`the smaller scale statistics. As
`a result, most textual substitution
`compressors pro-
`cess their compressed
`representation
`in larger blocks
`than their statistical counterparts,
`thereby gaining a sig-
`nificant speed advantage.
`It was previously
`believed
`that the speed gained by textual substitution would
`necessarily cost something
`in compression achieved.
`We were surprised
`to discover
`that with careful atten-
`tion to coding,
`textual substitution
`compressors can
`mat’ch the compression performance
`of the best statisti-
`cal methods.
`im-
`scheme, which we will
`Consider
`the following
`prove
`later in the article. Compressed
`files contain
`two
`types of codewords:
`
`literal x
`
`copy x, -y
`
`pass the next x characters direc-tly
`uncompressed output
`go back y characters
`in the output and copy
`x characters
`forward
`to the current position.
`
`into
`
`the
`
`Communications of the .4CM
`
`April 1989 Volume 3;! Number 4
`
`Apple Inc. Exhibit 1004 Page 1
`
`

`
`So, for example, the following piece of literature:
`
`IT WAS THE BEST OF TIMES,
`IT WAS THE WORST OF TIMES
`
`would compress to
`
`WAS THE BEST OF TIMES,
`(literal26)IT
`(copy 11-26)(literal
`3)WOR(copy 11-27)
`
`The compression achieved depends on the space re-
`quired for the copy and literal codewords. Our simplest
`scheme, hereafter denoted Al, uses 8 bits for a literal
`codeword and 16 for a copy codeword. If the first 4 bits
`are 0, then the codeword is a literal; the next 4 bits
`encode a length x in the range [l . . 161 and the follow-
`ing x characters are literal (one byte per character).
`Otherwise, the codeword is a copy; the first 4 bits
`encode a length x in the range [2 . . 161 and the next
`12 bits are a displacement y in the range [l . .4096]. At
`each step, the policy by which the compressor chooses
`between a literal and a copy is as follows: If the com-
`pressor is idle (just finished a copy, or terminated a
`literal because of the 16-character limit), then the long-
`est copy of length 2 or more is issued; otherwise, if the
`longest copy is less than 2 long, a literal is started. Once
`started, a literal is extended across subsequent charac-
`ters until a copy of length 3 or more can be issued or
`until the length limit is reached.
`Al would break the first literal in the above example
`into two literals and compress the source from 51 bytes
`down to 36. Al
`is close to Ziv and Lempel’s first textual
`substitution proposal 1421. One difference is that Al
`uses a separate literal codeword, while Ziv and Lempel
`combine each copy codeword with a single literal char-
`acter. We have found it useful to have longer literals
`during the startup transient; after the startup, it is bet-
`ter to have no literals consuming space in the copy
`codewords.
`Our empirical studies showed that, for source code
`and English text, the field size choices for Al are good;
`reducing the size of the literal length field by 1 bit
`increases compression slightly but gives up the byte-
`alignment property of the Al codewords. In short, if
`one desires a simple method based upon the copy and
`literal idea, Al
`is a good choice.
`Al was designed for &bit per character text or pro-
`gram sources, but, as we will see shortly, it achieves
`good compression on other kinds of source data, such as
`compiled code and images, where the word model does
`not match the source data particularly well, or where
`no model of the source is easily perceived. Al
`is, in
`fact, an excellent approach to general purpose data
`compression. In the remainder of this article, we will
`study Al and several more powerful variations.
`
`2. OVERVIEW OF THE DATA STRUCTURE
`The fixed window suffix tree of this article is a modifi-
`cation of McCreight’s suffix tree [28] (see also [21, 34,
`38]), which is itself a modification of Morrison’s PATRI-
`CIA tree [30], and Morrison’s tree is ultimately based on
`a Trie data structure [22, page 4811. We will review
`each of these data structures briefly.
`
`Research Contributions
`
`A Trie is a tree structure where the branching occurs
`according to “digits” of the keys, rather than according
`to comparisons of the keys. In English, for example, the
`most natural “digits” are individual
`letters, with the Zth
`level of the tree branching according to the Ith letter of
`the words in the tree.
`In Figure 1, many internal nodes are superfluous,
`having only one descendant. If we are building an
`index for a file, we can save space by eliminating
`the
`superfluous nodes and putting pointers to the file into
`the nodes rather than including characters in the data
`structure. In Figure 2, the characters in parentheses are
`not actually represented in the data structure, but they
`can be recovered from the (position, level) pairs in the
`nodes. Figure 2 also shows a suffix pointer (as a dark
`right arrow) that will be explained later.
`Figure 2 represents some, but not all, of the innova-
`tions in Morrison’s PATRICIA trees. He builds the trees
`with binary “digits” rather than full characters, and this
`allows him to save more space by folding the leaves
`into the internal nodes. Our “digits” are bytes, so the
`branching factor can be as large as 256. Since there are
`rarely 256 descendants of a node, we do not reserve
`that much space in each node, but instead hash the
`arcs. There is also a question about when the strings in
`parentheses are checked in the searching process. In
`what follows, we usually check characters immediately
`when we cross an arc. Morrison’s scheme can avoid file
`access by skipping the characters on the arcs, and doing
`only one file access and comparison at the end of the
`search. However, our files will be in main memory, so
`this consideration is unimportant. We will use the sim-
`plified tree depicted in Figure 2.
`For Al, we wish to find the longest (up to 16 charac-
`ter] match to the current string beginning anywhere in
`the preceding 4096 positions. If all preceding 4096
`strings were stored in a PATRICIA tree with depth
`d = 16, then finding this match would be straightfor-
`ward. Unfortunately,
`the cost of inserting these strings
`can be prohibitive,
`for if we have just descended d
`levels in the tree to insert the string starting at position
`i then we will descend at least d - 1 levels inserting the
`string at i + 1. In the worst case this can lead to O(nd)
`
`ASTRAY
`
`ASTRIDE
`
`FIGURE 1. A Trie
`
`April 1989 Volume 32 Number 4
`
`Communications of the ACM
`
`491
`
`Apple Inc. Exhibit 1004 Page 2
`
`

`
`Research Contributions
`
`insertion time for a file of size n. Since later encodings
`will use much larger values for d than 16, it is impor-
`tant to eliminate 11 from the running time.
`To insert the strings in O(n) time, McCreight added
`additional suffix pointers to the tree. Each internal
`node, representing the string aX on the path from the
`root to the internal node, has a pointer to the node
`representing X, th: string obtained by stripping a single
`letter from the beginning of ax. If a string starting at i
`has just been inse:.ted at level d we do not need to
`return to the root to insert the string at i + 1; instead,
`a nearby suffix po .nter will
`lead us to the relevant
`branch of the tree.
`Figure 3 shows how suffix links are created and used.
`On the previous it oration, we have matched the string
`aXY, where a is a single character, X and Y are strings,
`and b is the first u nmatched character after Y. Figure 3
`shows a complicat sd case where a new internal node,
`(Y, has been added to the tree, and the suffix link of (Y
`must be computed. We insert the next string XYb by
`going up the tree tl, node ,f3, representing the string ax,
`and crossing its su:fix link to y, representing X. Once
`we have crossed t1.e suffix link, we descend again in
`the tree, first by “rsscanning”
`the string Y, and then by
`“scanning” from 6 -rntil the new string is inserted. The
`first part is called ‘ rescanning” because it covers a por-
`tion of the string that was covered by the previous
`insert, and so it dol?s not require checking the internal
`strings on the arcs. (In fact, avoiding these checks is
`essential to the linear time functioning of the algo-
`rithm.) The rescan either ends at an existing node 6, or
`6 is created to insermt the new string XYb; either way we
`have the destination for the suffix link of LY. We have
`restored the invariant
`that every internal node, except
`possibly the one ju:;t cfreated, has a suffix link.
`For the Al compressor, with a 4096-byte fixed win-
`dow, we need a way to delete and reclaim the storage
`for portions of the I uffix tree representing strings fur-
`ther back than 4096 in the file. Several things must be
`added to the suffix tree data structure. The leaves of
`the tree are placed in a circular buffer, so that the
`oldest leaf can be identified and reclaimed, and the
`internal nodes are given “son count” fields. When an
`
`File:
`
`ZTRI
`
`DE
`
`9
`10
`ASTRAY
`
`(SW >
`
`FIGURE 2. A DATRICIA Tree with a Suffix Pointer
`
`R
`
`\ x
`\ \
`
`aX 1
`I
`
`FIGURE 3. Building a Suffix Tree
`
`internal “son count” falls to one, the node is deleted
`and two consecutive arcs are combined. In Section 3, it
`is shown that this approach will never leave a “dan-
`gling” suffix link pointing to deleted nodes. .Unfortu-
`nately, this is not the only problem in maintaining a
`valid suffix tree. The modifications
`that avo:ided a re-
`turn to the root for each new insertion create havoc for
`deletions. Since we have not always returned to the
`root, we may have consistently entered a branch of the
`tree sideways. The pointers (to strings in the 4096-byte
`win.dow) in the higher levels of such a branc:h can be-
`come out-of-date. However, traversing the b:ranch and
`updating the pointers would destroy any advantage
`gained by using the suffix links.
`We can keep valid pointers and avoid extensive up-
`dating by partially updating according to a percolating
`update. Each internal node has a single “update” bit. If
`the update bit is true when we are updating a node,
`then we set the bit false and propagate the update re-
`cursively to the node’s parent. Otherwise, we set the bit
`true and stop the propagation. In the worst case, a long
`string of true updates can cause the update to propagate
`to the root. However, when amortized over all new
`leaves, the cost of updating is constant, and the effect of
`upd,ating is to keep all internal pointers on positions
`within
`the last 4096 positions of the file. These facts
`will be shown in Section 3.
`We can now summarize the operation of the inner
`loop, using Figure 3 again. If we have just created node
`LY, then we use (Y’S parent’s suffix link to find y. From y
`we move down in the tree, first rescanning, a.nd then
`scanning. At the end of the scan, we percolate an up-
`date from the leaf, moving toward the root, setting the
`posnion fields equal to the current position, and setting
`the update bits false, until we find a node wit-h an
`update bit that is already false, whereupon we set that
`node’s update bit true and stop the percolation. Finally,
`we go to the circular buffer of leaves and replace the
`oldest leaf with the new leaf. If the oldest leaf’s parent
`has only one remaining son, then it must also be de-
`
`492
`
`Communications of the 1lCA4
`
`April 1989 Volume 32 Number 4
`
`Apple Inc. Exhibit 1004 Page 3
`
`

`
`Research Contributions
`
`leted; in this case, the remaining son is attached to its
`grandparent, and the deleted node‘s position is perco-
`lated upward as before, only at each step the position
`being percolated and the position already in the node
`must be compared and the more recent of these sent
`upward in the tree.
`
`CONSIDERATIONS
`3. THEORETICAL
`The correctness and linearity of suffix tree construction
`follows from McCreight’s original paper [28]. Here we
`will concern ourselves with the correctness and the
`linearity of suffix tree destruction-questions
`raised in
`Section 2.
`
`THEOREM 1. Delefing leaves in FIFO order and delefing
`infernal nodes with single sons will never leave dangling
`suffix pointers.
`
`PROOF. Assume the contrary. We have a node a!
`with a suffix pointer to a node 6 that has just been
`deleted. The existence of (Y means that there are at
`least two strings that agree for 1 positions and then
`differ at 1 + 1. Assuming that these two strings start at
`positions i and j, where both i and j are within
`the
`window of recently scanned strings and are not equal
`to the current position, then there are are two even
`younger strings at i + 1 and j + 1 that differ first at 1.
`This contradicts the assumption that 6 has one son. (If
`either i or j are equal to the current position, then (r is
`a new node, and can temporarily be without a suffix
`pointer.)
`
`There are two issues related to the percolating update:
`its cost and its effectiveness.
`
`THEOREM 2. Each percolated update has constant amor-
`tized cost.
`
`PROOF. We assume that the data structure contains
`a “credit” on each internal node where the “update”
`flag is true. A new leaf can be added with two “credits.”
`One is spent immediately
`to update the parent, and the
`other is combined with any credits remaining at the
`parent to either: 1) obtain one credit to leave at the
`parent and terminate the algorithm or 2) obtain two
`credits to apply the algorithm recursively at the parent.
`This gives an amortized cost of two updates for each
`new leaf.
`
`For the next theorem, define the “span” of a suffix tree
`to be equal to the size of its fixed window. So far we
`have used examples with “span” equal to 4096, but the
`value is flexible.
`
`THEOREM 3. Using the percolating update, every infer-
`nal node will be updated at least once during every period of
`length “span .”
`
`It is useful to prove the slightly stronger re-
`PROOF.
`sult that every internal node (that remains for an entire
`period) will be updated twice during a period, and thus
`propagate at least one update to its parent. To show a
`contradiction, we find the earliest period and the node
`
`p farthest from the root that does not propagate an
`update to its parent. If /3 has at least two children
`that
`have remained for the entire period, then p must have
`received updates from these nodes: they are farther
`from the root. If B has only one remaining child, then
`it must have a new child, and so it will still get
`two updates. (Every newly created arc causes a son
`to update a parent, percolating if necessary.) Similarly,
`two new children also cause two updates. By every
`accounting, p will receive two updates during the
`period, and thus propagate an update-contradicting
`our assumption of /3’s failure to update its parent.
`There is some flexibility on how updating is handled.
`We could propagate the current position upward before
`rescanning, and then write the current position into
`those nodes passed during the rescan and scan; in this
`case, the proof of Theorem 3 is conservative. Alterna-
`tively, a similar, symmetric proof can be used to show
`that updating can be omitted when new arcs are added
`so long as we propagate an update after every arc is
`deleted. The choice is primarily a matter of implemen-
`tation convenience, although the method used above is
`slightly faster.
`The last major theoretical consideration is the effec-
`tiveness of the Al policy in choosing between literal
`and copy codewords. We have chosen the following
`one-pass policy for Al: When the encoder is idle, issue
`a copy if it is possible to copy two or more characters;
`otherwise, start a literal. If the encoder has previously
`started a literal, then terminate the literal and issue a
`copy only if the copy is of length three or greater.
`Notice that this policy can sometimes go astray. For
`example, suppose that the compressor is idle at position
`i and has the following copy lengths available at subse-
`quent positions:
`
`i
`1
`
`i+l
`3
`
`i+2
`16
`
`i+3
`15
`
`i+4
`14
`
`i+5
`13
`
`(1)
`
`Under the policy, the compressor encodes position i
`with a literal codeword, then takes the copy of length 3,
`and finally
`takes a copy of length 14 at position i + 4. It
`uses 6 bytes in the encoding:
`
`(literal
`
`l)X(copy 3 - y)(copy 14 - y)
`
`If the compressor had foresight it could avoid the
`copy of length 3, compressing the same material into
`5 bytes:
`
`[literal 2)XX(copy 16 - y)
`
`The optimal solution can be computed by dynamic
`programming [36]. One forward pass records the length
`of the longest possible copy at each position (as in equa-
`tion 1) and the displacement for the copy (not shown in
`equation 1). A second backward pass computes the
`optimal way to finish compressing the file from each
`position by recording the best codeword to use and the
`length to the end-of-file. Finally, another forward pass
`reads off the solution and outputs the compressed file.
`However, one would probably never want to use dy-
`
`April 1989 Volume 32 Number 4
`
`Communications of the ACM
`
`493
`
`Apple Inc. Exhibit 1004 Page 4
`
`

`
`Research Contributions
`
`namic programming since the one-pass heuristic is a lot
`faster, and we estimated for several typical files that
`the heuristically compressed output was only about
`1 percent larger than the optimum. Furthermore, we
`will show in the Iemiainder of this section that the size
`of the compressed file is never worse than % the size of
`the optimal solution for the specific Al encoding.
`This will require ,leveloping some analytic tools, so the
`non-mathematics.
`re!ader should feel free to skip to
`Section 4.
`
`The following def nitions are useful:
`
`F(i) is the longest feasible copy at position i
`
`Definition.
`in the file.
`Sample F(i)‘s were, given above in equation 1. They are
`dependent on the encoding used. For now, we are as-
`suming that they iire limited
`in magnitude to 16, and
`must correspond t3 copy sources within
`the last 4096
`characters.
`
`B(i 1 is the size of the best way to compress
`Definition.
`the remainder of thl, file, starting at position i.
`B(i)‘s would be computed in the reverse pass of the
`optimal algorithm outlined above.
`
`The following T neorems are given without proof:
`
`THEOREM. F(i t 1) 1 F(i) - 1.
`
`THEOREM. Then? exists an optimal solution where copies
`are longest possible (i.e., only copies corresponding to F(i)‘s
`are used).
`
`THEOREM. B(i) IS monotone decreasing.
`
`THEOREM. Any :;olution can be modified, without affect-
`ing length, so that (literal x1) followed
`immediately by
`(literal x2) implies lhaf x1 is maximum (in this case 26).
`We could continue to reason in this vein, but there is
`an abstract way of looking at the problem that is both
`clearer and more general. Suppose we have a non-
`deterministic
`finite! automaton where each transition
`given a cost. A simple example is shown in Figure 4.
`The machine accepts (a + b)*, with costs as shown in
`parentheses.
`The total cost of accepting a string is the sum of the
`
`is
`
`Start
`
`FIGURE 4. A Nondeierministic Automaton with Transition Costs
`
`transition costs for each character. (While it is not
`important to our problem, the optimal solution can be
`computed by forming a transition matrix for each let-
`te:r, using the costs shown in parentheses, and then
`multiplying
`the matrices for a given string, treating the
`coefficients as elements of the closed semiring with op-
`erations of addition and minimization.) We can obtain a
`solution that approximates the minimum by deleting
`transitions in the original machine until it becomes a
`deterministic machine. This corresponds to choosing a
`policy in our original data compression pro’blem. A pol-
`icy for the machine in Figure 4 is shown in Figure 5.
`We now wish to compare, in the worst case, the dif-
`ference between optimally accepting a string with the
`non-deterministic machine, and deterministically ac-
`cepting the same string with the “policy” machine. This
`is (done by taking a cross product of the two machines,
`as shown in Figure 6.
`In Figure 6 there are now two weights on each transi-
`tion; the first is the cost in the non-deterministic graph,
`and the second is the cost in the policy graph. Asymp-
`totically,
`the relationship of the optimal solution to the
`policy solution is dominated by the smallest ratio on a
`cycle in this graph. In the case of Figure 6, there is a
`cycle from 1, 1’ to 1, 2' and back that has cost in the
`no:n-deterministic graph of 2 + 1 = 3, and cost in the
`policy graph of 3 + 3 = 6, giving a ratio of Vi. That is,
`
`a (1)
`
`>
`
`FIGURE 5. A Deterministic “Policy” Automaton for Figure 4
`
`the policy solution can be twice as bad as the optimum
`on t.he string ababababab. , . .
`In general, we can find the cycle with the smallest
`ratio mechanically, using well known techni.ques [8,
`271. The idea is to conjecture a ratio r and th.en reduce
`the pairs of weights (x, y) on the arcs to single weights
`x - ry. Under this reduction, a cycle with zero weight
`has ratio exactly r. If a cycle has negative weight, then r
`is too large. The ratio on the negative cycle is used as a
`new conjecture, and the process is iterated. (Negative
`cycl.es are detected by running a shortest path algo-
`rithm and checking for convergence.) Once we have
`found the minimum ratio cycle, we can create a worst
`case string in the original automata problem by finding
`a path from the start state to the cycle and then repeat-
`ing .the cycle indefinitely. The ratio of the costs of ac-
`
`494
`
`Communications of the AC.M
`
`April 1989 Volume 32 Number 4
`
`Apple Inc. Exhibit 1004 Page 5
`
`

`
`Start
`
`Research Contributions
`
`below, eliminates many more transitions:
`
`if i 5 1
`115 start a literal
`lo ‘2
`l,-, continue a literal
`if x 2 1 and i I z
`1, “3
`if i 5 3 or x = 0 and i = 2
`I, “3 ci-I start a copy
`Pm
`cf +
`cYml continue a copy
`
`that
`to guarantee
`Finally, we add one more machine
`the strings of pi are realistic.
`In this machine, state si
`means that the previous character was pi, so the index
`of the next character must be at least pi-l:
`
`Si -% S,
`
`(j 2
`
`i - 1)
`
`(51
`
`The cross product of these three machines has ap-
`proximately
`17K states and was analyzed mechanically
`to prove a minimum
`ratio cycle of VS. Thus the policy
`we have chosen is never off by more than 25 percent,
`and the worst case is realized on a string
`that repeats a
`pi pattern as follows:
`
`FIGURE 6. The Cross Product
`
`7
`6
`5
`4
`3
`2
`1
`PI0 PI0 Pg Pa P7 Ps P5
`
`and determin-
`the string non-deterministically
`cepting
`istically will converge
`to the ratio of the cycle. (The
`path taken
`in the cross product graph will not necessar-
`ily bring us to the same cycle, due to the initial path
`fragment; we will, nevertheless,
`do at least as well.)
`Conversely,
`if we have a sufficiently
`long string with
`non-deterministic
`to deterministic
`ratio r, then the
`string will eventually
`loop in the cross product graph. If
`we remove
`loops with
`ratio greater
`than r we only im-
`prove the ratio of the string, so we must eventually
`find
`a loop with
`ratio at least as small as r.
`The above discussion gives us an algorithmic way of
`analyzing our original data compression problem. The
`possible values of F(i) are encoded
`in a 17 character
`alphabet pO . . . p16, representing
`the length of copy
`available at each position. The compression algorithm
`is described by a non-deterministic
`machine
`that ac-
`cepts strings of pi; this machine has costs equal
`to the
`lengths of the codewords used by the algorithm. There
`are two parameterized
`states in this machine: 1, means
`that there is a literal codeword under construction with
`x spaces still available;
`cY means that a copy is in prog-
`ress with y characters
`remaining
`to copy. The idle state
`is IO = co. In the non-deterministic
`machine,
`the possi-
`ble transitions are:
`Pm
`-
`lo
`P.(l)
`+
`1,
`l
`Pi@)
`*
`---*
`Ci-I
`P*W
`c, --* c,-,
`
`115
`I,-,
`
`start a literal
`continue a literal
`start a copy
`continue a copy
`
`(x 2 1)
`
`(2)
`
`is used as a wild card to denote any state.)
`(An asterisk
`Based on the theorems above we have already elimi-
`nated some transitions
`to simplify what
`follows. For
`example,
`
`c, “3 115 start a literal
`
`from
`
`inside a copy (y 2 1)
`
`(3)
`
`is unnecessary. The deterministic machine, given
`
`8
`
`p4
`
`9
`
`p3
`
`10
`
`pz
`
`11
`
`p1
`
`12
`
`p2
`
`13
`
`14
`
`p10
`
`p10
`
`15
`
`p9
`
`*.
`
`.
`
`(6)
`
`is nothing special about 10; it was chosen to
`(There
`illustrate a long copy and to match
`the example
`in
`Appendix A.) The deterministic
`algorithm
`takes a copy
`of length 10 in the first position, and then switches
`to a
`literal
`for positions 11 and 12. Five bytes are used in
`each repetition of the pattern. The optimal solution
`is
`one position out of phase. It takes a copy of length 10 in
`the second position, and then finds a copy of length 2 at
`position 12, for a total of four bytes on each iteration.
`We have abstracted
`the problem so that the possible
`copy operations are described by a string of pi, and we
`have shown a pathological
`pattern of p, that results
`in
`% of the optimal encoding. There might still be some
`doubt that such a string exists, since the condition
`that
`our third machine
`(5) guarantees, F(i + 1) 1 F(i)
`- 1, is
`a necessary but not sufficient condition. Nevertheless,
`the details of an actual pathological
`string can be found
`in Appendix A.
`
`4. A SIMPLER DATA STRUCTURE
`is not
`Although
`the quantity of code associated with Al
`enormous,
`it is complicated,
`and the data structures are
`fairly
`large. In this section, we present simpler methods
`for finding
`the suffix and for propagating
`the window
`position.
`is to update
`update
`to a percolating
`The alternative
`the positions
`in all nodes back to the root whenever a
`new leaf is inserted. Then no updates are needed when
`nodes are deleted. The update
`flags can be eliminated.
`The alternative
`to suffix pointers
`is more compli-
`cated. The cost of movement
`in a tree is not uniform;
`moving deeper requires a hash table lookup, which
`is
`more expensive
`than following a parent pointer. So we
`can determine
`the suffix by starting at the suffix
`leaf
`and following parent pointers back toward
`the root un-
`
`April 1989 Volume 32 Number 4
`
`Communications of the ACM
`
`495
`
`Apple Inc. Exhibit 1004 Page 6
`
`

`
`Research Contributions
`
`til the suffix node is reached. The suffix leaf is known
`because the strin,; at i matched the string at some ear-
`lier window position j; the suffix leaf j + 1 is the next
`entry in the leaf array. With this change, the suffix
`pointers can be e: iminated.
`From a theoretical perspective, these modifications,
`which have O(nd I worst case performance for a file of
`size n and cut-off depth d, are inferior to the O(n) per-
`formance of the suffix tree. For Al, with a cutoff of 16,
`these modifications
`improve average performance, but
`the A2 method discussed in the next section has such a
`deep cut-off that : uffix pointers and percolated updates
`are preferable.
`
`5. A MORE POWERFUL ENCODING
`The 4,096-byte window of Al
`is roughly optimal for
`fixed size copy and literal codewords. Longer copies
`would, on average, be found in a larger window, but a
`larger displacement field would be required to encode
`them. To exploit a larger window, we must use a
`variable-width em:od.ing statistically sensitive to the
`fact that recent window positions are more likely to be
`used by copy codewords than those positions further
`back. Similarly,
`it is advantageous to use variable-
`width encodings for copy and literal lengths.
`There are severid approaches we might use for vari-
`able-length encoding. We could use fixed or adaptive
`Huffman coding, arithmetic encoding, a variable-length
`encoding of the inegers, or a manageable set of hand-
`designed codewords. We eliminated
`from consideration
`adaptive Huffman and arithmetic coding because they
`are slow. Moreove.:, we felt they would provide (at best)
`a secondary adaptive advantage since the “front end”
`textual substitution
`is itself adapting to the input. We
`experimented witf. a fixed Huffman encoding, a hand-
`designed family of codewords, and a variable-length
`encoding of the integers, so we will compare these
`options briefly:
`
`Hand-Designed Coc3ewords. This is a direct generaliza-
`tion of Al, with shllrt copies that use fewer bits but
`cannot address the full window, and longer copies that
`can address larger ‘,locks further back in the window.
`With a few codewords, this is fast and relatively easy to
`implement.

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket