throbber
Data Compression via Textual Substitution
`
`JAMES A. STORER AND THOMAS G. SZYMANSKI
`
`Princeton University, Princeton, New Jersey
`
`Abstract. A general model for data compression which includes most data compression systems in the
`literature as special cases is presented. Macro schemes are based on the principle of finding redundant
`strings or patterns and replacing them by pointers to a common copy. Difierent varieties of macro schemes
`may be defined by specifying the meaning of a pointer, that is, a pointer may indicate a substring of the
`compressed string. a substring of the original string, or it sul:-string ofsome other string such as an external
`dictionary. Other varieties of macro schemes may be defined by restricting the type of overlapping or
`recursion that may be used. Trade-offs between different varieties of macro schemes, exact lower bounds
`on the amount of compression obtainable, and the complexity of encoding and decoding are discussed, as
`well as how the work of other authors relates to this model.
`
`Categories and Subject Descriptors: E14 [Data]: Coding and Information Theory—~dam cofilpflction and
`conrprerrion; F. 1.3 [Computation by Abstract Devices]: Complexity Classes—reducr'bi‘h'ry and corripletenci-i;
`1'-12.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Probleins—parrern
`matching.
`
`General Terms: Algorithms, Theory
`
`Additional Key Words and Phrases: Textual substitution, macro expansion, dictionary, NP-completeness
`
`1. Introduction
`
`On-line secondary storage space is one of the most restricting resources in many
`modern computer installations, particularly in those employing multiuser time-shar-
`ing systems. Fast algorithms for compressing and restoring data files can do much to
`alleviate this problem. Some of the more popular data compression schemes described
`in the literature include statistical encoding techniques such as Huffman codes [8],
`which typically encode a block of source data as a variable-length string of bits
`determined by various statistical properties of the source data; incremental encoding
`methods (e.g., [21, 34]), which typically compress a file by recording only the
`difference between successive records; and textual substitution or macro encoding
`schemes (e.g., [6, 7, 12-14, 19, 20, 25-28, 30, 31, 33, 35, 37-39]), which factor out
`duplicate occurrences of data, replacing the repeated elements with some sort of
`special marker identifying the data to be replaced at that point. In addition, many
`ad-hoc methods for handling data with certain known characteristics appear in the
`literature.
`
`This research was supported in part by the National Science Foundation under Grant MCS 74-2l939 and
`in part by Bell Laboratories.
`Authors’ present addresses: J. A. Storer, Department of Computer Science, Brandeb University, Waltham,
`MA 02254; T. G. Szymanski, Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07914.
`Permission to copy without fee all or put of this material is granted provided that the copies are not made
`or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication
`and its date appear, and notice is given that copying is by permission of the Association for Computing
`Machinery. To copy otherwise, or to republish, requires a fee and]or specific permission.
`© 1982 ACM 0004-5411]82/1000-0928 $00.75
`
`Journal of the Association for Computing Machinery, Vol. 29, No at, October I982, pp 928-95]
`
`APPLE 1016
`
`APPLE 1016
`
`1
`
`

`
`Data Compression via Textual Substitution
`
`929
`
`This paper is devoted exclusively to the properties of the macro model for data
`compression. We study two major types of macro schemes, the types being differ-
`entiated by the location where the factored-out text is stored. Section 2 contains a
`discussion of our model along with some basic definitions, Sections 3 and 4 present
`our results for the two major types of schemes considered, and Section 5 examines
`the relative performance of the various compression schemes introduced in the
`preceding sections. To reduce the size of this paper, NP-completeness results are
`presented in [30]. However, the more important results of [30] are summarized here.
`Before proceeding to the next section, we define the following notation:
`
`(1) If s and s, denote strings and n 2 1 is an integer, sis; denotes the concatenation
`of s1 with S2, H§‘.1 3, denotes 51.92 - -
`- s,,, and 5” denotes I]?..1 S. 5° denotes the
`empty string.
`(2) We use the term collection to mean multiset.‘
`(3) If s is a string, |s| denotes the length of s, and ifs is a collection, fsl denotes the
`number of elements in s (with each element being counted as many times as it
`appears in s).
`(4) We extend the min function to strings by defining
`.
`'f
`
`s» = {:;
`
`5
`
`;.,.l::.L.. '32‘,
`
`(5) For a real number lz, |'h'| denotes the least integer greater than or equal to h.
`
`2. The Model and Basic Definitions
`
`We shall treat the source data as a finite string over some alphabet. With external
`macro schemes, a source string is encoded as a pair of strings, a dictionary and a
`skeleton. The skeleton contains characters of the input alphabet interspersed with
`pointers to substrings of the dictionary. The dictionary is also allowed to contain
`pointers to substrings of the dictionary. The source string is recovered by substituting
`dictionary strings for pointers. With internal macro schemes, a string is compressed
`by replacing duplicate instances of substrings with pointers to other occurrences of
`the same substrings. The result is a single string of characters and pointers.
`Throughout this paper let p 2 1 denote the implementation-dependent size of a
`pointer.” If x is a string containing pointers, the length of x, denoted Ix |, is defined
`to be the number of characters in x plus p times the number of pointers in x. We
`shall treat a pointer as an indivisible object which, in some unspecified fashion,
`uniquely and unambiguously identifies some string which is referred to as the target
`of that pointer. The way a pointer is written is not important; the only assumption
`we make is that it is always possible to determine by inspection of a pointer the
`length of its target?‘ For simplicity we shall write a pointer as a pair (11, m), where 11
`indicates the position of the first character in the target,‘ m indicates the length of the
`target, and |(n, m)} is the pointer size p. Without loss of generality it will always be
`assumed that m > p.
`
`‘ A multiset is a set in which repetitions are allowed For example [a, a, b} is a multiset.
`2 We assume that all pointers within a given string have a uniform size. (Variable-length pointers are
`considered in [30]) We also assume p to be an integer, although our results generalize to nonintegral
`pointer SIZES
`3 It is not always necessary to make this assumption and, in fact, it can be useful to remove it See [30] for
`a dlSCllSS10n of this.
`
`‘ n can be either an absolute location or a displacement. For example, with internal schemes, 1: could be
`the distance from the pointer to its target.
`
`2
`
`

`
`930
`
`J. A. sronnn AND T. G. SZYMANSKI
`
`As an example of these ideas, let p = 1, and consider the string
`
`to = aaBccDaacEaccFacac,
`
`which might be encoded under the external macro model as
`
`J: - aacc#(l, 2)B(3, 2)D(l, 3)E(2, 3)F(2, 2)(2, 2),
`
`where # separates the dictionary from the skeleton. For convenience, we assume
`|#| - 0. The compression achieved by the string x (i.e., the ratio |x |/| w |) is lg. Using
`the internal macro model, w could be encoded as
`
`y = aaBccD(l, 2)cEa(4, 2)Fac(l3, 2),
`
`achieving a compression of 15.
`Implementation considerations motivate us to describe a number of variations on
`our basic models. A scheme is recursive if a macro body (i.e., a string that is a target
`of a pointer) is allowed to itself contain pointers. Two pointers overlap if their targets
`overlap. Whether overlapping pointers are permitted in the external model depends
`highly on the implementation chosen for the dictionary.” An original pointer is one
`which denotes a substring of the original source string, whereas a compressedpointer
`denotes a substring of the compressed representation itself. The string y of the
`previous example contains compressed pointers. Using original pointers, we could
`encode w as
`
`2 -= aaBccD(l, 2)cEa(4, 2)!‘(8, 2)(8, 2),
`
`achieving a compression of {-3. Original pointers are more natural for one-pass
`decoding. Compressed pointers allow the recovery of portions of the source string
`without requiring the implicit decompression of the entire string. A left (right) pointer
`is one which denotes a substring occurring earlier (later) in the string. Considering
`the strings x, y, and 2 presented above, only 1: uses overlapping pointers, only 2 uses
`recursion, and none of these strings use right pointers. By using both left and right
`pointers it is possible to save additional space over the use ofjust one direction. For
`example, using both right and left pointers, the compressed forms y and 2 presented
`above could be replaced by
`
`y = (5, 2)B(l0, 2)DaacEaccF(6, 2)(6, 2),
`z = (7, 2)B(i2, 2)DaacE(8, 2)cF(8, 2)(8, 2),
`
`achieving a compression of it and 13 respectively. We discuss recursion in relation to
`original pointers primarily to study the “power” of various methods. With original
`pointers, a pointer is recursive if all or part of the string it represents is represented
`by a pointer.
`Cycles cannot occur in compressed forms with compressed pointers, but using
`original pointers, cycles can often make sense. For example, the compressed form
`ab(5, 2)a{l, 3) uniquely determines the palindrome crbaaaaba even though the two
`pointers in this compressed form comprise a cycle. Here the pointers (5, 2) and (1, 3)
`are a cycle in the sense that each points to a portion of the string represented by the
`other. An example of a degenerate cycle is given by the compressed form a(l, n),
`which uniquely determines the string a""‘1. Schemes which allow recursion but not
`cycles are said to have topological recursion. From the above discussion it should be
`clear that topological recursion is not necessary for a compressed form to be uniquely
`
`5 Certain implementation considerations can lead to the placement of various restrictions on the kinds of
`overlapping permitted. Some of these restrictions are described in [30, 32].
`
`3
`
`

`
`Data Compression via Textual Substitution
`
`931
`
`decodable. However, it can be useful to consider topological recursion for three
`reasons. First, authors in the past (such as Lempel and Ziv) have assumed this.
`Second, study of such schemes leads to a deeper understanding of the power of
`original pointers. Third, topological recursion may model some practical considera-
`tions in the design of efficient original pointer compression methods.
`The above discussion leads us to formally define four basic macro schemes and
`three types of restrictions which may be placed on any of these schemes. Throughout
`this paper, 2 denotes the underlying alphabet from which the data in question is
`constructed.
`
`Definition 1. A compressed form of a string s using the EPM (external pointer
`macro) scheme is any string t = so#s1 satisfying“
`
`(1) so and s1 consist of characters from E and pointers to substrings of so.
`(2) s can be obtained from s; by performing the following two steps:
`
`(a) Replace each pointer in S1 with its target.
`03) Repeat step A until .91 contains no pointers.
`
`Definition 2. A compressedform of a string s using the CPM (compressedpointer
`macro) scheme is any string t satisfying
`
`t consists of characters from E and pointers to substrings oft.
`(1)
`(2) s can be obtained from t by forming the string stir and then decoding as with the
`EPM scheme.
`
`Definition 3. A compressed form of a string 3 using the OPM (original pointer
`macro) scheme is any string 1 satisfying
`
`t consists of characters from 2 and pointers representing substrings of s.
`(1)
`(2) s can be obtained from r by replacing each pointer (n, m) by the sequence of
`pointers (n, 1), (n + 1, 1), ..., (n + m — 1, 1) and then decoding as with the
`CPM scheme, with the stipulation that pointers are considered to have length 1.
`
`Definition 4. A compressedform of a string .9 using the OEPM (original external
`pointer macro) scheme is any string 1 = so#s1 satisfying
`
`t consists of characters from 2 and pointers.
`(1)
`(2) so may be decoded using the OPM scheme to produce a string r. Furthermore,
`pointers in st point to substrings of r.
`(3) s may be obtained by replacing each pointer in .91 with its target in r.
`
`A contraction of a string s for pointer size p according to a given scheme is a
`shortest compressed form of s using that scheme with pointer size p. A contraction of
`a string s will be denoted by A(s).’ We shall refer to the process of replacing a string
`r by a pointer asfactoring out r and often refer to a string that is a target or potential
`target as afactor.
`
`Definition 5. A CPM (OPM) pointer ql depends on pointer qr,» if the target of q;
`contains 42 (all or part of the string represented by qz) or if there is a pointer q.-5 such
`that qr depends on :13 and qa depends on qz. A macro scheme is restricted to no
`recursion if dependent pointers are forbidden, and to topological recursion if no
`
`6 For convenience we assume throughout this paper that [#1 - 0.
`’A string may have more than one ruinimal-length compressed form. For formal dkcussions we can
`always ensure that A(s) is unique by assuming a lexicographic ordering.
`
`4
`
`

`
`932
`
`I. A. STORER AND T. G. SZYMANSKI
`
`pointer may depend on itself; that is, it must be possible to sort topologically” the
`pointers of a compressed form according to their dependencies.
`
`Definition 6. Two pointers overlap if their targets overlap and strictly overlap if
`their targets overlap but neither target is a substring of the other. A macro scheme is
`restricted to no overlapping if overlapping pointers are forbidden.
`
`Definition 7. A CPM (OPM) pointer q points to the left if the leftmost character
`of its target is to the left of q (the leftmost character of the string represented by q).
`A right pointer is similarly defined. A macro scheme is restricted to unidirectional
`pointers if all pointers must point in the same direction (of course, with the EPM and
`OEPM schemes, this only applies to the external dictionary). As a special case of
`this, we can restrict a macro scheme to have only left or right pointers.
`
`The different combinations of the four basic macro schemes we have defined and
`
`the recursion, overlapping, and pointer direction restrictions provide us with a large
`number of data compression methods. The combinations are sufficiently general to
`cover virtually all of the text substitution schemes proposed in the literature.
`Discussion of the utility and appropriateness of various restrictions to the models are
`deferred until later in the paper.
`We have not discussed the concept of adding to pointers" arguments which allow
`the specification of modifications to be made on a factor before it is substituted.
`Macro schemes with arguments generally have more power than ones without. Also,
`a few data compression methods presented in the literature require a macro scheme
`with arguments to model them, for example, the subsequence and supersequence
`compression methods discussed in [16]. Macro schemes with arguments will not be
`discussed in this paper, but it it should be noted that the macro model can be
`extended to allow this generality.
`
`3. The External Macro Model
`
`The external macro model views the collection of macro bodies as residing outside
`the remainder of the compressed string. This makes external schemes ideal for
`compressing collections of strings using a common dictionary. There are several
`reasons why it is more natural to treat all pointers as compressed pointers when
`discussing this model. First, authors in the past have used the EPM scheme and not
`the OEPM scheme (the authors mentioned in the introduction who used external
`schemes all considered variants of the EPM scheme). Second, it allows us to
`decompress arbitrary portions of the data without first having to produce the entire
`string. Third, compressed pointers often require less space than original pointers. For
`these reasons, we concentrate our attention in this section on the EPM model and
`then indicate how our results can be extended to the OEPM model. As we shall see
`
`in the next section, there are advantages to using original pointers over compressed
`pointers that justify consideration of the OEPM model. In many cases the extension
`of results to the OEPM scheme is trivial, since if recursion is forbidden, original and
`compressed pointers become equivalent in power. Similarly, if overlapping is forbid-
`den, unidirectional and bidirectional pointers become equivalent.
`
`THEOREM 1. For all strings s, only topological recursion is allowed, then (assuming
`.9 is compressible) both lAEPM(S)| and |AoEp_M(.‘u‘)| are
`
`(a) 2p lag2(ls|/p) + 1.9;).
`
`” For a discussion of topological sorting, see [10].
`
`5
`
`

`
`Data Compression via Textual Substitution
`
`933
`
`(b) 23}: loga(|s|/p) - 0.02p when overlapping isforbidden.
`(c) 22(plsI)"" when recursion isforbidden.
`(cl) 22(p|s |)"2 when both recursion and overlapping areforbidden.
`(e)
`(a)—(d) hold even ifpointers are required to be unidirectional.
`
`If nontopologicai recursion is allowed, then
`
`(f) The bounds of (a)—(e) holdfor the EPM scheme.
`
`But
`
`(3) |AoEpM(S)| 1: 2p + 1, independently of what overlapping and pointer direction
`restrictiofls are made.
`
`Furthermore, all if the bounds in (a)—( g) are tight; that is, each is attainedfor infinitely
`many strings s.”
`
`PROOF. For (a)—(e), since |AoEpM(s)| :5 |AEPM(S') |, it is sufficient to show that the
`OEPM scheme satisfies these bounds and the EPM scheme can attain them infinitely
`often.
`
`First let us consider (a). We can assume that s = a"' (for some a in 2) because
`fAogpM(a"')| 5 I AuEpM(.5') |. It is easy to show that for some p < k 5 2p and n,
`
`Ao1n=n(s) = 0* E q.#qu+1.
`
`where q,, 1 5 i < n, points to the string represented by everything to the lefi of it and
`q,.+1 points to some substring of length Is| in the dictionary. Since qt points to a
`string of k characters, we must have It > p, or else a shorter contraction of s could be
`produced. Similarly, if k > 2p, we could produce a shorter contraction by representing
`a" by a"""‘ followed by a pointer to this string. Thus we have
`
`lAonrvM(s)| 2 k + np +p
`2p1og2(|s},’lc) + lc -1-}:
`2 mi11{pI'log2([s|/i)'|+i+p:p < is 2p}
`2p log2(|s[/p) + min{p(l + h -— log2li): l < h 5 2}
`> P1082051/P) + 1-9p-
`
`For anyp < is 2p and n, the bound rnin{pilog2(|s|/i)| + i + ptp < i 5 2p} is
`achieved exactly by the EPM scheme on the strings = a'’'.
`We now consider 0:). Again we can assume that s = a"'. Since overlapping is
`forbidden, the pointers of AonpM(s) can be divided into a sequence of sets S1, . .
`. , S...
`such that the pointers in S., 1 < is m, have targets whose compressed representation
`consists of pointers in some s_,, j < i; in fact, since we are concerned only with worst-
`case performance,’° we can assume that j = i — 1. We can also assume that S.-, 1 _<_ i
`5 m, contains at most three pointers. This is because for any k 7: 4 there are an i and
`j such that 21' + 3f 5. k 5 2'3 ’, and so we can replace a set S. of four or more pointers
`by a sequence of sets having at most three pointers, where the last set in the sequence
`will represent a string at least as long as that represented by the original sequence of
`four or more. Hence, for some x 5 2 we can assume that for all i > x, S; contains
`
`“ Actually, the bound in (a) is just an approximation for the expression min{p|log».x(|s|;i)|+ i + p:p < i
`_<_ 2;}, which is attained exactly infinitely often. Similarly, the bound 111 (b) 15 just an approximation for
`the expression min{3p| loga{| sl/i) I + i: 2p < I 5 4p], which IS attained exactly for infinitely many strings.
`1" It 15 not necessary to consider a compressed form oflength x for a stung a’ if there is a compressed form
`of length <1: (sx) for a string a‘ where z 2 y (z > y).
`
`6
`
`

`
`934
`
`J. A. STORER AND T. G. SZYMANSKI
`
`exactly three pointers. This is because we can assume that the sets of two come first
`and a sequence of three two-pointer sets can be replaced by two three-pointer sets.
`Given this, it can be assumed that for some 2p < k 5 4p, 0 ..<_ M 5 L 5 l, and n,
`Aoiipm(s) is of the form
`
`a*(q?»)‘(qi)”(_Inl2 q?)#q5‘»+u
`
`where qo points to a‘; qr points to qfi; qg points to a”, qfi, or qi, depending on the
`values of L and M; and for 3 5 is n + 1, qt points to q‘:’—i. Thus we have
`
`+
`iAoEpM(.S')| Z k + 31110 +
`2 3p|'log3(|s|/k2L2M)'I + k + 2Lp + 2Mp
`2 3pflog3(|s|/k)'| + 1: + 3p(§ — log32)(L + M)
`3 3PT10g3(|S|/k)'| + k
`2 min{3pflog3(js|/i)‘] + i':2p < is 4p}
`2 3p log;z.(|s|/p) + min{p(h -- 3 log3h):2 < h S 4}
`> 3;! loga(|s|/p) - 0.02;).
`
`For any 2]; < is 4p and n > 1, the bound of rnin[3p|log3(|s|/i)| + i‘:2p < i 5 4p}
`is achieved using the EPM scheme with no overlapping on the strings = aw‘ .
`The proofs for (c) and (d) appear in [30]. All of the proofs of (a)—(d) make use of
`left pointers only, and so (e) follows. (f) follows trivially because compressed pointers
`cannot form cycles, and hence with the EPM scheme all recursion must be topological.
`For (g) we may again assume that s = am for some a in E. If s is compressible using
`the EOPM scheme, then .9 must contain at least two pointers and at least one
`character. Hence 2p + l is a lower bound. It is also tight, since a(l, ls] — l)#(1, ISD
`is a compressed form for s = a"'. D
`
`For topological recursion, Theorem 1 says just what one would expect; there is an
`S2(_p log| s |) lower bound on the size of a compressed string when recursion is allowed
`(log; with overlapping and logs without) and an S2((pIs])1/2) lower bound when
`recursion is not allowed. For nontopological recursion the bound of (g) may seem
`unnatural. Clearly we cannot represent a string of arbitrary length using a constant
`amount of space. The bound of (g) simply illustrates the need for the pointer size to
`be a function of the string size to model situations where pointers may indicate strings
`of arbitrary length.
`It should also be pointed out that the bounds of Theorem 1 apply primarily to
`“pathological” strings; in practice, reducing the size of a file by a small constant
`factor may be very significant. However, much of the utility of Theorem 1 comes
`from the fact that it provides exact bounds which are needed in several of our NP»
`completeness" proofs.
`The next theorem considers encoding algorithms for the EPM model. Wagner [35]
`presents a polynomial-time algorithm for compressing a string, assuming that the
`dictionary of macro bodies is given as input to the encoding algorithm. However, no
`mention is made as to how the selection of the best possible dictionary is accom-
`plished. Several heuristic methods for constructing dictionaries have been presented
`in [20] and [25}. Neither of these guarantees optimal compression or even provides
`bounds on the compression that is obtainable. The reason for this gap in the literature
`is the NP-completeness of finding AEpM(S).
`
`“ For a definition of NP-completeness and related terms, see [1]. All ofour proofs show NP-completeness
`in the sense of Karp [9] (which implies that of [3]).
`
`7
`
`

`
`Data Compression via Textual Substitution
`
`935
`
`THEOREM 2. Given a string s and an integer K, it is NP-complete to determine
`whether IAE;-M/(s)| 5 K in any of the following situations:
`
`(a) both recursion and overlapping allowed;
`(b) recursion allowed, overlappingforbidden;
`(c)
`recursion forbidden, overlapping allowed;
`(d) both recursion and overlappingforbidden;
`(e) unidirectional pointers and any of (a)-(d).
`
`Furthermore, the above are true regardless of whether p is part of the problem input or
`is constrained to be a fixed integer greater than 1. In fact, we show (b) and (d) to be
`true even zfp = l.
`
`It should be clear that no one part of the theorem directly implies any
`PROOF.
`other. Thus several reductions are used. The reductions employed include the node
`cover problem [9], the restricted node cover problem [17], the K-node cover problem
`[30], and the superstring problem [4, 17]. As mentioned in the introduction, proofs of
`all NP-completeness results appear in [30]. However, we shall include the proof of
`(b) and (d) as a “sample proof.”
`
`. , e,.,}), K
`.
`. , v,,}, E = {e1, .
`.
`.
`Proofof(b) and (d)forp >1. Let G = (V= {v1,
`be an instance of the node cover problem, and let p = po. Let S be a special symbol,
`and let @ denote a new, distinct symbol each time it occurs. For v. in V, let V. =
`$v§”‘$, and for e, = (v,, v;,) in E, let E. = $v,”1$v‘;',“$. Now let
`p rt.
`rn
`
`s= (H 11 v;@)(n to).
`
`1-1 1-1
`
`1-1
`
`We claim that G has a node cover of size SK if and only if |A(s)| 5 |s| + K — m.
`First suppose that G has a node cover X g V of size K. We shall construct a
`compressed form t for s (having length |sI + K — m), where t is'of the form
`s0#(]]-§’.1 ]]j'., F7} @)(flI’i1 E. @), where so contains those V, for which V. is in X, and
`I7, is t; if v,- is not in X and a pointer to v, in so if v, is in X. If E. is sv;7“sv*,:"s,
`then E‘.
`is either rv‘,§"1$ or $v,‘1q, where r is a pointer to v, in st; and q is
`a pointer to vs in so. Since X is a node cover, this can always be done. If we now com-
`pute the length of I, Isol = K(p + 1), IHf—1H?—: 77; @l = lH‘»'—1 H5’-1 V,@I -pK,
`and [11:11 E.@| = |f[:’11 E.@| — m. Hence |A(s)] 5 M = ls] + K(p + 1) —pK— m
`=|s| + K - m, as was to be shown.
`Conversely, suppose that |A(s)| < |s| + K — m. We shall show that G has a
`node cover of size at most K. First observe that since overlapping of pointer
`targets is forbidden, no pointer in A(s) can refer, for any strings x and y, to a string
`of the form xv.$v,y, x@y, or x@y, since such a string can occur at most once
`in s and no gain can be achieved by factoring it out. Thus A(s) is of the form
`so#(H{’.1
`".1 I7’,@)(]'[i'i1 E‘,@), where so is a dictionary of macro bodies and the
`1775 and E35 are the shortest compressed forms of the W5 and E.-’s, respectively,
`using so. As mentioned earlier, without loss of generality we are assuming throughout
`this paper that every pointer references a string of length at least p + 1. Thus, since
`| K] = p + I, we can infer that each I7. is either V, itself or a pointer to an occurrence
`of v. in so. Similarly, since |E,| = 2p + 1, each E. must either consist of E. itself or else
`be a string of the form rv,"'$ or $v, '1r, where r is a pointer to some V, in so. Now let
`L be the number of E,’s such that E. = E,, that is, the number of 5'33 that have not
`had a factor removed. Then | ]'[I’.'., E,@| = |]'[;':, E.@| — (m — L), since removing a
`factor from an E, saves one character. Let .1’ be the number of V.-’s in so. Then
`
`8
`
`

`
`936
`
`J. A. sronmr. AND T. G. SZYMANSK1
`
`{sol - 1(1) + l) and |]]l’.1 H}-‘.1 l7}@I == []]‘.-'.1 I]}‘.1 V,-@l - Jp, because each it.» that
`is replaced by a pointer saves one character. Thus |A(s)| - J(p + 1) + l.s| — Jp -
`(m-L)=-[s|+J+L—m,andsoJ+LsK.W'enowclaimthatGhasanode
`cover of size J + L formed by taking the J nodes represented in so and one node
`from each of the L edges not factored in Ms}. Therefore 6? has a node cover of size
`K, as was to be shown.
`
`Proof of(b) and (d)forp = l. The following proof is similar to the original proof
`of this result [30] in that it employs a reduction to the node cover problem for degree
`three graphs. However, although the actual the actual construction is slightly longer,
`the proof of its correctness is simpler. This simplified proof is due to J. Gallant.
`LetG=(V=* {V1,---.V,.},E== {e1,...,e».}),Kbea.ninstanceot'thenodecover
`problem for which all nodes have degree 3." As in the proof for p > I, let 3 be a
`special symbol, and let @ denote a new, distinct symbol each time it occurs. For v,
`in Vlet
`
`V. = Sv.-3
`
`and
`
`W. = $”v.-$2.
`
`and for e. = (v,, v;,) in E let
`
`Now let
`
`E5 = 32:5-S’v.t$2.
`
`II
`
`M
`
`S =
`
`(V=@W-@)’)(II1 E@)-
`
`The claim is that Ghas a node cover ofsize Kifand only if|A(s)l 5 |s| + K— 1411.
`The first half of the proof argues that if X is a minimal node cover for G, then a
`compressed form for s can be constructed as follows:
`
`(I) The two copies of each V. go to a copy of V; in the dictionary and two pointers
`in the skeleton.
`
`(2) The two copies of each Wt go to two copies of 3 followed by a pointer to V.-
`followed by a $ in the skeleton if v. is not in X; otherwise the two copies of W.
`go to W; in the dictionary and two pointers in the skeleton.
`(3) Each E; representing e. = (v,-, vi.) goes to a pointer to 32v,-S’ followed by a pointer
`to $v,t$ followed by 3 if v,- is chosen to cover the edge; otherwise E; goes to $
`followed by a pointer to $v,$ followed by a pointer to $2v;.$’.
`
`A compressed form for s as constructed above saves one character for each pair of
`Vis for a total of it: four characters for each pair of Wis when 1:. is not in X, three
`characters for each pair of W.-’s when 11,- is in X for a total of 4:1 — K, and six
`characters for each E; for a total of 6m. Since In -= gn, this yields |A(s)| 5 Is] -
`n--4n+K-—6m=|sI+K—l4n.
`The other half of the proof requires more work. Here it must be argued that the
`method of compressing S as described above is the best possible. This is done by
`considering several cases that rest heavily on the fact that all nodes in G have degree
`3. The degree-3 restriction makes it unprofitable to factor out many strings that might
`otherwise be factored if nodes with large degrees were present. We leave the details
`of this half of the proof to the reader. Note that the above reduction is for the case
`where recursion is forbidden. If recursion is allowed, the same construction may be
`used, except that one copy of W. should be used instead of two. El
`
`12 Using a result from [SL it is easy to show this restriction of the node cover problem to be NP-complete;
`see [17].
`
`9
`
`

`
`Data Compression via Textual Substitution
`
`937
`
`In [30], cases (b)—(e) of Theorem 2 are shown to hold for the EOPM scheme. Case
`(a) is shown for the problem of compressing collections, but the single string problem
`remains open at the time of the writing of this paper. In addition, in [30] it is
`conjectured that all of (a)—(e) of Theorem 2 can be shown for p = 1.
`Throughout this paper we assume that the alphabets over which strings are written
`are unbounded in size. However, results concerning lower bounds on encoding
`complexity are stronger if they apply to the case where all strings are assumed to be
`written over some fixed size alphabet, and, although unbounded size alphabets model
`many practical situations (such as when entries in a system dictionary are taken to be
`the basic characters), there are certainly many situations in which it is more realistic
`to consider strings to be written over some fixed finite alphabet. Thus it is worthwhile
`to consider the complexity of compressing strings when it is assumed that all strings
`are written over a fixed size alphabet. Since our motivation for doing this is to model
`practical situations, when discussing fixed size alphabets we also require that pointers
`of a given size can only encode a finite amount of information. This requirement is
`met by stipulating that the pointer size 1: be dependent on the string being processed.
`Because complexity results concerning fixed size alphabets are more technical, we
`shall only state a few sample theorems to indicate the flavor of these results and only
`for the case when both recursion and overlapping are forbidden. Suppose we allow
`pointers to be able to indicate any substring of the source. Then a pointer’s length
`must be some implementation-dependent multiple of the logarithm of the string
`length.
`
`If recursion and overlapping areforbidden, then, giving a string 3 over
`THEOREM 3.
`any alphabet with at least three Symbols, an integer K, and a real It > 0, it is NP-
`complete to determine whether 5 has an EPM campressedform t satisfying M 5 K
`when the pointer size is [h Iog2| t['|.
`
`Two other natural ways to determine pointer size are either to require that the
`information content of a pointer be sufficient to distinguish all the pointers in an
`encoding or to require that a pointer be able to identify any substring of the
`dictionary. To this end, if t is an encoding of some string using the EPM scheme,
`let 80) be the number of distinct pointers in t, and let do?) be the dictionary
`portion of t.
`
`If recursion and overlapping areforbidden, then, given a string s over
`THEOREM 4.
`any alphabet with at least three symbols, an integer K, and a real It 2 1, it is NP-
`complete to determine whether S has an EPM compressedform I satisjjiing [t| :1 K in
`thefollowing situations:
`
`(a) p is [h log28(t)'|.
`(b) p is [' h logzl d(t) | 1.
`
`In [30], results similar

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket