`
`JAMES A. STORER AND THOMAS G. SZYMANSKI
`
`Princeton University, Princeton, New Jersey
`
`Abstract. A general model for data compression which includes most data compression systems in the
`literature as special cases is presented. Macro schemes are based on the principle of finding redundant
`strings or patterns and replacing them by pointers to a common copy. Difierent varieties of macro schemes
`may be defined by specifying the meaning of a pointer, that is, a pointer may indicate a substring of the
`compressed string. a substring of the original string, or it sul:-string ofsome other string such as an external
`dictionary. Other varieties of macro schemes may be defined by restricting the type of overlapping or
`recursion that may be used. Trade-offs between different varieties of macro schemes, exact lower bounds
`on the amount of compression obtainable, and the complexity of encoding and decoding are discussed, as
`well as how the work of other authors relates to this model.
`
`Categories and Subject Descriptors: E14 [Data]: Coding and Information Theory—~dam cofilpflction and
`conrprerrion; F. 1.3 [Computation by Abstract Devices]: Complexity Classes—reducr'bi‘h'ry and corripletenci-i;
`1'-12.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Probleins—parrern
`matching.
`
`General Terms: Algorithms, Theory
`
`Additional Key Words and Phrases: Textual substitution, macro expansion, dictionary, NP-completeness
`
`1. Introduction
`
`On-line secondary storage space is one of the most restricting resources in many
`modern computer installations, particularly in those employing multiuser time-shar-
`ing systems. Fast algorithms for compressing and restoring data files can do much to
`alleviate this problem. Some of the more popular data compression schemes described
`in the literature include statistical encoding techniques such as Huffman codes [8],
`which typically encode a block of source data as a variable-length string of bits
`determined by various statistical properties of the source data; incremental encoding
`methods (e.g., [21, 34]), which typically compress a file by recording only the
`difference between successive records; and textual substitution or macro encoding
`schemes (e.g., [6, 7, 12-14, 19, 20, 25-28, 30, 31, 33, 35, 37-39]), which factor out
`duplicate occurrences of data, replacing the repeated elements with some sort of
`special marker identifying the data to be replaced at that point. In addition, many
`ad-hoc methods for handling data with certain known characteristics appear in the
`literature.
`
`This research was supported in part by the National Science Foundation under Grant MCS 74-2l939 and
`in part by Bell Laboratories.
`Authors’ present addresses: J. A. Storer, Department of Computer Science, Brandeb University, Waltham,
`MA 02254; T. G. Szymanski, Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07914.
`Permission to copy without fee all or put of this material is granted provided that the copies are not made
`or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication
`and its date appear, and notice is given that copying is by permission of the Association for Computing
`Machinery. To copy otherwise, or to republish, requires a fee and]or specific permission.
`© 1982 ACM 0004-5411]82/1000-0928 $00.75
`
`Journal of the Association for Computing Machinery, Vol. 29, No at, October I982, pp 928-95]
`
`APPLE 1016
`
`APPLE 1016
`
`1
`
`
`
`Data Compression via Textual Substitution
`
`929
`
`This paper is devoted exclusively to the properties of the macro model for data
`compression. We study two major types of macro schemes, the types being differ-
`entiated by the location where the factored-out text is stored. Section 2 contains a
`discussion of our model along with some basic definitions, Sections 3 and 4 present
`our results for the two major types of schemes considered, and Section 5 examines
`the relative performance of the various compression schemes introduced in the
`preceding sections. To reduce the size of this paper, NP-completeness results are
`presented in [30]. However, the more important results of [30] are summarized here.
`Before proceeding to the next section, we define the following notation:
`
`(1) If s and s, denote strings and n 2 1 is an integer, sis; denotes the concatenation
`of s1 with S2, H§‘.1 3, denotes 51.92 - -
`- s,,, and 5” denotes I]?..1 S. 5° denotes the
`empty string.
`(2) We use the term collection to mean multiset.‘
`(3) If s is a string, |s| denotes the length of s, and ifs is a collection, fsl denotes the
`number of elements in s (with each element being counted as many times as it
`appears in s).
`(4) We extend the min function to strings by defining
`.
`'f
`
`s» = {:;
`
`5
`
`;.,.l::.L.. '32‘,
`
`(5) For a real number lz, |'h'| denotes the least integer greater than or equal to h.
`
`2. The Model and Basic Definitions
`
`We shall treat the source data as a finite string over some alphabet. With external
`macro schemes, a source string is encoded as a pair of strings, a dictionary and a
`skeleton. The skeleton contains characters of the input alphabet interspersed with
`pointers to substrings of the dictionary. The dictionary is also allowed to contain
`pointers to substrings of the dictionary. The source string is recovered by substituting
`dictionary strings for pointers. With internal macro schemes, a string is compressed
`by replacing duplicate instances of substrings with pointers to other occurrences of
`the same substrings. The result is a single string of characters and pointers.
`Throughout this paper let p 2 1 denote the implementation-dependent size of a
`pointer.” If x is a string containing pointers, the length of x, denoted Ix |, is defined
`to be the number of characters in x plus p times the number of pointers in x. We
`shall treat a pointer as an indivisible object which, in some unspecified fashion,
`uniquely and unambiguously identifies some string which is referred to as the target
`of that pointer. The way a pointer is written is not important; the only assumption
`we make is that it is always possible to determine by inspection of a pointer the
`length of its target?‘ For simplicity we shall write a pointer as a pair (11, m), where 11
`indicates the position of the first character in the target,‘ m indicates the length of the
`target, and |(n, m)} is the pointer size p. Without loss of generality it will always be
`assumed that m > p.
`
`‘ A multiset is a set in which repetitions are allowed For example [a, a, b} is a multiset.
`2 We assume that all pointers within a given string have a uniform size. (Variable-length pointers are
`considered in [30]) We also assume p to be an integer, although our results generalize to nonintegral
`pointer SIZES
`3 It is not always necessary to make this assumption and, in fact, it can be useful to remove it See [30] for
`a dlSCllSS10n of this.
`
`‘ n can be either an absolute location or a displacement. For example, with internal schemes, 1: could be
`the distance from the pointer to its target.
`
`2
`
`
`
`930
`
`J. A. sronnn AND T. G. SZYMANSKI
`
`As an example of these ideas, let p = 1, and consider the string
`
`to = aaBccDaacEaccFacac,
`
`which might be encoded under the external macro model as
`
`J: - aacc#(l, 2)B(3, 2)D(l, 3)E(2, 3)F(2, 2)(2, 2),
`
`where # separates the dictionary from the skeleton. For convenience, we assume
`|#| - 0. The compression achieved by the string x (i.e., the ratio |x |/| w |) is lg. Using
`the internal macro model, w could be encoded as
`
`y = aaBccD(l, 2)cEa(4, 2)Fac(l3, 2),
`
`achieving a compression of 15.
`Implementation considerations motivate us to describe a number of variations on
`our basic models. A scheme is recursive if a macro body (i.e., a string that is a target
`of a pointer) is allowed to itself contain pointers. Two pointers overlap if their targets
`overlap. Whether overlapping pointers are permitted in the external model depends
`highly on the implementation chosen for the dictionary.” An original pointer is one
`which denotes a substring of the original source string, whereas a compressedpointer
`denotes a substring of the compressed representation itself. The string y of the
`previous example contains compressed pointers. Using original pointers, we could
`encode w as
`
`2 -= aaBccD(l, 2)cEa(4, 2)!‘(8, 2)(8, 2),
`
`achieving a compression of {-3. Original pointers are more natural for one-pass
`decoding. Compressed pointers allow the recovery of portions of the source string
`without requiring the implicit decompression of the entire string. A left (right) pointer
`is one which denotes a substring occurring earlier (later) in the string. Considering
`the strings x, y, and 2 presented above, only 1: uses overlapping pointers, only 2 uses
`recursion, and none of these strings use right pointers. By using both left and right
`pointers it is possible to save additional space over the use ofjust one direction. For
`example, using both right and left pointers, the compressed forms y and 2 presented
`above could be replaced by
`
`y = (5, 2)B(l0, 2)DaacEaccF(6, 2)(6, 2),
`z = (7, 2)B(i2, 2)DaacE(8, 2)cF(8, 2)(8, 2),
`
`achieving a compression of it and 13 respectively. We discuss recursion in relation to
`original pointers primarily to study the “power” of various methods. With original
`pointers, a pointer is recursive if all or part of the string it represents is represented
`by a pointer.
`Cycles cannot occur in compressed forms with compressed pointers, but using
`original pointers, cycles can often make sense. For example, the compressed form
`ab(5, 2)a{l, 3) uniquely determines the palindrome crbaaaaba even though the two
`pointers in this compressed form comprise a cycle. Here the pointers (5, 2) and (1, 3)
`are a cycle in the sense that each points to a portion of the string represented by the
`other. An example of a degenerate cycle is given by the compressed form a(l, n),
`which uniquely determines the string a""‘1. Schemes which allow recursion but not
`cycles are said to have topological recursion. From the above discussion it should be
`clear that topological recursion is not necessary for a compressed form to be uniquely
`
`5 Certain implementation considerations can lead to the placement of various restrictions on the kinds of
`overlapping permitted. Some of these restrictions are described in [30, 32].
`
`3
`
`
`
`Data Compression via Textual Substitution
`
`931
`
`decodable. However, it can be useful to consider topological recursion for three
`reasons. First, authors in the past (such as Lempel and Ziv) have assumed this.
`Second, study of such schemes leads to a deeper understanding of the power of
`original pointers. Third, topological recursion may model some practical considera-
`tions in the design of efficient original pointer compression methods.
`The above discussion leads us to formally define four basic macro schemes and
`three types of restrictions which may be placed on any of these schemes. Throughout
`this paper, 2 denotes the underlying alphabet from which the data in question is
`constructed.
`
`Definition 1. A compressed form of a string s using the EPM (external pointer
`macro) scheme is any string t = so#s1 satisfying“
`
`(1) so and s1 consist of characters from E and pointers to substrings of so.
`(2) s can be obtained from s; by performing the following two steps:
`
`(a) Replace each pointer in S1 with its target.
`03) Repeat step A until .91 contains no pointers.
`
`Definition 2. A compressedform of a string s using the CPM (compressedpointer
`macro) scheme is any string t satisfying
`
`t consists of characters from E and pointers to substrings oft.
`(1)
`(2) s can be obtained from t by forming the string stir and then decoding as with the
`EPM scheme.
`
`Definition 3. A compressed form of a string 3 using the OPM (original pointer
`macro) scheme is any string 1 satisfying
`
`t consists of characters from 2 and pointers representing substrings of s.
`(1)
`(2) s can be obtained from r by replacing each pointer (n, m) by the sequence of
`pointers (n, 1), (n + 1, 1), ..., (n + m — 1, 1) and then decoding as with the
`CPM scheme, with the stipulation that pointers are considered to have length 1.
`
`Definition 4. A compressedform of a string .9 using the OEPM (original external
`pointer macro) scheme is any string 1 = so#s1 satisfying
`
`t consists of characters from 2 and pointers.
`(1)
`(2) so may be decoded using the OPM scheme to produce a string r. Furthermore,
`pointers in st point to substrings of r.
`(3) s may be obtained by replacing each pointer in .91 with its target in r.
`
`A contraction of a string s for pointer size p according to a given scheme is a
`shortest compressed form of s using that scheme with pointer size p. A contraction of
`a string s will be denoted by A(s).’ We shall refer to the process of replacing a string
`r by a pointer asfactoring out r and often refer to a string that is a target or potential
`target as afactor.
`
`Definition 5. A CPM (OPM) pointer ql depends on pointer qr,» if the target of q;
`contains 42 (all or part of the string represented by qz) or if there is a pointer q.-5 such
`that qr depends on :13 and qa depends on qz. A macro scheme is restricted to no
`recursion if dependent pointers are forbidden, and to topological recursion if no
`
`6 For convenience we assume throughout this paper that [#1 - 0.
`’A string may have more than one ruinimal-length compressed form. For formal dkcussions we can
`always ensure that A(s) is unique by assuming a lexicographic ordering.
`
`4
`
`
`
`932
`
`I. A. STORER AND T. G. SZYMANSKI
`
`pointer may depend on itself; that is, it must be possible to sort topologically” the
`pointers of a compressed form according to their dependencies.
`
`Definition 6. Two pointers overlap if their targets overlap and strictly overlap if
`their targets overlap but neither target is a substring of the other. A macro scheme is
`restricted to no overlapping if overlapping pointers are forbidden.
`
`Definition 7. A CPM (OPM) pointer q points to the left if the leftmost character
`of its target is to the left of q (the leftmost character of the string represented by q).
`A right pointer is similarly defined. A macro scheme is restricted to unidirectional
`pointers if all pointers must point in the same direction (of course, with the EPM and
`OEPM schemes, this only applies to the external dictionary). As a special case of
`this, we can restrict a macro scheme to have only left or right pointers.
`
`The different combinations of the four basic macro schemes we have defined and
`
`the recursion, overlapping, and pointer direction restrictions provide us with a large
`number of data compression methods. The combinations are sufficiently general to
`cover virtually all of the text substitution schemes proposed in the literature.
`Discussion of the utility and appropriateness of various restrictions to the models are
`deferred until later in the paper.
`We have not discussed the concept of adding to pointers" arguments which allow
`the specification of modifications to be made on a factor before it is substituted.
`Macro schemes with arguments generally have more power than ones without. Also,
`a few data compression methods presented in the literature require a macro scheme
`with arguments to model them, for example, the subsequence and supersequence
`compression methods discussed in [16]. Macro schemes with arguments will not be
`discussed in this paper, but it it should be noted that the macro model can be
`extended to allow this generality.
`
`3. The External Macro Model
`
`The external macro model views the collection of macro bodies as residing outside
`the remainder of the compressed string. This makes external schemes ideal for
`compressing collections of strings using a common dictionary. There are several
`reasons why it is more natural to treat all pointers as compressed pointers when
`discussing this model. First, authors in the past have used the EPM scheme and not
`the OEPM scheme (the authors mentioned in the introduction who used external
`schemes all considered variants of the EPM scheme). Second, it allows us to
`decompress arbitrary portions of the data without first having to produce the entire
`string. Third, compressed pointers often require less space than original pointers. For
`these reasons, we concentrate our attention in this section on the EPM model and
`then indicate how our results can be extended to the OEPM model. As we shall see
`
`in the next section, there are advantages to using original pointers over compressed
`pointers that justify consideration of the OEPM model. In many cases the extension
`of results to the OEPM scheme is trivial, since if recursion is forbidden, original and
`compressed pointers become equivalent in power. Similarly, if overlapping is forbid-
`den, unidirectional and bidirectional pointers become equivalent.
`
`THEOREM 1. For all strings s, only topological recursion is allowed, then (assuming
`.9 is compressible) both lAEPM(S)| and |AoEp_M(.‘u‘)| are
`
`(a) 2p lag2(ls|/p) + 1.9;).
`
`” For a discussion of topological sorting, see [10].
`
`5
`
`
`
`Data Compression via Textual Substitution
`
`933
`
`(b) 23}: loga(|s|/p) - 0.02p when overlapping isforbidden.
`(c) 22(plsI)"" when recursion isforbidden.
`(cl) 22(p|s |)"2 when both recursion and overlapping areforbidden.
`(e)
`(a)—(d) hold even ifpointers are required to be unidirectional.
`
`If nontopologicai recursion is allowed, then
`
`(f) The bounds of (a)—(e) holdfor the EPM scheme.
`
`But
`
`(3) |AoEpM(S)| 1: 2p + 1, independently of what overlapping and pointer direction
`restrictiofls are made.
`
`Furthermore, all if the bounds in (a)—( g) are tight; that is, each is attainedfor infinitely
`many strings s.”
`
`PROOF. For (a)—(e), since |AoEpM(s)| :5 |AEPM(S') |, it is sufficient to show that the
`OEPM scheme satisfies these bounds and the EPM scheme can attain them infinitely
`often.
`
`First let us consider (a). We can assume that s = a"' (for some a in 2) because
`fAogpM(a"')| 5 I AuEpM(.5') |. It is easy to show that for some p < k 5 2p and n,
`
`Ao1n=n(s) = 0* E q.#qu+1.
`
`where q,, 1 5 i < n, points to the string represented by everything to the lefi of it and
`q,.+1 points to some substring of length Is| in the dictionary. Since qt points to a
`string of k characters, we must have It > p, or else a shorter contraction of s could be
`produced. Similarly, if k > 2p, we could produce a shorter contraction by representing
`a" by a"""‘ followed by a pointer to this string. Thus we have
`
`lAonrvM(s)| 2 k + np +p
`2p1og2(|s},’lc) + lc -1-}:
`2 mi11{pI'log2([s|/i)'|+i+p:p < is 2p}
`2p log2(|s[/p) + min{p(l + h -— log2li): l < h 5 2}
`> P1082051/P) + 1-9p-
`
`For anyp < is 2p and n, the bound rnin{pilog2(|s|/i)| + i + ptp < i 5 2p} is
`achieved exactly by the EPM scheme on the strings = a'’'.
`We now consider 0:). Again we can assume that s = a"'. Since overlapping is
`forbidden, the pointers of AonpM(s) can be divided into a sequence of sets S1, . .
`. , S...
`such that the pointers in S., 1 < is m, have targets whose compressed representation
`consists of pointers in some s_,, j < i; in fact, since we are concerned only with worst-
`case performance,’° we can assume that j = i — 1. We can also assume that S.-, 1 _<_ i
`5 m, contains at most three pointers. This is because for any k 7: 4 there are an i and
`j such that 21' + 3f 5. k 5 2'3 ’, and so we can replace a set S. of four or more pointers
`by a sequence of sets having at most three pointers, where the last set in the sequence
`will represent a string at least as long as that represented by the original sequence of
`four or more. Hence, for some x 5 2 we can assume that for all i > x, S; contains
`
`“ Actually, the bound in (a) is just an approximation for the expression min{p|log».x(|s|;i)|+ i + p:p < i
`_<_ 2;}, which is attained exactly infinitely often. Similarly, the bound 111 (b) 15 just an approximation for
`the expression min{3p| loga{| sl/i) I + i: 2p < I 5 4p], which IS attained exactly for infinitely many strings.
`1" It 15 not necessary to consider a compressed form oflength x for a stung a’ if there is a compressed form
`of length <1: (sx) for a string a‘ where z 2 y (z > y).
`
`6
`
`
`
`934
`
`J. A. STORER AND T. G. SZYMANSKI
`
`exactly three pointers. This is because we can assume that the sets of two come first
`and a sequence of three two-pointer sets can be replaced by two three-pointer sets.
`Given this, it can be assumed that for some 2p < k 5 4p, 0 ..<_ M 5 L 5 l, and n,
`Aoiipm(s) is of the form
`
`a*(q?»)‘(qi)”(_Inl2 q?)#q5‘»+u
`
`where qo points to a‘; qr points to qfi; qg points to a”, qfi, or qi, depending on the
`values of L and M; and for 3 5 is n + 1, qt points to q‘:’—i. Thus we have
`
`+
`iAoEpM(.S')| Z k + 31110 +
`2 3p|'log3(|s|/k2L2M)'I + k + 2Lp + 2Mp
`2 3pflog3(|s|/k)'| + 1: + 3p(§ — log32)(L + M)
`3 3PT10g3(|S|/k)'| + k
`2 min{3pflog3(js|/i)‘] + i':2p < is 4p}
`2 3p log;z.(|s|/p) + min{p(h -- 3 log3h):2 < h S 4}
`> 3;! loga(|s|/p) - 0.02;).
`
`For any 2]; < is 4p and n > 1, the bound of rnin[3p|log3(|s|/i)| + i‘:2p < i 5 4p}
`is achieved using the EPM scheme with no overlapping on the strings = aw‘ .
`The proofs for (c) and (d) appear in [30]. All of the proofs of (a)—(d) make use of
`left pointers only, and so (e) follows. (f) follows trivially because compressed pointers
`cannot form cycles, and hence with the EPM scheme all recursion must be topological.
`For (g) we may again assume that s = am for some a in E. If s is compressible using
`the EOPM scheme, then .9 must contain at least two pointers and at least one
`character. Hence 2p + l is a lower bound. It is also tight, since a(l, ls] — l)#(1, ISD
`is a compressed form for s = a"'. D
`
`For topological recursion, Theorem 1 says just what one would expect; there is an
`S2(_p log| s |) lower bound on the size of a compressed string when recursion is allowed
`(log; with overlapping and logs without) and an S2((pIs])1/2) lower bound when
`recursion is not allowed. For nontopological recursion the bound of (g) may seem
`unnatural. Clearly we cannot represent a string of arbitrary length using a constant
`amount of space. The bound of (g) simply illustrates the need for the pointer size to
`be a function of the string size to model situations where pointers may indicate strings
`of arbitrary length.
`It should also be pointed out that the bounds of Theorem 1 apply primarily to
`“pathological” strings; in practice, reducing the size of a file by a small constant
`factor may be very significant. However, much of the utility of Theorem 1 comes
`from the fact that it provides exact bounds which are needed in several of our NP»
`completeness" proofs.
`The next theorem considers encoding algorithms for the EPM model. Wagner [35]
`presents a polynomial-time algorithm for compressing a string, assuming that the
`dictionary of macro bodies is given as input to the encoding algorithm. However, no
`mention is made as to how the selection of the best possible dictionary is accom-
`plished. Several heuristic methods for constructing dictionaries have been presented
`in [20] and [25}. Neither of these guarantees optimal compression or even provides
`bounds on the compression that is obtainable. The reason for this gap in the literature
`is the NP-completeness of finding AEpM(S).
`
`“ For a definition of NP-completeness and related terms, see [1]. All ofour proofs show NP-completeness
`in the sense of Karp [9] (which implies that of [3]).
`
`7
`
`
`
`Data Compression via Textual Substitution
`
`935
`
`THEOREM 2. Given a string s and an integer K, it is NP-complete to determine
`whether IAE;-M/(s)| 5 K in any of the following situations:
`
`(a) both recursion and overlapping allowed;
`(b) recursion allowed, overlappingforbidden;
`(c)
`recursion forbidden, overlapping allowed;
`(d) both recursion and overlappingforbidden;
`(e) unidirectional pointers and any of (a)-(d).
`
`Furthermore, the above are true regardless of whether p is part of the problem input or
`is constrained to be a fixed integer greater than 1. In fact, we show (b) and (d) to be
`true even zfp = l.
`
`It should be clear that no one part of the theorem directly implies any
`PROOF.
`other. Thus several reductions are used. The reductions employed include the node
`cover problem [9], the restricted node cover problem [17], the K-node cover problem
`[30], and the superstring problem [4, 17]. As mentioned in the introduction, proofs of
`all NP-completeness results appear in [30]. However, we shall include the proof of
`(b) and (d) as a “sample proof.”
`
`. , e,.,}), K
`.
`. , v,,}, E = {e1, .
`.
`.
`Proofof(b) and (d)forp >1. Let G = (V= {v1,
`be an instance of the node cover problem, and let p = po. Let S be a special symbol,
`and let @ denote a new, distinct symbol each time it occurs. For v. in V, let V. =
`$v§”‘$, and for e, = (v,, v;,) in E, let E. = $v,”1$v‘;',“$. Now let
`p rt.
`rn
`
`s= (H 11 v;@)(n to).
`
`1-1 1-1
`
`1-1
`
`We claim that G has a node cover of size SK if and only if |A(s)| 5 |s| + K — m.
`First suppose that G has a node cover X g V of size K. We shall construct a
`compressed form t for s (having length |sI + K — m), where t is'of the form
`s0#(]]-§’.1 ]]j'., F7} @)(flI’i1 E. @), where so contains those V, for which V. is in X, and
`I7, is t; if v,- is not in X and a pointer to v, in so if v, is in X. If E. is sv;7“sv*,:"s,
`then E‘.
`is either rv‘,§"1$ or $v,‘1q, where r is a pointer to v, in st; and q is
`a pointer to vs in so. Since X is a node cover, this can always be done. If we now com-
`pute the length of I, Isol = K(p + 1), IHf—1H?—: 77; @l = lH‘»'—1 H5’-1 V,@I -pK,
`and [11:11 E.@| = |f[:’11 E.@| — m. Hence |A(s)] 5 M = ls] + K(p + 1) —pK— m
`=|s| + K - m, as was to be shown.
`Conversely, suppose that |A(s)| < |s| + K — m. We shall show that G has a
`node cover of size at most K. First observe that since overlapping of pointer
`targets is forbidden, no pointer in A(s) can refer, for any strings x and y, to a string
`of the form xv.$v,y, x@y, or x@y, since such a string can occur at most once
`in s and no gain can be achieved by factoring it out. Thus A(s) is of the form
`so#(H{’.1
`".1 I7’,@)(]'[i'i1 E‘,@), where so is a dictionary of macro bodies and the
`1775 and E35 are the shortest compressed forms of the W5 and E.-’s, respectively,
`using so. As mentioned earlier, without loss of generality we are assuming throughout
`this paper that every pointer references a string of length at least p + 1. Thus, since
`| K] = p + I, we can infer that each I7. is either V, itself or a pointer to an occurrence
`of v. in so. Similarly, since |E,| = 2p + 1, each E. must either consist of E. itself or else
`be a string of the form rv,"'$ or $v, '1r, where r is a pointer to some V, in so. Now let
`L be the number of E,’s such that E. = E,, that is, the number of 5'33 that have not
`had a factor removed. Then | ]'[I’.'., E,@| = |]'[;':, E.@| — (m — L), since removing a
`factor from an E, saves one character. Let .1’ be the number of V.-’s in so. Then
`
`8
`
`
`
`936
`
`J. A. sronmr. AND T. G. SZYMANSK1
`
`{sol - 1(1) + l) and |]]l’.1 H}-‘.1 l7}@I == []]‘.-'.1 I]}‘.1 V,-@l - Jp, because each it.» that
`is replaced by a pointer saves one character. Thus |A(s)| - J(p + 1) + l.s| — Jp -
`(m-L)=-[s|+J+L—m,andsoJ+LsK.W'enowclaimthatGhasanode
`cover of size J + L formed by taking the J nodes represented in so and one node
`from each of the L edges not factored in Ms}. Therefore 6? has a node cover of size
`K, as was to be shown.
`
`Proof of(b) and (d)forp = l. The following proof is similar to the original proof
`of this result [30] in that it employs a reduction to the node cover problem for degree
`three graphs. However, although the actual the actual construction is slightly longer,
`the proof of its correctness is simpler. This simplified proof is due to J. Gallant.
`LetG=(V=* {V1,---.V,.},E== {e1,...,e».}),Kbea.ninstanceot'thenodecover
`problem for which all nodes have degree 3." As in the proof for p > I, let 3 be a
`special symbol, and let @ denote a new, distinct symbol each time it occurs. For v,
`in Vlet
`
`V. = Sv.-3
`
`and
`
`W. = $”v.-$2.
`
`and for e. = (v,, v;,) in E let
`
`Now let
`
`E5 = 32:5-S’v.t$2.
`
`II
`
`M
`
`S =
`
`(V=@W-@)’)(II1 E@)-
`
`The claim is that Ghas a node cover ofsize Kifand only if|A(s)l 5 |s| + K— 1411.
`The first half of the proof argues that if X is a minimal node cover for G, then a
`compressed form for s can be constructed as follows:
`
`(I) The two copies of each V. go to a copy of V; in the dictionary and two pointers
`in the skeleton.
`
`(2) The two copies of each Wt go to two copies of 3 followed by a pointer to V.-
`followed by a $ in the skeleton if v. is not in X; otherwise the two copies of W.
`go to W; in the dictionary and two pointers in the skeleton.
`(3) Each E; representing e. = (v,-, vi.) goes to a pointer to 32v,-S’ followed by a pointer
`to $v,t$ followed by 3 if v,- is chosen to cover the edge; otherwise E; goes to $
`followed by a pointer to $v,$ followed by a pointer to $2v;.$’.
`
`A compressed form for s as constructed above saves one character for each pair of
`Vis for a total of it: four characters for each pair of Wis when 1:. is not in X, three
`characters for each pair of W.-’s when 11,- is in X for a total of 4:1 — K, and six
`characters for each E; for a total of 6m. Since In -= gn, this yields |A(s)| 5 Is] -
`n--4n+K-—6m=|sI+K—l4n.
`The other half of the proof requires more work. Here it must be argued that the
`method of compressing S as described above is the best possible. This is done by
`considering several cases that rest heavily on the fact that all nodes in G have degree
`3. The degree-3 restriction makes it unprofitable to factor out many strings that might
`otherwise be factored if nodes with large degrees were present. We leave the details
`of this half of the proof to the reader. Note that the above reduction is for the case
`where recursion is forbidden. If recursion is allowed, the same construction may be
`used, except that one copy of W. should be used instead of two. El
`
`12 Using a result from [SL it is easy to show this restriction of the node cover problem to be NP-complete;
`see [17].
`
`9
`
`
`
`Data Compression via Textual Substitution
`
`937
`
`In [30], cases (b)—(e) of Theorem 2 are shown to hold for the EOPM scheme. Case
`(a) is shown for the problem of compressing collections, but the single string problem
`remains open at the time of the writing of this paper. In addition, in [30] it is
`conjectured that all of (a)—(e) of Theorem 2 can be shown for p = 1.
`Throughout this paper we assume that the alphabets over which strings are written
`are unbounded in size. However, results concerning lower bounds on encoding
`complexity are stronger if they apply to the case where all strings are assumed to be
`written over some fixed size alphabet, and, although unbounded size alphabets model
`many practical situations (such as when entries in a system dictionary are taken to be
`the basic characters), there are certainly many situations in which it is more realistic
`to consider strings to be written over some fixed finite alphabet. Thus it is worthwhile
`to consider the complexity of compressing strings when it is assumed that all strings
`are written over a fixed size alphabet. Since our motivation for doing this is to model
`practical situations, when discussing fixed size alphabets we also require that pointers
`of a given size can only encode a finite amount of information. This requirement is
`met by stipulating that the pointer size 1: be dependent on the string being processed.
`Because complexity results concerning fixed size alphabets are more technical, we
`shall only state a few sample theorems to indicate the flavor of these results and only
`for the case when both recursion and overlapping are forbidden. Suppose we allow
`pointers to be able to indicate any substring of the source. Then a pointer’s length
`must be some implementation-dependent multiple of the logarithm of the string
`length.
`
`If recursion and overlapping areforbidden, then, giving a string 3 over
`THEOREM 3.
`any alphabet with at least three Symbols, an integer K, and a real It > 0, it is NP-
`complete to determine whether 5 has an EPM campressedform t satisfying M 5 K
`when the pointer size is [h Iog2| t['|.
`
`Two other natural ways to determine pointer size are either to require that the
`information content of a pointer be sufficient to distinguish all the pointers in an
`encoding or to require that a pointer be able to identify any substring of the
`dictionary. To this end, if t is an encoding of some string using the EPM scheme,
`let 80) be the number of distinct pointers in t, and let do?) be the dictionary
`portion of t.
`
`If recursion and overlapping areforbidden, then, given a string s over
`THEOREM 4.
`any alphabet with at least three symbols, an integer K, and a real It 2 1, it is NP-
`complete to determine whether S has an EPM compressedform I satisjjiing [t| :1 K in
`thefollowing situations:
`
`(a) p is [h log28(t)'|.
`(b) p is [' h logzl d(t) | 1.
`
`In [30], results similar