`
`DEBRA A. LELEWER and DANIEL S. HIRSCHBERG
`
`Department of Information and Computer Science, University of California, Irvine, California 92717
`
`This paper surveys a variety of data compression methods spanning almost 40 years of
`research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique
`developed in 1986. The aim of data compression is to reduce redundancy in stored or
`communicated data, thus increasing effective data density. Data compression has
`important application in the areas of file storage and distributed systems. Concepts from
`information theory as they relate to the goals and evaluation of data compression
`methods are discussed briefly. A framework for evaluation and comparison of methods is
`constructed and applied to the algorithms presented. Comparisons of both theoretical and
`empirical natures are reported, and possibilities for future research are suggested.
`
`Categories and Subject Descriptors: E.4 [Data]: Coding and Information Theory-data
`compaction and compression
`
`General Terms: Algorithms, Theory
`
`Additional Key Words and Phrases: Adaptive coding, adaptive Huffman codes, coding,
`coding theory, file compression, Huffman codes, minimum-redundancy codes, optimal
`codes, prefix codes, text compression
`
`INTRODUCTION
`
`Data compression is often referred to as
`coding, where coding is a general term en(cid:173)
`compassing any special representation of
`data that satisfies a given need. Informa(cid:173)
`tion theory is defined as the study of
`efficient coding and its consequences in
`the form of speed of transmission and
`probability of error [Ingels 1971]. Data com(cid:173)
`pression may be viewed as a branch of
`information theory in which the primary
`objective is to minimize the amount of data
`to be transmitted. The purpose of this pa(cid:173)
`per is to present and analyze a variety of
`data compression algorithms.
`simple characterization of data
`A
`compression is that it involves transform(cid:173)
`ing a string of characters in some represen(cid:173)
`tation (such as ASCII) into a new string
`(e.g., of bits) that contains the same infor-
`
`mation but whose length is as small as
`possible. Data compression has important
`application in the areas of data transmis(cid:173)
`sion and data storage. Many data process(cid:173)
`ing applications require storage of large
`volumes of data, and the number of such
`applications is constantly increasing as the
`use of computers extends to new disci(cid:173)
`plines. At the same time, the proliferation
`of computer communication networks is
`resulting in massive transfer of data over
`communication links. Compressing data to
`be stored or transmitted reduces storage
`and/or communication costs. When the
`amount of data to be transmitted is re(cid:173)
`duced, the effect is that of increasing the
`capacity of the communication channel.
`Similarly, compressing a file to half of its
`original size is equivalent to doubling the
`capacity of the storage medium. It may then
`become feasible to store the data at a
`
`Permission to copy without fee all or part of this material is granted provided that the copies are not made or
`distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
`date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
`copy otherwise, or to republish, requires a fee and/or specific permission.
`© 1988 ACM 0360-0300/87 /0900-0261 $1.50
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 1
`
`
`
`262
`
`•
`
`D. A. Lelewer and D. S. Hirschberg
`
`CONTENTS
`
`INTRODUCTION
`1. FUNDAMENTAL CONCEPTS
`1.1 Definitions
`1.2 Classification of Methods
`1.3 A Data Compression Model
`1.4 Motivation
`2. SEMANTIC DEPENDENT METHODS
`3. STATIC DEFINED-WORD SCHEMES
`3.1 Shannon-Fano Code
`3.2 Static Huffman Coding
`3.3 Universal Codes and Representations of the
`Integers
`3.4 Arithmetic Coding
`4. ADAPTIVE HUFFMAN CODING
`4.1 Algorithm FGK
`4.2 Algorithm V
`5. OTHER ADAPTIVE METHODS
`5.1 Lempel-Ziv Codes
`5.2 Algorithm BSTW
`6. EMPIRICAL RESULTS
`7. SUSCEPTIBILITY TO ERROR
`7 .1 Static Codes
`7.2 Adaptive Codes
`8. NEW DIRECTIONS
`9. SUMMARY
`REFERENCES
`
`higher, thus faster, level of the storage hi(cid:173)
`erarchy and reduce the load on the input/
`output channels of the computer system.
`Many of the methods discussed in this
`paper are
`implemented
`in production
`systems. The UNIX1 utilities compact
`and compress are based on methods dis(cid:173)
`cussed in Sections 4 and 5, respectively
`[UNIX 1984). Popular file archival systems
`such as ARC and PKARC use techniques
`presented in Sections 3 and 5 [ARC 1986;
`PKARC 1987). The savings achieved by
`data compression can be dramatic; reduc(cid:173)
`tion as high as 80% is not uncommon
`[Reghbati 1981). Typical values of com(cid:173)
`pression provided by compact are text
`(38%), Pascal source (43%), C source
`(36%), and binary (19%). Compress gener(cid:173)
`ally achieves better compression (50-60%
`for text such as source code and English)
`and takes less time to compute [UNIX
`1984). Arithmetic coding (Section 3.4) has
`
`'UNIX is a trademark of AT&T Bell Laboratories.
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`been reported to reduce a file to anywhere
`from 12.1 to 73.5% of its original size [Wit(cid:173)
`ten et al. 1987). Cormack reports that data
`compression programs based on Huffman
`coding (Section 3.2) reduced the size of a
`large student-record database by 42.1 %
`when only some of the information was
`compressed. As a consequence of this size
`reduction, the number of disk operations
`required to load the database was reduced
`by 32.7% [Cormack 1985). Data com(cid:173)
`pression routines developed with specific
`applications in mind have achieved com(cid:173)
`pression factors as high as 98% [Severance
`1983).
`Although coding for purposes of data se(cid:173)
`curity (cryptography) and codes that guar(cid:173)
`antee a certain level of data integrity (error
`detection/correction) are topics worthy of
`attention, they do not fall under the
`umbrella of data compression. With the
`exception of a brief discussion of the sus(cid:173)
`ceptibility to error of the methods surveyed
`(Section 7), a discrete noiseless channel is
`assumed. That is, we assume a system in
`which a sequence of symbols chosen from
`a finite alphabet can be transmitted from
`one point to another without the possibility
`of error. Of course, the coding schemes
`described here may be combined with data
`security or error-correcting codes.
`Much of the available literature on data
`compression approaches the topic from the
`point of view of data transmission. As noted
`earlier, data compression is of value in data
`storage as well. Although this discussion is
`framed in the terminology of data trans(cid:173)
`mission, compression and decompression
`of data files are essentially the same tasks
`as sending and receiving data over a com(cid:173)
`munication channel. The focus of this
`paper is on algorithms for data compres(cid:173)
`sion; it does not deal with hardware aspects
`of data transmission. The reader is referred
`to Cappellini [1985) for a discussion of
`techniques with natural hardware imple(cid:173)
`mentation.
`Background concepts in the form of ter(cid:173)
`minology and a model for the study of data
`compression are provided in Section 1. Ap(cid:173)
`plications of data compression are also dis(cid:173)
`cussed in Section 1 to provide motivation
`for the material that follows.
`
`NetApp; Rackspace Exhibit 1015 Page 2
`
`
`
`Although the primary focus of this survey
`is data compression methods of general
`utility, Section 2 includes examples from
`the literature in which ingenuity applied to
`domain-specific problems has yielded inter(cid:173)
`esting coding techniques. These techniques
`are referred to as semantic dependent since
`they are designed to exploit the context
`and semantics of the data to achieve re(cid:173)
`dundancy reduction. Semantic-dependent
`techniques include the use of quadtrees,
`run-length encoding, or difference mapping
`for storage and transmission of image data
`[Gonzalez and Wintz 1977; Samet 1984].
`General-purpose techniques, which as(cid:173)
`sume no knowledge of the information
`content of the data, are described in
`Sections 3-5. These descriptions are suffi(cid:173)
`ciently detailed to provide an understand(cid:173)
`ing of the techniques. The reader will need
`to consult the references for implementa(cid:173)
`tion details. In most cases only worst-case
`analyses of the methods are feasible. To
`provide a more realistic picture of their
`effectiveness, empirical data are presented
`in Section 6. The susceptibility to error of
`the algorithms surveyed is discussed in Sec(cid:173)
`tion 7, and possible directions for future
`research are considered in Section 8.
`
`1. FUNDAMENTAL CONCEPTS
`A brief introduction to information theory
`is provided in this section. The definitions
`and assumptions necessary to a compre(cid:173)
`hensive discussion and evaluation of data
`compression methods are discussed. The
`following string of characters is used to
`illustrate the concepts defined: EXAMPLE
`= "aa bbb cccc ddddd eeeeee fffffffgggggggg ".
`
`1.1 Definitions
`A code is a mapping of source messages
`(words from the source alphabet a) into
`codewords (words of the code alphabet (3).
`The source messages are the basic units
`into which the string to be represented is
`partitioned. These basic units may be single
`symbols from the source alphabet, or they
`may be strings of symbols. For string
`EXAMPLE, a= {a, b, c, d, e, f, g, space I.
`For purposes of explanation, (3 is taken to
`
`Data Compression
`
`•
`
`263
`
`Source message
`
`Codeword
`
`a
`b
`c
`d
`e
`f
`g
`space
`
`000
`001
`010
`011
`100
`101
`110
`111
`
`Figure 1. A block-block code for EXAMPLE.
`
`Source message
`
`Codeword
`
`aa
`bbb
`cccc
`ddddd
`eeeeee
`fffffff
`gggggggg
`space
`
`0
`1
`10
`11
`100
`101
`110
`111
`
`Figure2. A variable-variable code for EXAMPLE.
`
`be {O, lj. Codes can be categorized as block(cid:173)
`block, block-variable, variable-block, or
`variable-variable, where block-block in(cid:173)
`dicates that the source messages and
`codewords are of fixed length and variable(cid:173)
`variable codes map variable-length source
`messages into variable-length codewords. A
`block-block code for EXAMPLE is shown
`in Figure 1, and a variable-variable code is
`given in Figure 2. If the string EXAMPLE
`were coded using the Figure 1 code, the
`length of the coded message would be 120;
`using Figure 2 the length would be 30.
`The oldest and most widely used codes,
`ASCII and EBCDIC, are examples of
`block-block codes, mapping an alphabet of
`64 (or 256) single characters onto 6-bit (or
`8-bit) codewords. These are not discussed,
`since they do not provide compression.
`The codes featured in this survey are of
`the block-variable, variable-variable, and
`variable-block types.
`When source messages of variable length
`are allowed, the question of how a mes(cid:173)
`sage ensemble (sequence of messages) is
`parsed into individual messages arises.
`Many of the algorithms described here are
`defined-word schemes. That is, the set of
`source messages is determined before the
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 3
`
`
`
`264
`
`•
`
`D. A. Lelewer and D. S. Hirschberg
`
`invocation of the coding scheme. For
`example,
`in text file processing, each
`character may constitute a message, or
`messages may be defined
`to consist
`of alphanumeric and nonalphanumeric
`strings. In Pascal source code, each token
`may represent a message. All codes involv(cid:173)
`ing fixed-length source messages are, by
`default, defined-word codes. In free-parse
`methods, the coding algorithm itself parses
`the ensemble
`into variable-length se(cid:173)
`quences of symbols. Most of the known
`data compression methods are defined(cid:173)
`word schemes; the free-parse model differs
`in a fundamental way from the classical
`coding paradigm.
`A code is distinct if each codeword is
`distinguishable from every other (i.e., the
`mapping from source messages to code(cid:173)
`words is one to one). A distinct code is
`uniquely decodable if every codeword is
`identifiable when immersed in a sequence
`of codewords. Clearly, each of these fea(cid:173)
`tures is desirable. The codes of Figures 1
`and 2 are both distinct, but the code of
`Figure 2 is not uniquely decodable. For
`example, the coded message 11 could be
`decoded as either "ddddd" or "bbbbbb". A
`uniquely decodable code is a prefix code (or
`prefix-free code) if it has the prefix prop(cid:173)
`erty, which requires that no codeword be a
`proper prefix of any other codeword. All
`uniquely decodable block-block and vari(cid:173)
`able-block codes are prefix codes. The code
`with codewords {1, 100000, 00} is an ex(cid:173)
`ample of a code that is uniquely decodable
`but that does not have the prefix property.
`Prefix codes are instantaneously decodable;
`that is, they have the desirable property
`that the coded message can be parsed into
`codewords without the need for lookahead.
`In order to decode a message encoded using
`the codeword set {l, 100000, 00}, lookahead
`is required. For example, the first codeword
`of the message 1000000001 is 1, but this
`cannot be determined until the last (tenth)
`symbol of the message is read (if the string
`of zeros had been of odd length, the first
`codeword would have been 100000).
`A minimal prefix code is a prefix code
`such that, if x is a proper prefix of some
`codeword, then xa is either a codeword or
`a proper prefix of a codeword for each letter
`
`a in {3. The set of codewords {00, 01, 10} is
`an example of a prefix code that is not
`minimal. The fact that 1 is a proper prefix
`of the codeword 10 requires that 11 be
`either a codeword or a proper prefix of a
`codeword, and it is neither. Intuitively, the
`minimality constraint prevents the use of
`codewords that are longer than necessary.
`In the above example the codeword 10 could
`be replaced by the codeword 1, yielding a
`minimal prefix code with shorter code(cid:173)
`words. The codes discussed in this paper
`are all minimal prefix codes.
`In this section a code has been defined
`to be a mapping from a source alphabet
`to a code alphabet; we now define related
`terms. The process of transforming a source
`ensemble into a coded message is coding or
`encoding. The encoded message may be re(cid:173)
`ferred to as an encoding of the source en(cid:173)
`semble. The algorithm that constructs the
`mapping and uses it to transform the source
`ensemble is called the encoder. The decoder
`performs the inverse operation, restoring
`the coded message to its original form.
`
`1.2 Classification of Methods
`Not only are data compression schemes
`categorized with respect to message and
`codeword lengths, but they are also classi(cid:173)
`fied as either static or dynamic. A static
`method is one in which the mapping from
`the set of messages to the set of codewords
`is fixed before transmission begins, so that
`a given message is represented by the same
`codeword every time it appears in the mes(cid:173)
`sage ensemble. The classic static defined(cid:173)
`word scheme is Huffman coding [Huffman
`1952]. In Huffman coding, the assignment
`of codewords to source messages is based
`on the probabilities with which the source
`messages appear in the message ensemble.
`Messages that appear frequently are rep(cid:173)
`resented by short codewords; messages with
`smaller probabilities map to longer code(cid:173)
`words. These probabilities are determined
`before transmission begins. A Huffman
`code for the ensemble EXAMPLE is given
`in Figure 3. If EXAMPLE were coded using
`this Huffman mapping, the length of the
`coded message would be 117. Static Huff(cid:173)
`man coding is discussed in Section 3.2;
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 4
`
`
`
`Source message
`
`Probability
`
`Codeword
`
`Source message
`
`Probability
`
`Codeword
`
`Data Compression
`
`•
`
`265
`
`a
`b
`c
`d
`e
`f
`g
`space
`
`2/40
`3/40
`4/40
`5/40
`6/40
`7/40
`8/40
`5/40
`
`1001
`1000
`011
`010
`111
`110
`00
`101
`
`Figure 3. A Huffman code for the message EXAM(cid:173)
`PLE (code length= 117).
`
`other static schemes are discussed in Sec(cid:173)
`tions 2 and 3.
`A code is dynamic if the mapping from
`the set of messages to the set of codewords
`changes over time. For example, dynamic
`Huffman coding involves computing an ap(cid:173)
`proximation to the probabilities of occur(cid:173)
`rence "on the fly," as the ensemble is being
`transmitted. The assignment of codewords
`to messages is based on the values of the
`relative frequencies of occurrence at each
`point in time. A message x may be repre(cid:173)
`sented by a short codeword early in the
`transmission because it occurs frequently
`at the beginning of the ensemble, even
`though its probability of occurrence over
`the total ensemble is low. Later, when the
`more probable messages begin to occur with
`higher frequency, the short codeword will
`be mapped to one of the higher probability
`messages, and x will be mapped to a longer
`codeword. As an illustration, Figure 4
`presents a dynamic Huffman code table
`corresponding to the prefix "aa bbb" of
`EXAMPLE. Although the frequency of
`space over the entire message is greater
`than that of b, at this point b has higher
`frequency and therefore is mapped to the
`shorter codeword.
`Dynamic codes are also referred to in the
`literature as adaptive, in that they adapt to
`changes in ensemble characteristics over
`time. The term adaptive is used for the
`remainder of this paper; the fact that these
`codes adapt to changing characteristics is
`the source of their appeal. Some adaptive
`methods adapt to changing patterns in the
`source [Welch 1984), whereas others ex(cid:173)
`ploit locality of reference [Bentley et al.
`1986). Locality of reference is the tendency,
`
`a
`b
`space
`
`2/6
`3/6
`1/6
`
`10
`0
`11
`
`Figure 4. A dynamic Huffman code table for the
`prefix "aa bbb" of message EXAMPLE.
`
`common in a wide variety of text types, for
`a particular word to occur frequently_ for
`short periods of time and then fall mto
`disuse for long periods.
`All of the adaptive methods are one-pass
`methods· only one scan of the ensemble is
`required: Static Huffman coding requires
`two passes: one pass to compute probabili(cid:173)
`ties and determine the mapping, and a sec(cid:173)
`ond pass for transmission. Thus, as long as
`the encoding and decoding times of an
`adaptive method are not substantially
`greater than those of a static method, ~he
`fact that an initial scan is not needed im(cid:173)
`plies a speed improvement in the adaptive
`case. In addition, the mapping determined
`in the first pass of a static coding scheme
`must be transmitted by the encoder to the
`decoder. The mapping may preface each
`transmission (i.e., each file sent), or a single
`mapping may be agreed upon and used for
`multiple transmissions. In one-pass meth(cid:173)
`ods the encoder defines and redefines the
`mapping dynamically during transmission.
`The decoder must define and redefine the
`mapping in sympathy, in essence "learn(cid:173)
`ing" the mapping as codewords are re(cid:173)
`ceived. Adaptive methods are discussed in
`Sections 4 and 5.
`An algorithm may also be a hybrid,
`neither completely static nor completely
`dynamic. In a simple hybrid scheme, sender
`and receiver maintain identical codebooks
`containing k static codes. For each trans(cid:173)
`mission the sender must choose one of the
`k previ~usly agreed upon codes and inform
`the receiver of the choice (by transmitting
`first the "name" or number of the chosen
`code). Hybrid methods are discussed fur(cid:173)
`ther in Sections 2 and 3.2.
`
`1.3 A Data Compression Model
`In order to discuss the relative merits of
`data compression techniques, a framework
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 5
`
`
`
`266
`
`•
`for comparison must be established. There
`are two dimensions along which each of
`the schemes discussed here may be mea(cid:173)
`sured: algorithm complexity and amount of
`compression. When data compression is
`used in a data transmission application, the
`goal is speed. Speed of transmission de(cid:173)
`pends on the number of bits sent, the time
`required for the encoder to generate the
`coded message, and the time required for
`the decoder to recover the original ensem -
`ble. In a data storage application, although
`the degree of compression is the primary
`concern, it is nonetheless necessary that
`the algorithm be efficient in order for the
`scheme to be practical. For a static scheme,
`there are three algorithms to analyze: the
`map construction algorithm, the encoding
`algorithm, and the decoding algorithm. For
`a dynamic scheme, there are just two algo(cid:173)
`rithms: the encoding algorithm and the
`decoding algorithm.
`Several common measures of compres(cid:173)
`sion have been suggested: redundancy
`[Shannon and Weaver 1949), average mes(cid:173)
`sage length [Huffman 1952), and compres(cid:173)
`sion ratio [Rubin 1976; Ruth and Kreutzer
`1972). These measures are defined below.
`Related to each of these measures are
`assumptions about
`the characteristics
`of the source. It is generally assumed in
`information
`theory
`that all statistical
`parameters of a message source are known
`with perfect accuracy [Gilbert 1971). The
`most common model is that of a discrete
`memoryless source; a source whose output
`is a sequence of letters (or messages), each
`letter being a selection from some fixed
`alphabet ai, · · ., an. The letters are taken
`to be random, statistically independent se(cid:173)
`lections from the alphabet, the selection
`being made according to some fixed prob(cid:173)
`ability assignment p(ai), · ·., p(an) [Gal(cid:173)
`lager 1968). Without loss of generality, the
`IO, 11
`code alphabet is assumed to be
`throughout this paper. The modifications
`necessary for larger code alphabets are
`straightforward.
`We assume that any cost associated
`with the code letters is uniform. This is a
`reasonable assumption, although it omits
`applications like telegraphy where the
`
`D. A. Lelewer and D. S. Hirschberg
`
`code symbols are of different durations.
`The assumption is also important, since
`the problem of constructing optimal
`codes over unequal code letter costs is
`a significantly different and more diffi(cid:173)
`cult problem. Perl et al. [1975] and Varn
`[1971) have developed algorithms for min(cid:173)
`imum-redundancy prefix coding in the case
`of arbitrary symbol cost and equal code(cid:173)
`word probability. The assumption of equal
`probabilities mitigates the difficulty pre(cid:173)
`sented by the variable symbol cost. For the
`more general unequal letter costs and un(cid:173)
`equal probabilities model, Karp [1961) has
`proposed an integer linear programming
`approach. There have been several approx(cid:173)
`imation algorithms proposed for this more
`difficult problem [Krause 1962; Cot 1977;
`Mehlhorn 1980).
`When data are compressed, the goal is
`to reduce redundancy, leaving only the
`informational content. The measure of
`information of a source message a; (in
`bits) is -log p(a;), where log denotes the
`base 2 logarithm and p(a;) denotes the
`probability of occurrence of the message
`a;.2 This definition has intuitive appeal; in
`the case in which p(a;) = 1, it is clear that
`a; is not at all informative since it had to
`occur. Similarly, the smaller the value of
`p(a;), the more unlikely a; is to appear, and
`hence the larger its information content.
`The reader is referred to Abramson [1963,
`pp. 6-13) for a longer, more elegant discus(cid:173)
`sion of the legitimacy of this technical
`definition of the concept of information.
`The average information content over the
`source alphabet can be computed by
`weighting the information content of each
`source letter by its probability of occur(cid:173)
`expression L 7=1
`the
`rence, yielding
`[-p(a;)log p(a;)]. This quantity is referred
`to as the entropy of a source letter or the
`entropy of the source and is denoted by H.
`Since the length of a codeword for message
`a; must be sufficient to carry the informa(cid:173)
`tion content of a;, entropy imposes a lower
`bound on the number of bits required for
`the coded message. The total number of
`
`2 Note that throughout this paper all logarithms are
`to the base 2.
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 6
`
`
`
`bits must be at least as large as the product
`of Hand the length of the source ensemble.
`Since the value of H is generally not an
`integer, variable-length codewords must be
`used if the lower bound is to be achieved.
`Given that message EXAMPLE is to be
`encoded one letter at a time, the entropy of
`its source can be calculated using the prob(cid:173)
`abilities given in Figure 3: H = 2.894, so
`that the minimum number of bits con(cid:173)
`tained in an encoding of EXAMPLE is 116.
`The Huffman code given in Section 1.2 does
`not quite achieve the theoretical minimum
`in this case.
`Both of these definitions of information
`content are due to Shannon [1949]. Ader(cid:173)
`ivation of the concept of entropy as it re(cid:173)
`lates to information theory is presented by
`Shannon [1949]. A simpler, more intuitive
`explanation of entropy is offered by Ash
`[1965].
`The most common notion of a "good"
`code is one that is optimal in the sense of
`having minimum redundancy. Redundancy
`can be defined as L p(a;)li - L [-p(ai)log
`p(ai)], where li is the length of the codeword
`representing message ai. The expression
`r, p (ai )li represents the lengths of the code(cid:173)
`words weighted by their probabilities of
`occurrence, that is, the average codeword
`length. The expression L [-p(a;)log p(a;)]
`is entropy H. Thus, redundancy is a mea(cid:173)
`sure of the difference between average
`codeword length and average information
`content. If a code has minimum average
`codeword length for a given discrete prob(cid:173)
`ability distribution, it is said to be a mini(cid:173)
`mum redundancy code.
`We define the term local redundancy to
`capture the notion of redundancy caused
`by local properties of a message ensemble,
`rather than its global characteristics. Al(cid:173)
`though the model used for analyzing
`general-purpose coding techniques assumes
`a random distribution of the source mes(cid:173)
`sages, this may not actually be the case. In
`particular applications the tendency for
`messages to cluster in predictable patterns
`may be known. The existence of predictable
`patterns may be exploited to minimize local
`redundancy. Examples of applications in
`which local redundancy is common and
`
`Data Compression
`
`•
`
`267
`
`methods for dealing with local redundancy
`are discussed in Sections 2 and 6.2.
`Huffman uses average message length,
`L p(ai)l;, as a measure of the efficiency of
`a code. Clearly, the meaning of this term is
`the average length of a coded message. We
`use the term average codeword length to
`represent this quantity. Since redundancy
`is defined to be average codeword length
`minus entropy and entropy is constant
`for a given probability distribution, mini(cid:173)
`mizing average codeword length minimizes
`redundancy.
`A code is asymptotically optimal if it
`has the property that for a given probability
`distribution, the ratio of average codeword
`length to entropy approaches 1 as entropy
`tends to infinity. That is, asymptotic opti(cid:173)
`mality guarantees that average codeword
`length approaches the theoretical mini(cid:173)
`mum
`(entropy
`represents
`information
`content, which imposes a lower bound on
`codeword length).
`The amount of compression yielded by a
`coding scheme can be measured by a
`compression ratio. Compression ratio has
`been defined in several ways. The defini(cid:173)
`tion C =(average message length)/(average
`codeword length) captures the common
`meaning, which is a comparison of the
`length of the coded message to the length
`of the original ensemble [Cappellini 1985].
`If we think of the characters of the ensem(cid:173)
`ble EXAMPLE as 6-bit ASCII characters,
`then the average message length is 6 bits.
`The Huffman code of Section 1.2 repre(cid:173)
`sents EXAMPLE in 117 bits, or 2.9 bits
`per character. This yields a compression
`ratio of 6/2.9, representing compression by
`a factor of more than 2. Alternatively, we
`may say that Huffman encoding produces
`a file whose size is 49% of the original
`ASCII file, or that 49% compression has
`been achieved.
`A somewhat different definition of
`compression ratio by Rubin [1976], C =
`(S - 0 - OR)/S, includes the representa(cid:173)
`tion of the code itself in the transmission
`cost. In this definition, S represents the
`length of the source ensemble, 0 the length
`of the output (coded message), and OR the
`size of the "output representation" (e.g., the
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 7
`
`
`
`268
`
`•
`
`D. A. Lelewer and D. S. Hirschberg
`
`number of bits required for the encoder to
`transmit the code mapping to the decoder).
`The quantity OR constitutes a "charge" to
`an algorithm for transmission of informa(cid:173)
`tion about the coding scheme. The inten(cid:173)
`tion is to measure the total size of the
`transmission (or file to be stored).
`
`1.4 Motivation
`As discussed in the Introduction, data
`compression has wide application in terms
`of information storage, including represen(cid:173)
`tation of the abstract data type string
`[Standish 1980] and file compression.
`Huffman coding is used for compression in
`several file archival systems [ARC 1986;
`PKARC 1987], as is Lempel-Ziv coding,
`one of the adaptive schemes to be discussed
`in Section 5. An adaptive Huffman coding
`technique is the basis for the compact com(cid:173)
`mand of the UNIX operating system, and
`the UNIX compress utility employs the
`Lempel-Ziv approach [UNIX 1984].
`In the area of data transmission, Huff(cid:173)
`man coding has been passed over for years
`in favor of block-block codes, notably
`ASCII. The advantage of Huffman coding
`is in the average number of bits per char(cid:173)
`acter transmitted, which may be much
`smaller than the log n bits per character
`(where n is the source alphabet size) of a
`block-block system. The primary difficulty
`associated with variable-length codewords
`is that the rate at which bits are presented
`to the transmission channel will fluctuate,
`depending on the relative frequencies of the
`source messages. This fluctuation requires
`buffering between the source and the chan(cid:173)
`nel. Advances in technology have both
`overcome this difficulty and contributed to
`the appeal of variable-length codes. Cur(cid:173)
`rent data networks allocate communication
`resources to sources on the basis of need
`and provide buffering as part of the system.
`These systems require significant amounts
`of protocol, and fixed-length codes are quite
`inefficient for applications such as packet
`headers. In addition, communication costs
`are beginning to dominate storage and
`processing costs, so that variable-length
`coding schemes that reduce communication
`costs are attractive even if they are more
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`complex. For these reasons, one could ex(cid:173)
`pect to see even greater use of variable(cid:173)
`length coding in the future.
`It is interesting to note that the Huffman
`coding algorithm, originally developed for
`the efficient transmission of data, also has
`a wide variety of applications outside the
`sphere of data compression. These include
`construction of optimal search trees [Zim(cid:173)
`merman 1959; Hu and Tucker 1971; Itai
`1976], list merging [Brent and Kung 1978],
`and generating optimal evaluation trees in
`the compilation of expressions [Parker
`1980]. Additional applications
`involve
`search for jumps in a monotone function of
`a single variable, sources of pollution along
`a river, and leaks in a pipeline [Glassey and
`Karp 1976]. The fact that this elegant com(cid:173)
`binatorial algorithm has influenced so
`many diverse areas underscores its impor(cid:173)
`tance.
`
`2. SEMANTIC-DEPENDENT METHODS
`
`compression
`Semantic-dependent data
`techniques are designed to respond to spe(cid:173)
`cific types of local redundancy occurring in
`certain applications. One area in which
`data compression is of great importance
`is image representation and processi