throbber
Data Compression
`
`DEBRA A. LELEWER and DANIEL S. HIRSCHBERG
`
`Department of Information and Computer Science, University of California, Irvine, California 92717
`
`This paper surveys a variety of data compression methods spanning almost 40 years of
`research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique
`developed in 1986. The aim of data compression is to reduce redundancy in stored or
`communicated data, thus increasing effective data density. Data compression has
`important application in the areas of file storage and distributed systems. Concepts from
`information theory as they relate to the goals and evaluation of data compression
`methods are discussed briefly. A framework for evaluation and comparison of methods is
`constructed and applied to the algorithms presented. Comparisons of both theoretical and
`empirical natures are reported, and possibilities for future research are suggested.
`
`Categories and Subject Descriptors: E.4 [Data]: Coding and Information Theory-data
`compaction and compression
`
`General Terms: Algorithms, Theory
`
`Additional Key Words and Phrases: Adaptive coding, adaptive Huffman codes, coding,
`coding theory, file compression, Huffman codes, minimum-redundancy codes, optimal
`codes, prefix codes, text compression
`
`INTRODUCTION
`
`Data compression is often referred to as
`coding, where coding is a general term en(cid:173)
`compassing any special representation of
`data that satisfies a given need. Informa(cid:173)
`tion theory is defined as the study of
`efficient coding and its consequences in
`the form of speed of transmission and
`probability of error [Ingels 1971]. Data com(cid:173)
`pression may be viewed as a branch of
`information theory in which the primary
`objective is to minimize the amount of data
`to be transmitted. The purpose of this pa(cid:173)
`per is to present and analyze a variety of
`data compression algorithms.
`simple characterization of data
`A
`compression is that it involves transform(cid:173)
`ing a string of characters in some represen(cid:173)
`tation (such as ASCII) into a new string
`(e.g., of bits) that contains the same infor-
`
`mation but whose length is as small as
`possible. Data compression has important
`application in the areas of data transmis(cid:173)
`sion and data storage. Many data process(cid:173)
`ing applications require storage of large
`volumes of data, and the number of such
`applications is constantly increasing as the
`use of computers extends to new disci(cid:173)
`plines. At the same time, the proliferation
`of computer communication networks is
`resulting in massive transfer of data over
`communication links. Compressing data to
`be stored or transmitted reduces storage
`and/or communication costs. When the
`amount of data to be transmitted is re(cid:173)
`duced, the effect is that of increasing the
`capacity of the communication channel.
`Similarly, compressing a file to half of its
`original size is equivalent to doubling the
`capacity of the storage medium. It may then
`become feasible to store the data at a
`
`Permission to copy without fee all or part of this material is granted provided that the copies are not made or
`distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
`date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
`copy otherwise, or to republish, requires a fee and/or specific permission.
`© 1988 ACM 0360-0300/87 /0900-0261 $1.50
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 1
`
`

`

`262
`
`•
`
`D. A. Lelewer and D. S. Hirschberg
`
`CONTENTS
`
`INTRODUCTION
`1. FUNDAMENTAL CONCEPTS
`1.1 Definitions
`1.2 Classification of Methods
`1.3 A Data Compression Model
`1.4 Motivation
`2. SEMANTIC DEPENDENT METHODS
`3. STATIC DEFINED-WORD SCHEMES
`3.1 Shannon-Fano Code
`3.2 Static Huffman Coding
`3.3 Universal Codes and Representations of the
`Integers
`3.4 Arithmetic Coding
`4. ADAPTIVE HUFFMAN CODING
`4.1 Algorithm FGK
`4.2 Algorithm V
`5. OTHER ADAPTIVE METHODS
`5.1 Lempel-Ziv Codes
`5.2 Algorithm BSTW
`6. EMPIRICAL RESULTS
`7. SUSCEPTIBILITY TO ERROR
`7 .1 Static Codes
`7.2 Adaptive Codes
`8. NEW DIRECTIONS
`9. SUMMARY
`REFERENCES
`
`higher, thus faster, level of the storage hi(cid:173)
`erarchy and reduce the load on the input/
`output channels of the computer system.
`Many of the methods discussed in this
`paper are
`implemented
`in production
`systems. The UNIX1 utilities compact
`and compress are based on methods dis(cid:173)
`cussed in Sections 4 and 5, respectively
`[UNIX 1984). Popular file archival systems
`such as ARC and PKARC use techniques
`presented in Sections 3 and 5 [ARC 1986;
`PKARC 1987). The savings achieved by
`data compression can be dramatic; reduc(cid:173)
`tion as high as 80% is not uncommon
`[Reghbati 1981). Typical values of com(cid:173)
`pression provided by compact are text
`(38%), Pascal source (43%), C source
`(36%), and binary (19%). Compress gener(cid:173)
`ally achieves better compression (50-60%
`for text such as source code and English)
`and takes less time to compute [UNIX
`1984). Arithmetic coding (Section 3.4) has
`
`'UNIX is a trademark of AT&T Bell Laboratories.
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`been reported to reduce a file to anywhere
`from 12.1 to 73.5% of its original size [Wit(cid:173)
`ten et al. 1987). Cormack reports that data
`compression programs based on Huffman
`coding (Section 3.2) reduced the size of a
`large student-record database by 42.1 %
`when only some of the information was
`compressed. As a consequence of this size
`reduction, the number of disk operations
`required to load the database was reduced
`by 32.7% [Cormack 1985). Data com(cid:173)
`pression routines developed with specific
`applications in mind have achieved com(cid:173)
`pression factors as high as 98% [Severance
`1983).
`Although coding for purposes of data se(cid:173)
`curity (cryptography) and codes that guar(cid:173)
`antee a certain level of data integrity (error
`detection/correction) are topics worthy of
`attention, they do not fall under the
`umbrella of data compression. With the
`exception of a brief discussion of the sus(cid:173)
`ceptibility to error of the methods surveyed
`(Section 7), a discrete noiseless channel is
`assumed. That is, we assume a system in
`which a sequence of symbols chosen from
`a finite alphabet can be transmitted from
`one point to another without the possibility
`of error. Of course, the coding schemes
`described here may be combined with data
`security or error-correcting codes.
`Much of the available literature on data
`compression approaches the topic from the
`point of view of data transmission. As noted
`earlier, data compression is of value in data
`storage as well. Although this discussion is
`framed in the terminology of data trans(cid:173)
`mission, compression and decompression
`of data files are essentially the same tasks
`as sending and receiving data over a com(cid:173)
`munication channel. The focus of this
`paper is on algorithms for data compres(cid:173)
`sion; it does not deal with hardware aspects
`of data transmission. The reader is referred
`to Cappellini [1985) for a discussion of
`techniques with natural hardware imple(cid:173)
`mentation.
`Background concepts in the form of ter(cid:173)
`minology and a model for the study of data
`compression are provided in Section 1. Ap(cid:173)
`plications of data compression are also dis(cid:173)
`cussed in Section 1 to provide motivation
`for the material that follows.
`
`NetApp; Rackspace Exhibit 1015 Page 2
`
`

`

`Although the primary focus of this survey
`is data compression methods of general
`utility, Section 2 includes examples from
`the literature in which ingenuity applied to
`domain-specific problems has yielded inter(cid:173)
`esting coding techniques. These techniques
`are referred to as semantic dependent since
`they are designed to exploit the context
`and semantics of the data to achieve re(cid:173)
`dundancy reduction. Semantic-dependent
`techniques include the use of quadtrees,
`run-length encoding, or difference mapping
`for storage and transmission of image data
`[Gonzalez and Wintz 1977; Samet 1984].
`General-purpose techniques, which as(cid:173)
`sume no knowledge of the information
`content of the data, are described in
`Sections 3-5. These descriptions are suffi(cid:173)
`ciently detailed to provide an understand(cid:173)
`ing of the techniques. The reader will need
`to consult the references for implementa(cid:173)
`tion details. In most cases only worst-case
`analyses of the methods are feasible. To
`provide a more realistic picture of their
`effectiveness, empirical data are presented
`in Section 6. The susceptibility to error of
`the algorithms surveyed is discussed in Sec(cid:173)
`tion 7, and possible directions for future
`research are considered in Section 8.
`
`1. FUNDAMENTAL CONCEPTS
`A brief introduction to information theory
`is provided in this section. The definitions
`and assumptions necessary to a compre(cid:173)
`hensive discussion and evaluation of data
`compression methods are discussed. The
`following string of characters is used to
`illustrate the concepts defined: EXAMPLE
`= "aa bbb cccc ddddd eeeeee fffffffgggggggg ".
`
`1.1 Definitions
`A code is a mapping of source messages
`(words from the source alphabet a) into
`codewords (words of the code alphabet (3).
`The source messages are the basic units
`into which the string to be represented is
`partitioned. These basic units may be single
`symbols from the source alphabet, or they
`may be strings of symbols. For string
`EXAMPLE, a= {a, b, c, d, e, f, g, space I.
`For purposes of explanation, (3 is taken to
`
`Data Compression
`
`•
`
`263
`
`Source message
`
`Codeword
`
`a
`b
`c
`d
`e
`f
`g
`space
`
`000
`001
`010
`011
`100
`101
`110
`111
`
`Figure 1. A block-block code for EXAMPLE.
`
`Source message
`
`Codeword
`
`aa
`bbb
`cccc
`ddddd
`eeeeee
`fffffff
`gggggggg
`space
`
`0
`1
`10
`11
`100
`101
`110
`111
`
`Figure2. A variable-variable code for EXAMPLE.
`
`be {O, lj. Codes can be categorized as block(cid:173)
`block, block-variable, variable-block, or
`variable-variable, where block-block in(cid:173)
`dicates that the source messages and
`codewords are of fixed length and variable(cid:173)
`variable codes map variable-length source
`messages into variable-length codewords. A
`block-block code for EXAMPLE is shown
`in Figure 1, and a variable-variable code is
`given in Figure 2. If the string EXAMPLE
`were coded using the Figure 1 code, the
`length of the coded message would be 120;
`using Figure 2 the length would be 30.
`The oldest and most widely used codes,
`ASCII and EBCDIC, are examples of
`block-block codes, mapping an alphabet of
`64 (or 256) single characters onto 6-bit (or
`8-bit) codewords. These are not discussed,
`since they do not provide compression.
`The codes featured in this survey are of
`the block-variable, variable-variable, and
`variable-block types.
`When source messages of variable length
`are allowed, the question of how a mes(cid:173)
`sage ensemble (sequence of messages) is
`parsed into individual messages arises.
`Many of the algorithms described here are
`defined-word schemes. That is, the set of
`source messages is determined before the
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 3
`
`

`

`264
`
`•
`
`D. A. Lelewer and D. S. Hirschberg
`
`invocation of the coding scheme. For
`example,
`in text file processing, each
`character may constitute a message, or
`messages may be defined
`to consist
`of alphanumeric and nonalphanumeric
`strings. In Pascal source code, each token
`may represent a message. All codes involv(cid:173)
`ing fixed-length source messages are, by
`default, defined-word codes. In free-parse
`methods, the coding algorithm itself parses
`the ensemble
`into variable-length se(cid:173)
`quences of symbols. Most of the known
`data compression methods are defined(cid:173)
`word schemes; the free-parse model differs
`in a fundamental way from the classical
`coding paradigm.
`A code is distinct if each codeword is
`distinguishable from every other (i.e., the
`mapping from source messages to code(cid:173)
`words is one to one). A distinct code is
`uniquely decodable if every codeword is
`identifiable when immersed in a sequence
`of codewords. Clearly, each of these fea(cid:173)
`tures is desirable. The codes of Figures 1
`and 2 are both distinct, but the code of
`Figure 2 is not uniquely decodable. For
`example, the coded message 11 could be
`decoded as either "ddddd" or "bbbbbb". A
`uniquely decodable code is a prefix code (or
`prefix-free code) if it has the prefix prop(cid:173)
`erty, which requires that no codeword be a
`proper prefix of any other codeword. All
`uniquely decodable block-block and vari(cid:173)
`able-block codes are prefix codes. The code
`with codewords {1, 100000, 00} is an ex(cid:173)
`ample of a code that is uniquely decodable
`but that does not have the prefix property.
`Prefix codes are instantaneously decodable;
`that is, they have the desirable property
`that the coded message can be parsed into
`codewords without the need for lookahead.
`In order to decode a message encoded using
`the codeword set {l, 100000, 00}, lookahead
`is required. For example, the first codeword
`of the message 1000000001 is 1, but this
`cannot be determined until the last (tenth)
`symbol of the message is read (if the string
`of zeros had been of odd length, the first
`codeword would have been 100000).
`A minimal prefix code is a prefix code
`such that, if x is a proper prefix of some
`codeword, then xa is either a codeword or
`a proper prefix of a codeword for each letter
`
`a in {3. The set of codewords {00, 01, 10} is
`an example of a prefix code that is not
`minimal. The fact that 1 is a proper prefix
`of the codeword 10 requires that 11 be
`either a codeword or a proper prefix of a
`codeword, and it is neither. Intuitively, the
`minimality constraint prevents the use of
`codewords that are longer than necessary.
`In the above example the codeword 10 could
`be replaced by the codeword 1, yielding a
`minimal prefix code with shorter code(cid:173)
`words. The codes discussed in this paper
`are all minimal prefix codes.
`In this section a code has been defined
`to be a mapping from a source alphabet
`to a code alphabet; we now define related
`terms. The process of transforming a source
`ensemble into a coded message is coding or
`encoding. The encoded message may be re(cid:173)
`ferred to as an encoding of the source en(cid:173)
`semble. The algorithm that constructs the
`mapping and uses it to transform the source
`ensemble is called the encoder. The decoder
`performs the inverse operation, restoring
`the coded message to its original form.
`
`1.2 Classification of Methods
`Not only are data compression schemes
`categorized with respect to message and
`codeword lengths, but they are also classi(cid:173)
`fied as either static or dynamic. A static
`method is one in which the mapping from
`the set of messages to the set of codewords
`is fixed before transmission begins, so that
`a given message is represented by the same
`codeword every time it appears in the mes(cid:173)
`sage ensemble. The classic static defined(cid:173)
`word scheme is Huffman coding [Huffman
`1952]. In Huffman coding, the assignment
`of codewords to source messages is based
`on the probabilities with which the source
`messages appear in the message ensemble.
`Messages that appear frequently are rep(cid:173)
`resented by short codewords; messages with
`smaller probabilities map to longer code(cid:173)
`words. These probabilities are determined
`before transmission begins. A Huffman
`code for the ensemble EXAMPLE is given
`in Figure 3. If EXAMPLE were coded using
`this Huffman mapping, the length of the
`coded message would be 117. Static Huff(cid:173)
`man coding is discussed in Section 3.2;
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 4
`
`

`

`Source message
`
`Probability
`
`Codeword
`
`Source message
`
`Probability
`
`Codeword
`
`Data Compression
`
`•
`
`265
`
`a
`b
`c
`d
`e
`f
`g
`space
`
`2/40
`3/40
`4/40
`5/40
`6/40
`7/40
`8/40
`5/40
`
`1001
`1000
`011
`010
`111
`110
`00
`101
`
`Figure 3. A Huffman code for the message EXAM(cid:173)
`PLE (code length= 117).
`
`other static schemes are discussed in Sec(cid:173)
`tions 2 and 3.
`A code is dynamic if the mapping from
`the set of messages to the set of codewords
`changes over time. For example, dynamic
`Huffman coding involves computing an ap(cid:173)
`proximation to the probabilities of occur(cid:173)
`rence "on the fly," as the ensemble is being
`transmitted. The assignment of codewords
`to messages is based on the values of the
`relative frequencies of occurrence at each
`point in time. A message x may be repre(cid:173)
`sented by a short codeword early in the
`transmission because it occurs frequently
`at the beginning of the ensemble, even
`though its probability of occurrence over
`the total ensemble is low. Later, when the
`more probable messages begin to occur with
`higher frequency, the short codeword will
`be mapped to one of the higher probability
`messages, and x will be mapped to a longer
`codeword. As an illustration, Figure 4
`presents a dynamic Huffman code table
`corresponding to the prefix "aa bbb" of
`EXAMPLE. Although the frequency of
`space over the entire message is greater
`than that of b, at this point b has higher
`frequency and therefore is mapped to the
`shorter codeword.
`Dynamic codes are also referred to in the
`literature as adaptive, in that they adapt to
`changes in ensemble characteristics over
`time. The term adaptive is used for the
`remainder of this paper; the fact that these
`codes adapt to changing characteristics is
`the source of their appeal. Some adaptive
`methods adapt to changing patterns in the
`source [Welch 1984), whereas others ex(cid:173)
`ploit locality of reference [Bentley et al.
`1986). Locality of reference is the tendency,
`
`a
`b
`space
`
`2/6
`3/6
`1/6
`
`10
`0
`11
`
`Figure 4. A dynamic Huffman code table for the
`prefix "aa bbb" of message EXAMPLE.
`
`common in a wide variety of text types, for
`a particular word to occur frequently_ for
`short periods of time and then fall mto
`disuse for long periods.
`All of the adaptive methods are one-pass
`methods· only one scan of the ensemble is
`required: Static Huffman coding requires
`two passes: one pass to compute probabili(cid:173)
`ties and determine the mapping, and a sec(cid:173)
`ond pass for transmission. Thus, as long as
`the encoding and decoding times of an
`adaptive method are not substantially
`greater than those of a static method, ~he
`fact that an initial scan is not needed im(cid:173)
`plies a speed improvement in the adaptive
`case. In addition, the mapping determined
`in the first pass of a static coding scheme
`must be transmitted by the encoder to the
`decoder. The mapping may preface each
`transmission (i.e., each file sent), or a single
`mapping may be agreed upon and used for
`multiple transmissions. In one-pass meth(cid:173)
`ods the encoder defines and redefines the
`mapping dynamically during transmission.
`The decoder must define and redefine the
`mapping in sympathy, in essence "learn(cid:173)
`ing" the mapping as codewords are re(cid:173)
`ceived. Adaptive methods are discussed in
`Sections 4 and 5.
`An algorithm may also be a hybrid,
`neither completely static nor completely
`dynamic. In a simple hybrid scheme, sender
`and receiver maintain identical codebooks
`containing k static codes. For each trans(cid:173)
`mission the sender must choose one of the
`k previ~usly agreed upon codes and inform
`the receiver of the choice (by transmitting
`first the "name" or number of the chosen
`code). Hybrid methods are discussed fur(cid:173)
`ther in Sections 2 and 3.2.
`
`1.3 A Data Compression Model
`In order to discuss the relative merits of
`data compression techniques, a framework
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 5
`
`

`

`266
`
`•
`for comparison must be established. There
`are two dimensions along which each of
`the schemes discussed here may be mea(cid:173)
`sured: algorithm complexity and amount of
`compression. When data compression is
`used in a data transmission application, the
`goal is speed. Speed of transmission de(cid:173)
`pends on the number of bits sent, the time
`required for the encoder to generate the
`coded message, and the time required for
`the decoder to recover the original ensem -
`ble. In a data storage application, although
`the degree of compression is the primary
`concern, it is nonetheless necessary that
`the algorithm be efficient in order for the
`scheme to be practical. For a static scheme,
`there are three algorithms to analyze: the
`map construction algorithm, the encoding
`algorithm, and the decoding algorithm. For
`a dynamic scheme, there are just two algo(cid:173)
`rithms: the encoding algorithm and the
`decoding algorithm.
`Several common measures of compres(cid:173)
`sion have been suggested: redundancy
`[Shannon and Weaver 1949), average mes(cid:173)
`sage length [Huffman 1952), and compres(cid:173)
`sion ratio [Rubin 1976; Ruth and Kreutzer
`1972). These measures are defined below.
`Related to each of these measures are
`assumptions about
`the characteristics
`of the source. It is generally assumed in
`information
`theory
`that all statistical
`parameters of a message source are known
`with perfect accuracy [Gilbert 1971). The
`most common model is that of a discrete
`memoryless source; a source whose output
`is a sequence of letters (or messages), each
`letter being a selection from some fixed
`alphabet ai, · · ., an. The letters are taken
`to be random, statistically independent se(cid:173)
`lections from the alphabet, the selection
`being made according to some fixed prob(cid:173)
`ability assignment p(ai), · ·., p(an) [Gal(cid:173)
`lager 1968). Without loss of generality, the
`IO, 11
`code alphabet is assumed to be
`throughout this paper. The modifications
`necessary for larger code alphabets are
`straightforward.
`We assume that any cost associated
`with the code letters is uniform. This is a
`reasonable assumption, although it omits
`applications like telegraphy where the
`
`D. A. Lelewer and D. S. Hirschberg
`
`code symbols are of different durations.
`The assumption is also important, since
`the problem of constructing optimal
`codes over unequal code letter costs is
`a significantly different and more diffi(cid:173)
`cult problem. Perl et al. [1975] and Varn
`[1971) have developed algorithms for min(cid:173)
`imum-redundancy prefix coding in the case
`of arbitrary symbol cost and equal code(cid:173)
`word probability. The assumption of equal
`probabilities mitigates the difficulty pre(cid:173)
`sented by the variable symbol cost. For the
`more general unequal letter costs and un(cid:173)
`equal probabilities model, Karp [1961) has
`proposed an integer linear programming
`approach. There have been several approx(cid:173)
`imation algorithms proposed for this more
`difficult problem [Krause 1962; Cot 1977;
`Mehlhorn 1980).
`When data are compressed, the goal is
`to reduce redundancy, leaving only the
`informational content. The measure of
`information of a source message a; (in
`bits) is -log p(a;), where log denotes the
`base 2 logarithm and p(a;) denotes the
`probability of occurrence of the message
`a;.2 This definition has intuitive appeal; in
`the case in which p(a;) = 1, it is clear that
`a; is not at all informative since it had to
`occur. Similarly, the smaller the value of
`p(a;), the more unlikely a; is to appear, and
`hence the larger its information content.
`The reader is referred to Abramson [1963,
`pp. 6-13) for a longer, more elegant discus(cid:173)
`sion of the legitimacy of this technical
`definition of the concept of information.
`The average information content over the
`source alphabet can be computed by
`weighting the information content of each
`source letter by its probability of occur(cid:173)
`expression L 7=1
`the
`rence, yielding
`[-p(a;)log p(a;)]. This quantity is referred
`to as the entropy of a source letter or the
`entropy of the source and is denoted by H.
`Since the length of a codeword for message
`a; must be sufficient to carry the informa(cid:173)
`tion content of a;, entropy imposes a lower
`bound on the number of bits required for
`the coded message. The total number of
`
`2 Note that throughout this paper all logarithms are
`to the base 2.
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 6
`
`

`

`bits must be at least as large as the product
`of Hand the length of the source ensemble.
`Since the value of H is generally not an
`integer, variable-length codewords must be
`used if the lower bound is to be achieved.
`Given that message EXAMPLE is to be
`encoded one letter at a time, the entropy of
`its source can be calculated using the prob(cid:173)
`abilities given in Figure 3: H = 2.894, so
`that the minimum number of bits con(cid:173)
`tained in an encoding of EXAMPLE is 116.
`The Huffman code given in Section 1.2 does
`not quite achieve the theoretical minimum
`in this case.
`Both of these definitions of information
`content are due to Shannon [1949]. Ader(cid:173)
`ivation of the concept of entropy as it re(cid:173)
`lates to information theory is presented by
`Shannon [1949]. A simpler, more intuitive
`explanation of entropy is offered by Ash
`[1965].
`The most common notion of a "good"
`code is one that is optimal in the sense of
`having minimum redundancy. Redundancy
`can be defined as L p(a;)li - L [-p(ai)log
`p(ai)], where li is the length of the codeword
`representing message ai. The expression
`r, p (ai )li represents the lengths of the code(cid:173)
`words weighted by their probabilities of
`occurrence, that is, the average codeword
`length. The expression L [-p(a;)log p(a;)]
`is entropy H. Thus, redundancy is a mea(cid:173)
`sure of the difference between average
`codeword length and average information
`content. If a code has minimum average
`codeword length for a given discrete prob(cid:173)
`ability distribution, it is said to be a mini(cid:173)
`mum redundancy code.
`We define the term local redundancy to
`capture the notion of redundancy caused
`by local properties of a message ensemble,
`rather than its global characteristics. Al(cid:173)
`though the model used for analyzing
`general-purpose coding techniques assumes
`a random distribution of the source mes(cid:173)
`sages, this may not actually be the case. In
`particular applications the tendency for
`messages to cluster in predictable patterns
`may be known. The existence of predictable
`patterns may be exploited to minimize local
`redundancy. Examples of applications in
`which local redundancy is common and
`
`Data Compression
`
`•
`
`267
`
`methods for dealing with local redundancy
`are discussed in Sections 2 and 6.2.
`Huffman uses average message length,
`L p(ai)l;, as a measure of the efficiency of
`a code. Clearly, the meaning of this term is
`the average length of a coded message. We
`use the term average codeword length to
`represent this quantity. Since redundancy
`is defined to be average codeword length
`minus entropy and entropy is constant
`for a given probability distribution, mini(cid:173)
`mizing average codeword length minimizes
`redundancy.
`A code is asymptotically optimal if it
`has the property that for a given probability
`distribution, the ratio of average codeword
`length to entropy approaches 1 as entropy
`tends to infinity. That is, asymptotic opti(cid:173)
`mality guarantees that average codeword
`length approaches the theoretical mini(cid:173)
`mum
`(entropy
`represents
`information
`content, which imposes a lower bound on
`codeword length).
`The amount of compression yielded by a
`coding scheme can be measured by a
`compression ratio. Compression ratio has
`been defined in several ways. The defini(cid:173)
`tion C =(average message length)/(average
`codeword length) captures the common
`meaning, which is a comparison of the
`length of the coded message to the length
`of the original ensemble [Cappellini 1985].
`If we think of the characters of the ensem(cid:173)
`ble EXAMPLE as 6-bit ASCII characters,
`then the average message length is 6 bits.
`The Huffman code of Section 1.2 repre(cid:173)
`sents EXAMPLE in 117 bits, or 2.9 bits
`per character. This yields a compression
`ratio of 6/2.9, representing compression by
`a factor of more than 2. Alternatively, we
`may say that Huffman encoding produces
`a file whose size is 49% of the original
`ASCII file, or that 49% compression has
`been achieved.
`A somewhat different definition of
`compression ratio by Rubin [1976], C =
`(S - 0 - OR)/S, includes the representa(cid:173)
`tion of the code itself in the transmission
`cost. In this definition, S represents the
`length of the source ensemble, 0 the length
`of the output (coded message), and OR the
`size of the "output representation" (e.g., the
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp; Rackspace Exhibit 1015 Page 7
`
`

`

`268
`
`•
`
`D. A. Lelewer and D. S. Hirschberg
`
`number of bits required for the encoder to
`transmit the code mapping to the decoder).
`The quantity OR constitutes a "charge" to
`an algorithm for transmission of informa(cid:173)
`tion about the coding scheme. The inten(cid:173)
`tion is to measure the total size of the
`transmission (or file to be stored).
`
`1.4 Motivation
`As discussed in the Introduction, data
`compression has wide application in terms
`of information storage, including represen(cid:173)
`tation of the abstract data type string
`[Standish 1980] and file compression.
`Huffman coding is used for compression in
`several file archival systems [ARC 1986;
`PKARC 1987], as is Lempel-Ziv coding,
`one of the adaptive schemes to be discussed
`in Section 5. An adaptive Huffman coding
`technique is the basis for the compact com(cid:173)
`mand of the UNIX operating system, and
`the UNIX compress utility employs the
`Lempel-Ziv approach [UNIX 1984].
`In the area of data transmission, Huff(cid:173)
`man coding has been passed over for years
`in favor of block-block codes, notably
`ASCII. The advantage of Huffman coding
`is in the average number of bits per char(cid:173)
`acter transmitted, which may be much
`smaller than the log n bits per character
`(where n is the source alphabet size) of a
`block-block system. The primary difficulty
`associated with variable-length codewords
`is that the rate at which bits are presented
`to the transmission channel will fluctuate,
`depending on the relative frequencies of the
`source messages. This fluctuation requires
`buffering between the source and the chan(cid:173)
`nel. Advances in technology have both
`overcome this difficulty and contributed to
`the appeal of variable-length codes. Cur(cid:173)
`rent data networks allocate communication
`resources to sources on the basis of need
`and provide buffering as part of the system.
`These systems require significant amounts
`of protocol, and fixed-length codes are quite
`inefficient for applications such as packet
`headers. In addition, communication costs
`are beginning to dominate storage and
`processing costs, so that variable-length
`coding schemes that reduce communication
`costs are attractive even if they are more
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`complex. For these reasons, one could ex(cid:173)
`pect to see even greater use of variable(cid:173)
`length coding in the future.
`It is interesting to note that the Huffman
`coding algorithm, originally developed for
`the efficient transmission of data, also has
`a wide variety of applications outside the
`sphere of data compression. These include
`construction of optimal search trees [Zim(cid:173)
`merman 1959; Hu and Tucker 1971; Itai
`1976], list merging [Brent and Kung 1978],
`and generating optimal evaluation trees in
`the compilation of expressions [Parker
`1980]. Additional applications
`involve
`search for jumps in a monotone function of
`a single variable, sources of pollution along
`a river, and leaks in a pipeline [Glassey and
`Karp 1976]. The fact that this elegant com(cid:173)
`binatorial algorithm has influenced so
`many diverse areas underscores its impor(cid:173)
`tance.
`
`2. SEMANTIC-DEPENDENT METHODS
`
`compression
`Semantic-dependent data
`techniques are designed to respond to spe(cid:173)
`cific types of local redundancy occurring in
`certain applications. One area in which
`data compression is of great importance
`is image representation and processi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket