`
`DEBRA A. LELEWER and DANIEL S. HIRSCHBERG
`
`Department
`
`of Information
`
`and Computer Science, University
`
`of California,
`
`Irvine, California 92717
`
`This paper surveys a variety of data compression methods spanning almost 40 years of
`research,
`from
`the work of Shannon, Fano, and Huffman
`in the late 1940s to a technique
`developed
`in 1986. The aim of data compression
`is to reduce redundancy
`in stored or
`communicated
`data, thus increasing effective data density. Data compression has
`important
`application
`in the areas of file storage and distributed
`systems. Concepts
`information
`theory as they relate
`to the goals and evaluation of data compression
`methods are discussed briefly. A framework
`for evaluation
`and comparison of methods
`constructed and applied
`to the algorithms presented. Comparisons of both theoretical
`empirical natures are reported, and possibilities
`for future
`research are suggested
`
`from
`
`is
`and
`
`Categories and Subject Descriptors: E.4 [Data]: Coding and Information
`compaction and compression
`
`Theory-data
`
`General Terms: Algorithms,
`
`Theory
`
`Additional Key Words and Phrases: Adaptive coding, adaptive Huffman
`coding theory,
`tile compression, Huffman
`codes, minimum-redundancy
`codes, prefix codes, text compression
`
`codes, coding,
`codes, optimal
`
`INTRODUCTION
`
`Data compression is often referred to as
`coding, where coding is a general term en-
`compassing any special representation of
`data that satisfies a given need. Informa-
`tion
`theory
`is defined as the study of
`efficient coding and its consequences in
`the
`form of speed of transmission and
`probability of error [ Ingels 19711. Data com-
`pression may be viewed as a branch of
`information
`theory
`in which the primary
`objective is to minimize the amount of data
`to be transmitted. The purpose of this pa-
`per is to present and analyze a variety of
`data compression algorithms.
`of data
`A simple
`characterization
`compression is that it involves transform-
`ing a string of characters in some represen-
`tation (such as ASCII)
`into a new string
`(e.g., of bits) that contains the same infor-
`
`is as small as
`mation but whose length
`possible. Data compression has important
`application
`in the areas of data transmis-
`sion and data storage. Many data process-
`ing applications
`require storage of large
`volumes of data, and the number of such
`applications is constantly increasing as the
`use of computers extends to new disci-
`plines. At the same time, the proliferation
`of computer communication networks
`is
`resulting
`in massive transfer of data over
`communication
`links. Compressing data to
`be stored or transmitted
`reduces storage
`and/or communication costs. When
`the
`amount of data to be transmitted
`is re-
`duced, the effect is that of increasing the
`capacity of the communication channel.
`Similarly, compressing a file to half of its
`original size is equivalent
`to doubling the
`capacity of the storage medium. It may then
`become feasible to store the data at a
`
`the copies are not made or
`that
`is granted provided
`fee all or part of this material
`to copy without
`Permission
`for direct commercial advantage,
`the ACM copyright notice and the title of the publication
`and its
`distributed
`date appear, and notice
`is given that copying
`is by permission of the Association
`for Computing Machinery.
`To
`copy otherwise, or to republish,
`requires a fee and/or specific permission.
`0 1966 ACM 0360-0300/87/0900-0261$1.50
`
`ACM Computing SUWSYS, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 1
`
`
`
`262
`
`.
`
`D. A. Lelewer and D. S. Hirschberg
`
`CONTENTS
`
`been reported to reduce a file to anywhere
`from 12.1 to 73.5% of its original size [Wit-
`ten et al. 1987 1. Cormack reports that data
`compression programs based on Huffman
`coding (Section 3.2) reduced the size of a
`large student-record database by 42.1%
`when only some of the information was
`compressed. As a consequence of this size
`reduction,
`the number of disk operations
`required to load the database was reduced
`by 32.7%
`[Cormack 19851. Data com-
`pression routines developed with specific
`applications
`in mind have achieved com-
`pression factors as high as 98% [Severance
`19831.
`Although coding for purposes of data se-
`curity (cryptography) and codes that guar-
`antee a certain level of data integrity (error
`detection/correction)
`are topics worthy of
`attention,
`they do not
`fall under
`the
`umbrella of data compression. With
`the
`exception of a brief discussion of the sus-
`ceptibility
`to error of the methods surveyed
`(Section 7), a discrete noiseless channel is
`assumed. That is, we assume a system in
`which a sequence of symbols chosen from
`a finite alphabet can be transmitted
`from
`one point to another without the possibility
`of error. Of course, the coding schemes
`described here may be combined with data
`security or error-correcting codes.
`Much of the available literature on data
`compression approaches the topic from the
`point of view of data transmission. As noted
`earlier, data compression is of value in data
`storage as well. Although this discussion is
`framed in the terminology of data trans-
`mission, compression and decompression
`of data files are essentially the same tasks
`as sending and receiving data over a com-
`munication
`channel. The
`focus of this
`paper is on algorithms
`for data compres-
`sion; it does not deal with hardware aspects
`of data transmission. The reader is referred
`to Cappellini
`[1985] for a discussion of
`techniques with natural hardware imple-
`mentation.
`Background concepts in the form of ter-
`minology and a model for the study of data
`compression are provided in Section 1. Ap-
`plications of data compression are also dis-
`cussed in Section 1 to provide motivation
`for the material that follows.
`
`METHODS
`SCHEMES
`
`of the
`
`CODING
`
`CONCEPTS
`
`INTRODUCTION
`1. FUNDAMENTAL
`1.1 Definitions
`of Methods
`1.2 Classification
`1.3 A Data Compression Model
`1.4 Motivation
`DEPENDENT
`2. SEMANTIC
`3. STATIC DEFINED-WORD
`3.1 Shannon-Fan0
`Code
`3.2 Static Huffman Coding
`3.3 Universal Codes and Representations
`Integers
`Coding
`3.4 Arithmetic
`HUFFMAN
`4. ADAPTIVE
`FGK
`4.1 Algorithm
`V
`4.2 Algorithm
`METHODS
`5. OTHER ADAPTIVE
`5.1 Lempel-Ziv
`Codes
`5.2 Algorithm
`BSTW
`6. EMPIRICAL
`RESULTS
`7. SUSCEPTIBILITY
`TO ERROR
`7.1 Static Codes
`7.2 Adaptive Codes
`8. NEW DIRECTIONS
`9. SUMMARY
`REFERENCES
`
`higher, thus faster, level of the storage hi-
`erarchy and reduce the load on the input/
`output channels of the computer system.
`Many of the methods discussed in this
`paper are
`implemented
`in production
`systems. The UNIX1 utilities
`compact
`and compress are based on methods dis-
`cussed in Sections 4 and 5, respectively
`[UNIX 19841. Popular file archival systems
`such as ARC and PKARC use techniques
`presented in Sections 3 and 5 [ARC 1986;
`PKARC 19871. The savings achieved by
`data compression can be dramatic; reduc-
`tion as high as 80% is not uncommon
`[Reghbati 19811. Typical values of com-
`pression provided by compact are text
`(38%), Pascal source (43%), C source
`(36%), and binary (19%). Compress gener-
`ally achieves better compression (50-60%
`for text such as source code and English)
`and takes less time
`to compute
`[UNIX
`19841. Arithmetic coding (Section 3.4) has
`
`1 UNIX
`
`is a trademark of AT&T Bell Laboratories.
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 2
`
`
`
`focus of this survey
`the primary
`Although
`is data compression methods of general
`utility,
`Section 2 includes examples
`from
`the literature
`in which
`ingenuity applied
`to
`domain-specific
`problems has yielded
`inter-
`esting coding techniques. These techniques
`are referred
`to as semantic dependent since
`they are designed
`to exploit
`the context
`and semantics of the data
`to achieve
`re-
`dundancy
`reduction. Semantic-dependent
`techniques
`include
`the use of quadtrees,
`run-length encoding, or difference mapping
`for storage and transmission
`of image data
`[Gonzalez and Wintz 1977; Samet 19841.
`General-purpose
`techniques, which as-
`sume no knowledge
`of
`the
`information
`content
`of
`the data, are described
`in
`Sections 3-5. These descriptions are suffi-
`ciently detailed
`to provide an understand-
`ing of the techniques. The reader will need
`to consult
`the references
`for implementa-
`tion details.
`In most cases only worst-case
`analyses of the methods are feasible. To
`provide a more realistic
`picture of their
`effectiveness, empirical data are presented
`in Section 6. The susceptibility
`to error of
`the algorithms surveyed is discussed in Sec-
`tion 7, and possible directions
`for
`future
`research are considered
`in Section 8.
`
`1. FUNDAMENTAL CONCEPTS
`
`theory
`to information
`introduction
`A brief
`is provided
`in this section. The definitions
`and assumptions
`necessary
`to a compre-
`hensive discussion and evaluation
`of data
`compression methods are discussed. The
`following
`string of characters
`is used to
`illustrate
`the concepts defined: EXAMPLE
`= “au bbb cccc ddddd eeeeee fffffffgggggggg”.
`
`1.1 Definitions
`
`A code is a mapping of source messages
`(words
`from
`the source alphabet a)
`into
`codewords (words of the code alphabet 0).
`The source messages are the basic units
`into which
`the string
`to be represented
`is
`partitioned. These basic units may be single
`symbols
`from the source alphabet, or they
`may be strings
`of symbols. For string
`cy = (a, b, c, d, e, f, g, space).
`EXAMPLE,
`For purposes of explanation,
`/3 is taken
`to
`
`Data Compression
`
`l
`
`263
`
`Source message
`
`Codeword
`
`a
`b
`ii
`
`7
`g
`space
`
`000
`001
`011 010
`
`100 101
`
`110
`111
`
`Figure 1. A block-block
`
`code for EXAMPLE.
`
`Source message
`
`Codeword
`
`;ib
`cccc
`ddddd
`eeeeee
`fffffff
`&%%Nmz!
`space
`
`0 1
`10
`11
`100
`101
`110
`111
`
`Figure 2. A variable-variable
`
`code for EXAMPLE.
`
`be (0,l). Codes can be categorized as block-
`block, block-variable,
`variable-block,
`or
`variable-variable,
`where block-block
`in-
`dicates
`that
`the source messages and
`codewords are of fixed length and variable-
`variable codes map variable-length
`source
`into variable-length
`codewords. A
`messages
`block-block
`code for EXAMPLE
`is shown
`in Figure 1, and a variable-variable
`code is
`given in Figure 2. If the string EXAMPLE
`were coded using
`the Figure 1 code, the
`length of the coded message would be 120;
`using Figure 2 the length would be 30.
`The oldest and most widely used codes,
`ASCII
`and EBCDIC,
`are examples
`of
`block-block
`codes, mapping an alphabet of
`64 (or 256) single characters onto 6-bit (or
`8-bit) codewords. These are not discussed,
`since
`they do not provide
`compression.
`The codes featured
`in
`this survey are of
`the block-variable,
`variable-variable,
`and
`variable-block
`types.
`length
`When source messages of variable
`are allowed,
`the question of how a mes-
`sage ensemble
`(sequence of messages)
`is
`parsed
`into
`individual
`messages arises.
`Many of the algorithms described here are
`defined-word
`schemes. That
`is, the set of
`source messages is determined before
`the
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 3
`
`
`
`264
`
`.
`
`D. A. Lelewer and D. S. Hirschberg
`
`scheme. For
`the coding
`of
`invocation
`text
`file processing,
`each
`in
`example,
`character may constitute
`a message, or
`messages may be defined
`to
`consist
`of alphanumeric
`and
`nonalphanumeric
`strings.
`In Pascal source code, each token
`may represent a message. All codes involv-
`ing
`fixed-length
`source messages are, by
`default, defined-word
`codes. In free-parse
`methods,
`the coding algorithm
`itself parses
`the
`ensemble
`into
`variable-length
`se-
`quences of symbols. Most of the known
`data compression methods are defined-
`word schemes; the free-parse model differs
`in a fundamental
`way
`from
`the classical
`coding paradigm.
`is
`if each codeword
`A code is distinct
`distinguishable
`from every other
`(i.e., the
`mapping
`from source messages to code-
`words
`is one to one). A distinct
`code is
`uniquely
`decodable
`if every codeword
`is
`identifiable when
`immersed
`in a sequence
`of codewords. Clearly, each of these
`fea-
`tures
`is desirable. The codes of Figures 1
`and 2 are both distinct,
`but
`the code of
`Figure 2 is not uniquely
`decodable. For
`example,
`the coded message 11 could be
`decoded as either
`“ddddd”
`or “bbbbbb”. A
`uniquely decodable code is a prefix code (or
`prefix-free
`code) if it has the prefix prop-
`erty, which
`requires
`that no codeword be a
`proper prefix of any other codeword. All
`uniquely
`decodable block-block
`and vari-
`able-block codes are prefix codes. The code
`(1, 100000, 00) is an ex-
`with codewords
`ample of a code that
`is uniquely decodable
`but that does not have the prefix property.
`Prefix codes are instantaneously
`decodable;
`that
`is, they have
`the desirable property
`that
`the coded message can be parsed into
`codewords without
`the need for lookahead.
`In order to decode a message encoded using
`the codeword set (1, 100000, 001, lookahead
`is required. For example,
`the first codeword
`of the message 1000000001
`is 1, but
`this
`cannot be determined until
`the last (tenth)
`symbol of the message is read (if the string
`of zeros had been of odd length,
`the first
`codeword would have been 100000).
`A minimal prefix code is a prefix code
`such that,
`if x is a proper prefix of some
`codeword,
`then xu is either a codeword or
`a proper prefix of a codeword
`for each letter
`
`(00, 01, 10) is
`u in ,B. The set of codewords
`an example of a prefix
`code that
`is not
`minimal. The fact that 1 is a proper prefix
`of the codeword 10 requires
`that 11 be
`either a codeword or a proper prefix of a
`codeword, and it is neither.
`Intuitively,
`the
`minimality
`constraint
`prevents
`the use of
`codewords
`that are longer
`than necessary.
`In the above example the codeword 10 could
`be replaced by the codeword 1, yielding a
`minimal
`prefix
`code with
`shorter
`code-
`words. The codes discussed
`in this paper
`are all minimal prefix codes.
`In this section a code has been defined
`to be a mapping
`from a source alphabet
`to a code alphabet; we now define
`related
`terms. The process of transforming
`a source
`ensemble into a coded message is coding or
`encoding. The encoded message may be re-
`ferred
`to as an encoding of the source en-
`semble. The algorithm
`that constructs
`the
`mapping and uses it to transform
`the source
`ensemble is called the encoder. The decoder
`performs
`the
`inverse operation,
`restoring
`the coded message to its original
`form.
`
`1.2 Classification
`
`of Methods
`
`schemes
`Not only are data compression
`categorized with
`respect
`to message and
`codeword
`lengths, but they are also classi-
`fied as either static or dynamic. A static
`method
`is one in which
`the mapping
`from
`the set of messages to the set of codewords
`is fixed before transmission
`begins, so that
`a given message is represented by the same
`codeword every time it appears in the mes-
`sage ensemble. The classic static defined-
`word scheme is Huffman
`coding
`[Huffman
`19521. In Huffman
`coding,
`the assignment
`of codewords
`to source messages is based
`on the probabilities with which
`the source
`messages appear in the message ensemble.
`Messages
`that appear
`frequently
`are rep-
`resented by short codewords; messages with
`smaller probabilities map to longer code-
`words. These probabilities
`are determined
`before
`transmission
`begins. A Huffman
`code for the ensemble EXAMPLE
`is given
`in Figure 3. If EXAMPLE
`were coded using
`this Huffman mapping,
`the
`length of the
`coded message would be 117. Static Huff-
`man coding
`is discussed
`in Section 3.2;
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 4
`
`
`
`Source message
`
`Probability
`
`Codeword
`
`Source message
`
`Probability
`
`Codeword
`
`Data Compression
`
`l
`
`265
`
`;
`
`i
`e
`f
`g
`space
`
`
`
`2140 3140
`
`4140 5/40
`
`6140
`7/40
`a/40
`5140
`
`1001
`1000
`011
`010
`111
`110
`00
`101
`
`code for the message EXAM-
`Figure 3. A Huffman
`PLE (code length = 117).
`
`in Sec-
`
`other static schemes are discussed
`tions 2 and 3.
`from
`if the mapping
`A code is dynamic
`the set of messages to the set of codewords
`changes over time. For example, dynamic
`Huffman coding involves computing an ap-
`proximation
`to the probabilities
`of occur-
`rence “on the fly,” as the ensemble is being
`transmitted.
`The assignment of codewords
`to messages is based on the values of the
`relative
`frequencies of occurrence at each
`point
`in time. A message x may be repre-
`sented by a short codeword early
`in
`the
`transmission
`because it occurs
`frequently
`at
`the beginning
`of the ensemble, even
`though
`its probability
`of occurrence over
`the total ensemble
`is low. Later, when the
`more probable messages begin to occur with
`higher
`frequency,
`the short codeword will
`be mapped to one of the higher probability
`messages, and x will be mapped to a longer
`codeword. As an
`illustration,
`Figure
`4
`presents a dynamic Huffman
`code table
`corresponding
`to
`the prefix
`“aa bbb ” of
`EXAMPLE.
`Although
`the
`frequency
`of
`space over
`the entire message is greater
`than
`that of b, at this point b has higher
`frequency and therefore
`is mapped
`to the
`shorter codeword.
`to in the
`Dynamic codes are also referred
`literature as adaptive,
`in that they adapt to
`changes
`in ensemble characteristics
`over
`time, The
`term adaptive
`is used for
`the
`remainder of this paper; the fact that these
`codes adapt
`to changing characteristics
`is
`the source of their appeal. Some adaptive
`methods adapt to changing patterns
`in the
`source
`[Welch 19841, whereas others ex-
`of reference
`[Bentley
`et al.
`ploit
`locality
`19861. Locality of reference is the tendency,
`
`a
`b
`space
`
`216
`3/6
`l/6
`
`10
`0
`11
`
`code
`Figure 4. A dynamic Huffman
`“an bbb” of message EXAMPLE.
`prefix
`
`table
`
`for
`
`the
`
`in a wide variety of text types, for
`common
`a particular word
`to occur
`frequently
`for
`short periods of time and
`then
`fall
`into
`disuse for long periods.
`All of the adaptive methods are one-pass
`methods; only one scan of the ensemble
`is
`required. Static Huffman
`coding
`requires
`two passes: one pass to compute probabili-
`ties and determine
`the mapping, and a sec-
`ond pass for transmission. Thus, as long as
`the encoding and decoding
`times of an
`adaptive method
`are not
`substantially
`greater
`than
`those of a static method,
`the
`fact that an initial
`scan is not needed im-
`plies a speed improvement
`in the adaptive
`case. In addition,
`the mapping determined
`in ,the first pass of a static coding scheme
`must be transmitted
`by the encoder to the
`decoder. The mapping may preface each
`transmission
`(i.e., each file sent), or a single
`mapping may be agreed upon and used for
`multiple
`transmissions.
`In one-pass meth-
`ods the encoder defines and redefines
`the
`mapping dynamically
`during
`transmission.
`The decoder must define and redefine
`the
`mapping
`in sympathy,
`in essence “learn-
`ing”
`the mapping as codewords are re-
`ceived. Adaptive methods are discussed
`in
`Sections 4 and 5.
`An algorithm may also be a hybrid,
`neither
`completely
`static nor completely
`dynamic.
`In a simple hybrid scheme, sender
`and receiver maintain
`identical
`codebooks
`containing
`lz static codes. For each trans-
`mission,
`the sender must choose one of the
`k previously agreed upon codes and inform
`the receiver of the choice (by transmitting
`first
`the “name” or number of the chosen
`code). Hybrid methods are discussed
`fur-
`ther in Sections 2 and 3.2.
`
`1.3 A Data Compression Model
`
`the relative merits of
`to discuss
`In order
`data compression
`techniques, a framework
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 5
`
`
`
`266
`
`.
`
`D. A. Lelewer and D. S. Hirschberg
`
`for comparison must be established. There
`are two dimensions
`along which each of
`the schemes discussed here may be mea-
`sured: algorithm complexity and amount of
`compression. When data compression
`is
`used in a data transmission
`application,
`the
`goal is speed. Speed of transmission
`de-
`pends on the number of bits sent, the time
`required
`for
`the encoder
`to generate
`the
`coded message, and the time required
`for
`the decoder to recover the original ensem-
`ble. In a data storage application,
`although
`the degree of compression
`is the primary
`concern,
`it
`is nonetheless
`necessary
`that
`the algorithm
`be efficient
`in order
`for the
`scheme to be practical. For a static scheme,
`there are three algorithms
`to analyze:
`the
`map construction
`algorithm,
`the encoding
`algorithm, and the decoding algorithm. For
`a dynamic scheme, there are just two algo-
`rithms:
`the encoding algorithm
`and
`the
`decoding algorithm.
`Several common measures of compres-
`sion have been suggested:
`redundancy
`[Shannon and Weaver 19491, average mes-
`sage length
`[Huffman
`19521, and compres-
`sion ratio
`[Rubin 1976; Ruth and Kreutzer
`19721. These measures are defined below.
`Related
`to each of
`these measures are
`about
`the
`characteristics
`assumptions
`of the source.
`It
`is generally assumed
`in
`information
`theory
`that
`all
`statistical
`parameters of a message source are known
`with perfect accuracy
`[Gilbert
`19711. The
`most common model
`is that of a discrete
`memoryless source; a source whose output
`is a sequence of letters
`(or messages), each
`letter being a selection
`from some fixed
`- - - , a,,. The
`letters are taken
`alphabet al,
`to be random, statistically
`independent
`se-
`lections
`from
`the alphabet,
`the selection
`being made according
`to some fixed prob-
`ability assignment p(al),
`. . ., p(a,)
`[Gal-
`lager 19681. Without
`loss of generality,
`the
`code alphabet
`is assumed
`to be
`(0, 1)
`throughout
`this paper. The modifications
`necessary
`for
`larger code alphabets
`are
`straightforward.
`that any cost associated
`We assume
`with
`the code letters
`is uniform. This
`is a
`reasonable assumption,
`although
`it omits
`applications
`like
`telegraphy
`where
`the
`
`durations.
`code symbols are of different
`The assumption
`is also
`important,
`since
`the
`problem
`of
`constructing
`optimal
`codes over unequal
`code
`letter
`costs
`is
`a significantly
`different
`and more diffi-
`cult problem. Perl et al. [1975] and Varn
`[1971] have developed algorithms
`for min-
`imum-redundancy
`prefix coding in the case
`of arbitrary
`symbol cost and equal code-
`word probability.
`The assumption of equal
`probabilities mitigates
`the difficulty
`pre-
`sented by the variable symbol cost. For the
`more general unequal
`letter costs and un-
`equal probabilities model, Karp
`[1961] has
`proposed an
`integer
`linear programming
`approach. There have been several approx-
`imation algorithms proposed
`for this more
`difficult
`problem
`[Krause 1962; Cot 1977;
`Mehlhorn
`19801.
`the goal is
`When data are compressed,
`reduce
`redundancy,
`leaving only
`the
`to
`informational
`content.
`The measure of
`information
`of a source message oi (in
`bits)
`is -log p(ai), where
`log denotes
`the
`base 2 logarithm
`and p(ai) denotes
`the
`probability
`of occurrence of the message
`oi.2 This definition
`has intuitive
`appeal; in
`the case in which p(ai) = 1, it is clear that
`ai is not at all informative
`since it had to
`occur. Similarly,
`the smaller
`the value of
`p(ai),
`the more unlikely ai is to appear, and
`hence
`the
`larger
`its
`information
`content.
`The reader is referred
`to Abramson
`[1963,
`pp. 6-131 for a longer, more elegant discus-
`sion of
`the
`legitimacy
`of
`this
`technical
`definition
`of the concept of information.
`The average information
`content over the
`source alphabet
`can be computed
`by
`weighting
`the information
`content of each
`source
`letter by
`its probability
`of occur-
`yielding
`the
`expression
`xi”=1
`rence,
`[-p(oi)log
`p(ai)]. This quantity
`is referred
`to as the entropy of a source letter or the
`entropy of the source and is denoted by H.
`Since the length of a codeword
`for message
`oi must be sufficient
`to carry
`the informa-
`tion content of oi, entropy
`imposes a lower
`bound on the number of bits required
`for
`the coded message. The
`total number of
`
`throughout
`that
`* Note
`to the base 2.
`
`this paper all logarithms
`
`are
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 6
`
`
`
`bits must be at least as large as the product
`of H and the length of the source ensemble.
`Since
`the value of H is generally not an
`integer, variable-length
`codewords must be
`used if the lower bound
`is to be achieved.
`that message EXAMPLE
`Given
`is to be
`encoded one letter at a time, the entropy of
`its source can be calculated using the prob-
`abilities given
`in Figure 3: H = 2.894, so
`that
`the minimum
`number of bits con-
`in an encoding of EXAMPLE
`tained
`is 116.
`The Huffman code given in Section 1.2 does
`not quite achieve the theoretical minimum
`in this case.
`of information
`Both of these definitions
`[ 19491. A der-
`content are due to Shannon
`ivation of the concept of entropy as it re-
`lates to information
`theory
`is presented by
`Shannon
`[1949]. A simpler, more intuitive
`explanation
`of entropy
`is offered by Ash
`[1965].
`The most common notion of a “good”
`is optimal in the sense of
`code is one that
`redundancy. Redundancy
`having minimum
`can be defined as c p(ai)Zi - 2 [-p(ai)log
`p(oi)], where li is the length of the codeword
`representing message oi. The expression
`z p(ai)Zi represents
`the lengths of the code-
`words weighted by
`their probabilities
`of
`occurrence,
`that
`is, the average codeword
`length. The expression z [-p(ai)log
`p(oi)]
`is entropy H. Thus,
`redundancy
`is a mea-
`sure of
`the difference
`between average
`codeword
`length and average information
`content.
`If a code has minimum
`average
`codeword
`length
`for a given discrete prob-
`it is said to be a mini-
`ability distribution,
`mum redundancy code.
`local redundancy to
`We define
`the term
`capture
`the notion of redundancy
`caused
`by local properties of a message ensemble,
`rather
`than
`its global characteristics.
`Al-
`though
`the model used
`for analyzing
`general-purpose coding techniques assumes
`a random distribution
`of the source mes-
`sages, this may not actually be the case. In
`particular
`applications
`the
`tendency
`for
`messages to cluster
`in predictable patterns
`may be known. The existence of predictable
`patterns may be exploited
`to minimize
`local
`redundancy. Examples of applications
`in
`which
`local
`redundancy
`is common and
`
`Data Compression
`
`l
`
`267
`
`it
`
`local redundancy
`for dealing with
`methods
`are discussed in Sections 2 and 6.2.
`uses average message length,
`Huffman
`2 p(ai)Zi, as a measure of the efficiency of
`a code. Clearly,
`the meaning of this term is
`the average length of a coded message. We
`term average codeword length to
`use the
`represent
`this quantity. Since redundancy
`is defined
`to be average codeword
`length
`minus entropy
`and entropy
`is constant
`for a given probability
`distribution, mini-
`mizing average codeword
`length minimizes
`redundancy.
`is asymptotically optimal
`A code
`if
`has the property
`that for a given probability
`distribution,
`the ratio of average codeword
`length
`to entropy approaches 1 as entropy
`tends to infinity.
`That
`is, asymptotic opti-
`mality guarantees
`that average codeword
`length approaches
`the
`theoretical mini-
`mum
`(entropy
`represents
`information
`content, which
`imposes a lower bound on
`codeword
`length).
`The amount of compression yielded by a
`coding
`scheme can be measured by a
`compression ratio. Compression
`ratio has
`been defined
`in several ways. The defini-
`tion C = (average message length)/(average
`codeword
`length)
`captures
`the common
`meaning, which
`is a comparison
`of
`the
`length of the coded message to the length
`of the original ensemble
`[Cappellini
`19851.
`If we think of the characters of the ensem-
`ble EXAMPLE as 6-bit ASCII characters,
`then
`the average message length
`is 6 bits.
`The Huffman
`code of Section 1.2 repre-
`sents EXAMPLE
`in 117 bits, or 2.9 bits
`per character. This yields a compression
`ratio of 612.9, representing
`compression by
`a factor of more than 2. Alternatively,
`we
`may say that Huffman
`encoding produces
`a file whose size is 49% of the original
`ASCII
`file, or that 49% compression has
`been achieved.
`of
`definition
`different
`A
`somewhat
`[1976], C =
`compression
`ratio by Rubin
`(S - 0 - OR)/S, includes
`the representa-
`tion of the code itself
`in the transmission
`cost.
`In
`this definition,
`S represents
`the
`length of the source ensemble, 0 the length
`(coded message), and OR the
`of the output
`size of the “output
`representation”
`(e.g., the
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 7
`
`
`
`260
`
`l
`
`D. A. Lelewer and D. S. Hirschberg
`
`number of bits required for the encoder to
`transmit the code mapping to the decoder).
`The quantity OR constitutes a “charge* to
`an algorithm
`for transmission of informa-
`tion about the coding scheme. The inten-
`tion
`is to measure the total size of the
`transmission (or file to be stored).
`
`1.4 Motivation
`data
`Introduction,
`the
`As discussed in
`compression has wide application
`in terms
`of information storage, including represen-
`tation of the abstract data type string
`[Standish 19801 and
`file compression.
`Huffman coding is used for compression in
`several file archival systems [ARC 1986;
`PKARC 19871, as is Lempel-Ziv coding,
`one of the adaptive schemes to be discussed
`in Section 5. An adaptive Huffman coding
`technique is the basis for the compact com-
`mand of the UNIX operating system, and
`the UNIX compress utility employs the
`Lempel-Ziv approach [UNIX 19841.
`In the area of data transmission, Huff-
`man coding has been passed over for years
`in
`favor of block-block
`codes, notably
`ASCII. The advantage of Huffman coding
`is in the average number of bits per char-
`acter transmitted, which may be much
`smaller than the log n bits per character
`(where n is the source alphabet size) of a
`block-block system. The primary difficulty
`associated with variable-length codewords
`is that the rate at which bits are presented
`to the transmission channel will fluctuate,
`depending on the relative frequencies of the
`source messages. This fluctuation
`requires
`buffering between the source and the chan-
`nel. Advances
`in
`technology have both
`overcome this difficulty and contributed
`to
`the appeal of variable-length codes. Cur-
`rent data networks allocate communication
`resources to sources on the basis of need
`and provide buffering as part of the system.
`These systems require significant amounts
`of protocol, and fixed-length codes are quite
`inefficient
`for applications such as packet
`headers. In addition, communication costs
`are beginning
`to dominate storage and
`processing costs, so that variable-length
`coding schemes that reduce communication
`costs are attractive even if they are more
`
`complex. For these reasons, one could ex-
`pect to see even greater use of variable-
`length coding in the future.
`It is interesting to note that the Huffman
`coding algorithm, originally developed for
`the efficient transmission of data, also has
`a wide variety of applications outside the
`sphere of data compression. These include
`construction of optimal search trees [Zim-
`merman 1959; Hu and Tucker 1971; Itai
`19761, list merging [Brent and Kung 19781,
`and generating optimal evaluation trees in
`the compilation of expressions
`[Parker
`19801. Additional
`applications
`involve
`search for jumps in a monotone function of
`a single variable, sources of pollution along
`a river, and leaks in a pipeline [Glassey and
`Karp 19761. The fact that this elegant com-
`binatorial
`algorithm has
`influenced
`so
`many diverse areas underscores its impor-
`tance.
`
`METHODS
`2. SEMANTIC-DEPENDENT
`data
`Semantic-dependent
`compression
`techniques are designed to respond to spe-
`cific types of local redundancy occurring in
`certain applications. One area in which
`data compression is of great importance
`is image representation and processing.
`There are two major reasons for this. The
`first is that digitized images contain a large
`amount of local redundancy. An image is
`usually captured in the form of an array of
`pixels, whereas methods that exploit
`the
`tendency for pixels of like color