throbber
Data Compression
`
`DEBRA A. LELEWER and DANIEL S. HIRSCHBERG
`
`Department
`
`of Information
`
`and Computer Science, University
`
`of California,
`
`Irvine, California 92717
`
`This paper surveys a variety of data compression methods spanning almost 40 years of
`research,
`from
`the work of Shannon, Fano, and Huffman
`in the late 1940s to a technique
`developed
`in 1986. The aim of data compression
`is to reduce redundancy
`in stored or
`communicated
`data, thus increasing effective data density. Data compression has
`important
`application
`in the areas of file storage and distributed
`systems. Concepts
`information
`theory as they relate
`to the goals and evaluation of data compression
`methods are discussed briefly. A framework
`for evaluation
`and comparison of methods
`constructed and applied
`to the algorithms presented. Comparisons of both theoretical
`empirical natures are reported, and possibilities
`for future
`research are suggested
`
`from
`
`is
`and
`
`Categories and Subject Descriptors: E.4 [Data]: Coding and Information
`compaction and compression
`
`Theory-data
`
`General Terms: Algorithms,
`
`Theory
`
`Additional Key Words and Phrases: Adaptive coding, adaptive Huffman
`coding theory,
`tile compression, Huffman
`codes, minimum-redundancy
`codes, prefix codes, text compression
`
`codes, coding,
`codes, optimal
`
`INTRODUCTION
`
`Data compression is often referred to as
`coding, where coding is a general term en-
`compassing any special representation of
`data that satisfies a given need. Informa-
`tion
`theory
`is defined as the study of
`efficient coding and its consequences in
`the
`form of speed of transmission and
`probability of error [ Ingels 19711. Data com-
`pression may be viewed as a branch of
`information
`theory
`in which the primary
`objective is to minimize the amount of data
`to be transmitted. The purpose of this pa-
`per is to present and analyze a variety of
`data compression algorithms.
`of data
`A simple
`characterization
`compression is that it involves transform-
`ing a string of characters in some represen-
`tation (such as ASCII)
`into a new string
`(e.g., of bits) that contains the same infor-
`
`is as small as
`mation but whose length
`possible. Data compression has important
`application
`in the areas of data transmis-
`sion and data storage. Many data process-
`ing applications
`require storage of large
`volumes of data, and the number of such
`applications is constantly increasing as the
`use of computers extends to new disci-
`plines. At the same time, the proliferation
`of computer communication networks
`is
`resulting
`in massive transfer of data over
`communication
`links. Compressing data to
`be stored or transmitted
`reduces storage
`and/or communication costs. When
`the
`amount of data to be transmitted
`is re-
`duced, the effect is that of increasing the
`capacity of the communication channel.
`Similarly, compressing a file to half of its
`original size is equivalent
`to doubling the
`capacity of the storage medium. It may then
`become feasible to store the data at a
`
`the copies are not made or
`that
`is granted provided
`fee all or part of this material
`to copy without
`Permission
`for direct commercial advantage,
`the ACM copyright notice and the title of the publication
`and its
`distributed
`date appear, and notice
`is given that copying
`is by permission of the Association
`for Computing Machinery.
`To
`copy otherwise, or to republish,
`requires a fee and/or specific permission.
`0 1966 ACM 0360-0300/87/0900-0261$1.50
`
`ACM Computing SUWSYS, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 1
`
`

`

`262
`
`.
`
`D. A. Lelewer and D. S. Hirschberg
`
`CONTENTS
`
`been reported to reduce a file to anywhere
`from 12.1 to 73.5% of its original size [Wit-
`ten et al. 1987 1. Cormack reports that data
`compression programs based on Huffman
`coding (Section 3.2) reduced the size of a
`large student-record database by 42.1%
`when only some of the information was
`compressed. As a consequence of this size
`reduction,
`the number of disk operations
`required to load the database was reduced
`by 32.7%
`[Cormack 19851. Data com-
`pression routines developed with specific
`applications
`in mind have achieved com-
`pression factors as high as 98% [Severance
`19831.
`Although coding for purposes of data se-
`curity (cryptography) and codes that guar-
`antee a certain level of data integrity (error
`detection/correction)
`are topics worthy of
`attention,
`they do not
`fall under
`the
`umbrella of data compression. With
`the
`exception of a brief discussion of the sus-
`ceptibility
`to error of the methods surveyed
`(Section 7), a discrete noiseless channel is
`assumed. That is, we assume a system in
`which a sequence of symbols chosen from
`a finite alphabet can be transmitted
`from
`one point to another without the possibility
`of error. Of course, the coding schemes
`described here may be combined with data
`security or error-correcting codes.
`Much of the available literature on data
`compression approaches the topic from the
`point of view of data transmission. As noted
`earlier, data compression is of value in data
`storage as well. Although this discussion is
`framed in the terminology of data trans-
`mission, compression and decompression
`of data files are essentially the same tasks
`as sending and receiving data over a com-
`munication
`channel. The
`focus of this
`paper is on algorithms
`for data compres-
`sion; it does not deal with hardware aspects
`of data transmission. The reader is referred
`to Cappellini
`[1985] for a discussion of
`techniques with natural hardware imple-
`mentation.
`Background concepts in the form of ter-
`minology and a model for the study of data
`compression are provided in Section 1. Ap-
`plications of data compression are also dis-
`cussed in Section 1 to provide motivation
`for the material that follows.
`
`METHODS
`SCHEMES
`
`of the
`
`CODING
`
`CONCEPTS
`
`INTRODUCTION
`1. FUNDAMENTAL
`1.1 Definitions
`of Methods
`1.2 Classification
`1.3 A Data Compression Model
`1.4 Motivation
`DEPENDENT
`2. SEMANTIC
`3. STATIC DEFINED-WORD
`3.1 Shannon-Fan0
`Code
`3.2 Static Huffman Coding
`3.3 Universal Codes and Representations
`Integers
`Coding
`3.4 Arithmetic
`HUFFMAN
`4. ADAPTIVE
`FGK
`4.1 Algorithm
`V
`4.2 Algorithm
`METHODS
`5. OTHER ADAPTIVE
`5.1 Lempel-Ziv
`Codes
`5.2 Algorithm
`BSTW
`6. EMPIRICAL
`RESULTS
`7. SUSCEPTIBILITY
`TO ERROR
`7.1 Static Codes
`7.2 Adaptive Codes
`8. NEW DIRECTIONS
`9. SUMMARY
`REFERENCES
`
`higher, thus faster, level of the storage hi-
`erarchy and reduce the load on the input/
`output channels of the computer system.
`Many of the methods discussed in this
`paper are
`implemented
`in production
`systems. The UNIX1 utilities
`compact
`and compress are based on methods dis-
`cussed in Sections 4 and 5, respectively
`[UNIX 19841. Popular file archival systems
`such as ARC and PKARC use techniques
`presented in Sections 3 and 5 [ARC 1986;
`PKARC 19871. The savings achieved by
`data compression can be dramatic; reduc-
`tion as high as 80% is not uncommon
`[Reghbati 19811. Typical values of com-
`pression provided by compact are text
`(38%), Pascal source (43%), C source
`(36%), and binary (19%). Compress gener-
`ally achieves better compression (50-60%
`for text such as source code and English)
`and takes less time
`to compute
`[UNIX
`19841. Arithmetic coding (Section 3.4) has
`
`1 UNIX
`
`is a trademark of AT&T Bell Laboratories.
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 2
`
`

`

`focus of this survey
`the primary
`Although
`is data compression methods of general
`utility,
`Section 2 includes examples
`from
`the literature
`in which
`ingenuity applied
`to
`domain-specific
`problems has yielded
`inter-
`esting coding techniques. These techniques
`are referred
`to as semantic dependent since
`they are designed
`to exploit
`the context
`and semantics of the data
`to achieve
`re-
`dundancy
`reduction. Semantic-dependent
`techniques
`include
`the use of quadtrees,
`run-length encoding, or difference mapping
`for storage and transmission
`of image data
`[Gonzalez and Wintz 1977; Samet 19841.
`General-purpose
`techniques, which as-
`sume no knowledge
`of
`the
`information
`content
`of
`the data, are described
`in
`Sections 3-5. These descriptions are suffi-
`ciently detailed
`to provide an understand-
`ing of the techniques. The reader will need
`to consult
`the references
`for implementa-
`tion details.
`In most cases only worst-case
`analyses of the methods are feasible. To
`provide a more realistic
`picture of their
`effectiveness, empirical data are presented
`in Section 6. The susceptibility
`to error of
`the algorithms surveyed is discussed in Sec-
`tion 7, and possible directions
`for
`future
`research are considered
`in Section 8.
`
`1. FUNDAMENTAL CONCEPTS
`
`theory
`to information
`introduction
`A brief
`is provided
`in this section. The definitions
`and assumptions
`necessary
`to a compre-
`hensive discussion and evaluation
`of data
`compression methods are discussed. The
`following
`string of characters
`is used to
`illustrate
`the concepts defined: EXAMPLE
`= “au bbb cccc ddddd eeeeee fffffffgggggggg”.
`
`1.1 Definitions
`
`A code is a mapping of source messages
`(words
`from
`the source alphabet a)
`into
`codewords (words of the code alphabet 0).
`The source messages are the basic units
`into which
`the string
`to be represented
`is
`partitioned. These basic units may be single
`symbols
`from the source alphabet, or they
`may be strings
`of symbols. For string
`cy = (a, b, c, d, e, f, g, space).
`EXAMPLE,
`For purposes of explanation,
`/3 is taken
`to
`
`Data Compression
`
`l
`
`263
`
`Source message
`
`Codeword
`
`a
`b
`ii
`
`7
`g
`space
`
`000
`001
`011 010
`
`100 101
`
`110
`111
`
`Figure 1. A block-block
`
`code for EXAMPLE.
`
`Source message
`
`Codeword
`
`;ib
`cccc
`ddddd
`eeeeee
`fffffff
`&%%Nmz!
`space
`
`0 1
`10
`11
`100
`101
`110
`111
`
`Figure 2. A variable-variable
`
`code for EXAMPLE.
`
`be (0,l). Codes can be categorized as block-
`block, block-variable,
`variable-block,
`or
`variable-variable,
`where block-block
`in-
`dicates
`that
`the source messages and
`codewords are of fixed length and variable-
`variable codes map variable-length
`source
`into variable-length
`codewords. A
`messages
`block-block
`code for EXAMPLE
`is shown
`in Figure 1, and a variable-variable
`code is
`given in Figure 2. If the string EXAMPLE
`were coded using
`the Figure 1 code, the
`length of the coded message would be 120;
`using Figure 2 the length would be 30.
`The oldest and most widely used codes,
`ASCII
`and EBCDIC,
`are examples
`of
`block-block
`codes, mapping an alphabet of
`64 (or 256) single characters onto 6-bit (or
`8-bit) codewords. These are not discussed,
`since
`they do not provide
`compression.
`The codes featured
`in
`this survey are of
`the block-variable,
`variable-variable,
`and
`variable-block
`types.
`length
`When source messages of variable
`are allowed,
`the question of how a mes-
`sage ensemble
`(sequence of messages)
`is
`parsed
`into
`individual
`messages arises.
`Many of the algorithms described here are
`defined-word
`schemes. That
`is, the set of
`source messages is determined before
`the
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 3
`
`

`

`264
`
`.
`
`D. A. Lelewer and D. S. Hirschberg
`
`scheme. For
`the coding
`of
`invocation
`text
`file processing,
`each
`in
`example,
`character may constitute
`a message, or
`messages may be defined
`to
`consist
`of alphanumeric
`and
`nonalphanumeric
`strings.
`In Pascal source code, each token
`may represent a message. All codes involv-
`ing
`fixed-length
`source messages are, by
`default, defined-word
`codes. In free-parse
`methods,
`the coding algorithm
`itself parses
`the
`ensemble
`into
`variable-length
`se-
`quences of symbols. Most of the known
`data compression methods are defined-
`word schemes; the free-parse model differs
`in a fundamental
`way
`from
`the classical
`coding paradigm.
`is
`if each codeword
`A code is distinct
`distinguishable
`from every other
`(i.e., the
`mapping
`from source messages to code-
`words
`is one to one). A distinct
`code is
`uniquely
`decodable
`if every codeword
`is
`identifiable when
`immersed
`in a sequence
`of codewords. Clearly, each of these
`fea-
`tures
`is desirable. The codes of Figures 1
`and 2 are both distinct,
`but
`the code of
`Figure 2 is not uniquely
`decodable. For
`example,
`the coded message 11 could be
`decoded as either
`“ddddd”
`or “bbbbbb”. A
`uniquely decodable code is a prefix code (or
`prefix-free
`code) if it has the prefix prop-
`erty, which
`requires
`that no codeword be a
`proper prefix of any other codeword. All
`uniquely
`decodable block-block
`and vari-
`able-block codes are prefix codes. The code
`(1, 100000, 00) is an ex-
`with codewords
`ample of a code that
`is uniquely decodable
`but that does not have the prefix property.
`Prefix codes are instantaneously
`decodable;
`that
`is, they have
`the desirable property
`that
`the coded message can be parsed into
`codewords without
`the need for lookahead.
`In order to decode a message encoded using
`the codeword set (1, 100000, 001, lookahead
`is required. For example,
`the first codeword
`of the message 1000000001
`is 1, but
`this
`cannot be determined until
`the last (tenth)
`symbol of the message is read (if the string
`of zeros had been of odd length,
`the first
`codeword would have been 100000).
`A minimal prefix code is a prefix code
`such that,
`if x is a proper prefix of some
`codeword,
`then xu is either a codeword or
`a proper prefix of a codeword
`for each letter
`
`(00, 01, 10) is
`u in ,B. The set of codewords
`an example of a prefix
`code that
`is not
`minimal. The fact that 1 is a proper prefix
`of the codeword 10 requires
`that 11 be
`either a codeword or a proper prefix of a
`codeword, and it is neither.
`Intuitively,
`the
`minimality
`constraint
`prevents
`the use of
`codewords
`that are longer
`than necessary.
`In the above example the codeword 10 could
`be replaced by the codeword 1, yielding a
`minimal
`prefix
`code with
`shorter
`code-
`words. The codes discussed
`in this paper
`are all minimal prefix codes.
`In this section a code has been defined
`to be a mapping
`from a source alphabet
`to a code alphabet; we now define
`related
`terms. The process of transforming
`a source
`ensemble into a coded message is coding or
`encoding. The encoded message may be re-
`ferred
`to as an encoding of the source en-
`semble. The algorithm
`that constructs
`the
`mapping and uses it to transform
`the source
`ensemble is called the encoder. The decoder
`performs
`the
`inverse operation,
`restoring
`the coded message to its original
`form.
`
`1.2 Classification
`
`of Methods
`
`schemes
`Not only are data compression
`categorized with
`respect
`to message and
`codeword
`lengths, but they are also classi-
`fied as either static or dynamic. A static
`method
`is one in which
`the mapping
`from
`the set of messages to the set of codewords
`is fixed before transmission
`begins, so that
`a given message is represented by the same
`codeword every time it appears in the mes-
`sage ensemble. The classic static defined-
`word scheme is Huffman
`coding
`[Huffman
`19521. In Huffman
`coding,
`the assignment
`of codewords
`to source messages is based
`on the probabilities with which
`the source
`messages appear in the message ensemble.
`Messages
`that appear
`frequently
`are rep-
`resented by short codewords; messages with
`smaller probabilities map to longer code-
`words. These probabilities
`are determined
`before
`transmission
`begins. A Huffman
`code for the ensemble EXAMPLE
`is given
`in Figure 3. If EXAMPLE
`were coded using
`this Huffman mapping,
`the
`length of the
`coded message would be 117. Static Huff-
`man coding
`is discussed
`in Section 3.2;
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 4
`
`

`

`Source message
`
`Probability
`
`Codeword
`
`Source message
`
`Probability
`
`Codeword
`
`Data Compression
`
`l
`
`265
`
`;
`
`i
`e
`f
`g
`space
`
`
`
`2140 3140
`
`4140 5/40
`
`6140
`7/40
`a/40
`5140
`
`1001
`1000
`011
`010
`111
`110
`00
`101
`
`code for the message EXAM-
`Figure 3. A Huffman
`PLE (code length = 117).
`
`in Sec-
`
`other static schemes are discussed
`tions 2 and 3.
`from
`if the mapping
`A code is dynamic
`the set of messages to the set of codewords
`changes over time. For example, dynamic
`Huffman coding involves computing an ap-
`proximation
`to the probabilities
`of occur-
`rence “on the fly,” as the ensemble is being
`transmitted.
`The assignment of codewords
`to messages is based on the values of the
`relative
`frequencies of occurrence at each
`point
`in time. A message x may be repre-
`sented by a short codeword early
`in
`the
`transmission
`because it occurs
`frequently
`at
`the beginning
`of the ensemble, even
`though
`its probability
`of occurrence over
`the total ensemble
`is low. Later, when the
`more probable messages begin to occur with
`higher
`frequency,
`the short codeword will
`be mapped to one of the higher probability
`messages, and x will be mapped to a longer
`codeword. As an
`illustration,
`Figure
`4
`presents a dynamic Huffman
`code table
`corresponding
`to
`the prefix
`“aa bbb ” of
`EXAMPLE.
`Although
`the
`frequency
`of
`space over
`the entire message is greater
`than
`that of b, at this point b has higher
`frequency and therefore
`is mapped
`to the
`shorter codeword.
`to in the
`Dynamic codes are also referred
`literature as adaptive,
`in that they adapt to
`changes
`in ensemble characteristics
`over
`time, The
`term adaptive
`is used for
`the
`remainder of this paper; the fact that these
`codes adapt
`to changing characteristics
`is
`the source of their appeal. Some adaptive
`methods adapt to changing patterns
`in the
`source
`[Welch 19841, whereas others ex-
`of reference
`[Bentley
`et al.
`ploit
`locality
`19861. Locality of reference is the tendency,
`
`a
`b
`space
`
`216
`3/6
`l/6
`
`10
`0
`11
`
`code
`Figure 4. A dynamic Huffman
`“an bbb” of message EXAMPLE.
`prefix
`
`table
`
`for
`
`the
`
`in a wide variety of text types, for
`common
`a particular word
`to occur
`frequently
`for
`short periods of time and
`then
`fall
`into
`disuse for long periods.
`All of the adaptive methods are one-pass
`methods; only one scan of the ensemble
`is
`required. Static Huffman
`coding
`requires
`two passes: one pass to compute probabili-
`ties and determine
`the mapping, and a sec-
`ond pass for transmission. Thus, as long as
`the encoding and decoding
`times of an
`adaptive method
`are not
`substantially
`greater
`than
`those of a static method,
`the
`fact that an initial
`scan is not needed im-
`plies a speed improvement
`in the adaptive
`case. In addition,
`the mapping determined
`in ,the first pass of a static coding scheme
`must be transmitted
`by the encoder to the
`decoder. The mapping may preface each
`transmission
`(i.e., each file sent), or a single
`mapping may be agreed upon and used for
`multiple
`transmissions.
`In one-pass meth-
`ods the encoder defines and redefines
`the
`mapping dynamically
`during
`transmission.
`The decoder must define and redefine
`the
`mapping
`in sympathy,
`in essence “learn-
`ing”
`the mapping as codewords are re-
`ceived. Adaptive methods are discussed
`in
`Sections 4 and 5.
`An algorithm may also be a hybrid,
`neither
`completely
`static nor completely
`dynamic.
`In a simple hybrid scheme, sender
`and receiver maintain
`identical
`codebooks
`containing
`lz static codes. For each trans-
`mission,
`the sender must choose one of the
`k previously agreed upon codes and inform
`the receiver of the choice (by transmitting
`first
`the “name” or number of the chosen
`code). Hybrid methods are discussed
`fur-
`ther in Sections 2 and 3.2.
`
`1.3 A Data Compression Model
`
`the relative merits of
`to discuss
`In order
`data compression
`techniques, a framework
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 5
`
`

`

`266
`
`.
`
`D. A. Lelewer and D. S. Hirschberg
`
`for comparison must be established. There
`are two dimensions
`along which each of
`the schemes discussed here may be mea-
`sured: algorithm complexity and amount of
`compression. When data compression
`is
`used in a data transmission
`application,
`the
`goal is speed. Speed of transmission
`de-
`pends on the number of bits sent, the time
`required
`for
`the encoder
`to generate
`the
`coded message, and the time required
`for
`the decoder to recover the original ensem-
`ble. In a data storage application,
`although
`the degree of compression
`is the primary
`concern,
`it
`is nonetheless
`necessary
`that
`the algorithm
`be efficient
`in order
`for the
`scheme to be practical. For a static scheme,
`there are three algorithms
`to analyze:
`the
`map construction
`algorithm,
`the encoding
`algorithm, and the decoding algorithm. For
`a dynamic scheme, there are just two algo-
`rithms:
`the encoding algorithm
`and
`the
`decoding algorithm.
`Several common measures of compres-
`sion have been suggested:
`redundancy
`[Shannon and Weaver 19491, average mes-
`sage length
`[Huffman
`19521, and compres-
`sion ratio
`[Rubin 1976; Ruth and Kreutzer
`19721. These measures are defined below.
`Related
`to each of
`these measures are
`about
`the
`characteristics
`assumptions
`of the source.
`It
`is generally assumed
`in
`information
`theory
`that
`all
`statistical
`parameters of a message source are known
`with perfect accuracy
`[Gilbert
`19711. The
`most common model
`is that of a discrete
`memoryless source; a source whose output
`is a sequence of letters
`(or messages), each
`letter being a selection
`from some fixed
`- - - , a,,. The
`letters are taken
`alphabet al,
`to be random, statistically
`independent
`se-
`lections
`from
`the alphabet,
`the selection
`being made according
`to some fixed prob-
`ability assignment p(al),
`. . ., p(a,)
`[Gal-
`lager 19681. Without
`loss of generality,
`the
`code alphabet
`is assumed
`to be
`(0, 1)
`throughout
`this paper. The modifications
`necessary
`for
`larger code alphabets
`are
`straightforward.
`that any cost associated
`We assume
`with
`the code letters
`is uniform. This
`is a
`reasonable assumption,
`although
`it omits
`applications
`like
`telegraphy
`where
`the
`
`durations.
`code symbols are of different
`The assumption
`is also
`important,
`since
`the
`problem
`of
`constructing
`optimal
`codes over unequal
`code
`letter
`costs
`is
`a significantly
`different
`and more diffi-
`cult problem. Perl et al. [1975] and Varn
`[1971] have developed algorithms
`for min-
`imum-redundancy
`prefix coding in the case
`of arbitrary
`symbol cost and equal code-
`word probability.
`The assumption of equal
`probabilities mitigates
`the difficulty
`pre-
`sented by the variable symbol cost. For the
`more general unequal
`letter costs and un-
`equal probabilities model, Karp
`[1961] has
`proposed an
`integer
`linear programming
`approach. There have been several approx-
`imation algorithms proposed
`for this more
`difficult
`problem
`[Krause 1962; Cot 1977;
`Mehlhorn
`19801.
`the goal is
`When data are compressed,
`reduce
`redundancy,
`leaving only
`the
`to
`informational
`content.
`The measure of
`information
`of a source message oi (in
`bits)
`is -log p(ai), where
`log denotes
`the
`base 2 logarithm
`and p(ai) denotes
`the
`probability
`of occurrence of the message
`oi.2 This definition
`has intuitive
`appeal; in
`the case in which p(ai) = 1, it is clear that
`ai is not at all informative
`since it had to
`occur. Similarly,
`the smaller
`the value of
`p(ai),
`the more unlikely ai is to appear, and
`hence
`the
`larger
`its
`information
`content.
`The reader is referred
`to Abramson
`[1963,
`pp. 6-131 for a longer, more elegant discus-
`sion of
`the
`legitimacy
`of
`this
`technical
`definition
`of the concept of information.
`The average information
`content over the
`source alphabet
`can be computed
`by
`weighting
`the information
`content of each
`source
`letter by
`its probability
`of occur-
`yielding
`the
`expression
`xi”=1
`rence,
`[-p(oi)log
`p(ai)]. This quantity
`is referred
`to as the entropy of a source letter or the
`entropy of the source and is denoted by H.
`Since the length of a codeword
`for message
`oi must be sufficient
`to carry
`the informa-
`tion content of oi, entropy
`imposes a lower
`bound on the number of bits required
`for
`the coded message. The
`total number of
`
`throughout
`that
`* Note
`to the base 2.
`
`this paper all logarithms
`
`are
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 6
`
`

`

`bits must be at least as large as the product
`of H and the length of the source ensemble.
`Since
`the value of H is generally not an
`integer, variable-length
`codewords must be
`used if the lower bound
`is to be achieved.
`that message EXAMPLE
`Given
`is to be
`encoded one letter at a time, the entropy of
`its source can be calculated using the prob-
`abilities given
`in Figure 3: H = 2.894, so
`that
`the minimum
`number of bits con-
`in an encoding of EXAMPLE
`tained
`is 116.
`The Huffman code given in Section 1.2 does
`not quite achieve the theoretical minimum
`in this case.
`of information
`Both of these definitions
`[ 19491. A der-
`content are due to Shannon
`ivation of the concept of entropy as it re-
`lates to information
`theory
`is presented by
`Shannon
`[1949]. A simpler, more intuitive
`explanation
`of entropy
`is offered by Ash
`[1965].
`The most common notion of a “good”
`is optimal in the sense of
`code is one that
`redundancy. Redundancy
`having minimum
`can be defined as c p(ai)Zi - 2 [-p(ai)log
`p(oi)], where li is the length of the codeword
`representing message oi. The expression
`z p(ai)Zi represents
`the lengths of the code-
`words weighted by
`their probabilities
`of
`occurrence,
`that
`is, the average codeword
`length. The expression z [-p(ai)log
`p(oi)]
`is entropy H. Thus,
`redundancy
`is a mea-
`sure of
`the difference
`between average
`codeword
`length and average information
`content.
`If a code has minimum
`average
`codeword
`length
`for a given discrete prob-
`it is said to be a mini-
`ability distribution,
`mum redundancy code.
`local redundancy to
`We define
`the term
`capture
`the notion of redundancy
`caused
`by local properties of a message ensemble,
`rather
`than
`its global characteristics.
`Al-
`though
`the model used
`for analyzing
`general-purpose coding techniques assumes
`a random distribution
`of the source mes-
`sages, this may not actually be the case. In
`particular
`applications
`the
`tendency
`for
`messages to cluster
`in predictable patterns
`may be known. The existence of predictable
`patterns may be exploited
`to minimize
`local
`redundancy. Examples of applications
`in
`which
`local
`redundancy
`is common and
`
`Data Compression
`
`l
`
`267
`
`it
`
`local redundancy
`for dealing with
`methods
`are discussed in Sections 2 and 6.2.
`uses average message length,
`Huffman
`2 p(ai)Zi, as a measure of the efficiency of
`a code. Clearly,
`the meaning of this term is
`the average length of a coded message. We
`term average codeword length to
`use the
`represent
`this quantity. Since redundancy
`is defined
`to be average codeword
`length
`minus entropy
`and entropy
`is constant
`for a given probability
`distribution, mini-
`mizing average codeword
`length minimizes
`redundancy.
`is asymptotically optimal
`A code
`if
`has the property
`that for a given probability
`distribution,
`the ratio of average codeword
`length
`to entropy approaches 1 as entropy
`tends to infinity.
`That
`is, asymptotic opti-
`mality guarantees
`that average codeword
`length approaches
`the
`theoretical mini-
`mum
`(entropy
`represents
`information
`content, which
`imposes a lower bound on
`codeword
`length).
`The amount of compression yielded by a
`coding
`scheme can be measured by a
`compression ratio. Compression
`ratio has
`been defined
`in several ways. The defini-
`tion C = (average message length)/(average
`codeword
`length)
`captures
`the common
`meaning, which
`is a comparison
`of
`the
`length of the coded message to the length
`of the original ensemble
`[Cappellini
`19851.
`If we think of the characters of the ensem-
`ble EXAMPLE as 6-bit ASCII characters,
`then
`the average message length
`is 6 bits.
`The Huffman
`code of Section 1.2 repre-
`sents EXAMPLE
`in 117 bits, or 2.9 bits
`per character. This yields a compression
`ratio of 612.9, representing
`compression by
`a factor of more than 2. Alternatively,
`we
`may say that Huffman
`encoding produces
`a file whose size is 49% of the original
`ASCII
`file, or that 49% compression has
`been achieved.
`of
`definition
`different
`A
`somewhat
`[1976], C =
`compression
`ratio by Rubin
`(S - 0 - OR)/S, includes
`the representa-
`tion of the code itself
`in the transmission
`cost.
`In
`this definition,
`S represents
`the
`length of the source ensemble, 0 the length
`(coded message), and OR the
`of the output
`size of the “output
`representation”
`(e.g., the
`
`ACM Computing Surveys, Vol. 19, No. 3, September 1987
`
`NetApp Exhibit 1006 Page 7
`
`

`

`260
`
`l
`
`D. A. Lelewer and D. S. Hirschberg
`
`number of bits required for the encoder to
`transmit the code mapping to the decoder).
`The quantity OR constitutes a “charge* to
`an algorithm
`for transmission of informa-
`tion about the coding scheme. The inten-
`tion
`is to measure the total size of the
`transmission (or file to be stored).
`
`1.4 Motivation
`data
`Introduction,
`the
`As discussed in
`compression has wide application
`in terms
`of information storage, including represen-
`tation of the abstract data type string
`[Standish 19801 and
`file compression.
`Huffman coding is used for compression in
`several file archival systems [ARC 1986;
`PKARC 19871, as is Lempel-Ziv coding,
`one of the adaptive schemes to be discussed
`in Section 5. An adaptive Huffman coding
`technique is the basis for the compact com-
`mand of the UNIX operating system, and
`the UNIX compress utility employs the
`Lempel-Ziv approach [UNIX 19841.
`In the area of data transmission, Huff-
`man coding has been passed over for years
`in
`favor of block-block
`codes, notably
`ASCII. The advantage of Huffman coding
`is in the average number of bits per char-
`acter transmitted, which may be much
`smaller than the log n bits per character
`(where n is the source alphabet size) of a
`block-block system. The primary difficulty
`associated with variable-length codewords
`is that the rate at which bits are presented
`to the transmission channel will fluctuate,
`depending on the relative frequencies of the
`source messages. This fluctuation
`requires
`buffering between the source and the chan-
`nel. Advances
`in
`technology have both
`overcome this difficulty and contributed
`to
`the appeal of variable-length codes. Cur-
`rent data networks allocate communication
`resources to sources on the basis of need
`and provide buffering as part of the system.
`These systems require significant amounts
`of protocol, and fixed-length codes are quite
`inefficient
`for applications such as packet
`headers. In addition, communication costs
`are beginning
`to dominate storage and
`processing costs, so that variable-length
`coding schemes that reduce communication
`costs are attractive even if they are more
`
`complex. For these reasons, one could ex-
`pect to see even greater use of variable-
`length coding in the future.
`It is interesting to note that the Huffman
`coding algorithm, originally developed for
`the efficient transmission of data, also has
`a wide variety of applications outside the
`sphere of data compression. These include
`construction of optimal search trees [Zim-
`merman 1959; Hu and Tucker 1971; Itai
`19761, list merging [Brent and Kung 19781,
`and generating optimal evaluation trees in
`the compilation of expressions
`[Parker
`19801. Additional
`applications
`involve
`search for jumps in a monotone function of
`a single variable, sources of pollution along
`a river, and leaks in a pipeline [Glassey and
`Karp 19761. The fact that this elegant com-
`binatorial
`algorithm has
`influenced
`so
`many diverse areas underscores its impor-
`tance.
`
`METHODS
`2. SEMANTIC-DEPENDENT
`data
`Semantic-dependent
`compression
`techniques are designed to respond to spe-
`cific types of local redundancy occurring in
`certain applications. One area in which
`data compression is of great importance
`is image representation and processing.
`There are two major reasons for this. The
`first is that digitized images contain a large
`amount of local redundancy. An image is
`usually captured in the form of an array of
`pixels, whereas methods that exploit
`the
`tendency for pixels of like color

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket