`Seroussi et al.
`Feb. 14, 1995
`[45] Date of Patent:
`
`[11] Patent Number:
`
`5,389,922
`
`USOOS389922A
`
`[54] COMPRESSION USING SMALL
`DICTIONARIES WITH APPLICATIONS TO
`NETWORK PACKETS
`
`[75]
`
`Inventors: Gadiel Seroussi, Cupertino, Ca1if.;
`Abraham Lempel, Haifa, Israel
`
`[73] Assignee: Hewlett-Packard Company, Palo
`Alto, Calif.
`
`[21] Appl. No.2 46,548
`
`[22] Filed:
`
`Apr. 13, 1993
`
`Int. Cl.5 .............................................. H03M 7/30
`[51]
`
`..... .. 341/51; 341/106
`[52]
`[58] Field of Search ..................... 341/51, 106, 52, 55,
`341/87, 90, 94, 95
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,464,650 8/1984 E$astman et al.
`4,558,302 12/1985 Welch ........
`
`.
`
`340/347
`
`. 340/347 4,8l4,7._46 3/1989 Miller et a1.
`
`.. 341/95
`....... 341/51
`5,003,307 3/1991 Whiting et a1.
`5,150,430 9/1992 Chu ................................... 341/51 X
`
`OTHER PUBLICATIONS
`
`Bunton et al., “Practical Dictionary Management for
`Hardware Data Compression, Development of a
`Theme by Ziv and Lempel”, Communications of the
`ACM, 35:95—104, Jan. 1992.
`“On the Complexity of Finite Sequences”, IEEE Trans-
`actions on Information Theory, IT-22:75-81, Jan. 1976.
`“A Universal Algorithm for Sequential Data Compres-
`
`sion”, IEEE Transactions on Information Theory,
`IT—23:337-343, May 1977.
`“Compression of Individual Sequences via Variable
`Rate Coding”, IEEE Transactions on Information The-
`ory, IT—24:530—S36.
`U.S. patent application Ser. No. 07/892,546, filing date
`Jun. 1, 1992, Lempel et al., “Lempe1i—Ziv Compression
`Scheme with Enhanced Adaption”.
`
`Primary Examiner——Sharon D. Logan
`
`[57]
`
`I ABSTRACI‘
`
`The invention is a dictionary initialization scheme
`adaptive to changes in the type and structure of input
`data. The compression ratio is increased by minimizing
`the number of data entries used to represent single char-
`acters in the input data. By using fewer codes than what
`is normally used to represent characters in an array of
`input data, the dictionary can have fewer entries than
`the alphabet size. A further aspect of the invention
`implements a type of nm-length encoding in the LZ
`methodology which exploits the redundant structure
`existing in the compressed stream in the presence of a
`long run. Some of the codewords in the compressed
`stream are deleted but can be recovered at the decom-
`pression site. The foregoing LZE method is used alone,
`or used in combination with other methods to form a
`compression scheme that is especially useful for trans-
`mitting network packets.
`
`23 Claims, 24 Drawing Sheets
`
`INITIALIZE EMPTYkDICT. 50
`NAc-—c0+2
`
`W*-- NULL STRING
`
`
`
`a <—NEXT INPUT CHAR
`
`55
`
`
`
` N0
`
`
`60
`
`N PROCESS H
`
`NEW gHAR ENTER "a"
`
`IN DICT.
`WITH cooa
`NAC
`
`ORACLE 1 103
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 1 of 24
`
`5,389,922
`
`E
`
`E
`
`
`
`MAXIMALMATCH Fig.2
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 2 of 24
`
`5,389,922
`
`48
`
`46
`
`SPECIAL CODE
`
`CHAR
`
` ¢—-————>
` CURRENT CODE
`m
`LENGTH (12 BITS)
`(8 BITS)
`
`Fig.3
`
`48
`
`49
`
`SPECIAL CODE
`
`PARTIAL CHAR
`
`
`CURRENT CODE
`m—k
`LENGTH (12 BITS)
`(:5 BITS)
`
`Fig. 4
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 3 of 24
`
`5,389,922
`
`INITIALIZE EMPTY DICT. 50
`NAC+-CO+2k
`A
`
`W<-- NULL STRING
`
`52
`
`a <—NEXT INPUT CHAR
`
`58
`
`56 @
`
`NO
`
`60
`
`“PROCESS”
`NEW CHAR
`
`
`
`
`
`Cl
`55
`
`
`ENTER "0"
`IN DICT.
`
`WITH CODE
`NAC
`
`
`
`
`67
`
`A
`NO
`OUTPUT CODE or-' w
`
`
`
`ENTER Wa IN DICT.
`WITH CODE NAC
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 4 of 24
`
`5,389,922
`
`
`
`
`
`
`74
`
`76
`
`OUTPUT
`c1 USING
`b BITS
`
`OUTPUT
`02 USING Fig.6
`m—k BITS
`
`
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 5 of 24
`
`5,389,922
`
`CHARACTER 0
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 6 of 24
`
`5,389,922
`
` READ
`
`“:EEE‘I¥’IES
`
`PROCESS
`NEW CHAR
`
`
`
`
`
`CODEC1
`(b BITS)
`
`NO
`
`
` PROCESS
`
`CONTROL
`CODE
`
`NO
`
`PROCESS
`REGULAR
`
`CODE
`
`95
`
`98
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 7 of 24
`
`5,389,922
`
`FROM INPUT
`BUFFER
`
`m—k
`
`100 (FIXED)
`
`102
`
`b
`
`A 9 B
`B—A
`
`36\
`
`108
`
`\__\’.__J
`b—k
`
`(DISCARD)
`
`110
`O
`O
`112 1
`
`1
`
`ENTER SINGLE
`CHAR STRING
`IN DICT.
`
`m BITS
`
`114
`
`Fig.9
`
`OUTPUT
`BUFFER
`
`
`
`U.S. Patent
`
`5,389,922
`
`E32fig2%Ii
`
`Q83E.A2:3E:2.83E
`
`2.mE
`
`
`
`.sEEzH.m».2327:5m.m._.S.._._53V22
`
`muemmbfizmmtm
`
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 9 of 24
`
`5,389,922
`
`116
`
`ENGINE
`
`INPUT
`
`COMPRESSION
`ENGINE
`
`
`
`READ INPUT
`CHARACTERS
`
`118
`
`
`
`°GET INPUT CHARS
`
`°MAINTAIN DICT.
`
`°PRODUCE CODES
`
`122
`
`OUTPUT
`ENGINE
`
`
`
`°DO CODE OUTPUT
`
`
`
`120
`
`Fig.11
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 10 of 24
`
`5,389,922
`
`58
`
`INPUT
`
`%
`
`,e424
`
`\\ ENGINE
`
`GET INPUT CHARS
`
`COMPRESSION -MAINTAIN DICT.
`ENGINE
`-PRODUCE CODES
`
`
`
`TO NAC
`
`-INTERCEPT CODES
`
`°CHECK FOR RUNS
`RUN
`ENHANCEMENT -DISABLE OUTPUT
`ENGINE
`FOR SOME CODES
`
`152
`
`RUN
`ENHANCEMENT
`ACCESS :
`ENGINE HAS
`RUNCQDE
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 11 of 24
`
`5,389,922
`
`CODE C
`
`RUNCODE.—— c
`
`DISABLE
`OUTPUT
`
`140
`
`
`
`RUNCODE
`
`= NULL
`
`?
`
`
`
`144
`
`OUTPUT
`RUNCODE
`
`RUNCODE <——NULL
`
`
`
`146
`
`
`
`U.S. Patént
`
`Feb. 14, 1995
`
`Sheet 12 of 24
`
`5,389,922
`
`DECODER
`INPUT
`
`150
`°DO CODE INPUT
`
`ENGINE 152
`
`CODES
`
`°GET CODES
`
`
`
`
`DECOMPRESSION
`
`ENGINE
`
`DECODER
`OUTPUT
`
`ENGINE
`
`°MAINTAIN DICT.
`
`-PRODUCE OUTPUT
`CHARS
`
`156
`
`-DO CHAR OUTPUT
`
`
`
`Fig.14
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 13 of 24
`
`5,389,922
`
`DECODER
`INPUT
`ENGINE
`
`RUN ENH.
`
`158
`
`‘D0 CODE INPUT
`
`160
`
`°INTERCEPT CODES
`
`°CHECK FOR RUNS
`
`
`
`
`
` DECODER
`
`
`ENGINE
`
`
`°GENERATE
`
`MODIFIED CODE
`STREAM
`
`
`
`DECOMPRESSION
`
` °GET CODES
`
`°MAINTAIN DICT.
`
`
`
`°PRODUCE OUTPUT
`CHARS
`
`ENGINE
`
`
`
`DECODER
`OUTPUT
`ENGINE
`
`°DO CHAR OUTPUT
`
`Fig. 15
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 14 of 24
`
`5,389,922
`
`CODE 0
`
`170
`
`NO
`
`172
`
`YES
`
`SEND 0 TO
`DECOMPRESSION
`ENGINE
`
`RUNCODE‘—-NAC
`
`174
`
`SEND RUNCODE To 175
`DECOMPRESSION
`ENGINE
`
`
`
`RUNCODE+1
`
`RUNCODE._.
`
`178
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 15 of 24
`
`5,389,922
`
`<5.mE
`
`mm~53
`
`Tm~53
`
`mm~53
`
`on~Eo<
`
`hmmon_<
`
`mm_..OH..._O._.
`
`Sooo:ou__o__
`
`
`
` 0.onmnmn+_nxN+oouo<zSum.23
`
`
`
`s_<mEm<._.<n_>23
`
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`5,389,922
`
`
`
`M...hn_EEEEEEEEEEHEMn\s_<m_Em<29ommmmmazoomo%—.\|/I
`
`<m<ommmmmaz
`
`mS.mE.
`
`
`
`
`
`._.Z.:..:2moZ<IZm_I52“:zamxzoNH._.HP<S.u:>Em_
`
`20¢...
`
` <N—.OH:._
`
`
`
`
`
`s_<E.m<53ommmmmazooZOH._.<NHu_<H._.HZHm_._.<HQm:>_mm._.ZH
`
`m?JhHHfiEHE
`
`
`
`
`U.S. Pateilt
`
`Feb. 14, 1995
`
`Sheet 17 of 24
`
`5,389,922
`
`
`
`
`
`
`
`\|.\rn/\|\.r/]l...\r|I/\I|.\/.|/
`
`NlmNNI
`
`00:.
`
`
`
`_.—._....:._._...._.eO...O—.O...OO—._...._.—.
`
`
`
`NllmNNl.®.N\I|L/.|.l/\.||\/I}\||\/l|)\||\(J
`
`—
`
`<2
`mfi
`
`
`
`_._...._._.O_....:O._.OO...OO—O...OO
`
`
`
`
`
`>m<zHm_maoo
`
`
`
`moooEozm:
`
`mtrm
`
`m3._<>
`
`
`
`—..._:NI..®NAN..lNNv
`
`NNVO._.P
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 18 of 24
`
`5,389,922
`
`ZZZZ
`
`0.1
`
`0.2
`
`0.3
`
`0.4
`
`0.5
`
`Fig.18B
`
`4321D.=__=__
`
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 19 of 24
`
`5,389,922
`
`
`
`APPROX.
`92
`
`APPROX.
`R (2,132)
`
`1.0000
`
`0.0000
`
`0.5542
`
`0.7905
`
`0.2006
`
`0.8989
`
`gifg‘
`
`""— 0.1115
`
`0.9755
`
`-"‘
`
`0.0590
`
`0.9843
`
`0.0304
`
`0.9952
`
`0.0154
`
`0.9990
`
`0.0078
`
`1.0000
`
`1.0000
`
` 0.0059
`
`Fig. 19
`
`flflflflflflfllflflNN()1—sCD0)\l
`
`
`M..."‘-0I0mjouCD”“N3'53
`
`"""
`
`|\3-P-O) CA
`l\3f\3U'| (O
`
`"“"
`
`"“"
`N U‘! \l
`
`
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 20 of 24
`
`5,389,922
`
`INPUT'PACKET
`
`200
`
`
`
`ADDRESS
`PROCESSING
`
`
`
`202
`
`208
`
`LZE
`
`COPY
`
`204
`
`Z—HUF
`
`210
`
`
`
`
`
`
`Fig. 20
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 21 of 24
`
`5,389,922
`
`m._.:.._ozm
`
`
`
`m_.:.._a_n_I
`
`m.Exo<n_®m.:
`
`mExo<n_ooom
`
`mmtrmmmmdvn
`
`
`mmtrmm¢m.nmm._
`
`
`
`Exo<n_.0><
`
`¢nNHZm._
`
`
`
`._.m_xO<a.o><
`
`mmnuzfl
`
`Hm.mE
`
`zoHmmmEn_2oo
`
`QOI._.m:2
`
`
`
`ZOmmmmazooXHZD
`
`
`
`.m:mm:mi...m_.._OI>>
`
`
`
`ZOmmmmmzooXHZD
`
`
`
`.93NSm_.:.._m._O_.._>>
`
`
`
`zomzo..<.mN._
`
`
`
`mExo<a._<3QH>HQzH
`
`ZO.u_DI|Noz<m_N._
`
`
`
`mEv_o<n_.._<3n_H>HoZH
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 22 of 24
`
`5,389,922
`
`
`
`om;m3._.202zam._.<mHw_..._ma:ants
`
`flow.om_om_om_o§oN_om_o§
`
`mm.mE
`
`_
`
`We
`
`m.o
`
`Bo
`
`o.o
`
`.m.o
`
`To
`
`nd
`
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 23 of 24
`
`5,389,922
`
`4oo16oo180o2ooo
`
`
` E g “
`
`’.“f’°.“!“-
`ooooO
`
`
`
`
`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 24 of 24
`
`5,389,922
`
`Fig.24
`
`§§
`
`_
`
`
`
` §§ “
`§ §
`
`>.“‘£V"°.“!“.
`oooooO
`
`
`
`~
`
`1
`
`5,389,922
`
`2
`
`.
`
`COMPRESSION USING SMALL DICTIONARIES
`WITH APPLICATIONS TO NETWORK PACKETS
`
`BACKGROUND OF INVENTION
`
`This invention relates generally to compression and
`decompression of digital data and more particularly to
`implementations of lossless compression and decom-
`pression methods and apparatus using a dictionary to
`store compression data, and applications of compres-
`sion/decompression techniques to network packet com-
`mnnications.
`A major class of compression schemes encode multi-
`ple-character strings using binary sequences or “code-
`words” not otherwise used to encode individual charac-
`ters. The strings are composed of an “alphabet,” or
`single-character strings. This alphabet represents the
`smallest unique piece of information the compressor
`processes. Thus, an algorithm which uses eight bits, to
`represent its characters, has 256 unique characters in its
`alphabet. Compression is effective to the degree that the
`multiple-character strings represented in the encoding
`scheme are encountered in a given file of the data
`stream. By analogy with bilingual dictionaries used to
`translate between human languages, the device that
`embodies the mapping between uncompressed code and
`compressed code is commonly referred to as a “dictio-
`nary.”
`Generally, the usefulness of a dictionary-based com-
`pression scheme is dependent on the frequency with
`which the dictionary entries for multiple-character
`strings are used. If a fixed dictionary is optimized for
`one file type it is unlikely to be optimized for another.
`For example, a dictionary which includes a large num-
`ber of character combinations likely to be found in
`newspaper text files, is unlikely to efficiently compress
`data base files, spreadsheet files, bit-mapped graphics
`files, computer-aided design files, et cetera.
`Adaptive compression schemes are known in which
`the dictionary used to compress given input data is
`created while that input data is being compressed.
`Codewords representing every single character possible
`in the uncompressed input data are put into the dictio-
`nary. Additional entries are added to the dictionary as
`multiple-character strings are encountered in the file.
`The additional dictionary entries are used to encode
`subsequent occurrences of
`the multiple-character
`strings. For example, matching of current input patterns
`is attempted only against phrases currently residing in
`the dictionary. After each failed match, a new phrase is
`added to the dictionary. The new phrase is formed by
`extending the matched phrase by one symbol (e.g., the
`input symbol that “breaks” the match). Compression is
`effected to the extent that the mirltiple-character strings
`occurring most frequently in the file are encountered as
`the dictionary is developing.
`During decompression, the dictionary is built in a like
`manner. Thus, when a codeword for a character string
`is encountered in the compressed file, the dictionary
`contains the necessary information to reconstruct the
`corresponding character string. Widely-used compres-
`sion algorithms that use a dictionary to store compres-
`sion and decompression information are the first and
`second methods of Lempel and Ziv, called LZ1 and
`LZ2 respectively. The Lempel-Ziv (LZ) algorithm was
`originally described by Lempel and Ziv in “On the
`Complexity of Finite Sequences” IEEE Transactions on
`Information Theory, IT-22:75-81, January 1976; and in
`
`5
`
`l0
`
`15
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`60
`
`65
`
`“A Universal Algorithm for Sequential Data Compres-
`sion” IEEE Transactions on Information Theory, IT-
`23:337—343, May 1977; and “Compression of Individual
`Sequences via Variable Rate Coding” IEEE Transac-
`tions on Information Theory, IT-24:530-536. Dictionary
`usage is also disclosed in U.S. Pat. No. 4,464,650 to
`Eastman et al., and various improvements in the algo-
`rithms are disclosed in U.S. Pat. Nos. 4,558,302 to
`Welch, and 4,814,746 to Miller et al.
`When working on a practical implementation, the
`amount of memory available for compression/decom-
`pression is finite. Therefore, the number of entries in the
`dictionary is finite and the length of the codewords used
`to encode the entries is bounded. Typically, the length
`of codewords varies between 12 and 16 bits. When the
`input data sequence is sufficiently long, the dictionary
`will eventually “fill up.” Several courses of action are
`possible at this point. For example, the dictionary can
`be frozen in its current state, and used for the remainder
`of the input sequence. In a second approach, the dictio-
`nary is reset and a new dictionary created from scratch.
`In a third approach, the dictionary is frozen for some
`time, until the compression ratio deteriorates, then the
`dictionary is reset. Alternate strategies for dictionary
`reset are described in U.S. application Ser. No.
`07/892,546, filed Jun. 1, 1992 entitled “Lempel-Ziv
`Compression Scheme with Enhanced Adaptation”, as is
`hereby incorporated by reference herein, and by Bun-
`ton, S. et al., in “Practical Dictionary Management for
`Hardware Data Compression” Communications of the
`ACM, 5:95-104, January 1992.
`In the LZW process, the dictionary must be initial-
`ized for the single-character strings that are used to
`build the compression dictionary. These characters are
`assigned unique codes within the compression/decom-
`pression system. This implies that the number of bits in
`any additional output code sent out by the encoder (e.g.,
`codes that represent multiple character strings) are con-
`trolled by the number of single-character strings. For
`example, the shortest bit length for a multiple character
`string is determined by the number of single-character
`strings. The number of bits in subsequent codes repre-
`senting multiple characters, increase in length by one bit
`every time the number of entries in the dictionary reach
`the next power of 2. Using more bits to represent single-
`character codewords proportionally decreases
`the
`overall compression performance.
`The initialization of single input characters as de-
`scribed above is inefficient for input data with a large
`alphabet size or when only an unknown subset of the
`alphabet is expected to occur in the input data. For
`example, when the “natural” alphabet for the input data
`consists of 16-bit symbols, the initial dictionary size
`would have 65,536 entries. Therefore,
`the minimal
`length of any output code generated, in addition to the
`characters from the “natural” alphabet (e.g., codes rep-
`resenting multi-charaeter strings) is at least 17 bits. Al-
`ternatively, if the block of input data (i.e., the data to be
`compressed) is small relative to the alphabet size, there
`is an unnecessarily high overhead in time, memory
`space, and compression ratio that comes from initial-
`izing,
`storing, and encoding,
`respectively,
`single-
`character strings from the input data.
`To overcome these problems, some variants of the
`LZ algorithm employ an empty
`dictionary. When
`a new input character is encountered, the compressor
`outputs a special code, followed by a copy of the new
`
`
`
`5,389,922
`
`3
`character. This allows the decompressor to keep track
`of a subset of the input. alphabet that is actually in use,
`allowing decoding to proceed as usual. The main prob-
`lem with this strategy is the high cost of encoding new
`characters. For short files over large alphabets, this
`overhead cost might become unacceptably high. For
`instance, with 8-bit symbols and 12-bit output codes, 20
`bits are required to let the decoder know a-new charac-
`ter has occurred. In addition, often there is redundancy
`within the encoded character strings output by the LZ
`algorithm. For example, a string of the same input char-
`P
`‘ acters i.e., a “run”
`roduces a se uence of encoded
`strings with a predictable and redundant structure. This
`redundancy is not presently leveraged to further in-
`crease the compression ratio of standard compression
`algorithms.
`Accordingly, a need remains for a data compression
`initialization process that is adaptable to different types
`of input data and different data structures to increase
`the data compression ratio and to reduce the amount of
`memory required in a dictionary based compression/-
`decompression system.
`SUMMARY OF THE INVENTION
`
`It is, therefore, an object of the invention to improve
`the compression and decompression of digital data in a
`dictionary-based system.
`Another object of the invention is to increase the data
`compression ratio for
`compression/decompression
`schemes by reducing the number of bits used in repre-
`senting encoded character strings.
`Another object of the invention is to reduce the over-
`head of initializing a dictionary in a dictionary-based
`compression and decompression system.
`A further object of the invention is to more efficiently
`compress digital data which either occurs in small files
`or which is represented by a subset of a large single-
`character alphabet.
`A further object of the invention is to recompress
`encoded character strings that represent
`input data
`character runs to further increase the compression ratio
`of a compression/decompression system.
`The invention is a dictionary based initialization
`scheme that is adaptive to changes in the type and struc-
`ture of input data. The initialization scheme increases
`the compression ratio by minimizing the number of data
`entries used in a dictionary based compression/decom-
`pression system to represent single-character data
`strings. The reduced number of data entries reduces the
`bit-length of codewords in a compressed data string.
`Reducing the codeword bit-length in the compressed
`data string increases the overall compression ratio.
`The invention uses a variable number of special
`codes. The total number of special codes is, typically,
`selected to be less than the total number of character
`codes used for representing single-characters in the
`input data array. Each special code carries part of the
`information on a new character. Additional bits, that
`further identify a new character, are then transmitted in
`a separate partial character. This process reduces the
`‘cost’ (i.e., bandwidth and storage space) of transmitting
`a new symbol. The process is adaptable anywhere be-
`tween a no initialization process (e.g., empty initial
`dictionary) to a full alphabet initialization where each
`unique character in the alphabet is assigned an entry in
`the dictionary.
`The number of special codes is adaptable to the appli-
`cation presently being performed. Thus, the number of
`
`4
`special codes is predetermined for specific types of data
`to maximize the compression ratio. This method allows
`the dictionary to have fewer entries than the alphabet
`size. Thus, compression is possible with very small dic-
`tionaries, that require very little memory. This is partic-
`ularly useful in applications where the input data blocks
`are short, and each block has to be compressed indepen-
`dently. The initialization scheme also provides com-
`pression of data sources with large alphabet sizes (e.g.,
`16-bit symbols), while maintaining moderate size dictio-
`naries.
`
`A further aspect of the invention implements a type
`of run-length encoding in the LZ methodology (LZE).
`In conventional LZ2 data compression, a rim, which is
`a string of repeated occurrences of the same character
`in the input data, is encoded as a series of codes. Each
`successive code is built upon the previous code, fol-
`lowed by a code for the remainder or tail of the run.
`The decompressor then receives and decodes each of
`these codes in sequential order. The system sends a
`shortened sequence by transmitting a single code (rep-
`resenting most of the character rim) and the proceeding
`and tail codes.
`The foregoing LZE method is used alone, or in com-
`bination with other methods to form a compression
`scheme especially useful in transmitting network pack-
`ets. In the combined system, the LZE compression
`scheme is applied to an input data stream in parallel
`with one or more other data compression methods. For
`example, a Huffman variable-length coding scheme or
`an uncompressed transmission scheme. The output of
`the method providing the best compression ratio is then
`used for data transmission and storage. A high compres-
`sion ratio is obtained for real network packet data con-
`taining packets with a wide distribution of lengths. The
`high compression ratio is consistently maintained across
`the entire range of packet lengths even in data having a
`large proportion of short packets.
`The foregoing and other objects, features and advan-
`tages of the invention will become more readily appar-
`ent from the following detailed description of a pre-
`ferred embodiment of the invention which proceeds
`with reference to the accompanying drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a generalized block diagram of a compres-
`sion/decompression system in which the enhancements
`of the present invention are implemented.
`FIG. 2 is a diagram illustrating the basic principle of
`LZ compression.
`FIG. 3 is a diagram illustrating the transmission of a
`new character in conventional LZ compression with an
`empty
`dictionary.
`FIG. 4 is a diagram illustrating the transmission of a
`new character with enhanced LZ compression using
`intermediate dictionary initialization according to the
`invention.
`FIG. 5 is a flow chart of the basic LZE intermediate
`initialization process.
`FIG. 6 is a flow chart of the “new character” encod-
`ing subprocess of FIG. 5.
`FIG. 7 is a functional block diagram of circuitry for
`implementing the encoding subprocess of FIG. 6 in the
`compression subsystem of FIG. 1.
`FIG. 8 is a flow chart of a “new character” decoding
`subprocess for decoding codes produced by the sub-
`process of FIG. 6.
`
`5
`
`l0
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`
`
`5,389,922
`
`5
`FIG. 9 is a functional block diagram of circuitry for
`implementing the decoding subprocess of FIG. 8 in the
`decompression subsystem of FIG. 1.
`FIG. 10 is a table showing the compression results for
`different initialization schemes according to the present
`invention.
`
`FIG. 11 is a block diagram of a conventional data
`compression system.
`FIG. 12 is a block diagram of circuitry for imple-
`menting an enhanced Lempel-Ziv run-length encoding
`scheme in the compression subsystem shown in FIG. 1.
`FIG. 13 is a flow chart showing a process for imple-
`menting an enhanced Lempel-Ziv rim-length encoding
`method according to the invention.
`FIG. 14 is a block diagram of a conventional data
`decompression system.
`FIG. 15 is a block diagram of circuitry for imple-
`menting enhanced Lempel-Ziv run-length decoding in
`the decompression subsystem of FIG. 1 according to
`the invention.
`
`FIG. 16 is a flow chart of the enhanced Lempel-Ziv
`run-length decoding method according to the inven-
`tion.
`
`FIGS. 17A and 17B are a graphical depiction of the
`method for performing intermediate initialization and
`run length encoding.
`FIG. 18A is a table illustrating optimal Huffman
`coded assignments.
`FIG. 18B is a plot of compression ratios for Huffman
`encoded data.
`
`FIG. 19 is a table showing probability distributions
`for different bit lengths.
`FIG. 20 is a block diagram of a parallel “best wins”
`compression system according to the invention.
`FIG. 21 is a table showing compression results for
`various network packet files.
`FIG. 22 is a graph of compression ratio vs. time using
`the system of FIG. 20 on real network packet data.
`FIG. 23 is a graph showing the distribution of the
`data in FIG. 22 by packet length.
`FIG. 24 is a graph of the compression ratio of the data
`in FIG. 22 as a function of packet length.
`DETAILED DESCRIPTION
`
`The general arrangement and operation of Lempel-
`Ziv compression/decompression systems are well-
`known and are, therefore, described only in general
`terms with reference to FIGS. 1 and 2. The system 22 in
`FIG. 1 includes a compression subsystem 24 and a de-
`compression subsystem 26 interconnected by a digital
`data communications (or storage) charmel 28. In prac-
`tice, both terminals of a system will include both com-
`pression and decompression subsystems and the hard-
`ware is typically designed to operate interchangeably to
`compress/send or to receive/decompress data.
`Each subsystem includes, in addition to conventional
`communications (or storage) circuitry (not shown), a
`compression engine 30 which implements the basic
`Lempel-Ziv compression algorithm, memory 32 imple-
`menting one or more dictionaries in which data entries
`encoding the character string data are stored, and sup-
`porting circuits implementing the enhancements further
`described below. The supporting circuits include the
`intermediate initialization encoder 34 and counterpart
`decoder 36, which are further detailed in FIGS. 7 and 9,
`and the encoder nm enhancement engine 38 and coun-
`terpart decoder rtm enhancement engine 40 which are
`shown in FIGS. 12 and 15.
`
`6
`FIG. 2 illustrates the Lempel-Ziv (LZ) algorithm, for
`lossless compression of digital data (i.e., the original
`data is completely recoverable from its compressed
`image). The LZ method matches a current pattern in an
`input data stream to patterns occurring previously. For
`example, a current pattern 42 (ABC) in input stream 46
`is the same as a pattern 44 (ABC) that was previously
`transmitted. The compression subsystem 24 (FIG. 1),
`substitutes a description (i.e., codeword) of the maximal
`match for the matched input symbols (ABC). The de-
`compression subsystem 26 (FIG. 1) can then recon-
`struct the original symbols from the match codeword,
`and from previously decompressed data segments. In
`redundant data sources, the descriptions or the code-
`words describing a multiple character match tend to be
`shorter than the matched patterns, thus achieving data
`compression.
`The main feature of LZ2 is incremental parsing. The
`input data sequence is parsed into phrases, which are
`collected in a dictionary. Maximal matching of current
`input patterns is attempted, as described above, only
`against phrases in the dictionary. After each match, a
`new phrase is formed by extending the matched phrase
`with the input symbol that “breaks” the match. This and
`other variants of the algorithm, are asymptotically opti-
`mal, (i.e., achieve, in the limit, the best compression
`ratio theoretically possible). The algorithm is also
`highly adaptive, learning the statistical characteristics
`of the input data “on the fly”. In LZ2, this “knowledge”
`is stored in the dictionary, whose entries parse the input
`data sequence.
`The compressor implementation in LZ2 can be infor-
`mally described as follows:
`1. A dictionary is initialized with all single-letter
`words that exist in the input alphabet (e.g. the 256 one-
`byte strings) and a distinct index codeword is then as-
`signed to each single-letter word.
`2. A current phrase is initialized with the first charac-
`ter from an input data stream.
`3. Characters from the input data stream are continu-
`ously read, extending the current phrase, as long as a
`matching phrase exists in the dictionary.
`4. The process is stopped when the current phrase is
`of the form wa, where “a” is the last character read
`from the input data stream, W is a phrase in the dictio-
`nary, while Wa does not match an entry in the phrase
`dictionary.
`5. The codeword for W is output.
`6. Wa is added to the dictionary, assigning it the next
`available codeword.
`
`7. The current phrase is set to “a” and the process
`returned to Step 3.
`This implementation is known as LZW.
`In the decompressor subsystem 26 in FIG. 1, a similar
`phrase dictionary is built. The decompressor is first
`initialized as in Step 1 above and new phrases are then
`added to the dictionary as the data is being decom-
`pressed. When the decompressor receives a code for a
`phrase W followed by the code for a phrase starting
`with the character “a”, it adds the phrase Wa to the
`dictionary and assigns it the next available code. Thus,
`the decompressor can reconstruct the dictionary built
`by the compressor, without the latter having to send the
`dictionary along with the compressed data.
`In a practical implementation, the amount of memory
`available to the encoder (and similarly to the decoder) is
`limited. Therefore, the number of phrases in the dictio-
`nary is also limited, and the output codes are of bounded
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`
`
`7
`length. Typically, the upper bound on the code length is
`between 12 and 16 bits. When the input data sequence is
`sufficiently long,
`the dictionary will eventually “fill
`up”. At this point, the LZW dictionary is either “fro-
`zen” or “reinitialized”.
`
`Step 1 in the above outlined LZW compression pro-
`cedure calls for the initialization of the dictionary with
`all single-character strings. Let in denote the size, in
`bits, of the single characters from the input data string
`(e.g., m= 8, or one byte, in the most common case). The
`phrases in the dictionary are assigned codes co,
`co+1, co-i-2, .
`.
`.
`.
`, c030 (2"1—l), for some initial non-
`negative number co. This implies that the first code sent
`out by the encoder (i.e., the first code representing a
`multiple character string) must be at least m+l bits
`long. In practical implementations, it is customary to
`use output codes of length m+1 at the beginning of the
`compression process, and subsequently increase the
`length of the output codes by one bit every time the
`number of entries in the dictionary reaches the next
`power of 2. Hence, the length of the output codes vary
`between m+l and b, where 2" is the maximum size of
`the dictionary, and b> =m+l. For simplicity,
`it is
`assumed that the maximum dictionary size is a power of
`2. This is the case in most practical implementations,
`although it is not a necessary requirement. Clearly, the
`length of the output codes directly impact the compres-
`sion ratio. Specifically, the shorter the output codes, the
`better the compression ratio.
`The initialization in Step 1 above works well in many
`applications, however, it is inefficient in applications
`where the alphabet size is large. This initialization pro-
`cess is also inefticient if only an unknown subset of the
`input alphabet is expected to occur in the data to be
`compressed. For example, in an application where the
`“natural” alphabet for the input data consists of 16-bit
`symbols, the initial dictionary size has 65,536 entries,
`and the minimal length of an output code is 17 bits. In an
`application where the block of data to be compressed is
`small relative to the alphabet size, it is often unnecessary
`to encode each potential single-character string.
`To overcome these problems, some variations of the
`LZ algorithm employ an empty initial dictionary. When
`a new input character is encountered, the compressor
`outputs a special code, followed by a copy of the new
`character. In this method, the decompressor keeps track
`of the subset of the input alphabet that is actually in use,
`and decoding proceeds as usual. The main problem with
`this process is the high cost of encoding new characters.
`For short files over large alphabets, this overhead cost
`becomes unacceptably high. For example, in FIG. 3
`both a character code 46 and a special code 48, are
`required to indicate to the decompressor engine 30
`(FIG. 1) which new character has occurred. Assuming
`an 8-bit character length and a 12-bit current code
`length, a total of 20 bits must be transmitted to the
`decoder to identify each new character.
`
`Intermediate Dictionary Initialization
`
`To eliminate the bit length and memory problems
`stated above, 2" different special codes co, co+ 1, co+2,
`. . . , co+(2’<—1) are used where 0<=k<=m. This
`assumes the numbers 0, 1, . . . , co—1 are used for other
`control codes. In this manner, a special code carries k
`bits of information on the new character, and exactly
`m-k additional bits are needed to identify the new char-
`acter. This is shown in FIG. 4 with k=5. The current
`code length for a special code 48 is 12 bits and a partial
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`5,389,922
`
`8
`character code 49 has a length of 3 bits (i.e., m-
`k=8—5=3). Thus,
`the ‘cost’ of transmitting a new
`single-character string is reduced from 20 bits to 15 bits.
`When k=0, the method reduces to an empty initializa-
`tion, and when k=m, the system operates as a full al-
`phabet initialization (i.e., each character in the alphabet
`is represented by a codeword).
`Referring to FIG. 5, operation according to the in-
`vention begins with an empty initial dictionary, and
`proceeds generally in accordance with the LZ algo-
`rithm modified as next described. Operation is initial-
`ized at block 50 by setting the next available code
`(NAC) to the value co+2k. In the next step, block 52, a
`null string is set as the current phrase W. Then, at block
`54, the next input character is input as “a”. The step in
`block 56 queries whether the string Wa is already stored
`in the dictionary. If so, block 58 sets W equal to Wa and
`returns to block 54. This process repeats as long as a
`match is found in the dictionary, w