throbber
United States Patent [19]
`Seroussi et al.
`Feb. 14, 1995
`[45] Date of Patent:
`
`[11] Patent Number:
`
`5,389,922
`
`USOOS389922A
`
`[54] COMPRESSION USING SMALL
`DICTIONARIES WITH APPLICATIONS TO
`NETWORK PACKETS
`
`[75]
`
`Inventors: Gadiel Seroussi, Cupertino, Ca1if.;
`Abraham Lempel, Haifa, Israel
`
`[73] Assignee: Hewlett-Packard Company, Palo
`Alto, Calif.
`
`[21] Appl. No.2 46,548
`
`[22] Filed:
`
`Apr. 13, 1993
`
`Int. Cl.5 .............................................. H03M 7/30
`[51]
`
`..... .. 341/51; 341/106
`[52]
`[58] Field of Search ..................... 341/51, 106, 52, 55,
`341/87, 90, 94, 95
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,464,650 8/1984 E$astman et al.
`4,558,302 12/1985 Welch ........
`
`.
`
`340/347
`
`. 340/347 4,8l4,7._46 3/1989 Miller et a1.
`
`.. 341/95
`....... 341/51
`5,003,307 3/1991 Whiting et a1.
`5,150,430 9/1992 Chu ................................... 341/51 X
`
`OTHER PUBLICATIONS
`
`Bunton et al., “Practical Dictionary Management for
`Hardware Data Compression, Development of a
`Theme by Ziv and Lempel”, Communications of the
`ACM, 35:95—104, Jan. 1992.
`“On the Complexity of Finite Sequences”, IEEE Trans-
`actions on Information Theory, IT-22:75-81, Jan. 1976.
`“A Universal Algorithm for Sequential Data Compres-
`
`sion”, IEEE Transactions on Information Theory,
`IT—23:337-343, May 1977.
`“Compression of Individual Sequences via Variable
`Rate Coding”, IEEE Transactions on Information The-
`ory, IT—24:530—S36.
`U.S. patent application Ser. No. 07/892,546, filing date
`Jun. 1, 1992, Lempel et al., “Lempe1i—Ziv Compression
`Scheme with Enhanced Adaption”.
`
`Primary Examiner——Sharon D. Logan
`
`[57]
`
`I ABSTRACI‘
`
`The invention is a dictionary initialization scheme
`adaptive to changes in the type and structure of input
`data. The compression ratio is increased by minimizing
`the number of data entries used to represent single char-
`acters in the input data. By using fewer codes than what
`is normally used to represent characters in an array of
`input data, the dictionary can have fewer entries than
`the alphabet size. A further aspect of the invention
`implements a type of nm-length encoding in the LZ
`methodology which exploits the redundant structure
`existing in the compressed stream in the presence of a
`long run. Some of the codewords in the compressed
`stream are deleted but can be recovered at the decom-
`pression site. The foregoing LZE method is used alone,
`or used in combination with other methods to form a
`compression scheme that is especially useful for trans-
`mitting network packets.
`
`23 Claims, 24 Drawing Sheets
`
`INITIALIZE EMPTYkDICT. 50
`NAc-—c0+2
`
`W*-- NULL STRING
`
`
`
`a <—NEXT INPUT CHAR
`
`55
`
`
`
` N0
`
`
`60
`
`N PROCESS H
`
`NEW gHAR ENTER "a"
`
`IN DICT.
`WITH cooa
`NAC
`
`ORACLE 1 103
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 1 of 24
`
`5,389,922
`
`E
`
`E
`
`
`
`MAXIMALMATCH Fig.2
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 2 of 24
`
`5,389,922
`
`48
`
`46
`
`SPECIAL CODE
`
`CHAR
`
` ¢—-————>
` CURRENT CODE
`m
`LENGTH (12 BITS)
`(8 BITS)
`
`Fig.3
`
`48
`
`49
`
`SPECIAL CODE
`
`PARTIAL CHAR
`
`
`CURRENT CODE
`m—k
`LENGTH (12 BITS)
`(:5 BITS)
`
`Fig. 4
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 3 of 24
`
`5,389,922
`
`INITIALIZE EMPTY DICT. 50
`NAC+-CO+2k
`A
`
`W<-- NULL STRING
`
`52
`
`a <—NEXT INPUT CHAR
`
`58
`
`56 @
`
`NO
`
`60
`
`“PROCESS”
`NEW CHAR
`
`
`
`
`
`Cl
`55
`
`
`ENTER "0"
`IN DICT.
`
`WITH CODE
`NAC
`
`
`
`
`67
`
`A
`NO
`OUTPUT CODE or-' w
`
`
`
`ENTER Wa IN DICT.
`WITH CODE NAC
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 4 of 24
`
`5,389,922
`
`
`
`
`
`
`74
`
`76
`
`OUTPUT
`c1 USING
`b BITS
`
`OUTPUT
`02 USING Fig.6
`m—k BITS
`
`
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 5 of 24
`
`5,389,922
`
`CHARACTER 0
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 6 of 24
`
`5,389,922
`
` READ
`
`“:EEE‘I¥’IES
`
`PROCESS
`NEW CHAR
`
`
`
`
`
`CODEC1
`(b BITS)
`
`NO
`
`
` PROCESS
`
`CONTROL
`CODE
`
`NO
`
`PROCESS
`REGULAR
`
`CODE
`
`95
`
`98
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 7 of 24
`
`5,389,922
`
`FROM INPUT
`BUFFER
`
`m—k
`
`100 (FIXED)
`
`102
`
`b
`
`A 9 B
`B—A
`
`36\
`
`108
`
`\__\’.__J
`b—k
`
`(DISCARD)
`
`110
`O
`O
`112 1
`
`1
`
`ENTER SINGLE
`CHAR STRING
`IN DICT.
`
`m BITS
`
`114
`
`Fig.9
`
`OUTPUT
`BUFFER
`
`

`
`U.S. Patent
`
`5,389,922
`
`E32fig2%Ii
`
`Q83E.A2:3E:2.83E
`
`2.mE
`
`
`
`.sEEzH.m».2327:5m.m._.S.._._53V22
`
`muemmbfizmmtm
`
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 9 of 24
`
`5,389,922
`
`116
`
`ENGINE
`
`INPUT
`
`COMPRESSION
`ENGINE
`
`
`
`READ INPUT
`CHARACTERS
`
`118
`
`
`
`°GET INPUT CHARS
`
`°MAINTAIN DICT.
`
`°PRODUCE CODES
`
`122
`
`OUTPUT
`ENGINE
`
`
`
`°DO CODE OUTPUT
`
`
`
`120
`
`Fig.11
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 10 of 24
`
`5,389,922
`
`58
`
`INPUT
`
`%
`
`,e424
`
`\\ ENGINE
`
`GET INPUT CHARS
`
`COMPRESSION -MAINTAIN DICT.
`ENGINE
`-PRODUCE CODES
`
`
`
`TO NAC
`
`-INTERCEPT CODES
`
`°CHECK FOR RUNS
`RUN
`ENHANCEMENT -DISABLE OUTPUT
`ENGINE
`FOR SOME CODES
`
`152
`
`RUN
`ENHANCEMENT
`ACCESS :
`ENGINE HAS
`RUNCQDE
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 11 of 24
`
`5,389,922
`
`CODE C
`
`RUNCODE.—— c
`
`DISABLE
`OUTPUT
`
`140
`
`
`
`RUNCODE
`
`= NULL
`
`?
`
`
`
`144
`
`OUTPUT
`RUNCODE
`
`RUNCODE <——NULL
`
`
`
`146
`
`

`
`U.S. Patént
`
`Feb. 14, 1995
`
`Sheet 12 of 24
`
`5,389,922
`
`DECODER
`INPUT
`
`150
`°DO CODE INPUT
`
`ENGINE 152
`
`CODES
`
`°GET CODES
`
`
`
`
`DECOMPRESSION
`
`ENGINE
`
`DECODER
`OUTPUT
`
`ENGINE
`
`°MAINTAIN DICT.
`
`-PRODUCE OUTPUT
`CHARS
`
`156
`
`-DO CHAR OUTPUT
`
`
`
`Fig.14
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 13 of 24
`
`5,389,922
`
`DECODER
`INPUT
`ENGINE
`
`RUN ENH.
`
`158
`
`‘D0 CODE INPUT
`
`160
`
`°INTERCEPT CODES
`
`°CHECK FOR RUNS
`
`
`
`
`
` DECODER
`
`
`ENGINE
`
`
`°GENERATE
`
`MODIFIED CODE
`STREAM
`
`
`
`DECOMPRESSION
`
` °GET CODES
`
`°MAINTAIN DICT.
`
`
`
`°PRODUCE OUTPUT
`CHARS
`
`ENGINE
`
`
`
`DECODER
`OUTPUT
`ENGINE
`
`°DO CHAR OUTPUT
`
`Fig. 15
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 14 of 24
`
`5,389,922
`
`CODE 0
`
`170
`
`NO
`
`172
`
`YES
`
`SEND 0 TO
`DECOMPRESSION
`ENGINE
`
`RUNCODE‘—-NAC
`
`174
`
`SEND RUNCODE To 175
`DECOMPRESSION
`ENGINE
`
`
`
`RUNCODE+1
`
`RUNCODE._.
`
`178
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 15 of 24
`
`5,389,922
`
`<5.mE
`
`mm~53
`
`Tm~53
`
`mm~53
`
`on~Eo<
`
`hmmon_<
`
`mm_..OH..._O._.
`
`Sooo:ou__o__
`
`
`
` 0.onmnmn+_nxN+oouo<zSum.23
`
`
`
`s_<mEm<._.<n_>23
`
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`5,389,922
`
`
`
`M...hn_EEEEEEEEEEHEMn\s_<m_Em<29ommmmmazoomo%—.\|/I
`
`<m<ommmmmaz
`
`mS.mE.
`
`
`
`
`
`._.Z.:..:2moZ<IZm_I52“:zamxzoNH._.HP<S.u:>Em_
`
`20¢...
`
` <N—.OH:._
`
`
`
`
`
`s_<E.m<53ommmmmazooZOH._.<NHu_<H._.HZHm_._.<HQm:>_mm._.ZH
`
`m?JhHHfiEHE
`
`
`

`
`U.S. Pateilt
`
`Feb. 14, 1995
`
`Sheet 17 of 24
`
`5,389,922
`
`
`
`
`
`
`
`\|.\rn/\|\.r/]l...\r|I/\I|.\/.|/
`
`NlmNNI
`
`00:.
`
`
`
`_.—._....:._._...._.eO...O—.O...OO—._...._.—.
`
`
`
`NllmNNl.®.N\I|L/.|.l/\.||\/I}\||\/l|)\||\(J
`
`—
`
`<2
`mfi
`
`
`
`_._...._._.O_....:O._.OO...OO—O...OO
`
`
`
`
`
`>m<zHm_maoo
`
`
`
`moooEozm:
`
`mtrm
`
`m3._<>
`
`
`
`—..._:NI..®NAN..lNNv
`
`NNVO._.P
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 18 of 24
`
`5,389,922
`
`ZZZZ
`
`0.1
`
`0.2
`
`0.3
`
`0.4
`
`0.5
`
`Fig.18B
`
`4321D.=__=__
`
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 19 of 24
`
`5,389,922
`
`
`
`APPROX.
`92
`
`APPROX.
`R (2,132)
`
`1.0000
`
`0.0000
`
`0.5542
`
`0.7905
`
`0.2006
`
`0.8989
`
`gifg‘
`
`""— 0.1115
`
`0.9755
`
`-"‘
`
`0.0590
`
`0.9843
`
`0.0304
`
`0.9952
`
`0.0154
`
`0.9990
`
`0.0078
`
`1.0000
`
`1.0000
`
` 0.0059
`
`Fig. 19
`
`flflflflflflfllflflNN()1—sCD0)\l
`
`
`M..."‘-0I0mjouCD”“N3'53
`
`"""
`
`|\3-P-O) CA
`l\3f\3U'| (O
`
`"“"
`
`"“"
`N U‘! \l
`
`
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 20 of 24
`
`5,389,922
`
`INPUT'PACKET
`
`200
`
`
`
`ADDRESS
`PROCESSING
`
`
`
`202
`
`208
`
`LZE
`
`COPY
`
`204
`
`Z—HUF
`
`210
`
`
`
`
`
`
`Fig. 20
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 21 of 24
`
`5,389,922
`
`m._.:.._ozm
`
`
`
`m_.:.._a_n_I
`
`m.Exo<n_®m.:
`
`mExo<n_ooom
`
`mmtrmmmmdvn
`
`
`mmtrmm¢m.nmm._
`
`
`
`Exo<n_.0><
`
`¢nNHZm._
`
`
`
`._.m_xO<a.o><
`
`mmnuzfl
`
`Hm.mE
`
`zoHmmmEn_2oo
`
`QOI._.m:2
`
`
`
`ZOmmmmazooXHZD
`
`
`
`.m:mm:mi...m_.._OI>>
`
`
`
`ZOmmmmmzooXHZD
`
`
`
`.93NSm_.:.._m._O_.._>>
`
`
`
`zomzo..<.mN._
`
`
`
`mExo<a._<3QH>HQzH
`
`ZO.u_DI|Noz<m_N._
`
`
`
`mEv_o<n_.._<3n_H>HoZH
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 22 of 24
`
`5,389,922
`
`
`
`om;m3._.202zam._.<mHw_..._ma:ants
`
`flow.om_om_om_o§oN_om_o§
`
`mm.mE
`
`_
`
`We
`
`m.o
`
`Bo
`
`o.o
`
`.m.o
`
`To
`
`nd
`
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 23 of 24
`
`5,389,922
`
`4oo16oo180o2ooo
`
`
` E g “
`
`’.“f’°.“!“-
`ooooO
`
`
`
`

`
`U.S. Patent
`
`Feb. 14, 1995
`
`Sheet 24 of 24
`
`5,389,922
`
`Fig.24
`
`§§
`
`_
`
`
`
` §§ “
`§ §
`
`>.“‘£V"°.“!“.
`oooooO
`
`

`
`~
`
`1
`
`5,389,922
`
`2
`
`.
`
`COMPRESSION USING SMALL DICTIONARIES
`WITH APPLICATIONS TO NETWORK PACKETS
`
`BACKGROUND OF INVENTION
`
`This invention relates generally to compression and
`decompression of digital data and more particularly to
`implementations of lossless compression and decom-
`pression methods and apparatus using a dictionary to
`store compression data, and applications of compres-
`sion/decompression techniques to network packet com-
`mnnications.
`A major class of compression schemes encode multi-
`ple-character strings using binary sequences or “code-
`words” not otherwise used to encode individual charac-
`ters. The strings are composed of an “alphabet,” or
`single-character strings. This alphabet represents the
`smallest unique piece of information the compressor
`processes. Thus, an algorithm which uses eight bits, to
`represent its characters, has 256 unique characters in its
`alphabet. Compression is effective to the degree that the
`multiple-character strings represented in the encoding
`scheme are encountered in a given file of the data
`stream. By analogy with bilingual dictionaries used to
`translate between human languages, the device that
`embodies the mapping between uncompressed code and
`compressed code is commonly referred to as a “dictio-
`nary.”
`Generally, the usefulness of a dictionary-based com-
`pression scheme is dependent on the frequency with
`which the dictionary entries for multiple-character
`strings are used. If a fixed dictionary is optimized for
`one file type it is unlikely to be optimized for another.
`For example, a dictionary which includes a large num-
`ber of character combinations likely to be found in
`newspaper text files, is unlikely to efficiently compress
`data base files, spreadsheet files, bit-mapped graphics
`files, computer-aided design files, et cetera.
`Adaptive compression schemes are known in which
`the dictionary used to compress given input data is
`created while that input data is being compressed.
`Codewords representing every single character possible
`in the uncompressed input data are put into the dictio-
`nary. Additional entries are added to the dictionary as
`multiple-character strings are encountered in the file.
`The additional dictionary entries are used to encode
`subsequent occurrences of
`the multiple-character
`strings. For example, matching of current input patterns
`is attempted only against phrases currently residing in
`the dictionary. After each failed match, a new phrase is
`added to the dictionary. The new phrase is formed by
`extending the matched phrase by one symbol (e.g., the
`input symbol that “breaks” the match). Compression is
`effected to the extent that the mirltiple-character strings
`occurring most frequently in the file are encountered as
`the dictionary is developing.
`During decompression, the dictionary is built in a like
`manner. Thus, when a codeword for a character string
`is encountered in the compressed file, the dictionary
`contains the necessary information to reconstruct the
`corresponding character string. Widely-used compres-
`sion algorithms that use a dictionary to store compres-
`sion and decompression information are the first and
`second methods of Lempel and Ziv, called LZ1 and
`LZ2 respectively. The Lempel-Ziv (LZ) algorithm was
`originally described by Lempel and Ziv in “On the
`Complexity of Finite Sequences” IEEE Transactions on
`Information Theory, IT-22:75-81, January 1976; and in
`
`5
`
`l0
`
`15
`
`20
`
`25
`
`30
`
`35
`
`45
`
`50
`
`55
`
`60
`
`65
`
`“A Universal Algorithm for Sequential Data Compres-
`sion” IEEE Transactions on Information Theory, IT-
`23:337—343, May 1977; and “Compression of Individual
`Sequences via Variable Rate Coding” IEEE Transac-
`tions on Information Theory, IT-24:530-536. Dictionary
`usage is also disclosed in U.S. Pat. No. 4,464,650 to
`Eastman et al., and various improvements in the algo-
`rithms are disclosed in U.S. Pat. Nos. 4,558,302 to
`Welch, and 4,814,746 to Miller et al.
`When working on a practical implementation, the
`amount of memory available for compression/decom-
`pression is finite. Therefore, the number of entries in the
`dictionary is finite and the length of the codewords used
`to encode the entries is bounded. Typically, the length
`of codewords varies between 12 and 16 bits. When the
`input data sequence is sufficiently long, the dictionary
`will eventually “fill up.” Several courses of action are
`possible at this point. For example, the dictionary can
`be frozen in its current state, and used for the remainder
`of the input sequence. In a second approach, the dictio-
`nary is reset and a new dictionary created from scratch.
`In a third approach, the dictionary is frozen for some
`time, until the compression ratio deteriorates, then the
`dictionary is reset. Alternate strategies for dictionary
`reset are described in U.S. application Ser. No.
`07/892,546, filed Jun. 1, 1992 entitled “Lempel-Ziv
`Compression Scheme with Enhanced Adaptation”, as is
`hereby incorporated by reference herein, and by Bun-
`ton, S. et al., in “Practical Dictionary Management for
`Hardware Data Compression” Communications of the
`ACM, 5:95-104, January 1992.
`In the LZW process, the dictionary must be initial-
`ized for the single-character strings that are used to
`build the compression dictionary. These characters are
`assigned unique codes within the compression/decom-
`pression system. This implies that the number of bits in
`any additional output code sent out by the encoder (e.g.,
`codes that represent multiple character strings) are con-
`trolled by the number of single-character strings. For
`example, the shortest bit length for a multiple character
`string is determined by the number of single-character
`strings. The number of bits in subsequent codes repre-
`senting multiple characters, increase in length by one bit
`every time the number of entries in the dictionary reach
`the next power of 2. Using more bits to represent single-
`character codewords proportionally decreases
`the
`overall compression performance.
`The initialization of single input characters as de-
`scribed above is inefficient for input data with a large
`alphabet size or when only an unknown subset of the
`alphabet is expected to occur in the input data. For
`example, when the “natural” alphabet for the input data
`consists of 16-bit symbols, the initial dictionary size
`would have 65,536 entries. Therefore,
`the minimal
`length of any output code generated, in addition to the
`characters from the “natural” alphabet (e.g., codes rep-
`resenting multi-charaeter strings) is at least 17 bits. Al-
`ternatively, if the block of input data (i.e., the data to be
`compressed) is small relative to the alphabet size, there
`is an unnecessarily high overhead in time, memory
`space, and compression ratio that comes from initial-
`izing,
`storing, and encoding,
`respectively,
`single-
`character strings from the input data.
`To overcome these problems, some variants of the
`LZ algorithm employ an empty
`dictionary. When
`a new input character is encountered, the compressor
`outputs a special code, followed by a copy of the new
`
`

`
`5,389,922
`
`3
`character. This allows the decompressor to keep track
`of a subset of the input. alphabet that is actually in use,
`allowing decoding to proceed as usual. The main prob-
`lem with this strategy is the high cost of encoding new
`characters. For short files over large alphabets, this
`overhead cost might become unacceptably high. For
`instance, with 8-bit symbols and 12-bit output codes, 20
`bits are required to let the decoder know a-new charac-
`ter has occurred. In addition, often there is redundancy
`within the encoded character strings output by the LZ
`algorithm. For example, a string of the same input char-
`P
`‘ acters i.e., a “run”
`roduces a se uence of encoded
`strings with a predictable and redundant structure. This
`redundancy is not presently leveraged to further in-
`crease the compression ratio of standard compression
`algorithms.
`Accordingly, a need remains for a data compression
`initialization process that is adaptable to different types
`of input data and different data structures to increase
`the data compression ratio and to reduce the amount of
`memory required in a dictionary based compression/-
`decompression system.
`SUMMARY OF THE INVENTION
`
`It is, therefore, an object of the invention to improve
`the compression and decompression of digital data in a
`dictionary-based system.
`Another object of the invention is to increase the data
`compression ratio for
`compression/decompression
`schemes by reducing the number of bits used in repre-
`senting encoded character strings.
`Another object of the invention is to reduce the over-
`head of initializing a dictionary in a dictionary-based
`compression and decompression system.
`A further object of the invention is to more efficiently
`compress digital data which either occurs in small files
`or which is represented by a subset of a large single-
`character alphabet.
`A further object of the invention is to recompress
`encoded character strings that represent
`input data
`character runs to further increase the compression ratio
`of a compression/decompression system.
`The invention is a dictionary based initialization
`scheme that is adaptive to changes in the type and struc-
`ture of input data. The initialization scheme increases
`the compression ratio by minimizing the number of data
`entries used in a dictionary based compression/decom-
`pression system to represent single-character data
`strings. The reduced number of data entries reduces the
`bit-length of codewords in a compressed data string.
`Reducing the codeword bit-length in the compressed
`data string increases the overall compression ratio.
`The invention uses a variable number of special
`codes. The total number of special codes is, typically,
`selected to be less than the total number of character
`codes used for representing single-characters in the
`input data array. Each special code carries part of the
`information on a new character. Additional bits, that
`further identify a new character, are then transmitted in
`a separate partial character. This process reduces the
`‘cost’ (i.e., bandwidth and storage space) of transmitting
`a new symbol. The process is adaptable anywhere be-
`tween a no initialization process (e.g., empty initial
`dictionary) to a full alphabet initialization where each
`unique character in the alphabet is assigned an entry in
`the dictionary.
`The number of special codes is adaptable to the appli-
`cation presently being performed. Thus, the number of
`
`4
`special codes is predetermined for specific types of data
`to maximize the compression ratio. This method allows
`the dictionary to have fewer entries than the alphabet
`size. Thus, compression is possible with very small dic-
`tionaries, that require very little memory. This is partic-
`ularly useful in applications where the input data blocks
`are short, and each block has to be compressed indepen-
`dently. The initialization scheme also provides com-
`pression of data sources with large alphabet sizes (e.g.,
`16-bit symbols), while maintaining moderate size dictio-
`naries.
`
`A further aspect of the invention implements a type
`of run-length encoding in the LZ methodology (LZE).
`In conventional LZ2 data compression, a rim, which is
`a string of repeated occurrences of the same character
`in the input data, is encoded as a series of codes. Each
`successive code is built upon the previous code, fol-
`lowed by a code for the remainder or tail of the run.
`The decompressor then receives and decodes each of
`these codes in sequential order. The system sends a
`shortened sequence by transmitting a single code (rep-
`resenting most of the character rim) and the proceeding
`and tail codes.
`The foregoing LZE method is used alone, or in com-
`bination with other methods to form a compression
`scheme especially useful in transmitting network pack-
`ets. In the combined system, the LZE compression
`scheme is applied to an input data stream in parallel
`with one or more other data compression methods. For
`example, a Huffman variable-length coding scheme or
`an uncompressed transmission scheme. The output of
`the method providing the best compression ratio is then
`used for data transmission and storage. A high compres-
`sion ratio is obtained for real network packet data con-
`taining packets with a wide distribution of lengths. The
`high compression ratio is consistently maintained across
`the entire range of packet lengths even in data having a
`large proportion of short packets.
`The foregoing and other objects, features and advan-
`tages of the invention will become more readily appar-
`ent from the following detailed description of a pre-
`ferred embodiment of the invention which proceeds
`with reference to the accompanying drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a generalized block diagram of a compres-
`sion/decompression system in which the enhancements
`of the present invention are implemented.
`FIG. 2 is a diagram illustrating the basic principle of
`LZ compression.
`FIG. 3 is a diagram illustrating the transmission of a
`new character in conventional LZ compression with an
`empty
`dictionary.
`FIG. 4 is a diagram illustrating the transmission of a
`new character with enhanced LZ compression using
`intermediate dictionary initialization according to the
`invention.
`FIG. 5 is a flow chart of the basic LZE intermediate
`initialization process.
`FIG. 6 is a flow chart of the “new character” encod-
`ing subprocess of FIG. 5.
`FIG. 7 is a functional block diagram of circuitry for
`implementing the encoding subprocess of FIG. 6 in the
`compression subsystem of FIG. 1.
`FIG. 8 is a flow chart of a “new character” decoding
`subprocess for decoding codes produced by the sub-
`process of FIG. 6.
`
`5
`
`l0
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`

`
`5,389,922
`
`5
`FIG. 9 is a functional block diagram of circuitry for
`implementing the decoding subprocess of FIG. 8 in the
`decompression subsystem of FIG. 1.
`FIG. 10 is a table showing the compression results for
`different initialization schemes according to the present
`invention.
`
`FIG. 11 is a block diagram of a conventional data
`compression system.
`FIG. 12 is a block diagram of circuitry for imple-
`menting an enhanced Lempel-Ziv run-length encoding
`scheme in the compression subsystem shown in FIG. 1.
`FIG. 13 is a flow chart showing a process for imple-
`menting an enhanced Lempel-Ziv rim-length encoding
`method according to the invention.
`FIG. 14 is a block diagram of a conventional data
`decompression system.
`FIG. 15 is a block diagram of circuitry for imple-
`menting enhanced Lempel-Ziv run-length decoding in
`the decompression subsystem of FIG. 1 according to
`the invention.
`
`FIG. 16 is a flow chart of the enhanced Lempel-Ziv
`run-length decoding method according to the inven-
`tion.
`
`FIGS. 17A and 17B are a graphical depiction of the
`method for performing intermediate initialization and
`run length encoding.
`FIG. 18A is a table illustrating optimal Huffman
`coded assignments.
`FIG. 18B is a plot of compression ratios for Huffman
`encoded data.
`
`FIG. 19 is a table showing probability distributions
`for different bit lengths.
`FIG. 20 is a block diagram of a parallel “best wins”
`compression system according to the invention.
`FIG. 21 is a table showing compression results for
`various network packet files.
`FIG. 22 is a graph of compression ratio vs. time using
`the system of FIG. 20 on real network packet data.
`FIG. 23 is a graph showing the distribution of the
`data in FIG. 22 by packet length.
`FIG. 24 is a graph of the compression ratio of the data
`in FIG. 22 as a function of packet length.
`DETAILED DESCRIPTION
`
`The general arrangement and operation of Lempel-
`Ziv compression/decompression systems are well-
`known and are, therefore, described only in general
`terms with reference to FIGS. 1 and 2. The system 22 in
`FIG. 1 includes a compression subsystem 24 and a de-
`compression subsystem 26 interconnected by a digital
`data communications (or storage) charmel 28. In prac-
`tice, both terminals of a system will include both com-
`pression and decompression subsystems and the hard-
`ware is typically designed to operate interchangeably to
`compress/send or to receive/decompress data.
`Each subsystem includes, in addition to conventional
`communications (or storage) circuitry (not shown), a
`compression engine 30 which implements the basic
`Lempel-Ziv compression algorithm, memory 32 imple-
`menting one or more dictionaries in which data entries
`encoding the character string data are stored, and sup-
`porting circuits implementing the enhancements further
`described below. The supporting circuits include the
`intermediate initialization encoder 34 and counterpart
`decoder 36, which are further detailed in FIGS. 7 and 9,
`and the encoder nm enhancement engine 38 and coun-
`terpart decoder rtm enhancement engine 40 which are
`shown in FIGS. 12 and 15.
`
`6
`FIG. 2 illustrates the Lempel-Ziv (LZ) algorithm, for
`lossless compression of digital data (i.e., the original
`data is completely recoverable from its compressed
`image). The LZ method matches a current pattern in an
`input data stream to patterns occurring previously. For
`example, a current pattern 42 (ABC) in input stream 46
`is the same as a pattern 44 (ABC) that was previously
`transmitted. The compression subsystem 24 (FIG. 1),
`substitutes a description (i.e., codeword) of the maximal
`match for the matched input symbols (ABC). The de-
`compression subsystem 26 (FIG. 1) can then recon-
`struct the original symbols from the match codeword,
`and from previously decompressed data segments. In
`redundant data sources, the descriptions or the code-
`words describing a multiple character match tend to be
`shorter than the matched patterns, thus achieving data
`compression.
`The main feature of LZ2 is incremental parsing. The
`input data sequence is parsed into phrases, which are
`collected in a dictionary. Maximal matching of current
`input patterns is attempted, as described above, only
`against phrases in the dictionary. After each match, a
`new phrase is formed by extending the matched phrase
`with the input symbol that “breaks” the match. This and
`other variants of the algorithm, are asymptotically opti-
`mal, (i.e., achieve, in the limit, the best compression
`ratio theoretically possible). The algorithm is also
`highly adaptive, learning the statistical characteristics
`of the input data “on the fly”. In LZ2, this “knowledge”
`is stored in the dictionary, whose entries parse the input
`data sequence.
`The compressor implementation in LZ2 can be infor-
`mally described as follows:
`1. A dictionary is initialized with all single-letter
`words that exist in the input alphabet (e.g. the 256 one-
`byte strings) and a distinct index codeword is then as-
`signed to each single-letter word.
`2. A current phrase is initialized with the first charac-
`ter from an input data stream.
`3. Characters from the input data stream are continu-
`ously read, extending the current phrase, as long as a
`matching phrase exists in the dictionary.
`4. The process is stopped when the current phrase is
`of the form wa, where “a” is the last character read
`from the input data stream, W is a phrase in the dictio-
`nary, while Wa does not match an entry in the phrase
`dictionary.
`5. The codeword for W is output.
`6. Wa is added to the dictionary, assigning it the next
`available codeword.
`
`7. The current phrase is set to “a” and the process
`returned to Step 3.
`This implementation is known as LZW.
`In the decompressor subsystem 26 in FIG. 1, a similar
`phrase dictionary is built. The decompressor is first
`initialized as in Step 1 above and new phrases are then
`added to the dictionary as the data is being decom-
`pressed. When the decompressor receives a code for a
`phrase W followed by the code for a phrase starting
`with the character “a”, it adds the phrase Wa to the
`dictionary and assigns it the next available code. Thus,
`the decompressor can reconstruct the dictionary built
`by the compressor, without the latter having to send the
`dictionary along with the compressed data.
`In a practical implementation, the amount of memory
`available to the encoder (and similarly to the decoder) is
`limited. Therefore, the number of phrases in the dictio-
`nary is also limited, and the output codes are of bounded
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`

`
`7
`length. Typically, the upper bound on the code length is
`between 12 and 16 bits. When the input data sequence is
`sufficiently long,
`the dictionary will eventually “fill
`up”. At this point, the LZW dictionary is either “fro-
`zen” or “reinitialized”.
`
`Step 1 in the above outlined LZW compression pro-
`cedure calls for the initialization of the dictionary with
`all single-character strings. Let in denote the size, in
`bits, of the single characters from the input data string
`(e.g., m= 8, or one byte, in the most common case). The
`phrases in the dictionary are assigned codes co,
`co+1, co-i-2, .
`.
`.
`.
`, c030 (2"1—l), for some initial non-
`negative number co. This implies that the first code sent
`out by the encoder (i.e., the first code representing a
`multiple character string) must be at least m+l bits
`long. In practical implementations, it is customary to
`use output codes of length m+1 at the beginning of the
`compression process, and subsequently increase the
`length of the output codes by one bit every time the
`number of entries in the dictionary reaches the next
`power of 2. Hence, the length of the output codes vary
`between m+l and b, where 2" is the maximum size of
`the dictionary, and b> =m+l. For simplicity,
`it is
`assumed that the maximum dictionary size is a power of
`2. This is the case in most practical implementations,
`although it is not a necessary requirement. Clearly, the
`length of the output codes directly impact the compres-
`sion ratio. Specifically, the shorter the output codes, the
`better the compression ratio.
`The initialization in Step 1 above works well in many
`applications, however, it is inefficient in applications
`where the alphabet size is large. This initialization pro-
`cess is also inefticient if only an unknown subset of the
`input alphabet is expected to occur in the data to be
`compressed. For example, in an application where the
`“natural” alphabet for the input data consists of 16-bit
`symbols, the initial dictionary size has 65,536 entries,
`and the minimal length of an output code is 17 bits. In an
`application where the block of data to be compressed is
`small relative to the alphabet size, it is often unnecessary
`to encode each potential single-character string.
`To overcome these problems, some variations of the
`LZ algorithm employ an empty initial dictionary. When
`a new input character is encountered, the compressor
`outputs a special code, followed by a copy of the new
`character. In this method, the decompressor keeps track
`of the subset of the input alphabet that is actually in use,
`and decoding proceeds as usual. The main problem with
`this process is the high cost of encoding new characters.
`For short files over large alphabets, this overhead cost
`becomes unacceptably high. For example, in FIG. 3
`both a character code 46 and a special code 48, are
`required to indicate to the decompressor engine 30
`(FIG. 1) which new character has occurred. Assuming
`an 8-bit character length and a 12-bit current code
`length, a total of 20 bits must be transmitted to the
`decoder to identify each new character.
`
`Intermediate Dictionary Initialization
`
`To eliminate the bit length and memory problems
`stated above, 2" different special codes co, co+ 1, co+2,
`. . . , co+(2’<—1) are used where 0<=k<=m. This
`assumes the numbers 0, 1, . . . , co—1 are used for other
`control codes. In this manner, a special code carries k
`bits of information on the new character, and exactly
`m-k additional bits are needed to identify the new char-
`acter. This is shown in FIG. 4 with k=5. The current
`code length for a special code 48 is 12 bits and a partial
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`5,389,922
`
`8
`character code 49 has a length of 3 bits (i.e., m-
`k=8—5=3). Thus,
`the ‘cost’ of transmitting a new
`single-character string is reduced from 20 bits to 15 bits.
`When k=0, the method reduces to an empty initializa-
`tion, and when k=m, the system operates as a full al-
`phabet initialization (i.e., each character in the alphabet
`is represented by a codeword).
`Referring to FIG. 5, operation according to the in-
`vention begins with an empty initial dictionary, and
`proceeds generally in accordance with the LZ algo-
`rithm modified as next described. Operation is initial-
`ized at block 50 by setting the next available code
`(NAC) to the value co+2k. In the next step, block 52, a
`null string is set as the current phrase W. Then, at block
`54, the next input character is input as “a”. The step in
`block 56 queries whether the string Wa is already stored
`in the dictionary. If so, block 58 sets W equal to Wa and
`returns to block 54. This process repeats as long as a
`match is found in the dictionary, w

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket