`Franaszek et al.
`
`USOOS 87003 6A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,870,036
`*Feb. 9, 1999
`
`[54] ADAPTIVE MULTIPLE DICTIONARY DATA
`COMPRESSION
`
`341/51
`2/1995 Serussi et a1. .
`5,389,922
`5,467,087 11/1995 Chu ......................................... .. 341/51
`
`[75] Inventors: Peter Anthony Franaszek, Mt. Kisco;
`John Timothy Robinson, Y0rkt0Wn_
`gliliitsltsgl‘fglfl glgyslus Thomas’ Whlte
`
`’
`'
`'
`[73] Assignee: International Business Machines
`Corporation, Armonk, NY.
`
`[ * ]
`
`Notice:
`
`The terminal 17 months of this patent has
`been disclaimed-
`
`[21] Appl' NO‘: 393’967
`[22] Filed:
`Feb. 24, 1995
`H03M 7/00
`[51] Int C16
`_______ __ 341/51
`[52]
`[58] Field of Search ................................ .. 341/51, 65, 59,
`341/106, 107
`
`[56]
`
`_
`References Clted
`U.S. PATENT DOCUMENTS
`
`Primary Examiner—Brian K. Young
`Attorney, Agent, or Firm—Richard M. LudWin; Heslin &
`
`Rothenberg, PC.
`
`[57]
`
`ABSTRACT
`
`A system and method for compressing and decompressing
`data using a plurality of data compression mechanisms.
`Representative samples of each block of data are tested to
`select an appropriate one of the data compression mecha
`nisms to apply to the block. The block is then compressed
`using the selected one of the mechanisms and the com
`pressed block is provided With an identi?er of the selected
`mechanism For decompression, the identi?er is examined to
`select an appropriate one of the data decompression mecha
`nisms to apply to the block. The block is then decompressed
`using the selected one of the mechanisms.
`
`5,204,756
`
`4/1993 Chevion et al. .
`
`17 Claims, 9 Drawing Sheets
`
`205
`\ DATA BLETCKS
`TYPE
`TYPE
`
`EEO
`\
`
`CUMPRESSED
`BLEICKS
`835V CMD
`CMD
`
`0 0 0
`
`—>
`
`CUMPRESSEIR
`
`—>
`
`O I O
`
`\F
`210
`TYPEI DATA TYPE
`(TEXT, IMAGE, ETC.)
`(UPTIETNAL)
`
`CUMPRESSIEIN
`METHUD TABLE
`
`O 0 o
`
`RUN-LENGTH
`240/
`
`L21
`
`O O .
`
`COMPRESSED
`DATA BLEICKS
`
`840 /
`Q
`
`CMD
`
`CMD J 270x ,
`
`' ' ‘
`
`—’
`
`DATA
`DE-CIJMPRESSEIR
`
`—>
`
`v
`230
`CMD= CDMPRESSIUN
`METHEID DESCRIPTIUN
`
`DICTIUNARY
`BLETCKS
`850
`
`UNCUMPRESSED
`BLIIICKS
`
`' ' '
`
`280/
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 001
`
`
`
`U.S. Patent
`
`Feb. 9, 1999
`
`Sheet 1 of9
`
`5,870,036
`
`
`
`mml/ wow J
`
`X ow.
`
`mmwwmmazmolwm
`
`mmmwmmninu
`
`mmwwmmmzmu
`Qmwwmm
`
`SEQ
`
`x0515
`
`Kg
`
`InEUUZD
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 002
`
`
`
`Feb. 9, 1999
`
`Sheet2 0f9
`
`5,870,036
`
`>M¢ZDHHQHQ
`
`1525123.9%
`
`mmwmmmmzmuzs
`
`863mEEEmmm
`
`Qmmmmmmzmu
`
`
`
`MXQDJm¢H¢m
`
`..H--
`
`US. Patent mammmmazmuamm..l.C.UI‘|‘l0
`
`
`
`ZDHHQHMQMMQQDIHMZ409mduquPXMHV
`
`
`ZDwammazmo.mzoZDHmwmmmzmumm>H¢H¢Q.mm>H
`.mgmqhQDIHMZ.
`
`
`
`mmmmmmmzmo
`
`wquJmwquJm{qu
`
`mom
`
`
`
`on.mmwwmmmzmuoco.{Eqm_
`
`0mmDam
`
`A4¢ZUHHQDV
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 003
`
`
`
`US. Patent
`
`Feb. 9, 1999
`
`Sheet3 0f9
`
`5,870,036
`
`¢H¢Q
`
`memmmmzmu
`
`wwwmazmo
`
`Hum92¢
`
`QED
`
`Vanjm\omm
`
`on
`
`mmm+_
`
`HDZDav
`
`Amwmmmzmo
`
`
`
`mmwwmmmzmu<H¢Q
`
`Fougmm
`
`mem
`
`QDIHmz
`
`QFQHQJ
`
`HXMH
`
`Iu¢m
`
`QDIHmz
`
`QHQHQQ
`
`mZHZMMHum
`
`Jzo
`
`mmwwmmmzmoz:
`
`
`
`xume{HQQ
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 004
`
`
`
`U.S. Patent
`
`Feb. 9, 1999
`
`Sheet 4 of9
`
`5,870,036
`
`FIG. 4A
`
`DATA
`TYPE AVAILABLE
`'2
`
`SET CML=
`SET CML=
`404\ CML-TABLE
`409
`DEFAULT-CML
`<TYPE<B>>
`\
`/
`407 ____sET E Tl] ETRsT____—|
`CML ENTRY
`
`METHUD (E)
`DICTIEINARY
`BASED '?
`
`414
`
`DATA TYPE
`
`419\DESFEATULDT_—LDI—SEI=ST
`T
`SET M: METHDD<E>J
`REMUVEi E FRDM CML"—
`T
`EUR EACH D
`424
`\__ IN D—LIST:
`
`421
`
`'
`
`SET IHJST:
`D-LIST-TABLE
`<TYPE<B>>
`
`ADD (MD)
`TU CML
`
`427j
`
`WAS E
`LAST ENTRY IN
`CML '?
`
`YES
`
`TEST
`
`SET E Tl]
`NEXT ENTRY 431
`IN CML
`-/
`l
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 005
`
`
`
`U.S. Patent
`
`Feb. 9, 1999
`
`Sheet 5 of9
`
`5,870,036
`
`FIG. 48
`
`T3“
`
`434
`SET E TU FIRST
`CML. ENTRY,- 1:1 _/
`
`437
`
`NE]
`
`METHUD (E)
`DICTIONARY
`BASED '2
`
`YES
`
`439
`
`/
`
`SET M=METHE|D (E);
`D=DICTIE1NARY (E)
`
`444
`/
`CUMPRESS B001, I02)
`USING METHEID (E)
`INTEI K BYTES
`
`CUMF’RESS B(k31, be)
`USING METHUD M &
`DICTIUNARY D
`INTI] K BYTES
`
`K 441
`
`447 \
`
`SET
`CRTT<I>=K
`
`NE]
`
`451 \
`SET E TEI
`NEXT ENTRY IN
`CML,‘ I=I+1
`
`WAS E
`LAST ENTRY IN
`CML '?
`
`YES
`
`FIND Q SUCH
`THAT CRTT<Q>
`IS MIN(CRTT)
`
`454
`
`457
`
`IS
`CRTT<Q> <
`THRESHULD
`7
`
`YES
`
`CEIMPRESS
`
`459
`DEI NUT
`\
`CEIMPRESS B
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 006
`
`
`
`U.S. Patent
`
`Feb. 9, 1999
`
`Sheet 6 of9
`
`5,870,036
`
`CEIMPRESS
`
`461
`SET E TU Q’TH
`ENTRY IN CML _/
`
`NI]
`
`470
`\
`CUMPRESS B INTII B’
`USING METHUD <E>
`
`METHUD (E)
`DICTIEINARY
`BASED 'e
`
`464
`
`YES
`
`467
`f
`CDMPRESS B INTU B’
`USING METHIJD <E> &
`DICTIUNARY (E)
`
`_1l___
`
`STURE E IN [IMD
`AREA IIIF B’ /475
`
`I
`
`OUTPUT B’
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 007
`
`
`
`U.S. Patent
`
`Feb. 9, 1999
`
`Sheet 7 0f 9
`
`5,870,036
`
`0mm
`
`m .9;
`
`
`
` mmwwmmazmulmm qkqm K ohm
`
`
`
`\lomm WQHW
`
`mmwwmmmzmo / 0mm
`
`.lllvGmIm 3km
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 008
`
`
`
`U.S. Patent
`
`Feb. 9, 1999
`
`Sheet 8 of9
`
`5,870,036
`
`210\/UNCDMPRES/\SED BLEICKS
`T
`STlIIRE (220
`
`P
`RETRIIEVE/270
`
`‘
`DATA
`CUMPRESSUR =
`
`"
`
`DATA
`DEI-CIJMPRESSDR
`
`COMPRESSION
`METHEID TABLE 111cm
`
`y
`
`r
`
`ARITHMETIC
`
`N
`
`/.
`600
`/- RUN-LENGTH N
`
`BLEICK ID
`__
`M’ZEDR'
`
`mm 601
`~I
`/
`DIRECTORY 60g STATIC DI
`
`>
`
`@T.
`
`Y
`
`230
`W‘
`
`COMPRESSED
`BLOCKS
`
`\
`
`,
`
`j
`630
`620
`/
`MEMEIRY
`DICTIONARY
`BLEICKS
`
`640
`
`DICTF-‘lS‘b
`
`Y
`
`603/ SUB-BLOCK
`0 D1cT.=cATENATE Y
`6 14/ SUB-BLOCKS
`/‘
`240
`
`/
`240a
`
`2250
`,1
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 009
`
`
`
`Feb. 9, 1999
`
`Sheet 9 0f 9
`
`5,870,036
`
`_wmmou¢ZDQZQ
`
`
`
`mxomjm_mvajm0menu>m<zmfibammmwmmmzmu
`
`_>
`
`
`
`_->M<ZDEQEQEQHM
`
`>w=uzmz>JZDQ<mmmmwwmmazmuz:
`
`zEmmmmmzmo
`
`>m¢ZDHHoE
`
`ZEMMMMQZDQ
`
`
`
`
`
`[ENFZDQ
`
`US. Patent
`a25:5quTEmZEEm_a15%:le
`
`
`
`IHUZMITZDMhe:W7355:38%222%flflmaalzmwzmwumama___“é:
`zEwmmmazBIan
`
`BEZEE,‘E5555mmmmmmzHam
`IImxomjm<F¢QN.
`
`
`
`
`ll_EEG:4EEG:mn_.J738$222$
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 010
`
`
`
`1
`ADAPTIVE MULTIPLE DICTIONARY DATA
`COMPRESSION
`
`I. BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`This invention relates in general to data processing sys
`tems and, more speci?cally, to a system and method for
`compressing data.
`2. Background of the Invention
`In general, data compression involves taking a stream of
`symbols and transforming them into codes. If the compres
`sion is effective, the resulting stream of codes Will be smaller
`than the original symbol stream. The decision to output a
`certain code for a certain symbol or set of symbols is based
`on a model. The model is simply a collection of data and
`rules used to process input symbols and determine Which
`code(s) to output. A computer program may use the model
`to accurately de?ne the probabilities for each symbol in
`order to produce an appropriate code based on those prob
`abilities.
`Data compression techniques often need to be lossless
`(Without incidences of error or loss of data). Exceptions to
`this include, for example, certain applications pertaining to
`graphic images or digitiZed voice. Lossless compression
`consists of those techniques guaranteed to generate an exact
`duplicate of the input data stream after a compress/expand
`cycle. This is the type of compression often used When
`storing database records, spreadsheets, or Word processing
`?les. In these applications, the loss of even a single bit can
`be catastrophic.
`Lossless data compression is generally implemented
`using one of tWo different types of modeling: statistical or
`dictionary-based. Statistical modeling reads in and encodes
`a single symbol at a time using the probability of that
`character’s appearance. Statistical models achieve compres
`sion by encoding symbols into bit strings that use feWer bits
`than the original symbols. The quality of the compression
`goes up or doWn depending on hoW good the program is at
`developing a model. The model has to predict the correct
`probabilities for the symbols. The farther these probabilities
`are from a uniform distribution, the more compression that
`can be achieved.
`In dictionary-based modeling, the coding problem is
`reduced in signi?cance, making the model supremely impor
`tant. The dictionary-based compression processes use a
`completely different method to compress data. This family
`of processes does not encode single symbols as variable
`length bit strings; it encodes variable-length strings of
`symbols as single pointers. The pointers form an index to a
`phrase dictionary. If the pointers are smaller than the phrases
`they replace, compression occurs. In many respects,
`dictionary-based compression is easier for people to under
`stand. In every day life, people use phone numbers, DeWey
`Decimal numbers, and postal codes to encode larger strings
`of text. This is essentially What a dictionary-based encoder
`does.
`In general, dictionary-based compression replaces
`phrases With pointers. If the number of bits in the pointer is
`less than the number of bits in the phrase, compression Will
`occur. HoWever, the methods for building and maintaining a
`dictionary are varied.
`A static dictionary is built up before compression occurs,
`and it does not change While the data is being compressed.
`For example, a database containing all motor-vehicle reg
`istrations for a state could use a static dictionary With only
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`5,870,036
`
`2
`a feW thousand entries that concentrate on Words such as
`“Ford,” “Jones,” and “1994.” Once this dictionary is
`compiled, it is used by both the encoder and the decoder as
`required.
`There are advantages and disadvantages to static dictio
`naries. Nevertheless, dictionary-based compression schemes
`using static dictionaries are mostly ad hoc, implementation
`dependent, and not general purpose.
`Many of the Well-knoWn dictionary-based processes are
`adaptive. Instead of having a completely de?ned dictionary
`When compression begins, adaptive schemes start out either
`With no dictionary or With a default baseline dictionary. As
`compression proceeds, the processes add neW phrases to be
`used later as encoded tokens.
`For a further discussion of data compression in general,
`please refer to The Data Compression Book, by Mark
`Nelson, @ 1992 by M&T Publishing, Inc., Which is hereby
`incorporated by reference herein.
`As mentioned, the history of past symbols of a sequence
`often provides valuable information about the behavior of
`the sequence in the future. Various universal techniques have
`been devised to use this information for data compression or
`prediction. For example, the Lempel-Ziv (“LZ”) compres
`sion process, Which is discussed Within Compression of
`Individual Sequences by Variable Rate Coding, by J. Ziv and
`A. Lempel, IEEE Trans. Inform. Theory, IT-24: 530—536,
`1978 (Which is incorporated by reference herein), uses the
`past symbols to build up a dictionary of phrases and com
`presses the string using this dictionary. As Lempel and Ziv
`have shoWn, this process is universally optimal in that the
`compression ratio converges to the entropy for all stationary
`ergodic (of or related to a process in Which every sequence
`or siZeable sample is equally representative of the Whole)
`sequences. Thus, given an arbitrarily long sequence, such
`compression operates as Well as if the distribution of the
`sequence Was knoWn in advance.
`The Lempel-Ziv compression method has achieved great
`popularity because of its simplicity and ease of implemen
`tation (actually, Lempel-Ziv is often used to denote any
`dictionary based universal coding scheme, as a result, the
`standard method described herein is only one of this large
`class). It asymptotically achieves the entropy limit for data
`compression. HoWever, the rate of convergence may be sloW
`and there is scope for improvement for short sequences. In
`particular, at the end of each phrase, the process returns to
`the root of the phrase tree, so that contextual information is
`lost. One approach to this problem Was suggested by
`Plotnik, Weinberger and Ziv for ?nite state sources, as
`described Within Upper Bounds on the Probability of
`Sequences Emitted by Finite-State Sources and on the
`Redundancy of the Lempel-ZivAlgorithm, by E. Plotnik, M.
`J. Weinberger and J. Ziv, IEEE Trans. Inform. Theory,
`IT-38(1): 66—72, January 1992, Which is incorporated by
`reference herein. Their idea Was to maintain separate LZ like
`trees for each source state (or estimated state) of a ?nite state
`model of the source. Plotnik, Weinberger and Ziv shoWed
`that this procedure is asymptotically optimal.
`U.S. patent application Ser. No. 08/253,047, ?led on Jun.
`2, 1994 and assigned to the same assignee as the present
`invention describes a compression system Wherein contex
`tual information is implemented in conjunction With a
`dictionary-based compression process Within a data process
`ing system. Before encoding a next phrase Within an input
`string of data, the compression system, through dynamic
`programming, derives the statistically best set of dictionaries
`for utiliZation in encoding the next phrase. The selection of
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 011
`
`
`
`5,870,036
`
`3
`this set of dictionaries is dependent upon each dictionary’s
`past utilization Within the process. Thus, the choice of
`dictionaries is directly related to each of their performance
`on compression, and not on a presumed model for the
`sequence as is required in a compression technique utilizing
`a state machine description of the source, Which is available
`to both the encoder and decoder. The selected set of dictio
`naries is implemented by the data processing system to
`encode the next phrase of data. The context of the phrase is
`used to select a particular dictionary Within the selected set
`for the encoding process, Which computes a pointer to the
`phrase Within the dictionary that corresponds to the next
`phrase.
`The above-technique has the property that the compressor
`and de-compressor can each compute from the preceding
`string, the dictionary Which Will be used for the next phrase.
`This has the advantage that the identity of the dictionary
`need not be included. HoWever, the chosen dictionary may
`not in fact be the best one to use, as the decision of Which
`to use is based on estimates. Further, in some cases
`improved compression could result from using other types
`of compression techniques.
`
`II. SUMMARY OF THE INVENTION
`
`It is an object of this invention to dynamically compress
`data blocks by using the best of a given set of methods, and
`for dictionary-based compression, by using the best of a
`given set of dictionaries.
`In accordance With one aspect of the present invention,
`there is provided a system and method for compressing data
`using a plurality of data compression mechanisms. Repre
`sentative samples of each block of data are tested to select
`an appropriate one of the data compression mechanisms to
`apply to the block. The block is then compressed using the
`selected one of the mechanisms and the compressed block is
`provided With an identi?er of the selected mechanism.
`In accordance With another aspect of the present invention
`there is provided a system and method for decompressing
`data. using a plurality of data decompression mechanisms.
`Each block of data includes a coding identi?er Which is
`indicative of the method or mechanism used to compress the
`block. The coding identi?er is examined to select an appro
`priate one of the data decompression mechanisms to apply
`to the block. The block is then decompressed using the
`selected one of the mechanisms.
`
`III. BRIEF DESCRIPTION OF THE DRAWING
`
`FIG. 1 shoWs a system structure suitable for use in
`conjunction With the present invention;
`FIG. 2 shoWs a data compressor and data de-compressor
`270 in accordance With the principles of the present inven
`tion;
`FIG. 3 shoWs the overall structure of the data compressor
`of FIG. 2;
`FIGS. 4A, 4B and 4C are a How chart of the operation of
`the data compressor of FIG. 3;
`FIG. 5 shoWs the overall structure of the data decompres
`sor of FIG. 2;
`FIG. 6 shoWs a storage system incorporating compression
`& decompression in accordance With the principles of the
`present invention; and
`FIG. 7 is a block diagram of an encoder/decoder in a
`compression system, in accordance With the principles of the
`present invention.
`Like reference numerals appearing in multiple ?gures
`represent like elements.
`
`4
`IV. DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`
`Asystem structure suitable for use in conjunction With the
`present invention is shoWn in FIG. 1. The system includes a
`CPU 5 Which accesses a ?rst memory 10 containing uncom
`pressed data blocks 15. Within the same information pro
`cessing system as the CPU 5 or Within another “remote”
`system, there is a second memory 20. The memories 10, 20
`can be semiconductor memories, magnetic storage (such as
`a disk or tape), optical storage or any other suitable type of
`information storage media. The data blocks are transferred
`betWeen the ?rst and second memories.
`To increase the number of data blocks that can be stored
`in the second memory 20 given a ?xed second memory siZe,
`data blocks 25 may be stored in a compressed format in the
`second memory. For this purpose there is a compressor 30
`that compresses data blocks as they are transferred to the
`second memory, and a de-compressor 40 that de-compresses
`data blocks as they are transferred to the ?rst memory.
`In the present description, the Word “coding” is used to
`refer generically to any of encoding (compressing) or decod
`ing (decompressing). The Word “CODEC” refers to an
`apparatus that performs both encoding and decoding.
`FIG. 2 shoWs a data compressor 220 and data
`de-compressor 270 Which operate in accordance With the
`principles of the present invention. The data compressor 220
`and de-compressor 270 and the associated logic take the
`places of, respectively, the compressor 30 and decompressor
`40 of FIG. 1. In this embodiment, the uncompressed data
`blocks 210 that can optionally contain type information 205.
`The type information can be, for examples, image data
`encoded in a given format, source code for a given pro
`gramming language, etc. The data blocks 210 are input to the
`data compressor 220.
`The data compressor 220 and data de-compressor 270
`share a compression method table 240 and a memory 250
`containing a number of dictionary blocks 1,2,n. It should be
`understood that the table 240 and the memory 250 can each
`be a single shared entity or, alternatively, can be duplicated
`such that one of each is located locally at the data compres
`sor 220 and the data decompressor 270. The compression
`method table 240 can include, for example, of an array of
`pointers to compression routines or device address of
`compression/decompression hardWare components. Each
`method entry in the table 240 includes an identi?er 240a
`indicated Whether the method is non-dictionary-based or
`dictionary-based.
`Dictionary blocks are identi?ed by an index specifying
`the offset of the block in the dictionary block memory. As
`explained more fully beloW, the compressor 220 selects a
`compression method. For dictionary-based methods the
`compressor 220 also selects a dictionary block, to compress
`the data. The compressor outputs compressed data blocks
`230, With an index (M) 232 identifying the selected com
`pression method, and for dictionary-based methods, dictio
`nary block identi?er (D), encoded in a compression method
`description (CMD) area 235 in the compressed block.
`The use of a dictionary for data compression and decom
`pression is Well knoWn in the art and described, for example,
`in US. patent application Ser. No. 08/253,047, ?led on Jun.
`2, 1994 and in US. Pat. No. 4,814,746, both of Which are
`incorporated by reference herein as if printed in full beloW.
`Compressed data blocks 230, With the compression
`method identi?er M and for dictionary-based methods dic
`tionary block identi?er D encoded in the CMD area 235 are
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 012
`
`
`
`5,870,036
`
`5
`input to the de-compressor 270. The de-compressor 270
`de-compresses the block using the speci?ed method found in
`the compression method table 240 (using the compression
`method identi?er as an index), and for dictionary-based
`methods, speci?ed dictionary block found in the dictionary
`block memory 250, and outputs uncompressed data blocks
`280.
`FIG. 3 shoWs the overall structure of the data compressor
`220. Each component is explained more fully beloW. Given
`an uncompressed block 205 as input, ?rst, in block 310, a
`compression method list (CML) is determined. The CML is
`an array of entries of the form M or (M,D), Where the latter
`form is used for dictionary-based methods. M is an index
`into the compression method table 240 previously described
`With respect to FIG. 2, and D is a dictionary block identi?er
`for a dictionary block contained in the dictionary block
`memory 250, also previously described With respect to FIG.
`2.
`
`Next, in block 320, each method or (method,dictionary)
`pair is tested on a sample taken from the uncompressed data
`block 205, and the resulting sample compression is saved. In
`block 330, the best method or (method,dictionary) pair is
`found (i.e., the one giving the most compression). The best
`method is readily determined by comparing the siZe of the
`uncompressed same to the siZe of the compressed sample for
`each method. In a preferred embodiment, the siZe of the
`sample block as compressed by each of a number of methods
`is stored in a table and the best method is determined by
`examining the table to ?nd the smallest compressed block.
`If the best sample compression does not satisfy a threshold
`condition (eg 30% compression as compared to the uncom
`pressed sample), then it is decided at this point not to
`compress the block as indicated by 335. In this case the
`block Will be stored in uncompressed format. OtherWise, in
`block 340, the block is compressed using the best method
`(and if applicable dictionary) found in block 330, the method
`or (method,dictionary) pair is encoded in the CMD area 235
`of the block, and the compressed data block 230 results as
`output.
`For dictionary compression methods, one Way to test a
`representative sample is to determine the compression ratios
`of a given number (eg 30) phrases. It should be understood
`that all of the phrases could be tested. For symbol based
`compression (such as Huffman or Arithmetic coding) the
`compression of a given number (eg 150) or percentage (eg
`10% to 20%) of the symbols can be tested.
`The operation of the data compressor 220 is shoWn in
`FIGS. 4A, 4B, and 4C.
`In step 401, if a data type (eg text, image, etc.) for a
`given uncompressed block B is available, in step 404 the
`Compression Method List (CML) is set to a list of com
`pression methods that have been preselected for that data
`type. OtherWise, if no data type is available, in step 407 the
`CML is set to a default list of compression methods.
`Next, the folloWing steps are done for each dictionary
`based method M in the CML. This involves checking each
`entry E in the CML. First, in step 409, E is set to the ?rst
`CML entry. Next, in step 411, it is determined if the method
`of E is dictionary-based. If it is not, step 429 is executed,
`Where it is checked if E Was the last CML entry. If it Was not,
`in step 431 E is set to the next CML entry, and the method
`returns to step 411. If step 429 determines that E Was the last
`CML entry, the loop is exited, and the method proceeds to
`step 434.
`If the test in step 411 determines that the method of E Was
`dictionary-based, the folloWing steps are performed, With
`control passing to step 429 at the completion of these steps.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`6
`In step 414, it is determined if a data type is available (ie
`the block includes a “data type” entry in the type ?eld 205),
`If a data type is available, in steps 417, 421, 424, and 427,
`the CML is expanded by replacing E With the list (M,D1),
`(M,D2), .
`.
`.
`, (M,Dj), Where (D1, .
`.
`.
`, Dj) is a list of
`dictionary block identi?ers that have been preselected for
`the data type When using compression method M.
`OtherWise, if no data type is available), steps 419, 421, 424,
`and 427, replace E With the list (M,D1‘), (M,D2‘), .
`.
`.
`,
`(M,Dk‘), Where (D1‘, .
`.
`. , Dk‘) is a default list of dictionary
`block identi?ers for compression method M.
`Next, steps 437, 439, 441, 444, and 447, are done for each
`entry E of the form M or (MD) in the CML (Where the latter
`entry is for dictionary-based methods).
`In step 434, E is set to the ?rst entry in the CML, and I,
`Which Will be used as the entry number, is set to 1. At the
`completion of steps 437—447, in step 449 it is checked if E
`Was the last CML entry. If it Was not, E is set to the next
`CML entry and I is incremented in step 451, With control
`returning to block 437. OtherWise the loop is exited, With
`control passing to step 454.
`Steps 437—447 comprise the body of the loop. Steps 437,
`439, 441, and 444, apply compression method M (and
`dictionary D if M is dictionary-based) on a sample from B,
`B(b1,b2) consisting of bytes b1 through b2 consecutively in
`B, resulting in a compressed sample of length K bytes. Next,
`in step 447, the resulting compressed sample length K is
`saved as entry CRTT(I) in compression ratio test table
`CRTT, Where this table is an array of integers. Note: the
`above compression tests are independent, and although the
`?gure shoWs a sequential loop, implementations are possible
`in Which all these tests are performed in parallel.
`Next, in step 454, the smallest compressed length in the
`table CRTT is found, say entry CRTT(Q). In step 457, the
`smallest compressed length is compared against a threshold.
`If all uncompressed input blocks are a given ?xed siZe, the
`threshold can be constant; otherWise, the threshold can be
`computed as a function of the input block siZe. For example,
`the threshold can be computed as a given fraction of the siZe
`of the block currently being compressed.
`If step 457 determines that CRTT(Q) is not suf?ciently
`small, in step 459 the data block B is not compressed.
`OtherWise, if step 457 determines that CRTT(Q) is suf?
`ciently small, in step 461, E is set equal to the Q’th entry in
`the CML, Where E is of the form M or (M,D). In steps 464,
`467, and 470, B is compressed using M, and if M is
`dictionary-based, dictionary D, into a block B‘. Next, in step
`457, E is recorded in the CMD (compression method
`description) pre?x area of the compressed block (B‘), and the
`resulting block B‘ is output.
`FIG. 5 shoWs the structure of the decompressor 270.
`Given a compressed block 230 as input, ?rst, in block 510
`the compression method used to compress the block is found
`(by decoding the CMD ?eld), along With the dictionary
`identi?er of the dictionary used if the method is dictionary
`based. Next, in block 520, the compression method and
`dictionary (if applicable) is used to decompress the block,
`resulting in the uncompressed data block 280 as output.
`FIG. 6 illustrates an application of the principles of the
`present invention to a storage system in Which there is a
`random access memory 620 into Which arbitrary blocks of
`data are stored and retrieved. In order to increase the total
`number of bytes of uncompressed data blocks that can be
`stored before the memory becomes full, uncompressed
`blocks 210 are compressed by a data compressor 220 before
`being stored as compressed blocks 230, and de-compressed
`by a data de-compressor 270 When they are retrieved.
`
`Veritas Techs. LLC
`Exhibit 1004
`Page 013
`
`
`
`5,870,036
`
`7
`In this example, part of the memory 620 is used to store
`dictionary blocks 250. The dictionary blocks include some
`number of ?xed, static blocks, determined before hand, and
`also of some number of blocks that are created and subse
`quently deleted in a fashion subsequently to be described.
`For simplicity, assume that all uncompressed data blocks
`are a given ?xed siZe, say 4096 bytes. Each such block can
`be considered to consist of 8 512-byte sub-blocks, for
`example. In some cases it may happen that using the ?rst
`such sub-block as a dictionary may yield better compression
`than any of the ?xed static dictionaries. This method is
`shoWn as 602 in the compression method table 240.
`Alternatively, take the ?rst 64 bytes of each sub-block, and
`concatenating these into a 512-byte dictionary, could be the
`best dictionary for some blocks having certain phrase pat
`terns. This method is indicated by 603 in the ?gure. The
`other three methods indicated in the ?gure are arithmetic
`coding 600, run-length coding 601, and LZ1 using one of the
`?xed set of static dictionaries 602.
`The data compressor operates as previously described.
`HoWever, if the best compression method is found to be that
`using the ?rst sub-block 602 or that using a concatenation of
`multiple sub-blocks 603, the dictionary is stored as one of
`the dictionary blocks 250 at this time, the dictionary direc
`tory 640 is updated accordingly using standard techniques,
`and the identity of the dictionary block as given by the
`dictionary directory is stored in the CMD. In any case, the
`compressed data block 230 is stored in the memory 620, and
`the mapping of block identi?ers to memory addresses 630 is
`updated using standard techniques. Conversely, the data
`de-compressor operates as previously described, hoWever
`When a compressed block is retrieved, if it is found that
`either the ?rst sub-block or concatenation of multiple sub
`blocks methods Was used, then if the compressed data block
`is also being removed from the memory, the dictionary is
`also removed from the set of dictionary blocks at this time.
`FIG. 7 is a block diagram of an encoder/decoder in a
`compression system, in accordance With the principles of the
`present invention. The system of FIG. 7 includes a micro
`processor (Up) 708 Which controls the encoding and decod
`ing of the data blocks under control of program code 712
`stored in a read only memory 710. The program code
`implements the methods of FIGS. 4A—4C and the decom
`pression of FIG. 5. The system also includes CODECs 702,
`704, 706 Which perform both the encoding and decoding
`functions as determined by control data sent from the
`microprocessor 708. The CODECs can be embodied in
`hardWare logic circuits or as program code executed by
`microprocessor 708. The compression method table 240 is
`stored in a read only memory 714 Which is addressed and
`read by the microprocessor. The block address ID map 630,
`the dictionary directory 640, the dictionary blocks 250 and
`the compressed blocks 230 are stored in random access
`memories 716, 718, 720 (Which can be multiple memories or
`address spaces in a single memory).
`The present system and method can be modi?ed and
`extended in a number of Ways. For example multiple dic
`tionary blocks can be used as a logical single dictionary. In
`such an embodiment, a set of dictionary blocks, say D1, D2,
`.
`.
`.
`, Dn, can be used by a dictionary-based compression
`method, say IJempel-Ziv 1 (LZ1), as if the ordered set is
`logically a single dictionary. In this case D1, D2, .
`.
`.
`, Dn
`are considered to be catenated into a single logical dictio
`nary D, and pointers in the compressed output refer to
`substrings in one of the n actual dictionary blocks.
`As another modi?cation, the set of dictionary blocks used
`by the compressor and de-compressor need not be ?xed to
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`50
`
`a O
`
`65
`
`8
`a predetermined set. For example, this set could be initially
`empty, and dictionaries generated by a dictionary-based
`compression method could be added to the set as subsequent
`blocks are compressed, until a maximum number of dictio
`nary blocks is reached.
`Furthermore, if there are currently no compressed blocks
`requiring a given dictionary block for de-compression (this
`can be determined by examining the CMD ?elds), the given
`dictionary block can be removed from the current set of
`dictionary blocks.
`Among other possibilities, the set of dictionary blocks
`could be managed using an LRU type policy: in order to add
`a neW dictionary block, the least-recently-used dictionary
`block (not currently required for de-compression) could be
`selected to be replaced.
`Although the present invention and its advantages have
`been described in detail, it should be understood that various
`changes, substitutions and alterations can be made herein
`Without departing from the spirit and scope of the invention
`as de?ned by the app