`Fallon et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,597,812 B1
`Jul. 22, 2003
`
`Usti06597812B1
`
`(54) SYSTEM ANI) Ml~Z'l'H0l) FOR i.ossi.i«:ss
`])A'[‘A C()MPR]¢:SS]()N AND
`
`0.195.024 Bl ~
`(i,439.9U2 B2 "'
`
`2:20:31 Fallon
`I2/2002 [Ieath
`
`341151
`341/'37
`
`DECOMPRESSION
`
`' cited by examiner
`
`(75)
`
`Inventors: James J. Fallon, Amionk, NY (US);
`slew“ L‘ B"! Ba.V5id°- NY (US)
`
`(73) A5-aignee: Rt-aliime Data, LLC, New York, NY
`(us)
`
`( x ) Nmicg:
`
`Subjecl K, any disclaimer’ [he 13",] 01-[his
`palum is cxwndcd or adjuslcd undcr 35
`U_S‘(j_ 1540,) by 0 days_
`
`21
`
`)
`1
`(22)
`
`A L N J 09 579
`922
`PP
`0
`I
`Filed:
`May 26, 2000
`
`1
`
`(50)
`
`Related U.S. Application Data
`i’f0Vi$i0IIai =1PPiiC€IlI'0It Nu.
`|‘>0tl3b.Sbl. Filed on May 28.
`1°99»
`Int. Ci.’ ................................................ .. coax 9/36
`(51)
`(52) U.S. Cl.
`........................ .. 382/232; 382/245; 341/51
`(58) Field of Search
`3821232-251;
`14”‘ l_79, 348 ,.4.,0 ”_4.,4 .,_ 375 ,.,4O_.,4l
`‘
`'
`' "’ '
`' "'
`'
`‘ ‘
`f ' '
`References cued
`_
`.
`..
`_
`U ‘i PATENT l)()C‘UMEN'l“§
`
`(56)
`
`p,.,-,,,,,,-y E_w,,,,,'m.,._Jinggc wu
`(74) Attorney, Agent‘, or Ft'mi—F. Chau & Associates. LLP;
`F
`l( V. D ‘R
`E.
`.
`L ma’
`gq
`
`Atis"t‘RAC'l'
`
`ran
`(57)
`
`Systems and methods for providing l0S.‘5lESS data compres-
`sion and decompression are disclosed which exploit various
`Characteristics of run-length encoding, parametric dictionary
`encoding. and bit packing to comprise an enoodingfdecoding
`process having an elfficiency that
`is suitable [or 1.156.!!!
`real-time lossless data C(ll'npt’L'SS](ln and decompression
`applications. In one aspect. a method for compressing input
`data comprising a plurality of data blocks comprises the
`steps of: detecting if the input data comprises a run-length
`sequence of data blocks; outputting an encoded run-length
`sequence. if a run-length sequence of data blocks is detected;
`maintaining «'3 dictionary comiiri-sing =i_p1_urality_of Cody‘
`w‘"dS‘.wh°re".' each ‘ml: wmd 1." the dlcllonary 15 a5S°°"
`aw.d mm a umquc dam Hock Slr'.ng;bl1"ld'"g a dam llmck
`string from at least one data block in the input data that is not
`part of a run-length sequence; searching for a code word in
`the dictionary having a unique data block string as_sociated
`therewith that matches the huill data block string; and
`out utiin
`t e e
`e wor
`re resentinvt e ')l.lll
`ata 3 oc
`p
`’ g h
`on
`cl
`p
`' 5 h
`I
`‘l
`cl
`I
`I
`k
`
`5,37rJ.03n A *
`5,883,975 A *
`5,955,970 A t
`
`341,-‘st
`211999 Franaszek el al.
`382/‘Z32
`3;‘l90‘) Narita el al.
`
`oxirioo Heath ........................ .. 341/87
`
`S'"”3'
`
`30 Claims, 6 Drawing Sheets
`
`Input
`Data Stream
`
`Run-Length Encoder
`
`:
`I
`I
`:
`
`13
`
`ENCODER
`
`r_____-_____.-.....-_--_.._....-__......--—-—-
`
`Encoded
`
`ORACLE 1101
`
`
`
`U.S. Patent
`
`Jul. 22, 2003
`
`Sheet 1 of 6
`
`US 6,597,812 B1
`
`n%8=m
`
`38:0
`
`9C.
`
`0+m:.Sml2.:
`
`59..
`
`Btsm
`
`:35
`
`Emfim£3
`
`_.mm:w_u_
`
`
`
`
`
`
`
`U.S. Patent
`
`Jul. 22, 2003
`
`Sheet 2 of 6
`
`US 6,597,812 B1
`
`Initialize Dictionary
`
`And Hash Table
`
`
`
`Initialize Pstring To
`Empty
`
`2°°
`
`201
`
`No
`
`Output Code
`Corresponding To
`Pstring
`
`226
`
`Are There
`Input Bytes
`To Process?
`
`~ 202
`
`Yes
`
`
`
`Read Next Input Byte
`And Store in C
`
`Check Next
`Consecutive
`
`Input Bytes
`
`Are There
`At Least s Consecutive
`
`
`
`Output‘ Code
`Denoting End
`°f mp” Data
`
`203
`For Pstring
`
`~22?
`
`Output Run Length
`Sequence
`
`Emmy
`
`
`
` Initialize Pstring To
`
`Matching Input Bytes
`For Run Length
`
`No
`
`Output Code
`
`208
`
`FIGURE 2A
`
`
`
`U.S. Patent
`
`Jul. 22, 2003
`
`Sheet 3 of 6
`
`US 6,597,812 B1
`
`Generate: Pstring+C
`
`210
`
`Search Dictionary
`For Pstn'ng+C
`
`211
`
`Store Dictionary lndex
`Yes Of Matching String in
`Mcode
`
`Set Pstring =
`Pstring+C
`
`
`
`ls Pstring+C
`in Dictionary?
`
`
`
`212 ~
`
`No
`
`
`
`Output Code for
`Pstring
`
`215
`
`Create Dictionary Entry
`For Pstn'ng+C
`
`216
`
`
`
`213
`
`214
`
`Reset Dictionary To
`Initial State
`
`220
`
`Reset Hash Table
`
`Output Code Word
`Indicating Dictionary
`initialization
`
`222
`
`Search Dictionary To
`Find Entry for Pstring
`
`224
`
`Yes
`
`
`
`
`
`
`Would Addition
`
`
`
`
`No
`
`Add New Entry To
`End Of Dictionary
`
`Update Hash Table
`
`Set Pstfing = C
`
`
`
`Of Entry To Dictionary
`Exceed Threshold?
`
`
`
`
`
`
`
`
`Store Dictionary
`Index of Matching
`Entry in Mcode
`
`
`
`FIGURE 2B
`
`
`
`U.S. Patent
`
`Jul. 22, 2003
`
`Sheet 4 of 6
`
`US 6,597,812 B1
`
`S9...98
`
`bm:o_§n_
`
`Buoouo
`
`nmanor.
`
`3.83
`
`.330
`
`
`
`Euoomo5u:mJ_-::m_
`
`
`
`a%_o8__5._m
`
`Sac.
`
`._ot=m
`
`nmboocm
`
`EmmémEma
`
`
`
`
`
`
`
`U.S. Patent
`
`Jul. 22, 2003
`
`Sheet 5 of 6
`
`US 6,597,812 B1
`
`Initialize Dictionary
`
`Initialize Pstring And
`Cstring to Empty
`
`
`
`Read Code Word From
`Encoded Input Stream
`And Store It in Ccode If Ccode =2,
`End Decoding
`Process
`
`
`
`
`
`
`
`
`
`Output Decoded Characters
`Corresponding To Run
`Length Encoded Sequence
`
`
`
`If Ccode=1, Process Next
`Successive Run Length
`Encoded Bytes
`In Encoded Input Stream
`
`Lookup Cstring
`(Dictionary Entry For
`
`Ccode) and Output It
`
`408
`
`If Ccode=0,
`Return to Step 400
`
`FIGURE 4A
`
`
`
`U.S. Patent
`
`Jul. 22, 2003
`
`Sheet 6 of 6
`
`US 6,597,812 B1
`
`Set Pcode = Ccode
`
`409
`
`Read Next Code From
`Encoded Input Steam And
`
`410
`
`Store It in Ccode
` If Ccode=2,
`
`End Decoding Process
`
`412
`
`If Ccode=1, Process Next
`Successive Run Length
`In Enigggefigxtgream
`
`Output Decoded Characters
`Corresponding to Run
`Length Encoded Sequence
`
`Return to Step 400
`
`If Ccode=0,
`
`Yes
`
`
`
`
`'5
`Ccode a
`Control Code?
`411 «-
`
`
`
`
`No
`
`To Dictionary
`
`
`Is Ccode in
`
`dictionary?
`
`No
`
`Lookup Cstn'ng
`(Dictionary Entry for
`Ccode) And Output It
`
`Take First Character
`From Cstring And
`Store It in C
`
`Add Pstn'ng+C
`
`417
`
`418
`
`419
`
`Store First Character
`Of Cstring in C
`
`To Dictionary
`
`Add Pstn'ng+C
`
`Output Pstn'ng+C
`
`420
`
`421
`
`422
`
`FIGURE 4B
`
`
`
`US 6,597,812 B1
`
`1
`SYSTEM AND METHOD FOR LOSSLESS
`DATA COMPRESSION AND
`DECOMPRESSION
`
`CROSS-REFERENCE TO RELATED
`APPLICATION
`
`This application is based on provisional application U.S.
`Application Ser. No. 60/136,561 filed on May 28, 1999.
`
`BACKGROUND
`
`1. Technical Field
`
`The present invention relates generally to data compres-
`sion and decompression and, more particularly to systems
`and methods for providing lossless data compression and
`decompression using a combination of dictionary and run
`length encoding.
`2. Description of Related Art
`Information may be represented in a variety of manners.
`Discrete information such as text and numbers are easily
`represented in digital data. This type of data representation
`is known as symbolic digital data. Symbolic digital data is
`thus an absolute representation of data such as a letter,
`figure, character, mark, machine code, or drawing.
`Continuous information such as speech, music, audio,
`images and video frequently exists in the natural world as
`analog information. As is well-known to those skilled in the
`art, recent advances in very large scale integration (VLSI)
`digital computer technology have enabled both discrete and
`analog information to be represented with digital data.
`Continuous information represented as digital data is often
`referred to as diffuse data. Diffuse digital data is thus a
`representation of data that is of low information density and
`is typically not easily recognizable to humans in its native
`form.
`
`There are many advantages associated with digital data
`representation. For instance, digital data is more readily
`processed, stored, and transmitted due to its inherently high
`noise immunity. In addition, the inclusion of redundancy in
`digital data representation enables error detection and/or
`correction. Error detection and/or correction capabilities are
`dependent upon the amount and type of data redundancy,
`available error detection and correction processing, and
`extent of data corruption.
`One outcome of digital data representation is the continu-
`ing need for increased capacity in data processing, storage,
`retrieval and transmittal. This is especially true for diffuse
`data where continuing increases in fidelity and resolution
`create exponentially greater quantities of data. Within the
`current art, data compression is widely used to reduce the
`amount of data required to process, transmit, store and/or
`retrieve a given quantity of information. In general, there are
`two types of data compression techniques that may be
`utilized either separately or jointly to encode and decode
`data: lossy and lossless data compression.
`Lossy data compression techniques provide for an inexact
`representation of the original uncompressed data such that
`the decoded (or reconstructed) data differs from the original
`unencoded/uncompressed data. Lossy data compression is
`also known as irreversible or noisy compression. Negent-
`ropy is defined as the quantity of information in a given set
`of data. Thus, one obvious advantage of lossy data com-
`pression is that the compression ratios can be larger than that
`dictated by the negentropy limit, all at
`the expense of
`information content. Many lossy data compression tech-
`niques seek to exploit various traits within the human senses
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`to eliminate otherwise imperceptible data. For example,
`lossy data compression of visual imagery might seek to
`delete information content in excess of the display resolution
`or contrast ratio of the target display device.
`On the other hand, lossless data compression techniques
`provide an exact representation of the original uncom-
`pressed data. Simply stated, the decoded (or reconstructed)
`data is identical to the original unencoded/uncompressed
`data. Lossless data compression is also known as reversible
`or noiseless compression. Thus, lossless data compression
`has, as its current limit, a minimum representation defined
`by the negentropy of a given data set.
`It is well known within the current art that data compres-
`sion provides several unique benefits. First, data compres-
`sion can reduce the time to transmit data by more efficiently
`utilizing low bandwidth data links. Second, data compres-
`sion economizes on data storage and allows more informa-
`tion to be stored for a fixed memory size by representing
`information more efficiently.
`A rich and highly diverse set of lossless data compression
`and decompression algorithms exist within the current art.
`These range from the simplest “adhoc” approaches to highly
`sophisticated formalized techniques that span the sciences of
`information theory, statistics, and artificial intelligence. One
`fundamental problem with almost all modern approaches is
`the compression ratio verses the encoding and decoding
`speed achieved. As previously stated, the current theoretical
`limit for data compression is the entropy limit of the data set
`to be encoded. However, in practice, many factors actually
`limit the compression ratio achieved. Most modern com-
`pression algorithms are highly content dependent. Content
`dependency exceeds the actual statistics of individual ele-
`ments and often includes a variety of other factors including
`their spatial location within the data set.
`Within the current art there also presently exists a strong
`inverse relationship between achieving the maximum
`(current) theoretical compression ratio, referred to as “algo-
`rithmic effectiveness”, and requisite processing time. For a
`given single algorithm the “effectiveness” over a broad class
`of data sets including text, graphics, databases, and execut-
`able object code is highly dependent upon the processing
`effort applied. Given a baseline data set, processor operating
`speed and target architecture, along with its associated
`supporting memory and peripheral set, “algorithmic efli-
`ciency” is defined herein as the time required to achieve a
`given compression ratio. Algorithmic efficiency assumes
`that a given algorithm is implemented in an optimum object
`code representation executing from the optimum places in
`memory. This is virtually never achieved in practice due to
`limitations within modern optimizing software compilers. In
`addition, an optimum algorithmic implementation for a
`given input data set may not be optimum for a different data
`set. Much work remains in developing a comprehensive set
`of metrics for measuring data compression algorithmic
`performance, however for present purposes the previously
`defined terms of algorithmic effectiveness and efficiency
`should suffice.
`
`Of the most widely utilized compression techniques,
`arithmetic coding possesses the highest degree of algorith-
`mic effectiveness but, as expected, is the slowest to execute.
`This is followed in turn by dictionary compression, Huffman
`coding, and run-length coding techniques with respectively
`decreasing execution times. What is not apparent from these
`algorithms,
`that is also one major deficiency within the
`current art,
`is knowledge of their algorithmic efficiency.
`More specifically, given a compression ratio that is within
`
`
`
`US 6,597,812 B1
`
`3
`the effectiveness of multiple algorithms, the question arises
`as to their corresponding efficiency on various data sets.
`SUMMARY OF THE INVENTION
`
`if a run-
`
`The present invention is directed to systems and methods
`for providing lossless data compression and decompression.
`The present
`invention exploits various characteristics of
`run-length encoding, parametric dictionary encoding, and
`bit packing to comprise an encoding/decoding process hav-
`ing an efficiency that is suitable for use in real-time lossless
`data compression and decompression applications.
`In one aspect of the present
`invention, a method for
`compressing input data comprising a plurality of data blocks
`comprises the steps of:
`detecting if the input data comprises a run-length
`sequence of data blocks;
`outputting an encoded run-length sequence,
`length sequence of data blocks is detected;
`maintaining a dictionary comprising a plurality of code
`words, wherein each code word in the dictionary is
`associated with a unique data block string;
`building a data block string from at least one data block
`in the input data that
`is not part of a run-length
`sequence;
`searching for a code word in the dictionary having a
`unique data block string associated therewith that
`matches the built data block string; and
`outputting the code word representing the built data block
`string.
`In another aspect of the present invention, the dictionary
`is dynamically maintained and updated during the encoding
`process by generating a new code word corresponding to a
`built data block string, if the built data block string does not
`match a unique data block string in the dictionary, and then
`adding the new code word in the dictionary.
`In yet another aspect of the present invention, the dictio-
`nary is initialized during the encoding process if the number
`of code words (e.g., dictionary indices) in the dictionary
`exceeds a predetermined threshold. When the dictionary is
`initialized, a code word is output in the encoded data stream
`to indicate that the dictionary has been initialized at that
`point
`in the encoding process. An initialization process
`further comprises resetting the dictionary to only include
`each possible code word corresponding to a unique data
`block string comprising a single data block. By way of
`example, if each data block comprises a byte of data, there
`will be 256 possible code words for a data block string
`comprising a single byte. In this instance, the dictionary
`reset to its initial state will comprise 256 entries.
`In another aspect of the present invention, the dictionary
`further comprises a plurality of control code words, wherein
`a control code word is designated to represent a dictionary
`initialization, a run-length encoded sequence, and the end of
`the input data (or completion of the encoding process).
`These control words are used in the decoding process to
`re-create the input data.
`In yet another aspect of the present invention, a bit-
`packing process is employed to pack the bits of successive
`output code words representing encoded run-length
`sequences and data block strings.
`In another aspect of the present invention, a method for
`decompressing an encoded data stream comprising a plu-
`rality of code words, which is generated using the encoding
`method, comprises the steps of:
`maintaining a dictionary comprising a plurality of code
`words utilized to generate the encoded data stream,
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`wherein the code words in the dictionary comprise
`control code words and code words that are each
`
`associated with a unique data block string;
`decoding and outputting a run-length sequence of data
`blocks associated with an input code word of the
`encoded data stream, if the input code word is a control
`code word in the dictionary that indicates an encoded
`run-length sequence;
`outputting a unique data block string in the dictionary that
`is associated with an input code word of the encoded
`data stream, if the input code word is found in the
`dictionary; and
`if the input code word is not found in the dictionary,
`building a new data block string comprising (1) the
`unique data block string associated with a previous
`control word found in the dictionary and (2) the first
`data block of the unique data block string, adding the
`new string to the dictionary, and outputting the new
`string.
`These and other aspects, features and advantages of the
`present invention will become apparent from the following
`detailed description of preferred embodiments, which is to
`be read in connection with the accompanying drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of a system for providing
`lossless data compression according to an embodiment of
`the present invention;
`FIGS. 2a and 2b comprise a flow diagram of a method for
`providing lossless data compression according to one aspect
`of the present invention;
`FIG. 3 is a block diagram of a system for providing
`lossless data decompression according to an embodiment of
`the present invention; and
`FIGS. 4A and 4B comprise a flow diagram of a method for
`providing lossless data decompression according to one
`aspect of the present invention.
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`The present invention is directed to systems and methods
`for providing lossless data compression and decompression.
`It is to be understood that the present invention may be
`implemented in various forms of hardware, software,
`firmware, or a combination thereof. In particular, the present
`invention may be implemented in hardware comprising
`general purpose microprocessors, digital signal processors,
`and/or dedicated finite state machines. Preferably,
`the
`present invention is implemented as an application program,
`tangibly embodied on one or more data storage mediums,
`which is executable on any machine, device or platform
`comprising suitable architecture. It is to be further under-
`stood that, because the present invention is preferably imple-
`mented as software, the actual system configurations and
`process flow illustrated in the accompanying Figures may
`differ depending upon the manner in which the invention is
`programmed. Given the teachings herein, one of ordinary
`skill in the related art will be able to contemplate these and
`similar implementations or configurations of the present
`invention.
`
`Data Compression
`Referring now to FIG. 1, a block diagram illustrates a
`system 10 for providing lossless data compression according
`to an embodiment of the present invention. In general, the
`data compression system 10 comprises an input buffer 11 for
`
`
`
`US 6,597,812 B1
`
`5
`temporarily buffering an input data stream and an encoder 12
`for compressing the input data stream. It is to be understood
`that the compressed data stream output from the encoder
`may,
`for example, be stored in a storage medium for
`subsequent retrieval and decoded using a decompression
`method described below, or transmitted over a local or
`global computer network (for purposes of increased band-
`width transmission) and decompressed at a desired location.
`It is to be further understood that the input buffer 11 is an
`optional component that may be employed, for example, in
`real-time compression applications where the rate of com-
`pression of the encoder 12 is slower than the bandwidth of
`the input data stream.
`In general, the encoder 12 employs a unique combination
`of compression techniques preferably including run-length
`encoding and hash table dictionary encoding to compress an
`input data stream, as well as bit-packing to increase the final
`compression ratio. More specifically, the encoder 12 com-
`prises a run-length encoder 13 and dictionary encoder 14,
`both of which utilize a code word dictionary 15 to output one
`or more “code words” representing a “character string”
`identified by the respective encoder 13, 14 in the input data
`stream. It is to be understood that the term “character” as
`
`used herein refers to an input byte of data that can take on
`any one of 256 values, and the term “string” as used herein
`refers to a grouping of one or more characters (bytes).
`Furthermore, as described in further detail below,
`in a
`preferred embodiment, a “code word” for a given character
`string comprises a dictionary index (denoted herein as D[i])
`of the character string in the dictionary 15.
`During an encoding process in which bytes of data in the
`input stream are input to the encoder 12,
`the run-length
`encoder 13 will identify a run-length sequence in the data
`stream,
`i.e., a character string comprising a plurality of
`consecutively similar characters (bytes), and output one or
`more code words from the dictionary 15 to represent the
`run-length sequence (as explained in detail below).
`Moreover, the dictionary encoder 14 will build a character
`string comprising two or more characters (which does not
`comprise a run-length sequence), search the dictionary 15
`for a code word that corresponds to the character string, and
`then output the code word representing the character string.
`In addition,
`if the character string that
`is built by the
`dictionary encoder 14 does not match a character string in
`the dictionary 15, the dictionary encoder 14 will cause the
`character string to be added to the dictionary and a new code
`word (e.g., dictionary index) will be associated with that
`string. An encoding process according to one aspect of the
`present invention will be described in detail below with
`reference, for example, to the flow diagram of FIGS. 2A and
`2B.
`
`The encoder 12 utilizes a plurality of data storage struc-
`tures 16 for temporarily storing data during an encoding
`process. For example, in the illustrative embodiment of FIG.
`1, a Pstring data structure 17 is employed for temporarily
`storing a working character string, Pstring. A C data struc-
`ture 18 is employed for temporarily storing a next character
`(byte) C in the input stream. In addition, a Pstring+C data
`structure 19 is used for temporarily storing a character string
`Pstring+C which is a string comprising all of the characters
`in Pstring plus the character in C. Moreover, an Mcode data
`structure 23 is used for temporarily storing a code word
`(Mcode) (e.g., dictionary index) corresponding to a previous
`successful string match in the dictionary. The use of these
`data structures will be discussed in further detail below.
`
`The code word dictionary 15 comprises a plurality of
`dictionary indices D[i], wherein each index in the dictionary
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`
`to either a
`15 is mapped (via a mapping module 20)
`predefined control code or a different code word correspond-
`ing to a character (byte) string. The mapping module 20
`preferably employs a hash function to, inter alia, map each
`character string (e.g., strings of one or more bytes) into a
`unique index D[i]
`in the dictionary 15 (although other
`mapping techniques known to those skilled in the art may be
`employed). As indicated above, in a preferred embodiment,
`the dictionary indices D[i] are output as the “code words”
`(also referred to herein as “Mcodes”)by the encoder to create
`an encoded file. These code words are processed by a
`decoder to decompress an encoded file (as discussed below
`with reference to FIGS. 3, 4a and 4b.)
`In a preferred embodiment, the first three entries in the
`dictionary 15, indices D[0], D[1], and D[3], are reserved as
`control codes. In particular, the entry for the dictionary index
`D[0], or code word “0”, is output to indicate (to the decoder)
`that the dictionary 15 has been reset to its initial state. As
`explained in detail below, the dictionary 15 is preferably
`reset at the commencement of an encoding process before a
`new input stream is processed and, preferably, during an
`encoding process when the total number of entries D[i] in
`the dictionary 15 exceeds a predetermined limit. In addition,
`the dictionary index D[1], or code word “1”, is utilized for
`the run-length encoding process. More specifically, the code
`word “1” is output to indicate that the next two consecutive
`output numbers (in the encoded sequence) represent a run-
`length encoding sequence comprising (1) a character code
`and (2) a number denoting the amount of consecutive
`characters found in the data stream corresponding to the
`character code. Furthermore, the dictionary index D[2], or
`code word “2” is output to indicate the end of the data stream
`and completion of the encoding process.
`index
`The next 256 entries in the dictionary 15 (i.e.,
`numbers 3 through 258) each comprise a single character
`sting (e.g., one byte) corresponding to one of the 256
`possible character codes. Accordingly,
`in a preferred
`embodiment, the dictionary indices D[0] through D[258] are
`the only entries that exist in the dictionary 15 upon initial-
`ization of the dictionary 15. Any additional character strings
`that are dynamically added to the dictionary 15 by the
`dictionary encoder 14 during an encoding process will be
`consecutively added beginning at index D[260].
`It is to be appreciated that, as indicated above, for a given
`character string under consideration,
`the encoder 12 will
`output (as a code word) the dictionary index number D[i]
`corresponding to a matching character string. Since the
`dictionary index number is usually less than two bytes and
`the input character strings are typically longer than six bytes,
`the reduction in the number of bits output can be significant.
`In one embodiment of the present invention, the dictio-
`nary encoder 14 can search the code word dictionary 15 for
`a matching character string therein by comparing each entry
`in the dictionary 15 to the input character string under
`consideration. In certain instances, however, the amount of
`entries D[i]0 in the dictionary 15 can increase significantly,
`potentially rendering this search process slow, inefficient
`and computationally intensive. Accordingly, the data com-
`pression system 10 preferably comprises a hash table 21
`which is utilized by the dictionary encoder 14 during an
`encoding process to reduce the search time for finding a
`matching character string in the dictionary 15.
`More specifically, in one embodiment, the hash table 21
`comprises a plurality of arrays Array[N], wherein each array
`comprises every dictionary index number D[i] in the dic-
`tionary 15 having an entry (i.e., character strings) that begins
`
`
`
`US 6,597,812 B1
`
`7
`with a character code corresponding to the array index. For
`example, the third hash table array Arrary[3] comprises all
`the dictionary indices D[i] having a dictionary entry in
`which the first character (byte) of the string has decimal
`value of “three.” In the preferred embodiment where the
`encoder processes individual bytes of data in the input
`stream, since there are 256 possible characters, there are 256
`arrays, i.e., Array[N], where N=1 .
`.
`. 256. Advantageously,
`the use of the hash table 21 for finding matching strings in
`the dictionary reduces the number of string comparisons by
`256.
`
`In another embodiment, the hash table 21 comprises a
`plurality of nested hash tables. For example, a first level of
`hashing can use the first character to subdivide the dictio-
`nary 15 into 256 sub-dictionaries and a second level of
`hashing may use the 2”“ character of the input string to
`further subdivide each of the initial 256 entries. Each
`
`additional level of hashing subdivides each dictionary into
`an additional 256 sub-dictionaries. For example, 2 levels of
`hashing yields 25 62 sub-dictionaries and n levels yields 25 6”
`sub-dictionaries. The purpose of this hashing function is to
`reduce the time for searching the dictionary 15. For
`example, using an n level hashing scheme reduces the search
`time by 256”—(n*256).
`Furthermore, as explained in detail below with reference
`to the process depicted in FIGS. 2a and 2b, the hash table is
`dynamically modified to incorporate new entries D[i] that
`are added to the dictionary 15 during the encoding process.
`In addition, the data compression system 10 optionally
`comprises a bit packing module 22 for providing additional
`compression of the encoded data stream. As explained
`above, the maximum size (i.e., number of entries D[i]) of the
`dictionary 15 is predefined and, consequently, the maximum
`number of bits of information needed to represent any index
`in the dictionary 15 is known a priori. For example, if the
`maximum dictionary size is 4000 entries, only 12 bits are
`needed to represent any index number. Since data is typi-
`cally transferred in groups of 8 or 16 bits, in the above
`example where 12 bits maximum are need to represent the
`index number, 4 bits out of every 16 bits would be wasted.
`Accordingly,
`to provide additional compression,
`the
`encoder 12 preferably implements the bit-packing module
`22 to pack the bits of successive output code words. It is to
`be understood that any suitable bit-packing technique known
`to those skilled in the art may be employed. In a preferred
`embodiment, the bit-packing module employs a shift regis-
`ter to output at least 16 bits of data when the data is ready
`for output. By way of example, assume a 12-bit code word
`is initially input to the shift register. The next 12-bit code
`word that is output is also placed in the shift register, and the
`shift register would contain 24 bits of information. Then, 16
`bits would be output from the shift register, leaving 8 bits
`remaining. When the next 12-bit code word is input to the
`shift register, the shift register will contain 20 bits, and 16
`will be output. This bit packing process is repeated for every
`output code word until the encoding process is complete.
`Advantageously, the bit packing process according to the
`present invention improves the compression by a factor of
`15/12, or 1.33. Moreover,
`it is to be appreciated that the
`processing time required for the bit-packing is negligible.
`Consequently, the bit packing process provides increased
`compression (“algorithmic effectiveness”) without a signifi-
`cant
`increase in processing overhead (“algorithmic
`efficiency”).
`Referring now to FIGS. 2a and 2b, a flow diagram
`illustrates a method for compressing data according to one
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`aspect of the present invention. In particular, the encoding
`process depicted in FIGS. 2a and 2b illustrates a mode of
`operation of the system 10 of FIG. 1. Initially, the dictionary
`15 and hash table 21 are initialized (step 200). For example,
`as noted above, the dictionary 15 is initialized to include 259
`entries, i.e., the first three entries D[0]—D[2] comprise the
`control codes and the next 256 entries D[3]—D[259] com-
`prise the 256 possible character codes (assuming, of course,
`that the encoder processes data blocks each comprising a
`byte). Furthermore, the hash table will be initialized such
`that each array Arrary[1]—[N] comprises one entry—the
`dictionary index D[i] for the corresponding character code.
`Next, the Pstring data structure 17 (or “Pstring”)is initialized
`to be empty (i.e., it contains no characters at initialization)
`(step 201). It is to be understood that neither the C data
`structure 18 (or “C”) nor the Mcode data structure 23 (or
`“Mcode”) require initialization.
`After the initialization process, a determination is made as
`to whether there are any input characters for processing (step
`202). If there is input data (affirmative result in step 202), the
`first (or next) character (e.g., byte) in the input stream will
`be read and temporarily stored in C (step 203). Then, the
`next consecutive characters in the input stream are checked
`(step 204) to determine if there is a string of at least s
`consecutive characters that match the character stored in C
`
`to trigger a run-length sequence (step 205), where s is a
`predetermined minimum number of consecutive characters
`that are required to trigger a run-length encoding sequence.
`If there are at least s consecutively similar characters in
`the input stream (affirmative determination in step 205), then
`a determination is made as to whether Pstring is empty (step
`206). If Pstring is empty (affirmative determination in step
`206),