`
`USU(}6253264Bl
`
`(12; United States Patent
`Sebastian
`
`(10) Patent No.:
`:45) Date nf Patent:
`
`US 6,253,264 B1
`*Jun. 26, 2001
`
`3-11151
`3115035
`
`341 -‘In’?
`.’
`341.-107
`'F'U"?1'll'12
`H17-"ll'H
`395.’! 14
`
`1 1111997 Panauussis
`5.1111-11.403
`5.110.823 1
`1_1wt.1.=1 Coleman
`
`tjvrmish et «'11-
`1rI9‘_J8
`§.-?l<15='»2 *
`: 1}-figgg Eulhwr --------
`Dmes
`3,
`_.
`3
`5.367.114 1‘
`2.11000 Barhir
`.............
`
`
`
`5,001,515
`
`11110021 Fall-slat.
`
`,
`,
`,
`,,
`_
`,
`[DRl;lGN FAILNI DUC UMEN F5
`11 1129 117 A2
`1111990
`(1211)
`.
`
`OTHER PUB! ICA-I-IONS
`'
`R1'1:::,R,lT,t.-.1a1.,"VLSl uuiw.-rsa1Noisc1..-asCoc1cr,"1\rr15'
`Tech Notes 3301. p. 334 (Mar. 1.1990}.
`
`(54) CODING NE'[‘WORKGR()UPING DATA or
`SAME um); Type mu) 1;: pcxs usmc
`
`1-‘LLI-L 1')/\‘[‘A STRlJC'l‘URE. AND SELI-JCTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`;
`_
`rm
`-
`“ASL 0” NDCK DATA IYPF‘
`lnvenlor: William Seb1’|slifln.I7almnulh. MA
`(US)
`
`(75)
`
`(73)
`
`A.-ssignt-11::
`
`_
`{ ‘ J Nntlr.-1::
`
`Intelligent Compression ’l‘eclIn0l0gles.
`Hlmomh‘ MA (US)
`I
`I
`‘
`_
`_
`Hus patent Lssucd on :1 conlmucd prim-
`cculion upplicaliun filed under 37 CTR
`l.53(d), and is subj::cI lo the lw-:1'11_v year
`""“'i5k'”5 “F 35 U'S‘c'
`
`Subject“*"3"Ydi$l“i“1°r'lh“'°"T'°“hi"3
`patent is mended or adjusted under 35
`U.S.C.
`l54(h)l1y U d.'t_\«'5-;.
`
`"Competitive Parallel Prtsccssing for
`cl al.
`Diner DB.
`C'nmprt:5.5iL'11J of 131111," .N1".'S T91:1‘1N0fe.5' 2301. p. 379111/111;»
`1’ _199m_
`
`(211 App]. No.: 09103154152
`(22)
`I-'ilc.-Ll:
`Mar. 6, 1998
`
`. med by flxamim
`
`'
`
`'1’
`
`_
`
`(60)
`
`_
`(51,)
`
`‘
`Int. (,1.
`
`7
`
`(52)
`
`11.5. CI.
`
`Related U_S_ Applicafion Dam
`Provisional application No t1l'J;'fl3E1 ‘I48
`filed on Mar
`191;-;_‘
`I
`'"
`I
`I
`‘
`............................ ,. GOISP 17130: G06? 31110:
`G06]: SE00; I-103M "NOD; HUJM 7130
`............................... .. 710163; 710.105; 341151;
`141,10?
`‘
`'
`3-—1l1'5l. E15. IU7;
`{5H) Field of Search
`7r_m'102. 503, 504. 505. 5012; 3701474;
`710%‘
`()8
`"
`
`(S6)
`‘
`
`Refumnces Cited
`US. PATENT DOCTIMENTS
`‘
`311904 Carr
`.................................... 3'1"[I1"J4.l
`3-11151
`1111095 Chu
`J‘J51’I£lI0.l3
`4-,"l'E4FJ'? Alhancsc er
`305177!)
`5!l‘}0? R110 el :11.
`......
`631907 'l')'lcr cl :11.
`.
`.. 3953117
`
`5_2‘J3,3".-"9
`5,«1m_.r11'1'!
`*
`5.61 1.54]
`*
`5,632,lI1‘J
`5,638,408 '
`
`Pri1111?ry L'.\':I1111T.r_11:1‘—Tl_10mas Lee
`AS'“'5'mm Exmmner—1aDh Nguyen
`(74) .4Imr1Ir.’_\j. .4g{.’fl£_. or F1'r111—IIa1‘11ilto1:1, Brook, Smilh &
`Reynolds. PC.
`
`{S7}
`
`d\BSTRAC'l‘
`
`A prcfcned coding, nclwork uses an architcclure called 3
`Baa"-0—I*'ilIcr—Rcs11urcc (BI.-R) sys1cn1.
`'l“his. approach inte-
`«
`1 .
`,
`‘
`_
`_.
`1
`.
`.
`’?"""‘?5_ 1"“ "d‘.".‘”“fg"S °f .f°7""“' S',’°‘_’fi°- ‘°'""'°-§5'"" T” 3
`gcnc.1'._1 -purpose <.omprc:~s1t_J:1 loo s1..1'v1ng, 3 W1 1: 1:ar1_1,c_ 01
`dala 1L‘|l'lTlflL‘5. Sourci: (1313 19 parsed 1111:: blocks 01 s1m1[z1r
`data and each parsed blacks are compressed using a respec»
`lively selected cnmprcssinn algorithm. Thu: algnrilhrn can be
`chmacn from 11 5111110 model 01‘I|'1v.-: data 01' can be adaplivc 10
`[he dam In H]: parsed block‘
`[be parsed blocks are men
`combined inlu an uncudcd data file.
`l"or d1:1:0ding, Ihc
`‘
`.
`_
`pmccsa. 1s rcvcrscd.
`
`
`
`29 Claims. 8 Drawing Sheets
`
`—_, ..—.»i~. 1
`-.
`*-
`.\-orm/.;’
`.
`_.?-.
`;m.u"- 1
`'
`my —
`
`
`‘Ca
`-2.0:’:
`r.)-1
`..a:n.e..
`
`an
`
`Oracle 1004
`Oracle 1004
`
`
`
`U.S. Patent
`
`Jun. 20', 2001
`
`ghee: 1 of8
`
`US 6,253,264 B1
`
`SENDING APPLICATION
`
`SOURCE DATA 2
`
`ENCODER
`
`ENCODED DATA 4
`
`DECODER
`
`§
`
`§
`
`DECODED DATA 6
`
`Z
`
`RECEIVING APPLICATION
`
`FIG. 1
`
`PRIOR ART
`
`
`
`
`
`zofiomzmm._
`
`s_mB>mzofiumdm..
`
`
`
`
`N<._.¢Dmomnow
`
`NN.xN_m_._.:__n_
`
`U.S.Patent
`
`<_mmEmo
`
`~§_.._.§.....mm:
`
`m><mw_<
`
`ommm<m.er
`
`vmxwz_mmqn_
`
`s_mB>m
`
`Jun.2(._.2001
`
`mzo_5:w:mz_
`
`oz_wm<n_
`
`
`
`onn_mN__2o_.w:u
`
`><%.._<
`
`Sheet2of3
`
`
`
`m2w_o“_mz<E_20._.m3o
`
`mm
`
`._.zmzOn_S_OU
`
`mm::oo2
`
`mi
`
`mm
`
`omooozmmmm:m_2<m<n_
`
`
`<559N0_u_
`mm.
`
`ozfioozm/
`
`Us6,253,264B1
`
`
`
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 3 of 3
`
`US 6,253,264 B1
`
`FIG. 3
`
`
`
`P5U
`
`.m
`
`%
`
`N.4
`
`2,33.6S
`
`m
`
`
`
`
`
`PMmmooo>¢.mm<20¢”.:m_oo_2oEEm.
`
`
`
`
`
`MV.O_.u_
`
`
`
`
`
`ummaoo><w_m<Emmooo.sm»
`
`
`
` .3.+mmaooI022moz_m:m320222:
`
`
`
`
`
`S.K.
`
`
`
`
`
`.w5.5o2_Emom_5_am_mn_n_OE:
`
`n.moz_Em
`
`wmm_>__E
`
`
`
`mm__._o»<s_oz_Ew
`
`
`
`
`U.S. Patent
`
`Jun. 2(._. 2001
`
`Sheet 5 of 8
`
`I.B462.352.,6SU
`
`cmX%
`
`<29Sod
`
`5
`
`EMDOONu.
`
`
`
` mm_Em>zooQ_.wm¢m#mm3<>n_m:o_.<_2z:
`
`
`
`
`
`NEDOO><mm<.201“.
`
`
`
`m;¢mmF_._wmmmzo_wmm_>2oowhzmzoaxmozw<wm:z§_Sn
`
`
`
`
`
`m.O_n_
`
`8mmmmmm
`
`
`
`
`
`
`
`MMDOOozztwO._.mmooo><«_m<O...mmooo><mm_¢O._.mmooo><w_m<O._.
`
`
`
`
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Jun. 2(._. 2001
`
`Sheet 6 of 3
`
`US 6,253,264 B1
`
`SAMPLE DATA BASES
`
`1
`
`GENERATE PARSED FILES
`
`PARSED FILES
`
`l
`
`ANALYSIS RESULT FILES
`
`ANALYZE PARSED FILES
`GENERATE PREDICTOR DATA
`
`PDAT FILE
`
`1
`
`FIG. 6
`
`
`
`U.S. Patent
`
`Ju11.26_.2(llll
`
`Sheet 7 or 8
`
`Us 6,253,264 B1
`
`25
`
`ARRAY TO BE CODED
`ARRAY DATA
`‘V
`BUILD MODEL
`
`T1
`
`MODEL
`
`ARRAY DATA
`
`70
`/
`
`73
`
`ENCODED MODEL INFO
`
`75
`
`was
`
`SPLITFER
`
`-77
`
`TWO OR MORE ARRAYS
`
`PROCESS SEPARATELY
`
`YES
`
`MATCHING
`ALGORITHMS
`
`81
`
`
`
` USE
`INITIAL
`SPLITTER
`
`
`
`
`No
`
`79
`
`
`
`MATCHING
`* LGORITHM
`
`85 ~.
`
`NO
`
`UNMATCH ED
`VALUES
`
`OVE§§:qZ§Lg:RAY
`
`ARRAYS CONTAINING
`
`MATCH DATA
`
`83
`
`CONVERT TO |NT32
`
`
`
`
`USE
`NUMERIC
`TRANSFORM
`
`’39
`
`NUMERIC
`ALGORITHMS
`
`91
`
`OFFSETS FROM
`MODEL
`
`95 SPLI'I'I'ER
`
`TWO OR MORE ARRAYS
`
`PROCESS SEPARATELY
`
`
`
`YES
`
`REPEAT
`TRANSFORM
`
`99
`
`100
`
`FIG 7
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 3 of 3
`
`US 6,253,264 B1
`
`101
`
`ARRAY DATA IN
`
`103
`
`RANGE
`
`REMAPPER
`
`
`
`
`
`
`1 O5
`
`EXTRA BITS
`
`TO PACK
`
`MTF CODER
`
`109
`
`
`
`RANGE CODES
`
`NO
`
`
`
`RANGE CODES TO
`
`LOW LEVEL CODER
`
`FIG. 8
`
`
`
`US 6,253,264 B1
`
`1
`CODING NETWORK GROUPING l)A'I'A OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTIJ RE AND SEI.F.C'I'I NG
`COMPRESSION FOR INDIVIDUAL BUOCK
`BASE ON BLOCK DATA TYPE
`
`RELATED APPLICATIONS
`
`This application claims the benefit of US. Provisional
`Serial No. (tU,=’0.'l6,S43 filed Mar. 7. 1997. the teachings of
`which are incorporated herein by reference in their entirety.
`BA(TI\"(}R()UNl) ()I- THE l'NVl£N’l'I()N
`
`ID
`
`High performance data compression] systems use models
`of the data to increase their ability to predict values. which
`in turrt leads to greater compression. The best models can he
`achieved by puilding a compression system to support a
`specific data format. Instead of trying to deduce a crude
`model from the data within it specific lile, rt form at-spec'il'.lc
`compression system can provide it precise prc-determined
`model. The model can take advantage not just of the file
`formal structure, but also of statistical data gathered from
`sarnplc databases.
`Previous efforts at form at—spe<:ific compression have been
`focused on solutions to a few individual formats rather than
`on the development of at generalized method that could be
`adapted to many formats. The models which have been
`created typically involve a small number of components.
`This works adequately when most of the data is included in
`a few components, such as an image file having mostly red,
`blue. and green pixel values. Hut mny formats are best
`modeled using a large number of components, and the
`previous systems are not designed to build or encode such
`models.
`
`SUMMARY U1" Tl-IE INVIENTIIJN
`
`A preferred coding system in accordance with the inven-
`tion solves both of these problems in the prior art: it provides
`a generalized solution which can he adapted to handle a wide
`range of formats. and it effectively handles models that use
`large numbers of components. The system involves new
`approaches at many levels: front the highest (interface) level
`to the lowest (core encoding algorithms) level. The coding
`network includes a compression network for encoding data
`and a decompression network for decoding data.
`At the highest level, a pr-:.l'e.rrcd compression system in
`accordance with the invention uses an architecture called a
`Base-Filter-Resource (BFR) system. This approach inte-
`grates the advantages of format-specific com prcssion into a
`gcrIeral—purposc compression too] serving a wide range of
`data formats. The system includes filters which each support
`a specific data format, such as for Excel XLS worksheets or
`Word DOC files. The base includes the system control
`modules and a library used by all the filters. The resources
`include routines which are used by more than one liltcr, but
`which are not part of the base. lfa filter is installed which
`matches the format ofthe data to he encoded, the advantages
`of forntnt-specific compression can be realized for that data.
`Otherwise. at "generic" filter is used which achieves perfor-
`mance similar to other non—speciIic data compression sys-
`tems {such as Pl(Zip, Stacker. etc).
`At the next level, the system preferably includes il method
`for parsing source data into individual components. The
`basic approach, called "structure flipping," provides a key to
`converLi rig format
`information into compression models.
`SLructurc flipping rcorganizcs the inform:tLion in a file so that
`similar cornpone-nts that are normally separated are grouped
`together.
`
`IL}
`
`3!]
`
`3-5
`
`4!)
`
`St)
`
`Ln no
`
`oil
`
`($5
`
`2
`Upon this foundation are a numbers of tools. such as:
`a language to simplify the creation of parsing routines;
`tools to parse the source data using this method into
`separate components; and
`tools to generate models for individual components by
`automated analysis of sample data bases.
`These tools can be applied to the filters for a wide range of
`fllc and data types.
`The compression system preferably includes tools called
`customized array transforms for specific liltr.-rs and for
`certain types of filters. These tecltniqucs handle a specific
`file type or certain data constructions used by several file
`types, such as encoding two dimensional arrays of data as
`found in many database formats.
`At the low-level of the system are preferably a number of
`mechanisms for encoding data urrnys, including:
`new low-level encoding algorithms;
`melltods for integrating a large number of transforms and
`encoding algorithms:
`methods for eliminating overhead so that small data
`blocks can he ellicicnfly coded; and
`a new method for
`integrating both static models
`(determined from statistical analysis of sample
`databases) and dynamic models (adapted to the data
`vvilliin a panicula: array) into the encoding of each
`component.
`A preferred method of encoding source data: comprising
`parsing the source data into at plurality of blocks. The parsed
`blocks typically have it rlifferent format than the source data
`format. In particular, similar data from the source data are
`collected and grouped into respective blocks.
`For each block, a compression algorithm is selected from
`a plurality of candidate com prcssion algorithms and applied
`to the hlock. The compression algorithms can be determined
`based on the amount of data in the respective block.
`Iittrthermore, the compression algorithm can be adapted to
`the respective block,
`including the use of a customized
`transfonn. The selection of an algorithm can also be based
`on a compression model, which is derived from the format
`of the source data. The compressed data from the plurality
`of blocks are then combined into encoded data.
`
`The coding network can also include a decompression
`network to convert the encoded data back into the source
`
`data. First the data is decoded and then the parsing is
`reversed. In a lussless system, the resulting data is identical
`to the source data.
`Embodiments of the invention preferably take the form of
`machine executable instructions embedded in a machine-
`rcatlahte format on a Cl)—R()M. floppy disk or hard disk, or
`another machine-readable distribution medium. These
`instructions are executed by one or more processing units to
`implement the compression and decompression networks.
`BRIEF [)l_'"..‘~3r’..'.llll3"l'l(3l'Vl 0|’ 'l'l{E DRAWINGS
`
`The foregoing and other objects. features and advantages
`of the invention will be apparent from the following more
`particular description of preferred embodiments of the
`invention. as illustrated in the accornpanying drawings in
`which like reference characters refer to the same parts
`throughout the different views. The rlrawirtgs are not nec-
`essarily to scale, emphasis instead being placed upon illus-
`trating the principles ofthe invention.
`FIG. I is a high-level
`lrloclr diagram of a typical data
`compression system.
`l.’l{j. 2 is a block diagram ofa preferrccl encoder using the
`BFR system in accordance with the invention.
`
`
`
`
`
`US 6,253,264 B1
`
`3
`FIG. 3 is it How chart of an array coder.
`FIG. -'I- is it block diagram of the string coder 40 of FIG.
`
`3
`
`FIG. 5 is El block diagram of the Host coder 50 of FIG. 3.
`FIG. 6 is a flowchart of a preferred autnmztterl analysis
`system.
`FIG. 7 is at llowchttrt ovcrviewing the high-level of the
`integer coder.
`FIG. 8 is a flowchart ovcrviewing the mid-level of the
`integer coder.
`
`DETAILED l)I3SC'RlP'I'IONS OF PREI-‘ERRED
`EMBODIMENTS OF THE INVENTION
`
`5
`
`ll}
`
`ELI
`
`3!]
`
`4
`invention. The encoder 3’ is based on the use of :1 plurality
`of filters ltln. .
`.
`. , l'I.lx,
`.
`.
`. , II}: which serve specific tile
`formats. For example, one filter 10:1 might support several
`versions ofthe DBI’ database forrrtat, ‘While another lilter 10:
`might support seve rat] versions of the DOC‘ format used by
`the Microsoft Word software program. The individual filters
`provide respective selection criteria 12 to at filter selection
`system 22.
`The filter selection system 22 receives the source data 2
`and checks the selection criteria l2.n, .
`.
`.
`, 11:‘, .
`.
`.
`, 12; of
`all filters I0rt,. .
`. , 10x, .
`.
`. , I0: installed in the system to
`see it’ any of them support the source data's format. It" not.
`a "generic" litter is used which provides compression per-
`formance similar to other generic compression systems,
`such as Lcmpcl-Ziv (L2) engines. In it panieular preferred
`errthodimeut of the invention. the generic compression sys-
`tem employs an S211’ engine as described by Mr. Schindler
`in US. application Ser. No. t]8.’970,32D fded Nov.
`I-1, 1997.
`the teachings of which are incorporated herein by reference
`in their entirety. The descriptions of the network will pri-
`marily cover the situations in which a filter to support the
`date format is successfully found.
`Briefly. it parsing system 24 parses the source data 2 into
`:1 large number of parsed Arrays 25, which each include all
`the instances of it particular component in the source. tile.
`The term “Arraty", described in further detail below,
`is
`capitalized to indicate that this refers to a type of structure
`specific to the network, as opposed to the usual use of the
`term "array“. The algorithm used to parse the source data 2
`is called "Structure Flipping“, an important ttspect of the
`network, described in further detail below.
`The BFR system uses an automated analysis system (also
`described in detail below) to build ti. model for each Array
`25, which will he used when the Arrays 25 are encoded in
`an array coder 28. In some cases, better compression can be
`achieved if customized array translorrns 26 are first used.
`While the models generated by the automated analysis
`systc m process each component Array 25 independently. the
`customized routines can take advantage of inter-component
`relationships.
`The array coder 28 encodes the Arrays 25 using :1 count-
`depcndent modeling system. which includes El mixture of
`static models {constant
`for a ll databases) and dynamic
`adjustment to each particular Array 25.
`The tiller lllx has components to support each ofthe four
`sections of the encoder 3’. The selection criteria llx, deter-
`mined by the file format specification, includes inforrtiation
`needed to recognize files served by the filter 10.x‘ such as byte
`values at the beginning of the tile or tile title sutfices. Parsing
`instructions 14, also determined by the file format
`specification,
`include the information needed to parse the
`source tile into parsed Arrays 25 having all the instances of
`a particular com poncnt
`in the source tile. The custom
`in 1:
`_ component models 16 are constructed by referencing both
`the file formal specification and sample databases. They can
`take advantage of inter-component relationships to trans-
`form some of the component Arrays 25 to make them more
`compressible. Encoding parameters 18 are generated by an
`autornuted analysis system which examines component
`Arrays 25 parsed from :1 large number of sample databases.
`The encoding parameters 18 provide the data necessary to
`support the count-dependent modeling system in the array
`coder 28.
`l~‘ll.'1‘LiR Sl:1l.E.(.'l1DN SYS'l‘T_~‘.M
`A preferred compression system uses an cxptintlablc
`architecture that allows users to support new or updated
`
`FIG. 1 is :1 high—levcl block diagram of :1 typical data
`compression system. The primary components ofthe system
`are a sending application 1, an encoder 3, a clccodcr 5. and
`a receiving application 7.
`The sending application 1 provides source data 2 to be
`encoded hy the encoder 3. The sending application can be a
`file archivcr, a communication system, or any other appli-
`cation with a need to represent data using fewer bytes than
`in the native format.
`The encoder 3 receives the source data 2 and uses data
`compression algorithms to represent the data using, fewer
`bytes. The typical implementation ol‘ the encoder 3 is as
`software running on a general purpose computer, although
`the algnritlims to be described may also be implemented on
`specialized hardware.
`The output 01' the encoder is encoded data 4, which can be
`stored in memory (in which case the encoding allows more
`source data to be stored within the same Itardwarc), or
`transmitted to the decoder5 (in which case the encoding the
`source data to be transmitted faster when the transmission _
`channel bandwidth is limited).
`The decoder 5 retrieves or receives the encoded data 4 and
`reverses the algorithms used to encode the data at its to
`reconstruct the source data as decoded Llaltt 6. In a lossless
`system, the decoded data 6 is identical to the source data.
`Depending on the application, hewt:-ver, some data loss may
`be acceptable. As with the encoder 5,
`the decoder 5 is
`typically embodied as software running on a general purpose
`corn pttter, but can be implemented in specialized hardware.
`The receiving application 7 receives the decoded data 6
`for processing.
`A preferred compression engine includes liltcr-based
`encoders and decoders which can be utilized by a large
`number of st: riding ztpptieations l and receiving applications
`7. While the applications [,7 that storetretrievc or transmit.-'
`receive the encoded data are not part of the compression
`system, they may adjust encoding and decoding opera! ions
`according to system conditions such as available transmis-
`sion channel bandwidth.
`
`4!)
`
`St)
`
`The descriptions to follow will mostly cover the encoder.
`The dccodcrjust reverses. the ::ncoder‘s actions,so rt descrip-
`tion of the encoder is su Ificicnt
`to describe the overall
`eneodingjdccocling system. The term “File" will be used to
`describe at block of source data, which matches the conven-
`tional usngc, but which should also he understood to be
`applicable to many other types ofdatta blocks such as a data
`exchange in it Client-Server application.
`ENCODER
`
`FIG. 2 is a block diagram of it preferred encoder using :1
`Base-Filter-Resource (BI 11) network in accordance with the
`
`
`
`till
`
`65
`
`
`
`US 6,253,264 B1
`
`5
`formats by adding or replacing filters it]. As noted above, a
`litter 10 can support more than one format. For example. :1
`filter called “l.D'l'US123v-‘l-" might support all tile formats
`associated with the Lotus [23 program, such as the current
`WK-at format and the earlier WK3 and EMS formats. If :1
`
`Filter was later developed for a new Lotus WK5 formal, a
`user replaces the old filter with one supporting the new
`format and which would also he hackwardly compatible
`with the earlier filter with respect to the older formats.
`In the Microsoft Windows environment, each litter is :1
`separate dynamically linked lihrary (t)t.l.) tile. In addition
`to data tables. a filter Includes executable code.
`
`It]
`
`If a filter is found for :1 file, then it is dynamically linked
`and then called to encode the file. The encoded data stream
`includes an identification (ID) code indicating which filter
`was used to encode the data, and the decoder has to have a
`corresponding decoding filter.
`To implement
`the filter selection system 22, the hase
`module maintains a registry of the currently installed files,
`which includes the data need to identify source liter. of that
`type as well as the path to the DLL. This registry is updated
`each tirne at filter is added or replaced. The identification data
`can include both a list of tile title. sufficcs and a method to
`identify files via byte values near the beginning of the Ede.
`The identification procedure uses a search tree method to
`reduce the time needctl to check the list when many filters
`are installed. From this registry, the appropriate filter It) is
`invoked for the file type.
`To handle compound file formats such as Microsoft's
`01.132, one filter can invoke another filter to encode a
`sub-file or stream within the compound tile. For cxa rnplc, :1
`Winword Filter might call an Excel Filter to encode some
`Excel data which was embedded within a compound docu-
`ment tile.
`PARSING SYSTEM
`
`large
`As noted above, the parsing system 24 creates ti
`number of pa rscd Arrays 25. Each parsed Array 25 includes
`all instances of a particular cornponcnl in 11 File. A separate
`component
`is created for each member of each dc-lined
`structure type.
`
`EXAMPLE 1
`
`For example, consider the following, description of a record
`dclined in a file format description.
`(Tell Value:
`
`The ti:-rmat uses a special method of coding numbers in an
`internal numher type. called an Rh’. number, to save
`memory and disk space.
`Record Data:
`
`Olisct
`
`Name
`
`Size
`
`Contents
`
`4
`0
`8
`
`IEJ
`
`we
`col
`frnl
`
`nun-r
`
`3‘.
`3‘.
`1
`
`4
`
`Row numhrrr
`ffolurrtrt nurrtlaer
`Index to the record that includr.-5
`the cell format
`the coded number
`
`The parsing system 24 creates parsed Arrays 25 for each
`of the four components of this record. Each time an R.l-3'.
`record is found, an entry is added to each array. The parsed
`Arrays 25 are much more compressible than the raw source
`data, as shown by data from five records:
`
`35
`
`4U
`
`5U
`
`an an
`
`t'.iU
`
`65
`
`Rec
`
`Row
`
`1
`3
`3
`4
`5
`
`L'rxtZIl)t3l
`Uxlltlul
`L'IxE|t}lJl
`tixtZ|tJI't1
`EJ.\:t]rJCl]
`
`Col
`
`Uxtlml‘
`llxtltltlj
`nxtJrJn-1
`tIt)clJ'DlJu
`lJ)ctJ(Jt)‘t
`
`lndett
`
`Oxf)DS5
`0x005 f
`0x005}
`rlxousf
`riicr:-Usf
`
`Nulrt Vol
`
`Ux3tT1e 100
`f}|x3tI1a3tJt)
`nx3tf1duuo
`ox}-t1'1.-ant:
`ax3.a'1.-Jana
`
`The source data would appear almost random to a byte-
`matching algorithm (shown in hex):
`0.| UOD2D0550ftfl0elli3fDl{l0D30D3fDDD[la3fl
`3l'0l [tt]U|4[Jt]S3DUCl0r_l0li 3f
`01 {)0 06 00 Sf [10 D0 «:4 cl fl 3l‘U1{lfJ{t7 Of} 51' 00 (ID :13
`I] fit"
`But when handled as four separate Arrays 215 (Row, Col,
`Index, and Nam) using algorithms. optimized for each
`component,
`the data can compress significantly. This
`approach is referred to herein as "structure flipping,” because
`of tilt‘: way it rearranges the data.
`To encode together the instances of the same component
`i.n a database is a common compression technique, and the
`file format is often used to identify such components. But in
`the prior art. this approach has heen limited to isolating a few
`major components of a specific file formal. and the file
`format description used only as a technique of finding, these
`major components. In accordance. with preferred embodi-
`ment.-=. of the invention, a file format is actually converted
`into a compression model, which is then interfaced into an
`automated system for analyzing the data and optimizing the
`coding performance of each coniponent. While this com-
`prcssion model is usually entranced by the customized array
`transforms 26. it is still an aspect of the overall system.
`The extent
`to which preferred approaeltes ditfer from
`prior art systems is re flecled in the new tools constructed in
`order to implement it. This approach creates a large number
`of!\rrays 25. Eacli Array 25 requires a dilfe rent compression
`model, and the number of entries in each can change
`significantly from tile to tile. In order to work elI'ect_ively in
`such a case, prefe-rred embodiments of the invention employ
`new approaches in how the low-level coding is handled such
`as the elimination of overhead, count-dependent processing,
`and the use ofa processing network instead of fixed coding
`algorithms. Tht:s.c tech fliques, described above, are pro fcrrcd
`in order to handle the output of the parsing system 24.
`All of these implementation features are integrated at a
`high level. so that high-lcvel calls can handle all of the
`operations needed to initialize the system. parse the data,
`and encode the Arrays.
`In accordance with it particular preferred embodiment of
`the invention,
`the lite parsing system 24 uses three sult-
`systems:
`
`FLL Li_fl L'l"Fi:'. R
`A R R AYE
`t'vs't'ttt.'c'T'
`
`[npLtI‘.O'LttpttI InIL-t'1't|ce
`To store the data for It single component
`To paw: the input into Arrays or to write the decndecl
`An‘u_\':i into the Output stream
`
`A plurality of FILE_BUF'f'ER routines provide a llexible
`inputtoutput interface using l*‘lI.E _BUFFl3.R structures. so
`the same set of parsing routines can he used regardless of the
`inputtoutput formats. This allows the compression system to
`be independent of the lit] format. The file parsing system
`preferably supports two file formats: [305 and DLI-L2, but
`this can expanded to other file formats as well as to inter-
`
`
`
`
`
`US 6,253,264 B1
`
`7
`faces to communication ports and other U0 options. 1"or
`DOS files, the logical stream positions are usually the same
`as the raw file positions. The t'tLt:'Z format requires an
`extensive layer of code to follow an OIEZ stream within the
`container file.
`A plurality of ARRAY routines provide at system for
`managing the process of initializing, adding data to.
`encoding, and freeing arrays of data. Each ARRAY structure
`includes one array of data and the information needed to
`manage these processes. such as the number of entries in the
`Array 25, the size of the bulTer currently allocated for the
`array. and informatiort which will be useful in encoding the
`array.
`In particular, ARRAY_XX structure stores all the data
`from a component, and includes mechanisms to sttpport fast
`sequential
`rcadfwrite access to the data, an expandable
`hul’l'er'for the data. and storage ofrnoclel data. This structure
`also allows a simple set of tooLs to handle all allocation,
`parsing, and coding functions.
`the
`XX nomenclature
`indicates that slightly different ARRAY structures are
`dclirted for different data types, such as dilierent integer
`word sizes, strings, and flouting point types, although many
`of the processing routines work on all of the variations.
`A plurality of [’VS'l‘RU(.‘l' routines make up an auto—
`mated system for parsing the file data. During encoding.
`they read data from the FTLE_ BUFFER structure and write
`data into Arrays 25. During decoding, they take data from
`the Arrays 25 and write it into the l~TI.E _BLIt~‘t-"ER struc-
`tures.
`routines can take a high-level description of a
`file format and handle all the tasks associated with managing
`the parsing as well as the encoding ofthe parsed data. Many
`of the [Ville-rs
`II} will also include custramiwcd parsing
`routines which call the FILE BUF'l'~’El1 and ARRAY rou-
`tines rlirectly instead of using the PVSTRUCT routines. But
`an aspect of the parsing system 24 is the PVSTRUCT
`routines, and this discussion will be concerned primarily
`with their operation.
`A PVSTRIJCT structure includes all information needed
`to parse and code a data structure defined in a file format.
`This usually involves multiple components. so it provides
`and handles individual Arrays 25 for each component. The
`term "pvstruct"' comes from Predictor Variable Structure,
`where Predictor refers to the usc of statistical data gained
`from analyzing a large. number ofsamplc files to assist in the
`encoding of the data, and Variable indicates that it supports
`structures where the component counts can vary at run time
`for each instance (called "dynamic structures” in some
`textbooks).
`A parsing language is a defined instruction set which
`provides a concise way to specify the parsing of a
`PVS']‘RU("l‘ structure. The instruction set includes support
`for a wide range of dynamic variation, including where the
`counts of components are dependent on lile counts of other
`components as well as on external factors. It incorporates
`other lilc forirtat data, such as defined values or range
`restrictions. It also provides a means of exchanging data
`between the parser and the filter.
`The compression system is based on parsing a file so as
`to place items together that are similar. In the following
`description ofthe compression system, examples illustrating
`some structures which might be found in a typical spread-
`sheet file t"on'nat.
`
`EJCAMPLE 2
`
`Spreadsheet files typically include rt series of records.
`Each record |)e-gins. with a 16-bit word indicating the type of
`record followed by a 16-bit word giving the number ofbytes
`
`1U
`
`EL}
`
`.10
`
`35
`
`4!]
`
`5!}
`
`111 an
`
`(it)
`
`{i5
`
`8
`in the body of the record. The file specification then
`describes the contents of each record. A spccillcation for a
`TABI..E SIZE record might be:
`
`RecType-=-Ox4{l'_’
`Rec Body Sizc=12
`Re-cord Data:
`
`Ctfliaet Name
`4
`Minnow
`is
`Maxliow
`-3
`Minuet
`Jt't
`Mnxtfoi
`13
`(Reserve-rl'}
`
`Size Cnntcntti
`2
`First defined row on the sheet
`2
`Last defmed row on the sheet, plus I
`2
`First defined column on the sheet
`2
`Last dctined cnlutnn on the sheet, plus 1
`4
`Reserved
`
`The re. can he more than one such record in the data. When
`there is more than one record,
`the values of the same
`component in each record are going to be similar to each
`other. Tl'tl.'~‘. enables the compression system to take advan-
`tage of the similarities to increase the compression ratio.
`The compression system is preferably based on the
`ARRAY_)O{ structures, including:
`
`r\R.R_AY_ [19
`
`ARRAY lt'.1:
`
`ARIUW _3.“.
`.»\RItAY__F33:
`AI-‘.RJ\‘i'_Fu4
`Alv‘.R.i\‘f__l<'ttiJ:
`AltR..?tY_STR.:
`
`for S-bit darn its-.1115, treated by the unending systern as
`being unsigned
`for ltivhir. data items, Lrtcatcd by the encoding system as
`being signed
`for .32-bit data items. treated by the encoding system as
`being signed
`for .’-2‘-hit llonls
`for ti-t-hit tlonts ("doubles")
`for 80-bit floats ("long duublr:s"::
`each entry includes several bytes whose count is given
`hy n size (string size}.
`
`For the 'l"ABlJ3. SIZE record, live arrays would be created
`{referring to the record specification). Each time a record is
`parsed. an entry is added into each of the live arrays.
`When all
`the records had heen parsed. a [u action is
`invoked to encode the items in each array. This function
`includes at large number of algorithms, which can exploit a
`wide variety of relationships to achieve maximum compres-
`sion. The irnptementation is extremely ulficicnt
`in that it
`requires virtually no overhead: an array with only one entry
`can cven be ctliciuntly compressed. This is necessary
`because the parsing system can transform a file into ti huge
`number of these small arrays. The encoding function can
`also free the data after it finishes encoding it.
`The decoding process reverses the sequence. First, each of
`the component arrays is decoded. Then the parsing is
`reversed.
`
`As this type of parsing has to he done extensively. a
`number of tools are used to simplify the process. They are
`based on the use of instructions called vstruct_defs. To
`continue with the TABLE SIZE example, the instruction set
`to parse such :1 record would be:
`
`If the lir.-at 3 instruction.-. always provide general info about the record:
`8,
`if nLII:t1 instructions in this set
`J2,
`U size of the record body, which is a fixed sat of IE bytes
`5.
`.-3' hum arrays to he created
`It the rcntaining iruztructious tell how to parse the record:
`l‘.’\“I't t'i_.
`.-'1' Parse it
`to hit integer [or the first component t“.\«'linRov.'“}
`INTII5.
`
`
`
`
`
`US 6,253,264 B1
`
`9
`
`-oonlinucd
`
`lNTlo,
`IN'T]b.
`INTJ-1 l:
`
`.-".t Parse a 3'2 bit integer for lhc. last component E“ Rescn‘ed"]
`
`Tht: vslrLIct_de1' instructions art: 16 bit integers. The first
`three words always include the same general info about the
`record. The rust of the words include instructions which tell
`the automated parsing system how to proceed when encoun-
`tering such a record.
`in this case. the instructions are this
`simplest possible—just a list of data types for each compo-
`nent of the record. The parser would create an array for each
`component, and load one entry of that type into the array
`each lime it parscs a rccord of that type;
`Functions are then provided to automate the parsing and
`encoding of records based on thesr: instructions. Again, the
`decoding process reverses the actions of the encoder.
`The vstrucl_ clcf instructions can hand]: situations