`
`US('J(}6Z’._‘-'s3264Bl
`
`(12; United States Patent
`Sebastian
`
`(10) Patent No.:
`145) Date nf Patent:
`
`US 6,253,264 B1
`*Jun. 26, 2001
`
`341151
`3*J:‘\_r7S5
`
`341_-'1”?
`.’
`3411-107
`'?U"?1'll'12
`”.’IT".*ll'H
`3951'] 14
`
`5.ns4.4?aa * 11:1-"an Panauus-sis
`5.708.828 *
`l_1l‘J9R (‘oicmau
`
`
`
`1rI9‘_J8
`tjvrmish et a1-
`fiulhficr
`--------
`.3
`‘D mes
`2.11000 Barhir
`.............
`
`1111000 Fallelal.
`
`§.-?10.5=j»2 *
`”
`3,
`_.
`*
`5.367.114 *
`
`5.0u1,51:'~
`
`,
`,
`1
`,
`,
`, _‘,
`_
`J
`[DRl;lGN FAILNI DUC UMbN rs
`1: 729 117 A2
`311990 (129) .
`
`OT} IER PU Bl JCATIONS
`
`Rica, 11.11, :1 a1.,"V'[_.Sl Universal Noisclcss Cocl1:r,"1'\«"H.5'
`Tech Notes 231.31‘ . p. 334 (Mar. 1.1990}.
`
`(54) conmc NE'I‘WORKGR()UP[NG DATA or
`SAME DATA TYPE I_N'|‘() BI;()C](S USING
`
`FILE l').«‘\‘I‘A S’l‘RlJC'l‘URE AND SELI-ZCTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`‘-
`_
`v 1
`,-
`“ASL 0” BLOCK DATA Wm‘
`lnvenlor: William Se-I:mslian.|7almnu1h. MA
`(U5)
`
`(75)
`
`(73) Assignuc:
`
`_
`(‘J Hutu.-.0:
`
`Intelligent Compression ’l‘eclInul0gles.
`mmomh‘ MA (US)
`_
`_
`Thxs patent Lssucd on :4 conlinucd pros-
`cculiorl uppliculiun filed under 37 CTR
`l.53(d'), and is subjccl lo the lwunly year
`""“'i5i°”5 “F 35 U'S‘C'
`
`Sub-leul many di”"‘l3i"1°r'lh“ l°"T' Oflhifi
`patent is mended or adjnslcd under 35
`U.S.C.
`|54(h} by 0 clays.
`
`"Competitive Parallel Processing for
`el al.
`Diner DJ}.
`Compression of De1ta,".N1"!S T9u::'1N0fe'.5' 2301. p. 379111/111;:
`I‘ _199[])_
`
`(211 App]. No; 09;035,1152
`
`(22)
`
`I-'ilc.-Ll:
`
`Mar. 6, 1998
`
`“ Cilcd by examiner
`
`Prirrrary .'L’.\':I111£.r_:e1‘—Tl_10mas Lee
`Asslwlfllll Exmmner—1aDh Nguyen
`(74) A!mr1Ir.’_\j. Agcm_. w" I-'1'rm—IIa111ilton, Brook, Smilh &
`Rcynnlds. PC.
`
`(57)
`
`ABSTRACT
`
`A prcslbtrcd coding, nclwork uses an .1r-;:hit::r:lI.11'c.=. called 3
`Ba:ac:—I*'il1cr—Rc.-murcc {B1-'11) sys1cn1.
`'l“I1is. approach inte-
`.
`, .
`,
`,
`_
`_-
`,
`-
`-
`3',‘"°S_ 1"“ "d‘.".‘"“fg"S “f .&T''““' S'f°‘_’fi°. ‘°m"'°i5'"" TE“
`gc.nc1'._1 -purpose <.omprc:~sJr_Jn loo s1..1'v1n_g a w1c 1:
`ran_1,<._ 0
`dala lL‘|l'lTlfll5. Source (1013 19 parsed 1111:: blocks 01 s1r111[z1r
`data and each parsed blacks are. compressed using a respec»
`lively selected cnmprcssinn algorithm. Thu: algnrilhrn can be
`chosen from a stalic m0d1:l0l‘I|'1v.-.- data or can be adaplivc 10
`[he data In ll]: Pa-rsfll block‘ The PMS‘:-d blocks are lhlm
`combined inlu an encoded data file.
`l"or dumding,
`Ihc:
`.
`pmccss as rcvcrsccl.
`
`29 Claims. 8 Drawing Sheets
`
`'
`
`‘.3
`
`_
`
`(60)
`
`_
`(51)
`
`‘
`Int. (,1.
`
`7
`
`(52) U.s. Cl.
`
`Related U_S_ Applicafim Dam
`Provisional application No Dl'J;'f]3E1 ‘I48
`filed on Mar
`19:;-;_‘
`I
`‘E
`I
`I
`‘
`............................ .. GUISP 17,130: G06? 3300:
`[£06]: SE00; I-103M "NOD; HUJM 7130
`............................... .. 71D!68; 710..-'05; 341151;
`141 ,Im.
`‘
`'
`3-—1U5l. E15. IU7:
`{5H) Field nf S-Barcll
`7r.m'102. 503, 504. 505. 505; 3701474;
`_/lmfis (J8
`"
`
`References Cited
`US. PATENT DDCIIMENWE
`5.293.379 ‘
`3r}l‘J‘J=1 Carr
`.................................... _T.u"[I1"J4.l
`341.-'51
`5.-1£’.Ll'|R’.-"
`“)5 C
`'
`l
`'
`H“
`b"
`395f.1lI0.l3
`“
`5.61154]
`4,"l9‘J'I-‘ Alhancsc at
`395177!)
`‘
`5,f_v32,lI1‘J
`5!l‘l0? R00 el al.
`.... ..
`5,638,408 '
`631907 'l')'lcr cl :11.
`.
`.. 3953117
`
`
`
`I
`
`(56)
`‘
`
`
`
`Oracle 1104
`
`Oracle 1 104
`
`
`
`U.S. Patent
`
`Jun. 20, 2001
`
`,sheet 1 of8
`
`US 6,253,264 B1
`
`SENDING APPLICATION
`
`SOURCE DATA 2
`
`ENCODER
`
`ENCODED DATA 4
`
`DECODER
`
`§
`
`§
`
`DECODED DATA 6
`
`Z
`
`RECEIVING APPLICATION
`
`FIG. ‘I
`
`PRIOR ART
`
`
`
`
`
`zofiomzmm__
`
`s_Ew>mzotomdm..
`
`
`
`
`N<._.¢Dmomnom
`
`NN.xN_m_._.:__n_
`
`U.S.Patent
`
`¢_mm_:mo
`
`~§__._.§.....mm:
`
`m><mw_<
`
`ommm<a.qr
`
`vmxwz_mmqn_
`
`s_Em>m
`
`Jun.2(._.2001
`
`mzo_E:w:mz_
`
`oz_wm<n_
`
`
`
`onnmN__2o_.w:o
`
`><mm<
`
`Sheet2of8
`
`
`
`m2mon_mz<m.__20._.m3U
`
`mm
`
`._.zmzOn_S_OU
`
`Smm::aos_
`
`mm
`
`omooozmmmm:m_s_<m<n_
`
`
`E59N0_u_
`mm.
`
`.wz_n_oozmx
`
`Us6,253,264B1
`
`
`
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 3 of 3
`
`US 6,253,264 B1
`
`FIG. 3
`
`
`
`P3U
`
`.m
`
`m
`
`%
`
`N.4
`
`2,33,6S
`
`m
`
`
`
`
`
`Pm.mmooo>¢mm<Eomm:m_oo_2oEEw.
`
`
`
`
`
`S.K.
`
`
`
`
`
`n.mozihmm<29w2_Emom_5_am_mn_n_O5:
`
`MS
`
`
`
`mm._._oEs_0235
`
`Mmms__E
`
`MV.O_.u_
`
`
`
`
`
`ummaoo><w_m<Emmooota»
`
`
`
` .3.+$9001022mwz_Em310222:
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Jun. 2(._. 2001
`
`Sheet 5 of 8
`
`I.B462.352.,6SU
`
`cmX%
`
`<55ba.O.E
`
`S
`
`KMDOONu.
`
`
`
` mm_Em>zooo_.mm¢m#mm3<>n_m:o_.<_2z:
`
`
`
`
`
`NEQOO><mm<EOE“.
`
`
`
`w;<mmF_._wmmmzo_wmm>zoowpzmzoaxm.o_bw<wm:z§_Sb
`
`
`
`
`
`m.O_n_
`
`8mmmmmm
`
`
`
`
`
`
`
`MMDODozztwO._.mmoou><«_m.qO._.mmooo><mm_¢O._.mmooo><zm<O._.
`
`
`
`
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Jun. 2(._. 2001
`
`Sheet as of 3
`
`US 6,253,264 B1
`
`SAMPLE DATA BASES
`
`1
`
`GENERATE PARSED FILES
`
`PARSED FILES
`
`l
`
`ANALYSIS RESULT FILES
`
`ANALYZE PARSED FILES
`GENERATE PREDICTOR DATA
`
`PDAT FILE
`
`1
`
`FIG. 6
`
`
`
`U.S. Patent
`
`Ju11.26_.20lll
`
`Sheet 7 of 8
`
`Us 6,253,264 B1
`
`25
`
`ARRAY TO BE CODED
`ARRAY DATA
`V
`BUILD MODEL
`
`T1
`
`MODEL
`
`ARRAY DATA
`
`‘FD
`/
`
`73
`
`ENCODED MODEL INFO
`
`
`
`75
`
`YES
`
`SPLITTER
`
`-77
`
`TWO OR MORE ARRAYS
`PROCESS SEPARATELY
`
`
`
`
`USE
`NUMERIC
`TRANSFORM
`
`"39
`
`NUMERIC
`ALGORITHMS
`
`91
`
`OFFSETS FROM
`MODEL
`
`95 SPLI'I'I'ER
`
`TWO OR MORE ARRAYS
`
`PROCESS SEPARATELY
`
`
`
`TRANSFORM
`
`100
`
`FIG 7
`
`79x
`
`YES
`
`MATCHING
`ALGORITHMS
`
`81
`
`
`
`MATCHING
`* LGORITHM
`
`35 A.
`
`NO
`
`UNMATCH ED
`VALUES
`
`OVEEELZSLEEMY
`
`ARRAYS CONTAINING
`
`MATCH DATA
`
`83
`
`CONVERT TO INT32
`
`
`
`U.S. Patent
`
`Jun. 26, mm
`
`Sheet 3 of 8
`
`US 6,253,264 B1
`
`101
`
`ARRAY DATA IN
`
`103
`
`RANGE
`
`REMAPPER
`
`
`
`
`
`
`1 O5
`
`EXTRA BITS
`
`TO PACK
`
`MTF CODER
`
`109
`
`
`
`RANGE CODES
`
`NO
`
`
`
`RANGE CODES TO
`
`LOW LEVEL CODER
`
`FIG. 8
`
`
`
`US 6,253,264 B1
`
`1
`CODING NETWORK GROUPING l)A'I'A OF
`SAME DATA TYPE INFO BLOCKS USING
`I-‘ILIE DATA STRUCTIJ RE AND SI-ZI.EC'I'I NG
`[IOM]’RESSION FOR INDIVIDUAL BDOCK
`BASE ON BLOCK DATA TYPE
`
`RELATED APPLICATIONS
`
`This application claims the benefit of US. Provisional
`Serial No. (iU,=’036,S43 filed Mar. 7. 1997. the teachings of
`which are incorporated herein by reference in their entirety.
`BA(Tl\"(}R()UNl) 01- THE INV'liN’l'I()N
`
`1U
`
`l-Iigh perfonnance data compression systems use models
`of the data to increase their ability to predict values. which
`in turn leads to greater compression. The best models can be
`achieved by building a compression system to support a
`specific data format. Instead of trying to deduce a crude
`model from the data within it specific tile, ti format-specific
`compression system can provide it precise prc-determined
`model. The model can take advantage not just of the file
`formal structure. but also of statistical data gathered from
`sample databases.
`Previous efforts at form at-specific compression have heart
`focused on solutions to a few individual formats rather than
`on the development of a generalized method that could lac
`adapted to many formats. The models which have been
`created typically involve a small number of components.
`This works adequately when most of the data is included in
`a few components, such as an image file having mostly red,
`blue. and green pixel values.
`l-lut may formats are best
`modeled using a large number of components. and the
`previous systems are not designed to build or encode such
`models.
`
`SUMMARY O1" Tl-IE INVENTIIJN
`
`A preferred coding system in accordance with the inven-
`tion solves both of these problems in the prior art: it provides
`a generalized solution which can he adapted to handle a wide
`range of formats. and it etfeotivcly handles models that use
`large numbers of components. The system involves new
`approacltcs at many levels: from the highest (interface) level
`to the lowest (core encoding algorithms) level. The coding
`network includes a compression network for encoding data
`and a decompression network [or decoding data.
`At the highest level, a pr-:.l'r:rrcd compression system in
`accordance with the invention uses an architecture called a
`Bast:-Filter-Resource (BFR) system. This approach inte-
`grates the advantages of forntat-specific compression into a
`gcrIeral—purpose compression too] serving a wide range of
`data formats. The system includes filters which each support
`a specific data format. such as for Excel XLS worksheets or
`Word DOC tiles. The base includes the system control
`modules and a library used by all the filters. The resources
`include routines which are used by more than one lilter, but
`which are not part of the base. If a filter is installed which
`matches the Format ofthe data to be encoded. the advantages
`of l'orn1nt-specific compression can be realized for that data.
`Otherwise. at "generic" filter is used which achieves perfor-
`mance similar to otlier non—speciI':c data compression sys-
`tems {such as Pl(Zip, Stacker. etc).
`At the next level, the system pre lerably includes il method
`for parsing source data into individual components. The
`basic approach, called "structure flipping," provides a key to
`converLi ng format
`infornuition into compression models.
`SLructurc flipping reorgallizes the inform:1Lion in a file so that
`similar cornponc-uts that are normally separated are grouped
`together.
`
`EL}
`
`3!]
`
`3-5
`
`all}
`
`St)
`
`till
`
`0'5
`
`2
`Upon this Foundation are :1 numbers of tools. such as:
`a language to simplify the creation of parsing routines;
`tools to parse the source data using this method into
`separate componcnLs; and
`tools to generate models for individual components by
`automated analysis of sample data bases.
`Tliusr; tools can be applied in the filters for -El wide range of
`llle and data types.
`The compression system preferably includes tools‘ called
`customized array transforms l'or specific litters and for
`certain types of filters. These techniques handle at specific
`file type or certain data constructions used by several file
`types, such as encoding two dimensional arrays of data as
`found in many database tbrmats.
`At the low-level of the system are preferably a number of
`mechanisms for encoding data arrays, including:
`new low-level encoding algorithms;
`methods for integrating a large number oftransforms and
`c-rtcoding algorithms;
`methods for eliminating overhead so that small data
`blocks can he eificicntly coded; and
`a new method for
`integrating both static models
`(determined from statistical analysis of sample
`databases) and dynamic models (adapted to the data
`within a panicula: array) into the encoding of each
`component.
`A preferred method of encoding source data: comprising
`parsing the source data into at plurality of blocks. The parsed
`blocks typically have it rlifferent format than the source data
`format. In particular, similar data from the source data are
`collected and grouped into respective blocks.
`For each block, a compression algorithm is selected from
`a plurality of candidate compression algorithms and applied
`to the block. The compression algorithms can be determined
`based on the amount of data in the respective block.
`liurthcrmore, the compression algorithm can be adapted to
`the respective block,
`including the use of at customized
`transfonn. The selection of an algorithm can also be based
`on a compression model. which is rteriverl from the format
`of the source data. The compressed data from the plurality
`of blocks are then combined into encoded data.
`
`The coding network can also include a decompression
`network to convert the encoded data back into the source
`
`data. First the data is decoded and then the parsing is
`reversed. In a lossless system, the resulting data is identical
`to the source data.
`Embodiments of the invention preferably take the form of
`machine executable instructions embedded in a machine-
`rcarlahtc format on a (‘l)—R()M. floppy disk or hard disk, or
`another machine-readable distribution medium. These
`instructions are executed by one or more processing units to
`implement the compression and decompression networks.
`BRIEF [)E.S{.'.RlP'l'l0N OI" '['l{E DRAWINGS
`
`The foregoing and other objects, features and advantages
`of the invention will be apparent from the following more
`particular description of preferred embodiments of the
`invention. as illustrated in the accompanying drawings in
`which like reference characters refer to the same parts
`throughout the different views. The rlrawings are not nec-
`essarily to scale, emphasis instead being placed upon illus-
`trating the principles ofthc invention.
`FIG. I is a high-level hlock diagram of a typical data
`compression system.
`Fit}. 2 is a block diagram nfa preferred encoder using the
`BFR system in accordance with the invention.
`
`
`
`
`
`US 6,253,264 B1
`
`3
`FIG. 3 is a llow chart of an array coder.
`FIG. -'I- is :1 block diagram of the string coder 40 of FIG.
`
`3
`
`FIG. 5 is El block diagram of the float coder 50 of FIG. 3.
`FIG. 6 is a flowchart of a preferred alttomtlled analysis
`system.
`FIG. 7 is at llowchttrt ovcrviewing the higli-level of the
`integer coder.
`l"l(.i. 8 is a flowchart overviewing the mid-level of the
`integer coder.
`
`5
`
`ll}
`
`DETAILED IDESCRIPTIONS OF PREFERRED
`EMBODIMENTS OF THE INVENTION
`
`FIG. 1 is it higb—|cvcl block diagram of :1 typical data
`corn pression system. The primary components ofthe system
`are a sending application I, an encoder 3, a -clc-coder 5. and
`a receiving application 7.
`The sending application 1 ])]T.Nlt.llt:S source data 2 to be
`encoded hy the encoder 3. 'l‘he sending application can be a
`file arcltivcr. a communication system. or any other appli-
`cation with a need to represent data using fewer bytes than
`in the native format.
`The encoder 3 receives the source data 2 and uses data
`compression algorithms to represent the data using, fewer
`bytes.
`'l'l1c typical implementation of the encoder 3 is as
`software running on a general purpose computer, although
`the algorithms to be described may also be implemented on
`specialized hardware.
`The output of the encoder is encoded data 4. which can be
`stored in rnemory (in which case the encoding allows more
`source data to be stored within the same Iiardwarc), or
`transmitted to the dccocler5 (in which case the encoding the
`source data to be transmitted faster when the transmission
`
`channel bandwidth is limited).
`The decorler 5 retrieves or receives the encoded data 4 and
`reverses the algorithms used to encode the data so as to
`reconstruct the source data as decoded data 6. In a lossless
`system, the decoded data 6 is identical to the source data.
`Depending on the application, however, some data loss may
`be acceptable. As with the encoder 5,
`the decoder 5 is
`typically embodied as software running on a general purpose
`com puter, but can be implemented in specialized hardware.
`The receiving application 7 receives the decoded data 6
`for processing.
`A preferred compression engine includes filter-based
`encoders and decoders which can be utilized by a large
`number of st: riding ztpptications l and receiving applications
`7. While the applications 1.7 that storctretricvc or transmit.-'
`receive the encoded data are not part of the compression
`system, they may adjust encoding and decoding operations
`according to system conditions such as available transmis-
`sion channel bandwidth.
`
`The descriptions to follow will mostly cover the encoder.
`The decoderjust reverses the encoder‘s actions.so H descrip-
`tion of the encoder is nu liicicnt
`to describe the overall
`cncodingjducotling system. The term “File" will be used to
`describe a block of source data, which matches the conven-
`tional usage, but which should also be understood to be
`applicable to many other types ofdala blocks such as a data
`exchange in it Client-Server application.
`ENCODER
`
`FIG. 2 is 2!. block. diagram of it preferred encoder using :1
`Base-Filter-Resource {l3I'"R) network in accordance with the
`
`3!]
`
`35
`
`4!)
`
`5!)
`
`till
`
`65
`
`4
`invention. The encoder 3’ is based on the use of :1 plurality
`of filters ltln. .
`.
`. . ltlx, .
`.
`. , II}: which serve specific file
`formats. for example, one filter 10:: might support several
`versions ofthe DEF database format, While another filter 10:
`might support several versions of the DOC‘ format used by
`the Microsoft Word software program. The individual filters
`provide respective selection criteria 12 to at tiller seloction
`system 22.
`The filter selection system 22 receives the source data 2
`and checks the selection criteria l2n, .
`.
`.
`. 11:‘, .
`.
`.
`, 12; of
`all filters 10:1,. .
`. , 103:,
`.
`.
`. , I0: installed in the system to
`see if any of them support the son roe data's format. If not,
`a "generic" filter is used which provides compression per-
`formance similar to other generic compression systems,
`such as Lcmpcl-Ziv (L2) engines. In a particular preferred
`embodiment of the invcn1inn.1l1e generic compression sys-
`tem employs an S211’ engine as described by Mr. Schindler
`in US. application Ser. No. C|8.’970,32D fded Nov.
`I-1, 1997.
`the teachings of which are incorporated he rein by reference
`in their entirety. The descriptions of the network will pri-
`marily cover the situations in which a filter to support the
`data format is successfully found.
`Briefly. it parsing system 24 parses the source data 2 into
`:1 large number of parsed Arrays 25, which each include all
`the instances of a particular component in the source. tile.
`The term “Array", described in further detail below,
`is
`capitalized to indicate that this refers to a type of structure
`specific to the network, as opposed to the usual use of the
`term "array“. The algorithm used to parse the source data 2
`is called "Structure Flipping“, an important aspect of the
`network. described in further detail below.
`The BFR system uses an automated analysis system (also
`described in detail below) to build ti. model for each Array
`25, which will he used when the Arrays 25 are encoded in
`an array coder 28. In some cases, butter compression can be
`achieved if customized array transforms 26 are first used.
`While the models generated by the automated analysis
`syste tn proccs-'. each component Array 25 independently. the
`customized routines can take advantage of inter-component
`relationships.
`The array coder 28 encodes the Arrays 25 using a count-
`depcndent modeling system. which includes a mixture of
`static models {constant
`for all databases) and dynamic
`adjustment to each particular Array 25.
`The tiller lllx has components to support each of the four
`sections of the encoder 3'. The selection criteria 12x. deter-
`mined by the file format specification. includes information
`needed to recognize files served by the filter 10.x‘ such as byte
`values at the beginning of the tile or file Iitlc sulficcrs. Parsing
`instructions 14, also determined by tht.‘
`file format
`specification.
`include the information needed to parse the
`source file into parsed Arrays 25 having all the instances of
`a particular component
`in the source tile. The custom
`component models 16 are constructed by referencing both
`the file form at specification and sample databases. They can
`take advantage of inter-component relationships to trans-
`form some of the component Arrays 25 to make them more
`compressible. Encoding parameters til are generated by an
`automated analysis system which examines component
`Arrays 25 parsed from a large number ofsamplc databases.
`The encoding parameters 18 provide the data necessary to
`support the count-dependent modeling system in the array
`coder 28.
`l~‘1l.'1‘LiR Sl:Il.B(.'l1DN SYS'l"l:‘.M
`A preferred compression system uses an cxptintlablc
`architecture that allows users to support new or updated
`
`
`
`
`
`US 6,253,264 B1
`
`5
`formats by adding or replacing filters [I]. As noted above, a
`filter 10 can support more than one t.'orn.'tat. For example, a
`filter called “l.tJ'['US123v-'1" might support all tile formats
`associated with the Lotus [23 program, such as the current
`WK-at format and the earlier WK3 antl FM3 formats. If :1
`
`Filter was later developed for a new Lotus WK5 formal, a
`user replaces the old filter with one supporting the new
`format and which would also he hacl€w.:t.rdly'
`compatible
`with the earlier filter with respect to the older fortrtats.
`lo the Microsoft Windows environment, each lifter is :1
`separate dynamically linked lihrary (t)t.l.) tile. In addition
`to data tables. a filter includes executable code.
`
`If a filter is found for :1 lite, then it is dynamically linked
`and then called to encode the file. The encoded data stre ttlTt
`includes an identification (ID) code indicating which filter
`was used to encode the data, and the decoder has to have a
`corresponding decoding filter.
`To implement
`the filter selection system 22, the base
`module maintains a registry of the currently installed files,
`which includes the data need to identify source tiles of that
`type as well as the path to the DLL. This registry is updated
`each tirne at filter is added or replaced. The identification data
`can include both it list of file title. su fficcs and a method to
`identify files vial byte values near the beginning of the file.
`The identification procedure uses H search tree method to
`reduce the time needed to check the list when many filters
`are installed. From this registry, the appropriate filter It} is
`invoked for the file type.
`To handle compound file formats such as Microsoft's
`01.132, one filter can invoke another filter to encode a
`sub-iilc or stream within the compound file. For cxa mple, :1
`Winword I'ilter might call an Excel Filter to encode some
`Excel data which was embedded withiti a compound docu-
`ment lile.
`PARSING SYSTEM
`
`large
`As noted above, the parsing system 24 creates it
`nu mhcr of parsed Arrays 25. E.:ich parsed Array 25 includes
`all instances of a particular cornponcnt in it tile. A separate
`component
`is created for each member of each de-lined
`structure type.
`
`EXAMPLE 1
`
`For example, consider the following, description of a record
`dclincd in a file format description.
`(Tell Value:
`
`The format uses a special method of coding numbers in an
`internal numher type. called an Rh’. number, to save
`memory and disk space.
`Record Data:
`
`Offset
`
`Nnmc
`
`Size
`
`Contents
`
`4
`0
`3
`
`1U
`
`M
`col
`frrtt
`
`nun-t
`
`Z‘.
`2
`1
`
`4
`
`Row number
`Folumrt nurnlaer
`Index to the record that includes;
`the cell format
`the coded nttrnber
`
`The parsing system 24 creates parsed Arrays 25 for each
`of the four components of this record. Each time ttn R]-3'.
`record is found, an entry is added to each array. The parsed
`Arm ys 25 are much more com prcssible than the raw source
`data, as shown by data from live records:
`
`IE]
`
`EL]
`
`3-5
`
`40
`
`SU
`
`an ‘J’:
`
`t'.iU
`
`65
`
`Rec
`
`Row
`
`1
`2
`3
`4
`5
`
`L'Ix|'.|l}L'll
`L'.Ix[Iht_'rl
`L"Ix|I|t}IJl
`L.'IxEltJl'l1
`E.'Ix|I]tJO]
`
`Col
`
`Uxtltltll
`tlxtlfltlj
`nxnot)-t
`l})ctJt'Jt)fi
`IJ‘.-ctJtJt)‘t
`
`index
`
`OKODSS
`0x005 f
`0xnos3
`f.lJ(UU5f
`fJi(tJU5f
`
`Nuttt Val
`
`Etx3tT1e ltifl
`ftx3tI1a3t)t)
`fix}-tT1dDt)|}
`OX3-l1'1c-‘IUD
`E.t)(.3-fl"l-'.I31}D
`
`The source data would appear almost random to a byte-
`matching algorithm (shown in hex):
`0| (JOD2D0550tlf|'0t:—l fl3fDl {I003-0D5fDODD a3i'l
`3t'0l [JUO-1[tt]53DUUftd0ii 3f
`01 {)0 (I6 00 5f [)0 DU «:4 cl fl 3fUl {I0 {J7 Gt} 51' 00 (ID d3
`ll 3?
`But when handled as four separate Arrays 215 (Row, Cc],
`Index, and Num) using algorithms optimized for each
`component,
`the data can compress significantly. This
`approach is referred to herein as "structure llipping” because
`of the way it rearranges the data.
`To encode together the instances of the same component
`i.n a database is ti common compression technique. and the
`file format is often usetl to identify such coroponents. But in
`the prior art. this approach has been limited to isolating a few
`major components of it specific file formal. and the file
`format description used only as a technique of finding these
`major components. In ttccordance. with preferred embodi-
`niertts of the invention, a file format is actually converted
`into a compression model, which is then interfaced into an
`‘automated system for analyzing the data and opljmizuig the
`ending performance of each component. While this com-
`pression model is usually enlianccd by the customized array
`transforms 26. it is still an aspect of the overall system.
`The extent
`to which preferred approacltes differ frorrt
`prior art systems is reflected in the new tools constructed in
`order to implement it. This approach creates a large number
`ofhrrays 25. Each Array 25 requires a dilie rent cornpression
`model, and the number of entries in each can change
`significantly from tile to tile. In order to work clliecljvcly in
`such a case, prefe-rred ernbodiments of the invention employ
`new approaches in how the low—level coding is handled such
`as the elirrtinatinn of overhead, count-dependent processing,
`and the use ofa processing network instead of fixed coding
`algorithms. 'l‘ht:s.r.-. tech niques, described above, are pre fcrred
`in order to handle the output of the parsing system 24.
`All of these implementation features are integrated at a
`high level, so that high-level calls can handle all of the
`operations needed to initialize the system. parse the data,
`and encode the Arrays.
`In accordance with it particular preferred embodiment of
`the invention,
`the life parsing system 24 uses three sult-
`systems:
`
`FIL L‘l_fl L7l"H:'. R
`A R R AYES
`t‘\«':i't't{t.'C‘t"
`
`[npLtI‘.O'utpltI Intcrfticc
`To stoic the data for it single component
`To paint: the input into Armys or to write the decoded
`At'ru_\':i into the Output strunm
`
`A plurality of FlLE._BUFFER routines provide a flexible
`inputtoutput intcrtaot: using FILE _ BUFFISR structures. so
`the same set of parsing routines can he used regardless of the
`inputioutput formats. This allows the compression system to
`be independent of the UL} format. The file parsing system
`preferably supports two lilo thrmatsz DOS and ULE2, but
`this can expanded to other file formats as well as to inter-
`
`
`
`
`
`US 6,253,264 B1
`
`7
`faces to communication ports and other [IO options. 1"or
`DOS tiles, the logcal stream positions are usually the same
`as the raw file positinms. The t'tL|:'Z format requires an
`extensive layer of code to follow an OLEZ stream within the
`container file.
`A plurality of ARRAY routines provide at system for
`managing the process of initializing. adding data to.
`encoding, and freeing arrays ofdata. Each ARRAY structure
`includes one array of data and the information needed to
`manage these processes. such as the number of entries in the
`Array 25, the size of the hul1"er currently allocated for the
`array. and information which will be useful in encoding the
`array.
`In particular, ARRAY_XX structure stores all the data
`from a component, and includes mechanisms to stlpport first
`sequential
`rcadrwrite access to the data, an expandable
`bufier for the data. and storage ofrnoclel data. This structure
`also allows a simple set of tools to handle all allocation,
`parsing, and coding functions.
`the
`XX nomenclature
`indicates that slightly dilferent ARRAY structures are
`delirted for different data types, such as dilierent integer
`word sizes, strings. and floating point types, although many
`of the processing routines work on all of the variations.
`A plurality of [’VS'l‘RU(.‘l' routines make up an auto—
`mated system for parsing the file data. During encoding.
`they read data from the F'[I.E_ BUFFER structure and Write
`data into Arrays 25. During decoding, they take data from
`the Arrays 25 and write it into the t~‘tI.t-L _BLIt~‘t-‘ER struc-
`tures.
`routines can take a high-level description of a
`tile format and handle all the tasks associated with managing
`the parsing as well as the encoding ofthe parsed data. Many
`of the [Filters
`It} will also include custtamiwcd parsing
`routines which call the l-‘Il.E BUFFER and ARRAY rou-
`tines directly instead of using the PVSTRUCT routines. But
`an aspect of the parsing system 24 is the PVSTRUCT
`routines, and this discussion will be concerned primarily
`with their operation.
`A PVSTRUCT structure includes all information needed
`to parse and code a data structure defined in a file format.
`This usually involves multiple components. so it provides
`and handles individual Arrays 25 for each component. The
`term "pvstruct"' comes from Predictor Variahlc Structure,
`where Predictor refers to the use of statistical data gained
`from analyzing a large number ofsamplc files to assist in the
`encoding of the data, and Variable indicates that it supports
`structures where the component counts can vary at run time
`for each instance (called "dynamic stnictures” in some
`textbooks).
`A parsing language is a defined instruction St.‘-T. which
`provides a concise way to specify the parsing of a
`PVSTR UCT structure. The instruction set includes support
`for a wide range of dynamic variation, including where the
`counts of components are dependent on tile counts of other
`components as well as on external factors. It incorporates
`other file forirtat data, such as defined values or range
`restrictions. It also provides a menns of exchanging data
`between the parser and the tiller.
`The compression system is based on parsing :1 file so as
`to place items together that art: similar. In the following
`description ofthe compression system, examples illustrating
`some structures which might be found in a typical spread-
`sheet file fonnat.
`
`EXAMPLE 2
`
`Spreadsheet files typically include :1 series of records.
`Each record begins with it 16-bit word indicating the type of
`record followed by a 16-bit word giving the number ofhytes
`
`1U
`
`EL?
`
`.10
`
`35
`
`4U
`
`St}
`
`in -J.
`
`till
`
`65
`
`8
`in the body of the record. The file specification then
`describes the contents of each record. A speci.llcation for a
`TABI_.l£ SIZE record might hc:
`
`RecType-=-0x402
`Rec Body Sizc=12
`Re-cord Data:
`
`Ufliaet Name
`4
`Minnow
`is Maxkow
`ti
`Mlucol
`Jt't
`MnxCol
`12
`(R:Escr\'r.-d}
`
`Size Contents
`1
`First defined row on the about
`2
`Last defined row on the sheet, plus 1
`2
`First defined colunm on the sheet
`2
`Last dctincd cnltlrrtn on the sheet, plus I
`4
`Regen-ed
`
`There. can be more than one such record in the data. When
`there is more than one record,
`the values of the same
`component in each record are going to be similar to each
`other. T|1is enables the compression systcni to lake advan-
`tage of the similarities to increase the compression ratio.
`The compression system is prel'cr:tbly based on the
`ARRAY_)O{ structures, including:
`
`r\R.R_.t\Y_ [19
`
`ARRAY lit:
`
`ARILAY __32.:
`.»\RItAY_.F33:
`J\RR.1\‘i’_Fo4
`Attth-’\Y__l<‘tttJ:
`ARR.AY_STR.:
`
`for 8-bit data it¢‘.T!15, treated by the encoding system as
`being unsigned
`for ltivbir. data items, Lrtcatcd by the encoding system as
`being signed
`for 32-bit data items. treated by the encoding sy.-nt:n1 as
`being signed
`for 32-bit tloals
`tor ti-t-bit tlonts ("douhlcs"J
`for 80-bit floats ("long douhleeft
`each entry includes sevenal bytes whose count is given
`hy n size (string size}.
`
`For the TABIJE. SIZE. record, live arrays would be created
`{referring to the record specification). Each time a record is
`parsed. an entry is added into each of the Five arrays.
`is
`When all
`the records had been parsed.
`it
`[11 uctiort
`invoked to encode the items in each array. This function
`includes at large number of algorithms, which can exploit a
`wide variety of relationships to achieve maximum compres-
`sion. The implementation is extremely eflicicnt
`in that it
`requires virtually no overhead: an array with only one entry
`can even be efliciuntly compressed. This is necessary
`because the parsing system can transform tl li.lc into ti huge
`number of these small arrays. The encoding fttnction can
`also free the data after it finishes encoding it.
`The decoding process reverses the sequence. First, each of
`lhe component arrays is decoded. Then the parsing is
`reversed.
`
`As this type of parsing has to he done extensively. a
`number of tools are used to simplify the process. They are
`based on the use of instructions called vstruct_defs. To
`continue with the TABLE SIZE example, the instruction set
`to p.-trse such at record would be:
`
`Jr‘ the lir.-at 3 instruction.-. always provide general info about the record:
`8.
`it near i.n.stt't.tction.s in this set
`J2,
`U’ size oi the record body, wtiich is n lixcd st‘. of IE bytes
`5.
`.-3' num arrays to he created
`It the rentaining lt1.t:tt't.I(.‘|.iOtJ£ tell how to parse the record:
`l‘.’\“I't ti_.
`If Parse it
`to hit integer [or the first component t“l\«'IirLi+'.ov«’“}
`INTIG.
`
`
`
`
`
`US 6,253,264 B1
`
`9
`
`-continued
`
`tNTia__
`I.N'T]t:.
`[HTS-1 l:
`
`.-".t Parse a 3'3 bit integer for the. last component If“ Rest:n‘ecl"]
`
`The vstrLIct_de1' instructions are 16 bit integers. The lirst
`three words always include the same general info about the
`record. The rest of the words include instructions which tell
`the automated parsing system how to proceed when encou n-
`tering such a record. In this case. the instructions are the
`simplest possible—just a list of data types for each compo-
`nent of the record. The parser would create an array for each
`component, and load one entry of that type into the array
`each time it parses a record of that type;
`Functions are then provided to automate the parsing and
`encoding of records based on these instructions. Again, the
`decoding process reverses the actions of the encoder.
`The vstruct_ def instructions can handle situations much
`more complex than the previous example. For convenience,
`a summary of the construction of these inst