`(10) Patent N0.:
`US 6,253,264 B1
`9
`
`Sebastian
`45 Date of Patent:
`*Jun. 26 2001
`
`U5006253264B1
`
`(75)
`
`(54) CODING NETWORK GROUPING DATA OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTURE AND SELECTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`BASE 0N BLOCK DATA TYPE
`.
`.
`.
`.
`Inventor‘ Wllham Sebastlan, Falmouth> MA
`(Us)
`:Itll'tC
`'Thl'
`131311128151? Mznggggssmn ec “0 Ogles’
`This patent issued on a continued prOS-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`t
`t
`t
`'
`'
`f 35 U.S.C.
`11)::(Iéll)(2)erm pr0v1510ns 0
`.
`
`73 AS'
`)
`(
`Slgnee
`(*) Notice:
`
`5,684,478 * 11/1997 Panaoussis ............................. 341/51
`1/1998 Coleman .............. 395/785
`5,708,828 *
`
`1/1998 Gormish et a1.
`.
`..... 341/107
`5,710,562 *
`
`3:23:23 * 131333 3119554 ~~~~~~~~~~~~~~ 333%;
`
`..
`,
`,
`*
`0 mes
`2/1999 Barbir ...................... 341/107
`5,867,114 *
`
`.
`..... 707/102
`3/1999 Mochizuki et al.
`5,881,380 *
`
`5,983,236 * 11/1999 Yager et a1.
`..... 707/104
`............................. 395/114
`5,991,515 * 11/1999 Fall et al.
`
`FOREIGN PATENT DOCUMENTS
`0 729 237 A2
`8/1996
`EP .
`(
`)
`OTHER PUBLICATIONS
`
`Rice, R.F., et al., “VLSI Universal Noiseless Coder,” NTIS
`Tech Notes 2301, p. 234 (Mar. 1,1990).
`
`Subjectto any disclaimer, the term Of this
`patent 15 extended or adjusted under 35
`
`Diner, D.B., et al., “Competitive Parallel Processing for
`Compression Of Data,” NTIS Tech Notes 2301, p. 379 (May
`
`(21) Appl. No.2 09/036,052
`(22)
`Filed:
`Mar. 6, 1998
`
`* cited by examiner
`
`60
`
`)
`(
`(51)
`
`Related US. Application Data
`M . 7
`.
`.
`P
`1
`1.
`t'
`N . 60 036 548 fil d
`’
`at
`on
`13395101121 app lea Ion
`0
`/
`’
`’
`e
`Int. Cl.7 .............................. G06F 17/30; G06F 3/00;
`G06F 5/00; H03M 7/00; H03M 7/30
`................................. 710/68; 710/65;33:11//15017;
`(52) US. Cl.
`.
`.
`(58) Fleld 0f Seargg7/102503504501534ég581’ 36756#1ng
`’
`’
`’
`’
`710/65 68
`’
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,293,379 *
`5,467,087
`5,617,541 *
`5,632,009 *
`5,638,498 *
`
`3/1994 Carr
`.................................... 370/941
`11/1995 Chu ............................. 341/51
`
`.....
`.. 395/200.13
`4/1997 Albanese et al.
`
`5/1997 Rao et a1.
`.........
`395/770
`.......................... 395/117
`6/1997 Tyler et a1.
`
`firintary Lgcaminer—rlihortrlla; Lee
`sszstant xammer— an
`guyen
`(74) Attorney, Agent, or Firm—Hamilton, Brook, Smith &
`Re nOldS, RC.
`y
`(57)
`ABSTRACT
`A preferred coding network uses an architecture called a
`Base-Filter-Resource (BFR) system. This approach inte-
`grates the advantages Of format-Specific compression into a
`general-purpose compression tOOl serving a Wide range Of
`data formats. Source data is parsed into blocks Of Similar
`data and each parsed blocks are compressed using a respec-
`tively selected compression algorithm. The algorithm can be
`chosen from a static model Of the data or can be adaptive to
`the data in the parsed block. The parsed blocks are then
`combined into an encoded data file. For decoding,
`the
`process is reversed.
`
`29 Claims, 8 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`\ ENCODING
`PARAMETERS
`15
`
`24
`
`
`
`
`
`/ SOURCE DATA2
`4FlLTER
`922
`sELEcfloN
`SYSTEM
`,2
`
`v
`PARSING
`svsTEM
`., PARSED
`ARRAVS
`25
`
`v
`CUSTOMIZED ,2,
`ARRAV
`>
`
`TRANSFORMS
`
`
`/25
`23
`ARR/W
`CODER
`DATA29
`,, ENCODED
`V
`
`
`CODER 5‘3
`
`SWNG \
`CODER Au
`
`FLOAT
`
`
`
`\
`
`Commvault Ex. 1023
`Commvault v. Realtime
`
`US Patent No. 9,054,728
`
`Page 1
`
`
`
`34\
`
`v
`
`v
`
`32.\ STRING
`ARRAY
`N
`FLOAT
`ARRAY
`N
`MEGER \
`
`CODER 7‘0
`
`
`
`Page 1
`
`Commvault Ex. 1023
`Commvault v. Realtime
`US Patent No. 9,054,728
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 1 0f 8
`
`US 6,253,264 B1
`
`SENDING APPLICATION
`
`SOURCE DATA 2
`
`I
`
`ENCODER
`
`ENCODED DATA 4
`
`DECODER
`
`§
`
`é
`
`DECODED DATA 6
`
`RECEIVING APPLICATION Z
`
`FIG. 1
`
`PRIOR ART
`
`Page 2
`
`Page 2
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 2 0f 8
`
`US 6,253,264 B1
`
`mm
`
`mm><mm<
`
`mmZDDOE
`
`,9.
`
`
`
`mmoooozaoozm
`
`N<._.<Qm0m30w¥
`
`NNKNEE
`
`ZO_._.Om_._m_m
`
`VN
`
`Oz_wm<n_
`
`_>_m:[m>m
`
`oz_mm<n_
`
`><mm<
`
`ommm<6
`
`m><mm<3M3
`30.5352.lmmbi
`
`mmENE/55:0?..
`mm..
`
`
`
`_>_m_._.w>mZOFOmfim—w
`
`<Emtmo
`
`NNr-ulnXNF—u-n—NNr
`
`/_m
`
`szZOQEOO
`
`20.530
`
`m_>_m_0n_mz<w:.0—.
`
`
`
`omooozmmmmfiEEE
`
`m:
`
`<meN.66
`
`Page 3
`
`Page 3
`
`
`
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 3 0f 8
`
`US 6,253,264 B1
`
`
`
`FIG. 3
`
`Page 4
`
`Page 4
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 4 0f 8
`
`2,352,6S
`
`1
`
`
`Ummooo><mm<OF$80CAB
`
`
`mmS+$0001022moz_Em810222:
`
`
`<F<DOZEFm05.05meuOkw:
`
`ov\
`
`B.mv.OE
`
`Page 5
`
`IV
`
`
`
`MMIOHQE02me
`
`mm_>__mn_
`
`
`
`
`
`mmooo><mm<20m;
`
`m02_m._.w
`
`mv
`
`
`
`:moozoEfimV
`
`Page 5
`
`
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 5 0f 8
`
`US 6,253,264 B1
`
`om\a
`
`SE20:
`
`rm
`
`mmDOONI.
`
`
`
` mmzflw>ZOOowmm<m»mm=j<>DMIOF<EZD
`
`
`
`
`
`mmEMQOO><mm<20m”.
`
`
`
`mI_<mm_._._l_mmmmZO_mW_m_>ZOOm._.Zm_ZOn_Xm_OEm<mm_._.Z<S_OE
`
`
`
`
`
`
`
`
`
`
`
`ENDOOOz_m._.wO._.MMQOO><mm<O._.mmDOO><mm<O._.Mmooo><mm<O._.
`
`
`
`
`
`
`
`
`
`m.0_n_
`
`owmmmmmm
`
`Page 6
`
`Page 6
`
`
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 6 0f 8
`
`US 6,253,264 B1
`
`SAMPLE DATA BASES
`
`I
`
`GENERATE PARSED FILES
`
`PARSED FILES
`
`I
`
`ANALYSIS RESULT FILES
`
`ANALYZE PARSED FILES
`GENERATE PREDICTOR DATA
`
`PDAT FILE
`
`I
`
`FIG. 6
`
`Page 7
`
`Page 7
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 7 0f 8
`
`US 6,253,264 B1
`
`25
`
`ARRAY TO BE CODED
`ARRAY DATA
`V
`BUILD MODEL
`
`71
`
`70
`/
`
`73
`
`ENCODED MODEL INFO
`
`MODEL
`
`ARRAY DATA
`
`
` USE
`
`'N'T'AL
`
`
`NO
`
`SPLITTER
`
`YES
`
`TWO OR MORE ARRAYS
`
`PROCESS SEPARATELY
`
`79
`
`YES
`
`MATCHING
`ALGORITHMS
`
`81
`
`UNMATCHED
`
`VALUES
`
`ARRAYS CONTAINING
`
`MATCH DATA
`
`83
`
`
`
`
`MATCHING
`ALGORITHM
`
`
`
`85
`NO
`
`
`OVEHRELZEELEERAY
`
`
`CONVERT TO |NT32
`
` USE
` 89
`NUMERIC
`
`SFORM
`
`
`NUMERIC
`ALGORITHMS
`
`91
`
`OFFSETS FROM
`MODEL
`
`95$ SPLITTER
`TRANSFORM
`
`USE
`REPEAT
`
`YES
`
`REPEAT
`
`99
`
`TWO OR MORE ARRAYS
`
`PROCESS SEPARATELY
`
`NO
`
`100
`
`FIG 7
`
`Page 8
`
`Page 8
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 8 0f 8
`
`US 6,253,264 B1
`
`101
`
`ARRAY DATA IN
`
`103
`
`RANGE
`
`
`
`
`REMAPPER
`
`
`EXTRA BITS
`
`TO PACK
`
`105
`
`
`
`109
`
`MTF CODER
`
`NO
`
` RANGE CODES
`
`
`RANGE CODES TO
`
`LOW LEVEL CODER
`
`
`
`FIG. 8
`
`Page 9
`
`Page 9
`
`
`
`US 6,253,264 B1
`
`1
`CODING NETWORK GROUPING DATA OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTURE AND SELECTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`BASE ON BLOCK DATA TYPE
`
`RELATED APPLICATIONS
`
`This application claims the benefit of US. Provisional
`Serial No. 60/036,548 filed Mar. 7, 1997, the teachings of
`which are incorporated herein by reference in their entirety.
`BACKGROUND OF THE INVENTION
`
`High performance data compression systems use models
`of the data to increase their ability to predict values, which
`in turn leads to greater compression. The best models can be
`achieved by building a compression system to support a
`specific data format. Instead of trying to deduce a crude
`model from the data within a specific file, a format-specific
`compression system can provide a precise pre-determined
`model. The model can take advantage not just of the file
`format structure, but also of statistical data gathered from
`sample databases.
`Previous efforts at format-specific compression have been
`focused on solutions to a few individual formats rather than
`
`on the development of a generalized method that could be
`adapted to many formats. The models which have been
`created typically involve a small number of components.
`This works adequately when most of the data is included in
`a few components, such as an image file having mostly red,
`blue, and green pixel values. But may formats are best
`modeled using a large number of components, and the
`previous systems are not designed to build or encode such
`models.
`
`SUMMARY OF THE INVENTION
`
`Apreferred coding system in accordance with the inven-
`tion solves both of these problems in the prior art: it provides
`a generalized solution which can be adapted to handle a wide
`range of formats, and it effectively handles models that use
`large numbers of components. The system involves new
`approaches at many levels: from the highest (interface) level
`to the lowest (core encoding algorithms) level. The coding
`network includes a compression network for encoding data
`and a decompression network for decoding data.
`At the highest level, a preferred compression system in
`accordance with the invention uses an architecture called a
`
`Base-Filter-Resource (BFR) system. This approach inte-
`grates the advantages of format-specific compression into a
`general-purpose compression tool serving a wide range of
`data formats. The system includes filters which each support
`a specific data format, such as for Excel XLS worksheets or
`Word DOC files. The base includes the system control
`modules and a library used by all the filters. The resources
`include routines which are used by more than one filter, but
`which are not part of the base. If a filter is installed which
`matches the format of the data to be encoded, the advantages
`of format-specific compression can be realized for that data.
`Otherwise, a “generic” filter is used which achieves perfor-
`mance similar to other non-specific data compression sys-
`tems (such as PKZip, Stacker, etc.).
`At the next level, the system preferably includes a method
`for parsing source data into individual components. The
`basic approach, called “structure flipping,” provides a key to
`converting format
`information into compression models.
`Structure flipping reorganizes the information in a file so that
`similar components that are normally separated are grouped
`together.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`Upon this foundation are a numbers of tools, such as:
`a language to simplify the creation of parsing routines;
`tools to parse the source data using this method into
`separate components; and
`tools to generate models for individual components by
`automated analysis of sample data bases.
`These tools can be applied to the filters for a wide range of
`file and data types.
`The compression system preferably includes tools called
`customized array transforms for specific filters and for
`certain types of filters. These techniques handle a specific
`file type or certain data constructions used by several file
`types, such as encoding two dimensional arrays of data as
`found in many database formats.
`At the low-level of the system are preferably a number of
`mechanisms for encoding data arrays, including:
`new low-level encoding algorithms;
`methods for integrating a large number of transforms and
`encoding algorithms;
`methods for eliminating overhead so that small data
`blocks can be efficiently coded; and
`a new method for
`integrating both static models
`(determined from statistical analysis of sample
`databases) and dynamic models (adapted to the data
`within a particular array) into the encoding of each
`component.
`A preferred method of encoding source data comprising
`parsing the source data into a plurality of blocks. The parsed
`blocks typically have a different format than the source data
`format. In particular, similar data from the source data are
`collected and grouped into respective blocks.
`For each block, a compression algorithm is selected from
`a plurality of candidate compression algorithms and applied
`to the block. The compression algorithms can be determined
`based on the amount of data in the respective block.
`Furthermore, the compression algorithm can be adapted to
`the respective block,
`including the use of a customized
`transform. The selection of an algorithm can also be based
`on a compression model, which is derived from the format
`of the source data. The compressed data from the plurality
`of blocks are then combined into encoded data.
`
`The coding network can also include a decompression
`network to convert the encoded data back into the source
`
`the data is decoded and then the parsing is
`data. First
`reversed. In a lossless system, the resulting data is identical
`to the source data.
`
`Embodiments of the invention preferably take the form of
`machine executable instructions embedded in a machine-
`
`readable format on a CD-ROM, floppy disk or hard disk, or
`another machine-readable distribution medium. These
`
`instructions are executed by one or more processing units to
`implement the compression and decompression networks.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The foregoing and other objects, features and advantages
`of the invention will be apparent from the following more
`particular description of preferred embodiments of the
`invention, as illustrated in the accompanying drawings in
`which like reference characters refer to the same parts
`throughout the different views. The drawings are not nec-
`essarily to scale, emphasis instead being placed upon illus-
`trating the principles of the invention.
`FIG. 1 is a high-level block diagram of a typical data
`compression system.
`FIG. 2 is a block diagram of a preferred encoder using the
`BFR system in accordance with the invention.
`
`Page 10
`
`Page 10
`
`
`
`US 6,253,264 B1
`
`3
`FIG. 3 is a flow chart of an array coder.
`FIG. 4 is a block diagram of the string coder 40 of FIG.
`
`3
`
`FIG. 5 is a block diagram of the float coder 50 of FIG. 3.
`FIG. 6 is a flowchart of a preferred automated analysis
`system.
`FIG. 7 is a flowchart overviewing the high-level of the
`integer coder.
`FIG. 8 is a flowchart overviewing the mid-level of the
`integer coder.
`
`DETAILED DESCRIPTIONS OF PREFERRED
`EMBODIMENTS OF THE INVENTION
`
`FIG. 1 is a high-level block diagram of a typical data
`compression system. The primary components of the system
`are a sending application 1, an encoder 3, a decoder 5, and
`a receiving application 7.
`The sending application 1 provides source data 2 to be
`encoded by the encoder 3. The sending application can be a
`file archiver, a communication system, or any other appli-
`cation with a need to represent data using fewer bytes than
`in the native format.
`The encoder 3 receives the source data 2 and uses data
`
`compression algorithms to represent the data using fewer
`bytes. The typical implementation of the encoder 3 is as
`software running on a general purpose computer, although
`the algorithms to be described may also be implemented on
`specialized hardware.
`The output of the encoder is encoded data 4, which can be
`stored in memory (in which case the encoding allows more
`source data to be stored within the same hardware), or
`transmitted to the decoder 5 (in which case the encoding the
`source data to be transmitted faster when the transmission
`
`channel bandwidth is limited).
`The decoder 5 retrieves or receives the encoded data 4 and
`
`reverses the algorithms used to encode the data so as to
`reconstruct the source data as decoded data 6. In a lossless
`
`system, the decoded data 6 is identical to the source data.
`Depending on the application, however, some data loss may
`be acceptable. As with the encoder 3,
`the decoder 5 is
`typically embodied as software running on a general purpose
`computer, but can be implemented in specialized hardware.
`The receiving application 7 receives the decoded data 6
`for processing.
`A preferred compression engine includes filter-based
`encoders and decoders which can be utilized by a large
`number of sending applications 1 and receiving applications
`7. While the applications 1,7 that store/retrieve or transmit/
`receive the encoded data are not part of the compression
`system, they may adjust encoding and decoding operations
`according to system conditions such as available transmis-
`sion channel bandwidth.
`
`The descriptions to follow will mostly cover the encoder.
`The decoder just reverses the encoder’s actions, so a descrip-
`tion of the encoder is sufficient
`to describe the overall
`
`encoding/decoding system. The term “File” will be used to
`describe a block of source data, which matches the conven-
`tional usage, but which should also be understood to be
`applicable to many other types of data blocks such as a data
`exchange in a Client-Server application.
`
`ENCODER
`
`4
`invention. The encoder 3' is based on the use of a plurality
`of filters 10a, .
`.
`.
`, 10x, .
`.
`.
`, 102 which serve specific file
`formats. For example, one filter 10a might support several
`versions of the DBF database format, while another filter 102
`might support several versions of the DOC format used by
`the Microsoft Word software program. The individual filters
`provide respective selection criteria 12 to a filter selection
`system 22.
`The filter selection system 22 receives the source data 2
`and checks the selection criteria 12a, .
`.
`.
`, 12x, .
`.
`.
`, 122 of
`all filters 10a, .
`.
`.
`, 10x, .
`.
`.
`, 102 installed in the system to
`see if any of them support the source data’s format. If not,
`a “generic” filter is used which provides compression per-
`formance similar to other generic compression systems,
`such as Lempel-Ziv (LZ) engines. In a particular preferred
`embodiment of the invention, the generic compression sys-
`tem employs an SZIP engine as described by Mr. Schindler
`in US. application Ser. No. 08/970,220 filed Nov. 14, 1997,
`the teachings of which are incorporated herein by reference
`in their entirety. The descriptions of the network will pri-
`marily cover the situations in which a filter to support the
`data format is successfully found.
`Briefly, a parsing system 24 parses the source data 2 into
`a large number of parsed Arrays 25, which each include all
`the instances of a particular component in the source file.
`The term “Array”, described in further detail below,
`is
`capitalized to indicate that this refers to a type of structure
`specific to the network, as opposed to the usual use of the
`term “array”. The algorithm used to parse the source data 2
`is called “Structure Flipping”, an important aspect of the
`network, described in further detail below.
`
`The BFR system uses an automated analysis system (also
`described in detail below) to build a model for each Array
`25, which will be used when the Arrays 25 are encoded in
`an array coder 28. In some cases, better compression can be
`achieved if customized array transforms 26 are first used.
`While the models generated by the automated analysis
`system process each component Array 25 independently, the
`customized routines can take advantage of inter-component
`relationships.
`The array coder 28 encodes the Arrays 25 using a count-
`dependent modeling system, which includes a mixture of
`static models (constant for all databases) and dynamic
`adjustment to each particular Array 25.
`The filter 10x has components to support each of the four
`sections of the encoder 3'. The selection criteria 12x, deter-
`mined by the file format specification, includes information
`needed to recognize files served by the filter 10x such as byte
`values at the beginning of the file or file title suffices. Parsing
`instructions 14, also determined by the file format
`specification, include the information needed to parse the
`source file into parsed Arrays 25 having all the instances of
`a particular component
`in the source file. The custom
`component models 16 are constructed by referencing both
`the file format specification and sample databases. They can
`take advantage of inter-component relationships to trans-
`form some of the component Arrays 25 to make them more
`compressible. Encoding parameters 18 are generated by an
`automated analysis system which examines component
`Arrays 25 parsed from a large number of sample databases.
`The encoding parameters 18 provide the data necessary to
`support the count-dependent modeling system in the array
`coder 28.
`FILTER SELECTION SYSTEM
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`FIG. 2 is a block diagram of a preferred encoder using a
`Base-Filter-Resource (BFR) network in accordance with the
`
`A preferred compression system uses an expandable
`architecture that allows users to support new or updated
`
`Page11
`
`Page 11
`
`
`
`US 6,253,264 B1
`
`5
`formats by adding or replacing filters 10. As noted above, a
`filter 10 can support more than one format. For example, a
`filter called “LOTUS123V4” might support all file formats
`associated with the Lotus 123 program, such as the current
`WK4 format and the earlier WK3 and FM3 formats. If a
`
`Filter was later developed for a new Lotus WKS format, a
`user replaces the old filter with one supporting the new
`format and which would also be backwardly compatible
`with the earlier filter with respect to the older formats.
`In the Microsoft Windows environment, each Filter is a
`separate dynamically linked library (DLL) file. In addition
`to data tables, a filter includes executable code.
`If a filter is found for a file, then it is dynamically linked
`and then called to encode the file. The encoded data stream
`
`includes an identification (ID) code indicating which filter
`was used to encode the data, and the decoder has to have a
`corresponding decoding filter.
`the base
`To implement the filter selection system 22,
`module maintains a registry of the currently installed files,
`which includes the data need to identify source files of that
`type as well as the path to the DLL. This registry is updated
`each time a filter is added or replaced. The identification data
`can include both a list of file title suffices and a method to
`
`identify files via byte values near the beginning of the file.
`The identification procedure uses a search tree method to
`reduce the time needed to check the list when many filters
`are installed. From this registry, the appropriate filter 10 is
`invoked for the file type.
`To handle compound file formats such as Microsoft’s
`OLE2, one filter can invoke another filter to encode a
`sub-file or stream within the compound file. For example, a
`Winword Filter might call an Excel Filter to encode some
`Excel data which was embedded within a compound docu-
`ment file.
`PARSING SYSTEM
`
`As noted above, the parsing system 24 creates a large
`number of parsed Arrays 25. Each parsed Array 25 includes
`all instances of a particular component in a file. A separate
`component
`is created for each member of each defined
`structure type.
`
`EXAMPLE 1
`
`For example, consider the following description of a record
`defined in a file format description.
`Cell Value:
`
`The format uses a special method of coding numbers in an
`internal number type, called an RK number, to save
`memory and disk space.
`Record Data:
`
`Offset
`
`Name
`
`Size
`
`Contents
`
`4
`6
`8
`
`10
`
`rw
`col
`fmt
`
`num
`
`2
`2
`2
`
`4
`
`Row number
`Column number
`Index to the record that includes
`the cell format
`the coded number
`
`The parsing system 24 creates parsed Arrays 25 for each
`of the four components of this record. Each time an RK
`record is found, an entry is added to each array. The parsed
`Arrays 25 are much more compressible than the raw source
`data, as shown by data from five records:
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Rec
`
`Row
`
`1
`2
`3
`4
`5
`
`0x0001
`0x0001
`0x0001
`0x0001
`0x0001
`
`Col
`
`0x0002
`0x0003
`0x0004
`0x0006
`0x0007
`
`Index
`
`0x0055
`0x005f
`0x0053
`0x005f
`0x005f
`
`Num Val
`
`0x3ff1e100
`0x3ff1a300
`0x3ff1d000
`0x3ff1c400
`0x3ff1d300
`
`The source data would appear almost random to a byte-
`matching algorithm (shown in hex):
`01 00 02 00 55 00 00 e1 fl 3f01 00 03 00 5f 00 00 a3 fl
`3f 01 00 04 00 53 00 00 d0 fl 3f
`01 00 06 00 5f 00 00 c4 el fl 3f 01 00 07 00 5f 00 00 d3
`fl 3f
`
`But when handled as four separate Arrays 25 (Row, Col,
`Index, and Num) using algorithms optimized for each
`component,
`the data can compress significantly. This
`approach is referred to herein as “structure flipping” because
`of the way it rearranges the data.
`To encode together the instances of the same component
`in a database is a common compression technique, and the
`file format is often used to identify such components. But in
`the prior art, this approach has been limited to isolating a few
`major components of a specific file format, and the file
`format description used only as a technique of finding these
`major components. In accordance with preferred embodi-
`ments of the invention, a file format is actually converted
`into a compression model, which is then interfaced into an
`automated system for analyzing the data and optimizing the
`coding performance of each component. While this com-
`pression model is usually enhanced by the customized array
`transforms 26, it is still an aspect of the overall system.
`The extent
`to which preferred approaches differ from
`prior art systems is reflected in the new tools constructed in
`order to implement it. This approach creates a large number
`of Arrays 25. Each Array 25 requires a different compression
`model, and the number of entries in each can change
`significantly from file to file. In order to work effectively in
`such a case, preferred embodiments of the invention employ
`new approaches in how the low-level coding is handled such
`as the elimination of overhead, count-dependent processing,
`and the use of a processing network instead of fixed coding
`algorithms. These techniques, described above, are preferred
`in order to handle the output of the parsing system 24.
`All of these implementation features are integrated at a
`high level, so that high-level calls can handle all of the
`operations needed to initialize the system, parse the data,
`and encode the Arrays.
`In accordance with a particular preferred embodiment of
`the invention, the file parsing system 24 uses three sub-
`systems:
`
`FILEiBUFFER
`ARRAYS
`PVSTRUCT
`
`Input\Output Interface
`To store the data for a single component
`To parse the input into Arrays or to write the decoded
`Arrays into the Output stream
`
`A plurality of FILEiBUFFER routines provide a flexible
`input/output interface using FILEiBUFFER structures, so
`the same set of parsing routines can be used regardless of the
`input/output formats. This allows the compression system to
`be independent of the I/O format. The file parsing system
`preferably supports two file formats: DOS and OLE2, but
`this can expanded to other file formats as well as to inter-
`
`Page 12
`
`Page 12
`
`
`
`US 6,253,264 B1
`
`7
`faces to communication ports and other I/O options. For
`DOS files, the logical stream positions are usually the same
`as the raw file positions. The OLE2 format requires an
`extensive layer of code to follow an OLE2 stream within the
`container file.
`
`A plurality of ARRAY routines provide a system for
`managing the process of initializing, adding data to,
`encoding, and freeing arrays of data. Each ARRAY structure
`includes one array of data and the information needed to
`manage these processes, such as the number of entries in the
`Array 25, the size of the buffer currently allocated for the
`array, and information which will be useful in encoding the
`array.
`In particular, ARRAYiXX structure stores all the data
`from a component, and includes mechanisms to support fast
`sequential read/write access to the data, an expandable
`buffer for the data, and storage of model data. This structure
`also allows a simple set of tools to handle all allocation,
`parsing, and coding functions. The iXX nomenclature
`indicates that slightly different ARRAY structures are
`defined for different data types, such as different integer
`word sizes, strings, and floating point types, although many
`of the processing routines work on all of the variations.
`A plurality of PVSTRUCT routines make up an auto-
`mated system for parsing the file data. During encoding,
`they read data from the FILEiBUFFER structure and write
`data into Arrays 25. During decoding, they take data from
`the Arrays 25 and write it into the FILEiBUFFER struc-
`tures. These routines can take a high-level description of a
`file format and handle all the tasks associated with managing
`the parsing as well as the encoding of the parsed data. Many
`of the Filters 10 will also include customized parsing
`routines which call the FILEiBUFFER and ARRAY rou-
`tines directly instead of using the PVSTRUCT routines. But
`an aspect of the parsing system 24 is the PVSTRUCT
`routines, and this discussion will be concerned primarily
`with their operation.
`A PVSTRUCT structure includes all information needed
`
`to parse and code a data structure defined in a file format.
`This usually involves multiple components, so it provides
`and handles individual Arrays 25 for each component. The
`term “pvstruct” comes from Predictor Variable Structure,
`where Predictor refers to the use of statistical data gained
`from analyzing a large number of sample files to assist in the
`encoding of the data, and Variable indicates that it supports
`structures where the component counts can vary at run time
`for each instance (called “dynamic structures” in some
`textbooks).
`A parsing language is a defined instruction set which
`provides a concise way to specify the parsing of a
`PVSTRUCT structure. The instruction set includes support
`for a wide range of dynamic variation, including where the
`counts of components are dependent on file counts of other
`components as well as on external factors. It incorporates
`other file format data, such as defined values or range
`restrictions. It also provides a means of exchanging data
`between the parser and the filter.
`The compression system is based on parsing a file so as
`to place items together that are similar. In the following
`description of the compression system, examples illustrating
`some structures which might be found in a typical spread-
`sheet file format.
`
`EXAMPLE 2
`
`Spreadsheet files typically include a series of records.
`Each record begins with a 16-bit word indicating the type of
`record followed by a 16-bit word giving the number of bytes
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`in the body of the record. The file specification then
`describes the contents of each record. A specification for a
`TABLE SIZE record might be:
`
`RecType=0x402
`Rec Body Size=12
`Record Data:
`
`Offset Name
`
`Size Contents
`
`4
`6
`8
`10
`12
`
`MinRow
`MaxRow
`MinCol
`MaxCol
`(Reserved)
`
`First defined row on the sheet
`2
`Last defined row on the sheet, plus 1
`2
`First defined column on the sheet
`2
`Last defined column on the sheet, plus 1
`2
`4 Reserved
`
`There can be more than one such record in the data. When
`there is more than one record,
`the values of the same
`component in each record are going to be similar to each
`other. This enables the compression system to take advan-
`tage of the similarities to increase the compression ratio.
`The compression system is preferably based on the
`ARRAYiXX structures, including:
`
`ARRAYiUS:
`
`ARRAY716:
`
`ARRAY732:
`
`ARRAY7F32:
`ARRAY7F64
`ARRAYiFSO:
`ARRAY,STR:
`
`for 8-bit data items, treated by the encoding system as
`being unsigned
`for 16-bit data items, treated by the encoding system as
`being signed
`for 32-bit data items, treated by the encoding system as
`being signed
`for 32-bit floats
`for 64-bit floats (“doubles”)
`for 80-bit floats (“long doubles”)
`each entry includes several bytes whose count is given
`by a size (string size).
`
`For the TABLE SIZE record, five arrays would be created
`(referring to the record specification). Each time a record is
`parsed, an entry is added into each of the five arrays.
`When all
`the records had been parsed, a function is
`invoked to encode the items in each array. This function
`includes a large number of algorithms, which can exploit a
`wide variety of relationships to achieve maximum compres-
`sion. The implementation is extremely efficient in that it
`requires virtually no overhead: an array with only one entry
`can even be efficiently compressed. This is necessary
`because the parsing system can transform a file into a huge
`number of these small arrays. The encoding function can
`also free the data after it finishes encoding it.
`The decoding process reverses the sequence. First, each of
`the component arrays is decoded. Then the parsing is
`reversed.
`
`As this type of parsing has to be done extensively, a
`number of tools are used to simplify the process. They are
`based on the use of instructions called vstructidefs. To
`continue with the TABLE SIZE example, the instruction set
`to parse such a record would be:
`
`// the first 3 instructions always provide general info about the record:
`8,
`// num instructions in this set
`12,
`// size of the record body, which is a fixed sz of 12 bytes
`5,
`// num arrays to be created
`// the remaining instructions tell how to parse the record:
`INT16,
`// Parse a 16 bit integer for the first component (“MinRow”)
`INT16,
`
`Page 13
`
`Page 13
`
`
`
`US 6,253,264 B1
`
`9
`
`-continued
`
`10
`
`-continued
`#define FLOAT32
`#define FLOAT64
`#define FLOAT80
`#define STRING
`#define STRINGZ
`#define SYS,Sl
`#define SYS,SZ1
`#define SYS,SZ
`#define SYS,SZ2
`
`3
`4
`5
`6
`7
`8
`9
`
`INT16,
`INT16,
`INT32 };
`
`// Parse a 32 bit integer for the last component (“Reserved”)
`
`The vstructidef instructions are 16 bit integers. The first
`three words always include the same general info about the
`record. The rest of the words include instructions which tell
`
`the automated parsing system how to proceed when encoun-
`tering such a record. In this case, the instructions are the
`simplest possible—just a list of data types for each compo-
`nent of the record. The parser would create an array for each
`component, and load one entry of that type into the array
`each time it parses a record of that type:
`Functions are then provided to automate the parsing and
`encoding of records based on these instructions.