`US 6,253,264 B1
`(10) Patent No.:
`(12)
`Sebastian
`(45) Date of Patent:
`*Jun. 26, 2001
`
`
`US006253264B1
`
`(75)
`
`73)
`
`Assi
`
`:
`
`ion
`
`Technologi
`
`(54) CODING NETWORK GROUPING DATA OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTURE AND SELECTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`BASE ON BLOCK DATA TYPE
`.
`ene
`:
`Inventor: William Sebastian, Falmouth, MA
`(US)
`Intelligent C
`Assignee Fakaoeth MA(US) EMMIS,
`(73)
`(*) Notice:
`This patent issued on a continued pros-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`tent
`t
`isi
`f 35 US.C.
`"34a)(2).mm provisions ©
`‘
`Subjectto any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`
`5,684,478 * 11/1997 Panaoussis ......ssessesseeeeees 341/51
`
`cece 395/785
`5,708,828 *
`1/1998 Coleman...
`1/1998 Gormish et al... 341/107
`5,710,562 *
`
`
`oeee : vit900 purser sesensentnsnennnnennsenee sOnor
`olmes ....
`.
`864,
`2/1999 Barbir oe eeeeereeeee 341/107
`5,867,114 *
`
`3/1999 Mochizuki et al... 707/102
`5,881,380 *
`secsssesssssseonen 707/104
`5,983,236 * 11/1999 Yager et al.
`
`5.991.515 * 11/1999 Fall et abe cccceccsccsssseesseeern 305/114
`FOREIGN PATENT DOCUMENTS
`(EP).
`(er)
`OTHER PUBLICATIONS
`
`8/1996
`
`0 729 237 A2
`
`Rice, R.F., et al., “VLSI Universal Noiseless Coder,” NTIS
`Tech Notes 2301, p. 234 (Mar. 1,1990).
`Diner, D.B., et al., “Competitive Parallel Processing for
`Compression of Data,” NTIS Tech Notes 2301, p. 379 (May
`
`* cited by examiner
`
`(21) Appl. No.: 09/036,052
`(22)
`Filed:
`Mar. 6, 1998
`animar,y bxaminernoms Lee
`Related U.S. Application Data
`
`
`
`
`60) filedonMar. 7Provisional application No. 60/036.548. ssistant Examiner—Tanh Nguyen
`
`
`(60) 1997 application
`No.
`60/036.548,
`(74) Attorney, Agent, or Firm—Hamilton, Brook, Smith &
`filed
`on
`Mar.
`7,
`Reynolds, P.C.
`(SL) Unt. C0 eeeeeeecccssssenneees G06F 17/30; GO6F 3/00;
`y
`GO6F 5/00; HO3M 7/00; H03M 7/30
`(57)
`ABSTRACT
`(52) U.S. C1. ceeeeeeeseeeseeennees 710/68; eesaay A preferred coding network uses an architecture called a
`:
`.
`Base-Filter-Resource (BFR) system. This approach inte-
`(58) Field of seaTIOSSOs,SokS08S08;eSa7 grates the advantages of format-specific compression into a
`?
`?
`?
`?
`410/65 68
`general-purpose compression tool serving a wide range of
`;
`data formats. Source data is parsed into blocks of similar
`References Cited
`data and each parsed blocks are compressed using a respec-
`tively selected compression algorithm. The algorithm can be
`U.S. PATENT DOCUMENTS
`chosen from a static model of the data or can be adaptive to
`the data in the parsed block. The parsed blocks are then
`oerosy . siyto0s on sereeeseneseseseseneeesescesseseeees ete combined into an encoded data file. For decoding,
`the
`:
`467,
`Us ieeecceeccssecesescenceeeecneecneeees
`
`4/1997 Albaneseetal......
`"395/200.13
`5,617,541 *
`Processis reversed.
`
`5,632,009 *
`5/1997 Raoetal.
`........
`.. 395/770
`5,638,498 *
`6/1997 Tyler et al. oo eeeceeeeseeeeeee 395/117
`
`(56)
`
`29 Claims, 8 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
` N
`
`28
`ARRAY
`x
`
`<a
`2g
`‘
`32, STRINGS,
`Y
`ARRAY,7”|
`N
`STRING |}.
`cover
`|40
`FLOATS,
`ARRAY,
`
`4,
`
`¥
`
`
`CODER |50
`
`FLOAT
`
`
`
`INTEGER|,
`
`CODER [3g
`
`
`
`|S
`
`Commvault Ex. 1023
`Commvault v. Realtime
`
`US Patent No. 9,054,728
`
`Page1
`
`Page 1
`
`Commvault Ex. 1023
`Commvault v. Realtime
`US Patent No. 9,054,728
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 1 of 8
`
`US 6,253,264 B1
`
`SENDING APPLICATION
`
`SOURCE DATA 2
`
`ENCODER
`
`ENCODED DATA 4
`
`DECODER
`
`3
`
`°
`
`DECODED DATA 6
`
`RECEIVING APPLICATION f
`
`FIG. 1
`PRIOR ART
`
`Page 2
`
`Page 2
`
`
`
`U.S. Patent
`
`Jun.26, 2001
`
`Sheet 2 of 8
`
`US 6,253,264 Bl
`
`cVIVOsounos-t
`
`zz/|_MALS
`
`NOILOaTaS
`
`
`
`WALSASNOILOa1g$
`
`VIYSLIYS
`
`ZELOUXSEeSL
`
`8c
`
`G2
`
`QAqdOON3
`
`Vivd
`
`6¢
`
`SAVEYV
`
`Ge
`
`
`
`9¢°]qaZIWOLSNO
`
`AVY
`
`SWYHOASNVALL
`
`WOLSND
`
`LNANOdWOO
`
`SaINGOW
`
`OL
`
`ONIGOONA
`
`SYSLANVaVd
`
`81
`
`¢Old
`
`ve
`
`ONISHVd
`
`qasuvdvl
`
`SNOILONUYLSNI
`
`WALSASONISdVd
`
`dyatd
`
`Page 3
`
`Page 3
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 3 of 8
`
`US 6,253,264 B1
`
`
`
`FIG. 3
`
`Page 4
`
`Page 4
`
`
`
`U.S. Patent
`
`Jun.26, 2001
`
`Sheet 4 of 8
`
`US 6,253,264 Bl
`
`Ov7
`
`OILVLS)
`YsdqdO0OAVeeVWOwds(JAGOW
`
`
`
`
`
`
`
`VIVdGONIYLSQSL0IdsddAOLSI]
`
`Lv
`
`
`
`YAHOLVWONIYLS
`
`SONIYLS
`
`dsSWldd
`
`8c
`
`
`
`
`
`
`
`Lv4S3dO0HOLVWWSONINLSGSHO.LVWNN
`
`
`
`YAqdO9AVaeevVOL
`
`
`
`Ysaqd09LxXal
`
`vOlas
`
`Page 5
`
`Page 5
`
`
`
`
`
`U.S. Patent
`
`Jun.26, 2001
`
`Sheet 5 of 8
`
`OG|Vivd
`
`1vO14
`
`LS
`
`Ysd0921
`
`
`
`92Ysaq0oAVaeVWOds
`
`SANTWAGSHOLVAINN
`
`
`
`YsaLYSANOOOLASVE
`
`
`
`STIVYsSLITSYysNOISHSANO9SLNANOdxad019SVSSILNVW019
`
`
`
`
`
`
`
`
`
`YAGOOONIYLSOLYAd0oAVaeeVOLYAdOoAVeavVOLYad0o0AvaavOL
`
`US 6,253,264 Bl
`
`
`
`
`
`GS‘Old
`
`OV828c82
`
`Page 6
`
`Page 6
`
`
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 6 of 8
`
`US 6,253,264 B1
`
`SAMPLE DATA BASES
`
`PARSED FILES
`
`GENERATE PARSED FILES
`ANALYZE PARSED FILES
`GENERATE PREDICTOR DATA
`
`ANALYSIS RESULT FILES
`
`PDAT FILE
`
`FIG. 6
`
`Page 7
`
`Page 7
`
`
`
`U.S. Patent
`
`Jun.26, 2001
`
`Sheet 7 of 8
`
`US 6,253,264 B1
`
`ARRAY TO BE CODED
`
`No a
`
`ARRAY DATA
`
`iBUILD MODEL
`
`
`MODEL
`
`ARRAY DATA
`
` USE
`
`INITIAL
`
`SPLITTER
`
`NO
`
`71
`
`73
`
`ENCODED MODEL INFO
`
`SPLITTER
`
`YES
`
`TWO OR MORE ARRAYS
`PROCESS SEPARATELY
`
`19
`
`
`
`YES
`
`MATCHING
`ALGORITHMS
`
`81
`
`UNMATCHED
`VALUES
`
`ARRAYS CONTAINING
`MATCH DATA
`
`\’23
`
`CONVERTTO INT32
`
`
`
`MATCHING
`
`ALGORITHM
`
`85
`NO
`
`
`OVERSIZE ARRAY|
`HANDLER
`
`
`89
`
`NUMERIC
`ALGORITHMS
`
`TWO OR MORE ARRAYS
`PROCESS SEPARATELY
`
`YES
`
`REPEAT
`TRANSEORM
`
`99
`
`TO MID-LEVEL CODER
`
`100
`
`FIG. 7
`
`Page 8
`
`91
`
`95
`
`OFFSETS FROM
`MODEL
`
`SPLITTER
`
` USE
`
`
`NUMERIC
`SFORM
`
`<>
`;
`
`
`Page 8
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 8 of 8
`
`US 6,253,264 B1
`
`101
`
`ARRAY DATA IN
`
`103
`
`RANGE
`
`REMAPPER
`
`
`EXTRA BITS
`TO PACK
`
`105
`
`
`
` RANGE CODES
`
`
`109
`
`MTF CODER
`
`NO
`
`RANGE CODES TO
`LOW LEVEL CODER
`
`
`
`FIG. 8
`
`Page 9
`
`Page 9
`
`
`
`US 6,253,264 B1
`
`1
`CODING NETWORK GROUPING DATA OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTURE AND SELECTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`BASE ON BLOCK DATA TYPE
`
`RELATED APPLICATIONS
`
`This application claims the benefit of U.S. Provisional
`Serial No. 60/036,548 filed Mar. 7, 1997, the teachings of
`which are incorporated herein by reference in their entirety.
`BACKGROUND OF THE INVENTION
`
`High performance data compression systems use models
`of the data to increase their ability to predict values, which
`in turn leads to greater compression. The best models can be
`achieved by building a compression system to support a
`specific data format. Instead of trying to deduce a crude
`model from the data within a specific file, a format-specific
`compression system can provide a precise pre-determined
`model. The model can take advantage not just of the file
`format structure, but also of statistical data gathered from
`sample databases.
`Previous efforts at format-specific compression have been
`focused on solutions to a few individual formats rather than
`
`on the development of a generalized method that could be
`adapted to many formats. The models which have been
`created typically involve a small number of components.
`This works adequately when mostof the data is included in
`a few components, such as an imagefile having mostly red,
`blue, and green pixel values. But may formats are best
`modeled using a large number of components, and the
`previous systems are not designed to build or encode such
`models.
`
`SUMMARYOF THE INVENTION
`
`A preferred coding system in accordance with the inven-
`tion solves both of these problemsin the priorart: it provides
`a generalized solution which can be adapted to handle a wide
`range of formats, and it effectively handles models that use
`large numbers of components. The system involves new
`approachesat many levels: from the highest(interface) level
`to the lowest (core encoding algorithms) level. The coding
`network includes a compression network for encoding data
`and a decompression network for decoding data.
`At the highest level, a preferred compression system in
`accordance with the invention uses an architecture called a
`Base-Filter-Resource (BFR) system. This approach inte-
`grates the advantages of format-specific compression into a
`general-purpose compression tool serving a wide range of
`data formats. The system includesfilters which each support
`a specific data format, such as for Excel XLS worksheets or
`Word DOC files. The base includes the system control
`modules and a library used byall the filters. The resources
`include routines which are used by more than onefilter, but
`which are not part of the base. If a filter is installed which
`matches the format of the data to be encoded, the advantages
`of format-specific compression can be realized for that data.
`Otherwise, a “generic” filter is used which achieves perfor-
`mance similar to other non-specific data compression sys-
`tems (such as PKZip, Stacker, etc.).
`Atthe next level, the system preferably includes a method
`for parsing source data into individual components. The
`basic approach,called “structure flipping,” provides a key to
`converting format
`information into compression models.
`Structure flipping reorganizes the information ina file so that
`similar components that are normally separated are grouped
`together.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`Upon this foundation are a numbersof tools, such as:
`a language to simplify the creation of parsing routines;
`tools to parse the source data using this method into
`separate components; and
`tools to generate models for individual components by
`automated analysis of sample data bases.
`These tools can be applied to the filters for a wide range of
`file and data types.
`The compression system preferably includes tools called
`customized array transforms for specific filters and for
`certain types of filters. These techniques handle a specific
`file type or certain data constructions used by several file
`types, such as encoding two dimensional arrays of data as
`found in many database formats.
`At the low-level of the system are preferably a number of
`mechanisms for encoding data arrays, including:
`new low-level encoding algorithms;
`methods for integrating a large numberof transforms and
`encoding algorithms;
`methods for eliminating overhead so that small data
`blocks can be efficiently coded; and
`a new method for
`integrating both static models
`(determined from statistical analysis of sample
`databases) and dynamic models (adapted to the data
`within a particular array) into the encoding of each
`component.
`A preferred method of encoding source data comprising
`parsing the source data into a plurality of blocks. The parsed
`blocks typically have a different format than the source data
`format. In particular, similar data from the source data are
`collected and grouped into respective blocks.
`For each block, a compression algorithm is selected from
`a plurality of candidate compression algorithms and applied
`to the block. The compression algorithms can be determined
`based on the amount of data in the respective block.
`Furthermore, the compression algorithm can be adapted to
`the respective block,
`including the use of a customized
`transform. The selection of an algorithm can also be based
`on a compression model, which is derived from the format
`of the source data. The compressed data from the plurality
`of blocks are then combined into encoded data.
`
`The coding network can also include a decompression
`network to convert the encoded data back into the source
`data. First
`the data is decoded and then the parsing is
`reversed. In a lossless system, the resulting data is identical
`to the source data.
`
`Embodiments of the invention preferably take the form of
`machine executable instructions embedded in a machine-
`
`readable format on a CD-ROM,floppy disk or hard disk, or
`another machine-readable distribution medium. These
`instructions are executed by one or moreprocessing units to
`implement the compression and decompression networks.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The foregoing and other objects, features and advantages
`of the invention will be apparent from the following more
`particular description of preferred embodiments of the
`invention, as illustrated in the accompanying drawings in
`which like reference characters refer to the same parts
`throughout the different views. The drawings are not nec-
`essarily to scale, emphasis instead being placed upon illus-
`trating the principles of the invention.
`FIG. 1 is a high-level block diagram of a typical data
`compression system.
`FIG. 2 is a block diagram of a preferred encoder using the
`BFR system in accordance with the invention.
`
`Page 10
`
`Page 10
`
`
`
`US 6,253,264 B1
`
`3
`FIG. 3 is a flow chart of an array coder.
`FIG. 4 is a block diagram ofthe string coder 40 of FIG.
`
`3
`
`FIG. 5 is a block diagram ofthe float coder 50 of FIG. 3.
`FIG. 6 is a flowchart of a preferred automated analysis
`system.
`FIG. 7 is a flowchart overviewing the high-level of the
`integer coder.
`FIG. 8 is a flowchart overviewing the mid-level of the
`integer coder.
`
`DETAILED DESCRIPTIONS OF PREFERRED
`EMBODIMENTS OF THE INVENTION
`
`4
`invention. The encoder 3' is based on the use of a plurality
`of filters 10a,..., 10x, ... , 10z which serve specific file
`formats. For example, one filter 10a might support several
`versions of the DBF database format, while anotherfilter 10z
`might support several versions of the DOC format used by
`the Microsoft Word software program. The individualfilters
`provide respective selection criteria 12 to a filter selection
`system 22.
`The filter selection system 22 receives the source data 2
`and checksthe selection criteria 12a,...,12x,..., 12z of
`all filters 10a, ..., 10x, ..., 10z installed in the system to
`see if any of them support the source data’s format. If not,
`a “generic” filter is used which provides compression per-
`formance similar to other generic compression systems,
`such as Lempel-Ziv (LZ) engines. In a particular preferred
`FIG. 1 is a high-level block diagram of a typical data
`embodimentof the invention, the generic compression sys-
`compression system. The primary components of the system
`tem employs an SZIP engine as described by Mr. Schindler
`are a sending application 1, an encoder 3, a decoder 5, and
`in US. application Ser. No. 08/970,220 filed Nov. 14, 1997,
`a receiving application 7.
`the teachings of which are incorporated herein by reference
`The sending application 1 provides source data 2 to be
`in their entirety. The descriptions of the network will pri-
`encoded by the encoder 3. The sending application can be a
`marily cover the situations in whichafilter to support the
`file archiver, a communication system, or any other appli-
`data format is successfully found.
`cation with a need to represent data using fewer bytes than
`in the native format.
`Briefly, a parsing system 24 parses the source data 2 into
`a large numberof parsed Arrays 25, which each includeall
`The encoder 3 receives the source data 2 and uses data
`the instances of a particular component in the source file.
`The term “Array”, described in further detail below,
`is
`capitalized to indicate that this refers to a type of structure
`specific to the network, as opposed to the usual use of the
`term “array”. The algorithm used to parse the source data 2
`is called “Structure Flipping”, an important aspect of the
`network, described in further detail below.
`The BFR system uses an automated analysis system (also
`described in detail below) to build a model for each Array
`25, which will be used when the Arrays 25 are encoded in
`an array coder 28. In somecases, better compression can be
`achieved if customized array transforms 26 are first used.
`While the models generated by the automated analysis
`system process each component Array 25 independently, the
`customized routines can take advantage of inter-component
`relationships.
`The array coder 28 encodes the Arrays 25 using a count-
`dependent modeling system, which includes a mixture of
`static models (constant for all databases) and dynamic
`adjustment to each particular Array 25.
`Thefilter 10x has components to support each of the four
`sections of the encoder 3'. The selection criteria 12x, deter-
`mined bythefile format specification, includes information
`neededto recognizefiles served by the filter 10x such as byte
`values at the beginning ofthefile orfile title suffices. Parsing
`instructions 14, also determined by the file format
`specification, include the information needed to parse the
`source file into parsed Arrays 25 havingall the instances of
`a particular component
`in the source file. The custom
`component models 16 are constructed by referencing both
`the file format specification and sample databases. They can
`take advantage of inter-component relationships to trans-
`form someof the component Arrays 25 to make them more
`compressible. Encoding parameters 18 are generated by an
`automated analysis system which examines component
`Arrays 25 parsed from a large number of sample databases.
`The encoding parameters 18 provide the data necessary to
`support the count-dependent modeling system in the array
`coder 28.
`FILTER SELECTION SYSTEM
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`compression algorithms to represent the data using fewer
`bytes. The typical implementation of the encoder 3 is as
`software running on a general purpose computer, although
`the algorithms to be described may also be implemented on
`specialized hardware.
`The output of the encoderis encoded data 4, which can be
`stored in memory (in which case the encoding allows more
`source data to be stored within the same hardware), or
`transmitted to the decoder 5 (in which case the encoding the
`source data to be transmitted faster when the transmission
`channel bandwidth is limited).
`The decoder5 retrieves or receives the encoded data 4 and
`reverses the algorithms used to encode the data so as to
`reconstruct the source data as decoded data 6. In a lossless
`
`system, the decoded data 6 is identical to the source data.
`Depending on the application, however, some data loss may
`be acceptable. As with the encoder 3,
`the decoder 5 is
`typically embodied as software running on a general purpose
`computer, but can be implemented in specialized hardware.
`The receiving application 7 receives the decoded data 6
`for processing.
`A preferred compression engine includes filter-based
`encoders and decoders which can be utilized by a large
`numberof sending applications 1 and receiving applications
`7. While the applications 1,7 that store/retrieve or transmit/
`receive the encoded data are not part of the compression
`system, they may adjust encoding and decoding operations
`according to system conditions such as available transmis-
`sion channel bandwidth.
`
`The descriptions to follow will mostly cover the encoder.
`The decoderjust reverses the encoder’s actions, so a descrip-
`tion of the encoder is sufficient
`to describe the overall
`
`encoding/decoding system. The term “File” will be used to
`describe a block of source data, which matches the conven-
`tional usage, but which should also be understood to be
`applicable to many other types of data blocks such as a data
`exchange in a Client-Server application.
`
`ENCODER
`
`FIG. 2 is a block diagram of a preferred encoder using a
`Base-Filter-Resource (BFR) network in accordance with the
`
`A preferred compression system uses an expandable
`architecture that allows users to support new or updated
`
`Page 11
`
`Page 11
`
`
`
`US 6,253,264 B1
`
`5
`formats by adding orreplacing filters 10. As noted above, a
`filter 10 can support more than one format. For example, a
`filter called “LOTUS123v4” might support all file formats
`associated with the Lotus 123 program, such as the current
`WK4format and the earlier WK3 and FM3 formats. If a
`
`Filter was later developed for a new Lotus WKS5 format, a
`user replaces the old filter with one supporting the new
`format and which would also be backwardly compatible
`with the earlier filter with respect to the older formats.
`In the Microsoft Windows environment, each Filter is a
`separate dynamically linked library (DLL) file. In addition
`to data tables, a filter includes executable code.
`If a filter is found fora file, then it is dynamically linked
`and then called to encode the file. The encoded data stream
`
`identify files via byte values near the beginning ofthefile.
`The identification procedure uses a search tree method to
`reduce the time needed to check the list when manyfilters
`are installed. From this registry, the appropriate filter 10 is
`invoked for the file type.
`To handle compound file formats such as Microsoft’s
`OLE2, one filter can invoke another filter to encode a
`sub-file or stream within the compoundfile. For example, a
`Winword Filter might call an Excel Filter to encode some
`Excel data which was embedded within a compound docu-
`mentfile.
`PARSING SYSTEM
`
`As noted above, the parsing system 24 creates a large
`numberof parsed Arrays 25. Each parsed Array 25 includes
`all instances of a particular componentin a file. A separate
`component
`is created for each member of each defined
`structure type.
`
`EXAMPLE1
`
`For example, consider the following description of a record
`defined in a file format description.
`Cell Value:
`
`The format uses a special method of coding numbersin an
`internal number type, called an RK number, to save
`memory and disk space.
`Record Data:
`
`Rec
`
`Row
`
`1
`2
`3
`4
`5
`
`0x0001
`0x0001
`0x0001
`0x0001
`0x0001
`
`Col
`
`0x0002
`0x0003
`0x0004
`0x0006
`0x0007
`
`Index
`
`0x0055
`Ox005f
`0x0053
`Ox005f
`Ox005f
`
`Num Val
`
`Ox3ff1e100
`0x3ff1a300
`0x3ff1d000
`Ox3ff1c400
`0x3ff1d300
`
`10
`
`15
`
`20
`
`30
`
`35
`
`40
`
`45
`
`50
`
`The source data would appear almost random to a byte-
`matching algorithm (shown in hex):
`01 00 02 00 55 00 00 el fi 3f01 00 03 00 5f 00 00 a3 fl
`3f 01 00 04 00 53 00 00 dO fi 3f
`01 00 06 00 5f 00 00 c4 el fl 3f 01 00 07 00 5f 00 00 d3
`fl 3f
`includes an identification (ID) code indicating which filter
`But when handled as four separate Arrays 25 (Row, Col,
`was used to encode the data, and the decoder has to have a
`Index, and Num) using algorithms optimized for each
`corresponding decodingfilter.
`component,
`the data can compress significantly. This
`the base
`To implement the filter selection system 22,
`approachis referred to herein as “structure flipping” because
`module maintains a registry of the currently installed files,
`of the way it rearranges the data.
`which includes the data need to identify source files of that
`To encode together the instances of the same component
`type as well as the path to the DLL. This registry is updated
`in a database is a common compression technique, and the
`
`each timeafilter is added or replaced. The identification data file format is often used to identify such components. But in
`can include bothalist offile title suffices and a method to
`25
`the priorart, this approach has been limitedto isolating a few
`major components of a specific file format, and the file
`format description used only as a technique offinding these
`major components. In accordance with preferred embodi-
`ments of the invention, a file format is actually converted
`into a compression model, which is then interfaced into an
`automated system for analyzing the data and optimizing the
`coding performance of each component. While this com-
`pression model is usually enhanced by the customized array
`transforms 26, it is still an aspect of the overall system.
`The extent
`to which preferred approaches differ from
`prior art systemsis reflected in the new tools constructed in
`order to implementit. This approach creates a large number
`of Arrays 25. Each Array 25 requires a different compression
`model, and the number of entries in each can change
`significantly from file to file. In order to work effectively in
`such a case, preferred embodimentsof the invention employ
`new approachesin how the low-level coding is handled such
`as the elimination of overhead, count-dependent processing,
`and the use of a processing network instead of fixed coding
`algorithms. These techniques, described above, are preferred
`in order to handle the output of the parsing system 24.
`All of these implementation features are integrated at a
`high level, so that high-level calls can handle all of the
`operations needed to initialize the system, parse the data,
`and encode the Arrays.
`In accordance with a particular preferred embodiment of
`the invention, the file parsing system 24 uses three sub-
`systems:
`
`Offset
`
`Name
`
`Size
`
`Contents
`
`4
`6
`8
`
`10
`
`rw
`col
`fmt
`
`num
`
`2
`2
`2
`
`4
`
`Row number
`Column number
`Index to the record that includes
`the cell format
`the coded number
`
`The parsing system 24 creates parsed Arrays 25 for each
`of the four components of this record. Each time an RK
`record is found, an entry is added to each array. The parsed
`Arrays 25 are much more compressible than the raw source
`data, as shown bydata from five records:
`
`55
`
`60
`
`65
`
`FILE_BUFFER
`ARRAYS
`PVSTRUCT
`
`Input\Output Interface
`To store the data for a single component
`To parse the input into Arrays or to write the decoded
`Arrays into the Output stream
`
`A plurality of FILE_BUFFERroutines provide a flexible
`input/output interface using FILE__BUFFERstructures, so
`the sameset of parsing routines can be used regardlessof the
`input/output formats. This allows the compression system to
`be independent of the I/O format. The file parsing system
`preferably supports two file formats: DOS and OLE2, but
`this can expanded to other file formats as well as to inter-
`
`Page 12
`
`Page 12
`
`
`
`US 6,253,264 B1
`
`7
`faces to communication ports and other I/O options. For
`DOSfiles, the logical stream positions are usually the same
`as the raw file positions. The OLE2 format requires an
`extensive layer of code to follow an OLE2 stream within the
`containerfile.
`
`8
`in the body of the record. The file specification then
`describes the contents of each record. A specification for a
`TABLE SIZE record might be:
`
`RecType=0x402
`Rec Body Size=12
`Record Data:
`
`Offset Name
`
`Size Contents
`
`4.
`6
`8
`10
`12
`
`MinRow
` MaxRow
` MinCol
` MaxCol
`(Reserved)
`
`First defined row on the sheet
`2
`Last defined row on the sheet, plus 1
`2
`First defined column on the sheet
`2
`Last defined column onthe sheet, plus 1
`2
`4 Reserved
`
`10
`
`15
`
`20
`
`A plurality of ARRAY routines provide a system for
`managing the process of initializing, adding data to,
`encoding, and freeing arrays of data. Each ARRAYstructure
`includes one array of data and the information needed to
`manage these processes, such as the numberof entries in the
`Array 25, the size of the buffer currently allocated for the
`array, and information which will be useful in encoding the
`array.
`In particular, ARRAYXX structure stores all the data
`from a component, and includes mechanismsto supportfast
`sequential read/write access to the data, an expandable
`buffer for the data, and storage of model data. This structure
`There can be more than one such record in the data. When
`also allows a simple set of tools to handle all allocation,
`there is more than one record,
`the values of the same
`parsing, and coding functions. The _XX nomenclature
`component in each record are going to be similar to each
`indicates that slightly different ARRAY structures are
`other. This enables the compression system to take advan-
`defined for different data types, such as different integer
`tage of the similarities to increase the compressionratio.
`wordsizes, strings, and floating point types, although many
`The compression system is preferably based on the
`of the processing routines work on all of the variations.
`ARRAY_XX structures, including:
`A plurality of PVSTRUCT routines make up an auto-
`25
`mated system for parsing the file data. During encoding,
`they read data from the FILE_BUFFERstructure and write
`data into Arrays 25. During decoding, they take data from
`for 8-bit data items, treated by the encoding system as
`being unsigned
`the Arrays 25 and write it into the FILEBUFFERstruc-
`for 16-bit data items, treated by the encoding system as
`tures. These routines can take a high-level description of a
`being signed
`file format and handle all the tasks associated with managing
`for 32-bit data items, treated by the encoding system as
`the parsing as well as the encoding of the parsed data. Many
`being signed
`for 32-bit floats
`ARRAY_F32:
`of the Filters 10 will also include customized parsing
`for 64-bit floats (“doubles”)
`ARRAY_F64
`routines which call the FILE.BUFFER and ARRAYrou-
`for 80-bit floats (“long doubles”)
`ARRAY_F80:
`tines directly instead of using the PVSTRUCTroutines. But
`ARRAY_STR:_each entry includes several bytes whose countis given
`35
`an aspect of the parsing system 24 is the PVSTRUCT
`by a size (string size).
`routines, and this discussion will be concerned primarily
`with their operation.
`A PVSTRUCTstructure includes all information needed
`
`ARRAY_U8:
`
`30
`
`ARRAY_16:
`
`ARRAY_32:
`
`For the TABLE SIZErecord, five arrays would be created
`(referring to the record specification). Each time a record is
`to parse and code a data structure defined in a file format.
`parsed, an entry is added into each ofthe five arrays.
`This usually involves multiple components, so it provides
`When all
`the records had been parsed, a function is
`and handles individual Arrays 25 for each component. The
`invoked to encode the items in each array. This function
`term “pvstruct” comes from Predictor Variable Structure,
`includes a large number of algorithms, which can exploit a
`where Predictor refers to the use of statistical data gained
`wide variety of relationships to achieve maximum compres-
`from analyzing a large numberof samplefiles to assist in the
`sion. The implementation is extremely efficient in that it
`encoding of the data, and Variable indicates that it supports
`requires virtually no overhead: an array with only one entry
`structures where the component counts can vary at run time
`can even be efficiently compressed. This is necessary
`for each instance (called “dynamic structures” in some
`because the parsing system can transformafile into a huge
`textbooks).
`number of these small arrays. The encoding function can
`A parsing language is a defined instruction set which
`also free the data after it finishes encodingit.
`provides a concise way to specify the parsing of a
`The decoding process reverses the sequence. First, each of
`PVSTRUCTstructure. The instruction set includes support
`the component arrays is decoded. Then the parsing is
`for a wide range of dynamic variation, including where the
`reversed.
`counts of components are dependent on file counts of other
`As this type of parsing has to be done extensively, a
`components as well as on external factors. It incorporates
`numberof tools are used to simplify the process. They are
`other file format data, such as defined values or range
`based on the use of instructions called vstruct_defs. To
`restrictions. It also provides a means of exchanging data
`continue with the TABLE SIZE example, the instruction set
`between the parser andthefilter.
`to parse such a record would be:
`The compression system is based on parsingafile so as
`to place items together that are similar. In the following
`description of the compression system, examples illustrating
`some structures which might be found in a typical spread-
`sheet file format.
`
`40
`
`45
`
`50
`
`55
`
`60
`
`EXAMPLE 2
`
`Spreadsheet files typically include a series of records.
`Each record begins with a 16-bit word indicating the type of
`record followed by a 16-bit word giving the numberof bytes
`
`65
`
`// the first 3 instructions always provide general info about the record:
`8
`// num instructions in this set
`12,
`// size of the record body, which is a fixed sz of 12 bytes
`5,
`// num arrays to be created
`// the remaining instructions tell how to parse the record:
`INT16,
`// Parse a 16 bit integer for the first component (“MinRow’)
`INT16,
`
`Page 13
`
`Page 13
`
`
`
`US 6,253,264 B1
`
`9
`
`-continued
`
`10
`
`-continued
`#define FLOAT32
`#define FLOAT64
`#define FLOAT80
`#define STRING
`#define STRINGZ
`#define SYS_S1
`#define SYS_SZ1
`#define SYS_S2
`#define SYS_SZ2
`
`BAINMBW
`
`9
`Oxa
`Oxb
`
`INT16,
`INT16,
`INT32 };
`
` // Parse a 32 bit integer for the last component (“Reserved”)
`
`The vstruct_def instructions are 16 bit integers. The first
`three words always include the same general info about the
`record. The rest of the words include instructions which tell
`
`the automated parsing system how to proceed when encoun-
`tering such a record. In this case, the instructions are the
`simplest possible—just a list of data types for each compo-
`nent of the record. The parser would create an array for each
`component, and load one entry of that type into the array
`each time it parses a record of that type:
`Functions are then provided to automate the parsing and
`encoding of records based on these instructions. Again, the
`decoding process reverses the actions of the encoder.
`The vstruct_def instructions can handle situations much
`more complex than the previous example. For convenience,
`a summary of the construction of these instructions will be
`summarized next. The upper byte of the 16 bit instruction is
`an argumentas required by the lower b