`Sebastian
`
`(10) Patent N0.:
`(45) Date 0f Patent:
`
`US 6,253,264 B1
`*Jun. 26, 2001
`
`US006253264B1
`
`(54) CODING NETWORK GROUPING DATA OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTURE AND SELECTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`BASE 0N BLOCK DATA TYPE
`
`5,684,478 * 11/1997 Panaoussis ........................... .. 341/51
`5,708,828 * 1/1998 Coleman ....... ..
`395/785
`5,710,562 * 1/1998 Gormish et al. .
`341/107
`i
`311E591 ------ -
`, , 5,867,114 * 2/1999 Barbir ............... .. 0 mes
`
`
`
`
`
`
`341/107
`
`_
`.
`.
`.
`(75) Inventor' W‘lham Sebastlam Fa1m°uth> MA
`(Us)
`
`707/102
`5,881,380 * 3/1999 MochiZuki et al.
`707/104
`5,983,236 * 11/1999 Yager et al.
`5,991,515 * 11/1999 Falleial ........................... .. 395/114
`
`'Thl'
`73 As' :Itll'tC
`(
`)
`slgnee
`Mgngggssmn ec “0 Ogles’
`
`FOREIGN PATENT DOCUMENTS
`
`(*) Notice:
`
`This patent issued on a continued pros-
`ecution application ?led under 37 CFR
`1.53(d), and is subject to the tWenty year
`t
`t
`t
`'
`'
`f 35 U.S.C.
`I1); ‘fax; rm provlslons 0
`'
`subject_to any disclaimeri the term of this
`Patent 15 extended or adlusted under 35
`
`0 729 237 A2
`
`8/1996 EP .
`(
`)
`OTHER PUBLICATIONS
`
`Rice, R.F., et al., “VLSI Universal Noiseless Coder,” NTIS
`Tech Notes 2301, p. 234 (Mar. 1,1990).
`Diner, D.B., et al., “Competitive Parallel Processing for
`Compression of Data,” NTIS Tech Notes 2301, p. 379 (May
`
`(21) Appl. NO.Z 09/036,052
`(22) Filed:
`Mar. 6, 1998
`Related US. Application Data
`t. N ' 60 036 548 ?l d M ' 7
`l.
`l
`.
`P .
`lggv?lona applcalon 0
`/
`’
`’
`e on at
`’
`
`60
`(
`)
`
`(51) Im. C17 ............................ .. G06F 17/30; G06F 3/00;
`G06F 5/00; H03M 7/00; H03M 7/30
`
`(52) US. Cl. ............................... .. 710/68; 710/65;33;l11//15017;
`,
`_
`(58) Fleld of
`its}
`’
`7’10/65 6g
`’
`
`’
`
`’
`
`’
`
`’
`
`* Cited by examiner
`
`grim”); Lgmml?er —1¥1OItI11aI\SI Lee
`guyen
`sststant xamtner— an
`(74) Attorney, Agent, or Firm—Hamilton, Brook, Smith &
`Re nolds, PC.
`y
`(57)
`
`ABSTRACT
`
`A preferred Coding network uses an architecture Called a
`Base-Filter-Resource (BFR) system. This approach inte
`grates the advantages of format-speci?c compression into a
`general-purpose compression tool serving a Wide range of
`data formats. Source data is parsed into blocks of similar
`data and each parsed blocks are compressed using a respec
`tively selected compression algorithm. The algorithm can be
`chosen from a static model of the data or can be adaptive to
`the data in the parsed block. The parsed blocks are then
`combined into an encoded data ?le. For decoding, the
`
`process 15 reversed‘
`
`29 Claims, 8 Drawing Sheets
`
`(56)
`
`References Cited
`
`US, PATENT DOCUMENTS
`*
`
`""""""""""""""""""" " 37335‘;
`
`5,617,541 * 4/1997 Albanese et al. ............. .. 395/20013
`5,632,009 * 5/1997 R80 et al. ....... ..
`395/770
`5,638,498 * 6/1997 Tyler et al. ........................ .. 395/117
`
`l/ SOURCE DATAZ
`
`\ SELECT/ON
`CRlTERIA
`12a. _12x.. ‘21
`
`FlLTER @ g
`
`10
`
`\ ENC ING
`PARA
`ERS
`1s
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 001
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 1 0f 8
`
`US 6,253,264 B1
`
`SENDING APPLICATION
`
`SOURCE DATA 2
`
`I
`
`ENCODER
`
`ENCODED DATA 4
`
`I
`
`DECODER
`
`DECODED DATA 6
`
`I
`I
`
`I03
`
`I01
`
`RECEIVING APPLICATION
`
`Z
`
`FIG. 1
`PRIOR ART
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 002
`
`
`
`US. Patent
`
`Jun. 26, 2001
`
`Sheet 2 0f 8
`
`US 6,253,264 B1
`
`mmESE
`
`zOfiomjmm
`
`35$
`
`memdd
`
`m><mm<
`
`mm
`
`02—mm<n_
`
`_>_m:.w>m
`
`ONE—20.530
`
`><mm<
`
`m_>_mOn_mZ<m._.
`
`ZOEbmfimw
`
`<Emtm0
`
`NNF....wa.:._m.N_‘
`
`OZ_mm<n_
`
`mZOFODmHmE
`
`3.
`
`20.530
`
`.rzsznzzOO
`
`mmfiDDOE
`
`9.
`
`N<._.<Qm0m30w\a
`
`ESE
`
`DMOOOZM
`
`<._.<D
`
`mm
`
`GZEOOZm
`
`wKMFmEdidd
`
`m:
`
`N.OE
`
`mg
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 003
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 3 0f 8
`
`US 6,253,264 B1
`
`STRING
`CODER 40
`
`Y
`
`FLOAT
`CODER 50
`
`INTEGER
`CODER 70
`
`FIG. 3
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 004
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 4 0f 8
`
`US 6,253,264 B1
`
`
`
`@000 I022
`
`mmOOO ><Mm< OP
`
`@N\ + $\
`
`
`
`WOZFEM OMIOEQEZD
`
`+
`
`
`
`NEQOU Cm;
`
`w .OE
`
`mm
`
`
`
`
`
`KMOOO ><EM< 20mm
`
`wOZEPw
`
`OmFOEmEQ m0 kw:
`
`
`
`jwooz QEPQ
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 005
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 5 0f 8
`
`US 6,253,264 B1
`
`%\
`
`ww
`
`mm
`
`
`
`
`
`EMOOO ><MM< EONE
`
`EmFEw>ZOO o Pmm<m W +
`
`mm
`
`mm
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 006
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 6 0f 8
`
`US 6,253,264 B1
`
`SAMPLE DATA BASES
`
`I
`
`GENERATE PARSED FILES
`
`PARSED FILES
`
`ANALYZE PARSED FILES
`
`ANALYSIS RESULT FILES
`
`GENERATE PREDICTOR DATA
`
`/62
`
`/64
`
`/66
`
`PDAT FILE
`
`l
`
`FIG. 6
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 007
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 7 0f 8
`
`US 6,253,264 B1
`
`25
`
`73
`
`—>@NCCDED MODEL INF@/
`
`(ARRAY TO BE CODED)
`ARRAY DATA
`Y
`BUILD MODEL
`ARRAY DATA
`
`71
`
`MODEL
`
`SPLITTER
`
`77
`
`TWO OR MORE ARRAYS
`PROCESS SEPARATELY
`
`MATCHING
`ALGORITHM
`
`MATCHING
`ALGORITHMS
`
`81
`
`85
`NO +
`OVEHREILIZSLQERAY
`7
`CONVERT TO |NT32
`
`87
`
`UNMATCHED
`VALUES
`ARRAYS CONTAINING 83
`MATCH DATA
`
`USE
`NUMERIC
`TRANSFORM
`
`91
`
`NUMERIC
`ALGORITHMS
`OFFSETS FROM
`MODEL
`
`SPLITTER
`
`95
`
`TWO OR MORE ARRAYS
`PROCESS SEPARATELY
`
`REPEAT
`TRANSFORM
`
`99
`
`FIG. 7
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 008
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 8 0f 8
`
`US 6,253,264 B1
`
`101
`
`ARRAY DATA IN
`
`/103
`
`105
`
`RANGE
`REMAPPER
`
`EXTRA BITS
`TO PACK
`
`RANGE CODES
`
`USE MTF
`CODER
`
`MTF CODER
`
`/ 109
`
`'
`RANGE CODES TO
`LOW LEVEL CODER
`
`110
`
`FIG. 8
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 009
`
`
`
`US 6,253,264 B1
`
`1
`CODING NETWORK GROUPING DATA OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTURE AND SELECTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`BASE ON BLOCK DATA TYPE
`
`RELATED APPLICATIONS
`This application claims the bene?t of US. Provisional
`Serial No. 60/036,548 ?led Mar. 7, 1997, the teachings of
`Which are incorporated herein by reference in their entirety.
`
`BACKGROUND OF THE INVENTION
`High performance data compression systems use models
`of the data to increase their ability to predict values, Which
`in turn leads to greater compression. The best models can be
`achieved by building a compression system to support a
`speci?c data format. Instead of trying to deduce a crude
`model from the data Within a speci?c ?le, a format-speci?c
`compression system can provide a precise pre-determined
`model. The model can take advantage not just of the ?le
`format structure, but also of statistical data gathered from
`sample databases.
`Previous efforts at format-speci?c compression have been
`focused on solutions to a feW individual formats rather than
`on the development of a generaliZed method that could be
`adapted to many formats. The models Which have been
`created typically involve a small number of components.
`This Works adequately When most of the data is included in
`a feW components, such as an image ?le having mostly red,
`blue, and green piXel values. But may formats are best
`modeled using a large number of components, and the
`previous systems are not designed to build or encode such
`models.
`
`SUMMARY OF THE INVENTION
`Apreferred coding system in accordance With the inven
`tion solves both of these problems in the prior art: it provides
`a generaliZed solution Which can be adapted to handle a Wide
`range of formats, and it effectively handles models that use
`large numbers of components. The system involves neW
`approaches at many levels: from the highest (interface) level
`to the loWest (core encoding algorithms) level. The coding
`netWork includes a compression netWork for encoding data
`and a decompression netWork for decoding data.
`At the highest level, a preferred compression system in
`accordance With the invention uses an architecture called a
`Base-Filter-Resource (BFR) system. This approach inte
`grates the advantages of format-speci?c compression into a
`general-purpose compression tool serving a Wide range of
`data formats. The system includes ?lters Which each support
`a speci?c data format, such as for EXcel XLS Worksheets or
`Word DOC ?les. The base includes the system control
`modules and a library used by all the ?lters. The resources
`include routines Which are used by more than one ?lter, but
`Which are not part of the base. If a ?lter is installed Which
`matches the format of the data to be encoded, the advantages
`of format-speci?c compression can be realiZed for that data.
`OtherWise, a “generic” ?lter is used Which achieves perfor
`mance similar to other non-speci?c data compression sys
`tems (such as PKZip, Stacker, etc.).
`At the neXt level, the system preferably includes a method
`for parsing source data into individual components. The
`basic approach, called “structure ?ipping,” provides a key to
`converting format information into compression models.
`Structure ?ipping reorganiZes the information in a ?le so that
`similar components that are normally separated are grouped
`together.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`2
`Upon this foundation are a numbers of tools, such as:
`a language to simplify the creation of parsing routines;
`tools to parse the source data using this method into
`separate components; and
`tools to generate models for individual components by
`automated analysis of sample data bases.
`These tools can be applied to the ?lters for a Wide range of
`?le and data types.
`The compression system preferably includes tools called
`customiZed array transforms for speci?c ?lters and for
`certain types of ?lters. These techniques handle a speci?c
`?le type or certain data constructions used by several ?le
`types, such as encoding tWo dimensional arrays of data as
`found in many database formats.
`At the loW-level of the system are preferably a number of
`mechanisms for encoding data arrays, including:
`neW loW-level encoding algorithms;
`methods for integrating a large number of transforms and
`encoding algorithms;
`methods for eliminating overhead so that small data
`blocks can be ef?ciently coded; and
`a neW method for integrating both static models
`(determined from statistical analysis of sample
`databases) and dynamic models (adapted to the data
`Within a particular array) into the encoding of each
`component.
`A preferred method of encoding source data comprising
`parsing the source data into a plurality of blocks. The parsed
`blocks typically have a different format than the source data
`format. In particular, similar data from the source data are
`collected and grouped into respective blocks.
`For each block, a compression algorithm is selected from
`a plurality of candidate compression algorithms and applied
`to the block. The compression algorithms can be determined
`based on the amount of data in the respective block.
`Furthermore, the compression algorithm can be adapted to
`the respective block, including the use of a customiZed
`transform. The selection of an algorithm can also be based
`on a compression model, Which is derived from the format
`of the source data. The compressed data from the plurality
`of blocks are then combined into encoded data.
`The coding netWork can also include a decompression
`netWork to convert the encoded data back into the source
`data. First the data is decoded and then the parsing is
`reversed. In a lossless system, the resulting data is identical
`to the source data.
`Embodiments of the invention preferably take the form of
`machine executable instructions embedded in a machine
`readable format on a CD-ROM, ?oppy disk or hard disk, or
`another machine-readable distribution medium. These
`instructions are eXecuted by one or more processing units to
`implement the compression and decompression netWorks.
`BRIEF DESCRIPTION OF THE DRAWINGS
`The foregoing and other objects, features and advantages
`of the invention Will be apparent from the folloWing more
`particular description of preferred embodiments of the
`invention, as illustrated in the accompanying draWings in
`Which like reference characters refer to the same parts
`throughout the different vieWs. The draWings are not nec
`essarily to scale, emphasis instead being placed upon illus
`trating the principles of the invention.
`FIG. 1 is a high-level block diagram of a typical data
`compression system.
`FIG. 2 is a block diagram of a preferred encoder using the
`BER system in accordance With the invention.
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 010
`
`
`
`US 6,253,264 B1
`
`3
`FIG. 3 is a ?oW chart of an array coder.
`FIG. 4 is a block diagram of the string coder 40 of FIG.
`
`3
`
`FIG. 5 is a block diagram of the ?oat coder 50 of FIG. 3.
`FIG. 6 is a ?owchart of a preferred automated analysis
`system.
`FIG. 7 is a ?owchart overvieWing the high-level of the
`integer coder.
`FIG. 8 is a ?oWchart overvieWing the mid-level of the
`integer coder.
`
`DETAILED DESCRIPTIONS OF PREFERRED
`EMBODIMENTS OF THE INVENTION
`
`FIG. 1 is a high-level block diagram of a typical data
`compression system. The primary components of the system
`are a sending application 1, an encoder 3, a decoder 5, and
`a receiving application 7.
`The sending application 1 provides source data 2 to be
`encoded by the encoder 3. The sending application can be a
`?le archiver, a communication system, or any other appli
`cation With a need to represent data using feWer bytes than
`in the native format.
`The encoder 3 receives the source data 2 and uses data
`compression algorithms to represent the data using feWer
`bytes. The typical implementation of the encoder 3 is as
`softWare running on a general purpose computer, although
`the algorithms to be described may also be implemented on
`specialiZed hardWare.
`The output of the encoder is encoded data 4, Which can be
`stored in memory (in Which case the encoding alloWs more
`source data to be stored Within the same hardWare), or
`transmitted to the decoder 5 (in Which case the encoding the
`source data to be transmitted faster When the transmission
`channel bandWidth is limited).
`The decoder 5 retrieves or receives the encoded data 4 and
`reverses the algorithms used to encode the data so as to
`reconstruct the source data as decoded data 6. In a lossless
`system, the decoded data 6 is identical to the source data.
`Depending on the application, hoWever, some data loss may
`be acceptable. As With the encoder 3, the decoder 5 is
`typically embodied as softWare running on a general purpose
`computer, but can be implemented in specialiZed hardWare.
`The receiving application 7 receives the decoded data 6
`for processing.
`A preferred compression engine includes ?lter-based
`encoders and decoders Which can be utiliZed by a large
`number of sending applications 1 and receiving applications
`7. While the applications 1,7 that store/retrieve or transmit/
`receive the encoded data are not part of the compression
`system, they may adjust encoding and decoding operations
`according to system conditions such as available transmis
`sion channel bandWidth.
`The descriptions to folloW Will mostly cover the encoder.
`The decoder just reverses the encoder’s actions, so a descrip
`tion of the encoder is suf?cient to describe the overall
`encoding/decoding system. The term “File” Will be used to
`describe a block of source data, Which matches the conven
`tional usage, but Which should also be understood to be
`applicable to many other types of data blocks such as a data
`exchange in a Client-Server application.
`
`ENCODER
`
`FIG. 2 is a block diagram of a preferred encoder using a
`Base-Filter-Resource (BFR) netWork in accordance With the
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`4
`invention. The encoder 3‘ is based on the use of a plurality
`of ?lters 10a, .
`.
`. , 10x, .
`.
`.
`, 102 Which serve speci?c ?le
`formats. For example, one ?lter 10a might support several
`versions of the DBF database format, While another ?lter 102
`might support several versions of the DOC format used by
`the Microsoft Word softWare program. The individual ?lters
`provide respective selection criteria 12 to a ?lter selection
`system 22.
`The ?lter selection system 22 receives the source data 2
`and checks the selection criteria 12a, .
`.
`. , 12x, .
`.
`. , 122 of
`
`, 102 installed in the system to
`.
`.
`. , 10x, .
`.
`all ?lters 10a, .
`see if any of them support the source data’s format. If not,
`a “generic” ?lter is used Which provides compression per
`formance similar to other generic compression systems,
`such as Lempel-Ziv (LZ) engines. In a particular preferred
`embodiment of the invention, the generic compression sys
`tem employs an SZIP engine as described by Mr. Schindler
`in US. application Ser. No. 08/970,220 ?led Nov. 14, 1997,
`the teachings of Which are incorporated herein by reference
`in their entirety. The descriptions of the netWork Will pri
`marily cover the situations in Which a ?lter to support the
`data format is successfully found.
`Brie?y, a parsing system 24 parses the source data 2 into
`a large number of parsed Arrays 25, Which each include all
`the instances of a particular component in the source ?le.
`The term “Array”, described in further detail beloW, is
`capitaliZed to indicate that this refers to a type of structure
`speci?c to the netWork, as opposed to the usual use of the
`term “array”. The algorithm used to parse the source data 2
`is called “Structure Flipping”, an important aspect of the
`netWork, described in further detail beloW.
`The BFR system uses an automated analysis system (also
`described in detail beloW) to build a model for each Array
`25, Which Will be used When the Arrays 25 are encoded in
`an array coder 28. In some cases, better compression can be
`achieved if customiZed array transforms 26 are ?rst used.
`While the models generated by the automated analysis
`system process each component Array 25 independently, the
`customiZed routines can take advantage of inter-component
`relationships.
`The array coder 28 encodes the Arrays 25 using a count
`dependent modeling system, Which includes a mixture of
`static models (constant for all databases) and dynamic
`adjustment to each particular Array 25.
`The ?lter 10x has components to support each of the four
`sections of the encoder 3‘. The selection criteria 12x, deter
`mined by the ?le format speci?cation, includes information
`needed to recogniZe ?les served by the ?lter 10x such as byte
`values at the beginning of the ?le or ?le title suffices. Parsing
`instructions 14, also determined by the ?le format
`speci?cation, include the information needed to parse the
`source ?le into parsed Arrays 25 having all the instances of
`a particular component in the source ?le. The custom
`component models 16 are constructed by referencing both
`the ?le format speci?cation and sample databases. They can
`take advantage of inter-component relationships to trans
`form some of the component Arrays 25 to make them more
`compressible. Encoding parameters 18 are generated by an
`automated analysis system Which examines component
`Arrays 25 parsed from a large number of sample databases.
`The encoding parameters 18 provide the data necessary to
`support the count-dependent modeling system in the array
`coder 28.
`FILTER SELECTION SYSTEM
`A preferred compression system uses an expandable
`architecture that alloWs users to support neW or updated
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 011
`
`
`
`US 6,253,264 B1
`
`5
`formats by adding or replacing ?lters 10. As noted above, a
`?lter 10 can support more than one format. For example, a
`?lter called “LOTUS123v4” might support all ?le formats
`associated With the Lotus 123 program, such as the current
`WK4 format and the earlier WK3 and FM3 formats. If a
`Filter Was later developed for a neW Lotus WKS format, a
`user replaces the old ?lter With one supporting the neW
`format and Which Would also be backWardly compatible
`With the earlier ?lter With respect to the older formats.
`In the Microsoft WindoWs environment, each Filter is a
`separate dynamically linked library (DLL) ?le. In addition
`to data tables, a ?lter includes executable code.
`If a ?lter is found for a ?le, then it is dynamically linked
`and then called to encode the ?le. The encoded data stream
`includes an identi?cation (ID) code indicating Which ?lter
`Was used to encode the data, and the decoder has to have a
`corresponding decoding ?lter.
`To implement the ?lter selection system 22, the base
`module maintains a registry of the currently installed ?les,
`Which includes the data need to identify source ?les of that
`type as Well as the path to the DLL. This registry is updated
`each time a ?lter is added or replaced. The identi?cation data
`can include both a list of ?le title suf?ces and a method to
`identify ?les via byte values near the beginning of the ?le.
`The identi?cation procedure uses a search tree method to
`reduce the time needed to check the list When many ?lters
`are installed. From this registry, the appropriate ?lter 10 is
`invoked for the ?le type.
`To handle compound ?le formats such as Microsoft’s
`OLE2, one ?lter can invoke another ?lter to encode a
`sub-?le or stream Within the compound ?le. For example, a
`WinWord Filter might call an Excel Filter to encode some
`Excel data Which Was embedded Within a compound docu
`ment ?le.
`PARSING SYSTEM
`As noted above, the parsing system 24 creates a large
`number of parsed Arrays 25. Each parsed Array 25 includes
`all instances of a particular component in a ?le. A separate
`component is created for each member of each de?ned
`structure type.
`
`10
`
`15
`
`25
`
`35
`
`EXAMPLE 1
`For example, consider the folloWing description of a record
`de?ned in a ?le format description.
`Cell Value:
`
`45
`
`The format uses a special method of coding numbers in an
`internal number type, called an RK number, to save
`memory and disk space.
`Record Data:
`
`Rec
`
`RoW
`
`Col
`
`Index
`
`Num Val
`
`1
`2
`3
`4
`5
`
`OxOOOl
`OxOOOl
`OxOOOl
`OxOOOl
`OxOOOl
`
`0x0002
`0x0003
`0x0004
`0x0006
`0x0007
`
`0x0055
`0x005f
`0x0053
`0x005f
`0x005f
`
`Ox3ff1e100
`Ox3ff1a300
`Ox3ff1d000
`Ox3ff1c400
`Ox3ff1d300
`
`The source data Would appear almost random to a byte
`matching algorithm (shoWn in hex):
`01 00 02 00 55 00 00 el ? 3f01 00 03 00 5f 00 00 a3 ?
`3f 01 00 04 00 53 00 00 d0 ? 3f
`01 00 06 00 5f 00 00 c4 el ? 3f 01 00 07 00 5f 00 00 d3
`
`But When handled as four separate Arrays 25 (RoW, Col,
`Index, and Num) using algorithms optimiZed for each
`component, the data can compress signi?cantly. This
`approach is referred to herein as “structure ?ipping” because
`of the Way it rearranges the data.
`To encode together the instances of the same component
`in a database is a common compression technique, and the
`?le format is often used to identify such components. But in
`the prior art, this approach has been limited to isolating a feW
`major components of a speci?c ?le format, and the ?le
`format description used only as a technique of ?nding these
`major components. In accordance With preferred embodi
`ments of the invention, a ?le format is actually converted
`into a compression model, Which is then interfaced into an
`automated system for analyZing the data and optimiZing the
`coding performance of each component. While this com
`pression model is usually enhanced by the customiZed array
`transforms 26, it is still an aspect of the overall system.
`The extent to Which preferred approaches differ from
`prior art systems is re?ected in the neW tools constructed in
`order to implement it. This approach creates a large number
`of Arrays 25. Each Array 25 requires a different compression
`model, and the number of entries in each can change
`signi?cantly from ?le to ?le. In order to Work effectively in
`such a case, preferred embodiments of the invention employ
`neW approaches in hoW the loW-level coding is handled such
`as the elimination of overhead, count-dependent processing,
`and the use of a processing netWork instead of ?xed coding
`algorithms. These techniques, described above, are preferred
`in order to handle the output of the parsing system 24.
`All of these implementation features are integrated at a
`high level, so that high-level calls can handle all of the
`operations needed to initialiZe the system, parse the data,
`and encode the Arrays.
`In accordance With a particular preferred embodiment of
`the invention, the ?le parsing system 24 uses three sub
`systems:
`
`Offset
`
`Name
`
`Size
`
`Contents
`
`4
`6
`8
`
`rW
`col
`frnt
`
`10
`
`num
`
`2
`2
`2
`
`4
`
`RoW number
`Column number
`Index to the record that includes
`the cell format
`the coded number
`
`The parsing system 24 creates parsed Arrays 25 for each
`of the four components of this record. Each time an RK
`record is found, an entry is added to each array. The parsed
`65
`Arrays 25 are much more compressible than the raW source
`data, as shoWn by data from ?ve records:
`
`55
`
`FILEiB UFFER
`ARRAYS
`PVSTRUCT
`
`Input\Output Interface
`To store the data for a single component
`To parse the input into Arrays or to Write the decoded
`Arrays into the Output stream
`
`A plurality of FILEiBUFFER routines provide a ?exible
`input/output interface using FILEiBUFFER structures, so
`the same set of parsing routines can be used regardless of the
`input/output formats. This alloWs the compression system to
`be independent of the I/O format. The ?le parsing system
`preferably supports tWo ?le formats: DOS and OLE2, but
`this can expanded to other ?le formats as Well as to inter
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 012
`
`
`
`US 6,253,264 B1
`
`7
`faces to communication ports and other I/O options. For
`DOS ?les, the logical stream positions are usually the same
`as the raW ?le positions. The OLE2 format requires an
`extensive layer of code to folloW an OLE2 stream Within the
`container ?le.
`A plurality of ARRAY routines provide a system for
`managing the process of initializing, adding data to,
`encoding, and freeing arrays of data. Each ARRAY structure
`includes one array of data and the information needed to
`manage these processes, such as the number of entries in the
`Array 25, the size of the buffer currently allocated for the
`array, and information Which Will be useful in encoding the
`array.
`In particular, ARRAYiXX structure stores all the data
`from a component, and includes mechanisms to support fast
`sequential read/Write access to the data, an expandable
`buffer for the data, and storage of model data. This structure
`also alloWs a simple set of tools to handle all allocation,
`parsing, and coding functions. The iXX nomenclature
`indicates that slightly different ARRAY structures are
`de?ned for different data types, such as different integer
`Word sizes, strings, and ?oating point types, although many
`of the processing routines Work on all of the variations.
`A plurality of PVSTRUCT routines make up an auto
`mated system for parsing the ?le data. During encoding,
`they read data from the FILEiBUFFER structure and Write
`data into Arrays 25. During decoding, they take data from
`the Arrays 25 and Write it into the FILEiBUFFER struc
`tures. These routines can take a high-level description of a
`?le format and handle all the tasks associated With managing
`the parsing as Well as the encoding of the parsed data. Many
`of the Filters 10 Will also include customized parsing
`routines Which call the FILEiBUFFER and ARRAY rou
`tines directly instead of using the PVSTRUCT routines. But
`an aspect of the parsing system 24 is the PVSTRUCT
`routines, and this discussion Will be concerned primarily
`With their operation.
`A PVSTRUCT structure includes all information needed
`to parse and code a data structure de?ned in a ?le format.
`This usually involves multiple components, so it provides
`and handles individual Arrays 25 for each component. The
`term “pvstruct” comes from Predictor Variable Structure,
`Where Predictor refers to the use of statistical data gained
`from analyzing a large number of sample ?les to assist in the
`encoding of the data, and Variable indicates that it supports
`structures Where the component counts can vary at run time
`for each instance (called “dynamic structures” in some
`textbooks).
`A parsing language is a de?ned instruction set Which
`provides a concise Way to specify the parsing of a
`PVSTRUCT structure. The instruction set includes support
`for a Wide range of dynamic variation, including Where the
`counts of components are dependent on ?le counts of other
`components as Well as on external factors. It incorporates
`other ?le format data, such as de?ned values or range
`restrictions. It also provides a means of exchanging data
`betWeen the parser and the ?lter.
`The compression system is based on parsing a ?le so as
`to place items together that are similar. In the folloWing
`description of the compression system, examples illustrating
`some structures Which might be found in a typical spread
`sheet ?le format.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`EXAMPLE 2
`Spreadsheet ?les typically include a series of records.
`Each record begins With a 16-bit Word indicating the type of
`record folloWed by a 16-bit Word giving the number of bytes
`
`65
`
`8
`in the body of the record. The ?le speci?cation then
`describes the contents of each record. A speci?cation for a
`TABLE SIZE record might be:
`RecType=0x402
`Rec Body Size=12
`Record Data:
`
`Offset Name
`
`Size Contents
`
`4
`6
`8
`10
`12
`
`MinRoW
`MaxRoW
`MinCol
`MaxCol
`(Reserved)
`
`2 First de?ned roW on the sheet
`2 Last de?ned roW on the sheet, plus 1
`2 First de?ned column on the sheet
`2 Last de?ned column on the sheet, plus 1
`4 Reserved
`
`There can be more than one such record in the data. When
`there is more than one record, the values of the same
`component in each record are going to be similar to each
`other. This enables the compression system to take advan
`tage of the similarities to increase the compression ratio.
`The compression system is preferably based on the
`ARRAYiXX structures, including:
`
`ARRAYiU8:
`
`ARRAYi16:
`
`ARRAYi32:
`
`for 8-bit data items, treated by the encoding system as
`being unsigned
`for 16-bit data items, treated by the encoding system as
`being signed
`for 32-bit data items, treated by the encoding system as
`being signed
`for 32-bit ?oats
`ARRAYiF32:
`for 64-bit ?oats (“doubles”)
`ARRAYiF64
`for 80-bit ?oats (“long doubles”)
`ARRAYiF80:
`ARRAYiSTR: each entry includes several bytes Whose count is given
`by a size (string size).
`
`For the TABLE SIZE record, ?ve arrays Would be created
`(referring to the record speci?cation). Each time a record is
`parsed, an entry is added into each of the ?ve arrays.
`When all the records had been parsed, a function is
`invoked to encode the items in each array. This function
`includes a large number of algorithms, Which can exploit a
`Wide variety of relationships to achieve maximum compres
`sion. The implementation is extremely ef?cient in that it
`requires virtually no overhead: an array With only one entry
`can even be ef?ciently compressed. This is necessary
`because the parsing system can transform a ?le into a huge
`number of these small arrays. The encoding function can
`also free the data after it ?nishes encoding it.
`The decoding process reverses the sequence. First, each of
`the component arrays is decoded. Then the parsing is
`reversed.
`As this type of parsing has to be done extensively, a
`number of tools are used to simplify the process. They are
`based on the use of instructions called vstructidefs. To
`continue With the TABLE SIZE example, the instruction set
`to parse such a record Would be:
`
`// the ?rst 3 instructions alWays provide general info about the record:
`8,
`// num instructions in this set
`12,
`// size of the record body, Which is a ?xed sz of 12 bytes
`5,
`// num arrays to be created
`// the remaining instructions tell hoW to parse the record:
`INT16,
`// Parse a 16 bit integer for the ?rst component (“MinRoW”)
`INT16,
`
`Veritas Techs. LLC
`Exhibit 1030
`Page 013
`
`
`
`9
`
`-continued
`
`US 6,253,264 B1
`
`10
`
`-continued
`
`INT16,
`INT16,
`INT32 };
`
`// Parse a 32 bit integer for the last component (“Reserved”)
`
`The vstructidef instructions are 16 bit integers. The ?rst
`three words always include the same general info about the
`record. The rest of the words include instructions which tell
`the automated parsing system how to proceed when encoun
`tering such a record. In this case, the instructions are the
`simplest possible—just a list of data types for each compo
`nent of the record. The parser would create an array for each
`component, and load one entry of that type into the array
`each time it parses a record of that type:
`Functions are then provided to automate the parsing and
`encoding of records based on these instructions. Again, the
`decoding process reverses the actions of the encoder.
`The vstructidef