`Sebastian
`
`I lllll llllllll Ill lllll lllll lllll lllll lllll 111111111111111111111111111111111
`US006253264Bl
`US 6,253,264 Bl
`*Jun.26,2001
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) CODING NETWORK GROUPING DATA OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTURE AND SELECTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`BASE ON BLOCK DATA TYPE
`
`(75)
`
`Inventor: William Sebastian, Falmouth, MA
`(US)
`
`(73) Assignee: Intelligent Compression Technologies,
`Falmouth, MA (US)
`
`( *) Notice:
`
`This patent issued on a continued pros(cid:173)
`ecution application filed under 37 CFR
`1.53( d), and is subject to the twenty year
`patent term provisions of 35 U.S.C.
`154(a)(2).
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/036,052
`
`(22) Filed:
`
`Mar. 6, 1998
`
`Related U.S. Application Data
`( 60) Provisional application No. 60/036,548, filed on Mar. 7,
`1997.
`
`(51)
`
`Int. Cl.7 .............................. G06F 17/30; G06F 3/00;
`G06F 5/00; H03M 7/00; H03M 7/30
`(52) U.S. Cl. ................................. 710/68; 710/65; 341/51;
`341/107
`(58) Field of Search ................................ 341/51, 65, 107;
`707/102, 503, 504, 505, 508; 370/474;
`710/65, 68
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,293,379 *
`5,467,087
`5,617,541 *
`5,632,009 *
`5,638,498 *
`
`Carr ... ... ... ... .... ... ... ... ... ... ..... 370/94.1
`Chu ........................................ 341/51
`Albanese et al. ............... 395/200.13
`Rao et al.
`... ... ... ... ... .... ... ... ... 395 /770
`Tyler et al.
`.......................... 395/117
`
`3/1994
`11/1995
`4/1997
`5/1997
`6/1997
`
`5,684,478 * 11/1997 Panaoussis ............................. 341/51
`5,708,828 * 1/1998 Coleman .............................. 395/785
`5,710,562 * 1/1998 Gormish et al. ..................... 341/107
`5,838,964 * 11/1998 Gubser ................................. 395/612
`5,864,860 * 1/1999 Holmes ................................ 707/101
`5,867,114 * 2/1999 Barbir .................................. 341/107
`5,881,380 * 3/1999 Mochizuki et al.
`................. 707/102
`5,983,236 * 11/1999 Yager et al.
`......................... 707/104
`5,991,515 * 11/1999 Fall et al. ............................. 395/114
`
`FOREIGN PATENT DOCUMENTS
`
`0 729 237 A2
`
`8/1996 (EP) .
`
`OTHER PUBLICATIONS
`
`Rice, R.F., et al., "VLSI Universal Noiseless Coder," NTIS
`Tech Notes 2301, p. 234 (Mar. 1,1990).
`
`Diner, D.B., et al., "Competitive Parallel Processing for
`Compression of Data," NTIS Tech Notes 2301, p. 379 (May
`1, 1990).
`
`* cited by examiner
`
`Primary Examiner-Thomas Lee
`Assistant Examiner-Tanh Nguyen
`(74) Attorney, Agent, or Firm-Hamilton, Brook, Smith &
`Reynolds, P.C.
`
`(57)
`
`ABSTRACT
`
`A preferred coding network uses an architecture called a
`Base-Filter-Resource (BFR) system. This approach inte(cid:173)
`grates the advantages of format-specific compression into a
`general-purpose compression tool serving a wide range of
`data formats. Source data is parsed into blocks of similar
`data and each parsed blocks are compressed using a respec(cid:173)
`tively selected compression algorithm. The algorithm can be
`chosen from a static model of the data or can be adaptive to
`the data in the parsed block. The parsed blocks are then
`combined into an encoded data file. For decoding, the
`process is reversed.
`
`29 Claims, 8 Drawing Sheets
`
`NetApp; Rackspace Exhibit 1005 Page 1
`
`
`
`U.S. Patent
`
`Jun.26,2001
`
`Sheet 1 of 8
`
`US 6,253,264 Bl
`
`SENDING APPLICATION
`
`I
`SOURCE DATA 2
`i
`
`ENCODER
`
`ENCODED DATA4
`
`+
`
`DECODER
`
`1
`
`3
`
`5
`
`DECODED DATA6
`
`i
`
`RECEIVING APPLICATION
`
`7
`
`FIG. 1
`PRIOR ART
`
`NetApp; Rackspace Exhibit 1005 Page 2
`
`
`
`lo-"
`~
`.i;;..
`O'I
`'N
`~
`(It
`'N
`O'I
`rJ'J.
`
`e
`
`0
`N
`~ .....
`'Jl =(cid:173)~
`
`"""' 00
`
`"""'
`N c c
`
`~
`
`~~
`N
`
`~
`
`~ =
`
`......
`~ ......
`~
`•
`\JJ.
`d •
`
`---r:E'NcoDED
`
`DATA
`
`29
`
`!
`
`CODER
`ARRAY v28
`
`~
`
`l/25
`
`..
`
`TRANSFORMS
`CUSTOMIZED V 26
`
`ARRAY
`
`25
`
`•
`
`ARRAYS
`I/ PARSED
`
`SYSTEM
`PARSING v24
`
`[/ 2
`
`..
`
`SYSTEM
`
`""!SELECTION
`
`FILTER 1,,22
`
`ysoURCE DATA 2
`
`18
`
`FIG. 2
`
`PARAMETERS
`"'-ENCODING
`
`-
`
`16
`
`MODULES
`
`COMPONENT
`"'-CUSTOM
`
`14
`
`INSTRUCTIONS
`"'-PARSING
`
`I--
`
`..__
`
`FILTER
`
`10a
`
`12a, ... , 12x, ... 12z
`
`CRITERIA
`SELECTION
`
`"
`
`I
`
`3'""
`
`I---
`
`10x 1-----,
`
`10
`/
`
`··.I
`
`··.I __ ~ 10z
`r--i
`
`NetApp; Rackspace Exhibit 1005 Page 3
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 3 of 8
`
`US 6,253,264 Bl
`
`y
`
`y
`
`N
`
`N
`
`INTEGER
`CODER 70
`
`STRING
`CODER 40
`
`FLOAT
`CODER 50
`
`FIG. 3
`
`NetApp; Rackspace Exhibit 1005 Page 4
`
`
`
`lo-"
`~
`.i;;..
`O'I
`'N
`~
`(It
`'N
`O'I
`rJ'J.
`
`e
`
`00
`0 .....,
`.i;;..
`
`~ .....
`'Jl =(cid:173)~
`
`'"""'
`N c c
`
`~
`
`~~
`N
`
`~
`
`~ =
`
`......
`~ ......
`~
`•
`\JJ.
`d •
`
`,--40
`
`FIG. 4
`
`TO ARRAY CODER
`
`TEXT CODER
`
`28
`
`MATCH CODES
`
`/47
`UNMATCHED STRINGS
`
`t
`
`PRIMER
`
`STRING MATCHER
`
`~
`
`-
`
`/41
`
`, r
`
`STRING DATA
`
`LIST OF PREDICTED
`
`STRI NGS
`
`FROM ARRAY CODER
`
`(STATIC MODEL)
`
`/28
`
`43
`
`NetApp; Rackspace Exhibit 1005 Page 5
`
`
`
`U.S. Patent
`
`Jun.26,2001
`
`Sheet 5 of 8
`
`US 6,253,264 Bl
`
`0
`L.()
`
`\
`
`co
`N
`"'--
`~ w
`0
`0 u
`>-
`~
`
`~
`<(
`:?:
`0
`~ u.
`
`-
`
`<(
`
`~
`~ w
`0
`0
`1---. 0
`u
`<(
`0
`N
`_J
`_J
`u.
`
`('i)
`L.()
`
`"'--
`~ w
`I-
`~ w
`>
`z
`0
`u
`0 ......
`w
`Cf)
`<( co
`
`-
`
`-
`
`Cf)
`w
`:J
`_J
`~
`0
`_w____.
`I u
`~
`:?: z
`
`:J
`
`Cf)
`_J
`
`....
`~ w
`
`1-
`_J
`
`-
`
`Cf)
`~
`~ w
`z
`....
`0
`Cf)
`~ w
`> z
`0 u
`
`~
`
`Cf)
`
`1-z
`w z
`....
`0 a_
`>< w
`0 ......
`..c
`
`Cf)
`<(
`Cf)
`Cf)
`
`....
`
`1-z
`
`<(
`~
`0 ......
`..c
`
`"-
`
`.
`(9
`u..
`
`"-
`
`co
`N
`
`co
`N
`
`"'
`
`co
`N
`
`"'
`
`~ w
`0
`0 u
`(9 z
`~
`I-
`Cf)
`0
`I-
`
`~ w
`0
`0 u
`>-
`~
`
`~
`<(
`0
`I-
`
`~ w
`0
`0 u
`>-
`~
`
`~
`<(
`0
`I-
`
`~ w
`0
`0 u
`>-
`~
`
`~
`<(
`0
`I-
`
`NetApp; Rackspace Exhibit 1005 Page 6
`
`
`
`U.S. Patent
`
`Jun.26,2001
`
`Sheet 6 of 8
`
`US 6,253,264 Bl
`
`SAMPLE DATA BASES
`
`i
`
`GENERATE PARSED FILES
`
`PARSED FILES
`
`!
`
`ANALYZE PARSED FILES
`
`ANALYSIS RESULT FILES
`
`1 r
`
`GENERATE PREDICTOR DATA
`
`POAT FILE
`
`i
`
`FIG. 6
`
`NetApp; Rackspace Exhibit 1005 Page 7
`
`
`
`U.S. Patent
`
`Jun.26,2001
`
`Sheet 7 of 8
`
`US 6,253,264 Bl
`
`~-----~ 25
`0RRAY TO BE CODEo/
`I
`ARRAY DATA
`•
`BUILD MODEL
`
`71
`
`-:::( 73
`@NCODED MODEL INF'?;
`
`70
`)
`
`MODEL
`
`ARRAY DATA
`
`SPLITTER
`
`77
`
`TWO OR MORE ARRAYS
`PROCESS SEPARATELY
`
`MATCHING
`ALGORITHMS
`
`81
`
`85
`OVERSIZE ARRAY
`HANDLER
`
`87
`
`UNMATCHED
`VALUES
`
`ARRAYS CONTAINING
`MATCH DATA
`
`83
`
`91
`
`NUMERIC
`ALGORITHMS
`OFFSETS FROM
`MODEL
`
`SPLITTER
`
`95
`
`TWO OR MORE ARRAYS
`~---------'PROCESS SEPARATELY
`
`REPEAT
`TRANSFORM
`
`99
`
`NO
`
`TO MID-LEVEL CODER
`
`100
`
`FIG. 7
`
`NetApp; Rackspace Exhibit 1005 Page 8
`
`
`
`U.S. Patent
`
`Jun. 26, 2001
`
`Sheet 8 of 8
`
`US 6,253,264 Bl
`
`101
`
`103
`
`RANGE
`RE MAPPER
`
`EXTRA BITS
`TO PACK
`
`RANGE CODES
`
`YES
`
`MTF CODER
`
`105
`
`109
`
`NO
`
`RANGE CODES TO
`LOW LEVEL CODER
`
`110
`
`FIG. 8
`
`NetApp; Rackspace Exhibit 1005 Page 9
`
`
`
`US 6,253,264 Bl
`
`1
`CODING NETWORK GROUPING DATA OF
`SAME DATA TYPE INTO BLOCKS USING
`FILE DATA STRUCTURE AND SELECTING
`COMPRESSION FOR INDIVIDUAL BLOCK
`BASE ON BLOCK DATA TYPE
`
`RELATED APPLICATIONS
`This application claims the benefit of U.S. Provisional
`Serial No. 60/036,548 filed Mar. 7, 1997, the teachings of
`which are incorporated herein by reference in their entirety.
`
`5
`
`15
`
`BACKGROUND OF THE INVENTION
`High performance data compression systems use models
`of the data to increase their ability to predict values, which
`in turn leads to greater compression. The best models can be
`achieved by building a compression system to support a
`specific data format. Instead of trying to deduce a crude
`model from the data within a specific file, a format-specific
`compression system can provide a precise pre-determined
`model. The model can take advantage not just of the file 20
`format structure, but also of statistical data gathered from
`sample databases.
`Previous efforts at format-specific compression have been
`focused on solutions to a few individual formats rather than
`on the development of a generalized method that could be 25
`adapted to many formats. The models which have been
`created typically involve a small number of components.
`This works adequately when most of the data is included in
`a few components, such as an image file having mostly red,
`blue, and green pixel values. But may formats are best 30
`modeled using a large number of components, and the
`previous systems are not designed to build or encode such
`models.
`
`2
`Upon this foundation are a numbers of tools, such as:
`a language to simplify the creation of parsing routines;
`tools to parse the source data using this method into
`separate components; and
`tools to generate models for individual components by
`automated analysis of sample data bases.
`These tools can be applied to the filters for a wide range of
`file and data types.
`The compression system preferably includes tools called
`10 customized array transforms for specific filters and for
`certain types of filters. These techniques handle a specific
`file type or certain data constructions used by several file
`types, such as encoding two dimensional arrays of data as
`found in many database formats.
`At the low-level of the system are preferably a number of
`mechanisms for encoding data arrays, including:
`new low-level encoding algorithms;
`methods for integrating a large number of transforms and
`encoding algorithms;
`methods for eliminating overhead so that small data
`blocks can be efficiently coded; and
`a new method for integrating both static models
`(determined from statistical analysis of sample
`databases) and dynamic models (adapted to the data
`within a particular array) into the encoding of each
`component.
`A preferred method of encoding source data comprising
`parsing the source data into a plurality of blocks. The parsed
`blocks typically have a different format than the source data
`format. In particular, similar data from the source data are
`collected and grouped into respective blocks.
`For each block, a compression algorithm is selected from
`a plurality of candidate compression algorithms and applied
`35 to the block. The compression algorithms can be determined
`based on the amount of data in the respective block.
`Furthermore, the compression algorithm can be adapted to
`the respective block, including the use of a customized
`transform. The selection of an algorithm can also be based
`40 on a compression model, which is derived from the format
`of the source data. The compressed data from the plurality
`of blocks are then combined into encoded data.
`The coding network can also include a decompression
`network to convert the encoded data back into the source
`data. First the data is decoded and then the parsing is
`reversed. In a lossless system, the resulting data is identical
`to the source data.
`Embodiments of the invention preferably take the form of
`machine executable instructions embedded in a machine(cid:173)
`readable format on a CD-ROM, floppy disk or hard disk, or
`another machine-readable distribution medium. These
`instructions are executed by one or more processing units to
`implement the compression and decompression networks.
`
`SUMMARY OF THE INVENTION
`A preferred coding system in accordance with the inven(cid:173)
`tion solves both of these problems in the prior art: it provides
`a generalized solution which can be adapted to handle a wide
`range of formats, and it effectively handles models that use
`large numbers of components. The system involves new
`approaches at many levels: from the highest (interface) level
`to the lowest (core encoding algorithms) level. The coding
`network includes a compression network for encoding data
`and a decompression network for decoding data.
`At the highest level, a preferred compression system in 45
`accordance with the invention uses an architecture called a
`Base-Filter-Resource (BFR) system. This approach inte(cid:173)
`grates the advantages of format-specific compression into a
`general-purpose compression tool serving a wide range of
`data formats. The system includes filters which each support 50
`a specific data format, such as for Excel XLS worksheets or
`Word DOC files. The base includes the system control
`modules and a library used by all the filters. The resources
`include routines which are used by more than one filter, but
`which are not part of the base. If a filter is installed which 55
`matches the format of the data to be encoded, the advantages
`of format-specific compression can be realized for that data.
`Otherwise, a "generic" filter is used which achieves perfor(cid:173)
`mance similar to other non-specific data compression sys(cid:173)
`tems (such as PKZip, Stacker, etc.).
`At the next level, the system preferably includes a method
`for parsing source data into individual components. The
`basic approach, called "structure flipping," provides a key to
`converting format information into compression models.
`Structure flipping reorganizes the information in a file so that 65
`similar components that are normally separated are grouped
`together.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The foregoing and other objects, features and advantages
`of the invention will be apparent from the following more
`particular description of preferred embodiments of the
`invention, as illustrated in the accompanying drawings in
`60 which like reference characters refer to the same parts
`throughout the different views. The drawings are not nec(cid:173)
`essarily to scale, emphasis instead being placed upon illus(cid:173)
`trating the principles of the invention.
`FIG. 1 is a high-level block diagram of a typical data
`compression system.
`FIG. 2 is a block diagram of a preferred encoder using the
`BFR system in accordance with the invention.
`
`NetApp; Rackspace Exhibit 1005 Page 10
`
`
`
`US 6,253,264 Bl
`
`3
`FIG. 3 is a flow chart of an array coder.
`FIG. 4 is a block diagram of the string coder 40 of FIG.
`
`3.
`
`FIG. 5 is a block diagram of the float coder 50 of FIG. 3.
`FIG. 6 is a flowchart of a preferred automated analysis
`system.
`FIG. 7 is a flowchart overviewing the high-level of the
`integer coder.
`FIG. 8 is a flowchart overviewing the mid-level of the 10
`integer coder.
`
`5
`
`4
`invention. The encoder 3' is based on the use of a plurality
`of filters lOa, ... , lOx, ... , lOz which serve specific file
`formats. For example, one filter lOa might support several
`versions of the DEF database format, while another filter lOz
`might support several versions of the DOC format used by
`the Microsoft Word software program. The individual filters
`provide respective selection criteria 12 to a filter selection
`system 22.
`The filter selection system 22 receives the source data 2
`and checks the selection criteria 12a, ... , 12x, ... , 12z of
`all filters lOa, ... , lOx, ... , lOz installed in the system to
`see if any of them support the source data's format. If not,
`a "generic" filter is used which provides compression per(cid:173)
`formance similar to other generic compression systems,
`15 such as Lempel-Ziv (LZ) engines. In a particular preferred
`embodiment of the invention, the generic compression sys(cid:173)
`tem employs an SZIP engine as described by Mr. Schindler
`in U.S. application Ser. No. 08/970,220 filed Nov. 14, 1997,
`the teachings of which are incorporated herein by reference
`20 in their entirety. The descriptions of the network will pri(cid:173)
`marily cover the situations in which a filter to support the
`data format is successfully found.
`Briefly, a parsing system 24 parses the source data 2 into
`a large number of parsed Arrays 25, which each include all
`the instances of a particular component in the source file.
`The term "Array", described in further detail below, is
`capitalized to indicate that this refers to a type of structure
`specific to the network, as opposed to the usual use of the
`term "array". The algorithm used to parse the source data 2
`30 is called "Structure Flipping", an important aspect of the
`network, described in further detail below.
`The BFR system uses an automated analysis system (also
`described in detail below) to build a model for each Array
`25, which will be used when the Arrays 25 are encoded in
`an array coder 28. In some cases, better compression can be
`achieved if customized array transforms 26 are first used.
`While the models generated by the automated analysis
`system process each component Array 25 independently, the
`40 customized routines can take advantage of inter-component
`relationships.
`The array coder 28 encodes the Arrays 25 using a count(cid:173)
`dependent modeling system, which includes a mixture of
`static models (constant for all databases) and dynamic
`45 adjustment to each particular Array 25.
`The filter lOx has components to support each of the four
`sections of the encoder 3'. The selection criteria 12x, deter(cid:173)
`mined by the file format specification, includes information
`needed to recognize files served by the filter lOx such as byte
`values at the beginning of the file or file title suffices. Parsing
`instructions 14, also determined by the file format
`specification, include the information needed to parse the
`source file into parsed Arrays 25 having all the instances of
`a particular component in the source file. The custom
`55 component models 16 are constructed by referencing both
`the file format specification and sample databases. They can
`take advantage of inter-component relationships to trans(cid:173)
`form some of the component Arrays 25 to make them more
`compressible. Encoding parameters 18 are generated by an
`60 automated analysis system which examines component
`Arrays 25 parsed from a large number of sample databases.
`The encoding parameters 18 provide the data necessary to
`support the count-dependent modeling system in the array
`coder 28.
`65 FILTER SELECTION SYSTEM
`A preferred compression system uses an expandable
`architecture that allows users to support new or updated
`
`DETAILED DESCRIPTIONS OF PREFERRED
`EMBODIMENTS OF THE INVENTION
`
`FIG. 1 is a high-level block diagram of a typical data
`compression system. The primary components of the system
`are a sending application 1, an encoder 3, a decoder 5, and
`a receiving application 7.
`The sending application 1 provides source data 2 to be
`encoded by the encoder 3. The sending application can be a
`file archiver, a communication system, or any other appli(cid:173)
`cation with a need to represent data using fewer bytes than
`in the native format.
`The encoder 3 receives the source data 2 and uses data 25
`compression algorithms to represent the data using fewer
`bytes. The typical implementation of the encoder 3 is as
`software running on a general purpose computer, although
`the algorithms to be described may also be implemented on
`specialized hardware.
`The output of the encoder is encoded data 4, which can be
`stored in memory (in which case the encoding allows more
`source data to be stored within the same hardware), or
`transmitted to the decoder 5 (in which case the encoding the
`source data to be transmitted faster when the transmission 35
`channel bandwidth is limited).
`The decoder 5 retrieves or receives the encoded data 4 and
`reverses the algorithms used to encode the data so as to
`reconstruct the source data as decoded data 6. In a lossless
`system, the decoded data 6 is identical to the source data.
`Depending on the application, however, some data loss may
`be acceptable. As with the encoder 3, the decoder 5 is
`typically embodied as software running on a general purpose
`computer, but can be implemented in specialized hardware.
`The receiving application 7 receives the decoded data 6
`for processing.
`A preferred compression engine includes filter-based
`encoders and decoders which can be utilized by a large
`number of sending applications 1 and receiving applications
`7. While the applications 1, 7 that store/retrieve or transmit/
`receive the encoded data are not part of the compression
`system, they may adjust encoding and decoding operations
`according to system conditions such as available transmis(cid:173)
`sion channel bandwidth.
`The descriptions to follow will mostly cover the encoder.
`The decoder just reverses the encoder's actions, so a descrip(cid:173)
`tion of the encoder is sufficient to describe the overall
`encoding/decoding system. The term "File" will be used to
`describe a block of source data, which matches the conven(cid:173)
`tional usage, but which should also be understood to be
`applicable to many other types of data blocks such as a data
`exchange in a Client-Server application.
`
`50
`
`ENCODER
`
`FIG. 2 is a block diagram of a preferred encoder using a
`Base-Filter-Resource (BFR) network in accordance with the
`
`NetApp; Rackspace Exhibit 1005 Page 11
`
`
`
`US 6,253,264 Bl
`
`6
`
`Rec
`
`Row
`
`Col
`
`Index
`
`Num Val
`
`Ox0001
`Ox0001
`Ox0001
`Ox0001
`Ox0001
`
`2
`3
`4
`5
`
`Ox0002
`Ox0003
`Ox0004
`Ox0006
`Ox0007
`
`Ox0055
`Ox005f
`Ox0053
`Ox005f
`Ox005f
`
`Ox3ffle100
`Ox3ffla300
`Ox3ffld000
`Ox3fflc400
`Ox3ffld300
`
`10
`
`5
`formats by adding or replacing filters 10. As noted above, a
`filter 10 can support more than one format. For example, a
`filter called "LOTUS123v4" might support all file formats
`associated with the Lotus 123 program, such as the current
`WK4 format and the earlier WK3 and FM3 formats. If a 5
`Filter was later developed for a new Lotus WKS format, a
`user replaces the old filter with one supporting the new
`format and which would also be backwardly compatible
`with the earlier filter with respect to the older formats.
`In the Microsoft Windows environment, each Filter is a
`separate dynamically linked library (DLL) file. In addition
`to data tables, a filter includes executable code.
`If a filter is found for a file, then it is dynamically linked
`and then called to encode the file. The encoded data stream 15
`includes an identification (ID) code indicating which filter
`was used to encode the data, and the decoder has to have a
`corresponding decoding filter.
`To implement the filter selection system 22, the base
`module maintains a registry of the currently installed files,
`which includes the data need to identify source files of that
`type as well as the path to the DLL. This registry is updated
`each time a filter is added or replaced. The identification data
`can include both a list of file title suffices and a method to
`identify files via byte values near the beginning of the file.
`The identification procedure uses a search tree method to
`reduce the time needed to check the list when many filters
`are installed. From this registry, the appropriate filter 10 is
`invoked for the file type.
`To handle compound file formats such as Microsoft's
`OLE2, one filter can invoke another filter to encode a
`sub-file or stream within the compound file. For example, a
`Winword Filter might call an Excel Filter to encode some
`Excel data which was embedded within a compound docu(cid:173)
`ment file.
`PARSING SYSTEM
`As noted above, the parsing system 24 creates a large
`number of parsed Arrays 25. Each parsed Array 25 includes
`all instances of a particular component in a file. A separate
`component is created for each member of each defined
`structure type.
`
`The source data would appear almost random to a byte(cid:173)
`matching algorithm (shown in hex):
`01 00 02 00 SS 00 00 el fl 3f01 00 03 00 Sf 00 00 a3 fl
`3f 01 00 04 00 S3 00 00 dO fl 3f
`01 00 06 00 Sf 00 00 c4 el fl 3f 01 00 07 00 Sf 00 00 d3
`fl 3f
`But when handled as four separate Arrays 25 (Row, Col,
`Index, and Num) using algorithms optimized for each
`component, the data can compress significantly. This
`approach is referred to herein as "structure flipping" because
`of the way it rearranges the data.
`To encode together the instances of the same component
`in a database is a common compression technique, and the
`file format is often used to identify such components. But in
`the prior art, this approach has been limited to isolating a few
`major components of a specific file format, and the file
`format description used only as a technique of finding these
`major components. In accordance with preferred embodi(cid:173)
`ments of the invention, a file format is actually converted
`30 into a compression model, which is then interfaced into an
`automated system for analyzing the data and optimizing the
`coding performance of each component. While this com(cid:173)
`pression model is usually enhanced by the customized array
`transforms 26, it is still an aspect of the overall system.
`The extent to which preferred approaches differ from
`prior art systems is reflected in the new tools constructed in
`order to implement it. This approach creates a large number
`of Arrays 25. Each Array 25 requires a different compression
`model, and the number of entries in each can change
`40 significantly from file to file. In order to work effectively in
`such a case, preferred embodiments of the invention employ
`new approaches in how the low-level coding is handled such
`as the elimination of overhead, count-dependent processing,
`and the use of a processing network instead of fixed coding
`algorithms. These techniques, described above, are preferred
`in order to handle the output of the parsing system 24.
`All of these implementation features are integrated at a
`high level, so that high-level calls can handle all of the
`operations needed to initialize the system, parse the data,
`and encode the Arrays.
`In accordance with a particular preferred embodiment of
`the invention, the file parsing system 24 uses three sub-
`systems:
`
`20
`
`25
`
`35
`
`EXAMPLE 1
`For example, consider the following description of a record
`defined in a file format description.
`Cell Value:
`The format uses a special method of coding numbers in an
`internal number type, called an RK number, to save
`memory and disk space.
`Record Data:
`
`45
`
`50
`
`Offset
`
`Name
`
`Size
`
`Contents
`
`4
`
`8
`
`rw
`col
`fmt
`
`10
`
`num
`
`2
`2
`2
`
`4
`
`Row number
`Column number
`Index to the record that includes
`the cell format
`the coded number
`
`The parsing system 24 creates parsed Arrays 25 for each
`of the four components of this record. Each time an RK
`record is found, an entry is added to each array. The parsed
`Arrays 25 are much more compressible than the raw source
`data, as shown by data from five records:
`
`55
`
`60
`
`FILE_BUFFER
`ARRAYS
`PVSTRUCT
`
`Input\Output Interface
`To store the data for a single component
`To parse the input into Arrays or to write the decoded
`Arrays into the Output stream
`
`A plurality of FILE_BUFFER routines provide a flexible
`input/output interface using FILE_BUFFER structures, so
`the same set of parsing routines can be used regardless of the
`input/output formats. This allows the compression system to
`65 be independent of the 1/0 format. The file parsing system
`preferably supports two file formats: DOS and OLE2, but
`this can expanded to other file formats as well as to inter-
`
`NetApp; Rackspace Exhibit 1005 Page 12
`
`
`
`US 6,253,264 Bl
`
`8
`in the body of the record. The file specification then
`describes the contents of each record. A specification for a
`TABLE SIZE record might be:
`
`RecType=Ox402
`Rec Body Size=12
`Record Data:
`
`Offset Name
`
`Size Contents
`
`4 MinRow
`MaxRow
`8 Min Col
`10 MaxCol
`12
`(Reserved)
`
`First defined row on the sheet
`2
`2 Last defined row on the sheet, plus 1
`2
`First defined column on the sheet
`2 Last defined column on the sheet, plus 1
`4 Reserved
`
`There can be more than one such record in the data. When
`there is more than one record, the values of the same
`component in each record are going to be similar to each
`other. This enables the compression system to take advan(cid:173)
`tage of the similarities to increase the compression ratio.
`The compression system is preferably based on the
`ARRAY _XX structures, including:
`
`ARRAY_ US:
`
`ARRAY_ 16:
`
`ARRAY_ 32:
`
`ARRAY_ F32:
`ARRAY_ F64
`ARRAY_ FSO:
`ARRAY_ STR:
`
`for 8-bit data items, treated by the encoding system as
`being unsigned
`for 16-bit data items, treated by the encoding system as
`being signed
`for 32-bit data items, treated by the encoding system as
`being signed
`for 32-bit floats
`for 64-bit floats ("doubles")
`for 80-bit floats ("long doubles")
`each entry includes several bytes whose count is given
`by a size (string size).
`
`5
`
`10
`
`15
`
`30
`
`35
`
`7
`faces to communication ports and other 1/0 options. For
`DOS files, the logical stream positions are usually the same
`as the raw file positions. The OLE2 format requires an
`extensive layer of code to follow an OLE2 stream within the
`container file.
`A plurality of ARRAY routines provide a system for
`managing the process of initializing, adding data to,
`encoding, and freeing arrays of data. Each ARRAY structure
`includes one array of data and the information needed to
`manage these processes, such as the number of entries in the
`Array 25, the size of the buffer currently allocated for the
`array, and information which will be useful in encoding the
`array.
`In particular, ARRAY _XX structure stores all the data
`from a component, and includes mechanisms to support fast
`sequential read/write access to the data, an expandable
`buffer for the data, and storage of model data. This structure
`also allows a simple set of tools to handle all allocation,
`parsing, and coding functions. The _XX nomenclature
`indicates that slightly different ARRAY structures are 20
`defined for different data types, such as different integer
`word sizes, strings, and floating point types, although many
`of the processing routines work on all of the variations.
`A plurality of PVSTRUCT routines make up an auto(cid:173)
`mated system for parsing the file data. During encoding, 25
`they read data from the FILE_BUFFER structure and write
`data into Arrays 25. During decoding, they take data from
`the Arrays 25 and write it into the FILE_BUFFER struc(cid:173)
`tures. These routines can take a high-level description of a
`file format and handle all the tasks associated with managing
`the parsing as well as the encoding of the parsed data. Many
`of the Filters 10 will also include customized parsing
`routines which call the FILE_BUFFER and ARRAY rou(cid:173)
`tines directly instead of using the PVSTRUCT routines. But
`an aspect of the parsing system 24 is the PVSTRUCT
`routines, and this discussion will be concerned primarily
`with their operation.
`A PVSTRUCT structure includes all information needed
`to parse and code a data structure defined in a file format.
`This usually involves multiple components, so it provides
`and handles individual Arrays 25 for each component. The
`term "pvstruct" comes from Predictor Variable Structure,
`where Predictor refers to the use of statistical data gained
`from analyzing a large number of sample files to assist in the
`encoding of the data, and Variable indicates that it supports
`structures where the component counts can vary at run time
`for each instance (called "dynamic structures" in some
`textbooks).
`A parsing language is a defined instruction set which
`provides a concise way to specify the parsing of a
`PVSTRUCT structure. The instruction set includes support
`for a wide range of dynamic variation, including where the
`counts of components are dependent on file counts of other
`components as well as on external factors. It incorporates
`other file format data, such as defined values or range
`restrictions. It also provides a means of exchanging data
`between the parser and the filter.
`The compression system is based on parsing a file so as
`to place items together that are similar. In the following
`description of the compression system, examples illustrating 60
`some structures which might be found in a typical spread(cid:173)
`sheet file format.
`
`For the TABLE SIZE record, five arrays would be created
`(referring to the record specification). Each time a record is
`40 parsed, an entry is added into each of the five arrays.
`When all the records had been parsed, a function is
`invoked to encode the items in each array. This function
`includes a large number of algorithms, which can exploit a
`wide variety of relationships to achieve maximum compres-
`45 sion. The implementation is extremely efficient in that it
`requires virtually no overhead: an array with only one entry
`can even be efficiently compressed. This is necessary
`because the parsing system can transform a file into a huge
`number of these small arrays. The encoding