`
`(19) World Intellectual Property Organization
`International Bureau
`
`(43) International Publication Date
`12 July 2001 (12.07.2001)
`
`
`
`PCT
`
`(10) International Publication Number
`WO 01/50325 A2
`
`(51) International Patent Classification7:
`
`G06F 17/00
`
`(21) International Application Number:
`
`PCT/US00/35672
`
`(22) International Filing Date:
`29 December 2000 (29.12.2000)
`
`(81) Designated States (national): AE, AG, AL, AM, AT, AU,
`AZ, BA, BB, BG, BR, BY, BZ, CA, CH, CN, CR, CU, CZ,
`DE, DK, DM, DZ, EE, ES, FI, GB, GD, GE, GH, GM, HR,
`HU, ID, IL, IN, IS, JP, KE, KG, KP, KR, KZ, LC, LK, LR,
`LS, LT, LU, LV, MA, MD, MG, MK, MN, MW, MX, MZ,
`NO, NZ, PL, PT, RO, RU, SD, SE, SG, SI, SK, SL, TJ, TM,
`TR, TT, TZ, UA, UG, UZ, VN, YU, ZA, ZW.
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`English
`
`English
`
`(30) Priority Data:
`60/174,305
`
`3 January 2000 (03.01.2000)
`
`US
`
`(71) Applicant: EFECKTA TECHNOLOGIES CORPO-
`RATION [US/US]; 1370 Bob Adams Drive, Steamboat
`Springs, CA 80487 (US).
`
`(72) Inventors: AVERY, Caleb; —. TOBELMANN, Ralph; —.
`
`(84) Designated States (regional): ARIPO patent (GH, GM,
`KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZW), Eurasian
`patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), European
`patent (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR, IE,
`IT, LU, MC, NL, PT, SE, TR), OAPI patent (BF, BJ, CF,
`CG, CI, CM, GA, GN, GW, ML, MR, NE, SN, TD, TG).
`
`Published:
`
`Without international search report and to be republished
`upon receipt of that report.
`
`(74) Agents: SACHS, Robert et al.; Fenwick & West LLP, Two
`Palo Alto Square, Palo Alto, CA 94306 (US).
`
`For two-letter codes and other abbreviations, refer to the "Guid-
`ance Notes on Codes andAbbreviations ” appearing at the begin-
`ning ofeach regular issue ofthe PCT Gazette.
`
`(54) Title: EFFICIENT AND LOSSLESS CONVERSION FOR TRANSMISSION OR STORAGE OF DATA
`
`102
`
` Binary Data
`
`Segment Data
`
`Select Transform For
`Segment
`
`104
`
`106
`
`WO01/50325A2
`
`108
`
`110
`
`Package Each
`Transform
`
`
`
`Store/Transmit
`
`(57) Abstract: A system and method for lossless data compression. A mathematical transform equivalent to the content value of
`the data, and taking fewer bits to represent, is found.
`
`NetApp; Rackspace
`
`Exhibit 1019
`
`Page 1
`
`NetApp; Rackspace Exhibit 1019 Page 1
`
`
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`EFFICIENT AND LOSSLESS CONVERSION FOR TRANSMISSION OR
`
`STORAGE OF DATA
`
`This application claims priority under 35 U.S.C. §119(e)
`
`from US. Patent
`
`Application serial No. 60/174,305, filed January 3, 2000, which is incorporated by reference
`
`in its entirety.
`
`BACKGROUND
`
`Field of the Invention
`
`This invention relates to data transformation, more particularly, to lossless data
`
`compression.
`
`Background of the Invention
`
`Existing compression technology focuses on finding and removing redundancy in
`
`the input binary data. Early compression approaches focused on the format of the data.
`
`These format approaches utilize run—length encoding (RLE) and variations of frequency
`
`mapping methods. These pattern-coding approaches work well for ASCII character data,
`
`but never reached the compression potential for other data formats.
`
`Advances in compression technology evolved from information theory, particularly
`
`Claude Shannon’s work on information entropy. The bulk of this work is statistical in
`
`nature.
`
`Shannon-Fano and Huffman encoding build probability trees of symbols in
`
`descending order of their occurrence in the source data, allowing the generation of “good”
`
`variable—size codes.
`
`This is often referred to as entropy coding.
`
`Compression is
`
`accomplished because more frequently occurring binary patterns are assigned shorter codes,
`
`allowing for a reduction in the overall average of bits required for a message.
`
`Shannon-Fano and Huffman encoding are optimal only when the probability of a
`
`pattern’s occurrence is a negative power of 2. These methods engendered a number of
`
`adaptive versions that optimize the probability trees as the data varies.
`
`10
`
`15
`
`20
`
`25
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page2
`
`NetApp; Rackspace Exhibit 1019 Page 2
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`Arithmetic Coding overcame the negative power of 2 probabilities problem by
`
`assigning one (normally long) code to the entire data. This method reads the data, symbol by
`
`symbol, and appends bits to the output code each time more patterns are recognized.
`
`The need for more efficiency in text encoding led to the development and evolution of
`
`dictionary encoding, typified by LZ family of algorithms developed by J. Ziv and A. Lempel.
`
`These methods spawned numerous variations.
`
`In these methods, strings of symbols (a
`
`dictionary) are built up as they are encountered, and then coded as tokens. Output is then a
`
`mix of an index and raw data.
`
`10
`
`15
`
`As with entropy coding, dictionary methods can be static, or adaptive. Variants of the
`
`LZ family make use of different techniques to optimize the dictionary and its index. These
`
`techniques include:' search buffers, look-ahead buffers, history buffers, sliding windows, hash
`
`tables, pointers, and circular queues. These techniques serve to reduce the bloat of seldom-
`
`used dictionary entries. The popularity of these methods is due to their simplicity, speed,
`
`reasonable compression rates, and low memory requirements.
`
`Different types of information tend to create specific binary patterns. Redundancy or
`
`entropy compression methods are directly dependent upon symbolic data, and the inherent
`
`patterns that can be recognized, mapped, and reduced. As a result, different methods must be
`
`optimized for different types of information. The compression is as efficient as the method of
`
`modeling the underlying data. However, there are limits to the structures that can be mapped
`
`20
`
`and reduced.
`
`The redundancy-based methodologies are limited in application and/or performance.
`
`In general, entropy coding either compromises speed or compression when addressing the
`limited redundancy that can be efficiently removed. Typically, these methods have very 1c; (
`
`compression gain. The primary advantage is that entropy coding can be implemented to
`
`25
`
`remain lossless.
`
`Lossy compression can often be applied to diffuse data such as data representing
`
`speech, audio,
`
`image, and video. Lossy compression implies that the data cannot be
`
`reconstructed exactly. Certain applications can afford to lose data during compression and
`
`reconstitution because of the limitations of human auditory and visual systems in interpreting
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page3
`
`NetApp; Rackspace Exhibit 1019 Page 3
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`the information. Perceptual coding techniques are used to exploit these limitations of the
`
`human eyes and ears. A perceptual coding model followed by entropy encoding, which uses
`
`one of the previously discussed techniques, produces effective compression. However, a
`
`unique model (and entropy coder) is needed for each type of data because the requirements
`
`are so different. Further, the lossy nature of such compression techniques mean the results
`
`lose some fidelity, at times noticable, from the original, and make them unsuitable for many
`
`purposes.
`
`Thus, a method for compression that is both lossless and capable of high compression
`
`gain is needed.
`
`10
`
`15
`
`20
`
`25
`
`SUMMARY OF THE INVENTION
`
`The present invention compresses binary data. The data is split into segments. Each
`
`of these segments has a numerical value. A transform, along with state information for that
`
`transform, is selected for each segment. The numerical value of the transform with its state
`
`information is equal to the numerical value of the segment. The transform, state information
`
`and packet overhead are packaged into a transform packet. The bit—length of the transform
`
`packet is compared to the bit-length of a segment packet that includes the raw segment and
`
`any necessary packet overhead. The packet with the smaller bit-length is chosen and stored
`
`or transmitted. After reception of the packets, or retrieval of the packets from storage, the
`
`numerical value of each segment is recalculated from the transform and state information, if
`
`necessary. The segments are recombined to reconstitute the original binary data.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`Fig. 1 is a flow chart showing an overview of the compression process.
`
`Fig. 2 is a flow chart showing an overview of the recovery process.
`
`Fig. 3 is a block diagram of the compression system and illustrates the general flow of
`
`data through the compression system.
`
`Fig. 4 is a flow chart of the pre-processor.
`
`Fig. 5 is a flow chart of the transform engine.
`
`Fig. 6 is a flow chart of a transform process.
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page4
`
`NetApp; Rackspace Exhibit 1019 Page 4
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`Fig. 7 illustrates finding a set of favorable state information without testing every
`
`possible set of state information.
`
`Fig. 8 illustrates partial solutions Within a segment.
`
`Detailed Description of the Preferred Embodiments
`
`Fig. 1 is a flow chart showmg an overview of the compression process 100. The
`
`initial input 102 is binary data. Any piece of binary data is simply a number expressed in
`
`binary form. Thus, any piece of binary data has a numerical value. This numerical value is
`
`the data’s “content value.”
`
`The input binary data 102 is split 104 into segments, if necessary. The input binary
`
`data may be short enough that it is not further split. Each segment has a content value. The
`
`content value of each segment is identified and a transform along with appropriate state
`
`information is selected and tested 106 for each segment. A general transform is capable of
`
`representing many values. The state information provides the information necessary to
`
`specify an exact value for the transform. The term “state information” includes any variables,
`
`coefficients, remainders, or any other information necessary to set the specific numerical
`
`value for the transform. In some embodiments, “packet overhea ,” is added to the transform
`
`and state information. The packet overhead includes any information beyond the transform
`
`and state information needed to allow later recalculation of the original segment and
`
`reconstitution of the original input binary data.
`
`The transform With its state information has the identical numerical value as the
`
`corresponding segment. The following equation represents the transform concept:
`
`M=T(State Information),
`
`where M is the content value of the segment, and T is the transform. The transform is an
`
`arithmetic transform, a logical transform, or another mathematical transform.
`
`The transform with its state information and packet overhead has a representation
`
`efficiency gain (“REG”). The REG is a measurement of the efficiency of the transform, state
`
`information, and any packet overhead. The REG is defined as the ratio of LogzM/ LogZT,
`
`where LogzM is the number of binary bits required to represent M, and Long is the number of
`
`binary bits required to decodably represent the transform, state information, and packet
`
`10
`
`15
`
`20
`
`25
`
`30
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page5
`
`NetApp; Rackspace Exhibit 1019 Page 5
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`overhead. Thus, if the REG value is greater than one, the transform plus state information
`
`and packet overhead takes up fewer bits than the segment.
`
`For example, a 9-bit size message is capable of representing the transform ab: , where
`
`each of a, b, and c is a 3-bit message. In this case, abc is the transform, and a, b, and c are
`
`each a variable of the transform. The values of a, b, and c make up the state information.
`
`Both the transform and the state information are necessary to make an expression whose
`
`numerical value is equal to the content value of the segment.
`
`With the a”c example above, nine bits are capable of representing an integer over
`
`700,000 decimal digits long, equivalent to more than 2.3 million binary bits. As an example,
`
`10
`
`this transform can represent the content value of 150,094,635,296,999,121, or 362 using only
`
`nine bits. This number would occupy fifty—eight bits if conventionally represented in binary
`
`form. Thus, by using this transform, the 9-bit message transmits 58 bits of content value.
`
`The 58-bit data segment has been greatly compressed.
`
`Each transform and accompanying state information are next packaged 108 into a
`
`packet. The packet includes additional packet overhead that provides any other information
`
`necessary to allow later unpackaging and recovery of the content value of the packet. This
`
`packet overhead can include information that identifies the segment, identifies the transform,
`
`or any other necessary information. The packet typically takes fewer bits to express than the
`
`original segment. Thus, the segment has been compressed.
`
`Each packet is then stored or transmitted 110. Because the packet is typically smaller
`
`than the segment, storing the packet takes up less space than the segment, and transmitting
`
`the packet takes less time than transmitting the segment.
`
`Fig. 2 is a flow chart showing an overview of the recovery process 200, in which the
`
`original data is recovered from packets. The packet is received 202 if it has been transmitted,
`
`or retrieved fi‘om storage if it has been stored. Next, the packet coding is decoded 204 to
`
`determine the identity of the transform and how to use the state information to recover the
`
`original segment. Using the decoded information, the original segment’s content value is
`
`recalculated 206. Finally, all recomputed segments are put together again in their original
`
`order to reconstitute 208 the original input binary data. Thus, the output 210 of the recovery
`
`process 200 is identical to the input 102 of the compression process.
`
`15
`
`20
`
`25
`
`30
`
`5
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page6
`
`NetApp; Rackspace Exhibit 1019 Page 6
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`System Overview
`
`Fig. 3 is a block diagram of the compression system 300 and illustrates the general
`
`’ flow of data through the compression system 300. The compression system 300 performs the
`
`processes of Figs. 1 and 2. The flow through the compression system 300 starts with the
`
`input of a binary string 302 that is to be compressed.
`
`The control program 306 serves as the process controller for all compression system
`
`300 functions. The control program 306 monitors other compression system 300 components
`
`and tracks all information related to the input binary string throughout processing. The
`
`control program 306 also interacts with and terminates any process according to time, trial, or
`
`10
`
`other parameters.
`
`The pre-processor 304 takes the input binary string 302 and splits the binary string
`
`302 into segments. The length of each segment is appropriate to a given data type, transforms
`
`likely to be used, processor capacity, and application parameters. In some situations, the pre—
`
`processor 304 also mutates the segments in order to give more favorable variations of a
`
`15
`
`segment to the transform engine 308.
`
`The transform engine 308 receives the segments from the pre-processor 304 and
`
`computes permutations of transforms and state information, each permutation equivalent to
`
`the content value of the segment. The size of each permutation, including associated packet
`
`overhead, is analyzed, and the most efficient combination of transform, state information, and
`
`packet overhead is selected. If the permutations show no data compression, the segment is
`
`sent unmodified to the packager 310 as raw data.
`
`The packager 310 receives the output of the transform engine 308. For each segment,
`
`the packager receives from the control program 306 all relevant information, including state
`
`information, as to how the segment was transformed. From the transform, state information,
`
`and other associated information, the packager 310 creates a packet. Thus, the packet
`
`includes the transform, state information, and packet overhead information that together
`
`allows later decoding of the packet.
`
`20
`
`25
`
`At this point, the input binary string 302 has been segmented, transformed, and
`
`packaged into packets. Together, the packets are smaller than the original input binary string
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page7
`
`NetApp; Rackspace Exhibit 1019 Page 7
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`302. These packets can be either stored or transmitted through established storage and
`
`transmission protocols 312. Because the packets are smaller than the original input binary
`
`string, they have the advantage of taking less storage space and being transmitted more
`
`quickly than the original binary data.
`
`The unpackager 314 receives the packet if the packet has been transmitted, or retrieves
`
`the packet from storage if the packet has been stored. The unpackager 314 interprets the
`
`packet overhead instructions, unpacks the packet components and associated information for
`
`use by the decoder 316, and unpacks any unmodified data as raw data segments.
`
`The decoder 316 receives the unpackaged information from the unpackager 314 and
`
`recalculates the content value of the original segment. The decoder 316 applies the state
`
`information to the appropriate transform(s) to recalculate the content value of the segment. If
`the data was mutated, the process is reversed for that given segment to recover the original
`
`format of the segment.
`
`The reconstitutor 318 receives the segments from the decoder 316 and concatenates
`
`the segments in sequence to reconstitute the original binary string. Based on application
`
`parameters, this can be done to an entire file or streamed as needed by the application. The
`
`reconstitutor 318 then outputs the binary string 320. The binary string 320 output by the
`
`reconstitutor 318 is identical to the input binary string. Thus, the compression system 300
`
`provides lossless compression.
`
`Control Program
`
`10
`
`15
`
`20
`
`The control program 306 tracks data related to each data segment and transformation
`
`of the segment, as well as information regarding the pre—processor 304, the transform engine
`
`308, and all other compression system 300 functions. This information can include, but is not
`
`25
`
`limited to:
`
`l. The identification of each segment, including information providing the
`
`location (of the segment within the original input binary string
`
`2. The size of each segment
`
`3. The data type of each segment
`
`30
`
`4. Computational power of the transform engine 308
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page8
`
`NetApp; Rackspace Exhibit 1019 Page 8
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`Computational power of the decoder 316
`
`Application requirements, such as time constraints for real-time streaming
`
`applications
`
`Data type requirements
`
`Whether the segment has been mutated, and the technique used to mutate the
`
`segment
`
`Time spent on the segment by the pre-processor 304
`
`Computational cycles spent on the segment by the pre-processor 304
`
`Time spent on the segment by the transform engine 308
`
`Computational cycles spent on the segment by the transform engine 308
`
`The identity of the transform or transforms used for the segment
`
`Any Combinatorial Optimization Heuristics used to select state information
`
`variables
`
`Tracking information related to any heuristic used
`
`Successful transform(s)
`
`State information of successful transform(s)
`
`Bit—length of transform reference(s) and associated state information for
`
`successful transform(s)
`
`Bit—length of associated packet overhead for successful transform(s)
`
`Partial solution transform(s)
`
`State information of partial solutions (including offsets)
`
`Bit—length of transform references and associated state information for partial
`
`solutions
`
`Bit-length ofpacket overhead for transforms used by partial solutions
`
`Finite State Machine(s) used
`
`Finite State Machine information, such as reference data and Finite State
`
`Machine tree data
`
`de Bruijn Sequence starting point and log of index
`
`3-D Graph Tree tracking information
`
`N—space curve data
`
`9.
`
`10.
`
`ll.
`
`12.
`
`13.
`
`14.
`
`15.
`
`16.
`
`17.
`
`18.
`
`19.
`
`20.
`
`21.
`
`22.
`
`23.
`
`24.
`
`25.
`
`26.
`
`27.
`
`28.
`
`29.
`
`“BOTS” used
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page9
`
`10
`
`15
`
`20
`
`25
`
`30
`
`NetApp; Rackspace Exhibit 1019 Page 9
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`30. BOTS information, such as content value locations of BOTS and deltas from
`
`BOTS to segment content value
`
`The process monitor (not shown) is a major subpart of the control program 306. The
`
`process monitor monitors the pre—processor 304 and the transform engine 308. The process
`
`monitor has parameters that constrain the operations of the pre—processor 304 and transform
`
`engine 308. These constraint parameters can include a target REG, the time spent processing,
`
`the number of computational cycles spent processing, or other parameters. A target REG is a
`
`REG value that is sufficiently high so that the compression system 300 stops searching for
`
`higher REG transforms when a combination of transform, state information, and packet
`
`overhead with the target REG is found. The constraint parameters for time spent processing
`
`and computational cycles spent processing ensure that the compression system 300 does not
`
`spend an indefinite amount of time and resources searching for the best combination of
`
`transform and state information. The parameters are preset or changed by the control
`
`program 306. The parameters are changed if the data type changes, if the computational
`
`resources change, if the application changes, or by human or other external intervention.
`
`If the pre—processor 304 or the transform engine 308 exceeds a constraint parameter,
`
`the process monitor signals the control program 306 to terminate pre—processor 304 or
`
`transform engine 308 operations. When pre—processor 304 operations are terminated, the
`
`binary string is split into segments. The length of the segments are the best approximation of
`
`optimum length thus far determined by the pre-processor 304. When transform engine 308
`
`operations are terminated, the transform engine 308 will output the current best transform and
`
`state information, or, if no transforms compress the data, send the segment as raw data.
`
`Pre—Processor
`
`Fig. 4 is a flow chart of the pre~processor 304. First, the pre—processor 304 analyzes
`
`400 the data type of the binary string 302.
`
`In analyzing the data type, the pre—processor 304
`
`attempts to identify or characterize the data type of the binary string 302. The data type is
`
`used to help determine which segment size, which transform, and which state information for
`
`the transform, which will be used to test and transform the binary data. The pre—processor
`
`304 characterizes the binary string 302 by either knowing which application generated the
`
`binary string 302 or by analyzing the data itself. One method for analyzing the data to
`
`determine the data type is comparing the binary string 302 to an existing database of data
`
`10
`
`15
`
`20
`
`25
`
`30
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page10
`
`NetApp; Rackspace Exhibit 1019 Page 10
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`types. One method for generating such a database is by sampling information generated by
`
`known digital applications.
`
`If the pre-processor 304 characterizes the binary string 302, the control program 306
`
`stores the data type associated with that string. If the pro-processor 304 cannot characterize
`
`the binary string 302, the control program 306 keeps a record that the binary string’s 302 data
`
`type is unknown.
`
`The pre-processor 304 next determines 402 whether the size of the input binary string
`
`302 is greater than a minimum optimum processing block size. The minimum optimum
`
`processing size is different for different data types. The control program 306 stores each
`
`known minimum optimum processing size associated with the corresponding data type.
`
`If
`
`the pre—processor 304 has characterized the data type and the minimum optimum processing
`
`size for that data type is known, the pro-processor 304 simply compares the input binary
`
`string 302 to the stored minimum optimum processing size.
`
`If the data type is unknown, or an optimum size for a data type is unknown, the binary
`
`string 302 is split 404 into segments of a size that has worked well previously with many
`
`types of data. Alternatively, the binary string 302 is initially split 404 into segments of
`
`differing bit—lengths. As these segments of differing bit-lengths are processed by the
`
`compression system 300, the control program 306 keeps a record of which segment sizes are
`
`most easily processed. By trying different sizes, and tracking which sizes work best, the
`
`control program 306 builds a record of an optimum processing size for each data type, so that
`
`the optimum processing size for that type of data becomes known. If the control program 306
`
`develops a record of an optimum processing size for a data type, that segment size can be
`
`incorporated for future use.
`
`If the input binary string 302 is larger than the minimum optimum processing block
`
`size, the binary string 302 is split 404 into segments. The bit-length of the segments is
`
`determined by several factors.
`
`The computational power of the transform engine 308 affects the bit-length of the
`
`segments. If the transform engine 308 has little computational power available, the segments
`
`must be shorter than if the transform engine 308 has a great deal of computational power.
`
`Similarly, if the computational power of the decoder 316 is lmown, it will influence the bit-
`
`length of the segments. The more computational power the decoder 316 has, the greater the
`
`bit-length of the segment can be.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`10
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page11
`
`NetApp; Rackspace Exhibit 1019 Page 11
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`a“
`i/The application using the coinpression system 300 also affects the bit-lengths of the
`
`segments. For instance, applications such as video conferencing, which require human-
`
`perceptible real-time encoding, require shorter segments than those of off—line archiving
`
`where larger segments can be transformed and encoded over longer timeframes and more
`
`processing cycles.
`
`The data type also affects the bit-lengths of the segments. For instance, audio data
`
`may lend itself to different segment sizes than ASCII data. Another factor related to data type
`
`is what transforms will likely be used with the data type. The transforms used with a
`
`particular data type may perform better with a specific segment length. The control program
`
`306 monitors the transform engine 308 and stores which transforms work well with specific
`
`data types, which segments sizes work well with those transforms, and which segment sizes
`
`can be incorporated for future use.
`
`Thus, when the pre-processor 304 successfully
`
`characterizes a data type, the control program 306 is capable of retrieving information as to
`
`the optimum bit-length of the segment.
`
`If the input binary string 302 is equal to or smaller than the minimum optimum
`
`processing block size, the binary string 302 is treated as a single segment. Each segment,
`
`whether it is a portion of the split binary string 302, or the entire binary string, is treated the
`
`same throughout the rest of the compression system 300.
`
`The segment size is output 406 to the control program 306. The control program 306
`
`stores the segment size and uses it to compare to the size of possible combinations of
`
`transform, state information and packet overhead.
`
`The control program 306 assigns to each segment specific identification information
`
`so that the original binary data may be reconstituted later, and tracks each of the segments,
`
`along with the identification information, associated data type, settings for the process
`
`monitor, and any other associated data.
`
`Each of the segments may next be mutated 408. The segment is mutated 408 if the
`
`control program 306 has stored information indicating that previously, mutated segments of
`
`that data type and segment size resulted in successful transforms, or that unmutated segments
`
`of that data type and segment size did not result in successful transforms.
`
`If the control
`
`program 306 has no information stored that indicate that a mutated segment would result in
`
`successfiil transforms, the segment is not mutated. However, the segment may subsequently
`
`10
`
`15
`
`20
`
`25
`
`30
`
`‘11
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page12
`
`NetApp; Rackspace Exhibit 1019 Page 12
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`be returned 412 from the transform engine 308 to be mutated if a successfill transform for the
`
`unmutated segment is not found.
`
`The pre—processor 304 mutates 408 segments by creating different versions of the
`
`segment to provide the transform engine 308 With more permutations of the same segment.
`
`By providing more permutations of the segment,
`
`the likelihood of finding an efficient
`
`transform is increased. Some segments for which the transform engine 308 failed to generate
`
`an efficient transform 412 are also returned to the pre-processor 304 and mutated 408.
`
`By monitoring and storing information on which data type and segment sizes are
`
`returned to the pre-processor 304 to be mutated 408,
`
`the control program 306 gains
`
`information on which data types and segment sizes should be initially mutated. Further, by
`
`monitoring mutated segments through the compression system 300, the control program 306
`
`gains information on which mutations result in successful transforms. This information can
`
`be re—incorporated into the control program.
`
`There are several ways to mutate the segment. If the control program 306 has a record
`
`stored of which mutating technique has resulted in successful transforms, that mutating
`
`technique is used.
`
`If the control program 306 does not have such a record, the mutating
`
`technique is chosen by stepping through a stored library ofpossible mutations.
`
`In a first mutating technique, the bit—length of the segment is adjusted. This involves
`
`splitting a group of segments differently. By doing so, the pre—processor 306 sends different
`
`content values to the transform engine 308 to process. Another technique is “cutting and
`
`shuffling” the data. A predefined shuffle library is used. The index of the shuffle is included
`
`as part of the packet overhead. Yet another technique is finding the complement of the
`
`segment and sending the complement to the transform engine 308. In another technique, the
`
`bits within the segment are shifted or rotated. When the bits are shifted or rotated, the
`
`direction and number of bits to rotate or shift is sent as part of the packet overhead. The
`
`segment can be modified by an arithmetic (scaling) or logical (XORing) operation.
`
`In an
`
`alternate technique, other traditional compressions are used to change the content value of the
`
`segment.
`
`The control program 306 tracks mutated segments, and how the'mutated segments
`
`were mutated. This information is provided to the packager 310 so that the packager 310 can
`
`add the associated appropriate packet overhead codes. These codes identify the data as
`
`mutated data, as well as how the data has been mutated.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`12
`
`NetApp;Rackspace
`
`Exhibit1019
`
`Page13
`
`NetApp; Rackspace Exhibit 1019 Page 13
`
`
`
`WO 01/50325
`
`PCT/US00/35672
`
`The segments, mutated or not, are next output 410 to the transform engine 308.
`
`Transform Engine
`
`Fig. 5 is a flow chart of the transform engine 308. The segment output by the pre-
`
`processor 304 is input 410 to the transform engine 308. The segment is sent to as many
`
`transform processes 502 as are available and as computational resources allow. The segment
`
`is processed by the transform processes 502 either serially or in parallel.
`
`Each transform process 502 provides another transform (or transforms) available to
`
`represent the content value of the segment. The transform engine 308 has multiple transform
`
`processes 502 available to it since a single transform may not be appropriate to the content
`
`value of every input segment. This is seen from studying the example, discussed previously,
`
`of the 9-bit size transform abc , where each of a, b, and c is a 3-bit message. Using this
`
`transform, the content value of an integer over 700,000 decimal digits long can be transmitted
`
`using only nine bits. Such a number would occupy over 2.3 million bits if conventionally
`
`represented. While this transform provides the capability for nine bits to transmit a 2.3
`
`million-bit number, the nine bits still are only capable of conveying less than 512 different
`
`messages, corresponding to 512 different content values. Therefore, there exist a significant
`
`number of values between 0 and 777 that the transform ab” cannot represent.
`
`Normally, arithmetic transforms include a remainder R, so that t