`US 6,667,700 B1
`(10) Patent N0.:
`McCanne et al.
`
`(45) Date of Patent: Dec. 23, 2003
`
`USOO6667700B1
`
`(54) CONTENT-BASED SEGMENTATION
`SCHEME FOR DATA COMPRESSION IN
`STORAGE AND TRANSMISSION
`INCLUDING HIERARCHICAL SEGMENT
`REPRESENTATION
`
`(75)
`
`Inventors: Steven McCanne, Berkeley, CA (US);
`Michael J. Demmer, San Francisco,
`CA (US)
`
`(73) Assignee: NBT Technology, Inc., San Francisco,
`CA (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. N0.: 10/285,330
`
`(22)
`
`Filed:
`
`Oct. 30, 2002
`
`Int. Cl.7 ................................................. H03M 7/34
`(51)
`
`......................................... 341/51; 341/50
`(52) US. Cl.
`(58) Field Of Search .............................. 341/50, 51, 67,
`341/55; 709/246, 247
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5/1995 Whiting ...................... 395/700
`5,414,850 A *
`OTHER PUBLICATIONS
`
`Sayood, Khalid et al., “Recursively Indexed Quantization of
`Memoryless Sources”, IEEE Transactions 0n Information
`Theory, Sep. 1992, pp. 1602—1609, vol. IT—38, No. 5.
`Manber, Udi et al., “GLIMPSE: A Tool to Search Through
`Entire File Systems”, Department of Computer Science, Oct.
`1993, pp. 1—10, Technical Report #93—34, University of
`Arizona, Tucson, Arizona.
`
`Manber, Udi et al., “Finding Similar Files in a Large File
`System”Department of Computer Science, Oct. 1993, pp.
`1—10, Technical Report #93—33, University of Arizona,
`Tucson, Arizona.
`Manber, Udi et al., “A Text Compression Scheme That
`Allows Fast Searching Directly in the Compressed File”,
`Department of Computer Science, Mar. 1993, pp. 1—12,
`Technical Report #93—07, University of Arizona, Tucson,
`Arizona.
`
`* cited by examiner
`
`Primary Examiner—Brian Young
`Assistant Examiner—Joseph Lauture
`(74) Attorney, Agent, or Firm—Philip H. Albert; Townsend
`and Townsend and Crew, LLP
`
`(57)
`
`ABSTRACT
`
`In a coding system, input data within a system is encoded.
`The input data might include sequences of symbols that
`repeat in the input data or occur in other input data encoded
`in the system. The encoding includes determining a target
`segment size, determining a window size,
`identifying a
`fingerprint within a window of symbols at an offset in the
`input data, determining whether the offset is to be designated
`as a cut point and segmenting the input data as indicated by
`the set of cut points. For each segment so identified, the
`encoder determines whether the segment is to be a refer-
`enced segment or an unreferenced segment, replacing the
`segment data of each referenced segment with a reference
`label and storing a reference binding in a persistent segment
`store for each referenced segment, if needed. Hierarchically,
`the process can be repeated by grouping references into
`groups, replacing the grouped references with a group label,
`storing a binding between the grouped references and group
`label, if one is not already present, and repeating the process.
`The number of levels of hierarchy can be fixed in advanced
`or it can be determined from the content encoded.
`
`6 Claims, 9 Drawing Sheets
`
`Input Data
`
`Segment
`Deiimiting
`
`RM Da‘a
`
`Segmented
`Data
`
`Segmented
`Bindings
`
`/
`
`l
`
`I
`
`I
`
`//
`
`/
`
` /
`
`1
`
`Referenced
`
`/
`I
`’
`’
`/
`III“ [J
`/
`/
`/
`/
`g__Y_J
`/
`I
`//
`First Level Bindings
`/
`l
`
`/
`
`/
`
`I
`
`/
`
`z
`
`/
`
`/
`
`ll
`
`l/
`
`/
`
`//
`
`/
`
`Output Data
`
`[02,325+R‘.S+R'.7)(R22.R‘3+R‘a]
`g__v__)
`Second Level Bindings
`
`RIV—1004 - Page 1 of21
`
`ll||
`
`l|
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 1 0f 9
`
`US 6,667,700 B1
`
`Input Data
`
`
`
`
`Window Size
`
`Output Data
`(References to Segments,
`References, Transmitted
`Bindings. Residual Data)
`
`Target Segment
`Size
`
`Other Control
`Parameters
`
`FIG. 1
`
`300
`
`Encoded Data
`
`Output Data
`
`
`
`
`References.
`
`Segments
`
`Extracted Bindings
`
`FIG. 2
`
`RIV—1004 - Page 2 of 21
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 2 0f 9
`
`US 6,667,700 B1
`
`220
`
`input Buffer
`
`
`
`Segment Delimiter
`
`
`
`222
`
`Input Data
`
`—}m
`230
`BIHdmgSas needed
`224
`
`
`Compressed Data
`(References, Unreferenced
`Segments, Bindings)
`
`FIG. 3
`
`320
`
`Compressed
`Data
`
`310
`
`
`
`
`- ‘—
`a
`
`Unreferenced
`
`Segments
`
`Segmented
`Data
`
`330
`
`Output Buffer
`
`Reconstructed
`Data
`
`FIG. 4
`
`RIV—1004 - Page 3 of21
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 3 0f 9
`
`US 6,667,700 B1
`
`input Data
`
`Segment
`Delimiting
`
`Se mented
`
`/
`
`/
`
`/
`
`/
`
`l
`
`/
`
`First Level Bindings
`
`/
`
`Segmented
`Bindings
`
`[02115.8061138062180]
`
`an- [1
`
`Output Data
`
`Second Level Bmdmgs
`
`FIG. 5
`
`RIV-1004 - Page 4 of 21
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 4 0f 9
`
`US 6,667,700 B1
`
`
`
`
`
`Length
`L
`
`<R1> <R2> <R3> <R4> .
`
`.
`
`. <RM>
`
`(Li / avgss) -rlen = L1
`
`<R12> <R22> <R32> <R42> .
`
`.
`
`. <RM2>
`
`(L1 lavgss) rlen = L2
`
`<R1K> <R2K> <R3K> <R4K> .
`
`.
`
`. <RMK>
`
`(Lx_1 I avgss) «ten = LK
`
`avgss = Average Segment Size
`den = Reference Length
`K = # of Encoding Levels
`
`=
`
`L
`
`K
`
`L
`
`
`(an/95s)K
`
`FIG. 7
`
`RIV-1004 - Page 5 of 21
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 5 0f 9
`
`US 6,667,700 B1
`
`MW
`
`Data Bytes
`
`Length
`L
`
`Space Overhead = K rlen
`Nonspace Waste = K-1
`
`4:11:
`
`= [ <R1>]
`
`<R13>
`
`ZRJ‘B
`
`= [ <R12>l
`
`=[<R1K'1>]
`
`FlG. 8
`
`Leaf
`
`Content
`
`Bit
`
`
`
`FIG. 9
`
`RIV—1004 - Page 6 of 21
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 6 0f 9
`
`US 6,667,700 B1
`
`Decoder
`
`input
`
`<RARB><>
`
`<RDE><R>
`
`AJ‘ \
`_L\\\
`
`<RB>
`
`<Rc>
`
`<RF> <RG>
`
`<RE>
`
`<RB>
`
`<Rc>
`
`&\\\\
`
`<data> <data> <DATA> <DATA> <DATA>
`
`FIG. 10
`
`Decoder
`
`Output
`
`RIV-1004 - Page 7 of 21
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 7 0f 9
`
`US 6,667,700 B1
`
` Segment
`
`Table FIG.11
`
`
`
`Level2 Encoding
`
`RIV—1004 - Page 8 of21
`
`Reetd
`
`InputData:
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 8 0f 9
`
`SU
`
`007,7
`
`
`
`83.90038:839...33ko
`
`
`
`.390.ooo<cozommcmfiE96
`
`
`
` «No0N0
`
`
`
`
`1.BS.0.“.
`
`
`
`M,£589508*5.0256525.20\z<>>\$525.o25
`
`RIV—1004 - Page 9 of21
`
`
`
`US. Patent
`
`Dec. 23, 2003
`
`Sheet 9 0f 9
`
`US 6,667,700 B1
`
`Decoder store <filename, file data>
`
`
`HCS
`
`retrieve <filename><timestamp> 4—D»
`
`
`File System
`
`delete <filename><time range>
`
`Near-Line
`
`fitename! lanai! timestamp
`
`FIG. 13
`
`File
`
`Whole
`
`Caching
`
`FIG. 14
`
`RIV—1004 - Page 10 of21
`
`
`
`US 6,667,700 B1
`
`1
`CONTENT-BASED SEGMENTATION
`SCHEME FOR DATA COMPRESSION IN
`STORAGE AND TRANSMISSION
`INCLUDING HIERARCHICAL SEGMENT
`REPRESENTATION
`
`CROSS-REFERENCES TO RELATED
`APPLICATIONS
`
`US. patent application Ser. No. 10/285,315 entitled
`“Transaction Accelerator for Client-Server Communication
`
`10
`
`Systems” (hereinafter “McCanne I”) was filed of even date
`with this application and is incorporated by reference herein
`for all purposes.
`
`BACKGROUND OF THE INVENTION
`
`The present invention relates generally to data compres-
`sion and more specifically to segmentation used for com-
`pression.
`Data compression is useful for more efficiently storing
`and transmitting data. Data compression is a process of
`representing input data as compressed data such that the
`compressed data comprises fewer bits or symbols than the
`input data and is such that
`the compressed data can be
`decompressed into at least a suitable approximation of the
`original input data. Compression allows for more efficient
`transmission of data, as fewer bits need to be sent to allow
`a receiver to recover the original set of bits (exactly or
`approximately) and compression allows for more efficient
`storage as fewer bits need be stored.
`“Compression ratio” refers to the ratio of the number of
`bits or symbols in the original data to the number of bits or
`symbols in the compressed data. For example, if a sequence
`of 100 bytes of data is representable by 5 bytes of data, the
`compression ratio in that example is 20:1. If the input data
`need not be recovered exactly, so called “lossy compression”
`can be used, generally resulting in greater compression
`ratios than “lossless” compression. In a typical application
`where the compression is to be transparent, the compression
`should be lossless.
`
`Compression based on the structure and statistics of the
`input content is common. A typical compressor receives an
`input stream or block of data and produces a compressed
`stream or block, taking into account the symbol values in the
`input, the position of particular symbol values in the input,
`relationships among various symbol values in the input, as
`well as the expected nature of the source of input data. For
`example, where the input data is expected to be English text,
`it is highly likely that the output of the source following a “.”
`(period) symbol is a “ ” (blank space) symbol. This char-
`acteristic of the source can be exploited by the compressor.
`For example, the blank space might be represented by no
`symbol at all in the compressed data, thus reducing the data
`by one symbol. Of course, in order to have the compressed
`data be decompressable losslessly, the compressor would
`have to encode special notations for each instance where a
`period is not followed by a blank space. However, given
`their relative frequency of occurrence, many more omissions
`can be expected than special notations, so the overall result
`is net compression.
`One method of compression used with sources that are
`likely to contain repeated sequences of input characters is
`the dictionary approach. With this approach, a dictionary of
`symbol sequences is built up and each occurrence of one of
`the symbol sequences in the dictionary is replaced with the
`index into the dictionary. Where the compressor and the
`decompressor have access to the same dictionary,
`the
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`
`decompressor can losslessly decompress the compressed
`data by replacing each dictionary reference with the corre-
`sponding entry. Generally, dictionary compression assumes
`that an input stream can be divided into sequences and that
`those sequences will recur later in the input stream.
`the
`Of course,
`for the dictionary approach to work,
`decompressor has to have a copy the dictionary used by the
`compressor. Where the compression is for reducing trans-
`mission efforts, the compressor and the decompressor are
`normally separated by the transmission channel over which
`efforts are being reduced, but the load on the channel may
`be increased if the dictionary is sent over that channel. A
`similar issue arises where compression is to be applied for
`reducing storage, as the dictionary needs to be stored so the
`decompressor has access to it and that adds to the storage
`effort. In some schemes, the dictionary is a fixed dictionary
`and thus it can be amortized over many compressions to
`reduce the per compression cost of the dictionary to where
`the overhead is insignificant. In other schemes, the dictio-
`nary is adaptive, but is reconstructable from data already
`available to the decompressor, but as previously decom-
`pressed symbols.
`Compression is useful in networks where network traffic
`is limited by bandwidth constraints. One example is a wide
`area network (WAN), such as the Internet, which generally
`has less free bandwidth per use than other networks, such as
`a dedicated local area network (LAN) or a dedicated WAN.
`For cost reasons, many would like to use nondedicated
`WAN’s instead of relying only on LAN’s or adding dedi-
`cated WAN’s, but are constrained by the performance of
`nondedicated WAN’s. Compression can potentially make it
`feasible to use a low bandwidth link for high bandwidth
`applications since it reduces the number of actual bits
`required to represent a larger input sequence. Similarly,
`compression can potentially enhance performance or capac-
`ity of a file system by reducing the number of bits required
`to represent all of the files in the system.
`In general, data stored and communicated across enter-
`prise systems and networks often has high degrees of
`information redundancy present. For example, e-mail mes-
`sages and attachments sent to large numbers of recipients in
`a corporation generate many redundant copies of the mes-
`sage data in storage systems as well as cause redundant
`traffic to be sent across the network. Likewise, many elec-
`tronic documents within an enterprise share very high
`degrees of commonality as different employees work with
`similar pieces of corporate information in different settings.
`If such data were compressed, network performance
`would improve and effective storage capacity would
`increase. Traditional compression schemes can exploit some
`of these redundancies by detecting statistical correlations in
`an input symbol stream and encoding the stream’s symbols
`in as few bits as possible based on the statistical correlations.
`Some dictionary-based compression schemes are known as
`“universal codes” in that
`they converge to the optimal
`compression scheme (the Shannon limit) under various
`assumptions including the assumption that the input sym-
`bols conform to a stationary random process. This would
`imply then that one could achieve optimal performance
`simply by deploying a universal coding system that per-
`formed optimal compression of network traffic in a network
`or of file data in a storage system.
`However, this approach does not necessarily work well in
`practice. For example, it is well known that enabling com-
`pression on the network interface of a router improves
`performance, but only marginally (30% is typical but it
`
`RIV—1004 - Page 11 of21
`
`
`
`3
`
`4
`
`US 6,667,700 B1
`
`depends on the underlying traffic). One problem with tradi-
`tional universal coding schemes is that they do not neces-
`sarily converge to optimal rate if the underlying data input
`has non-stationary statistics. Moreover, if the underlying
`statistics are stationary but they exhibit “long range depen-
`dence” (LRD), the rate of convergence of the universal code
`to optimality could be impractically slow (perhaps exponen-
`tially slow). This has important consequences as many
`studies have provided evidence that network traffic exhibits
`LRD, and in fact, there is an open controversy as to whether
`the underlying data processes are best modeled as LRD
`random processes or non-stationary processes. Other studies
`have shown that file statistics (like size distributions, etc.)
`also exhibit LRD. In short, this all means that traditional
`methods of universal coding are not necessarily the best
`practical solution, and a technique that exploits long-range
`dependence of typical data sources is likely to do better.
`One brute-force approach to detecting long-range corre-
`lations is to employ a dictionary-based compression scheme
`that searches with great breadth over a data source (a file, a
`communication stream, etc.) for patterns that are repeated,
`represent those patterns with a name or label and store the
`corresponding data in a table or database in association with
`the name or label. To exploit LRD, a very large window of
`data could be kept that allows the system to peer arbitrarily
`far back in the input (or in time) to detect long-range
`dependent patterns. This simple model intuitively matches
`the structure of information in an enterprise. That is, many
`similar sources of information both change slowly over time
`and appear in different contexts (email, file systems, Web,
`etc). As underlying technology improves (e.g., disks and
`memory become increasingly less expensive), this approach
`becomes even more practical. However,
`the brute-force
`approach still has shortcomings.
`One shortcoming is that searching for arbitrary patterns of
`matching data in a bit stream is computationally expensive
`and the general problem of finding the optimal solution
`quickly and efficiently in the presence of LRD statistics has
`not been adequately solved. An alternative approach is to
`abandon the ideal of finding an optimal solution and instead
`focus on approximate solutions or heuristics that perform
`well in the light of LRD and are practical and feasible.
`One tool that proves useful in this framework is a pro-
`posed heuristic for finding repeated patterns in data by
`segmenting the data based on the input content itself, rather
`than some externally imposed blocking or framing scheme.
`See,
`for example, Muthitacharoen, A., et al., “A Low-
`Bandwidth Network File System”, in Proceedings of the 18th
`ACM Symposium on Operating Systems Principles (SOSP
`’01), pp. 174—187 (Chateau Lake Louise, Banff, Canada,
`October 2001) (in vol. 35, 5 of ACM SIGOPS Operating
`Systems Review, ACM Press).
`In the LBFS system
`described therein, portions of transmitted files are replaced
`with hashes, and the recipient uses the hashes to reconstruct
`which portion of which file on a file system corresponds to
`the replaced data. Another example of segmentation based
`on input content is described in the context of matching
`portions of files, as described by Manber, “Finding Similar
`Files in a Large File System”, USENIX Proceedings, San
`Francisco 1994 (available as University of Arizona Dept. of
`Comp. Sci. Technical Report TR93-33).
`Other attempts to reduce network traffic through dictio-
`nary style compression techniques have been applied at the
`network layer. One such technique includes representing
`portions of network traffic with tokens and maintaining
`tables of tokens at each end of a connection. See, for
`example, Spring, N., et al., “A Protocol-Independent Tech-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`nique for Eliminating Redundant Network Trafiic”, in Pro-
`ceedings of ACM SIGCOMM (August 2000). As described
`in that reference, network traffic that contains redundancies
`can be reduced by identifying repeated strings and replacing
`the repeated strings with tokens to be resolved from a shared
`table at either end of a connection. Because it operates solely
`on individual packets, the performance gains that accrue
`from this approach are limited by the ratio of the packet
`payload size to the packet header (since the packet header is
`generally not compressible using the described technique).
`Also, because the mechanism is implemented at the packet
`level, it only applies to regions of the network where two
`ends of a communicating path have been configured with the
`device. This configuration can be difficult to achieve, if not
`impractical,
`in certain environments. Also, by indexing
`network packets using a relatively small memory-based
`table with a first-in first-out replacement policy (without the
`aid of, for instance, a large disk-based backing store), the
`efficacy of the approach is limited to detecting and exploit-
`ing communication redundancies that are fairly localized in
`time, i.e., the approach cannot exploit LRD properties of the
`underlying data stream.
`An alternative approach to reduce network traffic involves
`caching, where a request for data is not sent over the network
`if a copy of the data is available locally in a cache. As used
`herein, the terms “near”, “far”, “local” and “remote” might
`refer to physical distance, but more typically they refer to
`effective distance. The effective distance between two
`
`computers, computing devices, servers, clients, peripherals,
`etc. is, at least approximately, a measure of the difficulty of
`getting data between the two computers.
`While caching is good for blocks of data that do not
`change and are not found in similar forms under different
`names, improvements are still needed in many cases. In file
`caching, the unit of caching is typically a block of a file or
`the whole file. If the same data is present in a different file,
`or two files have only small differences, caching will not
`remove the redundancies or exploit them to reduce commu-
`nication costs. Even if a data object is segmented into many
`blocks and each of the blocks is cached separately, the net
`result is still inefficient because a small insertion of deletion
`
`of data in the underlying object will cause the data to shift
`through many (if not all) of the blocks and thus nullify the
`benefits of caching. This is due to the fact that the blocks are
`imposed arbitrarily on the input stream, and so it is impos-
`sible to detect that only a small change has been made to the
`underlying data.
`improvements can be made in
`In View of the above,
`compressing data in a network environment,
`in storage
`systems, and elsewhere.
`
`BRIEF SUMMARY OF THE INVENTION
`
`In a coding system according to one embodiment of the
`present invention, input data within a system is encoded. The
`input data might include sequences of symbols that repeat in
`the input data or occur in other input data encoded in the
`system. The encoding includes determining one or more
`target segment sizes, determining one or more window sizes,
`identifying a fingerprint within a window of symbols at an
`offset in the input data, determining whether the offset is to
`be designated as a cut point and segmenting the input data
`as indicated by the set of cut points. For each segment so
`identified, the encoder determines whether the segment is to
`be a referenced segment or an unreferenced segment, replac-
`ing the segment data of each referenced segment with a
`reference label and storing a reference binding in a persistent
`
`RIV—1004 - Page 12 of21
`
`
`
`US 6,667,700 B1
`
`5
`
`if needed.
`segment store for each referenced segment,
`Hierarchically, the process can be repeated by segmenting
`the reference label strings into groups, replacing the grouped
`references with a group label, storing a binding between the
`grouped references and group label, if one is not already
`present, and repeating the process. The number of levels of
`hierarchy can be fixed in advanced or it can be determined
`from the content encoded.
`
`Other features and advantages of the invention will be
`apparent in view of the following detailed description and
`preferred embodiments.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of an encoder that encodes a
`data stream or block using segmentation based on content
`and segment references.
`FIG. 2 is a block diagram of a decoder that decodes a data
`stream or block using segmentation bindings from a persis-
`tent segment store.
`FIG. 3 is a diagram illustrating an encoding process in
`more detail.
`
`FIG. 4 is a diagram illustrating a decoding process in
`more detail.
`
`FIG. 5 is a diagram illustrating a hierarchical encoding
`process.
`
`FIG. 6 is an illustration of a persistent segment store
`having a plurality of level segment stores.
`FIG. 7 diagrammatically illustrates a problem with using
`too few encoding levels.
`FIG. 8 diagrammatically illustrates a problem with using
`too few encoding levels.
`FIG. 9 is an illustration of a persistent segment store
`organized to hold arbitrary depth of references.
`FIG. 10 illustrates a decoding process using the persistent
`segment store of FIG. 9.
`FIG. 11 illustrates hierarchical content-induced segmen-
`tation.
`
`FIG. 12 is a block diagram of a networked client-server
`pair where some traffic between the client and the server is
`routed through a client-side transaction accelerator (“CT ”)
`and a server-side transaction accelerator (“STA”).
`FIG. 13 illustrates a file system using hierarchical
`content-induced segmentation.
`FIG. 14 illustrates a near-line file system (NLFS) and file
`server front end.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`The present invention has many applications, as will be
`apparent after reading this disclosure.
`In describing an
`embodiment of a compression system according to the
`present invention, only a few of the possible variations are
`described. Other applications and variations will be apparent
`to one of ordinary skill in the art, so the invention should not
`be construed as narrowly as the examples, but rather in
`accordance with the appended claims.
`Coding (encoding, decoding) can be done by a number of
`different devices, which might involve hardware, software,
`or both. Coding can be done with a computer, computing
`device, peripheral, electronics, or the like, and/or using an
`application being executed or controlled by such element.
`Coding might be done incident to a transport process, such
`as that described in McCanne I. Using the coding apparatus
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`6O
`
`65
`
`6
`and processes described herein, the responsiveness of trans-
`actions over a network can be improved by lessening the
`number of bits that need to be transmitted at critical times
`
`the coding system may be
`over critical channels. Also,
`integrated into a stand-alone storage system to optimize
`capacity. In this environment, the effective capacity of a
`storage system can be enhanced as the coding system
`described herein extracts common sequences of data and
`stores them just once, even if the sequence appears poten-
`tially many times, in potentially many different files.
`FIG. 1 illustrates the particular inputs that an encoder
`might have. As shown there, an encoder has an input for
`receiving input data and inputs for parameters, such as target
`segment size, window size and other control parameters, as
`well as outputs for the output data and bindings generated in
`the encoding process, which are stored in a persistent
`segment store (PSS) 210 as needed. In operation, encoder
`140 would process input data, identify segments of data,
`replace the segment’s data with a reference, provide the
`segment data and a segment reference to PSS 142 in the
`form of a binding and output the encoded data. The encoded
`output data might
`include unreferenced segment data,
`embedded bindings and/or reference labels for referenced
`segments. In the simplest case, the output data is entirely
`reference labels.
`
`The target segment size parameter is used by the encoder
`to control the average size of a segment. In general, seg-
`ments are variable length and there are design tradeoffs in
`selection of this size as described below.
`
`FIG. 2 illustrates a decoder 300 and a PSS 310, which
`together perform decoding that is the inverse of the encoding
`done by encoder 200. As described above, the encoded data
`might comprise references, bindings and unreferenced
`residual data. When decoder 150 encounters a binding in
`received data, it can use the segment data in that binding to
`reconstruct the original data and it can also store the binding
`in its PSS. When decoder 150 encounters a reference with-
`
`out a binding, it can use the reference to obtain segment data
`from PSS 152 to reconstruct the segment. If the segment
`reference is not found in PSS 152, decoder 150 can send a
`request for the segment data.
`FIG. 3 is a diagram illustrating an encoding process in
`more detail. As shown there, input data to be encoded is
`stored in buffer 220, segmented by segment delimiter 222,
`which creates the output references and bindings for PSS
`210. The output references, unreferenced segments, such as
`segment 224, and bindings as needed are provided to an
`output buffer 230 to form the output of the encoder.
`FIG. 4 is a diagram illustrating a decoding process in
`more detail. As shown there, encoded data is bufiered in an
`input buffer 320. From the buffer contents, bindings are
`extracted and stored in PSS 310. Reference labels are
`
`provided to a replacer 325 that replaces them with the
`referenced segment data and places the data in an output
`buffer 330. Unreferenced segment data is output directly to
`output buffer 330 for eventual output as the reconstructed
`data.
`
`As the above-described figures illustrate, one aspect of the
`encoding process is the segmentation of input data. In a
`process for segmenting, identifying “cut points”, such as
`offsets in the input data where one segment ends and the next
`segment begins, is equivalent to segregating the input data
`into separate data structures, or the like.
`If a segmentation scheme is designed appropriately, seg-
`ment boundaries should always appear in the same place for
`the same sequences of data, regardless of the context in
`
`RIV—1004 - Page 13 of21
`
`
`
`US 6,667,700 B1
`
`7
`which that data appeared. If that were the case, commonly
`repeated patterns of data would be segmented in the same
`way over each repetition of data and a system can be
`designed to efficiently identify these repeated patterns. For
`example, a particular data sequence (such as pieces of a
`widely used GIF image or pieces of a bitmap representing a
`commonly used graphical icon) that appears in many dif-
`ferent locations in larger files (such as a word processing
`document, slide presentation document, or on a Web page)
`will always be found and segmented in the same way,
`regardless of what data surrounds it.
`To achieve this property, the segmentation scheme herein
`uses information in the data itself to guide the segmentation
`process rather than externally imposed parameters like block
`sizes, transaction boundaries, etc. As input data is consumed
`by the coding process, the various values and structure of the
`input symbols guide the segmentation process (as described
`below). Consequently,
`the system has a “self-
`synchronizing” property where if similar data patterns are
`presented to its input,
`the same segment boundaries are
`detected, inducing the formation of the same segments that
`have been seen in the past. Moreover, this means that the
`system is robust to insertions, deletions, or other changes in
`the underlying input data, in that once a previously present
`segment boundary is found, the new segment will match an
`existing segment from that point on (assuming the data
`pattern from that point on has been seen in the past).
`Because this scheme is guided by the content of the input,
`it is described herein as “content-induced segmentation”. By
`applying content-induced segmentation to an input stream
`with repeated patterns across very large time scales (i.e.,
`exhibiting LRD statistics), the same segments are created
`without having to keep track of the entire history of data
`patterns that have been seen in the past. That
`is,
`the
`segmentation process can simply segment the input based on
`the input itself without having to search over the existing
`data patterns already found. While, in general, this approach
`does not produce the optimal segmentation (i.e., maximizing
`the size of segments while simultaneously maximizing the
`number of repeated segments found), it provides a good
`tradeoff between complexity and performance—the scheme
`is effective at exploiting certain types of LRD while leading
`to an efficient and practical implementation.
`Content-Induced Segmentation
`A segmenter, such as segment delimiter 222 in FIG. 3
`operates on a portion (or all) of a set of input data. The
`segmenter batches up a certain number of input symbols
`(determined by the current offset and the window size
`parameter) and computes a hash function over the window.
`This is referred to as the fingerprint of the window. A
`deterministic fingerprint indicator function returns a Bool-
`ean value for each fingerprint indicating whether the offset
`is to be considered a cut point and thereby defines a segment
`boundary. Preferably, the window does not extend beyond
`either end of an application data unit (ADU), so that each
`window contribution is from exactly one ADU. Thus,
`preferably, the encoder is provided an indication in an input
`data stream where one ADU ends and another one starts.
`
`However, such indication is not necessary for correct opera-
`tion of the system. Rather,
`these indications allow the
`encoder to force a segment boundary in the stream where an
`ADU boundary ends and thus prevent practical inefliciencies
`such as extra delays waiting for data that might not ever
`arrive (e.g., relying upon an implementation time-out to
`force a segment boundary instead). Additionally, restricting
`segment boundaries to be within a single ADU prevents the
`system from detecting and storing segments that are unlikely
`to be repeated in the future as may happen if a segment spans
`two ADU’s.
`
`1O
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`6O
`
`65
`
`8
`Where the fingerprint indicator function values are 1 and
`0, 1 might represent a cut point and 0 represent other than
`a cut point. Thus, under one convention, when the function
`evaluates to 1 for a given fingerprint having a given offset
`and window, a new segment is created at the symbol defined
`by the first symbol of the input data (in order according to
`an input data order). Under that convention, if the function
`evaluates to 0, then one more input symbol is consumed and
`the window is advanced to cover this new symbol (and the
`least current symbol
`is removed from the window but
`remains in the current segment). The target segment size can
`thus be controlled as a parameter of the fingerprint indicator
`function.
`
`tend to
`Since a well-chosen fingerprint function will
`create random bit patterns in the fingerprint, the Bernoulli
`distribution function of a random variable, parameterized by
`the target segment size, can be used to map the hash value
`into the indicator value. For example, the parameter can be
`chosen such that on average, assuming a random distribution
`of fingerprint inputs, 1 out of M times, the function evaluates
`to true. Thus, on average, the segment size is going to be
`M+W bytes, where W is the window size in use (because a
`segment is at least W bytes large). Thus, the value of M
`deter