throbber
(12) United States Patent
`US 6,667,700 B1
`(10) Patent N0.:
`McCanne et al.
`
`(45) Date of Patent: Dec. 23, 2003
`
`USOO6667700B1
`
`(54) CONTENT-BASED SEGMENTATION
`SCHEME FOR DATA COMPRESSION IN
`STORAGE AND TRANSMISSION
`INCLUDING HIERARCHICAL SEGMENT
`REPRESENTATION
`
`(75)
`
`Inventors: Steven McCanne, Berkeley, CA (US);
`Michael J. Demmer, San Francisco,
`CA (US)
`
`(73) Assigncc: NBT Technology, Inc., San Francisco,
`CA (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 10/285,330
`
`(22)
`
`Filed:
`
`Oct. 30, 2002
`
`Int. Cl.7 ............ H03M 7/34
`(51)
`
`.............
`(52) US. Cl.
`341/51; 341/50
`(58) Field of Search .............................. 341/50, 51, 67,
`341/55; 709/246, 247
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5/1995 Whiting ...................... 395/700
`5,414,850 A
`OTHER PUBLICATIONS
`
`Sayood, Khalid et al., “Recursively Indexed Quantization of
`Memoryless Sources”, IEEE Transactions 0n Information
`Theory, Sep. 1992, pp. 160271609, vol. IT738, No. 5.
`Manber, Udi et al., “GLIMPSE: ATool to Search Through
`Entire File Systems”,Department of Computer Science, Oct.
`1993, pp. 1—10, Technical Report #93—34, University of
`Arizona, Tucson, Arizona.
`
`Manber, Udi et al., “Finding Similar Files in a Large File
`System”Department of Computer Science, Oct. 1993, pp.
`1—10, Technical Report #93—33, University of Arizona,
`Tucson, Arizona.
`Manber, Udi et al., “A Text Compression Scheme That
`Allows Fast Searching Directly in the Compressed File”,
`Department of Computer Science, Mar. 1993, pp. 1—12,
`Technical Report #93—07, University of Arizona, Tucson,
`Arizona.
`
`* cited by examiner
`Primary Examiner—Brian Young
`Assistant Examiner—Joseph Lauture
`(74) Attorney, Agent, or Firm—Philip H. Albert; Townsend
`and Townsend and Crew, LLP
`
`(57)
`
`ABSTRACT
`
`In a coding system, input data within a system is encoded.
`The input data might include sequences of symbols that
`repeat in the input data or occur in other input data encoded
`in the system. The encoding includes determining a target
`segment size, determining a window size,
`identifying a
`fingerprint within a window of symbols at an offset in the
`input data, determining whether the offset is to be designated
`as a cut point and segmenting the input data as indicated by
`the set of cut points. For each segment so identified, the
`encoder determines whether the segment is to be a refer—
`enced segment or an unreferenced segment, replacing the
`segment data of each referenced segment with a reference
`label and storing a reference binding in a persistent segment
`store for each referenced segment, if needed. Hierarchically,
`the process can be repeated by grouping references into
`groups, replacing the grouped references with a group label,
`storing a binding between the grouped references and group
`label, if one is not already present, and repeating the process.
`The number of levels of hierarchy can be fixed in advanced
`or it can be determined from the content encoded.
`
`6 Claims, 9 Drawing Sheets
`
`Input Data
`
`Butler
`
`Segment
`Delimiting
`
`Raw Data
`
`Segmented
`Data
`
`I
`i
`i
`t
`I
`i
`
`Referenced
`Data
`
`I
`
`/
`
`I
`
`,
`
`//
`
`z
`
`l
`
`l
`
`l
`
`I
`I
`
`I
`
`/
`
`/
`
`/
`
`/
`
`/
`I
`I
`’
`naumaa
`/
`z
`/
`/
`/
`I
`I
`/////
`///
`/
`/
`/
`
`/
`
`I
`
`Segmented
`Bindings
`
`[(R'...samuasaxRuLSa]
`;_~Y_J
`First Level Bindings
`
`Output Data
`
`[ma R’.5+R‘.s+ R'17’(R22.R‘3* Rn]
`Second Leve| Bindings
`
`RIV—1004 / Page 1 of21
`sjwNRQQU P q‚ˆ† R ‡ SR
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Sheet 1 0f 9
`
`US 6,667,700 B1
`
`Input Data
`
`
`
`
`Window Size
`
`Output Data
`(References to Segments,
`References, Transmitted
`Bindings. Residual Data)
`
`Target Segment
`Size
`
`Other Control
`Parameters
`
`FIG. 1
`
`300
`
`Encoded Data
`
`Output Data
`
`References.
`
`Segments
`
`Extracted Bindings
`
`
`
`
`FIG. 2
`
`RIV—1004 / Page 2 of 21
`sjwNRQQU P q‚ˆ† S ‡ SR
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Sheet 2 0f 9
`
`US 6,667,700 B1
`
`Input Buffer
`
`
`
`Input Data
`
`
`Segment Delimiter
`
`
`
`220
`
`222
`
`M .A O
`
`fififilflfifiw
`230
`
`B\;ndings,as needed
`224
`
`
`Compressed Data
`(References, Unreferenced
`Segments, Bindings)
`
`FIG. 3
`
`320
`
`Compressed
`Data
`
`310
`
`- <—
`a
`
`Unreferenced
`
`Segments
`
`
`
`
`Segmented
`Data
`
`330
`
`Output Buffer
`
`Reconstructed
`Data
`
`FIG. 4
`
`RIV—1004 / Page 3 of 21
`sjwNRQQU P q‚ˆ† T ‡ SR
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Sheet 3 0f 9
`
`US 6,667,700 B1
`
`input Data
`
`Segment
`Delimiiing
`
`
`
`Segmented
`Data
`
`Segmented
`Bindings
`
`
`
`[0325. SA)<R'16,SB)<R‘17.SC)]
`9—3—4
`
`First Level Bindings
`
`an. [1
`
`Output Data
`
`Second Level Bindings
`
`FIG. 5
`
`RIV—1004 / Page 4 of 21
`sjwNRQQU P q‚ˆ† U ‡ SR
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Sheet 4 0f 9
`
`US 6,667,700 B1
`
`
`
`
`
`Length
`L
`
`<R1> <R2> <R3> <R4> .
`
`.
`
`. <RM>
`
`(I.i I avgss) -rlen = L1
`
`<R12> <R22> <R32> <Rf> .
`
`.
`
`. <RM2>
`
`(L1 lavgss)o rlen = L2
`
`<R K> <R "> <R K> <R K> ... <R K>
`1
`2
`3
`4
`M
`
`(L . Iavgss) -rien = L
`x 1
`
`K
`
`avgss = Average Segment Size
`rlen = Reference Length
`K = # of Encoding Levels
`
`=
`
`L
`
`"
`
`L
`
`
`(avgss)K
`
`FIG. 7
`
`RIV—1004 / Page 5 of 21
`sjwNRQQU P q‚ˆ† V ‡ SR
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Sheet 5 0f 9
`
`US 6,667,700 B1
`
`10.1mm
`
`Length
`
`Space O'verhead = K- rlen
`Nonspace Waste = K-1
`
`Leaf
`
`Content
`
`Bit
`
`
`
`FIG. 9
`
`RIV—1004 / Page 6 of 21
`sjwNRQQU P q‚ˆ† W ‡ SR
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Sheet 6 0f 9
`
`US 6,667,700 B1
`
`Decoder
`
`input
`
`B>
`
`<RDE><R>
`
`<Rc>:\\
`A;\\\
`
`<RB>
`
`<Rc>
`
`<RF> <RG>
`
`<RE>
`
`<RB>
`
`<Rc>
`
`¥\\\\
`
`<data> <data> <DATA> <DATA> <DATA>
`
`FIG. 10
`
`Decoder
`
`Output
`
`RIV—1004 / Page 7 of 21
`sjwNRQQU P q‚ˆ† X ‡ SR
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Sheet 7 0f 9
`
`US 6,667,700 B1
`
` Segment
`
`Table FIG.11
`
`
`
`Level2 Encoding
`
`RIV—1004 / Page 8 of 21
`sjwNRQQU P q‚ˆ† Y ‡ SR
`
`Reetd
`
`InputData:
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Shcct8 0f9
`
`SU
`
`00797
`
`
`
`
`
`55283cozommcmc.32%
`
`
` «Noomm
`
`
`
`
`556693cozommcmfiE26
`
`1IBNV0—“—
`
`w,v.51.8859o*55256.523.30\Z<>>\«059:.5261—
`
`RIV—1004 / Page 9 of 21
`sjwNRQQU P q‚ˆ† Z ‡ SR
`
`

`

`US. Patent
`
`Dec. 23, 2003
`
`Sheet 9 0f 9
`
`US 6,667,700 B1
`
`Decoder store <filename, file data>
`
`
`HCS
`
`retrieve <filename><timestamp> <—>
`
`
`File System
`
`delete <filename><time range>
`
`Near-Line
`
`filenamel label! timestamp
`
`FIG. 13
`
`File
`
`Whole
`
`Caching
`
`FIG. 14
`
`RIV—1004 / Page 10 of21
`sjwNRQQU P q‚ˆ† RQ ‡ SR
`
`

`

`US 6,667,700 B1
`
`1
`CONTENT-BASED SEGMENTATION
`SCHEME FOR DATA COMPRESSION IN
`STORAGE AND TRANSMISSION
`INCLUDING HIERARCHICAL SEGMENT
`REPRESENTATION
`
`CROSS-REFERENCES TO RELATED
`APPLICATIONS
`
`US. patent application Ser. No. 10/285,315 entitled
`“Transaction Accelerator for Client-Server Communication
`
`10
`
`Systems” (hereinafter “McCanne I”) was filed of even date
`with this application and is incorporated by reference herein
`for all purposes.
`BACKGROUND OF THE INVENTION
`
`15
`
`The present invention relates generally to data compres-
`sion and more specifically to segmentation used for com-
`pression.
`Data compression is useful for more efficiently storing
`and transmitting data. Data compression is a process of
`representing input data as compressed data such that the
`compressed data comprises fewer bits or symbols than the
`input data and is such that the compressed data can be
`decompressed into at least a suitable approximation of the
`original input data. Compression allows for more efficient
`transmission of data, as fewer bits need to be sent to allow
`a receiver to recover the original set of bits (exactly or
`approximately) and compression allows for more efficient
`storage as fewer bits need be stored.
`“Compression ratio” refers to the ratio of the number of
`bits or symbols in the original data to the number of bits or
`symbols in the compressed data. For example, if a sequence
`of 100 bytes of data is representable by 5 bytes of data, the
`compression ratio in that example is 20:1. If the input data
`need not be recovered exactly, so called “lossy compression”
`can be used, generally resulting in greater compression
`ratios than “lossless” compression. In a typical application
`where the compression is to be transparent, the compression
`should be lossless.
`
`Compression based on the structure and statistics of the
`input content is common. A typical compressor receives an
`input stream or block of data and produces a compressed
`stream or block, taking into account the symbol values in the
`input, the position of particular symbol values in the input,
`relationships among various symbol values in the input, as
`well as the expected nature of the source of input data. For
`example, where the input data is expected to be English text,
`it is highly likely that the output of the source following a “.”
`(period) symbol is a “ ” (blank space) symbol. This char-
`acteristic of the source can be exploited by the compressor.
`For example, the blank space might be represented by no
`symbol at all in the compressed data, thus reducing the data
`by one symbol. Of course, in order to have the compressed
`data be decompressable losslessly, the compressor would
`have to encode special notations for each instance where a
`period is not followed by a blank space. However, given
`their relative frequency of occurrence, many more omissions
`can be expected than special notations, so the overall result
`is net compression.
`One method of compression used with sources that are
`likely to contain repeated sequences of input characters is
`the dictionary approach. With this approach, a dictionary of
`symbol sequences is built up and each occurrence of one of
`the symbol sequences in the dictionary is replaced with the
`index into the dictionary. Where the compressor and the
`decompressor have access to the same dictionary,
`the
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`decompressor can losslessly decompress the compressed
`data by replacing each dictionary reference with the corre-
`sponding entry. Generally, dictionary compression assumes
`that an input stream can be divided into sequences and that
`those sequences will recur later in the input stream.
`the
`Of course,
`for the dictionary approach to work,
`decompressor has to have a copy the dictionary used by the
`compressor. Where the compression is for reducing trans—
`mission efforts, the compressor and the decompressor are
`normally separated by the transmission channel over which
`efforts are being reduced, but the load on the channel may
`be increased if the dictionary is sent over that channel. A
`similar issue arises where compression is to be applied for
`reducing storage, as the dictionary needs to be stored so the
`decompressor has access to it and that adds to the storage
`effort. In some schemes, the dictionary is a fixed dictionary
`and thus it can be amortized over many compressions to
`reduce the per compression cost of the dictionary to where
`the overhead is insignificant. In other schemes, the dictio-
`nary is adaptive, but is reconstructable from data already
`available to the decompressor, but as previously decom—
`pressed symbols.
`Compression is useful in networks where network traffic
`is limited by bandwidth constraints. One example is a wide
`area network (WAN), such as the Internet, which generally
`has less free bandwidth per use than other networks, such as
`a dedicated local area network (LAN) or a dedicated WAN.
`For cost reasons, many would like to use nondedicated
`WAN’s instead of relying only on LAN’s or adding dedi-
`cated WAN’s, but are constrained by the performance of
`nondedicated WAN ’s. Compression can potentially make it
`feasible to use a low bandwidth link for high bandwidth
`applications since it reduces the number of actual bits
`required to represent a larger input sequence. Similarly,
`compression can potentially enhance performance or capac-
`ity of a file system by reducing the number of bits required
`to represent all of the files in the system.
`In general, data stored and communicated across enter-
`prise systems and networks often has high degrees of
`information redundancy present. For example, e-mail mes-
`sages and attachments sent to large numbers of recipients in
`a corporation generate many redundant copies of the mes-
`sage data in storage systems as well as cause redundant
`traffic to be sent across the network. Likewise, many elec-
`tronic documents within an enterprise share very high
`degrees of commonality as different employees work with
`similar pieces of corporate information in different settings.
`If such data were compressed, network performance
`would improve and effective storage capacity would
`increase. Traditional compression schemes can exploit some
`of these redundancies by detecting statistical correlations in
`an input symbol stream and encoding the stream’s symbols
`in as few bits as possible based on the statistical correlations.
`Some dictionary-based compression schemes are known as
`“universal codes” in that
`they converge to the optimal
`compression scheme (the Shannon limit) under various
`assumptions including the assumption that the input sym-
`bols conform to a stationary random process. This would
`imply then that one could achieve optimal performance
`simply by deploying a universal coding system that per-
`formed optimal compression of network traffic in a network
`or of file data in a storage system.
`However, this approach does not necessarily work well in
`practice. For example, it is well known that enabling com-
`pression on the network interface of a router improves
`performance, but only marginally (30% is typical but it
`
`RIV—1004 / Page 11 of21
`sjwNRQQU P q‚ˆ† RR ‡ SR
`
`

`

`US 6,667,700 B1
`
`3
`depends on the underlying traffic). One problem with tradi-
`tional universal coding schemes is that they do not neces-
`sarily converge to optimal rate if the underlying data input
`has non-stationary statistics. Moreover, if the underlying
`statistics are stationary but they exhibit “long range depen-
`dence” (LRD), the rate of convergence of the universal code
`to optimality could be impractically slow (perhaps exponen—
`tially slow). This has important consequences as many
`studies have provided evidence that network traflic exhibits
`LRD, and in fact, there is an open controversy as to whether
`the underlying data processes are best modeled as LRD
`random processes or non-stationary processes. Other studies
`have shown that file statistics (like size distributions, etc.)
`also exhibit LRD. In short, this all means that traditional
`methods of universal coding are not necessarily the best
`practical solution, and a technique that exploits long-range
`dependence of typical data sources is likely to do better.
`One brute-force approach to detecting long-range corre-
`lations is to employ a dictionary-based compression scheme
`that searches with great breadth over a data source (a file, a
`communication stream, etc.) for patterns that are repeated,
`represent those patterns with a name or label and store the
`corresponding data in a table or database in association with
`the name or label. To exploit LRD, a very large window of
`data could be kept that allows the system to peer arbitrarily
`far back in the input (or in time) to detect
`long-range
`dependent patterns. This simple model intuitively matches
`the structure of information in an enterprise. That is, many
`similar sources of information both change slowly over time
`and appear in different contexts (email, file systems, Web,
`etc). As underlying technology improves (e.g., disks and
`memory become increasingly less expensive), this approach
`becomes even more practical. However,
`the brute—force
`approach still has shortcomings.
`One shortcoming is that searching for arbitrary patterns of
`matching data in a bit stream is computationally expensive
`and the general problem of finding the optimal solution
`quickly and efliciently in the presence of LRD statistics has
`not been adequately solved. An alternative approach is to
`abandon the ideal of finding an optimal solution and instead
`focus on approximate solutions or heuristics that perform
`well in the light of LRD and are practical and feasible.
`One tool that proves useful in this framework is a pro-
`posed heuristic for finding repeated patterns in data by
`segmenting the data based on the input content itself, rather
`than some externally imposed blocking or framing scheme.
`See,
`for example, Muthitacharoen, A., et al., “A Low-
`Bandwidth Network File System”, in Proceedings of the 18th
`ACM Symposium on Operating Systems Principles (SOSP
`’01), pp. 174—187 (Chateau Lake Louise, Banff, Canada,
`October 2001) (in vol. 35, 5 of ACM SIGOPS Operating
`Systems Review, ACM Press).
`In the LBFS system
`described therein, portions of transmitted files are replaced
`with hashes, and the recipient uses the hashes to reconstruct
`which portion of which file on a file system corresponds to
`the replaced data. Another example of segmentation based
`on input content is described in the context of matching
`portions of files, as described by Manber, “Finding Similar
`Files in a Large File System”, USENIX Proceedings, San
`Francisco 1994 (available as University of Arizona Dept. of
`Comp. Sci. Technical Report TR93-33).
`Other attempts to reduce network traffic through dictio-
`nary style compression techniques have been applied at the
`network layer. One such technique includes representing
`portions of network traffic with tokens and maintaining
`tables of tokens at each end of a connection. See, for
`example, Spring, N., et al., “A Protocol-Independent Tech-
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`nique for Eliminating Redundant Network Trafiic”, in Pro-
`ceedings of ACM SIGCOMM (August 2000). As described
`in that reference, network traffic that contains redundancies
`can be reduced by identifying repeated strings and replacing
`the repeated strings with tokens to be resolved from a shared
`table at either end of a connection. Because it operates solely
`on individual packets, the performance gains that accrue
`from this approach are limited by the ratio of the packet
`payload size to the packet header (since the packet header is
`generally not compressible using the described technique).
`Also, because the mechanism is implemented at the packet
`level, it only applies to regions of the network where two
`ends of a communicating path have been configured with the
`device. This configuration can be diflicult to achieve, if not
`impractical,
`in certain environments. Also, by indexing
`network packets using a relatively small memory-based
`table with a first-in first-out replacement policy (without the
`aid of, for instance, a large disk-based backing store), the
`efficacy of the approach is limited to detecting and exploit-
`ing communication redundancies that are fairly localized in
`time, i.e., the approach cannot exploit LRD properties of the
`underlying data stream.
`An alternative approach to reduce network traffic involves
`caching, where a request for data is not sent over the network
`if a copy of the data is available locally in a cache. As used
`herein, the terms “near”, “far”, “local” and “remote” might
`refer to physical distance, but more typically they refer to
`effective distance. The effective distance between two
`
`computers, computing devices, servers, clients, peripherals,
`etc. is, at least approximately, a measure of the difficulty of
`getting data between the two computers.
`While caching is good for blocks of data that do not
`change and are not found in similar forms under different
`names, improvements are still needed in many cases. In file
`caching, the unit of caching is typically a block of a file or
`the whole file. If the same data is present in a different file,
`or two files have only small differences, caching will not
`remove the redundancies or exploit them to reduce commu-
`nication costs. Even if a data object is segmented into many
`blocks and each of the blocks is cached separately, the net
`result is still inefficient because a small insertion of deletion
`
`of data in the underlying object will cause the data to shift
`through many (if not all) of the blocks and thus nullify the
`benefits of caching. This is due to the fact that the blocks are
`imposed arbitrarily on the input stream, and so it is impos-
`sible to detect that only a small change has been made to the
`underlying data.
`In View of the above, improvements can be made in
`compressing data in a network environment,
`in storage
`systems, and elsewhere.
`
`BRIEF SUMMARY OF THE INVENTION
`
`In a coding system according to one embodiment of the
`present invention, input data within a system is encoded. The
`input data might include sequences of symbols that repeat in
`the input data or occur in other input data encoded in the
`system. The encoding includes determining one or more
`target segment sizes, determining one or more window sizes,
`identifying a fingerprint within a window of symbols at an
`offset in the input data, determining whether the offset is to
`be designated as a cut point and segmenting the input data
`as indicated by the set of cut points. For each segment so
`identified, the encoder determines whether the segment is to
`be a referenced segment or an unreferenced segment, replac-
`ing the segment data of each referenced segment with a
`reference label and storing a reference binding in a persistent
`
`RIV—1004 / Page 12 of21
`sjwNRQQU P q‚ˆ† RS ‡ SR
`
`

`

`US 6,667,700 B1
`
`5
`if needed.
`segment store for each referenced segment,
`Hierarchically, the process can be repeated by segmenting
`the reference label strings into groups, replacing the grouped
`references with a group label, storing a binding between the
`grouped references and group label, if one is not already
`present, and repeating the process. The number of levels of
`hierarchy can be fixed in advanced or it can be determined
`from the content encoded.
`
`Other features and advantages of the invention will be
`apparent in view of the following detailed description and
`preferred embodiments.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of an encoder that encodes a
`data stream or block rising segmentation based on content
`and scgmcnt rcfcrcnccs.
`FIG. 2 is a block diagram of a decoder that decodes a data
`stream or block using segmentation bindings from a persis—
`tent segment store.
`FIG. 3 is a diagram illustrating an encoding process in
`more detail.
`
`FIG. 4 is a diagram illustrating a decoding process in
`more detail.
`
`FIG. 5 is a diagram illustrating a hierarchical encoding
`process.
`
`FIG. 6 is an illustration of a persistent segment store
`having a plurality of level segment stores.
`FIG. 7 diagrammatically illustrates a problem with using
`too few encoding levels.
`FIG. 8 diagrammatically illustrates a problem with using
`too few encoding levels.
`FIG. 9 is an illustration of a persistent segment store
`organized to hold arbitrary depth of references.
`FIG. 10 illustrates a decoding process using the persistent
`segment store of FIG. 9.
`FIG. 11 illustrates hierarchical content-induced segmen-
`tation.
`
`FIG. 12 is a block diagram of a networked client-server
`pair where some traffic between the client and the server is
`routed through a client-side transaction accelerator (“CTA”)
`and a server-side transaction accelerator (“STA”).
`FIG. 13 illustrates a file system using hierarchical
`content-induced segmentation.
`FIG. 14 illustrates a near—line file system (NLFS) and file
`server front end.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`The present invention has many applications, as will be
`apparent after reading this disclosure.
`In describing an
`embodiment of a compression system according to the
`present invention, only a few of the possible variations are
`described. Other applications and variations will be apparent
`to one of ordinary skill in the art, so the invention should not
`be construed as narrowly as the examples, but rather in
`accordance with the appended claims.
`Coding (encoding, decoding) can be done by a number of
`different devices, which might involve hardware, software,
`or both. Coding can be done with a computer, computing
`device, peripheral, electronics, or the like, and/or using an
`application being executed or controlled by such element.
`Coding might be done incident to a transport process, such
`as that described in McCanne I. Using the coding apparatus
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`and processes described herein, the responsiveness of trans-
`actions over a network can be improved by lessening the
`number of bits that need to be transmitted at critical times
`over critical channels. Also,
`the coding system may be
`integrated into a stand-alone storage system to optimize
`capacity. In this environment, the effective capacity of a
`storage system can be enhanced as the coding system
`described herein extracts common sequences of data and
`stores them just once, even if the sequence appears poten—
`tially many times, in potentially many different files.
`FIG. 1 illustrates the particular inputs that an encoder
`might have. As shown there, an encoder has an input for
`receiving input data and inputs for parameters, such as target
`segment size, window size and other control parameters, as
`well as outputs for the output data and bindings generated in
`the encoding process, which are stored in a persistent
`segment store (PSS) 210 as needed. In operation, encoder
`140 would process input data, identify segments of data,
`replace the segments data with a reference, provide the
`segment data and a segment reference to PSS 142 in the
`form of a binding and output the encoded data. The encoded
`output data might
`include unreferenced segment data,
`embedded bindings and/or reference labels for referenced
`segments. In the simplest case, the output data is entirely
`reference labels.
`
`The target segment size parameter is used by the encoder
`to control the average size of a segment. In general, seg-
`ments are variable length and there are design tradeoffs in
`selection of this size as described below.
`
`FIG. 2 illustrates a decoder 300 and a PSS 310, which
`together perform decoding that is the inverse of the encoding
`done by encoder 200. As described above, the encoded data
`might comprise references, bindings and unreferenced
`residual data. When decoder 150 encounters a binding in
`received data, it can use the segment data in that binding to
`reconstruct the original data and it can also store the binding
`in its PSS. When decoder 150 encounters a reference with-
`out a binding, it can use the reference to obtain segment data
`from PSS 152 to reconstruct the segment. If the segment
`rcfcrcncc is not found in PSS 152, decoder 150 can send a
`request for the segment data.
`FIG. 3 is a diagram illustrating an encoding process in
`more detail. As shown there, input data to be encoded is
`stored in bulfer 220, segmented by segment delimiter 222,
`which creates the output references and bindings for PSS
`210. The output references, unreferenced segments, such as
`segment 224, and bindings as needed are provided to an
`output buffer 230 to form the output of the encoder.
`FIG. 4 is a diagram illustrating a decoding process in
`more detail. As shown there, encoded data is bufi'ered in an
`input buffer 320. From the buffer contents, bindings are
`extracted and stored in PSS 310. Reference labels are
`
`provided to a replacer 325 that replaces them with the
`referenced segment data and places the data in an output
`buffer 330. Unreferenced segment data is output directly to
`output buffer 330 for eventual output as the reconstructed
`data.
`
`As the above-described figures illustrate, one aspect of the
`encoding process is the segmentation of input data. In a
`process for segmenting, identifying “cut points”, such as
`offsets in the input data where one segment ends and the next
`segment begins, is equivalent to segregating the input data
`into separate data structures, or the like.
`If a segmentation scheme is designed appropriately, seg-
`ment boundaries should always appear in the same place for
`the same sequences of data, regardless of the context in
`
`RIV—1004 / Page 13 of21
`sjwNRQQU P q‚ˆ† RT ‡ SR
`
`

`

`US 6,667,700 B1
`
`7
`which that data appeared. If that were the case, commonly
`repeated patterns of data would be segmented in the same
`way over each repetition of data and a system can be
`designed to efficiently identify these repeated patterns. For
`example, a particular data sequence (such as pieces of a
`widely used GIF image or pieces of a bitmap representing a
`commonly used graphical icon) that appears in many dif—
`ferent locations in larger files (such as a word processing
`document, slide presentation document, or on a Web page)
`will always be found and segmented in the same way,
`regardless of what data surrounds it.
`To achieve this property, the segmentation scheme herein
`uses information in the data itself to guide the segmentation
`process rather than externally imposed parameters like block
`sizes, transaction boundaries, etc. As input data is consumed
`by the coding process, the various values and structure of the
`input symbols guide the segmentation process (as described
`below). Consequently,
`the system has a “self-
`synchronizing” property where if similar data patterns are
`presented to its input,
`the same segment boundaries are
`detected, inducing the formation of the same segments that
`have been seen in the past. Moreover, this means that the
`system is robust to insertions, deletions, or other changes in
`the underlying input data, in that once a previously present
`segment boundary is found, the new segment will match an
`existing segment from that point on (assuming the data
`pattern from that point on has been seen in the past).
`Because this scheme is guided by the content of the input,
`it is described herein as “content-induced segmentation”. By
`applying content-induced segmentation to an input stream
`with repeated patterns across very large time scales (i.e.,
`exhibiting LRD statistics), the same segments are created
`without having to keep track of the entire history of data
`patterns that have been seen in the past. That
`is,
`the
`segmentation process can simply segment the input based on
`the input itself without having to search over the existing
`data patterns already found. While, in general, this approach
`does not produce the optimal segmentation (i.e., maximizing
`the size of segments while simultaneously maximizing the
`number of repeated segments found), it provides a good
`tradeoff between complexity and performance—the scheme
`is effective at exploiting certain types of LRD while leading
`to an efficient and practical implementation.
`Content-Induced Segmentation
`A segmenter, such as segment delimiter 222 in FIG. 3
`operates on a portion (or all) of a set of input data. The
`segmenter batches up a certain number of input symbols
`(determined by the current offset and the window size
`parameter) and computes a hash function over the window.
`This is referred to as the fingerprint of the window. A
`deterministic fingerprint indicator function returns a Bool-
`ean value for each fingerprint indicating whether the offset
`is to be considered a cut point and thereby defines a segment
`boundary. Preferably, the window does not extend beyond
`either end of an application data unit (ADU), so that each
`window contribution is from exactly one ADU. Thus,
`preferably, the encoder is provided an indication in an input
`data stream where one ADU ends and another one starts.
`
`However, such indication is not necessary for correct opera-
`tion of the system. Rather,
`these indications allow the
`encoder to force a segment boundary in the stream where an
`ADU boundary ends and thus prevent practical inefficiencies
`such as extra delays waiting for data that might not ever
`arrive (e.g., relying upon an implementation time-out to
`force a segment boundary instead). Additionally, restricting
`segment boundaries to be within a single ADU prevents the
`system from detecting and storing segments that are unlikely
`to be repeated in the future as may happen if a segment spans
`two ADU’s.
`
`10
`
`15
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`Where the fingerprint indicator function values are 1 and
`0, 1 might represent a cut point and 0 represent other than
`a cut point. Thus, under one convention, when the function
`evaluates to 1 for a given fingerprint having a given oifset
`and window, a new segment is created at the symbol defined
`by the first symbol of the input data (in order according to
`an input data order). Under that convention, if the function
`evaluates to 0, then one more input symbol is consumed and
`the window is advanced to cover this new symbol (and the
`least current symbol
`is removed from the window but
`remains in the current segment). The target segment size can
`thus be controlled as a parameter of the fingerprint indicator
`function.
`tend to
`Since a well-chosen fingerprint function will
`create random bit patterns in the fingerprint, the Bernoulli
`distribution function of a random variable, parameterized by
`the target segment size, can be used to map the hash value
`into the indicator value. For example, the parameter can be
`chosen such that on average, assuming a random distribution
`of fingerprint inputs, 1 out of M tim

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket