throbber
I lllll llllllll Ill lllll lllll lllll lllll lllll 111111111111111111111111111111111
`US008028106B2
`
`c12) United States Patent
`Bondurant et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 8,028,106 B2
`Sep.27,2011
`
`(54) HARDWARE ACCELERATION OF
`COMMONALITY FACTORING WITH
`REMOVABLE MEDIA
`
`(75)
`
`Inventors: Matthew D. Bondurant, Superior, CO
`(US); Steven W. Scroggs, Louisville, CO
`(US)
`
`(73) Assignee: Proster Systems, Inc., Boulder, CO
`(US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 254 days.
`
`(21) Appl. No.: 12/167,867
`
`(22) Filed:
`
`Jul. 3, 2008
`
`(65)
`
`Prior Publication Data
`
`US 2009/0013140Al
`
`Jan. 8, 2009
`
`Related U.S. Application Data
`
`(60) Provisional application No. 60/948,387, filed on Jul. 6,
`2007, provisional application No. 60/948,394, filed on
`Jul. 6, 2007.
`
`(51)
`
`Int. Cl.
`(2006.01)
`G06F 13112
`(52) U.S. Cl. ................. 710/68; 710/62; 710/65; 710/74
`(58) Field of Classification Search ........................ None
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,990,810 A
`11/ 1999 Williams
`6,704,730 B2
`3/2004 Moulton et al.
`6,810,398 B2
`10/2004 Moulton
`7,065,619 Bl
`612006 Zhu et al.
`7,197,189 B2
`3/2007 Adelmann
`7,403,451 B2
`7/2008 Goodman et al.
`
`7,533,323 B2
`2006/0059207 Al
`200910013129 Al
`
`512009 Alaimo et al.
`3/2006 Hirsch et al.
`1/2009 Bondurant
`
`OTHER PUBLICATIONS
`
`Broder, Andrei Z., "Some applications of Rabin's fingerprinting
`method", no date, pp. 1-10.
`Cox, Landon P. et al., "Pastiche: Making Backup Cheap and Easy",
`Department of Electrical Engineering and Computer Science, Univ.
`of Michigan, Ann Arbor, MI, Proceedings of the 5th Symposium on
`Operating Systems Design and Implementation, Boston, MA, Dec.
`9-11, 2002, 14 pages.
`
`(Continued)
`
`Primary Examiner - Eron J Sorrell
`(74) Attorney, Agent, or Firm - Kilpatrick Townsend &
`Stockton LLP
`
`ABSTRACT
`(57)
`Systems and methods for commonality factoring for storing
`data on removable storage media are described. The systems
`and methods allow for highly compressed data, e.g., data
`compressed using archiving or backup methods including
`de-duplication, to be stored in an efficient manner on portable
`memory devices such as removable storage cartridges. The
`methods include breaking data, e.g., data files for backup, into
`unique chunks and calculating identifiers, e.g., hash identifi(cid:173)
`ers, based on the unique chunks. Redundant chunks can be
`identified by calculating identifiers and comparing identifiers
`of other chunks to the identifiers of unique chunks previously
`calculated. When a redundant chunk is identified, a reference
`to the existing unique chunk is generated such that the chunk
`can be reconstituted in relation to other chunks in order to
`recreate the original data. The method further includes storing
`one or more of the unique chunks, the identifiers and/or the
`references on the removable storage medium. The accelera(cid:173)
`tion hardware and/or software can reside in multiple devices,
`depending on the embodiment. For example, hardware and/or
`software forthe chunking and/or hashing functions can reside
`in one or more of a host computer, a removable storage
`device, a removable cartridge holder and the removable stor(cid:173)
`age cartridge.
`
`20 Claims, 5 Drawing Sheets
`
`~200
`
`Storage -
`
`I 108
`
`Cartridge
`........................................
`
`,202
`_,,
`
`Chunking
`~
`Module
`
`[204
`
`-
`
`Hashing
`Module
`
`208
`
`206-1
`
`214
`212
`210
`208 -
`206-2
`
`~
`__ , Chunk/ID
`
`~
`
`Ref.
`Database Database
`..,...
`
`~ -r
`
`222 .)
`
`220
`
`CSCO-1010
`Page 1 of 14
`
`

`

`US 8,028,106 B2
`Page 2
`
`OTHER PUBLICATIONS
`
`Denehy, Timothy E. et al., "Duplicate Management for Reference
`Data", RJ 10305, Oct. 7, 2003, Computer Science, IBM Research
`Report, Duplicate Management for Reference Data, pp. 1-14.
`Douglis, Fred et al., "Application-specific Delta-encoding via
`Resemblance Detection", Mar. 31, 2003, 19 pages.
`Karp, Richard M. et al., Efficient randomized pattern-matching algo(cid:173)
`rithms, IBMJ. Res. Develop., vol. 31, No. 2, Mar. 1987, pp. 249-260.
`Korn, David G. et al., "Engineering a Differencing and Compression
`Data Format", AT&T Laboratories-Research, Proceedings of the
`2002 USENIX Annual Technical Conference, Monterey, CA, Jun.
`10-15, 2002, pp. 1-10.
`Kulkarni, Purushottam et al., "Redundancy Elimination Within
`Large Collections of Files", Proceedings of the General Track: 2004
`USENIX Annual Technical Conference, Boston, MA, Jun. 27-Jul. 2,
`2004, 14 pages.
`
`Moreton, Tim D. et al., "Storage, Mutability and Naming in Pasta",
`Univ. of Cambridge Computer Laboratory, Cambridge UK, no date,
`5 pages.
`Muthitacharoen, Athicha et al., "A Low-bandwidth Network File
`System", MIT Laboratory for Computer Science, Cambridge, MA
`02139, USA, no date, 2 pages.
`Policroniades, Calicrates et al., "Alternatives for Detecting Redun(cid:173)
`dancy in Storage Systems Data", Computer Laboratory, Cambridge
`University, Proceedings of the General Track: 2004 USENIX Annual
`Technical Conference, Boston, MA, Jun. 27-Jul. 2, 2004, 14 pages.
`Rabin, Michael 0., "Fingerprinting by Randon Polynomials",
`Department of Mathematics, The Hebrew Univ. of Jerusalem, no
`date, 14 pages.
`You, Lawrence L. et al., "Evaluation of Efficient Archival Storage
`Techniques", no date, pp. 1-6.
`U.S. Appl. No. 12/167,872, filed Jul. 3, 2008, Office Action mailed
`May 12, 2010, 17 pages.
`
`Page 2 of 14
`
`

`

`U.S. Patent
`
`Sep.27,2011
`
`Sheet 1of5
`
`US 8,028,106 B2
`
`f102
`.-------_,-M
`
`Host Computer
`
`0
`112l Interface Cable
`
`110""'\ .i-.--------1---------r
`
`"°Removable
`Cartridge
`Device
`
`Cart Holder
`
`Storage
`..__ ___ __..-t-,--'1 Cartridge
`106)
`
`(108
`.
`
`~100
`
`04
`
`~200
`
`r1oa
`
`FIG.1
`
`,204
`
`-
`
`,202
`
`206
`
`Chunking
`Module
`
`208 ~
`
`206-1
`
`Hashing
`Module
`....._ ____ _. 206-2
`
`214 -
`212
`210
`
`Storage
`Cartridge
`.............................................................. i:,
`--·····--····
`~
`
`~
`~ __.. r:: __ ...:_1
`Ref.
`--J Chunk/ID
`Database Database
`
`r
`
`208 -...
`,..../ ___ .... _ _,... __
`
`FIG.2
`
`220
`
`Page 3 of 14
`
`

`

`U.S. Patent
`
`Sep.27,2011
`
`Sheet 2 of 5
`
`US 8,028,106 B2
`
`"300
`
`202
`
`208
`
`Chunking
`Module
`
`Hashing
`Module
`
`208
`206-2
`
`./308
`..........................•. ': ............................. ,
`302
`304
`
`214
`
`306
`
`214
`
`208
`306-1
`
`Encryp.
`Module
`
`Comp.
`Module
`
`.
`"····································-·······················=
`FIG.3
`
`308
`
`Data
`Processing
`Module
`
`3oa-2
`
`204-1
`--------······-----······------·····------··---li.. ............................ .
`402
`404
`
`208
`
`Hash
`206-1 Calculator
`
`212-1
`
`210-1
`208-1
`206-3
`
`Searching
`Module
`
`214-1
`
`208-1
`
`206-4
`
`406
`
`Identifier
`Database
`···-··············-···· ......................................•.......................
`
`FIG.4
`
`Page 4 of 14
`
`

`

`U.S. Patent
`
`Sep.27,2011
`
`Sheet 3 of 5
`
`US 8,028,106 B2
`
`rio2
`___ _____ .J....,
`
`;V'500-1
`
`Host Computer
`
`Interface Cable
`
`510·11
`~-e--------------4-~------~---~
`Removable
`Cartridge
`Device
`
`504-1
`
`H
`
`Cart Holder
`
`Storage
`Cartridge
`
`-
`
`~ C/H
`506-1)
`502-1/o/.__ ..
`
`. (508-1
`
`FIG. SA
`
`Page 5 of 14
`
`

`

`U.S. Patent
`
`Sep.27,2011
`
`Sheet 4 of 5
`
`US 8,028,106 B2
`
`,tl'500-2
`
`Host Computer
`
`510-2 ...
`
`Interface Cable
`
`502-2
`
`,.. 504-2
`
`j l
`/
`1lr /V
`
`C/H I
`
`Removable
`Cartridge
`Device
`
`'Ir
`
`Cart Holder
`
`Storage
`Cartridge
`
`(508-2
`.
`
`FIG. SB
`
`,tl'500-3
`
`502-3
`
`510-3
`
`Interface Cable
`
`504-3
`
`Removable
`Cartridge
`Device
`
`Cart Holder
`
`Storage
`Cartridge ...,....._.__ ___ _.
`508-3
`
`506-3
`
`FIG.SC
`
`Page 6 of 14
`
`

`

`U.S. Patent
`
`Sep.27,2011
`
`Sheet 5 of 5
`
`US 8,028,106 B2
`
`"600
`
`Receive Data Stream
`
`604
`
`Breaks Data into a
`Chunk
`
`Calculate Hash
`Identifier
`
`608
`
`Store Identifier in
`Database
`
`NO
`
`YES
`
`616
`
`612
`
`Discard Redundant Data
`Chunk and Create a Reference
`
`Perform Additional Data Processing
`(Comp., Encrypt., & ECC)
`
`618
`
`614
`
`Store Reference on the
`,___--1 Removable Medium
`
`Store Unique Chunk, its Associated
`Identifier, on the Removable Medium
`
`FIG. 6
`
`Page 7 of 14
`
`

`

`US 8,028, 106 B2
`
`1
`HARDWARE ACCELERATION OF
`COMMONALITY FACTORING WITH
`REMOVABLE MEDIA
`
`This application claims the benefit of and is a non-provi(cid:173)
`sional of both co-pending U.S. Provisional Application Ser.
`No. 60/948,394 filed on Jul. 6, 2007; and U.S. Provisional
`Application Ser. No. 60/948,387 filed on Jul. 6, 2007, which
`are hereby expressly incorporated by reference in their
`entirety for all purposes.
`This application expressly incorporates by reference U.S.
`application Ser. No. 12/167,872, filed on even date herewith,
`entitled "Commonality Factoring For Removable Media", in
`its entirety for all purposes.
`
`BACKGROUND OF THE DISCLOSURE
`
`The present invention generally relates to data storage sys(cid:173)
`tems and, but not by way oflimitation, to data storage systems
`that store information on removable media.
`Conventional backup involves of a series of full, incremen(cid:173)
`tal or differential backups that saves multiple copies of iden(cid:173)
`tical or slowly changing data. This approach to backup leads
`to a high level of data redundancy.
`For years, there has been a considerable disparity between
`the prices of tape and disk-based storage systems with tape(cid:173)
`based storage being less expensive. Therefore, conventional
`data storage solutions have been tape based storage systems
`that compress data using conventional algorithms for an aver(cid:173)
`age compression ratio of about 2:1. Advantageously, tape(cid:173)
`based storage systems use removable tape cartridges that can
`be taken to off-site location for disaster recovery. However,
`the process of recovering data in a tape based storage system
`is slow, complex and unreliable.
`Data de-duplication, kuown as commonality factoring, is a
`process of reducing storage needs by eliminating redundant
`data. Data de-duplication is a disk-based data storage system
`that greatly reduces disk space requirements. However, disk(cid:173)
`based data storage systems including de-duplication methods
`are not easily exported to removable media. In order to export
`de-duplicated data to removable media, the de-duplicated
`data has to be first reformulated to its original form and then
`be recorded on removable tape cartridges, thereby, requiring
`more storage space than the de-duplicated version.
`Data de-duplication is a resource intensive process, which
`is implemented in software as part of the commonality fac(cid:173)
`toring solutions. Due to the intensive computational process,
`top of the line multi-core/multi-processor servers are used to
`provide adequate performance to perform the de-duplication
`process. The amount of performance gained by the use of
`multi-core/multi-processor servers depends on the algo(cid:173)
`rithms used and their implementation in software. However,
`the overall cost and power consumption of these multi-core/
`multi-processor servers are high.
`
`SUMMARY
`
`In various embodiments, systems and methods for com(cid:173)
`monality factoring for storing data on removable storage
`media are described. The systems and methods allow for
`highly compressed data, e.g., data compressed using
`archiving or backup methods including de-duplication, to be
`stored in an efficient manner on portable memory devices
`such as removable storage cartridges. The methods include
`breaking data, e.g., data files for backup, into unique chunks
`and calculating identifiers, e.g., hash identifiers, based on the
`unique chunks. Redundant chunks can be identified by cal-
`
`2
`culating identifiers and comparing identifiers of other chunks
`to the identifiers of unique chunks previously calculated.
`When a redundant chunk is identified, a reference to the
`existing unique chunk is generated such that the chunk can be
`reconstituted in relation to other chunks in order to recreate
`the original data. The method further includes storing one or
`more of the unique chunks, the identifiers and/or the refer(cid:173)
`ences on the removable storage medium.
`In some aspects, hardware and/or software can be used to
`10 accelerate the commonality factoring process. The accelera(cid:173)
`tion hardware and/or software can reside in multiple devices,
`depending on the embodiment. For example, hardware and/or
`software forthe chunking and/or hashing functions can reside
`in one or more of a host computer, a removable storage
`15 device, a removable cartridge holder (e.g., a socket) and the
`removable storage cartridge.
`In one embodiment, a system for commonality factoring
`for storing data with a removable storage cartridge is dis(cid:173)
`closed. The system includes a processor, an expansion bus
`20 coupled to the processor and a socket coupled to the expan(cid:173)
`sion bus. The socket is configured to accept the removable
`storage cartridge. An expansion module is removably
`coupled to the expansion bus. The expansion module is con(cid:173)
`figured to transfer data to the removable storage cartridge.
`25 The expansion module includes a chunking module and a
`hashing module. The chunking module is configured to break
`an original data stream into a number of chunks. The hashing
`module is coupled to the chunking module in a pipeline
`fashion such that at least a portion of input to the hashing
`30 module comprises output from the chunking module. The
`hashing module is configured to determine if each chunk is
`unique, and forward chunks determined to be unique toward
`the removable storage cartridge.
`In another embodiment, a method for commonality factor-
`35 ing for storing data with a removable storage cartridge is
`disclosed. In one step, at an expansion module removably
`coupled to a host computer, an original data stream is
`received. The expansion module includes a chunking module
`and a hashing module. The hashing module and the chunking
`40 module are configured in a pipeline architecture such that at
`least a portion of input to the hashing module includes output
`from the chunking module. At the chunking module, the
`original data stream is broken into a number of chunks. The
`chunks are forwarded toward the hashing module. The hash-
`45 ing module calculates an identifier for each forwarded chunk;
`storing the identifiers; and determines, based on the identifi(cid:173)
`ers, whether each chunk is unique. At least one of the unique
`chunks and the identifier is forwarded to the removable stor(cid:173)
`age cartridge. The removable storage cartridge includes a
`50 storage drive.
`In yet another embodiment, an expansion card for com(cid:173)
`monality factoring for storing data with a removable storage
`cartridge is disclosed. The expansion card includes a chunk(cid:173)
`ing module and a hashing module. The chunking module is
`55 configured to receive an original data stream from the host
`computer and break the original data stream into a plurality of
`chunks. The expansion card is configured to be removably
`coupled to a host computer and the removable storage car(cid:173)
`tridge and store data on the removable storage cartridge. The
`60 hashing module is coupled to the chunking module in a pipe(cid:173)
`line fashion such that at least a portion of input to the hashing
`module comprises output from the chunking module. The
`hashing module is configured to: receive the plurality of
`chunks from the chunking module; calculate an identifier for
`65 each of the received chunks; determine, based on the identi(cid:173)
`fiers, if each chunk is unique; and store the unique chunks on
`the removable storage cartridge.
`
`Page 8 of 14
`
`

`

`US 8,028, 106 B2
`
`3
`Further areas of applicability of the present disclosure will
`become apparent from the detailed description provided here(cid:173)
`inafter. It should be understood that the detailed description
`and specific examples, while indicating various embodi(cid:173)
`ments, are intended for purposes of illustration only and are
`not intended to necessarily limit the scope of the disclosure.
`
`BRIEF DESCRIPTION OF THE DRAWING
`
`FIG. 1 depicts a block diagram of an embodiment of a data
`storage system.
`FIG. 2 depicts a block diagram of an embodiment of a
`system for performing commonality factoring.
`FIG. 3 depicts a block diagram of an alternative embodi(cid:173)
`ment of a system for performing commonality factoring.
`FIG. 4 depicts a block diagram of an alternative embodi(cid:173)
`ment of a system for performing commonality factoring.
`FIGS. SA, SB, and SC illustrate schematic diagrams of
`alternative embodiments of data storage systems for perform(cid:173)
`ing commonality factoring.
`FIG. 6 illustrates a flowchart of an example of a process for
`storing data on a removable data cartridge.
`In the appended figures, similar components and/or fea(cid:173)
`tures may have the same reference label. Further, various
`components of the same type may be distinguished by fol(cid:173)
`lowing the reference label by a dash and a second label that
`distinguishes among the similar components. If only the first
`reference label is used in the specification, the description is
`applicable to any one the similar components having the same
`first reference label irrespective of the second reference label.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`The ensuing description provides preferred exemplary
`embodiment(s) only, and is not intended to limit the scope,
`applicability or configuration of the disclosure. Rather, the
`ensuing description of the preferred exemplary embodiment
`(s) will provide those skilled in the art with an enabling
`description for implementing a preferred exemplary embodi(cid:173)
`ments of the disclosure. It should be understood that various 40
`changes may be made in the function and arrangement of
`elements without departing from the spirit and scope of the
`invention as set forth in the appended claims.
`This disclosure relates in general to data storage systems
`used for data backup, restore and archive applications. It
`specifically relates to a new generation of removable storage
`cartridges housing a hard disk drive (HDD) as the storage
`medium. Throughout the specification, HDD may be used to
`describe the storage medium but it is to be understood that
`flash memory or a solid state disk (SSD) drive could be used
`in the alternative.
`Embodiments of the present invention are directed to a
`system for storing more data on a single storage cartridge than
`the use of the conventional Lempel-Ziv (LZ) compression
`methods would allow. This is achieved through implementa- 55
`ti on of commonality factoring (or de-duplication). In particu(cid:173)
`lar, the system according to the present invention accelerates
`the process so that the data reduction is performed at a rate
`competitive with a Linear Tape Open (LTO) tape drive with(cid:173)
`out requiring a high end server to perform the processing.
`According to one embodiment of the present invention,
`there is provided a system for accelerating commonality fac(cid:173)
`toring for storing data with a storage cartridge. The system
`includes a chunking module for breaking an original data
`stream into chunks. In the chunking module, pipelining and 65
`table lookups are used for optimization. The system also
`includes a hashing module for determining if each chunk is
`
`4
`unique or a duplicate of any of the previously stored chunks.
`The first byte of each chunk is processed by the hashing
`module before the last byte of the chunk has been processed
`by the chunking module to achieve parallelism.
`In this embodiment, the chunking module may comprise a
`section for Rabin fingerprinting or a section for performing a
`sliding window checksum. Further, in this embodiment, the
`hashing module may comprise one or more of a section for
`Message Digest Algorithm 5 (MD5) hashing, a section for
`10 Secure HashAlgorithm-1 (SHA-1) hashing and a section for
`Secure HashAlgorithm-2 (SHA-2) hashing.
`According to another embodiment of the present invention,
`there is provided another system for accelerating commonal(cid:173)
`ity factoring for storing data with a storage cartridge. The
`15 system includes the chunking module and the hashing mod(cid:173)
`ule as above, and further includes an additional data process(cid:173)
`ing module.
`In this embodiment, the additional data processing module
`may comprise one or more of a data compression module, an
`20 encryption module and an error correction coding (ECC)
`module. Furthermore, the data compression module may
`comprise a section for performing a Lempel-Ziv Stac (LZS)
`algorithm. Additionally, the encryption module may com(cid:173)
`prise a section for performing a Triple Data Encryption Stan-
`25 dard (3DES) algorithm, an Advanced Encryption Standard-
`128 (AES-128) algorithm or an Advanced Encryption
`Standard-256 (AES-256) algorithm.
`According to yet another embodiment of the present inven(cid:173)
`tion, there is provided yet another system for accelerating
`30 commonality factoring for storing data with a storage car(cid:173)
`tridge. The system includes the chunking module, the hashing
`module and the additional data processing module as above,
`and further includes a database search module followed by
`the additional data processing module for performing a
`35 search of the chunk database based on outputs from the hash(cid:173)
`ing module and passing only the unique chunks to the addi(cid:173)
`tional processing module. The objective is to reduce band(cid:173)
`width requirements for the additional data processing
`module.
`According to yet another embodiment of the present inven-
`tion, there is provided yet another system for accelerating
`commonality factoring for storing data with a storage car(cid:173)
`tridge. The system includes the chunking module and asso(cid:173)
`ciated modules, wherein multiple data paths in parallel are
`45 utilized. The objective is to further accelerate the commonal(cid:173)
`ity factoring process.
`In this embodiment, the multiple data paths may comprise
`a single data stream split across multiple instances by trun(cid:173)
`cating the data stream at locations that do not necessarily
`50 align with chunk boundaries as calculated by the chunking
`module, wherein the size of the truncated portions of the data
`stream is either fixed or variable.
`Referring first to FIG. 1, an embodiment of a data storage
`system 100 is shown. The data storage system 100 may
`include a host computer 102 and a removable drive bay 104.
`The host computer 102 includes a processor and an expansion
`bus. The expansion bus is coupled to the processor and is
`configured to transfer data to the drive bay 104 via standard
`interface. The removable drive bay 104 may include a remov-
`60 able cartridge device 110 and a removable cartridge holder
`106. The host computer 102 may be communicatively
`coupled with removable cartridge device 110. By way of
`example, the removable cartridge device 110 interface to the
`host computer 102 may be any version of Small Computer
`System interface (SCSI), a Fiber Channel (FC) interface, an
`Ethernet interface, an Advanced Technology Attachment
`(ATA) interface, or any other type of interface that allows the
`
`Page 9 of 14
`
`

`

`US 8,028, 106 B2
`
`5
`removable cartridge device 110 to communicate with the host
`computer 1 02.The cartridge holder 106 can be a plastic
`socket and can physically mount to a circuit board of the
`removable cartridge device 110. The cartridge holder 106
`may further include an eject and lock mechanism. A remov(cid:173)
`able storage cartridge 108 provides storage capability for the
`data storage system 100, wherein the storage cartridge 108 is
`removably coupled to the removable cartridge device 110.
`The portable storage cartridge 108 is also optionally locked in
`the cartridge holder 106. In an alternative embodiment, the
`host computer 102 may be communicatively coupled with
`cartridge holder 106 through an interface cable 112.
`As will be described further bellow in various embodi(cid:173)
`ments, the commonality factoring function may be imple(cid:173)
`mented as an expansion module in one or more of the follow(cid:173)
`ing locations: 1) in the storage cartridge 108, 2) in the
`removable cartridge device 110 and outside the cartridge
`holder 106, and 3) in the host computer 102.
`As explained above, the present invention identifies dupli- 20
`cate portions in an original data stream which have previously
`been stored, so that a reference to the data portion can be
`stored in place of the duplicate portion itself. There are sev(cid:173)
`eral steps for performing this process as follows: (1) a step of
`breaking the original data stream into small chunks (data 25
`portions) which can be analyzed for redundancy; (2) a step of
`calculating an identifier for each chunk; (3) a step of deter(cid:173)
`mining, by searching a database of the identifiers, if each
`chunk is unique in that the same chunk has not been found in
`the previous chunks; and ( 4) a step of organizing the unique 30
`chunks, identifiers and associated metadata so that the origi(cid:173)
`nal data stream can be regenerated. The original data stream
`can represent any form of data such as audio, video, textual
`and can be a plurality of files or objects.
`Steps (1) and (2) of the above process are more processor 35
`intensive, and appropriate to apply hardware acceleration to.
`Also, these steps may be combined with other data modifica(cid:173)
`tion steps such as conventional data compression and encryp(cid:173)
`tion as part of the overall data storage process. All of these
`steps are considered in terms of an integration to provide the 40
`maximum system throughput.
`Rabin Fingerprinting is a method of breaking the incoming
`data stream into smaller chunks of data which can be analyzed
`for redundancy. This method has tractable statistical proper(cid:173)
`ties that simpler methods such as a rolling checksum do not 45
`exhibit, but any chunking algorithm could be used in various
`embodiments. This method has been implemented in soft(cid:173)
`ware as part of the commonality factoring solutions, which
`may be done in order to accelerate time-to-market for these
`products at the expense of cost and/or performance. Top-of- 50
`the-line multi-core/multi-processor servers are used to pro(cid:173)
`vide adequate performance of the software algorithms.
`Instead of implementing this method in software, one
`embodiment of the present invention implements this method
`in a solution that uses hardware, which provides increased 55
`performance with lower cost and lower power dissipation.
`The detail ofhardware implementation of Rabin Fingerprint(cid:173)
`ing method in a pipeline fashion is described in U.S. Provi(cid:173)
`sional Patent Application Ser. No. 60/948,394, filed on Jul. 6,
`2007. By implementing the hardware in a pipelined fashion, 60
`high throughputs can be obtained at reasonable clock rates
`with minimal logic.
`Rabin fingerprinting is fundamentally an operation on
`polynomials a single-bit at a time in a data stream. Because
`most systems work well with data aligned to 8-bit byte bound- 65
`aries, the result of the polynomial operations is only relevant
`for every eighth bit. Since the intermediate calculations are
`
`6
`not considered, we can optimize the calculations by directly
`calculating the next fingerprint value 8-bits at a time.
`Rabin fingerprints are calculated on a sliding window of
`data, e.g., 48 bytes in a buffer array. For each calculation, the
`oldest byte in the array is replaced with the newest byte. The
`first pipeline stage replaces the oldest byte with the newest
`byte and performs a lookup based on the oldest byte which
`provides a value that can be used to remove the oldest byte's
`effect from the fingerprint. The next pipeline stage uses the
`10 input to remove the oldest data from the fingerprint and then
`combines the fingerprint with the new data using another
`table lookup to generate the new fingerprint. The final pipe(cid:173)
`line stage determines whether a portion of the new fingerprint
`matches a predetermined check value used for determining
`15 chunk boundaries and verifies that the chunk size fits within a
`minimum/maximum range.
`The output of the chunking step using either Rabin finger(cid:173)
`printing or simpler methods such as a sliding window check(cid:173)
`sum is a sequence of data called a chunk which can be ana(cid:173)
`lyzed to determine if it has previously been stored by the
`storage system. One way to efficiently determine whether the
`chunk has been previously stored is to compute a one way
`function on the data called a hash which allows determination
`to be made with very high statistical likelihood of whether the
`data is a duplicate of any of the previously stored data. Many
`hash algorithms are available for this purpose such as MD5,
`SHA-1 and the SHA-2 family. The goal is to select an algo(cid:173)
`rithm which has a statistically small enough chance of colli(cid:173)
`sions that it can be assumed that it will not produce false
`matches. The hash algorithm is resistant to intentional or
`malicious attempts to cause collisions. The hash algorithm
`should be secure; MD5 is not truly considered secure and
`SHA-1 has some potential vulnerabilities, but these vulner(cid:173)
`abilities may not apply to some applications. The type of hash
`algorithm may be chosen depending on the application. Fur(cid:173)
`ther, the use of multiple hash algorithms is possible in some
`embodiments.
`Referring next to FIG. 2, a block diagram of an embodi(cid:173)
`ment of a system 200 for performing commonality factoring
`is shown. The system 200 includes chunking module 202
`coupled directly to a hashing module 204. In this embodi(cid:173)
`ment, the chunking module 202 performs the step of breaking
`the original data stream 206 into small chunks using Rabin
`fingerprinting algorithm on a sliding window of data stream
`206. Other embodiments may use different methods and algo(cid:173)
`rithms such as sliding window checksum. Referring back to
`the FIG. 1, the original data stream 206 can be provided from
`different sources depending on the location of the expansion
`module in various embodiments as discussed below. For
`example, the host computer 102, the removable cartridge
`device 110, or the storage cartridge 108 can all forward the
`original data stream 206 to the chunking module 202 in vari(cid:173)
`ous embodiments.
`The chunking module 202 outputs a sequence of data bytes
`called chunks 206-1 along with an indication 208 whether a
`chunk boundary has been reached for each sequence of data
`bytes. The end of each sequence indication 208 is also
`referred to as an end-of-record or EOR. This allows the EOR
`208 and the data chunks 206-1 to be synchronized as they pass
`on to the hashing module 204. In this embodiment, the chunk(cid:173)
`ing module 202 is coupled to the hashing module 204 in a
`pipeline fashion such that at least a portion of the input to the
`hashing module 204 comprises output from the chunking
`module 202. In one embodiment, the hashing module 204
`processes a first byte of each chunk from the sequence of data
`bytes 206-1 before a last byte of the same chunk is processed
`by the chunking module 202. Other embodiments may obtain
`
`Page 10 of 14
`
`

`

`US 8,028, 106 B2
`
`7
`the complete chunk from the chunking module 202 and then
`run the chunk through the hashing module 204.
`The hashing module 204 performs steps of calculating an
`identifier for each chunk from the sequence of data bytes
`206-1, and then determining the uniqueness of the chunk. The
`determination step can be performed by storing the identifiers
`into a database and searching the database of identifiers to
`determine whether each chunk is unique. Where the chunk is
`found to be unique, the unique chunk and its identifier are
`stored in a chunk/ID database 220 on the removable storage
`cartridge 108. Table I shows an example of chunk/ID data(cid:173)
`base 220 where streams of unique chunks and their identifiers
`are stored on the removable storage cartridge 108.
`
`8
`ules, e.g., the length of unique chunks 210, and hash values
`212 can optionally be passed to the compression module 302
`and the encryption module 304. This will provide a synchro(cid:173)
`nized output for each module.
`A system like the one shown in FIG. 3 has the benefit of
`reducing bus traffic between the expansion module and the
`main system memory where the data is probably stored while
`not being processed. Table III indicates an example of the
`savings in the bus bandwidth whe

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket