`US008028106B2
`
`c12) United States Patent
`Bondurant et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 8,028,106 B2
`Sep.27,2011
`
`(54) HARDWARE ACCELERATION OF
`COMMONALITY FACTORING WITH
`REMOVABLE MEDIA
`
`(75)
`
`Inventors: Matthew D. Bondurant, Superior, CO
`(US); Steven W. Scroggs, Louisville, CO
`(US)
`
`(73) Assignee: Proster Systems, Inc., Boulder, CO
`(US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 254 days.
`
`(21) Appl. No.: 12/167,867
`
`(22) Filed:
`
`Jul. 3, 2008
`
`(65)
`
`Prior Publication Data
`
`US 2009/0013140Al
`
`Jan. 8, 2009
`
`Related U.S. Application Data
`
`(60) Provisional application No. 60/948,387, filed on Jul. 6,
`2007, provisional application No. 60/948,394, filed on
`Jul. 6, 2007.
`
`(51)
`
`Int. Cl.
`(2006.01)
`G06F 13112
`(52) U.S. Cl. ................. 710/68; 710/62; 710/65; 710/74
`(58) Field of Classification Search ........................ None
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5,990,810 A
`11/ 1999 Williams
`6,704,730 B2
`3/2004 Moulton et al.
`6,810,398 B2
`10/2004 Moulton
`7,065,619 Bl
`612006 Zhu et al.
`7,197,189 B2
`3/2007 Adelmann
`7,403,451 B2
`7/2008 Goodman et al.
`
`7,533,323 B2
`2006/0059207 Al
`200910013129 Al
`
`512009 Alaimo et al.
`3/2006 Hirsch et al.
`1/2009 Bondurant
`
`OTHER PUBLICATIONS
`
`Broder, Andrei Z., "Some applications of Rabin's fingerprinting
`method", no date, pp. 1-10.
`Cox, Landon P. et al., "Pastiche: Making Backup Cheap and Easy",
`Department of Electrical Engineering and Computer Science, Univ.
`of Michigan, Ann Arbor, MI, Proceedings of the 5th Symposium on
`Operating Systems Design and Implementation, Boston, MA, Dec.
`9-11, 2002, 14 pages.
`
`(Continued)
`
`Primary Examiner - Eron J Sorrell
`(74) Attorney, Agent, or Firm - Kilpatrick Townsend &
`Stockton LLP
`
`ABSTRACT
`(57)
`Systems and methods for commonality factoring for storing
`data on removable storage media are described. The systems
`and methods allow for highly compressed data, e.g., data
`compressed using archiving or backup methods including
`de-duplication, to be stored in an efficient manner on portable
`memory devices such as removable storage cartridges. The
`methods include breaking data, e.g., data files for backup, into
`unique chunks and calculating identifiers, e.g., hash identifi(cid:173)
`ers, based on the unique chunks. Redundant chunks can be
`identified by calculating identifiers and comparing identifiers
`of other chunks to the identifiers of unique chunks previously
`calculated. When a redundant chunk is identified, a reference
`to the existing unique chunk is generated such that the chunk
`can be reconstituted in relation to other chunks in order to
`recreate the original data. The method further includes storing
`one or more of the unique chunks, the identifiers and/or the
`references on the removable storage medium. The accelera(cid:173)
`tion hardware and/or software can reside in multiple devices,
`depending on the embodiment. For example, hardware and/or
`software forthe chunking and/or hashing functions can reside
`in one or more of a host computer, a removable storage
`device, a removable cartridge holder and the removable stor(cid:173)
`age cartridge.
`
`20 Claims, 5 Drawing Sheets
`
`~200
`
`Storage -
`
`I 108
`
`Cartridge
`........................................
`
`,202
`_,,
`
`Chunking
`~
`Module
`
`[204
`
`-
`
`Hashing
`Module
`
`208
`
`206-1
`
`214
`212
`210
`208 -
`206-2
`
`~
`__ , Chunk/ID
`
`~
`
`Ref.
`Database Database
`..,...
`
`~ -r
`
`222 .)
`
`220
`
`CSCO-1010
`Page 1 of 14
`
`
`
`US 8,028,106 B2
`Page 2
`
`OTHER PUBLICATIONS
`
`Denehy, Timothy E. et al., "Duplicate Management for Reference
`Data", RJ 10305, Oct. 7, 2003, Computer Science, IBM Research
`Report, Duplicate Management for Reference Data, pp. 1-14.
`Douglis, Fred et al., "Application-specific Delta-encoding via
`Resemblance Detection", Mar. 31, 2003, 19 pages.
`Karp, Richard M. et al., Efficient randomized pattern-matching algo(cid:173)
`rithms, IBMJ. Res. Develop., vol. 31, No. 2, Mar. 1987, pp. 249-260.
`Korn, David G. et al., "Engineering a Differencing and Compression
`Data Format", AT&T Laboratories-Research, Proceedings of the
`2002 USENIX Annual Technical Conference, Monterey, CA, Jun.
`10-15, 2002, pp. 1-10.
`Kulkarni, Purushottam et al., "Redundancy Elimination Within
`Large Collections of Files", Proceedings of the General Track: 2004
`USENIX Annual Technical Conference, Boston, MA, Jun. 27-Jul. 2,
`2004, 14 pages.
`
`Moreton, Tim D. et al., "Storage, Mutability and Naming in Pasta",
`Univ. of Cambridge Computer Laboratory, Cambridge UK, no date,
`5 pages.
`Muthitacharoen, Athicha et al., "A Low-bandwidth Network File
`System", MIT Laboratory for Computer Science, Cambridge, MA
`02139, USA, no date, 2 pages.
`Policroniades, Calicrates et al., "Alternatives for Detecting Redun(cid:173)
`dancy in Storage Systems Data", Computer Laboratory, Cambridge
`University, Proceedings of the General Track: 2004 USENIX Annual
`Technical Conference, Boston, MA, Jun. 27-Jul. 2, 2004, 14 pages.
`Rabin, Michael 0., "Fingerprinting by Randon Polynomials",
`Department of Mathematics, The Hebrew Univ. of Jerusalem, no
`date, 14 pages.
`You, Lawrence L. et al., "Evaluation of Efficient Archival Storage
`Techniques", no date, pp. 1-6.
`U.S. Appl. No. 12/167,872, filed Jul. 3, 2008, Office Action mailed
`May 12, 2010, 17 pages.
`
`Page 2 of 14
`
`
`
`U.S. Patent
`
`Sep.27,2011
`
`Sheet 1of5
`
`US 8,028,106 B2
`
`f102
`.-------_,-M
`
`Host Computer
`
`0
`112l Interface Cable
`
`110""'\ .i-.--------1---------r
`
`"°Removable
`Cartridge
`Device
`
`Cart Holder
`
`Storage
`..__ ___ __..-t-,--'1 Cartridge
`106)
`
`(108
`.
`
`~100
`
`04
`
`~200
`
`r1oa
`
`FIG.1
`
`,204
`
`-
`
`,202
`
`206
`
`Chunking
`Module
`
`208 ~
`
`206-1
`
`Hashing
`Module
`....._ ____ _. 206-2
`
`214 -
`212
`210
`
`Storage
`Cartridge
`.............................................................. i:,
`--·····--····
`~
`
`~
`~ __.. r:: __ ...:_1
`Ref.
`--J Chunk/ID
`Database Database
`
`r
`
`208 -...
`,..../ ___ .... _ _,... __
`
`FIG.2
`
`220
`
`Page 3 of 14
`
`
`
`U.S. Patent
`
`Sep.27,2011
`
`Sheet 2 of 5
`
`US 8,028,106 B2
`
`"300
`
`202
`
`208
`
`Chunking
`Module
`
`Hashing
`Module
`
`208
`206-2
`
`./308
`..........................•. ': ............................. ,
`302
`304
`
`214
`
`306
`
`214
`
`208
`306-1
`
`Encryp.
`Module
`
`Comp.
`Module
`
`.
`"····································-·······················=
`FIG.3
`
`308
`
`Data
`Processing
`Module
`
`3oa-2
`
`204-1
`--------······-----······------·····------··---li.. ............................ .
`402
`404
`
`208
`
`Hash
`206-1 Calculator
`
`212-1
`
`210-1
`208-1
`206-3
`
`Searching
`Module
`
`214-1
`
`208-1
`
`206-4
`
`406
`
`Identifier
`Database
`···-··············-···· ......................................•.......................
`
`FIG.4
`
`Page 4 of 14
`
`
`
`U.S. Patent
`
`Sep.27,2011
`
`Sheet 3 of 5
`
`US 8,028,106 B2
`
`rio2
`___ _____ .J....,
`
`;V'500-1
`
`Host Computer
`
`Interface Cable
`
`510·11
`~-e--------------4-~------~---~
`Removable
`Cartridge
`Device
`
`504-1
`
`H
`
`Cart Holder
`
`Storage
`Cartridge
`
`-
`
`~ C/H
`506-1)
`502-1/o/.__ ..
`
`. (508-1
`
`FIG. SA
`
`Page 5 of 14
`
`
`
`U.S. Patent
`
`Sep.27,2011
`
`Sheet 4 of 5
`
`US 8,028,106 B2
`
`,tl'500-2
`
`Host Computer
`
`510-2 ...
`
`Interface Cable
`
`502-2
`
`,.. 504-2
`
`j l
`/
`1lr /V
`
`C/H I
`
`Removable
`Cartridge
`Device
`
`'Ir
`
`Cart Holder
`
`Storage
`Cartridge
`
`(508-2
`.
`
`FIG. SB
`
`,tl'500-3
`
`502-3
`
`510-3
`
`Interface Cable
`
`504-3
`
`Removable
`Cartridge
`Device
`
`Cart Holder
`
`Storage
`Cartridge ...,....._.__ ___ _.
`508-3
`
`506-3
`
`FIG.SC
`
`Page 6 of 14
`
`
`
`U.S. Patent
`
`Sep.27,2011
`
`Sheet 5 of 5
`
`US 8,028,106 B2
`
`"600
`
`Receive Data Stream
`
`604
`
`Breaks Data into a
`Chunk
`
`Calculate Hash
`Identifier
`
`608
`
`Store Identifier in
`Database
`
`NO
`
`YES
`
`616
`
`612
`
`Discard Redundant Data
`Chunk and Create a Reference
`
`Perform Additional Data Processing
`(Comp., Encrypt., & ECC)
`
`618
`
`614
`
`Store Reference on the
`,___--1 Removable Medium
`
`Store Unique Chunk, its Associated
`Identifier, on the Removable Medium
`
`FIG. 6
`
`Page 7 of 14
`
`
`
`US 8,028, 106 B2
`
`1
`HARDWARE ACCELERATION OF
`COMMONALITY FACTORING WITH
`REMOVABLE MEDIA
`
`This application claims the benefit of and is a non-provi(cid:173)
`sional of both co-pending U.S. Provisional Application Ser.
`No. 60/948,394 filed on Jul. 6, 2007; and U.S. Provisional
`Application Ser. No. 60/948,387 filed on Jul. 6, 2007, which
`are hereby expressly incorporated by reference in their
`entirety for all purposes.
`This application expressly incorporates by reference U.S.
`application Ser. No. 12/167,872, filed on even date herewith,
`entitled "Commonality Factoring For Removable Media", in
`its entirety for all purposes.
`
`BACKGROUND OF THE DISCLOSURE
`
`The present invention generally relates to data storage sys(cid:173)
`tems and, but not by way oflimitation, to data storage systems
`that store information on removable media.
`Conventional backup involves of a series of full, incremen(cid:173)
`tal or differential backups that saves multiple copies of iden(cid:173)
`tical or slowly changing data. This approach to backup leads
`to a high level of data redundancy.
`For years, there has been a considerable disparity between
`the prices of tape and disk-based storage systems with tape(cid:173)
`based storage being less expensive. Therefore, conventional
`data storage solutions have been tape based storage systems
`that compress data using conventional algorithms for an aver(cid:173)
`age compression ratio of about 2:1. Advantageously, tape(cid:173)
`based storage systems use removable tape cartridges that can
`be taken to off-site location for disaster recovery. However,
`the process of recovering data in a tape based storage system
`is slow, complex and unreliable.
`Data de-duplication, kuown as commonality factoring, is a
`process of reducing storage needs by eliminating redundant
`data. Data de-duplication is a disk-based data storage system
`that greatly reduces disk space requirements. However, disk(cid:173)
`based data storage systems including de-duplication methods
`are not easily exported to removable media. In order to export
`de-duplicated data to removable media, the de-duplicated
`data has to be first reformulated to its original form and then
`be recorded on removable tape cartridges, thereby, requiring
`more storage space than the de-duplicated version.
`Data de-duplication is a resource intensive process, which
`is implemented in software as part of the commonality fac(cid:173)
`toring solutions. Due to the intensive computational process,
`top of the line multi-core/multi-processor servers are used to
`provide adequate performance to perform the de-duplication
`process. The amount of performance gained by the use of
`multi-core/multi-processor servers depends on the algo(cid:173)
`rithms used and their implementation in software. However,
`the overall cost and power consumption of these multi-core/
`multi-processor servers are high.
`
`SUMMARY
`
`In various embodiments, systems and methods for com(cid:173)
`monality factoring for storing data on removable storage
`media are described. The systems and methods allow for
`highly compressed data, e.g., data compressed using
`archiving or backup methods including de-duplication, to be
`stored in an efficient manner on portable memory devices
`such as removable storage cartridges. The methods include
`breaking data, e.g., data files for backup, into unique chunks
`and calculating identifiers, e.g., hash identifiers, based on the
`unique chunks. Redundant chunks can be identified by cal-
`
`2
`culating identifiers and comparing identifiers of other chunks
`to the identifiers of unique chunks previously calculated.
`When a redundant chunk is identified, a reference to the
`existing unique chunk is generated such that the chunk can be
`reconstituted in relation to other chunks in order to recreate
`the original data. The method further includes storing one or
`more of the unique chunks, the identifiers and/or the refer(cid:173)
`ences on the removable storage medium.
`In some aspects, hardware and/or software can be used to
`10 accelerate the commonality factoring process. The accelera(cid:173)
`tion hardware and/or software can reside in multiple devices,
`depending on the embodiment. For example, hardware and/or
`software forthe chunking and/or hashing functions can reside
`in one or more of a host computer, a removable storage
`15 device, a removable cartridge holder (e.g., a socket) and the
`removable storage cartridge.
`In one embodiment, a system for commonality factoring
`for storing data with a removable storage cartridge is dis(cid:173)
`closed. The system includes a processor, an expansion bus
`20 coupled to the processor and a socket coupled to the expan(cid:173)
`sion bus. The socket is configured to accept the removable
`storage cartridge. An expansion module is removably
`coupled to the expansion bus. The expansion module is con(cid:173)
`figured to transfer data to the removable storage cartridge.
`25 The expansion module includes a chunking module and a
`hashing module. The chunking module is configured to break
`an original data stream into a number of chunks. The hashing
`module is coupled to the chunking module in a pipeline
`fashion such that at least a portion of input to the hashing
`30 module comprises output from the chunking module. The
`hashing module is configured to determine if each chunk is
`unique, and forward chunks determined to be unique toward
`the removable storage cartridge.
`In another embodiment, a method for commonality factor-
`35 ing for storing data with a removable storage cartridge is
`disclosed. In one step, at an expansion module removably
`coupled to a host computer, an original data stream is
`received. The expansion module includes a chunking module
`and a hashing module. The hashing module and the chunking
`40 module are configured in a pipeline architecture such that at
`least a portion of input to the hashing module includes output
`from the chunking module. At the chunking module, the
`original data stream is broken into a number of chunks. The
`chunks are forwarded toward the hashing module. The hash-
`45 ing module calculates an identifier for each forwarded chunk;
`storing the identifiers; and determines, based on the identifi(cid:173)
`ers, whether each chunk is unique. At least one of the unique
`chunks and the identifier is forwarded to the removable stor(cid:173)
`age cartridge. The removable storage cartridge includes a
`50 storage drive.
`In yet another embodiment, an expansion card for com(cid:173)
`monality factoring for storing data with a removable storage
`cartridge is disclosed. The expansion card includes a chunk(cid:173)
`ing module and a hashing module. The chunking module is
`55 configured to receive an original data stream from the host
`computer and break the original data stream into a plurality of
`chunks. The expansion card is configured to be removably
`coupled to a host computer and the removable storage car(cid:173)
`tridge and store data on the removable storage cartridge. The
`60 hashing module is coupled to the chunking module in a pipe(cid:173)
`line fashion such that at least a portion of input to the hashing
`module comprises output from the chunking module. The
`hashing module is configured to: receive the plurality of
`chunks from the chunking module; calculate an identifier for
`65 each of the received chunks; determine, based on the identi(cid:173)
`fiers, if each chunk is unique; and store the unique chunks on
`the removable storage cartridge.
`
`Page 8 of 14
`
`
`
`US 8,028, 106 B2
`
`3
`Further areas of applicability of the present disclosure will
`become apparent from the detailed description provided here(cid:173)
`inafter. It should be understood that the detailed description
`and specific examples, while indicating various embodi(cid:173)
`ments, are intended for purposes of illustration only and are
`not intended to necessarily limit the scope of the disclosure.
`
`BRIEF DESCRIPTION OF THE DRAWING
`
`FIG. 1 depicts a block diagram of an embodiment of a data
`storage system.
`FIG. 2 depicts a block diagram of an embodiment of a
`system for performing commonality factoring.
`FIG. 3 depicts a block diagram of an alternative embodi(cid:173)
`ment of a system for performing commonality factoring.
`FIG. 4 depicts a block diagram of an alternative embodi(cid:173)
`ment of a system for performing commonality factoring.
`FIGS. SA, SB, and SC illustrate schematic diagrams of
`alternative embodiments of data storage systems for perform(cid:173)
`ing commonality factoring.
`FIG. 6 illustrates a flowchart of an example of a process for
`storing data on a removable data cartridge.
`In the appended figures, similar components and/or fea(cid:173)
`tures may have the same reference label. Further, various
`components of the same type may be distinguished by fol(cid:173)
`lowing the reference label by a dash and a second label that
`distinguishes among the similar components. If only the first
`reference label is used in the specification, the description is
`applicable to any one the similar components having the same
`first reference label irrespective of the second reference label.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`The ensuing description provides preferred exemplary
`embodiment(s) only, and is not intended to limit the scope,
`applicability or configuration of the disclosure. Rather, the
`ensuing description of the preferred exemplary embodiment
`(s) will provide those skilled in the art with an enabling
`description for implementing a preferred exemplary embodi(cid:173)
`ments of the disclosure. It should be understood that various 40
`changes may be made in the function and arrangement of
`elements without departing from the spirit and scope of the
`invention as set forth in the appended claims.
`This disclosure relates in general to data storage systems
`used for data backup, restore and archive applications. It
`specifically relates to a new generation of removable storage
`cartridges housing a hard disk drive (HDD) as the storage
`medium. Throughout the specification, HDD may be used to
`describe the storage medium but it is to be understood that
`flash memory or a solid state disk (SSD) drive could be used
`in the alternative.
`Embodiments of the present invention are directed to a
`system for storing more data on a single storage cartridge than
`the use of the conventional Lempel-Ziv (LZ) compression
`methods would allow. This is achieved through implementa- 55
`ti on of commonality factoring (or de-duplication). In particu(cid:173)
`lar, the system according to the present invention accelerates
`the process so that the data reduction is performed at a rate
`competitive with a Linear Tape Open (LTO) tape drive with(cid:173)
`out requiring a high end server to perform the processing.
`According to one embodiment of the present invention,
`there is provided a system for accelerating commonality fac(cid:173)
`toring for storing data with a storage cartridge. The system
`includes a chunking module for breaking an original data
`stream into chunks. In the chunking module, pipelining and 65
`table lookups are used for optimization. The system also
`includes a hashing module for determining if each chunk is
`
`4
`unique or a duplicate of any of the previously stored chunks.
`The first byte of each chunk is processed by the hashing
`module before the last byte of the chunk has been processed
`by the chunking module to achieve parallelism.
`In this embodiment, the chunking module may comprise a
`section for Rabin fingerprinting or a section for performing a
`sliding window checksum. Further, in this embodiment, the
`hashing module may comprise one or more of a section for
`Message Digest Algorithm 5 (MD5) hashing, a section for
`10 Secure HashAlgorithm-1 (SHA-1) hashing and a section for
`Secure HashAlgorithm-2 (SHA-2) hashing.
`According to another embodiment of the present invention,
`there is provided another system for accelerating commonal(cid:173)
`ity factoring for storing data with a storage cartridge. The
`15 system includes the chunking module and the hashing mod(cid:173)
`ule as above, and further includes an additional data process(cid:173)
`ing module.
`In this embodiment, the additional data processing module
`may comprise one or more of a data compression module, an
`20 encryption module and an error correction coding (ECC)
`module. Furthermore, the data compression module may
`comprise a section for performing a Lempel-Ziv Stac (LZS)
`algorithm. Additionally, the encryption module may com(cid:173)
`prise a section for performing a Triple Data Encryption Stan-
`25 dard (3DES) algorithm, an Advanced Encryption Standard-
`128 (AES-128) algorithm or an Advanced Encryption
`Standard-256 (AES-256) algorithm.
`According to yet another embodiment of the present inven(cid:173)
`tion, there is provided yet another system for accelerating
`30 commonality factoring for storing data with a storage car(cid:173)
`tridge. The system includes the chunking module, the hashing
`module and the additional data processing module as above,
`and further includes a database search module followed by
`the additional data processing module for performing a
`35 search of the chunk database based on outputs from the hash(cid:173)
`ing module and passing only the unique chunks to the addi(cid:173)
`tional processing module. The objective is to reduce band(cid:173)
`width requirements for the additional data processing
`module.
`According to yet another embodiment of the present inven-
`tion, there is provided yet another system for accelerating
`commonality factoring for storing data with a storage car(cid:173)
`tridge. The system includes the chunking module and asso(cid:173)
`ciated modules, wherein multiple data paths in parallel are
`45 utilized. The objective is to further accelerate the commonal(cid:173)
`ity factoring process.
`In this embodiment, the multiple data paths may comprise
`a single data stream split across multiple instances by trun(cid:173)
`cating the data stream at locations that do not necessarily
`50 align with chunk boundaries as calculated by the chunking
`module, wherein the size of the truncated portions of the data
`stream is either fixed or variable.
`Referring first to FIG. 1, an embodiment of a data storage
`system 100 is shown. The data storage system 100 may
`include a host computer 102 and a removable drive bay 104.
`The host computer 102 includes a processor and an expansion
`bus. The expansion bus is coupled to the processor and is
`configured to transfer data to the drive bay 104 via standard
`interface. The removable drive bay 104 may include a remov-
`60 able cartridge device 110 and a removable cartridge holder
`106. The host computer 102 may be communicatively
`coupled with removable cartridge device 110. By way of
`example, the removable cartridge device 110 interface to the
`host computer 102 may be any version of Small Computer
`System interface (SCSI), a Fiber Channel (FC) interface, an
`Ethernet interface, an Advanced Technology Attachment
`(ATA) interface, or any other type of interface that allows the
`
`Page 9 of 14
`
`
`
`US 8,028, 106 B2
`
`5
`removable cartridge device 110 to communicate with the host
`computer 1 02.The cartridge holder 106 can be a plastic
`socket and can physically mount to a circuit board of the
`removable cartridge device 110. The cartridge holder 106
`may further include an eject and lock mechanism. A remov(cid:173)
`able storage cartridge 108 provides storage capability for the
`data storage system 100, wherein the storage cartridge 108 is
`removably coupled to the removable cartridge device 110.
`The portable storage cartridge 108 is also optionally locked in
`the cartridge holder 106. In an alternative embodiment, the
`host computer 102 may be communicatively coupled with
`cartridge holder 106 through an interface cable 112.
`As will be described further bellow in various embodi(cid:173)
`ments, the commonality factoring function may be imple(cid:173)
`mented as an expansion module in one or more of the follow(cid:173)
`ing locations: 1) in the storage cartridge 108, 2) in the
`removable cartridge device 110 and outside the cartridge
`holder 106, and 3) in the host computer 102.
`As explained above, the present invention identifies dupli- 20
`cate portions in an original data stream which have previously
`been stored, so that a reference to the data portion can be
`stored in place of the duplicate portion itself. There are sev(cid:173)
`eral steps for performing this process as follows: (1) a step of
`breaking the original data stream into small chunks (data 25
`portions) which can be analyzed for redundancy; (2) a step of
`calculating an identifier for each chunk; (3) a step of deter(cid:173)
`mining, by searching a database of the identifiers, if each
`chunk is unique in that the same chunk has not been found in
`the previous chunks; and ( 4) a step of organizing the unique 30
`chunks, identifiers and associated metadata so that the origi(cid:173)
`nal data stream can be regenerated. The original data stream
`can represent any form of data such as audio, video, textual
`and can be a plurality of files or objects.
`Steps (1) and (2) of the above process are more processor 35
`intensive, and appropriate to apply hardware acceleration to.
`Also, these steps may be combined with other data modifica(cid:173)
`tion steps such as conventional data compression and encryp(cid:173)
`tion as part of the overall data storage process. All of these
`steps are considered in terms of an integration to provide the 40
`maximum system throughput.
`Rabin Fingerprinting is a method of breaking the incoming
`data stream into smaller chunks of data which can be analyzed
`for redundancy. This method has tractable statistical proper(cid:173)
`ties that simpler methods such as a rolling checksum do not 45
`exhibit, but any chunking algorithm could be used in various
`embodiments. This method has been implemented in soft(cid:173)
`ware as part of the commonality factoring solutions, which
`may be done in order to accelerate time-to-market for these
`products at the expense of cost and/or performance. Top-of- 50
`the-line multi-core/multi-processor servers are used to pro(cid:173)
`vide adequate performance of the software algorithms.
`Instead of implementing this method in software, one
`embodiment of the present invention implements this method
`in a solution that uses hardware, which provides increased 55
`performance with lower cost and lower power dissipation.
`The detail ofhardware implementation of Rabin Fingerprint(cid:173)
`ing method in a pipeline fashion is described in U.S. Provi(cid:173)
`sional Patent Application Ser. No. 60/948,394, filed on Jul. 6,
`2007. By implementing the hardware in a pipelined fashion, 60
`high throughputs can be obtained at reasonable clock rates
`with minimal logic.
`Rabin fingerprinting is fundamentally an operation on
`polynomials a single-bit at a time in a data stream. Because
`most systems work well with data aligned to 8-bit byte bound- 65
`aries, the result of the polynomial operations is only relevant
`for every eighth bit. Since the intermediate calculations are
`
`6
`not considered, we can optimize the calculations by directly
`calculating the next fingerprint value 8-bits at a time.
`Rabin fingerprints are calculated on a sliding window of
`data, e.g., 48 bytes in a buffer array. For each calculation, the
`oldest byte in the array is replaced with the newest byte. The
`first pipeline stage replaces the oldest byte with the newest
`byte and performs a lookup based on the oldest byte which
`provides a value that can be used to remove the oldest byte's
`effect from the fingerprint. The next pipeline stage uses the
`10 input to remove the oldest data from the fingerprint and then
`combines the fingerprint with the new data using another
`table lookup to generate the new fingerprint. The final pipe(cid:173)
`line stage determines whether a portion of the new fingerprint
`matches a predetermined check value used for determining
`15 chunk boundaries and verifies that the chunk size fits within a
`minimum/maximum range.
`The output of the chunking step using either Rabin finger(cid:173)
`printing or simpler methods such as a sliding window check(cid:173)
`sum is a sequence of data called a chunk which can be ana(cid:173)
`lyzed to determine if it has previously been stored by the
`storage system. One way to efficiently determine whether the
`chunk has been previously stored is to compute a one way
`function on the data called a hash which allows determination
`to be made with very high statistical likelihood of whether the
`data is a duplicate of any of the previously stored data. Many
`hash algorithms are available for this purpose such as MD5,
`SHA-1 and the SHA-2 family. The goal is to select an algo(cid:173)
`rithm which has a statistically small enough chance of colli(cid:173)
`sions that it can be assumed that it will not produce false
`matches. The hash algorithm is resistant to intentional or
`malicious attempts to cause collisions. The hash algorithm
`should be secure; MD5 is not truly considered secure and
`SHA-1 has some potential vulnerabilities, but these vulner(cid:173)
`abilities may not apply to some applications. The type of hash
`algorithm may be chosen depending on the application. Fur(cid:173)
`ther, the use of multiple hash algorithms is possible in some
`embodiments.
`Referring next to FIG. 2, a block diagram of an embodi(cid:173)
`ment of a system 200 for performing commonality factoring
`is shown. The system 200 includes chunking module 202
`coupled directly to a hashing module 204. In this embodi(cid:173)
`ment, the chunking module 202 performs the step of breaking
`the original data stream 206 into small chunks using Rabin
`fingerprinting algorithm on a sliding window of data stream
`206. Other embodiments may use different methods and algo(cid:173)
`rithms such as sliding window checksum. Referring back to
`the FIG. 1, the original data stream 206 can be provided from
`different sources depending on the location of the expansion
`module in various embodiments as discussed below. For
`example, the host computer 102, the removable cartridge
`device 110, or the storage cartridge 108 can all forward the
`original data stream 206 to the chunking module 202 in vari(cid:173)
`ous embodiments.
`The chunking module 202 outputs a sequence of data bytes
`called chunks 206-1 along with an indication 208 whether a
`chunk boundary has been reached for each sequence of data
`bytes. The end of each sequence indication 208 is also
`referred to as an end-of-record or EOR. This allows the EOR
`208 and the data chunks 206-1 to be synchronized as they pass
`on to the hashing module 204. In this embodiment, the chunk(cid:173)
`ing module 202 is coupled to the hashing module 204 in a
`pipeline fashion such that at least a portion of the input to the
`hashing module 204 comprises output from the chunking
`module 202. In one embodiment, the hashing module 204
`processes a first byte of each chunk from the sequence of data
`bytes 206-1 before a last byte of the same chunk is processed
`by the chunking module 202. Other embodiments may obtain
`
`Page 10 of 14
`
`
`
`US 8,028, 106 B2
`
`7
`the complete chunk from the chunking module 202 and then
`run the chunk through the hashing module 204.
`The hashing module 204 performs steps of calculating an
`identifier for each chunk from the sequence of data bytes
`206-1, and then determining the uniqueness of the chunk. The
`determination step can be performed by storing the identifiers
`into a database and searching the database of identifiers to
`determine whether each chunk is unique. Where the chunk is
`found to be unique, the unique chunk and its identifier are
`stored in a chunk/ID database 220 on the removable storage
`cartridge 108. Table I shows an example of chunk/ID data(cid:173)
`base 220 where streams of unique chunks and their identifiers
`are stored on the removable storage cartridge 108.
`
`8
`ules, e.g., the length of unique chunks 210, and hash values
`212 can optionally be passed to the compression module 302
`and the encryption module 304. This will provide a synchro(cid:173)
`nized output for each module.
`A system like the one shown in FIG. 3 has the benefit of
`reducing bus traffic between the expansion module and the
`main system memory where the data is probably stored while
`not being processed. Table III indicates an example of the
`savings in the bus bandwidth whe