`UNIFIED PATENTS
`
`EXHIBIT 1003
`
`Unfied Patents Exhibit 1003
`Pg. 1
`
`
`
`(12) United States Patent
`(10) Patent N0.:
`US 6,233,589 B1
`
`Balcha et al.
`(45) Date of Patent:
`May 15, 2001
`
`USOO6233589B1
`
`(54) METHOD AND SYSTEM FOR REFLECTING
`7
`DIFFERENCES BETWEEN T“ O FILES
`Inventors: Muralidhar Balcha, Shrewsbury, MA
`(US); Chandan Kudige, Karnalaka
`(IN)
`
`(75)
`
`(73) Assignee: Novell, Inc., Provo, UT (us)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U’S‘C’ 154(1)) by 0 days.
`
`(21) Appl. No.: 09/127,523
`.
`Jul. 31’ 1998
`Flled3
`(22)
`
`Int. Cl? ...................................................... G06F 17/30
`(1)
`(52) U.S.C1.
`..................
`707/203
`
`(
`) Field of Search
`707/203—10, 201,
`707/204, 217, 2, 6, 8, 101, 202, 205
`
`(56)
`‘
`’
`
`References Cited
`U.S. PATENT DOCUMENTS
`12/1995 Squibb ................................ 707/201
`11/1996 Morris
`
`. 707/1
`5/1997 Morris ..............................
`707/1
`
`707/204
`7/1998 Whiting etal.
`
`709/217
`.
`7/1999 Van Hoff et a1.
`4/2000 Waldin, Jr. et a1.
`...... 707/10
`
`55479554
`5,574.906
`5,634,052
`5,778,395 *
`5,919,247 *
`6,052,531
`
`OTHER PUBLICATIONS
`ICS 161: Design and Analysis of Algorithms Lecture Notes
`for Feb. 19, 1996, “Longest Common Sequences”, World
`Wide Web PUthfl‘im
`.,
`Longest Common Subsequence Problem , Pruhs, World
`Wide Web publication, Oct. 6, 1997,
`“Longest Common Subsequence”, World Wide Web publi—
`cation, 2 pp., author and publlcatlon date unknown,
`“EXCCUHVC Summary”, WOFId Wide WGb publication, 5 1113-,
`author and publication date UnkIIOWH.
`* Cited bV examiner
`
`Primary Examiner—Hosain T. Alam
`Assistant ExamineriSanjiv Shah
`(74) Attorney, Agent, or Firm—Dinsmore & Shohl LLP
`(57)
`ABSTRACT
`
`Amethod and system for reflecting differences between two
`files. The method includes generating a base signature file
`having a plurality of base bit patterns, each base bit pattern
`
`being generated as a function of a portion of data in a first
`file. A second file contaimng a plurallty of reV1sed bit
`patterns is generated from a second file. Each revised bit
`pattern IS compared to and matches at least one of the base
`bit patterns. A delta file reflecting the differences between a
`first file and the second file based on the base signature file,
`the delta signature file, and the second file is created.
`
`20 Claims, 4 Drawing Sheets
`
`72 ’
`
`94
`ADD CRC T0 REVISED
`
`STGNATJQE FTLE
`
`
`OPEN REVISED TILE
`{{T THE BLOCK SZE FROM THE EASE
`
`_
`SWTURE FIL; iNiTIALIZE REVTSED Sim FILE
`
`
`READ PEXT BLOCK
`
`
`
`
`
`CAI (llLATE CRC 0E TtE BLOCK
`
`
`YES
`
`ANV CFC TN BAE
`STCMTURE FTLE?
`
`
`
`
`
` DOES (RC MATCH
`
`
`
`
`92
`
`_
`56 J
`
`RED/[M v53 mom THE are
`COMPUTATION
`
`1
`ADD NEX' BYTE T0 at 010
`commnow
`mspncmrm H
`
`r55
`
`
`
`
`
`
`ADD DTSPLACFMENT OF THE BLOCK
`TO REVISED STGNATLRE FTLE
`
`
`
`DOES CRO MATCH
`
`
`13M CRC IN BASE
`SIGNAllRE FILE?
`
`
`
`Unfied Patents Exhibit 1003
`
`Pg. 2
`
`Unfied Patents Exhibit 1003
`Pg. 2
`
`
`
`US. Patent
`
`May 15, 2001
`
`Sheet 1 0f 4
`
`US 6,233,589 B1
`
`22
`
`24-
`
`LAN
`
`SERVER
`
`LAN
`
`SERVER
`
`FIG.1
`
`
`
`FIG. 2
`
`Unfied Patents Exhibit 1003
`
`Pg. 3
`
`Unfied Patents Exhibit 1003
`Pg. 3
`
`
`
`US. Patent
`
`May 15, 2001
`
`Sheet 2 0f 4
`
`US 6,233,589 B1
`
`58
`
`42
`
`FILE
`
`
`BASE
` SIGNATURE
`GENERATOR
`
`
` REVISED
`SIGNATLRE
`FILE
`
`
`
`OPEN BASE FILE.
`
`
`
`INITIALIZE BASE
`SIGNATLRE FILE.
`
`56
`
`
`
`
`
`
`
`60
`
`FIG. 4
`
`ADD CRC TO THE
`BASE SIGNATURE
`
`62
`
`
`
`CALCULATE CRC
`OF THE BLOCK
`
`Unfied Patents Exhibit 1003
`
`Pg. 4
`
`Unfied Patents Exhibit 1003
`Pg. 4
`
`
`
`US. Patent
`
`May 15, 2001
`
`Sheet 3 0f 4
`
`US 6,233,589 B1
`
`72
`
`OPEN REVISED FILE
`GET THE BLOCK SIZE FROM THE BASE
`
`SIGNATURE FILE. INITIALIZE REVISED SIGN FILE
`
`READ NEXT BLOCK
`
`74
`
`76
`
`® 61D 78
`
`N0
`
`CALCULATE CRC OF THE BLOCK
`
`50
`
`_
`
`YES
`
`82
`DOES CRC MATCH
`
`ANY CRC |N BASE
`SIGNATURE FILE?
`
`NO
`
`DISPLACEMENT = 0
`
`54
`
`94
`
`ADD CRC To REVISED
`
`SIGNATURE FILE
`
`
`
`56
`
`92
`
`REMOVE MSB FROM THE CRC
`COMPUTATION
`
`
`
`ADD NEXT BYTE TO THE CRC
`COMPUTATION
`DISPLACEMENT ++
`
`<93
`
`ADD DISPLACEMENT OF THE BLOCK
`TO REVISED SIGNATLRE FILE
`
`YES
`
`90
`
`NO
`
`SIGNATURE FILE?
`
`
` ANY CRC IN BASE
`DOES CRC MATCH
`
`
`FIG. 5
`
`Unfied Patents Exhibit 1003
`
`Pg. 5
`
`Unfied Patents Exhibit 1003
`Pg. 5
`
`
`
`US. Patent
`
`May 15, 2001
`
`Sheet 4 0f 4
`
`US 6,233,589 B1
`
`
`
`
`
`L1 : LIST BASE SIGNATURE CRCS
`L2: LIST OF REVISED SIGNATURE
`CRCS
`E : FIRST ELEMENT IN L1,
`E2; FIRST ELEMENT IN L2,
`|=J=0;E =E
`=NJLL
`l]
`21
`
`
`102
`
`
`
`
`
`100
`
`F l
`
`6' 6
`
`@ I
`
`
`
`
`
`
`
`
`
`
`E1=NEXT(E1)
`
`E2= NEXT(E2)
`
`l++, J++; E 11: E 11E21=E 2
`
`
`
`
`
`
`
`
`
`104
`YES
`
`
`DOES ElAND E2MACTH?
`
`106
`YES
`
`
`
`
`110
`
`
`PREVIE I=E
`YES
`11
`838‘
`INSIS, BASE(E1), SIBASEINEXTIE 21)),
`
`
`PREWE2“: E 21
`DISPLACEMENT(E2)]) TO DELTA FILE
`
`
`
`N0
`
`
`112
`
`
`
`PREV1E2) = E21
`
`&&
`
`
`
`
`
`
`
`YES
`
`DELIS, BASE(NEXT(E11]). BASE (E 1)
`-BASE(NEXT(E11)]) TO DELTA FILE
`
`116
`
`114
`
`&&
`
`PREV(E2) I = E 21
`
`YES
`
`
`
`DELIS, BASE(NEXT(E11]), BASE(E1)
`
`
`
`-BASE(NEXT(E11))) TO DELTA TILE.
`
`
`INSIS, BASE(NEXT(E 11)).
`
`SIBASEINEXTIE 21),
`BASE(E2 I-BASEINEXTIE21m TO DELTA FILE
`
`
`
`
`115
`
`N0
`
`NO
`
`YES
`
`NO
`
`122
`
`LCSLEN[I+1, J]<
`LCSLENII, J+1]
`
`l++;
`E1 =NEXT(E1)
`
`120
`
`1 24
`
`J++;
`E2=NEXT(E2)
`
`Unfied Patents Exhibit 1003
`
`Pg. 6
`
`Unfied Patents Exhibit 1003
`Pg. 6
`
`
`
`US 6,233,589 B1
`
`1
`METHOD AND SYSTEM FOR REFLECTING
`DIFFERENCES BETWEEN TWO FILES
`
`FIELD OF THE INVENTION
`
`The present invention relates generally to backup and
`synchronization of files, and in particular relates to a method
`and system for reflecting differences between two files.
`BACKGROUND OF THE INVENTION
`
`Copies of files are frequently transmitted over a network
`from one computer to another computer. One reason to copy
`a file is for backup purposes. If a file created on one
`computer has been backed up on another computer, it can be
`easily recovered in the event the hard drive of the first
`computer fails. Because the loss of a file could mean the loss
`of extremely important data, and/or result
`in significant
`effort to recreate the file, file backup processes are very
`common. However, file backup has at least two problems
`associated with it: first, it can require significant network
`bandwidth to transfer file data from the first computer to the
`backup computer, and second,
`it can require significant
`storage space to maintain copies of files. Both of these
`problems can be alleviated to some extent through the use of
`an incremental backup. An incremental backup copies only
`those files that have been changed since the previous ’
`backup. Incremental backups can significantly reduce the
`number of files that are backed up on a periodic basis.
`Typically, when a file is modified, only a small portion of
`the file is actually changed from the previous version. While
`an incremental backup can reduce network bandwidth and
`save storage space compared to a complete backup, it is still
`inefficient in that a complete file is backed 11p even though
`it is possible that only a small portion of the file was actually
`modified.
`In an attempt
`to improve upon incremental
`backups, backup processes exist that identify the dillerences
`between two versions of a file, and attempt to backup only
`those dilIerences. This is referred to as a dillerencing pro—
`cess. Differencing processes can reduce network bandwidth
`and storage requirements because only portions of the file
`are backed up.
`Copies of files are also frequently made for purposes of
`synchronization or replication. A synchronized file exists in
`two dilferent locations, such as on two dilferent servers, and
`changes made to one file must be reflected in the other file.
`Synchronization usually occurs by periodically copying the
`file from one location to the other location.
`
`10
`
`o;O
`
`35
`
`40
`
`45
`
`U.S. Pat. No. 5,634,052 discloses a system for reducing
`
`storage requirements in a backup subsystem. The system
`‘erences _
`includes creating a delta file reflecting the di
`between a base file and a modified version of the base file,
`and transmitting the delta file to a server [or backup pur—
`poses. One problem associated with this system is that the
`base file is necessary to create the delta file that re ects the
`differences between the base file and the revised file. Thus,
`if the delta file is to be created on another computer, such as
`the server, the base file must first be transmitted to tie server
`where the differencing operation is carried out. Moreover,
`the ’052 patent does not disc ose optimal mechanisms for
`creating the delta file.
`In a differencing backup system, the differencing mecha—
`nism used to create the delta ile can be quite important. It
`is not uncommon for files to be many megabytes in size. A
`differencing mechanism that processes a file multiple times,
`
`
`or processes a file in an ine icient manner can result in
`excessive backup times. Moreover, an inellicient dillerenc-
`ing mechanism can result in more data being backed up than
`
`
`
`
`
`60
`
`
`
`2
`necessary. In other words, two dilferencing mechanisms can
`vary in their ability to efficiently recognize and reflect
`dillerences between two files. Also, it would be preferable
`for a differencing mechanism to be able to determine dif-
`ferences between a base file and a modified version of the
`base file without actually having to repeatedly process the
`base file, so that the differencing operation can be performed
`on a remote computer, without the need to process the entire
`base file.
`
`US, Pat. No. 5,574,906 discloses a system and method
`for reducing storage requirements in backup subsystems.
`The ’906 patent discloses a system similar to that disclosed
`in the ‘052 patent, with the enhancement that the base file
`from which the differencing operation is derived can be
`compressed. In certain files, a compressed base file will
`utilize less bandwidth and less storage space on a computer
`than would an uncompressed based file. One problem with
`this approach is that
`the compressibility of files differs
`greatly. While compression can significantly reduce the size
`of some files, compression algorithms do not obtain signifi-
`
`cant reduction with other type of files. Additionally,
`the
`
`
`
`di ‘erencing mechanism of the ’906 patent works by first
`compressing the revised version of the file, and upon deter-
`mining that compressed portions of the base file and the
`revised file differ, both the base file and the revised file are
`uncompressed at
`those locations so that
`the differences
`between the two files can be determined. The overhead
`involved in such compression/decompression algorithms
`can be significant.
`US. Pat. No. 5,479,654 discloses an apparatus and
`method for generating a set of representations of the changes
`made in a computer file during a period of time. The process
`disclosed in the ’654 patent makes multiple passes through
`portions of the most recent version of the file to determine
`the differences between it and the previous version of the
`file.
`Thus, it is apparent that a differencing system that reduces
`network trallic, elliciently determines and rellects diller-
`ences between two files quickly, and reduces storage
`requirements would be highly desirable.
`SUMMARY OF TIIE INVENTION
`
`It is one object of the present invention to provide a
`method and system for determining the differences between
`a base file and a modified file without the need for a copy of
`the base file.
`
`It is another object of the present invention to provide a
`method and system for reflecting the differences between
`two files that is highly efficient and reduces network traffic.
`It is yet another object of the present invention to provide
`a method and system that can determine the differences
`between a base file and a modified file either on a client
`computer or a server computer without a need for a copy of
`the base file.
`It is still another object of the present invention to provide
`a method and system for creating a signature of a base file
`that can be used to determine the differences between a base
`file and a modified file.
`Additional objects, advantages and other novel features of
`the invention will be set forth in part in the description that
`follows and, in part, will become apparent to those skilled in
`the art upon examination of the invention. To achieve the
`foregoing and other objects, and in accordance with the
`purposes of the present invention as described above, a
`method and system for reflecting dilIerences between two
`versions of a file is provided. The method includes
`
`Unfied Patents Exhibit 1003
`
`Pg. 7
`
`Unfied Patents Exhibit 1003
`Pg. 7
`
`
`
`3
`from a base file, a base signature file that
`generating,
`includes a plurality of base bit patterns. Each bit pattern is
`generated as a function of a portion of data in the base file.
`A revised version of the base file is created. A revised
`signature file, including a plurality of revised bit patterns, is
`generated from the revised file. Each revised bit pattern
`matches at least one of the base bit patterns. Based on
`differences between the base signature file and the revised
`signature file, the revised file is accessed and a delta file
`reflecting the differences between the base file and the
`revised file is generated.
`Once the base signature file is generated from the base
`file, the base file need not be accessed to generate the delta
`file reflecting the differences between the base file and the
`revised file. Thus, according to the present invention, the
`base signature file, which is a small fraction in size of the
`original base file, can be transmitted over a network to
`another computer, such as a server, and the differencing
`operation to generate the delta file can be carried out on the
`server, without the need to transmit the original base file to
`the server.
`According to one embodiment of the invention, the base
`signature file is generated by reading fame size blocks of
`data from the base file, and for each block of data, calcu—
`lating and generating a bit pattern, such as a cyclic redun-
`dancy check (CRC) value that is stored in the base signature .
`file. After the base file is processed, a base signature file
`exists containing a plurality of CRC values derived from the
`data contained in the base file. Amodified, or revised version
`of the base file is made, and a revised signature file is created
`by reading frame size blocks of data from the revised file and
`generating a revised CRC value for each data block. The
`revised CRC value is compared to the CRC values in the
`base file. If the revised CRC value matches a CRC value in
`the base signature file, the revised CRC value is stored in the
`revised signature file. Amatch is an indication that the block
`of data matches a block of data in the base file. If no CRC
`values in the base signature file match the revised CRC bit
`pattern, a new block of data olIset one byte from the
`previous block of data is read. This process is repeated until
`a revised CRC bit pattern is generated that matches a bit
`pattern from the base signature file. The revised bit pattern
`is then saved to the revised signature file, along with the
`offset of the block of data from the beginning of the revised
`file, and the next block of data is read from the revised file,
`this block of data beginning at the end of the previous block
`of data. At the end of the process, a revised signature file
`exists with CRC values that match CRC values in the base
`signature file, and offsets indicating the location in the
`revised file of the matching bloc (s of data.
`
`The base signature file and revised signature file are then _
`
`processed, and based on the di erences between the two
`files, and the olIsets in the revised signature file, the revised
`file is accessed and a delta file is created reflecting the
`differences between the base file and the revised file. The
`delta file preferably contains primitives, such as insert,
`modify, and delete primitives, which are commands that can
`be applied to a previous version of the file to generate the
`revised file.
`According to one embodiment of the present invention,
`the revised signature file and the base signature file are
`processed in the order of the longest common subsequence
`of revised bit patterns with respect to the base bit patterns in
`the base signature file. Determining differences between the
`base file and the revised file based on the longest common
`subsequence of bit patterns ensures a minimal number of
`primitives are used in the delta file to reflect the dilIerences
`between the base file and the revised file.
`
`
`
`40
`
`45
`
`60
`
`US 6,233,589 B1
`
`10
`
`(,4O
`
`35
`
`
`
`
`4
`The present invention reduces the amount of data neces—
`sary to reflect differences between two files, reducing both
`network utilization and storage requirements. The present
`invention is useful not only in backup systems, but also in
`re alication and synchronization systems. Moreover, once a
`base signature file has been generated, a delta file reflecting
`di erences between the base file and a subsequent version of
`the base file can be generated without the base file. Still other
`ob'ects of the present invention will become apparent to
`those skilled in this art from the following description
`wherein there is saown and described preferred embodi-
`ments of this invention. As will be realized, the invention is
`
`
`capable of other di erent obvious aspects all without depart-
`ing from the invention. Accordingly,
`the drawings and
`description will be regarded as illustrative in nature and not
`as restrictive.
`
`
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The accompanying drawings incorporated in and forming
`a part of the specification, illustrate several aspects of the
`present invention, and together with the description serve to
`explain the principals of the invention. In the drawings:
`FIG. 1 is a block diagram illustrating an environment in
`which the present invention has application;
`FIG. 2 is a block diagram illustrating another environment
`in which the present invention has application;
`FIG. 3 is a schematic diagram illustrating the relationship
`between various files and a delta file according to one
`embodiment of the present invention;
`FIG. 4 is a flow diagram illustrating a process for gener—
`ating a base signature file according to one embodiment of
`the present invention;
`FIG. 5 is a flow diagram illustrating a process for gener-
`ating a delta signature file according to one embodiment of
`the present invention; and
`FIG. 6 is a flow diagram illustrating a process for gener—
`ating a delta file according to one embodiment of the present
`invention.
`
`Reference will now be made in detail to present preferred
`embodiments of the invention, examples of which are illus—
`trated in the accompanying drawings, wherein like numerals
`indicate the same elements throughout the views.
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`
`The present invention is directed to a differencing mecha-
`
`
`nism that quickly and e ieiently determines the differences
`
`between two files, and generates a delta file reflecting those
`
`
`
`di erences. Referring to FIG. 1, an environment using file
`synchronization or file replication is shown. Each of servers
`22 and 24 maintain a copy of a base file 21, 27, respectively,
`and are interconnected via a network 26. Base files 21, 27
`should be identical to one another and thus changes made to
`one copy should eventually be reflected in the other copy. A
`base signature file, as described in more detail herein, is
`generated on one of the servers 22, 24 and copied to the
`other server. The base signature file is maintained on server
`22 as file 20, and maintained on server 24 as file 28. Either
`base file 21 or 27 can be modified at either server. Upon
`detection of a modification to the file, the detecting server,
`for example server 22, uses the respective base signature
`file, for example, base signature file 20, to generate a new
`delta file, and communicates the delta file over network 26
`to server 24. Server 24 then uses the delta file to update the
`base file 27, and recalculates the base signature file 28. In
`
`Unfied Patents Exhibit 1003
`
`Pg. 8
`
`Unfied Patents Exhibit 1003
`Pg. 8
`
`
`
`5
`this manner, base files 21, 27 on servers 22 and 24 can stay
`in synchronization with minimal transfer of data over net-
`work 26.
`
`6
`Any version of the file can be derived from the base
`version of the file F and the given deltas between the base
`version and the desired version.
`
`US 6,233,589 B1
`
`5
`
`10
`
`FIG. 2 is a block diagram illustrating another environment
`in which the present invention has application. Aplurality of
`client computers 32 are connected over a network 26 to a
`backup server 30. Backup server 30 is in communication
`with a storage medium 34 for storage of files backed up from
`computers 32. Each new file created on computers 32 is
`preferably communicated to backup server 30 and backed up
`on storage medium 34. For each new file generated, a base
`signature file is created. The base signature file can be
`generated either on computer 32 or backup server 30. Auser
`of computer 32 can modify the base file and generate a
`revised file The differencing mechanism of the current
`invention then processes the base signature file and the
`revised file to generate a revised signature file. The revised
`signature lile and base signature file in conjunction with the
`revised file are used to create a delta file, which can then be
`communicated over network 26 to backup server 30, and
`stored on storage medium 34.
`Multiple versions of delta files can be maintained so that
`any particular version of a file can be restored. To accom-
`plish this, a new base signature file is generated from the
`revised file. In essence, the revised file becomes the base file ’
`for the next version of the revised file. If the base signature
`files are being maintained on backup server 30, and only the
`delta file is communicated from computers 32 to backup
`server 30, a new base signature file can be created on backup
`server 30 by first applying the delta file received from
`computer 32 to the previous base file to create the revised
`file. (This revised file becomes the base file for use with the
`next received delta file.) The base signature file can then be
`generated from the revised file. If the base signature files are
`being maintained on computer 32, the new base signature
`file can be generated from the actual revised file that was
`used in creating the delta file. In this manner, multiple delta
`files can be maintained to allow the recreation of any version
`of the base file.
`
`o;O
`
`35
`
`40
`
`Rf generates the required version of the file from a given
`base file and a forward delta.
`Rb generates the required version of the file from a given
`base file and a backward delta.
`If F,- represents the im version of file F and dz. represents the
`forward delta then
`
`Fi:Rf(E di)
`
`If F} represents the j”‘ version of the file F and di represents
`the backward delta then
`
`F,=R..(F, 41-)
`Streams
`Assume S is a stream of bytes. If the function len(S)
`returns the length of stream S, then the following conven—
`tions hold true:
`
`i. S[i] Where léiéleii(S) represents a single byte at
`position i within the stream S.
`ii. S[a, b] where léa, bélen(S) represents a substream of
`S starting at position a and the next b—l bytes
`iii. S[1, len(S)]=S
`iv. S[n, #]=S[n, len(S)]
`v. if Sl[i]=S;[i] Vi and len(Sl)=len(Sz) then S1 and S2 are
`same.
`
`vi. S1+S2 concatenates S2 to S1.
`Blocks
`Each stream is divided into blocks of fixed length such
`that following conventions hold true:
`i. BS is a fixed integer, used to represent the block size in
`the stream S
`
`ii. Given a stream S, N (S) represents the number of blocks
`in the stream S such that
`
`N(S)=[Ien(S:)/BS]
`
`If network bandwidth consumption is not a problem, and
`client processing power is limited, the revised file can be
`communicated to backup server 30, and the delta file can be
`created on backup server 30. Alternatively, the base signa—
`ture file can be stored by backup server 30 on storage
`medium 34, and then communicated to computer 32 upon its
`request. Computer 32 can then use the base signature file to
`create the delta file, and transmit the delta file to backup
`server 30.
`
`45
`
`Certain terminology will be used to describe the invention
`herein. The term ‘stream’ will be used to refer to contiguous
`bytes of data stored in a file. The phrase ‘delta‘ will be used
`to refer to differences between two files.
`Deltas
`A versioned sequence of a file can be represented as
`follows:
`
`iii. B"(S) represents n”1 block in the stream S such that,
`Enl:.5')=3[BS*n+1, BS"(11+1)], 0§11§N(S)—1
`
`Signature of Stream
`A signature is calculated with each stream of data. The
`signature of the stream enables the identification of changed
`blocks in the stream. The signature of the stream is described
`as follows:
`
`i. CRC(B,L(S)) represents the CRC of the block B” of the
`stream S.
`
`ii. CRC(A)=CRC(R) iff A=B, where A and B are two
`blocks in the stream S (CRC(A)=CRC of block A).
`iii. Sig(S) represents the signature of the stream S such
`that
`
`{‘11) d2; d3; -
`
`-
`
`-
`
`, dm: Edm+1> dm+2.'
`
`-
`
`' dm+n}>
`
`Sig(5)={CRC(3i(S>>: CRCfB2(SD, -
`
`-
`
`-
`
`s CRC(BN(5)(S>>}
`
`, dmm
`.
`.
`where F is the base version of the file and d1, ell, .
`are deltas of file F. Deltas d1, d2, .
`.
`.
`, d,” represent reverse
`deltas and deltas dmfl, dmu, .
`.
`.
`, dmm
`represent forward
`deltas with respect to file F.
`Given two versions of file F, Fl. and Fr, a diff function is
`defined such that
`
`d=dUflFp m
`
`where d is the delta changes between 17,- and Ff.
`
`60
`
`If C,- is the CRC of a block i in a stream S then
`i. Base(C,7) is the starting position of the block within the
`stream S
`
`ii. Next(Ci) returns the CRC of the next block in the
`signature Sig(S)
`
`The present invention generates a delta file reflecting the
`
`
`
`di erences between a first and second file. The delta file
`preferably comprises primitives, such as insert, modify and
`delete commands, which reflect the operations to apply to
`
`Unfied Patents Exhibit 1003
`
`Pg. 9
`
`Unfied Patents Exhibit 1003
`Pg. 9
`
`
`
`US 6,233,589 B1
`
`7
`the llrst file to derive the second file. According to one
`embodiment of this invention,
`three primitives are used.
`Each primitive is dellned as a function that returns the
`resultant strem(S*) after applying the operation to the base
`stream S.
`Insertion primitive:
`S*=ins(S, i, S')=S[1, i]+S'+S[i, #] where 5' represents the
`stream that was inserted within the stream S.
`Deletion primitive:
`S*=del(S, i, len)=S[l, i]+S[i+len, #]
`Modification primitive:
`S*=m0d(S, i, S‘)=S[l, i]+S'+S[i+len(S'), #] where 8' rep-
`resents modification to the substream S[i, len(S‘)]
`A sequence of primitive operations on a stream can be
`represented as 1312) _
`_
`_ ”(S) such that
`W _
`_
`_ 1,,(5‘)=P,,(P,,,1( .
`.
`. (P1(S)))) where PiE[ins, del, mod]
`
`Inverse Primitive:
`Give S, S* , P such that:
`
`S*=P(S)
`
`we define inverse primitive P— such that
`
`S=P7(S*)
`
`If P=ins(S,i,S') then P—=del(S*, i, 1511(9))
`
`If P=del(S,i,len) then P7=ins(Sx, i, S[i+l, len])
`
`If P=mod(S, i,S') then P—=mod(S*, i, S')
`
`If P=P,,(P,,,1( .
`
`.
`
`. (P1(S)))j then
`
`Pmpjmzp .
`
`. (133(5))
`
`10
`
`o;O
`
`35
`
`8
`and revised signature file 48, preferably in the longest
`common subsequence order of revised bit patterns. Based on
`comparisons between base signature file 42 and revised
`signature file 48, and using the offsets contained in revised
`signature file 48 and the data in revised file 44, the delta file
`52 is generated. Revised file 44 can then be treated as a new
`base file, and a new base signature file can be generated from
`revised file 44. Alternately, the next revision of revised file
`44 can be processed against the original base signature file
`
`42, and a delta file can be generated that reflects the
`
`
`
`di erences between original base file 38 and the newest
`version of the revised file. The method and system of the
`present invention eliminate the need to retain the base file for
`purposes of generating a delta file after generation of the
`base signature file. Because the base signature file comprises
`bit patterns which are significantly smaller in size than the
`blocks of data which they represent, the base signature file
`42 is a small fraction of the size of the original base file 38.
`Preferably, the bit patterns used in the present invention
`are cyclic redundancy code (“CRC”) patterns. A CRC is
`derived from a block of data and the calculation and use of
`CRCs (for the purpose of detecting errors in data
`transmission) is well known to those skilled in the art, and
`will not be discussed herein in detail. The block size chosen
`earl be arbitrary, but must be the same for the base file and
`for subsequent revisions of the base file. The block size
`should not be so small that the process loses elliciency, and
`not so large that large blocks of data are backed up from
`minor modifications. For example, for a large file on the
`order of about one gigabyte, a suitable block size would be
`about 64 Kb. Assuming that the CRC polynomial is 32 bytes
`long, the size of the signature file would be approximately
`128 Kb. Thus, the signature file is significantly smaller than
`the data file on which it is based. Thus, a trade off exists
`between the size of the signature file and the block size so
`that effective performance is achieved.
`A CRC algorithm treats a message, or block of data, as a
`single binary number and divides it by another fixed binary
`number. The remainder of this division is known as the
`CRC. The fixed binary number is normally called the
`polynomial of the CRC. While a normal CRC algorithm
`could be applied to the present invention, performance may
`not be satisfactory. Preferably, a table-driven CRC algorithm
`is used so that a mere shift, OR, XOR and a table lookup per
`byte can be used to derive the CRC.
`The use of a CRC has several advantages. First, a CRC
`can serve as a unique identity that represents the contents of
`a logical block of data. When the contents are changed, the
`CRC value should change. Ideally, a CRC algorithm has
`polynomial time complexity and very little space complex-
`ity. In other words, if the size of the file is N, the time
`required by the CRC algorithm to perform the calculation is
`K*N (where K is a constant), and has relatively small
`memory requirements. The CRC value itself should con-
`sume very little space relative to the amount of data that it
`represents. Given a stream of data where the logical blocks
`of data represented by a set of function outputs is embedded,
`it should be possible to determine the boundaries of the
`logical blocks represented by these outputs in polynomial
`time in a single pass of the file. A table-driven CRC is
`extremely fast and does not depend on the size of the
`polynomial. The purpose of one aspect of the present
`invention is to back up portions of files that have changed
`with respect to a previous version of the file. When data is
`inserted or deleted in a file, the position of blocks of data in
`the revised file dilfers from the same blocks of data in the
`previous version of the file. The differencing mechanism of
`
`Unfied Patents Exhibit 1003
`
`Pg. 10
`
`If P represents the forward delta for the base stream to
`generate the current version of the stream, then P— repre-
`sents the backward delta for the current version of the stream
`to generate the base version of the stream.
`In general, one aspect of the present invention relates to
`solving the following problem: given two streams S1 and $2
`llnd P1,2 .
`_
`_ ’ n such that Sz=P1,2 _
`_
`. .n (Si) using Sig(Sl) and
`Sm
`With this notational background, a high—level schematic
`of the present invention can be described with reference to
`FIG. 3. A generator 40 processes a base file 38 and generates
`a base signature file 42. Base signature file 42 comprises a
`plurality of bit patterns, each of which is derived as a
`function of a block of data from base file 38. Base file 44 is
`then modified, or revised, by a riser and a revised file 44 is .
`created. Generator 46 processes the revised file 44, gener-
`ating bit patterns as a function of blocks of data from revised
`file 44. To distinguish these bit patterns from those contained
`in base signature file 42, the bit patterns generated as a
`function of data from revised file 44 will be referred to as
`revised bit patterns, and those contained in base signature
`file 42 will be referred to as base bit patterns. Each revised
`bit pattern is compared to the base bit patterns in base
`signature file 42. For each revised bit pattern that matches a
`base bit pattern in base signature file 42,
`it is stored in
`revised signature file 48, along with an offset indicating the
`location in revised file 44 of the beginning of the block of
`data represented by the revised bit pattern. In this manner,
`generator 46 creates a revised signature file 48 containing a
`plurality of revised bit patterns. To generate a delta file 52
`rellecting the differences between base file 38 and revised
`file 44, a generator 50 processes the base signature f