`Balcha et al.
`
`111111111111111111111111111111111111111111111111111111111111111111111111111
`US006233589Bl
`US 6,233,589 BI
`May 15,2001
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) METHOD AND SYSTEM FOR REFLECTING
`DIFFERENCES BETWEEN TWO FILES
`
`(75)
`
`Inventors: Muralidhar Balcha, Shrewsbury, MA
`(US); Chandan Kudige, Karnataka
`(IN)
`
`(73) Assignee: Novell, Inc., Provo, UT (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.c. 154(b) by 0 days.
`
`(21)
`
`(22)
`
`(51)
`(52)
`(58)
`
`(56)
`
`Appl. No.: 09/127,523
`
`Filed:
`
`Jul. 31, 1998
`
`G06F 17/30
`Int. CI?
`707/203
`U.S. Cl.
`707/203-10,201,
`Field of Search
`707/204, 217, 2, 6, 8, 101, 202, 205
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,479,654
`5,574,906
`5,634,052
`5,778,395 *
`5,919,247 *
`6,052,531 *
`
`12/1995
`11/1996
`5/1997
`7/1998
`7/1999
`4/2000
`
`Squibb ..
`Morris
`Morris
`Whiting et al.
`Van Hoff et al.
`Waldin, Jr. et al.
`
`707/201
`707/1
`707/1
`707/204
`709/217
`707/10
`
`72
`
`94
`
`OTHER PUBLICATIONS
`
`ICS 161: Design and Analysis of Algorithms Lecture Notes
`for Feb. 19, 1996, "Longest Common Sequences", World
`Wide Web publication.
`"Longest Common Subsequence Problem", Pruhs, World
`Wide Web publication, Oct. 6, 1997.
`"Longest Common Subsequence", World Wide Web publi(cid:173)
`cation, 2 pp., author and publication date unknown.
`"Executive Summary", World Wide Web publication,S pp.,
`author and publication date unknown.
`* cited by examiner
`Primary Examiner-Hosain T. Alam
`Assistant Examiner---8anjiv Shah
`(74) Attorney, Agent, or Firm-Dinsmore & Shohl LLP
`
`(57)
`
`ABSTRACT
`
`A method and system for reflecting differences between two
`files. The method includes generating a base signature file
`having a plurality of base bit patterns, each base bit pattern
`being generated as a function of a portion of data in a first
`file. A second file containing a plurality of revised bit
`patterns is generated from a second file. Each revised bit
`pattern is compared to and matches at least one of the base
`bit patterns. A delta file reflecting the differences between a
`first file and the second file based on the base signature file,
`the delta signature file, and the second file is created.
`
`20 Claims, 4 Drawing Sheets
`
`78
`
`YES
`
`DOES CRC MATCH
`AllY CRC IN BASE
`SIIJ<ATURE FILE'
`
`86
`
`92
`
`YES
`
`DOES CRc MATCH
`my CRC IN BASE
`SIGNATURE FILE?
`
`Oracle Exhibit 1003, pg 1
`
`
`
`u.s. Patent
`
`May 15, 2001
`
`Sheet 1 of 4
`
`US 6,233,589 BI
`
`22
`
`LAN
`SERVER
`
`24
`
`LAN
`SERVER
`
`20
`
`21
`
`28
`
`27
`
`FIG.1
`
`34
`
`32
`
`32
`
`32
`
`32
`
`32
`
`DOS
`
`WIN 95
`
`NETWARE t---~
`
`OS/2
`
`WIN NT
`
`FIG.2
`
`Oracle Exhibit 1003, pg 2
`
`
`
`u.s. Patent
`
`May 15, 2001
`
`Sheet 2 of 4
`
`US 6,233,589 BI
`
`FIG. 3
`
`50
`
`52
`
`OPEN BASE FILE.
`INITIALIZE BASE
`SIGNAT~E FILE.
`
`56
`
`GET THE BLOCK SIZE
`FROM BACKUP SERVER
`
`58
`
`60
`
`READ NEXT BLOCK
`IN THE FILE . . . . - - - - - - - -
`
`66
`
`ADD CRC TO THE
`BASE SIGNATURE
`
`62
`
`----.J
`
`NO
`CALCULATE CRC
`>----1 OF THE BLOCK 1--
`
`4-
`6-
`
`68
`
`FIG.4
`
`Oracle Exhibit 1003, pg 3
`
`
`
`u.s. Patent
`
`May 15, 2001
`
`Sheet 3 of 4
`
`US 6,233,589 BI
`
`72
`
`OPEN REVISED FILE
`GET THE BLOCK SIZE FROM THE BASE
`SIGNATURE FILE. INITIALIZE REVISED SIGN FILE.
`
`94
`
`ADD CRC TO REVISED
`SIGNAT~E FILE
`
`78
`
`CALCULATE CRC OF THE BLOCK
`
`YES
`
`86
`
`REMOVE MSB FROM THE CRC
`COMPUTATION
`
`92
`
`ADD DISPLACEMENT OF THE BLOCK
`TO REVISED SIGNAT~E FILE
`
`ADD NEXT BYTE TO THE CRC
`COMPUTATION
`DISPLACEMENT ++
`
`88
`
`YES
`
`NO
`
`FIG. 5
`
`Oracle Exhibit 1003, pg 4
`
`
`
`u.s. Patent
`
`May 15, 2001
`
`Sheet 4 of 4
`
`US 6,233,589 BI
`
`100
`
`Ll : LIST BASE SIGNATURE CRCS
`~ :LIST OF REVISED SIG"lATURE
`CRCS
`E : FIRST ELEMENT IN L11
`E2; FIRST ELEMENT IN L2,
`I=J=O; ElI =E21=NJLL
`102
`
`NO
`
`FIG. 6
`
`E1=NEXT(E1)
`Er NEXT(E2)
`1++, J++; ElI=ElIE21=E 2
`
`108
`
`YES
`
`INS(S, BASE(E 1), S[BASE(NEXT(E 2P),
`DISPLACEMENT(E2)]) TO DELTA FILE
`
`YES
`
`YES
`
`112
`
`DEllS, BASE(NEXT(E 11)), BASE (E 1)
`-BASE(NEXT(E 11))) TO DELTA FILE
`
`116
`
`DEUS, BASE(NEXT(E 11)), BASE(E 1)
`-BASE(NEXT(E 11))) TO DELTA FILE.
`INS(S, BASE(NEXT(E 11)),
`S(BASE(NEXT(E 21),
`BASE(E2 )-BASE(NEXT(E21)]) TO DELTA FILE
`
`NO
`
`NO
`LCSLEN[I+l, Jl<
`LCSLEN[I, J+ll
`
`122
`
`1++;
`E1=NEXT(E1)
`
`124
`
`120
`
`J++;
`E2=NEXT(E2)
`
`Oracle Exhibit 1003, pg 5
`
`
`
`US 6,233,589 Bl
`
`1
`METHOD AND SYSTEM FOR REFLECTING
`DIFFERENCES BETWEEN TWO FILES
`
`FIELD OF THE INVENTION
`
`The present invention relates generally to backup and
`synchronization of files, and in particular relates to a method
`and system for reflecting differences between two files.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`2
`necessary. In other words, two differencing mechanisms can
`vary in their ability to efficiently recognize and reflect
`differences between two files. Also, it would be preferable
`for a differencing mechanism to be able to determine dif-
`ferences between a base file and a modified version of the
`base file without actually having to repeatedly process the
`base file, so that the differencing operation can be performed
`on a remote computer, without the need to process the entire
`base file.
`U.S. Pat. No. 5,574,906 discloses a system and method
`for reducing storage requirements in backup subsystems.
`The '906 patent discloses a system similar to that disclosed
`in the '052 patent, with the enhancement that the base file
`from which the differencing operation is derived can be
`15 compressed. In certain files, a compressed base file will
`utilize less bandwidth and less storage space on a computer
`than would an uncompressed based file. One problem with
`the compressibility of files differs
`this approach is that
`greatly. While compression can significantly reduce the size
`20 of some files, compression algorithms do not obtain signifi(cid:173)
`cant reduction with other type of files. Additionally, the
`differencing mechanism of the '906 patent works by first
`compressing the revised version of the file, and upon deter(cid:173)
`mining that compressed portions of the base file and the
`25 revised file differ, both the base file and the revised file are
`uncompressed at
`those locations so that
`the differences
`between the two files can be determined. The overhead
`involved in such compression/decompression algorithms
`can be significant.
`U.S. Pat. No. 5,479,654 discloses an apparatus and
`method for generating a set of representations of the changes
`made in a computer file during a period of time. The process
`disclosed in the ' 654 patent makes multiple passes through
`portions of the most recent version of the file to determine
`35 the differences between it and the previous version of the
`file.
`Thus, it is apparent that a differencing system that reduces
`network traffic, efficiently determines and reflects differ(cid:173)
`ences between two files quickly, and reduces storage
`40 requirements would be highly desirable.
`
`Copies of files are frequently transmitted over a network 10
`from one computer to another computer. One reason to copy
`a file is for backup purposes. If a file created on one
`computer has been backed up on another computer, it can be
`easily recovered in the event the hard drive of the first
`computer fails. Because the loss of a file could mean the loss
`of extremely important data, and/or result
`in significant
`effort to recreate the file, file backup processes are very
`common. However, file backup has at least two problems
`associated with it: first, it can require significant network
`bandwidth to transfer file data from the first computer to the
`backup computer, and second,
`it can require significant
`storage space to maintain copies of files. Both of these
`problems can be alleviated to some extent through the use of
`an incremental backup. An incremental backup copies only
`those files that have been changed since the previous
`backup. Incremental backups can significantly reduce the
`number of files that are backed up on a periodic basis.
`Typically, when a file is modified, only a small portion of
`the file is actually changed from the previous version. While 30
`an incremental backup can reduce network bandwidth and
`save storage space compared to a complete backup, it is still
`inefficient in that a complete file is backed up even though
`it is possible that only a small portion of the file was actually
`to improve upon incremental
`modified.
`In an attempt
`backups, backup processes exist that identify the differences
`between two versions of a file, and attempt to backup only
`those differences. This is referred to as a differencing pro(cid:173)
`cess. Differencing processes can reduce network bandwidth
`and storage requirements because only portions of the file
`are backed up.
`Copies of files are also frequently made for purposes of
`synchronization or replication. A synchronized file exists in
`two different locations, such as on two different servers, and
`changes made to one file must be reflected in the other file. 45
`Synchronization usually occurs by periodically copying the
`file from one location to the other location.
`U.S. Pat. No. 5,634,052 discloses a system for reducing
`storage requirements in a backup subsystem. The system
`includes creating a delta file reflecting the differences 50
`between a base file and a modified version of the base file,
`and transmitting the delta file to a server for backup pur(cid:173)
`poses. One problem associated with this system is that the
`base file is necessary to create the delta file that reflects the
`differences between the base file and the revised file. Thus, 55
`if the delta file is to be created on another computer, such as
`the server, the base file must first be transmitted to the server
`where the differencing operation is carried out. Moreover,
`the '052 patent does not disclose optimal mechanisms for
`creating the delta file.
`In a differencing backup system, the differencing mecha(cid:173)
`nism used to create the delta file can be quite important. It
`is not uncommon for files to be many megabytes in size. A
`differencing mechanism that processes a file multiple times,
`or processes a file in an inefficient manner can result in 65
`excessive backup times. Moreover, an inefficient differenc(cid:173)
`ing mechanism can result in more data being backed up than
`
`60
`
`SUMMARY OF IRE INVENTION
`It is one object of the present invention to provide a
`method and system for determining the differences between
`a base file and a modified file without the need for a copy of
`the base file.
`It is another object of the present invention to provide a
`method and system for reflecting the differences between
`two files that is highly efficient and reduces network traffic.
`It is yet another object of the present invention to provide
`a method and system that can determine the differences
`between a base file and a modified file either on a client
`computer or a server computer without a need for a copy of
`the base file.
`It is still another object of the present invention to provide
`a method and system for creating a signature of a base file
`that can be used to determine the differences between a base
`file and a modified file.
`Additional objects, advantages and other novel features of
`the invention will be set forth in part in the description that
`follows and, in part, will become apparent to those skilled in
`the art upon examination of the invention. To achieve the
`foregoing and other objects, and in accordance with the
`purposes of the present invention as described above, a
`method and system for reflecting differences between two
`versions of a file is provided. The method includes
`
`Oracle Exhibit 1003, pg 6
`
`
`
`US 6,233,589 Bl
`
`5
`
`4
`The present invention reduces the amount of data neces(cid:173)
`sary to reflect differences between two files, reducing both
`network utilization and storage requirements. The present
`invention is useful not only in backup systems, but also in
`replication and synchronization systems. Moreover, once a
`base signature file has been generated, a delta file reflecting
`differences between the base file and a subsequent version of
`the base file can be generated without the base file. Still other
`objects of the present invention will become apparent to
`10 those skilled in this art from the following description
`wherein there is shown and described preferred embodi(cid:173)
`ments of this invention. As will be realized, the invention is
`capable of other different obvious aspects all without depart(cid:173)
`ing from the invention. Accordingly,
`the drawings and
`15 description will be regarded as illustrative in nature and not
`as restrictive.
`
`3
`generating, from a base file, a base signature file that
`includes a plurality of base bit patterns. Each bit pattern is
`generated as a function of a portion of data in the base file.
`A revised version of the base file is created. A revised
`signature file, including a plurality of revised bit patterns, is
`generated from the revised file. Each revised bit pattern
`matches at
`least one of the base bit patterns. Based on
`differences between the base signature file and the revised
`signature file, the revised file is accessed and a delta file
`reflecting the differences between the base file and the
`revised file is generated.
`Once the base signature file is generated from the base
`file, the base file need not be accessed to generate the delta
`file reflecting the differences between the base file and the
`revised file. Thus, according to the present invention, the
`base signature file, which is a small fraction in size of the
`original base file, can be transmitted over a network to
`another computer, such as a server, and the differencing
`operation to generate the delta file can be carried out on the
`server, without the need to transmit the original base file to
`the server.
`According to one embodiment of the invention, the base
`signature file is generated by reading fame size blocks of
`data from the base file, and for each block of data, calcu(cid:173)
`lating and generating a bit pattern, such as a cyclic redun(cid:173)
`dancy check (CRe) value that is stored in the base signature 25
`file. After the base file is processed, a base signature file
`exists containing a plurality of CRC values derived from the
`data contained in the base file. A modified, or revised version
`of the base file is made, and a revised signature file is created
`by reading frame size blocks of data from the revised file and 30
`generating a revised CRC value for each data block. The
`revised CRC value is compared to the CRC values in the
`base file. If the revised CRC value matches a CRC value in
`the base signature file, the revised CRC value is stored in the
`revised signature file. A match is an indication that the block 35
`of data matches a block of data in the base file. If no CRC
`values in the base signature file match the revised CRC bit
`pattern, a new block of data offset one byte from the
`previous block of data is read. This process is repeated until
`a revised CRC bit pattern is generated that matches a bit 40
`pattern from the base signature file. The revised bit pattern
`is then saved to the revised signature file, along with the
`offset of the block of data from the beginning of the revised
`file, and the next block of data is read from the revised file,
`this block of data beginning at the end of the previous block 45
`of data. At the end of the process, a revised signature file
`exists with CRC values that match CRC values in the base
`signature file, and offsets indicating the location in the
`revised file of the matching blocks of data.
`The base signature file and revised signature file are then 50
`processed, and based on the differences between the two
`files, and the offsets in the revised signature file, the revised
`file is accessed and a delta file is created reflecting the
`differences between the base file and the revised file. The
`delta file preferably contains primitives, such as insert, 55
`modify, and delete primitives, which are commands that can
`be applied to a previous version of the file to generate the
`revised file.
`According to one embodiment of the present invention,
`the revised signature file and the base signature file are 60
`processed in the order of the longest common subsequence
`of revised bit patterns with respect to the base bit patterns in
`the base signature file. Determining differences between the
`base file and the revised file based on the longest common
`subsequence of bit patterns ensures a minimal number of 65
`primitives are used in the delta file to reflect the differences
`between the base file and the revised file.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The accompanying drawings incorporated in and forming
`20 a part of the specification, illustrate several aspects of the
`present invention, and together with the description serve to
`explain the principals of the invention. In the drawings:
`FIG. 1 is a block diagram illustrating an environment in
`which the present invention has application;
`FIG. 2 is a block diagram illustrating another environment
`in which the present invention has application;
`FIG. 3 is a schematic diagram illustrating the relationship
`between various files and a delta file according to one
`embodiment of the present invention;
`FIG. 4 is a flow diagram illustrating a process for gener(cid:173)
`ating a base signature file according to one embodiment of
`the present invention;
`FIG. 5 is a flow diagram illustrating a process for gener(cid:173)
`ating a delta signature file according to one embodiment of
`the present invention; and
`FIG. 6 is a flow diagram illustrating a process for gener(cid:173)
`ating a delta file according to one embodiment of the present
`invention.
`Reference will now be made in detail to present preferred
`embodiments of the invention, examples of which are illus(cid:173)
`trated in the accompanying drawings, wherein like numerals
`indicate the same elements throughout the views.
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`The present invention is directed to a differencing mecha(cid:173)
`nism that quickly and efficiently determines the differences
`between two files, and generates a delta file reflecting those
`differences. Referring to FIG. 1, an environment using file
`synchronization or file replication is shown. Each of servers
`22 and 24 maintain a copy of a base file 21, 27, respectively,
`and are interconnected via a network 26. Base files 21, 27
`should be identical to one another and thus changes made to
`one copy should eventually be reflected in the other copy. A
`base signature file, as described in more detail herein, is
`generated on one of the servers 22, 24 and copied to the
`other server. The base signature file is maintained on server
`22 as file 20, and maintained on server 24 as file 28. Either
`base file 21 or 27 can be modified at either server. Upon
`detection of a modification to the file, the detecting server,
`for example server 22, uses the respective base signature
`file, for example, base signature file 20, to generate a new
`delta file, and communicates the delta file over network 26
`to server 24. Server 24 then uses the delta file to update the
`base file 27, and recalculates the base signature file 28. In
`
`Oracle Exhibit 1003, pg 7
`
`
`
`US 6,233,589 Bl
`
`5
`
`5
`this manner, base files 21, 27 on servers 22 and 24 can stay
`in synchronization with minimal transfer of data over net(cid:173)
`work 26.
`FIG. 2 is a block diagram illustrating another environment
`in which the present invention has application. A plurality of
`client computers 32 are connected over a network 26 to a
`backup server 30. Backup server 30 is in communication
`with a storage medium 34 for storage of files backed up from
`computers 32. Each new file created on computers 32 is
`preferably communicated to backup server 30 and backed up 10
`on storage medium 34. For each new file generated, a base
`signature file is created. The base signature file can be
`generated either on computer 32 or backup server 30. A user
`of computer 32 can modify the base file and generate a
`revised file. The differencing mechanism of the current 15
`invention then processes the base signature file and the
`revised file to generate a revised signature file. The revised
`signature file and base signature file in conjunction with the
`revised file are used to create a delta file, which can then be
`communicated over network 26 to backup server 30, and 20
`stored on storage medium 34.
`Multiple versions of delta files can be maintained so that
`any particular version of a file can be restored. To accom(cid:173)
`plish this, a new base signature file is generated from the
`revised file. In essence, the revised file becomes the base file 25
`for the next version of the revised file. If the base signature
`files are being maintained on backup server 30, and only the
`delta file is communicated from computers 32 to backup
`server 30, a new base signature file can be created on backup
`server 30 by first applying the delta file received from 30
`computer 32 to the previous base file to create the revised
`file. (This revised file becomes the base file for use with the
`next received delta file.) The base signature file can then be
`generated from the revised file. If the base signature files are
`being maintained on computer 32, the new base signature 35
`file can be generated from the actual revised file that was
`used in creating the delta file. In this manner, multiple delta
`files can be maintained to allow the recreation of any version
`of the base file.
`If network bandwidth consumption is not a problem, and
`client processing power is limited, the revised file can be
`communicated to backup server 30, and the delta file can be
`created on backup server 30. Alternatively, the base signa(cid:173)
`ture file can be stored by backup server 30 on storage 45
`medium 34, and then communicated to computer 32 upon its
`request. Computer 32 can then use the base signature file to
`create the delta file, and transmit the delta file to backup
`server 30.
`Certain terminology will be used to describe the invention 50
`herein. The term 'stream' will be used to refer to contiguous
`bytes of data stored in a file. The phrase 'delta' will be used
`to refer to differences between two files.
`Deltas
`A versioned sequence of a file can be represented as 55
`follows:
`
`40
`
`6
`Any version of the file can be derived from the base
`version of the file F and the given deltas between the base
`version and the desired version.
`Rf generates the required version of the file from a given
`base file and a forward delta.
`Rb generates the required version of the file from a given
`base file and a backward delta.
`If Fi represents the ith version of file F and di represents the
`forward delta then
`
`Fi~R/.F, d,)
`
`If Fj represents the jth version of the file F and dj represents
`the backward delta then
`
`Streams
`Assume S is a stream of bytes. If the function len(S)
`returns the length of stream S, then the following conven(cid:173)
`tions hold true:
`i. SCi] where l~i~len(S) represents a single byte at
`position i within the stream S.
`ii. S[a, b] where 1 ~ a, b ~ len(S) represents a substream of
`S starting at position a and the next b-l bytes
`iii. S[I, len(S)]=S
`iv. S[n, #]=S[n, len(S)]
`v. if Sl[i]=S2[i] Vi and len(Sl)=len(S2) then Sl and S2 are
`same.
`vi. Sl+S2 concatenates S2 to Sl'
`Blocks
`Each stream is divided into blocks of fixed length such
`that following conventions hold true:
`i. BS is a fixed integer, used to represent the block size in
`the stream S
`ii. Given a stream S, N(S) represents the number of blocks
`in the stream S such that
`
`N(SHlen(S)/BS]
`
`iii. Bn(S) represents nth block in the stream S such that,
`
`Bn(S)~S[BS*n+1, BS*(n+1)], O~n~N(S)-l
`
`Signature of Stream
`A signature is calculated with each stream of data. The
`signature of the stream enables the identification of changed
`blocks in the stream. The signature of the stream is described
`as follows:
`i. CRC(Bn(S)) represents the CRC of the block Bn of the
`stream S.
`ii. CRC(A)=CRC(B) iff A=B, where A and B are two
`blocks in the stream S (CRC(A)=CRC of block A).
`iii. Sig(S) represents the signature of the stream S such
`that
`
`Sig(S)~{CRC(B,(S)), CRC(B 2(S)), ... , CRC(BN(s/S))}
`
`where F is the base version of the file and d1 , d2, ... , dm + n
`are deltas of file F. Deltas d1 , d2, ... , dm represent reverse 60
`deltas and deltas dm + 1 , dm + 2 , . . . , dm + n represent forward
`deltas with respect to file F.
`Given two versions of file F, Fi and Fp a diff function is
`defined such that
`
`d~diff(Fi' F)
`
`where d is the delta changes between Fi and Fj .
`
`If C i is the CRC of a block i in a stream S then
`i. Base(C;) is the starting position of the block within the
`stream S
`ii. Next(C;) returns the CRC of the next block in the
`signature Sig(S)
`The present invention generates a delta file reflecting the
`65 differences between a first and second file. The delta file
`preferably comprises primitives, such as insert, modify and
`delete commands, which reflect the operations to apply to
`
`Oracle Exhibit 1003, pg 8
`
`
`
`US 6,233,589 Bl
`
`7
`the first file to derive the second file. According to one
`three primitives are used.
`embodiment of this invention,
`Each primitive is defined as a function that returns the
`resultant strem(S*) after applying the operation to the base
`stream S.
`Insertion primitive:
`S*=ins(S, i, S')=S[1, i]+S'+S[i, #] where S' represents the
`stream that was inserted within the stream S.
`Deletion primitive:
`S*=del(S, i, len)=S[1, i]+S[i+len, #]
`Modification primitive:
`S*=mod(S, i, S')=S[1, i]+S'+S[i+len(S'), #] where S' rep(cid:173)
`resents modification to the substream SCi, len(S')]
`A sequence of primitive operations on a stream can be
`represented as P1,2, ... ,n(S) such that
`
`Pi.2 • . . . •n(S)~Pn(Pn-i( ... (Pi(S)))) where PiE[ins, del, mod]
`l~i~n
`
`Inverse Primitive:
`Give S, S* , P such that:
`
`we define inverse primitive P- such that
`
`S~P-(S*)
`
`If P~ins(S,i,S') then P-~del(S*, i, len(S'))
`
`If P~del(S,i,len) then P-~ins(S*, i, S[i+1, len])
`
`If P~mod(S, i,S') then P-~mod(S*, i, S')
`
`If P~Pn(Pn-i(' .. (Pi(S)))) then
`
`P-~P-i(P-2( ... (P-n(S))))
`
`If P represents the forward delta for the base stream to
`generate the current version of the stream, then P- repre(cid:173)
`sents the backward delta for the current version of the stream
`to generate the base version of the stream.
`In general, one aspect of the present invention relates to
`solving the following problem: given two streams Sl and S2
`find P12
`n such that S2=P12
`n (Sl) using Sig(Sl) and
`, ... ,
`, ... ,
`S2'
`
`With this notational background, a high-level schematic
`of the present invention can be described with reference to
`FIG. 3. A generator 40 processes a base file 38 and generates
`a base signature file 42. Base signature file 42 comprises a
`plurality of bit patterns, each of which is derived as a
`function of a block of data from base file 38. Base file 44 is
`then modified, or revised, by a user and a revised file 44 is
`created. Generator 46 processes the revised file 44, gener(cid:173)
`ating bit patterns as a function of blocks of data from revised
`file 44. To distinguish these bit patterns from those contained
`in base signature file 42, the bit patterns generated as a
`function of data from revised file 44 will be referred to as
`revised bit patterns, and those contained in base signature
`file 42 will be referred to as base bit patterns. Each revised
`bit pattern is compared to the base bit patterns in base
`signature file 42. For each revised bit pattern that matches a
`base bit pattern in base signature file 42, it is stored in
`revised signature file 48, along with an offset indicating the
`location in revised file 44 of the beginning of the block of
`data represented by the revised bit pattern. In this manner,
`generator 46 creates a revised signature file 48 containing a
`plurality of revised bit patterns. To generate a delta file 52
`reflecting the differences between base file 38 and revised
`file 44, a generator 50 processes the base signature file 42
`
`5
`
`8
`and revised signature file 48, preferably in the longest
`common subsequence order of revised bit patterns. Based on
`comparisons between base signature file 42 and revised
`signature file 48, and using the offsets contained in revised
`signature file 48 and the data in revised file 44, the delta file
`52 is generated. Revised file 44 can then be treated as a new
`base file, and a new base signature file can be generated from
`revised file 44. Alternately, the next revision of revised file
`44 can be processed against the original base signature file
`10 42, and a delta file can be generated that
`reflects the
`differences between original base file 38 and the newest
`version of the revised file. The method and system of the
`present invention eliminate the need to retain the base file for
`purposes of generating a delta file after generation of the
`15 base signature file. Because the base signature file comprises
`bit patterns which are significantly smaller in size than the
`blocks of data which they represent, the base signature file
`42 is a small fraction of the size of the original base file 38.
`Preferably, the bit patterns used in the present invention
`20 are cyclic redundancy code ("CRC") patterns. A CRC is
`derived from a block of data and the calculation and use of
`CRCs (for
`the purpose of detecting errors in data
`transmission) is well known to those skilled in the art, and
`will not be discussed herein in detail. The block size chosen
`25 can be arbitrary, but must be the same for the base file and
`for subsequent revisions of the base file. The block size
`should not be so small that the process loses efficiency, and
`not so large that large blocks of data are backed up from
`minor modifications. For example, for a large file on the
`30 order of about one gigabyte, a suitable block size would be
`about 64 Kb. Assuming that the CRC polynomial is 32 bytes
`long, the size of the signature file would be approximately
`128 Kb. Thus, the signature file is significantly smaller than
`the data file on which it is based. Thus, a trade off exists
`35 between the size of the signature file and the block size so
`that effective performance is achieved.
`A CRC algorithm treats a message, or block of data, as a
`single binary number and divides it by another fixed binary
`number. The remainder of this division is known as the
`40 CRe. The fixed binary number is normally called the
`polynomial of the CRe. While a normal CRC algorithm
`could be applied to the present invention, performance may
`not be satisfactory. Preferably, a table-driven CRC algorithm
`is used so that a mere shift, OR, XOR and a table lookup per
`45 byte can be used to derive the CRe.
`The use of a CRC has several advantages. First, a CRC
`can serve as a unique identity that represents the contents of
`a logical block of data. When the contents are changed, the
`CRC value should change. Ideally, a CRC algorithm has
`50 polynomial time complexity and very little space complex(cid:173)
`ity. In other words, if the size of the file is N, the time
`required by the CRC algorithm to perform the calculation is
`K*N (where K is a constant), and has relatively small
`memory requirements. The CRC value itself should con-
`55 sume very little space relative to the amount of data that it
`represents. Given a stream of data where the logical blocks
`of data represented by a set of function outputs is embedded,
`it should be possible to determine the boundaries of the
`logical blocks represented by these outputs in polynomial
`60 time in a single pass of the file. A table-driven CRC is
`extremely fast and does not depend on the size of the
`polynomial. The purpose of one aspect of the present
`invention is to back up portions of files that have changed
`with respect to a previous version of the file. When data is
`65 inserted or deleted in a file, the position of blocks of data in
`the revised file differs from the same blocks of data in the
`previous version of the file. The differencing mechanism of
`
`Oracle Exhibit 1003, pg 9
`
`
`
`US 6,233,589 Bl
`
`9
`the present invention determines the locations of identical
`blocks of data in two versions of the file. The following
`illustration will be useful
`in discussing the differencing
`mechanism according to one embodiment of the invention.
`
`~
`
`rn
`S ~l~JJ
`
`Unmodified stream S
`r---,
`
`Modified stream S'
`
`\
`
`Consider a stream S which has two consecutive logical
`blocks A and B. The base address of blockA is baseA and the
`base address of block B is baseB' Data D is inserted between
`the logical blocks A and B. After data D is inserted, the base
`address of block B will be changed to baseB+len(D). When
`the generator reads first block A from stream S', the revised
`CRC derived from block A will match one of the base CRCs
`in the signature file associated with stream S, i.e., the CRC
`of block A of stream S. Since the revised CRC matches a
`base CRC, the next data block read from stream S' begins
`with the first byte past the end of block A. The block of data
`read from the modified stream S' will be referred to as a
`frame of reference. The current frame of reference is shown
`in the above example by a dashe