`Miller
`
`[19]
`
`111111111111111111111111111111111111111111111111111111111111111111111111111
`US005832520A
`Patent Number:
`Date of Patent:
`
`[11]
`
`[45]
`
`5,832,520
`Nov. 3, 1998
`
`[54] AUTOMATIC FILE DIFFERENCING AND
`UPDATING SYSTEM
`
`"Comparing and Merging Files", from the World Wide Web
`(Jun. 8, 1996).
`
`[75]
`
`Inventor: William A. Miller, Gualala, Calif.
`
`[73] Assignee: Miller, Call, Planck and Miller,
`Portola Valley, Calif.
`
`[21] Appi. No.: 754,486
`
`[22]
`
`Filed:
`
`Nov. 22, 1996
`
`Int. C1.6
`[51]
`[52] U.S. Cl.
`Field of Search
`[58]
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,278,979
`5,418,945
`5,604,853
`5,623,656
`5,644,709
`
`1/1994 Foster et al.
`5/1995 Carter et al.
`2/1997 Nagashima
`4/1997 Lyons.
`7/1997 Austin
`
`G06F 17/30
`707/203
`707/203
`
`707/203
`.... 707/8
`707/540
`707/10
`395/187.01
`
`OlliER PUBLICATIONS
`
`Cormen, et aI., Introduction to Algorithms, The MIT Press
`(1990).
`Pocket Soft, Inc., White Paper re .RTPatch Professional
`Binary Update System, from the World Wide Web (http://
`www.pocketsoft.com) (Nov. 14, 1996).
`
`Primary Examiner-Wayne Amsbury
`Attorney, Agent, or Firm-Townsend and Townsend and
`Crew LLP
`
`[57]
`
`ABSTRACT
`
`A method and file structure for generating an efficient
`difference files from and old file and a new file so that a
`difference file can be transmitted to a second computer
`system where the difference file and a duplicate of the old
`file can quickly be used to create a copy of the new file is
`disclosed. A differencing process compares an old file and a
`new file to generate a difference file in which the old file is
`used as a database of byte strings. The differencing process
`reads strings of data from the new file, searches for the
`existence of those strings in the old file, and notes the
`locations in the old file in which strings in the new file are
`found and stores in a difference file an indication of the
`location where a matching string is found and an indication
`of the length. A specific file structure for the difference file
`is disclosed.
`
`28 Claims, 13 Drawing Sheets
`
`COpy
`
`T6
`
`DETERMINE WHETHER
`A COPY, INSERT OR INSERT/COPY
`OPERATION WILL BE PERFORMED
`(AND THE COUNT
`AND POSITION
`INFORMATION).
`
`INSERT
`
`NO
`
`....,T10
`, -__--"
`CREATE INSERT COMMAND
`AND INSERT INTO RAW DIFF
`FILE
`
`UPDATE STATISTICS
`FOR COUNT AND
`POSITION BITS
`REQUIRED FOR THIS
`COPY
`
`T12
`
`NO NO = SEARCH HAS ALREADY
`RETURNED THE NEXT COPY
`AFTER AN INSERT
`
`T16
`'------~~ENOOF FILE?
`
`YES
`
`YES
`
`L I-----+(-C RAWDlFFFILEIS )'~4--------,1
`
`_
`
`COMPLETE
`
`_ ~
`
`Oracle Exhibit 1004, pg 1
`
`
`
`u.s. Patent
`5,832,520
`Sheet 1 of 13
`Nov. 3, 1998
`1-----------------------------------
`
`II
`
`OLD FILE
`
`NEW FILE
`
`10
`
`20
`
`DIFFERENCING
`PROCESS
`
`100
`
`30
`- - - - - - - - - - - -
`
`_
`
`1
`
`I
`
`~IIIIIIIIIIIIIIIIIII
`
`1
`
`COMMUNICATION
`5 -----',.
`PATH
`r----------------------
`
`2
`
`-----------
`
`DIFFERENCE
`FILE
`
`30
`
`15
`
`REVISION
`PROCESS
`
`200
`
`NEW* FILE
`
`5
`
`I
`I
`J
`
`FIG. 1
`
`~IIIIIIIIIII1IIIIIIIII :
`
`I
`I
`
`Oracle Exhibit 1004, pg 2
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 2 of 13
`
`5,832,520
`
`OLD FILE
`
`NEW FILE
`
`10
`
`20
`
`TSI
`
`106
`
`lSI
`
`OISD
`
`~
`113
`
`~
`110
`
`105
`
`INDEX
`BUILDER
`
`SEARCH
`ENGINE
`
`3
`
`10
`
`10~
`
`10~
`
`OISD
`BUILDER
`
`COMMAND
`ENCODE
`
`COUNT
`ENCODE
`
`3
`
`FINAL
`DIFFERENCE
`FILE
`
`FIG. 2
`
`Oracle Exhibit 1004, pg 3
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 3 of 13
`
`5,832,520
`
`82
`
`84
`
`FILE
`
`( START THE DIFFIT
`PROCESS•
`I INPUT OLD FILE AND NEW
`!
`IFORM CHECKSUMS FOR 11""""'86
`
`OLD FILE AND NEW FILE
`~
`BUILD THE TSI (TEXT STRING ,--..810
`INDEX) STRUCTURE FOR THE
`OLD FILE
`
`~
`
`812
`
`v-
`
`814
`
`BUILD THE RAW DIFF (DIFFERENCE) FILE
`INCLUDING A SEQUENCE OF COPY
`COMMANDS. INSERT COMMANDS AND
`INSERTION STRINGS
`~
`BUILD THE OISD (OPTIMIZED INSERTION r---
`STRING DATABASE). THE OISD IS A
`DATABASE OF INSERTION STRINGS WITH
`CERTAIN REDUNDANT STRINGS REMOVED.
`~
`ENCODE THE COMMAND SEQUENCE FROM
`816
`THE RAW DIFFERENCE FILE AND PLACE IT ,--..'
`IN THE DIFFERENCE FILE.
`~
`APPEND THE (HUFFMAN) COMMAND
`DECODE TREES AND THE INITIAL COMMAND
`STATE TO THE DIFFERENCE FILE.
`~
`APPEND THE OPTIMIZED INSERTION TEXT
`STRINGS (FROM THE OISD) TO THE
`DIFFERENCE FILE
`
`817
`,--..'
`
`818
`Ir---
`
`+
`ADD A HEADER TO THE DIFF FILE WHICH /--.8
`CONTAINS CHECKSUMS AND OTHER
`INFORMATION ABOUT THE OLD, NEW AND
`DIFF FILES.
`~
`FORM THE CHECKSUM FOR THE DIFF FILE
`AND ADD TO THE HEADER. THE DIFF FILE IS
`NOW COMPLETE. OPTIONALLY COMPRESS
`THE DIFFERENCE FILE.
`
`r---8
`
`20
`
`22
`
`+
`
`DIFFIT PROCESS
`COMPLETED
`
`824
`
`FIG. 3
`
`Oracle Exhibit 1004, pg 4
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 4 of 13
`
`5,832,520
`
`AN EXAMPLE
`OFAN
`UNSORTED
`TSI
`
`AN EXAMPLE
`OFA
`SORTED TSI
`
`FOUR BYTES OF THE OLD
`FILE DATA STARTING AT THE
`OFFSET.
`
`BYTE OFFSET INTO THE OLD
`FILE (FIELD IS 2,3 OR 4 BYTES
`LONG DEPENDING UPON OLD
`FILE SIZE).
`
`1--_11_22_3_3_44.....;1=0=l-
`22334455
`33445566
`
`000001
`000002
`
`1_09--1}106
`
`OADF3452
`....... DF345278
`
`N-4
`N-3
`
`LAST FOUR BYTES IN THE
`FILE.
`
`(N IS THE OLD FILE
`LENGTH)
`
`00000000
`00000000
`00000001
`
`OF3538
`003780
`13BC73D
`
`FFFFFFFE
`FFFFFFFF
`
`003827
`183F56
`
`FIG.4A
`
`Oracle Exhibit 1004, pg 5
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 5 of 13
`
`5,832,520
`
`HASH
`FUNCTION
`
`OLD FILE STRING
`STARTING AT
`POSITION SHOWN
`
`PRODUCES
`POINTER INTO
`THE HASH HEAD
`TABLE
`
`1Q (OLD FILE)
`.1..Q8. (HASH CHAIN TABLE)
`CORRESPONDS TO
`
`123
`
`NULL POINTER
`
`CORRESPONDS TO
`
`CORRESPONDS T
`
`107 (HASH HEAD TABLE)
`
`CORRESPONDS TO
`
`CORRESPONDS TO
`
`x
`
`x
`
`131b
`
`NULL
`
`FIG. 48
`
`Oracle Exhibit 1004, pg 6
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 6 of 13
`
`5,832,520
`
`START BUILD OF THE
`RAW DIFF FILE
`
`INITIALIZE P_OLD AND P_NEW
`TO THE BEGINNING OF EACH
`FILE
`
`SEARCH THE OLD FILE FOR
`r------------~ THE STRING AT THE CURRENT
`POSITION IN THE NEW FILE
`(SEE FIG. 5B).
`
`T2
`
`:r4
`
`COpy
`
`DETERMINE WHETHER
`A COPY, INSERT OR INSERT/COPY
`OPERATION WILL BE PERFORMED
`(AND THE COUNT
`AND POSITION
`INFORMATION).
`
`INSERT
`
`NO
`
`CREATE COpy
`COMMAND AND INSERT
`INTO RAW DIFF FILE
`
`T8
`
`....,T10
`.--__--x
`CREATE INSERT COMMAND
`AND INSERT INTO RAW DIFF
`FILE
`
`UPDATE STATISTICS
`FOR COUNT AND
`POSITION BITS
`REQUIRED FOR THIS
`COpy
`
`T12
`
`NO NO = SEARCH HAS ALREADY
`RETURNED THE NEXT COPY
`AFTER AN INSERT
`
`YES
`
`L . . . I
`
`YES
`~.(__RA_W_D_IF_F_F_IL_E_IS_)I"ll~I----------'1
`_
`COMPLETE
`_
`
`FIG.5A
`
`Oracle Exhibit 1004, pg 7
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 7 of 13
`
`5,832,520
`
`T18
`
`NO
`
`.------''-------,T40
`USE THE TSI OR
`SEQUENTIALLY
`SEARCH THE OLD FILE
`FOR ALL MATCHING
`STRINGS LONGER
`THAN X BYTES
`
`YES
`
`KEEP THE LENGTH
`AND LOCATION OF T60
`THE LONGEST STRING
`FOUND
`
`T62
`
`T58
`
`INCREMENT THE
`INSERT LENGTH
`COUNTER AND OLD
`FILE AND NEW FILE
`POINTERS
`
`T25
`
`NBYTE
`STRING AT P_OLD
`ANDP_NEW
`MATCH?
`
`T35
`
`NO
`
`RETURN LENGTH OF
`STRING WITH "COPY
`IMMEDIATE" STATUS
`
`T85
`RETURN LENGTH OF
`INSERT AND COPY, WITH
`"INSERT AND COPY
`(IMMEDIATE OR NEW
`POSITION)" STATUS
`
`RETURN FROM SEARCH
`FUNCTION
`T90
`
`ATTEMPT
`RESYNCHRONIZATION
`INCREMENT POINTERS, ADD
`8 TO INSERT COST.
`
`YES
`
`NO
`
`IGNORE
`RESYNCHRONlZATION
`AND USE ORIGINAL
`"INSERT AND COPY FROM
`NEW POSITION"
`PARAMETERS
`
`T68
`
`IGNORE THE ORIGINAL
`"INSERT AND COPY
`FROM NEW POSITION",
`AND USE THE "INSERT
`AND COPY IMMEDIATE"
`PARAMETERS
`
`T80
`YES
`~--------i FIND THE LENGTH T75
`OF THE MATCH
`
`FIG. 58
`
`Oracle Exhibit 1004, pg 8
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 8 of 13
`
`5,832,520
`
`W10
`CREATE LONG COUNT (SET THE FIRST
`BYTE TO ZERO, APPEND THE LENGTH,
`USING LONGFIELDWIDTH NUMBER OF
`BYTES)
`
`W50
`CREATE LONG COUNT (SET
`THE FIRST BYTE TO DXeD,
`APPEND THE LENGTH, USING
`LONGFIELDWIDTH NUMBER OF
`BYTES)
`
`START CREATE COpy
`COMMAND
`
`W4
`
`W6
`
`W8
`
`NO
`
`W12
`
`APPEND THE COMMAND TO THE RAW DIFF
`FILE (SEE FIG. 6C FOR COpy COMMAND
`FORMAT)
`
`W14
`
`END OF CREATE COPY
`COMMAND
`
`FIG.6A
`
`START CREATE INSERT
`COMMAND
`
`W44
`
`NO
`
`W46
`
`INSERT COMMAND BYTE
`EQUALS THE LENGTH +
`DXeD
`
`W48
`
`APPEND THE TEXT TO
`BE INSERTED
`
`W52
`
`APPEND THE COMMAND TO THE RAW DIFF
`FILE (SEE FIG. 6D FOR INSERT COMMAND
`FORMAT)
`
`W54
`
`FIG. 68
`
`Oracle Exhibit 1004, pg 9
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 9 of 13
`
`5,832,520
`
`COpy COMMAND FORMATS
`I OXXXXXXX I OLD FILE POSITION
`
`1ST BYTE
`
`# OF BYTES = LONGFIELDWIDTH
`
`SHORT COUNT COPY
`(NORMALIZED LENGTH<=127)
`
`(XXXXXXX = NORMALIZED LENGTH OF THE COPY)
`
`LONG COUNT COpy
`(NORMALIZED LENGTH>127)
`
`# OF BYTES =
`LONGFIELDWIDTH
`
`1ST BYTE
`
`I00000000 I OLD FILE POSITION
`FIG.6C
`
`# OF BYTES =
`LONGFIELDWIDTH
`
`COpy LENGTH
`
`INSERT COMMAND FORMATS
`
`SHORT COUNT INSERT
`(NORMALIZED LENGTH<=127)
`
`1ST BYTE
`
`11XXXXXXX I
`
`(XXXXXXX = NORMALIZED LENGTH OF THE INSERT)
`
`# OF BYTES = LENGTH OF INSERTION TEXT
`
`INSERT TEXT
`
`LONG COUNT INSERT
`(NORMALIZED LENGTH>127)
`
`# OF BYTES =
`LONGFIELDWIDTH
`
`1ST BYTE
`
`110000000 I INSERT LENGTH
`FIG. 60
`
`# OF BYTES = LENGTH OF
`INSERTION TEXT
`
`INSERT TEXT
`
`/'100
`
`Q'11O
`~
`
`o
`
`o 0 0
`
`FIG. 10
`
`Oracle Exhibit 1004, pg 10
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 10 of 13
`
`5,832,520
`
`U2
`
`END
`(OISO COMPLETED)
`
`SEARCH THE RAW DIFF FILE FOR INSERT
`COMMANDS AND COPY STRING LENGTH
`AND A POINTER TO INSERTION STRING
`INDEX (lSI)
`
`U4
`
`FOR ANY INSERT COMMAND
`CODE INDICATING POINTER
`NOT RESOLVED, REPLACE THE
`POINTER WITH AN OISD
`OFFSET
`
`U30
`
`USING THE COMPLETED lSI,
`CHECK THE LENGTH OF EACH
`STRING OF INSERTION TEXT 14-----YES----<
`(START AT THE FIRST STRING)
`
`U10
`
`U12
`
`U26
`
`UPDATE STATISTICS
`FOR COUNT BITS
`REQUIRED FOR THIS
`INSERT
`
`NO
`
`GET THE STRING FROM THE RAW U14
`DIFF FILE AND SEQUENTIALLY
`SEARCH THE EXISTING OISD FOR
`THE STRING
`
`REPLACE THE FIRST BYTE
`OF THE INSERTION TEXT IN
`THE RAW DIFF FILE WITH
`THE COMMAND CODE FOR
`THAT INSERT COMMAND
`
`U2
`
`U 7
`
`NO
`
`USE THE lSI TO SEARCH THE U18
`INSERT STRINGS IN THE RAW
`DIFF FILE AFTER THE CURRENT
`STRING
`
`CHANGE THE lSI POINTER
`TO AN OFFSET INTO THE
`OISD WHERE THE STRING I--.-..l
`IS LOCATED AND SET
`COMMAND CODE TO IMP
`
`U32
`
`CHANGE THE lSI POINTER TO
`POINT TO THE BEGINNING OF
`OTHER lSI ENTRY WITH AN
`OFFSET INDICATING THE
`START OF THE INCLUDED
`STRING AND SET COMMAND
`CODE TO IMP (POINTER NOT
`RESOLVED)
`
`ADD THE STRING TO THE OISD
`AND SET COMMAND CODE TO ICP 1-
`(INSERT AT THE CURRENT
`POSITION)
`
`---1
`
`FIG.7A
`
`Oracle Exhibit 1004, pg 11
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 11 of 13
`
`5,832,520
`
`\ J
`
`:GzW.
`
`.J
`
`151 (INSERTION STRING INDEX) PRIOR
`TO BUILDING 0150
`I
`
`MEMORY
`
`MOSTRE~
`INSER~ENTJ
`
`COMMAND
`
`SECOND INSERT
`COMMAND
`
`------~
`
`FIG. 78
`
`151 (INSERTION STRING INDEX)
`DURING BUILDING OF 0150
`
`/
`
`0150
`I
`
`THE 0150 GROWS U~,.
`••••
`
`/
`
`• •••
`
`\
`
`151 GROWS
`DOWN
`
`FIG.7C
`
`SECOND
`INSERT
`COMMAND
`
`THIRD INSERT
`COMMAND
`
`FIRST
`INSERT
`COMMAND
`
`REPLACE FIRST BYTE OF
`THE INSERTION TEXT
`
`COMMAND CODE.
`
`r.. STRING WITH THE
`
`INSERTION
`TEXT
`STRING
`
`2
`
`CC =COPY COMMANDS
`IC = INSERT COMMANDS
`RAW DIFF FILE DURING CONSTRUCTION OF OISD
`
`ENDS HERE
`
`FIG. 70
`
`Oracle Exhibit 1004, pg 12
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 12 of 13
`
`5,832,520
`
`HUFFMAN
`DECODE
`HEADERI TREES I
`
`INITIAL
`COMMAND
`STATE
`
`I
`
`ENCODED COMMAND
`SEQUENCE
`
`OISD (INSERTION
`TEXT)
`
`CCP rcp CCP CMP IMP CCP CMP CMP -
`
`r
`
`MINIMIZED
`COMMANDS
`
`-
`
`-136749BC3DF67
`
`r
`
`THE OPTIMIZED INSERTION
`STRING DATABASE
`
`I
`
`FIG.8A
`
`NEXT STATE
`
`CCP
`
`0/10·
`0/10·
`
`CMP
`1
`11
`11
`11
`
`CCP
`CMP
`rcp
`IMP
`
`ICP
`
`0
`
`IMP
`0
`10
`
`CURRENT
`STATE
`
`= PRESENT STATE
`(PRESENT COMMAND INTERPRETATION)
`
`=COMMAND CODE THAT TRANSITIONS
`TO THE NEXT STATE
`
`FIG. 88
`
`Oracle Exhibit 1004, pg 13
`
`
`
`u.s. Patent
`
`Nov. 3, 1998
`
`Sheet 13 of 13
`
`5,832,520
`
`405
`
`408
`
`START REVIT PROCESS
`
`DECOMPRESS DIFFERENCE FILE
`
`410 VALIDATE CHECKSUMS OF OLD AND DIFF
`FILES
`
`415
`
`425
`
`L...-
`
`UNPACK HUFFMAN DECODE TREES
`
`INITIALIZE POINTERS
`--,-
`
`-----l
`
`430
`
`UNPACK FIRST COMMAND AND INITIALIZE
`STATE MACHINE
`
`710
`
`UNPACK NEXT
`COMMAND
`
`435
`
`DECODE THE COMMAND TYPE
`
`500
`
`NO 510
`
`COpy "LENGTH
`COUNr' BYTES FROM
`P_OLD FILE TO P_NEW
`AND ADJUST
`POINTERS
`
`COPY "LENGTH
`COUNr' BYTES FROM
`NEW COpy POSITION
`IN OLD FILE TO P_NEW
`AND ADJUST
`POINTERS
`
`COPY "LENGTH
`COUNr' BYTES FROM
`P-'T TO P_NEW AND
`ADJUST POINTERS
`
`COpy "LENGTH
`COUNr' BYTES FROM
`NEW INSERT
`POSITION TO P_NEW
`AND ADJUST
`POINTERS
`
`YES
`
`---
`
`NO
`
`YES
`
`FIG. 9
`
`Oracle Exhibit 1004, pg 14
`
`
`
`5,832,520
`
`5
`
`2
`Another method useful for transmitting updated files is
`using a technique known as a differencing program or
`comparator program to compare an old file to a new revised
`file in order to determine how the files differ. One such file
`system is distributed as part of the GNU UNIX-like oper(cid:173)
`ating system through tools referred to as diff and patch and
`described in standard GNU documentation. The described
`system discusses a way to use a differencing program to
`generate a patch file, and then using that patch file in
`10 combination with the old file to generate a newly revised
`file. While the GNU revision system has some applications
`within the UNIX-like operating system within which it was
`developed, it has not been generalizable in the new envi(cid:173)
`ronment of personal computer systems. The most commonly
`15 available versions of the system are limited to text files, and
`achieve only limited compression. These programs cannot
`effectively handle files where a number of fairly complex
`changes have occurred, such as a number of block moves or
`random shuffling of text strings. These programs also do not
`20 produce the smallest patch file possible.
`What is needed is a method and system for generating a
`difference file from an old file and a new file, where that
`in minimal number of bytes,
`difference file indicates,
`changes between the old file and the new file. The needed
`25 system would allow users to then transmit the difference file
`to a second computer system or to a backup or archive
`storage system (system 2), and to use that difference file and
`the old file along with a decoding process to generate a
`newly revised file. The difference file could also be stored
`30 locally, allowing a number of versions of the same file to be
`saved without duplicating redundant information. Ideally,
`the difference file would be the smallest possible difference
`file, achieving compression density of perhaps 10% or less
`of the original file, even with a moderate number of changes
`35 between the two files.
`
`1
`AUTOMATIC FILE DIFFERENCING AND
`UPDATING SYSTEM
`BACKGROUND OF THE INVENTION
`This application claims the benefit of a provisional appli(cid:173)
`cation Ser. No. 60/021,457, filed Jul. 3, 1996 now pending
`filed with a source code appendix consisting of 22 pages.
`The appendix contains program source code in the C++
`programming language for software modules that embody
`aspects of the current invention.
`The present invention relates to the field of data files used
`by computers. More specifically,
`the present
`invention
`relates to a system for creating, updating or revising large
`computer files by using only a small file containing indica(cid:173)
`tions of the differences between the large computer files and
`a preexisting computer file.
`The present invention is motivated in part by changes that
`have been occurring in the personal computer industry over
`the last several years.
`Increases in performance and
`decreases in cost have led to a proliferation of computer
`equipment in homes and offices. This computer equipment
`has in turn spawned a burgeoning market for software
`modules that cause the equipment to operate in a desired
`manner. In recent years, the software modules have become
`larger and larger as the price of computer memory and the
`storage space needed to hold these software modules have
`become cheaper and cheaper. This has allowed for the
`development and sale of far more complex executable
`program codes to accomplish various functions such as word
`processing, spreadsheets, multimedia or any other use for a
`computer. In addition to executable files, more and more
`complex text and multimedia files, as well as database files,
`are commonly being used and distributed or archived in
`home and office computer systems.
`These large files are distributed from software manufac(cid:173)
`turers to users via a number of different means, including
`being preloaded on a computer's hard drive before the
`computer is purchased, being shipped on a fixed medium
`such as a floppy disk or CD ROM, or being distributed
`through a transmission medium such as a dial-up telephone 40
`service, a BBS, or the Internet.
`It is the nature of computer software and other large files
`that it is often desirable to update or revise files in order to
`correct errors or add features. Sometimes these revisions
`may be relatively minor, involving changes in only a small 45
`percentage of the data that makes up the file.
`One obstacle to the frequent revision of large computer
`files by a manufacturer is the cost of delivering the updated
`If an entire new revised file must be
`file to the user.
`delivered, the amount of data can be substantial. Large files 50
`typically are as large as ten million characters (10
`Megabytes) or larger. Distribution of such files on floppy
`disk can require a relatively large amount of disk space.
`Distribution of such large files over a medium such as the
`Internet can take an undesirably long time from the point of 55
`view of the customer and can consume a large amount of
`server resources from the point of view of the file provider.
`One solution to the problem of distributing large com(cid:173)
`puter files is use of compression. A number of standard
`compression algorithms are in existence and are commonly 60
`used today. These algorithms typically achieve compression
`of a large executable file down to between 40% to 60% of
`its original file size and can compress some types of text files
`even further, thereby reducing the transaction costs of ship(cid:173)
`ping the file. However, for very large computer files or 65
`collections of files, even a compressed file reduced to 40%
`still represents a substantial transmission cost.
`
`SUMMARY OF THE INVENTION
`The present invention comprises a software system with
`several components, a method, and a file structure for
`generating very efficient difference files (sometimes abbre(cid:173)
`viated DIFF file) from an old file and a new file so that a
`difference file can be transmitted to a second computer
`system where the difference file and a duplicate of the old
`file can quickly be used to create a copy of the new file,
`duplicating the new file as it existed on the first computer
`system. The difference file could also be stored locally to the
`first computer system, allowing the new file to be duplicated
`from the old file without storing the new file.
`According to the present invention, a differencing process
`on a first computer system compares an old file and a new
`file to generate a difference file. In this process, the old file
`is used essentially as a database of byte strings. The differ(cid:173)
`encing process reads strings of data from the new file and
`searches for the existence of those strings in the old file. The
`differencing process notes the locations in the old file in
`which strings in the new file are found and stores in a
`difference file an indication of the location where a matching
`string is found and an indication of the length of the
`matching string found in the old file. This information is
`stored in the difference file in a copy command. When the
`differencing process encounters strings of characters in the
`new file that are not found in the old file, the differencing
`process adds those strings to an insert database and adds an
`indication in the difference file of the location in the insert
`database where the strings can be found and the length of the
`string. This location and length are stored in an insert
`command.
`
`Oracle Exhibit 1004, pg 15
`
`
`
`5,832,520
`
`10
`
`3
`According to a specific embodiment of the invention, the
`differencing process, upon opening the old file, creates an
`index (or hash table) of all of the character strings of
`predetermined length found in the old file, along with the
`locations at which those character strings were found, in 5
`order to facilitate searching for character strings from the
`new file. According to a further embodiment, the index (or
`hash table) is created only if the differencing process detects
`that there is sufficient memory on the first computer to hold
`the index.
`The present invention also comprises a file structure for
`the difference file that allows the difference file to hold
`information from which to construct a copy of the new file
`using a duplicate of the old file while occupying the least
`number of bytes. According to specific embodiments, this 15
`difference file is built in a multistep process to minimize the
`size of the difference file.
`The method according to the invention attempts to mini(cid:173)
`mize the size of the difference file by a variety of techniques,
`any group of which may be incorporated into specific 20
`embodiments. According to a specific embodiment,
`the
`entire old file, rather than a limited portion of the data, is
`used as a database. This helps produce a smaller difference
`file by potentially finding more data to copy from the old file.
`In general, as much data as possible is copied from the old 25
`file, unless it takes fewer bits to insert the data. Copying data
`from the old file requires only a command code, whereas
`inserting data requires a command, plus the actual insertion
`data. The invention makes decisions about whether to copy
`or insert data by using a search algorithm that can do the 30
`following: use a "current positions" pointer into the old file
`for copying data, which eliminates an explicit position field
`if data can be copied from the current position; favor
`copying from the "current position" in the old file; search the
`entire old file for data not found at the "current position" and 35
`if the data is found elsewhere in the old file, copy from that
`position (unless it "costs more" than inserting enough data
`to allow the next copy from the "current position"); and
`insert data if a copy was not chosen.
`According to an embodiment, the length of the minimum 40
`data string searched for in the old file is selected as "N" bytes
`and is dependent on where the data is copied from. The copy
`from current position command uses a shorter minimum
`data string than copy from another position because it has
`been encoded to require fewer bits.
`According to further embodiments, commands and count
`fields of the difference file are encoded. Commands may be
`encoded using a "state machine", where certain commands
`are implied by the sequence of previous commands, and
`count fields may be encoded with a "cascaded" count field 50
`method. Huffman encoding of smaller count values may be
`employed to further reduce the difference file size, and
`counts larger than the Huffman-encoded counts use three
`progressively larger count fields. The length of these larger
`count fields is variable for each command type, and is 55
`modified based on statistics gathered during the construction
`of each individual difference file.
`According to further embodiments, all command, count
`and position fields are "bit-packed" to eliminate unused
`"filler" bits, and redundant data strings are removed from the 60
`insertion data before being appended to the difference file.
`The entire minimized difference file (minimized by the
`techniques mentioned above) may be finally compressed
`(using a "well-known" compression algorithm-like "zip"-or
`proprietary compression technique) to reduce the file size.
`According to an embodiment of the invention, execution
`time of the differencing step is important only in that the
`
`45
`
`65
`
`4
`difference method should not take an unreasonable time to
`execute (overnight may be OK in many cases). An index or
`is not
`hash table may be used to speed searching, but
`necessary for the differencing process (and will not be built
`if sufficient memory is unavailable).
`invention is
`In this patent application,
`the present
`described with reference to specific embodiments. It will be
`understood by anyone skilled in the programming art that
`many variations on the basic system and method of the
`present invention are possible within a computer environ(cid:173)
`ment. The invention therefore should not be construed as
`limited except as provided in the attached claims.
`For example, conventional computer systems today
`encode data as a collection of two-state binary units known
`at bits. Most current computers group these bits into 8-bit
`groups known as bytes, also referred to as characters. A
`sequence of bytes or characters is commonly referred to as
`a string. These terms are used in accordance with their
`accepted meaning in the art in this application, but it should
`be understood that the techniques of the invention could be
`used in different types of computing systems having differ(cid:173)
`ent means for encoding and organizing data.
`Also, within the art, the terms "text" and "string" are
`sometimes used in a particular way to describe computer
`encoded alphanumeric data, and at other times these terms
`are used very broadly to denote a sequence of data values
`that could represent anything: text, a number, a piece of an
`image, sound, etc. In the present description, these terms and
`other terms used in the art are intended to be given their
`broadest meaning.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram showing the general differenc(cid:173)
`ing and revision process according to the current invention.
`FIG. 2 is a block diagram of a differencing process and its
`components according to an embodiment of the invention.
`FIG. 3 is a flow chart illustrating a method used by the
`differencing processor according to the present invention.
`FIG. 4A is a diagram of a Text String Index (TSI)
`according to one embodiment of the present invention.
`FIG. 4B is a diagram of a hash table TSI according to a
`preferred embodiment of the present invention.
`FIG. 5A is a flow chart illustrating a method used by the
`differencing processor according to the present invention to
`build the raw difference file.
`FIG. 5B is a flow chart illustrating a method used by the
`differencing processor to search for strings in the old file
`according to the present invention.
`FIGS. 6A-6D show flow charts and command structures
`illustrating copy and insert commands inserted into the raw
`difference file according to the present invention.
`FIGS. 7A-7D show a flow chart and index and file
`structures illustrating creating the optimized insert string
`database for including into the final difference file according
`to the present invention.
`FIGS. SA-SB illustrate the file structure for the final
`difference file and a state machine and command format for
`command encoding in the final difference file according to
`an embodiment of the present invention.
`FIG. 9 is a flow chart showing the method of the revision
`processor according to the current invention.
`FIG. 10 shows a computer system incorporating the
`invention.
`
`Oracle Exhibit 1004, pg 16
`
`
`
`5,832,520
`
`5
`
`5
`DETAILED DESCRIPTION OF THE DRAWINGS
`Overview
`An overview of the process according to the invention is
`illustrated in FIG. 1. FIG. 1 illustrates a first computer
`system 1 and a second computer system 2 which commu-
`nicate via a communication path 5. Both computer systems
`1 and 2 can be any collection of computing devices oper(cid:173)
`ating together whatsoever as is known in the art. Computer
`systems 1 and 2 can also refer to the same computer system,
`or components within the same computer system. Commu- 10
`nication path 5 may be any means by which a file can be
`communicated between computer system 1 and computer
`system 2, including a removable fixed medium such as a
`floppy disk or CD ROM disk, or a communication medium
`such as an interoffice network, a telephone line, a bus, or the
`Internet. Path 5 also might encompass an electronic maillS
`message.
`As shown in FIG. 1, computer system 1 includes an old
`file 10, a new file 20, and a differencing processor 100 which
`generates a difference file 30. New file 20 is generally
`somewhat similar to old file 10, containing some changes in 20
`data which reflect a revision of new file 20. Old file 10 and
`new file 20 could be any collection of computer data
`whatsoever, including an executable application program, an
`operating system program, a text file, a spreadsheet or other
`data file, an image file, or any other type of computer- 25
`readable data.
`Difference processor (DiffIt) 100 reads new file 20 and
`compares it to old file 10 by a process described below.
`Difference process 100 then stores indications of the data to
`be copied from old file 10 or inserted from new file 20 into 30
`difference file 30. According to the invention, when new file
`20 is a revised version of old file 10, difference file 30 will
`be substantially smaller than either new file 20 or old file 10,
`in some cases, only ten percent or less than the size of new
`file 20.
`According to the invention, difference file 30 may then be 35
`transmitted over path 5 to computer system 2 where a
`revision process (RevIt) 200 reads a duplicate 15 of old file
`10 and difference file 30 and creates a copy 25 of new file
`20 on computer system 2. According to the invention, copy
`25 (designated the New* file) is identical to new file 20 40
`created on computer system 1.
`Differencing Process Overview
`FIG. 2 is a block diagram illustrating the process of
`differencing process (Diffit process) 100. Diffit process 100
`uses old file 10 and new file 20 to generate difference file 30. 45
`An optional index builder 105 may be used to build a text
`string index (TSI) 106 to speed building the difference file.
`Search engine 104 reads strings of data from new file 20
`and attempts to locate string matches in old file 10 for each
`string found in new file 20. If an index 106 is present, it is 50
`used by search engine 104 to increase searching speed.
`When search engine 104 finds a match in old file 10 for
`a string from new file 20, it indicates this by placing a copy
`command into a raw difference file 31 (RawDiff). The copy
`command includes an indication of where the string is found 55
`in 10 and the length of the string. When search engine 104
`does not find a match in old file 10 for a string from new file
`20, it indicates this by placing an insert command into a raw
`difference file 31 (RawDifl). The insert command includes
`the text for the string that was not found in old file 10.
`According to an embodiment of the invention, an opti(cid:173)
`mized insertion string database (OISD) engine 101 examines
`the raw DIFF file 31 to create an OISD 110, using an
`insertion string index 113. A command encoder 102 and a
`count encoder 103 encode command codes and count fields 65
`into the final difference file 30 which also includes OISD
`110.
`
`60
`
`6
`The details of the operation of the elements shown in FIG.
`2 according to specific embodiments will now be described.
`Differencing Process Flowchart
`FIG. 3 is a detailed flow chart of the operation of
`differencing processor 100. The process begins (S2) and old
`file 10 and new file 20 are opened by the processor (S4).
`Checksums are formed for both the old file and the new file
`and are stored (S6). The processor then checks to see if there
`is sufficient memory to create a text string index and if so a
`text string index (TSI) is built (S10). A raw difference file is
`created by searching for strings from the new file in the old
`file, using the TSI if one was created (S12).
`After the raw difference file is created, an Optimized
`Insertion String Database (OISD) is generated from the
`insertion text and commands in the raw difference file as
`illustrated in FIGS. 7A, 7B, 7C and 7D (S14). The com(cid:173)
`mands in the raw difference file are encoded to minimize
`their size (S16) by various possible encoding techniques
`including a state-machine for command encoding and Huff(cid:173)
`man encoding for string length values. The encoded com(cid:173)
`mands are then placed in the final difference file (S16). The
`Huffman decode tables and initial state of the command
`decode state machine are then appended to the difference file
`(S17) followed by the OISD (S18). A header is added to the
`difference file containing the check sums and other infor(cid:173)
`mation about the old file, the new file and the difference file
`(S20). Once the difference file is complete a checksum may
`be formed for the completed difference file and that check(cid:173)
`sum added to the header (S22). This final difference file may
`then be compressed using either a proprietary or publicly
`available compression algorithm (S23). The differencing
`process is then complete (S24).
`Building the Text String Index
`According to one embodiment, an index builder 105 may
`be employed to create an index of the old file 10 prior to old
`file 10 being searched for matching new file strings. FIGS.
`4A and 4B illustrate two different examples of text string
`index (TSI) 106 according to specific embodiments of the
`invention. The Text String Index (TSI) is a data structure that
`decreases the search time for text strings in the old file. The
`TSI is an optional element of the present invention; search(cid:173)
`ing of the old file can take place without one. However,
`searching will generally be faster if a TSI is created and
`used.
`FIG. 4Ashows a v