`UNIFIED PATENTS
`
`EXHIBIT 1004
`
`Unfied Patents Exhibit 1004
`Pg. 1
`
`
`
`U8005832520A
`
`Umted States Patent
`[19]
`[11] Patent Number:
`5,832,520
`
`Miller
`[45] Date of Patent:
`Nov. 3, 1998
`
`[54] AUTOMATIC FILE DIFFERENCING AND
`UPDATING SYSTEM
`
`“Comparing and Merging Files”, from the World Wide Web
`(Jun 8, 1996).
`
`[75]
`
`Inventor: William A. Miller, Gualala, Calif.
`
`[73] Assignee: Miller, Cal], Planck and Miller,
`P0rt0]a Valley, Calif.
`
`An b
`' —W
`Ex
`P '
`Am’m’y‘ AW’W” FiyucT 15 “yd
`Home},
`gent, or
`trm— ownsen
`Crew LLP
`
`d T
`ownsen
`
`d
`
`an
`
`a
`
`an
`
`[21] Appl. N0.: 754,486
`[22]
`Filed:
`Nov. 22, 1996
`
`............................
`3
`n .
`.
`[”1]
`I t C] 6
`G06F 17/30
`[52] US. Cl.
`.. 707/203
`
`[58] Field of Search
`......................... 707/203
`
`[56]
`
`References Cited
`
`7
`9,278,979
`3,418,945
`5,604,853
`5,623,656
`5,644,709
`
`US. PATENT DOCUMENTS
`
`11/1994 1105“” at al‘
`707/293
`5/1995 Latter et al.
`7U//8
`
`
`707/540
`[”97 Nagashima
`
`...... 707/10
`/1997 Lyons
`---------------------- 39318701
`7/1997 Austin
`OTHER PUBLICATIONS
`
`Corrnen, et al., Introduction to Algorithms, The MIT Press
`(1990).
`Pocket Soft, Inc., White Paper re .RTPatch Professional
`Binary Update System, from the World Wide Web (http://
`www.pocketsoft.com) (Nov, 14, 1996).
`
`ABSTRACT
`
`[57]
`
`.
`A method and file structure for generating an efficient
`
`:1 erence igllles frornband old filedand a new fille so that a
`
`
`1 erence
`e can. e transmitte
`to a secon
`computer
`system where 1e dlfference file and a duphcate of the old
`
`file can quickly be used 0 create a copy of the new file is
`
`
`disclosed. Adi erencing process compares an old file and-a
`
`
`new file to generate a di erence file in wnch the 01d file 1s
`
`
`
`used as a database of by e strings. The di ‘ereneing process
`.
`reads strings of data from the new file, searches for the
`existence of those strings in the old file, and notes the
`
`locations in the old file in which strings in the new file are
`
`
`found and stores in a di
`erence file an indication of the
`
`location where a matching
`of the length. A specific fi
`is disclosed,
`
`
`
`string is found and an indication
`e structure for the difference file
`
`28 Claims, 13 Drawing Sheets
`
`START BUILD OF THE
`RAW DIFI= FILE
`
`
`
`INITIALIZE P_OLD AND P_NEW T2
`TO THE BEGINNING OF EACH
`
`
`
`SEARCH THE OLD I=II.E FOR
`THE STRING AT THE CURRENT
`POSITION IN THE NEW FILE
`
`(SEE FIG. 53).
`
`
`
`
`DETERMINE WHETHER
`A COPY. INSERT 0R INSERT/COPY
`OPERATION WILL BE PERFORMED
`
`(AND THE COUNT
`
`
`AND POSITION
`
`
`INFORMATION).
`
`
`
`
`INSERT—
`
`CREATE COPY
`COMMAND AND INSERT
`INTO RAW DIFF FILE
`
`CREATE INSERT CDIVMAND
`AND INSERT INTO RAW DIFF
`
` T10
`
`
`
` UPDATE STATISTICS
`NO N0 = SEARCH HAS ALREADY
`FOR COUNT AND
`RETURNED THE NEXT COPY
`POSITION BITS
`AFTER AN INSERT
`
`
`REQUIRED FOR THIS
`COPY
`
`FILE
`
`
`
`RAW DIFF FILE IS
`COMPLETE
`
`
`
`YES
`
`Unfied Patents Exhibit 1004
`
`Pg. 2
`
`Unfied Patents Exhibit 1004
`Pg. 2
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 1 0f 13
`
`5,832,520
`
`\‘V 9V
`
`DUPLICATE OF
`OLD FILE
`
`15
`
`REVISION
`PROCESS
`
`200
`
`FIG. 1
`
`Unfied Patents Exhibit 1004
`
`Pg. 3
`
`lIl|I||l| :
`
`1
`:
`
`|llllIIIlI
`
`Unfied Patents Exhibit 1004
`Pg. 3
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 2 0f 13
`
`5,832,520
`
`- 105
`V eV v
`
`10
`
`BUILDER
`
`TSI
`
`106
`
`04
`
`SEARCH
`ENGINE
`
`9/
`
`
`RAW
`
`
`DIFFERENCE
`
`
`FILE
`
`
`w
`
`20
`
`own
`
`BUlDER
`
`\\13
`
`1o
`
`10
`
`z’/
`
`10
`
`K//
`
`3'
`
`COMMAND
`
`ENCODE
`
`COUNT
`ENCODE
`
`
`
`FILE
`
`FINAL
`DIFFERENCE
`
`FIG. 2
`
`Unfied Patents Exhibit 1004
`
`Pg. 4
`
`Unfied Patents Exhibit 1004
`Pg. 4
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 3 0f 13
`
`5,832,520
`
`START THE DIFFIT
`PROCESS
`
`82
`
`INPUT OLD FILE AND NEW
`FILE
`
`S4
`
`FORM CHECKSUMS FOR {\35
`OLD FILE AND NEW FILE
`
`BUILD THE TSI (TEXT STRING
`INDEX) STRUCTURE FOR THE
`
`OLD FILE
`
`810
`
`BUILD THE RAW DIFF (DIFFERENCE) FILE
`INCLUDING A SEQUENCE OF COPY
`COMMANDS, INSERT COMMANDS AND
`
`INSERTION STRINGS
`
`BUILD THE OISD (OPTIMIZED INSERTION
`STRING DATABASE). THE OISD IS A
`DATABASE 0F INSERTION STRINGS WITH
`CERTAIN REDUNDANT STRINGS REMOVED.
`
`ENCODE THE COMMAND SEQUENCE FROM
`THE RAW DIFFERENCE FILE AND PLACE IT
`IN THE DIFFERENCE FILE.
`
`APPEND THE (HUFFMAN) COMMAND
`DECODE TREES AND THE INITIAL COMMAND
`STATE TO THE DIFFERENCE FILE.
`
`APPEND THE OPTIMIZED INSERTION TEXT
`
`STRINGS (FROM THE OISD) TO THE DIFFERENCE FILE
`
`S12
`
`S14
`
`S16
`
`S17
`
`$18
`
`ADD A HEADER TO THE DIFF FILE WHICH
`
`
`CONTAINS CHECKSUMS AND OTHER
`INFORMATION ABOUT THE OLD. NEW AND
`DIFF FILES.
`
`
`FORM THE CHECKSUM FOR THE DIFF FILE
`AND ADD TO THE HEADER. THE DIFF FILE IS
`NOW COMPLETE. OPTIONALLY COMPRESS
`THE DIFFERENCE FILE.
`
`
`
`$20
`
`$22
`
`DIFFIT PROCESS
`COMPLETED
`
`824'
`
`FIG. 3
`
`Unfied Patents Exhibit 1004
`
`Pg. 5
`
`Unfied Patents Exhibit 1004
`Pg. 5
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 4 0f 13
`
`5,832,520
`
`BYTE OFFSET INTO THE OLD
`FOUR BYTES OF THE OLD
`FILE (FIELD IS 2,3 0R 4 BYTES
`FILE DATA STARTING AT THE
`LONG DEPENDWG UPON OLD
`OFFSET-
`FILE SIZE).
`
`
`
`
`
`1122334410.
`
`oooooo m
`
`}106
`
`
`
`
`
`AN EXAMPLE
`i
`I
`OF AN
`N-4
`OADF3452
`UNSORTED
`
`
`TSI —_
`
`
`
`
`LAST FOUR BYTES IN THE
`FILE.
`
`(N IS THE OLD FILE
`LENGTH)
`
`00000000
`
`. 00000000
`
`00000001
`
`
`
`
`
`
`
`
`
`¢
`
`
`
`
`
`OF3538
`
`003780
`
`13B073D
`
`$
`003827
`
`AN EXAMPLE
`OF A
`SORTED TSI
`
`FIG. 4A
`
`Unfied Patents Exhibit 1004
`
`Pg. 6
`
`Unfied Patents Exhibit 1004
`Pg. 6
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 5 0f 13
`
`5,832,520
`
`
`
`
`
`1 1 5
`
`r‘
`HASH
`FUNCTION
`
`OLD FILE STRING
`STARTING AT
`POSITION SHOWN
`
`m (HASH CHAIN TABLE)
`
`‘I_Q (OLD FILE)
`
`
`PRODUCES
`
`POINTER INTO
`THE HASH HEAD
`
`
`CORRESPONDS T0
`CORRESPONDS TO
`
`NU LL POI NTE R
`
`
`
`CORRESPONDS T0
`
`
`
`
`IIIIIIIIII
`
`
`
`
`
`
`
`CORRESPONDS TO
`
`
`
`
`CORRESPONDS TO
`
`NULL
`
`X
`
`IIIIIII
`
`
`
`FIG. 4B
`
`Unfied Patents Exhibit 1004
`
`Pg. 7
`
`Unfied Patents Exhibit 1004
`Pg. 7
`
`
`
`
`
`
`NO
`
`CREATE COPY
`COMMAND AND INSERT
`
`T10
`
`CREATE INSERT COMMAND
`
`
`AND INSERT INTO RAW DIFF
`
`FILE
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 6 0f 13
`
`5,832,520
`
`START BUILD OF THE
`RAW DIFF FILE
`
`
`
`INITIALIZE P__OLD AND P__NE
`
`
`TO THE BEGINNING OF EACH
`FILE
`
`
`SEARCH THE OLD FILE FOR
`
`THE STRING AT THE CURRENT
`
`
`POSITION IN THE NEW FILE
`
`(SEE FIG. EB).
`
`DETERMINE WHETHER
`
`
`A COPY. INSERT 0R INSERT/COPY
`
`
`OPERATION WILL BE PERFORMED
` INSERT
`
`(AND THE COUNT
`
`AND POSITION
`
`INFORMATION).
`
`
`INTO RAW DIFF FILE
`
` YES
`
`UPDATE STATISTI
`
`
`CS
`FOR COUNT AND
`
`POSITION BITS
`REQUIRED FOR THIS
`
`COPY
`
`
`
`
`
`
`N0 N0 = SEARCH HAS ALREADY
`RETURNED THE NEXT COPY
`AFTER AN INS
`
`ERT
`
`
`
`RAW DIFF FILE IS
`COMPLETE
`
`FIG. 5A
`
`Unfied Patents Exhibit 1004
`
`Pg. 8
`
`Unfied Patents Exhibit 1004
`Pg. 8
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 7 0f 13
`
`5,832,520
`
`START SEARCH
`
`
`INITIALIZE INSERT
`
`
`COUNTER TO ZERO
`
`
`T18
`
`
`
`N BYTE
`T25
`
`
`FIND LENGTH
`STRING AT P_OLD
`
`
`AND P_NEW
`
`MATCH?
`
`USE THE TSI OR
`
`
`SEQUENTIALLY
`
`SEARCH THE OLD FILE
`
`FOR ALL MATCHING
`
`STRINGS LONGER
`
`THAN X BYTES
`
`
`RETURN LENGTH OF
`
`STRING WITH "COPY
`
`IMMEDIATE" STATUS
`
`MATCHING
`
`
`STRINGS
`
`RETURN LENGTH OF
`FOUND?
`
`INSERT AND COPY. WITH
`"INSERT AND COPY
`
`
`KEEP THE LENGTH
`(IMMEDIATE OR NEW
`
`
`
`
`POSITION)" STATUS
`AND LOCATION OF
`INCREMENT THE
`THE LONGEST STRING
`INSERT LENGTH
`
`
`
`FOUND
`COUNTER AND OLD
`
`
`FILE AND NEW FILE
` POINTERS
` CALCULATE THE COPY
`
`COST AND INSERT COST
`
`
`FUNCTION
`
` RETURN FROM SEARCH
`T90
` ATTEMPT
`
`RESYNCHRONIZATION
`INCREMENT POINTERS. ADD
`8 T0 INSERT COST,
`
`INSERT COST
`
`< COPY COST?
`
`
`
`IGNORE
`
`RESYNCHRONIZATION
`
`
`AND USE ORIGINAL
`
`"INSERT AND COPY FROM
`
`DID Z BY'TES
`NEW POSITION"
`
`
`
`MATCH?
`PARAMETERS
`
`
`
`
`
`IGNORE THE ORIGINAL
`
`"INSERT AND COPY
`
`T80
`YES
`
`FROM NEW POSITION",
`FIND THE LENGTH T75
`AND USE THE "INSERT
`
`OF THE MATCH
`AND COPY IMMEDIATE"
`PARAMETERS
`
`
`
`FIG. 58
`
`Unfied Patents Exhibit 1004
`
`Pg. 9
`
`Unfied Patents Exhibit 1004
`Pg. 9
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 8 0f 13
`
`5,832,520
`
`
`
`COMMAND
`
`
`W10
`
` CREATE LONG COUNT (SET THE FIRST
`
`
`LENGTH
`BYTE TO ZERO. APPEND THE LENGTH.
`OF COPY >127
`YES
`USING LONGFIELDWIDTH NUMBER OF
`
`
`BYTES?
`BYTES)
`
`
`
`W4
`
`NO
`
`COPY COMMAND BYTE
`EQUALS THE LENGTH
`
`APPEND THE POSITION
`BYTES
`
`
`W6
`
`
`
`
`
`
`APPEND THE COMMAND TO THE RAW DIFF
`FILE (SEE FIG. GO FOR COPY COMMAND
`FORMAT)
`
`W14
`
`END OF CREATE COPY
`COMMAND
`
`FIG. 6A
`
`START CREATE INSERT
`COMMAND
`
`W44
`
`
`
`LENGTH
`OF INSERT > 127
`BYTES?
`
`W50
`
`CREATE LONG COUNT (SET
`THE FIRST BYTE TO 0X80,
`APPEND THE LENGTH, USING
`LONGFIELDWIDTH NUMBER OF
`BYTES)
`
`
`
`
`
`
`
`
` YES
`
`
`
`
`NO
`
`W46
`
`
`
`
`
`INSERT COMMAND BYTE
`EQUALS THE LENGTH +
`0X80
`
`W48
`
`APPEND THE TEXT TO
`BE INSERTED
`
`APPEND THE COMMAND TO THE RAW DIFF
`FILE (SEE FIG. 60 FOR INSERT COMMAND
`FORMAT)
`
`
`
`
`
`END OF CREATE INSERT
`COMMAND
`
`W12
`
`W52
`
`FIG. 63
`
`Unfied Patents Exhibit 1004
`
`Pg. 10
`
`Unfied Patents Exhibit 1004
`Pg. 10
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 9 0f 13
`
`5,832,520
`
`COPYCOMMANDFORMATS
`IST BYTE
`# 0F BYTES = LONGFIELDWIDTH
`
`SHORT COUNT COPY
`OLD FILE POSITION
`(NORMALIZED LENGTH<=127)
`(XXXXXXX = NORMALIZED LENGTH OF THE COPY)
`
`# 0F BYTES =
`# OF BYTES =
`1ST BYTE
`LONGFIELDWIDTH
`LONGFIELDWIDTH
`LONGCOUNTCOPY ——
`(NORMALIZED LENGTH>127)
`00000000
`OLD FILE POSITION
`COPY LENGTH
`
`FIG. 6C
`
`INSERT COMMAND FORMATS
`
`SHORTCOUNT'NSERT
`(NORMALIZED LENGTH<=127)
`(XXXXXXX = NORMALIZED LENGTH OF THE INSERT)
`
`INSERT TEXT
`
`IST BYTE
`
`# OF BYTES = LENGTH OF INSERTION TEXT
`
`# OF BYTES =
`# 0F BYTES = LENGTH OF
`‘57 BYTE
`LONGFIELDWIDTH
`INSERTION TEXT
`LONGCOUNT'NSERT ——
`(NORMALIZED LENGTH>127)
`10000000
`INSERT LENGTH
`INSERT TEXT
`
`FIG. 60
`
`
`
`FIG. 10
`
`Unfied Patents Exhibit 1004
`
`Pg. 11
`
`Unfied Patents Exhibit 1004
`Pg. 11
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 10 0f 13
`
`5,832,520
`
`‘ '
`B I
`I I
`I
`I
`
`
`(OPTIMIZED INSERTION STRING
`
`
`U2
`
`DATABASE
`
`SEARCH THE RAW DIFF FILE FOR INSERT
`
`
`U4
`COMMANDS AND COPY STRING LENGTH
`
`
`AND A POINTER TO INSERTION STRING
`
`
`INDEX (ISI)
`
`USING THE COMPLETED ISI,
`
`CHECK THE LENGTH OF EACH
`STRING OF INSERTION TEXT
`
`
`(START AT THE FIRST STRING)
`
`U10
`
`
`
`END
`(OISD COMPLETED)
`
`U30
`
`FOR ANY INSERT COMMAND
`CODE INDICATING POINTER
`NOT RESOLVED, REPLACE THE
`POINTER WITH AN OISD
`OFFSET
`
`
`
`YES
`
`MORE ISI ENTRIES?
`
`
`
`
`
`
`
`
` INSERT
`
`
`
`OISD? NO
`U32
`
`GET THE STRING FROM THE RAW [J14
`DIFF FILE AND SEQUENTIALLY
`
`SEARCH THE EXISTING OISD FOR
`
`
`THE STRING
`
`
`STRING
`
`
`FOUND IN THE
`
`
`CHANGE THE ISI POINTER
`TO AN OFFSET INTO THE
`OISD WHERE THE STRING
`IS LOCATED AND SET
`COMMAND CODE TO IMP
`
`USE THE ISI TO SEARCH THE
`INSERT STRINGS IN THE RAw
`DIFF FILE AFI'ER THE CURRENT
`STRING
`
`U18
`
`.
`
`U20
`
`STRING
`
`
`FOUND AS PART 0
`
`
`ANOTHER INSERT STRING
`
`
`IN THE RAW
`
`
`DIFF FILE?
`
`
`
` "
`
`U22
`
`CHANGE THE ISI POINTER TO
`POINT TO THE BEGINNING OF
`OTHER ISI ENTRY WITH AN
`OFFSET INDICATING THE
`START OF THE INCLUDED
`STRING AND SET COMMAND
`CODE TO IMP (POINTER NOT
`RESOLVED)
`
`ADD THE STRING TO THE OISD
`AND SET COMMAND CODE TO ICP
`(INSERT AT THE CURRENT
`
`POSITION)
`
`FIG. 7A
`
`Unfied Patents Exhibit 1004
`
`Pg. 12
`
`U12
`
`
`
`LENGTH <=
`
`
`OSITION LENGTH?
`MINIMUN INSERT
`
`YES
`
`
`
`NO
`
`U26
`
`UPDATE STATISTICS
`FOR COUNT BITS
`REQUIRED FOR THIS
`
`REPLACE THE FIRST BYTE
`
`OF THE INSERTION TEXT IN
`THE RAW DIFF FILE WITH
`THE COMMAND CODE FOR
`THAT INSERT COMMAND
`
`Unfied Patents Exhibit 1004
`Pg. 12
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 11 of 13
`
`5,832,520
`
`ISI (INSERTION STRING INDEX) PRIOR
`TO BUILDING OISD
`
`I:
`Li
`E
`D
`D
`E
`3 E
`3 E
`a g
`5 I2
`EH
`E E
`z
`z
`0
`:9 I
`==
`t 9
`E
`*5
`HIE
`'5
`{3
`LIJE
`ISIGROWS
`0
`{LE
`1'—
`E w
`E
`DOWN
`E E
`E
`E E
`E
`| ‘5 a I ‘ I
`o z
`-'
`8 z
`4
`
`MEMORY
`. . . 0 .
`
`MOST RECENT
`INSERT
`COMMAND
`
`SECOND INSERT
`COMMAND
`
`FIRST INSERT
`COMMAND
`
`ISI (INSERTION STRING INDEX)
`DURING BUILDING OF OISD
`
`o I-
`9 ><
`O E
`9 5
`I— -
`3,4 IE
`1L}... 3
`
`03
`
`a
`a
`3
`'5
`5
`
`O'
`
`9
`-
`E
`5
`E a E
`l‘c's z I-
`c, E "’
`5 5 5
`E E E
`o
`'2 E E z
`5 w :I: 8
`“ZS"-
`
`o I-
`<2 35
`8 "
`E
`t 5
`I:
`g
`“(,1 m
`Lu
`E $
`00.. OZ "
`
`0
`
`\
`THE OISD GROWS UP
`'
`
`F47
`
`0000
`
`OISD
`
`/
`
`
`
`
`
`
`
`DOWN
`INSERT
`SECOND
`
`COMMAND N
`INSERT
`COMMAND
`
`THIRD INSERT
`
`COMMAND
`FIRST
`
`'
`FIG 7C
`
`COMMAND
`INSERT
`
`REPLACE FIRST BYTE OF
`THE INSERTION TEXT
`STRING WITH THE
`COMMAND CODE.
`
`cc
`
`'61
`
`2
`
`INSERTION
`TEXT
`STRING
`
`3
`'c 2
`CC
`CC = COPY COMMANDS
`
`INSERTION
`TEXT
`4
`'c 3
`CC
`STRING
`IC = INSERT COMMANDS
`
`INSERTION
`TEXT
`STRING
`
`' ° ' .
`ENDS HERE
`
`RAW DIFF FILE DURING CONSTRUCTION OF 0150
`
`FIG. 70
`
`Unfied Patents Exhibit 1004
`
`Pg. 13
`
`Unfied Patents Exhibit 1004
`Pg. 13
`
`
`
`US. Patent
`
`Nov. 3, 1998
`
`Sheet 12 0f 13
`
`5,832,520
`
`HUFFMAN
`DECODE
`HEADER TREES
`
`INITIAL
`COMMAND
`STATE
`
`ENCODED COMMAND.
`
`SEQUENCE
`
`OISD (INSERTION
`TEXT)
`
`CCP ICP CCP CMP IMP CCP CMP CMP I I O 3674QBC3DF67 I I O I
`
`I
`
`COMMANDS
`MINIMIZED
`
`T
`
`THE OPTIMIZED INSERTION
`STRING DATABASE
`
`FIG. 8A
`
`NEXT STATE
`
`mm-
`
`CURRENT
`STATE
`
`
`(FOLLOWED BY
`
`ICP)
`
` CCP
`
`
`1 o
`
`o
`
`1 1 Q) = PRESENT STATE
`
`(PRESENT COMMAND INTERPRETATION)
`
`NOTES:
`
`/X\ = COMMAND CODE THAT TRANSITIONS
`TO THE NEXT STATE
`
`FIG. 88
`
`Unfied Patents Exhibit 1004
`
`Pg. 14
`
`Unfied Patents Exhibit 1004
`Pg. 14
`
`
`
`US. Patent
`
`Nov.
`
`3, 1998
`
`Sheet 13 0f 13
`
`5,832,520
`
`405
`
`408
`
`START REVIT PROCESS
`
`DECOMPRESS DIFFERENCE FILE
`
`410
`
`VALIDATE CHECKSUMS OF OLD AND DIFF
`FILES
`
`415
`
`425
`
`430
`
`UNPACK HUFFMAN DECODE TREES
`
`INITIALIZE POINTERS
`
`UNPACK FIRST COMMAND AND INITIALIZE
`STATE MACHINE
`
`710
`
`UNPACK NEXT
`COMMAND
`
`435
`
`DECODE THE COMMAND TYPE
`
` 500
`
`_
`
`660
`UNPACK NEw INSERT
`POSITION OFFSET
`
`DECODE LENGTH
`COUNT
`
`550
`DECODE LENGTH
`COUNT
`
`650
`600
`
`DECODE LENGTH
`DECODE LENGTH
`
`
`COUNT
`
`560
`
`UNPACK NEW COPY
`POSITION OFFSET
`
`570
`
`670
`
`NO 510
`
`COPY "LENGTH
`COPY "LENGTH
`COPY "LENGTH
`COUNT" BYTES FROM
`
`
`COUNT" BYTES FROM
`COUNT" BYTES FROM
`
`NEw COPY POSITION
`
`
`
`P_0LD FILE TO P_NEW
`
`
`NEW INSERT
`COUNT' BYTES FROM
`IN OLD FILE TO P NEw
`AND ADJUST
`
`
`AND ADJUS?
`P_IT TO P_NEW AND
`POSITION TO P_NEW
`
`
`
`POINTERS
`ADJUST POINTERS
`AND ADJUST
`
`POINTERS
`
`
`POINTERS
`
`
`COPY "LENGTH
`
`
`
`
`
`'VE ALL TH
`WAS LAST
`
`
`COMMANDS IN THE
`
`
`COMMAND CCP
`DIFF FILE BEEN
`
`FOLLOWED BY |CP7
`EXECUTED?
`
`
`
`YES
`
`800
`
`VALIDATE CHECKSUM
`FOR THE COMPLETED
`NEW FILE.
`
`
`
`NEW FILE
`850
`
`RECONSTRUCTION IS
`COMPLETED.
`
`
`
`FIG. 9
`
`Unfied Patents Exhibit 1004
`
`Pg. 15
`
`Unfied Patents Exhibit 1004
`Pg. 15
`
`
`
`This application claims the benefit of a provisional appli—
`cation Ser. No. 60/021,457, filed Jul. 3, 1996 now pending
`filed with a source code appendix consisting of 22 pages.
`The appendix contains program source code in the C++
`programming language for software modules that embody
`aspects of the current invention.
`The present invention relates to the field of data files used
`by computers. More specifically,
`the present
`invention
`relates to a system for creating, updating or revising large
`computer files by using only a small file containing indica-
`tions of the differences between the large computer files and
`a preexisting computer file.
`The present invention is motivated in part by changes that
`have been occurring in the personal computer industry over
`the last several years.
`Increases in performance and
`decreases in cost have led to a proliferation of computer
`equipment in homes and ollices. This computer equipment
`has in turn spawned a burgeoning market for software
`modules that cause the equipment to operate in a desired
`manner. In recent years, the software modules have become
`larger and larger as the price of computer memory and the
`storage space needed to hold these software modules have
`become cheaper and cheaper. This has allowed for the
`development and sale of far more complex executable
`program codes to accomplish various functions such as word
`processing, spreadsheets, multimedia or any other use for a
`computer. In addition to executable files, more and more
`complex text and multimedia files, as well as database files,
`are commonly being used and distributed or archived in
`home and office computer systems.
`These large files are distributed from software manufac-
`turers to users via a number of different means, including
`being preloaded on a computer’s hard drive before the
`computer is purchased, being shipped on a fixed medium
`such as a floppy disk or CD ROM, or being distributed
`through a transmission medium such as a dial-up telephone
`service, a BBS, or the Internet.
`It is the nature of computer software and other large files
`that it is often desirable to update or revise files in order to
`correct errors or add features. Sometimes these revisions
`may be relatively minor, involving changes in only a small
`percentage of the data that makes up the file.
`One obstacle to the frequent revision of large computer
`files by a manufacturer is the cost of delivering the updated
`file to the user. If an entire new revised file must be
`delivered, the amount of data can be substantial. Large files _
`typically are as large as ten million characters (10
`Megabytes) or larger. Distribution of such liles on floppy
`disk can require a relatively large amount of disk space.
`Distribution of such large files over a medium such as the
`Internet can take an undesirably long time from the point of
`view of the customer and can consume a large amount of
`server resources from the point of view of the file provider.
`One solution to the problem of distributing large com-
`puter files is use of compression. A number of standard
`compression algorithms are in existence and are commonly
`used today. These algorithms typically achieve compression
`of a large executable file down to between 40% to 60% of
`its original file size and can compress some types of text files
`even further, thereby reducing the transaction costs of ship-
`ping the file. However, for very large computer files or
`collections of files, even a compressed file reduced to 40%
`still represents a substantial transmission cost.
`
`10
`
`o;O
`
`35
`
`40
`
`45
`
`60
`
`1
`AUTOMATIC FILE DIFFERENCING AND
`UPDATING SYSTEM
`BACKGROUND OF THE INVENTION
`
`5,832,520
`
`2
`Another method useful for transmitting updated liles is
`using a technique known as a differencing program or
`comparator program to compare an old file to a new revised
`file in order to determine how the files differ. One such file
`system is distributed as part of the GNU UNIX—like oper—
`ating system through tools referred to as diff and patch and
`
`described in standard GNU documentation. The described
`
`
`
`system discusses a way to use a di erencing program to
`generate a patch file, and then using that patch file in
`combination with the old file to generate a newly revised
`file. While the GNU revision system has some applications
`Within the UN IX-like operating system within which it was
`developed, it has not been generalizable in the new envi-
`ronment of personal computer systems. The most commonly
`available versions of the system are limited to text files, and
`achieve only limited compression. These programs cannot
`effectively handle files where a number of fairly complex
`
`
`changes have occurred, such as a number of block moves or
`
`random shu ling of text strings. These programs also do not
`produce the smallest patch file possible.
`
`What is needed is a method and system for generating a
`
`
`di erence fi e from an old file and a new file, where that
`
`di ference ile indicates,
`in minimal number of bytes,
`changes between the old file and the new file. The needed
`system would allow users to then transmit the difference file
`to a seconc computer system or to a backup or archive
`storage system (system 2), and to use that difference file and
`
`
`the old file along with a decoding process to generate a
`
`
`newly revised file. The di erence file could also be stored
`locally, allowing a number of versions of the same file to be
`saved without duplicating redundant information. Ideally,
`the difference file would be the smallest possible difference
`file, achieving compression density of perhaps 10% or less
`of the original file, even with a moderate number of changes
`between the two files.
`SUMMARY OF THE INVENTION
`
`
`
`
`
`The present invention comprises a software system with
`
`several components, a method, and a file structure for
`
`
`
`generating very e icient difference files (sometimes abbre-
`viated DIFF file) from an old file and a new file so that a
`difference file can be transmitted to a second computer
`system where the difference file and a duplicate of the old
`file can quickly be used to create a copy of the new file,
`duplicating the new file as it existed on the first computer
`system. The difference file could also be stored locally to the
`first computer system, allowing the new file to be duplicated
`from the old file without storing the new file.
`According to the oresent invention, a differencing process
`
`on a first computer system compares an old file and a new
`
`
`
`
`file to generate a di erence file. In this process, the old file
`
`
`is used essentially as a database of byte strings. The di er-
`encing process reaes strings of data from the new file and
`
`searches for the existence of those strings in the old file. The
`di erencing process notes the locations in the old file in
`
`which strings in the new file are found and stores in a
`
`di erence file an inc ication of the location where a matching
`string is found and an indication of the length of the
`matching string found in the old file. This information is
`stored in the difference file in a copy command. When the
`di erencing process encounters strings of characters in the
`new file that are not found in the old file, the differencing
`process adds those strings to an insert database and adds an
`inc ication in the difference file of the location in the insert
`database where the strings can be found and the length of the
`string. This location and length are stored in an insert
`command.
`
`
`
`
`
`
`Unfied Patents Exhibit 1004
`
`Pg. 16
`
`Unfied Patents Exhibit 1004
`Pg. 16
`
`
`
`5,832,520
`
`
`
`3
`According to a specilic embodiment of the invention, the
`differencing process, upon opening the old file, creates an
`index (or hash table) of all of the character strings of
`predetermined length found in the old file, along with the
`locations at which those character strings were found, in
`order to facilitate searching for character strings from the
`new file. According to a further embodiment, the index (or
`hash table) is created only if the differencing process detects
`that there is sufficient memory on the first computer to hold
`the index.
`
`
`The present invention also comprises a file structure for
`
`
`the difference file that allows the di erence file to hold
`information from which to construct a copy of the new file
`using a duplicate of the old file while occupying the least
`number of bytes. According to specific embodiments, this
`difference file is built in a multistep process to minimize the
`size of the difference ile.
`
`The method according to the invention attempts to mini-
`
`mize the size of the di ‘erence file by a variety of techniques,
`any group of which may be incorporated into specific
`embodiments. According to a specific embodiment,
`the
`entire old file, rather han a limited portion of the data, is
`used as a database. This helps produce a smaller dilIerence
`file by potentially finding more data to copy from the old file.
`In general, as much data as possible is copied from the old
`file, unless it takes fewer bits to insert the data. Copying data
`from the old file requires only a command code, whereas
`inserting data requires a command, plus the actual insertion
`data. The invention makes decisions about whether to copy
`or insert data by using a search algorithm that can do the
`following: use a “current positions” pointer into the old file
`for copying data, which eliminates an explicit position field
`if data can be copied from the current position; favor
`copying from the “current position” in the old file; search the
`entire old file for data not found at the “current position” and
`if the data is found elsewhere in the old file, copy from that
`position (unless it “costs more” than inserting enough data
`to allow the next copy from the “current position”); and
`insert data if a copy was not chosen.
`According to an embodiment, the length of the minimum
`data string searched for in the old file is selected as “N" bytes
`and is dependent on where the data is copied from. The copy
`from current position command uses a shorter minimum
`data string than copy from another position because it has
`been encoded to require fewer bits.
`According to further embodiments, commands and count
`fields of the difference file are encoded. Commands may be
`encoded using a “state machine”, where certain commands
`are implied by the sequence of previous commands, and
`count fields may be encoded with a “cascaded” count field
`method. Hullman encoding of smaller count values may be
`employed to further reduce the difference file size, and
`counts larger than the HuJIman—encoded counts use three
`progressively larger count fields. The length of these larger
`count fields is variable for each command type, and is
`modified based on statistics gathered during the construction
`of each individual difference file.
`
`According to further embodiments, all command, count
`and position fields are “bit-packed” to eliminate unused
`
`“filler” bits, and redundant data strings are removed from the
`
`
`insertion data before being appended to the di erence file.
`The entire minimized difference file (minimized by the
`techniques mentioned above) may be finally compressed
`(using a “well-known” compression algorithm- ike “zip”-or
`proprietary compression technique) to reduce the file size.
`According to an embodiment of the invention, execution
`time of the differencing step is important only in that the
`
`
`
`10
`
`o;O
`
`35
`
`40
`
`45
`
`LIIo
`
`60
`
`4
`dillerence method should not take an unreasonable time to
`execute (overnight may be OK in many cases). An index or
`hash table may be used to speed searching, but is not
`necessary for the differencing process (and will not be built
`if sufficient memory is unavailable).
`invention is
`In this patent application,
`the present
`described with reference to specific embodiments. It will be
`understood by anyone skilled in the programming art that
`many variations on the basic system and method of the
`present invention are possible within a computer environ-
`ment. The invention therefore should not be construed as
`limited except as provided in the attached claims.
`For example, conventional computer systems today
`encode data as a collection of two-state biliary units known
`at bits. Most current computers group these bits into 8—bit
`groups known as bytes, also referred to as characters. A
`sequence of bytes or characters is commonly referred to as
`a string. These terms are used in accordance with their
`accepted meaning in the art in this application, but it should
`
`be understood that the techniques of the invention could be
`
`
`
`used in different types of computing systems having di er-
`ent means for encoding and organizing data.
`Also, within the art, the terms “text” and “string” are
`sometimes used in a particular way to describe computer
`encoded alphanumeric data, and at other times these terms
`are used very broadly to denote a sequence of data values
`that could represent anything: text, a number, a piece of an
`image, sound, etc. In the present description, these terms and
`other terms used in the art are intended to be given their
`broadest meaning.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram showing the general differenc-
`ing and revision process according to the current invention.
`FIG. 2 is a block diagram of a differencing process and its
`components according to an embodiment of the invention.
`
`
`FIG. 3 is a flow chart illustrating a method used by the
`
`di
`erencing processor according to the present invention.
`FIG. 4A is a diagram of a Text String Index (TSI)
`according to one embodiment of the present invention.
`FIG. 4B is a diagram of a hash table TSI according to a
`preferred embodiment of the present invention.
`
`FIG. 5A is a flow chart illustrating a method used by the
`i erencing processor according to the present invention to
`build the raw dilIerence file.
`
`FIG. 5B is a flow chart illustrating a method used by the
`' erencing processor to search for strings in the old file
`according to the present invention.
`FIGS. 6A—6D show flow charts and command structures
`i lustrating copy and insert commands inserted into the raw
`' erence file according to the present invention.
`FIGS. 7A77D show a flow chart and index and file
`structures illustrating creating the optimized insert string
`catabase for including into the final difference file according
`to the present invention.
`FIGS. SA—SB illustrate the file structure for the final
`
`
`
`ci erence file and a state machine and command format for
`command encoding in the final difference file according to
`an embodiment of the present invention.
`FIG. 9 is a flow chart showing the method of the revision
`processor according to the current invention.
`FIG. 10 shows a computer system incorporating the
`invention.
`
`
`
`
`
`Unfied Patents Exhibit 1004
`
`Pg. 17
`
`Unfied Patents Exhibit 1004
`Pg. 17
`
`
`
`5,832,520
`
`10
`
`5
`DETAILED DESCRIPTION OF THE DRAWINGS
`Overview
`An overview of the process according to the invention is
`illustrated in FIG. 1. FIG. 1 illustrates a first computer
`system 1 and a second computer system 2 which commu—
`nicate via a communication path 5. Both computer systems
`1 and 2 can be any collection of computing devices oper-
`ating together whatsoever as is known in the art. Computer
`systems 1 and 2 can also refer to the same computer system,
`or components within the same computer system. Commu-
`nication path 5 may be any means by which a file can be
`communicated between computer system 1 and computer
`system 2, including a removable fixed medium such as a
`Iloppy disk or CD ROM disk, or a communication medium
`such as an interoffice network, a telephone line, a bus, or the
`Internet. Path 5 also might encompass an electronic mail
`message.
`As shown in FIG. 1, computer system 1 includes an old
`file 10, a new file 20, and a differencing processor 100 which
`generates a difference file 30. New file 20 is generally
`somewhat similar to old file 10, containing some changes in
`data which reflect a revision of new file 20. Old file 10 and
`new file 20 could be any collection of computer data
`whatsoever, including an executable application program, an
`operating system program, a text file, a spreadsheet or other
`data file, an image file, or any other type of computer-
`
`readable data.
`
`
`
`Di erence processor (Difflt) 100 reads new file 20 and
`compares it to old file 10 by a process described below.
`DiJIerence process 100 then stores indications of the data to
`be copied from old file 10 or inserted from new file 20 into
`difference file 30. According to the invention, when new file
`20 is a revised version of old file 10, dill'erenc/e file 30 will
`be substantially smaller than either new file 20 or old file 10,
`in some cases, only ten percent or less than the size of new
`file 20.
`According to the invention, difference file 30 may then be
`transmitted over path 5 to computer system 2 where a
`revision process (RevIt) 200 reads a duplicate 15 of old file
`10 and difference file 30 and creates a copy 25 of new file
`20 on computer system 2, According to the invention, copy
`25 (designated the New* file) is identical to new file 20
`created on computer system 1.
`DiJIerencing Process Overview
`FIG. 2 is a block diagram illustrating the process of
`differencing process (DiffIt process) 100. DiftIt process 100
`uses old file 10 and new file 20 to generate difference file 30.
`An optional index builder 105 may be used to build a text
`string index (TSI) 106 to speed building the difference file.
`Search engine 104 reads strings of data from new file 20
`and attempts to locate string matches in old file 10 for each
`string found in new file 20, If an index 106 is present, it is .
`used by search engine 104 to increase searching speed.
`When search engine 104 finds a match in old file 10 for
`a string from new file 20, it indicates this by placing a copy
`command into a raw difference file 31 (RawDiff), The copy
`command includes an indication of where the string is found
`in 10 and the length of the string. When search engine 104
`does not find a match in old file 10 for a string from new file
`20, it indicates this by placing an insert command into a raw
`difference file 31 (RawDiff). The insert command includes
`the text for the string that was not found in old file 10.
`According to an embodiment of the invention, an opti—
`mized insertion string database (OISD) engine 101 examines
`the raw DIFF file 31 to create an OISD 110, using an
`insertion string index 113. A command encoder 102 and a
`count encoder 103 encode command codes and count fields
`into the final diIIerence file 30 which also includes OISD
`110.
`
`
`
`6
`The details of the operation of the elements shown in FIG.
`2 according to specific embodiments will now be described.
`Dilerencing Process Flowchart
`FIG. 3 is a detailed flow chart of the operatio