`
`DROPBOX EX. 1007
`
`
`
`
`
`Efficient Distributed Backup with Delta Compression
`
`Randal C. Burns
`Department of Computer Science
`IBM Almaden Research Center
`randal@almaden.ibm.com
`
`Darrell D. E. Longf
`Department of Computer Science
`University of California, Santa Cruz
`darrell@cs.ucsc.edu
`
`Abstract
`
`Inexpensive storage and more powerful processors have re-
`sulted in a proliferation of data that needs to be reliably
`backed up. Network resource limitations make it increas-
`ingly difficult to backup a distributed file system on a nightly
`or even weekly basis. By using delta compression algo-
`rithms, which minimally encode a version of a file using only
`the bytes that have changed, a backup system can compress
`the data sent to a server. With the delta backup technique,
`we can achieve significant savings in network transmission
`time over previous techniques.
`Our measurements indicate that file system data may, on
`average, be compressed to within 10% of its original size
`with this method and that approximately 45% of all changed
`files have also been backed up in the previous week. Based
`on our measurements, we conclude that a small file store on
`the client that contains copies of previously backed up files
`can be used to retain versions in order to generate delta files.
`To reduce the load on the backup server, we implement a
`modified version storage architecture, version jumping, that
`allows us to restore delta encoded file versions with at most
`two accesses to tertiary storage. This minimizes server work-
`load and network transmission time on file restore.
`
`1
`
`Introduction
`
`Currently, file system backup takes too long and has a pro-
`hibitive storage cost. Resource constraints impede the reli-
`able and frequent backup of the increasing amounts of data
`spinning on disks today. The time required to perform back-
`up is in direct proportion to the amount of data transmitted
`
`tThe work of this author was performed while a Visiting Scientist at the
`IBM Almaden Research Center.
`
`Pcm+sion to make digitnl/l~nrd copk of all or part ofthis mnkrial Ibr
`Ill:11 tllc capics
`pcrsonnl or classroom USC is granted willlout I k pr&dCd
`are not made or distributed for prolit or comnvrckd ndvar~~ngc. IIW copy-
`right notice, tla title of lhe publicalion and ils thlc :ippcnr. and nolicc is
`given lllnt copyright is by pcnnksion of Ihe AChl.
`Ills. ‘I’0 Cupy otll~nuise,
`to republish, to post on sc\r~r\rs or to rcdistrihuk 10 lists. rquircs spccilic
`pcnnission and/or fk
`IOPADS 97 Son *Jose (2 1JS4
`1997.4CM o-89791-966-l/97/1
`copyight
`
`I..SB.W
`
`over the network from backup clients to a backup server. By
`using delta compression, compactly encoding a file version
`as a set of changes from a previous version, our backup sys-
`tem architecture reduces the size of data to be transmitted,
`reducing both time to perform backup and storage required
`on a backup server.
`Backup and restore can be limited by both network band-
`width, often 10 Mb/s, and poor throughput to tertiary storage
`devices, as slow as 500 KBls to tape storage. Since resource
`limitations frequently make backing up changed files infca-
`sible over a single night or even weekend, delta file com-
`pression has great potential to alleviate bandwidth problems
`by using available client processor cycles and disk storage to
`reduce the amount of data transferred. This technology cn-
`ables us to perform file backup over lower bandwidth chan-
`nels than were previously possible, for example a subscrip
`tion based backup service over the Internet.
`Early efforts to reduce the amount of data to bc backed up
`produced a simple optimization, incremental backup, which
`backs up only those files that have been modified since the
`end of the last epoch, the period betsvcen two backups. While
`incremental backup only detects changes at the granularity
`of a file, delta backup refines this concept, transmitting only
`the altered bytes in the files to be incrementally backed up
`[12]. Consequently, if only a few bytes arc modified in a
`large file, the backup system saves the expense of transmit-
`ting the large file in favor of transmitting only the changed
`bytes.
`Recent advances in differencing algorithms [ 1,4], allow
`nearly optimally compressed encodings of binary inputs in
`linear time. We use such an algorithm to generate delta cn-
`codings of versions.
`A differencing algorithm finds and outputs the changes
`made between two versions of the same file by locating com-
`mon strings to be copied and unique strings to be added cx-
`plicitly. A delfafire (A) is the encoding of the output of a
`differencing algorithm. An algorithm that creates a delta file
`takes as input two versions of a file, a rej%rencefile and a ver-
`sionjle to be encoded, and outputs a delta file rcprcscnting
`the modifications made between versions:
`
`Dropbox Ex. 1007
`
`
`
`Reconstruction, the inverse operation, requires the reference
`file and a delta file to rebuild a version:
`
`A backup system can leverage delta files to generate min-
`imal file representations (see Figure 1). We enhanced the
`client/server architecture of the AdStar Distributed Storage
`Manager (ADSM) backup system [5] to transmit delta files
`when a backup client has retained two versions of the same
`file. Furthermore, both uncompressed files and delta en-
`coded files still realize benefits from the standard file com-
`pression methods that ADSM already utilizes [I]. We inte-
`grate delta backup into ADSM and have a backwardly com-
`patible system with optimizations to transmit and store a re-
`duced amount of data at the server.
`The server storage and network transmission benefits are
`paid for in the client processor cycles to generate delta files
`and in additional disk storage at a backup client to retain sec-
`ond copies of files that are used to generate delta files. This
`architecture optimizes the network and storage bottleneck at
`the server in favor of the distributed resources of a server’s
`multiple clients.
`We will describe previous work in the use of delta com-
`pression for version storage in $2. Our modiications to the
`ADSM system architecture are presented in 53. In 54, we
`describe the delta storage problem in the presence of a large
`number of versions. In $5, we analyze the performance of
`the version jumping storage method and compare it to the
`optimally compact storage method of linear delta chains. In
`56 potential future work is discussed and we present our con-
`clusions in $7.
`
`2 Origins of Delta Backup
`
`Delta backup emerged from many applications, the first in-
`stance appearing in database technology. The database pages
`that are written between backup epochs are marked as “dirty’
`using a single bit [ 10, 173. At the end of an epoch, only
`the dirty pages need to be backed up. This concept paral-
`lels delta backup but operates only at page granularity. For
`file systems, there are no guarantees that modifications have
`page alignment. While dirty bits are effective for databases,
`they may not apply well to file system backup.
`To improve on the granularity of backup, logging meth-
`ods have been used to record the changes to a database [9,
`141 and a file system [16]. A logging system records every
`write during an epoch to a log file. This can be used to re-
`cover the version as it existed at the start of an epoch into its
`current state. While semantically similar to delta compres-
`sion, logging does not provide the compressibility guaran-
`tees of differencing algorithms. In the database example, a
`r
`
`27
`
`log or journal of changes grows in proportion to the number
`of writes made. If the same bytes are modified more than
`once or if adjacent bytes are modified at different times, this
`will not be a minimal representation. For an extremely active
`page, the log will likely exceed the page size. Differential
`compression also has the guarantee that a log of modifica-
`tions to a tile will be no smaller than the corresponding delta
`[41.
`Recently, several commercial systems have appeared on
`the marketplace that claim to perform delta backup [ 1 I, 6,
`181. While the exact methods these systems use have not
`been disclosed, the product literature implies that they either
`perform logging [ 1 I] or difference at the granularity of a file
`block [18, 61. We perform delta backup at the granularity
`of a byte. By running differential compression algorithms
`on the changed versions at this granularity, we generate and
`backup minimal delta files.
`
`2.1 Previous Work in Version Management
`Despite file restoration being the infrequent operation for
`backup and restore, optimizing this process has great util-
`ity. Often, restore is performed when file system compo-
`nents have been lost or damaged. Unavailable data and non-
`functional systems cost businesses, universities and home
`users lost time and revenue. This contrasts the backup oper-
`ation which is generally performed at night or in other low
`usage periods.
`Restoring files that have been stored using delta backup
`generates additional concerns in delta file management. Tra-
`ditional methods for storing deltas require the decompres-
`sion agent to examine either all of the versions of a given
`file [ 131 or all versions in between the first version and the
`version being restored [15]. In either case, the time to re-
`construct a given file grows at least linearly (see $4.1) with
`respect to the number of versions involved.
`A backup system has the additional limitation that any
`given delta file may reside on a physically distinct media
`device and device access can be slow, generally several sec-
`onds to load a tape in a tape robot [8]. Consequently, having
`many distinct deltas participate in the restoration of a single
`file becomes costly in device latency. An important goal of
`our system is to minimize the number of deltas participating
`in any given restore operation.
`Previous efforts in the efficient restoration of file versions
`have provided restoration execution times independent of the
`number of intermediate versions. These include methods
`based on AM Dags [7], linked data structures [19, 21, or
`string libraries [3]. However, these require all delta versions
`of a given file to be present at restore time and are conse-
`quently infeasible for a backup system. Such a backup sys-
`tem would require all prior versions of a file to be recalled
`from long term storage for that file to be reconstructed.
`As previous methods in efficient restoration fail to ap-
`
`/
`
`I
`
`I
`
`I :
`
`.
`
`I
`
`t ’
`i
`
`Dropbox Ex. 1007
`
`
`
`ADSM
`Delta
`Client
`
`Delta Compression Agent
`RCfCLYUC
`
`DcRa
`
`,
`:
`
`so&p
`
`\
`
`Server Hlemchlcal
`Storage Manager
`
`ADSM
`Delta
`
`Fue
`SYSlOll
`
`\
`
`I
`
`Relerence
`Fik
`StOlX
`
`\
`
`/
`
`.
`
`!
`
`Figure 1: Client/server schematic for a delta backup system.
`
`ply to the backup and restore application, we describe a new
`technique called version jumping and an associated system
`architecture. Version jumping takes many delta versions off
`of the same reference version and consequently requires at
`most two files to perform the restore operation on a given
`version. The restoration time is also independent of the total
`number of stored versions.
`
`3 System Architecture
`
`We modified the architecture and design of the AdStar Dis-
`tributed Storage Manager (ADSM) from IBM to add delta
`compression capabilities. The modified ADSM client delta
`compresses file data at the granularity of a byte and trans-
`mits delta files to the server. The modified ADSM server has
`enhanced file deletion capabilities to ensure that each delta
`file stored on the server also has the corresponding reference
`file.
`ADSM is a client/server backup and restore system cur-
`rently available for over 30 client platforms. The ADSM
`server is available for several operating systems including
`Windows NT and various UNIX and mainframe operating
`systems. The key features of ADSM are scheduled policy
`management for file backup, both client request and server
`polling, and hierarchical storage management of server me-
`dia devices including tape, disk drive and optical drives.
`
`3.1 Delta Compression at the Client
`We have modified the standard ADSM client to perform delta
`backup and restore operations. The modifications include
`the differencing and reconstruction algorithms (see 3 1) and
`the addition of a referencefile store.
`In order to compute deltas, the current version of the file
`must be compared with a reference version. We have the
`choice of storing the reference version on the client or fetch-
`
`ing it from the server. Obtaining the reference file from the
`server is unviable since it would increase both network traf-
`fic and server load, adversely affecting the time to perform
`the backup. By storing the reference file at the client, WC in-
`cur a small local storage cost in exchange for a large benefit
`in decreased backup time.
`We considered several options for maintaining the refcr-
`ence files, including copy-on-write, shadowing, and file sys-
`tem logging. Each of these options had to be rejected since
`they violated the design criterion that no file system modifi-
`cations could be made. Since ADSM supports more than
`30 client operating systems, any file system modification
`presents sign&ant portability and maintenance concerns.
`Instead, we chose to keep copies of recently backed up files
`in a reference file store.
`The reference file store is a reserved area on disk whcrc
`reference versions of files are kept (see Figure 1). When
`sending an uncompressed file to its server, the backup client
`copies this file to its reference file store. At this point, the file
`system holds two copies of the same file: one active version
`in fde system space and a static reference version in the rcf-
`erence file store. When the reference file store fills, the client
`selects a file to be ejected. We choose the victim file with a
`simple weighted least recently used (LRU) tcchniquc, In a
`backup system, many files are equally old, as they have been
`backed up at the same epoch. In order to discern among
`these multiple potential victims, our backup client uses an
`additional metric to weight tiles of equal LRU value WC
`select as a victim the reference file that achieved the worst
`relative compression on its previous usage, i.e. the file with
`the highest delta file size to uncompressed file size ratio at
`the last backup. This allows us to discern from many potcn-
`tial victims to increase the utility of our refercnco file store,
`At the same time, this weighting does not violate the tried
`and true LRU principle and consequently can bc cxpcctcd to
`realize all of the benefits of locality.
`
`28
`
`Dropbox Ex. 1007
`
`
`
`.2 c
`5 i%
`s 5
`,o
`z!
`
`0.6
`0.5
`0.4
`
`0.3
`
`0.2
`0.1
`nl
`"0
`
`20
`
`40
`
`60
`
`80
`Days
`
`I
`100 120 140 160
`
`Figure 2: Fraction of files modiied today also modified 1,4
`and 7 days ago.
`
`When the backup client sends a file that has a prior ver-
`sion in the reference file store, the client delta compresses
`this file and transmits the delta version. The old version of
`the file is retained in the reference file store as a common
`reference tile for a version jumping system.
`We do not expect the storage requirements of the refer-
`ence file store to be excessive. In order to evaluate the ef-
`fectiveness of the reference file store, we collected data for
`five months from the file servers at the School of Engineer-
`ing at the University of California, Santa Cruz. Each night,
`we would search for all files that had either been modified
`or newly created during the past 24 hours. Our measure-
`ments indicate that of approximately 2,356,400 files, less
`than 0.55% are newly created or modiied on an average day
`and less than 2% on any given day. These recently modified
`files contain approximately 1% of the file system data.
`In Figure 2, we estimate the effectiveness of the reference
`file store using a sliding window. We considered windows of
`l-7 days. The z-axis indicates the day in the trace, while the
`y-axis denotes the fraction of files that were modiied on that
`day and that were also created or modified in that window.
`Files that are not in the window are either newly created or
`have not been modified recently. An average of 45% and a
`maximum of 87% of files that are modified on a given day
`had also been modified in the last 7 days. This locality of file
`system access verifies the usefulness of a reference file store
`and we expect that a small fraction, approximately 5%, of
`local file system space will be adequate to maintain copies
`of recently modified file versions.
`
`3.2 Delta Versions at the Server
`
`In addition to our client modifications, delta version stor-
`age requires some enhanced file retention and file garbage
`collection policies at a backup server, For example, a naiive
`
`server might discard the reference files for delta files that can
`be recalled. The files that these deltas encode would then be
`lost from the backup storage pool. The file retention poli-
`cies we will describe require additional metadata to resolve
`the dependencies between a delta file and its corresponding
`reference file.
`A backup server accepts and retains files from backup
`clients (see Figure 1). These files are available for restora-
`tion by the clients that backed them up for the duration of
`their residence on the server. Files that are available for
`restoration are active tiles. A typical file retention policy
`would be: “hold the four most recent versions of a file.”
`While file retention policies may be far more complex, this
`turns out to have no bearing on our analysis and consequent
`requirements for backup servers. We only concern ourselves
`with the existence of deletion policies on an ADSM server
`and that files on the server are only active for a finite number
`of backup epochs.
`To reconstruct a version file to a client from a backup
`server, when the server has a delta encoding of this file, the
`client must restore both the delta encoding and the corre-
`sponding reference file. Under this constraint, a modified
`retention policy dictates that a backup server must retain all
`reference files for its active files that are delta encoded. In
`other words, a file cannot be deleted from the backup server
`until all active files that use it as a reference file have also
`been deleted. This relationship easily translates to a depen-
`dency digmph with dependencies represented by directed
`edges and files by nodes. This digraph is used both to encode
`dependent files which need to bc retained and to garbage col-
`lect the reference files when the referring delta versions are
`deleted.
`For a version chain storage scheme, we may store a delta
`file, AA*,A~, which depends on file AZ. However, the backup
`server may not store AZ. Instead it stores a delta file repre-
`sentation. In this event, we have AA~,A~ depend upon the
`delta encoding of As, AA, ,A* (see Figure 3(a)).
`The dependency digraphs for delta chains never branch
`and do not have cycles. Furthermore, each node in the di-
`graph has at most one inbound edge and one outbound edge.
`In this example, to restore file Al, the backup server needs
`to keep its delta encoding, AA*,A~, and it needs to be able
`to construct the file it depends upon, AZ. Since, we only
`have the delta encoding of AZ, we must retain this encod-
`ing, AA,,A~, and all files it requires to reconstruct A2. The
`obvious conclusion is: given a dependency digraph, to re-
`construct the file at a node, the backup server must retain all
`files in nodes reachable from the node to be reconstructed.
`For version jumping, as in version chains, the version
`digraphs do not branch. However, any node in the version
`digraph may have multiple incoming edges, i.e. a file may
`be a common reference file among many deltas (see Figure
`3(b)).
`The dependency digraphs for both version jumping and
`
`Dropbox Ex. 1007
`
`
`
`(a) Simple delta chain.
`
`(AA~.A;)
`
`(b) Jumping version chain.
`
`Figure 3: Dependency graphs for garbage collection and
`server deletion policies.
`
`delta chains are acyclic. This results directly from delta
`tiles being generated in a strict temporal order; the refer-
`ence file for any delta file is always from a previous backup
`epoch. Since these digraphs are acyclic, we can implicitly
`solve the dependency problem with local information. We
`decide whether or not the node is available for deletion with-
`out traversing the dependency digraph. This method works
`for all acyclic digraphs.
`At each node in the dependency digraph, i.e. for each
`file, we maintain a pointer to the reference file and a refer-
`ence counter. The pointer to the reference file indicates a
`dependency; a delta encoded version points to its reference
`file and an uncompressed file has no value for this pointer.
`The reference counter stores the number of inbound edges
`to a node and is used for garbage collection. A node has no
`knowledge of what files depend on it, only that dependent
`files exist.
`When backing up a delta file, we require a client to trans-
`mit its file identifier and the identifier of its reference file.
`When we store a new delta file at the ADSM server, we store
`the name of its reference file as its dependent reference and
`initialize its reference counter to zero. We must also update
`the metadata of the referenced file by incrementing its refer-
`ence counter. With these two metadata fields, the modiied
`backup server has enough information to retain dependent
`files and can guarantee that the reference files for its active
`files are available.
`
`3.3 Two-phase deletion
`
`When storing delta files, the backup server often needs to
`retain files that are no longer active. We modify deletion in
`this system to place files into three states: active files, inac-
`tive dependent files, and inactive independent files. Inactive
`dependent files are those files that can no longer be restored
`as a policy criterion for deletion has been met but need to bc
`retained as active delta versions depend on them. Inactive
`independent files are those files suitable for removal from
`storage as they cannot be restored and no active delta files
`depend on them.
`We add one more piece of metadata, an activity bit, to
`each file backed up on our server. When a file is received
`at the sewer, it is stored as usual and its activity bit is set.
`This file is now marked as active. The backup server implc-
`ments deletion policies as usual. However, when a file rc-
`tention policy indicates that a file can no longer be restored,
`we clear the file’s activity bit instead of deleting it. Its state
`is now inactive. Clearing the activity bit is the first phase of
`deletion. If no other files depend on this newly inactive file,
`it may be removed from storage, the second phase.
`Marking a file inactive may result in other files becoming
`independent as well. The newly independent reference file is
`garbage collected through the reference pointer of the delta
`file. The following rules govern file deletion:
`l When a file has no referring deltas, i.e. its rcfcrcncc
`counter equals zero, delete the file.
`l When deleting a delta file, decrement the refercncc
`counter of its reference file and garbage collect the rcf-
`erence file if appropriate.
`Reference counting, reference file pointers, and activity
`bits correctly implement the reachability dependence rela-
`tionship. The two phase deletion technique operates locally,
`never traversing the implicit dependency digraph, and con-
`sequently incurs little execution time overhead. The backup
`server only traverses this digraph when restoring files, It
`follows dependency pointers to determine the set of files rc-
`quired to restore the delta version in question.
`
`4 Delta Storage and Transmission
`
`The time to transmit files to and from a server is directly pro-
`portional to the amount of data sent. For a delta backup and
`restore system, the amount of data is also related to the man-
`ner in which delta files are generated. We develop an analy-
`sis to compare the version jumping storage method with stor-
`ing delta files as version to version incremental changes. WC
`show that version jumping pays a small compression penalty
`for file system backup when compared to the optimal linear
`delta chains. In exchange for this lost compression, version
`jumping allows a delta file to be rebuilt with at most two
`accesses to tertiary storage.
`
`30
`
`,.
`, 1...
`
`.-r
`.:.<‘.,,
`
`E-
`
`_, :I j,:
`-._.
`
`..‘.‘:
`__
`
`.\
`.-.>:
`
`.
`
`,Y-‘.
`j
`
`..‘i-.
`,
`
`,..,;
`-:,-;.
`
`?.I
`
`\
`
`_
`.
`
`(
`
`*:
`
`.
`,*
`
`_r--
`
`$y------T--T.
`
`,
`
`:
`
`4:
`
`,-,,;7-
`-,
`
`.
`
`*
`
`,
`
`”
`
`8
`
`_I-
`
`,.T$
`-.
`
`,
`
`Dropbox Ex. 1007
`
`
`
`The version storage problem for backup and restore dif-
`fers due to special storage requirements and the distributed
`nature of the application. Since AJXM stores files on mul-
`tiple and potentially slow media devices, not all versions of
`a file are readily available. This unavailability and other de-
`sign considerations shape our methods for storing delta ver-
`sions. If a backup system stores files on a tape robot then,
`when reconstructing a file, the server may have to load a
`separate tape for every delta it must access. Access to each
`tape may require several seconds. Consequently, multiple
`tape motions are intolerable. Our version jumping design
`guarantees that at most two accesses to the backing store are
`needed to rebuild a delta file.
`Also, we minimize the server load by performing all of
`the differencing tasks on the client. Since a server processes
`requests from many clients, a system that attempted to run
`compression algorithms at this location would quickly be-
`come processor bound and achieve poor data throughput.
`For client side differencing, we generate delta files at the
`client and transmit them to the server where they are stored
`unaltered. Using client differencing, the server incurs no ad-
`ditional processor overhead.
`
`4.1 Linear Delta Chains
`
`Serializability and lack of concurrency in file systems result
`in each file having a single preceeding and single following
`version. This linear sequence of versions forms a history of
`modifications to a file which we call a version chain. We
`now develop a notation for delta storage and analyze a linear
`version chain stored using traditional methods [15].
`Linear delta chains are the most compact version storage
`scheme as the inter-version modification are smallest when
`differencing between consecutive versions. We will use the
`optimal@ of linear delta chains later as a basis to compare
`the compression of the version jumping scheme.
`We denote the uncompressed
`ith version of a file by
`Vi. The difference between two versions Vi and Vi is in-
`dicated by A(,,,).
`The file A(K,v~) should be considered
`the differentially compressed encoding of Vj with respect to
`Vi such that Vj can be restored by the inverse differencing
`operation applied to Vi and A(% ,Q) . We indicate the differ-
`encing operation by
`
`and the inverse differencing or reconstruction operation by
`
`For our analysis, we consider a linear sequence of ver-
`sions of the same file that continues indefinitely,
`
`vl,%...
`
`,K-1,%K+1,...
`
`.
`
`The traditional way to store this version chain as a series
`of deltas is, for two adjacent versions Vi and Vi+l, to store
`the difference between these two files,-Avi,vi+, [13]. This
`produces the following “delta chain”
`
`Under this system, to reconstruct an arbitrary version J$, the
`algorithm must apply the inverse differencing algorithm re-
`cursively for all intermediate versions 2 through i. This re-
`lation can be compactly expressed as a recurrence. Vi rep-
`resents the contents of the ith version of a file and & is the
`recurrent file version. So when rebuilding Vi, Vi = & and
`
`Ri = ~-l(A(vl.-l,v&L~);
`
`RI = K.
`
`The time required to restore a version depends upon the time
`to restore all of the intermediate versions. In general, restora-
`tion time grows linearly in the number of intermediate ver-
`sions. In a system that retains multiple versions, the cost of
`restoring the most remote version quickly becomes exorbi-
`tant.
`
`4.2 Reverse Delta Chains
`
`Some version control systems solve the problem of long delta
`chains with reverse delta chain storage [15]. A reverse delta
`chain keeps the most recent version of a file present and un-
`compressed. The version chain is then stored as a set of
`backward deltas. For most applications, the most recent ver-
`sions are accessed far more frequently than older versions
`and the cost of restoring an old version with many interme-
`diate versions is offset by the low probability of that version
`being requested.
`We have seen that linear delta chains fail to provide ac-
`ceptable performance for delta storage (see $4.1). Our de-
`sign constraints of client-side differencing and two tape ac-
`cesses for delta restore also eliminate the use of reverse delta
`chains. We show this by examining the steps taken in a re-
`verse delta system to transmit the next version to the backup
`server.
`At some point, a server stores a reverse delta chain of the
`form
`
`By convention, Vi is created by modification of Vi-l. For
`versions Vi and Vj in a linear version chain, these versions
`are adjacent if they obey the property j - i = 1 and an
`intermediate version Vk obeys the property i < L < j.
`
`the
`In order to backup its new version of the file, V&,
`client generates a difference file, A(vn,v,,+l) and transmits
`this difference to the server. However, this delta is not the file
`that the server needs to store. It needs to generate and store
`
`Dropbox Ex. 1007
`
`
`
`A(v,+~,~,). Upon receiving A~v,,,v,++ the server must ap-
`ply the difference to V, to create Vn+l and then run the dif-
`ferencing algorithm to create A(v,+r,v,). To update a single
`version in the reverse delta chain, the server must store two
`new files, recall one old file, and perform both the differ-
`encing and reconstruction operations. Reverse delta chains
`fail to meet our design criteria as they implement neither
`minimal server processor load nor reconstruction with the
`minimum number of participating files.
`
`4.3 Version Jumping Delta Chains
`
`Our solution to the version storage implements what we call
`jumpirlg deltas. This design uses a minimum number of files
`for reconstruction and performs differencing on the backup
`client.
`In a version jumping system, the server stores versions in
`a modified forward delta chain with an occasional whole file
`rather than a delta. Such a chain looks like:
`
`K, A(v,,vz), A(v,,vs), - - - , A(K,I+-~% A,,,,,,,,
`
`. - -
`
`Storing this sequence of files allows any given version to be
`reconstructed by accessing at most two files from the version
`chain.
`When performing delta compression at the backup client,
`the files transmitted to the server may be stored directly with-
`out additional manipulation. This immediate storage at the
`server limits the processor overhead associated with each
`client session and optimizes the backup server.
`An obvious concern with these methods is that one ex-
`pects compression to degrade when taking the difference be-
`tween two non-adjacent versions, i.e. for versions Vi and
`Vj, IA(vi,~jjl’ increases as j - i increases. Since the com-
`pression is likely to degrade as the version distance, j - i,
`increases, we require an occasional whole file to limit the
`maximum version distance. This raises the question: what is
`the optimal number of versions between whole files?
`
`5 Performance Analysis
`
`We analyze the storage and transmission cost of backing up
`files using a version jumping policy. We already know that
`version jumping far outperforms other version storage meth-
`ods on restore, since it requires only two files to be accessed
`from tertiary storage to restore a delta file. Now, by showing
`that the compression loss with version jumping is small as
`compared to linear delta chains, the optimal storage method
`for delta compression, we conclude version jumping to be a
`superior policy.
`The analysis of transmission time and server storage is
`identical, since our backup server immediately stores all files,
`
`‘For a file If, we use IV1 to denote the size of the file. Since files are
`one dimensional streilllls of bytes, this is synonymous to the length of V.
`
`including deltas, that it receives and transmission time is in
`direct proportion to the amount of data sent. We choose to
`examine storage for ease of understanding the conten