throbber
Weighted Voting for Replicated Data
`Da,·id K. Gifford
`Stanford University and Xerox Palo Alto Research Center
`
`In a new al&orithm ror maintaining replicated data.
`e'el1 copy or a nplicated file Is usigned some number or
`'otes. E'ery. transaction collects a nad quorum orr 'otes
`to nad a me. and a write quorum or Ill 'otes to write a file,
`such that r +Ill Is greater tban the total number or 'otes
`assigned to the file. n1s ensures tbat there is a non•null
`intersection between ettry read quorum and etery write
`quoruaa. Version numbers make it possible to determine
`which copies an curnnt. The nliabUitJ and perrormance
`cbaracteristics or a npUcated file can be controlled br
`appropriatelr choosing r, w, aad
`the
`file's
`'oting
`configuration.
`The
`algorithm
`guarantees
`serial
`consistenq, admits temporary copies in a natural war br
`the introduction or copies with no 'otes. and bas been
`implemented in tile context or an application srstem caned
`Violet.
`Ker Words and Phrues: weighted 'otlna. replicated
`data. quorum. file srstem. file suite. npresentatite. weak
`npresentati'e. transaction. lockina:, computer network
`CR Categories: 4.3. 4.35, 4..13, 3.81
`
`The wort reponed bere wa supported in pan by the Xet01
`Corporation. and by the Fullie and John Hertz Foundation.
`Author's present lddress: Xet01 Palo Alto Research Center, 3333
`Coyote HDl Road. Palo Alto. Calit"onlla 94304.
`
`Permission to copy without fee all or pan of this material is
`aranted provided that the copies an not made or distributed
`for dinct commercialadvantaae. the ACM copyriaht notice
`and the title of the publication and its date appear. and notice
`is aiven that copyina is by permission of the Association for
`Computina Machinery. To copy otherwise, or to republish.
`requires a fc:e and( or specific permission.
`• 1979 ACM 0-89791~·5(79/ 1200/0150 S00.75
`
`150
`
`1. Introduction
`
`The requirements of distributed computer systems
`are stimulating interest in keeping copies of the same
`information at different nodes in a computer network.
`Replication of data allows information to be located close
`to its point of use. either by statically locating cories in
`high use ueas. or by dynamically creating temporary
`copies as dictated by demand. Replication of data· also
`increases the availability of data. by allowing many nodes
`to service requests for the same information in parallel.
`and by masking panial system failures. Thus. in some
`cues.. the cost of maintaining copies is offset by the
`performance. communication cost. and reliability benefits
`that replicated data affords.
`We present a new algorithm for the maintenance of
`replicated
`files.
`The algorithm can be briefly
`clwac:terized by the following Jescription:
`• Every copy of a replicated tile is assigned some
`number of votes.
`• Every transaction collects a read quorum of r votes
`to read a file, and a write quorum of w votes to
`write: a file. such that r+ w is greater than the total
`number of votes assigned to the file.
`• This ensures that there is a non-null intersection
`between every
`read quorum and every write
`is always a subset of the
`quorum.
`There
`representatives of a file whose votes total to w that
`are current
`is gathered
`that
`read quorum
`• Thus. any
`guaranteed to bav• a current copy.
`• Version numbers make it possible to determine
`which copies are current
`
`is
`
`The algorithm has a number of desirable propenies:
`• It continues to operate correctly with inaccessible
`copies.
`
` Exhibit 1016 Page 1
`
` SYMANTEC
`
`

`
`- It consists of a small amount of extra machinery that
`runs on top of a transactional me system. Although
`"voting" occurs IS will become evident later in the
`paper. no complicated message based coordination
`mechanisms are needed.
`- It provides serial consistency.
`In other words. it
`appears to each transaction that it alone is running.
`The most cum:at version or data is always provided
`to a user.
`- By manipulatina r, w, and the voting structure·of a
`replicated me, a system administrator can alter the
`file's perfoi'IDUlCC and reliability c:haracteristic:s.
`- All of the extra copies of a file that are created,
`including temporary copies on users' loc:al disks. can
`be incorporated into our framework.
`
`The remainder of the paper is orpnized as five
`sections. Section 2 describes related work, and how the
`algorithm differs from previous solutions. The algorithm's
`mvironment, interface, and basic structure are introduced
`ia Section 3. Refinements are offered in Section 4,
`iacluding the introduction of temporary copies and a new
`locking technique. The Violet System, which oontains an
`implementation of this proposal, and some performance
`IDIISiderations are discussed in Section S. The final
`section is a brief CDDclusion. The appendix demonstrates
`lbat our algorithm maintains serial oonsistency (1~
`The ideas in this paper are illustrated in Mesa. a
`rqramming lanaua&e developed at the Xerox Palo Alto
`Research Center (8). Mesa is well suited for this tat
`because it contains integrated support for processes,
`IIIOilitors, and condition variables (6). To simplify this
`presentation some nonessential details bave been omitted
`from the Mesa examples.
`
`~ Related Work
`
`Previous algorithms for maintaining replicated data
`&II into two classes. Some insist that every object has a
`primary site which assumes responsibility for update
`arbitration. Distributed INORES (10) is such a system.
`This technique is simple, but relatively inflexible. Others
`lkt not employ distin11Jished sites for objeciS, and are
`IIIOI'e complex than primary site algorithms. SDD-1 (9)
`tecps all copies or an object up to date by sending
`updates via a communication system that will buffer
`aasces over machine cruhes. Thomas' proposal (11)
`an~· requires that a majority or an object's copies be
`apdated, and includes voting.
`Although we share the notion of voting, it is difl"lcult
`., directly compare our algorithm with Thomas' because
`die two provide different services. Notably:
`· We guarantee serial oonsistency for queries (read·
`only transactions), while Thomas' algorithm does
`not.
`
`- We do not insist that a majority of an object's copies
`be updated.
`- Thomas' algorithm does not employ weighted
`voters. which limits its flexibility.
`- Thomas' algorithm is more complex because it
`addresses consistency i~es IS well as replication
`issues. We have separated the two, resulting in an
`algorithm that is easier to reason about and to
`implement
`Our structure allows for the inclusion of temporary
`oopies.
`
`3. The Basic Alcorlthm
`
`3.1 Emironment
`The concepts necessary for the implementation or our
`algorithm are modeled below IS a stable jilt systtm.
`In
`Section 3.3 we build our algorithm for replicated data
`assuming the existence of such a system.
`Our exposition uses two kinds of objects, flits and
`containtn. Files are arrays of bytes, addressed by read
`and write operations as described below.' Containers are
`they are
`storage repositories for
`files:
`intended
`to
`represent storage devices such as disk drives. These
`objects, and others introduced later in the paper, have
`unique names. No two objects will ever be assigned the
`same name, even if they are on different machines. We
`will not ooncem ourselves further with how programs
`acquire names. but will assume that the names or
`CDDtainers and rues of interest are at hand.
`A file is logically an array of bytes that can be
`, created. deleted, read. and written.
`Plt.Crute: Plon:DUIIE fcoataiaer: CHI....,JDI
`u:ruaNS lRie: fltJDI;
`Plt.Delete: PIOCIDUaE(nte: r•JDI;
`ne.Rad: I'IIOCEOtlal (file: ri&ID, startB)'te., COVIIC IHTIGEJl,
`llllffer: POUfTIII)i
`fill. Write: PIOCEDUIE (file: fiiiJD, startBJte., couaC INlECU,
`Wfer: POINDilJ;
`To keep the discussion simple, we assume that file
`system primitives operate on remote and local fil~ alike.
`This can be accomplished by encoding a file's location or
`CDDtainer in its unique identifier, or by maintaining
`location hints for remote files. These details will not be
`oonsidered further.
`Transac:-j..~n~ are used
`the scope of
`to define
`ooncurrency oontrol and failure recovery. A transaction is
`a group of related file operations bracketed by a begin
`transac:tion call and a CDDtmit transaction call.
`T-....aechl: PIOCEDCJIIZ:
`T -.... eo...nic PIOCEDUU;
`A tranmction hides concurrency by making it appear
`is no other activity in the
`to its file operations that thl
`
`151
`
` Exhibit 1016 Page 2
`
` SYMANTEC
`
`

`
`system, a property known as serial consistency (1~ A
`transaction hides undesirable events that can be recovered
`from, such as a detected disk read error, or a server cra~h.
`A transaction also guarantees that either all of its write
`operations are performed, or none of them are.
`Furthermore, once a transaction has committed, its effects
`must be resilient to hardware failures, such as a server
`crash. Every process has a single current transaction.
`Thus, for an application program to use two transactions it
`must create at least two processes. A forked process can
`join its parent's transaction by calling:
`T---.loiaPanatsTnasaction: raocmuu:;
`A file may be unavailable if the server it resides on is
`down, or if there is a communication failure. If a read
`operation is directed to a file that is unavailable, the
`corresponding File.Read call will never return. Multiple
`processes are used by our algorithm to allow it to proceed
`in this case. Outstanding uncompleted reads, because
`they never occurred, do not affect the ability of a
`transaction to commit The transaction system only
`guarantees serial consistency for reads that have actually
`completed when the transaction is committed. Ukewise.
`if a write operation is directed to a file that is unavailable.
`the corresponding File.Write call will never return.
`However, a transaction that attempts to commit with
`unfinished writes will remain uncommitted until all of its
`writes complete.
`It is possible that a user will want to abort a
`transaction in progress. A transaction abort. which can be
`initiated by a user as shown below, will discard all of a
`transaction's writes, and terminate the transaction.
`T.-....AMrt PaOCIDUH;
`file system will
`the
`that
`It
`is also possible
`spontaneously abort a transaction because of a server
`crash, communication failure, or lock conftict.
`This concludes our model set of primitive objects and
`operations.
`The: model abstracts a confederation or
`coorcrating computers into a structure that has uniform
`naming and a distributed transactional file system. As we
`shall see in following sections, the abstrac:tions introduced
`here make the replication algorithm straightforward to
`explain. Of course we believe that the model that we
`have described is realizable and pr&l.-tical; in fact. the ideas
`necessary for an implementation have received a great
`deal of attention. Gray (4) provides a nice discussion of
`two phase commit protocols. locking, and synchronization
`primitives. Lampson and Sturgis (S, 7) describe an
`implemented system that has all of the properties our
`model requires.
`
`3.1 Interrace
`Our algorithm uses the facilities described in Section
`3.1 to provide an abstraction called a ji/e suite. This is a
`file that is realized by a collection of copies, which we call
`representatives because of the democratic way in which
`update decisions are reached. When a file suite is created,
`
`a description of its configuration must be supplied, which
`r. w,
`includes
`the number of represent.'ltives,
`the
`containers where they should be stored. and the number
`or votes each should be accord.:d.
`Configuration: TYPE • RECORD (
`r: INTEGER.
`w: ll'ITECER.
`r. ARRAl' OF RECORD (cnulaincr: COIII•i-JD. \Otes: INTECERJI;
`l'iii.CrcatcSuitc: PROC"EDl'RE lconrtCur:~tion: ConlicunationJ
`RPURNS (suite: FIIIJDJ;
`Filc.CreateSuite stores a suite"s configuration in stlble
`storage. The structurt:l; stored would depend on the
`algorithm's implementation, but Figure 1 shows one
`possible alternative. A suite is cataloged by directory
`entries, preferably more than one in case one of them is
`unavailable.
`Each representative has a prefix
`that
`identifies all the other representatives in the suite and
`their voting strength.
`Once created. a file suite can be treated like an
`ordinary file. The File.Rcad. File. Write, and File.Dclete
`operations specified in Section 3.1 can be used to
`manipulate the abstract array of bytes represented by a
`file suite. Uke file operations, all file suite operations are
`part of some transaction. A file suite appc:ars to be an
`· ordinary file: in almost every respect.
`Differences arise because a file suite can have
`reliability characteristics
`that are
`performance and
`imp<mible for a file. It is possible: to tailor the relinbility
`and performance of a file suite by manipulating its voting
`configuration. A high performance suite results by
`heavily weighting high performance representatives. and a
`very reliable suite results by heavily weighting reliable
`represc:Dtatives. A file suite can also be made very reliable
`by having many equally weighted representati,es. A
`completely decentralized structure results from equally
`weighting representatives, and a completely centralized
`scheme results from assigning of all of the votes to one
`representative. Thus the algorithm falls into both or the
`classes described in Section 2.
`Once the general reliability and performance of a
`suite is established by its voting configuration. the relative
`reliability and performance of Read and Write can be
`controlled by adjusting r and w. As w decreases, the
`reliability and performance of writes increases. As r
`decreases,
`the reliability and performance of reads
`increases. The choice of r and "' will depend on an
`application's read to write ratio. the cost of reading and
`writing, and the desired reliability and performance.
`The following examples suuest the diverse mix of
`properties that can be created by appropriately setting r
`and w. In the table below we assume that the probability

`that a representative is unavailable is .01.
`Example 1 is configured for a file with a high read to
`write ratio in a single server. multiple user environment
`Replication is used to enhance the performance of the
`system, not the reliability. There is one server on a local
`network that can be accessed in 75 milliseconds. Two
`users have chosen to make copies on their personal disks
`
`152
`
` Exhibit 1016 Page 3
`
` SYMANTEC
`
`

`
`by creating weak representatives. or representatives with
`no votes (see Section 4.1 for a oomplete discUssion of
`weak representatives). This allows them to access the
`COf'Y on their local disk, resulting in lower latency and less
`tramc to the shared server.
`Example 2 is oonligured for a lile with a moderate
`rc:ad to write: ratio tbat is primarily accessed from one
`local netWork. The server on the local network is assigned
`two votes. with the two servm on remote networks
`assigned one vote apiece. Reads can be satisfied from the
`local server, and writes must access the kx:al server and
`one remote server. The system will oontinue to operate in
`read-only mode if the local server fails. Users could
`create additional weak representatives for lower read
`latency.
`Example 3 is oonfigured for a me With a very high
`read to write ratio, such u a system directory, in a three
`server environment ·Users can read from any server, and
`the probability that the file will be unavailable is very
`small. Update5 must be applied to all copies. Once apin.
`users could create additional weak representatives on their
`local machines for lower read lateDcy.
`
`l.atencJ (IIIIIC)
`RepreRDtalive1
`Repl'ellllt8dvel
`RepNICII&IIiYeJ
`Vodna Caaftagndaa
`
`,
`w
`Rlld
`LltencJ (IIIIIC)
`8loctlna Prubabillty
`Wrile
`I..ICeacJ (IIIIK)
`Bloctina Prababtlit7
`
`,..,,.1
`75
`65
`65
`<1.0.0>
`1
`1
`
`f•rp!el
`75
`100
`750
`<2.1.1>
`2
`J
`
`fi'Epi!J
`75
`750
`750
`<1.1.1>
`1
`J
`
`65
`ux 1o·l
`
`75
`2.0X10""
`
`75
`l.OX10-6
`
`75
`1.0X 1o·l
`
`100
`u.x 1o·2
`
`750
`1.ox 1cr2
`
`FUeSuite: MONITOil(suiteName: F"lle.ID) = DEOJN
`VenionNumber: TYPE • {untnown,l. 2. 3, 4,-}
`Set: TYPE • AllllAYOFIIOOl.EAN:
`SultcEntry: TYPE • IU!COilD (
`DIIM: File.ID,
`version: VminnNumber,
`¥ales: IHIEOEilJ;
`Illite: AllllA Y OF SuiteEDtry;
`c:uneatVeniallNumber: VenionNumber:
`,...,.,
`tlnlRcspoDdecl: IIOOUAM:
`- ,_ wlwll/1111 ..... MiM,.,
`- ...... . ,_ ...--Jbr.,....,_
`_ _....,_,.,,_Jbr•-...-
`
`r: IHftOIIl:
`W:INTIGD:
`
`the number of
`instantiated,
`When FUeSuite
`is
`representatives. their names. their version numbers, their
`voting strengths, '· and w must be oopied from some
`representative's prefix
`into the data structure shown
`above. This information must be obtained with the same
`transaction that is later used ·to access the me suite, in
`order to guarantee that it occurately reflects the suite's
`oonfiguration. Additional information. such u the speed
`of a representative, has been omitted from a SuiteEntry to
`mate the biSic algorithm easier to understand.
`To read from a file suite, a read quorum must be
`gathered to ensure
`that a current representative
`is
`included. After a file suite is first accessed, oollccting a
`quorum never encounters any delays. The operation or
`the collector which gathers a quorum is described in detail
`below. From the quorum, any current representative can
`actually be read. Ideally, one would like to read from the
`representative that wm respond fastest.
`
`_....,..., ....... ..
`
`ReM: PJtOCEDUilE (ftle: 1'11&10, ftrslBJte, couat: JNTEOEll. buffer:
`POIHTD)•
`IEGIN
`quonnn: Set - CollectR.adQuanun():
`'-l: IJ111'10D •
`SelectFIIttltCurreatRepresenlatiwe(quorumJ:
`---...-~=:r=e.
`FIII.R
`· best
`finlBJte. coua&. buffer):
`IND:
`
`3.3 Tile Alprit._
`in proee and
`the basic algorithm
`We present
`f'tagments of Mesa CXJde. The prose is meant be a
`oomplete explanation. with the Mesa code provided so the
`reader can check his understandin& of the ideas. AD the
`Mesa procedures shown below are part or a sinsle
`monitor called FileSuite. There is a separate instance of
`FileSaite for each transaction accessing a Jiven suite.
`ENTRY procedures manipulate shared data, and thus lock
`the monitOr. Careful usc of public non-entry procedures
`has been made so the monitnr is never locked while input
`or outpUt is in proaress. allowing FileSuite to process
`liimultaneous requests.
`
`To write to a .file suite, a write quorum is assembled: all of
`the representatives in the quorum must be current so
`updates are not applied to obsolete represcntative!t
`.lJI of
`the writes to the quorum are done in parallel. The first
`write of a transac:tion increments the version numbers of
`its write quorum. Thus. all subsequent writes will be
`directed to tbe same quorum, because it will be the only
`one that is current. Determining which write is the first
`one must done under the protection of the monitor, and is
`not shown in the Mesa code. With the procedure below,
`the result of issuin1 two concurrent writes that update the
`same portion of a me is undefined.
`
`153
`
` Exhibit 1016 Page 4
`
` SYMANTEC
`
`

`
`_...,.,.~
`
`Write: PROC!DUU(ftle: Fl1e.ID, ftntByte, COUDC lN'nGEl. buffer.
`POINtD.).
`1!0114
`quorum: Set • CoUectWrtteQuorum(J;
`i. cauat: IJIITIOD • 0:
`procas: AUA Y OP l'llOCISS:
`-.-.......... ~11/~-... ,., .......
`FOa I IN (l..LBNOTH(IUileU
`DO
`IP q\IONIII(IJ THIN
`IIE011f
`COUll& ., COUftt + 1;
`pmcea(couatJ .. f'OilK
`R.,_....liveWrt&e(l. ftntByte, couat. bufl'er);
`!NO:
`INDLOOP:
`POa i tN (L.couatl
`DO
`lOIN pnlelll(l);
`ENDLOOf:
`
`Re,resentatl,eWrite: PltOCIDUII(i. ftntDyte. count: INTEOI!Il. bufl'er:
`JIOINTU)•
`IIEOIM ___ ...., ... Niw(/11/-~}tllrlllr----
`T....-..loinParentsl'ranUclion():
`UpdateVenionNumber(IJ:
`FilL Wrile(lllite(i).Dame. ftnlByte, COUDI. bulfttl:
`DID;
`
`--..... ~.,........,._,,...
`
`It is possible tbat a representative will become
`unavailable whUe a file suite is iD use. perhaps due to a
`server cruh. A simple solution to this problem, not
`shown in the procedures above, is to abort the current
`transaction if Rad or Write take more thaD a specified
`length of time. This will restart the suite, u delcribed
`bdow.
`'
`.
`Quorum sizes are the mtnlnfllm number of votes that
`must be collected for read and write operations to
`proceed. It is possible to increase the performance of a
`file suite by anificialJy apandma a quorum with
`Once apin, ·to reduce
`additional
`representatives.
`complexity, the procedures shown above do DOt use this
`appi'OICh.
`When a file suite is fli'St accessed, version number
`inquiries are sent to representatives. The information that
`results is used a the basis ror fUture collector dec:isions.
`To determine tbe correct ·value of a me suite's current
`version number a read quorum must be established before
`the tile suite can entertain requests. AD representatives
`might not contain the current votin& rules, but the
`alaorithm will stabilize with the correct rules before a read
`quorum is established, u shown in Section 4.6.
`If a
`representative is unreachable its version number read will
`never return. This does not prohibit a user's transaction
`from committina. u described in Section 3.1.
`
`.. __ ,.,~----·rvl•
`
`laitlatelaquiries: l'llOCEDUU =
`1£011'1
`.
`i:tHriOD:
`- jfiW ""' ,,., ,.,, ,, rqrat~~fllliWI
`FOa i IN (l..uNOTH(suitcl)
`DO
`Detath(FOilK lnquiry(iU;
`ENDLOOP:
`( J .. CollectRead(];
`!ND:
`laquirr: PKOCIDUU(i: UC11!GD) •
`IIEOIM
`- __ _..,.,. Nlw(/11/"r,.,.,
`,._JoinP:areatsTransa:tion():
`
`-jfiW""',., -II/ • ..,...,._
`NewRepraentative(RcadPrcfta laformatioa(iD;
`END:
`RudPreRlllafarmatlon: l'tlQCEDUaE (1: Jlf"I'EOElt) arru..m (i. version,
`rP, wP: INTIOIIl. v: Aaii.A Y OF IMTEOD) •
`IIOIN < rwu/wnioot nullllm. r. ~ andtlrrayo{YOtinl strenJtlls{rom
`tile prefix o{rqf'IIRflttlli .. i >
`
`NewRepresetltlthe: !Nn.Y PJt.OCIDUaE (1. version. rP, wP: INTEG Ill. v:
`ADA Y OF tN'nGB) •
`IEOIM
`j: ltfTSG!It;
`
`-.,.,. ,_ .. -~
`
`suitcfl).venionNumber .. venioft:
`-"''* --~- """'""'"
`IF version > currentVeJSioaNumbcf ntEN
`IEOtM
`cum:ntVcnionNumber .. version;
`r .. rP: w .. wP;
`FOaj IN (l..L!NGTH(sutteU
`DO
`suitc{j). votes .. v[j);
`ENDLOOP:
`END:
`llntResponded .. nUE:
`IIOADCAST Clowdl.arpr;
`DID:
`
`The collector is usel.l by every file suite opcn1tion to
`pther a quorum of representotives. Normally
`the
`collector selects what it considers to be the quorum that
`will respond the fastest. and returns immediately to its
`caller. Occasionally one of two problems will arise. First.
`is possible
`that a read quorum or the suite's
`it
`representntivcs have not reported their version numbers.
`In this c:1se the collector can only wait for one of them to
`report in. The second I!Utcntiul problem is that a read
`quorum have reported their version numbers. but there is
`not a current write quorum. This can only occur if some
`representatives have not reported their version numbers.
`In this case if r < w the collector will initiate a background
`process to copy tne contents of the suite into one o;" the
`obsolete represcntotives thnt has reported in.
`lt is always
`lqal to copy the current. contents of the file suite to an
`obsolete representative. Note that the copy process will
`be reading from the suite, in erfect a rec::ut'Sive call, but
`there will be enough votes for this rend-only operntion to
`proceed. To minimize lock connicts the backtnound
`process should be run in a separate transaction. The
`background process signifies its completion by breaking
`the transaction of its pnrenL
`
`154
`
` Exhibit 1016 Page 5
`
` SYMANTEC
`
`

`
`-.....WIIIIwll•-•••••·,..,.....,.
`~ COMOnlON:
`ClllectR~ EHraYI'ROCIDUUJUmJaNS(quorum: Set) •
`IIOIH
`i. j. VOIIS: IHmll!ll:
`iDda: ADA Y OF ll'f'lBJD;
`- """'llwjllrl .. , ... .-ilf,.,.,.. ..... c.,"-·.~~,,..,.,.,,.,
`UNt1L ftnLR~ DO WAIT rvwd-a- EI'IDLOOI':
`
`_, ...... -~ ... ,., ,...,..,,.,., __
`--__ ,,,..,.,_ ..... ..,.._,_._...,....,
`
`DO
`inda • SortRepresmtllivesBySPeedQ:
`qucnm • AU.(FALSE); ,.._ • 0;
`-•11•-JI!rl•,....,.... ._..
`1'01. i '" (l..LII'Idnl(sut-u
`DO
`j. iDdn(i):
`IFWiteiJl.veniaaNum~ TH2H
`1101"
`quorum(jJ • BUI:
`¥Otel • ,... + wite(jl votes:
`IF votes > • r tHEN arruaN(quorumJ:
`!NO:
`DIDLOOP:
`
`---~~ .......
`
`WAITCrowdl..lrpr;
`IJfDI.OOP:
`IND:
`
`If a user decides to abort his traasaction. or if the
`system spontaneously aborts a user's traDsaction the suite
`is no toaaer in a weU defmed state. The venion number
`information it is holding is no longer auaranteed to be
`serially amsistent with ensuinc operations, and must be
`diJI:ardcd and replaced by new inquiries. Asynchronous
`representative writes or venion number reads sUl1 in
`prosrcsa must be_ aboned.
`
`155
`
`baitiallu: PI.OCEDUU •
`-~-~....,--IU-·
`IEOIN
`•
`_..,..,.......,....,_
`Reset();
`
`··-·--·
`
`AbonFileSuiteProccsses();
`T,._._BqinO:
`laililldoquiriel();
`DID:
`Reset: I!HI'RY PI.OCEDUU •
`
`IIOIN _,..,.,_,..._....
`
`.
`FOJt I '" (l..l.ENGTJf(uteD
`DO
`IUite(iJ.venioaNumber .. llllbown;
`wite(i .VOIIS • 0;
`DIDLOOP:
`ftrstJtesponded • FAUI:
`CUJftlltVenioaNumber • llllkDowla;
`DID:
`
`At transaction commit the module instance is deleted.
`
`"· Reflaemeats
`
`"-1 Weak RepresentatlYes
`into the
`We can
`incorporate temporary copies
`algorithm by introducing representatives with no votes.
`called wck ,.prusrtatt.a.. Such a representative will not
`change the quorums or a file suite. and thus can be
`introduced at any time. However, it can be induded in
`any quorum and, when· placed on a hip speed storage
`device. can improve the perfonnance of a file suite.
`Because a weak representative hu no votes. it bears
`no responsibility for the long term safekeepina of data.
`There wiD always be a write quorum of other
`representatives that contain current data. Thus. if an enor
`is detected while accessing a weak representative an
`acceptable means of recovery is to invalidate it. by setting
`its version number to be unknown. This property allows
`weak repRSCDtatives to be stored outside of the stable file
`system. We have made the further simplification of
`insistina that weak representatives be unshared. To insure
`that users do not share weak representatives es.dusive
`locks are used.
`amcurrmcy
`and
`reQlvery
`The
`simplified
`requirements of weak representatives allow us to store
`them in I less JeDeral (de system than the one OUtlined in
`Section 3.l In particular, they c:an be stored on a user's
`personal computer usina a very simple mechanism. After
`a crash on a user's personal machine it is sufficient to
`invalidate all weak representatives that are locked.
`
`.U Lode CompatlbWI;J
`A disadvantage that our algorithm has in comparison
`with Thomas' is that locks are set by the stable file system
`to auarantee serial consistency. which reduce5 the amount
`of ex»ncurrency in the system. For example, a typical
`that
`to prantee serial
`locking suu=arc
`is used
`consistency bu two types of locks. read and write. These
`
` Exhibit 1016 Page 6
`
` SYMANTEC
`
`

`
`locks are set on data items implicitly in response to file
`operations. The compatibility of locks is specified by the
`matrix:
`
`No Lock Read
`Write
`Yes
`Yes
`Yes
`No Lock
`No
`Yes
`Yes
`Read
`No
`Yes
`No
`Write
`A transaction is suspended if it attempts to set a lock that
`is incompatible. This matrix corresponds to the familiar
`rule that either n rcaden or one writer arc permitted to
`acx:ess a file simultaneously.
`This locking rule potentially can iDtroduce long
`periods of time when information is unavailable. For
`example, if a user controls the length of a transaction. he
`can hold a write lock for a long period of time. This may
`naturally occur as a user thinks at the keyboard. To
`insure that no user monopolizes a file. a transaction will
`be timed out if other users are waiting for a file it has
`locked.
`A
`transaction
`that times out leaves files
`unchanged, because it is aborted. The same mechanism
`insures that cyclic lock dependencies (deadlocks) will be
`resolved by aborting some transaction. Tame-outs did not
`provide an adequate solution in our environment. as we
`describe in Section 5.1.
`A property of serial consistency is that all of a
`transaction's writes appear to occur at transaction commit
`time. We can take advantage of this property to increase
`the concurrency in our system. Writes appear to occur
`when they are issued, but in fact are buffered until
`c:t\mmit tinte by the stable file system. A read following a
`write will receive the write's data. When a user writes a
`dntum, an 1-Write lock is set. for intention to write. At
`commit time I-Write locks are converted to Commit locks.
`and the writing actually takes place. The new lock
`compatibility matrix is:
`Commit
`1-Writc
`No Lock Read
`Yes
`Yes
`No Lock Yes
`Yes
`Yes
`Yes
`Yes
`Read
`No
`!·Write
`Yes
`Yes
`No
`No
`No
`Commit
`Yes
`No
`No
`With this revised locking matrix. data is only unavailable
`for predictably short periods of time. during commit
`processing. This results in increased concurrency. as we
`discuss in Section 5.1. However, it may cause the later
`abortion of a transaction.
`to mate multiple 1-Write
`We
`chose
`locks
`two
`incompatible. because evenwally one of
`the
`transactions would probably commit, and become
`incompatible. Thus we chose not to postpone the
`inevitable.
`
`4.J Lower Degrees or Consistency
`We have assumed that the algorithm must be capable
`of providing serial consistency.
`Lower degrees of
`consistency are possible, and allow liberties to be taken,
`for cxnmple. setting r to l1e 0. This corresponds to the
`
`notion "give me the latest version you can find, but I
`don"t care if it isn't fresh". Certain applications that have
`self-correcting characteristics, such as name lookup, can
`use
`lower degrees of consistency.
`However,
`the
`unexplicable behavior of lower degrees precludes their
`widespread use. Gray et al [3) have explored the
`properties of various degrees of consistency in detail.
`Their Degree 3 consistency corresponds
`to serial
`consistency. If our algorithm is run on top of a file system
`that ensures Degree 0 or Degree 1 consistency, the
`algorithm will guarantee the same consistency it sees, a
`fact we wiD not prove here.
`
`4.4 Size or Replicated Objects
`The size of an object that is replicated should be
`chosen to match the needs of an intended application.
`For example. a data base manager might choose to
`replicate relations. tuples, or indexes. Each replicated
`object requires a version number. Because our algorithm
`depends on a file's vemon number remaining unchanged
`throughout a transaction, the smallest unit that can be
`individually loc:ked is a file.
`
`4.5 Broken Read Locks
`If the stable file S)"Stem can break read locks to
`resolve conflicts instead of aborting an entire transaction,
`two positive effects result First. fewer trans.,ctions are
`aborted. Second. it is not necessary for a version number
`to remain unchanged during a transaction.
`If it does
`change, the broken read lock mechanism infonns the
`algorithm. The smallest lockable unit is no longer a file,
`but instead is the smallest lockable unit supported by the
`stable file system.
`
`4.6 Reassigning Votes
`As the focus of references to a file suite changes, it is
`possible to update the file suite's voting configuration for
`optimum perfonnancc. Unless the refinement proposed
`in Section 4.5
`is adopted, updating a suite s voting
`configuration conflicts with any other use of the suite.
`Section 3.3's inquiry process protects itself against such
`changes by reading a suite's voting structure under the
`protection of a transaction.
`To change r, w, or the voting structure of a file suite
`it is necessary to ensure that that there is a write 4uorum
`under the new rules that is current. The change can then
`be effected by updating
`the prefixes of a set of
`representatives that is the union of a current write 4uorum
`and a future (under the new voting rules) w1ite quorum.
`We claim that regardless of t11c: order rcrrcscntntives are
`ex:unined, the most recent voting rules will be discovered
`before a read quorum is established.
`Imagine that a
`transaction· incorrectly assumes t11at an obsolete generation
`of voting rules, G. is current Then thc:re is a set voting
`rules. G + 1. that is one generation more recent. But when
`
`156
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket