`Introduction
`Databases (SDD-1 )
`
`J. B. R~THNIE, JR., P. A. BERNSTEIN, s. FOX, N. GOODMAN, M. HAMMER,
`T. A. LANDERS, C. REEVE, D. W. SHIPMAN, and E. WONG
`Computer Corporation of America
`
`The declining cost of computer hardware and the increasing data processing needs of geographically
`dispersed organizations have led to substantial
`interest
`in distributed data management. SDD-1
`is a
`distributed
`database management system currently
`being developed by Computer Corporation
`of
`America. Users interact with SDD-1 precisely as if it were a nondistributed
`database system because
`SDD-1 handles all issues arising
`from
`the distribution
`of data. These
`issues include distributed
`concurrency
`control, distributed
`query processing,
`resiliency
`to component
`failure, and distributed
`directory management. This paper presents an overview of the SDD-1 design and its solutions
`to the
`above problems.
`This paper
`is the first of a series of companion papers on SDD-1
`Bernstein et al. [4], and Hammer and Shipman
`[14]).
`
`(Bernstein and Shipman
`
`[2],
`
`Key Words and Phrases: distributed
`query processing, database reliability
`CR Categories: 3.5,4.33
`
`database system, relational
`
`data model, concurrency
`
`control,
`
`1. INTRODUCTION
`database management system under development by
`SDD-1
`is a distributed
`Computer Corporation of America. SDD-1
`is a system for managing databases
`whose storage is distributed over a network of computers. Functionally,
`SDD-1
`provides
`the same capabilities
`that one expects of any modern database manage-
`ment system
`(DBMS), and users interact with
`it precisely as if it were not
`distributed.
`
`that the copies are not
`is granted provided
`fee all or part of this material
`to copy without
`Permission
`made or distributed
`for direct commercial advantage,
`the ACM copyright notice and the title of the
`publication and its date appear, and notice
`is given that copying
`is by permission of the Association
`for Computing Machinery.
`To copy otherwise,
`or
`to republish,
`requires a fee and/or
`specific
`permission.
`of
`This
`research was supported by the Advanced Research Projects Agency of the Department
`Defense under Contract NOOO39-77-C-0674, ARPA Order No. 3175-6. The views and conclusions
`contained
`in this document are those of the authors and should not be interpreted
`as necessarily
`representing
`the official policies, either expressed or implied, of the Advanced Research Projects
`Agency or the U.S. Government.
`Authors’ present addresses: J. B. Rothnie, Jr., S. Fox, T. A. Landers, C. Reeve, and D. W. Shipman,
`Computer Corporation
`of America, 575 Technology Square, Cambridge, MA 02139; P. A. Bernstein
`and N. Goodman, Center
`for Research
`in Computing Technology, Aiken Computation
`Laboratory,
`Harvard University, Cambridge, MA 02138; M. Hammer, Massachusetts
`Institute of Technology,
`Laboratory
`for Computing Science, 545 Technology
`Square, Cambridge, MA 02139; E. Wong,
`Department
`of Electrical Engineering and Computer Sciences, University
`of California at Berkeley,
`Berkeley, CA 94720.
`0 1980 ACM 0362-5915/80/0300-01$00.75
`ACM Transactions on Database Systems, Vol. 5, No. 1, March 1980, Pages l-17.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-1
`
`
`
`2
`
`-
`
`J. 6. Rothnie et al.
`
`two char-
`for applications which exhibit
`Systems like SDD-1 are appropriate
`acteristics: First, the activity
`requires an integrated database. That is, the activity
`entails access to a single pool of information
`by multiple persons, organizations,
`or programs. And second, either
`the users of the information
`or its sources are
`distributed geographically. Many data processing applications have these char-
`acteristics,
`including
`
`information,
`
`and similar data pro-
`
`control, accounting, personnel
`(1) inventory
`cessing systems in large companies;
`(2) point-of-sale accounting systems, electronic banking, and other consumer-
`oriented on-line processing systems;
`(3) large-scale data resources, e.g., census, climatology,
`databases;
`(4) military
`intelligence databases, and command and control systems;
`(5) report-generating
`systems for businesses with multiple data processing cen-
`ters; and so forth.
`
`toxicology, and similar
`
`Decentralized processing is desirable
`for reasons of
`these applications
`in
`of function. Centralized control is needed
`performance,
`reliability,
`and flexibility
`to ensure operation
`in accordance with overall policy and goals. By meeting both
`these goals in one system, distributed database management offers unique bene-
`fits.
`However, distributed database systems pose new technical challenges owing to
`their inherent
`requirements
`for data communication
`and their inherent potential
`for parallel processing. The principal bottleneck
`in these systems is data com-
`munication. All economically
`feasible long-distance communication media incur
`lengthy delays and/or
`low bandwidth. Moreover,
`the cost of moving data through
`a network
`is comparable
`to the cost of storing
`the data locally
`for many days.
`Parallel processing is also an inherent aspect of distributed systems and mitigates
`to some extent
`the communication
`factor. However,
`it
`is often difficult
`to
`construct algorithms
`that can exploit parallelism.
`For these reasons, the techniques used to implement centralized DBMSs must
`be reexamined
`in the distributed DBMS context. We have done this in developing
`SDD-1, and this paper outlines our main results.
`in
`and the flow of events
`Section 2 describes SDD-l’s
`overah architecture
`processing transactions. Sections 3 through 5 then introduce
`the techniques used
`by SDD-1 for solving the most difficult problems in distributed data management:
`concurrency
`control, query processing, and reliability. Detailed discussions of
`these techniques are presented
`in [2-4, 12, 14, 251. Section 6 explains how these
`the management of system directories. The paper
`techniques are used to handle
`concludes with a brief history of SDD-1 and a summary of its principal contri-
`butions
`to the field.
`
`2. SYSTEM ORGANIZATION
`
`2.1 Data Model
`SDD-1 supports a relational data model [S]. Users interact with SDD-1 in a high-
`level language called Datalanguage
`[9] which
`is illustrated
`in Figure 1. Datalan-
`guage differs
`from relational
`languages such as QUEL
`[15] or SEQUEL
`[7]
`
`ACM Transactions
`
`on Database Systems, Vol. 5, No. 1, March 1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-2
`
`
`
`Introduction to a System for Distributed Databases
`
`3
`
`Relation:
`
`CUSTOMER
`
`(Name, Branch, Acct #, SavBal, ChkBal, LoanBal)
`
`Command: Update C in CUSTOMER with C.Name = “Adams”
`
`Begin
`
`C.SavBal = C.SavBal
`
`- 100
`
`CChkBal = C.ChkBal + 100
`
`End;
`
`Fig. 1. A Datalanguage
`
`command.
`
`in its use of “declared” variables. This construct and related control
`primarily
`structures expand
`the power of Datalanguage
`to that of a general-purpose
`programming
`language. For purposes of this paper,
`the differences between
`Datalanguage and QUEL or SEQUEL are not important, and for pedagogic ease,
`we adopt QUEL
`terminology.
`is more
`for end-users but
`language
`Datalanguage may be used as a query
`typically
`invoked by host programs. Datalanguage
`is embedded in host programs
`in essentially
`the same manner as QUEL or SEQUEL. That
`is, the host program
`issues self-contained Datalanguage commands to SDD-1, which processes these
`commands exactly as if entered by an end-user.
`(e.g., the command
`A single Datalanguage command
`is called a transaction
`shown in Figure 1 is a transaction). Transactions are the units of atomic inter-
`action between SDD-1 and the external world. This concept of transaction
`is
`similar
`to that of INGRES
`[ 151 and System R [ 11.
`is
`relation
`An SDD-1 database consists of (logical)
`relations. Each SDD-1
`into subrelations called logical fragments, which are the units of data
`partitioned
`distribution.
`Logical
`fragments are defined
`in two steps. First,
`the relation
`is
`partitioned horizontally
`into subsets defined by “simple”
`restrictions.’ Then each
`horizontal
`subset is partitioned
`into subrelations defined by projections
`(see
`Figures 2 and 3). To reconstruct
`the logical relation
`from its fragments, a unique
`tuple
`identifier
`is appended
`to each tuple and
`included
`in every
`fragment
`[lo, 211.
`Logical fragments are the units of data distribution, meaning that each may be
`stored at any one or several sites in the system. Logical
`fragments are defined
`and the assignment of fragments
`to sites is made when the database is designed.
`is called a stored fragment.
`A stored copy of a logical fragment
`or redundancy.
`Note that user transactions are unaware of data distribution
`responsibility
`to
`They
`reference only relations, not fragments.
`It
`is SDD-l’s
`translate
`from relations
`to logical
`fragments, and
`then
`to select
`the stored
`fragments
`to access in processing any given transaction.
`
`2.2 General Architecture
`SDD-1
`is a collection of three
`Modules
`(TMs), Data Modules
`configured as in Figure 4.
`
`types of virtual machines [16]-Transaction
`(DMs), and a Reliable Network
`(RelNet)-
`
`is a Boolean expression whose clauses are of the form
`restriction
`’ A simple
`(constant), where
`(rel-op)
`is =, #, >, c, etc.
`
`(attribute)
`
`(rel-op),
`
`ACM Transactions on Database Systems, Vol. 5, No. 1, March 1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-3
`
`
`
`4
`
`-
`
`J. B. Rothnie et al.
`
`CUSTOMER
`
`(Name,
`
`Branch,
`
`Acct.#,
`
`SavBal,
`
`ChkBal,
`
`LoanBaI)
`
`CUST-1
`
`CUST-2
`
`CUST-3a
`
`CUST-3b
`
`CUST-1
`CUST-2
`CUST-3a
`CUST-3b
`
`= CUSTOMER where Branch = 1
`= CUSTOMER where Branch = 2
`= CUSTOMER where Branch = 3 and LoanBal f 0
`= CUSTOMER where Branch = 3 and LoanBal = 0
`
`Fig. 2. Horizontal
`
`partitioning.
`
`CUSTOMER
`
`(Name,
`
`Branch,
`
`Acct#,
`
`SavBal,
`
`ChkBal,
`
`LoanBal)
`
`CUST-1
`
`I
`
`CUST-I.1
`
`CUST-1.2
`
`I
`
`I
`
`CUST-2
`
`CUST-1.1
`CUST-1.2
`etc.
`
`= CUST-1
`= CUST-1
`
`[Name, Branch]
`[Acct#, SavBal, ChkBal, LoanBal]
`
`In order
`Fig. 3. Vertical partitioning.
`unique
`tuple identifier
`is appended
`
`its fragments, a
`from
`to reconstruct CUSTOMER
`to each tuple and included
`in euery fragment
`[21].
`
`ACM Transactions on Database Systems, Vol. 5, No. 1, March 1980.
`
`Fig. 4. SDD-1 configuration.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-4
`
`
`
`Introduction
`
`to a System for Distributed Databases
`
`5
`
`(DMs). DMs are, in
`All data managed by SDD-1 are stored by Data Modules
`effect, back-end DBMSs
`that respond to commands from Transaction Modules.
`DMs respond to four types of commands: (1) read part of the DM’s database into
`a local workspace at that DM; (2) moue part of a local workspace from this DM
`to another DM; (3) manipulate data in a local workspace at the DM; (4) write
`part of the local workspace into the permanent database stored at the DM.
`Transaction Modules
`(TMs) plan and control
`the distributed
`execution of
`transactions. Each transaction processed by SDD-1
`is supervised by some TM
`which performs
`the following
`tasks:
`
`(1) Fragmentation. The TM
`into queries on
`translates queries on relations
`logical fragments and decides which
`instances of stored fragments
`to access.
`(2) Concurrency control. The TM synchronizes
`the transaction with all other
`active transactions
`in the system.
`(3) Accessplunning. The TM compiles the transaction
`which can be executed cooperatively by several DMs.
`(4) Distributed query execution. The TM coordinates execution of the com-
`piled access plan, exploiting parallelism whenever possible.
`
`into a parallel program
`
`(RelNet) which
`is the Reliable Network
`third SDD-1 virtual machine
`The
`interconnects TMs and DMs
`in a robust
`fashion. The RelNet provides
`four
`services:
`
`(1) guaranteed deliuery, allowing messages to be delivered even if the recipient
`is down at the time the message is sent, and even if the sender and receiver
`are never up simultaneously;
`(2) transaction control, a mechanism
`for posting updates at multiple DMs,
`that either all DMs post the update or none do;
`guaranteeing
`(3) site monitoring,
`to keep track of which sites have failed, and to inform sites
`impacted by failures;
`(4) network clock, a virtual clock kept approximately
`
`synchronized at all sites.
`
`three pieces:
`into
`the distributed DBMS problem
`divides
`This architecture
`database management, management of distributed
`transactions, and distributed
`DBMS
`reliability.
`By
`implementing
`each of these pieces as a self-contained
`virtual machine,
`the overall SDD-1 design is substantially
`simplified.
`
`2.3 Run-Time Structure
`in a distributed DBMS,
`to execute a transaction
`Among
`the functions
`required
`three are especially difficult:
`concurrency control, distributed query processing,
`and reliable posting of updates. SDD-1 handles each of these problems
`in a
`distinct processing phase, so that each can be solved independently. Consider
`transaction T of Figure 1. When T is submitted
`to SDD-1
`for processing,
`the
`system invokes a three-phase processing procedure. The phases are called Read,
`Execute, and Write.
`fast phase is the Read phase and exists for purposes of concurrency
`The
`control. The TM
`that is supervising T analyzes it to determine which portions of
`the (logical) database it reads, called its read-set. In this case the TM would
`ACM Transactions
`on Database Systems, Vol. 5, No. 1, March 1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-5
`
`
`
`6
`
`-
`
`J. B. Rothnie et al.
`
`Page Map for T
`
`Page # storage lot
`
`Page Map for T ’
`
`.
`.
`.
`
`Fig. 5. Version
`
`to storage mapping. T and T’ share each page until one transaction
`or the other modifies
`the page.
`
`determine
`
`that T’s read-set is
`
`{CSavBal, C.ChkBal 1 C.Name = “Adams”}.
`
`that
`to access to obtain
`the TM decides which stored fragments
`In addition,
`those
`data. Then
`the TM
`issues Read commands
`to the DMs
`that house
`for
`to set aside a private copy of that fragment
`fragments,
`instructing
`each DM
`use during subsequent processing phases.
`to be consistent
`The private copies obtained by the Read phase are guaranteed
`for guaranteeing
`even though the copies reside at distributed sites. The techniques
`consistency are described
`in Section 3. Since the data are consistent when read,
`and since the copies are private, subsequent phases can operate freely on these
`data without
`fear of interference
`from other transactions.
`the
`We emphasize that no data are actually
`transferred between sites during
`Read phase. Each DM simply sets aside the specified data in a workspace at the
`DM. Moreover,
`in each DM,
`the private workspace
`is implemented
`using a
`differential
`file mechanism
`[22], so data are not actually copied. This mechanism
`operates as follows. The primary organization of a stored fragment
`is a paged
`file, much like a UNIX
`[20] or TENEX
`[6] file. Apage
`is a unit of logical storage;
`a page map is a function
`that associates a physical storage location with each
`page (see Figure 5). The “private copy” set aside by a Read command is in reality
`a page map. Page maps behave
`like private copies because pages are never
`updated
`in place;
`if page P is modified on behalf of transaction T’, say, a new
`block of secondary storage is allocated, and the modified page is written
`there. T’
`is able to access the modified page because its page map is also modified
`to reflect
`ACM Transactions on Database Systems, Vol. 5, No. 1, March 1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-6
`
`
`
`Introduction
`
`to a System for Distributed Databases
`
`-
`
`7
`
`transactions are unaffected because their page
`P’s new storage location. Other
`maps remain unchanged. A similar mechanism
`is described in [ 181.
`the Execute phase, implements distributed
`query
`The second phase, called
`processing. At this time, the TM compiles T into a distributed program that takes
`as input
`the distributed workspace created by the Read phase. This compilation
`procedure
`is described in Section 4. The compiled program consists of Move and
`Manipulate
`commands which cause the DMs
`to perform
`the intent of T in a
`distributed
`fashion. The compiled program
`is supervised by the TM
`to ensure
`that commands are sent to DMs
`in the correct order and to handle
`run-time
`errors.
`into the database
`The output of the program is a list of data items to be written
`(in the case of update
`transactions)
`or displayed
`to the user (in the case of
`retrievals).
`In our example, this output
`list would contain a unique tuple identifier
`for Adams’ tuple, identifiers
`for the field names SavBal and ChkBal, and the new
`values for these fields. This output
`list is produced in a workspace (i.e., temporary
`file) at one DM, and is not yet installed
`into
`the permanent database. Conse-
`quently, problems of concurrency
`control and reliable writing are irrelevant
`during
`this phase.
`The final phase, called the Write phase, installs data modified by T into the
`permanent database and/or displays data retrieved by T to the user. For each
`entry
`in the output
`list, the TM determines which DM(s) contain copies of that
`data item. The TM orders the final DM
`that holds the output
`list to send the
`appropriate entries of the output
`list to each DM; it then issues Write commands
`to each of these DMs
`thereby causing the new values to be installed
`into
`the
`database. Techniques described in Section 5 are used during
`the Write phase to
`ensure that partial
`results are not installed or displayed even if multiple sites or
`communication
`links
`fail
`in midstream. This
`is the most difficult
`aspect of
`distributed DBMS
`reliability,
`and by separating
`it into a distinct phase, we
`simplify both it and the other phases.
`the key
`in SDD-1 neatly partitions
`The three-phase processing of transactions
`technical problems of distributed database management. The next sections of this
`paper explain how SDD-1 solves each of these independent problems.
`
`3. CONCURRENCY
`CONTROL
`The problems
`that arise when multiple users access a shared database are well
`known. Generically
`there are two types of problems:
`(1) If transaction Tl
`is
`reading a portion of the database while
`transaction T2 is updating
`it, Tl might
`read inconsistent data (see Figure 6). (2) If transactions T3 and T4 are both
`updating
`the database, race conditions can produce erroneous results (see Figure
`7). These problems arise in all shared databases-centralized
`or distributed-and
`solved using database locking. However, we have developed
`are conventionally
`a new technique
`for SDD-1.
`
`3.1 Methodology
`SDD-1, like most other DBMSs, adopts serializability as its criterion
`for concur-
`rent correctness. Serializability
`requires that whenever
`transactions execute con-
`their effect must be identical
`to some serial
`(i.e., noninterleaved)
`currently,
`ACM Transactions on Database Systems, Vol. 5, No. 1. March 1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-7
`
`
`
`0
`
`*
`
`J. 8. Rothnie et al.
`
`Consider the database of Figures 2 and 3, and assume fragments CUST-3a.2, CUST-3a.3, are stored
`at different DMs:
`Let transaction Tl be
`Range of C is CUSTOMER;
`Retrieve C (SavBal + ChkBal) where C.Name = “Adams”;
`Let transaction T2 be
`Range of C is CUSTOMER;
`Replace C(SavBal = SavBal - $100, ChkBal = ChkBal + $100) where C.Name = “Adams”;
`And suppose Tl and T2 execute in the following concurrent order
`Tl reads Adams’ SavBal (= $looO) from fragment CUST-3a.2
`T2 writes Adams’ SavBal (= $900) into fragment CUST-3a.2
`T2 writes Adams’ ChkBal (= $100) into fragment CUST-3a.3
`Tl reads Adams’ ChkBal (= $100) from fragment CUST-3a.3
`Tl’s output will be $lOJ.IO + $100 = $1100, which is incorrect.
`
`Fig. 6. Reading inconsistent data.
`
`Given the database of Figures 2 and 3.
`Let transaction T3 be
`Range of C is CUSTOMER;
`Replace C (ChkBal = ChkBal + $100) where C.Name = “Monroe”;
`Let transaction T4 be
`Range of C is CUSTOMER;
`Replace C (ChkBal = ChkBal - $50) where CName = “Monroe”;
`And suppose T3 and T4 execute in the following concurrent order
`T3 reads Monroe’s ChkBal
`(= $50)
`T4 reads Monroe’s ChkBal
`(= $50)
`T4 writes Monroe’s ChkBal (= $0)
`T3 writes Monroe’s ChkBal (= $50 + $100 = $150)
`The value of ChkBai left in the database is $150, which is incorrect. The final balance should be $50
`- $50 + $100 = $100.
`
`Fig. 7. Race condition producing erroneous update.
`
`is based on the assumption
`execution of those same transactions. This criterion
`that each transaction maps a consistent database state into another consistent
`state. Given this assumption, every serial execution preserves consistency. Since
`a serializable execution
`is equivalent
`to a serial one, it too preserves database
`consistency.
`through database locking. By locking, we
`ensure serializability
`Most DBMS
`mean a synchronization method
`in which
`transactions dynamically
`reserve data
`before accessing them [ 111.
`from
`that are distinctly different
`SDD-1 uses two synchronization mechanisms
`[5]. The first mechanism, called conflict graph analysis, is a technique
`locking
`for analyzing
`“classes” of transactions
`to detect those transactions
`that require
`The second mechanism consists of a set of synchro-
`little or no synchronization.
`nization protocols based on “timestamps,” which synchronize
`those transactions
`that need it.
`ACM Transactions
`
`on Database Systems, Vol. 5, No. 1, March 1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-8
`
`
`
`Introduction
`
`to a System for Distributed Databases
`
`9
`
`Define transactions Tl and T2 as in Figure 6
`read-set (Tl) = {C.SavBal, C.ChkBal where C.Name = “Adams”}
`write-set(T1) = {}
`read-set (T2) = read-set(T1)
`write-set(T2) = read-set(T1)
`T2
`Tl
`‘2
`‘1
`
`w1 N Note: nodes ri and Wi
`
`denote the read-set
`and write-set of
`transaction Ti
`
`W2
`
`Define transaction T3 and T4 as in Figure 7
`read-set (T3) = {C.ChkBal, where C.Name = “Monroe”}
`write-set(T3) = read-set(T3)
`read-set (T4) = read-set(T3)
`write-set(T4) = read-set(T3)
`T4
`‘4
`
`T3
`‘3
`
`W3 lxl
`
`w4
`
`Fig. 8. Conflict graphs.
`
`3.2 Conflict Graph Analysis
`is the portion of the database it reads and its write-
`The read-set of a transaction
`set is the portion of the database it updates. Two transactions conflict if the read-
`set or write-set of one intersects
`the write-set of the other. In a system that uses
`locking, each transaction
`locks data before accessing them, so conflicting
`trans-
`actions never run concurrently. However, not all conflicts violate serializability;
`that
`is, some conflicting
`transactions
`can safely be run concurrently. More
`concurrency
`can be attained by checking whether or not a given conflict
`is
`troublesome, and only synchronizing
`those that are. Conflict graph analysis is a
`technique
`for doing this.
`the read-sets and write-sets of trans-
`The nodes of a conflict graph represent
`actions, and edges represent conflicts among these sets. (There
`is also an edge
`between
`the read-set and write-set of each transaction.) Figure 8 shows sample
`conflict graphs. The important property
`is that different kinds of edges require
`different
`levels of synclironization,
`and that synchronization
`as strong as locking
`is required only for edges that participate
`in cycles [2]. In Figure 8, for example,
`transactions Tl and T2 do not require synchronization
`as strong as locking,
`whereas T3 and T4 do.
`It is impractical
`to use conflict graph analysis at run-time because too much
`intersite communication would be required
`to exchange information
`about con-
`flicts.
`Instead, during database design we apply the technique off-line as follows:
`defines transaction classes, which are named groups
`The database administrator
`of commonly executed transactions. Each class is defined by its name, a read-set,
`a write-set, and the TM at which
`it runs; a transaction
`is a member of a class if
`ACM Transactions on Database Systems, Vol. 5, No. 1, March 1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-9
`
`
`
`10
`
`.
`
`J. B. Rothnie et al.
`
`the transaction’s read-set and write-set are contained in the class’s read-set and
`write-set, respectively. Conflict graph analysis is actually performed on these
`transaction classes, not on individual
`transactions as in Figure 8. (Notice that
`transactions from different classes can conflict only if their classes conflict.) The
`output of the analysis is a table which tells (a) for each class, which other classes
`it conflicts with, and (b) for each such conflict, how much synchronization (if
`any) is required to ensure serializability.
`to supervise
`is only permitted
`It is convenient
`to assume that each TM
`transactions from one class, and vice versa.’ At run-time, when transaction T is
`submitted, the system determines which class(es) T is a member of, and sends T
`to the TM that supervises one of these classes. The TM synchronizes T against
`other transactions in its class using a local mechanism similar to locking. To
`synchronize T against transactions in other classes, the TM uses the synchroni-
`zation method(s) specified by the conflict graph analysis. These methods are
`called “protocols” and are described below.
`
`3.3 Time&-Based Protocols
`To synchronize two transactions that conflict dangerously, one must be run first,
`and the other delayed until it can safely proceed. In locking systems, the execution
`order is determined by the order in which transactions request conflicting locks.
`In SDD-1, the order is determined by a total ordering of transactions induced by
`timestamps. Each transaction submitted to SDD-1 is assigned a globally unique
`timestamp by its TM. Timestamps are generated by concatenating a TM iden-
`tifier to the right of the network clock time, so that time&s from different
`TMs always differ in their low-order bits. This means of generating unique
`timestamps was proposed in [23].
`The timestamp of a transaction is attached to all Read and Write commands
`sent to DMs on the behalf of that transaction. In addition, each Read command
`contains a list of classes that conflict dangerously with the transaction issuing the
`Read (this list was determined by the conflict graph analysis). When a DM
`receives a Read command, it defers the command until
`it has processed all
`earlier Write commands (i.e., those with smaller timestamps) and no later Write
`commands (i.e., those with larger ones) from the TMs for the specified classes.
`The DM can determine how long to wait because of a DM-TM communication
`discipline called piping.
`Piping requires that each TM send its Write commands to DMs in timestamp
`order. In addition, the Reliable Network guarantees that messages are received
`in the order sent. Thus when a DM receives a Write from (say) TMx timestamped
`(say) TSx, the DM knows it has received all Write commands from TMx with
`timestamps less than TSx. So, to process a Read command with timestamp TSR,
`the DM proceeds as follows:
`
`For each class specified in the Read command, the DM processes all Write
`commands from that class’s TM up to (but not beyond) TSR. If, however,
`
`* This assumption engenders no loss of generality since several TMs can be multiprogrammed at one
`site, and several classes can be defined with identical read-sets and write-sets.
`ACM Transactions on Database Systems, Vol. 5, No. 1, March 1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-10
`
`
`
`Introduction
`
`to a System
`
`for Distributed Databases
`
`11
`
`the DM has already processed a Write command with
`TSR from one of these TMs,
`the Read is rejected.
`
`timestamp beyond
`
`for Write commands, idle TMs periodically
`To avoid excessive delays in waiting
`send null
`(empty)
`timestamped Write commands; also, an impatient DM can
`that is slow in sending them.
`from a TM
`explicitly
`request a null Write
`to
`The synchronization
`protocol we have just described roughly corresponds
`locking and is designed to avoid “race conditions”
`[5]. However,
`there are several
`variations of this protocol, depending on the type of timestamp attached to Read
`commands and the interpretation
`of the timestamps by DMs. For example, read-
`the DM selects the
`only transactions can use a less expensive protocol
`in which
`timestamp,
`thereby avoiding
`the possibility of rejection and reducing delay. The
`variety of available synchronization
`protocols
`is an important
`feature of SDD-l’s
`concurrency control.
`The SDD-1 concurrency control mechanism
`[3, 41; its correctness is formally proved in [2].
`When all Read commands have been processed, consistent private copies of
`the read-set have been set aside at all necessary DMs. At this point,
`the Read
`phase is complete.
`
`is described
`
`in greater detail
`
`in
`
`4. DISTRIBUTED
`
`QUERY PROCESSING
`
`read-set, the next step is to
`Having obtained a consistent copy of a transaction’s
`compile
`the transaction
`into a parallel program and execute it. The key part of
`is access planning, an optimization
`procedure
`that minimizes
`the compilation
`the object program’s
`intersite communication
`needs while maximizing
`its paral-
`lelism. Access planning
`is discussed in Section 4.1, and execution of compiled
`transactions
`is explained
`in Section 4.2.
`
`4.1 Access Planning
`transaction T is to move all of
`Perhaps the simplest way to execute a distributed
`T’s read-set to a single DM, and then execute T at that DM (see Figure 9). This
`approach works but suffers two drawbacks: (1) T’s read-set might be very large,
`and moving it between sites could be exorbitantly
`expensive; (2) little use is made
`of parallel processing. Access planning overcomes these drawbacks.
`The access planner produces object programs with two phases, called reduction
`and final processing. The reduction phase eliminates
`from T’s read-set as much
`data as is economically
`feasible without changing T’s answer. Then, during
`final
`processing, the reduced read-set is moved to a designated “final” DM where T is
`executed. This structure mirrors
`the simple approach described above but lowers
`communication
`cost and increases parallelism via reduction.
`restriction and projection operators, plus an
`Reduction employs
`the familiar
`operator called semi-join, defined as follows: let R(A, B) and S(C, D) be relations;
`the semi-join of R by S on a qualification
`q (e.g., R.B = S.C) equals the join of R
`and S on q, projected back onto the attributes of R (see Figure 10). If R and S are
`stored at different DMs,
`this semi-join
`is computed by projecting S onto the
`attributes of q (i.e., S.C), and moving
`the result to R’s DM.
`We define the cost of an operation
`to be the amount of data (e.g., number of
`ACM Transactions on Database Systems, Vol. 5, No. 1, M&h
`1980.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2008-11
`
`
`
`12
`
`-
`
`J. B. Rothnie et al.
`
`Given the database of Figures 2 and 3.
`
`Let transaction T5 be
`Range of C is CUSTOMER;
`Replace C (ChkBal = ChkBal + LoanBal) where LoanBal < 0;
`(The effect of T5 is to credit
`loan overpayments
`to customers’ checking accounts.)
`
`Simple strategy
`Move every
`fragment
`T5 locally at that site.
`
`that could potentially
`
`contribute
`
`to T5’s result
`
`to a designated site. Process
`
`Fig. 9. Simple execution strategy.
`
`Given:
`CUST
`
`AUTO-PAY
`
`(Name,
`Jeff.
`Adams
`Polk
`Tyler
`Buchanan
`Johnson
`
`LoanBal)
`ChkBal,
`$3oooo
`$399
`$2~
`$100
`$2oooo
`$250
`$15ooo
`$lf)o
`$4oooo
`$799
`$2c@ @oooo
`
`(Name,
`Jeff.
`Adams
`Polk
`Tyler
`Buchanan
`Johnson
`
`Amount)
`$399
`$2@3
`@@J
`$150
`$499
`$200
`
`(i)
`Example
`The semi-join of
`CUST by AUTO-PAY
`equals the join of
`CUST and AUTO-PAY
`ChkBal,
`(Name,
`Jeff.
`$309
`$200
`Johnson
`$200
`Johnson
`$200
`Johnson
`projected onto
`PJame,
`Jeff.
`Johnson
`
`ChkBal,
`$399
`$2W
`
`on CUSTChkEal
`
`= AUTO-PAY.Amount
`
`on CUST.ChkBal
`LoanBal,
`$3oooo
`$2oooo
`$2oooo
`woo00
`
`= AUTO-PAY.Amount
`Amount)
`Name,
`Jeff.
`$309
`$200
`Adams
`$200
`Polk
`$200
`Johnson
`
`LoanBal)
`$3oooo
`$2oooo.
`
`(ii)
`Example
`The semi-join of
`on CUST.ChkBal
`CUST by AUTO-PAY
`and CUST.Name = AUTO-PAY.Name
`equals:
`ChkBal,
`LoanBal)
`(Name,
`$2oooo
`Adams
`$100
`$15900
`Tyler
`$100
`(These are customers whose balances are insufficient
`
`< AUTO-PAY.Amount
`
`for their automatic
`
`loan payments.)
`
`Fig. 10. Semi-join examples.
`
`it, while its benefit is the
`that must be sent between sites to execute
`bytes)
`amount by which
`it reduces the size of its opera