throbber
in a System
`Query Processing
`for Distributed Databases (SDD-1)
`
`PHILIP A. BERNSTEIN and NATHAN GOODMAN
`Harvard University
`EUGENE WONG
`University of California at Berkeley
`CHRISTOPHER
`L. REEVE
`Massachusetts
`Institute of Technology
`and
`JAMES B. ROTHNIE, Jr.
`Computer Corporation of America
`
`in the SDD-1 distributed
`queries
`relational
`the techniques used to optimize
`Thii paper describes
`language called Datalan-
`database system. Queries are submitted
`to SDD-1
`in a high-level procedural
`guage. Optimization
`begins by translating
`each Datalanguage
`query
`into a relational
`calculus
`form
`called an envelope, which
`is essentially
`an aggregate-free QUEL query. This paper
`is primarily
`concerned with
`the optimization
`of envelopes.
`operations at various
`Envelopes are processed
`in two phases. The first phase executes relational
`that contains all data
`sites of the distributed
`database
`in order
`to delimit a subset of the database
`relevant
`to the envelope. This subset is called a reduction of the database. The second phase transmits
`the reduction
`to one designated site, and the query
`is executed
`locally at that site.
`The critical optimization
`problem
`is to perform
`the reduction phase efficiently. Success depends on
`designing a good repertoire
`of operators
`to use during
`this phase, and an effective algorithm
`for
`deciding which of these operators
`to use in processing a given envelope against a given database. The
`principal
`reduction operator
`that we employ
`is called a sem@oin. In this paper we define the semijoin
`operator, explain why semijoin
`is an effective
`reduction operator, and present an algorithm
`that
`constructs a cost-effective
`program of semijoins, given an envelope and a database.
`
`Key Words and Phrases: distributed
`mization, semijoins
`CR Categories: 3.70,4.33
`
`databases, relational
`
`databases, query processing, query opti-
`
`that the copies are not
`is granted provided
`fee all or part of this material
`to copy without
`Permission
`made or distributed
`for direct commercial advantage,
`the ACM copyright notice and the title of the
`publication
`and its date appear, and notice
`is given that copying
`is by permission of the Association
`for Computing Machinery.
`To copy otherwise,
`or
`to republish,
`requires a fee and/or
`specific
`permission.
`of
`This
`research was supported by the Advanced Research Projects Agency of the Department
`Defense and was monitored by the Naval Electronic System Command under Contract NO6039-77-C-
`0074, ARPA Order 3175-6.
`in Computing Technology,
`for Research
`Authors’ addresses: P.A. Bernstein and N. Goodman, Center
`Aiken Computation
`Laboratory, Harvard University, Cambridge, MA 02138; E. Wong, Department
`of Electrical Engineering and Computer Sciences, University
`of California at Berkeley, Berkeley, CA
`94720; C.L. Reeve, Laboratory
`for Computer Science, Massachusetts
`Institute
`of Technology,
`545
`Technology Square, Cambridge, MA 02139; J.B. Rothnie, Jr., Computer Corporation
`of America, 675
`Massachusetts Avenue, Cambridge, MA 02139.
`0 1981 ACM 0362-5915/81/1200-osoz6602 $00.75
`
`ACM Transactions on Database Systems, Vol. 6, No. 4, December 1961, Pages 602-625
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-1
`
`

`

`Query Processing
`
`in a System for Distributed Databases
`
`*
`
`603
`
`INTRODUCTION
`1.
`SDD-1 is a distributed database system developed by the Computer Corporation
`of America [23]. SDD-1 permits a relational database to be distributed among
`the sites of a computer network, yet accessed as if it were stored at a single site.
`Users interact with SDD-1 by submitting queries coded in a high-level procedural
`language called Datalanguage [20]. Figures 1 and 2 illustrate an SDD-1 database
`and a Datalanguage query. This paper is concerned with efficient execution of
`such queries. Other aspects of SDD-1 are discussed in [5, 6, 14, 231.
`Our objective is to process queries with a minimum quantity of intersite data
`transfer. That is, we assume network bandwidth to be the system bottleneck and
`seek to minimize use of this resource; all other resources are assumed to be free.’
`This assumption is appropriate in SDD-1 because the network is the slowest
`system component by two orders of magnitude.2 This assumption has been
`
`Database D
`
`name,
`S(s#,
`1, Acme,
`2,
`Best,
`3,
`Mid,
`4,
`Nadir,
`
`location)
`MA
`MA
`NY
`CA
`
`yw,
`1,
`1,
`3,
`4,
`4,
`
`p#, GY)
`1,
`20
`2,
`50
`3,
`50
`1,
`10
`5,
`75
`
`P(P#, name, type)
`1,
`LSI,
`micro
`2,
`Pll,
`mini
`3,
`360,
`main
`4,
`CRI,
`huge
`5,
`8080, micro
`
`S describes suppliers.
`P describes parts.
`Y tells which suppliers supply which parts and in what quantity.
`
`Assume that S, Y, and P are sorted at sites 1,2, and 3, respectively.
`
`Fig. 1. Example database.
`
`of quey
`Description
`for all parts supplied by a
`List the supplier name, part name, and quantity supplied
`Massachusetts supplier. Also, print how many of these are minis.
`
`:= 0;
`
`Query Q
`Begin
`Count
`For S
`then for Y
`If S.location = “MA”
`If S.s# = Y.s# then for P
`If Y.p# = P.p# then begin
`Print S.name, P.name, Y.qty;
`If P.type = “mini”
`then Count
`Print “Number of minis is”, Count
`;
`End.
`
`:= Count + 1; end;;;
`
`Fig. 2. Example datalanguage query. Datalanguage used in this example
`in the appendix.
`
`is defined
`
`’ In practice, database processing within sites is considered as a secondary objective. For expository
`clarity, we shall not treat
`this issue.
`long-
`is a packet-switched
`the network
`(PDP-lOs), while
`’ Sites in SDD-1 are mainframe computers
`distance network
`(Arpanet). Sustainable bandwidth
`on the network
`is at most 10 kbits per second
`(see [24-261).
`
`ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-2
`
`

`

`604
`
`l
`
`P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.
`
`Envelope E
`Retrieve
`(S.s#, S.name, Nocation)
`Retrieve
`(Y.s#, Y.p#, Y.qty)
`Retrieve
`(P.p#, P.name, P.type)
`
`where qualification
`where qualification
`where qualification
`
`qualification:
`
`Slocation = “MA” A S.s# = Y.s# A Y.p# = P.p#
`
`for query of Figure 2. Envelopes are
`Fig. 3. Envelope
`defied
`in Section 2. Intuitively,
`an envelope specifies a
`subset of each relation
`in the database. We express enve-
`lopes in a relational
`calculus similar
`to QUEL
`[15].
`
`The result of envelope E is to retrieve any superset of the data specified by E. For
`example,
`
`S(s#,
`1,
`2,
`
`name,
`Acme,
`Best,
`
`location)
`MA
`MA
`
`Y(s#,
`1,
`1,
`
`P#, @Y)
`1,
`20
`2,
`50
`
`P(p#,
`1,
`2,
`3,
`4,
`5,
`
`type)
`name,
`micro
`LSI,
`mini
`Pll,
`main
`360,
`huge
`CRI,
`8080, micro
`
`The specific superset retrieved
`
`is determined by efficiency considerations.
`
`The retrieved
`
`relations are also transmitted
`
`to a single site, for example, site 3.
`
`Fig. 4. Processing envelope of Figure 3.
`
`it is not
`[7,8,13, 16,17, 33,341, although naturally
`adopted by other researchers
`appropriate
`in every system [lo, 19, 271. Section 5 discusses the impact of this
`assumption on our approach.
`Our algorithm has three main steps. Step 1 maps a Datalanguage query Q into
`form (an envelope) that specifies a superset of the database
`a relational calculus
`needed to answer Q (see Figure 3). Step 1 depends on details of Datalanguage
`and is of general interest only insofar as Datalanguage
`resembles other procedural
`query languages. This step is described in [ll].
`Step 2 evaluates the envelope. This step retrieves a superset of the database
`specified by the envelope, assembling
`the result at a single site S, (see Figure 4).
`(The specific superset retrieved and the “assembly site” S, are determined by
`efficiency considerations.) Step 2 is accomplished by translating
`the envelope into
`(a reducer), followed by commands
`a program P containing
`relational operations
`to move the results of P to S, (see Figure 5). The goal is to construct a reducer P
`and select a site S, such that the cost of computing P and moving
`the results
`to
`S, is minimum over aLl reducers and sites. This optimization
`problem constitutes
`the core of the SDD-1 query processing algorithm and is the focus of this paper.
`Step 3 executes Q at S, using the data assembled by Step 2. Since Step 3 only
`involves
`local query processing,
`it will not be discussed further. Steps l-3 are
`outlined
`in Figure 6.
`The paper has five sections. Section 2 defines envelopes and the operations
`used to process envelopes. Section 3 presents techniques
`for estimating
`the cost
`and effect of a reducer composed of these operations. Section 4 presents a
`heuristic algorithm
`that compiles envelopes into efficient
`(though not necessarily
`optimal)
`reducers. Section 5 discusses related work and suggests directions
`for
`
`ACM Transactions
`
`on Database Systems, Vol. 6, No. 4, December 1981
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-3
`
`

`

`Query Processing
`
`in a System for Distributed Databases
`
`l
`
`605
`
`Program P
`1. S := S[location = “MA”]
`2. Y := Y(s# = s#]S
`
`; restrict S to MA suppliers
`; this operation
`is semijoin
`-it
`computes
`the set of Y
`tuples that corresponds
`to
`MA suppliers.
`
`end
`
`Figure 4 shows the result of applying P to the database of Figure 1.
`
`Fig. 5. Program
`
`for envelope of Figure 3.
`
`pT--+
`
`1
`
`Q (Datalanguage
`
`query)
`
`0 (Distributed
`
`database)
`
`1
`
`E (Relational envelope)
`
`and select assemblv
`
`site
`
`execution
`Distributed
`of reducer
`
`S, (Assembly
`
`site)
`
`\
`
`-//
`
`t-l D ’ (Superset of database
`
`needed
`
`to compute Q(D))
`
`I
`
`Step 3
`
`4
`
`Fig. 6. Main steps of query processing algorithm.
`
`ACM Transactions
`
`on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-4
`
`

`

`606
`
`l
`
`P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.
`
`(a) Relational Data Objects
`Term
`
`Definition
`
`A set of values
`Domain
`An alternate name for a domain
`Attribute
`Relation schema A description of a relation, consisting of a relation name and
`list of attributes
`A subset of the Cartesian product of the domains of the
`attributes of the corresponding relation schema
`An element (or row) of a relation
`A set of relations
`Attributes of relation R
`
`Tuple
`Database
`attr(R)
`
`Relation
`
`Also
`Projection:
`
`(b) Relational Algebraic Operations
`Restriction:
`R[A = k] = (r E R ] r.A = k}
`where r.A is the value of the A-domain in tuple r
`R[A = B] = (r E R ] r.A = r.B)
`R[AI, AZ, . . . , An]
`= ((r.Ar, r.Az,. . . , r.A,) ] r E R)
`R[A=B]S=((r,s)]rER,sES,andr.A=s.B)
`R(A = B]S = (RCA = BIS) [attr(R)]
`= {r]rERAr.AEs[B])
`
`Join:
`Semijoin:
`
`Fig. 7. Relational terminology.
`
`relational databases at the
`research. We assume reader familiarity with
`future
`level of [9]. A review of relational
`terminology appears in Figure 7.
`
`2. QUERY PROCESSING STRATEGY
`
`2.1 Envelopes
`
`of
`The attributes of relation R are denoted attr(R). Relation RI is a subrelation
`relation Ri, if attr(RI) _C attr(Ri) and Rf c Ri[attr(Ri)].
`Let D = {RI,
`. . . , R,}
`andD’=
`{R;,...
`, Rh} be databases. D’ is a subdcztabase of D, denoted D’ 5 D,
`if Rf is a subrelation of Ri for i = 1, . . . , n. An envelope is a relational
`calculus
`expression
`that maps a database into a subdatabase. We express envelopes
`in a
`language similar
`to QUEL
`[15].
`. . . , t,. The
`tl,
`lists
`q and target
`An envelope E consists of a qualification
`term q is a Boolean
`formula with clauses of the form Ri. A = Rj .B or Ri. A = k.3
`The terms Ri . A and Rj . B are called indexed variables. Each ti is a set of variables
`indexed by Ri: that
`is, ti is of the form
`{Ri. Ail,
`. . . , Ri. Ail}. Envelope E maps
`database D into subdatabase D’ defined by the following
`collection of QUEL
`queries.
`
`Retrieve
`.
`.
`Retrieve
`
`into R; (tr ) where q.
`
`into Rb(t,) where q.
`
`We limit
`
`the form of envelopes
`
`in two additional ways. One, qualifications
`
`are
`
`3 Note that we avoid tuple variables. These can be accommodated by (conceptually) duplicating a
`relation, thereby having two relation names ranging over the same relation.
`
`ACM Transactions
`
`on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-5
`
`

`

`Query Processing
`
`in a System for Distributed Databases
`
`-
`
`607
`
`Join graph for envelope of figure 1.
`
`0 P.type
`
`the qualifi-
`is handled by placing
`assumed to be pure conjunctions; disjunction
`cation in disjunctive normal
`form and treating each conjunct separately. Two, if
`Ri. A is a term of q, then 6 must contain Ri. A.
`E is an envelope for Datalanguage query Q if for all databases D, Q(E(D)) =
`
`1 Q(D). I t iti n u ve y, an envelope for Q “envelopes” or delimits
`
`the portions of the
`database needed to answer Q. In general, there are many envelopes for a given Q;
`a good envelope is one that tightly delimits
`the data needed by Q. Finding good
`envelopes is an optimization
`problem
`that depends on details of Datalanguage,
`and our approach to this problem
`is described in [ll].
`(a join graph) is useful. The nodes of
`A graph representation of qualifications
`a join graph represent
`indexed variables and constants, and the edges represent
`clauses. A join graph contains
`the edge {Ri. A, Rj .B} (respectively,
`{Ri. A, k} ) iff
`the qualification
`contains Ri .A = Rj .B (respectively, Ri. A = k) (see Figure 8).
`The connected components of a join graph characterize
`the clauses implied by
`the qualification:
`Let N and N’ be nodes of the join graph for q; q implies N = N’
`iff N and N’ are in the same connected component.
`(Proofs appear in [2,3]. Note
`that if N and N’ represent distinct constants, q is unsatisfiable.)
`
`2.2 Reducers
`A reduction of database D with respect to E is any D’ such that E(D) 5 D’ 5 D.
`A reducer for E is a sequential program4 P of relational operations such that for
`
`4 A reducer
`
`is executed as a parallel program, however
`
`(see [23]).
`
`ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-6
`
`

`

`608
`
`*
`
`P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.
`
`Given
`
`S(s#,
`1,
`2,
`
`name,
`Acme,
`Best,
`
`location)
`MA
`MA
`
`yw,
`1,
`1,
`3,
`4,
`4,
`
`Pd @Y)
`20
`1,
`50
`2,
`50
`3,
`10
`1,
`75
`5,
`
`WP#,
`1,
`2,
`3,
`4,
`5,
`
`type)
`name,
`micro
`LSI,
`mini
`Pll,
`main
`360,
`huge
`CRI,
`8080, micro
`
`l Y(s# = s#]S = Y( s#,
`1,
`1,
`l P(p# = p#]Y = P( p#,
`
`p#,
`1,
`2,
`
`qty) = (Y tuples that correspond
`20
`50
`
`to a MA supplier)
`
`name,
`
`that are supplied by some MA
`type) = {parts
`supplier}
`
`micro
`mini
`
`location)
`MA
`
`= {MA
`thing)
`
`suppliers who
`
`supply
`
`any
`
`1,
`2,
`l S(s# = s#]Y = S(s#,
`1,
`these semijoins are profitable. However, Y(p#
`*All of
`profitable.
`
`LSI,
`Pll,
`
`name,
`Acme,
`
`= p#]P would not be
`
`Fig. 9. Semijoin.
`
`all databases D, P(D) is a reduction of D with respect to E. Given E and D, our
`optimization
`task is to construct a reducer P and select a site S, such that
`the
`cost of computing P(D) and moving the results to S, is minimum over all reducers
`and sites. A reduction operation
`for E is an operation
`that
`is permitted
`in a
`reducer
`for E. A reduction operation
`reduces the size of D by eliminating
`data
`not specified by E(D). The benefit of a reduction operation
`is the amount of data
`the cost is the amount of intersite data transfer required
`to compute
`it eliminates;
`the operation.
`and projections have zero cost and nonnegative benefit, and so
`Restrictions
`every restriction
`and projection permitted by E should be included
`in every
`reducer
`for E. The projections permitted by E are Ri[ti],
`for i = 1, . . . , n. The
`restrictions
`permitted by E can be determined
`from
`its join graph: E permits
`Ri[A = B] (respectively, Ri[A = k])
`iff q implies Ri.A = Ri.B
`(respectively,
`Ri. A = k) iff Ri. A and Ri. B (respectively, k) are connected
`in the join graph.
`To reduce
`the database
`further, data from
`two or more relations must be
`for this purpose isjoin. However, our algorithm
`combined. The obvious operation
`uses an operation called semijoin, which we deem to be superior. A semijoin
`is
`the semijoin of relation Ri by relation Rj on clause Ri. A = Rj. B,
`“half of a join”;
`denoted Ri(A = B]Rj, equals the join of Ri and Rj on that clause projected back
`onto attr(Ri)
`(see Figure 9). (Notice
`that semijoin, unlike
`join,
`is asymmetric;
`that
`is, Ri( A = B]Rj # Rj( B = AIRi. The
`former reduces Ri, while
`the latter
`reduces Rj .) As with restrictions,
`the semijoins permitted by E can be determined
`from its join graph: E permits Ri( A = B]Rj and Rj (B = A]Ri
`iff Ri. A and Rj . B
`are connected
`in the join graph.
`We prefer semijoins
`to joins for three reasons. First, Ri( A = B]Rj c Ri, and so
`joins can
`semijoins monotonically
`reduce the size of the database. By contrast,
`increase the size of the database; in the worst case, 1 Ri[A = B]Rj 1 = ) Ri 1 * ] Rj I.
`ACM Transactions
`on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-7
`
`

`

`Query Processing
`
`in a System for Distributed Databases
`
`-
`
`609
`
`l Let D be the database
`
`B)
`%(A,
`1
`0
`0
`1
`site 1
`
`C)
`MB,
`1
`1
`0
`0
`site 2
`
`C)
`MA,
`1
`1
`0
`0
`site 3
`
`D, E, F, G, H,
`MC,
`0000000
`1
`1
`
`1)
`
`1
`
`1
`
`1
`
`1
`
`1
`site 4
`
`l Let E be the envelope
`q: R,.A = &.A A R1.B = R2.B A R2.C = R3.C A R3.C = R.C
`for
`i = 1, . . . ,4.
`t, = attr(RJ
`
`is to move RI, Rz, and Ra to site
`the optimal evaluation of E(D)
`l Using semijoins,
`4-that
`is, no semijoins should be used. This requires
`the transmission of 12 data
`items.
`l Using joins, the optimal evaluation
`
`is
`
`RIP := R$B = B]RI at site 2-cost = 4
`RI= := RnJA = A A C = C] at site 2-cost = 4
`Note that RIZS = ( )
`% := Rs[C = C]Rn3 at site 4-cost
`Total cost = 8.
`
`= 0.
`
`Fig. 10. Bad case for semijoins. This example
`
`is adapted
`
`from
`
`[2].
`
`than joins.
`less intersite data transfer
`Second, semijoins can be computed with
`To compute Ri(A = B]Rj, we need only transmit a projection of a relation
`(viz.,
`Rj[B]), whereas to compute Ri[A = B]Rj we must transmit an entire relation. Of
`course, the semijoin may also have less effect than
`the join, since Ri(A = B]Rj
`only reduces Ri, whereas Ri[A = B]Rj simultaneously
`reduces Ri and Rj. However,
`the third advantage of semijoins
`is that the “reductive effect” of any single join
`can be attained by two semijoins, usually at lower cost, as follows.
`Let Rij = RJA = B]Rj. The reductive effect of this join is its effect on Ri and
`Rj, namely, Ru[attr(Ri)]
`and Rij[attr(Rj)].
`By definition of projection,
`
`Rd[attr(Ri)]
`
`= {ri ] 3 (ri, rj) E Rij}
`= {ri E Ri] (3 rj E Rj)(ri.A = rj.B)},
`= Ri(A = B]Rj,
`
`by definition of join
`by definition of semijoin.
`
`= Rj( B = AIR. Thus the reductive effect of Ri[A = B]Rj
`Similarly, Rij[attr(Rj)]
`is attained by two semijoins as claimed.
`Now let us compare the cost of the join versus the two semijoins. To compute
`Ri[A = B]Rj, one of the relations, Rj say, must be transmitted
`to the other’s site.
`This has cost 1 Rj I* width(Rj), where width(Rj) equals the number of bits in each
`tuple of Rj. TO compute the semijoins, we transmit Rj [B] to Ri and Ri[A]
`to Rj,
`But Ri[A] c Rj[B] after
`for a cost of I Rj [B] I *width(B)
`I *width(A).
`+ I Ri[A]
`Ri(A = B]Rj is executed, and so if we execute the semijoins
`in sequence, the cost
`is less than or equal to I Rj[B]
`) * (width(A)
`+ width(B)).
`This quantity
`is less
`than or equal to I Rj \ * width(Rj) under the reasonable assumption
`that width(A)
`+ width(B)
`I width(Rj). Given this assumption,
`the cost of the semijoins
`is less
`than or equal the cost of the join, as claimed.
`Our arguments
`in support of semijoins are heuristic and there are cases in
`which
`joins outperform semijoins. Figure 10 illustrates such a case. An optimal
`query processing algorithm would almost certainly
`include both joins and semi-
`ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-8
`
`

`

`610
`
`.
`
`P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.
`
`Let D =
`
`(Ss#, name,
`Acme,
`(1,
`2, Best,
`
`location),
`MA
`MA
`
`Yb#,
`1,
`1,
`
`P#,
`1,
`2,
`
`qty),
`10
`20
`
`PW,
`1,
`2,
`
`name,
`LSI,
`Pll,
`
`type))
`micro
`mini
`
`lk, Mid,
`lOk, Nadir,
`
`NY
`CA
`
`lk,
`
`lk,
`
`50
`
`lk,
`lOk,
`
`370,
`470,
`
`main
`main
`
`(a) Attributty domains dom(S.s#) = dom(Y.s#) = {idA’s from 1 to 10k)
`dom(S.name) = dom(P.name)
`= (names of length < 10)
`dom(S.location) = {states of U.S.}
`v (provinces of Canada)
`
`etc.
`(b) Auxiliary domains X1 = (strings of length < 10)
`XP = (integers from 1 to IOk}
`
`(c) Subset hierarchies
`
`dom(S.name) = dom(P.name) dom(S.location) dom(P.type)
`dom(S.s#) = dom(Y.s#)
`= dom(Y.p# = dom(P.p#)
`
`dom(Y.qty)
`
`Fig. 11. Domains.
`
`integration of these tactics is an open problem, however, and
`joins. The graceful
`our algorithm only uses semijoins.
`
`3. COST AND BENEFIT ESTIMATION
`the cost
`reducer, we need to estimate
`To compile an envelope
`into an efficient
`and benefit of reduction operations. This section presents an estimation procedure
`based on a statistical model of the database. We only consider
`the estimation
`problem
`for semijoins; estimation
`techniques
`for restrictions and projections are
`described by [28].
`of a set theoretic model of the
`is an approximation
`Our statistical model
`database and is described
`in Section 3.1. Section 3.2 presents a technique
`for
`the effect of set operations. Section 3.3 extends
`this
`technique
`to
`estimating
`estimate
`the effect of a sequence of semijoins.
`
`3.1 Database Model
`Let D = {RI,
`. . . . R,} be a database. Associated with each attribute
`of each
`relation,
`for example, Ri. A, is a finite domain of values, dom(Ri. A). Ri[A]
`is
`constrained
`to be a subset of dom(Ri. A) (see Figure lla). The model also contains
`auxiliary
`domains
`(see Figure
`llb). The set of all domains
`is partitioned
`into
`domain hierarchies,
`each of which contains a maximum domain X,,, and all
`domains X such that X _C X,,,
`(see Figure 11~). Domains Xi and Xj are joinable
`ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-9
`
`

`

`Query Processing
`
`in a System for Distributed Databases
`
`*
`
`611
`
`domains
`
`XI
`
`X2
`
`dom(S.s#)
`dom(Y.s#)
`dom(P.p#)
`dom(Y.i#)
`10k
`2
`
`dom(S.location)
`
`59
`2
`
`Y.s#
`lk
`
`Y.p# * * *
`lk
`
`10k
`2
`
`c(domain)
`w(domain)
`
`attributes
`c(attribute)
`
`I
`
`relations
`c(relation)
`
`26’”
`10
`
`S.s#
`10k
`
`S
`1Ok
`
`S.location
`50
`
`Y
`lOOk
`
`P.p#
`10k
`
`P
`1Ok
`
`I
`
`I
`
`Fig. 12. Statistical model of database of Figure 12.
`
`If a qualification
`if they are members of the same domain hierarchy.
`Ri.A = Rj.B,
`then dom(Ri.A) and dom(Rj.B) must be joinable.
`We approximate D by the following statistics, called a database profile.
`
`contains
`
`(1) For each domain X
`
`(i) c(A) = the estimated cardinality of X,
`(ii) w(A) = the “width” of X, that is, the number of bits, words, and so forth,
`used to represent an arbitrary element of X.
`
`(2) For each relation Ri, c(Ri) = the estimated cardinality of Ri.
`(3) For each relation Ri and each A E attr(Ri),
`c(Ri.A) = the estimated cardi-
`nality of Ri[A].
`
`l(i) and l(ii) are fixed a priori by the database administrator, while
`Parameters
`the other parameters are updated by the system
`to reflect changes
`in
`the
`database. To reduce overhead, these parameters are updated off-line on a periodic
`basis.
`the domain hierarchies by specifying which
`indicates
`The statistical model
`domains are subsets of which other domains. The model also includes
`the
`following assumptions.
`
`(1) If Xi c Xj, then Xi is a randomly selected subset of Xj; operationally
`means that the probability
`of x E Xi is identical
`for all x E Xj .
`(2) For each relation Ri and A E attr(Ri)
`
`this
`
`is a randomly selected subset of dom(Ri.A).
`(i) RJA]
`this means
`(ii) Tuples of Ri are uniform2y distributed
`over values of Ri[A];
`that the probability
`of ri. A = a is identical
`for all ri E Ri and a E Ri[A].
`
`(3) For each Ri and distinct A, B E attr(Rti), Ri[A] and Ri[B] are independent,
`meaning that the probability
`of ri. A = a is unaffected by the value of ri.B.
`
`is a crude
`These assumptions are quite strong and this statistical model
`approximation. However,
`it is difficult
`to devise better models without knowledge
`of the processes placing data in the database. Figure 12 illustrates our model.
`
`ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-10
`
`

`

`612
`
`*
`
`P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.
`
`l Let U = X2 from Figure 11
`l Let X be the following sequence of operations
`Y, = select (U, 1)
`; YI might represent S[s#]
`Yz = select (LJ, l/10)
`; YZ might represent Y[s#]
`Y3 = select (Y1, l/50)
`; YZ might represent the effect on S[s#] of
`S[location = “MA”]
`; Yd might represent the effect on Y[s#] of
`Y(s# = s#]S.
`; YE, might represent the effect on S[s#] of
`S(s# = s#]Y
`
`Y4 = Yz n Y3
`
`Yg=Y3r3Y,
`
`l
`
`G(Z)
`
`=
`
`Fig. 13. Graph representations of set operations.
`
`3.2 Effect of Set Operations
`Consider
`the following problem. We are given a universe U of objects and two
`for constructing subsets of U-random
`selection (defined below) and
`operations
`set intersection. The problem
`is to estimate the cardinality of any set that can be
`constructed by a sequence of these operations. Let X be such a set. The selectivity
`of X is the probability
`that an arbitrary
`x E U is also an element
`of X. The
`expected 1 X 1 is just its selectivity
`times 1 U I, and so to estimate
`I X I it is sufficient
`to estimate
`its selectivity.
`for “x E X,” and
`Let X c U and x E U. We use X(x) as an abbreviation
`Prob(X(x))
`denotes the probability
`of X(x)
`(i.e., the selectivity of X). Similarly,
`if S = {X,,
`. . . . X,}
`is a family of subsets of U, S(x) is an abbreviation
`for
`At, Xi(x), and Prob(S(x)) denotes the joint probability
`of x being an element of
`every Xi E S.
`the random selection operation. Let, X L U and 0 I cr 5 1;
`We now define
`select(X,
`(u) constructs a set X’ c X in which Prob(X’(x))
`= a*Prob(X(x))
`for all
`x E u.
`Let Z be a sequence of selection and intersection operations. We can represent.
`Z as an edge labeled DAG, G(Z), whose edges represent operations and whose
`nodes represent sets constructed by those operations
`(see Figure 13). Formally,
`G(Z) = (V(Zj, E(Z), label), where
`
`(1) V(Z) = (U} U S, where S is the family of sets constructed by Z;
`(2) E(Z) contains
`the following edges:
`
`label (Y if X’ = select(X, a);
`(i) (X, X’) with
`(ii)
`(X, X’) and (Y, X’) with
`label 1 if X’ = X rl Y.
`
`G(X) provides an efficient. means of calculating Prob(S(x))
`of sets constructed by Z.
`ACM Transactions
`on Database Systems, Vol. 6, No. 4, December 1981.
`
`for arbitrary
`
`families
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-11
`
`

`

`Query Processing
`
`in a System
`
`for Distributed Databases
`
`*
`
`613
`
`{U,Y,,
`l LetS=
`. . , Y,) . Lemma 1 states that Prob(S(x)) equals the product of all
`edge labels that precede S in G(Z).
`l Selectivity of
`U = 1
`YI = 1
`Yz = l/10
`Y3 =
`l/50
`Y4 =
`l/500
`=
`l/500
`l Note that YS = Y3 n Y, = Ya n (Yz n Y3) = YZ n YI = Yq, so it is not coincidental
`that Yg and Y, have identical selectivities.
`
`Y5
`
`Fig. 14. Calculating selectivities
`
`for Figure 13.
`
`LEMMA 1. Let Z be any sequence of selections and intersections
`operating
`initially
`on U, let S c V(Z), and let E(S) = {E E E(Z) 1 E precedes some node
`XES}.ThenforallxE
`U
`
`Prob(S(x))
`
`=
`
`fl
`EEELS)
`
`label(E).
`
`PROOF. See the appendix. Cl
`
`in Figure 14.
`The lemma is illustrated
`The main result of this section follows a corollary.
`
`PROPOSITION 1. Let B be a sequence of selection and intersection operations
`on U, and define G(Z) as above. (i) If X’ = select (X, a) is an
`operating
`initially
`operation of Z, then the selectivity of X’ equals a times the selectivity of X. (ii)
`If X’ = X n Y is an operation of 2, then the selectivity of X’ equals the product
`over all edges E that precede X or Y in G(Z) of label(E).
`
`3.3 Effect of Semijoins
`is analyzed as several sequences of set operations, one
`A sequence of semijoins
`per domain hierarchy. Let & be the sequence for hierarchy Hk. The universe
`for
`I%‘:k is initialized
`to contain
`the following
`IZ:k is the maximum domain of Hk, X,,,.
`selections.
`
`for each X E Hk.
`a), where (Y = c(X)/c(X,,)
`(1) X = select(X,,,,
`(2) R[A]
`= select(dom(Ri.A),
`a) where a! = c(Ri.A)/c(dom(Ri.A))
`relation and attribute such that dom(Ri.A) E Hk.
`
`for each
`
`The effect of a semijoin, Ri( A = B]Rj is analyzed in three steps.
`
`into Ri[A] n Rj[B]. Suppose dom(Ri.A) E Hk. To
`(1) The semijoin maps RJA]
`estimate
`the new cardinality
`of Ri[A], we append RJA] = Ri[A] n Rj[B]
`to &
`and use Proposition 1 to estimate the new selectivity of RJA]. c(Ri.A)
`is updated
`to the new selectivity
`times c(Xmax).
`(2) The semijoin eliminates some tuples from Ri. Since tuples of Ri are assumed
`to be uniformly distributed over values of Ri[A],
`the estimated new cardinality of
`Ri is (new value of c(Ri.A))*(old
`value of c(R))/(old
`value of c(Ri.A)).
`(3) The elimination
`of tuples from Ri causes some values to be deleted from
`RJA’],
`for all A’ E attr(Ri)
`-
`{A}. This effect can be analyzed as a hit ratio
`problem: Let t be the old value of c(Ri); these t tuples are assumed to be uniformly
`ACM Transactions on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-12
`
`

`

`614
`
`l
`
`P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. Rothnie, Jr.
`
`l Consider R(A, B) and buppose 1 R 1 = 6 and 1 R[B] 1 = 3
`l We can partition R into 3 blocks based on B values.
`
`- , bl
`- , bl
`- t b2
`- t b2
`- , b3
`- , b3
`El
`
`l
`
`l
`
`l
`
`If we select one tuple of R we will, of course, hit one block.
`If we select two tuples, we will probably hit two blocks, but we might only hit one.
`If we select three
`tuples, we might hit three blocks, but it is more likely
`that we
`will only hit two.
`And so forth.
`
`Fig. 15. Hit ratio problem.
`
`“blocks,” where each block contains all tuples with
`distributed over b = c(Ri.A’)
`the same A’ value (see Figure 15). The question
`is: “How many blocks do we
`expect to hit if we randomly select n = c(Ri) tuples?” An efficient
`formula
`that
`is given by Yao [32]:
`answers this question
`
`the expected number of blocks =
`
`Yh b, t) = b * ,!! ‘“,“,~i’~~’ ,
`
`where
`
`d = 1 -
`
`l/b.
`
`In practice,
`
`it is reasonable
`
`to approximate Y by
`
`for n < +b
`for+btn<2b
`for 2b < n.
`
`P(n, b, t) =
`
`iin + b),
`i
`b,
`Y and y are graphed in Figure 16.
`to P(new value of c(R;), old value of c(Ri.A’), old
`Thus we update c(Ri.A’)
`value of c(Ri)).
`a) to the sequence for the
`:= select(Ri[A’],
`In addition, we append Ri[A’]
`domain hierarchy
`that contains dom(I&.A’), where (Y = (new value of c(Ri.A’))/
`(old value of c(Ri.A’)). This selection
`is not used in estimating
`the effect of the
`current semijoin, but is needed to estimate the effects of later ones.
`
`to be the amount of intersite data transfer
`
`3.4 Cost and Benefit of Semijoins
`The cost of Ri (A = B]Rj is defined
`required
`to compute
`it. This equals
`if Ri and Rj are stored at the same site
`otherwise.
`
`0,
`c(Rj.B)*w(dom(Rj.B)),
`
`l
`from the database.
`The benefit of Ri (A = B]Rj is the amount of data eliminated
`This equals ((c(Ri) before the semijoin)
`- (c(Ri) after the semijoin))*(the width
`of Ri) = ((c(Ri) before) - (c(Ri) after))* &Eattr(q) w(dom(Ri.A)).
`
`ACM Transactions
`
`on Database Systems, Vol. 6, No. 4, December 1981.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2016-13
`
`

`

`Query Processing
`
`in a System for Distributed Databases
`
`l
`
`615
`
`Hit Ration
`
`l Given t tuples distributed over b blocks.
`
`l How many blocks will we hit if we select n tuples?
`
`b
`
`1OK
`
`10K
`
`1K
`
`Fig. 16. Yao function.
`
`4. OPTIMIZATION
`
`ALGORITHM
`
`is an envelope E and
`This section presents our optimization algorithm. The input
`database profile D. The algorithm compiles E into a reducer P, which is estimated
`to be profitable
`in any database modeled by D. In addition,
`the algorithm selects
`an assembly site S, and appends to P commands to move the reduced database
`to s,.
`two
`Section 4.1 presents our “basic” algorithm, and Section 4.2 describes
`enhancements
`to the basic a

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket