`
`MODERN
`
`DATABASE
`
`MANAGEMENT
`
`Fzfi/a Edition
`
`Fred R. MéFadden
`
`[6199631 A. Hofifer ..
`
`Mary B. Prescatt-'»»'
`
`
`
`Apple, Exhibit 1026, Page 1
`
`Apple, Exhibit 1026, Page 1
`
`
`
`
`
`
`
`I In memory of my valued colleague Daniel Conger. — P. R. M.
`
`I To Patty, for her sacrifices, encouragement, and support.
`To my students, for being receptive and critical, and challenging me to
`be a better teacher. — I. A. H.
`
`I To Larry, Mike, and Ivan. Their love and support provide a foundation
`to my life which makes efforts such as writing this book possible. And
`to Jeff and Fred, who gave me the opportunity to write with them and
`patiently provided invaluable guidance along the way. — M. B. P.
`
`Executive Editor: Michael Roche
`
`Developmental Editor: Maureen Allaire Spada
`Assistant Editor: Ruth Berry
`Editorial Assistant: Adam Hamel
`
`Senior Marketing Manager: Tom Ziolkowski
`Senior Marketing Coordinator: Deanna Storey
`Senior Production Supervisor: Patty Mahtani
`Manufacturing Buyer: Sheila Spinney
`Design Director: Regina Hagen
`Cover Designer: Linda Manly Wade
`Text Designer: Sandra Rigney
`Composition and Project Coordination: Elm Street Publishing Services, Inc.
`Cover Image: © Mick Tarel for Artville
`
`Library of Congress Cataloging-in-Publication Data
`
`McFadden, Fred R., 1933—
`Modem database management / Fred R. McFadden, Jeffrey A. Hoffer,
`Mary B. Prescott. —— 5th ed.
`p.
`cm.
`Includes bibliographical references and index.
`ISBN 0—8053-6054—9
`
`1. Database management.
`111. Title.
`QA76.9.D3M395
`005.74—dc21
`
`1999
`
`I. Hoffer, Jeffrey A.
`
`II. Prescott, Mary B.
`
`Reprinted with corrections, May 1999
`Copyright © 1999 by Addison—Wesley Educational Publishers, Inc.
`
`98—4120
`
`C113
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,
`or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or
`otherwise, without the prior written permission of the publisher. Printed in the United States of
`America.
`
`5678910WCT020100
`
`Apple, Exhibit 1026, Page 2
`
`
`
`Apple, Exhibit 1026, Page 2
`
`
`
`
`
`
`
`Distributed Databases
`
`LEARNING OBJECTIVES
`
`
`
`
`. EXplaintou. strateg es fo » the deSign ofdistributeddatabases, options within
`each, strategy, and-the factors to consider in seleCtion among these strategies. '
`
`WuWw—mwmwwmmmmmwmm
`
`I State the relative advantages of synchronous and asynchronous data
`replication and partitioning as three'major approaches for distributed
`database design.
`
`I Outline the steps involved in processing a query in a distributed database
`and several approaches used to optimize distributed query processing.
`
`I Explain the salient features of several distributed database management
`systems.
`
`INTRODUCTION
`
`When an organization is geographically dispersed, it may choose to store its data-
`bases on a central computer or to distribute them to local computers (or a combina-
`tion of both). A distributed database is a single logical database that is spread
`
`Distributed database: A
`
`single logical database that is
`spread physically across
`computers in multiple
`locations that are connected by
`adata communications link.
`417
`
`Apple, Exhibit 1026, Page 3
`
`Apple, Exhibit 1026, Page 3
`
`
`
`
`
`418
`
`Chapter 11 —— Distributed Databases
`
`»
`
`’
`
`l
`
`..
`
`‘
`
`i
`
`‘
`
`r
`
`'
`
`physically across computers in multiple locations that are connected by a data com-
`munications network. We emphasize that a distributed database is truly a database,
`not a loose collection of files. The distributed database is still centrally administered
`as a corporate resource while providing local flexibility and customization. The net—
`work must allow the users to share the data; thus a user (or program) at location A
`must be able to access (and perhaps update) data at location B. The sites Of a distrib-
`uted system may be spread over a large area (such as the United States or the world),
`or over a small area (such as a building or campus). The computers may range from
`microcomputers to large-scale computers or even supercomputers.
`A distributed database requires multiple database management systems, run-
`ning at each remote site. The degree to which these different DBMSs cooperate, or
`work in partnership, and whether there is a master site that coordinates requests
`involving data from multiple sites distinguish different types of distributed data-
`base environments.
`
`It is important to distinguish between distributed and decentralized databases. A
`decentralized database is also stored on computers at multiple locations; however,
`the computers are not interconnected by a network, so that users at the various sites
`cannot share data. Thus a decentralized database is best regarded as a collection of
`.
`.
`.
`.
`.
`-
`.
`independent databases, rather than havmg the geographical distribution of a Single
`d t ba a ase.
`.
`_
`.
`.
`~
`-
`Various busmess conditions encourage the use of distributed databases:
`.
`_
`_
`
`Decentralized database:
`A database that is Stored 0“
`compmers at multiple
`1
`'
`; h
`I
`Demons
`owever. the
`computers are not inter—
`connected by a network, so
`that users at the van-Gus sites
`cannot share data.
`
`in modern organizations are often geographically (and possibly internationally)
`
`o Distribution and autonomy ofbusmess units Divisions, departments, and facilities
`distributed. Often each unit has the authority to create its own information
`
`systems, and often these units want local data over which they can have controls.
`
`0 Data sharing Even moderately complex business decisions require sharing data
`across business units, so it must be convenient to consolidate data across local
`databases on demand.
`
`o Data communications costs and reliability The cost to ship large quantities of
`data across a communications network or to handle a large volume of transac—
`tions from remote sources can be high. It is often more economical to locate
`data and applications close to where they are needed. Also, dependence on
`data communications can be risky, so keeping local copies or fragments of data
`can be a reliable way to support the need for rapid access to data across the
`organization.
`
`The ability to create a distributed database has existed for over a decade. As
`you might expect, a variety of distributed database options exist (Bell and Crimson,
`1992). Figure 11-1 outlines the range of distributed database environments. These
`environments are briefly explained by the following:
`
`I. Homogeneous The same DBMS is used at each node.
`
`A. Autonomous Each DBMS works independently, passing messages back and
`forth to share data updates.
`
`B. Non-autonomous A central, or master, DBMS coordinates database access
`
`and update across the nodes.
`
`ll. Heterogeneous Potentially different DBMSs are used at each node.
`
`A. Systems Supports some or all of the functionality of one logical database.
`
`Apple, Exhibit 1026, Page 4
`
`Apple, Exhibit 1026, Page 4
`
`
`
`41 9
`Intro duction
`
`
`
`
`gr2
`
`
`
`
`Distributed database environments
`
`
`
`Homogeneous
`Heterogeneous
`
`Figure 11-1
`Distributed database
`
`environments (adapted from
`Bell and Grimson, 1992)
`
`Autonomous
`
`Non-Autonomous
`
`Gateways
`
`Systems
`/\'
`Full DBMS functionality
`Partial—Multidatabase
`/\
`Federated
`Unfederated
`
`/\
`Loose integration
`Tight integration
`
`ii
`
`1
`
`till
`
`1. Fall DBMS Functionality Supports all of the functionality of a distrib-
`uted database, as discussed in the remainder of this chapter.
`
`2. Partial—Mnltidatabase Supports some features of a distributed database,
`as discussed in the remainder of this chapter.
`
`a. Federated Supports local databases for unique data requests.
`
`1. Loose Integration Many schemas exist, for each local database, and
`each local DBMS must communicate with all local schemas.
`
`ii. Tight Integration One global schema exists that defines all the data
`across all local databases.
`
`b. Unfederated Requires all access to go through a central coordinating
`module.
`
`B. Gateways Simple paths are created to other databases, without the benefits
`of one logical database.
`
`A homogeneous distributed database environment is depicted in Figure 11-2.
`This environment is typically defined by the following characteristics (related to the
`non-autonomous category described above):
`
`' Data are distributed across all the nodes
`
`' The same DBMS is used at each location
`
`° All data are managed by the distributed DBMS (so there are no exclusively
`local data)
`
`' All users access the database through one global schema or database definition
`
`' The global schema is simply the union of all the local database schemas
`
`It is difficult in most organizations to force a homogeneous environment, yet hetero-
`geneous environments are much more difficult to manage.
`
`Apple, Exhibit 1026, Page 5
`
`Apple, Exhibit 1026, Page 5
`
`
`
`
`
`420
`
`Chapter 11 — Distributed Databases
`
`Figure 11-2
`Homogeneous distributed
`database environment
`(adapted from Bell and
`
`Global User
`
`Glohal User
`
`Grimson, 1992)
`
`
`Distributed
`DBMS
`
`
`
`Global
`Schema
`
`Figure 11-3
`Heterogeneous distributed
`database environment
`(adapted from Bell and
`Grimson, 1999)
`
`'
`
`ii.
`331
`
`
`
`As listed above, there are so many variations of heterogeneous distributed data-
`base environments. In the remainder of the chapter, however, a heterogeneous envi-
`ronment will be defined by the following characteristics (as depicted in Figure 11-3):
`
`0 Data are distributed across all the nodes.
`
`0 Different DBMSs may be used at each node.
`
`Local User
`
`Global User
`
`Local User
`
`
`
`Apple, Exhibit 1026, Page 6
`
`Apple, Exhibit 1026, Page 6
`
`
`
`
` Introduction 421
`
`
`
`using only the local DBMS and schema.
`
`o A global schema exists, which allows local users to access remote data.
`
`objectives and Trade-offs
`
` 0 Some users require only local access to databases, which can be accomplished by
`
`A major objective of distributed databases is to provide ease of access to data for users
`at many different locations. To meet this objective, the distributed database system
`must provide what is called location transparency, which means that a user (or user
`program) using data for querying or updating need not know the location of the data.
`Any request to retrieve or update data from any site is automatically forwarded by
`the system to the site or sites related to the processing request. Ideally, the user is
`unaware of the distribution of data, and all data in the network appear as a single log-
`ical database stored at one site. In this ideal case, a single query can join data from
`tables in multiple sites as if the data were all in one site.
`A second objective of distributed databases is local autonomy, which is the capa-
`bility to administer a local database and to operate independently when connections
`to other nodes have failed (Date, 1995). With local autonomy, each site has the capa-
`bility to control local data, administer security, log transactions, and recover when
`local failures occur, and provide full access to local data to local users when any cen—
`tral or coordinating site cannot operate. In this case, data are locally owned and man—
`aged, even though they are accessible from remote sites. This implies that there is no
`reliance on a central site.
`
`A significant trade-off in designing a distributed database environment is
`whether to use synchronous or asynchronous distributed technology. With a syn-
`chronous distributed database technology, all data across the network are continu-
`ously kept up-to—date so that a user at any site can access data anywhere on the
`network at any time and get the same answer. With synchronous technology, if any
`copy of a data item is updated anywhere on the network, the same update is imme-
`diately applied to all other copies or aborted. Synchronous technology ensures data
`integrity and minimizes the complexity of knowing where the most recent copy of
`data are located. Synchronous technology can result in unsatisfactorily slow response
`time since the distributed DBMS is spending considerable time checking that an
`update is accurately and completely propagated across the network.
`Asynchronous distributed database technology keeps copies of replicated data
`at different nodes so that local servers can access data without reaching out across the
`network. With asynchronous technology, there is usually some delay in propagating
`data updates across the remote databases, so some degree of at least temporary
`inconsistency is tolerated. Asynchronous technology tends to have acceptable
`response time since updates happen locally and data replicas are synchronized in
`batches and predetermined intervals, but may be more complex to plan and design to
`ensure exactly the right level of data integrity and consistency across the nodes.
`Compared to centralized databases, there are numerous advantages to either
`form of a distributed database. The most important are the following:
`
`0
`
`Increased reliability and availability When a centralized system fails, the database
`is unavailable to all users. A distributed system will continue to function at some
`reduced level, however, even when a component fails. The reliability and
`availability will depend (among other things) on how the data are distributed
`(discussed in the following sections).
`
`Location transparency: A
`design goal for a distributed
`database, which says that a
`user (or user program) using
`data need not know the
`location of the data.
`
`Local autonomy: A design
`goal for a distributed
`database, which says that a
`site can independently
`administer and operate its
`database when connections to
`other nodes have failed.
`
`Synchronous distributed
`database: A form of
`distributed database
`
`technology in which all data
`across the network are
`
`continuously kept up—to-date
`so that a user at any site can
`access data anywhere on the
`netwOrk at any time and get
`the same answer.
`
`Asynchronous distributed
`database: A form of
`distributed database
`
`technology in which copies of
`replicated data are kept at
`different nodes so that local
`servers can access data
`
`without reaching out across
`the network.
`
`Apple, Exhibit 1026, Page 7
`
`Apple, Exhibit 1026, Page 7
`
`
`
`Chapter 11 —— Distributed Databases
`
`422
`
`0 Local control Distributing the data encourages local groups to exercise greater
`control over “their” data, which promotes improved data integrity and adminis_
`tration. At the same time, users can access nonlocal data when necessary.
`Hardware can be chosen for the local site to match the local, not global, data
`processing work.
`
`0 Modular growth Suppose that an organization expands to a new location or
`adds a new work group. It is often easier and more economical to add a local
`computer and its associated data to the distributed network than to expand a
`large central computer. Also, there is less chance of disruption to existing users
`than is the case when a central computer system is modified or expanded.
`
`0 Lower communication costs With a distributed system, data can be located closer
`to their point of use. This can reduce communication costs, compared to a central
`system.
`
`0 Faster response Depending on how data are distributed, most requests for data
`by users at a particular site can be satisfied by data stored at that site. This
`speeds up query processing since communication and central computer delays
`are minimized. It may also be possible to split complex queries into subqueries
`that can be processed in parallel at several sites, providing even faster response.
`
`A distributed database system also faces certain costs and disadvantages:
`
`0 Software cost and complexity More complex software (especially the DBMS) is
`required for a distributed database environment. We discuss this software later
`in the chapter.
`
`0 Processing overhead The various sites must exchange messages and perform
`additional calculations to ensure proper coordination among data at the different
`the sites.
`
`0 Data integrity A by-product of the increased complexity and need for coordina-
`tion is the additional exposure to improper updating and other problems of data
`integrity.
`
`If the data are not distributed properly according to their usage,
`0 Slow response
`or if queries are not formulated correctly, response to requests for data can be
`extremely slow. These issues are discussed later in the chapter.
`
`OPTIONS FOR DISTRIBUTING A DATABASE
`
`How should a database be distributed among the sites (or nodes) of a network? We
`discussed this important issue of physical database design in Chapter 7, which intro-
`duced an analytical procedure for evaluating alternative distribution strategies. In
`that chapter we noted that there are four basic strategies for distributing databases:
`
`1. Data replication
`
`2. Horizontal partitioning
`3. Vertical partitioning
`4. Combinations of the above
`
`I
`
`Apple, Exhibit 1026, Page 8
`
`Apple, Exhibit 1026, Page 8
`
`
`
`Figure 1 1-4
`
`Acct_Number
`
`Customer_Name
`
`
`
` Options for Distributing a Database 423
`
`Branch_Name
`
`Balance
`
`Customer relation fora bank
`
`
`
`
`
`
`
`
`
`200
`324
`153
`426
`
`500
`683
`252
`
`Jones
`Smith
`Gray
`Dorman
`
`Green
`McIntyre
`Elmore
`
`Lakeview
`Valley
`Valley
`Lakeview
`
`Valley
`Lakeview
`Lakeview
`
`1000
`250
`38'
`796
`
`1 68
`1500
`330
`
`we will explain and illustrate each of these approaches using relational databases.
`The same concepts apply (with some variations) for other data models, such as hier-
`
`archical and network.
`suppose that a bank has numerous branches located throughout a state. One of
`the base relations in the bank’s database is the Customer relation. Figure 11-4 shows
`the format for an abbreviated version of this relation. For simplicity, the sample data
`in the relation apply to only two of the branches (Lakeview and Valley). The primary
`key in this relation is account number (Acct_Number). Branch_Name is the name of
`the branch where customers have opened their accounts (and therefore where they
`presumably perform most of their transactions).
`
`Data Replication
`
`An increasingly popular option for data distribution is to store a separate copy of the
`database at each of two or more sites. Replication may allow an IS organization to
`move a database off a centralized mainframe onto less expensive, departmental or
`location-specific servers, close to end users (Koop, 1995). Replication may use either
`synchronous or asynchronous distributed database technologies, although asyn-
`chronous technologies are more typical in a replicated environment. The Customer
`relation in Figure 11—4 could be stored at Lakeview or Valley, for example. If a copy is
`stored at every site, we have the case of full replication (which may be impractical
`except for only relatively small databases).
`There are five advantages to data replication:
`
`If one of the sites containing the relation (or database) fails, a copy
`1. Reliability
`can always be found at another site without network traffic delays. Also,
`available copies can all be updated as soon as possible as transactions occur,
`and unavailable nodes will be updated once they return to service.
`
`2. Fast response Each site that has a full copy can process queries locally, so
`queries can be processed rapidly.
`
`3. Possible avoidance of complicated distributed transaction integrity routines
`Replicated databases are usually refreshed at scheduled intervals, so most
`forms of replication are used when some relaxing of synchronization across
`database copies is acceptable.
`
`,
`
`4- Node decoupling Each transaction may proceed without coordination across
`the network. Thus, if nodes are down, busy, or disconnected (e.g., in the case of
`mobile personal computers), a transaction is handled when the user desires. In
`
`l
`
`l
`
`I
`
`“l
`
`‘
`
`l
`
`
`
`Apple, Exhibit 1026, Page 9
`
`Apple, Exhibit 1026, Page 9
`
`
`
`I
`
`3"
`l
`
`i
`
`l
`‘
`
`
`
`424 Chapter 11 —— Distributed Databases
`
`
`
`the place of real-time synchronization of updates, a behind-the—scenes process
`coordinates all data copies.
`
`ia
`.‘
`
`5. Reduced network truflic at prime time Often updating data happens during prime
`business hours, when network traffic is highest and the demands for rapid
`response greatest. Replication, with delayed updating of copies of data, moves
`network traffic for sending updates to other nodes to non—prime time hours.
`
`Replication has two primary disadvantages:
`
`1. Storage requirements Each site that has a full copy must have the same storage
`capacity that would be required if the data were stored centrally. Each copy
`requires storage space (the cost for which is constantly decreasing), and
`processing time is required to update each copy on each node.
`
`2. Complexity and cost of updating Whenever a relation is updated, it must
`(eventually) be updated at each site that holds a copy. Synchronizing updating
`near real time can require careful coordination, as will be clear later under the
`topic of commit protocol.
`
`For these reasons, data replication is favored where most process requests are
`read-only and where the data are relatively static, as in catalogs, telephone directo-
`ries, train schedules, and so on. CD-ROM and DVD storage technology has promise
`as an economical medium for replicated databases. Replication is used for "noncol-
`laborative data,” where one location does not need a real-time update of data main-
`tained by other locations (The, 1994). In these applications, data eventually need to be
`synchronized, as quickly as is practical. Replication is not a viable approach for
`online applications such as airline reservations, automated teller machine transac-
`tions, and other financial activities—applications for which each user wants data
`about the same, nonsharable resource.
`
`Snapshot Replication Different schemes exist for updating data copies. Some appli-
`cations, like those for decision support and data warehousing or mining—which do
`not require current data—are supported by simplextable copying or periodic snap-
`shots. This might work as follows, assuming multiple sites are updating the same
`data. First, updates from all replicated sites are periodically collected at a master or
`primary site, where all the updates are made to form a consolidated record of all
`changes. With some distributed DBMSs, this list of changes is collected in a snapshot
`log, which is a table of row identifiers for the records to go into the snapshot. Then a
`read-only snapshot of the replicated portion of the database is taken at the master
`site. Finally, the snapshot is sent to each site where there is a copy (it is often said that
`these other sites ”subscribe” to the data owned at the primary site). This is called a
`full refresh of the database (Edelstein, 1995a). Alternatively, only those pages that
`have changed since the last snapshot can be sent, which is called a differential refresh.
`In this case, a snapshot log for each replicated table is joined with the associated
`base to form the set of changed rows to be sent to the replicated sites.
`Some forms of replication management allow dynamic ownership of data, in
`which the right to update replicated data moves from site to site, but at any point in
`time, only one site owns the right. Dynamic ownership would be appropriate as
`business activities move across time zones, or where the processing of data folloWS a
`work flow across business units supported by different database servers.
`A final form of replication management allows shared ownership of data. Shared
`updates introduces significant issues for managing update conflicts across sites. For
`
`3
`
`Apple, Exhibit 1026, Page 10
`
`Apple, Exhibit 1026, Page 10
`
`
`
`
`
`
`
`Options for Distributing a Database 425
`
`examp16, what if tellers at two bank branches try to update a customer’s address at
`the same time? Asynchronous technology will allow conflicts to exist temporarily.
`This may be fine as long as the updates are not critical to business operations, and -
`such conflicts can be detected and resolved before real business problems arise.
`The cost to perform a snapshot refresh may depend on whether the snapshot is
`sjmple or complex. A simple snapshot is one that references all or a portion of only
`one table. A complex snapshot involves multiple tables, usually from transactions
`that involve joins (such as the entry of a customer order and associated line items).
`With some distributed DBMSs, a simple snapshot can be handled by a differential
`refresh whereas complex snapshots require more time—consuming full refreshes.
`some distributed DBMSs support only simple snapshots.
`
`Near Real-Time Replication For near real-time requirements, store and forward
`messages for each completed transaction can be broadcast across the network inform-
`ing all nodes to update data as soon as is possible, without forcing a confirmation to
`the originating node (as is the case with a coordinated commit protocol, discussed
`below) before the database at the originating node is updated (Schussel, 1994). One
`way to generate such messages is by using triggers (discussed in Chapter 9). A trigger
`can be stored at each local database so that when a piece of replicated data is updat-
`ed, the trigger executes corresponding update commands against remote database
`replicas (Edelstein, 1993). With the use of triggers, each database update event can be
`handled individually and transparently to programs and users. If network connec-
`tions to a node are down or the node is busy, these messages informing the node to
`update its database are held in a queue to be processed when possible.
`
`Pull Replication The schemes explained above for synchronizing replicas are all
`examples of push strategies. Pull strategies also exist. In a pull strategy, the target, not
`the source node, controls when a local database is updated. With pull strategies, the
`local database determines when it needs to be refreshed, and requests a snapshot or
`the emptying of an update message queue. Pull strategies have the advantage that
`the local site controls when it needs and can handle updates. Thus, synchronization is
`less disruptive and occurs only when needed by each site, not when a central master
`site thinks it is best to update.
`
`Database Integrity with Replication For both periodic and near real-time replica-
`tion, consistency across the distributed, replicated database is compromised. Whether
`delayed or near real-time, the DBMS managing replicated databases still must ensure
`the integrity of the database. Decision support applications permit synchronization
`on a table-by-table basis, whereas near real-time applications require transaction~by-
`transaction synéhronization. But in both cases, the DBMS must ensure that copies are
`synchronized per application requirements.
`The difficulty of handling updates with a replicated database also depends on
`the number of nodes at which updates may occur (Froemming, 1996). In a single-
`updater environment, updates will usually be handled by periodically sending read-
`only database snapshots of updated database segments to the nonupdater nodes. In
`this case the effects of multiple updates are effectively batched for the read-only sites.
`This would be the situation for product catalogs, price lists, and other reference data
`for a mobile sales force. In a multiple-updater environment, the most obvious issue is
`data collisions. Data collisions arise when the independently operating updating
`nodes are each attempting to update the same data at the same time. In this case, the
`DBMS must include mechanisms to detect and handle data collisions. For example,
`
`
`
`Apple, Exhibit 1026, Page 11
`
`Apple, Exhibit 1026, Page 11
`
`
`
`
`
`426
`
`Chapter 11 — Distributed Databases
`
`
`
`the DBMS must decide if processing at nodes in conflict should be suspended Until
`the collision is resolved.
`
`When to Use Replication Whether replication is a viable alternative design for a
`distributed database depends on several factors (Froemming, 1996):
`
`1. Data timeliness Applications that can tolerate out-of—date data (whether this be
`for a few seconds or a few hours) are better candidates for replication.
`
`2. DBMS capabilities An important DBMS capability is whether it will support a
`query that references data from more than one node. If not, then replication is a
`better candidate than the partitioning schemes, which are discussed in the
`following sections.
`
`3. Performance implications Replication means that each node is periodically
`refreshed. When this refreshing occurs, the distributed node may be very busy
`handling a large volume of updates. If the refreshing occurs by event triggers
`(for example, when a certain volume of changes accumulate), refreshing could
`occur at a time when the remote node is busy doing local work.
`
`4. Heterogeneity in the network Replication can be complicated if different nodes
`use different operating systems and DBMSs, or, more commonly, use different
`database designs. Mapping changes from one site to n other sites could mean n
`different routines to translate the changes from the originating node into the
`scheme for processing at the other nodes.
`
`5. Communications network capabilities Transmission speeds and capacity in a
`data communications network may prohibit frequent, complete refreshing of
`very large tables. Replication does not require a dedicated communications
`connection, however, so less expensive, shared networks could be used for
`database snapshot transmissions.
`
`Horizontal Partitioning
`
`With horizontal partitioning (see Chapter 7 for a description of different forms of table
`partitioning), some of the rows of a table (or relation) are put into a base relation at
`one site, and other rows are put into a base relation at another site. More generally,
`the rows of a relation are distributed to many sites.
`Figure 11-5 shows the result of taking horizontal partitions of the Customer rela-
`tion. Each row is now located at its home branch. If customers actually conduct most
`of their transactions at the home branch, the transactions are processed locally and
`response times are minimized. When a customer initiates a transaction at another
`branch, the transaction must be transmitted to the home branch for processing and
`the response transmitted back to the initiating branch (this is the normal pattern f01'
`persons using automated teller machines, or ATMs). If a customer’s usage pattern
`changes (perhaps because of a move), the system may be able to detect this Change
`and dynamically move the record to the location where most transactions are being
`initiated. In summary, horizontal partitions for a distributed database have four
`major advantages:
`
`1. Eficiency Data are stored close to where they are used and separate from
`other data used by other users or applications.
`
`2. Local optimization Data can be stored to Optimize performance for local access-
`
`Apple, Exhibit 1026, Page 12
`
`Apple, Exhibit 1026, Page 12
`
`
`
`
`
`
`
`Options for Distributing a Database 427
`
`Acct_Number
`
`Customer_Name
`
`Branch_Name
`
`Balance
`
`200
`426
`683
`252
`
`‘
`
`Jones
`Dorman
`McIntyre
`Elinore
`
`Lakeview
`Lakeview
`Lakeview '
`Lakeview L
`
`1000
`i 796
`1500
`330
`
`..
`
`_
`
`‘
`
`_
`
`’
`
`Figure 11-5
`Horizontal partitions
`
`(a) Lakevicw Branch
`
`i
`
`’
`
`1
`
`Acct_Number
`324
`153
`500
`
`,
`
`CustOmer_Name
`Smith
`* Gray
`_ Green
`
`.
`
`,
`
`. Branch_Narne
`Valley
`‘
`L
`Valley I
`Valley
`
`_
`
`L
`
`_
`
`.
`
`(b) Valley Branch
`
`Balance
`‘ 2501
`'38,
`168‘
`
`'
`
`'
`
`3. Security Data not relevant to usage at a particular site are not made available.
`
`4. Ease of querying Combining data across horizontal partitions is easy since
`rows are simply merged by unions across the partitions.
`
`Thus, horizontal partitions are usually used when an organizational function is dis-
`tributed, but each site is concerned with only a subset of the entity instances (fre-
`quently based on geography).
`Horizontal partitions also have two primary disadvantages:
`
`1. Inconsistent access speed When data from several partitions are required, the
`access time can be significantly different from local-only data access.
`
`2. Backup vulnerability Since data are not replicated, when data at one site
`become inaccessible or damaged, usage cannot switch to another site where a
`copy exists; data may be lost if proper backup is not performed at each site.
`
`Vertical Partitioning
`
`With the vertical partitioning approach (again, see Chapter 7), some of the columns of
`a relation are projected into a base relation at one of the sites, and other columns are
`projected into a base relation at another site (more generally, columns may be pro-
`jected to several sites). The relations at each of the sites must share a common
`domain, so that the original table can be reconstructed.
`To illustrate vertical partitioning, we use an application for the manufacturing
`company shown in Figure 11-6. Figure 11—7 shows the Part relation with Part_Number
`as the primary key. Some of these data are used primarily by manufacturing, while
`Others are used mostly by engineering. The data are distributed to the respective
`departmental computers using vertical partitioni