`
`An Approach to Decentralized
`Computer Systems
`
`Part Number: 87611
`
`Technical Report 85.4
`June 1985
`
`1
`
`LYFT 1033
`
`LYFT 1033
`
`1
`
`
`
`2
`
`
`
`”'ITANDEMCOMPUTERS
`
`Technical Report 85.4
`June 1985
`
`An Approach to Decentralized
`Computer Systems
`
`PN87611
`
`3
`
`
`
`4
`
`
`
`AN APPROACH TO DECENTRALIZED COMPUTER SYSTEMS
`
`Jim Gray
`
`June 1985
`
`Tandem Technical report 85.4
`
`5
`
`
`
`6
`
`
`
`Tandem TR 85.4
`
`AN APPROACH TO DECENTRALIZED COMPUTER SYSTEMS
`
`Jim Gray
`
`June 1985
`
`Revised January 1986
`
`ABSTRACT
`
`The technology
`
`for distributed computing
`
`is available.
`
`However,
`
`decentralized systems still
`
`pose design and management problems.
`
`Decentralized systems will
`
`always
`
`require more
`
`careful
`
`design,
`
`planning, and management than their centralized counterparts.
`
`This paper begins with the
`
`rational for and against decentralization.
`
`Then, a technical approach to decentralized systems is sketched. This
`
`approach contrasts with the popular
`
`concept
`
`of
`
`a
`
`distributed
`
`integrated database which transparently provides
`
`remote
`
`10
`
`against
`
`single system image. Rather, it proposes that function be distributed
`
`as "servers" which abstract data
`
`as high—level operations
`
`on objects
`
`and communicate with 'requestors' via
`
`a
`
`standard message protocol.
`
`The requestor-server approach has the
`
`advantages of modularity and
`
`performance.
`
`—-—-————.——.—u.—-—--—
`
`This paper has been submitted for publication to the IEEE Transactions
`on Software Engineering.
`
`7
`
`
`
`8
`
`
`
`
`TABLE g CONTENTS
`
`1.
`
`2.
`
`What
`
`is a decentralized system?
`
`1.1. Why build decentralized systems?
`
`1.2.
`
`Integrated system or integrated database?
`
`Technical aspects of distributed systems
`
`2.1. Computational model -- Objects-processes~messages
`
`2.2. Dictionary: naming, authorization and control
`
`2.3. Database
`
`2.3.1. Data definition
`
`2.3.2. Data manipulation
`
`2.3.3. Data independence
`
`2.3.4. Database design
`
`2.4. Networking
`
`2.4.1. Network protocols
`
`2.4.2. Terminal support -- terminal
`
`independence
`
`2.4.3. Network management
`
`2.5. Transaction management
`
`2.5.1. The transaction concept
`
`2.5.2. Direct and queued transaction processing
`
`An approach to designing and managing decentralization
`
`Summary
`
`Acknowledgments
`
`References
`
`9
`
`
`
`10
`
`10
`
`
`
`
`1.1. What is a decentralized system?
`
`A decentralized computer system, as opposed to a centralized one,
`
`is
`
`collection of autonomous computers which communicate with one
`
`another
`
`to perform a common service.
`
`A decentralized system might
`
`occupy
`
`a
`
`single room, but more typically decentralized systems have geographic
`
`and organizational diversity.
`
`The world telephone
`
`system is
`
`the biggest
`
`and best
`
`example of a
`
`decentralized system. It consists of
`
`thousands of
`
`computers,
`
`and
`
`almost a billion terminals.
`
`SOme of the nodes of the system are tiny
`
`local PBX's while others
`
`are quite large, able to handle many calls
`
`per second. Different parts of the system are operated by cooperating
`
`organizations with different
`
`hardware,
`
`different
`
`languages
`
`and
`
`different ideologies. They
`
`have
`
`agreed to protocols which
`
`allow
`
`direct dialing from anywhere to anywhere and the consequent automatic
`
`routing and billing.
`
`[Amos].
`
`Other examples of decentralized systems can
`
`be
`
`found
`
`in the world
`
`travel
`
`industry system,
`
`inter—connecting travel
`
`agents with hotels,
`
`airlines, and other reservation systems, and the
`
`emerging electronic
`
`financial system, connecting
`
`financial
`
`institutions,
`
`businesses and
`
`governments.
`
`11
`
`11
`
`
`
`These systems have the following common features:
`
`* diverse organizations and organizational procedures,
`
`* diverse computer architectures, both hardware and software,
`
`* diverse terminal types,
`
`* diverse system sizes,
`
`from tiny to large. and
`
`* diverse site environments.
`
`The "glue'I that holds
`
`each of
`
`these
`
`systems together is a message
`
`protocol. For each of these systems,
`
`the participating organizations
`
`agreed that:
`
`"When I send you this message,
`
`it means thus
`
`and so“.
`
`They also agreed to the bit—level format of each such message.
`
`The thesis of this article is that these systems could have been built
`
`as an “integrated database",
`
`but
`
`that
`
`such
`
`a
`
`system would
`
`be a
`
`management nightmare.
`
`In reality,
`
`the parts of these systems are not
`
`tightly integrated —- each part
`
`is generally quite different from the
`
`others.
`
`It is "integrated" by having the parts agree to a
`
`common but
`
`arms—length message protocol.
`
`But, before getting into the
`
`the
`
`technical details of distrubted
`
`computing, it is worthwhile
`
`to
`
`review why we distribute computer
`
`systems in the first place.
`
`12
`
`12
`
`
`
`1.1.
`
`Ehy decentralize s stems?
`
`There is no "best'I form of organization.
`
`Each form has its advantages
`
`and disadvantages.
`
`The structure of computing is just one
`
`asPeCt
`
`0f
`
`the structure of
`
`an organization.
`
`Centralized organizations will
`
`continue with centralized computing
`
`and decentralized organizations
`
`will adopt appropriate degrees of decentralized computing [March].
`
`For such decentralized organizations, distributed computer systems are
`
`likely to give
`
`the operational units control of their data.
`
`If done
`
`correctly, a decentralized system allows more flexibility for
`
`growth
`
`of capacity and function.
`
`In addition, a decentralized system can
`
`be
`
`more maintainable since
`
`each part
`
`can change anything so long as it
`
`continues to support its external interfaces.
`
`Expanding on
`
`this
`
`argument,
`
`there
`
`are both
`
`organizational
`
`and
`
`technical motives for decentralizing a system.
`
`The main organizational reasons for decentralized computing are:
`
`* Integration 93 Ere-existing system :
`
`As existing computer
`
`systems
`
`are connected to provide
`
`common
`
`services,
`
`one naturally gets
`
`a
`
`distributed computer
`
`system.
`
`The world phone
`
`system,
`
`travel
`
`reservation systems and finance systems give good examples
`
`of this
`
`decentralization via integration of pre—existing systems.
`
`* Organizational autonomy: Managers want administrative control of
`
`facilities critical to their effectiveness.
`
`To make
`
`this
`
`control
`
`13
`
`13
`
`
`
`real, most administrators also want operational
`
`control of
`
`their
`
`computers and their databases.
`
`Increasingly. computer
`
`systems
`
`are
`
`being designed to reflect the organization structure.
`
`There are several technological reasons for building a decentralized
`
`system.
`
`It is sometimes not feasible to build a centralized system
`
`with the required capacity, response-time, availability or seCurity of
`
`a distributed system. Typical
`
`issues are:
`
`* Capacity:
`
`Some applications have needs far in excess of the largest
`
`computer. They can't fit on a single computer or a single site.
`
`
`* Response-time:
`
`If requestors are distributed over a large area,
`
`the
`
`transit times for requests and replies may be prohibitive. Long haul
`
`commuications can add a second
`
`to response time. Putting processing
`
`and the needed data close to the requestor eliminates such delays.
`
`* Availability: Having geographically distributed autonOmous sites
`
`processing parts of the same application has
`
`the effect
`
`of
`
`error
`
`containment -- a failure or disaster is
`
`likely to be limited to a
`
`single site.
`
`In
`
`addition,
`
`the
`
`remaining
`
`sites may
`
`be
`
`able
`
`to
`
`substitute for a failed node.
`
`
`* Cost: Decentralization may reduce long-haul
`
`communications traffic
`
`and hence reduce communications costs.
`
`The
`
`communications
`
`savings
`
`may outweigh the extra cost of distributed facilites and staff.
`
`14
`
`14
`
`
`
`* Security:
`
`Some orgainzations
`
`feel more
`
`secure when
`
`they have
`
`physical control over the
`
`site that
`
`stores
`
`their data.
`
`(I
`
`am
`
`skeptical of this argument but it is frequently made).
`
`Modularity is another reason for building a distributed system —-
`
`the
`
`desire for modular
`
`growth and change of function and capacity as well
`
`as the desire for the manageability which derives
`
`from a modular
`
`structure.
`
`A decentralized system is necessarily modular:
`
`the autonomous nodes of
`
`the computer network
`
`communicate with one
`
`another via
`
`a message
`
`protocol.
`
`In such a system it is easy to add capacity by adding nodes
`
`or by growing a node.
`
`If done properly, such change is non-disruptive
`
`-- changing one
`
`service does not
`
`interrupt other services so long as
`
`the change is
`
`upward compatible.
`
`Centralized systems are much more
`
`limited in their ability to grow and change.
`
`15
`
`15
`
`
`
`1.2
`
`Integrated system 9; integrated database
`
`As a preview of the thesis of this paper, observe that a major
`
`thrust
`
`of the 19?0's was towards the centralization of data
`
`and application
`
`definition so that diverse parts of
`
`the organization could
`
`share
`
`information -- the goal was
`
`an “integrated-database“.
`
`In most cases
`
`this has resulted in a bureaucracy and
`
`a monolithic
`
`system.
`
`The
`
`corporate-wide Data Base Administrator (DBA)
`
`is a centralized function
`
`and has all
`
`the problems
`
`associated with centralized systems. This
`
`structure can be a source of delays and organizational friction.
`
`A more workable approach to decentralized systems
`
`is
`
`to let
`
`each
`
`department have its own DBA. Rather than exposing the detailed record
`
`formats of its database --
`
`the
`
`traditional
`
`view of
`
`an
`
`integrated
`
`database -- each function externalizes a protocol for manipulating the
`
`data it maintains.
`
`For
`
`example,
`
`the order
`
`entry function of
`
`a
`
`business might support messages to lookup, add, and alter orders.
`
`The
`
`database and accounting
`
`rules for managing orders would
`
`be
`
`hidden
`
`within the procedures supporting these messages.
`
`Such
`
`an
`
`interface
`
`allows a department to change its internal implementation and
`
`to add
`
`new function without
`
`impacting other departments, so long as
`
`the old
`
`message interface is still supported.
`
`In addition,
`
`the department can
`
`enforce integrity and authority constraints
`
`on the data
`
`by assuring
`
`that only it's programs alter the data. This gives modular change and
`
`modular growth to the whole
`
`system.
`
`Management of
`
`these protocol
`
`definitions becomes the responsibility of the network DBA.
`
`16
`
`16
`
`
`
`2.
`
`Technical aspects 9E decentralized systems
`
`So far,
`
`the
`
`rational for decentralized systems has
`
`been
`
`sketched.
`
`This section proposes a technical approach to decentralized systems.
`
`The main technical problem unique to decentralized systems is the lack
`
`of global (centralized) knowledge. It
`
`is difficult to know everything
`
`about the rest of
`
`the network.
`
`Yet global
`
`knowledge seems to be
`
`required to answer questions such as:
`
`“Where is file A?", or "What is
`
`the best way
`
`to reach node N?". Most other technical problems of
`
`decentralized systems are also present
`
`in centralized systems.
`
`It
`
`is
`
`simply a matter of degree. Messages in a distributed system travel
`
`more slowly,
`
`less reliably and
`
`at greater cost.
`
`In
`
`a centralized
`
`system, a data item is rarely more than a disc seek away (~30 ms),
`
`in
`
`a distributed system it may be several satellite hops away (~l sec).
`
`Moreover,
`
`long—haul communication systems sometimes lose messages
`
`and
`
`always charge a lot to deliver one. The lack of global
`
`knowledge adds
`
`an entirely new dimension
`
`to the problems of data placement
`
`and
`
`program execution -- where to keep the data and where to
`
`run
`
`the
`
`programs.
`
`The following model deals with decentralization of knowledge by making
`
`each component a module which may be connected to any other module.
`
`17
`
`17
`
`
`
`2.1
`
`
`Computational model: objects—processes—messages
`
`Much in the style of modern programming languages such
`
`as Smalltalk,
`
`Modula, and Ada,
`
`the system supports the concept of object "type" and
`
`object "instance". Primitive object
`
`types
`
`are Process, Terminal,
`
`File. and Node. Application developers implement new types by writing
`
`programs in a
`
`conventional
`
`language
`
`such as Cobol or PLI.
`
`The
`
`programs are executed as processes —* programs running on computers.
`
`Externally, each process
`
`appears
`
`to be
`
`an object
`
`instance which
`
`supports the verbs of that object
`
`type.
`
`Internally,
`
`processes are
`
`structured as a collection of subvroutines which implement that.type.
`
`A process may address other objects
`
`(processes,
`
`files,
`
`terminals,
`
`etc.)
`
`by OPENing them. The open step takes a
`
`symbolic name
`
`for the
`
`object,
`
`locates the
`
`named object.
`
`checks
`
`the authorization of
`
`the
`
`process to the object and returns a reference for
`
`a "session“
`
`to the
`
`object -- in operating systems terms,
`
`it returns a capability for the
`
`object. Thereafter the process may operate on the object
`
`by
`
`sending
`
`messages via the
`
`session (WRITE),
`
`and
`
`reading the
`
`replies
`
`(READ).
`
`Ultimately the session is terminated (CLOSE).
`
`This discussion is
`
`symmetric,
`
`a process may be OPENed.
`
`In that case
`
`it will receive an OPEN message. which it may accept or reject, and if
`
`accepted it will
`
`receive
`
`a
`
`sequence
`
`of requests_which it READs and
`
`generates a corresponding sequence of replies
`
`(WRITES). Ultimately,
`
`it may receive a CLOSE message or unilaterally close the session.
`
`18
`
`18
`
`
`
`Object creation is accomplished by opening
`
`a
`
`session to a process
`
`managing the object type. This session is then used to communicate the
`
`name and attributes of the new object.
`
`To give a simple example. One
`
`opens a session to a file to access the
`
`file. The session is opened with a process (file server)
`
`representing
`
`that file.
`
`Sending messages via
`
`the
`
`session instructs
`
`the
`
`server
`
`process to position within the file, and
`
`insert,
`
`update or
`
`extract
`
`records in the file.
`
`Because every object has a name, a process can
`
`address any object
`
`in.
`
`the network -- subject to authorization restrictions. This means that
`
`the process is unaware of the physical location of the object.
`
`It can
`
`write on any terminal, read or write any file,
`
`and address
`
`any other
`
`process. This is one key to structuring a decentralized system.
`
`In order to fit into the CALL structure of
`
`conventional
`
`programming
`
`languages,
`
`the concept
`
`of
`
`“remote procedure
`
`call“
`
`is
`
`introduced.
`
`Rather than the
`
`program having
`
`to OPEN, WRITE, READ and CLOSE the
`
`object, it may
`
`CALL
`
`an operation on the object.
`
`A remote procedure
`
`call has the format:
`
`<operation>(<ohj ect>,<p1>,<p2> , . . . , <pn>)RETURNS(<r1>, . . . ,<rm>)
`
`where the <pi) are by-value parameters and <rib are returned results.
`
`This request
`
`is translated as follows:
`
`(1) If the process managing the object
`
`is not OPEN, an open message is
`
`sent to it.
`
`(2) The following message is sent to the object manager:
`
`19
`
`19
`
`
`
`(<operation>,<object>,<pl),<p2>,...,<pn>)
`(3) The object manager
`performs
`the operation and
`
`replies with
`
`message:
`
`(<r1>,<r2>,...,<rm>).
`
`(4) The reply message is unpacked into <r1>,
`
`...,<rm>.
`
`Typically,
`
`the underlying software “saves“
`
`the
`
`open so
`
`that
`
`later
`
`operations on the same object will have low overhead.
`
`This structuring allows functions to be executed within a process,
`
`or
`
`in a different process perhaps at a remote node.
`
`It gives
`
`a uniform
`
`way for a
`
`node
`
`to export
`
`it's primitive types, and abstractions of
`
`these types. As will be argued later, nodes generally export
`
`high-
`
`level abstractions rather than primitive types.
`
`The message formats
`
`(2)
`
`and (3) above, describe the interface to the
`
`type. Defining a decentralized system consists
`
`of defining these
`
`message formats
`
`and
`
`their
`
`semantics.
`
`Thereafter,
`
`independent
`
`organizations can implement these protocols
`
`to provide services or to
`
`make requests of others.
`
`This model
`
`is
`
`the basis for the Tandem system [Tandem]. Argus System
`
`[Liskov] and the R*
`
`System [Lindsay].
`
`It also forms the basis for
`
`IBM's SNA Logical Unit Type 6, which defines the types DL/l Database,
`
`Queue, Process, etc.,
`
`and will
`
`grow to
`
`include many more
`
`types
`
`[Hoffman].
`
`The airlines
`
`reservation system to manage
`
`seats,
`
`the
`
`electronic funds transfer systems also use the message model.
`
`20
`
`10
`
`20
`
`
`
`2.2 Dictionary: naming. authorization and control
`
`The system dictionary plays a
`
`key
`
`role in object-oriented systems.
`
`The dictionary maps symbolic names to object descriptors, and performs
`
`authorization checks.
`
`In addition,
`
`the dictiOnary provides
`
`some
`
`reporting and control functions.
`
`In order to structure the name space, symbolic names are organized as
`
`a hierarchy. The
`
`name 'A.B.C"
`
`names an object but may also name a
`
`subtree of objects
`
`each with ”A.B.C"
`
`as
`
`a prefix of their names.
`
`Interior nodes of the hierarchy act as directory objects. Any object
`
`can act as a directory object. Allowing
`
`names to grow on
`
`the right
`
`allows objects to have sub-objects. Allowing names
`
`to grow on
`
`the
`
`left allows addressing of other
`
`name
`
`spaces
`
`for
`
`inter-networking.
`
`Prefixs and suffixs on telephone numbers exemplify this principal.
`
`The dictionary is partitioned among the nodes
`
`of
`
`the
`
`network,
`
`each
`
`node has a part of the dictionary.
`
`The dictionary stores the generic
`
`attributes of each object,
`
`its
`
`name.
`
`type, owner,
`
`and authorization,
`
`in the name
`
`space.
`
`The manager
`
`for
`
`that object type stores type-
`
`specific attributes in catalogs which parallel
`
`the name
`
`space.
`
`In
`
`this way,
`
`a
`
`new
`
`object
`
`type
`
`can
`
`be
`
`added
`
`by
`
`creating
`
`programs/processes which maintain the catalogs of that type and a
`
`new
`
`type instance can be added by requesting one of these processes to add
`
`records to the catalogs for that type as well as making
`
`an
`
`entry in
`
`the name Space.
`
`21
`
`11
`
`21
`
`
`
`When an object name is presented to OPEN, it is looked up in the name
`space.
`The prefix of
`the
`name designates the partition of the name
`
`space, for example Lookup 'A.B.C.D" might
`
`look for "B.C.D" in the name
`
`space at node
`
`“A".
`
`The alias object type allows renaming of objects
`
`or directories. Aliases are generally followed until
`
`a
`
`non—alias
`
`object or error is encountered.
`
`Lookup returns a descriptor
`
`for
`
`an
`
`object.
`
`The OPEN procedure uses the information in this descriptor to
`
`route the Open message to the appropriate object.
`
`Node autonomy implies
`
`that
`
`each
`
`node must
`
`be
`
`able
`
`to operate in
`
`isolation. Hence, each object or object
`
`fragment at
`
`a
`
`node
`
`is
`
`described by the
`
`name
`
`space
`
`at
`
`that
`
`node.
`
`In addition,
`
`the node
`
`dictionary may have descriptions
`
`of objects
`
`at other nodes.
`
`For
`
`example,
`
`if a
`
`file is partitioned among several nodes of the network,
`
`each node will certainly store the description of the local partition;
`
`but, each node may also store a
`
`replica of
`
`the description of the
`
`whole file.
`
`Authorization is one of the most difficult areas of decentralized
`
`systems. Of course, all network traffic is encrypted and
`
`low level
`
`protocols do node authentication.
`
`Each process
`
`executes under
`
`some
`
`authority (ID). Each object has
`
`an
`
`access control
`
`list associated
`
`with its descriptor in the dictionary.
`
`The entries of
`
`each
`
`access
`
`list are <who,what> pairs.
`
`The “who"
`
`is either an ID or the ID of a
`
`GROUP of
`
`IDs.
`
`The “what"
`
`is a
`
`list of
`
`authorities allowed
`
`to that
`
`object.
`
`22
`
`12
`
`22
`
`
`
`When a process tries to OPEN the object “Z",
`
`the name "Z“ is looked up
`
`in the dictionary to find its descriptor and access control list.
`
`The
`
`name server first checks the access list to see
`
`if the
`
`requestor has
`
`the authority to open
`
`the object.
`
`If not,
`
`the
`
`lookup signals a
`
`security violation.
`
`The issue of authenticating the requestor arises when the OPEN travels
`
`across the network.
`
`At best,
`
`the
`
`server knows that requestor R at
`
`node N made the request ——
`
`i.e. node N will
`
`vouch for R. Either of
`
`the following two approaches deal with this:
`
`either the
`
`access
`
`list
`
`can be structured as "ID at NUDE" elements for the "who" fields or
`
`a
`
`bidirectional password scheme can be required for remote requestors.
`
`As explained so far,
`
`the dictionary implements and exports objects
`
`of
`
`type directory, alias,
`
`type, access
`
`control
`
`list, and group.
`
`The
`
`support of types, allows others
`
`to
`
`implement basic
`
`types
`
`such as
`
`record, file,
`
`terminal,
`
`and application types
`
`such as
`
`invoice
`
`and
`
`customer .
`
`In addition to providing the basic naming,
`
`authorization and type
`
`extension facilities,
`
`the dictionary also provides
`
`a
`
`structure
`
`for
`
`reporting and control.
`
`The dictionary implements an object
`
`of type
`
`"dependency". Each dependency is a binary relation among objects. When
`
`a file definition is based on a record definition,
`
`the relationship
`
`< FILE-A , RECORD—R>
`
`is entered into the DEPENDS—ON relation. When an index is added
`
`to a
`
`file,
`
`the relationship
`
`23
`
`13
`
`23
`
`
`
`< FILE-A ,
`
`INDEX-I
`
`5'
`
`is added. When
`
`a
`
`program is compiled from source Sl,SZ,S3,.. SN,
`
`the
`
`relationships:
`
`< O , 51> and 4 O , $2 >,
`
`...
`
`, < 0 , Sn)
`
`are added.
`
`Centralized systems are able
`
`to recompute
`
`relationships
`
`on
`
`demand
`
`rather then maintain them as they are created. A decentralized system
`
`cannot economically
`
`search
`
`the
`
`entire
`
`system
`
`looking
`
`for
`
`relationships. Hence,
`
`the relationships must be explicitly asserted
`
`and maintained.
`
`Relationships can be used for reporting and for control:
`
`* Reporting: Relationships can tell what objects will be affected by a
`
`change and in turn can be used to automatically propagate the change
`
`in rebuilding an application —— recompile all affected programs. and
`
`reorganize all affected files and displays.
`
`* Control: Relationships
`
`can be used for change control -* to freeze
`
`them and all the objects they depend upon.
`
`24
`
`14
`
`24
`
`
`
`2.3
`
`
`Data Management
`
`
`2.3.1 Data definition
`
`Database systems
`
`implement
`
`the
`
`”record"
`
`type
`
`and operations
`
`on
`
`records.
`
`A record definition specifies
`
`the record fields, each field
`
`storing atoms of a designated type.
`
`In addition,
`
`integrity constraints
`
`may be associated with the
`
`record definition. Record
`
`instances must
`
`satisfy these constraints. Uniqueness, value
`
`range,
`
`and
`
`referential
`
`integrity are typical
`
`integrity constraints
`
`[Date].
`
`In
`
`addition,
`
`record definitions may have physical attributes
`
`such as
`
`location,
`
`recovery attributes, and block size.
`
`The instances of a
`
`record type
`
`are variously called a file, relation, set or table. We use
`
`the term
`
`file here.
`
`A file may be partitioned among nodes
`
`of the
`
`network based
`
`on some
`
`predicate-to-node map.
`
`The
`
`following
`
`is
`
`a
`
`sample
`
`partitioning
`
`criterion:
`
`ACCOUNT.NUMBER < 1000000 <2> WEST
`
`ACCOUNT.NUMBER 3 1000000 <=> EAST
`
`A record instance will
`
`be
`
`stored at
`
`the
`
`node
`
`(partition) which
`
`satisfies the predicate.
`
`Data is partitioned for several reasons:
`
`
`* Load sharing: distributing traffic on the data,
`
`* Locality: placing it closer to the data consumers thus improving
`
`local availability and response time and,
`
`25
`
`15
`
`25
`
`
`
`* Autonomy:
`
`allowing local control of each partition of the data.
`
`Partitioning has long been used to distribute data
`
`across multiple
`
`disk arms at a single node.
`
`Geographic distribution is
`
`a
`
`simple
`
`generalization of
`
`that
`
`concept.
`
`The
`
`implementation
`
`details
`
`are
`
`slightly different since the record definition must be
`
`replicated at
`
`each dictionary of each involved node.
`
`A record may
`
`be
`
`replicated at several nodes. Replication at a node
`
`has long been used for availability. If one copy
`
`is lost,
`
`the others
`
`can be used in its stead.
`
`In
`
`a geographically- distributed system
`
`replication can
`
`have
`
`the
`
`added benefit
`
`of
`
`improved
`
`response
`
`by
`
`eliminating long—haul message delays.
`
`But replication at multiple nodes connected via
`
`communications
`
`lines
`
`presents several novel problems if updates must be broadcast
`
`to all
`
`the replicas. This may cause
`
`substantial delays, and may
`
`inhibit
`
`update if some
`
`replicas
`
`are
`
`inaccessible.
`
`The presumption
`
`in
`
`a
`
`centralized system is that I'if a replica is unavailable
`
`to me,
`
`it is
`
`unavailable to everyone".
`
`This
`
`assumption
`
`is
`
`not valid
`
`in a
`
`decentralized system. Unavailability of a
`
`replica can be
`
`caused by
`
`replica failure or by communication failure.
`
`In the
`
`latter
`
`case,
`
`a
`
`replica may be available to one part of the network but not
`
`to others.
`
`The concepts of current and stale information helps to understand the
`
`algorithms designed to deal with the maintenance of
`
`replicas. Quite
`
`general algorithms to maintain data currency are known, but they
`
`have
`
`26
`
`16
`
`26
`
`
`
`substantial message delay cost.
`
`Other
`
`algorithms
`
`are
`
`known which
`
`tradeoff data
`
`currency in exchange
`
`for
`
`better
`
`performance
`
`or
`
`availability. Three kinds of replication are worth considering:
`
`* Current Global Replicas:
`
`In this scheme a majority of the replicas
`
`are updated as part of each transaction (the definition of majority
`
`differs from proposal
`
`to proposal).
`
`The best of
`
`these algorithms
`
`tolerate some node unavailability but they have the problem that for
`
`a small number nodes (two)
`
`they do not tolerate data unavailability
`
`and for a large number of nodes
`
`(more than
`
`2)
`
`they
`
`introduce long
`
`delays in transaction commit [Gifford],
`
`[Abadi].
`
`
`* éfi Soon gs Possible (ASAP) Updates: This scheme postulates a master
`
`node(per record type or instance) which is allowed
`
`to unilaterally
`
`update the record at his node. The updates are
`
`then asynchronously
`
`sent
`
`to the other replicas. This approach sacrifices consistency for
`
`availability and response time [Norman].
`
`If the system is Operating
`
`correctly, replicas will be only seconds out of date.
`
`* Snapshots: There is
`
`a master
`
`file somewhere
`
`and
`
`slave
`
`copies
`
`elsewhere.
`
`Each slave copy is
`
`a snapshot
`
`of the master as of a
`
`certain time. The
`
`slave copies
`
`are periodically updated.
`
`This is
`
`appropriate for
`
`files which
`
`change very slowly
`
`and
`
`for which
`
`currency is not critical [Adiba].
`
`In this design, replicas may be a
`
`day or a week out of date.
`
`27
`
`17
`
`27
`
`
`
`Each of these approaches to replication has its place. Notice that the
`
`dictionary must
`
`use
`
`a
`
`current
`
`data algorithm for managing
`
`the
`
`replication of object descriptions because its users may not
`
`be
`
`able
`
`to tolerate inconsistencies.
`
`The replication criterion and
`
`replication algorithms
`
`are
`
`specified
`
`when defining or redefining a record. When a program runs, it must be
`
`unaware that
`
`a
`
`record
`
`is partitioned or
`
`replicated because
`
`the
`
`partitioning and replication may change after the program is written.
`
`In order to deal with this,
`
`the program may specify that it is willing
`
`to accept stale data
`
`CURRENCY := FALSE
`
`or that it wants the most current copy
`
`CURRENCY := TRUE.
`
`This approach allows data independence
`
`and also allows performance
`
`improvements which make replicated data workable.
`
`28
`
`18
`
`28
`
`
`
`
`2.3.2 Data manipulation
`
`Record instances are accessed and manipulated by a data manipulation
`
`language (DML) which allows the requestor
`
`to add,
`
`read, alter
`
`and
`
`delete sets of records. The
`
`DML is
`
`really a
`
`compiler for
`
`a limited
`
`for: of
`
`set
`
`theory expressions.
`
`It must
`
`contend with
`
`records
`
`distributed over several nodes of the network.
`
`The DML compiler produces a plan to perform the desired operation. The
`
`compiler is aware of the data distribution criterion and based on that
`
`it picks a global plan which minimizes a cost
`
`function. It
`
`then sub—
`
`contracts the access
`
`to the
`
`local data to the remote nodes. These
`
`nodes develop their own plans for doing the data manipulation at their
`
`nodes.
`
`If the data organization at a node changes,
`
`it recompiles its
`
`plan without affecting the other nodes.
`
`Subcontracting setvoriented operations
`
`rather
`
`than record—at-a-time
`
`operations produces substantial savings in response time
`
`and message
`
`traffic. However,
`
`even greater
`
`savings
`
`can
`
`be
`
`achieved
`
`by
`
`subcontracting the application logic itself. For example, reserving an
`
`airline seat at
`
`a remote node will involve updates to multiple record
`
`types and several integrity checks. Application level protocols
`
`cast
`
`this as a single remote procedure call to a server process which then
`
`does local DML operations.
`
`This
`
`is more efficient
`
`than
`
`invoking
`
`several DML operations across the network.
`
`It also allows the
`
`server
`
`node better control of access to the data.
`
`29
`
`19
`
`29
`
`
`
`
`2.3.3 Data independence
`
`It must be possible to move, partition, replicate,
`
`and physically
`
`reorganize data without
`
`invalidating programs which
`
`access
`
`it.
`
`The
`
`DML plan is automatically reevaluated if the physical structure of the
`
`data changes. This is called data independence.
`
`In addition,
`
`it must be possible to alter the
`
`logical structure
`
`of a
`
`record ~- for
`
`example add
`
`fields to a record type or
`
`replace
`
`one
`
`record type with two others which can be joined together to form the
`
`original. There are limits to what can
`
`be done here, but
`
`the basic .
`
`trick is to define the "old“ interface as a view of the new interface.
`
`A view is
`
`a materialization rule which allows instances of one record
`
`type to be materialized from others.
`
`Unless the view is very simple,
`
`a
`
`one—to-one map
`
`from the base
`
`records, it may not be possible to reflect updates on the view down to
`
`the base files.
`
`For this reason, views are primarily used to subset a
`
`record type to provide access control to data.
`
`They allow one to hide
`
`data from programs unless they have a “need to know".
`
`To repeat
`
`the
`
`theme,
`
`having
`
`a procedural
`
`interface to the data (a
`
`server) allows the provider of the service much more
`
`flexability in
`
`redefining the data without
`
`impacting others.
`
`The view constructs of
`
`data management systems are very restrictive
`
`in
`
`the
`
`kinds
`
`of data
`
`independence they support.
`
`30
`
`20
`
`30
`
`
`
`2.3.4 Database design
`
`Logical database design specifies which
`
`records will be available and
`
`the integrity constraints applied to the records.
`
`Physical database
`
`design specifies
`
`how the
`
`records
`
`are
`
`stored:
`
`file organization,
`
`indices, partitioning criteria, replication criteria.
`
`Distribution has little affect on
`
`logical database design unless the
`
`desired design has poor performance.
`
`In that
`
`case it may have
`
`to be
`
`changed as a concession to performance.
`
`Physical database design is intimately related to data distribution.
`
`At present database design is a largely manual process but much of the
`
`design work could be mechanized.
`
`The first step is to instrument the
`
`system so
`
`that
`
`one
`
`can get
`
`meaningful measurements of
`
`record statistics_ and activity.
`
`Given
`
`these statistics and the response time requirements of the system,
`
`the
`
`database design problem can be mechanized as follows: Since
`
`the DML
`
`can predict the cost
`
`of
`
`a particular plan for a particular database
`
`design, one can evaluate the cost of a particular database design on a
`
`particular application workload —— it is the weighted sum of the costs
`
`of the individual queries.
`
`By
`
`evaluating all possible designs, one
`
`can find acceptable or
`
`least-cost designs.
`
`This
`
`search can
`
`be
`
`mechanized.
`
`31
`
`21
`
`31
`
`
`
`2.4
`
`Networking and terminal support
`
`2.4.1 Network protocols
`
`The discussion so far presumes
`
`a very powerful networking capability.
`
`It assumes that any process in any node can open a
`
`session with any
`
`process in any other
`
`node.
`
`Files
`
`and
`
`terminals are viewed as the
`
`processes which control
`
`them.
`
`Implicitly,
`
`it
`
`assumes
`
`that
`
`these
`
`sessions may be established with relatively little overhead.
`
`This view is implicit
`
`in IBM's System Network Architecture [Hoffman],
`
`Tandem's Guardian-Expand architecture [Tandem] and the
`
`ISO reference
`
`model for Open Systems Interconnection.
`
`These protocols go
`
`on
`
`to
`
`architect higher level objects such as processes, queues
`
`and files
`
`based on the session layer.
`
`In
`
`addition they specify many
`
`layers
`
`below the session layer.
`
`The point of this discussion is that it is reasonable to assume a very
`
`powerful networking capability in which any process in any
`
`node
`
`can
`
`open a session with any process in any other node.
`
`The only debatable
`
`point is the cost of such sessions. with present systems,
`
`the cost
`
`is
`
`high, but it
`
`seems
`
`to be acceptable given the benefits which accrue.
`
`Several commercial systems operate in this way today [CICS].
`
`[Tandem].
`
`32
`
`22
`
`32
`
`
`
`2.4.2 Terminal support -— terminal
`
`independence
`
`The real revolution in data management has
`
`been in the area
`
`of data
`
`display, not data storage.
`
`Information storage has not changed much in the last twenty years. We
`
`had random access memories, discs and tapes then and we have them now.
`
`The cost per byte has declined dramatically, but the lagical interface
`
`to storage has not changed much.
`
`In the same
`
`time, we
`
`have
`
`gone
`
`from card—readers to teletypes to
`
`alpha-numeric displays. We are about to go to bitmap displays with the
`
`heavy use of icons and graphics.
`
`When discs get bigger or
`
`faster,
`
`the
`
`files
`
`grow and the programs
`
`continue to work. When terminals
`
`change from card—readers
`
`to bitmap
`
`displays the programs
`
`need
`
`to be overhauled.
`
`This overhaul
`
`is
`
`expensive. Thus one
`
`frequently sees
`
`a
`
`desk