`International Conference on the
`Management of Data
`Portland, Oregon
`
`~
`I
`
`SIGMOD RECORD
`Volume 18, Number 2, June 1989
`
`Editors: James Clifford, Bruce Lindsay
`and David Maier
`
`
`
`) '
`QA 7,. ~
`.03
`A3><
`1~!7
`
`The Association for Computing Machinery
`11 West 42nd Street
`New York, New York 10036
`
`c 1989 by the Association for Computing Machinery,
`Inc. Copying without fee is permitted provided that the
`copies are not made or distributed for direct commercial
`advantage, and credit to the source is given. Abstract(cid:173)
`ing with credit is permitted. For other copying of
`articles that carry a code at the bottom of the first
`page, copying is permitted provided that the per - copy
`fee indicated in the code is paid through the Copyright
`Clearance Center , 27 Congress Street , Salem, MA 01970.
`For p~.~ ssion to republish write
`to: Director of
`Pul:tli cations, Association for Computing Machinery. To
`c,o
`otherwise, or republish, requires a fee and/or
`.~,.,._.~;. .·.s 'P.!'l «?,ific permission .
`··~'tl
`'-r ·•
`b·:
`r
`
`.. ; ··. "
`. .
`
`'to\
`
`ISBN 0-89791-317-5
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 64145
`Baltimore, MD 21264
`
`Price:
`Members .... . .. .
`All others
`
`$25.00
`$33.00
`
`ACM Order Number:
`
`472890
`
`i i
`
`
`
`Extending a Relational Database with
`Deferred Referential Integrity Checking and Intelligent Joins1
`
`Stephanie Cammarata (steph@rand.org)
`Prasadram Ramachandra (ram@rand.org)
`Darrell Shane (shane@rand.org)
`
`The RAND Corporation
`1700 Main Street
`Santa Monica, CA 90406-2138
`(213) 393-0411
`
`ABSTRACT
`
`1. Introduction
`
`Interactive use of relational database management
`systems (DBMS) requires a user to be knowledgeable about the
`semantics of the application represented in the database. In
`many cases, however, users are not trained in the application
`field and are not DBMS experts. Two categories of
`functionality are problematic for such users: (1) updating a
`database without violating integrity constraints imposed by the
`domain and (2) using join operations to retrieve data from more
`than one relation. We have been conducting research to help an
`uninformed or casual user interact with a relational DBMS.
`
`This paper describes two capabilities to aid an
`interactive database user who
`is neither an application
`specialist nor a DBMS expert. We have developed deferred
`Referential Integrity Checking (RIC) and Intelligent Join (IJ)
`which extend the operations of a relational DBMS. These
`facilities are made possible by explicit representation of
`database semantics combined with a relational schema.
`Deferred RIC is a static validation procedure that checks
`uniqueness of tuples, non-null keys, uniqueness of keys, and
`inclusion dependencies. IJ allows a user to identify only the
`"target" data which is to be retrieved without the need to
`additionally specify "join clauses". In this paper we present the
`motivation for these facilities, describe the features of each, and
`present examples of their use.
`
`' This :macarch was sponsored by the Defense Advanced Research Projects
`
`Agency under the auspices of RAND's National Defense Research Instiwte, a
`
`Federally Funded Research and Development Center sponsored by the Office of
`
`the Secretary of Defense. Views and concluSions contained in this document are
`those of the authors and should not be inte2preted as rcpm<enting the official
`opinion of DARPA, the U.S. Government, or any person or agency connected
`with them.
`
`Permission to copy without fee all or part of this material is granted provided that
`the copies are not made or distributed for direct commercial advantage, the ACM
`copyright notice and the title of the publication and its date appear, and notice is
`given that copying is by permission of the Association for Computing Machinery.
`To copy otherwise, or to republish, requires a fee and/ or specific permission.
`© 1989 ACM 0-89791-317-5/ 89/0005/0088$1.50
`
`88
`
`the advent of workstation environments,
`With
`interactive software, and public domain databases, the use of
`DBMS is no longer limited to database administrators (DBAs),
`operations managers, and application programmers. Personnel
`in many different facets of a workplace are experimenting with
`DBMS for organizing, maintaining, and sharing information
`[McCa82). In many cases, little or no database design is
`undertaken before a database is generated. Concerns for update
`anomalies and consistency maintenance, studied in theoretical
`discussions of relational database management, are rarely
`addressed in the practical data management activities of many
`organizations. Often, a novice user · simply "relationalizes" a
`flat file into an intuitive set of tables. The resulting first normal
`form database
`implicitly relates
`tables
`through common
`attributes among the relations.
`
`A complete representation of a user's application
`database should attempt to encode 1) the schema for every
`relation, 2) the data stored in the relations, and 3) the semantic
`relationships among relations. The first two categories are
`captured in every relational system. However, the semantics of
`the application is seldom expressed explicitly and is usually left
`to an individual user to interpret. Unfortunately, few DBMS
`tools and languages have facilities to store semantics and aid
`users
`in
`interpreting
`these
`semantics
`[Tou82,Blum87,Neuh88, Jone87].
`Although
`common
`attributes between tables are based on underlying semantic
`relationships between relations, the relational model and its
`various implementations place no restrictions on the naming of
`attributes. Experienced users may establish
`their own
`conventions for relation and attribute names but no relational
`DBMS represents or enforces such conventions. Therefore,
`without an explicit conceptual model, it is difficult for users to
`access and validate the information they need [Curt81).
`
`Developing an information model during database
`design is one means of expressing this information [Nava86).
`However, commitment to such a formal effort is infrequent and
`the resulting model is usually a paper documentation aid,
`unavailable to interactive users. Another DBMS support tool,
`the data dictionary,
`interfaces a database
`to external
`applications by defining
`interface ent!I.Jes,
`application
`transactions, and generating reports, but is not suitable for a
`casual
`interactive user
`(Alle82, Dolk87). Our approach
`recommends generating a knowledge base or information
`dictionary to capture previously implicit semantics of an
`existing relational schema and database. Research efforts
`toward integrating DBMS with expert systems have also
`
`
`
`adopted similar techniques [Reha85, Al-Z87, Schu88]. The
`universal
`relation model
`indirectly
`represents
`relational
`to achieve complete access path
`metadata by aiming
`independence [Maie82].
`
`two capabilities, deferred
`We have developed
`Referential Integrity Checking (RIC) and Intelligent Join (IJ),
`which extend
`the operations of a relational database
`management system by utilizing explicit semantics and
`supplemental metadata combined with a relational schema.
`Deferred RIC is a database validation process that checks
`uniqueness of tuples, non-null keys, uniqueness of keys, and
`inclusion dependencies. This procedure can be invoked by the
`user or the system at specified intervals. U allows a user to
`identify only the "target" data which is to be retrieved without
`the need to additionally specify "join clauses". IJ subsequently
`navigates through the relations to generate the necessary join
`operations. These capabilities are supported by a metadata
`network constructed from both the relational schema and
`extended mctadata stored in an information dictionary.
`
`In the next section we present our previous work
`focussing on relational metadata and introduce an example
`database that we refer to throughout the remainder of the paper.
`Section 3 discusses a network representation for relational
`metadata which facilitates both Referential Integrity Checking
`and Intelligent Join algorithms. Discussion and examples of
`RIC are presented in section 4, and section 5 details IJ. We
`conclude with a discussion of the limitations, and suggestions
`for future work.
`
`1. liD: An extended lnformatlob dictionary
`
`In previous work we have developed an Intelligent
`Information Dictionary (liD) to address the issues described
`above. no serves as an interface between an interactive user
`and the query language of a relational DBMS [Camm88]. no
`is
`implemented in Franzlisp Flavors running on a Sun
`Microsystems workstation. The dictionary communicates
`directly with the Ingres relational DBMS (also resident on a
`Sun machine) through the Lingres system, a Lisp to lngres
`interface we previously implemented. Lingres provides the full
`functionality of the Quel query language, accessed from Lisp
`and Flavors.
`
`1.1IID functionality
`no aids a user in understanding the organization of a
`relational database by representing both the constructs of a
`relational database, and domain specific knowledge acquired
`to augment a
`from an application specialist. liD serves
`relational schema with supplemental metadata. By combining
`domain knowledge with knowledge of relational database
`concepts, no supports
`interactive
`tools
`for browsing,
`customized data manipulation, and interactive value checking.
`Figure 1 shows the schema and extensional data of our test
`database, atlas. This geography database consists of seven
`relations, each with a primary key (which is underlined,
`"===", in figure 1). In many relational database applications,
`users are supplied with only the information shown in figure 1.
`The atlas database includes many of the typical anomalies
`found in first normal form databases.
`
`In figure 2 we present portions of liD metadata,
`supplied by a domain expert, for the relations animal and
`country including a metadata description for the atlas
`database. In addition to metadata entries for relations, liD also
`represents column metadata such as value constraints, units
`
`information and conversion, and default values. no is intended
`to serve the users' need for extended database capabilities, and
`not as a data model. However, because liD represents and uses
`the semantics of the database, considerable overlap exists
`between the capabilities of liD and ER (Entity-Relationship) or
`semantic modeling [Hull87]. In the future, we plan to develop
`a more complete modeling environment based on the existing
`no framework.
`
`One important component of an no knowledge base
`information about "interlinks". Interlinks represent the
`is
`semantic information prescribing exactly how two or more
`relations are implicitly related. This information is generally
`referred
`to as
`integrity constraints, expressing structural
`conditions of a relational database. Knowledge of semantic
`is essential for
`interlink
`information and key attributes
`interactive users. However, without a facility like liD, there is
`no repository for this information. In figure 3, we show
`interlink metadata describing the relationships between the
`relations weather,
`and
`country,
`vegetation.
`Interlink identifiers correspond to the attribute "interlink-list" in
`figure 2. Specification of common columns or "join fields" are
`indicated
`in
`the "from-column-list" and "to-column-list".
`Information contained in interlinks combined with the attribute
`"key-list" found in the relation metadata make explicit the
`information needed by users for checking the referential
`consistency of a relational database and for manipulating the
`underlying data.
`
`Until now, interlink information in liD was strictly
`passive. Facilities were available for a user to browse through
`an TID knowledge base to learn about the database. However,
`information about
`interlinks and keys did not actively
`of
`the DBMS
`contribute
`during
`the manipulation
`[Koss87, Gray88]. Our objective for the work described in this
`paper, was to build an operational extension to the capabilities
`of a relational DBMS which would automatically use this
`information to aid the user in retrieving data and checking for
`structural integrity.
`
`2.2. Extended capabilities supported by liD
`
`One important reason for representing relational
`metadata is to enable referential integrity checking (RIC).
`Validating referential integrity involves two categories of
`constraints: key constraints and referential constraints. A key
`constraint is implied by the existence of candidate keys and
`requires unique and non-null key values. Referential
`constraints are entailed by the relationship between a key in
`one relation and a foreign key in another. At any given time,
`the value of a foreign key in the first relation must be either
`null, or must be a key value in some tuple of the other relation.
`
`Much research has addressed referential integrity,
`however, with the exception of Sybase, we know of no other
`(excluding PC-based DBMS) which
`commercial DBMS
`enforces referential integrity in real time during database
`manipulation [Casa88]. Our philosophy, however, is not to
`provide "immediate" referential checking. Instead, we are
`promoting "deferred" referential checking. In many cases, the
`benefits of immediate checking do not justify the excessive
`overhead [Lafu82, Hato88]. Deferred RIC can be initiated by
`the user or by the system at specified points in time. For
`example, many application databases arc not designed using
`theoretical principles of relational normalization. Therefore, the
`consistency of such databses is questionable. Furthermore,
`users are not prevented from changing the value of a key
`attribute or violating inclusion dependencies. Deferred RIC
`
`89
`
`
`
`•,
`
`fauna table
`
`1
`I distfeature
`Jlono
`tlat
`I animal
`I country
`1---------- ----- - - ------ - -J ••••••••.......••••••••••••.. • .••• /
`I india
`1 tioer
`1
`601
`751 burnino eyes
`1
`I
`I
`tussr
`Jpenouin
`701
`100Jmajestic
`I usa
`I doo
`I
`40 I
`100 I up-to-tricks
`I
`I canada
`I penouin
`I
`60 I
`100 I sloppy
`I
`I china
`I doo
`I
`40 I
`125/listless
`1
`I australia
`130 I jumps
`1 kanoaroo
`1
`301
`1
`!elephant
`I
`601
`75/trunked
`I
`I
`!india
`I
`I
`601
`751
`I
`I
`I
`I
`60 I
`751
`I
`I india
`I doo
`I
`60 I
`751 hooded
`I
`I japan
`I tioer
`I
`7 51
`10 I burnino-eyes
`I
`I
`Jindia
`Jtioer
`601
`75Jbody-stripes
`1
`1--- ---------------------------------------------------------J
`
`country table
`
`I country
`
`Jlatnorth
`
`tlatsouth
`
`Jlonoeast
`
`Jlonowest
`
`1
`
`I ------------1 ... .................................................... I
`
`90 I
`60 I
`8 I
`78 I
`I
`!india
`I
`1901
`201
`401
`751
`tussr
`65/
`125/
`30/
`50/
`1
`tusa
`I
`601
`125/
`501
`751
`!canada
`I
`1351
`75/
`251
`55/
`!china
`I
`1551
`115/
`401
`101
`!australia
`I
`I
`150/
`1001
`30/
`50/
`0 I
`0 I
`0 I
`0 I
`I
`/india
`J--------------------------------------------------------------------1
`
`weather table
`
`Jlatnorth
`
`tlatsouth
`
`tlonoeast
`
`Jlonowest
`
`1 zone
`
`1 avorain
`
`1
`
`I -------------------------------------------------------1 ............................. I
`
`I
`I
`781
`81
`60 I
`90 I tropic
`60 I
`I
`I
`751
`401
`20\
`190Jtundra
`201
`I
`I
`50 I
`30 I
`50 I
`1251
`651 temperate
`I
`1
`751
`501
`1251
`60Jtundra
`401
`I
`1
`551
`251
`751
`135Jtropic
`601
`I
`I
`101
`401
`1151
`155jtemperate
`201
`I
`I
`40 I
`451
`351
`201
`0 !mediterranean
`I
`I
`751
`OJ
`501
`301
`501
`1-------------------------------------------------------------------------------------l
`natwildlife table
`economy table
`
`jcountry
`
`Jnatanimal
`
`jnatbird
`
`I
`
`I ------------1 ......................... I
`
`I
`I india
`1 peacock
`1 tioer
`I ussr
`I
`I
`I penouin
`I
`I eaole
`I
`I usa
`I
`Jkiwi
`Jkanoaroo
`!australia
`I
`I
`I
`I china
`I
`I
`I
`I canada
`J--------------------------------------1
`
`animal table
`
`I
`1 percapita
`Jonp
`1 country
`I ==--=------- 1 •...••...•...••.••...•..••. I
`210 I
`300 I
`1 india
`1
`I
`tussr
`24001
`83701
`jusa
`1
`42001
`172201
`!canada
`1
`4501
`174301
`260 I
`1 china
`1
`2751
`12000 I
`1 australia
`1
`2311
`I
`jusa
`42001
`172201
`1---------------- ----------------------- -J
`
`1 animal
`
`J distfeat ure
`
`J country found
`
`Jmaxspeed
`
`1 anmtype
`
`I avoheioht
`
`I avoweioht
`
`I avo life
`
`I
`
`I ------------1 ...................................................... ............................................. · · · · I
`
`I
`/tiger
`Jburnino-eyes
`!india
`1
`40/carnivorous
`361
`1501
`201
`I
`!elephant
`Jtrunk
`tchina
`1
`10Jherbivorous
`1001
`5001
`301
`/kangaroo
`Jjumps
`/australia
`1
`25/pouched
`841
`J
`1801
`151
`I
`100 I
`30 I
`1 pe nouin
`1 sloppy
`1 ussr
`1
`21 polar
`361
`I doo
`1 listless
`I usa
`I
`20 I omnivorous
`361
`I
`80 I
`121
`1--------------------------------------------------------------------------------------------------------------------l
`vegetation table
`
`I zone
`
`I avorain
`
`I maxtemp
`
`I mintemp
`
`I tree type
`
`I main crop
`
`I
`
`I ---------------1 ... ................................................................ · · · · .. I
`
`I
`I corn
`-30 I deciduous
`1001
`30 I
`I
`I temperate
`I tropic
`I
`I rice
`0 I everoreen
`llO I
`60 I
`I
`I
`I none
`I tundra
`-601 coniferous
`801
`201
`1
`I
`I bamboo
`40 I evergreen
`110 1
`751
`I
`I equatorial
`J---------- ----------------- ------------------------------ ---------- - ----- ----------------1
`Figure 1: Atlas database with anomalies
`
`90
`
`
`
`relation-list:
`interlink-list:
`
`Database-supplement: "ATLAS"
`name:
`11 ATLAS"
`11 atlas database contains an example of information about
`description:
`different countries. This is used to illustrate the procedures
`developed for Referential Inteority checkino and Intellioent
`Join operation."
`(WEATHER COUNTRY ECONOMY NATWILDLIFE AN I MAL VEGETATION FAUNA)
`(exhibits/is-exhibited-by is-inhabited-by/lives-in
`has/is-associated-with has-inhabitants/lives-at
`supports/is-supported-by is-conducive-to/orows-in
`is-represented-by/represents
`has-characteristics/is-distinou ished-by
`has-same-distinouishino-features-as/
`has-same-distinouishino-features-as)
`(POSITION-VECTOR COUNTRY-NAME)
`ATLAS
`
`oroup-li st:
`ldatabase:
`
`description:
`
`Relation-supplement: "COUNTRY"
`name:
`11 COUNTRY"
`name-explanation: "This relation 'country• contains the position variables
`for every country • "
`"Each record describes a country's approx. boundaries on
`the four sides."
`(COUNTRY LONGWE.ST LONGE.AST LATSOUTH LATNORTH)
`( (COUNTRY) )
`(POSITION-VECTOR COUNTRY-NAME)
`COUNTRY
`ATLAS
`(supports/is-supported-by has-inhabitants/lives-at
`has/is-associated-with is-inhabited-by/lives-in
`exhibits/is-exhibited-by)
`
`column-list:
`key-list:
`oroup-1 i st :
`lrelation:
`mydatabase:
`interlink-list:
`
`Relation-supplement "ANIMAL"
`name:
`name-explanation:
`description:
`column-list:
`
`key-list:
`oroup-1 i st :
`lrelation:
`mydatabase:
`interlink-list:
`
`11 ANIMAL 11
`•animal is a relation containinQ all animals"
`"Each record details an animal's characteristics"
`(AVGLIFE AVGWEIGHT AVGHEIGHT ANMTYPE MAXSPEED COUNTRYFOUND
`DISTFEATURE ANIMAL)
`((ANIMAL))
`nil
`ANIMAL
`ATLAS
`(has-same-distinouishino-features-as/has-same-distinouishinQ
`-features-as has-characteristics/is-distinouished-by
`is-represented-by/represents is-inhabited-by/lives-in)
`
`Figure 2: liD metadata
`
`does not prevent these anomalies, however, it will subsequently
`detect the errors and notify the user of inconsistencies.
`Detailed discussion of RIC procedures and examples applying
`RIC to the atlas database are found in section 4.
`
`Another compelling reason for needing to know
`interlink information focuses on the "select" or "retrieve"
`command. In both SQL and Que!, the user composes the
`desired selection using a qualification clause to effect an
`equijoin operation. In many cases, more than one equijoin over
`the relations is necessary to retrieve the desired information.
`For example, in the atlas database, suppose a user wishes to
`retrieve the treetype found in India. Ideally, the user would
`like to submit a query such as:
`
`retrieve (country.country, vegetation.treetype)
`where country.country = "india"
`
`However, this request requires a join across an intermediate
`relation, namely, weather. Therefore, to retrieve the desired
`data, the following query is necessary:
`
`retrieve (country.country, vegetation.treetype)
`where country.country = "india" and
`country.latnorth weather.latnorth and
`country.latsouth weather.latsouth and
`country.longeast weather.longeast and
`country.longwest weather.longwes t and
`weather.zone = vegetation.zone
`
`In this example, the user must know implicit "interlink"
`and
`information
`relating
`the
`relations
`country
`vegetation and the user additionally needs to compose the
`query to join the relations country, weather, and
`vegetation. (In the remainder of this paper, use of the term
`"join" implies "equijoin".)
`
`In effect, the user must navigate through the relations
`to establish the desired correspondences. It is interesting to note
`that when the relational model was introduced, declarative
`query languages were cited as an important benefit. Indeed, we
`have come a long way from tracing record pointers and
`maintaining currency indicators. Nevertheless, a notion of
`"navigation" still remains. Navigation in a relational context
`implies navigating through a "conceptual" model rather than a
`"physical" model. The Intelligent Join capability we have
`developed uses explicit metadata in liD to perform the
`
`91
`
`
`
`Interlink: •exhibits/is-exhibited-by"
`from-relation:
`COUNTRY
`from-column-list:
`(LATNORTH LATSOUTH
`LONGEAST LONGWEST )
`WEATHER
`(LATNORTH LATSOUTH
`LONGEAST LONGWEST)
`cardinality: :ONE to 1.
`
`to-relation:
`to- column-list:
`
`from-relation:
`from-column-list:
`
`WEATHER
`(LATNORTH LATSOUTH
`LONGEAST LONGWEST)
`COUNTRY
`(LATNORTH LATSOUTH
`LONGEAST LONGWEST)
`cardinality: :ZERO-MANY to 1.
`
`to-relation:
`to-column-list:
`
`Interlink: "is-conduclve-to/orows-in"
`from-relation:
`WEATHER
`from-column-list:
`(ZONE)
`to-relation:
`VEGETATION
`to-column-list:
`(ZONE)
`cardinality: :ONE to 1.
`
`VEGETATION
`from-relation:
`(ZONE)
`from-column-list:
`to-relation:
`WEATHER
`to-column-1 ist: (ZONE)
`cardinality: :ZERO-MANY to 1.
`
`Figure 3: llD interlinks
`
`navigation and compose join clauses automatically for the user.
`In section 5 we provide a more detailed explanation and
`examples of IJ queries.
`
`3. Metadata as a network structure
`
`RIC and IJ capabilities require explicit representation
`?f s~~antic metadata, which is traditionally recognized as
`1mphc1t. user knowledge. Metadata in llD, such as figures 2
`and 3, 1S represented as object-oriented instances of classes
`such as "Relation-supplement", "Column-supplement", and
`''lnt~rlink". However, to better facilitate navigation through a
`relat10nal database, we augment our object-oriented llD
`representation with data structures corresponding
`to
`the
`network-like organization of a relational database. A network
`llD
`representation
`is generated directly
`from selective
`metadata. The network is constant throughout the life of the
`database (unless schema modifications are allowed); therefore,
`the overhead incurred by building the network is a one-time
`cost.
`
`Given a relational schema, exactly one network can
`be produced which represents the database schema. The
`the network helps contribute
`to
`the
`uniqueness of
`computational efficiency when traversing and processing the
`network by eliminating the problem of recognizing and
`manipulating isomorphic networks. Although each relational
`schema maps to a single network, it is possible for more than
`one schema to map to a common network. This situation occurs
`when the databases have the same "meta-schema", that is, the
`same number of interlinks and the same kinds of constraints
`over those interlinks. In the remainder of this paper, references
`to a "relational schema" include supplemental metadata, such
`as keys and interlinks, and references to a "network" denote an
`llD metadata network.
`
`3.1. Building the metadata network
`
`Nodes in the network represent relations, and links
`capture interlink information between the nodes. If n interlinks
`are defined between two relations, then the corresponding
`nodes are connected by n links. For example, in figure 2 there
`are nine interlinks declared in the interlink-list of the
`atlas database-supplement; therefore, nine links are
`represented in the corresponding network. Links not only
`represent an abstract relationship but also encode attributes
`which the two relations have in common. The network supports
`three different kinds of links depending on the characteristics of
`the common attributes in the relations denoted by the connected
`nodes. These characteristics distinguish between common
`attributes which are a key in one, both, or neither of the
`relations. We identify the corresponding link types as singly
`directed, doubly directed, or undirected links. Below we detail
`the specifications of each link type.
`
`In a database with k relations, R l' R 2, ... , Rk, the
`corresponding metadata network has nodes n 1, n 2, ... , nk where
`the ith relation, R;o is represented by n1• An undirected link
`between two nodes, n1 and n i' indicates that there exists a set of
`attributes common to both relations, R1 and R ., which is a key
`in neither relation. A doubly directed link ~tween n. and n .
`signifies that there exists a common set of attributes whlch is ~
`key in both R1 and Rr A singly directed link emanating from n1
`indicates that there exists a set of common attributes between
`R1 and Ri that is a key in Ri. A singly directed link also
`implies that the common attributes are a foreign key in R ..
`While there can be at most one undirected link between ty)o
`nodes, there can exist zero or more singly and doubly directed
`links. Multiple singly directed links denote multiple foreign
`key/key relationships between two relations. More than one
`doubly directed link between two relations occurs only when
`there are multiple candidate keys in both relations.
`
`The network corresponding to the atlas database
`is shown in figure 4, indicating nodes, links, and common
`attribute sets prescribed by the above rules. We have also
`underlined key attributes participating in each link. In this
`example, only one doubly directed link is found between any
`pair of nodes because we have defined only a single primary
`key for each relation in the atlas database. For some links
`the common attributes have the same name, for instance,
`country identifies
`the common attribute
`in both
`the
`country and natwildlife relations. However, attribute
`names are assigned for convenience and there is no restriction
`on naming conventions. For example, the common attributes
`between the relations animal and natwildlife are
`named animal and natanimal respectively. Therefore,
`the network must explicitly name the common attributes for
`both relations, and if there is more than one link between two
`nodes, proper correspondences must be maintained. It is also
`important to note that the contents of the database are not
`reflected in the network. Instead, the network captures the
`structure of the database among abstract entities such as
`relations and attributes. When it is necessary to refer to the
`contents of the database, for example, during RIC checking, we
`access the relations directly through Ingres.
`
`So far, we have discussed only the foundations for
`the RIC and IJ capabilities. The network representation and
`encapsulation of a relational schema has proven to be an
`expressive and powerful formalism for explicit representation
`of relational metadata. In the subsequent sections, we present
`our methodology for irttegrity checking and intelligent joins
`within the framework of a metadata network.
`
`92
`
`
`
`Figure 4: Metadata network with common attribute sets
`
`4. Deferred referential Integrity checking
`
`This section describes work addressing one category
`of integrity constraints, namely, referential integrity. These
`constraints help describe the structural integrity of a relational
`schema and database. Unlike "value" constraints which encode
`information about allowable values and are enforced with
`predicates, referential integrity constraints focus on allowable
`mappings between relations. Although the particular values of
`an attribute arc irrelevant, the use of those values for mapping
`across relations reflects the structural requirements of the
`database schema.
`
`Two different approaches can be applied to enforce
`referential integrity. One method prevents integrity violations
`through the use of triggers and demons, and is intimately
`integrated into the data manipulation routines of the database
`[Ston85]. While this ensures a consistent database at all times,
`the associated overhead is usually high. The second option,
`which we have adopted, supports "deferred" referential
`integrity checking (RIC). In this approach, update anomalies
`are not blocked or prevented. Rather, deferred RIC performs a
`complete "sweep" of the database at regular intervals notifying
`the user of inconsistencies.
`
`4.1. Background
`
`We have observed the use of first normal form
`databases for simulation and modeling applications. In these
`domains, global databases are acquired as flat files from outside
`agencies. When corresponding lngres databases are generated,
`
`they are already inconsistent and erroneous. Furthermore, users
`select and combine subsets of the global data to create their
`own local databases for their application models. Once their
`local databases are satisfactorily derived, few modifications are
`made to the data. In these simulation laboratories, deferred
`RIC is being used to validate entire global databases and also
`local derived databases. In addition, users can modify the
`integrity constraints of their local data by changing liD
`metadata. In this context, deferred RIC is an ideal solution for
`"scrubbing"
`inconsistent global databases and enforcing
`referential integrity within local versions.
`
`Opponents of deferred validation argue
`that
`processing an entire database is expensive. Indeed, deferred
`checking is a dedicated, time-intensive process. However, in
`the scenario described above, validating the entire database is a
`task performed overnight when the database is idle. Local
`databases are usually small enough that they can be validated
`during a lunch hour! Although we have not yet performed any
`systematic studies or timings, our intuition suggests that many
`interactive database applications resemble the situations we
`have observed.
`
`4.2. RIC algorithms
`
`The current implementation to validate referential
`integrity entails three rules: The first rule requires non-null key
`values. The second restriction, also regarding key attributes,
`insists that key values be unique. The third rule addresses
`inclusion dependencies between foreign key and key attributes.
`However, before applying these rules to a database, deferred
`RIC first checks for duplicate tuples in each relation. Although
`
`93
`
`
`
`relational databases
`theory underlying
`set-oriented
`the
`precludes duplicate "tuples", relational DBMS implementations
`do not enforce their uniqueness. The need for this type of
`validation check was motivated directly by the simulation and
`modeling applications discussed above, where duplicate tuples
`were responsible for extreme errors during selections which
`involved aggregation functions, such as, count and sum.
`
`We have implemented RIC algorithms corresponding
`the above
`four criteria. These procedures utilize
`to
`information, such as keys and interlinks, encoded in the
`metadata network described in section 3. Because interlink
`mappings are identified by syntactic pattern matching across
`common attributes of relations, the RIC facility automatically
`to validate referential
`generates the procedures necessary
`integrity for any given relational schema.
`
`In the following four subsections, we explain the
`method we have adopted for implementing each RIC rule. We
`also provide examples in each subsection demonstrating an
`application of the rule's procedure to the atlas database.
`The entire RIC process produces four error flles corresponding
`to each RIC rule. The examples shown below are extracted
`from these output error files.
`
`4.2..1. Unique tuples
`
`The algorithm for validating uniqueness of tuples
`uses the Lingres (and corresponding Ingres) functions "count"
`and "countu[nique]". Both coun