`Q
`
`A v i S i l b e r s c h a t z
`M i c h a e l S t o n e b r a k e r
`J e f f U l l m a n
`E d i t o r s
`
`0
`
`OF THE AOM
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2208
`Page 1 of 11
`
`I~
`October 1991/Vol.34, No.10/GOMMUNIGATION$
`
`
`he history of database system research is one of exceptional productivity and startling economic impact. Barely 20 years old as a basic science research field, database research has fueled an information services industry estimated at $10 billion per year in the U.S. alone. Achievements in database research underpin funda- mental advances in communications systems, transportation and logistics, financial management, knowledge-based systems, accessibility to scientific literature, and a host of other civilian and defense applica- tions. They also serve as the foundation for considerable progress in basic science in various fields ranging from computing to biology.-...--.- As impressive as the accomplish- ments of basic database research have been, there is a growing awareness and concern that only the surface has been scratched in developing an understanding of the database principles and tech- niques required to support the ad- vanced information management applications that are expected to revolutionize industrialized econo- mies early in the next century. Rapid advances in areas such as manufacturing science, scientific visualization, robotics, optical stor- age, and high-speed communica- tions already threaten to over- whelm the existing substrate of database theory and practice. In February 1990, the National Science Foundation convened a workshop in Palo Alto, California for the purpose of identifying the technology pull factors that will serve as forcing functions for advanced database technology and the corre- sponding basic research needed to enable that technology. The partici- pants included representatives from the academic and industrial sides of the database research com- munity. 1 The primary conclusions of the workshop participants can be IThe workshop was attended by Michael Bro- die, Peter Buneman, Mike Carey, Ashok Chandra, Hector Garcia-Molina, Jim Gray, Ron Fagin, Dave Lomet, Dave Maier, Marie Ann Niemat, Avi Silberschatz, Michael Stonebraker, Irv Traiger, Jeff Ullman, Gio Wiederhold, Carlo Zaniolo, and Maria Zemankova. summarized as follows: 1. A substantial number of the ad- vanced technologies that will underpin industrialized econo- mies in the early 21st century depend on radically new data- base technologies that are cur- rently not well understood, and that require intensive and sus- tained basic research. 2. Next-generation database appli- cations will have little in com- mon with today's business data processing databases. They will involve much more data, require new capabilities including type extensions, multimedia support, complex objects, rule process- ing, and archival storage, and will necessitate rethinking the algorithms for almost all DBMS operations. 3. The cooperation among differ- ent organizations on common scientific, engineering, and com- mercial problems will require large-scale, heterogeneous, dis- tributed databases. Very difficult problems lie ahead in the areas of inconsistent databases, secu- rity, and massive scale-up of dis- tributed DBMS technology. This article provides further in- formation about these topics, as well as a brief description of some of the important achievements of the database research community. Background and Scope The database research community has been in existence since the late 1960s. Starting with modest repre- sentation, mostly in industrial re- search laboratories, it has expanded dramatically over the last two dec- ades to include substantial efforts at major universities, government lab- oratories and research consortia. Initially, database research cen- tered on the management of data in business applications such as auto- mated banking, record keeping, and reservation systems. These applications have four require- ments that characterize database systems: • Efficiency in the access to and modification of very large amounts of data; • Resilience, or the ability of the data to survive hardware crashes and software errors, without sus- taining loss or becoming incon- sistent; • Access control, including simulta- neous access of data by multiple users in a consistent manner and assuring only authorized access to information; and • Persistence, the maintenance of data over long periods of time, independent of any programs that access the data. Database systems research has cen- tered around methods for design- ing systems with these characteris- tics, and around the languages and conceptual tools that help users to access, manipulate, and design databases.
`
`COMMUNICATIONS OF THE ACM/October 1991/Vo1.34, No.10
`
`111
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2208
`Page 2 of 11
`
`
`
`that the database research commu- nity has explored, the following three have perhaps had the most impact: relational database systems, transaction management, and dis- tributed database systems. Each has fundamentally affected users of database systems, offering either radical simplifications in dealing with data, or great en- hancement of their capability to manage information. Relational Databases In 1970 there were two popular approaches used to construct data- base management systems. The first approach, exemplified by IBM's Information Management System (IMS), has a data model (mathematical abstraction of data and operations on data) that re- quires all data records to be assem- bled into a collection of trees. Con- sequently, some records are root records and all others have unique parent records. The query lan- guage permitted an application programmer to navigate from root records to the records of interest, accessing one record at a time. The second approach was typi- fied by the standards of the Confer- ence on Data Systems Languages (CODASYL). They suggested that the collection of DBMS records be arranged into a directed graph. Again, a navigational query lan- guage was defined, by which an application program could move from a specific entry point record to desired information. Both the tree-based (called hier- archical) and graph-based (net- work) approaches to data manage- ment have several fundamental disadvantages. Consider the follow- ing examples: 1. To answer a specific database request, an application pro- grammer, skilled in performing disk-oriented optimization, must write a complex program to nav- igate through the database. For example, the company president cannot, at short notice, receive a response to the query "How many employees in the Widget department will retire in the next three years?" unless a pro- gram exists to count departmen- tal retirees. 2. When the structure of the data- base changes, as it will whenever new kinds of information are added, application programs usually need to be rewritten. As a result, the database systems of 1970 were costly to use because of the low-level interface between the application program and the DBMS, and because the dynamic nature of user data mandates con- tinued program maintenance. The relational data model, pio- neered by E. F. Codd in a series of papers in 1970-72, offered a fun- damentally different approach to data storage. Codd suggested that conceptually all data be repre- sented by simple tabular data struc- tures (relations), and that users ac- cess data through a high-level, nonprocedural (or declarative) query language. Instead of writing an algorithm to obtain desired rec- ords one at a time, the application programmer is only required to specify a predicate that identifies the desired records or combination of records. A query optimizer in the DBMS translates the predicate specification into an algorithm to perform database access to solve the query. These nonprocedural languages are dramatically easier to use than the navigation languages of IMS and CODASYL; they lead to higher programmer productivity and facilitate direct database access by end users. During the 1970s the database research community extensively investigated the relational DBMS concept. They: • Invented high-level relational query languages to ease the use of the DBMS by both end users and application programmers. The theory of higher-level query languages has been developed to provide a firm basis for under- standing and evaluating the ex- pressive power of database lan- guage constructs. • Developed the theory and algo- rithms necessary to optimize que- ries-that is, to translate queries in the high-level relational query languages into plans that are as efficient as what a skilled pro- grammer would have written using one of the earlier DBMSs for accessing the data. This tech- nology probably represents the most successful experiment in optimization of very high-level languages among all varieties of computer systems. • Formulated a theory of normaliza- tion to help with database design by eliminating redundancy and certain logical anomalies from the data.
`Database management systems (DBMSs) are now used in almost every computing environment to organize,, create and maintain im- portant collections of information. The technology that makes these systems possible is the direct result of a successful program of database research. This article will highlight some important achievements of the database research community over the past two decades, includ- ing the scope and significance of the technological transfer of data- base research results to industry. We focus on the major accomplish- ments of relational databases, transaction management, and dis- tributed databases. Today, we stand at the threshold of applying database technology in a variety of new and important di- rections, including scientific data- bases, design databases, and uni- versal access to information. Later in this article we pinpoint two key areas in which research will make a significant impact in the next few years: next-generation database applications and heterogeneous, distributed databases. Accomplllshments Of the Last TWO Decades
`
`From among the various directions
`
`112
`
`NO.10/COMMUNIGATIONSOFTHE ACM
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2208
`Page 3 of 11
`
`October 1991/Vo1.34,
`
`
`P H I H H ) t ) Y ) I t III • ,i • Constructed algorithms to allo- cate tuples of relations to pages (blocks of records) in files on sec- ondary storage, to minimize the average cost of accessing those tuples. • Constructed buffer management algorithms to exploit knowledge of access patterns for moving pages back and forth between disk and a main memory buffer pool. • Constructed indexing techniques to provide fast associative access to random single records and/or sets of records specified by values or value ranges for one or more attributes. • Implemented prototype rela- tional DBMSs that formed the nucleus for many of the present commercial relational DBMSs. As a result of this research in the 1970s, numerous commercial prod- ucts based on the relational concept appeared in the 1980s. Not only were the ideas identified by the re- search community picked up and used by the vendors, but also, sev- eral of the commercial develop- ments were led by implementors of the earlier research prototypes. Today, commercial relational data- base systems are available on virtu- ally any hardware platform from personal computer to mainframe, and are standard software on all new computers in the 1990s. There is a moral to be learned from the success of relational data- base systems. When the relational data model was first proposed, it was regarded as an elegant theoret- ical construct but implementable only as a toy. It was only with con- siderable research, much of it fo- cused on basic principles of rela- tional databases, that large-scale implementations were made possi- ble. The next generation of data- bases calls for continued research into the foundations of database systems, in the expectation that other such useful "toys" will emerge. Transaction Management During the last two decades, data- base researchers have also pio- neered the transaction concept. A transaction is a sequence of opera- tions that must appear "atomic" when executed. For example, when a bank customer moves $100 from account A to account B, the data- base system must ensure that either both of the operations--Debit A and Credit B--happen or that nei- ther happens (and the customer is informed). If only the first occurs, then the customer has lost $100, and an inconsistent database state results. To guarantee that a transaction transforms the database from one consistent state to another requires that: 1. The concurrent execution of transactions must be such that each transaction appears to exe- cute in isolation. Concurrency con- trol is the technique used to pro- vide this assurance. 2. System failures, either of hard- ware or software, must not result in inconsistent database states. A transaction must execute in its entirety or not at all. Recovery is the technique used to provide this assurance. Concurrent transactions in the system must be synchronized cor- rectly in order to guarantee that consistency is preserved. For in- stance, while we are moving $100 from A to B, a simultaneous move- ment of $300 from account B to account C should result in a net deduction of $200 from B. The view of correct synchronization of transactions is that they must be serializable; that is, the effect on the database of any number of transac- tions executing in parallel must be the same as if they were executed one after another, in some order. During the 1970s and early 1980s the DBMS research commu- nity worked extensively on the transaction model. First, the theory of serializability was worked out in detail, and precise definitions of the correctness of schedulers (algorithms for deciding when transactions could execute) were produced. Second, numerous con- currency control algorithms were invented that ensure serializability. These included algorithms based on • Locking data items to prohibit conflicting accesses. Especially important is a technique called two-phase locking, which guaran- tees serializability by requiring a transaction to obtain all the locks it will ever need before releasing any locks. • Timestamping accesses so the system could prevent violations of serializability. • Keeping multiple versions O f data objects available. The various algorithms were sub- jected to rigorous experimental studies and theoretical analysis to determine the conditions under which each was preferred. Recovery is the other essential component of transaction manage- ment. We must guarantee that all the effects of a transaction are in- stalled in the database, or that none of them are, and this guarantee must be kept even when a system crash loses the contents of main memory. During the late 1970s and early 1980s, two major approaches to this service were investigated, namely: Write-ahead logging. A summary of the effects of a transaction is stored in a sequential file, called a log, be- fore the changes are installed in the database itself. The log is on disk or tape where it can survive system crashes and power failures. When a transaction completes, the logged changes are then posted to the database. If a transaction fails to complete, the log is used to restore the prior database state. Shadowfile techniques. New copies of entire data items, usually disk pages, are created to reflect the ef- fects of a transaction and are writ-
`
`COMMUNICATIONS OF THE ACM/October 1991/Vol,34, No.10
`
`113
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2208
`Page 4 of 11
`
`
`
`ten to the disk in entirely new loca- tions. A single atomic actio n remaps the data pages, so as to substitute the new versions for the old when the transaction completes. If a transaction fails, the new versions are discarded. Recovery techniques have been ex- tended to cope with the failure of the stable medium as well. A backup copy of the data is stored on an entirely separate device. Then, with logging, the log can be used to roll forward the backup copy to the current state. Distributed Databases A third area in which the DBMS research community played a vital role is distributed databases. In the late 1970s there was a realization that organizations are fundamen- tally decentralized and require databases at multiple sites. For ex- ample, information about the Cali- fornia customers of a company might be stored on a machine in Los AngelLes, while data about the New England customers could exist on a machine in Boston. Such data distribution moves the data closer to the people who are responsible for it and reduces remote commu- nication costs. Furthermore, the decentralized system is more likely to be available when crashes occur. If a single, cen- tral site goes down, all data is un- available. However, if one of sev- eral regional sites goes down, only part of the total database is inacces- sible. Moreover, if the company chooses to pay the cost of multiple copies of important data, then a single site failure need not cause data inaccessibility. In a multidatabase environment we strive to provide location trans- parency. "]['hat is, all data should appear to the user as if they are lo- cated at hiis or her particular site. Moreover, the user should be able to execute normal transactions against such data. Providing loca- tion transparency required the DBMS research community to in- vestigate new algorithms for dis- tributed query optimization, con- currency control, crash recovery, and support of multiple copies of data objects for higher perfor- mance and availability. In the early 1980s the research community rose to this challenge. Distributed concurrency control algorithms were designed, imple- mented and tested. Again, simula- tion studies and analysis compared the candidates to see which algo- rithms were dominant. The funda- mental notion of a two-phase com- mit to ensure the possibility of crash recovery in a distributed database was discovered. Algorithms were designed to recover from processor and communication failures, and data patch schemes were put for- ward to rejoin distributed databases that had been forced to operate independently after a network fail- ure. Technology for optimizing dis- tributed queries was developed, along with new algorithms to per- form the basic operations on data in a distributed environment. Finally, various algorithms for the update of multiple copies of a data item were invented. These ensure that all copies of each item are consis- tent. All the major DBMS vendors are presently commercializing distrib- uted DBMS technology. Again we see the same pattern discussed ear- lier for relational databases and transactions, namely aggressive re- search support by government and industry, followed by rapid tech- nology transfer from research labs to commercial products. The Next Challenges Some might argue that database systems are a mature technology and it is therefore time to refocus research onto other topics. Cer- tainly relational DBMSs, both cen- tralized and distributed, are well studied, and commercialization is well along. Object management ideas, following the philosophy of object-oriented programming, have been extensively investigated over the last few years and should allow more general kinds of data ele- ments to be placed in databases than the numbers and character strings supported in traditional sys- tems. The relentless pace of ad- vances in hardware technology makes CPUs, memory and disks drastically cheaper each year. Cur- rent databases will therefore be- come progressively cheaper to de- ploy as the 1990s unfold. Perhaps the DBMS area should be declared solved, and energy and research efforts allocated elsewhere. We argue strongly here that such a turn of events would be a serious mistake. Rather, we claim that solu- tions to the important database problems of the year 2000 and be- yond are not known. Moreover, hardware advances of the next de- cade will not make brute force solu- tions economical, because the scale of the prospective applications is simply too great. In this section we highlight two key areas in which we feel impor- tant research contributions are re- quired in order to make future DBMS applications viable: Next- generation database applications and heterogeneous, distributed databases. In addition to being important intellectual challenges in their own right, their solutions offer products and technology of great social and economic importance, including improved delivery of medical care, advanced design and manufactur- ing systems, enhanced tools for sci- entists, greater per capita produc- tivity through increased personal access to information, and new mil- itary applications. The Research Agenda
`
`Next-Generation DBMS Applications To motivate the discussion of re- search problems that follows, in this section we present several examples of the kinds of database applica- tions that we expect will be built during the next decade. 1. For many years, NASA scientists
`
`for
`
`114
`
`]O/COMMUNIC, ATIONS OF THE ACM
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2208
`Page 5 of 11
`
`October 1991/Vo].34, No.
`
`
`) H I H H ) L ) Y ) I t III have been collecting vast amounts of information from space. They estimate that they require storage for 1016 bytes of data (about 10,000 optical disk jukeboxes) just to maintain a few years' worth of satellite image data they will collect in the 1990s. Moreover, they are very reluctant to throw anything away, lest it be exactly the data set needed by a future scientist to test some hypothesis. It is un- clear how this database can be stored and searched for relevant images using current or soon-to- be available technology. 2. Databases serve as the backbone of computer-aided design sys- tems. For example, civil engi- neers envision a facilities-engi- neering design system that manages all information about a project, such as a skyscraper. This database must maintain and integrate information about the project from the viewpoints of hundreds of subcontractors. For example, when an electri- cian puts a hole in a beam to let a wire through, the load-bearing soundness of the structure could be compromised. The design system should, ideally, recalcu- late the stresses, or at the least, warn the cognizant engineer that a problem may exist. 3. The National Institutes of Health (NIH) and the U.S. De- partment of Energy (DOE) have embarked on a joint national ini- tiative to construct the DNA se- quence corresponding to the human genome. The gene se- quence is several billion ele- ments long, and its many subse- quences define complex and variable objects. The matching of individuals' medical problems to differences in genetic makeup is a staggering problem and will require new technologies of data representation and search. 4. Several large department stores already record every product- code-scanning action of every cashier in every store in their 5. chain. Buyers run ad-hoc que- ries on this historical database in an attempt to discover buying patterns and make stocking de- cisions. This application taxes the capacity of available disk sys- tems. Moreover, as the cost of disk space declines, the retail chain will keep a larger and larger history to track buying habits more accurately. This process of "mining" data for hidden patterns is not limited to commercial applications. We foresee similar applications, often with even larger databases, in science, medicine, intelligence gathering, and many other areas. Most insurance firms have a sub- stantial on-line database that records the policy coverage of the firm's customers. These databases will soon be enhanced with multimedia data such as pho- tographs of property damaged, digitized images of handwritten claim forms, audio transcripts of appraisers' evaluations, images of specially insured objects, and so on. Since image data is ex- ceedingly large, such databases will become enormous. More- over, future systems may well store video walk-throughs of houses in conjunction with a homeowners policy, further en- larging the size of this class of databases. Again, applications of this type are not limited to com- mercial enterprises. These applications not only in- troduce problems of size, they also introduce problems with respect to all conventional aspects of DBMS technology (e.g., they pose funda- mentally new requirements for access patterns, transactions, con- currency control, and data repre- sentation). These applications have in common the property that they will push the limits of available technology for the foreseeable fu- ture. As computing resources be- come cheaper, these problems are all likely to expand at the same or at a faster rate. Hence, they cannot be overcome simply by waiting for the technology to bring computing costs down to an acceptable level. We now turn to the research problems that must be solved to make such next-generation applica- tions work. Next-generation appli- cations require new services in sev- eral different areas in order to succeed. New Kinds of Data Many next-generation applications entail storing large and internally complex objects. The insurance example, (5) requires storage of images. Scientific and design data- bases often deal with very large ar- rays or sequences of data elements. A database for software engineer- ing might store program state- ments, and a chemical database might store protein structures. We need solutions to two classes of problems: data access and data type management. Current databases are optimized for delivering small records to an application program. When fields in a record become very large, this paradigm breaks down. The DBMS should read a large object only once and place it directly at its final desti- nation. Protocols must be designed to chunk large objects into manage- able size pieces for the application to process. A new generation of query languages will be required to support querying of array and se- quence data as will mechanisms for easily manipulating disk and ar- chive representations of such ob- jects. In addition, extended storage structures and indexing techniques will be needed to support efficient processing of such data. A second class of problems con- cerns type management. There must be a way for the programmer to construct the types appropriate for his or her application. The need for more flexible type systems has been one of the major forces in the development of object-oriented databases. One of the drawbacks of the systems developed so far is that COMMUNICATIONS OF THE ACM/October 1991/Vol.34, No.10
`
`! ~
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2208
`Page 6 of 11
`
`
`
`Many of the new applications will involve primitive concepts not found in most current applications. type-checking is largely dynamic, which lay:s open the possibility that programming errors tend to show up at run time, not during compila- tion. In order to provide the data- base application designer with the same safety nets that are provided by modern high-level program- ming languages, we must deter- mine how we can combine static type disciplines with persistent data and evolution of the database struc- ture over time. Rule Processing Next-generation applications will frequently involve a large number of rules, which take declarative ("if A is true, then B is true"), and impera- tive ("ifA is true, then do C") forms. For example, a design database should notify the proper designer if a modification by a second designer may have affected the subsystem that is the responsibility of the first designer. Such rules may include elaborate constraints that the de- signer wants enforced, triggered actions that require processing when specific events take place, and complex deductions that should be made automatically within the sys- tem. It is ,common to call such sys- tems knowledge-base systems, al- though we prefer to view them as a natural, although difficult, exten- sion of DBMS technology. Rules have received considerable attention as the mechanism for trig- gering, data mining (as discussed in the department store example), and other forms of reasoning about data. Declarative rules are advanta- geous because they provide a logi- cal declaration of what the user wants rather than a detailed specifi- cation of how the results are to be obtained. Similarly, imperative rules allow for a declarative specifi- cation of the conditions under which a certain action is to be taken. The value of declarativenes in rela- tional query languages like SQL (the most common such language) has been amply demonstrated, and an extension of the idea to the next generation of query languages is desirable. Traditionally, rule processing has been performed by separate subsystems, usually called expert system shells. However, applica- tions such as the notification exam- ple cannot be done efficiently by a separate subsystem, and such rule processing must be performed di- rectly by the DBMS. Research is needed on how to specify the rules and on how to process a large rule base efficiently. Although consider- able effort has been directed at these topics by the artificial intelli- gence (AI) community, the focus has been on approaches that as- sume all relevant data structures are in main memory, such as RETE networks. Next-generation applica- tions are far too big to be amenable to such techniques. We also need tools that will allow us to validate and debug very large collections of rules. In a large sys- tem, the addition of a single rule can easily introduce an inconsis- tency in the knowledge base or cause chaotic and unexpected ef- fects and can even end up repeat- edly firing itself. We need tech- niques to decompose sets of rules into manageable components and prevent (or control in a useful way) such inconsistencies and repeated rule firing. New Concepts in Data Models Many of the new applications will involve primitive concepts not found in most current applications, and there is a need to build them cleanly into specialized or extended query languages. Issues range from efficiency of implementation to the fundamental theory underlying important primitives. For example, we need to consider: Spatial Data. Many scientific data- bases have two- or three-dimen- sional points, lines, and polygons as data elements. A typical search is to find the 10 closest neighbors to some given data element. Solving such queries will require sophisti- cated, new multidimensional access methods. There has been substan- tial research in this area, but most has been oriented toward main memory data structures, such as quad trees and segment trees. The disk-oriented structures, including K-D-B trees and R-trees, do not perform particularly well when given real-world data. Time. In many exploratory applica- tions, one might wish to retrieve and explore the database state as of some point in the past or to retrieve the time history of a particular data value. Engineers, shopkeepers, and physicists all require different no- tions of time. No support for an algebra over time exists in any cur- rent commercial DBMS, although research prototypes and special- purpose systems have been built. However, there is not even an agreement across systems on what a "time interval" is; for example, is it discrete or continuous, open-ended or closed? Uncertainty. There are applications, such as identification of features
`
`11~
`
`No.10/COMMUNICATIONS OF THE ACM
`
`Enfish, LLC; IPR2014-00574
`Exhibit 2208
`Page 7 of 11
`
`October 1991/Vo1.34,
`
`
`Parallelism
`
`) H I H I} H ) L ) )' ) I t III • J Next-generation applications often aim to facilitate collaborative and interactive access to a database. from satellite photographs, for which we need to attach a likeli- hood that data represents a certain phenomenon. Reasoning under uncertainty, especially when a con- clusion must be derived from sev- eral interrelated partial or alterna- tive results, is a problem that the AI community has addressed for many years, with only modest success. Further research is essential, as we must learn not only to cope with data of limited reliability, but to do so efficiently, with massive amounts of data. Scaling Up It will be necessary to scale all DBMS algorithms to operate effec- tively on databases of the size con- templated by next-generation ap- plications, often several orders of magnitude bigger than the largest databases found today. Databases larger than a terabyte (10 TM bytes) will not be unusual. The current architecture of DBMSs will not scale to such magnitudes. For ex- ample, curren