Document Management, Digital Libraries and the Web
`Document Management, Digital Libraries and the
`June 9, 1995
`Larry Masinter <>
`Document management systems are used by individuals, office workgroups and enterprises to organize
`and keep track of the documents being produced as a part of their work. Digital Library technology is
`being developed by many organizations to make the world's knowledge available through computers and
`communication technology. The World-Wide Web is an Internet application being used by individuals,
`companies and other organizations for promoting themselves, their products, doing electronic commerce,
`and for providing information to the vast number of Internet users around the world. These three
`application areas have much in common and also significant differences. The paper notes the common
`elements and some of the technical issues common in these areas, and explores the opportunities for
`synergy when these applications merge.
`• 1. Introduction
`• 1.1 Document Management Overview
`• 1.2 Digital Libraries Overview
`• 1.3 The Web: an Overview
`• 2. Common Elements
`o 2.1 Document Identifiers
`o 2.2 MetaData
`o 2.3 Authentication. Authorization and Accounting
`o 2.4 Document types
`o 2.5 Searching
`• 3. Opportunities
`• References
`• Acknowledgments
`1. Introduction
`The terms "document management system", "digital library" and "World-Wide Web" describe applications
`with a number of common architectural elements, though they are distinct in many of their features, in
`their domains of use, and in the systems and protocols they involve. This first section of paper describe
`each of the areas, their critical properties, some examples of their use, and the systems, standards, and
`organizations involved in developing them. Section 2 then explores many of the common design issues that
`are facing developers in each of the areas. Finally, Section 3 sets out some ofthe opportunities for
`integrating the three application areas.
Document Management, Digital Libraries and the Web
`1.1 Document Management Overview
`Document management systems are software packages designed to help individuals, workgroups and large
`enterprises manage their growing number of documents stored in electronic form.[l.l[2J.. Document
`management is seen as a way to help companies manage the intellectual property that is locked up in the
`company's documents, currently hidden away in a morass of directories and subdirectories in scattered fi le
`servers across their networks. Document management systems may be used for a workgroup (a group of
`users connected via a local area network) or an enterprise (everyone in a company, connected via a
`corporate network).
`Document management is used to manage the entire life cycle of a document, from creation through
`multiple revisions and finally into long-term storage and records management. For example, workgroup
`document management systems often offer library services for preserving update consistency, similar to
`check-out and check-in capabilities of software source code control systems. When a user checks out a
`document, the system locks the document from other users' changes. When the document is checked back
`in, the document management system makes it available for others to revise. Along with maintaining
`update consistency, the document management application tracks revisions in a multi-author/editor setting.
`Document management systems usually feature ~earching in repositories of documents both by externally
`applied information about the documents (e.g., user who entered it, date of revision, or version
`relationship) and by content (e.g., search on words contained within the document.)
`Frequently, document management systems are integrated with imaging capabilities: the ability to deal
`with scanned raster images (fax quality or higher) of documents that originated in paper form, as well as
`with documents that originated in electronic form. While imaging applications traditionally had been a
`separate domain, the line between image management and general document management has been
`increasingly blurred in recent years. In image document management systems, optical character
`recognition (OCR) is used to analyze the document content and index the corpus for content retrieval, even
`when the documents themselves are retained in image form.
`Document management systems are usually integrated with the desktop applications. That means that the
`user's application program -- word processor, spreadsheet, graphic editor -- is modified to work directly
`with the document management system. For example, if a user running WordPerfect pulls down on the
`"File/Open" menu, a search interface to the document management repository might appear rather than the
`standard file system dialog interface.
`Document management systems are sometimes connected to or integrated with workflow systems, though
`the latter is strictly speaking a different application. While document management systems deal with
`storing and searching documents in repositories, workflow systems are organized around work processes.
`Thus, a workflow system contains a model of the tasks of an organization and the roles that individuals
`play in that organization, and routes the work according to the model of the work process. Of course, the
`results of that process are often stored in document management repositories, and document management
`operations are often steps in the tasks managed by the workflow system.
`Applications of Document Management Systems
`To make clear the function of document management applications, it may help to give some typical
`examples of how these systems are used:
`• A large multinational law firm manages all of its correspondence and contracts in a document
`management system. Because the firm believes it has an obligation to offer similar legal advice to all
`http://larry.m asi nter .net/docweblib.html
Document Management, Digital Libraries and the Web
`clients in similar situations, the company wants the system to keep track of all correspondence,
`contracts, and so forth as produced in each of its offices.
`• A large aerospace company finds that almost every plane off their assembly line is different in
`configuration. The documentation for the repair and maintenance of the plane needs to match the
`configuration shipped. The document management system allows the configuration of the shipped
`documentation to match the product. As more and more manufacturers move into custom product
`delivery and just-in-time manufacturing, it has become increasingly important to have a system that
`can allow documentation to track the changes in the products.
`• Offices accumulate large repositories of general correspondence and often look for smaller
`document management applications for tracking correspondence and business documents.
`There are a large number of vendors of document management systems. Some of the major products and
`vendors include Documentum, PC Docs, SoftSolutions from WordPerfect/Novell, FileNet, Visual Recall
`from Xerox, and Mezzanine from Saros. Many other products include document management capabilities,
`including offerings from Verity, Oracle, and Lotus (Notes).
`As document management products have developed, there has been a growing demand for standards to
`allow interoperability between them. Large enterprises discover that different workgroups within their
`organization have, for various reasons, chosen different document management products. As they attempt
`to integrate these products across the enterprise, enterprise-wide standard interfaces and interoperability
`become increasingly important.
`To this end, consortia have organized to define standards for document management. For example, the
`Open Document Management API (ODMA) is a simple Application Program Interface (API) designed to
`let desktop applications (such as an editor or spreadsheet) integrate with any of a number of document
`management systems[3][4][5]. It redefmes file access menu items such as "Open", "Save", and "Save as ... "
`to call the document management system (if one is installed) instead of the file system.
`At another level, there have been recent attempts by industry groups to define a middleware layer between
`the user interface and back-end document repositories, so that users in an enterprise can access documents
`stored in multiple document management systems across their enterprise. The two efforts by the Shamrock
`Document Management Coalition (Shamrock's Enterprise Library Services) and the Document Enabled
`Networking[§] specification are being merged into a new Document Management Alliance (DMA)Ul to
`promote a single standard interface. These initiatives are creating a set of standard interfaces that define
`system elements such as "document", "repository", and "attribute" as well as as operations such as
`searching, checking out a document, and retrieving it.
`1.2 Digital Libraries
`What is a digital library? The term is sometimes used in a relatively literal way to refer to a system or
`application whose function is chiefly to extend the reach of a conventional library, for example by making
`its collection available in electronic form to remote users. More abstractly, the term is used to describe any
`application or system aimed at providing access and services for a large electronic document corpus.
`Usually the users of such corpora are thought of as members of a general or specialized public, rather than
`the personnel of an organization or enterprise. Over the last few years there have been research and
`development projects of both types; see, for example, [8][9H 1 0] [1 1"1 and special issues of journalsU1J.. For
`all their differences and particularities, these projects have certain general characteristics in common.
`Key Features of Digital Libraries
`Digital libraries usually possess large corpora of information of generally high value. Not only is the
Document Management, Digital Libraries and the Web
`material of high quality, but also some care is placed on cataloging the material, and making sure that the
`origin, date, and other external descriptive information is accurate. Many digital library projects are
`concerned with providing digital access to material that already exists within traditional library collections,
`and thus concentrate on material that was originally intended for analog media: libraries of scanned images
`of photographs or printed texts, digitized video segments and so forth. Other projects extend the library
`metaphor to other collections such as scientific data sets, software libraries or multimedia works. A great
`deal of work in this area concentrates on providing enhanced content or access methods, with the problem
`often couched as one of providing a way of satisfying the individual's particular "information needs". This
`might be a chemistry graduate student looking for information for a research project, a high-school student
`downloading a multi-media chemistry text, or a market researcher looking for information about chemical
`Digital library systems and standards
`While much digital library work is in its early phase of development, there is a rich tradition in the library
`community that has influenced the thinking and design of systems for Digital Libraries. Historically,
`library automation has taken the form of Online Public Access Catalogs (OPACs). The standards for online
`library catalogs include MARCilll and Z39.50.(2B. Another kind ofmetadata is represented by the
`Scientific and Technical Attribute Set (STAS), which defines a standard for metadata elements to describe
`scientific datasets as opposed to traditional bibliographic material.
`More recently, a number of research initiatives have proposed systems and mechanisms for future digital
`libraries, including the six NSF/ARPA/NASA joint initiative projects, initiatives ofthe national libraries
`and library system vendors. Previous work in copyright management[14][15], document identifiersllil.,
`and the Computer Science Technical Report project lin also contribute to digital library technology.
`1.3 The web: an overview
`These days, it is hardly necessary to define "the web" at an Internet conference. (It's hardly necessary to
`define "the web" to the cab driver who takes you to the conference from the airport.) For the sake of
`contrast, though, it will be useful to lay out the web's key features here.
`Key Features of the web
`By "the web", I mean information on the Internet, as is accessed by individuals using a World-Wide Web
`or some other network information access tool. The web is accessed using one of the many web browsers
`now available. The web provides a document interface to information. That is, a users is presented with a
`document which includes links to follow and forms to fill out. By interacting with the document, the user
`causes a new document to be presented. The web, as an Internet service, is primarily public. A web site can
`provide access to a very large number of users across the world.
`Example applications of the web
`The web is used for institutional public relations and product information, personal communication, online
`publishing, and scientific, technical and scholarly interchange. For example, companies put up web sites
`about their products and services; a growing number of newspapers and information service providers are
`producing web sites. Students put up ' home pages' covering their hobbies. Professional organizations and
`educational institutions give out information about their organizations and their resources.
`Web systems and standards
Document Management, Digital Libraries and the Web
`There are a growing number of web systems and software packages, including those produced by
`sponsored research, university researchers and commercial vendors. Dozens of start-ups compete for
`The web systems and protocols, originally defined in the research community, are being refined by a
`number of companies and consortia (the W3C consortium, for example) and being standardized by
`working groups of the Internet Engineering Task Force (IETF). The IETF is developing standards for
`Uniforn1 Resource Locators (URLs), Uniform Resource Names (URNs), the HyperText Transfer Protocol
`(HTTP), and the HyperText Markup Language (HTML). These elements are the principal dements of the
`World Wide Web. The web also includes other network search protocols and access systems. For example,
`the Gopher protocol defined by the University of Minnesota is part of the web, while the Internet use of the
`Z39.50 standard is defined by the Z39.50 Implementors Group (ZIG)ll.8J.
`2. Common Elements in Document Management, Digital Libraries
`and the Web
`The three application areas of document management, digital libraries and the web share common
`technology elements. This section describes some of these common elements, how they're deployed in
`each area, and the general design problems that are shared by all three areas. With more coordination
`between the groups designing the systems and protocols in these areas, solutions that are deployed for one
`set of applications might be reapplied in others, duplicate effort avoided, and the opportunities for synergy
`2.1 Document Identifiers
`In any computer system for manipulating information, it is important to allow objects to contain persistent
`references to other objects. These references are used from inside databases, in bibliographies, hypertext
`links, and in a variety of other ways. The approaches used in document management, digital libraries and
`the web have differed.
`Identifiers in Document Management systems
`Commercial document management systems all employ some kind of document identifier mechanism, so
`that pointers to documents in the document management system can be saved and referenced independent
`of that system. For example, ODMA has a document ID--a persistent, portable identifier for a document(cid:173)
`- that is accepted or returned by ODMA functions. It is used to save away references to documents, to refer
`to documents in electronic mail or by other processes. Other examples of document identifiers include
`those used in OpenDocll2J. and OLE. The OpenDoc standard uses the Bento file format.[2..QJ., which
`incorporates globally unique identifiers to make references from one document to another. OLE use a
`variety of identifiers to keep permanent references valid between composite objectsl£ll.
`Identifiers in Libraries
`Traditionally, the library community has developed a number of mechanisms to uniquely identify a work.
`These mechanisms include "call numbers" (e.g., the Library of Congress Call Number system which yields
`identifiers that are printed like PS35660815.W4.1987), ISBN numbers (originally intended for inventory)
`and ISSN numbers (which identify serials, i.e. , material that is updated regularly.) More recently, librarians
`have tried to apply this apparatus to digital works, which do not always lend themselves to traditional
`treatment and which raise a number of design issues involving the use of document identifiers.(21]..
`Identifiers on the Web
Document Management, Digital Libraries and the Web
`In the World-Wide Web, the most common kind of identifier is a URL. URLs are probably familiar to
`anyone who has used a web browser or read the papers in this conference, where the references include
`URLs. While the name "URL" seems to indicate that it locates the object (says 'where it is'), in fact, a URL
`is more like an 'access method': it tells you how, on the Internet, to access the object. As many have
`observed, there is a serious problem using URLs when information or web resources move. There is a
`strong desire to create a new scheme for URNs that name an object independent of its. location. Some kind
`of distributed URN -> URL location service (for which there is not yet an accepted design) would then be
`employed to find out the actual location of objects. Several proposals have been brought forward and are
`being evaluated.
`Issues in Document Identifiers
`There are a number of open design issues in the area of document identifiers. These design issues are
`present for dealing with electronic documents, whether in a library, a workgroup, or on the Internet.
`Fragments, relationships
`How does one identify a piece of something else? For example, if there is a volume of collected papers, do
`the individual papers get separate identifiers? If so, is the identifier for each element somehow
`syntactically related to the identifier for the whole? If not, how is the relationship established? Is there a
`database that links the part to the whole?
`When an object is revised, does it retain its identifier? For example, in System 33[23], every document had
`two identifiers: one that was assigned to 'this version' and another that specified 'the latest version of
`whatever this becomes'.
`In the office environment, a document with a cover memo attached might be considered a different object.
`However, in some situations, the 'cover' material is merely an external attribute, and the document hasn't
`changed and should not get a different identifier.
`In general, there are a large number of relationships between objects that can be expressed as relationships
`of the identifiers of the objects, and relevant design decisions are currently made in an ad hoc fashion.
`Publishers are allowed to retain the same ISBN number for minor printing revisions, but the paperback and
`hardcover of a book are given different ISBN numbers. On the web, the URL of a document doesn't
`change if the content changes. Moreover, different vendors' document management systems seem to take
`different approaches to dealing with revision and identity.
`There are a variety of methods used to ensure that different documents do not get the same identifier, even
`when different entities are assigning names. These methods rely either on a distributed hierarchy, or a
`probabilistic method of name assignment.
`In a hierarchical uniqueness system, there is a tree of 'naming authorities'. Every naming authority
`guarantees that it will not give out the same identifier to two different documents. If it delegates some of
`the naming authority to sub-authorities, it also delegates that promise. ("Here, you can give out names, but
`you make sure you never give out the same name twice.") For example, the Internet's Domain Name
`Service is a hierarchical service; the owner of "" can hand out unique names under that suffix,
`and to delegate the naming system underneath to the owner of "". Many of the proposals for
`http://l arry .m asi nter .neVdocweblib.html
`URNs on the Web are hierarchical.
Document Management, Digital Libraries and the Web
`Some distributed naming systems are hierarchical but have a fixed depth of the hierarchy. For example,
`ISBN numbers have three parts: a country code (the country of registry for the publisher), the publisher
`identifier, and, for each publisher, the document identifier. Each publisher is allowed to assign their own
`ISBN numbers. Some naming systems are not distributed, but guarantee uniqueness by keeping a single
`source of identifiers; for example, the Library of Congress Control Number is assigned uniquely by the
`U.S. Library of Congress.
`A random naming authority is one in which names are given out using random numbers; each authority
`uses enough information to make the probability of two documents getting the same identifier quite small.
`For example, some schemes use the one-way hash (MD5, SHA) of the document as the document
`identifier. The LIFN system [24] uses a randomly assigned document identifier in this way.
`Given a name for an object, how does one go about finding information about that object? How much
`information is packed into the name? For example, ISBN numbers give you some clue about who the
`publisher is, and there is a global registry of publishers. If you can't find the document in your catalog, you
`can check the publisher. On the other hand, the random schemes give no hints. Using URLs, the identifier
`contains nearly complete information to access a resource across the global Internet. Usually, though, the
`more information contained in the identifier, the harder it is to for the resolution system to find objects
`when they have moved.
`2.2 MetaData
`In document management, digital libraries and the web, it is common to want to record information about
`documents that is not part of the documents themselves. These assertions are sometimes called 'document
`attributes'; sometimes they are called 'metadata' to signify that they are data about data rather than the
`information itself. Metadata assists in the description, organization, discovery and access to network
`information resources.
`Metadata in Document Management
`Most document management systems include mechanisms that permit at least the system administrator to
`define, according to the application, a set of attributes that are common to the documents in a repository or
`at least a variety of classes of documents. For example, many systems record the user identity of the
`originator of the document, the date and time of origination, other information external to the documents
`themselves, or some other attributes of the documents in the repository, as determined by the system
`administrator. A law office might index its documents by the name of the client; a manufacturer, by the
`product or parts codes affected within.
`Mctadata in Digital Libraries
`Libraries have traditionally been quite concerned with cataloging -- a process which associates metadata
`with bibliographic material. The card catalog entries for an item in the library provides metadata about the
`item. There are a variety of standards used for online cataloging. The most prominent is USMARC.
`Various attempts have been made to extend and enhance USMARC to deal with online material[25][26].
`The Z39.50 standard contains extensive mechanisms for both communicating search parameters (requested
`metadata) and document attributes (output metadata.) More recently, attempts to define online document
`BLUE COAT SYSTEMS - Exhibit 1060 Page 7

Document Management, Digital Libraries and the Web
`standards for the humanities arrived at a standard set of metadata for humanities texts[28].
`Metadata on the Internet
`The Internet community has several efforts to define a set of metadata tags useful for information on the
`network. For example, the Internet Anonymous FTP Archives working group of the Internet Engineering
`Task Force attempted to set a standard for describing FTP-accessible datai12J.. In fact, one could think of
`the standard headers of an Internet electronic mail message as identifying attributes for each message[30].
`Every Internet message has required attributes; for example, it must identify who it is "From" and "To" and
`the "Date" it was sent. In addition, there are optional attributes, such as "Subject" and "Comments". There
`are rules that specify the kinds of values each attribute can have.
`The Uniform Resource Identifier working group,Q.ll has been trying to develop a standard syntax and
`representation for information citations in a scheme called Uniform Resource Citations (URCs) to describe
`information on the Internet as a way of discovering or describing more about a referenced resource (via
`URL or URN) before retrieving the item, as well as a way of cataloging Internet information.
`Issues in Metadata
`There are a number of design issues in representing metadata for online information, some semantic (what
`does it mean and how do you say it?), some structural (does metadata have structure?) and some syntactic
`(how do the semantics and structure get represented as a sequence of characters or bytes?) These issues
`span the three application areas.
`Semantic issues
`Are there well known attributes? MARC takes a strong stand: MARC defines a set of well-known
`attributes with descriptions of each. Some of them take on values within a controlled vocabulary. There are
`standards for the completeness and quality of a catalog entry. The set of attributes is defined and used
`universally by nearly all online library catalogs. In document management systems, on the other hand, the
`system administrator for a workgroup generally establishes conventions for the attributes used and what
`they mean. When multiple document management systems are brought together, though, combining the
`semantics of the disparate sources is a serious problem. The Internet community is struggling with
`standardization of semantics for attribute sets. While there are some attributes that are well-known (content
`attributions in mail messages, mapping to ISO protocols in X.400), these are by no means universal.
`If there is not a single well-known set of attributes that spans all known objects, then it is still possible to
`create a system of entities-- classes of documents which share the same schema of attributes. For each
`class, the attribute set can then be defined. For example, a document management system might allow for
`'memo' and 'spreadsheet' and 'expense report'. Every memo might be catalogued by its distribution list,
`while an expense report might be required to have a budget center and a signature status. More complex
`schema systems allow for inheritance and specialization of classes, as is found in object-oriented
`programming. There are variations among different implementations, just as there are in different object(cid:173)
`oriented programming systems.
`Structural issues
`Frequently it is difficult to tell the ' boundaries' of an online electronic work. If one describes a site's ' home
`page', does the description apply to the site, or just to the introductory 'splash page'? If an object contains
`parts, do the parts have separate attributes? For example, if a report in a document management system has
`BLUE COAT SYSTEMS - Exhibit 1060 Page 8

Document Management, Digital Libraries and the Web
`a cover memo, in what way are the author of the report and the author of the cover memo distinguished or
`reported in the description of the overall object?
`Metadata itself can also have structure. It is sometimes necessary and occasionally critical to know the
`author of an attribute or the time when the attribute was assigned. If metadata itself can be updated and ·
`revised, then the history of its editing may be of relevance. How does one distinguish between 'the title'
`and 'the title, translated into French', and 'the title, translated into English from Italian by D.H.Lawrence'.
`The relationships between elements of the metadata are problematic for some flat attribute-value
`representation schemes like MARC.

`Syntactic and system issues with metadata
`While it might seem straightforward, standardization of the syntactic mechanisms for representing the
`semantics and structure of attributes is quite difficult. First, attributes might have a fixed, extensible, or
`uncontrolled set of values. The mechanisms for assigning the allowable elements of the controlled set are
`difficult to establish. Each attribute or field might need to deal with alternative syntaxes (e.g., for names, is
`it last name first or given name first?), multiple character sets (names in Chinese or Arabic), or even non(cid:173)
`textual data.
`2.3 Authentication, Authorization, Accounting (AAA) and Related Issues
`There are several related issues having to do with security, rights, privacy, confidentiality and access that
`arise in all of the application areas. Authentication is the process by which the identity of a person (or
`system) is ascertained and assured. Authorization is the process of determining whether a given operation
`is allowed, such as reading a document or updating metadata. Accounting is the process of recording
`operations and the payment due for them. An audit trail of records of past operations might be kept, as a
`way of checking the integrity of the system.

`AAA in Document Management
`In document management systems, the critical elements of AAA are concerned with managing the
`permissions to access the information in a set of documents and maintaining the integrity of these
`documents. Some documents are confidential, others are public, others belong to particular workgroups.
`Most of the early work in authorization followed the military model of classified information and clearance
`levels; this model has been found to be inappropriate for many non-military applications. Frequently, the
`authorization system of the document management system is inadequate to represent and enforce the
`company's access control needs; for example, the actual work practice in many organizations will relax
`rules and guidelines in specific situations.
`Despite the more complex needs, some document management systems rely on either their database
`manager or the host network operating system to provide authentication and access control, if for no other
`reason than to avoid providing a separate authentication and administrative domains.
`AAA in Digital Libraries
`In the library setting, the requirements for AAA often focus on copyright, payment methods, and usage
`rights; in addition, there is a significant concern for the privacy of the reader and information about what is
`being read by whom. The situation is made more complex by the difficulties in interpreting copyright law
`originally designed for physical material in a world of electronic reproduction and distribution. In many
`countries, the copyright law and practice around it is being reexamined in the age of electronic distribution.
`http://larry .mas inter .neVdocweblib.htm I

