`
`CU-CS-732-94
`
`C. Mic Bowman
`
`Peter B. Danzig
`Darren R. Hardy
`Udi Manber
`
`Michael F. Schwartz
`
`
`
`
` E.@UniversityofColoradoatBoulder
`
`
`
`
`
`DEPARTMENT OF COMPUTER SCIENCE
`
`EMC/VMware v. PersonalWeb
`IPR2013-00083
`
`EMCVMW 1064
`
`
`
`Harvest: A Scalable, Customizable
`
`Discovery and Access System
`
`C. Mic Bowman
`
`Peter B. Danzig
`Darren R. Hardy
`Udi Manber
`
`Michael F. Schwartz
`
`CU—CS-732—94
`
`July 1994:
`
`@7
`
`University of Colorado at Boulder
`
`Technical Report CU—CS—732—94
`Department of Computer Science
`Campus Box 430
`UniverSIt of Colorado
`Boulder,
`olorado 80309
`
`
`
`ANY OPINIONS, FINDINGS, AND CONCLUSIONS OR RECOMMENDATIONS
`EXPRESSED IN THIS PUBLICATION ARE THOSE OF THE AUTHOR(S) AND DO NOT
`NECESSARILY REFLECT THE VIEWS OF THE AGENCIES NAMED IN THE
`
`ACKNOWLEDGMENTS SECTION.
`
`
`
`Harvest: A Scalable, Customizable Discovery and Access System
`
`C. Mic Bowman
`
`Peter B. Danzig
`
`Darren R. Hardy
`
`Transarc Corp.
`
`University of Southern California
`
`University of Colorado — Boulder
`
`mic©transarc.com
`
`danzig@usc.edu
`
`hardy©cs.colorado.edu
`
`Udi Manber
`
`Michael F. Schwartz
`
`University of Arizona
`
`University of Colorado — Boulder
`
`udi@cs.arizona.edu
`
`schwartz©cs.colorado.edu
`
`July 1994
`
`Abstract
`
`Rapid growth in data volume, user base, and data diversity render Internet—accessible in—
`
`formation increasingly diflicult to use effectively. In this paper we introduce Harvest, a system
`
`that provides a set of customizable tools for gathering information from diverse repositories,
`
`building topic—specific content indexes, flexibly searching the indexes, widely replicating them,
`
`and caching objects as they are retrieved across the Internet. The system interoperates with
`
`Mosaic and with HTTP, FTP, and Gopher information resources, We discuss the design and
`implementation of each subsystem, and provide measurements indicating that Harvest can re—
`
`duce server load, network traffic, and index space requirements significantly compared with
`
`previous indexing systems. We also discuss a half dozen indexes we have built using Harvest,
`
`underscoring both the customizability and scalability of the system.
`
`
`
`1
`
`Introduction
`
`Over the past few years a progression of Internet publishing tools have appeared. Until 1992,
`FTP [43] and NetNews [39] were the principal publishing tools. Around 1992, Gopher [38] and
`WAIS [31] gained popularity because they simplified network interactions and provided better
`ways to navigate through information. With the introduction of Mosaic [2] in 1993, publishing
`information on the World Wide Web [3] gained widespread use, because of Mosaic’s attractive
`graphical interface and ease of use for accessing multimedia data reachable via WWW links.
`While Internet publishing has become easy and popular, making efifective use of Internet—
`accessible information has become more difficult. As the volume of Internet accessible information
`
`grows, it is increasingly difficult to locate relevant information. Moreover, current information sys-
`tems experience serious server and network bottlenecks as a rapidly growing user populace attempts -
`to access networked information. Finally, current systems primarily support text and graphics in-
`tended for end user viewing; they provide little support for more structured, complex data. For a
`more detailed discussion of these problems, the reader is referred to [10].
`In this paper we introduce a system that addresses these problems using a variety of tech—
`niques. We call the system Harvest, to connote its focus on reaping the growing crop of Internet
`information. Harvest supports resource discovery through topic-specific content indexing made
`possible by a very efficient distributed information gathering architecture. It resolves bottlenecks
`through topology-adaptive index replication and object caching. Finally, Harvest supports struc—
`tured data through a combination of structureepreservingvindexes, flexible search engines, and data
`type-specific manipulation and integration mechanisms. Because it is highly customizable, Harvest
`can be used in many different situations.
`
`The remainder of the paper is organized as follows.
`In Section 2 we discuss related work. In
`Section 3 we present the Harvest system, including a variety of performance measurements.
`In
`Section 4 we offer several demonstrations of Harvest, including WWW pointers where readers can
`try these demonstrations. In Section 5 we discuss work inrprogress, and in Section 6 we summarize
`Harvest’s contributions to the state of resource discovery.
`
`2 Related Work
`
`While impossible to discuss all related work, we touch on some of the better—known efforts here.
`
`Resource Discovery
`
`Because of the difficulty of keeping a large information space organized, the labor intensity of
`traversing large information systems, and the subjective nature of organization, many resource
`discovery systems create indexes of network—accessible information.
`file/menu name indexes of
`Many of the early indexing tools fall into one of two categories:
`widely distributed information (such as Archie [18], Veronica [19], or WWWW [37]); and full
`content indexes of individual sites (such as Gifford’s Semantic File System [21], WAIS [31], and
`local Gopher [38] indexes). Name-only indexes are very space efficient, but support limited queries.
`For example, it is only possible to query Archie for “graphics packages” whose file names happen to
`reflect their contents. Moreover, global fiat indexes are less useful as the information space grows
`(causing queries to match too much information). In contrast to full content indexes that support
`
`
`
`.
`
`powerful queries but require so much space that they cannot hold more than a few sites’ worth of
`data, Harvest achieves space efficient, content indexes of widely distributed data.
`Recently, a number of efforts have been initiated to create content indexes of widely distributed
`sites (e.g., Gifford’s Content Router [45] and Pinkerton’s WebCrawler [40]). One of the goals of
`Harvest is to provide a flexible and efficient system upon which such systems can be built. We
`discuss this point in Section 3.1.
`
`The WHOIS and Network Information Look Up Service Working Group in the Internet En—
`gineering Task Force is defining an Internet standard called “WHOIS++” which gathers con—
`cise descriptions (called “centroids’7) of each indexed database [46].-In contrast to our approach,
`WHOIS++ does not provide an automatic data gathering architecture, nor many of the scaling
`features (such as caching and replication) that Harvest provides. However, Harvest can be used to
`build WHOIS++.
`
`Many of the ideas and some of the code for Harvest were derived from our previous work on
`the agrep string search tool [47], the Essence customized information extraction system [26], the
`Indie distributed indexing system [15], and the Univers attribute-based name service [12].
`
`Caching and Replication
`
`A great deal of caching and replication research has been carried out in the operating systems
`community over the'past 20 years (e.g., the Andrew File System [28] and ISIS [5]). More recently,
`a number of efforts have begun to build object caches-into HTTP servers (e.g., Lagoon [41]),and ‘
`to support replication [22] in Internet information systems such as Archie.
`In contrast to the flat organization of existing Internet caches, Harvest supports a hierarchical
`architecture of object caches, modeled after the Domain Naming System’s caching architecture.
`We believe this is the most appropriate arrangement because of a simulation-study weperformed
`using NSFNET backbone trace data [14]. We also note that one of the biggest scaling problems
`facing the Andrew File System is its callback—based invalidation protocol, which would be reduced
`if caches were arranged hierarchically.
`Current replication systems (such as USENET news [39] and the Archie replication system)
`require replicas to be placed and configured manually. Harvest’s approach is to derive the replica
`configuration automatically, adapting to measured physical network changes. We believe this ap—
`proach is more appropriate in the large, dynamic Internet environment.
`
`Structured Data Support
`
`At present there is essentially no support for structured data in Internet information systems. When
`Mosaic encounters an object with internal structure (such as a relational database or some time
`series data), it simply asks the user where to save the file on the local disk, for later perusal outside
`of Mosaic. The database community has done a great deal of work with more structured distributed
`information systems, but to date has fielded no widespread systems on the Internet. Similarly, a
`number of object—oriented systems have been developed (e.g., CORBA [23] and OLE [29]), but have
`yet to be deployed on the Internet at large.
`At present, Harvest supports structured data through the use of attribute—value structured
`indexes. We are currently extending the system to support a more object—oriented paradigm to
`index and access complex data.
`
`
`
`3 The Harvest System
`
`To motivate the Harvest design, Figure *1 illustrates two-important inefficiencies experienced by ‘
`most of the indexing systems described in Section 2. In this figure and the remainder of the paper,
`a Provider indicates a server running one or more of the standard Internet information services
`(FTP, Gopher, and HTTP). Bold boxes indicate excessive load being placed on servers, primarily
`because the indexing systems gather information from Providers that fork a separate process for
`each retrieved object. Similarly, bold edges indicate excessive traffic being placed on network
`links, primarily because the indexing systems retrieve entire objects and then discard most of their
`contents (for example, retaining Only'HTML anchors andlinks in the index). These inefficiencies are
`compounded by the fact that current indexing systems gather information independently, without
`coordinating the effort among each other.
`
`
`
`
`
`
`
`
`
`
`
`
`
` Index
`
`
`
`Figure 1: Inefficiencies Caused by Uncoordinated Gathering of Information from Object Retrieval
`Systems
`
`Figure 2 illustrates the Harvest approach to information gathering. A Gatherer collects in-
`dexing information from- a Provider, while a Broker provides an indexed query interface to the
`gathered information. Brokers retrieve information from one or more Gatherers or other Brokers,
`and incrementally update their indexes.
`Gatherers and Brokers can be arranged in various ways to achieve much more flexible and
`efficient use of the network and servers than the approach illustrated in Figure 1. The leftmost
`part of Figure 2 shows a Gatherer accessing a Provider from across the network (using the native
`FTP, Gopher, or HTTP protocols). This arrangement mimics the configuration shown in Figure 1,
`and is primarily useful for interoperating with systems that'do not run the Harvest software.
`This arrangement experiences the same inefficiencies as discussed above, although the gatherered
`information can be shared by many Brokers in this arrangement.
`In the second part of Figure 2, the Gatherer is run at the Provider site. As suggested by the
`lack of bold boxes and lines in this part of the figure, doing this can save a great deal of server load
`and network traffic. We discuss the techniques used by the Gatherer to achieve these efiiciency
`gains in Section 3.1.
`i
`
`
`
`
`
`
`l
`
`l
`
`l
`
`Broker
`
`Gatherer
`
`
`
`
`
`l
`
`l
`
`I
`
`'
`
`I
`
`'
`
`
`
`
`
`
`
`Broker
`
`Gatherer
`Provider
`
`
`
`
`
`Gatherer
`Provider
`
`
`
`
`Broker
`
`,
`
`T
`
`Broker
`
`i
`
`
`
`Figure 2: Harvest Information Gathering Approach
`
`The ellipses in Figure '2 indicate that a Broker can collect information from many Gatherers '
`(to build an index of widely distributed information) and that a Gatherer can feed information
`to many Brokers (thereby saving repeated gathering costs). By retrieving information from other
`Brokers (illustrated in the rightmost part of Figure 2), Brokers can also cascade indexed views from'
`one another—using the Broker’s query interface to filter/refine the information from one Broker to
`the next.
`
`The final element illustrated in Figure 2 is the Harvest Server Registry (HSR). This is a dis—
`tinguished Broker instance that registers information about each Harvest Gatherer, Broker, Cache,
`and Replicator in the Internet. The HSR is useful when constructing new Gatherers and Brokers,
`to avoid duplication of effort. It can also be used when looking for an appropriate Broker at search
`time.
`
`In effect, this Gatherer/ Broker design provides an efficient network pipe facility, allowing in-
`formation to be gatherered, transmitted, and shared by many different types of indexes, with a
`great deal of flexibility in how information is filtered and interconnected between providers and
`Brokers. Efficiency and flexibility are important because we want to enable the construction of
`many different topic—specific Brokers. For example, one Broker might index scientific papers for
`microbiologists, while another Broker indexes PC software archives. By focusing index contents
`per topic and per community, a Broker can avoid many of the vocabulary [20] and scaling problems
`of unfocused global indexes (such as Archie and VVWWW).
`Gatherers and Brokers communicate using an attribute—value stream protocol called the Sum—
`mary Object Interchange Format (SOIF) [9], intended to be easily parsed yet sufficiently expressive
`to handle many kinds of objects. It. delineates streams of object summaries, and allows for multiple
`levels of nested detail. SOIF is based on a combination of the Internet Anonymous FTP Archives
`(IAFA) IETF Working Group templates [17] and BibTeX [33]. Each template contains a type,
`a Uniform Resource Locator (URL) [4], and a list of byte-count delimited field name/field value
`pairs. We define a set of mandatory and"recommendedattributes for Harvest system components.
`
`
`
`For example, attributes for a Broker describe the server’s administrator, location, software version,
`and the type of objects it contains. An example SOIF record is illustrated in Figure '3. 1.
`
`@DOCUMENT{ ftp://ftp.cs.psu.edu/pub/techreps/TR123.ps
`Title{28}: Type Structured File Systems
`
`Author-Name{33}: C. Dharap, R. Balay and M. Bowman
`
`Author-Organization-Type{11}: educational
`Author-Department{32}: Computer Science and Engineering
`Publication-Status{19}: Accepted, To appear.
`
`types object-oriented semantic filesystems
`Keywords{42}:
`Access-Protocol-v*{3}: FTP
`
`'Nebula is a dynamic,
`Description{457}: The paper describes the design of Nebula.
`object based typed file system. Nebula uses types to create well
`
`structured files as well as to export existing files in the internet
`environment. This allows for backward scalability. Files in Nebula are
`
`abstractions and are defined by objects. Nebula assumes a flat global
`namespace and operations are provided to manipulate logical namespaces.
`
`Figure 3: Example SOIF Template
`
`Figure 4 illustrates the Harvest architecture in more detail, showing the Gatherer/ Broker in—
`terconnection arrangements and SOIF protocol, the internal components of the Broker*(discussed
`in Section 3.2), and the caching, replication, and client interface subsystems.
`The Replicator can be used to replicate servers, to enhance user—base scalability. For example,
`the HSR will likely become heavily replicated, since it acts a point of first contact for searches and
`new server deployment efforts. The Replication subsystem can also be used to divide the gathering
`process among many servers (e.g., letting one server index each U.S. regional network), distributing
`the partial updates among the replicas.
`The Object Cache reduces network load, server load, and response latency when accessing
`located information objects. Additionally, it will be used for caching access methods in the future,
`when we incorporate an object-oriented access mechanism into Harvest. At present Harvest simply
`retrieves and displays objects using the standard “hard coded” Mosaic procedures, but we are
`extending the system to allow publishers to associate access methods with objects (see Section 5).
`Given this support, it will be possible to perform much more flexible display and access operations.
`We discuss each of the Harvest subsystems below.
`
`3.1 The Gatherer Subsystem
`
`As indexing the Web becomes increasingly popular, two types of problems arise: data collection
`inefficiencies and duplication of coding effort. Harvest addresses these problems by providing an
`efficient and flexible system upon which to build many different indexes. Rather than building
`
`1VVlnile the attributes in this example are limited to ASClI values, SOIF allows arbitrary binary data
`
`
`
`
`
`
`Replication
`
`Manager
`
`Broker
`\.
`
`
`l Storage Manager
`
`
`
`Query
`Mgnager
`
`
`
`:
`
`l
`
`2. retrieve
`object &
`access
`methods
`
`Registry
`
`
`\
`
`
`
`
`
`
`and Indexer
`m /
`/
`
`/
`1. search
`.
`
`
`
`
`Client
`
`/®
`
`/
`
`SOIF
`
`/
`
`
`
`Collector
`
`
`
`i
`
`
`
`Gatherer
`
`
`
`
`Provider
`
`
`
`
`Object
`
`Cache
`
`
`
`
`
`
`Figure 4: Harvest System Architecture
`
`one indexer from scratch to collect WAIS headlines, another to index Computer Science technical
`reports, etc., Harvest can be configured in various ways to produce all of these indexes, at great
`savings of server load, network trafiic, and index space.
`
`Efficient Indexing Support
`
`Koster offers a number of guidelines for constructing Web “robots” [32]. For example, he suggests
`using breadth-first rather than depth—first traversal, to minimize rapid—fire requests to a single
`server. Rather than working around the existing shortcomings, we believe a better approach is to
`provide a much more efficient index collection system.
`The biggest inefficiency in current Web indexing tools arises from collecting indexing information
`from systems designed to support object retrieval. For example, to build an index of HTML
`documents, each document must be retrieved from a server and scanned for HTML links. This
`causes a great deal of load, because each object retrieval results in creating a TCP connection,
`forking a UNIX process, changing directories several levels deep, transmitting the object, and
`terminating the connection. An index of FTP file names can be built more efficiently using recursive
`directory listing operations, but this still requires an expensive file system traversal.
`The Gatherer dramatically reduces these inefiiciencies through the use of Provider site-resident
`software optimized for indexing. This software scans the objects periodically and maintains a
`
`cache of indexing information, so that separate traversals are not required for each request. 2' More
`
`2Some FTP sites provide a similar optimization by placing a file containing the results of a recent recursive
`directory listing in a high—level directory. The Gatherer uses these files when they are available.
`
`
`
`importantly, it allows a Provider’s indexing information to be retrieved in a single stream, rather
`than requiring separate requests for each object. This results in a huge reduction in server load,
`The combination of traversal caching and response streaming can reduce server load by several
`orders of magnitude.
`As a simple quantification of these savings, we measured the time taken to gather information
`three different ways: via individual FTP retrievals, via individual local file system retrievals, and
`from the Gatherer’s cache. To focus these measurements on server load, we collected only empty
`files and used the network loopback interface in the case where data would normally go across
`the network. For a more precise quantification we could have measured CPU, memory, and disk
`resource usage, but measuring elapsed time provides a first order approximation of overall load
`for each of the gathering methods. We collected the measurements over 1,000 iterations on an
`otherwise unloaded DEC Alpha workstation. Our results indicate that gathering via the local
`file system causes 6.99 times less load than gathering via FTP. More importantly, retrieving an
`entire stream of records from the Gatherer’s cache incurs 2.72 times less server load than retrieving
`a single object via FTP (and 2.81 times less load for retrieving all objects at a site, because of
`a common-case optimization we implemented that keeps a compressed cache of this information
`around). Thus, assuming an average archive site size of 2,300 files 3, serving indexing information
`via a Gatherer will reduce server load per data collection attempt by a factor of 6,429. Collecting
`all objects at the site increases this reduction to 6,631, because of the aforementioned common-case
`cache.
`
`The Gatherer also reduces network traffic substantially, based on three techniques. First, it
`uses the Essence system [26] to extract content summaries before passing the data to a remote
`Indexer. Essence uses type—specific procedures to extract the most relevant parts of documents as
`content summaries — for example, extracting author and title information from LaTeX documents,
`and routine names from executablefiles. Hence, where current robots might retrieve a set of
`HTML documents and extract anchors and URLs from each, Essence extracts content summaries
`at the Provider site, often reducing the amount of data to be transmitted by a factor of 20—50
`or more. 4 Beyond generating content summaries at the Provider site, the Gatherer also allows
`Brokers to request only those summaries that have changed since a specified time, and transmits
`the response stream in a compressed form. 5 The combination of pre—filtering, incrementalupdates,
`and compression can reduce network traffic by several orders of magnitude.
`As an example of these savings, our content index of Computer Science technical reports (see
`Section 4) starts with 2.44 GB of distributed around the Internet, and extracts 111 MB worth
`of content summaries. It then transmits these data as a 41 MB compressed stream to requesting
`Brokers, for a 59.4 fold network transmission savings over collecting the documents and building
`a full-text index. The network transmission savings are actually higher still if one considers the
`number of network links that must be traversed by each byte of data. For the above example, we
`used traceroute [30] to count hops to each Provider site. In the measured example, the average hop
`count was 18, magnifying the importance of network transmission savings. This savings corresponds
`
`to running local Gatherers at each Provider site, rather than having to pull full objects across the
`
`3We computed this average from measurements of the total number of sites and files indexed by Archie [44].
`4The reduction factor depends on data type. For example, the reduction for PostScript files is quite high, While
`the reduction for HTML files is fairly low.
`5For access methods that do not support time stamps (like the current version of HTTP), the Gatherer extracts
`and compares “MDS” cryptographic checksums of each object before transmitting updates.
`
`
`
`
`
`
`
`Customized
`
`Type Recog.
`_ Methods
`
`
`
`README IREADME}
`doc/essence.1 [mar-page]
`doc/Essence.ms [tray—ms]
`doc/Essenceps [POMSCript]
`src/pstext/pstext/dvi.c
`ICsaurre]
`H
`i
`
`_——> TypeRecog.
`
`
`JW
`
`PresentationJ\
`Unnestin
`
`Customized
`Selection
`
`Methods
`
`
`
`Candidate
`Selectign
`
`Customized
`Summarizing
`Methods/
`
`README [README]
`doc/essence.1 ['Vlaflpage]
`doc/Essence.ms [nag—ms]
`
`w) Summarizing
`
`
`
`
`
`i
`/_x\
`Customlzed
`Unnesiing
`Methods
`
`SOIF records coniainin :
`— keywords from RE DME;
`—- usage line from essence.1;
`—- author, title. & abstract fields from Essence.ms
`
`Figure 5: Example of Essence Customized Information Extraction
`
`Internet before summarizing at a central site. Note that these computations do not include the
`additional savings from doing incremental updates.
`While one might suspect that Essence’s use of content summaries causes the resulting indexes to
`lose useful keywords, our measurements indicate that Essence achieves nearly the same precision 6
`and 70% the recall of WAIS, but requires only 3-11% as much index space and 70% as much
`summarizing and indexing time as WAIS [27]. For users who need the added recall and precision,
`however, Essence also supports a full-text extraction option.
`
`Flexibility of Index Collection
`
`In addition to reducing the amount of indexing data that must be transmitted, across the network,
`Essence’s summarizing mechanism permits a great deal of flexibility in how information is extracted.
`Site administrators can customize how object types are recognized (typically by examining file
`naming conventions and contents); which objects get indexed (e.g., excluding executable code for
`which there are corresponding source files); and how information is extracted from each type of
`object (e.g., extracting title. and author information from-bibliographic databases, andcopyright
`strings from executable programs). Essence can also extract files with arbitrarily deep presentation—
`layer nesting (e.g., compressed “tar” files). An example of Essence processing is illustrated in
`Figure 5.
`The Gatherer allows objects to be specified either individually (which we call “LeafNodes”)
`or through type-specific recursive enumeration (which we call “RootNodes”). For example,
`it
`enumerates FTP objects using a recursive directory listing, and enumerates HTTP objects by
`
`6Intuitively, precision is the probability that all of the retrieved documents will be relevant, while recall is the
`probability that all of the relevant documents will be retrieved [7].
`
`
`
`recursively extracting the URL links that point to other HTML objects within the same server as
`the original RootNode URL. 7
`The combination of Gatherer enumeration support and content extraction flexibility allows the
`site administrator to specify a Gatherer’s configuration very easily. Given this specification, the
`system automatically enumerates, gathers, unnests, and summarizes a wide variety of information
`resources, generating content summaries that range in detail from full—text to highly abridged and
`specialized “signatures”. The Gatherer also allows users to include manually edited information
`(such as hand—chosen keywords or taxonometric classification references) along with the Essence—
`extracted information. This allows users to improve index quality for data that warrants the added
`manual labor. For example, Harvest can incorporate site markup records specified, using IETF-
`specified “IAFA” templates [17]. The Gatherer stores-the manual and automatically generated
`information separately, so that periodic automated summarizing will not overwrite manually created
`information.
`
`The other type of flexibility supported by the Gatherer is in the placement of Gatherer processes.
`Ideally, a Gatherer will be placed at each Provider site, to support efficient data gathering. Because-
`this requires cooperation from remote sites, we also allow Gatherers to be run remotely, accessing
`objects using FTP / Gopher / HTTP. As an optimization, we are currently building a translator that
`will translate FTP / Gopher / HTTP URLs to local file system references if the Gatherer is run locally.
`Doing so will provide a factor of 6.99 reduction in server load while gathering locally, as indicated
`by the server load measurements discussed above.
`
`Because remote gathering imposes so much load, we implemented a connection caching mech—
`anism, which attempts to keep connections open across individual object retrievals.8 When the
`available UNIX file descriptors are exhausted (or timed out with inactivity), they are closed ac—
`cording to a replacement policy. This change resulted in a 700% performance improvement for
`FTP transfers.
`
`3.2 The Broker Subsystem
`
`As illustrated in Figure 4, the Broker consists of four software modules: the Collector, the Registry,
`the Storage Manager, and the Query Manager.
`
`The Registry and Storage Manager maintain the authoritative list of summary objects that exist
`in the Broker. For each object, the Registry records a unique identifier and a time-to-live. The
`unique object identifier (01D) consists of the resource’s URL plus information about the Gatherer
`that generated the summary object. The Storage Manager archives a collection of summary objects
`on disk, storing each as a file in the underlying file system.
`
`The Collector periodically requests updates from each Gatherer or Brokerspecified in its con—
`figuration file. The gatherer responds with a list of object summaries to be created, removed,
`or updated, based on a timestamp passed from the requesting Broker. The Collector parses the
`response and forwards it to the Registry for processing.
`
`7This restriction prevents a misconfigured Gatherer from traversing huge parts of the Web. As a second limitation,
`
`Gatherers refuse to enumerate RootNode URLs beyond a (configurable) maximum number of resulting LeafNode
`URLs.
`
`3At present this is only possible with the FTP control connection. Gopher and HTTP close connections after each
`retrieval.
`
`10
`
`
`
`When an object is created, an OID is added to the Registry, the summary object is archived
`by the Storage Manager, and a handle is given to the Indexer. Finally, the Registry is written
`to disk to complete the transaction. When an object is destroyed, the OID is removed and the
`summary object is deleted by the Storage Manager and Indexer. When an object is refreshed, its
`time-to—live is recomputed.
`If the time—to—live expires for an object, the object is removed from
`the Broker. Since the state of the Indexer and Storage Manager may become inconsistent with the
`Registry, periodic garbage collection removes any unreferenced objects from the Storage Manager
`and Indexer.
`
`The Query Manager exports objects to the network. It accepts a query, translates it into an
`intermediate representation, and passes it: to the search. engine. The search engine responds with a
`list of OIDs and some search engine—specific data for each. For example, one of our search engines
`(Glimpse; see Section 3.3) returns the matching attribute for each summary object in the result list.
`The Query Manager constructs a short object description from the OID, the search engine—specific
`data, and the summary object. It then returns the list of object descriptions to the client.
`A Broker can gather objects directly from another Broker using‘a bulk transfer protocol within
`the Query Manager. When a query is sent to the source Broker, instead of returning the usual
`short description, the Query Manager returns the entire summary object.
`The Query Manager supports remote administration of the Broker’s configuration. Several
`Broker parameters can be manipulated, including the list of Gatherers to contact, the frequency of
`contact with a Gatherer, the default time—to—live for summary objects, and the frequency of garbage
`collection.
`
`3.3 The Index and Search Subsystem
`
`To accommodate diverse indexing and searching needs, Harvest defines a general Broker—Indexer
`interface that can accommodate a variety of “backend” search engines. The principal requirements
`are that the backend support boolean combinations of attribute—based queries, and that it support
`incremental updates. One could therefore use a‘variety of different backends inside a Broker, such
`as WAIS or Ingres (although some versions of WAIS do not support all the needed features).
`We have developed two particular search and index subsystems for Harvest, each optimized for
`different uses. Glimpse [35] supports space—efficient indexes and flexible interactive queries, while
`Nebula [11] supports fast searches and complex standing queries that scan the data on a regular
`basis and extract relevant information.
`'
`
`Glimpse Index/Search Subsystem
`
`The main novelty of Glimpse is that it requires a very small index (as low as 24% of the size of
`the data), yet allows very flexible content queries, including the use of Boolean expressions, regular
`expressions, and approximate matching (i.e., allowing misspelling). Another advantage of Glimpse
`is that the index can be built quickly and modified incrementally.
`The index used by Glimpse is similar in principle to inverted indexes, with one additional
`feature. The main part of the index consists of a list of all words that appear in all files. For
`each word there is a pointer to a list of occurrences of the word. But instead of pointing to the
`exact occurrence, Glimpse points to an adjustable—sized area that contains that occurrence. This
`
`11
`
`
`
`area can be simply the file that contains the word, a group of files, or perhaps a paragraph. 9 By
`combining pointers for several occurrences of each word and by minimizing the size of the pointers
`(since it is not necessary to store the exact location), the index becomes rather small. This allows
`