`
`CU-CS-732-94
`
`C. Mic Bowman
`Peter B. Danzig
`Darren R. Hardy
`Udi Manber
`Michael F. Schwartz
`
`=(DinivrstyofColoradoatBoulder
`
`
`
`
`
`
`DEPARTMENT OF COMPUTER SCIENCE
`
`EMC/VMware v. PersonalWeb
`IPR2013-00083
`
`EMCVMW 1064
`
`
`
`Harvest: A Scalable, Customizable
`Discovery and Access System
`
`C. Mic Bowman
`Peter B. Danzig
`Darren R. Hardy
`Udi Manber
`Michael F. Schwartz
`
`CU-CS-732-94
`
`July 1994
`
`By
`
`University of Colorado at Boulder
`
`Technical Report CU-CS-732-94
`Department of Computer Science
`ampus Box 430
`Universit of Colorado
`Boulder, Colorado 80309
`
`
`
`ANY OPINIONS, FINDINGS, AND CONCLUSIONS OR RECOMMENDATIONS
`EXPRESSED IN THIS PUBLICATION ARE THOSE OF THE AUTHOR(S) AND DO NOT
`NECESSARILY REFLECT THE VIEWS OF THE AGENCIES NAMEDIN THE
`ACKNOWLEDGMENTSSECTION.
`
`
`
`Harvest: A Scalable, Customizable Discovery and Access System
`
`C. Mic Bowman
`Transarc Corp.
`
`mic@transarc.com
`
`Peter B. Danzig
`University of Southern California
`
`Darren R. Hardy
`University of Colorado - Boulder
`
`danzig@usc.edu
`
`hardy@cs.colorado.edu
`
`Udi Manber
`
`Michael F. Schwartz
`
`University of Arizona
`
`University of Colorado - Boulder
`
`udi@cs.arizona.edu
`
`schwartz@cs.colorado.edu
`
`July 1994
`
`Abstract
`
`Rapid growth in data volume, user base, and data diversity render Internet-accessible in-
`formation increasingly difficult to use effectively. In this paper we introduce Harvest, a system
`that provides a set of customizable tools for gathering information from diverse repositories,
`building topic-specific content indexes, flexibly searching the indexes, widely replicating them,
`and caching objects as they are retrieved across the Internet. The system interoperates with
`Mosaic and with HTTP, FTP, and Gopher information resources. We discuss the design and
`implementation of each subsystem, and provide measurements indicating that Harvest can re-
`duce server load, network traffic, and index space requirements significantly compared with
`previous indexing systems. Wealso discuss a half dozen indexes we have built using Harvest,
`underscoring both the customizability and scalability of the system.
`
`
`
`1
`
`Introduction
`
`Over the past few years a progression of Internet publishing tools have appeared. Until 1992,
`FTP [43] and NetNews [39] were the principal publishing tools. Around 1992, Gopher [38] and
`WAIS [31] gained popularity because they simplified network interactions and provided better
`ways to navigate through information. With the introduction of Mosaic [2] in 1993, publishing
`information on the World Wide Web [3] gained widespread use, because of Mosaic’s attractive
`graphical interface and ease of use for accessing multimedia data reachable via WWWlinks.
`While Internet publishing has become easy and popular, making effective use of Internet-
`accessible information has become moredifficult. As the volume of Internet accessible information
`grows, it is increasingly difficult to locate relevant information. Moreover, current information sys-
`tems experience serious server and network bottlenecks as a rapidly growing user populace attempts -
`to access networked information. Finally, current systems primarily support text and graphics in-
`tended for end user viewing; they provide little support for more structured, complex data. For a
`more detailed discussion of these problems, the reader is referred to [10].
`In this paper we introduce a system that addresses these problems using a variety of tech-
`niques. We call the system Harvest, to connote its focus on reaping the growing crop of Internet
`information. Harvest supports resource discovery through topic-specific content indexing made
`possible by a very efficient distributed information gathering architecture. It resolves bottlenecks
`through topology-adaptive index replication and object caching. Finally, Harvest supports struc-
`tured data through a combination of structure-preservingindexes, flexible search engines, and data
`type-specific manipulation and integration mechanisms. Becauseit is highly customizable, Harvest
`can be used in many different situations.
`In Section 2 we discuss related work. In
`The remainder of the paper is organized as follows.
`Section 3 we present the Harvest system, including a variety of performance measurements.
`In
`Section 4 we offer several demonstrations of Harvest, including WWWpointers where readers can
`try these demonstrations. In Section 5 we discuss work in progress, and in Section 6 we summarize
`Harvest’s contributions to the state of resource discovery.
`
`2 Related Work
`
`While impossible to discuss all related work, we touch on someof the better-known efforts here.
`
`Resource Discovery
`
`Because of the difficulty of keeping a large information space organized, the labor intensity of
`traversing large information systems, and the subjective nature of organization, many resource
`discovery systems create indexes of network-accessible information.
`file/menu name indexes of
`Many of the early indexing tools fall into one of two categories:
`widely distributed information (such as Archie [18], Veronica [19], or WWWW ([87]); and full
`content indexes of individual sites (such as Gifford’s Semantic File System [21], WAIS [31], and
`local Gopher [38] indexes). Name-only indexes are very spaceefficient, but support limited queries.
`For example, it is only possible to query Archie for “graphics packages” whosefile names happen to
`reflect their contents. Moreover, global. flat indexes are less useful as the information space grows
`(causing queries to match too much information). In contrast to full content indexes that support
`
`
`
`powerful queries but require so much space that they cannot hold more than a few sites’ worth of
`data, Harvest achieves spaceefficient, content indexes of widely distributed data.
`Recently, a number ofefforts have been initiated to create content indexes of widely distributed >
`sites (e.g., Gifford’s Content Router [45] and Pinkerton’s WebCrawler [40]). One of the goals of
`Harvest is to provide a flexible and efficient system upon which such systems can be built. We
`discuss this point in Section 3.1.
`The WHOIS and Network Information Look Up Service Working Group in the Internet En-
`gineering Task Force is defining an Internet standard called “WHOIS++” which gathers con-
`cise descriptions (called “centroids”) of each indexed. database [46]..In contrast to our approach,
`WHOIS++ does not provide an automatic data gathering architecture, nor many of the scaling
`features (such as caching and replication) that Harvest provides. However, Harvest can be used to
`build WHOIS++.
`Many of the ideas and some of the code for Harvest were derived from our previous work on
`the agrep string search tool [47], the Essence customized information extraction system [26], the
`Indie distributed indexing system [15], and the Univers attribute-based nameservice [12].
`
`Caching and Replication
`
`A great deal of caching and replication research has been carried out in the operating systems
`community over the past 20 years (e.g., the Andrew File System [28] and ISIS [5]). More recently,
`a numberof efforts have begun to build object caches into HTTP servers(e.g., Lagoon [41]), and_
`to support replication [22] in Internet information systems such as Archie.
`In contrast to the flat organization of existing Internet caches, Harvest supports a hierarchical
`architecture of object caches, modeled after the Domain Naming System’s caching architecture.
`Webelieve this is the most appropriate arrangement because of a simulation study we performed
`using NSFNET backbonetrace data [14]. We also note that one of the biggest scaling problems
`facing the Andrew File System is.its callback-based invalidation protocol, which would be reduced
`if caches were arranged hierarchically.
`Current replication systems (such as USENET news[39] and the Archie replication system)
`require replicas to be placed and configured manually. Harvest’s approach is to derive the replica
`configuration automatically, adapting to measured physical network changes. We believe this ap-
`proach is more appropriate in the large, dynamic Internet environment.
`
`Structured Data Support
`
`At present there is essentially no support for structured data in Internet information systems. When
`Mosaic encounters an object with internal structure (such as a relational database or some time
`series data), it simply asks the user where to save thefile on the local disk, for later perusal outside
`of Mosaic. The database community has done a great deal of work with more structured distributed.
`information systems, but to date has fielded no widespread systems on the Internet. Similarly, a
`numberof object-oriented systems have been developed (e.g., CORBA [23] and OLE [29]), but have
`yet to be deployed on the Internet at large.
`At present, Harvest supports structured data through the use of attribute-value structured
`indexes. We are currently extending the system to support a more object-oriented paradigm to
`index and access complex data.
`
`
`
`3 The Harvest System
`
`To motivate the Harvest design, Figure 1 illustrates two important inefficiencies experiencedby ©
`most of the indexing systems described in Section 2. In this figure and the remainderof the paper,
`a Provider indicates a server running one or more of the standard Internet information services
`(FTP, Gopher, and HTTP). Bold boxes indicate excessive load being placed. on. servers, primarily
`because the indexing systems gather information from Providers that fork a separate process for
`each retrieved object. Similarly, bold edges indicate excessive traffic being placed on network
`links, primarily because the indexing systems retrieve entire objects and then discard most of their
`contents (for example, retaining only HTML anchors andlinks in the index). These inefficiencies are
`compoundedby the fact that current indexing systems gather information independently, without
`coordinating the effort among each other.
`
`
`
`
` Index
`
`
`
`
`
` Provider
`
`
`
`
`
`Figure 1: Inefficiencies Caused by Uncoordinated Gathering of Information from Object Retrieval
`Systems
`
`Figure 2 illustrates the Harvest approach to information gathering. A Gatherer collects in-
`dexing information from a Provider, while a Broker provides an indexed query interface to the
`gathered information. Brokers retrieve information from one or more Gatherers or other Brokers,
`and incrementally update their indexes.
`Gatherers and Brokers can be arranged in various ways to achieve much more flexible and
`efficient use of the network and servers than the approach illustrated in Figure 1. The leftmost
`part of Figure 2 shows a Gatherer accessing a Provider from across the network (using the native
`FTP, Gopher, or HTTP protocols). This arrangement. mimics the configuration shown in Figure 1,
`and is primarily useful for interoperating with systems that do not run the Harvest software.
`This arrangement experiences the same inefficiencies as discussed above, although the gatherered
`information can be shared by many Brokers in this arrangement.
`In the second part of Figure 2, the Gatherer is run at the Provider site. As suggested by the
`lack of bold boxes andlines in this part of the figure, doing this can save a great deal of server load
`and network traffic. We discuss the techniques used by the Gatherer to achieve these efficiency
`gains in Section 3.1.
`
`
`
`
`
`
`
`
`
`
`
`
`ger
`
`oo
`Broker
`
`
`eer
`
`se
`
`Ll
`|
`
`Gatherer
`
`
`
`see
`
`.
`
`
`
`
`
`
`Broker
`
`apt
`
`D7
`Gatherer
`Provider
`
`Broker
`
`ON
`
`
`Gatherer
`Broker
`|
`Provider
`
`
`
`
`
`Figure 2: Harvest Information Gathering Approach
`
`The ellipses in Figure 2 indicate that a Broker can collect information from many Gatherers -
`(to build an index of widely distributed information) and that a Gatherer can feed information
`to many Brokers (thereby saving repeated gathering costs). By retrieving information from other
`Brokers(illustrated in the rightmost part of Figure 2), Brokers can alsocascade indexed views from:
`one another—using the Broker’s query interface tofilter/refine the information from one Broker to
`the next.
`The final element illustrated in Figure 2 is the Harvest Server Registry (HSR). This is a dis-
`tinguished Broker instance that registers information about each Harvest Gatherer, Broker, Cache,
`and Replicator in the Internet. The HSR is useful when constructing new Gatherers and Brokers,
`to avoid duplication of effort. It can also be used when looking for an appropriate Broker at search
`time.
`In effect, this Gatherer/Broker design provides an efficient network pipe facility, allowing in-
`formation to be gatherered, transmitted, and shared by many different types of indexes, with a
`great deal of flexibility in how information is filtered and interconnected between providers and
`Brokers. Efficiency and flexibility are important because we want to enable the construction of
`many different topic-specific Brokers. For example, one Broker might index scientific papers for
`microbiologists, while another Broker indexes PC software archives. By focusing index contents
`per topic and per community, a Broker can avoid many of the vocabulary [20] and scaling problems
`of unfocused global indexes (such as Archie and WWWW).
`Gatherers and Brokers communicate using an attribute-value stream protocol called the Sum-
`mary Object Interchange Format (SOI) [9], intended to be easily parsed yet sufficiently expressive
`to handle many kinds of objects. It. delineates streams of object summaries, and allows for multiple
`levels of nested detail. SOIF is based on a combination of the Internet Anonymous FTP Archives
`(IAFA) IETF Working Group templates [17] and BibTeX [33]. Each template contains a type,
`a Uniform Resource Locator (URL) [4], and a list of byte-count delimited field name/field value
`pairs. We define a set of mandatory: and:recommendedattributes for Harvest system components.
`
`
`
`For example, attributes for a Broker describe the server’s administrator, location, software version,
`and the type of objects it contains. An example SOIF record is illustrated in Figure 3. !.
`
`@DOCUMENT{ ftp://ftp.cs.psu.edu/pub/techreps/TR123 .ps
`Title{28}: Type Structured File Systems
`Author-Name{33}: C. Dharap, R. Balay and M. Bowman
`Author-Organization-Type{ii}: educational
`Author-Department{32}: Computer Science and Engineering
`Publication-Status{19}: Accepted, To appear.
`Keywords{42}:
`types object-oriented semantic filesystems
`Access-Protocol-v*{3}: FTP
`Description{457}: The paper describes the design of Nebula. Nebula is a dynamic,
`object based typed file system. Nebula uses types to create well
`structured files as well as to export existing files in the internet
`environment. This allows for backward scalability. Files in Nebula are
`abstractions and are defined by objects. Nebula assumes a flat global
`namespace and operations are provided to manipulate logical namespaces.
`
`Figure 3: Example SOIF Template
`
`Figure 4 illustrates the Harvest architecture in more detail, showing the Gatherer/Broker in-
`terconnection arrangements and SOIF protocol, the internal components of the Broker (discussed
`in Section 3.2), and the caching, replication, and client interface subsystems.
`The Replicator can be used to replicate servers, to enhance user-basescalability. For example,
`the HSR will likely become heavily replicated, since it acts a point of first contact for searches and
`new server deployment efforts. The Replication subsystem can also be used to divide the gathering
`process among manyservers(e.g., letting one server index each U.S. regional network), distributing
`the partial updates amongthe replicas.
`The Object Cache reduces network load, server load, and response latency when accessing
`located information objects. Additionally, it. will be used for caching access methods in the future,
`when we incorporate an object-oriented access mechanism into Harvest. At present Harvest simply
`retrieves and displays objects using the standard “hard coded” Mosaic procedures, but we are
`extending the system to allow publishers to associate access methods with objects (see Section 5).
`Given this support, it will be possible to perform much moreflexible display and access operations.
`Wediscuss each of the Harvest subsystems below.
`
`3.1 The Gatherer Subsystem
`
`As indexing the Web becomes increasingly popular, two types of problems arise: data collection
`inefficiencies and duplication of coding effort. Harvest addresses these problems by providing an
`efficient and flexible system upon which to build many different indexes. Rather than building
`
`1While the attributes in this example are limited to ASCII values, SOIF allows arbitrary binary data
`
`
`
`
`
`
`Replication
`Manager
`
`
`
`Broker
`
`eee
`
`' Storage Manager
`
`
`
`/
`
`_—
`
`SOF
`
`
`
`Gatherer
`
`
`
`see)
`jandindexer|/
`/ SOIF
`
`
`
`
`
`
`
`
`1. search
`Query
`
`
`
`
`
`
`
`Manager
`Client
`
`Collector
`
`2. retrieve
`object &
`access
`methods
`
`
`
`Registry
`
`a
`
`
`
`
`
`
`
`
`Object
`
`Cache
`
`
`
`
`
`
`
`Provider
`
`
`
`Figure 4: Harvest System Architecture
`
`one indexer from scratch to collect WAIS headlines, another to index Computer Science technical
`reports, etc., Harvest can be configured in various ways to produce all of these indexes, at great
`savings of server load, network traffic, and index space.
`
`Efficient Indexing Support
`
`Koster offers a number of guidelines for constructing Web “robots” [32]. For example, he suggests
`using breadth-first rather than depth-first traversal, to minimize rapid-fire requests to a single
`server. Rather than working around the existing shortcomings, we believe a better approachis to
`provide a much moreefficient index collection system.
`The biggest inefficiency in current Web indexing tools arises from collecting indexing information
`from systems designed to support object retrieval. For example, to build an index of HTML
`documents, each document must be retrieved from a server and scanned for HTML links. This
`causes a great deal of load, because each object retrieval results in creating a TCP connection,
`forking a UNIX process, changing directories several levels deep, transmitting the object, and
`terminating the connection. An index of FTPfile names can be built more efficiently using recursive
`directory listing operations, but this still requires an expensive file system traversal.
`The Gatherer dramatically reduces these inefficiencies through the use of Provider site-resident
`software optimized for indexing. This software scans the objects periodically and maintains a
`cache of indexing information, so that separate traversals are not required for each request. *- More
`
`?Some FTP sites provide a similar optimization by placing a file containing the results of a recent recursive
`directory listing in a high-level directory. The Gatherer uses these files when they are available.
`
`
`
`importantly, it allows a Provider’s indexing information to be retrieved in a single stream, rather
`than requiring separate requests for each object. This results in a huge reduction in server load,
`The combination of traversal caching and response streaming can reduce server load by several
`orders of magnitude.
`As a simple quantification of these savings, we measured the time taken to gather information
`three different ways: via individual FTP retrievals, via individual local file system retrievals, and
`from the Gatherer’s cache. To focus these measurements on server load, we collected only empty
`files and used the network loopback interface in the case where data would normally go across
`the network. For a more precise quantification we could have measured CPU, memory, and disk
`resource usage, but measuring elapsed time provides a first order approximation of overall load
`for each of the gathering methods. We collected the measurements over 1,000 iterations on an
`otherwise unloaded DEC Alpha workstation. Our results indicate that gathering via the local
`file system causes 6.99 times less load than gathering via FTP. More importantly, retrieving an
`entire stream of records fromthe Gatherer’s cache incurs 2.72 times less server load than retrieving
`a single object via FTP (and 2.81 times less load for retrieving all objects at a site, because of
`a common-case optimization we implemented that keeps a compressed cache of this information
`around). Thus, assuming an average archivesite size of 2,300 files °, serving indexing information
`via a Gatherer will reduce server load per data collection attempt by a factor of 6,429. Collecting
`all objects at the site increases this reduction to 6,631, because of the aforementioned common-case
`cache,
`The Gatherer also reduces network traffic substantially, based on three techniques. First, it
`uses the Essence system [26] to extract content summaries before passing the data to a remote
`Indexer. Essence uses type-specific procedures to extract the most relevant parts of documents as
`content summaries - for example, extracting author and title information from LaTeX documents,
`and routine names from executablefiles. Hence, where current robots might retrieve a set of
`HTML documents and extract anchors and URLs from each, Essence extracts content summaries
`at the Provider site, often reducing the amount of data to be transmitted by a factor of 20-50
`or more. * Beyond generating content summaries at the Provider site, the Gatherer also allows
`Brokers to request only those summaries that have changed since a specified time, and transmits
`the response stream in a compressedform. ° The combination of pre-filtering, incremental updates,
`and compression can reduce network traflic by several orders of magnitude.
`As an example of these savings, our content index of Computer Science technical reports (see
`Section 4) starts with 2.44 GB of distributed around the Internet, and extracts 111 MB worth
`of content summaries. It then transmits these data as a 41 MB compressed stream to requesting
`Brokers, for a 59.4 fold network transmission savings over collecting the documents and building
`a full-text index. The network transmission savings are actually higher still if one considers the
`number of network links that must be traversed by each byte of data. For the above example, we
`used traceroute [30] to count hops to each Providersite. In the measured example, the average hop
`count was 18, magnifying the importance of network transmission savings. This savings corresponds
`to running local Gatherers at each Provider site, rather than having to pull full objects across the
`
`
`“We computed this average from measurements of the total number ofsites and files indexed by Archie [44].
`“The reduction factor depends on data type. For example, the reduction for PostScript files is quite high, while
`the reduction for HTML files is fairly low.
`°For access methods that. do not support time stamps (like the current version of HTTP), the Gatherer extracts
`and compares “MD5” cryptographic checksums of each object before transmitting updates.
`
`
`
`README /README]
`doc/essence.1 /manpage]
`doc/Essence.ms [traff—msj
`README /[README]
`doc/Essence.ps [PostScript]
`doc/essence.1 [manpage]
`src/pstext/pstext/dvi.c
` /C source]
`:
`
`Essence.tar.Z | TypeRecog.|peRecog.|andidate eres|Summarizing
`doc/Essence.ms /traff —ms]
`Candidat
`aan
`7
`
`
`Selection|“
`
`
`
`Customized
`
`Type Recog.
`__ Methods
`
`
`
`y
`
`Customized
`
`Selection
`Methods
`
`
`
`
`
`1
`POnnesting|
`__Unnesting
`
`Customized
`Unnesting
`Methods
`
`Customized
`Summarizing
`Methods”
`
`
`
`SOIF records containing:
`~ keywords from README;
`— usageline from essence.1;
`— author,title, & abstract fields from Essence.ms
`
`
`
`Figure 5: Example of Essence Customized Information Extraction
`
`Internet before summarizing at a central site. Note that these computations do not include the
`additional savings from doing incremental updates.
`While one might suspect that Essence’s use of content summaries causes the resulting indexes to
`lose useful keywords, our measurements indicate that Essence achieves nearly the same precision ©
`and 70% the recall of WAIS, but requires only 3-11% as much index space and 70% as much
`summarizing and indexing time as WAIS [27]. For users who need the added recall and precision,
`however, Essence also supports a full-text extraction option.
`
`Flexibility of Index Collection
`
`In addition to reducing the amount of indexing data that must be transmittedacross the network,
`Essence’s summarizing mechanism permits a great deal of flexibility in how information is extracted.
`Site administrators can customize how object types are recognized (typically by examining file
`naming conventions and contents); which objects get indexed (e.g., excluding executable code for
`which there are corresponding source files); and how information is extracted from each type of
`object (e.g., extracting title.and author information. from bibliographic databases, and copyright
`strings from executable programs). Essence can also extract files with arbitrarily deep presentation-
`layer nesting (e.g., compressed “tar” files). An example of Essence processing is illustrated in
`Figure 5.
`The Gatherer allows objects to be specified either individually (which we call “LeafNodes”)
`or through type-specific recursive enumeration (which we call “RootNodes”). For example,
`it
`enumerates FTP objects using a recursive directory listing, and enumerates HTTP objects by
`
`®Intuitively, precision is the probability that all of the retrieved documents will be relevant, while recall is the
`probability that all of the relevant documents will be retrieved [7].
`
`
`
`recursively extracting the URL links that point to other HTML objects within the same server as
`the original RootNode URL. 7
`The combination of Gatherer enumeration support and content extraction flexibility allows the
`site administrator to specify a Gatherer’s configuration very easily. Given this specification, the
`system automatically enumerates, gathers, unnests, and summarizes a wide variety of information
`resources, generating content summaries that range in detail from full-text to highly abridged and
`specialized “signatures”. The Gatherer also allows users to include manually edited information
`(such as hand-chosen keywords or taxonometric classification references) along with the Essence-
`extracted information. This allows users to improve index quality for data that warrants the added
`manual labor.. For example, Harvest can incorporate site markup records specified using IETF-
`specified “IAFA” templates [17]. The Gatherer stores the manual and automatically generated
`information separately, so that periodic automated summarizing will not overwrite manually created
`information.
`The other type offlexibility supported by the Gathereris in the placement of Gatherer processes.
`Ideally, a Gatherer will be placed at each Provider site, to support efficient data gathering. Because
`this requires cooperation from remote sites, we also allow Gatherers to be run remotely, accessing
`objects using FTP/Gopher/HTTP. As an optimization, we are currently building a translator that
`will translate FTP /Gopher/HTTP URLsto localfile system referencesif the Gathereris runlocally.
`Doing so will provide a factor of 6.99 reduction in server load while gathering locally, as indicated
`by the server load measurements discussed above.
`Because remote gathering imposes so much load, we implemented a connection caching mech-
`anism, which attempts to keep connections open across individual object retrievals.2 When the
`available UNIX file descriptors are exhausted (or timed out with inactivity), they are closed ac-
`cording to a replacement policy. This change resulted in a 700% performance improvement for
`FTP transfers.
`
`3.2. The Broker Subsystem
`
`Asillustrated in Figure 4, the Broker consists of four software modules: the Collector, the Registry,
`the Storage Manager, and the Query Manager.
`The Registry and Storage Manager maintain the authoritative list of summary objects that exist
`in the Broker. For each object, the Registry records a unique identifier and a time-to-live. The
`unique object identifier (OID) consists of the resource’s URL plus information about the Gatherer
`that generated the summary object. The Storage Managerarchives a collection of summary objects
`on disk, storing each as a file in the underlying file system.
`The Collector periodically requests updates from each.Gatherer or Broker. specified in its con-
`figuration file. The gatherer responds with a list of object summaries to be created, removed,
`or updated, based on a timestamp passed from the requesting Broker. The Collector parses the
`response and forwards it to the Registry for processing.
`
`"This restriction prevents a misconfigured Gatherer from traversing huge parts of the Web. As a second limitation,
`Gatherers refuse to enumerate RootNode URLs beyond a (configurable) maximum number of resulting LeafNode
`URLs.
`® At present this is only possible with the FTP control connection. Gopher and HTTP close connections after each
`retrieval.
`
`10
`
`
`
`When an object is created, an OID is added to the Registry, the summary object is archived
`by the Storage Manager, and a handle is given to the Indexer. Finally, the Registry is written
`to disk to complete the transaction. When an object is destroyed, the OID is removed and the
`summary object is deleted by the Storage Manager and Indexer. When an object is refreshed, its
`time-to-live is recomputed.
`If the time-to-live expires for an object, the object is removed from
`the Broker. Since the state of the Indexer and Storage Manager may becomeinconsistent with the
`Registry, periodic garbage collection removes any unreferenced objects from the Storage Manager
`and Indexer.
`The Query Manager exports objects to the network. It accepts a query, translates it into an
`intermediate representation, and passes it to the search engine. The search engine responds with a
`list of OIDs and somesearch engine-specific data for each. For example, one of our search engines
`(Glimpse; see Section. 3.3) returns the matching attribute for each summary object in the result: list.
`The Query Manager constructs a short object description from the OID, the search engine-specific
`data, and the summary object. It then returns thelist. of object descriptions to the client.
`A Broker can gather objects directly from another Broker using’a bulk transfer protocol within
`the Query Manager. When a query is sent to the source Broker, instead of returning the usual
`short description, the Query Manager returns the entire summary object.
`The Query Manager supports remote administration of the Broker’s configuration. Several
`Broker parameters can be manipulated, including thelist of Gatherers to contact, the frequency of
`contact with a Gatherer, the default time-to-live for summary objects, and the frequency of garbage
`collection.
`
`3.3. The Index and Search Subsystem
`
`To accommodate diverse indexing and searching needs, Harvest defines a general Broker-Indexer
`interface that can accommodate a variety of “backend” search engines. The principal requirements
`are that the backend support boolean combinations of attribute-based queries, and that it support
`incremental updates. One could therefore use a variety of different backends inside a Broker, such
`as WAISor Ingres (although some versions of WAIS do not support all the needed features).
`We have developed two particular search and index subsystems for Harvest, each optimized for
`different uses. Glimpse [35] supports space-efficient indexes and flexible interactive queries, while
`Nebula [11] supports fast searches and complex standing queries that scan the data on a regular
`basis and extract relevant information.
`.
`
`Glimpse Index/Search Subsystem
`
`The main novelty of Glimpse is that it requires a very small index (as low as 2-4% of the size of
`the data), yet allows very flexible content queries, including the use of Boolean expressions, regular
`expressions, and approximate matching (i.e., allowing misspelling). Another advantage of Glimpse
`is that the index can be built quickly and modified incrementally.
`The index used by Glimpse is similar in principle to inverted indexes, with one additional
`feature. The main part of the index consists of a list of all words that appear in all files. For
`each word there is a pointer to a list of occurrences of the word. But instead of pointing to the
`exact occurrence, Glimpse points to an adjustable-sized area that contains that occurrence. This
`
`ll
`
`
`
`area can be simply the file that contains the word, a group of files, or perhaps a paragraph. ? By
`combining pointers for several occurrences of each word and by minimizing thesize of the pointers
`(since it is not necessary to store the exact location), the index becomes rather small. This allows
`Glimpse to use agrep [47] to search the index, and then search the ar