throbber
Harvest: A Scalable, Customizable Discovery and Access System
`
`CU-CS-732-94
`
`C. Mic Bowman
`
`Peter B. Danzig
`Darren R. Hardy
`Udi Manber
`
`Michael F. Schwartz
`
`
`
`
` E.@UniversityofColoradoatBoulder
`
`
`
`
`
`DEPARTMENT OF COMPUTER SCIENCE
`
`EMC/VMware v. PersonalWeb
`IPR2013-00083
`
`EMCVMW 1064
`
`

`

`Harvest: A Scalable, Customizable
`
`Discovery and Access System
`
`C. Mic Bowman
`
`Peter B. Danzig
`Darren R. Hardy
`Udi Manber
`
`Michael F. Schwartz
`
`CU—CS-732—94
`
`July 1994:
`
`@7
`
`University of Colorado at Boulder
`
`Technical Report CU—CS—732—94
`Department of Computer Science
`Campus Box 430
`UniverSIt of Colorado
`Boulder,
`olorado 80309
`
`

`

`ANY OPINIONS, FINDINGS, AND CONCLUSIONS OR RECOMMENDATIONS
`EXPRESSED IN THIS PUBLICATION ARE THOSE OF THE AUTHOR(S) AND DO NOT
`NECESSARILY REFLECT THE VIEWS OF THE AGENCIES NAMED IN THE
`
`ACKNOWLEDGMENTS SECTION.
`
`

`

`Harvest: A Scalable, Customizable Discovery and Access System
`
`C. Mic Bowman
`
`Peter B. Danzig
`
`Darren R. Hardy
`
`Transarc Corp.
`
`University of Southern California
`
`University of Colorado — Boulder
`
`mic©transarc.com
`
`danzig@usc.edu
`
`hardy©cs.colorado.edu
`
`Udi Manber
`
`Michael F. Schwartz
`
`University of Arizona
`
`University of Colorado — Boulder
`
`udi@cs.arizona.edu
`
`schwartz©cs.colorado.edu
`
`July 1994
`
`Abstract
`
`Rapid growth in data volume, user base, and data diversity render Internet—accessible in—
`
`formation increasingly diflicult to use effectively. In this paper we introduce Harvest, a system
`
`that provides a set of customizable tools for gathering information from diverse repositories,
`
`building topic—specific content indexes, flexibly searching the indexes, widely replicating them,
`
`and caching objects as they are retrieved across the Internet. The system interoperates with
`
`Mosaic and with HTTP, FTP, and Gopher information resources, We discuss the design and
`implementation of each subsystem, and provide measurements indicating that Harvest can re—
`
`duce server load, network traffic, and index space requirements significantly compared with
`
`previous indexing systems. We also discuss a half dozen indexes we have built using Harvest,
`
`underscoring both the customizability and scalability of the system.
`
`

`

`1
`
`Introduction
`
`Over the past few years a progression of Internet publishing tools have appeared. Until 1992,
`FTP [43] and NetNews [39] were the principal publishing tools. Around 1992, Gopher [38] and
`WAIS [31] gained popularity because they simplified network interactions and provided better
`ways to navigate through information. With the introduction of Mosaic [2] in 1993, publishing
`information on the World Wide Web [3] gained widespread use, because of Mosaic’s attractive
`graphical interface and ease of use for accessing multimedia data reachable via WWW links.
`While Internet publishing has become easy and popular, making efifective use of Internet—
`accessible information has become more difficult. As the volume of Internet accessible information
`
`grows, it is increasingly difficult to locate relevant information. Moreover, current information sys-
`tems experience serious server and network bottlenecks as a rapidly growing user populace attempts -
`to access networked information. Finally, current systems primarily support text and graphics in-
`tended for end user viewing; they provide little support for more structured, complex data. For a
`more detailed discussion of these problems, the reader is referred to [10].
`In this paper we introduce a system that addresses these problems using a variety of tech—
`niques. We call the system Harvest, to connote its focus on reaping the growing crop of Internet
`information. Harvest supports resource discovery through topic-specific content indexing made
`possible by a very efficient distributed information gathering architecture. It resolves bottlenecks
`through topology-adaptive index replication and object caching. Finally, Harvest supports struc—
`tured data through a combination of structureepreservingvindexes, flexible search engines, and data
`type-specific manipulation and integration mechanisms. Because it is highly customizable, Harvest
`can be used in many different situations.
`
`The remainder of the paper is organized as follows.
`In Section 2 we discuss related work. In
`Section 3 we present the Harvest system, including a variety of performance measurements.
`In
`Section 4 we offer several demonstrations of Harvest, including WWW pointers where readers can
`try these demonstrations. In Section 5 we discuss work inrprogress, and in Section 6 we summarize
`Harvest’s contributions to the state of resource discovery.
`
`2 Related Work
`
`While impossible to discuss all related work, we touch on some of the better—known efforts here.
`
`Resource Discovery
`
`Because of the difficulty of keeping a large information space organized, the labor intensity of
`traversing large information systems, and the subjective nature of organization, many resource
`discovery systems create indexes of network—accessible information.
`file/menu name indexes of
`Many of the early indexing tools fall into one of two categories:
`widely distributed information (such as Archie [18], Veronica [19], or WWWW [37]); and full
`content indexes of individual sites (such as Gifford’s Semantic File System [21], WAIS [31], and
`local Gopher [38] indexes). Name-only indexes are very space efficient, but support limited queries.
`For example, it is only possible to query Archie for “graphics packages” whose file names happen to
`reflect their contents. Moreover, global fiat indexes are less useful as the information space grows
`(causing queries to match too much information). In contrast to full content indexes that support
`
`

`

`.
`
`powerful queries but require so much space that they cannot hold more than a few sites’ worth of
`data, Harvest achieves space efficient, content indexes of widely distributed data.
`Recently, a number of efforts have been initiated to create content indexes of widely distributed
`sites (e.g., Gifford’s Content Router [45] and Pinkerton’s WebCrawler [40]). One of the goals of
`Harvest is to provide a flexible and efficient system upon which such systems can be built. We
`discuss this point in Section 3.1.
`
`The WHOIS and Network Information Look Up Service Working Group in the Internet En—
`gineering Task Force is defining an Internet standard called “WHOIS++” which gathers con—
`cise descriptions (called “centroids’7) of each indexed database [46].-In contrast to our approach,
`WHOIS++ does not provide an automatic data gathering architecture, nor many of the scaling
`features (such as caching and replication) that Harvest provides. However, Harvest can be used to
`build WHOIS++.
`
`Many of the ideas and some of the code for Harvest were derived from our previous work on
`the agrep string search tool [47], the Essence customized information extraction system [26], the
`Indie distributed indexing system [15], and the Univers attribute-based name service [12].
`
`Caching and Replication
`
`A great deal of caching and replication research has been carried out in the operating systems
`community over the'past 20 years (e.g., the Andrew File System [28] and ISIS [5]). More recently,
`a number of efforts have begun to build object caches-into HTTP servers (e.g., Lagoon [41]),and ‘
`to support replication [22] in Internet information systems such as Archie.
`In contrast to the flat organization of existing Internet caches, Harvest supports a hierarchical
`architecture of object caches, modeled after the Domain Naming System’s caching architecture.
`We believe this is the most appropriate arrangement because of a simulation-study weperformed
`using NSFNET backbone trace data [14]. We also note that one of the biggest scaling problems
`facing the Andrew File System is its callback—based invalidation protocol, which would be reduced
`if caches were arranged hierarchically.
`Current replication systems (such as USENET news [39] and the Archie replication system)
`require replicas to be placed and configured manually. Harvest’s approach is to derive the replica
`configuration automatically, adapting to measured physical network changes. We believe this ap—
`proach is more appropriate in the large, dynamic Internet environment.
`
`Structured Data Support
`
`At present there is essentially no support for structured data in Internet information systems. When
`Mosaic encounters an object with internal structure (such as a relational database or some time
`series data), it simply asks the user where to save the file on the local disk, for later perusal outside
`of Mosaic. The database community has done a great deal of work with more structured distributed
`information systems, but to date has fielded no widespread systems on the Internet. Similarly, a
`number of object—oriented systems have been developed (e.g., CORBA [23] and OLE [29]), but have
`yet to be deployed on the Internet at large.
`At present, Harvest supports structured data through the use of attribute—value structured
`indexes. We are currently extending the system to support a more object—oriented paradigm to
`index and access complex data.
`
`

`

`3 The Harvest System
`
`To motivate the Harvest design, Figure *1 illustrates two-important inefficiencies experienced by ‘
`most of the indexing systems described in Section 2. In this figure and the remainder of the paper,
`a Provider indicates a server running one or more of the standard Internet information services
`(FTP, Gopher, and HTTP). Bold boxes indicate excessive load being placed on servers, primarily
`because the indexing systems gather information from Providers that fork a separate process for
`each retrieved object. Similarly, bold edges indicate excessive traffic being placed on network
`links, primarily because the indexing systems retrieve entire objects and then discard most of their
`contents (for example, retaining Only'HTML anchors andlinks in the index). These inefficiencies are
`compounded by the fact that current indexing systems gather information independently, without
`coordinating the effort among each other.
`
`
`
`
`
`
`
`
`
`
`
`
`
` Index
`
`
`
`Figure 1: Inefficiencies Caused by Uncoordinated Gathering of Information from Object Retrieval
`Systems
`
`Figure 2 illustrates the Harvest approach to information gathering. A Gatherer collects in-
`dexing information from- a Provider, while a Broker provides an indexed query interface to the
`gathered information. Brokers retrieve information from one or more Gatherers or other Brokers,
`and incrementally update their indexes.
`Gatherers and Brokers can be arranged in various ways to achieve much more flexible and
`efficient use of the network and servers than the approach illustrated in Figure 1. The leftmost
`part of Figure 2 shows a Gatherer accessing a Provider from across the network (using the native
`FTP, Gopher, or HTTP protocols). This arrangement mimics the configuration shown in Figure 1,
`and is primarily useful for interoperating with systems that'do not run the Harvest software.
`This arrangement experiences the same inefficiencies as discussed above, although the gatherered
`information can be shared by many Brokers in this arrangement.
`In the second part of Figure 2, the Gatherer is run at the Provider site. As suggested by the
`lack of bold boxes and lines in this part of the figure, doing this can save a great deal of server load
`and network traffic. We discuss the techniques used by the Gatherer to achieve these efiiciency
`gains in Section 3.1.
`i
`
`

`

`
`
`
`l
`
`l
`
`l
`
`Broker
`
`Gatherer
`
`
`
`
`
`l
`
`l
`
`I
`
`'
`
`I
`
`'
`
`
`
`
`
`
`
`Broker
`
`Gatherer
`Provider
`
`
`
`
`
`Gatherer
`Provider
`
`
`
`
`Broker
`
`,
`
`T
`
`Broker
`
`i
`
`
`
`Figure 2: Harvest Information Gathering Approach
`
`The ellipses in Figure '2 indicate that a Broker can collect information from many Gatherers '
`(to build an index of widely distributed information) and that a Gatherer can feed information
`to many Brokers (thereby saving repeated gathering costs). By retrieving information from other
`Brokers (illustrated in the rightmost part of Figure 2), Brokers can also cascade indexed views from'
`one another—using the Broker’s query interface to filter/refine the information from one Broker to
`the next.
`
`The final element illustrated in Figure 2 is the Harvest Server Registry (HSR). This is a dis—
`tinguished Broker instance that registers information about each Harvest Gatherer, Broker, Cache,
`and Replicator in the Internet. The HSR is useful when constructing new Gatherers and Brokers,
`to avoid duplication of effort. It can also be used when looking for an appropriate Broker at search
`time.
`
`In effect, this Gatherer/ Broker design provides an efficient network pipe facility, allowing in-
`formation to be gatherered, transmitted, and shared by many different types of indexes, with a
`great deal of flexibility in how information is filtered and interconnected between providers and
`Brokers. Efficiency and flexibility are important because we want to enable the construction of
`many different topic—specific Brokers. For example, one Broker might index scientific papers for
`microbiologists, while another Broker indexes PC software archives. By focusing index contents
`per topic and per community, a Broker can avoid many of the vocabulary [20] and scaling problems
`of unfocused global indexes (such as Archie and VVWWW).
`Gatherers and Brokers communicate using an attribute—value stream protocol called the Sum—
`mary Object Interchange Format (SOIF) [9], intended to be easily parsed yet sufficiently expressive
`to handle many kinds of objects. It. delineates streams of object summaries, and allows for multiple
`levels of nested detail. SOIF is based on a combination of the Internet Anonymous FTP Archives
`(IAFA) IETF Working Group templates [17] and BibTeX [33]. Each template contains a type,
`a Uniform Resource Locator (URL) [4], and a list of byte-count delimited field name/field value
`pairs. We define a set of mandatory and"recommendedattributes for Harvest system components.
`
`

`

`For example, attributes for a Broker describe the server’s administrator, location, software version,
`and the type of objects it contains. An example SOIF record is illustrated in Figure '3. 1.
`
`@DOCUMENT{ ftp://ftp.cs.psu.edu/pub/techreps/TR123.ps
`Title{28}: Type Structured File Systems
`
`Author-Name{33}: C. Dharap, R. Balay and M. Bowman
`
`Author-Organization-Type{11}: educational
`Author-Department{32}: Computer Science and Engineering
`Publication-Status{19}: Accepted, To appear.
`
`types object-oriented semantic filesystems
`Keywords{42}:
`Access-Protocol-v*{3}: FTP
`
`'Nebula is a dynamic,
`Description{457}: The paper describes the design of Nebula.
`object based typed file system. Nebula uses types to create well
`
`structured files as well as to export existing files in the internet
`environment. This allows for backward scalability. Files in Nebula are
`
`abstractions and are defined by objects. Nebula assumes a flat global
`namespace and operations are provided to manipulate logical namespaces.
`
`Figure 3: Example SOIF Template
`
`Figure 4 illustrates the Harvest architecture in more detail, showing the Gatherer/ Broker in—
`terconnection arrangements and SOIF protocol, the internal components of the Broker*(discussed
`in Section 3.2), and the caching, replication, and client interface subsystems.
`The Replicator can be used to replicate servers, to enhance user—base scalability. For example,
`the HSR will likely become heavily replicated, since it acts a point of first contact for searches and
`new server deployment efforts. The Replication subsystem can also be used to divide the gathering
`process among many servers (e.g., letting one server index each U.S. regional network), distributing
`the partial updates among the replicas.
`The Object Cache reduces network load, server load, and response latency when accessing
`located information objects. Additionally, it will be used for caching access methods in the future,
`when we incorporate an object-oriented access mechanism into Harvest. At present Harvest simply
`retrieves and displays objects using the standard “hard coded” Mosaic procedures, but we are
`extending the system to allow publishers to associate access methods with objects (see Section 5).
`Given this support, it will be possible to perform much more flexible display and access operations.
`We discuss each of the Harvest subsystems below.
`
`3.1 The Gatherer Subsystem
`
`As indexing the Web becomes increasingly popular, two types of problems arise: data collection
`inefficiencies and duplication of coding effort. Harvest addresses these problems by providing an
`efficient and flexible system upon which to build many different indexes. Rather than building
`
`1VVlnile the attributes in this example are limited to ASClI values, SOIF allows arbitrary binary data
`
`

`

`
`
`
`Replication
`
`Manager
`
`Broker
`\.
`
`
`l Storage Manager
`
`
`
`Query
`Mgnager
`
`
`
`:
`
`l
`
`2. retrieve
`object &
`access
`methods
`
`Registry
`
`
`\
`
`
`
`
`
`
`and Indexer
`m /
`/
`
`/
`1. search
`.
`
`
`
`
`Client
`
`/®
`
`/
`
`SOIF
`
`/
`
`
`
`Collector
`
`
`
`i
`
`
`
`Gatherer
`
`
`
`
`Provider
`
`
`
`
`Object
`
`Cache
`
`
`
`
`
`
`Figure 4: Harvest System Architecture
`
`one indexer from scratch to collect WAIS headlines, another to index Computer Science technical
`reports, etc., Harvest can be configured in various ways to produce all of these indexes, at great
`savings of server load, network trafiic, and index space.
`
`Efficient Indexing Support
`
`Koster offers a number of guidelines for constructing Web “robots” [32]. For example, he suggests
`using breadth-first rather than depth—first traversal, to minimize rapid—fire requests to a single
`server. Rather than working around the existing shortcomings, we believe a better approach is to
`provide a much more efficient index collection system.
`The biggest inefficiency in current Web indexing tools arises from collecting indexing information
`from systems designed to support object retrieval. For example, to build an index of HTML
`documents, each document must be retrieved from a server and scanned for HTML links. This
`causes a great deal of load, because each object retrieval results in creating a TCP connection,
`forking a UNIX process, changing directories several levels deep, transmitting the object, and
`terminating the connection. An index of FTP file names can be built more efficiently using recursive
`directory listing operations, but this still requires an expensive file system traversal.
`The Gatherer dramatically reduces these inefiiciencies through the use of Provider site-resident
`software optimized for indexing. This software scans the objects periodically and maintains a
`
`cache of indexing information, so that separate traversals are not required for each request. 2' More
`
`2Some FTP sites provide a similar optimization by placing a file containing the results of a recent recursive
`directory listing in a high—level directory. The Gatherer uses these files when they are available.
`
`

`

`importantly, it allows a Provider’s indexing information to be retrieved in a single stream, rather
`than requiring separate requests for each object. This results in a huge reduction in server load,
`The combination of traversal caching and response streaming can reduce server load by several
`orders of magnitude.
`As a simple quantification of these savings, we measured the time taken to gather information
`three different ways: via individual FTP retrievals, via individual local file system retrievals, and
`from the Gatherer’s cache. To focus these measurements on server load, we collected only empty
`files and used the network loopback interface in the case where data would normally go across
`the network. For a more precise quantification we could have measured CPU, memory, and disk
`resource usage, but measuring elapsed time provides a first order approximation of overall load
`for each of the gathering methods. We collected the measurements over 1,000 iterations on an
`otherwise unloaded DEC Alpha workstation. Our results indicate that gathering via the local
`file system causes 6.99 times less load than gathering via FTP. More importantly, retrieving an
`entire stream of records from the Gatherer’s cache incurs 2.72 times less server load than retrieving
`a single object via FTP (and 2.81 times less load for retrieving all objects at a site, because of
`a common-case optimization we implemented that keeps a compressed cache of this information
`around). Thus, assuming an average archive site size of 2,300 files 3, serving indexing information
`via a Gatherer will reduce server load per data collection attempt by a factor of 6,429. Collecting
`all objects at the site increases this reduction to 6,631, because of the aforementioned common-case
`cache.
`
`The Gatherer also reduces network traffic substantially, based on three techniques. First, it
`uses the Essence system [26] to extract content summaries before passing the data to a remote
`Indexer. Essence uses type—specific procedures to extract the most relevant parts of documents as
`content summaries — for example, extracting author and title information from LaTeX documents,
`and routine names from executablefiles. Hence, where current robots might retrieve a set of
`HTML documents and extract anchors and URLs from each, Essence extracts content summaries
`at the Provider site, often reducing the amount of data to be transmitted by a factor of 20—50
`or more. 4 Beyond generating content summaries at the Provider site, the Gatherer also allows
`Brokers to request only those summaries that have changed since a specified time, and transmits
`the response stream in a compressed form. 5 The combination of pre—filtering, incrementalupdates,
`and compression can reduce network traffic by several orders of magnitude.
`As an example of these savings, our content index of Computer Science technical reports (see
`Section 4) starts with 2.44 GB of distributed around the Internet, and extracts 111 MB worth
`of content summaries. It then transmits these data as a 41 MB compressed stream to requesting
`Brokers, for a 59.4 fold network transmission savings over collecting the documents and building
`a full-text index. The network transmission savings are actually higher still if one considers the
`number of network links that must be traversed by each byte of data. For the above example, we
`used traceroute [30] to count hops to each Provider site. In the measured example, the average hop
`count was 18, magnifying the importance of network transmission savings. This savings corresponds
`
`to running local Gatherers at each Provider site, rather than having to pull full objects across the
`
`3We computed this average from measurements of the total number of sites and files indexed by Archie [44].
`4The reduction factor depends on data type. For example, the reduction for PostScript files is quite high, While
`the reduction for HTML files is fairly low.
`5For access methods that do not support time stamps (like the current version of HTTP), the Gatherer extracts
`and compares “MDS” cryptographic checksums of each object before transmitting updates.
`
`

`

`
`
`
`
`Customized
`
`Type Recog.
`_ Methods
`
`
`
`README IREADME}
`doc/essence.1 [mar-page]
`doc/Essence.ms [tray—ms]
`doc/Essenceps [POMSCript]
`src/pstext/pstext/dvi.c
`ICsaurre]
`H
`i
`
`_——> TypeRecog.
`
`
`JW
`
`PresentationJ\
`Unnestin
`
`Customized
`Selection
`
`Methods
`
`
`
`Candidate
`Selectign
`
`Customized
`Summarizing
`Methods/
`
`README [README]
`doc/essence.1 ['Vlaflpage]
`doc/Essence.ms [nag—ms]
`
`w) Summarizing
`
`
`
`
`
`i
`/_x\
`Customlzed
`Unnesiing
`Methods
`
`SOIF records coniainin :
`— keywords from RE DME;
`—- usage line from essence.1;
`—- author, title. & abstract fields from Essence.ms
`
`Figure 5: Example of Essence Customized Information Extraction
`
`Internet before summarizing at a central site. Note that these computations do not include the
`additional savings from doing incremental updates.
`While one might suspect that Essence’s use of content summaries causes the resulting indexes to
`lose useful keywords, our measurements indicate that Essence achieves nearly the same precision 6
`and 70% the recall of WAIS, but requires only 3-11% as much index space and 70% as much
`summarizing and indexing time as WAIS [27]. For users who need the added recall and precision,
`however, Essence also supports a full-text extraction option.
`
`Flexibility of Index Collection
`
`In addition to reducing the amount of indexing data that must be transmitted, across the network,
`Essence’s summarizing mechanism permits a great deal of flexibility in how information is extracted.
`Site administrators can customize how object types are recognized (typically by examining file
`naming conventions and contents); which objects get indexed (e.g., excluding executable code for
`which there are corresponding source files); and how information is extracted from each type of
`object (e.g., extracting title. and author information from-bibliographic databases, andcopyright
`strings from executable programs). Essence can also extract files with arbitrarily deep presentation—
`layer nesting (e.g., compressed “tar” files). An example of Essence processing is illustrated in
`Figure 5.
`The Gatherer allows objects to be specified either individually (which we call “LeafNodes”)
`or through type-specific recursive enumeration (which we call “RootNodes”). For example,
`it
`enumerates FTP objects using a recursive directory listing, and enumerates HTTP objects by
`
`6Intuitively, precision is the probability that all of the retrieved documents will be relevant, while recall is the
`probability that all of the relevant documents will be retrieved [7].
`
`

`

`recursively extracting the URL links that point to other HTML objects within the same server as
`the original RootNode URL. 7
`The combination of Gatherer enumeration support and content extraction flexibility allows the
`site administrator to specify a Gatherer’s configuration very easily. Given this specification, the
`system automatically enumerates, gathers, unnests, and summarizes a wide variety of information
`resources, generating content summaries that range in detail from full—text to highly abridged and
`specialized “signatures”. The Gatherer also allows users to include manually edited information
`(such as hand—chosen keywords or taxonometric classification references) along with the Essence—
`extracted information. This allows users to improve index quality for data that warrants the added
`manual labor. For example, Harvest can incorporate site markup records specified, using IETF-
`specified “IAFA” templates [17]. The Gatherer stores-the manual and automatically generated
`information separately, so that periodic automated summarizing will not overwrite manually created
`information.
`
`The other type of flexibility supported by the Gatherer is in the placement of Gatherer processes.
`Ideally, a Gatherer will be placed at each Provider site, to support efficient data gathering. Because-
`this requires cooperation from remote sites, we also allow Gatherers to be run remotely, accessing
`objects using FTP / Gopher / HTTP. As an optimization, we are currently building a translator that
`will translate FTP / Gopher / HTTP URLs to local file system references if the Gatherer is run locally.
`Doing so will provide a factor of 6.99 reduction in server load while gathering locally, as indicated
`by the server load measurements discussed above.
`
`Because remote gathering imposes so much load, we implemented a connection caching mech—
`anism, which attempts to keep connections open across individual object retrievals.8 When the
`available UNIX file descriptors are exhausted (or timed out with inactivity), they are closed ac—
`cording to a replacement policy. This change resulted in a 700% performance improvement for
`FTP transfers.
`
`3.2 The Broker Subsystem
`
`As illustrated in Figure 4, the Broker consists of four software modules: the Collector, the Registry,
`the Storage Manager, and the Query Manager.
`
`The Registry and Storage Manager maintain the authoritative list of summary objects that exist
`in the Broker. For each object, the Registry records a unique identifier and a time-to-live. The
`unique object identifier (01D) consists of the resource’s URL plus information about the Gatherer
`that generated the summary object. The Storage Manager archives a collection of summary objects
`on disk, storing each as a file in the underlying file system.
`
`The Collector periodically requests updates from each Gatherer or Brokerspecified in its con—
`figuration file. The gatherer responds with a list of object summaries to be created, removed,
`or updated, based on a timestamp passed from the requesting Broker. The Collector parses the
`response and forwards it to the Registry for processing.
`
`7This restriction prevents a misconfigured Gatherer from traversing huge parts of the Web. As a second limitation,
`
`Gatherers refuse to enumerate RootNode URLs beyond a (configurable) maximum number of resulting LeafNode
`URLs.
`
`3At present this is only possible with the FTP control connection. Gopher and HTTP close connections after each
`retrieval.
`
`10
`
`

`

`When an object is created, an OID is added to the Registry, the summary object is archived
`by the Storage Manager, and a handle is given to the Indexer. Finally, the Registry is written
`to disk to complete the transaction. When an object is destroyed, the OID is removed and the
`summary object is deleted by the Storage Manager and Indexer. When an object is refreshed, its
`time-to—live is recomputed.
`If the time—to—live expires for an object, the object is removed from
`the Broker. Since the state of the Indexer and Storage Manager may become inconsistent with the
`Registry, periodic garbage collection removes any unreferenced objects from the Storage Manager
`and Indexer.
`
`The Query Manager exports objects to the network. It accepts a query, translates it into an
`intermediate representation, and passes it: to the search. engine. The search engine responds with a
`list of OIDs and some search engine—specific data for each. For example, one of our search engines
`(Glimpse; see Section 3.3) returns the matching attribute for each summary object in the result list.
`The Query Manager constructs a short object description from the OID, the search engine—specific
`data, and the summary object. It then returns the list of object descriptions to the client.
`A Broker can gather objects directly from another Broker using‘a bulk transfer protocol within
`the Query Manager. When a query is sent to the source Broker, instead of returning the usual
`short description, the Query Manager returns the entire summary object.
`The Query Manager supports remote administration of the Broker’s configuration. Several
`Broker parameters can be manipulated, including the list of Gatherers to contact, the frequency of
`contact with a Gatherer, the default time—to—live for summary objects, and the frequency of garbage
`collection.
`
`3.3 The Index and Search Subsystem
`
`To accommodate diverse indexing and searching needs, Harvest defines a general Broker—Indexer
`interface that can accommodate a variety of “backend” search engines. The principal requirements
`are that the backend support boolean combinations of attribute—based queries, and that it support
`incremental updates. One could therefore use a‘variety of different backends inside a Broker, such
`as WAIS or Ingres (although some versions of WAIS do not support all the needed features).
`We have developed two particular search and index subsystems for Harvest, each optimized for
`different uses. Glimpse [35] supports space—efficient indexes and flexible interactive queries, while
`Nebula [11] supports fast searches and complex standing queries that scan the data on a regular
`basis and extract relevant information.
`'
`
`Glimpse Index/Search Subsystem
`
`The main novelty of Glimpse is that it requires a very small index (as low as 24% of the size of
`the data), yet allows very flexible content queries, including the use of Boolean expressions, regular
`expressions, and approximate matching (i.e., allowing misspelling). Another advantage of Glimpse
`is that the index can be built quickly and modified incrementally.
`The index used by Glimpse is similar in principle to inverted indexes, with one additional
`feature. The main part of the index consists of a list of all words that appear in all files. For
`each word there is a pointer to a list of occurrences of the word. But instead of pointing to the
`exact occurrence, Glimpse points to an adjustable—sized area that contains that occurrence. This
`
`11
`
`

`

`area can be simply the file that contains the word, a group of files, or perhaps a paragraph. 9 By
`combining pointers for several occurrences of each word and by minimizing the size of the pointers
`(since it is not necessary to store the exact location), the index becomes rather small. This allows
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket