`Kirsch
`
`I lllll llllllll Ill lllll lllll lllll lllll lllll 111111111111111111111111111111111
`US005855020A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,855,020
`Dec. 29, 1998
`
`[54] WEB SCAN PROCESS
`
`[75]
`
`Inventor: Steven T. Kirsch, Los Altos, Calif.
`
`[73] Assignee: Infoseek Corporation, Sunnyvale,
`Calif.
`
`[21] Appl. No.: 604,584
`
`[22] Filed:
`
`Feb. 21, 1996
`
`Int. Cl.6
`...................................................... G06F 17/30
`[51]
`[52] U.S. Cl. ................................. 707/10; 707/104; 707/2;
`395/200.33
`[58] Field of Search ..................................... 395/326, 602,
`3951793, 610, 800, 187.01, 200.36, 200.48,
`200.33; 345/335; 707/5, 9; 702/2, 104,
`10
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,572,643
`5,710,918
`5,751,956
`5,752,246
`5,761,499
`
`11/1996 Judson .................................... 395/793
`1/1998 Lagarde et al. ... ... ... ... ... .... ... ... .. 707 /10
`5/1998 Kirsch ................................ 395/200.33
`5/1998 Rogers et al. ............................ 707/10
`6/1998 Sonderegger ............................. 707/10
`
`OIBER PUBLICATIONS
`
`Cole et al. "Oracle Spins Web Strategy", Network World,
`v12, n3, pp. 1 & 49, Jan. 16, 1995.
`Davis, Jessica "EMail World/Internet Expo to Feature Web
`Solutions", v18, n8, p. 6, Feb. 19, 1996.
`Berners-Lee "The World-Wide Web", Communications of
`the ACM, v37, n8, pp. 76-82, Aug. 1994.
`
`Nadile, Lisa "Adobe Targets Mac Web Development", PC
`Week, v12, n40, p62(1), Oct. 9, 1995.
`
`Snell, Jason "Webtop Publishing Here at Last", MacUser,
`vll, n12, p44(2), Dec. 1995.
`
`Primary Examiner-Wayne Amsbury
`Assistant Examiner---Charles L. Rones
`Attorney, Agent, or Firm---Fliesler, Dubb, Meyer & Lovejoy
`
`[57]
`
`ABSTRACT
`
`An information locator system providing for the expedient
`acquisition, validation and updating of information locators
`in a heterogenous network protocol environment. The loca(cid:173)
`tor system includes an information location discrimination
`engine coupleable to a network operating in the heteroge(cid:173)
`neous network protocol environment, a validation engine
`coupled to the information location discrimination engine to
`receive information locators and a database providing for the
`storage of information locators as discrete searchable
`resource locators. The validation engine is also connected to
`the data base for retrieving and storing resource locators.
`The validation engine provides for the autonomous interro(cid:173)
`gation of the heterogeneous network protocol environment
`to validate a predetermined information locator as a corre(cid:173)
`sponding resource locator that is unique to the discrete
`searchable resource locators then stored by the database.
`Where a valid and inferred unique information locator is
`found, the validation engine provides a corresponding
`resource locator to the data base for subsequently searchable
`storage.
`
`10 Claims, 3 Drawing Sheets
`
`WORLD WIDE
`
`WEB
`
`GOPHER
`
`OTHER STATIC
`
`INFORMATION
`
`SOURCES
`
`NET NEWS
`
`L1sTSERV
`
`OTHER
`
`DYNAMIC
`
`INFORMATION
`
`SOURCES
`
`2
`
`DATABASE
`
`
`
`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 1 of 3
`
`5,855,020
`
`18
`
`CONSOLE
`
`INTERNET
`
`WORLD WIDE
`
`WEB
`
`GOPHER
`
`FIG.
`
`OTHER STATIC
`
`INFORMATION
`
`SOURCES
`
`NET NEWS
`
`LISTS ERV
`
`OTHER
`
`DYNAMIC
`
`INFORMATION
`
`SOURCES
`
`USER 0
`
`••• USER N
`
`FTP
`
`2
`
`FIG. 2
`
`VALIDATION &
`
`SEARCH
`
`ENGINE
`
`DATABASE
`
`
`
`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 2 of 3
`
`5,855,020
`
`RECEIVE
`
`DYNAMIC DATA
`
`FEED
`
`52
`
`FILTER FOR
`
`INFORMATION
`
`RESOURCE
`
`LOCATORS
`
`FIG. 3
`
`50
`
`IRL FOUND?
`
`No
`
`DISCARD
`
`INFORMATION
`
`8
`
`CONVERT TO
`
`UNIVERSAL
`
`RESOURCE
`
`LOCATOR FORM
`
`60
`
`No
`
`VALIDATE URL,
`
`CAPTURE
`
`CONTEXT
`
`66
`
`DISCARD
`
`URL
`
`70
`
`YES
`
`SAVE URL
`
`72
`
`No
`
`76
`
`
`
`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 3 of 3
`
`5,855,020
`
`REVALIDATE
`
`URL
`
`SELECT URL
`
`TO
`
`REVISIT
`
`84
`
`No
`
`VALIDATE URL,
`
`RE-CAPTURE
`
`CONTEXT
`
`88
`
`FIG. 4
`~
`80
`
`UPDATE URL
`
`AND CONTEXT
`
`IN
`
`94
`
`No
`
`YES
`
`102
`
`
`
`1
`WEB SCAN PROCESS
`
`5,855,020
`
`20
`
`2
`Typical protocol identifiers include FTP, Gopher, HTTP,
`and News. The protocol server address typically is of the
`form "prefix.domain," where the prefix is typically "www"
`for web servers and "ftp" for FTP servers. The "domain" is
`the standard Internet sub-domain.top_level-domain of the
`server. Optional qualifiers may be provided to specify, for
`example, a particular hypertext page maintained by a web
`server or a sub-directory accessible through an FTP server.
`Internet protocols such as FTP, Gopher and HTTP provide
`10 access typically to generally static information sources. The
`information is not entirely static, but rather typified by a
`static basic URL that provides referential access to infor(cid:173)
`mation that is substantially persistent and typically updated
`or expanded on a periodic basis. Other Internet transport
`protocols exist to support dynamic information sources.
`These dynamic information sources are typified as highly
`fluid streams of information, often defined as articles or
`messages, exchanged via the Internet. In general, the content
`of these information streams is not persistent at least in the
`sense that the information is not immediately organized and
`accessible, if ever, through generally static URLs.
`A principle dynamic information source is the network
`news as transported over the Internet using the network
`news transfer protocol (NNTP). The network news system,
`25 historically referred to as Usenet, provides for the succes(cid:173)
`sively up stream and down stream propagation of news
`articles between interconnected computer systems.
`Specifically, news articles are posted to logically defined
`news groups and are propagated generally via the Internet to
`30 other computer systems that temporarily store the articles
`subject to expiration rules. Each participating computer
`system also serves to propagate the articles to other com(cid:173)
`puter systems that have not previously received the propa-
`gating news articles.
`Another and again historically older dynamic information
`source is provided by independently operating list servers
`(ListServ) residing on computer systems that are, in general,
`connected to the Internet. A list server is a typically auto(cid:173)
`mated service that functions autonomously to repeat elec-
`40 tronic mail messages received by a publicly-known list
`server E-Mail account to an established list of subscribers
`known to the list server by explicit or fully qualified E-Mail
`addresses. The list server is thus an automated electronic
`remailer that allows a one to many distribution of E-Mail
`45 messages through the indirection operation of the list server.
`The remailing of E-Mail messages is typically dynamic and,
`therefore, persistent messages are maintained, if at all,
`selectively by the subscribers of a particular mailing list.
`Furthermore, the list servers are themselves subject to
`50 extreme variability in location and operation since only a
`publicly available dedicated E-Mail address is required in
`substance to operate a list server.
`The ability to simply track if not expediently search for
`information available via the Internet has not kept pace with
`55 the rapid expansion of information available via the Internet.
`One predominant source of new information appears as
`essentially static web pages. Various automatons, often
`generally referred to as "web crawlers," have been devel(cid:173)
`oped to incrementally trace through URLs embedded in the
`60 various web pages and thereby develop an information map
`of available information resources within the logical web
`space. Since the Web is not entirely static, but rather greatly
`increasing in its extent and complexity on a continuing basis,
`web crawlers face a daunting task in repeatedly tracing out
`65 and maintaining a web space map of URLs.
`Simply tracing through all URLs available via the web is
`not practical if only in terms of the time and cost required to
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`The present application is related to the following 5
`Application, assigned to the Assignee of the present Appli-
`cation:
`1) METHOD AND APPARATUS FOR REDIRECTION
`OF SERVER EXTERNAL HYPER-LINK REFERENCES,
`invented by Kirsch, U.S. Pat. No. 5,751,956, filed concur(cid:173)
`rently herewith, and
`2) SECURE, CONVENIENT AND EFFICIENT SYS(cid:173)
`TEM AND METHOD OF PERFORMING TRANS-
`INTERNET PURCHASE TRANSACTIONS, invented by
`Kirsch, application Ser. No. 08/604,506, filed concurrently 15
`herewith.
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`The present invention is generally related to systems for
`discriminating and organizing informational locators or key
`references obtained from source information and, in
`particular, to a system and process for expediently develop(cid:173)
`ing locators of independently distributed information acces(cid:173)
`sible through a heterogeneous protocol network, such as the
`Internet.
`2. Description of the Related Art
`The national and international packet switched public
`network generically referred to as the Internet has existed for
`some time. Although often referred to as a single techno(cid:173)
`logical entity, the Internet is represented by a substantial
`complex of communication systems ranging from conven(cid:173)
`tional analog and digital telephone lines through fiber optic,
`microwave and satellite communications links. The physical 35
`structure of the Internet is logically unified through the
`establishment of common information transport protocols
`and addressing and resource referencing schemes that allow
`quite disparate computer systems to communicate both
`locally and internationally with one another.
`Common information transport protocols include the
`basic file transfer protocol (FTP) and simple mail transfer
`protocol (SMTP). Other information transport protocols that
`are progressively more interactive, particularly in a visual
`manner, include the comparatively simple telnet protocol
`and the typically telnet based gopher information request
`and retrieval service.
`Recently, a new information transport protocol, known as
`the hypertext transfer protocol (HTTP), has been widely
`accepted on the Internet. This transport protocol is utilized
`to support a graphically interactive distributed information
`system variously known as the World Wide Web (WWW or
`W3) or simply as "the Web." The HTTP protocol provides
`for the transfer of both textual and graphical information via
`the Internet in a coordinated manner based on a system of
`client web page browser requests and remote web page
`server information responses. An HTTP session is estab(cid:173)
`lished between a client browser and page server based on an
`HTTP transaction initiated in response to a browser refer(cid:173)
`ence to a uniform resource locator (URL). The URL system
`was comparatively recently established to provide a conve(cid:173)
`nient and de-facto standardized format by which different
`Internet based or accessed information sources can be iden(cid:173)
`tified by type, and therefore inferentially by access transport
`protocol. In general, URLs have the following form:
`<protocol identifier>://<protocol server address>/
`<qualifier>
`
`
`
`5,855,020
`
`5
`
`10
`
`4
`This is achieved by the present invention through an
`information locator system providing for the expedient
`acquisition and validation of information locators in a het(cid:173)
`erogenous network protocol environment. The locator sys(cid:173)
`tem includes an information location discrimination engine
`coupleable to a network operating in the heterogeneous
`network protocol environment, a validation engine coupled
`to the information location discrimination engine to receive
`information locators and a database providing for the storage
`of information locators as discrete searchable resource loca(cid:173)
`tors. The validation engine is also connected to the database
`for retrieving and storing resource locators. The validation
`engine provides for the autonomous interrogation of the
`heterogeneous network protocol environment to validate a
`15 predetermined information locator as a corresponding
`resource locator that is unique among discrete searchable
`resource locators then stored by the database. Where a valid
`and inferred unique information locator is found, the vali(cid:173)
`dation engine provides a corresponding resource locator to
`20 the database for subsequently searchable storage.
`To support the currentness of the database, an update and
`purge algorithm is also associated with the validation engine
`for periodically updating or removing obsolete or invalid
`resource locators from the database.
`Thus, an advantage of the present invention is that a
`dynamic source of information is used to identify new,
`rapidly changing and frequently referenced resource loca(cid:173)
`tors
`Another advantage of the present invention is that one or
`more dynamic sources of information can be mutually
`referenced to identify potential resource locators and that
`existing sources of information and database stores of
`resource locators can be utilized to screen for and verify
`unique resource locators that are then added to the resource
`locator database.
`A further advantage of the present invention is that the
`resource locator database is searchable both for supporting
`the validation of unique resource locators and for supporting
`contextually based database searches for resource locator
`references.
`Still another advantage of the present invention is that
`multiple sources of information, each transported via a
`corresponding network protocol, can be dynamically filtered
`for potential resource locators.
`
`3
`actually complete a trace before substantial portions of the
`map are antiquated by the addition and gradual revision of
`web URLs. Some estimates of the size of the Web place the
`number of presently active URLs at greater than about 50
`million and growing rapidly. Furthermore, any such incre-
`mental tracing must be, by any practical definition, incom(cid:173)
`plete. A URL trace must contend with problems of infinite
`depth due to URL mutual references and reference looping,
`made further complex by the existence of URL aliases. A
`trace must also deal with discrete discontinuities that inher-
`ently exist at any given time in the basic structure of the
`URL defined web space. Normally a self contained or only
`outwardly directed island (connected group) of URL refer(cid:173)
`ences may exist either by choice or as a consequence of the
`delay in the ponderous operation of web crawlers before
`discovering a URL trace that leads to a URL island. This
`tracing delay is conventionally reduced by trimming the
`depth at which URLs are traced from a base URL. However,
`this strategy actually results in an increased likelihood of
`more islands existing with a greater distribution of and even
`larger islands of URLs being excluded from the URL map
`created by a web crawler.
`A class of Internet business services (IBS) has developed
`to deal with the problems of locating information available
`through the Internet. These business services characteristi(cid:173)
`cally utilize web crawlers to establish searchable web space 25
`maps. These maps, in turn, are made available on the
`Internet typically through an advertising supported or user(cid:173)
`fee based search engine interface accessible via a defined
`web page. One well-known and one of the oldest Web
`searching systems is provided by Lycos, Inc.® 30
`(www.lycos.com). Completeness and timeliness of the list(cid:173)
`ing of information resources available through the Internet is
`of paramount concern to such Internet business services.
`These problems are of particular importance since the new-
`est sources of information are often the most important to 35
`subscribers of such Internet business services. A related
`problem is in identifying for the subscriber the most active
`of current interest information sources. The ability to ensure
`the completeness, timeliness and currentness of the search(cid:173)
`able information available through an Internet business 40
`service is therefore highly desirable. However, because of
`the fundamental nature of web crawlers and the fully dis(cid:173)
`tributed nature of the web space, no direct method or system
`of achieving these goals is conventionally known. For
`example, Lycos has developed a search strategy based on 45
`conducting an essentially random search of URLs tempered
`by preferences. These preferences allow for the explicit or
`manual specification of starting URLs to include in the
`search and generally automated efforts by the search engine
`to identify and traverse Web server home pages, Web pages 50
`with substantial external links, user home pages and URL
`that are short, suggestive of a logical if not actual server
`hierarchy of Web pages. However, the Ly cos search system
`is otherwise limited to the identification of URLs from the
`pages selected for traversal. The application of these
`preferences, the practical limitation of the depth of URL
`search and the randomness of the URL tracing operation
`may all act to inadvertently limit or at least substantially
`delay the inclusion of new Web URLs and even entire Web
`islands into the Web map space traced by the Lycos Web 60
`crawler.
`
`55
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`These and other advantages and features of the present
`invention will become better understood upon consideration
`of the following detailed description of the invention when
`considered in connection of the accompanying drawings, in
`which like reference numerals designate like parts through(cid:173)
`out the figures thereof, and wherein:
`FIG. 1 illustrates a client/server system architecture uti-
`lizing heterogeneous protocols in a networking environ-
`ment;
`FIG. 2 illustrates multiple static and dynamic data sources
`as available through the Internet network;
`FIG. 3 is a flow diagram of the discrimination and
`validation of new resource locators from dynamic data
`sources in accordance with a preferred embodiment of the
`present invention; and
`FIG. 4 is a flow diagram of a purge process through
`65 selective revalidation of resource locators previously stored
`in the resource locator database in accordance with a pre(cid:173)
`ferred embodiment of the present invention.
`
`SUMMARY OF THE INVENTION
`Thus, a general purpose of the present invention is to
`provide a system and method of identifying and verifying
`new resource locators of static or comparatively static
`information culled from dynamic sources of information.
`
`
`
`5
`DETAILED DESCRIPTION OF IBE
`INVENTION
`
`5,855,020
`
`5
`
`20
`
`30
`
`A typical environment 10 utilizing the Internet for net(cid:173)
`work services is shown in FIG. 1. A client computer system
`12 is coupled directly or through an Internet service provider
`(ISP) to the Internet 14. By logical reference, a uniform
`resource locator corresponding to an Internet server system
`16, 18 may be accessed. Provided that a common protocol
`is supported and mutual access permissions are met, a
`transaction between the client 12 and server 16 can be
`initiated.
`As graphically illustrated in FIG. 2, client users 200-20n
`have logically transparent access via the Internet 14 to a
`wide array of information sources disparately served by
`servers coupled to the Internet 14. Different information
`sources including FTP 22, Gopher 24, World Wide Web 26
`and other static information sources 28 exist as persistent
`information sources available to the users 200 _n· Net News
`30, ListServ 32, and other dynamic information sources 34
`provide typically subscriber based information on an
`on-going basis to the users 200 _n according to respective
`subscription profiles.
`An Internet business service 36, in accordance with a
`preferred embodiment of the present invention, is coupled to
`the Internet 14 to obtain access to both the static and
`dynamic sources of information. By the same connection,
`the Internet business service 36 is also itself accessible as a
`static information source to the users 200 _n via the Internet
`14.
`In accordance with a preferred embodiment of the present
`invention, a discrimination engine 38 is provided to process
`the dynamic information sources 30, 32, 34 to identify
`information resource locators typically in the form of URLs.
`Preferably, a full feed of Network news 30 is routed to the 35
`discrimination engine 38 by subscription established by the
`Internet business service 36 with an up stream Internet
`service provider. Net News articles are thereby directed to
`the discrimination engine 38 on an as propagated basis via
`the Internet 14. At present, full Net News feed 30 transports 40
`up to 1 gigabyte or more of information per day.
`The discrimination engine 38 preferably implements a
`conventional regular expression parser that filters the Net
`News article stream for occurrences of information resource
`locators. In the preferred embodiment of the present 45
`invention, properly formed uniform resource locators are
`identified by the parser and extracted from the Net News
`article stream by the discrimination engine 38. Additionally,
`the discrimination engine 38 may implement the parser to
`recognize incompletely formed URLs. For example, a text 50
`sequence constructed as www.sub-domain. top-level-domain
`(where sub-domain is an identifier and top-level-domain can
`be edu, gov, org, corn or a two-letter country code) may be
`recognized as an implied HTTP URL. Similarly, a text
`stream of the form ftp.sub-domain.top-level-domain may be 55
`recognized as an implied FTP resource locator. Accordingly,
`the parser within the discrimination engine 38 can be made
`to utilize assumptions about the proper form of an informa(cid:173)
`tion resource locator generally consistent with the assump(cid:173)
`tions that a conventional end user 200 _n might reasonably 60
`make.
`All information resource locators identified by the dis(cid:173)
`crimination engine 38 are provided to the validation and
`search engine 40. A corresponding URL reference is
`constructed, if need be, and a search is performed against a 65
`local database 42 containing a list of URLs as constructed by
`the Internet business service 36 utilizing web crawler tech-
`
`6
`niques and the on-going operation of the present invention.
`Where the corresponding URL is unlisted in the database 42,
`the validation and search engine 40 issues a corresponding
`URL client request via the Internet to determine whether an
`information server provides a valid response. Responses
`indicating that the request is barred due to insufficient access
`privileges or that the requested information no longer exists
`are treated as indicating that the URL reference is invalid.
`Equally, the failure of any server to respond is treated as an
`10 invalidating response. If a reference is determined to be
`invalid for some number of consecutive attempts by the
`validation engine 40 to validate the reference over some
`time period, the information resource locator is marked as a
`"dead" URL and any contextual information stored by the
`15 database 42 in association with the URL is effectively
`purged from the database 42. Preferably, the purge threshold
`is set at failure of five consecutive validation attempts made
`within a ten day period.
`Where a valid information resource locator is found, the
`corresponding URL and selected contextual information
`received as part of the validity verification are then stored in
`the database 42.
`In a similar manner, the Internet business service 36
`preferably subscribes to independently identified mailing
`25 lists managed and propagated by list servers 32 and other
`dynamic information sources 34. Once subscribed, the list
`servers 32 and other dynamic information sources 34 pro(cid:173)
`vide logically parallel dynamic information streams to the
`discrimination engine 38 of the Internet business service 36.
`This information is again parsed by the discrimination
`engine 38 to identify potential information resource locators.
`The database 42 is initially built, in accordance with a
`preferred embodiment of the present invention, through the
`operation of a conventional web crawler modified in a
`conventional manner to limit recursive crawl to a URL
`reference depth of five. Although other crawl depths could
`be used, a depth of five has been empirically established as
`adequate when used in conjunction with the present inven(cid:173)
`tion. New URLs identified from the dynamic information
`sources are provided in an effective manner to the web
`crawler of the present invention for further exploration.
`Consequently, the direct operation of the depth-five web
`crawler is sufficient and appropriate for identifying new
`information resources that exist in active areas. The present
`invention, by operation on dynamic information sources,
`serves to rapidly identify new, changed and currently active
`information resources as they are announced dynamically.
`Furthermore, multiple references and changed or corrected
`resource locators are also expediently collected from the
`dynamic information sources. The database 42 developed
`through the operation of the present invention is thereby
`maintained in a complete, timely and current manner.
`The preferred method 50 of processing data received via
`dynamic information sources is shown in FIG. 3. Informa(cid:173)
`tion received from a dynamic data feed 52 is processed
`through a general regular expression parser to filter and
`identify information resource locators 54 within the data
`feed. Where an information resource locator (IRL) is not
`found within a packet of data received from the feed 56, the
`data packet is discarded 58 and the next packet is examined
`54.
`Where an information resource locator is identified, the
`form of the resource locator is converted as necessary and if
`possible to a uniform resource locator form 60. The database
`42 is then searched 62 to determine whether the URL
`previously exists in the database 42. If the URL exists in the
`
`
`
`5,855,020
`
`5
`
`7
`database 42, the IRL is discarded 64. The database 42 may,
`none the less, be updated to reflect a repeated reference of
`the URL, thereby indicating degree of current activity and
`the interest in and relative importance of the URL.
`Accordingly, a repeated reference count field associated
`with the URL in the database 42 can be incremented with
`each repeated dynamic URL reference.
`Where the URL is not found in the database 42, a client
`request is made to the Internet 14 to retrieve information
`from the URL at 66. If no valid response to the URL client 10
`request is received 68, the IRL is again discarded 70.
`Where the URL is determined to be valid, the URL and a
`contextually appropriate sampling of the information
`returned by the URL client request are saved to the database
`42 at 72. If any information packets from the dynamic data 15
`feed remain 74 the next data packet is examined 54.
`Otherwise, the process terminates 76 generally until the
`dynamic data feed 52 resumes.
`In accordance with a preferred embodiment of the present
`invention, the URLs identified from the dynamic informa- 20
`tion sources via the process 50 are further explored by the
`depth-five web crawler in combination with the execution of
`a revalidation process 80, as shown in FIG. 4. The modified
`web crawler is initiated to revalidate the URL database 82 on
`a periodic basis, if not continuously. As part of the pro(cid:173)
`grammed operation of the web crawler, a URL is selected
`from the database 42 for consideration as to whether to
`purge the selected URL from the database 42 at 84. The
`determination is made based on an initial evaluation of the
`purge characteristics established with the URL. These char- 30
`acteristics are stored as data fields associated with the URL
`in the database 42. These characteristic fields may store
`information relating to the URL including an indication of
`the age of the URL since the URL was first identified by the
`service 36, the frequency that the content associated with the 35
`URL changes as discovered through the process of
`validation, the frequency that the URL has moved, the
`number of failed responses within the current threshold
`purge period. These and other similar characteristics may be
`utilized in combination to determine how frequently the 40
`modified web crawler should operate to revalidate a par(cid:173)
`ticular URL. Where the characteristics necessary for the web
`crawler to revisit the URL as part of the validation/purge
`process are not met 86, a next URL is selected 84.
`Where a URL has been newly added to the database 42, 45
`a default period of approximately one week is established as
`the frequency of revalidating the URL. However, the first
`time that the modified web crawler considers a newly added
`URL, the revisit and database update characteristics are by
`definition met, in order to force revalidation and to ensure 50
`that any deeper URLs associated with this new URL are
`immediately explored by the modified web crawler and, as
`appropriate, are each in turn added to the database 42.
`Thus, the process 80 operates to revalidate a new or
`appropriately aged URL at 88. A URL client request is issued
`to the Internet 14 and any appropriate server response is
`captured and filtered for context for comparison against any
`prior version of the URL context as stored in the database
`42. Where the selected URL is valid and the received context
`has not been changed, the age and other characteristics
`relating to the revisit/purge criteria determination are
`adjusted or updated at 98 in the database 42. If any unex(cid:173)
`plored URLs remain in the database 42 at 100, another URL
`is selected 84. Otherwise, the current iteration of revalida(cid:173)
`tion of the database 42 is complete 98.
`Where no valid response is received back from the URL
`server, or the context derived from the response received
`
`8
`differs from the context stored by the database 42, the
`process 80 then determines whether, for an invalid response
`at 92, the purge threshold criteria for the URL has been
`reached. Where the purge criteria have not been met or only
`the context associated with the URL has changed, the URL
`revisit related data and update frequency data associated
`with the URL are modified in the database 42 at 94.
`specifically, a new period for revisiting the URL is calcu-
`lated based on an average of the rate of change of the URL
`context, the number of invalid responses in the current
`validation period is accounted for or reset, and any new
`context is updated to the database 42. Where the context has
`changed, any URLs referenced in the new context are
`explored by the modified web crawler beginning at 84.
`Once the purge threshold criteria has been met following
`an invalid URL server response, the URL is marked as
`"dead" and the associated context is purged from the data(cid:173)
`base 42 at 96. The process 80 then resumes with the
`selection of a next URL from the database 94 to potentially
`revisit at 84.
`Thus, a comprehensive system for maintaining a resource
`locator map describing information resources accessible
`through the Internet and identified through the combined
`examination of both static and dynamic information sources
`25 has been described.
`While the invention has been particularly shown and
`described with reference to preferred embodiments thereof it
`will be understood by those skilled in the art that various
`changes in form and details may be made therein without
`departing from the spirit and scope of the invention as
`defined by the appended claims.
`I claim:
`1. A system of autonomously maintaining a searchable
`database of information accessible over the Internet, said
`system comprising:
`a) a discrimination system coupleable to the Internet to
`receive messages including electronic mail messages
`and network news messages, said discrimination pro(cid:173)
`cessing said electronic mail and network news mes(cid:173)
`sages to identify embedded URLs; and
`b) a validation system coupleable to the Internet, said
`validation system coupled to said discrimination sys(cid:173)
`tem to receive a predetermined embedded URL, said
`validation system enabling an access of the Internet to
`retrieve Web page information associated with said
`predetermined embedded URL; and
`c) a database for searchably storing said predetermined
`embedded URL in association with the Web page
`informa