`Kirsch
`
`54 WEBSCAN PROCESS
`
`75 Inventor: Steven T. Kirsch, Los Altos, Calif.
`73 Assignee: Infoseek Corporation, Sunnyvale,
`Calif.
`
`Appl. No.: 604,584
`21
`22 Filed:
`Feb. 21, 1996
`(51) Int. Cl. ................................................ G06F 17/30
`52 U.S. Cl. ................................. 707/10; 707/104; 707/2;
`395/200.33
`58 Field of Search ..................................... 395/326, 602,
`395/793, 610, 800, 187.01, 200.36, 200.48,
`200.33; 345/335; 707/5, 9; 702/2, 104,
`10
`
`56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,572,643 11/1996 Judson .................................... 395/793
`5,710,918
`1/1998 Lagarde et al. ........................... 707/10
`5,751,956 5/1998 Kirsch ...............
`395/200.33
`5,752,246
`5/1998 Rogers et al. ............................ 707/10
`5,761,499 6/1998 Sonderegger ............................. 707/10
`OTHER PUBLICATIONS
`Cole et al. “Oracle Spins Web Strategy”, Network World,
`V12, n3, pp. 1 & 49, Jan. 16, 1995.
`Davis, Jessica “EMail World/Internet Expo to Feature Web
`Solutions”, v18, n8, p. 6, Feb. 19, 1996.
`Berners-Lee “The World-Wide Web”, Communications of
`the ACM, v37, n8, pp. 76–82, Aug. 1994.
`
`USOO5855020A
`Patent Number:
`11
`(45) Date of Patent:
`
`5,855,020
`Dec. 29, 1998
`
`Nadile, Lisa “Adobe Targets Mac Web Development", PC
`Week, v12, n40, p.62(1), Oct. 9, 1995.
`Snell, Jason “Webtop Publishing Here at Last', MacUser,
`v11, n12, p44(2), Dec. 1995.
`
`Primary Examiner Wayne Amsbury
`ASSistant Examiner-Charles L. Rones
`Attorney, Agent, or Firm-Fliesler, Dubb, Meyer & Lovejoy
`57
`ABSTRACT
`An information locator System providing for the expedient
`acquisition, validation and updating of information locators
`in a heterogenous network protocol environment. The loca
`tor System includes an information location discrimination
`engine coupleable to a network operating in the heteroge
`neous network protocol environment, a validation engine
`coupled to the information location discrimination engine to
`receive information locators and a database providing for the
`Storage of information locators as discrete Searchable
`resource locators. The validation engine is also connected to
`the data base for retrieving and Storing resource locators.
`The validation engine provides for the autonomous interro
`gation of the heterogeneous network protocol environment
`to validate a predetermined information locator as a corre
`sponding resource locator that is unique to the discrete
`Searchable resource locators then Stored by the database.
`Where a valid and inferred unique information locator is
`found, the validation engine provides a corresponding
`resource locator to the database for Subsequently Searchable
`Storage.
`
`10 Claims, 3 Drawing Sheets
`
`Other Static
`in Formation
`Sources
`
`Net News
`
`ListServ
`
`32
`
`
`
`Other
`DYNAMic
`information
`Sources
`
`34
`
`World Woe
`Wes
`
`Gopher
`
`2es
`
`24
`
`
`
`
`
`Walidation &
`Search
`Engine
`
`Discrimination
`Engine
`
`38
`
`36
`
`S
`
`12
`Y-1
`
`Petitioner Google Ex-1022, 0001
`
`
`
`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 1 of 3
`
`5,855,020
`
`
`
`FG.
`
`Console
`
`Other Static
`information
`Sources
`
`28
`
`3O
`
`Ne News
`
`
`
`
`
`32
`
`34
`
`Other
`Dynamic
`information
`
`Sources
`
`FTP
`
`2
`
`F. G. 2
`
`VALIDATIon &
`Search
`Engine
`
`
`
`
`
`
`
`
`
`O O O
`
`
`
`
`
`
`
`Discrimination
`Engine
`
`38
`
`S
`DataBase
`
`42
`Y-1
`
`Petitioner Google Ex-1022, 0002
`
`
`
`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 2 of 3
`
`5,855,020
`
`52
`
`1Y
`
`FG 3
`
`5 O
`
`Receive
`DYNAMIC DATA
`Feaeo
`
`FLter for
`INFORMATION
`Resource
`Locators
`
`54
`
`No
`
`
`
`
`
`
`
`DiscARD
`
`NFORMATION
`
`Yes
`
`56
`
`8
`
`
`
`Convert to
`
`UNIVERSAL
`Resource
`Locator Form
`
`6O
`
`
`
`
`
`
`
`is UFRL in
`DATABAse?
`
`Yes
`
`e2
`
`VALIDATE URL,
`CAPTURE
`Context
`
`DiscARD
`URL
`
`64
`
`is URL
`s
`V
`2
`Alo
`
`No
`
`68
`
`YES
`
`DiscARD
`URL
`
`7O
`
`
`
`
`
`
`
`
`
`72
`
`SAVE URL
`S. CoNEXT TO
`DATABAs E
`
`
`
`
`
`76
`
`ANY
`NFORMATION
`REMAIN
`
`
`
`Yes
`
`No
`
`74
`
`Petitioner Google Ex-1022, 0003
`
`
`
`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 3 of 3
`
`5,855,020
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`REVALIDATE
`URL
`DAAB Ase
`
`SELECT UR
`
`To
`REVIST
`
`REVISIT CRTERIA
`
`VALIDATE URL,
`RE-CAPTURE
`CoNTEXT
`
`FG. 4
`
`UFPDATE URL
`AND CONTEXT
`IN DATABASE
`
`MARK URL DEAD
`REMOVE FR dM
`DATABASE
`
`URL INVALID
`or Con TEXT
`CHANGED7
`
`Furce
`THREs Ho LD
`REACHED
`
`ADJUsT REVs IT
`CRTERIA DATA
`
`O2
`
`Petitioner Google Ex-1022, 0004
`
`
`
`1
`WEBSCAN PROCESS
`
`5,855,020
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`The present application is related to the following
`Application, assigned to the ASSignee of the present Appli
`cation:
`1) METHOD AND APPARATUS FOR REDIRECTION
`OF SERVER EXTERNAL HYPER-LINK REFERENCES,
`invented by Kirsch, U.S. Pat. No. 5,751,956, filed concur
`rently herewith, and
`2) SECURE, CONVENIENT AND EFFICIENT SYS
`TEM AND METHOD OF PERFORMING TRANS
`INTERNET PURCHASE TRANSACTIONS, invented by
`Kirsch, application Ser. No. 08/604,506, filed concurrently
`herewith.
`
`15
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`The present invention is generally related to Systems for
`discriminating and organizing informational locators or key
`references obtained from Source information and, in
`particular, to a System and proceSS for expediently develop
`ing locators of independently distributed information acces
`Sible through a heterogeneous protocol network, Such as the
`Internet.
`2. Description of the Related Art
`The national and international packet Switched public
`network generically referred to as the Internet has existed for
`Some time. Although often referred to as a Single techno
`logical entity, the Internet is represented by a Substantial
`complex of communication Systems ranging from conven
`tional analog and digital telephone lines through fiber optic,
`microwave and Satellite communications links. The physical
`Structure of the Internet is logically unified through the
`establishment of common information transport protocols
`and addressing and resource referencing Schemes that allow
`quite disparate computer Systems to communicate both
`locally and internationally with one another.
`Common information transport protocols include the
`basic file transfer protocol (FTP) and simple mail transfer
`protocol (SMTP). Other information transport protocols that
`are progressively more interactive, particularly in a visual
`manner, include the comparatively simple telnet protocol
`and the typically telnet based gopher information request
`and retrieval Service.
`Recently, a new information transport protocol, known as
`the hypertext transfer protocol (HTTP), has been widely
`accepted on the Internet. This transport protocol is utilized
`to Support a graphically interactive distributed information
`system variously known as the World Wide Web (WWW or
`W3) or simply as “the Web.” The HTTP protocol provides
`for the transfer of both textual and graphical information via
`the Internet in a coordinated manner based on a System of
`client web page browser requests and remote web page
`server information responses. An HTTP session is estab
`lished between a client browser and page Server based on an
`HTTP transaction initiated in response to a browser refer
`ence to a uniform resource locator (URL). The URL system
`was comparatively recently established to provide a conve
`nient and de-facto standardized format by which different
`Internet based or accessed information Sources can be iden
`tified by type, and therefore inferentially by access transport
`protocol. In general, URLs have the following form:
`<protocol identifierd://<protocol Server address> /
`<qualifiers
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`Typical protocol identifiers include FTP, Gopher, HTTP,
`and News. The protocol server address typically is of the
`form “prefix.domain,” where the prefix is typically “www.”
`for web servers and “ftp” for FTP servers. The “domain” is
`the Standard Internet Sub-domain.top level-domain of the
`Server. Optional qualifiers may be provided to Specify, for
`example, a particular hypertext page maintained by a web
`server or a sub-directory accessible through an FTP server.
`Internet protocols such as FTP, Gopher and HTTP provide
`access typically to generally Static information Sources. The
`information is not entirely Static, but rather typified by a
`static basic URL that provides referential access to infor
`mation that is Substantially persistent and typically updated
`or expanded on a periodic basis. Other Internet transport
`protocols exist to Support dynamic information Sources.
`These dynamic information Sources are typified as highly
`fluid Streams of information, often defined as articles or
`messages, exchanged via the Internet. In general, the content
`of these information Streams is not persistent at least in the
`Sense that the information is not immediately organized and
`accessible, if ever, through generally Static URLS.
`A principle dynamic information Source is the network
`news as transported over the Internet using the network
`news transfer protocol (NNTP). The network news system,
`historically referred to as Usenet, provides for the Succes
`Sively up Stream and down Stream propagation of news
`articles between interconnected computer Systems.
`Specifically, news articles are posted to logically defined
`news groups and are propagated generally via the Internet to
`other computer Systems that temporarily Store the articles
`Subject to expiration rules. Each participating computer
`System also serves to propagate the articles to other com
`puter Systems that have not previously received the propa
`gating news articles.
`Another and again historically older dynamic information
`Source is provided by independently operating list Servers
`(ListServ) residing on computer Systems that are, in general,
`connected to the Internet. A list Server is a typically auto
`mated Service that functions autonomously to repeat elec
`tronic mail messages received by a publicly-known list
`server E-Mail account to an established list of Subscribers
`known to the list server by explicit or fully qualified E-Mail
`addresses. The list Server is thus an automated electronic
`remailer that allows a one to many distribution of E-Mail
`messages through the indirection operation of the list Server.
`The remailing of E-Mail messages is typically dynamic and,
`therefore, persistent messages are maintained, if at all,
`Selectively by the Subscribers of a particular mailing list.
`Furthermore, the list servers are themselves subject to
`extreme variability in location and operation Since only a
`publicly available dedicated E-Mail address is required in
`Substance to operate a list Server.
`The ability to simply track if not expediently search for
`information available via the Internet has not kept pace with
`the rapid expansion of information available via the Internet.
`One predominant Source of new information appears as
`essentially Static web pages. Various automatons, often
`generally referred to as “web crawlers,” have been devel
`oped to incrementally trace through URLs embedded in the
`various web pages and thereby develop an information map
`of available information resources within the logical web
`Space. Since the Web is not entirely Static, but rather greatly
`increasing in its extent and complexity on a continuing basis,
`web crawlers face a daunting task in repeatedly tracing out
`and maintaining a web space map of URLS.
`Simply tracing through all URLs available via the web is
`not practical if only in terms of the time and cost required to
`
`Petitioner Google Ex-1022, 0005
`
`
`
`5,855,020
`
`15
`
`25
`
`3
`actually complete a trace before Substantial portions of the
`map are antiquated by the addition and gradual revision of
`web URLs. Some estimates of the size of the Web place the
`number of presently active URLs at greater than about 50
`million and growing rapidly. Furthermore, any Such incre
`mental tracing must be, by any practical definition, incom
`plete. A URL trace must contend with problems of infinite
`depth due to URL mutual references and reference looping,
`made further complex by the existence of URL aliases. A
`trace must also deal with discrete discontinuities that inher
`1O
`ently exist at any given time in the basic structure of the
`URL defined web space. Normally a self contained or only
`outwardly directed island (connected group) of URL refer
`ences may exist either by choice or as a consequence of the
`delay in the ponderous operation of web crawlers before
`discovering a URL trace that leads to a URL island. This
`tracing delay is conventionally reduced by trimming the
`depth at which URLs are traced from a base URL. However,
`this Strategy actually results in an increased likelihood of
`more islands existing with a greater distribution of and even
`larger islands of URLs being excluded from the URL map
`created by a web crawler.
`A class of Internet business services (IBS) has developed
`to deal with the problems of locating information available
`through the Internet. These busineSS Services characteristi
`cally utilize web crawlers to establish searchable web space
`maps. These maps, in turn, are made available on the
`Internet typically through an advertising Supported or user
`fee based Search engine interface accessible via a defined
`web page. One well-known and one of the oldest Web
`Searching Systems is provided by Lycos, Inc. (E)
`(www.lycos.com). Completeness and timeliness of the list
`ing of information resources available through the Internet is
`of paramount concern to such Internet business services.
`These problems are of particular importance Since the new
`35
`est Sources of information are often the most important to
`Subscribers of Such Internet busineSS Services. A related
`problem is in identifying for the subscriber the most active
`of current interest information Sources. The ability to ensure
`the completeness, timelineSS and currentness of the Search
`able information available through an Internet busineSS
`service is therefore highly desirable. However, because of
`the fundamental nature of web crawlers and the fully dis
`tributed nature of the web space, no direct method or System
`of achieving these goals is conventionally known. For
`example, LycoS has developed a Search Strategy based on
`conducting an essentially random Search of URLS tempered
`by preferences. These preferences allow for the explicit or
`manual Specification of Starting URLS to include in the
`Search and generally automated efforts by the Search engine
`to identify and traverse Web server home pages, Web pages
`50
`with Substantial external links, user home pages and URL
`that are short, Suggestive of a logical if not actual Server
`hierarchy of Web pages. However, the LycoS Search System
`is otherwise limited to the identification of URLS from the
`pages Selected for traversal. The application of these
`55
`preferences, the practical limitation of the depth of URL
`Search and the randomneSS of the URL tracing operation
`may all act to inadvertently limit or at least Substantially
`delay the inclusion of new Web URLs and even entire Web
`islands into the Web map space traced by the Lycos Web
`crawler.
`
`4
`This is achieved by the present invention through an
`information locator System providing for the expedient
`acquisition and validation of information locators in a het
`erogenous network protocol environment. The locator Sys
`tem includes an information location discrimination engine
`coupleable to a network operating in the heterogeneous
`network protocol environment, a validation engine coupled
`to the information location discrimination engine to receive
`information locators and a database providing for the Storage
`of information locators as discrete Searchable resource loca
`tors. The validation engine is also connected to the database
`for retrieving and Storing resource locators. The validation
`engine provides for the autonomous interrogation of the
`heterogeneous network protocol environment to validate a
`predetermined information locator as a corresponding
`resource locator that is unique among discrete Searchable
`resource locators then stored by the database. Where a valid
`and inferred unique information locator is found, the Vali
`dation engine provides a corresponding resource locator to
`the database for Subsequently Searchable Storage.
`To Support the currentness of the database, an update and
`purge algorithm is also associated with the validation engine
`for periodically updating or removing obsolete or invalid
`resource locators from the database.
`Thus, an advantage of the present invention is that a
`dynamic Source of information is used to identify new,
`rapidly changing and frequently referenced resource loca
`torS
`Another advantage of the present invention is that one or
`more dynamic Sources of information can be mutually
`referenced to identify potential resource locators and that
`existing Sources of information and database Stores of
`resource locators can be utilized to Screen for and Verify
`unique resource locators that are then added to the resource
`locator database.
`A further advantage of the present invention is that the
`resource locator database is Searchable both for Supporting
`the validation of unique resource locators and for Supporting
`contextually based database Searches for resource locator
`references.
`Still another advantage of the present invention is that
`multiple Sources of information, each transported via a
`corresponding network protocol, can be dynamically filtered
`for potential resource locators.
`BRIEF DESCRIPTION OF THE DRAWINGS
`These and other advantages and features of the present
`invention will become better understood upon consideration
`of the following detailed description of the invention when
`considered in connection of the accompanying drawings, in
`which like reference numerals designate like parts through
`out the figures thereof, and wherein:
`FIG. 1 illustrates a client/server system architecture uti
`lizing heterogeneous protocols in a networking environ
`ment,
`FIG. 2 illustrates multiple Static and dynamic data Sources
`as available through the Internet network;
`FIG. 3 is a flow diagram of the discrimination and
`validation of new resource locators from dynamic data
`Sources in accordance with a preferred embodiment of the
`present invention; and
`FIG. 4 is a flow diagram of a purge process through
`Selective revalidation of resource locators previously Stored
`in the resource locator database in accordance with a pre
`ferred embodiment of the present invention.
`
`40
`
`45
`
`60
`
`SUMMARY OF THE INVENTION
`Thus, a general purpose of the present invention is to
`provide a System and method of identifying and Verifying
`new resource locators of Static or comparatively Static
`information culled from dynamic Sources of information.
`
`65
`
`Petitioner Google Ex-1022, 0006
`
`
`
`S
`DETAILED DESCRIPTION OF THE
`INVENTION
`A typical environment 10 utilizing the Internet for net
`work services is shown in FIG.1. A client computer system
`12 is coupled directly or through an Internet Service provider
`(ISP) to the Internet 14. By logical reference, a uniform
`resource locator corresponding to an Internet Server System
`16, 18 may be accessed. Provided that a common protocol
`is Supported and mutual access permissions are met, a
`transaction between the client 12 and server 16 can be
`initiated.
`As graphically illustrated in FIG. 2, client users 20-20,
`have logically transparent acceSS Via the Internet 14 to a
`wide array of information Sources disparately Served by
`servers coupled to the Internet 14. Different information
`sources including FTP22, Gopher 24, World Wide Web 26
`and other Static information Sources 28 exist as persistent
`information sources available to the users 20. Net News
`30, ListServ 32, and other dynamic information sources 34
`provide typically Subscriber based information on an
`on-going basis to the users 20, according to respective
`Subscription profiles.
`An Internet busineSS Service 36, in accordance with a
`preferred embodiment of the present invention, is coupled to
`the Internet 14 to obtain access to both the static and
`dynamic Sources of information. By the same connection,
`the Internet business service 36 is also itself accessible as a
`Static information Source to the users 20, via the Internet
`14.
`In accordance with a preferred embodiment of the present
`invention, a discrimination engine 38 is provided to proceSS
`the dynamic information sources 30, 32, 34 to identify
`information resource locators typically in the form of URLs.
`Preferably, a full feed of Network news 30 is routed to the
`discrimination engine 38 by Subscription established by the
`Internet business service 36 with an up stream Internet
`service provider. Net News articles are thereby directed to
`the discrimination engine 38 on an as propagated basis via
`the Internet 14. At present, full Net News feed 30 transports
`up to 1 gigabyte or more of information per day.
`The discrimination engine 38 preferably implements a
`conventional regular expression parser that filters the Net
`NewS article Stream for occurrences of information resource
`locators. In the preferred embodiment of the present
`invention, properly formed uniform resource locators are
`identified by the parser and extracted from the Net News
`article Stream by the discrimination engine 38. Additionally,
`the discrimination engine 38 may implement the parser to
`recognize incompletely formed URLS. For example, a text
`Sequence constructed as www.Sub-domain.top-level-domain
`(where Sub-domain is an identifier and top-level-domain can
`be edu, gov, org, corn or a two-letter country code) may be
`recognized as an implied HTTP URL. Similarly, a text
`Stream of the form ftp. Sub-domain.top-level-domain may be
`recognized as an implied FTP resource locator. Accordingly,
`the parser within the discrimination engine 38 can be made
`to utilize assumptions about the proper form of an informa
`tion resource locator generally consistent with the assump
`tions that a conventional end user 20, might reasonably
`make.
`All information resource locators identified by the dis
`crimination engine 38 are provided to the validation and
`Search engine 40. A corresponding URL reference is
`constructed, if need be, and a Search is performed against a
`local database 42 containing a list of URLS as constructed by
`the Internet business service 36 utilizing web crawler tech
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`5,855,020
`
`6
`niques and the on-going operation of the present invention.
`Where the corresponding URL is unlisted in the database 42,
`the validation and Search engine 40 issues a corresponding
`URL client request via the Internet to determine whether an
`information Server provides a valid response. Responses
`indicating that the request is barred due to insufficient access
`privileges or that the requested information no longer exists
`are treated as indicating that the URL reference is invalid.
`Equally, the failure of any Server to respond is treated as an
`invalidating response. If a reference is determined to be
`invalid for some number of consecutive attempts by the
`validation engine 40 to validate the reference over some
`time period, the information resource locator is marked as a
`“dead” URL and any contextual information stored by the
`database 42 in association with the URL is effectively
`purged from the database 42. Preferably, the purge threshold
`is Set at failure of five consecutive validation attempts made
`within a ten day period.
`Where a valid information resource locator is found, the
`corresponding URL and Selected contextual information
`received as part of the validity verification are then Stored in
`the database 42.
`In a similar manner, the Internet busineSS Service 36
`preferably Subscribes to independently identified mailing
`lists managed and propagated by list Servers 32 and other
`dynamic information sources 34. Once subscribed, the list
`Servers 32 and other dynamic information Sources 34 pro
`vide logically parallel dynamic information Streams to the
`discrimination engine 38 of the Internet business service 36.
`This information is again parsed by the discrimination
`engine 38 to identify potential information resource locators.
`The database 42 is initially built, in accordance with a
`preferred embodiment of the present invention, through the
`operation of a conventional web crawler modified in a
`conventional manner to limit recursive crawl to a URL
`reference depth of five. Although other crawl depths could
`be used, a depth of five has been empirically established as
`adequate when used in conjunction with the present inven
`tion. New URLs identified from the dynamic information
`Sources are provided in an effective manner to the web
`crawler of the present invention for further exploration.
`Consequently, the direct operation of the depth-five web
`crawler is Sufficient and appropriate for identifying new
`information resources that exist in active areas. The present
`invention, by operation on dynamic information Sources,
`Serves to rapidly identify new, changed and currently active
`information resources as they are announced dynamically.
`Furthermore, multiple references and changed or corrected
`resource locators are also expediently collected from the
`dynamic information Sources. The database 42 developed
`through the operation of the present invention is thereby
`maintained in a complete, timely and current manner.
`The preferred method 50 of processing data received via
`dynamic information sources is shown in FIG. 3. Informa
`tion received from a dynamic data feed 52 is processed
`through a general regular expression parser to filter and
`identify information resource locators 54 within the data
`feed. Where an information resource locator (IRL) is not
`found within a packet of data received from the feed 56, the
`data packet is discarded 58 and the next packet is examined
`54.
`Where an information resource locator is identified, the
`form of the resource locator is converted as necessary and if
`possible to a uniform resource locator form 60. The database
`42 is then searched 62 to determine whether the URL
`previously exists in the database 42. If the URL exists in the
`
`Petitioner Google Ex-1022, 0007
`
`
`
`15
`
`7
`database 42, the IRL is discarded 64. The database 42 may,
`none the less, be updated to reflect a repeated reference of
`the URL, thereby indicating degree of current activity and
`the interest in and relative importance of the URL.
`Accordingly, a repeated reference count field associated
`with the URL in the database 42 can be incremented with
`each repeated dynamic URL reference.
`Where the URL is not found in the database 42, a client
`request is made to the Internet 14 to retrieve information
`from the URL at 66. If no valid response to the URL client
`request is received 68, the IRL is again discarded 70.
`Where the URL is determined to be valid, the URL and a
`contextually appropriate Sampling of the information
`returned by the URL client request are saved to the database
`42 at 72. If any information packets from the dynamic data
`feed remain 74 the next data packet is examined 54.
`Otherwise, the process terminates 76 generally until the
`dynamic data feed 52 resumes.
`In accordance with a preferred embodiment of the present
`invention, the URLs identified from the dynamic informa
`tion sources via the process 50 are further explored by the
`depth-five web crawler in combination with the execution of
`a revalidation process 80, as shown in FIG. 4. The modified
`web crawler is initiated to revalidate the URL database 82 on
`a periodic basis, if not continuously. AS part of the pro
`grammed operation of the web crawler, a URL is Selected
`from the database 42 for consideration as to whether to
`purge the selected URL from the database 42 at 84. The
`determination is made based on an initial evaluation of the
`purge characteristics established with the URL. These char
`acteristics are Stored as data fields associated with the URL
`in the database 42. These characteristic fields may store
`information relating to the URL including an indication of
`the age of the URL since the URL was first identified by the
`Service 36, the frequency that the content associated with the
`URL changes as discovered through the process of
`validation, the frequency that the URL has moved, the
`number of failed responses within the current threshold
`purge period. These and other Similar characteristics may be
`utilized in combination to determine how frequently the
`modified web crawler should operate to revalidate a par
`ticular URL. Where the characteristics necessary for the web
`crawler to revisit the URL as part of the validation/purge
`process are not met 86, a next URL is selected 84.
`Where a URL has been newly added to the database 42,
`a default period of approximately one week is established as
`the frequency of revalidating the URL. However, the first
`time that the modified web crawler considers a newly added
`URL, the revisit and database update characteristics are by
`definition met, in order to force revalidation and to ensure
`that any deeper URLs associated with this new URL are
`immediately explored by the modified web crawler and, as
`appropriate, are each in turn added to the database 42.
`Thus, the process 80 operates to revalidate a new or
`appropriately aged URL at 88. AURL client request is issued
`to the Internet 14 and any appropriate Server response is
`captured and filtered for context for comparison against any
`prior version of the URL context as stored in the database
`42. Where the selected URL is valid and the received context
`has not been changed, the age and other characteristics
`relating to the revisit/purge criteria determination are
`adjusted or updated at 98 in the database 42. If any unex
`plored URLs remain in the database 42 at 100, another URL
`is selected 84. Otherwise, the current iteration of revalida
`tion of the database 42 is complete 98.
`65
`Where no valid response is received back from the URL
`Server, or the context derived from the response received
`
`45
`
`50
`
`55
`
`60
`
`5,855,020
`
`25
`
`35
`
`40
`
`8
`differs from the context stored by the database 42, the
`process 80 then determines whether, for an invalid response
`at 92, the purge threshold criteria for the URL has been
`reached. Where the purge criteria have not been met or only
`the context associated with the URL has changed, the URL
`revisit related data and update frequency data associated
`with the URL are modified in the database 42 at 94.
`Specifically, a new period for revisiting the URL is calcu
`lated based on an average of the rate of change of the URL
`context, the number of invalid responses in the current
`validation period is accounted for or reset, and any new
`context is updated to the database 42. Where the context has
`changed, any URLS referenced in the new context are
`explored by the modified web crawler beginning at 84.
`Once the purge threshold criteria has been met following
`an invalid URL server response, the URL is marked as
`“dead” and the associated context is purged from the data
`base 42 at 96. The process 80 then resumes with the
`selection of a next URL from the database 94 to potentially
`revisit at 84.
`Thus, a comprehensive System for maintaining a resource
`locator map describing information resources accessible
`through the Internet and identified through the combined
`examination of both Static and dynamic information Sources
`has been described.
`While the invention has been particularly shown and
`described with reference to preferred embodiments thereof it
`will be understood by those skilled in the art that various
`changes in form and details may be made therein without
`departing from the Spirit and Scope of the invention as
`defined by the appended claims.
`I claim:
`1. A System of autonomously maintaining a Searchable
`database of information accessible over the Internet, Said
`System comprising:
`a) a discrimination System coupleable to the Internet to
`receive messages including electronic mail messages
`and network newS messages, Said discrimination pro
`cessing Said electronic mail and network news mes
`Sages to identify embedded URLs, and
`b) a validation System coupleable to the Internet, Said
`validation System coupled to Said discrimination Sys
`tem to receive a predetermined embedded URL, said
`validation System enabling an access of the Internet to
`retriev