throbber
United States Patent [19J
`Kirsch
`
`I lllll llllllll Ill lllll lllll lllll lllll lllll 111111111111111111111111111111111
`US005855020A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,855,020
`Dec. 29, 1998
`
`[54] WEB SCAN PROCESS
`
`[75]
`
`Inventor: Steven T. Kirsch, Los Altos, Calif.
`
`[73] Assignee: Infoseek Corporation, Sunnyvale,
`Calif.
`
`[21] Appl. No.: 604,584
`
`[22] Filed:
`
`Feb. 21, 1996
`
`Int. Cl.6
`...................................................... G06F 17/30
`[51]
`[52] U.S. Cl. ................................. 707/10; 707/104; 707/2;
`395/200.33
`[58] Field of Search ..................................... 395/326, 602,
`3951793, 610, 800, 187.01, 200.36, 200.48,
`200.33; 345/335; 707/5, 9; 702/2, 104,
`10
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,572,643
`5,710,918
`5,751,956
`5,752,246
`5,761,499
`
`11/1996 Judson .................................... 395/793
`1/1998 Lagarde et al. ... ... ... ... ... .... ... ... .. 707 /10
`5/1998 Kirsch ................................ 395/200.33
`5/1998 Rogers et al. ............................ 707/10
`6/1998 Sonderegger ............................. 707/10
`
`OIBER PUBLICATIONS
`
`Cole et al. "Oracle Spins Web Strategy", Network World,
`v12, n3, pp. 1 & 49, Jan. 16, 1995.
`Davis, Jessica "EMail World/Internet Expo to Feature Web
`Solutions", v18, n8, p. 6, Feb. 19, 1996.
`Berners-Lee "The World-Wide Web", Communications of
`the ACM, v37, n8, pp. 76-82, Aug. 1994.
`
`Nadile, Lisa "Adobe Targets Mac Web Development", PC
`Week, v12, n40, p62(1), Oct. 9, 1995.
`
`Snell, Jason "Webtop Publishing Here at Last", MacUser,
`vll, n12, p44(2), Dec. 1995.
`
`Primary Examiner-Wayne Amsbury
`Assistant Examiner---Charles L. Rones
`Attorney, Agent, or Firm---Fliesler, Dubb, Meyer & Lovejoy
`
`[57]
`
`ABSTRACT
`
`An information locator system providing for the expedient
`acquisition, validation and updating of information locators
`in a heterogenous network protocol environment. The loca(cid:173)
`tor system includes an information location discrimination
`engine coupleable to a network operating in the heteroge(cid:173)
`neous network protocol environment, a validation engine
`coupled to the information location discrimination engine to
`receive information locators and a database providing for the
`storage of information locators as discrete searchable
`resource locators. The validation engine is also connected to
`the data base for retrieving and storing resource locators.
`The validation engine provides for the autonomous interro(cid:173)
`gation of the heterogeneous network protocol environment
`to validate a predetermined information locator as a corre(cid:173)
`sponding resource locator that is unique to the discrete
`searchable resource locators then stored by the database.
`Where a valid and inferred unique information locator is
`found, the validation engine provides a corresponding
`resource locator to the data base for subsequently searchable
`storage.
`
`10 Claims, 3 Drawing Sheets
`
`WORLD WIDE
`
`WEB
`
`GOPHER
`
`OTHER STATIC
`
`INFORMATION
`
`SOURCES
`
`NET NEWS
`
`L1sTSERV
`
`OTHER
`
`DYNAMIC
`
`INFORMATION
`
`SOURCES
`
`2
`
`DATABASE
`
`

`

`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 1 of 3
`
`5,855,020
`
`18
`
`CONSOLE
`
`INTERNET
`
`WORLD WIDE
`
`WEB
`
`GOPHER
`
`FIG.
`
`OTHER STATIC
`
`INFORMATION
`
`SOURCES
`
`NET NEWS
`
`LISTS ERV
`
`OTHER
`
`DYNAMIC
`
`INFORMATION
`
`SOURCES
`
`USER 0
`
`••• USER N
`
`FTP
`
`2
`
`FIG. 2
`
`VALIDATION &
`
`SEARCH
`
`ENGINE
`
`DATABASE
`
`

`

`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 2 of 3
`
`5,855,020
`
`RECEIVE
`
`DYNAMIC DATA
`
`FEED
`
`52
`
`FILTER FOR
`
`INFORMATION
`
`RESOURCE
`
`LOCATORS
`
`FIG. 3
`
`50
`
`IRL FOUND?
`
`No
`
`DISCARD
`
`INFORMATION
`
`8
`
`CONVERT TO
`
`UNIVERSAL
`
`RESOURCE
`
`LOCATOR FORM
`
`60
`
`No
`
`VALIDATE URL,
`
`CAPTURE
`
`CONTEXT
`
`66
`
`DISCARD
`
`URL
`
`70
`
`YES
`
`SAVE URL
`
`72
`
`No
`
`76
`
`

`

`U.S. Patent
`
`Dec. 29, 1998
`
`Sheet 3 of 3
`
`5,855,020
`
`REVALIDATE
`
`URL
`
`SELECT URL
`
`TO
`
`REVISIT
`
`84
`
`No
`
`VALIDATE URL,
`
`RE-CAPTURE
`
`CONTEXT
`
`88
`
`FIG. 4
`~
`80
`
`UPDATE URL
`
`AND CONTEXT
`
`IN
`
`94
`
`No
`
`YES
`
`102
`
`

`

`1
`WEB SCAN PROCESS
`
`5,855,020
`
`20
`
`2
`Typical protocol identifiers include FTP, Gopher, HTTP,
`and News. The protocol server address typically is of the
`form "prefix.domain," where the prefix is typically "www"
`for web servers and "ftp" for FTP servers. The "domain" is
`the standard Internet sub-domain.top_level-domain of the
`server. Optional qualifiers may be provided to specify, for
`example, a particular hypertext page maintained by a web
`server or a sub-directory accessible through an FTP server.
`Internet protocols such as FTP, Gopher and HTTP provide
`10 access typically to generally static information sources. The
`information is not entirely static, but rather typified by a
`static basic URL that provides referential access to infor(cid:173)
`mation that is substantially persistent and typically updated
`or expanded on a periodic basis. Other Internet transport
`protocols exist to support dynamic information sources.
`These dynamic information sources are typified as highly
`fluid streams of information, often defined as articles or
`messages, exchanged via the Internet. In general, the content
`of these information streams is not persistent at least in the
`sense that the information is not immediately organized and
`accessible, if ever, through generally static URLs.
`A principle dynamic information source is the network
`news as transported over the Internet using the network
`news transfer protocol (NNTP). The network news system,
`25 historically referred to as Usenet, provides for the succes(cid:173)
`sively up stream and down stream propagation of news
`articles between interconnected computer systems.
`Specifically, news articles are posted to logically defined
`news groups and are propagated generally via the Internet to
`30 other computer systems that temporarily store the articles
`subject to expiration rules. Each participating computer
`system also serves to propagate the articles to other com(cid:173)
`puter systems that have not previously received the propa-
`gating news articles.
`Another and again historically older dynamic information
`source is provided by independently operating list servers
`(ListServ) residing on computer systems that are, in general,
`connected to the Internet. A list server is a typically auto(cid:173)
`mated service that functions autonomously to repeat elec-
`40 tronic mail messages received by a publicly-known list
`server E-Mail account to an established list of subscribers
`known to the list server by explicit or fully qualified E-Mail
`addresses. The list server is thus an automated electronic
`remailer that allows a one to many distribution of E-Mail
`45 messages through the indirection operation of the list server.
`The remailing of E-Mail messages is typically dynamic and,
`therefore, persistent messages are maintained, if at all,
`selectively by the subscribers of a particular mailing list.
`Furthermore, the list servers are themselves subject to
`50 extreme variability in location and operation since only a
`publicly available dedicated E-Mail address is required in
`substance to operate a list server.
`The ability to simply track if not expediently search for
`information available via the Internet has not kept pace with
`55 the rapid expansion of information available via the Internet.
`One predominant source of new information appears as
`essentially static web pages. Various automatons, often
`generally referred to as "web crawlers," have been devel(cid:173)
`oped to incrementally trace through URLs embedded in the
`60 various web pages and thereby develop an information map
`of available information resources within the logical web
`space. Since the Web is not entirely static, but rather greatly
`increasing in its extent and complexity on a continuing basis,
`web crawlers face a daunting task in repeatedly tracing out
`65 and maintaining a web space map of URLs.
`Simply tracing through all URLs available via the web is
`not practical if only in terms of the time and cost required to
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`The present application is related to the following 5
`Application, assigned to the Assignee of the present Appli-
`cation:
`1) METHOD AND APPARATUS FOR REDIRECTION
`OF SERVER EXTERNAL HYPER-LINK REFERENCES,
`invented by Kirsch, U.S. Pat. No. 5,751,956, filed concur(cid:173)
`rently herewith, and
`2) SECURE, CONVENIENT AND EFFICIENT SYS(cid:173)
`TEM AND METHOD OF PERFORMING TRANS-
`INTERNET PURCHASE TRANSACTIONS, invented by
`Kirsch, application Ser. No. 08/604,506, filed concurrently 15
`herewith.
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`The present invention is generally related to systems for
`discriminating and organizing informational locators or key
`references obtained from source information and, in
`particular, to a system and process for expediently develop(cid:173)
`ing locators of independently distributed information acces(cid:173)
`sible through a heterogeneous protocol network, such as the
`Internet.
`2. Description of the Related Art
`The national and international packet switched public
`network generically referred to as the Internet has existed for
`some time. Although often referred to as a single techno(cid:173)
`logical entity, the Internet is represented by a substantial
`complex of communication systems ranging from conven(cid:173)
`tional analog and digital telephone lines through fiber optic,
`microwave and satellite communications links. The physical 35
`structure of the Internet is logically unified through the
`establishment of common information transport protocols
`and addressing and resource referencing schemes that allow
`quite disparate computer systems to communicate both
`locally and internationally with one another.
`Common information transport protocols include the
`basic file transfer protocol (FTP) and simple mail transfer
`protocol (SMTP). Other information transport protocols that
`are progressively more interactive, particularly in a visual
`manner, include the comparatively simple telnet protocol
`and the typically telnet based gopher information request
`and retrieval service.
`Recently, a new information transport protocol, known as
`the hypertext transfer protocol (HTTP), has been widely
`accepted on the Internet. This transport protocol is utilized
`to support a graphically interactive distributed information
`system variously known as the World Wide Web (WWW or
`W3) or simply as "the Web." The HTTP protocol provides
`for the transfer of both textual and graphical information via
`the Internet in a coordinated manner based on a system of
`client web page browser requests and remote web page
`server information responses. An HTTP session is estab(cid:173)
`lished between a client browser and page server based on an
`HTTP transaction initiated in response to a browser refer(cid:173)
`ence to a uniform resource locator (URL). The URL system
`was comparatively recently established to provide a conve(cid:173)
`nient and de-facto standardized format by which different
`Internet based or accessed information sources can be iden(cid:173)
`tified by type, and therefore inferentially by access transport
`protocol. In general, URLs have the following form:
`<protocol identifier>://<protocol server address>/
`<qualifier>
`
`

`

`5,855,020
`
`5
`
`10
`
`4
`This is achieved by the present invention through an
`information locator system providing for the expedient
`acquisition and validation of information locators in a het(cid:173)
`erogenous network protocol environment. The locator sys(cid:173)
`tem includes an information location discrimination engine
`coupleable to a network operating in the heterogeneous
`network protocol environment, a validation engine coupled
`to the information location discrimination engine to receive
`information locators and a database providing for the storage
`of information locators as discrete searchable resource loca(cid:173)
`tors. The validation engine is also connected to the database
`for retrieving and storing resource locators. The validation
`engine provides for the autonomous interrogation of the
`heterogeneous network protocol environment to validate a
`15 predetermined information locator as a corresponding
`resource locator that is unique among discrete searchable
`resource locators then stored by the database. Where a valid
`and inferred unique information locator is found, the vali(cid:173)
`dation engine provides a corresponding resource locator to
`20 the database for subsequently searchable storage.
`To support the currentness of the database, an update and
`purge algorithm is also associated with the validation engine
`for periodically updating or removing obsolete or invalid
`resource locators from the database.
`Thus, an advantage of the present invention is that a
`dynamic source of information is used to identify new,
`rapidly changing and frequently referenced resource loca(cid:173)
`tors
`Another advantage of the present invention is that one or
`more dynamic sources of information can be mutually
`referenced to identify potential resource locators and that
`existing sources of information and database stores of
`resource locators can be utilized to screen for and verify
`unique resource locators that are then added to the resource
`locator database.
`A further advantage of the present invention is that the
`resource locator database is searchable both for supporting
`the validation of unique resource locators and for supporting
`contextually based database searches for resource locator
`references.
`Still another advantage of the present invention is that
`multiple sources of information, each transported via a
`corresponding network protocol, can be dynamically filtered
`for potential resource locators.
`
`3
`actually complete a trace before substantial portions of the
`map are antiquated by the addition and gradual revision of
`web URLs. Some estimates of the size of the Web place the
`number of presently active URLs at greater than about 50
`million and growing rapidly. Furthermore, any such incre-
`mental tracing must be, by any practical definition, incom(cid:173)
`plete. A URL trace must contend with problems of infinite
`depth due to URL mutual references and reference looping,
`made further complex by the existence of URL aliases. A
`trace must also deal with discrete discontinuities that inher-
`ently exist at any given time in the basic structure of the
`URL defined web space. Normally a self contained or only
`outwardly directed island (connected group) of URL refer(cid:173)
`ences may exist either by choice or as a consequence of the
`delay in the ponderous operation of web crawlers before
`discovering a URL trace that leads to a URL island. This
`tracing delay is conventionally reduced by trimming the
`depth at which URLs are traced from a base URL. However,
`this strategy actually results in an increased likelihood of
`more islands existing with a greater distribution of and even
`larger islands of URLs being excluded from the URL map
`created by a web crawler.
`A class of Internet business services (IBS) has developed
`to deal with the problems of locating information available
`through the Internet. These business services characteristi(cid:173)
`cally utilize web crawlers to establish searchable web space 25
`maps. These maps, in turn, are made available on the
`Internet typically through an advertising supported or user(cid:173)
`fee based search engine interface accessible via a defined
`web page. One well-known and one of the oldest Web
`searching systems is provided by Lycos, Inc.® 30
`(www.lycos.com). Completeness and timeliness of the list(cid:173)
`ing of information resources available through the Internet is
`of paramount concern to such Internet business services.
`These problems are of particular importance since the new-
`est sources of information are often the most important to 35
`subscribers of such Internet business services. A related
`problem is in identifying for the subscriber the most active
`of current interest information sources. The ability to ensure
`the completeness, timeliness and currentness of the search(cid:173)
`able information available through an Internet business 40
`service is therefore highly desirable. However, because of
`the fundamental nature of web crawlers and the fully dis(cid:173)
`tributed nature of the web space, no direct method or system
`of achieving these goals is conventionally known. For
`example, Lycos has developed a search strategy based on 45
`conducting an essentially random search of URLs tempered
`by preferences. These preferences allow for the explicit or
`manual specification of starting URLs to include in the
`search and generally automated efforts by the search engine
`to identify and traverse Web server home pages, Web pages 50
`with substantial external links, user home pages and URL
`that are short, suggestive of a logical if not actual server
`hierarchy of Web pages. However, the Ly cos search system
`is otherwise limited to the identification of URLs from the
`pages selected for traversal. The application of these
`preferences, the practical limitation of the depth of URL
`search and the randomness of the URL tracing operation
`may all act to inadvertently limit or at least substantially
`delay the inclusion of new Web URLs and even entire Web
`islands into the Web map space traced by the Lycos Web 60
`crawler.
`
`55
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`These and other advantages and features of the present
`invention will become better understood upon consideration
`of the following detailed description of the invention when
`considered in connection of the accompanying drawings, in
`which like reference numerals designate like parts through(cid:173)
`out the figures thereof, and wherein:
`FIG. 1 illustrates a client/server system architecture uti-
`lizing heterogeneous protocols in a networking environ-
`ment;
`FIG. 2 illustrates multiple static and dynamic data sources
`as available through the Internet network;
`FIG. 3 is a flow diagram of the discrimination and
`validation of new resource locators from dynamic data
`sources in accordance with a preferred embodiment of the
`present invention; and
`FIG. 4 is a flow diagram of a purge process through
`65 selective revalidation of resource locators previously stored
`in the resource locator database in accordance with a pre(cid:173)
`ferred embodiment of the present invention.
`
`SUMMARY OF THE INVENTION
`Thus, a general purpose of the present invention is to
`provide a system and method of identifying and verifying
`new resource locators of static or comparatively static
`information culled from dynamic sources of information.
`
`

`

`5
`DETAILED DESCRIPTION OF IBE
`INVENTION
`
`5,855,020
`
`5
`
`20
`
`30
`
`A typical environment 10 utilizing the Internet for net(cid:173)
`work services is shown in FIG. 1. A client computer system
`12 is coupled directly or through an Internet service provider
`(ISP) to the Internet 14. By logical reference, a uniform
`resource locator corresponding to an Internet server system
`16, 18 may be accessed. Provided that a common protocol
`is supported and mutual access permissions are met, a
`transaction between the client 12 and server 16 can be
`initiated.
`As graphically illustrated in FIG. 2, client users 200-20n
`have logically transparent access via the Internet 14 to a
`wide array of information sources disparately served by
`servers coupled to the Internet 14. Different information
`sources including FTP 22, Gopher 24, World Wide Web 26
`and other static information sources 28 exist as persistent
`information sources available to the users 200 _n· Net News
`30, ListServ 32, and other dynamic information sources 34
`provide typically subscriber based information on an
`on-going basis to the users 200 _n according to respective
`subscription profiles.
`An Internet business service 36, in accordance with a
`preferred embodiment of the present invention, is coupled to
`the Internet 14 to obtain access to both the static and
`dynamic sources of information. By the same connection,
`the Internet business service 36 is also itself accessible as a
`static information source to the users 200 _n via the Internet
`14.
`In accordance with a preferred embodiment of the present
`invention, a discrimination engine 38 is provided to process
`the dynamic information sources 30, 32, 34 to identify
`information resource locators typically in the form of URLs.
`Preferably, a full feed of Network news 30 is routed to the 35
`discrimination engine 38 by subscription established by the
`Internet business service 36 with an up stream Internet
`service provider. Net News articles are thereby directed to
`the discrimination engine 38 on an as propagated basis via
`the Internet 14. At present, full Net News feed 30 transports 40
`up to 1 gigabyte or more of information per day.
`The discrimination engine 38 preferably implements a
`conventional regular expression parser that filters the Net
`News article stream for occurrences of information resource
`locators. In the preferred embodiment of the present 45
`invention, properly formed uniform resource locators are
`identified by the parser and extracted from the Net News
`article stream by the discrimination engine 38. Additionally,
`the discrimination engine 38 may implement the parser to
`recognize incompletely formed URLs. For example, a text 50
`sequence constructed as www.sub-domain. top-level-domain
`(where sub-domain is an identifier and top-level-domain can
`be edu, gov, org, corn or a two-letter country code) may be
`recognized as an implied HTTP URL. Similarly, a text
`stream of the form ftp.sub-domain.top-level-domain may be 55
`recognized as an implied FTP resource locator. Accordingly,
`the parser within the discrimination engine 38 can be made
`to utilize assumptions about the proper form of an informa(cid:173)
`tion resource locator generally consistent with the assump(cid:173)
`tions that a conventional end user 200 _n might reasonably 60
`make.
`All information resource locators identified by the dis(cid:173)
`crimination engine 38 are provided to the validation and
`search engine 40. A corresponding URL reference is
`constructed, if need be, and a search is performed against a 65
`local database 42 containing a list of URLs as constructed by
`the Internet business service 36 utilizing web crawler tech-
`
`6
`niques and the on-going operation of the present invention.
`Where the corresponding URL is unlisted in the database 42,
`the validation and search engine 40 issues a corresponding
`URL client request via the Internet to determine whether an
`information server provides a valid response. Responses
`indicating that the request is barred due to insufficient access
`privileges or that the requested information no longer exists
`are treated as indicating that the URL reference is invalid.
`Equally, the failure of any server to respond is treated as an
`10 invalidating response. If a reference is determined to be
`invalid for some number of consecutive attempts by the
`validation engine 40 to validate the reference over some
`time period, the information resource locator is marked as a
`"dead" URL and any contextual information stored by the
`15 database 42 in association with the URL is effectively
`purged from the database 42. Preferably, the purge threshold
`is set at failure of five consecutive validation attempts made
`within a ten day period.
`Where a valid information resource locator is found, the
`corresponding URL and selected contextual information
`received as part of the validity verification are then stored in
`the database 42.
`In a similar manner, the Internet business service 36
`preferably subscribes to independently identified mailing
`25 lists managed and propagated by list servers 32 and other
`dynamic information sources 34. Once subscribed, the list
`servers 32 and other dynamic information sources 34 pro(cid:173)
`vide logically parallel dynamic information streams to the
`discrimination engine 38 of the Internet business service 36.
`This information is again parsed by the discrimination
`engine 38 to identify potential information resource locators.
`The database 42 is initially built, in accordance with a
`preferred embodiment of the present invention, through the
`operation of a conventional web crawler modified in a
`conventional manner to limit recursive crawl to a URL
`reference depth of five. Although other crawl depths could
`be used, a depth of five has been empirically established as
`adequate when used in conjunction with the present inven(cid:173)
`tion. New URLs identified from the dynamic information
`sources are provided in an effective manner to the web
`crawler of the present invention for further exploration.
`Consequently, the direct operation of the depth-five web
`crawler is sufficient and appropriate for identifying new
`information resources that exist in active areas. The present
`invention, by operation on dynamic information sources,
`serves to rapidly identify new, changed and currently active
`information resources as they are announced dynamically.
`Furthermore, multiple references and changed or corrected
`resource locators are also expediently collected from the
`dynamic information sources. The database 42 developed
`through the operation of the present invention is thereby
`maintained in a complete, timely and current manner.
`The preferred method 50 of processing data received via
`dynamic information sources is shown in FIG. 3. Informa(cid:173)
`tion received from a dynamic data feed 52 is processed
`through a general regular expression parser to filter and
`identify information resource locators 54 within the data
`feed. Where an information resource locator (IRL) is not
`found within a packet of data received from the feed 56, the
`data packet is discarded 58 and the next packet is examined
`54.
`Where an information resource locator is identified, the
`form of the resource locator is converted as necessary and if
`possible to a uniform resource locator form 60. The database
`42 is then searched 62 to determine whether the URL
`previously exists in the database 42. If the URL exists in the
`
`

`

`5,855,020
`
`5
`
`7
`database 42, the IRL is discarded 64. The database 42 may,
`none the less, be updated to reflect a repeated reference of
`the URL, thereby indicating degree of current activity and
`the interest in and relative importance of the URL.
`Accordingly, a repeated reference count field associated
`with the URL in the database 42 can be incremented with
`each repeated dynamic URL reference.
`Where the URL is not found in the database 42, a client
`request is made to the Internet 14 to retrieve information
`from the URL at 66. If no valid response to the URL client 10
`request is received 68, the IRL is again discarded 70.
`Where the URL is determined to be valid, the URL and a
`contextually appropriate sampling of the information
`returned by the URL client request are saved to the database
`42 at 72. If any information packets from the dynamic data 15
`feed remain 74 the next data packet is examined 54.
`Otherwise, the process terminates 76 generally until the
`dynamic data feed 52 resumes.
`In accordance with a preferred embodiment of the present
`invention, the URLs identified from the dynamic informa- 20
`tion sources via the process 50 are further explored by the
`depth-five web crawler in combination with the execution of
`a revalidation process 80, as shown in FIG. 4. The modified
`web crawler is initiated to revalidate the URL database 82 on
`a periodic basis, if not continuously. As part of the pro(cid:173)
`grammed operation of the web crawler, a URL is selected
`from the database 42 for consideration as to whether to
`purge the selected URL from the database 42 at 84. The
`determination is made based on an initial evaluation of the
`purge characteristics established with the URL. These char- 30
`acteristics are stored as data fields associated with the URL
`in the database 42. These characteristic fields may store
`information relating to the URL including an indication of
`the age of the URL since the URL was first identified by the
`service 36, the frequency that the content associated with the 35
`URL changes as discovered through the process of
`validation, the frequency that the URL has moved, the
`number of failed responses within the current threshold
`purge period. These and other similar characteristics may be
`utilized in combination to determine how frequently the 40
`modified web crawler should operate to revalidate a par(cid:173)
`ticular URL. Where the characteristics necessary for the web
`crawler to revisit the URL as part of the validation/purge
`process are not met 86, a next URL is selected 84.
`Where a URL has been newly added to the database 42, 45
`a default period of approximately one week is established as
`the frequency of revalidating the URL. However, the first
`time that the modified web crawler considers a newly added
`URL, the revisit and database update characteristics are by
`definition met, in order to force revalidation and to ensure 50
`that any deeper URLs associated with this new URL are
`immediately explored by the modified web crawler and, as
`appropriate, are each in turn added to the database 42.
`Thus, the process 80 operates to revalidate a new or
`appropriately aged URL at 88. A URL client request is issued
`to the Internet 14 and any appropriate server response is
`captured and filtered for context for comparison against any
`prior version of the URL context as stored in the database
`42. Where the selected URL is valid and the received context
`has not been changed, the age and other characteristics
`relating to the revisit/purge criteria determination are
`adjusted or updated at 98 in the database 42. If any unex(cid:173)
`plored URLs remain in the database 42 at 100, another URL
`is selected 84. Otherwise, the current iteration of revalida(cid:173)
`tion of the database 42 is complete 98.
`Where no valid response is received back from the URL
`server, or the context derived from the response received
`
`8
`differs from the context stored by the database 42, the
`process 80 then determines whether, for an invalid response
`at 92, the purge threshold criteria for the URL has been
`reached. Where the purge criteria have not been met or only
`the context associated with the URL has changed, the URL
`revisit related data and update frequency data associated
`with the URL are modified in the database 42 at 94.
`specifically, a new period for revisiting the URL is calcu-
`lated based on an average of the rate of change of the URL
`context, the number of invalid responses in the current
`validation period is accounted for or reset, and any new
`context is updated to the database 42. Where the context has
`changed, any URLs referenced in the new context are
`explored by the modified web crawler beginning at 84.
`Once the purge threshold criteria has been met following
`an invalid URL server response, the URL is marked as
`"dead" and the associated context is purged from the data(cid:173)
`base 42 at 96. The process 80 then resumes with the
`selection of a next URL from the database 94 to potentially
`revisit at 84.
`Thus, a comprehensive system for maintaining a resource
`locator map describing information resources accessible
`through the Internet and identified through the combined
`examination of both static and dynamic information sources
`25 has been described.
`While the invention has been particularly shown and
`described with reference to preferred embodiments thereof it
`will be understood by those skilled in the art that various
`changes in form and details may be made therein without
`departing from the spirit and scope of the invention as
`defined by the appended claims.
`I claim:
`1. A system of autonomously maintaining a searchable
`database of information accessible over the Internet, said
`system comprising:
`a) a discrimination system coupleable to the Internet to
`receive messages including electronic mail messages
`and network news messages, said discrimination pro(cid:173)
`cessing said electronic mail and network news mes(cid:173)
`sages to identify embedded URLs; and
`b) a validation system coupleable to the Internet, said
`validation system coupled to said discrimination sys(cid:173)
`tem to receive a predetermined embedded URL, said
`validation system enabling an access of the Internet to
`retrieve Web page information associated with said
`predetermined embedded URL; and
`c) a database for searchably storing said predetermined
`embedded URL in association with the Web page
`informa

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket