throbber
as) United States
`a2) Patent Application Publication co) Pub. No.: US 2005/0125412 Al
`
` Glover (43) Pub. Date: Jun. 9, 2005
`
`
`US 20050125412A1
`
`(54) WEB CRAWLING
`
`Related U.S. Application Data
`
`(75)
`
`Inventor: Eric J. Glover, North Brunswick, NJ
`(US)
`
`(60) Provisional application No. 60/528,071, filed on Dec.
`9, 2003.
`
`Correspondence Address:
`NEC Laboratories America, Inc.
`4 Independence Way
`Princeton, NJ 08540 (US)
`(73) Assignee: NEC Laboratories America, Inc., Prin-
`ceton, NJ
`
`(21) Appl. No.:
`
`10/807,698
`
`(22)
`
`Filed:
`
`Mar. 24, 2004
`
`Publication Classification
`
`Int. CH coccccescsscecscssessessessessssstsstestensesseessees GO6F 7/00
`(51)
`(52) US. Che veececceessssssscsseeseesesnsccncessnesnsesnscenseeses 707/10
`67)
`ABSTRACT
`The present invention is directed to mechanismsfor improv-
`ing the “crawling” of resources on a network, which takes
`into account
`the notion of browser state. An improved
`indexing scheme for the crawled results and improved
`search mechanismsare also disclosed.
`
`REQUEST(URL, 51)
`
`REQUEST(URL, $2)
`
`Data Co Exhibit 1030
`Data Co Exhibit 1030
`Data Co v. Bright Data
`Data Cov. Bright Data
`
`

`

`Patent Application Publication
`
`Jun. 9, 2005 Sheet 1 of 4
`
`US 2005/0125412 Al
`
`100
`
`
`
`121
`122
`123
`
`FIG. 1
`(PRIOR ART)
`
`

`

`Patent Application Publication
`
`Jun. 9, 2005 Sheet 2 of 4
`
`US 2005/0125412 Al
`
`REQUEST(URL,$1)
`
`REQUEST(URL, s2)
`
`FIG. 2
`
`129
`
`

`

`Patent Application Publication
`
`Jun. 9, 2005 Sheet 3 of 4
`
`US 2005/0125412 Al
`
`.
`
`START
`
`FOR EACH URL
`
`FOREACH STATES
`IN(S1, $2, ..., Sn)
`
`302
`
`303
`
`
`
`
`
`
`
`
`
`
`
`STORE PAGE INDEXED ON
`URL AND BROWSER STATE s|274
`
`REQUEST PAGE AT URL
`USING BROWSER STATE S
`
`
`
`NEXT URL|986
`
`FIG. 3
`
`

`

`Patent Application Publication
`
`Jun. 9, 2005 Sheet 4 of 4
`
`US 2005/0125412 Al
`
`START
`
`RECEIVE REQUEST
`FROM CLIENT BROWSER
`
`407
`
`DETECT BROWSER STATE|4n9
`OF CLIENT BROWSER
`
`BY BROWSER STATE
`
`CONDUCT SEARCH ON
`PAGES FILTERED
`
`COMPOSE RESULTS PAGE|444
`
`SEND RESULTS TO CLIENT|449
`
`END
`
`FIG. 4
`
`

`

`US 2005/0125412 A1
`
`Jun. 9, 2005
`
`WEB CRAWLING
`
`[0001] This Utility Patent Application is a Non-Provi-
`sional of and claims the benefit of U.S. Provisional Patent
`Application Ser. No. 60/528,071 entitled “IMPROVED
`WEB CRAWLING?”filed on Dec. 9, 2003, the contents of
`which are incorporated by reference herein.
`
`BACKGROUND OF THE INVENTION
`
`to information
`invention relates
`[0002] The present
`retrieval and, more particularly,
`to automated “crawling”
`techniques for retrieving information on a network.
`
`[0003] A vast array of content can be retrieved from
`servers across a large network such asthe Internet. Typically,
`such content is embodied in documents referred to collo-
`
`quially as “web pages” created using a markup language
`such as the Hypertext Markup Language (HTML) and
`retrieved by a client “browser” using a protocol such as the
`Hypertext Transfer Protocol (HTTP). See, e.g., R. Fielding
`et al., “Hypertext Transfer Protocoh—HTTP/1.1,” Internet
`Engineering Task Force (IETF), Request for Comments
`(RFC) 2616 (June 1999); T. Berners-Lee, D. Connolly,
`“Hypertext Markup Language,” IETF, RFC 1866 (Novem-
`ber 1995). Such documents on the World Wide Web are
`typically identified using a Uniform Resource Locator
`(URL), e.g., in the form “http:/Awww.example.com/dir/page-
`-html”. See T. Berners-Lee, “Uniform Resource Identifiers
`in WWW,”IETF, Network Working Group, RFC 1630 (June
`1994); T. Berners-Lee, L. Masinter, M. McCahill, eds.,
`“Uniform Resource Locators (URL),” IETF, Network Work-
`ing Group, RFC 1738 (December 1994). Given the large
`amount of content available on the Internet, it has become
`advantageous to provide searchable databases of content
`and/or content metadata. A typical search engine on the
`Internet today operates by a process referred to as “crawl-
`ing” web pages, whereby a large number of documents are
`automatically retrieved and stored for analysis and indexing.
`
`[0004] Recently, it has become common for many popular
`webservers to return multiple versions of content for the
`same URL.This is typically accomplished through the use
`of “browserstate” and can be used, for example, to custom-
`ize the web page to particular languagesorto reflect some
`personal preferences of the user of the client browser.
`Unfortunately,
`typical search engines only offer a single
`“browser state” and are unableto “see” the different content
`
`associated with the same URL. The problem is made worse
`in that most search engines index the “crawled” web pages
`by URL alone, which typically permits storing only one
`copy of a given web page. Even if a search engine crawler
`by coincidence retrieves the different content,
`the search
`engine typically must select only one of the multiple ver-
`sions of content to associate with the particular URL. The
`problem is manifest by the fact that a searching user, who
`has a “browserstate”different from that of the crawler used
`to find a given page, might click on a result and not find the
`correct contents identified by the search engine—orin fact
`might never be able to find the correct results because the
`crawler was unable to find the documents associated with a
`state different from their own.
`
`SUMMARYOF THE INVENTION
`
`[0005] The present invention is directed to an improved
`technique for “crawling” for resources, such as web pages,
`
`in a network. An improved crawler is disclosed which is
`modified to fetch at least one page (and possibly all pages)
`with a different browser state. As discussed in further detail
`herein, the browserstate can represent a variety of different
`parameters/information about a client browser to a server,
`such as a languageor locale preference, a reported browser-
`string, a geographic location (e.g. based on the IP address or
`locale settings of the browser) or other factors.
`
`invention is also directed to an
`[0006] The present
`improved scheme for storing and/or indexing the crawled
`results and for searching throughthe results. A database can
`be readily constructed in which a combination of the uni-
`form resource locator and the browserstate is utilized as an
`identifier. Hence, the same uniform resource locator could
`be saved more than once in the database, once for each
`different browser state. When a user performsa search, the
`user’s browser state can be used to select the matching
`pages.
`
`[0007] These and other advantagesof the invention will be
`apparent to those of ordinary skill in the art by reference to
`the following detailed description and the accompanying
`drawings.
`
`BRIEF DESCRIPTION OF DRAWINGS
`
`[0008] FIG. 1 showsa client host in communication with
`a server host in accordance with the prior art.
`
`[0009] FIG. 2 showsa client host in communication with
`a serverhost in accordance with an embodimentof an aspect
`of the invention.
`
`[0010] FIG. 3 is a flowchart of processing performed by
`a crawler, in accordance with an embodimentof this aspect
`of the invention.
`
`(0011] FIG. 4 is a flowchart of processing performed by a
`search engine, in accordance with an embodiment of this
`aspect of the invention.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`In FIG. 1 and 2, a client host 110 is shown in
`{0012]
`communication through a network 100 with a server host
`120. It is assumed without limitation that the client host 110
`
`is executing some crawler application and that the server
`host 120 is executing someserver application that provides
`the crawler application access to various resourcesstored at
`the server or accessible to the server. For example, and
`without limitation, the server application can be an HTTP
`server, such as APACHE,and the crawler application can be
`a script that issues a series of HTTPclient requests. It is also
`assumed that
`the communication network 100 provides
`connectivity using some advantageous protocol, such as
`TCP/IP. It should be noted that the present invention is not
`limited to any such particular communication protocol or to
`any such particular crawler application or client-server
`architecture.
`
`[0013] The crawler application automatically requests a
`variety of resources stored on one or more server hosts
`connected to the network. The present
`invention is not
`limited to any particular type of resource, although the
`present inventionis of particular interest in “crawling” pages
`composed in some markup language such as HTML or
`
`

`

`US 2005/0125412 A1
`
`Jun. 9, 2005
`
`XML.For purposesof illustration and discussion only, the
`different resources shall be referred to also as “pages”
`herein. The resources are typically identified by what the
`inventors refer to generically as uniform resource locators.
`A uniform resource locator, for purposes of the present
`invention, can be any advantageous representation or iden-
`tifier of the “location”of the resource in the network for use
`
`[0020] Acrawler operating in accordance with an embodi-
`ment of an aspect of the invention would operate as follows:
`
`[0021]
`
`for each URL,
`
`[0022]
`
`for each state s in (sl, s2,..., sn)
`
`[0023]
`
`page-contents_n=request(URL,s_n).
`
`[0024] As a result, there can be several copies of page
`contents for each given URL.This is depicted in FIG. 2. The
`crawler on the client host 110 in FIG. 2 sends multiple
`requests 250 to the server host 120 for the same URL. Where
`the different states sl, s2, s3 can be represented using
`“voluntary” settings, the client host 110 can readily vary the
`requests to reflect different browser state. Where the differ-
`ent states sl, s2, 33 reflect “external” factors, it may be
`necessary to execute different crawlers on different hosts
`reflecting the different external factors. The server 120
`receives the different requests and responds at 260 to the
`requests by selecting each of the different pages 121, 122,
`123, depending on the particular state s in the specific
`crawler request.
`
`by the crawler and other client applications. The present
`invention is not limited to any particular form of uniform
`resource locator. For example, in the context of the World
`Wide Web, the uniform resource locator can be a conven-
`tional URL such as “http:/AWwww.example.com/dir/page-
`-html” where “http:” represents the particular retrieval meth-
`odology, “www.example.com” represents an identification
`of the server host (or alternatively by network address
`depending on whether addresstranslation facilities are avail-
`able), and “/dir/page.html” represents a directory tree path
`and documentidentifier for the resource on the server host.
`See T. Berners-Lee, “Uniform Resource Identifiers in
`WWW,” IETF, Network Working Group, RFC 1630 (June
`1994); T. Berners-Lee, L. Masinter, M. McCahill, eds.,
`“Uniform Resource Locators (URL),” IETF, Network Work-
`[0025] FIG.3 is a flowchart of the processing performed
`ing Group, RFC 1738 (December 1994), which are incor-
`by a crawler, in accordance with an embodimentof this
`porated by reference herein.
`aspect of the invention. At step 301, the crawler processes
`the next URLinalist of URLs. As is known in the art of
`[0014]
`It is assumed that the network provides access to a
`collection of pages, pl, p2, p3, etc... .
`, with each
`corresponding to a uniform resource locator U1, U2, U3,..
`. Un. In the priorart, it is generally assumed that at a given
`specific time a particular uniform resource locator will
`correspond to a unique page, i.e. that Ulp1, U2—p2,etc.
`The pages may change over time, or even be dropped
`resulting in a “dead”link, but the correspondence between
`a uniform resource locator and a resource is typically
`assumed. A conventional prior art crawler, accordingly, will
`operate as follows:
`
`the list can be generated by specifying some
`crawlers,
`popular websites and extracting further URL links from each
`pageretrieved. At step 302, the crawler selects a state s from
`a collection of advantageously-defined states. The crawler
`can be implementedto select every state variation for every
`URLor, more preferably, can be implementedto be selective
`as to which states are varied and for which URLs. Atstep
`303, the crawler issues a request for the resource at the URL
`modified to reflect the selected state s. For example, an
`illustrative HTTP request
`for
`the URL “http://www.ex-
`ample.com/dir/page.html” would look similar to the follow-
`ing:
`
`[0026] GET /dir/page.html HTTP/1.1
`
`[0027] Host: www.example.com
`
`[0028] Accept: */*
`
`[0029] Accept-Languages: en-us
`
`[0030] User-Agent: Mozilla/4.0 (compatible; MSIE
`6.0; Windows NT 5.0)
`
`[0031] See, e.g., R. Fielding et al., “Hypertext Transfer
`Protocol—HTTP/1.1,” Internet Engineering Task Force
`(IETF), Request for Comments (RFC) 2616 (June 1999).
`The “Accept-Languages” option specifies “en-us” (English
`speakers in United States) and could be readily varied to
`other languagesor locales. See, e.g., H. Alvestrand, “IETF
`Policy on Character Sets and Languages,” IETF Network
`Working Group, RFC 2277 (January 1998); H. Alvestrand,
`“Tags for the Identification of Languages,” IETF Network
`Working Group, RFC 3066 (January 2001), the contents of
`which are incorporated by reference herein. The “browser
`string” shown in the “User-Agent” option specifies the type
`of browser (here Microsoft Internet Explorer) and could be
`readily varied to other types of browsers, such as Netscape
`or a cell-phone enabled browser.
`
`the crawler receives the requested
`[0032] At step 304,
`resource and proceeds to store and process the resource. In
`
`[0015]
`
`for each URL
`
`page-contents=request(URL),
`with state s as a constant for all URLs and
`
`[0016]
`
`[0017]
`pages.
`
`[0018] Unfortunately, the client state may affect the map-
`ping, so that (U1, s1)p1, (U1, s2)p1_2, (U1, s3)pL_3,
`.
`.. and so on, where pl may be different from p1_2 and
`p13.
`
`[0019] For example, as depicted in FIG.1, the crawler on
`the client host 110 sends a request 150 to the server host 120
`for a particular URL. The server 120 receives the request and
`responds at 160 to the request by selecting one out of a
`plurality of pages 121, 122, 123, depending on the particular
`request andstate s of the client. The “state” of the client can
`refer to any of a collection of parameters or information
`available to the server host 120 aboutthe client application/
`host. For example, and without limitation, a conventional
`browser has a variety of “voluntary” settings that can be
`identified by a server application, such as type of client
`browser, preferred language or locale, etc. There are also
`“external” factors that can be identified by a server, such as
`the client’s network address (IP address) which is a property
`not directly settable by a client application. All of these
`different forms of information available to the server host
`120 are defined as state “s” and the state is assumed to
`
`contain any one or more of these parameters.
`
`

`

`US 2005/0125412 A1
`
`Jun. 9, 2005
`
`accordance with an embodiment of another aspect of the
`invention,
`it
`is advantageous to index the resource by
`browserstate as well as by URL.In other words, instead of
`indexing the resource as follows:
`
`[0033] Add-to-database(URL, page-contents)
`
`[0034]
`
`it is preferable to index the resource as follows:
`
`is desired to crawl for variations on
`[0052] Where it
`browser state that rely on what are referred to as “external”
`factors above, it is advantageous to provide for different
`crawler architectures. For example, where the server host
`uses an external factor (such as a network address) as an
`approximation of geographic location of the client,
`it
`is
`advantageous to implement the crawler as follows:
`
`[0035]
`
`for each state s in (sl, s2,..., sn)
`
`[0036] Add-to-database(URL, s_n, page-contents_n)
`
`[0037] Thus, each contents of each resource is saved and
`associated with the URL and with the particular browser
`state selected for the request.
`
`[0038] With reference again to FIG. 3, the next state is
`selected at step 305 and another request is issued,etc., until
`the specified states for the particular URL are exhausted.
`Then,at step 306, the next URLis utilized until the crawler
`has exhausted all URLs or some crawling threshold has been
`reached.
`
`[0039] After the different URLs U1, U2, U3 are crawled,
`a database is constructed that would look like the following:
`
`[0040] p_1-U1, sl
`
`[0041] p1_2>U1, s2
`
`[0042] p1_3-U1, 33
`
`[0043] p2_1-U2, s1
`
`[0044] p2_2U2, s2
`
`[0045] p2_3U2, s3
`
`[0046] where s1, s2, and s3 represent the different browser
`states. This is in contrast to a prior art database which would
`looklike:
`
`[0047] p1-U1
`
`[0048]
`
`p2—-U2
`
`[0049]
`
`p3—-U3
`
`[0050] There are a variety of improvements within the
`spirit of the present invention that could be made to the
`structure of the database created by the crawler. For
`example, the database could advantageously only save one
`copy of resources whose contents are the same for every
`state. Rather than store duplicates of the same content, it is
`preferable to store a pointer to the contents. Ifpage-con-
`tents1” is the same as “page-contents2”, then the crawler
`would store only one copy of the page contents and have a
`pointer stored associating it with the URL(s) and thestate(s)
`that found the content. Even where the two resources are
`different from each other, the first resource could be stored
`as normally and the second resource could be stored in a
`form that preserves only the differences between the first
`resource and the second resource, for example and without
`limitation, using some form of*diff” procedure or delta-
`encoding.
`
`[0051] Thus,it is not a requirement in the context of the
`present invention that all URLs be saved or even crawled for
`all states. Rather, a logical association should be made
`between the state and the URL with the page contents for at
`least some URLs and somestates.
`
`(a) The crawler can be implemented as a
`[0053]
`plurality of physically distributed crawlers that feed
`into a single pool of information. Each distributed
`crawler can have its own reported state and could
`index the crawled information separately.
`
`(b) The crawler can be implemented as a
`[0054]
`centralized crawler with a plurality of physically
`distributed remote “agents”—acting for example as
`“proxies” or
`“points-of-presence” which issue
`requests on behalf of the centralized crawler. The
`server host would interact with the crawler’s agents
`and identify the crawler’s requests as having the
`external factors of the particular agent issuing the
`request.
`
`(c) The crawler can be implemented as a
`[0055]
`centralized crawler that simply pretends to have a
`different external factor, e.g., by pretending to be
`from a different
`location than it actually is. For
`example, here are a variety of mechanismsfor “fak-
`ing” a host’s network address, such as modifying the
`network addressing scheme, the domain name sys-
`tem, or the contents of IP packetsto reflect different
`external
`factors. The requests from the crawler
`would appear to the host server as if they were
`coming from a host with the different external fac-
`tors.
`
`[0056] Likewise, there are variations on the above catego-
`ries, such as distributed implementations of the functions of
`the centralized crawler described above. Such variations
`would be encompassed within the scope of the present
`invention. Different instances of the crawler in different
`
`locations may cause someoverlap, e.g., pages requested by
`a crawler in Spain using a browsersetting of “es-mx” might
`be the same as pages requested by crawlers in the United
`States using a setting of “es-es”. To address such overlap-
`ping resources, it may be desirable to unify the different
`states for more efficient storage. Thus, for example, even if
`a crawler has been modified to support a wide range of
`browser states, sl, s2, s3, ... , s100, the system may be
`implementedso as to return a response for somesetofstates,
`e.g., S1-s50, and another response for the rest, s51-s100.
`Thus, not all 100 copies would need be stored in the
`database. It may be preferable to merely store the differences
`between the copies.
`
`[0057] When a user performs a search on the database
`created by the crawler, conventionally all users would be
`treated equally with regard to the set of pages that might be
`returned for a given query. ‘he query results would proceed
`as follows:
`
`[0058] Results=find-relevant-pages(q)
`
`[0059] Even where prior art search engines such as
`GOOGLEattemptto take into account user language pref-
`erences by redirecting, for example, French users to a
`French GOOGLE domain,all query requests submitted to
`
`

`

`US 2005/0125412 A1
`
`Jun. 9, 2005
`
`the French GOOGLE domain would still be treated the
`same, regardless of browserstate. In contrast, and in accor-
`dance with an embodimentof another aspect of the inven-
`tion,
`the resources matching a particular query can be
`selected based on state as well. The query can proceed as
`follows:
`
`[0060] Results=find-relevant-pages(q, browser-state)
`
`the
`[0061] where browser-state specifies the state of
`browser of the user submitting the query or represents the
`state specified by the user in the query itself
`
`FIG.4 is a flowchart of processing performed by
`[0062]
`a search engine, in accordance with an illustrative embodi-
`mentof this aspect of the present invention. At step 401, the
`search engine receives a query request from a client browser.
`At step 402, the search engine detects the browserstate of
`the client browser. This is accomplished by, for example and
`without
`limitation, analyzing the HTTP options in the
`request, by analyzing the IP address of the client, etc. Then,
`at step 403, the search engine conducts the search for pages
`matching the specified query where the results are adjusted
`based on the detected browserstate of the client browser and
`howit relates to the state of the crawler. Thus, a user browser
`configured for “English” could receive different search
`engine results than a user browser configured for “French”.
`This can be accomplished, for example and withoutlimita-
`tion, by filtering pages in the result set to only match those
`whichsatisfy the state. Then, at step 404, the search engine
`composesa page of the results and, at step 405, proceeds to
`send the results page to the client browser.
`
`It should also be noted that a specific implemen-
`[0065]
`tation might have a default policy when the browser’s state
`does not correspond to a crawler’s state. For example, where
`the search engine receives a request from a browserset for
`the language of “Swahili” and no crawler was run for that
`particular state. The policy of the implementation might be
`to use a default state sl, which might be for example
`“Language=English, Location=US”. The specific mecha-
`nism for selecting default state or for determining which
`browser state most closely matches (or is considered a
`match) for a given crawler state (and vice versa) is not
`relevant to the spirit of the present invention.
`
`It will be appreciated that those skilled in the art
`[0066]
`will be able to devise numerous arrangements and variations
`which, although not explicitly shown or described herein,
`embodythe principles of the invention and are within their
`spirit and scope. For example, and without limitation, the
`definition of “state” can vary, and the method for dealing
`with partial state could readily vary, in accordance with the
`specifications of one of ordinary skill in the art. Also, the
`present invention has been described with particular refer-
`ence to HTTP and Web pages. The present
`invention,
`nevertheless and as mentioned above,is readily extendable
`to other protocols and resource types.
`Whatis claimed is:
`
`1. Amethod for crawling for resources in a network, the
`method comprising:
`
`receiving a list of resources on the network andforat least
`one of the resources on the list of resources,
`
`[0063] For example, consider a search engine which
`sendingafirst request to a server in the network for the
`receives a query ql from a user and which proceeds to
`resource using a first browserstate, and
`determine that the matching results include pages p1_1,
`pl_2, and p2_3. Recall that a page may be entered multiple
`times (once for each state) under the above-described new
`indexing scheme. Assumethat the user’s browserstate is the
`same as s2 (the fields that are considered by the crawler
`matchthat of the crawler state s2). In this case, a simplefilter
`is applied and pl_1 and p2_3 are removed since their
`associated state was not s2. pl_1 wasassociated with sl (the
`crawler state that found the page) and p2_3 was associated
`with s3. In the above case, if the results included p2_1 and
`p2_1l=p2_2, then either state s1 or state s2 would allowit to
`remain since the same page contents were found with more
`than onestate.
`
`sending a second request for the same resource using a
`second browserstate.
`2. The method of claim 1 wherein the resources are
`identified by uniform resource locators and whereinthefirst
`and second request specify a same uniform resourcelocator.
`3. The method of claim 1 wherein the browserstate
`comprises a language preference.
`4. The method of claim 1 wherein the browser state
`comprises a locale preference.
`5. The method of claim 1 wherein the browser state
`comprises location information.
`6. The method of claim 1 wherein the browser state
`
`It should be noted that it is not required that the
`[0064]
`filtering occur after the initial results are obtained. The
`filtering effect can be incorporated into the relevance func-
`tion or built into the database or indexer. Such variations
`would bestill within the scope of the present invention. For
`example, and without limitation, consider a query for “XYZ
`COMPANY”where the user’s browserstate has been set to
`“fr-fr’ (French/France). A conventional search engine might
`return results that include “www.xyz.com”as result rl and
`“www.xyz.co.fr” are result r2. In accordance with another
`embodiment of another aspect of the invention,
`the rel-
`evance function can be modified to consider the browser
`
`state in the scoring/ranking of results, even where the
`crawler state was fixed. The ranking of“www.xyz.co.fr” can
`be altered to comefirst, because the user’s browser has been
`set to “fr-fr’. Note that the relevance function can be so
`modified, even if both pages were crawled/found with a
`fixed (and possibly different from “fr-fr’’) browserstate.
`
`comprises a browseridentification.
`7. The method of claim 1 wherein the browser state
`
`comprises a network address.
`8. The method of claim 1 wherein the first request and the
`second request are issued by a first and second crawler
`applications that
`respectively have a first and second
`browserstate.
`9. The method of claim 1 wherein the first and second
`requests are issued by a crawler application that varies its
`browser state between the first and second requests.
`10. A method for processing crawled resources in a
`network, the method comprising:
`
`receiving a resource in response to a request for the
`resource using one of a plurality of browserstates;
`
`storing the resource; and
`
`indexing the resource, the indexing step further compris-
`ing the step of associating the resource with a first
`
`

`

`US 2005/0125412 A1
`
`Jun. 9, 2005
`
`browserstate where the first browser state is the one of
`the plurality of browser states used to request
`the
`resource.
`11. The method of claim 10 wherein resources are iden-
`
`tified by uniform resource locators and wherein at least a
`first resource and a second resource identified by a same
`uniform resource locator are associated with different
`browserstates.
`12. The method of claim 11 wherein the first and second
`
`resources are both stored only if the second resource is
`different from the first resource.
`13. The method of claim 12 wherein if the second
`resource is a duplicate of the first resource, a reference is
`stored that associates the stored first resource with the
`second browserstate.
`14. The method of claim 10 wherein the browser state
`
`comprises any one of a group consisting of language pref-
`erence,
`locale preference,
`location information, browser
`identification, and network address.
`15. A method for searching a database of crawled
`resources, the method comprising the steps of:
`
`receiving a search query from a browserclient;
`
`detecting a browser state for the browserclient; and
`
`searching for results from the database of resource using
`both the search query and the browser state of the
`browserclient.
`16. The method of claim 15 wherein the database includes
`at least one record which associates a first resource and a
`second resource in the database with a same uniform
`resource locator but with different browserstates.
`17. The methodof claim 15 wherein results that match the
`
`search query are filtered using the browser state of the
`browser client.
`18. The method of claim 15 wherein a relevance function
`is utilized to rank results from search of the database and
`wherein the relevance function considers the browserstate
`
`of the browser client in ranking the results.
`19. The method of claim 15 wherein if the browser state
`
`of the browser client does not match any of the browser
`states in the database, then a default browserstate is used in
`the search.
`20. The method of claim 15 wherein the browserstate
`comprises any one of a group consisting of language pref-
`erence,
`locale preference,
`location information, browser
`identification, and network address.
`21. Acomputer-readable medium comprising one or more
`instructions which when executed perform the following:
`
`language preference, locale preference, location informa-
`tion, browser identification, and network address.
`24. Acomputer-readable medium comprising one or more
`instructions which when executed perform the following:
`
`receiving a resource in response to a request for the
`resource using one of a plurality of browserstates;
`
`storing the resource; and
`
`indexing the resource, the indexing step further compris-
`ing the step of associating the resource with a first
`browser state where the first browserstate is the one of
`the plurality of browser states used to request
`the
`resource.
`
`25. The computer-readable medium of claim 24 wherein
`resources are identified by uniform resource locators and
`wherein at
`least a first resource and a second resource
`identified by a same uniform resource locator are associated
`with different browserstates.
`
`26. The computer-readable medium of claim 25 wherein
`the first and second resources are both stored only if the
`second resource is different from the first resource.
`
`27. The computer-readable medium of claim 26 wherein
`if the second resource is a duplicate of the first resource, a
`reference is stored that associates the stored first resource
`with the second browserstate.
`
`28. The computer-readable medium of claim 24 wherein
`the browser state comprises any one of a group consisting of
`language preference, locale preference, location informa-
`tion, browser identification, and network address.
`29. Acomputer-readable medium comprising one or more
`instructions which when executed perform the following:
`
`receiving a search query from a browserclient;
`
`detecting a browserstate for the browser client; and
`
`searching for results from the database of resource using
`both the search query and the browser state of the
`browserclient.
`
`30. The computer-readable medium of claim 29 wherein
`the database includesat least one record which associates a
`first resource and a second resource in the database with a
`same uniform resource locator but with different browser
`states.
`
`31. The computer-readable medium of claim 29 wherein
`results that match the search query are filtered using the
`browser state of the browserclient.
`
`32. The computer-readable medium of claim 29 wherein
`a relevance function is utilized to rank results from search of
`receivingalist of resources on the network andfor at least
`the database and wherein the relevance function considers
`one of the resources on the list of resources,
`the browserstate of the browserclient in ranking the results.
`33. The computer-readable medium of claim 29 wherein
`if the browser state of the browser client does not match any
`of the browserstates in the database, then a default browser
`state is used in the search.
`
`sending a first request to a server in the network for a
`resource using a first browserstate, and
`
`sending a second request for the same resource using a
`second browserstate.
`
`22. The computer-readable medium of claim 21 wherein
`the resources are identified by a uniform resource locator
`and wherein the first and second request specify a same
`uniform resource locator.
`23. The computer-readable medium of claim 21 wherein
`the browser state comprises any one of a group consisting of
`
`34. The computer-readable medium of claim 29 wherein
`the browser state comprises any one of a group consisting of
`language preference, locale preference, location informa-
`tion, browser identification, and network address.
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket