`
`US 20070100817Al
`
`(19) United States
`c12) Patent Application Publication
`Acharya et al.
`
`(10) Pub. No.: US 2007/0100817 A1
`May 3, 2007
`(43) Pub. Date:
`
`(54) DOCUMENT SCORING BASED ON
`DOCUMENT CONTENT UPDATE
`
`(75)
`
`Inventors: Anurag Acharya, Campbell, CA (US);
`Jeffrey Dean, Palo Alto, CA (US); Paul
`Haahr, San Francisco, CA (US);
`Monika Benzinger, Corseaux (CH);
`Steve Lawrence, Mountain View, CA
`(US); Karl Pfleger, Mountain View, CA
`(US); Simon Tong, Mountain View, CA
`(US)
`
`Correspondence Address:
`HARRITY SNYDER, LLP
`11350 Random Hills Road
`SUITE 600
`FAIRFAX, VA 22030 (US)
`
`(73) Assignee: GOOGLE INC., Mountain View, CA
`
`(21) Appl. No.:
`
`11/562,285
`
`(22) Filed:
`
`Nov. 21, 2006
`
`125
`
`Related U.S. Application Data
`
`( 62) Division of application No. 10/7 48,664, illed on Dec.
`31, 2003.
`
`(60) Provisional application No. 60/507,617, filed on Sep.
`30, 2003.
`
`Publication Classification
`
`(51)
`
`Int. Cl.
`G06F 17130
`(2006.01)
`(52) U.S. Cl ................................................................... 707/5
`
`(57)
`
`ABSTRACT
`
`A system may determine a measure of how a content of a
`document changes over time, generate a score for the
`document based, at least in part, on the measure of how the
`content of the document changes over time, and rank the
`document with regard to at least one other document based,
`at least in part, on the score.
`
`SEARCH ENGINE
`
`DOCUMENT
`LOCATOR
`.31Q
`
`HISTORY
`COMPONENT
`~
`
`RANKING
`COMPONENT
`~
`
`DOCUMENT
`CORPUS
`340
`
`EXHIBIT 2109
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00481
`
`
`
`Patent Application Publication May 3, 2007 Sheet 1 of 4
`
`US 2007/0100817 A1
`
`100 ~
`
`110 8 ~
`110 8 /
`
`FIG. 1
`
`120
`
`/
`
`130
`
`1/
`
`~ SERVER
`
`~ 140 B
`
`
`
`Patent Application Publication May 3, 2007 Sheet 2 of 4
`
`US 2007/0100817 A1
`
`110-140 ~
`
`INPUT DEVICES
`
`OUTPUT DEVICES
`
`COMMUNICATION
`INTERFACE
`
`MAIN
`MEMORY
`
`ROM
`
`STORAGE
`DEVICE
`
`BUS
`210
`
`PROCESSOR
`
`FIG. 2
`
`
`
`Patent Application Publication May 3, 2007 Sheet 3 of 4
`
`US 2007/0100817 A1
`
`125
`
`SEARCH ENGINE
`
`FIG. 3
`
`DOCUMENT
`LOCATOR
`310
`
`HISTORY
`COMPONENT
`320
`
`RANKING
`COMPONENT
`330
`
`DOCUMENT
`CORPUS
`340
`
`
`
`Patent Application Publication May 3, 2007 Sheet 4 of 4
`
`US 2007/0100817 A1
`
`FIG. 4
`
`410
`
`420
`
`IDENTIFY DOCUMENTS
`
`OBTAIN HISTORY DATA
`ASSOCIATED WITH DOCUMENTS
`
`430
`
`SCORE DOCUMENTS BASED, AT
`LEAST IN PART, ON HISTORY DATA
`
`
`
`US 2007/0100817 AI
`
`May 3, 2007
`
`1
`
`DOCUMENT SCORING BASED ON DOCUMENT
`CONTENT UPDATE
`
`RELATED APPLICATION
`
`[0001] This application is a divisional of U.S. patent
`application Ser. No. 10/748,664, filed Dec. 31, 2003, which
`claims priority under 35 U.S.C. § 119 based on U.S. Pro(cid:173)
`visional Application No. 60/507,617, filed Sep. 30, 2003, the
`disclosures of which are incorporated herein by reference.
`
`BACKGROUND OF THE INVENTION
`
`[0002] 1. Field of the Invention
`
`[0003] The present invention relates generally to informa(cid:173)
`tion retrieval systems and, more particularly, to systems and
`methods for generating search results based, at least in part,
`on historical data associated with relevant documents.
`
`[0004] 2. Description of Related Art
`
`[0005] The World Wide Web ("web") contains a vast
`amount of information. Search engines assist users in locat(cid:173)
`ing desired portions of this information by cataloging web
`documents. Typically, in response to a user's request, a
`search engine returns links to documents relevant to the
`request.
`
`[0006] Search engines may base their determination of the
`user's interest on search terms (called a search query)
`provided by the user. The goal of a search engine is to
`identify links to high quality relevant results based on the
`search query. Typically, the search engine accomplishes this
`by matching the terms in the search query to a corpus of
`pre-stored web documents. Web documents that contain the
`user's search terms are considered "hits" and are returned to
`the user.
`
`[0007]
`Ideally, a search engine, in response to a given
`user's search query, will provide the user with the most
`relevant results. One category of search engines identifies
`relevant documents based on a comparison of the search
`query terms to the words contained in the documents.
`Another category of search engines identifies relevant docu(cid:173)
`ments using factors other than, or in addition to, the presence
`of the search query terms in the documents. One such search
`engine uses information associated with links to or from the
`documents to determine the relative importance of the
`documents.
`
`[0008] Both categories of search engines strive to provide
`high quality results for a search query. There are several
`factors that may affect the quality of the results generated by
`a search engine. For example, some web site producers use
`spamming techniques to artificially inflate their rank. Also,
`"stale" documents (i.e., those documents that have not been
`updated for a period of time and, thus, contain stale data)
`may be ranked higher than "fresher" documents (i.e., those
`documents that have been more recently updated and, thus,
`contain more recent data). In some particular contexts, the
`higher ranking stale documents degrade the search results.
`
`[0009] Thus, there remains a need to improve the quality
`of results generated by search engines.
`
`SUMMARY OF THE INVENTION
`[0010] Systems and methods consistent with the principles
`of the invention may score documents based, at least in part,
`
`on history data associated with the documents. This scoring
`may be used to improve search results generated in connec(cid:173)
`tion with a search query.
`
`[0011] According to one aspect, a method may include
`determining a measure of how a content of a document
`changes over time; generating a score for the document
`based, at least in part, on the measure of how the content of
`the document changes over time; and ranking the document
`with regard to at least one other document based, at least in
`part, on the score.
`
`[0012] According to another aspect, a method may include
`determining a first rate of change in a content of a document
`in a first time period; determining a second rate of change in
`the content of the document in a second time period;
`comparing the first rate of change and the second rate of
`change to determine whether there is an increase or a
`decrease in the rate of change in the content of the docu(cid:173)
`ment; generating a score for the document based, at least in
`part, on whether there is an increase or a decrease in the rate
`of change in the content of the document; and ranking the
`document with regard to at least one other document based,
`at least in part, on the score.
`
`[0013] According to yet another aspect, a method may
`include receiving a search query; performing a search based,
`at least in part, on the search query to identifY a group of
`search result documents; determining a date on which a
`content changed for each of the search result documents in
`a set of the search result documents in the group; determin(cid:173)
`ing an average date-of-change of the search result docu(cid:173)
`ments in the set of search result documents based, at least in
`part, on the determined dates; generating a score for a search
`result document in the set of search result documents based,
`at least in part, on a difference between the determined date
`associated with the search result document and the average
`date-of-change of the search result documents in the set of
`search result documents; and ranking the search result
`document with regard to at least one other one of the search
`result documents based, at least in part, on the score.
`
`[0014] According to a further aspect, a method may
`include determining a measure of how anchor text associ(cid:173)
`ated with a link pointing to a document changes over time;
`generating a score for the document based, at least in part,
`on the measure of how the anchor text associated with the
`link pointing to the document changes over time; and
`ranking the document with regard to at least one other
`document based, at least in part, on the score.
`[0015] According to another aspect, a system may include
`means for determining whether a topic associated with a
`document changes over time; means for generating a score
`for the document based, at least in part, on the whether the
`topic associated with the document changes; and means for
`ranking the document with regard to at least one other
`document based, at least in part, on the score.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`[0016] The accompanying drawings, which are incorpo(cid:173)
`rated in and constitute a part of this specification, illustrate
`an embodiment of the invention and, together with the
`description, explain the invention. In the drawings,
`[0017] FIG. 1 is a diagram of an exemplary network in
`which systems and methods consistent with the principles of
`the invention may be implemented;
`
`
`
`US 2007/0100817 AI
`
`May 3, 2007
`
`2
`
`[0018] FIG. 2 is an exemplary diagram of a client and/or
`server of FIG. 1 according to an implementation consistent
`with the principles of the invention;
`
`[0019] FIG. 3 is an exemplary functional block diagram of
`the search engine of FIG. 1 according to an implementation
`consistent with the principles of the invention; and
`
`[0020] FIGS. 4 is a flowchart of exemplary processing for
`scoring documents according to an implementation consis(cid:173)
`tent with the principles of the invention.
`
`DETAILED DESCRIPTION
`
`[0021] The following detailed description of the invention
`refers to the accompanying drawings. The same reference
`numbers in different drawings may identify the same or
`similar elements. Also, the following detailed description
`does not limit the invention.
`
`[0022] Systems and methods consistent with the principles
`of the invention may score documents using, for example,
`history data associated with the documents. The systems and
`methods may use these scores to provide high quality search
`results.
`
`[0023] A "document," as the term is used herein, is to be
`broadly interpreted to include any machine-readable and
`machine-storable work product. A document may include an
`e-mail, a web site, a file, a combination of files, one or more
`files with embedded links to other files, a news group
`posting, a blog, a web advertisement, etc. In the context of
`the Internet, a common document is a web page. Web pages
`often include textual information and may include embed(cid:173)
`ded information (such as meta information, images, hyper(cid:173)
`links, etc.) and/or embedded instructions (such as Javascript,
`etc.). A page may correspond to a document or a portion of
`a document. Therefore, the words "page" and "document"
`may be used interchangeably in some cases. In other cases,
`a page may refer to a portion of a document, such as a
`sub-document. It may also be possible for a page to corre(cid:173)
`spond to more than a single document.
`
`[0024]
`In the description to follow, documents may be
`described as having links to other documents and/or links
`from other documents. For example, when a document
`includes a link to another document, the link may be referred
`to as a "forward link." When a document includes a link
`from another document, the link may be referred to as a
`"back link." When the term "link" is used, it may refer to
`either a back link or a forward link.
`
`Exemplary Network Configuration
`
`[0025] FIG. 1 is an exemplary diagram of a network 100
`in which systems and methods consistent with the principles
`of the invention may be implemented. Network 100 may
`include multiple clients 110 connected to multiple servers
`120-140 via a network 150. Network 150 may include a
`local area network (LAN), a wide area network (WAN), a
`telephone network, such as the Public Switched Telephone
`Network (PSTN), an intranet, the Internet, a memory device,
`another type of network, or a combination of networks. Two
`clients 110 and three servers 120-140 have been illustrated
`as connected to network 150 for simplicity. In practice, there
`may be more or fewer clients and servers. Also, in some
`instances, a client may perform the functions of a server and
`a server may perform the functions of a client.
`
`[0026] Clients 110 may include client entities. An entity
`may be defined as a device, such as a wireless telephone, a
`personal computer, a personal digital assistant (PDA), a lap
`top, or another type of computation or communication
`device, a thread or process running on one of these devices,
`and/or an object executable by one of these device. Servers
`120-140 may include server entities that gather, process,
`search, and/or maintain documents in a manner consistent
`with the principles of the invention. Clients 110 and servers
`120-140 may connect to network 150 via wired, wireless,
`and/or optical connections.
`
`[0027]
`In an implementation consistent with the principles
`of the invention, server 120 may include a search engine 125
`usable by clients 110. Server 120 may crawl a corpus of
`documents (e.g., web pages), index the documents, and store
`information associated with the documents in a repository of
`crawled documents. Servers 130 and 140 may store or
`maintain documents that may be crawled by server 120.
`While servers 120-140 are shown as separate entities, it may
`be possible for one or more of servers 120-140 to perform
`one or more of the functions of another one or more of
`servers 120-140. For example, it may be possible that two or
`more of servers 120-140 are implemented as a single server.
`It may also be possible for a single one of servers 120-140
`to be implemented as two or more separate (and possibly
`distributed) devices.
`
`Exemplary Client/Server Architecture
`[0028] FIG. 2 is an exemplary diagram of a client or server
`entity (hereinafter called "client/server entity"), which may
`correspond to one or more of clients 110 and servers
`120-140, according to an implementation consistent with the
`principles of the invention. The client/server entity may
`include a bus 210, a processor 220, a main memory 230, a
`read only memory (ROM) 240, a storage device 250, one or
`more input devices 260, one or more output devices 270, and
`a communication interface 280. Bus 210 may include one or
`more conductors that permit communication among the
`components of the client/server entity.
`
`[0029] Processor 220 may include one or more conven(cid:173)
`tional processors or microprocessors that interpret and
`execute instructions. Main memory 230 may include a
`random access memory (RAM) or another type of dynamic
`storage device that stores information and instructions for
`execution by processor 220. ROM 240 may include a
`conventional ROM device or another type of static storage
`device that stores static information and instructions for use
`by processor 220. Storage device 250 may include a mag(cid:173)
`netic and/or optical recording medium and its corresponding
`drive.
`
`Input device(s) 260 may include one or more
`[0030]
`conventional mechanisms that permit an operator to input
`information to the client/server entity, such as a keyboard, a
`mouse, a pen, voice recognition and/or biometric mecha(cid:173)
`nisms, etc. Output device(s) 270 may include one or more
`conventional mechanisms that output information to the
`operator, including a display, a printer, a speaker, etc.
`Communication interface 280 may include any transceiver(cid:173)
`like mechanism that enables the client/server entity to com(cid:173)
`municate with other devices and/or systems. For example,
`communication interface 280 may include mechanisms for
`communicating with another device or system via a net(cid:173)
`work, such as network 150.
`
`
`
`US 2007/0100817 AI
`
`May 3, 2007
`
`3
`
`[0031] As will be described in detail below, the client/
`server entity, consistent with the principles of the invention,
`perform certain searching-related operations. The client/
`server entity may perform these operations in response to
`processor 220 executing software instructions contained in a
`computer-readable medium, such as memory 230. A com(cid:173)
`puter-readable medium may be defined as one or more
`physical or logical memory devices and/or carrier waves.
`
`[0032] The software instructions may be read into memory
`230 from another computer-readable medium, such as data
`storage device 250, or from another device via communi(cid:173)
`cation interface 280. The software instructions contained in
`memory 230 may cause processor 220 to perform processes
`that will be described later. Alternatively, hardwired cir(cid:173)
`cuitry may be used in place of or in combination with
`software instructions to implement processes consistent with
`the principles of the invention. Thus, implementations con(cid:173)
`sistent with the principles of the invention are not limited to
`any specific combination of hardware circuitry and software.
`
`Exemplary Search Engine
`
`[0033] FIG. 3 is an exemplary functional block diagram of
`search engine 125 according to an implementation consis(cid:173)
`tent with the principles of the invention. Search engine 125
`may include document locator 310, history component 320,
`and ranking component 330. As shown in FIG. 3, one or
`more of document locator 310 and history component 320
`may connect to a document corpus 340. Document corpus
`340 may include information associated with documents that
`were previously crawled, indexed, and stored, for example,
`in a database accessible by search engine 125. History data,
`as will be described in more detail below, may be associated
`with each of the documents in document corpus 340. The
`history data may be stored in document corpus 340 or
`elsewhere.
`
`[0034] Document locator 310 may identifY a set of docu(cid:173)
`ments whose contents match a user search query. Document
`locator 310 may initially locate documents from document
`corpus 340 by comparing the terms in the user's search
`query to the documents in the corpus. In general, processes
`for indexing documents and searching the indexed collection
`to return a set of documents containing the searched terms
`are well known in the art. Accordingly, this functionality of
`document locator 310 will not be described further herein.
`
`[0035] History component 320 may gather history data
`associated with the documents in document corpus 340. In
`implementations consistent with the principles of the inven(cid:173)
`tion, the history data may include data relating to: document
`inception dates; document content updates/changes; query
`analysis; link-based criteria; anchor text (e.g., the text in
`which a hyperlink is embedded, typically underlined or
`otherwise highlighted in a document); traffic; user behavior;
`domain-related information; ranking history; user main(cid:173)
`tained/generated data (e.g., bookmarks); unique words, big(cid:173)
`rams, and phrases in anchor text; linkage of independent
`peers; and/or document topics. These different types of
`history data are described in additional detail below. In other
`implementations, the history data may include additional or
`different kinds of data.
`
`[0036] Ranking component 330 may assign a ranking
`score (also called simply a "score" herein) to one or more
`documents in document corpus 340. Ranking component
`
`330 may assign the ranking scores prior to, independent of,
`or in connection with a search query. When the documents
`are associated with a search query (e.g., identified as rel(cid:173)
`evant to the search query), search engine 125 may sort the
`documents based on the ranking score and return the sorted
`set of documents to the client that submitted the search
`query. Consistent with aspects of the invention, the ranking
`score is a value that attempts to quantifY the quality of the
`documents. In implementations consistent with the prin(cid:173)
`ciples of the invention, the score is based, at least in part, on
`the history data from history component 320.
`
`Exemplary History Data
`
`Document Inception Date
`
`[0037] According to an implementation consistent with
`the principles of the invention, a document's inception date
`may be used to generate (or alter) a score associated with
`that document. The term "date" is used broadly here and
`may,
`thus,
`include
`time and date measurements. As
`described below, there are several techniques that can be
`used to determine a document's inception date. Some of
`these techniques are "biased" in the sense that they can be
`influenced by third parties desiring to improve the score
`associated with a document. Other techniques are not biased.
`Any of these techniques, combinations of these techniques,
`or yet other techniques may be used to determine a docu(cid:173)
`ment's inception date.
`
`[0038] According to one implementation, the inception
`date of a document may be determined from the date that
`search engine 125 first learns of or indexes the document.
`Search engine 125 may discover the document through
`crawling, submission of the document (or a representation/
`snnnnary thereof) to search engine 125 from an "outside"
`source, a combination of crawl or submission-based index(cid:173)
`ing techniques, or in other ways. Alternatively, the inception
`date of a document may be determined from the date that
`search engine 125 first discovers a link to the document.
`
`[0039] According to another implementation, the date that
`a domain with which a document is registered may be used
`as an indication of the inception date of the document.
`According to yet another implementation, the first time that
`a document is referenced in another document, such as a
`news article, newsgroup, mailing list, or a combination of
`one or more such documents, may be used to infer an
`inception date of the document. According to a further
`implementation, the date that a document includes at least a
`threshold number of pages may be used as an indication of
`the inception date of the document. According to another
`implementation, the inception date of a document may be
`equal to a time stamp associated with the document by the
`server hosting the document. Other techniques, not specifi(cid:173)
`cally mentioned herein, or combinations of techniques could
`be used to determine or infer a document's inception date.
`
`[0040] Search engine 125 may use the inception date of a
`document for scoring of the document. For example, it may
`be assumed that a document with a fairly recent inception
`date will not have a significant number of links from other
`documents (i.e., back links). For existing link-based scoring
`techniques that score based on the number of links to/from
`a document, this recent document may be scored lower than
`an older document that has a larger number of links (e.g.,
`back links). When the inception date of the documents are
`
`
`
`US 2007/0100817 AI
`
`May 3, 2007
`
`4
`
`considered, however, the scores of the documents may be
`modified (either positively or negatively) based on the
`documents' inception dates.
`
`[0041] Consider the example of a document with an
`inception date of yesterday that is referenced by 10 back
`links. This document may be scored higher by search engine
`125 than a document with an inception date of 10 years ago
`that is referenced by 100 back links because the rate of link
`growth for the former is relatively higher than the latter.
`While a spiky rate of growth in the number of back links
`may be a factor used by search engine 125 to score docu(cid:173)
`ments, it may also signal an attempt to spam search engine
`125. Accordingly, in this situation, search engine 125 may
`actually lower the score of a document( s) to reduce the effect
`of spamming.
`
`[0042] Thus, according to an implementation consistent
`with the principles of the invention, search engine 125 may
`use the inception date of a document to determine a rate at
`which links to the document are created (e.g., as an average
`per unit time based on the number of links created since the
`inception date or some window in that period). This rate can
`then be used to score the document, for example, giving
`more weight to documents to which links are generated
`more often.
`
`In one implementation, search engine 125 may
`[0043]
`modifY the link-based score of a document as follows:
`
`H~Lilog (F+2),
`
`where H may refer to the history-adjusted link score, L may
`refer to the link score given to the document, which can be
`derived using any known link scoring technique (e.g., the
`scoring technique described in U.S. Pat. No. 6,285,999) that
`assigns a score to a document based on links to/from the
`document, and F may refer to elapsed time measured from
`the inception date associated with the document (or a
`window within this period).
`
`[0044] For some queries, older documents may be more
`favorable than newer ones. As a result, it may be beneficial
`to adjust the score of a document based on the difference (in
`age) from the average age of the result set. In other words,
`search engine 125 may determine the age of each of the
`documents in a result set (e.g., using their inception dates),
`determine the average age of the documents, and modifY the
`scores of the documents (either positively or negatively)
`based on a difference between the documents' age and the
`average age.
`
`In summary, search engine 125 may generate (or
`[0045]
`alter) a score associated with a document based, at least in
`part, on information relating to the inception date of the
`document.
`
`Content Updates/Changes
`
`[0046] According to an implementation consistent with
`the principles of the invention, information relating to a
`manner in which a document's content changes over time
`may be used to generate (or alter) a score associated with
`that document. For example, a document whose content is
`edited often may be scored differently than a document
`whose content remains static over time. Also, a document
`having a relatively large amount of its content updated over
`time might be scored differently than a document having a
`relatively small amount of its content updated over time.
`
`In one implementation, search engine 125 may
`[0047]
`generate a content update score (U) as follows:
`
`U~f(UF, UA),
`
`where f may refer to a function, such as a sum or weighted
`sum, UF may refer to an update frequency score that
`represents how often a document (or page) is updated, and
`UAmay refer to an update amount score that represents how
`much the document (or page) has changed over time. UF
`may be determined in a number of ways, including as an
`average time between updates, the number of updates in a
`given time period, etc.
`
`[0048] UA may also be determined as a function of one or
`more factors, such as the number of "new" or unique pages
`associated with a document over a period of time. Another
`factor might include the ratio of the number of new or
`unique pages associated with a document over a period of
`time versus the total number of pages associated with that
`document. Yet another factor may include the amount that
`the document is updated over one or more periods of time
`(e.g., n% of a document's visible content may change over
`a period t (e.g., last m months)), which might be an average
`value. A further factor might include the amount that the
`document (or page) has changed in one or more periods of
`time (e.g., within the last x days).
`
`[0049] According to one exemplary implementation, UA
`may be determined as a function of differently weighted
`portions of document content. For instance, content deemed
`to be unimportant if updated/changed, such as Javascript,
`comments, advertisements, navigational elements, boiler(cid:173)
`plate material, or date/time tags, may be given relatively
`little weight or even ignored altogether when determining
`UA. On the other hand, content deemed to be important if
`updated/changed (e.g., more often, more recently, more
`extensively, etc.), such as the title or anchor text associated
`with the forward links, could be given more weight than
`changes to other content when determining UA.
`
`[0050] UF and UA may be used in other ways to influence
`the score assigned to a document. For example, the rate of
`change in a current time period can be compared to the rate
`of change in another (e.g., previous) time period to deter(cid:173)
`mine whether there is an acceleration or deceleration trend.
`Documents for which there is an increase in the rate of
`change might be scored higher than those documents for
`which there is a steady rate of change, even if that rate of
`change is relatively high. The amount of change may also be
`a factor in this scoring. For example, documents for which
`there is an increase in the rate of change when that amount
`of change is greater than some threshold might be scored
`higher than those documents for which there is a steady rate
`of change or an amount of change is less than the threshold.
`
`[0051]
`In some situations, data storage resources may be
`insufficient to store the documents when monitoring the
`documents for content changes. In this case, search engine
`125 may store representations of the documents and monitor
`these representations for changes. For example, search
`engine 125 may store "signatures" of documents instead of
`the (entire) documents themselves to detect changes to
`document content. In this case, search engine 125 may store
`a term vector for a document (or page) and monitor it for
`relatively large changes. According to another implementa(cid:173)
`tion, search engine 125 may store and monitor a relatively
`
`
`
`US 2007/0100817 AI
`
`May 3, 2007
`
`5
`
`small portion (e.g., a few terms) of the documents that are
`determined to be important or the most frequently occurring
`(excluding "stop words").
`[0052] According to yet another implementation, search
`engine 125 may store a summary or other representation of
`a document and monitor this information for changes.
`According to a further implementation, search engine 125
`may generate a similarity hash (which may be used to detect
`near-duplication of a document) for the document and
`monitor it for changes. A change in a similarity hash may be
`considered to indicate a relatively large change in its asso(cid:173)
`ciated document. In other implementations, yet other tech(cid:173)
`niques may be used to monitor documents for changes. In
`situations where adequate data storage resources exist, the
`full documents may be stored and used to determine changes
`rather than some representation of the documents.
`[0053] For some queries, documents with content that has
`not recently changed may be more favorable than documents
`with content that has recently changed. As a result, it may be
`beneficial to adjust the score of a document based on the
`difference from the average date-of-change of the result set.
`In other words, search engine 125 may determine a date
`when the content of each of the documents in a result set last
`changed, determine the average date of change for the
`documents, and modifY the scores of the documents (either
`positively or negatively) based on a difference between the
`documents' date-of-change and the average date-of-change.
`[0054]
`In summary, search engine 125 may generate (or
`alter) a score associated with a document based, at least in
`part, on information relating to a manner in which the
`document's content changes over time. For very large docu(cid:173)
`ments that include content belonging to multiple individuals
`or organizations, the score may correspond to each of the
`sub-documents (i.e., that content belonging to or updated by
`a single individual or organization).
`Query Analysis
`[0055] According to an implementation consistent with
`the principles of the invention, one or more query-based
`factors may be used to generate (or alter) a score associated
`with a document. For example, one query-based factor may
`relate to the extent to which a document is selected over time
`when the document is included in a set of search results. In
`this case, search engine 125 might score documents selected
`relatively more often/increasingly by users higher than other
`documents.
`[0056] Another query-based factor may relate to the
`occurrence of certain search terms appearing in queries over
`time. A particular set of search terms may increasingly
`appear in queries over a period of time. For example, terms
`relating to a "hot" topic that is gaining/has gained popularity
`or a breaking news event would conceivably appe