`Page
`
`USOO6285999B1
`(10) Patent No.:
`US 6,285,999 B1
`(45) Date of Patent:
`Sep. 4, 2001
`
`(54) METHOD FOR NODE RANKING INA
`LINKED DATABASE
`
`(75) Inventor: Lawrence Page, Stanford, CA (US)
`(73) Assignee: The Board of Trustees of the Leland
`Stanford Junior University, Stanford,
`CA (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(*) Notice:
`
`(21) Appl. No.: 09/004,827
`(22) Filed:
`Jan. 9, 1998
`Related U.S. Application Data
`(60) Provisional application No. 60/035,205, filed on Jan. 10,
`1997.
`(51) Int. Cl." .................................................. G06F 17/30
`(52) U.S. Cl. .................................... 707/5; 707/7, 707/501
`(58) Field of Search .................................... 707/100, 5, 7,
`707/513, 1–3, 10, 104, 501; 345/440, 382/226,
`229, 230, 231
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,953,106 *
`8/1990 Gansner et al. ..................... 345/440
`5,450,535 *
`9/1995 North ................
`... 395/140
`... 395/610
`5/1998 Mauldin ....
`5,748,954
`... 707/3
`5,752,241 * 5/1998 Cohen ...........
`5,832,494 * 11/1998 Egger et al. ......
`707/102
`5,848,407
`12/1998 Ishikawa et al. ........................ 707/2
`6,014,678
`1/2000 Inoue et al. .......................... 707/501
`OTHER PUBLICATIONS
`S. Jeromy Carriere et al., “Web Query: Searching and Visu
`alizing the Web through Connectivity”, Computer Networks
`and ISDN Systems 29 (1997). pp. 1257–1267.*
`Wang et al"Prefetching in Worl WideWeb", IEEE 1996, pp.
`28-32.
`Ramer et al. “Similarity, Probability and Database Organi
`sation: Extended Abstract”, 1996, pp. 272.276.*
`
`
`
`Craig Boyle “To link or not to link: An empirical comparison
`of Hypertext linking strategies”. ACM 1992, pp. 221-231.*
`L. Katz, "A new Status indeX derived from Sociometric
`analysis,” 1953, Psychometricka, vol. 18, pp.39-43.
`C.H. Hubbell, “An input-output approach to clique identi
`fication sociometry,” 1965, pp. 377–399.
`Mizruchi et al., “Techniques for disaggregating centrality
`scores in Social networks,” 1996, Sociological Methodology,
`pp. 26-48.
`E. Garfield, “Citation analysis as a tool in journal evalua
`tion,” 1972, Science, vol. 178, pp. 471-479.
`Pinski et al., “Citation influence for journal aggregates of
`Scientific publications: Theory, with application to the lit
`erature of physics,” 1976, Inf. Proc. And Management, vol.
`12, pp. 297-312.
`N. Geller, “On the citation influence methodology of Pinski
`and Narin,” 1978, Inf. Proc. And Management, vol. 14, pp.
`93-95.
`P. Doreian, “Measuring the relative Standing of disciplinary
`journals,” 1988, Inf. Proc. And Management, vol. 24, pp.
`45-56.
`
`(List continued on next page.)
`
`Primary Examiner Thomas Black
`Assistant Examiner Uyen Le
`(74) Attorney, Agent, or Firm-Harrity & Snyder L.L.P.
`(57)
`ABSTRACT
`A method assigns importance ranks to nodes in a linked
`database, Such as any database of documents containing
`citations, the World wide web or any other hypermedia
`database. The rank assigned to a document is calculated
`from the ranks of documents citing it. In addition, the rank
`of a document is calculated from a constant representing the
`probability that a browser through the database will ran
`domly jump to the document. The method is particularly
`useful in enhancing the performance of Search engine results
`for hypermedia databases, Such as the World wide web,
`whose documents have a large variation in quality.
`
`29 Claims, 3 Drawing Sheets
`
`Petitioner Google Ex-1034, 0001
`
`
`
`US 6,285,999 B1
`Page 2
`
`OTHER PUBLICATIONS
`P. Doreian, “A measure of Standing for citation networks
`within a wider environment,” 1994, Inf. Proc. And Manage
`ment, vol. 30, pp. 21-31.
`Botafogo et al., "Structural analysis of hypertext: Identifying
`hierarchies and useful metrics.” 1992, ACM Trans. Inc.
`Systems, vol. 10, pp. 142-180.
`Mark E. Frisse, “Searching for information in a hypertext
`medical handbook,” 1988, Communications of the ACM,
`vol. 31, No. 7, pp. 880-886.
`Massimo Marchiori, “The quest for correct information on
`the Web: Hyper search engines,” 1997, Computer Networks
`and ISDN Systems, vol. 29, No. 8-13, pp. 1225–1235.
`Oliver A. McBryan, “GENVL and WWWW: Tools for
`taming the web,” 1994, Proceedings of the first International
`Wold Wide Web Conference, pp. 1-13.
`
`Carriere et al., “WebOuery: Searching and visualizing the
`web through connectivity.” 1997, Proc. 6' International
`World Wide Web Conference, pp. 1-14.
`Arocena et al., “Applications of a web query language,”
`1997, Computer Networks and ISDN Systems, vol. 29, No.
`8-13, pp. 1305–1316.
`Jon M. Kleinberg, “Authoritative sources in a hyperlinked
`environment,” 1998, Proc. Of the 9" Annual ACM-SIAM
`Symposium on Discrete Algorithms, pp. 668-677.
`Henzinger et al., “Measuring indeX quality using random
`walks on the web”, 1999, Proc. of the 8' International World
`Wide Web Conference, pp. 213–225.
`
`* cited by examiner
`
`Petitioner Google Ex-1034, 0002
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 4, 2001
`Sep. 4, 2001
`
`Sheet 1 of 3
`Sheet 1 of 3
`
`US 6,285,999 B1
`US 6,285,999 B1
`
`
`
`A N
`
`
`
`FIG. 1
`FIG. 1
`
`Petitioner Google Ex-1034, 0003
`
`Petitioner Google Ex-1034, 0003
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep. 4, 2001
`Sep. 4, 2001
`
`Sheet 2 of 3
`Sheet 2 of 3
`
`US 6,285,999 B1
`US 6,285,999 B1
`
`
`
`
`
`FIG. 2
`FIG. 2
`
`Petitioner Google Ex-1034, 0004
`
`Petitioner Google Ex-1034, 0004
`
`
`
`U.S. Patent
`
`Sep. 4, 2001
`
`Sheet 3 of 3
`
`US 6,285,999 B1
`
`START
`
`101
`
`103
`
`105
`
`SELECT AN INITIALN-DIMENSIONAL VECTOR p,
`
`COMPUTE AN APPROXIMATION p, TO A
`STEADY-STATE PROBABILITY p. IN
`ACCORDANCE WITH THE EQUATION p = Ap,
`
`
`
`
`
`DETERMINEARANKrk FOR NODEk FROMA
`kth COMPONENT OF p.
`
`DONE
`
`FIG. 3
`
`Petitioner Google Ex-1034, 0005
`
`
`
`1
`METHOD FOR NODE RANKING IN A
`LINKED DATABASE
`
`CROSS-REFERENCES TO RELATED
`APPLICATIONS
`This application claims priority from U.S. provisional
`patent application Ser. No. 60/035,205 filed Jan. 10, 1997,
`which is incorporated herein by reference.
`
`STATEMENT REGARDING GOVERNMENT
`SUPPORT
`This invention was supported in part by the National
`Science Foundation grant number IRI-9411306-4. The Gov
`ernment has certain rights in the invention.
`15
`
`1O
`
`FIELD OF THE INVENTION
`This invention relates generally to techniques for analyZ
`ing linked databases. More particularly, it relates to methods
`for assigning ranks to nodes in a linked database, Such as any
`database of documents containing citations, the World wide
`web or any other hypermedia database.
`
`BACKGROUND OF THE INVENTION
`Due to the developments in computer technology and its
`increase in popularity, large numbers of people have recently
`Started to frequently Search huge databases. For example,
`internet Search engines are frequently used to Search the
`entire World wide Web. Currently, a popular Search engine
`might execute over 30 million Searches per day of the
`indexable part of the web, which has a size in excess of 500
`Gigabytes. Information retrieval Systems are traditionally
`judged by their precision and recall. What is often neglected,
`however, is the quality of the results produced by these
`Search engines. Large databases of documents Such as the
`web contain many low quality documents. As a result,
`Searches typically return hundreds of irrelevant or unwanted
`documents which camouflage the few relevant ones. In order
`to improve the Selectivity of the results, common techniques
`allow the user to constrain the Scope of the Search to a
`Specified Subset of the database, or to provide additional
`Search terms. These techniques are most effective in cases
`where the database is homogeneous and already classified
`into Subsets, or in cases where the user is Searching for well
`known and Specific information. In other cases, however,
`these techniques are often not effective because each con
`Straint introduced by the user increases the chances that the
`desired information will be inadvertently eliminated from
`the Search results.
`Search engines presently use various techniques that
`attempt to present more relevant documents. Typically,
`documents are ranked according to variations of a Standard
`vector Space model. These variations could include (a) how
`recently the document was updated, and/or (b) how close the
`Search terms are to the beginning of the document. Although
`this Strategy provides Search results that are better than with
`no ranking at all, the results Still have relatively low quality.
`Moreover, when Searching the highly competitive web, this
`measure of relevancy is Vulnerable to "spamming” tech
`niques that authors can use to artificially inflate their docu
`ment's relevance in order to draw attention to it or its
`advertisements. For this reason Search results often contain
`commercial appeals that should not be considered a match to
`the query. Although Search engines are designed to avoid
`Such ruses, poorly conceived mechanisms can result in
`disappointing failures to retrieve desired information.
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,285,999 B1
`
`2
`Hyperlink Search Engine, developed by IDD Information
`Services, (http://rankdex.gari.com/) uses backlink informa
`tion (i.e., information from pages that contain links to the
`current page) to assist in identifying relevant web docu
`ments. Rather than using the content of a document to
`determine relevance, the technique uses the anchor text of
`links to the document to characterize the relevance of a
`document. The idea of associating anchor text with the page
`the text points to was first implemented in the World Wide
`Web Worm (Oliver A. McBryan, GENVL and WWWW:
`Tools for Taming the Web, First International Conference on
`the World Wide Web, CERN, Geneva, May 25–27, 1994).
`The Hyperlink Search Engine has applied this idea to assist
`in determining document relevance in a Search. In particular,
`Search query terms are compared to a collection of anchor
`text descriptions that point to the page, rather than to a
`keyword index of the page content. A rank is then assigned
`to a document based on the degree to which the Search terms
`match the anchor descriptions in its backlink documents.
`The well known idea of citation counting is a simple
`method for determining the importance of a document by
`counting its number of citations, or backlinkS. The citation
`rank r(A) of a document which has n backlink pages is
`Simply
`
`In the case of databases whose content is of relatively
`uniform quality and importance it is valid to assume that a
`highly cited document should be of greater interest than a
`document with only one or two citations. Many databases,
`however, have eXtreme variations in the quality and impor
`tance of documents. In these cases, citation ranking is overly
`Simplistic. For example, citation ranking will give the same
`rank to a document that is cited once on an obscure page as
`to a similar document that is cited once on a well-known and
`highly respected page.
`
`SUMMARY
`Various aspects of the present invention provide Systems
`and methods for ranking documents in a linked database.
`One aspect provides an objective ranking based on the
`relationship between documents. Another aspect of the
`invention is directed to a technique for ranking documents
`within a database whose content has a large variation in
`quality and importance. Another aspect of the present inven
`tion is to provide a document ranking method that is Scalable
`and can be applied to extremely large databaseS Such as the
`world wide web. Additional aspects of the invention will
`become apparent in View of the following description and
`asSociated figures.
`One aspect of the present invention is directed to taking
`advantage of the linked Structure of a database to assign a
`rank to each document in the database, where the document
`rank is a measure of the importance of a document. Rather
`than determining relevance only from the intrinsic content of
`a document, or from the anchor text of backlinks to the
`document, a method consistent with the invention deter
`mines importance from the extrinsic relationships between
`documents. Intuitively, a document should be important
`(regardless of its content) if it is highly cited by other
`documents. Not all citations, however, are necessarily of
`equal Significance. A citation from an important document is
`more important than a citation from a relatively unimportant
`document. Thus, the importance of a page, and hence the
`rank assigned to it, should depend not just on the number of
`citations it has, but on the importance of the citing docu
`ments as well. This implies a recursive definition of rank: the
`
`Petitioner Google Ex-1034, 0006
`
`
`
`US 6,285,999 B1
`
`3
`rank of a document is a function of the ranks of the
`documents which cite it. The ranks of documents may be
`calculated by an iterative procedure on a linked database.
`Because citations, or links, are ways of directing
`attention, the important documents correspond to those
`documents to which the most attention is directed. Thus, a
`high rank indicates that a document is considered valuable
`by many people or by important people. Most likely, these
`are the pages to which Someone performing a Search would
`like to direct his or her attention. Looked at another way, the
`importance of a page is directly related to the Steady-state
`probability that a random Web Surfer ends up at the page
`after following a large number of linkS. Because there is a
`larger probability that a Surfer will end up at an important
`page than at an unimportant page, this method of ranking
`pages assigns higher ranks to the more important pages.
`In one aspect of the invention, a computer implemented
`method is provided for Scoring linked database documents.
`The method comprises the Steps of
`obtaining a plurality of documents, at least Some of the
`documents being linked documents, at least Some of the
`documents being linking documents, and at least Some
`of the documents being both linked documents and
`linking documents, each of the linked documents being
`pointed to by a link in one or more of the linking
`documents, assigning a Score to each of the linked
`documents based on Scores of the one or more linking
`documents, and processing the linked documents
`according to their Scores.
`Additional aspects, applications and advantages will
`become apparent in View of the following description and
`asSociated figures.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a diagram of the relationship between three
`linked hypertext documents according to the invention.
`FIG. 2 is a diagram of a three-document web illustrating
`the rank associated with each document in accordance with
`the present invention.
`FIG. 3 is a flowchart of one embodiment of the invention.
`
`DETAILED DESCRIPTION
`Although the following detailed description contains
`many specifics for the purposes of illustration, anyone of
`ordinary skill in the art will appreciate that many variations
`and alterations to the following details are within the Scope
`of the invention. Accordingly, the following embodiments of
`the invention are Set forth without any loSS of generality to,
`and without imposing limitations upon, the claimed inven
`tion. For Support in reducing the present invention to
`practice, the inventor acknowledges Sergey Brin, Scott
`Hassan, Rajeev Motwani, Alan Steremberg, and Terry Wino
`grad.
`A linked database (i.e. any database of documents con
`taining mutual citations, Such as the Worldwide web or other
`hypermedia archive, a dictionary or thesaurus, and a data
`base of academic articles, patents, or court cases) can be
`represented as a directed graph of N nodes, where each node
`corresponds to a web page document and where the directed
`connections between nodes correspond to links from one
`document to another. A given node has a set of forward links
`that connect it to children nodes, and a set of backward links
`that connect it to parent nodes. FIG. 1 shows a typical
`relationship between three hypertext documents A, B, and C.
`AS shown in this particular figure, the first linkS in docu
`
`4
`ments B and C are pointers to document A. In this case we
`say that B and C are backlinks of A, and that A is a forward
`link of B and of C. Documents B and C also have other
`forward links to documents that are not shown.
`Although the ranking method of the present invention is
`Superficially similar to the well known idea of citation
`counting, the present method is more Subtle and complex
`than citation counting and gives far Superior results. In a
`Simple citation ranking, the rank of a document A which has
`in backlink pages is simply
`
`According to one embodiment of the present method of
`ranking, the backlinks from different pages are weighted
`differently and the number of links on each page is normal
`ized. More precisely, the rank of a page A is defined
`according to the present invention as
`
`15
`
`C
`
`r(A) = + (1-0)
`
`r(B1) --
`B1
`
`r(B) )
`* IBT):
`
`where B, ..., B, are the backlink pages of A, r(B), ...,
`r(B) are their ranks, B, . . . , B, are their numbers of
`forward links, and C. is a constant in the interval 0,1, and
`N is the total number of pages in the web. This definition is
`clearly more complicated and Subtle than the Simple citation
`rank. Like the citation rank, this definition yields a page rank
`that increases as the number of backlinks increases. But the
`present method considers a citation from a highly ranked
`backlink as more important than a citation from a lowly
`ranked backlink (provided both citations come from back
`link documents that have an equal number of forward links).
`In the present invention, it is possible, therefore, for a
`document with only one backlink (from a very highly ranked
`page) to have a higher rank than another document with
`many backlinks (from very low ranked pages). This is not
`the case with Simple citation ranking.
`The ranks form a probability distribution over web pages,
`So that the Sum of ranks over all web pages is unity. The rank
`of a page can be interpreted as the probability that a Surfer
`will be at the page after following a large number of forward
`links. The constant C. in the formula is interpreted as the
`probability that the web surfer will jump randomly to any
`web page instead of following a forward link. The page
`ranks for all the pages can be calculated using a simple
`iterative algorithm, and corresponds to the principal eigen
`vector of the normalized link matrix of the web, as will be
`discussed in more detail below.
`In order to illustrate the present method of ranking,
`consider the simple web of three documents shown in FIG.
`2. For Simplicity of illustration, we assume in this example
`that r=0. Document A has a Single backlink to document C,
`and this is the only forward link of document C, so
`r(A)=r(C).
`Document B has a Single backlink to document A, but this
`is one of two forward links of document A, SO
`r(B)=r(A)/2.
`Document C has two backlinks. One backlink is to
`document B, and this is the only forward link of document
`B. The other backlink is to document A via the other of the
`two forward links from A. Thus
`r(C)=r(B)+r(A)/2.
`In this simple illustrative case we can see by inspection
`that r(A)=0.4, r(B)=0.2, and r(C)=0.4. Although a typical
`value for C. is -0.1, if for simplicity we set C =0.5 (which
`corresponds to a 50% chance that a Surfer will randomly
`jump to one of the three pages rather than following a
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Petitioner Google Ex-1034, 0007
`
`
`
`S
`forward link), then the mathematical relationships between
`the ranks become more complicated. In particular, we then
`have
`
`6
`present invention. Because the rank of various different
`documents can vary by orders of magnitude, it is convenient
`to define a logarithmic rank
`
`US 6,285,999 B1
`
`The solution in this case is r(A)=1%9, r(B)=1%9, and
`r(C)=1549.
`In practice, there are millions of documents and it is not
`possible to find the Solution to a million equations by
`inspection. Accordingly, in the preferred embodiment a
`Simple iterative procedure is used. AS the initial State we
`may simply Set all the ranks equal to 1/N. The formulas are
`then used to calculate a new set of rankS based on the
`existing ranks. In the case of millions of documents, Suffi
`cient convergence typically takes on the order of 100 itera
`tions. It is not always necessary or even desirable, however,
`to calculate the rank of every page with high precision. Even
`approximate rank values, using two or more iterations, can
`provide very valuable, or even Superior, information.
`The iteration proceSS can be understood as a steady-state
`probability distribution calculated from a model of a random
`Surfer. This model is mathematically equivalent to the expla
`nation described above, but provides a more direct and
`concise characterization of the procedure. The model
`includes (a) an initial N-dimensional probability distribution
`vector po where each component poil gives the initial
`probability that a random Surfer will start at a node i, and (b)
`an NxN transition probability matrix A where each compo
`nent Aij gives the probability that the surfer will move
`from node i to node j. The probability distribution of the
`graph after the Surfer follows one link is p=Apo, and after
`two links the probability distribution is p=Ap=Ap'.
`ASSuming this iteration converges, it will converge to a
`Steady-state probability
`
`pea = lim. Apo,
`
`which is a dominant eigenvector of A. The iteration circu
`lates the probability through the linked nodes like energy
`flows through a circuit and accumulates in important places.
`Because pages with no links occur in Significant numbers
`and bleed off energy, they cause Some complication with
`computing the ranking. This complication is caused by the
`fact they can add huge amounts to the "random jump' factor.
`This, in turn, causes loops in the graph to be highly empha
`sized which is not generally a desirable property of the
`model. In order to address this problem, these childless
`pages can Simply be removed from the model during the
`iterative Stages, and added back in after the iteration is
`complete. After the childless pages are added back in,
`however, the same number of iterations that was required to
`remove them should be done to make Sure they all receive
`a value. (Note that in order to ensure convergence, the norm
`of p must be made equal to 1 after each iteration.) An
`alternate method to control the contribution of the childless
`nodes is to only estimate the Steady State by iterating a Small
`number of times.
`The rank ri) of a node i can then be defined as a function
`of this steady-state probability distribution. For example, the
`rank can be defined simply by ri=pooi. This method of
`calculating rank is mathematically equivalent to the iterative
`method described first. Those skilled in the art will appre
`ciate that this same method can be characterized in various
`different ways that are mathematically equivalent. Such
`characterizations are obviously within the Scope of the
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`pei
`{pok:
`ri) = log-
`Peo
`ke 1, N
`
`which assigns a rank of 0 to the lowest ranked node and
`increases by 1 for each order of magnitude in importance
`higher than the lowest ranked node.
`“FIG. 3 shows one embodiment of a computer imple
`mented method for calculating an importance rank for N
`linked nodes of a linked database. At a step 101, an initial
`N-dimensional vector po is Selected. An approximation p, to
`a steady-state probability p in accordance with the equation
`p=A"po is computed at a step 103. Matrix A can be an NXN
`transition probability matrix having elements Ai repre
`Senting a probability of moving from node i to node j. At a
`step 105, a rank rk) for node k from a k" component of p,
`is determined.”.
`In one particular embodiment, a finite number of itera
`tions are performed to approximate poo. The initial distri
`bution can be Selected to be uniform or non-uniform. A
`uniform distribution would set each component of p equal
`to 1/N. A non-uniform distribution, for example, can divide
`the initial probability among a few nodes which are known
`a priori to have relatively large importance. This non
`uniform distribution decreases the number of iterations
`required to obtain a close approximation to poo and also is
`one way to reduce the effect of artificially inflating relevance
`by adding unrelated terms.
`In another particular embodiment, the transition matrix A
`is given by
`
`A = 1 + (1 - a B
`= 1 + (1-0)B.
`
`where 1 is an NxN matrix consisting of all 1s., C. is the
`probability that a Surfer will jump randomly to any one
`of the N nodes, and B is a matrix whose elements
`Bij are given by
`
`1
`- if node i points to node i
`Bij = n;
`0 otherwise,
`
`where n is the total number of forward links from node
`i. The (1-C) factor acts as a damping factor that
`limits the extent to which a document's rank can be
`inherited by children documents. This models the
`fact that users typically jump to a different place in
`the web after following a few links. The value of C.
`is typically around 15%. Including this damping is
`important when many iterations are used to calculate
`the rank So that there is no artificial concentration of
`rank importance within loops of the web.
`Alternatively, one may set C=0 and only iterate a few
`times in the calculation.
`Consistent with the present invention, there are Several
`ways that this method can be adapted or altered for various
`purposes. AS already mentioned above, rather than including
`the random linking probability C. equally among all nodes,
`it can be divided in various ways among all the Sites by
`changing the 1 matrix to another matrix. For example, it
`could be distributed So that a random jump takes the Surfer
`
`Petitioner Google Ex-1034, 0008
`
`
`
`US 6,285,999 B1
`
`15
`
`25
`
`35
`
`40
`
`7
`to one of a few nodes that have a high importance, and will
`not take the surfer to any of the other nodes. This can be very
`effective in preventing deceptively tagged documents from
`receiving artificially inflated relevance. Alternatively, the
`random linking probability could be distributed so that
`random jumps do not happen from high importance nodes,
`and only happen from other nodes. This distribution would
`model a Surfer who is more likely to make random jumps
`from unimportant sites and follow forward links from
`important sites. A modification to avoid drawing unwar
`ranted attention to pages with artificially inflated relevance
`is to ignore local links between documents and only consider
`links between Separate domains. Because the links from
`other Sites to the document are not directly under the control
`of a typical web site designer, it is then difficult for the
`designer to artificially inflate the ranking. A simpler
`approach is to weight links from pages contained on the
`Same web server less than links from other Servers. Also, in
`addition to Servers, internet domains and any general mea
`Sure of the distance between links could be used to deter
`mine Such a weighting.
`Additional modifications can further improve the perfor
`mance of this method. Rank can be increased for documents
`whose backlinks are maintained by different institutions and
`authors in various geographic locations. Or it can be
`increased if linkS come from unusually important web
`locations Such as the root page of a domain.
`LinkS can also be weighted by their relative importance
`within a document. For example, highly visible links that are
`near the top of a document can be given more weight. Also,
`links that are in large fonts or emphasized in other ways can
`be given more weight. In this way, the model better approxi
`mates human usage and authors intentions. In many cases
`it is appropriate to assign higher value to links coming from
`pages that have been modified recently Since Such informa
`tion is less likely to be obsolete.
`Various implementations of the invention have the advan
`tage that the convergence is very fast (a few hours using
`current processors) and it is much less expensive than
`building a full-text index. This Speed allows the ranking to
`be customized or personalized for Specific users. For
`example, a user's home page and/or bookmarks can be given
`a large initial importance, and/or a high probability of a
`random jump returning to it. This high rating essentially
`indicates to the System that the person's homepage and/or
`bookmarks does indeed contain Subjects of importance that
`should be highly ranked. This procedure essentially trains
`the System to recognize pages related to the person's inter
`ests. The present method of determining the rank of a
`document can also be used to enhance the display of
`documents. In particular, each link in a document can be
`annotated with an icon, text, or other indicator of the rank of
`the document that each link points to. Anyone viewing the
`document can then easily See the relative importance of
`various links in the document.
`The present method of ranking documents in a database
`can also be useful for estimating the amount of attention any
`document receives on the Web Since it models human
`behavior when Surfing the Web. Estimating the importance
`of each backlink to a page can be useful for many purposes
`including Site design, busineSS arrangements with the
`backlinkers, and marketing. The effect of potential changes
`to the hypertext Structure can be evaluated by adding them
`to the link Structure and recomputing the ranking.
`Real usage data, when available, can be used as a starting
`point for the model and as the distribution for the alpha
`factor.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`This can allow this ranking model to fill holes in the usage
`data, and provide a more accurate or comprehensive picture.
`Thus, although this method of ranking does not necessar
`ily match the actual traffic, it nevertheless measures the
`degree of exposure a document has throughout the web.
`Another important application and embodiment of the
`present invention is directed to enhancing the quality of
`results from Web Search engines. In this application of the
`present invention, a ranking method according to the inven
`tion is integrated into a web search engine to produce results
`far Superior to existing methods in quality and performance.
`A Search engine employing a ranking method of the present
`invention provides automation while producing results com
`parable to a human maintained categorized System. In this
`approach, a web crawler explores the web and creates an
`index of the web content, as well as a directed graph of
`nodes corresponding to the Structure of hyperlinks. The
`nodes of the graph (i.e. pages of the web) are then ranked
`according to importance as described above in connection
`with various exemplary embodiments of the present inven
`tion.
`The Search engine is used to locate documents that match
`the Specified Search criteria, either by Searching full text, or
`by Searching titles only. In addition, the Search can include
`the anchor text associated with backlinks to the page. This
`approach has Several advantages in this context. First,
`anchors often provide more accurate descriptions of web
`pages than the pages themselves. Second, anchors may exist
`for images, programs, and other objects that cannot be
`indexed by a text-based Search engine. This also makes it
`possible to return web pages which have not actually been
`crawled. In addition, the engine can compare the Search
`terms with a list of its backlink document titles. Thus, even
`though the text of the document itself may not match the
`Search terms, if the document is cited by documents whose
`titles or backlink anchor text match the Search terms, the
`document will be considered a match. In addition to or
`instead of the anchor text, the text in the immediate vicinity
`of the backlink anchor text can also be compared to the
`Search terms in order to improve the Search.
`Once a set of documents is identified that match the Search
`terms, the list of documents is then Sorted with high ranking
`documents first and low ranking documents last. The rank
`ing in this case is a function which combines all of the above
`factorS Such as the objective ranking and textual matching.
`If desired, the results can be grouped by category or Site as
`well.
`It will be clear to one skilled in the art that the above
`embodiments may be altered in many ways without depart
`ing from the Sco