`Page
`
`111111111111111111111111111111111111111111111111111111111111111111111111111
`US006285999B1
`US 6,285,999 Bl
`Sep.4,2001
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) MEmOD FOR NODE RANKING IN A
`IJNJ(ED DATAJJASE
`
`(75)
`
`Inventor: Lawrence Page, Stanford, CA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`Craig Boyle "To link or not to link: An empirical comparison
`of Hypertext linking strategies". ACM 1992, pp. 221-231. •
`L. Katz, "A new status index derived from sociometric
`analysis," 1953, Psychometricka, vol. 18, pp. 39-43.
`(73) Assignee: The Board of Trustees of the Leland
`C.H. Hubbell, "An input-output approach to clique identi-
`fication sociometry," 1965, pp. 377-399.
`Stanford Junior University, Stanford,
`-· · · - ··- --- --- ----MiZfuchi et al., -"Teeliiliquesfor dis:aggregalirig-celi!rality ·-
`------- · ·--cA"(US)-
`-
`-- -- - -
`scores in social networks,"1996, Sociological Methodology,
`pp. 26-48.
`E. Garfield, "Citation analysis as a tool in journal evalua(cid:173)
`tion,'' 1972, Science, vol. 178, pp. 471-479.
`Pinski et al., "Citation influence for journal aggregates of
`scientific publications: Theory, with application to the lit(cid:173)
`erature of physics," 1976, Inf. Proc. And Management, vol.
`12, pp. 297-312.
`N. Geller, "On the citation influence methodology of Pinski
`and Narin,'' 1978, Inf. Proc. And Management, val. 14, pp.
`93--95.
`P. Doreian, "Measuring the relative standing of disciplinary
`journals," 1988, Inf. Proc. And Management, vol. 24, pp.
`45-56.
`
`(21) Appl. No.: 09/004,827
`
`(22) Filed:
`
`Jan. 9,1998
`
`Related U.S. Application Data
`(60) Provisional application No. 60/035,205, filed on Jan. 10,
`1997.
`Int. Cl.7
`••••••••.•••••••••••••••••••••••••••••••••••••••••.••• GOOF 17/30
`(51)
`(52) U.S. Cl ..................................... 707/5; 707n; 707/501
`(58) Field of Search .................................... 707/100, 5, 7,
`707/513, 1-3, 10, 104, 501; 345/440; 382/226,
`229, 230, 231
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,953,106 • 8/1990 Gansner et al. ..................... 345/440
`5,450,535 • 9/1995 North ................................... 395/140
`5/1998 Mauldin ............................... 395/610
`5,748,954
`5,752,241 • 5/1998 Cohen ...................................... 707/3
`5,832,494 • 11/1998 Egger et al .......................... 707/102
`5,848,407 • 12/1998 Ishikawa et al. ........................ 707/2
`6,014,678 • 1/2000 Inoue et al. .......................... 707/501
`
`OTIIER PUBLICATIONS
`
`S. Jeromy Carriere et al, "Web Query: Searching and Visu(cid:173)
`alizing the Web through Connectivity", Computer Networks
`and ISDN Systems 29 (1997). pp. 1257-1267.*
`Wang et al "Prefetching in Worl Wide Web", IEEE 1996, pp.
`28-32.*
`Ramer et al "Similarity, Probability and Database Organi(cid:173)
`sation: Extended Abstract", 1996, pp. 272.276.*
`
`(List continued on next page.)
`
`Primary Examiner-Thomas Black
`Assistant Examiner-Uyen Le
`(74) Attorney, Agent, or Firm-Harrity & Snyder L.L.P.
`AJJSTRACT
`
`(57)
`
`A method assigns importance ranks to nodes in a linked
`database, such as any database of documents containing
`citations, the world wide web or any other hypermedia
`database. The rank assigned to a document is calculated
`from the ranks of documents citing it. In addition, the rank
`of a document is calculated from a constant representing the
`probability that a browser through the database will ran(cid:173)
`domly jump to the document. The method is particularly
`useful in enhancing the performance of search engine results
`for hypermedia databases, such as the world wide web,
`whose documents have a large variation in quality.
`
`29 Claims, 3 Drawing Sheets
`
`B
`0.2
`
`EXHIBIT 2086
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00479
`
`
`
`US 6,285,999 Bl
`Page 2
`
`01HER PUBLICATIONS
`
`P. Doreian, "A measure of standing for citation networks
`within a wider environment," 1994, Inf. Proc. And Manage(cid:173)
`ment, vol. 30, pp. 21-31.
`Botafogo et al., "Structural analysis of hypertext: Identifying
`hierarchies and useful metrics," 1992, ACM Trans. Inc.
`Systems, vol. 10, pp. 142-180.
`Mark E. Frisse, "Searching for information in a hypertext
`medical handbook," 1988, Communications of the ACM,
`vol. 31, No. 7, pp. 880--886.
`Massimo Marchiori, "The quest for correct information on
`the Web: Hyper search engines," 1997, Computer Networks
`and ISDN Systems, vol. 29, No. 8-13, pp. 1225-1235.
`Oliver A McBryan, "GENVL and WWWW: Tools for
`taming the web," 1994, Proceedings of the first International
`Wold Wide Web Conference, pp. 1-13.
`
`Carriere et al., "WebQuery: Searching and visualizing the
`web through connectivity," 1997, Proc. 6'h International
`World Wide Web Conference, pp. 1-14.
`
`Arocena et al., "Applications of a web query language,"
`1997, Computer Networks and ISDN Systems, vol. 29, No.
`8-13, pp. 1305-1316.
`
`Jon M. Kleinberg, "Authoritative sources in a hyperlinked
`environment," 1998, Proc. Of the 9'h Annual ACM-SIAM
`Symposium on Discrete Algorithms, pp. 668-677.
`
`Henzinger et al., "Measuring index quality using random
`walks on the web", 1999, Pro c. of the 8'h International World
`Wide Web Conference, pp. 213-225.
`
`* cited by examiner
`
`
`
`U.S. Patent
`US. Patent
`
`Sep.4,2001
`Sep. 4, 2001
`
`Sheet 1 013
`Sheet 1 of 3
`
`US 6,285,999 Bl
`US 6,285,999 B1
`
`8
`
`c
`
`
`
`A
`
`
`
`FIG. 1
`FIG. 1
`
`
`
`U.S. Patent
`US. Patent
`
`Sep.4,2001
`Sep. 4, 2001
`
`Sheet 2 0f3
`Sheet 2 of 3
`
`US 6,285,999 Bl
`US 6,285,999 B1
`
`A
`0.4
`
`0.2
`
`B
`0.2
`
`
`
`c
`0.4
`
`FIG. 2
`FIG. 2
`
`
`
`U.S. Patent
`
`Sep.4,2001
`
`Sheet 3 of 3
`
`US 6,285,999 Bl
`
`101
`~--------------~--------------~~
`
`START
`
`+
`
`SELECT AN INITIAL N-DIMENSIONAL VECTOR p0
`
`103
`,,
`~--------------~--------------~~
`
`COMPUTE AN APPROXIMATION Pn TO A
`STEADY-STATE PROBABILITY Pco IN
`ACCORDANCE WITH THE EQUATION Pn = Anp0
`
`105
`,.
`~--------------~--------------~~
`
`DETERMINE A RANK r[k] FOR NODE k FROM A
`k1h COMPONENT OF Pn
`
`+
`
`DONE
`
`FIG. 3
`
`
`
`US 6,285,999 Bl
`
`1
`METHOD FOR NODE RANKING IN A
`LINKED DATABASE
`
`CROSS-REFERENCES TO RELATED
`APPLICATIONS
`
`This application claims priority from U.S. provisional
`patent application Ser. No. 60/035,205 filed Jan. 10, 1997,
`which is incorporated herein by reference.
`
`STATEMENT REGARDING GOVERNMENT
`SUPPORT
`
`This invention was supported in part by the National
`Science Foundation grant number IRI-9411306-4. The Gov(cid:173)
`ernment has certain rights in the invention.
`
`FIELD OF THE INVENTION
`
`This invention relates generally to techniques for analyz(cid:173)
`ing linked databases. More particularly, it relates to methods
`for assigning ranks to nodes in a linked database, such as any 20
`database of documents containing citations, the world wide
`web or any other hypermedia database.
`
`BACKGROUND OF THE INVENTION
`
`2
`Hyperlink Search Engine, developed by IDD Information
`Services, (http://rankdex.gari.com/) uses backlink informa(cid:173)
`tion (i.e., information from pages that contain links to the
`current page) to assist in identifying relevant web docu-
`5 ments. Rather than using the content of a document to
`determine relevance, the technique uses the anchor text of
`links to the document to characterize the relevance of a
`document. The idea of associating anchor text with the page
`the text points to was first implemented in the World Wide
`10 Web Worm (Oliver A McBryan, GENVL and WWWW:
`Tools for Taming the Web, First International Conference on
`the World Wide Web, CERN, Geneva, May 25-27, 1994).
`The Hyperlink Search Engine has applied this idea to assist
`in determining document relevance in a search. In particular,
`15 search query terms are compared to a collection of anchor
`text descriptions that point to the page, rather than to a
`keyword index of the page content. A rank is then assigned
`to a document based on the degree to which the search terms
`match the anchor descriptions in its backlink documents.
`The well known idea of citation counting is a simple
`method for determining the importance of a document by
`counting its number of citations, or backlinks. The citation
`rank r(A) of a document which has n backlink pages 1s
`simply
`r(A)=n.
`In the case of databases whose content is of relatively
`uniform quality and importance it is valid to assume that a
`highly cited document should be of greater interest than a
`document with only one or two citations. Many databases,
`30 however, have extreme variations in the quality and impor(cid:173)
`tance of documents. In these cases, citation ranking is overly
`simplistic. For example, citation ranking will give the same
`rank to a document that is cited once on an obscure page as
`to a similar document that is cited once on a well-known and
`35 highly respected page.
`
`25
`
`SUMMARY
`
`Various aspects of the present invention provide systems
`and methods for ranking documents in a linked database.
`One aspect provides an objective ranking based on the
`relationship between documents. Another aspect of the
`invention is directed to a technique for ranking documents
`within a database whose content has a large variation in
`quality and importance. Another aspect of the present inven(cid:173)
`tion is to provide a document ranking method that is scalable
`and can be applied to extremely large databases such as the
`world wide web. Additional aspects of the invention will
`become apparent in view of the following description and
`50 associated figures.
`One aspect of the present invention is directed to taking
`advantage of the linked structure of a database to assign a
`rank to each document in the database, where the document
`rank is a measure of the importance of a document. Rather
`55 than determining relevance only from the intrinsic content of
`a document, or from the anchor text of backlinks to the
`document, a method consistent with the invention deter(cid:173)
`mines importance from the extrinsic relationships between
`documents. Intuitively, a document should be important
`60 (regardless of its content) if it is highly cited by other
`documents. Not all citations, however, are necessarily of
`equal significance. A citation from an important document is
`more important than a citation from a relatively unimportant
`document. Thus, the importance of a page, and hence the
`65 rank assigned to it, should depend not just on the number of
`citations it has, but on the importance of the citing docu(cid:173)
`ments as well. This implies a recursive definition of rank: the
`
`Due to the developments in computer technology and its
`increase in popularity, large numbers of people have recently
`started to frequently search huge databases. For example,
`internet search engines are frequently used to search the
`entire world wide web. Currently, a popular search engine
`might execute over 30 million searches per day of the
`indexable part of the web, which has a size in excess of 500
`Gigabytes. Information retrieval systems are traditionally
`judged by their precision and recall. What is often neglected,
`however, is the quality of the results produced by these
`search engines. Large databases of documents such as the
`web contain many low quality documents. As a result,
`searches typically return hundreds of irrelevant or unwanted
`documents which camouflage the few relevant ones. In order
`to improve the selectivity of the results, common techniques 40
`allow the user to constrain the scope of the search to a
`specified subset of the database, or to provide additional
`search terms. These techniques are most effective in cases
`where the database is homogeneous and already classified
`into subsets, or in cases where the user is searching for well 45
`known and specific information. In other cases, however,
`these techniques are often not effective because each con(cid:173)
`straint introduced by the user increases the chances that the
`desired information will be inadvertently eliminated from
`the search results.
`Search engines presently use various techniques that
`attempt to present more relevant documents. Typically,
`documents are ranked according to variations of a standard
`vector space model. These variations could include (a) how
`recently the document was updated, and/or (b) how close the
`search terms are to the beginning of the document. Although
`this strategy provides search results that are better than with
`no ranking at all, the results still have relative! y low quality.
`Moreover, when searching the highly competitive web, this
`measure of relevancy is vulnerable to "spamming" tech(cid:173)
`niques that authors can use to artificially inflate their docu(cid:173)
`ment's relevance in order to draw attention to it or its
`advertisements. For this reason search results often contain
`commercial appeals that should not be considered a match to
`the query. Although search engines are designed to avoid
`such ruses, poorly conceived mechanisms can result m
`disappointing failures to retrieve desired information.
`
`
`
`US 6,285,999 Bl
`
`4
`ments B and C are pointers to document A. In this case we
`say that B and C are backlinks of A, and that A is a forward
`link of B and of C. Documents B and C also have other
`forward links to documents that are not shown.
`Although the ranking method of the present invention is
`superficially similar to the well known idea of citation
`counting, the present method is more subtle and complex
`than citation counting and gives far superior results. In a
`simple citation ranking, the rank of a document A which has
`n backlink pages is simply
`r(A)=n.
`According to one embodiment of the present method of
`ranking, the backlinks from different pages are weighted
`differently and the number of links on each page is normal(cid:173)
`ized. More precisely, the rank of a page A is defined
`according to the present invention as
`
`a:
`r(Bn))
`(r(BJ)
`r(A) = - + (1- a:) - - + · · · + - - ,
`IBII
`IBnl
`N
`
`20
`
`3
`rank of a document is a function of the ranks of the
`documents which cite it. The ranks of documents may be
`calculated by an iterative procedure on a linked database.
`Because citations, or links, are ways of directing
`attention, the important documents correspond to those 5
`documents to which the most attention is directed. Thus, a
`high rank indicates that a document is considered valuable
`by many people or by important people. Most likely, these
`are the pages to which someone performing a search would
`like to direct his or her attention. Looked at another way, the 10
`importance of a page is directly related to the steady-state
`probability that a random web surfer ends up at the page
`after following a large number of links. Because there is a
`larger probability that a surfer will end up at an important
`page than at an unimportant page, this method of ranking 15
`pages assigns higher ranks to the more important pages.
`In one aspect of the invention, a computer implemented
`method is provided for scoring linked database documents.
`The method comprises the steps of:
`obtaining a plurality of documents, at least some of the
`documents being linked documents, at least some of the
`documents being linking documents, and at least some
`of the documents being both linked documents and
`linking documents, each of the linked documents being 25
`pointed to by a link in one or more of the linking
`documents; assigning a score to each of the linked
`documents based on scores of the one or more linking
`documents; and processing the linked documents
`according to their scores.
`Additional aspects, applications and advantages will
`become apparent in view of the following description and
`associated figures.
`
`30
`
`where B1 , . . . , Bn are the backlink pages of A, r(B1 ), . . . ,
`r(Bn) are their ranks, IB 1 I, ... , IBnl are their numbers of
`forward links, and a is a constant in the interval [0,1], and
`N is the total number of pages in the web. This definition is
`clearly more complicated and subtle than the simple citation
`rank. Like the citation rank, this definition yields a page rank
`that increases as the number of backlinks increases. But the
`present method considers a citation from a highly ranked
`backlink as more important than a citation from a lowly
`ranked backlink (provided both citations come from back-
`link documents that have an equal number of forward links).
`In the present invention, it is possible, therefore, for a
`document with only one backlink (from a very highly ranked
`35 page) to have a higher rank than another document with
`many backlinks (from very low ranked pages). This is not
`the case with simple citation ranking.
`The ranks form a probability distribution over web pages,
`so that the sum of ranks over all web pages is unity. The rank
`40 of a page can be interpreted as the probability that a surfer
`will be at the page after following a large number of forward
`links. The constant a in the formula is interpreted as the
`probability that the web surfer will jump randomly to any
`web page instead of following a forward link. The page
`45 ranks for all the pages can be calculated using a simple
`iterative algorithm, and corresponds to the principal eigen(cid:173)
`vector of the normalized link matrix of the web, as will be
`discussed in more detail below.
`In order to illustrate the present method of ranking,
`50 consider the simple web of three documents shown in FIG.
`2. For simplicity of illustration, we assume in this example
`that r=O. Document A has a single backlink to document C,
`and this is the only forward link of document C, so
`r(A)=r(C).
`Document B has a single backlink to document A, but this
`is one of two forward links of document A, so
`r(B)=r(A)/2.
`Document C has two backlinks. One backlink is to
`document B, and this is the only forward link of document
`60 B. The other backlink is to document A via the other of the
`two forward links from A Thus
`r(C)=r(B)+r(A)/2.
`In this simple illustrative case we can see by inspection
`that r(A)=0.4, r(B)=0.2, and r(C)=0.4. Although a typical
`value for a is -0.1, if for simplicity we set a=0.5 (which
`corresponds to a 50% chance that a surfer will randomly
`jump to one of the three pages rather than following a
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a diagram of the relationship between three
`linked hypertext documents according to the invention.
`FIG. 2 is a diagram of a three-document web illustrating
`the rank associated with each document in accordance with
`the present invention.
`FIG. 3 is a flowchart of one embodiment of the invention.
`
`DETAILED DESCRIPTION
`
`Although the following detailed description contains
`many specifics for the purposes of illustration, anyone of
`ordinary skill in the art will appreciate that many variations
`and alterations to the following details are within the scope
`of the invention. Accordingly, the following embodiments of
`the invention are set forth without any loss of generality to,
`and without imposing limitations upon, the claimed inven(cid:173)
`tion. For support in reducing the present invention to
`practice, the inventor acknowledges Sergey Erin, Scott
`Hassan, Rajeev Motwani,Alan Steremberg, and Terry Wino(cid:173)
`grad.
`A linked database (i.e. any database of documents con(cid:173)
`taining mutual citations, such as the world wide web or other
`hypermedia archive, a dictionary or thesaurus, and a data(cid:173)
`base of academic articles, patents, or court cases) can be
`represented as a directed graph of N nodes, where each node
`corresponds to a web page document and where the directed
`connections between nodes correspond to links from one
`document to another. A given node has a set of forward links
`that connect it to children nodes, and a set of backward links
`that connect it to parent nodes. FIG. 1 shows a typical 65
`relationship between three hypertext documents A, B, and C.
`As shown in this particular figure, the first links in docu-
`
`55
`
`
`
`US 6,285,999 Bl
`
`6
`present invention. Because the rank of various different
`documents can vary by orders of magnitude, it is convenient
`to define a logarithmic rank
`
`5
`forward link), then the mathematical relationships between
`the ranks become more complicated. In particular, we then
`have
`r(A)=%+r(C)/2,
`r(B)=%+r(A)/4, and
`r(C)=%+r(A)/4+r(B)/2.
`The solution in this case is r(A)=1%9, r(B)=10/39, and
`r(C)=15f39.
`In practice, there are millions of documents and it is not
`possible to find the solution to a million equations by 10
`inspection. Accordingly, in the preferred embodiment a
`simple iterative procedure is used. As the initial state we
`may simply set all the ranks equal to 1/N. The formulas are
`then used to calculate a new set of ranks based on the
`existing ranks. In the case of millions of documents, suffi(cid:173)
`cient convergence typically takes on the order of 100 itera(cid:173)
`tions. It is not always necessary or even desirable, however,
`to calculate the rank of every page with high precision. Even
`approximate rank values, using two or more iterations, can
`provide very valuable, or even superior, information.
`The iteration process can be understood as a steady-state
`probability distribution calculated from a model of a random
`surfer. This model is mathematically equivalent to the expla(cid:173)
`nation described above, but provides a more direct and
`concise characterization of the procedure. The model 25
`includes (a) an initial N-dimensional probability distribution
`vector p0 where each component p0[i] gives the initial
`probability that a random surfer will start at a node i, and (b)
`an NxN transition probability matrix A where each compo(cid:173)
`nent A[i][j] gives the probability that the surfer will move 30
`from node i to node j. The probability distribution of the
`graph after the surfer follows one link is p1=Ap0, and after
`two links the probability distribution is p2 =Ap1=A2p0.
`Assuming this iteration converges, it will converge to a
`steady-state probability
`
`5
`
`which assigns a rank of 0 to the lowest ranked node and
`increases by 1 for each order of magnitude in importance
`higher than the lowest ranked node.
`"FIG. 3 shows one embodiment of a computer imple(cid:173)
`mented method for calculating an importance rank for N
`linked nodes of a linked database. At a step 101, an initial
`15 N-dimensional vector p0 is selected. An approximation Pn to
`a steady-state probability p= in accordance with the equation
`Pn=Anp 0 is computed at a step 103. Matrix A can be an NxN
`transition probability matrix having elements A[i][j] repre(cid:173)
`senting a probability of moving from node ito node j. At a
`20 step 105, a rank r[k] for node k from a k'h component of Pn
`is determined.".
`In one particular embodiment, a finite number of itera-
`tions are performed to approximate poo. The initial distri(cid:173)
`bution can be selected to be uniform or non-uniform. A
`uniform distribution would set each component of p0 equal
`to 1/N. A non-uniform distribution, for example, can divide
`the initial probability among a few nodes which are known
`a priori to have relatively large importance. This non(cid:173)
`uniform distribution decreases the number of iterations
`required to obtain a close approximation to poo and also is
`one way to reduce the effect of artificially inflating relevance
`by adding unrelated terms.
`In another particular embodiment, the transition matrix A
`is given by
`
`35
`
`a:
`A = N 1 + (1- a:)B,
`
`which is a dominant eigenvector of A The iteration circu- 40
`lates the probability through the linked nodes like energy
`flows through a circuit and accumulates in important places.
`Because pages with no links occur in significant numbers
`and bleed off energy, they cause some complication with
`computing the ranking. This complication is caused by the 45
`fact they can add huge amounts to the "random jump" factor.
`This, in turn, causes loops in the graph to be highly empha(cid:173)
`sized which is not generally a desirable property of the
`model. In order to address this problem, these childless
`pages can simply be removed from the model during the 50
`iterative stages, and added back in after the iteration is
`complete. After the childless pages are added back in,
`however, the same number of iterations that was required to
`remove them should be done to make sure they all receive
`a value. (Note that in order to ensure convergence, the norm 55
`of P; must be made equal to 1 after each iteration.) An
`alternate method to control the contribution of the childless
`nodes is to only estimate the steady state by iterating a small
`number of times.
`The rank r[i] of a node i can then be defined as a function 60
`of this steady-state probability distribution. For example, the
`rank can be defined simply by r[i]=poo[i]. This method of
`calculating rank is mathematically equivalent to the iterative
`method described first. Those skilled in the art will appre(cid:173)
`ciate that this same method can be characterized in various 65
`different ways that are mathematically equivalent. Such
`characterizations are obviously within the scope of the
`
`where 1 is an NxN matrix consisting of all 1s, a is the
`probability that a surfer will jump randomly to any one
`of the N nodes, and B is a matrix whose elements
`B[ i ][j] are given by
`
`B[i][J] =
`
`_1_ if node i points to node j
`n;
`0 otherwise,
`
`{
`
`where n; is the total number of forward links from node
`i. The (1-a) factor acts as a damping factor that
`limits the extent to which a document's rank can be
`inherited by children documents. This models the
`fact that users typically jump to a different place in
`the web after following a few links. The value of a
`is typically around 15%. Including this damping is
`important when many iterations are used to calculate
`the rank so that there is no artificial concentration of
`rank importance within loops of the web.
`Alternatively, one may set a=O and only iterate a few
`times in the calculation.
`Consistent with the present invention, there are several
`ways that this method can be adapted or altered for various
`purposes. As already mentioned above, rather than including
`the random linking probability a equally among all nodes,
`it can be divided in various ways among all the sites by
`changing the 1 matrix to another matrix. For example, it
`could be distributed so that a random jump takes the surfer
`
`
`
`US 6,285,999 Bl
`
`7
`to one of a few nodes that have a high importance, and will
`not take the surfer to any of the other nodes. This can be very
`effective in preventing deceptively tagged documents from
`receiving artificially inflated relevance. Alternatively, the
`random linking probability could be distributed so that
`random jumps do not happen from high importance nodes,
`and only happen from other nodes. This distribution would
`model a surfer who is more likely to make random jumps
`from unimportant sites and follow forward links from
`important sites. A modification to avoid drawing unwar(cid:173)
`ranted attention to pages with artificially inflated relevance
`is to ignore local links between documents and only consider
`links between separate domains. Because the links from
`other sites to the document are not directly under the control
`of a typical web site designer, it is then difficult for the
`designer to artificially inflate the ranking. A simpler
`approach is to weight links from pages contained on the
`same web server less than links from other servers. Also, in
`addition to servers, internet domains and any general mea(cid:173)
`sure of the distance between links could be used to deter(cid:173)
`mine such a weighting.
`Additional modifications can further improve the perfor(cid:173)
`mance of this method. Rank can be increased for documents
`whose backlinks are maintained by different institutions and
`authors in various geographic locations. Or it can be
`increased if links come from unusually important web
`locations such as the root page of a domain.
`Links can also be weighted by their relative importance
`within a document. For example, highly visible links that are
`near the top of a document can be given more weight. Also, 30
`links that are in large fonts or emphasized in other ways can
`be given more weight. In this way, the model better approxi(cid:173)
`mates human usage and authors' intentions. In many cases
`it is appropriate to assign higher value to links coming from
`pages that have been modified recently since such informa- 35
`tion is less likely to be obsolete.
`Various implementations of the invention have the advan(cid:173)
`tage that the convergence is very fast (a few hours using
`current processors) and it is much less expensive than
`building a full-text index. This speed allows the ranking to
`be customized or personalized for specific users. For
`example, a user's home page and/or bookmarks can be given
`a large initial importance, and/or a high probability of a
`random jump returning to it. This high rating essentially
`indicates to the system that the person's homepage and/or 45
`bookmarks does indeed contain subjects of importance that
`should be highly ranked. This procedure essentially trains
`the system to recognize pages related to the person's inter(cid:173)
`ests. The present method of determining the rank of a
`document can also be used to enhance the display of 50
`documents. In particular, each link in a document can be
`annotated with an icon, text, or other indicator of the rank of
`the document that each link points to. Anyone viewing the
`document can then easily see the relative importance of
`various links in the document.
`The present method of ranking documents in a database
`can also be useful for estimating the amount of attention any
`document receives on the web since it models human
`behavior when surfing the web. Estimating the importance
`of each backlink to a page can be useful for many purposes 60
`including site design, business arrangements with the
`backlinkers, and marketing. The effect of potential changes
`to the hypertext structure can be evaluated by adding them
`to the link structure and recomputing the ranking.
`Real usage data, when available, can be used as a starting 65
`point for the model and as the distribution for the alpha
`factor.
`
`8
`This can allow this ranking model to fill holes in the usage
`data, and provide a more accurate or comprehensive picture.
`Thus, although this method of ranking does not necessar(cid:173)
`ily match the actual traffic, it nevertheless measures the
`5 degree of exposure a document has throughout the web.
`Another important application and embodiment of the
`present invention is directed to enhancing the quality of
`results from web search engines. In this application of the
`present invention, a ranking method according to the inven-
`10 tion is integrated into a web search engine to produce results
`far superior to existing methods in quality and performance.
`A search engine employing a ranking method of the present
`invention provides automation while producing results com(cid:173)
`parable to a human maintained categorized system. In this
`approach, a web crawler explores the web and creates an
`15 index of the web content, as well as a directed graph of
`nodes corresponding to the structure of hyperlinks. The
`nodes of the graph (i.e. pages of the web) are then ranked
`according to importance as described above in connection
`with various exemplary embodiments of the present inven-
`20 tion.
`The search engine is used to locate documents that match
`the specified search criteria, either by searching full text, or
`by searching titles only. In addition, the search can include
`the anchor text associated with backlinks to the page. This
`25 approach has several advantages in this context. First,
`anchors often provide more accurate descriptions of web
`pages than the pages themselves. Second, anchors may exist
`for images, programs, and other objects that cannot be
`indexed by a text-based search engine. This also makes it
`possible to return web pages which have not actually been
`crawled. In addition, the engine can compare the search
`terms with a list of its backlink document titles. Thus, even
`though the text of the document itself may not match the
`search terms, if the document is cited by documents whose
`titles or backlink anchor text match the search terms, the
`document will be considered a match. In addition to or
`instead of the anchor text, the text in the immediate vicinity
`of the backlink anchor text can also be compared to the
`se