`
`US007028029B2
`
`c12) United States Patent
`Kamvar et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,028,029 B2
`Apr.ll, 2006
`
`(54) ADAPTIVE COMPUTATION OF RANKING
`
`(75)
`
`Inventors: Sepandar D. Kamvar, Palo Alto, CA
`(US); Taber H. Haveliwala, Mountain
`View, CA (US); Gene H. Golub,
`Stanford,_CA (US)
`
`(73) Assignee: Google Inc., Mountain View, CA (US)
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 10/925,189
`
`(22) Filed:
`
`Aug. 23, 2004
`
`(65)
`
`Prior Publication Data
`
`US 2005/0027685 AI
`
`Feb. 3, 2005
`
`Related U.S. Application Data
`(63) Continuation-in-part of application No. 10/646,331,
`filed on Aug. 22, 2003.
`
`(60) Provisional application No. 60/458,921, filed on Mar.
`28,2003.
`
`(51)
`
`Int. Cl.
`(2006.01) .
`G06F 17130
`(52) U.S. Cl. ................................ 707/S; 3/7; 715/501.1
`(58) Field of Classification Search .................... 707/3,
`707/100, 5, 101, 10; 706/15; 715/500
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`6,560,600 Bl"' 5/2003 Broder .......................... 70717
`6,584,468 Bl"' 6/2003 Gabriel eta! ................. 707/IO
`6,799,176 Bt• 9/2004 Page ............................. 707/5
`6,87I,202 Bl * 3/2005 Broder .......................... 707/7
`2003/0204502 AI* I0/2003 Tomlin eta! .................. 707!5
`2003/0208478 AI "' 11/2003 Harvey . .. . ...... ... .. ... ........ 707/3
`2003/0208482- AI "' H/2003 --IGm-et at .. ~ ................ ;~ 707/3
`2004/0024752 AI* 2/2004 Manber eta! ................. 707/3
`2004/0111412 AI* 6/2004 Broder .......................... 707!7
`
`OTHER PUBLICATIONS
`Arasu, A, eta!., "PageRank Computation and the Structure
`of the Web: Experiments and Algorithms," Proceedings of
`the 11th lnt'l World Wide Web Conj, Poster Track, 2002.
`Bharat, K., et a!., "Improved Algorithms for Topic Distilla(cid:173)
`tion in a Hyperlinked Environment," Proceedings of the
`ACM-SIGIR, 1998.
`
`(Continued)
`
`Primary Examiner-Jean M. Corrielus
`(74) Attomey, Agent, or Firm-Morgan, Lewis & Beckius
`LLP
`
`(57)
`
`ABSTRACT
`
`A system and method is disclosed in which a ranking
`function for a set of document rank values is iteratively
`solved with respect to a set of linked documents until a first
`stability condition is satisfied. After such condition is satis(cid:173)
`fied, some of the ranks will have converged. The ranking
`function is modified to take into account these converged
`ranks so as to reduce the ranking function's computation
`cost. The modified ranking function is then solved until a
`second stability condition is satisfied. After such condition is
`satisfied more of the ranks will have converged. The ranking
`function is again modified and process continues until
`complete.
`
`6,112,202 A
`6,285,999 BI •
`
`8/2000 Kleinberg
`9/200I Page ............................. 70715
`
`31 Claims, S Drawing Sheets
`
`EXHIBIT 2104
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`
`
`US 7,028,029 B2
`Page 2
`
`OTHER PUBLICATIONS
`
`Haveliwala, T., "Efficient Computation of PageRank," Stan(cid:173)
`ford University Technical Report, 1999.
`Haveliwala, T., "Topic Sensitive PageRank," Proceedings of
`the 11th Int'l World Wide Web Conf, 2002.
`Haveliwala, T., eta!., "The Second Eigenvalue of the Google
`Matrix," Stanford University Technical Report, 2003.
`Jeh, G., eta!., "Scaling Personalized Web Search," Proceed(cid:173)
`ings ofthe 12th Int'l World Wide Web Conf, 2003.
`
`Kamvar, S., et a!., "Exploiting the Block Structure of the
`Web for Computing PageRank," Stanford University Techni(cid:173)
`cal Report, 1999.
`Kamvar, S., et a!., "Extrapolation Methods for Accelerating
`PageRank Computations," Proceedings of the 12th Int 'l
`World Wide Web Conf, 2003.
`Page, L., et a!., "The PageRank Citation Ranking: Bringing
`Order to the Web," Stanford Digital Libraries Working
`Paper, 1998.
`* cited by examiner
`
`
`
`U.S. Patent
`
`Apr. 11, 2006
`
`Sheet 1 of 5
`
`US 7,028,029 B2
`
`Search Engine 100
`
`-
`
`Indexers
`
`106 ' Document
`
`Index
`108
`
`------------------~-·
`
`Second Level
`Controller
`120
`
`t
`
`Doc.
`Index
`Server
`
`•••
`
`t
`
`Doc.
`Index
`Server
`122n
`
`I
`
`122a
`
`Link Records
`124
`
`'-
`
`RL Resolver
`126
`
`Link Map(s)
`128
`age Ranke~
`130
`/
`
`Page Ranks
`132
`
`Quantizers
`134
`
`1
`
`r-------r1 ------------------~
`I
`I
`I
`
`Search Engine
`Back End System
`102
`
`Search Engine
`Front End System
`104
`
`Fig. 1
`
`
`
`U.S. Patent
`
`Apr. 11, 2006
`
`Sheet 2 of 5
`
`US 7,028,029 B2
`
`210
`
`1
`
`0.9
`
`.0.6
`
`(/)
`Q)
`
`0.7
`
`C> m-
`a...~ 0.6
`
`-·-oro
`c"S 0.5
`.Q E
`t:::
`0(.) c..-e
`a...
`
`::::::1
`
`0.4
`
`0.3
`
`0.2
`
`0.1
`
`00
`
`5
`
`10
`
`15
`
`20
`
`25
`
`JO
`
`35
`
`40
`
`45
`
`Convergence Time
`
`Fig. 2
`
`
`
`U.S. Patent
`
`Apr. 11, 2006
`
`Sheet 3 of 5
`
`US 7,028,029 B2
`
`Create a directed graph of linked documents
`
`Associate all nodes with the set of non-converged nodes
`
`302
`
`304
`
`Calculate iteration ranking values of a
`ranking function for the nodes in the set of
`non-converged nodes
`
`312
`
`no
`
`ave a predefine
`number of iterations
`finished?
`
`Identify nodes in the non-converged set
`whose ranking values have converged to a
`predefined iteration tolerance
`
`Associate the identified nodes to a set of
`converged nodes and remove them from
`the set of non-converged nodes
`
`310
`
`314
`
`316
`
`Fig. 3
`
`
`
`U.S. Patent
`
`Apr. 11, 2006
`
`Sheet 4 of 5
`
`US 7,028,029 B2
`
`J
`X (k+1)
`1
`
`X (k+1)
`2
`
`402
`
`404
`J -
`
`A11
`
`A12
`
`A2.1
`
`A2.2
`
`412
`
`406
`
`A1,n
`
`~.n
`
`X (k)
`1
`
`X (k)
`2
`
`(k+1)
`xn-m
`/ ,..---- '
`(k+1 '
`'X
`n-m+1
`\
`
`/
`
`(k+1)
`xn-m+2
`
`1
`I
`I
`I
`I
`I
`I
`I
`I
`\
`
`\
`I
`I
`I
`I
`I
`I
`I
`J
`I
`
`I
`
`X (k+1)
`' n
`
`' --c_
`410
`
`-
`
`An-m,1
`
`An-m,2
`
`An-m,n
`
`An-m+1,1 An-m+1,2
`
`An-m+1,n
`
`An-m+2,1 An-m+2,2
`
`An-m+2,n
`
`An1
`
`An,2
`
`Ann
`
`Fig. 4
`
`(k+1
`xn-m
`........ - ...... ,
`
`\
`
`/
`
`I
`
`'(_k
`1 Xn-m+1 1
`I
`(k
`xn-m+2
`
`I
`I
`I
`
`\ X (k)
`\
`/
`n
`' '~
`408
`
`X (k+1)
`1
`
`\
`I
`I
`I
`I
`I
`I
`I
`I
`
`I
`I
`I
`
`•••
`X (k+1)
`' n
`
`\ ',,_--(_
`
`502
`
`506
`
`f
`
`508
`
`/
`
`514
`
`/
`
`A1.1
`
`A1 n-m
`
`A1,n-m+1
`
`A1 n
`
`··· An-m n-m
`
`An-m,n-m+1
`
`· · ·
`
`An-m n
`
`- -------------------~--------------------
`A n-m+1,n-m
`
`An-m+1,1
`
`An-m+1 n-m+1
`
`· · · An-m+1,n
`
`518 J -
`
`X (k)
`1
`
`X
`(k+1
`n-m
`
`\ 512
`
`An,n-m+1
`
`\
`\
`
`\ 510
`
`Fig. 5
`
`X (k)
`n
`
`I
`1
`J
`
`I
`
`',-<_
`504
`
`
`
`U.S. Patent
`
`Apr. 11, 2006
`
`Sheet 5 of 5
`
`US 7,028,029 B2
`
`Computer 600
`
`./
`Operating System
`
`Memory 606
`
`CPU
`602
`
`I 608~
`
`__.~61 0
`
`User interface
`I Display ~ -612
`I Keyboard I
`_/
`
`Communication
`interface
`_;
`
`604
`
`614
`
`Network Communication
`Module
`Page Ranker
`
`Computation Module
`
`Modification Module
`
`Removal Module
`
`Modifier Module
`
`Identification Module
`
`Convergence Module
`
`Fig. 6
`
`v
`v
`
`f
`f
`v
`v
`v
`v
`v
`
`616
`
`618
`
`130
`
`620
`
`622
`
`624
`
`626
`
`628
`
`630
`
`
`
`US 7,028,029 B2
`
`1
`ADAPTIVE COMPUTATION OF RANKING
`
`RELATED APPLICATIONS
`
`This application is a continuation-in-part of U.S. Utility
`patent application Ser. No. 10/646,331, filed Aug. 22, 2003,
`which claimed priority on U.S. Provisional Patent Applica(cid:173)
`tion No. 60/458,921 filed Mar. 28, 2003, both of which are
`incorporated by reference herein in their entirety.
`
`STATEMENT OF GOVERNMENT SPONSORED
`SUPPORT
`
`This invention was supported in part by the National
`Science Foundation under Grant No. IIS-0085896 and Grant
`No. CCR-9971010. The US Government has certain rights
`in this invention.
`
`2
`based ranking. For example, U.S. Pat. No. 6,285,999 to Page
`discloses a technique used by the Google search engine for
`assigning a rank to each document in a hypertext database.
`According to the link-based ranking method of Page, the
`rank of a node is recursively defined as a function of the
`ranks of its parent nodes. Looked at another way, the rank of
`a node is the steady-state probability that an arbitrarily long
`random walk through the network will end up at the given
`node. Thus, a node will tend to have a high rank if it has
`10 many parents, or if its parents have high rank.
`Although link-based ranking techniques are improve(cid:173)
`ments over prior techniques, in the case of an extremely
`large database, such as the world wide web which contains
`billions of pages, the computation of the ranks for all the
`15 pages can take considerable time. Accordingly, it would be
`valuable to provide techniques for calculating page ranks
`with greater computational efficiency.
`
`TECHNICAL FIELD
`
`SUMMARY
`
`The present invention relates generally to improved tech(cid:173)
`niques for analyzing large directed graphs for use in com(cid:173)
`puter systems, and in particular to reducing the computa(cid:173)
`tional complexity of assigning ranks to nodes.
`
`BACKGROUND
`
`20
`
`In one embodiment, the invention includes iteratively
`solving a ranking function for a set of document rank values
`with respect to a set of linked documents until a first stability
`condition is satisfied. The ranking function is modified so as
`25 to reduce the ranking function's computation cost and then
`the modified ranking function is solved until a second
`stability condition is satisfied.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The aforementioned features and advantages of the inven(cid:173)
`tion as well as additional features and advantages thereof
`will be more clearly understood hereinafter as a result of a
`detailed description of embodiments of the invention when
`taken in conjunction with the drawings. Like reference
`numerals refer to corresponding parts throughout the several
`views of the drawings.
`FIG. 1 illustrates a search engine environment in accor(cid:173)
`dance with an embodiment of the present invention.
`FIG. 2 illustrates a graph of the cumulative proportion of
`converged pages versus convergence time in accordance
`with an embodiment of the invention.
`FIG. 3 illustrates solving the ranking function in accor(cid:173)
`dance with an embodiment of the invention
`FIG. 4 illustrates a ranking function in accordance with an
`embodiment of the invention.
`FIG. 5 illustrates a ranking function in accordance with an
`embodiment of the invention.
`FIG. 6 illustrates a computer system in accordance with
`50 an embodiment of the invention.
`
`A search engine is a software program designed to help a
`user access files stored on a computer, for example on the
`World Wide Web (WWW), by allowing the user to ask for 30
`documents meeting certain criteria (e.g., those containing a
`given word, a set of words, or a phrase) and retrieving files
`that match those criteria. Web search engines work by
`storing information about a large number of web pages
`(hereinafter also referred to as "pages" or "documents"), 35
`which they retrieve from the WWW. These documents are
`retrieved by a web crawler or spider, which is an automated
`web browser which follows the links it encounters in a
`crawled document. The contents of each successfully
`crawled document are indexed, thereby adding data con- 40
`cerning the words or terms in the document to an index
`database for use in responding to queries. Some search
`engines, also store all or part of the document itself, in
`addition to the index entries. When a user makes a search
`query having one or more terms, the search engine searches 45
`the index for documents that satisfY the query, and provides
`a listing of matching documents, typically including for each
`listed document the URL, the title of the document, and in
`some search engines a portion of document's text deemed
`relevant to the query.
`It can be useful for various purposes to rank or assign
`importance values to nodes in a large linked database. For
`example, the relevance of database search results can be
`improved by sorting the retrieved nodes according to their
`ranks, and presenting the most important, highly ranked 55
`nodes first. Alternately, the search results can be sorted based
`on a query score for each document in the search results,
`where the query score is a function of the document ranks as
`well as other factors.
`One approach to ranking documents involves examining 60
`the intrinsic content of each document or the back-link
`anchor text in parents of each document. This approach can
`be computationally intensive and often fails to assign high(cid:173)
`est ranks to the most important documents. Another
`approach to ranking involves examining the extrinsic rela(cid:173)
`tionships between documents, i.e., from the link structure of
`the directed graph. This type of approach is called a link-
`
`DESCRIPTION OF EMBODIMENTS
`
`The techniques of the present invention may used in a
`search engine environment where the linked database is
`generated from crawling a number of documents, such as the
`Internet. FIG. 1 is a block diagram of one such typical search
`engine environment. As illustrated in FIG. 1, a search engine
`100 has a back end system 102 and a front end system 104.
`The layout of the search engine system 100 is merely
`exemplary and can take on any other suitable layout or
`configuration.
`The back end system 102 may include one or more
`crawlers 105 (also known as spiders), one or more document
`65 indexers 106 and a document index 108. To index the large
`number of Web pages that exist on the worldwide web, the
`web crawler 104 locates and downloads web pages and other
`
`
`
`US 7,028,029 B2
`
`3
`information (hereinafter also referred to as "documents"). In
`some embodiments, a set of content filters 110 identify and
`filter out duplicate documents, and determine which docu(cid:173)
`ments should be sent to the document indexers 106 for
`indexing. The document indexers 106 process the down(cid:173)
`loaded documents, creating a document index 108 of terms
`found in those documents. If a document changes, then the
`document index 108 is updated with new information. Until
`a document is indexed, it is generally not available to users
`of the search engine 100.
`The front end system 104 may include a web server 112,
`one or more controllers 114, a cache 118, a second level
`controller 120 and one or more document index servers
`122a, ... , 122n. The document index 108 is created by the
`search engine 100 and is used to identifY documents that
`contain one or more terms in a search query. To search for
`documents on a particular subject, a user enters or otherwise
`specifies a search query, which includes one or more terms
`and operators (e.g., Boolean operators, positional operators,
`parentheses, etc.), and submits the search query to the search 20
`engine 100 using the web server 112.
`The controller 114 is coupled to the web server 112 and
`the cache 118. The cache 118 is used to speed up searches
`by temporarily storing previously located search results. In
`some embodiments, the cache 118 is distributed over mul(cid:173)
`tiple cache servers. Furthermore, in some embodiments, the
`data (search results) in the cache 118 is replicated in a
`parallel set of cache servers.
`While the following discussion describes certain func(cid:173)
`tions as being performed by one or more second level
`controllers 120, it should be understood that the number of
`controllers (114, 120) and the distribution of functions
`among those controllers may vary from one implementation
`to another. The second level controller 120 communicates
`with one or more document index servers 122a, ... , 122n.
`The document index servers 122a, ... , 122n (or alternately,
`one of the controllers 114, 120) encode the search query into
`an expression that is used to search the document index 108
`to identify documents that contain the terms specified by the
`search query. In some embodiments, the document index
`servers 122 search respective partitions of the document
`index 108 generated by the back end system 102 and return
`their results to the second level controller 120. The second
`level controller 120 combines the search results received
`from the document index servers 122a, ... , 122n, removes
`duplicate results (if any), and forwards those results to the
`controller 114. In some embodiments, there are multiple
`second level controllers 120 that operate in parallel to search
`different partitions of the document index 108, each second
`level controller 120 having a respective set of document
`index servers 122 to search respective sub-partitions of
`document index 108. In such embodiments, the controller
`114 distributes the search query to the multiple second level
`controllers 120 and combines search results received from
`the second level controllers 120. The controller 114 also
`stores the search query and search results in the cache 118,
`and passes the search results to the web server 112. A list of
`documents that satisfY the search query is presented to the
`user via the web server 112.
`In some embodiments, the content filters 110, or an
`associated set of servers or processes, identifY all the links
`in every web page produced by the crawlers 105 and store
`information about those links in a set of link records 124.
`The link records 124 indicate both the source URL and the
`target URL of each link, and may optionally contain other 65
`information as well, such as the "anchor text" associated
`with the link. A URL Resolver 126 reads the link records 124
`
`4
`and generates a database 128 of links, also called link maps,
`which include pairs of URLs or other web page document
`identifiers. In some embodiments, the links database 128 is
`used by a set of one or more Page Rankers 130 to compute
`Page Ranks 132 for all the documents downloaded by the
`crawlers. These Page Ranks 132 are then used by the
`controller 114 to rank the documents returned in response to
`a query of the document index 108 by document index
`servers 122. Alternately, the document index servers 122
`10 may utilize the Page Ranks 132 when computing query
`scores for documents listed in the search results. In certain
`embodiments of the present invention, the back end system
`102 further comprises quantizers 134 that are used to
`quantize data in Page Ranks 132. Brin and Page, "The
`15 Anatomy of a Large-Scale Hypertextual Search Engine," 7th
`International World Wide Web Conference, Brisbane, Aus(cid:173)
`tralia, provides more details on how one type of Page Rank
`metric can be computed. Other types of link-based on
`non-link based ranking techniques could also be utilized.
`A link-based ranking system, such as PageRank, makes
`the assumption that a link from a page u to a page v can be
`viewed as evidence that page v is an "important" page. In
`particular, the amount of importance conferred on page v by
`page u is proportional to the importance of page u and
`25 inversely proportional to the number of pages to which page
`u points. Since the importance of page u is itself not known,
`determining the importance for every page i requires an
`iterative fixed-point computation.
`In one embodiment, the importance of a page i is defined
`30 as the probability that at some particular time step, a random
`web surfer is at page i. Provided that the surfer chooses one
`of the links on page i, that link is chosen with a probability
`of 1 divided by the number of outlinks from page i, when the
`probability of choosing any of the outlinks is uniform across
`35 the outlinks. A transition probability matrix, P, may be
`created where P(i,j) is provided as 1/deg(i), where deg(i)
`represents the number of outlinks from page i. In other
`embodiments, P(i,j) could take into consideration certain
`personalization information for an individual or for a group,
`40 or could take into account other information derived from
`page i itself and/or elsewhere, and need not be uniform over
`each outlink from a given page.
`Some web pages have no outlinks, but for P to be a more
`useful transition probability matrix, every node must have at
`45 least 1 outgoing transition, i.e., P should have no rows
`consisting of all zeros. A matrix P can be converted into a
`more useful transition matrix by adding a complete set of
`outgoing transitions to pages with outdegree(O), i.e., no
`outlinks, to account for the probability that the surfer visiting
`50 that page randomly jumps to another page. In one embodi(cid:173)
`ment, the row for a page having no outlinks is modified to
`account for a probability that the surfer will jump to a
`different page uniformly across all pages, i.e., each element
`in the row becomes 1/n, where n is the number of nodes, or
`55 pages. In another embodiment, the modification could be
`non-uniform across all nodes and take into account person(cid:173)
`alization information. This personalization information
`might cause certain pages to have a higher probability
`compared to others based on a surfer's preferences, surfing
`60 habits, or other information. For example, if a surfer fre(cid:173)
`quently visits http://www.google.com, the transition prob(cid:173)
`ability from page i to the Google homepage would be higher
`than a page that the user infrequently visits. Another modi-
`fication to P may take into account the probability that any
`random surfer will jump to a random Web page (rather than
`following an outlink). The destination of the random jump is
`chosen according to certain probability distributions. In
`
`
`
`US 7,028,029 B2
`
`5
`some embodiments, this is uniform across all pages and in
`some embodiments this distribution is non-uniform and
`based on certain personalization information. Taking the
`transpose of the twice modified matrix P provides a matrix
`A. In the matrix P, a row i provided the transition probability
`distribution for a surfer at node i, whereas in the matrix A
`this is provided by column i. Mathematically this can be
`represented as:
`
`A~(c(P+D)+(1-c)E)T,
`
`where Pis a probability transition where P(i,j) represents the
`probability that the surfer will choose one of the links on i
`to page j; D represents the probability that a surfer visiting
`a page with no outlinks will jump to any other page; E
`represents the probability that a surfer will not choose any of 15
`the links and will jump to another page; and (1-c) represents
`a de-coupling factor indicating how likely it is that a surfer
`will jump to a random Web page, while c represents a
`coupling factor indicating how likely it is that a surfer will
`select one of the links in a currently selected or viewed page. 20
`Assuming that the probability distribution over all the
`nodes of the surfer's location at time 0 is given by xC0l, then
`the probability distribution for the surfer's location at time
`k is given by xCk)=A (k)x(o). The unique stationary distribution
`of the Markov chain is defined as limk~=xCkl, which is
`equivalent to limk~=ACklxC0l, and is independent of the
`initial distribution xC0 l. This is simply the principal eigen(cid:173)
`vector of the matrix A and the values can be used as ranking
`values. One way to calculate the principal eigenvector
`begins with a uniform distribution xC0 l=v and computes
`successive iterations of the ranking function, xCk)=A xCk- 1
`),
`until convergence. Convergence can be defined when two
`successive iterations of the ranking function produce a
`difference within a tolerance value. Various method can be
`used to determine tolerance values based on desired con- 35
`vergence characteristics or how much variation exists as the
`tolerance decreases.
`FIG. 2 illustrates an exemplary cumulate plot of conver(cid:173)
`gence times using the above described iterative process. The
`x-axis represents convergence by iteration number and the
`y-axis represents the cumulative proportion of document
`rank values that have converged. At a point 210, it can be
`seen, for this exemplary data set, that a large number of
`ranks have converged by point 210 within 20 iterations, but 45
`the final ranks take a significantly longer time to converge.
`Embodiments of the invention take advantage of this
`skewed distribution of convergence times to reduce the
`computational cost required for the determination of the full
`set of document rank values. Computational cost can be
`reduced by reducing the number of operations that must be
`performed and/or simplifying the types that must be pre(cid:173)
`formed. Additionally, reducing the need to move items in
`and out of main memory can have an effect on computa(cid:173)
`tional cost. By not recalculating the ranks of those ranks
`which have converged during a particular cycle of iterations,
`embodiments of the invention reduce the computation cost
`of determining document rank values.
`Referring to FIG. 3, a directed graph oflinked documents
`is initially created (302) where each document is represented 60
`by a node in the graph, and all nodes are associated with the
`set of nodes whose document rank values have not con(cid:173)
`verged (304). If the set of nodes which have not converged
`is empty (306-yes), then all the nodes have converged and
`the process ends (308). If the set of nodes which have not 65
`converged is not empty (330-no ), then an iteration of the
`function is calculated for those nodes which have not
`
`25
`
`30
`
`6
`converged (310). A predetermined number of iterations are
`completed per given cycle before examining which nodes'
`document rank values have converged. Accordingly, if a
`predetermined number of iterations for the current cycle has
`not been completed (312-no), then an additional iteration is
`calculated (310). On the other hand, if the predetermined
`number of iterations for the cycle been completed (312-yes ),
`then those nodes whose ranks have converged are identified
`(314). The number of iterations per cycle can be chosen in
`10 different ways and in some embodiments may depend on the
`balancing the computation cost of identifYing the nodes
`which have converged and modifying the ranking function
`versus computing the iterations. For example, the number of
`iterations could be chosen from a number between 5 and 15.
`In other embodiments, the number of iterations prior to
`identifying converged ranks could vary depending on a
`given cycle, with successive cycles having different number
`of iterations. For example, when the number of iterations for
`a cycle has been met (312-yes), the number of iterations for
`the next loop could be modified, such that the next iterative
`cycle would end after a different set of iterations, and so on.
`In other embodiments, instead of basing the end of a cycle
`on whether a number of iterations have been completed, the
`cycle is based on a proportion of nodes whose rank has
`converged. For example, the first cycle of iterations could
`complete after 25% of the nodes have converged. The
`proportion for the next cycle could be set to be an additional
`25% or some other percentage. One of ordinary skill in the
`art will readily recognize other ways this concept can be
`expanded using various criteria to end the iterative cycle.
`After the iteration cycle is complete (312-yes), those
`nodes whose document ranking value has converged to
`within a predefined iteration tolerance are identified (314).
`In some embodiments, the same tolerance value is used for
`each cycle of iteration and in other embodiments, the
`tolerance value could vary depending on the iterative cycle.
`Tolerances values could be selected from 0.00001 to 0.01, or
`other values. Those nodes which have converged are disas(cid:173)
`sociated with the set of non-converged nodes (316). The
`40 process continues until all document rank values have
`converged or some other type of ending mechanism is
`triggered. Other triggering mechanisms might include, for
`example, identifying convergence for a specific subset of
`nodes.
`In other embodiments, a first phase of rank computation
`may be computed using an initial tolerance level for con(cid:173)
`vergence as described above and using the phase tolerance
`level for each cycle of iteration in the phase. However,
`another phase of rank computation could follow using a
`50 second tolerance level for the cycles in the phase and using
`the ranks previously computed in the first phase as respec(cid:173)
`tive, initial document rank values in the next phase of rank
`computation. In some embodiments, the second tolerance
`level is smaller by an order of magnitude than the previous
`55 phase. In some embodiments, more than two phases are used
`with successively narrower tolerances for convergence.
`When the nodes whose document rank values are asso-
`ciated with the converged set, their document rank values
`are no longer calculated. In some embodiments, computing
`only document rank values which have not converged takes
`advantage of the matrix structure of the ranking function. As
`mentioned above, in some embodiments, the ranking func(cid:173)
`tion can be described as xCkl=A xCk- 1
`). At some time k, some
`of the document rank values will have converged. FIG. 4
`illustrates a ranking function accordance with some embodi(cid:173)
`ments where some of the rank values have converged.
`Colunm 402 of FIG. 4 illustrates the document rank value at
`
`
`
`US 7,028,029 B2
`
`7
`the k+ 1st iteration of the ranking function for node, or
`document, i, x?+ 1l. The document ranking values for the
`k+l" iteration are given by the matrix multiplication of A
`(shown at 404) by the k'h iteration of the document rank
`values x, (k) (shown at 406). The ranks which have converged
`by iteration k can be represented by xn-m+l (k) to xn (k) (shown
`at 408), where n represents the total number of nodes, or
`documents, and m represents the number of document rank
`values which have converged. Accordingly, the values for
`xn-m>l(k+l) to xn (k+l) (shown at 410) at the k+l st iteration will
`be the same as xn-m+l (k) to xn (k) (shown at 408) and those
`document rank values need not be calculated again. In some
`embodiments, only the calculations for those nodes which
`have not converged (shown at 412) are calculated. The
`ranking function is modified to remove those rows from the
`calculation. In some embodiments, the rows of colunm 402
`and matrix 404 corresponding to the converged nodes 410
`are not read into memory. In some embodiments, the matrix
`multiplication needed for rows corresponding to the con(cid:173)
`verged ranks are simply ignored and not calculated. In other
`embodiments the rows of 402 and 404 corresponding to the
`converged ranks are replaced by all zeros (which signifi(cid:173)
`cantly reduces computation time). In these embodiments, the
`colunm 406 is not affected since the converged values
`therein are used in the ranking function iteration. In some
`embodiments, the rows are initially ordered by decreasing
`order of convergence based on a previous solving of the
`ranking function. This has the effect of keeping longer
`converging nodes in main memory and reducing the amount
`of memory accesses to read portions of the modified ranking
`function into memory during the course of the computation.
`As mentioned earlier, reducing the amount of memory
`accesses can significantly reduce computation cost.
`During each cycle of iteration, the contributions to the
`rank of a non-converged node from the converged nodes is 35
`a constant. Accordingly, in some embodiments these con(cid:173)
`tributions are only calculated once per cycle of iteration.
`These embodiments can be understood with reference to
`FIG. 5. After a period of iterations, the nodes 502 have
`converged as described above. Accordingly, the values at 40
`504 will remain constant throughout each iteration cycle
`until another examination of convergence is made (314 and
`316 of FIG. 3). The matrix 506 now may be thought of as
`consisting of 4 partitions 508, 510, 512, 514. The partition
`508 illustrates the contributions that the non-converged 45
`nodes make to other non-converged nodes 516 (also called
`sub-matrix 516). The partition 510 illustrates the contribu(cid:173)
`tions that converged nodes make to converged nodes. The
`partition 512 illustrates the contributions that the non(cid:173)
`converged nodes make to the converged nodes. Finally, the 50
`partition 514 illustrates the contributions that the converged
`nodes make to the non-converged nodes 516. When matrix
`518 (the previous document ranks values) is multiplied
`against a row i in matrix 506, the multiplication products
`corresponding to values in partition 514 are constants. 55
`Therefore, to modifY the ranking function even further, some
`embodiments only calculate the products produced by mul(cid:173)
`tiplying partition 514 (representing contributions of the
`converged nodes to the non-converged nodes) once per