`
`Taher H. Haveliwala(cid:3)
`Stanford University
`taherh@db.stanford.edu
`
`October 18, 1999
`
`Abstract
`
`This paper discusses e(cid:14)cient techniques for computing PageRank, a ranking met-
`ric for hypertext documents. We show that PageRank can be computed for very
`large subgraphs of the web (up to hundreds of millions of nodes) on machines
`with limited main memory. Running-time measurements on various memory
`con(cid:12)gurations are presented for PageRank computation over the 24-million-page
`Stanford WebBase archive. We discuss several methods for analyzing the con-
`vergence of PageRank based on the induced ordering of the pages. We present
`convergence results helpful for determining the number of iterations necessary
`to achieve a useful PageRank assignment, both in the absence and presence of
`search queries.
`
`Introduction
`1
`The dramatic growth of the world-wide web, now exceeding 800 million pages [11],
`is forcing modern web search engines to look beyond simply the content of pages
`in providing relevant answers to queries. Recent work in utilizing the link structure
`of the web for improving the quality of search results is promising.
`In particular,
`the Google search engine uses PageRank, an iterative algorithm that determines the
`\importance" of a web page based on the \importance" of its parent pages [5]. A
`related algorithm used by the IBM HITS system maintains a hub and an authority
`score for each page, in which the authority score for a page is determined by the hub
`scores of its parents, and the hub score of a page is determined by the authority scores
`of its children [10].
`
`Given that centralized search engines are having trouble simply indexing the entire
`web [7, 11], presenting users with search-query results relevant to them is becoming
`increasingly di(cid:14)cult. The growth of the web will force a greater reliance on client-
`side (cid:12)ltering and relevance analysis, and personalized searching techniques become
`essential. Part of our research in personalizing link-based ranking algorithms involves
`applying biases during the iterations to increase the importance of certain categories
`of pages [4]. For instance, we can in(cid:13)uence PageRank to weight pages related to
`computers more heavily than pages related to cars. In this scenario, each user would
`
`(cid:3)Supported by NSF IIS-98-11904 and NSF Graduate Research Fellowship
`
`1
`
`IPR2018-01594
`
`EXHIBIT
`2051
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 1
`
`
`
`be expected to run PageRank biased with their personalization pro(cid:12)le on their own
`personal machines, with the necessary link structure provided to them on inexpensive
`media such as DVD-ROM. Reducing the running times and memory requirements of
`personalized ranking algorithms is essential.
`
`After reviewing the PageRank algorithm [4, 5] in Section 2, we discuss the various
`aspects of its running time, and show that it can be computed e(cid:14)ciently:
`
`(cid:15) We describe an implementation, using the principle underlying the block nested-
`loops-join strategy, that e(cid:14)ciently controls the cost per iteration even in low
`memory environments (Section 3)
`(cid:15) We empirically show that single-precision rank values are su(cid:14)cient for the com-
`putation (Section 4)
`(cid:15) We investigate techniques for determining the number of iterations required
`to yield a useful PageRank assignment. We present convergence results using
`several metrics that are useful when the induced ordering on pages, rather than
`the PageRank value itself, is considered. We investigate the convergence of the
`globally induced ordering, as well as the ordering induced over speci(cid:12)c search
`results (Section 5)
`
`By showing that a useful ranking can be computed for large web subgraphs in as
`little as an hour, with only modest main-memory requirements, the potential for
`personalization is made clear.
`
`2 Review of PageRank
`First let us review the motivation behind PageRank [5]. The essential idea is that
`if page u has a link to page v, then the author of u is implicitly conferring some
`importance to page v. How much importance does u confer? Let Nu be the outdegree
`of page u, and let Rank(p) represent the importance (i.e., PageRank) of page p. Then
`the link (u; v) confers Rank(u)=Nu units of rank to v. This simple idea leads to the
`following (cid:12)xpoint computation that yields the rank vector Rank (cid:3) over all of the pages
`on the web. If N is the number of pages, assign all pages the initial value 1=N . Let
`Bv represent the set of pages pointing to v. In each iteration, propagate the ranks as
`follows:1
`
`8vRanki+1(v) = X
`u2Bv
`
`Ranki(u)=Nu
`
`We continue the iterations until Rank stabilizes to within some threshold. The (cid:12)nal
`vector Rank(cid:3) contains the PageRank vector over the web. This vector is computed
`only once after each crawl of the web; the values can then be used to in(cid:13)uence the
`ranking of search results [2].
`
`The process can also be expressed as the following eigenvector calculation, pro-
`viding useful insight into PageRank. Let M be the square, stochastic matrix corre-
`sponding to the directed graph G of the web, assuming all nodes in G have at least
`
`1Note that for u 2 Bv, the edge (u; v) guarantees Nu (cid:21) 1
`
`2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 2
`
`
`
`one outgoing edge. If there is a link from page j to page i, then let the matrix en-
`try mij have the value 1=Nj. Let all other entries have the value 0. One iteration
`of the previous (cid:12)xpoint computation corresponds to the matrix-vector multiplication
`M (cid:2) Rank. Repeatedly multiplying Rank by M yields the dominant eigenvector
`Rank(cid:3) of the matrix M . Because M corresponds to the stochastic transition matrix
`over the graph G, PageRank can be viewed as the stationary probability distribution
`over pages induced by a random walk on the web.
`
`We can measure the convergence of the iterations using the Residual vector.
`Because M is stochastic (i.e., the entries in each column sum to 1), the dominant
`eigenvalue of M is 1. Thus the PageRank vector Rank (cid:3), the dominant eigenvector
`of M , has eigenvalue 1, leading to the equality M (cid:2) Rank (cid:3) = Rank(cid:3). We can use
`the deviation from this equality for some other vector as an error estimate. For some
`intermediate vector Ranki, let Residuali = M (cid:2)Ranki(cid:0)Ranki. Equivalently, because
`multiplication by M is an iteration step, we have Residuali = Ranki+1 (cid:0) Ranki. We
`can treat jjResidualijj as an indicator for how well Ranki approximates Rank(cid:3). We
`expect jjResidualijj to tend to zero after an adequate number of iterations.
`
`We now address several issues regarding the computation. We previously made
`the assumption that each node in G has at least one outgoing edge. To enforce this
`assumption, we can iteratively remove nodes in G that have outdegree 0. Alterna-
`tively, we can conceptually add a complete set of outgoing edges to any node with
`outdegree 0. Another caveat is that the convergence of PageRank is guaranteed only
`if M is irreducible (i.e., G is strongly connected) and aperiodic [12]. The latter is
`guaranteed in practice for the web, while the former is true if we add a dampening
`factor to the rank propagation. We can de(cid:12)ne a new matrix M 0 in which we add
`transition edges of probability 1(cid:0)c
`N between every pair of nodes in G:
`
`]N (cid:2)N
`
`1 N
`
`M 0 = cM + (1 (cid:0) c) (cid:2) [
`
`This modi(cid:12)cation improves the quality of PageRank by introducing a decay factor
`which limits the e(cid:11)ect of rank sinks [4], in addition to guaranteeing convergence to a
`unique rank vector. For the above modi(cid:12)cation to M , an iteration of PageRank can
`be expressed as follows:2
`
`]N (cid:2)1
`
`1 N
`
`M 0 (cid:2) Rank = cM (cid:2) Rank + (1 (cid:0) c) (cid:2) [
`
`We can bias PageRank to weight certain categories of pages by replacing the uniform
`vector [ 1
`N ]N (cid:2)1 with the nonuniform N (cid:2) 1 personalization vector ~p as discussed in [4].
`In terms of the random-walk model of PageRank, the personalization vector represents
`the addition of a complete set of transition edges where the probability of edge (u; v)
`is given by (1 (cid:0) c) (cid:2) pv. Although the matrix M 0 that results from the modi(cid:12)cations
`discussed above is not sparse, we never need to store it explicitly. We need only the
`ability to evaluate M 0 (cid:2) Rank e(cid:14)ciently.
`
`2This equality makes use of the fact that jjRankjj1 = 1
`
`3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 3
`
`
`
`3 E(cid:14)cient Memory Usage
`If we momentarily ignore the scale of the web, the implementation of PageRank is very
`simple. The sheer scale of the web, however, requires much greater care in the use of
`data structures. We will begin with a detailed discussion of a naive implementation,
`during which we will clarify the size of the datasets. The naive algorithm is useful for
`gaining a clear understanding of matrix-vector multiplication in the speci(cid:12)c context
`of PageRank. We will then present an e(cid:14)cient version of the algorithm, reminiscent
`of the block nested-loops-join algorithm, that substantially lowers the main memory
`requirements. Similar strategies for fast in-memory matrix-vector multiplies are com-
`monly used in the scienti(cid:12)c computing community for improving caching behavior
`[9, 14]. As we are dealing with massive web repositories in which the data structures
`are measured in gigabytes and are stored on disk, we will take a more i/o-centric
`approach in presenting the algorithm and its cost analysis. Empirical timing results
`for both the naive and block-based approaches are presented.
`
`3.1 The Naive Technique
`
`The Stanford WebBase, our local repository of the web, currently contains roughly
`25 million pages. There are over 81 million urls in the link graph, including urls
`that were not themselves crawled, but exist in the bodies of crawled pages. For
`our experiments, we (cid:12)rst used a preprocessing step that removed dangling pages,
`meaning pages with no children. Starting with the 81-million-node graph, all nodes
`with outdegree 0 were removed. The step was repeated once more on the resulting
`graph, yielding a subgraph with close to 19 million nodes. This process was needed
`since the original graph is a truncated snapshot of the web with many dangling nodes.
`Nodes with no outgoing links pointing back to the crawled subgraph can adversely
`a(cid:11)ect the PageRank assignment, as mentioned in Section 2. The node id’s were
`assigned consecutively from 0 to 18,922,290. The link structure for the (cid:12)nal graph,
`referred to as Links, is stored on disk in a binary format, illustrated textually in
`Figure 1. The source-id and each of the destination-id’s are stored as 32-bit integers.
`
`Figure 1: The Link Structure
`
`The outdegree is stored as a 16-bit integer. The size of the link structure, after the
`preprocessing steps mentioned above, is 1.01 GB, and is assumed to exceed the size
`of main memory.
`
`The setup for the naive PageRank implementation is as follows. We create two
`arrays of (cid:13)oating point values representing the rank vectors, called Source and Dest,
`as conceptually shown in the matrix-vector multiplication given in Figure 2. Each
`
`4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 4
`
`
`
`Figure 2: The Rank Vectors
`
`In our
`vector has N entries, where N is the number of nodes in our web graph.
`experiment, N was 18,922,290. The rank values for iteration i are held in Source,
`and the rank values for iteration i + 1 are constructed in Dest. Using single-precision
`values, these arrays for our particular graph have a combined size of over 150 MB.
`The iteration steps can be expressed as follows (where c is the dampening factor):
`
`8sSource[s] = 1=N
`while(residual > (cid:28) ) f
`8dDest[d] = 0
`while(not Links.eof()) f
`Links.read(source, n, dest1; dest2; :::; destn)
`for j = 1 : : : n
`Dest[destj] = Dest[destj] + Source[source]=n
`
`g 8
`
`dDest[d] = c (cid:2) Dest[d] + 1(cid:0)c
`N /* dampening or personalization*/
`residual = jjSource (cid:0) Destjj /* recompute only every few iterations */
`Source = Dest
`
`g
`We make successive passes over Links, using the current rank values, held in Source,
`to compute the next rank values, held in Dest. We can stop when the norm of the
`di(cid:11)erence between Source and Dest reaches some threshold, or alternatively after
`a prechosen number of iterations. Assuming main memory is large enough to hold
`Source and Dest, the i/o cost for each iteration of the above implementation is given
`by:
`
`C = jLinksj
`
`If main memory is large enough to hold only the Dest array, and we assume that the
`link structure is sorted on the source (cid:12)eld, the i/o cost is given by:
`
`C = jSourcej + jDestj + jLinksj
`
`Source needs to be sequentially read from disk during the rank propagation step, and
`Dest needs to be written to disk to serve as the Source vector for the subsequent
`iteration.
`
`Although many machines may have enough main memory to hold these arrays,
`a larger crawl with 50 to 100 million pages clearly will result in rank vectors that
`
`5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 5
`
`
`
`exceed the main memory of most computers. Considering that the publicly indexable
`web now contains roughly 800 million pages [11], the naive approach is infeasible
`for large subgraphs of the web. As mentioned above, if the link structure is sorted
`on the source (cid:12)eld, the accesses on Source will be sequential, and will not pose a
`problem. However, the random access pattern on the Dest array leads the working
`set of this implementation to equal the size of the Dest array. If the main memory
`cannot accommodate the Dest array, the running time will increase dramatically and
`the above cost analysis becomes invalid.
`
`3.2 The Block-Based Strategy
`
`There is a similarity between an iteration of PageRank and the relational join op-
`erator. Let the children of node source in the structure Links be represented by
`CHILDREN (source; Links). If we consider the two rank arrays Source and Dest
`as relations, an iteration of PageRank is performing a join in which each entry source
`of Source joins with each entry dest of Dest if dest 2 CHILDREN (source; Links).
`Instead of adjoining (cid:12)elds of two tuples, however, we are adding in the (scaled) value
`of the Source entry to the Dest entry.
`
`Although the analogy is not exact, the core technique used by the block-oriented
`join strategy can be used to control the working set of the PageRank algorithm as well.
`We will partition the Dest array, the cause of the large working set, into (cid:12) blocks each
`of size D pages, as illustrated in Figure 3. If P represents the size of main memory
`
`Figure 3: Blocked Multiplication
`
`in physical memory pages, then we require D (cid:20) P (cid:0) 2, since we must leave input
`bu(cid:11)ers for reading in Source and Links. The links (cid:12)le Links must be rearranged
`to re(cid:13)ect this setup. We partition Links into (cid:12) links (cid:12)les Links0; : : : ; Links(cid:12)(cid:0)1,
`such that the destinations (cid:12)eld in Linksi contains only those nodes dest such that
`(cid:12) (cid:2) i (cid:20) dest < (cid:12) (cid:2) (i + 1). In other words, the outgoing links of a node are bucketed
`according to the range that the identi(cid:12)er of the destination page falls into. The
`partitioning of the links of Figure 1 for the case of three blocks is shown in Figure 4.
`
`This partitioning scheme obeys the equality:
`
`CHILDREN (source; Links) = [
`i
`
`CHILDREN (source; Linksi)
`
`6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 6
`
`
`
`Figure 4: Partitioned Link File
`
`Note that P jLinksij > jLinksj because of the extra overhead caused by the redun-
`dant storage of the source node and outdegree entries in each of the partitions. The
`block-oriented algorithm proceeds as follows:
`8sSource[s] = 1=N
`while(residual > (cid:28) ) f
`for i = 0 : : : (cid:12) (cid:0) 1 f
`8dDesti[d] = 0
`while(not Linksi.eof()) f
`Linksi.read(source, n, k, dest1; dest2; :::; destk)
`for j = 1 . . . k
`Desti[destj] = Desti[destj] + Source[source]=n
`
`g 8
`
`dDesti[d] = c (cid:2) Desti[d] + 1(cid:0)c
`N /* dampening or personalization*/
`Write Desti to disk
`
`g r
`
`esidual = jjSource (cid:0) Destjj /* recompute only every few iterations */
`Source = Dest
`
`g
`
`Because Linksi is sorted on the source (cid:12)eld, each pass through Linksi requires
`only one sequential pass through Source. The working set of this algorithm is ex-
`actly P by design, so no swapping occurs. De(cid:12)ne (cid:15) to be such that the equality
`Pi jLinksij = jLinksj (cid:2) (1 + (cid:15)) is satis(cid:12)ed. The cost of this approach is then given
`by:
`
`C = (cid:12) (cid:2) jSourcej + jDestj + jLinksj (cid:2) (1 + (cid:15))
`
`In practice, (cid:15) is reasonably small, as shown in Figure 7. The other cost introduced by
`the partitioning scheme is the need to make (cid:12) passes over the Source array. However,
`since in practice we have jLinksj > 5(cid:2)jSourcej, this additional overhead is reasonable.
`Note that computing the residual during every iteration would require an additional
`
`7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 7
`
`
`
`pass over Source, which is not included in the above cost analysis. We can largely
`avoid this cost by computing the residual only at (cid:12)xed intervals.
`
`If we had stored the links in transpose format, in which each entry contained a
`node and a list of parent nodes, then the above algorithm would remain essentially the
`same, except that we would break the Source array into (cid:12) blocks, and make multiple
`passes over the Dest array. We would successively load in blocks of Source, and fully
`distribute its rank according to LinksT
`i to all destinations in Dest. However, note
`that each \pass" over the Dest array requires reading in the values from disk, adding
`in the current Source block’s contributions, and then writing out the updated values
`to disk. Thus using LinksT rather than Links incurs an additional i/o cost of jDestj,
`since Dest is both read and written on each pass.
`
`In order to take advantage of sequential read transfer bene(cid:12)ts, we can load in
`more than one page of Source and Linksi at a time as we stream them in. This
`bu(cid:11)ering reduces the e(cid:11)ective time required per pass of Source and Links, at the
`expense of increasing (cid:12). The best strategy for allocating memory between the three
`data structures is dependent on the machine and disk architecture, although any
`reasonable allocation will allow the PageRank algorithm to be used e(cid:14)ciently in
`cases where the Rank vectors exceed main memory.
`
`3.3 Timing Results
`
`For our experiments, we used a 450MHz Pentium-III machine with a 7200-RPM
`Western Digital Caviar AC418000 hard disk. We measured the running times of
`PageRank over roughly 19 million pages using three di(cid:11)erent partitionings of the
`Dest array: 1-block (i.e., naive), 2-blocks, and 4-blocks. The expected memory usage
`is given in Figure 5. We tested the three partitioning strategies on three di(cid:11)erent
`memory con(cid:12)gurations: 256 MB, 64 MB, and 32 MB.
`
`The time required per iteration of PageRank is given for each of the three parti-
`tionings under each of the three memory con(cid:12)gurations in Figure 6. As expected, the
`most e(cid:14)cient strategy is to partition the Dest array (and the corresponding Links
`structure) into enough blocks so that a single Dest block can (cid:12)t in physical memory.
`Using too many blocks slightly degrades performance, as both the number of passes
`over Source and the size of Links increase. Figure 7 shows the total size of the link
`structure for the three partitionings, as well as the associated (cid:15), as discussed in Sec-
`tion 3.2. Using too few blocks, however, degrades performance by several orders of
`magnitude. For the cases in Figure 6 where the block size exceeds physical memory,
`we had to estimate the full iteration time from a partially completed iteration, as the
`running times were unreasonably high.
`
`The blocking strategy commonly used for other algorithms, including the rela-
`tional join operator, is very e(cid:11)ective in controlling the memory requirements of Page-
`Rank. The block-based PageRank is not an approximation of normal PageRank: the
`same matrix-vector multiplication M 0 (cid:2)Source is performed whether or not Dest and
`Links are partitioned. The resultant PageRank vector is identical regardless of the
`number of blocks used. As the Stanford WebBase increases in size to several hundreds
`
`8
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 8
`
`
`
`Figure 5: Expected Memory Usage
`
`Figure 6: Log Plot of Running Times
`
`Figure 7: Link Structure Growth
`
`9
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 9
`
`
`
`of millions of pages, the block-oriented technique will be essential in computing Page-
`Rank, even on machines with fairly large amounts of main memory. Furthermore, the
`block-oriented technique will be necessary for allowing individual users to compute
`their own personalized PageRank.
`
`4 Accuracy
`When computing PageRank, we can use either single-precision or double-precision
`values for the Source and Dest arrays. Using double-precision for Source and Dest,
`however, would adversely a(cid:11)ect the running time by doubling the sizes of the two
`vectors. Here we show that single-precision Source and Dest vectors are su(cid:14)cient.
`Double-precision values should be used for any individual variables, such as the cur-
`rent residual or the current total rank. These individual variables of course have
`negligible impact on the memory footprint.
`
`The use of single-precision Rank vectors does not lead to signi(cid:12)cant numerical
`error. We computed PageRank for 100 iterations using single-precision Source and
`Dest vectors. We converted the (cid:12)nal computed PageRank vector to the equivalent
`double-precision vector, and performed a double-precision iteration step to get an
`accurate value for the residual: 2:575 (cid:2) 10(cid:0)4. We then recomputed PageRank for 100
`iterations using exclusively double-precision vectors, but found that the residual of
`the (cid:12)nal vector had not noticeably improved: 2:571 (cid:2) 10(cid:0)4.
`
`5 Convergence Analysis
`Although PageRank is guaranteed to converge given the conditions mentioned in Sec-
`tion 2, there are several measures we can use to analyze the convergence rate. The
`residual, which is the norm of the di(cid:11)erence of the PageRank vectors from two succes-
`sive iterations as discussed in Section 2, is one possible measure of the convergence.
`A more useful approach to analyzing the convergence involves looking at the ordering
`of pages induced by the Rank vector. If the PageRank values will be used strictly for
`determining the relative importance of pages, the convergence should be measured
`based on how the ordering changes as the number of iterations increases.
`In this
`section, we will discuss various techniques for analyzing the convergence of induced
`orderings. Depending on the (cid:12)nal application of the rankings, we can concentrate
`on the ordering induced over all pages or on the ordering induced over results to
`speci(cid:12)c queries. We will (cid:12)rst discuss global orderings, and then look at query speci(cid:12)c
`orderings.
`
`5.1 Global Ordering
`
`The global ordering induced by PageRank provides an insightful view into the con-
`vergence rate. We analyze the convergence of the induced ordering in two ways: a
`histogram of the di(cid:11)erence in positions of pages within two orderings, and a similarity
`measure between two ordered listings.
`
`When analyzing the change in page ordering while varying the number of itera-
`tions, we are more concerned with instability among the top ranked pages; whether
`
`10
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 10
`
`
`
`Figure 8: Histograms of Position Di(cid:11)erence
`
`or not we deem that a particular page is the least important or the second least
`important is usually irrelevant.
`
`Figure 8 shows histograms, with a bucket size of 100, of position di(cid:11)erences for the
`orderings induced by various numbers of iterations, when compared with the ordering
`induced by 100 iterations. We considered a page if and only if at least one of the
`orderings placed it among the top 1 million, to account for our intuition that the
`ordering among highly ranked pages is more signi(cid:12)cant. For the ordering induced
`by 50 iterations, we see that the bulk of pages occur at similar positions as in the
`ordering induced by 100 iterations.
`
`We now turn to the similarity measure. In many scenarios, we are only concerned
`with identifying the top pages { we may not necessarily need an exact ordering among
`them. For instance, if we only have the resources to index (or otherwise process) a
`small subset of the web, we might choose to concentrate on the (unordered) set of
`the most highly ranked pages. We de(cid:12)ne the similarity of two sets of pages A and B
`to be jA\Bj
`jA[Bj . To visualize how closely two ordered rankings agree on identifying the
`top pages, we successively compute the similarity among the top n pages according
`to each ordering, as n is increased in stepped increments. Figure 9 is a graph of the
`similarity of orderings as n is increased, in steps of 100, to 1 million pages. Here
`we see that the ordering induced by only 25 iterations agrees very closely with the
`ordering induced by 100 iterations on what the top pages are.
`
`5.2 Query Speci(cid:12)c Ordering
`
`PageRank is currently used predominantly for the ranking of search results. Although
`analyzing the global ordering is useful in certain applications, we show here that much
`of the instability in the ordering across iterations tends to a(cid:11)ect the relative rankings
`
`11
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 11
`
`
`
`Figure 9: Global Similarity
`
`only of unrelated pages. Just as we are often more concerned with instability among
`highly ranked pages, we are also often more concerned with instability among pages
`that are likely to co-occur in the results of a search query.
`
`As before, PageRank is computed only once, yielding a global ordering over web
`pages. When investigating query-speci(cid:12)c ordering, however, we take a subset of these
`pages, corresponding to the results of a conjunctive search query, and analyze the
`relative ordering of pages within this result set. Here we analyze the convergence for
`the induced orderings of the results of two typical search queries. For the games query,
`we retrieved the urls of all pages in the WebBase that contained the words ‘action’,
`‘adventure’, ‘game’, ‘role’, and ‘playing’. For the music query, we retrieved the urls
`of all pages that contained the words ‘music’ and ‘literature’. Out of the roughly 20
`million pages used for the query, 1,567 urls satis(cid:12)ed the games query, and 19,341 urls
`satis(cid:12)ed the music query. We can see in Figures 10 and 11 that the orderings induced
`by only 10 iterations agree fairly well with the orderings induced by 100 iterations.
`The PageRank process seems to converge very quickly when viewed in the context of
`query results.
`
`The residual vector M 0 (cid:2)Ranki (cid:0)Ranki, while appealing in a mathematical sense,
`is only a myopic view of the convergence rate. Looking at the induced ordering pro-
`vides a more practical view of the convergence behavior. The techniques we discussed
`are applicable in a broader context as well { we have used them to measure the dif-
`ference in orderings induced by various personalizations of PageRank, in which the
`number of iterations remains constant.
`
`6 Future Work
`Personalization of web-based information retrieval will become an increasingly impor-
`tant area of research as the amount of available data on the web continues to grow.
`
`12
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 12
`
`
`
`Figure 10: Games Query Similarity
`
`Figure 11: Music Query Similarity
`
`13
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 13
`
`
`
`We have shown that PageRank can be computed e(cid:14)ciently on modestly equipped
`machines, suggesting that individual users can compute their own personalized Page-
`Rank vectors. We will be investigating the most e(cid:11)ective means of constructing the
`personalization vector, as discussed in Section 2, so that the resultant ranking vector
`best captures the user’s notion of the importance of a page. We envision utilizing a
`user’s browsing history and bookmark collection to build the personalization vector.
`
`In Section 5.2 we empirically showed that an accurate PageRank vector can be
`computed in as few as 10 iterations, if accuracy is measured in the context of the
`induced ordering of results to conjunctive search queries. This convergence result
`suggests further experiments to determine to what extent the exact nature of the
`query a(cid:11)ects how many iterations are needed before the induced ordering of the query
`result stabilizes. Using the methods discussed in this paper for analyzing the e(cid:11)ect
`on induced ordering, we will further explore techniques for speeding up PageRank
`computation while minimizing the loss of accuracy.
`
`7 Conclusion
`Algorithms harnessing the link structure of the web are becoming increasingly useful
`as tools to present relevant results to search queries. Although the PageRank algo-
`rithm is based on a simple idea, scaling its implementation to operate on large sub-
`graphs of the web requires careful arrangement of the data. We have demonstrated
`that PageRank can be run on modestly equipped machines. We have determined
`that single-precision Rank vectors are su(cid:14)cient for accurate computation. We have
`presented several ways of analyzing the convergence of the algorithm, based on the or-
`dering induced on web pages by the Rank vector. By looking at the ordering induced
`on speci(cid:12)c query results, we have shown that even as few as 10 iterations can provide
`a useful ranking assignment. Given the timing results of Section 3.3, this convergence
`rate implies that we can compute a useful PageRank vector on a modestly equipped
`machine in roughly an hour, demonstrating the feasibility of client-side computation
`of personalized PageRank.
`
`14
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 14
`
`
`
`Acknowledgements
`I would like to thank Professor Je(cid:11) Ullman and Professor Rajeev Motwani, as well as
`the rest of the Stanford MIDAS group, for invaluable discussions regarding the ideas
`in this paper. I would also like to thank the Stanford WebBase team for access to the
`web crawl data.
`
`References
`
`[1] Search Engine Watch: Up-to-date information on leading search engines. Located at
`http://www.searchenginewatch.com/.
`
`[2] The Google Search Engine: Commercial search engine founded by the originators of
`PageRank. Located at http://www.google.com/.
`
`[3] K. Bharat and A. Broder. A technique for measuring the relative size and overlap of
`public web search engines. In Proceedings of the Seventh International World Wide Web
`Conference, 1998.
`
`[4] S. Brin, R. Motwani, L. Page, and T. Winograd. What can you do with a web in
`your pocket. In Bulletin of the IEEE Computer Society Technical Committee on Data
`Engineering, 1998.
`
`[5] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In
`Proceedings of the Seventh International World Wide Web Conference, 1998.
`
`[6] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan.
`Automatic resource compilation by analyzing hyperlink structure and associated text.
`In Proceedings of the Seventh International World Wide Web Conference, 1998.
`
`[7] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to
`topic-speci(cid:12)c web resource discovery. In Proceedings of the Eighth International World
`Wide Web Conference, 1999.
`
`[8] J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings
`of the Eighth International World Wide Web Conference, 1999.
`
`[9] E.-J. Im and K. Yelick. Optimizing sparse matrix vector multiplication on smps. In Pro-
`ceedings of the Ninth SIAM Conference on Parallel Processing for Scienti(cid:12)c Computing,
`1999.
`[10] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the
`ACM-SIAM Symposium on Discrete Algorithms, 1998.
`
`[11] S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400:107{
`109, July 1999.
`
`[12] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press,
`United Kingdom, 1995.
`
`[13] R. Ramakrishnan. Database Management Systems. McGraw-Hill, 1998.
`
`[14] S. Toledo. Improving the memory-system performance of sparse-matrix vector multipli-
`cation. In IBM Journal of Research and Development, volume 41. IBM, 1997.
`
`15
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 1