throbber
E(cid:14)cient Computation of PageRank
`
`Taher H. Haveliwala(cid:3)
`Stanford University
`taherh@db.stanford.edu
`
`October 18, 1999
`
`Abstract
`
`This paper discusses e(cid:14)cient techniques for computing PageRank, a ranking met-
`ric for hypertext documents. We show that PageRank can be computed for very
`large subgraphs of the web (up to hundreds of millions of nodes) on machines
`with limited main memory. Running-time measurements on various memory
`con(cid:12)gurations are presented for PageRank computation over the 24-million-page
`Stanford WebBase archive. We discuss several methods for analyzing the con-
`vergence of PageRank based on the induced ordering of the pages. We present
`convergence results helpful for determining the number of iterations necessary
`to achieve a useful PageRank assignment, both in the absence and presence of
`search queries.
`
`Introduction
`1
`The dramatic growth of the world-wide web, now exceeding 800 million pages [11],
`is forcing modern web search engines to look beyond simply the content of pages
`in providing relevant answers to queries. Recent work in utilizing the link structure
`of the web for improving the quality of search results is promising.
`In particular,
`the Google search engine uses PageRank, an iterative algorithm that determines the
`\importance" of a web page based on the \importance" of its parent pages [5]. A
`related algorithm used by the IBM HITS system maintains a hub and an authority
`score for each page, in which the authority score for a page is determined by the hub
`scores of its parents, and the hub score of a page is determined by the authority scores
`of its children [10].
`
`Given that centralized search engines are having trouble simply indexing the entire
`web [7, 11], presenting users with search-query results relevant to them is becoming
`increasingly di(cid:14)cult. The growth of the web will force a greater reliance on client-
`side (cid:12)ltering and relevance analysis, and personalized searching techniques become
`essential. Part of our research in personalizing link-based ranking algorithms involves
`applying biases during the iterations to increase the importance of certain categories
`of pages [4]. For instance, we can in(cid:13)uence PageRank to weight pages related to
`computers more heavily than pages related to cars. In this scenario, each user would
`
`(cid:3)Supported by NSF IIS-98-11904 and NSF Graduate Research Fellowship
`
`1
`
`IPR2018-01594
`
`EXHIBIT
`2051
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 1
`
`

`

`be expected to run PageRank biased with their personalization pro(cid:12)le on their own
`personal machines, with the necessary link structure provided to them on inexpensive
`media such as DVD-ROM. Reducing the running times and memory requirements of
`personalized ranking algorithms is essential.
`
`After reviewing the PageRank algorithm [4, 5] in Section 2, we discuss the various
`aspects of its running time, and show that it can be computed e(cid:14)ciently:
`
`(cid:15) We describe an implementation, using the principle underlying the block nested-
`loops-join strategy, that e(cid:14)ciently controls the cost per iteration even in low
`memory environments (Section 3)
`(cid:15) We empirically show that single-precision rank values are su(cid:14)cient for the com-
`putation (Section 4)
`(cid:15) We investigate techniques for determining the number of iterations required
`to yield a useful PageRank assignment. We present convergence results using
`several metrics that are useful when the induced ordering on pages, rather than
`the PageRank value itself, is considered. We investigate the convergence of the
`globally induced ordering, as well as the ordering induced over speci(cid:12)c search
`results (Section 5)
`
`By showing that a useful ranking can be computed for large web subgraphs in as
`little as an hour, with only modest main-memory requirements, the potential for
`personalization is made clear.
`
`2 Review of PageRank
`First let us review the motivation behind PageRank [5]. The essential idea is that
`if page u has a link to page v, then the author of u is implicitly conferring some
`importance to page v. How much importance does u confer? Let Nu be the outdegree
`of page u, and let Rank(p) represent the importance (i.e., PageRank) of page p. Then
`the link (u; v) confers Rank(u)=Nu units of rank to v. This simple idea leads to the
`following (cid:12)xpoint computation that yields the rank vector Rank (cid:3) over all of the pages
`on the web. If N is the number of pages, assign all pages the initial value 1=N . Let
`Bv represent the set of pages pointing to v. In each iteration, propagate the ranks as
`follows:1
`
`8vRanki+1(v) = X
`u2Bv
`
`Ranki(u)=Nu
`
`We continue the iterations until Rank stabilizes to within some threshold. The (cid:12)nal
`vector Rank(cid:3) contains the PageRank vector over the web. This vector is computed
`only once after each crawl of the web; the values can then be used to in(cid:13)uence the
`ranking of search results [2].
`
`The process can also be expressed as the following eigenvector calculation, pro-
`viding useful insight into PageRank. Let M be the square, stochastic matrix corre-
`sponding to the directed graph G of the web, assuming all nodes in G have at least
`
`1Note that for u 2 Bv, the edge (u; v) guarantees Nu (cid:21) 1
`
`2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 2
`
`

`

`one outgoing edge. If there is a link from page j to page i, then let the matrix en-
`try mij have the value 1=Nj. Let all other entries have the value 0. One iteration
`of the previous (cid:12)xpoint computation corresponds to the matrix-vector multiplication
`M (cid:2) Rank. Repeatedly multiplying Rank by M yields the dominant eigenvector
`Rank(cid:3) of the matrix M . Because M corresponds to the stochastic transition matrix
`over the graph G, PageRank can be viewed as the stationary probability distribution
`over pages induced by a random walk on the web.
`
`We can measure the convergence of the iterations using the Residual vector.
`Because M is stochastic (i.e., the entries in each column sum to 1), the dominant
`eigenvalue of M is 1. Thus the PageRank vector Rank (cid:3), the dominant eigenvector
`of M , has eigenvalue 1, leading to the equality M (cid:2) Rank (cid:3) = Rank(cid:3). We can use
`the deviation from this equality for some other vector as an error estimate. For some
`intermediate vector Ranki, let Residuali = M (cid:2)Ranki(cid:0)Ranki. Equivalently, because
`multiplication by M is an iteration step, we have Residuali = Ranki+1 (cid:0) Ranki. We
`can treat jjResidualijj as an indicator for how well Ranki approximates Rank(cid:3). We
`expect jjResidualijj to tend to zero after an adequate number of iterations.
`
`We now address several issues regarding the computation. We previously made
`the assumption that each node in G has at least one outgoing edge. To enforce this
`assumption, we can iteratively remove nodes in G that have outdegree 0. Alterna-
`tively, we can conceptually add a complete set of outgoing edges to any node with
`outdegree 0. Another caveat is that the convergence of PageRank is guaranteed only
`if M is irreducible (i.e., G is strongly connected) and aperiodic [12]. The latter is
`guaranteed in practice for the web, while the former is true if we add a dampening
`factor to the rank propagation. We can de(cid:12)ne a new matrix M 0 in which we add
`transition edges of probability 1(cid:0)c
`N between every pair of nodes in G:
`
`]N (cid:2)N
`
`1 N
`
`M 0 = cM + (1 (cid:0) c) (cid:2) [
`
`This modi(cid:12)cation improves the quality of PageRank by introducing a decay factor
`which limits the e(cid:11)ect of rank sinks [4], in addition to guaranteeing convergence to a
`unique rank vector. For the above modi(cid:12)cation to M , an iteration of PageRank can
`be expressed as follows:2
`
`]N (cid:2)1
`
`1 N
`
`M 0 (cid:2) Rank = cM (cid:2) Rank + (1 (cid:0) c) (cid:2) [
`
`We can bias PageRank to weight certain categories of pages by replacing the uniform
`vector [ 1
`N ]N (cid:2)1 with the nonuniform N (cid:2) 1 personalization vector ~p as discussed in [4].
`In terms of the random-walk model of PageRank, the personalization vector represents
`the addition of a complete set of transition edges where the probability of edge (u; v)
`is given by (1 (cid:0) c) (cid:2) pv. Although the matrix M 0 that results from the modi(cid:12)cations
`discussed above is not sparse, we never need to store it explicitly. We need only the
`ability to evaluate M 0 (cid:2) Rank e(cid:14)ciently.
`
`2This equality makes use of the fact that jjRankjj1 = 1
`
`3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 3
`
`

`

`3 E(cid:14)cient Memory Usage
`If we momentarily ignore the scale of the web, the implementation of PageRank is very
`simple. The sheer scale of the web, however, requires much greater care in the use of
`data structures. We will begin with a detailed discussion of a naive implementation,
`during which we will clarify the size of the datasets. The naive algorithm is useful for
`gaining a clear understanding of matrix-vector multiplication in the speci(cid:12)c context
`of PageRank. We will then present an e(cid:14)cient version of the algorithm, reminiscent
`of the block nested-loops-join algorithm, that substantially lowers the main memory
`requirements. Similar strategies for fast in-memory matrix-vector multiplies are com-
`monly used in the scienti(cid:12)c computing community for improving caching behavior
`[9, 14]. As we are dealing with massive web repositories in which the data structures
`are measured in gigabytes and are stored on disk, we will take a more i/o-centric
`approach in presenting the algorithm and its cost analysis. Empirical timing results
`for both the naive and block-based approaches are presented.
`
`3.1 The Naive Technique
`
`The Stanford WebBase, our local repository of the web, currently contains roughly
`25 million pages. There are over 81 million urls in the link graph, including urls
`that were not themselves crawled, but exist in the bodies of crawled pages. For
`our experiments, we (cid:12)rst used a preprocessing step that removed dangling pages,
`meaning pages with no children. Starting with the 81-million-node graph, all nodes
`with outdegree 0 were removed. The step was repeated once more on the resulting
`graph, yielding a subgraph with close to 19 million nodes. This process was needed
`since the original graph is a truncated snapshot of the web with many dangling nodes.
`Nodes with no outgoing links pointing back to the crawled subgraph can adversely
`a(cid:11)ect the PageRank assignment, as mentioned in Section 2. The node id’s were
`assigned consecutively from 0 to 18,922,290. The link structure for the (cid:12)nal graph,
`referred to as Links, is stored on disk in a binary format, illustrated textually in
`Figure 1. The source-id and each of the destination-id’s are stored as 32-bit integers.
`
`Figure 1: The Link Structure
`
`The outdegree is stored as a 16-bit integer. The size of the link structure, after the
`preprocessing steps mentioned above, is 1.01 GB, and is assumed to exceed the size
`of main memory.
`
`The setup for the naive PageRank implementation is as follows. We create two
`arrays of (cid:13)oating point values representing the rank vectors, called Source and Dest,
`as conceptually shown in the matrix-vector multiplication given in Figure 2. Each
`
`4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 4
`
`

`

`Figure 2: The Rank Vectors
`
`In our
`vector has N entries, where N is the number of nodes in our web graph.
`experiment, N was 18,922,290. The rank values for iteration i are held in Source,
`and the rank values for iteration i + 1 are constructed in Dest. Using single-precision
`values, these arrays for our particular graph have a combined size of over 150 MB.
`The iteration steps can be expressed as follows (where c is the dampening factor):
`
`8sSource[s] = 1=N
`while(residual > (cid:28) ) f
`8dDest[d] = 0
`while(not Links.eof()) f
`Links.read(source, n, dest1; dest2; :::; destn)
`for j = 1 : : : n
`Dest[destj] = Dest[destj] + Source[source]=n
`
`g 8
`
`dDest[d] = c (cid:2) Dest[d] + 1(cid:0)c
`N /* dampening or personalization*/
`residual = jjSource (cid:0) Destjj /* recompute only every few iterations */
`Source = Dest
`
`g
`We make successive passes over Links, using the current rank values, held in Source,
`to compute the next rank values, held in Dest. We can stop when the norm of the
`di(cid:11)erence between Source and Dest reaches some threshold, or alternatively after
`a prechosen number of iterations. Assuming main memory is large enough to hold
`Source and Dest, the i/o cost for each iteration of the above implementation is given
`by:
`
`C = jLinksj
`
`If main memory is large enough to hold only the Dest array, and we assume that the
`link structure is sorted on the source (cid:12)eld, the i/o cost is given by:
`
`C = jSourcej + jDestj + jLinksj
`
`Source needs to be sequentially read from disk during the rank propagation step, and
`Dest needs to be written to disk to serve as the Source vector for the subsequent
`iteration.
`
`Although many machines may have enough main memory to hold these arrays,
`a larger crawl with 50 to 100 million pages clearly will result in rank vectors that
`
`5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 5
`
`

`

`exceed the main memory of most computers. Considering that the publicly indexable
`web now contains roughly 800 million pages [11], the naive approach is infeasible
`for large subgraphs of the web. As mentioned above, if the link structure is sorted
`on the source (cid:12)eld, the accesses on Source will be sequential, and will not pose a
`problem. However, the random access pattern on the Dest array leads the working
`set of this implementation to equal the size of the Dest array. If the main memory
`cannot accommodate the Dest array, the running time will increase dramatically and
`the above cost analysis becomes invalid.
`
`3.2 The Block-Based Strategy
`
`There is a similarity between an iteration of PageRank and the relational join op-
`erator. Let the children of node source in the structure Links be represented by
`CHILDREN (source; Links). If we consider the two rank arrays Source and Dest
`as relations, an iteration of PageRank is performing a join in which each entry source
`of Source joins with each entry dest of Dest if dest 2 CHILDREN (source; Links).
`Instead of adjoining (cid:12)elds of two tuples, however, we are adding in the (scaled) value
`of the Source entry to the Dest entry.
`
`Although the analogy is not exact, the core technique used by the block-oriented
`join strategy can be used to control the working set of the PageRank algorithm as well.
`We will partition the Dest array, the cause of the large working set, into (cid:12) blocks each
`of size D pages, as illustrated in Figure 3. If P represents the size of main memory
`
`Figure 3: Blocked Multiplication
`
`in physical memory pages, then we require D (cid:20) P (cid:0) 2, since we must leave input
`bu(cid:11)ers for reading in Source and Links. The links (cid:12)le Links must be rearranged
`to re(cid:13)ect this setup. We partition Links into (cid:12) links (cid:12)les Links0; : : : ; Links(cid:12)(cid:0)1,
`such that the destinations (cid:12)eld in Linksi contains only those nodes dest such that
`(cid:12) (cid:2) i (cid:20) dest < (cid:12) (cid:2) (i + 1). In other words, the outgoing links of a node are bucketed
`according to the range that the identi(cid:12)er of the destination page falls into. The
`partitioning of the links of Figure 1 for the case of three blocks is shown in Figure 4.
`
`This partitioning scheme obeys the equality:
`
`CHILDREN (source; Links) = [
`i
`
`CHILDREN (source; Linksi)
`
`6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 6
`
`

`

`Figure 4: Partitioned Link File
`
`Note that P jLinksij > jLinksj because of the extra overhead caused by the redun-
`dant storage of the source node and outdegree entries in each of the partitions. The
`block-oriented algorithm proceeds as follows:
`8sSource[s] = 1=N
`while(residual > (cid:28) ) f
`for i = 0 : : : (cid:12) (cid:0) 1 f
`8dDesti[d] = 0
`while(not Linksi.eof()) f
`Linksi.read(source, n, k, dest1; dest2; :::; destk)
`for j = 1 . . . k
`Desti[destj] = Desti[destj] + Source[source]=n
`
`g 8
`
`dDesti[d] = c (cid:2) Desti[d] + 1(cid:0)c
`N /* dampening or personalization*/
`Write Desti to disk
`
`g r
`
`esidual = jjSource (cid:0) Destjj /* recompute only every few iterations */
`Source = Dest
`
`g
`
`Because Linksi is sorted on the source (cid:12)eld, each pass through Linksi requires
`only one sequential pass through Source. The working set of this algorithm is ex-
`actly P by design, so no swapping occurs. De(cid:12)ne (cid:15) to be such that the equality
`Pi jLinksij = jLinksj (cid:2) (1 + (cid:15)) is satis(cid:12)ed. The cost of this approach is then given
`by:
`
`C = (cid:12) (cid:2) jSourcej + jDestj + jLinksj (cid:2) (1 + (cid:15))
`
`In practice, (cid:15) is reasonably small, as shown in Figure 7. The other cost introduced by
`the partitioning scheme is the need to make (cid:12) passes over the Source array. However,
`since in practice we have jLinksj > 5(cid:2)jSourcej, this additional overhead is reasonable.
`Note that computing the residual during every iteration would require an additional
`
`7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 7
`
`

`

`pass over Source, which is not included in the above cost analysis. We can largely
`avoid this cost by computing the residual only at (cid:12)xed intervals.
`
`If we had stored the links in transpose format, in which each entry contained a
`node and a list of parent nodes, then the above algorithm would remain essentially the
`same, except that we would break the Source array into (cid:12) blocks, and make multiple
`passes over the Dest array. We would successively load in blocks of Source, and fully
`distribute its rank according to LinksT
`i to all destinations in Dest. However, note
`that each \pass" over the Dest array requires reading in the values from disk, adding
`in the current Source block’s contributions, and then writing out the updated values
`to disk. Thus using LinksT rather than Links incurs an additional i/o cost of jDestj,
`since Dest is both read and written on each pass.
`
`In order to take advantage of sequential read transfer bene(cid:12)ts, we can load in
`more than one page of Source and Linksi at a time as we stream them in. This
`bu(cid:11)ering reduces the e(cid:11)ective time required per pass of Source and Links, at the
`expense of increasing (cid:12). The best strategy for allocating memory between the three
`data structures is dependent on the machine and disk architecture, although any
`reasonable allocation will allow the PageRank algorithm to be used e(cid:14)ciently in
`cases where the Rank vectors exceed main memory.
`
`3.3 Timing Results
`
`For our experiments, we used a 450MHz Pentium-III machine with a 7200-RPM
`Western Digital Caviar AC418000 hard disk. We measured the running times of
`PageRank over roughly 19 million pages using three di(cid:11)erent partitionings of the
`Dest array: 1-block (i.e., naive), 2-blocks, and 4-blocks. The expected memory usage
`is given in Figure 5. We tested the three partitioning strategies on three di(cid:11)erent
`memory con(cid:12)gurations: 256 MB, 64 MB, and 32 MB.
`
`The time required per iteration of PageRank is given for each of the three parti-
`tionings under each of the three memory con(cid:12)gurations in Figure 6. As expected, the
`most e(cid:14)cient strategy is to partition the Dest array (and the corresponding Links
`structure) into enough blocks so that a single Dest block can (cid:12)t in physical memory.
`Using too many blocks slightly degrades performance, as both the number of passes
`over Source and the size of Links increase. Figure 7 shows the total size of the link
`structure for the three partitionings, as well as the associated (cid:15), as discussed in Sec-
`tion 3.2. Using too few blocks, however, degrades performance by several orders of
`magnitude. For the cases in Figure 6 where the block size exceeds physical memory,
`we had to estimate the full iteration time from a partially completed iteration, as the
`running times were unreasonably high.
`
`The blocking strategy commonly used for other algorithms, including the rela-
`tional join operator, is very e(cid:11)ective in controlling the memory requirements of Page-
`Rank. The block-based PageRank is not an approximation of normal PageRank: the
`same matrix-vector multiplication M 0 (cid:2)Source is performed whether or not Dest and
`Links are partitioned. The resultant PageRank vector is identical regardless of the
`number of blocks used. As the Stanford WebBase increases in size to several hundreds
`
`8
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 8
`
`

`

`Figure 5: Expected Memory Usage
`
`Figure 6: Log Plot of Running Times
`
`Figure 7: Link Structure Growth
`
`9
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 9
`
`

`

`of millions of pages, the block-oriented technique will be essential in computing Page-
`Rank, even on machines with fairly large amounts of main memory. Furthermore, the
`block-oriented technique will be necessary for allowing individual users to compute
`their own personalized PageRank.
`
`4 Accuracy
`When computing PageRank, we can use either single-precision or double-precision
`values for the Source and Dest arrays. Using double-precision for Source and Dest,
`however, would adversely a(cid:11)ect the running time by doubling the sizes of the two
`vectors. Here we show that single-precision Source and Dest vectors are su(cid:14)cient.
`Double-precision values should be used for any individual variables, such as the cur-
`rent residual or the current total rank. These individual variables of course have
`negligible impact on the memory footprint.
`
`The use of single-precision Rank vectors does not lead to signi(cid:12)cant numerical
`error. We computed PageRank for 100 iterations using single-precision Source and
`Dest vectors. We converted the (cid:12)nal computed PageRank vector to the equivalent
`double-precision vector, and performed a double-precision iteration step to get an
`accurate value for the residual: 2:575 (cid:2) 10(cid:0)4. We then recomputed PageRank for 100
`iterations using exclusively double-precision vectors, but found that the residual of
`the (cid:12)nal vector had not noticeably improved: 2:571 (cid:2) 10(cid:0)4.
`
`5 Convergence Analysis
`Although PageRank is guaranteed to converge given the conditions mentioned in Sec-
`tion 2, there are several measures we can use to analyze the convergence rate. The
`residual, which is the norm of the di(cid:11)erence of the PageRank vectors from two succes-
`sive iterations as discussed in Section 2, is one possible measure of the convergence.
`A more useful approach to analyzing the convergence involves looking at the ordering
`of pages induced by the Rank vector. If the PageRank values will be used strictly for
`determining the relative importance of pages, the convergence should be measured
`based on how the ordering changes as the number of iterations increases.
`In this
`section, we will discuss various techniques for analyzing the convergence of induced
`orderings. Depending on the (cid:12)nal application of the rankings, we can concentrate
`on the ordering induced over all pages or on the ordering induced over results to
`speci(cid:12)c queries. We will (cid:12)rst discuss global orderings, and then look at query speci(cid:12)c
`orderings.
`
`5.1 Global Ordering
`
`The global ordering induced by PageRank provides an insightful view into the con-
`vergence rate. We analyze the convergence of the induced ordering in two ways: a
`histogram of the di(cid:11)erence in positions of pages within two orderings, and a similarity
`measure between two ordered listings.
`
`When analyzing the change in page ordering while varying the number of itera-
`tions, we are more concerned with instability among the top ranked pages; whether
`
`10
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 10
`
`

`

`Figure 8: Histograms of Position Di(cid:11)erence
`
`or not we deem that a particular page is the least important or the second least
`important is usually irrelevant.
`
`Figure 8 shows histograms, with a bucket size of 100, of position di(cid:11)erences for the
`orderings induced by various numbers of iterations, when compared with the ordering
`induced by 100 iterations. We considered a page if and only if at least one of the
`orderings placed it among the top 1 million, to account for our intuition that the
`ordering among highly ranked pages is more signi(cid:12)cant. For the ordering induced
`by 50 iterations, we see that the bulk of pages occur at similar positions as in the
`ordering induced by 100 iterations.
`
`We now turn to the similarity measure. In many scenarios, we are only concerned
`with identifying the top pages { we may not necessarily need an exact ordering among
`them. For instance, if we only have the resources to index (or otherwise process) a
`small subset of the web, we might choose to concentrate on the (unordered) set of
`the most highly ranked pages. We de(cid:12)ne the similarity of two sets of pages A and B
`to be jA\Bj
`jA[Bj . To visualize how closely two ordered rankings agree on identifying the
`top pages, we successively compute the similarity among the top n pages according
`to each ordering, as n is increased in stepped increments. Figure 9 is a graph of the
`similarity of orderings as n is increased, in steps of 100, to 1 million pages. Here
`we see that the ordering induced by only 25 iterations agrees very closely with the
`ordering induced by 100 iterations on what the top pages are.
`
`5.2 Query Speci(cid:12)c Ordering
`
`PageRank is currently used predominantly for the ranking of search results. Although
`analyzing the global ordering is useful in certain applications, we show here that much
`of the instability in the ordering across iterations tends to a(cid:11)ect the relative rankings
`
`11
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 11
`
`

`

`Figure 9: Global Similarity
`
`only of unrelated pages. Just as we are often more concerned with instability among
`highly ranked pages, we are also often more concerned with instability among pages
`that are likely to co-occur in the results of a search query.
`
`As before, PageRank is computed only once, yielding a global ordering over web
`pages. When investigating query-speci(cid:12)c ordering, however, we take a subset of these
`pages, corresponding to the results of a conjunctive search query, and analyze the
`relative ordering of pages within this result set. Here we analyze the convergence for
`the induced orderings of the results of two typical search queries. For the games query,
`we retrieved the urls of all pages in the WebBase that contained the words ‘action’,
`‘adventure’, ‘game’, ‘role’, and ‘playing’. For the music query, we retrieved the urls
`of all pages that contained the words ‘music’ and ‘literature’. Out of the roughly 20
`million pages used for the query, 1,567 urls satis(cid:12)ed the games query, and 19,341 urls
`satis(cid:12)ed the music query. We can see in Figures 10 and 11 that the orderings induced
`by only 10 iterations agree fairly well with the orderings induced by 100 iterations.
`The PageRank process seems to converge very quickly when viewed in the context of
`query results.
`
`The residual vector M 0 (cid:2)Ranki (cid:0)Ranki, while appealing in a mathematical sense,
`is only a myopic view of the convergence rate. Looking at the induced ordering pro-
`vides a more practical view of the convergence behavior. The techniques we discussed
`are applicable in a broader context as well { we have used them to measure the dif-
`ference in orderings induced by various personalizations of PageRank, in which the
`number of iterations remains constant.
`
`6 Future Work
`Personalization of web-based information retrieval will become an increasingly impor-
`tant area of research as the amount of available data on the web continues to grow.
`
`12
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 12
`
`

`

`Figure 10: Games Query Similarity
`
`Figure 11: Music Query Similarity
`
`13
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 13
`
`

`

`We have shown that PageRank can be computed e(cid:14)ciently on modestly equipped
`machines, suggesting that individual users can compute their own personalized Page-
`Rank vectors. We will be investigating the most e(cid:11)ective means of constructing the
`personalization vector, as discussed in Section 2, so that the resultant ranking vector
`best captures the user’s notion of the importance of a page. We envision utilizing a
`user’s browsing history and bookmark collection to build the personalization vector.
`
`In Section 5.2 we empirically showed that an accurate PageRank vector can be
`computed in as few as 10 iterations, if accuracy is measured in the context of the
`induced ordering of results to conjunctive search queries. This convergence result
`suggests further experiments to determine to what extent the exact nature of the
`query a(cid:11)ects how many iterations are needed before the induced ordering of the query
`result stabilizes. Using the methods discussed in this paper for analyzing the e(cid:11)ect
`on induced ordering, we will further explore techniques for speeding up PageRank
`computation while minimizing the loss of accuracy.
`
`7 Conclusion
`Algorithms harnessing the link structure of the web are becoming increasingly useful
`as tools to present relevant results to search queries. Although the PageRank algo-
`rithm is based on a simple idea, scaling its implementation to operate on large sub-
`graphs of the web requires careful arrangement of the data. We have demonstrated
`that PageRank can be run on modestly equipped machines. We have determined
`that single-precision Rank vectors are su(cid:14)cient for accurate computation. We have
`presented several ways of analyzing the convergence of the algorithm, based on the or-
`dering induced on web pages by the Rank vector. By looking at the ordering induced
`on speci(cid:12)c query results, we have shown that even as few as 10 iterations can provide
`a useful ranking assignment. Given the timing results of Section 3.3, this convergence
`rate implies that we can compute a useful PageRank vector on a modestly equipped
`machine in roughly an hour, demonstrating the feasibility of client-side computation
`of personalized PageRank.
`
`14
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 14
`
`

`

`Acknowledgements
`I would like to thank Professor Je(cid:11) Ullman and Professor Rajeev Motwani, as well as
`the rest of the Stanford MIDAS group, for invaluable discussions regarding the ideas
`in this paper. I would also like to thank the Stanford WebBase team for access to the
`web crawl data.
`
`References
`
`[1] Search Engine Watch: Up-to-date information on leading search engines. Located at
`http://www.searchenginewatch.com/.
`
`[2] The Google Search Engine: Commercial search engine founded by the originators of
`PageRank. Located at http://www.google.com/.
`
`[3] K. Bharat and A. Broder. A technique for measuring the relative size and overlap of
`public web search engines. In Proceedings of the Seventh International World Wide Web
`Conference, 1998.
`
`[4] S. Brin, R. Motwani, L. Page, and T. Winograd. What can you do with a web in
`your pocket. In Bulletin of the IEEE Computer Society Technical Committee on Data
`Engineering, 1998.
`
`[5] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In
`Proceedings of the Seventh International World Wide Web Conference, 1998.
`
`[6] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan.
`Automatic resource compilation by analyzing hyperlink structure and associated text.
`In Proceedings of the Seventh International World Wide Web Conference, 1998.
`
`[7] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to
`topic-speci(cid:12)c web resource discovery. In Proceedings of the Eighth International World
`Wide Web Conference, 1999.
`
`[8] J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings
`of the Eighth International World Wide Web Conference, 1999.
`
`[9] E.-J. Im and K. Yelick. Optimizing sparse matrix vector multiplication on smps. In Pro-
`ceedings of the Ninth SIAM Conference on Parallel Processing for Scienti(cid:12)c Computing,
`1999.
`[10] J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proceedings of the
`ACM-SIAM Symposium on Discrete Algorithms, 1998.
`
`[11] S. Lawrence and C. L. Giles. Accessibility of information on the web. Nature, 400:107{
`109, July 1999.
`
`[12] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press,
`United Kingdom, 1995.
`
`[13] R. Ramakrishnan. Database Management Systems. McGraw-Hill, 1998.
`
`[14] S. Toledo. Improving the memory-system performance of sparse-matrix vector multipli-
`cation. In IBM Journal of Research and Development, volume 41. IBM, 1997.
`
`15
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2068, p. 1

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket