`Bringing Order to the Web
`
`January 29, 1998
`
`Abstract
`
`The importance of a Web page is an inherently subjective matter, which depends on the
`readers interests, knowledge and attitudes. But there is still much that can be said objectively
`about the relative importance of Web pages. This paper describes PageRank, a method for
`rating Web pages objectively and mechanically, effectively measuring the human interest· and
`attention devoted to them.
`We compare PageRank to an idealized random Web surfer. We show how to efficiently
`compute PageRank for large numbers of pages. And, we show how to apply PageRank to search
`and to user navigation.
`
`1
`
`Introduction and Motivation
`
`The World Wide Web creates many new challenges for information retrieval. It is very large and
`heterogeneous. Current estimates are that there are over 150 million web pages with a doubling
`life of less than one year. More importantly, the web pages are extremely diverse, ranging from
`"What is Joe having for lunch today?" to journals about information retrieval. In addition to these
`major challenges, search engines on the Web must also contend with inexperienced users and pages
`engineered to manipulate search engine ranking functions.
`However, unlike "fiat" document collections, the World Wide Web is hypertext and provides
`considerable aUxiliary information on top of the text of the web pages, such as link structure and
`link text. In this paper, we take advantage of the link structure of the Web to produce a global
`"importance" ranking of every web page. This ranking, called PageRa.nk, helps search engines and
`users quickly make sense of the vast heterogeneity of the World Wide Web.
`
`1.1 Diversity of Web Pages
`
`Although there is already a large literature on academic citation analysis, there are a number
`of significant differences between web pages and academic publications. Unlike academic papers
`which are scrupulously reviewed, web pages proliferate free of quality control or publishing costs.
`With a simple program, huge numbers of pages can be created easily, artificially inflating citation
`counts. Because the Web environment contains competing profit seeking ventures, attention getting
`strategies evolve in response to search engine algorithms. For this reason, any evaluation strategy
`which counts replicable features of web pages is prone to manipulation. Further, .academic papers
`are well defined units of work, roughly similar in quality and number of citations, as well as in
`their purpose - to extend the body of knowledge. Web pages vary on a much wider scale than
`academic papers in quality, usage, citations, and length. A random archived message posting
`
`1
`
`EXHIBIT 2054
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00479
`
`
`
`asking an obscure question about an 113:\1 computer is very different from the II3M home page. A
`research article about the effects of cellular phone use on driver attention is very different from an
`advertisement for a pmticular cellular provider. The average web page quality experienced by a
`user is higher than the quality of the average web page. This is because the simplicity of creating
`and publishing web pages results in a large fraction of low quality web pages that users are unlikely
`t.o read.
`There are many axes along which web pages may be differentiated. In this paper, we deal
`primarily ·with one - an approximation of the overall relative importance of web pages.
`
`1.2 PageRank
`
`In order to measure the relative importance of web pages, we propose PageRank, a method for
`computing a ranking for every web page based on the graph of the web. PageRank has applications
`in search, browsing, and traffic estimation.
`Section 2 gives a mathematical description of PageR.ank and provides some intuitive justifi(cid:173)
`cation. In Section 3, we show how we efficiently compute PageRank for as many as 518 million
`hyperlinkR. To test the utility of PageRank for search, we built a. web Rea.n:h engine called Google
`(Section 5). V·./e also demonstrate how PageRank can be used as a browsing aid in Section 7.3.
`
`2 A Ranking for Every Page on the Web
`
`2.1 Related Work
`
`There has been a great deal of work on academic eitation analysis [Gar95]. Coffman [Go£71] ha.•;;
`published an interesting theory of how infonnation flow in a scientific community is an epidemic
`process.
`There has been a fair amount of recent activity on how to exploit the link structure of large
`hypertext systems such as the web. Pitkow recently completed his Ph.D. thesis on "Characterizing
`World Wide Web Ecologies" [Pit97, PPR96] with a wide variety of link based analysis. Weiss
`discuss clustering methods that take the link structure into account [WVS+96]. Spertus [Spe97]
`discusses information that can be obtained from the link struclure for a varicly of applications.
`Good visuali,;ation demands added structure on the hypertext and is discussed in [MFH95, MF95].
`Recently, Kleinberg [Kle98] has developed an interesting model of the web as Hubs and Authorities,
`based on an eigenvector calculation on the co-citation matrix of the web.
`Finally, there ha .. o;; been some interest in what "quality" means on the net from a library com(cid:173)
`munity [Til].
`It is obvious to try to apply standard citation analysis techniques to the web's hypertextual
`citation structure. One can simply think of every link as being like an academic citation. So,
`a major page like http:/ /www.yahoo.com/ will have tens of thousands of backlinks (or citations)
`pointing to it.
`This fact that the Yahoo horne page has so many backlinks generally imply that it is quite
`imporlanL. Indeed, many of the web search engines have used backlink counl as a way Lo try to bias
`their databases in favor of higher quality or 1nore important pages. However, simple backlink eounts
`have a number of problems on the web. Some of these problems have to do with characteristics of
`the web which are not present in normal academic citation databa~es.
`
`2
`
`
`
`2.2 Link Structure of the Web
`
`While estimates vary, the current graph of the crawlable Web has roughly 150 million nodes (pages)
`and 1.7 billion edges (links). Every page has some number of forward links (outedges) and backlinks
`(inedges) (see Figure 1). We can never know whether we have found all the backlinks of a particular
`page but if we have downloaded it7 we know all of its forward links at that time.
`
`A
`
` A :
`
`\s
`
`C
`
`
`
`C
`
`y’
`B
`
`
`
`Figure 1: A and B are Backlinks of C
`
`Web pages vary greatly in terms of the number of backlinks they have. For example7 the
`Netscape home page has 62.804 backlinks in our current database compared to most pages which
`have just a few backlinks. Generally, highly linked pages are more “important” than pages with
`few links. Simple citation counting has been used to speculate on the future winners of the Nobel
`
`Prize [San95]. PageRank provides a more sophisticated method for doing citation counting.
`The reason that PageRank is interesting is that there are many cases where simple Citation
`counting does not correspond to our common sense notion of importance. For example7 if a web
`page has a link off the Yahoo home page, it may be just one link but it is a very important one.
`This page should be ranked higher than many pages with more links but from obscure places.
`PageRank is an attempt to see how good an approximation to “importance” can be obtained just
`from the link structure.
`
`2.3 Propagation of Ranking Through Links
`
`Based on the discussion above, we give the following intuitive description of PageRank: a page has
`high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page
`has many backlinks and when a page has a few highly ranked backlinks.
`
`2.4 Definition of PageRank
`
`Let u be a web page. Then let F“ be the set of pages u points to and Bu be the set of pages that
`
`point to u. Let N = |Fu| be the number of links from u and let 0 be a factor used for normalization
`(so that the total rank of all web pages is constant).
`We begin by defining a simple ranking, R which is a slightly simplified version of PageRank:
`
`
`
`
`
`
`
`53
`
`
`
`
`
`100
`
`
`
`9
`
`
`
`
`
`50
`3
`
`50
`
`
`
`50
`
`
`
`
`
`3
`3
`
`Figure 2: Simplified PageRank Calculation
`
`This formalizes the intuition in the previous section. Note that the rank of a page is divided
`among its forward links evenly to contribute to the ranks of the pages they point to. Note that
`c < 1 because there are a number of pages with no forward links and their weight is lost from the
`system (see section 2.7). The equation is recursive but it may be computed by starting with any set
`of ranks and iterating the computation until it converges. Figure 2 demonstrates the propagation
`of rank from one pair of pages to another. Figure 3 shows a consistent steady state solution for a
`set of pages.
`Stated another way, let A be a square matrix with the rows and column corresponding to web
`pages. Let AW} : l/Nu if there is an edge from a to v and AW. : 0 if not.
`If we treat R as a
`vector over web pages7 then we have R 2 CAR. So R is an eigenvector of A with eigenvalue 0. In
`fact, we want the dominant eigenvector of A. It may be computed by repeatedly applying A to any
`nondegenerate start vector.
`There is a small problem with this simplified ranking function. Consider two web pages that
`point to each other but to no other page. And suppose there is some web page which points to
`one of them. Then, during iteration, this loop will accumulate rank but never distribute any rank
`(since there are no outedges). The loop forms a sort of trap which we call a rank sink.
`To overcome this problem of rank sinks, we introduce a rank source:
`
`Definition 1 Let E(u) be some vector over the Web pages that corresponds to a source of rank.
`Then, the PageRank of a set of Web pages is an assignment, R’,
`to the Web pages which satisfies
`
`
`
`R'(a) : e 2 R131?) + cE(a)
`
`’UEBu
`
`(1)
`
`such that c is maximized and ||R’||1 : 1 (llR’ll1 denotes the L1 norm of R’).
`
`
`
`
`
`
`
`
`
`B0
`
`.2
`
`
`
`
`
`0.2
`
`0.2
`
`
`
`C0
`
`.4
`
`
`
`
`
`A0
`
`.4
`
`
`
`
`
`
`
`0.2
`
`0.4
`
`Figure 3: Simplified PageRank Calculation
`
`
`
`
`
`
`
`
`
`∞
`
`
`
`
`
`∞
`
`∞
`
`Figure 4: Loop Which Acts as a Rank Sink
`
`where E(u) is some vector over the web pages that corresponds to a source of rank (see Sec-
`tion 6). Note that if E is all positive, c must be reduced to balance the equation. Therefore, this
`technique corresponds to a decay factor.
`In matrix notation we have R’ : C(AR' + E). Since
`||R’ H1 2 17 we can rewrite this as R’ : c(A + E x 1)R’ where 1 is the vector consisting of all ones.
`So, R’ is an eigenvector of (A + E X 1).
`
`2.5 Random Surfer Model
`
`The definition of PageRank above has another intuitive basis in random walks on graphs. The
`simplified version corresponds to the standing probability distribution of a random walk on the
`graph of the Web.
`Intuitively, this can be thought of as modeling the behavior of a “random
`surfer”. The “random surfer” simply keeps clicking on successive links at random. However, if a
`real Web surfer ever gets into a small loop of web pages7 it is unlikely that the surfer will continue
`in the loop forever. Instead, the surfer will jump to some other page. The additional factor E can
`be viewed as a way of modeling this behavior: the surfer periodically “gets bored” and jumps to a
`
`
`
`random page chosen based on the distribution in E.
`So far we have left E as a user defined parameter. In most tests we let E be uniform over all
`web pageR with value u. However, in Section 6 we shnw how different values of E ean generate
`"customized" page ranks.
`
`2.6 Computing PageRank
`
`The computation of PageRank is fairly straightforward if we ignore the issues of scale. Let S be
`almost any vector over \Veb pages (for example E). Then PagcRank may be computed as follows:
`Ro +-- s
`loop:
`~+I +-- ARi
`ll~ll 1 -11RH1II1
`d +--
`+-- Ri+l +dE
`fti+l
`IIRi+l- Rill I
`8 +--
`while 15 > E
`
`Kote that the d factor increases the rate of convergence aiHlmaintains IIR[[ 1 . An alternative
`normalization is to multiply R by the appropriate factor. The use of d may have a small impact
`on the influence of E.
`
`2. 7 Dangling Links
`
`One issue with this model is dangling links. Dangling links arc simply links that point to any page
`with no outgoing links. They affect the model because it is not clear where their weight should
`he distributed, and there arc a large number of them. Often these dangling linkR arc simply pages
`that we have not downloaded yet, since it is hard to sample the entire web (in our 24 million pages
`currently downloaded, we have 51 million URLs not downloaded yet, and hence dangling). Because
`dangling links do not alfecl. the ranking of any other page directly, we simply remove Lhem from
`the system until all the PageRanks are calculated. After all the PageRanks are calculated, they
`can be added back in, without affecting things significantly. Notice the normalization of the other
`links on the same page as a link which was removed will change slightly, but this should not have
`a large cii'ccL.
`
`3
`
`Implementation
`
`As part of the Stanford Webl3ase project. [Pl398], we have built a complete crawling and indexing
`system with a current repository of 24 million web pages. Any web crawler needs to keep a database
`of URLs so it can discover all the URLs on the web. To implement PageRank, the web crawler
`simply needs to build an index of links as it crawls. While a simple task, it is non-trivial because
`of the huge volumes involved. For example, to index our current 24 million page database in about
`five days, we need to process about 50 web pages per second. Since there about about 11 links on
`an average page (depending on what yon connt as a link) we need to process 550 links per second.
`Also, our datab<llie of 24 million pages references over 75 million unique CRLs which each link must
`be compared against.
`
`6
`
`
`
`Much time has been spent making the system resilient in the face of many deeply and intricately
`flawed web artifact~. There exi~t infinitely large ~ite~, page~. and even URL~. A large fraction of
`web pages have incorrect HT:\1L, making parser design difficult. Messy heuristics arc used to help
`the crawling process. For example, we do not crawl URLs with / cgi-bin/ in them. Of course it
`is impossible to get a correct sample of the ''entire web" since it is always changing. Sites are
`sometimes down, and some people decide t.o not. allow their sites Lobe indexed. Despite all this, we
`believe we have a re<:1sonable representation of the actual link structure of publidy accessible web.
`
`3.1 PageRank Implementation
`
`We convert each URL into a unique integer, and store each hypcrlink in a database using the integer
`IDs to identify page~. Details of our implementation are in [PB98]. In general, we have implemented
`PageRank in the following manner. First we sort the link structure by Parent ID. Then dangling
`links arc removed from the link database for reasons discussed above (a few iterations removes the
`vast majority of the dangling links). We need to make an initial assignment of the ranks. This
`assignment can be made by one of several strategies. If it is going to iterate until convergence, in
`general the initial values will not affect final values, just the rate of convergence. Tint we can speed
`up convergence by choosing a good initial assignment. '0lc believe that careful choice of the initial
`assignment and a small finite number of iterations may result in excellent or improved performance.
`Memory is allocated for the weights for every page. Since >ve usc single precision floating point
`values at 4 bytes each, this amounts to :300 megabytes for our 70 million URLs. Tf insufficient
`RAM is available to hold all the weights, multiple passes can be made (our implementation uses
`half as much memory and two passes). The weights from the current time step are kept in memory,
`and the previous weights are accessed linearly on disk. Also, all the access to the link database,
`A, is linear because it is sorted. Therefore, A can be kept on disk as well. Although these data
`structures arc very large, linear disk access allows each iteration to be completed in about (j minutes
`on a typical workstation. After the weights have converged, we add the dangling links back in and
`Iw:ornpnte the rankings. Kote after adding the dangling links back in, we need to iterate c1S many
`times as was required to remove the dangling links. Otherwise, some of the dangling links will have
`a ;7,ero weight. This whole process takes about five hours in the current implementation. ·with less
`strict convergence criteria, and more optimi<::ation, the calculation could be much faster. Or, more
`efficient techniques for estimating eigenvectors could be used to improve performance. However, it
`should be noted that the cost required to compute the PagcR.ank is insignificant compared to the
`cost required to build a full text index.
`
`4 Convergence Properties
`
`As can be seen from the graph in Figure 4 PageRank on a large :322 million link database converges
`to a rea1'10nahle tnleranr:e in roughly .12 iterationR. The convergence on half the data takes roughly
`45 iterations. This graph suggests that PageRank will scale very well even for extremely large
`collections as the scaling factor is roughly linear in log n.
`One of the interesting ramifications of the fact that the PagcRank calculation converges rapidly
`is that the web is an expander-like graph. To understand this better, we give a brief overview of
`the theory of random walks on graphs; refer to Motwani-Raghavan [MR95] for further details. A
`random walk on a gra.ph is a stochastic process where at any given time step we are at a particular
`node of the graph and choose an outedgc uniformly at random to determine the node to visit at
`the next tirue ~tep. A graph i~ said to be an expander if it is the case that every (not too large)
`subset of nodes S ha..<; a neighborhood (set of vertices accessible via outcdgcs emanating from nodes
`
`7
`
`
`
`Convergence of PageRank Computation
`322 Million Links
`161 Million Links
`
`100000000
`
`10000000
`
`1000000
`
`100000
`
`10000
`
`1000
`
`100
`
`Total Difference from Previous Iteration
`
`10
`
`0
`
`7.5
`
`15
`
`30
`22.5
`Number of Iterations
`
`37.5
`
`45
`
`52.5
`
`
`
`The benefits of Pa.geit1.uk a.re the grea.kst for urulcrspceified queri<~s. For example, a query
`for ''SLan ford U ni versit.v" may •·et.urn any number of web pages w l1 ich mention St.an l(.)l'd (such as
`publication list.s) on a c:onvnnt.iona.l sea.reh engine, hut. using Pa.geR;J.nk, the university horne page
`is listed first.
`
`5.1 Title Search
`
`To test the usefulness of PageRank for search we implemented a search engine that used only the
`titles of 16 million web pages. To ::mswer a query, the search engil1e fimls a.ll the web page~ whose
`titles contain all of the query word~. Then it sort,s Uw results hy PageRank. This Heard• en~ine
`is very simple and cheap Lo implemem. In informal LesLs, iL worked remarkably well. As can be
`s<•en in Figure 6, a ~mreh f(Jr "University" yi<•lds a. list of top univcr~itic~. This figur<• shows our
`~1ultiQuery system which allows a ns<~r to <!um·y two sea.reh engines at the same time. The search
`engine on the left is ou•· PageRank based Litle sea•·ch engine. The ba.1· gra.phs and pe•·cenl.ages sl10wn
`arc a Jog of the actual PageR.ank with t.hc top pa~c nonua.li:ted t.o 100%, uot a. percentile which is
`used eve•·ywl1ere else in Lhis paper. The sea•·ch engine on t.he right. is A llavisLa. You can see t.hat
`Altavista returns random lookin~ weh pages that match the query "University'' and are the root.
`page of the server (AJLavisLa seems Lo be usiug URL length as a quaJiL,v heur.isL.ic).
`
`Multi Search !university
`
`[ Semh ) Next' [nationol ROiks J
`
`~
`
`{}
`
`~ {}
`Qptical Ph~ics at the Un.iver5itt of OregQ.S
`Oregon Cen1er for Optics in Science and Technology. Department of
`Physics, UniveiSity of Oregon, Eugene OR 97403. Resemh Groups:
`Cannichael Group ..
`A..f1JJ...f!l.?J!.t!l::b. llt.V~C't:Vl. NtV -:i'i .. ~ 1K -IIi~· 96
`
`Camegie Mellon Un.iver5itt - Campus Netvor~
`Departments. Da1a Conununications. Da1a Conununications is
`responsible for installing and malll1allrlng all on campus networking
`equipment and all of ..
`A..ITJ!..#~JkUlWJlledtV -:i'i .. ~ .fK -19A~T9$
`
`Wesle::n:;Q Universitt Computer Science Group Home P!,g~
`Compu1er Science Group. Wesleyan University. Welcome to the home
`page of the Compu1er Science Group at Wesleyan University. We are
`administratively within.
`,0_/l]!ii'W'W'IY. <:<. '1/l!<k,!illil. edt!/ ·>"i.'> J'f( ·IS ll;r'/0
`
`,;;;;:
`
`I·"'·
`
`Keio Uni9'ersi!Y. Shonan l'ujisava Cam~u.s (Sl'C).
`B$3$N%Z !EFnPUBt%-%c%e%Q%9 (B(SFC) $B$N (BWWW $B9!
`$BCmDU=q$· (B $B$rFI$s$G$1$@$5$$ !U (B. Nihongo [English.
`SFC $B>pJs (B. [ $B%o.%G%11%"%;%e%?! • ..
`A..11J!..#~;Ui:.b>.k?.,i(':..~-:i'i.~SK-SF~J9,':'
`
`School of Chemis!IY., Uni9'ersi!Y. of SY.!!neY.
`The School of Chemistry. School of Chemistry, UniveiSity of Sydney,
`NSW 2006 Australia lntemationol Phone: +61·2·9351·4504 Fax:
`•61·2·9351·3329 Australia.
`A..ITJ!..#~ l~..W .• "fll t13. ,itV -:i'i.~ .fK -J'S F~J 9,':'
`
`Mankato State Universitt
`The Campus Athletics, Campus Tour, Bookstore, Maps, Current
`Events ... Admission & Registration Admissions, Financial Aid,
`Registrar's, Graduale ..
`A..ITJ!..#~m..'tl!bt.t?.JJJ.."~t«NtV -:i'i.~!>'K-J',-:<MJ,..96
`
`St. Ambrose Universitt
`Malll. Index: Academic Departments. Administrative Services. Campus
`News. Computing Services. Galvin Pine Arts Cenler. lnlemet
`Connections. Librazy ..
`A..ITJ!..#~-~~lNtV-:i'i.~JW-.fF~J9,':'
`
`Universitv of Washinvton ECSEL Pro'ects
`~ ¢
`I
`
`~
`
`9
`~>a? 0
`
`Query: university
`II Results Returned
`Showing Results From 0 to I 0
`
`Stollford UniveiSi!J( Homep§g~
`
`?4.?9% ~ - ~"SVtm - Q.M.J:»Y.-:<
`
`Stollford UniveiSi!J(: Portfolio Collection
`
`65.?8%
`
`'*" -~"S9tm - Q.MAAY.-:<
`
`UniveiSi!J( of Illinois at Urbona·Chomp§ign
`
`?3.26%
`
`1SK - /J}~W - Q,M.J:»Y.-:<
`
`68.38%
`
`tk - ~~~w - Qht.J$Y.-:<
`
`UniversiJ:Y. of California, Irvine
`
`68.0?%
`
`Jlf" • /J}~W • QMAAY.-:<
`
`UniversiJ:Y. of Milmeso1a
`
`6?.05%
`
`Of" - fJ}'f/iiW - QMMW.-:<
`
`Iowa Stale UniversiJ:Y. Homep.§g~
`
`66.66%
`
`SK - /J}'f&W - QMAAY.-:<
`
`The UniveiSi!J( of Mic!ljgm
`
`66.35%
`
`tk - ~'SVtm - QMMW.-:<
`
`Mississippi Stale UniversiJ:Y.
`
`66.35%
`
`SK - ~'SV!m - QMAAY.-:<
`
`Northwestern UniversiJ:Y.: NUinfo
`
`66.15%
`
`SK - /J}'f4W - Qht.J$Y.-:<
`
`OPVI 10
`v /1-.2)
`
`- http:llwww.stWonl.td\l/
`- http:llwww.stWonl.tdWhom.tiWD.iDistl').liolllpol1folio.html
`- http:llwww:uiut.td\1/
`Indiana UniversiJ:Y. - http:llwww.~.tdW
`- http:llwww.v.ei.tdW
`- http :llwww :u.m. udW
`- http:llwww.Wt$lt.tdW
`- http:llwww.WJJ.ieh.tdW
`- http:llwww.m.sst$.lt .tdW
`- http:llwww.nwu.tdW
`
`Fignn! 6: Comparison of Query i(lr "Cnivcrsity"
`
`9
`
`
`
`vVeb Page
`Download Netscape Software
`http:/ /www.w3.org/
`vVdcome to Netscape
`Point: It's Vnmt You're Searching For
`vVeb-Counter Home Page
`The Blue Ribbon Campaign for Online Free Speech
`CERN \Vekmne
`Yahoo!
`vVelcorne to Netscape
`Wusage 4.1: A Usage Statistics System For Web Servers
`The World \Vide Web Consortium (W3C)
`Lycos, Inc. Home Page
`St<uting Point
`Welcome to Magellan!
`Omde Corporation
`
`PageRank (average i~ 1.0)
`11589.00
`10717.70
`867:1.51
`7930.92
`7254.97
`7010.39
`6.562.49
`6561.80
`6203.47
`5963.27
`5672.21
`4683.31
`4501.98
`:~866.82
`3587.63
`
`Table 1: Top 15 Page Ranks: July 1996
`
`5.2 Rank Merging
`
`The reason LhaL the Litle based PageRank system works so well is thaL the Litle match ensures
`high precision, and the PageRank ensures high quality. vVhen matching a query like "University"
`on the web, recall is not very important because there is far more than a user can look at. For
`more specific searches where recall is more important, the traditional information retrieval scores
`over full-text and the PagcRank should be combined. Our Googlc system docs this Lypc of rank
`merging. Rank merging is known to be a very difficult problenL and we need to spend considerable
`additional effort before we will be able to do a reasonable evaluation of these types of queries.
`However, we do believe that using PageRank as a factor in these queries is quite beneficial.
`
`5.3 Some Sample Results
`
`We have experimented c:onsiderably with Google, a full-text search engine which n~e~ PageRank.
`While a full-scale user study is beyond the :::;cope of this paper, we provide a sample query in
`Appendix A. For more queries, we encourage the reader to test Google themselves [BPJ.
`Table 1 shows the top 15 pages based on PageRank. This particular listing was generated in
`July 1996. Tn a more recent cakulation of PageRank, Microsoft has just edged out 1\etscape for
`the highest PageRank.
`
`5.4 Common Case
`
`One of the design goals of PageRank was to handle the common case for queries well. For example,
`a user searched for "wolverine", remembering that the "C niversity of ).1ichigan systen1 used for all
`administrative functions by student:::; was called something with a wolverine in it. Our PageR.ank
`based title search system returned the answer "vVolverine Access" as the first reRnlt. This is sensible
`since all the student:::; regularly use the vVolverine Acce:::;s system, and a random user i:::; quite likely
`to be looking for it given the query "wolverine". The fact that the vVolverine Access site is a good
`common case is noL contained in lhc HTML of lhc page. Even if there were a way of ddining good
`
`10
`
`
`
`meta-information of this form within a page, it would be problematic since a page author could
`not be trusted with this kind of evaluation. Many web page authors would simply claim that their
`pages were all the best ami most uRcd on the web.
`It is important to note that the goal of finding a site that contains a great deal of information
`about wolverines is a very different task than finding the connnon case wolverine site. There is an
`interesting system [Mar97]that at.LemptR to Lind RiteR that discusR a topic in detail by propagating
`the textual matching score through the link structure of the web. It then tries to return the page
`on the most central path. This results in good results for queries like "flower"; the system will
`return good navigation pages from sites that deal with the topic of flowers in detail. Contrast that
`with the connnon case approach which might simply return a cmmnonly used connnercial site that
`had little information except how to buy flowers. It is our opinion that both of these tasks are
`important, and a general purpose \veb search engine should return results which fulfill the needs
`of both of these tasks automatically. In this paper, we are concentrating only on the common case
`approach.
`
`5.5 Subcomponents of Common Case
`
`It is instructive to consider what kind of common case scenarios PageRank can help represent.
`Besides a page which has a high usage, like the \Volverine Access cite, PageRank can also represent
`a collaborative notion of authority or trust. Por example, a user might prefer a news story simply
`because it is linked is linked directly from the :'-Jew York Times home page. Of course sm:h a story
`will receive quite a high PagcRank simply because it is mentioned by a very important page. This
`seems to capture a kind of collaborative trust, since if a page was mentioned by a trustworthy
`or authoritative source, it is more likely to be trustworthy or authoritative. Similarly, quality or
`importance seems to fit within this kind of circular definition.
`
`6 Personalized PageRank
`
`An important component of the PageRank calculation is E
`a vector over the Web pages which
`is used as a source of rank to make up for the rank sinks such a.'i cycles with no outedges (see
`Section 2.4). However, aside from solving the problem of rank sinks, E turns out to be a powerful
`parameter to adjust the page ranks. Intuitively theE vector corresponds to the distribution of web
`pages that a random surfer periodically jumps to. As we sec below, it can be used to give broad
`general views of the Web or views which are focussed and personalized to a particular individual.
`V•lc have performed most experiments with an E vector that. is uniform over all web pages with
`[[EI[ 1 = 0.15. This corresponds to a random surfer periodically jumping to a random web page.
`This is a very democratic choice for E since all web pages arc valued simply because they exist.
`Although this technique has been quite successful, there is an important problem with it. Some
`Weh pages with many related links receive an overly high ranking. Examples of these indmle
`copyright warnings, disclaimers, and highly interlinked mailing list archives.
`Another extreme is to have E consist entirely of a single web page. vVe tested two such E's -
`the Nctscapc home page, and the home page of a famous computer scientist, John McCarthy. For
`the Netscape home page, we atternpt to generate page ranks from the perspective of a novice user
`who has Nctscape set a.'l the default home page. In the case of John McCarthy's home page we
`want to calculate page ranks from the perspective of an individual who has given us considerable
`contextual information based on the links on his home page.
`In both cases, the mailing list problem mentioned above did not occur. And, in both cases, the
`respective home page got the highest PagcRank and was followed by its immediate links. From
`
`11
`
`
`
`Web Page
`Title
`John McCarthy's Home Page
`John Mitchell (Stanford CS Theory Group)
`Venture Law (Local Startup Law Finn)
`Stanford CS Home Page
`University of Michigan AI Lab
`University of Toronto CS Department
`Stanford CS Theory Croup
`Leadershape Institute
`
`.John Mr:Carthy'R View
`PagcRank Percentile
`100.00%
`100.00%
`99.94%
`100.00%
`99.95%
`99.99%
`!HJ.99%
`95.96%
`
`Netscape'R View
`PagcRank Percentile
`99.23%
`9:1.1-19%
`99.82%
`99.83%
`99.94%
`99.09%
`9!Ul5%
`97.10%
`
`Table 2: Page Ranks for Two Different Views: Netscape vs. John McCarthy
`
`that point, the disparity decrea~ed. In Table 2, we show the resulting page rank percentiles for
`an assortment of different pages. Pages related Lo computer science have a higher McCarthy-rank
`than Netscape-rank and pages related to computer science at Stanford have a considerably higher
`:\1cCarthy-rank. For example, the Web page of another Stanford Computer Science Dept. faculty
`member is more than six percentile points higher on the McCarthy-rank. Note that the page ranks
`arc displayed n.'l percentiles. This hn.'l the effect of compressing large differences in PageRank at
`the top of the range.
`Such personali;~:ed page ranks may have a number of applications, including personal search
`engines. These search engines could save users a greaL deal of trouble by efficiently guessing a
`large part of their interests given simple input such as their bookmarks or home page. vVe show an
`example of this in Appendix A with the "Mitchell" query. In this example, we demonstrate that
`while there are many people on the web named Mitchell, the number one result is the home page
`of a colleague of John McCarthy named John Mitchell.
`
`6.1 Manipulation by Commercial Interests
`
`These types of personalized PagcRanks arc virtually immune to manipulation by commercial in(cid:173)
`terests. For a page to get a high PageRank, it must convince an important page, or a lot of
`non-important pages to link to it. At worst, you can have manipulation in the form of buying
`aclv~rtis~m~nts(links) on important siteR. 13ut, this seemR w~ll mHl~r control siw:e it coRts mon~y.
`This immunity to manipulation is an extremely important property. This kind of commercial ma(cid:173)
`nipulation is causing search engines a great deal of trouble, and making features that would be great
`to have very difficult to implement. For example fast updating of documents is a very desirable
`feature, but it iR abuRed by people who want to manipulate the reRults of the search engine.
`A compromise between the t-wo extremes of uniform J;J and single page J;J is to let ~' consist of
`all the root level pages of all web servers. Notice this will allow some manipulation of PageRanks.
`Som~one who wiRhecl to ma.nipulat~ this syst~n1 could simply cr~ate a large munb~r of root l~vel
`servers all pointing at a particular site.
`
`12
`
`
`
`7 Applications
`
`7.1 Estimating Web Traffic
`
`Because PagcRank roughly corresponds to a random web surfer (sec Section 2.5), it is interesting
`to see how PageRank corresponds to actual usage. We used the counts of web page accesses from
`~LANR [NLA] proxy caehc and compared these t