`Bringing Order to the Web
`
`January ,
`
`Abstract
`
`The importance of a Web page is an inherently subjective matter, which depends on the
`readers interests, knowledge and attitudes. But there is still much that can be said objectively
`about the relative importance of Web pages. This paper describes PageRank, a method for
`rating Web pages objectively and mechanically, e ectively measuring the human interest and
`attention devoted to them.
`We compare PageRank to an idealized random Web surfer. We show how to e ciently
`compute PageRank for large numbers of pages. And, we show how to apply PageRank to search
`and to user navigation.
`
`
`
`Introduction and Motivation
`
`The World Wide Web creates many new challenges for information retrieval. It is very large and
`heterogeneous. Current estimates are that there are over million web pages with a doubling
`life of less than one year. More importantly, the web pages are extremely diverse, ranging from
`"What is Joe having for lunch today?" to journals about information retrieval. In addition to these
`major challenges, search engines on the Web must also contend with inexperienced users and pages
`engineered to manipulate search engine ranking functions.
`However, unlike " at" document collections, the World Wide Web is hypertext and provides
`considerable auxiliary information on top of the text of the web pages, such as link structure and
`link text. In this paper, we take advantage of the link structure of the Web to produce a global
`importance" ranking of every web page. This ranking, called PageRank, helps search engines and
`users quickly make sense of the vast heterogeneity of the World Wide Web.
`
` . Diversity of Web Pages
`
`Although there is already a large literature on academic citation analysis, there are a number
`of signi cant di erences between web pages and academic publications. Unlike academic papers
`which are scrupulously reviewed, web pages proliferate free of quality control or publishing costs.
`With a simple program, huge numbers of pages can be created easily, arti cially in ating citation
`counts. Because the Web environment contains competing pro t seeking ventures, attention getting
`strategies evolve in response to search engine algorithms. For this reason, any evaluation strategy
`which counts replicable features of web pages is prone to manipulation. Further, academic papers
`are well de ned units of work, roughly similar in quality and number of citations, as well as in
`their purpose to extend the body of knowledge. Web pages vary on a much wider scale than
`academic papers in quality, usage, citations, and length. A random archived message posting
`
`
`
`IPR2018-01594
`
`EXHIBIT
`2050
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 1
`
`
`
`asking an obscure question about an IBM computer is very di erent from the IBM home page. A
`research article about the e ects of cellular phone use on driver attention is very di erent from an
`advertisement for a particular cellular provider. The average web page quality experienced by a
`user is higher than the quality of the average web page. This is because the simplicity of creating
`and publishing web pages results in a large fraction of low quality web pages that users are unlikely
`to read.
`In this paper, we deal
`There are many axes along which web pages may be di erentiated.
`primarily with one - an approximation of the overall relative importance of web pages.
`
` . PageRank
`
`In order to measure the relative importance of web pages, we propose PageRank, a method for
`computing a ranking for every web page based on the graph of the web. PageRank has applications
`in search, browsing, and tra c estimation.
`Section gives a mathematical description of PageRank and provides some intuitive justi -
`cation. In Section , we show how we e ciently compute PageRank for as many as million
`hyperlinks. To test the utility of PageRank for search, we built a web search engine called Google
`Section . We also demonstrate how PageRank can be used as a browsing aid in Section . .
`
` A Ranking for Every Page on the Web
`
`. Related Work
`
`There has been a great deal of work on academic citation analysis Gar . Go man Gof has
`published an interesting theory of how information ow in a scienti c community is an epidemic
`process.
`There has been a fair amount of recent activity on how to exploit the link structure of large
`hypertext systems such as the web. Pitkow recently completed his Ph.D. thesis on Characterizing
`World Wide Web Ecologies" Pit , PPR with a wide variety of link based analysis. Weiss
`discuss clustering methods that take the link structure into account WVS+ . Spertus Spe
`discusses information that can be obtained from the link structure for a variety of applications.
`Good visualization demands added structure on the hypertext and is discussed in MFH , MF .
`Recently, Kleinberg Kle has developed an interesting model of the web as Hubs and Authorities,
`based on an eigenvector calculation on the co-citation matrix of the web.
`Finally, there has been some interest in what quality" means on the net from a library com-
`munity Til.
`It is obvious to try to apply standard citation analysis techniques to the web’s hypertextual
`citation structure. One can simply think of every link as being like an academic citation. So,
`a major page like http:www.yahoo.com will have tens of thousands of backlinks or citations
`pointing to it.
`This fact that the Yahoo home page has so many backlinks generally imply that it is quite
`important. Indeed, many of the web search engines have used backlink count as a way to try to bias
`their databases in favor of higher quality or more important pages. However, simple backlink counts
`have a number of problems on the web. Some of these problems have to do with characteristics of
`the web which are not present in normal academic citation databases.
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 2
`
`
`
`. Link Structure of the Web
`
`While estimates vary, the current graph of the crawlable Web has roughly million nodes pages
`and . billion edges links. Every page has some number of forward links outedges and backlinks
`inedges see Figure . We can never know whether we have found all the backlinks of a particular
`page but if we have downloaded it, we know all of its forward links at that time.
`
`A
`
`B
`
`C
`
`Figure : A and B are Backlinks of C
`
`Web pages vary greatly in terms of the number of backlinks they have. For example, the
`Netscape home page has , backlinks in our current database compared to most pages which
`have just a few backlinks. Generally, highly linked pages are more important" than pages with
`few links. Simple citation counting has been used to speculate on the future winners of the Nobel
`Prize San . PageRank provides a more sophisticated method for doing citation counting.
`The reason that PageRank is interesting is that there are many cases where simple citation
`counting does not correspond to our common sense notion of importance. For example, if a web
`page has a link o the Yahoo home page, it may be just one link but it is a very important one.
`This page should be ranked higher than many pages with more links but from obscure places.
`PageRank is an attempt to see how good an approximation to importance" can be obtained just
`from the link structure.
`
`. Propagation of Ranking Through Links
`
`Based on the discussion above, we give the following intuitive description of PageRank: a page has
`high rank if the sum of the ranks of its backlinks is high. This covers both the case when a page
`has many backlinks and when a page has a few highly ranked backlinks.
`
`. De nition of PageRank
`
`Let u be a web page. Then let Fu be the set of pages u points to and Bu be the set of pages that
`point to u. Let Nu = jFuj be the number of links from u and let c be a factor used for normalization
`so that the total rank of all web pages is constant.
`We begin by de ning a simple ranking, R which is a slightly simpli ed version of PageRank:
`
`Ru =c X
`vBu
`
`Rv
`Nv
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 3
`
`
`
`100
`
`9
`
`53
`
`50
`
`50
`
`3
`
`50
`
`3
`
`3
`
`Figure : Simpli ed PageRank Calculation
`
`This formalizes the intuition in the previous section. Note that the rank of a page is divided
`among its forward links evenly to contribute to the ranks of the pages they point to. Note that
`c because there are a number of pages with no forward links and their weight is lost from the
`system see section .. The equation is recursive but it may be computed by starting with any set
`of ranks and iterating the computation until it converges. Figure demonstrates the propagation
`of rank from one pair of pages to another. Figure shows a consistent steady state solution for a
`set of pages.
`Stated another way, let A be a square matrix with the rows and column corresponding to web
`pages. Let Au;v = =Nu if there is an edge from u to v and Au;v = if not. If we treat R as a
`vector over web pages, then we have R = cAR. So R is an eigenvector of A with eigenvalue c. In
`fact, we want the dominant eigenvector of A. It may be computed by repeatedly applying A to any
`nondegenerate start vector.
`There is a small problem with this simpli ed ranking function. Consider two web pages that
`point to each other but to no other page. And suppose there is some web page which points to
`one of them. Then, during iteration, this loop will accumulate rank but never distribute any rank
`since there are no outedges. The loop forms a sort of trap which we call a rank sink.
`To overcome this problem of rank sinks, we introduce a rank source:
`
`De nition Let Eu be some vector over the Web pages that corresponds to a source of rank.
`Then, the PageRank of a set of Web pages is an assignment, R , to the Web pages which satis es
`
`R u = c X
`
`vBu
`
`R v
`Nv
`
`+ cEu
`
`
`
`such that c is maximized and jjR jj = jjR jj denotes the L norm of R .
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 4
`
`
`
`B0
`
`.2
`
`0.2
`
`A0
`
`.4
`
`0.2
`
`C0
`
`.4
`
`0.2
`
`0.4
`
`Figure : Simpli ed PageRank Calculation
`
`¥ ¥ ¥
`
`Figure : Loop Which Acts as a Rank Sink
`
`where Eu is some vector over the web pages that corresponds to a source of rank see Sec-
`tion . Note that if E is all positive, c must be reduced to balance the equation. Therefore, this
`In matrix notation we have R = cAR + E. Since
`technique corresponds to a decay factor.
`jjR jj = , we can rewrite this as R = cA + E R where is the vector consisting of all ones.
`So, R is an eigenvector of A + E .
`
`. Random Surfer Model
`
`The de nition of PageRank above has another intuitive basis in random walks on graphs. The
`simpli ed version corresponds to the standing probability distribution of a random walk on the
`graph of the Web.
`Intuitively, this can be thought of as modeling the behavior of a random
`surfer". The random surfer" simply keeps clicking on successive links at random. However, if a
`real Web surfer ever gets into a small loop of web pages, it is unlikely that the surfer will continue
`in the loop forever. Instead, the surfer will jump to some other page. The additional factor E can
`be viewed as a way of modeling this behavior: the surfer periodically gets bored" and jumps to a
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 5
`
`
`
`random page chosen based on the distribution in E.
`So far we have left E as a user de ned parameter. In most tests we let E be uniform over all
`web pages with value . However, in Section we show how di erent values of E can generate
`customized" page ranks.
`
`. Computing PageRank
`
`The computation of PageRank is fairly straightforward if we ignore the issues of scale. Let S be
`almost any vector over Web pages for example E. Then PageRank may be computed as follows:
`
`R S
`
`loop :
`Ri+ ARi
`d jjRijj (cid:0) jjRi+ jj
`Ri+ Ri+ + dE
` jjRi+ (cid:0) Rijj
`while
`
`Note that the d factor increases the rate of convergence and maintains jjRjj . An alternative
`normalization is to multiply R by the appropriate factor. The use of d may have a small impact
`on the in uence of E.
`
`. Dangling Links
`
`One issue with this model is dangling links. Dangling links are simply links that point to any page
`with no outgoing links. They a ect the model because it is not clear where their weight should
`be distributed, and there are a large number of them. Often these dangling links are simply pages
`that we have not downloaded yet, since it is hard to sample the entire web in our million pages
`currently downloaded, we have million URLs not downloaded yet, and hence dangling. Because
`dangling links do not a ect the ranking of any other page directly, we simply remove them from
`the system until all the PageRanks are calculated. After all the PageRanks are calculated, they
`can be added back in, without a ecting things signi cantly. Notice the normalization of the other
`links on the same page as a link which was removed will change slightly, but this should not have
`a large e ect.
`
`
`
`Implementation
`
`As part of the Stanford WebBase project PB , we have built a complete crawling and indexing
`system with a current repository of million web pages. Any web crawler needs to keep a database
`of URLs so it can discover all the URLs on the web. To implement PageRank, the web crawler
`simply needs to build an index of links as it crawls. While a simple task, it is non-trivial because
`of the huge volumes involved. For example, to index our current million page database in about
` ve days, we need to process about web pages per second. Since there about about links on
`an average page depending on what you count as a link we need to process links per second.
`Also, our database of million pages references over million unique URLs which each link must
`be compared against.
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 6
`
`
`
`Much time has been spent making the system resilient in the face of many deeply and intricately
` awed web artifacts. There exist in nitely large sites, pages, and even URLs. A large fraction of
`web pages have incorrect HTML, making parser design di cult. Messy heuristics are used to help
`the crawling process. For example, we do not crawl URLs with cgi-bin in them. Of course it
`is impossible to get a correct sample of the "entire web" since it is always changing. Sites are
`sometimes down, and some people decide to not allow their sites to be indexed. Despite all this, we
`believe we have a reasonable representation of the actual link structure of publicly accessible web.
`
` . PageRank Implementation
`
`We convert each URL into a unique integer, and store each hyperlink in a database using the integer
`IDs to identify pages. Details of our implementation are in PB . In general, we have implemented
`PageRank in the following manner. First we sort the link structure by Parent ID. Then dangling
`links are removed from the link database for reasons discussed above a few iterations removes the
`vast majority of the dangling links. We need to make an initial assignment of the ranks. This
`assignment can be made by one of several strategies. If it is going to iterate until convergence, in
`general the initial values will not a ect nal values, just the rate of convergence. But we can speed
`up convergence by choosing a good initial assignment. We believe that careful choice of the initial
`assignment and a small nite number of iterations may result in excellent or improved performance.
`Memory is allocated for the weights for every page. Since we use single precision oating point
`values at bytes each, this amounts to megabytes for our million URLs.
`If insu cient
`RAM is available to hold all the weights, multiple passes can be made our implementation uses
`half as much memory and two passes. The weights from the current time step are kept in memory,
`and the previous weights are accessed linearly on disk. Also, all the access to the link database,
`A, is linear because it is sorted. Therefore, A can be kept on disk as well. Although these data
`structures are very large, linear disk access allows each iteration to be completed in about minutes
`on a typical workstation. After the weights have converged, we add the dangling links back in and
`recompute the rankings. Note after adding the dangling links back in, we need to iterate as many
`times as was required to remove the dangling links. Otherwise, some of the dangling links will have
`a zero weight. This whole process takes about ve hours in the current implementation. With less
`strict convergence criteria, and more optimization, the calculation could be much faster. Or, more
`e cient techniques for estimating eigenvectors could be used to improve performance. However, it
`should be noted that the cost required to compute the PageRank is insigni cant compared to the
`cost required to build a full text index.
`
` Convergence Properties
`
`As can be seen from the graph in Figure PageRank on a large million link database converges
`to a reasonable tolerance in roughly iterations. The convergence on half the data takes roughly
` iterations. This graph suggests that PageRank will scale very well even for extremely large
`collections as the scaling factor is roughly linear in log n.
`One of the interesting rami cations of the fact that the PageRank calculation converges rapidly
`is that the web is an expander-like graph. To understand this better, we give a brief overview of
`the theory of random walks on graphs; refer to Motwani-Raghavan MR for further details. A
`random walk on a graph is a stochastic process where at any given time step we are at a particular
`node of the graph and choose an outedge uniformly at random to determine the node to visit at
`the next time step. A graph is said to be an expander if it is the case that every not too large
`subset of nodes S has a neighborhood set of vertices accessible via outedges emanating from nodes
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 7
`
`
`
`in S that is larger than some factor times jSj; here, is called the expansion factor. It is the
`case that a graph has a good expansion factor if and only if the largest eigenvalue is su ciently
`larger than the second-largest eigenvalue. A random walk on a graph is said to be rapidly-mixing
`if it quickly time logarithmic in the size of the graph converges to a limiting distribution on the
`set of nodes in the graph. It is also the case that a random walk is rapidly-mixing on a graph if
`and only if the graph is an expander or has an eigenvalue separation.
`To relate all this to the PageRank computation, note that it is essentially the determination of
`the limiting distribution of a random walk on the Web graph. The importance ranking of a node
`is essentially the limiting probability that the random walk will be at that node after a su ciently
`large time. The fact that the PageRank computation terminates in logarithmic time is equivalent to
`saying that the random walk is rapidly mixing or that the underlying graph has a good expansion
`factor. Expander graphs have many desirable properties that we may be able to exploit in the
`future in computations involving the Web graph.
`
`Convergence of PageRank Computation
`
`322 Million Links
`161 Million Links
`
`7.5
`
`15
`
`22.5
`
`30
`
`37.5
`
`45
`
`52.5
`
`Number of Iterations
`
`100000000
`
`10000000
`
`1000000
`
`100000
`
`10000
`
`1000
`
`100
`
`10
`
`0
`
`Total Difference from Previous Iteration
`
`Figure : Rates of Convergence for Full Size and Half Size Link Databases
`
` Searching with PageRank
`
`A major application of PageRank is searching. We have implemented two search engines which use
`PageRank. The rst one we will discuss is a simple title-based search engine. The second search
`engine is a full text search engine called Google BP. Google utilizes a number of factors to rank
`search results including standard IR measures, proximity, anchor text text of links pointing to web
`pages, and PageRank. While a comprehensive user study of the bene ts of PageRank is beyond
`the scope of this paper, we have performed some comparative experiments and provide some sample
`results in this paper.
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 8
`
`
`
`The bene ts of PageRank are the greatest for underspeci ed queries. For example, a query
`for Stanford University" may return any number of web pages which mention Stanford such as
`publication lists on a conventional search engine, but using PageRank, the university home page
`is listed rst.
`
`. Title Search
`
`To test the usefulness of PageRank for search we implemented a search engine that used only the
`titles of million web pages. To answer a query, the search engine nds all the web pages whose
`titles contain all of the query words. Then it sorts the results by PageRank. This search engine
`is very simple and cheap to implement. In informal tests, it worked remarkably well. As can be
`seen in Figure , a search for University" yields a list of top universities. This gure shows our
`MultiQuery system which allows a user to query two search engines at the same time. The search
`engine on the left is our PageRank based title search engine. The bar graphs and percentages shown
`are a log of the actual PageRank with the top page normalized to , not a percentile which is
`used everywhere else in this paper. The search engine on the right is Altavista. You can see that
`Altavista returns random looking web pages that match the query University" and are the root
`page of the server Altavista seems to be using URL length as a quality heuristic.
`
`Figure : Comparison of Query for University"
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 9
`
`
`
`Web Page
`Download Netscape Software
`http:www.w .org
`Welcome to Netscape
`Point: It’s What You’re Searching For
`Web-Counter Home Page
`The Blue Ribbon Campaign for Online Free Speech
`CERN Welcome
`Yahoo!
`Welcome to Netscape
`Wusage . : A Usage Statistics System For Web Servers
`The World Wide Web Consortium W C
`Lycos, Inc. Home Page
`Starting Point
`Welcome to Magellan!
`Oracle Corporation
`
`PageRank average is .
` .
` .
` .
` .
`.
` .
`.
` .
` .
` .
`.
` .
` .
` .
` .
`
`Table : Top Page Ranks: July
`
`. Rank Merging
`
`The reason that the title based PageRank system works so well is that the title match ensures
`high precision, and the PageRank ensures high quality. When matching a query like University"
`on the web, recall is not very important because there is far more than a user can look at. For
`more speci c searches where recall is more important, the traditional information retrieval scores
`over full-text and the PageRank should be combined. Our Google system does this type of rank
`merging. Rank merging is known to be a very di cult problem, and we need to spend considerable
`additional e ort before we will be able to do a reasonable evaluation of these types of queries.
`However, we do believe that using PageRank as a factor in these queries is quite bene cial.
`
`. Some Sample Results
`
`We have experimented considerably with Google, a full-text search engine which uses PageRank.
`While a full-scale user study is beyond the scope of this paper, we provide a sample query in
`Appendix A. For more queries, we encourage the reader to test Google themselves BP.
`Table shows the top pages based on PageRank. This particular listing was generated in
`July . In a more recent calculation of PageRank, Microsoft has just edged out Netscape for
`the highest PageRank.
`
`. Common Case
`
`One of the design goals of PageRank was to handle the common case for queries well. For example,
`a user searched for wolverine", remembering that the University of Michigan system used for all
`administrative functions by students was called something with a wolverine in it. Our PageRank
`based title search system returned the answer Wolverine Access" as the rst result. This is sensible
`since all the students regularly use the Wolverine Access system, and a random user is quite likely
`to be looking for it given the query wolverine". The fact that the Wolverine Access site is a good
`common case is not contained in the HTML of the page. Even if there were a way of de ning good
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 10
`
`
`
`meta-information of this form within a page, it would be problematic since a page author could
`not be trusted with this kind of evaluation. Many web page authors would simply claim that their
`pages were all the best and most used on the web.
`It is important to note that the goal of nding a site that contains a great deal of information
`about wolverines is a very di erent task than nding the common case wolverine site. There is an
`interesting system Mar that attempts to nd sites that discuss a topic in detail by propagating
`the textual matching score through the link structure of the web. It then tries to return the page
`on the most central path. This results in good results for queries like ower"; the system will
`return good navigation pages from sites that deal with the topic of owers in detail. Contrast that
`with the common case approach which might simply return a commonly used commercial site that
`had little information except how to buy owers. It is our opinion that both of these tasks are
`important, and a general purpose web search engine should return results which ful ll the needs
`of both of these tasks automatically. In this paper, we are concentrating only on the common case
`approach.
`
`. Subcomponents of Common Case
`
`It is instructive to consider what kind of common case scenarios PageRank can help represent.
`Besides a page which has a high usage, like the Wolverine Access cite, PageRank can also represent
`a collaborative notion of authority or trust. For example, a user might prefer a news story simply
`because it is linked is linked directly from the New York Times home page. Of course such a story
`will receive quite a high PageRank simply because it is mentioned by a very important page. This
`seems to capture a kind of collaborative trust, since if a page was mentioned by a trustworthy
`or authoritative source, it is more likely to be trustworthy or authoritative. Similarly, quality or
`importance seems to t within this kind of circular de nition.
`
` Personalized PageRank
`
`An important component of the PageRank calculation is E a vector over the Web pages which
`is used as a source of rank to make up for the rank sinks such as cycles with no outedges see
`Section .. However, aside from solving the problem of rank sinks, E turns out to be a powerful
`parameter to adjust the page ranks. Intuitively the E vector corresponds to the distribution of web
`pages that a random surfer periodically jumps to. As we see below, it can be used to give broad
`general views of the Web or views which are focussed and personalized to a particular individual.
`We have performed most experiments with an E vector that is uniform over all web pages with
`jjEjj = : . This corresponds to a random surfer periodically jumping to a random web page.
`This is a very democratic choice for E since all web pages are valued simply because they exist.
`Although this technique has been quite successful, there is an important problem with it. Some
`Web pages with many related links receive an overly high ranking. Examples of these include
`copyright warnings, disclaimers, and highly interlinked mailing list archives.
`Another extreme is to have E consist entirely of a single web page. We tested two such E’s
`the Netscape home page, and the home page of a famous computer scientist, John McCarthy. For
`the Netscape home page, we attempt to generate page ranks from the perspective of a novice user
`who has Netscape set as the default home page. In the case of John McCarthy’s home page we
`want to calculate page ranks from the perspective of an individual who has given us considerable
`contextual information based on the links on his home page.
`In both cases, the mailing list problem mentioned above did not occur. And, in both cases, the
`respective home page got the highest PageRank and was followed by its immediate links. From
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 11
`
`
`
`Web Page
`Title
`John McCarthy’s Home Page
`John Mitchell Stanford CS Theory Group
`Venture Law Local Startup Law Firm
`Stanford CS Home Page
`University of Michigan AI Lab
`University of Toronto CS Department
`Stanford CS Theory Group
`Leadershape Institute
`
`Netscape’s View
`John McCarthy’s View
`PageRank Percentile PageRank Percentile
` .
` .
` .
` .
` .
` .
` .
` .
` .
` .
` .
` .
` .
` .
` .
` .
`
`Table : Page Ranks for Two Di erent Views: Netscape vs. John McCarthy
`
`that point, the disparity decreased. In Table , we show the resulting page rank percentiles for
`an assortment of di erent pages. Pages related to computer science have a higher McCarthy-rank
`than Netscape-rank and pages related to computer science at Stanford have a considerably higher
`McCarthy-rank. For example, the Web page of another Stanford Computer Science Dept. faculty
`member is more than six percentile points higher on the McCarthy-rank. Note that the page ranks
`are displayed as percentiles. This has the e ect of compressing large di erences in PageRank at
`the top of the range.
`Such personalized page ranks may have a number of applications, including personal search
`engines. These search engines could save users a great deal of trouble by e ciently guessing a
`large part of their interests given simple input such as their bookmarks or home page. We show an
`example of this in Appendix A with the Mitchell" query. In this example, we demonstrate that
`while there are many people on the web named Mitchell, the number one result is the home page
`of a colleague of John McCarthy named John Mitchell.
`
`. Manipulation by Commercial Interests
`
`These types of personalized PageRanks are virtually immune to manipulation by commercial in-
`terests. For a page to get a high PageRank, it must convince an important page, or a lot of
`non-important pages to link to it. At worst, you can have manipulation in the form of buying
`advertisementslinks on important sites. But, this seems well under control since it costs money.
`This immunity to manipulation is an extremely important property. This kind of commercial ma-
`nipulation is causing search engines a great deal of trouble, and making features that would be great
`to have very di cult to implement. For example fast updating of documents is a very desirable
`feature, but it is abused by people who want to manipulate the results of the search engine.
`A compromise between the two extremes of uniform E and single page E is to let E consist of
`all the root level pages of all web servers. Notice this will allow some manipulation of PageRanks.
`Someone who wished to manipulate this system could simply create a large number of root level
`servers all pointing at a particular site.
`
`
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2067, p. 12
`
`
`
` Applications
`
`. Estimating Web Tra c
`
`Because PageRank roughly corresponds to a random web surfer see Section ., it is interesting
`to see how PageRank corresponds to actual usage. We used the counts of web page accesses from
`NLANR NLA proxy cache and compared these to PageRank. The NLANR data was from several
`national proxy caches over the period of several months and consisted of , , unique URLs
`with the highest hit count going to Altavista with , hits. There were . million pages in the
`intersection of the cache data and our million URL database. It is extremely di cult to compare
`these datasets analytically for a number of di erent reasons. Many of the URLs in the cache access
`data are people reading their personal mail on free email services. Duplicate server names and page
`names are a serious problem. Incompleteness and bias a problem is both the PageRank data and
`the usage data. However, we did see some interesting trends in the data. There seems to be a high
`usage of pornographic sites in the cache data, but these sites generally had low PageRanks. We
`believe this is because people do not want to link to pornographic sites from their own web pages.
`Using this technique of looking for di erences between PageRank and usage, it may be possible to
` nd things that people like to look at, but do not want to mention on their web pages. There are
`some sites that have a very high usage, but low PageRank such as netscape.yahoo.com. We believe
`there is probably an important backlink which simply is omitted from our database we only have
`a partial link structure of the web. It may be possible to use us