`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`
`
`CI\MBIUOC£ tJNIVE:RSITY PRESS
`Gmbridge, New York, Mdb()urne, Madrid. Cape: Town, Singapore, Sio Pau1o
`
`Cambridge University Press
`The Edinburgh Huildin~ Cambridge CB2 SRU, UK
`Published in the United States of America b)' Cambridge University Press, New Vorl::
`w·w·w.cambridgc.org
`Information on this tide: w-ww.ca.mbridge.org/97805218657 I 5
`
`@Cambridge Unive.rsiry Press 2008
`
`This publication is in copyright. Subj«t to statutory exception a.nd to the pro\'ision of
`relevant collective ljcensing agreements, no reproduction of an)' part may take place
`without the written JXrmissjon of Cambridge University Press.
`
`First published in print format 2008
`
`ISBN-IJ 978-G-511-41405-J
`
`dlool: (EBL)
`
`ISBN-IJ 978-0-521-86571·5
`
`hardb3ck
`
`Oa.mbridge University Press has no responsibility for the persistence ot accuracy of uris
`fOr external or third-party internet V.'tbsires referred to in this public uion, and does not
`guarantee tb:.at an)' content on such websites is, or will remain, aocur.ne or appropriate.
`
`
`
`79.2 Webchttracrm'slics
`
`391
`
`A.gute '19..5 Cloaking as used by spammers.
`
`property and their agents, therefore, have a strong incentive to croate web
`pages that rank high)y on th.is query. ln a search engine whac;e scoring was
`based on term frequendes, a web page with numerous repetitions of mauigolf
`Sf'AM real eslate wouJd rank highly. This led to the first generation of sp~tm, which.
`(in the context of web sean:h) is the manipulation of web page content for
`the purpose of appearing high up in search results fot selected keywords.
`To avoid irritating users with these repetitions, .sophi<;ticated spamrners re-(cid:173)
`sorted to s uch tricks as rendering these repeated tenns in the same color as
`the background. Despite these words being oon.sequently invisible to the hu+
`man user, a search engine indexer wouJd parse the i.nvi<;ible words out of
`the HTML representation of the web page and index these \~ords as being
`prosen t in the page+
`At its root, s pam s tems from the heterogeneity of moti\•es in conten.t ere-.
`ation on the Web. In particular, many web ron.tent creators have commercial
`motives and therofore stand to gain from manipulating search engine results+
`You might argue that this is no different from a company that uses large fonts
`to li'it its phone numbers in the yeUow pages; but this gene-rally casts the
`company more and is thus a fairer mechanism. A more apt analogy, perhaps,
`is the use of company names beginning with a long s tring of As to be lis ted
`earl)' in a yellow pages category. In fact, tlw yellow pages' mod~ of rom+
`panies paying for larse-r /darker fonts has been replicated in web search: ln
`man)' search engines, it is possible to pa)' to h.a\'e one's web page included
`PAID in the search engine's index - a model kno\\'ll as p41id i,Jdusjon. Different
`I NCLUSION search engines h.a\'e different policies on whether to aUow paid inclusion,
`and whether such a payment has any effect on ranking in search results.
`Sean:h engines soan became sophisticated enough in their spam detection
`to screen out a large number of repetitions of particular keywords. Spam +
`mers responded with a richer set of .spam techniques, the best known of
`which we now describe+ The first of these techniques is cla'lking, shown
`in Figure 19.5. Hero, the .spammer's web server returns different pages
`d epending on whether the ht tp request comes from a web search engine's
`crawler (the part of the seatch engine that gathers web pages, to be described
`in Chapter 20), or from a hwnan user's browser. The former causes the web
`page to be indexed by the search engine under misleading kep.,·ords. \\'hen
`the u:;e,r searches for these 1.:eywords and elects to view the page. he reoeives
`
`
`
`Web seardJ basics
`
`m
`a web page that has aJtogether different content than that indexed by the
`search engine. Such deception of search indexers is unknown in the tradi(cid:173)
`tional world of lR; it ste-ms from the fact that the relationship between page
`publishers and web search engines is not completely coUaborative.
`A doonrvy ptAgt contains text and metadata carefully chosen to rank ttigh.ly
`on sclected seat'"C'h ke)""'ords. \-\'hen a browser requests the doorway page. it
`is redirected to a page containing content of a more commercial nature. More
`complex spam.ming techniques in\'Oive manipulation of the metadata related
`to a page including (for reasons \~e will see in Chapter 21) the links in to a
`web page. Given thatspamrning is inherently an economically motivated ac-
`Sll.utCH thtity, there has sprung around it an industry of som:h engint oplimiurs, or
`nNCI.NII SEOs, to provide consultancy services for clients who seek to have their web
`om.wlZIIRS pages rank highly on selected keywords. Web search engines frown on this
`business of attempting to decipher and adapt to their proprietary ranking
`techniques and indeed announce policies on forms of SEO behavior they do
`not tolerate (and have been known to shut down seat"Ctl requests from cer·
`tain SEOs for violation of these). lne .. •itably. t~ parrying bet'Nt!efl s uch SEOs
`(who gradual!)' infer features of each web seardi engine's ranking methods)
`and the web search eng.ines (who adapt in response) is an unending strug·
`AOVIIRSMUAL gle; indeed, the research subarea of ~riRI infomwlion retrieml has sprung
`lN);()Jt.MATION up around this bat de. To combat spammers who manipulate the text o£ their
`ltlll1UII\'Al. web pages is the exploitation of the Unk structure o f the Web- a tedmique
`known as fi11k at~nfysis. The first web seat"Ctl engine known to apply link anal(cid:173)
`ysis on a large scale (to be detailed in Chapter 21) w-as Coogle. although aU
`web search engines currently make use of it (and correspondingly,spammers
`U.NK SPAM now invest coru.iderable effort in sub\'erting it - this i.:s known as link spmr).
`
`(
`
`Exercise 19.1 lf the number of pages with in-degree i isproportionalto 1/ i2.1,
`what is the probability that a randomly chosen web page has in-degree I?
`
`Exercise 19.2 lfthe numberof pages with in-degree i isproportionaJ to l/il.1,
`what is the average in-degree of a web page?
`
`Exercise 19.3 If the number of pages with in-degree i is proportional to l/i2.1,
`then as the largest in-degree goes to infinity, does the fraction of pages v.ith
`in-degree i grow, stay the same, or diminish? How would your answer
`change for values of the exponent other than 2.1?
`
`Exercise 19.4 The average in-degree o£ aU nodes in a snapshot of the web
`graph is 9. \Vhat can we say about the average out~egree of aU nodes in
`this snapshot?
`
`19.3 Advertising as th o oronomlc model
`
`Early in the history of the Web, companies used graphical banner advertise-(cid:173)
`ments on web pages at popular websites (news and entertainment sitessudt
`
`
`
`19.3 Adt'f!rtising as tJse tm'JOmic model
`
`39'3
`
`as MSN, America On.l.ine, Yahoo!, and CNN). The primary purpose of these
`advertisements was branding: to convey to the viewer a pnsitive feeling about
`the brand of the company placing the advertisement. Typically these adver·
`CPM tisements are priced on a cost per mil (CPM) basis the cnst to the company of
`having its banner advertisement displayed I.OOOtimes. Some websitesstruc.k
`contracts with their advertisers in whidl an advertisement was priced not by
`the number of times it is displayed (also known as inrpressions). but rather by
`the number of times it is dir.kru or1 by the user. lltis pricing model is known
`CPC as the cost per dick (CPC) model. In such cases, clicking on the advertisement
`leads the user to a web page set up by the advertiser, where the user is in·
`duced to make a purchase. Here, the goal of the advertisement is not so much
`brand promotion as to induce a transaction. This distinction betv.-een brand(cid:173)
`and transaction-oriented advertising was already w ideJy recog.nized in the
`context of cocwenrionaJ media such as broadcast and print. The interactiv·
`ity of the web allowed the CPC billing model- clicks could be metered and
`monitortod by the h'ebsite and biUed to the advertiser.
`The pioneer in this d irection was a company named Goto, whkh changed
`its name to Ch•erture be/ore eventual acquisition by Yahoo! Goto was not, ln
`the traditionaJ sense, a search engine; rather, for e\•e.ry query tmn q it ac(cid:173)
`cepted bids from companies w ho wanted their web page shown on the query
`q. In response to the query q. Goto wou1d return the pages of all advertis(cid:173)
`er.s who bid for q. ordered b)' their bids. Furthennoro, when the user dk:l:ed
`on one of the returned results, the c01ll'Sponding advertiser would make a
`payment to Goto (in the initiaJ implementation. this payment equaled the
`advertiser's bid for q ).
`Several aspects of Goto's model are worth highlighting. First, a user typ·
`ing the query q into Goro's search interface was actlvel)' expressing an in(cid:173)
`terest and intent related to the query q. For instance, a user typing gofl dubs
`is more likely to be imminently purchasing a set than one who is simply
`browsing news on golf. Second, C".oto only got compensated when a user
`actuaUy expressed interest in an advertisement -as evinced by the user click(cid:173)
`ing the ad,·ertisemmt. Taken tosether, these created a powerful mechanism
`b)' which to connect advertisers to consutnerS, quickly raising the annual
`revenues of Goto/Ch•ertute into hundreds of millions of dollars. This style
`SI'OSSOiti!O of search ensme came to be knOhll variously as spo11scred sat.rdJ or se.zrcf1
`SJt.AitCU adr:-ert isir1g.
`SUIK:U Given these t'\'lO kinds of search engines- the .. pure" search engines such
`AO'Vi:ln'tsi.NC as Coogle and Altavista versus the sponsored search engines - the logi(cid:173)
`cal next step was to combine them into a single user experience. Current
`searc:fl engines follow precisely this model: They provide pure search re-(cid:173)
`ALCOitiTHM IC suJts (generally known as algorithmic sazrr:/1 results) as the primary response
`SUitCU to a user's search, together with sponsored search results displayed sepa·
`rate!)' and d istincri\•dy to the right of the algorithmic results. This is shown
`in Figure 19..6. Retrieving sponsored search results and ranking them in re(cid:173)
`sponse to a query has now become considerabl)' more sophisticated than
`
`
`
`II ~ttr~~, a'M fttJmta.llt, apDtm'fN Mi'..l..,
`
`--...
`' ~~:=~~=il"~ ... ~~~-
`...... ____ r_..-r _ _ _..._ .. __
`•
`---.. -· ... ·--,...,.,~
`.. - -.. - -... M ..,UIII \ ' "' ~
`tAI.&J,j~ Oo!oi<J••<MV. - -.-,_..,..._ .. _ .... _
`---
`___ ,.,_._._,"
`.................
`
`f, ....... - -..... -~-- ... -"''""A)II,
`
`-~-·---·~·
`Figure 19.6 Search ad\'ertiS!ng t.~red by query keywords. Here the query A320 returns al·
`gorithmJc .sea!Ch results about the Airbus alraaft, toget:herwlt.h ad\·ettlsemeniS fo1 variOUs non·
`alraaft goods numbered A320 that advertisetS seek to martet to those quef)'lng on th1s query.
`The lad. of ad\'ettisements for the aircraft reOects the fact that fe-w marketers attempt to sell A320
`alrtraft on the web.
`
`Web seardJ basics
`
`ea.~~ -~VOuPIC•<rw
`
`. .,.
`
`~~~
`
`Sl!ARCH
`f.S'CI SI!
`MAJI.t:ll111o:<:
`
`O.I CK SPAM
`
`the simple GotG scheme; the process entails a blending of ideas &om lR and
`microeconomics, and is beyond the scope of this book. f"'f advert:isers, un(cid:173)
`derstanding how search engines do this ranking and how to allocate mar·
`keting campaign budg& to different keywords and to different sponsored
`search engines has become a profession known as smrr.h et~gi1tl' 1narkding
`(SEM).
`The inherently economic motives underlying sponsored search gi\·e rise
`to a ttempts by some participants to s ubvert the S)'Stem to their ad\•antage.
`1ltis can take man)' fonns, one of which is knov.'tl as c.lick sp.7m. There is
`rurrendy no ~.miversaUy aocepted definition of d ick s pam. It refers (as the
`name sussests) to cUcks on sponsored seardt results tha t are not from bona
`fid e search llsetS. For instance, a devious ado;ertiser may attempt to exhaust
`the advertising budget of a competitor by clicking repeated!)' (through the
`use of a robotic click generator) on that competitor's sponsored search ad(cid:173)
`vertisements. Sean:fl msme.s face the challenge of discerning which of the
`dicks they observe are part of a pattern of click spam, to avoid charging their
`advertiser clients for such d iclco.
`
`? Exercise 19.5 TheGoto method ranked advertisements matchlng a query by
`
`bid: the highest-bidding adve rtiser got the top pooition, the second-highest
`the next, and so on. What can go wrong with this when the highest(cid:173)
`bidding advertiser places an advertisement that is irrele\>an t to the que ry?
`Why might an advertiser with an irrelevant advertisement bid high in this
`mal\ller?
`
`Exercise 19.6 Suppose that, in addition to bids ... we had for each ad\•ettiser
`their dick..Jflrotlgh rnre. ~ ratio of the his torical number of times users
`click on their advertisement to the number of rimes the ad\>ertisement was
`shov.'tl. Suggest a modification of the Goto scheme that exploits this data
`to avoid the problem in Exerdse 19.5 above.
`
`
`
`19.4 Tht: scatd1 user experiena
`19.4 The search user experience
`
`395
`
`It is crudal that we understand the users of web search as weU. This is again
`a significant change from traditional JR. where users were typicaUy profes(cid:173)
`sionals with a t least some training in the art of phrasing querk.--s O\'er a welJ+
`authored collection wha;e style and structure they understood welt In con·
`trast~ web search users tend to not know (or care) about the heterogeneity of
`web content. the syntax of query languages, and the art of phrasing queries;
`indeed, a mainstream tool (as web search has come to become) should not
`place such onerous demands an billions of people. A range of studies has
`conduded thai the average number of ke)'\VOrds in a web search is some(cid:173)
`where betv.-een two and three. Syntax operators (BooJean connectives, w ild·
`cards, etc.) are seldom used, again a resuJt of the composition of the a ud i·
`ence- "'normaJ" people, not infonnatian scientists.
`It is dear that the more user traffic a web search engine can attract, the
`more tm•enue it stands to eam from sponsored searc:fl. How do search en+
`ginesdifferentiate themselves and grow their traffic? Here, Coogle identified
`two principles that helped it to grow a t the expense of its competitors: (i) A
`focus on rotevance, specifically precision rather than recall in the first few
`rosults; and (ii) a user experience that is lightweight, meaning that both the
`search query page and the search results page are uncluttered and almost
`entire1y textual, w ith vety few graphical elements. The effect of the first was
`simp!)' to save users time in locating the infonnation they sought. The effect
`of the second is to provide a user experience that is extremely responsive, or
`at any rate not botdenecked by the time to load the search query or results
`I"Se.
`
`19.4.1 User q11ery needs
`
`There appear to be three broad categories into which common web search.
`queries can be grouped: (i} informational, (ii) navigational, and (iii) ttansac·
`tiona!. \\\'!now explain these categories; it should be dear that some queries
`will fall in more than one of these categories, while others will faJJ outside
`them.
`Iufomsatitmal IJilOies seek general information on a broad topic, s uch as
`INfOilMAnoSAt.
`OUDits leukemia or Provence. There is typicaJly not a single web page that contains
`aU the information sought; indeed, user.; with informational queries typically
`b')' 10 assimilate information from multiple web pages.
`N.AVICATIONAL Navigational querit>sseek the website or home page of a single entity that the
`Qt.'ntll'S user has in mind, say Lufthansa airlines. In such cases, lhe user' s expectation
`is that the very first search result should be the home page of Lufthansa.
`The user i.s not interested in a p1elhora of documents containing the term
`lufthansa; for such a user. the best measure of user satisfaction is precision
`at I.
`
`
`
`Web st12rd1 basics
`
`Indexes
`Agure t 9.7 The vartous componentS of a "''t'b seatdt engtne.
`
`Ad indexes
`
`TRA.NSAcrtosAJ. A lrn1JS«tiO'It.If query is one that is a prelude to the user perfonning a ttans(cid:173)
`QU£10' action on the Web - such as purchasing a product, downloading a file, or
`making a reservation. ln s uch cases, the search engine should return results
`Usting services that provide form interfaces for such transactions.
`Disooming which of these categories a query falls into can be challeng·
`i:ng. The category not only governs the a lgorithmic sea.rch resul ts, but the
`suitability of the query for sponsored searcfl results (since the query may re-(cid:173)
`\'eal an intent to pu.rchaset For navigational queries, some have argued that
`the search engine should return only a single result or even the target web
`page directly. Neverthelt"$, web search engines have historirnJJy engaged in
`a battle of bragging rights over which one indexes more web pages. Does the
`user really care? Perhaps not, but the media does highlight estimates (often
`statistically indefensible) of the sizes of various searc:h engines. Users are i.n(cid:173)
`Ruena.>d by these reports and thus, search engines do have to pay attention
`to how their index sizes compare to competitors'. For informational (and to
`a lesser extent, transactional) queries, the user does c.t re about the compre-(cid:173)
`hensiveness of the search engine.
`Figure [9.7 shows a composite picture of a web search engine indudingthe
`crawler, as weU as both the web page and ad\•ertisement indexes. The portion
`ol the figure under the cur\'ed dashed line is internal to the search engine.
`
`19.5 Index size and estimation
`
`To a first approximation~ comprehensi veness grows w ith index size, al(cid:173)
`though it does matter which specific pag5 a search engine indexes - some
`
`
`
`21 Link analysis
`
`The analy.si<i of hyperlinks and the graph srructure of the Web has been in·
`strumental in the development of web search. In this chapter, we focus on
`the use of hyperlinks for ranking .,.,"t'b sean:h results. Such link analysis is
`o~ of many factors considered by web sean.il engines in oomputing a rom·
`posite score for a Wl'b page on any given query. \"le begin by nwietoJingsome
`basics of the Web as a graph in Section 21.1, then pl'lJOOE.'d to the technical
`development of the e lements of link analysis for ranking.
`Link analysis for web search has i:nteJiectuaJ anteceden ts in the field of cita·
`tion anatysis, aspects of which overlap with an aroa known as bibliometrics.
`These disciplines seek to quantify the tnfluena! o{ scholarly articles by ana4
`lyzing the pattern of citations among them. Much as citations represent the
`conferral of authority from a scholarly article to others, link analysis on the
`1.Veb trea.ts hyperlinks from a web page to another as a conferral of authority.
`dearly. not every citation or hyperlink implies such authority confNtal; for
`this reason, simply measuring the quality of a web page by the number of
`in-links (citations from other pages} is not robust enough. For instance, one
`may contrive to set up multiple web pages pointing to a target web page,
`\'.ith the intent of arrifidall)' boosting the latter's tally of in-links. l1Us phe-(cid:173)
`nomenon is referred to as link spam. Neverthcless, the phenomenon of cita·
`tion is prevalent and dependable enough. that it is feasible for web search
`engines to derive useful signals for ranking from more sophisticated link
`analysis. IJnk analysis also proves to be a useful indicator of what page(s}
`to craw) next while crawling the web; this is done by using link analysis to
`guide the priority assignment in the front queues of Chapter 20.
`Section 21.1 develops the basic ideas underlying the use of the web graph
`in l ink anal)'Sis. Sections 21.2and 2l.3 then d evelop two distinct methods for
`link analysts, PageRank and HITS.
`
`421
`
`
`
`422
`21.1 The Web as a graph
`
`link analysis
`
`Recall the notion of the web graph from Section 19.2.1 and particularly Fig·
`uro 19.2. Our study of 1ink analysis builds on two intuitions.
`
`I. The anchor text pointing to page 8 is a good description of page B.
`2. The h)1X!rlin.k from A to 8 represents an endorsement of page B, by the
`creator of page A. This is not always the ca~; for instance, many links
`among pages within a single website stem from the user of a common
`template. For instance, most corporate websites have a pointer from ev(cid:173)
`ery page to a page containing a c-opyright notice - this is dearly not an
`endorsement. r\ccordingly, implementations of link analysis a lgorithms
`typically discount such "internal"' Jinks.
`
`21.1.1 A11cltor text and tlte web graplt
`
`The follov.ing fragment of HTML oode from a web page shows a hyperlink
`pointing to the home page of the journal of the ACM:
`
`<a href .. •ttttp: //wv .a cra.org/jacn/")Journal of ::h.e ACI1.<ta>
`
`In this case, the link points to the page ... ~.,.., . a em. or{l/ j acr~/ and the anchor
`text is Joum.lll of tJo..e ACM. dearl)', in this example the anchor is descriptive of
`the ta.rset page. But then the target page (B = http : J J.,.·lhf. a em. or9/ j ac11/)
`itself contains the same description as well as considerable additional infor·
`mation on the journal. So wh..1t use is the anchor text?
`The Web is full of instances where the page B does not provide an ao:u·
`rate description of itself. In many cases, this is a matter of how the publish·
`ers of page B choose to present themselves; this is especially common with
`cotporate web pages~ where a web presence is a marketing statement. f<or
`example, a t the time of the writing of this book the home page of the IBM
`rotporation (ww . i bm. com) did not contain the tenn computer anywhere in
`its HfML code, despite the fact that IBM is \\'id~y ''iewed as the world's
`largest computer maker. Similarly. the HTML code for the home page of Ya·
`hoo! (vw.,.•.y ahoo . COlli) does not at this time oontain the word portal
`Thus, there is often a gap between the terms in a web page and how web
`users would describe that web page. Consequently, web sean:hers need not
`use the tenns in a page to query for il ln addition, many web pages are rich
`in graphics and images, and/ or embed their text in these images; in such
`cases, the HTML parsing perlonned when crawling \\ilJ not extract text that
`is useful for indexing these pages. The .. standard IR" approach to this would
`be to use the methods outlined in Chapter 9 and Section U.4. 1lte insight
`behind anchor text is that such methods can be supplanted by anchor text,
`thereby tapping the power of the community of web page authors.
`
`