throbber
Face book, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`

`

`CI\MBIUOC£ tJNIVE:RSITY PRESS
`Gmbridge, New York, Mdb()urne, Madrid. Cape: Town, Singapore, Sio Pau1o
`
`Cambridge University Press
`The Edinburgh Huildin~ Cambridge CB2 SRU, UK
`Published in the United States of America b)' Cambridge University Press, New Vorl::
`w·w·w.cambridgc.org
`Information on this tide: w-ww.ca.mbridge.org/97805218657 I 5
`
`@Cambridge Unive.rsiry Press 2008
`
`This publication is in copyright. Subj«t to statutory exception a.nd to the pro\'ision of
`relevant collective ljcensing agreements, no reproduction of an)' part may take place
`without the written JXrmissjon of Cambridge University Press.
`
`First published in print format 2008
`
`ISBN-IJ 978-G-511-41405-J
`
`dlool: (EBL)
`
`ISBN-IJ 978-0-521-86571·5
`
`hardb3ck
`
`Oa.mbridge University Press has no responsibility for the persistence ot accuracy of uris
`fOr external or third-party internet V.'tbsires referred to in this public uion, and does not
`guarantee tb:.at an)' content on such websites is, or will remain, aocur.ne or appropriate.
`
`

`

`79.2 Webchttracrm'slics
`
`391
`
`A.gute '19..5 Cloaking as used by spammers.
`
`property and their agents, therefore, have a strong incentive to croate web
`pages that rank high)y on th.is query. ln a search engine whac;e scoring was
`based on term frequendes, a web page with numerous repetitions of mauigolf
`Sf'AM real eslate wouJd rank highly. This led to the first generation of sp~tm, which.
`(in the context of web sean:h) is the manipulation of web page content for
`the purpose of appearing high up in search results fot selected keywords.
`To avoid irritating users with these repetitions, .sophi<;ticated spamrners re-(cid:173)
`sorted to s uch tricks as rendering these repeated tenns in the same color as
`the background. Despite these words being oon.sequently invisible to the hu+
`man user, a search engine indexer wouJd parse the i.nvi<;ible words out of
`the HTML representation of the web page and index these \~ords as being
`prosen t in the page+
`At its root, s pam s tems from the heterogeneity of moti\•es in conten.t ere-.
`ation on the Web. In particular, many web ron.tent creators have commercial
`motives and therofore stand to gain from manipulating search engine results+
`You might argue that this is no different from a company that uses large fonts
`to li'it its phone numbers in the yeUow pages; but this gene-rally casts the
`company more and is thus a fairer mechanism. A more apt analogy, perhaps,
`is the use of company names beginning with a long s tring of As to be lis ted
`earl)' in a yellow pages category. In fact, tlw yellow pages' mod~ of rom+
`panies paying for larse-r /darker fonts has been replicated in web search: ln
`man)' search engines, it is possible to pa)' to h.a\'e one's web page included
`PAID in the search engine's index - a model kno\\'ll as p41id i,Jdusjon. Different
`I NCLUSION search engines h.a\'e different policies on whether to aUow paid inclusion,
`and whether such a payment has any effect on ranking in search results.
`Sean:h engines soan became sophisticated enough in their spam detection
`to screen out a large number of repetitions of particular keywords. Spam +
`mers responded with a richer set of .spam techniques, the best known of
`which we now describe+ The first of these techniques is cla'lking, shown
`in Figure 19.5. Hero, the .spammer's web server returns different pages
`d epending on whether the ht tp request comes from a web search engine's
`crawler (the part of the seatch engine that gathers web pages, to be described
`in Chapter 20), or from a hwnan user's browser. The former causes the web
`page to be indexed by the search engine under misleading kep.,·ords. \\'hen
`the u:;e,r searches for these 1.:eywords and elects to view the page. he reoeives
`
`

`

`Web seardJ basics
`
`m
`a web page that has aJtogether different content than that indexed by the
`search engine. Such deception of search indexers is unknown in the tradi(cid:173)
`tional world of lR; it ste-ms from the fact that the relationship between page
`publishers and web search engines is not completely coUaborative.
`A doonrvy ptAgt contains text and metadata carefully chosen to rank ttigh.ly
`on sclected seat'"C'h ke)""'ords. \-\'hen a browser requests the doorway page. it
`is redirected to a page containing content of a more commercial nature. More
`complex spam.ming techniques in\'Oive manipulation of the metadata related
`to a page including (for reasons \~e will see in Chapter 21) the links in to a
`web page. Given thatspamrning is inherently an economically motivated ac-
`Sll.utCH thtity, there has sprung around it an industry of som:h engint oplimiurs, or
`nNCI.NII SEOs, to provide consultancy services for clients who seek to have their web
`om.wlZIIRS pages rank highly on selected keywords. Web search engines frown on this
`business of attempting to decipher and adapt to their proprietary ranking
`techniques and indeed announce policies on forms of SEO behavior they do
`not tolerate (and have been known to shut down seat"Ctl requests from cer·
`tain SEOs for violation of these). lne .. •itably. t~ parrying bet'Nt!efl s uch SEOs
`(who gradual!)' infer features of each web seardi engine's ranking methods)
`and the web search eng.ines (who adapt in response) is an unending strug·
`AOVIIRSMUAL gle; indeed, the research subarea of ~riRI infomwlion retrieml has sprung
`lN);()Jt.MATION up around this bat de. To combat spammers who manipulate the text o£ their
`ltlll1UII\'Al. web pages is the exploitation of the Unk structure o f the Web- a tedmique
`known as fi11k at~nfysis. The first web seat"Ctl engine known to apply link anal(cid:173)
`ysis on a large scale (to be detailed in Chapter 21) w-as Coogle. although aU
`web search engines currently make use of it (and correspondingly,spammers
`U.NK SPAM now invest coru.iderable effort in sub\'erting it - this i.:s known as link spmr).
`
`(
`
`Exercise 19.1 lf the number of pages with in-degree i isproportionalto 1/ i2.1,
`what is the probability that a randomly chosen web page has in-degree I?
`
`Exercise 19.2 lfthe numberof pages with in-degree i isproportionaJ to l/il.1,
`what is the average in-degree of a web page?
`
`Exercise 19.3 If the number of pages with in-degree i is proportional to l/i2.1,
`then as the largest in-degree goes to infinity, does the fraction of pages v.ith
`in-degree i grow, stay the same, or diminish? How would your answer
`change for values of the exponent other than 2.1?
`
`Exercise 19.4 The average in-degree o£ aU nodes in a snapshot of the web
`graph is 9. \Vhat can we say about the average out~egree of aU nodes in
`this snapshot?
`
`19.3 Advertising as th o oronomlc model
`
`Early in the history of the Web, companies used graphical banner advertise-(cid:173)
`ments on web pages at popular websites (news and entertainment sitessudt
`
`

`

`19.3 Adt'f!rtising as tJse tm'JOmic model
`
`39'3
`
`as MSN, America On.l.ine, Yahoo!, and CNN). The primary purpose of these
`advertisements was branding: to convey to the viewer a pnsitive feeling about
`the brand of the company placing the advertisement. Typically these adver·
`CPM tisements are priced on a cost per mil (CPM) basis the cnst to the company of
`having its banner advertisement displayed I.OOOtimes. Some websitesstruc.k
`contracts with their advertisers in whidl an advertisement was priced not by
`the number of times it is displayed (also known as inrpressions). but rather by
`the number of times it is dir.kru or1 by the user. lltis pricing model is known
`CPC as the cost per dick (CPC) model. In such cases, clicking on the advertisement
`leads the user to a web page set up by the advertiser, where the user is in·
`duced to make a purchase. Here, the goal of the advertisement is not so much
`brand promotion as to induce a transaction. This distinction betv.-een brand(cid:173)
`and transaction-oriented advertising was already w ideJy recog.nized in the
`context of cocwenrionaJ media such as broadcast and print. The interactiv·
`ity of the web allowed the CPC billing model- clicks could be metered and
`monitortod by the h'ebsite and biUed to the advertiser.
`The pioneer in this d irection was a company named Goto, whkh changed
`its name to Ch•erture be/ore eventual acquisition by Yahoo! Goto was not, ln
`the traditionaJ sense, a search engine; rather, for e\•e.ry query tmn q it ac(cid:173)
`cepted bids from companies w ho wanted their web page shown on the query
`q. In response to the query q. Goto wou1d return the pages of all advertis(cid:173)
`er.s who bid for q. ordered b)' their bids. Furthennoro, when the user dk:l:ed
`on one of the returned results, the c01ll'Sponding advertiser would make a
`payment to Goto (in the initiaJ implementation. this payment equaled the
`advertiser's bid for q ).
`Several aspects of Goto's model are worth highlighting. First, a user typ·
`ing the query q into Goro's search interface was actlvel)' expressing an in(cid:173)
`terest and intent related to the query q. For instance, a user typing gofl dubs
`is more likely to be imminently purchasing a set than one who is simply
`browsing news on golf. Second, C".oto only got compensated when a user
`actuaUy expressed interest in an advertisement -as evinced by the user click(cid:173)
`ing the ad,·ertisemmt. Taken tosether, these created a powerful mechanism
`b)' which to connect advertisers to consutnerS, quickly raising the annual
`revenues of Goto/Ch•ertute into hundreds of millions of dollars. This style
`SI'OSSOiti!O of search ensme came to be knOhll variously as spo11scred sat.rdJ or se.zrcf1
`SJt.AitCU adr:-ert isir1g.
`SUIK:U Given these t'\'lO kinds of search engines- the .. pure" search engines such
`AO'Vi:ln'tsi.NC as Coogle and Altavista versus the sponsored search engines - the logi(cid:173)
`cal next step was to combine them into a single user experience. Current
`searc:fl engines follow precisely this model: They provide pure search re-(cid:173)
`ALCOitiTHM IC suJts (generally known as algorithmic sazrr:/1 results) as the primary response
`SUitCU to a user's search, together with sponsored search results displayed sepa·
`rate!)' and d istincri\•dy to the right of the algorithmic results. This is shown
`in Figure 19..6. Retrieving sponsored search results and ranking them in re(cid:173)
`sponse to a query has now become considerabl)' more sophisticated than
`
`

`

`II ~ttr~~, a'M fttJmta.llt, apDtm'fN Mi'..l..,
`
`--...
`' ~~:=~~=il"~ ... ~~~-
`...... ____ r_..-r _ _ _..._ .. __
`•
`---.. -· ... ·--,...,.,~
`.. - -.. - -... M ..,UIII \ ' "' ~
`tAI.&J,j~ Oo!oi<J••<MV. - -.-,_..,..._ .. _ .... _
`---
`___ ,.,_._._,"
`.................
`
`f, ....... - -..... -~-- ... -"''""A)II,
`
`-~-·---·~·
`Figure 19.6 Search ad\'ertiS!ng t.~red by query keywords. Here the query A320 returns al·
`gorithmJc .sea!Ch results about the Airbus alraaft, toget:herwlt.h ad\·ettlsemeniS fo1 variOUs non·
`alraaft goods numbered A320 that advertisetS seek to martet to those quef)'lng on th1s query.
`The lad. of ad\'ettisements for the aircraft reOects the fact that fe-w marketers attempt to sell A320
`alrtraft on the web.
`
`Web seardJ basics
`
`ea.~~ -~VOuPIC•<rw
`
`. .,.
`
`~~~
`
`Sl!ARCH
`f.S'CI SI!
`MAJI.t:ll111o:<:
`
`O.I CK SPAM
`
`the simple GotG scheme; the process entails a blending of ideas &om lR and
`microeconomics, and is beyond the scope of this book. f"'f advert:isers, un(cid:173)
`derstanding how search engines do this ranking and how to allocate mar·
`keting campaign budg& to different keywords and to different sponsored
`search engines has become a profession known as smrr.h et~gi1tl' 1narkding
`(SEM).
`The inherently economic motives underlying sponsored search gi\·e rise
`to a ttempts by some participants to s ubvert the S)'Stem to their ad\•antage.
`1ltis can take man)' fonns, one of which is knov.'tl as c.lick sp.7m. There is
`rurrendy no ~.miversaUy aocepted definition of d ick s pam. It refers (as the
`name sussests) to cUcks on sponsored seardt results tha t are not from bona
`fid e search llsetS. For instance, a devious ado;ertiser may attempt to exhaust
`the advertising budget of a competitor by clicking repeated!)' (through the
`use of a robotic click generator) on that competitor's sponsored search ad(cid:173)
`vertisements. Sean:fl msme.s face the challenge of discerning which of the
`dicks they observe are part of a pattern of click spam, to avoid charging their
`advertiser clients for such d iclco.
`
`? Exercise 19.5 TheGoto method ranked advertisements matchlng a query by
`
`bid: the highest-bidding adve rtiser got the top pooition, the second-highest
`the next, and so on. What can go wrong with this when the highest(cid:173)
`bidding advertiser places an advertisement that is irrele\>an t to the que ry?
`Why might an advertiser with an irrelevant advertisement bid high in this
`mal\ller?
`
`Exercise 19.6 Suppose that, in addition to bids ... we had for each ad\•ettiser
`their dick..Jflrotlgh rnre. ~ ratio of the his torical number of times users
`click on their advertisement to the number of rimes the ad\>ertisement was
`shov.'tl. Suggest a modification of the Goto scheme that exploits this data
`to avoid the problem in Exerdse 19.5 above.
`
`

`

`19.4 Tht: scatd1 user experiena
`19.4 The search user experience
`
`395
`
`It is crudal that we understand the users of web search as weU. This is again
`a significant change from traditional JR. where users were typicaUy profes(cid:173)
`sionals with a t least some training in the art of phrasing querk.--s O\'er a welJ+
`authored collection wha;e style and structure they understood welt In con·
`trast~ web search users tend to not know (or care) about the heterogeneity of
`web content. the syntax of query languages, and the art of phrasing queries;
`indeed, a mainstream tool (as web search has come to become) should not
`place such onerous demands an billions of people. A range of studies has
`conduded thai the average number of ke)'\VOrds in a web search is some(cid:173)
`where betv.-een two and three. Syntax operators (BooJean connectives, w ild·
`cards, etc.) are seldom used, again a resuJt of the composition of the a ud i·
`ence- "'normaJ" people, not infonnatian scientists.
`It is dear that the more user traffic a web search engine can attract, the
`more tm•enue it stands to eam from sponsored searc:fl. How do search en+
`ginesdifferentiate themselves and grow their traffic? Here, Coogle identified
`two principles that helped it to grow a t the expense of its competitors: (i) A
`focus on rotevance, specifically precision rather than recall in the first few
`rosults; and (ii) a user experience that is lightweight, meaning that both the
`search query page and the search results page are uncluttered and almost
`entire1y textual, w ith vety few graphical elements. The effect of the first was
`simp!)' to save users time in locating the infonnation they sought. The effect
`of the second is to provide a user experience that is extremely responsive, or
`at any rate not botdenecked by the time to load the search query or results
`I"Se.
`
`19.4.1 User q11ery needs
`
`There appear to be three broad categories into which common web search.
`queries can be grouped: (i} informational, (ii) navigational, and (iii) ttansac·
`tiona!. \\\'!now explain these categories; it should be dear that some queries
`will fall in more than one of these categories, while others will faJJ outside
`them.
`Iufomsatitmal IJilOies seek general information on a broad topic, s uch as
`INfOilMAnoSAt.
`OUDits leukemia or Provence. There is typicaJly not a single web page that contains
`aU the information sought; indeed, user.; with informational queries typically
`b')' 10 assimilate information from multiple web pages.
`N.AVICATIONAL Navigational querit>sseek the website or home page of a single entity that the
`Qt.'ntll'S user has in mind, say Lufthansa airlines. In such cases, lhe user' s expectation
`is that the very first search result should be the home page of Lufthansa.
`The user i.s not interested in a p1elhora of documents containing the term
`lufthansa; for such a user. the best measure of user satisfaction is precision
`at I.
`
`

`

`Web st12rd1 basics
`
`Indexes
`Agure t 9.7 The vartous componentS of a "''t'b seatdt engtne.
`
`Ad indexes
`
`TRA.NSAcrtosAJ. A lrn1JS«tiO'It.If query is one that is a prelude to the user perfonning a ttans(cid:173)
`QU£10' action on the Web - such as purchasing a product, downloading a file, or
`making a reservation. ln s uch cases, the search engine should return results
`Usting services that provide form interfaces for such transactions.
`Disooming which of these categories a query falls into can be challeng·
`i:ng. The category not only governs the a lgorithmic sea.rch resul ts, but the
`suitability of the query for sponsored searcfl results (since the query may re-(cid:173)
`\'eal an intent to pu.rchaset For navigational queries, some have argued that
`the search engine should return only a single result or even the target web
`page directly. Neverthelt"$, web search engines have historirnJJy engaged in
`a battle of bragging rights over which one indexes more web pages. Does the
`user really care? Perhaps not, but the media does highlight estimates (often
`statistically indefensible) of the sizes of various searc:h engines. Users are i.n(cid:173)
`Ruena.>d by these reports and thus, search engines do have to pay attention
`to how their index sizes compare to competitors'. For informational (and to
`a lesser extent, transactional) queries, the user does c.t re about the compre-(cid:173)
`hensiveness of the search engine.
`Figure [9.7 shows a composite picture of a web search engine indudingthe
`crawler, as weU as both the web page and ad\•ertisement indexes. The portion
`ol the figure under the cur\'ed dashed line is internal to the search engine.
`
`19.5 Index size and estimation
`
`To a first approximation~ comprehensi veness grows w ith index size, al(cid:173)
`though it does matter which specific pag5 a search engine indexes - some
`
`

`

`21 Link analysis
`
`The analy.si<i of hyperlinks and the graph srructure of the Web has been in·
`strumental in the development of web search. In this chapter, we focus on
`the use of hyperlinks for ranking .,.,"t'b sean:h results. Such link analysis is
`o~ of many factors considered by web sean.il engines in oomputing a rom·
`posite score for a Wl'b page on any given query. \"le begin by nwietoJingsome
`basics of the Web as a graph in Section 21.1, then pl'lJOOE.'d to the technical
`development of the e lements of link analysis for ranking.
`Link analysis for web search has i:nteJiectuaJ anteceden ts in the field of cita·
`tion anatysis, aspects of which overlap with an aroa known as bibliometrics.
`These disciplines seek to quantify the tnfluena! o{ scholarly articles by ana4
`lyzing the pattern of citations among them. Much as citations represent the
`conferral of authority from a scholarly article to others, link analysis on the
`1.Veb trea.ts hyperlinks from a web page to another as a conferral of authority.
`dearly. not every citation or hyperlink implies such authority confNtal; for
`this reason, simply measuring the quality of a web page by the number of
`in-links (citations from other pages} is not robust enough. For instance, one
`may contrive to set up multiple web pages pointing to a target web page,
`\'.ith the intent of arrifidall)' boosting the latter's tally of in-links. l1Us phe-(cid:173)
`nomenon is referred to as link spam. Neverthcless, the phenomenon of cita·
`tion is prevalent and dependable enough. that it is feasible for web search
`engines to derive useful signals for ranking from more sophisticated link
`analysis. IJnk analysis also proves to be a useful indicator of what page(s}
`to craw) next while crawling the web; this is done by using link analysis to
`guide the priority assignment in the front queues of Chapter 20.
`Section 21.1 develops the basic ideas underlying the use of the web graph
`in l ink anal)'Sis. Sections 21.2and 2l.3 then d evelop two distinct methods for
`link analysts, PageRank and HITS.
`
`421
`
`

`

`422
`21.1 The Web as a graph
`
`link analysis
`
`Recall the notion of the web graph from Section 19.2.1 and particularly Fig·
`uro 19.2. Our study of 1ink analysis builds on two intuitions.
`
`I. The anchor text pointing to page 8 is a good description of page B.
`2. The h)1X!rlin.k from A to 8 represents an endorsement of page B, by the
`creator of page A. This is not always the ca~; for instance, many links
`among pages within a single website stem from the user of a common
`template. For instance, most corporate websites have a pointer from ev(cid:173)
`ery page to a page containing a c-opyright notice - this is dearly not an
`endorsement. r\ccordingly, implementations of link analysis a lgorithms
`typically discount such "internal"' Jinks.
`
`21.1.1 A11cltor text and tlte web graplt
`
`The follov.ing fragment of HTML oode from a web page shows a hyperlink
`pointing to the home page of the journal of the ACM:
`
`<a href .. •ttttp: //wv .a cra.org/jacn/")Journal of ::h.e ACI1.<ta>
`
`In this case, the link points to the page ... ~.,.., . a em. or{l/ j acr~/ and the anchor
`text is Joum.lll of tJo..e ACM. dearl)', in this example the anchor is descriptive of
`the ta.rset page. But then the target page (B = http : J J.,.·lhf. a em. or9/ j ac11/)
`itself contains the same description as well as considerable additional infor·
`mation on the journal. So wh..1t use is the anchor text?
`The Web is full of instances where the page B does not provide an ao:u·
`rate description of itself. In many cases, this is a matter of how the publish·
`ers of page B choose to present themselves; this is especially common with
`cotporate web pages~ where a web presence is a marketing statement. f<or
`example, a t the time of the writing of this book the home page of the IBM
`rotporation (ww . i bm. com) did not contain the tenn computer anywhere in
`its HfML code, despite the fact that IBM is \\'id~y ''iewed as the world's
`largest computer maker. Similarly. the HTML code for the home page of Ya·
`hoo! (vw.,.•.y ahoo . COlli) does not at this time oontain the word portal
`Thus, there is often a gap between the terms in a web page and how web
`users would describe that web page. Consequently, web sean:hers need not
`use the tenns in a page to query for il ln addition, many web pages are rich
`in graphics and images, and/ or embed their text in these images; in such
`cases, the HTML parsing perlonned when crawling \\ilJ not extract text that
`is useful for indexing these pages. The .. standard IR" approach to this would
`be to use the methods outlined in Chapter 9 and Section U.4. 1lte insight
`behind anchor text is that such methods can be supplanted by anchor text,
`thereby tapping the power of the community of web page authors.
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket