throbber
EXHIBIT 2066
`Face book, Inc. eta/.
`v.
`Sojhvare Rights Archive, LLC
`CASE IPR2013-00480
`
`

`

`Copyright © 1999 by the ACM press, A Division of the Association for ComputiQ;l
`Machinary, Inc. (AC.M).
`
`Pearson Education Limited
`Edinburgh Gate
`Harlow
`Essex CM20 2JE
`England
`
`and Associated Companies throughout the World.
`Vistt us on the World W2de Web at
`http:/ j ww-w.pearsoneduc.com
`
`The rights of the authors of this Work have been asserted by them in accordance
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication m ay be reproduced, Etored in a
`retrieval system, or transmitted in any form or by any means, electronic, m•JdJinn,lft.l
`photocopying, recording or otherwise, without either the prior written perm,i.i!;;kJJil'Ci(
`the publisher or a licence permitting restricted copying in the l:nited Kingdom
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P
`
`\Vhile the publisher has made every attempt to trace all copyright owners and
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged m
`contact the publisher.
`
`:Viany of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has m~th~
`every attempt to supply trade mark informa'tion about manufacturers and their
`products mentioned in thi~ book. A list of the trademark designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloguing-in-Publication Data
`Baeza-Yates, R. (Ricardo)
`Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p. em.
`Includes bibliographical references and index.
`ISBN 0-201-39829-X
`1. Information storage and retieval systerns. I. Ribeiro, Berthier de Araujo
`Neto, 1960- . ll.Title.
`Z667.B34 1999
`025.04--dc21
`
`109 8 76543
`04 03 02 01 00
`
`

`

`SEARCH ENGINES
`
`379
`
`"'AllaVltla '"""" 3,1SI,Sit Well pageo for 908, Refine ygur search
`
`1. Wtkomt It PCfrlo!lll USA Sl!On!t!pq EnqlDo Wtlz S!!t
`URl.; 'f/'#W ()('(f~ ftt'~,W.·~<~~ ! h trtt
`!.<141 n1001f..O <J ~f•~ ~~3 - t•"'l" " ze o.<S bytu -I'> fr~jli:Jh [~I
`
`2. StmfllllqEMiftt
`HomoflVP Oalamr.kf lYP ~sf Net Trade Center} Flir New>! Le«!!ng
`i"rmo. Bu..,... & f>W>t< Oal~ l No.., Media Dalabllooi Wotkl Ir!lde
`Promotion ..
`URL: too t hrm-Ol.iln5.:om ("lffft!Jrt:h.iel1de.t.rttn
`L>M IM'~Il<d 22 ~e.p ~;o; - p&Qe "'• 4~ - ., fnQ*on (~J
`
`J .~m!l!t
`We<e<>mt}- [Cor.toct}- [M>!>J - (s..tchj Statci'wlQ !;'rl(,llm ~ Here or• """" P<lll<
`cOfJl)lete. ~·Info- the ...
`UAl: ·.,.~r;~et IC'Ie: p h Ht';.>/~ . .earcr. ( f!ffii
`, . .,~1 tr>a~'flP.(od 2~~J!On ~9"7 ~ P&:}e ,:~le l 2tl - .t'• t'nl)ll~h j l'una1t]
`
`, . .tlu..w..,u
`'WtiCtMt !G ty~ {it'tr\'WW 1 ~ tyblwltl l tilt,.., i 1eWch
`I M$ 1 ~ SeMt.h Tlft:~lil Ftaq~:ly -• '-'11
`C..~tVOnt if'AOJ Surcn Co1~61 .,..,, t'C"~pau C~t~tol Stt'n
`S.vttl Ef141n• ~ttVltw tye.._,e, h wu t!'AI!pCn•r.t cr ;y. .. ·,.
`S:tllt'f'l
`,.,. hl!'pJ/wW.w'-.yc::tw.t;ccfCI/
`S4t lttVM kM~
`
`z.:t:l -.,..to""""""""" •~~
`
`---AU , \IOU ll~• t.tlffid tn. Compl~M$ SA. CotLI A!C' .. 0UtQI'I,
`
`$11 tort~l!i!nij1, ~ctiH'lg. M~l'Jljtt.i:~•. TatltlltU'l, l rt~Cl , l.tga
`!tn;. ~J~C~~plii~OO:Ic~r.'ll
`Stt lt!t.LJitl tglll'ta.llY'~ .
`
`.. ~~
`
`!:rpt CI ~tb Md IJRt AM f .. ~ tlutmtJ OpM Sf.t Od y
`Htwt GtMlt YUSu rm Pro.~ .... MI!fam~ W .. ~•tng Cltcl:l
`WO S.•cn EI'IIIH~'f!X Ntlttllr'lg Vu Wtt E•~eaiS•Wt: h
`f:t:Jtr \tyWI::Intt iH 11tt-~ f- U Wn ~ Ar'o1
`'""' nt1f.JIW!INI~~·an:tt tt;mt
`SMittulttbM~_
`
`Pnu S..C.h~W.nt ~tor:
`
`*tf*'il
`
`'"'~n~ .. nu.d~ll rol.lectJ•omeoltheiDCI'tUJdulaeueheftiintt availableon th•
`WWW Ombdom .,.. lk ftultof the maintain«. ~lion• fill t&!W""' ' "'"'<lcomt! s.,,.. inttt.olhiJ
`info-rmtUon towr.N are tvaJiab~only thtrugh rpotialb:.ed toftwue.
`~}j~'Oftl&.~~
`! prsj: tM;w; am 1'ttt'.., 1';£ t :e
`
`rn~:~~IM~t:m"iaation lnJttftft/iUid.- - wdmxl-oom iJ a brt.kthroof,h
`n~~:-viptioo I«VIteda~ to hdp fntts•Ju!t Ulen tOnVmlentJy •earch the World Wlde Web. webta:d.oom
`enhal'\ra theexktinsc:e.p~.bUi"e. of CSJrtent.Yertiomof Nebctpe Navlptor (l.Otnd higher}. ThiJ tttJe,ervb
`wu developed. wofferc:tfidentpoint.tnd clkk•ttar to tctrch enpna:, newtPWP' and thoonn.dJ ol
`httd-to-rc.<"h dltabii•CII. ~..com provldea.~
`~ f/WwwwUtuirMrV
`~;:,;_ ... .mdg.:;M:ptM21r;tJ, !li&JV¥
`
`1:~ .l'l:«.~i.mlll..AQJ.A1111.1'l. - Thoindutuy'a I...Sina•••r<h •oflwolr•producu •renow fr«!nbop;
`PlS., pqwatuJ JOitch mgfne•ttd producu, I 0::6lrtptnied by eompkto doo.untntltion,ue-*"tilal.t:l~for
`dawnto.d ftotu th\1 Wdul~ f,.. olcht~Oucl itwLAnd <h«k bod< lroqumll)'lor opdata on procluctll!ld
`..... ~tootforinp.
`~-pil..ceft'i
`
`F igur e 13.7 Output for the q11ery searching and TJeb and engine for the four main
`search engines; from top to bottom: Alta Vista, Hot Bot, NorthernLight , and Excite.
`
`

`

`380
`
`SEARCHING THE \VEB
`
`The nser cun also refine the query by constructing more complex queries based
`on t he previous answer.
`The Web pages retrieved by the search engine in response to a nser query
`aro ranked, usw1.lly using stntistics related to t he terms in the query. In some
`eases this may not have any meaning, because relevance is not. fully correlated
`with statistics abont t.errn occurrence 1.-1.-ithin the collection. Some search engines
`also t aking into accouut terms included in mctatags or the tit.le, or t he popnla.rit.y
`of a Web page to i1uprove t he ranking. This topic is covered next.
`
`13.4.4 Ranking
`
`::-.'lost sem·ch engi nes use variations of the 13ooleau or vect.or model (see Chap ter 2)
`to do ranking. As with searching, rcmking ha..-; t.o be performed withont uccessing
`t he text, jnst t he index. T here is not much public informn.t.ion abont the specific
`ranking ulgorithrns nsed by cmrent search engines. Further, it is difficnlt t.o
`compare fairly different search engines giwn their differences, and continuous
`improvements. Z..lore important. it is almost impossible to measm e recall, a..c:; the
`umnber of relev1rnt pages can be quite large for simple queries. Some inconclusive
`studies include (327, 498].
`Ynwono n.nd Lee [844] propose three ranking algorithms in addition to the
`classical t.f-idf st:heme (see Chapter 2). T hey tu·e called Boolen.u spread , vector
`spread , and most-cited. T he fi rst two are the normal ranking algoritluw; of the
`Boole;UJ and vector model ext ended to include pages pointed to by a pa.ge in
`the answer or pages that point to a page in the answer. The third, most-cited,
`is based only on the terms included iu pages having a link to t he pages in the
`answer. A comparison of t.hcse techniq1ws considering 56 queries ow r ll. collection
`of 2400 \Veb pages indicates that the vedor model yields a betttlr recall-precision
`cu rw , wit h an average preCision of 75%.
`Some of the new rankiJ1g nlgorithms also use hyperlink information. T his
`is an important difference between the Web and normal IR databases. T he
`rmmber of byperlinks that point to a page provides a mea..'mre of its popularity
`mHl quality. Also, many links in common between pages or pages reft>renced
`by the same pagl) often indicates a relationship bdween those pages. \Ve now
`preseut three examples of rauking techniques that exploit thl'i>e fact.s, but. they
`differ in that two of them depend on the query and the last does uot.
`T he first is WebQuery [148], which also allows vis11a.l browsing of Web
`pages. \\h•bQuery takes a oet of Web pages (for example, the answer to a. query)
`aud ranks them based on how connected each Web page is. Additio11ally, it
`extends the set by finding Web pages that arc highly connected to the origiunl
`set. A rl"lated approach is pn~sentect by Li [512j.
`A better id('<t is due to Kleinberg [444] :o.nd used in HITS (Hypertext In(cid:173)
`dnecd Topic ScFI.rch ). T his raukiug scheme depends on t he query a11d considers
`the sd of pages 8 that point to or are pointed by pages in the anSW('r. Pa.ge::;
`that have many links pointing to them iu S are called authorities (that is, they
`sho11ld have relevant content ). Page;> that have m:o.ny 011tgoing links are called
`lmbs (they should point to simila r cont.ent). A positive two-way feedback exists:
`
`

`

`better authority pages come from incoming edges from good hubs and better
`lmb pages come from outgoing edges to good authorities. Let H (p) and A(p) be
`the hub and authority value of page p. These values are defined such that the
`following equations are satisfied for all pages p:
`
`SEARCH ENGINES
`
`381
`
`H (p) =
`
`A (u )
`
`A(p) =
`
`H (v)
`
`v ~S: v- p
`
`where H(p) and A(p) for all pages are normalized (in the original paper, the sum
`of the squares of each measure is set to one). These values can be determined
`throngh an iterative algorithm, and they converge to the principal eigenvector
`of t he link matrix of S. In the case of the \Veb, to avoid an explosion of the size
`of S, a maximal number of pages pointing to the am>wer can be defined. This
`technique doe::; not work with non·existent, repeated, or automatically generated
`links. One solution is to weight each link based on the snrrounding content. A
`second problem is that the topic of the result can become diffused. For example,
`a particular query is enlarged by a more general topic that contains the original
`answer. One solution to this problem is to analyze the content of each page
`and assign a score to it, as in traditional IR ranking. The link weight and the
`page score can be included on the previous formula multiplying each term of the
`summation [154, 93, 153]. Experiments show that the recall and precision on
`the first ten answerH increases significantly [93]. The order of the links can also
`be used by dividing the links into subgroups and using the HITS algorithm on
`those subgroups instead of the original Web pages [153].
`The la.">t example is PageRank, which is part of the ranking algorithm used
`by Coogle [117]. PageRunk simulates a user navigating randomly in the \Veb
`who jumpH to a random page with probability q or follows a random hyperlink
`(on the current page) with probability 1- q. It is further assumed that this
`user never goes back to a preYiously visited page following an already travcrHed
`hyperlink backwards. This process can be modeled with a Markov chain, from
`where the stationary probability of being in each page can he computed. This
`value is then used as part of the ranking mechanism of Coogle. Let C(a) be the
`number of outgoing links of page a and suppose that page a is pointed to by
`pages Pl to Pn· Then, the PageRn.nk, PR(a) of a is defined as
`
`PR(a) ""'q + (1- q) L PR(pi)/C(pi)
`
`i= l
`
`where q must be set by the sy::;tem (a typical value is 0.15). Notice that the
`ranking (weight) of other pages is normalized by the number of links in the
`page. PageRank can be computed using an iterative algorithm, and corresponds
`to the principal eigenvector of the normalized link matrix of the \Veb (which
`is the transition matrix of the Markov chain). Crawling the \Veb using this
`ordering has been shown to be bett<'r than other crawling scheme::; [168] (see
`next section).
`
`

`

`382
`
`SEARCHING THE WEB
`
`T herefore, to help ranking algorithms, page designers should include infor(cid:173)
`mative titles, headings, and meta fields, as well as good links. However, keywords
`should not be repeatt->d as some search engines penalize repeating words (spam(cid:173)
`ming). lJsi ng full terms instead of indirect ways to refer to subjects should also
`be considered.
`
`13.4.5 Crawling the Web
`In this section we discuss how to crawl the Web, a.s there are several techniques.
`The simplest is to start with a set of URLs and from there extract other URLs
`whieh are followed recursively in a breadth-firs t or depth-first fashion. For that
`reason, search engines allow users to submit top Web sites that will be added to
`t he URL set. A variation is to start with a set of populars URLs, because we
`can expect that they have information frequently requested. Both cases work
`well for one crawler, but it is difficult to coordinate several crawlers to avoid
`visit ing t he same page more than once. Another technique is to partition the
`\Veb using country codes or Internet names, a nd assign one or more robots to
`each pa rtition, and explore each part it ion exhaustively.
`Considering how the Web is traverst->d, the index of a search engine can be
`thought of as analogous to the stru·s in a n sky. What 'we see has never existed,
`as the light has t raveled different distances to reach our eye. Similarly, Web
`pages referenced in au index were a lso explored at different dates and they may
`not exist any more. Nevertheless, when we retrieve a page, we obtain its actual
`couteut. How fresh are the \Veb pages referenced in an index? The pages will
`be from one day to two months old. For that reason, most search engines show
`in the a nswer the date when the page was indexed. The percentage of invalid
`links stored in search engines vary from 2 to 9%. User submitted pages are
`usually crawled after a few days or weeks. Starting there, some engines traverse
`the whole Web site. while others select j ust a sample of pages or pages up to
`a certain depth. Non-submitted pages will wait from weeks up to a couple of
`months to be detected. There are some engines that learn the change frequency
`of a page and visit it accordingly [175]. T hey may also crawl more frequently
`popular pages (for example, pages having many links pointing to them). Overall,
`the current fastest crawlers are able to traverse up to 10 million Web pages per
`day.
`
`The order in which the URLs are traversed is important. As already men(cid:173)
`tioned, the links in a Web page can be traversed breadth first or depth first.
`Using a breadth first policy, we first look at all the pages linked by the current
`page , and so on. This matches well Web sites that are structured by related
`topics. On the other hand, the coverage will be wide but shallow and a Web
`server can be bombarded with many rapid requests. In the depth first case, we
`follow the first link of a page and we do the same on that page until we cauuot go
`deeper, returning recursively. This provides a narrow but deep traversal. Only
`recently, some research on this problem has appeared [168], showing that good
`ordering schemes can m~Lke a difference if crawling better pages first (using t he
`PageR<tnk scheme mentioned a bove).
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket