`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00479
`
`
`
`Copyright © 1999 by the ACM press, A Division of the Association for ComputiQ;l
`Machinary, Inc. (AC.M).
`
`Pearson Education Limited
`Edinburgh Gate
`Harlow
`Essex CM20 2JE
`England
`
`and Associated Companies throughout the World.
`Vistt us on the World W2de Web at
`http:/ j ww-w.pearsoneduc.com
`
`The rights of the authors of this Work have been asserted by them in accordance
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication m ay be reproduced, Etored in a
`retrieval system, or transmitted in any form or by any means, electronic, m•JdJinn,lft.l
`photocopying, recording or otherwise, without either the prior written perm,i.i!;;kJJil'Ci(
`the publisher or a licence permitting restricted copying in the l:nited Kingdom
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P
`
`\Vhile the publisher has made every attempt to trace all copyright owners and
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged m
`contact the publisher.
`
`:Viany of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has m~th~
`every attempt to supply trade mark informa'tion about manufacturers and their
`products mentioned in thi~ book. A list of the trademark designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloguing-in-Publication Data
`Baeza-Yates, R. (Ricardo)
`Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p. em.
`Includes bibliographical references and index.
`ISBN 0-201-39829-X
`1. Information storage and retieval systerns. I. Ribeiro, Berthier de Araujo
`Neto, 1960- . ll.Title.
`Z667.B34 1999
`025.04--dc21
`
`109 8 76543
`04 03 02 01 00
`
`
`
`SEARCH ENGINES
`
`379
`
`"'AllaVltla '"""" 3,1SI,Sit Well pageo for 908, Refine ygur search
`
`1. Wtkomt It PCfrlo!lll USA Sl!On!t!pq EnqlDo Wtlz S!!t
`URl.; 'f/'#W ()('(f~ ftt'~,W.·~<~~ ! h trtt
`!.<141 n1001f..O <J ~f•~ ~~3 - t•"'l" " ze o.<S bytu -I'> fr~jli:Jh [~I
`
`2. StmfllllqEMiftt
`HomoflVP Oalamr.kf lYP ~sf Net Trade Center} Flir New>! Le«!!ng
`i"rmo. Bu..,... & f>W>t< Oal~ l No.., Media Dalabllooi Wotkl Ir!lde
`Promotion ..
`URL: too t hrm-Ol.iln5.:om ("lffft!Jrt:h.iel1de.t.rttn
`L>M IM'~Il<d 22 ~e.p ~;o; - p&Qe "'• 4~ - ., fnQ*on (~J
`
`J .~m!l!t
`We<e<>mt}- [Cor.toct}- [M>!>J - (s..tchj Statci'wlQ !;'rl(,llm ~ Here or• """" P<lll<
`cOfJl)lete. ~·Info- the ...
`UAl: ·.,.~r;~et IC'Ie: p h Ht';.>/~ . .earcr. ( f!ffii
`, . .,~1 tr>a~'flP.(od 2~~J!On ~9"7 ~ P&:}e ,:~le l 2tl - .t'• t'nl)ll~h j l'una1t]
`
`, . .tlu..w..,u
`'WtiCtMt !G ty~ {it'tr\'WW 1 ~ tyblwltl l tilt,.., i 1eWch
`I M$ 1 ~ SeMt.h Tlft:~lil Ftaq~:ly -• '-'11
`C..~tVOnt if'AOJ Surcn Co1~61 .,..,, t'C"~pau C~t~tol Stt'n
`S.vttl Ef141n• ~ttVltw tye.._,e, h wu t!'AI!pCn•r.t cr ;y. .. ·,.
`S:tllt'f'l
`,.,. hl!'pJ/wW.w'-.yc::tw.t;ccfCI/
`S4t lttVM kM~
`
`z.:t:l -.,..to""""""""" •~~
`
`---AU , \IOU ll~• t.tlffid tn. Compl~M$ SA. CotLI A!C' .. 0UtQI'I,
`
`$11 tort~l!i!nij1, ~ctiH'lg. M~l'Jljtt.i:~•. TatltlltU'l, l rt~Cl , l.tga
`!tn;. ~J~C~~plii~OO:Ic~r.'ll
`Stt lt!t.LJitl tglll'ta.llY'~ .
`
`.. ~~
`
`!:rpt CI ~tb Md IJRt AM f .. ~ tlutmtJ OpM Sf.t Od y
`Htwt GtMlt YUSu rm Pro.~ .... MI!fam~ W .. ~•tng Cltcl:l
`WO S.•cn EI'IIIH~'f!X Ntlttllr'lg Vu Wtt E•~eaiS•Wt: h
`f:t:Jtr \tyWI::Intt iH 11tt-~ f- U Wn ~ Ar'o1
`'""' nt1f.JIW!INI~~·an:tt tt;mt
`SMittulttbM~_
`
`Pnu S..C.h~W.nt ~tor:
`
`*tf*'il
`
`'"'~n~ .. nu.d~ll rol.lectJ•omeoltheiDCI'tUJdulaeueheftiintt availableon th•
`WWW Ombdom .,.. lk ftultof the maintain«. ~lion• fill t&!W""' ' "'"'<lcomt! s.,,.. inttt.olhiJ
`info-rmtUon towr.N are tvaJiab~only thtrugh rpotialb:.ed toftwue.
`~}j~'Oftl&.~~
`! prsj: tM;w; am 1'ttt'.., 1';£ t :e
`
`rn~:~~IM~t:m"iaation lnJttftft/iUid.- - wdmxl-oom iJ a brt.kthroof,h
`n~~:-viptioo I«VIteda~ to hdp fntts•Ju!t Ulen tOnVmlentJy •earch the World Wlde Web. webta:d.oom
`enhal'\ra theexktinsc:e.p~.bUi"e. of CSJrtent.Yertiomof Nebctpe Navlptor (l.Otnd higher}. ThiJ tttJe,ervb
`wu developed. wofferc:tfidentpoint.tnd clkk•ttar to tctrch enpna:, newtPWP' and thoonn.dJ ol
`httd-to-rc.<"h dltabii•CII. ~..com provldea.~
`~ f/WwwwUtuirMrV
`~;:,;_ ... .mdg.:;M:ptM21r;tJ, !li&JV¥
`
`1:~ .l'l:«.~i.mlll..AQJ.A1111.1'l. - Thoindutuy'a I...Sina•••r<h •oflwolr•producu •renow fr«!nbop;
`PlS., pqwatuJ JOitch mgfne•ttd producu, I 0::6lrtptnied by eompkto doo.untntltion,ue-*"tilal.t:l~for
`dawnto.d ftotu th\1 Wdul~ f,.. olcht~Oucl itwLAnd <h«k bod< lroqumll)'lor opdata on procluctll!ld
`..... ~tootforinp.
`~-pil..ceft'i
`
`F igur e 13.7 Output for the q11ery searching and TJeb and engine for the four main
`search engines; from top to bottom: Alta Vista, Hot Bot, NorthernLight , and Excite.
`
`
`
`380
`
`SEARCHING THE \VEB
`
`The nser cun also refine the query by constructing more complex queries based
`on t he previous answer.
`The Web pages retrieved by the search engine in response to a nser query
`aro ranked, usw1.lly using stntistics related to t he terms in the query. In some
`eases this may not have any meaning, because relevance is not. fully correlated
`with statistics abont t.errn occurrence 1.-1.-ithin the collection. Some search engines
`also t aking into accouut terms included in mctatags or the tit.le, or t he popnla.rit.y
`of a Web page to i1uprove t he ranking. This topic is covered next.
`
`13.4.4 Ranking
`
`::-.'lost sem·ch engi nes use variations of the 13ooleau or vect.or model (see Chap ter 2)
`to do ranking. As with searching, rcmking ha..-; t.o be performed withont uccessing
`t he text, jnst t he index. T here is not much public informn.t.ion abont the specific
`ranking ulgorithrns nsed by cmrent search engines. Further, it is difficnlt t.o
`compare fairly different search engines giwn their differences, and continuous
`improvements. Z..lore important. it is almost impossible to measm e recall, a..c:; the
`umnber of relev1rnt pages can be quite large for simple queries. Some inconclusive
`studies include (327, 498].
`Ynwono n.nd Lee [844] propose three ranking algorithms in addition to the
`classical t.f-idf st:heme (see Chapter 2). T hey tu·e called Boolen.u spread , vector
`spread , and most-cited. T he fi rst two are the normal ranking algoritluw; of the
`Boole;UJ and vector model ext ended to include pages pointed to by a pa.ge in
`the answer or pages that point to a page in the answer. The third, most-cited,
`is based only on the terms included iu pages having a link to t he pages in the
`answer. A comparison of t.hcse techniq1ws considering 56 queries ow r ll. collection
`of 2400 \Veb pages indicates that the vedor model yields a betttlr recall-precision
`cu rw , wit h an average preCision of 75%.
`Some of the new rankiJ1g nlgorithms also use hyperlink information. T his
`is an important difference between the Web and normal IR databases. T he
`rmmber of byperlinks that point to a page provides a mea..'mre of its popularity
`mHl quality. Also, many links in common between pages or pages reft>renced
`by the same pagl) often indicates a relationship bdween those pages. \Ve now
`preseut three examples of rauking techniques that exploit thl'i>e fact.s, but. they
`differ in that two of them depend on the query and the last does uot.
`T he first is WebQuery [148], which also allows vis11a.l browsing of Web
`pages. \\h•bQuery takes a oet of Web pages (for example, the answer to a. query)
`aud ranks them based on how connected each Web page is. Additio11ally, it
`extends the set by finding Web pages that arc highly connected to the origiunl
`set. A rl"lated approach is pn~sentect by Li [512j.
`A better id('<t is due to Kleinberg [444] :o.nd used in HITS (Hypertext In(cid:173)
`dnecd Topic ScFI.rch ). T his raukiug scheme depends on t he query a11d considers
`the sd of pages 8 that point to or are pointed by pages in the anSW('r. Pa.ge::;
`that have many links pointing to them iu S are called authorities (that is, they
`sho11ld have relevant content ). Page;> that have m:o.ny 011tgoing links are called
`lmbs (they should point to simila r cont.ent). A positive two-way feedback exists:
`
`
`
`better authority pages come from incoming edges from good hubs and better
`lmb pages come from outgoing edges to good authorities. Let H (p) and A(p) be
`the hub and authority value of page p. These values are defined such that the
`following equations are satisfied for all pages p:
`
`SEARCH ENGINES
`
`381
`
`H (p) =
`
`A (u )
`
`A(p) =
`
`H (v)
`
`v ~S: v- p
`
`where H(p) and A(p) for all pages are normalized (in the original paper, the sum
`of the squares of each measure is set to one). These values can be determined
`throngh an iterative algorithm, and they converge to the principal eigenvector
`of t he link matrix of S. In the case of the \Veb, to avoid an explosion of the size
`of S, a maximal number of pages pointing to the am>wer can be defined. This
`technique doe::; not work with non·existent, repeated, or automatically generated
`links. One solution is to weight each link based on the snrrounding content. A
`second problem is that the topic of the result can become diffused. For example,
`a particular query is enlarged by a more general topic that contains the original
`answer. One solution to this problem is to analyze the content of each page
`and assign a score to it, as in traditional IR ranking. The link weight and the
`page score can be included on the previous formula multiplying each term of the
`summation [154, 93, 153]. Experiments show that the recall and precision on
`the first ten answerH increases significantly [93]. The order of the links can also
`be used by dividing the links into subgroups and using the HITS algorithm on
`those subgroups instead of the original Web pages [153].
`The la.">t example is PageRank, which is part of the ranking algorithm used
`by Coogle [117]. PageRunk simulates a user navigating randomly in the \Veb
`who jumpH to a random page with probability q or follows a random hyperlink
`(on the current page) with probability 1- q. It is further assumed that this
`user never goes back to a preYiously visited page following an already travcrHed
`hyperlink backwards. This process can be modeled with a Markov chain, from
`where the stationary probability of being in each page can he computed. This
`value is then used as part of the ranking mechanism of Coogle. Let C(a) be the
`number of outgoing links of page a and suppose that page a is pointed to by
`pages Pl to Pn· Then, the PageRn.nk, PR(a) of a is defined as
`
`PR(a) ""'q + (1- q) L PR(pi)/C(pi)
`
`i= l
`
`where q must be set by the sy::;tem (a typical value is 0.15). Notice that the
`ranking (weight) of other pages is normalized by the number of links in the
`page. PageRank can be computed using an iterative algorithm, and corresponds
`to the principal eigenvector of the normalized link matrix of the \Veb (which
`is the transition matrix of the Markov chain). Crawling the \Veb using this
`ordering has been shown to be bett<'r than other crawling scheme::; [168] (see
`next section).
`
`
`
`382
`
`SEARCHING THE WEB
`
`T herefore, to help ranking algorithms, page designers should include infor(cid:173)
`mative titles, headings, and meta fields, as well as good links. However, keywords
`should not be repeatt->d as some search engines penalize repeating words (spam(cid:173)
`ming). lJsi ng full terms instead of indirect ways to refer to subjects should also
`be considered.
`
`13.4.5 Crawling the Web
`In this section we discuss how to crawl the Web, a.s there are several techniques.
`The simplest is to start with a set of URLs and from there extract other URLs
`whieh are followed recursively in a breadth-firs t or depth-first fashion. For that
`reason, search engines allow users to submit top Web sites that will be added to
`t he URL set. A variation is to start with a set of populars URLs, because we
`can expect that they have information frequently requested. Both cases work
`well for one crawler, but it is difficult to coordinate several crawlers to avoid
`visit ing t he same page more than once. Another technique is to partition the
`\Veb using country codes or Internet names, a nd assign one or more robots to
`each pa rtition, and explore each part it ion exhaustively.
`Considering how the Web is traverst->d, the index of a search engine can be
`thought of as analogous to the stru·s in a n sky. What 'we see has never existed,
`as the light has t raveled different distances to reach our eye. Similarly, Web
`pages referenced in au index were a lso explored at different dates and they may
`not exist any more. Nevertheless, when we retrieve a page, we obtain its actual
`couteut. How fresh are the \Veb pages referenced in an index? The pages will
`be from one day to two months old. For that reason, most search engines show
`in the a nswer the date when the page was indexed. The percentage of invalid
`links stored in search engines vary from 2 to 9%. User submitted pages are
`usually crawled after a few days or weeks. Starting there, some engines traverse
`the whole Web site. while others select j ust a sample of pages or pages up to
`a certain depth. Non-submitted pages will wait from weeks up to a couple of
`months to be detected. There are some engines that learn the change frequency
`of a page and visit it accordingly [175]. T hey may also crawl more frequently
`popular pages (for example, pages having many links pointing to them). Overall,
`the current fastest crawlers are able to traverse up to 10 million Web pages per
`day.
`
`The order in which the URLs are traversed is important. As already men(cid:173)
`tioned, the links in a Web page can be traversed breadth first or depth first.
`Using a breadth first policy, we first look at all the pages linked by the current
`page , and so on. This matches well Web sites that are structured by related
`topics. On the other hand, the coverage will be wide but shallow and a Web
`server can be bombarded with many rapid requests. In the depth first case, we
`follow the first link of a page and we do the same on that page until we cauuot go
`deeper, returning recursively. This provides a narrow but deep traversal. Only
`recently, some research on this problem has appeared [168], showing that good
`ordering schemes can m~Lke a difference if crawling better pages first (using t he
`PageR<tnk scheme mentioned a bove).
`
`