throbber
Modern Information Retrieval
`
`Ricardo Baeza- Yates
`
`Berthier Ribeiro-Neto
`
`Sr
`
`ACM Prass
`New York
`
`vy Addison-Wesley
`
`Harlow, England « Reading, Massachusetts
`Menlo Park, California e New York
`Don Mills, Ontario « Amsterdam e Bonn
`Sydney @ Singapore « Tokyo # Macrid
`San Juan « Milan # Mexico City e Seoul « Taipei
`
`(cid:20)
`1
`
`PETITIONERS - EXHIBIT 1014
`PETITIONERS- EXHIBIT 1014
`IPR2022-00217
`
`IPR2022-00217
`
`

`

`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`Addison Wesley Longman Limited
`Edinburgh Gate
`Harlow
`Essex CM20 2JE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the trademark designations and their
`owners appears on page vii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`ŽzZg‘$’j“•”;–g—?˜™šœ›$›$›g‘`—I–RžŸ Žz¡"g’bž£¢j¢£¤RŸx¥¦“•§$“•¢I“•;¨ Z©v—I–Rž Ÿ ¢j¢j ˜ “•ªU—I“•Z¨C©€Z’5Žz$«¬R—I“†¨S”¡ˆª ˜ –g“†¨Sª;’b‘$¤Y­®¨ ˜$¯_° Ÿ±ŽO¡C² ¯Ÿ ³g³>“•¢jZ¨t´ž£¢Iµ•žœ‘0¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³¸v³e“†¨>g¬’b”;– ¹ ªU—jžº ª;’jµ•;»¸v¢j¢jžh¼CŽz¡C½Z¾|½¿g¸¸¨S”;µ•ªZ¨R³ªZ¨R³ˆŸ ¢j¢j ˜ “•ª$—jž£³CŽz$«Sª;¨“•ž£¢ —I–g’bZ¬R”Z–RZ¬R— —I–Rž ´CZ’jµ•³ ¯Àz–Sž_’j“•”Z–g—j¢±Z©—I–Rž ªZ¬R—I–RZ’b¢_Z©Á—I–“•¢±´;’jÂ`–Rª;§$ž ]žœžh¨ˆªU¢j¢jžh’b—jž£³`g‘k—I–Rž£«Ã“†¨ˆª ˜£˜ ;’b³ª;¨ ˜ ž»z“•—I–—I–RžŽO;g‘$’j“•”Z–g—£¤]¥±ž£¢I“•”Z¨R¢ ªZ¨R³ˆÄª$—jžh¨g—j¢_Ÿ ˜ — š£›$ÅUÅ ¯Ÿ¦µ†µY’j“•”Z–g—j¢_’bž£¢jžh’b§$žœ³ ¯3Æ  Sª;’b— Z©Á—I–“•¢_¬ggµ†“ ˜ ª$—I“ǏZ¨ «:ªZ‘A]ž ’bžb’b>³>¬ ˜ ž£³n¤Y¢j—jZ’bžœ³`“†¨Cª’bž£—I’j“•ž£§;ª;µv¢j‘>¢j—jž£«0¤S;’±—I’bªZ¨R¢j«·“•—j—jž£³0“†¨CªZ¨g‘A©€;’b«Z’_g‘kªZ¨g‘&«Až£ªZ¨R¢£¤Sžbµ•ž ˜ —I’bZ¨g“ ˜ ¤Y«Až ˜ –Sª;¨“ ˜ ªZµ€¤g–SU—j ˜ Zg‘$“†¨R”¤8’bž ˜ Z’b³>“†¨S”k;’ U—I–Sžb’b»O“•¢jžU¤]»O“•—I–RZ¬R— žb“•—I–Sžh’5—I–Sž g’j“•Z’5»z’j“•—j—jžh¨0]žb’b«·“•¢j¢I“•Z¨ ;©—I–Rž ¬gµ†“•¢I–Ržh’:Z’±ª·µ†“ ˜ žb¨ ˜ ž]žh’b«·“•—j—I“†¨R”`’bž£¢j—I’j“ ˜ —jž£³ ˜ Zg‘$“†¨R”A“†¨C—I–Sž È_¨“•—jžœ³ˆÉ “†¨S”U³U«"“•¢j¢I¬Sžœ³g‘ —I–Rž_ŽzZg‘$’j“•”Z–g—v¶n“ ˜ žh¨S¢I“†¨R”Ÿ ”$žb¨ ˜ ‘¶8—j³]¤U›$¾ ÀU—j—jžh¨g–SªU«ÊŽzZ¬g’b—˦Uª$³]¤$¶YZ¨R³;¨k´ÌšœÄ›$º±¸ ¯´Í–g“†µ•žA—I–Rž ¬ggµ†“•¢I–Sžh’–Rª$¢_«Aª$³gžž£§$žh’b‘kªU—j—jž£«·R—¦—j:—I’bª ˜ ž ª;µ†µ ˜ Zg‘$’j“•”;–g— ;»z¨Sžh’b¢±ªZ¨R³0;S—jª;“†¨]žh’b«“•¢j¢I“•Z¨t—j·’bžhg’b>³e¬ ˜ žA«AªU—jžh’j“•ªZµ€¤S“†¨Cª©€ž£» ˜ ªU¢jž£¢_—I–“•¢_–SªU¢¦g’bZ§Už£³0“•«·]U¢j¢I“†µ•ž ¯ŽzZg‘$’j“•”;–g—¦–RZµ•³gžh’b¢ Z©v«AªU—jžh’j“•ªZµÁ»z–“ ˜ –ˆ–SªU¢¦¨R$—z]ž£žb¨ˆª ˜ Â>¨RZ»zµ•ž£³g”$žœ³ªZ’bž žh¨ ˜ ;¬’bªU”$ž£³ˆ—j˜ ;¨g—jª ˜ —_—I–RžMg¬gµ†“•¢I–Ržh’ ¯¡ˆªZ¨g‘&Z©Á—I–Sž³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ ¬S¢jž£³kg‘0«Aª;¨>¬©®ª ˜ —I¬g’bžh’b¢ ª;¨S³&¢jžhµ†µ•žh’b¢ —jA³>“•¢j—I“†¨R”Z¬g“•¢I– —I–Sžb“†’g’b>³e¬ ˜ —j¢ ª;’bž ˜ µ•ªZ“•«:ž£³ªU¢_—I’bª$³gž£«AªZ’jÂg¢ ¯ Ÿ ³g³e“•¢j;¨´Cž£¢Iµ•ž£‘&¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³ˆ–SªU¢¦«:ª$³gžž£§Užh’b‘kª$—j—jž£«·R— —j:¢I¬gµ•‘&—I’bª$³gž·«:ªZ’jÂk“†¨©®Z’b«AªU—I“•Z¨tªZ];¬S—±«AªZ¨>¬g©€ª ˜ —I¬’bžh’b¢±ªZ¨R³ˆ—I–Sžb“†’g’b>³e¬ ˜ —j¢ «:žh¨g—I“•Z¨Rž£³ˆ“†¨C—I–“•¢_]>Z ¯ ŸÌµ†“•¢j—±Z©v—I–Rž —I’bª$³gž£«AªZ’jŠ³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ªZ¨R³ˆ—I–Sžh“†’;»z¨Sžh’b¢_ª;]žœªZ’b¢ ;¨ˆSªU”$ž §$“†“†“ ¯Àz‘$]žœ¢jž£—z“†¨tŽOU«·g¬S—jžh’5¡ˆ>³žb’j¨ˆg‘0Î;ÏÄ’j“†¨g—jž£³ˆªZ¨R³0]Z¬g¨R³0“†¨C—I–Sž È_¨“•—jž£³CÐ>—jª$—jžœ¢ Z©ÁŸ «Ažb’j“ ˜ ªÑ8“†’b¢j—¦g’j“†¨g—jž£³šœ›$›$›­IÐ>Ò Æ ¾;ÓÔ½Z¾šbÓsÕ$›$ÅG½Z›;ÓIÖ× ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éœê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£áŸ ˜ ªU—jªZµ•U”Z¬RžM’bž ˜ Z’b³0©€;’ —I–g“•¢z]>ZÂ&“•¢ ª;§;ªZ“†µ•ª;µ•žA©Ü’bU«—I–Sž Ò)’j“•—I“•¢I–C¶]“†’bª;’b‘Þ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ۜÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éœê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªUž£ð£ªZӀñ3ª$—jž£¢£¤RË ¯ò° Ëz“ ˜ ª;’b³G²¡ˆ>³gžh’j¨ˆ“†¨g©€Z’b«:ª$—I“•Z¨C’bž£—I’j“•ž£§;ª;µ3ó·Ëz“ ˜ ªZ’b³gkÒzª$ž£ðœªZӀñzªU—jž£¢£¤SÒ3žh’b—I–“•žb’ Ëz“†]žh“†’b;Ó Æ ž£—j ¯ ¯-˜ « ¯­Ü¨ ˜ µ†¬S³gž£¢ g“†µ†“•U”Z’bªZg–S“ ˜ ª$µv’bžh©€žb’bžh¨ ˜ ž£¢_ªZ¨R³0“†¨S³gžh¼ ¯­sÐeÒ Æ ¾ZÓÔ½Z¾gšhÓsÕ$›UÅ>½;›ZÓs֚ ¯ ­Ü¨©€;’b«Aª$—I“•;¨C¢j—jZ’bªU”$žªZ¨R³`’bž£—I“•ž£§;ª;µv¢j‘>¢j—jž£«A¢ ¯ ­ ¯ ËO“†ôžh“†’b¤]Òzžb’b—I–“•žh’5³žŸ¦’bªvõ¬;öbÆ ž£—j¤Yš£›UÏ$¾ZÓ ¯ ­j­ ¯ Àz“•—Iµ•ž ¯÷SÏ$ÏGø ¯ ÒzÕ$ù š£›$›U›¾>½UÎ ¯ ¾$ù;ú>³ ˜ ½eš ›U›ZÓbš£¾U¾$Õ$Վ¦­€Ä
`
`Library of Congress Cataloguing-in-Publication Data
`Baeza-Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p-
`cm.
`Includes bibliographical references and index.
`ISBN 0-201-39829-X
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Araujo
`Neto, 1960- . II.Title.
`Z667.B34
`1999
`025.04-de21
`
`99-10033
`CIP
`
`(cid:21)
`
`

`

`2
`
`MODELING
`
`t-dimensional vectoria] space aud standard linear algebra operations on vectors.
`Vor the classic probabilistic model, the framework is composed of sets. starcard
`probability operations, and the Bayes’ theorem.
`In fhe remainder of this chapter, we discuss the various DR models shown
`iu Figure 2.1. Vhroughont the discussion, we do not explicithy instantiate the
`components D, Q, F, and Afg;,d;) of cach model. Such components should be
`quite cleay from the discussion and can be easily inferred.
`
`2.5 Classic Information Retrieval
`
`In this section we briefly present (he three classic models in inforrnation retrieval
`namely, the Buolean, the vector, and the probabilistic models.
`
`2.5.1 Basic Concepts
`
`The classic mudels in information retrieval consider (har each document is de-
`serihed by a set of represcutative keywords called index termms. Au dudes ferni
`is sitaply a {document} word whose senmusties helps in reuernhering the docn-
`ment's main themes. hus, tudex terms are used to iiiex and sununarize tle
`document coutents. Tn general, mdex terins are matuly mows becanse nous
`have meaning by themselves aud thas, their semantics is easier to ideality and
`Lo grasp, Adjectives, adverbs, aud comueetives are loss useful as iudex terms
`because they work mainly as complements. Tlowever.
`it might be interesting
`io consider all the distinet words in a docniment collection as index terms. For
`instauce, this approach is adopted by some Web search engipes as digeqssedt in
`Chapter 13 Gin which case, the document logical view is full tert], We postpone
`a discussion on the problem of how to generate index terms itil Chapter 7.
`where the issue is covered in detail.
`Given a set of index terms for a document, we notice that uot all terms
`are equally useful for describing the document contents. Tu fact, there are bidex
`terms which are simply vaguer than others. Deciding on the naportauce of a
`term for summarizing the contents of a document is not a trivial issuc. Despite
`this difficulty, there are properties of an index term which are easily measured
`and which are useful for evaluating the potential of a Lerma as such. For insbauce,
`consider a collection with a mmndred thousand documents. A word which appears
`in each af the one hundred thousand documents is complercly useless as an index
`terin because it does not tell us anything about which documents the user might
`be mterested in. On the other hand. a word which appears in just five documents
`is quite useful because it narrows down considerably the space of documents
`which might be of interest to the user. Thus,
`it should be clear that distiner
`index terms have varying relevance when used to describe document coutents.
`This effect is captured through the assignment of pnmerical weights to each index
`term of a decuinent.
`
`(cid:22)
`
`

`

`CLASSIC INFORALATION KETRIEVAL
`
`2
`
`fy be an index term, dj be a document. and w,; > U be a weight
`Let
`associated with the pair (hyd). This weight quantifies the importanee of the
`index terion for describing the decuinent sermantie contents.
`
`Definition Lett he fhe number af tudes terms inthe systemand k, he a generic
`inden terin, Koo URy.0.. Ay} ts the set of all index terms. A ueight wy; > U
`is associated with each cinder teria ky of a document dj. Pur an index term
`whieh ducs Wok mppeer in the document
`tert, wy = 0. With the document @;
`is associided an index terme vector dy represented hy dy == (uty). Wg je... We)
`
`inf yg; be a faneton that refurne the weight associated auth tHe iudex
`Further,
`fern A,
`am any bedimensivnul eector (he. gldyi = wy}
`
`As we Jater discuss. the index term weighrs are usually assumed to be muntu-
`ally independent. This weans that knowing the weight 1w,; associated with ihe
`pair (hd; ]
`tells us nothing about
`the weight ayy), associated with the pair
`ky: yj}. This is clearly a stmplification because ocenrreuces of index terms mM
`a document are not nucorrelated. Consider, for tustance, that the terms cam-
`pier andl nefwerk are used to index a given doctiment which covers the area, of
`computer networks. Frequenth, in this docunient, the appearance of one of these
`two words attracts the appearance of the other. Thus, these two words are corre
`lated and their weights could reflect this correlation. While iiitual independence
`seerps to be a strony simplification. it does simplify the task of computing Licex
`tern weights aud allows for fast ranking computation. Furthermore, taking acd-
`vantage of index termcorrelations for improvimg the final document ranking is
`not me aimple task. be fact, none of the many approaches proposed m the past
`has clearly dermianstrated that index term correlations
`are advantageous (for
`ranking purposes) with geveral collections. Therefore, unless clearly stated oth-
`erwise, we assume mntual independence among index terms.
`In Chapter 4 we
`discuss uvodern retwieva) techniques which are based on term correlations and
`which have been tested successfully with particular collections. These suceesses
`secu to be slowly shifting the enrrent understanding towards a mure favorable
`view of the usefulness of term correlations for inforrhation retrieval systems.
`The above definitions provide support for discussing the three claysie infore
`mation retrieval models, namely, the Boolean, the vector. and the probabilistic
`models. us we now do,
`
`
`
`
`
`
`
`
`
`
`
`
`
`2.5.2 Boolean Model
`The Boolean model is a shiople retrieval model based on set theory and Boolean
`“algebra, Since the concept of a set is quite intuitive, the Boolean miodel pro-
`
`Vides a framework which is easy to grasp by a common user of an (Ro system.
`
`Furthermore. the queries are specified as Boolean expressions which have precise
`semantics. Given its inherent simplicity and neat formaltisin, the Boolean made]
`ecrived great attention in past years and was adopted by many of the early
`omimercial bibliographic systems.
`
`(cid:23)
`
`

`

`26
`
`MODELING
`
`
`
`Figure 2.8 The three conjunetive components for the query jg = Ka A (ky vA},
`
`Unforiumately, the Boolean model suffers from inajor drawbacks. First,
`its retrieval strategy is based on a binary decision criterion {1e., a document is
`predicted to be either relevant or non-relevant} without any notion of a grading
`scale. which prevents good retrieval performauce. Thus. the Boolean model is
`in reality nuch more a data {instead of information) retrieva) model. Secand,
`while Boolean expressions have precise semantics, frequently it
`is pot simple to
`translate an information need into a Boolean expression. In fact, most users find
`it dificult and awkward to express their query requests in terms of Boolean ex-
`pressions. The Boolean expressions actually formulated by users often are quite
`shnuple (see Chapter 10 for a more thorangh discussion op this issne}. Despite
`these drawbacks. the Boolean modet is still the dominant model with commercial
`doctment database systems aud provides a good starting point for those new tu
`the field,
`The Boolean model considers that index teruis are present or absent im a
`document, As a result, the iidex terin weights are assumed to be all binary. ie.,
`wey € {0.1}. A query g is composed of index terme linked by three connectives:
`
`salon which
`not, and, or. Thus, a query is essentially a conventional Boolean expre
`Heche TOP
`
`can be represented as a disjunction of conjunctive vectors (Le. in di
`
`mat form - DNF}. For instance, the query [¢ = ky A (A,
`¥ 7k,)}] can be written
`in disjunctive normal formas [Gap = (L112) 8 1.10) v O,0,0)). whereeachof
`the eomponents is a binary weighted vector associated with the tuple (Aa. Ry, ke.
`These binary weighted vectors are called the conjunctive components of Gun¢-
`Figure 2.3 illustrates the three conjunctive components for the query q.
`
`the indee termweight variables are all
`For the Boolean model,
`Definition
`binary te. wig © {0.1}. A query g is a conventional Boolean expressian. Let
`fing 06 the disjunctive normal form. for the query q. Further, tet dp. be any of the
`conjunchine components of ding. The similarity of a document d;
`to the query q
`is defined as
`
`sim(d;qi =
`‘
`
`
`| Woo € dang) MWe. Olds) = ge(Gee)}
`| ¥ Adee
`otherwise
`0
`
`(cid:24)
`
`

`

`
`CLASSIC [INFORMALLION RETRIEVAL
`
`27
`
`if simld,, qi = 1 then the Boolean model predicts thet the document d; is relevant
`fo the query q (it might aot be). Otherwise,
`the prediction is that the document
`is ot redevared.
`
`is cither releraué ur non
`The Boolean model precicty that cach document
`redecant, There is ny notion of a partial mefch to the query condinions. Por
`instance,
`let d, be a document for which d; — (0.1.0). Document d, inclides
`
`
`the index term fy but is considered non-relevant to the query gos ky Athake:
`The main advantages of the Boolean tiodel are the clean formalisin behind
`the model aud its simplicity. The inain disadvantages are that exact matching
`naylear] to retrioval of tog few or toe may documents (see Chapter E01, “Today,
`it is well known that index term weighting can lead to a substantial improvement
`iu tetricval performance. Jndex term weighting brings us to the vector model.
`
`2.5.3 Vector Model
`
`the use of binary weights is tuo
`i697, 695] recognizes that
`The vector model
`Hinting aad proposes a franiework tn which partial uateliug is possile. This
`is accomplished by assigning nen-binary weights to index torus in queries and
`in documents.
`“Phese term weights are ultimately used to compute the degrer
`of similarity between cach document stored in the system und the user query.
`By sorting the retrieved documents in decreasing order of this degree ofsimilar-
`ity. the veclLor model takes into consideration documents which match the query
`tertas only partially. The main resultant effect is that the ranked doctunent ap-
`ewer set
`is alot more precise fin the sense tliat it better matches lie user infur-
`mation need) than the doctnment answer set retrieved by the Boolean macdel.
`
`the weight wy; assoriuted uth a puer (hy dy)
`Definition For the vector model,
`is positive and nun-biery. Further,
`the index terms in the query are alse
`weighted. Let ing, be the werght associated with the parr [Ry q], where wy, 2 0.
`Then.
`the query weetor 7 is defined as F -
`(uyigq. Wages. Wg) where t
`is the
`luteal number of index terma in the system. As before, the vector for a document
`d,
`is represented by dy = (wyyj.w2y.---. Wey).
`
`Therefore, a document dj and a user query g are represented as 1 -dmensional
`vectors as shown in Figure 2.4. The vector moclel proposes to evaluate (he degree
`of similarity of the document dj with regard to the query q as the correkwiou
`between the vectors dy and ¢ This correlation can be quantified. for instance.
`by the costre ef the angle berween these two vectors. That is,
`
`sim(dy.q)
`
`dj eg
`| xi
`
`(cid:25)
`
`

`

` Q
`
`DN
`
`MODELING
`
`Figure 2.4 The cosine of @ is acopted as simidj qi
`
`where dj. and |@ are the uorins of the document and query vectors. The factor
`id) does not affect the ranking (Le. the ordering of the docmuents) because it
`is the same for all ductanents. The factor [<5 provides a normahzation in the
`space of the docninents.
`Siuce ij = Oand iy, = 0, sim(g, dy] varies from 0 to +1. Thus. instead of
`attempting to predict whether a document is relevant or net, the vector model
`ranks the documents according to their degree of similarity to the query. A
`document might be retrieved even if it matches the query ouly partiadly. For
`instance, one can establish a Chresheld on sim(d;, g) and retrieve the documents
`with a degree of similarity above that threshold. Bul
`to compute raukings we
`
` ‘y low index term weights are obtained.
`need first to spec
`Todex derm weiglits can be calculated in many different ways. The work by
`Salton and McGill (698) reviews various term-weighting techniques. Tere, we do
`not discuss them iu detail, Instead, we concentrate on elucidating the mata idea
`behind the most efleciive tenn-weighting tecliniqnes. This idea is related to the
`basic principles which support clustering, techniques, as follows.
`Civen acollection C of objects and a vague descriplion of a set 4, the goal of
`a suuple clustering algorithta might be to separate the collection C uf objects mta
`
`dwo sets: afirst one which is conrposed of objects related to the set A and a second
`one which is composed of objects not related to the set A. Vague description here
`mnéans that we do not have complete information for deciding precisely which
`objects are and which are uot in the set 4. For instauce, one might be lookiug
`for a set 4 of cars which have a price comparable to that of a Lexus 400. Since it
`is not clear what the term comparable neans exactly, there is uot a precise [and
`unique) description of the set A. More sophisticated clustering algorithms night
`attempt to separate the objects of a collection into various clusters (or classes)
`according ta their properties. For instance, patients of a doctor specializing
`in cancer could be classified into five classes:
`terminal. advanced, metastasis.
`diagnosed, and healthy. Again, the possible class descriptions might be imprecise
`fund not unique} and the problem is one of deciding to which of these classes
`a new patient should be assigned.
`In what follows, however, we only discuss
`the simpler version of the clustering problem {i.e., the one which considers only
`two classes) because all (hat is required is a decision on which documents are
`predicted to be relevant and which ones are predicted to be not relevant (with
`regard to a given user queryj.
`
`(cid:26)
`
`

`

`
`
`ABSIC INFORMATION RETRIEVAL
`
`2t
`
`‘To view the TR problem: as one of clustering, we refer tu the early work of
`Salton. We think of the documents as a collection C of objects and think of the
`aser query a8 a (vague) specification of a set “lof objects.
`In this scenario, the
`IR problem can be reduced to the problern of determining which documents are
`in the setand which ones are uot fie, the IR problem can be viewed as a
`clustering problem). Tua clustering problem, two miain issues have to he resolved.
`First, one needs to deterimine what are the features which better describe the
`objects in the set AL Seeond, one weeds to determine what are the features
`which better distinguish the objects im the set from the remaining objects in
`the collection C.
`‘The first sot of features provides for quantification of intra.
`cluster situilarity. while Uke second set of features provides for quaatifieation
`of enter-chester dissimilarity. Phe most successful clustering aleorithius try to
`balanee Uhese two effects,
`Tn the vector model, intra-clustering sinvilarity is quantified by measuring
`the raw frequency of a term fh;
`inside a document d;. Such term frequeney is
`usualy referred to as the tf facter and provides oue tueasure of how well tliat
`terra deseribes Ge document contents (Le., intra-documeut characterization),
`Furthermore, iuter-cluster dissimilarity is quantified by measuring the prverse
`of the frequency of a term Ay among the documents in the collection. This
`factur is usually referred to as the duterse docunient fregueney or the tf factor.
`The motivation for usage of au idf factor is that
`terms which appear ly uisuiv
`doeumerts are uol very useful for distinguishing a relevant document from a
`non-relevant one, As with goud chistering algorithms. the most cHeetive term-
`weighting scheines for IR. try to balance these two effects.
`
`
`
`
`fet Noo be the tetel umber of documents ce the systens and vy be
`Definition
`the nenber of documents in_which the indcr term kj appears. Let freq; be
`The
`
`
`youfrequency of ferin ky in the document dy (ie. the nuniber of times the tern
`- ky is mentioned in the tect of the document d;). Then, the normatiard frequency
`
`Yi. of ferm ky in document dy as ginen Bi
`
`uf a
`
`reg;2-4
`mary fred. 5.
`
`12.1)
`oO
`
`
`re the macimumnis cumputed over aff terms whieh are wnentioned in the teat
`the document d;.
`If the term ky does not appear in the document d;
`then
`2 =U. Further, let idf;, inverse document frequency for ky. be given by
`
`
`
`
`
`(2.2)
`!
`
`(2.3)
`
`N
`.
`idf, = low —
`i
`ni
`
`:
`
`oa,
`t
`
`,
`
`:
`
`‘best known term-iweighting schémes use qucighissivhicn are given by
`Leo
`
`N
`j= fiz X log --
`ny
`
`(cid:27)
`
`

`

`30)
`
`MODELING
`
`or by a variation of His formuda. Such ferm-weighting strategies ere called tf-idf
`schemes,
`
`beveral variations of the above expression for the weight wy) are described im an
`Interesting paper by Salton and Buckley which appeared in 1988 [696]. However,
`in general,
`the above expression should provide a pood weighting scheme for
`lnany collections.
`For the query term weights, Salton and Buekley suggest
`
`Wy = (is +
`
`.
`
`
`(L5
`free, ,
`mend) Peay a
`
`f - a | * lug —
`
`N
`ny
`
`/
`
`.
`
`(2.4)
`
`where fregig is the raw frequeney of tie term é; in the text of the information
`Leqlest. ¢.
`(1} ifs terni-weiglling scheme
`The main adeantayes of the vector model are:
`liupraves retrieval performace; (2) its partial matching strategy allows retrieval
`of documents that approginete die query conditions; and (3) its cosiue rank-
`ing formula sorts the documents according to their degree af similarity to the
`query. Theoretically, the vector model has the disadvantaye tlat index terms are
`assumed (o be urtuedlly independent fequatiun 2.3 does vot account for index
`term dependencies}, However. lu practice, consideration of term dependencies
`mieht be a disadvantage. Dne cto the locality of many tern dependencies, their
`indiscriminate application tu all the documents in the collection might im fact
`faut Lhe overall performace.
`Despite its suupliciey, the vector model is a resilient ranking strategy with
`eoneral collections.
`It vields ranked answer sets which are difficult
`to improve
`pon without query oxpansign or relevance feedback {see Chapter 5} within the
`framework of the vector model. A large variety of alternative ranking methods
`have Leen compared to the veetor mudel but the consensus seems to be that,
`in veneral, the vector model is either auperior or almost as good as the known
`alternatives. Furthermore. it is simple and fast. For (hese reasons. the vector
`viodel is a poplar retrieval model nowadays.
`
`2.5.4
`
`Probabilistic Model
`
`In this section, we describe the classic probabilistie model introduced in 1976
`by Reberston and Sparck Jones 677] which later became known as the binery
`widrpendenrce retrieval (BIR} model. Our discussion is intentionally brief aud
`focuses mainly on highlighting the key foatures of the model. With this purpose
`in mind, we de not detain ourselves in subtleties regarding the binary iidepen-
`dence assumption for the model. The section on bibliographic discussion points
`to references which cover these details.
`‘The probabilistic model atteraprts to capture the TR. problem within a prob-
`abilistic framework. Che ftuudaiuental idea is as follows, Given a user query,
`there is a set of documenta which contaius exactly the relevant documents and
`
`(cid:28)
`
`

`

`CLASSIC INFORMATION RETRIEVAL
`
`31
`
`no other. Let us refer to this set of documents as the idead answer set. Given the
`description of this ideal answer set, we would have no problems in retrieving its
`documents. This, we can think of the queryimg process as a process of specify-
`ing the properties of an ideal answer set (which is analogous to interpreting the
`1K problem as a problem of clustering). The problem is that we do nol know
`exactly what these properties are. All we knowis that there are index terms
`whose semantics should be used to characterize these properties. Since these
`properties are uot known at query time, an effort has to be made at. initially
`guessing what they could be. This initial guess allows us to generate a prelim-
`inary prohabilistic description of the ideal nuswer sel which is used to retrieve
`a first set of documents. An interaction with the user is then initiated with the
`purpose of improving the prubabilistic description of the ideal answer set. Such
`interaction could proceed as follows.
`The user takes a look at the retrieved documents and decides which ones
`are relevant and which ones are not (in truth, ouly the first. top documents need
`to be examined}, The system thon uses this information to refine the description
`of the ideal answer set. By repeating this process aany tines, it is expected
`that such a description will evolye aud become closer lo the real deseription af
`the ideal answer set. Thus, one should always have im ouiad the need to guess at
`the beginning the description of the ideal answer set. Furthermore, a conscious
`effort is uiade to model this deseription in probabilistic terns.
`The probabilistic rnodel is based on the following fundatuental assumption.
`
`Assuinphion (Probabilistic Pringiple) Given a user query q and a document dy
`
`ithe collection, the probabilistic model tries to estimate the srobability chat
`the user vill fd the dogument d; mteresting (Le. relevant).
`‘The model as-
`sumerthat
`Thewtice depends on the query and the document
`
`representations ouly. Further.
`the model assumes that there is a subset of all
`documents which the user prefers as the answer set for the query g.9Such an
`ideal answer set is labeled A and should maximize the overall probability of re+
`evance to the user, Documents in the set # are predicted to be relevant to the
`query. Documents not in this set are predicted ta be nen-relevant.
`
`This assumplion is quite troublesome because it docs pot state explicitly
`how to compute the probabilities of relevance, In fact. not even the sample space
`which is to be used for defining such probabilities ts given.
`Given a query g, the probabilistic model assigns to each document dj, ag
`a measure of its similarity to the query. the ratio P{d; relevant-to qi /P(d; non-
`relevat-to g} which computes the odds of the document d; being relevant to the
`query q. Taking the odds of relevance as the rauk minimises the probability of
`an erroneous judgement [282, 785).
`
`the tudexr term: weight variables are
`For the probabilistic model,
`Definition
`ai binary ie, wy, © {OL}. wig € {0,1}. A query g is a subset of tader terms.
`Let Robe the set of documents knoun(or initially guessed) to be relevant. Let R
`be the complement of R (i.c.,
`the set of non-relevant documents}. Let PUR|d; }
`
`(cid:20)(cid:19)
`10
`
`

`

`32
`
`MGA BLING
`
`is relevant to the query q and P(Ridj)
`ie the propalility that the ducument d;
`be the probability that dj
`is non-relevant to q. The simdarity simidj.q) of the
`document rl,
`to te query q is defined as the ratio
`
`stetels, gi =
`
`P(RId;}
`P(R|d))
`
`Using Bayes’ rule,
`
`P(E IRx PER)
`_
`simild,.qg) =3S
`Pid; |Ri x PLR]
`
`Pid, RY stands for the probability of randomly selecting the document 2; from
`the set Roof relevant documents. Further, P(C2) stands for the probability that a
`docuinent randouiy selected frou the entire collection is relevant. The meanings
`altached to P(d)"2) and P(R) ace analegons and complementary.
`Since PCR) and PR) ave the same for all the documents in the eollection,
`WOOWLILG,
`
`sentes. yg)
`
`Pid; |R}
`Pu, Bi
`
`
`Assuming independence of index ternus,
`
`simi, gl
`
`Thycdy. 1 Pb RN x Tac 1=0 PUR|R))
`
`Tocca, esl Ptky [R)) “ I]
`fyb PUR|R))
`SE
`Lin [tps
`
`Pik|2i slands for the probability that Lhe index term #; is present in a document
`randomiy seleeted from the set Ro PUkjLA) stands for the probability that the
`index teri: &y
`is not present in a document randomly selected from the set A.
`‘The probabilities associated with the set # have meanings which are analogous
`to the ones just described,
`Taking logarithms, recalling that P(k;|R) + P(k;|A) = {, and ignoring
`factors which are constant for all documents in the context of the same query,
`we can finally write
`
`
`
`
`
`sinalds.¢) 7 Uy9M Uys X (1 PkyRy Lk i ee
`dy.)
`2."
`3
`Sil PRR PGiR)>
`
`SEPA Gey
`
`~
`
`Uy,
`
`Uy
`
`O8 Sp, oT
`
`lo
`
`7
`
`which is a key expression for ranking computation in the probabilistic model.
`Since we do not know the set # al the beginning, it is necessary to devise
`a iuethod for initially computing the probabilities P(k;|R) and P(k;|R). There
`are inany allernatives for such computation. We discuss a couple of them below.
`
`(cid:20)(cid:20)
`11
`
`

`

`“CLASSIC INFORMATION RETRIEVAL
`
`a3
`
`In the very beginuing (.c., nmediately after the query specification), there
`are no tetrieved documents. Thus. one has to make simplifying assuniptions such
`as: [al assume that P(%,|A] is coustantfor all index terms 4, (twpically, equal to
`0.5) wed (bh) assume that the distribution of iudex terms amoug the nun-relevant
`documents can he approximated by the distributiou of index terms among all
`the documents tn the collection, These two assumptions vield
`
`015
`=
`PYAR}
`Pik'R) = 2
`,
`!
`N
`
`ie the number af documents which contatn the
`where, as already defined. mn;
`
`index term A; and Vis the total number of documents in the collection. Given
`this initial guess, we can then retrieve documents which contain query terms and
`provide an initial probabilistic ranking for them. After that. this initial ranking
`ix Luproved as follows.
`Let F be a subset of the doctauents initially relrieved and ranked by the
`
`— —-
`probabilistic model Sneha subset can be defined, for lastance, as
`ruuked- documents whorey is & previously defined threshold. Further,let Vpbe
`the subset of ¥ composed of the documents in Vo owhich contain dhe index term
`&,. For simplicity, we also use Fo oand Vote refer to Ue uamber of eleanents tn
`these sets Git should alwars be clear when the used variable refers to Ue sek or to
`the number of elements in it}. For improving the probalilistic ranking, we need
`to improve our guesses for P(A, FR} and P(AR). This can be accomplished with
`the folluwing assumptions:
`(a) we can approximate P(A,|AR) by the distribulion
`af the index term A; among the documents revricved so far, and {hb} we can
`
`approximate PURIR) liv cousidering that ail
`the nom-retrieved damunents are
`hot relevant. With these assumptions, we can write,
`
`
`
`ia
`
`no KB
`—
`PibsRy = at
`NO
`{Ay
`|
`
`r
`
`This process can then be repeated recursively. By doing so, we are able to
`iuprove on our gucssés for the probabilities Pihy|A) and P(A.) without auy
`assistance from @ hiiman subject (contrary to the original idea}. However, we
`car also use assistanee from the user for definition of the subset V as originally
`coucelved.,
`‘Lhe last formulas for P(&;)R} and P{k,|£2) pose problems for small values
`of Voand ¥; which arise im practice fsuch as V = | and VF, = 0). To circumvent
`these prebletas, an adjustment factor is often added in which vields
`
`Vp40.5
`PUAIR) =no ! oe +]
`
`Pik shy =
`
`(cid:20)(cid:21)
`12
`
`

`

`aed
`
`MODELING
`
`Au adjustmentfactor which is constant and equal to 0.5 is not alwavs satisfactory.
`An alternative is to take the fraction n,/N as the adjustment factor which yields
`
`Vi+®
`Piky R) = s-. S
`V+i1
`at)
`Puls
`[R)
`ne Via WS
`Ate
`.
`val)
`NOV GI
`
`=
`
`This completes our discussion of the probabilistic model.
`is (hat docu-
`The main advantage of the probabilistic model,
`in theory,
`meuts are rauked in decreasing order of their probabality of being relevant. The
`disadvantages include:
`(L} the needto guess the initial se

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket