throbber
Modern Information Retrieval
`
`Ricardo Baeza—Yates
`
`Berthier Ribeiro-Neto
`
`@ A
`
`CM Press
`New York
`
`A Addison-Wesley
`
`Harlow. England 0 Reading, Massachusetts
`Menlo Park, California 0 New York
`
`Don Mills. Ontario a Amsterdam I Bonn
`
`Sydney a Singapore 0 Tokyo 0 Madrid
`
`San Juan 0 Milan 0 Mexico City 0 Seoul o Taipei
`
`(cid:20)
`1
`
`ELASTIC - EXHIBIT 1014
`
`ELASTIC - EXHIBIT 1014
`
`

`

`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinery, Inc. (ACM).
`
`Addison Wesley Longinan Limited
`Edinburgh Gate
`Harlow
`Essex CM20 ZJE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`ŽzZg‘$’j“•”;–g—?˜™šœ›$›$›g‘`—I–RžŸ Žz¡"g’bž£¢j¢£¤RŸx¥¦“•§$“•¢I“•;¨ Z©v—I–Rž Ÿ ¢j¢j ˜ “•ªU—I“•Z¨C©€Z’5Žz$«¬R—I“†¨S”¡ˆª ˜ –g“†¨Sª;’b‘$¤Y­®¨ ˜$¯_° Ÿ±ŽO¡C² ¯Ÿ ³g³>“•¢jZ¨t´ž£¢Iµ•žœ‘0¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³¸v³e“†¨>g¬’b”;– ¹ ªU—jžº ª;’jµ•;»¸v¢j¢jžh¼CŽz¡C½Z¾|½¿g¸¸¨S”;µ•ªZ¨R³ªZ¨R³ˆŸ ¢j¢j ˜ “•ª$—jž£³CŽz$«Sª;¨“•ž£¢ —I–g’bZ¬R”Z–RZ¬R— —I–Rž ´CZ’jµ•³ ¯Àz–Sž_’j“•”Z–g—j¢±Z©—I–Rž ªZ¬R—I–RZ’b¢_Z©Á—I–“•¢±´;’jÂ`–Rª;§$ž ]žœžh¨ˆªU¢j¢jžh’b—jž£³`g‘k—I–Rž£«Ã“†¨ˆª ˜£˜ ;’b³ª;¨ ˜ ž»z“•—I–—I–RžŽO;g‘$’j“•”Z–g—£¤]¥±ž£¢I“•”Z¨R¢ ªZ¨R³ˆÄª$—jžh¨g—j¢_Ÿ ˜ — š£›$ÅUÅ ¯Ÿ¦µ†µY’j“•”Z–g—j¢_’bž£¢jžh’b§$žœ³ ¯3Æ  Sª;’b— Z©Á—I–“•¢_¬ggµ†“ ˜ ª$—I“ǏZ¨ «:ªZ‘A]ž ’bžb’b>³>¬ ˜ ž£³n¤Y¢j—jZ’bžœ³`“†¨Cª’bž£—I’j“•ž£§;ª;µv¢j‘>¢j—jž£«0¤S;’±—I’bªZ¨R¢j«·“•—j—jž£³0“†¨CªZ¨g‘A©€;’b«Z’_g‘kªZ¨g‘&«Až£ªZ¨R¢£¤Sžbµ•ž ˜ —I’bZ¨g“ ˜ ¤Y«Až ˜ –Sª;¨“ ˜ ªZµ€¤g–SU—j ˜ Zg‘$“†¨R”¤8’bž ˜ Z’b³>“†¨S”k;’ U—I–Sžb’b»O“•¢jžU¤]»O“•—I–RZ¬R— žb“•—I–Sžh’5—I–Sž g’j“•Z’5»z’j“•—j—jžh¨0]žb’b«·“•¢j¢I“•Z¨ ;©—I–Rž ¬gµ†“•¢I–Ržh’:Z’±ª·µ†“ ˜ žb¨ ˜ ž]žh’b«·“•—j—I“†¨R”`’bž£¢j—I’j“ ˜ —jž£³ ˜ Zg‘$“†¨R”A“†¨C—I–Sž È_¨“•—jžœ³ˆÉ “†¨S”U³U«"“•¢j¢I¬Sžœ³g‘ —I–Rž_ŽzZg‘$’j“•”Z–g—v¶n“ ˜ žh¨S¢I“†¨R”Ÿ ”$žb¨ ˜ ‘¶8—j³]¤U›$¾ ÀU—j—jžh¨g–SªU«ÊŽzZ¬g’b—˦Uª$³]¤$¶YZ¨R³;¨k´ÌšœÄ›$º±¸ ¯´Í–g“†µ•žA—I–Rž ¬ggµ†“•¢I–Sžh’–Rª$¢_«Aª$³gžž£§$žh’b‘kªU—j—jž£«·R—¦—j:—I’bª ˜ ž ª;µ†µ ˜ Zg‘$’j“•”;–g— ;»z¨Sžh’b¢±ªZ¨R³0;S—jª;“†¨]žh’b«“•¢j¢I“•Z¨t—j·’bžhg’b>³e¬ ˜ žA«AªU—jžh’j“•ªZµ€¤S“†¨Cª©€ž£» ˜ ªU¢jž£¢_—I–“•¢_–SªU¢¦g’bZ§Už£³0“•«·]U¢j¢I“†µ•ž ¯ŽzZg‘$’j“•”;–g—¦–RZµ•³gžh’b¢ Z©v«AªU—jžh’j“•ªZµÁ»z–“ ˜ –ˆ–SªU¢¦¨R$—z]ž£žb¨ˆª ˜ Â>¨RZ»zµ•ž£³g”$žœ³ªZ’bž žh¨ ˜ ;¬’bªU”$ž£³ˆ—j˜ ;¨g—jª ˜ —_—I–RžMg¬gµ†“•¢I–Ržh’ ¯¡ˆªZ¨g‘&Z©Á—I–Sž³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ ¬S¢jž£³kg‘0«Aª;¨>¬©®ª ˜ —I¬g’bžh’b¢ ª;¨S³&¢jžhµ†µ•žh’b¢ —jA³>“•¢j—I“†¨R”Z¬g“•¢I– —I–Sžb“†’g’b>³e¬ ˜ —j¢ ª;’bž ˜ µ•ªZ“•«:ž£³ªU¢_—I’bª$³gž£«AªZ’jÂg¢ ¯ Ÿ ³g³e“•¢j;¨´Cž£¢Iµ•ž£‘&¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³ˆ–SªU¢¦«:ª$³gžž£§Užh’b‘kª$—j—jž£«·R— —j:¢I¬gµ•‘&—I’bª$³gž·«:ªZ’jÂk“†¨©®Z’b«AªU—I“•Z¨tªZ];¬S—±«AªZ¨>¬g©€ª ˜ —I¬’bžh’b¢±ªZ¨R³ˆ—I–Sžb“†’g’b>³e¬ ˜ —j¢ «:žh¨g—I“•Z¨Rž£³ˆ“†¨C—I–“•¢_]>Z ¯ ŸÌµ†“•¢j—±Z©v—I–Rž —I’bª$³gž£«AªZ’jŠ³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ªZ¨R³ˆ—I–Sžh“†’;»z¨Sžh’b¢_ª;]žœªZ’b¢ ;¨ˆSªU”$ž §$“†“†“ ¯Àz‘$]žœ¢jž£—z“†¨tŽOU«·g¬S—jžh’5¡ˆ>³žb’j¨ˆg‘0Î;ÏÄ’j“†¨g—jž£³ˆªZ¨R³0]Z¬g¨R³0“†¨C—I–Sž È_¨“•—jž£³CÐ>—jª$—jžœ¢ Z©ÁŸ «Ažb’j“ ˜ ªÑ8“†’b¢j—¦g’j“†¨g—jž£³šœ›$›$›­IÐ>Ò Æ ¾;ÓÔ½Z¾šbÓsÕ$›$ÅG½Z›;ÓIÖ× ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éœê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£áŸ ˜ ªU—jªZµ•U”Z¬RžM’bž ˜ Z’b³0©€;’ —I–g“•¢z]>ZÂ&“•¢ ª;§;ªZ“†µ•ª;µ•žA©Ü’bU«—I–Sž Ò)’j“•—I“•¢I–C¶]“†’bª;’b‘Þ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ۜÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éœê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªUž£ð£ªZӀñ3ª$—jž£¢£¤RË ¯ò° Ëz“ ˜ ª;’b³G²¡ˆ>³gžh’j¨ˆ“†¨g©€Z’b«:ª$—I“•Z¨C’bž£—I’j“•ž£§;ª;µ3ó·Ëz“ ˜ ªZ’b³gkÒzª$ž£ðœªZӀñzªU—jž£¢£¤SÒ3žh’b—I–“•žb’ Ëz“†]žh“†’b;Ó Æ ž£—j ¯ ¯-˜ « ¯­Ü¨ ˜ µ†¬S³gž£¢ g“†µ†“•U”Z’bªZg–S“ ˜ ª$µv’bžh©€žb’bžh¨ ˜ ž£¢_ªZ¨R³0“†¨S³gžh¼ ¯­sÐeÒ Æ ¾ZÓÔ½Z¾gšhÓsÕ$›UÅ>½;›ZÓs֚ ¯ ­Ü¨©€;’b«Aª$—I“•;¨C¢j—jZ’bªU”$žªZ¨R³`’bž£—I“•ž£§;ª;µv¢j‘>¢j—jž£«A¢ ¯ ­ ¯ ËO“†ôžh“†’b¤]Òzžb’b—I–“•žh’5³žŸ¦’bªvõ¬;öbÆ ž£—j¤Yš£›UÏ$¾ZÓ ¯ ­j­ ¯ Àz“•—Iµ•ž ¯÷SÏ$ÏGø ¯ ÒzÕ$ù š£›$›U›¾>½UÎ ¯ ¾$ù;ú>³ ˜ ½eš ›U›ZÓbš£¾U¾$Õ$Վ¦­€Ä
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longinan Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the tradeka designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in—Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloguing-in—Publication Data
`Baeza—Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza—Yates, Berthier Ribeiro—Neto.
`p.
`cm.
`
`Includes bibliographical references and index.
`ISBN 0-201-39829-1:
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Araujo
`Neto, 1960- . II.Tit1e.
`Z667.B34
`1999
`025.04—dc21
`
`99-10033
`CIP
`
`(cid:21)
`
`

`

`‘21
`
`MODELING
`
`t—diniensional vectorial space and standard linear algebra operations on vectors.
`For the classic prolnihilistic model. the framework is coniposm‘l of sets. standard
`prohahilit}r operations, and the Bayes" theorem.
`[u the remainder of this chapter, we discuss the various lit models shown
`in Figure 2i 'l‘hroughout the discussion, we do not explicitly instantiate the
`components D. Q, J”, and thode of each model. Such components should be
`quite clear from the tllSCllSSlUl‘i and can he easily inferred.
`
`2.5 Classic Information Retrieval
`
`in this section we briefly present the three ('tlassic models in inlornnttion retrieval
`nanielyg the Boolean. the vector, and the prt‘ibabilistic models.
`
`2.5.1 Basic Concepts
`
`The classic models in information retrieval consider that each document is de—
`
`scril'ied hf a set of rt-tprest:ntal'ive key-words called index terms. An order term
`is simply a {document} word whose semantics helps in reineinhcring the docu—
`ments main themes. Thus, index terms are used to index and summarize the
`document. contents.
`In general, index terms are mainly nouns because nouns
`have in tuning, by themselves and thus‘ their semantics is easier to ltlt'fillll‘f and
`to grasp. Adjectives. adverbs‘ and connectives are less useful as index terms
`l'iecause the}! work mainly as complements However.
`it might he intre‘resting
`to consider all the distinct. words in a. document collection as index terms. For
`
`instance, this approach is adopted hy some Web search engines as discussed in
`Chapter .13 (in which case, the document logical View is full test-t). “7e postpone
`a. discussion on the problem of how to generate index terms until Chapter 7.
`where the issue is cmrered in detail.
`
`Given a set of index terms for a tht'ulllC—‘lit, we notice that not all terms
`are equally useful for describing the document contents. in fact. there are index
`terms which are simply raguer than others. Deciding on the imports-1mm of a
`term for summarizing the contt-rnts of a. document is not. a trivial issue. Despite
`this ditticulty, there are properties of an index term which are easily rneasurtul
`and which are useful for evaluating the potential of a term as such. For instance,
`consider a collection with a hundred thousand documents. A word whicl'i appears
`in each of the one hundred thousand documents is completely useless as an index
`term because it does not tell us anything about. which documents the user might
`he interested in. On the other hand, 1 word which appears in just. live documents
`is quite useful because it narrows down considerably the space of documents
`which might he of interest. to the user. Thus,
`it. should he clear that distinct.
`index terms have varying relevance when used to describe document contents.
`This effect is captured through the assignment. of numerical weights to each index
`term of a document.
`
`(cid:22)
`
`

`

`CLAS SIC IN FORMATION ltE-TH IEVAL
`
`25
`
`:3 U be a weight
`it}- he a document. and in“;
`ls..- he an index term.
`Let
`associated with the pair (to. at.) This weight quantifies the importance of the
`index term for describing the (.locmnent semantic contents.
`
`Lr ( t he the nmntier of order terms in the system rind to be e generic
`Definition
`tinder:
`tr rm. K
`{it}. .
`.
`. do} to the Set of nit.
`trader: terms.
`."l hilt-gilt: U'H'
`1"» U
`is a.:~'socwtr-'rt
`fir-uh.
`ranch {Critter
`term. A}- of a (loco-mm! (ti, For an.
`index term
`which doc-s not appear in {in doc-rancid Cert.
`rew- = tl. With the (tortumr'nt d?-
`,5.- associated an index term iirr'to-r d?- represented by d? 2.:
`[Z'tIELJJHF-Anjg.
`.
`. ”ref it
`
`5r.- u function that returns the. nieighi associated with.
`Further.
`tel It;
`the index
`term tr.
`in any Ldorirnsiuant elector .{"t.rn_. getrtji = ulgdj’.
`
`As we later discuss. the index term weights are usually assumed to he mutu—
`ally hale-pendent. This means that knowingr the weight rchj associated with the
`pair tt'..rf__;} tells us nothing athont
`the Weight. ng- } 1“,, associated with the pair
`tix‘t: l‘d-Ji' This is Clcarlv a simplification because oeenrreiiccs of index terms in
`a document are not. uneor1ehder.l. Consider. for instance. that the terms com—
`
`puter and tJ.{-'fit.'0'1‘h‘tt]‘f‘ used to index a given document which ("(WRTS the area of
`computer nettt'orka‘.
`l-‘i'erluentlvv. in this (locunu.~tit_. the appearance of one of these
`two words attracts the appearance of the. other. Thus. these two words are corres
`larcd and their weights could reflect this correlation. W'hile mutual independence
`seems to he a strong simplification. it does simplify the task of ccnnputing index
`term'weights and allows for fast. ranking computation. Furthermore, tithing, ad—
`vantage of index term correlations for improving the. final document ranking is
`not. a sinuiale task.
`in fact. none of the many approaches proposed in the past.
`has clertrlv demonstrated that index term correlations.-
`hr: ltt'l
`'antageoim [for
`ranking purposes) with general collectitnis. Therefore, unless clearly stated oth—
`erwiee. we assume mutual imiependence among; index terms.
`In Chapter :3 we
`diseues modern retrieval techniques wl'rieh are based on term eorrelatirms and
`which have. been tested successfully with particular collections.
`'lf'hese s11111t':esses
`seem to he slt‘m-‘lv shifting the current urnlerstanding towards a more favorable
`View of the usefulness of term correlations for information retrie rai
`.’:»'bl-{‘ILIS.
`The. above definitions provide support tor discussing the three classic infor-
`mation retrieval models. namely. the Boolean. the vector. and the prolmhilistic
`models. as we now do.
`
`
`
`
`
`
`
`
`
`
`
`
`@532 Boolean Model
`{The Boolean model is a simple retrie. rat model hosed on set theory and Boolean
`algebra. Since the. concept of a set. is quite intuitive. the Boolean model pro—
`tndes a framework which is easy to grasp by; a. common user of an [R system.
`
`.(urthermore. the queries are specified as Boolean expressions which have precise
`
`. eniantics. Given its inherent simplicity and neat fornmlism. the Boolean model
`-ceived great. attention in past years and was adopted by many of the. early
`bmmercial hilfdiograjjihie Systems.
`
`
`
`(cid:23)
`
`

`

`26
`
`MtiirwsLI-so
`
`
`
`Figure 2.3 The three conjunctive cmnrmnents for the query [:1 —— kn
`
`[its
`
`eta-h.
`
`l_..'nfort1_niately. the Boolean model suffers from major drawimcks. First.
`its retrieval strategy is hosed on a binary decision criterion tie... a document. is
`predicted to be either relevant or non—relevant) without any notion of a grading
`scale. which prevents good retrieval performance. Thus. the Boolean model is
`in realityr Ilil1('ll more a. data {instead of information} retiim'ul niodcl. Second.
`while Boolean expressions have. j’irecise semantics. frequently it
`is not simple. to
`translate an intorinatimi need into a Boolean expression. In fact most users find
`it difficult and awkward to express their query requests in terms of Boolean ex-
`pressimis. The Boolean expressions actually fornnilated by users often are (mite
`simple ("see Chapter It] for a more thorough discussion on this issue}. Despite
`these. drawbacks. the Boolean model is still the dominant model with commercial
`
`document database systems and provides a good starting point. for those. new to
`the field.
`
`The Boolean model considers that index terms are. present or absent in a
`rim-.Lnnent. its a result. the index terni weights are assumed to he all binary. i.e.,
`
`1cm 6 {0. 1}. A query o is composed of index terms linked by three connectives:
` 'sion which
`not, and. or-.Thus. a query is essentially a conventional Boolean exprt
`
`can he represented as a (.lisjurnrtion ofconjunctive vet-tors lie. in (ii.
`'uncto'e norw
`
`moiform — DNF). For instance. the query [q = k” fa [h ' wlcrtjl can he written
`in disjuncti'v'e normal form as [921“): = [:l. 1. l) V {.l. 1, [I]
`if [1. [hill]. where. each of
`the components is a binary weighted vector associated with the. tuple {km rich, fer}.
`These binary weighted vectors are.
`tailed the. conjunctive components of 27],, f.
`l’igure 2.3 illustrates the. three (.tonjuuctive components for the query q.
`
`the indeir rte-rm weight variables are. all
`For the Boolean model.
`Definition
`binary 218., ww- t: {0. 1}. A query q is a conventional Boolean empress-ion. Let
`rjrfimf be the disjunctive normal form. for the query q. F'n-rflwr, Bet r121. be any of the
`conjunctive compom-znts of qjgnf. The. similar-it}; of o. doc-urnent (£3.-
`to the query q
`is defined as
`
`.si'nllrqul =
`'
`
`1
`[1
`
`
`l {Lila-Efilmrlet't's. gatdils-‘yrttimli
`if 5%:
`othermse
`
`(cid:24)
`
`

`

`
`(_.'L.A\b‘b']f_'3 IN FOHMJU'ION RETRIEVAL
`
`27
`
`If stn'i-[ttJ _. q) = l the-n the Boolean model predicts that the mien-meet dJ- is trait-"rent
`to the query :3
`("l-tit might not he). Otherwise.
`the prediction. is that the deremt'ntt
`is not
`relate-n.5,
`
`is either 't‘rftt't-‘ti'tt-I- or Hou—
`The Boolean model predicts that each (ltnsun'n-ent.
`rah-(rant. There is no notion of it partial mtttr‘h to the query Conditions. For
`lItShtIlL'E-J let dJ he a document- for whieh d:
`-— (0.1.0). Domuneut dJ includes
`
`
`the index term h, but. is (‘tJlIt-SlthI‘Eltl non—relevant to the query it;
`:= k“
`ti},
`It“
`J
`The main ttrIt-‘ott-trtges of the Boolean niodel 21.10- t-ho clean formalism behind
`the. model and its simplicity: The main disadvantages are that. exact matching
`niaty lead to retrieval of too few or too many documents tsee Chapter III}. "today.
`it is well known that. index terni weighting ("r111 lend to a. substantial i1]]D1"tJ\'t“:II‘IE:nt
`in retrieval perfornitutce. Index term weighting brings us to the \‘E-JL'IUI' model.
`
`2.5.3 Vector Model
`
`the use of hiunrj,‘ weights is too
`[697, 695] ret-ognizes that
`The teeter model
`limiting and protwses :1 fratiitwtrork in which partial matching; 1:: possihle. This
`is élt't't)lll[)li.‘jltfl(.l
`lily assigning 'rton—tJ'I'ttory weights to index terms in queries and
`In t'let‘:iilii(\i‘1t.s.
`'l‘hese term weights are ultimately used to compute the deg-ret-
`of s-t'nntority between eneh doeunient stored in the. system out] the user :‘uiery.
`By sorting the retrieved t’lot'uments in decreasing order of this degree of similar—
`ity. the vector model takes. into consideratimi doeunients which thatch the query
`terms only partially. The main resultant. effect is that the ranked document £111-
`swt'r set
`is :1 lot. more precise [in the sense that. it better nmtehes the user infot—
`mntitn'i need] than the dot'ntnent. answer set retrieved by the Boolean model.
`
`the 'tt.'t"'ighr‘ tt‘J-JJ- a-Hsor‘t-rifritt u-“iih it pelt [It dJJ
`Definition For the t't't‘to't‘ moth-“l.
`is posit-tor? rmri
`'thJ'r1.—tjt'Ilrt'I'jt,f. Further.
`the trader terms-
`in.
`the nee-3'3; are also
`weighted. Let in,” he the urrnfltt (temet'a.t‘.ert with the posit [Ru-1;], urhrnz try-J]. 3.:
`[1.
`Ethan.
`the query t'eetot' r? is defined as if
`'
`(In; (J. M‘JIJJJ.
`.
`.
`. rem] Usher?- t
`is thr-
`tote! number of index: temns m the systtmt,
`.43 before. the recto-r for u (heroine-of.
`(Q is r'r'JtJresentetI by (EJ- = {trLJg tt‘gng .
`.
`.
`. it‘JIJ- ].
`
`ttJ- and a user query q are represented as t-dhnensionnl
`Therefore, a (IOt'IlltlE‘Ill,
`vectors as shown in Figure 12.4. The vector model prtn‘mses to e ’tlltliitt’ the degree
`of similarity of the document ttJ- with regard to the query q as the ('(Jrl‘ttliltIUll
`between the vectors (IJ and g". This eorrelation can he quantified. for instance.
`by the cosine of the angle hetvt-‘eon these two vectors. That is,
`
`stm.[:d_J-. o J
`
`_.___. .._ _.
`
`.r
`Elm-1 ”'m' K “"11
`
`(cid:25)
`
`

`

` Q
`
`2s
`
`moo HI.'IN(_:
`
`Figure 2.4
`
`'I'he cosine of 0 is adopted as sledding}
`
`where id;- and In? are the. norms of the document and query vectors. The factor
`lq‘l does not affect the ranking lie. the. ordering;r of the documents) because it
`is the same. for all documents. The factor Mil provides a nortnalizntiou in the
`space of the documents.
`Since uth- 7? [l and mm ._'=_ ll, slung, dj] varies from U to +l. 'l’hus. instead of
`attempting; to predict. whether a. document is relevant or not. the vector model
`ranks the documents according to their degree of similarity to the query. A
`document might be retrieved even if it matches the ruler}; only partially. For
`instance, one can establish a threshold on suntdj. q) and retrieve the documents
`with a degree of similarity above that threshold. But
`to compute rankings we
`
`need lirst to spec 1')? how index term weights are obtained.
`Index term weights can he calculated in man}' (tillerent. ways. The work by
`Bolton and McGill [698} reviews various term-weighting techniqu s. Here, we do
`not discuss them in detail. Instead, we crmcentrate on elucidating the main idea
`behind the most etler::tive term—weighting techniques. This idea is related to the
`hnsic. principles which support clustering; techniques as follows.
`Given ncollection C" ot'ohjects and a roger: descriptitm ole set A, the goal of
`a. simple clustering algoritlnn might he to separate the collm‘ttion C“ of objects into
`
`two sets:
`a first one which is composed of objects related to the set :i and a second
`one which is composed of objects not related to the set :1. Vague description here
`means that. we do not. have complete informatitm for deciding precisely which
`objects are and which are not. in the set A. For instance. one might. he looking
`for a set A of cars which have a. price compo-mole to that of a Lexus 400. Since it
`is not clear what the term comparable means exactly. there is not a precise (and
`unique] description of the. set A. More sophisticated clustering tngOTltlllllt-i might.
`attempt to separate the objects of a collection into various clusters {or classes)
`according,
`to their properties. For instance. patients of a doctor specializing
`in cancer could be classified into five class s:
`terminal. advanced, metastasis.
`diagnosed and health"; Again. the possible class descriptions might he imprecise
`{and not. unique} and the problem is one of deciding to which of these classes
`a. new patient. should he i‘issigncd.
`In what follows. howewr, we. only discuss
`the simpler u‘rrsion of the clustering problem lie. the one which considers only
`two classes) because all that is required is a decision on which documents are
`predicted to he relevant and. which ones are predictul to he not relevant [with
`regard to a given user query").
`
`(cid:26)
`
`

`

`
`
` 'XSSIC‘ INI-‘ORE-IATION RLI-‘I'RJEVAL
`
`2t}
`
`to View the. IR problrfitn as one of clustering. we refer to the early work of
`Salton. We. think of the documents as a. collection (T of objects and think of the
`tract query as a {vague} etgaecitication of a Ht‘i’.
`."l of objectrt.
`[11 this scenario. the
`lit. problem can he reduced to the. prohh‘etn of determining which Chicuinents are
`in the. get
`:1 and which ones are not. tie...
`the 1H. problem can he viewed is a
`clustering; I'Jl‘tilhletn}. In a clustering problem, two main issues have to be resolved.
`First“ one needs to determine what are the features which better describe the
`
`Seconcil. one needs to determine what are the features
`ohjectn‘ in the set. 21.
`which hotter distinguish the ohjectt-t in the set .-1 from the remaining ohjectH in
`the collection (7. The first set of features provides for quantification of intra-
`ct-s‘ir't‘r-It‘ Hilliilal‘iT-y. while.
`the r'~.'H"(J11(l Ht’L of features provides for quantifieat-ion
`of rrttt-'r‘—t'.lu.‘~'trir dissirniiarit}: The mot-it successful clustering algorithms try to
`balance these t-WLJ ei'iecth.
`In the vector model. iiiii‘ti-(fl‘llSI-E‘J‘il‘lg similarity is quantified by measuring
`the raw frequency of a term it;
`inside a document di- Such term frequency is
`usualh' referred to tie: the (f factor and pun-ides: one measure of how well that
`term tteecrihes the. document contents tic” intra—docurnent characterization}.
`Furthermore.
`inter—cluster dissimilarity is quantified by Incarntring the inverse.
`of the frequency of a term it..- among the documents in the collection. This
`factor is lthitrlly rci'errt't'i to as the inilersr document. freq-arrm‘y or the. idfftir‘trior.
`The motivation for usage of an idi factor is that
`terms which appear in many
`docinnents are. not very useful for distinguishing a relevant document from a
`[1011-1'E!i£'.’\'i5l.lii' one. AH with good clustering algorithnm. the most. cl'i‘ectit'e term—
`wcighting schemes for Mt. trjv' to balance these two t"i‘l:l!tti'.ti.
`
`in the :ryrtcrrt and it! he
`[ct N or: the total rttttrttrc'r o (itfifttflttftt-[a‘
`Definition
`
`
`titre. writ-moor of documents in 'tct’tich t} :: t'rzdcr term alr- appears. Let frugal,-
`ir-
`t-f'
`WWfi-ocrtcy of term in in the document. d}- {t..c...
`the number of times Htt-
`term
`it; is mentioned in the (turf. of the. document rt'jfi. Then.
`the. nortrroiizr‘rt frranncy
`
`t- of term is;
`in. document r'tT- o other: by
`
`In} :-
`
`fi'f'ffgu;
`
`mars
`
`fretHJ
`
`l.-.21l
`
`.
`
`
`
`
`
`
`
`the.
`in.
`re the irtti.:rim.tt.trt. is comp-titled over (it! tcrms which arr. m.f'.1‘?t’itJi'P-r"(i
`
`the document if},
`if the icon it!- doe‘:5 not appear in (he docutttertt (t;-
`
`_.= U. Further.
`first idfi‘ inverse: document. frequency for to. be gimri try
`
`
`
`'
`
`I
`
`text.
`(ht-rt
`
`N
`.
`a
`my“; :log—
`n.
`
`
`
`_
`
`.
`
`:1
`.
`
`=
`
`[22]
`.
`
`“bent hwttfl't tcrm—trrrigiitr'riy sclifiincs ass-t: -;it_cirihi‘.!.".inhre}: rirc yucca by
`t. n"
`
`JV
`m’ = ftj X l‘Jt-a ii
`
`[2.3-]
`
`(cid:27)
`
`

`

`3t)
`
`iiiiosLmG
`
`or in; ri era-mutton of this formats. Sue}; term:weighting strategies are edited tf-ifl'f
`sr'tn'im-s.
`
`Sei'erad variations of the shore expression for the weight n'n; on: deseribed in an
`interesting paper by Bolton and Buckley whieh appeared in 1988 [696]. HOWQVCL
`in general.
`the above. expression should provide. a good weighting scheme for
`minty t'olleetions.
`For the query term weights. Stilt-UH and Buekley suggest
`
`“in; : (”.5 +
`
`mum;
`
`f _
`
`j 11-qu
`
`
`0.5 'I'r't 4
`
`iii—L.) X [0% ——
`
`N
`
`n,-
`
`.
`
`K
`
`[2.4}
`
`where fir-ow is the row frequency of the term 2.}- in the text oi the information
`retpit‘st q.
`t: 1} its term—weighting scheme
`The. main :nteuut.eg_n:.s ol' the W'etor model are;
`imprm-‘es retrieval perfornmnee: (2) its partial matching strategy allows retrie "(1.1
`of doeunieiits that npprc'iLr-imetr:
`the. queri- (‘onditionm and {3]
`its cosine rank—
`ing formula. sorts the. (loeumeuts ueeordim:r to their degree of similarity to the
`query. Thr.--oret.i(.'nll_\-'._ L-he i-‘etf'ttn' model has the. distort-Hon.ridge that index tm‘nis are
`nssiuned to be. mutually independent [t-tqinition 2.3 does not ueeount for index
`term dependencies}.
`i-lowerer. in protrt-iee. consideration of term dependencies
`might be u (lisathantago. Due to the locality of many term dependencies, their
`indiseriiuimire npplieatitm to nll the doeunients in the (‘ollection might in fact-
`hui'f the. overoil pet‘hnniutn‘e.
`if)espit.r:- its simplicity. the vector model is a resilient ranking strategy with
`general collections.
`it yields ranked answer sets which are. diffieult
`to improve
`upon without query expansion or Relevance teedbnek (see Chapter 5} within the
`framework of the \-‘eetor model. A large. variety of alternative ranking methods
`have been eonrpared to the veemr model but the eonsensus seems to he that,
`in gene-wink the veetor model is either surmrior or ahrlost sis good as the known
`alternatives. Furthermore. it is simple and first. For these. reasons. the vector
`model is a popular retrieval model nowmlays.
`
`2.5.4
`
`Probabilistic Model
`
`in This section no deseril'ie the. elnssie probobilistie model introduced in 1976
`by; Roberston and Spar-eh eres :tiTT] wliieh later became known as the binary
`tridr'pmirlr'ner: retrieval [BIB] model. Our discussion is intentionally brief and
`focuses mainly on highlighting the key features of the model. With this purpose
`in mind: we. do not-detain oursielves in subtleties regarding the binary indepen—
`rlenee assmnptiou for the. model. The. section on bibliographic discussion points
`to retemnees which cover these details.
`
`The probabilistic model attempts to capture the TR. problem within a prob—
`abiiietie framework. The fundamental idea is as follows. Given :1 user query-2
`there s a. set of documents wlrieh eontuins exactly the reie rant documents and
`
`(cid:28)
`
`

`

`ti‘L-Assu't INFORMATION RETRIEVAL
`
`31
`
`no other, Let us refer to this set of documents as the ideal. answer set. Given the
`
`description of this ideal answer set, we would have no prob cues in retrieving; its
`documents. Thus. we can think of the. querying process as a protests of specify—
`ing the properties of an ideal answer set (which is analogous to interpreting the
`1R problem as a problem of clustering). The problem is that we do not know
`exactly what. these properties are. All we know is that there are index terms
`whose semantics should be. used to characterise these properties. Since these
`properties are not known at query time. an effort has to be made at initially
`gueseing what they could be. This initial guess allows us to generate a prelim—
`inary probabilistic description of the ideal answer set which is used to retriei'e
`a first set. of docunn'énts. An interaction with the user is then initiated with the
`
`purpose of ii'nproving the probahilistic' thiscriptiou of the ideal answer set. Such
`interaction could proceet'l as follcm-‘s.
`The user takes a look at. the retrieved doci.iments and decides which ones
`
`are relevant and which ones are not (in truth. only the first top doclnnents need
`to he examined}. The system then uses this lIiff'Jl‘ltlt'ttltJIl to refine the description
`of the ideal answer set. By repeating this process many times. it is expected
`that such a description will evolve and become closer to the real description of
`the ideal answer set.
`'l'hus, one. should always have in mind the need to guess at
`the beginning the description of the ideal answer set Furthermore. a conscious
`effort Is made to model this descriptirm in pi'L'il11aliilisl-ic terms.
`The probabilistic model is based on the following iundann'xnt al assumption.
`
`Assumption {Probabilistic Principle} GiV en a user {pier}: r; and a document it,-
`
`
`iu .
`. co action. the probabilistic model tries to estimemie )robability that?
`the user will find the dtTcTiment dJ
`interesting tie. relevant). The model as.—
`
`
`smfies'fiiititkt
`is probe '11 1 3-
`.
`. ance c epends on the query and the document.
`representations only. Further.
`the model assumes that. there is a subset of all
`documents which the user prefers as the answer set for the query (1.
`Such an
`ideal answer set is labeled R and should maximize the overall probability of rel—
`evance to the user, Dociuii‘eiits in the set H. are predicted to be relevant to the
`query. Documents not in this set are predicted to be_non—ti:(event.
`
`This assumption is quite. troublesome because it does not state explicitly
`how to compute the probabilities of relevance. In fact. not even the sample space
`which is to be used for defining such probabilities is given.
`Given a query q, the probabilistic model assigns to each document (if, as
`a measure of its similarity to the query. the ratio Viol}- releve’nit—to elf-"'i’t'rij non—
`rele rant—to g} which computes the odds of the document dT being relevant to the
`query :3. "Hiking the. odds of relevance as the rank minimises the probability of
`an erroneous judgement 1282, 7’85].
`
`the tinder. term u'cilqet t-‘rit‘iebles are
`For the probabilistic model.
`Definition
`
`all {Jittery i.e._. in”- E {0,1}. Wing E {0,1}.
`:1 query q is a subset of indcr terriis.
`Let i? be the act of documents known. (or initially guessedj' to he relic-mint. Let R
`be the corripie-ttient of R {ten the set of non—relevant documcats/l, Lei PtiihZi-l
`
`(cid:20)(cid:19)
`10
`
`

`

`3L]
`
`MODELING
`
`1111.01 11"(th.11;1
`is 1111:1111-111 to the query 1}
`.1111: probability that 1/111 document 113-
`111'
`111' 111:
`iii-11111111111111; that. (1')-
`is 1m11..—1'1et1;:1111n.t
`to 11.
`T1113 3111111111111; 5111111541 of the
`11111:'11-1111'-1111'13
`to 1111': query 1; 1.. defined as the rat-1'11
`
`511111111. 1111——
`
`
`P11111111 ]
`
`1318.111}
`
`[Bing B11.}'1-.-."-i' 111119,
`
`PM. 'R} X P1 1?]
`51111111 .q]_—- _._._
`PI.11_,-11?._1 x 1’11?
`J
`
`911111111 511111115 for 1.111: probatiility of randomly St'lecting 1.1112 document. rtj from
`1111‘-
`.501- R of 111111111111;
`(1011111111211’1452 Further, I’L’Rl stands for 1111? 111111111111111.5..r that a
`11111'1111119111 1111111111111}! selected from L116 1411111111 Collection is relevant. The 111eanings
`11111111111111 to Pi 1111?]
`111111 P113} 1111- annilogona 11nd 11011111111111911131'1'.
`811111} 131.1%) and Fifi] are 11112 14111111: for 1111 the 1.101111111121115 in 1.1111 111111113111111.
`1th “1111-.
`
`.51111111:.qi w
`'
`"
`
`'PldliRi
`__
`P111: 11'.
`
`
`.31. 1131111111153; 1111.11};1131111111110 of 111111111 t1‘1‘1115,
`
`3111111151. 1 11 J
`
`x-
`
`
`iii-1.11.1. , P1111811 x 11—199“1_1. ._11 1’11. [R11
`
`1h." 1
`111...“... 111'11111111 1MMRIJ
`
`F’ih.|_'HI stand" 1111 tl1e}11o11.111111t\111111thein1k‘x 1131111 11‘.1.5 present in 11.111112111111111.
`111.111 loiniy 5ele1t1111 from the 991. R. 1’[1.1.1.8] 511111115 foi il‘it‘ probabiiit} that 1111)
`i||1'1£'-\'
`1.1111111
`1-1:.
`is not present in a 1.1111111110111- 1'2nidolnly 5111112111111 {10111111115131 R
`1111‘ probabilities associated with the .511. F have 111111111111g3 which 21.11% analogous
`to 1111: 1111115 just. 11951211111111
`that. P11511111] + Pia-H1) = 1. and ignoring
`Taking iogai‘itln'i'is, retailing;
`111111.115 which are. constant for 1111 1101111116111}. 111 the contoxt of 1111‘ 51111162 query,
`WP 1‘1111 11118.11},r writt‘
`
`1 _- 11)“:
`E,
`J‘
`I
`‘)
`1.12111[(11.11) N idiom x -11-'1-J><x(logl“I1:11.111?)
`which is 11 11113.! 6.111111111111011 for milking computation in 1.1112 prolgaabilistic model.
`51111111 wv do not. know 1.111: 11121; R. 1'11. the beginning, it is necessary to devise
`1.1 11111111111 for initially computing the probabilities PUMR} 3.1111 113011-1111. T111111!
`are 1111-1111' alternatives for 3:111:11 commutation. We discuss 11 couple of 11111111 betow.
`
`(cid:20)(cid:20)
`11
`
`

`

`I
`
`(I'llix‘tt‘sifiltil
`
`lNI*"(.'JHl\I.-X"FION ltECt'HIEVAI.)
`
`33
`
`In the. very lzieginning tie: immediately alter the query epet:ifieat..ioul._ there
`are no retrieved documents. Thus. one has to make. simplitfing assumptions such
`as: [at assume that 19(39de is constant. for all index- terms A; (typically, equal to
`0.5} and {b] assume. that the r.listrihution of index terms among the uon—relewant
`domunents can he approximated by the distritnititni of index terms among; all
`the documents in the collar-timi. These two assumptions yield
`
`.I'-’tt‘,|'fi’_l
`:
`[1.5
`when) : {Lg
`
`is the number of documents which contain the
`in;
`whero as already defined.
`
`index" term k1- a.nd N is the. oral number o coeuments in the t'ollect'on. Given
`this nntial guess we can then J.‘L:t-l‘1€‘.-'0 documents which contain query terms and
`prm-ido an initial probabilistie ranking for them. After that. this init'al ranking
`is innn‘m'ed as follows.
`Let 1-" be a subset of the. document's initially retrieved and ranned by the
`
`prohabilistiemodel. Sueh a. subset ran be defined: for instance. a the top 1'
`ranked dosnmhnth ' ere_.1'_i_s a DF't‘fijlm-lfih' (leiined I..l1reshold l“ili

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket