throbber
Modern Information Retrieval
`
`Ricardo Baeza—Yates
`
`Berthier Ribeiro-Neto
`
`e A
`
`CM Press
`New York
`
`A Addison-Wesley
`
`Harlow. England 0 Reading. Massachusetts
`Menlo Park, California 0 New York
`
`Don Mills. Ontario a Amsterdam I Bonn
`
`Sydney a Singapore 0 Tokyo 0 Madrid
`
`San Juan 0 Milan 0 Mexico City 0 Seoul o Taipei
`
`1
`1
`
`EX1014
`EX1014
`
`

`

`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinery, Inc. (ACM).
`
`Addison Wesley Longinan Limited
`Edinburgh Gate
`Harlow
`Essex CM20 ZJE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`ŽzZg‘$’j“•”;–g—?˜™šœ›$›$›g‘`—I–RžŸ Žz¡"g’bž£¢j¢£¤RŸx¥¦“•§$“•¢I“•;¨ Z©v—I–Rž Ÿ ¢j¢j ˜ “•ªU—I“•Z¨C©€Z’5Žz$«¬R—I“†¨S”¡ˆª ˜ –g“†¨Sª;’b‘$¤Y­®¨ ˜$¯_° Ÿ±ŽO¡C² ¯Ÿ ³g³>“•¢jZ¨t´ž£¢Iµ•žœ‘0¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³¸v³e“†¨>g¬’b”;– ¹ ªU—jžº ª;’jµ•;»¸v¢j¢jžh¼CŽz¡C½Z¾|½¿g¸¸¨S”;µ•ªZ¨R³ªZ¨R³ˆŸ ¢j¢j ˜ “•ª$—jž£³CŽz$«Sª;¨“•ž£¢ —I–g’bZ¬R”Z–RZ¬R— —I–Rž ´CZ’jµ•³ ¯Àz–Sž_’j“•”Z–g—j¢±Z©—I–Rž ªZ¬R—I–RZ’b¢_Z©Á—I–“•¢±´;’jÂ`–Rª;§$ž ]žœžh¨ˆªU¢j¢jžh’b—jž£³`g‘k—I–Rž£«Ã“†¨ˆª ˜£˜ ;’b³ª;¨ ˜ ž»z“•—I–—I–RžŽO;g‘$’j“•”Z–g—£¤]¥±ž£¢I“•”Z¨R¢ ªZ¨R³ˆÄª$—jžh¨g—j¢_Ÿ ˜ — š£›$ÅUÅ ¯Ÿ¦µ†µY’j“•”Z–g—j¢_’bž£¢jžh’b§$žœ³ ¯3Æ  Sª;’b— Z©Á—I–“•¢_¬ggµ†“ ˜ ª$—I“ǏZ¨ «:ªZ‘A]ž ’bžb’b>³>¬ ˜ ž£³n¤Y¢j—jZ’bžœ³`“†¨Cª’bž£—I’j“•ž£§;ª;µv¢j‘>¢j—jž£«0¤S;’±—I’bªZ¨R¢j«·“•—j—jž£³0“†¨CªZ¨g‘A©€;’b«Z’_g‘kªZ¨g‘&«Až£ªZ¨R¢£¤Sžbµ•ž ˜ —I’bZ¨g“ ˜ ¤Y«Až ˜ –Sª;¨“ ˜ ªZµ€¤g–SU—j ˜ Zg‘$“†¨R”¤8’bž ˜ Z’b³>“†¨S”k;’ U—I–Sžb’b»O“•¢jžU¤]»O“•—I–RZ¬R— žb“•—I–Sžh’5—I–Sž g’j“•Z’5»z’j“•—j—jžh¨0]žb’b«·“•¢j¢I“•Z¨ ;©—I–Rž ¬gµ†“•¢I–Ržh’:Z’±ª·µ†“ ˜ žb¨ ˜ ž]žh’b«·“•—j—I“†¨R”`’bž£¢j—I’j“ ˜ —jž£³ ˜ Zg‘$“†¨R”A“†¨C—I–Sž È_¨“•—jžœ³ˆÉ “†¨S”U³U«"“•¢j¢I¬Sžœ³g‘ —I–Rž_ŽzZg‘$’j“•”Z–g—v¶n“ ˜ žh¨S¢I“†¨R”Ÿ ”$žb¨ ˜ ‘¶8—j³]¤U›$¾ ÀU—j—jžh¨g–SªU«ÊŽzZ¬g’b—˦Uª$³]¤$¶YZ¨R³;¨k´ÌšœÄ›$º±¸ ¯´Í–g“†µ•žA—I–Rž ¬ggµ†“•¢I–Sžh’–Rª$¢_«Aª$³gžž£§$žh’b‘kªU—j—jž£«·R—¦—j:—I’bª ˜ ž ª;µ†µ ˜ Zg‘$’j“•”;–g— ;»z¨Sžh’b¢±ªZ¨R³0;S—jª;“†¨]žh’b«“•¢j¢I“•Z¨t—j·’bžhg’b>³e¬ ˜ žA«AªU—jžh’j“•ªZµ€¤S“†¨Cª©€ž£» ˜ ªU¢jž£¢_—I–“•¢_–SªU¢¦g’bZ§Už£³0“•«·]U¢j¢I“†µ•ž ¯ŽzZg‘$’j“•”;–g—¦–RZµ•³gžh’b¢ Z©v«AªU—jžh’j“•ªZµÁ»z–“ ˜ –ˆ–SªU¢¦¨R$—z]ž£žb¨ˆª ˜ Â>¨RZ»zµ•ž£³g”$žœ³ªZ’bž žh¨ ˜ ;¬’bªU”$ž£³ˆ—j˜ ;¨g—jª ˜ —_—I–RžMg¬gµ†“•¢I–Ržh’ ¯¡ˆªZ¨g‘&Z©Á—I–Sž³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ ¬S¢jž£³kg‘0«Aª;¨>¬©®ª ˜ —I¬g’bžh’b¢ ª;¨S³&¢jžhµ†µ•žh’b¢ —jA³>“•¢j—I“†¨R”Z¬g“•¢I– —I–Sžb“†’g’b>³e¬ ˜ —j¢ ª;’bž ˜ µ•ªZ“•«:ž£³ªU¢_—I’bª$³gž£«AªZ’jÂg¢ ¯ Ÿ ³g³e“•¢j;¨´Cž£¢Iµ•ž£‘&¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³ˆ–SªU¢¦«:ª$³gžž£§Užh’b‘kª$—j—jž£«·R— —j:¢I¬gµ•‘&—I’bª$³gž·«:ªZ’jÂk“†¨©®Z’b«AªU—I“•Z¨tªZ];¬S—±«AªZ¨>¬g©€ª ˜ —I¬’bžh’b¢±ªZ¨R³ˆ—I–Sžb“†’g’b>³e¬ ˜ —j¢ «:žh¨g—I“•Z¨Rž£³ˆ“†¨C—I–“•¢_]>Z ¯ ŸÌµ†“•¢j—±Z©v—I–Rž —I’bª$³gž£«AªZ’jŠ³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ªZ¨R³ˆ—I–Sžh“†’;»z¨Sžh’b¢_ª;]žœªZ’b¢ ;¨ˆSªU”$ž §$“†“†“ ¯Àz‘$]žœ¢jž£—z“†¨tŽOU«·g¬S—jžh’5¡ˆ>³žb’j¨ˆg‘0Î;ÏÄ’j“†¨g—jž£³ˆªZ¨R³0]Z¬g¨R³0“†¨C—I–Sž È_¨“•—jž£³CÐ>—jª$—jžœ¢ Z©ÁŸ «Ažb’j“ ˜ ªÑ8“†’b¢j—¦g’j“†¨g—jž£³šœ›$›$›­IÐ>Ò Æ ¾;ÓÔ½Z¾šbÓsÕ$›$ÅG½Z›;ÓIÖ× ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éœê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£áŸ ˜ ªU—jªZµ•U”Z¬RžM’bž ˜ Z’b³0©€;’ —I–g“•¢z]>ZÂ&“•¢ ª;§;ªZ“†µ•ª;µ•žA©Ü’bU«—I–Sž Ò)’j“•—I“•¢I–C¶]“†’bª;’b‘Þ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ۜÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éœê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªUž£ð£ªZӀñ3ª$—jž£¢£¤RË ¯ò° Ëz“ ˜ ª;’b³G²¡ˆ>³gžh’j¨ˆ“†¨g©€Z’b«:ª$—I“•Z¨C’bž£—I’j“•ž£§;ª;µ3ó·Ëz“ ˜ ªZ’b³gkÒzª$ž£ðœªZӀñzªU—jž£¢£¤SÒ3žh’b—I–“•žb’ Ëz“†]žh“†’b;Ó Æ ž£—j ¯ ¯-˜ « ¯­Ü¨ ˜ µ†¬S³gž£¢ g“†µ†“•U”Z’bªZg–S“ ˜ ª$µv’bžh©€žb’bžh¨ ˜ ž£¢_ªZ¨R³0“†¨S³gžh¼ ¯­sÐeÒ Æ ¾ZÓÔ½Z¾gšhÓsÕ$›UÅ>½;›ZÓs֚ ¯ ­Ü¨©€;’b«Aª$—I“•;¨C¢j—jZ’bªU”$žªZ¨R³`’bž£—I“•ž£§;ª;µv¢j‘>¢j—jž£«A¢ ¯ ­ ¯ ËO“†ôžh“†’b¤]Òzžb’b—I–“•žh’5³žŸ¦’bªvõ¬;öbÆ ž£—j¤Yš£›UÏ$¾ZÓ ¯ ­j­ ¯ Àz“•—Iµ•ž ¯÷SÏ$ÏGø ¯ ÒzÕ$ù š£›$›U›¾>½UÎ ¯ ¾$ù;ú>³ ˜ ½eš ›U›ZÓbš£¾U¾$Õ$Վ¦­€Ä
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longinan Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the tradeka designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in—Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloguing-in—Publication Data
`Baeza—Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza—Yates, Berthier Ribeiro—Neto.
`p.
`cm.
`
`Includes bibliographical references and index.
`ISBN 0-201-39829-1:
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Araujo
`Neto, 1960- . II.Tit1e.
`Z667.B34
`1999
`025.04—dc21
`
`99-10033
`CIP
`
`2
`
`

`

`‘21
`
`)\l()lJEIJINU
`
`t—diniensional vectorial space and standard linear algebra operations on vectors.
`For the classic proln‘thilistic model. the framework is composed of sets. standard
`prohahilit}r operi'ttions, and the Bayes" theorem.
`in the remainder of this chapter, we discuss the various lit models shown
`in Figure 2.].
`'l‘hroughout‘. the discussion, we do not: explicitly instantiate the
`components D. Q, J”. and Bitindji of each model. Such components should be
`quite clear from the discussion and can he easily inferred.
`
`2.5 Classic Information Retrieval
`
`in this section we briefly present the three ('tlassic models in information retrieval
`narnelyg the Boolean the vector, and the probabilistic inmlels.
`
`2.5.1 Basic Concepts
`
`The classic models in information retrieval consider that each document is de—
`
`scril'ied hf a set. of rt-tprcst:ntat‘ive keywords called index terms. An tartar term
`is simply a {document} word whose semantics helps in rcnremhcriug the docu—
`ments main themes.
`'l‘hus, index terms are used to index and summarize the
`document contents.
`In general, index terms are mainly nouns hecausc nouns
`have in tuning, by themselves and thus. their semantics is easier to ltlf‘fitill‘f and
`to grasp. Adjectives. adverbs. and connectives are less useful as index terms
`because the}! work mainly as complements However.
`it might he interesting
`to consider all the distinct. words in a document collection as index turns. For
`
`instance, this approach is adopted hy some Web search engines as discussed in
`Chapter .13 (in which case, the document logical View is felt tort). “7e postpone
`a. discussion on the problem of how to generate index terms until Chapter 7.
`where the issue is covered in detail.
`
`Given a set of index terms for a document. we notice that not all terms
`are equally useful for describing the document contents. in fact: there are index
`terms which are simply vaguer than others. Deciding on the importance of a
`term for summarizing the contents of a. document is not. a trivial issue. Despite
`this ditfimrlty, there are properties of an index term which are easily measured
`and which are useful for evaluating the potential of a term as such. For instance,
`consider a collection with a hundred thousand documents. A word whicl'i appears
`in each of the one hundred thousand documents is completely useless as an index
`term because it does not tell us anything about which documents the user might
`he interested in. On the other hand, a word which appears in just. five documents
`is quite useful because it narrows down considerably the space of documents
`which might he of interest. to the user. Thus.
`it. should lit}. clear that distinct.
`index terms have varying relevance when used to describe document contents.
`This effect is captured through the assignment. of numerical insights to each index
`term of a document.
`
`3
`
`

`

`CLAS SIC IN FORMATION ltE-TH Hit-At.
`
`‘25
`
`:3 U be :1 Height
`it}- be a document. and row-
`ls..- he an index term.
`Let
`associated with the pair (to. at.) This weight quantifies the importance of the
`index term for deseribing the (.loemnent semantic eontents.
`
`Lr ( t he the number of index: terms in the system (out la:a be e generir:
`Definition
`miter tr rm. K
`{tr}. .
`.
`. do} is the Set of (tilt.
`trader: terms.
`."l weight U'H'
`1"» U
`is dfisflt'ltlit’il with. each {Coder
`term. A}- of o (teen-merit (ii, For an.
`index tit-rm
`who'll does not appear in {in doe-rower Cert. nu”,- 2 ti.
`l-l"’-i(.li
`the (tortumr'nt d?-
`H.- asmt'irrtert on Midi-'3' term ruin-r d?- T'ept'esented by d? 2.:
`('UlLJfiifl-A'J. .
`. ..-n2, J.).
`
`-n.=e-igh.t ossrmioferl with.
`Further.
`tel f};
`hr.-
`11- firm-tom that returns the.
`the index
`term iv.
`in may Liliiiteitsionrn' elector .{"l.rn_. getrlji = ti:,.-U.,l.
`
`As we. later discuss. the index term weights are usually assumed to he mutu—
`>tll_'\" independent. This means that knowingr the weight mm- associated with the
`pair {tenth} tells us within}; athout
`the Weight. ng- } J‘J assoeiuted with the pztir
`tix‘t: ],(f‘.J._i. This is Clenrlv a Hitltl')lifi(.'fltit)fl heeause oeeurrences of index terms in
`n i'ltwnnieut are not. uneorrelnted. Consider. for instance. that the terms (om—
`
`p-nfer and nelieort’s‘ are used to index a. given document whieh eovers the al‘t‘il of
`computer networks. Fl't‘t'lttt?11l..l}'. in this doeunmnt. the appearunee of one of these
`two words nttrnets the appearance of the. other. Thus. these two words are (:i'n're»
`lured and their weights rould relieet this correlation. W'hile mutual independence
`seems to he a! strong simplification. it does simplify the task of eonmuting index
`term'weights and allows for f21st. ranking computation. Furthermore, tithing od—
`vantage of index term eorrelotions for improving the. final document ranking is
`not.
`7|. simple task.
`11! foot. none of the many approaches proposed in the past.
`has clearly demonstrated that index term correlations
`nr: ltt'l
`'itntngeoim [for
`ranking purposes) with general collections. Therefore, unless elein‘ly stated oth—
`erwier-. we assume mutual iiulependenee among; index terms.
`In Chapter :3 we
`dist-ties modern retrie n] teelniiques whieli are based on term eorrel;-1tir.ms and
`tt-‘ltitilt have. been tested sm-eessfully with particular collections.
`'li'hese sueeesses
`seem to he slt‘iwlv shifting the current understanding towards a more fmroruhle
`View of the usefulness of term correlations for information retrie rai cysteine.
`The. above definitions in‘ovide support tor r_liseussing the three eletssie infor-
`mation retrieval models. namely. the Boolean. the vector. and the prolmliilistie
`models. as we now do.
`
`
`
`
`
`
`
`
`
`
`
`
`2.5.2 Boolean Model
`{The Boolean model is a simple l‘etrie rel model hosed on set theory and Boolean
`Jhgtfiiret. Since the. eoneept of a set. is quite intuitive, the Boolean model pro—
`trides it framework which is easy to grasp by; a. common user of an [R Hlt'fw‘fl‘lJ].
`
`.(urtliermore. the queries are specified as Boolean expressions whieh have preeise
`
`. ennnities. (liven its inherent simplicity and nest fornmlism. the Bandeau model
`ueived great. attention in past years and was adopted by many of the. early
`bmmereial hiltdiogretjjihie Systeme.
`
`
`
`4
`
`

`

`26
`
`manual-wt;
`
`
`
`Figure 2.3 The three conjunctive comrmnents for the query [:1 —— it”
`
`[its
`
`nth-H.
`
`l_..'nh?irt1_niately. the Boolean model suffers from major drawlmeks. First.
`its retrieval strategy is hosed on a binary [incision criterion tie... a document is
`ruedieted to be either relevant or non—relevant) without any notion of a grading
`scale. which prevents good retrieval performance. Thus. the Boolean model is
`in realityr Ilil1('ll more a. data {instead of information} retiim'ul model. Seeond.
`while Boolean expressions have. precise. semantics. frequently it
`is not simple to
`translate an intorniatnni need into a Boolean expression. In faet‘ most users find
`it. difficult and awkward to express their query requests in terms of Boolean ex-
`pressimis. The Boolean expressions aetually fornnilated by users often are quite
`simple ("see Chapter It] for a more thorough discussion on this issue}. Despite
`these. drawbacks. the Boolean model is still the dominant model with commercial
`
`doeuinent database systems and provides a. good starting point. for those new to
`the field.
`
`The Boolean model considers that index terms are. present or absent in a
`domnnent. its a result. the index terin weights are assumed to he all binary. i.e.,
`
`arm 6 {0. 1}. A query o is composed of index terms linked by three connectives:
` 'sion which
`not, and. oraThus, a query is essentially a conventional Boolean exprt
`
`
`eon he represented as a (.lisjurnrtion ofeonjunetive vet-tors tie. in (ii. uric-tree norw
`our! form — DNF). For instance, the query [q = k” .e.
`[it"s
`wispjl run he written
`in disjuneti'v'e normal form as [(17:be = [:l. 1. l) V {.l. 1._ [1"] u [1, [ran Where. each of
`the eonlpor‘ients is a binary weighted vector associated with the. tuple {km rich, fair}.
`These binary weighted vectors are.
`tailed the. eonjunetive components of 27],, f.
`l’igure 2.3 illustrates the three (.-.onjnm':ti\-'e components for the query q.
`
`the indeir rte-rm weight variables are. all
`For the Boolean model.
`Definition
`inw- C {0. 1}. A queer; q is a cement-tonal Boolean empress-ton. Let
`binary 218.,
`rjrfimf be the disjunctive normal form. for the query q. F's-rifles} Bet r1215 be any of the
`eo-njunetore eompom-znts of qjgnf. The. similarity of a. doe-unrest dj
`to the query q
`is defined as
`
`si'nilrqul =
`'
`
`1
`[1
`
`
`l {Lila-Effimrlettn gitch-‘J=gi(€m)i
`if 5%:
`otherwise
`
`5
`
`

`

`
`LILASSJC IN FOHMJYIION RETRIEVAL
`
`27
`
`If miter-IdJ _. q) = 1 the-n the Boothe-It mwtet predicts that the deem-meet dJ- is r'etr-"ermt
`to the query :3 (it might not he). Otherwise! the prediction. is that the‘ derement
`is not r‘etetteet.
`
`is either 't‘rftt't-‘fi'tt-t er mm—
`The Boolean model media-s that each (ltnsunient
`;r'eIt-t‘rtnt. There is nu notion of it pertmt metr‘h to the query (miditim'ts. For
`IIISIHIJCBJ Iet dJ he a document for whieh d:
`-— (0.1.0). Dtmltnieut
`ttJ inductee
`
`
`the index term It» hut. .IH (‘mtrsittered nun—relevant to the. query it;
`:= If“
`ti},
`III“
`J
`The main uttererttrtges ef the Boolean niedet 21.10- t-he clean ftn'mahsm behind
`the model and Its simplicity: The main titted-Utintages are that. exact- matching
`may teed to retrieval of two few er tom many documents Isee Chapter It“. "IJ'Jdtty.
`it is well known that index term weighting ("r111 teed to a. substantiat imprtwement
`IlI mtrieml perhirniauce. Index term weighting brings us to the \‘t-JL'IUI' irledet.
`
`2.5.3 Vector Model
`
`is Leo
`ti-‘eightr:
`the use DI tliiltlt'jJ'
`ret-(ighizes that
`IIiIIT, 695]
`the teeter model
`limiting and IJI‘E.JI}(.JSE.:H :1 fi‘ettiitw‘.-'t;rk in which partial matching Ir]: pussihte. This
`in £1('('Ulll[)Ii.‘lei‘-Lt by {weighing 'ttU'tt—IJ'I'tttl'F‘y weights to index terms in (lllt'l'hih and
`In {'tUt'iI‘Illli‘Ilt-H.
`'t‘heee term weights are titt-hnately used tn eentpttte the deg-ret-
`ef steetaritt} between em‘h deeunient stored in the. system and the user (111(21‘)’.
`By hurting the retrieved (let'uments in :Ieert‘ttsing :trder of thir: degree (at Himihir—
`it}: the vector model takes. inte Ctntn‘itlemtien (IUCILlllltElltH which thatch the query
`terms unty partiatty. The main resultant effect II-S that the ranked ttoetuht'ht £111-
`MVt‘T set
`is :1 lot. more precise Iin the sense that it better nmtehes the ne'er infur—
`mntitn'l need] than the dominieut answer set retrieved by the Boolean model.
`
`the tee-fight u-‘J-JJ- a-Hsor‘t-rifritt with it pee It}. LtJ'J
`t't't‘te'r moth-“I.
`Definition PM the:
`is posit-tee rmrt hurt—httnufiy. Further;
`the Index:
`terms in the tree-3'3; are {IIHU
`tt‘eightt'tt. Let mm he the airtight (tsmet'a.t‘.csrt with the pet-Ir [Inge], mhrrr: try-J]. 3.:
`[1.
`Ethan.
`the query t'eetm' r? 22:,- tiefined as if
`'
`Ice; (J. M‘JIJJ‘.
`.
`.
`. rem] Usher?- I
`IS thr-
`tetrtt Mt-mtter of index: terms it}; the system,
`.43 before. the eeeto-I' for r: doctrine-ht.
`(Q is F'f'}H'f’.‘n‘f<3’tt-{ftt by (EJ- = Itt‘1_‘.J-.e‘gl_J-..
`.
`.
`. it‘JIJ- ].
`
`llrim‘ query q are represented as t-thtlienniulia.t
`ttJ- and E1.
`Therefore, a (IOi'IlltlE‘Ilt,
`vectors as shown in Figure 12.4. The vector mndet prm‘mses to e ’tlIlliitE’ the degree
`of similarity of the doetmient. ttJ- with regard to the query q as the ('UII‘E‘IEltIUll
`between the vectors rtJ and g". This eorretatttm can he quantified. for instance.
`by the wet-rte of the, eeyte hetvt-‘een these two vectors. That is,
`
`
`
`_..__. .._ _.
`
`.
`
`H'iI'nJ.I:(1'._J-. q J
`
`6
`
`

`

` Q
`
`22:4:
`
`moo HI.'IN(_:
`
`Figure 2.4
`
`'I'he cosine of 0 is adopted as .sinitrlJ-c}
`
`where id;- und In? are the norms of the document and query vectors. The factor
`lq‘l does not affect the ranking lie. the. curleriner of the documents) because it
`is the same. for all documents. The factor had provides a nol‘nlnlizutiou in the
`space of the documents.
`._'=_ I), si-rntq, dj] varies from U to +i. 'l‘hus. instead of
`Since urn,- 7? [l and ”’rlri
`attempting; to predict. \‘i'lltlilRFI' a. document is relevant or not. the vector model
`ranks the documents :urcording to their degree of similarity to the query. A
`document might be retrieved even if it nuttches the query only partially}. For
`instance, one can estahlish a threshold on simtdj, q) and retrieve the documents
`with a degree of similarity above that threshold. But
`to compute rankings we
`
`need lirst to spec 1'; how index term weights are obtained.
`Index term weights can he calculated in man}' dillhrent. ways. The work by
`Salton and McGill [698} reviews various term-weighting techniqu s. Here. we do
`not discuss them in detail. Instead, we cmicentrate on elucidating the main idea
`hehind the most ctler::tivc term—wei_;__§hting techniques. This idea is relitted to the
`hnsic. principles which support clustering; techniques‘ as follows.
`Given acollectiou C" ot'ohjects and a roger: descriptitm ola set. A, the goal of
`a. simple clusteringr algoritlnn irright he to separate the collmttion C“ of objects into
`
`two sets: it first one which is composed of objects related to the set :i and u. second
`one which is composed of ohjects not. related to the set :1. Vague description here
`menus that. we do not. have complete infornmtion for deciding precisely which
`objects are and which are not. in the set A. For instance. one might. he looking
`for a set .4 of cars which have a. price. comparable to that of it Lexus 400. Since it
`is not clear what the term comparable menus exactly. there is not a precise (and
`unique] description of the. set A. More sophisticated clustering edgorithnis might.
`attempt to scparatc the objects of a collection into various clusters {or classes)
`according,
`to their properties. For instance. patients of a doctor specializing
`in cancer could be classified into five class s:
`terminal. advanced. metastasis.
`diagnosetL and healthy Again. the possible class descriptions might he imprecise
`{and not. unique} and the problem is one of deciding to which of these classes
`a. new patient. should he i‘issigncd.
`In what follows. however, we only discuss
`the simpler version of the clustering problem tie. the one. which considers only
`two classes) because all that is required is a decision on which documents are
`predicted to he relevant and. which ones are predictivd to he not relevant [with
`regard to a given user query").
`
`7
`
`

`

`
`thLAbh‘It" INI-‘ORA-IATION R.IZ-'l'R.IEV'AL
`
`2i}
`
`it) View the. IR pi‘obleuii as one Of clustering, we refer to the early work of
`Salton. “it! think of the dor'uments as a. tiiollectimi C of Objects and think of the.
`user query as a {vague} sigiet'ilit.'.ation ()f a set :1 of Objects.
`[11 this seenariu. the
`Iii. niei'flem earl he reduced to the. pruhlein of determining which deeunients are
`in the set
`31 and which tines are not. lie. the 1H. problem ("an he Viewed as a
`eiustering Iiii‘iiii'dem}. In a Clustering prt'Jhleni, two main issues have to be resolved.
`EiI-g-gt.‘ one needs to determine what are the features which better describe the
`
`uhjeets iu the set xi. Second. one needs to deterrriiite what are the features
`whir‘h t'iettei' distinguish the uhieets in the set .-1 from the reiriaining objects in
`the mtteetion (T The first set of features provides for quantification of int-m.-
`et-iistc-ir similarity. while.
`the set-(Jud set of features prurides for queuitifiemien
`of tr?t(-'r’—(‘.tustrir' dissiiriilurit}: The most sueeesstiil elustering algorithms try to
`halaitee these twr.) sitter-ts.
`Tn the Yet‘ttii‘ model. intra~elustering similarity is quantified by measuring
`the raw frequeney of a term is;
`inside a document ((1. Such term frequency is
`usually referred to as the if faster and provides one liieasure of how well that
`mum desr'rihes the. deeuiiimu contents {i.e.. intrzHioeumelit eliuraeteriz-atiou}.
`Furthermore inter—eluster dissimilarity is quantified l)}' measuring the inverse
`of the trequt‘IlV-F' 0t a term it..- ameng the documents in the eolleetien. This
`latter is usually reterrf‘tl to as the iii-verse doeiimeut freq-usury [Jr the. iriffiirtifur'.
`The motivation for usage of an intt factor is that
`terms which appear in many
`doeumeiits are. not very useful for distinguishing a relevant rlot‘unient- from at
`[mn—reiex'emt one. As with good Clustering algorithms. the must. el'l‘eetive term—
`weightiiig schemes for llt. trjv' ti) halenee these twu eiteets.
`
`I11-
`
`freq“
`
`mar;
`
`frag 1
`
`[2.1:
`
`
`Let N be the total urtii'rttrc'r U documents in the system {Mari M1 he
`Definition
`
`
`. {the urti'mttfl' of derjzum-fzuts in.
`tifh-ttjtt
`15}
`:: index term it“ appears. Let frugal, H'
`u'
`
`Wflmuem'y of term. it;
`in the tflOE'N'Ntfftri- d1- {t.r’...
`the number of times the term
`tax; is uteritinued iii
`the Hart of the doruimmrt rig-Ii. Then. the nm‘nmhzr‘d frequent};
`
`-
`:f1-_J of term if;
`in. (teen-merit rt?-
`is gainer: by
`
`
`
`
`
`instntumml m. the trait
`re the iriu.:rim1nrt is temp-iittd (J'tJtii‘ (Iii. tei‘rits whit‘h am:
`
`the (teen-merit rfj,
`if the tenets ts!- dees not appear in (he doetmtettt (t1-
`(hr-n
`
`_-: 0‘ Fifi-lib“:
`tri'i- “if“ inverse. decrement Ire-quartet; for £11.
`tit: glee“. Ly
`
`
`
`
`my: :iog‘i
`a
`'3’]:
`
`.
`
`.
`
`.
`
`.
`
`'
`
`=
`
`[2.21
`'
`
`‘best known.
`
`It—‘T‘m-H‘rf'tgx’ltt‘i't'lf) schemes use weirihfstieh-ieh are elem by
`I. n‘
`
`JV
`m‘ = ft; x the;
`
`[2.31
`
`8
`
`

`

`3t)
`
`.‘JUUHLING
`
`or by ri era-mutton. of this formats. Such term:weighting strategies are called tf-ifl'f
`stiletto-s.
`
`Sex'eral variations of the above expression for the. weight U'LJ' Em: deseribed in an
`interesting paper by Salton and Buckley whieh appeared in 1988 [696]. Ht‘Jwevelu
`in general.
`the algiove. expression should provide. a good weighting scheme for
`many t'olleetions.
`Hit the query term weights. Salt-on and liuekley suggest
`
`“in; : (”.5 +
`
`mar;
`
`f _
`
`j 11-qu
`
`
`0.5 'I'r't 4
`
`iii—L.) X [0% ——
`
`N
`
`n,-
`
`.
`
`K
`
`[2.4}
`
`where freqw is the raw frequency of the term 2.}- in the text oi the information
`retpit‘st q.
`t: 1} its term—weighting scheme
`The. main udmmtuyes ol' the vector model are;
`iniprm'es retrieval performance: (2) its partial matching strategy allows retrie 'al
`of doeuments that omit-(nitritetr:
`the. query eonditions: and t3] “-5 eosine rank—
`ing formula. sorts the. doeuments aerzorrlim:r to their degree of similarity to the
`query. Thr.--oret.it.'all_\-'._ the \-'e.t."t[_n' model has the. (Used-Hon.ridge that index t(.rrnis are
`assumed to he. mutually independent [equation 2.3 does not aeeount for index
`term dependencies}. However. in prairtice. consideration of term dependencies
`might be a disathantago. Due to the locality of many term depenrlemnes, their
`ludisr‘riiuiuato applieatitm to all the doeuments in the (‘olloctlon might. in fact-
`}:ui'f the. owrail pt‘l‘lljllILt't-IltT-tl‘.
`if)espitr:- its simplicity. the vector model is a rosilient ranking strategy with
`general collections.
`it yields ranked answer sets which are. ditl-ieult
`to improve
`upon without query expansion or Relevance hardback (see Chapter 5} within the
`framework of the \-‘r."et.or model. A large. variety of alternative ranking methods
`have been eonrpared to the veemr model but the eonsensus seems to he that,
`in gone-nah the veetor model is either superior or almost as good as the known
`alternatives. Furthormore. it is simple. and fast. For these. reasons. the vector
`model is a popular retrieval model nowadays.
`
`2.5.4
`
`Probabilistic Model
`
`in this section so deseril'ie the. elassit: probabilistit‘ model introduced in 1976
`by; Roberston and Spar-ck Jones :tiTT] wliieh later became known as the binary
`trrdr'pmirlr'ner: retrieval [BIB] model. Our discussion is intent-immlly brief and
`focuses mainly on highlighting the key features of the model. With this purpose
`in mind._ we. do not-detain ourselves in subtleties regarding the binary indepen—
`denee assmnptiou for the. model. The. section on bibliographic discussion points
`to reterenees which cover these details.
`
`The probabilistic model attempts to capture the TR. problem within a prob—
`abilistie framework. The fundamental idea is as follows. Given a user query-2
`there is a set of documents whieh eontains exactly the role rant documents and
`
`9
`
`

`

`ti‘L-Assn't INFORMATION RETRIEVAL
`
`31
`
`no other. Let us refer to this set of documents as the ideal answer set. Given the
`
`description of this ideal answer set, we would have no prob ems in retrieving; its
`i'toeinncnts. Thus. we can think of the. querying process as a process of specify—
`ing the properties of an ideal answer set (which is analogous to interpreting the
`1R problem as a problem of clustering). The problem is that we do not know
`exactly what these properties are. All we know is that there are index terms
`whose semantics should he. used to characterise these properties. Since these
`properties are not. known at query time. an effort has to he made at. initially
`guessing what they could he. This initial guess allows us to generate a prelim—
`inary prolnthilistic description of the ideal answer set which is used to retrieve
`a first set. of documents. An interaction with the user is then initiated with the
`
`purpose of ii'nproving the prohahilistic' thiscription of the ideal answer sct. Such
`interaction could proceed as follcars.
`The user takes a look at the retrieved documents and decides which ones
`
`are relevant and which ones are not (in truth. only the first top documents need
`to he examined}. The systei‘u then uses this inforrm'ititm to refine the description
`of the ideal answer Set. By repeating this process many times. it is expected
`that such a description will evolve and liieeomo closer to the real description of
`the ideal answer set.
`'l'hus, one. should always have in mind the need to guess at
`the heginning the description of the ideal answer set Furthernuire. a conscious
`clim‘t Is made to model this descript-imi in pi‘t‘il11aliilist-ic terms.
`The probabilistic model is hased on the follr'iwing lumiainent al assumpticm.
`
`Afilfit'tt'tl’lplfii‘flft {Probabilistic Principle} Git en a user (piety r; and a document d,-
`
`
`iu .
`. co ection. the proha-ihilistic model tries to estiinemie )rol'zahility that?
`the user will find the dcTETiment
`(EJ
`interesting lie... relevant]. The model as—
`
`
`smfifiefi'tilieitkt
`is probe '11 1 3-
`.
`. ance c cpends on the query and the :‘ioriuncnt
`representations only. Further.
`the model assumes that there is a subset. of all
`documents which the user preft-ars as the answer set for the query (1.
`Such an
`“ideal answer set is labeled R and should maximize the overall probability of rel—
`evance to the user, DOL‘HMLH in the set. H. are predictmt to he relevant to the
`query. Documents not, in this set are. predicted to he_non—r‘elet-‘ant.
`
`This assumption is quite trouhlesmne because it does not state explicitly
`how to compute the probabilities of relevance. In tact. not even the sample space
`which is to he used for defining such probabilities is given.
`Given a query q, the 1'irolnthilistic model assigns to each document ilj, as
`a measure of its similarity to the (111(31'3'. the ratio l’t'o’j relex-rmt—to elf-"'l’t'rlj non—
`rele rant—to g} which computes the odds of the document dT being relevant to the
`query :3. Taking the odds of relewmcc as the. rank minimises the probability of
`an erroneous judgennint 1282, 7’85].
`
`the tinder term weight t-‘ririables are
`For the probabilistic model,
`Definition
`
`all emery to, arm- E {0,1}, “as-IQ E {0,1}.
`:1 query q is a subset of trailer terms.
`Let i? be the act of documents known. (or initially guessed)" to be relevant. Let. R
`be the complement of R (to. the set of ItOi’t-t‘f’lff'tfit’t‘tlt (locumeats/l, Lei Pijfldjfl
`
`10
`10
`
`

`

`[1L]
`
`MODELING
`
`is relevant to the query 1} and 111—1111;")
`1111; probability that 13/111 document 013'
`1.1--
`th: probability that. (1')-
`is no11.413115111111 to q. The 5111111111111; 5111111111111 of the
`111'
`down-men? 1'13.
`to (he query 1;
`1.5 defined 11.5 the ratio
`
`511111111. (11——
`
`
`I’Uilri ]
`
`1’1R1d1
`
`[Bing B11155 rule,
`
`PM. H} X P1111
`51111111 .q]_—- _._._
`Piriiliij >< P117]
`J
`
`1’11);ij Stalrltifi for 1.111: {11111111111111}! of randomly Selecting the document (G from
`the .5111- R of 111111111111; (lovinnents. Mother, I’L’h’i stands for the 1111111111111in that 11
`11111111115111 11111111111111! 5eleeté‘d from the entire Collection i.'5 relevant. The 111eaning5
`511111111111 to P1 11:.le 111111 P1131 1m.- EtIILl-lOgOHH‘ 11111.1 601111110111011131'1'.
`Shim P1131 21nd Fifi-1 are the 5111111: for all the 11001111101115 in 1.1111 91111115111111.
`we write.
`
`51111151941
`'
`"
`
`5.
`
`5101131131
`__
`P111: in
`
`.31. 551111111153; 11111121mndonce 01' index t1‘1‘1115,
`
`5111115151. ._ 11 J
`
`5'
`
`
`1111.15 , 11811><1H99Q1|4P11|R11
`
`U: .1'
`1
`(1191...... 1% IR)H .'111.1111)
`
`F’izfi.|_'HI 51.111111" fol the}.nohahilitx- that the 111119}; term 1.;1.5 present in :1. 11o:.':ulnt.nf.
`11.5.11:lUIIliV 59115111011 from the. 591.13.1“(11'31 511111115 101 11111 probability that tho
`ilu'lr'-\' 111111 1.1.1.5 not pro5(11L in a (10(111110111 121.1ic1olnly .5131112111.111 11011111105131 R
`1 hr probabilities .155111‘11111111 with the set 17’ 11211-0. 11115111111115 which Ei-I'E‘ analogous
`to tho ono5 111.51. tilescri‘hod.
`1. and ignoring
`Taking logarithrris,
`151151111115; that Fish-IR] 5 HEAR) :'
`fart-Lars which are constant for 1111 docunientéa in the contoxt’. of the 52111162 query,
`we (1111 11118.11}! write
`
`5..
`
`J11 5 1911.
`
`‘)
`.5'r.111|:(1_3.'.q1 N Lima x '11-'1-33xx(logl“IPHI-R11?)
`which 1.5 11 key e.x1.n‘E-'5:-jion for milking computation in the prolgaahilisgtie model.
`Since we do not know flu: set R. 1'11. the beginning, it is necessary to Lie-vim
`.1 method for initialiy computing the probabilities P{ki|R1 and P(k.-1R1. There
`are many alternatives for 5:111:11 computation. We discuss a couple of them below.
`
`11
`11
`
`

`

`I
`
`(711315510 lNI*"(.')Hl\I.-X"FION RETRIEVAI.)
`
`33
`
`In the. very lziegiuning tie... immediately after the query spe<:ifieat..ioul._ there
`are no retrieved tloetnnents. Thus. one has to make. simplifying assumptions such
`as: [at assume that POMS?) is constant. for all index- terms 1.71; (typically, equal to
`0.5} and {bl assume. that the distribution of index terms among the non—relewant
`dm‘uuuents can he approximated by the distrilnnitm of index terms among; all
`the documents in the collar-firm. These two assumptions yield
`
`.l'-’tt‘,i'fi’_l
`:
`[1.5
`mks?) : 13
`
`is the number of documents which contain the
`in;
`where. as already defined.
`
`.otal number o coeuments in tho t'ollect'on. Given
`index" term hi- and N is the.
`this initial guess we can then retrieve documents which contain query terms and
`pt‘t.:t-'ido an initial prohabilistit- ranking for them. After that. this init'ul ranking
`is inuirm'trd as follows.
`Let 1-" be a subset of the. document's initially retrieved and rausod by the
`
`tu'ohahilistit.'_modol. Such a. suhset ran he t'leiiued. for instance. a the top 1'
`ranked dosnments
`' ere._l'_i_s a DF't‘fijim-lfih' (leiinetl I..l1resltol<.l.
`l“ilt‘tlit_:.—let,—l'}—lii‘_n
`the subset of l-"' t‘otnposed of the. documents in l" whit-h contain the index term
`#3,. For simplicity, we. also use.
`l-' and l-"',_ to refer tr] tht: numher of elem

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket