`
`Ricardo Baeza—Yates
`
`Berthier Ribeiro-Neto
`
`e A
`
`CM Press
`New York
`
`A Addison-Wesley
`
`Harlow. England 0 Reading. Massachusetts
`Menlo Park, California 0 New York
`
`Don Mills. Ontario a Amsterdam I Bonn
`
`Sydney a Singapore 0 Tokyo 0 Madrid
`
`San Juan 0 Milan 0 Mexico City 0 Seoul o Taipei
`
`1
`1
`
`EX1014
`EX1014
`
`
`
`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinery, Inc. (ACM).
`
`Addison Wesley Longinan Limited
`Edinburgh Gate
`Harlow
`Essex CM20 ZJE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`zZg$j;g?$$g`IR z¡"gb£¢j¢£¤Rx¥¦§$¢I;¨ Z©vIR ¢j¢j ªUIZ¨C©Z5z$«¬RI¨S¡ª g¨Sª;b$¤Y®¨ $¯_° ±O¡C² ¯ ³g³>¢jZ¨t´£¢Iµ0¶8;¨SU«AªZ¨C¶]«·j£³¸v³e¨>g¬b; ¹ ªUjº ª;jµ;»¸v¢j¢jh¼Cz¡C½Z¾|½¿g¸¸¨S;µªZ¨R³ªZ¨R³ ¢j¢j ª$j£³Cz$«Sª;¨£¢IgbZ¬RZRZ¬R IR´CZjµ³ ¯ÀzS_jZgj¢±Z©IRªZ¬RIRZb¢_Z©ÁI¢±´;jÂ`Rª;§$ ]h¨ªU¢j¢jhbj£³`gkIR£«Ã¨ª £ ;b³ª;¨ »zIIRO;g$jZg£¤]¥±£¢IZ¨R¢ªZ¨R³Äª$jh¨gj¢_ £$ÅUÅ ¯¦µµYjZgj¢_b£¢jhb§$³ ¯3Æ Sª;b Z©ÁI¢_¬ggµ ª$IÇZ¨ «:ªZA] bbb>³>¬ £³n¤Y¢jjZb³`¨Cªb£Ij£§;ª;µv¢j>¢jj£«0¤S;±IbªZ¨R¢j«·jj£³0¨CªZ¨gA©;b«Z_gkªZ¨g&«A£ªZ¨R¢£¤Sbµ IbZ¨g ¤Y«A Sª;¨ ªZµ¤gSUj Zg$¨R¤8b Zb³>¨Sk; UISbb»O¢jU¤]»OIRZ¬R bISh5IS gjZ5»zjjjh¨0]bb«·¢j¢IZ¨ ;©IR ¬gµ¢IRh:Z±ª·µ b¨ ]hb«·jI¨R`b£¢jIj j£³ Zg$¨RA¨CISÈ_¨j³É ¨SU³U«"¢j¢I¬S³gIR_zZg$jZgv¶n h¨S¢I¨R $b¨ ¶8j³]¤U$¾ ÀUjjh¨gSªU«ÊzZ¬gb˦Uª$³]¤$¶YZ¨R³;¨k´ÌÄ$º±¸ ¯´ÍgµAIR ¬ggµ¢IShRª$¢_«Aª$³g£§$hbkªUjj£«·R¦j:Ibª ª;µµ Zg$j;g ;»z¨Shb¢±ªZ¨R³0;Sjª;¨]hb«¢j¢IZ¨tj·bhgb>³e¬ A«AªUjhjªZµ¤S¨Cª©£» ªU¢j£¢_I¢_SªU¢¦gbZ§U£³0«·]U¢j¢Iµ ¯zZg$j;g¦RZµ³ghb¢Z©v«AªUjhjªZµÁ»z SªU¢¦¨R$z]£b¨ª Â>¨RZ»zµ£³g$³ªZbh¨ ;¬bªU$£³j ;¨gjª _IRMg¬gµ¢IRh ¯¡ªZ¨g&Z©ÁIS³¢IZ¨Rª$IZ¨R¢¬S¢j£³kg0«Aª;¨>¬©®ª I¬gbhb¢ ª;¨S³&¢jhµµhb¢jA³>¢jI¨RZ¬g¢I ISbgb>³e¬ j¢ ª;b µªZ«:£³ªU¢_Ibª$³g£«AªZjÂg¢ ¯ ³g³e¢j;¨´C£¢Iµ£&¶8;¨SU«AªZ¨C¶]«·j£³SªU¢¦«:ª$³g£§Uhbkª$jj£«·R j:¢I¬gµ&Ibª$³g·«:ªZjÂk¨©®Zb«AªUIZ¨tªZ];¬S±«AªZ¨>¬g©ª I¬bhb¢±ªZ¨R³ISbgb>³e¬ j¢ «:h¨gIZ¨R£³¨CI¢_]>Z ¯ ̵¢j±Z©vIRIbª$³g£«AªZj³¢IZ¨Rª$IZ¨R¢ªZ¨R³ISh;»z¨Shb¢_ª;]ªZb¢ ;¨SªU$§$ ¯Àz$]¢j£z¨tOU«·g¬Sjh5¡>³bj¨g0Î;ÏÄj¨gj£³ªZ¨R³0]Z¬g¨R³0¨CISÈ_¨j£³CÐ>jª$j¢ Z©Á «Abj ªÑ8b¢j¦gj¨gj£³$$IÐ>Ò Æ ¾;ÓÔ½Z¾bÓsÕ$$ÅG½Z;ÓIÖ×ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£á ªUjªZµUZ¬RMb Zb³0©; Ig¢z]>ZÂ&¢ ª;§;ªZµª;µA©ÜbU«ISÒ)jI¢IC¶]bª;bÞ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ÛÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªU£ð£ªZÓñ3ª$j£¢£¤RË ¯ò° Ëz ª;b³G²¡>³ghj¨¨g©Zb«:ª$IZ¨Cb£Ij£§;ª;µ3ó·Ëz ªZb³gkÒzª$£ðªZÓñzªUj£¢£¤SÒ3hbIbËz]hb;Ó Æ £j ¯ ¯- « ¯Ü¨ µ¬S³g£¢ gµUZbªZgS ª$µvbh©bbh¨ £¢_ªZ¨R³0¨S³gh¼ ¯sÐeÒ Æ ¾ZÓÔ½Z¾ghÓsÕ$UÅ>½;ZÓsÖ ¯ ܨ©;b«Aª$I;¨C¢jjZbªU$ªZ¨R³`b£I£§;ª;µv¢j>¢jj£«A¢ ¯ ¯ ËOôhb¤]ÒzbbIh5³¦bªvõ¬;öbÆ £j¤Y£UÏ$¾ZÓ ¯ j ¯ ÀzIµ ¯÷SÏ$ÏGø ¯ ÒzÕ$ù £$U¾>½UÎ ¯ ¾$ù;ú>³ ½e UZÓb£¾U¾$Õ$Õ¦Ä
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longinan Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the tradeka designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in—Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloguing-in—Publication Data
`Baeza—Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza—Yates, Berthier Ribeiro—Neto.
`p.
`cm.
`
`Includes bibliographical references and index.
`ISBN 0-201-39829-1:
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Araujo
`Neto, 1960- . II.Tit1e.
`Z667.B34
`1999
`025.04—dc21
`
`99-10033
`CIP
`
`2
`
`
`
`‘21
`
`)\l()lJEIJINU
`
`t—diniensional vectorial space and standard linear algebra operations on vectors.
`For the classic proln‘thilistic model. the framework is composed of sets. standard
`prohahilit}r operi'ttions, and the Bayes" theorem.
`in the remainder of this chapter, we discuss the various lit models shown
`in Figure 2.].
`'l‘hroughout‘. the discussion, we do not: explicitly instantiate the
`components D. Q, J”. and Bitindji of each model. Such components should be
`quite clear from the discussion and can he easily inferred.
`
`2.5 Classic Information Retrieval
`
`in this section we briefly present the three ('tlassic models in information retrieval
`narnelyg the Boolean the vector, and the probabilistic inmlels.
`
`2.5.1 Basic Concepts
`
`The classic models in information retrieval consider that each document is de—
`
`scril'ied hf a set. of rt-tprcst:ntat‘ive keywords called index terms. An tartar term
`is simply a {document} word whose semantics helps in rcnremhcriug the docu—
`ments main themes.
`'l‘hus, index terms are used to index and summarize the
`document contents.
`In general, index terms are mainly nouns hecausc nouns
`have in tuning, by themselves and thus. their semantics is easier to ltlf‘fitill‘f and
`to grasp. Adjectives. adverbs. and connectives are less useful as index terms
`because the}! work mainly as complements However.
`it might he interesting
`to consider all the distinct. words in a document collection as index turns. For
`
`instance, this approach is adopted hy some Web search engines as discussed in
`Chapter .13 (in which case, the document logical View is felt tort). “7e postpone
`a. discussion on the problem of how to generate index terms until Chapter 7.
`where the issue is covered in detail.
`
`Given a set of index terms for a document. we notice that not all terms
`are equally useful for describing the document contents. in fact: there are index
`terms which are simply vaguer than others. Deciding on the importance of a
`term for summarizing the contents of a. document is not. a trivial issue. Despite
`this ditfimrlty, there are properties of an index term which are easily measured
`and which are useful for evaluating the potential of a term as such. For instance,
`consider a collection with a hundred thousand documents. A word whicl'i appears
`in each of the one hundred thousand documents is completely useless as an index
`term because it does not tell us anything about which documents the user might
`he interested in. On the other hand, a word which appears in just. five documents
`is quite useful because it narrows down considerably the space of documents
`which might he of interest. to the user. Thus.
`it. should lit}. clear that distinct.
`index terms have varying relevance when used to describe document contents.
`This effect is captured through the assignment. of numerical insights to each index
`term of a document.
`
`3
`
`
`
`CLAS SIC IN FORMATION ltE-TH Hit-At.
`
`‘25
`
`:3 U be :1 Height
`it}- be a document. and row-
`ls..- he an index term.
`Let
`associated with the pair (to. at.) This weight quantifies the importance of the
`index term for deseribing the (.loemnent semantic eontents.
`
`Lr ( t he the number of index: terms in the system (out la:a be e generir:
`Definition
`miter tr rm. K
`{tr}. .
`.
`. do} is the Set of (tilt.
`trader: terms.
`."l weight U'H'
`1"» U
`is dfisflt'ltlit’il with. each {Coder
`term. A}- of o (teen-merit (ii, For an.
`index tit-rm
`who'll does not appear in {in doe-rower Cert. nu”,- 2 ti.
`l-l"’-i(.li
`the (tortumr'nt d?-
`H.- asmt'irrtert on Midi-'3' term ruin-r d?- T'ept'esented by d? 2.:
`('UlLJfiifl-A'J. .
`. ..-n2, J.).
`
`-n.=e-igh.t ossrmioferl with.
`Further.
`tel f};
`hr.-
`11- firm-tom that returns the.
`the index
`term iv.
`in may Liliiiteitsionrn' elector .{"l.rn_. getrlji = ti:,.-U.,l.
`
`As we. later discuss. the index term weights are usually assumed to he mutu—
`>tll_'\" independent. This means that knowingr the weight mm- associated with the
`pair {tenth} tells us within}; athout
`the Weight. ng- } J‘J assoeiuted with the pztir
`tix‘t: ],(f‘.J._i. This is Clenrlv a Hitltl')lifi(.'fltit)fl heeause oeeurrences of index terms in
`n i'ltwnnieut are not. uneorrelnted. Consider. for instance. that the terms (om—
`
`p-nfer and nelieort’s‘ are used to index a. given document whieh eovers the al‘t‘il of
`computer networks. Fl't‘t'lttt?11l..l}'. in this doeunmnt. the appearunee of one of these
`two words nttrnets the appearance of the. other. Thus. these two words are (:i'n're»
`lured and their weights rould relieet this correlation. W'hile mutual independence
`seems to he a! strong simplification. it does simplify the task of eonmuting index
`term'weights and allows for f21st. ranking computation. Furthermore, tithing od—
`vantage of index term eorrelotions for improving the. final document ranking is
`not.
`7|. simple task.
`11! foot. none of the many approaches proposed in the past.
`has clearly demonstrated that index term correlations
`nr: ltt'l
`'itntngeoim [for
`ranking purposes) with general collections. Therefore, unless elein‘ly stated oth—
`erwier-. we assume mutual iiulependenee among; index terms.
`In Chapter :3 we
`dist-ties modern retrie n] teelniiques whieli are based on term eorrel;-1tir.ms and
`tt-‘ltitilt have. been tested sm-eessfully with particular collections.
`'li'hese sueeesses
`seem to he slt‘iwlv shifting the current understanding towards a more fmroruhle
`View of the usefulness of term correlations for information retrie rai cysteine.
`The. above definitions in‘ovide support tor r_liseussing the three eletssie infor-
`mation retrieval models. namely. the Boolean. the vector. and the prolmliilistie
`models. as we now do.
`
`
`
`
`
`
`
`
`
`
`
`
`2.5.2 Boolean Model
`{The Boolean model is a simple l‘etrie rel model hosed on set theory and Boolean
`Jhgtfiiret. Since the. eoneept of a set. is quite intuitive, the Boolean model pro—
`trides it framework which is easy to grasp by; a. common user of an [R Hlt'fw‘fl‘lJ].
`
`.(urtliermore. the queries are specified as Boolean expressions whieh have preeise
`
`. ennnities. (liven its inherent simplicity and nest fornmlism. the Bandeau model
`ueived great. attention in past years and was adopted by many of the. early
`bmmereial hiltdiogretjjihie Systeme.
`
`
`
`4
`
`
`
`26
`
`manual-wt;
`
`
`
`Figure 2.3 The three conjunctive comrmnents for the query [:1 —— it”
`
`[its
`
`nth-H.
`
`l_..'nh?irt1_niately. the Boolean model suffers from major drawlmeks. First.
`its retrieval strategy is hosed on a binary [incision criterion tie... a document is
`ruedieted to be either relevant or non—relevant) without any notion of a grading
`scale. which prevents good retrieval performance. Thus. the Boolean model is
`in realityr Ilil1('ll more a. data {instead of information} retiim'ul model. Seeond.
`while Boolean expressions have. precise. semantics. frequently it
`is not simple to
`translate an intorniatnni need into a Boolean expression. In faet‘ most users find
`it. difficult and awkward to express their query requests in terms of Boolean ex-
`pressimis. The Boolean expressions aetually fornnilated by users often are quite
`simple ("see Chapter It] for a more thorough discussion on this issue}. Despite
`these. drawbacks. the Boolean model is still the dominant model with commercial
`
`doeuinent database systems and provides a. good starting point. for those new to
`the field.
`
`The Boolean model considers that index terms are. present or absent in a
`domnnent. its a result. the index terin weights are assumed to he all binary. i.e.,
`
`arm 6 {0. 1}. A query o is composed of index terms linked by three connectives:
` 'sion which
`not, and. oraThus, a query is essentially a conventional Boolean exprt
`
`
`eon he represented as a (.lisjurnrtion ofeonjunetive vet-tors tie. in (ii. uric-tree norw
`our! form — DNF). For instance, the query [q = k” .e.
`[it"s
`wispjl run he written
`in disjuneti'v'e normal form as [(17:be = [:l. 1. l) V {.l. 1._ [1"] u [1, [ran Where. each of
`the eonlpor‘ients is a binary weighted vector associated with the. tuple {km rich, fair}.
`These binary weighted vectors are.
`tailed the. eonjunetive components of 27],, f.
`l’igure 2.3 illustrates the three (.-.onjnm':ti\-'e components for the query q.
`
`the indeir rte-rm weight variables are. all
`For the Boolean model.
`Definition
`inw- C {0. 1}. A queer; q is a cement-tonal Boolean empress-ton. Let
`binary 218.,
`rjrfimf be the disjunctive normal form. for the query q. F's-rifles} Bet r1215 be any of the
`eo-njunetore eompom-znts of qjgnf. The. similarity of a. doe-unrest dj
`to the query q
`is defined as
`
`si'nilrqul =
`'
`
`1
`[1
`
`
`l {Lila-Effimrlettn gitch-‘J=gi(€m)i
`if 5%:
`otherwise
`
`5
`
`
`
`
`LILASSJC IN FOHMJYIION RETRIEVAL
`
`27
`
`If miter-IdJ _. q) = 1 the-n the Boothe-It mwtet predicts that the deem-meet dJ- is r'etr-"ermt
`to the query :3 (it might not he). Otherwise! the prediction. is that the‘ derement
`is not r‘etetteet.
`
`is either 't‘rftt't-‘fi'tt-t er mm—
`The Boolean model media-s that each (ltnsunient
`;r'eIt-t‘rtnt. There is nu notion of it pertmt metr‘h to the query (miditim'ts. For
`IIISIHIJCBJ Iet dJ he a document for whieh d:
`-— (0.1.0). Dtmltnieut
`ttJ inductee
`
`
`the index term It» hut. .IH (‘mtrsittered nun—relevant to the. query it;
`:= If“
`ti},
`III“
`J
`The main uttererttrtges ef the Boolean niedet 21.10- t-he clean ftn'mahsm behind
`the model and Its simplicity: The main titted-Utintages are that. exact- matching
`may teed to retrieval of two few er tom many documents Isee Chapter It“. "IJ'Jdtty.
`it is well known that index term weighting ("r111 teed to a. substantiat imprtwement
`IlI mtrieml perhirniauce. Index term weighting brings us to the \‘t-JL'IUI' irledet.
`
`2.5.3 Vector Model
`
`is Leo
`ti-‘eightr:
`the use DI tliiltlt'jJ'
`ret-(ighizes that
`IIiIIT, 695]
`the teeter model
`limiting and IJI‘E.JI}(.JSE.:H :1 fi‘ettiitw‘.-'t;rk in which partial matching Ir]: pussihte. This
`in £1('('Ulll[)Ii.‘lei‘-Lt by {weighing 'ttU'tt—IJ'I'tttl'F‘y weights to index terms in (lllt'l'hih and
`In {'tUt'iI‘Illli‘Ilt-H.
`'t‘heee term weights are titt-hnately used tn eentpttte the deg-ret-
`ef steetaritt} between em‘h deeunient stored in the. system and the user (111(21‘)’.
`By hurting the retrieved (let'uments in :Ieert‘ttsing :trder of thir: degree (at Himihir—
`it}: the vector model takes. inte Ctntn‘itlemtien (IUCILlllltElltH which thatch the query
`terms unty partiatty. The main resultant effect II-S that the ranked ttoetuht'ht £111-
`MVt‘T set
`is :1 lot. more precise Iin the sense that it better nmtehes the ne'er infur—
`mntitn'l need] than the dominieut answer set retrieved by the Boolean model.
`
`the tee-fight u-‘J-JJ- a-Hsor‘t-rifritt with it pee It}. LtJ'J
`t't't‘te'r moth-“I.
`Definition PM the:
`is posit-tee rmrt hurt—httnufiy. Further;
`the Index:
`terms in the tree-3'3; are {IIHU
`tt‘eightt'tt. Let mm he the airtight (tsmet'a.t‘.csrt with the pet-Ir [Inge], mhrrr: try-J]. 3.:
`[1.
`Ethan.
`the query t'eetm' r? 22:,- tiefined as if
`'
`Ice; (J. M‘JIJJ‘.
`.
`.
`. rem] Usher?- I
`IS thr-
`tetrtt Mt-mtter of index: terms it}; the system,
`.43 before. the eeeto-I' for r: doctrine-ht.
`(Q is F'f'}H'f’.‘n‘f<3’tt-{ftt by (EJ- = Itt‘1_‘.J-.e‘gl_J-..
`.
`.
`. it‘JIJ- ].
`
`llrim‘ query q are represented as t-thtlienniulia.t
`ttJ- and E1.
`Therefore, a (IOi'IlltlE‘Ilt,
`vectors as shown in Figure 12.4. The vector mndet prm‘mses to e ’tlIlliitE’ the degree
`of similarity of the doetmient. ttJ- with regard to the query q as the ('UII‘E‘IEltIUll
`between the vectors rtJ and g". This eorretatttm can he quantified. for instance.
`by the wet-rte of the, eeyte hetvt-‘een these two vectors. That is,
`
`
`
`_..__. .._ _.
`
`.
`
`H'iI'nJ.I:(1'._J-. q J
`
`6
`
`
`
` Q
`
`22:4:
`
`moo HI.'IN(_:
`
`Figure 2.4
`
`'I'he cosine of 0 is adopted as .sinitrlJ-c}
`
`where id;- und In? are the norms of the document and query vectors. The factor
`lq‘l does not affect the ranking lie. the. curleriner of the documents) because it
`is the same. for all documents. The factor had provides a nol‘nlnlizutiou in the
`space of the documents.
`._'=_ I), si-rntq, dj] varies from U to +i. 'l‘hus. instead of
`Since urn,- 7? [l and ”’rlri
`attempting; to predict. \‘i'lltlilRFI' a. document is relevant or not. the vector model
`ranks the documents :urcording to their degree of similarity to the query. A
`document might be retrieved even if it nuttches the query only partially}. For
`instance, one can estahlish a threshold on simtdj, q) and retrieve the documents
`with a degree of similarity above that threshold. But
`to compute rankings we
`
`need lirst to spec 1'; how index term weights are obtained.
`Index term weights can he calculated in man}' dillhrent. ways. The work by
`Salton and McGill [698} reviews various term-weighting techniqu s. Here. we do
`not discuss them in detail. Instead, we cmicentrate on elucidating the main idea
`hehind the most ctler::tivc term—wei_;__§hting techniques. This idea is relitted to the
`hnsic. principles which support clustering; techniques‘ as follows.
`Given acollectiou C" ot'ohjects and a roger: descriptitm ola set. A, the goal of
`a. simple clusteringr algoritlnn irright he to separate the collmttion C“ of objects into
`
`two sets: it first one which is composed of objects related to the set :i and u. second
`one which is composed of ohjects not. related to the set :1. Vague description here
`menus that. we do not. have complete infornmtion for deciding precisely which
`objects are and which are not. in the set A. For instance. one might. he looking
`for a set .4 of cars which have a. price. comparable to that of it Lexus 400. Since it
`is not clear what the term comparable menus exactly. there is not a precise (and
`unique] description of the. set A. More sophisticated clustering edgorithnis might.
`attempt to scparatc the objects of a collection into various clusters {or classes)
`according,
`to their properties. For instance. patients of a doctor specializing
`in cancer could be classified into five class s:
`terminal. advanced. metastasis.
`diagnosetL and healthy Again. the possible class descriptions might he imprecise
`{and not. unique} and the problem is one of deciding to which of these classes
`a. new patient. should he i‘issigncd.
`In what follows. however, we only discuss
`the simpler version of the clustering problem tie. the one. which considers only
`two classes) because all that is required is a decision on which documents are
`predicted to he relevant and. which ones are predictivd to he not relevant [with
`regard to a given user query").
`
`7
`
`
`
`
`thLAbh‘It" INI-‘ORA-IATION R.IZ-'l'R.IEV'AL
`
`2i}
`
`it) View the. IR pi‘obleuii as one Of clustering, we refer to the early work of
`Salton. “it! think of the dor'uments as a. tiiollectimi C of Objects and think of the.
`user query as a {vague} sigiet'ilit.'.ation ()f a set :1 of Objects.
`[11 this seenariu. the
`Iii. niei'flem earl he reduced to the. pruhlein of determining which deeunients are
`in the set
`31 and which tines are not. lie. the 1H. problem ("an he Viewed as a
`eiustering Iiii‘iiii'dem}. In a Clustering prt'Jhleni, two main issues have to be resolved.
`EiI-g-gt.‘ one needs to determine what are the features which better describe the
`
`uhjeets iu the set xi. Second. one needs to deterrriiite what are the features
`whir‘h t'iettei' distinguish the uhieets in the set .-1 from the reiriaining objects in
`the mtteetion (T The first set of features provides for quantification of int-m.-
`et-iistc-ir similarity. while.
`the set-(Jud set of features prurides for queuitifiemien
`of tr?t(-'r’—(‘.tustrir' dissiiriilurit}: The most sueeesstiil elustering algorithms try to
`halaitee these twr.) sitter-ts.
`Tn the Yet‘ttii‘ model. intra~elustering similarity is quantified by measuring
`the raw frequeney of a term is;
`inside a document ((1. Such term frequency is
`usually referred to as the if faster and provides one liieasure of how well that
`mum desr'rihes the. deeuiiimu contents {i.e.. intrzHioeumelit eliuraeteriz-atiou}.
`Furthermore inter—eluster dissimilarity is quantified l)}' measuring the inverse
`of the trequt‘IlV-F' 0t a term it..- ameng the documents in the eolleetien. This
`latter is usually reterrf‘tl to as the iii-verse doeiimeut freq-usury [Jr the. iriffiirtifur'.
`The motivation for usage of an intt factor is that
`terms which appear in many
`doeumeiits are. not very useful for distinguishing a relevant rlot‘unient- from at
`[mn—reiex'emt one. As with good Clustering algorithms. the must. el'l‘eetive term—
`weightiiig schemes for llt. trjv' ti) halenee these twu eiteets.
`
`I11-
`
`freq“
`
`mar;
`
`frag 1
`
`[2.1:
`
`
`Let N be the total urtii'rttrc'r U documents in the system {Mari M1 he
`Definition
`
`
`. {the urti'mttfl' of derjzum-fzuts in.
`tifh-ttjtt
`15}
`:: index term it“ appears. Let frugal, H'
`u'
`
`Wflmuem'y of term. it;
`in the tflOE'N'Ntfftri- d1- {t.r’...
`the number of times the term
`tax; is uteritinued iii
`the Hart of the doruimmrt rig-Ii. Then. the nm‘nmhzr‘d frequent};
`
`-
`:f1-_J of term if;
`in. (teen-merit rt?-
`is gainer: by
`
`
`
`
`
`instntumml m. the trait
`re the iriu.:rim1nrt is temp-iittd (J'tJtii‘ (Iii. tei‘rits whit‘h am:
`
`the (teen-merit rfj,
`if the tenets ts!- dees not appear in (he doetmtettt (t1-
`(hr-n
`
`_-: 0‘ Fifi-lib“:
`tri'i- “if“ inverse. decrement Ire-quartet; for £11.
`tit: glee“. Ly
`
`
`
`
`my: :iog‘i
`a
`'3’]:
`
`.
`
`.
`
`.
`
`.
`
`'
`
`=
`
`[2.21
`'
`
`‘best known.
`
`It—‘T‘m-H‘rf'tgx’ltt‘i't'lf) schemes use weirihfstieh-ieh are elem by
`I. n‘
`
`JV
`m‘ = ft; x the;
`
`[2.31
`
`8
`
`
`
`3t)
`
`.‘JUUHLING
`
`or by ri era-mutton. of this formats. Such term:weighting strategies are called tf-ifl'f
`stiletto-s.
`
`Sex'eral variations of the above expression for the. weight U'LJ' Em: deseribed in an
`interesting paper by Salton and Buckley whieh appeared in 1988 [696]. Ht‘Jwevelu
`in general.
`the algiove. expression should provide. a good weighting scheme for
`many t'olleetions.
`Hit the query term weights. Salt-on and liuekley suggest
`
`“in; : (”.5 +
`
`mar;
`
`f _
`
`j 11-qu
`
`
`0.5 'I'r't 4
`
`iii—L.) X [0% ——
`
`N
`
`n,-
`
`.
`
`K
`
`[2.4}
`
`where freqw is the raw frequency of the term 2.}- in the text oi the information
`retpit‘st q.
`t: 1} its term—weighting scheme
`The. main udmmtuyes ol' the vector model are;
`iniprm'es retrieval performance: (2) its partial matching strategy allows retrie 'al
`of doeuments that omit-(nitritetr:
`the. query eonditions: and t3] “-5 eosine rank—
`ing formula. sorts the. doeuments aerzorrlim:r to their degree of similarity to the
`query. Thr.--oret.it.'all_\-'._ the \-'e.t."t[_n' model has the. (Used-Hon.ridge that index t(.rrnis are
`assumed to he. mutually independent [equation 2.3 does not aeeount for index
`term dependencies}. However. in prairtice. consideration of term dependencies
`might be a disathantago. Due to the locality of many term depenrlemnes, their
`ludisr‘riiuiuato applieatitm to all the doeuments in the (‘olloctlon might. in fact-
`}:ui'f the. owrail pt‘l‘lljllILt't-IltT-tl‘.
`if)espitr:- its simplicity. the vector model is a rosilient ranking strategy with
`general collections.
`it yields ranked answer sets which are. ditl-ieult
`to improve
`upon without query expansion or Relevance hardback (see Chapter 5} within the
`framework of the \-‘r."et.or model. A large. variety of alternative ranking methods
`have been eonrpared to the veemr model but the eonsensus seems to he that,
`in gone-nah the veetor model is either superior or almost as good as the known
`alternatives. Furthormore. it is simple. and fast. For these. reasons. the vector
`model is a popular retrieval model nowadays.
`
`2.5.4
`
`Probabilistic Model
`
`in this section so deseril'ie the. elassit: probabilistit‘ model introduced in 1976
`by; Roberston and Spar-ck Jones :tiTT] wliieh later became known as the binary
`trrdr'pmirlr'ner: retrieval [BIB] model. Our discussion is intent-immlly brief and
`focuses mainly on highlighting the key features of the model. With this purpose
`in mind._ we. do not-detain ourselves in subtleties regarding the binary indepen—
`denee assmnptiou for the. model. The. section on bibliographic discussion points
`to reterenees which cover these details.
`
`The probabilistic model attempts to capture the TR. problem within a prob—
`abilistie framework. The fundamental idea is as follows. Given a user query-2
`there is a set of documents whieh eontains exactly the role rant documents and
`
`9
`
`
`
`ti‘L-Assn't INFORMATION RETRIEVAL
`
`31
`
`no other. Let us refer to this set of documents as the ideal answer set. Given the
`
`description of this ideal answer set, we would have no prob ems in retrieving; its
`i'toeinncnts. Thus. we can think of the. querying process as a process of specify—
`ing the properties of an ideal answer set (which is analogous to interpreting the
`1R problem as a problem of clustering). The problem is that we do not know
`exactly what these properties are. All we know is that there are index terms
`whose semantics should he. used to characterise these properties. Since these
`properties are not. known at query time. an effort has to he made at. initially
`guessing what they could he. This initial guess allows us to generate a prelim—
`inary prolnthilistic description of the ideal answer set which is used to retrieve
`a first set. of documents. An interaction with the user is then initiated with the
`
`purpose of ii'nproving the prohahilistic' thiscription of the ideal answer sct. Such
`interaction could proceed as follcars.
`The user takes a look at the retrieved documents and decides which ones
`
`are relevant and which ones are not (in truth. only the first top documents need
`to he examined}. The systei‘u then uses this inforrm'ititm to refine the description
`of the ideal answer Set. By repeating this process many times. it is expected
`that such a description will evolve and liieeomo closer to the real description of
`the ideal answer set.
`'l'hus, one. should always have in mind the need to guess at
`the heginning the description of the ideal answer set Furthernuire. a conscious
`clim‘t Is made to model this descript-imi in pi‘t‘il11aliilist-ic terms.
`The probabilistic model is hased on the follr'iwing lumiainent al assumpticm.
`
`Afilfit'tt'tl’lplfii‘flft {Probabilistic Principle} Git en a user (piety r; and a document d,-
`
`
`iu .
`. co ection. the proha-ihilistic model tries to estiinemie )rol'zahility that?
`the user will find the dcTETiment
`(EJ
`interesting lie... relevant]. The model as—
`
`
`smfifiefi'tilieitkt
`is probe '11 1 3-
`.
`. ance c cpends on the query and the :‘ioriuncnt
`representations only. Further.
`the model assumes that there is a subset. of all
`documents which the user preft-ars as the answer set for the query (1.
`Such an
`“ideal answer set is labeled R and should maximize the overall probability of rel—
`evance to the user, DOL‘HMLH in the set. H. are predictmt to he relevant to the
`query. Documents not, in this set are. predicted to he_non—r‘elet-‘ant.
`
`This assumption is quite trouhlesmne because it does not state explicitly
`how to compute the probabilities of relevance. In tact. not even the sample space
`which is to he used for defining such probabilities is given.
`Given a query q, the 1'irolnthilistic model assigns to each document ilj, as
`a measure of its similarity to the (111(31'3'. the ratio l’t'o’j relex-rmt—to elf-"'l’t'rlj non—
`rele rant—to g} which computes the odds of the document dT being relevant to the
`query :3. Taking the odds of relewmcc as the. rank minimises the probability of
`an erroneous judgennint 1282, 7’85].
`
`the tinder term weight t-‘ririables are
`For the probabilistic model,
`Definition
`
`all emery to, arm- E {0,1}, “as-IQ E {0,1}.
`:1 query q is a subset of trailer terms.
`Let i? be the act of documents known. (or initially guessed)" to be relevant. Let. R
`be the complement of R (to. the set of ItOi’t-t‘f’lff'tfit’t‘tlt (locumeats/l, Lei Pijfldjfl
`
`10
`10
`
`
`
`[1L]
`
`MODELING
`
`is relevant to the query 1} and 111—1111;")
`1111; probability that 13/111 document 013'
`1.1--
`th: probability that. (1')-
`is no11.413115111111 to q. The 5111111111111; 5111111111111 of the
`111'
`down-men? 1'13.
`to (he query 1;
`1.5 defined 11.5 the ratio
`
`511111111. (11——
`
`
`I’Uilri ]
`
`1’1R1d1
`
`[Bing B11155 rule,
`
`PM. H} X P1111
`51111111 .q]_—- _._._
`Piriiliij >< P117]
`J
`
`1’11);ij Stalrltifi for 1.111: {11111111111111}! of randomly Selecting the document (G from
`the .5111- R of 111111111111; (lovinnents. Mother, I’L’h’i stands for the 1111111111111in that 11
`11111111115111 11111111111111! 5eleeté‘d from the entire Collection i.'5 relevant. The 111eaning5
`511111111111 to P1 11:.le 111111 P1131 1m.- EtIILl-lOgOHH‘ 11111.1 601111110111011131'1'.
`Shim P1131 21nd Fifi-1 are the 5111111: for all the 11001111101115 in 1.1111 91111115111111.
`we write.
`
`51111151941
`'
`"
`
`5.
`
`5101131131
`__
`P111: in
`
`.31. 551111111153; 11111121mndonce 01' index t1‘1‘1115,
`
`5111115151. ._ 11 J
`
`5'
`
`
`1111.15 , 11811><1H99Q1|4P11|R11
`
`U: .1'
`1
`(1191...... 1% IR)H .'111.1111)
`
`F’izfi.|_'HI 51.111111" fol the}.nohahilitx- that the 111119}; term 1.;1.5 present in :1. 11o:.':ulnt.nf.
`11.5.11:lUIIliV 59115111011 from the. 591.13.1“(11'31 511111115 101 11111 probability that tho
`ilu'lr'-\' 111111 1.1.1.5 not pro5(11L in a (10(111110111 121.1ic1olnly .5131112111.111 11011111105131 R
`1 hr probabilities .155111‘11111111 with the set 17’ 11211-0. 11115111111115 which Ei-I'E‘ analogous
`to tho ono5 111.51. tilescri‘hod.
`1. and ignoring
`Taking logarithrris,
`151151111115; that Fish-IR] 5 HEAR) :'
`fart-Lars which are constant for 1111 docunientéa in the contoxt’. of the 52111162 query,
`we (1111 11118.11}! write
`
`5..
`
`J11 5 1911.
`
`‘)
`.5'r.111|:(1_3.'.q1 N Lima x '11-'1-33xx(logl“IPHI-R11?)
`which 1.5 11 key e.x1.n‘E-'5:-jion for milking computation in the prolgaahilisgtie model.
`Since we do not know flu: set R. 1'11. the beginning, it is necessary to Lie-vim
`.1 method for initialiy computing the probabilities P{ki|R1 and P(k.-1R1. There
`are many alternatives for 5:111:11 computation. We discuss a couple of them below.
`
`11
`11
`
`
`
`I
`
`(711315510 lNI*"(.')Hl\I.-X"FION RETRIEVAI.)
`
`33
`
`In the. very lziegiuning tie... immediately after the query spe<:ifieat..ioul._ there
`are no retrieved tloetnnents. Thus. one has to make. simplifying assumptions such
`as: [at assume that POMS?) is constant. for all index- terms 1.71; (typically, equal to
`0.5} and {bl assume. that the distribution of index terms among the non—relewant
`dm‘uuuents can he approximated by the distrilnnitm of index terms among; all
`the documents in the collar-firm. These two assumptions yield
`
`.l'-’tt‘,i'fi’_l
`:
`[1.5
`mks?) : 13
`
`is the number of documents which contain the
`in;
`where. as already defined.
`
`.otal number o coeuments in tho t'ollect'on. Given
`index" term hi- and N is the.
`this initial guess we can then retrieve documents which contain query terms and
`pt‘t.:t-'ido an initial prohabilistit- ranking for them. After that. this init'ul ranking
`is inuirm'trd as follows.
`Let 1-" be a subset of the. document's initially retrieved and rausod by the
`
`tu'ohahilistit.'_modol. Such a. suhset ran he t'leiiued. for instance. a the top 1'
`ranked dosnments
`' ere._l'_i_s a DF't‘fijim-lfih' (leiinetl I..l1resltol<.l.
`l“ilt‘tlit_:.—let,—l'}—lii‘_n
`the subset of l-"' t‘otnposed of the. documents in l" whit-h contain the index term
`#3,. For simplicity, we. also use.
`l-' and l-"',_ to refer tr] tht: numher of elem