`
`Ricardo Baeza—Yates
`
`Berthier Ribeiro-Neto
`
`@ A
`
`CM Press
`New York
`
`A Addison-Wesley
`
`Harlow. England 0 Reading, Massachusetts
`Menlo Park, California 0 New York
`
`Don Mills. Ontario a Amsterdam I Bonn
`
`Sydney a Singapore 0 Tokyo 0 Madrid
`
`San Juan 0 Milan 0 Mexico City 0 Seoul o Taipei
`
`(cid:20)
`1
`
`ELASTIC - EXHIBIT 1014
`
`ELASTIC - EXHIBIT 1014
`
`
`
`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinery, Inc. (ACM).
`
`Addison Wesley Longinan Limited
`Edinburgh Gate
`Harlow
`Essex CM20 ZJE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`zZg$j;g?$$g`IR z¡"gb£¢j¢£¤Rx¥¦§$¢I;¨ Z©vIR ¢j¢j ªUIZ¨C©Z5z$«¬RI¨S¡ª g¨Sª;b$¤Y®¨ $¯_° ±O¡C² ¯ ³g³>¢jZ¨t´£¢Iµ0¶8;¨SU«AªZ¨C¶]«·j£³¸v³e¨>g¬b; ¹ ªUjº ª;jµ;»¸v¢j¢jh¼Cz¡C½Z¾|½¿g¸¸¨S;µªZ¨R³ªZ¨R³ ¢j¢j ª$j£³Cz$«Sª;¨£¢IgbZ¬RZRZ¬R IR´CZjµ³ ¯ÀzS_jZgj¢±Z©IRªZ¬RIRZb¢_Z©ÁI¢±´;jÂ`Rª;§$ ]h¨ªU¢j¢jhbj£³`gkIR£«Ã¨ª £ ;b³ª;¨ »zIIRO;g$jZg£¤]¥±£¢IZ¨R¢ªZ¨R³Äª$jh¨gj¢_ £$ÅUÅ ¯¦µµYjZgj¢_b£¢jhb§$³ ¯3Æ Sª;b Z©ÁI¢_¬ggµ ª$IÇZ¨ «:ªZA] bbb>³>¬ £³n¤Y¢jjZb³`¨Cªb£Ij£§;ª;µv¢j>¢jj£«0¤S;±IbªZ¨R¢j«·jj£³0¨CªZ¨gA©;b«Z_gkªZ¨g&«A£ªZ¨R¢£¤Sbµ IbZ¨g ¤Y«A Sª;¨ ªZµ¤gSUj Zg$¨R¤8b Zb³>¨Sk; UISbb»O¢jU¤]»OIRZ¬R bISh5IS gjZ5»zjjjh¨0]bb«·¢j¢IZ¨ ;©IR ¬gµ¢IRh:Z±ª·µ b¨ ]hb«·jI¨R`b£¢jIj j£³ Zg$¨RA¨CISÈ_¨j³É ¨SU³U«"¢j¢I¬S³gIR_zZg$jZgv¶n h¨S¢I¨R $b¨ ¶8j³]¤U$¾ ÀUjjh¨gSªU«ÊzZ¬gb˦Uª$³]¤$¶YZ¨R³;¨k´ÌÄ$º±¸ ¯´ÍgµAIR ¬ggµ¢IShRª$¢_«Aª$³g£§$hbkªUjj£«·R¦j:Ibª ª;µµ Zg$j;g ;»z¨Shb¢±ªZ¨R³0;Sjª;¨]hb«¢j¢IZ¨tj·bhgb>³e¬ A«AªUjhjªZµ¤S¨Cª©£» ªU¢j£¢_I¢_SªU¢¦gbZ§U£³0«·]U¢j¢Iµ ¯zZg$j;g¦RZµ³ghb¢Z©v«AªUjhjªZµÁ»z SªU¢¦¨R$z]£b¨ª Â>¨RZ»zµ£³g$³ªZbh¨ ;¬bªU$£³j ;¨gjª _IRMg¬gµ¢IRh ¯¡ªZ¨g&Z©ÁIS³¢IZ¨Rª$IZ¨R¢¬S¢j£³kg0«Aª;¨>¬©®ª I¬gbhb¢ ª;¨S³&¢jhµµhb¢jA³>¢jI¨RZ¬g¢I ISbgb>³e¬ j¢ ª;b µªZ«:£³ªU¢_Ibª$³g£«AªZjÂg¢ ¯ ³g³e¢j;¨´C£¢Iµ£&¶8;¨SU«AªZ¨C¶]«·j£³SªU¢¦«:ª$³g£§Uhbkª$jj£«·R j:¢I¬gµ&Ibª$³g·«:ªZjÂk¨©®Zb«AªUIZ¨tªZ];¬S±«AªZ¨>¬g©ª I¬bhb¢±ªZ¨R³ISbgb>³e¬ j¢ «:h¨gIZ¨R£³¨CI¢_]>Z ¯ ̵¢j±Z©vIRIbª$³g£«AªZj³¢IZ¨Rª$IZ¨R¢ªZ¨R³ISh;»z¨Shb¢_ª;]ªZb¢ ;¨SªU$§$ ¯Àz$]¢j£z¨tOU«·g¬Sjh5¡>³bj¨g0Î;ÏÄj¨gj£³ªZ¨R³0]Z¬g¨R³0¨CISÈ_¨j£³CÐ>jª$j¢ Z©Á «Abj ªÑ8b¢j¦gj¨gj£³$$IÐ>Ò Æ ¾;ÓÔ½Z¾bÓsÕ$$ÅG½Z;ÓIÖ×ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£á ªUjªZµUZ¬RMb Zb³0©; Ig¢z]>ZÂ&¢ ª;§;ªZµª;µA©ÜbU«ISÒ)jI¢IC¶]bª;bÞ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ÛÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªU£ð£ªZÓñ3ª$j£¢£¤RË ¯ò° Ëz ª;b³G²¡>³ghj¨¨g©Zb«:ª$IZ¨Cb£Ij£§;ª;µ3ó·Ëz ªZb³gkÒzª$£ðªZÓñzªUj£¢£¤SÒ3hbIbËz]hb;Ó Æ £j ¯ ¯- « ¯Ü¨ µ¬S³g£¢ gµUZbªZgS ª$µvbh©bbh¨ £¢_ªZ¨R³0¨S³gh¼ ¯sÐeÒ Æ ¾ZÓÔ½Z¾ghÓsÕ$UÅ>½;ZÓsÖ ¯ ܨ©;b«Aª$I;¨C¢jjZbªU$ªZ¨R³`b£I£§;ª;µv¢j>¢jj£«A¢ ¯ ¯ ËOôhb¤]ÒzbbIh5³¦bªvõ¬;öbÆ £j¤Y£UÏ$¾ZÓ ¯ j ¯ ÀzIµ ¯÷SÏ$ÏGø ¯ ÒzÕ$ù £$U¾>½UÎ ¯ ¾$ù;ú>³ ½e UZÓb£¾U¾$Õ$Õ¦Ä
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longinan Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the tradeka designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in—Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloguing-in—Publication Data
`Baeza—Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza—Yates, Berthier Ribeiro—Neto.
`p.
`cm.
`
`Includes bibliographical references and index.
`ISBN 0-201-39829-1:
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Araujo
`Neto, 1960- . II.Tit1e.
`Z667.B34
`1999
`025.04—dc21
`
`99-10033
`CIP
`
`(cid:21)
`
`
`
`‘21
`
`MODELING
`
`t—diniensional vectorial space and standard linear algebra operations on vectors.
`For the classic prolnihilistic model. the framework is coniposm‘l of sets. standard
`prohahilit}r operations, and the Bayes" theorem.
`[u the remainder of this chapter, we discuss the various lit models shown
`in Figure 2i 'l‘hroughout the discussion, we do not explicitly instantiate the
`components D. Q, J”, and thode of each model. Such components should be
`quite clear from the tllSCllSSlUl‘i and can he easily inferred.
`
`2.5 Classic Information Retrieval
`
`in this section we briefly present the three ('tlassic models in inlornnttion retrieval
`nanielyg the Boolean. the vector, and the prt‘ibabilistic models.
`
`2.5.1 Basic Concepts
`
`The classic models in information retrieval consider that each document is de—
`
`scril'ied hf a set of rt-tprest:ntal'ive key-words called index terms. An order term
`is simply a {document} word whose semantics helps in reineinhcring the docu—
`ments main themes. Thus, index terms are used to index and summarize the
`document. contents.
`In general, index terms are mainly nouns because nouns
`have in tuning, by themselves and thus‘ their semantics is easier to ltlt'fillll‘f and
`to grasp. Adjectives. adverbs‘ and connectives are less useful as index terms
`l'iecause the}! work mainly as complements However.
`it might he intre‘resting
`to consider all the distinct. words in a. document collection as index terms. For
`
`instance, this approach is adopted hy some Web search engines as discussed in
`Chapter .13 (in which case, the document logical View is full test-t). “7e postpone
`a. discussion on the problem of how to generate index terms until Chapter 7.
`where the issue is cmrered in detail.
`
`Given a set of index terms for a tht'ulllC—‘lit, we notice that not all terms
`are equally useful for describing the document contents. in fact. there are index
`terms which are simply raguer than others. Deciding on the imports-1mm of a
`term for summarizing the contt-rnts of a. document is not. a trivial issue. Despite
`this ditticulty, there are properties of an index term which are easily rneasurtul
`and which are useful for evaluating the potential of a term as such. For instance,
`consider a collection with a hundred thousand documents. A word whicl'i appears
`in each of the one hundred thousand documents is completely useless as an index
`term because it does not tell us anything about. which documents the user might
`he interested in. On the other hand, 1 word which appears in just. live documents
`is quite useful because it narrows down considerably the space of documents
`which might he of interest. to the user. Thus,
`it. should he clear that distinct.
`index terms have varying relevance when used to describe document contents.
`This effect is captured through the assignment. of numerical weights to each index
`term of a document.
`
`(cid:22)
`
`
`
`CLAS SIC IN FORMATION ltE-TH IEVAL
`
`25
`
`:3 U be a weight
`it}- he a document. and in“;
`ls..- he an index term.
`Let
`associated with the pair (to. at.) This weight quantifies the importance of the
`index term for describing the (.locmnent semantic contents.
`
`Lr ( t he the nmntier of order terms in the system rind to be e generic
`Definition
`tinder:
`tr rm. K
`{it}. .
`.
`. do} to the Set of nit.
`trader: terms.
`."l hilt-gilt: U'H'
`1"» U
`is a.:~'socwtr-'rt
`fir-uh.
`ranch {Critter
`term. A}- of a (loco-mm! (ti, For an.
`index term
`which doc-s not appear in {in doc-rancid Cert.
`rew- = tl. With the (tortumr'nt d?-
`,5.- associated an index term iirr'to-r d?- represented by d? 2.:
`[Z'tIELJJHF-Anjg.
`.
`. ”ref it
`
`5r.- u function that returns the. nieighi associated with.
`Further.
`tel It;
`the index
`term tr.
`in any Ldorirnsiuant elector .{"t.rn_. getrtji = ulgdj’.
`
`As we later discuss. the index term weights are usually assumed to he mutu—
`ally hale-pendent. This means that knowingr the weight rchj associated with the
`pair tt'..rf__;} tells us nothing athont
`the Weight. ng- } 1“,, associated with the pair
`tix‘t: l‘d-Ji' This is Clcarlv a simplification because oeenrreiiccs of index terms in
`a document are not. uneor1ehder.l. Consider. for instance. that the terms com—
`
`puter and tJ.{-'fit.'0'1‘h‘tt]‘f‘ used to index a given document which ("(WRTS the area of
`computer nettt'orka‘.
`l-‘i'erluentlvv. in this (locunu.~tit_. the appearance of one of these
`two words attracts the appearance of the. other. Thus. these two words are corres
`larcd and their weights could reflect this correlation. W'hile mutual independence
`seems to he a strong simplification. it does simplify the task of ccnnputing index
`term'weights and allows for fast. ranking computation. Furthermore, tithing, ad—
`vantage of index term correlations for improving the. final document ranking is
`not. a sinuiale task.
`in fact. none of the many approaches proposed in the past.
`has clertrlv demonstrated that index term correlations.-
`hr: ltt'l
`'antageoim [for
`ranking purposes) with general collectitnis. Therefore, unless clearly stated oth—
`erwiee. we assume mutual imiependence among; index terms.
`In Chapter :3 we
`diseues modern retrieval techniques wl'rieh are based on term eorrelatirms and
`which have. been tested successfully with particular collections.
`'lf'hese s11111t':esses
`seem to he slt‘m-‘lv shifting the current urnlerstanding towards a more favorable
`View of the usefulness of term correlations for information retrie rai
`.’:»'bl-{‘ILIS.
`The. above definitions provide support tor discussing the three classic infor-
`mation retrieval models. namely. the Boolean. the vector. and the prolmhilistic
`models. as we now do.
`
`
`
`
`
`
`
`
`
`
`
`
`@532 Boolean Model
`{The Boolean model is a simple retrie. rat model hosed on set theory and Boolean
`algebra. Since the. concept of a set. is quite intuitive. the Boolean model pro—
`tndes a framework which is easy to grasp by; a. common user of an [R system.
`
`.(urthermore. the queries are specified as Boolean expressions which have precise
`
`. eniantics. Given its inherent simplicity and neat fornmlism. the Boolean model
`-ceived great. attention in past years and was adopted by many of the. early
`bmmercial hilfdiograjjihie Systems.
`
`
`
`(cid:23)
`
`
`
`26
`
`MtiirwsLI-so
`
`
`
`Figure 2.3 The three conjunctive cmnrmnents for the query [:1 —— kn
`
`[its
`
`eta-h.
`
`l_..'nfort1_niately. the Boolean model suffers from major drawimcks. First.
`its retrieval strategy is hosed on a binary decision criterion tie... a document. is
`predicted to be either relevant or non—relevant) without any notion of a grading
`scale. which prevents good retrieval performance. Thus. the Boolean model is
`in realityr Ilil1('ll more a. data {instead of information} retiim'ul niodcl. Second.
`while Boolean expressions have. j’irecise semantics. frequently it
`is not simple. to
`translate an intorinatimi need into a Boolean expression. In fact most users find
`it difficult and awkward to express their query requests in terms of Boolean ex-
`pressimis. The Boolean expressions actually fornnilated by users often are (mite
`simple ("see Chapter It] for a more thorough discussion on this issue}. Despite
`these. drawbacks. the Boolean model is still the dominant model with commercial
`
`document database systems and provides a good starting point. for those. new to
`the field.
`
`The Boolean model considers that index terms are. present or absent in a
`rim-.Lnnent. its a result. the index terni weights are assumed to he all binary. i.e.,
`
`1cm 6 {0. 1}. A query o is composed of index terms linked by three connectives:
` 'sion which
`not, and. or-.Thus. a query is essentially a conventional Boolean exprt
`
`can he represented as a (.lisjurnrtion ofconjunctive vet-tors lie. in (ii.
`'uncto'e norw
`
`moiform — DNF). For instance. the query [q = k” fa [h ' wlcrtjl can he written
`in disjuncti'v'e normal form as [921“): = [:l. 1. l) V {.l. 1, [I]
`if [1. [hill]. where. each of
`the components is a binary weighted vector associated with the. tuple {km rich, fer}.
`These binary weighted vectors are.
`tailed the. conjunctive components of 27],, f.
`l’igure 2.3 illustrates the. three (.tonjuuctive components for the query q.
`
`the indeir rte-rm weight variables are. all
`For the Boolean model.
`Definition
`binary 218., ww- t: {0. 1}. A query q is a conventional Boolean empress-ion. Let
`rjrfimf be the disjunctive normal form. for the query q. F'n-rflwr, Bet r121. be any of the
`conjunctive compom-znts of qjgnf. The. similar-it}; of o. doc-urnent (£3.-
`to the query q
`is defined as
`
`.si'nllrqul =
`'
`
`1
`[1
`
`
`l {Lila-Efilmrlet't's. gatdils-‘yrttimli
`if 5%:
`othermse
`
`(cid:24)
`
`
`
`
`(_.'L.A\b‘b']f_'3 IN FOHMJU'ION RETRIEVAL
`
`27
`
`If stn'i-[ttJ _. q) = l the-n the Boolean model predicts that the mien-meet dJ- is trait-"rent
`to the query :3
`("l-tit might not he). Otherwise.
`the prediction. is that the deremt'ntt
`is not
`relate-n.5,
`
`is either 't‘rftt't-‘ti'tt-I- or Hou—
`The Boolean model predicts that each (ltnsun'n-ent.
`rah-(rant. There is no notion of it partial mtttr‘h to the query Conditions. For
`lItShtIlL'E-J let dJ he a document- for whieh d:
`-— (0.1.0). Domuneut dJ includes
`
`
`the index term h, but. is (‘tJlIt-SlthI‘Eltl non—relevant to the query it;
`:= k“
`ti},
`It“
`J
`The main ttrIt-‘ott-trtges of the Boolean niodel 21.10- t-ho clean formalism behind
`the. model and its simplicity: The main disadvantages are that. exact matching
`niaty lead to retrieval of too few or too many documents tsee Chapter III}. "today.
`it is well known that. index terni weighting ("r111 lend to a. substantial i1]]D1"tJ\'t“:II‘IE:nt
`in retrieval perfornitutce. Index term weighting brings us to the \‘E-JL'IUI' model.
`
`2.5.3 Vector Model
`
`the use of hiunrj,‘ weights is too
`[697, 695] ret-ognizes that
`The teeter model
`limiting and protwses :1 fratiitwtrork in which partial matching; 1:: possihle. This
`is élt't't)lll[)li.‘jltfl(.l
`lily assigning 'rton—tJ'I'ttory weights to index terms in queries and
`In t'let‘:iilii(\i‘1t.s.
`'l‘hese term weights are ultimately used to compute the deg-ret-
`of s-t'nntority between eneh doeunient stored in the. system out] the user :‘uiery.
`By sorting the retrieved t’lot'uments in decreasing order of this degree of similar—
`ity. the vector model takes. into consideratimi doeunients which thatch the query
`terms only partially. The main resultant. effect is that the ranked document £111-
`swt'r set
`is :1 lot. more precise [in the sense that. it better nmtehes the user infot—
`mntitn'i need] than the dot'ntnent. answer set retrieved by the Boolean model.
`
`the 'tt.'t"'ighr‘ tt‘J-JJ- a-Hsor‘t-rifritt u-“iih it pelt [It dJJ
`Definition For the t't't‘to't‘ moth-“l.
`is posit-tor? rmri
`'thJ'r1.—tjt'Ilrt'I'jt,f. Further.
`the trader terms-
`in.
`the nee-3'3; are also
`weighted. Let in,” he the urrnfltt (temet'a.t‘.ert with the posit [Ru-1;], urhrnz try-J]. 3.:
`[1.
`Ethan.
`the query t'eetot' r? is defined as if
`'
`(In; (J. M‘JIJJJ.
`.
`.
`. rem] Usher?- t
`is thr-
`tote! number of index: temns m the systtmt,
`.43 before. the recto-r for u (heroine-of.
`(Q is r'r'JtJresentetI by (EJ- = {trLJg tt‘gng .
`.
`.
`. it‘JIJ- ].
`
`ttJ- and a user query q are represented as t-dhnensionnl
`Therefore, a (IOt'IlltlE‘Ill,
`vectors as shown in Figure 12.4. The vector model prtn‘mses to e ’tlltliitt’ the degree
`of similarity of the document ttJ- with regard to the query q as the ('(Jrl‘ttliltIUll
`between the vectors (IJ and g". This eorrelation can he quantified. for instance.
`by the cosine of the angle hetvt-‘eon these two vectors. That is,
`
`stm.[:d_J-. o J
`
`_.___. .._ _.
`
`.r
`Elm-1 ”'m' K “"11
`
`(cid:25)
`
`
`
` Q
`
`2s
`
`moo HI.'IN(_:
`
`Figure 2.4
`
`'I'he cosine of 0 is adopted as sledding}
`
`where id;- and In? are the. norms of the document and query vectors. The factor
`lq‘l does not affect the ranking lie. the. ordering;r of the documents) because it
`is the same. for all documents. The factor Mil provides a nortnalizntiou in the
`space of the documents.
`Since uth- 7? [l and mm ._'=_ ll, slung, dj] varies from U to +l. 'l’hus. instead of
`attempting; to predict. whether a. document is relevant or not. the vector model
`ranks the documents according to their degree of similarity to the query. A
`document might be retrieved even if it matches the ruler}; only partially. For
`instance, one can establish a threshold on suntdj. q) and retrieve the documents
`with a degree of similarity above that threshold. But
`to compute rankings we
`
`need lirst to spec 1')? how index term weights are obtained.
`Index term weights can he calculated in man}' (tillerent. ways. The work by
`Bolton and McGill [698} reviews various term-weighting techniqu s. Here, we do
`not discuss them in detail. Instead, we crmcentrate on elucidating the main idea
`behind the most etler::tive term—weighting techniques. This idea is related to the
`hnsic. principles which support clustering; techniques as follows.
`Given ncollection C" ot'ohjects and a roger: descriptitm ole set A, the goal of
`a. simple clustering algoritlnn might he to separate the collm‘ttion C“ of objects into
`
`two sets:
`a first one which is composed of objects related to the set :i and a second
`one which is composed of objects not related to the set :1. Vague description here
`means that. we do not. have complete informatitm for deciding precisely which
`objects are and which are not. in the set A. For instance. one might. he looking
`for a set A of cars which have a. price compo-mole to that of a Lexus 400. Since it
`is not clear what the term comparable means exactly. there is not a precise (and
`unique] description of the. set A. More sophisticated clustering tngOTltlllllt-i might.
`attempt to separate the objects of a collection into various clusters {or classes)
`according,
`to their properties. For instance. patients of a doctor specializing
`in cancer could be classified into five class s:
`terminal. advanced, metastasis.
`diagnosed and health"; Again. the possible class descriptions might he imprecise
`{and not. unique} and the problem is one of deciding to which of these classes
`a. new patient. should he i‘issigncd.
`In what follows. howewr, we. only discuss
`the simpler u‘rrsion of the clustering problem lie. the one which considers only
`two classes) because all that is required is a decision on which documents are
`predicted to he relevant and. which ones are predictul to he not relevant [with
`regard to a given user query").
`
`(cid:26)
`
`
`
`
`
` 'XSSIC‘ INI-‘ORE-IATION RLI-‘I'RJEVAL
`
`2t}
`
`to View the. IR problrfitn as one of clustering. we refer to the early work of
`Salton. We. think of the documents as a. collection (T of objects and think of the
`tract query as a {vague} etgaecitication of a Ht‘i’.
`."l of objectrt.
`[11 this scenario. the
`lit. problem can he reduced to the. prohh‘etn of determining which Chicuinents are
`in the. get
`:1 and which ones are not. tie...
`the 1H. problem can he viewed is a
`clustering; I'Jl‘tilhletn}. In a clustering problem, two main issues have to be resolved.
`First“ one needs to determine what are the features which better describe the
`
`Seconcil. one needs to determine what are the features
`ohjectn‘ in the set. 21.
`which hotter distinguish the ohjectt-t in the set .-1 from the remaining ohjectH in
`the collection (7. The first set of features provides for quantification of intra-
`ct-s‘ir't‘r-It‘ Hilliilal‘iT-y. while.
`the r'~.'H"(J11(l Ht’L of features provides for quantifieat-ion
`of rrttt-'r‘—t'.lu.‘~'trir dissirniiarit}: The mot-it successful clustering algorithms try to
`balance these t-WLJ ei'iecth.
`In the vector model. iiiii‘ti-(fl‘llSI-E‘J‘il‘lg similarity is quantified by measuring
`the raw frequency of a term it;
`inside a document di- Such term frequency is
`usualh' referred to tie: the (f factor and pun-ides: one measure of how well that
`term tteecrihes the. document contents tic” intra—docurnent characterization}.
`Furthermore.
`inter—cluster dissimilarity is quantified by Incarntring the inverse.
`of the frequency of a term it..- among the documents in the collection. This
`factor is lthitrlly rci'errt't'i to as the inilersr document. freq-arrm‘y or the. idfftir‘trior.
`The motivation for usage of an idi factor is that
`terms which appear in many
`docinnents are. not very useful for distinguishing a relevant document from a
`[1011-1'E!i£'.’\'i5l.lii' one. AH with good clustering algorithnm. the most. cl'i‘ectit'e term—
`wcighting schemes for Mt. trjv' to balance these two t"i‘l:l!tti'.ti.
`
`in the :ryrtcrrt and it! he
`[ct N or: the total rttttrttrc'r o (itfifttflttftt-[a‘
`Definition
`
`
`titre. writ-moor of documents in 'tct’tich t} :: t'rzdcr term alr- appears. Let frugal,-
`ir-
`t-f'
`WWfi-ocrtcy of term in in the document. d}- {t..c...
`the number of times Htt-
`term
`it; is mentioned in the (turf. of the. document rt'jfi. Then.
`the. nortrroiizr‘rt frranncy
`
`t- of term is;
`in. document r'tT- o other: by
`
`In} :-
`
`fi'f'ffgu;
`
`mars
`
`fretHJ
`
`l.-.21l
`
`.
`
`
`
`
`
`
`
`the.
`in.
`re the irtti.:rim.tt.trt. is comp-titled over (it! tcrms which arr. m.f'.1‘?t’itJi'P-r"(i
`
`the document if},
`if the icon it!- doe‘:5 not appear in (he docutttertt (t;-
`
`_.= U. Further.
`first idfi‘ inverse: document. frequency for to. be gimri try
`
`
`
`'
`
`I
`
`text.
`(ht-rt
`
`N
`.
`a
`my“; :log—
`n.
`
`
`
`_
`
`.
`
`:1
`.
`
`=
`
`[22]
`.
`
`“bent hwttfl't tcrm—trrrigiitr'riy sclifiincs ass-t: -;it_cirihi‘.!.".inhre}: rirc yucca by
`t. n"
`
`JV
`m’ = ftj X l‘Jt-a ii
`
`[2.3-]
`
`(cid:27)
`
`
`
`3t)
`
`iiiiosLmG
`
`or in; ri era-mutton of this formats. Sue}; term:weighting strategies are edited tf-ifl'f
`sr'tn'im-s.
`
`Sei'erad variations of the shore expression for the weight n'n; on: deseribed in an
`interesting paper by Bolton and Buckley whieh appeared in 1988 [696]. HOWQVCL
`in general.
`the above. expression should provide. a good weighting scheme for
`minty t'olleetions.
`For the query term weights. Stilt-UH and Buekley suggest
`
`“in; : (”.5 +
`
`mum;
`
`f _
`
`j 11-qu
`
`
`0.5 'I'r't 4
`
`iii—L.) X [0% ——
`
`N
`
`n,-
`
`.
`
`K
`
`[2.4}
`
`where fir-ow is the row frequency of the term 2.}- in the text oi the information
`retpit‘st q.
`t: 1} its term—weighting scheme
`The. main :nteuut.eg_n:.s ol' the W'etor model are;
`imprm-‘es retrieval perfornmnee: (2) its partial matching strategy allows retrie "(1.1
`of doeunieiits that npprc'iLr-imetr:
`the. queri- (‘onditionm and {3]
`its cosine rank—
`ing formula. sorts the. (loeumeuts ueeordim:r to their degree of similarity to the
`query. Thr.--oret.i(.'nll_\-'._ L-he i-‘etf'ttn' model has the. distort-Hon.ridge that index tm‘nis are
`nssiuned to be. mutually independent [t-tqinition 2.3 does not ueeount for index
`term dependencies}.
`i-lowerer. in protrt-iee. consideration of term dependencies
`might be u (lisathantago. Due to the locality of many term dependencies, their
`indiseriiuimire npplieatitm to nll the doeunients in the (‘ollection might in fact-
`hui'f the. overoil pet‘hnniutn‘e.
`if)espit.r:- its simplicity. the vector model is a resilient ranking strategy with
`general collections.
`it yields ranked answer sets which are. diffieult
`to improve
`upon without query expansion or Relevance teedbnek (see Chapter 5} within the
`framework of the \-‘eetor model. A large. variety of alternative ranking methods
`have been eonrpared to the veemr model but the eonsensus seems to he that,
`in gene-wink the veetor model is either surmrior or ahrlost sis good as the known
`alternatives. Furthermore. it is simple and first. For these. reasons. the vector
`model is a popular retrieval model nowmlays.
`
`2.5.4
`
`Probabilistic Model
`
`in This section no deseril'ie the. elnssie probobilistie model introduced in 1976
`by; Roberston and Spar-eh eres :tiTT] wliieh later became known as the binary
`tridr'pmirlr'ner: retrieval [BIB] model. Our discussion is intentionally brief and
`focuses mainly on highlighting the key features of the model. With this purpose
`in mind: we. do not-detain oursielves in subtleties regarding the binary indepen—
`rlenee assmnptiou for the. model. The. section on bibliographic discussion points
`to retemnees which cover these details.
`
`The probabilistic model attempts to capture the TR. problem within a prob—
`abiiietie framework. The fundamental idea is as follows. Given :1 user query-2
`there s a. set of documents wlrieh eontuins exactly the reie rant documents and
`
`(cid:28)
`
`
`
`ti‘L-Assu't INFORMATION RETRIEVAL
`
`31
`
`no other, Let us refer to this set of documents as the ideal. answer set. Given the
`
`description of this ideal answer set, we would have no prob cues in retrieving; its
`documents. Thus. we can think of the. querying process as a protests of specify—
`ing the properties of an ideal answer set (which is analogous to interpreting the
`1R problem as a problem of clustering). The problem is that we do not know
`exactly what. these properties are. All we know is that there are index terms
`whose semantics should be. used to characterise these properties. Since these
`properties are not known at query time. an effort has to be made at initially
`gueseing what they could be. This initial guess allows us to generate a prelim—
`inary probabilistic description of the ideal answer set which is used to retriei'e
`a first set. of docunn'énts. An interaction with the user is then initiated with the
`
`purpose of ii'nproving the probahilistic' thiscriptiou of the ideal answer set. Such
`interaction could proceet'l as follcm-‘s.
`The user takes a look at. the retrieved doci.iments and decides which ones
`
`are relevant and which ones are not (in truth. only the first top doclnnents need
`to he examined}. The system then uses this lIiff'Jl‘ltlt'ttltJIl to refine the description
`of the ideal answer set. By repeating this process many times. it is expected
`that such a description will evolve and become closer to the real description of
`the ideal answer set.
`'l'hus, one. should always have in mind the need to guess at
`the beginning the description of the ideal answer set Furthermore. a conscious
`effort Is made to model this descriptirm in pi'L'il11aliilisl-ic terms.
`The probabilistic model is based on the following iundann'xnt al assumption.
`
`Assumption {Probabilistic Principle} GiV en a user {pier}: r; and a document it,-
`
`
`iu .
`. co action. the probabilistic model tries to estimemie )robability that?
`the user will find the dtTcTiment dJ
`interesting tie. relevant). The model as.—
`
`
`smfies'fiiititkt
`is probe '11 1 3-
`.
`. ance c epends on the query and the document.
`representations only. Further.
`the model assumes that. there is a subset of all
`documents which the user prefers as the answer set for the query (1.
`Such an
`ideal answer set is labeled R and should maximize the overall probability of rel—
`evance to the user, Dociuii‘eiits in the set H. are predicted to be relevant to the
`query. Documents not in this set are predicted to be_non—ti:(event.
`
`This assumption is quite. troublesome because it does not state explicitly
`how to compute the probabilities of relevance. In fact. not even the sample space
`which is to be used for defining such probabilities is given.
`Given a query q, the probabilistic model assigns to each document (if, as
`a measure of its similarity to the query. the ratio Viol}- releve’nit—to elf-"'i’t'rij non—
`rele rant—to g} which computes the odds of the document dT being relevant to the
`query :3. "Hiking the. odds of relevance as the rank minimises the probability of
`an erroneous judgement 1282, 7’85].
`
`the tinder. term u'cilqet t-‘rit‘iebles are
`For the probabilistic model.
`Definition
`
`all {Jittery i.e._. in”- E {0,1}. Wing E {0,1}.
`:1 query q is a subset of indcr terriis.
`Let i? be the act of documents known. (or initially guessedj' to he relic-mint. Let R
`be the corripie-ttient of R {ten the set of non—relevant documcats/l, Lei PtiihZi-l
`
`(cid:20)(cid:19)
`10
`
`
`
`3L]
`
`MODELING
`
`1111.01 11"(th.11;1
`is 1111:1111-111 to the query 1}
`.1111: probability that 1/111 document 113-
`111'
`111' 111:
`iii-11111111111111; that. (1')-
`is 1m11..—1'1et1;:1111n.t
`to 11.
`T1113 3111111111111; 5111111541 of the
`11111:'11-1111'-1111'13
`to 1111': query 1; 1.. defined as the rat-1'11
`
`511111111. 1111——
`
`
`P11111111 ]
`
`1318.111}
`
`[Bing B11.}'1-.-."-i' 111119,
`
`PM. 'R} X P1 1?]
`51111111 .q]_—- _._._
`PI.11_,-11?._1 x 1’11?
`J
`
`911111111 511111115 for 1.111: probatiility of randomly St'lecting 1.1112 document. rtj from
`1111‘-
`.501- R of 111111111111;
`(1011111111211’1452 Further, I’L’Rl stands for 1111? 111111111111111.5..r that a
`11111'1111119111 1111111111111}! selected from L116 1411111111 Collection is relevant. The 111eanings
`11111111111111 to Pi 1111?]
`111111 P113} 1111- annilogona 11nd 11011111111111911131'1'.
`811111} 131.1%) and Fifi] are 11112 14111111: for 1111 the 1.101111111121115 in 1.1111 111111113111111.
`1th “1111-.
`
`.51111111:.qi w
`'
`"
`
`'PldliRi
`__
`P111: 11'.
`
`
`.31. 1131111111153; 1111.11};1131111111110 of 111111111 t1‘1‘1115,
`
`3111111151. 1 11 J
`
`x-
`
`
`iii-1.11.1. , P1111811 x 11—199“1_1. ._11 1’11. [R11
`
`1h." 1
`111...“... 111'11111111 1MMRIJ
`
`F’ih.|_'HI stand" 1111 tl1e}11o11.111111t\111111thein1k‘x 1131111 11‘.1.5 present in 11.111112111111111.
`111.111 loiniy 5ele1t1111 from the 991. R. 1’[1.1.1.8] 511111115 foi il‘it‘ probabiiit} that 1111)
`i||1'1£'-\'
`1.1111111
`1-1:.
`is not present in a 1.1111111110111- 1'2nidolnly 5111112111111 {10111111115131 R
`1111‘ probabilities associated with the .511. F have 111111111111g3 which 21.11% analogous
`to 1111: 1111115 just. 11951211111111
`that. P11511111] + Pia-H1) = 1. and ignoring
`Taking iogai‘itln'i'is, retailing;
`111111.115 which are. constant for 1111 1101111116111}. 111 the contoxt of 1111‘ 51111162 query,
`WP 1‘1111 11118.11},r writt‘
`
`1 _- 11)“:
`E,
`J‘
`I
`‘)
`1.12111[(11.11) N idiom x -11-'1-J><x(logl“I1:11.111?)
`which is 11 11113.! 6.111111111111011 for milking computation in 1.1112 prolgaabilistic model.
`51111111 wv do not. know 1.111: 11121; R. 1'11. the beginning, it is necessary to devise
`1.1 11111111111 for initially computing the probabilities PUMR} 3.1111 113011-1111. T111111!
`are 1111-1111' alternatives for 3:111:11 commutation. We discuss 11 couple of 11111111 betow.
`
`(cid:20)(cid:20)
`11
`
`
`
`I
`
`(I'llix‘tt‘sifiltil
`
`lNI*"(.'JHl\I.-X"FION ltECt'HIEVAI.)
`
`33
`
`In the. very lzieginning tie: immediately alter the query epet:ifieat..ioul._ there
`are no retrieved documents. Thus. one has to make. simplitfing assumptions such
`as: [at assume that 19(39de is constant. for all index- terms A; (typically, equal to
`0.5} and {b] assume. that the r.listrihution of index terms among the uon—relewant
`domunents can he approximated by the distritnititni of index terms among; all
`the documents in the collar-timi. These two assumptions yield
`
`.I'-’tt‘,|'fi’_l
`:
`[1.5
`when) : {Lg
`
`is the number of documents which contain the
`in;
`whero as already defined.
`
`index" term k1- a.nd N is the. oral number o coeuments in the t'ollect'on. Given
`this nntial guess we can then J.‘L:t-l‘1€‘.-'0 documents which contain query terms and
`prm-ido an initial probabilistie ranking for them. After that. this init'al ranking
`is innn‘m'ed as follows.
`Let 1-" be a subset of the. document's initially retrieved and ranned by the
`
`prohabilistiemodel. Sueh a. subset ran be defined: for instance. a the top 1'
`ranked dosnmhnth ' ere_.1'_i_s a DF't‘fijlm-lfih' (leiined I..l1reshold l“ili