`
`Ricardo Baeza- Yates
`
`Berthier Ribeiro-Neto
`
`Sr
`
`ACM Prass
`New York
`
`vy Addison-Wesley
`
`Harlow, England « Reading, Massachusetts
`Menlo Park, California e New York
`Don Mills, Ontario « Amsterdam e Bonn
`Sydney @ Singapore « Tokyo # Macrid
`San Juan « Milan # Mexico City e Seoul « Taipei
`
`(cid:20)
`1
`
`PETITIONERS - EXHIBIT 1014
`PETITIONERS- EXHIBIT 1014
`IPR2022-00217
`
`IPR2022-00217
`
`
`
`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`Addison Wesley Longman Limited
`Edinburgh Gate
`Harlow
`Essex CM20 2JE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the trademark designations and their
`owners appears on page vii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`zZg$j;g?$$g`IR z¡"gb£¢j¢£¤Rx¥¦§$¢I;¨ Z©vIR ¢j¢j ªUIZ¨C©Z5z$«¬RI¨S¡ª g¨Sª;b$¤Y®¨ $¯_° ±O¡C² ¯ ³g³>¢jZ¨t´£¢Iµ0¶8;¨SU«AªZ¨C¶]«·j£³¸v³e¨>g¬b; ¹ ªUjº ª;jµ;»¸v¢j¢jh¼Cz¡C½Z¾|½¿g¸¸¨S;µªZ¨R³ªZ¨R³ ¢j¢j ª$j£³Cz$«Sª;¨£¢IgbZ¬RZRZ¬R IR´CZjµ³ ¯ÀzS_jZgj¢±Z©IRªZ¬RIRZb¢_Z©ÁI¢±´;jÂ`Rª;§$ ]h¨ªU¢j¢jhbj£³`gkIR£«Ã¨ª £ ;b³ª;¨ »zIIRO;g$jZg£¤]¥±£¢IZ¨R¢ªZ¨R³Äª$jh¨gj¢_ £$ÅUÅ ¯¦µµYjZgj¢_b£¢jhb§$³ ¯3Æ Sª;b Z©ÁI¢_¬ggµ ª$IÇZ¨ «:ªZA] bbb>³>¬ £³n¤Y¢jjZb³`¨Cªb£Ij£§;ª;µv¢j>¢jj£«0¤S;±IbªZ¨R¢j«·jj£³0¨CªZ¨gA©;b«Z_gkªZ¨g&«A£ªZ¨R¢£¤Sbµ IbZ¨g ¤Y«A Sª;¨ ªZµ¤gSUj Zg$¨R¤8b Zb³>¨Sk; UISbb»O¢jU¤]»OIRZ¬R bISh5IS gjZ5»zjjjh¨0]bb«·¢j¢IZ¨ ;©IR ¬gµ¢IRh:Z±ª·µ b¨ ]hb«·jI¨R`b£¢jIj j£³ Zg$¨RA¨CISÈ_¨j³É ¨SU³U«"¢j¢I¬S³gIR_zZg$jZgv¶n h¨S¢I¨R $b¨ ¶8j³]¤U$¾ ÀUjjh¨gSªU«ÊzZ¬gb˦Uª$³]¤$¶YZ¨R³;¨k´ÌÄ$º±¸ ¯´ÍgµAIR ¬ggµ¢IShRª$¢_«Aª$³g£§$hbkªUjj£«·R¦j:Ibª ª;µµ Zg$j;g ;»z¨Shb¢±ªZ¨R³0;Sjª;¨]hb«¢j¢IZ¨tj·bhgb>³e¬ A«AªUjhjªZµ¤S¨Cª©£» ªU¢j£¢_I¢_SªU¢¦gbZ§U£³0«·]U¢j¢Iµ ¯zZg$j;g¦RZµ³ghb¢Z©v«AªUjhjªZµÁ»z SªU¢¦¨R$z]£b¨ª Â>¨RZ»zµ£³g$³ªZbh¨ ;¬bªU$£³j ;¨gjª _IRMg¬gµ¢IRh ¯¡ªZ¨g&Z©ÁIS³¢IZ¨Rª$IZ¨R¢¬S¢j£³kg0«Aª;¨>¬©®ª I¬gbhb¢ ª;¨S³&¢jhµµhb¢jA³>¢jI¨RZ¬g¢I ISbgb>³e¬ j¢ ª;b µªZ«:£³ªU¢_Ibª$³g£«AªZjÂg¢ ¯ ³g³e¢j;¨´C£¢Iµ£&¶8;¨SU«AªZ¨C¶]«·j£³SªU¢¦«:ª$³g£§Uhbkª$jj£«·R j:¢I¬gµ&Ibª$³g·«:ªZjÂk¨©®Zb«AªUIZ¨tªZ];¬S±«AªZ¨>¬g©ª I¬bhb¢±ªZ¨R³ISbgb>³e¬ j¢ «:h¨gIZ¨R£³¨CI¢_]>Z ¯ ̵¢j±Z©vIRIbª$³g£«AªZj³¢IZ¨Rª$IZ¨R¢ªZ¨R³ISh;»z¨Shb¢_ª;]ªZb¢ ;¨SªU$§$ ¯Àz$]¢j£z¨tOU«·g¬Sjh5¡>³bj¨g0Î;ÏÄj¨gj£³ªZ¨R³0]Z¬g¨R³0¨CISÈ_¨j£³CÐ>jª$j¢ Z©Á «Abj ªÑ8b¢j¦gj¨gj£³$$IÐ>Ò Æ ¾;ÓÔ½Z¾bÓsÕ$$ÅG½Z;ÓIÖ×ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£á ªUjªZµUZ¬RMb Zb³0©; Ig¢z]>ZÂ&¢ ª;§;ªZµª;µA©ÜbU«ISÒ)jI¢IC¶]bª;bÞ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ÛÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªU£ð£ªZÓñ3ª$j£¢£¤RË ¯ò° Ëz ª;b³G²¡>³ghj¨¨g©Zb«:ª$IZ¨Cb£Ij£§;ª;µ3ó·Ëz ªZb³gkÒzª$£ðªZÓñzªUj£¢£¤SÒ3hbIbËz]hb;Ó Æ £j ¯ ¯- « ¯Ü¨ µ¬S³g£¢ gµUZbªZgS ª$µvbh©bbh¨ £¢_ªZ¨R³0¨S³gh¼ ¯sÐeÒ Æ ¾ZÓÔ½Z¾ghÓsÕ$UÅ>½;ZÓsÖ ¯ ܨ©;b«Aª$I;¨C¢jjZbªU$ªZ¨R³`b£I£§;ª;µv¢j>¢jj£«A¢ ¯ ¯ ËOôhb¤]ÒzbbIh5³¦bªvõ¬;öbÆ £j¤Y£UÏ$¾ZÓ ¯ j ¯ ÀzIµ ¯÷SÏ$ÏGø ¯ ÒzÕ$ù £$U¾>½UÎ ¯ ¾$ù;ú>³ ½e UZÓb£¾U¾$Õ$Õ¦Ä
`
`Library of Congress Cataloguing-in-Publication Data
`Baeza-Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p-
`cm.
`Includes bibliographical references and index.
`ISBN 0-201-39829-X
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Araujo
`Neto, 1960- . II.Title.
`Z667.B34
`1999
`025.04-de21
`
`99-10033
`CIP
`
`(cid:21)
`
`
`
`2
`
`MODELING
`
`t-dimensional vectoria] space aud standard linear algebra operations on vectors.
`Vor the classic probabilistic model, the framework is composed of sets. starcard
`probability operations, and the Bayes’ theorem.
`In fhe remainder of this chapter, we discuss the various DR models shown
`iu Figure 2.1. Vhroughont the discussion, we do not explicithy instantiate the
`components D, Q, F, and Afg;,d;) of cach model. Such components should be
`quite cleay from the discussion and can be easily inferred.
`
`2.5 Classic Information Retrieval
`
`In this section we briefly present (he three classic models in inforrnation retrieval
`namely, the Buolean, the vector, and the probabilistic models.
`
`2.5.1 Basic Concepts
`
`The classic mudels in information retrieval consider (har each document is de-
`serihed by a set of represcutative keywords called index termms. Au dudes ferni
`is sitaply a {document} word whose senmusties helps in reuernhering the docn-
`ment's main themes. hus, tudex terms are used to iiiex and sununarize tle
`document coutents. Tn general, mdex terins are matuly mows becanse nous
`have meaning by themselves aud thas, their semantics is easier to ideality and
`Lo grasp, Adjectives, adverbs, aud comueetives are loss useful as iudex terms
`because they work mainly as complements. Tlowever.
`it might be interesting
`io consider all the distinet words in a docniment collection as index terms. For
`instauce, this approach is adopted by some Web search engipes as digeqssedt in
`Chapter 13 Gin which case, the document logical view is full tert], We postpone
`a discussion on the problem of how to generate index terms itil Chapter 7.
`where the issue is covered in detail.
`Given a set of index terms for a document, we notice that uot all terms
`are equally useful for describing the document contents. Tu fact, there are bidex
`terms which are simply vaguer than others. Deciding on the naportauce of a
`term for summarizing the contents of a document is not a trivial issuc. Despite
`this difficulty, there are properties of an index term which are easily measured
`and which are useful for evaluating the potential of a Lerma as such. For insbauce,
`consider a collection with a mmndred thousand documents. A word which appears
`in each af the one hundred thousand documents is complercly useless as an index
`terin because it does not tell us anything about which documents the user might
`be mterested in. On the other hand. a word which appears in just five documents
`is quite useful because it narrows down considerably the space of documents
`which might be of interest to the user. Thus,
`it should be clear that distiner
`index terms have varying relevance when used to describe document coutents.
`This effect is captured through the assignment of pnmerical weights to each index
`term of a decuinent.
`
`(cid:22)
`
`
`
`CLASSIC INFORALATION KETRIEVAL
`
`2
`
`fy be an index term, dj be a document. and w,; > U be a weight
`Let
`associated with the pair (hyd). This weight quantifies the importanee of the
`index terion for describing the decuinent sermantie contents.
`
`Definition Lett he fhe number af tudes terms inthe systemand k, he a generic
`inden terin, Koo URy.0.. Ay} ts the set of all index terms. A ueight wy; > U
`is associated with each cinder teria ky of a document dj. Pur an index term
`whieh ducs Wok mppeer in the document
`tert, wy = 0. With the document @;
`is associided an index terme vector dy represented hy dy == (uty). Wg je... We)
`
`inf yg; be a faneton that refurne the weight associated auth tHe iudex
`Further,
`fern A,
`am any bedimensivnul eector (he. gldyi = wy}
`
`As we Jater discuss. the index term weighrs are usually assumed to be muntu-
`ally independent. This weans that knowing the weight 1w,; associated with ihe
`pair (hd; ]
`tells us nothing about
`the weight ayy), associated with the pair
`ky: yj}. This is clearly a stmplification because ocenrreuces of index terms mM
`a document are not nucorrelated. Consider, for tustance, that the terms cam-
`pier andl nefwerk are used to index a given doctiment which covers the area, of
`computer networks. Frequenth, in this docunient, the appearance of one of these
`two words attracts the appearance of the other. Thus, these two words are corre
`lated and their weights could reflect this correlation. While iiitual independence
`seerps to be a strony simplification. it does simplify the task of computing Licex
`tern weights aud allows for fast ranking computation. Furthermore, taking acd-
`vantage of index termcorrelations for improvimg the final document ranking is
`not me aimple task. be fact, none of the many approaches proposed m the past
`has clearly dermianstrated that index term correlations
`are advantageous (for
`ranking purposes) with geveral collections. Therefore, unless clearly stated oth-
`erwise, we assume mntual independence among index terms.
`In Chapter 4 we
`discuss uvodern retwieva) techniques which are based on term correlations and
`which have been tested successfully with particular collections. These suceesses
`secu to be slowly shifting the enrrent understanding towards a mure favorable
`view of the usefulness of term correlations for inforrhation retrieval systems.
`The above definitions provide support for discussing the three claysie infore
`mation retrieval models, namely, the Boolean, the vector. and the probabilistic
`models. us we now do,
`
`
`
`
`
`
`
`
`
`
`
`
`
`2.5.2 Boolean Model
`The Boolean model is a shiople retrieval model based on set theory and Boolean
`“algebra, Since the concept of a set is quite intuitive, the Boolean miodel pro-
`
`Vides a framework which is easy to grasp by a common user of an (Ro system.
`
`Furthermore. the queries are specified as Boolean expressions which have precise
`semantics. Given its inherent simplicity and neat formaltisin, the Boolean made]
`ecrived great attention in past years and was adopted by many of the early
`omimercial bibliographic systems.
`
`(cid:23)
`
`
`
`26
`
`MODELING
`
`
`
`Figure 2.8 The three conjunetive components for the query jg = Ka A (ky vA},
`
`Unforiumately, the Boolean model suffers from inajor drawbacks. First,
`its retrieval strategy is based on a binary decision criterion {1e., a document is
`predicted to be either relevant or non-relevant} without any notion of a grading
`scale. which prevents good retrieval performauce. Thus. the Boolean model is
`in reality nuch more a data {instead of information) retrieva) model. Secand,
`while Boolean expressions have precise semantics, frequently it
`is pot simple to
`translate an information need into a Boolean expression. In fact, most users find
`it dificult and awkward to express their query requests in terms of Boolean ex-
`pressions. The Boolean expressions actually formulated by users often are quite
`shnuple (see Chapter 10 for a more thorangh discussion op this issne}. Despite
`these drawbacks. the Boolean modet is still the dominant model with commercial
`doctment database systems aud provides a good starting point for those new tu
`the field,
`The Boolean model considers that index teruis are present or absent im a
`document, As a result, the iidex terin weights are assumed to be all binary. ie.,
`wey € {0.1}. A query g is composed of index terme linked by three connectives:
`
`salon which
`not, and, or. Thus, a query is essentially a conventional Boolean expre
`Heche TOP
`
`can be represented as a disjunction of conjunctive vectors (Le. in di
`
`mat form - DNF}. For instance, the query [¢ = ky A (A,
`¥ 7k,)}] can be written
`in disjunctive normal formas [Gap = (L112) 8 1.10) v O,0,0)). whereeachof
`the eomponents is a binary weighted vector associated with the tuple (Aa. Ry, ke.
`These binary weighted vectors are called the conjunctive components of Gun¢-
`Figure 2.3 illustrates the three conjunctive components for the query q.
`
`the indee termweight variables are all
`For the Boolean model,
`Definition
`binary te. wig © {0.1}. A query g is a conventional Boolean expressian. Let
`fing 06 the disjunctive normal form. for the query q. Further, tet dp. be any of the
`conjunchine components of ding. The similarity of a document d;
`to the query q
`is defined as
`
`sim(d;qi =
`‘
`
`
`| Woo € dang) MWe. Olds) = ge(Gee)}
`| ¥ Adee
`otherwise
`0
`
`(cid:24)
`
`
`
`
`CLASSIC [INFORMALLION RETRIEVAL
`
`27
`
`if simld,, qi = 1 then the Boolean model predicts thet the document d; is relevant
`fo the query q (it might aot be). Otherwise,
`the prediction is that the document
`is ot redevared.
`
`is cither releraué ur non
`The Boolean model precicty that cach document
`redecant, There is ny notion of a partial mefch to the query condinions. Por
`instance,
`let d, be a document for which d; — (0.1.0). Document d, inclides
`
`
`the index term fy but is considered non-relevant to the query gos ky Athake:
`The main advantages of the Boolean tiodel are the clean formalisin behind
`the model aud its simplicity. The inain disadvantages are that exact matching
`naylear] to retrioval of tog few or toe may documents (see Chapter E01, “Today,
`it is well known that index term weighting can lead to a substantial improvement
`iu tetricval performance. Jndex term weighting brings us to the vector model.
`
`2.5.3 Vector Model
`
`the use of binary weights is tuo
`i697, 695] recognizes that
`The vector model
`Hinting aad proposes a franiework tn which partial uateliug is possile. This
`is accomplished by assigning nen-binary weights to index torus in queries and
`in documents.
`“Phese term weights are ultimately used to compute the degrer
`of similarity between cach document stored in the system und the user query.
`By sorting the retrieved documents in decreasing order of this degree ofsimilar-
`ity. the veclLor model takes into consideration documents which match the query
`tertas only partially. The main resultant effect is that the ranked doctunent ap-
`ewer set
`is alot more precise fin the sense tliat it better matches lie user infur-
`mation need) than the doctnment answer set retrieved by the Boolean macdel.
`
`the weight wy; assoriuted uth a puer (hy dy)
`Definition For the vector model,
`is positive and nun-biery. Further,
`the index terms in the query are alse
`weighted. Let ing, be the werght associated with the parr [Ry q], where wy, 2 0.
`Then.
`the query weetor 7 is defined as F -
`(uyigq. Wages. Wg) where t
`is the
`luteal number of index terma in the system. As before, the vector for a document
`d,
`is represented by dy = (wyyj.w2y.---. Wey).
`
`Therefore, a document dj and a user query g are represented as 1 -dmensional
`vectors as shown in Figure 2.4. The vector moclel proposes to evaluate (he degree
`of similarity of the document dj with regard to the query q as the correkwiou
`between the vectors dy and ¢ This correlation can be quantified. for instance.
`by the costre ef the angle berween these two vectors. That is,
`
`sim(dy.q)
`
`dj eg
`| xi
`
`(cid:25)
`
`
`
` Q
`
`DN
`
`MODELING
`
`Figure 2.4 The cosine of @ is acopted as simidj qi
`
`where dj. and |@ are the uorins of the document and query vectors. The factor
`id) does not affect the ranking (Le. the ordering of the docmuents) because it
`is the same for all ductanents. The factor [<5 provides a normahzation in the
`space of the docninents.
`Siuce ij = Oand iy, = 0, sim(g, dy] varies from 0 to +1. Thus. instead of
`attempting to predict whether a document is relevant or net, the vector model
`ranks the documents according to their degree of similarity to the query. A
`document might be retrieved even if it matches the query ouly partiadly. For
`instance, one can establish a Chresheld on sim(d;, g) and retrieve the documents
`with a degree of similarity above that threshold. Bul
`to compute raukings we
`
` ‘y low index term weights are obtained.
`need first to spec
`Todex derm weiglits can be calculated in many different ways. The work by
`Salton and McGill (698) reviews various term-weighting techniques. Tere, we do
`not discuss them iu detail, Instead, we concentrate on elucidating the mata idea
`behind the most efleciive tenn-weighting tecliniqnes. This idea is related to the
`basic principles which support clustering, techniques, as follows.
`Civen acollection C of objects and a vague descriplion of a set 4, the goal of
`a suuple clustering algorithta might be to separate the collection C uf objects mta
`
`dwo sets: afirst one which is conrposed of objects related to the set A and a second
`one which is composed of objects not related to the set A. Vague description here
`mnéans that we do not have complete information for deciding precisely which
`objects are and which are uot in the set 4. For instauce, one might be lookiug
`for a set 4 of cars which have a price comparable to that of a Lexus 400. Since it
`is not clear what the term comparable neans exactly, there is uot a precise [and
`unique) description of the set A. More sophisticated clustering algorithms night
`attempt to separate the objects of a collection into various clusters (or classes)
`according ta their properties. For instance, patients of a doctor specializing
`in cancer could be classified into five classes:
`terminal. advanced, metastasis.
`diagnosed, and healthy. Again, the possible class descriptions might be imprecise
`fund not unique} and the problem is one of deciding to which of these classes
`a new patient should be assigned.
`In what follows, however, we only discuss
`the simpler version of the clustering problem {i.e., the one which considers only
`two classes) because all (hat is required is a decision on which documents are
`predicted to be relevant and which ones are predicted to be not relevant (with
`regard to a given user queryj.
`
`(cid:26)
`
`
`
`
`
`ABSIC INFORMATION RETRIEVAL
`
`2t
`
`‘To view the TR problem: as one of clustering, we refer tu the early work of
`Salton. We think of the documents as a collection C of objects and think of the
`aser query a8 a (vague) specification of a set “lof objects.
`In this scenario, the
`IR problem can be reduced to the problern of determining which documents are
`in the setand which ones are uot fie, the IR problem can be viewed as a
`clustering problem). Tua clustering problem, two miain issues have to he resolved.
`First, one needs to deterimine what are the features which better describe the
`objects in the set AL Seeond, one weeds to determine what are the features
`which better distinguish the objects im the set from the remaining objects in
`the collection C.
`‘The first sot of features provides for quantification of intra.
`cluster situilarity. while Uke second set of features provides for quaatifieation
`of enter-chester dissimilarity. Phe most successful clustering aleorithius try to
`balanee Uhese two effects,
`Tn the vector model, intra-clustering sinvilarity is quantified by measuring
`the raw frequency of a term fh;
`inside a document d;. Such term frequeney is
`usualy referred to as the tf facter and provides oue tueasure of how well tliat
`terra deseribes Ge document contents (Le., intra-documeut characterization),
`Furthermore, iuter-cluster dissimilarity is quantified by measuring the prverse
`of the frequency of a term Ay among the documents in the collection. This
`factur is usually referred to as the duterse docunient fregueney or the tf factor.
`The motivation for usage of au idf factor is that
`terms which appear ly uisuiv
`doeumerts are uol very useful for distinguishing a relevant document from a
`non-relevant one, As with goud chistering algorithms. the most cHeetive term-
`weighting scheines for IR. try to balance these two effects.
`
`
`
`
`fet Noo be the tetel umber of documents ce the systens and vy be
`Definition
`the nenber of documents in_which the indcr term kj appears. Let freq; be
`The
`
`
`youfrequency of ferin ky in the document dy (ie. the nuniber of times the tern
`- ky is mentioned in the tect of the document d;). Then, the normatiard frequency
`
`Yi. of ferm ky in document dy as ginen Bi
`
`uf a
`
`reg;2-4
`mary fred. 5.
`
`12.1)
`oO
`
`
`re the macimumnis cumputed over aff terms whieh are wnentioned in the teat
`the document d;.
`If the term ky does not appear in the document d;
`then
`2 =U. Further, let idf;, inverse document frequency for ky. be given by
`
`
`
`
`
`(2.2)
`!
`
`(2.3)
`
`N
`.
`idf, = low —
`i
`ni
`
`:
`
`oa,
`t
`
`,
`
`:
`
`‘best known term-iweighting schémes use qucighissivhicn are given by
`Leo
`
`N
`j= fiz X log --
`ny
`
`(cid:27)
`
`
`
`30)
`
`MODELING
`
`or by a variation of His formuda. Such ferm-weighting strategies ere called tf-idf
`schemes,
`
`beveral variations of the above expression for the weight wy) are described im an
`Interesting paper by Salton and Buckley which appeared in 1988 [696]. However,
`in general,
`the above expression should provide a pood weighting scheme for
`lnany collections.
`For the query term weights, Salton and Buekley suggest
`
`Wy = (is +
`
`.
`
`
`(L5
`free, ,
`mend) Peay a
`
`f - a | * lug —
`
`N
`ny
`
`/
`
`.
`
`(2.4)
`
`where fregig is the raw frequeney of tie term é; in the text of the information
`Leqlest. ¢.
`(1} ifs terni-weiglling scheme
`The main adeantayes of the vector model are:
`liupraves retrieval performace; (2) its partial matching strategy allows retrieval
`of documents that approginete die query conditions; and (3) its cosiue rank-
`ing formula sorts the documents according to their degree af similarity to the
`query. Theoretically, the vector model has the disadvantaye tlat index terms are
`assumed (o be urtuedlly independent fequatiun 2.3 does vot account for index
`term dependencies}, However. lu practice, consideration of term dependencies
`mieht be a disadvantage. Dne cto the locality of many tern dependencies, their
`indiscriminate application tu all the documents in the collection might im fact
`faut Lhe overall performace.
`Despite its suupliciey, the vector model is a resilient ranking strategy with
`eoneral collections.
`It vields ranked answer sets which are difficult
`to improve
`pon without query oxpansign or relevance feedback {see Chapter 5} within the
`framework of the vector model. A large variety of alternative ranking methods
`have Leen compared to the veetor mudel but the consensus seems to be that,
`in veneral, the vector model is either auperior or almost as good as the known
`alternatives. Furthermore. it is simple and fast. For (hese reasons. the vector
`viodel is a poplar retrieval model nowadays.
`
`2.5.4
`
`Probabilistic Model
`
`In this section, we describe the classic probabilistie model introduced in 1976
`by Reberston and Sparck Jones 677] which later became known as the binery
`widrpendenrce retrieval (BIR} model. Our discussion is intentionally brief aud
`focuses mainly on highlighting the key foatures of the model. With this purpose
`in mind, we de not detain ourselves in subtleties regarding the binary iidepen-
`dence assumption for the model. The section on bibliographic discussion points
`to references which cover these details.
`‘The probabilistic model atteraprts to capture the TR. problem within a prob-
`abilistic framework. Che ftuudaiuental idea is as follows, Given a user query,
`there is a set of documenta which contaius exactly the relevant documents and
`
`(cid:28)
`
`
`
`CLASSIC INFORMATION RETRIEVAL
`
`31
`
`no other. Let us refer to this set of documents as the idead answer set. Given the
`description of this ideal answer set, we would have no problems in retrieving its
`documents. This, we can think of the queryimg process as a process of specify-
`ing the properties of an ideal answer set (which is analogous to interpreting the
`1K problem as a problem of clustering). The problem is that we do nol know
`exactly what these properties are. All we knowis that there are index terms
`whose semantics should be used to characterize these properties. Since these
`properties are uot known at query time, an effort has to be made at. initially
`guessing what they could be. This initial guess allows us to generate a prelim-
`inary prohabilistic description of the ideal nuswer sel which is used to retrieve
`a first set of documents. An interaction with the user is then initiated with the
`purpose of improving the prubabilistic description of the ideal answer set. Such
`interaction could proceed as follows.
`The user takes a look at the retrieved documents and decides which ones
`are relevant and which ones are not (in truth, ouly the first. top documents need
`to be examined}, The system thon uses this information to refine the description
`of the ideal answer set. By repeating this process aany tines, it is expected
`that such a description will evolye aud become closer lo the real deseription af
`the ideal answer set. Thus, one should always have im ouiad the need to guess at
`the beginning the description of the ideal answer set. Furthermore, a conscious
`effort is uiade to model this deseription in probabilistic terns.
`The probabilistic rnodel is based on the following fundatuental assumption.
`
`Assuinphion (Probabilistic Pringiple) Given a user query q and a document dy
`
`ithe collection, the probabilistic model tries to estimate the srobability chat
`the user vill fd the dogument d; mteresting (Le. relevant).
`‘The model as-
`sumerthat
`Thewtice depends on the query and the document
`
`representations ouly. Further.
`the model assumes that there is a subset of all
`documents which the user prefers as the answer set for the query g.9Such an
`ideal answer set is labeled A and should maximize the overall probability of re+
`evance to the user, Documents in the set # are predicted to be relevant to the
`query. Documents not in this set are predicted ta be nen-relevant.
`
`This assumplion is quite troublesome because it docs pot state explicitly
`how to compute the probabilities of relevance, In fact. not even the sample space
`which is to be used for defining such probabilities ts given.
`Given a query g, the probabilistic model assigns to each document dj, ag
`a measure of its similarity to the query. the ratio P{d; relevant-to qi /P(d; non-
`relevat-to g} which computes the odds of the document d; being relevant to the
`query q. Taking the odds of relevance as the rauk minimises the probability of
`an erroneous judgement [282, 785).
`
`the tudexr term: weight variables are
`For the probabilistic model,
`Definition
`ai binary ie, wy, © {OL}. wig € {0,1}. A query g is a subset of tader terms.
`Let Robe the set of documents knoun(or initially guessed) to be relevant. Let R
`be the complement of R (i.c.,
`the set of non-relevant documents}. Let PUR|d; }
`
`(cid:20)(cid:19)
`10
`
`
`
`32
`
`MGA BLING
`
`is relevant to the query q and P(Ridj)
`ie the propalility that the ducument d;
`be the probability that dj
`is non-relevant to q. The simdarity simidj.q) of the
`document rl,
`to te query q is defined as the ratio
`
`stetels, gi =
`
`P(RId;}
`P(R|d))
`
`Using Bayes’ rule,
`
`P(E IRx PER)
`_
`simild,.qg) =3S
`Pid; |Ri x PLR]
`
`Pid, RY stands for the probability of randomly selecting the document 2; from
`the set Roof relevant documents. Further, P(C2) stands for the probability that a
`docuinent randouiy selected frou the entire collection is relevant. The meanings
`altached to P(d)"2) and P(R) ace analegons and complementary.
`Since PCR) and PR) ave the same for all the documents in the eollection,
`WOOWLILG,
`
`sentes. yg)
`
`Pid; |R}
`Pu, Bi
`
`
`Assuming independence of index ternus,
`
`simi, gl
`
`Thycdy. 1 Pb RN x Tac 1=0 PUR|R))
`
`Tocca, esl Ptky [R)) “ I]
`fyb PUR|R))
`SE
`Lin [tps
`
`Pik|2i slands for the probability that Lhe index term #; is present in a document
`randomiy seleeted from the set Ro PUkjLA) stands for the probability that the
`index teri: &y
`is not present in a document randomly selected from the set A.
`‘The probabilities associated with the set # have meanings which are analogous
`to the ones just described,
`Taking logarithms, recalling that P(k;|R) + P(k;|A) = {, and ignoring
`factors which are constant for all documents in the context of the same query,
`we can finally write
`
`
`
`
`
`sinalds.¢) 7 Uy9M Uys X (1 PkyRy Lk i ee
`dy.)
`2."
`3
`Sil PRR PGiR)>
`
`SEPA Gey
`
`~
`
`Uy,
`
`Uy
`
`O8 Sp, oT
`
`lo
`
`7
`
`which is a key expression for ranking computation in the probabilistic model.
`Since we do not know the set # al the beginning, it is necessary to devise
`a iuethod for initially computing the probabilities P(k;|R) and P(k;|R). There
`are inany allernatives for such computation. We discuss a couple of them below.
`
`(cid:20)(cid:20)
`11
`
`
`
`“CLASSIC INFORMATION RETRIEVAL
`
`a3
`
`In the very beginuing (.c., nmediately after the query specification), there
`are no tetrieved documents. Thus. one has to make simplifying assuniptions such
`as: [al assume that P(%,|A] is coustantfor all index terms 4, (twpically, equal to
`0.5) wed (bh) assume that the distribution of iudex terms amoug the nun-relevant
`documents can he approximated by the distributiou of index terms among all
`the documents tn the collection, These two assumptions vield
`
`015
`=
`PYAR}
`Pik'R) = 2
`,
`!
`N
`
`ie the number af documents which contatn the
`where, as already defined. mn;
`
`index term A; and Vis the total number of documents in the collection. Given
`this initial guess, we can then retrieve documents which contain query terms and
`provide an initial probabilistic ranking for them. After that. this initial ranking
`ix Luproved as follows.
`Let F be a subset of the doctauents initially relrieved and ranked by the
`
`— —-
`probabilistic model Sneha subset can be defined, for lastance, as
`ruuked- documents whorey is & previously defined threshold. Further,let Vpbe
`the subset of ¥ composed of the documents in Vo owhich contain dhe index term
`&,. For simplicity, we also use Fo oand Vote refer to Ue uamber of eleanents tn
`these sets Git should alwars be clear when the used variable refers to Ue sek or to
`the number of elements in it}. For improving the probalilistic ranking, we need
`to improve our guesses for P(A, FR} and P(AR). This can be accomplished with
`the folluwing assumptions:
`(a) we can approximate P(A,|AR) by the distribulion
`af the index term A; among the documents revricved so far, and {hb} we can
`
`approximate PURIR) liv cousidering that ail
`the nom-retrieved damunents are
`hot relevant. With these assumptions, we can write,
`
`
`
`ia
`
`no KB
`—
`PibsRy = at
`NO
`{Ay
`|
`
`r
`
`This process can then be repeated recursively. By doing so, we are able to
`iuprove on our gucssés for the probabilities Pihy|A) and P(A.) without auy
`assistance from @ hiiman subject (contrary to the original idea}. However, we
`car also use assistanee from the user for definition of the subset V as originally
`coucelved.,
`‘Lhe last formulas for P(&;)R} and P{k,|£2) pose problems for small values
`of Voand ¥; which arise im practice fsuch as V = | and VF, = 0). To circumvent
`these prebletas, an adjustment factor is often added in which vields
`
`Vp40.5
`PUAIR) =no ! oe +]
`
`Pik shy =
`
`(cid:20)(cid:21)
`12
`
`
`
`aed
`
`MODELING
`
`Au adjustmentfactor which is constant and equal to 0.5 is not alwavs satisfactory.
`An alternative is to take the fraction n,/N as the adjustment factor which yields
`
`Vi+®
`Piky R) = s-. S
`V+i1
`at)
`Puls
`[R)
`ne Via WS
`Ate
`.
`val)
`NOV GI
`
`=
`
`This completes our discussion of the probabilistic model.
`is (hat docu-
`The main advantage of the probabilistic model,
`in theory,
`meuts are rauked in decreasing order of their probabality of being relevant. The
`disadvantages include:
`(L} the needto guess the initial se