`
`Ricardo Baeza- Yates
`
`Berthier Ribeiro-Neto
`
`IPR2017-01039
`Unified EX1014 Page 1
`
`@ A
`
`CM Prass
`New York
`
`wW Addison-Wesley
`
`Harlow, England # Reading, Massachusetts
`Menlo Park, California « New York
`Don Mills, Ontario s Amsterdam # Bonn
`Sydney # Singapore # Tokyo # Madrid
`San Juan « Milan # Mexico City e Seau! « Taipei
`
`IPR2017-01039
`Unified EX1014 Page 1
`
`
`
`Copyright (c) 1999 by the ACM press, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`Addison Wesley Longman Limited
`Edinburgh Gate
`Harlow
`Essex CM20 2JE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Manyof the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the trademark designations and their
`owlers appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`zZg$j;g?$$g`IR z¡"gb£¢j¢£¤Rx¥¦§$¢I;¨ Z©vIR ¢j¢j ªUIZ¨C©Z5z$«¬RI¨S¡ª g¨Sª;b$¤Y®¨ $¯_° ±O¡C² ¯ ³g³>¢jZ¨t´£¢Iµ0¶8;¨SU«AªZ¨C¶]«·j£³¸v³e¨>g¬b; ¹ ªUjº ª;jµ;»¸v¢j¢jh¼Cz¡C½Z¾|½¿g¸¸¨S;µªZ¨R³ªZ¨R³ ¢j¢j ª$j£³Cz$«Sª;¨£¢IgbZ¬RZRZ¬R IR´CZjµ³ ¯ÀzS_jZgj¢±Z©IRªZ¬RIRZb¢_Z©ÁI¢±´;jÂ`Rª;§$ ]h¨ªU¢j¢jhbj£³`gkIR£«Ã¨ª £ ;b³ª;¨ »zIIRO;g$jZg£¤]¥±£¢IZ¨R¢ªZ¨R³Äª$jh¨gj¢_ £$ÅUÅ ¯¦µµYjZgj¢_b£¢jhb§$³ ¯3Æ Sª;b Z©ÁI¢_¬ggµ ª$IÇZ¨ «:ªZA] bbb>³>¬ £³n¤Y¢jjZb³`¨Cªb£Ij£§;ª;µv¢j>¢jj£«0¤S;±IbªZ¨R¢j«·jj£³0¨CªZ¨gA©;b«Z_gkªZ¨g&«A£ªZ¨R¢£¤Sbµ IbZ¨g ¤Y«A Sª;¨ ªZµ¤gSUj Zg$¨R¤8b Zb³>¨Sk; UISbb»O¢jU¤]»OIRZ¬R bISh5IS gjZ5»zjjjh¨0]bb«·¢j¢IZ¨ ;©IR ¬gµ¢IRh:Z±ª·µ b¨ ]hb«·jI¨R`b£¢jIj j£³ Zg$¨RA¨CISÈ_¨j³É ¨SU³U«"¢j¢I¬S³gIR_zZg$jZgv¶n h¨S¢I¨R $b¨ ¶8j³]¤U$¾ ÀUjjh¨gSªU«ÊzZ¬gb˦Uª$³]¤$¶YZ¨R³;¨k´ÌÄ$º±¸ ¯´ÍgµAIR ¬ggµ¢IShRª$¢_«Aª$³g£§$hbkªUjj£«·R¦j:Ibª ª;µµ Zg$j;g ;»z¨Shb¢±ªZ¨R³0;Sjª;¨]hb«¢j¢IZ¨tj·bhgb>³e¬ A«AªUjhjªZµ¤S¨Cª©£» ªU¢j£¢_I¢_SªU¢¦gbZ§U£³0«·]U¢j¢Iµ ¯zZg$j;g¦RZµ³ghb¢Z©v«AªUjhjªZµÁ»z SªU¢¦¨R$z]£b¨ª Â>¨RZ»zµ£³g$³ªZbh¨ ;¬bªU$£³j ;¨gjª _IRMg¬gµ¢IRh ¯¡ªZ¨g&Z©ÁIS³¢IZ¨Rª$IZ¨R¢¬S¢j£³kg0«Aª;¨>¬©®ª I¬gbhb¢ ª;¨S³&¢jhµµhb¢jA³>¢jI¨RZ¬g¢I ISbgb>³e¬ j¢ ª;b µªZ«:£³ªU¢_Ibª$³g£«AªZjÂg¢ ¯ ³g³e¢j;¨´C£¢Iµ£&¶8;¨SU«AªZ¨C¶]«·j£³SªU¢¦«:ª$³g£§Uhbkª$jj£«·R j:¢I¬gµ&Ibª$³g·«:ªZjÂk¨©®Zb«AªUIZ¨tªZ];¬S±«AªZ¨>¬g©ª I¬bhb¢±ªZ¨R³ISbgb>³e¬ j¢ «:h¨gIZ¨R£³¨CI¢_]>Z ¯ ̵¢j±Z©vIRIbª$³g£«AªZj³¢IZ¨Rª$IZ¨R¢ªZ¨R³ISh;»z¨Shb¢_ª;]ªZb¢ ;¨SªU$§$ ¯Àz$]¢j£z¨tOU«·g¬Sjh5¡>³bj¨g0Î;ÏÄj¨gj£³ªZ¨R³0]Z¬g¨R³0¨CISÈ_¨j£³CÐ>jª$j¢ Z©Á «Abj ªÑ8b¢j¦gj¨gj£³$$IÐ>Ò Æ ¾;ÓÔ½Z¾bÓsÕ$$ÅG½Z;ÓIÖ×ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£á ªUjªZµUZ¬RMb Zb³0©; Ig¢z]>ZÂ&¢ ª;§;ªZµª;µA©ÜbU«ISÒ)jI¢IC¶]bª;bÞ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ÛÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªU£ð£ªZÓñ3ª$j£¢£¤RË ¯ò° Ëz ª;b³G²¡>³ghj¨¨g©Zb«:ª$IZ¨Cb£Ij£§;ª;µ3ó·Ëz ªZb³gkÒzª$£ðªZÓñzªUj£¢£¤SÒ3hbIbËz]hb;Ó Æ £j ¯ ¯- « ¯Ü¨ µ¬S³g£¢ gµUZbªZgS ª$µvbh©bbh¨ £¢_ªZ¨R³0¨S³gh¼ ¯sÐeÒ Æ ¾ZÓÔ½Z¾ghÓsÕ$UÅ>½;ZÓsÖ ¯ ܨ©;b«Aª$I;¨C¢jjZbªU$ªZ¨R³`b£I£§;ª;µv¢j>¢jj£«A¢ ¯ ¯ ËOôhb¤]ÒzbbIh5³¦bªvõ¬;öbÆ £j¤Y£UÏ$¾ZÓ ¯ j ¯ÀzIµ ¯÷SÏ$ÏGø ¯ÒzÕ$ù £$U¾>½UÎ ¯¾$ù;ú>³ ½e UZÓb£¾U¾$Õ$Õ¦Ä
`
`Library of Congress Cataloguing-in-Publication Data
`Baeza-Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p-
`cm.
`Includes bibliographical references and index.
`ISBN 0-201-39829-X
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Aratjo
`Neto, 1960- . II.Title.
`2667.B34
`1999
`025.04-de21
`
`99-10033
`CIP
`
`IPR2017-01039
`Unified EX1014 Page 2
`
`IPR2017-01039
`Unified EX1014 Page 2
`
`
`
`2!
`
`MOVELING
`
`t-dimensional vectorial space aud standard linear algebra operations on vectors.
`For the classic probabilistic model, (he framework ts composed of sets. stardard
`probability operations, and the Bayes’ theoretn.
`In the remainder of this chapter, we discuss the various TR models shown
`in Bigure 2.1. Throughout the discussion, we do not explicithy instantiate {he
`components D, Q, F, and Afg;,dj) of cach model. Such components should be
`quite clear fromthe discussion and can be easily inferred,
`
`2.5 Classic Information Retrieval
`
`In this section we briefly present the three classic models in information retrieval
`namely. the Boolean, the vector, and the probabilistic models,
`
`2.5.1 Basic Cancepts
`
`is ce-
`The classic mudels in information retrieval consider thar each deectutmert
`serihed by a set of cepreseutative keywords called index tertus. Au fuder terre
`is sitaply a (document) word whose setuautics helps in renombering the docu-
`tment's main themes. Uhus, iudex verma are used to iidex and summarize the
`document coutents. Ta general,
`index terins are mainly nouns becatise nous
`have meaning by themselves aucl thus, their semantics is easier to ideutify and
`to grasp, Adjectives, adverbs, aud comuectives arc less usefas iudex terms
`because they work mainly as canplements. Towever.
`it might be interesting
`Lo consider all the distinet wards in a docuinent collection as index terms. For
`instauce, this approach is adopted hy some Web search engines as discussed in
`Chapter 13 (in which case, the document logical view is full tert}, We postpone
`a discussion on the problemof how to generale index terms until Chapter 7.
`where the issue is covered in detail.
`Given a set of index terms for a document, we notice that uot all terms
`are equally usefid fur deseribing the document contents. Tu fact, there are index
`terms which are siinply vaguer than others. Deciding on the importance of a
`term for summarizing the contents of a dactiment is uot a trivial issne. Despite
`this difficulty, there are properties of an index term which are easily measured
`and which are usetul for evaluating the potential of a term as such. For instance,
`consider a collection with a hundred thousand documeuts. A word which appears
`in each of the one humdred thousand documents is completely useless as an index
`tern because it dues not tell us anything about which documents the user might
`be interested in. On the other hand, a word which appears jn just five documents
`is quite useful because it narrows down considerably the space of documents
`which might be of interest to the user. Thus,
`it should be clear that distinct
`index terms have varying relovance when used to describe dociment coutents.
`This effect is captured through the assigninent of numerical weights 1o each index
`term of a document.
`
`IPR2017-01039
`Unified EX1014 Page 3
`
`IPR2017-01039
`Unified EX1014 Page 3
`
`
`
`CLASSIC INFORMATION RETRIEVAL
`
`2a
`
`Let Ay be an index term, d; be a document. aud wy) = U be a weight
`assovlaned with the pair (4;.d¢,}. This weight quantifies the importance of the
`index teri tor describing the document semantic contents.
`
`Definition Lett be the number of iudex terms in the system and k, be a generic
`inden term. Kw (ky... hy} ts the set of all index terms. A weight w,j > 0
`is assoriated wilh each wider ferm hy of a document dj. Por an index term
`whivh dues Wok mppear in the docwment
`tert, wij = 0. With the dacnurnent a,
`is axsocteded an tuder tern vector dy represented hy dy =+ (tty. Wg joe. te yd.
`Further,
`irl y; be a funetion that refurna the weight assariated with the index
`term &om any bdimensional cector fhe, qld) = wey dh
`
`As we laver discuss. the index term weights are usually assumed to be tmntn-
`ally independent. This means that knowing the weight w,.; associated with the
`pair (A,) tells us nothing ahoul
`the weight wy, )., associated with the pair
`{kya ij}. This is clearly a simplification because oceurreuces of index terms m1
`a document are not nucorrelated. Consider, for tustauce, that the terms com-
`meter ancl network are used tu tidex a given document which covers the areaof
`computer networks. Prequenth, in this document, the appearance of one of these
`two words attracts the appearance of the other. Thus. these nwo words are corre-
`lated and their weights conld reflect this correlation, While umtual mdecpendence
`seems to be a strong siniplification, it does simplify the task of computing index
`rerun weights aud allows for fast ranking compntation. Furthermore, taking acl-
`vantage of index term correlations for improving the final document rauking 1s
`not a simple task.
`bn fact. none of the mamy approaches proposed in the past
`has clearly demonstrated (hat index term correlations
`are advantageous (for
`ranking purposes) with general collections. Therefore, unless clearly stated oth-
`enwise, we assume mutual itvlependence among index terms.
`fn Chapler 4 we
`discuss uiodern retrieval techniques which are based on term correlations and
`which have been tested successfully with particular collections. hese suceesses
`seein to be slowly shifting the current understanding towards a mure favorable
`view of the usefulness of term correlations for informatiou retrieval systems.
`The above definitions provide support for discussing the three classic infor-
`nation retrieval models, namely, the Boolean, the vector. and the probabilistic
`models. us we now do.
`
`
`
`
`5.2 Boolean Mode!
`
`The Boolean model is a supple retrieval model bagecl ou set theory and Boolean
`algebra, Since the concept of a set is quite intuitive, the Boolean model pro-
`Vides a framework whichis easy to grasp by a common user of an {Rsystem.
`Furthermore, the queries are specified as Boolean expressions which have precise
`
`senianties. Given its inherent simplicity and neat formalist. the Boolean mode)
`ceived great attention in past vears aud was adopted by many of the carly
`
`oromercial bibliographic systems.
`
`
`
`
`IPR2017-01039
`Unified EX1014 Page 4
`
`IPR2017-01039
`Unified EX1014 Page 4
`
`
`
`26
`
`BMODELING
`
`
`
`Figure 2.3 The three conjunctive compoucuts for the query jg = ka os [ky Re}.
`
`the Boolean model suffers from wajor drawbacks. First,
`Uufortunately,
`its retrieval strategy is based on a binary decision criteriou {Le., a documentis
`predicted to be either relevant or uon-relevant) without any notion of a grading
`stale. which prevents good retrieval performance. Thus. the Boolean model is
`in reality much more a data {instead of informarion) retrieval model Second,
`while Boolean expressions have precise semantics, frequently it
`is pol simple to
`translate an information need into a Boolean expression. Tu fact, most users fine
`it difficult aud awkward ta express their query requests im terms of Boolean ex-
`pressions. The Boolean expressions actually formulated by users often are quite
`situple (see Chapter 10 for a more thorough discussion on this issue}. Despite
`these drawbacks. the Boolean inoclel is still the dominant model with commercial
`document database systems aud provides a good starting point for those new to
`the field.
`The Boolean model considers Ghat index terms are present or absent in a
`document. As a result, the index terin weights are assumed to be all binary. ice.,
`wi € {0.1}. A query g is composed of index terms linked by three connectives:
`not, and, or-Thus, a queryis essentially a conventional Boolean expression which
`can be represented as a disjunction of conjunctive vectors (.¢., In disqunctine nor-
`
`mal fernt — DNF). For instance, the query [q = ka A (A, ¥ 1k.) can be written
`in disjunctive normal formas [Gag = Ol.1,1) ¥ (4.1.0) ¥ 01,0,0)). where eachof
`the components is a binary weighted vector associated with the tuple (ha. An, Kel.
`These binary weighted vectors are called the conjunctive components of dap ¢.
`Vigure 2.3 illustrates the three conjunctive components for the queryq.
`
`the indec term. weight variables are all
`For the Boolean model.
`Definition
`binary te., wiz C{O.1}. A query g is a conventional Boolean expression. Let
`fang b6 the disjunctive nermal formfor the query q. Further, tet doe be any of the
`conjunctive components of Gang. The similarity of a document dj
`to the query q
`is defined ts
`
`sim(d;, q) _
`wee
`ss
`
`|
`0
`
`{ovo
`*
`a,
`ay ong Le
`+
`4=
`
`1 Gee & dang} AUR. (dj) = di(dee))
`af “Ger:
`otherwise
`
`IPR2017-01039
`Unified EX1014 Page 5
`
`IPR2017-01039
`Unified EX1014 Page 5
`
`
`
`
`CLASSIC INFORMATION RETRIEVAL
`
`27
`
`if sim(d,, gi = 1 then the Boolean model predicts that the document d, is relevant
`to the query q Gt might not bel. Otherwise,
`the prediction is that the document
`ig not redeware,
`
`is cither referaué or non-
`The Boolean medel predicts that each document
`reecant There is nu notion of a partial reich to the query conditions. Por
`instance,
`let d, be a document for which d; — (0.1.0). Document d,
`includes
`the index teri &y, but is cousidered nen-relevant to the query (gs kg Athy Reds
`The main advantages of the Boolean model are the clean formalisin behind
`the model and its simplicity. The inain disadvantages are that exact matching
`naylead ta retrieval of tow few or toe many documents {see Chapter E01. “Today,
`it is well knownthat index term weighting canlead to a substantial improvement
`in retrieval performance. Index termweighting brings us to ble vector model.
`
`2.5.3 Vector Model
`
`the use of binary weights is tua
`i697, 695] recognizes that
`The vector model
`Hutuliug aad proposes a framework in which partial matching is possible, This
`is accomplished by assigning nen-binery weights to index tcrmis iu qiluries and
`in doctunents. These term weights are ultimately used to counpute the degree
`of similarity between cach document stored in the svslem und the user query.
`By sorting the retrieved documonts in decreasing order of this degree af simitlar-
`ivy. the vector model takes into consideration documents which match the query
`terms only partially. The main resultant effect is that the ranked ductunent an-
`ewer set
`is alot more precise fin the sense that it better matelies the user infur-
`mation need) than the doctiment answer set relrieved by the Boolean model.
`
`the weight wy; assoviuted uth a pud(hy dj]
`Definition For the vector model,
`is positive and nen-binary. Further,
`the mdex terms in the query are also
`weighted. Let in, be the wcight associated with the parr [AiG], where yg 2 UO.
`Lhen.
`the query vector 7 is defined as F -
`(uy ge Wags... Ug) where t
`is tee
`tuted number of index terma in the system. As before, the vector fer a document
`d,
`is represented by dy = (wyy.wagj...-. wey.
`
`Therefore, a document @; and a user query gq are represented as 1-ditiensional
`‘vectors as shown in Figure 2.4. The vector morlel proposes to evaluate Cie degree
`of similarity of the documem. dj with regard to the query g as the correlation
`between the vectors dj and @ This correlation can be quantified. for instance.
`by the cosine of the angle between these two vectors. That tis,
`
`stm(dy.q)
`
`dj eq
`d3| x ig]
`yr wy| X Wig
`2
`“Feat
`9
`faoot
`faisLByxy ek wy
`
`IPR2017-01039
`Unified EX1014 Page 6
`
`IPR2017-01039
`Unified EX1014 Page 6
`
`
`
` Q
`
`3s
`
`MODELING
`
`Figure 2.4 The cosine of @ is actopted as simledy a.
`
`where \d;" and |g are the norms of the document and query vectors. The factor
`i) does not affeer the ranking {Le.. the ordering of the documents] because it
`is the same for all doctments. The factor [5 provides a normalization in the
`space of the docninents.
`Since iw,j = Qandiaryy > O, stmt, dj) varies from Ota +i. Phus, instead of
`attempting to predict whether a document is relevaut or uot, the vector model
`ranks the documents according to their degree of similarity to the query. A
`document might be retrieved even if it matches the query ouly partially. For
`instance, one can establish a threshold on sirn(d;,q) and retrieve the documents
`with a degree of similarity above that threshold. But te compute raukings we
`need first to specify how index term weights are obtained.
`Tudex term weights can be calculated in may different ways. The work by
`Salton and McGill (698) reviews various term-weighting techniques, [Tere, we do
`not discuss them iu detail Instead, we concentrate on elucidating the main idea
`behind the most effective term-weighting techniqnes. This idea is related to the
`basic principles which support clustering Lechuiques. as follows.
`Given a collection C of oljects and a vague description of a set 4, the goal of
`a siniple clustering algorithin might be to separate the collection C’ of objects into
`iwo sets: a first one which is composed of objects related to the set A and a second
`one which is composed of objects not related to the set -l. Vaguedescription here
`ineans that we do not have complete information for deciding precisely which
`objects are and which are not in the set 4. Por instance, one might be looking
`for a set 4 of cars which have a price comparable to that of a Lexus 400. Since it
`is not clear what Lhe term comparable ineans exactly, there is uot a precise (and
`unique) description of the set A. More sophisticated clustering algorithms might
`attempt to separate the objects of a collection into various clusters (or classes)
`according to their properties. For instance, patients of a doctor specializing
`in camcer could be classified into five classes:
`terminal. acdvauced, metastasis.
`diagnosed, and healthy. Again, the possible class descriptions might be imprecise
`(and not unique} and the problem is one of deciding to which of these classes
`a uew patient should be assigned.
`In what follows, however, we only discuss
`
`the simpler version of the clustermg problem {i.e., the one which considers ouly
`two classes) because alf Luat
`is required is a decision on which documents are
`predicted to be relevant amd which ones are predicted to be not relevant (with
`regard to a given user query].
`
`IPR2017-01039
`Unified EX1014 Page 7
`
`IPR2017-01039
`Unified EX1014 Page 7
`
`
`
`
`O@LABSIC INFORMATION RETRIEVAL
`
`20
`
`‘La view the TR problem as one of clustering, we refer tu tle early work of
`Salton. We think of the documents as a collection C of objects and think of the
`user query as a (vague) specification of a set A of objects.
`[In this scenario, the
`TR. problem can be reduced tu the problern of determining which documents are
`in the set Aland which ones are uot {ie.,
`the UR problem can be viewed as a
`clustering problem). Ina clustering problem, two niain issues have to he resolved.
`First, one needs to determine what are the features which better describe the
`objects in the set
`-~L Second, one needs to determine what are the features
`which better distinguish the objects in the ser A from the remaining objects in
`the cdilection C. The first set of features provides for quantification of intra-
`cluster stnuilarity. while Che second set of features provides for qnauitification
`of enter-cluster dissimilarity. Phe most successful clistering algorithms try to
`balance Lhese two effects.
`Tn the vectur inodel. intra-clustering similarity is quantified by measuring
`the raw frequency of a term Ay
`insicle a document @;. Such term frequency is
`ustidls referred to as the tf focter and provides one tueasure of how well that
`term describes Che document contents (c., intra-documeut claracterizalion).
`Further:ore, inter-cluster dissimilarity is quantified by measuring the imverse
`of the frequency of a term A, among the documents in the collection. This
`factur is usually referred to as the diwerse document frequency or the af factor.
`The motivation for usage of au idf factor is that
`lerms which appear in ui
`domuuerts are nol very useful for distinguishing a relevant document from a
`non-relevant one, As with voud chistering algorithms. the must cHeetive term-
`weighting schemes for IR. try to balance these two effects.
`
`the system: and vy be
`fet N be the total number of documents ot
`Detinition
`
`
`the monber of documents in which the index term ky appears. Del fregy;
`beThe
`poufrequency of terin ky in the document dj (ic. the wumber of tines thetern
`Sky is mentioned in the text of the document dj}. Then, the normalized frequency
`Sof term ky in document dy is giuen by
`
`fred;
`fej or oe
`meaty fred j
`
`e the macimune is cumputed over all terms which are mentioned in the treat
`the document d;.
`ff the term ky does not appear in the document d;
`then
`
`=U. Further, let idf;, inverse document frequency for ky. be given by
`
`
`
`
`
`
`
`
`
`Nv
`.
`idf, = low —
`"nh;
`
`be
`
`Don,
`. st
`’
`‘
`.
`best known term-iccighting schémes use qucightsaivhich are given by
`Lee
`
`,
`
`,
`
`N
`ig = Fig x log n
`i
`
`.
`‘
`12.1)
`
`(2.2)
`:
`
`(2.3)
`
`IPR2017-01039
`Unified EX1014 Page 8
`
`IPR2017-01039
`Unified EX1014 Page 8
`
`
`
`30
`
`MOORLING
`
`or by a variation of tis formula. Such ferm-weighting strategies arc called thidf
`schemes,
`
`Several variations of the above expression for the weight uy; are described in an
`Interesting paper by Salton and Buckley which appeared in 1988 [696]. However,
`in general.
`the above expression showld provide a good weighting scheme for
`liany collections.
`For the query term weights, Salton and Buckley suzvest
`
`ing = (a+ iti. | a
`
`
`O.5 fre.
`may Pred y
`
`N
`
`Ai
`
`“7
`
`(2.4)
`
`where freg,, is the raw frequeney of the term &y in the text of the information
`Tequeat q.
`The main adeautayes of the vector model are: (1) its terni-weizliling scheme
`improves retrieval performance, (2) its partial matching strategy allows retrieval
`of documents that approximate Uae query conditions: and (3) its cosime rank-
`ing fornia sorts the documents according to their degree of similarity to the
`query. Theoretically, the vectur model has the disadvantaye that index terms are
`ested to be umtnally iudepeudent {equatiuu 2.3 does mot accouut for index
`teTm dependencies}. However. in practice. consideration of term dependencies
`might be a disadvantage. Dne to the locality of many term dependencies, their
`indiscriminate application to all the doctuments in the colleetion might in fact
`fagthe overall performance.
`Despite its siipliciey, he vector model is a resthent ranking strategy with
`ecneral collections.
`Lt vields ranked answer sects which are difficult
`tu improve
`npon without query expansion or relevance feedback (see Chapter 5} within the
`framework of the vector model. A large variety of alternative ranking methods
`have been compared to the veetor madel but the consensus seems to be that,
`in veneral, the vector model is either superior or almost ag good as the known
`alternatives. Furthermore. it is stuuple and fast. For Wiese reasons. the vecter
`model is a popular retrieval model nowadays.
`
`2.5.4
`
`Probabilistic Model
`
`Tn this section, we describe the classic probabilistie model introduced in 1976
`by Roberston and Sparck Jones 677] which later became known as the binery
`independence retrieval (BIR) model. Our discussion is intentionally brief aud
`focuses mainly on highlighting the key features of the model. With this purpose
`in mud, we do not detain ourselves in subtleties regarding the binary indepen-
`denee assumption for the model. The section on bibliographic discussion points
`to references which cover these details.
`Vhe probabilistic model atternpts to capture the TR. problem within a prob-
`abilistic framework. he fundamental idea is as follows, Given a user query,
`there is a set of documents whieh contains exactly the relevant documents and
`
`IPR2017-01039
`Unified EX1014 Page 9
`
`IPR2017-01039
`Unified EX1014 Page 9
`
`
`
`CLASSIC INFORMATION RETRIEVAL
`
`3l
`
`no other. Let us refer to this set of documents as the ideal answer set. Given the
`description of this ideal answerset, we would lave no problems in retrieving its
`documents. Thus, we can think of the queryimg process as a process of specify-
`ing the properties of an ideal answer set (which is analogous to interpreting the
`1K problem as a problem of clustering). The problemis that we do uot know
`exactly what these properties are. All we knowis that there are index (erms
`whose semantics should be used to characterize these properlies. Since these
`properties are not known at query time, an cflort has to be made at. initially
`sucssing what they could be. This initial guess allows us to generate a prelim-
`inary probabilistic description of the ideal answer set which is used to retrieve
`a first set of documents. An interaction with the user is then initiated with the
`purpose of improving the prubabilistie description of (he ideal answer set. Such
`interaction could proceed as follows.
`‘The user takes a look at the retrieved documents and decides which ones
`are relevant and which ones are not (in truth, only the first top documents need
`to be examined). The system then uses this information to refine the description
`of the ideal answer set. By repeating this process unamy times, it is expected
`that such a description will evolve aud become closer lo the real description af
`ihe ideal answer set. Thus, one should always have in wind the need to guess at
`the hegiuning the deseription of the ideal answer set. Furthermore. a conscious
`effort is inade to model this deseription in probabilistic terms.
`The probabilistic model is based on the following fundaruental assumption.
`
`Assumption (Probabilistic Principle) Given a user query q and a document d;
` Ttollection, the probabilistic rnodel tries to estimate the
`istitetl webability that
`the user will find the document d; interesting (Le relevant).
`‘The model as-
`
`sumiesthatthis probability
`Tthetice depends on the query and the document
`representations ouly. Further.
`the model assumes that there is a subset of all
`documents which the user prefers as the answer ser for the query g.
`Such an
`ideal answer set is Jabeled Fo and should maximize the overall probability of re!
`evance to the user. Doenmients in the set 2 are predicted two be relevant Lo the
`query. Docuraents not in this set are predicted to be non-relevant.
`
`This assumption is quite troublesome because it does not state explicitly
`how to compute the probabilities of relevance. In fact, not even the sample space
`which is to be used for defining such probabilities ts given.
`Given a query g. the probabilistic model assigns to each document d;, as
`a mcasure of its similarity wo the query. the ratio Pid; relevant-to q)/P(d,; non-
`relevant-to g} which computes the odds of the document d; being relevant to the
`query g. Taking the odds of relevance as the rank minimizes the probability of
`an erroneous judgement [282, 785).
`
`For the probabilistic model, the index term weight variables are
`Definition
`all binary ie. uy © {OV}. wig € {0,1}. A query g is a subset of tinder terms.
`Let Robe the set of documents known(or initially guessed) to be relevant. Let R
`be the complement of R ft... the set of non-relevant documents). Let PUR|d;}
`
`IPR2017-01039
`Unified EX1014 Page 10
`
`IPR2017-01039
`Unified EX1014 Page 10
`
`
`
`32
`
`MGIBELING
`
`is relevant to the query q and P(Rid;)
`be the profability that the document d;
`be the probability that d;
`is non-relenant to q. The similarity simidy,q) of the
`document rl,
`to the query q is defined as the ratte
`
`Styl ied, . ef) =
`
`PURIA))
`PURE)
`
`Using Bayes” rule,
`
`P(LER} x PER)
`_
`simul. g) = s+
`Pid |Ri x PCR)
`
`Pid, R) stands for the probability of randomly selecting the document d; from
`the set. & of relevant documents. Further, PCA) stands for the probability that a
`ductment randomly selected from the eniire collection is relevant. The mieanings
`attached to P (f/'R) and PCR) are analogous and complementary.
`Since PCA) and PR) are the samefor all the documeuts in the collection,
`we write,
`
`sentibs. gio
`
`P(G|R)
`Pid; Wi
`
`
`Assuming independence of index termis,
`
`semid,,g@)
`
`o~
`
`
`
`Ted. PURIRD% Mycapao PEARY
`Tes, a PAA) x TL. cé 320 Pik, |R))
`
`Pik,[21 slanes for the probability that the index term &; is present ina document
`randomly selected from the set R. P(AjLA} stands for the probability that the
`iudex term: Ay
`is not present in a deemment randomly selected from the set &.
`‘The probabilities associated with the set # have meanings which are analogous
`ta the ones just described,
`Taking logarithms, recalling that P(A,|R) + P(k:|R) = 1, and ignoring
`factors which are constant for all documents in the context of the sume query,
`we cau finally write
`
`i
`— L
`#21
`
`sunld;.q)
`
`o~ So ug x Uy; x (108 Pa -+ log
`
`PCRAR
`
`I PikIR
`
`eal 5)
`
`
`which is a key expression for ranking computation in the probabilistic model.
`Since we do not know the set # al the beginning, it is necessary to devise
`a inethodfor initially computing the probabilities P(k;|R) and P(k;|R). There
`are any alternatives for such computation. We discuss a couple of them below.
`
`IPR2017-01039
`Unified EX1014 Page 11
`
`IPR2017-01039
`Unified EX1014 Page 11
`
`
`
`"CLASSIC INFORMATION RETRIEVAL
`
`uo
`
`Tn the very beginning (i.c., bnmedtately after the query specification), there
`are no retrieved documents. Thus. one has to make simplifying assumptions such
`as: [aj assume that P(R,|A) is constantfor all index terms &; (typically, equal to
`0.5) and (b) assume that the distribution of tidex terms amoug the non-releyant
`documents can he approxtmated by the distribution of index terns among all
`the documents tn the calection. These two assumptions yield
`
`=
`Pk|R)
`P(RIR) =
`
`0.5
`
`ia the number of documents which contain the
`where, as already defined, oj;
`
`index term Ay and V is the Total nuinber of doeaments in the collection. Given
`thismitintguess, we can then retrieve documents which contain query terms and
`provide ag initial probabilistic ranking for them. After iat. this initial ranking
`is Hnpreved as follows.
`Let Vo be a snbset of the doctuments initially retrieved and ranked by the
`probabilistic model Sneha subset can be defined, for lustance. as the top +
`rankeddocumentswhore:is a previously detined threshold. Further,letVibe—
`the subset of ¥ composed of the documents in Vo which contain the index term
`k,. For simplicity, we also use Fo and Vote refer toa Ue tunber of elaments in
`these sets (it should always be clear when the used variable refers to the sel or ta
`the muanber of elements mit}, For improving the probabilistic ranking, we need
`to improve our guesses for P(A, A} and P(ky A}. This cau be accomplished with
`the following assumptions:
`(a) we can approximate P(&, AR) by the distribution
`of the index term ky among the documents retricved so far, and (b) we can
`approximate P(k SR) by cousidering that all
`the nou-retrieved documents are
`not relevant. With these assutaptions, we can write,
`
`=
`
`\“
`=v
`Pik, RY
`nyo
`=
`Pik Ry) = wt
`me
`SN
`
`f
`
`This process can thon he repeated recursively, Ty doiug so, we are able to
`iuprove on our guesses for the probabilities P(A;|H) and Ptk;|R) withont any
`assistance from a hmman subject feontrary to the original ideal. However, we
`cart also use assistance from the user for definition of the subset V as originally
`conceived,
`‘Lhe last formmlas for Pik; A} and P(k, |} pose problems for small values
`of Woand ¥; which arise in practice fsuch as F = 1 and V, = 0). To circrmvent
`these prebleras, an adjustment factor is often added in which viel