`
`Ricardo Baeza-Yates
`
`Berthier Ribeiro-Neto
`
`$3
`
`ACM Press
`New York
`
`W Addison-Wesiey
`
`Hariow. England 0 Reading, Massachusetts
`Menlo Park, Caiifornia 0 New York
`
`Don Mills, Ontario I Amsterdam - Bonn
`
`Sydney a Singapore 0 Tokyo 0 Madrid
`
`San Juan 0 Milan 0 Mexico City 0 Seoul o Taipei
`
`BIoomReach, Inc. EX1014 Page 1
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 1
`
`
`
`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`Addison Wesley Longman Limited
`Edinburgh Gate
`Harlow
`Essex CM20 ZJE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the trademark designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`zZg$j;g?$$g`IR z¡"gb£¢j¢£¤Rx¥¦§$¢I;¨ Z©vIR ¢j¢j ªUIZ¨C©Z5z$«¬RI¨S¡ª g¨Sª;b$¤Y®¨ $¯_° ±O¡C² ¯ ³g³>¢jZ¨t´£¢Iµ0¶8;¨SU«AªZ¨C¶]«·j£³¸v³e¨>g¬b; ¹ ªUjº ª;jµ;»¸v¢j¢jh¼Cz¡C½Z¾|½¿g¸¸¨S;µªZ¨R³ªZ¨R³ ¢j¢j ª$j£³Cz$«Sª;¨£¢IgbZ¬RZRZ¬R IR´CZjµ³ ¯ÀzS_jZgj¢±Z©IRªZ¬RIRZb¢_Z©ÁI¢±´;jÂ`Rª;§$ ]h¨ªU¢j¢jhbj£³`gkIR£«Ã¨ª £ ;b³ª;¨ »zIIRO;g$jZg£¤]¥±£¢IZ¨R¢ªZ¨R³Äª$jh¨gj¢_ £$ÅUÅ ¯¦µµYjZgj¢_b£¢jhb§$³ ¯3Æ Sª;b Z©ÁI¢_¬ggµ ª$IÇZ¨ «:ªZA] bbb>³>¬ £³n¤Y¢jjZb³`¨Cªb£Ij£§;ª;µv¢j>¢jj£«0¤S;±IbªZ¨R¢j«·jj£³0¨CªZ¨gA©;b«Z_gkªZ¨g&«A£ªZ¨R¢£¤Sbµ IbZ¨g ¤Y«A Sª;¨ ªZµ¤gSUj Zg$¨R¤8b Zb³>¨Sk; UISbb»O¢jU¤]»OIRZ¬R bISh5IS gjZ5»zjjjh¨0]bb«·¢j¢IZ¨ ;©IR ¬gµ¢IRh:Z±ª·µ b¨ ]hb«·jI¨R`b£¢jIj j£³ Zg$¨RA¨CISÈ_¨j³É ¨SU³U«"¢j¢I¬S³gIR_zZg$jZgv¶n h¨S¢I¨R $b¨ ¶8j³]¤U$¾ ÀUjjh¨gSªU«ÊzZ¬gb˦Uª$³]¤$¶YZ¨R³;¨k´ÌÄ$º±¸ ¯´ÍgµAIR ¬ggµ¢IShRª$¢_«Aª$³g£§$hbkªUjj£«·R¦j:Ibª ª;µµ Zg$j;g ;»z¨Shb¢±ªZ¨R³0;Sjª;¨]hb«¢j¢IZ¨tj·bhgb>³e¬ A«AªUjhjªZµ¤S¨Cª©£» ªU¢j£¢_I¢_SªU¢¦gbZ§U£³0«·]U¢j¢Iµ ¯zZg$j;g¦RZµ³ghb¢Z©v«AªUjhjªZµÁ»z SªU¢¦¨R$z]£b¨ª Â>¨RZ»zµ£³g$³ªZbh¨ ;¬bªU$£³j ;¨gjª _IRMg¬gµ¢IRh ¯¡ªZ¨g&Z©ÁIS³¢IZ¨Rª$IZ¨R¢¬S¢j£³kg0«Aª;¨>¬©®ª I¬gbhb¢ ª;¨S³&¢jhµµhb¢jA³>¢jI¨RZ¬g¢I ISbgb>³e¬ j¢ ª;b µªZ«:£³ªU¢_Ibª$³g£«AªZjÂg¢ ¯ ³g³e¢j;¨´C£¢Iµ£&¶8;¨SU«AªZ¨C¶]«·j£³SªU¢¦«:ª$³g£§Uhbkª$jj£«·R j:¢I¬gµ&Ibª$³g·«:ªZjÂk¨©®Zb«AªUIZ¨tªZ];¬S±«AªZ¨>¬g©ª I¬bhb¢±ªZ¨R³ISbgb>³e¬ j¢ «:h¨gIZ¨R£³¨CI¢_]>Z ¯ ̵¢j±Z©vIRIbª$³g£«AªZj³¢IZ¨Rª$IZ¨R¢ªZ¨R³ISh;»z¨Shb¢_ª;]ªZb¢ ;¨SªU$§$ ¯Àz$]¢j£z¨tOU«·g¬Sjh5¡>³bj¨g0Î;ÏÄj¨gj£³ªZ¨R³0]Z¬g¨R³0¨CISÈ_¨j£³CÐ>jª$j¢ Z©Á «Abj ªÑ8b¢j¦gj¨gj£³$$IÐ>Ò Æ ¾;ÓÔ½Z¾bÓsÕ$$ÅG½Z;ÓIÖ×ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£á ªUjªZµUZ¬RMb Zb³0©; Ig¢z]>ZÂ&¢ ª;§;ªZµª;µA©ÜbU«ISÒ)jI¢IC¶]bª;bÞ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ÛÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªU£ð£ªZÓñ3ª$j£¢£¤RË ¯ò° Ëz ª;b³G²¡>³ghj¨¨g©Zb«:ª$IZ¨Cb£Ij£§;ª;µ3ó·Ëz ªZb³gkÒzª$£ðªZÓñzªUj£¢£¤SÒ3hbIbËz]hb;Ó Æ £j ¯ ¯- « ¯Ü¨ µ¬S³g£¢ gµUZbªZgS ª$µvbh©bbh¨ £¢_ªZ¨R³0¨S³gh¼ ¯sÐeÒ Æ ¾ZÓÔ½Z¾ghÓsÕ$UÅ>½;ZÓsÖ ¯ ܨ©;b«Aª$I;¨C¢jjZbªU$ªZ¨R³`b£I£§;ª;µv¢j>¢jj£«A¢ ¯ ¯ ËOôhb¤]ÒzbbIh5³¦bªvõ¬;öbÆ £j¤Y£UÏ$¾ZÓ ¯ j ¯ÀzIµ ¯÷SÏ$ÏGø ¯ÒzÕ$ù £$U¾>½UÎ ¯¾$ù;ú>³ ½e UZÓb£¾U¾$Õ$Õ¦Ä
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloguing-in—Publication Data
`Baeza-Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p.
`cm.
`
`Includes bibliographical references and index.
`ISBN 0-201-39829-X
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Arafijo
`Neto, 1960- . II.Title.
`Z667.BB4
`1999
`025.04—dc21
`
`99-10033
`CIP
`
`BIoomReach, Inc. EX1014 Page 2
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 2
`
`
`
`-.1va
`
`21
`
`MODELING
`
`t—dimensional vectorial space and standard linear algebra operations on vectors.
`For the. classic prolnihilistic model. the framework is composed of sets. staiulard
`probability operations, and the Bayes’ theorem.
`In the remainder of this chapter, we discuss the various IR models shown
`in Figure 2i 'l‘hroughout the distuission, we do not explicitly instantiate the
`components D_._ Q, f. and Blighdji of each model. Such components should he
`quite clear from the discussion and can he easily inferred.
`
`2.5 Classic Information Retrieval
`
`In this section we briefly present the three ('rlassic models in infornuttiou retriewal
`namely the Boolean. the vector, and the probabilistic nnn'lels.
`
`2.5.1 Basic Concepts
`
`The. classic models in iniormation retrieval consider that each doctnnent is de—
`
`scrihed by a set. of rt-zprcscntalive key words called index terms. An finder term
`is simply a {document} word whose semantics helps in rememhering the docu—
`ment's main themes. Thus, index terms are used to index and suunnarize the
`document. contents.
`In general,
`index terms are mainly nouns heceuse nouns
`have meaning h}! themselves and thus‘ their semantics is easier to identil‘f and
`to grasp. Adjectives. adverbs, and connectives are less useful as index terms
`l'iecause the}! work mainly as complements.
`llowmrer.
`it might he interesting
`to consider all the distinct wtn'ds in a document collection as index terms. For
`
`instance, this approach is adopted hy some Weir sca‘rrch engines as discussed in
`Chapter .13 [in which case, the document logical \-'l(:\-\-' is felt. tort). “7c postpone
`a. discussion on the problem of how to generate index terms until Chapter 7-.
`where the issue is cmrered in detail.
`
`Given a set of index terms for a document? we notice that not all terms
`are. equally usct‘ul for dcscrihing the document contents. in fact. there are index
`terms which are Sllllpl,_’ vaguer than others. Deciding on the irnportamte of u
`term for sunnnarizing the contents of a. document is not a trivial issue. Despite
`this difficulty, there are properties of an index term which are. easily measured
`and which are useful for evaluating the potential of a term as such. For instance,
`consider a collection with a hundred thousand documents. A word which appears
`in each of the one hundred thousand documents is completely useless as an index
`term because it. does not tell us anything about which documents the user might
`be interested in. On the other hand. a word which i-tppears in just five documents
`is quite useful he tause it narrows down considerably-- the space of documents
`which might he of interest to the user. Thus,
`it should he. clear that distinct
`index terms have. varying relevance when used to describe document contents.
`This effect is captured through the assignment of numerical areighfs to each index
`term of a document.
`
`BIoomReach, Inc. EX1014 Page 3
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 3
`
`
`
`emssm IN FORMATION its-"rs IEVAL
`
`‘25
`
`:3 U l): a weight
`‘o: a document, and arm.-
`d}-
`la.- b.‘ an index terrn.
`Let
`associated with the pair (knoll) This weight quantifies the importance of the
`index term for describing the (.loeinnent semantic et'nitcnts.
`
`er. t he the another ofindei: terms in. the system and is, be e genera.-
`Definition
`imiei: tr-r'in. K
`{_lc].. .
`.
`.3153}
`is the set of oil.
`index: terms.
`:1 height in”-
`1"»
`(l
`is u.:~'.~'sor'i.uted wraith. each imtr-r term. k,- of a document. d,» For an.
`inder term
`which (has not-
`rippcrir in the document
`tort. n1.)- 2 ti.
`l-l-‘i'tti
`the (fortnmcnt at",
`
`l'l’l1._i"”i2..i~ .
`is esmr'irited an Ends-3' term irr'rtor d..- represented by (it, 2:
`Further.
`let Us
`bi.-
`e. function that. returns the. Tiltiighi osscic-iotcrt with.
`term tr, or any t—diniensiomn' rector .;"'t.r._.
`,gatjrtJ] 2 term}.
`
`. uteri).
`the order
`
`is we later discuss. the index term weights are. usually assumed to l}(‘ mutu—
`ally independent. This means that knowing the weight ”ch associated with the
`pair {ts-Mtg tells us nothing about
`the weight. an“ associated with the pair
`lift-4 1,111.}! This is clearlv a simplification because occurrences of index terms in
`a document are not. nncorrelated.
`(.‘onsider, for instance, that the terms ('(JFTL—
`
`peter and Ht'ttt'O'J'YlY are used to index a given document which covers the area. of
`computer networks. I-hctlucntly, in this (lotill‘l'llL’UlT, the appearance otonc of these
`two words attracts the appearance of the other. Thus. these two words are (.tm‘re—
`lated and their weights could reflect this correlation. Vii-"idle inntual independence
`seems to be a strong simplification. it does. simplify the task of computing index
`term‘weights and allows for fast. ranking computation. Furthermore, taking ad—
`vantage of index term correlations for improving the. final document. ranking is
`not. a simple task.
`in fact. none of the. many approaches proposed in the past.
`has clearlv i'lemonstrated that index term correlations
`art ad 'antageons [for
`ranking purposes] with general collectitnis. Therefore, unless clearly stated oth—
`erwise, we assume mutual independence among; index terms.
`In Chapter :3 we
`discuss modern ret-rie 'al techniques wl'iich are based on term t'.01"rt".l£-1T.l0n:-1' and
`which have been tested successfully with particular collections. These successes
`seem to be slim-iv shitting; the current understanding towards a more favorable
`view of the usefulness of term correlations for ll'lftll'l‘fléLI-lt'nl retrie Yul systems,
`The above definitions rn‘m-‘ide support for discussing the three classic infor-
`mation retrieval models. initially. the. Boolean, the vector. and the probabilistic:
`models. as we now do.
`
`
`
`
`
`
`
`
`
`
`
`.52 Boolean Model
`
`2The Boolean model is a simple retrieval model based on set. theory and Boolean
`Elgcbra. Since the concept of a set is quite intuitive, the Boolean model pro—
`.Vides a fran'imvorl; which is easy to grasp by a. common user of all [R system.
`Furthermore, the queries are specified as Boolean expressions which have 131“?“lEie
`e11m,nticg_ Given its inherent simplicity and neat formalism. the Boolean inodel
`Ceived great. attention in past years and vas adopted by many of the early
`bmniercial bibliographic system-s.
`
`BIoomReach, Inc. EX1014 Page 4
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 4
`
`
`
`26
`
`MODELING
`
`
`
`Figure 2.3 The three conjunctive cmnponcms for the query to —— k“
`
`(to.
`
`“ix-M.
`
`the Boolean model sutlers from major drawlmcks. First.
`1f nfortunately.
`its retrieval strategy is l'iased on a binary decision criterion tie, a document. is
`predicted to be either relevant or non~relovant) without any notion of a grading
`scale. which prevents good retrieval performance. Thus. the Boolean model is
`in realiti-r much more a data {instead of information} retrieval model. Seemid.
`while Boolean expressions have precise semantics. frequently it
`is not simple. to
`translate an intormatirm need into a Boolean expression. In fact most users find
`it difficult and awkward to express their query requests in terms of Boolean ex-
`pressions. The Boolean expressions actually formulated by users often are quite
`simple {see (fl-hapter It] for a more thorough discussion on this issue). Despite
`these. drawbacks. the Boolean model is still the dominant model with commercial
`
`document. database systems and provides a good starting point for those. new to
`the field.
`
`The Boolean model considers that index terms are present or absent in a
`dotannent. As a result. the index term weights are assumed to he all binary. i.e.,
`We; 6 {0.1}. A (piety t} is composed of index terms linked by three connectives:
`not‘ and, orzThus, a query is essentially a com-'entional Boolean expression which
`
`can be represented as a disjunction of conjunctive rectors tie. in disjnnctt-i'e norw
`and form. — DN F). For instance, the. query [q 2 It” a (tit
`wioj] can be written
`in disjunctive normal form as [973“; = [1. 1, l) \-"' {.l. 1.1!) "-.-’ [1. ll‘ {3}]. where each of
`the components is a binary weighted Vector associated with the. tuplc it“s: kt, kc}-
`These binary weighted vectors are called the conjunctive components of (loaf.
`Figure 2.3 illustrates the. three (.‘t‘mjtilu'Lt-it'o components for the query q.
`
`-u.:eight oar-tables are. all
`the index term.
`For the Boolean model.
`Definition
`binary to, tea.)- C {0.1}. A query q is a convent-tonal Boolean empress-ton. Let
`illiaf be the disjmtctirc normal form for the query 0. Further, let r121. be any of the
`CUTU’tt-‘Fttittt-‘B components of (kg. The similarity of a (tetra-meat dj
`to the query q
`is defined as
`
`St'fllidjad] : { 1
`
`t 1
`
`
`
`if 26cm l
`
`otherwise:
`
`_,
`
`{glee E thief} A ivk.‘ glid‘tl : 9116(6)]
`
`I
`
`BIoomReach, Inc. EX1014 Page 5
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 5
`
`
`
`
`LILASSJEI IN Pom-IA'HON RETRIEVAL
`
`2T
`
`If room-[dJ _. gr) 2 1 then (he B ooh-(1:11 model. predicts that the doe-omen: (FJ- is r'eh-"ermt
`to the query :3 (it might not he}.
`[)the-I‘u.'ise._.
`the prerhetien is that the dm‘umeof
`is not named-n.5,
`
`is either ochre-at. or oeu—
`that each tl:'.a'.:111r1(-2nt
`lgal'etlitls
`The Boolean model
`reh-rr’mt. There is no notion of a partial molt‘h. to the :111.1er_\' ('onditions. For
`instance.
`let
`(I) be a doemnmit for which a: -- {0.1.0}. Dot-luneut cl,
`includes
`the index term to, but is eonsidered non—relevant to the query if;
`r: h“ ear-{k}, uh“);
`The main e.Flt-wiritriges; of the Boolean model are. the clean ft'h‘malism l.'Jt‘rl1lIll'.l
`the. model and its silnplieity. The main dtsad-Uoutages are that. exaet. matching
`may lead to retrieval of too few or too many documents {see Chapter it'll.
`"I.1'J(ia}-'.
`it is well known that index term weighting ean lead to a. substantial iniprt‘.‘;\'ement
`in retrieval perftiirniance. Index term weighting brings 11s to the. rector model.
`
`2.5.3 Vector Model
`
`the use of lunar]; weights is too
`ilifli’, 695] reeognizes that
`The veetor model
`limiting and proratases a trainee-wok in which partial matching is missible. This
`is aeeolnplished by assigning heir—himrry weights to index terms in qtleries and
`in doeulnents.
`'l‘hese term weights are. ultimately reset] to eoinpute the deg-rer-
`ef shorter-it}; between eaeh doeuinent stored in the system and the user query.
`By sorting the retrieved (loeuments in decreasing order of this degree. of similar—
`it}: the vector model takes. into eonsitleratitm documents which Tili't-LCIlJ the query
`terms only partially. The 11min resultant effect is that the ranked doelunr-Jot an-
`swer set
`is a lot. more praise [in the sense that it better inatehes the user infor—
`matim'i need] than the doeunient. answer set. retrieved by the Boolean model.
`
`the weight. ri-‘m (lg-‘tm‘l-‘ll‘t‘l with fi- FEW thhdjl
`Definition For the: fleet-oi" model,
`is posit-rm? rmd rmrr—hiuruy. Further.
`the order terms in.
`the hue-re are. also
`
`3'3 [1.
`twirlighted. Let “his: he the wright assoc-rated with the pair [kl-.0], urhrrr: Uri-J].
`'I'hth.
`the query vector r? is- defined as.
`t?
`- (rem. U‘glq. .
`.
`.
`. rem] -u..=hr-rr- t
`3's-
`the
`total number of index? terms to the system.
`.43 before. the redo-r for r1. doe-u:(lent
`rlJ
`is regimeseuled by (it; = (and. teglj. .
`.
`.
`. u“ ].
`
`(lJ- and a user rulerjy q are represented as 1. -(lll1'Jt+‘Jl.-lmlil.l
`Therefore, a (l()('11111(‘11l.
`'\-'et'.tors as shown in Figure 12.4. The vector model proposes to evaluate the degree
`
`of similarity of the docun'ient {13- with regard to the query q as the eorrelation
`between the vectors (l;- and g“. This eorrelatirm can be quantified. for instance.
`by the. cosine of the angle. hem-eon these two vectors. That is,
`
`show}; oi]
`
`--.__. .._ ...
`
`.r
`_Z~.—--1 ”'w' A “5.31
`
`BIoomReach, Inc. EX1014 Page 6
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 6
`
`
`
` Q
`
`232%:
`
`not) HUNG
`
`Figure 2.4
`
`'i‘he. cosine of 8 is adopted as simirqul.
`
`where Id; and Ir? are the. norms of the document and query vectors. The factor
`,q“; does not affect the ranking tie. the. ordering of the documents] lgaecause it
`is the same. for all documents. The factor '6le provides a iiormalizntion in the
`space of the documents.
`'l’hus. ii'istead of
`Since 'rrrw- Ta [l and new, __ ll, si-inL'q, dj ] varies from U to + i.
`attempting; to predict. whether a. t'lot'ttlmcnl
`is relevant or not. the vector model
`ranks the documents according to their degree of similarity to the query. A
`docmnent might be retrieved even if it matches the. query only partially. For
`instance one can establish a threshold on simtdjmft and retrieve the documents
`with a degree of similarity above that tliresl'lold. But
`to compute rankings we
`nced iirst to specify how index term weights are obtained.
`Index term weights can be calculated in man}r diilcrent ways. The work by
`Bolton and McGill [698} reviews various term-weighting techniques. Here, we do
`not discuss them in detail. Instead, we concentrate on elucidating the main idea
`behind the most effective term—weighting techniques. This idea is related to the
`basic principles which support clustering techniques. as follows.
`Given a collection (3‘ of objects and a wager: description of a set. A, the goal of
`a simple clusteringr algorithm might be to separate the collmttion (7' of objects into
`two sets: a first one which is composed of objects related to the set :‘i and a second
`one which is composed of objects not related to the set. i Vague description here
`means that we do not: have complete information for deciding precisely which
`objects are. and which are not. in the set A. For instance. one might be looking
`for a set A of cars which have a. price COHt-pa-i‘rihlfi to that of a Lexus 400. Since it.
`is not. clear what the term comparable means exactly, there is not a precise (and
`unique) description of the set .rl. More sophisticated clustering algorithms might
`attempt to separate the objects of a collection into unions clusters {or classes]
`according to their properties. For instance. patients of a doctor speciz‘dizing
`in cancer could be classified into five classes:
`terminal. advanced, metastasis
`diagnosed, and healtlrg. Again‘ the possible class descriptions might be imprecise
`{and not unique) and the problem is one of deciding to which of these classes
`a. new patient should be i‘issigncd.
`In what follows. hon-ever, we only discuss
`
`the simpler version of the clustering problem tie. the one. which considers only
`two classes) because all that is required is a decision on which documents are
`predicted to be. relevant and. which ones are. predicted to be not. relevant. [with
`regard to a given user (intact-"j.
`
`BIoomReach, Inc. EX1014 Page 7
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 7
`
`
`
`
`{TY-LABSIC‘ INI-‘()RI\-IATION RLE'I'RIEV’AL
`
`2i}
`
`to View the. IR problem as one of clustering, we refer to the early work of
`Salton.
`\’\-"'e think of the doeumertts as a. collection C of objects and think of the
`user query as a {vague} speeilit.'.at.ion of a set .11 of objects.
`[11 this seenario. the
`lit. problem ean he redueetl to the problem of determining which doeuments are
`in the set
`:1 and which (mes are not tie...
`the lit. problem ean he viewed as a
`eiustering prol'ilem}. In a Clustering problem, two main issues have to be resolved.
`First. one needs to determine what are the features which better describe the
`
`objects in the set. 21. Second. one needs to determine what are the features
`\t'hieh hetter distinguish the objects in the set.
`..-l from the remaining ohjeets in
`the eollection (7. The first set of features provides for quantification of intro--
`et-s'tsr‘f-ir‘ similarity. while.
`the set-end set of features prtwides for oriant-ifit'ation
`of U?t(".”-t"-Il'tt.‘~'ltf'i" dissimilarity: The most sueeessfid elustering algorithms try to
`halartee these two el'feets.
`To the Yeettir iriodel. intra-elustering similarity is quantified by measuring
`
`inside a document at}. Sueh term frequency is
`the raw frequeney of a term is;
`usually referred to as the (f factor and prm-‘ides one measure of how well that
`term deserihes the document eont'ents ti.e..
`intra—doelnnent elizu'aeteriz-ation}.
`Furthermore.
`inter—eluster dissimilarity is omtntii‘ied by Irieasuring the inverse.
`of the lrttqt.1("nti}' of a term k3: arming the documents in the collection. This
`feet-or is usually referred to as the iii-verse (tetra-merit freq-usury or the. idfforttor.
`The motivation for usage of an idl factor is that
`terins whieh appt's'tr in many
`doeumeiits are not very useful for distinguishing a relevant riot‘ument from it
`[1011-1'i!lt'.’\':5tlll’ one. As with good elusteririg‘ algorithms. the most. el'l‘eetive term—
`weighting schemes for Ill. try to halanee these twu etteets.
`
`
`
`
`,-
`
`I
`
`that N be the total rtttrritrer' o dealt-Inertia in the system and u;- be
`Definition
`
`
`Elitre. rt-‘ttmber of deform-(9M3 to. tshitrh if
`;: index; term alr- appears. Let freqm W
`mflfifii'tie2rr.t'y of term. it;
`to the dorm-merit. di (to. the another of times the term
`is; is narratio-rta'd tn the (turf. of the dorfiitmeot dj/l. Then, the rtorrrtshzr‘d freq-runny
`- of term it;
`irt dor'-rrmr:‘ut rt".-
`-.is grim-ii by
`
`
`
`
`
`e the irio.:rirr:r.urrt is c:mtp-rtttd over all. territs aihieh on: m.r.r3tion..r"d in. the trait.
`
`- the doe-irritant d};
`If the ter‘rri
`it!- does not appear in (he doetmtertt (t;-
`their
`
`2 U. Forth-er. let tdfi. inverse dot:'tmir;:rt(. firsq-uenrri; for to. be girth try
`
`
`.
`
`.
`‘
`[2.1;
`
`[2.2‘;
`'
`
`.
`
`_.
`
`-
`
`v
`
`=
`
`.
`
`.
`'a
`
`few,-
`
`:2-
`
`fT‘f'fflU;
`- -—'—
`mar;
`frets)-
`
`idfzrlog‘l—l
`n.-
`t
`
`'
`
`
`
`best firemen.
`
`tt-“r‘iri—irrrgrightirtg settdirtes use weioht‘fifstuhrett on: green by
`t r."
`
`id = f” x tog ii
`
`[23]
`
`BIoomReach, Inc. EX1014 Page 8
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 8
`
`
`
`3t)
`
`noosmNG
`
`or by u turret-tort. of this formula. Such Iet‘m-'tth-ighting strategies (Ir-rt? cuttrid tf—tdf
`st-l‘tctttcs.
`
`Sereral variations of the. above expression for the. weight ten; are described in an
`interesting paper by Salton and Buckley which appeared in 1988 [696]. However“
`in general.
`the above expression should provide a good weighting scheme for
`many collections.
`For the query term weights. Salton and Buckle}; suggest
`
`was _ (0.5+
`
`
`0.:
`
`N
`
`’f’”t_-L..) X1W—
`
`mrcrt
`
`fit-mi;
`
`-
`
`.
`
`L24}
`
`where fit-1111.9. is the ‘aw frequency of the term to- in the text of the information
`retpicst q.
`
`t: 1} its term—weighting scheme.
`The. main udt-‘unt.ug_,le.s of the \-'{"(‘.t(}1“111()(lt.‘l are:
`imprta-‘es retrieval performance: [2) its partial matching strategy allows retrie 'al
`of documents that.
`tippitiatintutr:
`the. query conditions: and {3]
`its cosine rank»
`111% formula. sorts the. documents according to their degree of similarity to the
`query. Tht--oretit.'all_\-'._ the. vector model has the. thSti-d-t'o-ntege that index terms are
`assumed to he. mutually independent [equation 2.3 does not account for index
`t't-rni dependencit-rs}.
`I‘lowcx'cr. in practice. t'ionsideration of term dependencies
`might. be a disadvantage. Due to the locality of many term dependencies, their
`itIdiscrii'ninate t1[.l[')li{iatlt'.'it]
`to all the documents in the. collection might. in fact.
`hurt the. overall performatice.
`Despite its simplicity, the vector model is a resilient. ranking strategy with
`general collections.
`1t yields ranked answer sets which are. difficult
`to improve
`upon without. query expansion or rt-Ele'rauce feedback [see Chapter 5} within the
`framework of the \-"f."(‘t-(i1‘ model. A large. variety of alternative ranking;T methods
`have. been compared to the vector model but. the consensus seems to he that,
`in gtBIJE-‘I‘fli, the vector model is either superior or ali'ilost. as good as the known
`alternatives. Furthermore. it is simple and fast. For these. reasons. the vector
`model is a popular retrie -'al model noi't'idafs.
`
`2.5.4
`
`Probabilistic Model
`
`in this section.
`
`tie descriliie the. classic probabilistic model introduced in 1976
`
`by; lioberston and Sparck Jones :tiTTl which later became known as the binary
`trtdr'=pen.rlr=-iicc r‘etr'ic-twl (Bill-J model. Our discussion is intentionally brief and
`focuses mainly on highlighting the key features of the model. With this purpose
`in 111ind._ we. do not-detain ourselves in subtleties regarding the binary indepen—
`dence assumption for the model. The. section on bibliographic discussion points
`to references which cover these details.
`
`The probabilistic model atteti‘ipts to capture the. TR. problem within a prob—
`fllJlllSt-lt‘. framework. The fundamental idea is as follows. Given a user query
`there is a set of documents which contains exactly the relevant documents and
`
`BIoomReach, Inc. EX1014 Page 9
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 9
`
`
`
`t.‘LAss1t‘. iNFonnATIoN RETRIEVAL
`
`31
`
`no other, Let us refer to this set. of documents as the ideal answer set. Given the
`
`description of this ideal answer set, we would have no problems in retrieving its
`document-s. Thus, we can think of the querying process as a process of specify—
`ing the properties of an ideal answer set {which is analogous to interpreting the
`1R problem as a problem of clustering). The. problem is that we do not know
`exactly what these properties are. All we know is that there are index terms
`whose semantics should be used to characterize these properties. Since these
`properties are not known at query time. an effort has to be made at initially
`guessing what. they could be. This initial guess allows us to generate a prelim—
`inary probabilistic description of the ideal answer set which is used to retrieye
`a first set of docuinmts. An interaction with the user is then initiated with the
`
`purpose of improving the probabilistic' description of the. ideal answer set. Such
`interaction could proceed as follows.
`The user takes a look at the retrieved doctunents and decides which ones
`
`are relevant and which ones are not. [in truth. only the first. top documents need
`to be. examined}. The system then uses this information to refine the description
`of the ideal answer set. By repeating this process many times.
`it is expected
`that such a description will (.‘VOiVE! and become closer to the real description of
`the ideal answer set. Thus, one. should always have in mind the need to guess at
`the beginning the description of the ideal answer set. Furthermore. a conscious
`effort is made to model this description in proliiabiiistic terms.
`The probabilistic model is based on the following lundammitai assumption.
`
`Assumption {Probabilistic Priort'lplc) Git-en a user query r; and a document d?-
`
`
`in
`. co ection.
`the. probabilistic model tries to estimzflre )rol'iability that
`the user will find the (lti'ciiment rlJ- interesting lie... relevant}. The model as—
`
`smnes‘fifi‘affl
`is proban 1 .y
`..
`. ance r epends on the query and the document
`representations only. Further.
`the model assumes that. there is a subset. of all
`documents which the user prefers as the answer set. for the query o.
`Such an
`ideal answer set is labeled R and should maximize the overall probability of rel—
`evance to the user. DOtf-illilgllllfl in the set H. are. predicted to be. i'elceortt to the
`query. Documents not, in this set are predicted to be_non—r'i.-'-le-i.ronr.
`
`This assumption is quite troublesome because it does not state explicitly
`how to compute the probabilities of relevance. In t'actr not even the sample space
`which is to be used for defining such probabilities is given.
`Given a query q. the probabilistic model assigns to each document a}, as
`a measure of its similarity to the query. the ratio l’fo’j relevant-to ql,-"l’(rij non—
`relevant—to qi} which computes the odds of the document t'lj being relevant to the
`query q. Taking the. odds of relevance as the rank minimises the probability of
`an erroneous judgennant 1282, 1’85].
`
`For the probabilistic model, the index. term aJe-iglit irriciubles are
`Definition
`all binary i.e._. my E {0,1}.
`lt‘flq €— {0,1}.
`:1 query q is a. subset of index terms.
`Let R be the set of thC'uTTH’TliS known. (or initially guessed)" to be relevant. Let. E
`be the complement of R (to. the set. of non—relevant dUC'tt-itl{it'll-3;}. Lei Piling-i
`
`BIoomReach, Inc. EX1014 Page 10
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 10
`
`
`
`3‘3
`
`?\l(JlJELIN(_l
`
`is relevant to the query q and t’iTiugJ
`the probability that the document it?-
`tit-
`thr' probability that (ii
`is nou.—i'ril(::mn.!.
`to q. The similarity simirquL] of the
`be;
`rims-meant rt].
`to the query q is defined. as the ratio
`
`s .I' m trifl- \ qt) :
`
`
`Ptltlrggl
`_4\
`l’{:R|<t,- i
`
`l'siug' Bus-es rulu
`
`wigs} x mm
`.
`.
`.w'nni‘ttyql 2 “4:4
`Pitt-IR; >< i’th’]
`
`Final?) stands for tho probability of randomly selecting the document. ti}- from
`thr- set- R. of roles-out dominionts. Further, Fiji} stands for the probability that a
`document rmuloluly selected from the entito collection is relevant. The meanings
`attmhml to Pit-12'?) and Pitt} are analogous uiul complement-aria
`Shim PtRJ and I’ll-R] are the saint: for all the Lloeun'ieuts in tho (‘olloetimn
`we wruv.
`
`.s-iiiitil.j.q]
`
`w
`
`
`wet-IR}
`Pit-Z" Til
`
`
`A ssuniing inLlepmidonoe of index terms,
`
`sim trt} ._ q _I -
`
`
`LIL-M- 1 Puget} >< {nugget other:
`illmzi. Pit-Wit: >< {I1
`-«_P{1Z+.IRIJ
`I}: lit}
`
`F’If_t'..i- | it] stands for the prohahility that the intlex term kg- is present in n. doetuuont
`randomly selected from the set ii. Fifi-'58] stands for the probability that the
`ilu'lr-x term lag.-
`is not present in a document- i‘a.ii(i()Irll_\-' selt'zctml from the set. R.
`'l‘lw ['Jrolmhilities élSE-iUFiEl-Lfiffl with the set P haw-"o meanings which are amilogous
`to t-llt.‘ ones just. {ilescri‘lurct
`Taking logarithms, recalling that. Ptkili’i] —.— Pia-l3) = l. and ignoring
`ittL‘t-UI'S which are. constant. for all (toeulrients in the context of the same query,
`wv can finall}r write
`
`t
`
`.
`
`.
`
`Him [til-J11) N gum x rig-J >< (log 1% +
`
`P' 3!.
`
`‘.
`
`which is El. kc}! exm‘pssion for ranking computation in the probabilistic model.
`Since we do not. know the set R. at the beginning, it is necessary to devise
`a method for initially computing the probabilities PUMR} and PUMR]. There.
`are many alternatives for such computation. we discuss a couple of them below.
`
`BIoomReach, Inc. EX1014 Page 11
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 11
`
`
`
`' emssni: lNI*"(.')HM.-X"FION ltE’l'HlEV-i—tl.)
`
`33
`
`In the very lziegimring tie: immediately after the query specification]. there
`are no retrieved documents. Thus. one has to make simplifying assumptions such
`as: (a) assume that POLAR-i is constant. for all index terms Ag; (typically, equal to
`0.5} and {hi} assume that the distribution or“ index terms among the non—relevant.
`doetmients can be al'iproxunated by the distrilguition of index terms among; all
`the doeuments in the :.-olle::'tion. These two assumptions yield
`
`ends)
`
`2
`
`so
`
`Mfr-.4?) =
`
`is the nui'uber of documents whitrh contain the
`n,-_
`where. as already deiined.
`
`hides term 11::- and N is the. oral number 0 (tutument-s in