throbber
Modern Information Retrieval
`
`Ricardo Baeza- Yates
`
`Berthier Ribeiro-Neto
`
`@ A
`
`CM Prass
`New York
`
`wW Addison-Wesley
`
`Harlow, England # Reading, Massachusetts
`Menlo Park, California « New York
`Don Mills, Ontario s Amsterdam # Bonn
`Sydney # Singapore # Tokyo # Madrid
`San Juan « Milan # Mexico City e Seau! « Taipei
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page1
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 1
`
`

`

`Copyright (c) 1999 by the ACM press, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`Addison Wesley Longman Limited
`Edinburgh Gate
`Harlow
`Essex CM20 2JE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Manyof the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the trademark designations and their
`owlers appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`ŽzZg‘$’j“•”;–g—?˜™šœ›$›$›g‘`—I–RžŸ Žz¡"g’bž£¢j¢£¤RŸx¥¦“•§$“•¢I“•;¨ Z©v—I–Rž Ÿ ¢j¢j ˜ “•ªU—I“•Z¨C©€Z’5Žz$«¬R—I“†¨S”¡ˆª ˜ –g“†¨Sª;’b‘$¤Y­®¨ ˜$¯_° Ÿ±ŽO¡C² ¯Ÿ ³g³>“•¢jZ¨t´ž£¢Iµ•žœ‘0¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³¸v³e“†¨>g¬’b”;– ¹ ªU—jžº ª;’jµ•;»¸v¢j¢jžh¼CŽz¡C½Z¾|½¿g¸¸¨S”;µ•ªZ¨R³ªZ¨R³ˆŸ ¢j¢j ˜ “•ª$—jž£³CŽz$«Sª;¨“•ž£¢ —I–g’bZ¬R”Z–RZ¬R— —I–Rž ´CZ’jµ•³ ¯Àz–Sž_’j“•”Z–g—j¢±Z©—I–Rž ªZ¬R—I–RZ’b¢_Z©Á—I–“•¢±´;’jÂ`–Rª;§$ž ]žœžh¨ˆªU¢j¢jžh’b—jž£³`g‘k—I–Rž£«Ã“†¨ˆª ˜£˜ ;’b³ª;¨ ˜ ž»z“•—I–—I–RžŽO;g‘$’j“•”Z–g—£¤]¥±ž£¢I“•”Z¨R¢ ªZ¨R³ˆÄª$—jžh¨g—j¢_Ÿ ˜ — š£›$ÅUÅ ¯Ÿ¦µ†µY’j“•”Z–g—j¢_’bž£¢jžh’b§$žœ³ ¯3Æ  Sª;’b— Z©Á—I–“•¢_¬ggµ†“ ˜ ª$—I“ǏZ¨ «:ªZ‘A]ž ’bžb’b>³>¬ ˜ ž£³n¤Y¢j—jZ’bžœ³`“†¨Cª’bž£—I’j“•ž£§;ª;µv¢j‘>¢j—jž£«0¤S;’±—I’bªZ¨R¢j«·“•—j—jž£³0“†¨CªZ¨g‘A©€;’b«Z’_g‘kªZ¨g‘&«Až£ªZ¨R¢£¤Sžbµ•ž ˜ —I’bZ¨g“ ˜ ¤Y«Až ˜ –Sª;¨“ ˜ ªZµ€¤g–SU—j ˜ Zg‘$“†¨R”¤8’bž ˜ Z’b³>“†¨S”k;’ U—I–Sžb’b»O“•¢jžU¤]»O“•—I–RZ¬R— žb“•—I–Sžh’5—I–Sž g’j“•Z’5»z’j“•—j—jžh¨0]žb’b«·“•¢j¢I“•Z¨ ;©—I–Rž ¬gµ†“•¢I–Ržh’:Z’±ª·µ†“ ˜ žb¨ ˜ ž]žh’b«·“•—j—I“†¨R”`’bž£¢j—I’j“ ˜ —jž£³ ˜ Zg‘$“†¨R”A“†¨C—I–Sž È_¨“•—jžœ³ˆÉ “†¨S”U³U«"“•¢j¢I¬Sžœ³g‘ —I–Rž_ŽzZg‘$’j“•”Z–g—v¶n“ ˜ žh¨S¢I“†¨R”Ÿ ”$žb¨ ˜ ‘¶8—j³]¤U›$¾ ÀU—j—jžh¨g–SªU«ÊŽzZ¬g’b—˦Uª$³]¤$¶YZ¨R³;¨k´ÌšœÄ›$º±¸ ¯´Í–g“†µ•žA—I–Rž ¬ggµ†“•¢I–Sžh’–Rª$¢_«Aª$³gžž£§$žh’b‘kªU—j—jž£«·R—¦—j:—I’bª ˜ ž ª;µ†µ ˜ Zg‘$’j“•”;–g— ;»z¨Sžh’b¢±ªZ¨R³0;S—jª;“†¨]žh’b«“•¢j¢I“•Z¨t—j·’bžhg’b>³e¬ ˜ žA«AªU—jžh’j“•ªZµ€¤S“†¨Cª©€ž£» ˜ ªU¢jž£¢_—I–“•¢_–SªU¢¦g’bZ§Už£³0“•«·]U¢j¢I“†µ•ž ¯ŽzZg‘$’j“•”;–g—¦–RZµ•³gžh’b¢ Z©v«AªU—jžh’j“•ªZµÁ»z–“ ˜ –ˆ–SªU¢¦¨R$—z]ž£žb¨ˆª ˜ Â>¨RZ»zµ•ž£³g”$žœ³ªZ’bž žh¨ ˜ ;¬’bªU”$ž£³ˆ—j˜ ;¨g—jª ˜ —_—I–RžMg¬gµ†“•¢I–Ržh’ ¯¡ˆªZ¨g‘&Z©Á—I–Sž³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ ¬S¢jž£³kg‘0«Aª;¨>¬©®ª ˜ —I¬g’bžh’b¢ ª;¨S³&¢jžhµ†µ•žh’b¢ —jA³>“•¢j—I“†¨R”Z¬g“•¢I– —I–Sžb“†’g’b>³e¬ ˜ —j¢ ª;’bž ˜ µ•ªZ“•«:ž£³ªU¢_—I’bª$³gž£«AªZ’jÂg¢ ¯ Ÿ ³g³e“•¢j;¨´Cž£¢Iµ•ž£‘&¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³ˆ–SªU¢¦«:ª$³gžž£§Užh’b‘kª$—j—jž£«·R— —j:¢I¬gµ•‘&—I’bª$³gž·«:ªZ’jÂk“†¨©®Z’b«AªU—I“•Z¨tªZ];¬S—±«AªZ¨>¬g©€ª ˜ —I¬’bžh’b¢±ªZ¨R³ˆ—I–Sžb“†’g’b>³e¬ ˜ —j¢ «:žh¨g—I“•Z¨Rž£³ˆ“†¨C—I–“•¢_]>Z ¯ ŸÌµ†“•¢j—±Z©v—I–Rž —I’bª$³gž£«AªZ’jŠ³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ªZ¨R³ˆ—I–Sžh“†’;»z¨Sžh’b¢_ª;]žœªZ’b¢ ;¨ˆSªU”$ž §$“†“†“ ¯Àz‘$]žœ¢jž£—z“†¨tŽOU«·g¬S—jžh’5¡ˆ>³žb’j¨ˆg‘0Î;ÏÄ’j“†¨g—jž£³ˆªZ¨R³0]Z¬g¨R³0“†¨C—I–Sž È_¨“•—jž£³CÐ>—jª$—jžœ¢ Z©ÁŸ «Ažb’j“ ˜ ªÑ8“†’b¢j—¦g’j“†¨g—jž£³šœ›$›$›­IÐ>Ò Æ ¾;ÓÔ½Z¾šbÓsÕ$›$ÅG½Z›;ÓIÖ× ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éœê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£áŸ ˜ ªU—jªZµ•U”Z¬RžM’bž ˜ Z’b³0©€;’ —I–g“•¢z]>ZÂ&“•¢ ª;§;ªZ“†µ•ª;µ•žA©Ü’bU«—I–Sž Ò)’j“•—I“•¢I–C¶]“†’bª;’b‘Þ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ۜÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éœê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªUž£ð£ªZӀñ3ª$—jž£¢£¤RË ¯ò° Ëz“ ˜ ª;’b³G²¡ˆ>³gžh’j¨ˆ“†¨g©€Z’b«:ª$—I“•Z¨C’bž£—I’j“•ž£§;ª;µ3ó·Ëz“ ˜ ªZ’b³gkÒzª$ž£ðœªZӀñzªU—jž£¢£¤SÒ3žh’b—I–“•žb’ Ëz“†]žh“†’b;Ó Æ ž£—j ¯ ¯-˜ « ¯­Ü¨ ˜ µ†¬S³gž£¢ g“†µ†“•U”Z’bªZg–S“ ˜ ª$µv’bžh©€žb’bžh¨ ˜ ž£¢_ªZ¨R³0“†¨S³gžh¼ ¯­sÐeÒ Æ ¾ZÓÔ½Z¾gšhÓsÕ$›UÅ>½;›ZÓs֚ ¯ ­Ü¨©€;’b«Aª$—I“•;¨C¢j—jZ’bªU”$žªZ¨R³`’bž£—I“•ž£§;ª;µv¢j‘>¢j—jž£«A¢ ¯ ­ ¯ ËO“†ôžh“†’b¤]Òzžb’b—I–“•žh’5³žŸ¦’bªvõ¬;öbÆ ž£—j¤Yš£›UÏ$¾ZÓ ¯ ­j­ ¯Àz“•—Iµ•ž ¯÷SÏ$ÏGø ¯ÒzÕ$ù š£›$›U›¾>½UÎ ¯¾$ù;ú>³ ˜ ½eš ›U›ZÓbš£¾U¾$Õ$Վ¦­€Ä
`
`Library of Congress Cataloguing-in-Publication Data
`Baeza-Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p-
`cm.
`Includes bibliographical references and index.
`ISBN 0-201-39829-X
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Aratjo
`Neto, 1960- . II.Title.
`2667.B34
`1999
`025.04-de21
`
`99-10033
`CIP
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 2
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 2
`
`

`

`2!
`
`MOVELING
`
`t-dimensional vectorial space aud standard linear algebra operations on vectors.
`For the classic probabilistic model, (he framework ts composed of sets. stardard
`probability operations, and the Bayes’ theoretn.
`In the remainder of this chapter, we discuss the various TR models shown
`in Bigure 2.1. Throughout the discussion, we do not explicithy instantiate {he
`components D, Q, F, and Afg;,dj) of cach model. Such components should be
`quite clear fromthe discussion and can be easily inferred,
`
`2.5 Classic Information Retrieval
`
`In this section we briefly present the three classic models in information retrieval
`namely. the Boolean, the vector, and the probabilistic models,
`
`2.5.1 Basic Cancepts
`
`is ce-
`The classic mudels in information retrieval consider thar each deectutmert
`serihed by a set of cepreseutative keywords called index tertus. Au fuder terre
`is sitaply a (document) word whose setuautics helps in renombering the docu-
`tment's main themes. Uhus, iudex verma are used to iidex and summarize the
`document coutents. Ta general,
`index terins are mainly nouns becatise nous
`have meaning by themselves aucl thus, their semantics is easier to ideutify and
`to grasp, Adjectives, adverbs, aud comuectives arc less usefas iudex terms
`because they work mainly as canplements. Towever.
`it might be interesting
`Lo consider all the distinet wards in a docuinent collection as index terms. For
`instauce, this approach is adopted hy some Web search engines as discussed in
`Chapter 13 (in which case, the document logical view is full tert}, We postpone
`a discussion on the problemof how to generale index terms until Chapter 7.
`where the issue is covered in detail.
`Given a set of index terms for a document, we notice that uot all terms
`are equally usefid fur deseribing the document contents. Tu fact, there are index
`terms which are siinply vaguer than others. Deciding on the importance of a
`term for summarizing the contents of a dactiment is uot a trivial issne. Despite
`this difficulty, there are properties of an index term which are easily measured
`and which are usetul for evaluating the potential of a term as such. For instance,
`consider a collection with a hundred thousand documeuts. A word which appears
`in each of the one humdred thousand documents is completely useless as an index
`tern because it dues not tell us anything about which documents the user might
`be interested in. On the other hand, a word which appears jn just five documents
`is quite useful because it narrows down considerably the space of documents
`which might be of interest to the user. Thus,
`it should be clear that distinct
`index terms have varying relovance when used to describe dociment coutents.
`This effect is captured through the assigninent of numerical weights 1o each index
`term of a document.
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 3
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 3
`
`

`

`CLASSIC INFORMATION RETRIEVAL
`
`2a
`
`Let Ay be an index term, d; be a document. aud wy) = U be a weight
`assovlaned with the pair (4;.d¢,}. This weight quantifies the importance of the
`index teri tor describing the document semantic contents.
`
`Definition Lett be the number of iudex terms in the system and k, be a generic
`inden term. Kw (ky... hy} ts the set of all index terms. A weight w,j > 0
`is assoriated wilh each wider ferm hy of a document dj. Por an index term
`whivh dues Wok mppear in the docwment
`tert, wij = 0. With the dacnurnent a,
`is axsocteded an tuder tern vector dy represented hy dy =+ (tty. Wg joe. te yd.
`Further,
`irl y; be a funetion that refurna the weight assariated with the index
`term &om any bdimensional cector fhe, qld) = wey dh
`
`As we laver discuss. the index term weights are usually assumed to be tmntn-
`ally independent. This means that knowing the weight w,.; associated with the
`pair (A,) tells us nothing ahoul
`the weight wy, )., associated with the pair
`{kya ij}. This is clearly a simplification because oceurreuces of index terms m1
`a document are not nucorrelated. Consider, for tustauce, that the terms com-
`meter ancl network are used tu tidex a given document which covers the areaof
`computer networks. Prequenth, in this document, the appearance of one of these
`two words attracts the appearance of the other. Thus. these nwo words are corre-
`lated and their weights conld reflect this correlation, While umtual mdecpendence
`seems to be a strong siniplification, it does simplify the task of computing index
`rerun weights aud allows for fast ranking compntation. Furthermore, taking acl-
`vantage of index term correlations for improving the final document rauking 1s
`not a simple task.
`bn fact. none of the mamy approaches proposed in the past
`has clearly demonstrated (hat index term correlations
`are advantageous (for
`ranking purposes) with general collections. Therefore, unless clearly stated oth-
`enwise, we assume mutual itvlependence among index terms.
`fn Chapler 4 we
`discuss uiodern retrieval techniques which are based on term correlations and
`which have been tested successfully with particular collections. hese suceesses
`seein to be slowly shifting the current understanding towards a mure favorable
`view of the usefulness of term correlations for informatiou retrieval systems.
`The above definitions provide support for discussing the three classic infor-
`nation retrieval models, namely, the Boolean, the vector. and the probabilistic
`models. us we now do.
`
`
`
`
`5.2 Boolean Mode!
`
`The Boolean model is a supple retrieval model bagecl ou set theory and Boolean
`algebra, Since the concept of a set is quite intuitive, the Boolean model pro-
`Vides a framework whichis easy to grasp by a common user of an {Rsystem.
`Furthermore, the queries are specified as Boolean expressions which have precise
`
`senianties. Given its inherent simplicity and neat formalist. the Boolean mode)
`ceived great attention in past vears aud was adopted by many of the carly
`
`oromercial bibliographic systems.
`
`
`
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 4
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 4
`
`

`

`26
`
`BMODELING
`
`
`
`Figure 2.3 The three conjunctive compoucuts for the query jg = ka os [ky Re}.
`
`the Boolean model suffers from wajor drawbacks. First,
`Uufortunately,
`its retrieval strategy is based on a binary decision criteriou {Le., a documentis
`predicted to be either relevant or uon-relevant) without any notion of a grading
`stale. which prevents good retrieval performance. Thus. the Boolean model is
`in reality much more a data {instead of informarion) retrieval model Second,
`while Boolean expressions have precise semantics, frequently it
`is pol simple to
`translate an information need into a Boolean expression. Tu fact, most users fine
`it difficult aud awkward ta express their query requests im terms of Boolean ex-
`pressions. The Boolean expressions actually formulated by users often are quite
`situple (see Chapter 10 for a more thorough discussion on this issue}. Despite
`these drawbacks. the Boolean inoclel is still the dominant model with commercial
`document database systems aud provides a good starting point for those new to
`the field.
`The Boolean model considers Ghat index terms are present or absent in a
`document. As a result, the index terin weights are assumed to be all binary. ice.,
`wi € {0.1}. A query g is composed of index terms linked by three connectives:
`not, and, or-Thus, a queryis essentially a conventional Boolean expression which
`can be represented as a disjunction of conjunctive vectors (.¢., In disqunctine nor-
`
`mal fernt — DNF). For instance, the query [q = ka A (A, ¥ 1k.) can be written
`in disjunctive normal formas [Gag = Ol.1,1) ¥ (4.1.0) ¥ 01,0,0)). where eachof
`the components is a binary weighted vector associated with the tuple (ha. An, Kel.
`These binary weighted vectors are called the conjunctive components of dap ¢.
`Vigure 2.3 illustrates the three conjunctive components for the queryq.
`
`the indec term. weight variables are all
`For the Boolean model.
`Definition
`binary te., wiz C{O.1}. A query g is a conventional Boolean expression. Let
`fang b6 the disjunctive nermal formfor the query q. Further, tet doe be any of the
`conjunctive components of Gang. The similarity of a document dj
`to the query q
`is defined ts
`
`sim(d;, q) _
`wee
`ss
`
`|
`0
`
`{ovo
`*
`a,
`ay ong Le
`+
`4=
`
`1 Gee & dang} AUR. (dj) = di(dee))
`af “Ger:
`otherwise
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 5
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 5
`
`

`

`
`CLASSIC INFORMATION RETRIEVAL
`
`27
`
`if sim(d,, gi = 1 then the Boolean model predicts that the document d, is relevant
`to the query q Gt might not bel. Otherwise,
`the prediction is that the document
`ig not redeware,
`
`is cither referaué or non-
`The Boolean medel predicts that each document
`reecant There is nu notion of a partial reich to the query conditions. Por
`instance,
`let d, be a document for which d; — (0.1.0). Document d,
`includes
`the index teri &y, but is cousidered nen-relevant to the query (gs kg Athy Reds
`The main advantages of the Boolean model are the clean formalisin behind
`the model and its simplicity. The inain disadvantages are that exact matching
`naylead ta retrieval of tow few or toe many documents {see Chapter E01. “Today,
`it is well knownthat index term weighting canlead to a substantial improvement
`in retrieval performance. Index termweighting brings us to ble vector model.
`
`2.5.3 Vector Model
`
`the use of binary weights is tua
`i697, 695] recognizes that
`The vector model
`Hutuliug aad proposes a framework in which partial matching is possible, This
`is accomplished by assigning nen-binery weights to index tcrmis iu qiluries and
`in doctunents. These term weights are ultimately used to counpute the degree
`of similarity between cach document stored in the svslem und the user query.
`By sorting the retrieved documonts in decreasing order of this degree af simitlar-
`ivy. the vector model takes into consideration documents which match the query
`terms only partially. The main resultant effect is that the ranked ductunent an-
`ewer set
`is alot more precise fin the sense that it better matelies the user infur-
`mation need) than the doctiment answer set relrieved by the Boolean model.
`
`the weight wy; assoviuted uth a pud(hy dj]
`Definition For the vector model,
`is positive and nen-binary. Further,
`the mdex terms in the query are also
`weighted. Let in, be the wcight associated with the parr [AiG], where yg 2 UO.
`Lhen.
`the query vector 7 is defined as F -
`(uy ge Wags... Ug) where t
`is tee
`tuted number of index terma in the system. As before, the vector fer a document
`d,
`is represented by dy = (wyy.wagj...-. wey.
`
`Therefore, a document @; and a user query gq are represented as 1-ditiensional
`‘vectors as shown in Figure 2.4. The vector morlel proposes to evaluate Cie degree
`of similarity of the documem. dj with regard to the query g as the correlation
`between the vectors dj and @ This correlation can be quantified. for instance.
`by the cosine of the angle between these two vectors. That tis,
`
`stm(dy.q)
`
`dj eq
`d3| x ig]
`yr wy| X Wig
`2
`“Feat
`9
`faoot
`faisLByxy ek wy
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 6
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 6
`
`

`

` Q
`
`3s
`
`MODELING
`
`Figure 2.4 The cosine of @ is actopted as simledy a.
`
`where \d;" and |g are the norms of the document and query vectors. The factor
`i) does not affeer the ranking {Le.. the ordering of the documents] because it
`is the same for all doctments. The factor [5 provides a normalization in the
`space of the docninents.
`Since iw,j = Qandiaryy > O, stmt, dj) varies from Ota +i. Phus, instead of
`attempting to predict whether a document is relevaut or uot, the vector model
`ranks the documents according to their degree of similarity to the query. A
`document might be retrieved even if it matches the query ouly partially. For
`instance, one can establish a threshold on sirn(d;,q) and retrieve the documents
`with a degree of similarity above that threshold. But te compute raukings we
`need first to specify how index term weights are obtained.
`Tudex term weights can be calculated in may different ways. The work by
`Salton and McGill (698) reviews various term-weighting techniques, [Tere, we do
`not discuss them iu detail Instead, we concentrate on elucidating the main idea
`behind the most effective term-weighting techniqnes. This idea is related to the
`basic principles which support clustering Lechuiques. as follows.
`Given a collection C of oljects and a vague description of a set 4, the goal of
`a siniple clustering algorithin might be to separate the collection C’ of objects into
`iwo sets: a first one which is composed of objects related to the set A and a second
`one which is composed of objects not related to the set -l. Vaguedescription here
`ineans that we do not have complete information for deciding precisely which
`objects are and which are not in the set 4. Por instance, one might be looking
`for a set 4 of cars which have a price comparable to that of a Lexus 400. Since it
`is not clear what Lhe term comparable ineans exactly, there is uot a precise (and
`unique) description of the set A. More sophisticated clustering algorithms might
`attempt to separate the objects of a collection into various clusters (or classes)
`according to their properties. For instance, patients of a doctor specializing
`in camcer could be classified into five classes:
`terminal. acdvauced, metastasis.
`diagnosed, and healthy. Again, the possible class descriptions might be imprecise
`(and not unique} and the problem is one of deciding to which of these classes
`a uew patient should be assigned.
`In what follows, however, we only discuss
`
`the simpler version of the clustermg problem {i.e., the one which considers ouly
`two classes) because alf Luat
`is required is a decision on which documents are
`predicted to be relevant amd which ones are predicted to be not relevant (with
`regard to a given user query].
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 7
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 7
`
`

`

`
`O@LABSIC INFORMATION RETRIEVAL
`
`20
`
`‘La view the TR problem as one of clustering, we refer tu tle early work of
`Salton. We think of the documents as a collection C of objects and think of the
`user query as a (vague) specification of a set A of objects.
`[In this scenario, the
`TR. problem can be reduced tu the problern of determining which documents are
`in the set Aland which ones are uot {ie.,
`the UR problem can be viewed as a
`clustering problem). Ina clustering problem, two niain issues have to he resolved.
`First, one needs to determine what are the features which better describe the
`objects in the set
`-~L Second, one needs to determine what are the features
`which better distinguish the objects in the ser A from the remaining objects in
`the cdilection C. The first set of features provides for quantification of intra-
`cluster stnuilarity. while Che second set of features provides for qnauitification
`of enter-cluster dissimilarity. Phe most successful clistering algorithms try to
`balance Lhese two effects.
`Tn the vectur inodel. intra-clustering similarity is quantified by measuring
`the raw frequency of a term Ay
`insicle a document @;. Such term frequency is
`ustidls referred to as the tf focter and provides one tueasure of how well that
`term describes Che document contents (c., intra-documeut claracterizalion).
`Further:ore, inter-cluster dissimilarity is quantified by measuring the imverse
`of the frequency of a term A, among the documents in the collection. This
`factur is usually referred to as the diwerse document frequency or the af factor.
`The motivation for usage of au idf factor is that
`lerms which appear in ui
`domuuerts are nol very useful for distinguishing a relevant document from a
`non-relevant one, As with voud chistering algorithms. the must cHeetive term-
`weighting schemes for IR. try to balance these two effects.
`
`the system: and vy be
`fet N be the total number of documents ot
`Detinition
`
`
`the monber of documents in which the index term ky appears. Del fregy;
`beThe
`poufrequency of terin ky in the document dj (ic. the wumber of tines thetern
`Sky is mentioned in the text of the document dj}. Then, the normalized frequency
`Sof term ky in document dy is giuen by
`
`fred;
`fej or oe
`meaty fred j
`
`e the macimune is cumputed over all terms which are mentioned in the treat
`the document d;.
`ff the term ky does not appear in the document d;
`then
`
`=U. Further, let idf;, inverse document frequency for ky. be given by
`
`.
`‘
`12.1)
`
`(2.2)
`;
`
`(2.3)
`
`
`
`
`
`
`
`
`
`so.
`‘
`’
`.
`best known term-iccighting schémes use qucightsaivhich are given by
`a
`
`’
`
`,
`
`oy
`_ i
`
`N
`.
`idf, = log —
`“T;
`
`oe
`
`N
`ig = Fig x log n
`i
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 8
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 8
`
`

`

`30
`
`MOORLING
`
`or by a variation of tis formula. Such ferm-weighting strategies arc called thidf
`schemes,
`
`Several variations of the above expression for the weight uy; are described in an
`Interesting paper by Salton and Buckley which appeared in 1988 [696]. However,
`in general.
`the above expression showld provide a good weighting scheme for
`liany collections.
`For the query term weights, Salton and Buckley suzvest
`
`ing = (a+ iti. | a
`
`
`O.5 fre.
`may Pred y
`
`N
`
`Ai
`
`“7
`
`(2.4)
`
`where freg,, is the raw frequeney of the term &y in the text of the information
`Tequeat q.
`The main adeautayes of the vector model are: (1) its terni-weizliling scheme
`improves retrieval performance, (2) its partial matching strategy allows retrieval
`of documents that approximate Uae query conditions: and (3) its cosime rank-
`ing fornia sorts the documents according to their degree of similarity to the
`query. Theoretically, the vectur model has the disadvantaye that index terms are
`ested to be umtnally iudepeudent {equatiuu 2.3 does mot accouut for index
`teTm dependencies}. However. in practice. consideration of term dependencies
`might be a disadvantage. Dne to the locality of many term dependencies, their
`indiscriminate application to all the doctuments in the colleetion might in fact
`fagthe overall performance.
`Despite its siipliciey, he vector model is a resthent ranking strategy with
`ecneral collections.
`Lt vields ranked answer sects which are difficult
`tu improve
`npon without query expansion or relevance feedback (see Chapter 5} within the
`framework of the vector model. A large variety of alternative ranking methods
`have been compared to the veetor madel but the consensus seems to be that,
`in veneral, the vector model is either superior or almost ag good as the known
`alternatives. Furthermore. it is stuuple and fast. For Wiese reasons. the vecter
`model is a popular retrieval model nowadays.
`
`2.5.4
`
`Probabilistic Model
`
`Tn this section, we describe the classic probabilistie model introduced in 1976
`by Roberston and Sparck Jones 677] which later became known as the binery
`independence retrieval (BIR) model. Our discussion is intentionally brief aud
`focuses mainly on highlighting the key features of the model. With this purpose
`in mud, we do not detain ourselves in subtleties regarding the binary indepen-
`denee assumption for the model. The section on bibliographic discussion points
`to references which cover these details.
`Vhe probabilistic model atternpts to capture the TR. problem within a prob-
`abilistic framework. he fundamental idea is as follows, Given a user query,
`there is a set of documents whieh contains exactly the relevant documents and
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 9
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 9
`
`

`

`CLASSIC INFORMATION RETRIEVAL
`
`3l
`
`no other. Let us refer to this set of documents as the ideal answer set. Given the
`description of this ideal answerset, we would lave no problems in retrieving its
`documents. Thus, we can think of the queryimg process as a process of specify-
`ing the properties of an ideal answer set (which is analogous to interpreting the
`1K problem as a problem of clustering). The problemis that we do uot know
`exactly what these properties are. All we knowis that there are index (erms
`whose semantics should be used to characterize these properlies. Since these
`properties are not known at query time, an cflort has to be made at. initially
`sucssing what they could be. This initial guess allows us to generate a prelim-
`inary probabilistic description of the ideal answer set which is used to retrieve
`a first set of documents. An interaction with the user is then initiated with the
`purpose of improving the prubabilistie description of (he ideal answer set. Such
`interaction could proceed as follows.
`‘The user takes a look at the retrieved documents and decides which ones
`are relevant and which ones are not (in truth, only the first top documents need
`to be examined). The system then uses this information to refine the description
`of the ideal answer set. By repeating this process unamy times, it is expected
`that such a description will evolve aud become closer lo the real description af
`ihe ideal answer set. Thus, one should always have in wind the need to guess at
`the hegiuning the deseription of the ideal answer set. Furthermore. a conscious
`effort is inade to model this deseription in probabilistic terms.
`The probabilistic model is based on the following fundaruental assumption.
`
`Assumption (Probabilistic Principle) Given a user query q and a document d;
` Ttollection, the probabilistic rnodel tries to estimate the
`istitetl webability that
`the user will find the document d; interesting (Le relevant).
`‘The model as-
`
`sumiesthatthis probability
`Tthetice depends on the query and the document
`representations ouly. Further.
`the model assumes that there is a subset of all
`documents which the user prefers as the answer ser for the query g.
`Such an
`ideal answer set is Jabeled Fo and should maximize the overall probability of re!
`evance to the user. Doenmients in the set 2 are predicted two be relevant Lo the
`query. Docuraents not in this set are predicted to be non-relevant.
`
`This assumption is quite troublesome because it does not state explicitly
`how to compute the probabilities of relevance. In fact, not even the sample space
`which is to be used for defining such probabilities ts given.
`Given a query g. the probabilistic model assigns to each document d;, as
`a mcasure of its similarity wo the query. the ratio Pid; relevant-to q)/P(d,; non-
`relevant-to g} which computes the odds of the document d; being relevant to the
`query g. Taking the odds of relevance as the rank minimizes the probability of
`an erroneous judgement [282, 785).
`
`For the probabilistic model, the index term weight variables are
`Definition
`all binary ie. uy © {OV}. wig € {0,1}. A query g is a subset of tinder terms.
`Let Robe the set of documents known(or initially guessed) to be relevant. Let R
`be the complement of R ft... the set of non-relevant documents). Let PUR|d;}
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 10
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 10
`
`

`

`32
`
`MGIBELING
`
`is relevant to the query q and P(Rid;)
`be the profability that the document d;
`be the probability that d;
`is non-relenant to q. The similarity simidy,q) of the
`document rl,
`to the query q is defined as the ratte
`
`Styl ied, . ef) =
`
`P( Rid;)
`
`PURE)
`
`Using Bayes” rule,
`
`P(LER} x PER)
`_
`simul. g) = s+
`Pid |Ri x PCR)
`
`Pid, R) stands for the probability of randomly selecting the document d; from
`the set. & of relevant documents. Further, PCA) stands for the probability that a
`ductment randomly selected from the eniire collection is relevant. The mieanings
`attached to P (f/'R) and PCR) are analogous and complementary.
`Since PCA) and PR) are the samefor all the documeuts in the collection,
`we write,
`
`sentibs. gio
`
`
`P(G|R)
`Pid; Wi
`
`
`Assuming independence of index termis,
`
`semid,,g@)
`
`o~
`
`
`
`Ted. PURIRD% Mycapao PEARY
`Tes, a PAA) x TL. cé 320 Pik, |R))
`
`Pik,[21 slanes for the probability that the index term &; is present ina document
`randomly selected from the set R. P(AjLA} stands for the probability that the
`iudex term: Ay
`is not present in a deemment randomly selected from the set &.
`‘The probabilities associated with the set # have meanings which are analogous
`ta the ones just described,
`Taking logarithms, recalling that P(A,|R) + P(k:|R) = 1, and ignoring
`factors which are constant for all documents in the context of the sume query,
`we cau finally write
`
`i
`— L
`
`sunld;.q)
`
`o~ da x Uy; x (108 Pa -+ log
`
`PCRAR
`
`I PikIR
`
`eal 5)
`
`
`which is a key expression for ranking computation in the probabilistic model.
`Since we do not know the set # al the beginning, it is necessary to devise
`a inethodfor initially computing the probabilities P(k;|R) and P(k;|R). There
`are any alternatives for such computation. We discuss a couple of them below.
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 11
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 11
`
`

`

`"CLASSIC INFORMATION RETRIEVAL
`
`uo
`
`Tn the very beginning (i.c., bnmedtately after the query specification), there
`are no retrieved documents. Thus. one has to make simplifying assumptions such
`as: [aj assume that P(R,|A) is constantfor all index terms &; (typically, equal to
`0.5) and (b) assume that the distribution of tidex terms amoug the non-releyant
`documents can he approxtmated by the distribution of index terns among all
`the documents tn the calection. These two assumptions yield
`
`=
`Pk|R)
`P(RIR) =
`
`0.5
`
`ia the number of documents which contain the
`where, as already defined, oj;
`
`index term Ay and V is the Total nuinber of doeaments in the collection. Given
`thismitintguess, we can then retrieve documents which contain query terms and
`provide ag initial probabilistic ranking for them. After iat. this initial ranking
`is Hnpreved as follows.
`Let Vo be a snbset of the doctuments initially retrieved and ranked by the
`probabilistic model Sneha subset can be defined, for lustance. as the top +
`rankeddocumentswhore:is a previously detined threshold. Further,letVibe—
`the subset of ¥ composed of the documents in Vo which contain the index term
`k,. For simplicity, we also use Fo and Vote refer toa Ue tunber of elaments in
`these sets (it should always be clear when the used variable refers to the sel or ta
`the muanber of elements mit}, For improving the probabilistic ranking, we need
`to improve our guesses for P(A, A} and P(ky A}. This cau be accomplished with
`the following assumptions:
`(a) we can approximate P(&, AR) by the distribution
`of the index term ky among the documents retricved so far, and (b) we can
`approximate P(k SR) by cousidering that all
`the nou-retrieved documents are
`not relevant. With these assutaptions, we can write,
`
`=
`
`\“
`=v
`Pik, RY
`nyo
`=
`Pik Ry) = wt
`me
`SN
`
`f
`
`This process can thon he repeated recursively, Ty doiug so, we are able to
`iuprove on our guesses for the probabilities P(A;|H) and Ptk;|R) withont any
`assistance from a hmman subject feontrary to the original ideal. However, we
`cart also use assistance from the user for definition of the subset V as originally
`conceived,
`‘Lhe last formmlas for Pi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket