throbber
Modern Information Retrieval
`
`Ricardo Baeza-Yates
`
`Berthier Ribeiro-Neto
`
`$3
`
`ACM Press
`New York
`
`W Addison-Wesiey
`
`Hariow. England 0 Reading, Massachusetts
`Menlo Park, Caiifornia 0 New York
`
`Don Mills, Ontario I Amsterdam - Bonn
`
`Sydney a Singapore 0 Tokyo 0 Madrid
`
`San Juan 0 Milan 0 Mexico City 0 Seoul o Taipei
`
`BIoomReach, Inc. EX1014 Page 1
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 1
`
`

`

`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`Addison Wesley Longman Limited
`Edinburgh Gate
`Harlow
`Essex CM20 ZJE
`
`England
`
`and Associated Companies throughout the World.
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1P 9HE.
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the trademark designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`ŽzZg‘$’j“•”;–g—?˜™šœ›$›$›g‘`—I–RžŸ Žz¡"g’bž£¢j¢£¤RŸx¥¦“•§$“•¢I“•;¨ Z©v—I–Rž Ÿ ¢j¢j ˜ “•ªU—I“•Z¨C©€Z’5Žz$«¬R—I“†¨S”¡ˆª ˜ –g“†¨Sª;’b‘$¤Y­®¨ ˜$¯_° Ÿ±ŽO¡C² ¯Ÿ ³g³>“•¢jZ¨t´ž£¢Iµ•žœ‘0¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³¸v³e“†¨>g¬’b”;– ¹ ªU—jžº ª;’jµ•;»¸v¢j¢jžh¼CŽz¡C½Z¾|½¿g¸¸¨S”;µ•ªZ¨R³ªZ¨R³ˆŸ ¢j¢j ˜ “•ª$—jž£³CŽz$«Sª;¨“•ž£¢ —I–g’bZ¬R”Z–RZ¬R— —I–Rž ´CZ’jµ•³ ¯Àz–Sž_’j“•”Z–g—j¢±Z©—I–Rž ªZ¬R—I–RZ’b¢_Z©Á—I–“•¢±´;’jÂ`–Rª;§$ž ]žœžh¨ˆªU¢j¢jžh’b—jž£³`g‘k—I–Rž£«Ã“†¨ˆª ˜£˜ ;’b³ª;¨ ˜ ž»z“•—I–—I–RžŽO;g‘$’j“•”Z–g—£¤]¥±ž£¢I“•”Z¨R¢ ªZ¨R³ˆÄª$—jžh¨g—j¢_Ÿ ˜ — š£›$ÅUÅ ¯Ÿ¦µ†µY’j“•”Z–g—j¢_’bž£¢jžh’b§$žœ³ ¯3Æ  Sª;’b— Z©Á—I–“•¢_¬ggµ†“ ˜ ª$—I“ǏZ¨ «:ªZ‘A]ž ’bžb’b>³>¬ ˜ ž£³n¤Y¢j—jZ’bžœ³`“†¨Cª’bž£—I’j“•ž£§;ª;µv¢j‘>¢j—jž£«0¤S;’±—I’bªZ¨R¢j«·“•—j—jž£³0“†¨CªZ¨g‘A©€;’b«Z’_g‘kªZ¨g‘&«Až£ªZ¨R¢£¤Sžbµ•ž ˜ —I’bZ¨g“ ˜ ¤Y«Až ˜ –Sª;¨“ ˜ ªZµ€¤g–SU—j ˜ Zg‘$“†¨R”¤8’bž ˜ Z’b³>“†¨S”k;’ U—I–Sžb’b»O“•¢jžU¤]»O“•—I–RZ¬R— žb“•—I–Sžh’5—I–Sž g’j“•Z’5»z’j“•—j—jžh¨0]žb’b«·“•¢j¢I“•Z¨ ;©—I–Rž ¬gµ†“•¢I–Ržh’:Z’±ª·µ†“ ˜ žb¨ ˜ ž]žh’b«·“•—j—I“†¨R”`’bž£¢j—I’j“ ˜ —jž£³ ˜ Zg‘$“†¨R”A“†¨C—I–Sž È_¨“•—jžœ³ˆÉ “†¨S”U³U«"“•¢j¢I¬Sžœ³g‘ —I–Rž_ŽzZg‘$’j“•”Z–g—v¶n“ ˜ žh¨S¢I“†¨R”Ÿ ”$žb¨ ˜ ‘¶8—j³]¤U›$¾ ÀU—j—jžh¨g–SªU«ÊŽzZ¬g’b—˦Uª$³]¤$¶YZ¨R³;¨k´ÌšœÄ›$º±¸ ¯´Í–g“†µ•žA—I–Rž ¬ggµ†“•¢I–Sžh’–Rª$¢_«Aª$³gžž£§$žh’b‘kªU—j—jž£«·R—¦—j:—I’bª ˜ ž ª;µ†µ ˜ Zg‘$’j“•”;–g— ;»z¨Sžh’b¢±ªZ¨R³0;S—jª;“†¨]žh’b«“•¢j¢I“•Z¨t—j·’bžhg’b>³e¬ ˜ žA«AªU—jžh’j“•ªZµ€¤S“†¨Cª©€ž£» ˜ ªU¢jž£¢_—I–“•¢_–SªU¢¦g’bZ§Už£³0“•«·]U¢j¢I“†µ•ž ¯ŽzZg‘$’j“•”;–g—¦–RZµ•³gžh’b¢ Z©v«AªU—jžh’j“•ªZµÁ»z–“ ˜ –ˆ–SªU¢¦¨R$—z]ž£žb¨ˆª ˜ Â>¨RZ»zµ•ž£³g”$žœ³ªZ’bž žh¨ ˜ ;¬’bªU”$ž£³ˆ—j˜ ;¨g—jª ˜ —_—I–RžMg¬gµ†“•¢I–Ržh’ ¯¡ˆªZ¨g‘&Z©Á—I–Sž³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ ¬S¢jž£³kg‘0«Aª;¨>¬©®ª ˜ —I¬g’bžh’b¢ ª;¨S³&¢jžhµ†µ•žh’b¢ —jA³>“•¢j—I“†¨R”Z¬g“•¢I– —I–Sžb“†’g’b>³e¬ ˜ —j¢ ª;’bž ˜ µ•ªZ“•«:ž£³ªU¢_—I’bª$³gž£«AªZ’jÂg¢ ¯ Ÿ ³g³e“•¢j;¨´Cž£¢Iµ•ž£‘&¶8;¨S”U«AªZ¨C¶]“•«·“•—jž£³ˆ–SªU¢¦«:ª$³gžž£§Užh’b‘kª$—j—jž£«·R— —j:¢I¬gµ•‘&—I’bª$³gž·«:ªZ’jÂk“†¨©®Z’b«AªU—I“•Z¨tªZ];¬S—±«AªZ¨>¬g©€ª ˜ —I¬’bžh’b¢±ªZ¨R³ˆ—I–Sžb“†’g’b>³e¬ ˜ —j¢ «:žh¨g—I“•Z¨Rž£³ˆ“†¨C—I–“•¢_]>Z ¯ ŸÌµ†“•¢j—±Z©v—I–Rž —I’bª$³gž£«AªZ’jŠ³žœ¢I“•”Z¨Rª$—I“•Z¨R¢ªZ¨R³ˆ—I–Sžh“†’;»z¨Sžh’b¢_ª;]žœªZ’b¢ ;¨ˆSªU”$ž §$“†“†“ ¯Àz‘$]žœ¢jž£—z“†¨tŽOU«·g¬S—jžh’5¡ˆ>³žb’j¨ˆg‘0Î;ÏÄ’j“†¨g—jž£³ˆªZ¨R³0]Z¬g¨R³0“†¨C—I–Sž È_¨“•—jž£³CÐ>—jª$—jžœ¢ Z©ÁŸ «Ažb’j“ ˜ ªÑ8“†’b¢j—¦g’j“†¨g—jž£³šœ›$›$›­IÐ>Ò Æ ¾;ÓÔ½Z¾šbÓsÕ$›$ÅG½Z›;ÓIÖ× ØZÙÇÚ£ÙÜÛ£ÝßÞ)ÙÜàØ;ágØZâmã·ágÚ£áäÇåSæSçÙÜèæébÙÇè8éœê çàäÜÙÜë>ágÚ£ÙÜåSèíìkáÚ£áŸ ˜ ªU—jªZµ•U”Z¬RžM’bž ˜ Z’b³0©€;’ —I–g“•¢z]>ZÂ&“•¢ ª;§;ªZ“†µ•ª;µ•žA©Ü’bU«—I–Sž Ò)’j“•—I“•¢I–C¶]“†’bª;’b‘Þ)ÙÜàØ;ágØ;âÍågîã·åSèæRØZï>ۜÛ0ãáÚ£ágäÜåSæRçÙÜèæébÙÜè8éœê çàäÜÙÇë>áÚ£ÙÇåSèíì`ágÚ£áÒzªUž£ð£ªZӀñ3ª$—jž£¢£¤RË ¯ò° Ëz“ ˜ ª;’b³G²¡ˆ>³gžh’j¨ˆ“†¨g©€Z’b«:ª$—I“•Z¨C’bž£—I’j“•ž£§;ª;µ3ó·Ëz“ ˜ ªZ’b³gkÒzª$ž£ðœªZӀñzªU—jž£¢£¤SÒ3žh’b—I–“•žb’ Ëz“†]žh“†’b;Ó Æ ž£—j ¯ ¯-˜ « ¯­Ü¨ ˜ µ†¬S³gž£¢ g“†µ†“•U”Z’bªZg–S“ ˜ ª$µv’bžh©€žb’bžh¨ ˜ ž£¢_ªZ¨R³0“†¨S³gžh¼ ¯­sÐeÒ Æ ¾ZÓÔ½Z¾gšhÓsÕ$›UÅ>½;›ZÓs֚ ¯ ­Ü¨©€;’b«Aª$—I“•;¨C¢j—jZ’bªU”$žªZ¨R³`’bž£—I“•ž£§;ª;µv¢j‘>¢j—jž£«A¢ ¯ ­ ¯ ËO“†ôžh“†’b¤]Òzžb’b—I–“•žh’5³žŸ¦’bªvõ¬;öbÆ ž£—j¤Yš£›UÏ$¾ZÓ ¯ ­j­ ¯Àz“•—Iµ•ž ¯÷SÏ$ÏGø ¯ÒzÕ$ù š£›$›U›¾>½UÎ ¯¾$ù;ú>³ ˜ ½eš ›U›ZÓbš£¾U¾$Õ$Վ¦­€Ä
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloguing-in—Publication Data
`Baeza-Yates, R.(Ricardo)
`Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p.
`cm.
`
`Includes bibliographical references and index.
`ISBN 0-201-39829-X
`
`1. Information storage and retieval systems. I. Ribeiro, Berthier de Arafijo
`Neto, 1960- . II.Title.
`Z667.BB4
`1999
`025.04—dc21
`
`99-10033
`CIP
`
`BIoomReach, Inc. EX1014 Page 2
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 2
`
`

`

`-.1va
`
`21
`
`MODELING
`
`t—dimensional vectorial space and standard linear algebra operations on vectors.
`For the. classic prolnihilistic model. the framework is composed of sets. staiulard
`probability operations, and the Bayes’ theorem.
`In the remainder of this chapter, we discuss the various IR models shown
`in Figure 2i 'l‘hroughout the distuission, we do not explicitly instantiate the
`components D_._ Q, f. and Blighdji of each model. Such components should he
`quite clear from the discussion and can he easily inferred.
`
`2.5 Classic Information Retrieval
`
`In this section we briefly present the three ('rlassic models in infornuttiou retriewal
`namely the Boolean. the vector, and the probabilistic nnn'lels.
`
`2.5.1 Basic Concepts
`
`The. classic models in iniormation retrieval consider that each doctnnent is de—
`
`scrihed by a set. of rt-zprcscntalive key words called index terms. An finder term
`is simply a {document} word whose semantics helps in rememhering the docu—
`ment's main themes. Thus, index terms are used to index and suunnarize the
`document. contents.
`In general,
`index terms are mainly nouns heceuse nouns
`have meaning h}! themselves and thus‘ their semantics is easier to identil‘f and
`to grasp. Adjectives. adverbs, and connectives are less useful as index terms
`l'iecause the}! work mainly as complements.
`llowmrer.
`it might he interesting
`to consider all the distinct wtn'ds in a document collection as index terms. For
`
`instance, this approach is adopted hy some Weir sca‘rrch engines as discussed in
`Chapter .13 [in which case, the document logical \-'l(:\-\-' is felt. tort). “7c postpone
`a. discussion on the problem of how to generate index terms until Chapter 7-.
`where the issue is cmrered in detail.
`
`Given a set of index terms for a document? we notice that not all terms
`are. equally usct‘ul for dcscrihing the document contents. in fact. there are index
`terms which are Sllllpl,_’ vaguer than others. Deciding on the irnportamte of u
`term for sunnnarizing the contents of a. document is not a trivial issue. Despite
`this difficulty, there are properties of an index term which are. easily measured
`and which are useful for evaluating the potential of a term as such. For instance,
`consider a collection with a hundred thousand documents. A word which appears
`in each of the one hundred thousand documents is completely useless as an index
`term because it. does not tell us anything about which documents the user might
`be interested in. On the other hand. a word which i-tppears in just five documents
`is quite useful he tause it narrows down considerably-- the space of documents
`which might he of interest to the user. Thus,
`it should he. clear that distinct
`index terms have. varying relevance when used to describe document contents.
`This effect is captured through the assignment of numerical areighfs to each index
`term of a document.
`
`BIoomReach, Inc. EX1014 Page 3
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 3
`
`

`

`emssm IN FORMATION its-"rs IEVAL
`
`‘25
`
`:3 U l): a weight
`‘o: a document, and arm.-
`d}-
`la.- b.‘ an index terrn.
`Let
`associated with the pair (knoll) This weight quantifies the importance of the
`index term for describing the (.loeinnent semantic et'nitcnts.
`
`er. t he the another ofindei: terms in. the system and is, be e genera.-
`Definition
`imiei: tr-r'in. K
`{_lc].. .
`.
`.3153}
`is the set of oil.
`index: terms.
`:1 height in”-
`1"»
`(l
`is u.:~'.~'sor'i.uted wraith. each imtr-r term. k,- of a document. d,» For an.
`inder term
`which (has not-
`rippcrir in the document
`tort. n1.)- 2 ti.
`l-l-‘i'tti
`the (fortnmcnt at",
`
`l'l’l1._i"”i2..i~ .
`is esmr'irited an Ends-3' term irr'rtor d..- represented by (it, 2:
`Further.
`let Us
`bi.-
`e. function that. returns the. Tiltiighi osscic-iotcrt with.
`term tr, or any t—diniensiomn' rector .;"'t.r._.
`,gatjrtJ] 2 term}.
`
`. uteri).
`the order
`
`is we later discuss. the index term weights are. usually assumed to l}(‘ mutu—
`ally independent. This means that knowing the weight ”ch associated with the
`pair {ts-Mtg tells us nothing about
`the weight. an“ associated with the pair
`lift-4 1,111.}! This is clearlv a simplification because occurrences of index terms in
`a document are not. nncorrelated.
`(.‘onsider, for instance, that the terms ('(JFTL—
`
`peter and Ht'ttt'O'J'YlY are used to index a given document which covers the area. of
`computer networks. I-hctlucntly, in this (lotill‘l'llL’UlT, the appearance otonc of these
`two words attracts the appearance of the other. Thus. these two words are (.tm‘re—
`lated and their weights could reflect this correlation. Vii-"idle inntual independence
`seems to be a strong simplification. it does. simplify the task of computing index
`term‘weights and allows for fast. ranking computation. Furthermore, taking ad—
`vantage of index term correlations for improving the. final document. ranking is
`not. a simple task.
`in fact. none of the. many approaches proposed in the past.
`has clearlv i'lemonstrated that index term correlations
`art ad 'antageons [for
`ranking purposes] with general collectitnis. Therefore, unless clearly stated oth—
`erwise, we assume mutual independence among; index terms.
`In Chapter :3 we
`discuss modern ret-rie 'al techniques wl'iich are based on term t'.01"rt".l£-1T.l0n:-1' and
`which have been tested successfully with particular collections. These successes
`seem to be slim-iv shitting; the current understanding towards a more favorable
`view of the usefulness of term correlations for ll'lftll'l‘fléLI-lt'nl retrie Yul systems,
`The above definitions rn‘m-‘ide support for discussing the three classic infor-
`mation retrieval models. initially. the. Boolean, the vector. and the probabilistic:
`models. as we now do.
`
`
`
`
`
`
`
`
`
`
`
`.52 Boolean Model
`
`2The Boolean model is a simple retrieval model based on set. theory and Boolean
`Elgcbra. Since the concept of a set is quite intuitive, the Boolean model pro—
`.Vides a fran'imvorl; which is easy to grasp by a. common user of all [R system.
`Furthermore, the queries are specified as Boolean expressions which have 131“?“lEie
`e11m,nticg_ Given its inherent simplicity and neat formalism. the Boolean inodel
`Ceived great. attention in past years and vas adopted by many of the early
`bmniercial bibliographic system-s.
`
`BIoomReach, Inc. EX1014 Page 4
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 4
`
`

`

`26
`
`MODELING
`
`
`
`Figure 2.3 The three conjunctive cmnponcms for the query to —— k“
`
`(to.
`
`“ix-M.
`
`the Boolean model sutlers from major drawlmcks. First.
`1f nfortunately.
`its retrieval strategy is l'iased on a binary decision criterion tie, a document. is
`predicted to be either relevant or non~relovant) without any notion of a grading
`scale. which prevents good retrieval performance. Thus. the Boolean model is
`in realiti-r much more a data {instead of information} retrieval model. Seemid.
`while Boolean expressions have precise semantics. frequently it
`is not simple. to
`translate an intormatirm need into a Boolean expression. In fact most users find
`it difficult and awkward to express their query requests in terms of Boolean ex-
`pressions. The Boolean expressions actually formulated by users often are quite
`simple {see (fl-hapter It] for a more thorough discussion on this issue). Despite
`these. drawbacks. the Boolean model is still the dominant model with commercial
`
`document. database systems and provides a good starting point for those. new to
`the field.
`
`The Boolean model considers that index terms are present or absent in a
`dotannent. As a result. the index term weights are assumed to he all binary. i.e.,
`We; 6 {0.1}. A (piety t} is composed of index terms linked by three connectives:
`not‘ and, orzThus, a query is essentially a com-'entional Boolean expression which
`
`can be represented as a disjunction of conjunctive rectors tie. in disjnnctt-i'e norw
`and form. — DN F). For instance, the. query [q 2 It” a (tit
`wioj] can be written
`in disjunctive normal form as [973“; = [1. 1, l) \-"' {.l. 1.1!) "-.-’ [1. ll‘ {3}]. where each of
`the components is a binary weighted Vector associated with the. tuplc it“s: kt, kc}-
`These binary weighted vectors are called the conjunctive components of (loaf.
`Figure 2.3 illustrates the. three (.‘t‘mjtilu'Lt-it'o components for the query q.
`
`-u.:eight oar-tables are. all
`the index term.
`For the Boolean model.
`Definition
`binary to, tea.)- C {0.1}. A query q is a convent-tonal Boolean empress-ton. Let
`illiaf be the disjmtctirc normal form for the query 0. Further, let r121. be any of the
`CUTU’tt-‘Fttittt-‘B components of (kg. The similarity of a (tetra-meat dj
`to the query q
`is defined as
`
`St'fllidjad] : { 1
`
`t 1
`
`
`
`if 26cm l
`
`otherwise:
`
`_,
`
`{glee E thief} A ivk.‘ glid‘tl : 9116(6)]
`
`I
`
`BIoomReach, Inc. EX1014 Page 5
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 5
`
`

`

`
`LILASSJEI IN Pom-IA'HON RETRIEVAL
`
`2T
`
`If room-[dJ _. gr) 2 1 then (he B ooh-(1:11 model. predicts that the doe-omen: (FJ- is r'eh-"ermt
`to the query :3 (it might not he}.
`[)the-I‘u.'ise._.
`the prerhetien is that the dm‘umeof
`is not named-n.5,
`
`is either ochre-at. or oeu—
`that each tl:'.a'.:111r1(-2nt
`lgal'etlitls
`The Boolean model
`reh-rr’mt. There is no notion of a partial molt‘h. to the :111.1er_\' ('onditions. For
`instance.
`let
`(I) be a doemnmit for which a: -- {0.1.0}. Dot-luneut cl,
`includes
`the index term to, but is eonsidered non—relevant to the query if;
`r: h“ ear-{k}, uh“);
`The main e.Flt-wiritriges; of the Boolean model are. the clean ft'h‘malism l.'Jt‘rl1lIll'.l
`the. model and its silnplieity. The main dtsad-Uoutages are that. exaet. matching
`may lead to retrieval of too few or too many documents {see Chapter it'll.
`"I.1'J(ia}-'.
`it is well known that index term weighting ean lead to a. substantial iniprt‘.‘;\'ement
`in retrieval perftiirniance. Index term weighting brings 11s to the. rector model.
`
`2.5.3 Vector Model
`
`the use of lunar]; weights is too
`ilifli’, 695] reeognizes that
`The veetor model
`limiting and proratases a trainee-wok in which partial matching is missible. This
`is aeeolnplished by assigning heir—himrry weights to index terms in qtleries and
`in doeulnents.
`'l‘hese term weights are. ultimately reset] to eoinpute the deg-rer-
`ef shorter-it}; between eaeh doeuinent stored in the system and the user query.
`By sorting the retrieved (loeuments in decreasing order of this degree. of similar—
`it}: the vector model takes. into eonsitleratitm documents which Tili't-LCIlJ the query
`terms only partially. The 11min resultant effect is that the ranked doelunr-Jot an-
`swer set
`is a lot. more praise [in the sense that it better inatehes the user infor—
`matim'i need] than the doeunient. answer set. retrieved by the Boolean model.
`
`the weight. ri-‘m (lg-‘tm‘l-‘ll‘t‘l with fi- FEW thhdjl
`Definition For the: fleet-oi" model,
`is posit-rm? rmd rmrr—hiuruy. Further.
`the order terms in.
`the hue-re are. also
`
`3'3 [1.
`twirlighted. Let “his: he the wright assoc-rated with the pair [kl-.0], urhrrr: Uri-J].
`'I'hth.
`the query vector r? is- defined as.
`t?
`- (rem. U‘glq. .
`.
`.
`. rem] -u..=hr-rr- t
`3's-
`the
`total number of index? terms to the system.
`.43 before. the redo-r for r1. doe-u:(lent
`rlJ
`is regimeseuled by (it; = (and. teglj. .
`.
`.
`. u“ ].
`
`(lJ- and a user rulerjy q are represented as 1. -(lll1'Jt+‘Jl.-lmlil.l
`Therefore, a (l()('11111(‘11l.
`'\-'et'.tors as shown in Figure 12.4. The vector model proposes to evaluate the degree
`
`of similarity of the docun'ient {13- with regard to the query q as the eorrelation
`between the vectors (l;- and g“. This eorrelatirm can be quantified. for instance.
`by the. cosine of the angle. hem-eon these two vectors. That is,
`
`show}; oi]
`
`--.__. .._ ...
`
`.r
`_Z~.—--1 ”'w' A “5.31
`
`BIoomReach, Inc. EX1014 Page 6
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 6
`
`

`

` Q
`
`232%:
`
`not) HUNG
`
`Figure 2.4
`
`'i‘he. cosine of 8 is adopted as simirqul.
`
`where Id; and Ir? are the. norms of the document and query vectors. The factor
`,q“; does not affect the ranking tie. the. ordering of the documents] lgaecause it
`is the same. for all documents. The factor '6le provides a iiormalizntion in the
`space of the documents.
`'l’hus. ii'istead of
`Since 'rrrw- Ta [l and new, __ ll, si-inL'q, dj ] varies from U to + i.
`attempting; to predict. whether a. t'lot'ttlmcnl
`is relevant or not. the vector model
`ranks the documents according to their degree of similarity to the query. A
`docmnent might be retrieved even if it matches the. query only partially. For
`instance one can establish a threshold on simtdjmft and retrieve the documents
`with a degree of similarity above that tliresl'lold. But
`to compute rankings we
`nced iirst to specify how index term weights are obtained.
`Index term weights can be calculated in man}r diilcrent ways. The work by
`Bolton and McGill [698} reviews various term-weighting techniques. Here, we do
`not discuss them in detail. Instead, we concentrate on elucidating the main idea
`behind the most effective term—weighting techniques. This idea is related to the
`basic principles which support clustering techniques. as follows.
`Given a collection (3‘ of objects and a wager: description of a set. A, the goal of
`a simple clusteringr algorithm might be to separate the collmttion (7' of objects into
`two sets: a first one which is composed of objects related to the set :‘i and a second
`one which is composed of objects not related to the set. i Vague description here
`means that we do not: have complete information for deciding precisely which
`objects are. and which are not. in the set A. For instance. one might be looking
`for a set A of cars which have a. price COHt-pa-i‘rihlfi to that of a Lexus 400. Since it.
`is not. clear what the term comparable means exactly, there is not a precise (and
`unique) description of the set .rl. More sophisticated clustering algorithms might
`attempt to separate the objects of a collection into unions clusters {or classes]
`according to their properties. For instance. patients of a doctor speciz‘dizing
`in cancer could be classified into five classes:
`terminal. advanced, metastasis
`diagnosed, and healtlrg. Again‘ the possible class descriptions might be imprecise
`{and not unique) and the problem is one of deciding to which of these classes
`a. new patient should be i‘issigncd.
`In what follows. hon-ever, we only discuss
`
`the simpler version of the clustering problem tie. the one. which considers only
`two classes) because all that is required is a decision on which documents are
`predicted to be. relevant and. which ones are. predicted to be not. relevant. [with
`regard to a given user (intact-"j.
`
`BIoomReach, Inc. EX1014 Page 7
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 7
`
`

`

`
`{TY-LABSIC‘ INI-‘()RI\-IATION RLE'I'RIEV’AL
`
`2i}
`
`to View the. IR problem as one of clustering, we refer to the early work of
`Salton.
`\’\-"'e think of the doeumertts as a. collection C of objects and think of the
`user query as a {vague} speeilit.'.at.ion of a set .11 of objects.
`[11 this seenario. the
`lit. problem ean he redueetl to the problem of determining which doeuments are
`in the set
`:1 and which (mes are not tie...
`the lit. problem ean he viewed as a
`eiustering prol'ilem}. In a Clustering problem, two main issues have to be resolved.
`First. one needs to determine what are the features which better describe the
`
`objects in the set. 21. Second. one needs to determine what are the features
`\t'hieh hetter distinguish the objects in the set.
`..-l from the remaining ohjeets in
`the eollection (7. The first set of features provides for quantification of intro--
`et-s'tsr‘f-ir‘ similarity. while.
`the set-end set of features prtwides for oriant-ifit'ation
`of U?t(".”-t"-Il'tt.‘~'ltf'i" dissimilarity: The most sueeessfid elustering algorithms try to
`halartee these two el'feets.
`To the Yeettir iriodel. intra-elustering similarity is quantified by measuring
`
`inside a document at}. Sueh term frequency is
`the raw frequeney of a term is;
`usually referred to as the (f factor and prm-‘ides one measure of how well that
`term deserihes the document eont'ents ti.e..
`intra—doelnnent elizu'aeteriz-ation}.
`Furthermore.
`inter—eluster dissimilarity is omtntii‘ied by Irieasuring the inverse.
`of the lrttqt.1("nti}' of a term k3: arming the documents in the collection. This
`feet-or is usually referred to as the iii-verse (tetra-merit freq-usury or the. idfforttor.
`The motivation for usage of an idl factor is that
`terins whieh appt's'tr in many
`doeumeiits are not very useful for distinguishing a relevant riot‘ument from it
`[1011-1'i!lt'.’\':5tlll’ one. As with good elusteririg‘ algorithms. the most. el'l‘eetive term—
`weighting schemes for Ill. try to halanee these twu etteets.
`
`
`
`
`,-
`
`I
`
`that N be the total rtttrritrer' o dealt-Inertia in the system and u;- be
`Definition
`
`
`Elitre. rt-‘ttmber of deform-(9M3 to. tshitrh if
`;: index; term alr- appears. Let freqm W
`mflfifii'tie2rr.t'y of term. it;
`to the dorm-merit. di (to. the another of times the term
`is; is narratio-rta'd tn the (turf. of the dorfiitmeot dj/l. Then, the rtorrrtshzr‘d freq-runny
`- of term it;
`irt dor'-rrmr:‘ut rt".-
`-.is grim-ii by
`
`
`
`
`
`e the irio.:rirr:r.urrt is c:mtp-rtttd over all. territs aihieh on: m.r.r3tion..r"d in. the trait.
`
`- the doe-irritant d};
`If the ter‘rri
`it!- does not appear in (he doetmtertt (t;-
`their
`
`2 U. Forth-er. let tdfi. inverse dot:'tmir;:rt(. firsq-uenrri; for to. be girth try
`
`
`.
`
`.
`‘
`[2.1;
`
`[2.2‘;
`'
`
`.
`
`_.
`
`-
`
`v
`
`=
`
`.
`
`.
`'a
`
`few,-
`
`:2-
`
`fT‘f'fflU;
`- -—'—
`mar;
`frets)-
`
`idfzrlog‘l—l
`n.-
`t
`
`'
`
`
`
`best firemen.
`
`tt-“r‘iri—irrrgrightirtg settdirtes use weioht‘fifstuhrett on: green by
`t r."
`
`id = f” x tog ii
`
`[23]
`
`BIoomReach, Inc. EX1014 Page 8
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 8
`
`

`

`3t)
`
`noosmNG
`
`or by u turret-tort. of this formula. Such Iet‘m-'tth-ighting strategies (Ir-rt? cuttrid tf—tdf
`st-l‘tctttcs.
`
`Sereral variations of the. above expression for the. weight ten; are described in an
`interesting paper by Salton and Buckley which appeared in 1988 [696]. However“
`in general.
`the above expression should provide a good weighting scheme for
`many collections.
`For the query term weights. Salton and Buckle}; suggest
`
`was _ (0.5+
`
`
`0.:
`
`N
`
`’f’”t_-L..) X1W—
`
`mrcrt
`
`fit-mi;
`
`-
`
`.
`
`L24}
`
`where fit-1111.9. is the ‘aw frequency of the term to- in the text of the information
`retpicst q.
`
`t: 1} its term—weighting scheme.
`The. main udt-‘unt.ug_,le.s of the \-'{"(‘.t(}1“111()(lt.‘l are:
`imprta-‘es retrieval performance: [2) its partial matching strategy allows retrie 'al
`of documents that.
`tippitiatintutr:
`the. query conditions: and {3]
`its cosine rank»
`111% formula. sorts the. documents according to their degree of similarity to the
`query. Tht--oretit.'all_\-'._ the. vector model has the. thSti-d-t'o-ntege that index terms are
`assumed to he. mutually independent [equation 2.3 does not account for index
`t't-rni dependencit-rs}.
`I‘lowcx'cr. in practice. t'ionsideration of term dependencies
`might. be a disadvantage. Due to the locality of many term dependencies, their
`itIdiscrii'ninate t1[.l[')li{iatlt'.'it]
`to all the documents in the. collection might. in fact.
`hurt the. overall performatice.
`Despite its simplicity, the vector model is a resilient. ranking strategy with
`general collections.
`1t yields ranked answer sets which are. difficult
`to improve
`upon without. query expansion or rt-Ele'rauce feedback [see Chapter 5} within the
`framework of the \-"f."(‘t-(i1‘ model. A large. variety of alternative ranking;T methods
`have. been compared to the vector model but. the consensus seems to he that,
`in gtBIJE-‘I‘fli, the vector model is either superior or ali'ilost. as good as the known
`alternatives. Furthermore. it is simple and fast. For these. reasons. the vector
`model is a popular retrie -'al model noi't'idafs.
`
`2.5.4
`
`Probabilistic Model
`
`in this section.
`
`tie descriliie the. classic probabilistic model introduced in 1976
`
`by; lioberston and Sparck Jones :tiTTl which later became known as the binary
`trtdr'=pen.rlr=-iicc r‘etr'ic-twl (Bill-J model. Our discussion is intentionally brief and
`focuses mainly on highlighting the key features of the model. With this purpose
`in 111ind._ we. do not-detain ourselves in subtleties regarding the binary indepen—
`dence assumption for the model. The. section on bibliographic discussion points
`to references which cover these details.
`
`The probabilistic model atteti‘ipts to capture the. TR. problem within a prob—
`fllJlllSt-lt‘. framework. The fundamental idea is as follows. Given a user query
`there is a set of documents which contains exactly the relevant documents and
`
`BIoomReach, Inc. EX1014 Page 9
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 9
`
`

`

`t.‘LAss1t‘. iNFonnATIoN RETRIEVAL
`
`31
`
`no other, Let us refer to this set. of documents as the ideal answer set. Given the
`
`description of this ideal answer set, we would have no problems in retrieving its
`document-s. Thus, we can think of the querying process as a process of specify—
`ing the properties of an ideal answer set {which is analogous to interpreting the
`1R problem as a problem of clustering). The. problem is that we do not know
`exactly what these properties are. All we know is that there are index terms
`whose semantics should be used to characterize these properties. Since these
`properties are not known at query time. an effort has to be made at initially
`guessing what. they could be. This initial guess allows us to generate a prelim—
`inary probabilistic description of the ideal answer set which is used to retrieye
`a first set of docuinmts. An interaction with the user is then initiated with the
`
`purpose of improving the probabilistic' description of the. ideal answer set. Such
`interaction could proceed as follows.
`The user takes a look at the retrieved doctunents and decides which ones
`
`are relevant and which ones are not. [in truth. only the first. top documents need
`to be. examined}. The system then uses this information to refine the description
`of the ideal answer set. By repeating this process many times.
`it is expected
`that such a description will (.‘VOiVE! and become closer to the real description of
`the ideal answer set. Thus, one. should always have in mind the need to guess at
`the beginning the description of the ideal answer set. Furthermore. a conscious
`effort is made to model this description in proliiabiiistic terms.
`The probabilistic model is based on the following lundammitai assumption.
`
`Assumption {Probabilistic Priort'lplc) Git-en a user query r; and a document d?-
`
`
`in
`. co ection.
`the. probabilistic model tries to estimzflre )rol'iability that
`the user will find the (lti'ciiment rlJ- interesting lie... relevant}. The model as—
`
`smnes‘fifi‘affl
`is proban 1 .y
`..
`. ance r epends on the query and the document
`representations only. Further.
`the model assumes that. there is a subset. of all
`documents which the user prefers as the answer set. for the query o.
`Such an
`ideal answer set is labeled R and should maximize the overall probability of rel—
`evance to the user. DOtf-illilgllllfl in the set H. are. predicted to be. i'elceortt to the
`query. Documents not, in this set are predicted to be_non—r'i.-'-le-i.ronr.
`
`This assumption is quite troublesome because it does not state explicitly
`how to compute the probabilities of relevance. In t'actr not even the sample space
`which is to be used for defining such probabilities is given.
`Given a query q. the probabilistic model assigns to each document a}, as
`a measure of its similarity to the query. the ratio l’fo’j relevant-to ql,-"l’(rij non—
`relevant—to qi} which computes the odds of the document t'lj being relevant to the
`query q. Taking the. odds of relevance as the rank minimises the probability of
`an erroneous judgennant 1282, 1’85].
`
`For the probabilistic model, the index. term aJe-iglit irriciubles are
`Definition
`all binary i.e._. my E {0,1}.
`lt‘flq €— {0,1}.
`:1 query q is a. subset of index terms.
`Let R be the set of thC'uTTH’TliS known. (or initially guessed)" to be relevant. Let. E
`be the complement of R (to. the set. of non—relevant dUC'tt-itl{it'll-3;}. Lei Piling-i
`
`BIoomReach, Inc. EX1014 Page 10
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 10
`
`

`

`3‘3
`
`?\l(JlJELIN(_l
`
`is relevant to the query q and t’iTiugJ
`the probability that the document it?-
`tit-
`thr' probability that (ii
`is nou.—i'ril(::mn.!.
`to q. The similarity simirquL] of the
`be;
`rims-meant rt].
`to the query q is defined. as the ratio
`
`s .I' m trifl- \ qt) :
`
`
`Ptltlrggl
`_4\
`l’{:R|<t,- i
`
`l'siug' Bus-es rulu
`
`wigs} x mm
`.
`.
`.w'nni‘ttyql 2 “4:4
`Pitt-IR; >< i’th’]
`
`Final?) stands for tho probability of randomly selecting the document. ti}- from
`thr- set- R. of roles-out dominionts. Further, Fiji} stands for the probability that a
`document rmuloluly selected from the entito collection is relevant. The meanings
`attmhml to Pit-12'?) and Pitt} are analogous uiul complement-aria
`Shim PtRJ and I’ll-R] are the saint: for all the Lloeun'ieuts in tho (‘olloetimn
`we wruv.
`
`.s-iiiitil.j.q]
`
`w
`
`
`wet-IR}
`Pit-Z" Til
`
`
`A ssuniing inLlepmidonoe of index terms,
`
`sim trt} ._ q _I -
`
`
`LIL-M- 1 Puget} >< {nugget other:
`illmzi. Pit-Wit: >< {I1
`-«_P{1Z+.IRIJ
`I}: lit}
`
`F’If_t'..i- | it] stands for the prohahility that the intlex term kg- is present in n. doetuuont
`randomly selected from the set ii. Fifi-'58] stands for the probability that the
`ilu'lr-x term lag.-
`is not present in a document- i‘a.ii(i()Irll_\-' selt'zctml from the set. R.
`'l‘lw ['Jrolmhilities élSE-iUFiEl-Lfiffl with the set P haw-"o meanings which are amilogous
`to t-llt.‘ ones just. {ilescri‘lurct
`Taking logarithms, recalling that. Ptkili’i] —.— Pia-l3) = l. and ignoring
`ittL‘t-UI'S which are. constant. for all (toeulrients in the context of the same query,
`wv can finall}r write
`
`t
`
`.
`
`.
`
`Him [til-J11) N gum x rig-J >< (log 1% +
`
`P' 3!.
`
`‘.
`
`which is El. kc}! exm‘pssion for ranking computation in the probabilistic model.
`Since we do not. know the set R. at the beginning, it is necessary to devise
`a method for initially computing the probabilities PUMR} and PUMR]. There.
`are many alternatives for such computation. we discuss a couple of them below.
`
`BIoomReach, Inc. EX1014 Page 11
`
`|PR2019-01304
`
`IPR2019-01304
`BloomReach, Inc. EX1014 Page 11
`
`

`

`' emssni: lNI*"(.')HM.-X"FION ltE’l'HlEV-i—tl.)
`
`33
`
`In the very lziegimring tie: immediately after the query specification]. there
`are no retrieved documents. Thus. one has to make simplifying assumptions such
`as: (a) assume that POLAR-i is constant. for all index terms Ag; (typically, equal to
`0.5} and {hi} assume that the distribution or“ index terms among the non—relevant.
`doetmients can be al'iproxunated by the distrilguition of index terms among; all
`the doeuments in the :.-olle::'tion. These two assumptions yield
`
`ends)
`
`2
`
`so
`
`Mfr-.4?) =
`
`is the nui'uber of documents whitrh contain the
`n,-_
`where. as already deiined.
`
`hides term 11::- and N is the. oral number 0 (tutument-s in

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket