throbber
INFORMATION
`RETRIEVAL
`Algorithms and Heuristics
`
`David A. Grossman
`Ophir Frieder
`
`SPRINGER SCIENCE+BUSINESS MEDIA, LLC
`
`Page 1 of 262
`
`GOOGLE EXHIBIT 1010
`
`

`

`INFORMATION RETRIEVAL:
`ALGORITHMS AND HEURISTICS
`
`Page 2 of 262
`
`

`

`THE KLUWER INTERNATIONAL SERIES IN ENGINEERING
`AND COMPUTER SCIENCE
`
`Page 3 of 262
`
`

`

`INFORMATION RETRIEVAL:
`ALGORITHMS AND HEURISTICS
`
`DAVID A. GROSSMAN
`Office of Research and Development
`
`OPHIR FRIEDER
`Illinois Institute of Technology
`
`Springer Science+Business Media, LLC
`
`Page 4 of 262
`
`

`

`Library of Congress Cataloging-in-Publication Data
`
`/ David A.
`
`Grossman, David A., 1965-
`lnformation retrieval
`: algorithms and heuristics
`Grossman, Ophir Frieder.
`p.
`cm. --
`(Kluwer international series in engineering and
`computer science
`; SECS 461)
`Includes bibliographical references and index.
`ISBN 978-1-4613-7532-6
`ISBN 978-1-4615-5539-1 ( eBook)
`DOI 10.1007/978-1-4615-5539-1
`I.
`Information storage and retrieval systems.
`lll. Series.
`II. Title.
`Z667.G76 1998
`005.74--dc21
`
`I. Frieder, Ophir.
`
`98-30435
`CIP
`
`Copyright © 1998 by Springer Science+Business Media New York
`Originally published by Kluwer Academic Publishers in 1998
`Softcover reprint of the hardcover 1st edition 1998
`Second Printing 2000
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system or transmitted in any form or by any means, mechanical, photo(cid:173)
`copying, recording, or otherwise, without the prior written permission of the
`publisher, Springer Science+Business Media, LLC
`
`Printed on acid-free paper.
`
`Page 5 of 262
`
`

`

`Contents
`
`List of Figures
`Preface
`Acknowledgments
`
`1. INTRODUCTION
`
`2. RETRIEVAL STRATEGIES
`2.1 Vector Space Model
`2.2 Probabilistic Retrieval Strategies
`Inference Networks
`2.3
`2.4 Extended Boolean Retrieval
`2.5 Latent Semantic Indexing
`2.6 Neural Networks
`2.7 Genetic Algorithms
`2.8 Fuzzy Set Retrieval
`2.9 Summary
`2.10 Exercises
`
`3. RETRIEVAL UTILITIES
`3.1 Relevance Feedback
`3.2 Clustering
`3.3 Passage-based Retrieval
`3.4 N-grams
`3.5 Regression Analysis
`3.6 Thesauri
`3.7 Semantic Networks
`3.8 Parsing
`3.9 Summary
`3.10 Exercises
`
`4. EFFICIENCY ISSUES PERTAINING TO SEQUENTIAL IR SYSTEMS
`4.1
`Inverted Index
`
`vii
`ix
`xv
`
`1
`
`11
`13
`22
`48
`58
`60
`64
`70
`74
`80
`81
`
`83
`84
`94
`100
`102
`106
`108
`118
`125
`131
`131
`
`133
`134
`
`Page 6 of 262
`
`

`

`vi
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`4.2 Query Processing
`4.3 Signature Files
`4.4 Summary
`4.5 Exercises
`
`5. INTEGRATING STRUCTURED DATA AND TEXT
`5.1 Review of the Relational Model
`5.2 A Historical Progression
`5.3
`Information Retrieval Functionality Using the Relational Model
`5.4 Boolean Retrieval
`5.5 Proximity Searches
`5.6 Computing Relevance Using Unchanged SQL
`5.7 Relevance Feedback in the Relational Model
`5.8 Summary
`5.9 Exercises
`
`6. PARALLEL INFORMATION RETRIEVAL SYSTEMS
`6.1 Parallel Text Scanning
`6.2 Parallel Indexing
`6.3 Parallel Implementation of Clustering and Classification
`6.4 Summary
`6.5 Exercises
`
`7. DISTRIBUTED INFORMATION RETRIEVAL
`7.1 A Theoretical Model of Distributed IR
`7.2 Replication in Distributed IR Systems
`7.3
`Implementation Issues of a Distributed IR System
`7.4
`Improving Performance of Web-based IR Systems
`7.5 Web Search Engines
`7.6 Summary
`7.7 Exercises
`
`8. THE TEXT RETRIEVAL CONFERENCE (TREC)
`
`9. FUTURE DIRECTIONS
`
`References
`
`Index
`
`142
`146
`149
`150
`
`153
`157
`163
`168
`176
`179
`181
`183
`184
`184
`
`185
`186
`191
`198
`198
`199
`
`201
`202
`206
`209
`212
`214
`217
`219
`
`221
`
`227
`
`231
`
`253
`
`Page 7 of 262
`
`

`

`List of Figures
`
`1.1
`1.2
`1.3
`1.4
`1.5
`2.1
`2.2
`2.3
`2.4
`2.5
`2.6
`2.7
`2.8
`3.1
`3.2
`3.3
`3.4
`4.1
`4.2
`5.1
`6.1
`7.1
`8.1
`8.2
`
`Document Retrieval
`Document Routing
`Result Set: Relevant Retrieved, Relevant, and Retrieved
`Example of Precision and Two Points of Recall
`Typical and Optimal Precision/Recall Graph
`Vector Space Model
`Example: Vector Space Model with a 2 Term Vocabulary
`Example: Inverted Index
`Training Data for Probabilistic Retrieval
`Simple Inference Network
`Document-Term-Query Inference Network
`Inference Network Example
`Neural Network with Feedback
`Relevance Feedback Process
`Document Clustering
`Overlapping vs Non-Overlapping Passages
`Using a Thesaurus to Expand a Query
`Inverted Index
`Signature File
`IR as an Application of a RDBMS
`Partitioning an Inverted Index
`Distributed Document Retrieval
`Sample TREC document
`Sample TREC query
`
`2
`3
`4
`5
`6
`14
`15
`19
`26
`50
`54
`55
`67
`85
`95
`102
`109
`135
`146
`155
`192
`202
`223
`224
`
`Page 8 of 262
`
`

`

`Preface
`
`This book focuses on the technology of information retrieival: a user enters a
`query that describes a request for information, and an information retrieval
`system responds by identifying documents that are relevant to the query. Of
`course, the computer cannot understand documents and queries the way a
`human can, so methods other than full natural language understanding of the
`text must be employed. We cover the various techniques used to quickly find
`relevant documents, with the necessary introduction, technical details, and
`examples to illustrate the approach.
`The advent of the World Wide Web has increased the importance of infor(cid:173)
`mation retrieval. Instead of going to the local library to find something, people
`search the Web. Thus, the relative number of manual versus computer-assisted
`searches for information has shifted dramatically in the past few years. This
`has increased the need for automated information retrievall for extremely large
`document collections.
`It is estimated that the Web now contains more than twenty million different
`content areas, presented on more than 320 million web pages, and one million
`web servers-and it is doubling every nine months [Kahle, 1998, Lawrence and
`Giles, 1998]. This book describes the techniques that can be used to try and
`find the needle that a user is seeking in the enormous haystack that is the Web,
`as well as and other large information collections. These techniques are found in
`numerous research papers, described by many different authors, and presented
`with many different examples. This book consolidates the most commonly used
`information retrieval algorithms and heuristics.
`The problem of finding relevant information is not new. Early systems tried
`to classify knowledge into a set of known fixed categories. The first of these was
`completed in 1668 by the English philosopher John Wilkins [Subbiondo, 1992].
`The problem with this approach is that categorizers commonly do not place
`documents into the categories where searchers expect to find them. No matter
`what categories a user thinks of-they will not match what someone who is
`searching will find. For example, users of e-mail systems place mail in folders
`or categories-only to spend countless hours trying to find the same documents
`
`Page 9 of 262
`
`

`

`x
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`because they cannot remember what category they used, or the category they
`are sure they used does not contain the relevant document. Effective and
`efficient search techniques are needed to help users quickly find the information
`they are looking for.
`Another approach is to try to understand the content of the documents.
`Ideally, documents would be loaded into the computer, the computer would
`read and understand them, and then, users would simply ask questions and be
`given direct answers. The field of Natural Language Processing works on this
`approach-but suffice to say that this approach is extremely difficult, and we
`currently lack systems that can build good knowledge structures for even the
`simplest of texts.
`If we rule out hand-categorization and natural language processing, that
`leaves us with Information Retrieval (IR). With this approach, no real attempt
`is made to have the computer understand the document-instead techniques
`that use pattern matching, statistical sampling, machine learning, and proba(cid:173)
`bility theory are used to guess which documents are relevant. These systems are
`not perfect, but they offer users the opportunity to sift through large volumes
`of text.
`The key problem is that simply matching on query words is not sufficient to
`satisfy user requests. A query about "Abraham Lincoln" can bring back docu(cid:173)
`ments about "a man named Abraham went to the dealership to purchase a Lin(cid:173)
`coln." The most obvious techniques have not been shown to work well. Many
`researchers suspect that an automated thesaurus will dramatically improve the
`accuracy of document retrieval-to date, little success has been demonstrated
`with such a technique. For every good word you add-you end up adding
`something that degrades the results. For our same query, a thesaurus might
`add the word "president" to the query but then the user might be treated to
`documents describing presidents of large companies that have nothing to do
`with the query.
`Our objective is to describe the techniques or algorithms and heuristics used
`to find documents that are relevant to the user request and to find them quickly.
`Academic research since the late 1950s has focused on this problem. To really
`attack the problem requires expertise in the fields of library science, computa(cid:173)
`tional linguistics, natural language processing, probability theory, and of course,
`computer science. We have tried to give the necessary introduction to each
`topic, and to give the details and examples needed to understand work that
`has been done on each topic. The field is moving quickly-the advent of the
`Web has brought new focus to the problem and many new algorithms have
`been developed. When the first Text REtrieval Conference (TREC) met in
`1992 to evaluate text retrieval-there were essentially four retrieval strategies
`to find relevant documents-vector space retrieval, extended boolean retrieval,
`probabilistic retrieval, and inference networks. Only six years later, we describe
`eight strategies in Chapter 2. Many of these are recent, and we try to give solid
`coverage to each one.
`
`Page 10 of 262
`
`

`

`PREFACE
`
`xi
`
`We are still at the tip of the iceberg in terms of really succeeding at solving
`the information retrieval problem. Measurements from TREC that include a
`standard set of queries run against a two gigabyte document collection with
`around half a million documents show that just about any technique or com(cid:173)
`binations of techniques results in an answer set that contains about twenty to
`thirty percent of relevant documents. If we suspect most users look at the top
`ten documents, systems are fortunate to get about three to four of these ten
`documents correct. Users then waste time looking at more non-relevant docu(cid:173)
`ments than relevant documents. Other studies have shown that when informa(cid:173)
`tion retrieval systems are evaluated, they are found to miss numerous relevant
`documents [Blair and Maron, 1985]. Moreover, users have become complacent
`in what they expect of information r~trieval systems [Gordon, 1997]. It is this
`relatively poor performance that drives more research in the area of effective(cid:173)
`ness. It is our goal to describe the algorithms and heuristics that were tried so
`that researchers who wish to improve on past performance do not unnecessarily
`repeat work that has already been completed.
`Finding relevant documents is not enough. This book describes the current
`techniques used to find relevant documents quickly. Some excellent information
`retrieval texts are currently available on the market. Many of them, however,
`do not treat both aspects of information retrieval, namely, retrieval effectiveness
`in terms of accuracy and retrieval efficiency in terms of resource utilization. We
`provide an algorithm-directed presentation for both orientations of interest. We
`survey recent algorithms on effectiveness and efficiency. Certainly, new papers
`are constantly being published, but we have focused on key algorithms that
`must be understood before new papers can be digested.
`Detailed examples are provided wherever possible since many books and
`papers do not provide these examples. We assume the reader has a reasonable
`understanding of data structures and algorithms, but we do not assume much
`mathematical background. Whenever appropriate, a brief review to the needed
`background concepts is prQvided prior to describing the information retrieval
`material.
`A recent collection Readings in Information Retrieval was published that
`covers several key papers. Unfortunately, this collection excludes papers that
`focus predominantly on efficiency. Furthermore, this collection is just that, a
`collection of seminal papers. It is not organized as a text. Thus, it is difficult
`to use this collection as a class text. It is, however, an excellent supplement to
`this book.
`This book is primarily intended as a textbook for an undergraduate or gradu(cid:173)
`ate level course in Information Retrieval. It has been used in a graduate course,
`and we have incorporated student feedback in developing a set of overheads that
`assist instruction using our text. The set of Powerpoint slides, including speaker
`notes, are available at http://www.csam.iit.edu/-ophir/slides.
`Additionally, practitioners who build m systems or applications that use IR
`systems will find the information in this book useful when deciding on which
`retrieval strategies and utilities to deploy in production applications. Note that
`
`Page 11 of 262
`
`

`

`xii
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`the focus of the book is on algorithms, not on commercial products, but, to our
`knowledge, the basic strategies used by the majority of commercial products
`are described in the book. Our intent is that a practitioner may find that a
`commercial product is using a given strategy and can then use this book as a
`reference for more information on what is known about the techniques used by
`the product. Other information that is of use to practitioners is found in our
`chapter that focuses on use of unchanged SQL to integrate structured data and
`text.
`Finally, we note that the information retrieval field changes daily. For the
`most up to date coverage of the field, the best sources are publications such as
`the ACM 7ransactions on Information Systems, the Journal of the American
`Society for Information Science, Information Processing and Management, and
`Information RetrietJal. Other relevant papers are found in the various infor(cid:173)
`mation retrieval conferences such as ACM SIGIR and NIST TREC, and to a
`lesser degree ACM CIKM and the annual NLP conference.
`Any comments, suggestions, or corrections regarding either the text or the
`overheads are greatly welcomed and would be sincerely appreciated.
`Hope to hear from you soon.
`
`Ophir Frieder
`IITRl Professor of Computer Science
`Illinois Institute of Technology
`ophir@csam.iit.edu
`
`David Grossman
`Staff Scientist
`Office of Research and Development
`dagr@jnpcs.com
`
`Page 12 of 262
`
`

`

`From David:
`For my parents:
`Roberta "Mickey'' and Joshua Grossman.
`
`From Ophir:
`In honor of my grandmother:
`Hadasa Bogler
`and in memory of my grandparents:
`Yehoshua Bogler and
`Ruzena and Armin Frieder
`
`Page 13 of 262
`
`

`

`Acknowledgments
`
`We are indebted to many, many people, and we shudder at the thought of
`leaving out someone whose contributions have improved the quality of the book.
`Just in case we did, however, our sincere apologies on the oversight and a
`heartful thanks for the contributions.
`We are deeply indebted to Paul Kantor and Don Kraft for their insightful
`comments during some of the early stages of the development of the book. We
`also greatly thank Steve Robertson, K.L. Kwok, and Jamie Callan for critically
`reviewing the sections describing their efforts, and Warren Greiff whose extreme
`patience taught and retaught us the details of inference networks.
`General critical feedback came from multiple individuals. Readability for
`classroom use was enhanced by feedback given from the students participating
`in Ophir's CSE6000 class that read each draft copy multiple times. Although
`comments from all students are greatly appreciated, Salim Mounir Alaoui truly
`went beyond the call of duty, and his assistance is acknowledged by name. Also
`Matt Mahoney is recognized and thanked for his initial draft of the course
`presentation overheads and Nazli Goharian for her technical review and en(cid:173)
`hancements of them. David Roberts provided several specific suggestions that
`dramatically improved the introduction and focus of the book. John Juilfs
`spent countless hours working out all of the examples and checking our math
`(we take full responsibility for any remaining errors, but he saved us from lots
`of them). David Holmes has been an instrumental part of our TREC group,
`and his willingness to spend plenty of time and resources running experiments
`for us was critical to the development of the integration of structured data and
`text chapter. Special thanks go to Bill Bercik and Prachya Chalermwat for
`their work on all of the figures used in the text. Abdur Chowdhury provided
`valuable assistance with the content of the book and was a great help during
`the construction of the index.
`We are indebted to Steve Kirsch at Infoseek and Graham Spencer, Doug
`Cutting, and Jack Xu at Excite for their assistance with the material found on
`the section describing implementation details of web search engines.
`
`Page 14 of 262
`
`

`

`xvi
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`We thank many people who reviewed repeated iterations of the book in(cid:173)
`cluding Mary Catherine McCabe, Jean-Michel Pomarede, Muralidhar Medidi,
`and Amit Jain, whom we thank for their technical comments and Deborah Be(cid:173)
`nigni, Wendy Grossman and Karen Sweeney for the great deal of copy editing
`that they provided. We also thank all others who read the book and provided
`excellent feedback, and our colleagues at work who provided moral support.
`Finally, we would not have ever finished the book without constant moral
`support from our families and close friends. Special thanks go to Sandy Halin(cid:173)
`Adams and Nazli Goharian who made countless sacrifices of their time and gave
`us constant encouragement and support. Parents, siblings, in-laws, grandpar(cid:173)
`ents, aunts, uncles, and close friends, all helped with comments like "So, how
`is the book going?" or "Is the book out yet?". In spite of knowing that they
`were actually trying to cheer us on to finish, nothing was a better motivator to
`push ahead than such remarks.
`From the both of us to all of you, THANKS.
`
`Page 15 of 262
`
`

`

`1 INTRODUCTION
`
`Since the near beginnings of human civilization, human beings have focused on
`written communication. From cave drawings to scroll writings, from printing
`presses to electronic libraries--communicating has been of primary concern to
`our existence. Today, with the emergence of digital libraries and electronic
`information exchange there is clear need for improved techniques to organize
`these large quantities of information. Applied and theoretical research and
`development in the areas of information authorship, processing, storage, and
`retrieval is of interest to all sectors of the community. In this book, we overview
`recent research efforts that focus on the electronic searching and retrieving of
`documents.
`Our focus is strictly on the retrieval of information in response to user
`queries. That is, we discuss algorithms and approaches for ad hoc informa(cid:173)
`tion retrieval, or simply, information retrieval. Figure 1.1 illustrates the basic
`process of ad hoc information retrieval. A static or relatively static document
`collection is indexed prior to any user query. A query is: issued and a set of
`documents that are deemed relevant to the query are ranked based on their coo(cid:173)
`puted similarity to the query and presented to the user. Numerous techniques
`exist to identify how these documents are ranked, and that is a key focus of this
`book (effectiveness). Other techniques also exist to rank documents quickly,
`and these are also discussed (efficiency).
`Information Retrieval (IR) is devoted to finding "relevant" documents, not
`finding simple matches to patterns. Yet, often when information retrieval sys-
`
`Page 16 of 262
`
`

`

`2
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`Figure 1.1. Document Retrieval
`
`Ad-Hoc
`Query
`
`Static
`Document
`Collection
`
`Ranked
`Answer
`Set
`
`terns are evaluated, they are found to miss numerous relevant documents [Blair
`and Maron, 1985]. Moreover, users have become complacent in their expecta(cid:173)
`tion of accuracy of information retrieval systems [Gordon, 1997].
`A related problem is that of document routing or filtering. Here, the queries
`are static and the document collection constantly changes. An environment
`where corporate e-mail is routed based on predefined queries to different parts
`of the organization (i.e., e-mail about sales is routed to the sales department,
`marketing e-mail goes to marketing, etc.) is an example of an application of
`document routing. Figure 1.2 illustrates document routing. Document routing
`algorithms and approaches also widely appear in the literature, but are not
`addressed in this book.
`In Figure 1.3, we illustrate the critical document categories that correspond
`to any issued query. Namely, in the collection there are documents which
`are retrieved, and there are those documents that are relevant. In a perfect
`system, these two sets would be equivalent-we would only retrieve relevant
`In reality, systems retrieve many non-relevant documents. To
`documents.
`
`Page 17 of 262
`
`

`

`INTRODUCTION
`
`3
`
`Figure 1.2.
`
`Document Routing
`
`Set of Predetermined
`Queries or User
`Profiles
`
`Document
`Routing
`System
`
`Documents
`routed to
`each user
`based on
`query or
`profile
`
`Incoming Documents
`
`User 1
`
`User2
`
`User3
`
`User4
`
`measure effectiveness, two ratios are used: precision and recall. Precision is
`the ratio of the number of relevant documents retrieved to the total number
`retrieved. Precision provides an indication of the quality of the answer set.
`However, this does not consider the total number of relevant documents. A
`system might have good precision by retrieving ten documents and finding that
`nine are relevant (a 0.9 precision), but the total number of relevant documents
`also matters. If there were only nine relevant documents, the system would be
`a huge success-however if millions of documents were relevant and desired,
`this would not be a good result set.
`Recall considers the total number of relevant documents-it is the ratio of
`number of relevant documents retrieved to the total number of documents in the
`collection that are believed to be relevant. When the total number of relevant
`documents in the collection is unknown, an approximation of the number is
`obtained. A good survey of effectiveness measures as well as a brief overview
`of information retrieval is found in [Kantor, 1994).
`
`Page 18 of 262
`
`

`

`4
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`Figure 1.3.
`
`Result Set: Relevant Retrieved, Relevant, and Retrieved
`
`All documents
`
`Relevant
`
`Relevant
`Retrieved
`
`Relevant Retrieved
`Retrieved
`
`Relevant Retrieved
`Relevant
`
`Recall=
`
`Precision may be computed at various points of recall. Consider the example
`in Figure 1.4. Ten documents are retrieved, but only two documents (docu(cid:173)
`ments two and five) are relevant to the query. Consider the document retrieval
`performance represented by the sloped line. Fifty percent recall (finding one
`of the two relevant documents) results when two documents are retrieved. At
`this point, precision is fifty percent as we have retrieved two documents and
`one of them is relevant. To reach one hundred percent recall, we must continue
`to retrieve documents until both relevant documents are retrieved. For our
`example, it is necessary to retrieve five documents to find both relevant docu(cid:173)
`ments. At this point, precision is forty percent because two out of five retrieved
`documents are relevant. Hence, for any desired level of recall it is possible to
`compute precision. Graphing precision at various points of recall is referred to
`as a precision/recall curve.
`A typical precision/recall curve is shown in Figure 1.5. Typically, as higher
`recall is desired, more documents must be retrieved to obtain the desired level of
`recall. In a perfect system, only relevant documents are retrieved. This means
`
`Page 19 of 262
`
`

`

`Figure 1.4.
`
`Example of Precision and Two Points of Recall
`
`INTRODUCTION
`
`5
`
`Precision
`1.0
`.9
`.8
`.7
`.6
`. 5
`.4
`.3
`.2
`.1
`
`Numrol
`Relevant Documenls = 2
`(d2, d5)
`
`Answer Set
`in Order of Similarity Coefficient
`
`............ (.5,.5)
`.. -~~ :4 )
`.............. _
`
`~
`dB
`d9
`d10
`
`.1
`
`.2 .3
`
`.4 .5 .6 .7 .8 .9 1.0
`
`Recall
`Predsm a1 50o/o recal = 1/2 = 50%
`Pr!mOll al 100% recal = 2tS = 40%
`
`that at any level of recall, precision would be 1.0. The optimal precision/recall
`line is shown in Figure 1.5.
`Average precision refers to an average of precision at various points of recall.
`Many systems today, when run on a standard document collection, report an
`average precision of between 0.2 and 0.3. Certainly, there is some element of
`fuzziness here because relevance is not a clearly defined concept, but it is clear
`that there is significant room for improvement in the area of effectiveness.
`Finding relevant documents is not enough. Finding relevant documents
`within an acceptable response time is the goal. This book describes the current
`strategies used to find relevant documents quickly. The quest to find efficient
`and effective information retrieval continues.
`We explain each algorithm in detail, and for each topic, include examples
`for the most crucial algorithms. We then switch gears into survey mode and
`provide references to related and follow-on work. We explain the key aspects
`of the algorithms and then provide references for those interested in further
`
`Page 20 of 262
`
`

`

`6
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`Figure 1.5.
`
`Typical and Optimal Precision/Recall Graph
`
`Precision
`1.0
`.9
`.8
`.7
`.6
`.5
`.4
`.3
`.2
`.1
`
`Optimal
`
`- - - - - - Typical
`
`.1
`
`.2 .3
`
`.4
`
`.5 .6 .7 .8 .9 1.0
`
`Recall
`
`details. A collection of key information retrieval research papers is found in
`[Sparck Jones and Willett, 1997].
`Recent algorithms designed to search large bodies of information are dis(cid:173)
`cussed throughout this book. Many research publications describe these al(cid:173)
`gorithms in detail, but they are spread across numerous journals and written
`in a variety of different styles. Also, they have differing expectations of their
`reader's background. We provide a relatively brief, but sufficiently detailed
`overview of the field.
`As earlier stated, a sophisticated mathematical background is not required.
`Whenever detailed mathematical constructs are used, we provide a quick re(cid:173)
`fresher of the key points needed to understand the algorithms and detailed
`examples.
`We believe this book is valuable to a variety of readers. Readers familiar with
`the core of computer science and interested in learning more about information
`retrieval algorithms should benefit from this text. We provide explanations of
`
`Page 21 of 262
`
`

`

`INTRODUCTION
`
`7
`
`the fundamental problems that exist, and how people have addressed them in
`the past.
`This book also has value for anyone who currently uses and supports a
`Relational Database Management System (RDBMS). Chapter 5 gives detailed
`algorithms that treat text retrieval as an application of a RDBMS. This makes
`it possible to integrate both structured data and text.
`To guide the reader through the key issues in ad hoc information retrieval,
`we partitioned our book into separate but inter-linked processing avenues. In
`the first section, covered in Chapters 2 and 3, we overview retrieval processing
`strategies and utilities. All of these strategies and utilities focus on one and
`only one critical issue, namely, the improvement of retrieval accuracy. In Chap(cid:173)
`ter 2, we describe eight models that were either developed for or adapted to
`information retrieval specifically for the purpose of enhancing the evaluation or
`ranking of documents retrieved in response to user queries. Chapter 3 describes
`utilities that could be applied to enhance any strategy described in Chapter 2.
`In Chapter 3, under the "Retrieval Utilities" heading, we focus on tech(cid:173)
`niques that are applicable to either all or most of the models. Several of those
`utilities described are language dependent, e.g., parsing and thesauri, others
`focus specifically at being language independent, namely, N-gram processing.
`We note in Chapter 3, that some of the described utilities were proposed as
`individual processing strategies. In reality, however, it is the combination of
`these techniques that yields the best improvements. An approach to precisely
`determine the optimal mix of techniques, the order to execute them, and the
`underlying models to operate them so as to yield the optimal processing strat(cid:173)
`egy is still unknown.
`After describing models and utilities that address accuracy demands, we
`turn our attention towards processing efficiency.
`In Chapter 4, we describe
`various document access schemes. That is, we describe both the constructs
`and usage of inverted indices as well as other representation schemes such as
`signature files. Each of these access schemes has advantages and disadvantages.
`The tradeoffs lie in terms of storage overhead and maintain.ability versus search
`and retrieval processing times. After describing the various access methods, we
`overview several compression schemes.
`Chapters 2 through 4 cover the basics in terms of traditional information
`retrieval models, utilities, and processing strategies. In Chapters 5, 6, and 7,
`we focus our attention on special topics in information retrieval. The three
`topics addressed, namely data integration, parallel, and distributed informa(cid:173)
`tion retrieval systems, were selected based on where the commercial sector is
`focusing.
`Traditionally, there was a clear separation between structured data, typ(cid:173)
`ically stored and accessed via relational database management systems, and
`semi-structured data such as text, typically stored and accessed via informa(cid:173)
`tion retrieval systems. Each processing system supported its own data storage
`files and access methods. Today, the distinction between structured and semi(cid:173)
`structured data is quickly vanishing. In fact, we no longer are concerned with
`
`Page 22 of 262
`
`

`

`8
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`just structured and semi-structured data, but often we include unstructured
`data, such as images, in the same storage repository. Our focus does not in(cid:173)
`clude image processing, although in the future one needs to do so.
`To address the integration of structured and unstructured data, commercial
`enterprises such as Oracle, IBM, and lnformix have integrated information re(cid:173)
`trieval functionality with their traditional relational database engines. Further(cid:173)
`more, text retrieval companies such as Verity, have added relational processing
`components. In all of these cases, however, additional functionality came at the
`expense of requiring additional, separate, processing units. In Chapter 5, we
`discuss the issues related to adding processing units and suggest an alternative
`method that involves implementing information retrieval processing capability
`as an application of relational databases. Using such an approach, the tradi(cid:173)
`tional benefits of relational database processing (i.e., portability, concurrency,
`recovery, etc.) are made available without requiring additional software de(cid:173)
`velopment. Since all traditional relational database vendors provide parallel
`implementations of their database software, implementing an information re(cid:173)
`trieval system as a relational database application further provides for a parallel
`instantiation of an information retrieval system.
`Having recognized the need for a parallel information retrieval capability,
`In Chapter 6, we ini(cid:173)
`we also overview recent developments in this area.
`tially describe the earlier parallel processing efforts in information retrieval.
`These approaches predominantly focus on the use of Single Inst

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket