`RETRIEVAL
`Algorithms and Heuristics
`
`David A. Grossman
`Ophir Frieder
`
`SPRINGER SCIENCE+BUSINESS MEDIA, LLC
`
`Page 1 of 262
`
`GOOGLE EXHIBIT 1010
`
`
`
`INFORMATION RETRIEVAL:
`ALGORITHMS AND HEURISTICS
`
`Page 2 of 262
`
`
`
`THE KLUWER INTERNATIONAL SERIES IN ENGINEERING
`AND COMPUTER SCIENCE
`
`Page 3 of 262
`
`
`
`INFORMATION RETRIEVAL:
`ALGORITHMS AND HEURISTICS
`
`DAVID A. GROSSMAN
`Office of Research and Development
`
`OPHIR FRIEDER
`Illinois Institute of Technology
`
`Springer Science+Business Media, LLC
`
`Page 4 of 262
`
`
`
`Library of Congress Cataloging-in-Publication Data
`
`/ David A.
`
`Grossman, David A., 1965-
`lnformation retrieval
`: algorithms and heuristics
`Grossman, Ophir Frieder.
`p.
`cm. --
`(Kluwer international series in engineering and
`computer science
`; SECS 461)
`Includes bibliographical references and index.
`ISBN 978-1-4613-7532-6
`ISBN 978-1-4615-5539-1 ( eBook)
`DOI 10.1007/978-1-4615-5539-1
`I.
`Information storage and retrieval systems.
`lll. Series.
`II. Title.
`Z667.G76 1998
`005.74--dc21
`
`I. Frieder, Ophir.
`
`98-30435
`CIP
`
`Copyright © 1998 by Springer Science+Business Media New York
`Originally published by Kluwer Academic Publishers in 1998
`Softcover reprint of the hardcover 1st edition 1998
`Second Printing 2000
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system or transmitted in any form or by any means, mechanical, photo(cid:173)
`copying, recording, or otherwise, without the prior written permission of the
`publisher, Springer Science+Business Media, LLC
`
`Printed on acid-free paper.
`
`Page 5 of 262
`
`
`
`Contents
`
`List of Figures
`Preface
`Acknowledgments
`
`1. INTRODUCTION
`
`2. RETRIEVAL STRATEGIES
`2.1 Vector Space Model
`2.2 Probabilistic Retrieval Strategies
`Inference Networks
`2.3
`2.4 Extended Boolean Retrieval
`2.5 Latent Semantic Indexing
`2.6 Neural Networks
`2.7 Genetic Algorithms
`2.8 Fuzzy Set Retrieval
`2.9 Summary
`2.10 Exercises
`
`3. RETRIEVAL UTILITIES
`3.1 Relevance Feedback
`3.2 Clustering
`3.3 Passage-based Retrieval
`3.4 N-grams
`3.5 Regression Analysis
`3.6 Thesauri
`3.7 Semantic Networks
`3.8 Parsing
`3.9 Summary
`3.10 Exercises
`
`4. EFFICIENCY ISSUES PERTAINING TO SEQUENTIAL IR SYSTEMS
`4.1
`Inverted Index
`
`vii
`ix
`xv
`
`1
`
`11
`13
`22
`48
`58
`60
`64
`70
`74
`80
`81
`
`83
`84
`94
`100
`102
`106
`108
`118
`125
`131
`131
`
`133
`134
`
`Page 6 of 262
`
`
`
`vi
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`4.2 Query Processing
`4.3 Signature Files
`4.4 Summary
`4.5 Exercises
`
`5. INTEGRATING STRUCTURED DATA AND TEXT
`5.1 Review of the Relational Model
`5.2 A Historical Progression
`5.3
`Information Retrieval Functionality Using the Relational Model
`5.4 Boolean Retrieval
`5.5 Proximity Searches
`5.6 Computing Relevance Using Unchanged SQL
`5.7 Relevance Feedback in the Relational Model
`5.8 Summary
`5.9 Exercises
`
`6. PARALLEL INFORMATION RETRIEVAL SYSTEMS
`6.1 Parallel Text Scanning
`6.2 Parallel Indexing
`6.3 Parallel Implementation of Clustering and Classification
`6.4 Summary
`6.5 Exercises
`
`7. DISTRIBUTED INFORMATION RETRIEVAL
`7.1 A Theoretical Model of Distributed IR
`7.2 Replication in Distributed IR Systems
`7.3
`Implementation Issues of a Distributed IR System
`7.4
`Improving Performance of Web-based IR Systems
`7.5 Web Search Engines
`7.6 Summary
`7.7 Exercises
`
`8. THE TEXT RETRIEVAL CONFERENCE (TREC)
`
`9. FUTURE DIRECTIONS
`
`References
`
`Index
`
`142
`146
`149
`150
`
`153
`157
`163
`168
`176
`179
`181
`183
`184
`184
`
`185
`186
`191
`198
`198
`199
`
`201
`202
`206
`209
`212
`214
`217
`219
`
`221
`
`227
`
`231
`
`253
`
`Page 7 of 262
`
`
`
`List of Figures
`
`1.1
`1.2
`1.3
`1.4
`1.5
`2.1
`2.2
`2.3
`2.4
`2.5
`2.6
`2.7
`2.8
`3.1
`3.2
`3.3
`3.4
`4.1
`4.2
`5.1
`6.1
`7.1
`8.1
`8.2
`
`Document Retrieval
`Document Routing
`Result Set: Relevant Retrieved, Relevant, and Retrieved
`Example of Precision and Two Points of Recall
`Typical and Optimal Precision/Recall Graph
`Vector Space Model
`Example: Vector Space Model with a 2 Term Vocabulary
`Example: Inverted Index
`Training Data for Probabilistic Retrieval
`Simple Inference Network
`Document-Term-Query Inference Network
`Inference Network Example
`Neural Network with Feedback
`Relevance Feedback Process
`Document Clustering
`Overlapping vs Non-Overlapping Passages
`Using a Thesaurus to Expand a Query
`Inverted Index
`Signature File
`IR as an Application of a RDBMS
`Partitioning an Inverted Index
`Distributed Document Retrieval
`Sample TREC document
`Sample TREC query
`
`2
`3
`4
`5
`6
`14
`15
`19
`26
`50
`54
`55
`67
`85
`95
`102
`109
`135
`146
`155
`192
`202
`223
`224
`
`Page 8 of 262
`
`
`
`Preface
`
`This book focuses on the technology of information retrieival: a user enters a
`query that describes a request for information, and an information retrieval
`system responds by identifying documents that are relevant to the query. Of
`course, the computer cannot understand documents and queries the way a
`human can, so methods other than full natural language understanding of the
`text must be employed. We cover the various techniques used to quickly find
`relevant documents, with the necessary introduction, technical details, and
`examples to illustrate the approach.
`The advent of the World Wide Web has increased the importance of infor(cid:173)
`mation retrieval. Instead of going to the local library to find something, people
`search the Web. Thus, the relative number of manual versus computer-assisted
`searches for information has shifted dramatically in the past few years. This
`has increased the need for automated information retrievall for extremely large
`document collections.
`It is estimated that the Web now contains more than twenty million different
`content areas, presented on more than 320 million web pages, and one million
`web servers-and it is doubling every nine months [Kahle, 1998, Lawrence and
`Giles, 1998]. This book describes the techniques that can be used to try and
`find the needle that a user is seeking in the enormous haystack that is the Web,
`as well as and other large information collections. These techniques are found in
`numerous research papers, described by many different authors, and presented
`with many different examples. This book consolidates the most commonly used
`information retrieval algorithms and heuristics.
`The problem of finding relevant information is not new. Early systems tried
`to classify knowledge into a set of known fixed categories. The first of these was
`completed in 1668 by the English philosopher John Wilkins [Subbiondo, 1992].
`The problem with this approach is that categorizers commonly do not place
`documents into the categories where searchers expect to find them. No matter
`what categories a user thinks of-they will not match what someone who is
`searching will find. For example, users of e-mail systems place mail in folders
`or categories-only to spend countless hours trying to find the same documents
`
`Page 9 of 262
`
`
`
`x
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`because they cannot remember what category they used, or the category they
`are sure they used does not contain the relevant document. Effective and
`efficient search techniques are needed to help users quickly find the information
`they are looking for.
`Another approach is to try to understand the content of the documents.
`Ideally, documents would be loaded into the computer, the computer would
`read and understand them, and then, users would simply ask questions and be
`given direct answers. The field of Natural Language Processing works on this
`approach-but suffice to say that this approach is extremely difficult, and we
`currently lack systems that can build good knowledge structures for even the
`simplest of texts.
`If we rule out hand-categorization and natural language processing, that
`leaves us with Information Retrieval (IR). With this approach, no real attempt
`is made to have the computer understand the document-instead techniques
`that use pattern matching, statistical sampling, machine learning, and proba(cid:173)
`bility theory are used to guess which documents are relevant. These systems are
`not perfect, but they offer users the opportunity to sift through large volumes
`of text.
`The key problem is that simply matching on query words is not sufficient to
`satisfy user requests. A query about "Abraham Lincoln" can bring back docu(cid:173)
`ments about "a man named Abraham went to the dealership to purchase a Lin(cid:173)
`coln." The most obvious techniques have not been shown to work well. Many
`researchers suspect that an automated thesaurus will dramatically improve the
`accuracy of document retrieval-to date, little success has been demonstrated
`with such a technique. For every good word you add-you end up adding
`something that degrades the results. For our same query, a thesaurus might
`add the word "president" to the query but then the user might be treated to
`documents describing presidents of large companies that have nothing to do
`with the query.
`Our objective is to describe the techniques or algorithms and heuristics used
`to find documents that are relevant to the user request and to find them quickly.
`Academic research since the late 1950s has focused on this problem. To really
`attack the problem requires expertise in the fields of library science, computa(cid:173)
`tional linguistics, natural language processing, probability theory, and of course,
`computer science. We have tried to give the necessary introduction to each
`topic, and to give the details and examples needed to understand work that
`has been done on each topic. The field is moving quickly-the advent of the
`Web has brought new focus to the problem and many new algorithms have
`been developed. When the first Text REtrieval Conference (TREC) met in
`1992 to evaluate text retrieval-there were essentially four retrieval strategies
`to find relevant documents-vector space retrieval, extended boolean retrieval,
`probabilistic retrieval, and inference networks. Only six years later, we describe
`eight strategies in Chapter 2. Many of these are recent, and we try to give solid
`coverage to each one.
`
`Page 10 of 262
`
`
`
`PREFACE
`
`xi
`
`We are still at the tip of the iceberg in terms of really succeeding at solving
`the information retrieval problem. Measurements from TREC that include a
`standard set of queries run against a two gigabyte document collection with
`around half a million documents show that just about any technique or com(cid:173)
`binations of techniques results in an answer set that contains about twenty to
`thirty percent of relevant documents. If we suspect most users look at the top
`ten documents, systems are fortunate to get about three to four of these ten
`documents correct. Users then waste time looking at more non-relevant docu(cid:173)
`ments than relevant documents. Other studies have shown that when informa(cid:173)
`tion retrieval systems are evaluated, they are found to miss numerous relevant
`documents [Blair and Maron, 1985]. Moreover, users have become complacent
`in what they expect of information r~trieval systems [Gordon, 1997]. It is this
`relatively poor performance that drives more research in the area of effective(cid:173)
`ness. It is our goal to describe the algorithms and heuristics that were tried so
`that researchers who wish to improve on past performance do not unnecessarily
`repeat work that has already been completed.
`Finding relevant documents is not enough. This book describes the current
`techniques used to find relevant documents quickly. Some excellent information
`retrieval texts are currently available on the market. Many of them, however,
`do not treat both aspects of information retrieval, namely, retrieval effectiveness
`in terms of accuracy and retrieval efficiency in terms of resource utilization. We
`provide an algorithm-directed presentation for both orientations of interest. We
`survey recent algorithms on effectiveness and efficiency. Certainly, new papers
`are constantly being published, but we have focused on key algorithms that
`must be understood before new papers can be digested.
`Detailed examples are provided wherever possible since many books and
`papers do not provide these examples. We assume the reader has a reasonable
`understanding of data structures and algorithms, but we do not assume much
`mathematical background. Whenever appropriate, a brief review to the needed
`background concepts is prQvided prior to describing the information retrieval
`material.
`A recent collection Readings in Information Retrieval was published that
`covers several key papers. Unfortunately, this collection excludes papers that
`focus predominantly on efficiency. Furthermore, this collection is just that, a
`collection of seminal papers. It is not organized as a text. Thus, it is difficult
`to use this collection as a class text. It is, however, an excellent supplement to
`this book.
`This book is primarily intended as a textbook for an undergraduate or gradu(cid:173)
`ate level course in Information Retrieval. It has been used in a graduate course,
`and we have incorporated student feedback in developing a set of overheads that
`assist instruction using our text. The set of Powerpoint slides, including speaker
`notes, are available at http://www.csam.iit.edu/-ophir/slides.
`Additionally, practitioners who build m systems or applications that use IR
`systems will find the information in this book useful when deciding on which
`retrieval strategies and utilities to deploy in production applications. Note that
`
`Page 11 of 262
`
`
`
`xii
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`the focus of the book is on algorithms, not on commercial products, but, to our
`knowledge, the basic strategies used by the majority of commercial products
`are described in the book. Our intent is that a practitioner may find that a
`commercial product is using a given strategy and can then use this book as a
`reference for more information on what is known about the techniques used by
`the product. Other information that is of use to practitioners is found in our
`chapter that focuses on use of unchanged SQL to integrate structured data and
`text.
`Finally, we note that the information retrieval field changes daily. For the
`most up to date coverage of the field, the best sources are publications such as
`the ACM 7ransactions on Information Systems, the Journal of the American
`Society for Information Science, Information Processing and Management, and
`Information RetrietJal. Other relevant papers are found in the various infor(cid:173)
`mation retrieval conferences such as ACM SIGIR and NIST TREC, and to a
`lesser degree ACM CIKM and the annual NLP conference.
`Any comments, suggestions, or corrections regarding either the text or the
`overheads are greatly welcomed and would be sincerely appreciated.
`Hope to hear from you soon.
`
`Ophir Frieder
`IITRl Professor of Computer Science
`Illinois Institute of Technology
`ophir@csam.iit.edu
`
`David Grossman
`Staff Scientist
`Office of Research and Development
`dagr@jnpcs.com
`
`Page 12 of 262
`
`
`
`From David:
`For my parents:
`Roberta "Mickey'' and Joshua Grossman.
`
`From Ophir:
`In honor of my grandmother:
`Hadasa Bogler
`and in memory of my grandparents:
`Yehoshua Bogler and
`Ruzena and Armin Frieder
`
`Page 13 of 262
`
`
`
`Acknowledgments
`
`We are indebted to many, many people, and we shudder at the thought of
`leaving out someone whose contributions have improved the quality of the book.
`Just in case we did, however, our sincere apologies on the oversight and a
`heartful thanks for the contributions.
`We are deeply indebted to Paul Kantor and Don Kraft for their insightful
`comments during some of the early stages of the development of the book. We
`also greatly thank Steve Robertson, K.L. Kwok, and Jamie Callan for critically
`reviewing the sections describing their efforts, and Warren Greiff whose extreme
`patience taught and retaught us the details of inference networks.
`General critical feedback came from multiple individuals. Readability for
`classroom use was enhanced by feedback given from the students participating
`in Ophir's CSE6000 class that read each draft copy multiple times. Although
`comments from all students are greatly appreciated, Salim Mounir Alaoui truly
`went beyond the call of duty, and his assistance is acknowledged by name. Also
`Matt Mahoney is recognized and thanked for his initial draft of the course
`presentation overheads and Nazli Goharian for her technical review and en(cid:173)
`hancements of them. David Roberts provided several specific suggestions that
`dramatically improved the introduction and focus of the book. John Juilfs
`spent countless hours working out all of the examples and checking our math
`(we take full responsibility for any remaining errors, but he saved us from lots
`of them). David Holmes has been an instrumental part of our TREC group,
`and his willingness to spend plenty of time and resources running experiments
`for us was critical to the development of the integration of structured data and
`text chapter. Special thanks go to Bill Bercik and Prachya Chalermwat for
`their work on all of the figures used in the text. Abdur Chowdhury provided
`valuable assistance with the content of the book and was a great help during
`the construction of the index.
`We are indebted to Steve Kirsch at Infoseek and Graham Spencer, Doug
`Cutting, and Jack Xu at Excite for their assistance with the material found on
`the section describing implementation details of web search engines.
`
`Page 14 of 262
`
`
`
`xvi
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`We thank many people who reviewed repeated iterations of the book in(cid:173)
`cluding Mary Catherine McCabe, Jean-Michel Pomarede, Muralidhar Medidi,
`and Amit Jain, whom we thank for their technical comments and Deborah Be(cid:173)
`nigni, Wendy Grossman and Karen Sweeney for the great deal of copy editing
`that they provided. We also thank all others who read the book and provided
`excellent feedback, and our colleagues at work who provided moral support.
`Finally, we would not have ever finished the book without constant moral
`support from our families and close friends. Special thanks go to Sandy Halin(cid:173)
`Adams and Nazli Goharian who made countless sacrifices of their time and gave
`us constant encouragement and support. Parents, siblings, in-laws, grandpar(cid:173)
`ents, aunts, uncles, and close friends, all helped with comments like "So, how
`is the book going?" or "Is the book out yet?". In spite of knowing that they
`were actually trying to cheer us on to finish, nothing was a better motivator to
`push ahead than such remarks.
`From the both of us to all of you, THANKS.
`
`Page 15 of 262
`
`
`
`1 INTRODUCTION
`
`Since the near beginnings of human civilization, human beings have focused on
`written communication. From cave drawings to scroll writings, from printing
`presses to electronic libraries--communicating has been of primary concern to
`our existence. Today, with the emergence of digital libraries and electronic
`information exchange there is clear need for improved techniques to organize
`these large quantities of information. Applied and theoretical research and
`development in the areas of information authorship, processing, storage, and
`retrieval is of interest to all sectors of the community. In this book, we overview
`recent research efforts that focus on the electronic searching and retrieving of
`documents.
`Our focus is strictly on the retrieval of information in response to user
`queries. That is, we discuss algorithms and approaches for ad hoc informa(cid:173)
`tion retrieval, or simply, information retrieval. Figure 1.1 illustrates the basic
`process of ad hoc information retrieval. A static or relatively static document
`collection is indexed prior to any user query. A query is: issued and a set of
`documents that are deemed relevant to the query are ranked based on their coo(cid:173)
`puted similarity to the query and presented to the user. Numerous techniques
`exist to identify how these documents are ranked, and that is a key focus of this
`book (effectiveness). Other techniques also exist to rank documents quickly,
`and these are also discussed (efficiency).
`Information Retrieval (IR) is devoted to finding "relevant" documents, not
`finding simple matches to patterns. Yet, often when information retrieval sys-
`
`Page 16 of 262
`
`
`
`2
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`Figure 1.1. Document Retrieval
`
`Ad-Hoc
`Query
`
`Static
`Document
`Collection
`
`Ranked
`Answer
`Set
`
`terns are evaluated, they are found to miss numerous relevant documents [Blair
`and Maron, 1985]. Moreover, users have become complacent in their expecta(cid:173)
`tion of accuracy of information retrieval systems [Gordon, 1997].
`A related problem is that of document routing or filtering. Here, the queries
`are static and the document collection constantly changes. An environment
`where corporate e-mail is routed based on predefined queries to different parts
`of the organization (i.e., e-mail about sales is routed to the sales department,
`marketing e-mail goes to marketing, etc.) is an example of an application of
`document routing. Figure 1.2 illustrates document routing. Document routing
`algorithms and approaches also widely appear in the literature, but are not
`addressed in this book.
`In Figure 1.3, we illustrate the critical document categories that correspond
`to any issued query. Namely, in the collection there are documents which
`are retrieved, and there are those documents that are relevant. In a perfect
`system, these two sets would be equivalent-we would only retrieve relevant
`In reality, systems retrieve many non-relevant documents. To
`documents.
`
`Page 17 of 262
`
`
`
`INTRODUCTION
`
`3
`
`Figure 1.2.
`
`Document Routing
`
`Set of Predetermined
`Queries or User
`Profiles
`
`Document
`Routing
`System
`
`Documents
`routed to
`each user
`based on
`query or
`profile
`
`Incoming Documents
`
`User 1
`
`User2
`
`User3
`
`User4
`
`measure effectiveness, two ratios are used: precision and recall. Precision is
`the ratio of the number of relevant documents retrieved to the total number
`retrieved. Precision provides an indication of the quality of the answer set.
`However, this does not consider the total number of relevant documents. A
`system might have good precision by retrieving ten documents and finding that
`nine are relevant (a 0.9 precision), but the total number of relevant documents
`also matters. If there were only nine relevant documents, the system would be
`a huge success-however if millions of documents were relevant and desired,
`this would not be a good result set.
`Recall considers the total number of relevant documents-it is the ratio of
`number of relevant documents retrieved to the total number of documents in the
`collection that are believed to be relevant. When the total number of relevant
`documents in the collection is unknown, an approximation of the number is
`obtained. A good survey of effectiveness measures as well as a brief overview
`of information retrieval is found in [Kantor, 1994).
`
`Page 18 of 262
`
`
`
`4
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`Figure 1.3.
`
`Result Set: Relevant Retrieved, Relevant, and Retrieved
`
`All documents
`
`Relevant
`
`Relevant
`Retrieved
`
`Relevant Retrieved
`Retrieved
`
`Relevant Retrieved
`Relevant
`
`Recall=
`
`Precision may be computed at various points of recall. Consider the example
`in Figure 1.4. Ten documents are retrieved, but only two documents (docu(cid:173)
`ments two and five) are relevant to the query. Consider the document retrieval
`performance represented by the sloped line. Fifty percent recall (finding one
`of the two relevant documents) results when two documents are retrieved. At
`this point, precision is fifty percent as we have retrieved two documents and
`one of them is relevant. To reach one hundred percent recall, we must continue
`to retrieve documents until both relevant documents are retrieved. For our
`example, it is necessary to retrieve five documents to find both relevant docu(cid:173)
`ments. At this point, precision is forty percent because two out of five retrieved
`documents are relevant. Hence, for any desired level of recall it is possible to
`compute precision. Graphing precision at various points of recall is referred to
`as a precision/recall curve.
`A typical precision/recall curve is shown in Figure 1.5. Typically, as higher
`recall is desired, more documents must be retrieved to obtain the desired level of
`recall. In a perfect system, only relevant documents are retrieved. This means
`
`Page 19 of 262
`
`
`
`Figure 1.4.
`
`Example of Precision and Two Points of Recall
`
`INTRODUCTION
`
`5
`
`Precision
`1.0
`.9
`.8
`.7
`.6
`. 5
`.4
`.3
`.2
`.1
`
`Numrol
`Relevant Documenls = 2
`(d2, d5)
`
`Answer Set
`in Order of Similarity Coefficient
`
`............ (.5,.5)
`.. -~~ :4 )
`.............. _
`
`~
`dB
`d9
`d10
`
`.1
`
`.2 .3
`
`.4 .5 .6 .7 .8 .9 1.0
`
`Recall
`Predsm a1 50o/o recal = 1/2 = 50%
`Pr!mOll al 100% recal = 2tS = 40%
`
`that at any level of recall, precision would be 1.0. The optimal precision/recall
`line is shown in Figure 1.5.
`Average precision refers to an average of precision at various points of recall.
`Many systems today, when run on a standard document collection, report an
`average precision of between 0.2 and 0.3. Certainly, there is some element of
`fuzziness here because relevance is not a clearly defined concept, but it is clear
`that there is significant room for improvement in the area of effectiveness.
`Finding relevant documents is not enough. Finding relevant documents
`within an acceptable response time is the goal. This book describes the current
`strategies used to find relevant documents quickly. The quest to find efficient
`and effective information retrieval continues.
`We explain each algorithm in detail, and for each topic, include examples
`for the most crucial algorithms. We then switch gears into survey mode and
`provide references to related and follow-on work. We explain the key aspects
`of the algorithms and then provide references for those interested in further
`
`Page 20 of 262
`
`
`
`6
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`Figure 1.5.
`
`Typical and Optimal Precision/Recall Graph
`
`Precision
`1.0
`.9
`.8
`.7
`.6
`.5
`.4
`.3
`.2
`.1
`
`Optimal
`
`- - - - - - Typical
`
`.1
`
`.2 .3
`
`.4
`
`.5 .6 .7 .8 .9 1.0
`
`Recall
`
`details. A collection of key information retrieval research papers is found in
`[Sparck Jones and Willett, 1997].
`Recent algorithms designed to search large bodies of information are dis(cid:173)
`cussed throughout this book. Many research publications describe these al(cid:173)
`gorithms in detail, but they are spread across numerous journals and written
`in a variety of different styles. Also, they have differing expectations of their
`reader's background. We provide a relatively brief, but sufficiently detailed
`overview of the field.
`As earlier stated, a sophisticated mathematical background is not required.
`Whenever detailed mathematical constructs are used, we provide a quick re(cid:173)
`fresher of the key points needed to understand the algorithms and detailed
`examples.
`We believe this book is valuable to a variety of readers. Readers familiar with
`the core of computer science and interested in learning more about information
`retrieval algorithms should benefit from this text. We provide explanations of
`
`Page 21 of 262
`
`
`
`INTRODUCTION
`
`7
`
`the fundamental problems that exist, and how people have addressed them in
`the past.
`This book also has value for anyone who currently uses and supports a
`Relational Database Management System (RDBMS). Chapter 5 gives detailed
`algorithms that treat text retrieval as an application of a RDBMS. This makes
`it possible to integrate both structured data and text.
`To guide the reader through the key issues in ad hoc information retrieval,
`we partitioned our book into separate but inter-linked processing avenues. In
`the first section, covered in Chapters 2 and 3, we overview retrieval processing
`strategies and utilities. All of these strategies and utilities focus on one and
`only one critical issue, namely, the improvement of retrieval accuracy. In Chap(cid:173)
`ter 2, we describe eight models that were either developed for or adapted to
`information retrieval specifically for the purpose of enhancing the evaluation or
`ranking of documents retrieved in response to user queries. Chapter 3 describes
`utilities that could be applied to enhance any strategy described in Chapter 2.
`In Chapter 3, under the "Retrieval Utilities" heading, we focus on tech(cid:173)
`niques that are applicable to either all or most of the models. Several of those
`utilities described are language dependent, e.g., parsing and thesauri, others
`focus specifically at being language independent, namely, N-gram processing.
`We note in Chapter 3, that some of the described utilities were proposed as
`individual processing strategies. In reality, however, it is the combination of
`these techniques that yields the best improvements. An approach to precisely
`determine the optimal mix of techniques, the order to execute them, and the
`underlying models to operate them so as to yield the optimal processing strat(cid:173)
`egy is still unknown.
`After describing models and utilities that address accuracy demands, we
`turn our attention towards processing efficiency.
`In Chapter 4, we describe
`various document access schemes. That is, we describe both the constructs
`and usage of inverted indices as well as other representation schemes such as
`signature files. Each of these access schemes has advantages and disadvantages.
`The tradeoffs lie in terms of storage overhead and maintain.ability versus search
`and retrieval processing times. After describing the various access methods, we
`overview several compression schemes.
`Chapters 2 through 4 cover the basics in terms of traditional information
`retrieval models, utilities, and processing strategies. In Chapters 5, 6, and 7,
`we focus our attention on special topics in information retrieval. The three
`topics addressed, namely data integration, parallel, and distributed informa(cid:173)
`tion retrieval systems, were selected based on where the commercial sector is
`focusing.
`Traditionally, there was a clear separation between structured data, typ(cid:173)
`ically stored and accessed via relational database management systems, and
`semi-structured data such as text, typically stored and accessed via informa(cid:173)
`tion retrieval systems. Each processing system supported its own data storage
`files and access methods. Today, the distinction between structured and semi(cid:173)
`structured data is quickly vanishing. In fact, we no longer are concerned with
`
`Page 22 of 262
`
`
`
`8
`
`INFORMATION RETRIEVAL: ALGORITHMS AND HEURISTICS
`
`just structured and semi-structured data, but often we include unstructured
`data, such as images, in the same storage repository. Our focus does not in(cid:173)
`clude image processing, although in the future one needs to do so.
`To address the integration of structured and unstructured data, commercial
`enterprises such as Oracle, IBM, and lnformix have integrated information re(cid:173)
`trieval functionality with their traditional relational database engines. Further(cid:173)
`more, text retrieval companies such as Verity, have added relational processing
`components. In all of these cases, however, additional functionality came at the
`expense of requiring additional, separate, processing units. In Chapter 5, we
`discuss the issues related to adding processing units and suggest an alternative
`method that involves implementing information retrieval processing capability
`as an application of relational databases. Using such an approach, the tradi(cid:173)
`tional benefits of relational database processing (i.e., portability, concurrency,
`recovery, etc.) are made available without requiring additional software de(cid:173)
`velopment. Since all traditional relational database vendors provide parallel
`implementations of their database software, implementing an information re(cid:173)
`trieval system as a relational database application further provides for a parallel
`instantiation of an information retrieval system.
`Having recognized the need for a parallel information retrieval capability,
`In Chapter 6, we ini(cid:173)
`we also overview recent developments in this area.
`tially describe the earlier parallel processing efforts in information retrieval.
`These approaches predominantly focus on the use of Single Inst