`
`Google's PageRaok and Beyond:
`The Science of Search Engine Rankings
`
`EXHIBIT 2099
`Facebook, Inc. eta/.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`
`
`PuPstyleBook March 3, 2006
`PuPstyleBook March 3, 2006
`
`
`
`PuPstyleBook March 3, 2006
`
`Google’s PageRank and Beyond:
`The Science of Search Engine Rankings
`
`Amy N. Langville and Carl D. Meyer
`
`PRINCETON UNIVERSITY PRESS
`
`PRINCETON AND OXFORD
`
`
`
`PuPstyleBook March 3, 2006
`PuPstyleBook March 3, 2006
`
`
`
`PuPstyleBook March 3, 2006
`
`Contents
`
`Preface
`
`Chapter 1. Introduction to Web Search Engines
`1.1
`A Short History of Information Retrieval
`1.2
`An Overview of Traditional Information Retrieval
`1.3
`Web Information Retrieval
`
`Chapter 2. Crawling, Indexing, and Query Processing
`2.1
`Crawling
`2.2
`The Content Index
`2.3
`Query Processing
`
`Chapter 3. Ranking Webpages by Popularity
`3.1
`The Scene in 1998
`3.2
`Two Theses
`3.3
`Query-Independence
`
`Chapter 4. The Mathematics of Google’s PageRank
`4.1
`The Original Summation Formula for PageRank
`4.2
`Matrix Representation of the Summation Equations
`4.3
`Problems with the Iterative Process
`4.4
`A Little Markov Chain Theory
`4.5
`Early Adjustments to the Basic Model
`4.6
`Computation of the PageRank Vector
`4.7
`Theorem and Proof for Spectrum of the Google Matrix
`
`Chapter 5. Parameters in the PageRank Model
`5.1
`The α Factor
`The Hyperlink Matrix H
`5.2
`The Teleportation Matrix E
`5.3
`
`Chapter 6. The Sensitivity of PageRank
`6.1
`Sensitivity with respect to α
`
`ix
`
`1
`
`1
`5
`9
`
`15
`
`15
`19
`21
`
`25
`
`25
`26
`30
`
`31
`
`32
`33
`34
`36
`36
`39
`45
`
`47
`
`47
`48
`49
`
`57
`
`57
`
`
`
`PuPstyleBook March 3, 2006
`
`vi
`
`CONTENTS
`
`6.2
`6.3
`6.4
`6.5
`
`Sensitivity with respect to H
`Sensitivity with respect to vT
`Other Analyses of Sensitivity
`Sensitivity Theorems and Proofs
`
`Chapter 7. The PageRank Problem as a Linear System
`Properties of (I − αS)
`7.1
`Properties of (I − αH)
`7.2
`7.3
`Proof of the PageRank Sparse Linear System
`
`Chapter 8. Issues in Large-Scale Implementation of PageRank
`8.1
`Storage Issues
`8.2
`Convergence Criterion
`8.3
`Accuracy
`8.4
`Dangling Nodes
`8.5
`Back Button Modeling
`
`Chapter 9. Accelerating the Computation of PageRank
`9.1
`An Adaptive Power Method
`9.2
`Extrapolation
`9.3
`Aggregation
`9.4
`Other Numerical Methods
`
`Chapter 10. Updating the PageRank Vector
`10.1
`The Two Updating Problems and their History
`10.2 Restarting the Power Method
`10.3
`Approximate Updating Using Approximate Aggregation
`10.4
`Exact Aggregation
`10.5
`Exact vs. Approximate Aggregation
`10.6 Updating with Iterative Aggregation
`10.7 Determining the Partition
`10.8 Conclusions
`
`Chapter 11. The HITS Method for Ranking Webpages
`11.1
`The HITS Algorithm
`11.2 HITS Implementation
`11.3 HITS Convergence
`11.4 HITS Example
`11.5
`Strengths and Weaknesses of HITS
`11.6 HITS’s Relationship to Bibliometrics
`11.7 Query-Independent HITS
`11.8
`Accelerating HITS
`11.9 HITS Sensitivity
`
`62
`63
`63
`66
`
`71
`
`71
`72
`73
`
`75
`
`75
`79
`79
`80
`84
`
`89
`
`89
`90
`94
`97
`
`99
`
`100
`101
`102
`104
`105
`107
`109
`111
`
`115
`
`115
`117
`119
`120
`122
`123
`124
`126
`126
`
`
`
`PuPstyleBook March 3, 2006
`
`CONTENTS
`
`Chapter 12. Other Link Methods for Ranking Webpages
`12.1
`SALSA
`12.2 Hybrid Ranking Methods
`12.3 Rankings based on Traffic Flow
`
`Chapter 13. The Future of Web Information Retrieval
`13.1
`Spam
`13.2
`Personalization
`13.3 Clustering
`13.4
`Intelligent Agents
`13.5
`Trends and Time-Sensitive Search
`13.6
`Privacy and Censorship
`13.7
`Library Classification Schemes
`13.8 Data Fusion
`
`Chapter 14. Resources for Web Information Retrieval
`14.1 Resources for Getting Started
`14.2 Resources for Serious Study
`
`Chapter 15. The Mathematics Guide
`15.1
`Linear Algebra
`15.2
`Perron–Frobenius Theory
`15.3 Markov Chains
`15.4
`Perron Complementation
`15.5
`Stochastic Complementation
`15.6 Censoring
`15.7
`Aggregation
`15.8 Disaggregation
`
`Chapter 16. Glossary
`
`Bibliography
`
`Index
`
`vii
`
`131
`
`131
`135
`136
`
`139
`
`139
`142
`142
`143
`144
`146
`147
`148
`
`149
`
`149
`150
`
`153
`
`153
`167
`175
`186
`192
`194
`195
`198
`
`201
`
`207
`
`219
`
`
`
`PuPstyleBook March 3, 2006
`PuPstyleBook March 3, 2006
`
`
`
`PuPstyleBook March 3, 2006
`
`Preface
`
`Purpose
`As teachers of linear algebra, we wanted to write a book to help students and the general
`public appreciate and understand one of the most exciting applications of linear algebra
`today—the use of link analysis by web search engines. This topic is inherently interesting,
`timely, and familiar. For instance, the book answers such curious questions as: How do
`search engines work? Why is Google so good? What’s a Google bomb? How can I
`improve the ranking of my homepage in Teoma?
`
`We also wanted this book to be a single source for material on web search engine rank-
`ings. A great deal has been written on this topic, but it’s currently spread across numerous
`technical reports, preprints, conference proceedings, articles, and talks. Here we have
`summarized, clarified, condensed, and categorized the state of the art in web ranking.
`
`Our Audience
`the general science reader
`We wrote this book with two diverse audiences in mind:
`and the technical science reader. The title echoes the technical content of the book, but
`in addition to being informative on a technical level, we have also tried to provide some
`entertaining features and lighter material concerning search engines and how they work.
`The Mathematics
`Our goal in writing this book was to reach a challenging audience consisting of the
`general scientific public as well as the technical scientific public. Of course, a complete
`understanding of link analysis requires an acquaintance with many mathematical ideas.
`Nevertheless, we have tried to make the majority of the book accessible to the general sci-
`entific public. For instance, each chapter builds progressively in mathematical knowledge,
`technicality, and prerequisites. As a result, Chapters 1-4, which introduce web search and
`link analysis, are aimed at the general science reader. Chapters 6, 9, and 10 are particularly
`mathematical. The last chapter, Chapter 15, “The Mathematics Guide,” is a condensed but
`complete reference for every mathematical concept used in the earlier chapters. Through-
`out the book, key mathematical concepts are highlighted in shaded boxes. By postponing
`the mathematical definitions and formulas until Chapter 15 (rather than interspersing them
`throughout the text), we were able to create a book that our mathematically sophisticated
`readers will also enjoy. We feel this approach is a compromise that allows us to serve both
`audiences: the general and technical scientific public.
`
`
`
`PuPstyleBook March 3, 2006
`
`x
`
`PREFACE
`
`Asides
`An enjoyable feature of this book is the use of Asides. Asides contain entertaining news
`stories, practical search tips, amusing quotes, and racy lawsuits. Every chapter, even the
`particularly technical ones, contains several asides. Often times a light aside provides the
`perfect break after a stretch of serious mathematical thinking. Brief asides appear in shaded
`boxes while longer asides that stretch across multiple pages are offset by horizontal bars
`and italicized font. We hope you enjoy these breaks—we found ourselves looking forward
`to writing them.
`
`Computing and Code
`Truly mastering a subject requires experimenting with the ideas. Consequently, we have
`incorporated Matlab code to encourage and jump-start the experimentation process. While
`any programming language is appropriate, we chose Matlab for three reasons: (1) its matrix
`storage architecture and built-in commands are particularly suited to the large sparse link
`analysis matrices of this text, (2) among colleges and universities, Matlab is a market leader
`in mathematical software, and (3) it’s very user-friendly. The Matlab programs in this book
`are intended to be instruction, not production, code. We hope that, by playing with these
`programs, readers will be inspired to create new models and algorithms.
`
`Acknowledgments
`We thank Princeton University Press for supporting this book. We especially enjoyed
`working with Vickie Kearn, the Senior Editor at PUP. Vickie, thank you for displaying just
`the right combination of patience and gentle pressure. For a book with such timely mate-
`rial, you showed amazing faith in us. We thank all those who reviewed our manuscripts
`and made this a better book. Of course, we also thank our families and friends for their
`encouragement. Your pride in us is a powerful driving force.
`
`Dedication
`We dedicate this book to mentors and mentees worldwide. The energy, inspiration, and
`support that is sparked through such relationships can inspire great products. For us, it
`produced this book, but more importantly, a wonderful synergistic friendship.
`
`
`
`PuPstyleBook March 3, 2006
`
`Chapter One
`Introduction to Web Search Engines
`
`1.1 A SHORT HISTORY OF INFORMATION RETRIEVAL
`
`Today we have museums for everything—the museum of baseball, of baseball players, of
`crazed fans of baseball players, museums for world wars, national battles, legal fights, and
`family feuds. While there’s no shortage of museums, we have yet to find a museum ded-
`icated to this book’s field, a museum of information retrieval and its history. Of course,
`there are related museums, such as the Library Museum in Boras, Sweden, but none con-
`centrating on information retrieval. Information retrieval1 is the process of searching
`within a document collection for a particular information need (called a query). Although
`dominated by recent events following the invention of the computer, information retrieval
`actually has a long and glorious tradition. To honor that tradition, we propose the cre-
`ation of a museum dedicated to its history. Like all museums, our museum of information
`retrieval contains some very interesting artifacts. Join us for a brief tour.
`
`The earliest document collections were recorded on the painted walls of caves. A
`cave dweller interested in searching a collection of cave paintings to answer a particular
`information query had to travel by foot, and stand, staring in front of each painting. Un-
`fortunately, it’s hard to collect an artifact without being gruesome, so let’s fast forward a
`bit.
`
`Before the invention of paper, ancient Romans and Greeks recorded information on
`papyrus rolls. Some papyrus artifacts from ancient Rome had tags attached to the rolls.
`These tags were an ancient form of today’s Post-it Note, and make an excellent addition to
`our museum. A tag contained a short summary of the rolled document, and was attached
`in order to save readers from unnecessarily unraveling a long irrelevant document. These
`abstracts also appeared in oral form. At the start of Greek plays in the fifth century B.C.,
`the chorus recited an abstract of the ensuing action. While no actual classification scheme
`has survived from the artifacts of Greek and Roman libraries, we do know that another
`elementary information retrieval tool, the table of contents, first appeared in Greek scrolls
`from the second century B.C. Books were not invented until centuries later, when necessity
`required an alternative writing material. As the story goes, the Library of Pergamum (in
`what is now Turkey) threatened to overtake the celebrated Library of Alexandria as the
`best library in the world, claiming the largest collection of papyrus rolls. As a result, the
`Egyptians ceased the supply of papyrus to Pergamum, so the Pergamenians invented an
`alternative writing material, parchment, which is made from thin layers of animal skin. (In
`fact, the root of the word parchment comes from the word Pergamum.) Unlike papyrus,
`
`1The boldface terms that appear throughout the book are also listed and defined in the Glossary, which begins
`on page 201.
`
`
`
`PuPstyleBook March 3, 2006
`
`2
`
`CHAPTER 1
`
`parchment did not roll easily, so scribes folded several sheets of parchment and sewed them
`into books. These books outlasted scrolls and were easier to use. Parchment books soon
`replaced the papyrus rolls.
`
`The heights of writing, knowledge, and documentation of the Greek and Roman
`periods were contrasted with their lack during the Dark and Middle Ages. Precious few
`documents were produced during this time. Instead, most information was recorded orally.
`Document collections were recorded in the memory of a village’s best storyteller. Oral
`traditions carried in poems, songs, and prayers were passed from one generation to the
`next. One of the most legendary and lengthy tales is Beowulf, an epic about the adventures
`of a sixth-century Scandinavian warrior. The tale is believed to have originated in the
`seventh century and been passed from generation to generation through song. Minstrels
`often took poetic license, altering and adding verses as the centuries passed. An inquisitive
`child wishing to hear stories about the monster Grendel waited patiently while the master
`storyteller searched his memory to find just the right part of the story. Thus, the result of the
`child’s search for information was biased by the wisdom and judgement of the intermediary
`storyteller. Fortunately, the invention of paper, the best writing medium yet, superior to
`even parchment, brought renewed acceleration to the written record of information and
`collections of documents. In fact, Beowulf passed from oral to written form around A.D.
`1000, a date over which scholars still debate. Later, monks, the possessors of treasured
`reading and writing skills, sat in scriptoriums working as scribes from sunrise to sunset.
`The scribes’ works were placed in medieval libraries, which initially were so small that
`they had no need for classification systems. Eventually the collections grew, and it became
`common practice to divide the holdings into three groups:
`theological works, classical
`authors of antiquity, and contemporary authors on the seven arts. Lists of holdings and
`tables of contents from classical books make nice museum artifacts from the medieval
`period.
`
`Other document collections sprung up in a variety of fields. This dramatically ac-
`celerated with the re-invention of the printing press by Johann Gutenberg in 1450. The
`wealthy proudly boasted of their private libraries, and public libraries were instituted in
`America in the 1700s at the prompting of Benjamin Franklin. As library collections grew
`and became publicly accessible, the desire for focused search became more acute. Hierar-
`chical classification systems were used to group documents on like subjects together. The
`first use of a hierarchical organization system is attributed to the Roman author Valerius
`Maximus, who used it in A.D. 30 to organize the topics in his book, Factorum ac dicto-
`rum memorabilium libri IX (Nine Books of Memorable Deeds and Sayings). Despite these
`rudimentary organization systems, word of mouth and the advice of a librarian were the
`best means of obtaining accurate quality information for a search. Of course, document
`collections and their organization expanded beyond the limits of even the best librarian’s
`memory. More orderly ways of maintaining records of a collection’s holdings were de-
`vised. Notable artifacts that belong in our information retrieval museum are a few lists
`of individual library holdings, sorted by title and also author, as well as examples of the
`Dewey decimal system (1872), the card catalog (early 1900s), microfilm (1930s), and the
`MARC (MAchine Readable Cataloging) system (1960s).
`
`These inventions were progress, yet still search was not completely in the hands of
`the information seeker. It took the invention of the digital computer (1940s and 1950s) and
`the subsequent inventions of computerized search systems to move toward that goal. The
`
`
`
`PuPstyleBook March 3, 2006
`
`INTRODUCTION TO WEB SEARCH ENGINES
`
`3
`
`first computerized search systems used special syntax to automatically retrieve book and
`article information related to a user’s query. Unfortunately, the cumbersome syntax kept
`search largely in the domain of librarians trained on the systems. An early representative
`of computerized search such as the Cornell SMART system (1960s) [146] deserves a place
`in our museum of information retrieval.
`
`In 1989 the storage, access, and searching of document collections was revolution-
`ized by an invention named the World Wide Web by its founder Tim Berners-Lee [79]. Of
`course, our museum must include artifacts from this revolution such as a webpage, some
`HTML, and a hyperlink or two. The invention of linked document collections was truly
`original at this time, despite the fact that Vannevar Bush, once Director of the Office of
`Scientific Research and Development, foreshadowed its coming in his famous 1945 essay,
`“As We May Think” [43]. In that essay, he describes the memex, a futuristic machine
`(with shocking similarity to today’s PC and Web) that mirrors the cognitive processes of
`humans by leaving “trails of association” throughout document collections. Four decades
`of progress later, remnants of Bush’s memex formed the skeleton of Berners-Lee’s Web. A
`drawing of the memex (Figure 1.1) by a graphic artist and approved by Bush was included
`in LIFE magazine’s 1945 publishing of Bush’s prophetic article.
`
`Figure 1.1 Drawing of Vannevar Bush’s memex appearing in LIFE. Original caption read: “Memex
`in the form of a desk would instantly bring files and material on any subject to the op-
`erator’s fingertips. Slanting translucent screens supermicrofilm filed by code numbers.
`At left is a mechanism which automatically photographs longhand notes, pictures, and
`letters, then files them in the desk for future reference.”
`
`The World Wide Web became the ultimate signal of the dominance of the Informa-
`tion Age and the death of the Industrial Age. Yet despite the revolution in information
`storage and access ushered in by the Web, users initiating web searches found themselves
`floundering. They were looking for the proverbial needle in an enormous, ever-growing
`information haystack. In fact, users felt much like the men in Jorge Luis Borges’ 1941
`short story [35], “The Library of Babel”, which describes an imaginary, infinite library.
`
`When it was proclaimed that the Library contained all books, the first im-
`pression was one of extravagant happiness. All men felt themselves to be
`the masters of an intact and secret treasure. There was no personal or world
`problem whose eloquent solution did not exist in some hexagon.
`
`
`
`PuPstyleBook March 3, 2006
`
`4
`
`CHAPTER 1
`
`. . . As was natural, this inordinate hope was followed by an excessive depres-
`sion. The certitude that some shelf in some hexagon held precious books and
`that these precious books were inaccessible seemed almost intolerable.
`
`Much of the information in the Library of the Web, like that in the fictitious Library
`of Babel, remained inaccessible. In fact, early web search engines did little to ease user
`frustration; search could be conducted by sorting through hierarchies of topics on Yahoo, or
`by sifting through the many (often thousands of) webpages returned by the search engine,
`clicking on pages to personally determine which were most relevant to the query. Some
`users resorted to the earliest search techniques used by ancient queriers—word of mouth
`and expert advice. They learned about valuable websites from friends and linked to sites
`recommended by colleagues who had already put in hours of search effort.
`
`All this changed in 1998 when link analysis hit the information retrieval scene
`[40, 106]. The most successful search engines began using link analysis, a technique that
`exploited the additional information inherent in the hyperlink structure of the Web, to im-
`prove the quality of search results. Web search improved dramatically, and web searchers
`religiously used and promoted their favorite engines like Google and AltaVista. In fact, in
`2004 many web surfers freely admit their obsession with, dependence on, and addiction
`to today’s search engines. Below we include the comments [117] of a few Google fans
`to convey the joy caused by the increased accessibility of the Library of the Web made
`possible by the link analysis engines. Incidentally, in May 2004 Google held the largest
`share of the search market with 37% of searchers using Google, followed by 27% using
`the Yahoo conglomerate, which includes AltaVista, AlltheWeb, and Overture.2
`• “It’s not my homepage, but it might as well be. I use it to ego-surf. I use
`it to read the news. Anytime I want to find out anything, I use it.”—Matt
`Groening, creator and executive producer, The Simpsons
`• “I can’t imagine life without Google News. Thousands of sources from
`around the world ensure anyone with an Internet connection can stay in-
`formed. The diversity of viewpoints available is staggering.”—Michael
`Powell, chair, Federal Communications Commission
`• “Google is my rapid-response research assistant. On the run-up to a
`deadline, I may use it to check the spelling of a foreign name, to acquire
`an image of a particular piece of military hardware, to find the exact
`quote of a public figure, check a stat, translate a phrase, or research the
`background of a particular corporation.
`It’s the Swiss Army knife of
`information retrieval.”—Garry Trudeau, cartoonist and creator, Doones-
`bury
`
`Nearly all major search engines now combine link analysis scores, similar to those
`used by Google, with more traditional information retrieval scores. In this book, we record
`the history of one aspect of web information retrieval. That aspect is the link analysis
`or ranking algorithms underlying several of today’s most popular and successful search
`
`2These market share statistics were compiled by comScore, a company that counted the number of searches
`done by U.S. surfers in May 2004 using the major search engines. See the article at
`http://searchenginewatch.com/reports/article.php/2156431.
`
`
`
`PuPstyleBook March 3, 2006
`
`INTRODUCTION TO WEB SEARCH ENGINES
`
`5
`
`engines, including Google and Teoma. Incidentally, we’ll add the PageRank link analysis
`algorithm [40] used by Google (see Chapters 4-10) and the HITS algorithm [106] used by
`Teoma (see Chapter 11) to our museum of information retrieval.
`
`1.2 AN OVERVIEW OF TRADITIONAL INFORMATION RETRIEVAL
`
`To set the stage for the exciting developments in link analysis to come in later chapters, we
`begin our story by distinguishing web information retrieval from traditional informa-
`tion retrieval. Web information retrieval is search within the world’s largest and linked
`document collection, whereas traditional information retrieval is search within smaller,
`more controlled, nonlinked collections. The traditional nonlinked collections existed be-
`fore the birth of the Web and still exist today. Searching within a university library’s col-
`lection of books or within a professor’s reserve of slides for an art history course—these
`are examples of traditional information retrieval.
`
`These document collections are nonlinked, mostly static, and are organized and cate-
`gorized by specialists such as librarians and journal editors. These documents are stored in
`physical form as books, journals, and artwork as well as electronically on microfiche, CDs,
`and webpages. However, the mechanisms for searching for items in the collections are
`now almost all computerized. These computerized mechanisms are referred to as search
`engines, virtual machines created by software that enables them to sort through virtual
`file folders to find relevant documents. There are three basic computer-aided techniques
`for searching traditional information retrieval collections: Boolean models, vector space
`models, and probabilistic models [14]. These search models, which were developed in
`the 1960s, have had decades to grow, mesh, and morph into new search models. In fact,
`as of June 2000, there were at least 3,500 different search engines (including the newer
`web engines) [37], which means that there are possibly 3,500 different search techniques.
`Nevertheless, since most search engines rely on one or more of the three basic models, we
`describe these in turn.
`
`1.2.1 Boolean Search Engines
`
`The Boolean model of information retrieval, one of the earliest and simplest retrieval meth-
`ods, uses the notion of exact matching to match documents to a user query. Its more refined
`descendents are still used by most libraries. The adjective Boolean refers to the use of
`Boolean algebra, whereby words are logically combined with the Boolean operators AND,
`OR, and NOT. For example, the Boolean AND of two logical statements x and y means that
`both x AND y must be satisfied, while the Boolean OR of these two statements means that
`at least one of these statements must be satisfied. Any number of logical statements can be
`combined using the three Boolean operators. The Boolean model of information retrieval
`operates by considering which keywords are present or absent in a document. Thus, a doc-
`ument is judged as relevant or irrelevant; there is no concept of a partial match between
`documents and queries. This can lead to poor performance [14]. More advanced fuzzy set
`theoretic techniques try to remedy this black-white Boolean logic by introducing shades of
`gray. For example, a title search for car AND maintenance on a Boolean engine causes
`the virtual machine to return all documents that use both words in the title. A relevant doc-
`ument entitled “Automobile Maintenance” will not be returned. Fuzzy Boolean engines
`use fuzzy logic to categorize this document as somewhat relevant and return it to the user.
`
`
`
`PuPstyleBook March 3, 2006
`
`6
`
`CHAPTER 1
`
`The car maintenance query example introduces the main drawbacks of Boolean
`search engines; they fall prey to two of the most common information retrieval problems,
`synonymy and polysemy. Synonymy refers to multiple words having the same meaning,
`such as car and automobile. A standard Boolean engine cannot return semantically related
`documents whose keywords were not included in the original query. Polysemy refers to
`words with multiple meanings. For example, when a user types bank as their query, does
`he or she mean a financial center, a slope on a hill, a shot in pool, or a collection of objects
`[24]? The problem of polysemy can cause many documents that are irrelevant to the user’s
`actual intended query meaning to be retrieved. Many Boolean search engines also require
`that the user be familiar with Boolean operators and the engine’s specialized syntax. For
`example, to find information about the phrase iron curtain, many engines require quo-
`tation marks around the phrase, which tell the search engine that the entire phrase should
`be searched as if it were just one keyword. A user who forgets this syntax requirement
`would be surprised to find retrieved documents about interior decorating and mining for
`iron ore.
`
`Nevertheless, variants of the Boolean model do form the basis for many search en-
`gines. There are several reasons for their prevalence. First, creating and programming a
`Boolean engine is straightforward. Second, queries can be processed quickly; a quick scan
`through the keyword files for the documents can be executed in parallel. Third, Boolean
`models scale well to very large document collections. Accommodating a growing collec-
`tion is easy. The programming remains simple; merely the storage and parallel processing
`capabilities need to grow. References [14, 75, 107] all contain chapters with excellent
`introductions to the Boolean model and its extensions.
`
`1.2.2 Vector Space Model Search Engines
`
`Another information retrieval technique uses the vector space model [147], developed by
`Gerard Salton in the early 1960s, to sidestep some of the information retrieval problems
`mentioned above. Vector space models transform textual data into numeric vectors and ma-
`trices, then employ matrix analysis3 techniques to discover key features and connections
`in the document collection. Some advanced vector space models address the common text
`analysis problems of synonymy and polysemy. Advanced vector space models, such as LSI
`[64] (Latent Semantic Indexing), can access the hidden semantic structure in a document
`collection. For example, an LSI engine processing the query car will return documents
`whose keywords are related semantically (in meaning), e.g., automobile. This ability to
`reveal hidden semantic meanings makes vector space models, such as LSI, very powerful
`information retrieval tools.
`
`Two additional advantages of the vector space model are relevance scoring and rel-
`evance feedback. The vector space model allows documents to partially match a query by
`assigning each document a number between 0 and 1, which can be interpreted as the like-
`lihood of relevance to the query. The group of retrieved documents can then be sorted by
`degree of relevancy, a luxury not possible with the simple Boolean model. Thus, vec-
`tor space models return documents in an ordered list, sorted according to a relevance
`score. The first document returned is judged to be most relevant to the user’s query.
`
`3Mathematical terms are defined in Chapter 15, the Mathematics Chapter, and are italicized throughout.
`
`
`
`PuPstyleBook March 3, 2006
`
`INTRODUCTION TO WEB SEARCH ENGINES
`
`7
`
`Some vector space search engines report the relevance score as a relevancy percentage.
`For example, a 97% next to a document means that the document is judged as 97% rele-
`vant to the user’s query. (See the Federal Communications Commission’s search engine,
`http://www.fcc.gov/searchtools.html, which is powered by Inktomi, once known
`to use the vector space model. Enter a query such as taxes and notice the relevancy score
`reported on the right side.) Relevance feedback, the other advantage of the vector space
`model, is an information retrieval tuning technique that is a natural addition to the vec-
`tor space model. Relevance feedback allows the user to select a subset of the retrieved
`documents that are useful. The query is then resubmitted with this additional relevance
`feedback information, and a revised set of generally more useful documents is retrieved.
`
`A drawback of the vector space model is its computational expense. At query time,
`distance measures (also known as similarity measures) must be computed between each
`document and the query. And advanced models, such as LSI, require an expensive singu-
`lar value decomposition [82, 127] of a large matrix that numerically represents the entire
`document collection. As the collection grows, the expense of this matrix decomposition
`becomes prohibitive. This computational expense also exposes another drawback—vector
`space models do not scale well. Their success is limited to small document collections.
`
`Understanding Search Engines
`
`The informative little book by Michael Berry and Murray Browne, Understanding
`Search Engines: Mathematical Modeling and Text Retrieval [23], provides an
`excellent explanation of vector space models, especially LSI, and contains several
`examples and sample code. Our mathematical readers will enjoy this book and its
`application of linear algebra algorithms in the context of traditional information
`retrieval.
`
`1.2.3 Probabilistic Model Search Engines
`
`Probabilistic models attempt to estimate the probability that the user will find a particular
`document relevant. Retrieved documents are ranked by their odds of relevance (the ratio
`of the probability that the document is relevant to the query divided by the probability that
`the document is not relevant to the query). The probabilistic model operates recursively
`and requires that the underlying algorithm guess at initial parameters then iteratively tries
`to improve this initial guess to obtain a final ranking of relevancy probabilities.
`
`Unfortunately, probabilistic models can be very hard to build and program. Their
`complexity grows quickly, deterring many researchers and limiting their scalability. Prob-
`abilistic models also require several unrealistic simplifying assumptions, such as indepen-
`dence between terms as well as documents. Of course, the independence assumption is
`restrictive in most cases. For instance, in this document the most likely word to follow in-
`formation is the word retrieval, but the independence assumption judges each word
`as equally likely to follow the word information. On the other hand, the probabilistic
`framework can naturally accommodate a priori preferences, and thus, these models do of-
`fer promise of tailoring search results to the preferences of individual users. For example, a
`
`
`
`PuPstyleBook March 3, 2006
`
`8
`
`CHAPTER 1
`
`user’s query history can be incorporated into the probabilistic model’s initial guess, which
`generates better query results than a democratic guess.
`
`1.2.4 Meta-search Engines
`
`There’s actually a fourth model for traditional search engines, meta-search engines, which
`combines the three classic models. Meta-search engines are based on the principle that
`while one search engine is good, two (or more) are better. One search engine may be great
`at a certain task, w