throbber
PuPstyleBook March 3, 2006
`
`Google's PageRaok and Beyond:
`The Science of Search Engine Rankings
`
`EXHIBIT 2099
`Facebook, Inc. eta/.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`

`

`PuPstyleBook March 3, 2006
`PuPstyleBook March 3, 2006
`
`

`

`PuPstyleBook March 3, 2006
`
`Google’s PageRank and Beyond:
`The Science of Search Engine Rankings
`
`Amy N. Langville and Carl D. Meyer
`
`PRINCETON UNIVERSITY PRESS
`
`PRINCETON AND OXFORD
`
`

`

`PuPstyleBook March 3, 2006
`PuPstyleBook March 3, 2006
`
`

`

`PuPstyleBook March 3, 2006
`
`Contents
`
`Preface
`
`Chapter 1. Introduction to Web Search Engines
`1.1
`A Short History of Information Retrieval
`1.2
`An Overview of Traditional Information Retrieval
`1.3
`Web Information Retrieval
`
`Chapter 2. Crawling, Indexing, and Query Processing
`2.1
`Crawling
`2.2
`The Content Index
`2.3
`Query Processing
`
`Chapter 3. Ranking Webpages by Popularity
`3.1
`The Scene in 1998
`3.2
`Two Theses
`3.3
`Query-Independence
`
`Chapter 4. The Mathematics of Google’s PageRank
`4.1
`The Original Summation Formula for PageRank
`4.2
`Matrix Representation of the Summation Equations
`4.3
`Problems with the Iterative Process
`4.4
`A Little Markov Chain Theory
`4.5
`Early Adjustments to the Basic Model
`4.6
`Computation of the PageRank Vector
`4.7
`Theorem and Proof for Spectrum of the Google Matrix
`
`Chapter 5. Parameters in the PageRank Model
`5.1
`The α Factor
`The Hyperlink Matrix H
`5.2
`The Teleportation Matrix E
`5.3
`
`Chapter 6. The Sensitivity of PageRank
`6.1
`Sensitivity with respect to α
`
`ix
`
`1
`
`1
`5
`9
`
`15
`
`15
`19
`21
`
`25
`
`25
`26
`30
`
`31
`
`32
`33
`34
`36
`36
`39
`45
`
`47
`
`47
`48
`49
`
`57
`
`57
`
`

`

`PuPstyleBook March 3, 2006
`
`vi
`
`CONTENTS
`
`6.2
`6.3
`6.4
`6.5
`
`Sensitivity with respect to H
`Sensitivity with respect to vT
`Other Analyses of Sensitivity
`Sensitivity Theorems and Proofs
`
`Chapter 7. The PageRank Problem as a Linear System
`Properties of (I − αS)
`7.1
`Properties of (I − αH)
`7.2
`7.3
`Proof of the PageRank Sparse Linear System
`
`Chapter 8. Issues in Large-Scale Implementation of PageRank
`8.1
`Storage Issues
`8.2
`Convergence Criterion
`8.3
`Accuracy
`8.4
`Dangling Nodes
`8.5
`Back Button Modeling
`
`Chapter 9. Accelerating the Computation of PageRank
`9.1
`An Adaptive Power Method
`9.2
`Extrapolation
`9.3
`Aggregation
`9.4
`Other Numerical Methods
`
`Chapter 10. Updating the PageRank Vector
`10.1
`The Two Updating Problems and their History
`10.2 Restarting the Power Method
`10.3
`Approximate Updating Using Approximate Aggregation
`10.4
`Exact Aggregation
`10.5
`Exact vs. Approximate Aggregation
`10.6 Updating with Iterative Aggregation
`10.7 Determining the Partition
`10.8 Conclusions
`
`Chapter 11. The HITS Method for Ranking Webpages
`11.1
`The HITS Algorithm
`11.2 HITS Implementation
`11.3 HITS Convergence
`11.4 HITS Example
`11.5
`Strengths and Weaknesses of HITS
`11.6 HITS’s Relationship to Bibliometrics
`11.7 Query-Independent HITS
`11.8
`Accelerating HITS
`11.9 HITS Sensitivity
`
`62
`63
`63
`66
`
`71
`
`71
`72
`73
`
`75
`
`75
`79
`79
`80
`84
`
`89
`
`89
`90
`94
`97
`
`99
`
`100
`101
`102
`104
`105
`107
`109
`111
`
`115
`
`115
`117
`119
`120
`122
`123
`124
`126
`126
`
`

`

`PuPstyleBook March 3, 2006
`
`CONTENTS
`
`Chapter 12. Other Link Methods for Ranking Webpages
`12.1
`SALSA
`12.2 Hybrid Ranking Methods
`12.3 Rankings based on Traffic Flow
`
`Chapter 13. The Future of Web Information Retrieval
`13.1
`Spam
`13.2
`Personalization
`13.3 Clustering
`13.4
`Intelligent Agents
`13.5
`Trends and Time-Sensitive Search
`13.6
`Privacy and Censorship
`13.7
`Library Classification Schemes
`13.8 Data Fusion
`
`Chapter 14. Resources for Web Information Retrieval
`14.1 Resources for Getting Started
`14.2 Resources for Serious Study
`
`Chapter 15. The Mathematics Guide
`15.1
`Linear Algebra
`15.2
`Perron–Frobenius Theory
`15.3 Markov Chains
`15.4
`Perron Complementation
`15.5
`Stochastic Complementation
`15.6 Censoring
`15.7
`Aggregation
`15.8 Disaggregation
`
`Chapter 16. Glossary
`
`Bibliography
`
`Index
`
`vii
`
`131
`
`131
`135
`136
`
`139
`
`139
`142
`142
`143
`144
`146
`147
`148
`
`149
`
`149
`150
`
`153
`
`153
`167
`175
`186
`192
`194
`195
`198
`
`201
`
`207
`
`219
`
`

`

`PuPstyleBook March 3, 2006
`PuPstyleBook March 3, 2006
`
`

`

`PuPstyleBook March 3, 2006
`
`Preface
`
`Purpose
`As teachers of linear algebra, we wanted to write a book to help students and the general
`public appreciate and understand one of the most exciting applications of linear algebra
`today—the use of link analysis by web search engines. This topic is inherently interesting,
`timely, and familiar. For instance, the book answers such curious questions as: How do
`search engines work? Why is Google so good? What’s a Google bomb? How can I
`improve the ranking of my homepage in Teoma?
`
`We also wanted this book to be a single source for material on web search engine rank-
`ings. A great deal has been written on this topic, but it’s currently spread across numerous
`technical reports, preprints, conference proceedings, articles, and talks. Here we have
`summarized, clarified, condensed, and categorized the state of the art in web ranking.
`
`Our Audience
`the general science reader
`We wrote this book with two diverse audiences in mind:
`and the technical science reader. The title echoes the technical content of the book, but
`in addition to being informative on a technical level, we have also tried to provide some
`entertaining features and lighter material concerning search engines and how they work.
`The Mathematics
`Our goal in writing this book was to reach a challenging audience consisting of the
`general scientific public as well as the technical scientific public. Of course, a complete
`understanding of link analysis requires an acquaintance with many mathematical ideas.
`Nevertheless, we have tried to make the majority of the book accessible to the general sci-
`entific public. For instance, each chapter builds progressively in mathematical knowledge,
`technicality, and prerequisites. As a result, Chapters 1-4, which introduce web search and
`link analysis, are aimed at the general science reader. Chapters 6, 9, and 10 are particularly
`mathematical. The last chapter, Chapter 15, “The Mathematics Guide,” is a condensed but
`complete reference for every mathematical concept used in the earlier chapters. Through-
`out the book, key mathematical concepts are highlighted in shaded boxes. By postponing
`the mathematical definitions and formulas until Chapter 15 (rather than interspersing them
`throughout the text), we were able to create a book that our mathematically sophisticated
`readers will also enjoy. We feel this approach is a compromise that allows us to serve both
`audiences: the general and technical scientific public.
`
`

`

`PuPstyleBook March 3, 2006
`
`x
`
`PREFACE
`
`Asides
`An enjoyable feature of this book is the use of Asides. Asides contain entertaining news
`stories, practical search tips, amusing quotes, and racy lawsuits. Every chapter, even the
`particularly technical ones, contains several asides. Often times a light aside provides the
`perfect break after a stretch of serious mathematical thinking. Brief asides appear in shaded
`boxes while longer asides that stretch across multiple pages are offset by horizontal bars
`and italicized font. We hope you enjoy these breaks—we found ourselves looking forward
`to writing them.
`
`Computing and Code
`Truly mastering a subject requires experimenting with the ideas. Consequently, we have
`incorporated Matlab code to encourage and jump-start the experimentation process. While
`any programming language is appropriate, we chose Matlab for three reasons: (1) its matrix
`storage architecture and built-in commands are particularly suited to the large sparse link
`analysis matrices of this text, (2) among colleges and universities, Matlab is a market leader
`in mathematical software, and (3) it’s very user-friendly. The Matlab programs in this book
`are intended to be instruction, not production, code. We hope that, by playing with these
`programs, readers will be inspired to create new models and algorithms.
`
`Acknowledgments
`We thank Princeton University Press for supporting this book. We especially enjoyed
`working with Vickie Kearn, the Senior Editor at PUP. Vickie, thank you for displaying just
`the right combination of patience and gentle pressure. For a book with such timely mate-
`rial, you showed amazing faith in us. We thank all those who reviewed our manuscripts
`and made this a better book. Of course, we also thank our families and friends for their
`encouragement. Your pride in us is a powerful driving force.
`
`Dedication
`We dedicate this book to mentors and mentees worldwide. The energy, inspiration, and
`support that is sparked through such relationships can inspire great products. For us, it
`produced this book, but more importantly, a wonderful synergistic friendship.
`
`

`

`PuPstyleBook March 3, 2006
`
`Chapter One
`Introduction to Web Search Engines
`
`1.1 A SHORT HISTORY OF INFORMATION RETRIEVAL
`
`Today we have museums for everything—the museum of baseball, of baseball players, of
`crazed fans of baseball players, museums for world wars, national battles, legal fights, and
`family feuds. While there’s no shortage of museums, we have yet to find a museum ded-
`icated to this book’s field, a museum of information retrieval and its history. Of course,
`there are related museums, such as the Library Museum in Boras, Sweden, but none con-
`centrating on information retrieval. Information retrieval1 is the process of searching
`within a document collection for a particular information need (called a query). Although
`dominated by recent events following the invention of the computer, information retrieval
`actually has a long and glorious tradition. To honor that tradition, we propose the cre-
`ation of a museum dedicated to its history. Like all museums, our museum of information
`retrieval contains some very interesting artifacts. Join us for a brief tour.
`
`The earliest document collections were recorded on the painted walls of caves. A
`cave dweller interested in searching a collection of cave paintings to answer a particular
`information query had to travel by foot, and stand, staring in front of each painting. Un-
`fortunately, it’s hard to collect an artifact without being gruesome, so let’s fast forward a
`bit.
`
`Before the invention of paper, ancient Romans and Greeks recorded information on
`papyrus rolls. Some papyrus artifacts from ancient Rome had tags attached to the rolls.
`These tags were an ancient form of today’s Post-it Note, and make an excellent addition to
`our museum. A tag contained a short summary of the rolled document, and was attached
`in order to save readers from unnecessarily unraveling a long irrelevant document. These
`abstracts also appeared in oral form. At the start of Greek plays in the fifth century B.C.,
`the chorus recited an abstract of the ensuing action. While no actual classification scheme
`has survived from the artifacts of Greek and Roman libraries, we do know that another
`elementary information retrieval tool, the table of contents, first appeared in Greek scrolls
`from the second century B.C. Books were not invented until centuries later, when necessity
`required an alternative writing material. As the story goes, the Library of Pergamum (in
`what is now Turkey) threatened to overtake the celebrated Library of Alexandria as the
`best library in the world, claiming the largest collection of papyrus rolls. As a result, the
`Egyptians ceased the supply of papyrus to Pergamum, so the Pergamenians invented an
`alternative writing material, parchment, which is made from thin layers of animal skin. (In
`fact, the root of the word parchment comes from the word Pergamum.) Unlike papyrus,
`
`1The boldface terms that appear throughout the book are also listed and defined in the Glossary, which begins
`on page 201.
`
`

`

`PuPstyleBook March 3, 2006
`
`2
`
`CHAPTER 1
`
`parchment did not roll easily, so scribes folded several sheets of parchment and sewed them
`into books. These books outlasted scrolls and were easier to use. Parchment books soon
`replaced the papyrus rolls.
`
`The heights of writing, knowledge, and documentation of the Greek and Roman
`periods were contrasted with their lack during the Dark and Middle Ages. Precious few
`documents were produced during this time. Instead, most information was recorded orally.
`Document collections were recorded in the memory of a village’s best storyteller. Oral
`traditions carried in poems, songs, and prayers were passed from one generation to the
`next. One of the most legendary and lengthy tales is Beowulf, an epic about the adventures
`of a sixth-century Scandinavian warrior. The tale is believed to have originated in the
`seventh century and been passed from generation to generation through song. Minstrels
`often took poetic license, altering and adding verses as the centuries passed. An inquisitive
`child wishing to hear stories about the monster Grendel waited patiently while the master
`storyteller searched his memory to find just the right part of the story. Thus, the result of the
`child’s search for information was biased by the wisdom and judgement of the intermediary
`storyteller. Fortunately, the invention of paper, the best writing medium yet, superior to
`even parchment, brought renewed acceleration to the written record of information and
`collections of documents. In fact, Beowulf passed from oral to written form around A.D.
`1000, a date over which scholars still debate. Later, monks, the possessors of treasured
`reading and writing skills, sat in scriptoriums working as scribes from sunrise to sunset.
`The scribes’ works were placed in medieval libraries, which initially were so small that
`they had no need for classification systems. Eventually the collections grew, and it became
`common practice to divide the holdings into three groups:
`theological works, classical
`authors of antiquity, and contemporary authors on the seven arts. Lists of holdings and
`tables of contents from classical books make nice museum artifacts from the medieval
`period.
`
`Other document collections sprung up in a variety of fields. This dramatically ac-
`celerated with the re-invention of the printing press by Johann Gutenberg in 1450. The
`wealthy proudly boasted of their private libraries, and public libraries were instituted in
`America in the 1700s at the prompting of Benjamin Franklin. As library collections grew
`and became publicly accessible, the desire for focused search became more acute. Hierar-
`chical classification systems were used to group documents on like subjects together. The
`first use of a hierarchical organization system is attributed to the Roman author Valerius
`Maximus, who used it in A.D. 30 to organize the topics in his book, Factorum ac dicto-
`rum memorabilium libri IX (Nine Books of Memorable Deeds and Sayings). Despite these
`rudimentary organization systems, word of mouth and the advice of a librarian were the
`best means of obtaining accurate quality information for a search. Of course, document
`collections and their organization expanded beyond the limits of even the best librarian’s
`memory. More orderly ways of maintaining records of a collection’s holdings were de-
`vised. Notable artifacts that belong in our information retrieval museum are a few lists
`of individual library holdings, sorted by title and also author, as well as examples of the
`Dewey decimal system (1872), the card catalog (early 1900s), microfilm (1930s), and the
`MARC (MAchine Readable Cataloging) system (1960s).
`
`These inventions were progress, yet still search was not completely in the hands of
`the information seeker. It took the invention of the digital computer (1940s and 1950s) and
`the subsequent inventions of computerized search systems to move toward that goal. The
`
`

`

`PuPstyleBook March 3, 2006
`
`INTRODUCTION TO WEB SEARCH ENGINES
`
`3
`
`first computerized search systems used special syntax to automatically retrieve book and
`article information related to a user’s query. Unfortunately, the cumbersome syntax kept
`search largely in the domain of librarians trained on the systems. An early representative
`of computerized search such as the Cornell SMART system (1960s) [146] deserves a place
`in our museum of information retrieval.
`
`In 1989 the storage, access, and searching of document collections was revolution-
`ized by an invention named the World Wide Web by its founder Tim Berners-Lee [79]. Of
`course, our museum must include artifacts from this revolution such as a webpage, some
`HTML, and a hyperlink or two. The invention of linked document collections was truly
`original at this time, despite the fact that Vannevar Bush, once Director of the Office of
`Scientific Research and Development, foreshadowed its coming in his famous 1945 essay,
`“As We May Think” [43]. In that essay, he describes the memex, a futuristic machine
`(with shocking similarity to today’s PC and Web) that mirrors the cognitive processes of
`humans by leaving “trails of association” throughout document collections. Four decades
`of progress later, remnants of Bush’s memex formed the skeleton of Berners-Lee’s Web. A
`drawing of the memex (Figure 1.1) by a graphic artist and approved by Bush was included
`in LIFE magazine’s 1945 publishing of Bush’s prophetic article.
`
`Figure 1.1 Drawing of Vannevar Bush’s memex appearing in LIFE. Original caption read: “Memex
`in the form of a desk would instantly bring files and material on any subject to the op-
`erator’s fingertips. Slanting translucent screens supermicrofilm filed by code numbers.
`At left is a mechanism which automatically photographs longhand notes, pictures, and
`letters, then files them in the desk for future reference.”
`
`The World Wide Web became the ultimate signal of the dominance of the Informa-
`tion Age and the death of the Industrial Age. Yet despite the revolution in information
`storage and access ushered in by the Web, users initiating web searches found themselves
`floundering. They were looking for the proverbial needle in an enormous, ever-growing
`information haystack. In fact, users felt much like the men in Jorge Luis Borges’ 1941
`short story [35], “The Library of Babel”, which describes an imaginary, infinite library.
`
`When it was proclaimed that the Library contained all books, the first im-
`pression was one of extravagant happiness. All men felt themselves to be
`the masters of an intact and secret treasure. There was no personal or world
`problem whose eloquent solution did not exist in some hexagon.
`
`

`

`PuPstyleBook March 3, 2006
`
`4
`
`CHAPTER 1
`
`. . . As was natural, this inordinate hope was followed by an excessive depres-
`sion. The certitude that some shelf in some hexagon held precious books and
`that these precious books were inaccessible seemed almost intolerable.
`
`Much of the information in the Library of the Web, like that in the fictitious Library
`of Babel, remained inaccessible. In fact, early web search engines did little to ease user
`frustration; search could be conducted by sorting through hierarchies of topics on Yahoo, or
`by sifting through the many (often thousands of) webpages returned by the search engine,
`clicking on pages to personally determine which were most relevant to the query. Some
`users resorted to the earliest search techniques used by ancient queriers—word of mouth
`and expert advice. They learned about valuable websites from friends and linked to sites
`recommended by colleagues who had already put in hours of search effort.
`
`All this changed in 1998 when link analysis hit the information retrieval scene
`[40, 106]. The most successful search engines began using link analysis, a technique that
`exploited the additional information inherent in the hyperlink structure of the Web, to im-
`prove the quality of search results. Web search improved dramatically, and web searchers
`religiously used and promoted their favorite engines like Google and AltaVista. In fact, in
`2004 many web surfers freely admit their obsession with, dependence on, and addiction
`to today’s search engines. Below we include the comments [117] of a few Google fans
`to convey the joy caused by the increased accessibility of the Library of the Web made
`possible by the link analysis engines. Incidentally, in May 2004 Google held the largest
`share of the search market with 37% of searchers using Google, followed by 27% using
`the Yahoo conglomerate, which includes AltaVista, AlltheWeb, and Overture.2
`• “It’s not my homepage, but it might as well be. I use it to ego-surf. I use
`it to read the news. Anytime I want to find out anything, I use it.”—Matt
`Groening, creator and executive producer, The Simpsons
`• “I can’t imagine life without Google News. Thousands of sources from
`around the world ensure anyone with an Internet connection can stay in-
`formed. The diversity of viewpoints available is staggering.”—Michael
`Powell, chair, Federal Communications Commission
`• “Google is my rapid-response research assistant. On the run-up to a
`deadline, I may use it to check the spelling of a foreign name, to acquire
`an image of a particular piece of military hardware, to find the exact
`quote of a public figure, check a stat, translate a phrase, or research the
`background of a particular corporation.
`It’s the Swiss Army knife of
`information retrieval.”—Garry Trudeau, cartoonist and creator, Doones-
`bury
`
`Nearly all major search engines now combine link analysis scores, similar to those
`used by Google, with more traditional information retrieval scores. In this book, we record
`the history of one aspect of web information retrieval. That aspect is the link analysis
`or ranking algorithms underlying several of today’s most popular and successful search
`
`2These market share statistics were compiled by comScore, a company that counted the number of searches
`done by U.S. surfers in May 2004 using the major search engines. See the article at
`http://searchenginewatch.com/reports/article.php/2156431.
`
`

`

`PuPstyleBook March 3, 2006
`
`INTRODUCTION TO WEB SEARCH ENGINES
`
`5
`
`engines, including Google and Teoma. Incidentally, we’ll add the PageRank link analysis
`algorithm [40] used by Google (see Chapters 4-10) and the HITS algorithm [106] used by
`Teoma (see Chapter 11) to our museum of information retrieval.
`
`1.2 AN OVERVIEW OF TRADITIONAL INFORMATION RETRIEVAL
`
`To set the stage for the exciting developments in link analysis to come in later chapters, we
`begin our story by distinguishing web information retrieval from traditional informa-
`tion retrieval. Web information retrieval is search within the world’s largest and linked
`document collection, whereas traditional information retrieval is search within smaller,
`more controlled, nonlinked collections. The traditional nonlinked collections existed be-
`fore the birth of the Web and still exist today. Searching within a university library’s col-
`lection of books or within a professor’s reserve of slides for an art history course—these
`are examples of traditional information retrieval.
`
`These document collections are nonlinked, mostly static, and are organized and cate-
`gorized by specialists such as librarians and journal editors. These documents are stored in
`physical form as books, journals, and artwork as well as electronically on microfiche, CDs,
`and webpages. However, the mechanisms for searching for items in the collections are
`now almost all computerized. These computerized mechanisms are referred to as search
`engines, virtual machines created by software that enables them to sort through virtual
`file folders to find relevant documents. There are three basic computer-aided techniques
`for searching traditional information retrieval collections: Boolean models, vector space
`models, and probabilistic models [14]. These search models, which were developed in
`the 1960s, have had decades to grow, mesh, and morph into new search models. In fact,
`as of June 2000, there were at least 3,500 different search engines (including the newer
`web engines) [37], which means that there are possibly 3,500 different search techniques.
`Nevertheless, since most search engines rely on one or more of the three basic models, we
`describe these in turn.
`
`1.2.1 Boolean Search Engines
`
`The Boolean model of information retrieval, one of the earliest and simplest retrieval meth-
`ods, uses the notion of exact matching to match documents to a user query. Its more refined
`descendents are still used by most libraries. The adjective Boolean refers to the use of
`Boolean algebra, whereby words are logically combined with the Boolean operators AND,
`OR, and NOT. For example, the Boolean AND of two logical statements x and y means that
`both x AND y must be satisfied, while the Boolean OR of these two statements means that
`at least one of these statements must be satisfied. Any number of logical statements can be
`combined using the three Boolean operators. The Boolean model of information retrieval
`operates by considering which keywords are present or absent in a document. Thus, a doc-
`ument is judged as relevant or irrelevant; there is no concept of a partial match between
`documents and queries. This can lead to poor performance [14]. More advanced fuzzy set
`theoretic techniques try to remedy this black-white Boolean logic by introducing shades of
`gray. For example, a title search for car AND maintenance on a Boolean engine causes
`the virtual machine to return all documents that use both words in the title. A relevant doc-
`ument entitled “Automobile Maintenance” will not be returned. Fuzzy Boolean engines
`use fuzzy logic to categorize this document as somewhat relevant and return it to the user.
`
`

`

`PuPstyleBook March 3, 2006
`
`6
`
`CHAPTER 1
`
`The car maintenance query example introduces the main drawbacks of Boolean
`search engines; they fall prey to two of the most common information retrieval problems,
`synonymy and polysemy. Synonymy refers to multiple words having the same meaning,
`such as car and automobile. A standard Boolean engine cannot return semantically related
`documents whose keywords were not included in the original query. Polysemy refers to
`words with multiple meanings. For example, when a user types bank as their query, does
`he or she mean a financial center, a slope on a hill, a shot in pool, or a collection of objects
`[24]? The problem of polysemy can cause many documents that are irrelevant to the user’s
`actual intended query meaning to be retrieved. Many Boolean search engines also require
`that the user be familiar with Boolean operators and the engine’s specialized syntax. For
`example, to find information about the phrase iron curtain, many engines require quo-
`tation marks around the phrase, which tell the search engine that the entire phrase should
`be searched as if it were just one keyword. A user who forgets this syntax requirement
`would be surprised to find retrieved documents about interior decorating and mining for
`iron ore.
`
`Nevertheless, variants of the Boolean model do form the basis for many search en-
`gines. There are several reasons for their prevalence. First, creating and programming a
`Boolean engine is straightforward. Second, queries can be processed quickly; a quick scan
`through the keyword files for the documents can be executed in parallel. Third, Boolean
`models scale well to very large document collections. Accommodating a growing collec-
`tion is easy. The programming remains simple; merely the storage and parallel processing
`capabilities need to grow. References [14, 75, 107] all contain chapters with excellent
`introductions to the Boolean model and its extensions.
`
`1.2.2 Vector Space Model Search Engines
`
`Another information retrieval technique uses the vector space model [147], developed by
`Gerard Salton in the early 1960s, to sidestep some of the information retrieval problems
`mentioned above. Vector space models transform textual data into numeric vectors and ma-
`trices, then employ matrix analysis3 techniques to discover key features and connections
`in the document collection. Some advanced vector space models address the common text
`analysis problems of synonymy and polysemy. Advanced vector space models, such as LSI
`[64] (Latent Semantic Indexing), can access the hidden semantic structure in a document
`collection. For example, an LSI engine processing the query car will return documents
`whose keywords are related semantically (in meaning), e.g., automobile. This ability to
`reveal hidden semantic meanings makes vector space models, such as LSI, very powerful
`information retrieval tools.
`
`Two additional advantages of the vector space model are relevance scoring and rel-
`evance feedback. The vector space model allows documents to partially match a query by
`assigning each document a number between 0 and 1, which can be interpreted as the like-
`lihood of relevance to the query. The group of retrieved documents can then be sorted by
`degree of relevancy, a luxury not possible with the simple Boolean model. Thus, vec-
`tor space models return documents in an ordered list, sorted according to a relevance
`score. The first document returned is judged to be most relevant to the user’s query.
`
`3Mathematical terms are defined in Chapter 15, the Mathematics Chapter, and are italicized throughout.
`
`

`

`PuPstyleBook March 3, 2006
`
`INTRODUCTION TO WEB SEARCH ENGINES
`
`7
`
`Some vector space search engines report the relevance score as a relevancy percentage.
`For example, a 97% next to a document means that the document is judged as 97% rele-
`vant to the user’s query. (See the Federal Communications Commission’s search engine,
`http://www.fcc.gov/searchtools.html, which is powered by Inktomi, once known
`to use the vector space model. Enter a query such as taxes and notice the relevancy score
`reported on the right side.) Relevance feedback, the other advantage of the vector space
`model, is an information retrieval tuning technique that is a natural addition to the vec-
`tor space model. Relevance feedback allows the user to select a subset of the retrieved
`documents that are useful. The query is then resubmitted with this additional relevance
`feedback information, and a revised set of generally more useful documents is retrieved.
`
`A drawback of the vector space model is its computational expense. At query time,
`distance measures (also known as similarity measures) must be computed between each
`document and the query. And advanced models, such as LSI, require an expensive singu-
`lar value decomposition [82, 127] of a large matrix that numerically represents the entire
`document collection. As the collection grows, the expense of this matrix decomposition
`becomes prohibitive. This computational expense also exposes another drawback—vector
`space models do not scale well. Their success is limited to small document collections.
`
`Understanding Search Engines
`
`The informative little book by Michael Berry and Murray Browne, Understanding
`Search Engines: Mathematical Modeling and Text Retrieval [23], provides an
`excellent explanation of vector space models, especially LSI, and contains several
`examples and sample code. Our mathematical readers will enjoy this book and its
`application of linear algebra algorithms in the context of traditional information
`retrieval.
`
`1.2.3 Probabilistic Model Search Engines
`
`Probabilistic models attempt to estimate the probability that the user will find a particular
`document relevant. Retrieved documents are ranked by their odds of relevance (the ratio
`of the probability that the document is relevant to the query divided by the probability that
`the document is not relevant to the query). The probabilistic model operates recursively
`and requires that the underlying algorithm guess at initial parameters then iteratively tries
`to improve this initial guess to obtain a final ranking of relevancy probabilities.
`
`Unfortunately, probabilistic models can be very hard to build and program. Their
`complexity grows quickly, deterring many researchers and limiting their scalability. Prob-
`abilistic models also require several unrealistic simplifying assumptions, such as indepen-
`dence between terms as well as documents. Of course, the independence assumption is
`restrictive in most cases. For instance, in this document the most likely word to follow in-
`formation is the word retrieval, but the independence assumption judges each word
`as equally likely to follow the word information. On the other hand, the probabilistic
`framework can naturally accommodate a priori preferences, and thus, these models do of-
`fer promise of tailoring search results to the preferences of individual users. For example, a
`
`

`

`PuPstyleBook March 3, 2006
`
`8
`
`CHAPTER 1
`
`user’s query history can be incorporated into the probabilistic model’s initial guess, which
`generates better query results than a democratic guess.
`
`1.2.4 Meta-search Engines
`
`There’s actually a fourth model for traditional search engines, meta-search engines, which
`combines the three classic models. Meta-search engines are based on the principle that
`while one search engine is good, two (or more) are better. One search engine may be great
`at a certain task, w

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket