`.884
`
`j
`···_I
`
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`
`
`Editor-in-Chief
`Acquisitions Editor
`Editorial Assistant
`Managing Editor
`Online Product Manager
`Director of Marketing
`Marketing Manager
`Marketing Coordinator
`Senior Manufacturing Buyer
`Text Design, Composition,
`and Illustrations
`Art Direction
`Cover Design
`Cover Image
`
`Michael Hirsch
`Matt Goldstein
`Sarah Milmore
`Jeff Holcomb
`Bethany Tidd
`Margaret Waples
`Erin Davis
`Kathryn Ferranti
`Carol Melville
`W. Bruce Croft, Donald Metzler,
`and Trevor Strohman
`Linda Knowles
`Elena Sidorova
`© Peter Gudella I Shutterstock
`
`Many of the designations used by manufacturers and sellers to distinguish their products
`are claimed as trademarks. Where those designations appear in this book, and Addison(cid:173)
`Wesley was aware of a trademark claim, the designations have been printed in initial caps
`or all caps.
`
`The programs and applications presented in this book have been included for their
`instructional value. They have been tested with care, but are not guaranteed for any
`particular purpose. The publisher does not offer any warranties or representations, nor
`does it accept any liabilities with respect to the programs or applications.
`
`Library of Congress Cataloging-in-Publication Data available upon request
`
`Copyright© 2010 Pearson Education, Inc. All rights reserved. No part of this publication
`may be reproduced, stored in a retrieval system, or transmitted, in any form or by any
`means, electronic, mechanical, photocopying, recording, or otherwise, without the prior
`written permission of the publisher. Printed in the United States of America. For
`information on obtaining permission for use of material in this work, please submit a
`written request to Pearson Education, Inc., Rights and Contracts Department, 501
`Boylston Street, Suite 900, Boston, MA 02116, fax (617) 671-3447, or online at
`http://www .pearsoned.com/legal!permissions.htm.
`
`ISBN-13: 978-0-13-607224-9
`ISBN-10: 0-13-607224-0
`
`1 2 3 4 5 6 7 8 9 1 0- HP -13 12 11 10 09
`
`
`
`"W
`
`14
`
`2 Architecture of a Search Engine
`
`We may have more specific goals, too, but usually these fall into the categories
`of effectiveness or efficiency (or both). For instance, the collection of documents
`we want to search may be changing; making sure that the search engine immedi(cid:173)
`ately reacts to .changes in documents is both an effectiveness issue and an efficiency
`issue.
`The architecture of a search engine is determined by these two requirements.
`Because we want an efficient system, search engines employ specialized data struc(cid:173)
`tures that are optimized for fast retrieval. Because we want high-quality results,
`search engines carefully process text and store text statistics that help improve the
`relevance of results.
`Many of the components we discuss in the following sections have been used
`for decades, and this general design has been shown to be a useful compromise
`between the competing goals of effective and efficient retrieval. In later chapters,
`we will discuss these components in more detail.
`
`2.2 Basic Building Blocks
`
`Search engine components support two major functions, which we call the index(cid:173)
`ing process and the query process. The indexing process builds the structures that
`enable searching, and the query process uses those structures and a person's query
`to produce a ranked list of documents. Figure 2.1 shows the high-level "building
`blocks" of the indexing process. These major components are text acquisition, text
`traniformation, and index creation.
`The task of the text acquisition component is to identify and make available
`the documents that will be searched. Although in some cases this will involve sim(cid:173)
`ply using an existing collection, text acquisition will more often require building
`a collection by crawling or scanning the Web, a corporate intranet, a desktop, or
`other sources of information. In addition to passing documents to the next com(cid:173)
`ponent in the indexing process, the text acquisition component creates a docu(cid:173)
`ment data store, which contains the text and metadata for all the documents.
`Metadata is information about a document that is not part of the text content,
`such the document type (e.g., email or web page), document structure, and other
`features, such as document length.
`The text transformation component transforms documents into index terms
`or features. Index terms, as the name implies, are the parts of a document that
`are stored in the index and used in searching. 'The simplest index term is a word,
`but not every word may be used for searching. A "feature" is more often used in
`
`
`
`2.2 Basic Building Blocks
`
`15
`
`Document data store
`
`Index Creation ~ H
`~"""'~'LJ
`
`Index
`
`Text Transformation
`
`Fig. 2.1. The indexing process
`
`Email, weh pages,
`ne\vs articles. memos. letters
`
`the field of machine learning to refer to a part of a text document that is used to
`represent its content, which also describes an index term. Examples of other types
`of index terms or features are phrases, names of people, dates, and links in a web
`page. Index terms are sometimes simply referred to as "terms." The set of all the
`terms that are indexed for a document collection is called the index vocabulary.
`The index creation component takes the output of the text transformation
`component and creates the indexes or data structures that enable fast searching.
`Given the large number of documents in many search applications, index creation
`must be efficient, both in terms of time and space. Indexes must also be able to be
`efficiently updated when new documents are acquired. Inverted indexes, or some(cid:173)
`times inverted Jiles, are by far the most common form of index used by search
`engines. An inverted index, very simply, contains a list for every index term of the
`documents that contain that ~ndex term. It is iny.~r~d in the sense of being the
`opposite of a doc~ment file that lists, for every~d.~¢ument, the index terms they
`' 41 . ..
`contain. There are many variations ofirn~~~~~d'indexes, and the particular form of
`index used is one of the most import~nt ¥'r~cts of a search engine.
`Figure 2.2 shows the building block~>'of the query process. The major compo(cid:173)
`nents are user interaction, ranking, and evaluation.
`The user interaction comporteri·i: provides the interface between the person
`doing the s~arching and the search engine. One task for this component is accept(cid:173)
`ing the user's query and transforming it into index terms. Another task is to take
`the ranked list of documents from the search engine and organize it into the re-
`
`·-
`.t
`y
`g
`Gt
`
`le
`1-
`
`n.-
`u(cid:173)
`ts.
`lt,
`,er
`
`rns
`iat
`rd,
`in
`
`
`
`...... ,., ... ----....
`
`16
`
`2 Architecture of a Search Engine
`
`User Interaction
`
`Ranking
`
`Evaluation
`
`Index
`
`log data
`
`Fig. 2.2. The query process
`
`suits shown to the user. This includes, for example, generating the snippets used to
`summarize documents. The document data store is one of the sources ofinforma(cid:173)
`tion used in generating the results. Finally, this component also provides a range
`of techniques for refining the query so that it better represents the information
`need.
`The ranking component is the core of the search engine. It takes the trans(cid:173)
`formed query from the user interaction component and generates a ranked list of
`documents using scores based on a retrieval model. Ranking must be both effi(cid:173)
`cient, since many queries may need to be processed in a short time, and effective,
`since the quality of the ranking determines whether the search engine accom(cid:173)
`plishes the goal of finding relevant information. The efficiency of ranking depends
`on the indexes, and the effectiveness depends on the retrieval model.
`The task of the evaluation component is to measure and monitor effectiveness
`and efficiency. An important part of that is to record and analyze user behavior
`using log data. The results of evaluation are used to tune and improve the ranking
`component. Most of the evaluation component is not part of the online search
`engine, apart from logging user and system data. Evaluation is primarily an offiine
`activity, but it is a critical part of any search application.
`
`
`
`2.3 Breaking It Down
`
`17
`
`2.3 Breaking It Down
`
`We now look in more detail at the components of each of the basic 'building
`blocks. Not all of these components will be part of every search engine, but to(cid:173)
`gether they cover what we consider to be the most important functions for a broad
`range of search applications.
`
`2.3.1 Text Acquisition
`
`Crawler
`
`In many applications, the crawle~ ~omponent has the primary responsibility for
`identifying and acquiring documents for the search engine. There are a number of
`different types of crawlers, but the most common is the general web crawler. A web
`crawler is designed to follow the links on web pages to discover and download new
`pages. Although this sounds deceptively simple, there are significant challenges in
`designing a web crawler that can efficiently handle the huge volume of new pages
`on the Web, while at the same time ensuring that pages that may have changed
`since the last time a crawler visited a site are kept "fresh" for the search engine. A
`web crawler can be restricted to a single site, such as a university, as the basis for
`site search. Focused, or topical, web crawlers use classification techniques to restrict
`the pages that are visited to those that are likely to be about a specific topic. This
`type of crawler may be used by a vertical or topical search application, such as a
`search engine that provides access to medical information on web pages.
`For enterprise search, the crawler is adapted to discover and update all docu(cid:173)
`ments and web pages related to a company's operation. An enterprise document
`crawler follows links to discover both external and internal (i.e., restricted to the
`corporate intranet) pages, but also must scan both corporate and personal di(cid:173)
`rectories to identify email, word processing documents, presentations, database
`records, and other company information. Document crawlers are also used for
`desktop search, although in this case only the user's personal directories need to
`be scanned.
`
`Feeds
`
`Document foeds are a mechanism for accessing a real-time stream of documents.
`For example, a news feed is a constant stream of news stories and updates. In con(cid:173)
`trast to a crawler, which must discover new documents, a search engine acquires
`
`to
`1a-
`tge
`on
`
`1S(cid:173)
`of
`n-
`ve,
`n(cid:173)
`ds
`
`~ss
`or
`'1g
`:h
`1.e
`
`Text Acquisition
`
`Crawler
`Feeds
`Conversion
`Document data store
`
`
`
`18
`
`2 Architecture of a Search Engine
`
`new documents from a feed simply by monitoring it. RSS2 is a common standard
`used for web feeds for content such as news, blogs, or video. An RSS "reader"
`is used to subscribe to RSS feeds, which are formatted using XL\fL. 3 XML is a
`language for describing data formats, similar to HTML.4 The reader monitors
`those feeds and provides new content when it arrives. Radio and television feeds
`are also used in some search applications, where the "documents" contain auto(cid:173)
`matically segmented audio and video streams, together with associated text from
`closed captions or speech recognition.
`
`Conversion
`
`The documents found by a crawler or provided by a feed are rarely in plain text.
`Instead, they come in a variety of formats, such as HTML, XML, Adobe PDF,
`, and so on. Most search engines require
`Microsoft Word~, Microsoft PowerPoint0
`that these documents be converted into a consistent text plus metadata format.
`In this conversion, the control sequences and non-content data associated with
`a particular format are either removed or recorded as metadata. In the case of
`HTML and XML, much of this process can be described as part of the text trans(cid:173)
`formation component. For other formats, the conversion process is a basic step
`that prepares the document for further processing. PDF documents, for example,
`must be converted to text. Various utilities are available that perform this conver(cid:173)
`sion, with varying degrees of accuracy. Similarly, utilities are available to convert
`the various Microsoft Office® formats into text.
`Another common conversion problem comes from the way text is encoded in a
`document. ASCII5 is a common standard single-byte character encoding scheme
`used for text. ASCII uses either 7 or 8 bits (extended ASCII) to represent either
`128 or 256 possible characters. Some languages, however, such as Chinese, have
`many more characters than English and use a number of other encoding schemes.
`Unicode is a standard encoding scheme that uses 16 bits (typically) to represent
`most of the world's languages. Any application that deals with documents in dif(cid:173)
`ferent languages has to ensure that they are converted into a consistent encoding
`scheme before further processing.
`
`2 RSS a~tually refers to a family of standards with similar names (and the same initials),
`such as Really Simple Syndication or Rich Site Summary.
`3 eXtensible Markup Language
`4 HyperText Markup Language
`5 American Standard Code for Information Interchange
`
`l
`
`(
`
`f
`a
`r
`F
`I~
`F
`v
`
`e
`
`0
`p
`f,
`n
`
`
`
`tdard
`ader"
`Lis a
`1itors
`feeds
`auto(cid:173)
`from
`
`L text.
`PDF,
`:quire
`•rmat .
`. with
`:~.se of
`trans-
`c step
`mple,
`mver(cid:173)
`mvert
`
`~dina
`:heme
`either
`:,have
`temes.
`resent
`in dif(cid:173)
`:oding
`
`titials ),
`
`2.3 Breaking It Down
`
`19
`
`Document data store
`
`The document data store is a database used to manage large numbers of docu(cid:173)
`ments and the structured data that is associated with them. The document con(cid:173)
`tents are typically stored in compressed form for efficiency. The structured data
`consists of document metadata and other information extracted from the docu(cid:173)
`ments, such as links and anchor text (the text associated with a link). A relational
`database system can be used to store the documents and metadata. Some applica(cid:173)
`tions, however, use a simpler, more efficient storage system to provide very fast
`retrieval times for very large document stores.
`Although the original documents are available on the Web, in the enterprise
`database, the document data store is necessary to provide fast access to the doc(cid:173)
`ument contents for a range of search engine components. Generating summaries
`of retrieved documents, for example, would take far too long if the search engine
`had to access the original documents and reprocess them.
`
`2.3.2 Text Transformation
`
`Parser
`
`The parsing component is responsible for processing the sequence of text tokens
`in the document to recognize structural elements such as tides, figures, links, and
`headings. Tokenizing the text is an important first step in this process. In many
`cases, tokens are the same as words. Both document and query text must be trans(cid:173)
`formed into tokens in the same manner so that they can be easily compared. There
`are a number of decisions that potentially affect retrieval that make tokenizing
`non-trivial. For example, a simple definition for tokens could be strings of al(cid:173)
`phanumeric characters that are separated by spaces. This does not tell us, however,
`how to deal with special characters such as capital letters, hyphens, and apostro(cid:173)
`phes. Should we treat "apple" the same as "Apple"? Is "on-line" two words or one
`word? Should the apostrophe in "O'Connor" be treated the same as the one in
`"owner's"? In some languages, tokenizinggets even more interesting. Chinese, for
`example, has no obvious wordseparator like a space in English.
`Document structure is often specified by a markup language such as HTML
`or XML. HTML is the default language used for specifying the structure of web
`pages. XML has much more flexibility and is used as a data interchange format
`for many applications. TI1e document parser uses knowledge of the syntax of the
`markup language to identify the structure.
`
`Text Transformation
`
`Parser
`Stopping
`Stemming
`link Analysis
`Information Extraction
`Classifier
`
`
`
`-
`
`20
`
`2 Architecture of a Search Engine
`
`Both HTML and XML use tags to define document elements. For example,
`<h2> Search </h2> defines "Search" as a second-level heading in HTML. Tags and
`other control sequences must be treated appropriately when tokenizing. Other
`types of documents, such as email and presentations, have a specific syntax and
`methods for specifying structure, but much of this may be be removed or simpli(cid:173)
`fied by the conversion component.
`
`Stopping
`
`The stopping component has the simple task of removing common words from
`the stream of tokens that become index terms. The most common words are typ(cid:173)
`ically Junction words that help form sentence structure but contribute little on
`their own to the description of the topics covered by the text. Examples are "the",
`"of", "to", and "for". Because they are so common, removing them can reduce the
`size of the indexes considerably. Depending on the retrieval model that is used as
`the basis of the ranking, removing these words usually has no impact on the search
`engine's effectiveness, and may even improve it somewhat. Despite these potential
`advantages, it can be difficult to decide how many words to include on the stop(cid:173)
`word list. Some stopword lists used in research contain hundreds of words. The
`problem with using such lists is that it becomes impossible to search with queries
`like "to be or not to be" or "down under". To avoid this, search applications may
`use very small stopword lists (perhaps just containing "the") when processing doc(cid:173)
`ument text, but then use longer lists for the default processing of query text.
`
`Stemming
`
`Stemming is another word-level transformation. The task of the stemming com(cid:173)
`ponent (or stemmer) is to group words that are derived from a common stem.
`Grouping "fish", "fishes", and "fishing" is one example. By replacing each member
`of a group with one designated word (for example, the shortest, which in this case
`is "fish"), we increase the likelihood that words used in queries and documents
`will match. Stemming, in fact, generally produces small improvements in ranking
`effectiveness. Similar to stopping, stemming can be done aggressively, conserva(cid:173)
`tively, or not at all. Aggressive stemming can cause search problems. It may not be
`appropriate, for example, to retrieve documents about different varieties of fish in
`response to the query "fishing". Some search applications use more conservative
`stemming, such as simply identifying plural forms using the letter "s': or they may
`
`
`
`2.3 Breaking It Down
`
`21
`
`do no stemming when processing document text and focus on adding appropriate
`word variants to the query.
`Some languages, such as Arabic, have more complicated morphology than En(cid:173)
`glish, and stemming is consequently more important. An effective stemming com(cid:173)
`ponent in Arabic has a huge impact on search effectiveness. In contrast, there is
`little word variation in other languages, such as Chinese, and for these languages
`stemming is not effective.
`
`Link extraction and analysis
`
`Links and the corresponding anchor text in web pages can readily be identified
`and extracted during document parsing. Extraction means that this information
`is recorded in the document data store, and can be indexed separately from the
`general text content. Web search engines make extensive use of this information
`through link analysis algorithms such as PageRank (Brin & Page, 1998). Link
`analysis provides the search engine with a rating of the popularity, and to some
`extent, the authority of a page (in other words, how important it is). Anchor text,
`which is the clickable text of a web link, can be used to enhance the text content
`of a page that the link points to. These two factors can significantly improve the
`effectiveness of web search for some types of queries.
`
`Information extraction
`
`Information extraction is used to identify index terms that are more complex than
`single words. This may be as simple as words in bold or words in headings, but in
`generalmay require significant additional computation. Extracting syntactic fea(cid:173)
`tures such as noun phrases, for example, requires some form of syntactic analysis
`or part-ofspeech tagging. Research in this area has focused on techniques for ex(cid:173)
`tracting features with specific semantic content, such as named entity recognizers,
`which can reliably identify information such as person names, company names,
`dates,.and locations.
`,
`
`Classifier
`
`The classifier component identifies class-related metadata for documents or parts
`of documents. This covers a range of functions that are often described separately.
`Classification techniques assign predefined class labels to documents. These labels
`typically represent topical categories such as "sports", "politics", or "business". Two
`
`.mple,
`~sand
`::::>ther
`x and
`mpli-
`
`from
`e typ(cid:173)
`:le on
`
`:e the
`;ed as
`earch
`::ntial
`:stop(cid:173)
`;, The
`1eries
`>may
`•doc(cid:173)
`'
`
`com(cid:173)
`stem.
`mber
`s case
`o.ents
`1king
`erva(cid:173)
`otbe
`.shin
`·ativc
`·may
`
`
`
`Index Creation
`
`Document Statistics
`Vv' eighting
`Inversion
`Distribution
`
`.. #&
`
`- - - -
`
`. . . . .
`
`. bQQ __ ;
`
`22
`
`2 Architecture of a Search Engine
`
`important examples of other types of classification are identifying documents as
`~pam, and identifying the non-content parts of documents, such as advertising.
`Clustering techniques are used to group related documents without predefined
`categories. These document groups can be used in a variety of ways during ranking
`or user interaction.
`
`2.3.3 Index Creation
`
`Document statistics
`
`The task of the document statistics component is simply to gather and record
`statistical information about words, features, and documents. This information
`is used by the ranking component to compute scores for documents. The types
`of data generally required are the counts of index term occurrences (both words
`and more complex features) in individual documents, the positions in the doc(cid:173)
`uments where the index terms occurred, the counts of occurrences over groups
`of documents (such as all documents labeled "sports" or the entire collection of
`documents), and the lengths of documents in terms of the number of tokens. The
`actual data required is determined by the retrieval model and associated rank(cid:173)
`ing algorithm. The document statistics are stored in lookup tables, which are data
`structures designed for fast retrieval.
`
`Weighting
`
`Index term weights reflect the relative importance of words in documents, and
`are used in computing scores for ranking. The specific form of a weight is deter(cid:173)
`mined by the retrieval model. The weighting component calculates weights using
`the document statistics and stores them in lookup tables. Weights could be calcu(cid:173)
`lated as part of the query process, and some types of weights require information
`about the query, but by doing as much calculation as possible during the indexing
`process, the efficiency of the query process will be improved.
`One of the most common types used in older retrieval models is known as tfidf
`weighting. There are many variations of these weights, but they are all based on a
`combination of the frequency or count of index term occurrences in a document
`(the term .frequency, or tj) and the frequency of index term occurrence over the
`entire collec'tion of documents (inverse document .frequency, or idf). The idfweight
`is called inverse document frequency because it gives high weights to terms that
`occur in very few documents.cAtypical formula for idfis log N / n, where N is the
`
`-
`
`t
`d
`
`II
`
`1
`t1
`cc
`T
`w
`d:
`d(
`[8
`
`Cl
`
`In
`
`11
`an
`eB
`su
`
`CC!
`dij
`of
`sit,
`cat
`wb
`me
`
`2.3
`
`Qu
`
`Th(
`guc.
`teri
`que
`In~
`
`!; -
`
`
`
`2.3 Breaking It Down
`
`23
`
`total number of documents indexed by the search engine and n is the number of
`documents that contain a particular term.
`
`Inversion
`
`The inversion component is the core of the indexing process. Its task is to change
`the stream of document-term information coming from the text transformation
`component into term-document information for the creation ofinverted indexes.
`The challenge is to do this efficiently, not only for large numbers of documents
`when the inverted indexes are initially created, but also when the indexes are up(cid:173)
`dated with new documents from feeds or crawls. The format of the inverted in(cid:173)
`dexes is designed for fast query processing and depends to some extent on the
`ranking algorithm used. The indexes are also compressed to further enhance effi(cid:173)
`ciency.
`
`Index distribution
`
`The index distribution component distributes indexes across multiple computers
`and potentially across multiple sites on a network. Distribution is essential for
`efficient performance with web search engines. By distributing the indexes for a
`subset of the documents (document distribution), both indexing and query pro(cid:173)
`cessing can be done in parallel. Distributing the indexes for a subset of terms (term
`distribution) can also support parallel processing of queries. Replication is a form
`of distribution where copies of indexes or parts of indexes are stored in multiple
`sites so that queryprocessing can be made more efficient by reducing communi(cid:173)
`cation delays. Peer-to-peer search involves a less organized form of distribution
`where each node in a network maintains its own indexes and collection of docu(cid:173)
`ments.
`
`2.3.4 User Interaction
`
`Query input
`
`/
`
`User Interaction
`
`Query input
`Query transformation
`Results Output
`
`The query input component provides an interface and a parser for a query lan(cid:173)
`guage. The simplest query languages, such as those used in most web search in(cid:173)
`terfaces, have only a small number of operators. An operator is a command in the
`query language that is used to indicate text that should be treated in a special way.
`In general, operators help to clarify the meaning of the query by constraining how
`
`ts as
`;ing.
`[ned
`king
`
`cord
`Ltion
`ypes
`·ords
`doc(cid:173)
`oups
`mof
`.The
`:ank-
`data
`
`, and
`leter(cid:173)
`using
`:alcu(cid:173)
`ation
`exing
`
`;tfitlf
`ion a
`.ment
`~r the
`reight
`s that
`is the
`
`