throbber
.Tl< 5105
`.884
`
`j
`···_I
`
`Facebook, Inc. et al.
`v.
`Software Rights Archive, LLC
`CASE IPR2013-00480
`
`

`

`Editor-in-Chief
`Acquisitions Editor
`Editorial Assistant
`Managing Editor
`Online Product Manager
`Director of Marketing
`Marketing Manager
`Marketing Coordinator
`Senior Manufacturing Buyer
`Text Design, Composition,
`and Illustrations
`Art Direction
`Cover Design
`Cover Image
`
`Michael Hirsch
`Matt Goldstein
`Sarah Milmore
`Jeff Holcomb
`Bethany Tidd
`Margaret Waples
`Erin Davis
`Kathryn Ferranti
`Carol Melville
`W. Bruce Croft, Donald Metzler,
`and Trevor Strohman
`Linda Knowles
`Elena Sidorova
`© Peter Gudella I Shutterstock
`
`Many of the designations used by manufacturers and sellers to distinguish their products
`are claimed as trademarks. Where those designations appear in this book, and Addison(cid:173)
`Wesley was aware of a trademark claim, the designations have been printed in initial caps
`or all caps.
`
`The programs and applications presented in this book have been included for their
`instructional value. They have been tested with care, but are not guaranteed for any
`particular purpose. The publisher does not offer any warranties or representations, nor
`does it accept any liabilities with respect to the programs or applications.
`
`Library of Congress Cataloging-in-Publication Data available upon request
`
`Copyright© 2010 Pearson Education, Inc. All rights reserved. No part of this publication
`may be reproduced, stored in a retrieval system, or transmitted, in any form or by any
`means, electronic, mechanical, photocopying, recording, or otherwise, without the prior
`written permission of the publisher. Printed in the United States of America. For
`information on obtaining permission for use of material in this work, please submit a
`written request to Pearson Education, Inc., Rights and Contracts Department, 501
`Boylston Street, Suite 900, Boston, MA 02116, fax (617) 671-3447, or online at
`http://www .pearsoned.com/legal!permissions.htm.
`
`ISBN-13: 978-0-13-607224-9
`ISBN-10: 0-13-607224-0
`
`1 2 3 4 5 6 7 8 9 1 0- HP -13 12 11 10 09
`
`

`

`"W
`
`14
`
`2 Architecture of a Search Engine
`
`We may have more specific goals, too, but usually these fall into the categories
`of effectiveness or efficiency (or both). For instance, the collection of documents
`we want to search may be changing; making sure that the search engine immedi(cid:173)
`ately reacts to .changes in documents is both an effectiveness issue and an efficiency
`issue.
`The architecture of a search engine is determined by these two requirements.
`Because we want an efficient system, search engines employ specialized data struc(cid:173)
`tures that are optimized for fast retrieval. Because we want high-quality results,
`search engines carefully process text and store text statistics that help improve the
`relevance of results.
`Many of the components we discuss in the following sections have been used
`for decades, and this general design has been shown to be a useful compromise
`between the competing goals of effective and efficient retrieval. In later chapters,
`we will discuss these components in more detail.
`
`2.2 Basic Building Blocks
`
`Search engine components support two major functions, which we call the index(cid:173)
`ing process and the query process. The indexing process builds the structures that
`enable searching, and the query process uses those structures and a person's query
`to produce a ranked list of documents. Figure 2.1 shows the high-level "building
`blocks" of the indexing process. These major components are text acquisition, text
`traniformation, and index creation.
`The task of the text acquisition component is to identify and make available
`the documents that will be searched. Although in some cases this will involve sim(cid:173)
`ply using an existing collection, text acquisition will more often require building
`a collection by crawling or scanning the Web, a corporate intranet, a desktop, or
`other sources of information. In addition to passing documents to the next com(cid:173)
`ponent in the indexing process, the text acquisition component creates a docu(cid:173)
`ment data store, which contains the text and metadata for all the documents.
`Metadata is information about a document that is not part of the text content,
`such the document type (e.g., email or web page), document structure, and other
`features, such as document length.
`The text transformation component transforms documents into index terms
`or features. Index terms, as the name implies, are the parts of a document that
`are stored in the index and used in searching. 'The simplest index term is a word,
`but not every word may be used for searching. A "feature" is more often used in
`
`

`

`2.2 Basic Building Blocks
`
`15
`
`Document data store
`
`Index Creation ~ H
`~"""'~'LJ
`
`Index
`
`Text Transformation
`
`Fig. 2.1. The indexing process
`
`Email, weh pages,
`ne\vs articles. memos. letters
`
`the field of machine learning to refer to a part of a text document that is used to
`represent its content, which also describes an index term. Examples of other types
`of index terms or features are phrases, names of people, dates, and links in a web
`page. Index terms are sometimes simply referred to as "terms." The set of all the
`terms that are indexed for a document collection is called the index vocabulary.
`The index creation component takes the output of the text transformation
`component and creates the indexes or data structures that enable fast searching.
`Given the large number of documents in many search applications, index creation
`must be efficient, both in terms of time and space. Indexes must also be able to be
`efficiently updated when new documents are acquired. Inverted indexes, or some(cid:173)
`times inverted Jiles, are by far the most common form of index used by search
`engines. An inverted index, very simply, contains a list for every index term of the
`documents that contain that ~ndex term. It is iny.~r~d in the sense of being the
`opposite of a doc~ment file that lists, for every~d.~¢ument, the index terms they
`' 41 . ..
`contain. There are many variations ofirn~~~~~d'indexes, and the particular form of
`index used is one of the most import~nt ¥'r~cts of a search engine.
`Figure 2.2 shows the building block~>'of the query process. The major compo(cid:173)
`nents are user interaction, ranking, and evaluation.
`The user interaction comporteri·i: provides the interface between the person
`doing the s~arching and the search engine. One task for this component is accept(cid:173)
`ing the user's query and transforming it into index terms. Another task is to take
`the ranked list of documents from the search engine and organize it into the re-
`
`·-
`.t
`y
`g
`Gt
`
`le
`1-
`
`n.-
`u(cid:173)
`ts.
`lt,
`,er
`
`rns
`iat
`rd,
`in
`
`

`

`...... ,., ... ----....
`
`16
`
`2 Architecture of a Search Engine
`
`User Interaction
`
`Ranking
`
`Evaluation
`
`Index
`
`log data
`
`Fig. 2.2. The query process
`
`suits shown to the user. This includes, for example, generating the snippets used to
`summarize documents. The document data store is one of the sources ofinforma(cid:173)
`tion used in generating the results. Finally, this component also provides a range
`of techniques for refining the query so that it better represents the information
`need.
`The ranking component is the core of the search engine. It takes the trans(cid:173)
`formed query from the user interaction component and generates a ranked list of
`documents using scores based on a retrieval model. Ranking must be both effi(cid:173)
`cient, since many queries may need to be processed in a short time, and effective,
`since the quality of the ranking determines whether the search engine accom(cid:173)
`plishes the goal of finding relevant information. The efficiency of ranking depends
`on the indexes, and the effectiveness depends on the retrieval model.
`The task of the evaluation component is to measure and monitor effectiveness
`and efficiency. An important part of that is to record and analyze user behavior
`using log data. The results of evaluation are used to tune and improve the ranking
`component. Most of the evaluation component is not part of the online search
`engine, apart from logging user and system data. Evaluation is primarily an offiine
`activity, but it is a critical part of any search application.
`
`

`

`2.3 Breaking It Down
`
`17
`
`2.3 Breaking It Down
`
`We now look in more detail at the components of each of the basic 'building
`blocks. Not all of these components will be part of every search engine, but to(cid:173)
`gether they cover what we consider to be the most important functions for a broad
`range of search applications.
`
`2.3.1 Text Acquisition
`
`Crawler
`
`In many applications, the crawle~ ~omponent has the primary responsibility for
`identifying and acquiring documents for the search engine. There are a number of
`different types of crawlers, but the most common is the general web crawler. A web
`crawler is designed to follow the links on web pages to discover and download new
`pages. Although this sounds deceptively simple, there are significant challenges in
`designing a web crawler that can efficiently handle the huge volume of new pages
`on the Web, while at the same time ensuring that pages that may have changed
`since the last time a crawler visited a site are kept "fresh" for the search engine. A
`web crawler can be restricted to a single site, such as a university, as the basis for
`site search. Focused, or topical, web crawlers use classification techniques to restrict
`the pages that are visited to those that are likely to be about a specific topic. This
`type of crawler may be used by a vertical or topical search application, such as a
`search engine that provides access to medical information on web pages.
`For enterprise search, the crawler is adapted to discover and update all docu(cid:173)
`ments and web pages related to a company's operation. An enterprise document
`crawler follows links to discover both external and internal (i.e., restricted to the
`corporate intranet) pages, but also must scan both corporate and personal di(cid:173)
`rectories to identify email, word processing documents, presentations, database
`records, and other company information. Document crawlers are also used for
`desktop search, although in this case only the user's personal directories need to
`be scanned.
`
`Feeds
`
`Document foeds are a mechanism for accessing a real-time stream of documents.
`For example, a news feed is a constant stream of news stories and updates. In con(cid:173)
`trast to a crawler, which must discover new documents, a search engine acquires
`
`to
`1a-
`tge
`on
`
`1S(cid:173)
`of
`n-
`ve,
`n(cid:173)
`ds
`
`~ss
`or
`'1g
`:h
`1.e
`
`Text Acquisition
`
`Crawler
`Feeds
`Conversion
`Document data store
`
`

`

`18
`
`2 Architecture of a Search Engine
`
`new documents from a feed simply by monitoring it. RSS2 is a common standard
`used for web feeds for content such as news, blogs, or video. An RSS "reader"
`is used to subscribe to RSS feeds, which are formatted using XL\fL. 3 XML is a
`language for describing data formats, similar to HTML.4 The reader monitors
`those feeds and provides new content when it arrives. Radio and television feeds
`are also used in some search applications, where the "documents" contain auto(cid:173)
`matically segmented audio and video streams, together with associated text from
`closed captions or speech recognition.
`
`Conversion
`
`The documents found by a crawler or provided by a feed are rarely in plain text.
`Instead, they come in a variety of formats, such as HTML, XML, Adobe PDF,
`, and so on. Most search engines require
`Microsoft Word~, Microsoft PowerPoint0
`that these documents be converted into a consistent text plus metadata format.
`In this conversion, the control sequences and non-content data associated with
`a particular format are either removed or recorded as metadata. In the case of
`HTML and XML, much of this process can be described as part of the text trans(cid:173)
`formation component. For other formats, the conversion process is a basic step
`that prepares the document for further processing. PDF documents, for example,
`must be converted to text. Various utilities are available that perform this conver(cid:173)
`sion, with varying degrees of accuracy. Similarly, utilities are available to convert
`the various Microsoft Office® formats into text.
`Another common conversion problem comes from the way text is encoded in a
`document. ASCII5 is a common standard single-byte character encoding scheme
`used for text. ASCII uses either 7 or 8 bits (extended ASCII) to represent either
`128 or 256 possible characters. Some languages, however, such as Chinese, have
`many more characters than English and use a number of other encoding schemes.
`Unicode is a standard encoding scheme that uses 16 bits (typically) to represent
`most of the world's languages. Any application that deals with documents in dif(cid:173)
`ferent languages has to ensure that they are converted into a consistent encoding
`scheme before further processing.
`
`2 RSS a~tually refers to a family of standards with similar names (and the same initials),
`such as Really Simple Syndication or Rich Site Summary.
`3 eXtensible Markup Language
`4 HyperText Markup Language
`5 American Standard Code for Information Interchange
`
`l
`
`(
`
`f
`a
`r
`F
`I~
`F
`v
`
`e
`
`0
`p
`f,
`n
`
`

`

`tdard
`ader"
`Lis a
`1itors
`feeds
`auto(cid:173)
`from
`
`L text.
`PDF,
`:quire
`•rmat .
`. with
`:~.se of
`trans-
`c step
`mple,
`mver(cid:173)
`mvert
`
`~dina
`:heme
`either
`:,have
`temes.
`resent
`in dif(cid:173)
`:oding
`
`titials ),
`
`2.3 Breaking It Down
`
`19
`
`Document data store
`
`The document data store is a database used to manage large numbers of docu(cid:173)
`ments and the structured data that is associated with them. The document con(cid:173)
`tents are typically stored in compressed form for efficiency. The structured data
`consists of document metadata and other information extracted from the docu(cid:173)
`ments, such as links and anchor text (the text associated with a link). A relational
`database system can be used to store the documents and metadata. Some applica(cid:173)
`tions, however, use a simpler, more efficient storage system to provide very fast
`retrieval times for very large document stores.
`Although the original documents are available on the Web, in the enterprise
`database, the document data store is necessary to provide fast access to the doc(cid:173)
`ument contents for a range of search engine components. Generating summaries
`of retrieved documents, for example, would take far too long if the search engine
`had to access the original documents and reprocess them.
`
`2.3.2 Text Transformation
`
`Parser
`
`The parsing component is responsible for processing the sequence of text tokens
`in the document to recognize structural elements such as tides, figures, links, and
`headings. Tokenizing the text is an important first step in this process. In many
`cases, tokens are the same as words. Both document and query text must be trans(cid:173)
`formed into tokens in the same manner so that they can be easily compared. There
`are a number of decisions that potentially affect retrieval that make tokenizing
`non-trivial. For example, a simple definition for tokens could be strings of al(cid:173)
`phanumeric characters that are separated by spaces. This does not tell us, however,
`how to deal with special characters such as capital letters, hyphens, and apostro(cid:173)
`phes. Should we treat "apple" the same as "Apple"? Is "on-line" two words or one
`word? Should the apostrophe in "O'Connor" be treated the same as the one in
`"owner's"? In some languages, tokenizinggets even more interesting. Chinese, for
`example, has no obvious wordseparator like a space in English.
`Document structure is often specified by a markup language such as HTML
`or XML. HTML is the default language used for specifying the structure of web
`pages. XML has much more flexibility and is used as a data interchange format
`for many applications. TI1e document parser uses knowledge of the syntax of the
`markup language to identify the structure.
`
`Text Transformation
`
`Parser
`Stopping
`Stemming
`link Analysis
`Information Extraction
`Classifier
`
`

`

`-
`
`20
`
`2 Architecture of a Search Engine
`
`Both HTML and XML use tags to define document elements. For example,
`<h2> Search </h2> defines "Search" as a second-level heading in HTML. Tags and
`other control sequences must be treated appropriately when tokenizing. Other
`types of documents, such as email and presentations, have a specific syntax and
`methods for specifying structure, but much of this may be be removed or simpli(cid:173)
`fied by the conversion component.
`
`Stopping
`
`The stopping component has the simple task of removing common words from
`the stream of tokens that become index terms. The most common words are typ(cid:173)
`ically Junction words that help form sentence structure but contribute little on
`their own to the description of the topics covered by the text. Examples are "the",
`"of", "to", and "for". Because they are so common, removing them can reduce the
`size of the indexes considerably. Depending on the retrieval model that is used as
`the basis of the ranking, removing these words usually has no impact on the search
`engine's effectiveness, and may even improve it somewhat. Despite these potential
`advantages, it can be difficult to decide how many words to include on the stop(cid:173)
`word list. Some stopword lists used in research contain hundreds of words. The
`problem with using such lists is that it becomes impossible to search with queries
`like "to be or not to be" or "down under". To avoid this, search applications may
`use very small stopword lists (perhaps just containing "the") when processing doc(cid:173)
`ument text, but then use longer lists for the default processing of query text.
`
`Stemming
`
`Stemming is another word-level transformation. The task of the stemming com(cid:173)
`ponent (or stemmer) is to group words that are derived from a common stem.
`Grouping "fish", "fishes", and "fishing" is one example. By replacing each member
`of a group with one designated word (for example, the shortest, which in this case
`is "fish"), we increase the likelihood that words used in queries and documents
`will match. Stemming, in fact, generally produces small improvements in ranking
`effectiveness. Similar to stopping, stemming can be done aggressively, conserva(cid:173)
`tively, or not at all. Aggressive stemming can cause search problems. It may not be
`appropriate, for example, to retrieve documents about different varieties of fish in
`response to the query "fishing". Some search applications use more conservative
`stemming, such as simply identifying plural forms using the letter "s': or they may
`
`

`

`2.3 Breaking It Down
`
`21
`
`do no stemming when processing document text and focus on adding appropriate
`word variants to the query.
`Some languages, such as Arabic, have more complicated morphology than En(cid:173)
`glish, and stemming is consequently more important. An effective stemming com(cid:173)
`ponent in Arabic has a huge impact on search effectiveness. In contrast, there is
`little word variation in other languages, such as Chinese, and for these languages
`stemming is not effective.
`
`Link extraction and analysis
`
`Links and the corresponding anchor text in web pages can readily be identified
`and extracted during document parsing. Extraction means that this information
`is recorded in the document data store, and can be indexed separately from the
`general text content. Web search engines make extensive use of this information
`through link analysis algorithms such as PageRank (Brin & Page, 1998). Link
`analysis provides the search engine with a rating of the popularity, and to some
`extent, the authority of a page (in other words, how important it is). Anchor text,
`which is the clickable text of a web link, can be used to enhance the text content
`of a page that the link points to. These two factors can significantly improve the
`effectiveness of web search for some types of queries.
`
`Information extraction
`
`Information extraction is used to identify index terms that are more complex than
`single words. This may be as simple as words in bold or words in headings, but in
`generalmay require significant additional computation. Extracting syntactic fea(cid:173)
`tures such as noun phrases, for example, requires some form of syntactic analysis
`or part-ofspeech tagging. Research in this area has focused on techniques for ex(cid:173)
`tracting features with specific semantic content, such as named entity recognizers,
`which can reliably identify information such as person names, company names,
`dates,.and locations.
`,
`
`Classifier
`
`The classifier component identifies class-related metadata for documents or parts
`of documents. This covers a range of functions that are often described separately.
`Classification techniques assign predefined class labels to documents. These labels
`typically represent topical categories such as "sports", "politics", or "business". Two
`
`.mple,
`~sand
`::::>ther
`x and
`mpli-
`
`from
`e typ(cid:173)
`:le on
`
`:e the
`;ed as
`earch
`::ntial
`:stop(cid:173)
`;, The
`1eries
`>may
`•doc(cid:173)
`'
`
`com(cid:173)
`stem.
`mber
`s case
`o.ents
`1king
`erva(cid:173)
`otbe
`.shin
`·ativc
`·may
`
`

`

`Index Creation
`
`Document Statistics
`Vv' eighting
`Inversion
`Distribution
`
`.. #&
`
`- - - -
`
`. . . . .
`
`. bQQ __ ;
`
`22
`
`2 Architecture of a Search Engine
`
`important examples of other types of classification are identifying documents as
`~pam, and identifying the non-content parts of documents, such as advertising.
`Clustering techniques are used to group related documents without predefined
`categories. These document groups can be used in a variety of ways during ranking
`or user interaction.
`
`2.3.3 Index Creation
`
`Document statistics
`
`The task of the document statistics component is simply to gather and record
`statistical information about words, features, and documents. This information
`is used by the ranking component to compute scores for documents. The types
`of data generally required are the counts of index term occurrences (both words
`and more complex features) in individual documents, the positions in the doc(cid:173)
`uments where the index terms occurred, the counts of occurrences over groups
`of documents (such as all documents labeled "sports" or the entire collection of
`documents), and the lengths of documents in terms of the number of tokens. The
`actual data required is determined by the retrieval model and associated rank(cid:173)
`ing algorithm. The document statistics are stored in lookup tables, which are data
`structures designed for fast retrieval.
`
`Weighting
`
`Index term weights reflect the relative importance of words in documents, and
`are used in computing scores for ranking. The specific form of a weight is deter(cid:173)
`mined by the retrieval model. The weighting component calculates weights using
`the document statistics and stores them in lookup tables. Weights could be calcu(cid:173)
`lated as part of the query process, and some types of weights require information
`about the query, but by doing as much calculation as possible during the indexing
`process, the efficiency of the query process will be improved.
`One of the most common types used in older retrieval models is known as tfidf
`weighting. There are many variations of these weights, but they are all based on a
`combination of the frequency or count of index term occurrences in a document
`(the term .frequency, or tj) and the frequency of index term occurrence over the
`entire collec'tion of documents (inverse document .frequency, or idf). The idfweight
`is called inverse document frequency because it gives high weights to terms that
`occur in very few documents.cAtypical formula for idfis log N / n, where N is the
`
`-
`
`t
`d
`
`II
`
`1
`t1
`cc
`T
`w
`d:
`d(
`[8
`
`Cl
`
`In
`
`11
`an
`eB
`su
`
`CC!
`dij
`of
`sit,
`cat
`wb
`me
`
`2.3
`
`Qu
`
`Th(
`guc.
`teri
`que
`In~
`
`!; -
`
`

`

`2.3 Breaking It Down
`
`23
`
`total number of documents indexed by the search engine and n is the number of
`documents that contain a particular term.
`
`Inversion
`
`The inversion component is the core of the indexing process. Its task is to change
`the stream of document-term information coming from the text transformation
`component into term-document information for the creation ofinverted indexes.
`The challenge is to do this efficiently, not only for large numbers of documents
`when the inverted indexes are initially created, but also when the indexes are up(cid:173)
`dated with new documents from feeds or crawls. The format of the inverted in(cid:173)
`dexes is designed for fast query processing and depends to some extent on the
`ranking algorithm used. The indexes are also compressed to further enhance effi(cid:173)
`ciency.
`
`Index distribution
`
`The index distribution component distributes indexes across multiple computers
`and potentially across multiple sites on a network. Distribution is essential for
`efficient performance with web search engines. By distributing the indexes for a
`subset of the documents (document distribution), both indexing and query pro(cid:173)
`cessing can be done in parallel. Distributing the indexes for a subset of terms (term
`distribution) can also support parallel processing of queries. Replication is a form
`of distribution where copies of indexes or parts of indexes are stored in multiple
`sites so that queryprocessing can be made more efficient by reducing communi(cid:173)
`cation delays. Peer-to-peer search involves a less organized form of distribution
`where each node in a network maintains its own indexes and collection of docu(cid:173)
`ments.
`
`2.3.4 User Interaction
`
`Query input
`
`/
`
`User Interaction
`
`Query input
`Query transformation
`Results Output
`
`The query input component provides an interface and a parser for a query lan(cid:173)
`guage. The simplest query languages, such as those used in most web search in(cid:173)
`terfaces, have only a small number of operators. An operator is a command in the
`query language that is used to indicate text that should be treated in a special way.
`In general, operators help to clarify the meaning of the query by constraining how
`
`ts as
`;ing.
`[ned
`king
`
`cord
`Ltion
`ypes
`·ords
`doc(cid:173)
`oups
`mof
`.The
`:ank-
`data
`
`, and
`leter(cid:173)
`using
`:alcu(cid:173)
`ation
`exing
`
`;tfitlf
`ion a
`.ment
`~r the
`reight
`s that
`is the
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket