throbber
STARTS(cid:0) Stanford Proposal for Internet Meta(cid:2)Searching (cid:0)
`
`Luis Gravano
`
`Chen(cid:2)Chuan K(cid:3) Chang
`
`H(cid:4)ector Garc(cid:4)(cid:5)a(cid:2)Molina
`
`Computer Science Department
`
`Computer Science Department
`
`Computer Science Department
`
`Stanford University
`
`Stanford University
`
`Stanford University
`
`gravano(cid:2)cs(cid:3)stanford(cid:3)edu
`
`kevin(cid:2)db(cid:3)stanford(cid:3)edu
`
`hector(cid:2)cs(cid:3)stanford(cid:3)edu
`
`Andreas Paepcke
`
`Computer Science Department
`
`Stanford University
`
`paepcke(cid:2)db(cid:3)stanford(cid:3)edu
`
`Abstract
`
`Document sources are available everywhere(cid:2) both within the
`internal networks of organizations and on the Internet(cid:3) Even
`individual organizations use search engines from di(cid:4)erent
`vendors to index their internal document collections(cid:3) These
`search engines are typically incompatible in that they sup(cid:5)
`port di(cid:4)erent query models and interfaces(cid:2) they do not re(cid:5)
`turn enough information with the query results for adequate
`merging of the results(cid:2) and (cid:6)nally(cid:2) in that they do not ex(cid:5)
`port metadata about the collections that they index (cid:7)e(cid:3)g(cid:3)(cid:2) to
`assist in resource discovery(cid:8)(cid:3) This paper describes STARTS(cid:2)
`an emerging protocol for Internet retrieval and search that
`facilitates the task of querying multiple document sources(cid:3)
`STARTS has been developed in a unique way(cid:3) It is not a
`standard(cid:2) but a group e(cid:4)ort coordinated by Stanford(cid:9)s Dig(cid:5)
`ital Library project(cid:2) and involving over companies and
`organizations(cid:3) The objective of this paper is not only to
`give an overview of the STARTS protocol proposal(cid:2) but also
`to discuss the process that led to its de(cid:6)nition(cid:3)
`
` Introduction
`
`Document sources are available everywhere(cid:2) both within the
`internal networks of organizations and on the Internet(cid:3) The
`source contents are often hidden behind search interfaces
`and models that vary from source to source(cid:3) Even individual
`organizations use search engines from di(cid:4)erent vendors to
`index their internal document collections(cid:3) These organiza(cid:5)
`tions can bene(cid:6)t from metasearchers(cid:2) which are services that
`provide uni(cid:6)ed query interfaces to multiple search engines(cid:3)
`Thus(cid:2) users have the illusion of a single combined document
`
`(cid:0) This material is based upon work supported by the National Sci(cid:2)
`ence Foundation under Cooperative Agreement IRI(cid:2)  (cid:9) Funding
`for this cooperative agreement is also provided by DARPA(cid:10) NASA(cid:10)
`and the industrial partners of the Stanford Digital Libraries Project(cid:9)
`Any opinions(cid:10) (cid:11)nding(cid:10) and conclusions or recommendations expressed
`in this material are those of the author(cid:12)s(cid:13) and do not necessarily
`re(cid:14)ect the views of the National Science Foundation or the other
`sponsors(cid:9)
`
`source(cid:3) This paper describes STARTS (cid:2) an emerging pro(cid:5)
`tocol for Internet retrieval and search(cid:3) The goal of STARTS
`is to facilitate the main three tasks that a metasearcher per(cid:5)
`forms(cid:11)
`
`(cid:0) Choosing the best sources to evaluate a query
`
`(cid:0) Evaluating the query at these sources
`
`(cid:0) Merging the query results from these sources
`
`STARTS has been developed in a unique way(cid:3) It is not
`a standard(cid:2) but a group e(cid:4)ort involving over companies
`and organizations(cid:3) The objective of this paper is not only to
`give an overview of the STARTS protocol proposal(cid:2) but also
`to discuss the process that led to its de(cid:6)nition(cid:3) In particular(cid:2)
`
`(cid:0) We will describe the history of the project(cid:2) including
`the current status of a reference implementation(cid:2) and
`will highlight some of the existing (cid:12)tensions(cid:13) between
`information providers and search engine builders (cid:7)Sec(cid:5)
`tions  and (cid:8)(cid:3)
`
`(cid:0) We will explain the protocol(cid:2) together with some of
`the tradeo(cid:4)s and compromises that we had to make in
`its design (cid:7)Section (cid:8)(cid:3)
`
`(cid:0) We will comment on other work that is closely related
`to STARTS (cid:7)Section (cid:8)(cid:3)
`
` History of our Proposal
`
`The Digital Library project at Stanford coordinated search
`engine vendors and other key players to informally design
`a protocol that would allow searching and retrieval of in(cid:5)
`formation from distributed and heterogeneous sources(cid:3) We
`were initially contacted by Steve Kirsch(cid:2) president of In(cid:5)
`foseek (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)infoseek(cid:4)com(cid:8)(cid:2) in June (cid:3) His idea
`was that Stanford should collect the views of the search en(cid:5)
`gine vendors on how to address the problem at hand(cid:3) Then
`Stanford(cid:2) acting as an unbiased party(cid:2) would design a pro(cid:5)
`tocol proposal that would reconcile the vendors(cid:9) ideas(cid:3) The
`key motivation behind this informal procedure was to avoid
`the long delays usually involved in the de(cid:6)nition of formal
`standards(cid:3)
`In July(cid:2) (cid:2) we started our e(cid:4)ort with (cid:6)ve compa(cid:5)
`nies(cid:11) Fulcrum (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)fulcrum(cid:4)com(cid:8)(cid:2) Infoseek(cid:2) PLS
`
` STARTS stands for (cid:15)Stanford Protocol Proposal for Internet
`Retrieval and Search(cid:9)(cid:16)
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2009-1
`
`

`

`(cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)pls(cid:4)com(cid:8)(cid:2) Verity (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)verity(cid:4)com(cid:8)(cid:2)
`and WAIS(cid:3) Microsoft Network (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)msn(cid:4)com(cid:8) joined
`the initial group in November(cid:3) We circulated a preliminary
`draft describing the main three problems that we wanted to
`address (cid:7)i(cid:3)e(cid:3)(cid:2) choosing the best sources for a query(cid:2) evaluat(cid:5)
`ing the query at these sources(cid:2) and merging the query results
`from the sources(cid:8)(cid:3) We scheduled meetings with people from
`the companies to discuss these problems and get feedback(cid:3)
`We met individually with each company between December(cid:2)
` (cid:2) and February(cid:2) (cid:3) During each meeting(cid:2) we would
`show a couple of slides for each problem to agree on its def(cid:5)
`inition(cid:2) terminology(cid:2) etc(cid:3) After this(cid:2) we would discuss the
`possible solutions for each problem in detail(cid:3)
`Based on the comments and suggestions that we received(cid:2)
`we produced a (cid:6)rst draft of our proposal by March(cid:2) (cid:3)
`We then produced two revisions of this draft using feed(cid:5)
`back from the original companies(cid:2) plus other organizations
`that started participating(cid:3) Among these companies and
`organizations are Excite (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)excite(cid:4)com(cid:8)(cid:2) GILS
`(cid:7)http(cid:2)(cid:3)(cid:3)info(cid:4)er(cid:4)usgs(cid:4)gov(cid:2)(cid:3)gils(cid:3)(cid:8)(cid:2) Harvest (cid:7)http(cid:2)(cid:3)(cid:3)(cid:7)
`harvest(cid:4)transarc(cid:4)com(cid:8)(cid:2) Hewlett(cid:5)Packard Laboratories
`(cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)hpl(cid:4)hp(cid:4)com(cid:8)(cid:2) and Netscape (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)(cid:7)
`netscape(cid:4)com(cid:8)(cid:3) Finally(cid:2) we held a workshop at Stanford
`with the major participants on August st(cid:2)  (cid:7)http(cid:2)(cid:3)(cid:3)(cid:7)
`www(cid:7)db(cid:4)stanford(cid:4)edu(cid:3)(cid:8)gravano(cid:3)workshop participants(cid:4)(cid:7)
`html(cid:8)(cid:3) The goal of this one(cid:5)day workshop was to iron out
`the controversial aspects of the proposal(cid:2) and to get feedback
`for its (cid:6)nal draft (cid:20) (cid:21)(cid:3)
`De(cid:6)ning STARTS has been a very interesting experience(cid:11)
`we wanted to design a protocol that would be simple(cid:2) yet
`powerful enough to allow us to address the three problems at
`hand(cid:3) We could have adopted a (cid:12)least common denomina(cid:5)
`tor(cid:13) approach for our solution(cid:3) However(cid:2) many interesting
`interactions would have been impossible under such a solu(cid:5)
`tion(cid:3) Alternatively(cid:2) we could have incorporated the sophis(cid:5)
`ticated features that the search engines provide(cid:2) but that
`also would have challenged interoperability(cid:2) and would have
`driven us away from simplicity(cid:3) Consequently(cid:2) we had to
`walk a very (cid:6)ne line(cid:2) trying to (cid:6)nd a solution that would be
`expressible enough(cid:2) but not too complicated or impossible
`to quickly implement by the search engine vendors(cid:3)
`Another aspect that made the experience challenging was
`dealing with companies that have secret(cid:2) proprietary algo(cid:5)
`rithms(cid:2) as those for ranking documents(cid:3) (cid:7)See Section (cid:3)(cid:3)(cid:8)
`Obviously(cid:2) we could not ask the companies to reveal these
`algorithms(cid:3) However(cid:2) we still needed to have them export
`enough information so that a metasearcher could do some(cid:5)
`thing useful with the query results(cid:3)
`As mentioned above(cid:2) the STARTS(cid:5) (cid:3) speci(cid:6)cation is al(cid:5)
`ready completed(cid:3) A reference implementation of the proto(cid:5)
`col has been built at Cornell University by Carl Lagoze(cid:3) (cid:7)See
`http(cid:2)(cid:3)(cid:3)www(cid:7)diglib(cid:4)stanford(cid:4)edu for information(cid:3)(cid:8) Also(cid:2)
`the Z (cid:3) community is designing a pro(cid:6)le of their Z (cid:3)(cid:5)
`  standard based on STARTS(cid:3) (cid:7)This pro(cid:6)le was origi(cid:5)
`nally called ZSTARTS(cid:2) but has since changed its name to
`ZDSR(cid:2) for Z (cid:3) Pro(cid:6)le for Simple Distributed Search and
`Ranked Retrieval(cid:3)(cid:8) Finally(cid:2) we will try to (cid:6)nd a sponsor to
`present STARTS under the World(cid:5)Wide Web Consortium
`(cid:7)W C(cid:8)(cid:2) so that a formal standard can emerge from it(cid:3)
`Our goal in presenting this paper at the SIGMOD con(cid:5)
`ference is to also get the SIGMOD community involved in
`this e(cid:4)ort(cid:3) We believe that the Internet has become cen(cid:5)
`tral to (cid:12)management of data(cid:13) (cid:7)the MOD in SIGMOD(cid:8)(cid:2) and
`searching and resource discovery across the Internet is one
`of the most important problems(cid:3)
`
`Source 1
`
`Query
`
`Client
`
`Results
`
`Source 2
`
`Resource
`
`Figure (cid:11) A metasearcher queries a source(cid:2) and may specify
`that the query be evaluated at several sources at the same
`resource(cid:3)
`
` Our Metasearch Model and its Associated Problems
`
`In this section we describe the basic metasearch model un(cid:5)
`derlying our proposal(cid:2) and the three main problems that a
`metasearcher faces today(cid:3) These problems motivated our
`e(cid:4)ort(cid:3)
`For the purpose of the STARTS protocol(cid:2) we view the
`Internet as a potentially large number of resources (cid:7)e(cid:3)g(cid:3)(cid:2)
`Knight(cid:5)Ridder(cid:9)s Dialog information service(cid:2) or the CS(cid:5)TR
`sources (cid:8)(cid:3) Each resource consists of one or more sources
`(cid:7)Figure (cid:8)(cid:3) A source is a collection of text documents (cid:7)e(cid:3)g(cid:3)(cid:2)
`Inspec and the Computer Database in the Dialog resource(cid:8)(cid:2)
`with an associated search engine that accepts queries from
`clients and produces results(cid:3) We assume that documents are
`(cid:12)(cid:23)at(cid:2)(cid:13) in the sense that we do not(cid:2) for example(cid:2) allow any
`nesting of documents(cid:3) We do not consider non(cid:5)textual doc(cid:5)
`uments or data either (cid:7)e(cid:3)g(cid:3)(cid:2) geographical data(cid:8) to keep the
`protocol simple(cid:3) Sources may be (cid:12)small(cid:13) (cid:7)e(cid:3)g(cid:3)(cid:2) the collection
`of papers written by some university professor(cid:8) or (cid:12)large(cid:13)
`(cid:7)e(cid:3)g(cid:3)(cid:2) the collection of World(cid:5)Wide Web pages indexed by a
`crawler(cid:8)(cid:3)
`As described in the Introduction(cid:2) a metasearcher (cid:7)or any
`end client(cid:2) in general(cid:8) would typically issue queries to multi(cid:5)
`ple sources(cid:2) for which it needs to perform three main tasks(cid:3)
`First(cid:2) the metasearcher chooses the best sources to evalu(cid:5)
`ate a query(cid:3) Then(cid:2) it submits the query to these sources(cid:3)
`Finally(cid:2) it merges the results from the sources and presents
`them to the user that issued the query(cid:3) To query multiple
`sources within the same resource(cid:2) the metasearcher issues
`the query to one of the sources at the resource (cid:7)Source
`in Figure (cid:8)(cid:2) specifying the other (cid:12)local(cid:13) sources where to
`also evaluate the query (cid:7)Source  in Figure (cid:8)(cid:3) This way(cid:2)
`the resource can eliminate duplicate documents from the
`query result(cid:2) for example(cid:2) which would be di(cid:24)cult for the
`metasearcher to do if it queried all of the sources indepen(cid:5)
`dently(cid:3)
`Building metasearchers is nowadays a hard task because
`di(cid:4)erent search engines are largely incompatible and do not
`allow for interoperability(cid:3) In general(cid:2) text search engines(cid:11)
`
`(cid:0) Use di(cid:4)erent query languages (cid:7)the query(cid:2)language prob(cid:2)
`lem(cid:25) Section (cid:3) (cid:8)
`
`(cid:0) Rank documents in the query results using secret al(cid:5)
`gorithms (cid:7)the rank(cid:2)merging problem(cid:25) Section (cid:3)(cid:8)
`
`(cid:0) Do not export information about the sources in a stan(cid:5)
`dard form (cid:7)the source(cid:2)metadata problem(cid:25) Section (cid:3) (cid:8)
`
`The CS(cid:2)TR sources constitute an emerging library of Computer
`Science Technical Reports (cid:12)http(cid:2)(cid:3)(cid:3)www(cid:4)ncstrl(cid:4)org(cid:13)(cid:9)
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2009-2
`
`

`

`Below we visit each of these metasearch problems(cid:3) The
`discussion will illustrate the need for an agreement between
`search engine vendors so that metasearchers can work ef(cid:5)
`fectively(cid:3) Finally(cid:2) Section (cid:3) summarizes the metasearch
`requirements that should be facilitated by the agreement(cid:3)
`
` (cid:5) The Query(cid:6)Language Problem
`
`A metasearcher submits queries over multiple sources(cid:3) But
`the interfaces and capabilities of these sources may vary dra(cid:5)
`matically(cid:3) Even the basic query model that the sources sup(cid:5)
`port may vary(cid:3)
`Some search engines (cid:7)e(cid:3)g(cid:3)(cid:2) Glimpse(cid:8) only support the
`Boolean retrieval model (cid:20)(cid:21)(cid:3)
`In this model(cid:2) a query is a
`condition that documents either do or do not satisfy(cid:3) The
`query result is then a set of documents(cid:3) For example(cid:2) a
`query (cid:12)distributed and systems(cid:13) returns all documents
`that contain both the words (cid:12)distributed(cid:13) and (cid:12)systems(cid:13) in
`them(cid:3)
`Alternatively(cid:2) most commercial search engines also sup(cid:5)
`port some variation of the vector(cid:2)space retrieval model (cid:20)(cid:21)(cid:3)
`In this model(cid:2) a query is a list of terms(cid:2) and documents are
`assigned a score according to how similar they are to the
`query(cid:3) The query result is then a rank of documents(cid:3) For
`example(cid:2) a query (cid:12)distributed systems(cid:13) returns a rank of
`documents that is typically based on the number of occur(cid:5)
`rences of the words (cid:12)distributed(cid:13) and (cid:12)systems(cid:13) in them(cid:3)
`A document in the query result might contain the word (cid:12)dis(cid:5)
`tributed(cid:13) but not the word (cid:12)systems(cid:2)(cid:13) for example(cid:2) or vice
`versa(cid:2) unlike in the Boolean(cid:5)model case above(cid:3)
`Even if two sources support a Boolean retrieval model(cid:2)
`their query syntax often di(cid:4)er(cid:3) A query asking for docu(cid:5)
`ments with the words (cid:12)distributed(cid:13) and (cid:12)systems(cid:13) might
`be expressed as (cid:12)distributed and systems(cid:13) in one source(cid:2)
`and as (cid:12)(cid:9)distributed (cid:9)systems(cid:13) in another(cid:2) for example(cid:3)
`More serious problems appear if di(cid:4)erent (cid:6)elds (cid:7)e(cid:3)g(cid:3)(cid:2)
`abstract(cid:8) are available for searching at di(cid:4)erent sources(cid:3)
`For example(cid:2) a source might support queries like (cid:10)abstract
`(cid:11)(cid:11)databases(cid:12)(cid:12)(cid:13) that ask for documents that have the word
`(cid:12)databases(cid:13) in their abstract(cid:2) whereas some other sources
`might not support the abstract (cid:6)eld for querying(cid:3)
`Another complication results from di(cid:4)erent stemming al(cid:5)
`gorithms or stop(cid:5)word lists being implicit in the query model
`of each source(cid:3) (cid:7)Stemming is used to make a query on (cid:12)sys(cid:5)
`tems(cid:13) also retrieve documents on (cid:12)system(cid:2)(cid:13) for example(cid:3)
`Stop words are used to not process words like (cid:12)the(cid:13) in the
`queries(cid:2) for example(cid:3)(cid:8) If a user wants documents about the
`rock group (cid:12)The Who(cid:2)(cid:13) knowing about the stop(cid:5)word be(cid:5)
`havior of the sources would allow a metasearcher(cid:2) for exam(cid:5)
`ple(cid:2) to know whether it is possible to disallow the elimination
`of stop words from queries at each source(cid:3)
`As a result of all this heterogeneity(cid:2) a metasearcher would
`have to translate the original query to adjust it to each
`source(cid:9)s syntax(cid:3) To do this translation(cid:2) the metasearcher
`needs to know the characteristics of each source(cid:3) (cid:7)The work
`in (cid:20) (cid:2) (cid:21) illustrates the complexities involved in query trans(cid:5)
`lation(cid:3)(cid:8) As we will see in Section (cid:3) (cid:2) querying multiple
`sources is much easier if the sources support some common
`query language(cid:3) Even if support for most of this language is
`optional(cid:2) query translation is much simpler if sources reveal
`what portions of the language they support(cid:3)
`
` These ranks also typically depend on other factors(cid:10) like the num(cid:2)
`ber of documents in the source that contain the query words(cid:10) for
`example(cid:9)
`
` (cid:5) The Rank(cid:6)Merging Problem
`
`A source that supports the vector(cid:5)space retrieval model ranks
`its documents according to how (cid:12)similar(cid:13) the documents
`and a given query are(cid:3) Unfortunately(cid:2) there are many ways
`to compute these similarities(cid:3) To make matters more com(cid:5)
`plicated(cid:2) the ranking algorithms are usually proprietary to
`the search engine vendors(cid:2) and their details are not publicly
`available(cid:3)
`Merging query results from sources that use di(cid:4)erent and
`unknown ranking algorithms is hard(cid:3) (cid:7)See (cid:20)(cid:2) (cid:21) for algo(cid:5)
`rithms for merging multiple document ranks(cid:3)(cid:8) For example(cid:2)
`source S might report that document d has a score of (cid:3)
`for some query(cid:2) while source S might report that docu(cid:5)
`ment d has a score of (cid:2) for the same query(cid:3) If we want
`to merge the results from S and S into a single document
`rank(cid:2) should we rank d higher than d(cid:2) or vice versa(cid:26) (cid:7)Some
`search engines are designed so that the top document for a
`query always has a score of(cid:2) say(cid:2) (cid:2)(cid:3)(cid:8)
`It is even hard to merge query results from sources that
`use the same ranking algorithm(cid:2) even if we know this algo(cid:5)
`rithm(cid:3) The reason is that the algorithm might rank doc(cid:5)
`uments di(cid:4)erently based on the collection where the doc(cid:5)
`ument appears(cid:3) For example(cid:2) if a source S specializes in
`computer science(cid:2) the word databases might appear in many
`of its documents(cid:3) Then(cid:2) this word will tend to have a low
`associated weight in S (cid:7)e(cid:3)g(cid:3)(cid:2) if S uses the tf(cid:2)idf formula for
`computing weights (cid:20)(cid:21)(cid:8)(cid:3) The word databases(cid:2) on the other
`hand(cid:2) might have a high associated weight in a sourceS 
`that is totally unrelated to computer science and contains
`very few documents with that word(cid:3) Consequently(cid:2) S might
`assign its documents a low score for a query containing the
`word databases(cid:2) while S assigns a few documents a high
`score for that query(cid:3) Therefore(cid:2) it is possible for two very
`similar documents d and d to receive very di(cid:4)erent scores
`for a given query(cid:2) ifd appears in S and d appears in S(cid:3)
`Thus(cid:2) even if the sources use the same ranking algorithm(cid:2)
`a metasearcher still needs additional information to merge
`query results in a meaningful way(cid:3)
`
` (cid:5) The Source(cid:6)Metadata Problem
`
`A metasearcher might have thousands of sources available
`for querying(cid:3) Some of these sources might charge for their
`use(cid:3) Some of the sources might have large response times(cid:3)
`Therefore(cid:2) it becomes crucial that the metasearcher just con(cid:5)
`tact sources that might contain useful documents for a given
`query(cid:3) The metasearcher then needs information about each
`source(cid:9)s contents(cid:3)
`Some sources freely deliver their entire document col(cid:5)
`lection(cid:2) whereas others do not(cid:3) Often(cid:2) those sources that
`have for(cid:5)pay information are of the second type(cid:3) If a source
`exports all of its contents (cid:7)e(cid:3)g(cid:3)(cid:2) many World(cid:5)Wide Web
`sites(cid:8)(cid:2) then it is not as critical to have it describe its col(cid:5)
`lection to the metasearchers(cid:3) After all(cid:2) the metasearchers
`can just grab all of the sources(cid:9) contents and summarize
`them any way they want(cid:3) This is what (cid:12)crawlers(cid:13) like
`AltaVista (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)altavista(cid:4)digital(cid:4)com(cid:8) do(cid:3) How(cid:5)
`ever(cid:2) for performance reasons(cid:2) it may still be useful to re(cid:5)
`quire that such sources export a more succinct description of
`themselves(cid:3) In contrast(cid:2) if a source (cid:12)hides(cid:13) its information
`(cid:7)e(cid:3)g(cid:3)(cid:2) through a search interface(cid:8)(cid:2) then it is even more impor(cid:5)
`tant that the source can describe its contents(cid:3) Otherwise(cid:2)
`if a source does not export any kind of content summary(cid:2)
`it becomes hard for a metasearcher to assess what kind of
`information the source covers(cid:3)
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2009-3
`
`

`

` (cid:5) Metasearch Requirements
`
`(cid:5) Query Language
`
`In summary(cid:2) a sophisticated metasearcher will need to per(cid:5)
`form the following tasks in order to e(cid:24)ciently query multiple
`resources(cid:11)
`
`(cid:0) Extract the list of sources from the resources periodi(cid:5)
`cally (cid:7)to (cid:6)nd out what sources are available for query(cid:5)
`ing(cid:8) (cid:7)Section (cid:3) (cid:3) (cid:8)
`
`(cid:0) Extract metadata and content summaries from the
`sources periodically (cid:7)to be able to decide what sources
`are potentially useful for a given query(cid:8) (cid:7)Section (cid:3) (cid:3)(cid:8)
`
`Also(cid:2) given a user query(cid:11)
`
`(cid:0) Issue the query to one or more sources at one or more
`resources (cid:7)Sections (cid:3) and (cid:3) (cid:3) (cid:8)
`
`(cid:0) Get the results from the multiple resources(cid:2) merge
`them(cid:2) and present them to the user (cid:7)Section (cid:3)(cid:8)
`
` Our Protocol Proposal
`
`In this section we de(cid:6)ne a protocol proposal that addresses
`the metasearch requirements of Section (cid:3)(cid:3) This proto(cid:5)
`col is meant for machine(cid:5)to(cid:5)machine communication(cid:11) users
`should not have to write queries using the proposed query
`language(cid:2) for instance(cid:3) Also(cid:2) all communication with the
`sources is sessionless in our protocol(cid:2) and the sources are
`stateless(cid:3) Finally(cid:2) we do not deal with any security issues(cid:2)
`or with error reporting in our proposal(cid:3) The main motiva(cid:5)
`tion behind these (cid:7)and many of the other(cid:8) decisions is to
`keep the protocol simple and easy to implement(cid:3)
`Our protocol does not describe an architecture for (cid:12)meta(cid:5)
`searching(cid:3)(cid:13) However(cid:2) it does describe the facilities that a
`source needs to provide in order to help a metasearcher(cid:3)
`The facilities provided by a source can range from simple to
`sophisticated(cid:2) and one of the key challenges in developing
`our protocol was in deciding the right level of sophistica(cid:5)
`tion(cid:3)
`In e(cid:4)ect(cid:2) metasearchers often have to search across
`simple sources as well as across sophisticated ones(cid:3) On the
`one hand(cid:2) it is important to have some agreed(cid:5)upon minimal
`functionality that is simple enough for all sources to com(cid:5)
`ply with(cid:3) On the other hand(cid:2) it is important to allow the
`more sophisticated sources to export their richer features(cid:3)
`Therefore(cid:2) our protocol keeps the requirements to a mini(cid:5)
`mum(cid:2) while it provides optional features that sophisticated
`sources can use if they wish(cid:3)
`Our protocol mainly deals with what information needs
`to be exchanged between sources and metasearchers (cid:7)e(cid:3)g(cid:3)(cid:2) a
`query(cid:2) a result set(cid:8)(cid:2) and not so much with how that infor(cid:5)
`mation is formatted (cid:7)e(cid:3)g(cid:3)(cid:2) using Harvest SOIFs (cid:8) or trans(cid:5)
`ported (cid:7)e(cid:3)g(cid:3)(cid:2) using HTTP(cid:8)(cid:3) Actually(cid:2) what transport to use
`generated some heated debate during the STARTS work(cid:5)
`shop(cid:3) Consequently(cid:2) we expect the STARTS information to
`be delivered in multiple ways in practice(cid:3) For concreteness(cid:2)
`the STARTS speci(cid:6)cation and examples that we give below
`use SOIFs just to illustrate how our content can be deliv(cid:5)
`ered(cid:3) However(cid:2) STARTS includes mechanisms to specify
`other formats for its contents(cid:3)
`
`SOIF objects are typed(cid:10) ASCII(cid:2)based encodings for structured
`objects(cid:17) see http(cid:2)(cid:3)(cid:3)harvest(cid:4)transarc(cid:4)com(cid:3)afs(cid:3)transarc(cid:4)com(cid:3)public(cid:3)(cid:5)
`trg(cid:3)Harvest(cid:3)user(cid:5)manual(cid:3)(cid:9)
`
`In this section we describe the basic features of the query
`language that a source should support(cid:3) To cover the func(cid:5)
`tionality o(cid:4)ered by most commercial search engines(cid:2) queries
`have both a Boolean component(cid:11) the (cid:3)lter expression(cid:2) and a
`vector(cid:5)space component(cid:11) the ranking expression(cid:3) (cid:7)See Sec(cid:5)
`tion (cid:3) (cid:3) (cid:3)(cid:8) Also(cid:2) queries have other associated properties
`that further specify the query results(cid:3) For example(cid:2) a query
`speci(cid:6)es the maximum number of documents that should be
`returned(cid:2) among other things(cid:3) (cid:7)See Section (cid:3) (cid:3)(cid:3)(cid:8)
`
`(cid:5) (cid:5) Filter and Ranking Expressions
`
`Queries have a (cid:6)lter expression (cid:7)the Boolean component(cid:8)
`and a ranking expression (cid:7)the vector(cid:5)space component(cid:8)(cid:3) The
`(cid:3)lter expression speci(cid:6)es some condition that must be sat(cid:5)
`is(cid:6)ed by every document in the query result (cid:7)e(cid:3)g(cid:3)(cid:2) all docu(cid:5)
`ments in the answer must have (cid:12)Ullman(cid:13) as one of the au(cid:5)
`thors(cid:8)(cid:3) The ranking expression speci(cid:6)es words that are de(cid:5)
`sired(cid:2) and imposes an order over the documents in the query
`result (cid:7)e(cid:3)g(cid:3)(cid:2) the documents in the answer will be ranked ac(cid:5)
`cording to how many times they contain the words (cid:12)dis(cid:5)
`tributed(cid:13) and (cid:12)databases(cid:13) in their body(cid:8)(cid:3)
`
`Example Consider the following query with (cid:3)lter expres(cid:2)
`sion(cid:4)
`
`(cid:10)(cid:10)author (cid:11)(cid:11)Ullman(cid:12)(cid:12)(cid:13) and (cid:10)title (cid:11)(cid:11)databases(cid:12)(cid:12)(cid:13)(cid:13)
`
`and ranking expression(cid:4)
`
`list(cid:10)(cid:10)body(cid:7)of(cid:7)text (cid:11)(cid:11)distributed(cid:12)(cid:12)(cid:13) (cid:10)body(cid:7)of(cid:7)text
`(cid:11)(cid:11)databases(cid:12)(cid:12)(cid:13)(cid:13)
`
`This query returns documents having (cid:5)Ullman(cid:6) as one of
`the authors and the word (cid:5)databases(cid:6) in their title(cid:7) The
`documents that match the (cid:3)lter expression are then ranked
`according to how well their text matches the words (cid:5)dis(cid:2)
`tributed(cid:6) and (cid:5)databases(cid:7)(cid:6)
`
`In principle(cid:2) a query need not contain a (cid:6)lter expression(cid:3)
`If this is the case(cid:2) we assume that all documents qualify for
`the answer(cid:2) and are ranked according to the ranking expres(cid:5)
`sion(cid:3) Similarly(cid:2) a query need not contain a ranking expres(cid:5)
`sion(cid:3) If this is the case(cid:2) the result of the query is the set
`of objects that match the (cid:7)Boolean(cid:8) (cid:6)lter expression(cid:3) Some
`search engines only support (cid:6)lter or ranking expressions(cid:2)
`but not both (cid:7)e(cid:3)g(cid:3)(cid:2) Glimpse only supports (cid:6)lter expressions(cid:8)(cid:3)
`Therefore(cid:2) we allow sources to support just one type of ex(cid:5)
`pression(cid:3) In this case(cid:2) the sources indicate (cid:7)Section (cid:3) (cid:3) (cid:8)
`what type they support as part of their metadata(cid:3)
`Both the (cid:6)lter and the ranking expressions may contain
`multiple terms(cid:3) The (cid:6)lter and ranking expressions com(cid:5)
`bine these terms with operators like (cid:12)and(cid:13) and (cid:12)or(cid:13)(cid:7)e(cid:3)g(cid:3)(cid:2)
`(cid:10)(cid:10)author (cid:11)(cid:11)Ullman(cid:12)(cid:12)(cid:13) and (cid:10)title (cid:11)(cid:11)databases(cid:12)(cid:12)(cid:13)(cid:13)(cid:8)(cid:3)
`The ranking expressions also combine terms using the (cid:12)list(cid:13)
`operator(cid:2) which simply groups together a set of terms(cid:2) as in
`Example (cid:3) Also(cid:2) the terms of a ranking expression may
`have a weight associated with them(cid:2) indicating their rela(cid:5)
`tive importance in the ranking expression(cid:3)
`In de(cid:6)ning the expressive power of the (cid:6)lter and ranking
`expressions we had to balance the needs of search engine
`builders and metasearchers(cid:3) On the one hand(cid:2) builders in
`general want powerful expressions(cid:2) so that all the features
`of their engine can be called upon(cid:3) On the other hand(cid:2)
`metasearchers want simpler (cid:6)lter and ranking expressions(cid:2)
`because they know that not all search engines support the
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2009-4
`
`

`

`same advanced features(cid:3) The simpler the (cid:6)lter and ranking
`expressions are(cid:2) the more likely it is that engines will have
`common features(cid:2) and the easier it will be to interoperate(cid:3)
`Also(cid:2) those metasearchers whose main market is Internet
`searching prefer simple expressions because most of their
`customers use simple queries(cid:3)
`In contrast(cid:2) search engine
`builders cater to a broader mix of customers(cid:3) Some of these
`customers require sophisticated query capabilities(cid:3)
`Next we de(cid:6)ne the (cid:6)lter and ranking expressions more
`precisely(cid:3) We start by de(cid:6)ning the l(cid:2)strings(cid:2) which are the
`basic building blocks for queries(cid:3) Then we show how these
`strings are adorned with (cid:6)elds and modi(cid:6)ers to build atomic
`terms(cid:3) Finally(cid:2) we describe how to construct complex (cid:6)lter
`and ranking expressions(cid:3)
`
`Atomic Terms
`
`One of the most heavily discussed issues in our workshop
`was how to support multiple languages and character sets(cid:3)
`Our initial design had not supported queries using multi(cid:5)
`ple character sets or languages(cid:3) However(cid:2) the search engine
`vendors felt strongly against this limitation(cid:3) So(cid:2) we decided
`early on in our workshop to include multi(cid:5)lingual(cid:27)character
`support(cid:2) but the question was how far to go(cid:3) For exam(cid:5)
`ple(cid:2) did we want to support a query asking for documents
`with the Spanish word (cid:12)taco(cid:13)(cid:26) Did we also want to handle
`queries asking for documents whose abstract was in French(cid:2)
`but that also included the English word (cid:12)weekend(cid:13)(cid:26) An(cid:5)
`other issue was how to handle dialects(cid:2) e(cid:3)g(cid:3)(cid:2) how to specify
`that a document is written(cid:2) say(cid:2) in British English vs(cid:3) in
`American English(cid:3)
`During the workshop we also discussed whether we could
`make the multi(cid:5)language support invisible to those who just
`wanted to submit English queries(

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket