`
`Luis Gravano
`
`Chen(cid:2)Chuan K(cid:3) Chang
`
`H(cid:4)ector Garc(cid:4)(cid:5)a(cid:2)Molina
`
`Computer Science Department
`
`Computer Science Department
`
`Computer Science Department
`
`Stanford University
`
`Stanford University
`
`Stanford University
`
`gravano(cid:2)cs(cid:3)stanford(cid:3)edu
`
`kevin(cid:2)db(cid:3)stanford(cid:3)edu
`
`hector(cid:2)cs(cid:3)stanford(cid:3)edu
`
`Andreas Paepcke
`
`Computer Science Department
`
`Stanford University
`
`paepcke(cid:2)db(cid:3)stanford(cid:3)edu
`
`Abstract
`
`Document sources are available everywhere(cid:2) both within the
`internal networks of organizations and on the Internet(cid:3) Even
`individual organizations use search engines from di(cid:4)erent
`vendors to index their internal document collections(cid:3) These
`search engines are typically incompatible in that they sup(cid:5)
`port di(cid:4)erent query models and interfaces(cid:2) they do not re(cid:5)
`turn enough information with the query results for adequate
`merging of the results(cid:2) and (cid:6)nally(cid:2) in that they do not ex(cid:5)
`port metadata about the collections that they index (cid:7)e(cid:3)g(cid:3)(cid:2) to
`assist in resource discovery(cid:8)(cid:3) This paper describes STARTS(cid:2)
`an emerging protocol for Internet retrieval and search that
`facilitates the task of querying multiple document sources(cid:3)
`STARTS has been developed in a unique way(cid:3) It is not a
`standard(cid:2) but a group e(cid:4)ort coordinated by Stanford(cid:9)s Dig(cid:5)
`ital Library project(cid:2) and involving over companies and
`organizations(cid:3) The objective of this paper is not only to
`give an overview of the STARTS protocol proposal(cid:2) but also
`to discuss the process that led to its de(cid:6)nition(cid:3)
`
` Introduction
`
`Document sources are available everywhere(cid:2) both within the
`internal networks of organizations and on the Internet(cid:3) The
`source contents are often hidden behind search interfaces
`and models that vary from source to source(cid:3) Even individual
`organizations use search engines from di(cid:4)erent vendors to
`index their internal document collections(cid:3) These organiza(cid:5)
`tions can bene(cid:6)t from metasearchers(cid:2) which are services that
`provide uni(cid:6)ed query interfaces to multiple search engines(cid:3)
`Thus(cid:2) users have the illusion of a single combined document
`
`(cid:0) This material is based upon work supported by the National Sci(cid:2)
`ence Foundation under Cooperative Agreement IRI(cid:2) (cid:9) Funding
`for this cooperative agreement is also provided by DARPA(cid:10) NASA(cid:10)
`and the industrial partners of the Stanford Digital Libraries Project(cid:9)
`Any opinions(cid:10) (cid:11)nding(cid:10) and conclusions or recommendations expressed
`in this material are those of the author(cid:12)s(cid:13) and do not necessarily
`re(cid:14)ect the views of the National Science Foundation or the other
`sponsors(cid:9)
`
`source(cid:3) This paper describes STARTS (cid:2) an emerging pro(cid:5)
`tocol for Internet retrieval and search(cid:3) The goal of STARTS
`is to facilitate the main three tasks that a metasearcher per(cid:5)
`forms(cid:11)
`
`(cid:0) Choosing the best sources to evaluate a query
`
`(cid:0) Evaluating the query at these sources
`
`(cid:0) Merging the query results from these sources
`
`STARTS has been developed in a unique way(cid:3) It is not
`a standard(cid:2) but a group e(cid:4)ort involving over companies
`and organizations(cid:3) The objective of this paper is not only to
`give an overview of the STARTS protocol proposal(cid:2) but also
`to discuss the process that led to its de(cid:6)nition(cid:3) In particular(cid:2)
`
`(cid:0) We will describe the history of the project(cid:2) including
`the current status of a reference implementation(cid:2) and
`will highlight some of the existing (cid:12)tensions(cid:13) between
`information providers and search engine builders (cid:7)Sec(cid:5)
`tions and (cid:8)(cid:3)
`
`(cid:0) We will explain the protocol(cid:2) together with some of
`the tradeo(cid:4)s and compromises that we had to make in
`its design (cid:7)Section (cid:8)(cid:3)
`
`(cid:0) We will comment on other work that is closely related
`to STARTS (cid:7)Section (cid:8)(cid:3)
`
` History of our Proposal
`
`The Digital Library project at Stanford coordinated search
`engine vendors and other key players to informally design
`a protocol that would allow searching and retrieval of in(cid:5)
`formation from distributed and heterogeneous sources(cid:3) We
`were initially contacted by Steve Kirsch(cid:2) president of In(cid:5)
`foseek (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)infoseek(cid:4)com(cid:8)(cid:2) in June (cid:3) His idea
`was that Stanford should collect the views of the search en(cid:5)
`gine vendors on how to address the problem at hand(cid:3) Then
`Stanford(cid:2) acting as an unbiased party(cid:2) would design a pro(cid:5)
`tocol proposal that would reconcile the vendors(cid:9) ideas(cid:3) The
`key motivation behind this informal procedure was to avoid
`the long delays usually involved in the de(cid:6)nition of formal
`standards(cid:3)
`In July(cid:2) (cid:2) we started our e(cid:4)ort with (cid:6)ve compa(cid:5)
`nies(cid:11) Fulcrum (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)fulcrum(cid:4)com(cid:8)(cid:2) Infoseek(cid:2) PLS
`
` STARTS stands for (cid:15)Stanford Protocol Proposal for Internet
`Retrieval and Search(cid:9)(cid:16)
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2009-1
`
`
`
`(cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)pls(cid:4)com(cid:8)(cid:2) Verity (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)verity(cid:4)com(cid:8)(cid:2)
`and WAIS(cid:3) Microsoft Network (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)msn(cid:4)com(cid:8) joined
`the initial group in November(cid:3) We circulated a preliminary
`draft describing the main three problems that we wanted to
`address (cid:7)i(cid:3)e(cid:3)(cid:2) choosing the best sources for a query(cid:2) evaluat(cid:5)
`ing the query at these sources(cid:2) and merging the query results
`from the sources(cid:8)(cid:3) We scheduled meetings with people from
`the companies to discuss these problems and get feedback(cid:3)
`We met individually with each company between December(cid:2)
` (cid:2) and February(cid:2) (cid:3) During each meeting(cid:2) we would
`show a couple of slides for each problem to agree on its def(cid:5)
`inition(cid:2) terminology(cid:2) etc(cid:3) After this(cid:2) we would discuss the
`possible solutions for each problem in detail(cid:3)
`Based on the comments and suggestions that we received(cid:2)
`we produced a (cid:6)rst draft of our proposal by March(cid:2) (cid:3)
`We then produced two revisions of this draft using feed(cid:5)
`back from the original companies(cid:2) plus other organizations
`that started participating(cid:3) Among these companies and
`organizations are Excite (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)excite(cid:4)com(cid:8)(cid:2) GILS
`(cid:7)http(cid:2)(cid:3)(cid:3)info(cid:4)er(cid:4)usgs(cid:4)gov(cid:2) (cid:3)gils(cid:3)(cid:8)(cid:2) Harvest (cid:7)http(cid:2)(cid:3)(cid:3)(cid:7)
`harvest(cid:4)transarc(cid:4)com(cid:8)(cid:2) Hewlett(cid:5)Packard Laboratories
`(cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)hpl(cid:4)hp(cid:4)com(cid:8)(cid:2) and Netscape (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)(cid:7)
`netscape(cid:4)com(cid:8)(cid:3) Finally(cid:2) we held a workshop at Stanford
`with the major participants on August st(cid:2) (cid:7)http(cid:2)(cid:3)(cid:3)(cid:7)
`www(cid:7)db(cid:4)stanford(cid:4)edu(cid:3)(cid:8)gravano(cid:3)workshop participants(cid:4)(cid:7)
`html(cid:8)(cid:3) The goal of this one(cid:5)day workshop was to iron out
`the controversial aspects of the proposal(cid:2) and to get feedback
`for its (cid:6)nal draft (cid:20) (cid:21)(cid:3)
`De(cid:6)ning STARTS has been a very interesting experience(cid:11)
`we wanted to design a protocol that would be simple(cid:2) yet
`powerful enough to allow us to address the three problems at
`hand(cid:3) We could have adopted a (cid:12)least common denomina(cid:5)
`tor(cid:13) approach for our solution(cid:3) However(cid:2) many interesting
`interactions would have been impossible under such a solu(cid:5)
`tion(cid:3) Alternatively(cid:2) we could have incorporated the sophis(cid:5)
`ticated features that the search engines provide(cid:2) but that
`also would have challenged interoperability(cid:2) and would have
`driven us away from simplicity(cid:3) Consequently(cid:2) we had to
`walk a very (cid:6)ne line(cid:2) trying to (cid:6)nd a solution that would be
`expressible enough(cid:2) but not too complicated or impossible
`to quickly implement by the search engine vendors(cid:3)
`Another aspect that made the experience challenging was
`dealing with companies that have secret(cid:2) proprietary algo(cid:5)
`rithms(cid:2) as those for ranking documents(cid:3) (cid:7)See Section (cid:3)(cid:3)(cid:8)
`Obviously(cid:2) we could not ask the companies to reveal these
`algorithms(cid:3) However(cid:2) we still needed to have them export
`enough information so that a metasearcher could do some(cid:5)
`thing useful with the query results(cid:3)
`As mentioned above(cid:2) the STARTS(cid:5) (cid:3) speci(cid:6)cation is al(cid:5)
`ready completed(cid:3) A reference implementation of the proto(cid:5)
`col has been built at Cornell University by Carl Lagoze(cid:3) (cid:7)See
`http(cid:2)(cid:3)(cid:3)www(cid:7)diglib(cid:4)stanford(cid:4)edu for information(cid:3)(cid:8) Also(cid:2)
`the Z (cid:3) community is designing a pro(cid:6)le of their Z (cid:3) (cid:5)
` standard based on STARTS(cid:3) (cid:7)This pro(cid:6)le was origi(cid:5)
`nally called ZSTARTS(cid:2) but has since changed its name to
`ZDSR(cid:2) for Z (cid:3) Pro(cid:6)le for Simple Distributed Search and
`Ranked Retrieval(cid:3)(cid:8) Finally(cid:2) we will try to (cid:6)nd a sponsor to
`present STARTS under the World(cid:5)Wide Web Consortium
`(cid:7)W C(cid:8)(cid:2) so that a formal standard can emerge from it(cid:3)
`Our goal in presenting this paper at the SIGMOD con(cid:5)
`ference is to also get the SIGMOD community involved in
`this e(cid:4)ort(cid:3) We believe that the Internet has become cen(cid:5)
`tral to (cid:12)management of data(cid:13) (cid:7)the MOD in SIGMOD(cid:8)(cid:2) and
`searching and resource discovery across the Internet is one
`of the most important problems(cid:3)
`
`Source 1
`
`Query
`
`Client
`
`Results
`
`Source 2
`
`Resource
`
`Figure (cid:11) A metasearcher queries a source(cid:2) and may specify
`that the query be evaluated at several sources at the same
`resource(cid:3)
`
` Our Metasearch Model and its Associated Problems
`
`In this section we describe the basic metasearch model un(cid:5)
`derlying our proposal(cid:2) and the three main problems that a
`metasearcher faces today(cid:3) These problems motivated our
`e(cid:4)ort(cid:3)
`For the purpose of the STARTS protocol(cid:2) we view the
`Internet as a potentially large number of resources (cid:7)e(cid:3)g(cid:3)(cid:2)
`Knight(cid:5)Ridder(cid:9)s Dialog information service(cid:2) or the CS(cid:5)TR
`sources (cid:8)(cid:3) Each resource consists of one or more sources
`(cid:7)Figure (cid:8)(cid:3) A source is a collection of text documents (cid:7)e(cid:3)g(cid:3)(cid:2)
`Inspec and the Computer Database in the Dialog resource(cid:8)(cid:2)
`with an associated search engine that accepts queries from
`clients and produces results(cid:3) We assume that documents are
`(cid:12)(cid:23)at(cid:2)(cid:13) in the sense that we do not(cid:2) for example(cid:2) allow any
`nesting of documents(cid:3) We do not consider non(cid:5)textual doc(cid:5)
`uments or data either (cid:7)e(cid:3)g(cid:3)(cid:2) geographical data(cid:8) to keep the
`protocol simple(cid:3) Sources may be (cid:12)small(cid:13) (cid:7)e(cid:3)g(cid:3)(cid:2) the collection
`of papers written by some university professor(cid:8) or (cid:12)large(cid:13)
`(cid:7)e(cid:3)g(cid:3)(cid:2) the collection of World(cid:5)Wide Web pages indexed by a
`crawler(cid:8)(cid:3)
`As described in the Introduction(cid:2) a metasearcher (cid:7)or any
`end client(cid:2) in general(cid:8) would typically issue queries to multi(cid:5)
`ple sources(cid:2) for which it needs to perform three main tasks(cid:3)
`First(cid:2) the metasearcher chooses the best sources to evalu(cid:5)
`ate a query(cid:3) Then(cid:2) it submits the query to these sources(cid:3)
`Finally(cid:2) it merges the results from the sources and presents
`them to the user that issued the query(cid:3) To query multiple
`sources within the same resource(cid:2) the metasearcher issues
`the query to one of the sources at the resource (cid:7)Source
`in Figure (cid:8)(cid:2) specifying the other (cid:12)local(cid:13) sources where to
`also evaluate the query (cid:7)Source in Figure (cid:8)(cid:3) This way(cid:2)
`the resource can eliminate duplicate documents from the
`query result(cid:2) for example(cid:2) which would be di(cid:24)cult for the
`metasearcher to do if it queried all of the sources indepen(cid:5)
`dently(cid:3)
`Building metasearchers is nowadays a hard task because
`di(cid:4)erent search engines are largely incompatible and do not
`allow for interoperability(cid:3) In general(cid:2) text search engines(cid:11)
`
`(cid:0) Use di(cid:4)erent query languages (cid:7)the query(cid:2)language prob(cid:2)
`lem(cid:25) Section (cid:3) (cid:8)
`
`(cid:0) Rank documents in the query results using secret al(cid:5)
`gorithms (cid:7)the rank(cid:2)merging problem(cid:25) Section (cid:3)(cid:8)
`
`(cid:0) Do not export information about the sources in a stan(cid:5)
`dard form (cid:7)the source(cid:2)metadata problem(cid:25) Section (cid:3) (cid:8)
`
`The CS(cid:2)TR sources constitute an emerging library of Computer
`Science Technical Reports (cid:12)http(cid:2)(cid:3)(cid:3)www(cid:4)ncstrl(cid:4)org(cid:13)(cid:9)
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2009-2
`
`
`
`Below we visit each of these metasearch problems(cid:3) The
`discussion will illustrate the need for an agreement between
`search engine vendors so that metasearchers can work ef(cid:5)
`fectively(cid:3) Finally(cid:2) Section (cid:3) summarizes the metasearch
`requirements that should be facilitated by the agreement(cid:3)
`
` (cid:5) The Query(cid:6)Language Problem
`
`A metasearcher submits queries over multiple sources(cid:3) But
`the interfaces and capabilities of these sources may vary dra(cid:5)
`matically(cid:3) Even the basic query model that the sources sup(cid:5)
`port may vary(cid:3)
`Some search engines (cid:7)e(cid:3)g(cid:3)(cid:2) Glimpse(cid:8) only support the
`Boolean retrieval model (cid:20)(cid:21)(cid:3)
`In this model(cid:2) a query is a
`condition that documents either do or do not satisfy(cid:3) The
`query result is then a set of documents(cid:3) For example(cid:2) a
`query (cid:12)distributed and systems(cid:13) returns all documents
`that contain both the words (cid:12)distributed(cid:13) and (cid:12)systems(cid:13) in
`them(cid:3)
`Alternatively(cid:2) most commercial search engines also sup(cid:5)
`port some variation of the vector(cid:2)space retrieval model (cid:20)(cid:21)(cid:3)
`In this model(cid:2) a query is a list of terms(cid:2) and documents are
`assigned a score according to how similar they are to the
`query(cid:3) The query result is then a rank of documents(cid:3) For
`example(cid:2) a query (cid:12)distributed systems(cid:13) returns a rank of
`documents that is typically based on the number of occur(cid:5)
`rences of the words (cid:12)distributed(cid:13) and (cid:12)systems(cid:13) in them(cid:3)
`A document in the query result might contain the word (cid:12)dis(cid:5)
`tributed(cid:13) but not the word (cid:12)systems(cid:2)(cid:13) for example(cid:2) or vice
`versa(cid:2) unlike in the Boolean(cid:5)model case above(cid:3)
`Even if two sources support a Boolean retrieval model(cid:2)
`their query syntax often di(cid:4)er(cid:3) A query asking for docu(cid:5)
`ments with the words (cid:12)distributed(cid:13) and (cid:12)systems(cid:13) might
`be expressed as (cid:12)distributed and systems(cid:13) in one source(cid:2)
`and as (cid:12)(cid:9)distributed (cid:9)systems(cid:13) in another(cid:2) for example(cid:3)
`More serious problems appear if di(cid:4)erent (cid:6)elds (cid:7)e(cid:3)g(cid:3)(cid:2)
`abstract(cid:8) are available for searching at di(cid:4)erent sources(cid:3)
`For example(cid:2) a source might support queries like (cid:10)abstract
`(cid:11)(cid:11)databases(cid:12)(cid:12)(cid:13) that ask for documents that have the word
`(cid:12)databases(cid:13) in their abstract(cid:2) whereas some other sources
`might not support the abstract (cid:6)eld for querying(cid:3)
`Another complication results from di(cid:4)erent stemming al(cid:5)
`gorithms or stop(cid:5)word lists being implicit in the query model
`of each source(cid:3) (cid:7)Stemming is used to make a query on (cid:12)sys(cid:5)
`tems(cid:13) also retrieve documents on (cid:12)system(cid:2)(cid:13) for example(cid:3)
`Stop words are used to not process words like (cid:12)the(cid:13) in the
`queries(cid:2) for example(cid:3)(cid:8) If a user wants documents about the
`rock group (cid:12)The Who(cid:2)(cid:13) knowing about the stop(cid:5)word be(cid:5)
`havior of the sources would allow a metasearcher(cid:2) for exam(cid:5)
`ple(cid:2) to know whether it is possible to disallow the elimination
`of stop words from queries at each source(cid:3)
`As a result of all this heterogeneity(cid:2) a metasearcher would
`have to translate the original query to adjust it to each
`source(cid:9)s syntax(cid:3) To do this translation(cid:2) the metasearcher
`needs to know the characteristics of each source(cid:3) (cid:7)The work
`in (cid:20) (cid:2) (cid:21) illustrates the complexities involved in query trans(cid:5)
`lation(cid:3)(cid:8) As we will see in Section (cid:3) (cid:2) querying multiple
`sources is much easier if the sources support some common
`query language(cid:3) Even if support for most of this language is
`optional(cid:2) query translation is much simpler if sources reveal
`what portions of the language they support(cid:3)
`
` These ranks also typically depend on other factors(cid:10) like the num(cid:2)
`ber of documents in the source that contain the query words(cid:10) for
`example(cid:9)
`
` (cid:5) The Rank(cid:6)Merging Problem
`
`A source that supports the vector(cid:5)space retrieval model ranks
`its documents according to how (cid:12)similar(cid:13) the documents
`and a given query are(cid:3) Unfortunately(cid:2) there are many ways
`to compute these similarities(cid:3) To make matters more com(cid:5)
`plicated(cid:2) the ranking algorithms are usually proprietary to
`the search engine vendors(cid:2) and their details are not publicly
`available(cid:3)
`Merging query results from sources that use di(cid:4)erent and
`unknown ranking algorithms is hard(cid:3) (cid:7)See (cid:20)(cid:2) (cid:21) for algo(cid:5)
`rithms for merging multiple document ranks(cid:3)(cid:8) For example(cid:2)
`source S might report that document d has a score of (cid:3)
`for some query(cid:2) while source S might report that docu(cid:5)
`ment d has a score of (cid:2) for the same query(cid:3) If we want
`to merge the results from S and S into a single document
`rank(cid:2) should we rank d higher than d(cid:2) or vice versa(cid:26) (cid:7)Some
`search engines are designed so that the top document for a
`query always has a score of(cid:2) say(cid:2) (cid:2) (cid:3)(cid:8)
`It is even hard to merge query results from sources that
`use the same ranking algorithm(cid:2) even if we know this algo(cid:5)
`rithm(cid:3) The reason is that the algorithm might rank doc(cid:5)
`uments di(cid:4)erently based on the collection where the doc(cid:5)
`ument appears(cid:3) For example(cid:2) if a source S specializes in
`computer science(cid:2) the word databases might appear in many
`of its documents(cid:3) Then(cid:2) this word will tend to have a low
`associated weight in S (cid:7)e(cid:3)g(cid:3)(cid:2) if S uses the tf(cid:2)idf formula for
`computing weights (cid:20)(cid:21)(cid:8)(cid:3) The word databases(cid:2) on the other
`hand(cid:2) might have a high associated weight in a sourceS
`that is totally unrelated to computer science and contains
`very few documents with that word(cid:3) Consequently(cid:2) S might
`assign its documents a low score for a query containing the
`word databases(cid:2) while S assigns a few documents a high
`score for that query(cid:3) Therefore(cid:2) it is possible for two very
`similar documents d and d to receive very di(cid:4)erent scores
`for a given query(cid:2) ifd appears in S and d appears in S(cid:3)
`Thus(cid:2) even if the sources use the same ranking algorithm(cid:2)
`a metasearcher still needs additional information to merge
`query results in a meaningful way(cid:3)
`
` (cid:5) The Source(cid:6)Metadata Problem
`
`A metasearcher might have thousands of sources available
`for querying(cid:3) Some of these sources might charge for their
`use(cid:3) Some of the sources might have large response times(cid:3)
`Therefore(cid:2) it becomes crucial that the metasearcher just con(cid:5)
`tact sources that might contain useful documents for a given
`query(cid:3) The metasearcher then needs information about each
`source(cid:9)s contents(cid:3)
`Some sources freely deliver their entire document col(cid:5)
`lection(cid:2) whereas others do not(cid:3) Often(cid:2) those sources that
`have for(cid:5)pay information are of the second type(cid:3) If a source
`exports all of its contents (cid:7)e(cid:3)g(cid:3)(cid:2) many World(cid:5)Wide Web
`sites(cid:8)(cid:2) then it is not as critical to have it describe its col(cid:5)
`lection to the metasearchers(cid:3) After all(cid:2) the metasearchers
`can just grab all of the sources(cid:9) contents and summarize
`them any way they want(cid:3) This is what (cid:12)crawlers(cid:13) like
`AltaVista (cid:7)http(cid:2)(cid:3)(cid:3)www(cid:4)altavista(cid:4)digital(cid:4)com(cid:8) do(cid:3) How(cid:5)
`ever(cid:2) for performance reasons(cid:2) it may still be useful to re(cid:5)
`quire that such sources export a more succinct description of
`themselves(cid:3) In contrast(cid:2) if a source (cid:12)hides(cid:13) its information
`(cid:7)e(cid:3)g(cid:3)(cid:2) through a search interface(cid:8)(cid:2) then it is even more impor(cid:5)
`tant that the source can describe its contents(cid:3) Otherwise(cid:2)
`if a source does not export any kind of content summary(cid:2)
`it becomes hard for a metasearcher to assess what kind of
`information the source covers(cid:3)
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2009-3
`
`
`
` (cid:5) Metasearch Requirements
`
`(cid:5) Query Language
`
`In summary(cid:2) a sophisticated metasearcher will need to per(cid:5)
`form the following tasks in order to e(cid:24)ciently query multiple
`resources(cid:11)
`
`(cid:0) Extract the list of sources from the resources periodi(cid:5)
`cally (cid:7)to (cid:6)nd out what sources are available for query(cid:5)
`ing(cid:8) (cid:7)Section (cid:3) (cid:3) (cid:8)
`
`(cid:0) Extract metadata and content summaries from the
`sources periodically (cid:7)to be able to decide what sources
`are potentially useful for a given query(cid:8) (cid:7)Section (cid:3) (cid:3)(cid:8)
`
`Also(cid:2) given a user query(cid:11)
`
`(cid:0) Issue the query to one or more sources at one or more
`resources (cid:7)Sections (cid:3) and (cid:3) (cid:3) (cid:8)
`
`(cid:0) Get the results from the multiple resources(cid:2) merge
`them(cid:2) and present them to the user (cid:7)Section (cid:3)(cid:8)
`
` Our Protocol Proposal
`
`In this section we de(cid:6)ne a protocol proposal that addresses
`the metasearch requirements of Section (cid:3)(cid:3) This proto(cid:5)
`col is meant for machine(cid:5)to(cid:5)machine communication(cid:11) users
`should not have to write queries using the proposed query
`language(cid:2) for instance(cid:3) Also(cid:2) all communication with the
`sources is sessionless in our protocol(cid:2) and the sources are
`stateless(cid:3) Finally(cid:2) we do not deal with any security issues(cid:2)
`or with error reporting in our proposal(cid:3) The main motiva(cid:5)
`tion behind these (cid:7)and many of the other(cid:8) decisions is to
`keep the protocol simple and easy to implement(cid:3)
`Our protocol does not describe an architecture for (cid:12)meta(cid:5)
`searching(cid:3)(cid:13) However(cid:2) it does describe the facilities that a
`source needs to provide in order to help a metasearcher(cid:3)
`The facilities provided by a source can range from simple to
`sophisticated(cid:2) and one of the key challenges in developing
`our protocol was in deciding the right level of sophistica(cid:5)
`tion(cid:3)
`In e(cid:4)ect(cid:2) metasearchers often have to search across
`simple sources as well as across sophisticated ones(cid:3) On the
`one hand(cid:2) it is important to have some agreed(cid:5)upon minimal
`functionality that is simple enough for all sources to com(cid:5)
`ply with(cid:3) On the other hand(cid:2) it is important to allow the
`more sophisticated sources to export their richer features(cid:3)
`Therefore(cid:2) our protocol keeps the requirements to a mini(cid:5)
`mum(cid:2) while it provides optional features that sophisticated
`sources can use if they wish(cid:3)
`Our protocol mainly deals with what information needs
`to be exchanged between sources and metasearchers (cid:7)e(cid:3)g(cid:3)(cid:2) a
`query(cid:2) a result set(cid:8)(cid:2) and not so much with how that infor(cid:5)
`mation is formatted (cid:7)e(cid:3)g(cid:3)(cid:2) using Harvest SOIFs (cid:8) or trans(cid:5)
`ported (cid:7)e(cid:3)g(cid:3)(cid:2) using HTTP(cid:8)(cid:3) Actually(cid:2) what transport to use
`generated some heated debate during the STARTS work(cid:5)
`shop(cid:3) Consequently(cid:2) we expect the STARTS information to
`be delivered in multiple ways in practice(cid:3) For concreteness(cid:2)
`the STARTS speci(cid:6)cation and examples that we give below
`use SOIFs just to illustrate how our content can be deliv(cid:5)
`ered(cid:3) However(cid:2) STARTS includes mechanisms to specify
`other formats for its contents(cid:3)
`
`SOIF objects are typed(cid:10) ASCII(cid:2)based encodings for structured
`objects(cid:17) see http(cid:2)(cid:3)(cid:3)harvest(cid:4)transarc(cid:4)com(cid:3)afs(cid:3)transarc(cid:4)com(cid:3)public(cid:3)(cid:5)
`trg(cid:3)Harvest(cid:3)user(cid:5)manual(cid:3)(cid:9)
`
`In this section we describe the basic features of the query
`language that a source should support(cid:3) To cover the func(cid:5)
`tionality o(cid:4)ered by most commercial search engines(cid:2) queries
`have both a Boolean component(cid:11) the (cid:3)lter expression(cid:2) and a
`vector(cid:5)space component(cid:11) the ranking expression(cid:3) (cid:7)See Sec(cid:5)
`tion (cid:3) (cid:3) (cid:3)(cid:8) Also(cid:2) queries have other associated properties
`that further specify the query results(cid:3) For example(cid:2) a query
`speci(cid:6)es the maximum number of documents that should be
`returned(cid:2) among other things(cid:3) (cid:7)See Section (cid:3) (cid:3)(cid:3)(cid:8)
`
`(cid:5) (cid:5) Filter and Ranking Expressions
`
`Queries have a (cid:6)lter expression (cid:7)the Boolean component(cid:8)
`and a ranking expression (cid:7)the vector(cid:5)space component(cid:8)(cid:3) The
`(cid:3)lter expression speci(cid:6)es some condition that must be sat(cid:5)
`is(cid:6)ed by every document in the query result (cid:7)e(cid:3)g(cid:3)(cid:2) all docu(cid:5)
`ments in the answer must have (cid:12)Ullman(cid:13) as one of the au(cid:5)
`thors(cid:8)(cid:3) The ranking expression speci(cid:6)es words that are de(cid:5)
`sired(cid:2) and imposes an order over the documents in the query
`result (cid:7)e(cid:3)g(cid:3)(cid:2) the documents in the answer will be ranked ac(cid:5)
`cording to how many times they contain the words (cid:12)dis(cid:5)
`tributed(cid:13) and (cid:12)databases(cid:13) in their body(cid:8)(cid:3)
`
`Example Consider the following query with (cid:3)lter expres(cid:2)
`sion(cid:4)
`
`(cid:10)(cid:10)author (cid:11)(cid:11)Ullman(cid:12)(cid:12)(cid:13) and (cid:10)title (cid:11)(cid:11)databases(cid:12)(cid:12)(cid:13)(cid:13)
`
`and ranking expression(cid:4)
`
`list(cid:10)(cid:10)body(cid:7)of(cid:7)text (cid:11)(cid:11)distributed(cid:12)(cid:12)(cid:13) (cid:10)body(cid:7)of(cid:7)text
`(cid:11)(cid:11)databases(cid:12)(cid:12)(cid:13)(cid:13)
`
`This query returns documents having (cid:5)Ullman(cid:6) as one of
`the authors and the word (cid:5)databases(cid:6) in their title(cid:7) The
`documents that match the (cid:3)lter expression are then ranked
`according to how well their text matches the words (cid:5)dis(cid:2)
`tributed(cid:6) and (cid:5)databases(cid:7)(cid:6)
`
`In principle(cid:2) a query need not contain a (cid:6)lter expression(cid:3)
`If this is the case(cid:2) we assume that all documents qualify for
`the answer(cid:2) and are ranked according to the ranking expres(cid:5)
`sion(cid:3) Similarly(cid:2) a query need not contain a ranking expres(cid:5)
`sion(cid:3) If this is the case(cid:2) the result of the query is the set
`of objects that match the (cid:7)Boolean(cid:8) (cid:6)lter expression(cid:3) Some
`search engines only support (cid:6)lter or ranking expressions(cid:2)
`but not both (cid:7)e(cid:3)g(cid:3)(cid:2) Glimpse only supports (cid:6)lter expressions(cid:8)(cid:3)
`Therefore(cid:2) we allow sources to support just one type of ex(cid:5)
`pression(cid:3) In this case(cid:2) the sources indicate (cid:7)Section (cid:3) (cid:3) (cid:8)
`what type they support as part of their metadata(cid:3)
`Both the (cid:6)lter and the ranking expressions may contain
`multiple terms(cid:3) The (cid:6)lter and ranking expressions com(cid:5)
`bine these terms with operators like (cid:12)and(cid:13) and (cid:12)or(cid:13)(cid:7)e(cid:3)g(cid:3)(cid:2)
`(cid:10)(cid:10)author (cid:11)(cid:11)Ullman(cid:12)(cid:12)(cid:13) and (cid:10)title (cid:11)(cid:11)databases(cid:12)(cid:12)(cid:13)(cid:13)(cid:8)(cid:3)
`The ranking expressions also combine terms using the (cid:12)list(cid:13)
`operator(cid:2) which simply groups together a set of terms(cid:2) as in
`Example (cid:3) Also(cid:2) the terms of a ranking expression may
`have a weight associated with them(cid:2) indicating their rela(cid:5)
`tive importance in the ranking expression(cid:3)
`In de(cid:6)ning the expressive power of the (cid:6)lter and ranking
`expressions we had to balance the needs of search engine
`builders and metasearchers(cid:3) On the one hand(cid:2) builders in
`general want powerful expressions(cid:2) so that all the features
`of their engine can be called upon(cid:3) On the other hand(cid:2)
`metasearchers want simpler (cid:6)lter and ranking expressions(cid:2)
`because they know that not all search engines support the
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2009-4
`
`
`
`same advanced features(cid:3) The simpler the (cid:6)lter and ranking
`expressions are(cid:2) the more likely it is that engines will have
`common features(cid:2) and the easier it will be to interoperate(cid:3)
`Also(cid:2) those metasearchers whose main market is Internet
`searching prefer simple expressions because most of their
`customers use simple queries(cid:3)
`In contrast(cid:2) search engine
`builders cater to a broader mix of customers(cid:3) Some of these
`customers require sophisticated query capabilities(cid:3)
`Next we de(cid:6)ne the (cid:6)lter and ranking expressions more
`precisely(cid:3) We start by de(cid:6)ning the l(cid:2)strings(cid:2) which are the
`basic building blocks for queries(cid:3) Then we show how these
`strings are adorned with (cid:6)elds and modi(cid:6)ers to build atomic
`terms(cid:3) Finally(cid:2) we describe how to construct complex (cid:6)lter
`and ranking expressions(cid:3)
`
`Atomic Terms
`
`One of the most heavily discussed issues in our workshop
`was how to support multiple languages and character sets(cid:3)
`Our initial design had not supported queries using multi(cid:5)
`ple character sets or languages(cid:3) However(cid:2) the search engine
`vendors felt strongly against this limitation(cid:3) So(cid:2) we decided
`early on in our workshop to include multi(cid:5)lingual(cid:27)character
`support(cid:2) but the question was how far to go(cid:3) For exam(cid:5)
`ple(cid:2) did we want to support a query asking for documents
`with the Spanish word (cid:12)taco(cid:13)(cid:26) Did we also want to handle
`queries asking for documents whose abstract was in French(cid:2)
`but that also included the English word (cid:12)weekend(cid:13)(cid:26) An(cid:5)
`other issue was how to handle dialects(cid:2) e(cid:3)g(cid:3)(cid:2) how to specify
`that a document is written(cid:2) say(cid:2) in British English vs(cid:3) in
`American English(cid:3)
`During the workshop we also discussed whether we could
`make the multi(cid:5)language support invisible to those who just
`wanted to submit English queries(