throbber
United States Patent [19J
`Voorhees et al.
`
`I IIII IIIIIII Ill llll llll llll IIII IIIII IIII IIII IIII IIIII II IIII Ill
`5,864,846
`Jan. 26, 1999
`
`US005864846A
`[11) Patent Number:
`[45) Date of Patent:
`
`[54) METHOD FOR FACILITATING WORLD
`WIDE WEB SFARCHES UTILIZING A
`DOCUMENT DISTRIBUTION FUSION
`STRATEGY
`
`[75)
`
`Inventors: Ellen M. Voorhees, North Potomac,
`Md.; Narendra K. Gupta, Dayton, N.J .
`
`[73) Assignee: Siemens Corporate Research, Inc.,
`Princeton, N .J.
`
`[21) Appl. No.: 674,646
`Jun. 28, 1996
`[22) Filed:
`[51) Int. Cl.6
`...................................................... G06F 17/30
`[52) U.S. Cl . ............................ 7(n/5; 707/2; 707/3; 707/4
`[58) Field of Search ..................................... 707/3, 5, 2, 4
`
`[56)
`
`References Cited
`
`U.S. P,6J'ENT DOCUMENTS
`
`4,823,306
`5,404,514
`5,442,778
`5,576,954
`5,642,502
`5,659,732
`5,675,819
`5,706,497
`
`4/1989 Barbie et al. ............................... 707/6
`4/1995 Kageneck et al. .......................... 707/5
`8/1995 Pedersen et al. ........................... 707 /5
`11/1996 Driscoll ....................................... 707/3
`6/1997 Driscoll ....................................... 707 /5
`8/1997 Kusch ......................................... 707/5
`10/1997 Schuette ......................................... 1/1
`1/1998 Takahashi et al ........................... 7fJ7/5
`
`OlllER PUBLICmONS
`
`Bartell et al, "Automatic Combination of Multiple Ranked
`Retrieval Systems", Proceedings of SIGIR ' 94, Jul. 1994,
`pp. 173 181.
`Belkin et al., "The Effect of Multiple Query Representations
`on Information System Performance", Proceedings of
`SIGIR ' 93, Jun. 1993, pp. 339 346.
`Fox et al., " Combination of Multiple Searches", Proceedings
`ofTREC 3, Apr. 1995, pp. 105 108 .
`
`Steinberg, "Seek and Ye Shall Find (Maybe)" Wired, vol. 4,
`#5, May 1996, pp. 1 18.
`QuarterDeck, URL: http://arachnid.qdeck.com/qdeck/prod
`ucts/webcompass.
`Towell et al., " Learning Collection Fusion Strategies for
`Information Retrieval", Proceedings of the 12th Annual
`Machine Learning Conference, Jul 1995, pp. 540 548.
`~orhees et al., "The Collection Fusion Problem", Proceed
`ings of TREC 3, NIST Special Publication 500 225, Apr.
`1995, pp. 95 104.
`~orhees et al., " Learning Collection Fusion Strategies",
`Proceedings of SIGIR ' 95, Jul. 1995, pp. 172 179.
`Primary Examiner-Paul R. Lintz
`Attorney, Agent, or Firm-Adel A. Ahmed
`ABSTRACT
`[57)
`
`A computer implemented method for facilitating v.brkl
`Wide Web Searches and like database searches by combin
`ing search result documents, as provided by separate search
`engines in response to a query, into one single integrated list
`so as to produce a single document with a ranked list of
`pages, by forming a set of selected queries, the queries
`including respective terms, for which selected queries rel
`evance data from past data is known, herein referred to as
`training queries, in a vector space comprising all training
`queries, the relevance data comprising judgments by a user
`as to whether a page is appropriate for a query which
`retrieved iL Further steps in the method are identifying a set
`of k most similar training queries to current query q,
`computing an average relevant document distribution of the
`k queries within the training queries' search results for each
`of the search engines, using the computed relevant docu
`rnent distributions, finding an optimal number of pages to
`select from the result set of each search engine when N total
`pages are to be retrieved, and creating a final retrieved set by
`forming the union of the top As pages from each search
`engine.
`
`15 Claims, 2 Drawing Sheets
`
`FOIIIUNG A SET OF SELECTED WlfS
`
`IOEIITIFY!Ni A SET t MOST SIMILAll
`TRAINING WIES TO ~ 1 WY Q
`
`COl4PIJT™G AN AVEl1AG£ llfl.EVANT OOCLIENT OISTAIBUTIOH
`Of SAIO k WIES MITHIN SAID TRA!N!Ni (JJEIHES' SEAllCH
`RESU.TS FOR EACH (f SA!O SEAIOl EIIGJNES
`
`US!tli SAID COIPUTEO IELEVAHT ooruEHT OJSTAIBUTIOHi.
`f!NOING AN <PTIMAL N!MlBI Of PA!ES TO SELECT FRON
`TKE IESULT SET CF EACH SEAR:H OO!NE
`m N TOTAL PAGfS AR£ TO 8£ IETRJEl{l)
`
`OlWINi A FINAi. IETAIEVEO SH BY
`FOAHNi Tl£ UNJ~ Of TIE T<P Is PAGES
`FROI EACH SENlCH ENiJN[
`
`001
`
`GOOGLE 1005
`
`

`

`U.S. Patent
`
`Jan. 26, 1999
`
`Sheet 1 of 2
`
`5,864,846
`
`1.
`
`FIG.
`
`FORMING A SET OF SELECTED QUERIES
`
`IDENTIFYING A SET k MOST SIMILAR
`TRAINING QUERIES TO CURRENT QUERY q
`
`COMPUTING AN AVERAGE RELEVANT DOCUMENT DISTRIBUTION
`OF SAID k QUERIES WITHIN SAID TRAINING QUERIES' SEARCH
`RESULTS FOR EACH OF SAID SEARCH ENGINES
`
`USING SAID COMPUTED RELEVANT DOCUMENT DISTRIBUTIONS,
`FINDING AN OPTIMAL NUMBER OF PAGES TO SELECT FROM
`THE RESULT SET OF EACH SEARCH ENGINE
`WHEN N TOTAL PAGES ARE TO BE RETRIEVED
`
`CREATING A FINAL RETRIEVED SET BY
`FORMING THE UNION OF THE TOP 1s PAGES
`FROM EACH SEARCH ENGINE
`
`002
`
`

`

`U.S. Patent
`
`Jan. 26, 1999
`
`Sheet 2 of 2
`
`5,864,846
`
`FIG. 2
`
`FORMING A RELEVANT DOCUMENT DISTRIBUTION,
`IN ACCORDANCE WITH A FUNCTION Fs q IN)
`
`COMPUTING AN AVERAGE RELEVANT DOCUMENT DISTRIBUTION
`OF THE k NEAREST NEIGHBORS OF SAID CURRENT QUERY,q,
`
`COMPUTING QUERY-QUERY SIMILARITIES
`
`UTILIZING THE COSINE OF THE ANGLE BETWEEN
`TWO QUERY VECTORS AS THE QUERIES' SIMILARITY
`
`COMPUTING THE AVERAGE RELEVANT
`DOCUMENT DISTRIBUTUION OVER k QUERIES
`
`PASSING TO A MAXIMIZATION PROCEDURE DISTRIBUTIONS
`AND THE TOTAL NUMBER OF DOCUMENTS TO BE RETRIEVED
`
`003
`
`

`

`5,864,846
`
`1
`METHOD FOR FACILITATING WORLD
`WIDE WEB SEARCHES UTILIZING A
`DOCUMENT DISTRIBUTION FUSION
`STRATEGY
`
`The present invention relates to an automatic method for
`facilitating World Wide Web Searches and, more
`specifically, to an automatic method for facilitating World
`Wide Web Searches by exploiting the differences in the
`search results of multiple search engines to produce a single
`list that is more accurate than any of the individual lists from
`which it is built.
`Text retrieval systems accept a statement of information
`need in the form of a query, assign retrieval status values to
`documents in the collection based on how well the docu- 15
`ments match the query, and return a ranked list of the
`documents ordered by retrieval status value. Data fusion
`methods that combine the search results of different queries
`representing a single information need to produce a final
`ranking that is more effective than the component rankings
`are well-known. See Bartell, B. T., Cottrell, G. W., and
`Belew, R. K.: Automatic combination of multiple ranked
`retrieval systems; Proceedings of SIGIR-94; July, 1994.
`Belkin, N. J. et al.: The effect of multiple query represen(cid:173)
`tations on information system performance; Proceedings of
`SIGIR-93; June, 1993. Fox, E. A and Shaw, J. A Combi(cid:173)
`nation of multiple searches. Proceedings ofTREC-2; March,
`1994.
`However, these fusion methods determine the rank of a
`document in the final list by computing a function of the 30
`retrieval status values of that document in each of the
`component searches. The methods are therefore not appli(cid:173)
`cable when the component searches return only the ordered
`list of documents and not the individual status values.
`The World Wide Web is a collection of information(cid:173)
`bearing units called "pages" interconnected by a set of links.
`To help users find pages on topics that are of interest to them,
`several groups provide search engines that accept a state(cid:173)
`ment of user need (in either English or a more formal query
`language) and return a list of pages that match the query. A 40
`list is usually ordered by a similarity measure computed
`between the query and the pages. While each of the search
`engines in principle searches over the same set of pages ( the
`entire Web), the size of the Web and the imprecise nature of
`the search algorithms frequently causes different search 45
`engines to return different lists of pages for the same query.
`Search engines such as Excite and Alta Vista provide a
`query interface to the information in these pages, and, like
`traditional text retrieval systems, return a ranked list of
`pages ordered by the similarity of the page to the query. See 50
`Steinberg, Steve G.: Seek and Ye Shall Find (Maybe);
`Wired; May, 1996. Because the search engines process
`queries in different ways, and because their coverage of the
`Web differs, the same query statement given to different
`engines often produces different results. Submitting the 55
`same query to multiple search engines, for example such as
`Quarterdeck's WebCompass product does, can improve
`overall search effectiveness. See QuarterDeck. URL: http:/
`/arachnid.qdeck.com/qdeck/products/webcompass.
`In accordance with an aspect of the invention, a method 60
`provides for combining the results of the separate search
`engines into a single integrated ranked list of pages in
`response to a query. Unlike WebCompass, the method does
`not keep the search results separated by the search engine
`that produced the result, but forms a single ranked list. 65
`Unlike the traditional fusion methods, the method in accor(cid:173)
`dance with the invention can produce a single ranking
`
`5
`
`2
`despite the fact that most search engines do not return the
`similarities that are computed for individual pages.
`The method in accordance with the invention utilizes a
`particular application of algorithms developed to combine
`the results of searches on potentially disjoint databases. See
`Towell, G., et al.: Learning Collection Fusion Strategies for
`Information Retrieval; Proceedings of the 12th Annual
`Machine Learning Conference; July, 1995. Voorhees, E. M.,
`Gupta, N. K., and Johnson-Laird, B.: The Collection Fusion
`10 Problem; Proceedings of TREC-3, NIST Special Publication
`500-225; April, 1995; pp. 95-104. Voorhees, E. M., Gupta,
`N. K., and Johnson-Laird, B.: Learning Collection Fusion
`Strategies; Proceedings of SIGIR-95; July, 1995; pp. 172-
`179.
`In accordance with another aspect of the invention, a
`computer-implemented method for facilitating World Wide
`Web Searches by combining search result documents, as
`provided by separate search engines in response to a query,
`into one single integrated list so as to produce a single
`20 document with a ranked list of pages, the method comprises
`the steps of (a) forming a set of selected queries, the queries
`including respective terms, for which selected queries rel(cid:173)
`evance data from past data is known, herein referred to as
`training queries, in a vector space comprising all training
`25 queries, the relevance data comprising judgments by a user
`as to whether a page is appropriate for a query which
`retrieved it; (b) identifying a set of k most similar training
`queries to current query q; (c) computing an average rel-
`evant document distribution of the k queries within the
`training queries' search results for each of the search
`engines; ( d) using the computed relevant document
`distributions, finding an optimal number of pages to select
`from the result set of each search engine then N total pages
`are to be retrieved; and ( e) creating a final retrieved set by
`35 forming the union of the top "-s pages from each search
`engine.
`An object of the present invention is to approximate the
`effectiveness of a single text retrieval system despite the
`collection being physically separated. Another object of the
`present invention is to combine the results of multiple
`searches of essentially the same database so as to improve
`the performance over any single search.
`In accordance with another aspect of the invention, the
`present method for facilitating World Wide Web searches
`utilizing a query clustering fusion strategy uses relevance
`data - - - judgments by the user as to whether a page is
`appropriate for the query which retrieved it - - - from past
`queries to compute the number of pages to select from each
`search engine for the current query. In the present
`description, the set of queries for which relevance data is
`known is called the training queries. The terms "page" and
`"document" are used interchangeably.
`FIGS. 1 and 2 show flow charts helpful to a fuller
`understanding of the invention.
`The function F/ (N), called a relevant document
`distribution, returns the number of relevant pages retrieved
`by search engine s for query q in the ranked list of size N.
`The fusion method in accordance with the present invention,
`builds an explicit model of the relevant document distribu(cid:173)
`tion of the joint search. The model is created by computing
`the average relevant document distribution of the k nearest
`neighbors of the current query, q. The nearest neighbors of
`q are the training queries that have the highest similarity
`with q.
`To compute query-query similarities, the present inven(cid:173)
`tion utilizes a vector representation of the queries. The
`vector queries are created by removing a set of high-
`
`004
`
`

`

`5,864,846
`
`3
`frequency function words such as prepos1t10ns from the
`query text, stemming the remaining words (i.e., removing
`suffixes to conflate related words to a common root), and
`assigning a weight to each term equal to the number of times
`the term occurs in the text (term frequency weights). The 5
`cosine of the angle between two query vectors is used as the
`queries' similarity.
`The average relevant document distribution over k que(cid:173)
`ries is computed by taking the average of the number of
`relevant documents retrieved by the set of queries after each 10
`document retrieved. Once the average relevant document
`distribution is computed for the current query for each
`search engine, the distributions and the total number of
`documents to be retrieved are passed to a maximization
`procedure. This procedure finds the cut-off level for each
`search engine that maximizes the number of relevant docu(cid:173)
`ments retrieved (the current maximization procedure simply
`does an exhaustive search). The computed cut-off levels are
`the number of documents selected from the result set of each
`search engine. The steps of the fusion process in accordance 20
`with the invention are essentially as follows.
`A Find the k most similar training queries to current
`query q
`1. Using standard techniques, create query vectors in a
`vector space consisting of all training queries.
`Weight terms in queries using a function that is
`proportional to the number of times the term occurs
`in the query.
`2. Create a query vector for the current query in the
`same vector space. Compute a vector similarity 30
`measure between the current query and all training
`queries.
`3. Select the k training queries with the highest simi(cid:173)
`larities.
`B. Within the training queries' search results for each
`search engine, compute the average relevant document
`distribution of the k queries.
`1. A relevant document distribution for a query q gives for
`each rank r the number of relevant documents retrieved 40
`at or below rank r by query q. The average distribution
`over a set of queries gives the mean number of relevant
`documents retrieved at or below rank r over the query
`set.
`C. Using the computed relevant document distributions,
`find the optimal number of pages to select from the
`result set of each search engine when N total pages are
`to be retrieved.
`1. Using any optimization technique (we use brute
`force), compute the number of pages that should be
`retrieved from each search engine (} .. J such that the
`total number of pages retrieved is N and the maxi(cid:173)
`mum possible number of relevant pages is retrieved
`subject to the constraint that e.g., to retrieve the page
`at rank 5 from a collection pages at ranks 1--4 must
`also be retrieved.
`2. There may be different combinations of pages
`retrieved from the search engine results that retrieve
`the maximum possible number of relevant pages.
`Choose any one of the combinations. Distribute 60
`"spill", the number of pages that can be retrieved
`from any search engine without affecting the number
`of relevant retrieved, in proportion to the number of
`pages that would otherwise be retrieved from that
`collection.
`D. Create the final retrieved set by forming the union of
`the top "-s pages from each search engine.
`
`4
`1. Rank pages in the final retrieved set probabilistically
`using a biased c-faced die.
`( a) To select the page to be in the next rank r of the
`final ranking, roll a c-faced die that is biased by
`the number of pages remaining to be placed in the
`final ranking from each of the search engines.
`Select the search engine whose number corre(cid:173)
`sponds to the die roll and place the next page from
`that engine's ranking into the final ranking.
`(b) Repeat until all N pages have been placed in the
`final ranking.
`The parameter k is used to control the amount of gener(cid:173)
`alization made from the training queries. Too few queries
`cause the predicted relevant document distribution to be too
`15 specific to the training queries, while too many queries cause
`different topic areas to be mixed resulting in too generic of
`a distribution.
`As used herein, a roll of an unbiased c-faced die selects
`a number in the range from 1 to c with a uniform probability
`of 1/c; however, in order to produce the final ranking, it is
`desired to bias the probability of selecting a search engine,
`numbered from 1 to c, by the number of pages it has to place
`in the ranking. This means that the page place in the first
`rank will, with higher probability, be selected from the
`25 search engine that contributed the most pages to the
`retrieved set. As pages are placed in the final ranking, the
`search engine with the most pages remaining to be placed
`will change, and thus the specific probabilities of selecting
`a search engine also change.
`A fusion method for facilitating World Wide Web
`Searches utilizing a query clustering fusion strategy is
`disclosed in a copending patent application by the present
`Inventor, entitled Method for facilitating World Wide Web
`Searches Utilizing a Query Clustering Fusion Strategy and
`35 filed on even date herewith and whereof the disclosure is
`herein incorporated by reference to the extent it is not
`incompatible with the present invention. As therein
`disclosed, the fusion method does not attempt to form an
`explicit model of a search engine's relevant document
`distribution.
`Instead, that system learns a measure of the quality of a
`search for a particular topic area by that engine. The number
`of pages selected from an engine for a new query is
`proportional to the value of the quality measure computed
`45 for that query. As in the approach disclosed in the above(cid:173)
`referenced application, the fusion strategy in accordance
`with the invention uses query vectors. Topic areas are
`represented as centroids of query clusters. For each search
`engine, the set of training queries is clustered using the
`50 number of (relevant and irrelevant) documents retrieved in
`common between two queries as a similarity measure. The
`assumption is that if two queries retrieve many documents in
`common they are about the same topic. The centroid of a
`query cluster is created by averaging the vectors of the
`55 queries contained within the cluster. This centroid is the
`system's representation of the topic covered by that query
`cluster.
`The training phase also assigns to a cluster a weight that
`reflects how effective queries in the cluster are for that
`search engine - - - the larger the weight, the more effective
`the queries are believed to be. The average number of
`relevant pages retrieved by queries in the cluster is used as
`a cluster's weight.
`After training, queries are processed as follows. The
`65 cluster whose centroid vector is most similar to the query
`vector is selected for the query and the associated weight is
`returned. The set of weights returned over all the search
`
`005
`
`

`

`5,864,846
`
`10
`
`5
`engines is used to apportion the final retrieved set such that
`when N pages are to be returned and ws is the weight
`returned by engine s, (ws~wk)*N (rounded appropriately)
`documents are selected from engine s. For example, assume
`the total number of pages to be retrieved is 100, and there are 5
`five search engines.
`If the weights returned by the engines are 4, 3, 3, 0, 2, then
`the first 33 pages returned by engine 1 would be selected, the
`first 25 pages from each of engines 2 and 3 would be
`selected, no pages would be selected from engine 4, and the
`first 17 pages from engine 5 would be selected. However, if
`the weights returned were 4, 8, 4, 0, 0 then 25 pages would
`be selected from each of engines 1 and 3, and 50 pages
`would be selected from engine 2. The weight of a cluster for
`a single engine in isolation is not meaningful; it is the
`relative difference in weights returned by the set of search 15
`engines over which the fusion is to be performed that is
`important. Of course, many variations of this scheme, such
`as forcing small weights to zero when the difference between
`weights is very large, are also possible.
`The current implementation uses the Ward clustering 20
`method and the reciprocal of the number of documents
`retrieved in common in the top 100 pages as the distance
`metric to cluster the training queries. A single set of clusters
`is produced from the resultant dendogram by cutting the
`dendogram at a pre-determined distance. The weight 25
`assigned to each cluster is the average number of relevant
`documents in the top L ranks. The similarity between a
`cluster centroid and a query is computed as the cosine of the
`two vectors, where each vector uses term frequency weights.
`In the query clustering fusion strategy, the parameter L 30
`controls part of the generalization made from the training
`queries. The number of documents used to compute query(cid:173)
`query similarities for the clustering routine will also have an
`effect. Steps of the method disclosed in the above-referenced
`copending patent application can be summarized as follows. 35
`A. Train for each search engine:
`1. Cluster training queries and build cluster centroids.
`(a) Apply Ward's clustering algorithm, using the
`number of pages retrieved in common at a rank
`less than or equal to a parameter Las the similarity 40
`between two queries.
`(b) Form clusters from hierarchy by considering all
`queries that cluster above a certain threshold to
`belong to the same cluster.
`(c) Form centroid for a particular cluster by creating 45
`the mean vector over all query vectors in the
`cluster.
`1. Create query vectors from query text using
`standard vector processing techniques; weight
`the terms using a function that is proportional 50
`to the number of times the term occurs in the
`query.
`ii. The weight of a term in the centroid vector is
`the sum of its weights in the vectors of the
`queries in the cluster divided by the number of 55
`queries in the cluster.
`2. Assign weights to each cluster reflecting the number
`of relevant pages expected to be obtained by this
`search engine for queries similar to those in the
`cluster.
`(a) Compute a cluster's weight as the mean number
`of relevant pages retrieved at a rank less than or
`equal to a parameter L over all the queries in the
`cluster.
`B. To process an incoming query, for each search engine, 65
`1. Find the cluster centroid that is most similar to the
`query.
`
`60
`
`6
`(a) Create a query vector for the current query in the
`vector space of the training queries.
`(b) Compute a vector similarity measure (e.g., the
`cosine) between the current query vector and each
`of the centroids.
`( c) Choose the centroid that has the greatest simi(cid:173)
`larity.
`2. Return the weight associated with the selected clus(cid:173)
`ter as the weight of the current search engine.
`C. Apportion the N slots in the retrieved set according to
`the weights returned by each search engine.
`1. Sum the weights returned by the set of engines.
`2. Select the top weight-of-this-engine/sum (rounded
`down) pages from the retrieved set of each engine.
`3. When fewer then N pages are retrieved due to
`rounding, select 1 more page from the most highly
`weighted engines until N pages are retrieved. (Break
`ties arbitrarily.)
`4. Rank pages in the retrieved set probabilistically
`using a biased c-faced die.
`(a) To select the document to be in the next rank r of
`the final ranking, roll a c-faced die that is biased
`by the number of pages remaining to be placed in
`the final ranking from each of the engines. Select
`the engine whose number corresponds to the die
`roll and place the next page from that engine's
`ranking into the final ranking.
`(b) Repeat until all N pages have been placed in the
`final ranking.
`The invention has been described by way of an exemplary
`embodiment. Various changes and modifications will be
`apparent to one skilled in the art to which it pertains. While
`reference has been made to the World Wide Web in con(cid:173)
`junction with searches, it is intended and should be under(cid:173)
`stood that what is herein intended is a data base as repre(cid:173)
`sented by the World Wide Web, of that type and not
`necessarily so named. Such changes and modifications are
`intended to be within the spirit and scope of the invention
`which is defined by the claims following.
`We claim:
`1. A computer-implemented method for facilitating World
`Wide Web Searches and like searches by combining search
`result documents, as provided by separate search engines in
`response to a query, into one single integrated list so as to
`produce a ranked list of pages, said method comprising the
`steps of:
`( a) forming a set of selected queries, said queries includ(cid:173)
`ing respective terms, for which selected queries rel(cid:173)
`evance data from past data is known, herein referred to
`as training queries, in a vector space comprising all
`training queries, said relevance data comprising judg(cid:173)
`ments by a user as to whether a page is appropriate for
`a query which retrieved it;
`(b) identifying a set of k most similar training queries to
`current query q;
`(c) computing an average relevant document distribution
`of said k queries within said training queries' search
`results for each of said search engines;
`( d) using said computed relevant document distributions,
`finding an optimal number of pages to select from the
`result set of each search engine when N total pages are
`to be retrieved; and
`( e) creating a final retrieved set by forming the union of
`the top "-s pages from each search engine.
`2. A computer-implemented method in accordance with
`claim 1, wherein step (a) comprises a step of
`
`006
`
`

`

`5,864,846
`
`5
`
`15
`
`20
`
`7
`(A) weighting each given term in said selected queries
`using a function that is proportional to that number of
`times said given term occurs in the query.
`3. A computer-implemented method in accordance with
`claim 2, wherein step (a) comprises:
`(B) creating a query vector for a current query in said
`vector space; and
`(C) computing a vector similarity measure between said
`current query and all training queries.
`4. A computer-implemented method in accordance with 10
`claim 1, wherein step (a) comprises
`(C) selecting those k training queries with the highest
`similarities.
`5. A computer-implemented method in accordance with
`claim 1, wherein in step (c), said average relevant document
`distribution is such that a relevant document distribution for
`a query q gives for each rank r the number of relevant
`documents retrieved at or below rank r by query q and the
`average distribution over a set of queries gives the mean
`number of relevant documents retrieved at or below rank r
`over the query set.
`6. A computer-implemented method in accordance with
`claim 1, wherein step (d) comprises:
`(D) computing, by using a selected optimization 25
`technique, a number of pages that should be retrieved
`from each respective search engine such that a total
`number of pages N is retrieved and a maximum pos(cid:173)
`sible number of relevant pages is retrieved subject to a
`predetermined constraint.
`7. A computer-implemented method in accordance with
`claim 6, wherein said predetermined constraint is typified by
`a constraint such that to retrieve a page at rank R from a
`collection, pages at rank 1 through rank (R-1) must also be
`retrieved.
`8. A computer-implemented method in accordance with
`claim 6, wherein, if in step (D) there be different combina(cid:173)
`tions of pages retrieved from those search engine results that
`retrieve a maximum possible number of relevant pages, then
`select any one of said different combinations.
`9. A computer-implemented method in accordance with
`claim 8, wherein step (A) comprises
`(E) distributing a number of pages that can be retrieved
`from any of said search engines without affecting the
`number of relevant documents retrieved, in proportion 45
`to that number of pages that would otherwise be
`retrieved from that collection.
`10. A computer-implemented method in accordance with
`claim 9, wherein step (E) comprises ranking pages in said
`final retrieved set probabilistically in accordance with a rule 50
`of a die roll using a biased c-faced die.
`11. A computer-implemented method in accordance with
`claim 10, wherein, to select the page to be in the next rank
`r of the final ranking, step (E) comprises rolling a c-faced die
`that is biased by the number of pages remaining to be placed 55
`in said final ranking from each of said search engines.
`12. A computer-implemented method in accordance with
`claim 11, wherein step (A) comprises selecting that search
`engine whose number corresponds to said die roll; and
`placing the next page from that engine's ranking into said 60
`final ranking.
`13. A computer-implemented method in accordance with
`claim 12, wherein the steps of claim 12 are repeated until all
`N pages have been placed in said final ranking.
`14. A computer-implemented method for facilitating 65
`World Wide Web Searches and like searches by combining
`search result documents from a joint search, as provided by
`
`30
`
`35
`
`40
`
`8
`separate search engines in response to a query, into one
`single integrated list so as to produce a ranked list of pages,
`said method comprising the steps of:
`(a) forming a relevant document distribution, in accor(cid:173)
`dance with a function F s q(N) which returns that number
`of relevant pages retrieved by search engine s for a
`current query q in a ranked list of size N;
`(b) forming an explicit model of said relevant document
`distribution of said joint search by computing an aver(cid:173)
`age relevant document distribution of the k nearest
`neighbors of said current query, q, said nearest neigh(cid:173)
`bors of q being those training queries that have highest
`similarity with q, said training queries being a set of
`selected queries, including respective terms, for which
`selected queries relevance data from past data is
`known, said relevance data comprising judgments by a
`user as to whether a page is appropriate for a query
`which retrieved it;
`(c) computing query-query similarities, by generating a
`vector query representation of said queries wherein
`vector queries are created by removing a set of high(cid:173)
`frequency function words, such as prepositions from
`query text, stemming remaining words, that is, remov(cid:173)
`ing suffixes to conflate related words to a common root,
`and assigning a term frequency weight to each term
`equal to the number of times the term occurs in said
`text;
`( d) utilizing the cosine of the angle between two query
`vectors as the queries' similarity;
`( e) computing the average relevant document distribution
`over k queries by taking an average of that number of
`relevant documents retrieved by the set of queries after
`each document retrieved; and
`(f) passing to a maximization procedure distributions and
`the total number of documents to be retrieved once the
`average relevant document distribution is computed for
`the current query for each search engine, for finding the
`cut-off level for each search engine that maximizes the
`number of relevant documents retrieved, whereby said
`computed cut-off levels correspond to the number of
`documents selected from the result set of each search
`engine.
`15. A computer-implemented method for facilitating
`World Wide Web Searches and like searches by combining
`search result documents, as provided by separate search
`engines in response to a query, into one single integrated list
`so as to produce a ranked list of pages, said method
`comprising the steps of:
`forming a set of selected queries, said queries including
`respective terms, for which se

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket