`United States Patent
`6,006,225
`[11] Patent Number:
`[45] Date of Patent: Dec. 21, 1999
`Bowman et al.
`
`
`
`USOO6006225A
`
`[54] REFINING SEARCH QUERIES BY THE
`SUGGESTION OF CORRELATED TERMS
`FROM PRIOR SEARCHES
`
`[75]
`
`Inventors: Dwayne E. Bowman, Woodinville;
`Ruben E. Ortega; Michael L.
`Halnrick, both of Seattle; Joel R.
`Spiegel, Woodinville; Timothy R.
`Kohn, Seattle, all of Wash.
`
`[73] Assignee: Amazon.Com, Seattle, Wash.
`
`[21] Appl. No.: 09/145,360
`
`[22]
`
`Filed:
`
`Sep. 1, 1998
`
`Related US. Application Data
`Provisional application No. 60/089,244, Jun. 15, 1998.
`
`[60]
`
`[51]
`Int. Cl.6 ...................................................... G06F 17/30
`
`[52]
`707/5; 707/2; 707/4; 707/10
`[58] Field of Search ................................... 707/5, 2, 10, 4
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,675,819
`5,721,897
`5,787,422
`5,794,233
`5,864,845
`5,911,140
`5,913,215
`
`..
`
`10/1997 Schuetze ................................... 704/10
`2/1998 Rubinstein
`707/2
`/1998 Tukey et al.
`707/5
`/1998 Rubinstein
`707/4
`1/1999 Voorhees et al.
`707/5
`6/1999 Tukey et a1.
`..
`707/5
`/1999 Rubinstein ................................ 707/10
`OTHER PUBLICATIONS
`
`
`
`.
`
`Bartell et a1., “Automatic Combination of Multiple Ranked
`Retrieval Systems”, Proceedings of SIGIR ’94, Jul. 1994,
`pp. 173—181.
`Belkin et al., “The Effect of Multiple Query Representations
`on Information System Performance” Proceedings of SIGIR
`’93, Jun. 1993, pp. 339—346.
`Shaw et a1., “Combination of Multiple Searches”, Proceed-
`ings of TREC—3, Apr. 1995, pp. 105—108.
`
`QuarterDeck Web Page, Downloaded Sep. 9, 1996, http://
`ara cnid .qd eck.com/qdeck/produ cts/webcompass.
`Towell et al. “Learning Collection Fusion Strategies for
`Information Retrieval”, Proceedings of the thh Annual
`Machine Learning Conference, Jul. 1995, pp. 540—548.
`Voorhees et a1., “Learning Collection Fusion Strategies”,
`Proceedings of SIGIR ’95, Jul. 1995, pp. 172—179.
`Voorhees et al., “The Collection Fusion Problem” Proceed-
`ings of TREC73, NIST Special Publication 5007225, Apr.
`1995, pp. 95—104.
`Abstract of Generating Advanced Query Interfaces, Lee,
`Srivastava and Vista, Computer Networks and ISDN Sys-
`tems Conference Title: Cornput. Netw. ISDN Syst. (Neth-
`erlands) vol. 30, No. 1—7, pp. 656—657 (1998).
`Abstract of Using Combination ofEvidence for Term Expan-
`sion, Wilkinson, Information Retrieval Research, Proceed-
`ings of the 19th Annual BCS—IRSG Colloquium on IR
`Research (1997).
`
`(List continued on next page.)
`Primary Examiner—Paul R. Iintz
`Attorney, Agent, or Firm—Knobbe, Martens, Olson & Bear,
`LLP
`
`[57]
`
`ABSTRACT
`
`A search engine is disclosed which suggests related terms to
`the user to allow the user to refine a search. The related terms
`are generated using query term correlation data which
`reflects the frequencies with which specific terms have
`previously appeared within the same query. The correlation
`data is generated and stored in a look-up table using an
`off-line process which parses a query log file. The table is
`regenerated periodically from the most recent query sub-
`missions (e.g., the last two weeks of query submissions), and
`thus strongly reflects the current preferences of users. Each
`related term is presented to the user via a respective hyper-
`link which can be selected by the user to submit a modified
`query. In one embodiment, the related terms are added to and
`selected from the table so as to guarantee that the modified
`queries will not produce a NULL query result.
`
`28 Claims, 10 Drawing Sheets
`
`PARSE DAILY LOG FILE
`T0 EXTRACT QUERY
`
`
`SUBMISSIONS FOR WHICH
`ITEMS FOUND > 0
`
`
`
`CORRELATE TERMS BASED
`ON FREQUENCY OF OCCURRENCE
`
`WITHIN SAME QUERY
`
`U0
`CREATE DAILY
`
`RESULTS FILE
`
`
`
`
`
`
`
`MERGE DAILY
`RESULTS FILES FOR
`LAST M DAYS
`
`
`
`459
`
`OVERWRITE EXISTING CORRELATION
`TABLE WITH NEW CORRELATION TABLE
`
`1
`1
`
`EX1026
`EX1026
`
`
`
`6,006,225
`Page 2
`
`OTHER PUBLICATIONS
`
`the NECI Meta Search Engine,
`Abstract of Inquirus,
`Lawrence and Giles, Computer Networks and ISDN Sys-
`tems Conference Title: Comput. Netw ISDN Syst. (Nether-
`lands) Vol. 30, No. 1—7, pp. 95—105 (1998).
`Abstract of Facilitating Complex Web Queries Through
`Visual User Interfaces and Query Relaxtion, Li and Shim,
`Computer Networks and ISDN Systems Conference Title:
`Comput. Netw. ISDN Syst. (Netherlands) vol. 30, No. 177,
`pp. 149—159 (1998).
`A User—centred Evaluation ofRankingAlgorithms for Inter-
`active Query Expansion, Efthimiadis, Proceedings of the
`16th Annual International ACM SIGIR Conference, Pitts-
`burgh, pp. 146—159 (1993).
`Concept Based Query Expansion, Qiu and Frei, Proceedings
`of the 16th Annual International ACM SIGIR Conference,
`Pittsburgh, pp. 160—169 (1993).
`Improving Retrieval Performance by Relevance Feedback,
`Salton and Buckley, J. of Am. Society for Info. Science
`41(4):288—297 (1990).
`
`Query Expansion Using Domain—Adapted. Weighted The-
`saurus in an Extended Boolean Model, Kwon, Kim and
`Choi, Proceedings of the 3rd International Conference on
`Information and Knowledge Management (CIKM’94), pp.
`140—146 (1994).
`Browsing Through Querying: Designing for Electronic
`Books, Charoenkitkarn, Tam, Chignell and Golovichinsky,
`at
`the 5th ACM Conference on Hypertext, Seattle, WA
`206—216 (1993).
`A Survey of Information Retrieval and Filtering blethods,
`Faloutsos and Oard, Univ. of Maryland, 22 pages (undated).
`A Corpus Analysis Approach for Automatic Query Expan-
`sion, Gauch and Wang, Proceedings of the 6th International
`Conference on Information and Knowledge Mangement, pp.
`278—284 (1997).
`Discovering Web Acess Patterns and Trends by Applying
`OLAP and Data Mining Technology on Web Logs, Zaiane,
`Xin and Han, Proceedings of the IEEE Forum on Research
`and Technology Advances in Digital Libraries
`(IEEE
`ADL’98), pp. 19—29 (1998).
`
`2
`
`
`
`US. Patent
`
`Dec. 21, 1999
`
`Sheet 1 0f 10
`
`6,006,225
`
`Ba8;NE|__llllllllllllllllllllllllllllllllll
`mummwmm_63mm_->mm30_
`”.22203933mm_BEESEE
`hQ>2_
`
`
`
`mmmoomazozbmdm
`
`IFQMK
`
`QNx
`
`bQm3
`
`o
`
`0..
`
`
`
`
`
`MAm<Hzoc<4mmmoo>mmao
`
`ownozzz<moommumon»zozomhm<lm
`
`<><?méKi
`
`
`
`«mmmummooumommz<o<mu<
`
` z...
`mfim_m<lmmommmmmzzanm
`
`\“QK
`
`WEE.
`
`mwmoomm
`
`zo_,_.<mmzmo
`58I—:20
`
`"
`
`3
`
`
`
`
`
`
`
`
`US. Patent
`
`Dcc.21, 1999
`
`Sheet 2 0f 10
`
`6,006,225
`
`Fiie
`
`Edit
`
`View
`
`Favorite
`
`Help
`
`QOIQQSEEAQ
`St®op Refresh Home Search
`FavorIte
`Font
`
`Bock Forw”
`
`Amazon.com Home I Music Search I Your Account
`
`Address
`
`http: //www. amazon. com/book_search
`
`a]
`
`amazon com Book Search
`
`Enter Author and/or Title
`
`Author:
`
`2/0
`
`©Exact Name
`
`OLast, First Name
`
`OStart of Last Name
`
`Title:
`
`220
`
`OStart(s) of Title Words
`©Title Word(s)
`OExact Title
`
`
` Search Now
`Clear the Form
`
`Author Search Tips / Title Search Tips
`
`Search by Subject
`
`240
`
`Subject:
`
`®Subject lord(s)
`OStart of Subject
`OExact Subject
`
`
`
`
`Search Now
`Clear the Form
`
`OStart(s) of Subject Word(s)
`
`Subject Search Tips
`
`Other Search Methods:
`
`ISBN, Publisher/Date. Quick Search
`
`
`
`F/GIZ
`
`4
`
`
`
`US. Patent
`
`Dcc.21, 1999
`
`Sheet 3 0f 10
`
`6,006,225
`
`3/0
`
`Friday, 13-Feb-98 02:23:52
`User Identifier = 29384719287
`HTTP_REFERRER= http://www.omuzon.com/book__seorch_page
`PATH_INFO=/book__seo rch
`title = Snow Crash
`items_found = 2
`
`J40
`
`320
`
`530
`
`Friday, 13—Feb—98 02:24:11
`User Identifier = 29384719287
`HTTP_REFERRER= http://www.omozon.com/book_search
`PATH_|NFO=/ISBN = 0555562614
`
`Friday. 13-Feb-98 06:15:03
`User Identifier = 54730543261
`HTTP_REFERRER= http://www.omazon.com/music_seorch_page
`PATH_|NFO=/music__search
`artist = This and That
`items_found = 0
`
`Friday, 13-Feb—98 10:07:34
`User Identifier = 027385918272
`HTTP_REFERRER= http://www.amozon.com/book_search_page
`PATH_|NFO=/book_search
`subject = outdoor troil
`items_found = 22
`
`H635
`
`5
`
`
`
`US. Patent
`
`Dcc.21, 1999
`
`Sheet 4 0f 10
`
`6,006,225
`
`BEGIN
`
`4/0
`
`PARSE DAILY LOG FILE
`
`ITEMS FOUND > 0
`
`TO EXTRACT QUERY
`SUBMISSIONS FOR WHICH
`
`420
`
`CORRELATE TERMS BASED
`
`ON FREQUENCY OF OCCURRENCE WITHIN SAME QUERY
`
`CREATE DAILY
`
`RESULTS FILE
`
`4.275
`
`440
`
`MERGE DAILY
`
`LAST M DAYS
`
`RESULTS FILES FOR
`
`OVERWRITE EXISTING CORRELATION
`TABLE WITH NEW CORRELATION TABLE
`
`450
`
`F/Gi 4
`
`END
`
`6
`
`
`
`US. Patent
`
`Dec. 21, 1999
`
`Sheet 5 0f 10
`
`6,006,225
`
`$3ozmfimlm
`
`VNMK
`
`.=<E.Im
`
`
`
`a:“3&0me
`
`3Vz_.§z<El<
`
`$3EDI;
`
`38xsum:3mEEmummmEmmmvmmnm$329288;Eozflfim;so;0225;
`St5.5;
`ASzomééé\
`
`$3«828%
`
`2E3z
`
`295»;85mmzossé
`$545:;
`
`mug;
`
`Vb.GE
`
`SSmoooSo'm$3457m
`ahwzwflmwmlmAMNWzoEo<>lm
`
`In45:;
`$3”.35;
`c:AzEumTut
`
`
`1/
`
`7
`
`
`
`
`
`
`
`US. Patent
`
`Dec. 21, 1999
`
`Sheet 6 0f 10
`
`6,006,225
`
`mELoHz
`
`33
`
`:a:3A3
`
`Ea5:3a3
`
`%\M.\
`
`.__<m._.lm
`
`
`
`mooosoum$3.__<Enm
`
`mEonmumE4.5:;
`
`5520;$322288;
`ozmfimlmAg5%m
`ozmfimmwso50225;
`
`a:mmafioxuumEzomééé
`
`3vz:v_z<El<
`
`xiumEVmEommum
`$3EDI;
`38mOSSOIm
`
`
`
`%m.“QR
`
`Q%%
`
`
`
`zoEo<>|mAnszoEo<>|m
`
`zoxilma5fizossé
`85%a,c4.5:;
`
`3855mm;
`
`3343:;
`
`NV\
`
`8
`
`
`
`
`
`
`
`
`
`US. Patent
`
`Dcc.21, 1999
`
`Sheet 7 0f 10
`
`6,006,225
`
`Dale
`
`Daily Log
`File
`
`Daily Results
`File
`
`Dates of Query
`Correlation Table
`
`Query Correlation
`Table
`
`137/9
`
`
`~74
`
`E—EE
`
`E-HE
`
`1-Feb-98
`
`2-Feb-98
`
`3—Feb—98
`
`11-Feb-98
`
`5-Feb—98
`
`6-Feb—98
`
`
`
`
`
`9
`
`
`
`US. Patent
`
`Dec. 21, 1999
`
`Sheet 8 0f 10
`
`6,006,225
`
`
`
`FOR EACH TERM
`IN THE QUERY
`
`720
`
`LOOK UP TERM IN THE
`CORRELATION TABLE
`
`750
`
`RETRIEVE THE TERMS
`RELATED TERMS LIST
`
`740
`
`
`
`
`_I
`
`
`
`
`7.50
`
`MULTl-TERM
`QUERY
`?
`
`760
`
`COMBINE ALL RELATED
`TERMS LISTS
`
`
`
`
`SELECT X TERMS WITH
`HIGHEST VALUES
`
`F/GI 7
`
`10
`10
`
`
`
`US. Patent
`
`Dcc.21, 1999
`
`Sheet 9 0f 10
`
`6,006,225
`
`Top 3 ReIaIed Terms:
`
`TRAIL — MIX
`
`TRAIL — YUKON J
`
`TRAIL — BIKE
`
`8/0
`
`HQ 634
`
`Infersec’ring Terms:
`
`6’20
`
`S — BIKE
`
`S — SPORTS
`S — VACATION
`
`830
`K.)
`
`Top 3 ReIa’red Terms:
`OUTDOOR TRAIL - BIKE
`OUTDOOR TRAIL - SPORTS
`OUTDOOR TRAIL — VACATION
`
`F/Gi 6’49
`
`11
`11
`
`
`
`US. Patent
`
`Dcc.21,1999
`
`Sheet 10 0f 10
`
`6,006,225
`
`
`
`.900 :
`
`
`
`file
`
`Edit
`
`Xiew
`
`Go
`
`[overlie
`
`flelp
`
`<==>®IQCQEJEEAQ
`Bock Forw... Stop Refresh Home
`Search
`Favorite
`Font
`Moil
`
`Address
`
`http: //www.omozor\.com/book_seorch
`
`a]
`
`amazon.com Book Search
`
`A Y
`
`our Search Results
`
`for: the subject words include "OUTDOOR TRAIL"
`
`
`
`
`
`
`
`
`9/0
`
`Related Query Terms: Click on a link below to narrow your
`search by adding a related term.
`OUTDOOR TRAIL — BIKE
`
`OUTDOOR TRAIL — SPORTS
`
`OUTDOOR TRAIL — VACATION
`
`Top Matches for this search:
`
`* Outdoor 8: Trail Guide to Yosemite
`
`* Outdoor Training on Mountian Bike Trails
`
`
`to See
`Full Results: The First 100 are shown below.
`more results scroll down and click the "More" button.
`
`
`
`
`
`
`The blessed trail: outdoor diary
`T he Great Outdoor Cookbook: Trail Reci ies
`
`920
`
`Outdoor 8c Trail Guide to Yosemite
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`H619
`
`12
`12
`
`
`
`6,006,225
`
`1
`REFINING SEARCH QUERIES BY THE
`SUGGESTION OF CORRELATED TERMS
`FROM PRIOR SEARCHES
`
`RELATED APPLICATION
`
`This application claims the benefit of U. S. Provisional
`Application No. 60/089,244, filed Jun. 15, 1998, the disclo—
`sure of which is hereby incorporated by reference.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of Invention
`
`This present invention relates to query processing, and
`more specifically relates to techniques for facilitating the
`process of refining search queries.
`2. Description of Related Art
`With the increasing popularity of the Internet and the
`World Wide Web, it is common for on-line users to utilize
`search engines to search the Internet for desired information.
`Many web sites permit users to perform searches to identify
`a small number of relevant items among a much larger
`domain of items. As an example, several web index sites
`permit users to search for particular web sites among known
`web sites. Similarly, many on-line merchants, such as
`booksellers, permit users to search for particular products
`among all of the products that can be purchased from the
`merchant. Other on-line services, such as LexisTM and
`WestlawTM, allow users to search for various articles and
`court opinions.
`In order to perform a search, a user submits a query
`containing one or more query terms. The query may also
`explicitly or implicitly identify a record field or segment to
`be searched, such as title, author, or subject classification of
`the item. For example, a user of an on-line bookstore may
`submit a query containing terms that
`the user believes
`appear within the title of a book. A query server program of
`the search engine processes the query to identify any items
`that match the terms of the query. The set of items identified
`by the query server program is referred to as a “query
`result.” In the on-line bookstore example, the query result is
`a set of books whose titles contain some or all of the query
`terms. In the web index site example, the query result is a set
`of web sites or documents. In web-based implementations,
`the query result is typically presented to the user as a
`hypertextual listing of the located items.
`If the scope of the search is large, the query result may
`contain hundreds, thousands or even millions of items. If the
`user is performing the search in order to find a single item
`or a small set of items, conventional approaches to ordering
`the items within the query result often fail to place the
`sought item or items near the top of the query result list. This
`requires the user to read through many other items in the
`query result before reaching the sought item. Certain search
`engines, such as ExciteTM and AltaVistaTM, suggest related
`query terms to the user as a part of the “search refinement”
`process. This allows the user to further refine the query and
`narrow the query result by selecting one or more related
`query terms that more accurately reflect the user’s intended
`request. The related query terms are typically generated by
`the search engine using the contents of the query result, such
`as by identifying the most frequently used terms within the
`located documents. For example, if a user were to submit a
`query on the term “FOOD,” the user may receive several
`thousand items in the query result. The search engine might
`then trace through the contents of some or all of these items
`and present
`the user with related query terms such as
`
`10
`
`15
`
`.
`
`40
`
`45
`
`60
`
`65
`
`2
`“RESTAURANTS,” “RECIPIES,” and “FDA” to allow the
`user to refine the query.
`The related query terms are commonly presented to the
`user together with corresponding check boxes that can be
`selectively marked or checked by the user to add terms to the
`query. In some implementations, the related query terms are
`alternatively presented to and selected by the user through
`drop down menus that are provided on the query result page.
`In either case, the user can add additional terms to the query
`and then resubmit the modified query. Using this technique,
`the user can narrow the query result down to a more
`manageable set consisting primarily of relevant items.
`One problem with existing techniques for generating
`related query terms is that the related terms are frequently of
`little or no value to the search refinement process. Another
`problem is that the addition of one or more related terms to
`the query sometimes leads to a NULL query result. Another
`problem is that the process of parsing the query result items
`to identify frequently used terms consumes significant pro-
`cessor resources, and can appreciably increase the amount of
`time the user must wait before viewing the query result.
`These and other deficiencies in existing techniques hinder
`the user’s goal of quickly and efficiently locating the most
`relevant items, and can lead to user frustration.
`
`SUMMARY OF THE INVENTION
`
`The present invention addresses these and other problems
`by providing a search refinement system and method for
`generating and displaying related query terms (“related
`terms”). In accordance with the invention, the related terms
`are generating using query term correlation data that is based
`on historical query submissions to the search engine. The
`query term correlation data (“correlation data”) is preferably
`based at least upon the frequencies with which specific terms
`have historically been submitted together within the same
`query. The incorporation of such historical query informa-
`tion into the process tends to produce related terms that are
`frequently used by other users in combination with the
`submitted query terms, and significantly increases the like-
`lihood that these related terms will be helpful to the search
`refinement process. To further increase the likelihood that
`the related terms will be helpful,
`the correlation data is
`preferably generated only from those historical query sub-
`missions that produced a successful query result (at least one
`match).
`the
`In accordance with one aspect of the invention,
`correlation data is stored in a correlation data structure
`
`(table, database, etc.) which is used to look up related terms
`in response to query submissions. The data structure is
`preferably generated using an off-line process which parses
`a query log file, but could alternatively be generated and
`updated in real-time as queries are received from users. In
`one embodiment, the data structure is regenerated periodi-
`cally (e.g., once per day)
`from the most recent qucry
`submissions (e.g., the last M days of entries in the query
`log), and thus strongly reflects the current tastes of the
`community of users; as a result, the related terms suggested
`by the search engine strongly reflect the current tastes of the
`community. Thus, for example, in the context of a search
`engine of an online merchant, the search engine tends to
`suggest related terms that correspond to the current best-
`selling products.
`In a preferred embodiment, each entry in the data struc-
`ture is in the form of a key term and a corresponding related
`terms list. Each related terms list contains the terms which
`
`have historically appeared together (in the same query) with
`
`13
`13
`
`
`
`6,006,225
`
`3
`the respective key term with the highest degree of frequency,
`ignoring unsuccessful query submissions (query submis-
`sions which produced a NULL query result). The data
`structure thus provides an efficient mechanism for looking
`up the related terms for a given query term.
`To generate a set of related terms for refining a submitted
`query (the “present query”), the related terms list for each
`term in the present query is initially obtained from the
`correlation data structure. If this step produces multiple
`related terms lists (as in the case of a multiple-term query),
`the related terms lists are preferably combined by taking the
`intersection between these lists (i.e., deleting the terms that
`are not common to all lists). The related terms which remain
`are terms which have previously appeared, in at least one
`successful query submission,
`in combination with every
`term of the present query. Thus, assuming items have not
`been deleted from the database being searched, any of these
`related terms can be individually added to the present query
`while guaranteeing that the modified query will not produce
`a NULL query result. To take advantage of this feature, the '
`related terms are preferably presented to the user via a user
`interface that requires the user to add no more than one
`related term per query submission. In other embodiment, the
`related terms are selected and displayed without guarantee-
`ing a successful query result.
`Because the related terms are identified from previously-
`generated correlation data, without the need to parse docu—
`ments or correlate terms, the related terms can be identified
`and presented to the user with little or no added delay.
`
`10
`
`15
`
`4
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`The present invention provides a search refinement sys-
`tem and method for generating related query terms (“related
`terms”) using a history of queries submitted to a search
`engine by a community of users. Briefly, the system gener-
`ates query term correlation data which reflects the frequency
`with which specific terms have previously occurred together
`within the same query. The system uses the query term
`correlation data in combination with the query term(s)
`entered by the user to recommend additional query terms for
`refining the query. The incorporation of such historical query
`information into the process tends to produce related terms
`that are frequently used by other users in combination with
`the submitted query terms, and significantly increases the
`likelihood that these related terms will be helpful to the
`search refinement process. To further increase the likelihood
`that the related terms will be helpful, the correlation data is
`preferably generated only from those historical query sub-
`missions that produced a successful query result (at least one
`match).
`In the preferred embodiment, the query term correlation
`date is regenerated periodically from recent query
`submissions, such as by using the last M days of entries in
`a query log, and thus heavily reflects the current tastes of
`users. As a result, the related terms suggested by the search
`engine tend to be terms that correspond to the most fre-
`quently searched items during the relevant
`time period.
`Thus, for example, in the context of a search engine of an
`online merchant, the search engine tends to suggest related
`terms that correspond to the current best-selling products. In
`one embodiment, the technique used to generate the related
`terms and present these terms to the user guarantees that the
`modified query will not produce a NULL query result.
`The search refinement methods of the invention may be
`implemented, for example, as part of a web site, an Internet
`site, an on-line services network, a document retrieval
`system, or any other type of computer system that provides
`searching capabilities to a community of users. In addition,
`the method may be combined with other methods for
`suggesting related terms, such as methods which process the
`contents of located documents.
`
`A preferred web-based implementation of the search
`refinement system will now be described with reference to
`FIGS.
`1—9. For purposes of illustration,
`the system is
`described herein in the context of a search engine that is used
`to assist customers of Amazon.com Inc. in locating items
`(e.g., books, CDs, etc.) from an on-line catalog of products.
`Throughout the description, reference will be made to vari-
`ous implementation-specific details of the Amazon.com
`implementation. These details are provided in order to fully
`illustrate a preferred embodiment of the invention, and not
`to limit
`the scope of the invention. The scope of the
`invention is set forth in the appended claims.
`
`40
`
`45
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`These and other features will now be described with
`
`reference to the drawings summarized below. These draw-
`ings and the associated description are provided to illustrate
`a preferred embodiment of the invention, and not to limit the
`scope of the invention.
`Throughout the drawings, reference numbers are re-used
`to indicate correspondence between referenced elements. In
`addition, the first digit of each reference number indicates
`the figure in which the element first appears.
`FIG. 1 illustrates a system in which users access web site
`information via the Internet, and illustrates the basic web site
`components used to implement a search engine which
`operates in accordance with the invention.
`FIG. 2 illustrates a sample book search page of the web
`site.
`
`FIG. 3 illustrates sample log entries of a daily query log
`file.
`
`FIG. 4 illustrates the process used to generate the corre-
`lation table of FIG. 1.
`
`FIG. 5A illustrates a sample mapping before a query is
`added.
`
`FIG. 5B illustrates a sample mapping after a query is
`added.
`
`FIG. 6 illustrates a process for generating the correlation
`table from the most recent daily query log files.
`FIG. 7 illustrates a process for selecting the related query
`terms from the correlation table.
`
`FIG. 8A illustrates a set of related query terms from a
`single-term query.
`FIG. 8B illustrates a set of intersecting terms and a set of
`related query terms from a multiple-term query.
`FIG. 9 illustrates a sample search result page of the web
`site.
`
`14
`14
`
`I. Overview of Web Site and Search Engine
`
`60
`
`65
`
`FIG. 1 illustrates the Amazon.com web site 130, including
`components used to implement a search engine in accor-
`dance with the invention.
`As it is well known in the art of Internet commerce, the
`Amazon.com web site includes functionality for allowing
`users to search, browse, and make purchases from an online
`catalog of book titles, music titles, and other types of items
`via the Internet 120. Because the catalog contains millions
`of items, it is important that the site provide an eflicient
`mechanism for assisting users in locating items.
`
`
`
`6,006,225
`
`5
`As illustrated by FIG. 1, the web site 130 includes a web
`server application 131 (“web server”) which processes user
`requests received from user computers 110 via the Internet
`120. These requests include queries submitted by users to
`search the on-line catalog for products. The web server 131
`records the user transactions, including query submissions,
`within a query log 135. In the embodiment depicted in FIG.
`1, the query log 135 consists of a sequence of daily query log
`files 135(1)—135(M), each of which represents one day of
`transactions.
`
`10
`
`The web site 130 also includes a query server 132 which
`processes the queries by searching a bibliographic database
`133. The bibliographic database 133 includes information
`about the various products that users may purchase through
`the web site 130. This information includes, for example, the
`titles, authors, publishers, subject descriptions, and ISBNs
`(International Standard Book Numbers) of book titles, and
`the titles, artists, labels, and music classifications of music
`titles. The information for each item is arranged within fields
`(such as an “author” field and a “title” field), enabling the '
`bibliographic database 133 to be searched on a field-
`restricted basis. The site also includes a database 134 of
`
`15
`
`HTML (Hypertext Markup Language) content which
`includes, among other things, product information pages
`which show and describe the various products.
`The query server 132 includes a related term selection
`process 139 which identifies related query terms based on
`query term correlation data stored in a correlation table 137.
`As depicted in FIG. 1 and described below, the correlation
`table 137 is generated periodically from the M most recent
`daily query log files 135(1)—135(M) using an off-line table
`generation process 136.
`The web server 131, query server 132, table generation
`process 136, and database software run on one or more
`UnixTM-based servers and workstations (not shown) of the
`web site 130 although other types of platforms could be
`used. The correlation table 137 is preferably cached in RAM
`(random access memory) on the same physical machine as
`that used to implement the query server 132. To accommo-
`date large numbers of users, they query server 132 and the
`correlation table 137 can be replicated across multiple
`machines. The web site components that are invoked during
`the searching process are collectively referred to herein as a
`“search engine.”
`FIG. 2 illustrates the general format of a book search page
`200 of the Amazon.com web site 130 that can be used to
`
`search the bibliographic database 133 for book titles. Users
`have access to other search pages that can be used to locate
`music titles and other types of products sold by the on-line
`merchant. The book search page 200 includes author, title,
`and subject fields 210, 220, 240 and associated controls that
`allow the user to initiate fieldrestricted searches for book
`
`titles. Users can perform searches by first typing in the
`desired information into a search field 210, 220, 240 and
`then clicking on the appropriate search button 230, 250. The
`term or string of terms submitted to the search engine is
`referred to herein as the “query.” Other areas of the web site
`ask the user to submit queries without limiting the terms to
`specific fields.
`When the user submits a query from the book search page
`200 to the web site 130, a the query sever 132 applies the
`query to the bibliographic database 133, taking into account
`any field restrictions within the query. If the query result is
`a single item, the item’s product information page is pre-
`sented to the user. If the query result includes multiple items,
`the list of items is presented to the user through a query
`
`40
`
`45
`
`60
`
`65
`
`6
`result page which contains hypertextual links to the ite ns
`respective product information pages.
`For multiple-term queries, the query server 132 cf ec-
`tively logically ANDs the query terms together to perform
`the search. For example, if the user enters the terms “JAV ”
`and “PROGRAMMING” into the title field 220, the query
`server 132 will search for and return a list of all items hat
`have both of these terms within the title. Thus, if any query
`term does not produce a match (referred to herein as a
`“non-matching term”),
`the query will produce a NL LL
`query result. Presenting a NUIL query result to the user can
`cause significant user frustration. To reduce this problem, in
`this event, the user may be presented with a list of items hat
`are deemed to be “close matches.” Although the search
`engine described herein logically ANDs the query terms
`together, it will be recognized that the invention can be
`applied to search engines that use other methods for pro-
`cessing queries.
`In accordance with the invention, the search engine uses
`the query term correlation data stored in the correlation table
`137 to select the related terms that best match the user’s
`
`
`
`query. The search engine then presents the related terms to
`the user, allowing the user to refine the search and enhance
`discovery of relevant information. The query term correla-
`tion data indicates relationships between query terms, and is
`used to effectively predict query terms that are likely to be
`helpful to the search refinement process. In accordance with
`another aspect of the invention, the correlation table 137
`preferably contains or reflects historical information about
`the frequencies with which specific query terms have
`appeared together within the same query.
`The general format of the correlation table 137 is illus-
`trated in FIG. 1. In the embodiment depicted in FIG. 1 and
`described in detail herein, the correlations between query
`terms are based solely on frequency of occurrence within the
`same query. As described below, other types of query tern
`correlations can additionally be used. In addition, although
`the disclosed implementation uses a table to store the query
`term correlation data, other types of databases can be used.
`As illustrated by FIG. 1, each entry within the correlation
`table 137 (two entries shown) has two primary components:
`a “key” term 140, and a “related terms” list 142 for that key
`term. The related terms list 142 is a list of the N (e.g. 50)
`query terms that have appeared within the same query as the
`keyword with the highest degree of frequency, and is
`ordered according to frequency. For example, the entry for
`the key term COSMOS (ignoring the single-term prefixes,
`which are discussed below) is:
`COSMOS: ASTRONOMY, SAGAN, UNIVERSE,
`indicating that ASTRONOMY has appeared together with
`COSMOS with the highest degree of frequency; SAGAN
`has appeared with COSMOS with the second highest degree
`of frequency, and so on. Each term that appears within the
`related terms list 142 is deemed to be related to the corre-
`sponding key term 140 by virtue of the relatively high
`frequency with which the terms have occurred within the
`same query.
`As further depicted by FIG. 1, each related term and each
`key term 140 preferably includes a single-character field
`prefix which indicates the search field 210, 220, 240 to
`which the term corresponds. These prefixes may, for
`example, be as follows: A=author, T=title, S=subject,
`R=artist, L=label, G=generic. In addition, each related term
`is stored together with a correlation score 146 which, in the
`preferred embodiment, indicates the number of times the
`related term has appeared in combination with the key term
`
`15
`15
`
`
`
`6,006,225
`
`7
`(within the search fields indicated by their respective field
`prefixes), not counting queries that produced a NULL query
`result.
`
`the related term (including prefix)
`Thus, for example,
`S-ASTRONOMY has a correlation score of 410 under the
`key term T-COSMOS, indicating that four hundred and ten
`“successful” queries were received (during the time period
`to which the table 137 corresponds) which included the
`combination of COSMOS in the title field and
`
`5
`
`field. Although the field
`ASTRONOMY in the subject
`prefixes and correlation scores 146 carry information which
`is useful to the related terms selection process (as described
`below), such information need not be preserved.
`In operation, when a user submits a query to the web site
`130, the web server 131 passes the query to the q