`United States Patent
`6,006,225
`[45] Date of Patent:
`Dec. 21, 1999
`Bowmanet al.
`
`[11] Patent Number:
`
`US006006225A
`
`[54] REFINING SEARCH QUERIES BY THE
`SUGGESTION OF CORRELATED TERMS
`FROM PRIOR SEARCHES
`
`[75]
`
`Inventors: Dwayne E. Bowman, Woodinville;
`Ruben E. Ortega; Michael L.
`Hamrick, both of Seattle; Joel R.
`Spiegel, Woodinville; ‘Timothy R.
`Kohn,Seattle, all of Wash.
`
`[73] Assignee: Amazon.Com, Seattle, Wash.
`
`[21] Appl. No.: 09/145,360
`
`[22]
`
`Filed:
`
`Sep. 1, 1998
`
`Related U.S. Application Data
`[60]
`Provisional application No. 60/089,244, Jun. 15, 1998.
`[ST]
`Inte Coecceeeccccccsececscssssseeessessunecsecsssssees GO6F 17/30
`
`[52]
`7107/5; 707/2; 707/4; 707/10
`[58] Field of Search oe 707/5, 2, 10, 4
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,675,819
`5,721,897
`5,787,422
`5,794,233
`5,864,845
`5,911,140
`5,913,215
`
`10/1997 Schuetze wees ceeees 704/10
`2/1998 Rubinstein ....
`707/2
`/1998 Tukeyet al.
`..
`707/5
`/1998 Rubinstein ....
`707/4
`
`1/1999 Voorheeset al.
`.
`707/5
`
`we 707/5
`6/1999 Tukeyetal. ..
`6/1999 Rubinstein... 707/10
`OTHER PUBLICATIONS
`
`Bartell et al., “Automatic Combination of Multiple Ranked
`Retrieval Systems”, Proceedings of SIGIR ’94, Jul. 1994,
`pp. 173-181.
`Belkin et al., “The Effect of Multiple Query Representations
`on Information System Performance” Proceedings of SIGIR
`°93, Jun. 1993, pp. 339-346.
`Shaw et al., “Combination of Multiple Searches”, Proceed-
`ings of TREC-3, Apr. 1995, pp. 105-108.
`
`QuarterDeck Web Page, Downloaded Sep. 9, 1996, http://
`aracnid .qdeck.com/qdeck/products/webcompass.
`Towell et al. “Learning Collection ['usion Strategies for
`Information Retrieval’, Proceedings of the 12th Annual
`Machine [earning Conference, Jul. 1995, pp. 540-548.
`Voorhees et al., “Learning Collection Fusion Strategies”,
`Proceedings of SIGIR °95, Jul. 1995, pp. 172-179.
`Voorheesetal., “The Collection Fusion Problem” Proceed-
`ings of TREC—3, NIST Special Publication 500-225, Apr.
`1995, pp. 95-104.
`Abstract of Generating Advanced Query Interfaces, Lee,
`Srivastava and Vista, Computer Networks and ISDN Sys-
`tems Conference Title: Comput. Netw. ISDN Syst. (Neth-
`erlands) vol. 30, No. 1-7, pp. 656-657 (1998).
`Abstract of Using Combination ofEvidence for Term Expan-
`sion, Wilkinson, Information Retrieval Research, Proceed-
`ings of the 19th Annual BCS-IRSG Colloquium on IR
`Research (1997).
`
`(List continued on next page.)
`Primary Examiner—Paul R. Lintz
`Attorney, Agent, or Firm—Knobbe, Martens, Olson & Bear,
`LLP
`
`[57]
`
`ABSTRACT
`
`Asearch engineis disclosed which suggests related terms to
`the userto allow the user to refine a search. The related terms
`are generated using query term correlation data which
`reflects the frequencies with which specific terms have
`previously appeared within the same query. The correlation
`data is generated and stored in a look-up table using an
`off-line process which parses a query log file. The table is
`regenerated periodically from the most recent query sub-
`missions (e.g., the last two weeks of query submissions), and
`thus strongly reflects the current preferences of users. Each
`related term is presented to the user via a respective hyper-
`link which can be selected by the user to submit a modified
`query. In onc embodiment,the related terms are added to and
`selected from the table so as to guarantee that the modified
`queries will not produce a NULL queryresult.
`
`28 Claims, 10 Drawing Sheets
`
`CORRELATE TERMS SASED
`ON FREQUENCY OF OCCURRENCE
`
`WITHIN SAME QUERY
`
`450
`
`CREATE DAILY
`RESULTS FILE
`
`
`
`MERGE DAILY
`RESULTS FILES FOR
`LAST M DAYS
`
`
`
`PARSE DAILY LOG FILE
`
`
`TO EXTRACT QUERY
`SUBMISSIONS FOR WHICH
`
`
`ITEMS FOUND > 6
`
`
`
`
`
`
`
`OVERWRITE EXISTING CORRELATION
`TABLE WITH NEW CORRELATION TABLE
`
`450
`
`1
`1
`
`EX1026
`EX1026
`
`
`
`6,006,225
`
`Page 2
`
`OTHER PUBLICATIONS
`
`the NECI Meta Search Engine,
`Abstract of Inquirus,
`Lawrence and Giles, Computer Networks and ISDN Sys-
`tems Conference Title: Comput. Netw ISDN Syst. (Nether-
`lands) vol. 30, No. 1-7, pp. 95-105 (1998).
`Abstract of Facilitating Complex Web Queries Through
`Visual User Interfaces and Query Relaxtion, Li and Shim,
`Computer Networks and ISDN Systems Conference Title:
`Comput. Netw. ISDN Syst. (Netherlands) vol. 30, No. 1-7,
`pp. 149-159 (1998).
`A User—centred Evaluation ofRankingAlgorithms for Inter-
`active Query Expansion, Efthimiadis, Proceedings of the
`16th Annual International ACM SIGIR Conference, Pitts-
`burgh, pp. 146-159 (1993).
`Concept Based Query Expansion, Qiu and Frei, Proceedings
`of the 16th Annual International ACM SIGIR Conference,
`Pittsburgh, pp. 160-169 (1993).
`Improving Retrieval Performance by Relevance Feedback,
`Salton and Buckley, J. of Am. Society for Info. Science
`41(4):288-297 (1990).
`
`Query Expansion Using Domain—Adapted, Weighted The-
`saurus in an Extended Boolean Model, Kwon, Kim and
`Choi, Proceedings of the 3rd International Conference on
`Information and Knowledge Management (CIKM’94), pp.
`140-146 (1994).
`Browsing Through Querying: Designing for Electronic
`Books, Charoenkitkarn, Tam, Chignell and Golovichinsky,
`at
`the 5th ACM Conference on Hypertext, Scattle, WA
`206-216 (1993).
`A Survey of Information Retrieval and Filtering Methods,
`Faloutsos and Oard, Univ. of Maryland, 22 pages (undated).
`A Corpus Analysis Approach for Automatic Query FE-xpan-
`sion, Gauch and Wang, Proceedings of the 6th International
`Conference on Information and Knowledge Mangement, pp.
`278-284 (1997).
`Discovering Web Acess Patterns and Trends by Applying
`OLAP and Data Mining ‘lechnology on Web Logs, Zaiane,
`Xin and Han, Proceedings of the IEEE Forum on Research
`and Technology Advances in Digital Libraries
`(IEEE
`ADL’98), pp. 19-29 (1998).
`
`2
`
`
`
`Dec. 21, 1999
`
`Sheet 1 of 10
`
`6,006,225
`
`
` ,°*°!
`
`(grzIdv-S©0¢) aSY3AINN-S
`dIHdWYDOITIgaUSANIS|CcAY¥3nod|
`LESGEL
`
`asvavvdWYaLOFLVI3Y
`0Z£)DNINNVDONd—SOly)AWONOYLSY-S
`
`
`Z6Z334409-SO6£NVOVS-V
`celLEL|
`
`
`FIGVLNOWLVTSYYODAYINO
`vAWr-S|__—=—=—=—sSOWSOD-L|$07/
`
`U.S. Patent
`
`aus83Moy5patsGsmLeMe
`
`
`
`S$300UdNOILOITIS
`
`<n.
`
`wav
`
`S$3908d
`
`NOILVYANIO
`-;Ava
`
`9Auanod
`
`O1
`
`A
`
`3
`
`LU
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Dee. 21, 1999
`
`Sheet 2 of 10
`
`6,006,225
`
`200
`
`LJ fal
`
`Help
`File
`Edit
`View
`Go
`Favorite
`Q@ &
`a=-2- © 2 @&
`Back Forw...
`Stop Refresh Home Search
`Favorite
`
`Print A ee
`
`
`
`Address|http: //www.amazon.com/book_search |
`
`Amazon.com Home | Music Search | Your Account
`
`Sniezon.comcom Book Search
`Enter Author and/or Title
`
`Author:
`
`270
`
`@Exact Name
`
`OlLast, First Name
`
`OStart of Last Name
`
`Title:
`
`LO
`
`OStart(s) of Title Words
`@Title Word(s)
`OFxact Title
`
`
`
`Search Now
`Clear the Form
`
`Author Search Tips / Title Search Tips
`
`Search by Subject
`subjees [Sd
`
`240
`
`@Subject Word(s)
`OStart of Subject
`OkExact Subject
`
`
`
`Search Now
`Clear the Form
`
`Subject Search Tips
`
`OStart(s) of Subject Word(s)
`
`Other Search Methods:
`ISBN, Publisher/Date, Quick Search
`
`FIG. 2
`
`4
`
`
`
`U.S. Patent
`
`Dee. 21, 1999
`
`Sheet 3 of 10
`
`6,006,225
`
`310
`
`User Identifier = 29384719287
`HTTP_REFERRER= http://www.amazon.com/book_search_page
`PATH_INFO=/book_search
`title = Snow Crash
`ifems_found = 2
`
`JIAO Friday, 13-Feb-98 02:23:52
`
`I20
`
`SIO
`
`Friday, 13—-Feb—98 02:24:11
`User Identifier = 29384719287
`HTTP_REFERRER= hitp://www.amazon.com/book_search
`PATH_INFO=/ISBN = 0553562614
`
`Friday, 13-Feb-98 06:15:03
`User Identifier = 54730543261
`HTTP_REFERRER= http://www.amazon.com/music_search_page
`PATH_INFO=/music_search
`artist = This and That
`items_found = 0
`
`Friday, 13-Feb-98 10:07:34
`User Identifier = 027385918272
`HTTP_REFERRER= hitp://www.amazon.com/book_search_page
`PATH_INFO=/book_search
`subject = outdoor trail
`items_found = 22
`
`FIG. F
`
`5
`
`
`
`U.S. Patent
`
`Dee. 21, 1999
`
`Sheet 4 of 10
`
`6,006,225
`
`BEGIN
`
`470
`
`PARSE DAILY LOG FILE
`TO EXTRACT QUERY
`SUBMISSIONS FOR WHICH
`
`ITEMS FOUND > 0
`ON FREQUENCY OF OCCURRENCE WITHIN SAME QUERY
`
`CORRELATE TERMS BASED
`
`420
`
`4350
`
`CREATE DAILY
`RESULTS FILE
`
`440
`
`MERGE DAILY
`RESULTS FILES FOR
`
`LAST M DAYS
`
`OVERWRITE EXISTING CORRELATION
`TABLE WITH NEW CORRELATION TABLE
`
`450
`
`FIG. 4
`
`END
`
`6
`
`
`
`U.S. Patent
`
`Dee. 21, 1999
`
`Shect 5 of 10
`
`6,006,225
`
`Sua}N
`
`LES
`
`WVYL-S
`
`
`
`(sc)ONIZVIE-S
`
`
`
`(Ss)ONIZWIG-L
`
`(z)NOSTYvYO-V
`
`(26)
`
`
`
`(¢z)YOOGLNO-S
`
`
`
`(Zt) S$1¥OdS—S
`
`NOMNA-L
`
`VGDL
`
`(iz)Lugaavo-V
`
`(£9)ayla—-S
`
`(6)NOILVOVA-S
`XIN-S(17)
`
`
`(9¢)NolLvonda—-L
`
`(¢Z)NOILYOVA-S
`YINOVM—-V
`SLYOdS-S
`ONINIG-L
`
`(71)asl0Y3OXI-S
`(¢9)yOOaLNO-S(yo)=AdANH—-L
`
`
` (71)TVaL-L(8)SALVA-V(¥S)=YlWd3y-L
`
`
`
`
`(1+)TIWul-S
`(S)NITNVYI-¥
`
`(001)
`
`(s9)
`
`(2)
`
`(01)
`
`WVdl-$
`
`TWVaL—-L
`
`(£2)
`
`JMla-L
`
`
`
`ZL
`
`7
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Dee. 21, 1999
`
`Shect 6 of 10
`
`6,006,225
`
`Sua]N
`
`GLEL
`
`
`
`
`
`
`
`(¥Z)¥YOOaLNO-Ss(99)
`
`
`
`
`
`(Zl) SLYOdS—S(Z)
`
`NOMNA-S
`
`(01)
`
`GSU7
`
`O65
`
`(Z6)XIN-S(7)2Liadyvo-V
`
`(Sc)SONIZVIE-S
`
` (00;(S)ONIZVI8—-L(v2)
`79IIg-S
`(6)NOILYOVA-S
`
`(9¢)NOILVONGI-L
`
`(¢%)NOILVOVA-S
`YINOVM-V
`SLYOdS—$
`ONINIG~L
`
`(Z1)3SI0Y3OXI-S
`(99)(yo)=ASAMH-LyYOOdLNO-s
`
`
`(7S)=alvd3u-L
`
`(Zr)TVYL-S
`
`(v1)TVal-L
`(z)NOSTYYO-V
`(S)NITINVYI-V
`(8)SILVA-V
`
`WYL-S
`
`
`
`AMIa-S
`
`TWUL-S
`
`WWal-L
`
`CPL
`
`8
`
`
`
`
`
`
`
`
`
`
`1—Feb-98
`
`2—-Feb-98
`
`3-Feb-98
`
`4—-Feb—-98
`
`5-Feb-98
`
`7—Feb—-98
`
`8—-Feb—98ee
`
`U.S. Patent
`
`Dee. 21, 1999
`
`Sheet 7 of 10
`
`6,006,225
`
`Date
`
`Daily Log
`ile
`
`Daily Results
`File
`
`Dates of Query
`Correlation Table
`
`Query Correlation
`Table
`
`
`6-Feb-98
`
`
`
`
`
`7—-Feb-98
`
`2-Feb-98
`
`|—|[-|[—[—|]
`
`670
`
`9
`
`
`
`U.S. Patent
`
`Dee. 21, 1999
`
`Sheet 8 of 10
`
`6,006,225
`
`SIO
`
`MULTI-TERM
`QUERY
`?
`
`
`
`760
`
`erm||
`_[_Next
`
`
`
`
`COMBINE ALL RELATED
`TERMS LISTS
`
`SELECT X TERMS WITH
`HIGHEST VALUES
`
`FIG. /
`
`10
`
`
`
`U.S. Patent
`
`Dee. 21, 1999
`
`Sheet 9 of 10
`
`6,006,225
`
`Top 3 Related Terms:
`
`TRAIL — MIX
`
`TRAIL -— YUKON _/
`
`TRAIL — BIKE
`
`E10
`
`FIG. GA
`
`Intersecting Terms:
`
`E20
`
`S — BIKE
`S — SPORTS
`S — VACATION
`
`E30
`QS
`
`Top 3 Related Terms:
`OUTDOOR TRAIL — BIKE
`OUTDOOR TRAIL — SPORTS
`OUTDOOR TRAIL — VACATION
`
`FIG. EB
`
`11
`
`
`
`U.S. Patent
`
`Dec. 21, 1999
`
`Sheet 10 of 10
`
`6,006,225
`
`900ts
`
`
`
`File
`Edit
`View
`Go
` Faverite
`Help
`dS
`== @®f}
`ff Qa Ga
`Back Forw... Stop Refresh Home
`Search
`Favorite
`Font
`
`
`Address|http: //www.amazon.com/bock_search
`
`amazon.com Book Search
`—oag
`
`
`
`|
`
` FIO
`
`
`Your Search Results
`
`for: the subject words include “OUTDOOR TRAIL”
`
`Related Query Terms:Click on a link below to narrow your
`search by adding a related term.
`OUTDOOR TRAIL — BIKE
`
`OUTDOOR TRAIL — SPORTS
`
`OUTDOOR TRAIL — VACATION
`
`Top Matches for this search:
`
`* Outdoor & Trail Guide to Yosemite
`
`* Qutdoor Training on Mountian Bike Trails
`
`
`
`
`
`
`to See
`Full Results: tne First 100 are shown below.
`more results scroll down and click the "More” button.
`
`
`
`The blessed trail: outdoor diary
`
`G20 T
`
`he Great Outdoor Cookbook: Trail Recipies
`Outdoor & Trail Guide to Yosemite
`
`
`
`12
`12
`
`
`
`6,006,225
`
`1
`REFINING SEARCH QUERIES BY THE
`SUGGESTION OF CORRELATED TERMS
`FROM PRIOR SEARCHES
`
`RELATED APPLICATION
`
`This application claims the benefit of U. S. Provisional
`Application No. 60/089,244, filed Jun. 15, 1998, the disclo-
`sure of which is hereby incorporated by reference.
`
`BACKGROUND OF THE INVENTION
`
`10
`
`1. Field of Invention
`
`2
`“RESTAURANTS,” “RECIPIES,” and “FDA”to allow the
`user to refine the query.
`The related query terms are commonly presented to the
`user together with corresponding check boxesthat can be
`selectively marked or checked by the user to add termsto the
`query. In some implementations, the related query terms are
`alternatively presented to and selected by the user through
`drop down menus that are provided on the query result page.
`In cither casc, the uscr can add additional terms to the query
`and then resubmit the modified query. Using this technique,
`the user can narrow the query result down to a more
`manageable set consisting primarily of relevant items.
`One problem with existing techniques for generating
`related query termsis that the related terms are frequently of
`little or no value to the search refinement process. Another
`problem is that the addition of one or morerelated terms to
`the query sometimesleads to a NULL queryresult. Another
`problemis that the process of parsing the query result items
`to identity frequently used terms consumessignificant pro-
`cessor resources, and can appreciably increase the amount of
`time the user must wait before viewing the query result.
`These and other deficiencies in existing techniques hinder
`the user’s goal of quickly and efficiently locating the most
`relevant items, and can lcad to user frustration.
`
`SUMMARYOF THE INVENTION
`
`The present invention addresses these and other problems
`by providing a search refinement system and method for
`generating and displaying related query terms (“related
`terms”). In accordance with the invention, the related terms
`are generating using query term correlation data that is based
`on historical query submissions to the search engine. The
`query term correlation data (“correlation data”) is preferably
`based at least upon the frequencies with whichspecific terms
`have historically been submitted together within the same
`query. The incorporation of such historical query informa-
`tion into the process tends to produce related terms that are
`frequently used by other users in combination with the
`submitted query terms, and significantly increases the like-
`lihood that these related terms will be helpful to the search
`refinement process. To further increase the likelihood that
`the related terms will be helpful,
`the correlation data is
`preferably generated only from those historical query sub-
`missions that produced a successful query result (at least one
`match).
`the
`In accordance with one aspect of the invention,
`correlation data is stored in a correlation data structure
`(table, database, etc.) which is used to look up related terms
`in response to query submissions.
`‘The data structure is
`preferably generated using an off-line process which parses
`a query log file, but could alternatively be generated and
`updated in real-time as queries are received from users. In
`one embodiment, the data structure is regenerated periodi-
`cally (c.g., once per day)
`from the most recent query
`submissions (e.g., the last M days of entries in the query
`log), and thus strongly reflects the current tastes of the
`community of users; as a result, the related terms suggested
`by the search engine stronglyreflect the current tastes of the
`community. Thus, for example, in the context of a search
`engine of an online merchant, the search engine tends to
`suggest related terms that correspond to the current best-
`selling products.
`In a preferred embodiment, cach entry in the data struc-
`ture is in the form of a key term and a corresponding related
`termslist. Each related terms list contains the terms which
`have historically appeared together (in the same query) with
`
`15
`
`-
`
`,
`
`This present invention relates to query processing, and
`more specifically relates to techniques for facilitating the
`process of refining scarch qucrics.
`2. Description of Related Art
`With the increasing popularity of the Internet and the
`World Wide Web, it is common for on-line users to utilize
`search engines to search the Internet for desired information.
`Manywebsites permit users to perform searches to identify
`a small number of relevant items among a much larger
`domain of items. As an example, several web index sites
`permit users to search for particular web sites among known
`web sites. Similarly, many on-line merchants, such as
`booksellers, permit users to search for particular products
`among all of the products that can be purchased from the
`merchant. Other on-line services, such as Lexis™ and
`Westlaw™, allow users to search for various articles and
`court opinions.
`In order to perform a search, a user submits a query
`containing one or more query terms. The query may also
`explicitly or implicitly identify a record field or segment to
`be searched, suchastitle, author, or subject classification of
`the item. For example, a user of an on-line bookstore may 3
`submit a query containing terms that
`the user believes
`appear within the title of a book. A query server program of
`the search engine processes the query to identify any items
`that match the terms of the query. The set of itemsidentified
`by the query server program is referred to as a “query
`result.” In the on-line bookstore example, the query result is
`a set of books whosetitles contain someorall of the query
`terms. In the web index site example, the query result is a set
`of web sites or documents. In web-based implementations,
`the query result is typically presented to the user as a
`hypertextual listing of the located items.
`If the scope of the search is large, the query result may
`contain hundreds, thousands or cven millions of items.If the
`user is performing the search in order to find a single item
`or a small set of items, conventional approachesto ordering
`the items within the query result often fail to place the
`sought item or items near the top of the query result list. This
`requires the user to read through many other items in the
`query result before reaching the sought item. Certain search
`engines, such as Excite™ and AltaVista™, suggest related
`query termsto the user as a part of the “search refinement”
`process. This allows the user to further refine the query and
`narrow the query result by selecting one or more related
`query terms that more accurately reflect the user’s intended
`request. The related query terms are typically generated by
`the search engine using the contents of the query result, such
`as byidentifying the most frequently used terms within the
`located documents. For example, if a user were to submit a
`query on the term “FOOD,” the uscr may reccive several
`thousand items in the query result. The search engine might
`then trace through the contents of someorall of these items
`and present
`the user with related query terms such as
`
`40
`
`45
`
`60
`
`65
`
`13
`13
`
`
`
`6,006,225
`
`3
`the respective key term with the highest degree of frequency,
`ignoring unsuccessful query submissions (query submis-
`sions which produced a NULL query result). The data
`structure thus provides an efficient mechanism for looking
`up the related terms for a given query term.
`To generate a set of related terms for refining a submitted
`query (the “present query”), the related terms list for each
`term in the present query is initially obtained from the
`correlation data structure. If this step produces multiple
`related termslists (as in the case of a multiple-term query),
`the related termslists are preferably combined by taking the
`intersection betweenthese lists (i.e., deleting the terms that
`are not commontoalllists). The related terms which remain
`are terms which have previously appeared, in at least one
`successful query submission,
`in combination with every
`term of the present query. Thus, assuming items have not
`been deleted from the database being searched, any of these
`related terms can be individually added to the present query
`while guaranteeing that the modified query will not produce
`a NULL query result. To take advantageof this feature, the -
`related terms are preferably presented to the user via a user
`interface that requires the user to add no more than one
`related term per query submission. In other embodiment,the
`related terms are selected and displayed without guarantee-
`ing a successful query result.
`Because the related terms are identified from previously-
`generated correlation data, without the need to parse docu-
`ments or correlate terms, the related terms can be identified
`and presented to the user with little or no added delay.
`
`10
`
`15
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`These and other features will now be described with
`
`reference to the drawings summarized below. These draw-
`ings and the associated description are providedto illustrate;
`a preferred embodimentof the invention, and notto limit the
`scope of the invention.
`Throughout the drawings, reference numbers are re-used
`to indicate correspondence between referenced elements. In
`addition, the first digit of each reference number indicates
`the figure in which the elementfirst appears.
`FIG. 1 illustrates a system in which users access web site
`information via the Internet, and illustrates the basic web site
`components uscd to implement a scarch cngine which
`operates in accordance with the invention.
`FIG. 2 illustrates a sample book search page of the web
`site.
`
`40
`
`45
`
`FIG. 3 illustrates sample log entries of a daily query log
`file.
`
`FIG. 4 illustrates the process used to generate the corre-
`lation table of FIG. 1.
`
`FIG. 5Aillustrates a sample mapping before a query is
`added.
`
`FIG. 5B illustrates a sample mapping after a query is
`added.
`
`FIG. 6 illustrates a process for generating the correlation
`table from the most recent daily query log files.
`FIG. 7 illustrates a process for selecting the related query
`terms from the correlation table.
`
`FIG. 8Aillustrates a set of related query terms from a
`single-term query.
`FIG. 8B illustrates a set of intersecting terms and a set of
`related query terms from a multiple-terrn query.
`FIG. 9 illustrates a sample search result page of the web
`site.
`
`4
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`The present invention provides a search refinement sys-
`tem and methodfor generating related query terms(“related
`terms”) using a history of queries submitted to a search
`engine by a community of users. Briefly, the system gener-
`ates query term correlation data whichreflects the frequency
`with which specific terms have previously occurred together
`within the same query. The system uses the query term
`correlation data in combination with the query term(s)
`entered by the user to recommendadditional query termsfor
`refining the query. The incorporation of such historical query
`information into the process tends to produce related terms
`that are frequently used by other users in combination with
`the submitted query terms, and significantly increases the
`likelihood that these related terms will be helpful to the
`search refinementprocess. To further increase the likelihood
`that the related terms will be helpful, the correlation data is
`preferably generated only from those historical query sub-
`missions that produced a successful queryresult (at least one
`match).
`In the preferred embodiment, the query term correlation
`date is regenerated periodically from recent query
`submissions, such as by using the last M days ofentries in
`a query log, and thus heavily reflects the current tastes of
`users. As a result, the related terms suggested by the search
`engine tend to be terms that correspond to the most fre-
`quently searched items during the relevant
`time period.
`Thus, for example, in the context of a search engine of an
`online merchant, the search engine tends to suggest related
`terms that correspondto the current best-selling products. In
`one embodiment, the technique used to generate the related
`terms and present these terms to the user guarantees that the
`modified query will not produce a NULL queryresult.
`The search refinement methods of the invention may be
`implemented, for example, as part of a website, an Internet
`site, an on-line services network, a document retrieval
`system, or any other type of computer system that provides
`searching capabilities to a community of users. In addition,
`the method may be combined with other methods for
`suggesting related terms, such as methods which process the
`contents of located documents.
`
`A preferred web-based implementation of the search
`refinement system will now be described with reference to
`FIGS. 1-9. For purposes of illustration,
`the system is
`described herein in the context of a search enginethat is used
`to assist customers of Amazon.com Inc. in locating items
`(e.g., books, CDs,etc.) from an on-line catalog of products.
`Throughout the description, reference will be made to vari-
`ous implementation-specific details of the Amazon.com
`implementation. These details are provided inorderto fully
`illustrate a preferred embodiment of the invention, and not
`to limit
`the scope of the invention. The scope of the
`invention is set forth in the appended claims.
`
`I. Overview of Web Site and Search Engine
`
`60
`
`65
`
`FIG. 1 illustrates the Amazon.com website 130, including
`components used to implement a search engine in accor-
`dance with the invention.
`As it is well known in the art of Internet commerce, the
`Amazon.com web site includes functionality for allowing
`users to search, browse, and make purchases fram an online
`catalog of booktitles, musictitles, and other types of items
`via the Internet 120. Because the catalog contains millions
`of items, it is important that the site provide an efficient
`mechanism for assisting users in locating items.
`
`14
`14
`
`
`
`6,006,225
`
`5
`As illustrated by FIG. 1, the web site 130 includes a web
`server application 131 (“web server”) which processes user
`requests received from user computers 110 via the Internet
`120. These requests include queries submitted by users to
`search the on-line catalog for products. The web server 131
`records the user transactions, including query submissions,
`within a query log 135. In the embodiment depicted in FIG.
`1, the querylog 135 consists of a sequence of daily query log
`files 135(1)-135(M), each of which represents one day of
`transactions.
`
`The website 130 also includes a query server 132 which
`processes the queries by searching a bibliographic database
`133. The bibliographic database 133 includes information
`about the various products that users may purchase through
`the website 130. This information includes, for example, the
`titles, authors, publishers, subject descriptions, and ISBNs
`(International Standard Book Numbers) of booktitles, and
`the titles, artists, labels, and music classifications of music
`titles. The information for each itemis arranged withinfields
`(such as an “author” field and a “title” field), enabling the
`bibliographic database 133 to be searched on a field-
`restricted basis. The site also includes a database 134 of
`
`HIML (Hypertext Markup Language) content which
`includes, among other things, product information pages
`which show and describe the various products.
`The query server 132 includes a related term selection
`process 139 which identifies related query terms based on
`query term correlation data stored in a correlation table 137.
`As depicted in FIG. 1 and described below, the correlation
`table 137 is generated periodically from the M most recent
`daily query log files 135(1)-135(M)using an off-line table
`generation process 136.
`The web server 131, query server 132, table generation
`process 136, and database software run on one or more
`Unix™-based servers and workstations (not shown) of the
`web site 130 although other types of platforms could be
`used. The correlation table 137 is preferably cached in RAM
`(random access memory) on the same physical machine as
`that used to implement the query server 132. To accommo-
`date large numbers of users, they query server 132 and the
`correlation table 137 can be replicated across multiple
`machines. The web site components that are invoked during
`the searching process are collectively referred to herein as a
`“search engine.”
`FIG. 2 illustrates the general format of a book search page
`200 of the Amazon.com web site 130 that can be used to
`
`search the bibliographic database 133 for booktitles. Users
`have access to other search pages that can be used to locate
`music titles and other types of products sold by the on-line
`merchant. The book search page 200 includes author,title,
`and subject fields 210, 220, 240 and associated controls that
`allow the user to initiate fieldrestricted searches for book
`
`10
`
`15
`
`20
`
`40
`
`45
`
`titles. Users can perform searches by first typing in the
`desired information into a search field 210, 220, 240 and ;
`then clicking on the appropriate search button 230, 250. The
`term or string of terms submitted to the search engine is
`referred to herein as the “query.” Other areas of the website
`ask the user to submit queries without limiting the terms to
`specific fields.
`Whenthe user submits a query from the book search page
`200 to the web site 130, a the query sever 132 applies the
`query to the bibliographic database 133, taking into account
`any ficld restrictions within the query. If the query result is
`a single item, the item’s product information page is pre-
`sented to the user. If the query result includes multiple items,
`the list of items is presented to the user through a query
`
`60
`
`65
`
`
`
`6
`result page which contains hypertextual links to the items
`respective product information pages.
`For multiple-term queries, the query server 132 effec-
`tively logically ANDs the query terms together to perform
`the search. For example, if the user enters the terms “JAVA”
`and “PROGRAMMING?”into the title field 220, the query
`server 132 will search for and return a list of all items that
`have bath of these terms within the title. Thus, if any query
`term docs not produce a match (referred to herein as a
`“non-matching term’),
`the query will produce a NULL
`query result. Presenting a NUL. query result to the user can
`cause significant user frustration. To reduce this problem, in
`this event, the user may be presented with a list of items that
`are deemed to be “close matches.” Although the search
`engine described herein logically ANDs the query terms
`together, it will be recognized that the invention can be
`applied to search engines that use other methods for pro-
`cessing queries.
`In accordance with the invention, the search engine uses
`the query term correlation data stored in the correlation table
`137 to select the related terms that best match the user’s
`
`query. The search engine then presents the related terms to
`the user, allowing the user to refine the search and enhance
`discovery of relevant information. The query term corrcla-
`tion data indicates relationships between query terms,and is
`used to effectively predict query terms that are likely to be
`helpful to the search refinement process. In accordance with
`another aspect of the invention, the correlation table 137
`preferably contains or reflects historical information about
`the frequencies with which specific query terms have
`appeared together within the same query.
`The general format of the correlation table 137 is illus-
`trated in FIG. 1. In the embodiment depicted in FIG. 1 and
`described in detail herein, the correlations between query
`terms are based solely on frequency of occurrence within the
`same query. As described below, other types of query tern
`correlations can additionally be used. In addition, although
`the disclosed implementation uses a table to store the query
`term correlation data, other types of databases can be used.
`As illustrated by FIG. 1, each entry within the correlation
`table 137 (two entries shown) has two primary components:
`a “key” term 140, and a “related terms”list 142 for that key
`term. The related terms list 142 is a list of the N (e.g. 50)
`query terms that have appeared within the same query as the
`keyword with the highest degree of frequency, and is
`ordered according to frequency. For example, the entry for
`the key term COSMOS(ignoring the single-term prefixes,
`which are discussed below)is:
`COSMOS: ASTRONOMY, SAGAN, UNIVERSE,
`indicating that ASTRONOMYhasappeared together with
`COSMOSwith the highest degree of frequency, SAGAN
`has appeared with COSMOSwith the second highest degree
`of frequency, and so on. Each term that appears within the
`related terms list 142 is deemedto be related to the corre-
`sponding key term 140 by virtue of the relatively high
`frequency with which the terms have occurred within the
`same query.
`As further depicted by FIG. 1, each related term and each
`key term 140 preferably includes a single-character field
`prefix which indicates the search field 210, 220, 240 to
`which the term corresponds. These prefixes may, for
`example, be as follows: A=author, T=title, S=subject,
`Re=artist, L=labcl, G=gencric. In addition, cach related term
`is stored together with a correlation score 146 which,in the
`preferred embodiment, indicates the number of times the
`related term has appeared in combination with the key term
`
`15
`15
`
`
`
`6,006,225
`
`7
`(within the search fields indicated by their respective field
`prefixes), not counting queries that produced a NULL query
`result.
`the related term (including prefix)
`Thus, for example,
`S-ASTRONOMYhasa correlation score of 410 under the
`key term T-COSMOS,indicating that four hundred and ten
`“successful” queries were received (during the time period
`to which the table 137 corresponds) which included the
`combination of COSMOS in the title field and
`
`10
`
`15
`
`field. Although the field
`ASTRONOMYin the subject
`prefixes and correlation scores 146 carry information which
`is useful to the related terms selection process (as described
`below), such information necd not be preserved.
`In operation, when a user submits a query to the website
`130, the web server 131 passes the query to the query