`Braden-Harder et al.
`
`I 1111111111111111 11111 lllll lllll lllll 111111111111111 lllll 111111111111111111
`US005933822A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,933,822
`Aug. 3, 1999
`
`[54] APPARATUS AND METHODS FOR AN
`INFORMATION RETRIEVAL SYSTEM THAT
`EMPLOYS NATURAL LANGUAGE
`PROCESSING OF SEARCH RESULTS TO
`IMPROVE OVERALL PRECISION
`
`R. Pohlmann et al, "The Effect of Syntactic Phrase Indexing
`on Retrieval Performance for Dutch Tests", Conference
`Proceedings of RIAO 97, Computer-Assisted Information
`Searching in Internet, McGill University, Quebec, Canada,
`Jun. 25-27, 1997, vol. 1, pp. 176-187.
`
`[75]
`
`Inventors: Lisa Braden-Harder, Reston, Va.;
`Simon H. Corston, Seattle, Wash.;
`William B. Dolan, Redmond, Wash.;
`Lucy H. Vanderwende, Bellevue,
`Wash.
`
`[73] Assignee: Microsoft Corporation, Redmond,
`Wash.
`
`[21] Appl. No.: 08/898,652
`Jul. 22, 1997
`
`[22] Filed:
`
`Int. Cl.6
`...................................................... G06F 17/00
`[51]
`[52] U.S. Cl. ......................................... 707/5; 707/3; 707/4
`[58] Field of Search .................. 707/5, 4, 3; 364/419.19;
`395/708
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,278,980
`5,544,049
`5,642,502
`5,671,404
`5,694,592
`5,706,497
`5,724,571
`5,794,050
`5,826,261
`
`1/1994 Pederson et al. ........................... 707/4
`8/1996 Henderson et al.
`............... 364/419.19
`6/1997 Driscoll ....................................... 707/5
`9/1997 Lizee et al. ................................. 707/5
`12/1997 Driscoll . ... ... ... .... ... ... ... ... ... .... ... ... 707 /3
`1/1998 Takahashi et al. .......................... 707/5
`3/1998 Woods
`........................................ 707/5
`8/1998 Dahlgren et al. . ... ... ... ... .... ... ... 395 /708
`10/1998 Spencer ....................................... 707/5
`
`OIBER PUBLICATIONS
`
`B. Katz, "Annotating the World Wide Web using Natural
`Language", Conference Proceedings of RIAO 97, Comput(cid:173)
`er-Assisted Information Searching in Internet, McGill Uni(cid:173)
`versity, Quebec, Canada, Jun. 25-27 1997, vol. 1, pp.
`136-155.
`A T. Arampatzis et al, "IRENA: Information Retrieval
`Engine based on Natural language Analysis", Conference
`Proceedings of RIAO 97, Computer-Assisted Information
`Searching in Internet, McGill University, Quebec, Canada,
`Jun. 25-27, 1997, vol. 1, pp. 159-175.
`
`(List continued on next page.)
`
`Primary Examiner-Wayne Amsbury
`Assistant Examiner-Thuy Pardo
`Attorney, Agent, or Firm-Michaelson & Wallace; Peter L.
`Michaelson
`
`[57]
`
`ABSTRACT
`
`Apparatus and accompanying methods for an information
`retrieval system that utilizes natural language processing to
`process results retrieved by, for example, an information
`retrieval engine such as a conventional statistical-based
`search engine, in order to improve overall precision.
`Specifically, such a search ultimately yields a set of retrieved
`documents. Each such document is then subjected to natural
`language processing to produce a set of logical forms. Each
`such logical form encodes, in a word-relation-word manner,
`semantic relationships, particularly argument and adjunct
`structure, between words in a phrase. A user-supplied query
`is analyzed in the same manner to yield a set of correspond(cid:173)
`ing logical forms therefor. Documents are ranked as a
`predefined function of the logical forms from the documents
`and the query. Specifically, the set of logical forms for the
`query is then compared against a set of logical forms for
`each of the retrieved documents in order to ascertain a match
`between any such logical forms in both sets. Each document
`that has at least one matching logical forms is heuristically
`scored, with each different relation for a matching logical
`forms being assigned a different corresponding predefined
`weight. The score of each such document is, e.g., a pre(cid:173)
`defined function of the weights of its uniquely matching
`logical forms. Finally, the retained documents are ranked in
`order of descending score and then presented to a user in that
`order.
`
`123 Claims, 14 Drawing Sheets
`
`Page 1 of 34
`
`GOOGLE EXHIBIT 1020
`
`
`
`5,933,822
`Page 2
`
`OIBER PUBLICATIONS
`
`M. Mitra et al, "An Analysis of Statistical and Syntactic
`Phrases", Conference Proceedings of RIAO
`97,
`Computer-Assisted Information Searching
`in Internet,
`McGill University, Quebec, Canada, Jun. 25-27, 1997, vol.
`1, pp. 200-214.
`P. Bruza et al, "Query ReFormulation on the Internet:
`Empirical Data and the Hyperindex Search Engine", Con(cid:173)
`ference Proceedings of RIAO 97, Computer-Assisted Infor(cid:173)
`mation Searching in Internet, McGill University, Quebec,
`Canada, Jun. 25-27, 1997, vol. 1, pp. 488-499.
`G. Grefenstette, "SQLET: Short Query Linguistic Expan(cid:173)
`sion Techniques, Palliating One-Word Queries by Providing
`Intermediate Structure to Text", Conference Proceedings of
`RIAO 97, Computer-Assisted Information Searching in
`Internet, McGill University, Quebec, Canada, Jun. 25-27,
`1997, vol. 1, pp. 500--509.
`R. Chandrasekar et al, "Using Syntactic Information in
`of
`Document
`Filtering: A Comparative
`Study
`Part-of-Speech Tagging and Supertagging", Conference
`Proceedings of RIAO 97, Computer-Assisted Information
`Searching in Internet, McGill University, Quebec, Canada,
`Jun. 25-27, 1997, vol. 1, pp. 531-545.
`M.A. Hearst, "TextTiling: Segmenting Text into Multi-para(cid:173)
`graph Subtopic Passages", Computational Linguistics, vol.
`23, No. 1, 1997, pp. 33-64.
`
`0. Etzoni, "The World-Wide Web: Quagmire or Gold
`Mine", Communications of the ACM, Nov. 1996, vol. 39,
`No. 11, pp. 65-68.
`
`T. Strzalkowski, "Natural Language Information Retrieval:
`TIPSTER-2 Final Report", Proceedings of Advances in Text
`Processing: Tipster Program Phase 2, DARPA, May 6-8,
`1996, Tysons Corner, Virginia, pp. 143-148.
`
`D. D. Lewis et al, "Natural language Processing for Infor(cid:173)
`mation Retrieval", Communications of the A CM, Jan. 1996,
`vol. 39, No. 1, pp. 92-101.
`
`T. Strzalkowski, "Natural Language Information Retrieval",
`Information Processing and Management, vol. 31, No. 3,
`1995, pp. 397-417.
`
`K. Jensen et al (editors), Natural Language Processing: The
`PLNLP Approach (© 1993, Kluwer Academic Publishers),
`Chapter 3 "PEG: The PLNLP English Grammar", pp. 29-45
`and Chapter 16 "PEGASUS: Deriving Argument Structures
`after Syntax", pp. 203-214.
`
`J. L. Fagan, "Experiments in Automatic Phrase Indexing for
`Document Retrieval: A Comparison of Syntactic and
`Non-Syntactic Methods", Ph.D. Thesis, Cornell University,
`1988, pp. i-261.
`
`Page 2 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 1 of 14
`
`5,933,822
`
`QUERY
`(FULL - TEXT --~----.l----f-_J__ ___ _
`RETRIEVED
`FORM)
`
`DATA-
`
`SET
`STORED
`DOCUMENTS
`
`r---RE_TR--1....IE-VA_L_oo;MENTS NATURAL
`ENGINE
`LANGUAGE
`t - - - -~
`1 O
`PROCESSOR
`
`25
`
`20
`
`30
`
`RANKED
`._-----. RETRIEVED
`DOCUMENTS
`35
`FIG. 1
`
`220
`
`J
`
`222
`
`__,225
`SEARCH
`ENGINE
`COMPUTER
`
`DATA(cid:173)
`SET
`
`SERVER
`
`227
`
`FIG. 2
`
`201
`
`300
`
`200
`
`QUERY
`
`RANKED
`DOCUMENTS
`
`420
`
`WEB
`BROWSER
`COMPUTER
`SYSTEM
`(CLIENT PC)
`
`205
`
`NETWORK 21 O
`( e.g. INTERNET)
`
`215
`
`203
`
`QUERY
`(FROM USER)
`
`-
`STATISTICALLY
`RETRIEVED
`DOCUMENT
`RECORDS
`(FROM SEARCH
`ENGINE)
`
`RETRIEVED
`DOCUMENTS
`
`·---
`400
`
`-,.
`
`-,..
`
`-
`
`\
`422
`
`4~2
`
`\
`442
`
`/420
`
`RETRIEVAL
`PROCESS
`
`600
`
`WEB BROWSER
`
`APPLICATION PROGRAMS
`
`\
`426
`
`)
`436
`
`)
`446
`
`-
`,.
`
`.
`
`-
`
`-
`
`QUERY
`(TO SEARCH
`ENGINE)
`-
`COMMAND TO
`DOWNLOAD
`DOCUMENT FOR
`EACH RETRIEVED
`RECORD
`RANKED
`RETRIEVED
`DOCUMENTS
`
`'---
`
`FIG. 4
`
`Page 3 of 34
`
`
`
`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`0 ....,
`N
`~ ....
`rF.J. =(cid:173)~
`
`\0
`\0
`'"""' \0
`~~
`~
`~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`SELECTIONS
`COMMANDS &
`
`USER
`
`MOUSE
`
`KEYBOARD,
`
`DEVICE
`-INPUT
`USER
`390
`
`I
`
`COMPUTER SYSTEM
`
`(PC CLIENT)
`
`FIG. 3
`
`r
`395
`
`400
`\
`
`375
`...,
`
`PROGRAMS
`APPLICATION
`
`300/
`
`' 330
`
`3~5
`
`. \ -f PRINTER
`367
`
`I
`
`380
`)
`
`\ ; l DISPLAY
`363
`
`1/F
`OUTPUT
`
`~
`360
`
`fl
`378
`
`1/F
`COMM
`
`o/s
`
`I
`
`MEMORY
`
`Jl
`
`lf
`
`PROCESSOR
`
`3?0,
`
`3f0
`
`31 o.
`
`I
`
`370
`\
`
`1/F
`INPUT
`
`205
`
`-
`
`DEDICATED
`
`SOURCES
`
`INPUT
`
`ET)
`
`( e.g. INTRA-N
`
`ACCESS
`NETWORK
`OTHER
`AND/OR
`INTERNET
`
`Page 4 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 3 of 14
`
`5,933,822
`
`510
`~ INPUT STRING:
`
`THE OCTOPUS HAS THREE HEARTS.
`
`LOGICAL FORM
`GRAPH:
`
`LOGICAL FORM
`TRIPLES:
`
`H~E
`Dsub-OCTOPUS
`Dobj -
`HEART
`L__ Ops -
`
`_,,,---515
`
`THREE
`
`HAVE- Dsub-OCTOPUS}
`HAVE - Dobj - HEART
`HEART- Ops - THREE
`FIG. 5A
`
`525
`
`530
`"-... INPUT STRING:
`
`LOGICAL FORM
`GRAPH:
`
`THE OCTOPUS HAS THREE HEARTS AND TWO LUNGS.
`HAVE t= Dsub- OCTOPUS
`Dobj -AND
`I
`Crds - - HEART
`
`/535
`
`LOGICAL FORM
`TRIPLES:
`
`LO~s-THREE
`
`LUNG
`I
`Ops-TWO
`
`HAVE - Dsub - OCTOPUS
`HAVE - Dobj - HEART
`HAVE - Dobj -
`LUNG
`HEART- Ops -
`THREE
`LUNG - Ops -
`TWO
`FIG. 5B
`
`540
`
`Page 5 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 4 of 14
`
`5,933,822
`
`55o"'- INPUT STRING:
`
`THE OCTOPUS HAS THREE HEARTS AND
`IT CAN SWIM.
`
`LOGICAL FORM
`GRAPH:
`
`AND
`Lcrds
`
`HAVE
`t=osub- OCTOPUS
`Dobj - HEART
`
`/555
`
`LOGICAL FORM
`TRIPLES:
`
`Ops-THREE
`
`SWIM
`L_Dsub-lT
`~-Refs- OCTOPUS
`HAVE - Dsub -
`OCTOPUS
`HAVE - Do~
`-
`HEART
`HEART - Ops
`-
`THREE
`SWIM - Dsub -
`IT
`SWIM - Dsub -
`OCTOPUS
`FIG. 5C
`
`560
`
`570
`
`"'- INPUT STRING:
`
`I LIKE SHARK FIN SOUP BOWLS.
`
`LOGICAL FORM
`GRAPH:
`
`LOGICAL FORM
`TRIPLES:
`
`/
`
`575
`
`LIKE
`t= □sub - I
`Dobj - BOWL
`'-----Mods-SHARK
`~FIN
`L_SOUP
`
`I
`BOWL
`SHARK
`FIN
`SOUP
`SHARK
`SHARK
`FIN
`
`LIKE
`- Dsub -
`LIKE
`- Dobj -
`BOWL - Mods -
`BOWL - Mods -
`BOWL - Mods -
`FIN
`- Mods -
`SOUP - Mods -
`SOUP - Mods -
`FIG. 5D
`
`580
`
`Page 6 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 5 of 14
`
`5,933,822
`
`~ ENTER
`
`OBTAIN FULL TEXT
`QUERY FROM
`USER
`
`RETRIEVAL
`PROCESS
`600
`
`._ 605
`
`'- 607
`
`225
`jS-EARCH _L_ -
`QUERY
`ENGINE
`'"- -1- - -t- ► RETRIEVE
`615
`SET OF
`I
`DOCUMENT
`RECORDS
`IN
`I
`ACCESS AND DOWNLOAD L L
`630
`RESPONSE TO
`1-c _\_ _ 7 _ _ QUERY
`- -
`- -
`- -
`RETRIEVED DOCUMENT
`RECORDS
`
`- -7
`,
`I
`
`625
`
`1
`
`I
`
`I
`
`I
`,
`~
`
`-
`
`-
`
`1---- 643
`
`SEND FULL-TEXT
`610--- QUERY TO SEARCH
`ENGINE
`
`RECEIVE SET OF
`DOCUMENT RECORDS
`___ FROM SEARCH ENGINE;
`635
`
`EACH CORRESPONDING
`DOCUMENT
`
`I
`
`1
`
`I
`
`•
`
`, 645
`
`INVOKE NLP
`ROUTINE 700
`TO YIELD LOGICAL
`FORM TRIPLES
`FOR FULL - TEXT
`QUERY AND STORE
`IN LOCAL DATABASE
`
`640
`,-
`CONVERT EACH DOCUMENT TO TEXT;
`PERFORM SENTENCE BREAKING
`ON EACH DOCUMENT; AND
`INVOKE NLP ROUTINE 700
`FOR EACH DOCUMENT TO
`YIELD LOGICAL FORM TRIPLES
`THEREFOR AND STORE IN
`DATASET
`
`I
`
`L---650
`
`COMPARE LOGICAL
`FORM TRIPLES FOR QUERY
`AGAINST LOGICAL
`FORM TRIPLES FOR
`EACH DOCUMENT IN SET
`TO LOCATE MATCHING
`LOGICAL FORM TRIPLES
`
`DISCARD ALL DOCUMENTS
`WHICH DO NOT CONTAIN
`ANY MATCHING LOGICAL
`FORM TRIPLES
`
`_________ J --------
`
`FIG. 6A
`
`Page 7 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 6 of 14
`
`5,933,822
`
`'
`FOR EACH REMAINING
`DOCUMENT, ASSIGN WEIGHT
`TO EACH MATCHING LOGICAL
`FORM TRIPLE THEREIN; AND
`SUM WEIGHTS ACROSS EACH
`DOCUMENT, AS APPROPRIATE,
`TO OBTAIN WEIGHT FOR
`THAT DOCUMENT
`
`-- 6 60
`
`RANK DOCUMENTS IN TERMS
`OF DESCENDING ORDER OF
`THEIR WEIGHTS
`
`c- 665
`
`DISPLAY RETRIEVED DOCUMENTS
`IN RANKED ORDER
`
`~ EXIT
`FIG. 68
`
`FIG.
`6A
`
`FIG.
`68
`FIG. 6
`
`Page 8 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 7 of 14
`
`5,933,822
`
`NLP ROUTINE
`700
`
`ENTER
`
`PROCESS INPUT TEXT TO YIELD LOGICAL
`FORM GRAPH
`
`i.-710
`
`EXTRACT LOGICAL FORM TRIPLES FROM GRAPH
`
`OUTPUT EACH LOGICAL FORM TRIPLE
`AS A DISTINCT FORMATTED STRING
`
`-730
`
`I,
`
`FOR INPUT TEXT, STORE EACH LOGICAL
`FORM TRIPLE THEREFOR ALONG WITH ITS
`ASSOCIATED DOCUMENT AND SENTENCE
`PROVENANCE (i.e. SOURCE LOCATION)
`IN A DATASET
`
`'
`
`EXIT
`
`FIG. 7
`
`Page 9 of 34
`
`
`
`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`0 ....,
`00
`~ ....
`'JJ. =(cid:173)~
`
`\0
`\0
`'"""' \0
`~~
`~
`~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`>~,,,..--DOCUMEN~'4,.845
`
`QUERY
`
`~-------LOGICAL FORM TRIPLES -----------------...
`
`DOCUMENT 3: ARTICLE ABOUT DEER
`DOCUMENT 2: ARTICLE ABOUT OCTOPI
`OCUMENT 1: RECIPE CONTAINING ARTICHOKE HEARTS AND OCTOPUS
`
`iNLP
`
`!D
`
`DOCUMENTS
`RETRIEVED
`STATISTICALLY
`
`820
`
`QUERY: HOW MANY HEARTS DOES AN OCTOPUS HAVE?
`
`810~
`
`✓ HAVE-Dobj-HEART
`HAVE-Dsub-DEER
`DOCUMENT .f, ....
`~-------------------------------~ 865
`
`FIG. 88
`
`TOTAL DOCUMENT
`
`WEIGHT=100
`
`~ /
`
`RANKING
`
`WEIGHT=100+ 75= 175
`
`TOTAL DOCUMENT
`
`ORDER OF DOCUMENT
`
`DOCUMENT 3
`DOCUMENT 2;
`PRESENTATION:
`
`t
`WEIGHT
`TOTAL
`BY
`
`WITH QUERY
`
`1 MATCH
`
`TRIPLES
`
`860
`)
`
`WITH QUERY
`2 MATCHES
`
`TRIPLES
`
`850
`)
`
`HAVE-Dsub-OCTOPUS
`✓HAVE-Dobj-HEART
`✓HAVE-Dsub-OCTOPUS
`DOCUMENT 2""~55
`
`DOCUMENT 1
`
`DISCARD
`
`TRIPLES
`
`WITH QUERY
`
`)I NO MATCH
`
`840
`
`HEART-Nadj-ARTICHOKE
`
`FRY-Dobj-OCTOPUS
`
`TABLE
`
`TRIPLE WEIGHTING
`
`LOGICAL FORM
`
`MATCHING
`
`800
`/
`
`FIG. 8A
`
`10
`10
`75
`100
`
`WEIGHT
`
`830
`
`Nodj
`Ops
`Dsub
`Dobj
`
`MATCHING TRIPLE
`
`RELATIONSHIP
`
`TYPE FOR
`
`HEART-Ops-MANY
`HAVE-Dobj-HEART
`HAVE-Dsub-OCTOPUS
`
`Page 10 of 34
`
`
`
`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`0 ...,
`~ ....
`'JJ. =(cid:173)~
`
`\0
`
`\0
`\0
`'"""'
`\0
`~~
`~
`~
`
`~ = .....
`~ .....
`~
`•
`r:JJ.
`d •
`
`PROCESSING
`NATURAL LANGUAGE
`STATISTICAL RETRIEVAL &
`DOCUMENT INDEXING,
`
`STATISTICAL RETRIEVAL
`DOCUMENT INDEXING &
`
`(SERVER}
`COMPUTER
`REMOTE
`\
`930
`
`(SE~~ER)
`COMPUTER
`REMOTE
`
`930
`
`\
`925
`
`DOCUMENTS
`
`{(
`
`DOCUMENTS
`RETRIEVED
`
`---STATISTICALLY 925
`
`QUERY ( ----
`FULL -TEXT (LITERAL)
`
`1........---._RANKED RETRIEVED
`QUERY 1 i----
`FULL-TEXT (LITERAL)
`
`I
`
`PC
`
`CLIENT
`J
`920
`
`PC
`
`CLIENT
`J
`920
`
`(e.g. PC)
`COMPUTER
`
`LOCAL
`
`J
`910
`
`PROCESSING
`& NATURAL LANGUAGE
`QUERY GENERATION
`
`PROCESSING
`& NATURAL LANGUAGE
`STATISTICAL RETRIEVAL,
`DOCUMENT INDEXING,
`
`FIG. 98
`
`FIG. 9A
`
`FIG. 9D
`
`QUERY GENERATION
`
`FIG. 9C
`
`960 PROCES~SING
`1
`I
`I
`I
`I
`I
`r----96-0~~--~--oocuMENTI
`
`I
`
`9602
`
`LANGUAGE
`NATURAL
`
`SERVER n
`
`:
`
`•
`
`SERVER 2 I RETRIEVAL
`STATISTICAL
`
`9601
`
`1
`
`J SERVER 1 I INDEXIN'-G'
`
`PROCESSOR .,_ •
`
`END
`FRONT(cid:173)
`940
`
`i COMPUTER
`1 REMOTE
`I
`
`----------
`R)
`
`n
`
`930 -
`
`Page 11 of 34
`
`
`
`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`0 ....,
`'"""' 0
`~ ....
`'JJ. =(cid:173)~
`
`\0
`\0
`'"""'
`\0
`
`t ~~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`I
`I
`I
`
`I
`
`I
`
`::.
`
`52
`
`1040
`
`\
`
`DUPLICATION
`
`SYSTEM
`
`MEDIA
`
`-1043
`
`I
`
`10052
`
`DUPLICATION
`
`I
`
`FIG. 10A
`
`INDEXING
`DOCUMENT
`
`I FIG. 10
`
`FIG. 108
`
`FIG. 10A
`
`-
`
`TRIPLES
`ASSOCIATED
`
`&
`
`~
`
`I 1035
`
`\
`
`DOCUMENTS~
`
`SET
`--------
`::-DATA
`-✓1 0 3 0
`
`I PROGRAM)
`1 USER INSTALLATION
`& RETRIEVAL PROCESS,
`( e.g. SEARCH ENGINE
`APPLICATION FILES
`
`1 DOCUMENT RETRIEVAL
`
`1
`
`DATASET)
`
`(MASTE: 7
`
`1015
`
`\
`
`COMPUTER
`
`ENGINE
`DOCUMENT INDEXING
`
`PROCESS 1100
`
`GENERATION
`
`TRIPLE
`
`MEMORY
`
`I
`
`10051
`
`INDEXED
`TO BE
`DOCUMENTS
`
`Page 12 of 34
`
`
`
`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`'"""' 0 ....,
`'"""'
`~ ....
`'JJ. =(cid:173)~
`
`\0
`\0
`'"""' \0
`~
`~
`~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`-
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`1
`
`DOCUMENT RETRIEVAL APPLICATION
`,, 1085
`1080
`r 1075
`
`I
`
`,,1060
`
`CD-ROMj
`
`(DISPLAYED)
`DOCUMENTS
`RANKED
`
`1070
`
`COMPUTER SYSTEM (PC)
`
`1050 3
`
`t
`
`USERj
`
`FIG. 108
`
`QUERY
`USER_
`
`MEDIA DUPLICATION
`
`10501
`
`\
`
`L _______________ _J
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I I
`I
`I
`I
`I
`1 APPLICATION PROGRAMS
`~-----------~---.
`MEMORY
`
`DOCUMENT SET
`.._ RETRIEVED
`
`1200
`PROCESS
`RETRIEVAL
`
`---
`
`'
`
`1190
`ENGINE
`SEARCH ~
`
`TRIPLES
`
`I '
`SET D
`
`-
`
`I
`
`I
`
`I
`
`I
`
`I
`
`DOCUMENTS
`
`1065
`
`DATA-
`~ \
`
`(
`
`1062
`
`& Lr10502
`
`[
`
`TRIPLES
`Gu] ASSOC.
`g
`DOCS
`
`INDEXED
`
`1-,~5
`
`.n.
`1050
`
`-
`
`I
`
`FILES
`APPL'N
`RETRIEVAL
`DOCUMENT
`
`• • •
`(e.g. CD-ROMS)
`REPLICAS 1050
`MEDIA
`
`I 1047 r
`
`I
`
`Page 13 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 12 of 14
`
`5,933,822
`
`FIG. 11
`TRIPLE
`GENERATION
`PROCESS
`1100
`
`ENTER
`(DOCUMENT TO BE INDEXED)
`CONVERT DOCUMENT TO TEXT;
`PERFORM SENTENCE BREAKING
`ON DOCUMENT; AND
`INVOKE NLP ROUTINE 1300
`TO YIELD LOGICAL FORM
`TRIPLES ASSOCIATED WITH
`THAT DOCUMENT AND TO
`STORE TRIPLES IN
`DATASET 1030
`
`1110
`
`EXIT
`
`FIG. 13A
`NLP ROUTINE 1300
`ENTER
`1310
`
`FIG. 138
`NLP ROUTINE 1350
`ENTER
`1360
`
`PROCESS INPUT TEXT, e.g. QUERY,
`TO YIELD LOGICAL FORM GRAPH
`
`PROCESS INPUT TEXT, e.g. QUERY,
`TO YIELD LOGICAL FORM GRAPH
`
`1320
`
`1370
`
`EXTRACT LOGICAL FORM
`TRIPLES FROM GRAPH
`
`EXTRACT LOGICAL FORM
`TRIPLES FROM GRAPH
`
`1330
`
`1380
`
`OUTPUT EACH LOGICAL FORM TRIPLE
`AS A DISTINCT FORMATTED STRING
`
`OUTPUT EACH LOGICAL FORM TRIPLE
`AS A DISTINCT FORMATTED STRING
`
`1340
`FOR CURRENT INPUT TEXT, STORE
`EACH LOGICAL FORM TRIPLE
`THEREFOR
`IN DATASET 1030
`
`1390
`
`FOR QUERY, STORE EACH LOGICAL
`FORM TRIPLE THEREFOR IN
`MEMORY 1075
`
`EXIT
`
`EXIT
`
`Page 14 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 13 of 14
`
`5,933,822
`
`~ENTER
`
`-
`
`OBTAIN FULL TEXT
`QUERY FROM
`USER
`
`'-1205
`.__
`
`FIG. 12A
`
`RETRIEVAL
`PROCESS
`1200
`
`'---1207
`1210
`
`.j,
`
`\
`
`1090
`I-SEARCH E~GINE ---7
`I . - - - - - - - -~
`~~~~y~►
`IN RESPONSE
`1215 I
`TO QUERY;
`RETRIEVE SET
`l
`OF DOCUMENT
`RECEIVE SET OF
`1230 1
`RECORDS AND
`DOCUMENT RECORDS ~_\_ 7-
`ASSOCIATED TRIPLES
`AND ASSOCIATED
`l
`L___ 1220 ___ _J
`TRIPLES FROM
`SEARCH ENGINE
`1090 AND
`STORE IN
`MEMORY 1075
`
`I
`
`I
`
`RETRIEVED DOCUMENT
`RECORDS AND ASSOCIATED
`DOCUMENT TRIPLES
`
`SEND FULL -TEXT
`QUERY TO SEARCH
`ENGINE
`1.-----,----...11
`
`~1243
`
`1245
`
`\
`
`j,
`
`INVOKE NLP
`ROUTINE 1350
`TO YIELD LOGICAL
`FORM TRIPLES
`FOR FULL-TEXT
`QUERY AND STORE
`IN MEMORY 1075
`
`COMPARE LOGICAL
`FORM TRIPLES FOR QUERY
`AGAINST LOGICAL
`FORM TRIPLES FOR
`EACH DOCUMENT IN SET
`TO LOCATE MATCHING
`LOGICAL FORM TRIPLE
`
`<----1250
`
`DISCARD ALL DOCUMENTS
`WHICH DO NOT CONTAIN
`ANY MATCHING LOGICAL
`FORM TRIPLES
`
`1.---1255
`
`••
`---------------
`
`Page 15 of 34
`
`
`
`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 14 of 14
`
`5,933,822
`
`'
`FOR EACH REMAINING
`DOCUMENT, ASSIGN WEIGHT
`TO EACH MATCHING LOGICAL
`FORM TRIPLE THEREIN; AND
`SUM WEIGHTS ACROSS EACH
`DOCUMENT, AS APPROPRIATE,
`TO OBTAIN WEIGHT FOR
`THAT DOCUMENT
`
`-1260
`
`'
`RANK DOCUMENTS IN TERMS
`OF DESCENDING ORDER OF
`THEIR WEIGHTS
`
`c---1265
`
`DISPLAY RETRIEVED DOCUMENTS
`IN RANKED ORDER
`
`-1270
`
`~ EXIT
`
`FIG. 128
`
`FIG.
`12A
`
`FIG.
`128
`FIG. 12
`
`Page 16 of 34
`
`
`
`5,933,822
`
`1
`APPARATUS AND METHODS FOR AN
`INFORMATION RETRIEVAL SYSTEM THAT
`EMPLOYS NATURAL LANGUAGE
`PROCESSING OF SEARCH RESULTS TO
`IMPROVE OVERALL PRECISION
`
`BACKGROUND OF THE DISCLOSURE
`
`1. Field of the Invention
`The invention relates to apparatus and accompanying
`methods for an information retrieval system that utilizes
`natural language processing to process results retrieved by,
`for example, an information retrieval engine such as a
`conventional statistical-based search engine, in order to
`improve overall precision.
`2. Description of the Prior Art
`Starting several decades ago and continuing to the
`present, automated information retrieval techniques have
`increasingly been used to retrieve stored information from a
`mass data store, such as a conventional database containing
`published materials and/or bibliographic information there(cid:173)
`for. Such a conventional database tends to be specialized in
`that it generally contains information directed to a particular,
`though broad-based topic, such as electrical engineering and
`computer related technology, as, e.g., in an INSPEC data(cid:173)
`base maintained by the Institute of Electrical and Electronic
`engineers (IEEE) and currently accessible through, e.g.,
`Dialog Information Services of Knight-Ridder Information.
`Inc. (DIALOG is a registered servicemark of Knight-Ridder
`Information, Inc.). While databases of this type certainly
`exhibit continuing growth as an increasing number of per(cid:173)
`tinent articles and other materials are published, the growth
`tends to be relatively moderate and reasonably well(cid:173)
`controlled. Also, such specialized databases tend to be rather
`well organized.
`However, with the advent and proliferation of the
`so-called "world-wide web" (hereinafter simply referred to
`as the "web") accessible through the Internet and the relative
`ease and low-cost associated with posting information to the
`web and accessing information therefrom as contrasted with
`traditional publishing, the amount of information available
`on the web manifests highly exponential, if not explosive,
`growth, with apparently no realistic limit in sight. While the
`web offers an increasingly rich array of information across
`all disciplines of human endeavor, information content on 45
`the web is highly chaotic and extremely disorganized, which
`severely complicates and often frustrates information access
`and retrieval therefrom.
`In an attempt to significantly ease the task of retrieving
`information from the web, a number of computerized search
`engines have been developed over the past few years for
`widespread public use. Generally speaking, these conven(cid:173)
`tional engines, through software-implemented "web
`crawlers", automatically visit web sites, and trace hypertext
`links therein, in seriatim and extract, abstract and index each
`document encountered therein, through so-called "key
`words", into a large database for subsequent access.
`Specifically, through such abstraction, each such document
`encountered by the crawler is reduced to what is commonly
`called a "bag of words" which contains content-bearing
`words that exist in the document, though stripped of all
`semantic and syntactic information. The content words may
`occur in the document itself and/or in just a description field
`of a hypertext-markup language (HTML) version of that
`document. In any event, the engine establishes an entry, i.e.,
`a document record, for each such document. For each
`document, each of its content words is indexed into a
`
`2
`searchable data structure with a link back to the document
`record. The document record typically contains: (a) a web
`address, i.e., a URL-uniform resource locator, through
`which the corresponding document can be accessed by a
`5 web browser; (b) various content words in that document,
`along with, in certain engines, a relative address of each such
`content word relative to other content words in that docu(cid:173)
`ment; (c) a short summary, often just a few lines, of the
`document or a first few lines of that document; and possibly
`10 ( d) the description of the document as provided in its HTML
`description field. To search the database, a user supplies the
`engine with a keyword based query. The query typically
`contains one or more user-supplied keywords, often just a
`small number, with, depending on the capabilities of the
`15 engine, possibly a Boolean (such as "AND" or "OR") or
`similar (such as a numeric proximity) operator situated
`between successive key words. In response to the query, the
`engine attempts to locate documents that contain as many of
`the keywords as possible, and, if a logical or proximity
`20 operator was provided, those key words in the specific
`combination requested or within a certain "range" (specified
`number of content words) of each other. In doing so, the
`engine searches through its database to locate documents
`that contain at least one word that matches one of the key
`25 words in the query and, where requested, according to the
`operator and/or range specified therewith. For each such
`document it finds, the engine retrieves the document record
`therefor and presents that record to the user ranked accord(cid:173)
`ing to a number of keyword matches in that document
`30 relative to those for the other such documents.
`Often, a great majority of documents retrieved solely in
`response to a user-supplied keyword query would be simply
`irrelevant to the query, thus frustrating the user.
`Consequently, to reduce the number of irrelevant docu-
`35 ments that are retrieved, conventional keyword based search
`engines (hereinafter referred to as simply "statistical search
`engines") incorporate statistical processing into their search
`methodologies. For example, based on a total number of
`matching key words between those in the query and the
`40 content words in each retrieved document record and how
`well these words match, i.e., in the combination and/or
`within a proximity range requested, a statistical search
`engine calculates numeric measures, collectively frequently
`referred to as "statistics", for each such document record
`retrieved. These statistics may include an inverse document
`frequency for each matching word. The engine then ranks
`the document records in terms of their statistics and returns
`to the user the document records for a small predefined
`number of retrieved records, typically 5-20 or less, that have
`50 the highest rankings. Once the user has reviewed a first
`group of document records ( or, for some engines, the
`documents themselves if they are returned by the engine) for
`a first group of retrieved documents, the user can then
`request a next group of document records having the next
`55 highest rankings, and so forth until all the retrieved docu(cid:173)
`ment records have been so reviewed.
`Traditionally, the performance of search engines has been
`assessed in terms of recall and precision. Recall measures, as
`a percentage of all relevant documents in a dataset, the
`60 number of such documents actually retrieved in response to
`a given query. Precision, on the other hand, measures, as a
`percentage of all documents retrieved, the number of those
`documents that are actually relevant to the query. We believe
`that in the context of a web search engine, recall is not an
`65 important metric of performance, inasmuch as the sheer
`number of documents ultimately retrieved is unimportant. In
`fact, for some queries, this number could be inordinately
`
`Page 17 of 34
`
`
`
`5,933,822
`
`10
`
`3
`large. Hence, we believe that not all relevant documents
`indexed by the engine need to be retrieved in order to
`produce a useful result; however, we believe that precision
`is extremely important, i.e., the documents that have the
`highest ranking and are presented first to a user should be
`those that are the most relevant to the query.
`The rather poor precision of conventional statistical
`search engines stems from their assumption that words are
`independent variables, i.e., words in any textual passage
`occur independently of each other. Independence in this
`context means that a conditional probability of any one word
`appearing in a document given the presence of another word
`therein is always zero, i.e., a document simply contains an
`unstructured collection of words or simply put a "bag of
`words". As one can readily appreciate, this assumption, with
`respect to any language, is grossly erroneous. English, like 15
`other languages, has a rich and complex syntactic and
`lexico-semantic structure with words whose meanings vary,
`often widely, based on the specific linguistic context in
`which they are used, with the context determining in any one
`instance a given meaning of a word and what word(s) can 20
`subsequently appear. Hence, words that appear in a textual
`passage are simply not independent of each other, rather they
`are highly inter-dependent. Keyword based search engines
`totally ignore this fine-grained linguistic structure. For
`example, consider an illustrative query expressed in natural 25
`language: "How many hearts does an octopus have?" A
`statistical search engine, operating on content words
`"hearts" and "octopus", or morphological stems thereof,
`might likely return or direct a user to a stored document that
`contains a recipe that has at its ingredients and hence its
`content words: "artichoke hearts, squid, onions and octo(cid:173)
`pus". This engine, given matches in the two content words
`"octopus" and "hearts", may determine, based on statistical
`measures, e.g. including proximity and logical operators,
`that this document is an excellent match, when, in reality, the 35
`document is quite irrelevant to the query.
`The art teaches various approaches for extracting ele(cid:173)
`ments of syntactic phrases as head-modifier pairs in unla(cid:173)
`beled relations. These elements are then indexed as terms
`(typically without internal structure) in a conventional sta(cid:173)
`tistical vector-space model.
`One example of such an approach is taught in J. L. Fagan,
`"Experiments in Automatic Phrase Indexing for Document
`Retrieval: A Comparison of Syntactic and Non-Syntactic
`Methods", Ph.D. Thesis, Cornell University, 1988, pages 45
`i-261. Specifically, this approach uses natural language
`processing to analyze English sentences and extract syntac-
`tic phrasal constituents elements wherein these phrasal con(cid:173)
`stituents are then treated as terms and indexed in an index
`using a statistical vector-space model. During retrieval, the
`user enters a query in natural language which, under this
`approach, is subjected to natural language processing for
`analysis and to extract elements of syntactic phrasal con(cid:173)
`stituents analogous to the elements stored in the index.
`Thereafter, attempts are made to match the elements of the
`syntactic phrasal constituents from the query to those stored
`in the index. The author contrasts this purely syntactic
`approach to a statistical approach, in which a stochastic
`method is used to identify elements within syntactic phrases.
`The author concludes that natural language processing does
`not yield substantial improvements over stochastic
`approaches, and that the small improvements in precision
`that natural language processing does sometimes produce do
`not justify the substantial processing cost associated with
`natural language processing.
`Another such syntactic based-approach is described, in
`the context of using natural language processing for select-
`
`4
`ing appropriate terms for inclusion within search queries, in
`T. Strzalkowski, "Natural Language Information Retrieval:
`TIPSTER-2 Final Report", Proceedings of Advances in Text
`Processing: Tipster Program Phase 2, DARPA, May 6-8,
`5 1996, Tysons Corner, Va., pages 143-148 (hereinafter the
`"DARPA paper"); and T. Strzalkowski, "Natural Language
`Information Retrieval", Information Processing and
`Management, Vol. 31, No. 3, 1995, pages 397-417. While
`this approach offers theoretical promise, the author on pages
`147-8 of the DARPA paper, concludes that, owing to the
`sophisticated processing required to implement the under(cid:173)
`lying natural language techniques, this approach is currently
`impractical:
`" ... [I]t is important to keep in mind that NLP [ natural
`language processing] techniques that meet our perfor(cid:173)
`mance requirements ( or at least are believed to be
`approaching these requirements) are still fairly unso-
`phisticated in their ability to handle natural language
`text. In particular, advanced processing involving con(cid:173)
`ceptual structuring, logical forms, etc. is still beyond
`reach, computationally. It may be assumed that these
`advanced techniques will prove even more effective,
`since they address the problem of representation-level
`limits; however, the experimental evidence is sparse
`and necessarily limited to rather small scale tests".
`A further syntactic-based approach of this sort is
`described in B. Katz, "Annotating the World Wide Web
`using Natural Language", Conference Proceedings of RIAO
`97, Computer-Assisted Information Searching in Internet,
`McGill University, Quebec, Canada, Jun. 25-27, 1997, Vol.
`30 1, pages 136-155 [hereinafter the "Katz publication"]. As
`described in the Katz publication, subject-verb-object
`expressions are created while preserving the internal struc(cid:173)
`ture so that during retrieval minor syntactic alternations can
`be accommodated.
`Because these syntactic approaches have yielded lacklus-
`ter improvements or have not been feasible to implement in
`natural language processing systems available at the time,
`the field has moved away from attempting to directly
`improve the precision and recall of the initial results of query
`40 to improvements in the user interface, i.e. specifically
`through methods for refining the query based on interaction
`with the user, such as through "find-similar" user responses
`to a retrieved result, and methods for visualizing the results
`of a query including displaying results in appropriate clus(cid:173)
`ters.
`While these improvements are useful in their own right,
`the added precision attainable through these improvements
`is still disappointingly low, and certainly insufficient to
`drastically reduce user frust