throbber
United States Patent [19J
`Braden-Harder et al.
`
`I 1111111111111111 11111 lllll lllll lllll 111111111111111 lllll 111111111111111111
`US005933822A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,933,822
`Aug. 3, 1999
`
`[54] APPARATUS AND METHODS FOR AN
`INFORMATION RETRIEVAL SYSTEM THAT
`EMPLOYS NATURAL LANGUAGE
`PROCESSING OF SEARCH RESULTS TO
`IMPROVE OVERALL PRECISION
`
`R. Pohlmann et al, "The Effect of Syntactic Phrase Indexing
`on Retrieval Performance for Dutch Tests", Conference
`Proceedings of RIAO 97, Computer-Assisted Information
`Searching in Internet, McGill University, Quebec, Canada,
`Jun. 25-27, 1997, vol. 1, pp. 176-187.
`
`[75]
`
`Inventors: Lisa Braden-Harder, Reston, Va.;
`Simon H. Corston, Seattle, Wash.;
`William B. Dolan, Redmond, Wash.;
`Lucy H. Vanderwende, Bellevue,
`Wash.
`
`[73] Assignee: Microsoft Corporation, Redmond,
`Wash.
`
`[21] Appl. No.: 08/898,652
`Jul. 22, 1997
`
`[22] Filed:
`
`Int. Cl.6
`...................................................... G06F 17/00
`[51]
`[52] U.S. Cl. ......................................... 707/5; 707/3; 707/4
`[58] Field of Search .................. 707/5, 4, 3; 364/419.19;
`395/708
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,278,980
`5,544,049
`5,642,502
`5,671,404
`5,694,592
`5,706,497
`5,724,571
`5,794,050
`5,826,261
`
`1/1994 Pederson et al. ........................... 707/4
`8/1996 Henderson et al.
`............... 364/419.19
`6/1997 Driscoll ....................................... 707/5
`9/1997 Lizee et al. ................................. 707/5
`12/1997 Driscoll . ... ... ... .... ... ... ... ... ... .... ... ... 707 /3
`1/1998 Takahashi et al. .......................... 707/5
`3/1998 Woods
`........................................ 707/5
`8/1998 Dahlgren et al. . ... ... ... ... .... ... ... 395 /708
`10/1998 Spencer ....................................... 707/5
`
`OIBER PUBLICATIONS
`
`B. Katz, "Annotating the World Wide Web using Natural
`Language", Conference Proceedings of RIAO 97, Comput(cid:173)
`er-Assisted Information Searching in Internet, McGill Uni(cid:173)
`versity, Quebec, Canada, Jun. 25-27 1997, vol. 1, pp.
`136-155.
`A T. Arampatzis et al, "IRENA: Information Retrieval
`Engine based on Natural language Analysis", Conference
`Proceedings of RIAO 97, Computer-Assisted Information
`Searching in Internet, McGill University, Quebec, Canada,
`Jun. 25-27, 1997, vol. 1, pp. 159-175.
`
`(List continued on next page.)
`
`Primary Examiner-Wayne Amsbury
`Assistant Examiner-Thuy Pardo
`Attorney, Agent, or Firm-Michaelson & Wallace; Peter L.
`Michaelson
`
`[57]
`
`ABSTRACT
`
`Apparatus and accompanying methods for an information
`retrieval system that utilizes natural language processing to
`process results retrieved by, for example, an information
`retrieval engine such as a conventional statistical-based
`search engine, in order to improve overall precision.
`Specifically, such a search ultimately yields a set of retrieved
`documents. Each such document is then subjected to natural
`language processing to produce a set of logical forms. Each
`such logical form encodes, in a word-relation-word manner,
`semantic relationships, particularly argument and adjunct
`structure, between words in a phrase. A user-supplied query
`is analyzed in the same manner to yield a set of correspond(cid:173)
`ing logical forms therefor. Documents are ranked as a
`predefined function of the logical forms from the documents
`and the query. Specifically, the set of logical forms for the
`query is then compared against a set of logical forms for
`each of the retrieved documents in order to ascertain a match
`between any such logical forms in both sets. Each document
`that has at least one matching logical forms is heuristically
`scored, with each different relation for a matching logical
`forms being assigned a different corresponding predefined
`weight. The score of each such document is, e.g., a pre(cid:173)
`defined function of the weights of its uniquely matching
`logical forms. Finally, the retained documents are ranked in
`order of descending score and then presented to a user in that
`order.
`
`123 Claims, 14 Drawing Sheets
`
`Page 1 of 34
`
`GOOGLE EXHIBIT 1020
`
`

`

`5,933,822
`Page 2
`
`OIBER PUBLICATIONS
`
`M. Mitra et al, "An Analysis of Statistical and Syntactic
`Phrases", Conference Proceedings of RIAO
`97,
`Computer-Assisted Information Searching
`in Internet,
`McGill University, Quebec, Canada, Jun. 25-27, 1997, vol.
`1, pp. 200-214.
`P. Bruza et al, "Query ReFormulation on the Internet:
`Empirical Data and the Hyperindex Search Engine", Con(cid:173)
`ference Proceedings of RIAO 97, Computer-Assisted Infor(cid:173)
`mation Searching in Internet, McGill University, Quebec,
`Canada, Jun. 25-27, 1997, vol. 1, pp. 488-499.
`G. Grefenstette, "SQLET: Short Query Linguistic Expan(cid:173)
`sion Techniques, Palliating One-Word Queries by Providing
`Intermediate Structure to Text", Conference Proceedings of
`RIAO 97, Computer-Assisted Information Searching in
`Internet, McGill University, Quebec, Canada, Jun. 25-27,
`1997, vol. 1, pp. 500--509.
`R. Chandrasekar et al, "Using Syntactic Information in
`of
`Document
`Filtering: A Comparative
`Study
`Part-of-Speech Tagging and Supertagging", Conference
`Proceedings of RIAO 97, Computer-Assisted Information
`Searching in Internet, McGill University, Quebec, Canada,
`Jun. 25-27, 1997, vol. 1, pp. 531-545.
`M.A. Hearst, "TextTiling: Segmenting Text into Multi-para(cid:173)
`graph Subtopic Passages", Computational Linguistics, vol.
`23, No. 1, 1997, pp. 33-64.
`
`0. Etzoni, "The World-Wide Web: Quagmire or Gold
`Mine", Communications of the ACM, Nov. 1996, vol. 39,
`No. 11, pp. 65-68.
`
`T. Strzalkowski, "Natural Language Information Retrieval:
`TIPSTER-2 Final Report", Proceedings of Advances in Text
`Processing: Tipster Program Phase 2, DARPA, May 6-8,
`1996, Tysons Corner, Virginia, pp. 143-148.
`
`D. D. Lewis et al, "Natural language Processing for Infor(cid:173)
`mation Retrieval", Communications of the A CM, Jan. 1996,
`vol. 39, No. 1, pp. 92-101.
`
`T. Strzalkowski, "Natural Language Information Retrieval",
`Information Processing and Management, vol. 31, No. 3,
`1995, pp. 397-417.
`
`K. Jensen et al (editors), Natural Language Processing: The
`PLNLP Approach (© 1993, Kluwer Academic Publishers),
`Chapter 3 "PEG: The PLNLP English Grammar", pp. 29-45
`and Chapter 16 "PEGASUS: Deriving Argument Structures
`after Syntax", pp. 203-214.
`
`J. L. Fagan, "Experiments in Automatic Phrase Indexing for
`Document Retrieval: A Comparison of Syntactic and
`Non-Syntactic Methods", Ph.D. Thesis, Cornell University,
`1988, pp. i-261.
`
`Page 2 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 1 of 14
`
`5,933,822
`
`QUERY
`(FULL - TEXT --~----.l----f-_J__ ___ _
`RETRIEVED
`FORM)
`
`DATA-
`
`SET
`STORED
`DOCUMENTS
`
`r---RE_TR--1....IE-VA_L_oo;MENTS NATURAL
`ENGINE
`LANGUAGE
`t - - - -~
`1 O
`PROCESSOR
`
`25
`
`20
`
`30
`
`RANKED
`._-----. RETRIEVED
`DOCUMENTS
`35
`FIG. 1
`
`220
`
`J
`
`222
`
`__,225
`SEARCH
`ENGINE
`COMPUTER
`
`DATA(cid:173)
`SET
`
`SERVER
`
`227
`
`FIG. 2
`
`201
`
`300
`
`200
`
`QUERY
`
`RANKED
`DOCUMENTS
`
`420
`
`WEB
`BROWSER
`COMPUTER
`SYSTEM
`(CLIENT PC)
`
`205
`
`NETWORK 21 O
`( e.g. INTERNET)
`
`215
`
`203
`
`QUERY
`(FROM USER)
`
`-
`STATISTICALLY
`RETRIEVED
`DOCUMENT
`RECORDS
`(FROM SEARCH
`ENGINE)
`
`RETRIEVED
`DOCUMENTS
`
`·---
`400
`
`-,.
`
`-,..
`
`-
`
`\
`422
`
`4~2
`
`\
`442
`
`/420
`
`RETRIEVAL
`PROCESS
`
`600
`
`WEB BROWSER
`
`APPLICATION PROGRAMS
`
`\
`426
`
`)
`436
`
`)
`446
`
`-
`,.
`
`.
`
`-
`
`-
`
`QUERY
`(TO SEARCH
`ENGINE)
`-
`COMMAND TO
`DOWNLOAD
`DOCUMENT FOR
`EACH RETRIEVED
`RECORD
`RANKED
`RETRIEVED
`DOCUMENTS
`
`'---
`
`FIG. 4
`
`Page 3 of 34
`
`

`

`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`0 ....,
`N
`~ ....
`rF.J. =(cid:173)~
`
`\0
`\0
`'"""' \0
`~~
`~
`~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`SELECTIONS
`COMMANDS &
`
`USER
`
`MOUSE
`
`KEYBOARD,
`
`DEVICE
`-INPUT
`USER
`390
`
`I
`
`COMPUTER SYSTEM
`
`(PC CLIENT)
`
`FIG. 3
`
`r
`395
`
`400
`\
`
`375
`...,
`
`PROGRAMS
`APPLICATION
`
`300/
`
`' 330
`
`3~5
`
`. \ -f PRINTER
`367
`
`I
`
`380
`)
`
`\ ; l DISPLAY
`363
`
`1/F
`OUTPUT
`
`~
`360
`
`fl
`378
`
`1/F
`COMM
`
`o/s
`
`I
`
`MEMORY
`
`Jl
`
`lf
`
`PROCESSOR
`
`3?0,
`
`3f0
`
`31 o.
`
`I
`
`370
`\
`
`1/F
`INPUT
`
`205
`
`-
`
`DEDICATED
`
`SOURCES
`
`INPUT
`
`ET)
`
`( e.g. INTRA-N
`
`ACCESS
`NETWORK
`OTHER
`AND/OR
`INTERNET
`
`Page 4 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 3 of 14
`
`5,933,822
`
`510
`~ INPUT STRING:
`
`THE OCTOPUS HAS THREE HEARTS.
`
`LOGICAL FORM
`GRAPH:
`
`LOGICAL FORM
`TRIPLES:
`
`H~E
`Dsub-OCTOPUS
`Dobj -
`HEART
`L__ Ops -
`
`_,,,---515
`
`THREE
`
`HAVE- Dsub-OCTOPUS}
`HAVE - Dobj - HEART
`HEART- Ops - THREE
`FIG. 5A
`
`525
`
`530
`"-... INPUT STRING:
`
`LOGICAL FORM
`GRAPH:
`
`THE OCTOPUS HAS THREE HEARTS AND TWO LUNGS.
`HAVE t= Dsub- OCTOPUS
`Dobj -AND
`I
`Crds - - HEART
`
`/535
`
`LOGICAL FORM
`TRIPLES:
`
`LO~s-THREE
`
`LUNG
`I
`Ops-TWO
`
`HAVE - Dsub - OCTOPUS
`HAVE - Dobj - HEART
`HAVE - Dobj -
`LUNG
`HEART- Ops -
`THREE
`LUNG - Ops -
`TWO
`FIG. 5B
`
`540
`
`Page 5 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 4 of 14
`
`5,933,822
`
`55o"'- INPUT STRING:
`
`THE OCTOPUS HAS THREE HEARTS AND
`IT CAN SWIM.
`
`LOGICAL FORM
`GRAPH:
`
`AND
`Lcrds
`
`HAVE
`t=osub- OCTOPUS
`Dobj - HEART
`
`/555
`
`LOGICAL FORM
`TRIPLES:
`
`Ops-THREE
`
`SWIM
`L_Dsub-lT
`~-Refs- OCTOPUS
`HAVE - Dsub -
`OCTOPUS
`HAVE - Do~
`-
`HEART
`HEART - Ops
`-
`THREE
`SWIM - Dsub -
`IT
`SWIM - Dsub -
`OCTOPUS
`FIG. 5C
`
`560
`
`570
`
`"'- INPUT STRING:
`
`I LIKE SHARK FIN SOUP BOWLS.
`
`LOGICAL FORM
`GRAPH:
`
`LOGICAL FORM
`TRIPLES:
`
`/
`
`575
`
`LIKE
`t= □sub - I
`Dobj - BOWL
`'-----Mods-SHARK
`~FIN
`L_SOUP
`
`I
`BOWL
`SHARK
`FIN
`SOUP
`SHARK
`SHARK
`FIN
`
`LIKE
`- Dsub -
`LIKE
`- Dobj -
`BOWL - Mods -
`BOWL - Mods -
`BOWL - Mods -
`FIN
`- Mods -
`SOUP - Mods -
`SOUP - Mods -
`FIG. 5D
`
`580
`
`Page 6 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 5 of 14
`
`5,933,822
`
`~ ENTER
`
`OBTAIN FULL TEXT
`QUERY FROM
`USER
`
`RETRIEVAL
`PROCESS
`600
`
`._ 605
`
`'- 607
`
`225
`jS-EARCH _L_ -
`QUERY
`ENGINE
`'"- -1- - -t- ► RETRIEVE
`615
`SET OF
`I
`DOCUMENT
`RECORDS
`IN
`I
`ACCESS AND DOWNLOAD L L
`630
`RESPONSE TO
`1-c _\_ _ 7 _ _ QUERY
`- -
`- -
`- -
`RETRIEVED DOCUMENT
`RECORDS
`
`- -7
`,
`I
`
`625
`
`1
`
`I
`
`I
`
`I
`,
`~
`
`-
`
`-
`
`1---- 643
`
`SEND FULL-TEXT
`610--- QUERY TO SEARCH
`ENGINE
`
`RECEIVE SET OF
`DOCUMENT RECORDS
`___ FROM SEARCH ENGINE;
`635
`
`EACH CORRESPONDING
`DOCUMENT
`
`I
`
`1
`
`I
`
`•
`
`, 645
`
`INVOKE NLP
`ROUTINE 700
`TO YIELD LOGICAL
`FORM TRIPLES
`FOR FULL - TEXT
`QUERY AND STORE
`IN LOCAL DATABASE
`
`640
`,-
`CONVERT EACH DOCUMENT TO TEXT;
`PERFORM SENTENCE BREAKING
`ON EACH DOCUMENT; AND
`INVOKE NLP ROUTINE 700
`FOR EACH DOCUMENT TO
`YIELD LOGICAL FORM TRIPLES
`THEREFOR AND STORE IN
`DATASET
`
`I
`
`L---650
`
`COMPARE LOGICAL
`FORM TRIPLES FOR QUERY
`AGAINST LOGICAL
`FORM TRIPLES FOR
`EACH DOCUMENT IN SET
`TO LOCATE MATCHING
`LOGICAL FORM TRIPLES
`
`DISCARD ALL DOCUMENTS
`WHICH DO NOT CONTAIN
`ANY MATCHING LOGICAL
`FORM TRIPLES
`
`_________ J --------
`
`FIG. 6A
`
`Page 7 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 6 of 14
`
`5,933,822
`
`'
`FOR EACH REMAINING
`DOCUMENT, ASSIGN WEIGHT
`TO EACH MATCHING LOGICAL
`FORM TRIPLE THEREIN; AND
`SUM WEIGHTS ACROSS EACH
`DOCUMENT, AS APPROPRIATE,
`TO OBTAIN WEIGHT FOR
`THAT DOCUMENT
`
`-- 6 60
`
`RANK DOCUMENTS IN TERMS
`OF DESCENDING ORDER OF
`THEIR WEIGHTS
`
`c- 665
`
`DISPLAY RETRIEVED DOCUMENTS
`IN RANKED ORDER
`
`~ EXIT
`FIG. 68
`
`FIG.
`6A
`
`FIG.
`68
`FIG. 6
`
`Page 8 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 7 of 14
`
`5,933,822
`
`NLP ROUTINE
`700
`
`ENTER
`
`PROCESS INPUT TEXT TO YIELD LOGICAL
`FORM GRAPH
`
`i.-710
`
`EXTRACT LOGICAL FORM TRIPLES FROM GRAPH
`
`OUTPUT EACH LOGICAL FORM TRIPLE
`AS A DISTINCT FORMATTED STRING
`
`-730
`
`I,
`
`FOR INPUT TEXT, STORE EACH LOGICAL
`FORM TRIPLE THEREFOR ALONG WITH ITS
`ASSOCIATED DOCUMENT AND SENTENCE
`PROVENANCE (i.e. SOURCE LOCATION)
`IN A DATASET
`
`'
`
`EXIT
`
`FIG. 7
`
`Page 9 of 34
`
`

`

`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`0 ....,
`00
`~ ....
`'JJ. =(cid:173)~
`
`\0
`\0
`'"""' \0
`~~
`~
`~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`>~,,,..--DOCUMEN~'4,.845
`
`QUERY
`
`~-------LOGICAL FORM TRIPLES -----------------...
`
`DOCUMENT 3: ARTICLE ABOUT DEER
`DOCUMENT 2: ARTICLE ABOUT OCTOPI
`OCUMENT 1: RECIPE CONTAINING ARTICHOKE HEARTS AND OCTOPUS
`
`iNLP
`
`!D
`
`DOCUMENTS
`RETRIEVED
`STATISTICALLY
`
`820
`
`QUERY: HOW MANY HEARTS DOES AN OCTOPUS HAVE?
`
`810~
`
`✓ HAVE-Dobj-HEART
`HAVE-Dsub-DEER
`DOCUMENT .f, ....
`~-------------------------------~ 865
`
`FIG. 88
`
`TOTAL DOCUMENT
`
`WEIGHT=100
`
`~ /
`
`RANKING
`
`WEIGHT=100+ 75= 175
`
`TOTAL DOCUMENT
`
`ORDER OF DOCUMENT
`
`DOCUMENT 3
`DOCUMENT 2;
`PRESENTATION:
`
`t
`WEIGHT
`TOTAL
`BY
`
`WITH QUERY
`
`1 MATCH
`
`TRIPLES
`
`860
`)
`
`WITH QUERY
`2 MATCHES
`
`TRIPLES
`
`850
`)
`
`HAVE-Dsub-OCTOPUS
`✓HAVE-Dobj-HEART
`✓HAVE-Dsub-OCTOPUS
`DOCUMENT 2""~55
`
`DOCUMENT 1
`
`DISCARD
`
`TRIPLES
`
`WITH QUERY
`
`)I NO MATCH
`
`840
`
`HEART-Nadj-ARTICHOKE
`
`FRY-Dobj-OCTOPUS
`
`TABLE
`
`TRIPLE WEIGHTING
`
`LOGICAL FORM
`
`MATCHING
`
`800
`/
`
`FIG. 8A
`
`10
`10
`75
`100
`
`WEIGHT
`
`830
`
`Nodj
`Ops
`Dsub
`Dobj
`
`MATCHING TRIPLE
`
`RELATIONSHIP
`
`TYPE FOR
`
`HEART-Ops-MANY
`HAVE-Dobj-HEART
`HAVE-Dsub-OCTOPUS
`
`Page 10 of 34
`
`

`

`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`0 ...,
`~ ....
`'JJ. =(cid:173)~
`
`\0
`
`\0
`\0
`'"""'
`\0
`~~
`~
`~
`
`~ = .....
`~ .....
`~
`•
`r:JJ.
`d •
`
`PROCESSING
`NATURAL LANGUAGE
`STATISTICAL RETRIEVAL &
`DOCUMENT INDEXING,
`
`STATISTICAL RETRIEVAL
`DOCUMENT INDEXING &
`
`(SERVER}
`COMPUTER
`REMOTE
`\
`930
`
`(SE~~ER)
`COMPUTER
`REMOTE
`
`930
`
`\
`925
`
`DOCUMENTS
`
`{(
`
`DOCUMENTS
`RETRIEVED
`
`---STATISTICALLY 925
`
`QUERY ( ----
`FULL -TEXT (LITERAL)
`
`1........---._RANKED RETRIEVED
`QUERY 1 i----
`FULL-TEXT (LITERAL)
`
`I
`
`PC
`
`CLIENT
`J
`920
`
`PC
`
`CLIENT
`J
`920
`
`(e.g. PC)
`COMPUTER
`
`LOCAL
`
`J
`910
`
`PROCESSING
`& NATURAL LANGUAGE
`QUERY GENERATION
`
`PROCESSING
`& NATURAL LANGUAGE
`STATISTICAL RETRIEVAL,
`DOCUMENT INDEXING,
`
`FIG. 98
`
`FIG. 9A
`
`FIG. 9D
`
`QUERY GENERATION
`
`FIG. 9C
`
`960 PROCES~SING
`1
`I
`I
`I
`I
`I
`r----96-0~~--~--oocuMENTI
`
`I
`
`9602
`
`LANGUAGE
`NATURAL
`
`SERVER n
`
`:
`
`•
`
`SERVER 2 I RETRIEVAL
`STATISTICAL
`
`9601
`
`1
`
`J SERVER 1 I INDEXIN'-G'
`
`PROCESSOR .,_ •
`
`END
`FRONT(cid:173)
`940
`
`i COMPUTER
`1 REMOTE
`I
`
`----------
`R)
`
`n
`
`930 -
`
`Page 11 of 34
`
`

`

`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`0 ....,
`'"""' 0
`~ ....
`'JJ. =(cid:173)~
`
`\0
`\0
`'"""'
`\0
`
`t ~~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`I
`I
`I
`
`I
`
`I
`
`::.
`
`52
`
`1040
`
`\
`
`DUPLICATION
`
`SYSTEM
`
`MEDIA
`
`-1043
`
`I
`
`10052
`
`DUPLICATION
`
`I
`
`FIG. 10A
`
`INDEXING
`DOCUMENT
`
`I FIG. 10
`
`FIG. 108
`
`FIG. 10A
`
`-
`
`TRIPLES
`ASSOCIATED
`
`&
`
`~
`
`I 1035
`
`\
`
`DOCUMENTS~
`
`SET
`--------
`::-DATA
`-✓1 0 3 0
`
`I PROGRAM)
`1 USER INSTALLATION
`& RETRIEVAL PROCESS,
`( e.g. SEARCH ENGINE
`APPLICATION FILES
`
`1 DOCUMENT RETRIEVAL
`
`1
`
`DATASET)
`
`(MASTE: 7
`
`1015
`
`\
`
`COMPUTER
`
`ENGINE
`DOCUMENT INDEXING
`
`PROCESS 1100
`
`GENERATION
`
`TRIPLE
`
`MEMORY
`
`I
`
`10051
`
`INDEXED
`TO BE
`DOCUMENTS
`
`Page 12 of 34
`
`

`

`N
`N
`00
`~ ....
`~
`....
`Ul
`
`'"""'
`,i;;..
`'"""' 0 ....,
`'"""'
`~ ....
`'JJ. =(cid:173)~
`
`\0
`\0
`'"""' \0
`~
`~
`~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`-
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`1
`
`DOCUMENT RETRIEVAL APPLICATION
`,, 1085
`1080
`r 1075
`
`I
`
`,,1060
`
`CD-ROMj
`
`(DISPLAYED)
`DOCUMENTS
`RANKED
`
`1070
`
`COMPUTER SYSTEM (PC)
`
`1050 3
`
`t
`
`USERj
`
`FIG. 108
`
`QUERY
`USER_
`
`MEDIA DUPLICATION
`
`10501
`
`\
`
`L _______________ _J
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I I
`I
`I
`I
`I
`1 APPLICATION PROGRAMS
`~-----------~---.
`MEMORY
`
`DOCUMENT SET
`.._ RETRIEVED
`
`1200
`PROCESS
`RETRIEVAL
`
`---
`
`'
`
`1190
`ENGINE
`SEARCH ~
`
`TRIPLES
`
`I '
`SET D
`
`-
`
`I
`
`I
`
`I
`
`I
`
`I
`
`DOCUMENTS
`
`1065
`
`DATA-
`~ \
`
`(
`
`1062
`
`& Lr10502
`
`[
`
`TRIPLES
`Gu] ASSOC.
`g
`DOCS
`
`INDEXED
`
`1-,~5
`
`.n.
`1050
`
`-
`
`I
`
`FILES
`APPL'N
`RETRIEVAL
`DOCUMENT
`
`• • •
`(e.g. CD-ROMS)
`REPLICAS 1050
`MEDIA
`
`I 1047 r
`
`I
`
`Page 13 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 12 of 14
`
`5,933,822
`
`FIG. 11
`TRIPLE
`GENERATION
`PROCESS
`1100
`
`ENTER
`(DOCUMENT TO BE INDEXED)
`CONVERT DOCUMENT TO TEXT;
`PERFORM SENTENCE BREAKING
`ON DOCUMENT; AND
`INVOKE NLP ROUTINE 1300
`TO YIELD LOGICAL FORM
`TRIPLES ASSOCIATED WITH
`THAT DOCUMENT AND TO
`STORE TRIPLES IN
`DATASET 1030
`
`1110
`
`EXIT
`
`FIG. 13A
`NLP ROUTINE 1300
`ENTER
`1310
`
`FIG. 138
`NLP ROUTINE 1350
`ENTER
`1360
`
`PROCESS INPUT TEXT, e.g. QUERY,
`TO YIELD LOGICAL FORM GRAPH
`
`PROCESS INPUT TEXT, e.g. QUERY,
`TO YIELD LOGICAL FORM GRAPH
`
`1320
`
`1370
`
`EXTRACT LOGICAL FORM
`TRIPLES FROM GRAPH
`
`EXTRACT LOGICAL FORM
`TRIPLES FROM GRAPH
`
`1330
`
`1380
`
`OUTPUT EACH LOGICAL FORM TRIPLE
`AS A DISTINCT FORMATTED STRING
`
`OUTPUT EACH LOGICAL FORM TRIPLE
`AS A DISTINCT FORMATTED STRING
`
`1340
`FOR CURRENT INPUT TEXT, STORE
`EACH LOGICAL FORM TRIPLE
`THEREFOR
`IN DATASET 1030
`
`1390
`
`FOR QUERY, STORE EACH LOGICAL
`FORM TRIPLE THEREFOR IN
`MEMORY 1075
`
`EXIT
`
`EXIT
`
`Page 14 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 13 of 14
`
`5,933,822
`
`~ENTER
`
`-
`
`OBTAIN FULL TEXT
`QUERY FROM
`USER
`
`'-1205
`.__
`
`FIG. 12A
`
`RETRIEVAL
`PROCESS
`1200
`
`'---1207
`1210
`
`.j,
`
`\
`
`1090
`I-SEARCH E~GINE ---7
`I . - - - - - - - -~
`~~~~y~►
`IN RESPONSE
`1215 I
`TO QUERY;
`RETRIEVE SET
`l
`OF DOCUMENT
`RECEIVE SET OF
`1230 1
`RECORDS AND
`DOCUMENT RECORDS ~_\_ 7-
`ASSOCIATED TRIPLES
`AND ASSOCIATED
`l
`L___ 1220 ___ _J
`TRIPLES FROM
`SEARCH ENGINE
`1090 AND
`STORE IN
`MEMORY 1075
`
`I
`
`I
`
`RETRIEVED DOCUMENT
`RECORDS AND ASSOCIATED
`DOCUMENT TRIPLES
`
`SEND FULL -TEXT
`QUERY TO SEARCH
`ENGINE
`1.-----,----...11
`
`~1243
`
`1245
`
`\
`
`j,
`
`INVOKE NLP
`ROUTINE 1350
`TO YIELD LOGICAL
`FORM TRIPLES
`FOR FULL-TEXT
`QUERY AND STORE
`IN MEMORY 1075
`
`COMPARE LOGICAL
`FORM TRIPLES FOR QUERY
`AGAINST LOGICAL
`FORM TRIPLES FOR
`EACH DOCUMENT IN SET
`TO LOCATE MATCHING
`LOGICAL FORM TRIPLE
`
`<----1250
`
`DISCARD ALL DOCUMENTS
`WHICH DO NOT CONTAIN
`ANY MATCHING LOGICAL
`FORM TRIPLES
`
`1.---1255
`
`••
`---------------
`
`Page 15 of 34
`
`

`

`U.S. Patent
`
`Aug. 3, 1999
`
`Sheet 14 of 14
`
`5,933,822
`
`'
`FOR EACH REMAINING
`DOCUMENT, ASSIGN WEIGHT
`TO EACH MATCHING LOGICAL
`FORM TRIPLE THEREIN; AND
`SUM WEIGHTS ACROSS EACH
`DOCUMENT, AS APPROPRIATE,
`TO OBTAIN WEIGHT FOR
`THAT DOCUMENT
`
`-1260
`
`'
`RANK DOCUMENTS IN TERMS
`OF DESCENDING ORDER OF
`THEIR WEIGHTS
`
`c---1265
`
`DISPLAY RETRIEVED DOCUMENTS
`IN RANKED ORDER
`
`-1270
`
`~ EXIT
`
`FIG. 128
`
`FIG.
`12A
`
`FIG.
`128
`FIG. 12
`
`Page 16 of 34
`
`

`

`5,933,822
`
`1
`APPARATUS AND METHODS FOR AN
`INFORMATION RETRIEVAL SYSTEM THAT
`EMPLOYS NATURAL LANGUAGE
`PROCESSING OF SEARCH RESULTS TO
`IMPROVE OVERALL PRECISION
`
`BACKGROUND OF THE DISCLOSURE
`
`1. Field of the Invention
`The invention relates to apparatus and accompanying
`methods for an information retrieval system that utilizes
`natural language processing to process results retrieved by,
`for example, an information retrieval engine such as a
`conventional statistical-based search engine, in order to
`improve overall precision.
`2. Description of the Prior Art
`Starting several decades ago and continuing to the
`present, automated information retrieval techniques have
`increasingly been used to retrieve stored information from a
`mass data store, such as a conventional database containing
`published materials and/or bibliographic information there(cid:173)
`for. Such a conventional database tends to be specialized in
`that it generally contains information directed to a particular,
`though broad-based topic, such as electrical engineering and
`computer related technology, as, e.g., in an INSPEC data(cid:173)
`base maintained by the Institute of Electrical and Electronic
`engineers (IEEE) and currently accessible through, e.g.,
`Dialog Information Services of Knight-Ridder Information.
`Inc. (DIALOG is a registered servicemark of Knight-Ridder
`Information, Inc.). While databases of this type certainly
`exhibit continuing growth as an increasing number of per(cid:173)
`tinent articles and other materials are published, the growth
`tends to be relatively moderate and reasonably well(cid:173)
`controlled. Also, such specialized databases tend to be rather
`well organized.
`However, with the advent and proliferation of the
`so-called "world-wide web" (hereinafter simply referred to
`as the "web") accessible through the Internet and the relative
`ease and low-cost associated with posting information to the
`web and accessing information therefrom as contrasted with
`traditional publishing, the amount of information available
`on the web manifests highly exponential, if not explosive,
`growth, with apparently no realistic limit in sight. While the
`web offers an increasingly rich array of information across
`all disciplines of human endeavor, information content on 45
`the web is highly chaotic and extremely disorganized, which
`severely complicates and often frustrates information access
`and retrieval therefrom.
`In an attempt to significantly ease the task of retrieving
`information from the web, a number of computerized search
`engines have been developed over the past few years for
`widespread public use. Generally speaking, these conven(cid:173)
`tional engines, through software-implemented "web
`crawlers", automatically visit web sites, and trace hypertext
`links therein, in seriatim and extract, abstract and index each
`document encountered therein, through so-called "key
`words", into a large database for subsequent access.
`Specifically, through such abstraction, each such document
`encountered by the crawler is reduced to what is commonly
`called a "bag of words" which contains content-bearing
`words that exist in the document, though stripped of all
`semantic and syntactic information. The content words may
`occur in the document itself and/or in just a description field
`of a hypertext-markup language (HTML) version of that
`document. In any event, the engine establishes an entry, i.e.,
`a document record, for each such document. For each
`document, each of its content words is indexed into a
`
`2
`searchable data structure with a link back to the document
`record. The document record typically contains: (a) a web
`address, i.e., a URL-uniform resource locator, through
`which the corresponding document can be accessed by a
`5 web browser; (b) various content words in that document,
`along with, in certain engines, a relative address of each such
`content word relative to other content words in that docu(cid:173)
`ment; (c) a short summary, often just a few lines, of the
`document or a first few lines of that document; and possibly
`10 ( d) the description of the document as provided in its HTML
`description field. To search the database, a user supplies the
`engine with a keyword based query. The query typically
`contains one or more user-supplied keywords, often just a
`small number, with, depending on the capabilities of the
`15 engine, possibly a Boolean (such as "AND" or "OR") or
`similar (such as a numeric proximity) operator situated
`between successive key words. In response to the query, the
`engine attempts to locate documents that contain as many of
`the keywords as possible, and, if a logical or proximity
`20 operator was provided, those key words in the specific
`combination requested or within a certain "range" (specified
`number of content words) of each other. In doing so, the
`engine searches through its database to locate documents
`that contain at least one word that matches one of the key
`25 words in the query and, where requested, according to the
`operator and/or range specified therewith. For each such
`document it finds, the engine retrieves the document record
`therefor and presents that record to the user ranked accord(cid:173)
`ing to a number of keyword matches in that document
`30 relative to those for the other such documents.
`Often, a great majority of documents retrieved solely in
`response to a user-supplied keyword query would be simply
`irrelevant to the query, thus frustrating the user.
`Consequently, to reduce the number of irrelevant docu-
`35 ments that are retrieved, conventional keyword based search
`engines (hereinafter referred to as simply "statistical search
`engines") incorporate statistical processing into their search
`methodologies. For example, based on a total number of
`matching key words between those in the query and the
`40 content words in each retrieved document record and how
`well these words match, i.e., in the combination and/or
`within a proximity range requested, a statistical search
`engine calculates numeric measures, collectively frequently
`referred to as "statistics", for each such document record
`retrieved. These statistics may include an inverse document
`frequency for each matching word. The engine then ranks
`the document records in terms of their statistics and returns
`to the user the document records for a small predefined
`number of retrieved records, typically 5-20 or less, that have
`50 the highest rankings. Once the user has reviewed a first
`group of document records ( or, for some engines, the
`documents themselves if they are returned by the engine) for
`a first group of retrieved documents, the user can then
`request a next group of document records having the next
`55 highest rankings, and so forth until all the retrieved docu(cid:173)
`ment records have been so reviewed.
`Traditionally, the performance of search engines has been
`assessed in terms of recall and precision. Recall measures, as
`a percentage of all relevant documents in a dataset, the
`60 number of such documents actually retrieved in response to
`a given query. Precision, on the other hand, measures, as a
`percentage of all documents retrieved, the number of those
`documents that are actually relevant to the query. We believe
`that in the context of a web search engine, recall is not an
`65 important metric of performance, inasmuch as the sheer
`number of documents ultimately retrieved is unimportant. In
`fact, for some queries, this number could be inordinately
`
`Page 17 of 34
`
`

`

`5,933,822
`
`10
`
`3
`large. Hence, we believe that not all relevant documents
`indexed by the engine need to be retrieved in order to
`produce a useful result; however, we believe that precision
`is extremely important, i.e., the documents that have the
`highest ranking and are presented first to a user should be
`those that are the most relevant to the query.
`The rather poor precision of conventional statistical
`search engines stems from their assumption that words are
`independent variables, i.e., words in any textual passage
`occur independently of each other. Independence in this
`context means that a conditional probability of any one word
`appearing in a document given the presence of another word
`therein is always zero, i.e., a document simply contains an
`unstructured collection of words or simply put a "bag of
`words". As one can readily appreciate, this assumption, with
`respect to any language, is grossly erroneous. English, like 15
`other languages, has a rich and complex syntactic and
`lexico-semantic structure with words whose meanings vary,
`often widely, based on the specific linguistic context in
`which they are used, with the context determining in any one
`instance a given meaning of a word and what word(s) can 20
`subsequently appear. Hence, words that appear in a textual
`passage are simply not independent of each other, rather they
`are highly inter-dependent. Keyword based search engines
`totally ignore this fine-grained linguistic structure. For
`example, consider an illustrative query expressed in natural 25
`language: "How many hearts does an octopus have?" A
`statistical search engine, operating on content words
`"hearts" and "octopus", or morphological stems thereof,
`might likely return or direct a user to a stored document that
`contains a recipe that has at its ingredients and hence its
`content words: "artichoke hearts, squid, onions and octo(cid:173)
`pus". This engine, given matches in the two content words
`"octopus" and "hearts", may determine, based on statistical
`measures, e.g. including proximity and logical operators,
`that this document is an excellent match, when, in reality, the 35
`document is quite irrelevant to the query.
`The art teaches various approaches for extracting ele(cid:173)
`ments of syntactic phrases as head-modifier pairs in unla(cid:173)
`beled relations. These elements are then indexed as terms
`(typically without internal structure) in a conventional sta(cid:173)
`tistical vector-space model.
`One example of such an approach is taught in J. L. Fagan,
`"Experiments in Automatic Phrase Indexing for Document
`Retrieval: A Comparison of Syntactic and Non-Syntactic
`Methods", Ph.D. Thesis, Cornell University, 1988, pages 45
`i-261. Specifically, this approach uses natural language
`processing to analyze English sentences and extract syntac-
`tic phrasal constituents elements wherein these phrasal con(cid:173)
`stituents are then treated as terms and indexed in an index
`using a statistical vector-space model. During retrieval, the
`user enters a query in natural language which, under this
`approach, is subjected to natural language processing for
`analysis and to extract elements of syntactic phrasal con(cid:173)
`stituents analogous to the elements stored in the index.
`Thereafter, attempts are made to match the elements of the
`syntactic phrasal constituents from the query to those stored
`in the index. The author contrasts this purely syntactic
`approach to a statistical approach, in which a stochastic
`method is used to identify elements within syntactic phrases.
`The author concludes that natural language processing does
`not yield substantial improvements over stochastic
`approaches, and that the small improvements in precision
`that natural language processing does sometimes produce do
`not justify the substantial processing cost associated with
`natural language processing.
`Another such syntactic based-approach is described, in
`the context of using natural language processing for select-
`
`4
`ing appropriate terms for inclusion within search queries, in
`T. Strzalkowski, "Natural Language Information Retrieval:
`TIPSTER-2 Final Report", Proceedings of Advances in Text
`Processing: Tipster Program Phase 2, DARPA, May 6-8,
`5 1996, Tysons Corner, Va., pages 143-148 (hereinafter the
`"DARPA paper"); and T. Strzalkowski, "Natural Language
`Information Retrieval", Information Processing and
`Management, Vol. 31, No. 3, 1995, pages 397-417. While
`this approach offers theoretical promise, the author on pages
`147-8 of the DARPA paper, concludes that, owing to the
`sophisticated processing required to implement the under(cid:173)
`lying natural language techniques, this approach is currently
`impractical:
`" ... [I]t is important to keep in mind that NLP [ natural
`language processing] techniques that meet our perfor(cid:173)
`mance requirements ( or at least are believed to be
`approaching these requirements) are still fairly unso-
`phisticated in their ability to handle natural language
`text. In particular, advanced processing involving con(cid:173)
`ceptual structuring, logical forms, etc. is still beyond
`reach, computationally. It may be assumed that these
`advanced techniques will prove even more effective,
`since they address the problem of representation-level
`limits; however, the experimental evidence is sparse
`and necessarily limited to rather small scale tests".
`A further syntactic-based approach of this sort is
`described in B. Katz, "Annotating the World Wide Web
`using Natural Language", Conference Proceedings of RIAO
`97, Computer-Assisted Information Searching in Internet,
`McGill University, Quebec, Canada, Jun. 25-27, 1997, Vol.
`30 1, pages 136-155 [hereinafter the "Katz publication"]. As
`described in the Katz publication, subject-verb-object
`expressions are created while preserving the internal struc(cid:173)
`ture so that during retrieval minor syntactic alternations can
`be accommodated.
`Because these syntactic approaches have yielded lacklus-
`ter improvements or have not been feasible to implement in
`natural language processing systems available at the time,
`the field has moved away from attempting to directly
`improve the precision and recall of the initial results of query
`40 to improvements in the user interface, i.e. specifically
`through methods for refining the query based on interaction
`with the user, such as through "find-similar" user responses
`to a retrieved result, and methods for visualizing the results
`of a query including displaying results in appropriate clus(cid:173)
`ters.
`While these improvements are useful in their own right,
`the added precision attainable through these improvements
`is still disappointingly low, and certainly insufficient to
`drastically reduce user frust

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket