`Messerly et al.
`
`I 1111111111111111 11111 1111111111 111111111111111 1111111111111111 Ill lllll llll
`US006076051A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,076,051
`Jun.13,2000
`
`[54]
`
`[75]
`
`INFORMATION RETRIEVAL UTILIZING
`SEMANTIC REPRESENTATION OF TEXT
`
`Inventors: John J. Messerly, Bainbridge Island;
`George E. Heidorn, Bellevue; Stephen
`D. Richardson; William B. Dolan,
`both of Redmond; Karen Jensen,
`Bellevue, all of Wash.
`
`[73] Assignee: Microsoft Corporation, Redmond,
`Wash.
`
`[21] Appl. No.: 08/886,814
`
`[22] Filed:
`
`Mar. 7, 1997
`
`Int. Cl.7 ............................. G06F 17/27; G06F 17/30
`[51]
`[52] U.S. Cl. ..................................... 704/9; 704/10; 707/2;
`707/5
`[58] Field of Search .......................... 704/1, 9, 10; 707/3,
`707/4, 5, 6, 1, 2, 104, 530, 531, 532
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,839,853
`5,278,980
`5,325,298
`5,592,661
`5,619,709
`5,675,745
`5,675,819
`5,794,050
`5,794,178
`5,799,308
`5,873,056
`5,933,833
`
`6/1989 Deerwester et al. ........................ 707/5
`1/1994 Pedersen et al.
`........................... 707/4
`6/1994 Gallant ........................................ 704/9
`1/1997 Eisenberg et al. ... ... ... ... .... ... ... 707 /104
`4/1997 Caid et al. .............................. 707/532
`10/1997 Oku et al. ................................... 705/7
`10/1997 Schuetze .. ... ... .... ... ... ... ... ... .... ... . 704/10
`8/1998 Dahlgren et al. . ... ... ... ... .... ... ... 395 /708
`8/1998 Caid et al. .. ... ... ... ... ... .... ... ... ... ... . 704/9
`8/1998 Dixon ...................................... 707/100
`2/1999 Liddy et al.
`................................ 704/9
`8/1999 Braden-Harder et al. .................. 707/5
`
`FOREIGN PATENT DOCUMENTS
`
`0 304 191 A2
`0 304 191 A3
`0 386 825 Al
`0 687 987 Al
`
`2/1989
`2/1989
`9/1990
`12/1995
`
`European Pat. Off ..
`European Pat. Off ..
`European Pat. Off ..
`European Pat. Off ..
`
`OTHER PUBLICATIONS
`
`Van Zuijlen, Job M., "Probabilistic Methods in Dependency
`Grammar Parsing,"
`International Parsing Workshop,
`10375:142-151, 1989.
`Gerard Salton, "Automatic Information Organization and
`Retrieval," McGraw Hill Book Company, pp. 168-178
`(1968).
`Fagan, Joel L., Ph.D., "Experiments in automatic phrase
`indexing for document retrieval: A comparison of syntactic
`and non-syntactic methods," Cornell University, UMI Dis(cid:173)
`sertation Information Service , pp. 1-261 (1987).
`James Allen, "Natural Language Understanding," The Ben(cid:173)
`jamin/Cummings Publishing Company, Inc., Chapter 8, pp.
`227-238, 1995.
`
`Primary Examiner-Joseph Thomas
`
`[57]
`
`ABSTRACT
`
`The present invention is directed to performing information
`retrieval utilizing semantic representation of text. In a pre(cid:173)
`ferred embodiment, a tokenizer generates from an input
`string information retrieval tokens that characterize the
`semantic relationship expressed in the input string. The
`tokenizer first creates from the input string a primary logical
`form characterizing a semantic relationship between
`selected words in the input string. The tokenizer then
`identifies hypernyms that each have an "is a" relationship
`with one of the selected words in the input string. The
`tokenizer then constructs from the primary logical form one
`or more alternative logical forms. The tokenizer constructs
`each alternative logical form by, for each of one or more of
`the selected words in the input string, replacing the selected
`word in the primary logical form with an identified hyper(cid:173)
`nym of the selected word. Finally, the tokenizer generates
`tokens representing both the primary logical form and the
`alternative logical forms. The tokenizer is preferably used to
`generate tokens for both constructing an index representing
`target documents and processing a query against that index.
`
`23 Claims, 18 Drawing Sheets
`
`touch (sense 2)
`
`kiss ( sense 1)
`
`verb
`
`~1122
`
`1121
`
`1120
`
`deep object
`pig (sense 2)
`
`1110
`
`1111
`
`deep subject
`man (sense 2)
`
`I
`I
`1
`
`person ( sense 1)
`
`1112 ~
`L-------------.J
`
`1130
`
`1131
`
`I
`:
`swine ( sense 1)
`animal ( sense 3)
`h.
`I ' - 1132
`I
`I _ _ _ _ _ _ _ _ _ _ _ _ J
`
`Page 1 of 29
`
`GOOGLE EXHIBIT 1025
`
`
`
`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`Prior Art
`
`Fig.1
`
`'"""' 00
`'"""' 0 ....,
`~ ....
`'JJ. =-~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`I
`
`locations
`
`tokens
`
`132
`
`document
`in target
`
`their locations
`
`tokens to
`
`index mapping
`
`I
`
`,,_ 131
`
`I
`
`retrieval
`index
`
`information
`
`engine
`retrieval
`
`J
`
`I
`
`I
`
`I
`
`tokens
`
`I
`
`I
`
`I
`
`I
`
`tokenizer
`
`I
`
`I
`
`I
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`I
`
`/" 140
`
`I
`
`location
`tokens,
`
`I
`
`construction
`
`index
`
`I
`
`location
`tokens,
`
`location
`
`text,
`
`/" 130
`
`r 120
`
`111
`
`text
`
`I
`
`112
`
`result
`query
`
`query
`
`I
`
`document
`
`target
`
`Page 2 of 29
`
`
`
`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`N
`~ ....
`'JJ. =(cid:173)~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 2
`
`260
`
`IR engine (
`
`I
`
`index /250
`
`I
`
`I
`
`I
`
`internet connection _/ 223
`
`222
`
`media drive (
`
`computer-readable
`
`221
`
`storage device (
`
`242
`
`linquistic knowledge base (
`
`parser (241
`
`240
`
`facility
`
`/230
`
`memory
`
`220
`
`input/output devices
`
`CPU (210
`
`200
`
`computer system
`
`Page 3 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 3 of 18
`
`6,076,051
`
`begin
`
`for each sentence in
`target documents
`
`tokenize sentence
`
`store tokens in index
`with occurrence
`locations
`
`next sentence
`
`receive query text
`
`tokenize query
`text
`
`identify matching
`tokens in index
`
`301
`
`302
`
`303
`
`304
`
`305
`
`306
`
`307
`
`308
`
`rank target
`documents in which
`matching tokens
`occur
`
`Fig. 3
`
`Page 4 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 4 of 18
`
`6,076,051
`
`tokenize
`
`construct primary
`logical form from
`input text
`
`expand primary
`logical form using
`hypemyms
`
`401
`
`402
`
`end
`
`Fig. 4
`
`Page 5 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 5 of 18
`
`6,076,051
`
`kiss ( sense 1)
`verb
`
`521
`
`520
`
`510
`
`511
`
`deep subject
`man (sense 2)
`
`deep object
`pig (sense 2)
`
`530
`
`531
`
`../ 550
`(man, kiss, pig)
`
`Fig. 5
`
`601
`
`602
`
`603
`
`604
`
`Im~ I lki~ingl r{i IP' I
`~ \_600 ~
`
`document 5,
`word 150
`
`document 5,
`word 153
`
`Fig. 6
`
`Page 6 of 29
`
`
`
`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`~ ....
`'JJ. =(cid:173)~
`
`O'I
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 7A
`
`(sense 1)
`
`child
`
`(sense 1)
`woman
`
`(sense 2)
`
`man
`
`713
`
`712
`
`711
`
`714
`
`715
`
`(sense 1)
`person
`
`(sense 3)
`animal
`
`Page 7 of 29
`
`
`
`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`-..J
`~ ....
`'JJ. =(cid:173)~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 7B
`
`.0002
`
`.0003
`
`(sense 7)
`
`lead
`
`(sense 1)
`villain
`
`(sense 1)
`
`child
`
`(sense 1)
`woman
`
`(sense 2)
`
`man
`
`723
`
`722
`
`713
`
`712
`
`711
`
`(sense 5)
`person
`
`721
`
`714
`
`715
`
`(sense 1)
`person
`
`(sense 3)
`animal
`
`Page 8 of 29
`
`
`
`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`00
`~ ....
`'JJ. =(cid:173)~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 8
`
`(sense 3)
`
`soften
`
`842
`
`834
`
`.0002
`
`(sense 1)
`
`kick
`
`(sense 1)
`
`kiss
`
`.0020
`
`(sense 1)
`massage
`
`833
`
`832
`
`(sense 1)
`
`hug
`
`(sense 6)
`
`touch
`
`841
`
`(sense 2)
`
`touch
`
`(sense 9)
`interact
`
`836
`
`Page 9 of 29
`
`
`
`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`\0
`~ ....
`'JJ. =(cid:173)~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 9
`
`.0811
`
`(sense 3)
`
`boar
`
`952
`
`(sense 2)
`
`pig
`
`951
`
`953
`
`954
`
`(sense 1)
`
`swine
`
`(sense 3)
`animal
`
`Page 10 of 29
`
`
`
`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`'"""' 0
`~ ....
`'JJ. =(cid:173)~
`
`0
`8
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 10
`
`(sense 1)
`fetishist
`
`1073
`
`1072
`
`.0003
`
`(sense 1)
`orgiastic
`
`(sense 1)
`
`swme
`
`.0012
`
`(sense 1)
`
`horse
`
`1063
`
`1062
`
`(sense 1)
`person
`
`(sense 7)
`animal
`
`1071
`
`1064
`
`1065
`
`(sense 3)
`animal
`
`(sense 1)
`organism
`
`Page 11 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 11 of 18
`
`6,076,051
`
`r
`
`touch (sense 2)
`
`~ 1122
`
`kiss (sense 1)
`verb
`
`1121
`
`1120
`
`deep object
`pig (sense 2)
`
`1130
`
`1131
`
`I
`I
`swine ( sense 1)
`1
`animal ( sense 3)
`1--....
`'-.... 1132
`I
`I
`I _ _ _ _ _ _ _ _ _ _ _ _ J
`
`1110
`
`1111
`
`deep subject
`man (sense 2)
`
`person ( sense 1)
`
`I
`I
`I
`
`1112 Ji
`
`' - - - - - - - - - - - - - - . J
`
`Fig. 11
`
`Page 12 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 12 of 18
`
`6,076,051
`
`verb
`
`deep
`object
`
`deep
`subject
`
`man
`
`person
`
`pig
`
`(man, kiss, pig)
`
`(person, kiss, pig)
`
`1231
`
`kiss
`
`swme
`
`(man, kiss, swine)
`
`(person, kiss, swine)
`
`1232
`
`1230
`
`animal
`
`(man, kiss, animal)
`
`(person, kiss, animal)
`
`pig
`
`(man, touch, pig)
`
`(person, touch, pig)
`
`1233
`
`1241
`
`touch
`
`swine
`
`(man, touch, swine)
`
`(person, touch, swine)
`
`1242
`
`1240
`
`animal
`
`( man, touch, animal)
`
`(person, touch, animal)
`
`1243
`
`1210
`
`1220
`
`/
`
`1200
`
`((man OR person), (kiss OR touch), (pig OR swine OR animal))
`
`Fig. 12
`
`Page 13 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 13 of 18
`
`6,076,051
`
`token
`
`animal#
`
`kiss"
`
`man
`
`person_
`
`pig#
`
`swine#
`
`touch"
`
`document
`number
`
`word
`number
`
`1300
`
`• • •
`
`5
`
`• • •
`
`5
`
`• • •
`
`5
`
`5
`
`5
`
`5
`
`5
`
`152
`
`151
`
`150
`
`150
`
`152
`
`152
`
`151
`
`\.
`
`)_
`
`J.
`
`~
`
`1310
`
`1320
`
`1330
`
`Fig. 13
`
`Page 14 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 14 of 18
`
`6,076,051
`
`kiss ( sense 1)
`verb
`
`1421
`
`1420
`
`1410
`
`1411
`
`deep subject
`man (sense 2)
`
`deep object
`horse ( sense 1)
`
`1430
`
`1431
`
`. / 1450
`(man, kiss, horse)
`
`Fig. 14
`
`Page 15 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 15 of 18
`
`6,076,051
`
`touch (sense 2)
`
`kiss (sense 1)
`verb
`
`~1522
`
`1521
`
`1520
`
`deep object
`horse ( sense 1)
`
`1530
`
`1531
`
`I
`I
`1
`
`animal (sense 3)
`
`i'\._ 1532
`I
`I _ _ _ _ _ _ _ _ _ _ _ _ J
`
`1510
`
`1511
`
`deep subject
`man (sense 2)
`
`person ( sense 1)
`
`I
`I
`I
`
`1512 /i
`
`' - - - - - - - - - - - - - - . J
`
`((man OR person),
`
`I _ / 1550
`.
`O
`(k.
`1ss OR touch), (horse R amma ))
`
`Fig. 15
`
`Page 16 of 29
`
`
`
`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`'"""' O'I
`~ ....
`'JJ. =(cid:173)~
`
`0
`8
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 16
`
`.0001
`
`(sense 1)
`fetishist
`
`(sense 1)
`orgiastic
`
`(sense 1)
`
`swine
`
`(sense 1)
`
`horse
`
`1673
`
`1672
`
`1661
`
`1663
`
`.0011
`
`1662
`
`(sense 1)
`person
`
`(sense 7)
`animal
`
`1671
`
`1664
`
`1665
`
`(sense 3)
`animal
`
`(sense 1)
`organism
`
`Page 17 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 17 of 18
`
`6,076,051
`
`touch (sense 2)
`
`kiss ( sense 1)
`verb
`
`~1722
`
`1721
`
`1720
`
`deep subject
`man ( sense 2)
`
`1710
`
`1711
`
`I
`I
`I
`
`person ( sense 1)
`
`1712 Ji
`L-------------~
`
`1750
`((man OR person), (kiss OR touch), _ _ ) _ /
`
`Fig. 17
`
`Page 18 of 29
`
`
`
`U.S. Patent
`
`Jun.13,2000
`
`Sheet 18 of 18
`
`6,076,051
`
`,--------------
`~1822
`touch (sense 2)
`
`kiss ( sense 1)
`verb
`
`1821
`
`1820
`
`deep object
`horse ( sense 1)
`
`animal ( sense 3)
`
`1830
`
`1831
`
`I
`I
`1
`
`~ 1832
`
`I
`I _ _ _ _ _ _ _ _ _ _ _ _ J
`
`(_, (kiss OR touch), (horse OR animal))
`
`_/ 1850
`
`Fig. 18
`
`Page 19 of 29
`
`
`
`6,076,051
`
`2
`
`1
`INFORMATION RETRIEVAL UTILIZING
`SEMANTIC REPRESENTATION OF TEXT
`
`TECHNICAL FIELD
`
`The present invention relates to the field of information
`retrieval, and, more specifically, to the field of information
`retrieval tokenization.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`10
`
`The father is holding the baby.
`into the following tokens:
`
`the
`father
`is
`hold
`the
`baby
`
`Information retrieval refers to the process of identifying
`occurrences in a target document of words in a query or
`query document. Information retrieval can be gainfully
`applied in several situations, including processing explicit
`user search queries, identifying documents relating to a 15
`particular document, judging the similarities of two
`documents, extracting the features of a document, and
`summarizing a document.
`Information retrieval typically involves a two-stage pro(cid:173)
`cess: (1) In an indexing stage, a document is initially 20
`indexed by (a) converting each word in the document into a
`series of characters intelligible to and differentiable by an
`information retrieval engine, called a "token" (known as
`"tokenizing" the document) and (b) creating an index map(cid:173)
`ping from each token to the location in the document where 25
`the token occurs. (2) In a query phase, a query ( or query
`document) is similarly tokenized and compared to the index
`to identify locations in the document at which tokens in the
`tokenized query occur.
`FIG. 1 is an overview data flow diagram depicting the
`information retrieval process. In the indexing stage, a target
`document 111 is submitted to a tokenizer 112. The target
`document is comprised of a number of strings, such as
`sentences, each occurring at a particular location in the
`target document. The strings in the target document and their 35
`word locations are passed to a tokenizer 120, which converts
`the words in each string into a series of tokens that are
`intelligible to and distinguishable by an information retrieval
`engine 130. An index construction portion 131 of the infor(cid:173)
`mation retrieval engine 130 adds the tokens and their
`locations to an index 140. The index maps each unique token
`to the locations at which it occurs in the target document.
`This process may be repeated to add a number of different
`target documents to the index, if desired. If the index 140
`thus represents the text in a number of target documents, the
`location information preferably includes an indication of, for 45
`each location, the document to which the location corre(cid:173)
`sponds.
`In the query phase, a textual query 112 is submitted to the
`tokenizer 120. The query may be a single string, or sentence,
`or may be an entire document comprised of a number of
`strings. The tokenizer 120 converts the words in the text of
`the query 112 into tokens in the same manner that it
`converted the words in the target document into tokens. The
`tokenizer 120 passes these tokens to an index retrieval
`portion 132 of the information retrieval engine 130. The 55
`index retrieval portion of the information retrieval engine
`searches the index 140 for occurrences of the tokens in the
`target document. For each of the tokens, the index retrieval
`portion of the information retrieval engine identifies the
`locations at which the token occurs in the target document. 60
`This list of locations is returned as the query result 113.
`Conventional tokenizers typically involve superficial
`transformations of the input text, such as changing each
`upper-case character to lower-case, identifying the indi(cid:173)
`vidual words in the input text, and removing suffixes from
`the words. For example, a conventional tokenizer might
`convert the input text string
`
`This approach to tokenization tends to make searches based
`on it overinclusive of occurrences in which senses of words
`are different than the intended sense in the query text. For
`example, the sample input text string uses the verb "hold" in
`the sense that means "to support or grasp." However, the
`token "hold" could match uses of the word "hold" that mean
`"the cargo area of a ship." This approach to tokenization also
`tends to be overinclusive of occurrences in which the words
`relate to each other differently than the words in the query
`text. For example, the sample input text string above, in
`which "father" is the subject of the word "held" and "baby"
`is the object, might match the sentence "The father and the
`baby held the toy," in which "baby" is a subject, not an
`object. This approach is further underinclusive of occur(cid:173)
`rences that use a different, but semantically related word in
`place of a word of the query text. For example, the input text
`string above would not match the text string "The parent is
`holding the baby." Given these disadvantages of conven-
`30 tional tokenization, a tokenizer that encodes semantic rela(cid:173)
`tionships implicit in the tokenized text would have signifi(cid:173)
`cant utility.
`
`SUMMARY OF THE INVENTION
`The invention is directed to performing information
`retrieval using an improved tokenizer that parses input text
`to identify logical forms, then expands the logical forms
`using hypernyms. The invention, when used in conjunction
`with conventional information retrieval index construction
`and querying, reduces the number of identified occurrences
`40 for which different senses were intended and in which words
`bear different relationships to each other, and increases the
`number of identified occurrences in which different but
`semantically related terms are used.
`The invention overcomes the problems associated with
`conventional tokenization by parsing both indexed and
`query text to perform lexical, syntactic, and semantic analy(cid:173)
`sis of this input text. This parsing process produces one or
`more logical forms, which identify words that perform
`primary roles in the query text and their intended senses, and
`that further identify the relationship between those words.
`The parser preferably produces logical forms that relate the
`deep subject, verb, and deep object of the input text. For
`example, for the input text "The father is holding the baby,"
`the parser might produce the following logical form:
`
`50
`
`deep subject
`
`father
`
`verb
`
`hold
`
`deep object
`
`baby
`
`The parser further ascribes to these words the particular
`senses in which they are used in the input text.
`Using a digital dictionary or thesaurus (also known as a
`65 "linguistic knowledge base") that identifies, for a particular
`sense of a word, senses of other words that are generic terms
`for the sense of the word ("hypernyms"), the invention
`
`Page 20 of 29
`
`
`
`6,076,051
`
`3
`changes the words within the logical forms produced by the
`parser to their hypernyms to create additional logical forms
`having an overall meaning that is hypernymous to the
`meaning of these original logical forms. For example, based
`on indications from the dictionary that a sense of "parent" is 5
`a hypernym of the ascribed sense of "father," a sense of
`"touch" is a hypernym of the ascribed sense of "hold," and
`a sense of "child" and sense of "person" are hypernyms of
`the ascribed sense of "baby," the invention might create
`additional logical forms as follows:
`
`4
`FIG. 14 is a logical form diagram showing the logical
`form preferably constructed by the facility for the query
`"man kissing horse."
`FIG. 15 shows the expansion of the primary logical form
`using hypernyms.
`FIG. 16 is a linguistic knowledge base diagram showing
`the selection of hypernyms of the deep object of the query
`logical form, horse (sense 1).
`FIG. 17 is a partial logical form diagram showing a partial
`10 logical form corresponding to a partial query containing
`only a deep subject and a verb.
`FIG. 18 is a partial logical form diagram showing a partial
`logical form corresponding to a partial query containing
`15 only a verb and a deep object.
`DETAILED DESCRIPTION OF THE
`INVENTION
`The present invention is directed to performing informa-
`20 tion retrieval utilizing semantic representation of text. When
`used in conjunction with conventional information retrieval
`index construction and querying, the invention reduces the
`number of identified occurrences for which different senses
`were intended and in which words bear different relation-
`25 ships to each other, and increases the number of identified
`occurrences in which different but semantically related
`terms are used.
`In a preferred embodiment, the conventional tokenizer
`shown in FIG. 1 is replaced with an improved information
`30 retrieval tokenization facility ("the facility") that parses
`input text to identify logical forms, then expands the logical
`forms using hypernyms. The invention overcomes the prob(cid:173)
`lems associated with conventional tokenization by parsing
`both indexed and query text to perform lexical, syntactic,
`35 and semantic analysis of this input text. This parsing process
`produces one or more logical forms, which identify words
`that perform primary roles in the query text and their
`intended senses, and that further identify the relationship
`between those words. The parser preferably produces logical
`40 forms that relate the deep subject, verb, and deep object of
`the input text. For example, for the input text "The father is
`holding the baby," the parser might produce logical form
`indicating the deep subject is "father," the verb is "hold,"
`and the deep object is "baby." Because transforming input
`text into a logical form distills the input text to its funda(cid:173)
`mental meaning by eliminating modifiers and ignoring dif(cid:173)
`ferences in tense and voice, transforming input text seg(cid:173)
`ments into the logical forms tends to unify the many
`different ways that may be used in a natural language to
`50 express the same idea. The parser further identifies the
`particular senses of these words in which they are used in the
`input text.
`Using a digital dictionary or thesaurus (also known as a
`"linguistic knowledge base") that identifies, for a particular
`55 sense of a word, senses of other words that are generic terms
`for the sense of the word ("hypernyms"), the invention
`changes the words within the logical forms produced by the
`parser to their hypernyms to create additional logical forms
`having an overall meaning that is hypernymous to the
`60 meaning of these original logical forms. The invention then
`transforms all of the generated logical forms into tokens
`intelligible by the information retrieval system that com(cid:173)
`pares the tokenized query to the index, and submits them to
`the information retrieval system.
`FIG. 2 is a high-level block diagram of the general(cid:173)
`purpose computer system upon which the facility preferably
`operates. The computer system 200 contains a central pro-
`
`deep subject
`
`parent
`father
`parent
`father
`parent
`father
`parent
`father
`parent
`father
`parent
`
`verb
`
`hold
`touch
`touch
`hold
`hold
`touch
`touch
`hold
`hold
`touch
`touch
`
`deep object
`
`baby
`baby
`baby
`child
`child
`child
`child
`person
`person
`person
`person
`
`The invention then transforms all of the generated logical
`forms into tokens intelligible by the information retrieval
`system that compares the tokenized query to the index, and
`submits them to the information retrieval system.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is an overview data flow diagram depicting the
`information retrieval process.
`FIG. 2 is a high-level block diagram of the general(cid:173)
`purpose computer system upon which the facility preferably
`operates.
`FIG. 3 is an overview flow diagram showing the steps
`preferably performed by the facility in order to construct and
`access an index semantically representing the target docu(cid:173)
`ments.
`FIG. 4 is a flow diagram showing the tokenize routine
`used by the facility to generate tokens for an input sentence.
`FIG. 5 is a logical form diagram showing a sample logical
`form.
`FIG. 6 is an input text diagram showing an input text 45
`fragment for which the facility would construct the logical
`form shown in FIG. 5.
`FIG. 7Ais a linguistic knowledge base diagram showing
`sample hypernym relationships identified by a linguistic
`knowledge base.
`FIG. 7B is a linguistic knowledge base diagram showing
`the selection of hypernyms of the deep subject of the
`primary logical form, man (sense 2).
`FIG. 8 is a linguistic knowledge base diagram showing
`the selection of hypernyms of the verb of the primary logical
`form, kiss (sense 1).
`FIGS. 9 and 10 are linguistic knowledge base diagrams
`showing the selection of hypernyms of the deep object of the
`primary logical form, pig (sense 2).
`FIG. 11 is a logical form diagram showing the expanded
`logical form.
`FIG. 12 is a chart diagram showing the derivative logical
`forms created by permuting the expanded primary logical
`form.
`FIG. 13 is an index diagram showing sample contents of
`the index.
`
`65
`
`Page 21 of 29
`
`
`
`6,076,051
`
`6
`semantic parsing process. For a detailed discussion of the
`construction of logical forms representing an input text
`string, refer to U.S. patent application Ser. No. 08/674,610,
`which is hereby incorporated by reference.
`The logical form used by the facility preferably isolates
`the principal verb of the sentence, the noun that is the real
`subject of the verb ("deep subject") and the noun that is the
`real object of the verb ("deep object"). FIG. 5 is a logical
`form diagram showing a sample primary logical form. The
`logical form has three elements: a deep subject element 510,
`a verb element 520, and a deep object element 530. It can be
`seen that the deep subject of the logical form is sense 2 of
`the word "man." The sense number indicates, for words
`having more than one sense, the particular sense ascribed to
`the word by the parser as defined by the linguistic knowl(cid:173)
`edge base used by the parser. For example, the word "man"
`could have a first sense meaning to supply with people and
`a second sense meaning adult male person. The verb of the
`logical form is a first sense of the word "kiss." Finally, the
`deep object is a second sense of the word "pig." An
`abbreviated version of this logical form is an ordered triple
`550 having as its first element the deep subject, as its second
`element the verb, and as its third element the deep object:
`
`(man, kiss, pig)
`
`The logical form shown in FIG. 5 characterizes a number
`of different sentences and sentence fragments. For example,
`FIG. 6 is an input text diagram showing an input text
`30 segment for which the facility would construct the logical
`form shown in FIG. 5. FIG. 6 shows the input text sentence
`fragment "man kissing a pig." It can be seen that this phrase
`occurs at word number 150 of document 5, occupying word
`positions 150, 151, 152, and 153. When the facility is
`35 tokenizing this input text fragment, it generates the logical
`form shown in FIG. 5. The facility would also generate the
`logical form shown in FIG. 5 for the following input text
`segments:
`
`5
`cessing unit (CPU) 210, input/output devices 220, and a
`computer memory (memory) 230. Among the input/output
`devices is a storage device 221, such as a hard disk drive.
`The input/output devices also include a computer-readable
`media drive 222, which can be used to install software 5
`products, including the facility which are provided on a
`computer-readable medium, such as a CD-ROM. The input/
`output devices further include an Internet connection 223
`enabling the computer system 200 to communicate with
`other computer systems via the Internet. The computer 10
`programs that preferably comprise the facility 240 reside in
`the memory 230 and execute on the CPU 210. The facility
`240 includes a rule-based parser 241 for parsing input text
`segments to be tokenized in order to produce logical forms.
`The facility 240 further includes a linguistic knowledge base 15
`242 used by the parser to ascribe sense numbers to words in
`the logical form. The facility further uses the linguistic
`knowledge base to identify hypernyms of the words in the
`generated logical forms. The memory 230 preferably also
`contains an index 250 for mapping from tokens generated 20
`from the target documents to locations in the target docu(cid:173)
`ments. The memory 230 also contains an information
`retrieval engine ("IR engine") 260 for storing tokens gen(cid:173)
`erated from the target documents in the index 250, and for
`identifying in the index tokens that match tokens generated 25
`from queries. While the facility is preferably implemented
`on a computer system configured as described above, those
`skilled in the art will recognize that it may also be imple(cid:173)
`mented on computer systems having different configura(cid:173)
`tions.
`FIG. 3 is an overview flow diagram showing the steps
`preferably performed by the facility in order to construct and
`access an index semantically representing the target docu(cid:173)
`ments. Briefly, the facility first semantically indexes the
`target documents by converting each sentence or sentence
`fragment of the target document into a number of tokens
`representing an expanded logical form portraying the rela(cid:173)
`tionship between the important words in the sentence,
`including hypernyms having similar meanings. The facility
`stores these "semantic tokens" in the index, along with the 40
`location in the target documents where the sentence occurs.
`After all of the target documents have been indexed, the
`facility is able to process information retrieval queries
`against the index. For each such query received, the facility
`tokenizes the text of the query in the same way it tokenized 45
`sentences from the target documents-by converting the
`sentence into semantic tokens together representing an
`expanded logical form for the query text. The facility then
`compares these semantic tokens to the semantic tokens
`stored in the index to identify locations in the target docu- 50
`ments for which these semantic tokens have been stored, and
`ranks the target documents containing these semantic tokens
`in the order of their relevance to the query. The facility may
`preferably update the index to include semantic tokens for
`new target documents at any time.
`Referring to FIG. 3, in steps 301-304, the facility loops
`through each sentence in the target documents. In step 302,
`the facility invokes a routine to tokenize the sentence as
`shown in FIG. 4.
`FIG. 4 is a flow diagram showing the tokenize routine
`used by the facility to generate tokens for an input sentence
`or other input text segment. In step 401, the facility con(cid:173)
`structs a primary logical form from the input text segment.
`As discussed above, a logical form represents the funda(cid:173)
`mental meaning of a sentence or sentence fragment. The
`logical forms are produced by applying the parser 241 (FIG.
`2) to subject the input text segment to a syntactic and
`
`The pig was kissed by an unusual man.
`
`The man will kiss the largest pig.
`
`Many pigs have been kissed by that man.
`
`55
`
`As discussed above, because transforming input text into a
`logical form distills the input text to its fundamental mean(cid:173)
`ing by eliminating modifiers and ignoring differences in
`tense and voice, transforming input text segments into the
`logical fonns tends to unify the many different ways that
`may be used in a natural language to express the same idea.
`Returning to FIG. 4, after the facility has constructed the
`primary logical form from the input text, such as the logical
`form shown in FIG. 5, the facility continues in step 402 to
`expand this primary logical form using hypernyms. After
`step 402, the tokenized routine returns.
`As mentioned above, a hypernym is a genus term that has
`an "is a" relationship with a particular word. For instance,
`the word "vehicle" is a hypernym of the word "automobile."
`The facility preferably uses a linguistic knowledge base to
`identify hypernyms of the words in the primary logical form.
`60 Such a linguistic knowledge base typically contains seman(cid:173)
`tic links identifying hypernyms of a word.
`FIG. 7Ais a linguistic knowledge base diagram showing
`sample hypernym relationships identified by a linguistic
`knowledge base. It should be noted that FIG. 7A, like the
`65 linguistic knowledge base diagrams that follow, has been
`simplified to facilitate this discussion, and omits information
`commonly found in linguistic knowledge bases that is not
`
`Page 22 of 29
`
`
`
`6,076,051
`
`7
`directl~ relevant to the present discussion. Each ascending
`arrow m FIG. 7A connects a word to its hypernym. For
`example, there is an arrow connecting the word man (sense
`2) 711 to the word person (sense 1) 714, indicating that
`person (sense 1) is a hypernym of man (sense 2). 5
`Conversely, man (sense 2) is said to be a "hyponym" of
`person (sense 1).
`In identifying hypernyms with which to expand the pri(cid:173)
`mary logical form, the facility selects one or more hyper(cid:173)
`nyms for each word of the primary logical form based upon 10
`the "coherency" of the hypernyms' hyponyms. By selecting
`hypernyms in this manner, the facility generalizes