throbber
United States Patent [19J
`Messerly et al.
`
`I 1111111111111111 11111 1111111111 111111111111111 1111111111111111 Ill lllll llll
`US006076051A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,076,051
`Jun.13,2000
`
`[54]
`
`[75]
`
`INFORMATION RETRIEVAL UTILIZING
`SEMANTIC REPRESENTATION OF TEXT
`
`Inventors: John J. Messerly, Bainbridge Island;
`George E. Heidorn, Bellevue; Stephen
`D. Richardson; William B. Dolan,
`both of Redmond; Karen Jensen,
`Bellevue, all of Wash.
`
`[73] Assignee: Microsoft Corporation, Redmond,
`Wash.
`
`[21] Appl. No.: 08/886,814
`
`[22] Filed:
`
`Mar. 7, 1997
`
`Int. Cl.7 ............................. G06F 17/27; G06F 17/30
`[51]
`[52] U.S. Cl. ..................................... 704/9; 704/10; 707/2;
`707/5
`[58] Field of Search .......................... 704/1, 9, 10; 707/3,
`707/4, 5, 6, 1, 2, 104, 530, 531, 532
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,839,853
`5,278,980
`5,325,298
`5,592,661
`5,619,709
`5,675,745
`5,675,819
`5,794,050
`5,794,178
`5,799,308
`5,873,056
`5,933,833
`
`6/1989 Deerwester et al. ........................ 707/5
`1/1994 Pedersen et al.
`........................... 707/4
`6/1994 Gallant ........................................ 704/9
`1/1997 Eisenberg et al. ... ... ... ... .... ... ... 707 /104
`4/1997 Caid et al. .............................. 707/532
`10/1997 Oku et al. ................................... 705/7
`10/1997 Schuetze .. ... ... .... ... ... ... ... ... .... ... . 704/10
`8/1998 Dahlgren et al. . ... ... ... ... .... ... ... 395 /708
`8/1998 Caid et al. .. ... ... ... ... ... .... ... ... ... ... . 704/9
`8/1998 Dixon ...................................... 707/100
`2/1999 Liddy et al.
`................................ 704/9
`8/1999 Braden-Harder et al. .................. 707/5
`
`FOREIGN PATENT DOCUMENTS
`
`0 304 191 A2
`0 304 191 A3
`0 386 825 Al
`0 687 987 Al
`
`2/1989
`2/1989
`9/1990
`12/1995
`
`European Pat. Off ..
`European Pat. Off ..
`European Pat. Off ..
`European Pat. Off ..
`
`OTHER PUBLICATIONS
`
`Van Zuijlen, Job M., "Probabilistic Methods in Dependency
`Grammar Parsing,"
`International Parsing Workshop,
`10375:142-151, 1989.
`Gerard Salton, "Automatic Information Organization and
`Retrieval," McGraw Hill Book Company, pp. 168-178
`(1968).
`Fagan, Joel L., Ph.D., "Experiments in automatic phrase
`indexing for document retrieval: A comparison of syntactic
`and non-syntactic methods," Cornell University, UMI Dis(cid:173)
`sertation Information Service , pp. 1-261 (1987).
`James Allen, "Natural Language Understanding," The Ben(cid:173)
`jamin/Cummings Publishing Company, Inc., Chapter 8, pp.
`227-238, 1995.
`
`Primary Examiner-Joseph Thomas
`
`[57]
`
`ABSTRACT
`
`The present invention is directed to performing information
`retrieval utilizing semantic representation of text. In a pre(cid:173)
`ferred embodiment, a tokenizer generates from an input
`string information retrieval tokens that characterize the
`semantic relationship expressed in the input string. The
`tokenizer first creates from the input string a primary logical
`form characterizing a semantic relationship between
`selected words in the input string. The tokenizer then
`identifies hypernyms that each have an "is a" relationship
`with one of the selected words in the input string. The
`tokenizer then constructs from the primary logical form one
`or more alternative logical forms. The tokenizer constructs
`each alternative logical form by, for each of one or more of
`the selected words in the input string, replacing the selected
`word in the primary logical form with an identified hyper(cid:173)
`nym of the selected word. Finally, the tokenizer generates
`tokens representing both the primary logical form and the
`alternative logical forms. The tokenizer is preferably used to
`generate tokens for both constructing an index representing
`target documents and processing a query against that index.
`
`23 Claims, 18 Drawing Sheets
`
`touch (sense 2)
`
`kiss ( sense 1)
`
`verb
`
`~1122
`
`1121
`
`1120
`
`deep object
`pig (sense 2)
`
`1110
`
`1111
`
`deep subject
`man (sense 2)
`
`I
`I
`1
`
`person ( sense 1)
`
`1112 ~
`L-------------.J
`
`1130
`
`1131
`
`I
`:
`swine ( sense 1)
`animal ( sense 3)
`h.
`I ' - 1132
`I
`I _ _ _ _ _ _ _ _ _ _ _ _ J
`
`Page 1 of 29
`
`GOOGLE EXHIBIT 1025
`
`

`

`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`Prior Art
`
`Fig.1
`
`'"""' 00
`'"""' 0 ....,
`~ ....
`'JJ. =-~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`I
`
`locations
`
`tokens
`
`132
`
`document
`in target
`
`their locations
`
`tokens to
`
`index mapping
`
`I
`
`,,_ 131
`
`I
`
`retrieval
`index
`
`information
`
`engine
`retrieval
`
`J
`
`I
`
`I
`
`I
`
`tokens
`
`I
`
`I
`
`I
`
`I
`
`tokenizer
`
`I
`
`I
`
`I
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`I
`
`/" 140
`
`I
`
`location
`tokens,
`
`I
`
`construction
`
`index
`
`I
`
`location
`tokens,
`
`location
`
`text,
`
`/" 130
`
`r 120
`
`111
`
`text
`
`I
`
`112
`
`result
`query
`
`query
`
`I
`
`document
`
`target
`
`Page 2 of 29
`
`

`

`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`N
`~ ....
`'JJ. =(cid:173)~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 2
`
`260
`
`IR engine (
`
`I
`
`index /250
`
`I
`
`I
`
`I
`
`internet connection _/ 223
`
`222
`
`media drive (
`
`computer-readable
`
`221
`
`storage device (
`
`242
`
`linquistic knowledge base (
`
`parser (241
`
`240
`
`facility
`
`/230
`
`memory
`
`220
`
`input/output devices
`
`CPU (210
`
`200
`
`computer system
`
`Page 3 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 3 of 18
`
`6,076,051
`
`begin
`
`for each sentence in
`target documents
`
`tokenize sentence
`
`store tokens in index
`with occurrence
`locations
`
`next sentence
`
`receive query text
`
`tokenize query
`text
`
`identify matching
`tokens in index
`
`301
`
`302
`
`303
`
`304
`
`305
`
`306
`
`307
`
`308
`
`rank target
`documents in which
`matching tokens
`occur
`
`Fig. 3
`
`Page 4 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 4 of 18
`
`6,076,051
`
`tokenize
`
`construct primary
`logical form from
`input text
`
`expand primary
`logical form using
`hypemyms
`
`401
`
`402
`
`end
`
`Fig. 4
`
`Page 5 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 5 of 18
`
`6,076,051
`
`kiss ( sense 1)
`verb
`
`521
`
`520
`
`510
`
`511
`
`deep subject
`man (sense 2)
`
`deep object
`pig (sense 2)
`
`530
`
`531
`
`../ 550
`(man, kiss, pig)
`
`Fig. 5
`
`601
`
`602
`
`603
`
`604
`
`Im~ I lki~ingl r{i IP' I
`~ \_600 ~
`
`document 5,
`word 150
`
`document 5,
`word 153
`
`Fig. 6
`
`Page 6 of 29
`
`

`

`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`~ ....
`'JJ. =(cid:173)~
`
`O'I
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 7A
`
`(sense 1)
`
`child
`
`(sense 1)
`woman
`
`(sense 2)
`
`man
`
`713
`
`712
`
`711
`
`714
`
`715
`
`(sense 1)
`person
`
`(sense 3)
`animal
`
`Page 7 of 29
`
`

`

`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`-..J
`~ ....
`'JJ. =(cid:173)~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 7B
`
`.0002
`
`.0003
`
`(sense 7)
`
`lead
`
`(sense 1)
`villain
`
`(sense 1)
`
`child
`
`(sense 1)
`woman
`
`(sense 2)
`
`man
`
`723
`
`722
`
`713
`
`712
`
`711
`
`(sense 5)
`person
`
`721
`
`714
`
`715
`
`(sense 1)
`person
`
`(sense 3)
`animal
`
`Page 8 of 29
`
`

`

`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`00
`~ ....
`'JJ. =(cid:173)~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 8
`
`(sense 3)
`
`soften
`
`842
`
`834
`
`.0002
`
`(sense 1)
`
`kick
`
`(sense 1)
`
`kiss
`
`.0020
`
`(sense 1)
`massage
`
`833
`
`832
`
`(sense 1)
`
`hug
`
`(sense 6)
`
`touch
`
`841
`
`(sense 2)
`
`touch
`
`(sense 9)
`interact
`
`836
`
`Page 9 of 29
`
`

`

`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`\0
`~ ....
`'JJ. =(cid:173)~
`
`0
`0
`0
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 9
`
`.0811
`
`(sense 3)
`
`boar
`
`952
`
`(sense 2)
`
`pig
`
`951
`
`953
`
`954
`
`(sense 1)
`
`swine
`
`(sense 3)
`animal
`
`Page 10 of 29
`
`

`

`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`'"""' 0
`~ ....
`'JJ. =(cid:173)~
`
`0
`8
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 10
`
`(sense 1)
`fetishist
`
`1073
`
`1072
`
`.0003
`
`(sense 1)
`orgiastic
`
`(sense 1)
`
`swme
`
`.0012
`
`(sense 1)
`
`horse
`
`1063
`
`1062
`
`(sense 1)
`person
`
`(sense 7)
`animal
`
`1071
`
`1064
`
`1065
`
`(sense 3)
`animal
`
`(sense 1)
`organism
`
`Page 11 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 11 of 18
`
`6,076,051
`
`r
`
`touch (sense 2)
`
`~ 1122
`
`kiss (sense 1)
`verb
`
`1121
`
`1120
`
`deep object
`pig (sense 2)
`
`1130
`
`1131
`
`I
`I
`swine ( sense 1)
`1
`animal ( sense 3)
`1--....
`'-.... 1132
`I
`I
`I _ _ _ _ _ _ _ _ _ _ _ _ J
`
`1110
`
`1111
`
`deep subject
`man (sense 2)
`
`person ( sense 1)
`
`I
`I
`I
`
`1112 Ji
`
`' - - - - - - - - - - - - - - . J
`
`Fig. 11
`
`Page 12 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 12 of 18
`
`6,076,051
`
`verb
`
`deep
`object
`
`deep
`subject
`
`man
`
`person
`
`pig
`
`(man, kiss, pig)
`
`(person, kiss, pig)
`
`1231
`
`kiss
`
`swme
`
`(man, kiss, swine)
`
`(person, kiss, swine)
`
`1232
`
`1230
`
`animal
`
`(man, kiss, animal)
`
`(person, kiss, animal)
`
`pig
`
`(man, touch, pig)
`
`(person, touch, pig)
`
`1233
`
`1241
`
`touch
`
`swine
`
`(man, touch, swine)
`
`(person, touch, swine)
`
`1242
`
`1240
`
`animal
`
`( man, touch, animal)
`
`(person, touch, animal)
`
`1243
`
`1210
`
`1220
`
`/
`
`1200
`
`((man OR person), (kiss OR touch), (pig OR swine OR animal))
`
`Fig. 12
`
`Page 13 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 13 of 18
`
`6,076,051
`
`token
`
`animal#
`
`kiss"
`
`man
`
`person_
`
`pig#
`
`swine#
`
`touch"
`
`document
`number
`
`word
`number
`
`1300
`
`• • •
`
`5
`
`• • •
`
`5
`
`• • •
`
`5
`
`5
`
`5
`
`5
`
`5
`
`152
`
`151
`
`150
`
`150
`
`152
`
`152
`
`151
`
`\.
`
`)_
`
`J.
`
`~
`
`1310
`
`1320
`
`1330
`
`Fig. 13
`
`Page 14 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 14 of 18
`
`6,076,051
`
`kiss ( sense 1)
`verb
`
`1421
`
`1420
`
`1410
`
`1411
`
`deep subject
`man (sense 2)
`
`deep object
`horse ( sense 1)
`
`1430
`
`1431
`
`. / 1450
`(man, kiss, horse)
`
`Fig. 14
`
`Page 15 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 15 of 18
`
`6,076,051
`
`touch (sense 2)
`
`kiss (sense 1)
`verb
`
`~1522
`
`1521
`
`1520
`
`deep object
`horse ( sense 1)
`
`1530
`
`1531
`
`I
`I
`1
`
`animal (sense 3)
`
`i'\._ 1532
`I
`I _ _ _ _ _ _ _ _ _ _ _ _ J
`
`1510
`
`1511
`
`deep subject
`man (sense 2)
`
`person ( sense 1)
`
`I
`I
`I
`
`1512 /i
`
`' - - - - - - - - - - - - - - . J
`
`((man OR person),
`
`I _ / 1550
`.
`O
`(k.
`1ss OR touch), (horse R amma ))
`
`Fig. 15
`
`Page 16 of 29
`
`

`

`~
`
`.... = Ul
`.... = .....:a
`
`0--,
`
`0--,
`
`'"""' 00
`0 ....,
`'"""' O'I
`~ ....
`'JJ. =(cid:173)~
`
`0
`8
`N
`'"""' ~~
`
`~ = ?
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`Fig. 16
`
`.0001
`
`(sense 1)
`fetishist
`
`(sense 1)
`orgiastic
`
`(sense 1)
`
`swine
`
`(sense 1)
`
`horse
`
`1673
`
`1672
`
`1661
`
`1663
`
`.0011
`
`1662
`
`(sense 1)
`person
`
`(sense 7)
`animal
`
`1671
`
`1664
`
`1665
`
`(sense 3)
`animal
`
`(sense 1)
`organism
`
`Page 17 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 17 of 18
`
`6,076,051
`
`touch (sense 2)
`
`kiss ( sense 1)
`verb
`
`~1722
`
`1721
`
`1720
`
`deep subject
`man ( sense 2)
`
`1710
`
`1711
`
`I
`I
`I
`
`person ( sense 1)
`
`1712 Ji
`L-------------~
`
`1750
`((man OR person), (kiss OR touch), _ _ ) _ /
`
`Fig. 17
`
`Page 18 of 29
`
`

`

`U.S. Patent
`
`Jun.13,2000
`
`Sheet 18 of 18
`
`6,076,051
`
`,--------------
`~1822
`touch (sense 2)
`
`kiss ( sense 1)
`verb
`
`1821
`
`1820
`
`deep object
`horse ( sense 1)
`
`animal ( sense 3)
`
`1830
`
`1831
`
`I
`I
`1
`
`~ 1832
`
`I
`I _ _ _ _ _ _ _ _ _ _ _ _ J
`
`(_, (kiss OR touch), (horse OR animal))
`
`_/ 1850
`
`Fig. 18
`
`Page 19 of 29
`
`

`

`6,076,051
`
`2
`
`1
`INFORMATION RETRIEVAL UTILIZING
`SEMANTIC REPRESENTATION OF TEXT
`
`TECHNICAL FIELD
`
`The present invention relates to the field of information
`retrieval, and, more specifically, to the field of information
`retrieval tokenization.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`10
`
`The father is holding the baby.
`into the following tokens:
`
`the
`father
`is
`hold
`the
`baby
`
`Information retrieval refers to the process of identifying
`occurrences in a target document of words in a query or
`query document. Information retrieval can be gainfully
`applied in several situations, including processing explicit
`user search queries, identifying documents relating to a 15
`particular document, judging the similarities of two
`documents, extracting the features of a document, and
`summarizing a document.
`Information retrieval typically involves a two-stage pro(cid:173)
`cess: (1) In an indexing stage, a document is initially 20
`indexed by (a) converting each word in the document into a
`series of characters intelligible to and differentiable by an
`information retrieval engine, called a "token" (known as
`"tokenizing" the document) and (b) creating an index map(cid:173)
`ping from each token to the location in the document where 25
`the token occurs. (2) In a query phase, a query ( or query
`document) is similarly tokenized and compared to the index
`to identify locations in the document at which tokens in the
`tokenized query occur.
`FIG. 1 is an overview data flow diagram depicting the
`information retrieval process. In the indexing stage, a target
`document 111 is submitted to a tokenizer 112. The target
`document is comprised of a number of strings, such as
`sentences, each occurring at a particular location in the
`target document. The strings in the target document and their 35
`word locations are passed to a tokenizer 120, which converts
`the words in each string into a series of tokens that are
`intelligible to and distinguishable by an information retrieval
`engine 130. An index construction portion 131 of the infor(cid:173)
`mation retrieval engine 130 adds the tokens and their
`locations to an index 140. The index maps each unique token
`to the locations at which it occurs in the target document.
`This process may be repeated to add a number of different
`target documents to the index, if desired. If the index 140
`thus represents the text in a number of target documents, the
`location information preferably includes an indication of, for 45
`each location, the document to which the location corre(cid:173)
`sponds.
`In the query phase, a textual query 112 is submitted to the
`tokenizer 120. The query may be a single string, or sentence,
`or may be an entire document comprised of a number of
`strings. The tokenizer 120 converts the words in the text of
`the query 112 into tokens in the same manner that it
`converted the words in the target document into tokens. The
`tokenizer 120 passes these tokens to an index retrieval
`portion 132 of the information retrieval engine 130. The 55
`index retrieval portion of the information retrieval engine
`searches the index 140 for occurrences of the tokens in the
`target document. For each of the tokens, the index retrieval
`portion of the information retrieval engine identifies the
`locations at which the token occurs in the target document. 60
`This list of locations is returned as the query result 113.
`Conventional tokenizers typically involve superficial
`transformations of the input text, such as changing each
`upper-case character to lower-case, identifying the indi(cid:173)
`vidual words in the input text, and removing suffixes from
`the words. For example, a conventional tokenizer might
`convert the input text string
`
`This approach to tokenization tends to make searches based
`on it overinclusive of occurrences in which senses of words
`are different than the intended sense in the query text. For
`example, the sample input text string uses the verb "hold" in
`the sense that means "to support or grasp." However, the
`token "hold" could match uses of the word "hold" that mean
`"the cargo area of a ship." This approach to tokenization also
`tends to be overinclusive of occurrences in which the words
`relate to each other differently than the words in the query
`text. For example, the sample input text string above, in
`which "father" is the subject of the word "held" and "baby"
`is the object, might match the sentence "The father and the
`baby held the toy," in which "baby" is a subject, not an
`object. This approach is further underinclusive of occur(cid:173)
`rences that use a different, but semantically related word in
`place of a word of the query text. For example, the input text
`string above would not match the text string "The parent is
`holding the baby." Given these disadvantages of conven-
`30 tional tokenization, a tokenizer that encodes semantic rela(cid:173)
`tionships implicit in the tokenized text would have signifi(cid:173)
`cant utility.
`
`SUMMARY OF THE INVENTION
`The invention is directed to performing information
`retrieval using an improved tokenizer that parses input text
`to identify logical forms, then expands the logical forms
`using hypernyms. The invention, when used in conjunction
`with conventional information retrieval index construction
`and querying, reduces the number of identified occurrences
`40 for which different senses were intended and in which words
`bear different relationships to each other, and increases the
`number of identified occurrences in which different but
`semantically related terms are used.
`The invention overcomes the problems associated with
`conventional tokenization by parsing both indexed and
`query text to perform lexical, syntactic, and semantic analy(cid:173)
`sis of this input text. This parsing process produces one or
`more logical forms, which identify words that perform
`primary roles in the query text and their intended senses, and
`that further identify the relationship between those words.
`The parser preferably produces logical forms that relate the
`deep subject, verb, and deep object of the input text. For
`example, for the input text "The father is holding the baby,"
`the parser might produce the following logical form:
`
`50
`
`deep subject
`
`father
`
`verb
`
`hold
`
`deep object
`
`baby
`
`The parser further ascribes to these words the particular
`senses in which they are used in the input text.
`Using a digital dictionary or thesaurus (also known as a
`65 "linguistic knowledge base") that identifies, for a particular
`sense of a word, senses of other words that are generic terms
`for the sense of the word ("hypernyms"), the invention
`
`Page 20 of 29
`
`

`

`6,076,051
`
`3
`changes the words within the logical forms produced by the
`parser to their hypernyms to create additional logical forms
`having an overall meaning that is hypernymous to the
`meaning of these original logical forms. For example, based
`on indications from the dictionary that a sense of "parent" is 5
`a hypernym of the ascribed sense of "father," a sense of
`"touch" is a hypernym of the ascribed sense of "hold," and
`a sense of "child" and sense of "person" are hypernyms of
`the ascribed sense of "baby," the invention might create
`additional logical forms as follows:
`
`4
`FIG. 14 is a logical form diagram showing the logical
`form preferably constructed by the facility for the query
`"man kissing horse."
`FIG. 15 shows the expansion of the primary logical form
`using hypernyms.
`FIG. 16 is a linguistic knowledge base diagram showing
`the selection of hypernyms of the deep object of the query
`logical form, horse (sense 1).
`FIG. 17 is a partial logical form diagram showing a partial
`10 logical form corresponding to a partial query containing
`only a deep subject and a verb.
`FIG. 18 is a partial logical form diagram showing a partial
`logical form corresponding to a partial query containing
`15 only a verb and a deep object.
`DETAILED DESCRIPTION OF THE
`INVENTION
`The present invention is directed to performing informa-
`20 tion retrieval utilizing semantic representation of text. When
`used in conjunction with conventional information retrieval
`index construction and querying, the invention reduces the
`number of identified occurrences for which different senses
`were intended and in which words bear different relation-
`25 ships to each other, and increases the number of identified
`occurrences in which different but semantically related
`terms are used.
`In a preferred embodiment, the conventional tokenizer
`shown in FIG. 1 is replaced with an improved information
`30 retrieval tokenization facility ("the facility") that parses
`input text to identify logical forms, then expands the logical
`forms using hypernyms. The invention overcomes the prob(cid:173)
`lems associated with conventional tokenization by parsing
`both indexed and query text to perform lexical, syntactic,
`35 and semantic analysis of this input text. This parsing process
`produces one or more logical forms, which identify words
`that perform primary roles in the query text and their
`intended senses, and that further identify the relationship
`between those words. The parser preferably produces logical
`40 forms that relate the deep subject, verb, and deep object of
`the input text. For example, for the input text "The father is
`holding the baby," the parser might produce logical form
`indicating the deep subject is "father," the verb is "hold,"
`and the deep object is "baby." Because transforming input
`text into a logical form distills the input text to its funda(cid:173)
`mental meaning by eliminating modifiers and ignoring dif(cid:173)
`ferences in tense and voice, transforming input text seg(cid:173)
`ments into the logical forms tends to unify the many
`different ways that may be used in a natural language to
`50 express the same idea. The parser further identifies the
`particular senses of these words in which they are used in the
`input text.
`Using a digital dictionary or thesaurus (also known as a
`"linguistic knowledge base") that identifies, for a particular
`55 sense of a word, senses of other words that are generic terms
`for the sense of the word ("hypernyms"), the invention
`changes the words within the logical forms produced by the
`parser to their hypernyms to create additional logical forms
`having an overall meaning that is hypernymous to the
`60 meaning of these original logical forms. The invention then
`transforms all of the generated logical forms into tokens
`intelligible by the information retrieval system that com(cid:173)
`pares the tokenized query to the index, and submits them to
`the information retrieval system.
`FIG. 2 is a high-level block diagram of the general(cid:173)
`purpose computer system upon which the facility preferably
`operates. The computer system 200 contains a central pro-
`
`deep subject
`
`parent
`father
`parent
`father
`parent
`father
`parent
`father
`parent
`father
`parent
`
`verb
`
`hold
`touch
`touch
`hold
`hold
`touch
`touch
`hold
`hold
`touch
`touch
`
`deep object
`
`baby
`baby
`baby
`child
`child
`child
`child
`person
`person
`person
`person
`
`The invention then transforms all of the generated logical
`forms into tokens intelligible by the information retrieval
`system that compares the tokenized query to the index, and
`submits them to the information retrieval system.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is an overview data flow diagram depicting the
`information retrieval process.
`FIG. 2 is a high-level block diagram of the general(cid:173)
`purpose computer system upon which the facility preferably
`operates.
`FIG. 3 is an overview flow diagram showing the steps
`preferably performed by the facility in order to construct and
`access an index semantically representing the target docu(cid:173)
`ments.
`FIG. 4 is a flow diagram showing the tokenize routine
`used by the facility to generate tokens for an input sentence.
`FIG. 5 is a logical form diagram showing a sample logical
`form.
`FIG. 6 is an input text diagram showing an input text 45
`fragment for which the facility would construct the logical
`form shown in FIG. 5.
`FIG. 7Ais a linguistic knowledge base diagram showing
`sample hypernym relationships identified by a linguistic
`knowledge base.
`FIG. 7B is a linguistic knowledge base diagram showing
`the selection of hypernyms of the deep subject of the
`primary logical form, man (sense 2).
`FIG. 8 is a linguistic knowledge base diagram showing
`the selection of hypernyms of the verb of the primary logical
`form, kiss (sense 1).
`FIGS. 9 and 10 are linguistic knowledge base diagrams
`showing the selection of hypernyms of the deep object of the
`primary logical form, pig (sense 2).
`FIG. 11 is a logical form diagram showing the expanded
`logical form.
`FIG. 12 is a chart diagram showing the derivative logical
`forms created by permuting the expanded primary logical
`form.
`FIG. 13 is an index diagram showing sample contents of
`the index.
`
`65
`
`Page 21 of 29
`
`

`

`6,076,051
`
`6
`semantic parsing process. For a detailed discussion of the
`construction of logical forms representing an input text
`string, refer to U.S. patent application Ser. No. 08/674,610,
`which is hereby incorporated by reference.
`The logical form used by the facility preferably isolates
`the principal verb of the sentence, the noun that is the real
`subject of the verb ("deep subject") and the noun that is the
`real object of the verb ("deep object"). FIG. 5 is a logical
`form diagram showing a sample primary logical form. The
`logical form has three elements: a deep subject element 510,
`a verb element 520, and a deep object element 530. It can be
`seen that the deep subject of the logical form is sense 2 of
`the word "man." The sense number indicates, for words
`having more than one sense, the particular sense ascribed to
`the word by the parser as defined by the linguistic knowl(cid:173)
`edge base used by the parser. For example, the word "man"
`could have a first sense meaning to supply with people and
`a second sense meaning adult male person. The verb of the
`logical form is a first sense of the word "kiss." Finally, the
`deep object is a second sense of the word "pig." An
`abbreviated version of this logical form is an ordered triple
`550 having as its first element the deep subject, as its second
`element the verb, and as its third element the deep object:
`
`(man, kiss, pig)
`
`The logical form shown in FIG. 5 characterizes a number
`of different sentences and sentence fragments. For example,
`FIG. 6 is an input text diagram showing an input text
`30 segment for which the facility would construct the logical
`form shown in FIG. 5. FIG. 6 shows the input text sentence
`fragment "man kissing a pig." It can be seen that this phrase
`occurs at word number 150 of document 5, occupying word
`positions 150, 151, 152, and 153. When the facility is
`35 tokenizing this input text fragment, it generates the logical
`form shown in FIG. 5. The facility would also generate the
`logical form shown in FIG. 5 for the following input text
`segments:
`
`5
`cessing unit (CPU) 210, input/output devices 220, and a
`computer memory (memory) 230. Among the input/output
`devices is a storage device 221, such as a hard disk drive.
`The input/output devices also include a computer-readable
`media drive 222, which can be used to install software 5
`products, including the facility which are provided on a
`computer-readable medium, such as a CD-ROM. The input/
`output devices further include an Internet connection 223
`enabling the computer system 200 to communicate with
`other computer systems via the Internet. The computer 10
`programs that preferably comprise the facility 240 reside in
`the memory 230 and execute on the CPU 210. The facility
`240 includes a rule-based parser 241 for parsing input text
`segments to be tokenized in order to produce logical forms.
`The facility 240 further includes a linguistic knowledge base 15
`242 used by the parser to ascribe sense numbers to words in
`the logical form. The facility further uses the linguistic
`knowledge base to identify hypernyms of the words in the
`generated logical forms. The memory 230 preferably also
`contains an index 250 for mapping from tokens generated 20
`from the target documents to locations in the target docu(cid:173)
`ments. The memory 230 also contains an information
`retrieval engine ("IR engine") 260 for storing tokens gen(cid:173)
`erated from the target documents in the index 250, and for
`identifying in the index tokens that match tokens generated 25
`from queries. While the facility is preferably implemented
`on a computer system configured as described above, those
`skilled in the art will recognize that it may also be imple(cid:173)
`mented on computer systems having different configura(cid:173)
`tions.
`FIG. 3 is an overview flow diagram showing the steps
`preferably performed by the facility in order to construct and
`access an index semantically representing the target docu(cid:173)
`ments. Briefly, the facility first semantically indexes the
`target documents by converting each sentence or sentence
`fragment of the target document into a number of tokens
`representing an expanded logical form portraying the rela(cid:173)
`tionship between the important words in the sentence,
`including hypernyms having similar meanings. The facility
`stores these "semantic tokens" in the index, along with the 40
`location in the target documents where the sentence occurs.
`After all of the target documents have been indexed, the
`facility is able to process information retrieval queries
`against the index. For each such query received, the facility
`tokenizes the text of the query in the same way it tokenized 45
`sentences from the target documents-by converting the
`sentence into semantic tokens together representing an
`expanded logical form for the query text. The facility then
`compares these semantic tokens to the semantic tokens
`stored in the index to identify locations in the target docu- 50
`ments for which these semantic tokens have been stored, and
`ranks the target documents containing these semantic tokens
`in the order of their relevance to the query. The facility may
`preferably update the index to include semantic tokens for
`new target documents at any time.
`Referring to FIG. 3, in steps 301-304, the facility loops
`through each sentence in the target documents. In step 302,
`the facility invokes a routine to tokenize the sentence as
`shown in FIG. 4.
`FIG. 4 is a flow diagram showing the tokenize routine
`used by the facility to generate tokens for an input sentence
`or other input text segment. In step 401, the facility con(cid:173)
`structs a primary logical form from the input text segment.
`As discussed above, a logical form represents the funda(cid:173)
`mental meaning of a sentence or sentence fragment. The
`logical forms are produced by applying the parser 241 (FIG.
`2) to subject the input text segment to a syntactic and
`
`The pig was kissed by an unusual man.
`
`The man will kiss the largest pig.
`
`Many pigs have been kissed by that man.
`
`55
`
`As discussed above, because transforming input text into a
`logical form distills the input text to its fundamental mean(cid:173)
`ing by eliminating modifiers and ignoring differences in
`tense and voice, transforming input text segments into the
`logical fonns tends to unify the many different ways that
`may be used in a natural language to express the same idea.
`Returning to FIG. 4, after the facility has constructed the
`primary logical form from the input text, such as the logical
`form shown in FIG. 5, the facility continues in step 402 to
`expand this primary logical form using hypernyms. After
`step 402, the tokenized routine returns.
`As mentioned above, a hypernym is a genus term that has
`an "is a" relationship with a particular word. For instance,
`the word "vehicle" is a hypernym of the word "automobile."
`The facility preferably uses a linguistic knowledge base to
`identify hypernyms of the words in the primary logical form.
`60 Such a linguistic knowledge base typically contains seman(cid:173)
`tic links identifying hypernyms of a word.
`FIG. 7Ais a linguistic knowledge base diagram showing
`sample hypernym relationships identified by a linguistic
`knowledge base. It should be noted that FIG. 7A, like the
`65 linguistic knowledge base diagrams that follow, has been
`simplified to facilitate this discussion, and omits information
`commonly found in linguistic knowledge bases that is not
`
`Page 22 of 29
`
`

`

`6,076,051
`
`7
`directl~ relevant to the present discussion. Each ascending
`arrow m FIG. 7A connects a word to its hypernym. For
`example, there is an arrow connecting the word man (sense
`2) 711 to the word person (sense 1) 714, indicating that
`person (sense 1) is a hypernym of man (sense 2). 5
`Conversely, man (sense 2) is said to be a "hyponym" of
`person (sense 1).
`In identifying hypernyms with which to expand the pri(cid:173)
`mary logical form, the facility selects one or more hyper(cid:173)
`nyms for each word of the primary logical form based upon 10
`the "coherency" of the hypernyms' hyponyms. By selecting
`hypernyms in this manner, the facility generalizes

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket