`
`TII-ITY
`SERIAL
`NUMBER
`
`PATEI
`
`I
`
`;td*s -[
`
`FILING DATE
`i:i ::ji Ij r::t';;! /'*'.;l
`
`cl^ss
`
`I NUMBER
`t
`
`Olf i-ii."il"i ..i,,
`Fr, 1 ,,r:i''''i'
`$..1 il,: i',; i',; i
`&
`
`""
`i'.i
`
`'::
`
`'.''
`
`irli;:i.i:ii:;i:ii::ii....:rj,, i{rt.l. i..ilr,.ti::i 1 i:,i:li[:: .l irji.., !:.:ii.,'ii.
`i.:,,, !::t l: t-::l-lF:tfil.iLil-ii'.i ., i::il:i jlri(ii.Lll]
`" i*i;l:l
`i:i[::.1.ri..-i::::\iL.ii:::. " ir..'|i..'l "
`
`,, i,:,1 i:.:i I !.:ii::.1..:l i::il.:i [:. l:::. ,, i- i l:]l
`i.d.i.1....;,,.. .i. r';ii.i i].j " i:ri.tL, i.:i1.,!.,
`
`6078051
`
`607605t
`
`GROUP ART UNIT
`-;l;ji-i-
`2"7:tV
`i..liI: ., i.,.i ri i:
`.i i..,t.-t i::i f.l,, i:'.i l,::: 1... il. i;:::
`i::ii:::.i.:'lrlt_.tirii.i,, i4!{i I l.::.i:.t[il:.iil
`
`EGMINER
`
`".J
`
`ri::+:i.:::i-Ji!.|"i .i. i\ii-t l. idi:,i
`vfil i:i :i l:::''i' !::: j.:r
`
`?t$)i"f;:i:t': :l: *: :ji: :i: :i: :i: i+: :l
`
`ii: :+: :l: :+; :ai :{: :i: :j; ji: ii: :i.: :+;
`
`I
`,ir ;i: F.f'lflf: 'i ijiilr.j {i-\i::,$:ii... :{ t::::r+[ ], t...ri.il;,+' :+, :ii: :i: :t: :+: :1r: :.1: :.t: :i: ii: :i:
`!fi::i::l j: Fr I F:i:r
`/
`
`ntstrnfrn
`APf{ e r 200t
`rflnffffi0t
`
`N7>N
`
`t_
`
`hI
`
`ilaro €
`
`i:::'i iii:ii:l:"i::;ilii i:r'i i.- l'iil{:,i t-.i t:-:i:::i.1iiriF: r::,ii:;j,_}i,.i i"i:iil
`
`-
`
`pdodry chimed
`35 USC ll9condiuons met O
`Verified and
`
`yes
`
`ni,
`no
`
`tr
`
`FILING FEE
`RECEIVED
`
`ii 1
`ir .i. .: .
`
`ttti
`._-. '.,.i ,. ,.,, :,.,
`
`ATTORNEY'S
`oocrer no.
`{, fr, :i r..t r..t iri i:; :i. .,
`
`l::.
`rl
`
`!.'i
`
`'!
`
`iili::: i"ii::iirli.:i-i .!.;."ii.j :::li:'";"i-t j. i::.tjr.:1i.... l..j"l- j" i
`
`irjlr::.i'.ii:il'rl | :1. t:..: i:ii::i
`
`t1i--i1J
`!"1 !_ l-t !
`
`til-ii'l
`l !r
`
`I r'"1 i
`
`U.S. OEPT. OF COMM./ PAT & TM-PTO-436L (Rev.12-9
`
`at
`ah
`ut
`
`Goo
`
`5
`
`.
`
`NOTICE OF ALLOWANCE MAILED
`
`:t?'f.?lfi^1,'l?iHl "' [s -"" ]tri,,,
`f -J a '77r,"
`
`|SSUE FEE €jrl
`Datg Pald "
`Amount Due
`/*/o ,()o
`l? f t.lTT
`
`Label
`Area
`
`rfd
`
`r 1.a^ 1s-qn 71t ;:#.''l"\
`23
`
`Aoplications Examiner
`CLAIMS ALLOWED
`Total Claims
`Print Claim
`
`I
`
`DRAWING
`
`A0{friNEH
`27.i'1--
`
`PREPARED FOR ISSUE
`
`Examiner
`
`Sheets Dnruo
`
`Figs. Drwg.n
`t8
`ISSUEBATCH yl V \
`NUMBER 11 3
`
`/
`
`Print Fig.
`
`ll
`
`WARNING: The information disclosed herein may be restlcte-d. Unauthorized disclosure may be prohi:ited
`by the United States Code Tltle 35, Se$idis 122, 181 and 368. Possession outside the U.S.
`Patent & Trademark Otfice is restrictgdte authorized employees and contractors only.
`
`Form PTO-436A
`(Rev. 8/92)
`
`4r
`4:i;;
`'litt
`
`,'ti;J::"*+ i',: f,fJ.f ",
`'rhlsq
`, tit'
`
`**f,]::"
`
`f o'rrnat nmilri$NR$ L*$hb) $st*
`
`*:" t
`
`(FACE)
`
`Page 1 of 364
`
`GOOGLE EXHIBIT 1026
`
`
`
`I
`
`.j.
`
`.ll
`
`L-]-
`
`i\:
`\,
`ITENT APFLICATION
`PATENT APFLICNTIbN
`' lllllllllillllllllllllffiiffiffillillillllt'!i'
`08885814
`
`CONTENTS
`
`i
`
`:
`
`APPROVED FOI]I LICEN$E
`
`*i,,1$$l
`
`Da.te
`Received ,
`.or
`Mailed
`
`r/v
`
`,rt');!s:,,n;
`
`,Q {)*-"
`papers.ld ffb lTp:
`"t't,t:tt-t/{ l;
`
`tt - t;'-.qr-l
`AuJtl
`t-,
`,lU
`,Y
`
`i-s*ttryt; *a x-qq
`
`/-r\D-'9-,-tL)
`ii--t;
`\\
`ll,
`il
`- r\ l\
`lt)
`ilj
`til!
`,'
`
`ffi
`
`,.
`
`5. r.d0
`
`4.
`
`5.
`
`6.
`
`7.
`
`8.
`
`9.
`
`10.
`
`11.
`
`12.
`
`13.
`
`14.
`
`15.
`
`16.
`
`17.
`
`18.
`
`19.
`
`20.
`
`22.
`
`23.
`
`24.
`
`25.
`
`26.
`
`27.
`
`28.
`
`29.
`
`30.
`
`2.191.
`
`\
`
`(FRONT)
`
`Page 2 of 364
`
`
`
`pArENr ApprrcArroN sERrAr
`
`llffllu|lqlulfll|!ryilllfllfil
`
`U.S. DEPARTI{ENT OF COMMERCE
`PATEM A}TD TRADEMANT OFFICE
`FEE RECORD SEEET
`
`qM
`
`08/1t/1997 TI}ADET 00000041 00096814
`01 FC:101
`770.00 0p
`0? FCrl0P
`480.00 0p
`03 FC:103
`740.00 0F
`
`Pt0-1556
`$187)
`
`qrq
`
`Page 3 of 364
`
`
`
`Snnn and Brnny lr,p
`6300 Columbia Center
`le, Washingt on 98104-7 A92
`Phone (206) 622-4900
`Fax (206) 682-6031
`
`Express Mail Certifioate No.:
`DocketNo.:
`Date:
`
`nM{fi229266US
`66100s.s12
`March 7,1997
`
`MESD
`MAR - 7 tggT
`
`BOXPATENT
`ASSISTA]YT CO
`PATENTS
`20 1I JEFFERSON DAVIS HIGI{WAY
`WASHINGTONDC 20231
`
`Sir:
`
`William B. Dolan,
`
`Transmitted herewith for filing is the patent application of:
`John J. Messerly, George E. Heidorn, Stephen D. Richardson,
`and Karen Jensen
`INFORMATION RETRIEVA L VTil,IzJNG SEMANTIC
`REPRESENTATION OF TEXT
`Enclosed are:
`!! sheets of informal drawings (Figs. 1-18).
`An assignment of the inventionto: Microsoft Corporation, a corporation of the State of Washington.
`A Declaration and Power of Attorney.
`A verified statement to establish small entity status under 37 C.F.R. 1.9 and 37 C.F.R. I.27 .
`A certified copy of Application No. , filed , from which priority is claimed, .
`The filing fee has been caloulated as shown below.
`Filed without fee or formal papers.
`
`Inventors:
`
`For:
`
`txl
`txl
`txl
`fl
`tl
`Xt
`tl
`
`X'or:
`
`Utilltv Fee
`
`Total Claims
`
`Indenendent Claims
`( ) MuftipleDependent
`Claim Presented
`
`ASSIGI\MEIIT
`
`No. Fileil
`
`No. Extra
`
`Small Entitv
`Rate
`
`Fee 'to
`
`$ 385
`
`54
`
`9
`
`34
`
`6
`
`x1l
`
`x40
`
`+ 130
`
`+40
`
`$
`
`$
`
`$
`
`$
`
`TOTAL $
`
`OtherThanA
`SmallEntitv
`Rate
`
`Fee
`
`$ 770
`
`$ 748
`
`$ 480
`
`$ -0-
`
`$40
`
`x22
`
`x80
`
`+ 260
`
`+40
`
`TOTAL $ 2.038
`
`or
`or
`
`or
`
`or
`
`or
`
`or
`
`or
`
`.or
`
`tl
`rxl
`txl
`
`Please charge my Deposit Account No. 1 9- 1090 in the amount of $-. A duplioate copy of this sheet is enclosed.
`A check in the amount of $ 2.038 is enclosed.
`The Assistant Commissioner is hereby authorized to charge
`communication or credit any overpayment to Deposit Acoount No.
`Xl,qrry additional filing fees required under 3? C.F.R. 1. 1 6.
`[X] Any patent applioation processing fees under 37 C.F.R. 1.
`
`payment of the following fees associated with this
`1P-1090. A duplicate copy ofthis sheet is enclosed.
`
`Stsved D.
`Registration No.
`
`l{rl
`
`:'rE
`
`l's
`
`:. 1
`
`a
`
`Page 4 of 364
`
`
`
`UNITED STATES PATENT AND TRADEMAITK OFFICE
`
`PATENT
`
`John J. Messerly, George E. Heidorn, Stephen D. Richardson,
`
`William B. Dolan, and Karen Jensen
`
`March 7, 1997
`
`INFORMATION RETRIEVAL UTILIZING SEMANTIC
`REPRESENTATION OF TEXT
`
`Docket No.
`
`661005.512
`
`Date
`
`March 7, 1997
`
`Box Patent Application
`Assistant Commissioner for Patents
`2011 Jefferson Davis Highway
`Washington, DC 20231
`
`", :'
`
`dfi6
`
`ll't,!
`
`LI
`
`CERTIFICATE OF MAILING BY "E)GRESS MAIL''
`
`Sir:
`
`I hereby certi$ that the enclosures listed below are being deposited with the
`United States Postal Service "EXPRESS MAIL Post Office to Addressee" service under 37
`C.F.R. $ 1.10, Mailing Label Certificate No. EM41722}2661J5, on March 7, lgg7, addressed
`to Box Patent Application, Assistant Commissioner for Patents, 2OIl Jefferson Davis
`Highway, Washington, DC 2023L
`
`Resp ectfu lly submitted,
`
`SEED and BERRY t t P
`
`SDL:ljj
`
`Enclosures:
`Postcard
`Check No. 45395 for $2,038
`Form PTO-I082 (+ copy)
`Specification, Claims, Abstract (37)
`i8 Sheets of Informal Drawings (Figs. 1-18)
`Declaration and Power of Attorney
`Form PTO-1595
`Assignment
`Information Disclosure Statement,
`Form PTO- 1449, References Cited (3)
`
`c:\sdl\1952
`
`Page 5 of 364
`
`
`
`t, 11 K,00 *lol-A
`
`Express Mail No. 8M417229266U5
`
`INFORMATION RETRIEVAL UTILIZING SEMANTIC
`
`REPRESENTATION OF TEXT
`
`TECHNICAL FIELD
`The present invention relates to the field of information retrieval, and,
`
`5
`
`more specifically, to the field of information retrieval tokenization.
`
`BACKGROLIND OF THE INVENTION
`Information retrieval refers to the process of identiSing occlilrences in a
`target document of words in a query or query document. Information retrieval can be
`
`gainfirlly applied in several situations, including processing explicit user search queries,
`identifying documents relating to a particular document, judging the similarities of two
`documents, extacting the features of a document, and summarizing a document.
`Information refiieval typically involves a two-stage process: (1) In an
`indexing stage, a document is initially indexed by (a) converting each word in the
`document into a series of characters intelligible to and differentiable by an information
`retrieval engine, called a "token" (known as "tokenizing" the document) and
`(b) creating an index mapping from each tokw to the location in the document where
`the token occurs. (2) In a query phase, a query (or query document) is similarly
`tokenized and compared to the index to identify locations in the document at which
`
`tokens in the tokenized query occur.
`Figure I is an overview data flow diagram depicting the information
`retrieval process. In the indexing stage, a tnget document 111 is submitted to a
`tokenizer 112. The target document is comprised of a number of strings, such as
`sentences, each occu:ring at a particular location in the target document. The strings in
`the target document and their word locations are passed to a tokenizet 120, which
`converts the words in each string into a series of tokens that are intelligible to and
`distinguishable by an information retrieval engine 130. An index construction portion
`131 of the information retrieval engine 130 adds the tokens and their locations to an
`index 140. The index maps each unique token to the locations at which it occurs in the
`target document. This process may be repeated to add a number of different target
`
`10
`
`15
`
`20.
`
`25
`
`30
`
`F-n
`
`tl
`
`in'
`
`-4#,d
`
`Page 6 of 364
`
`
`
`2
`
`documents to the index, if desired. If the index 140 thus represents the text in a number
`of target documents, the location information preferably includes an indication of, for
`
`each location, the document to which the location coresponds.
`
`In the query phase, a textual query 112 is submiued to the tokenizer 120.
`5 The query may be a single string, or sentence, or may be an entire document comprised
`of a number of strings. The tokenizer 120 converts the words in the text of the query
`112 into tokens in the same manner that it converted the words in the target document
`into tokens. The tokenizer 120 passes these tokens to an index retrieval portion 132 of
`the information retrieval engine 130. The index retrieval portion of the information
`10 retrieval engine searches the index 140 for occurrences of the tokens in the target
`document. For each of the tokens, the index retrieval portion of the information
`retrieval engine identifies the locations at which the token occurs in the target
`document. This list of locations is returned as the query result 113.
`Conventional tokenizers typically involve superfi cial transformations of
`15 the input text, such as changing each upper-case character to lower-case, identiffing the
`individual words in the input text, and removing suffixes from the words. For example,
`
`'<:5#r''''
`
`20
`
`.^"-i:,.-: ,. i r..
`
`: ..,' .'
`
`'
`
`25
`
`a conventional tokenizer might convert the input text string
`
`The father is holding the baby.
`
`into the following tokens:
`
`,h"
`
`father
`
`is
`
`hold
`
`the
`
`baby
`
`f5!
`1?!
`\r{
`ili
`
`{r,
`i*i
`
`Ii
`
`il"'?
`
`i*J
`t-I
`
`'tS n
`
`Page 7 of 364
`
`
`
`3
`
`This approach to tokenization tends to make searches based on it overinclusive of
`occunences in which senses of words are different than the intended sense in the query
`text. For example, the sample input text string uses the verb "hold" in the sense that
`means "to support or grasp." However, the token "hold" could match uses of the word
`"hold" that mean "the cargo area of a ship." This approach to tokenization also tends to
`be overinclusive of occurrences in which the words relate to each other differently than "
`the words in the query text. For example, the sample input text string above, in which
`o'father" is the subject of the word "held" and o'baby" is the object, might match the
`sentence "The father and the baby held the toy," in which "bfuy" is a subject, not an
`object. This approach is further underinclusive of occurrences that use a different, but
`semantically related word in place of a word of the query text. For example, the input
`text string above would not match the text string "The parent is holding the baby."
`Given these disadvantages of conventional toke rization, a tokenizer that .E*"M*
`IL
`semantic relationships implicit in the tokenized text would have significant utility.
`
`SUMMARY OF THE INVENTION
`The invention is directed to performing information retrieval using an
`improved tokenizer that parses input text to identifi' logical forms, then expands the
`logical forms using hypemyms. The invention, when used in conjunction with
`conventional information retrieval index construction and querying, reduces the number
`of identified occurrences for which different senses were intended and in which words
`bear different relationships to each other, and increases the number of identified
`occuffences in which different but semantically related terms are used.
`The invention overcomes the problems associated with conventional
`tokenization by parsing both indexed and query text to perform lexical, syntactic, and
`
`semantic analysis of this input text. This parsing process produces one or more logical
`forms, which identiS words that perform primary roles in the query text and their
`intended senses, and that further identifu the relationship between those words. The
`
`parser preferably produces logical forms that relate the deep subject, verb, and deep
`object of the input text. For example, for the input text "The father is holding the
`baby," the parser might produce the following logical form:
`
`10
`
`15
`
`20
`
`25
`
`30
`
`#a
`
`ii.
`Alat-
`
`i-,a
`
`'c
`
`i.-Jt
`
`ilw
`irrri
`e--ti
`ari
`'4
`
`--1"
`
`I
`
`; r
`
`
`
`Page 8 of 364
`
`
`
`_\,.
`
`*-*.d.d
`
`deep subject
`
`father
`
`verb
`hold
`
`deep object
`
`baby
`
`.j*q
`
`i"-r,
`
`iii|
`
`'1 i
`Xrn
`
`f""i
`
`,fii
`
`10
`
`15
`
`The parser fi.rther ascribes to these words the particular senses in which they are used in
`
`the input text.
`
`Using a digital dictionary or thesaurus (also known as a "linguistic
`knowledge base") that identifies, for a particular sense of a word, senses of other words
`
`that are generic terms for the sense of the word ("hypernyms"), the invention changes
`
`the words within the logical forms produced by the parser to their hypernyms to create
`
`additional logical forms having an overall meaning that is hypernymous to the meaning
`
`of these original logical forms. For example, based on indications from the dictionary
`that a sense of o'parent" is a hypernym of the ascribed sense of "father," a sense of
`"touch" is a hypernym of the ascribed sense of "hold," and a sense of o'child" and sense
`of "person" are hypernyms of the ascribed sense of "baby," the invention might create
`
`additional logical forms as foliows:
`
`deep subiect
`
`parent
`
`father
`
`parent
`
`father
`
`parent
`
`father
`
`parent
`
`father
`
`parent
`
`father
`
`paxent
`
`verb
`
`hold
`
`touch
`
`touch
`
`hold
`
`hold
`
`touch
`
`touch
`
`hold
`
`hold
`
`touch
`
`touch
`
`deep object
`
`baby
`
`baby
`
`baby
`
`I child
`child
`
`child
`
`child
`
`person
`
`person
`
`person
`
`person
`
`Page 9 of 364
`
`
`
`.
`
`5
`
`The invention then transforms all of the generated logical forms into
`tokens intelligible by the information retrieval system that compares the tokenized
`
`query to the index, and submits them to the information retrieval system.
`
`5 BRIEF DESCRIPTION OF THE DRAWINGS
`Figure I is an overview data flow diagram depicting the information.
`retrieval process.
`Figure 2 is a highJevel block diagram of the general-purpose computer
`
`10
`
`system upon which the facility preferably operates.
`Figure 3 is an overview flow diagram showing the steps preferably
`performed by the facility in order to construct and access an index semantically
`
`representing the target documents.
`Figure 4 is a flow diagram showing the tokenize routine used by the
`facility to generate tokens for an input sentence.
`Figure 5 is a iogical form diagram showing a sample logical form.
`Figure 6 is an input text diagram showing an input text fragment for
`which the facility would construct the logical form shown in Figure 5.
`Figure 7A is a linguistic knowledge base diagram showing sample
`hypernym relationships identified by a linguistic knowledge base.
`Figure 78 is a linguistic knowledge base diagram showing the seiection
`
`15
`
`20
`
`of hypernyms of the deep subject of the primary logical form, man (sense 2).
`Figure 8 is a linguistic knowledge base diagram showing the selection of
`
`hypemyms of the verb of the primary logical form, kiss (sense 1).
`Figures 9 and 10 are linguistic knowledge base diagrams showing the
`25 selection of hypernyms of the deep object of the primary logical form, pig (sense 2).
`Figure 11 is a logical form diagram showing the expanded logical form.
`Figure L2 is achart diagram showing the derivative logical forms created
`
`by permuting the expanded primary logical form.
`Figure 13 is an index diagram showing sample contents of the index.
`Figure 14 is a logical form diagram showing the logical form preferably
`
`30
`
`constructed by the facility for the query "man kissing horse."
`
`fi
`fli]
`,i"Sr/&
`,hirdffi
`i*i,
`;sq
`uJ
`il#
`.|;-$l
`u,
`
`#f
`
`EN
`
`.r'
`
`,;
`
`1 !
`
`
`
`Page 10 of 364
`
`
`
`6
`
`Figure 15 shbws the expansion of the primary logical form using
`
`hypernyms.
`
`Figure 16 is a linguistic knowledge base diagram showing the selection
`
`of hypernyms of the deep object of the query logical form, horse (sense 1).
`
`Figure 17 is apartial logical form diagram showing apafiral logical form
`
`corresponding to a partial query containing only a deep subject and a verb.
`Figure 18 is a partial logical form diagram showin g apartral logical form
`
`corresponding to a partial query containing only a verb and a deep object.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`DETAILED DESCRIPTION OF THE INVENTION
`The present invention is directed to performing information retrieval
`utilizing semantic representation of text. When used in conjunction with conventional
`
`information retrieval index construction and querying, the invention reduces the number
`of identified occurrences for which different senses were intended and in which words
`bear different relationships to each other, and increases the number of identified
`occurrences in which different but semantically related terms are used.
`
`In a preferred embodiment, the conventional tokenizer shown in Figure 1
`is replaced with an improved information retrieval tokenization facility ("the facility")
`that parses input text to identify logical forms, then expands the logical forms using
`hypernyms. The invention overcomos the problems associated with conventional
`tokenization by parsing both indexed and query text to perform lexical, syntactic, and
`semantic analysis of this input text. This parsing process produces one or more logical
`forms, which identiff words that perform primary roles in the query text and their
`intended senses, and that further identifu the relationship between those words. The
`parser preferably produces logical forms that relate the deep subject, verb, and deep
`object of the input text. For example, for the input text o'The father is holding the
`baby," the parser might produce logical form indicating the deep subject is "father," the
`verb is oohold," and the deep object is "baby." Because transforming input text into a
`logical form distills the input text to its fundamental meaning by eliminating modifiers
`and ignoring differences in tense and voice, transforming input text segments into the
`logical forms tends to unifr the many different ways that may be used in a natural
`
`rii
`
`ri-1t*-\
`
`Page 11 of 364
`
`
`
`7
`
`language to express the same idea. The parser further identifies the particular senses of
`
`these words in which they are used in the input text.
`Using a digital dictionary or thesaurus (also known as a "linguistic
`knowledge base") that identifies, for a particular sense of a word, senses of other words
`
`that are generic terms for the sense of the word ("hypernyms"), the invention changes
`the words within the logical forms produced by the parser to their hypernyms to create
`
`additional logical forms having an overall meaning that is hypemymous to the meaning
`of these original logical forms. The invention then transforms all of the generated
`logical forms into tokens intelligible by the information retrieval system that compares
`
`10
`
`the tokenized query to the index, and submits them to the information retrieval system.
`
`Figure 2 is a high-level block diagram of the general-purpose computer
`
`system upon which the facility preferably operates. The computer system 200 contains
`
`a central processing unit (CPU) 210, input/output devices 220, anda computer memory
`
`(memory) 230. Among the input/output devices is a storage device 221, such as a hard
`disk drive. The input/output devices also include a computer-readable media drle 222,
`r,vhich can be used to install software products, including the facility which are provided
`
`on a computer-readable medium, such as a CD-ROM. The input/output devices further
`include an Internet connection 223 enabling the computer system 200 to communicate
`with other computer systems via the Internet. The computer programs that preferably
`comprise the facility 240 reside in the memory 230 and execute on the CPU 210. The
`facility 240 includes a rule-based parser 241 for parsing input text segments to be
`tokenized in order to produce logical forms. The facility 240 further includes a
`linguistic knowledge base 242 used by the parser to ascribe sense numbers to words in
`the logical form. The facility further uses the linguistic knowledge base to identiff
`hypemyms of the words in the generated logical forms. The memory 230 preferably
`also contains an index 250 for mapping from tokens generated from the target
`documents to locations in the target documents. The memory 230 also contains an
`information retrieval engine ("IR engine") 260 for storing tokens generated from the
`target documents in the index 250, and for identiffing in the index tokens that match
`tokens generated from queries. While the facility is preferably implemented on a
`
`15
`
`20
`
`25
`
`30
`
`rTt
`
`?nd
`;"'!j
`
`Page 12 of 364
`
`
`
`8
`
`computer system configured as described above, those skilled in the art wili recognize
`that it may also be implemented on computer systems having different configurations.
`Figwe 3 is an overview flow diagram showing the steps preferably
`performed by the facility in order to construct and access an index semantically
`representing the target documents. Briefly, the facility first semantically indexes the
`target documents by converting each sentence or sentence fragment of the target
`document into a number of tokens representing an expanded logical form portraying the
`relationship between the important words in the sentence, including hypernyms having
`similar meanings. The facility stores these "semantic tokens" in the index, along with
`the location in the target documents where the sentence occurs. After all of the target
`documents have been indexed, the facility is able to process information retrieval
`queries against the index. For each such query received, the facility tokenizes the text
`of the query in the silme way it tokenized. sentences from the target documents -- by
`converting the sentence into semantic tokens together representing an expanded logical
`form for the query text. The facility then compares these semantic tokens to the
`semantic tokens stored in the index to identi$r locations in the target documents for
`which these semantic tokens have been stored, and ranks the target documents
`containing these semantic tokens in the order of their relevance to the query. The
`facility may preferably update the index to include semantic tokens for new tarset
`documents at any time.
`
`Refening to Figure 3, in steps 301-304, the facility loops through each
`sentence in the target documents. In step 302, the facility invokes a routine to tokenize
`the sentence as shown in Figure 4.
`Figure 4 is a flow diagram showing the tokenize routine used by the
`facility to generate tokens for an input sentence or other input text segment. In step
`401, the facility constructs a primary logical form from the input text segment. As
`discussed above, a logical form represents the fundamental meaning of a sentenoe or
`sentence fragment. The logical forms are produced by applying the parser 241
`(Figure 2) to subject the input text segment to a syntactic and semantic parsing process.
`For a detailed discussion of the construction of logical forms representing an input text
`
`10
`
`15
`
`2A
`
`25
`
`30
`
`d.--
`ls
`'sh
`
`ll
`
`;*
`
`;,{ v
`
`,-\
`
`Page 13 of 364
`
`
`
`9
`
`string, refer to U.S. Patent Application No. 08/674,610, which is hereby incorporated by
`
`reference.
`
`The logical form used by the facility preferably isolates the principal
`verb of the sentence, the noun that is the real subject of the verb ("deep subject") and
`
`the noun that is the real object ofthe verb ("deep object"). Figure 5 is a logical form
`diagram showing a sample primary logical form. The logical form has three elements:
`a deep subject element 510, a verb elemett 520, and a deep object element 530. It can
`be seen that the deep subject of the logical form is sense 2 of the word "man." The
`sense number indicates, for words having more than one sense, the particular sense
`
`10
`
`15
`
`ascribed to the word by the parser as defined by the linguistic knowledge base used by
`the parser. For example, the word "man" could have a first sense meaning to supply
`with people and a second sense meaning adult male person. The verb of the logical
`form is a first sense of the word "kiss." Finally, the deep object is a second sense of the
`word "pig." An abbreviated version of this logical form is an ordered triple 550 having
`as its first element the deep subject, as its second element the verb, and as its third
`
`element the deep object:
`
`(man, kiss, pig)
`
`20
`
`The logical form shown in Figure 5 characterizes a number of different
`. sentences and sentence fragments. For example, Figure 6 is an input text diagram
`
`i1"*
`'*w
`tli t
`
`gt
`
`\rri'
`
`.W
`
`showing an input text segment for which the facility would consffuct the logical form
`shown in Figure 5. Figure 6 shows the input text sentence fragment "man kissing a
`pig." It can be seen that this phrase occurs at word number 150 of document 5,
`occupying word positions 150, 151,152, and 153. When the facility is tokenizing this
`input text fragment, it generates the logical form shown in Figure 5. The facility would
`
`also generate the logical form shown in Figure 5 for the following input text segments:
`
`25
`
`tI
`
`I
`
`Page 14 of 364
`
`
`
`10
`
`The pig was kissed by an unusual man.
`
`The man will kiss the largest pig.
`
`Many pigs have been kissed by that man.
`
`i'---
`fn
`
`rai
`ixtl
`
`\l
`. frt
`ttl?!i6
`. aii
`
`As discussed above, because transforming input text into a logical form distills the input
`text to its fundamental meaning by eliminating modifiers and ignoring differences in "'
`tense and voice, transforming input text segments into the logical forms tends to uniry
`the many different ways that may be used in a natural language to express the same
`
`idea.
`
`10
`
`l5
`
`20
`
`25
`
`Returning to Figure 4, after the facility has constructed the primary
`logical form from the input text, such as the logical form shown in Figure 5, the facility
`continues in step 402 to expand this primary logical form using hypernyms. After step
`
`402, the tokenized routine returns.
`As mentioned above, a hypernym is a genus term that has an "is a"
`relationship with a particular word. For instance, the word "vehicle" is a hypernym of
`the word o'automobile." The facility preferably uses a linguistic knowledge base to
`identify hypernyms of the words in the primary logical form. Such a linguistic
`knowledge base typically contains semantic links identiffing hypernyms of a word.
`Figure 7A is a linguistic knowledge base diagram showing sample
`hypernym relationships identified by a linguistic knowledge base. It should be noted
`that Figure 7A, like the linguistic knowledge base diagrams that follow, has been
`simplified to facilitate this discussion, and omits information commonly found in
`linguistic knowledge bases that is not directly relevant to the present discussion. Each
`
`ascending affow in Figure 7A connects a word to its hypernym. For example, there is
`an uilrow connecting the word man (sense 2) 7tl to the word person (sense l) 7I4,
`indieating that person (sense 1) is a hypemym of man (sense 2). Conversely, man (sense
`
`2) is said to be a "hyponym" of person (sense 1).
`
`In identiffing hypernyms with which to expand the primary logical form,
`the facility selects one or more hypemyms for each word of the primary logical form
`based upon the oocoherency" of the hypemyms' hyponyms. By selecting hypernyms in
`
`30
`
`Page 15 of 364
`
`
`
`11
`
`this manner, the facility generalizes the meaning of the logical form beyond the
`meaning of the input text segment, but by a controlled amount. For a particular word of
`a primary logical form, the facility first selects the immediate hypernym of the word of
`the primary logical form. For example, with reference to Figure 7A, starting with man
`(sense 2) 7I1 which occurs in the primary logical form, the facility selects its
`hypemym, person (sense l) 714. The facility next bases its determination of whether to
`also select the hypemym of person (sense l) 714, animal (sense 3) 715, on whether
`person (sense 1) 714 has a coherent hyponym set with respect to the starting word man
`(sense 2) 7ll. Person (sense I) 7L4 has a coherent hyponym set with respect to man
`(sense 2) 7ll if a large number of hyponyms of all senses of the word person other than
`the starting word (sens e 2) 7l1 bear at least a threshold level of similarity to the starting
`word man (sense 2) 7ll.
`In order to determine the level of similarity between the hyponyms of the
`different senses of the hypernym, the facility preferably consults the linguistic
`knowledge base to obtain similarity weights indicating the degree of similarity between
`these word sentences. Figure 78 is a linguistic knowledge base diagram showing
`similarity weights between man (sense 2) and other hyponyms of person (sense 1) and
`person (sense 5). The diagram shows that the similarity weight between man (sense 2)
`and woman (sense 1) is ".0075"; between man (sense 2) and child (sense 1) is ".0029";
`between man (sense 2) andviilain (sense 1) is ".0003"; and between man (sense 2) and
`lead (sense 7) is ".0002". These similarity weights are preferably calculated by the
`linguistic knowledge base based on a network of semantic relations maintained by the
`linguistic knowledge base between the word sense pairs. For a detailed discussion of
`calculating similarity weights between word sense pairs using a linguistic knowledge
`base, refer to U.S. Patent Application No.08ho4,?
`eAmO*SZ*i, entitled "DETERMINING SIMILARITY BETWEEN WORDS," which
`
`flsi
`ied
`
`rff.
`
`{.iJ
`$E*
`ji i8
`
`Lri"--
`f6{ss
`E
`
`tgig
`
`r$EB
`
`10
`
`15
`
`20
`
`4;s,'
`
`irb\
`
`is hereby incorporated by reference.
`In order to determine whether the set of hyponyms is coherent based on
`these similarity weights, the facility determines whether a threshold number of the
`30 similarity weights exceed a threshold similarity weight. While the preferred threshold
`
`\
`{'**-
`
`II
`
`1
`
`Page 16 of 364
`
`
`
`iJ-
`
`[*"i:
`
`I2
`
`percentage is 90Yo, the threshold percentage may preferably be adjusted in order to
`optimize the performance of the facility. The similarity weight threshold may also be
`configured to optimize the performance of the facility. The threshold similarity weight
`is preferably coordinated with the overall distribution of similarity weights provided by
`the linguistic knowledge base. Here, the use of a tlueshold of ".0015" is shown. The
`
`facility therefore determines whether at least 90o/o of the similarity weights between the
`starting word and the other hyponyms of ali of the senses of the hypemym are at or
`above the ".0015" threshold similarity weight. It can be seen from Figure 78 that this
`condition is not satisfied by the hyponyms of person with respect to man (sense 1):
`
`while the similarity weights between man (sense 1) and woman (sense 1) and between
`man (sense 1) and child (sense 1) are greater than ".0015", the similarity weights
`between man (sense 1) and villain (sense 1) and between man (sense 1) and lead (sense
`7) are less than ".0015". The facility therefore does not select the further hypernym
`animal (sense 3) 715, or any hypemyms of animal (sense 3). As a result, only the
`hypernym person (sense l) 714 is selected to expand the primary logical form.
`To expand a primary logical form, the facility also selects hypernyms of
`the verb and deep object of the primary logical form. Figure 8 is a linguistic knowiedge
`base diagram showing the selection of hypernyms of the verb of the primary logicai
`form, kiss (sense 1). It can be seen from the diagram that touch (sense 2) is the
`hypernym of kiss (sense 1). The diagram also shows the similarity weights between
`kiss (sense 1) andthe otherhyponyms of all of the senses of touch. The facility first
`selects the immediate hyperrrym of the verb of the primary logical form kiss (sense l),
`touch (sense 2). To deterlnine whether to select the hypernym of touch (sense 2),
`interact (sense 9), the facility determines how many similarity weights between kiss
`(sense 1) and the other hyponyms of all of the senses of touch are at least as large as the
`threshold sirnilarity weight. Because only two of these four similarity weights ate at
`least as largd as the ".0015" threshold similarity weight, the facility do