`
`; · '· C\=t ~.OD -,o 1-f\
`. -Express Mail No. EM417229266US
`
`1
`
`INFORMATION RETRIEVAL UTILIZING SEMANTIC
`REPRESENTATION OF TEXT
`
`\
`
`5
`
`TECHNICAL FIELD
`The present invention relates to the field of information retrieval, and,
`more specifically, to the field of information retrieval tokenization.
`
`BACKGROUND OF THE INVENTION
`Information retrieval refers to the process of identifying occurrences in a
`target document of words in a query or query document. Information retrieval can be
`gainfully applied in several situations, including processing explicit user search queries,
`identifying documents relating to a particular document, judging the similarities of two
`documents, extracting the features of a document, and summarizing a document
`(1) In an
`Information retrieval typically involves a two-stage process:
`indexing stage, a document is initially indexed by (a) converting each word in the
`document into a series of characters intelligible to and differentiable by an information
`retrieval engine, called a "token" (known as "tokenizing" the document) and
`(b) creating an index mapping from each token to the location in the document where
`(2) In a query phase, a query (or query document) is similarly
`the token occurs.
`tokenized and compared to the index to identify locations in the document at which
`tokens in the tokenized query occur.
`Figure 1 is an overview data flow diagram depicting the information
`In the indexing stage, a target document 111 is submitted to a
`retrieval process.
`tokeniz.er 112. The target document is comprised of a number of strings, such as
`sentences, each occurring at a particular location in the target document. The strings in
`the target document and their word locations are passed to a tol,tenizer 120, which
`converts the words in each string into a series of tokens that are intelligible to and
`distinguishable by an information retrieval engine 130. An index construction portion
`131 of the information retrieval engine 130 adds the tokens and their locations to an
`index 140. The index maps each unique token to the locations at which it occurs in the
`target document. This process may be repeated to add a number of different target
`
`10
`
`15
`
`20
`
`25
`
`30
`
`r
`
`f
`; - f
`t. t
`f i
`
`Page 1 of 59
`
`GOOGLE EXHIBIT 1024
`
`
`
`2
`
`documents to the index, if desired. If the index 140 thus represents the text in a number
`of target documents, the location information preferably includes an indication of, for
`each location, the document to which the location corresponds.
`In the query phase, a textual query 112 is submitted to the tokenizer 120.
`The query may be a single string, or sentence, or may be an entire docwnent comprised
`of a number of strings. The tokenizer 120 converts the words in the text of the query
`112 into tokens in the same manner that it converted the words in the target document
`into tokens. The tokenizer 120 passes these tokens to an index retrieval portion 132 of
`the information retrieval engine 130. The index retrieval portion of the information
`retrieval engine searches the index 140 for occurrences of the tokens in the target
`document. For each of the tokens, the index retrieval portion of the information
`retrieval engine identifies the locations at which the token occurs in the target
`document. This list oflocations is returned as the query result 113.
`Conventional tokenizers typically involve superficial transformations of
`the input text, such as changing each upper-case character to lower-case, identifying the
`individual words in the input text, and removing suffixes from the words. For example,
`a conventional tokenizer might convert the input text string
`
`The father is holding the baby.
`
`into the following tokens:
`
`.
`
`the
`
`father
`
`is
`
`hold
`
`the
`
`baby
`
`.=-
`
`5
`
`IO
`
`15
`
`20
`
`25
`
`,. .. ~1 ~ i
`I
`I ~
`
`Page 2 of 59
`
`
`
`,-
`
`..
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`3
`
`This approach to token.iz.ation tends to make searches based on it overinclusive of
`occurrences in which senses of words are different than the intended sense in the query
`text. For example, the sample input text string uses the verb "hold" in the sense that
`means "to support or grasp." However, the token "hold" could !Illitch uses of the word
`"hold" that mean "the cargo area of a ship." This approach to token.iz.ation also tends to
`be overinclusive of occurrences in which the words relate to each other differently than
`the words in the query text. For example, the sample input text string above, in which
`"father" is the subject of the word "held" and "baby" is the object, might match the
`sentence "The father and the baby held the toy," in which "baby" is a subject, not an
`object. This approach is further underinclusive of occurrences that use a different, but
`semantically related word in place of a word of the query text. For example, the input
`text string above would not match the text string "The parent is holding the baby."
`Given these disadvantages of conventional tokenization, a tokenizer that enacts
`semantic relationships implicit in the tokenized text would have significant utility.
`
`SUMMARY OF THE INVENTION
`The invention is directed to performing information retrieval using an
`improved tokenizer that parses input text to identify logical fonns, then expands the
`logical forms using hypemyms. The invention, when used in conjunction with
`conventional information retrieval index construction and querying, reduces the number
`of identified occurrences for which different senses were intended and in which words
`bear different relationships to each other, and increases the number of identified
`occurrences in which different but semantically related terms are used.
`The invention overcomes the problems associated with conventional
`tokeniz.ation by parsing both indexed and query text to perform lexical, syntactic, and
`semantic analysis of this input text.· This parsing process produces dhe or more logical
`forins, which identify words that perform primary roles in the query text and their
`intended senses, and that further identify the relationship between those words. The
`parser preferably produces logical forms that relate the deep subject, verb, and deep
`object of the input text. For example, for the input text "The father is holding the
`baby," the parser might produce the following logical form:
`
`r,
`
`~l.·_i
`
`_,
`
`i
`
`---,..
`--
`. -
`
`l. -
`
`Page 3 of 59
`
`
`
`4
`
`deg, subject
`
`deg, object
`
`father
`
`hold
`
`baby
`
`The parser further ascribes to these words the particular senses in which they are used in
`
`the input text.
`
`-
`
`5
`
`Using a digital dictionary or thesaurus (also known as a "linguistic
`knowledge base") that identifies, for a particular sense of a word, senses of other words
`that are generic terms for the sense of the word ("hypemyms"), the invention changes
`the words within the logical forms produced by the parser to their hypernyms to create
`additional logical forms having an overall meaning that is hypernymous to the meaning
`10 of these original logical forms. For example, based on indications from the dictionary
`that a sense of "parent" is a hypernym of the ascribed sense of "father," a sense of
`"touch" is a hypemym of the ascribed sense of "hold," and a sense of "child" and sense
`of "person" are hypemyms of the ascribed sense of "baby," the invention might create
`
`additional logical forms as follows:
`
`15
`
`deep subject
`
`parent
`
`father
`
`parent
`
`father
`
`parent
`
`father
`
`parent
`
`father
`
`parent
`
`father
`
`par~nt
`
`.-
`
`verb
`
`hold
`
`touch
`
`touch
`
`hold
`
`hold
`
`touch
`
`touch
`
`hold
`
`hold
`
`touch
`
`touch
`
`deep object
`
`baby
`
`baby
`
`baby
`
`child
`
`child
`
`child
`
`child
`
`person
`
`~
`
`person
`
`person
`
`person
`
`l. -
`
`Page 4 of 59
`
`
`
`5
`
`The invention then transforms all of the generated logical forms into
`tokens intelligible by the information retrieval system that compares the tokenized
`query to the index, and submits them to the information retrieval system.
`
`10
`
`15
`
`5 BRIEF DESCRIPTION OF THE DRAWINGS
`Figure 1 is an overview data flow diagram depicting the information
`retrieval process.
`Figure 2 is a high-level block diagram of the general-purpose computer
`system upon which the facility preferably operates.
`Figure 3 is an overview flow diagram showing the steps preferably
`performed by the facility in order to construct and access an index semantically
`representing the target documents.
`Figure 4 is a flow diagram showing the tokenize routine used by the
`facility to generate tokens for an input sentence.
`Figure 5 is a logical form diagram showing a sample logical form.
`Figure 6 is an input text diagram showing an input text fragment for
`which the facility would construct the logical form shown in Figure 5.
`Figure 7 A is a linguistic knowledge base diagram showing sample
`hypemym relationships identified by a linguistic knowledge base.
`Figure 7B is a linguistic knowledge base diagram showing the selection
`of hypemyms of the deep subject of the primary logical form, man (sense 2).
`Figure 8 is a linguistic knowledge base diagram showing the selection of
`hypemyms of the verb of the primary logical form, kiss (sense 1).
`Figures 9 and 10 are linguistic knowledge base diagrams showing the
`selection ofhypemyms of the deep object of the primary logical form, pig (sense 2).
`Figure 11 is a logical.-form diagram showing the expanded logical form.
`Figure 12 is a chart diagram showing the derivative logical forms created
`by permuting the expanded primary logical form.
`Figure 13 is an index diagram showing sample contents of the index.
`Figure 14 is a logical form diagram showing the logical form preferably
`constructed by the facility for the query "man kissing horse."
`
`20
`
`25
`
`30
`
`Page 5 of 59
`
`
`
`6
`
`Figure 15 shows the expansion of the primary logical form using
`
`hypemyms.
`
`5
`
`Figure 16 is a linguistic knowledge base diagram showing the selection
`ofhypemyms of the deep object of the query logical form, horse (sense 1).
`Figure 17 is a partial logical form diagram showing a partial logical form
`corresponding to a partial query containing only a deep subject and a verb.
`Figure 18 is a partial logical form diagram showing a partial logical form
`corresponding to a partial query containing only a verb and a deep object.
`
`10
`
`15
`
`""'-
`
`DETAILED DESCRIPTION OF THE INVENTION
`The present invention is directed to performing information retrieval
`utilizing semantic representation of text. When used in conjunction with conventional
`information retrieval index construction and querying, the invention reduces the nwnber
`of identified occurrences for which different senses were intended and in which words
`bear different relationships to each other, and increases the number of identified
`occurrences in which different but semantically related terms are used.
`In a preferred embodiment, the conventional tokenirer shown in Figure 1
`is replaced with an improved information retrieval tokenization facility (''the facility")
`that parses input text to identify logical forms, then expands the logical forms using
`20 hypemyms. The invention overcomes the problems associated with conventional
`tokenization by parsing both indexed and query text to perform lexical, syntactic, and
`semantic analysis of this input text. This parsing process produces one or more logical
`forms, which identify words that perform primary roles in the query text and their
`intended senses, and that further identify the relationship between those words. The
`parser preferably produces logical forms that relate the deep subject, verb, and deep
`.-
`;;,-
`object of the input text. For example, for the input text "The father is holding the
`baby," the parser might produce logical form indicating the deep subject is "father," the
`verb is "hold," and the deep object is "baby." Because transforming input text into a
`logical form distills the input text to its fundamental meaning by eliminating modifiers
`and ignoring differences in tense and voice, transforming input text segments into the
`logical forms tends to unify the many different ways that may be used in a natural
`
`25
`
`30
`
`Page 6 of 59
`
`
`
`7
`
`language to express the same idea. The parser further identifies the particular senses of
`these words in which they are used in the input text.
`Using a digital dictionary or thesaurus (also known as a "linguistic
`knowledge base") that identifies, for a particular sense of a word, senses of other words
`that are generic terms for the sense of the word ("hypernyms"), the invention changes
`the words within the logical forms produced by the parser to their hypernyms to create
`additional logical forms having an overall meaning that is hypernymous to the meaning
`of these original logical forms. The invention then transforms all of the generated
`logical forms into tokens intelligible by the information retrieval system that compares
`the tokenized query to the index, and submits them to the information retrieval system.
`Figure 2 is a high-level block diagram of the general-purpose computer
`system upon which the facility preferably operates. The computer system 200 contains
`a central processing unit (CPU) 210, input/output devices 220, and a computer memory
`(memory) 230. Among the input/output devices is a storage device 221, such as a hard
`disk drive. The input/output devices also include a computer-readable media drive 222,
`which can be used to install software products, including the facility which are provided
`on a computer-readable medium, such as a CD-ROM. The input/output devices further
`include an Internet connection 223 enabling the computer system 200 to communicate
`with other computer systems via the Internet. The computer programs that preferably
`comprise the facility 240 reside in the memory 230 and execute on the CPU 210. The
`facility 240 includes a rule-based parser 241 for parsing input text segments to be
`tokenized in order to produce logical forms. The facility 240 further includes a
`linguistic knowledge base 242 used by the parser to ascribe sense numbers to words in
`the logical form. The facility further uses the linguistic knowledge base to identify
`hypernyms of the words in the generated logical forms. The memory 230 preferably
`also contains an index 250 for ~apping from tokens generate3 from the target
`documents to locations in the target documents. The memory 230 also contains an
`information retrieval engine ("IR engine") 260 for storing tokens generated from the
`target documents in the index 250, and for identifying in the index tokens that match
`tokens generated from queries. While the facility is preferably implemented on a
`
`S
`
`l 0
`
`15
`
`20
`
`25
`
`30
`
`Page 7 of 59
`
`
`
`8
`
`~E!
`
`15
`
`5
`
`10
`
`computer system configured as described above, those skilled in the art will recognize
`that it may also be implemented on computer systems having different configurations.
`Figure 3 is an overview flow diagram showing the steps preferably
`performed by the facility in order to construct and access an index semantically
`representing the target documents. Briefly, the facility first semantically indexes the
`target documents by converting each sentence or sentence fragment of the target
`document into a number of tokens representing an expanded logical fonn portraying the
`relationship between the important words in the sentence, including hypernyms having
`similar meanings. The facility stores these "semantic tokens" in the index, along with
`the location in the target documents where the sentence occurs. After all of the target
`documents have been indexed, the facility is able to process information retrieval
`queries against the index. For each such query received, the facility tokenizes the text
`of the query in the same way it tokenized sentences from the target documents -- by
`converting the sentence into semantic tokens together representing an expanded logical
`form for the query text The facility then compares these semantic tokens to the
`semantic tokens stored in the index to identify locations in the target documents for
`which these semantic tokens have been stored, and ranks the target documents
`containing these semantic tokens in the order of their relevance to the query. The
`facility may preferably update the index to include semantic tokens for new target
`20 documents at any time.
`Referring to Figure 3, in steps 301-304, the facility loops through each
`sentence in the target documents. In step 302, the facility invokes a routine to tokenize
`the sentence as shown in Figure 4.
`Figure 4 is a flow diagram showing the tokenize routine used by the
`facility to generate tokens for an input sentence or other input text segment In step
`r
`•·
`401, the facility constructs a primary logical form from the input text segment. As
`dis~ussed above, a logical form represents the fundamental meaning of a sentence or
`sentence fragment. The logical forms are produced by applying the parser 241
`(Figure 2) to subject the input text segment to a syntactic and semantic parsing process.
`30 For a detailed discussion of the construction of logical forms representing an input text
`
`25
`
`7"
`
`..
`
`Page 8 of 59
`
`
`
`9
`
`string, refer to U.S. Patent Application No. 08/674,610, which is hereby incorporated by
`reference.
`
`The logical form used by the facility preferably isolates the principal
`verb of the sentence, the noun that is the real subject of the verb ("deep subject") and
`the noun that is the real object of the verb ("deep object"). Figure 5 is a logical form
`diagram showing a sample primary logical form. The logical form has three elements:
`a deep subject element 510, a verb element 520, and a deep object element 530. It can
`be seen that the deep subject of the logical form is sense 2 of the word "man." The
`sense number indicates, for words having more than one sense, the particular sense
`ascribed to the word by the parser as defined by the linguistic knowledge base used by
`the parser. For example, the word "man" could have a first sense meaning to supply
`with people and a second sense meaning adult male person. The verb of the logical
`form is a first sense of the word "kiss." Finally, the deep object is a second sense of the
`word "pig." An abbreviated version of this logical form is an ordered triple 550 having
`as its first element the deep subject, as its second element the verb, and as its third
`element the deep object:
`
`5
`
`10
`
`15
`
`(man, kiss, pig)
`
`20
`
`25
`
`The logical form shown in Figure 5 characterizes a number of different
`sentences and sentence fragments. For example, Figure 6 is an input text diagram
`showing an input text segment for which the facility would construct the logical form
`shown in Figure 5. Figure 6 shows the input text sentence fragment "man kissing a
`It can be seen that this phrase occurs at word number 150 of document S,
`pig."
`occupying word positions 150, 151, 152, and 153. When the facility is tokenizing this
`input text fragment, it generates the.logical form shown in Figure 5. "The facility would
`also generate the logical form shown in Figure 5 for the following input text segments:
`
`Page 9 of 59
`
`
`
`• • : I,
`
`j
`
`..
`
`The pig was kissed by an unusual man.
`The man will kiss the largest pig.
`Many pigs have been kissed by that man.
`
`5 As discussed above, because transforming input text into a logical form distills the input
`text to its fundamental meaning by eliminating modifiers and ignoring differences in
`tense and voice, transforming input text segments into the logical forms tends to unify
`the many different ways that may be used in a natural language to express the same
`idea.
`
`Returning to Figure 4, after the facility has constructed the primary
`logical form from the input text, such as the logical form shown in Figure 5, the facility
`continues in step 402 to expand this primary logical fonn using hypernyms. After step
`402, the tokenized routine returns.
`As mentioned above, a hypemym is a genus term that has an "is a"
`relationship with a particular word. For instance, the word ''vehicle" is a hypemym of
`the word "automobile." The facility preferably uses a linguistic knowledge base to
`Such a linguistic
`identify hypemyms of the words in the primary logical form.
`knowledge base typically contains semantic links identifying hypernyms of a word.
`Figure 7 A is a linguistic knowledge base diagram showing sample
`hypernym relationships identified by a linguistic knowledge base. It should be noted
`that Figure 7 A, like the linguistic knowledge base diagrams that follow, has been
`simplified to facilitate this discussion, and omits information commonly found in
`linguistic knowledge bases that is not directly relevant to the present discussion. Each
`ascending arrow in Figure 7A connects a word to its hypernym. For example, there is
`an arrow connecting the word man (sense 2) 711 to the word person (sense 1) 714,
`indicating that person (sense 1) is a hypemym of man (sense 2). Conversely, man (sense
`2) is said to be a "hyponym" of person (sense 1).
`In identifying hypemyms with which to expand the primary logical form,
`the facility selects one or _more hypemyms for each word of the primary logical form
`based upon the "coherency" of the hypernyms' hyponyms. By selecting hypemyms in
`
`15
`
`20
`
`25
`
`30
`
`~
`~
`
`}
`
`l ..
`
`Page 10 of 59
`
`
`
`-·
`:.;
`
`-
`.
`
`'
`
`'
`
`. ' ·-·-. , ...
`.
`
`11
`
`this manner, the facility generalizes the meaning of the logical form beyond the
`meaning of the input text segment, but by a controlled amount. For a particular word of
`a primary logical form, the facility first selects the immediate hypemym of the word of
`the primary logical form. For example, with reference to Figure 7 A, starting with man
`(sense 2) 711 which occurs in the primary logical form, the facility selects its
`hypemym, person (sense 1) 714. The facility next bases its determination of whether to
`also select the hypemym of person (sense 1) 714, animal (sense 3) 715, on whether
`person (sense 1) 714 has a coherent hyponym set with respect to the starting word man
`(sense 2) 711. Person (sense 1) 714 has a coherent hyponym set with respect to man
`(sense 2) 711 if a large number of hyponyms of all senses of the word person other than
`the starting word (sense 2) 711 bear at least a threshold level of similarity to the starting
`word man (sense 2) 711.
`In order to determine the level of similarity between the hyponyms of the
`different senses of the hypemym, the facility preferably consults the linguistic
`knowledge base to obtain similarity weights indicating the degree of similarity between
`these word sentences. Figure 7B is a linguistic knowledge base diagram showing
`similarity weights between man (sense 2) and other hyponyms of person (sense 1) and
`person (sense 5). The diagram shows that the similarity weight between man (sense 2)
`and woman (sense 1) is ".0075"; between man (sense 2) and child (sense 1) is ''.0029";
`between man (sense 2) and villain (sense 1) is ".0003"; and between man (sense 2) and
`lead (sense 7) is ".0002". These similarity weights are preferably calculated by the
`linguistic knowledge base based on a network of semantic relations maintained by the
`linguistic knowledge base between the word sense pairs. For a detailed discussion of
`calculating similarity weights between word sense pairs using a linguistic knowledge
`base, refer to U.S. Patent Application No. _ _ _ _ (patent attorney's docket no.
`661005.524), entitled "DETER.MINING SIMILARITY BETWEENWORDS," which
`i!. hereby incorporated by reference.
`In order to determine whether the set of hyponyms is coherent based on
`these similarity weights, the facility determines whether a threshold number of the
`similarity weights exceed a threshold similarity weight. While the preferred threshold
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`~, .. . -!
`{ I ei
`' :-~ , r
`
`j" .·:
`
`._..
`
`L::
`
`l. i
`!· L
`i
`
`Page 11 of 59
`
`
`
`12
`
`5
`
`15
`
`percentage is 90%, the threshold percentage may preferably be adjusted in order to
`optimize the performance of the facility. The similarity weight threshold may also be
`configured to optimize the performance of the facility. The threshold similarity weight
`is preferably coordinated with the overall distribution of similarity weights provided by
`the linguistic knowledge base. Here, the use of a threshold of ".0015" is shown. The
`facility therefore determines whether at least 90% of the similarity weights between the
`starting word and the other hyponyms of all of the senses of the hypernym are at or
`above the ".0015" threshold similarity weight. It can be seen from Figure 7B that this
`condition is not satisfied by the hyponyms of person with respect to man (sense 1 ):
`10 while the similarity weights between man (sense 1) and woman (sense l) and between
`man (sense 1) and child (sense l) are greater than ".0015", the similarity weights
`between man (sense 1) and villain (sense 1) and between man (sense 1) and lead (sense
`7) are less than ".0015". The facility therefore does not select the further hypemym
`animal (sense 3) 715, or any hypernyms of animal (sense 3). As a result, only the
`hypernym person (sense 1) 714 is selected to expand the primary logical form.
`To expand a primary logical form, the facility also selects hypernyms of
`the verb and deep object of the primary logical form. Figure 8 is a linguistic knowledge
`base diagram showing the selection of hypemyms of the verb of the primary logical
`form, kiss (sense 1).
`It can be seen from the diagram that touch (sense 2) is the
`hypernym of kiss (sense I). The diagram also shows the similarity weights between
`kiss ( sense 1) and the other hyponyms of all of the senses of touch. The facility first
`selects the immediate hypemym of the verb of the primary logical form kiss (sense 1 ),
`touch (sense 2). To determine whether to select the hypernym of touch (sense 2),
`interact (sense 9), the facility determines how many similarity weights between kiss
`(sense 1) and the other hyponyms of all of the senses of touch are at least as large as the
`.-
`r
`threshold similarity weight. Because only two of these four similarity weights are at
`least as large as the ".0015" threshold similarity weight, the facility does not select the
`hypemym of touch (sense 2), interact (sense 9).
`
`20
`
`25
`
`Figures 9 and 10 are linguistic knowledge base diagrams showing the
`selection ofhypernyms of the deep object of the primary logical form and pig (sense 2).
`
`30
`
`l -
`
`Page 12 of 59
`
`
`
`13
`
`10
`
`It can be seen from Figure 9 that the facility selects the hypemym swine (sense 1) of pig
`(sense 2) to expand the primary logical form, as well as the hypemym animal (sense 3)
`of swine (sense 1), as more than 90%, (in fact, 100%) of the hypemyms of the only
`sense of swine have similarly weights at or about the ".0015" threshold similarity
`5 weight. It can be seen from Figure 10 that the facility does not continue to select the
`hypemym organism (sense 1) of animal (sense 3), as fewer than 90% (actually 25%) of
`the hyponyms of senses of animal have similarity weights at or about the ".0015"
`threshold similarity weight.
`Figure 11 is a logical form diagram showing the expanded logical form.
`It can be seen from Figure 11 that the deep subject element 1110 of the expanded
`logical form contains the hypernym person (sense 1) 1112 in addition to the word man
`(sense 2) 1111. It can be seen that the verb element 1120 contains the hypemyrn touch
`(sense 2) 1122 as well as the word kiss (sense 1) 1121. Further, it can be seen that the
`deep object element 1130 of the expanded logical form contains the hypemyrns swine
`(sense 1) and animal (sense 3) 1132 in addition to the word pig (sense 2) 1131.
`By permuting, in each element of the expanded logical form, the
`hypernyms with the original words, the facility can create a reasonably large number of
`derivative logical forms that are reasonably close in meaning to the primary logical
`form. Figure 12 is a chart diagram showing the derivative logical forms created by
`20 permuting the expanded primary logical form. It can be seen from Figure 12 that this
`permutation creates eleven derivative logical forms that each characterize the meaning
`of the input text in a reasonably accurate way. For example, the derivative logical form
`
`15
`
`25
`
`shown in Figure 12 is very close in meaning to the sentence fragment
`
`•
`
`F"""
`
`(person, touch, pig)
`
`man kissing a pig
`
`"'
`
`Page 13 of 59
`
`
`
`14
`
`The expanded logical form shown in Figure 11 represents the primazy logical form plus
`these eleven derivative logical forms, which are expressed more compactly as expanded
`logical form 1200:
`
`5
`
`((man OR person), (kiss OR touch), (pig OR swine OR animal))
`
`The facility generates logical tokens from this expanded logical form in a
`manner that allows them to be processed by a conventional information retrieval engine.
`First, the facility appends a reserved character to each word in the expanded logical
`form that identifies whether the word occurred in the input text segment as a deep
`-
`subject, verb, or deep object. This ensures that, when the word "man" occurs in the
`expanded logical form for a query input text segment as a deep subject, it will not match
`the word ••man" stored in the index as part of an expanded logical form in which it was
`the verb. A sample mapping of reserved characters to logical form elements is as
`follows:
`
`logical fonn element
`
`identifying character
`
`deep subject
`
`verb
`
`deep object
`
`I\
`
`#
`
`Using this sample mapping of reserved characters, tokens generated for the logical form
`••(man, kiss, pig)" would include "man_", "kiss"", and "pig#".
`Indices generated by conventional
`information retrieval engines
`commonly map each token to the particular locations in the target documents at which
`the token occurs. Conventional information retrieval engines may, for example,
`represent such target document locations using a document number, identifying the
`target document containing the occurrence of the token, and a word number, identifying
`the position of the occurrence of the token in that target document. Such target
`document locations allow a conventional information retrieval engine to identify words
`
`~
`
`15
`
`20
`
`25
`
`l. ..
`
`Page 14 of 59
`
`
`
`IS
`
`that occur together in a target document in response to a query using a "PHRASE"
`operator, which requires the words that it joins to be adjacent in the target document.
`For example, the query "red PHRASE bicyc1e" would match occurrences of "red" at
`document S, word 611 and "bicycle" at document S, word 612, but would not match
`5 occurrences of "red" at document 7, word 762 and "bicycle" at document 7, word 202.
`Storing target document locations in an index further allows conventional information
`retrieval engines to identify, in response to a query, the points at which queried tokens
`occur in the target documents.
`
`For expanded logical forms from a target document input text segment,
`the facility preferably similarly assigns artificial target document locations to each
`token, even though the tokens of the expanded logical form do not actually occur in the
`target document at these locations. Assigning these target document locations both
`
`15
`
`(A) enables conventional search engines to identify combinations of semantic tokens
`corresponding to a single primary or derivative logical form using the PHRASE
`operator, and (B) enables the facility to relate the assigned locations to the actual
`location of the input text fragment in the target document. The facility therefore assigns
`locations to semantic tokens as follows:
`
`logical form element
`
`location
`
`deep subject
`
`verb
`
`deep object
`
`(location of l st word of input
`text segment)
`(location of 1st word of input
`text segment)+ 1
`(location of 1st word of input
`text segment)+ 2
`
`20
`
`The facility therefore would assign target document locations as follo.ws for the tokens
`of ~e expanded logical form for "(man, kiss, pig)", derived from a sentence beginning
`at document S, word 150: "man_" and "person_" - document 5, word 150; "kiss"" and
`"touch"" -- document 5, word 151; and "pig#", "swine#", and "animal#" -- document 5,
`word 152.
`
`I ~, f
`
`•
`
`f -r
`'
`
`Page 15 of 59
`
`
`
`16
`
`Returning to Figure 3, in step 303, the facility stores the tokens created
`by the tokenize routine in the index with locations at which they occur. Figure 13 is an
`index diagram showing sample contents of the index. The index maps from each token
`to the identity of the document and location in the docw