`Turtle
`
`[111
`[45]
`
`llllll|||||l|ll||lllllllllllllllllllllllllllllllllllllllllllllllllllllllll
`
`US005265065A
`Patent Number:
`Date of Patent:
`
`5,265,065
`Nov. 23, 1993
`
`5,123,103 6/1992 Ohtaki et al. ..................... ., 395/600
`
`OTHER PUBLICATIONS
`
`'
`
`Turtle et al., “Evaluation of an lnterence Networ
`k-Based Retrieval Model", Transactions on Information
`Systems, Association for Computer Machinery, vol. 9,
`No. 3, pp. 187-223 (Jul. 1991).
`Croft et a1, "Interactive Retrieval of Complex Docu
`ments“, Information Processing and Management, vol.
`26, No. 5, pp. 593—613 (1990).
`
`(List continued on next page.)
`
`Primary Examiner-Thomas C. Lee
`Assistant Examiner—-Wayne Amsbury
`Attorney, Agent, or Firm-—-Kinney & Lange
`[57]
`ABSTRACT
`A computer implemented process for creating a search
`query for an information retrieval system in which a
`database is provided containing a plurality of‘ stopwords
`and phrases. A natural language input query de?nes the
`composition of the test of documents to be identi?ed.
`Each word of the natural language input query is com
`pared to the database in order to remove stopwords
`from the query. The remaining words of the input query
`are stemmed to their basic roots, and the sequence of
`stemmed words in the list is compared to phrases in the
`database to identify phrases in the search query. The
`phrases are substituted for the sequence of stemmed
`words from the list so that the remaining elements,
`namely the substituted phrases and unsubstituted
`stemmed words, form the search query. The completed
`search query elements are query nodes of a query net
`work used to match representation nodes of a document
`network of an inference network. The database includes
`as options a topic and key database for ?nding numeri
`cal keys, and a synonym database for ?nding synonyms,
`both of which are employed in the query as query
`nodes.
`
`46 Claims, 8 Drawing Sheets
`EXAMPLE
`
`1541
`
`175]
`[73]
`
`[21]
`[22]
`[51]
`[52]
`
`[53]
`
`[56]
`
`METHOD AND APPARATUS FOR
`INFORMATION RETRIEVAL FROM A
`DATABASE BY REPLACING DOMAIN
`SPECIFIC STEMMED PHASES IN A
`NATURAL LANGUAGE TO CREATE A
`SEARCH QUERY
`Inventor: Howard R. Turtle, Woodbury, Minn,
`Assignee: West Publishing Company, Eagan,
`Minn.
`Appl. No.: 773,101
`Filed:
`Oct. 8, 1991
`
`Int. Cl.5 ....................... .. G06F 15/40; G06F 7/10
`US. Cl. ........................... .. 395/600; 364/D1G. 1;
`364/2S2.1; 364/283.3; 1164/2228]; 364/225.3;
`364/226.4; 364/225.4; 364/226.6; 364/260.4
`Field of Search ............. .. 364/200, 300, 419, 513;
`395/600
`
`References Cited
`U.S, PATENT DOCUMENTS
`4,241,402 12/1980 Mayper, Jr. et a1.
`4,270,182 5/1981 Asija .................. ..
`4,358,824 11/1982 Glickman et a1.
`4,384,329 5/1983 Rosenbaum et a1.
`
`364/200
`364/900
`.. 364/200
`364/419
`
`r . . ., 364/900
`4,471,459 9/1984 Dickinson et a1, . . . r . .
`364/900
`4,499,553 2/1985 Dickinson et a], ,.
`4,554,631 11/1985 Reddington ...................... ., 364/300
`4,580,218 4/1986 Raye .................................. .. 364/300
`4,650,848 6/1987 Schramm ....... .,
`364/513
`4,688,195 8/1987 Thompson et al
`364/300
`4,706,212 11/1987 Toma ............. ..
`.. 364/900
`4,787,035 11/1988 Bourne .............................. ., 364/300
`4,823,306 4/1989 Barbic et al ...................... .. 364/900
`4,839,853 6/1989 Deerwester et a1.
`364/900
`
`4,862,408 8/1989 Zamora . . . . . . . . r . , . . r .
`
`. . . 1. 364/900
`
`4,868,750 9/1989 Kucera et a1.
`4,914,590 4/1990 Loatman et a1. .
`4,918,588 4/1990 Barrett et a1,
`4,931,935 6/1990 Ohira et a1. ,,,, ,.
`
`364/419
`.. 364/419
`364/200
`364/419
`
`4,972,349 11/1990 Kleittberger . . , . . . . . .
`
`. . . .. 364/900
`
`4,974,191 11/1990 Amirghodsi et a1.
`
`364/900
`
`. , . ,. 364/900
`4,991,087 2/1991 Burkowski et a1, . , . r r
`364/419
`5,099,425 3/1992 Kanno: Yuji et a1. ..
`395/600
`5,109,509 4/1992 Katayama et al. N
`5,117,349 5/1992 Tir?ing et a1. .................... .. 395/600
`
`50
`
`STEPS
`
`Enter Natural Language Query l
`
`=
`
`Purse words unto L151
`
`Stern Words
`
`-
`
`What is the liobrtit at the United
`States under the Fe rat ‘(on
`CtotmsAct 101 10101025 untamed
`to
`an ind
`t
`es of
`I
`en
`du contract
`actor worhin
`U V
`d 5
`ct:
`2 rule
`tote
`an
`nc n
`iv
`\f i
`Mummers ’
`I
`5°
`Qntroctar
`Contract
`Go:
`em
`
`i
`.
`,
`Ctutm
`Lrohrt
`w 4
`State
`Injur
`éwtu
`Siam“
`o e
`lnoepeynd
`
`or
`
`42
`
`“
`
`56
`
`\I
`
`Sutmuute Phrases
`
`l
`
`-
`
`LiQDtI
`Phrase itgnlt State}
`Phrase eon Tun Ctoirn Act)
`
`m “'1
`1
`1
`It
`run I
`We“ Met» den Con rector!
`Contract
`Phrase (Aqenc Umt Stave!
`vll‘l'l
`
`Facebook, Inc. - EXHIBIT 1237
`
`001
`
`
`
`5,265,065
`Page 2
`
`OTHER PUBLICATIONS
`
`Haynes, "Designing a System for the Specialized User:
`A Case Study", Proceedings-I985 National OnIine
`MeetingLearned Information lnc., pp. 205-213, (Apr.
`30, 1985).
`Croft et al, "A Retrieval Model Incorporating Hyper
`text Links", Hyperlex ‘89 Proceedings, Association for
`Computer Machinery, pp. 213-224 (Nov. 1989).
`Turtle et a1, “Inference Networks for Document Re
`trieval", Coins Technical Report 90-07, University of
`Massachusetts (Mar. 1990).
`
`Turtle et al, “Inference Network for Document Re
`trieval“, Sigir 90, Association for Computing Machin
`ery, pp. l~24 (Sep. 1990).
`Turtle, “Inference Network for Document Retrieval",
`Ph.D. Dissertation, Coins Technical Report 90-92, Uni
`versity of Massachusetts (Oct. 1990)v
`Turtle et a], “Efficient Probabilistic Inference for Text
`Retrieval”, Riao ’91 Conference Proceedings, Recherche
`d‘lnformation Assistée par Ordinateur, Universitat
`Automa de Barcelona, Spain, pp. 644-661 (Apr. 1991).
`Porter, “An Algorithm for Suf?x Skipping“, Program,
`vol, 14, pp. 130-137 (1980).
`
`Facebook, Inc. - EXHIBIT 1237
`
`002
`
`
`
`US. Patent
`
`Nov. 23, 1993
`
`Sheet 1 of 8
`
`5,265,065
`
`d1
`
`d|-1
`
`d|
`
`GD
`
`U2
`
`r?
`
`r 3
`
`,
`
`DOCUMENT '
`
`NETWORK
`10/
`
`QUERY
`
`NETWORK
`
`12/
`
`01
`
`d1
`
`r1
`
`M
`
`in
`
`‘
`
`, d
`
`___ ____—_—_
`
`r
`c3
`
`cm
`
`q2
`
`d|1
`
`q2
`
`dl
`
`rn
`
`'3
`
`c2
`
`q1
`
`d2
`
`r2
`
`Q1
`
`Facebook, Inc. - EXHIBIT 1237
`
`003
`
`
`
`US. Patent
`
`Nov. 23, 1993
`
`Sheet 2 of 8
`
`5,265,065
`
`52 ~65: m .mvm
`
`No}
`
`39:5: >9. 9.. 9.00..
`
`
`3323 >3. .m 052 5
`
`boxcobnc a2 30:5 >
`SEE? >9.
`
`.312...
`
`mono: >33. 2 23:5:
`
`>9. v9.5.7: 92 23
`
`\
`
`3 \
`9”
`
`
`
`£052 550.28
`
`a,
`
`$5005.
`.zwogz
`mwsmiOu
`
`20m
`
`O:
`
`Facebook, Inc. - EXHIBIT 1237
`
`004
`
`
`
`US. Patent
`
`Nov. 23, 1993
`
`Sheet 3 of 8
`
`5,265,065
`
`O¢
`
`NV
`
`Eu
`
`
`
`
`
`1299.50E3532:$951
`
`«noEEM
`
`{25
`
`
`
`3.2m:59.50539.5.;
`
`Em>oo
`
`80:50
`
`539.28E220
`
`€038<
`
`9.3qu
`
`w.._&5_<xw
`
`3353:5...2:o.35oco5:3
`3:52:B2:52.2:r.2;
`E35535sowe$0335»33:853.313.:.8.o<mE_o_u
`
`
`68:80Etc::28;38228
`
`
`
`30:505?.22m
`
`:8..083...9:82:.$35
`Co<E20:8..63...32:1
`
`
`5260£295to._.
`
`
`ocwo<£23mBum“.
`385:5335.”.
`.8202SEN mam—Fm
`
`>52032.9.3
`7.32.:383organ.
`
`
`
`m.coEEm>ow
`
`.32..
`
`5035
`5.8.
`
`
`
`33335u>oEum
`
`
`
`3.8.5.2::.85
`
`005
`
`Facebook, Inc. - EXHIBIT 1237
`
`Facebook, Inc. - EXHIBIT 1237
`
`005
`
`
`
`
`
`
`
`
`
`US. Patent
`
`Nov. 23, 1993
`
`Sheet 4 of 8
`
`5,265,065
`
`dm
`
`gig
`
`2
`
`‘1
`
`dm
`
`Q
`
`gig.
`
`dm
`
`0
`
`dm
`
`r‘
`
`Q
`
`r2
`
`'3 '
`
`Facebook, Inc. - EXHIBIT 1237
`
`006
`
`
`
`US. Patent
`
`Nov. 23, 1993
`
`Sheet 5 of 8
`
`5,265,065
`
`R. .mrw
`
`32.3 :2 2022M
`
`5.62.3 62000
`
`
`.3 8:: 22.5.85 mo ESP
`
`3:3 :33“. of 05:29.6
`
`3: on 3.2 $5238 :0 2
`32% :3 2: 52:3
`
`29.6 >5 E 022 00 3:3 E=E§0E 2t 3:3 9: mo .53
`
`
`8... 3% BE .5833 500 .5...
`
`.36 $05“. :3 2: 52:8
`
`
`umoEa :ztoa of .8
`
`mo»
`
`OF
`
`Facebook, Inc. - EXHIBIT 1237
`
`007
`
`
`
`US. Patent
`
`Nov. 23, 1993
`
`Sheet 6 of 8
`
`5,265,065
`
`INPUT 'NQUERY
`NATURAL LANGUAGE
`
`{82
`STOPWORD
`DATABASE
`
`PARSE
`'
`COMPARE
`STOPWORDS
`
`so
`‘1
`SYNONYM
`DATABASE
`
`YES
`
`[00
`
`COMPARE
`PHRASES
`
`|02
`
`I08
`
`TOPIC
`a
`KEY
`SUBROUTINE
`
`B4
`
`86
`
`as
`
`92
`
`94
`
`96 /
`PHRASE
`DATABASE
`
`Em To
`DOCUMENT
`NETWORK
`
`DETERMINE
`2 FIRS'I
`
`“LONG
`
`QUERY NETWORK
`
`Facebook, Inc. - EXHIBIT 1237
`
`008
`
`
`
`US. Patent
`
`Nov.
`23, 1993
`
`Sheet 7 of 8
`
`5,265,065
`
`FROM
`PARSER
`
`MS
`COUNT TER
`SET 2
`
`SET '
`
`ADD
`
`us /
`COMPARE TERM TO
`DATABASE, FIND
`
`I20 /
`
`CALCULATE
`O.4+O.6 idfi ' tfij
`/ I22
`ACCUMULATE
`
`I24
`
`NO
`
`/l26
`/I28
`/ I30
`
`YES
`DIVIDE BY
`2
`6
`STORE
`i
`REPEAT FOR
`PICS
`OTHER TO
`/ I32
`
`U
`
`RAN K
`Pn.
`SELECT TO
`
`I34
`
`THRESHOLD
`L
`
`H6
`5
`TOPIC AND KEY
`DATABASE
`
`I
`
`TO FIG 8
`
`I08
`
`TOPIC a KEY
`SUBROUTINE
`gig. 9
`
`Facebook, Inc. - EXHIBIT 1237
`
`009
`
`
`
`US. Patent
`
`Nov. 23, 1993
`
`Sheet 8 of 8
`
`5,265,065
`
`I46
`7
`DOCUMENT
`DATABASE
`
`TEXT
`NODES j
`
`DOCUMENT
`NETWORK
`gig. 10
`
`FROM QUERY
`NETWORK
`
`COUNT TERMS
`SET 2
`
`SET i=0
`
`ADD
`
`‘
`
`COM PAR
`DATABAS
`idfi ,
`if;
`
`TO
`
`150 /
`
`CALCULATE
`o.4+o.e - mi . tfn
`'1
`T
`ACCUMU LATE
`
`NO
`
`YES
`
`mvmE BY
`x+ z
`l
`STORE
`1
`REPEAT FOR
`OTHER DOCUMENTS
`
`I60
`
`RANK
`
`PRINgOUT
`
`DISPLAY
`
`Facebook, Inc. - EXHIBIT 1237
`
`010
`
`
`
`1
`
`METHOD AND APPARATUS FOR INFORMATION
`RETRIEVAL FROM A DATABASE BY REPLACING
`DOMAIN SPECIFIC STEMMED PHASES IN A
`NATURAL LANGUAGE TO CREATE A SEARCH
`QUERY
`
`20
`
`BACKGROUND OF THE INVENTION
`This invention relates to information retrieval, and
`particularly to document retrieval from a computer
`database. More particularly, the invention concerns a
`method and apparatus for creating a search query in
`natural language for use in an inference network for
`document identi?cation and retrieval purposes.
`Presently, document retrieval is most commonly per‘
`formed through use of Boolean search queries to search
`the texts of documents in the database. These retrieval
`systems specify strategies for evaluating documents
`with respect to a given query by logically comparing
`search queries to document texts. One of the problems
`associated with text searching is that for a single natural
`language description of an information need, different
`Boolean researchers will formulate different Boolean
`queries to represent that need. Because the queries are
`different, document ranking will be different for each
`search, thereby resulting in different documents being
`retrieved.
`More recently, hypertext databases have been devel
`oped which emphasize ?exible organizations of multi
`media "nodes" through connections made with user
`speci?ed links and interfaces which facilitate browsing
`in the network. Early networks employed query-based
`retrieval strategies to form a ranked list of candidate
`"starting points“ for hypertext browsing. Some systems
`employed feedback during browsing to modify the ini
`tial query and to locate additional starting points. Net
`work structures employing hypertext databases have
`used automatically and manually generated links be
`tween documents and the concepts or terms that are
`used to represent their content. For example, “docu
`ment clustering" employs links between documents that
`are automatically generated by comparing similarities
`of content. Another technique is “citations" wherein
`documents are linked by comparing similar citations in
`them. “Term clustering" and “manually-generated the
`sauri" provide links between terms, but these have not
`been altogether suitable for document searching on a
`reliable basis.
`Deductive databases have been developed employing
`facts about the nodes, and current links between the
`nodes. A simple query in a deductive database, where N
`is the only free variable in formula W, is of the form
`{N1W(N)}, which is read as “Retrieve all nodes N such
`that W(N) can be shown to be true in the current data
`base." However, deductive databases have not been
`successful in information retrieval. Particularly, uncer
`tainty associated with natural language affects the de
`ductive database, including the facts, the rules, and the
`query. For example, a speci?c concept may not be an
`accurate description of a particular node; some rules
`may be more certain than others; and some parts of a
`query may be more important than others. For a more
`complete description of deductive databases, see Croft
`et al. “A Retrieval Model for Incorporating Hypertext
`Links", Hypertext ‘89 Proceedings. pp 213-224, No
`vember l989 (Association for Computing Machinery),
`incorporated herein by reference.
`
`5,265,065
`2
`A Bayesian network is one which employs nodes to
`represent the document and the query. If a proposition
`represented by a parent node directly implies the propo
`sition represented by a child node, a implication line is
`drawn between the two nodes. If-then rules of Bayesian
`networks are interpreted as conditional probabilities.
`Thus, a rule A-+B is interpreted as a probability
`P(B|A), and the line connecting A with B is logically
`labeled with a matrix that specifies P(B l A) for all possi
`ble combinations of values of the two nodes. The set of
`matrices pointing to a node characterizes the depen
`dence relationship between that node and the nodes
`representing propositions naming it as a consequence.
`For a given set of prior probabilities for roots of the
`network, the compiled network is used to compute the
`probability or degree of belief associated with the re
`maining nodes.
`An inference network is one which is based on a
`plausible or non-deductive inference. One such network
`employs a Bayesian network, described by Turtle et al.
`in "Inference Networks for Document Retrieval",
`SIGIR 90, pp 1-24 September 1990 (Association for
`Computing Machinery), incorporated herein by refer
`ence. The Bayesian inference network described in the
`Turtle et al. article comprises a document network and
`a query network. The document network represents the
`document collection and employs document nodes, text
`representation nodes and content representation nodes.
`A document node corresponds to abstract documents
`rather than their speci?c representations, whereas a text
`representation node corresponds to a specific text repre
`sentation of the document. A set of content representa’
`tion nodes corresponds to a single representation tech
`nique which has been applied to the documents of the
`database.
`The query network of the Bayesian inference net
`work described in the Turtle et al. article employs an
`information node identifying the information need, and
`a plurality of concept nodes corresponding to the con‘
`cepts that express that information need. A plurality of
`intermediate query nodes may also be employed where
`multiple queries are used to express the information
`requirement.
`The Bayesian inference network described in the
`Turtle et al. article has been quite successful for general
`purpose databases. However, it has been difficult to
`formulate the query network to develop nodes which
`conform to the document network nodes. More particu
`larly, the inference network described in the Turtle et
`al. article did not use domain~speci?c knowledge bases
`to recognize phrases, such as specialized, professional
`terms, like jargon traditionally associated with specific
`professions, such as law or medicine.
`For a more general discussion concerning inference
`networks, reference may be made to Probabilistic Rea
`soning in Intelligent Systems: Network: of Plausible 1n fer
`ence by J. Pearl, published by Morgan Kaufmann Pub
`lishers, Inc., San Mateo, Calif, 1988, and to Probabilistic
`Reasoning in Expert Systems by R. E. Neapolitan, John
`Wiley 8L Sons, New York, N.Y., 1990.
`Prior techniques for recognizing phrases in an input
`query employed syntactic and statistical analysis and
`manual selection techniques. The present invention
`employs an automated domain-specific knowledge
`based system to recognize phrases.
`
`50
`
`35
`
`45
`
`65
`
`Facebook, Inc. - EXHIBIT 1237
`
`011
`
`
`
`3
`SUMMARY OF THE INVENTION
`The present invention provides a computer imple
`mented process performing a search query in which a
`database is provided containing domain—knowledge
`speci?c phrases. A natural language input query is in
`putted to the computer system, the input query de?ning
`the composition of the text of documents sought to be
`identi?ed. The natural language query is parsed and the
`stopwords are removed. The remaining words of the
`input query are stemmed to their basic roots, and the
`sequence of stemmed words in the list is compared to
`domain-speci?c phrases in the database to identify
`phrases in the search query. The phrases from the data
`base are substituted for the sequence of stemmed words
`from the list so that the remaining elements, namely, the
`substituted phrases and unsubstituted stemmed words,
`form the search query. More particularly, the individual
`terms of the completed search query form the query
`nodes of the query network.
`An optional and desirable feature of the present in
`vention resides in the provision of a technique for han
`dling citations as a syntactic phrase, the citations being
`employed for a “weighting’ of the statistical probability
`algorithms of the inference network.
`Another optional and desirable feature of the present
`invention resides in the provision of determining a key
`number based on a topical database, the key number
`being added to the search query as a query node and
`affecting the statistical probability algorithms of the
`inference network.
`
`0
`
`5
`
`20
`
`25
`
`5,265,065
`4
`in the propositions which potentially caused it. This is
`distinct from a diagnostic probability scheme in which
`the children provide support for their parents, that is
`belief in the potential causes of a proposition increases
`with belief in the proposition. In either case, the propa
`gation of probabilities through the network is done
`using information passed between adjacent nodes.
`FIG. 1 illustrates a Bayesian inference network as
`described in the aforementioned Turtle et al. article.
`The Bayesian network shown in FIG. 1 is a directed,
`acyclic dependency graph in which nodes represent
`propositional variables or constraints and the arcs rep
`resent dependence relations between propositions. An
`arc between nodes represents that the parent node
`"causes” or implies the proposition represented by the
`child node. The child node contains a link matrix or
`tensor which speci?es the probability that the child
`node is caused by any combination of the parent nodes.
`Where a node has multiple parents, the link matrix spec
`i?es the dependence of that child node on the set of
`parents and characterizes the dependence relationship
`between the node and all nodes representing its poten‘
`tial causes. Thus, for all nodes there exists an estimate of
`the probability that the node takes on a value given any
`set of values for its parent nodes. Ifa node a has a set of
`parents 1r,,={p1, .
`.
`. p,,}, the estimated probabilities
`P(a|p1, .
`.
`. p,,) are determined.
`The inference network is graphically illustrated in
`FIG. 1 and consists of two component networks: a
`document network 10 and a query network 12. The
`document network consists of document nodes d1, d1. .
`. . d,'.;, d,-, interior text representation nodes 11, t2. .
`.
`. tj.1,
`ti, and leaf nodes r1, r1, r3, .
`.
`. rk. The document nodes
`d correspond to abstract documents rather than their
`physical representations. The interior nodes t are text
`representation nodes which correspond to speci?c text
`representations within a document. The present inven
`tion will be described in connection with the text con
`tent of documents, but it is understood that the network
`can support document nodes with multiple children
`representing additional component types, such as audio,
`video, etc. Similarly, while a single text may be shared
`by more than one document, such as journal articles
`that appear in both serial issue and reprint collections,
`and parent/divisional patent speci?cations, the present
`invention shall be described in connection with a single
`text for each document. Therefore, for simplicity, the
`present invention shall assume a one-to-one correspon
`dence between documents and texts.
`The leaf nodes r are content representation nodes.
`There are several subsets of content representation
`nodes r1, r2, r3, .
`.
`. rk, each corresponding to a single
`representation technique which has been applied to the
`document texts. If a document collection has been in
`dexed employing automatic phrase extraction and man
`ually assigned index terms, then the set of representa
`tion nodes will consist of distinct subsets or content
`representation types with disjoint domains. For exam
`ple, if the phrase “independent contractor“ has been
`extracted and “independent contractor" has been manu
`ally assigned as an index term, then two content repre
`sentation nodes with distinct meanings will be created,
`one corresponding to the event that “independent con
`tractor" has been automatically extracted from the sub
`set of the collection, and the other corresponding to the
`event that “independent contractor" has been manually
`assigned to a subset of the collection. As will become
`
`55
`
`BRIEF DESCRIPTION OF THE DRA\VINGS
`FIG. 1 is a block diagram representation of a Bayes
`ian inference network with which the present invention
`is used.
`FIG. 2 is a block diagram representation of a simpli
`?ed Bayesian inference network as in FIG. 1.
`FIG. 3 is a block diagram of a computer system for
`carrying out the invention.
`FIG. 4 is a flowchart and example illustrating the
`steps of creating a search query in accordance with the
`preferred embodiment of the present invention.
`FIG. 5 is a ?owchart and example of the steps for
`determining a key number for inclusion in the search
`query described in connection with FIG. 4.
`FIGS. 6A-6D are block diagram representations ofa
`simpli?ed Bayesian inference network illustrating dif
`ferent techniques for handling phrases.
`FIG. 7 is a ?owchart illustrating the manner by
`which partial phrases are handled in a document re~
`trieval system.
`FIG. 8 is a detailed ?owchart of the query network in
`accordance with the presently preferred embodiment of
`the present invention.
`FIG. 9 is a detailed flowchart of a topic and key
`subroutine used in the query network illustrated in FIG.
`8.
`
`FIG. 10 is a detailed ?owchart of a document net
`work used with the query network shown in FIG. 8.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`The Inference Network
`Inference networks employ a predictive probability
`scheme in which parent nodes provide support for their
`children. Thus. the degree to which belief exists in a
`proposition depends on the degree to which belief exists
`
`65
`
`Facebook, Inc. - EXHIBIT 1237
`
`012
`
`
`
`5,265,065
`
`6
`
`20
`
`25
`
`30
`
`35
`
`5
`clear hereinafter, some concept representation nodes
`may be created based on the content of the query net
`work.
`Each document node has a prior probability associ
`ated with it that describes the probability of Observing
`that document. The document node probability will be
`equal to l/(collection size) and will be small for most
`document collections. Each text node contains a speci?
`cation of its dependence upon its parent. By assumption,
`this dependence is complete (t; is true) when its parent
`document is observed (d,- is true). Each representation
`node contains a speci?cation of the conditional proba
`bility associated with the node given its set of parent
`text nodes. The representation node incorporates the
`effect of any indexing weights (for example, term fre
`quency in each parent text) or term weights (inverse
`document frequency) associated with the concept.
`The query network 12 is an “inverted“ directed acy
`clic graph with a single node I which corresponds to an
`information need. The root nodes c1, c2, c3, .
`.
`. cm are
`the primitive concepts nodes used to express the infor
`mation requirement. A query concept node, c, contains
`the speci?cation of the probabilistic dependence of the
`query concept on its set of parent representation con
`tent nodes, r. The query concept nodes cl. .
`. cm define
`the mapping between the concepts used to represent the
`document collection and the concepts that make up the
`queries. A single concept node may have more than one
`parent representation node. For example, concept node
`c1 may represent the query concept "independent con
`tractor" and have as its parents representation nodes r;
`and U which correspond to "independent contractor“
`as a phrase and as a manually assigned term.
`Nodes q]. q; are query nodes representing distinct
`query representations corresponding to the event that
`the individual query representation is satis?ed. Each
`query node contains a speci?cation of the query on the
`query concept it contains. The intermediate query
`nodes are used in those cases where multiple query
`representations express the information need I.
`As shown in FIG. 1, there is a one-to-one correspon
`dence between document nodes, d, and text nodes, t.
`Consequently, the network representation of FIG. 1
`may be diagrammatically reduced so that the document
`nodes d1, d2, .
`.
`. d,~.1, d,-are parents to the representation
`nodes r1, r2, r3, . . . r;,. In practice, it is possible to further
`reduce the network of FIG. 1 due to an assumed one-to
`one correspondence between the representation nodes
`r1, r2, r3, .
`. . rk, and the concept nodes c1, c2, c3, . .
`. cm.
`The simpli?ed inference network is illustrated in FIG. 2
`and is more particularly described in the article by Tur
`tle et al., “Efficient Probabilistic Inference for Text
`Retrieval,“ RIAO 91 Conference Proceedings, pp.
`644-661, April, 1991 (Recherche d‘lnformaion Assistée
`par Ordinateur. Universitat Autonoma de Barcelona,
`Spain), which article is herein incorporated by refer
`ence.
`As described above, each child node carries a proba
`bility that the child node is caused by the parent node.
`The estimates of the dependence of a child node Q on its
`set of parents. P|, P2, .
`.
`. P,,, are encoded using the
`following expressions:
`
`where P(P1=true)=pg, P(P2=true)=p2, .
`.
`.
`P(Pn=true)=P,,, W1, W1, . . . w” are the term weights for
`each term P1, P1, .
`.
`. Pm and wg is the maximum proba
`bility that the child node can achieve, 0_-<_-w8-_<- I.
`As described above, all child nodes carry a probabil
`ity that the child was caused by the identi?ed parent
`nodes. Document network 10 is created once and re
`mains constant for each document in the network. The
`document network structure is not changed, except to
`add documents to the database. The document nodes d
`and text nodes t do not change for any given document
`once the document representation has been entered into
`document network 10. Most representation nodes are
`created with the database and are dependant on the
`document content. Some representation nodes (repre
`senting phrases and the like) are created for the particu
`lar search being conducted and are dependant of the
`search query.
`Query network 12, on the other hand. changes for
`each input query de?ning a document request. There
`fore, the concept nodes c of the search network are
`created with each search query and provide support to
`the query nodes q and the information need, node 1
`(FIG. 1).
`Document searching can be accomplished by a docu
`ment-based scan or a concept-based scan. A document
`based scan is one wherein the text of each document is
`scanned to determine the likelihood that the document
`meets the information need, I. More particularly, the
`representation nodes r1, r1, r3, .
`.
`. r], of a single docu
`ment are evaluated with respect to the several query
`nodes q], q; to determine a probability that the docu
`ment meets the information need. The top n-ranked
`documents are then selected as potential information
`need documents. The scan process reaches a point, for
`example after assigning a probability for more than n
`documents of a large document collection. that docu
`ments can be eliminated from the evaluation process
`after evaluating subsets of the representation nodes.
`More particularly, if a given document scores so low of
`a probability after only evaluating one or two represen
`tation nodes, determination can be made that even if the
`evaluation continued the document still would not
`score in the top n-ranked documents. Hence, most doc
`uments of a large collection are discarded from consid
`eration without having all their representation nodes
`evaluated.
`A concept-based scan is one wherein all documents
`containing a given representation node are evaluated.
`As the process continues through several representation
`nodes, a scorecard is maintained of the probabilities that
`each document meets the information need, I. More
`particularly, a single representation node r1 is evaluated
`for each document in the collection to assign an initial
`probability that the document meets the concept. The
`process continues through the several representation
`nodes with the probabilities being updated with each
`iteration. The top n-ranked documents are then selected
`as potential information need documents. If at some
`point in the process it can be determined that evaluation
`of additional representation nodes will not alter the
`ranking of the top n-ranked documents, the scan process
`can be terminated.
`
`40
`
`45
`
`50
`
`55
`
`Facebook, Inc. - EXHIBIT 1237
`
`013
`
`
`
`20
`
`25
`
`35
`
`7
`it can be appreciated that the representation nodes r1,
`r2, r3, .
`.
`. n, are nodes dependent on the content of the
`texts of the documents in the collection. Most represen
`tation nodes are created in the document database.
`Other representation nodes, namely those associated
`with phrases, synonyms and citations, are not manifest
`in any static physical embodiment and are created based
`on each search query. For example, a query manifesting
`the concept "employee” may be represented by one or
`more of “actor", "agent“, “attendant", “craftsman",
`"doer", "laborer", “maid”, “servant", "smith", “techni
`cian” and “worker”, to name a few. These various rep
`resentation nodes are created from the query node at
`the time of the search, such as through the use of the
`sauri and other tools to be described. A query node q1,
`qz, etc. can be manifest in one or more representations.
`The Search Query
`The present invention concerns development of the
`concept nodes c for use in the inference network illus
`trated in FIG. 1. The invention will be described in
`connection with a speci?c search query as follows:
`“What is the liability of the United States under the
`Federal Tort Claims Act for injuries sustained by
`employees of an independent contractor working
`under contract with an agency of the United States
`government?"
`Thus the present invention will be described in con
`nection with a database for searching legal documents,
`but it is to be understood the concepts of the invention
`may be applied to other professional databases, such as
`medical, theological, ?nancial and other types where
`specialized vocabularies, citations and digests are em
`ployed.
`The present invention is carried out through use of a
`computer system, such as illustrated in FIG. 3 compris
`ing a computer 20 connected to an input/output termi
`nal 22 and a read only memory (ROM) 24. ROM 24
`may be any form of read only memory, such as a CD
`ROM, write protected magnetic disc or tape, or a
`ROM, PROM or EPROM chip encoded for the pur
`poses described. Computer 20 may be a personal com
`puter (PC) and may be optionally connected through
`modem 26, telephone communication network 28 and
`modem 30 to a central computer 32 having a memory
`34. In one form of the invention, the document network
`10 and the document database containing the texts of
`documents represented by the document network are
`contained in the central computer 32 and its associated
`memory 34. Alternatively, the entire network and data
`base may be resident in the memory of personal com
`puter 20 and ROM 24. In a legal database and document
`information retrieval network the documents may com
`prise, for example, decisions and orders of courts and
`government agencies, rules, statutes and other docu
`ments re?ecting legal precedent. By maintaining the
`document database and document network at a central
`location, legal researchers may input documents into
`the document database in a uniform manner. Thus,
`there may be a plurality of computers 20, each having
`individual ROMs 24 and input/output devices 22, the
`computers 20 being linked to central computer 32 in a
`time-sharing mode. The search query is developed by
`each individual user or researcher by input via the re
`spective input/output terminal 22. For example, input
`/output terminal 22 may comprise the input keyboard
`and display unit of PC computer 20 and may include a
`printer for printing the display and/or document texts.
`
`5,265,065
`8
`ROM 24 contains a database containing phrases
`unique