`
`European Patent Office
`
`Office europeen des brevets
`
`I IIIIIII IIIIII Ill lllll lllll lllll lllll lllll lllll lllll lllll 111111111111111111
`@ Publication number : 0 597 630 A 1
`EUROPEAN PATENT APPLICATION
`
`@
`
`@ Int. Cl.5
`
`: G06F 15/403, G06F 15/20
`
`@ Representative : Goodman, Christopher
`Eric Potter & Clarkson St. Mary's Court St.
`Mary's Gate
`Nottingham NG1 1 LE (GB)
`
`@ Application number : 93308829.6
`@ Date of filing: 04.11.93
`
`@) Priority: 04.11.92 US 970718
`@ Date of publication of application:
`18.05.94 Bulletin 94/20
`(§) Designated Contracting States:
`AT BE CH DE DK ES FR GB GR IE IT LI LU MC
`NL PT SE
`@ Applicant: CONQUEST SOFTWARE INC.
`9700 Patuxent Woods Drive, Suite 140
`Columbia, Maryland MD-21046 (US)
`@ Inventor : Addison,Edwin R. Conquest
`Software Inc.
`9700 Patuxent Woods Drive,Suite 140,
`Columbia,Maryland MD-21046 (US)
`Inventor : Blair,Arden S. Conquest Software
`Inc.
`9700 Patuxent Woods Drive,Suite 140,
`Columbia,Maryland MD-21046 (US)
`Inventor : Nelson,Paul E. Conquest Software
`Inc.
`9700 Patuxent Woods Drive,Suite 140,
`Columbia,Maryland MD-21046 (US)
`Inventor: Schwartz,Thomas Conquest
`Software Inc.
`9700 Patuxent Woods Drive,Suite 140
`Columbia,Maryland MD-21046 (US)
`
`@) Method for resolution of natural-language queries against full-text databases.
`@ The method of the present invention com(cid:173)
`bines concept searching, document ranking,
`high speed and efficiency, browsing capabili(cid:173)
`ties, "intelligent" hypertext, document routing,
`and summarization (machine abstracting) in an
`easy-to-use implementation. The method of the
`present
`invention also offers Boolean and
`statistical query options. The method of the
`present invention is based upon "concept in(cid:173)
`dexing" (an index of "word senses" rather than
`just words.) It builds its concept index from a
`"semantic network" of word relationships with
`word definitions drawn from one or more stan(cid:173)
`dard
`human-language
`dictionaries. During
`query, users may select the meaning of a word
`from the dictionary during query construction,
`or may allow the method to disambiguate words
`based on semantic and statistical evidence of
`meaning. This results in a measurable improve(cid:173)
`ment in precision and recall. Results of search-
`ing are retrieved and displayed in ranked order.
`The ranking process is more sophisticated than
`prior art systems providing ranking because it
`takes linguistics and concepts, as well as statis-
`tics into account.
`
`Figure 1
`
`'I""
`
`<C
`0
`(It)
`(0
`.....
`O')
`It')
`0
`
`C. w
`
`Jouve, 18, rue Saint-Denis, 75001 PARIS
`
`Page 1 of 29
`
`GOOGLE EXHIBIT 1014
`
`
`
`EP O 597 630 A1
`
`Field of the Invention
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`The present invention is a method for computer-based information retrieval. Specifically, the method of
`the present invention comprises a computer-implemented text retrieval and management system. The present
`invention offers four advances in the art of computer-based text retrieval. First, querying is simple. Queries
`may be expressed in plain English (or in another suitable human language). Second, searching for "con(cid:173)
`cepts" has been found to be more accurate than Boolean, keyword or statistical searching as practiced in the
`prior art. Third, the method of the present invention is more efficient than sophisticated text retrieval methods
`of the prior art. It is faster (in equivalent applications), and features recall in excess of 80%, as compared to
`recall of less than 30% for Boolean systems, and approximately 50% for statistical methods of the prior art.
`Finally, the method of the present invention manages the entire research process for a user.
`
`Background of the Invention
`
`While there are dozens of information retrieval software systems commercially available, most of them
`are based on older Boolean search technology. A few are based on statistical search techniques which have
`proven to be somewhat better. But, to break the barrier to access to relevant information and to put this infor(cid:173)
`mation in the hands of end users at the desktop requires search software that is intuitive, easy to use, accurate,
`concept oriented, and needs a minimum investment of time by the user. The following distinctive features and
`benefits delineate these significant aspects of the method of the present invention.
`To date, there have been three major classes of text retrieval systems:
`• Keyword or Boolean systems that are based on exact word matching
`• Statistical systems that search for documents similar to a collection of words
`• Concept based systems that use knowledge to enhance statistical systems
`Keyword or Boolean systems dominate the market. These systems are difficult to use and perform poorly
`(typically 20% recall for isolated queries). They have succeeded only because of the assistance of human ex(cid:173)
`perts trained to paraphrase queries many different ways and to take the time to humanly eliminate the bad
`hits. While statistical search systems have increased performance to near50% recall, trained search expertise
`is still needed to formulate queries in several ways to conduct an adequate search.
`A concept based search system further closes the performance gap by adding knowledge to the system.
`To date, there is no standard way to add this knowledge. There are very few concept based search systems
`available and those that exist require intensive manual building of the underlying knowledge base.
`The next logical direction for improvement in text retrieval is its use of Natural Language Processing (NLP).
`While there are some experimental systems in government development programs, most of those prototypes
`have been only useful in narrow subject areas, they run slowly, and they are incomplete and unsuitable for
`commercialization. The failure of many early research prototypes of NLP based text retrieval systems has led
`to much skepticism in the industry, leading many to favor statistical approaches.
`There has been a growing interest in the research community in the combination of NLP and conventional
`text retrieval. This is evidenced by the growing number of workshops on the subject. The American Association
`of Artificial Intelligence sponsored two of them. The first was held at the 1990 Spring Al Symposium at Stanford
`University on the subject of "Text Based Intelligent Systems". The second one ( chaired by the applicant here in)
`was held atAAAl-91 in Anaheim in July 1991.
`
`Natural Language Techniques
`
`The literature is rich in theoretical discussions of systems intended to provide functions similar to those
`outlined above. A common approach in many textbooks on natural language processing (e.g., Natural Lan(cid:173)
`guage Understanding, James Allen, Benjamin Cummings, 1987) is to use "semantic interpretation rules" to
`identify the meanings of words in text. Such systems are "hand-crafted", meaning that new rules must be writ-
`ten for each new use. These rules cannot be found in any published dictionary or reference source. This ap(cid:173)
`proach is rarely employed in text retrieval is usually fails in some critical way to provide adequate results.
`Kravetz has reported in various workshops (AAAl-90 Spring Al Symposium at Stanford University) and in
`Lexical Acquisition by Uri Zarnick, Lawrence Erlbaum, 1991, ISBN 0-8056-0829-9, that "disambiguating word
`senses from a dictionary" would improve the performance of text retrieval systems, claiming experiments
`have proven that this method will improve precision. This author's philosophy suggests that a word sense be
`identified by "confirmation in context from multiple sources of evidence". None of Krovetz's published works
`propose a specific technique for doing so, and his recent publications indicate that he is "experimenting" to
`find suitable methods.
`
`2
`
`Page 2 of 29
`
`
`
`EP O 597 630 A1
`
`5
`
`15
`
`20
`
`Eugene Charniak, of Brown University has reported in "Al Magazine" (AAAI, Winter 1992), and has spoken
`atthe Naval Research Laboratory Al Laboratory (November 1991 )aboutthe technique of employing "spreading
`activation" to identify the meaning of a word in a small text. Charniak employs a "semantic network" and be(cid:173)
`gins with all instances of a given word. It then "fans out" in the network to find neighboring terms that are lo-
`cated near the candidate term in the text. This technique suffers from 2 admitted drawbacks: it requires a high(cid:173)
`quality partially hand-crafted, small semantic network, and this semantic network is not derived from pub(cid:173)
`lished sources. Consequently, the Charniak method has never been applied to any text longer than a few sen(cid:173)
`tences in a highly restricted domain of language.
`Stephanie Haas, of the University of North Carolina, has attempted to use multiple dictionaries in infor-
`10 mat ion retrieval including a main English dictionary coupled with a vertical application dictionary (such as a
`dictionary of computer terms used in a computer database). Haas' approach does not take advantage of word
`sense disambiguation, and she reported at ASIS, October 1991 that merging two dictionaries gave no meas(cid:173)
`urable increase in precision and recall over a single generic English dictionary.
`Uri Zernick, editor of Lexical Acquisition, Lawrence Erlbaum, 1991, suggests in the same book a "cluster
`signature" method from pattern recognition be used to identify word senses in text. The method lists words
`commonly co-occurring with a word in question and determines the percentage of the time that each of the
`commonly occurring words appears in context in the database or corpus for each word meaning. This is called
`the "signature" of each word meaning. The signatures of each meaning are compared with the use of a word
`in context to identify the meaning. This pattern recognition approach based upon a cluster technique discussed
`in Duda and Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, New York 1973 has the ob(cid:173)
`vious drawback that it has to be "trained" for each database. The signature information is not readily obtainable
`from a published dictionary.
`Brian Slator, (in the same book edited by Zernick above), discusses use of a "subject hierarchy" to compute
`a "context score" to disambiguate word senses. Generally, a "subject" or topic is identified by the context. A
`25 meaning is then selected by its relevance to the topic. This approach is only as strong as the depth of the
`subject hierarchy and it does not handle exceptions. A drawback of this approach is that available subject hi(cid:173)
`erarchies do not cover a significant portion of the lexicon of any dictionary, let alone the vocabulary of a native
`speaker of a language.
`One well known example of prior art in text retrieval that uses natural language input is the statistical tech-
`niques developed by Gerard Salton of Cornell University. His research system called SMART is now used in
`commercial applications, for example, Individual Inc. of Cambridge, MA uses it in a news clipping service. Dr.
`Salton is well known for his claims that natural language processing based text retrieval systems do not work
`as well as SMART. He bases such claims on limited experiments that he ran in the 1960's. At the 1991 ASIS
`meeting he stated that the reason natural language processing based systems don't work is that syntax is
`required and syntax is not useful without semantics. He further claims that "semantics is not available" due
`to the need to handcraft the rules. However, the system of the present invention has made semantics available
`through the use of statistical processing on machine readable dictionaries and automatic acquisition of se(cid:173)
`mantic networks.
`
`30
`
`35
`
`40
`
`Lexical Acquisition
`
`In the field of lexical acquisition, most of the prior art is succinctly summarized in the First Lexical Acqui(cid:173)
`sition Workshop Proceedings, August 1989, Detroit at IJCAl-89. There is a predominance of papers covering
`the automatic building of natural language processing lexicons for rule-based processing. Over 30 papers
`45 were presented on various ideas, isolated concepts or prototypes for acquiring information from electronic dic(cid:173)
`tionaries for use in natural language processing. None of these proposed the automatic building of a semantic
`network from published dictionaries.
`
`Indexing
`
`50
`
`55
`
`Typical text search systems contain an index of words with references to the database For a large docu(cid:173)
`ment databases, the number of references for any single term varies widely. Many terms may have only one
`reference, while other terms may have from 100,000 to 1 million references. The prior art substitutes the(cid:173)
`saurus entries for search terms, or simply requires the user rephrase his queries in order to "tease information
`out of the database". The prior art has many limitations. In the prior art, processing is at the level of words,
`not concepts. Therefore, the query explosion produces too many irrelevant variations to be useful in most cir(cid:173)
`cumstances. In most prior art systems, the user is required to restate queries to maximize recall. This limits
`such systems to use by "expert" users. In prior art systems, many relationships not found in a classical the-
`
`3
`
`Page 3 of 29
`
`
`
`EP O 597 630 A1
`
`saurus cannot be exploited (for example, a "keyboard" is related to a "computer" but it is not a synonym).
`
`Contextual Systems
`
`The prior art of systems which attempt to extract contextual understanding from natural language state-
`ments is primarily that of Gerard Salton (described in Automatic Text Processing, Addison-Wesley Publishing
`Company, 1989.) As described therein, such systems simply count terms (words) and co-occurrences of terms,
`but do not "understand" word meanings.
`Routing means managing the flow of text or message streams and selecting only text that meets the de-
`sired profile of a given user to send to that user. Routing is useful for electronic mail, news wire text, and in(cid:173)
`telligent message handling. It is usually the case that a text retrieval system designed for retrieval from archived
`data is not good for routing and visa versa. For news wire distribution applications (which seek to automate
`distribution of the elements of a "live" news feed to members of a subscriber audience based on "interest pro(cid:173)
`files"), it is time-intensive and very difficult to write the compound Boolean profiles upon which such systems
`depend. Furthermore, these systems engage in unnecessary and repetitive processing as each interest pro(cid:173)
`file and article are processed.
`
`Document Ranking
`
`Systems which seek to rank retrieved documents according to some criterion or group of criteria are dis-
`cussed by Salton, in Automatic Text Processing (ranking on probabilistic terms), and by Donna Harmon, in a
`recentASIS Journal article, (ranking on a combination of frequency related methods). Several commercial sys(cid:173)
`tems use ranking but their proprietors have never disclosed the algorithms used. Fulcrum uses (among other
`factors) document position and frequency. Personal Library Software uses inverse document frequency, term
`frequency and collocation statistics. Verity uses "accrued evidence based on the presence of terms defined
`in search topics".
`
`Concept Definition and Search
`
`The prior art comprises of two distinct methods for searching for "concepts". The first and most common
`of these is to use a private thesaurus where a user simply defines terms in a sett hat are believed to be related.
`Searching for any one of these terms will physically also search for and find the others. The literature is replete
`with research papers on uses of a thesaurus. Verity, in its Topic software, uses a second approach. In this
`approach users create a "topic" by linking terms together and declaring a numerical strength for each link,
`similar to the construction of a "neural network". Searching in this system retrieves any document that con(cid:173)
`tains sufficient (as defined by the system) "evidence" (the presence of terms that are linked to the topic under
`search). Neither of these approaches is based upon the meanings of the words as defined by a publisher's
`dictionary.
`Other prior art consists of two research programs:
`• TIPSTER: A government research program called TIPSTER is exploring new text retrieval methods. This
`work will not be completed until 1996 and there are no definitive results to date.
`• CLARIT: Carnegie Mellon University (CMU) has an incomplete prototype called CLARIT that uses dic(cid:173)
`tionaries for syntactic parsing information. The main claim of CLARIT is that it indexes phrases that it
`finds by syntactic parsing. Because CLARIT has no significant semantic processing, it can only be
`viewed as a search extension of keywords into phrases. Their processing is subsumed by the present
`invention, with the conceptual processing and semantic networks.
`
`Hypertext
`
`Prior art electronically-retrieved documents use "hypertext", a form of manually pre-established cross-
`reference. The cross-reference links are normally established by the document author or editor, and are static
`for a given document. When the linked terms are highlighted or selected by a user, the cross-reference links
`are used to find and display related text.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`Machine Abstracting
`
`Electronic Data Systems (EDS) reported machine abstracting using keyword search to extract the key sen(cid:173)
`tences based on commonly occurring terms which are infrequent in the database. This was presented at an
`
`4
`
`Page 4 of 29
`
`
`
`EP O 597 630 A1
`
`American Society for Information Systems (ASIS) 1991 workshop on natural language processing. They fur(cid:173)
`ther use natural language parsing to eliminate subordinate clauses.
`The present invention is similar, except that the retrieval of information for the abstract is based upon con(cid:173)
`cepts, not just keywords. In addition, the present invention uses semantic networks to further abstract these
`concepts to gain some general idea of the intent of the document.
`
`Summary
`
`The prior art may be summarized by the shortcomings of prior art systems for textual document search
`and retrieval. Most commercial systems of the prior art rely on "brute force indexing" and word or wild card
`search which provides fast response only for lists of documents which are ranked according to a precomputed
`index (such as document date) and not for relevance-ranked lists. For systems which attempt to relevance rank,
`the user must wait for the entire search to complete before any information is produced. Alternatively, some
`systems display documents quickly, but without any guarantee that documents displayed are the most rele-
`vant.
`The systems of the prior art rank documents retrieved on the presence of words, not word meanings. The
`prior art systems fail to use linguistic evidence such as syntax or semantic distance. No known prior art system
`can combine more than a two or three ranking criteria No known system in the prior art is capable of acquiring
`semantic network information directly from published dictionaries, and thus, to the extent that such networks
`are used at all, they must be "hand built" at great expense, and with the brittleness which results from the
`author's purpose and bias.
`In thesaurus-based information retrieval systems, as well as topic based information retrieval systems,
`concepts are created by linking words, not word meanings. In these systems (thesaurus and topic based), the
`user has the burden of creating concepts before searching. In addition, for topic based systems, the user has
`the added burden of making arbitrary numeric assignments to topic definitions. Prior art thesaurus and topic
`based systems do not link new concepts to an entire network of concepts in the natural language of search.
`Instead, isolated term groups are created that do not connect to the remainder of any concept knowledge base.
`Topic based systems require that topics be predefined to make use of concept-based processing.
`Finally, for hypertext systems, authors need not spend time coding hypertext links to present a hypertex-
`tual document to users because a natural language search (perhaps taken directly from the document itself)
`will find all relevant concepts, not just those found by the author.
`
`Brief Description of the Invention
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`The method of the present invention combines concept searching, document ranking, high speed and ef-
`ficiency, browsing capabilities, "intelligent" hypertext, document routing, and summarization (machine ab(cid:173)
`stracting) in an easy-to-use implementation.
`The method offers three query options:
`finding documents with concepts expressed in plain English;
`Natural Language:
`40 Query by Example:
`Present a document, retrieve similar documents;
`Private Concept:
`define a new term, enter it in the "semantic network", search.
`The method of the present invention continues to provide Boolean and statistical query options so that
`users will have easy access to a familiar interface and functionality while learning new and more powerful
`features of the present invention.
`The method of the present invention is based upon "concept indexing" (an index of "word senses" rather
`than just words.) A word sense is a specific use or meaning of a word or idiom. The method of the present
`invention bui Ids its concept index from a "semantic network" of word relationships with word definitions drawn
`from one or more standard English dictionaries. During query, users may select the meaning of a word from
`the dictionary during query construction. This results in a measurable improvement in precision.
`Results of text searching are retrieved and displayed in ranked order. The ranking process is more sophis-
`ticated than prior art systems providing ranking because it takes linguistics and concepts, as well as statistics
`into account.
`The method of the present invention uses an artificial intelligence "hill climbing" search to retrieve and
`display the best documents while the remainder of the search is sti II being processed. The method of the pres-
`ent invention achieves major speed advantages for interactive users.
`Other significant functions of the method of the present invention including browsing documents (viewing
`documents directly and moving around within and between documents by related concepts), implementing
`"dynamically compiled" hypertext, routing, and machine abstracting or automatic summarization of long texts.
`
`45
`
`50
`
`55
`
`5
`
`Page 5 of 29
`
`
`
`EP O 597 630 A1
`
`Brief Description of the Drawings
`
`Figure 1
`
`5
`
`Figures 2a-d
`
`Figure 3
`
`Figure 4
`
`Figure 5
`Figure 6
`
`depicts the computer program modules which implement the method of the present inven(cid:173)
`tion.
`depicts a detailed flow diagram of the concept indexing process according to the present in(cid:173)
`vention.
`depicts the process whereby the method of the present invention disambiguates word
`senses based on "concept collocation".
`depicts the sources of information in an automatically-acquired machine-readable dictionary
`according to the present invention.
`illustrates the structure of the machine-readable dictionary of the present invention.
`depicts a flow diagram of the query process according to the present invention.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Detailed Description of the Invention
`
`The method of the present invention is a "Natural Language Processing" based, text retrieval method.
`There are very few concept based search systems available and those that exist require intensive manual
`building of the underlying knowledge bases. The method of the present invention uses published dictionaries
`to build (automatically) the underlying knowledge base. The dictionary provides the knowledge needed to proc-
`ess accurately plain English or "natural language" input, making the user interface considerably simpler.
`In the method of the present invention:
`• There are no hand-crafted rules for each word meaning
`• Idioms and repetitive phrases are processed as a single meaning
`• Unknown words, proper names and abbreviations are automatically processed
`• Ill formed input with poor grammar and spelling errors can be processed
`The method of the present invention has combined the document ranking procedure with the search pro(cid:173)
`cedure. This allows for fast "hill-climbing" search techniques to quickly find the only the best documents re(cid:173)
`gardless of database size. All available search systems first retrieve all possible documents and then rank
`the results, a much slower process. The method of the present invention uses these search techniques to sup-
`port the advanced demands of natural language text retrieval.
`In the method of the present invention:
`• Only the best documents are retrieved
`• Searching is guided by document ranking
`• The document database is automatically divided into multiple sets
`• Searching over document sets significantly improves method performance
`
`Architecture
`
`40
`
`45
`
`The method of the present invention has been implemented as 5 computer program modules: the Query
`Program, the Index Program, the Library Manager, Dictionary Manager, and the Integrator's Toolkit. Each of
`these are defined below and their relationships are shown in Figure 1.
`• Query Program
`Program to accept queries and execute searches
`Program to index new or updated documents
`• Index Program
`• Library Manager
`Program to manage the organization of text files
`• Dictionary Editor
`Program to maintain dictionary/private searches
`• Integrator's Toolkit
`Program for developers to integrate the present invention with other computer
`systems and program products
`The method of the present invention offers Graphical User Interfaces, command line interfaces, and tools
`to customize the user interface. The display shows the title hits in ranked order and the full text of the docu-
`50 ments. Documents can be viewed, browsed and printed from the interface. The Integrator's Toolkit allows the
`product to be installed in any interface format. The system is an open system. It makes heavy use of "Appli(cid:173)
`cation Program Interfaces" (APls), or interfaces that allow it to be integrated, linked or compiled with other
`systems.
`
`55
`
`Natural Language Processing
`
`The method of the present invention is the first text search system that uses published dictionaries to build
`automatically the underlying knowledge base, eliminating the up front cost that an organization must absorb
`
`6
`
`Page 6 of 29
`
`
`
`EP O 597 630 A1
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`to use other concept based search systems. In addition, the dictionary gives knowledge needed to process
`accurately natural language input, making the user interface considerably simpler. The algorithms used iden(cid:173)
`tify the meaning of each word based upon a process called "spreading activation". NLP as used in the present
`invention improves text retrieval in many ways, including the following:
`• Morphological analysis allows better matching of terms like "computing" and "computational". Tradi-
`tional suffix stripping hides these related meanings and may introduce errors when suffixes are improp(cid:173)
`erly removed.
`• Syntactic analysis gives insight into the relationship between words.
`• Semantics resolve ambiguity of meaning (i.e., chemical plant vs. house plant).
`• Natural Language may be used to interact with the user, including allowing the user to select meanings
`of words using dictionary definitions
`
`Statistical Word Sense Disambiguation Using a Publisher's Dictionary
`
`The purpose of this method is to identify the specific meaning of each word in the text as identified in a
`publisher's dictionary. The reason to do this is to increase the precision oft he return during document retrieval
`and browsing. This is primarily a semantic "word sense disambiguation" and takes place via a "spreading ac(cid:173)
`tivation" concept through a "semantic network". The method used disambiguates word senses (identify word
`meanings) based on "concept collocation". If a new word sense appears in the text, the likelihood is that it is
`similar in meaning or domain to recent words in the text. Hence, recent syntactically compatible terms are com(cid:173)
`pared through the semantic network (discussed below) by "semantic distance". A classic example is that the
`word "bank" when used in close proximity to "river" has a different meaning from the same word when used
`in close proximity to "check".
`To make this concept work correctly, an underlying semantic network defined over the word senses is
`needed. An example of such a network is illustrated in the discussion which follows. Note that only one link
`type is used. This an "association link" which will be assigned a link strength from Oto 1. Past industrial ex(cid:173)
`perience with commercial systems has shown difficulty in maintaining rich semantic networks with many link
`types. Further, this concept indexing scheme does not require a deep understanding of the relationship be(cid:173)
`tween word senses. It simply must account for the fact that there is a relationship of some level of belief.
`The present invention uses a new form of statistical natural language processing that uses only informa-
`tion directly acquirable from a published dictionary and statistical context tests. Words are observed in a local
`region about the word in question and compared against terms in a "semantic network" that is derived directly
`from published dictionaries (see discussion below on automatic acquisition.) The resulting statistical test de(cid:173)
`termines the meaning, or reports that it cannot determine the meaning based upon the available context. (In
`this latter case, the method simply indexes over the word itself as in conventional text retrieval, defaulting to
`keyword or thesaurus processing).
`This method overcomes all the limitations discussed above. Hand-crafted rules are not required. The
`method applies to any text in any subject (obviously, in vertical subject domains, the percentage of words that
`can be disambiguated increases with a dictionary focused on that subject.) No training is required and excep-
`tions outside of a subject domain can easily be identified. The significance of this method is that now, any
`text may be indexed to the meanings of words defined in any published dictionary - generic or specialized.
`This allows much more accurate retrieval of information. Many fewer false hits wi II occur during text retrieval.
`
`Concept Indexing
`
`Figures 2a-d show a detailed breakout of the concept indexing process. The process extracts sentences
`from the text, tags the words within those sentences, looks up words and analyzes morphology, executes a
`robust syntactic parse, disambiguates word senses and produces the index.
`The first step in the indexing process is to extract sentences or other appropriate lexical units from the
`text. A tokenizer module that matches character strings is used for this task. While most sentences end in per(cid:173)
`iods or other terminal punctuation, sentence extraction is considerably more difficult than looking for the next
`period. Often, sentences are run on, contain periods with abbreviations creating ambiguities, and sometimes
`have punctuation within quotes or parenthesis. In addition, there exist non-sentinel strings in text such as lists,
`figure titles, footnotes, section titles and exhibit labels. Just as not all periods indicate sentence boundaries,
`so too, not all paragraphs are separated by a blank line. The tokenizer algorithm attempts to identify these
`lexical boundaries by accumulating evidence from a variety of sources, including a) Blank lines, b) Periods,
`c) Multiple spaces, d) List bullets, e) Uppercase Letters, f) Section numbers, h) Abbreviations, g) Other Punc(cid:173)
`tuation.
`
`7
`
`Page 7 of 29
`
`
`
`EP O 597 630 A1
`
`For example:
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`ConQuest™, by Mr. Edwin R. Addison and Mr. Paul E. Nelson is 90.9 percent accurate in retrieving rel(cid:173)
`evant documents. It has the following characteristics:
`• English only Queries
`• Fast Integrated Ranking and Retrieval
`In the above example, the sentence contains 6 periods, but only the last one demarks the end of the sen(cid:173)
`tence. The others are ignored for the following reason