throbber
Europaisches Patentamt
`
`European Patent Office
`
`Office europeen des brevets
`
`I IIIIIII IIIIII Ill lllll lllll lllll lllll lllll lllll lllll lllll 111111111111111111
`@ Publication number : 0 597 630 A 1
`EUROPEAN PATENT APPLICATION
`
`@
`
`@ Int. Cl.5
`
`: G06F 15/403, G06F 15/20
`
`@ Representative : Goodman, Christopher
`Eric Potter & Clarkson St. Mary's Court St.
`Mary's Gate
`Nottingham NG1 1 LE (GB)
`
`@ Application number : 93308829.6
`@ Date of filing: 04.11.93
`
`@) Priority: 04.11.92 US 970718
`@ Date of publication of application:
`18.05.94 Bulletin 94/20
`(§) Designated Contracting States:
`AT BE CH DE DK ES FR GB GR IE IT LI LU MC
`NL PT SE
`@ Applicant: CONQUEST SOFTWARE INC.
`9700 Patuxent Woods Drive, Suite 140
`Columbia, Maryland MD-21046 (US)
`@ Inventor : Addison,Edwin R. Conquest
`Software Inc.
`9700 Patuxent Woods Drive,Suite 140,
`Columbia,Maryland MD-21046 (US)
`Inventor : Blair,Arden S. Conquest Software
`Inc.
`9700 Patuxent Woods Drive,Suite 140,
`Columbia,Maryland MD-21046 (US)
`Inventor : Nelson,Paul E. Conquest Software
`Inc.
`9700 Patuxent Woods Drive,Suite 140,
`Columbia,Maryland MD-21046 (US)
`Inventor: Schwartz,Thomas Conquest
`Software Inc.
`9700 Patuxent Woods Drive,Suite 140
`Columbia,Maryland MD-21046 (US)
`
`@) Method for resolution of natural-language queries against full-text databases.
`@ The method of the present invention com(cid:173)
`bines concept searching, document ranking,
`high speed and efficiency, browsing capabili(cid:173)
`ties, "intelligent" hypertext, document routing,
`and summarization (machine abstracting) in an
`easy-to-use implementation. The method of the
`present
`invention also offers Boolean and
`statistical query options. The method of the
`present invention is based upon "concept in(cid:173)
`dexing" (an index of "word senses" rather than
`just words.) It builds its concept index from a
`"semantic network" of word relationships with
`word definitions drawn from one or more stan(cid:173)
`dard
`human-language
`dictionaries. During
`query, users may select the meaning of a word
`from the dictionary during query construction,
`or may allow the method to disambiguate words
`based on semantic and statistical evidence of
`meaning. This results in a measurable improve(cid:173)
`ment in precision and recall. Results of search-
`ing are retrieved and displayed in ranked order.
`The ranking process is more sophisticated than
`prior art systems providing ranking because it
`takes linguistics and concepts, as well as statis-
`tics into account.
`
`Figure 1
`
`'I""
`
`<C
`0
`(It)
`(0
`.....
`O')
`It')
`0
`
`C. w
`
`Jouve, 18, rue Saint-Denis, 75001 PARIS
`
`Page 1 of 29
`
`GOOGLE EXHIBIT 1014
`
`

`

`EP O 597 630 A1
`
`Field of the Invention
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`The present invention is a method for computer-based information retrieval. Specifically, the method of
`the present invention comprises a computer-implemented text retrieval and management system. The present
`invention offers four advances in the art of computer-based text retrieval. First, querying is simple. Queries
`may be expressed in plain English (or in another suitable human language). Second, searching for "con(cid:173)
`cepts" has been found to be more accurate than Boolean, keyword or statistical searching as practiced in the
`prior art. Third, the method of the present invention is more efficient than sophisticated text retrieval methods
`of the prior art. It is faster (in equivalent applications), and features recall in excess of 80%, as compared to
`recall of less than 30% for Boolean systems, and approximately 50% for statistical methods of the prior art.
`Finally, the method of the present invention manages the entire research process for a user.
`
`Background of the Invention
`
`While there are dozens of information retrieval software systems commercially available, most of them
`are based on older Boolean search technology. A few are based on statistical search techniques which have
`proven to be somewhat better. But, to break the barrier to access to relevant information and to put this infor(cid:173)
`mation in the hands of end users at the desktop requires search software that is intuitive, easy to use, accurate,
`concept oriented, and needs a minimum investment of time by the user. The following distinctive features and
`benefits delineate these significant aspects of the method of the present invention.
`To date, there have been three major classes of text retrieval systems:
`• Keyword or Boolean systems that are based on exact word matching
`• Statistical systems that search for documents similar to a collection of words
`• Concept based systems that use knowledge to enhance statistical systems
`Keyword or Boolean systems dominate the market. These systems are difficult to use and perform poorly
`(typically 20% recall for isolated queries). They have succeeded only because of the assistance of human ex(cid:173)
`perts trained to paraphrase queries many different ways and to take the time to humanly eliminate the bad
`hits. While statistical search systems have increased performance to near50% recall, trained search expertise
`is still needed to formulate queries in several ways to conduct an adequate search.
`A concept based search system further closes the performance gap by adding knowledge to the system.
`To date, there is no standard way to add this knowledge. There are very few concept based search systems
`available and those that exist require intensive manual building of the underlying knowledge base.
`The next logical direction for improvement in text retrieval is its use of Natural Language Processing (NLP).
`While there are some experimental systems in government development programs, most of those prototypes
`have been only useful in narrow subject areas, they run slowly, and they are incomplete and unsuitable for
`commercialization. The failure of many early research prototypes of NLP based text retrieval systems has led
`to much skepticism in the industry, leading many to favor statistical approaches.
`There has been a growing interest in the research community in the combination of NLP and conventional
`text retrieval. This is evidenced by the growing number of workshops on the subject. The American Association
`of Artificial Intelligence sponsored two of them. The first was held at the 1990 Spring Al Symposium at Stanford
`University on the subject of "Text Based Intelligent Systems". The second one ( chaired by the applicant here in)
`was held atAAAl-91 in Anaheim in July 1991.
`
`Natural Language Techniques
`
`The literature is rich in theoretical discussions of systems intended to provide functions similar to those
`outlined above. A common approach in many textbooks on natural language processing (e.g., Natural Lan(cid:173)
`guage Understanding, James Allen, Benjamin Cummings, 1987) is to use "semantic interpretation rules" to
`identify the meanings of words in text. Such systems are "hand-crafted", meaning that new rules must be writ-
`ten for each new use. These rules cannot be found in any published dictionary or reference source. This ap(cid:173)
`proach is rarely employed in text retrieval is usually fails in some critical way to provide adequate results.
`Kravetz has reported in various workshops (AAAl-90 Spring Al Symposium at Stanford University) and in
`Lexical Acquisition by Uri Zarnick, Lawrence Erlbaum, 1991, ISBN 0-8056-0829-9, that "disambiguating word
`senses from a dictionary" would improve the performance of text retrieval systems, claiming experiments
`have proven that this method will improve precision. This author's philosophy suggests that a word sense be
`identified by "confirmation in context from multiple sources of evidence". None of Krovetz's published works
`propose a specific technique for doing so, and his recent publications indicate that he is "experimenting" to
`find suitable methods.
`
`2
`
`Page 2 of 29
`
`

`

`EP O 597 630 A1
`
`5
`
`15
`
`20
`
`Eugene Charniak, of Brown University has reported in "Al Magazine" (AAAI, Winter 1992), and has spoken
`atthe Naval Research Laboratory Al Laboratory (November 1991 )aboutthe technique of employing "spreading
`activation" to identify the meaning of a word in a small text. Charniak employs a "semantic network" and be(cid:173)
`gins with all instances of a given word. It then "fans out" in the network to find neighboring terms that are lo-
`cated near the candidate term in the text. This technique suffers from 2 admitted drawbacks: it requires a high(cid:173)
`quality partially hand-crafted, small semantic network, and this semantic network is not derived from pub(cid:173)
`lished sources. Consequently, the Charniak method has never been applied to any text longer than a few sen(cid:173)
`tences in a highly restricted domain of language.
`Stephanie Haas, of the University of North Carolina, has attempted to use multiple dictionaries in infor-
`10 mat ion retrieval including a main English dictionary coupled with a vertical application dictionary (such as a
`dictionary of computer terms used in a computer database). Haas' approach does not take advantage of word
`sense disambiguation, and she reported at ASIS, October 1991 that merging two dictionaries gave no meas(cid:173)
`urable increase in precision and recall over a single generic English dictionary.
`Uri Zernick, editor of Lexical Acquisition, Lawrence Erlbaum, 1991, suggests in the same book a "cluster
`signature" method from pattern recognition be used to identify word senses in text. The method lists words
`commonly co-occurring with a word in question and determines the percentage of the time that each of the
`commonly occurring words appears in context in the database or corpus for each word meaning. This is called
`the "signature" of each word meaning. The signatures of each meaning are compared with the use of a word
`in context to identify the meaning. This pattern recognition approach based upon a cluster technique discussed
`in Duda and Hart, Pattern Classification and Scene Analysis, John Wiley & Sons, New York 1973 has the ob(cid:173)
`vious drawback that it has to be "trained" for each database. The signature information is not readily obtainable
`from a published dictionary.
`Brian Slator, (in the same book edited by Zernick above), discusses use of a "subject hierarchy" to compute
`a "context score" to disambiguate word senses. Generally, a "subject" or topic is identified by the context. A
`25 meaning is then selected by its relevance to the topic. This approach is only as strong as the depth of the
`subject hierarchy and it does not handle exceptions. A drawback of this approach is that available subject hi(cid:173)
`erarchies do not cover a significant portion of the lexicon of any dictionary, let alone the vocabulary of a native
`speaker of a language.
`One well known example of prior art in text retrieval that uses natural language input is the statistical tech-
`niques developed by Gerard Salton of Cornell University. His research system called SMART is now used in
`commercial applications, for example, Individual Inc. of Cambridge, MA uses it in a news clipping service. Dr.
`Salton is well known for his claims that natural language processing based text retrieval systems do not work
`as well as SMART. He bases such claims on limited experiments that he ran in the 1960's. At the 1991 ASIS
`meeting he stated that the reason natural language processing based systems don't work is that syntax is
`required and syntax is not useful without semantics. He further claims that "semantics is not available" due
`to the need to handcraft the rules. However, the system of the present invention has made semantics available
`through the use of statistical processing on machine readable dictionaries and automatic acquisition of se(cid:173)
`mantic networks.
`
`30
`
`35
`
`40
`
`Lexical Acquisition
`
`In the field of lexical acquisition, most of the prior art is succinctly summarized in the First Lexical Acqui(cid:173)
`sition Workshop Proceedings, August 1989, Detroit at IJCAl-89. There is a predominance of papers covering
`the automatic building of natural language processing lexicons for rule-based processing. Over 30 papers
`45 were presented on various ideas, isolated concepts or prototypes for acquiring information from electronic dic(cid:173)
`tionaries for use in natural language processing. None of these proposed the automatic building of a semantic
`network from published dictionaries.
`
`Indexing
`
`50
`
`55
`
`Typical text search systems contain an index of words with references to the database For a large docu(cid:173)
`ment databases, the number of references for any single term varies widely. Many terms may have only one
`reference, while other terms may have from 100,000 to 1 million references. The prior art substitutes the(cid:173)
`saurus entries for search terms, or simply requires the user rephrase his queries in order to "tease information
`out of the database". The prior art has many limitations. In the prior art, processing is at the level of words,
`not concepts. Therefore, the query explosion produces too many irrelevant variations to be useful in most cir(cid:173)
`cumstances. In most prior art systems, the user is required to restate queries to maximize recall. This limits
`such systems to use by "expert" users. In prior art systems, many relationships not found in a classical the-
`
`3
`
`Page 3 of 29
`
`

`

`EP O 597 630 A1
`
`saurus cannot be exploited (for example, a "keyboard" is related to a "computer" but it is not a synonym).
`
`Contextual Systems
`
`The prior art of systems which attempt to extract contextual understanding from natural language state-
`ments is primarily that of Gerard Salton (described in Automatic Text Processing, Addison-Wesley Publishing
`Company, 1989.) As described therein, such systems simply count terms (words) and co-occurrences of terms,
`but do not "understand" word meanings.
`Routing means managing the flow of text or message streams and selecting only text that meets the de-
`sired profile of a given user to send to that user. Routing is useful for electronic mail, news wire text, and in(cid:173)
`telligent message handling. It is usually the case that a text retrieval system designed for retrieval from archived
`data is not good for routing and visa versa. For news wire distribution applications (which seek to automate
`distribution of the elements of a "live" news feed to members of a subscriber audience based on "interest pro(cid:173)
`files"), it is time-intensive and very difficult to write the compound Boolean profiles upon which such systems
`depend. Furthermore, these systems engage in unnecessary and repetitive processing as each interest pro(cid:173)
`file and article are processed.
`
`Document Ranking
`
`Systems which seek to rank retrieved documents according to some criterion or group of criteria are dis-
`cussed by Salton, in Automatic Text Processing (ranking on probabilistic terms), and by Donna Harmon, in a
`recentASIS Journal article, (ranking on a combination of frequency related methods). Several commercial sys(cid:173)
`tems use ranking but their proprietors have never disclosed the algorithms used. Fulcrum uses (among other
`factors) document position and frequency. Personal Library Software uses inverse document frequency, term
`frequency and collocation statistics. Verity uses "accrued evidence based on the presence of terms defined
`in search topics".
`
`Concept Definition and Search
`
`The prior art comprises of two distinct methods for searching for "concepts". The first and most common
`of these is to use a private thesaurus where a user simply defines terms in a sett hat are believed to be related.
`Searching for any one of these terms will physically also search for and find the others. The literature is replete
`with research papers on uses of a thesaurus. Verity, in its Topic software, uses a second approach. In this
`approach users create a "topic" by linking terms together and declaring a numerical strength for each link,
`similar to the construction of a "neural network". Searching in this system retrieves any document that con(cid:173)
`tains sufficient (as defined by the system) "evidence" (the presence of terms that are linked to the topic under
`search). Neither of these approaches is based upon the meanings of the words as defined by a publisher's
`dictionary.
`Other prior art consists of two research programs:
`• TIPSTER: A government research program called TIPSTER is exploring new text retrieval methods. This
`work will not be completed until 1996 and there are no definitive results to date.
`• CLARIT: Carnegie Mellon University (CMU) has an incomplete prototype called CLARIT that uses dic(cid:173)
`tionaries for syntactic parsing information. The main claim of CLARIT is that it indexes phrases that it
`finds by syntactic parsing. Because CLARIT has no significant semantic processing, it can only be
`viewed as a search extension of keywords into phrases. Their processing is subsumed by the present
`invention, with the conceptual processing and semantic networks.
`
`Hypertext
`
`Prior art electronically-retrieved documents use "hypertext", a form of manually pre-established cross-
`reference. The cross-reference links are normally established by the document author or editor, and are static
`for a given document. When the linked terms are highlighted or selected by a user, the cross-reference links
`are used to find and display related text.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`Machine Abstracting
`
`Electronic Data Systems (EDS) reported machine abstracting using keyword search to extract the key sen(cid:173)
`tences based on commonly occurring terms which are infrequent in the database. This was presented at an
`
`4
`
`Page 4 of 29
`
`

`

`EP O 597 630 A1
`
`American Society for Information Systems (ASIS) 1991 workshop on natural language processing. They fur(cid:173)
`ther use natural language parsing to eliminate subordinate clauses.
`The present invention is similar, except that the retrieval of information for the abstract is based upon con(cid:173)
`cepts, not just keywords. In addition, the present invention uses semantic networks to further abstract these
`concepts to gain some general idea of the intent of the document.
`
`Summary
`
`The prior art may be summarized by the shortcomings of prior art systems for textual document search
`and retrieval. Most commercial systems of the prior art rely on "brute force indexing" and word or wild card
`search which provides fast response only for lists of documents which are ranked according to a precomputed
`index (such as document date) and not for relevance-ranked lists. For systems which attempt to relevance rank,
`the user must wait for the entire search to complete before any information is produced. Alternatively, some
`systems display documents quickly, but without any guarantee that documents displayed are the most rele-
`vant.
`The systems of the prior art rank documents retrieved on the presence of words, not word meanings. The
`prior art systems fail to use linguistic evidence such as syntax or semantic distance. No known prior art system
`can combine more than a two or three ranking criteria No known system in the prior art is capable of acquiring
`semantic network information directly from published dictionaries, and thus, to the extent that such networks
`are used at all, they must be "hand built" at great expense, and with the brittleness which results from the
`author's purpose and bias.
`In thesaurus-based information retrieval systems, as well as topic based information retrieval systems,
`concepts are created by linking words, not word meanings. In these systems (thesaurus and topic based), the
`user has the burden of creating concepts before searching. In addition, for topic based systems, the user has
`the added burden of making arbitrary numeric assignments to topic definitions. Prior art thesaurus and topic
`based systems do not link new concepts to an entire network of concepts in the natural language of search.
`Instead, isolated term groups are created that do not connect to the remainder of any concept knowledge base.
`Topic based systems require that topics be predefined to make use of concept-based processing.
`Finally, for hypertext systems, authors need not spend time coding hypertext links to present a hypertex-
`tual document to users because a natural language search (perhaps taken directly from the document itself)
`will find all relevant concepts, not just those found by the author.
`
`Brief Description of the Invention
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`The method of the present invention combines concept searching, document ranking, high speed and ef-
`ficiency, browsing capabilities, "intelligent" hypertext, document routing, and summarization (machine ab(cid:173)
`stracting) in an easy-to-use implementation.
`The method offers three query options:
`finding documents with concepts expressed in plain English;
`Natural Language:
`40 Query by Example:
`Present a document, retrieve similar documents;
`Private Concept:
`define a new term, enter it in the "semantic network", search.
`The method of the present invention continues to provide Boolean and statistical query options so that
`users will have easy access to a familiar interface and functionality while learning new and more powerful
`features of the present invention.
`The method of the present invention is based upon "concept indexing" (an index of "word senses" rather
`than just words.) A word sense is a specific use or meaning of a word or idiom. The method of the present
`invention bui Ids its concept index from a "semantic network" of word relationships with word definitions drawn
`from one or more standard English dictionaries. During query, users may select the meaning of a word from
`the dictionary during query construction. This results in a measurable improvement in precision.
`Results of text searching are retrieved and displayed in ranked order. The ranking process is more sophis-
`ticated than prior art systems providing ranking because it takes linguistics and concepts, as well as statistics
`into account.
`The method of the present invention uses an artificial intelligence "hill climbing" search to retrieve and
`display the best documents while the remainder of the search is sti II being processed. The method of the pres-
`ent invention achieves major speed advantages for interactive users.
`Other significant functions of the method of the present invention including browsing documents (viewing
`documents directly and moving around within and between documents by related concepts), implementing
`"dynamically compiled" hypertext, routing, and machine abstracting or automatic summarization of long texts.
`
`45
`
`50
`
`55
`
`5
`
`Page 5 of 29
`
`

`

`EP O 597 630 A1
`
`Brief Description of the Drawings
`
`Figure 1
`
`5
`
`Figures 2a-d
`
`Figure 3
`
`Figure 4
`
`Figure 5
`Figure 6
`
`depicts the computer program modules which implement the method of the present inven(cid:173)
`tion.
`depicts a detailed flow diagram of the concept indexing process according to the present in(cid:173)
`vention.
`depicts the process whereby the method of the present invention disambiguates word
`senses based on "concept collocation".
`depicts the sources of information in an automatically-acquired machine-readable dictionary
`according to the present invention.
`illustrates the structure of the machine-readable dictionary of the present invention.
`depicts a flow diagram of the query process according to the present invention.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Detailed Description of the Invention
`
`The method of the present invention is a "Natural Language Processing" based, text retrieval method.
`There are very few concept based search systems available and those that exist require intensive manual
`building of the underlying knowledge bases. The method of the present invention uses published dictionaries
`to build (automatically) the underlying knowledge base. The dictionary provides the knowledge needed to proc-
`ess accurately plain English or "natural language" input, making the user interface considerably simpler.
`In the method of the present invention:
`• There are no hand-crafted rules for each word meaning
`• Idioms and repetitive phrases are processed as a single meaning
`• Unknown words, proper names and abbreviations are automatically processed
`• Ill formed input with poor grammar and spelling errors can be processed
`The method of the present invention has combined the document ranking procedure with the search pro(cid:173)
`cedure. This allows for fast "hill-climbing" search techniques to quickly find the only the best documents re(cid:173)
`gardless of database size. All available search systems first retrieve all possible documents and then rank
`the results, a much slower process. The method of the present invention uses these search techniques to sup-
`port the advanced demands of natural language text retrieval.
`In the method of the present invention:
`• Only the best documents are retrieved
`• Searching is guided by document ranking
`• The document database is automatically divided into multiple sets
`• Searching over document sets significantly improves method performance
`
`Architecture
`
`40
`
`45
`
`The method of the present invention has been implemented as 5 computer program modules: the Query
`Program, the Index Program, the Library Manager, Dictionary Manager, and the Integrator's Toolkit. Each of
`these are defined below and their relationships are shown in Figure 1.
`• Query Program
`Program to accept queries and execute searches
`Program to index new or updated documents
`• Index Program
`• Library Manager
`Program to manage the organization of text files
`• Dictionary Editor
`Program to maintain dictionary/private searches
`• Integrator's Toolkit
`Program for developers to integrate the present invention with other computer
`systems and program products
`The method of the present invention offers Graphical User Interfaces, command line interfaces, and tools
`to customize the user interface. The display shows the title hits in ranked order and the full text of the docu-
`50 ments. Documents can be viewed, browsed and printed from the interface. The Integrator's Toolkit allows the
`product to be installed in any interface format. The system is an open system. It makes heavy use of "Appli(cid:173)
`cation Program Interfaces" (APls), or interfaces that allow it to be integrated, linked or compiled with other
`systems.
`
`55
`
`Natural Language Processing
`
`The method of the present invention is the first text search system that uses published dictionaries to build
`automatically the underlying knowledge base, eliminating the up front cost that an organization must absorb
`
`6
`
`Page 6 of 29
`
`

`

`EP O 597 630 A1
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`to use other concept based search systems. In addition, the dictionary gives knowledge needed to process
`accurately natural language input, making the user interface considerably simpler. The algorithms used iden(cid:173)
`tify the meaning of each word based upon a process called "spreading activation". NLP as used in the present
`invention improves text retrieval in many ways, including the following:
`• Morphological analysis allows better matching of terms like "computing" and "computational". Tradi-
`tional suffix stripping hides these related meanings and may introduce errors when suffixes are improp(cid:173)
`erly removed.
`• Syntactic analysis gives insight into the relationship between words.
`• Semantics resolve ambiguity of meaning (i.e., chemical plant vs. house plant).
`• Natural Language may be used to interact with the user, including allowing the user to select meanings
`of words using dictionary definitions
`
`Statistical Word Sense Disambiguation Using a Publisher's Dictionary
`
`The purpose of this method is to identify the specific meaning of each word in the text as identified in a
`publisher's dictionary. The reason to do this is to increase the precision oft he return during document retrieval
`and browsing. This is primarily a semantic "word sense disambiguation" and takes place via a "spreading ac(cid:173)
`tivation" concept through a "semantic network". The method used disambiguates word senses (identify word
`meanings) based on "concept collocation". If a new word sense appears in the text, the likelihood is that it is
`similar in meaning or domain to recent words in the text. Hence, recent syntactically compatible terms are com(cid:173)
`pared through the semantic network (discussed below) by "semantic distance". A classic example is that the
`word "bank" when used in close proximity to "river" has a different meaning from the same word when used
`in close proximity to "check".
`To make this concept work correctly, an underlying semantic network defined over the word senses is
`needed. An example of such a network is illustrated in the discussion which follows. Note that only one link
`type is used. This an "association link" which will be assigned a link strength from Oto 1. Past industrial ex(cid:173)
`perience with commercial systems has shown difficulty in maintaining rich semantic networks with many link
`types. Further, this concept indexing scheme does not require a deep understanding of the relationship be(cid:173)
`tween word senses. It simply must account for the fact that there is a relationship of some level of belief.
`The present invention uses a new form of statistical natural language processing that uses only informa-
`tion directly acquirable from a published dictionary and statistical context tests. Words are observed in a local
`region about the word in question and compared against terms in a "semantic network" that is derived directly
`from published dictionaries (see discussion below on automatic acquisition.) The resulting statistical test de(cid:173)
`termines the meaning, or reports that it cannot determine the meaning based upon the available context. (In
`this latter case, the method simply indexes over the word itself as in conventional text retrieval, defaulting to
`keyword or thesaurus processing).
`This method overcomes all the limitations discussed above. Hand-crafted rules are not required. The
`method applies to any text in any subject (obviously, in vertical subject domains, the percentage of words that
`can be disambiguated increases with a dictionary focused on that subject.) No training is required and excep-
`tions outside of a subject domain can easily be identified. The significance of this method is that now, any
`text may be indexed to the meanings of words defined in any published dictionary - generic or specialized.
`This allows much more accurate retrieval of information. Many fewer false hits wi II occur during text retrieval.
`
`Concept Indexing
`
`Figures 2a-d show a detailed breakout of the concept indexing process. The process extracts sentences
`from the text, tags the words within those sentences, looks up words and analyzes morphology, executes a
`robust syntactic parse, disambiguates word senses and produces the index.
`The first step in the indexing process is to extract sentences or other appropriate lexical units from the
`text. A tokenizer module that matches character strings is used for this task. While most sentences end in per(cid:173)
`iods or other terminal punctuation, sentence extraction is considerably more difficult than looking for the next
`period. Often, sentences are run on, contain periods with abbreviations creating ambiguities, and sometimes
`have punctuation within quotes or parenthesis. In addition, there exist non-sentinel strings in text such as lists,
`figure titles, footnotes, section titles and exhibit labels. Just as not all periods indicate sentence boundaries,
`so too, not all paragraphs are separated by a blank line. The tokenizer algorithm attempts to identify these
`lexical boundaries by accumulating evidence from a variety of sources, including a) Blank lines, b) Periods,
`c) Multiple spaces, d) List bullets, e) Uppercase Letters, f) Section numbers, h) Abbreviations, g) Other Punc(cid:173)
`tuation.
`
`7
`
`Page 7 of 29
`
`

`

`EP O 597 630 A1
`
`For example:
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`ConQuest™, by Mr. Edwin R. Addison and Mr. Paul E. Nelson is 90.9 percent accurate in retrieving rel(cid:173)
`evant documents. It has the following characteristics:
`• English only Queries
`• Fast Integrated Ranking and Retrieval
`In the above example, the sentence contains 6 periods, but only the last one demarks the end of the sen(cid:173)
`tence. The others are ignored for the following reason

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket