`Chong et al.
`
`USOO6366908B1
`(10) Patent No.:
`US 6,366,908 B1
`(45) Date of Patent:
`Apr. 2, 2002
`
`(*) Notice:
`
`(73) ASSignee:
`
`(54) KEYFACT-BASED TEXT RETRIEVAL
`SYSTEM, KEYFACT-BASED TEXT INDEX
`METHOD, AND RETRIEVAL METHOD
`(75) Inventors: Kyung Taek Chong; Myung-Gil Jang;
`MiSeon Jun; Se Young Park, all of
`Taejon (KR)
`Electronics and Telecommunications
`Research Institute, Taejon (KR)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`(21) Appl. No.:
`09/475,743
`(22) Filed:
`Dec. 30, 1999
`Foreign Application Priority Data
`(30)
`Jun. 28, 1999
`(KR) ............................................ 99-25035
`(51) Int. Cl. ................................................ G06F 17/30
`(52) U.S. Cl. ................... 707/3; 707/5; 707/6; 707/101;
`707/201; 704/7; 704/9; 704/10, 382/177;
`382/306
`(58) Field of Search ................................ 707/1, 2, 3, 4,
`707/5, 6, 102, 103, 104.1, 706/12, 45, 47;
`704/7, 9, 10, 270.1
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,839,853. A 6/1989 Deerwester et al. ........ 364/900
`5,289,375 A * 2/1994 Fukumochi et al. ... 364/419.02
`5.535,382 A
`7/1996 Ogawa ....................... 395/600
`5,541,836 A * 7/1996 Church et al. ......... 364/419.07
`5,598,557 A * 1/1997 Doner et al................. 395/605
`5,708,829 A
`1/1998 Kadashevich et al. ...... 395/793
`5,721,902 A
`2/1998 Schultz ....................... 395/604
`5,771,378 A * 6/1998 Holt et al. .................. 395/605
`OTHER PUBLICATIONS
`Jang, Ho Wook and Se Young Park, “Key fact Concept For
`An Information Retrieval System.” Proceedings of Natural
`Language Proc. Pacific Rim Symposium, pp. 510–513,
`1995.
`
`Jun, MiSeon and Se Young Park, “Key fact-Based Informa
`tion Retrieval System,” International Symposium on Digital
`Library, pp. 521-524, 1997.
`Arampatzis et al., “Phrase-Based Information Retrieval.”
`Journal of Information Proc. & Management 34(6):1-19,
`1998.
`Robert Krovetz and W. Bruce Croft:“Lexical ambiguity and
`information retrieval”, Apr. 1992, ACM, vol. 10, pp.
`115-141.*
`Michael Sussna: “Word sense disambiguation for free-text
`indexing using a massive semantic network”, 1993, ACM,
`pp. 67-74.*
`
`* cited by examiner
`
`Primary Examiner Thomas Black
`Assistant Examiner Jacques Veillard
`(74) Attorney, Agent, or Firm-E. Russell Tarleton; SEED
`IP Law Group PLLC
`(57)
`ABSTRACT
`A key fact-based text retrieval method and a key fact-based
`text index method that describes the formalized concept of
`a document by a pair comprising an object that is the head
`and a property that is the modifier and uses the information
`described by the pairs as index information for efficient
`document retrieval. A key fact-based text retrieval System
`includes key fact extracting, key fact indexing, and key fact
`retrieving. The key fact extracting analyzes a document
`collection and a query and extracts keywords and key facts.
`The keywords do not have part-of-Speech ambiguity and the
`key facts are extracted from the keywords. The key fact
`indexing calculates the frequency of the key facts and gen
`erates a key fact list of the document collection for a key fact
`indeX Structure. The key fact retrieving receive a key fact of
`the query and key facts of the document collection and
`defines a key fact-based retrieval model in consideration of a
`weight factor of the key fact pattern and generates a retrieval
`result. The retrieval result is a document Similar to the query.
`
`12 Claims, 5 Drawing Sheets
`
`21-
`
`MAN MEMORY DEVICE
`
`11
`
`KEYFACT EXTRACTION DEVICE
`
`12--KEYFACT INDEX DEW CE
`
`INPUT AND OUTPUT DEVICE - -22
`QUERY
`RETREWAL
`RESULT
`PROCESSING
`- E.
`
`-23
`
`to--GD
`
`25
`
`C C C C
`
`DOCUMENT
`COLLECTION
`
`26
`
`(ge, l-H2
`
`Page 1 of 11
`
`GOOGLE EXHIBIT 1001
`
`
`
`U.S. Patent
`
`Apr. 2, 2002
`
`Sheet 1 of 5
`
`US 6,366,908 B1
`
`FIG. 1
`
`DOCUMENT
`COLLECTO
`
`KEYFACT INDEX DEVICE
`
`12
`
`
`
`KEYFACT
`EXTRACTION
`DEVICE
`
`
`
`|NOEX
`STRUCTURE
`
`16
`
`
`
`
`
`
`
`
`
`
`
`KEYFACT RETREVAL DEVICE
`
`13
`
`RETREVAL
`RESULT
`
`17
`
`FIG 2
`
`
`
`
`
`
`
`
`
`
`
`
`
`MAN MEMORY DEVICE
`
`
`
`KEYFACT EXTRACTION DEVICE
`
`KFYFACT INDEX DEVICE
`
`
`
`KEYFACT RETREVAL DEVICE
`
`NDEX
`STRUCTURE
`
`
`
`
`
`
`
`INPUT AND OUTPUT DEVICE
`RETREVAL
`RESULT
`QUERY
`CENTRAL PROCESSING
`DEVICE
`
`HARD DISK
`
`
`
`DOCUMENT
`COLLECTION
`
`
`
`DCTIONARES
`
`22
`
`23
`
`24
`
`26
`
`27
`
`Page 2 of 11
`
`
`
`U.S. Patent
`
`Apr. 2, 2002
`
`Sheet 2 of 5
`
`US 6,366,908 B1
`
`FIG 3
`
`
`
`31
`
`INPUT DOCUMENT
`
`MORPHOLOGICAL ANALYSS S-C
`bictionales-36
`
`
`
`
`
`PART-OF-SPEECH TAGG|NG
`
`32
`
`33
`
`34-KEYFACT PAERN EXTRACTIONHKEYFACT
`PATTERN RULE
`KEYFACT GENERATION O d
`KEYFACT
`GENERATION
`RULE
`
`35
`
`
`
`
`
`39
`
`
`
`
`
`37
`
`38
`
`Page 3 of 11
`
`
`
`U.S. Patent
`
`Apr. 2, 2002
`
`Sheet 3 of 5
`
`US 6,366,908 B1
`
`FIG. 4
`
`DOCUMENT
`COLLECTION
`
`KEYFACT EXTRACTION DEVICE
`
`START
`
`41
`
`*KEYFACD- 42
`
`KEYFACT FREQUENCTY,
`OOCUMENT FREQUENCY
`CALCULATION
`
`43
`
`44
`
`
`
`TABLES
`GENERATING
`DOCUMENT INDEX TABLE
`DOCUMENT TABLE
`
`KEYFACT INDEX
`TABLE
`
`FORMING INDEX
`STRUCTURE
`
`45
`
`CINDEX FCD-46
`
`END
`
`Page 4 of 11
`
`
`
`U.S. Patent
`
`Apr. 2, 2002
`
`Sheet 4 of 5
`
`US 6,366,908 B1
`
`FIG. 5
`
`51
`
`KEFACT INDEX DEVICF
`
`STAR
`
`se-Gel)
`
`KEYFACT EXTRACTION DEVICE
`
`55
`
`FORMING DOCUMENT,
`QUERY VECTOR
`
`53
`
`54
`
`DETERMINING
`KEYFACT WEIGHT
`CONSTANT Cire
`
`CALCULATING DOCUMENT
`QUERY WEIGHT
`
`56
`
`57 - RANKED RETRIEVAL RESULT
`RETREVAL RESULT
`
`END
`
`KEYFACT
`RETREVAL
`MODEL
`
`
`
`58
`
`Page 5 of 11
`
`
`
`U.S. Patent
`
`Apr. 2, 2002
`
`Sheet 5 of 5
`
`US 6,366,908 B1
`
`FIG. 6
`
`
`
`QUERY IN
`NATURAL LANGUAGE
`
`DOCUMENT TEXT:
`
`RESULT
`
`Page 6 of 11
`
`
`
`1
`KEYEACT-BASED TEXT RETRIEVAL
`SYSTEM, KEYFACT-BASED TEXT INDEX
`METHOD, AND RETRIEVAL METHOD
`TECHNICAL FIELD
`The present invention relates to a key fact-based text
`retrieval method and a key fact-based text index method. In
`particular, the methods describe the formalized concept of a
`document as a pair comprising an object that is the head and
`a property that is the modifier, and uses the information
`described by the pair as index information for efficient
`document retrieval.
`
`BACKGROUND OF THE INVENTION
`A key fact means an important fact contained in Sentences
`which constitute a document. The key fact is represented by
`an object and property information through Syntactic analy
`sis of the Sentence.
`The keyword-based text retrieval method was the main
`stream in conventional text retrieval methods. However, the
`precision of the keyword-based text retrieval method was
`not good due to the following reasons. First, the meaning of
`the document is not precisely represented and the represen
`tativeness of document expression is low because the docu
`ment is represented by keywords, which are nouns. This is
`a fundamental reason for poor retrieval precision. Second,
`when a query includes a natural language phrase or a natural
`language Sentence or keywords, the intention of the user's
`query is not reflected precisely in a keyword-based text
`retrieval method because the query is expressed by key
`words. Therefore, the keyword-based text retrieval method
`has a fundamental limitation in retrieval precision because it
`performs document retrieval by keywords. As a result,
`because the keyword-based text retrieval System provides
`Such low level of retrieval precision, it causes a number of
`unnecessary retrievals and therefore precious resources,
`Such as time and effort, are wasted.
`Recently, a number of studies have been performed in the
`area of phrase-based text retrieval methods in order to
`compromise Such defects of the keyword-based retrieval
`method. The phrase-based text retrieval methods extract a
`precise phrase pattern through a morphological-Syntactic
`normalization proceSS and perform indexing and retrieval by
`extracted phrase. Therefore, the phrase-based retrieval
`method performs more precise text retrieval than the
`keyword-based text retrieval method but performs less pre
`cise text retrieval than a concept-based text retrieval method,
`which expresses text by concept units.
`A new approach to key fact-based text retrieval methods
`has been proposed in order to overcome the shortcomings of
`the keyword-based text retrieval method and generalize
`phrase-based text retrieval method. In the key fact-base text
`retrieval method, a part of text that represent the same
`meaning is described as a key fact. Since the key fact-based
`retrieval method is a Sort of concept-based retrieval method,
`and therefore indexing and retrieval of the key fact-based
`retrieval method are performed with the unit of the key fact,
`precision of the retrieval is greatly improved.
`In the key fact-based retrieval method, it is desirable that
`phrases or words having the same meaning are indexed as
`the same indexing terms. For example, noun phrases includ
`ing “the retrieval of information” as a subset of “the efficient
`retrieval of information”, “the retrieval of the distributed
`information', and “the fast retrieval of the distributed infor
`mation' must have common indices which can be possibly
`generated from “the retrieval of information” as Subsets and
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,366,908 B1
`
`2
`recognize also them as different meaning with Subtle con
`ceptual different indexes at the same time.
`Since the keyword-based retrieval method doesn't recog
`nize the conceptual difference between “the retrieval of the
`information” and “the efficient retrieval of the information',
`users are not able to retrieve the exact document that is
`desired.
`
`SUMMARY OF THE INVENTION
`A key fact-based retrieval method, which extracts the
`precise keyfact pattern using the natural language processing
`techniques and indexes documents with the unit of the
`key fact, is provided.
`In addition, a key fact-based retrieval method, which
`extracts precise key fact patterns included in a natural query
`of a user using the natural language processing techniques
`and retrieves documents similar to the query in the key fact
`based indeX file, is provided.
`In addition, a key fact-based retrieval method, which
`retrieves and indexes documents with the unit of key fact, is
`provided.
`A key fact-based text retrieval System of the present inven
`tion includes key fact extracting means, key fact indexing
`means, and key fact retrieving means. The key fact extracting
`means analyze a document collection and a user query, and
`extracting keywords not having part-of-Speech ambiguity
`from the document collection and the user query, and
`respectively extracting key facts of the document collection
`and the user query from the keywords. The key fact indexing
`means for calculating the frequency of the key facts of the
`document collection and generating a key fact list of the
`document collection for a key fact indeX Structure. The
`key fact retrieving means for receiving the key fact of the user
`query and the key facts of the document collection and
`defining a key fact retrieval model in consideration of weight
`factors according to a key fact pattern and generating a
`retrieval result.
`The key fact extracting means includes morphology ana
`lyzing means, part-of-Speech tagging means, keyfact pattern
`extracting means, and key fact generating means. The mor
`phology analyzing means analyze morphology of an input
`Sentence and obtaining tag Sequences of part-of-Speech by
`attaching part-of-Speech tags. The part-of-Speech tagging
`means Selects a tag Sequence of part-of-Speech out of the tag
`Sequences of part-of-speech. The tag Sequence of part-of
`Speech is precise. The key fact pattern extracting means
`extracts a key fact pattern by applying the tag Sequences of
`part-of-Speech to a key fact pattern rule. The key fact gener
`ating means applies the key fact pattern to a key fact pattern
`generation rule and generating a key fact list, which is a Set
`of key fact terms.
`The key fact indexing means includes frequency calculat
`ing means, table generating means, and key fact indexing
`means. The frequency calculating means calculates a fre
`quency of various key facts and a document frequency of the
`key facts. The various key facts are included in the document
`collection, and the document frequency is the number of
`documents contained the various key facts. The table gener
`ating means generates a document index table, a document
`table, and a key fact indeX table of the document collection.
`The key fact indexing means forms a keyfact indeX Structure.
`The key fact indeX Structure has information regarding docu
`ment frequency, document identifier, and key fact frequency
`in each corresponded documents.
`The key fact retrieving means includes following means. A
`means forms a document and a user query vector with an
`
`Page 7 of 11
`
`
`
`US 6,366,908 B1
`
`15
`
`25
`
`35
`
`40
`
`3
`index file and the key fact of the user query. The index file
`generated by the key fact indexing means. The key fact of the
`user query generated by the key fact extracting means. A
`means determines key fact weight constants in accordance
`with the key fact pattern. A means calculates key fact weights
`for the document and the user query by applying the key fact
`weight constants to the document and the user query vector.
`The retrieval results displaying means displays the retrieval
`result by applying the key fact weights to key fact retrieval
`model. The retrieval result indicates documents with a
`key fact Similar to the key fact of the user query.
`A key fact-based text retrieving method of the present
`invention includes key fact extracting Step, key fact indexing
`Step, and key fact retrieving Step. The key fact extracting Step
`is to analyze a document collection and a user query, and
`extracts keywords without part-of-Speech ambiguity from
`the document collection and the user query, and respectively
`extracts key facts of the document collection and the user
`query from the keywords. The key fact indexing Step is to
`calculates the frequency of the key facts of the document
`collection and generates a key fact list of the document
`collection for a key fact indeX Structure. The key fact retriev
`ing Step is to receives the key fact of the user query and the
`key facts of the document collection and defines a key fact
`retrieval model in consideration of weigh factors according
`to the key fact pattern and generates the retrieval result.
`The Step of key fact extracting includes the following
`Steps. The first Step is to analyze morphology of an input
`Sentence and obtaining tag Sequences of part-of-Speech by
`attaching part-of-Speech tags. The Second Step is to Select a
`tag Sequence of part-of-speech out of the tag Sequences of
`part-of-Speech. The third Step is to extract a key fact pattern
`by applying the tag Sequence of part-of-Speech to a key fact
`pattern rule. The fourth Step is to apply the key fact pattern
`to a key fact pattern generation rule and generating a key fact
`list.
`The Step of analyzing morphology includes the following
`Steps. The first Step is to divide the input Sentence into
`words. The Second Step is to perform morphological analysis
`on the words using part-of-Speech dictionaries. The third
`Step is to perform morphological variation and recover
`prototypes. The fourth Step is to obtain the tag Sequence of
`part-of-Speech by tagging part-of-Speech tags in accordance
`with the result of the morphological analysis.
`The part-of-Speech dictionaries include a noun dictionary,
`a verb dictionary, an adjective dictionary, an adverb
`dictionary, a preposition dictionary, a conjunction dictionary
`and a stop-word lexicon.
`The Step of key fact indexing includes the following StepS.
`The first Step is to calculate a frequency of various key facts
`and a document frequency of the key fact. The Second Step is
`to generate a document indeX table, a document table and a
`key fact index table of the document collection. The third
`Step is to form a key fact indeX Structure including document
`frequency, document identifier and key fact frequency.
`The Step of key fact retrieving includes the following
`Steps. The first Step is to form a document and a user query
`vector with an indeX file and a key fact of the user query. The
`Second Step is to determine key fact weight constants in
`accordance with the key fact pattern. The third Step is to
`calculate key fact weights for the document and the user
`query by applying the key fact weight constants to the
`document and the user query vector. The fourth Step is to
`display the retrieval result by applying the key fact weights
`to the key fact retrieval model. The retrieval result indicates
`documents with a key fact Similar to the key fact of the user
`query.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a block diagram illustrating a key fact-based text
`retrieval System of the present invention;
`FIG. 2 is a block diagram illustrating a hardware structure
`of a key fact-based text retrieval System in accordance with
`an embodiment of the present invention;
`FIG. 3 is a block diagram illustrating a key fact extraction
`device of a key fact-based text retrieval System in accordance
`with an embodiment of the present invention;
`FIG. 4 is a block diagram illustrating a key fact indeX
`device of a key fact-based text retrieval System in accordance
`with an embodiment of the present invention;
`FIG. 5 is a block diagram illustrating a key fact retrieval
`device of a key fact-based text retrieval System in accordance
`with an embodiment of the present invention; and
`FIG. 6 is a Screen image illustrating a document retrieval
`result in response to a query.
`DETAILED DESCRIPTION OF THE
`INVENTION
`FIG. 1 is a block diagram illustrating a key fact-based text
`retrieval System of the present invention. The key fact-based
`text retrieval System comprises a key fact extraction device
`11, a key fact indeX device 12, and a key fact retrieval device
`13. FIG. 2 is a block diagram illustrating a hardware
`Structure of a key fact-based text retrieval System in accor
`dance with an embodiment of the present invention.
`As shown in FIG. 2, the main memory device 21 includes
`a key fact extraction device, a key fact indeX device 12, a
`key fact retrieval device 13, and an index structure 16. The
`central processing device 23 Supervises the key fact-based
`text retrieval. A hard disk 24 stores document collection 25,
`dictionaries for key fact retrieval 26, and an indeX file that is
`the result of the key fact index. The index file 27 is loaded
`onto the main memory as an indeX Structure 16 and the
`key fact retrieval device 13 uses the index file. The input and
`output device 22 receives a query from a user and generates
`retrieval results to the user.
`Now, the key fact-based text retrieval System in accor
`dance with the present invention is explained with reference
`to FIG. 1. Once a document collection 14 or a query 15 is
`given, the key fact extraction device 11 extracts words with
`out ambiguity by performing morphological analysis and
`tagging. The key fact generation rule is applied to the words
`and then the key facts are extracted.
`The key fact index device 12 indexes the document col
`lection 14 or the query with the unit of key fact and calculates
`the frequencies of the key facts. The frequencies of the
`key facts are stored into the index structure 16 with the
`document ID information. The keyfact retrieval device 13
`orders documents using the Similarity calculation method
`and ShowS retrieval results. The Similarity calculation
`method considers document collection and key fact weights
`with the help of a key fact-based text retrieval model. In a
`key fact-based text retrieval, when a document collection 14
`or a query is given, the key fact extraction device 11
`expresses it in the unit of key facts. All key facts express
`Semantic relation between words in the form of object,
`property. Keyfacts can be categorized by configurations of
`an object and a property. Parts of text that express the same
`conceptual meaning in the document collection or the query
`are categorized into the same key fact type. The key fact
`extraction device will be reviewed in detail below with FIG.
`3.
`The keyfact index device 12 indexes the extracted key
`facts with frequency information. In other words, the key fact
`
`Page 8 of 11
`
`
`
`S
`indeX device 12 calculates frequencies of the various forms
`of key facts included in the documents and generates a
`key fact list of the document collection. Therefore, an index
`structure 16 that reflects key facts is created and the index file
`is stored. The key fact index device 12 will be reviewed in
`detail below with FIG. 4.
`When the key fact retrieval device 13 receives a query, it
`retrieves appropriate documents on the basis of the key fact
`based retrieval method. The keyfact retrieval model is
`defined by considering weights of key fact patterns. The
`Similarity between the query and the documents is calculated
`and appropriate documents for the query are shown as a
`result in the order of the similarity. The key fact retrieval
`device 13 will be reviewed in detail below with FIG. 5.
`As shown in FIG. 3, the key fact extraction device 11
`analyzes a document and generates keyfacts through the
`processes of morphological analysis, part-of-Speech tagging,
`key fact pattern extraction, and key fact generation.
`A document is Supplied at Stage 31 and morphological
`analysis is performed at Stage 32. A Sentence in the docu
`ment is divided into words and the morphological analysis
`is performed with dictionaries 36 at stage 32. The morpho
`logical variation is considered in order to recover proto
`types. The dictionaries 36 include a noun dictionary, a verb
`dictionary, an adjective dictionary, an adverb dictionary, a
`preposition dictionary, a conjunction dictionary, and a stop
`word lexicon. In Some cases, a part-of-Speech of a word is
`determined by rules without dictionaries.
`The part-of-Speech tag in dictionaries 36 includes noun
`(N), verb (V), adjective (A), preposition (P), and stop-word
`(S). The noun is further divided into proper noun (NO),
`name noun (NN), vocative noun (NV), unit nouns (NJ),
`predicate noun (NU), non-predicate noun (NX), etc. The
`reason for Such division is that the class of noun determines
`the object or the property of the key facts.
`For example, in a Sequence of words having two or three
`nouns in a row, it is likely that name noun (NN), proper noun
`(NO), and non-predicate noun (NX) are objects and vocative
`noun (NV), unit noun Id), and predicate noun (NU) are
`properties. Additionally, in a phrase having proper noun
`(NO), name noun (NN), and non-predicate noun (NX), the
`order of priority of nouns in the object is name noun
`(NN)>proper noun (NQ)>non-predicate noun (NX).
`The preposition is divided into the possessive preposition
`(PO) which is used as “of” and the positional preposition
`(PP) and etc. The adjective or the variated verb which makes
`up the noun is tagged as a pronoun (MP), which is a separate
`key fact tag. For example, in analyzing “the fast retrieval of
`the distributed information' with morphological analysis, a
`50
`result of the sequence of the tag would be “S (stop-word) A
`(adjective) NV (vocative noun) PO (possessive preposition)
`S (stop-word) V-ed (verb) NV (vocative noun). The V-ed
`(verb) is a modified form of verb and makes up the noun.
`Like the A (adjective), the V-ed (verb) is converted into a
`key fact tag MP and the Sequence of nouns is converted into
`a keyfact tag KEY. The final result would become “NMP
`KEY PO MP KEY.
`Once the Stage 32 of morphological analysis is performed,
`various results are obtained.
`At Stage of 33 in which part-of-Speech tagging is
`performed, a precise Sequence of tags is chosen among the
`various results of the morphological analysis. In other
`words, the part-of-Speech tags obtained from the morpho
`logical analysis are used at the Stage of part-of-speech
`tagging. The modified form of Verb that makes up a noun or
`an adjective is converted into a modifier (MP) and the
`
`6
`Sequence of nouns is converted into KEY tag. The exem
`plary sentence “the fast retrieval of the distributed informa
`tion” shows the final sequence of tags “MP KEY PO MP
`KEY.
`Once the final Sequence of tags in response to the input
`Sentence is obtained, the Stage of key fact pattern extraction
`34 searches the key fact pattern rule 37 and extracts mean
`ingful key fact patterns necessary for key fact generation. The
`key fact pattern rule 37 which is used for key fact pattern
`extraction describes key fact patterns as to the Sequence of
`the input tags. A part of the key fact pattern rule is illustrated
`at following table 1.
`
`Keyfact pattern
`
`KEY1. POKEY2
`(the retrieval of information)
`
`KEY1PO MPKEY2
`(the retrieval of the
`distributed information)
`MPKEY1. POKEY2
`(the fast retrieval of
`information)
`MP1 KEY1. POMP2 KEY2
`(the fast retrieval of the
`distributed information)
`
`TABLE 1.
`
`Key fact term list
`KEY2, KEY1), KEY1, NIL),
`KEY2, NIL),
`KEY2 KEY1, NIL
`information, retrieval
`information, NIL,
`retrieval, NIL,
`information retrieval, NIL
`KEY2, KEY1), KEY1, NIL),
`KEY2, NIL),
`KEY2 KEY1, NIL, KEY2, MP
`KEY2, KEY1), KEY1, NIL),
`KEY2, NIL),
`KEY2 KEY1, NIL, KEY1, MP
`KEY2, KEY1), KEY1, NIL),
`KEY2, NIL),
`KEY2 KEY1, NIL, KEY1, MP1,
`KEY2, MP2
`
`(Note:
`The italic is the examples.)
`The final sequence of tags “MP KEY PO MP KEY”
`obtained from “the fast retrieval of the distributed informa
`tion' is applied to the key fact pattern rule and the key fact
`pattern “MP1 KEY1PO MP2 KEY2" is the result.
`Key fact terms that have forms of object, property are
`generated as to the input key fact pattern at the Stage of the
`key fact generation 35 by Searching the key fact generation
`rule 38. The object is a noun or a compound noun repre
`Sented by a keyword and the property is a verbal word or a
`noun that makes up another noun, or a prototype of a Verbal
`word.
`The key fact generation rule includes possible key fact
`lists, each of which can be generated in each key fact pattern.
`In the example stated above, if the key fact pattern “MP1
`KEY1 JY MP2 KEY2" is applied to the key fact generation
`stage, “KEY2, KEY1), KEY1, NIL), KEY2, NIL),
`KEY2 KEY1, NIL, KEY1, MP1), KEY2, MP2 is
`going to be the outcome. That is, a key fact list 39
`“information, retrieval, retrieval, NIL), information,
`NIL), information retrieval, NIL), retrieval, fast,
`information, distributed” is obtained from the key fact
`pattern “the fast retrieval of the distributed information”.
`The key fact index device is now reviewed in detail with
`FIG. 4.
`The key fact indeX device calculates Statistical frequencies
`of key facts in a document obtained from the key fact extrac
`tion device 11 and forms the index structure. Therefore,
`indeX information is efficiently maintained and processed by
`the key fact index device. Each index term of the key fact
`indeX device is an extracted key fact term representing each
`document.
`For each document, the key fact frequency (tf) and docu
`ment frequency of the key fact (df) are calculated in order to
`obtain the frequency information of the key facts.
`
`US 6,366,908 B1
`
`15
`
`25
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`Page 9 of 11
`
`
`
`US 6,366,908 B1
`
`8
`
`7
`Next, Supplementary tables Such as a document indeX
`table, a document table, and a key fact indeX table are
`generated to form an efficient index structure 44. The
`document index table contains key facts of the document, the
`frequency information. The document table includes a real
`document text. The key fact index table is the main table that
`includes the document frequency (df) of each key fact, and
`pair list of the document identifier of each key fact and the
`frequency information within a document (tf).
`Next, an index structure is formed in the unit of the
`key fact and an indeX file is Stored. Efficient Storage Struc
`tures like the B+ tree can be used for the index structure. The
`inverted file structure of the key fact index table is used as
`posting information file Structure.
`A part of the result of the key fact index is shown in the
`following table 2.
`
`TABLE 2
`
`Df
`
`Document id:
`frequency
`
`2
`3
`
`2
`1.
`4
`1.
`
`(162:1)(197:1)
`(102:2)(188:3)(193:1)
`
`(6:2)(29:1)
`(6:1)
`(21:1)(33:2)(88:1)(90:3)
`(102:1)
`
`Key fact index
`
`thorn, sharp
`thorn, dull
`
`reed, NIL
`reed field, NIL
`branch, NIL
`Dahurian buckhorn
`family, NIL
`
`At table 2, in case of branch, NIL), “branch” appears at
`4 documents and therefore the document frequency (df) for
`key fact index branch, NIL) is four. In addition, “branch”
`appears once in document 21, twice in document 33, once in
`document 88, and three times in document 90.
`The key fact retrieval device 13 is now reviewed in detail
`with FIG. 5. The key fact retrieval device forms the docu
`ment vector and query vector with the keyfact, which is
`supplied from the keyfact extraction device 53, and the
`index file 52 generated by the key fact index device 51.
`The keyfact weight constants (Cl), which are fit for
`the attribute of a document collection, are determined 55
`before calculating the key fact weights from document and
`query vector. Table 3 shows that key fact weight constants
`are assigned to various patterns of key facts.
`
`TABLE 3
`
`Types
`Type 1
`Type 2
`
`Type 3
`Type 4
`Type 5
`
`Keyfact pattern
`KEY, NIL
`KEY, MP or
`KEY, VHIVB
`KEY1, KEY2
`KEY1 KEY2, NIL) or
`KEY2 KEY 1, NIL
`KEY1 KEY2 KEY3
`
`Weight constants
`CKrType
`CKrType II
`
`CKFTypeIII
`CKrTyperv
`CKrTypev
`
`The key fact weight constants are assigned with the
`Sequence
`like
`CKftyperCKftyperCKrypetIrCKftspervsCKftypews ... and do
`important role for the precision of key fact-based text
`retrieval. Therefore, weight constants are determined experi
`mentally on the basis of distribution of key fact pattern of
`document collection.
`The key fact weight constant is applied to the following
`equation 1 and the result of equation 1, a key fact weight
`(W), is used in the key fact-based text retrieval model.
`
`Wyk = if
`
`lost
`
`N + 1
`df.
`
`) Corra
`
`Equation 1
`
`5
`
`15
`
`25
`
`35
`
`45
`
`50
`
`55
`
`60
`
`65
`
`W: a key fact weight
`tf: frequency of a key fact
`N: size of a document
`df document frequency of a key fact
`Crisper a key fact weight constant
`Conventionally, only the frequency of keywords
`(tfe), the document frequency of keywords (dfe),
`and the number of the documents in a document collection
`are considered in calculating the keyword weight in the
`keyword-based text retrieval system. However, the key fact
`Weight constant (C) of the keyfact pattern is also
`reflected in calculating key fact weights in the key fact-based
`retrieval System, So as to make it possible to indeX and
`retrieve in the unit of a key fact.
`Next, the Similarity of the document appropriate for the
`query is calculated by employing the key fact retrieval model
`based upon the vector Space model. The result of the
`Similarity calculation determines the order of appropriate
`documents 57.
`FIG. 6 shows a Screen image for illustrating a document
`retrieval result in response to a query. A user makes a query
`in query Section 61 with natural language. The key fact is
`extracted by the key fact-based text retrieval System and the
`documents close to the query are found. The result of the
`retrieval of the query is displayed at the document retrieval
`result screen 62 in the order of similarity. Document title and
`weight are also displayed with the order of Similarity. In
`addition, if the document displayed is selected, document
`text screen 63 shows the contents of text of the document.
`According to the present invention, texts of document
`collection and user queries are expressed, indexed and
`retrieved by concept-based key facts. Therefore, more pre
`cise retrieval results are achievable. Additionally, Since
`indexing and retrieval with high precision are possible, time
`and efforts can be minimized, the keyfact-based retrieval
`method in accordance with the present invention can be used
`in various applications. Especially, digital library, text and
`annotation based multimedia information retrieval of broad
`casting Station, internet application, information retrieval of
`electronicS commercial trading, and education/medical/
`military application areas can take advantage of the present
`invention.
`Although representative embodiments of the present
`invention have been disclosed for illustrative purposes,
`those skilled in the art will appreciate that various
`modifications, additions and Subs