`to Information Systems
`
`Proceedings of the Second International Workshop
`June 26-28, 1996, Amsterdam, The Netherlands
`
`Edited by
`
`R.P. van de Riet
`Vrije Universiteit, Amsterdam, The Netherlands
`
`J.F.M. Burg
`Vrije Universiteit, Amsterdam, The Netherlands
`
`and
`
`A.J. van der Vos
`Vrije Universiteit, Amsterdam, The Netherlands
`
`1996
`
`10S
`Press
`,ay ....
`;= =
`
`Otmsha
`
`Amsterdam, Oxford, Tokyo, Washington, DC
`
`Page 1 of 14
`
`GOOGLE EXHIBIT 1017
`
`
`
`© The authors mentioned in the Table of Contents.
`
`All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any
`form or by any means, without the prior written permission from the publisher.
`
`ISBN 90 5199 273 4(1OS Press)
`ISBN 4 274 90102 5 C3000 (Ohmsha)
`Library of Congress Catalogue Card Number: 96-76771
`
`Publisher
`!OS Press
`Van Diemenstraat 94
`IO 13 CN Amsterdam
`Netherlands
`
`Distributor in the UK and Ireland
`IOS Press/Lavis Marketing
`73 Lime Walk
`Headington
`Oxford OX3 7 AD
`England
`
`Distributor in the USA and Canada
`!OS Press, Inc.
`P.O. Box 10558
`Burke, VA 22009-0558
`USA
`
`Distributor in Japan
`Ohmsha, Ltd.
`3-1 Kanda Nishiki - Cho
`Chiyoda- Ku
`Tokyo IOI
`Japan
`
`LEGAL NOTICE
`The publisher is not responsible for the use which might be made of the following information.
`
`PRINTED IN THE NETHERLANDS
`
`Page 2 of 14
`
`
`
`Applications of Natural Language to Information Systems
`R.P. van de Rief et al. (Eds.)
`/OS Press, I 996
`
`139
`
`The Fact Extraction Using the Keyfact
`
`Mi-Seon JUN, Se-Young PARK, Man-Soo KIM
`Natural Language Processing Section
`System Software Department
`Electronics and Telecommunications Research Institute
`TaeJon, Korea
`{msjun,sypark,mskim}@com.etri.re.kr
`
`Abstract. Many information retrieval systems retrieve relevant documents based on
`exact matching of keywords between a query and documents. We shall show how to
`extract a fact from a document using an extended concept of keyword, called keyfact
`which can contain syntactic patterns and semantic information. In second document
`ranking, predefined keyfact cluster set of the query terms is compared to each
`relevant document. Because relevant documents are such a small fraction of a
`collection, this method is different with query expansion retrieval scheme and
`substantially reduces the computational cost of the experiment.
`Keywords : Document Ranking, Fact, Semantic Information, Syntactic Pattern
`
`1 Introduction
`
`Many commercial information retrieval(IR) systems retrieve relevant documents based on
`keyword matching between a query and documents. There are two problems in using the
`method. The first problem is that keywords are ambiguous, and this ambiguity is causative
`of retrieving irrelevant document semantically. Therefore lexical ambiguity has to be
`resolved. The second problem is that a document is treated as a irrelevant document in spite
`of a relevant document, for the document does not include the same keywords as query
`terms. So an original query has to be expanded to semantically related words. The main
`function of an IR system is to rank relevant documents which satisfies the user's information
`need. In most retrieval models, the system ranks documents according to their inner product
`similarity, depending upon keywords in.a query. However users are generally not interested
`in retrieving documents with matching keywords, but with concepts that relevant words or
`information represent.
`Facts are truths in some relevant world. These are the things we want to represent and to
`search information. One representation of facts is so common that it deserves special
`mention: natural language sentences. Generally nouns and compound nouns are taken as
`keywords. Nouns and compound nouns are the most important elements for representing
`the fact in a natural language sentence. However besides the keywords such as nouns and
`compound nouns, verbs and adjectives have an important role in a sentence. Using the
`keywords for an IR system is relatively simple to implement. To extract facts well in a
`sentence, it is insufficient to use only nouns as keywords. Even in morphological analysis of
`
`Page 3 of 14
`
`
`
`140
`
`Information Retrieval
`
`a sentence, there are many lexical ambiguities. Furthennore in the syntactic problem, there
`are too many inflected fonns of adjectives and adverbs in Korean. Specially in Korean
`language, one keyword has 2(}-,30 senses in a dictionary in the worst case. Polysemous
`words in a query and documents can reduce the precision of a search significantly.
`Therefore lexical ambiguity has to be resolved. To resolve lexical ambiguity of keywords,
`we need several information such as keywords, verbs and adjectives. We introduced an
`extended concept of a keyword, called keyjact which can be represented as verb/ adjective,
`and can contain syntactic patterns and semantic information. We can consider lexical
`ambiguity of noun and verb.
`The verb "~tj-• is a typical Korean polysemy, and has twenty one translatable English
`verbs such as "write", "spend", "wear", and "adopt"
`in a Sisa Korean-English
`dictionary(I4]. In another example, the noun ".£.:4" is a typical Korean polysemy. The
`noun has English nouns such as "mother and child" and "hat" . So in our keyfact concept
`noun and verb/adjective are not independent of each other. If the keyfact can contain
`syntactic patterns and semantic information such as ".£. :4* ~ tj-/wear a hat" and "~ .Q. .s.!..
`~ tj-/write with a pen", then noun which occur with the verb "~ tj-" may be thought of as a
`clue for disambiguating senses. In the same manner, ambiguity of noun is much the same.
`The literature generally divides lexical ambiguity into two types: syntactic and semantic[2].
`Syntactic ambiguity refers to differences in syntactic category. Semantic ambiguity refers to
`differences in meanings. A number of approaches have been taken to word sense
`disambiguation. Lesk uses the Oxford Advanced Learners Dictionary[3] and Weiss uses
`word co-occurrences[4]. Many researchers used statistic information, semantic information,
`and both of the information as a knowledge for query expansion. Stiles and Lesk used
`statistic information of term association from documents. Salton experiments only synonym
`in the SMART system. Fox experiments five semantic category in the SMART system, and
`need humane intervention for selecting related words[8]. Above researches did not propose
`the problem of lexical ambiguity or did not considered automatic lexical disambiguation for
`query expansion. The ambiguity in a query must be resolved when the query is analyzed.
`Ambiguous words are not able to effectively expand before the ambiguity is not resolved.
`Query expansion enhances recall by adding some relevant documents excluded from exact
`matching. But it degrades precision. In order to improve precision, first the ambiguous
`word in a query is resolved by using knowledge base, when the query is analyzed. Second
`keyword concept which is defined as noun or compound noun need to be extend, and verb
`or adjective must be considered as indexing word. We resolve ambiguous query terms, and
`then expand unambiguous query terms. There are a wide choice of words to add to a query
`vector. One can add only the synonyms, or synonyms plus all descendants, or synonym plus
`parents and all descendants, or synonyms plus directly related words, etc. and any number
`of child links may be traverse. Expansion by synonyms plus and directly related word is
`benefit[!]. So we choose the parameter, and expanded queries are consisted with a special
`relationship FT(Fact Term) as well as semantic relationships using in general thesaurus.
`Because query expansion is a recall enhancing technique, we used expanded terms of a
`original query only when compute query-document similarity. So we got high precision rate.
`
`Page 4 of 14
`
`
`
`M.-S. Jun et al./The Fact Extraction Using the Key/act
`
`141
`
`In the following sections, we ( 1) describe the construction of a keyfact network for
`ranking the documents; (2) present a visualization of the keyfact network for keyfact
`retrieval; (3) show how the keyfact network can be used to rank documents and extract a
`fact; (4) evaluate ranking method to improve retrieval performance; and (5) make
`suggestions for future research.
`
`2 Keyfact and Keyf act Cluster
`
`The noun is the most important element for explaining the fact. Next we consider the
`compound noun which is composed of several nouns. The syntactic categories of the next
`complicated fact are noun phrases. A noun phrase consists of a noun and its modifiers that
`can be represented as inflected forms of verb and adjective. Korean verb and adjective have
`much more inflected forms as compared to English and French. The most simple
`fact(sentence) can be represented by noun(subject) and verb/adjective(predicate). So the
`keyfact is not independent of case slots, and contains syntactic patterns and semantic
`information. In this paper, we collected keyfacts and keyfact clusters in Gemong Korean
`encyclopedia for improved retrieval performance. The encyclopedia has two characteristics.
`First, it has syntactic characteristic composed of a title word and it's explanation part.
`Second, it has semantic characteristic that most words in the explanation part are
`semantically related with the title word. The encyclopedia is good to easily collect words
`and it's semantically related words of a word. We thought that the encyclopedia is a proper
`collection for construction of semantic information. The keyfacts extracted from a text can
`be represented in several forms. The forms have to be designed for easy matching. The
`keyfact weight is calculated in the same formula for calculating keyword weights based on
`the keyword frequency. When a user gives a query which contains some keyfacts as well as
`keywords, our system extracts the keyfacts from the query and tries to match the keyfacts
`which were extracted from documents. We use an exact matching method and when it fails,
`a partial matching method will be used. With keyfacts co-occurred with a keyword, we can
`use these keyfacts for disambiguating the keyword and ranking documents.
`A cluster is defined as a set of co-occurring keyfact terms. Co-occurring terms are usually
`relevant to each other and are sometimes synonyms. For keyfact clustering, we considered
`the sense definition of the noun in the Gemong Korean encyclopedia. In the simple
`automatic indexing method, raw terms are analyzed using a stemming algorithm and stop
`words are removed using a stop list. The stop words are usually prepositions, postpositions,
`determinants and those that appear too frequently to discriminate any documents. And then
`it finds an identical inflected verb and gets a basic form of the inflected verb by referencing
`verb dictionary. Our verb dictionary consists of two parts which are inflected verb form and
`basic verb form. One basic form can have many inflected form. Therefore inflected form is
`compared with the input text and basic form is used for disambiguating and ranking. Figure
`I shows one basic form can have many inflected forms. We assigned semantic relationships
`using in general thesaurus such as BT(Broader Term), NT(Narrow Term), RT(Related
`Tenn), HP(Has Part), UF(Used For), and a special relationship FT(keyFact Term). Most
`thesaurus uses the relationships except FT relationship. FT is defined as relationship
`
`Page 5 of 14
`
`
`
`142
`
`Information Retrieval
`
`between noun and verb/adjective. FT is used for resolving lexical ambiguity in a query and
`ranking the documents.
`
`inflected forms
`noun+postposition+verb
`
`buicform
`verb/postposition,noun
`
`car
`
`-If-~~£ 'U
`the
`te
`
`(crowded through
`
`traffic
`d/around
`(crowd/through the
`
`Figure I : Sample of keyfact
`In figure 1, "~Ht o}A] t:l-(take tea)" and "~t-a- Ej-tj-(take a car)" have the same keyword
`"~t". The noun cha "~t" is a typical example of Korean polysemy. This noun has nine
`"vehicle", "tea", "difference", and "for the purpose of' in the
`meanings such as
`Minchungseorim Korean dictionary and two meanings such as "vehicle" and "tea" in
`Gemong Korean encyclopedia. Therefore it is not easy to analyze the exact meaning from a
`sentence including polysemy. At first, we define the concept of cluster as a group of
`keyfacts which co-occur with a specific sense of a polysemy. Some keyfacts appearing in a
`text are useful for keyfact clustering while others are not. Thus in order to cluster keyfacts
`effectively, we normalize the clusters obtained by the following processes. we used the
`weighting by keyfact frequency alone. An appropriate threshold is chosen, and a keyfact
`which exceeds the threshold 3 are assumed to be connected with a keyword. Followings are
`normalization results of keyword cha " ~t" in Gemong Korean encyclopedia. For example,
`the keyword 'vehicle' has 'get off a car' as keyf act and 'bus' as keyword .
`
`vehicle={ ~ofl Jr! ~ i!j q/get off a car, ~ • E} 9-/take a car, 111 A)£ 7}9-/go by taxi, ~H~-
`,t1:1] 9-/be
`crowded with traffic, ~l- £ oj 19 'S\-c}-ltravel by vehicle, §. ~ ..Q.£ ~ ~-8} 9-/carry in a truck, ~ •
`'ll-t-9-/bring a car to a halt, ~I- -'lf-9-/ stop a cab, Aj-7}¼~1- i-q-tdrive one' s own car, 1:ljc/ bus,
`7) ~/train, dj A]/taxi, ... }
`tea={~-1- ifo]q-/make tea,~• 11}.s.q-/pour out tea, ~I- 1>j-A]9'/drink tea, ~I- i--1'!- 0)-g./sip tea,
`'a ot1 }I ~• qi~ °8} q-/serve a guest tea, ~• Pj-A] oj o]ot7] 'Sj- q/talk over
`~]- 7} f-i!l '-t 9"/tea brews, €
`a cup of tea, :,J lJ!tcoffee, ~~/green tea, -i-~/black tea, ~*-¥-/tea plant, ~l-71.s.711/a tea strainer, ~/cup,
`... }
`
`There are too many inflected forms of a adjective and a verb in a text. The keyfacts
`extracted from a text must be represented in basic forms. The forms have to be designed for
`inflected
`the different
`easy matching. If a keyfact have the same keyword and
`verb/adjective, it can be represented in the same basic form and will be the same keyfact.
`
`Page 6 of 14
`
`
`
`M-S. Jun et a/./The Fact Extraction Using the Key/act
`
`143
`
`3 Keyfact and Information Retrieval
`
`3. 1 System Overview
`
`Our IR system consists of server system and client system. The server platform is a
`Windows NT 3.5 Server on Pentium, and the client platform is a Windows NT 3.5
`Workstation on Pentium. The system uses a HTfPS for Windows NT as a web server, and
`Netscape navigator for Windows NT as a web browser. The system adopted NCAPI
`implemented by DDE for the communication between a Netscape navigator and a three
`the server system provides
`dimensional keyfact visualizer[I0, 11). Figure 2 shows
`information which is requested by the end-user or information builder. In figure 2, the
`square shows functional procedure and the circle represents the result of functional
`procedure. The server has processing engines for information building function and IR. The
`client system has the browser, and is used by users who want to search information[l3].
`The types of the IR in the server system can be defined as the natural language retrieval and
`the keyfact visualization retrieval which are shown in figure 2. In dispatcher function, the
`user's request is analyzed and is categorized into keyfact visual procedure or natural
`language query procedure according to the type.
`
`3.2 Keyfact Visualization Retrieval
`
`At recent, WWW has grown up to be the most popular service available on the internet. In
`spite of its short history, the reason why WWW achieved such a rapid development is that it
`integrated
`retrieval under an
`and hypertext
`supports multimedia presentation
`communication environment. The most common browsing object supported by web
`browsers has a problem in understanding easily the relationship among information because
`of its simplicity. The latest browsing method is browsing the relations among information in
`three dimension using a three dimensional web viewer. However the methods can't reflect
`support various functions required for
`immediately values changed dynamically and
`effective browsing. An advanced browser is needed for displaying exactly the relations
`among information not the information itself, and reflecting immediately the relation
`updated. In this paper, we provide browsing objects which can describe various relations
`among information and implement a visualization system which will transform automatically
`keyfact information into the objects. The information will be described by SGML, which is
`the standard for the document exchange. Unlike another three dimensional viewers, the
`system focuses on the visualization of the relations among information, and supports
`various browsing functions - expanding, shrinking, centralizing, etc. The browsers include
`keyfact visualizer and natural language interface which serve as helper applications for the
`typical internet browsers such as Mosaic and Netscape. In Figure 2, the CGI scripts to
`support the keyfact visualization return index documents or HTML documents after
`querying with the input data of FORM type from a web browser. The keyfacts construct an
`appropriate structure to visualize, with the weighted, labeled and many-to-many connected
`relations among information. We can distinguish a relation from the others by a link-line
`type and an adjacency by a link-line length. For example, a shorter length may represent the
`
`Page 7 of 14
`
`
`
`144
`
`Information Retrieval
`
`more related relationship. The query invokes three dimensional keyfact browser. Users can
`retrieve the keyfact network by clicking buttons such as Rotation, zooming, translation, and
`fly too.
`
`Figure 2: The information retrieval procedure
`
`The graphics operations are performed without communication between the server and a
`client. The visualization part on a client updates again the visual objects with the
`information document returned by CCI. Scripts related to the visualization on a server play
`a role to analyze the posted data and return the document of MIME type through
`CGI[9, I 0, I I]. If the clicked node is not a nontenninal node, just a web browser shows the
`list of HTML documents related to the node name by the natural language processing or the
`keyword processing. In the case of keyword processing, the node name is to be the title in
`
`Page 8 of 14
`
`
`
`M.-S. Jun et al./The Fact Extraction Using the Key/act
`
`145
`
`Gemong Korean encyclopedia. In the case of natural language processing, the node name is
`to be input of natural language query procedure, and the result is displayed in Netscape
`browser.
`
`3. 3 Natural Language Retrieval
`
`The end-user wants to search the wanted infonnation by the user friendly interfaces. One
`of those is natural language interface. If the server gets the natural language query, the
`natural language processing engine is activated. After processing some sequential operation,
`it creates candidate list, which is replied to client. The procedure of natural language
`retrieval is as following; First it extracts keyword and keyfact of a query. Second it gets a
`candidate top-fifty documents. Then the weighting and ranking operation are executed to
`get the candidate list. In this step, tf"idf weighting formula and only keywords are
`considered. Third it expands the query using the keyfact cluster for more accurate retrieval.
`Then the overlap counting method is executed to the candidate list for weighting and
`ranking. The document with the greatest number of overlaps between expanded query and a
`document is in the top of candidate list. Finally the system displays top-ten documents to
`the end-user.
`
`Disambiguation using Keyfact Cluster
`In this paper, some words that exist in a document can be selected by keywords, and some
`facts that can contain more precise information can be selected by keyfacts. Like this, for
`more precise natural language processing, we introduced the keyfact. Two examples are
`given to show how keyfact play an important role in determining the correct meaning.
`
`(le) What's the difference donkey and horse?
`( I k) ~ l-Hl 9J- ~ .9J j(t oJ 11-€:-
`-¥- '3! ~ 7t?
`tangnagwi-wa mal-ui chaichom-un muotinka
`donkey and horse the difference what is ?
`(2e) What kind of tea can man drink ?
`(2k) ,'-}~o] o}{l ? ~ e ;(}.9)
`'!'-W-~?
`saram-i masil su itnun cha-ui chongyu-nun
`tea of kind what
`man drink can
`
`Even in morphological analysis of a sentence, there are many ambiguities in Korean. As the
`- Korean version in ( I k ), mal " ~" is a polysemy and has three meanings such as "horse" ,
`"language" and "unit of measure" in Gemong Korean encyclopedia. The raw terms are
`analyzed by using a stemming algorithm, and the most frequent functional words are
`removed by using a stop list. Our method decides the specific sense of a given polysemy by
`calculating the similarity between polysemy of input query and words of predefined keyf act
`cluster. We can extract keywords such as tangnagwi "lg- '-t-=M ", mal " ~ " and chaichom
`"~}o] 1j" in a sentence (lk). We can choose "horse" as the correct meaning of mal
`according to the cluster of tangnagwi. Let's consider the keyfact cluster for 'cha'. If the
`
`Page 9 of 14
`
`
`
`146
`
`Information Retrieval
`
`child's links are limited to depth one, then "make tea", "pour out tea" , "drink tea", "sip tea",
`"tea brews", "serve a guest tea", "talk over a cup of tea", "coffee", "green tea", "black tea",
`"tea plant", "a tea strainer", and "cup" would be added. Unlike (lk), (2k) involves a
`keyfact such as masida "ol-Al q-/drink" besides keywords such as saram "Al-W"", chongyu
`"~ff" and cha "~l-". If the sentence (2k) have not had masida "01-J-.l tj-", then we can
`not select "tea" as the correct meaning of the c:ha "j}".
`
`Information Filtering using Keyfact Cluster
`Given a large amount of data, IR systems must be designed to return the data that the user
`wants to see. Therefore information filtering is very important part. Our main technical
`concern is to extract exact meanings from a document. The procedure of natural language
`retrieval is as following; First it extracts keywords of a query. Second it gets a 50 relevant
`documents. The relevant documents satisfy the user's information need. Several documents
`are ranked according to their inner product similarity between the original query and all the
`documents[5]. A query and all the documents contain both index terms and non-index
`terms. Index terms consist of only keyword. In order to improve retrieval speed, the IR
`system employs inverted files. Each document vector uses popular weighting scheme t_ridf
`based on the term frequency and the query vector uses Boolean (1 if the term appears, 0 if
`not) scheme [5,8]. Following is an example of inverted file. We know that index word
`'7}~-cl ¾' occurred in 19 documents, and the word has weighting 40845 in document
`number 5730.
`
`71-~-cl ¾ 19 5730 40845 5481400004169 37005 4531361503294 48861269050622
`24 40845 19733 81689191014084518122 47612 17188 47612 16740 34042 16401
`46642 15298 68084 14192 46642 12268 81689 12265 40845 12264 41870 12246 40845
`
`Documents can then be ranked in order of descending similarity to a query by keyword
`indexing[5,8]. Third it expands query terms with keyfact terms of predefined keyfact cluster
`for more accurate retrieval. Fourth each relevant document is compared to cluster set of the
`query(Second Document Ranking, hereafter). Because relevant documents are such a small
`fraction of the collection, this process substantially reduces the computational cost of the
`experiment. SOR provides documents sorted on relevance order by counting overlaps
`between the retrieved documents and keyfact clusters of the query. The SDR scheme is
`different with query expansion retrieval scheme[ 1]. In the query expansion, a new query
`could be created by adding terms from other documents, as in relevance feedback, or by
`adding synonyms of terms in the query (as found in a thesaurus). It then returns more
`documents using the revised query, and terms are consisted keywords. And the user
`indicates which documents from those returned are most relevant to his query. To calculate
`similarity, all pairwise combinations between keyfact terms in a query and a document are
`generated. For each keyfact in the expanded queries, the system enters the document in a
`hash table; the table is keyed on the document number, and the value is initially 1. If the
`documents was previously entered in the table, the value is simply incremented. But when
`the keyfact weighting is calculated, we give an 5 added value to each keyfact not keyword.
`
`Page 10 of 14
`
`
`
`M-S. Jun et al./The Fact Extraction Using the Key/act
`
`147
`
`The end result is that each entry in the table contains total number of common keyfacts of
`expanded queries and the document. The table is then sorted to produce a ranked list of
`documents.
`
`(3e) What is the highest mountain in the world?
`(3k) ~ i!I ojJJ,-J 7} i ~ .g. {!-.g. .!f ~ ~ 7}?
`segye-eseo kachang nopun san-un muotinka
`the highest mountain what is
`the world in
`
`As the Korean version in (3k), san "{!-" is a polysemy and has two meanings such as
`"mountain" and "acid" in Gemong Korean encyclopedia. Because the sentence (3e) have a
`select "mountain" as correct meaning of san "{!-" .
`keyfact "be the highest", we can
`kachang "7t~" is a polysemy and has several meanings such as "most", "the head of a
`family", etc. in the Minchungseorim Korean dictionary. Nevertheless kachang "7t ~" is
`not a polysemy and has one meaning of "the head of a family" in Gemong Korean
`encyclopedia. Therefore " 7} ~" in all the documents of our test collection Gemong Korean
`encyclopedia is indexed as meaning of "the head of a family" . Given the query "What is the
`highest mountain in the world ? ", when only keywords are applied to ranking documents
`and the query, D2 is closer to the query : Sim(Query, DI) = 23 versus Sim(Query, D2) =
`58. After ambiguity ofsan and kachang is resolved, relevant documents are represented as
`keyfacts and the query are represented as keyfact cluster. Following italic style fonts
`represent keyfacts, and the others represent keywords. Following results are obtained in
`SOR.
`
`Query= {be the highest, mountain, the world}
`Dl(the Himalayas) = { the Himalayas, the world, Nepal, China, glacier, winter, wind,
`summer, climbing, England, the highest mountain, be precipitous, top the mountain, climb}
`D2(the head of a family)= {the head ofa family, fortune, command, control, family,
`system, the Orient, the eldest son, inherit a fortune, represent a family, have a authority,
`system make progress}
`
`Sim(Query, DI)= Mt. Everest+ the world +climbing+ mountain range+ be very high+
`be precipitous
`= I + I + I + I + 5 + 5 = 14
`Sim(Query, D2) = the world
`= I
`
`When keyfacts are applied to the SDR, DI is closer to the query : Sim(Query, DI) = 14
`versus Sim(Query, D2) = I . The similarity is based on overlap counts between predefined
`cluster of input keywords and keyfacts of the document. Keyfact concept has a better
`performance than keyword concept.
`
`Page 11 of 14
`
`
`
`148
`
`Information Retrieval
`
`4. Empirical Results
`
`Evaluation of full-text documents retrieval models is based on relevance ranking and itO
`measurement of recall and precision for test collections. Relevance ranking returns an
`ordered list of relevant documents. Recall is defined as the number of relevant documents
`retrieved divided by the total number of relevant documents in the collection, and precision
`is defined as the number of relevant documents retrieved divided by the total number of
`documents retrieved. Before any evaluation can be performed, the relevance information for
`each query must be considered. In order to decide which documents were relevant to each
`query, we spent several months. As a result only 35 queries were used for evaluation. We
`now address the issue of whether keyfact improves performance when it is applied to the
`document ranking. Our experiments use the Gemong Korean encyclopedia as a test
`collection. The test collection comes with a set of natural language queries and a labeling
`(decided by human experts) that decides which documents are relevant to each query. The
`collection contains 35 queries and 23,000 documents. Two experiments to evaluate keyfact
`concept have been run on an IR system called 0KSE0 at Electronics Telecommunications
`Research Institute. 0KSE0 isn't acronym, but is the old name of place, in which the
`ancestor did research, compilation, and cultural enterprise.
`
`( 4e) want to know about speech and language
`(4k) ~~ ~ c>j ojJ qj-gJ-oj ~.it~ tj-
`mal-kwa ono-e taehayo alko shipta
`speech and language about know want to
`
`Recall and precision values were calculated for each query. The overall performance of an
`experiment is determined by processing a number of queries, and by computing the average
`precision over all the queries for each selected recall value. There are two experiments to be
`compared: experiment I and experiment 2. The experiment I uses only the keywords
`without keyfacts in ranking documents. The experiment 2 uses both keywords and keyfacts
`in SDR. Our system displayed the top IO terms selected by the tF'idf weighting(first
`ranking) and overlap counting(SDR) as following.
`
`First ranking without keyfacts:
`I donkey
`2mule
`3 zebra
`4 editorial
`5 indirect quotation
`
`Second ranking with keyfacts :
`l address
`2 composition
`3 subject of writing
`4 donkey
`5 mule
`
`In the sentence 4k, Korean word"~" has several meanings such as "the unit of measure",
`"horse", "speech", etc. When we considered only keywords in first ranking step, irrelevant
`document "donkey" was at the top of ranking and relevant document "editorial" is ranked
`fourth order. However when we considered both keywords and keyfacts in second ranking
`step, irrelevant document "donkey" was ranked fourth order. For each query, two retrieval
`
`Page 12 of 14
`
`
`
`M-S. Jun et al./The Fact Extraction Using the Key/act
`
`149
`
`results are compared. In figure3, the resulting graph shows the performance improvement
`when the keyfact is applied in SDR. The experiment I shows about 66% precision in
`Gemong Korean encyclopedia. In the experiment 2, latent indexing on keyfacts between
`some retrieved documents and a query provides 22%
`improvements on the precision
`performance.
`
`-
`
`· • ,
`
`I
`~
`
`p
`R
`E
`C
`I
`s
`I
`0
`N
`
`0.9
`0.8
`0.7
`0.6
`o.s
`0.4
`0.3
`0.2
`0.1
`0
`
`O
`
`0.1
`
`0.2 0.3 0.4 O.S 0.6 0.7 0.8 0.9
`RECALL
`Figure 3: Keyword versus keyfact for 3 5 queries
`
`I
`
`5 Conclusions and Future Work
`
`Traditional IR system adopts vectors of weighted term frequencies as representation of
`documents. The vector space model has some significant problems. It assumes that terms
`are independent and thus ignores term associations and lexical ambiguity[2,3]. In this paper,
`we constructed keyfact cluster, which is a new type of thesaurus, to disambiguate lexical
`ambiguity for natural language retrieval. We implemented three dimensional keyfact
`network retrieval system for keyfact visualization retrieval. User can easily traverse multi(cid:173)
`depth child nodes as well as one-depth child, and retrieve the keyfact network by three
`argument such as keyword, relationship, and weight in WWW.
`We apply the keyfact to the second document ranking, which operates under the assumption
`that there are any relevant documents for given query in first ranking step. Keyfacts of each
`relevant document are compared to predefined keyfact cluster set of a query. Because
`relevant documents are such a small fraction of the collection, this substantially reduces the
`computational cost of the experiment.
`When the collection grows as new documents are added, future research will
`concentrated on representation and manipulating knowledge(facts). As we look in more
`detail at ways of representing for more specific, more powerful inference mechanisms that
`operate on them. In this paper, the query vector uses Boolean scheme. But If the vectors
`are weighted to give emphasis to keyfacts that exemplify meaning, the system will satisfy
`the user's information need well.
`
`Page 13 of 14
`
`
`
`150
`
`Information Retrieval
`
`References
`
`[l]Ellen M.Voorhees, "Query Expansion using Lexical-Semantic Relations" . Proceedings of
`the Association for Computing Machinery-Special Interest Group on Information
`Retrieval, Dublin, 61-69, 1994.
`[2]Robert Krovertz and W.Bruce Croft, "Lexical Ambiguity and Information Retrieval" .
`Association for Computing Machinery Transaction on Information Systems, Vol. 10, No.
`2,