throbber
Applications of Natural Language
`to Information Systems
`
`Proceedings of the Second International Workshop
`June 26-28, 1996, Amsterdam, The Netherlands
`
`Edited by
`
`R.P. van de Riet
`Vrije Universiteit, Amsterdam, The Netherlands
`
`J.F.M. Burg
`Vrije Universiteit, Amsterdam, The Netherlands
`
`and
`
`A.J. van der Vos
`Vrije Universiteit, Amsterdam, The Netherlands
`
`1996
`
`10S
`Press
`,ay ....
`;= =
`
`Otmsha
`
`Amsterdam, Oxford, Tokyo, Washington, DC
`
`Page 1 of 14
`
`GOOGLE EXHIBIT 1017
`
`

`

`© The authors mentioned in the Table of Contents.
`
`All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any
`form or by any means, without the prior written permission from the publisher.
`
`ISBN 90 5199 273 4(1OS Press)
`ISBN 4 274 90102 5 C3000 (Ohmsha)
`Library of Congress Catalogue Card Number: 96-76771
`
`Publisher
`!OS Press
`Van Diemenstraat 94
`IO 13 CN Amsterdam
`Netherlands
`
`Distributor in the UK and Ireland
`IOS Press/Lavis Marketing
`73 Lime Walk
`Headington
`Oxford OX3 7 AD
`England
`
`Distributor in the USA and Canada
`!OS Press, Inc.
`P.O. Box 10558
`Burke, VA 22009-0558
`USA
`
`Distributor in Japan
`Ohmsha, Ltd.
`3-1 Kanda Nishiki - Cho
`Chiyoda- Ku
`Tokyo IOI
`Japan
`
`LEGAL NOTICE
`The publisher is not responsible for the use which might be made of the following information.
`
`PRINTED IN THE NETHERLANDS
`
`Page 2 of 14
`
`

`

`Applications of Natural Language to Information Systems
`R.P. van de Rief et al. (Eds.)
`/OS Press, I 996
`
`139
`
`The Fact Extraction Using the Keyfact
`
`Mi-Seon JUN, Se-Young PARK, Man-Soo KIM
`Natural Language Processing Section
`System Software Department
`Electronics and Telecommunications Research Institute
`TaeJon, Korea
`{msjun,sypark,mskim}@com.etri.re.kr
`
`Abstract. Many information retrieval systems retrieve relevant documents based on
`exact matching of keywords between a query and documents. We shall show how to
`extract a fact from a document using an extended concept of keyword, called keyfact
`which can contain syntactic patterns and semantic information. In second document
`ranking, predefined keyfact cluster set of the query terms is compared to each
`relevant document. Because relevant documents are such a small fraction of a
`collection, this method is different with query expansion retrieval scheme and
`substantially reduces the computational cost of the experiment.
`Keywords : Document Ranking, Fact, Semantic Information, Syntactic Pattern
`
`1 Introduction
`
`Many commercial information retrieval(IR) systems retrieve relevant documents based on
`keyword matching between a query and documents. There are two problems in using the
`method. The first problem is that keywords are ambiguous, and this ambiguity is causative
`of retrieving irrelevant document semantically. Therefore lexical ambiguity has to be
`resolved. The second problem is that a document is treated as a irrelevant document in spite
`of a relevant document, for the document does not include the same keywords as query
`terms. So an original query has to be expanded to semantically related words. The main
`function of an IR system is to rank relevant documents which satisfies the user's information
`need. In most retrieval models, the system ranks documents according to their inner product
`similarity, depending upon keywords in.a query. However users are generally not interested
`in retrieving documents with matching keywords, but with concepts that relevant words or
`information represent.
`Facts are truths in some relevant world. These are the things we want to represent and to
`search information. One representation of facts is so common that it deserves special
`mention: natural language sentences. Generally nouns and compound nouns are taken as
`keywords. Nouns and compound nouns are the most important elements for representing
`the fact in a natural language sentence. However besides the keywords such as nouns and
`compound nouns, verbs and adjectives have an important role in a sentence. Using the
`keywords for an IR system is relatively simple to implement. To extract facts well in a
`sentence, it is insufficient to use only nouns as keywords. Even in morphological analysis of
`
`Page 3 of 14
`
`

`

`140
`
`Information Retrieval
`
`a sentence, there are many lexical ambiguities. Furthennore in the syntactic problem, there
`are too many inflected fonns of adjectives and adverbs in Korean. Specially in Korean
`language, one keyword has 2(}-,30 senses in a dictionary in the worst case. Polysemous
`words in a query and documents can reduce the precision of a search significantly.
`Therefore lexical ambiguity has to be resolved. To resolve lexical ambiguity of keywords,
`we need several information such as keywords, verbs and adjectives. We introduced an
`extended concept of a keyword, called keyjact which can be represented as verb/ adjective,
`and can contain syntactic patterns and semantic information. We can consider lexical
`ambiguity of noun and verb.
`The verb "~tj-• is a typical Korean polysemy, and has twenty one translatable English
`verbs such as "write", "spend", "wear", and "adopt"
`in a Sisa Korean-English
`dictionary(I4]. In another example, the noun ".£.:4" is a typical Korean polysemy. The
`noun has English nouns such as "mother and child" and "hat". So in our keyfact concept
`noun and verb/adjective are not independent of each other. If the keyfact can contain
`syntactic patterns and semantic information such as ".£. :4* ~ tj-/wear a hat" and "~ .Q. .s.!..
`~ tj-/write with a pen", then noun which occur with the verb "~ tj-" may be thought of as a
`clue for disambiguating senses. In the same manner, ambiguity of noun is much the same.
`The literature generally divides lexical ambiguity into two types: syntactic and semantic[2].
`Syntactic ambiguity refers to differences in syntactic category. Semantic ambiguity refers to
`differences in meanings. A number of approaches have been taken to word sense
`disambiguation. Lesk uses the Oxford Advanced Learners Dictionary[3] and Weiss uses
`word co-occurrences[4]. Many researchers used statistic information, semantic information,
`and both of the information as a knowledge for query expansion. Stiles and Lesk used
`statistic information of term association from documents. Salton experiments only synonym
`in the SMART system. Fox experiments five semantic category in the SMART system, and
`need humane intervention for selecting related words[8]. Above researches did not propose
`the problem of lexical ambiguity or did not considered automatic lexical disambiguation for
`query expansion. The ambiguity in a query must be resolved when the query is analyzed.
`Ambiguous words are not able to effectively expand before the ambiguity is not resolved.
`Query expansion enhances recall by adding some relevant documents excluded from exact
`matching. But it degrades precision. In order to improve precision, first the ambiguous
`word in a query is resolved by using knowledge base, when the query is analyzed. Second
`keyword concept which is defined as noun or compound noun need to be extend, and verb
`or adjective must be considered as indexing word. We resolve ambiguous query terms, and
`then expand unambiguous query terms. There are a wide choice of words to add to a query
`vector. One can add only the synonyms, or synonyms plus all descendants, or synonym plus
`parents and all descendants, or synonyms plus directly related words, etc. and any number
`of child links may be traverse. Expansion by synonyms plus and directly related word is
`benefit[!]. So we choose the parameter, and expanded queries are consisted with a special
`relationship FT(Fact Term) as well as semantic relationships using in general thesaurus.
`Because query expansion is a recall enhancing technique, we used expanded terms of a
`original query only when compute query-document similarity. So we got high precision rate.
`
`Page 4 of 14
`
`

`

`M.-S. Jun et al./The Fact Extraction Using the Key/act
`
`141
`
`In the following sections, we ( 1) describe the construction of a keyfact network for
`ranking the documents; (2) present a visualization of the keyfact network for keyfact
`retrieval; (3) show how the keyfact network can be used to rank documents and extract a
`fact; (4) evaluate ranking method to improve retrieval performance; and (5) make
`suggestions for future research.
`
`2 Keyfact and Keyf act Cluster
`
`The noun is the most important element for explaining the fact. Next we consider the
`compound noun which is composed of several nouns. The syntactic categories of the next
`complicated fact are noun phrases. A noun phrase consists of a noun and its modifiers that
`can be represented as inflected forms of verb and adjective. Korean verb and adjective have
`much more inflected forms as compared to English and French. The most simple
`fact(sentence) can be represented by noun(subject) and verb/adjective(predicate). So the
`keyfact is not independent of case slots, and contains syntactic patterns and semantic
`information. In this paper, we collected keyfacts and keyfact clusters in Gemong Korean
`encyclopedia for improved retrieval performance. The encyclopedia has two characteristics.
`First, it has syntactic characteristic composed of a title word and it's explanation part.
`Second, it has semantic characteristic that most words in the explanation part are
`semantically related with the title word. The encyclopedia is good to easily collect words
`and it's semantically related words of a word. We thought that the encyclopedia is a proper
`collection for construction of semantic information. The keyfacts extracted from a text can
`be represented in several forms. The forms have to be designed for easy matching. The
`keyfact weight is calculated in the same formula for calculating keyword weights based on
`the keyword frequency. When a user gives a query which contains some keyfacts as well as
`keywords, our system extracts the keyfacts from the query and tries to match the keyfacts
`which were extracted from documents. We use an exact matching method and when it fails,
`a partial matching method will be used. With keyfacts co-occurred with a keyword, we can
`use these keyfacts for disambiguating the keyword and ranking documents.
`A cluster is defined as a set of co-occurring keyfact terms. Co-occurring terms are usually
`relevant to each other and are sometimes synonyms. For keyfact clustering, we considered
`the sense definition of the noun in the Gemong Korean encyclopedia. In the simple
`automatic indexing method, raw terms are analyzed using a stemming algorithm and stop
`words are removed using a stop list. The stop words are usually prepositions, postpositions,
`determinants and those that appear too frequently to discriminate any documents. And then
`it finds an identical inflected verb and gets a basic form of the inflected verb by referencing
`verb dictionary. Our verb dictionary consists of two parts which are inflected verb form and
`basic verb form. One basic form can have many inflected form. Therefore inflected form is
`compared with the input text and basic form is used for disambiguating and ranking. Figure
`I shows one basic form can have many inflected forms. We assigned semantic relationships
`using in general thesaurus such as BT(Broader Term), NT(Narrow Term), RT(Related
`Tenn), HP(Has Part), UF(Used For), and a special relationship FT(keyFact Term). Most
`thesaurus uses the relationships except FT relationship. FT is defined as relationship
`
`Page 5 of 14
`
`

`

`142
`
`Information Retrieval
`
`between noun and verb/adjective. FT is used for resolving lexical ambiguity in a query and
`ranking the documents.
`
`inflected forms
`noun+postposition+verb
`
`buicform
`verb/postposition,noun
`
`car
`
`-If-~~£ 'U
`the
`te
`
`(crowded through
`
`traffic
`d/around
`(crowd/through the
`
`Figure I : Sample of keyfact
`In figure 1, "~Ht o}A] t:l-(take tea)" and "~t-a- Ej-tj-(take a car)" have the same keyword
`"~t". The noun cha "~t" is a typical example of Korean polysemy. This noun has nine
`"vehicle", "tea", "difference", and "for the purpose of' in the
`meanings such as
`Minchungseorim Korean dictionary and two meanings such as "vehicle" and "tea" in
`Gemong Korean encyclopedia. Therefore it is not easy to analyze the exact meaning from a
`sentence including polysemy. At first, we define the concept of cluster as a group of
`keyfacts which co-occur with a specific sense of a polysemy. Some keyfacts appearing in a
`text are useful for keyfact clustering while others are not. Thus in order to cluster keyfacts
`effectively, we normalize the clusters obtained by the following processes. we used the
`weighting by keyfact frequency alone. An appropriate threshold is chosen, and a keyfact
`which exceeds the threshold 3 are assumed to be connected with a keyword. Followings are
`normalization results of keyword cha " ~t" in Gemong Korean encyclopedia. For example,
`the keyword 'vehicle' has 'get off a car' as keyf act and 'bus' as keyword .
`
`vehicle={ ~ofl Jr! ~ i!j q/get off a car, ~ • E} 9-/take a car, 111 A)£ 7}9-/go by taxi, ~H~-
`,t1:1] 9-/be
`crowded with traffic, ~l- £ oj 19 'S\-c}-ltravel by vehicle, §. ~ ..Q.£ ~ ~-8} 9-/carry in a truck, ~ •
`'ll-t-9-/bring a car to a halt, ~I- -'lf-9-/ stop a cab, Aj-7}¼~1- i-q-tdrive one' s own car, 1:ljc/ bus,
`7) ~/train, dj A]/taxi, ... }
`tea={~-1- ifo]q-/make tea,~• 11}.s.q-/pour out tea, ~I- 1>j-A]9'/drink tea, ~I- i--1'!- 0)-g./sip tea,
`'a ot1 }I ~• qi~ °8} q-/serve a guest tea, ~• Pj-A] oj o]ot7] 'Sj- q/talk over
`~]- 7} f-i!l '-t 9"/tea brews, €
`a cup of tea, :,J lJ!tcoffee, ~~/green tea, -i-~/black tea, ~*-¥-/tea plant, ~l-71.s.711/a tea strainer, ~/cup,
`... }
`
`There are too many inflected forms of a adjective and a verb in a text. The keyfacts
`extracted from a text must be represented in basic forms. The forms have to be designed for
`inflected
`the different
`easy matching. If a keyfact have the same keyword and
`verb/adjective, it can be represented in the same basic form and will be the same keyfact.
`
`Page 6 of 14
`
`

`

`M-S. Jun et a/./The Fact Extraction Using the Key/act
`
`143
`
`3 Keyfact and Information Retrieval
`
`3. 1 System Overview
`
`Our IR system consists of server system and client system. The server platform is a
`Windows NT 3.5 Server on Pentium, and the client platform is a Windows NT 3.5
`Workstation on Pentium. The system uses a HTfPS for Windows NT as a web server, and
`Netscape navigator for Windows NT as a web browser. The system adopted NCAPI
`implemented by DDE for the communication between a Netscape navigator and a three
`the server system provides
`dimensional keyfact visualizer[I0, 11). Figure 2 shows
`information which is requested by the end-user or information builder. In figure 2, the
`square shows functional procedure and the circle represents the result of functional
`procedure. The server has processing engines for information building function and IR. The
`client system has the browser, and is used by users who want to search information[l3].
`The types of the IR in the server system can be defined as the natural language retrieval and
`the keyfact visualization retrieval which are shown in figure 2. In dispatcher function, the
`user's request is analyzed and is categorized into keyfact visual procedure or natural
`language query procedure according to the type.
`
`3.2 Keyfact Visualization Retrieval
`
`At recent, WWW has grown up to be the most popular service available on the internet. In
`spite of its short history, the reason why WWW achieved such a rapid development is that it
`integrated
`retrieval under an
`and hypertext
`supports multimedia presentation
`communication environment. The most common browsing object supported by web
`browsers has a problem in understanding easily the relationship among information because
`of its simplicity. The latest browsing method is browsing the relations among information in
`three dimension using a three dimensional web viewer. However the methods can't reflect
`support various functions required for
`immediately values changed dynamically and
`effective browsing. An advanced browser is needed for displaying exactly the relations
`among information not the information itself, and reflecting immediately the relation
`updated. In this paper, we provide browsing objects which can describe various relations
`among information and implement a visualization system which will transform automatically
`keyfact information into the objects. The information will be described by SGML, which is
`the standard for the document exchange. Unlike another three dimensional viewers, the
`system focuses on the visualization of the relations among information, and supports
`various browsing functions - expanding, shrinking, centralizing, etc. The browsers include
`keyfact visualizer and natural language interface which serve as helper applications for the
`typical internet browsers such as Mosaic and Netscape. In Figure 2, the CGI scripts to
`support the keyfact visualization return index documents or HTML documents after
`querying with the input data of FORM type from a web browser. The keyfacts construct an
`appropriate structure to visualize, with the weighted, labeled and many-to-many connected
`relations among information. We can distinguish a relation from the others by a link-line
`type and an adjacency by a link-line length. For example, a shorter length may represent the
`
`Page 7 of 14
`
`

`

`144
`
`Information Retrieval
`
`more related relationship. The query invokes three dimensional keyfact browser. Users can
`retrieve the keyfact network by clicking buttons such as Rotation, zooming, translation, and
`fly too.
`
`Figure 2: The information retrieval procedure
`
`The graphics operations are performed without communication between the server and a
`client. The visualization part on a client updates again the visual objects with the
`information document returned by CCI. Scripts related to the visualization on a server play
`a role to analyze the posted data and return the document of MIME type through
`CGI[9, I 0, I I]. If the clicked node is not a nontenninal node, just a web browser shows the
`list of HTML documents related to the node name by the natural language processing or the
`keyword processing. In the case of keyword processing, the node name is to be the title in
`
`Page 8 of 14
`
`

`

`M.-S. Jun et al./The Fact Extraction Using the Key/act
`
`145
`
`Gemong Korean encyclopedia. In the case of natural language processing, the node name is
`to be input of natural language query procedure, and the result is displayed in Netscape
`browser.
`
`3. 3 Natural Language Retrieval
`
`The end-user wants to search the wanted infonnation by the user friendly interfaces. One
`of those is natural language interface. If the server gets the natural language query, the
`natural language processing engine is activated. After processing some sequential operation,
`it creates candidate list, which is replied to client. The procedure of natural language
`retrieval is as following; First it extracts keyword and keyfact of a query. Second it gets a
`candidate top-fifty documents. Then the weighting and ranking operation are executed to
`get the candidate list. In this step, tf"idf weighting formula and only keywords are
`considered. Third it expands the query using the keyfact cluster for more accurate retrieval.
`Then the overlap counting method is executed to the candidate list for weighting and
`ranking. The document with the greatest number of overlaps between expanded query and a
`document is in the top of candidate list. Finally the system displays top-ten documents to
`the end-user.
`
`Disambiguation using Keyfact Cluster
`In this paper, some words that exist in a document can be selected by keywords, and some
`facts that can contain more precise information can be selected by keyfacts. Like this, for
`more precise natural language processing, we introduced the keyfact. Two examples are
`given to show how keyfact play an important role in determining the correct meaning.
`
`(le) What's the difference donkey and horse?
`( I k) ~ l-Hl 9J- ~ .9J j(t oJ 11-€:-
`-¥- '3! ~ 7t?
`tangnagwi-wa mal-ui chaichom-un muotinka
`donkey and horse the difference what is ?
`(2e) What kind of tea can man drink ?
`(2k) ,'-}~o] o}{l ? ~ e ;(}.9)
`'!'-W-~?
`saram-i masil su itnun cha-ui chongyu-nun
`tea of kind what
`man drink can
`
`Even in morphological analysis of a sentence, there are many ambiguities in Korean. As the
`- Korean version in ( I k ), mal " ~" is a polysemy and has three meanings such as "horse" ,
`"language" and "unit of measure" in Gemong Korean encyclopedia. The raw terms are
`analyzed by using a stemming algorithm, and the most frequent functional words are
`removed by using a stop list. Our method decides the specific sense of a given polysemy by
`calculating the similarity between polysemy of input query and words of predefined keyf act
`cluster. We can extract keywords such as tangnagwi "lg- '-t-=M ", mal " ~ " and chaichom
`"~}o] 1j" in a sentence (lk). We can choose "horse" as the correct meaning of mal
`according to the cluster of tangnagwi. Let's consider the keyfact cluster for 'cha'. If the
`
`Page 9 of 14
`
`

`

`146
`
`Information Retrieval
`
`child's links are limited to depth one, then "make tea", "pour out tea", "drink tea", "sip tea",
`"tea brews", "serve a guest tea", "talk over a cup of tea", "coffee", "green tea", "black tea",
`"tea plant", "a tea strainer", and "cup" would be added. Unlike (lk), (2k) involves a
`keyfact such as masida "ol-Al q-/drink" besides keywords such as saram "Al-W"", chongyu
`"~ff" and cha "~l-". If the sentence (2k) have not had masida "01-J-.l tj-", then we can
`not select "tea" as the correct meaning of the c:ha "j}".
`
`Information Filtering using Keyfact Cluster
`Given a large amount of data, IR systems must be designed to return the data that the user
`wants to see. Therefore information filtering is very important part. Our main technical
`concern is to extract exact meanings from a document. The procedure of natural language
`retrieval is as following; First it extracts keywords of a query. Second it gets a 50 relevant
`documents. The relevant documents satisfy the user's information need. Several documents
`are ranked according to their inner product similarity between the original query and all the
`documents[5]. A query and all the documents contain both index terms and non-index
`terms. Index terms consist of only keyword. In order to improve retrieval speed, the IR
`system employs inverted files. Each document vector uses popular weighting scheme t_ridf
`based on the term frequency and the query vector uses Boolean (1 if the term appears, 0 if
`not) scheme [5,8]. Following is an example of inverted file. We know that index word
`'7}~-cl ¾' occurred in 19 documents, and the word has weighting 40845 in document
`number 5730.
`
`71-~-cl ¾ 19 5730 40845 5481400004169 37005 4531361503294 48861269050622
`24 40845 19733 81689191014084518122 47612 17188 47612 16740 34042 16401
`46642 15298 68084 14192 46642 12268 81689 12265 40845 12264 41870 12246 40845
`
`Documents can then be ranked in order of descending similarity to a query by keyword
`indexing[5,8]. Third it expands query terms with keyfact terms of predefined keyfact cluster
`for more accurate retrieval. Fourth each relevant document is compared to cluster set of the
`query(Second Document Ranking, hereafter). Because relevant documents are such a small
`fraction of the collection, this process substantially reduces the computational cost of the
`experiment. SOR provides documents sorted on relevance order by counting overlaps
`between the retrieved documents and keyfact clusters of the query. The SDR scheme is
`different with query expansion retrieval scheme[ 1]. In the query expansion, a new query
`could be created by adding terms from other documents, as in relevance feedback, or by
`adding synonyms of terms in the query (as found in a thesaurus). It then returns more
`documents using the revised query, and terms are consisted keywords. And the user
`indicates which documents from those returned are most relevant to his query. To calculate
`similarity, all pairwise combinations between keyfact terms in a query and a document are
`generated. For each keyfact in the expanded queries, the system enters the document in a
`hash table; the table is keyed on the document number, and the value is initially 1. If the
`documents was previously entered in the table, the value is simply incremented. But when
`the keyfact weighting is calculated, we give an 5 added value to each keyfact not keyword.
`
`Page 10 of 14
`
`

`

`M-S. Jun et al./The Fact Extraction Using the Key/act
`
`147
`
`The end result is that each entry in the table contains total number of common keyfacts of
`expanded queries and the document. The table is then sorted to produce a ranked list of
`documents.
`
`(3e) What is the highest mountain in the world?
`(3k) ~ i!I ojJJ,-J 7} i ~ .g. {!-.g. .!f ~ ~ 7}?
`segye-eseo kachang nopun san-un muotinka
`the highest mountain what is
`the world in
`
`As the Korean version in (3k), san "{!-" is a polysemy and has two meanings such as
`"mountain" and "acid" in Gemong Korean encyclopedia. Because the sentence (3e) have a
`select "mountain" as correct meaning of san "{!-" .
`keyfact "be the highest", we can
`kachang "7t~" is a polysemy and has several meanings such as "most", "the head of a
`family", etc. in the Minchungseorim Korean dictionary. Nevertheless kachang "7t ~" is
`not a polysemy and has one meaning of "the head of a family" in Gemong Korean
`encyclopedia. Therefore " 7} ~" in all the documents of our test collection Gemong Korean
`encyclopedia is indexed as meaning of "the head of a family" . Given the query "What is the
`highest mountain in the world ? ", when only keywords are applied to ranking documents
`and the query, D2 is closer to the query : Sim(Query, DI) = 23 versus Sim(Query, D2) =
`58. After ambiguity ofsan and kachang is resolved, relevant documents are represented as
`keyfacts and the query are represented as keyfact cluster. Following italic style fonts
`represent keyfacts, and the others represent keywords. Following results are obtained in
`SOR.
`
`Query= {be the highest, mountain, the world}
`Dl(the Himalayas) = { the Himalayas, the world, Nepal, China, glacier, winter, wind,
`summer, climbing, England, the highest mountain, be precipitous, top the mountain, climb}
`D2(the head of a family)= {the head ofa family, fortune, command, control, family,
`system, the Orient, the eldest son, inherit a fortune, represent a family, have a authority,
`system make progress}
`
`Sim(Query, DI)= Mt. Everest+ the world +climbing+ mountain range+ be very high+
`be precipitous
`= I + I + I + I + 5 + 5 = 14
`Sim(Query, D2) = the world
`= I
`
`When keyfacts are applied to the SDR, DI is closer to the query : Sim(Query, DI) = 14
`versus Sim(Query, D2) = I . The similarity is based on overlap counts between predefined
`cluster of input keywords and keyfacts of the document. Keyfact concept has a better
`performance than keyword concept.
`
`Page 11 of 14
`
`

`

`148
`
`Information Retrieval
`
`4. Empirical Results
`
`Evaluation of full-text documents retrieval models is based on relevance ranking and itO
`measurement of recall and precision for test collections. Relevance ranking returns an
`ordered list of relevant documents. Recall is defined as the number of relevant documents
`retrieved divided by the total number of relevant documents in the collection, and precision
`is defined as the number of relevant documents retrieved divided by the total number of
`documents retrieved. Before any evaluation can be performed, the relevance information for
`each query must be considered. In order to decide which documents were relevant to each
`query, we spent several months. As a result only 35 queries were used for evaluation. We
`now address the issue of whether keyfact improves performance when it is applied to the
`document ranking. Our experiments use the Gemong Korean encyclopedia as a test
`collection. The test collection comes with a set of natural language queries and a labeling
`(decided by human experts) that decides which documents are relevant to each query. The
`collection contains 35 queries and 23,000 documents. Two experiments to evaluate keyfact
`concept have been run on an IR system called 0KSE0 at Electronics Telecommunications
`Research Institute. 0KSE0 isn't acronym, but is the old name of place, in which the
`ancestor did research, compilation, and cultural enterprise.
`
`( 4e) want to know about speech and language
`(4k) ~~ ~ c>j ojJ qj-gJ-oj ~.it~ tj-
`mal-kwa ono-e taehayo alko shipta
`speech and language about know want to
`
`Recall and precision values were calculated for each query. The overall performance of an
`experiment is determined by processing a number of queries, and by computing the average
`precision over all the queries for each selected recall value. There are two experiments to be
`compared: experiment I and experiment 2. The experiment I uses only the keywords
`without keyfacts in ranking documents. The experiment 2 uses both keywords and keyfacts
`in SDR. Our system displayed the top IO terms selected by the tF'idf weighting(first
`ranking) and overlap counting(SDR) as following.
`
`First ranking without keyfacts:
`I donkey
`2mule
`3 zebra
`4 editorial
`5 indirect quotation
`
`Second ranking with keyfacts :
`l address
`2 composition
`3 subject of writing
`4 donkey
`5 mule
`
`In the sentence 4k, Korean word"~" has several meanings such as "the unit of measure",
`"horse", "speech", etc. When we considered only keywords in first ranking step, irrelevant
`document "donkey" was at the top of ranking and relevant document "editorial" is ranked
`fourth order. However when we considered both keywords and keyfacts in second ranking
`step, irrelevant document "donkey" was ranked fourth order. For each query, two retrieval
`
`Page 12 of 14
`
`

`

`M-S. Jun et al./The Fact Extraction Using the Key/act
`
`149
`
`results are compared. In figure3, the resulting graph shows the performance improvement
`when the keyfact is applied in SDR. The experiment I shows about 66% precision in
`Gemong Korean encyclopedia. In the experiment 2, latent indexing on keyfacts between
`some retrieved documents and a query provides 22%
`improvements on the precision
`performance.
`
`-
`
`· • ,
`
`I
`~
`
`p
`R
`E
`C
`I
`s
`I
`0
`N
`
`0.9
`0.8
`0.7
`0.6
`o.s
`0.4
`0.3
`0.2
`0.1
`0
`
`O
`
`0.1
`
`0.2 0.3 0.4 O.S 0.6 0.7 0.8 0.9
`RECALL
`Figure 3: Keyword versus keyfact for 3 5 queries
`
`I
`
`5 Conclusions and Future Work
`
`Traditional IR system adopts vectors of weighted term frequencies as representation of
`documents. The vector space model has some significant problems. It assumes that terms
`are independent and thus ignores term associations and lexical ambiguity[2,3]. In this paper,
`we constructed keyfact cluster, which is a new type of thesaurus, to disambiguate lexical
`ambiguity for natural language retrieval. We implemented three dimensional keyfact
`network retrieval system for keyfact visualization retrieval. User can easily traverse multi(cid:173)
`depth child nodes as well as one-depth child, and retrieve the keyfact network by three
`argument such as keyword, relationship, and weight in WWW.
`We apply the keyfact to the second document ranking, which operates under the assumption
`that there are any relevant documents for given query in first ranking step. Keyfacts of each
`relevant document are compared to predefined keyfact cluster set of a query. Because
`relevant documents are such a small fraction of the collection, this substantially reduces the
`computational cost of the experiment.
`When the collection grows as new documents are added, future research will
`concentrated on representation and manipulating knowledge(facts). As we look in more
`detail at ways of representing for more specific, more powerful inference mechanisms that
`operate on them. In this paper, the query vector uses Boolean scheme. But If the vectors
`are weighted to give emphasis to keyfacts that exemplify meaning, the system will satisfy
`the user's information need well.
`
`Page 13 of 14
`
`

`

`150
`
`Information Retrieval
`
`References
`
`[l]Ellen M.Voorhees, "Query Expansion using Lexical-Semantic Relations" . Proceedings of
`the Association for Computing Machinery-Special Interest Group on Information
`Retrieval, Dublin, 61-69, 1994.
`[2]Robert Krovertz and W.Bruce Croft, "Lexical Ambiguity and Information Retrieval" .
`Association for Computing Machinery Transaction on Information Systems, Vol. 10, No.
`2, 11

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket