throbber
Lexical Ambiguity and Information Retrieval
`
`ROBERT KROVETZ
`
`and
`
`W. BRUCE CROFT
`University of Massachusetts
`
`Lexical ambiguity is a pervasive problem in natural language processing. However, little
`quantitative information is available about the extent of Uie problem or about the impact that it
`has on iofcrmation retrieval systems. Wo report on an analysis of lexical ambiguity in informa
`tion retrieval test collections and on experiments to determine the utility of word meanings for
`separating relevant from nonrelevant dccuments. The experiments show that there is consider
`able ambiguity even in a specialized database. Word senses provide a signiflcant separation
`between relevant and nonrelevant documents, but several factors contribute to determining
`whether disambiguation will make an improvement in performance For example, resolving
`lexical ambiguity was found to have little impact on retrieval effectiveness for documents that
`have many words in common with the query. Other uses of word sense disambiguation in on
`information retrieval context are discussed
`
`Categories and Subject Descriptors; H.3.1 [Information Storage and Retrieval]: Content
`Analysis and Indexing—dtclionories, indexing methods, finguistic processing; H.3.3 (Informa
`tion Storage and Retrieval); Information Search and Retrieval—seorcA process, selection
`process; 1,2.7 [Artificial Intolligoncc]: Natural Language Processing—fexf analysis
`
`General Terms: Experimentation, Measurement, Performance
`Additional Key Words and Phrases; Disambiguation, document retrieval, semantically based
`search, word senses
`
`1. INTRODUCTION
`The goal of an information retrieval system is to locate relevant documents
`in response to a user's query. Documents are typically retrieved as a ranked
`list, where the ranking is based on estimations of relevance [5], The retrieval
`model for an information retrieval system specifies how documents and
`queries are represented and how these representations are compared to
`produce relevance estimates. Tho performance of the system is evaluated
`
`This work has been supported by the Office of Naval Research under University Research
`Initiative Grant N00014-86-K-0746, by the Air Force Omce of Scientific Research, under
`contract 91-0324, and by NSF Grant IRI-8814790.
`Authors' addrcBs; Computer Science Department, University of Massachusetts, Amherst, MA
`01003; email;krovetz@cs.umass.edu and croft@c5.uma3s.edu.
`Permission to Copy without feo all or part of this material ie granted provided that the copies arc
`not made or distributed for direct commercial advantage, the ACM copyright notice and the titio
`of the publication and its date appear, and notice is given that copying is by permission of the
`Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or
`specific permission,
`© 1992 ACM 1046-8188/92/0400-0115 SO1.60
`
`ACM TVonsactions on Information Syatema, Vol 10, No 2, April 1992, Pages 116-141
`
`Page 1 of 27
`
`GOOGLE EXHIBIT 1034
`
`

`

`116
`
`•
`
`R. Krovetz and w B. Croft
`
`with respect to standard test coUections that provide a set of queries, a set of
`documents, and a set of relevance judgments that indicate which documents
`are relevant to each query. The.se judgments are provided by the users who
`supply the queries and serve as a standard for evaluating performance.
`Information retrieval research is concerned with finding representations and
`methods of comparison that will accurately discriminate between relevant
`and nonrelevant documents.
`Many retrieval systems represent documents and queries by the words
`they contain, and base the comparison on the number of words they have in
`common. The more words the query and document have in common, the
`higher the document is ranked; this is referred to as a "coordination match.''
`Performance is improved by weighting query and document words using
`frequency information from the collection and individual document texts [27].
`There are two prohleins with using words to represent the content of
`documents. The first problem is that words are ambiguous, and this ambigu
`ity can cause documents to be retrieved that arc not relevant. Consider the
`following description of a search that was performed using the keyword
`"AIDS":
`
`Unfortunately, not all 34 [references] were about AIDS, the disease. The
`references included "two helpful aids during the first three months after total
`hip replacement." and "aids in diagnosing abnormal voiding patterns" [17].
`
`One response to this problem is to use phrases to reduce ambiguity (e.g.,
`specifying "hearing aids" if that is the desired sense) [27]. It is not always
`possible, however, to provide phrases in which the word occurs only with the
`desired sense. In addition, the requirement for phrases imposes a significant
`burden on the user.
`The second problem is that a document can be relevant even though it does
`not use the same words as tho.se that are provided in the query. The user is
`generally not interested in retrieving documents with exactly the same
`words, but with the concepts that those words represent. Retrieval systems
`address this problem by expanding the query words using related words from
`a thesaurus [27]. The relationships described in a thesaurus, however, are
`really between word senses rather than words. For example, the word "term"
`could be synonymous with "word" (as in a vocabulary term), "sentence" (as in
`a prison term), or "condition" (as in "terms of agreement"). If we expand the
`queiy with words from a thesaurus, we must be careful to use the right
`senses of those words. We not only have to know the sense of the word in the
`query (in this example, the sense of the word "term"), but the sense of the
`word that is being used to augment it (e.g.. the appropriate sense of the word
`"sentence") [7].'
`
`' Solton recommends that a thesaurus should be coded for ambiguous words, but only for those
`senses likely to appear in the collections to be treated [26, pp. 28-29] However, it is not always
`easy to make such judgments, and it makes the retrieval system specific to particular subject
`areas. The thesauri that are currently used in retrieval systems do not lake word senses into
`account
`
`ACM Traiisuctions on Information Systems, Vol 10. No 2. April 1992
`
`Page 2 of 27
`
`

`

`Lexical Ambiguity and Information Retrieval
`
`•
`
`117
`
`It is possible that representing documents by word senses, rather than
`words, will improve retrieval performance. Word senses represent more of
`the semantics of the text, and they provide a basis for exploring lexical
`semantic relationships such as synonymy and antonymy, which are impor
`tant in the construction of thesauri. Very little is known, however, about the
`quantitative aspects of lexical ambiguity. In this paper we describe experi
`ments designed to discover the degree of lexical ambiguity in information
`retrieval test collections, and the utility of word senses for discriminating
`between relevant and nonrelevant documents. The data from these experi
`ments will also provide guidance in the design of algorithms for automatic
`disambiguation.
`In these experiments, word senses are taken from a machine-readable
`dictionary. Dictionaries vary widely in the information they contain and the
`number of senses they describe. At one extreme we have pocket dictionaries
`with about 35,000-45,000 senses, and at the other the Oxford English
`Dictionary, with over 500,000 senses and in which a single entry can go on
`for several pages. Even large dictionaries will not contain an exhaustive
`listing of all of a word's senses; a word can be used in a technical sense
`specific to a particular field, and new words are constantly entering the
`language. It is important, however, that the dictionary contain a variety of
`information that can be used to distinguish the word senses. The dictionary
`we are using in our research, the Longman Dictionary of Contemporary
`English (LDOCE) (25], has the following information associated with its
`senses: part of speech, subcategorization,® morphology, semantic restrictions,
`and subject classification.® The latter two are only present in the machine-
`readable version.
`In the following section we discuss previous research that has been done on
`lexical ambiguity and its relevance to information retrieval. This includes
`work on the types of ambiguity and algorithms for word sense disambigua
`tion. In Section 3 we present and analyze the results of a series of experi
`ments on lexical ambiguity in information retrieval test collections.
`
`2. PREVIOUS RESEARCH ON LEXICAL AMBIGUITY
`
`2.1 Types of Lexical Ambiguily
`The literature generally divides lexical ambiguity into two types: syntactic
`and semantic [311. Syntactic ambiguity refers to differences in syntactic
`category (e.g., play can occur as either a noun or verb). Semantic ambiguity
`refers to differences in meaning, and is further broken down into homonymy
`or polysemy, depending on whether or not the meanings are related. The
`bark of a dog versus the bark of a tree is an example of homonjrmy; opening
`a door versus opening a book is an example of polysemy. Syntactic and
`
`This refers to subclasses of grammatical categories such as transitive versus intransitive veiljs.
`- Not all senses have all of this information ascoeiated with tiiem Also, eonio information, sueh
`as part of speech and morphology, is associated with the overall headword rather than just the
`
`sense.
`
`ACM Tk*ansactions on Informntion Systems, Vol. 10, No. April 1992.
`
`Page 3 of 27
`
`

`

`118
`
`•
`
`R. Kroveiz and W B Croft
`
`semantic ambiguity are orthogonal, since a word can have related meanings
`in different categories ("He will review the review when he gets back from
`vacation"), or unrelated meanings in different categories {"Can you see the
`can?").
`Although there is a theoretical distinction between homonomy and poly
`semy, it is not always easy to tell them apart in practice. What determines
`whether the senses are related? Dictionaries group senses on the basis of
`part-of-speech and etymology, but as mentioned above, senses can be related
`oven though they differ in syntactic category. Senses may also be related
`etyraologically, but be perceived as distinct at the present time (e.g., the
`"cardinal" of a church and "cardinal" numbers are etymologically related).
`It also is not clear how the relationship of senses affects their role in
`information retrieval. Although senses which ai-e unrelated might be more
`useful for separating relevant from nonrelevant documents, we found a
`number of instances in which related senses also acted as good discriminators
`(e.g., "West Germany" versus "The West").
`
`2.2 Automatic Disambiguation
`A number of approaches have been taken to word sense disambiguation.
`Small used a procedural approach in the Word Experts system [30]: words are
`considered experts of their own meaning and resolve their senses by passing
`messages between themselves. Cottrell resolved senses using connectionism
`[9], and Hirst and Hayes made use of spreading activation and semantic
`networks [18, 16).
`Perhaps the greatest difficulty encountered by previous work was the effort
`required to construct a representation of the senses. Because of the effort
`required, most systems have only dealt with a small number of words and a
`subset of their senses. Small's Word Expert Parser only contained Word
`Experts for a few dozen words, and Hayos' work only focused on disambiguat-
`ing nouns. Another shortcoming is that very little work has been done on
`disainbigualing large collections of real-world text. Researchers have instead
`argued for the advantages of their systems based on theoretical grounds and
`shown how they work over a selected set of examples. Although information
`retrieval test collections are small compared to real world databases, they are
`still orders of magnitude larger than single sentence examples. Machine-
`readable dictionaries give us a way to temporarily avoid the problem of
`representation of senses.* Instead the work can focus on how well informa
`tion about the occun'ence of a word in context matches with the information
`associated with its senses.
`It is currently not clear what kinds of information will prove most useful
`for disambiguation. In particular, it is not clear what kinds of knowledge will
`be required that is not contained in a dictionary. In the sentence "John left
`
`* We will eventually have to deal with word senise representation because of problems associated
`with dictionaries being incomplete, and because they may make too many distinctions; these are
`important research issues in lexical semantics. For more dnscussion on this see Krovetz [21].
`
`ACM Transactions on hiformalioii Systems, Vol 10, No 2, April 1992.
`
`Page 4 of 27
`
`

`

`Lexical Ambiguity and Information Retrieval
`
`n
`
`119
`
`a tip," the word "tip" might mean a gratuity or a piece of advice. Cullingford
`and Pazzani cite this as an example in which, scripts are needed for disam
`biguation [111. There is little data, however, about how often such a case
`occurs, how many scripts would be involved, or how much effort is required to
`construct them. We might be able to do just as well via the use of word
`cooccurrences (the gratuity sense of tip is likely to occur in the same context
`as "restaurant," "waiter," or "menu"). That is, we might be able to use the
`words that could trigger a script without actually making use of one.
`Word cooccurrences arc a very effective source of information for resolving
`ambiguity, as shown by experiments described in Section 3. They also form
`the basis for one of the earliest disambiguation systems, which was developed
`by Weiss in the context of information retrieval [34]. Words are disam-
`biguated via two kinds of rules: template rules and contextual rules. There is
`one set of rules for each word to be disambiguated. Template rules look at the
`words that cooccm within two words of the word to bo disambiguated,
`contextual rules allow a range of five words and ignore a subset of the
`closed-class words (words such as determiners, prepositions and coiyunctions).
`In addition, template rules are ordered before contextual rules. Within each
`class, rules are manually ordered by their frequency of success at determin
`ing the correct sense of the ambiguous word. A word is disambiguated by
`trying each rule in the rule sot for the word, starting with the first rule in the
`set and continuing with each rule in turn until the cooccurrence specified by
`the rule is satisfied. For example, the word "type" has a rule that indicates if
`it is followed by the word "of then it has the meaning "kind" (a template
`rule); if "type"' cooccurs within five words of the word "pica" or "print, it is
`given a printing interpretation (a contextual rule). Weiss conduced two sets
`of experiments; one on five words that occurred in the queries of a test
`collection on documentation and one on three words, but with a version of the
`system that learned the rules. Weiss felt that disambiguation would be more
`u-seful for question answering than strict information retrieval, bvit would
`become more necessary as databases became larger and more general.
`Word coilDcation was also used in several other disambiguation efforts.
`Black compared collocation with an approach based on subject-area codes and
`found collocation to be more effective [61. Dahlgren used collocation as one
`component of a multiphase disambiguation system (she also used syntax
`"common sense knowledge" based on the resulte of psycholinguistic studies)
`[12]. Atkins examined the reliability of collocation and syntax for identifymg
`the senses of the word "danger" in a large corpus [3]; she found that they
`wore reliable indicators of a particular sense for approximately 70% of the
`word instances she examined. Finally, Choueka and Lusignan showed that
`people can often disarabipaate words with only a few words of context
`(frequently only one word is needed) [8].
`Syntax is also an important source of information for disambip^ion.
`Along with the work of Dahlgren and Atkins, it has also been used hy Kelly
`and Stone for content analysis in the social sciences [20], and by Earl for
`machine translation [13]. The latter work was primarUy concerned with
`subcategorization (distinctions within a syntactic category), but also included
`ACM 'IVonsactions on Infonnation Systems. Vol. 10. No 2, April 1992.
`
`Page 5 of 27
`
`

`

`120 • R. Krovetz and W B. Croft
`
`semantic categories as part of the patterns associated with various words.
`Earl and her colleagues noticed that the patterns could be used for disam
`biguation and speculated that they might be used in information retrieval to
`help determine better phrases for indexing.
`Finally, the redundancy in a text can be a useful source of information.
`The words "bat," "ball," "pitcher," and "base" are all ambiguous and can be
`used in a variety of contexts, but collectively they indicate a single context
`and particular meanings. These ideas have been discussed in the literature
`for a long time ([2, 24)), but have only recently been exploited in computer
`ized systems. All of the efforts rely on the use of a thesaurus, either
`explicitly, as in the work of Bradley and Li aw (cf. [28]), or implicitly, as in
`the work of Slater [29j. The basic idea is to compute a histogram over the
`classes of a thesaurus; for each word in a document, a counter is incremented
`for each thesaurus class in which the word is a member. The top-rated
`thesaurus classes are then used to provide a bias for which senses of the
`words are correct. Bradley and Liaw use Rogets's Third International
`Thesaurus, and Slator uses the subject codes associated with senses in the
`Longman Dictionary of Contemporary English (LDOCE).®
`Machine-readable dictionaries have also been used in two other disam
`biguation systems. Lesk, using the Oxford Advanced Learners Dictionary,^
`takes a simple approach to disambiguation: words arc disambiguatcd by
`counting the overlap between words used in the definitions of the senses [23).
`For example, the word "pine" can have two senses: a tree, or sadness (as in
`"pine away"), and the word "cone" may be a geometric structure, or a fruit of
`a tree. Lesk's program computes the overlap between the senses of "pine"
`and "cone," and finds that the senses meaning "tree" and "fruit of a tree"
`have the most words in common. Lesk gives a success rate of fifty to seventy
`percent in disambiguating the words over a small collection of text.
`Wilks performed a similar experiment using the Longman dictionary [35).
`Rather than just counting the overlap of words, all the words in the definition
`of a particular sense of some word are grouped into a vector. To determine
`the sense of a word in a sentence, a vector of words from the sentence is
`compared to the vectors constructed from the sense definitions. The word is
`assigned the sense corresponding to the most similar* vector. Wilks manually
`disambiguated all occurrences of the word "bank" within LDOCE according
`to the senses of its definition and compared this to the results of the vector
`matching. Of the 197 occurrences of "bank," the similarity match correctly
`assigned 45% of them to the correct sense; the conect sense was in the top
`three senses 85% of the time.
`
`® Tliese codea are only present in the machine-readable version
`® Lesk also tried the same experiments with the Memam-Vfebster Collegiale Dictionary and the
`Collins English Dictionary; while he did not find any significant differences, he speculated that
`the longer definitions used in the Oxford English Dictionary (OED) might yield better results
`Later work by Becker on the New OED indicated that Lesk's algorithm did not perform as well
`as expected [4]
`
`ACM Transacuons on Information Systems. Vol 10, No 2, April 1992
`
`Page 6 of 27
`
`

`

`Lexical Ambiguity and Information Retrieval
`
`•
`
`121
`
`Because Infui-niation retrieval systems handle large text databases (mega
`bytes for a test collection and gigabytes/terabytes for an operational system),
`the correct sense will never be known for most of the words encountered. This
`is due to the simple fact that no human being will ever provide such
`confirmation. In addition, it is not always clear just what the correct sense is.
`In disambiguating the occurrences of "bank" within the Longman dictionary,
`Wilks found a number of cases where none of the senses was clearly the right
`one [351. In the information retrieval context, however, it may not be neces
`sary to identify the single correct sense of a word; retrieval effectiveness may
`be improved by ruling out as many of the incorrect word senses as possible,
`and giving a high weight to the senses must likely to be correct.
`Another factor to consider is that the dictionary may sometimes make
`distinctions that are not necessarily useful for a particular application. For
`example, consider the senses for the word "term" in the Longman dictionary.
`Seven of the senses are for a noun and one is for a verb. Of the seven noun
`senses, five refer to periods of time; one has the meaning "a vocabulary
`item"; and one has a meaning "a component of a mathematical expression."
`It may only be important to distinguish the four classes (three nouns and one
`verb), with the five "period of time" senses being collapsed into one. The
`experiments in this paper provide some insight into the important sense
`distinctions for information retrieval.
`As we mentioned at the start of this section, a major problem with previous
`approaches has been the effort required to develop a lexicon. Dahlgren is
`currently conducting tests on a 6,000 word corpus based on six articles from
`the Wall Street Journal. Development of the lexicon (which includes entries
`for 5,000 words)' took eight man-years of effort (Dahlgi-en, personal commu
`nication). This effort did not include a representation for all of the senses for
`those words, only the senses that actually occurred in the corpora she has
`been studying. While a significant part of this time was devoted to a one-time
`design effort, a substantial amount of time is still required for adding new
`words.
`The research described above has not provided many experimental results.
`Several researchers did not provide any experimental evidence, and the rest
`only conducted experiments on a small collection of text, a small number of
`words, and/or a restricted range of senses. Although some work has been
`done with information retrieval collections (e.g., [34]), disambiguation was
`only done for the queries. None of the previous work has provided evidence
`that disambiguation would be useful in separating relevant from nonrelevant
`documents. The following sections describe the degree of ambiguity found in
`two information retrieval test collections, and experiments involving word
`sense weighting, word sense matching, and the distribution of senses in
`queries and in the corpora.
`
`' These entries are based not only on the Walt Street Journal cocpus, but a corpus of 4100 words
`taken from a geography text.
`
`ACM IVanaactiono on Information Systems, Vol 10, No. 2, April 1992.
`
`Page 7 of 27
`
`

`

`R. KfoveU and W B Croft
`
`Table 1. Statislics on Inforinatlon Retrieval Teat Collections
`
`CACM TIME
`
`Number of queries
`NuiuLer of Uouuiiiciils
`Mtian words per query
`Mean words per document
`Mean relevant documents per query
`
`3. EXPERIMENTAL RESULTS ON LEXICAL AMBIGUITY
`
`Although lexical ambiguity iis ofleit mentioned in the information-retrieval
`literature as a problem (cf., [19, 261), relatively little information is provided
`about the degree of ambiguity encountered, or how much improvement would
`result from its resolution." We conducted experiments to determine the
`effectiveness of weighting words by the number of senses they have, and to
`determine the utility of word meanings in separating relevant from nonrele-
`vant documents. We first provide statistics about the retrieval collections we
`used and then describe the results of our experiments.
`
`3.1 Collection Statistics
`Information retrieval systems are evaluated with respect to standai'd test
`collections. Our exporiments were done on two of these collections: a set of
`titles and abstracts from Communications of the ACM (CACM) [14] and a set
`of short ailicles from TIME magazine. We chose these collections because of
`the contrast they provide; we wanted to see whether the subject area of the
`text has any effect on our experiments. Each collection also includes a set of
`natmal language queries and relevance judgments that indicate which docu
`ments are relevant to each query. Tlie CACM collection contains 3204 titles
`and abstracts" and 04 queries. The TIME collection contains only 42.1 docu
`ments^" and 83 queries, but the documents are more than six Limes longer
`than the CACM abstracts, so the collection overall contains more text, Table
`I lists the basic statistics for the two collections. We note that there are far
`fewer relevant' document.s per query for the TIME collection than for the
`CACM collection. The average for CACM does not include the 12 queries that
`do not have relevant documents.
`Table II provides statiatiiw about the word senses found in the two collec
`tions. The mean number of senses for the documents and queries was
`
`Weisa mantion.1 that resolving ambieni^ in Ihs SMAK'l' syatom was found to improve perfor
`mance by only 1 percuul. but did not provide any details on the experiments that were mvelved
`.1341
`° Half of tlioso are title only.
`^ The original collection contained 426 doeuiijents, but two of the dorumcnla were duplicates.
`" This analyzer is not the same as u "stemmer," whicii conflotes word variants by truncating
`ibcir endings; o stcmmcr does not indicate a word's root, and would not provide us with a way to
`determine which words were found in the dielionnry. Stemming is commonly used in infoinia-
`tion retrieval syBteins. hnwevor, and was therefore used in the experiments that follow
`
`ACM Tramucilons on Information Systems, Vol 10. Mo 2, April 10112
`
`Page 8 of 27
`
`

`

`Lexical Ambiguity and Informalton Retrieval
`
`123
`
`Table 11. Statisties for Word Senses In IR Test Collections
`
`CACM
`
`Number of words in the corpus
`Number of those words in LDOCE
`Including morphological variants
`Mean number of senses in the collection
`Mean number of senses in the queries
`
`Unique Words Word Occurrences !
`169769
`10203
`131804 (78%)
`3922 (38%)
`149358 (88%)
`5799 ( 57%)
`4.7 (4.4 without stop words)
`6.8 (5.3 without stop worcbi)
`
`1
`
`Number of words in the corpus
`Number of those words in LDOCE
`Including morphological variants
`Mean number of senses in the collection
`Mean number of senses in the queries
`
`TIME
`
`Unique Words Word Occurrences
`247031
`22106
`196083 (79%)
`9355 (42%)
`14326 (65%)
`215967 (87%)
`3.7 (3.6 without stop words)
`8.2 (4.8 without stop words)
`
`determined by a dictionary lookup process. Each word was initially retrieved
`from the dictionary directly; if it was not found, the lookup was retried this
`time winking use of a .simple morphological analyzer." For each dataset, the
`mean number of senses is calculated by avera^ng the number of senses for
`all unique words (word types) found in the dictionary.
`^
`niiti?
`The statistics indicate that a similar percentage of the words in the TIME
`and CACM collections appear in the dictionary (about 40% before any
`morphology and 57 to 65% once simple morphology is done), but that the
`TIME collection contains about twice as many unique words as CACM. Our
`morphological analyzer primarily does inflectional morphology (tense, aspect,
`plural, negation, comparative, and superlative). We estimate that adding
`more complex morphology would capture another 10% of the unique words.
`The statistics indicate that both collections have the potential to benefit
`from disambiguation. The mean number of senses for the CACM collection is
`4.7 (4.4 once stop words are removed)''' and 3.7 senses for the TIME colla
`tion (3.6 senses without the slop words). The ambiguity of the words in the
`
`These percentaRes refer to the unique words (word types) in the corpora, '^e words that were
`not m the dictionary consists of hyphenated forms, proper nouM, morphological variants not
`captured by the simple analyzer, and words that arc domain specific.
`"stop words ore words that are not considered uaelUl for Indexing, such as detemlners,
`prepositions, conjunctions, and other closed class words. They arc among the most ambiguous
`words in the language. See 133] for a list of typical stop worda.
`ACM Transactions on Tnformation Systems, Vol 10. No. 2, April lywi!
`
`Page 9 of 27
`
`

`

`124
`
`•
`
`R. Krovetz and W. B. Croft
`
`queries is also important. If those words were unambiguous, then disam
`biguation would not be needed because the documents would be retrieved
`based on the senses of the words in the queries. Our results indicate that the
`words in the queries are even more ambiguous than those in the documents.
`
`3.2 Experiment 1 —Word Sense Weighting
`
`Experiments with statistical information retrieval have shown that better
`performance is achieved by weighting words based on their frequency of use.
`The most effective weight is usually referred to as TF.IDF, which includes a
`component based on the frequency of the term in a document fTF) and a
`component based on the inverse of the frequency within the document
`collection (IDF) [27]. The intuitive basis for this weighting is that high
`frequency words are not able to effectively discriminate relevant from nonrel-
`evant documents. The IDF component gives a low weight to these words and
`increases the weight as the words become more selective. The TF component
`indicates that once a word appears in a document, its frequency within the
`document is a reflection of tlie document's relevance.
`Words of high frequency also tend to be words with a high number of
`senses. In fact, the number of senses for a word is approximately the square
`root of its relative frequency [36].'^ While tliis coiTelation may hold in
`general, it might be violated for particular words in a specific document
`collection. For example, in the CACM collection the word "computer" occurs
`very often, hut it cannot be considered very ambiguous.
`The intuition about the IDF component can be recast in terms of ambigu
`ity: words which are veiy ambiguous are not able to effectively discriminate
`relevant from nonrelevant documents. This led to the following hypothesis:
`weighting words in inverse proportion to their number of senses will give
`similar retrieval effectiveness to weighting based on inverse collection fre
`quency (IDF). This hypothesis is tested in the first experiment. Using word
`ambiguity to replace IDF weighting is a relatively crude technique, however,
`and there are more appropriate ways to include information about word
`senses in the retrieval model. In particular, the probabilistic retrieval model
`[10, 15. 33] can be modified to include information about the probabilities of
`occurrence of word senses. Tliis leads to the second hypothesis tested in this
`experiment: incoiporating information about word senses in a modified prob
`abilistic retrieval model will improve retrieval effectiveness. The methodol
`ogy and results of these experiments are discussed in the following sections.
`
`.3.2.1 Methodology of the Weighting Experiment. In order to understand
`the methodology of our experiment, we first provide a brief description of how
`retrieval systems are implemented.
`
`''' It should he noted that this is not the same as "Zipfs law," which states that the log of a
`word's frequency is proportional to its rank. That is. a small number of words account for must of
`the occurrences of words in a text, and almost all of the other words in the language occur
`infrequeqlly.
`
`ACM Transactions nn Infnrmatinn Systems. Vol 10, No 2, April 1992
`
`Page 10 of 27
`
`

`

`Lexical Ambiguity and Information Retrieval
`
`•
`
`125
`
`Information retrieval systems typically use an inverted file to identify
`those documents that contain the words mentioned in a query. The inverted
`file specifies a document identification number for each document in which
`the word occui's. For each word in the query, the system looks up the
`document list from the inverted file and enters the document in a hash table;
`the table is keyed on the document number, and the value is initially 1. If the
`document was previously entered in the table, the value is simply incre
`mented. The end result is that each entry in the table contains the number of
`query words that occurred in that document. The table is then sorted to
`produce a ranked list of documents. Such a ranking is referred to as a
`"coordination match" and constitutes a baseline strategy. As we mentioned
`earlier, performance can be improved by making use of the frequencies of the
`word within the collection and in the specific documents in which it occurs.
`This invo

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket