`
`ROBERT KROVETZ
`
`and
`
`W. BRUCE CROFT
`University of Massachusetts
`
`Lexical ambiguity is a pervasive problem in natural language processing. However, little
`quantitative information is available about the extent of Uie problem or about the impact that it
`has on iofcrmation retrieval systems. Wo report on an analysis of lexical ambiguity in informa
`tion retrieval test collections and on experiments to determine the utility of word meanings for
`separating relevant from nonrelevant dccuments. The experiments show that there is consider
`able ambiguity even in a specialized database. Word senses provide a signiflcant separation
`between relevant and nonrelevant documents, but several factors contribute to determining
`whether disambiguation will make an improvement in performance For example, resolving
`lexical ambiguity was found to have little impact on retrieval effectiveness for documents that
`have many words in common with the query. Other uses of word sense disambiguation in on
`information retrieval context are discussed
`
`Categories and Subject Descriptors; H.3.1 [Information Storage and Retrieval]: Content
`Analysis and Indexing—dtclionories, indexing methods, finguistic processing; H.3.3 (Informa
`tion Storage and Retrieval); Information Search and Retrieval—seorcA process, selection
`process; 1,2.7 [Artificial Intolligoncc]: Natural Language Processing—fexf analysis
`
`General Terms: Experimentation, Measurement, Performance
`Additional Key Words and Phrases; Disambiguation, document retrieval, semantically based
`search, word senses
`
`1. INTRODUCTION
`The goal of an information retrieval system is to locate relevant documents
`in response to a user's query. Documents are typically retrieved as a ranked
`list, where the ranking is based on estimations of relevance [5], The retrieval
`model for an information retrieval system specifies how documents and
`queries are represented and how these representations are compared to
`produce relevance estimates. Tho performance of the system is evaluated
`
`This work has been supported by the Office of Naval Research under University Research
`Initiative Grant N00014-86-K-0746, by the Air Force Omce of Scientific Research, under
`contract 91-0324, and by NSF Grant IRI-8814790.
`Authors' addrcBs; Computer Science Department, University of Massachusetts, Amherst, MA
`01003; email;krovetz@cs.umass.edu and croft@c5.uma3s.edu.
`Permission to Copy without feo all or part of this material ie granted provided that the copies arc
`not made or distributed for direct commercial advantage, the ACM copyright notice and the titio
`of the publication and its date appear, and notice is given that copying is by permission of the
`Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or
`specific permission,
`© 1992 ACM 1046-8188/92/0400-0115 SO1.60
`
`ACM TVonsactions on Information Syatema, Vol 10, No 2, April 1992, Pages 116-141
`
`Page 1 of 27
`
`GOOGLE EXHIBIT 1034
`
`
`
`116
`
`•
`
`R. Krovetz and w B. Croft
`
`with respect to standard test coUections that provide a set of queries, a set of
`documents, and a set of relevance judgments that indicate which documents
`are relevant to each query. The.se judgments are provided by the users who
`supply the queries and serve as a standard for evaluating performance.
`Information retrieval research is concerned with finding representations and
`methods of comparison that will accurately discriminate between relevant
`and nonrelevant documents.
`Many retrieval systems represent documents and queries by the words
`they contain, and base the comparison on the number of words they have in
`common. The more words the query and document have in common, the
`higher the document is ranked; this is referred to as a "coordination match.''
`Performance is improved by weighting query and document words using
`frequency information from the collection and individual document texts [27].
`There are two prohleins with using words to represent the content of
`documents. The first problem is that words are ambiguous, and this ambigu
`ity can cause documents to be retrieved that arc not relevant. Consider the
`following description of a search that was performed using the keyword
`"AIDS":
`
`Unfortunately, not all 34 [references] were about AIDS, the disease. The
`references included "two helpful aids during the first three months after total
`hip replacement." and "aids in diagnosing abnormal voiding patterns" [17].
`
`One response to this problem is to use phrases to reduce ambiguity (e.g.,
`specifying "hearing aids" if that is the desired sense) [27]. It is not always
`possible, however, to provide phrases in which the word occurs only with the
`desired sense. In addition, the requirement for phrases imposes a significant
`burden on the user.
`The second problem is that a document can be relevant even though it does
`not use the same words as tho.se that are provided in the query. The user is
`generally not interested in retrieving documents with exactly the same
`words, but with the concepts that those words represent. Retrieval systems
`address this problem by expanding the query words using related words from
`a thesaurus [27]. The relationships described in a thesaurus, however, are
`really between word senses rather than words. For example, the word "term"
`could be synonymous with "word" (as in a vocabulary term), "sentence" (as in
`a prison term), or "condition" (as in "terms of agreement"). If we expand the
`queiy with words from a thesaurus, we must be careful to use the right
`senses of those words. We not only have to know the sense of the word in the
`query (in this example, the sense of the word "term"), but the sense of the
`word that is being used to augment it (e.g.. the appropriate sense of the word
`"sentence") [7].'
`
`' Solton recommends that a thesaurus should be coded for ambiguous words, but only for those
`senses likely to appear in the collections to be treated [26, pp. 28-29] However, it is not always
`easy to make such judgments, and it makes the retrieval system specific to particular subject
`areas. The thesauri that are currently used in retrieval systems do not lake word senses into
`account
`
`ACM Traiisuctions on Information Systems, Vol 10. No 2. April 1992
`
`Page 2 of 27
`
`
`
`Lexical Ambiguity and Information Retrieval
`
`•
`
`117
`
`It is possible that representing documents by word senses, rather than
`words, will improve retrieval performance. Word senses represent more of
`the semantics of the text, and they provide a basis for exploring lexical
`semantic relationships such as synonymy and antonymy, which are impor
`tant in the construction of thesauri. Very little is known, however, about the
`quantitative aspects of lexical ambiguity. In this paper we describe experi
`ments designed to discover the degree of lexical ambiguity in information
`retrieval test collections, and the utility of word senses for discriminating
`between relevant and nonrelevant documents. The data from these experi
`ments will also provide guidance in the design of algorithms for automatic
`disambiguation.
`In these experiments, word senses are taken from a machine-readable
`dictionary. Dictionaries vary widely in the information they contain and the
`number of senses they describe. At one extreme we have pocket dictionaries
`with about 35,000-45,000 senses, and at the other the Oxford English
`Dictionary, with over 500,000 senses and in which a single entry can go on
`for several pages. Even large dictionaries will not contain an exhaustive
`listing of all of a word's senses; a word can be used in a technical sense
`specific to a particular field, and new words are constantly entering the
`language. It is important, however, that the dictionary contain a variety of
`information that can be used to distinguish the word senses. The dictionary
`we are using in our research, the Longman Dictionary of Contemporary
`English (LDOCE) (25], has the following information associated with its
`senses: part of speech, subcategorization,® morphology, semantic restrictions,
`and subject classification.® The latter two are only present in the machine-
`readable version.
`In the following section we discuss previous research that has been done on
`lexical ambiguity and its relevance to information retrieval. This includes
`work on the types of ambiguity and algorithms for word sense disambigua
`tion. In Section 3 we present and analyze the results of a series of experi
`ments on lexical ambiguity in information retrieval test collections.
`
`2. PREVIOUS RESEARCH ON LEXICAL AMBIGUITY
`
`2.1 Types of Lexical Ambiguily
`The literature generally divides lexical ambiguity into two types: syntactic
`and semantic [311. Syntactic ambiguity refers to differences in syntactic
`category (e.g., play can occur as either a noun or verb). Semantic ambiguity
`refers to differences in meaning, and is further broken down into homonymy
`or polysemy, depending on whether or not the meanings are related. The
`bark of a dog versus the bark of a tree is an example of homonjrmy; opening
`a door versus opening a book is an example of polysemy. Syntactic and
`
`This refers to subclasses of grammatical categories such as transitive versus intransitive veiljs.
`- Not all senses have all of this information ascoeiated with tiiem Also, eonio information, sueh
`as part of speech and morphology, is associated with the overall headword rather than just the
`
`sense.
`
`ACM Tk*ansactions on Informntion Systems, Vol. 10, No. April 1992.
`
`Page 3 of 27
`
`
`
`118
`
`•
`
`R. Kroveiz and W B Croft
`
`semantic ambiguity are orthogonal, since a word can have related meanings
`in different categories ("He will review the review when he gets back from
`vacation"), or unrelated meanings in different categories {"Can you see the
`can?").
`Although there is a theoretical distinction between homonomy and poly
`semy, it is not always easy to tell them apart in practice. What determines
`whether the senses are related? Dictionaries group senses on the basis of
`part-of-speech and etymology, but as mentioned above, senses can be related
`oven though they differ in syntactic category. Senses may also be related
`etyraologically, but be perceived as distinct at the present time (e.g., the
`"cardinal" of a church and "cardinal" numbers are etymologically related).
`It also is not clear how the relationship of senses affects their role in
`information retrieval. Although senses which ai-e unrelated might be more
`useful for separating relevant from nonrelevant documents, we found a
`number of instances in which related senses also acted as good discriminators
`(e.g., "West Germany" versus "The West").
`
`2.2 Automatic Disambiguation
`A number of approaches have been taken to word sense disambiguation.
`Small used a procedural approach in the Word Experts system [30]: words are
`considered experts of their own meaning and resolve their senses by passing
`messages between themselves. Cottrell resolved senses using connectionism
`[9], and Hirst and Hayes made use of spreading activation and semantic
`networks [18, 16).
`Perhaps the greatest difficulty encountered by previous work was the effort
`required to construct a representation of the senses. Because of the effort
`required, most systems have only dealt with a small number of words and a
`subset of their senses. Small's Word Expert Parser only contained Word
`Experts for a few dozen words, and Hayos' work only focused on disambiguat-
`ing nouns. Another shortcoming is that very little work has been done on
`disainbigualing large collections of real-world text. Researchers have instead
`argued for the advantages of their systems based on theoretical grounds and
`shown how they work over a selected set of examples. Although information
`retrieval test collections are small compared to real world databases, they are
`still orders of magnitude larger than single sentence examples. Machine-
`readable dictionaries give us a way to temporarily avoid the problem of
`representation of senses.* Instead the work can focus on how well informa
`tion about the occun'ence of a word in context matches with the information
`associated with its senses.
`It is currently not clear what kinds of information will prove most useful
`for disambiguation. In particular, it is not clear what kinds of knowledge will
`be required that is not contained in a dictionary. In the sentence "John left
`
`* We will eventually have to deal with word senise representation because of problems associated
`with dictionaries being incomplete, and because they may make too many distinctions; these are
`important research issues in lexical semantics. For more dnscussion on this see Krovetz [21].
`
`ACM Transactions on hiformalioii Systems, Vol 10, No 2, April 1992.
`
`Page 4 of 27
`
`
`
`Lexical Ambiguity and Information Retrieval
`
`n
`
`119
`
`a tip," the word "tip" might mean a gratuity or a piece of advice. Cullingford
`and Pazzani cite this as an example in which, scripts are needed for disam
`biguation [111. There is little data, however, about how often such a case
`occurs, how many scripts would be involved, or how much effort is required to
`construct them. We might be able to do just as well via the use of word
`cooccurrences (the gratuity sense of tip is likely to occur in the same context
`as "restaurant," "waiter," or "menu"). That is, we might be able to use the
`words that could trigger a script without actually making use of one.
`Word cooccurrences arc a very effective source of information for resolving
`ambiguity, as shown by experiments described in Section 3. They also form
`the basis for one of the earliest disambiguation systems, which was developed
`by Weiss in the context of information retrieval [34]. Words are disam-
`biguated via two kinds of rules: template rules and contextual rules. There is
`one set of rules for each word to be disambiguated. Template rules look at the
`words that cooccm within two words of the word to bo disambiguated,
`contextual rules allow a range of five words and ignore a subset of the
`closed-class words (words such as determiners, prepositions and coiyunctions).
`In addition, template rules are ordered before contextual rules. Within each
`class, rules are manually ordered by their frequency of success at determin
`ing the correct sense of the ambiguous word. A word is disambiguated by
`trying each rule in the rule sot for the word, starting with the first rule in the
`set and continuing with each rule in turn until the cooccurrence specified by
`the rule is satisfied. For example, the word "type" has a rule that indicates if
`it is followed by the word "of then it has the meaning "kind" (a template
`rule); if "type"' cooccurs within five words of the word "pica" or "print, it is
`given a printing interpretation (a contextual rule). Weiss conduced two sets
`of experiments; one on five words that occurred in the queries of a test
`collection on documentation and one on three words, but with a version of the
`system that learned the rules. Weiss felt that disambiguation would be more
`u-seful for question answering than strict information retrieval, bvit would
`become more necessary as databases became larger and more general.
`Word coilDcation was also used in several other disambiguation efforts.
`Black compared collocation with an approach based on subject-area codes and
`found collocation to be more effective [61. Dahlgren used collocation as one
`component of a multiphase disambiguation system (she also used syntax
`"common sense knowledge" based on the resulte of psycholinguistic studies)
`[12]. Atkins examined the reliability of collocation and syntax for identifymg
`the senses of the word "danger" in a large corpus [3]; she found that they
`wore reliable indicators of a particular sense for approximately 70% of the
`word instances she examined. Finally, Choueka and Lusignan showed that
`people can often disarabipaate words with only a few words of context
`(frequently only one word is needed) [8].
`Syntax is also an important source of information for disambip^ion.
`Along with the work of Dahlgren and Atkins, it has also been used hy Kelly
`and Stone for content analysis in the social sciences [20], and by Earl for
`machine translation [13]. The latter work was primarUy concerned with
`subcategorization (distinctions within a syntactic category), but also included
`ACM 'IVonsactions on Infonnation Systems. Vol. 10. No 2, April 1992.
`
`Page 5 of 27
`
`
`
`120 • R. Krovetz and W B. Croft
`
`semantic categories as part of the patterns associated with various words.
`Earl and her colleagues noticed that the patterns could be used for disam
`biguation and speculated that they might be used in information retrieval to
`help determine better phrases for indexing.
`Finally, the redundancy in a text can be a useful source of information.
`The words "bat," "ball," "pitcher," and "base" are all ambiguous and can be
`used in a variety of contexts, but collectively they indicate a single context
`and particular meanings. These ideas have been discussed in the literature
`for a long time ([2, 24)), but have only recently been exploited in computer
`ized systems. All of the efforts rely on the use of a thesaurus, either
`explicitly, as in the work of Bradley and Li aw (cf. [28]), or implicitly, as in
`the work of Slater [29j. The basic idea is to compute a histogram over the
`classes of a thesaurus; for each word in a document, a counter is incremented
`for each thesaurus class in which the word is a member. The top-rated
`thesaurus classes are then used to provide a bias for which senses of the
`words are correct. Bradley and Liaw use Rogets's Third International
`Thesaurus, and Slator uses the subject codes associated with senses in the
`Longman Dictionary of Contemporary English (LDOCE).®
`Machine-readable dictionaries have also been used in two other disam
`biguation systems. Lesk, using the Oxford Advanced Learners Dictionary,^
`takes a simple approach to disambiguation: words arc disambiguatcd by
`counting the overlap between words used in the definitions of the senses [23).
`For example, the word "pine" can have two senses: a tree, or sadness (as in
`"pine away"), and the word "cone" may be a geometric structure, or a fruit of
`a tree. Lesk's program computes the overlap between the senses of "pine"
`and "cone," and finds that the senses meaning "tree" and "fruit of a tree"
`have the most words in common. Lesk gives a success rate of fifty to seventy
`percent in disambiguating the words over a small collection of text.
`Wilks performed a similar experiment using the Longman dictionary [35).
`Rather than just counting the overlap of words, all the words in the definition
`of a particular sense of some word are grouped into a vector. To determine
`the sense of a word in a sentence, a vector of words from the sentence is
`compared to the vectors constructed from the sense definitions. The word is
`assigned the sense corresponding to the most similar* vector. Wilks manually
`disambiguated all occurrences of the word "bank" within LDOCE according
`to the senses of its definition and compared this to the results of the vector
`matching. Of the 197 occurrences of "bank," the similarity match correctly
`assigned 45% of them to the correct sense; the conect sense was in the top
`three senses 85% of the time.
`
`® Tliese codea are only present in the machine-readable version
`® Lesk also tried the same experiments with the Memam-Vfebster Collegiale Dictionary and the
`Collins English Dictionary; while he did not find any significant differences, he speculated that
`the longer definitions used in the Oxford English Dictionary (OED) might yield better results
`Later work by Becker on the New OED indicated that Lesk's algorithm did not perform as well
`as expected [4]
`
`ACM Transacuons on Information Systems. Vol 10, No 2, April 1992
`
`Page 6 of 27
`
`
`
`Lexical Ambiguity and Information Retrieval
`
`•
`
`121
`
`Because Infui-niation retrieval systems handle large text databases (mega
`bytes for a test collection and gigabytes/terabytes for an operational system),
`the correct sense will never be known for most of the words encountered. This
`is due to the simple fact that no human being will ever provide such
`confirmation. In addition, it is not always clear just what the correct sense is.
`In disambiguating the occurrences of "bank" within the Longman dictionary,
`Wilks found a number of cases where none of the senses was clearly the right
`one [351. In the information retrieval context, however, it may not be neces
`sary to identify the single correct sense of a word; retrieval effectiveness may
`be improved by ruling out as many of the incorrect word senses as possible,
`and giving a high weight to the senses must likely to be correct.
`Another factor to consider is that the dictionary may sometimes make
`distinctions that are not necessarily useful for a particular application. For
`example, consider the senses for the word "term" in the Longman dictionary.
`Seven of the senses are for a noun and one is for a verb. Of the seven noun
`senses, five refer to periods of time; one has the meaning "a vocabulary
`item"; and one has a meaning "a component of a mathematical expression."
`It may only be important to distinguish the four classes (three nouns and one
`verb), with the five "period of time" senses being collapsed into one. The
`experiments in this paper provide some insight into the important sense
`distinctions for information retrieval.
`As we mentioned at the start of this section, a major problem with previous
`approaches has been the effort required to develop a lexicon. Dahlgren is
`currently conducting tests on a 6,000 word corpus based on six articles from
`the Wall Street Journal. Development of the lexicon (which includes entries
`for 5,000 words)' took eight man-years of effort (Dahlgi-en, personal commu
`nication). This effort did not include a representation for all of the senses for
`those words, only the senses that actually occurred in the corpora she has
`been studying. While a significant part of this time was devoted to a one-time
`design effort, a substantial amount of time is still required for adding new
`words.
`The research described above has not provided many experimental results.
`Several researchers did not provide any experimental evidence, and the rest
`only conducted experiments on a small collection of text, a small number of
`words, and/or a restricted range of senses. Although some work has been
`done with information retrieval collections (e.g., [34]), disambiguation was
`only done for the queries. None of the previous work has provided evidence
`that disambiguation would be useful in separating relevant from nonrelevant
`documents. The following sections describe the degree of ambiguity found in
`two information retrieval test collections, and experiments involving word
`sense weighting, word sense matching, and the distribution of senses in
`queries and in the corpora.
`
`' These entries are based not only on the Walt Street Journal cocpus, but a corpus of 4100 words
`taken from a geography text.
`
`ACM IVanaactiono on Information Systems, Vol 10, No. 2, April 1992.
`
`Page 7 of 27
`
`
`
`R. KfoveU and W B Croft
`
`Table 1. Statislics on Inforinatlon Retrieval Teat Collections
`
`CACM TIME
`
`Number of queries
`NuiuLer of Uouuiiiciils
`Mtian words per query
`Mean words per document
`Mean relevant documents per query
`
`3. EXPERIMENTAL RESULTS ON LEXICAL AMBIGUITY
`
`Although lexical ambiguity iis ofleit mentioned in the information-retrieval
`literature as a problem (cf., [19, 261), relatively little information is provided
`about the degree of ambiguity encountered, or how much improvement would
`result from its resolution." We conducted experiments to determine the
`effectiveness of weighting words by the number of senses they have, and to
`determine the utility of word meanings in separating relevant from nonrele-
`vant documents. We first provide statistics about the retrieval collections we
`used and then describe the results of our experiments.
`
`3.1 Collection Statistics
`Information retrieval systems are evaluated with respect to standai'd test
`collections. Our exporiments were done on two of these collections: a set of
`titles and abstracts from Communications of the ACM (CACM) [14] and a set
`of short ailicles from TIME magazine. We chose these collections because of
`the contrast they provide; we wanted to see whether the subject area of the
`text has any effect on our experiments. Each collection also includes a set of
`natmal language queries and relevance judgments that indicate which docu
`ments are relevant to each query. Tlie CACM collection contains 3204 titles
`and abstracts" and 04 queries. The TIME collection contains only 42.1 docu
`ments^" and 83 queries, but the documents are more than six Limes longer
`than the CACM abstracts, so the collection overall contains more text, Table
`I lists the basic statistics for the two collections. We note that there are far
`fewer relevant' document.s per query for the TIME collection than for the
`CACM collection. The average for CACM does not include the 12 queries that
`do not have relevant documents.
`Table II provides statiatiiw about the word senses found in the two collec
`tions. The mean number of senses for the documents and queries was
`
`Weisa mantion.1 that resolving ambieni^ in Ihs SMAK'l' syatom was found to improve perfor
`mance by only 1 percuul. but did not provide any details on the experiments that were mvelved
`.1341
`° Half of tlioso are title only.
`^ The original collection contained 426 doeuiijents, but two of the dorumcnla were duplicates.
`" This analyzer is not the same as u "stemmer," whicii conflotes word variants by truncating
`ibcir endings; o stcmmcr does not indicate a word's root, and would not provide us with a way to
`determine which words were found in the dielionnry. Stemming is commonly used in infoinia-
`tion retrieval syBteins. hnwevor, and was therefore used in the experiments that follow
`
`ACM Tramucilons on Information Systems, Vol 10. Mo 2, April 10112
`
`Page 8 of 27
`
`
`
`Lexical Ambiguity and Informalton Retrieval
`
`123
`
`Table 11. Statisties for Word Senses In IR Test Collections
`
`CACM
`
`Number of words in the corpus
`Number of those words in LDOCE
`Including morphological variants
`Mean number of senses in the collection
`Mean number of senses in the queries
`
`Unique Words Word Occurrences !
`169769
`10203
`131804 (78%)
`3922 (38%)
`149358 (88%)
`5799 ( 57%)
`4.7 (4.4 without stop words)
`6.8 (5.3 without stop worcbi)
`
`1
`
`Number of words in the corpus
`Number of those words in LDOCE
`Including morphological variants
`Mean number of senses in the collection
`Mean number of senses in the queries
`
`TIME
`
`Unique Words Word Occurrences
`247031
`22106
`196083 (79%)
`9355 (42%)
`14326 (65%)
`215967 (87%)
`3.7 (3.6 without stop words)
`8.2 (4.8 without stop words)
`
`determined by a dictionary lookup process. Each word was initially retrieved
`from the dictionary directly; if it was not found, the lookup was retried this
`time winking use of a .simple morphological analyzer." For each dataset, the
`mean number of senses is calculated by avera^ng the number of senses for
`all unique words (word types) found in the dictionary.
`^
`niiti?
`The statistics indicate that a similar percentage of the words in the TIME
`and CACM collections appear in the dictionary (about 40% before any
`morphology and 57 to 65% once simple morphology is done), but that the
`TIME collection contains about twice as many unique words as CACM. Our
`morphological analyzer primarily does inflectional morphology (tense, aspect,
`plural, negation, comparative, and superlative). We estimate that adding
`more complex morphology would capture another 10% of the unique words.
`The statistics indicate that both collections have the potential to benefit
`from disambiguation. The mean number of senses for the CACM collection is
`4.7 (4.4 once stop words are removed)''' and 3.7 senses for the TIME colla
`tion (3.6 senses without the slop words). The ambiguity of the words in the
`
`These percentaRes refer to the unique words (word types) in the corpora, '^e words that were
`not m the dictionary consists of hyphenated forms, proper nouM, morphological variants not
`captured by the simple analyzer, and words that arc domain specific.
`"stop words ore words that are not considered uaelUl for Indexing, such as detemlners,
`prepositions, conjunctions, and other closed class words. They arc among the most ambiguous
`words in the language. See 133] for a list of typical stop worda.
`ACM Transactions on Tnformation Systems, Vol 10. No. 2, April lywi!
`
`Page 9 of 27
`
`
`
`124
`
`•
`
`R. Krovetz and W. B. Croft
`
`queries is also important. If those words were unambiguous, then disam
`biguation would not be needed because the documents would be retrieved
`based on the senses of the words in the queries. Our results indicate that the
`words in the queries are even more ambiguous than those in the documents.
`
`3.2 Experiment 1 —Word Sense Weighting
`
`Experiments with statistical information retrieval have shown that better
`performance is achieved by weighting words based on their frequency of use.
`The most effective weight is usually referred to as TF.IDF, which includes a
`component based on the frequency of the term in a document fTF) and a
`component based on the inverse of the frequency within the document
`collection (IDF) [27]. The intuitive basis for this weighting is that high
`frequency words are not able to effectively discriminate relevant from nonrel-
`evant documents. The IDF component gives a low weight to these words and
`increases the weight as the words become more selective. The TF component
`indicates that once a word appears in a document, its frequency within the
`document is a reflection of tlie document's relevance.
`Words of high frequency also tend to be words with a high number of
`senses. In fact, the number of senses for a word is approximately the square
`root of its relative frequency [36].'^ While tliis coiTelation may hold in
`general, it might be violated for particular words in a specific document
`collection. For example, in the CACM collection the word "computer" occurs
`very often, hut it cannot be considered very ambiguous.
`The intuition about the IDF component can be recast in terms of ambigu
`ity: words which are veiy ambiguous are not able to effectively discriminate
`relevant from nonrelevant documents. This led to the following hypothesis:
`weighting words in inverse proportion to their number of senses will give
`similar retrieval effectiveness to weighting based on inverse collection fre
`quency (IDF). This hypothesis is tested in the first experiment. Using word
`ambiguity to replace IDF weighting is a relatively crude technique, however,
`and there are more appropriate ways to include information about word
`senses in the retrieval model. In particular, the probabilistic retrieval model
`[10, 15. 33] can be modified to include information about the probabilities of
`occurrence of word senses. Tliis leads to the second hypothesis tested in this
`experiment: incoiporating information about word senses in a modified prob
`abilistic retrieval model will improve retrieval effectiveness. The methodol
`ogy and results of these experiments are discussed in the following sections.
`
`.3.2.1 Methodology of the Weighting Experiment. In order to understand
`the methodology of our experiment, we first provide a brief description of how
`retrieval systems are implemented.
`
`''' It should he noted that this is not the same as "Zipfs law," which states that the log of a
`word's frequency is proportional to its rank. That is. a small number of words account for must of
`the occurrences of words in a text, and almost all of the other words in the language occur
`infrequeqlly.
`
`ACM Transactions nn Infnrmatinn Systems. Vol 10, No 2, April 1992
`
`Page 10 of 27
`
`
`
`Lexical Ambiguity and Information Retrieval
`
`•
`
`125
`
`Information retrieval systems typically use an inverted file to identify
`those documents that contain the words mentioned in a query. The inverted
`file specifies a document identification number for each document in which
`the word occui's. For each word in the query, the system looks up the
`document list from the inverted file and enters the document in a hash table;
`the table is keyed on the document number, and the value is initially 1. If the
`document was previously entered in the table, the value is simply incre
`mented. The end result is that each entry in the table contains the number of
`query words that occurred in that document. The table is then sorted to
`produce a ranked list of documents. Such a ranking is referred to as a
`"coordination match" and constitutes a baseline strategy. As we mentioned
`earlier, performance can be improved by making use of the frequencies of the
`word within the collection and in the specific documents in which it occurs.
`This invo