throbber
Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora David Yarowsky AT&T Bell Laboratories 600 Mountain Avenue Murray Hill N J, 07974 yarowsky@research.att.com Abstract This paper describes a program that disambignates English word senses in unrestricted text using statistical models of the major Roget's Thesaurus categories. Roget's categories serve as approximations of conceptual classes. The categories listed for a word in Roger's index tend to correspond to sense distinctions; thus selecting the most likely category provides a useful level of sense disambiguatiou. The selection of categories is accomplished by identifying and weighting words that are indicative of each category when seen in context, using a Bayesian theoretical framework. Other statistical approaches have required special corpora or hand-labeled training examples for much of the lexicon. Our use of class models overcomes this knowledge acquisition bottleneck, enabling training on unresUicted monolingual text without human intervention. Applied to the 10 million word Grolier's Encyclopedia, the system correctly disambiguated 92% of the instances of 12 polysemous words that have been previously studied in the literature. 1. Problem Formulation This paper presents an approach to word sense disambiguation that uses classes of words to derive models useful for disambignating individual words in context. "Sense" is not a well defined concept; it has been based on subjective and often subtle distinctions in topic, register, dialect, collocation, part of speech and valency. For the purposes of this study, we will define the senses of a word as the categories listed for that word in Roger's International Thesaurus (Fourth Edition - Chapman, 1977). 1 Sense disambiguation will constitute 1. Note that this edition of Roger's Thesaurus is much more e0ttm$ive than the 1911 vm'sion, though somewhat more difficult to obtain in electronic form, One could me other other concept hlemrehics, such as WordNet (Miller, 1990) or the LDOCE mbject codes (Slator, 1991). All that it necessary is • set of semamic categories and • list of the words in each category. selecting the listed category which is most probable given the surrounding context. This may appear to be a particularly crude approximation, but as shown in the example below and in the table of results, it is surprisingly successful. Input Output Tvr.admillsauachedto cranu were used to lift heavy TOOLS for supplying powe¢ for cranes, hoists, and lift s. TOOLS hovetl'fitheisht,atower crane is oftea med .SB TM* TOOLS ¢labocate oaumhip ribalds cranes build • nest of vegetafi ANIMAL are more closely tv.lated to cranes and rails .Sn They ran ANIMAL low tees ,PP At least five crane species are in danger of ANIMAl. Not only do the Roget categories succeed in partitioning the major senses, but the sense tags they provide as output are far more mnemonic than a dictionary numbering such as "crane 1.2". Should such a dictionary sense number be desired as output, section 5 will outline how a linkage between Roget categories and dictionary definitions can be made. We will also focus on sense distinctions within a given part of speech. Distinctions between parts of speech, should be based on local syntactic evidence. We use a stochastic part-of-speech tagger (Church, 1989) for this purpose, run as a preprocessor. 2, Proposed Method The strategy proposed here is based on the following three observations: 1) Different conceptual classes of words, such as AmMALS or MACH~mS tend to appear in recognizably different contexts. 2) Different word senses tend to belong to different conceptual classes (crane can be an ANIMAL or a MACHINE). 3) If one can build a context discriminator for the conceptual classes, one has effectively built a context discriminator for the word senses that are members of those classes. Furthermore, the context indicators for a Roget category (e.g. gear, piston and engine for the category TOOLS/MACHINERY) will also tend to be context indicators for the members of that category (such as the machinery sense of crane). ACRES DE COLING-92. NA~rrES, 23-28 AO~r 1992 4 5 4 PRec. OF COLING-92, NANTES, AUG. 23-28, 1992
`
`AOL Ex. 1030
`Page 1 of 7
`
`

`

`We attempt to identify, weight and utilize th~e indicative words &s follows. For each of the 1042 Roget Categories: 1. Collect contexts which are representative of the Roget category 2. Identify salient words in the collective context and determine weights for each word, and 3. Use the resulting weights to predict the appropriate category for a polysemous word occurring in novel text. 2.1 Step 1: Collect Contexts which are Representative of the Roget category The goal of this step is to collect a set of words t/tat are typically found in the context of a Roget category. To do this, we extract concordances of 100 surrounding words for e2~h occurrence of each member of the category ill the corpus. Below is a sample set of partial concordances for words in the category "IOOLS.tMACIIINERY (348). The complete set contains 30,924 lines, selected from the particular training corpus used in this study, the 10 million word. June 1991 electronic version of Grolier's Encyclopedia. CARVING .SB The gutter uipment such as a hydraulic on .SB Resembling a power uipmant, valves for nuclear 00 BC, flint-odged wooden 1-penetrating c~bide-tipped ~lt heightens the colors .SB lxaditionM ABC method and nter of ro~tion .PP A rower rshy areas .SB The crowned adz has a concave blade for fonn shovel capable of lifting 26 cubic shovel mounted on a floating hul generators, oil-refinery turbines sickles were used to gather wild drills forced manufacturers to fi Drills live in the forests of equa drill were unchanged, and dissa crane is an assembly of fabricat crane, however, occasionally For optimal training, die concordance set should only include references to the given category. But in practice it will unavoidably include spurious examples since mmty of the words are polysemous (such its drill mid crane in lines 7, 8, and 10 above). While the level of noise introduced through polysemy is substantial, it can usually be tolerated because the spurious senses are distributed through tile 1(}41 other categories, whereas the signal is coneenwated in just one. Only if several words had secondary senses in the state category would context typical for the other category appear significant in this context. However, if one of these spurious senses was frexluent and dominated the set of examples, file situation could be disastrous. An attempt is made to weight the concordance data to minim~e this effect and to make the sample representative of all tools attd tnachinery, not just the more common ones. If a word such as drill occurs k tinies in the coqms, all words ill the context of drill contribnte weight 1/k to frequency sunos. Despite its flaws, this weighted matrix will serve as a representative, albeit noisy, sample of the typical context of'IYOOI.S/MACItlNERY in Grolier's encyclopedia. 2.2 Step 2: Identify ,salient words in the collective cmttext, and weight appropriately Intuitively, a salient word 2 is one which appears siguificantly more often in the context of a category than at other txfints in the corpus, and hence is a better than average indicator for the category. We formalize this wifl] a nmtual-in formation-like estimate: Pr(wlRCat) / Pr(w), tile probability of a word (w) appearing in the context of a Roget category divided by its overall probability in rile corpus. It is imlmrtant to exerci~ some care in estimating Pr(wlRCat). In principle, one could situply count tile number of times that w appears in the collective contexL However, this estimate, which is known as the tuaximnnt likelih(x',d estimate (MLE), can be unreliable, especially when w does not apl~-~ar vely often in the collective coutexl. We have smoothed file local estimates of Pr(wlRCat) with global estinmtes of Pr(w) to obtain a more reliable estimate. Estimates obtained from the local context are subject to measurement errors whereas estimates obtained li'om the global context are subject to being irrelevant. By interpoiathlg between the two, we attempt to find a compromise between the two sources of error, qllis procexlure is b~sed on recent work pioneewM by Willimn Gale, attd is explained in detail in another paper (Gale, Church and Yarowsky, 1992). Space does not permit a complete description here. Below are salient words tor Roget categories 348 and 414. *lllose ~lected are tile ntosl important 1o rite models, where importance is delined as the product of salience and local fi'equency. That is to say important words ate distinctive and fi~equcat. The nnmhers in parentheses are the log of the salience (logPr(wlRCat) /Pr(w)), which we will henceforth refer to as the word's weight in the statistical model of the category. 2. Fo~ illustrative simplicity, we will refer to words in context, In pnlctice, all op~]lil~t$ ale ac~uMly p~rfonned on the Iemma~ of the words (eal/V = eat,eatg.elling,ate,elae~l), lind inflecdonml dlnincdons tire igltored. While thi* achieves more concentrated and bclter estimated ttttiUics, it throws away uneful information which natty be ext~loited in future work. AC'I'ES DE cOL1NG-92, NANTES, 23-28 Ao(rr 1992 4 S g PROC. o1: COLING-92, NAN rJ',S, A[Jo. 23-28, 1992
`
`AOL Ex. 1030
`Page 2 of 7
`
`

`

`ANIMAL,INSECT (Category 414): species (2.3), family (1.7), bird (2.6), fish (2.4), breed (2.2), cm (2.2), animal (1.7), tail (2.7), egg (2.2), wild (2.6), common (1.3), coat (2.5), female (2.0), inhabit (2.2), eat (2,2), nest (2.5) .... TOOLS/MACHINERY (Category 348): tool (3.1), machine (2.7), engine (2.6), blade (3.8), cut (2.6), saw (5.1), lever (4.1), pump (3.5), device (2.2), gear (3.5), knife(3.8), wheel (2.8), shaft(3.3), wood(2.0), tooth(2.5), piston(3.6) .... Notice that these are not a list of members of the category; they are the words which are likely to co-occur with the members of the category. The complete list for TOOLS/MACH1NFJI.Y includes a broad set of relations, such as meronomy (blade, engine, gear, wheel, shaft, tooth, piston and cylinder), typical functions of machines (cut, rotate, move, turn, pull), typical objecls of those actions (wood, metal), as well as typical modifiers for machines (electric, mechanical, pneumatic). The list for a category typically contains over 3000 words, and is far richer than can be derived from a dictionary definition. 2.3 Step 3: Use the resulting weights to predict the appropriate category for a word in novel text When any of the salient words derived in step 2 appear in the context of an ambiguous word, there is evidence that the word belongs to the indicated category. If several such words appear, the evidence is compounded. Using Bayes' rule, we sum their weights, over all words in context, and determine the category for which the sum is greatest ~. ARGMAX ~ log Pr(w[RCat) x Pr(RCat) Rca: w i~ co,,~1 Pr(w) The context is defined to extend 50 words to the left and 50 words to the right of the polysemous word. This range was shown by Gale, Church and Yarowsky (1992) to be useful for this type of broad topic classification, in contrast to the relatively narrow (+3-6 word) window used in previous studies (e.g. Black, 1988). The 3. The reader may have noticed that the Pr(w) factor can be omitted since it will not change the results of the maximization. It is included here for expository convenience so that it is possible to ~-,npare results across words with very different probabilities, 'nae factor also become• impoc.ant when an incomplete tet of indicators iJ stored be, cause of comlmtational spac~ constraints. Currently we assume a uniform prior- probability for each Roget category (Pr(Rcal)). i,e. tense classification is based exclusively on otmte~tual information, independent of the underlying prd3abillt y of a given Re•el category appearing at any point in the colpos. maximization over RCats is constrained to consider only those categories under which the polysemous word is listed, generally on the order of a half dozen or so. 4 For example the word crane appears 74 times in Groliers; 36 occurrences refer to the animal sense and 38 refer to the heavy machinery sense. The system correctly classified all but one of the machinery senses, yielding 99% overall accuracy. The one miselassified case had a low score for all models, indicating a lack of confidence in any classification. It is useful to look at one example in some more detail. Consider the following instance of crane and its context of + l0 words: 5 lift water and to grind grain .PP Treadmills attached to cranes were used to lift heavy objects from Roman times, The table below shows the strongest indicators identified for the two categories in the sentence above. The model weights, as noted above, are equivalent to log Pr(wlRCat ) / Pr(w). Several indicators were found for the TOOLS/MACHtNE class. There is very little evidence for the ANIMAL sense of crane, with the possible exception of water. The preponderance of evidence favors the former classification, which happens to be correct. The difference between the two total scores indicate strong confidence in the answer. TOOLS/MACII. Weight ANIMALINSECT Weight water 0.76 lift 2.44 lift 2.44 grain 1.68 used 1.32 heavy 1.28 Treadmills 1.16 attached 0.58 grind 0.29 water 0,11 TOTAL 11,30 TOTAL 0.76 4. Although it is often useful to restrict the search in this way. the restriction does sometimes lead to uc~ble, especially when there are gaps in the thesaurus. For example, the category AMIJSI~I,,g~-r (# 876) lisa • number of card playin 8 terms, lint for some reason, the word suit is not included in this list. As it happens, the Grulier's Encydopndia contains 54 instances of the card-playing sense of suit, all of which ale mislabeled if the search is limited to just those categories of suit that are listed in RogeCs. However, if we open up the search to consider all 1042 care•odes, then we find that all 54 instances of su//are correctly labeled ils/o¢/usE,~,~cr, and mo~over. the scca~ is large in all 54 instances, indicating great confidence in the assignment. It is poJsiblc that the unrestricted search mode might be • good way to attemps to fill in omisfions in the •ha•auras. In any case. when suit is added to the ,oa~t/s~E~rr category, overall accuracy improves from 68% to 92%. 5, "Ibis narrower window is used for iaust rative simplicity. ACRES DE COLING-92, NANTES, 23-28 ^OUT 1992 4 5 6 PREC. OF COLING-92, NANTES, AUG. 23-28, 1992
`
`AOL Ex. 1030
`Page 3 of 7
`
`

`

`TABLE i Sen~ R~etCat~o~ N Co~. STAR (Hirst, 1987: N/A) Space Object UNIVERSE 1422 96% Celebrity r~rcrm~TArNt.X 222 95% Staa" Shaped Object INSlOlqIA ..... _5_6_ ...... .82.% 1700 96% MOLE (HirsL 1987: N/A *) Quantity OII!MlCALS 95 98% Mammal ANIMALJNSECI" 46 100% Skin Blemish DISEASE 13 100% Digging Machine SUPPORT 4 100~o 160 99% GALLEY (LusL 1986: 50-70% overall) Ancient Slfip SIIIP,BOAT 35 97% Printer's Tray PRI~flNG 5 100% Ship's Kitchen . COOKING 2 50% 42 95% CONE (Lesk, 1986; 50-70% overall *) Part of Trec PLANT 71 99% Shape of Ohject ANGULARITY 89 61% Part of Eye VISION 13 69% 173 77% BA&q (HearsL 1991: 100%; Speech Synthesis) Musical Senses MUSIC 158 99% Fish ANIMAL,INSECrf 69 100% 227 99% BOW (Clear, 1989: < 67%; Speech Synthesis) Weapon ARMS 59 92% Front of Ship StllP,BOAT 34 94% Violin Part MUSICAL INSTR 30 100% Ribbon ORNAMENTAalON 4 25% Bend in Object CONV~.Xn'V 2 50e Lowering Head RESPF.CT .______0_ ....... 5_-___ 129 91% ~--~Clear, 1989: < 65%) Preference PAR'I]CULA R IIY 228 93% Flavor SENSATION 80 93% 308 93% INTEREST (Black, 1988: 72%; Zemik, 1990: > 70%) Curiosity REASONING 359 88% Advantage IN/UffI1CE 163 34% Financial DEBT 59 90% Share PROPERTY 21 38% 602 72% ISSUE (Zemik, 1990: < 70%) Topic rotrtacs 831 94% Periodical BOOKS.PERIODI 28 89% Stock SECURn1Es 9 1OO% 868 94% Sense Roget Category N Corr. DUTY (Gale et el, 1992: 96%) Obligation DUTY 347 96% Tax PRICE.Iq:~. 52 96% 399 96% SENTENCE (Gale et al, 1992', 90% *) Florishment t .EGALAC'I1ON 128 99% Set of Words GRAMMAR 213 98% 341 98% SLUG (Hirsk 1987: N/A *) Animal ANI MAL,INSI!CT 24 100% Type Strip I,RINTtNO 8 100% Mass Unit WEIGItT 3 100% Fake Coin MONEY 2 50% Metallurgy 1MPUIZE.IMPAC-f I 100% Bullet ARMS 1 100% 39 97% Notes: 1) N refers to the total number of each sense obseawed in the test corpus. Corr. indicates file percemage of those tagged correctly. 2) Because thexe is no independent ground truth to indicate which is the "correct" Roget category for a given word, the decision is a subjective judgement made by a single human judge, in this case the author. 3) As previously noted, the Roger index is incomplete, hi four cases, identified by *, one missing category has been added to the list of possibilities for a word. These ontissions in the lexicon have been identified as outlined in Footnote 4. Without these additions, overall system performance would decrease by 5%. 4) Uses which an English speaker may consider a single sense are often realized by several Roget categories. For the purposes of succinct representation, such categories have been merged, and the name of file dominant category used in the table. As of this writing, the process has not been fully automated. For many applications such as speech synthesis and assignment to an established dictionary sense number or possible French translations, this merging of Roget classes is not necessary. The primary criterion for success is that words are partitioned into pure sense clusters. Words having a different sense from the majority sense of a partition are graded as errors. 5) Examples with the ammtation 'speech synthesis' have multiple pronunciations corresponding to sense distinctions. Their disambiguafion is important in speech processing. 6) All results are based on 100% recall. AcrEs DE COLlNG-92, NAI~'ES, 23-28 AOt3T 1992 4 5 7 PROC. OF COLING-92, NArCrES, AUG. 23-28, 1992
`
`AOL Ex. 1030
`Page 4 of 7
`
`

`

`3. Evaluation The algorithm described above was applied to 12 polysemous words previously discussed in the sense disambignation literature. Table 1 (previous l~lge) shows the systenl's performance. Authors who have discussed these words are listed in parentheses, along with the reported accuracy of their systems. Direct comparisons of performance between researchers is difficult, compounded by variances in corpora and grading criteria; using the same words is an attempt to minimize these differences. Regrettably, most authors have reported their results in qualitative terms. The exceptions include Zemik (1990) who cited "recall and precision of over 70%" for one word (interest) and observed that results for other words, including /ssue, were "less positive." Clear (1989) reported results for two words (65% and 67%), apparently at 85% recall. Leak (1986) claimed overall "50-70%" accuracies, although it is unclear under which parameters and constraints. In a 5 word test set, Black (1988) observed 75% mean accuracy using his optimal method on high entropy, 4-way sense distinctions. Hearst (1991) achieved 84% on simpler 2-way distinctions, editing out additional senses from the test set. Gale, Church and Yarowsky (1992) reported 92% accuracy, also on 2-way distinctions. Out eun'ent work compares favorably with these results, with 92% accuracy on a mean 3-way sense distinction 6. The performance is especially promising given that no hand tagging or special corpora were required in training, unlike all other systems considered. 4. Limitations of the Method The procedure described here is based on broad context models. It performs best on words with senses which can be distinguished by their broad context. These are most typically concrete nouns. Performance is weaker on the following: Topic Independent Distinctions: One of the reasons that interest is disambiguated poorly is that it can appear in almost any context. While its "curiosity" sense is often indicated by the presence of an academic subject or hobbie, the "advantage" sense (to be in one's interests) has few topic constraints. Distinguishing between two such abstractions is difficult. 7 However, the financial 6. This result is a fair ra~lure of pedorr~nee on words used in p~vi{ms studies, and may he useful for comparison acmsl systems. However, as wolrd$ pmvioully discuJscd in the literature may not he t~preu~tafive of typical English polyk-my, mean performance on • eomlTletely random u~ of words should differ, 7. Black (1988) has noted that this disfnction for interest is strongly corrected with th© ~urality (~" the word, a future we cura~ntly don't utilize. sense of interest is readily identifiable, and can be distinguished from the non-financial uses with 92% accuracy. Other distinctions between topic independent and topic constrained senses appear successful as well (e.g. taste, issue, duty and sentence). Minor Sense Distinctions within a Category: Distinctions between the medicinal and narcotic senses of drug ate not captured by the system because they both belong to the same Roget category (REMEDY). Similar problems occur with the musical senses of bass. Roget's Thesaurus offers a rich sub-hierarchy within each category, however. Future implementations will likely use this information, which is currently ignored. Verbs: Verbs have not been considered in this particular study, and it appears that they may benefit from more local models of their typical arguments. The unmodified system does seem to perform well on verbs which show clear topic distinctions such as fire. It's weapon, engine, furnace, employee, imagination and pottery senses have been disambiguated with 85% accuracy. Pre-Nominal Modifiers: The disambiguation of pre- nominal modifiers (adjectives and compound nominals) is heavily dependent on the noun modified, and much less so on distant context. While class-based Bayesian discrimination may be useful here as well, the optimal window size is much narrower. Idioms: These broad context, topic-based discriminators are also less successful in dealing with a word like hand, which is usually found in fixed expressions such us on the other hand and close at hand. These fixed expressions have more function than content, and therefore, they do not lend themselves to a method that depends on differences in content. The situation is far from hopeless, as many idioms are listed directly in Roget's Thesaurus and can be associated with a category through simple table lookup. Other research, such as Smadja and McKeown (1990), have shown more general ways of identifying and handling these fixed expressions and collocations. Given the broad set of issues involved in sense disambiguation, it is reasonable to use several specialized tools in cooperation. We akeady handle part of speech distinctions through other methods; an efficient idiom recognizer would be an appropriate addition as well. 5. Linking Roget Categories with other Sense Representations The Roget category names tend to be highly mnemonic and may well suffice as sense tags. However, one may want to link the Roget tags with an established reference such as the sense numbers one finds in a dictionary. We accomplish this by applying the models described above to the text of the definitions in a dictionary, creating a table of correspondences between Roget categories and ACRES DE COLING-92, NANTES, 23-28 AOUT 1992 4 5 8 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992
`
`AOL Ex. 1030
`Page 5 of 7
`
`

`

`sense numbers. Results for the word crane are illustrated below for two dictionaries: (1) COBUILD (Sinclair, 1987), and (2) Collins English Dictionary, First Edition (CED1) (Hanks, 1979). RCAT Sense # Definition TOOLS crane 1.1 ANIMAL crane 1.2 ANIMAL crane l ANIMAL crane 2 TOOLS crane 3 "IDOLS crane 4 a machine with a long movable large bird with a long neck and any large long-necked long-leg any similar bird, such as a her a device for lifting and moving a large trolley carrying a boom It may also be possible to link Roget category tags with "natural" sense tags, such as translations in a foreign language. We use a word-aligned parallel bilingual corpus such as the French-English Canadian Hansards for this purpose. For example, consider the polysemous word duty which can be translated into French its devoir or droit, depending on the sense (obligation or tax, respectively). When the Grolier-trained models are applied to the English side of the Hansards, the words tagged PRICE.FI~ most commonly aligned with the French words droits (256), droit (96) and douane (67). Words labeled OUT'/(the Roget category for Obligation) most frequently aligned with devoir (205). These correlations may have useful implications for machine translation and bilingual lexicography. 6. Other Sense Disambiguation Methods: The Knowledge Acquisition Bottleneck Word sense disambiguation is a long-standing problem in computational linguistics (Kaplan, 1950 ; Yngve, 1955; Bar-Hillel, 1960), with important implications for a variety of practical applications including speech synthesis, information retrieval, and machine translation. Most approaches may be characterized by the lollowing generalizations: 1) They tend to focus on the search for sets of word-specific features or indicators (typically words in context) which can disambignate the senses of a word. 2) Efforts to acquire these indicators have faced a knowledge acquisition bottleneck, characterized by either substantial human involvement for each word, and/or incomplete vocalmlary coverage. The AI community has enjoyed some success hand- coding detailed "word experts" (Small and Rieger, 1982; HirsL 1987), but this labor intensive process has severely limited coverage beyond small vocabularies. Others such as Lesk (1986), Walker (1987), Veronis and Ide (1990), and Guthrie et al. (1991) have turned to machine readable dictionaries (MRD's) in an effort to achieve broad vocabulary coverage. MRD's have the useful property that some indicative words for each sense are directly available in numbered definitions and examples. However, definitions arc often too short to provide an adequate set of indicators, and those words which are found lack significance weights to identify which are crucial and which are merely chaff. Dictionaries provide well structured but incomplete information. Recently, many have turned to text corpora to broaden the range and volume of available examples. Unlike dictionaries, however, raw corpora do not indicate which sense of a word occurs at a given instance. Several researchers (Kelly and Stone, 1975; Black, 1988) have overcome this through hand tagging of training examples, and were able to discover useful discriminatory patterns from the partitioned contexts. This also has proved labor intensive. Others (Weiss, 1973; Zeroik, 1990; Hearst, 1991) have attempted to partially automate the hand- tagging process through bootstrapping. Yet this has still required significant human intervention for each word in the vocabulary. Brown et al. (1991), Dagan (1991), and Gale ct at. (1992) have looked to parallel bilingual corpora to further automate training set acquisition. By identifying word correspondences in a bilingual text such as the Canadian Parliamentary Proceedings (Hansards), the translations found fur each English word may serve as sense tags. For example, the senses of sentence may be identified through their correspondence in the French to phrase (grammatical sentence) or peine (legal sentence). While this method has been used successfully on a portion of the vocabulary, its coverage is also limited. Currently available bilingual corpora lack size or diversity: over hulf of the words considered in this study either never appear in the Hansards or lack examples of secondary senses. More fundamentally, many words are mutually ambiguous across languages. French would be of little use in disambiguating the word interest, as all major senses translate as int~rdt. More promising is a non-lndo European language such as Japanese, which should avoid such mutual ambiguity for etymological reasons. Until more diverse, large bilingual corpora become available, the coverage of these methods will remain limited. Each of these approaches have faced a fundamental obstacle: word sense is an abstract concept that is not identified in natural texL Hence any system which hopes to acquire discriminators for specific senses of a word will need to isolate samples of those senses. While this process has been partially automated, it appears to require substantial human intervention to handle an unrestricted vocabulary. 7. Conclusion This paper has described an approach to word sense disambiguation using statistical models of word classes. This method overcomes the knowledge acquisition bottleneck faced by word-specific sense discriminators. By entirely circumventing the issue of polysemy AClXS DE COLING-92, NANqVS, 23-28 Aovr 1992 4 5 9 PRec. OF COLING-92, NAI'~fES, AUG. 23-28. 1992
`
`AOL Ex. 1030
`Page 6 of 7
`
`

`

`resolution in training material acquisition, the system has acquired an extensive set of sense discriminators from unrestricted monolingnal texts withoat haman intervention. Class models also offer the additional advantages of smaller model storage requirements and increased implementation efficiency due to reduced dimensionality. Also, they can correctly identify a word sense which occurs rarely or only once in the corpus -- performance unattainable by statistically trained word- specific models. These advances are not without cost, as class-based models have diluted discriminating power and may not capture highly indicative collocations specific to only one word. Despite the inherent handicaps, the system performs better than several previous approaches, based on a direct comparison of results for the same words. 8. Acknowledgements Special thanks are due to Ken Church and Barbara Grosz for their invaluable help in restructuring this paper, and to Bill Gale for the theoretical fonndalions on which this work rests. The author is also grateful to Marts Hearst, Femando Pemira, Donald Hindle, Richard Sproat, and Michael Riley for their comments and suggestions. References Bar-Hillel (1960). "Automatic Translation of Languages," in Advances in Coenpmera, Donald Booth and R. E. Meagher, eds., Academic, New York. Black, Ez~ (1988), "'An Experiment in Computational Discrimination of English Word Sea~s." IBM Journal of Research and Dev¢lopmenl, v 32. pp 185-194. Brown, Pr.mr, Stephen Della Pietra, Vincent Della Pinata, and Robert Mercer (1991), "Word Sense Disambiguition using Statistical Methods," Prooteding$ of the 2¢th Annual Meeting of the Association [or Computational Linguistics, pp 264-270. Brown, Peter, Vil,,c~at Delh Pintra, Peter deSouza, and Rck~rt Mercer (1990), "clasa-based n-gram Modeht of Natural Language," Proceedings of the IBM Natural Language ITL, Paris, Fnmce, pp 283- 298. C~apman, Robert (1977). Roget's International Thesaur~ (Fourth Edition), Haq~r and Row, New York, Choueka, Ymmov, and Serge Lusignam (1985). "'Disambiguation by Short Contexts," Computera and the ltwnanities, v 19. pp. 14%158. Omtch, Kmneth (1989), "A Stochastic Parts Program an Noun Phnse Parser for Un~strict~d Text, '~ Proceeding, IEEE International Conference on Acovatics, Speech and Signal Processing, Glasgow. Clear, Jeremy (1989). "'An Experiment in Automatic Word Sense ld~RificJtlon." Internal Doctwnent, Oxford Univerlity Press, Oxford. Courdl, Garriton (1989). A Connectionist Appre.ach to Word Sense Disamblguatioa, Pitman, London. Dagan, 13o, Alon leaS, and Olrike Schwall (1991), "Two Languages am Infmmative than One," Proceedings of the 29th Annual Meeting of the Aesoclation for Computatiosal Linguistics. pp 130-137. Gale, William, Kenneth (:hutch, and David Yarowsky (1992), "Ditcriminatlon Decisions for 100,000-Dimensional Spaces" AT&T Statistical Retear¢.h Report No. 103. Gale, William, Kenneth Church, and David Yarowsky (1992), "A Method for Disarnbiguating Word S~ses in • Large Ca~.ts," to appear in Computers and llumdnitits, Gnmger, Richard (1977), "FOUL-UP A program that figures out meanings of wo~ from ¢~ntext," HCAII-77, pp. 172-178. Guthile, J., L Guthrle, Y. Walks, and H. Aidinejad (1991), "Subject- Dependent Co-oc.cunea~ and Word Sense Disambiguation," Proceedings of the 29th Annual M, teeing of the Association for Compulmlanal Linguistics, pp 146-152. Hanks, Patrick (ed.) (1979), Collins English Dictionary, Collins, London and Glasgow, He.am, Matti (1991), '*Noun Hctnograph Disambiguation Using Local Context in Large Text Corpora," Using Corpora, Univenrity of Waterloo, Wat

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket