`Church et al.
`
`I 1111111111111111 11111 lllll lllll 111111111111111 1111111111 lll111111111111111
`US005541836A
`[lll Patent Number:
`[45] Date of Patent:
`
`5,541,836
`Jul. 30, 1996
`
`[54] WORD DISAMBIGUATION APPARATUS AND
`METHODS
`
`[75]
`
`Inventors: Kenneth W. Church, Chatham;
`William A. Gale, Maplewood; David
`E. Yarowsky, Summit, all of N.J.
`
`[73] Assignee: AT&T Corp., Murray Hill, N.J.
`
`[21] Appl. No.: 814,850
`
`Dec. 30, 1991
`
`[22] Filed:
`Int. Cl.6
`...••.••...•.......................•...........••.•.•. G06F 19/00
`[51]
`[52] U.S. Cl . ......................................................... 364/419.07
`[58] Field of Search ............................... 364/419, 419.07,
`364/419.08, 419.12, 419.13
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,661,924
`4,868,750
`4,914,590
`4,930,077
`4,942,526
`5,056,021
`5,088,038
`5,109,509
`5,128,865
`5,146,405
`5,170,349
`5,237,503
`5,243,520
`5,317,510
`
`4/1987 Okamoto et al ................... 364/DIG. 2
`9/1989 Kucera et al ........................... 364/419
`4/1990 Loatman et al ......................... 364/419
`5/1990 Fan .......................................... 364/419
`7/1990 Okojirna et al ......................... 364/419
`10/1991 Ausborn .................................. 364/419
`2/1992 Tanaka et al. ..................... 364/419 .08
`4/1992 Katayama et al ....................... 395/600
`7/1992 Sadler ..................................... 364/419
`9/1992 Church .................................... 364/419
`12/1992 Yagisawa et al ....................... 364/419
`8/1993 Bedecarrax et al ................ 364/419.08
`9/1993 Jacobs et al ....................... 364/419.08
`3/1994 Yoshimura et al ................ 364/419.08
`
`OTHER PUBLICATIONS
`
`IBM Technical Disclosure Bulletin vol. 33 No. 1B Jun. 1990
`Armonk US pp. 54-55 "Method for Inferring Lexical Asso(cid:173)
`ciations from Textual Co-Occurrences".
`
`E. Black, "An Experiment in Computational Discrimination
`of English Word Senses", IBM Journal of Research and
`Development, vol. 32, No. 2, Mar. 1988, Armonk, NY, USA.
`"Word Sense Disambiguation Using an Untagged Corpus",
`IBM Technical Disclosure Bulletin, vol. 35, No. 6, Nov.
`1992.
`J. D. Benson and B. Brainerd, "Chesterton's Parodies of
`Swinburne and Yeats: A Lexical Approach", Literary and
`·
`Linguistic Computing, vol. 3, No. 4, 1988.
`R. Krovetz, W. Bruce Croft, "Word Sense Disambiguation
`Using Machine-Readable Dictionaries", Proceedings of the
`12th Annual International ACMSIGIR Conference on
`Research and Development in Information Retrieval, Jun.
`1989, Cambridge.
`
`Primary Examiner-Donald E. McElheny, Jr.
`Attorney, Agent, or Firm-Gordon E. Nelson; Jeffrey M.
`Weinick
`
`[57]
`
`ABSTRACT
`
`Apparatus and methods for determining whether a word/
`sense pair is proper for a context. Wide contexts (100 words)
`are employed for both training and testing, and testing is
`done by adding the weights of vocabulary words from the
`context. The weights are determined by Bayesian techniques
`which interpolate between the probability of occurrence of
`a vocabulary word in a conditional sample of the training
`text and the probability of its occurrence in the entire
`training text. A further improvement in testing takes advan(cid:173)
`tage of the fact that a word is generally used in only a single
`sense in a single discourse. Also disclosed are automated
`training techniques including training on bilingual bodies of
`text and training using categories from Roget's Thesaurus.
`
`37 Claims, 2 Drawing Sheets
`
`101
`
`111
`
`TEXT
`
`C115
`
`P116
`
`I
`I
`
`I
`I
`
`c:::J,
`
`WSPPT
`107
`
`108{a)
`121
`
`108(n)
`121
`
`1 - - - -~ - - ;
`
`TW
`WSPPTE
`
`119
`
`113
`
`W119
`
`CW113
`
`103
`
`SLC
`
`WSPL
`117
`
`Page 1 of 13
`
`GOOGLE EXHIBIT 1039
`
`
`
`U.S. Patent
`
`Jul. 30, 1996
`
`Sheet 1 of 2
`
`5,541,836
`
`FIG. 1
`
`WSPPT
`107
`
`105
`WSP
`
`FIG. 2
`
`108(a)
`121
`
`WSPST
`TW w
`WSPPTE
`I
`I
`I
`
`119
`
`WSPPTE
`
`121
`
`113
`
`101
`
`111
`
`TEXT
`
`C115
`
`I
`I
`
`P116
`
`I
`I
`
`r-+
`
`SLC
`
`WSPL
`117
`
`W119
`
`CW113
`
`103
`
`201
`
`Table 6: Selected Portions of Two Models
`11Xsemeofdllly
`wcight*freq weight
`freq wonl
`so
`285
`5.7
`counlerVailing
`111.8
`4.3
`duties
`26
`99.9
`2.7 201137
`ILS '-209
`1.7
`43
`73.1
`tnde
`1.8
`39
`70.21_
`11.MeS
`69.3 204 3.31
`duty
`21
`90ftwood
`3.6 205 19
`68.4
`68.4
`36
`united
`1.9
`58.8
`8.4
`l'CICinds
`7
`S4
`lumber
`3
`18
`S0.4
`shingles
`4.2
`12
`50.4
`4.2
`shakes
`12
`46.8
`3.6
`13
`35
`22
`46.2
`2.1
`against
`canadian
`41.8
`1.1
`38
`
`2025
`
`obligalion sense of d,uy
`weigbt•freq weight
`freq word
`64
`petitions
`3.2
`20
`59.28
`228
`0.26
`to
`I
`0.42
`134
`56.28
`17
`petition
`51
`3
`17
`47.6
`2.8
`pursuant
`89 mr
`0.52
`46.28
`honour
`2.7
`14
`37.8
`1.4
`27
`order
`37.8
`present
`36
`2
`18
`12
`proceedings
`2.8
`33.6
`9
`prescription
`31.S
`3.S
`36
`house
`0.87
`31.32
`9
`reject
`3.3
`29.7
`boundaries
`4.2
`7
`29.4
`electoral
`4.1
`28.7
`7
`
`)
`203
`
`Page 2 of 13
`
`
`
`0\
`~
`~ 00
`i,,,,,,-
`.a;;:;...
`~ Ul
`Ul
`
`N
`s,
`N
`~
`Cl:l =(cid:173)~
`
`~ ~=
`i
`
`0'I
`1,0
`1-0
`1-1,
`
`~ = ~ = ~
`
`•
`• r.,;.
`Cj
`
`(2.0), animal (1.7), family (!.7), common (1.3), ...
`fish (2.4), species (2.3), egg (2.2), inhabit (2.2), breed (2.2), cm (2.2), eat (2.2). female
`ANIMAL.INSECT (Calegory 414): l8il (2.7), bird (2.6), wild (2.6), coat (2.5). nest (2.5).
`
`[
`
`405
`
`engine (2.6), cut (2.6), loodl (2.5), device (2.2), wood (2.0),. ..
`pump (3.5), gear (3.5), piston(3.6), shaft(3.3), tool (3.1), wheel (28), machine (2.7).
`TOOLS/MACHINERY (Category 348): saw (5.1), lever (4.1), blade (3.8). knife (3.8),
`
`[
`
`403
`
`(Wll9
`
`.
`
`401
`
`FIG. 4
`
`In marshy areas .SB The crowned craM • however. occasionally nests i ~ 305
`
`s cmter of rotation .PP A tower craM is an assembly of fabricated ste
`the traditional ABC mclhod and drill were unchanged , and dissatisfac
`I itcment heightens the colors .SB Drills live in die forests of equatorial ~ 303
`steel-penellating carbide-lipped drills forced manufactmers to find sir
`to 8000 BC • flint-edged wooden sickles were used to gather wild grain
`ic equipment , valves for nuclear generators • oil-ldinery turbines • and
`ommon .SB Resembling a power shovel mounted on a floating hull • th
`le equipment such as a hydraulic shovel capable of lifting 26 cubic me
`000 CARVING .SB The gutter adz has a concave blade for forming
`
`301
`
`FIG. 3
`
`---
`
`.
`
`.
`
`. ... --
`
`.
`
`-----------·-·····
`
`-
`
`.
`
`. ---
`
`------
`
`---·
`
`Page 3 of 13
`
`
`
`5,541,836
`
`2
`
`1
`WORD DISAMBIGUATION APPARATUS AND
`METHODS
`
`BACKGROUND OF THE INVENTION
`
`5
`
`1. Qualitative Methods, e.g., Hirst (1987)
`2. Dictionary-based Methods, e.g., Lesk (1986)
`3. Corpus-based Methods, e.g., Kelly and Stone (1975)
`In each case, the work has been limited by a knowledge
`acquisition bottleneck. For example, there has been a tradi(cid:173)
`tion in parts of the AI community of building large experts
`by hand, e.g., Granger (1977), Rieger (1977), Small and
`Rieger (1982), Hirst (1987). Unfortunately, this approach is
`10 not very easy to scale up, as many researchers have
`observed:
`"The expert for THROW is currently six pages long, ...
`but it should be 10 times that size." (Small and Reiger,
`198X)
`15 Since this approach is so difficult to scale up, much of the
`work has had to focus on "toy" domains (e.g., Winograd's
`Blocks World) or sublanguages (e.g., Isabelle (1984), Hir(cid:173)
`schman (1986)). Currently, it is not possible to find a
`semantic network with the kind of broad coverage that
`20 would be required for unrestricted text.
`Others such as Lesk (1986), Walker (1987), Ide (1990,
`Waterloo Meeting) have turned to machine-readable dictio(cid:173)
`narys (MRD) such as Oxford's Advanced Learner's Dictio(cid:173)
`nary of Current English (OALDCE) in the hope that MRDs
`25 might provide a way out of the knowledge acquisition
`bottleneck. These researchers seek to develop a program that
`could read an arbitrary text and tag each word in the text
`with a pointer to a particular sense number in a particular
`dictionary. Thus, for example, if Lesk' s program was given
`the phrase pine cone, it ought to tag pine with a pointer to
`the first sense under pine in OALDCE (a kind of evergreen
`tree), and it ought to tag cone with a pointer to the third sense
`under cone in OALDCE (fruit of certain evergreen trees).
`Lesk's program accomplishes this task by looking for over-
`laps between the words in the definition and words in the
`text "near" the ambiguous word.
`Unfortunately, the approach doesn't seem to work as well
`as one might hope. Lesk (1986) reports accuracies of
`50-70% on short samples of Pride and Prejudice. Part of the
`40 problem may be that dictionary definitions are too short to
`mention all of the collocations (words that are often found
`in the context of a particular sense of a polysemous word).
`In addition, dictionaries have much less coverage than one
`might have expected. Walker (1987) reports that perhaps
`45 half of the words occurring in a new text cannot be related
`to a dictionary entry.
`Thus, like the AI approach, the dictionary-based approach
`is also limited by the knowledge acquisition bottleneck;
`dictionaries simply don't record enough of the relevant
`information, and much of the information that is stored in
`the dictionary is not in a format that computers can easily
`digest, at least at present.
`A third line of research makes use of hand-annotated
`corpora. Most of these studies are limited by the availability
`of hand-annotated text. Since it is unlikely that such text will
`be available in large quantities for most of the polysemous
`words in the vocabulary, there are serious questions about
`how such an approach could be scaled up to handle unre(cid:173)
`stricted text. Kelly and Stone (1975) built 1815 disambigu(cid:173)
`ation models by hand, selecting words with a frequency of
`at least 20 in a half million word corpus. They started from
`key word in context (KWIC) concordances for each word,
`and used these to establish the senses they perceived as
`useful for content analysis. The models consisted of an
`ordered set of rules, each giving a sufficient condition for
`deciding on one classification, or for jumping to another rule
`in the same model, or for jumping to a rule in the model for
`
`30
`
`35
`
`1. Field of the Invention
`The invention relates to computerized text analysis gen(cid:173)
`erally and more specifically to the problem of determining
`whether a given word-sense pair is proper for a given
`context.
`2. Description of the Prior Art
`Machine translation of natural language texts has long
`been a goal of researchers in computer science and linguis(cid:173)
`tics. A major barrier to high-quality machine translation has
`been the difficulty of disambiguating words. Word disam(cid:173)
`biguation is necessary because many words in any natural
`language have more than one sense. For example, the
`English noun sentence has two senses in common usage: one
`relating to grammar, where a sentence is a part of a text or
`speech, and one relating to punishment, where a sentence is
`a punishment imposed for a crime. Human beings use the
`context in which the word appears and their general knowl(cid:173)
`edge of the world to determine which sense is meant, and
`consequently do not even have trouble with texts such as:
`The teacher gave the student the sentence of writing the
`sentence "I will not throw spit wads" 100 times.
`Computers, however, have no general knowledge of the
`world, and consequently, have had a great deal of trouble
`translating sentences such as the above into languages such
`as French, where the word used to translate sentence when
`it is employed in the grammatical sense is phrase and the
`word used to translate sentence when it is employed in the
`sense of punishment is peine.
`The ability to determine a probable sense of a word from
`the context in which the word is used important in other
`areas of text analysis as well. For example, optical character
`recognition systems and speech recognition systems often
`can only resolve a printed or spoken word into a small set of
`possibilities; one way of making a choice among the words
`in the small set is to determine which word has a sense which
`best fits the context. Other examples in this area are deter(cid:173)
`mining whether characters such as accents or umlauts should
`be present on a word or whether the word should be
`capitalized. Additionally, there are text editing tools such as
`spelling checkers or interactive thesauri which present the
`user with a set of suggested alternatives for a word. These
`tools, too, are improved if the set of alternatives is limited to
`words whose senses fit the context.
`Another area of text analysis that will benefit from good
`techniques for determining the probable sense of a word
`from its context is data base searching. Word searches in
`data bases work by simply matching a search term with an
`occurrence of the term in the data base, without regard to the
`sense in which the term is used in the data base. The only 55
`way to restrict a search to a given sense of a term is to
`provide other search terms which the searcher expects to
`find in conjunction with the first search term. Such a search
`strategy will, however, miss occurrences of the first term
`where the first term has the proper sense but is not found in 60
`conjunction with the other search terms. Given a useful way
`of determining what sense of a word best fits a context, it
`will be possible to search by specifying not only the search
`term, but also the sense in which it is being used.
`Past researchers have used
`three different general 65
`approaches to the word disambiguation problem sketched
`above:
`
`50
`
`Page 4 of 13
`
`
`
`5,541,836
`
`3
`another word. The conditions of a given rule could refer to
`the context within four words of the target word. They could
`test the morphology of the target word, an exact context
`word, or the part of speech or semantic class of any of the
`context words. The sixteen semantic classes were assigned 5
`by hand.
`Most subsequent work has sought automatic methods
`because it is quite labor intensive to construct these rules by
`hand. Weiss (1973) first built rule sets by hand for five
`words, then developed automatic procedures for building 10
`similar rule sets, which he applied to an additional three
`words. Unfortunately, the system was tested on the training
`set, so it is difficult to know how well it actually worked.
`Black (1987, 1988) studied five 4-way polysemous words
`using about 2000 hand tagged concordance lines for each
`word. Using 1500 training examples for each word, his
`program constructed decision trees based on the presence or
`absence of 81 "contextual categories" within the context of
`the ambiguous word. He used three different types of
`contextual categories: (1) subject categories from LDOCE, 20
`the Longman Dictionary of Contemporary English (Long(cid:173)
`man, 1978), (2) the 41 vocabulary items occurring most
`frequently within two words of the ambiguous word, and (3)
`the 40 vocabulary items excluding function words occurring
`most frequently in the concordance line. Black found that 25
`the dictionary categories produced the weakest performance
`(47 percent correct), while the other two were quite close at
`72 and 75 percent correct, respectively.
`There has recently been a flurry of interest in approaches
`based on hand-annotated corpora. Hearst (1991) is a very 30
`recent example of an approach somewhat like Black (1987,
`1988), Weiss (1973) and Kelly and Stone (1975), in this
`respect, though she makes use of considerably more syn(cid:173)
`tactic information than the others did. Her performance also
`seems to be somewhat better than the others', though it is 35
`difficult to compare performance across systems.
`As may be seen from the foregoing, the lack of suitable
`techniques for determining which word-sense pair best fits a
`given context has been a serious hindrance in many areas of
`text analysis. It is an object of the apparatus and methods 40
`disclosed herein to provide such techniques.
`
`4
`making a first determination of the sense of the given
`occurrence of the word; and
`making a final determination of the sense of the given
`occurrence of the word by comparing the first deter(cid:173)
`mination with a determination of the sense of a neigh(cid:173)
`boring occurrence of the word.
`The foregoing and other objects, aspects, and advantages
`of the invention will be apparent to one of ordinary skill in
`the art who peruses the following Drawing and Detailed
`Description, wherein:
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a block diagram of apparatus for determining the
`15 probability that a word/sense pair is proper for a context;
`FIG. 2 is a table of data from which table 107 of FIG. 1
`may be constructed;
`FIG. 3 is an example of part of a conditional sample; and
`FIG. 4 is an example of weights computed using Roget's
`categories.
`The reference numbers employed in the Drawing and the
`Detailed Description have three or more digits. The two least
`significant digits are a number within a figure; the remaining
`digits are the figure number. Thus, the element with the
`reference number "305" is first shown in FIG. 3.
`
`DETAILED DESCRIPTION
`
`The following Detailed Description will first provide an
`overview of the theoretical approach to the disambiguation
`·problem in a preferred embodiment, will then describe
`apparatus for solving the disambiguation problem, and will
`finally discuss how the apparatus for solving the disambigu(cid:173)
`ation problem is trained.
`
`BAYESIAN DISAMBIGUATION TECHNIQUES
`
`SUMMARY OF THE INVENTION
`
`50
`
`In one aspect, the invention is a method of automatically 45
`determining that a word/sense pair has a sense which suits
`a given position in a text. The method includes the steps of:
`determining a sequence of words in the text which
`includes the given position and is substantially longer
`than a single line of the text; and
`determining whether the word/sense pair has the suitable
`sense by automatically analyzing the sequence.
`In another aspect, the invention is a method of automatically
`determining a probability that a word/sense pair has a sense
`which suits a given position in a text. The method includes
`the steps of:
`determining a sequence of words in the text which
`includes the given position; and
`automatically employing a Bayesian discrimination tech(cid:173)
`nique involving the words in the sequence and the
`sense of the word-sense pair to determine the probabil(cid:173)
`ity that the word/sense pair has a sense which suits the
`given position.
`In still another aspect, the invention is a method of auto(cid:173)
`matically determining whether a given occurrence of a word
`in a text has a given sense. The method includes the steps of:
`
`65
`
`The word-sense disambiguation problem is a discrimina-
`tion problem, not very different from problems such as
`author identification and information retrieval. In author
`identification and information retrieval, it is customary to
`split the problem up into a testing phase and a training phase.
`During the training phase, we are given two (or more) sets
`of documents and are asked to construct a discriminator
`which can distinguish between the two (or more) classes of
`documents. These discriminators are then applied to new
`documents during the testing phase. In the author identifi-
`cation task, for example, the training set consists of several
`documents written by each of the two ( or more) authors. The
`resulting discriminator is then tested on documents whose
`authorship is disputed. In the information retrieval applica(cid:173)
`tion, the training set consists of a set of one or more relevant
`55 documents and a set of zero or more irrelevant documents.
`The resulting discriminator is then applied to all documents
`in the library in order to separate the more relevant ones
`from the less relevant ones. In the sense disambiguation
`case, the 100-word context surrounding instances of a poly-
`60 semous word (e.g., duty) are treated very much like a
`document.
`It is natural to take a Bayesian approach to these discrimi(cid:173)
`nation problems. Mosteller and Wallace (1964, section 3.1)
`used the following formula to combine new evidence (e.g.,
`the term by document matrix) with prior evidence (e.g., the
`historical record) in their classic authors hip study of the
`Federalist Papers.
`
`Page 5 of 13
`
`
`
`5,541,836
`
`5
`
`P(class 1) _ p(class 1) x L(class 1)
`P(class 2)
`- p(class 2)
`L(class 2)
`
`where P represents a final probability, p represents an initial
`probability, andLrepresents a likelihood. Similar equations
`can be found in textbooks on information retrieval (e.g.,
`Salton (1989), equation 10.17).
`The initial odds depend on the problem. In the author
`identification problem, for example, the initial odds are used
`to model what we know about the documents from the
`various conflicting historical ·records. In the information
`retrieval application, the user may have a guess about the
`fraction of the library that he or she would expect to be
`relevant; such a guess could be used as the prior. It is often
`the case that the prior probability will not have very much
`influence on the outcome, which is fortunate, since the prior
`can sometimes be difficult to estimate.
`It is common practice to decompose the likelihoods into
`a product of likelihoods over tokens (occurrences of words)
`in the document (under appropriate independence assump(cid:173)
`tions):
`
`L(class 1)
`L(class 2)
`
`n
`Pr(tok class 1)
`tok in doc Pr(tok class 2)
`
`The crucial ingredients for this calculation are the prob(cid:173)
`abilities of each term in the vocabulary conditional on the
`document being from a given class. These conditional
`probabilities have been computed in a number of different
`ways depending on the application and the study.
`For two senses, the Bayesian equation mentioned above
`becomes:
`
`P(sense 1)
`P(sense 2)
`
`p(sense 1)
`p(sense 2)
`
`x L(sense 1)
`L(sense 2)
`
`where p, P and L are the initial probability, the final
`probability and likelihood, as before. The initial probabili(cid:173)
`ties are determined from the overall probabilities of the two
`senses in the corpus. As in other large dimension discrimi(cid:173)
`nation problems, the likelihoods are decomposed into a
`product over tokens:
`
`L(sense 1)
`L{sense 2)
`
`n
`tok in context
`
`Pr(tok class 1)
`Pr(tok class 2)
`
`Apparatus for Use in Word-sense Disambiguation: FIG. 1
`FIG. 1 shows apparatus 101 which determines the like(cid:173)
`lihood that a word/sense pair has the sense required by a
`given context. In a preferred embodiment, the output of 50
`apparatus 101 is the logarithm of the likelihood. Log word/
`sense pair likelihood (WSPL) 117 is computed by sense
`likelihood calculator 103. Inputs to sense likelihood calcu(cid:173)
`lator 103 come from text 111, word/sense pair probability
`table 107, and word/sense pair 105. Word/sense pair prob- 55
`ability table 107 is a table which contains a subtable 108 for
`each word/sense pair which is of interest. Each subtable 108
`contains entries (WSPPTE) 123 for at least all of the words
`in text 111 which give useful indications whether the word/
`sense pair 105 to which subtable 108 corresponds has a 60
`sense suitable for position 116. Each entry for a text word
`121 contains an indication of the weight 119 of that word for
`determining whether word/sense pair 105 has a suitable
`sense.
`When sense likelihood calculator 103 is calculating the
`likelihood that a given word/sense pair 105 is suitable for a
`position P 116 in text 111, sense likelihood calculator 103
`
`25
`
`6
`begins reading words of text 111 from 50 words ahead of
`position 116 and continues reading 50 words following
`position 116. The 100 words which contain position 116 are
`position 116's context 115. An important aspect of the
`5 preferred embodiment is that there are substantially more
`words in context 115 than there are in a line of text 111,
`which is presumed to contain about 10 words. For each
`current word (CW) 113 read from context 115, calculator
`103 determines whether the word has an entry 123 in table
`10 107, and if it does, it adds the weight specified in the entry
`to the weights accumulated from the words read so far. The
`value of the accumulated weights 119 is then the likelihood
`117 that word/sense pair 105 is suitable for position 116.
`Of course, in most applications of apparatus 101, the issue
`15 will be which of two or more word/sense pairs best suits
`context 115. To find that out, one simply uses apparatus 101
`as described above for each word/sense pair in tum. The one
`with the highest accumulated weights is the word/sense pair
`best suited to position 116. For example, if apparatus 101
`20 were being used in the word disambiguation application,
`there would be a separate word/sense pair for each sense of
`the word to be disambiguated, and the sense used to translate
`the word would be the sense of the word/sense pair with the
`highest accumulated weights.
`When words are being disambiguated, it will occasionally
`occur that the difference between the highest and the next
`highest accumulated weights is not enough to permit clear
`disambiguation. In that case, SLC 103 may take other
`approaches. One such approach is by analyzing the dis-
`30 course to which text 111 belongs. For purposes of the present
`discussion, a discourse is one or more texts which concern
`a single subject or a set of related subjects. Within a given
`discourse, ambiguous words tend to be used in a single
`sense. For instance, if the discourse concerns grammar,
`35 sentence used in the punishment sense will rarely occur; if
`the discourse concerns criminal trial practice, sentence used
`in the grammatical sense will rarely occur.
`A simple way of analyzing a discourse using apparatus
`101 is as follows: The text belonging to the discourse is
`40 marked as such and calculator 103 stores the most suitable
`sense and weight for every position where there was clear
`disambiguation for the word in question. Generally, there
`will be a heavy preponderance of one of the possible senses,
`and the preponderant sense can be used in those situations
`45 where the analysis of context 115 alone did not clearly
`disambiguate. An even simpler, albeit less precise way, takes
`advantage of the fact that adjacent uses of a word tend to
`belong to the same discourse. In this technique, if the
`analysis of context 115 alone does not clearly disambiguate,
`apparatus 101 will apply the result of the examination of a
`neighboring context 115 which contains the word in ques-
`tion and determine the sense of the word in question from its
`sense in the neighboring context.
`In a preferred embodiment, apparatus 101 is implemented
`in a digital computer system. Text 111, table 107, and
`word/sense pair 105 are stored in the computer system's data
`storage system and sense likelihood calculator 103 is imple(cid:173)
`mented by means of a program executed by the digital
`computer system's processor. In some embodiments, table
`107 may be in read-only memory and sense likelihood
`calculator may be a special-purpose processor which has
`been adapted for rapid referencing of table 107. Such an
`embodiment might, for example, be useful in a pocket
`translating device or in an electronic typewriter.
`65 Computation of Word/Sense Pair Probability Table 107
`Conceptually, word/sense pair probability table 107 is a
`whole set of subtables 108. There is a subtable 108 for at
`
`Page 6 of 13
`
`
`
`5,541,836
`
`7
`least every word/sense pair which is of interest for the task
`we are interested in, and the subtable 108 for that word/sense
`pair contains an entry for at least every word in text 111.
`Moreover, what is really wanted is a word/sense pair prob(cid:173)
`ability table 107 which will work with virtually any text 5
`written in a given language. In the disambiguation context,
`such a table would conceptually include a subtable 108 for
`each word/sense pair for each ambiguous word in the given
`language and each subtable 108 would include ~n~es_ for
`every word in the given language. Of course, optumzat10ns 10
`are possible. For example, most words will contribute little
`or nothing to the disambiguation, and such words may
`simply be left out of the table and given a default weight.
`It is apparent that large word/sense pair probability tables
`107 can only be computed by machine. In the terms 15
`employed in the art, the training of apparatus 101 must be
`automated. Training is in many respects the reverse of the
`testing just described. In the disambiguation context, a
`subtable 108 is created in table 107 for a given ambiguous
`word by examining contexts 115 which contain known 20
`senses of the ambiguous word to find other words in the
`context which can be used in testing to estimate the probable
`sense of the ambiguous word. For example, if the subtable
`108 is for sentence's punishment sense, the contexts will
`certainly show more occurrences of the word 'judge" or the 25
`word "trial" than is normal in English language texts, and
`these words can be weighted accordingly in table 107.
`The largest problem in automating training is of course
`determining which sense the word being trained for has in a
`given context 115. For large tables 107, it is clearly imprac- 30
`tical for human readers to flag the word being trained for as
`having one sense or another in the contexts 115 being used
`for training. During development of apparatus 101, two
`techniques have been discovered for automatically deter(cid:173)
`mining the sense of the word being trained for in a given 35
`context 115. One of the techniques employs bilingual bodies
`of text; the other employs subject matter categories such as
`those provided by a thesaurus.
`Training on Bilingual Bodies of Text
`Training on bilingual bodies of text takes advantage of 40
`two facts: first, a translation of a word which is ambiguous
`in a first language into a word in a second language often
`indicates the sense of the ambiguous word. Thus, if the
`English word sentence is translated peine, we know that the
`English word is being used in the punishment sense. If it is 45
`translated phrase, we know that the grammatical sense is
`meant. Second, there are presently available large amounts
`of machine-readable text for which versions in two lan(cid:173)
`guages are available. One example of such a bilingual body
`of text is the Canadian Hansards, which is the journal in both 50
`English and French of the debates in the Canadian Houses
`of Parliament. In the following, the English version is
`termed the English Hansards and the French version the
`French Hansards.
`The preferred embodiment is trained on the Canadian 55
`Hansards for one sense of a given ambiguous English word
`as follows: first, statistics are gathered for the entire body of
`the English Hansards. The statistics include the number of
`tokens (words and phrases treated as words) in the English
`Canadian Hansards and then the number of occurrences of 60
`each word in the English Hansards. From these statistics the
`probability that a given word will ~ccur in a 100-word
`context of the English Hansards is computed.
`Then a conditional sample of the English Hansards is
`made for the desired sense of the given ambiguous word.
`This is done by finding each occurrence of the given
`ambiguous word in the English Hansards. Then the French
`
`65
`
`8
`word (or phrase) which corresponds to the occurrence is
`located in the French Canadian Hansards. The French word
`which corresponds to the occurence of the ambigu~us
`English word is located by aligning sentences of the En~hsh
`text with corresponding ones of the French text, as descnbed
`in Gale, W., and K. Church, "A Program for Aligning
`Sentences in Bilingual Corpora," Proceedings: 29th Annual
`Meeting of the Association for Computational Linguistics,
`pp. 177-184, 1991. When located, the French wor~ or
`phrase determines whether that occurrence of the g!ven
`word has the desired sense. If it does, the 50 words on either
`side of the occurrence are output to the conditional sample.
`It should be pointed out here that the use of a 100-word
`context in training is as important as it is in the operation of
`apparatus 101.
`Once the conditional sample is made, it is analyzed using
`Bayesian techniques which will be described in detail below
`to determine the weight of each word in the conditional
`sample with regard to the probability that the given ambigu(cid:173)
`ous word has the sense used to make th