`Kucera et al.
`
`Patent Number:
`11
`(45) Date of Patent:
`
`4,868,750
`Sep. 19, 1989
`
`54 COLLOCATIONAL GRAMMAR SYSTEM
`(75) Inventors: Henry Kucera, Providence, R.I.;
`Alwin B. Carus, Newton, Mass.;
`Jeffrey G. Hopkins, Pawtucket, R.I.
`73) Assignee: Houghton Mifflin Company, Boston,
`Mass.
`21 Appl. No.: 106,127
`22 Filed:
`Oct. 7, 1987
`(51) Int. Cl. .............................................. G06F 15/02
`(52) U.S.C. ..................................... 364/419; 364/900
`58) Field of Search ................ 364/419,900 MS File,
`364/200 MS File
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`4,661,924 4/1987 Okamoto et al. .................., 364/900
`4,703,425 10/1987 Muraki ................................ 364/419
`4,724,523 2/1988 Kucera ................................ 364/419
`4,742,481 5/1988 Yoshinura ........................., 364/49
`4,747,053 5/1988 Yoshimura ..
`364/49
`4,750,122 6/1988 Kaji .........
`364/419
`4,760,528 7/1988 Levin ......
`... 364/419
`4,773,009 9a 988 Kucera ................................ 364A419
`OTHER PUBLICATIONS
`Choice of Grammatical Word-Class Without Global Syn
`tactic Analysis. Tagging Words in the Lob Corpus, Mar
`
`shall, Ian in Computers and the Humanities 17 (1983)
`139-150.
`Primary Examiner-Michael R. Fleming
`Attorney, Agent, or Firn-Lahive & Cockfield
`(57)
`ABSTRACT
`A system for the grammatical annotation of natural
`language receives natural language text and annotates
`each word with a set of tags indicative of its possible
`grammatical or syntactic uses. An empirical probability
`of collocation function defined on pairs of tags is itera
`tively extended to a selected set of tag sequences of
`increasing length so as to select a most probable tag for
`each word of a sequence of ambiguously-tagged words.
`For listed pairs of commonly confused words a substi
`tute calculation reveals erroneous use of the wrong
`word. For words with tags having abnormally low
`frequency of occurrence, a stored table of reduced
`probability factors corrects the calculation. Once the
`text words have been annotated with their most proba
`ble tags, the tagged text is parsed by a parser which
`successively applies phrasal, predicate and clausal anal
`ysis to build higher structures from the disambiguated
`tag strings. A voice/text translator including such a tag
`annotator resolves sound or spelling ambiguity of
`words by their differing tags. A database retrieval sys
`tem, such as a spelling checker, includes a tag annotator
`to identify desired data by syntactic features.
`
`17 Claims, 13 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`
`
`
`SAA// JAA/AS
`f A/ASA's
`
`A/
`A/V/AAW
`
`SOAAA
`Wy
`
`
`
`
`
`
`
`AA//
`AWWFAA
`2
`
`
`
`
`
`
`
`
`
`
`
`JAYI
`ASSAA
`A/SA/AM///
`
`Z--
`-Stiff
`self a?
`
`
`
`
`
`AAAW/AAA
`AAOMASSO?
`ASA/A/A
`ASA?
`
`Page 1 of 35
`
`GOOGLE EXHIBIT 1011
`
`
`
`US. Patent
`
`Sep. 19, 1989
`
`Sheet 1 of 13
`
`4,868,750
`
`
`SPEC/Al 543155
`
`I MMfl/UIS
`
`{Pfl/
`
`
`rm
`
`MIN/P01([2?
`”[554?!
`MM
`
`
`
`
`may/0w
`war/MA”
`
`
`
`{/0
`
` mm
`
`mn'
`
`.
`___1__._
`5— {— SHEA/61.1.
`ism/rm? J:
`
`Page 2 0f 35
`
`Page 2 of 35
`
`
`
`US. Patent
`
`Sep. 19, 1989
`
`Sheet 2 of 13
`
`4,868,750
`
`000/0000 0/0500 (0000/04/700 (00/
`0000 0/4000/7/00/5 (00007017700 (00/
`'001' ' (00000007! 040)
`01050 00000f000/5 (0000/7/0/700 M0)
`500/ - 00!00 (0000700/700 700)
`02450 (0000(001700 (00/
`00000 (0000/7/00/00 (00/
`00/00 (00007001700 (00/
`000- 00l!/0/0‘.’ (000- 000/00! 710/
`000‘ 0000/70/00 (000- 000/00! M0)
`000- 00/10f/0/00/0000l0 000J000f/0I (000-000(04! M0)
`0057- 00/'000/000 (/700- 000/00! (00}
`4077020 (000 -000/IA! 7,40}
`'0000 ' (V0000! U0) (0)
`"005' (V0000! 0,40} (0)
`'00/00'(V0000! 1'00) (0)
`'00'-0050 0000 (V0001! (540/ (I)
`'40 ' (V0000! (00/ (0)
`“0000' (V0000! (00) (I)
`”400' (V0000! (40/(0)
`'/0'(V000A! 700/ (0)
`00000/0/If/00 000/000/700 0500/00/70! (00/
`0000/00! 000000 (0190-0007014! M0)
`500000/0/1/700 000/000/700 (000(000/0! 2740)
`'0/0' (V0004! 040/ (0)
`‘00/00' (V0000! 700/ (I)
`'00'- 0000 0000 (V0000! (/10) (I)
`'0000 ' (V0004! (A0} (I)
`'00' (V0000! (A0} (I)
`'0000 ' (V0000! (00/ (0 )
`5/06/0140 00(000/000 (000-000(00! (10/
`5/000! A0/P!!/0A! 00(000/000 (000 - 000/00! 710/
`H0014! 00(000/000 (000 -000/00! (00/
`007000/000/ 0000! 0 000J0007/00 (000-000(00! (00/
`0070570070]!
`'1'0000' (000/00! 1'00) (0)
`'000 '; 04.57 (0050 (V0000! 710/ (0)
`'0AV/00 ' ( V0000! (/10) (I)
`'IAVO '- 0,400 0000 (V0000! 700/ ( I )
`”000 ‘-',- 0/151' 0140fI670!0 (V0001! 1340/ ( I )
`"0,4V0' (V0000! M0} ( I )
`'000' (V0000! I740) (0)
`000000/f/00 (000700f/A! I710}
`10/00f/V0 (000000100! 040/
`000000lf/V0 .40/00/(V0 (000- 000/00! 540/
`51/000M/7V0 AOJ00f/V0 (000-000/01! H0}
`50000f/0 50000M/7V0 (000‘ 000/00! [10/
`0004! 001/!00)’ ( V0000! (10/ < 0)
`
`\*\'
`
`In.
`
`00!
`.400
`A01
`00
`If
`000
`000!
`000
`00/
`000
`000
`000
`00]
`00
`00
`05
`000
`000
`00/
`000
`000
`00!
`01'
`0/7
`0f.S'
`0fI
`0!
`0V0
`0V0
`IV/
`0V0
`0V0
`0V!
`(0
`JJ
`JJO
`././5
`.Uf
`00
`
`FIG.2
`(0/ (000 47/
`
`0/
`02
`03
`04
`05
`00
`0/
`00
`0.9
`/0
`/ /
`I!
`/J
`/4
`/.5
`/0
`ll
`/0
`/.9
`00
`2/
`Z?
`03
`04
`05
`00
`("7
`20
`2.9
`50
`3/
`30
`33
`30
`3.5
`30
`3/
`00
`3.9
`40
`4/
`40
`43
`44
`45
`40
`4/
`
`Page 3 0f 35
`
`Page 3 of 35
`
`
`
`US. Patent
`
`Sep.19, 1939
`
`Sheet 3 of 13
`
`4,868,750
`
`46'
`4.9
`II
`5/
`.52
`53
`54
`55
`56‘
`67
`II
`5.9
`II
`67
`6‘2
`6‘1
`6‘4
`6'5
`6'6‘
`6'7
`II
`629
`7:9
`7/
`I?
`71
`74
`7.5
`76‘
`7/
`II
`7.9
`II
`I/
`I2
`I!
`I4
`I5
`II
`If
`II
`II
`II
`.9/
`.92
`.93
`.94
`
`II
`m‘
`
`m m
`
`:
`II]
`If
`If!
`
`m m
`
`:
`
`II
`we:
`ms
`m:
`II
`II
`
`mm M
`
`:
`
`m P
`
`( I )
`S/IIIMI III/III IIII - I154“ fIIl/ {III/I41 MI)
`PISSISS/Vf 5/I6‘IMI IIWIII IIII (III-III/IAZ MI}
`Ill/III III/III IIII (III/IA! MI} (I)
`IIISISS/Vf IZIIAZ III/III IIII/IIf-III/IIZ MI)
`III-PISSISS/Vf IIIIII IIII-IIIf fIII {III/Ill MI} (I)
`S/IIIMI III/”[1? IIII ‘ IASf [III (III/IA! 7.467 ( I )
`IISSESS/Vf S/IIIMI IIIIII IIII {PIP III/Ill M9}
`I1IIIZ PIIIEI IIII (III/Ill MI} (I )
`IIISfSS/Vf IIIIAZ IIIIEI III/I lIIf-III/IAZ 1346‘}
`S/IIIMI AIVIII/AZ IIII - I155 fIII {III/IA! MI} (I )
`PISSESS/Vf S/IIIUI lIVfII/IZ IIII fIIf-III/IAZ HI}
`III/Ml IIVfII/Il IIII (III/Ill MI) (I)
`PISSfSS/Vf PlII/IZ AIVIII/ll IIII (III— III/Ill MI}
`III/Ill IIIIfI (IIF- III/Ill MI)
`III/Ill IIIIIII {III/IA! HI} (I)
`FISSfé'S/Vf III/Ill IIIIIII fIIf-III/IAI MI}
`IISIISS/Vf PIIIIII {III'III/IIZ MI)
`SIIIII FI556'55/Vf IIIIIII (III/Ill MI) (I)
`S/IIIMI Ifth'llVf IIIIIII {III/IA! MI) (I)
`Ill/IA! IfflfI/Vf IIIIIII {III/Ill 516‘) (I)
`IIJfIf/Vt’ IEISIIIZ PIIIIII (III/I211 546') (I)
`fI/II IEISII III/IAf/Vf FfISIIIl IIIIIIII’III/IIZ MI (I )
`III - l'I/II IIIIII III/II/VVEIEISIIAZ IIIIIII (III/Ill 7346') (I)
`III -II55[55/V[ IfISII/l IIIIIII/III/IIZ MI) (I)
`III! /F/fI {III - III/Ill 1546')
`IISf- IIIl/f/[I (FIE-III/IAZ MI}
`AIVfII MIVFII/Il MI)
`IIIPAIAf/Vf AIVfII MIVfII/AZ 176')
`.S'IIIIMf/Vf lIVfII (AIVEII/Il MI}
`III/Ill AIVEII MIVfII/AZ 546'}
`IIVfII/IAI/VIZ f MIVIII/Il 7546')
`/If/I//'/VAZ 'l'I' (VIII/H 1716‘}
`(III)
`[IIZIIAf/II fffIffIf/IZ 7246'}
`(f)
`VIII I4.” ffISE fIII {VIII/11 546')
`VIII IIISIII' MINI/III (VIII/$1 546‘} (I)
`VIII /If/I/f/Vf-I15[ FIII (VIII/fl MI) (I)
`VIII ”If IAIf/I/Plf foIII[ III)
`(I)
`VfIIIII-fI/II PfISII 5/I6‘IMI, PIESIIT ffISf {VIII/fl MI) (I)
`VIII fI/II-IIISII S/IIIMI, PIESEII' ffIII (VIII/II 1146‘} (I)
`[VI - IfffII/IEI (PIf-III/IAZ 1346'}
`IIISIIAZ WI -IIIIIII {III-III/IAZ M67
`!
`W0
`IIJfIf/Vf [VI - IIIIIII (III/Ill 136‘} (I >
`W5
`III/IAf/Vtr [VI- IIIIIII (III/III MI} ( I )
`1m
`II - III! lf/[I {III-III/IAZ 5467
`um
`II— AIVEII MIVEII/At 716'}
`I!
`~IAI.S'[I /IffII/Il ' II 176‘ ASS/IIII-
`[I]! [II-If-f/Zf IIIIZ'I
`
`PZS
`PM
`IFS
`PISS
`
`M o
`
`z
`01/”
`II
`Mk
`III
`II
`17/?
`
`m w
`
`/ m m m m m V
`
`IZ
`
`W W
`
`FIG. 2
`MI ”MI .94)
`
`Page 4 0f 35
`
`Page 4 of 35
`
`
`
`US. Patent
`
`Sep. 19,1989
`
`Sheet 4 of 13
`
`4,868,750
`
`1/
`
`1
`(7
`Mark
`abandon/near
`Mose
`Mose/7190f
`Mashed
`is
`MM
`
`flack/n;
`
`[Ml/P159 0F ”Mi/Ml ”AM/fl ”MT/MARY ifMMS
`
`I26
`
`150
`
`[14
`
`.
`4f
`M
`
`1W
`
`I/
`
`[7/
`
`I/
`
`V/
`
`V/
`
`JJ
`6'5 M’ 01 M
`JJ [73
`
`FIG. 3A
`
`|/
`Mia/.1
`(rd/Ms
`MM;
`am
`are
`my
`arisen
`arose
`ofe
`wake
`woke
`aim/rel;
`baa/Ill
`MM/as
`bale
`base
`bases
`basis
`to
`tear
`
`[Ml/Pl [S W '[fl’f/Df/M ' Ill/I MUM/VARY ATM/905
`
`I50
`
`125
`IV/S
`Mp
`My
`
`|34
`(M: mus/mm)
`mam/bu)
`(MS: M/‘M/
`it'll (Ire/“x be}
`if/P {Ire/I be]
`{Vi/ -/: arose/en's”)
`{Vi/4’: an'sa/
`WM,- m’se/
`/V1903 MN
`./J M (1/49/- /: awake /mmken)
`(V50: awake}
`/ V5175 wake/
`[If/5I Mail/us/
`(W: Mai/fl)
`(1491’.- fi/W
`MW,- Msas J
`(MS: base/Mm /
`MI.- Mses /
`if/ {Ire f: is)
`{Vfl/ - / .~ tare/Dome}
`
`FIG. 33
`
`law
`
`[w
`.5
`5
`5
`5‘
`5
`5
`.5
`J
`5
`I
`J
`5
`J
`5
`5
`5
`5
`I
`5
`.3“
`
`Page 5 0f 35
`
`Page 5 of 35
`
`
`
`US. Patent
`
`Sep.19,1989
`
`Sheet 5 of 13
`
`4,868,750
`
`[Ml/HES 0f ”Ml/I'Mé‘f/flfl " Ill/4’ 0/6‘f/0M4/P)’ flfd‘fl/WS
`
`|35
`
`|45
`
`L55
`
`I55
`
`[if/97‘ ”'
`
`If! 7‘ *
`
`WP7‘ 3'
`
`//VZ 7‘ '
`
`l/
`'//'s
`m '/
`are” '/
`M” '/
`can/ml
`call/d 'w
`coy/d0 ‘7‘
`d/‘do ‘/
`does/7 '/
`Jan '/
`
`125
`PPS 7‘ :91"!
`iii/7‘ "
`if}? 7‘ *
`74/0 7‘ *
`74/0 7‘ *
`[/0 7‘ //V/
`AM 7‘ "‘
`000 7‘ *
`M] 7‘ *
`MP 7‘ '*
`
`W
`f
`7‘
`7‘
`7‘
`7‘
`7‘
`7‘
`7‘
`7‘
`7‘
`
`FIG. 3C
`
`”JANA/WW A/V/VflfA/Ml 07" 555/ 55175765
`
`1 ac. .-
`[-
`2‘
`3:
`4:
`i'
`6':
`
`K
`J:
`.9:
`/0.'
`//.-
`/Z-
`Us
`/4'
`/.5:
`/6’.‘
`/Z~
`
`Ila/d:
`film
`MN:
`lo
`55W
`like
`new
`
`me/ropa/i/a/I
`[an
`all
`/)/‘3
`clever/j
`framed
`and
`br/Y/Mnf/y
`p/I/myed
`para/fears
`
`fay S/r/fly .'
`I,”
`MS {if . Ira/W V5! (-7”)
`M’ M
`149/ 1497"
`AI
`/J
`
`J./
`M"
`A” 01 M
`PP! PP”
`M
`VM (5)“ = I‘M/0) V50 ( =30
`('6’
`W I?!
`JJ
`IVA/5 {if wanna/J
`
`FIG.5
`
`Page 6 0f 35
`
`Page 6 of 35
`
`
`
`U.S. Patent Sep.19, 1989
`
`Sheet 6 of 13
`
`4,868,750
`
`(ass
`AA
`AA
`Af
`A.
`(A
`A)
`(S
`f
`Af
`A.
`AW
`AW
`W
`Af
`f
`a
`AWAP
`t
`Af
`Ap
`A.
`A.
`AP
`AA
`J)
`th
`WA
`
`A
`W
`WP
`Y
`
`(Asy (AAA/07 AASS AAAAAS A27 A65
`
`Aescription
`are-qualifiers and pre-quantifiers
`pos? deferminer
`arti?e
`forms of the verb "?o be
`coordinating confunction
`carolina/ number
`subordinating confunction
`forms of fle verb 'to do
`defarminers
`aris?enfia/ "fiere
`forms offe verb "to have
`areposition
`advectives
`meda/aurifiafy
`inf fifts
`Aroper nouns
`afe?t/af nouns
`ordina/ number
`Mania/Aronours
`aersona/Aronouns
`qualifiers
`adverbs
`nomina/adverb
`adverbaartic/e
`an/int/val/ "f
`ere/anaffon
`verbs
`wh-deferminer
`Aersona/ wh-pronouns
`wh-qualifier
`wh-dive/a
`none of fle above (60s-inferna/ only/
`
`AAG. 4
`
`Joys
`
`f
`f
`
`?
`A
`f
`A.
`4
`f
`A.
`/
`4
`f
`5
`4
`4.
`f
`2
`A.
`2
`
`f
`f
`f
`f
`A.
`f
`y
`f
`A
`f
`
`Page 7 of 35
`
`
`
`US. Patent
`
`Sep.l9, 1989
`
`Sheet 7 of 13
`
`4,868,750
`
`”If-[IN
`SHAW
`
`
`
`
`
`
`
`M’l’flf
`
`fiPA’fSS/M’
`
`MMl7; £451?
`
`[[55 f/I’M’J ”ll/[5
`
`
`5f/9/P A 50/771
`
` Sllff/I
`
`
`HM”
`
`
`SM/PPM/é'
`fli‘Wf55/017
`
`
`ffl/[fl f/i/r’t’f
`
`f/l/fS .7
`
`SHIV/X-
`
`SIR/Him
`[IPRfSS/fll M’
`
`[IF/91".5'5/017
`
`Ml’ I’M/W
`”If/1i”?
`
`
`
`
`
`
`
`
`6Y7 MAM/AHMZ
`J Wflfé‘flm’ll
`07/? MW!
`04/745451."
`
`
`
`
`flW'Pflf MAW/Af/MZ
`a’ Mflfé‘f/W’AZ
`M’fMfl/AT/flfl
`
`
`
`(fl/[[67
`WEN/AW
`
`
`
`
`
`5- ”0/905
`SPH'MZ 04,451”
`I‘d/$7659 flé'flflflf/IVIS
`
`
`
`FIG. 6'
`
`Page 8 0f 35
`
`Page 8 of 35
`
`
`
`US. Patent
`
`Sep.19,1989
`
`Sheet 8 of 13
`
`4,868,750
`
`5157' SE/W PM m
`
`554’ A’flflf
`
`
`l/W/Ml/[f
`
`
`flPfl/flff/r’s
`l VAR/1451A?
`
`6‘/
`
`
`
`lliflfflfflf
`MI? 1 V1 191’ / #00
`61/1? 546'
`M 5/7/15
`
`65
`
`/5
`
`
`MZI’ M6
`
`
`”PASS
`
`: AflVi/r’i 0/?
`
`
`ATM?”
`
`6‘6‘
`
`FM’ST
`
`
`iffff 046'
`
`546‘ 01‘ lM/M'S
`
`WWI/7295 5,46“!
`SUP/4’6 - I] M
`
`
`
`= SIM/W 546‘
`
`
`
`
`
`or MN m /,
`517 ”PF 0/ Wflfiflf
`
`W030
`
`
`[MAW 'fll/MZ/f 546'
`
`HAWK/f 546'
`
`
`567’ -— ma 4
`
`7
`
`
`or PM or > /
`
`mm mm my
`.90
`moms m
`
`EMMY/[fl
`
`
`619
`Pflflé‘ff5/176
`
`
`flflflf/A/[S
`
`FIG. 7
`
`Page 9 0f 35
`
`Page 9 of 35
`
`
`
`US. Patent
`
`Sheet 9 of 13
`Sep. 19,1989
`7/
`5ff I'H/P PMS
`
`
`NZ [0
`
`
`SEW/'0!r
`
`
`000:! PREV
`0/5/7003
`
`
`4,868,750
`
`/0f/W//‘}’
`005% 7,40
`
`71
`
`
`
`74
`
`/7”
`
`
`#7 00/?0
`
`MAM-
`
`
`M0 3% P/PV
`
`0 f/Md‘f
`
`
`
`
`MAMA/Y
`
`
`0 f 7M: ’P/n/ /
`1/017. 57
`
`
`MM MIA/Mi
`
`A’Ef0 f0
`
`
`All 004%
`
`
`”Mi
`0 00055 .7
`
`7.5
`
`
`
`76‘
`
` 0004/1"
`
`fill? / 07/?
`
`
`fill/P? Pf)?
`
`
`f0 VAN/f 0/"
`01/0! P0/A’ff0
`
`
`
`Page 10 0f 35
`
`Page 10 of 35
`
`
`
`U.S. Patent
`Alif A
`
`Sep. 19, 1989
`
`Sheet 10 of 13
`
`4,868,750
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SAJ
`A/AS/A2/A
`- WA
`
`SAJ
`Sps as (7
`
`A/S/
`(WAA D
`JAAA/S/
`2
`
`
`
`
`
`SAJ
`SAs a
`Apassa
`AfA. Ap
`
`45
`
`SWAA
`S/A
`
`(A
`
`Affif
`JMJA - Af
`SAJ Aff a
`AJA/S-AA
`
`
`
`
`
`Page 11 of 35
`
`
`
`US. Patent
`
`Sep. 19, 1989
`
`Sheet 11 of 13
`
`4,868,750
`
`1 II |II I | | I | I I g III l II I Il| |I I I| II I
`
` .9/
`0fl'0/7f 04/)!
`
`5:7 fH/P / PM
`
`
`10 5/7407 0"
`00005770/5 000/?Z /5f
`
`
`I III I I
`
`I
`I
`I
`I
`
`IIIII Il| |II
`
`
`
`—6'f/'0000[07 7/10
`0001/?
`0!" W000 I710 5/0/00
`0 [0/100
`
`
`.527 754/0 2 0/7?
`A00 5f00f
`
`f0 SDI/W 0/"
`/0 000/?[0f
`001‘"W005 0/5/7005
`0/5 0001’
`
`[/51'
`
`
`
`MAI/WIT
`0/1157? 13405 We
`
`05/00
`MID/l
` All
`M05 M’ ”0005'
`
`
`D40 5/0/00
`
`
` 0/0/05' 0/
`
`000057551?
`WV 0/V/50r?
`
`
`
`
`9’6
`
`001/701 )’ 0”
`m
`gar/mm 2pm
`
`mm M’
`0/5 mm
`
`
`Mil/M05 0/5 W!
`Sffl/r’f ”F 0/5/74!”
`
`/5f 0000555 -
`0V 00001-7”
`.
`0/5 0001"
`
`
`
`#1100”?
`95
`mm was
`Sill-'57
`
`
`f0/, 3
`
`-0/’0/1/'[
`00055
`
`ff/lP/Pf/F
` 754/02 0/7?
`
`Page 12 0f 35
`
`Page 12 of 35
`
`
`
`U.S. Patent Sep. 19, 1989
`
`Sheet 12 of 13
`
`4,868,750
`
`
`
`
`
`
`
`
`
`AA) A JAOAS
`AWA/AWA JAA/
`Affif AYA/SJAWA
`
`AAA
`JAAS AP
`W go:
`
`A WAS
`SAPJ
`WWA A/S, JAAAY
`f/SWAWAA
`
`)
`
`A/
`
`(gous
`
`A/WAAAYA
`A/S/
`
`
`
`A/5
`
`
`
`
`
`SAOS/A
`JA 3 (AAS //
`SA
`(AAS
`
`Page 13 of 35
`
`
`
`U.S. Patent Sep.19, 1989
`
`Sheet 13 of 13
`
`4,868,750
`
`A2)
`fAry
`Aft/A/104 -) i? - - - - - - - - - - - - - - - - -
`AWA/SAA/6/6/AW/(W
`WA
`AAAA)
`SAA/AWA
`AM/AAAY
`JYAAAA/AOS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`AAAA
`Af0071A
`
`AM
`AZA (AAVAff
`JAA, SAWA WA
`AMAS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`F/G, W2
`
`A2
`
`AMSA WMS
`JA6 (PASMA
`SAWWAAA
`
`/44
`(AAAAA/
`
`- - - - - - - - - - - - - - - - - - -
`/A
`A7
`(SAA
`A///04AAPY
`
`A97
`
`S-100S
`
`2007
`A(AAAA. JAA
`/A/(A/ )
`
`Page 14 of 35
`
`
`
`1.
`
`COLLOCATIONAL GRAMMAR SYSTEM
`
`The present invention relates to automated language
`analysis systems, and relates to such systems embodied
`in a computer for receiving digitally encoded textcom
`posed in a natural language, and using a stored dictio
`nary of words and an analysis program to analyze the
`encoded text and to identify errors. In particular, it
`relates to systems for the grammatical analysis of en
`10
`coded text.
`In recent years a number of systems have been devel
`oped for the automated recognition of syntactic infor
`mation. A survey of some systems appears in the text
`book of Winograd, Language as a Cognitive Proces
`s-Syntax (ISBN 0-201-08571-2 v. 1) at pages 357-361
`and pages 390-403. As a rule, although a number of
`theoretical linguistic formalisms have been developed
`to identify correct grammatical constructions, the prac
`tical construction of grammatical analyzing devices has
`proven difficult. Because the number of combinations of
`possible parts of speech for a string of words escalates
`exponentially with string length, syntax-recognizing
`systems have in general been limited to operating on
`text having a small, well-defined vocabulary, or to oper
`25
`ating on more general text but dealing with a limited
`range of syntactic features. Extensions of either vocabu
`lary or syntactic range require increasingly complex
`structures and an increasing number of special recogni
`tion rules, which would make a system large or too
`30
`unwieldy for commercial implementation on commonly
`available computing systems. Moreover, the automated
`grammatical systems which have been designed are
`special processors, in that they are not adapted to con
`ventional word processing or computer-aided publish
`35
`ing functions. For example, such systems may require
`that their input text be at least sufficiently pre-edited so
`that it is both correctly spelled and grammatically well
`formed. A misspelling, a wrong word such as a hom
`onym, a compound word, or even a simple syntax error
`40
`may render an input sentence unanalyzable.
`OBJECTS OF THE INVENTION
`It is an object of the present invention to provide an
`improved device for the grammatical analysis of digi
`45
`tally encoded natural language text.
`It is another object of the invention to provide a
`digital text analyzer for assigning tags to each word of
`a digitally encoded text indicative of syntactic or inflec
`tional features of the word.
`50
`It is a further object of the invention to provide a
`grammatical analyzer for encoded text which identifies
`the most probable tags of words of a sentence based
`upon collocation probabilities of their occurrence with
`adjacent tags.
`55
`It is a further object of the invention to provide a
`grammatical analyser which accepts as an input uned
`ited text material having misspellings and vocabulary
`eOS
`These and other features of the invention are ob
`60
`tained in an apparatus for the grammatical annotation of
`digitally encoded text material, preferably including a
`stored dictionary wherein each entry represents a word
`together with tags indicative of possible syntactic and
`inflectional features of the word. A sentence of digitally
`65
`encoded text is passed to the grammatical annotator,
`which first operates on the words of the sentence to
`annotate each word with a sequence of possible tags for
`
`15
`
`4,868,750
`2
`the word, and next operates on strings of tags of adja
`cent words to determine the probable tags, in order of
`likelihood, for each word.
`This produces a "disambiguated' tag set which iden
`tifies a most probable tag assignment, for each word of
`a string of words, and one or more next most likely tag
`assignments. The disambiguated tag set serves as an
`input to a grammar processor which in a preferred
`embodiment uses the tags to identify basic grammatical
`units such as noun phrases and simple predicates, and
`processes these units to determine the parse of the sen
`tence.
`Preferably, the stored dictionary of words includes
`data codes representative of features such as gender and
`number, requiring agreement among words, and this
`information is used to select proper constructions dur
`ing processing. The system preferably also includes a
`morphological analyzer, which uses prefixes, suffixes
`and other structural attributes of words to recognize
`certain classes of words which are not in the stored
`dictionary. For such a word, the analyser then creates a
`dictionary entry with appropriate tags so that grammat
`ical processing proceeds as though the word were in the
`database.
`More specifically, the grammatical analyzer anno
`tates the words of a sentence of text with grammatical
`tags and inflectional features of the word using one or
`more of the above techniques. Each string of multiply
`tagged words between two unambiguously-tagged
`words is then analyzed by a disambiguation sub-system
`which applies a collocational probability matrix to adja
`cent pairs of tags to iteratively construct a probability
`like measure and to determine a most probable tag
`string corresponding to the string of words. Candidate
`tag strings of lesser probability are stacked for use if a
`later processing step eliminates the "most probable' tag
`string. This results in a "disambiguated' sentence struc
`ture in which one or more likely tags are identified for
`each word of the sentence.
`In a preferred implementation, the probability-like
`measure is iteratively defined on generations of succes
`sively longer tag strings corresponding to sequences of
`words. Nodes which generate strings of lesser probabil
`ity are pruned from the calculation as it proceeds, so
`that only a handful of potentially thousands of tag
`strings need be processed.
`In a further embodiment of the invention, the values
`assigned by the collocation matrix are further modified,
`for tags of particular words appearing in a reduced tag
`probability database, in accordance with a table of re
`duced probabilities. In a further preferred embodiment,
`when a word of the string appears in another database,
`called the "commonly confused word' database, an
`augmented set of tag strings is created by substituting
`tags corresponding to a correlated word, and the substi
`tuted tag strings are collocationally evaluated as candi
`dates for the most probable tag string. In a further em
`bodiment, the tag strings selected in one of the forego
`ing operations are also checked against templates repre
`sentative of erroneous or rare parses to detect common
`errors. When a sentence has been annotated with tags
`and a most probable parse identified, the annotated
`sentence is then parsed by a parsing component which
`determines a parse of the whole sentence.
`The parsing component of a prototype system oper
`ates on the "most probable parse" (henceforth "MPP")
`tags assigned by the disambiguation sub-system to the
`words of a given sentence, in order to assign the higher
`
`Page 15 of 35
`
`
`
`5
`
`10
`
`15
`
`4,868,750
`3
`syntactic structure of the sentence and also to detect
`and suggest corrections for certain types of errors in the
`sentence. The parsing process preferably proceeds in
`three general phases: (a) the identification of the simplex
`noun phrases (NPs) in the sentence and, if there is more
`than one simplex NP, their combination into complex
`NPs; (b) the identification of the simplex verb groups
`(VGs) in the sentence and, if there is more than one
`simplex VG, their combination into complex VGs; and
`(c) the assigning of structure to complete sentences.
`In addition to its applications in a grammatical text
`analyzer, a disambiguator according to the invention
`includes improvements to existing types of non-gram
`mar language processors. For example, an improved
`spelling checker according to the invention includes a
`spelling checker of the type wherein each erroneously
`spelled word is identified and a list of possibly-intended
`words is displayed. Conventionally, such systems dis
`20
`play a list of words which are selected as having ap
`proximately the same spelling as the erroneously
`spelled word. An improved system according to the
`present invention includes a partial or complete gram
`matical processor which determines the local context of
`a word (i.e., its likley tags or a definite tag), and which
`selects from among the candidate replacement words so
`as to display only the possibly intended words having a
`tag compatible with the syntactic context of the mis
`spelled word.
`30
`In an improved speech recognition (or speech synthe
`sis) system embodying the invention, a disambiguation
`module or a grammatical processor differentiates pairs
`of homonyms (respectively, homographs) by probable
`syntactic context, thereby eliminating a common source
`of errors in the conversion of text-to-sound (respec
`tively, sound-to-text). Other examples are described,
`following a detailed description of a prototype embodi
`ment of a grammatical disambiguation system.
`40
`The novel features which are believed to be charac
`teristic of the invention are set forth with particularity
`in the appended claims. The invention itself, however,
`both as to its organization and method of operation,
`together with further objects and advantages thereof,
`may best be understood by reference to the following
`description taken in connection with the accompanying
`drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a block diagram of a system according to
`the present invention;
`FIG. 2 is a listing of system tags in an illustrative
`embodiment;
`55
`FIGS. 3A, 3B, 3C are samples of dictionary records;
`FIG. 4 is a listing of major classes of tags with corre
`sponding structural processing group codes;
`FIG. 5 is a representative text sentence annotated
`with its dictionary tags;
`FIG. 6 is a flow chart of a word tag annotation pro
`cessor;
`FIGS. 7-10 are flow charts detailing operation of the
`collocational disambiguation processing;
`FIG. 11 shows the processing of a general grammati
`65
`cal analyser operative on disambiguated text; and
`FIGS. 12-13 shows further details of preferred text
`word annotation processing.
`
`4.
`DETALED DESCRIPTION OF THE
`DRAWINGS
`FIG. 1 shows a block diagram of a grammatical ana
`lyzer according to the present invention having a
`CPU/controller 2 which may, for example, be a general
`purpose computer such as a micro- or mini-computer.
`The computer receives input text 4, e.g., from keyboard
`entry, a communications link, or a data storage device,
`and, if necessary, runs a sentence splitter 6 which parti
`tions the text into sentences for grammatical analysis.
`Alternatively, the system may receive as input discrete
`sentences of text or encoded text with sentence bound
`ary markers already inserted. Sentence splitting perse is
`known in the art, and is used, for example, in commer
`cially available systems for deriving word-per-sentence
`and similar statistical information in computerized read
`ability analysis systems. A suitable sentence splitter is
`disclosed in the copending patent application of Henry
`Kucera, Rachael Sokolowski and Jacqueline Russom
`filed June 6, 1986 as Ser. No. 872,094, entitled Method
`and Apparatus for Text Analysis, now issued as U.S.
`Pat. No. 4,773,009, which application is hereby incorpo
`rated by reference and made a part hereof.
`The controller 2 then passes each sentence to a gram
`matical analyzer 10 which annotates each word of the
`sentence, by reference to a stored word dictionary 8,
`and preferably several special databases or tables 7, as
`discussed further below, so as to produce an annotated
`sentence structure. The annotated sentence, or partial
`parses thereof and error messages or "prompts' are
`displayed on display 9.
`According to one aspect of the invention, the dictio
`nary includes a record for each word which contains a
`list of "tags', each tag encoding a syntactic or inflec
`tional property of the word, and which also includes a
`listing of special features used in the grammatical pro
`cessing.
`The processor annotates the sentence with this infor
`mation. It then utilizes the stored information to per
`form two, roughly sequential, operations on the anno
`tated sentence structure. First, a collocational tag
`disambiguation processor 10a applies an empirically
`compiled probability-like function defined on adjacent
`pairs of syntactic tags to determine a unique sequence of
`tags (one for each word) corresponding to the most
`probable parse of each ambiguously-annotated word in
`the sentence. The disambiguation processor also identi
`fies alternative tags of relatively high probability. Next,
`a grammatical processing module 10b operates on the
`identified tags to develop a parse of the sentence.
`A prototype text annotator embodiment was created
`having a main dictionary with 28,223 80-byte records,
`each record containing the complete grammatical infor
`mation for a given "word' which is either a base form
`or an irregularly inflected form. These records were of
`three types, marked by a record type-code in column 80
`to identify the types as "normal' (column 80 blank),
`"exception" ("S" in column 80) or "contraction" ("+"
`in column 80). Normal records correspond to the words
`with non-merged tags and (if they are nouns or verbs)
`regular inflections; exception records correspond to the
`words with non-merged tags that are members of an
`irregular (noun or verb) paradigm (these words may
`also be members of regular paradigms or uninflectable
`tag groups); and contraction records correspond to the
`words with merged tags (that is, tags that contain a
`
`35
`
`45
`
`50
`
`Page 16 of 35
`
`
`
`5
`
`15
`
`4,868,750
`6
`5
`Linguistic Information', now issued as U.S. Pat. No.
`"--", indicating that the corresponding word is a con
`4,724,523, of inventor Henry Kucera, which application
`traction of some type).
`is hereby incorporated by reference. Its operation is
`FIG. 2 is a listing of the tags used in the prototype
`further described below, by way of completeness, in
`embodiment, each of which is represented in the draw
`ing by a one to three character mnemonic and also by a
`connection with FIG. 6.
`In compiling the dictionary, if an inflectional variant
`one to two digit tag code. There are ninety-two such
`tags, although any given text word will generally have
`is a base form in its own right, it is listed separately in
`the database with the appropriate code for this usage.
`between one and six possible tags. Each tag indicates a
`For example, "backing' is stored as a noun of inflec
`possible syntactic use of the word, or an inflection. The
`tional class one, denoted N1, representing the paradigm
`dictionary records may also include, for nouns and
`10
`backing, backing's, backings, backings'). This dictio
`verbs, certain information encoding word features such
`nary entry is in addition to its inflectional usage as the
`as its number agreement behavior.
`present participle of the verb "to back" which would
`FIGS. 3A-3C show examples illustrating the format
`of the normal, exception and contraction records of the
`be recovered by inflection from the base form "back'
`prototype dictionary discussed above. The records each
`discussed above.
`FIG. 3B shows examples of exception records. These
`include the retrieval form of the main dictionary entry,
`left-justified with blank fill in columns 1-25 as field one,
`records contain elements (either base or inflected forms)
`that are members of irregular nouns or verb paradigms.
`and the record type code discussed above as the last
`In these records, the format of fields one to five are
`entry in the last field at column 80.
`FIG. 3A contains examples of "normal' main dictio
`similar to those of normal records shown in FIG. 3A,
`except that field four contains one or more substrings
`nary records. Normal records comprise approximately
`delimited by parentheses. The material between paren
`ninety-five percent of the database, and contain five
`theses identifies an irregular tag and the appropriate
`fixed-format fields, which include, in addition to fields
`base form for processing for such tag.
`one and five described above, the following.
`FIG. 3C illustrates contraction records, which lack
`Field two contains noun base form inflection code
`25
`the fields two through four of the foregoing two record
`information, if the base word has a noun form, for the
`word in field one, and occupies columns 26 through 29.
`types, and instead have a field two which contains from
`one to five merged tag representations (stored starting
`These code bits enable the construction of any desired
`in columns 26, 36, 46, 56, and 66, respectively), and
`inflection from the stored base form, by use of an inflec
`occupies columns 26 through 77. The last field, as with
`tional synthesis coding scheme discussed further below.
`the other two types of records, contains certain special
`Field three contains the verb base form inflection
`processing annotations, and occupies columns 78
`code information, if the base form has a verb form, for
`through 80; in the prototype, the only codes that occur
`the word in field one, and occupies columns 30 through
`in this field are the record type-indicating codes that
`33; these code bits compactly encode the verbal inflec
`tions corresponding to the base word.
`occur in column 80. The illustrated record for the word
`"ain't' indicates that it is a recognizable contraction
`Field four contains all other syntactic tags for the
`with a tag string consisting of the auxiliary tags corre
`word in field one, as well as any noun or verb feature
`sponding to the set of words ("an", "is", "are", "has",
`annotations, and occupies columns 34 through 77; fur
`"have"), plus the negation marker" corresponding to
`ther information concerning the feature annotations
`the morpheme “n't".
`that may appear in this field is given below in the discus
`As noted above, the main dictionary is a dictionary of
`sion of parsing and noun phrase determination.
`base form records each listing codes indicative of gram
`As noted above, noun and verb codes, if either occurs
`matical and inflectional tags and feature information.
`at all for a given word, are confined to the fields before
`Each text word is processed by an "unflection' proce
`column 34; all other tags must occur starting in or after
`dure which operates on the word to identify its base
`that column. For example, "back", the tenth word in
`form by stripping suffixes therefrom if possible to pro
`FIG. 3A, is encoded as being both a noun and a verb,
`duce a probable base form, and looking up the probable
`both of inflectional class one, yielding the paradigm
`base form in the dictionary. When the probable base
`back, back's, backs, backs' for the noun usage and
`form is found, the processor inspects inflectional codes
`back, backs, backed, backing for the verb, as well as an
`of the base form to confirm that any stripped suffixes
`adjective and an adverb (with tag codes as "JJ" and
`"RB', respectively). Although, including inflectional
`were indeed legal suffixes of the found entry. The ap
`propriate tags of the