throbber
United States Patent (19)
`Kucera et al.
`
`Patent Number:
`11
`(45) Date of Patent:
`
`4,868,750
`Sep. 19, 1989
`
`54 COLLOCATIONAL GRAMMAR SYSTEM
`(75) Inventors: Henry Kucera, Providence, R.I.;
`Alwin B. Carus, Newton, Mass.;
`Jeffrey G. Hopkins, Pawtucket, R.I.
`73) Assignee: Houghton Mifflin Company, Boston,
`Mass.
`21 Appl. No.: 106,127
`22 Filed:
`Oct. 7, 1987
`(51) Int. Cl. .............................................. G06F 15/02
`(52) U.S.C. ..................................... 364/419; 364/900
`58) Field of Search ................ 364/419,900 MS File,
`364/200 MS File
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`4,661,924 4/1987 Okamoto et al. .................., 364/900
`4,703,425 10/1987 Muraki ................................ 364/419
`4,724,523 2/1988 Kucera ................................ 364/419
`4,742,481 5/1988 Yoshinura ........................., 364/49
`4,747,053 5/1988 Yoshimura ..
`364/49
`4,750,122 6/1988 Kaji .........
`364/419
`4,760,528 7/1988 Levin ......
`... 364/419
`4,773,009 9a 988 Kucera ................................ 364A419
`OTHER PUBLICATIONS
`Choice of Grammatical Word-Class Without Global Syn
`tactic Analysis. Tagging Words in the Lob Corpus, Mar
`
`shall, Ian in Computers and the Humanities 17 (1983)
`139-150.
`Primary Examiner-Michael R. Fleming
`Attorney, Agent, or Firn-Lahive & Cockfield
`(57)
`ABSTRACT
`A system for the grammatical annotation of natural
`language receives natural language text and annotates
`each word with a set of tags indicative of its possible
`grammatical or syntactic uses. An empirical probability
`of collocation function defined on pairs of tags is itera
`tively extended to a selected set of tag sequences of
`increasing length so as to select a most probable tag for
`each word of a sequence of ambiguously-tagged words.
`For listed pairs of commonly confused words a substi
`tute calculation reveals erroneous use of the wrong
`word. For words with tags having abnormally low
`frequency of occurrence, a stored table of reduced
`probability factors corrects the calculation. Once the
`text words have been annotated with their most proba
`ble tags, the tagged text is parsed by a parser which
`successively applies phrasal, predicate and clausal anal
`ysis to build higher structures from the disambiguated
`tag strings. A voice/text translator including such a tag
`annotator resolves sound or spelling ambiguity of
`words by their differing tags. A database retrieval sys
`tem, such as a spelling checker, includes a tag annotator
`to identify desired data by syntactic features.
`
`17 Claims, 13 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`
`
`
`SAA// JAA/AS
`f A/ASA's
`
`A/
`A/V/AAW
`
`SOAAA
`Wy
`
`
`
`
`
`
`
`AA//
`AWWFAA
`2
`
`
`
`
`
`
`
`
`
`
`
`JAYI
`ASSAA
`A/SA/AM///
`
`Z--
`-Stiff
`self a?
`
`
`
`
`
`AAAW/AAA
`AAOMASSO?
`ASA/A/A
`ASA?
`
`Page 1 of 35
`
`GOOGLE EXHIBIT 1011
`
`

`

`US. Patent
`
`Sep. 19, 1989
`
`Sheet 1 of 13
`
`4,868,750
`
`
`SPEC/Al 543155
`
`I MMfl/UIS
`
`{Pfl/
`
`
`rm
`
`MIN/P01([2?
`”[554?!
`MM
`
`
`
`
`may/0w
`war/MA”
`
`
`
`{/0
`
` mm
`
`mn'
`
`.
`___1__._
`5— {— SHEA/61.1.
`ism/rm? J:
`
`Page 2 0f 35
`
`Page 2 of 35
`
`

`

`US. Patent
`
`Sep. 19, 1989
`
`Sheet 2 of 13
`
`4,868,750
`
`000/0000 0/0500 (0000/04/700 (00/
`0000 0/4000/7/00/5 (00007017700 (00/
`'001' ' (00000007! 040)
`01050 00000f000/5 (0000/7/0/700 M0)
`500/ - 00!00 (0000700/700 700)
`02450 (0000(001700 (00/
`00000 (0000/7/00/00 (00/
`00/00 (00007001700 (00/
`000- 00l!/0/0‘.’ (000- 000/00! 710/
`000‘ 0000/70/00 (000- 000/00! M0)
`000- 00/10f/0/00/0000l0 000J000f/0I (000-000(04! M0)
`0057- 00/'000/000 (/700- 000/00! (00}
`4077020 (000 -000/IA! 7,40}
`'0000 ' (V0000! U0) (0)
`"005' (V0000! 0,40} (0)
`'00/00'(V0000! 1'00) (0)
`'00'-0050 0000 (V0001! (540/ (I)
`'40 ' (V0000! (00/ (0)
`“0000' (V0000! (00) (I)
`”400' (V0000! (40/(0)
`'/0'(V000A! 700/ (0)
`00000/0/If/00 000/000/700 0500/00/70! (00/
`0000/00! 000000 (0190-0007014! M0)
`500000/0/1/700 000/000/700 (000(000/0! 2740)
`'0/0' (V0004! 040/ (0)
`‘00/00' (V0000! 700/ (I)
`'00'- 0000 0000 (V0000! (/10) (I)
`'0000 ' (V0004! (A0} (I)
`'00' (V0000! (A0} (I)
`'0000 ' (V0000! (00/ (0 )
`5/06/0140 00(000/000 (000-000(00! (10/
`5/000! A0/P!!/0A! 00(000/000 (000 - 000/00! 710/
`H0014! 00(000/000 (000 -000/00! (00/
`007000/000/ 0000! 0 000J0007/00 (000-000(00! (00/
`0070570070]!
`'1'0000' (000/00! 1'00) (0)
`'000 '; 04.57 (0050 (V0000! 710/ (0)
`'0AV/00 ' ( V0000! (/10) (I)
`'IAVO '- 0,400 0000 (V0000! 700/ ( I )
`”000 ‘-',- 0/151' 0140fI670!0 (V0001! 1340/ ( I )
`"0,4V0' (V0000! M0} ( I )
`'000' (V0000! I740) (0)
`000000/f/00 (000700f/A! I710}
`10/00f/V0 (000000100! 040/
`000000lf/V0 .40/00/(V0 (000- 000/00! 540/
`51/000M/7V0 AOJ00f/V0 (000-000/01! H0}
`50000f/0 50000M/7V0 (000‘ 000/00! [10/
`0004! 001/!00)’ ( V0000! (10/ < 0)
`
`\*\'
`
`In.
`
`00!
`.400
`A01
`00
`If
`000
`000!
`000
`00/
`000
`000
`000
`00]
`00
`00
`05
`000
`000
`00/
`000
`000
`00!
`01'
`0/7
`0f.S'
`0fI
`0!
`0V0
`0V0
`IV/
`0V0
`0V0
`0V!
`(0
`JJ
`JJO
`././5
`.Uf
`00
`
`FIG.2
`(0/ (000 47/
`
`0/
`02
`03
`04
`05
`00
`0/
`00
`0.9
`/0
`/ /
`I!
`/J
`/4
`/.5
`/0
`ll
`/0
`/.9
`00
`2/
`Z?
`03
`04
`05
`00
`("7
`20
`2.9
`50
`3/
`30
`33
`30
`3.5
`30
`3/
`00
`3.9
`40
`4/
`40
`43
`44
`45
`40
`4/
`
`Page 3 0f 35
`
`Page 3 of 35
`
`

`

`US. Patent
`
`Sep.19, 1939
`
`Sheet 3 of 13
`
`4,868,750
`
`46'
`4.9
`II
`5/
`.52
`53
`54
`55
`56‘
`67
`II
`5.9
`II
`67
`6‘2
`6‘1
`6‘4
`6'5
`6'6‘
`6'7
`II
`629
`7:9
`7/
`I?
`71
`74
`7.5
`76‘
`7/
`II
`7.9
`II
`I/
`I2
`I!
`I4
`I5
`II
`If
`II
`II
`II
`.9/
`.92
`.93
`.94
`
`II
`m‘
`
`m m
`
`:
`II]
`If
`If!
`
`m m
`
`:
`
`II
`we:
`ms
`m:
`II
`II
`
`mm M
`
`:
`
`m P
`
`( I )
`S/IIIMI III/III IIII - I154“ fIIl/ {III/I41 MI)
`PISSISS/Vf 5/I6‘IMI IIWIII IIII (III-III/IAZ MI}
`Ill/III III/III IIII (III/IA! MI} (I)
`IIISISS/Vf IZIIAZ III/III IIII/IIf-III/IIZ MI)
`III-PISSISS/Vf IIIIII IIII-IIIf fIII {III/Ill MI} (I)
`S/IIIMI III/”[1? IIII ‘ IASf [III (III/IA! 7.467 ( I )
`IISSESS/Vf S/IIIMI IIIIII IIII {PIP III/Ill M9}
`I1IIIZ PIIIEI IIII (III/Ill MI} (I )
`IIISfSS/Vf IIIIAZ IIIIEI III/I lIIf-III/IAZ 1346‘}
`S/IIIMI AIVIII/AZ IIII - I155 fIII {III/IA! MI} (I )
`PISSESS/Vf S/IIIUI lIVfII/IZ IIII fIIf-III/IAZ HI}
`III/Ml IIVfII/Il IIII (III/Ill MI) (I)
`PISSfSS/Vf PlII/IZ AIVIII/ll IIII (III— III/Ill MI}
`III/Ill IIIIfI (IIF- III/Ill MI)
`III/Ill IIIIIII {III/IA! HI} (I)
`FISSfé'S/Vf III/Ill IIIIIII fIIf-III/IAI MI}
`IISIISS/Vf PIIIIII {III'III/IIZ MI)
`SIIIII FI556'55/Vf IIIIIII (III/Ill MI) (I)
`S/IIIMI Ifth'llVf IIIIIII {III/IA! MI) (I)
`Ill/IA! IfflfI/Vf IIIIIII {III/Ill 516‘) (I)
`IIJfIf/Vt’ IEISIIIZ PIIIIII (III/I211 546') (I)
`fI/II IEISII III/IAf/Vf FfISIIIl IIIIIIII’III/IIZ MI (I )
`III - l'I/II IIIIII III/II/VVEIEISIIAZ IIIIIII (III/Ill 7346') (I)
`III -II55[55/V[ IfISII/l IIIIIII/III/IIZ MI) (I)
`III! /F/fI {III - III/Ill 1546')
`IISf- IIIl/f/[I (FIE-III/IAZ MI}
`AIVfII MIVFII/Il MI)
`IIIPAIAf/Vf AIVfII MIVfII/AZ 176')
`.S'IIIIMf/Vf lIVfII (AIVEII/Il MI}
`III/Ill AIVEII MIVfII/AZ 546'}
`IIVfII/IAI/VIZ f MIVIII/Il 7546')
`/If/I//'/VAZ 'l'I' (VIII/H 1716‘}
`(III)
`[IIZIIAf/II fffIffIf/IZ 7246'}
`(f)
`VIII I4.” ffISE fIII {VIII/11 546')
`VIII IIISIII' MINI/III (VIII/$1 546‘} (I)
`VIII /If/I/f/Vf-I15[ FIII (VIII/fl MI) (I)
`VIII ”If IAIf/I/Plf foIII[ III)
`(I)
`VfIIIII-fI/II PfISII 5/I6‘IMI, PIESIIT ffISf {VIII/fl MI) (I)
`VIII fI/II-IIISII S/IIIMI, PIESEII' ffIII (VIII/II 1146‘} (I)
`[VI - IfffII/IEI (PIf-III/IAZ 1346'}
`IIISIIAZ WI -IIIIIII {III-III/IAZ M67
`!
`W0
`IIJfIf/Vf [VI - IIIIIII (III/Ill 136‘} (I >
`W5
`III/IAf/Vtr [VI- IIIIIII (III/III MI} ( I )
`1m
`II - III! lf/[I {III-III/IAZ 5467
`um
`II— AIVEII MIVEII/At 716'}
`I!
`~IAI.S'[I /IffII/Il ' II 176‘ ASS/IIII-
`[I]! [II-If-f/Zf IIIIZ'I
`
`PZS
`PM
`IFS
`PISS
`
`M o
`
`z
`01/”
`II
`Mk
`III
`II
`17/?
`
`m w
`
`/ m m m m m V
`
`IZ
`
`W W
`
`FIG. 2
`MI ”MI .94)
`
`Page 4 0f 35
`
`Page 4 of 35
`
`

`

`US. Patent
`
`Sep. 19,1989
`
`Sheet 4 of 13
`
`4,868,750
`
`1/
`
`1
`(7
`Mark
`abandon/near
`Mose
`Mose/7190f
`Mashed
`is
`MM
`
`flack/n;
`
`[Ml/P159 0F ”Mi/Ml ”AM/fl ”MT/MARY ifMMS
`
`I26
`
`150
`
`[14
`
`.
`4f
`M
`
`1W
`
`I/
`
`[7/
`
`I/
`
`V/
`
`V/
`
`JJ
`6'5 M’ 01 M
`JJ [73
`
`FIG. 3A
`
`|/
`Mia/.1
`(rd/Ms
`MM;
`am
`are
`my
`arisen
`arose
`ofe
`wake
`woke
`aim/rel;
`baa/Ill
`MM/as
`bale
`base
`bases
`basis
`to
`tear
`
`[Ml/Pl [S W '[fl’f/Df/M ' Ill/I MUM/VARY ATM/905
`
`I50
`
`125
`IV/S
`Mp
`My
`
`|34
`(M: mus/mm)
`mam/bu)
`(MS: M/‘M/
`it'll (Ire/“x be}
`if/P {Ire/I be]
`{Vi/ -/: arose/en's”)
`{Vi/4’: an'sa/
`WM,- m’se/
`/V1903 MN
`./J M (1/49/- /: awake /mmken)
`(V50: awake}
`/ V5175 wake/
`[If/5I Mail/us/
`(W: Mai/fl)
`(1491’.- fi/W
`MW,- Msas J
`(MS: base/Mm /
`MI.- Mses /
`if/ {Ire f: is)
`{Vfl/ - / .~ tare/Dome}
`
`FIG. 33
`
`law
`
`[w
`.5
`5
`5
`5‘
`5
`5
`.5
`J
`5
`I
`J
`5
`J
`5
`5
`5
`5
`I
`5
`.3“
`
`Page 5 0f 35
`
`Page 5 of 35
`
`

`

`US. Patent
`
`Sep.19,1989
`
`Sheet 5 of 13
`
`4,868,750
`
`[Ml/HES 0f ”Ml/I'Mé‘f/flfl " Ill/4’ 0/6‘f/0M4/P)’ flfd‘fl/WS
`
`|35
`
`|45
`
`L55
`
`I55
`
`[if/97‘ ”'
`
`If! 7‘ *
`
`WP7‘ 3'
`
`//VZ 7‘ '
`
`l/
`'//'s
`m '/
`are” '/
`M” '/
`can/ml
`call/d 'w
`coy/d0 ‘7‘
`d/‘do ‘/
`does/7 '/
`Jan '/
`
`125
`PPS 7‘ :91"!
`iii/7‘ "
`if}? 7‘ *
`74/0 7‘ *
`74/0 7‘ *
`[/0 7‘ //V/
`AM 7‘ "‘
`000 7‘ *
`M] 7‘ *
`MP 7‘ '*
`
`W
`f
`7‘
`7‘
`7‘
`7‘
`7‘
`7‘
`7‘
`7‘
`7‘
`
`FIG. 3C
`
`”JANA/WW A/V/VflfA/Ml 07" 555/ 55175765
`
`1 ac. .-
`[-
`2‘
`3:
`4:
`i'
`6':
`
`K
`J:
`.9:
`/0.'
`//.-
`/Z-
`Us
`/4'
`/.5:
`/6’.‘
`/Z~
`
`Ila/d:
`film
`MN:
`lo
`55W
`like
`new
`
`me/ropa/i/a/I
`[an
`all
`/)/‘3
`clever/j
`framed
`and
`br/Y/Mnf/y
`p/I/myed
`para/fears
`
`fay S/r/fly .'
`I,”
`MS {if . Ira/W V5! (-7”)
`M’ M
`149/ 1497"
`AI
`/J
`
`J./
`M"
`A” 01 M
`PP! PP”
`M
`VM (5)“ = I‘M/0) V50 ( =30
`('6’
`W I?!
`JJ
`IVA/5 {if wanna/J
`
`FIG.5
`
`Page 6 0f 35
`
`Page 6 of 35
`
`

`

`U.S. Patent Sep.19, 1989
`
`Sheet 6 of 13
`
`4,868,750
`
`(ass
`AA
`AA
`Af
`A.
`(A
`A)
`(S
`f
`Af
`A.
`AW
`AW
`W
`Af
`f
`a
`AWAP
`t
`Af
`Ap
`A.
`A.
`AP
`AA
`J)
`th
`WA
`
`A
`W
`WP
`Y
`
`(Asy (AAA/07 AASS AAAAAS A27 A65
`
`Aescription
`are-qualifiers and pre-quantifiers
`pos? deferminer
`arti?e
`forms of the verb "?o be
`coordinating confunction
`carolina/ number
`subordinating confunction
`forms of fle verb 'to do
`defarminers
`aris?enfia/ "fiere
`forms offe verb "to have
`areposition
`advectives
`meda/aurifiafy
`inf fifts
`Aroper nouns
`afe?t/af nouns
`ordina/ number
`Mania/Aronours
`aersona/Aronouns
`qualifiers
`adverbs
`nomina/adverb
`adverbaartic/e
`an/int/val/ "f
`ere/anaffon
`verbs
`wh-deferminer
`Aersona/ wh-pronouns
`wh-qualifier
`wh-dive/a
`none of fle above (60s-inferna/ only/
`
`AAG. 4
`
`Joys
`
`f
`f
`
`?
`A
`f
`A.
`4
`f
`A.
`/
`4
`f
`5
`4
`4.
`f
`2
`A.
`2
`
`f
`f
`f
`f
`A.
`f
`y
`f
`A
`f
`
`Page 7 of 35
`
`

`

`US. Patent
`
`Sep.l9, 1989
`
`Sheet 7 of 13
`
`4,868,750
`
`”If-[IN
`SHAW
`
`
`
`
`
`
`
`M’l’flf
`
`fiPA’fSS/M’
`
`MMl7; £451?
`
`[[55 f/I’M’J ”ll/[5
`
`
`5f/9/P A 50/771
`
` Sllff/I
`
`
`HM”
`
`
`SM/PPM/é'
`fli‘Wf55/017
`
`
`ffl/[fl f/i/r’t’f
`
`f/l/fS .7
`
`SHIV/X-
`
`SIR/Him
`[IPRfSS/fll M’
`
`[IF/91".5'5/017
`
`Ml’ I’M/W
`”If/1i”?
`
`
`
`
`
`
`
`
`6Y7 MAM/AHMZ
`J Wflfé‘flm’ll
`07/? MW!
`04/745451."
`
`
`
`
`flW'Pflf MAW/Af/MZ
`a’ Mflfé‘f/W’AZ
`M’fMfl/AT/flfl
`
`
`
`(fl/[[67
`WEN/AW
`
`
`
`
`
`5- ”0/905
`SPH'MZ 04,451”
`I‘d/$7659 flé'flflflf/IVIS
`
`
`
`FIG. 6'
`
`Page 8 0f 35
`
`Page 8 of 35
`
`

`

`US. Patent
`
`Sep.19,1989
`
`Sheet 8 of 13
`
`4,868,750
`
`5157' SE/W PM m
`
`554’ A’flflf
`
`
`l/W/Ml/[f
`
`
`flPfl/flff/r’s
`l VAR/1451A?
`
`6‘/
`
`
`
`lliflfflfflf
`MI? 1 V1 191’ / #00
`61/1? 546'
`M 5/7/15
`
`65
`
`/5
`
`
`MZI’ M6
`
`
`”PASS
`
`: AflVi/r’i 0/?
`
`
`ATM?”
`
`6‘6‘
`
`FM’ST
`
`
`iffff 046'
`
`546‘ 01‘ lM/M'S
`
`WWI/7295 5,46“!
`SUP/4’6 - I] M
`
`
`
`= SIM/W 546‘
`
`
`
`
`
`or MN m /,
`517 ”PF 0/ Wflfiflf
`
`W030
`
`
`[MAW 'fll/MZ/f 546'
`
`HAWK/f 546'
`
`
`567’ -— ma 4
`
`7
`
`
`or PM or > /
`
`mm mm my
`.90
`moms m
`
`EMMY/[fl
`
`
`619
`Pflflé‘ff5/176
`
`
`flflflf/A/[S
`
`FIG. 7
`
`Page 9 0f 35
`
`Page 9 of 35
`
`

`

`US. Patent
`
`Sheet 9 of 13
`Sep. 19,1989
`7/
`5ff I'H/P PMS
`
`
`NZ [0
`
`
`SEW/'0!r
`
`
`000:! PREV
`0/5/7003
`
`
`4,868,750
`
`/0f/W//‘}’
`005% 7,40
`
`71
`
`
`
`74
`
`/7”
`
`
`#7 00/?0
`
`MAM-
`
`
`M0 3% P/PV
`
`0 f/Md‘f
`
`
`
`
`MAMA/Y
`
`
`0 f 7M: ’P/n/ /
`1/017. 57
`
`
`MM MIA/Mi
`
`A’Ef0 f0
`
`
`All 004%
`
`
`”Mi
`0 00055 .7
`
`7.5
`
`
`
`76‘
`
` 0004/1"
`
`fill? / 07/?
`
`
`fill/P? Pf)?
`
`
`f0 VAN/f 0/"
`01/0! P0/A’ff0
`
`
`
`Page 10 0f 35
`
`Page 10 of 35
`
`

`

`U.S. Patent
`Alif A
`
`Sep. 19, 1989
`
`Sheet 10 of 13
`
`4,868,750
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SAJ
`A/AS/A2/A
`- WA
`
`SAJ
`Sps as (7
`
`A/S/
`(WAA D
`JAAA/S/
`2
`
`
`
`
`
`SAJ
`SAs a
`Apassa
`AfA. Ap
`
`45
`
`SWAA
`S/A
`
`(A
`
`Affif
`JMJA - Af
`SAJ Aff a
`AJA/S-AA
`
`
`
`
`
`Page 11 of 35
`
`

`

`US. Patent
`
`Sep. 19, 1989
`
`Sheet 11 of 13
`
`4,868,750
`
`1 II |II I | | I | I I g III l II I Il| |I I I| II I
`
` .9/
`0fl'0/7f 04/)!
`
`5:7 fH/P / PM
`
`
`10 5/7407 0"
`00005770/5 000/?Z /5f
`
`
`I III I I
`
`I
`I
`I
`I
`
`IIIII Il| |II
`
`
`
`—6'f/'0000[07 7/10
`0001/?
`0!" W000 I710 5/0/00
`0 [0/100
`
`
`.527 754/0 2 0/7?
`A00 5f00f
`
`f0 SDI/W 0/"
`/0 000/?[0f
`001‘"W005 0/5/7005
`0/5 0001’
`
`[/51'
`
`
`
`MAI/WIT
`0/1157? 13405 We
`
`05/00
`MID/l
` All
`M05 M’ ”0005'
`
`
`D40 5/0/00
`
`
` 0/0/05' 0/
`
`000057551?
`WV 0/V/50r?
`
`
`
`
`9’6
`
`001/701 )’ 0”
`m
`gar/mm 2pm
`
`mm M’
`0/5 mm
`
`
`Mil/M05 0/5 W!
`Sffl/r’f ”F 0/5/74!”
`
`/5f 0000555 -
`0V 00001-7”
`.
`0/5 0001"
`
`
`
`#1100”?
`95
`mm was
`Sill-'57
`
`
`f0/, 3
`
`-0/’0/1/'[
`00055
`
`ff/lP/Pf/F
` 754/02 0/7?
`
`Page 12 0f 35
`
`Page 12 of 35
`
`

`

`U.S. Patent Sep. 19, 1989
`
`Sheet 12 of 13
`
`4,868,750
`
`
`
`
`
`
`
`
`
`AA) A JAOAS
`AWA/AWA JAA/
`Affif AYA/SJAWA
`
`AAA
`JAAS AP
`W go:
`
`A WAS
`SAPJ
`WWA A/S, JAAAY
`f/SWAWAA
`
`)
`
`A/
`
`(gous
`
`A/WAAAYA
`A/S/
`
`
`
`A/5
`
`
`
`
`
`SAOS/A
`JA 3 (AAS //
`SA
`(AAS
`
`Page 13 of 35
`
`

`

`U.S. Patent Sep.19, 1989
`
`Sheet 13 of 13
`
`4,868,750
`
`A2)
`fAry
`Aft/A/104 -) i? - - - - - - - - - - - - - - - - -
`AWA/SAA/6/6/AW/(W
`WA
`AAAA)
`SAA/AWA
`AM/AAAY
`JYAAAA/AOS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`AAAA
`Af0071A
`
`AM
`AZA (AAVAff
`JAA, SAWA WA
`AMAS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`F/G, W2
`
`A2
`
`AMSA WMS
`JA6 (PASMA
`SAWWAAA
`
`/44
`(AAAAA/
`
`- - - - - - - - - - - - - - - - - - -
`/A
`A7
`(SAA
`A///04AAPY
`
`A97
`
`S-100S
`
`2007
`A(AAAA. JAA
`/A/(A/ )
`
`Page 14 of 35
`
`

`

`1.
`
`COLLOCATIONAL GRAMMAR SYSTEM
`
`The present invention relates to automated language
`analysis systems, and relates to such systems embodied
`in a computer for receiving digitally encoded textcom
`posed in a natural language, and using a stored dictio
`nary of words and an analysis program to analyze the
`encoded text and to identify errors. In particular, it
`relates to systems for the grammatical analysis of en
`10
`coded text.
`In recent years a number of systems have been devel
`oped for the automated recognition of syntactic infor
`mation. A survey of some systems appears in the text
`book of Winograd, Language as a Cognitive Proces
`s-Syntax (ISBN 0-201-08571-2 v. 1) at pages 357-361
`and pages 390-403. As a rule, although a number of
`theoretical linguistic formalisms have been developed
`to identify correct grammatical constructions, the prac
`tical construction of grammatical analyzing devices has
`proven difficult. Because the number of combinations of
`possible parts of speech for a string of words escalates
`exponentially with string length, syntax-recognizing
`systems have in general been limited to operating on
`text having a small, well-defined vocabulary, or to oper
`25
`ating on more general text but dealing with a limited
`range of syntactic features. Extensions of either vocabu
`lary or syntactic range require increasingly complex
`structures and an increasing number of special recogni
`tion rules, which would make a system large or too
`30
`unwieldy for commercial implementation on commonly
`available computing systems. Moreover, the automated
`grammatical systems which have been designed are
`special processors, in that they are not adapted to con
`ventional word processing or computer-aided publish
`35
`ing functions. For example, such systems may require
`that their input text be at least sufficiently pre-edited so
`that it is both correctly spelled and grammatically well
`formed. A misspelling, a wrong word such as a hom
`onym, a compound word, or even a simple syntax error
`40
`may render an input sentence unanalyzable.
`OBJECTS OF THE INVENTION
`It is an object of the present invention to provide an
`improved device for the grammatical analysis of digi
`45
`tally encoded natural language text.
`It is another object of the invention to provide a
`digital text analyzer for assigning tags to each word of
`a digitally encoded text indicative of syntactic or inflec
`tional features of the word.
`50
`It is a further object of the invention to provide a
`grammatical analyzer for encoded text which identifies
`the most probable tags of words of a sentence based
`upon collocation probabilities of their occurrence with
`adjacent tags.
`55
`It is a further object of the invention to provide a
`grammatical analyser which accepts as an input uned
`ited text material having misspellings and vocabulary
`eOS
`These and other features of the invention are ob
`60
`tained in an apparatus for the grammatical annotation of
`digitally encoded text material, preferably including a
`stored dictionary wherein each entry represents a word
`together with tags indicative of possible syntactic and
`inflectional features of the word. A sentence of digitally
`65
`encoded text is passed to the grammatical annotator,
`which first operates on the words of the sentence to
`annotate each word with a sequence of possible tags for
`
`15
`
`4,868,750
`2
`the word, and next operates on strings of tags of adja
`cent words to determine the probable tags, in order of
`likelihood, for each word.
`This produces a "disambiguated' tag set which iden
`tifies a most probable tag assignment, for each word of
`a string of words, and one or more next most likely tag
`assignments. The disambiguated tag set serves as an
`input to a grammar processor which in a preferred
`embodiment uses the tags to identify basic grammatical
`units such as noun phrases and simple predicates, and
`processes these units to determine the parse of the sen
`tence.
`Preferably, the stored dictionary of words includes
`data codes representative of features such as gender and
`number, requiring agreement among words, and this
`information is used to select proper constructions dur
`ing processing. The system preferably also includes a
`morphological analyzer, which uses prefixes, suffixes
`and other structural attributes of words to recognize
`certain classes of words which are not in the stored
`dictionary. For such a word, the analyser then creates a
`dictionary entry with appropriate tags so that grammat
`ical processing proceeds as though the word were in the
`database.
`More specifically, the grammatical analyzer anno
`tates the words of a sentence of text with grammatical
`tags and inflectional features of the word using one or
`more of the above techniques. Each string of multiply
`tagged words between two unambiguously-tagged
`words is then analyzed by a disambiguation sub-system
`which applies a collocational probability matrix to adja
`cent pairs of tags to iteratively construct a probability
`like measure and to determine a most probable tag
`string corresponding to the string of words. Candidate
`tag strings of lesser probability are stacked for use if a
`later processing step eliminates the "most probable' tag
`string. This results in a "disambiguated' sentence struc
`ture in which one or more likely tags are identified for
`each word of the sentence.
`In a preferred implementation, the probability-like
`measure is iteratively defined on generations of succes
`sively longer tag strings corresponding to sequences of
`words. Nodes which generate strings of lesser probabil
`ity are pruned from the calculation as it proceeds, so
`that only a handful of potentially thousands of tag
`strings need be processed.
`In a further embodiment of the invention, the values
`assigned by the collocation matrix are further modified,
`for tags of particular words appearing in a reduced tag
`probability database, in accordance with a table of re
`duced probabilities. In a further preferred embodiment,
`when a word of the string appears in another database,
`called the "commonly confused word' database, an
`augmented set of tag strings is created by substituting
`tags corresponding to a correlated word, and the substi
`tuted tag strings are collocationally evaluated as candi
`dates for the most probable tag string. In a further em
`bodiment, the tag strings selected in one of the forego
`ing operations are also checked against templates repre
`sentative of erroneous or rare parses to detect common
`errors. When a sentence has been annotated with tags
`and a most probable parse identified, the annotated
`sentence is then parsed by a parsing component which
`determines a parse of the whole sentence.
`The parsing component of a prototype system oper
`ates on the "most probable parse" (henceforth "MPP")
`tags assigned by the disambiguation sub-system to the
`words of a given sentence, in order to assign the higher
`
`Page 15 of 35
`
`

`

`5
`
`10
`
`15
`
`4,868,750
`3
`syntactic structure of the sentence and also to detect
`and suggest corrections for certain types of errors in the
`sentence. The parsing process preferably proceeds in
`three general phases: (a) the identification of the simplex
`noun phrases (NPs) in the sentence and, if there is more
`than one simplex NP, their combination into complex
`NPs; (b) the identification of the simplex verb groups
`(VGs) in the sentence and, if there is more than one
`simplex VG, their combination into complex VGs; and
`(c) the assigning of structure to complete sentences.
`In addition to its applications in a grammatical text
`analyzer, a disambiguator according to the invention
`includes improvements to existing types of non-gram
`mar language processors. For example, an improved
`spelling checker according to the invention includes a
`spelling checker of the type wherein each erroneously
`spelled word is identified and a list of possibly-intended
`words is displayed. Conventionally, such systems dis
`20
`play a list of words which are selected as having ap
`proximately the same spelling as the erroneously
`spelled word. An improved system according to the
`present invention includes a partial or complete gram
`matical processor which determines the local context of
`a word (i.e., its likley tags or a definite tag), and which
`selects from among the candidate replacement words so
`as to display only the possibly intended words having a
`tag compatible with the syntactic context of the mis
`spelled word.
`30
`In an improved speech recognition (or speech synthe
`sis) system embodying the invention, a disambiguation
`module or a grammatical processor differentiates pairs
`of homonyms (respectively, homographs) by probable
`syntactic context, thereby eliminating a common source
`of errors in the conversion of text-to-sound (respec
`tively, sound-to-text). Other examples are described,
`following a detailed description of a prototype embodi
`ment of a grammatical disambiguation system.
`40
`The novel features which are believed to be charac
`teristic of the invention are set forth with particularity
`in the appended claims. The invention itself, however,
`both as to its organization and method of operation,
`together with further objects and advantages thereof,
`may best be understood by reference to the following
`description taken in connection with the accompanying
`drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a block diagram of a system according to
`the present invention;
`FIG. 2 is a listing of system tags in an illustrative
`embodiment;
`55
`FIGS. 3A, 3B, 3C are samples of dictionary records;
`FIG. 4 is a listing of major classes of tags with corre
`sponding structural processing group codes;
`FIG. 5 is a representative text sentence annotated
`with its dictionary tags;
`FIG. 6 is a flow chart of a word tag annotation pro
`cessor;
`FIGS. 7-10 are flow charts detailing operation of the
`collocational disambiguation processing;
`FIG. 11 shows the processing of a general grammati
`65
`cal analyser operative on disambiguated text; and
`FIGS. 12-13 shows further details of preferred text
`word annotation processing.
`
`4.
`DETALED DESCRIPTION OF THE
`DRAWINGS
`FIG. 1 shows a block diagram of a grammatical ana
`lyzer according to the present invention having a
`CPU/controller 2 which may, for example, be a general
`purpose computer such as a micro- or mini-computer.
`The computer receives input text 4, e.g., from keyboard
`entry, a communications link, or a data storage device,
`and, if necessary, runs a sentence splitter 6 which parti
`tions the text into sentences for grammatical analysis.
`Alternatively, the system may receive as input discrete
`sentences of text or encoded text with sentence bound
`ary markers already inserted. Sentence splitting perse is
`known in the art, and is used, for example, in commer
`cially available systems for deriving word-per-sentence
`and similar statistical information in computerized read
`ability analysis systems. A suitable sentence splitter is
`disclosed in the copending patent application of Henry
`Kucera, Rachael Sokolowski and Jacqueline Russom
`filed June 6, 1986 as Ser. No. 872,094, entitled Method
`and Apparatus for Text Analysis, now issued as U.S.
`Pat. No. 4,773,009, which application is hereby incorpo
`rated by reference and made a part hereof.
`The controller 2 then passes each sentence to a gram
`matical analyzer 10 which annotates each word of the
`sentence, by reference to a stored word dictionary 8,
`and preferably several special databases or tables 7, as
`discussed further below, so as to produce an annotated
`sentence structure. The annotated sentence, or partial
`parses thereof and error messages or "prompts' are
`displayed on display 9.
`According to one aspect of the invention, the dictio
`nary includes a record for each word which contains a
`list of "tags', each tag encoding a syntactic or inflec
`tional property of the word, and which also includes a
`listing of special features used in the grammatical pro
`cessing.
`The processor annotates the sentence with this infor
`mation. It then utilizes the stored information to per
`form two, roughly sequential, operations on the anno
`tated sentence structure. First, a collocational tag
`disambiguation processor 10a applies an empirically
`compiled probability-like function defined on adjacent
`pairs of syntactic tags to determine a unique sequence of
`tags (one for each word) corresponding to the most
`probable parse of each ambiguously-annotated word in
`the sentence. The disambiguation processor also identi
`fies alternative tags of relatively high probability. Next,
`a grammatical processing module 10b operates on the
`identified tags to develop a parse of the sentence.
`A prototype text annotator embodiment was created
`having a main dictionary with 28,223 80-byte records,
`each record containing the complete grammatical infor
`mation for a given "word' which is either a base form
`or an irregularly inflected form. These records were of
`three types, marked by a record type-code in column 80
`to identify the types as "normal' (column 80 blank),
`"exception" ("S" in column 80) or "contraction" ("+"
`in column 80). Normal records correspond to the words
`with non-merged tags and (if they are nouns or verbs)
`regular inflections; exception records correspond to the
`words with non-merged tags that are members of an
`irregular (noun or verb) paradigm (these words may
`also be members of regular paradigms or uninflectable
`tag groups); and contraction records correspond to the
`words with merged tags (that is, tags that contain a
`
`35
`
`45
`
`50
`
`Page 16 of 35
`
`

`

`5
`
`15
`
`4,868,750
`6
`5
`Linguistic Information', now issued as U.S. Pat. No.
`"--", indicating that the corresponding word is a con
`4,724,523, of inventor Henry Kucera, which application
`traction of some type).
`is hereby incorporated by reference. Its operation is
`FIG. 2 is a listing of the tags used in the prototype
`further described below, by way of completeness, in
`embodiment, each of which is represented in the draw
`ing by a one to three character mnemonic and also by a
`connection with FIG. 6.
`In compiling the dictionary, if an inflectional variant
`one to two digit tag code. There are ninety-two such
`tags, although any given text word will generally have
`is a base form in its own right, it is listed separately in
`the database with the appropriate code for this usage.
`between one and six possible tags. Each tag indicates a
`For example, "backing' is stored as a noun of inflec
`possible syntactic use of the word, or an inflection. The
`tional class one, denoted N1, representing the paradigm
`dictionary records may also include, for nouns and
`10
`backing, backing's, backings, backings'). This dictio
`verbs, certain information encoding word features such
`nary entry is in addition to its inflectional usage as the
`as its number agreement behavior.
`present participle of the verb "to back" which would
`FIGS. 3A-3C show examples illustrating the format
`of the normal, exception and contraction records of the
`be recovered by inflection from the base form "back'
`prototype dictionary discussed above. The records each
`discussed above.
`FIG. 3B shows examples of exception records. These
`include the retrieval form of the main dictionary entry,
`left-justified with blank fill in columns 1-25 as field one,
`records contain elements (either base or inflected forms)
`that are members of irregular nouns or verb paradigms.
`and the record type code discussed above as the last
`In these records, the format of fields one to five are
`entry in the last field at column 80.
`FIG. 3A contains examples of "normal' main dictio
`similar to those of normal records shown in FIG. 3A,
`except that field four contains one or more substrings
`nary records. Normal records comprise approximately
`delimited by parentheses. The material between paren
`ninety-five percent of the database, and contain five
`theses identifies an irregular tag and the appropriate
`fixed-format fields, which include, in addition to fields
`base form for processing for such tag.
`one and five described above, the following.
`FIG. 3C illustrates contraction records, which lack
`Field two contains noun base form inflection code
`25
`the fields two through four of the foregoing two record
`information, if the base word has a noun form, for the
`word in field one, and occupies columns 26 through 29.
`types, and instead have a field two which contains from
`one to five merged tag representations (stored starting
`These code bits enable the construction of any desired
`in columns 26, 36, 46, 56, and 66, respectively), and
`inflection from the stored base form, by use of an inflec
`occupies columns 26 through 77. The last field, as with
`tional synthesis coding scheme discussed further below.
`the other two types of records, contains certain special
`Field three contains the verb base form inflection
`processing annotations, and occupies columns 78
`code information, if the base form has a verb form, for
`through 80; in the prototype, the only codes that occur
`the word in field one, and occupies columns 30 through
`in this field are the record type-indicating codes that
`33; these code bits compactly encode the verbal inflec
`tions corresponding to the base word.
`occur in column 80. The illustrated record for the word
`"ain't' indicates that it is a recognizable contraction
`Field four contains all other syntactic tags for the
`with a tag string consisting of the auxiliary tags corre
`word in field one, as well as any noun or verb feature
`sponding to the set of words ("an", "is", "are", "has",
`annotations, and occupies columns 34 through 77; fur
`"have"), plus the negation marker" corresponding to
`ther information concerning the feature annotations
`the morpheme “n't".
`that may appear in this field is given below in the discus
`As noted above, the main dictionary is a dictionary of
`sion of parsing and noun phrase determination.
`base form records each listing codes indicative of gram
`As noted above, noun and verb codes, if either occurs
`matical and inflectional tags and feature information.
`at all for a given word, are confined to the fields before
`Each text word is processed by an "unflection' proce
`column 34; all other tags must occur starting in or after
`dure which operates on the word to identify its base
`that column. For example, "back", the tenth word in
`form by stripping suffixes therefrom if possible to pro
`FIG. 3A, is encoded as being both a noun and a verb,
`duce a probable base form, and looking up the probable
`both of inflectional class one, yielding the paradigm
`base form in the dictionary. When the probable base
`back, back's, backs, backs' for the noun usage and
`form is found, the processor inspects inflectional codes
`back, backs, backed, backing for the verb, as well as an
`of the base form to confirm that any stripped suffixes
`adjective and an adverb (with tag codes as "JJ" and
`"RB', respectively). Although, including inflectional
`were indeed legal suffixes of the found entry. The ap
`propriate tags of the

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket