`Kucera et al.
`
`Patent Number:
`11
`(45) Date of Patent:
`
`4,868,750
`Sep. 19, 1989
`
`54 COLLOCATIONAL GRAMMAR SYSTEM
`(75) Inventors: Henry Kucera, Providence, R.I.;
`Alwin B. Carus, Newton, Mass.;
`Jeffrey G. Hopkins, Pawtucket, R.I.
`73) Assignee: Houghton Mifflin Company, Boston,
`Mass.
`21 Appl. No.: 106,127
`22 Filed:
`Oct. 7, 1987
`(51) Int. Cl. .............................................. G06F 15/02
`(52) U.S.C. ..................................... 364/419; 364/900
`58) Field of Search ................ 364/419,900 MS File,
`364/200 MS File
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`4,661,924 4/1987 Okamoto et al. .................., 364/900
`4,703,425 10/1987 Muraki ................................ 364/419
`4,724,523 2/1988 Kucera ................................ 364/419
`4,742,481 5/1988 Yoshinura ........................., 364/49
`4,747,053 5/1988 Yoshimura ..
`364/49
`4,750,122 6/1988 Kaji .........
`364/419
`4,760,528 7/1988 Levin ......
`... 364/419
`4,773,009 9a 988 Kucera ................................ 364A419
`OTHER PUBLICATIONS
`Choice of Grammatical Word-Class Without Global Syn
`tactic Analysis. Tagging Words in the Lob Corpus, Mar
`
`shall, Ian in Computers and the Humanities 17 (1983)
`139-150.
`Primary Examiner-Michael R. Fleming
`Attorney, Agent, or Firn-Lahive & Cockfield
`(57)
`ABSTRACT
`A system for the grammatical annotation of natural
`language receives natural language text and annotates
`each word with a set of tags indicative of its possible
`grammatical or syntactic uses. An empirical probability
`of collocation function defined on pairs of tags is itera
`tively extended to a selected set of tag sequences of
`increasing length so as to select a most probable tag for
`each word of a sequence of ambiguously-tagged words.
`For listed pairs of commonly confused words a substi
`tute calculation reveals erroneous use of the wrong
`word. For words with tags having abnormally low
`frequency of occurrence, a stored table of reduced
`probability factors corrects the calculation. Once the
`text words have been annotated with their most proba
`ble tags, the tagged text is parsed by a parser which
`successively applies phrasal, predicate and clausal anal
`ysis to build higher structures from the disambiguated
`tag strings. A voice/text translator including such a tag
`annotator resolves sound or spelling ambiguity of
`words by their differing tags. A database retrieval sys
`tem, such as a spelling checker, includes a tag annotator
`to identify desired data by syntactic features.
`
`17 Claims, 13 Drawing Sheets
`
`
`
`
`
`
`
`
`
`
`
`
`
`SAA// JAA/AS
`f A/ASA's
`
`A/
`A/V/AAW
`
`SOAAA
`Wy
`
`
`
`
`
`
`
`AA//
`AWWFAA
`2
`
`
`
`
`
`
`
`
`
`
`
`JAYI
`ASSAA
`A/SA/AM///
`
`Z--
`-Stiff
`self a?
`
`
`
`
`
`AAAW/AAA
`AAOMASSO?
`ASA/A/A
`ASA?
`
`Page 1 of 35
`
`GOOGLE EXHIBIT 1011
`
`
`
`
`
`US. Patent—Sep. 19, 1989 Sheet 1of 13. 4,868,750
`
`
`
`
`CPUS
`
`CONTROLLER TEXT &
`
`
`MESSAGE
`DISPLAYSOUT
`
`
`
`
`
`
`CRAMMATICAL
`PROCESSOR
`
`BISAHBIG'R
`
`PARSER
`
`
`
`
`SOURCE
`TENT
`
`___¥__
`| SENTENCE |
`1 SPLITTER
`
`6-~
`
`
`
`
`Page 2 of 35
`
`Page 2 of 35
`
`
`
`US. Patent
`
`Sep. 19, 1989
`
`Sheet 2 of 13
`
`4,868,750
`
`Of
`o2
`OF
`o4
`oF
`06
`ow
`os
`og
`
`4 f
`
`f
`12
`iy
`'f
`ey
`
`6 t
`
`e
`1g
`49
`20
`
`2 2
`
`2
`23
`24
`25
`ob
`
`a 2
`
`b
`2g
`IG
`
`i J
`
`?
`JS
`IG
`JS
`36
`I7
`58
`J9
`40
`
`4 4
`
`2
`43
`44
`95
`
`9 4
`
`7
`
`SENTENCE CLOSER (PUNCTUATION TAG)
`OPEN PARENTHESIS (PUNCTUATION TAG)
`"NOT" (ADVERBIAL TAB)
`GLOSE PARENTHESIS (PUNCTUATION TAG)
`SEM! - COLON (PUNCTUATION TAB)
`DASH (PUNCTUATION TAB)
`COMMA (PUNCTUATION TAB)
`COLON (PUNCTUATION TAB)
`PRE- QUALIFIE? (PRE- NOMINAL TAG)
`PRE QUANTIFIER (PRE-NOMINAL TAG)
`PRE- QUANTIFIER/DOUBLE CONJUNCTION (PRE-NOMINAL TAG)
`POST~ DETERMINER (PRE- NOMINAL TAG)
`ARTICLE (PRE - NOMINAL TAG)
`“WERE” (VERBAL TAG) <F >
`“WAS” (VERBAL TAG) <F >
`"BEING" (VERBAL TA) <#>
`"BE" ~ BASE FORM (VERBAL TAG) SND
`"AM" (VERBAL TAB) <F>
`"BEEN" (VERBAL TAG) <N>
`"ARE" (VERBAL TAG)<F >
`1S" (VERBAL TAB) <F >
`COORDINATING CONJUNCTION (SEATENTIAL TAG)
`CARDINAL HUMBER (PRE- NOMINAL TAG)
`SUBORDINATING CONJUNCTION (SENTENTIAL TAG)
`“OID” (VERBAL TAG) <F >
`DOING” (VERBAL TAC) CHD
`“O0"— BASE FORM (VERBAL TAG) (HD
`“DONE” (VERBAL TAG) <#>
`"00" (VERBAL TAG) <#>
`"DOES" (VERBAL ThE) <F >
`SINGULAR DETERMINER ( PRE-NOHINAL TAG)
`SINGULAR/PLURAL DETERMINER ( PRE - NOMINAL TAB)
`PLURAL DETERMINER (PRE - NOMINAL TAG)
`DETERMINER / DOUBLE CONJUNCTION (PRE- NOMINAL TAB)
`EXTISTENTIAL “THERE” (NOMINAL TAG) <P>
`“WAD”;, PAST TENSE (VERBAL TAG) <F >
`“HAVING ° (VERBAL TAG) <#>
`“HAVE” — BASE FORM (VERGAL TAG) SN >
`"HAD"; PAST PARTICIPLE (VERBAL TAB) <N>
`“HAVE” (VERBAL TAG) © HD
`“HAS” (VERBAL TAG) <f >
`PREPOSITION (SENTENTIAL TAG)
`ADJECTIVE (PRE-MOMIWAL TAG)
`COMPARATIVE ADJECTIVE (PRE- NOMINAL TAB}
`SUPERLATIVE ADJECTIVE (PRE- NOMINAL TAG)
`SEMANTIC SUPERLATIVE (PRE- NOMIWAL TAB)
`MODAL AUXHLARY (VERBAL TAG) <F >
`
`~YN
`
`is
`
`AbL
`AGH
`ABT
`AP
`AT
`8D
`BEOL
`5EG
`BE
`BEM
`SEN
`BER
`BEL
`oe
`oo
`os
`08
`008
`BO!
`DON
`DOP
`BOL
`ar
`OT
`aS
`OIx
`&f
`4VO
`HVE
`AV
`AVA
`VP
`HVE
`
`i 4 A
`
`/#
`ASS
`JST
`#0
`
`2e
`
`(Ol THRU 47)
`
`Page 3 of 35
`
`Page 3 of 35
`
`
`
`4,868,750
`
`
`US. Patent—Sep. 19, 1989 Sheet 3 of 13
`
`id
`4aS
`WAS
`WASS
`WAL
`WP
`MPS
`WPS
`NPSS
`WR
`WRS
`WAS
`WASS
`7)
`PY
`PUS
`PPS
`PPSS
`PPL
`PRLS
`PPO
`PPS
`PPSS
`PPL
`
`a a
`
`P
`&B
`ABR
`RET
`AW
`
`< HD
`SINGULAR COMMON NOUN - BASE FORM (NOMINAL TAB)
`POSSESSIVE SINGULAR COMMON NOUN (PRE-NOMINAL TAG)
`PLURAL COMMON NOUN (NOMINAL TAB) <>
`POSSESSIVE PLURAL COMMON NOUN(PRE- NOMINAL TAG)
`WON POSSESSIVE COMMON NOUN - BASE FORM (NOMINAL TAG) SND
`SINGULAR PROPER HOUN ~ BASE FORM (NOMINAL TAG) <4 >
`POSSESSIVE SINGULAR PROPER NOUN (PRE- NOMINAL TAB)
`PLURAL PROPER NOUN (NOMINAL TAG) <H>
`POSSESSIVE PLURAL PROPER NOUM (PRE-WOMINAL TAG)
`SINGULAR AOVERBIAL NOUN - BASE FORM (NOMINAL TAG) <P>
`POSSESSIVE SINGULAR AOVERBIAL NOUN (PRE-NOMINAL TAG)
`PLURAL AOVERBIAL NOUN (HOMINAL TAG) SP >
`POSSESSIVE PLURAL AOVERBIAL NOUN [PRE- NOMINAL TAG)
`ORDINAL NUMBER (PRE-WOMINAL TAG)
`WONINAL PRONOUN (WOMINAL TAG) <P>
`POSSESSIVE NOHIWAL PRONOUN [PRE-HOMIWAL TAG)
`POSSESSIVE PRONOUN (PRE-NOMINAL TAB)
`SECOND POSSESSIVE PRONOUN (NOMINAL TAG) <P>
`SINGULAR REFLEXIVE PRONOUN (NOHIWAL TAS) <P>
`PLURAL REFLEXIVE PRONOUN (NOMINAL TAG) <P>
`OBJECTIVE PERSONAL PRONOUN (WOMIWAL TAC) <P>
`THRO PERSON NOMINATIVE PERSONAL PRONOUN (WOMIWAL TAG <P >
`WON -THIRO PERSON NOMINATIVEPERSONAL PRONOUN (NOMINAL TAG) <P >
`NON - POSSESSIVE PERSONAL PRONOUN (NOMINAL TAG) <P>
`QUALIFIER (PRE -WOMINAL TAG)
`POST- QUALIFIER (PRE -NOMIWAL TAG)
`ADVERE (AOVERBIAL TAG)
`COMPARATIVE ADVERB (AOVERBIAL TAC }
`SUPERLATIVE ADVERB (AOVERBIAL TAG)
`NOHINAL ADVERB (ADVERBIAL TAG)
`AOVERB/PARTICLE (AOVERBIAL TAG)
`0
`INFIMITIVAL "TO" (VERBAL TAG}=<¢1H)>
`UH
`EXCLAMATION ( SENTENTIAL TAG }
`v0
`VERB PAST TENSE FORM (VERBAL TAG) SF>
`VBC
`VERB PRESENT PARTICIPLE (VERBAL TAG) <#>
`VEt
`VERB LAFINITIVE ~ BASE FORM (VERBAL TAG) SND
`VEN
`VERG PAST PARTICIPLE (VERBAL TAB) <>
`YEP
`VERB NON-THIRD PERSON SINGULAR, PRESENT TENSE (VERBAL TAG) <#>
`V7
`VERB THIRD-PERSON SINGULAR, PRESENT TENSE (VERBAL TAG) <F >
`wor
`WH - DETERMINER (PRE - NOMINAL TAG)
`WPS
`PERSONAL WA - PRONOUN (PRE - NOMINAL TAG)
`W0
`OBJECTIVE WH- PRONOUN (NOMINAL TAG) <P>
`WPS
`NOMINATIVE WA- PRONOUN (NOMINAL TAG) <P>
`Wol
`WH- QUALIFIER (PRE-NOHINAL TAC)
`WRB
`WA-AOVERD (ADVERBIAL TAb)
`in
`~ PARSER INTERNAL WO TAS ASSIGHED -
`“Zed
`EWD-OF- FILE MARKER
`
`P 1
`
`FIG. 2
`
`(498 THRU 94)
`
`Page 4 of 35
`
`Page 4 of 35
`
`
`
`U.S. Patent
`
`Sep. 19, 1989
`
`Sheet 4 of 13
`
`4,868,750
`
`EXAMPLES OF “NORMAL "MAIN DICTIOWARY RECORDS
`| 26
`| 30
`|34
`
`| 80
`
`AT
`RB
`
`Md
`6S IW OL RB
`Jd #8
`
`W/
`
`W/
`
`MW
`W/
`
`VI
`
`Y/
`
`FIG. 3A
`
`EXAMPLES OF “EXCEPTION” HAIN DICTIONARY RECOROS
`
`|30
`
`\26
`H/s
`NGp
`W4p
`
`|34
`(HN: edveus/adieur)
`(NAS. adfeu)
`(KAS. odteu)
`BEM (xref: be)
`BER Carel: be)
`(V8l-/. arose /arisen)
`(VEN: arise)
`(VBO: arise)
`(V80. eat)
`J/ AB VEl- 1: awoke / awoken)
`(VEO. awoke)
`(VEN. owake)
`(WHS: bocitlys)
`(NA: bacith)
`(VEX: bid)
`(WW: bases }
`(WHSbose/bosis }
`(WH: bases}
`BET frref: is)
`(V8l-/- bore/borne}
`
`FIG. 3B
`
`
`
`REAREDREEEREESHEEDEEEE
`
`|/
`
`/a a
`
`back
`abandonment
`abaose
`obosement
`aboshed
`as
`back
`backing
`
`|/
`oder
`ogreus
`adeur
`an
`ore
`arise
`orisen
`arose
`ate
`awoke
`awoke
`gwoken
`becitly
`becitlys
`bade
`base
`bases
`basis
`be
`beor
`
`Page 5 of 35
`
`Page 5 of 35
`
`
`
`U.S. Patent
`
`Sep. 19, 1989
`
`Sheet 50f13
`
`4,868,750
`
`EXAMPLES OF "CONTRACTION" MAIW OLCTIONARY RECORDS
`
`|36
`
`\46
`
`|56
`
`\66
`
`BERE™
`
`bEZt*
`
`HVP4*
`
`HZ £ *
`
`\/
`‘tS
`ain't
`aren't
`con't
`connor
`could ‘ve
`couldnt
`didn't
`doesn't
`don't
`
`\26
`PPS 4 BEL
`SEMA *
`BERS *
`WO+*
`MO4*
`MD + HY/
`MO+t *
`LODE*
`LOZ +*
`DOP +*
`
`BRERRRREAS
`
`FIG. 3C
`
`GRAMMATICAL ANNOTATION OF TEST SENTENCE
`
`Word:
`John
`wants
`to
`sel/
`Me
`new
`metropolitan
`200
`oH
`his
`cleverly
`trained
`ond
`brittrantly
`plumaged
`parakeets
`
`Tog String:
`AP
`ANS (BF = want) VBL (+ BF)
`MW TO
`V8i YEP
`AT
`Jf
`Jf
`WW
`AEN OL RB
`PPS. PPSS
`a
`VEN (BF = train) V80 (+ BF)
`ce
`GL RB
`4/
`ANS (BF = porakeet)
`
`FIG. 5
`
`BSESSBWWAGAASS
`
`SSAxs
`
`Page 6 of 35
`
`Page 6 of 35
`
`
`
`U.S. Patent Sep.19, 1989
`
`Sheet 6 of 13
`
`4,868,750
`
`(ass
`AA
`AA
`Af
`A.
`(A
`A)
`(S
`f
`Af
`A.
`AW
`AW
`W
`Af
`f
`a
`AWAP
`t
`Af
`Ap
`A.
`A.
`AP
`AA
`J)
`th
`WA
`
`A
`W
`WP
`Y
`
`(Asy (AAA/07 AASS AAAAAS A27 A65
`
`Aescription
`are-qualifiers and pre-quantifiers
`pos? deferminer
`arti?e
`forms of the verb "?o be
`coordinating confunction
`carolina/ number
`subordinating confunction
`forms of fle verb 'to do
`defarminers
`aris?enfia/ "fiere
`forms offe verb "to have
`areposition
`advectives
`meda/aurifiafy
`inf fifts
`Aroper nouns
`afe?t/af nouns
`ordina/ number
`Mania/Aronours
`aersona/Aronouns
`qualifiers
`adverbs
`nomina/adverb
`adverbaartic/e
`an/int/val/ "f
`ere/anaffon
`verbs
`wh-deferminer
`Aersona/ wh-pronouns
`wh-qualifier
`wh-dive/a
`none of fle above (60s-inferna/ only/
`
`AAG. 4
`
`Joys
`
`f
`f
`
`?
`A
`f
`A.
`4
`f
`A.
`/
`4
`f
`5
`4
`4.
`f
`2
`A.
`2
`
`f
`f
`f
`f
`A.
`f
`y
`f
`A
`f
`
`Page 7 of 35
`
`
`
`
`
`US. Patent—Sep. 19, 1989 Sheet 7 of 13 4,868,750
`
`
`
`
`
`INPUT
`EXPRESSION
`WDAA BASE
`
`LESS THAN F TIMES
`
`EXPRESSIOM
`MOT FOUND
`
`
`
`
`STRIP A SUFFIX
` SUFFIL
`
`
`FROM
`
`
`STRIPPING
`EXPRESSION
`
`
`
`TRIED THREE
`TIMES 7
`
`
`SUFFIX-
`
`STRIPPED
`
`EXPRESSION IW
`DAT,ABASE
`
`
`
`
`GET GRAMMATICAL
`INFLECTIONAL
`INFO FROM
`DATABASE
`
`QUTPUT GRAMMATICAL
`d INFLECTIONAL
`INFORMATION
`
`ONFLECT
`RETURN
`
`S- WORDS
`SPECIAL 0-BASE
`FORCED TAG ROUTINES
`
`FIG. 6
`
`Page 8 of 35
`
`Page 8 of 35
`
`€
`
`
`U.S. Patent
`
`Sep. 19, 1989
`
`Sheet 80f 13
`
`4,868,750
`
`
`SET SENT PTR 70
`SEM NODE
`
`
`INITIALIZE
`
`B POINTERS
`4 VARIABLES
`
`IHCREMENT
`CUR LVL BY 1 ADD
`CUR TAG
`10 STRING
`
`
`
`Ss
`
`
`
`OWLY TAG
`= AOVERE OR
`
`
`
`NEGATION
`
`
`
`
`
`FIRST
`
`
`RESET TAC
`TAG OF WORD'S
`
`COUNTERS BACK
`STRING = XX OR
`
`/
`= SECON TAB
`
`
`BYPASS
`
`66
`
`
`
`
`IF PREV CTI,
`
`SET MPP OF CURRENT
`
`
`WORD
`
`WORD « UNIQUE TAC
`
`UNIQUE TAG
`
`SCP = TOP =0
`
`?
`
`If PREV CT > /
`START EVALUATION
`90
`PROCESS 70
`
`
`
`BRANCHED
`
`PROCESSING
`
`ROUTINES
`
`FIG. 7
`
`Page 9 of 35
`
`Page 9 of 35
`
`
`
`Sheet 9 of 13
`
`4,868,750
`
`
`EVALUATE
`
`A(Teyp, topy
`MULT. BY
`PREV DISTANCE
`
`_
`
`Sep. 19, 1989
`ff
`
`
`SET TEMP PIRS
`142 70
`
`
`
`
`START OF
`CUR & PREV
`DIS NODES
`
`
`
`SET CUR D
`TRACE=
`
`
`TAG PRY
`2 TRACE
`
`
`
`
`IDENTIFY
`
`PREV TAG
`
`
`
`
`
`
`NEED 10
`ALLOCATE
`
`
`MORE
`
`B WODES ?
`
` UPOATE
`
`TEMP f PTR
`
`
`
`
`TEMP 2 PIR
`10 VALUE OF
`BLINK POINTER
`
`
`
`USS. Patent
`
`Page 10 of 35
`
`Page 10 of 35
`
`
`
`U.S. Patent
`Alif A
`
`Sep. 19, 1989
`
`Sheet 10 of 13
`
`4,868,750
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`SAJ
`A/AS/A2/A
`- WA
`
`SAJ
`Sps as (7
`
`A/S/
`(WAA D
`JAAA/S/
`2
`
`
`
`
`
`SAJ
`SAs a
`Apassa
`AfA. Ap
`
`45
`
`SWAA
`S/A
`
`(A
`
`Affif
`JMJA - Af
`SAJ Aff a
`AJA/S-AA
`
`
`
`
`
`Page 11 of 35
`
`
`
`
`
`US. Patent—Sep. 19, 1989 Sheet 11 0f 13. 4,868,750
`
`
`
`I
`
`
`
`
`SET TEMP / PIR
`10 START OF
`
`CURRENTIS NODELIST
`
`
`
`—GET CURRENT TAG
`CREATE
`OF WORD TAS STRING
`OTRACE
`
`
`
`SET TEMP 2 PTR
`AWD STORE
`10 START OF
`IN CURRENT
`PREVIOUSDISNODE
`YS NODE
`
`ALL
`TAGS 1 WORDS
`TAG STRING
`LROGESSED
`
`| | i|
`
`|
`1
`
`|
`
`|||iI ||| |||
`
`~ALLOCATE
`NEW NODES
`
`—UPOATE
`TEMP {PTR
`TEMP 2 PTR
`
`SELECT
`TOP 3
`NODES
`
`Page 12 of 35
`
`|i |i| |||| || |
`
`| ||| | | | | ||| || ||| i||
`
` LUST
`
`mn
`
`
`MULTIPLY BY
`ALL DIST(TEMP 2PIR|97
`
`
`NODES IV
`DIS NODE }
`
`PREVIOUS DIS NODE
`STORE UP DISTANCE
`
`IST PROCESS +
`/W CURRENT
`
`DIS NODE
`
`
`
`Page 12 of 35
`
`
`
`U.S. Patent Sep. 19, 1989
`
`Sheet 12 of 13
`
`4,868,750
`
`
`
`
`
`
`
`
`
`AA) A JAOAS
`AWA/AWA JAA/
`Affif AYA/SJAWA
`
`AAA
`JAAS AP
`W go:
`
`A WAS
`SAPJ
`WWA A/S, JAAAY
`f/SWAWAA
`
`)
`
`A/
`
`(gous
`
`A/WAAAYA
`A/S/
`
`
`
`A/5
`
`
`
`
`
`SAOS/A
`JA 3 (AAS //
`SA
`(AAS
`
`Page 13 of 35
`
`
`
`U.S. Patent Sep.19, 1989
`
`Sheet 13 of 13
`
`4,868,750
`
`A2)
`fAry
`Aft/A/104 -) i? - - - - - - - - - - - - - - - - -
`AWA/SAA/6/6/AW/(W
`WA
`AAAA)
`SAA/AWA
`AM/AAAY
`JYAAAA/AOS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`AAAA
`Af0071A
`
`AM
`AZA (AAVAff
`JAA, SAWA WA
`AMAS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`F/G, W2
`
`A2
`
`AMSA WMS
`JA6 (PASMA
`SAWWAAA
`
`/44
`(AAAAA/
`
`- - - - - - - - - - - - - - - - - - -
`/A
`A7
`(SAA
`A///04AAPY
`
`A97
`
`S-100S
`
`2007
`A(AAAA. JAA
`/A/(A/ )
`
`Page 14 of 35
`
`
`
`1.
`
`COLLOCATIONAL GRAMMAR SYSTEM
`
`The present invention relates to automated language
`analysis systems, and relates to such systems embodied
`in a computer for receiving digitally encoded textcom
`posed in a natural language, and using a stored dictio
`nary of words and an analysis program to analyze the
`encoded text and to identify errors. In particular, it
`relates to systems for the grammatical analysis of en
`10
`coded text.
`In recent years a number of systems have been devel
`oped for the automated recognition of syntactic infor
`mation. A survey of some systems appears in the text
`book of Winograd, Language as a Cognitive Proces
`s-Syntax (ISBN 0-201-08571-2 v. 1) at pages 357-361
`and pages 390-403. As a rule, although a number of
`theoretical linguistic formalisms have been developed
`to identify correct grammatical constructions, the prac
`tical construction of grammatical analyzing devices has
`proven difficult. Because the number of combinations of
`possible parts of speech for a string of words escalates
`exponentially with string length, syntax-recognizing
`systems have in general been limited to operating on
`text having a small, well-defined vocabulary, or to oper
`25
`ating on more general text but dealing with a limited
`range of syntactic features. Extensions of either vocabu
`lary or syntactic range require increasingly complex
`structures and an increasing number of special recogni
`tion rules, which would make a system large or too
`30
`unwieldy for commercial implementation on commonly
`available computing systems. Moreover, the automated
`grammatical systems which have been designed are
`special processors, in that they are not adapted to con
`ventional word processing or computer-aided publish
`35
`ing functions. For example, such systems may require
`that their input text be at least sufficiently pre-edited so
`that it is both correctly spelled and grammatically well
`formed. A misspelling, a wrong word such as a hom
`onym, a compound word, or even a simple syntax error
`40
`may render an input sentence unanalyzable.
`OBJECTS OF THE INVENTION
`It is an object of the present invention to provide an
`improved device for the grammatical analysis of digi
`45
`tally encoded natural language text.
`It is another object of the invention to provide a
`digital text analyzer for assigning tags to each word of
`a digitally encoded text indicative of syntactic or inflec
`tional features of the word.
`50
`It is a further object of the invention to provide a
`grammatical analyzer for encoded text which identifies
`the most probable tags of words of a sentence based
`upon collocation probabilities of their occurrence with
`adjacent tags.
`55
`It is a further object of the invention to provide a
`grammatical analyser which accepts as an input uned
`ited text material having misspellings and vocabulary
`eOS
`These and other features of the invention are ob
`60
`tained in an apparatus for the grammatical annotation of
`digitally encoded text material, preferably including a
`stored dictionary wherein each entry represents a word
`together with tags indicative of possible syntactic and
`inflectional features of the word. A sentence of digitally
`65
`encoded text is passed to the grammatical annotator,
`which first operates on the words of the sentence to
`annotate each word with a sequence of possible tags for
`
`15
`
`4,868,750
`2
`the word, and next operates on strings of tags of adja
`cent words to determine the probable tags, in order of
`likelihood, for each word.
`This produces a "disambiguated' tag set which iden
`tifies a most probable tag assignment, for each word of
`a string of words, and one or more next most likely tag
`assignments. The disambiguated tag set serves as an
`input to a grammar processor which in a preferred
`embodiment uses the tags to identify basic grammatical
`units such as noun phrases and simple predicates, and
`processes these units to determine the parse of the sen
`tence.
`Preferably, the stored dictionary of words includes
`data codes representative of features such as gender and
`number, requiring agreement among words, and this
`information is used to select proper constructions dur
`ing processing. The system preferably also includes a
`morphological analyzer, which uses prefixes, suffixes
`and other structural attributes of words to recognize
`certain classes of words which are not in the stored
`dictionary. For such a word, the analyser then creates a
`dictionary entry with appropriate tags so that grammat
`ical processing proceeds as though the word were in the
`database.
`More specifically, the grammatical analyzer anno
`tates the words of a sentence of text with grammatical
`tags and inflectional features of the word using one or
`more of the above techniques. Each string of multiply
`tagged words between two unambiguously-tagged
`words is then analyzed by a disambiguation sub-system
`which applies a collocational probability matrix to adja
`cent pairs of tags to iteratively construct a probability
`like measure and to determine a most probable tag
`string corresponding to the string of words. Candidate
`tag strings of lesser probability are stacked for use if a
`later processing step eliminates the "most probable' tag
`string. This results in a "disambiguated' sentence struc
`ture in which one or more likely tags are identified for
`each word of the sentence.
`In a preferred implementation, the probability-like
`measure is iteratively defined on generations of succes
`sively longer tag strings corresponding to sequences of
`words. Nodes which generate strings of lesser probabil
`ity are pruned from the calculation as it proceeds, so
`that only a handful of potentially thousands of tag
`strings need be processed.
`In a further embodiment of the invention, the values
`assigned by the collocation matrix are further modified,
`for tags of particular words appearing in a reduced tag
`probability database, in accordance with a table of re
`duced probabilities. In a further preferred embodiment,
`when a word of the string appears in another database,
`called the "commonly confused word' database, an
`augmented set of tag strings is created by substituting
`tags corresponding to a correlated word, and the substi
`tuted tag strings are collocationally evaluated as candi
`dates for the most probable tag string. In a further em
`bodiment, the tag strings selected in one of the forego
`ing operations are also checked against templates repre
`sentative of erroneous or rare parses to detect common
`errors. When a sentence has been annotated with tags
`and a most probable parse identified, the annotated
`sentence is then parsed by a parsing component which
`determines a parse of the whole sentence.
`The parsing component of a prototype system oper
`ates on the "most probable parse" (henceforth "MPP")
`tags assigned by the disambiguation sub-system to the
`words of a given sentence, in order to assign the higher
`
`Page 15 of 35
`
`
`
`5
`
`10
`
`15
`
`4,868,750
`3
`syntactic structure of the sentence and also to detect
`and suggest corrections for certain types of errors in the
`sentence. The parsing process preferably proceeds in
`three general phases: (a) the identification of the simplex
`noun phrases (NPs) in the sentence and, if there is more
`than one simplex NP, their combination into complex
`NPs; (b) the identification of the simplex verb groups
`(VGs) in the sentence and, if there is more than one
`simplex VG, their combination into complex VGs; and
`(c) the assigning of structure to complete sentences.
`In addition to its applications in a grammatical text
`analyzer, a disambiguator according to the invention
`includes improvements to existing types of non-gram
`mar language processors. For example, an improved
`spelling checker according to the invention includes a
`spelling checker of the type wherein each erroneously
`spelled word is identified and a list of possibly-intended
`words is displayed. Conventionally, such systems dis
`20
`play a list of words which are selected as having ap
`proximately the same spelling as the erroneously
`spelled word. An improved system according to the
`present invention includes a partial or complete gram
`matical processor which determines the local context of
`a word (i.e., its likley tags or a definite tag), and which
`selects from among the candidate replacement words so
`as to display only the possibly intended words having a
`tag compatible with the syntactic context of the mis
`spelled word.
`30
`In an improved speech recognition (or speech synthe
`sis) system embodying the invention, a disambiguation
`module or a grammatical processor differentiates pairs
`of homonyms (respectively, homographs) by probable
`syntactic context, thereby eliminating a common source
`of errors in the conversion of text-to-sound (respec
`tively, sound-to-text). Other examples are described,
`following a detailed description of a prototype embodi
`ment of a grammatical disambiguation system.
`40
`The novel features which are believed to be charac
`teristic of the invention are set forth with particularity
`in the appended claims. The invention itself, however,
`both as to its organization and method of operation,
`together with further objects and advantages thereof,
`may best be understood by reference to the following
`description taken in connection with the accompanying
`drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 is a block diagram of a system according to
`the present invention;
`FIG. 2 is a listing of system tags in an illustrative
`embodiment;
`55
`FIGS. 3A, 3B, 3C are samples of dictionary records;
`FIG. 4 is a listing of major classes of tags with corre
`sponding structural processing group codes;
`FIG. 5 is a representative text sentence annotated
`with its dictionary tags;
`FIG. 6 is a flow chart of a word tag annotation pro
`cessor;
`FIGS. 7-10 are flow charts detailing operation of the
`collocational disambiguation processing;
`FIG. 11 shows the processing of a general grammati
`65
`cal analyser operative on disambiguated text; and
`FIGS. 12-13 shows further details of preferred text
`word annotation processing.
`
`4.
`DETALED DESCRIPTION OF THE
`DRAWINGS
`FIG. 1 shows a block diagram of a grammatical ana
`lyzer according to the present invention having a
`CPU/controller 2 which may, for example, be a general
`purpose computer such as a micro- or mini-computer.
`The computer receives input text 4, e.g., from keyboard
`entry, a communications link, or a data storage device,
`and, if necessary, runs a sentence splitter 6 which parti
`tions the text into sentences for grammatical analysis.
`Alternatively, the system may receive as input discrete
`sentences of text or encoded text with sentence bound
`ary markers already inserted. Sentence splitting perse is
`known in the art, and is used, for example, in commer
`cially available systems for deriving word-per-sentence
`and similar statistical information in computerized read
`ability analysis systems. A suitable sentence splitter is
`disclosed in the copending patent application of Henry
`Kucera, Rachael Sokolowski and Jacqueline Russom
`filed June 6, 1986 as Ser. No. 872,094, entitled Method
`and Apparatus for Text Analysis, now issued as U.S.
`Pat. No. 4,773,009, which application is hereby incorpo
`rated by reference and made a part hereof.
`The controller 2 then passes each sentence to a gram
`matical analyzer 10 which annotates each word of the
`sentence, by reference to a stored word dictionary 8,
`and preferably several special databases or tables 7, as
`discussed further below, so as to produce an annotated
`sentence structure. The annotated sentence, or partial
`parses thereof and error messages or "prompts' are
`displayed on display 9.
`According to one aspect of the invention, the dictio
`nary includes a record for each word which contains a
`list of "tags', each tag encoding a syntactic or inflec
`tional property of the word, and which also includes a
`listing of special features used in the grammatical pro
`cessing.
`The processor annotates the sentence with this infor
`mation. It then utilizes the stored information to per
`form two, roughly sequential, operations on the anno
`tated sentence structure. First, a collocational tag
`disambiguation processor 10a applies an empirically
`compiled probability-like function defined on adjacent
`pairs of syntactic tags to determine a unique sequence of
`tags (one for each word) corresponding to the most
`probable parse of each ambiguously-annotated word in
`the sentence. The disambiguation processor also identi
`fies alternative tags of relatively high probability. Next,
`a grammatical processing module 10b operates on the
`identified tags to develop a parse of the sentence.
`A prototype text annotator embodiment was created
`having a main dictionary with 28,223 80-byte records,
`each record containing the complete grammatical infor
`mation for a given "word' which is either a base form
`or an irregularly inflected form. These records were of
`three types, marked by a record type-code in column 80
`to identify the types as "normal' (column 80 blank),
`"exception" ("S" in column 80) or "contraction" ("+"
`in column 80). Normal records correspond to the words
`with non-merged tags and (if they are nouns or verbs)
`regular inflections; exception records correspond to the
`words with non-merged tags that are members of an
`irregular (noun or verb) paradigm (these words may
`also be members of regular paradigms or uninflectable
`tag groups); and contraction records correspond to the
`words with merged tags (that is, tags that contain a
`
`35
`
`45
`
`50
`
`Page 16 of 35
`
`
`
`5
`
`15
`
`4,868,750
`6
`5
`Linguistic Information', now issued as U.S. Pat. No.
`"--", indicating that the corresponding word is a con
`4,724,523, of inventor Henry Kucera, which application
`traction of some type).
`is hereby incorporated by reference. Its operation is
`FIG. 2 is a listing of the tags used in the prototype
`further described below, by way of completeness, in
`embodiment, each of which is represented in the draw
`ing by a one to three character mnemonic and also by a
`connection with FIG. 6.
`In compiling the dictionary, if an inflectional variant
`one to two digit tag code. There are ninety-two such
`tags, although any given text word will generally have
`is a base form in its own right, it is listed separately in
`the database with the appropriate code for this usage.
`between one and six possible tags. Each tag indicates a
`For example, "backing' is stored as a noun of inflec
`possible syntactic use of the word, or an inflection. The
`tional class one, denoted N1, representing the paradigm
`dictionary records may also include, for nouns and
`10
`backing, backing's, backings, backings'). This dictio
`verbs, certain information encoding word features such
`nary entry is in addition to its inflectional usage as the
`as its number agreement behavior.
`present participle of the verb "to back" which would
`FIGS. 3A-3C show examples illustrating the format
`of the normal, exception and contraction records of the
`be recovered by inflection from the base form "back'
`prototype dictionary discussed above. The records each
`discussed above.
`FIG. 3B shows examples of exception records. These
`include the retrieval form of the main dictionary entry,
`left-justified with blank fill in columns 1-25 as field one,
`records contain elements (either base or inflected forms)
`that are members of irregular nouns or verb paradigms.
`and the record type code discussed above as the last
`In these records, the format of fields one to five are
`entry in the last field at column 80.
`FIG. 3A contains examples of "normal' main dictio
`similar to those of normal records shown in FIG. 3A,
`except that field four contains one or more substrings
`nary records. Normal records comprise approximately
`delimited by parentheses. The material between paren
`ninety-five percent of the database, and contain five
`theses identifies an irregular tag and the appropriate
`fixed-format fields, which include, in addition to fields
`base form for processing for such tag.
`one and five described above, the following.
`FIG. 3C illustrates contraction records, which lack
`Field two contains noun base form inflection code
`25
`the fields two through four of the foregoing two record
`information, if the base word has a noun form, for the
`word in field one, and occupies columns 26 through 29.
`types, and instead have a field two which contains from
`one to five merged tag representations (stored starting
`These code bits enable the construction of any desired
`in columns 26, 36, 46, 56, and 66, respectively), and
`inflection from the stored base form, by use of an inflec
`occupies columns 26 through 77. The last field, as with
`tional synthesis coding scheme discussed further below.
`the other two types of records, contains certain special
`Field three contains the verb base form inflection
`processing annotations, and occupies columns 78
`code information, if the base form has a verb form, for
`through 80; in the prototype, the only codes that occur
`the word in field one, and occupies columns 30 through
`in this field are the record type-indicating codes that
`33; these code bits compactly encode the verbal inflec
`tions corresponding to the base word.
`occur in column 80. The illustrated record for the word
`"ain't' indicates that it is a recognizable contraction
`Field four contains all other syntactic tags for the
`with a tag string consisting of the auxiliary tags corre
`word in field one, as well as any noun or verb feature
`sponding to the set of words ("an", "is", "are", "has",
`annotations, and occupies columns 34 through 77; fur
`"have"), plus the negation marker" corresponding to
`ther information concerning the feature annotations
`the morpheme “n't".
`that may appear in this field is given below in the discus
`As noted above, the main dictionary is a dictionary of
`sion of parsing and noun phrase determination.
`base form records each listing codes indicative of gram
`As noted above, noun and verb codes, if either occurs
`matical and inflectional tags and feature information.
`at all for a given word, are confined to the fields before
`Each text word is processed by an "unflection' proce
`column 34; all other tags must occur starting in or after
`dure which operates on the word to identify its base
`that column. For example, "back", the tenth word in
`form by stripping suffixes therefrom if possible to pro
`FIG. 3A, is encoded as being both a noun and a verb,
`duce a probable base form, and looking up the probable
`both of inflectional class one, yielding the paradigm
`base form in the dictionary. When the probable base
`back, back's, backs, backs' for the noun usage and
`form is found, the processor inspects inflectional codes
`back, backs, backed, backing for the verb, as well as an
`of the base form to confirm that any stripped suffixes
`adjective and an adverb (with tag codes as "JJ" and
`"RB', respectively). Although, including inflectional
`were indeed legal suffixes of the found entry. The ap
`propriate tags of the found word are then loaded into a
`variants, this accounts for six different words (ten differ
`ent word-plus-tag pairs), only one record (that corre
`data structure, denoted a sentence node or SEN
`NODE, which represents that word for subsequent
`sponding to the base form; i.e., "back") is stored in the
`processing. In a prototype embodiment, each noun base
`database; all of its inflectional variants are recovered by
`55
`form in the dictionary is encoded according to one of
`an analysis/synthesis procedure, called "unflection/in
`four regular inflectional paradigms,