`
`I
`
`...__~fJJ!..I.
`
`__ - -•--
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`27
`
`5000 • ---
`
`4000 ..mt"! .• :
`
`3000, __
`
`I
`
`---
`.· ·., .. ,.'
`
`!
`
`e
`
`re
`
`a
`
`! .,.,..
`
`•
`
`i l111Nlllil~lllllltiJWlt, • "'"'41f
`
`•
`
`•
`
`•
`
`•
`
`•
`
`•.
`
`a
`
`2000·
`
`•
`
`•
`
`-
`N
`;_ 1000·
`>
`o·
`(.)
`z
`w
`::::, 5000-----
`0
`W 4000
`FF 3000!
`-.
`
`.,.,
`..
`
`2000,'
`
`0.2
`
`0.4
`
`r~
`..........
`·••1,,mrrrn
`
`;
`
`lS ■ ii
`----••
`l..iJl=iJ...J H111111111U1,
`
`u
`
`u
`
`3'"
`
`•II
`
`I,
`
`~p
`
`I•
`
`I
`
`.i
`
`0.0
`
`0.2
`
`0.4
`
`0.4 0.0
`0.2
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.0
`
`0.2
`
`0.4
`
`,,,,II!'
`
`1
`
`Figure 2. 15 Spectrograms of the vowel sounds.
`
`4000r----.------.----~------
`35001---+--~--H-,~--#---c
`
`....... .---l---l-------l
`
`30001---~
`
`.. ,-..1._;.,.;...----'4-
`
`N 25001--~+--
`~
`if' 20001---'...,,_~
`u.
`0
`~ 1500 1---+---l'-+-"~c'"-lt-:~.=+-
`z
`w
`:::,
`0
`w fl: 1000 1----
`
`_ ___._ _ __. _ _,
`500L--..I---L--=-_..J...-_._
`0
`200
`400
`600 800 1000 1200 1400
`FREQUENCY OF F1 (Hz)
`
`Figure 2.16 Measured frequencies of first and second for(cid:173)
`mants for a wide range of talkers for several vowels (after
`Peterson & Barney [7]).
`
`overlap between the fonnant frequencies for different vowel sounds by different talkers.
`The ellipses drawn in this figure represent gross characterizations of the regions in which
`most of the tokens of the different vowels lie. The message of Figure 2.16, for speech
`recognition by machine, is fairly clear; that is, it is not just a simple matter of measuring
`fonnant frequencies or spectral peaks accurately to accurately classify vowel sounds; one
`
`IPR2023-00035
`Apple EX1015 Page 58
`
`
`
`28
`
`2400~---------------,
`
`IV (il
`
`Chap. 2
`
`The Speech s·,g
`rial
`
`2200
`
`2000
`
`1800
`
`N
`~ 1600
`N
`~
`
`1400
`
`1200
`
`1000
`
`0 EH(£)
`
`0 AE (cl9)
`
`oER (~)
`
`oAH (A)
`
`0 AO (:>)
`SOOL---'--~....L..---'----...___-~:-::--~
`200
`300
`400
`500
`600
`F1 (Hz)
`
`700
`
`800
`
`Figure 2.17 The vowel triangle with centroid positions of the com(cid:173)
`mon vowels.
`
`must do some type of talker (accent) normalization to account for the variability in formants
`and overlap between vowels.
`A common way of exploiting the information embodied in Figures 2. 15 and 2. 16 is to
`represent each vowel by a centroid in the formant space with the realization that the centroid,
`at best, represents average behavior and does not represent variability across talkers. Such
`a representation leads to the classic vowel triangle shown in Figure 2.17 and represented
`in terms of formant positions by the data given in Table 2.2. The vowel triangle represents
`the extremes of formant locations in the F 1 -F2 plane, as represented by /i/ (low F 1, high
`F2), /u/ (low F 1, low F2), and /a/ (high F 1, low F2), with other vowels appropriately placed
`with respect to the triangle vertices. The utility of the formant frequencies of Table 2.2 has
`been demonstrated in text-to-speech synthesis in which high-quality vowel sounds have
`been synthesized using these positions for the resonances [8].
`
`2.4.2 Diphthongs
`
`there is some ambiguity and disagreement as to what is and what is not a
`Although
`diphthong, a reasonable definition is that a diphthong is a gliding monosyllabic speech
`
`IPR2023-00035
`Apple EX1015 Page 59
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`29
`
`IPA
`Symbol
`
`Typical
`Word
`
`F1
`
`F2
`
`F3
`
`TABLE 2.2. Formant frequencies for typical vowels.
`ARPAB.ET
`Symbol for
`Vowel
`IY
`IH
`EH
`AE
`AH
`AA
`AO
`UH
`uw
`ER
`
`Iii
`/1/
`/cl
`/re/
`IA/
`la/
`hi
`/U/
`/u/
`hi
`
`beet
`bit
`bet
`bat
`but
`hot
`bought
`foot
`boot
`bird
`
`270 2290 3010
`390 1990 2550
`530 1840 2480
`660 1720 2410
`520 1190 2390
`730 1090 2440
`570
`840
`2410
`440 1020 2240
`300
`870
`2240
`490 1350 1690
`
`sound that starts at or near the articulatory position for one vowel and moves to or toward
`the position for another. According to this definition, there are six diphthongs in American
`English, namely /aY / (as in buy), /aw/ (as in down), /eY / (as in bait), and /-:,Y / (as in boy),
`/o/ (as in boat), and /ju/ (as in you).
`-
`The diphthongs are produced by varying the vocal tract smoothly between vowel
`configurations appropriate to the diphthong. Figure 2.18 shows spectrogram plots of four
`of the diphthongs spoken by a male talker. The gliding motions of the formants are
`especially prominent for the sounds /aY /, /aw/ and /'JY / and are somewhat weaker for
`/eY / because of the closeness (in vowel space) of the two vowel sounds comprising this
`diphthong.
`An alternative way of displaying the time-varying spectral characteristics of diph(cid:173)
`thongs is via a plot of the values of the second formant versus the first formant (implicitly
`as a function of time) as shown in Figure 2.19 [9]. The arrows in this figure indicate the
`direction of motion of the formants (in the (F1 - F2 ) plane) as time increases. The dashed
`circles in this figure indicate average positions of the vowels. Based on these data, and
`other measurements, the diphthongs can be characterized by a time-varying vocal tract area
`function that varies between two vowel configurations.
`
`2.4.3 Semivowels
`
`The group of sounds consisting of /w/, /1/, /r/, and /y/ is quite difficult to characterize.
`These sounds are called semivowels because of their vowel-like nature. They are generally
`characterized by a gliding transition in vocal tract area function between adjacent phonemes.
`Thus the acoustic characteristics of these sounds are strongly influenced by the context in
`which they occur. For our purposes, they are best described as transitional, vowel-like
`sounds, and hence are similar in nature to the vowels and diphthongs.
`
`IPR2023-00035
`Apple EX1015 Page 60
`
`
`
`30
`
`Chap. 2
`
`The Speech Signal
`
`aY
`.......... r-r--r-.
`
`+
`
`+
`
`5000
`
`,---~-.--
`
`i
`4000
`
`3000
`
`2000
`
`I , •~,
`
`+
`
`+
`
`+
`
`+
`
`N' ;;.
`0
`~ 0.0 0.1
`z
`W
`~ 5000~...,,_~.,._,_,..--,-..,-~
`0 w
`a:
`u. 40001
`
`1000
`
`I
`
`0.2 0.3
`
`().4 - 0.5
`
`0.0
`
`0.1
`
`0.2
`
`0.3 0.4
`
`0.5
`
`eY
`
`'Jy
`
`+
`
`+
`
`'I
`
`+
`
`+
`
`0.1
`
`0.2
`
`0.3
`
`0.4
`
`3000f -l
`·~l~
`
`0.0 0.1
`
`0.2
`
`0.3
`
`0.0
`0.4
`TIME (sec)
`
`Figure 2.18 Spectrogram plots of four diphthongs.
`
`2.4.4 Nasal Consonants
`
`The nasal consonants /m/, /n/, and /r,! are produced with glottal excitation and the vocal
`tract totally constricted at some point along the oral passageway. The velum is lowered so
`that air flows through the nasal tract, with sound being radiated at the nostrils. The oral
`cavity, although constricted toward the front, is still acoustically coupled to the pharynx.
`Thus, the mouth serves as a resonant cavity that traps acoustic energy at certain natural
`frequencies. As far as the radiated sound is concerned, these resonant frequencies of the
`oral cavity appear as antiresonances, or zeros of the transfer function of sound transmis(cid:173)
`sion. Furthermore, nasal consonants and nasalized vowels (i.e., some vowels preceding or
`following nasal consonants) are characterized by resonances that are spectrally broader, or
`more highly damped, than those for vowels.
`The three nasal consonants are distinguished by the place along the oral tract at
`is at the lips; for /n/ the
`which a total constriction is made. For /ml the constriction
`constriction is just behind the teeth; and for / r, / the constriction
`is just forward of the
`velum itself. Figure 2.20 shows typical speech waveforms and Figure 2.21 spectrograms
`for two nasal consonants in the context vowel-nasal-vowel. The waveforms of /m/ and /n/
`look very similar. The spectrograms show a concentration of low-frequency energy with a
`midrange of frequencies that contain no prominent peaks. This is because of the particular
`
`L
`
`IPR2023-00035
`Apple EX1015 Page 61
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`31
`
`eY
`
`l
`
`\
`
`.... ~
`\\
`I
`I U / I
`'~-✓ I
`,,_,,
`
`,
`
`3K
`
`2K
`
`/--,
`\
`l
`,_.,,,
`
`/
`
`iu.
`
`--I
`
`'
`
`\
`I
`
`I u
`\
`,_/
`
`N
`I&. 1 K
`
`500
`
`400.__ _ __._ _____
`200
`250
`
`....L,. __
`
`.,__
`
`_______
`
`500
`
`~
`1 K
`
`Figure 2.19 Time variation of the first two fonnants for the diphthongs (after
`Holbrook and Fairbanks (9)).
`
`combination of resonances and antiresonances that result from the coupling of the nasal
`and oral tracts.
`
`2.4.5 Unvoiced Fricatives
`
`The unvoiced fricatives /f/, / () /, /s/, and /sh/ are produced by exciting the vocal tract by a
`steady air flow, which becomes turbulent in the region of a constriction in the vocal tract.
`The location of the constriction serves to determine which fricative sound is produced.
`For the fricative /f/ the constriction is near the lips; for /0 / it is near the teeth; for /s/ it
`is near the middle of the oral tract; and for /sh/ it is near the back of the oral tract. Thus
`the system for producing unvoiced fricatives consists of a source of noise at a constriction,
`which separates the vocal tract into two cavities. Sound is radiated from the lips-that
`is,
`from the front cavity. The back cavity serves, as in the case of nasals, to trap energy and
`thereby introduce antiresonances into the vocal output. Figure 2.22 shows the waveforms
`and Figure 2.23 the spectrograms of the fricatives /f/, /s/ and /sh/. The nonperiodic nature
`
`IPR2023-00035
`Apple EX1015 Page 62
`
`
`
`32
`
`Chap. 2
`
`The Speech s· 191'lal
`
`ama
`
`(a)
`
`~~
`t~ HfthAAO"ll"ll"AAQAAAII
`
`r,/7/'rv4JQ vv4> V4i'J4'7JVV~
`\Tvv7'1
`
`I\ I\ (\ (\ "I\
`
`I\(\
`
`I\ 0 /\ fl. I\"
`
`II.~
`
`(b)
`
`ana
`L ~ .. ~ .J J J , " "n "~ ".
`
`-----100
`
`msec----(cid:173)
`
`Figure 2.20 Wavefonns for the sequences
`/a-m-a/ and /a-n-a/.
`
`of fricative excitation is obvious in the waveform plots. The spectral differences among
`the fricatives are readily seen by comparing the three spectrograms.
`
`2.4.6 Voiced Fricatives
`
`The voiced fricatives /v/, /th/, /z/ and /zh/ are the counterparts of the unvoiced fricatives /f/,
`/0/, /s/, and /sh/, respectively, in that the place of constriction for each of the corresponding
`phonemes is essentially identical. However, the voiced fricatives differ markedly from
`their unvoiced counterparts in that two excitation sources are involved in their production.
`For voiced fricatives the vocal cords are vibrating, and thus one excitation source is at
`the glottis. However, since the vocal tract is constricted at some point forward of the
`glottis, the air flow becomes turbulent in the neighborhood of the constriction. Thus the
`
`IPR2023-00035
`Apple EX1015 Page 63
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`33
`
`a m
`
`a
`
`11111llli~~1•1t, I
`
`+
`
`+
`
`1\ I
`
`+
`
`+
`
`+
`+
`, .,,~~,
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`11 •'
`
`+
`
`+
`
`+
`
`+
`
`+
`
`5000
`
`4000
`
`-N
`-
`
`~
`
`3000
`
`> (.)
`z
`w
`:::> 2000
`0 w
`a: 1000
`u.
`
`0
`19674
`
`w
`C
`:::>
`....
`:::;
`Cl.
`~ <
`
`9837
`
`0
`
`a
`
`n
`
`a
`
`+
`
`+
`
`+
`
`+
`
`t
`
`11 ....
`
`t
`
`+
`+
`+
`ll~b~~i 10 l~-1 •. '
`+
`+
`+
`
`+
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`17980
`
`8990,
`
`0
`
`-9837
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`-8990
`
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.21 Spectrograms of the sequences /a-m-a/ and a-n-a/.
`
`spectra of voiced fricatives can be expected to display two distinct components. These
`excitation features are readily observable in Figure 2.24, which shows typical waveforms,
`and in Figure 2.25, which shows spectra for two voiced fricatives. The similarity of the
`unvoiced fricative /fl to the voiced fricative /v/ is easily shown in a comparison between
`corresponding spectrograms in Figures 2.23 and 2.25. Likewise, it is instructive to compare
`the spectrograms of /sh/ and /zh/.
`
`2.4.7 Voiced and Unvoiced Stops
`
`The voiced stop consonants /b/, /d/, and /g/, are transient, noncontinuant sounds produced
`by building up pressure behind a total constriction somewhere in the oral tract and then
`suddenly releasing the pressure. For /b/ the constriction is at the lips; for /d/ the constriction
`is at the back of the teeth; and for /g/ it is near the velum. During the period when there is
`total constriction in the tract, no sound is radiated from the lips. However, there is often a
`small amount of low-frequency energy radiated through the walls of the throat (sometimes
`called a voice bar). This occurs when the vocal cords are able to vibrate even though the
`vocal tract is closed at some point.
`Since the stop sounds are dynamical in nature, their properties are highly influenced
`by the vowel that follows the stop consonant. As such, the waveforms for stop consonants
`give little information about the particular stop consonant. Figure 2.26 shows the waveform
`of the syllable /a-b-a/. The waveform of /b/ shows few distinguishing features except for
`the voiced excitation and lack of high-frequency energy.
`
`IPR2023-00035
`Apple EX1015 Page 64
`
`
`
`34
`
`Chap. 2
`
`The Speech s·,g
`na1
`
`r7' 1 'es
`
`'0
`
`• • ..
`
`t ,►
`
`t' ;
`
`'
`
`; ;J • I ...
`
`4t
`
`' +r4' t
`
`(c)
`
`a sa
`, 1 .. L, !11,lo~.!u.!uJ11~0~11,l",lv
`111vr11q1q~rnn,r1~1q1qv
`
`'~
`
`vlw~·Av••
`
`I
`
`•
`
`......__ __
`
`" t1Y rr ' r ~ 'l
`
`100 msec
`
`Figure 2.22 Waveforms for the sounds /f/, /s/ and /sh/ in the context /a-x-a/ where /x/ is
`the unvoiced fricative.
`
`a
`f
`'
`N' 5000 ,.........,.......,_-,..-.,.....,.._
`
`% -
`
`•
`...
`
`I
`
`4000
`3000
`
`2000
`
`11!1 ••
`
`.....
`♦♦ ++
`•
`
`,:
`
`a s
`
`a
`
`5000
`
`4000
`
`♦
`
`3000
`
`2000
`
`t
`
`1000
`
`0
`20974
`
`10487
`
`10831
`
`w
`._
`0
`:)
`=.i
`~
`~ -10831
`
`~~....C......J........J..._j
`0.0
`0.2
`0.4
`
`Figure 2.23
`
`- 10184: .
`---:..--'-L...,_L
`o.o
`0.2
`
`: -10487
`t........J._.___J_.....J....,...J-<-~~
`0.0
`0.2
`0.4
`
`0.8
`0.4
`TIME (aec)
`f
`Spectrogram comparis
`• ons O the sounds /a-f-a/, /a-s-a/ and /a-sh-a/.
`
`IPR2023-00035
`Apple EX1015 Page 65
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`35
`
`(a)
`
`ava
`
`(b)
`
`aia
`
`'yvtfV 1Vi
`
`~v ~ •
`
`•
`
`.,... ....
`
`y\/F
`
`-----100
`
`vt v, rr11 " "
`msec-----i
`
`Figure 2.24 Wavefonns for the sequences /a(cid:173)
`v-a/ and /a-zh-a/.
`
`The unvoiced stop consonants /p/, /t/, and /k/ are similar to their voiced counterparts
`/bl, /di, and /g/, with one major exception. During t}:le period of total closure of the tract,
`as the pressure builds up, the vocal cords do not vibrate. Then, following the period of
`closure, as the air pressure is released, there is a brief interval of friction (due to sudden
`turbulence of the escaping air) followed by a period of aspiration (steady air flow from the
`glottis exciting the resonances of the vocal tract) before voiced excitation begins.
`Figure 2.27 shows waveforms and Figure 2.28 shows spectrograms of the voiced stop
`/b/ and the voiceless stop consonants /p/ and /t/. The "stop gap," or time interval, during
`which the pressure is built up is clearly in evidence. Also, it can be readily seen that the
`duration and frequency content of the frication noise and aspiration vary greatly with the
`stop consonant.
`
`IPR2023-00035
`Apple EX1015 Page 66
`
`
`
`36
`
`5000
`
`4000
`
`-N
`:::c: -> CJ
`z w
`=> 2000
`0 w
`0:
`u.
`
`3000
`
`1000
`
`a
`
`V
`
`a
`
`+
`
`+
`
`I
`
`-
`
`+
`1,1111-
`+
`
`+
`
`+
`
`+
`
`5000
`
`40001
`
`3000
`
`Chap. 2
`z
`
`a
`
`The Speech s·
`19na1
`a
`
`+ ( 1r'i ~
`I ,,,hit.
`
`t..hl .! ~It
`
`•
`
`+
`•••
`
`+
`
`+
`
`+.
`
`+
`
`ti+,
`
`~ I
`+ 'i'
`•II 1'°
`ti +
`"''
`II·
`
`0
`23761
`
`w 11881
`0
`::>
`~
`::i
`~
`~ <
`
`0
`
`-11881
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`12060
`
`0
`
`-12060
`
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.25 Spectrograms for the sequences /a-v-a/ and /a-zh-a/.
`
`a '
`
`100 ffllee ------------l
`
`Figure 2.26 Wavefonn t
`h
`or t e sequence /a-b-a/.
`
`' . ·, '·,
`
`..... :1·
`
`~-1,
`
`IPR2023-00035
`Apple EX1015 Page 67
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`37
`
`(a)
`
`apa
`.. )wA .. WJ. LH L, . 0
`w·· ,av v
`
`....
`
`,.,)
`
`t,-
`
`C
`
`tf
`
`(b)
`
`at a
`
`~~►-
`
`..,__-----100
`
`msec-------1
`
`Figure 2.27 Wavefonns for the sequences /a-p-a/
`and /a-t-a/.
`
`2.4.8 Review Exercises
`
`As a self-check on the reader's understanding of the material on speech sounds and their
`acoustic manifestations, we now digress and present some simple exercises along with the
`solutions. For maximum effectiveness, the reader is encouraged to think through each
`exercise before looking at the solution.
`Exercise 2.1
`1. Write out the phonetic transcription for the following words:
`he, eats, several, light, tacos
`
`2. What effect occurs when these five words are spoken in sequence as a sentence? What
`does this imply about automatic speech recognition?
`
`IPR2023-00035
`Apple EX1015 Page 68
`
`
`
`38
`
`a
`
`b
`
`a
`
`......
`. . . .
`
`..
`
`•
`
`3000
`2000 It •
`
`5000
`'N
`~ 4000
`>-
`(.)
`z
`w
`::,
`0 w
`a:
`LL
`
`1000
`
`0
`25679
`
`w
`0
`::,
`I-
`:::::i
`Q.
`~
`< -12840
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`Chap. 2
`
`The Speech Signal
`
`p
`
`a
`
`5000,--,------
`
`a
`
`4000
`
`11111J • ~l,M!III".\
`
`I
`
`a
`
`,1
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`21735
`
`10868
`
`I
`
`O !-t, -----
`I
`-10868;
`:.,.__ __
`0.0
`
`__._......._..i...;._-"--'-~
`0.4
`0.6
`0.2
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.28 Spectrogram comparisons of the sequences of voiced (/a-b-a/) and voiceless (/a-p-a/
`and /a-t-a/) stop consonants.
`
`Solution 2.1
`1. The phonetic transcriptions of the words are
`
`Word
`he
`eats
`several
`light
`tacos
`
`Phoneme Seguence
`
`/hi/
`/its/
`/s~v ul/
`/1 aYt/
`/takoz/
`
`ARPABET
`HH-IY
`IY-TS
`S-E H-V-R-AH-L
`L-AY-T
`T-AA-K-OW-Z
`
`2. When the words are spoken together, the last sound of each word merges with the
`first sound of the succeeding word (since they are the same sound), resulting in strong
`coarticulation of boundary sounds. The ARPABET transcription for the sentence is:
`HH-IY-T-S-EH-V-R-AH-L-AY-T-AA-K-OW-Z
`the durations of
`All information about word boundaries is totally lost; furthermore,
`the common sounds at the boundaries of words are much shorter than what would be
`predicted from the individual words.
`
`Exercise 2.2
`
`Some of the difficulties in large vocabulary speech recognition are related to the irregularities
`in the way basic speech sounds are combined to produce words. Exercise 2.2 highlights a
`couple of these difficulties.
`
`1. In word initial position of American English, which phoneme or phonemes can never
`occur? Which hardly ever occur?
`2. There are many word initial consonant clusters of length two, such as speak, drank,
`plead, and press. How many word initial consonant clusters of length three are there
`in American English? What general rule can you give about the sounds in each of the
`three positions?
`
`IPR2023-00035
`Apple EX1015 Page 69
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`39
`
`3. A nasal consonant can be combined with a stop consonant (e.g., camp, tend) in a limited
`number of ways. What general rule do such combinations obey? There are several
`notable exceptions to this general rule. Can you give a couple of exceptions? What
`kind of speaking irregularity often results from these exceptions?
`
`Solution 2.2
`1. The only phoneme that never occurs in initial word position in English is the /ng/ sound
`(e.g., sing). The only other sound that almost never occurs naturally in English, in
`initial word position, is /zh/ except some foreign words imported into English, such as
`gendanne, which does have an initial /zh/.
`2. The word initial consonant clusters of length three in English include
`
`/spl/
`/spr/
`/skw/
`/skr/
`/str/
`
`split
`spring
`squirt
`script
`string
`
`The general rule for such clusters is
`
`/sound s/unvoiced stop/semivowel/
`
`3. The general rule for a nasal-stop combination is that the nasal and stop have the
`same place of articulation, e.g., front/lips (/mp/), mid/dental (Int/), back/velar (Ing k/).
`Exceptions occur in words like summed (/md/) or hanged (Ing di) or dreamt (/mt/).
`There is often a tendency to insert an extra stop in such situations (e.g., dreamt -+
`/drempt/).
`
`Exercise 2.3
`An important speech task is accurate digit recognition. This exercise seeks to exploit knowl(cid:173)
`edge of acoustic phonetics to recognize first isolated digits, and next some simple connected
`digit strings. We first need a sound lexicon (a dictionary) for the digits. The sound lexicon
`describes the pronunciations of digits in terms of the basic sounds of English. Such a sound
`lexicon is given in Table 2.3. A single male adult talker (LRR) spoke each of the 11 digits in
`random sequence and in isolation, and spectrograms of these spoken utterances are shown in
`Figure 2.29. Figure 2.30 shows spectrograms of two connected digit sequences spoken by the
`same talker.
`1. Identify each of the 11 digits based on the acoustic properties of the sounds within the
`digit (as expressed in the sound lexicon). Remember that each digit was spoken exactly
`once.
`2. Try to identify the spoken digits in each of the connected digit strings.
`
`Solution 2.3
`1. The digits of the top row are 3 and 7:
`a. The digit 3 is cued by the distinctive brief initial fricative (101), followed by the
`semivowel /r/ where the second and third formants both get very low in frequency,
`followed by the /i/ where F2 and F3 both become very high in frequency.
`b. The digit 7 is cued by the strong /s/ frication at the beginning, the distinctive /c/,
`followed by the voiced fricative /vi, a short vowel /a/ and ending in the strong
`
`IPR2023-00035
`Apple EX1015 Page 70
`
`
`
`, ..
`
`I••
`
`I
`
`••
`
`·1····
`
`,····~··• .. ········-,-··--,-··
`
`I
`
`t
`
`I
`
`'.
`
`:·
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`.
`
`...........
`
`o
`
`'II
`
`I
`
`, ..•
`
`1000
`
`~~~!!::=~:~'.::~'.::~'.:~~~~~:::=:~
`
`5000
`
`4000
`
`3000
`
`2000
`
`♦
`
`♦
`
`♦
`
`♦
`
`~:=~=====~:~~;~:-::-::::::::~;.
`
`1000
`
`5000
`
`4000
`
`3000
`
`3000
`
`2000
`
`~~.J~,....l.
`
`
`
`• .,;,,,1-;;:.~ -,--,....1. _==
`
`-,:;J. -~
`
`+:, ♦
`
`♦
`
`♦
`
`♦
`
`I
`
`:~·~
`
`♦
`
`♦
`
`1000
`
`♦
`
`,t-.•
`I•·•
`
`
`
`~==:====~=~=!=~==:::=~~=:=~=~:. :,~,; .. ; .. ~,.:::~::;;
`
`
`
`5000
`4000
`3000
`
`2000
`
`1000
`
`♦
`
`♦
`
`♦
`
`♦
`
`♦
`
`♦
`
`~.,,;, ♦
`,, I,
`♦ ~~ ..
`♦
`♦ ,· ♦
`I ,,. •. ., .... ,
`
`-.:
`
`♦
`
`♦
`
`♦
`
`~··1
`I I
`
`iL.o-'.::~~~~-'-"-=o:'-::.6:--1-";;'o-1;;.s:---i.--:;-1-1;;.o;-----1_._1.._.2,r~1:i::..4~1"ir"~11.s
`6000~.........-,-=-:--.-,-
`5000
`
`.
`
`.
`
`4000
`
`:
`
`1000
`
`0 --l--,~--L--J..
`0.0
`0.2
`
`I
`
`:.1:, ..
`
`I ~ •
`
`•
`
`.. ,
`
`•
`
`._, __
`0.6
`0.4
`TIME (sec)
`
`....... ,
`......
`0.8
`
`'
`
`.....
`I...
`1.0
`
`Figure 2.29 Spectrograms of the 11 isolated digits, O through 9 plus oh, in
`random sequence.
`
`IPR2023-00035
`Apple EX1015 Page 71
`
`
`
`Zero
`One
`Two
`Three
`Four
`Five
`Six
`Seven
`Eight
`Nine
`Oh
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`0.0
`
`I'• •·d
`
`,♦
`
`•
`
`:t: -
`N"
`>-u
`z
`w 6000
`::>
`0 5000
`w
`a: 4000
`~
`3000
`
`2000
`
`1000
`
`0
`0.0
`
`Z-IH-R-OW
`W-AH-N
`T-UW
`TH-R-IY
`F-OW-R
`F-AY-V
`S-IH-K-S
`S-EH-V-AX-N
`EY-T
`N-AY-N
`ow
`
`...
`··~·.
`r'.' ..
`
`'
`'
`
`I
`
`•.
`
`0.2
`
`0.4
`
`0.6
`
`0.8
`
`1.0
`
`♦ I
`
`I~
`
`. . .
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`41
`
`TABLE 2.3. Sound Lexicon of Digits
`Word
`
`ARPABET
`
`Sounds
`/z Ir o/
`/WA n/
`It u/
`/0 r i/
`/for/
`/f aY v/
`/s I ks/
`Is f: van/
`/eY t/
`/n aY n/
`/o/
`
`0.2
`
`0.4
`
`0.6
`
`0.8
`
`1.4
`1.2
`1.0
`TIME (sec)
`
`1.6
`
`1.8
`
`2.0
`
`2.2
`
`Figure 2.30 Spectrograms of two connected digit sequences.
`
`nasal /n/.
`The digits in the second row are O and 9:
`a. The initial /z/ is cued by the strong frication with the presence of voicing at low
`frequencies; the following /I/ is seen by the high F2 and F3, the /r/ is signaled by
`the low F2 and F3, and the diphthong /o/ is signaled by the gliding motion of F2
`and F3 toward an /u/-like sound.
`b. The digit 9 is cued by the distinct initial and final nasals /n/ and by the /aY / glide
`between the nasals.
`The digits in the third row are 1 and 5:
`a. The digit l is cued by the strong initial semivowel /w/ with very low F2 and by the
`strong final nasal /n/.
`b. The digit 5 is cued by the weak initial frication of /f/, followed by the strong
`diphthong /aY / and ending in the very weak fricative /v/.
`
`IPR2023-00035
`Apple EX1015 Page 72
`
`
`
`42
`
`Chap. 2
`
`The Speech Signal
`
`The digits in the fourth row are 2 and 8:
`a. The digit 2 is cued by the strong /t/ burst and release followed by the glide to the
`/u/ sound.
`b. The digit 8 is cued by the initial weak diphthong /eY / followed by a clear stop gap
`of the /t/ and then the /t/ release.
`The digits in the fifth row are "oh" and 4:
`a. The digit "oh" is virtually a steady sound with a slight gliding tendency toward/u/
`at the end.
`b. The digit 4 is cued by the weak initial fricative /f/, followed by the strong /o/ vowel
`and ending with a classic /r/ where F2 and f 3 merge together.
`The digit in the last row is 6:
`a. The digit 6 is cued by the strong /s/ frication at the beginning and end, and by the
`steady vowel /1/ followed by the stop gap and release of the /kl.
`2. By examining the isolated digit sequences, one can eventually (with a lot of work and
`some good luck) conclude that the two sequences are
`
`Row 1:
`Row 2:
`
`2-oh-l
`5-8-2-3-3-1-6
`
`(telephone area code)
`(7-digit telephone number)
`
`We will defer any explanation of how any reasonable person, or machine, could perfonn
`this task until later in this book when we discuss connected word-recognition techniques.
`The purpose of this exercise is to convince the reader how difficult a relatively simple
`recognition task can be.
`
`2.5 APPROACHES TO AUTOMATIC SPEECH RECOGNITION BY MACHINE
`
`The material presented in the previous sections leads to a straightforward way of perfonning
`to decode the speech signal
`speech recognition by machine whereby the machine attempts
`in a sequential manner based on the observed acoustic features of the signal and the known
`relations between acoustic features and phonetic symbols. This method, appropriately
`called the acoustic-phonetic approach,
`is indeed viable and has been studied in great depth
`for more than 40 years. However, for a variety of reasons, the acoustic-phonetic approach
`has not achieved the same success in practical systems as have alternative methods. Hence,
`in this section, we provide an overview of several proposed approaches
`to automatic speech
`recognition by machine with the goal of providing some understanding
`as to the essentials
`of each proposed method, and the basic strengths and weaknesses of each approach.
`Broadly speaking, there are three approaches to speech recognition, namely:
`
`1. the acoustic-phonetic approach
`2. the pattern recognition approach
`3. the artificial intelligence approach
`
`is based on the theory of acoustic phonetics that postulates
`The acoustic-phonetic approach
`that there exist finite, distinctive phonetic units in spoken
`language and that the phonetic
`
`IPR2023-00035
`Apple EX1015 Page 73
`
`
`
`Sec.
`
`2.5
`
`Approaches
`
`to Automatic
`
`Speech
`
`Recognition
`
`by Machine
`
`43
`
`~AO:i-R1:1:r:+:=r-T1
`
`,
`
`~-t-~SIL=r
`
`TIME-
`
`Figure 2.31 Phoneme lattice for word string.
`
`units are broadly characterized by a set of properties that are manifest in the speech sig(cid:173)
`nal, or its spectrum, over time. Even though the acoustic properties of phonetic units
`are highly variable, both with speakers and with neighboring phonetic units (the so-called
`coarticulation of sounds), it is assumed that the rules governing the variability are straight(cid:173)
`forward and can readily be learned and applied in practical situations. Hence the first
`step in the acoustic-phonetic approach to speech recognition is called a segmentation and
`labeling phase because it involves segmenting the speech signal into discrete (in time)
`regions where the acoustic properties of the signal are representative of one (or possibly
`several) phonetic units (or classes), and then attaching one or more phonetic labels to each
`segmented region according to the acoustic properties. To actually do speech recognition.
`a second step is required. This second step attempts to determine a valid word (or string of
`words) from the sequence of phonetic labels produced in the first step, which is consistent
`with the constraints of the speech-recognition task (i.e., the words are drawn from a given
`vocabulary, the word sequence makes syntactic sense and has semantic meaning, etc.).
`To illustrate the steps involved in the acoustic-phonetic approach to speech recogni(cid:173)
`tion, consider the phoneme lattice shown in Figure 2.31. (A phoneme lattice is the result of
`the segmentation and labeling step of the recognition process and represents a sequential set
`of phonemes that are likely matches to the spoken input speech.) The problem is to decode
`the phoneme lattice into a word string (one or more words) such that every instant of time
`is included in one of the phonemes in the lattice, and such that the word ( or word sequence)
`is valid according to rules of English syntax. (The symbol SIL stands for silence or a pause
`between sounds or words; the vertical position in the lattice, at any time, is a measure of
`the goodness of the acoustic match to the phonetic unit, with the highest unit having the
`best match.) With a modest amount of searching, one can derive the appropriate phonetic
`string SIL-AO-L-AX-B-AW-T
`corresponding to the word string "all about," with the
`phonemes L, AX, and B having been second or third choices in the lattice and all other
`phonemes having been first choices. This simple example illustrates well the difficulty in
`decoding phonetic units into word strings. This is the so-called lexical access problem.
`Interestingly, as we will see in the next section, the real problem with the acoustic-phonetic
`approach to speech recognition is the difficulty in getting a reliable phoneme lattice for the
`lexical access stage.
`The pattern-recognition approach to speech recognition is basically one in which the
`speech patterns are used directly without explicit feature determination (in the acoustic(cid:173)
`phonetic sense) and segmentation. As in most pattern-recognition approaches, the method
`has two steps-namely,
`training of speech patterns, and recognition of patterns via pattern
`comparison. Speech "knowledge" is brought into the system via the training procedure.
`The concept is that if enough versions of a pattern to be recognized (be it a sound, a word, a
`phrase, etc.) are included in a training set provided to the algorithm, the training procedure
`
`IPR2023-00035
`Apple EX1015 Page 74
`
`
`
`44
`
`Chap. 2
`
`The Speech Signal
`
`should be able to adequately characterize the acoustic properties of the pattern (with no
`regard for or knowledge of any other pattern presented to the training procedure). This
`type of characterization of speech via training is called pattern classification because the
`machine learns which acoustic properties of the speech class are reliable and repeatable
`across all training tokens of the pattern. The utility of the method is the pattern-comparison
`stage, which does a direct comparison of the unknown speech ( the speech to be recognized),
`with each possible pattern learned in the training phase and classifies the unknown speech
`according to the goodness of match of the patterns.
`The pattern-recognition approach to speech recognition is the basis for the remainder
`of this book. Hence there will be a great deal of discussion and explanation of virtually every
`aspect of the procedure. However, at this point, suffice it to say that the pattern-recognition
`approach is the method of choice for speech recognition for three reasons:
`
`I. Simplicity of use. The method is easy to understand,
`it is rich in mathematical and
`communication theory justification for individual procedures used in training and
`decoding, and it is widely used and understood.
`2. Robustness and invariance to different speech vocabularies, users, feature sets, pat(cid:173)
`tern comparison algorithms and decision rules. This property makes the algorithm
`appropriate for a wide range of speech units (ranging from phonemelike units all the
`way through words, phrases, and sentences), word vocabularies,
`talker populations,
`background environments, transmission conditions, etc.
`3. Proven high performance. It will be shown that the pattern-recognition approach
`to speech recognition consistently provides high performance on any task that is
`reasonable for the technology and provides a clear path for extending the technology
`in a wide range of directions such that the performance degrades gracefully as the
`problem becomes more and more difficult.
`
`The so-called artificial intelligence approach to speech recognition is a hybrid of the
`acoustic-phonetic approach and the pattern-recognition approach in that it exploits ideas and
`concepts of both methods. The artificial intelligence approach attempts to mechanize the
`recognition procedure according to the way a person applies its intelligence in visualizing,
`analyzing, and finally making a decision on the measured acoustic features. In particular,
`among the techniques used within this class of me