throbber
t
`
`I
`
`...__~fJJ!..I.
`
`__ - -•--
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`27
`
`5000 • ---
`
`4000 ..mt"! .• :
`
`3000, __
`
`I
`
`---
`.· ·., .. ,.'
`
`!
`
`e
`
`re
`
`a
`
`! .,.,..
`
`•
`
`i l111Nlllil~lllllltiJWlt, • "'"'41f
`
`•
`
`•
`
`•
`
`•
`
`•
`
`•.
`
`a
`
`2000·
`
`•
`
`•
`
`-
`N
`;_ 1000·
`>
`o·
`(.)
`z
`w
`::::, 5000-----
`0
`W 4000
`FF 3000!
`-.
`
`.,.,
`..
`
`2000,'
`
`0.2
`
`0.4
`
`r~
`..........
`·••1,,mrrrn
`
`;
`
`lS ■ ii
`----••
`l..iJl=iJ...J H111111111U1,
`
`u
`
`u
`
`3'"
`
`•II
`
`I,
`
`~p
`
`I•
`
`I
`
`.i
`
`0.0
`
`0.2
`
`0.4
`
`0.4 0.0
`0.2
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.0
`
`0.2
`
`0.4
`
`,,,,II!'
`
`1
`
`Figure 2. 15 Spectrograms of the vowel sounds.
`
`4000r----.------.----~------
`35001---+--~--H-,~--#---c
`
`....... .---l---l-------l
`
`30001---~
`
`.. ,-..1._;.,.;...----'4-
`
`N 25001--~+--
`~
`if' 20001---'...,,_~
`u.
`0
`~ 1500 1---+---l'-+-"~c'"-lt-:~.=+-
`z
`w
`:::,
`0
`w fl: 1000 1----
`
`_ ___._ _ __. _ _,
`500L--..I---L--=-_..J...-_._
`0
`200
`400
`600 800 1000 1200 1400
`FREQUENCY OF F1 (Hz)
`
`Figure 2.16 Measured frequencies of first and second for(cid:173)
`mants for a wide range of talkers for several vowels (after
`Peterson & Barney [7]).
`
`overlap between the fonnant frequencies for different vowel sounds by different talkers.
`The ellipses drawn in this figure represent gross characterizations of the regions in which
`most of the tokens of the different vowels lie. The message of Figure 2.16, for speech
`recognition by machine, is fairly clear; that is, it is not just a simple matter of measuring
`fonnant frequencies or spectral peaks accurately to accurately classify vowel sounds; one
`
`IPR2023-00035
`Apple EX1015 Page 58
`
`

`

`28
`
`2400~---------------,
`
`IV (il
`
`Chap. 2
`
`The Speech s·,g
`rial
`
`2200
`
`2000
`
`1800
`
`N
`~ 1600
`N
`~
`
`1400
`
`1200
`
`1000
`
`0 EH(£)
`
`0 AE (cl9)
`
`oER (~)
`
`oAH (A)
`
`0 AO (:>)
`SOOL---'--~....L..---'----...___-~:-::--~
`200
`300
`400
`500
`600
`F1 (Hz)
`
`700
`
`800
`
`Figure 2.17 The vowel triangle with centroid positions of the com(cid:173)
`mon vowels.
`
`must do some type of talker (accent) normalization to account for the variability in formants
`and overlap between vowels.
`A common way of exploiting the information embodied in Figures 2. 15 and 2. 16 is to
`represent each vowel by a centroid in the formant space with the realization that the centroid,
`at best, represents average behavior and does not represent variability across talkers. Such
`a representation leads to the classic vowel triangle shown in Figure 2.17 and represented
`in terms of formant positions by the data given in Table 2.2. The vowel triangle represents
`the extremes of formant locations in the F 1 -F2 plane, as represented by /i/ (low F 1, high
`F2), /u/ (low F 1, low F2), and /a/ (high F 1, low F2), with other vowels appropriately placed
`with respect to the triangle vertices. The utility of the formant frequencies of Table 2.2 has
`been demonstrated in text-to-speech synthesis in which high-quality vowel sounds have
`been synthesized using these positions for the resonances [8].
`
`2.4.2 Diphthongs
`
`there is some ambiguity and disagreement as to what is and what is not a
`Although
`diphthong, a reasonable definition is that a diphthong is a gliding monosyllabic speech
`
`IPR2023-00035
`Apple EX1015 Page 59
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`29
`
`IPA
`Symbol
`
`Typical
`Word
`
`F1
`
`F2
`
`F3
`
`TABLE 2.2. Formant frequencies for typical vowels.
`ARPAB.ET
`Symbol for
`Vowel
`IY
`IH
`EH
`AE
`AH
`AA
`AO
`UH
`uw
`ER
`
`Iii
`/1/
`/cl
`/re/
`IA/
`la/
`hi
`/U/
`/u/
`hi
`
`beet
`bit
`bet
`bat
`but
`hot
`bought
`foot
`boot
`bird
`
`270 2290 3010
`390 1990 2550
`530 1840 2480
`660 1720 2410
`520 1190 2390
`730 1090 2440
`570
`840
`2410
`440 1020 2240
`300
`870
`2240
`490 1350 1690
`
`sound that starts at or near the articulatory position for one vowel and moves to or toward
`the position for another. According to this definition, there are six diphthongs in American
`English, namely /aY / (as in buy), /aw/ (as in down), /eY / (as in bait), and /-:,Y / (as in boy),
`/o/ (as in boat), and /ju/ (as in you).
`-
`The diphthongs are produced by varying the vocal tract smoothly between vowel
`configurations appropriate to the diphthong. Figure 2.18 shows spectrogram plots of four
`of the diphthongs spoken by a male talker. The gliding motions of the formants are
`especially prominent for the sounds /aY /, /aw/ and /'JY / and are somewhat weaker for
`/eY / because of the closeness (in vowel space) of the two vowel sounds comprising this
`diphthong.
`An alternative way of displaying the time-varying spectral characteristics of diph(cid:173)
`thongs is via a plot of the values of the second formant versus the first formant (implicitly
`as a function of time) as shown in Figure 2.19 [9]. The arrows in this figure indicate the
`direction of motion of the formants (in the (F1 - F2 ) plane) as time increases. The dashed
`circles in this figure indicate average positions of the vowels. Based on these data, and
`other measurements, the diphthongs can be characterized by a time-varying vocal tract area
`function that varies between two vowel configurations.
`
`2.4.3 Semivowels
`
`The group of sounds consisting of /w/, /1/, /r/, and /y/ is quite difficult to characterize.
`These sounds are called semivowels because of their vowel-like nature. They are generally
`characterized by a gliding transition in vocal tract area function between adjacent phonemes.
`Thus the acoustic characteristics of these sounds are strongly influenced by the context in
`which they occur. For our purposes, they are best described as transitional, vowel-like
`sounds, and hence are similar in nature to the vowels and diphthongs.
`
`IPR2023-00035
`Apple EX1015 Page 60
`
`

`

`30
`
`Chap. 2
`
`The Speech Signal
`
`aY
`.......... r-r--r-.
`
`+
`
`+
`
`5000
`
`,---~-.--
`
`i
`4000
`
`3000
`
`2000
`
`I , •~,
`
`+
`
`+
`
`+
`
`+
`
`N' ;;.
`0
`~ 0.0 0.1
`z
`W
`~ 5000~...,,_~.,._,_,..--,-..,-~
`0 w
`a:
`u. 40001
`
`1000
`
`I
`
`0.2 0.3
`
`().4 - 0.5
`
`0.0
`
`0.1
`
`0.2
`
`0.3 0.4
`
`0.5
`
`eY
`
`'Jy
`
`+
`
`+
`
`'I
`
`+
`
`+
`
`0.1
`
`0.2
`
`0.3
`
`0.4
`
`3000f -l
`·~l~
`
`0.0 0.1
`
`0.2
`
`0.3
`
`0.0
`0.4
`TIME (sec)
`
`Figure 2.18 Spectrogram plots of four diphthongs.
`
`2.4.4 Nasal Consonants
`
`The nasal consonants /m/, /n/, and /r,! are produced with glottal excitation and the vocal
`tract totally constricted at some point along the oral passageway. The velum is lowered so
`that air flows through the nasal tract, with sound being radiated at the nostrils. The oral
`cavity, although constricted toward the front, is still acoustically coupled to the pharynx.
`Thus, the mouth serves as a resonant cavity that traps acoustic energy at certain natural
`frequencies. As far as the radiated sound is concerned, these resonant frequencies of the
`oral cavity appear as antiresonances, or zeros of the transfer function of sound transmis(cid:173)
`sion. Furthermore, nasal consonants and nasalized vowels (i.e., some vowels preceding or
`following nasal consonants) are characterized by resonances that are spectrally broader, or
`more highly damped, than those for vowels.
`The three nasal consonants are distinguished by the place along the oral tract at
`is at the lips; for /n/ the
`which a total constriction is made. For /ml the constriction
`constriction is just behind the teeth; and for / r, / the constriction
`is just forward of the
`velum itself. Figure 2.20 shows typical speech waveforms and Figure 2.21 spectrograms
`for two nasal consonants in the context vowel-nasal-vowel. The waveforms of /m/ and /n/
`look very similar. The spectrograms show a concentration of low-frequency energy with a
`midrange of frequencies that contain no prominent peaks. This is because of the particular
`
`L
`
`IPR2023-00035
`Apple EX1015 Page 61
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`31
`
`eY
`
`l
`
`\
`
`.... ~
`\\
`I
`I U / I
`'~-✓ I
`,,_,,
`
`,
`
`3K
`
`2K
`
`/--,
`\
`l
`,_.,,,
`
`/
`
`iu.
`
`--I
`
`'
`
`\
`I
`
`I u
`\
`,_/
`
`N
`I&. 1 K
`
`500
`
`400.__ _ __._ _____
`200
`250
`
`....L,. __
`
`.,__
`
`_______
`
`500
`
`~
`1 K
`
`Figure 2.19 Time variation of the first two fonnants for the diphthongs (after
`Holbrook and Fairbanks (9)).
`
`combination of resonances and antiresonances that result from the coupling of the nasal
`and oral tracts.
`
`2.4.5 Unvoiced Fricatives
`
`The unvoiced fricatives /f/, / () /, /s/, and /sh/ are produced by exciting the vocal tract by a
`steady air flow, which becomes turbulent in the region of a constriction in the vocal tract.
`The location of the constriction serves to determine which fricative sound is produced.
`For the fricative /f/ the constriction is near the lips; for /0 / it is near the teeth; for /s/ it
`is near the middle of the oral tract; and for /sh/ it is near the back of the oral tract. Thus
`the system for producing unvoiced fricatives consists of a source of noise at a constriction,
`which separates the vocal tract into two cavities. Sound is radiated from the lips-that
`is,
`from the front cavity. The back cavity serves, as in the case of nasals, to trap energy and
`thereby introduce antiresonances into the vocal output. Figure 2.22 shows the waveforms
`and Figure 2.23 the spectrograms of the fricatives /f/, /s/ and /sh/. The nonperiodic nature
`
`IPR2023-00035
`Apple EX1015 Page 62
`
`

`

`32
`
`Chap. 2
`
`The Speech s· 191'lal
`
`ama
`
`(a)
`
`~~
`t~ HfthAAO"ll"ll"AAQAAAII
`
`r,/7/'rv4JQ vv4> V4i'J4'7JVV~
`\Tvv7'1
`
`I\ I\ (\ (\ "I\
`
`I\(\
`
`I\ 0 /\ fl. I\"
`
`II.~
`
`(b)
`
`ana
`L ~ .. ~ .J J J , " "n "~ ".
`
`-----100
`
`msec----(cid:173)
`
`Figure 2.20 Wavefonns for the sequences
`/a-m-a/ and /a-n-a/.
`
`of fricative excitation is obvious in the waveform plots. The spectral differences among
`the fricatives are readily seen by comparing the three spectrograms.
`
`2.4.6 Voiced Fricatives
`
`The voiced fricatives /v/, /th/, /z/ and /zh/ are the counterparts of the unvoiced fricatives /f/,
`/0/, /s/, and /sh/, respectively, in that the place of constriction for each of the corresponding
`phonemes is essentially identical. However, the voiced fricatives differ markedly from
`their unvoiced counterparts in that two excitation sources are involved in their production.
`For voiced fricatives the vocal cords are vibrating, and thus one excitation source is at
`the glottis. However, since the vocal tract is constricted at some point forward of the
`glottis, the air flow becomes turbulent in the neighborhood of the constriction. Thus the
`
`IPR2023-00035
`Apple EX1015 Page 63
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`33
`
`a m
`
`a
`
`11111llli~~1•1t, I
`
`+
`
`+
`
`1\ I
`
`+
`
`+
`
`+
`+
`, .,,~~,
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`11 •'
`
`+
`
`+
`
`+
`
`+
`
`+
`
`5000
`
`4000
`
`-N
`-
`
`~
`
`3000
`
`> (.)
`z
`w
`:::> 2000
`0 w
`a: 1000
`u.
`
`0
`19674
`
`w
`C
`:::>
`....
`:::;
`Cl.
`~ <
`
`9837
`
`0
`
`a
`
`n
`
`a
`
`+
`
`+
`
`+
`
`+
`
`t
`
`11 ....
`
`t
`
`+
`+
`+
`ll~b~~i 10 l~-1 •. '
`+
`+
`+
`
`+
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`17980
`
`8990,
`
`0
`
`-9837
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`-8990
`
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.21 Spectrograms of the sequences /a-m-a/ and a-n-a/.
`
`spectra of voiced fricatives can be expected to display two distinct components. These
`excitation features are readily observable in Figure 2.24, which shows typical waveforms,
`and in Figure 2.25, which shows spectra for two voiced fricatives. The similarity of the
`unvoiced fricative /fl to the voiced fricative /v/ is easily shown in a comparison between
`corresponding spectrograms in Figures 2.23 and 2.25. Likewise, it is instructive to compare
`the spectrograms of /sh/ and /zh/.
`
`2.4.7 Voiced and Unvoiced Stops
`
`The voiced stop consonants /b/, /d/, and /g/, are transient, noncontinuant sounds produced
`by building up pressure behind a total constriction somewhere in the oral tract and then
`suddenly releasing the pressure. For /b/ the constriction is at the lips; for /d/ the constriction
`is at the back of the teeth; and for /g/ it is near the velum. During the period when there is
`total constriction in the tract, no sound is radiated from the lips. However, there is often a
`small amount of low-frequency energy radiated through the walls of the throat (sometimes
`called a voice bar). This occurs when the vocal cords are able to vibrate even though the
`vocal tract is closed at some point.
`Since the stop sounds are dynamical in nature, their properties are highly influenced
`by the vowel that follows the stop consonant. As such, the waveforms for stop consonants
`give little information about the particular stop consonant. Figure 2.26 shows the waveform
`of the syllable /a-b-a/. The waveform of /b/ shows few distinguishing features except for
`the voiced excitation and lack of high-frequency energy.
`
`IPR2023-00035
`Apple EX1015 Page 64
`
`

`

`34
`
`Chap. 2
`
`The Speech s·,g
`na1
`
`r7' 1 'es
`
`'0
`
`• • ..
`
`t ,►
`
`t' ;
`
`'
`
`; ;J • I ...
`
`4t
`
`' +r4' t
`
`(c)
`
`a sa
`, 1 .. L, !11,lo~.!u.!uJ11~0~11,l",lv
`111vr11q1q~rnn,r1~1q1qv
`
`'~
`
`vlw~·Av••
`
`I
`
`•
`
`......__ __
`
`" t1Y rr ' r ~ 'l
`
`100 msec
`
`Figure 2.22 Waveforms for the sounds /f/, /s/ and /sh/ in the context /a-x-a/ where /x/ is
`the unvoiced fricative.
`
`a
`f
`'
`N' 5000 ,.........,.......,_-,..-.,.....,.._
`
`% -
`
`•
`...
`
`I
`
`4000
`3000
`
`2000
`
`11!1 ••
`
`.....
`♦♦ ++
`•
`
`,:
`
`a s
`
`a
`
`5000
`
`4000
`
`♦
`
`3000
`
`2000
`
`t
`
`1000
`
`0
`20974
`
`10487
`
`10831
`
`w
`._
`0
`:)
`=.i
`~
`~ -10831
`
`~~....C......J........J..._j
`0.0
`0.2
`0.4
`
`Figure 2.23
`
`- 10184: .
`---:..--'-L...,_L
`o.o
`0.2
`
`: -10487
`t........J._.___J_.....J....,...J-<-~~
`0.0
`0.2
`0.4
`
`0.8
`0.4
`TIME (aec)
`f
`Spectrogram comparis
`• ons O the sounds /a-f-a/, /a-s-a/ and /a-sh-a/.
`
`IPR2023-00035
`Apple EX1015 Page 65
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`35
`
`(a)
`
`ava
`
`(b)
`
`aia
`
`'yvtfV 1Vi
`
`~v ~ •
`
`•
`
`.,... ....
`
`y\/F
`
`-----100
`
`vt v, rr11 " "
`msec-----i
`
`Figure 2.24 Wavefonns for the sequences /a(cid:173)
`v-a/ and /a-zh-a/.
`
`The unvoiced stop consonants /p/, /t/, and /k/ are similar to their voiced counterparts
`/bl, /di, and /g/, with one major exception. During t}:le period of total closure of the tract,
`as the pressure builds up, the vocal cords do not vibrate. Then, following the period of
`closure, as the air pressure is released, there is a brief interval of friction (due to sudden
`turbulence of the escaping air) followed by a period of aspiration (steady air flow from the
`glottis exciting the resonances of the vocal tract) before voiced excitation begins.
`Figure 2.27 shows waveforms and Figure 2.28 shows spectrograms of the voiced stop
`/b/ and the voiceless stop consonants /p/ and /t/. The "stop gap," or time interval, during
`which the pressure is built up is clearly in evidence. Also, it can be readily seen that the
`duration and frequency content of the frication noise and aspiration vary greatly with the
`stop consonant.
`
`IPR2023-00035
`Apple EX1015 Page 66
`
`

`

`36
`
`5000
`
`4000
`
`-N
`:::c: -> CJ
`z w
`=> 2000
`0 w
`0:
`u.
`
`3000
`
`1000
`
`a
`
`V
`
`a
`
`+
`
`+
`
`I
`
`-
`
`+
`1,1111-
`+
`
`+
`
`+
`
`+
`
`5000
`
`40001
`
`3000
`
`Chap. 2
`z
`
`a
`
`The Speech s·
`19na1
`a
`
`+ ( 1r'i ~
`I ,,,hit.
`
`t..hl .! ~It
`
`•
`
`+
`•••
`
`+
`
`+
`
`+.
`
`+
`
`ti+,
`
`~ I
`+ 'i'
`•II 1'°
`ti +
`"''
`II·
`
`0
`23761
`
`w 11881
`0
`::>
`~
`::i
`~
`~ <
`
`0
`
`-11881
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`12060
`
`0
`
`-12060
`
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.25 Spectrograms for the sequences /a-v-a/ and /a-zh-a/.
`
`a '
`
`100 ffllee ------------l
`
`Figure 2.26 Wavefonn t
`h
`or t e sequence /a-b-a/.
`
`' . ·, '·,
`
`..... :1·
`
`~-1,
`
`IPR2023-00035
`Apple EX1015 Page 67
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`37
`
`(a)
`
`apa
`.. )wA .. WJ. LH L, . 0
`w·· ,av v
`
`....
`
`,.,)
`
`t,-
`
`C
`
`tf
`
`(b)
`
`at a
`
`~~►-
`
`..,__-----100
`
`msec-------1
`
`Figure 2.27 Wavefonns for the sequences /a-p-a/
`and /a-t-a/.
`
`2.4.8 Review Exercises
`
`As a self-check on the reader's understanding of the material on speech sounds and their
`acoustic manifestations, we now digress and present some simple exercises along with the
`solutions. For maximum effectiveness, the reader is encouraged to think through each
`exercise before looking at the solution.
`Exercise 2.1
`1. Write out the phonetic transcription for the following words:
`he, eats, several, light, tacos
`
`2. What effect occurs when these five words are spoken in sequence as a sentence? What
`does this imply about automatic speech recognition?
`
`IPR2023-00035
`Apple EX1015 Page 68
`
`

`

`38
`
`a
`
`b
`
`a
`
`......
`. . . .
`
`..
`
`•
`
`3000
`2000 It •
`
`5000
`'N
`~ 4000
`>-
`(.)
`z
`w
`::,
`0 w
`a:
`LL
`
`1000
`
`0
`25679
`
`w
`0
`::,
`I-
`:::::i
`Q.
`~
`< -12840
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`Chap. 2
`
`The Speech Signal
`
`p
`
`a
`
`5000,--,------
`
`a
`
`4000
`
`11111J • ~l,M!III".\
`
`I
`
`a
`
`,1
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`21735
`
`10868
`
`I
`
`O !-t, -----
`I
`-10868;
`:.,.__ __
`0.0
`
`__._......._..i...;._-"--'-~
`0.4
`0.6
`0.2
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.28 Spectrogram comparisons of the sequences of voiced (/a-b-a/) and voiceless (/a-p-a/
`and /a-t-a/) stop consonants.
`
`Solution 2.1
`1. The phonetic transcriptions of the words are
`
`Word
`he
`eats
`several
`light
`tacos
`
`Phoneme Seguence
`
`/hi/
`/its/
`/s~v ul/
`/1 aYt/
`/takoz/
`
`ARPABET
`HH-IY
`IY-TS
`S-E H-V-R-AH-L
`L-AY-T
`T-AA-K-OW-Z
`
`2. When the words are spoken together, the last sound of each word merges with the
`first sound of the succeeding word (since they are the same sound), resulting in strong
`coarticulation of boundary sounds. The ARPABET transcription for the sentence is:
`HH-IY-T-S-EH-V-R-AH-L-AY-T-AA-K-OW-Z
`the durations of
`All information about word boundaries is totally lost; furthermore,
`the common sounds at the boundaries of words are much shorter than what would be
`predicted from the individual words.
`
`Exercise 2.2
`
`Some of the difficulties in large vocabulary speech recognition are related to the irregularities
`in the way basic speech sounds are combined to produce words. Exercise 2.2 highlights a
`couple of these difficulties.
`
`1. In word initial position of American English, which phoneme or phonemes can never
`occur? Which hardly ever occur?
`2. There are many word initial consonant clusters of length two, such as speak, drank,
`plead, and press. How many word initial consonant clusters of length three are there
`in American English? What general rule can you give about the sounds in each of the
`three positions?
`
`IPR2023-00035
`Apple EX1015 Page 69
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`39
`
`3. A nasal consonant can be combined with a stop consonant (e.g., camp, tend) in a limited
`number of ways. What general rule do such combinations obey? There are several
`notable exceptions to this general rule. Can you give a couple of exceptions? What
`kind of speaking irregularity often results from these exceptions?
`
`Solution 2.2
`1. The only phoneme that never occurs in initial word position in English is the /ng/ sound
`(e.g., sing). The only other sound that almost never occurs naturally in English, in
`initial word position, is /zh/ except some foreign words imported into English, such as
`gendanne, which does have an initial /zh/.
`2. The word initial consonant clusters of length three in English include
`
`/spl/
`/spr/
`/skw/
`/skr/
`/str/
`
`split
`spring
`squirt
`script
`string
`
`The general rule for such clusters is
`
`/sound s/unvoiced stop/semivowel/
`
`3. The general rule for a nasal-stop combination is that the nasal and stop have the
`same place of articulation, e.g., front/lips (/mp/), mid/dental (Int/), back/velar (Ing k/).
`Exceptions occur in words like summed (/md/) or hanged (Ing di) or dreamt (/mt/).
`There is often a tendency to insert an extra stop in such situations (e.g., dreamt -+
`/drempt/).
`
`Exercise 2.3
`An important speech task is accurate digit recognition. This exercise seeks to exploit knowl(cid:173)
`edge of acoustic phonetics to recognize first isolated digits, and next some simple connected
`digit strings. We first need a sound lexicon (a dictionary) for the digits. The sound lexicon
`describes the pronunciations of digits in terms of the basic sounds of English. Such a sound
`lexicon is given in Table 2.3. A single male adult talker (LRR) spoke each of the 11 digits in
`random sequence and in isolation, and spectrograms of these spoken utterances are shown in
`Figure 2.29. Figure 2.30 shows spectrograms of two connected digit sequences spoken by the
`same talker.
`1. Identify each of the 11 digits based on the acoustic properties of the sounds within the
`digit (as expressed in the sound lexicon). Remember that each digit was spoken exactly
`once.
`2. Try to identify the spoken digits in each of the connected digit strings.
`
`Solution 2.3
`1. The digits of the top row are 3 and 7:
`a. The digit 3 is cued by the distinctive brief initial fricative (101), followed by the
`semivowel /r/ where the second and third formants both get very low in frequency,
`followed by the /i/ where F2 and F3 both become very high in frequency.
`b. The digit 7 is cued by the strong /s/ frication at the beginning, the distinctive /c/,
`followed by the voiced fricative /vi, a short vowel /a/ and ending in the strong
`
`IPR2023-00035
`Apple EX1015 Page 70
`
`

`

`, ..
`
`I••
`
`I
`
`••
`
`·1····
`
`,····~··• .. ········-,-··--,-··
`
`I
`
`t
`
`I
`
`'.
`
`:·
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`.
`
`...........
`
`o
`
`'II
`
`I
`
`, ..•
`
`1000
`
`~~~!!::=~:~'.::~'.::~'.:~~~~~:::=:~
`
`5000
`
`4000
`
`3000
`
`2000
`
`♦
`
`♦
`
`♦
`
`♦
`
`~:=~=====~:~~;~:-::-::::::::~;.
`
`1000
`
`5000
`
`4000
`
`3000
`
`3000
`
`2000
`
`~~.J~,....l.
`
`
`
`• .,;,,,1-;;:.~ -,--,....1. _==
`
`-,:;J. -~
`
`+:, ♦
`
`♦
`
`♦
`
`♦
`
`I
`
`:~·~
`
`♦
`
`♦
`
`1000
`
`♦
`
`,t-.•
`I•·•
`
`
`
`~==:====~=~=!=~==:::=~~=:=~=~:. :,~,; .. ; .. ~,.:::~::;;
`
`
`
`5000
`4000
`3000
`
`2000
`
`1000
`
`♦
`
`♦
`
`♦
`
`♦
`
`♦
`
`♦
`
`~.,,;, ♦
`,, I,
`♦ ~~ ..
`♦
`♦ ,· ♦
`I ,,. •. ., .... ,
`
`-.:
`
`♦
`
`♦
`
`♦
`
`~··1
`I I
`
`iL.o-'.::~~~~-'-"-=o:'-::.6:--1-";;'o-1;;.s:---i.--:;-1-1;;.o;-----1_._1.._.2,r~1:i::..4~1"ir"~11.s
`6000~.........-,-=-:--.-,-
`5000
`
`.
`
`.
`
`4000
`
`:
`
`1000
`
`0 --l--,~--L--J..
`0.0
`0.2
`
`I
`
`:.1:, ..
`
`I ~ •
`
`•
`
`.. ,
`
`•
`
`._, __
`0.6
`0.4
`TIME (sec)
`
`....... ,
`......
`0.8
`
`'
`
`.....
`I...
`1.0
`
`Figure 2.29 Spectrograms of the 11 isolated digits, O through 9 plus oh, in
`random sequence.
`
`IPR2023-00035
`Apple EX1015 Page 71
`
`

`

`Zero
`One
`Two
`Three
`Four
`Five
`Six
`Seven
`Eight
`Nine
`Oh
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`0.0
`
`I'• •·d
`
`,♦
`
`•
`
`:t: -
`N"
`>-u
`z
`w 6000
`::>
`0 5000
`w
`a: 4000
`~
`3000
`
`2000
`
`1000
`
`0
`0.0
`
`Z-IH-R-OW
`W-AH-N
`T-UW
`TH-R-IY
`F-OW-R
`F-AY-V
`S-IH-K-S
`S-EH-V-AX-N
`EY-T
`N-AY-N
`ow
`
`...
`··~·.
`r'.' ..
`
`'
`'
`
`I
`
`•.
`
`0.2
`
`0.4
`
`0.6
`
`0.8
`
`1.0
`
`♦ I
`
`I~
`
`. . .
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`41
`
`TABLE 2.3. Sound Lexicon of Digits
`Word
`
`ARPABET
`
`Sounds
`/z Ir o/
`/WA n/
`It u/
`/0 r i/
`/for/
`/f aY v/
`/s I ks/
`Is f: van/
`/eY t/
`/n aY n/
`/o/
`
`0.2
`
`0.4
`
`0.6
`
`0.8
`
`1.4
`1.2
`1.0
`TIME (sec)
`
`1.6
`
`1.8
`
`2.0
`
`2.2
`
`Figure 2.30 Spectrograms of two connected digit sequences.
`
`nasal /n/.
`The digits in the second row are O and 9:
`a. The initial /z/ is cued by the strong frication with the presence of voicing at low
`frequencies; the following /I/ is seen by the high F2 and F3, the /r/ is signaled by
`the low F2 and F3, and the diphthong /o/ is signaled by the gliding motion of F2
`and F3 toward an /u/-like sound.
`b. The digit 9 is cued by the distinct initial and final nasals /n/ and by the /aY / glide
`between the nasals.
`The digits in the third row are 1 and 5:
`a. The digit l is cued by the strong initial semivowel /w/ with very low F2 and by the
`strong final nasal /n/.
`b. The digit 5 is cued by the weak initial frication of /f/, followed by the strong
`diphthong /aY / and ending in the very weak fricative /v/.
`
`IPR2023-00035
`Apple EX1015 Page 72
`
`

`

`42
`
`Chap. 2
`
`The Speech Signal
`
`The digits in the fourth row are 2 and 8:
`a. The digit 2 is cued by the strong /t/ burst and release followed by the glide to the
`/u/ sound.
`b. The digit 8 is cued by the initial weak diphthong /eY / followed by a clear stop gap
`of the /t/ and then the /t/ release.
`The digits in the fifth row are "oh" and 4:
`a. The digit "oh" is virtually a steady sound with a slight gliding tendency toward/u/
`at the end.
`b. The digit 4 is cued by the weak initial fricative /f/, followed by the strong /o/ vowel
`and ending with a classic /r/ where F2 and f 3 merge together.
`The digit in the last row is 6:
`a. The digit 6 is cued by the strong /s/ frication at the beginning and end, and by the
`steady vowel /1/ followed by the stop gap and release of the /kl.
`2. By examining the isolated digit sequences, one can eventually (with a lot of work and
`some good luck) conclude that the two sequences are
`
`Row 1:
`Row 2:
`
`2-oh-l
`5-8-2-3-3-1-6
`
`(telephone area code)
`(7-digit telephone number)
`
`We will defer any explanation of how any reasonable person, or machine, could perfonn
`this task until later in this book when we discuss connected word-recognition techniques.
`The purpose of this exercise is to convince the reader how difficult a relatively simple
`recognition task can be.
`
`2.5 APPROACHES TO AUTOMATIC SPEECH RECOGNITION BY MACHINE
`
`The material presented in the previous sections leads to a straightforward way of perfonning
`to decode the speech signal
`speech recognition by machine whereby the machine attempts
`in a sequential manner based on the observed acoustic features of the signal and the known
`relations between acoustic features and phonetic symbols. This method, appropriately
`called the acoustic-phonetic approach,
`is indeed viable and has been studied in great depth
`for more than 40 years. However, for a variety of reasons, the acoustic-phonetic approach
`has not achieved the same success in practical systems as have alternative methods. Hence,
`in this section, we provide an overview of several proposed approaches
`to automatic speech
`recognition by machine with the goal of providing some understanding
`as to the essentials
`of each proposed method, and the basic strengths and weaknesses of each approach.
`Broadly speaking, there are three approaches to speech recognition, namely:
`
`1. the acoustic-phonetic approach
`2. the pattern recognition approach
`3. the artificial intelligence approach
`
`is based on the theory of acoustic phonetics that postulates
`The acoustic-phonetic approach
`that there exist finite, distinctive phonetic units in spoken
`language and that the phonetic
`
`IPR2023-00035
`Apple EX1015 Page 73
`
`

`

`Sec.
`
`2.5
`
`Approaches
`
`to Automatic
`
`Speech
`
`Recognition
`
`by Machine
`
`43
`
`~AO:i-R1:1:r:+:=r-T1
`
`,
`
`~-t-~SIL=r
`
`TIME-
`
`Figure 2.31 Phoneme lattice for word string.
`
`units are broadly characterized by a set of properties that are manifest in the speech sig(cid:173)
`nal, or its spectrum, over time. Even though the acoustic properties of phonetic units
`are highly variable, both with speakers and with neighboring phonetic units (the so-called
`coarticulation of sounds), it is assumed that the rules governing the variability are straight(cid:173)
`forward and can readily be learned and applied in practical situations. Hence the first
`step in the acoustic-phonetic approach to speech recognition is called a segmentation and
`labeling phase because it involves segmenting the speech signal into discrete (in time)
`regions where the acoustic properties of the signal are representative of one (or possibly
`several) phonetic units (or classes), and then attaching one or more phonetic labels to each
`segmented region according to the acoustic properties. To actually do speech recognition.
`a second step is required. This second step attempts to determine a valid word (or string of
`words) from the sequence of phonetic labels produced in the first step, which is consistent
`with the constraints of the speech-recognition task (i.e., the words are drawn from a given
`vocabulary, the word sequence makes syntactic sense and has semantic meaning, etc.).
`To illustrate the steps involved in the acoustic-phonetic approach to speech recogni(cid:173)
`tion, consider the phoneme lattice shown in Figure 2.31. (A phoneme lattice is the result of
`the segmentation and labeling step of the recognition process and represents a sequential set
`of phonemes that are likely matches to the spoken input speech.) The problem is to decode
`the phoneme lattice into a word string (one or more words) such that every instant of time
`is included in one of the phonemes in the lattice, and such that the word ( or word sequence)
`is valid according to rules of English syntax. (The symbol SIL stands for silence or a pause
`between sounds or words; the vertical position in the lattice, at any time, is a measure of
`the goodness of the acoustic match to the phonetic unit, with the highest unit having the
`best match.) With a modest amount of searching, one can derive the appropriate phonetic
`string SIL-AO-L-AX-B-AW-T
`corresponding to the word string "all about," with the
`phonemes L, AX, and B having been second or third choices in the lattice and all other
`phonemes having been first choices. This simple example illustrates well the difficulty in
`decoding phonetic units into word strings. This is the so-called lexical access problem.
`Interestingly, as we will see in the next section, the real problem with the acoustic-phonetic
`approach to speech recognition is the difficulty in getting a reliable phoneme lattice for the
`lexical access stage.
`The pattern-recognition approach to speech recognition is basically one in which the
`speech patterns are used directly without explicit feature determination (in the acoustic(cid:173)
`phonetic sense) and segmentation. As in most pattern-recognition approaches, the method
`has two steps-namely,
`training of speech patterns, and recognition of patterns via pattern
`comparison. Speech "knowledge" is brought into the system via the training procedure.
`The concept is that if enough versions of a pattern to be recognized (be it a sound, a word, a
`phrase, etc.) are included in a training set provided to the algorithm, the training procedure
`
`IPR2023-00035
`Apple EX1015 Page 74
`
`

`

`44
`
`Chap. 2
`
`The Speech Signal
`
`should be able to adequately characterize the acoustic properties of the pattern (with no
`regard for or knowledge of any other pattern presented to the training procedure). This
`type of characterization of speech via training is called pattern classification because the
`machine learns which acoustic properties of the speech class are reliable and repeatable
`across all training tokens of the pattern. The utility of the method is the pattern-comparison
`stage, which does a direct comparison of the unknown speech ( the speech to be recognized),
`with each possible pattern learned in the training phase and classifies the unknown speech
`according to the goodness of match of the patterns.
`The pattern-recognition approach to speech recognition is the basis for the remainder
`of this book. Hence there will be a great deal of discussion and explanation of virtually every
`aspect of the procedure. However, at this point, suffice it to say that the pattern-recognition
`approach is the method of choice for speech recognition for three reasons:
`
`I. Simplicity of use. The method is easy to understand,
`it is rich in mathematical and
`communication theory justification for individual procedures used in training and
`decoding, and it is widely used and understood.
`2. Robustness and invariance to different speech vocabularies, users, feature sets, pat(cid:173)
`tern comparison algorithms and decision rules. This property makes the algorithm
`appropriate for a wide range of speech units (ranging from phonemelike units all the
`way through words, phrases, and sentences), word vocabularies,
`talker populations,
`background environments, transmission conditions, etc.
`3. Proven high performance. It will be shown that the pattern-recognition approach
`to speech recognition consistently provides high performance on any task that is
`reasonable for the technology and provides a clear path for extending the technology
`in a wide range of directions such that the performance degrades gracefully as the
`problem becomes more and more difficult.
`
`The so-called artificial intelligence approach to speech recognition is a hybrid of the
`acoustic-phonetic approach and the pattern-recognition approach in that it exploits ideas and
`concepts of both methods. The artificial intelligence approach attempts to mechanize the
`recognition procedure according to the way a person applies its intelligence in visualizing,
`analyzing, and finally making a decision on the measured acoustic features. In particular,
`among the techniques used within this class of me

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket