`
`Chap. 2
`
`The Speech Signal
`
`and are seen as vertical striations in the spectrogram.
`The narrowband spectrogram (shown in the second panel of Figure 2.8) corresponds
`to performing a spectral analysis on 50-msec sections of waveform using a narrow analysis
`filter (40 Hz bandwidth), with the analysis again advancing in intervals of 1 msec. Because
`of the relatively narrow bandwidth of the analysis filters, individual spectral harmonics
`corresponding to the pitch of the speech waveform, during voiced regions, are resolved and
`are seen as almost-horizontal lines in the spectrogram. During periods of unvoiced speech,
`we see primarily high-frequency energy in the spectrograms; during periods of silence we
`essentially see no spectral activity (because of the reduced signal level).
`A third way of representing the time-varying signal characteristics of speech is via a
`parameterization of the spectral activity based on the model of speech production. Because
`the human vocal tract is essentially a tube, or concatenation of tubes, of varying cross(cid:173)
`sectional area that is excited either at one end (by the vocal cord puffs of air) or at a point
`along the tube (corresponding to turbulent air at a constriction), acoustic theory tells us that
`the transfer function of energy from the excitation source to the output can be described
`in terms of the natural frequencies or resonances of the tube. Such resonances are called
`formants for speech, and they represent the frequencies that pass the most acoustic energy
`from the source to the output. Typically there are about three resonances of significance, for
`a human vocal tract, below about 3500 Hz. Figure 2.9 [5] shows a wideband spectrogram,
`along with the computed formant frequency estimates, for the utterance "Why do I owe
`you a letter," spoken by a male speaker. There is a good correspondence between the
`estimated formant frequencies and the points of high spectral energy in the spectrogram.
`The formant frequency .representation is a highly efficient, compact representation of the
`time-varying characteristics of speech. The major problem, however, is the difficulty of
`reliably estimating the formant frequencies for low-level voiced sounds, and the difficulty
`of defining the formants for unvoiced or silence regions. As such, this representation is
`more of theoretical than of practical interest.
`representations of the phrase
`Figures 2.10 and 2.11 show spectral and temporal
`"Should we chase," spoken by a male speaker, along with a detailed segmentation of the
`waveform into individual sounds. The ultimate goal of speech recognition
`is to uniquely
`and automatically provide such a segmentation and labeling of speech into constituent
`sounds or sound groups such as words, then sentences. To understand
`the limitations on
`this approach, we will next discuss, in detail, the general sounds of English and the relevant
`acoustic and phonetic features of the sounds.
`
`2.4 SPEECH SOUNDS AND FEATURES
`
`in a language is often a
`The number of linguistically distinct speech sounds (phonemes)
`matter of judgment and is not invariant to different linguists. Table 2.1 shows a condensed
`list of phonetic symbols of American English, their ARPABET representation [6], and an
`example word in which the sound occurs. Shown in this table are 48 sounds, including 18
`vowels or vowel combinations (called diphthongs), 4 vowel-like consonants, 21 standard
`consonants, 4 syllabic sounds, and a phoneme referred to as a glottal stop (literally a symbol
`
`IPR2023-00037
`Apple EX1013 Page 51
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`21
`
`WHY DO I OWE YOU A LETTER
`Sr------r------,------,-------
`
`(a)
`
`N' 4
`:I:
`~
`> 3
`(.)
`z
`w
`:::> 2
`0
`w
`fE 1
`
`0.4
`
`0.8
`
`1.2
`
`1.6
`
`5,-------,,-----,---~---~
`
`(b)
`
`.. ____
`__ .,..,,,...
`_.
`• ....... •-----. .,.:
`
`0 L-------'-----1..------'---~-'
`0.4
`0
`
`0.8
`TIME (sec)
`
`1.2
`
`1.6
`
`Figure 2.9 Wideband spectrogram and fonnant frequency represen(cid:173)
`tation of the utterance "Why do I owe you a letter" (after Atal and
`Hanauer (5)).
`
`for a sound corresponding to a break in voicing within a sound).
`Many of the sounds or phonemes shown in Table 2.1 are not considered standard;
`they represent specialized cases such as the so-called barred I (/f:{) in the word ros!I_s. As
`such, a more standard representation of the basic sounds and sound classes of American
`English is shown in Figure 2.12. Here we see the conventional set of 11 vowels, classified
`as front, mid, or back, corresponding to the position of the tongue hump in producing the
`vowel; 4 vowel combinations or diphthongs; the 4 semivowels broken down into 2 liquids
`and 2 glides; the nasal consonants, the voiced and unvoiced stop consonants; the voiced
`and unvoiced fricatives; whisper; and the affricates. There are a total of 39 of the 48 sounds
`of Table 2.1 represented in Figure 2.12.
`
`2.4.1 The Vowels
`
`The vowel sounds are perhaps the most interesting class of sounds in English. Their
`importance to the classification and representation of written text is very low; however,
`most practical speech-recognition systems rely heavily on vowel recognition to achieve
`
`IPR2023-00037
`Apple EX1013 Page 52
`
`
`
`22
`
`Chap. 2
`
`The Speech Signal
`
`SHOULD WE CHASE
`
`5000
`
`+
`
`+
`
`·Iii+
`
`+ .,, ~,+
`
`+
`
`+
`
`+
`
`'N 4000
`!.
`>
`
`3000
`
`u z w ::,
`w a: u.
`
`0
`
`2000
`
`1000
`
`,
`+
`
`0
`12336
`
`w
`0
`::, ....
`
`~
`Q.
`~ <
`
`6168
`
`0
`
`-6168
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`TIME (sec)
`
`Figure 2.10 Wideband spectrogram and intensity contour of the
`phrase "Should we chase."
`
`high performance. To partially illustrate this point, consider the following sections of text:
`
`Section I
`
`Th_y n_t_d s_gn_f _c_nt _mpr _v_m_nts i_ th_ c_mp_ny's _m_g_, s_p_rv_s __ n,
`th _ _r w_rk_ng c_nd_t __ ns, b_n_f_ts _nd _pp_rt_n_t __ s f _r gr _wth.
`
`Section II
`
`A __ i_u_e __ o_a ___ a ___ a_e_ e __ e __ ia _____ e _a_e, _i ____ e __ o_e_ o_
`o __ u_a_io_a_ e ___ o_ee ___ i ______ e __ ea_i __ .
`
`In Section I we have omitted the conventional vowel letters (a,e,i,o,u); however, with a
`little effort the average reader can "fill in" the missing vowels and decode the section so
`that it reads
`
`They noted significant improvements in the company's image, supervision,
`their working conditions, benefits and opportunities
`for growth.
`
`In Section Il we have omitted the conventional consonant letters; the resulting text is
`essentially not decodable. The actual text is
`
`IPR2023-00037
`Apple EX1013 Page 53
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`23
`
`SHOULD WE CHASE
`
`ed1
`
`s
`
`~/\A ~AA ~~b ~nA AnA /\ C
`,,. [\.J\ n.A AA~ ~/\n
`(rVlJ vv-r1f'rVVif Wf\t"~V VVlf\J\TV7Yv7)
`u
`
`w~
`
`"u:
`
`0 v 0 o 0 u 0 v 0 v0:c-Av
`
`0 o "v,...,,cs-°vc,,eo
`
`d
`
`l\......c:>o °v,-,,,c.-.f\., 6 ovOJ
`w
`
`I.
`
`J\ 1\1\ A fl/\ A QA A (\fl A AA
`""• A /\ A /\
`./\ D
`vvv~v \}"Vl) • v_ vVlf
`Ill{ V \[ V V <->7]
`
`~
`
`I
`
`c
`
`tfl/\ Ahl\. ~d, L(\ AAA AA-~A A,."
`~AAfl LAO tAfl
`vrv"V1vv vvvv~11,rvwvir'4 vv vvr wrirvrr\) 9~" yv-v
`
`eY
`
`,.....
`C>
`
`• I\ >-.
`ov~
`I
`I
`t-----------100
`
`-
`
`w
`
`s
`
`msec----------.i
`
`Figure 2.11 The speech waveform and a segmentation and labeling of
`the constituent sounds of the phrase "Should we chase."
`
`toward pay stayed essentially the same, with the scores of
`Attitudes
`occupational employees slightly decreasing.
`
`In speaking, vowels are produced by exciting an essentially fixed vocal tract shape
`with quasi-periodic pulses of air caused by the vibration of the vocal cords. The way
`in which the cross-sectional area varies along the vocal tract determines the resonance
`frequencies of the tract (the formants) and thereby the sound that is produced. The vowel
`sound produced is determined primarily by the position of the tongue, but the positions of
`the jaw, lips, and to a small extent, the velum, also influence the resulting sound.
`
`IPR2023-00037
`Apple EX1013 Page 54
`
`
`
`24
`
`Chap. 2
`
`The Speech 5.,
`9 na1
`
`TABLE 2.1. A condensed list of phonetic symbols for American English.
`ARPABET
`Phoneme
`Examele
`ARPABET
`NX
`Ir,/
`beat
`p
`/p/
`bit
`T
`/ti
`bfilt
`K
`/kl
`~t
`B
`/bl
`b~t
`D
`/di
`BQb
`H
`/g/
`bgt
`HH
`/hi
`bQ!!ght
`F
`/fl
`boat
`TH
`/0/
`book
`s
`/s/
`boot
`SH
`/M(sh)
`~bout
`V
`/v/
`ros~s
`DH
`lo/
`bi!d
`z
`/z/
`butt~
`ZH
`/z/ (zh)
`d~n
`CH
`/c/ (tsh)
`buy
`!]! (dzh, j)
`JH
`boy
`WH
`/M/
`you
`EL
`/!/
`~it
`EM
`/fTI/
`rent
`EN
`!et
`/rJ./
`DX
`ff/
`!!!Cl
`Q
`net
`(?/
`
`Phoneme
`
`/i/
`/1/
`/e/ (eY)
`/cl
`/re/
`la/
`/11./
`/-:,/
`/o/ (ow)
`/U/
`/u/
`Jal
`/¼/
`1~1
`/'cf-/
`/aw/
`/aY/
`/::,YI
`/y/
`/w/
`Ir/
`/1/
`Im/
`In/
`
`IY
`IH
`EY
`EH
`AE
`AA
`AH
`AO
`ow
`UH
`uw
`AX
`IX
`ER
`AXR
`AW
`AY
`OY
`y
`w
`R
`L
`M
`N
`
`Exame'j;-
`
`sing -
`
`eet
`!en
`~it
`!!et
`f!ebt
`~et
`hat
`fat
`thing
`~at
`shut
`yat
`that
`~00
`a~ure
`church
`Judge
`which
`battk
`bottom
`button
`ba!!er
`(glottal stop)
`
`The vowels are generally long in duration (as compared to consonant sounds) and
`are spectrally well defined. As such they are usually easily and reliably recognized and
`therefore contribute significantly to our ability to recognize speech, both by human beings
`and by machine.
`including the typical
`There are several ways to characterize and classify vowels,
`articulatory configuration required to produce the sounds, typical waveform plots, and typ(cid:173)
`ical spectrogram plots. Figures 2.13-2.15 show typical articulatory
`configurations of the
`vowels (2.13), examples of vowel waveforms (2.14 ), and examples of vowel spectrograms
`(2.15). A convenient and simplified way of classifying vowel articulatory configurations is
`in terms of the tongue hump position (i.e., front, mid, back), and tongue hump height (high,
`mid, low), where the tongue hump is the mass of the tongue at its narrowest constriction
`within the vocal tract. According to this classification the vowels /i/, /1/, /re/, and/€/ are
`front vowels, (with different tongue heights) /a/, /A/, and fa/ are mid vowels, and /U/, /u/,
`and /o/ are back vowels (see also Figure 2.12).
`in Figure 2.14, the front
`As shown in the acoustic waveform plots of the vowels,
`vowels show a pronounced, high-frequency resonance,
`the mid vowels show a balance
`of energy over a broad frequency range, and the back vowels show a predominance of
`low-frequency spectral information. This behavior is evidenced in the vowel spectrogram
`
`IPR2023-00037
`Apple EX1013 Page 55
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`25
`
`/VOrS""
`
`iONEMES
`
`CONSONANTS
`
`WHISPER
`
`h (H)
`
`UNVOICED
`p (P)
`t (T)
`k (K)
`
`AFFRICATES
`j (JH)
`c (CH)
`
`FRONT
`
`i (IV)
`I (IH)
`e(EH)
`ae (AE)
`
`MID
`a (AA)
`3 (ER)
`1',e (AH,AX)
`:> (AO)
`
`BACK DIPHTHONGS
`u (UW)
`aY (AV)
`U (UH)
`-:,Y (OV)
`aw (AW)
`0 (OW)
`eY (EV)
`
`SEMIVOWELS
`I
`\
`LIQUIDS GLIDES
`w(W)
`1 (L)
`
`NASALS
`m(M)
`n (N)
`T) (NG)
`
`FRICATIVES
`
`I
`
`VOICED
`v (V)
`6 (TH)
`z (Z)
`z,zh (ZH)
`
`'\
`
`UNVOICED
`f (F)
`8 (THE)
`• (S)
`s,sh, / (SH)
`
`Figure 2.12 Chart of the classification of the standard phonemes of American English into broad sound classes.
`
`i (EVE)
`
`I ( IT)
`
`8 (HATE)
`
`~ (AT)
`
`U ( FATHER)
`
`0 (ALL)
`
`0 (OBEY)
`
`U (FOOT)
`
`U ( BOOT)
`
`A (UP)
`
`'1 (BIRD)
`
`Figure 2.13 Articulatory configurations for typical vowel sounds (after Flanagan [3]).
`
`IPR2023-00037
`Apple EX1013 Page 56
`
`
`
`26
`
`Chap. 2
`
`The Speech s·,g na1
`
`Figure 2.14 Acoustic waveform plots
`of typical vowel sounds.
`
`plots of Figure 2.15, in which the front vowels show a relatively high second and thir~
`formant frequency (resonance), whereas the mid vowels show well-separated and balance
`locations of the formants, and the back vowels (especially /u/) show almost no energy
`beyond the low-frequency region with low first and second formant frequencies.
`. ~e concept of a "typ_ic~l" vowel sound is, of course, unreaso~abl~ in light 0.~~~
`vanab1hty of vowel pronunc1at1on among men, women and children with different regi
`accents and other variable characteristics. To illustrate this point Figure 2.16 shows a
`•
`'
`f the first
`classic plot, made by Gordon Peterson and Harold Barney of measured values o
`I talkers
`'
`~
`and second formant 1or 10 vowels spoken by a wide range of male and fema e
`who attended the 1939 World's Fair in New York City [7].A wide range of variability c~n
`be seen in the measured formant frequencies for a given vowel sound, and also there 15
`
`''
`
`\
`
`IPR2023-00037
`Apple EX1013 Page 57
`
`
`
`t
`
`I
`
`...__~fJJ!..I.
`
`__ - -•--
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`27
`
`5000 • ---
`
`4000 ..mt"! .• :
`
`3000, __
`
`I
`
`---
`.· ·., .. ,.'
`
`!
`
`e
`
`re
`
`a
`
`! .,.,..
`
`•
`
`i l111Nlllil~lllllltiJWlt, • "'"'41f
`
`•
`
`•
`
`•
`
`•
`
`•
`
`•.
`
`a
`
`2000·
`
`•
`
`•
`
`-
`N
`;_ 1000·
`>
`o·
`(.)
`z
`w
`::::, 5000-----
`0
`W 4000
`FF 3000!
`-.
`
`.,.,
`..
`
`2000,'
`
`0.2
`
`0.4
`
`r~
`..........
`·••1,,mrrrn
`
`;
`
`lS ■ ii
`----••
`l..iJl=iJ...J H111111111U1,
`
`u
`
`u
`
`3'"
`
`•II
`
`I,
`
`~p
`
`I•
`
`I
`
`.i
`
`0.0
`
`0.2
`
`0.4
`
`0.4 0.0
`0.2
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.0
`
`0.2
`
`0.4
`
`,,,,II!'
`
`1
`
`Figure 2. 15 Spectrograms of the vowel sounds.
`
`4000r----.------.----~------
`35001---+--~--H-,~--#---c
`
`....... .---l---l-------l
`
`30001---~
`
`.. ,-..1._;.,.;...----'4-
`
`N 25001--~+--
`~
`if' 20001---'...,,_~
`u.
`0
`~ 1500 1---+---l'-+-"~c'"-lt-:~.=+-
`z
`w
`:::,
`0
`w fl: 1000 1----
`
`_ ___._ _ __. _ _,
`500L--..I---L--=-_..J...-_._
`0
`200
`400
`600 800 1000 1200 1400
`FREQUENCY OF F1 (Hz)
`
`Figure 2.16 Measured frequencies of first and second for(cid:173)
`mants for a wide range of talkers for several vowels (after
`Peterson & Barney [7]).
`
`overlap between the fonnant frequencies for different vowel sounds by different talkers.
`The ellipses drawn in this figure represent gross characterizations of the regions in which
`most of the tokens of the different vowels lie. The message of Figure 2.16, for speech
`recognition by machine, is fairly clear; that is, it is not just a simple matter of measuring
`fonnant frequencies or spectral peaks accurately to accurately classify vowel sounds; one
`
`IPR2023-00037
`Apple EX1013 Page 58
`
`
`
`28
`
`2400~---------------,
`
`IV (il
`
`Chap. 2
`
`The Speech s·,g
`rial
`
`2200
`
`2000
`
`1800
`
`N
`~ 1600
`N
`~
`
`1400
`
`1200
`
`1000
`
`0 EH(£)
`
`0 AE (cl9)
`
`oER (~)
`
`oAH (A)
`
`0 AO (:>)
`SOOL---'--~....L..---'----...___-~:-::--~
`200
`300
`400
`500
`600
`F1 (Hz)
`
`700
`
`800
`
`Figure 2.17 The vowel triangle with centroid positions of the com(cid:173)
`mon vowels.
`
`must do some type of talker (accent) normalization to account for the variability in formants
`and overlap between vowels.
`A common way of exploiting the information embodied in Figures 2. 15 and 2. 16 is to
`represent each vowel by a centroid in the formant space with the realization that the centroid,
`at best, represents average behavior and does not represent variability across talkers. Such
`a representation leads to the classic vowel triangle shown in Figure 2.17 and represented
`in terms of formant positions by the data given in Table 2.2. The vowel triangle represents
`the extremes of formant locations in the F 1 -F2 plane, as represented by /i/ (low F 1, high
`F2), /u/ (low F 1, low F2), and /a/ (high F 1, low F2), with other vowels appropriately placed
`with respect to the triangle vertices. The utility of the formant frequencies of Table 2.2 has
`been demonstrated in text-to-speech synthesis in which high-quality vowel sounds have
`been synthesized using these positions for the resonances [8].
`
`2.4.2 Diphthongs
`
`there is some ambiguity and disagreement as to what is and what is not a
`Although
`diphthong, a reasonable definition is that a diphthong is a gliding monosyllabic speech
`
`IPR2023-00037
`Apple EX1013 Page 59
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`29
`
`IPA
`Symbol
`
`Typical
`Word
`
`F1
`
`F2
`
`F3
`
`TABLE 2.2. Formant frequencies for typical vowels.
`ARPAB.ET
`Symbol for
`Vowel
`IY
`IH
`EH
`AE
`AH
`AA
`AO
`UH
`uw
`ER
`
`Iii
`/1/
`/cl
`/re/
`IA/
`la/
`hi
`/U/
`/u/
`hi
`
`beet
`bit
`bet
`bat
`but
`hot
`bought
`foot
`boot
`bird
`
`270 2290 3010
`390 1990 2550
`530 1840 2480
`660 1720 2410
`520 1190 2390
`730 1090 2440
`570
`840
`2410
`440 1020 2240
`300
`870
`2240
`490 1350 1690
`
`sound that starts at or near the articulatory position for one vowel and moves to or toward
`the position for another. According to this definition, there are six diphthongs in American
`English, namely /aY / (as in buy), /aw/ (as in down), /eY / (as in bait), and /-:,Y / (as in boy),
`/o/ (as in boat), and /ju/ (as in you).
`-
`The diphthongs are produced by varying the vocal tract smoothly between vowel
`configurations appropriate to the diphthong. Figure 2.18 shows spectrogram plots of four
`of the diphthongs spoken by a male talker. The gliding motions of the formants are
`especially prominent for the sounds /aY /, /aw/ and /'JY / and are somewhat weaker for
`/eY / because of the closeness (in vowel space) of the two vowel sounds comprising this
`diphthong.
`An alternative way of displaying the time-varying spectral characteristics of diph(cid:173)
`thongs is via a plot of the values of the second formant versus the first formant (implicitly
`as a function of time) as shown in Figure 2.19 [9]. The arrows in this figure indicate the
`direction of motion of the formants (in the (F1 - F2 ) plane) as time increases. The dashed
`circles in this figure indicate average positions of the vowels. Based on these data, and
`other measurements, the diphthongs can be characterized by a time-varying vocal tract area
`function that varies between two vowel configurations.
`
`2.4.3 Semivowels
`
`The group of sounds consisting of /w/, /1/, /r/, and /y/ is quite difficult to characterize.
`These sounds are called semivowels because of their vowel-like nature. They are generally
`characterized by a gliding transition in vocal tract area function between adjacent phonemes.
`Thus the acoustic characteristics of these sounds are strongly influenced by the context in
`which they occur. For our purposes, they are best described as transitional, vowel-like
`sounds, and hence are similar in nature to the vowels and diphthongs.
`
`IPR2023-00037
`Apple EX1013 Page 60
`
`
`
`30
`
`Chap. 2
`
`The Speech Signal
`
`aY
`.......... r-r--r-.
`
`+
`
`+
`
`5000
`
`,---~-.--
`
`i
`4000
`
`3000
`
`2000
`
`I , •~,
`
`+
`
`+
`
`+
`
`+
`
`N' ;;.
`0
`~ 0.0 0.1
`z
`W
`~ 5000~...,,_~.,._,_,..--,-..,-~
`0 w
`a:
`u. 40001
`
`1000
`
`I
`
`0.2 0.3
`
`().4 - 0.5
`
`0.0
`
`0.1
`
`0.2
`
`0.3 0.4
`
`0.5
`
`eY
`
`'Jy
`
`+
`
`+
`
`'I
`
`+
`
`+
`
`0.1
`
`0.2
`
`0.3
`
`0.4
`
`3000f -l
`·~l~
`
`0.0 0.1
`
`0.2
`
`0.3
`
`0.0
`0.4
`TIME (sec)
`
`Figure 2.18 Spectrogram plots of four diphthongs.
`
`2.4.4 Nasal Consonants
`
`The nasal consonants /m/, /n/, and /r,! are produced with glottal excitation and the vocal
`tract totally constricted at some point along the oral passageway. The velum is lowered so
`that air flows through the nasal tract, with sound being radiated at the nostrils. The oral
`cavity, although constricted toward the front, is still acoustically coupled to the pharynx.
`Thus, the mouth serves as a resonant cavity that traps acoustic energy at certain natural
`frequencies. As far as the radiated sound is concerned, these resonant frequencies of the
`oral cavity appear as antiresonances, or zeros of the transfer function of sound transmis(cid:173)
`sion. Furthermore, nasal consonants and nasalized vowels (i.e., some vowels preceding or
`following nasal consonants) are characterized by resonances that are spectrally broader, or
`more highly damped, than those for vowels.
`The three nasal consonants are distinguished by the place along the oral tract at
`is at the lips; for /n/ the
`which a total constriction is made. For /ml the constriction
`constriction is just behind the teeth; and for / r, / the constriction
`is just forward of the
`velum itself. Figure 2.20 shows typical speech waveforms and Figure 2.21 spectrograms
`for two nasal consonants in the context vowel-nasal-vowel. The waveforms of /m/ and /n/
`look very similar. The spectrograms show a concentration of low-frequency energy with a
`midrange of frequencies that contain no prominent peaks. This is because of the particular
`
`L
`
`IPR2023-00037
`Apple EX1013 Page 61
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`31
`
`eY
`
`l
`
`\
`
`.... ~
`\\
`I
`I U / I
`'~-✓ I
`,,_,,
`
`,
`
`3K
`
`2K
`
`/--,
`\
`l
`,_.,,,
`
`/
`
`iu.
`
`--I
`
`'
`
`\
`I
`
`I u
`\
`,_/
`
`N
`I&. 1 K
`
`500
`
`400.__ _ __._ _____
`200
`250
`
`....L,. __
`
`.,__
`
`_______
`
`500
`
`~
`1 K
`
`Figure 2.19 Time variation of the first two fonnants for the diphthongs (after
`Holbrook and Fairbanks (9)).
`
`combination of resonances and antiresonances that result from the coupling of the nasal
`and oral tracts.
`
`2.4.5 Unvoiced Fricatives
`
`The unvoiced fricatives /f/, / () /, /s/, and /sh/ are produced by exciting the vocal tract by a
`steady air flow, which becomes turbulent in the region of a constriction in the vocal tract.
`The location of the constriction serves to determine which fricative sound is produced.
`For the fricative /f/ the constriction is near the lips; for /0 / it is near the teeth; for /s/ it
`is near the middle of the oral tract; and for /sh/ it is near the back of the oral tract. Thus
`the system for producing unvoiced fricatives consists of a source of noise at a constriction,
`which separates the vocal tract into two cavities. Sound is radiated from the lips-that
`is,
`from the front cavity. The back cavity serves, as in the case of nasals, to trap energy and
`thereby introduce antiresonances into the vocal output. Figure 2.22 shows the waveforms
`and Figure 2.23 the spectrograms of the fricatives /f/, /s/ and /sh/. The nonperiodic nature
`
`IPR2023-00037
`Apple EX1013 Page 62
`
`
`
`32
`
`Chap. 2
`
`The Speech s· 191'lal
`
`ama
`
`(a)
`
`~~
`t~ HfthAAO"ll"ll"AAQAAAII
`
`r,/7/'rv4JQ vv4> V4i'J4'7JVV~
`\Tvv7'1
`
`I\ I\ (\ (\ "I\
`
`I\(\
`
`I\ 0 /\ fl. I\"
`
`II.~
`
`(b)
`
`ana
`L ~ .. ~ .J J J , " "n "~ ".
`
`-----100
`
`msec----(cid:173)
`
`Figure 2.20 Wavefonns for the sequences
`/a-m-a/ and /a-n-a/.
`
`of fricative excitation is obvious in the waveform plots. The spectral differences among
`the fricatives are readily seen by comparing the three spectrograms.
`
`2.4.6 Voiced Fricatives
`
`The voiced fricatives /v/, /th/, /z/ and /zh/ are the counterparts of the unvoiced fricatives /f/,
`/0/, /s/, and /sh/, respectively, in that the place of constriction for each of the corresponding
`phonemes is essentially identical. However, the voiced fricatives differ markedly from
`their unvoiced counterparts in that two excitation sources are involved in their production.
`For voiced fricatives the vocal cords are vibrating, and thus one excitation source is at
`the glottis. However, since the vocal tract is constricted at some point forward of the
`glottis, the air flow becomes turbulent in the neighborhood of the constriction. Thus the
`
`IPR2023-00037
`Apple EX1013 Page 63
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`33
`
`a m
`
`a
`
`11111llli~~1•1t, I
`
`+
`
`+
`
`1\ I
`
`+
`
`+
`
`+
`+
`, .,,~~,
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`11 •'
`
`+
`
`+
`
`+
`
`+
`
`+
`
`5000
`
`4000
`
`-N
`-
`
`~
`
`3000
`
`> (.)
`z
`w
`:::> 2000
`0 w
`a: 1000
`u.
`
`0
`19674
`
`w
`C
`:::>
`....
`:::;
`Cl.
`~ <
`
`9837
`
`0
`
`a
`
`n
`
`a
`
`+
`
`+
`
`+
`
`+
`
`t
`
`11 ....
`
`t
`
`+
`+
`+
`ll~b~~i 10 l~-1 •. '
`+
`+
`+
`
`+
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`17980
`
`8990,
`
`0
`
`-9837
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`-8990
`
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.21 Spectrograms of the sequences /a-m-a/ and a-n-a/.
`
`spectra of voiced fricatives can be expected to display two distinct components. These
`excitation features are readily observable in Figure 2.24, which shows typical waveforms,
`and in Figure 2.25, which shows spectra for two voiced fricatives. The similarity of the
`unvoiced fricative /fl to the voiced fricative /v/ is easily shown in a comparison between
`corresponding spectrograms in Figures 2.23 and 2.25. Likewise, it is instructive to compare
`the spectrograms of /sh/ and /zh/.
`
`2.4.7 Voiced and Unvoiced Stops
`
`The voiced stop consonants /b/, /d/, and /g/, are transient, noncontinuant sounds produced
`by building up pressure behind a total constriction somewhere in the oral tract and then
`suddenly releasing the pressure. For /b/ the constriction is at the lips; for /d/ the constriction
`is at the back of the teeth; and for /g/ it is near the velum. During the period when there is
`total constriction in the tract, no sound is radiated from the lips. However, there is often a
`small amount of low-frequency energy radiated through the walls of the throat (sometimes
`called a voice bar). This occurs when the vocal cords are able to vibrate even though the
`vocal tract is closed at some point.
`Since the stop sounds are dynamical in nature, their properties are highly influenced
`by the vowel that follows the stop consonant. As such, the waveforms for stop consonants
`give little information about the particular stop consonant. Figure 2.26 shows the waveform
`of the syllable /a-b-a/. The waveform of /b/ shows few distinguishing features except for
`the voiced excitation and lack of high-frequency energy.
`
`IPR2023-00037
`Apple EX1013 Page 64
`
`
`
`34
`
`Chap. 2
`
`The Speech s·,g
`na1
`
`r7' 1 'es
`
`'0
`
`• • ..
`
`t ,►
`
`t' ;
`
`'
`
`; ;J • I ...
`
`4t
`
`' +r4' t
`
`(c)
`
`a sa
`, 1 .. L, !11,lo~.!u.!uJ11~0~11,l",lv
`111vr11q1q~rnn,r1~1q1qv
`
`'~
`
`vlw~·Av••
`
`I
`
`•
`
`......__ __
`
`" t1Y rr ' r ~ 'l
`
`100 msec
`
`Figure 2.22 Waveforms for the sounds /f/, /s/ and /sh/ in the context /a-x-a/ where /x/ is
`the unvoiced fricative.
`
`a
`f
`'
`N' 5000 ,.........,.......,_-,..-.,.....,.._
`
`% -
`
`•
`...
`
`I
`
`4000
`3000
`
`2000
`
`11!1 ••
`
`.....
`♦♦ ++
`•
`
`,:
`
`a s
`
`a
`
`5000
`
`4000
`
`♦
`
`3000
`
`2000
`
`t
`
`1000
`
`0
`20974
`
`10487
`
`10831
`
`w
`._
`0
`:)
`=.i
`~
`~ -10831
`
`~~....C......J........J..._j
`0.0
`0.2
`0.4
`
`Figure 2.23
`
`- 10184: .
`---:..--'-L...,_L
`o.o
`0.2
`
`: -10487
`t........J._.___J_.....J....,...J-<-~~
`0.0
`0.2
`0.4
`
`0.8
`0.4
`TIME (aec)
`f
`Spectrogram comparis
`• ons O the sounds /a-f-a/, /a-s-a/ and /a-sh-a/.
`
`IPR2023-00037
`Apple EX1013 Page 65
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`35
`
`(a)
`
`ava
`
`(b)
`
`aia
`
`'yvtfV 1Vi
`
`~v ~ •
`
`•
`
`.,... ....
`
`y\/F
`
`-----100
`
`vt v, rr11 " "
`msec-----i
`
`Figure 2.24 Wavefonns for the sequences /a(cid:173)
`v-a/ and /a-zh-a/.
`
`The unvoiced stop consonants /p/, /t/, and /k/ are similar to their voiced counterparts
`/bl, /di, and /g/, with one major exception. During t}:le period of total closure of the tract,
`as the pressure builds up, the vocal cords do not vibrate. Then, following the period of
`closure, as the air pressure is released, there is a brief interval of friction (due to sudden
`turbulence of the escaping air) followed by a period of aspiration (steady air flow from the
`glottis exciting the resonances of the vocal tract) before voiced excitation begins.
`Figure 2.27 shows waveforms and Figure 2.28 shows spectrograms of the voiced stop
`/b/ and the voiceless stop consonants /p/ and /t/. The "stop gap," or time interval, during
`which the pressure is built up is clearly in evidence. Also, it can be readily seen that the
`duration and frequency content of the frication noise and aspiration vary greatly with the
`stop consonant.
`
`IPR2023-00037
`Apple EX1013 Page 66
`
`
`
`36
`
`5000
`
`4000
`
`-N
`:::c: -> CJ
`z w
`=> 2000
`0 w
`0:
`u.
`
`3000
`
`1000
`
`a
`
`V
`
`a
`
`+
`
`+
`
`I
`
`-
`
`+
`1,1111-
`+
`
`+
`
`+
`
`+
`
`5000
`
`40001
`
`3000
`
`Chap. 2
`z
`
`a
`
`The Speech s·
`19na1
`a
`
`+ ( 1r'i ~
`I ,,,hit.
`
`t..hl .! ~It
`
`•
`
`+
`•••
`
`+
`
`+
`
`+.
`
`+
`
`ti+,
`
`~ I
`+ 'i'
`•II 1'°
`ti +
`"''
`II·
`
`0
`23761
`
`w 11881
`0
`::>
`~
`::i
`~
`~ <
`
`0
`
`-11881
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`12060
`
`0
`
`-12060
`
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.25 Spectrograms for the sequences /a-v-a/ and /a-zh-a/.
`
`a '
`
`100 ffllee ------------l
`
`Figure 2.26 Wavefonn t
`h
`or t e sequence /a-b-a/.
`
`' . ·, '·,
`
`..... :1·
`
`~-1,
`
`IPR2023-00037
`Apple EX1013 Page 67
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`37
`
`(a)
`
`apa
`.. )wA .. WJ. LH L, . 0
`w·· ,av v
`
`....
`
`,.,)
`
`t,-
`
`C
`
`tf
`
`(b)
`
`at a
`
`~~►-
`
`..,__-----100
`
`msec-------1
`
`Figure 2.27 Wavefonns for the sequences /a-p-a/
`and /a-t-a/.
`
`2.4.8 Review Exercises
`
`As a self-check on the reader's understanding of the material on speech sounds and their
`acoustic manifestations, we now digress and present some simple exercises along with the
`solutions. For maximum effectiveness, the reader is encouraged to think through each
`exercise before looking at the solution.
`Exercise 2.1
`1. Write out the phonetic transcription for the following words:
`he, eats, several, light, tacos
`
`2. What effect occurs when these five words are spoken in sequence as a sentence? What
`does this imply about automatic speech recognition?
`
`IPR2023-00037
`Apple EX1013 Page 68
`
`
`
`38
`
`a
`
`b
`
`a
`
`......
`. . . .
`
`..
`
`•
`
`3000
`2000 It •
`
`5000
`'N
`~ 4000
`>-
`(.)
`z
`w
`::,
`0 w
`a:
`LL
`
`1000
`
`0
`25679
`
`w
`0
`::,
`I-
`:::::i
`Q.
`~
`< -12840
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`Chap. 2
`
`The Speech Signal
`
`p
`
`a
`
`5000,--,------
`
`a
`
`4000
`
`11111J • ~l,M!III".\
`
`I
`
`a
`
`,1
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`21735
`
`10868
`
`I
`
`O !-t, -----
`I
`-10868;
`:.,.__ __
`0.0
`
`__._......._..i...;._-"--'-~
`0.4
`0.6
`0.2
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.28 Spectrogram comparisons of the sequences of voiced (/a-b-a/) and voiceless (/a-p-a/
`and /a-t-a/) stop consonants.
`
`Solution 2.1
`1. The phonetic transcriptions of the words are
`
`Word
`he
`eats
`several
`light
`tacos
`
`Phoneme Seguence
`
`/hi/
`/its/
`/s~v ul/
`/1 aYt/
`/takoz/
`
`ARPABET
`HH-IY
`IY-TS
`S-E H-V-R-AH-L
`L-AY-T
`T-AA-K-OW-Z
`
`2. When the words are spoken together, the last sound of each word merges with the
`first sound of the succeeding word (since they are the same sound), resulting in strong
`coarticulation of boundary sounds. The ARPABET transcription for the sentence is:
`HH-IY-T-S-EH-V-R-AH-L-AY-T-AA-K-OW-Z
`the durations of
`All information about word boundaries is totally lost; furthermore,
`the common sounds at the boundaries of words are much shorter than what would be
`predicted from the individual words.
`
`Exercise 2.2
`
`Some of the difficulties in large vocabulary speech recognition are related to the irregularities
`in the way basic speech sounds are combined to produce words. Exercise 2.2 highlights a
`couple of these difficulties.
`
`1. In word initial position of American English, which phoneme or phonemes can never
`occur? Which hardly ever occur?
`2. There are many word initial consonant clusters of length two, such as speak, drank,
`plead, and press. How many word initial consonant clusters of length three are there
`in American English? What general rule can you give about the sounds in each of the
`three positions?
`
`IPR2023-00037
`Apple EX1013 Page 69
`
`
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`39
`
`3. A nasal consonant can be combined with a stop consonant (e.g., camp, tend) in a limited
`number of ways. What general rule do such combinations obey? There are several
`notable exceptions to this general rule. Can you give a couple of exceptions? What
`kind of speaking irregularity often results from these exceptions?
`
`Solution 2.2
`1. The only phoneme that never occurs in initial word position in English is the /ng/ sound
`(e.g., sing). The only other sound that almost never occurs naturally in English, in
`initial word position, is /zh/ except some foreign words imported into English, such as
`gendanne, which does have an initial /zh/.
`2. The word initial consonant clusters of length three in English include
`
`/spl/
`/spr/
`/skw/
`/skr/
`/str/
`
`split
`spring
`squirt
`script
`string
`
`The general rule for such clusters is
`
`/sound s/unvoiced stop/semivowel/
`
`3. The general rule for a nasal-stop combination is that the nasal and stop have the
`same place of articulation, e.g., front/lips (/mp/), mid/dental (Int/), back/velar (Ing k/).
`Exceptions occur in words like summed (/md/) or hanged (Ing di) or dreamt (/mt/).
`There is often a tendency to insert an extra stop in such situations (e.g., dreamt -+
`/drempt/).
`
`Exercise 2.3
`An important speech task is accurate digit recognition. This exercise seeks to exploit knowl(cid:173)
`edge of acoustic phonetics to recognize first isolated digits, and next some simple connected
`digit strings. We first need a sound lexicon (a dictionary) for the digits. The sound lexicon
`describes the pronunciations of digits in terms of the basic sounds of English. Such a sound
`lexicon is given in Table 2.3. A single male adult talker (LRR) spoke each of the 11 digits in
`random sequence and in isolation, and spectrograms of these spoken utterances are shown in
`Figure 2.29. Figure 2.30 shows spectrograms of two connected digit sequences spoken b