throbber
20
`
`Chap. 2
`
`The Speech Signal
`
`and are seen as vertical striations in the spectrogram.
`The narrowband spectrogram (shown in the second panel of Figure 2.8) corresponds
`to performing a spectral analysis on 50-msec sections of waveform using a narrow analysis
`filter (40 Hz bandwidth), with the analysis again advancing in intervals of 1 msec. Because
`of the relatively narrow bandwidth of the analysis filters, individual spectral harmonics
`corresponding to the pitch of the speech waveform, during voiced regions, are resolved and
`are seen as almost-horizontal lines in the spectrogram. During periods of unvoiced speech,
`we see primarily high-frequency energy in the spectrograms; during periods of silence we
`essentially see no spectral activity (because of the reduced signal level).
`A third way of representing the time-varying signal characteristics of speech is via a
`parameterization of the spectral activity based on the model of speech production. Because
`the human vocal tract is essentially a tube, or concatenation of tubes, of varying cross(cid:173)
`sectional area that is excited either at one end (by the vocal cord puffs of air) or at a point
`along the tube (corresponding to turbulent air at a constriction), acoustic theory tells us that
`the transfer function of energy from the excitation source to the output can be described
`in terms of the natural frequencies or resonances of the tube. Such resonances are called
`formants for speech, and they represent the frequencies that pass the most acoustic energy
`from the source to the output. Typically there are about three resonances of significance, for
`a human vocal tract, below about 3500 Hz. Figure 2.9 [5] shows a wideband spectrogram,
`along with the computed formant frequency estimates, for the utterance "Why do I owe
`you a letter," spoken by a male speaker. There is a good correspondence between the
`estimated formant frequencies and the points of high spectral energy in the spectrogram.
`The formant frequency .representation is a highly efficient, compact representation of the
`time-varying characteristics of speech. The major problem, however, is the difficulty of
`reliably estimating the formant frequencies for low-level voiced sounds, and the difficulty
`of defining the formants for unvoiced or silence regions. As such, this representation is
`more of theoretical than of practical interest.
`representations of the phrase
`Figures 2.10 and 2.11 show spectral and temporal
`"Should we chase," spoken by a male speaker, along with a detailed segmentation of the
`waveform into individual sounds. The ultimate goal of speech recognition
`is to uniquely
`and automatically provide such a segmentation and labeling of speech into constituent
`sounds or sound groups such as words, then sentences. To understand
`the limitations on
`this approach, we will next discuss, in detail, the general sounds of English and the relevant
`acoustic and phonetic features of the sounds.
`
`2.4 SPEECH SOUNDS AND FEATURES
`
`in a language is often a
`The number of linguistically distinct speech sounds (phonemes)
`matter of judgment and is not invariant to different linguists. Table 2.1 shows a condensed
`list of phonetic symbols of American English, their ARPABET representation [6], and an
`example word in which the sound occurs. Shown in this table are 48 sounds, including 18
`vowels or vowel combinations (called diphthongs), 4 vowel-like consonants, 21 standard
`consonants, 4 syllabic sounds, and a phoneme referred to as a glottal stop (literally a symbol
`
`IPR2023-00037
`Apple EX1013 Page 51
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`21
`
`WHY DO I OWE YOU A LETTER
`Sr------r------,------,-------
`
`(a)
`
`N' 4
`:I:
`~
`> 3
`(.)
`z
`w
`:::> 2
`0
`w
`fE 1
`
`0.4
`
`0.8
`
`1.2
`
`1.6
`
`5,-------,,-----,---~---~
`
`(b)
`
`.. ____
`__ .,..,,,...
`_.
`• ....... •-----. .,.:
`
`0 L-------'-----1..------'---~-'
`0.4
`0
`
`0.8
`TIME (sec)
`
`1.2
`
`1.6
`
`Figure 2.9 Wideband spectrogram and fonnant frequency represen(cid:173)
`tation of the utterance "Why do I owe you a letter" (after Atal and
`Hanauer (5)).
`
`for a sound corresponding to a break in voicing within a sound).
`Many of the sounds or phonemes shown in Table 2.1 are not considered standard;
`they represent specialized cases such as the so-called barred I (/f:{) in the word ros!I_s. As
`such, a more standard representation of the basic sounds and sound classes of American
`English is shown in Figure 2.12. Here we see the conventional set of 11 vowels, classified
`as front, mid, or back, corresponding to the position of the tongue hump in producing the
`vowel; 4 vowel combinations or diphthongs; the 4 semivowels broken down into 2 liquids
`and 2 glides; the nasal consonants, the voiced and unvoiced stop consonants; the voiced
`and unvoiced fricatives; whisper; and the affricates. There are a total of 39 of the 48 sounds
`of Table 2.1 represented in Figure 2.12.
`
`2.4.1 The Vowels
`
`The vowel sounds are perhaps the most interesting class of sounds in English. Their
`importance to the classification and representation of written text is very low; however,
`most practical speech-recognition systems rely heavily on vowel recognition to achieve
`
`IPR2023-00037
`Apple EX1013 Page 52
`
`

`

`22
`
`Chap. 2
`
`The Speech Signal
`
`SHOULD WE CHASE
`
`5000
`
`+
`
`+
`
`·Iii+
`
`+ .,, ~,+
`
`+
`
`+
`
`+
`
`'N 4000
`!.
`>
`
`3000
`
`u z w ::,
`w a: u.
`
`0
`
`2000
`
`1000
`
`,
`+
`
`0
`12336
`
`w
`0
`::, ....
`
`~
`Q.
`~ <
`
`6168
`
`0
`
`-6168
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`TIME (sec)
`
`Figure 2.10 Wideband spectrogram and intensity contour of the
`phrase "Should we chase."
`
`high performance. To partially illustrate this point, consider the following sections of text:
`
`Section I
`
`Th_y n_t_d s_gn_f _c_nt _mpr _v_m_nts i_ th_ c_mp_ny's _m_g_, s_p_rv_s __ n,
`th _ _r w_rk_ng c_nd_t __ ns, b_n_f_ts _nd _pp_rt_n_t __ s f _r gr _wth.
`
`Section II
`
`A __ i_u_e __ o_a ___ a ___ a_e_ e __ e __ ia _____ e _a_e, _i ____ e __ o_e_ o_
`o __ u_a_io_a_ e ___ o_ee ___ i ______ e __ ea_i __ .
`
`In Section I we have omitted the conventional vowel letters (a,e,i,o,u); however, with a
`little effort the average reader can "fill in" the missing vowels and decode the section so
`that it reads
`
`They noted significant improvements in the company's image, supervision,
`their working conditions, benefits and opportunities
`for growth.
`
`In Section Il we have omitted the conventional consonant letters; the resulting text is
`essentially not decodable. The actual text is
`
`IPR2023-00037
`Apple EX1013 Page 53
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`23
`
`SHOULD WE CHASE
`
`ed1
`
`s
`
`~/\A ~AA ~~b ~nA AnA /\ C
`,,. [\.J\ n.A AA~ ~/\n
`(rVlJ vv-r1f'rVVif Wf\t"~V VVlf\J\TV7Yv7)
`u
`
`w~
`
`"u:
`
`0 v 0 o 0 u 0 v 0 v0:c-Av
`
`0 o "v,...,,cs-°vc,,eo
`
`d
`
`l\......c:>o °v,-,,,c.-.f\., 6 ovOJ
`w
`
`I.
`
`J\ 1\1\ A fl/\ A QA A (\fl A AA
`""• A /\ A /\
`./\ D
`vvv~v \}"Vl) • v_ vVlf
`Ill{ V \[ V V <->7]
`
`~
`
`I
`
`c
`
`tfl/\ Ahl\. ~d, L(\ AAA AA-~A A,."
`~AAfl LAO tAfl
`vrv"V1vv vvvv~11,rvwvir'4 vv vvr wrirvrr\) 9~" yv-v
`
`eY
`
`,.....
`C>
`
`• I\ >-.
`ov~
`I
`I
`t-----------100
`
`-
`
`w
`
`s
`
`msec----------.i
`
`Figure 2.11 The speech waveform and a segmentation and labeling of
`the constituent sounds of the phrase "Should we chase."
`
`toward pay stayed essentially the same, with the scores of
`Attitudes
`occupational employees slightly decreasing.
`
`In speaking, vowels are produced by exciting an essentially fixed vocal tract shape
`with quasi-periodic pulses of air caused by the vibration of the vocal cords. The way
`in which the cross-sectional area varies along the vocal tract determines the resonance
`frequencies of the tract (the formants) and thereby the sound that is produced. The vowel
`sound produced is determined primarily by the position of the tongue, but the positions of
`the jaw, lips, and to a small extent, the velum, also influence the resulting sound.
`
`IPR2023-00037
`Apple EX1013 Page 54
`
`

`

`24
`
`Chap. 2
`
`The Speech 5.,
`9 na1
`
`TABLE 2.1. A condensed list of phonetic symbols for American English.
`ARPABET
`Phoneme
`Examele
`ARPABET
`NX
`Ir,/
`beat
`p
`/p/
`bit
`T
`/ti
`bfilt
`K
`/kl
`~t
`B
`/bl
`b~t
`D
`/di
`BQb
`H
`/g/
`bgt
`HH
`/hi
`bQ!!ght
`F
`/fl
`boat
`TH
`/0/
`book
`s
`/s/
`boot
`SH
`/M(sh)
`~bout
`V
`/v/
`ros~s
`DH
`lo/
`bi!d
`z
`/z/
`butt~
`ZH
`/z/ (zh)
`d~n
`CH
`/c/ (tsh)
`buy
`!]! (dzh, j)
`JH
`boy
`WH
`/M/
`you
`EL
`/!/
`~it
`EM
`/fTI/
`rent
`EN
`!et
`/rJ./
`DX
`ff/
`!!!Cl
`Q
`net
`(?/
`
`Phoneme
`
`/i/
`/1/
`/e/ (eY)
`/cl
`/re/
`la/
`/11./
`/-:,/
`/o/ (ow)
`/U/
`/u/
`Jal
`/¼/
`1~1
`/'cf-/
`/aw/
`/aY/
`/::,YI
`/y/
`/w/
`Ir/
`/1/
`Im/
`In/
`
`IY
`IH
`EY
`EH
`AE
`AA
`AH
`AO
`ow
`UH
`uw
`AX
`IX
`ER
`AXR
`AW
`AY
`OY
`y
`w
`R
`L
`M
`N
`
`Exame'j;-
`
`sing -
`
`eet
`!en
`~it
`!!et
`f!ebt
`~et
`hat
`fat
`thing
`~at
`shut
`yat
`that
`~00
`a~ure
`church
`Judge
`which
`battk
`bottom
`button
`ba!!er
`(glottal stop)
`
`The vowels are generally long in duration (as compared to consonant sounds) and
`are spectrally well defined. As such they are usually easily and reliably recognized and
`therefore contribute significantly to our ability to recognize speech, both by human beings
`and by machine.
`including the typical
`There are several ways to characterize and classify vowels,
`articulatory configuration required to produce the sounds, typical waveform plots, and typ(cid:173)
`ical spectrogram plots. Figures 2.13-2.15 show typical articulatory
`configurations of the
`vowels (2.13), examples of vowel waveforms (2.14 ), and examples of vowel spectrograms
`(2.15). A convenient and simplified way of classifying vowel articulatory configurations is
`in terms of the tongue hump position (i.e., front, mid, back), and tongue hump height (high,
`mid, low), where the tongue hump is the mass of the tongue at its narrowest constriction
`within the vocal tract. According to this classification the vowels /i/, /1/, /re/, and/€/ are
`front vowels, (with different tongue heights) /a/, /A/, and fa/ are mid vowels, and /U/, /u/,
`and /o/ are back vowels (see also Figure 2.12).
`in Figure 2.14, the front
`As shown in the acoustic waveform plots of the vowels,
`vowels show a pronounced, high-frequency resonance,
`the mid vowels show a balance
`of energy over a broad frequency range, and the back vowels show a predominance of
`low-frequency spectral information. This behavior is evidenced in the vowel spectrogram
`
`IPR2023-00037
`Apple EX1013 Page 55
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`25
`
`/VOrS""
`
`iONEMES
`
`CONSONANTS
`
`WHISPER
`
`h (H)
`
`UNVOICED
`p (P)
`t (T)
`k (K)
`
`AFFRICATES
`j (JH)
`c (CH)
`
`FRONT
`
`i (IV)
`I (IH)
`e(EH)
`ae (AE)
`
`MID
`a (AA)
`3 (ER)
`1',e (AH,AX)
`:> (AO)
`
`BACK DIPHTHONGS
`u (UW)
`aY (AV)
`U (UH)
`-:,Y (OV)
`aw (AW)
`0 (OW)
`eY (EV)
`
`SEMIVOWELS
`I
`\
`LIQUIDS GLIDES
`w(W)
`1 (L)
`
`NASALS
`m(M)
`n (N)
`T) (NG)
`
`FRICATIVES
`
`I
`
`VOICED
`v (V)
`6 (TH)
`z (Z)
`z,zh (ZH)
`
`'\
`
`UNVOICED
`f (F)
`8 (THE)
`• (S)
`s,sh, / (SH)
`
`Figure 2.12 Chart of the classification of the standard phonemes of American English into broad sound classes.
`
`i (EVE)
`
`I ( IT)
`
`8 (HATE)
`
`~ (AT)
`
`U ( FATHER)
`
`0 (ALL)
`
`0 (OBEY)
`
`U (FOOT)
`
`U ( BOOT)
`
`A (UP)
`
`'1 (BIRD)
`
`Figure 2.13 Articulatory configurations for typical vowel sounds (after Flanagan [3]).
`
`IPR2023-00037
`Apple EX1013 Page 56
`
`

`

`26
`
`Chap. 2
`
`The Speech s·,g na1
`
`Figure 2.14 Acoustic waveform plots
`of typical vowel sounds.
`
`plots of Figure 2.15, in which the front vowels show a relatively high second and thir~
`formant frequency (resonance), whereas the mid vowels show well-separated and balance
`locations of the formants, and the back vowels (especially /u/) show almost no energy
`beyond the low-frequency region with low first and second formant frequencies.
`. ~e concept of a "typ_ic~l" vowel sound is, of course, unreaso~abl~ in light 0.~~~
`vanab1hty of vowel pronunc1at1on among men, women and children with different regi
`accents and other variable characteristics. To illustrate this point Figure 2.16 shows a
`•
`'
`f the first
`classic plot, made by Gordon Peterson and Harold Barney of measured values o
`I talkers
`'
`~
`and second formant 1or 10 vowels spoken by a wide range of male and fema e
`who attended the 1939 World's Fair in New York City [7].A wide range of variability c~n
`be seen in the measured formant frequencies for a given vowel sound, and also there 15
`
`''
`
`\
`
`IPR2023-00037
`Apple EX1013 Page 57
`
`

`

`t
`
`I
`
`...__~fJJ!..I.
`
`__ - -•--
`
`Sec. 2.4
`
`Speech Sounds and Features
`
`27
`
`5000 • ---
`
`4000 ..mt"! .• :
`
`3000, __
`
`I
`
`---
`.· ·., .. ,.'
`
`!
`
`e
`
`re
`
`a
`
`! .,.,..
`
`•
`
`i l111Nlllil~lllllltiJWlt, • "'"'41f
`
`•
`
`•
`
`•
`
`•
`
`•
`
`•.
`
`a
`
`2000·
`
`•
`
`•
`
`-
`N
`;_ 1000·
`>
`o·
`(.)
`z
`w
`::::, 5000-----
`0
`W 4000
`FF 3000!
`-.
`
`.,.,
`..
`
`2000,'
`
`0.2
`
`0.4
`
`r~
`..........
`·••1,,mrrrn
`
`;
`
`lS ■ ii
`----••
`l..iJl=iJ...J H111111111U1,
`
`u
`
`u
`
`3'"
`
`•II
`
`I,
`
`~p
`
`I•
`
`I
`
`.i
`
`0.0
`
`0.2
`
`0.4
`
`0.4 0.0
`0.2
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.0
`
`0.2
`
`0.4
`
`,,,,II!'
`
`1
`
`Figure 2. 15 Spectrograms of the vowel sounds.
`
`4000r----.------.----~------
`35001---+--~--H-,~--#---c
`
`....... .---l---l-------l
`
`30001---~
`
`.. ,-..1._;.,.;...----'4-
`
`N 25001--~+--
`~
`if' 20001---'...,,_~
`u.
`0
`~ 1500 1---+---l'-+-"~c'"-lt-:~.=+-
`z
`w
`:::,
`0
`w fl: 1000 1----
`
`_ ___._ _ __. _ _,
`500L--..I---L--=-_..J...-_._
`0
`200
`400
`600 800 1000 1200 1400
`FREQUENCY OF F1 (Hz)
`
`Figure 2.16 Measured frequencies of first and second for(cid:173)
`mants for a wide range of talkers for several vowels (after
`Peterson & Barney [7]).
`
`overlap between the fonnant frequencies for different vowel sounds by different talkers.
`The ellipses drawn in this figure represent gross characterizations of the regions in which
`most of the tokens of the different vowels lie. The message of Figure 2.16, for speech
`recognition by machine, is fairly clear; that is, it is not just a simple matter of measuring
`fonnant frequencies or spectral peaks accurately to accurately classify vowel sounds; one
`
`IPR2023-00037
`Apple EX1013 Page 58
`
`

`

`28
`
`2400~---------------,
`
`IV (il
`
`Chap. 2
`
`The Speech s·,g
`rial
`
`2200
`
`2000
`
`1800
`
`N
`~ 1600
`N
`~
`
`1400
`
`1200
`
`1000
`
`0 EH(£)
`
`0 AE (cl9)
`
`oER (~)
`
`oAH (A)
`
`0 AO (:>)
`SOOL---'--~....L..---'----...___-~:-::--~
`200
`300
`400
`500
`600
`F1 (Hz)
`
`700
`
`800
`
`Figure 2.17 The vowel triangle with centroid positions of the com(cid:173)
`mon vowels.
`
`must do some type of talker (accent) normalization to account for the variability in formants
`and overlap between vowels.
`A common way of exploiting the information embodied in Figures 2. 15 and 2. 16 is to
`represent each vowel by a centroid in the formant space with the realization that the centroid,
`at best, represents average behavior and does not represent variability across talkers. Such
`a representation leads to the classic vowel triangle shown in Figure 2.17 and represented
`in terms of formant positions by the data given in Table 2.2. The vowel triangle represents
`the extremes of formant locations in the F 1 -F2 plane, as represented by /i/ (low F 1, high
`F2), /u/ (low F 1, low F2), and /a/ (high F 1, low F2), with other vowels appropriately placed
`with respect to the triangle vertices. The utility of the formant frequencies of Table 2.2 has
`been demonstrated in text-to-speech synthesis in which high-quality vowel sounds have
`been synthesized using these positions for the resonances [8].
`
`2.4.2 Diphthongs
`
`there is some ambiguity and disagreement as to what is and what is not a
`Although
`diphthong, a reasonable definition is that a diphthong is a gliding monosyllabic speech
`
`IPR2023-00037
`Apple EX1013 Page 59
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`29
`
`IPA
`Symbol
`
`Typical
`Word
`
`F1
`
`F2
`
`F3
`
`TABLE 2.2. Formant frequencies for typical vowels.
`ARPAB.ET
`Symbol for
`Vowel
`IY
`IH
`EH
`AE
`AH
`AA
`AO
`UH
`uw
`ER
`
`Iii
`/1/
`/cl
`/re/
`IA/
`la/
`hi
`/U/
`/u/
`hi
`
`beet
`bit
`bet
`bat
`but
`hot
`bought
`foot
`boot
`bird
`
`270 2290 3010
`390 1990 2550
`530 1840 2480
`660 1720 2410
`520 1190 2390
`730 1090 2440
`570
`840
`2410
`440 1020 2240
`300
`870
`2240
`490 1350 1690
`
`sound that starts at or near the articulatory position for one vowel and moves to or toward
`the position for another. According to this definition, there are six diphthongs in American
`English, namely /aY / (as in buy), /aw/ (as in down), /eY / (as in bait), and /-:,Y / (as in boy),
`/o/ (as in boat), and /ju/ (as in you).
`-
`The diphthongs are produced by varying the vocal tract smoothly between vowel
`configurations appropriate to the diphthong. Figure 2.18 shows spectrogram plots of four
`of the diphthongs spoken by a male talker. The gliding motions of the formants are
`especially prominent for the sounds /aY /, /aw/ and /'JY / and are somewhat weaker for
`/eY / because of the closeness (in vowel space) of the two vowel sounds comprising this
`diphthong.
`An alternative way of displaying the time-varying spectral characteristics of diph(cid:173)
`thongs is via a plot of the values of the second formant versus the first formant (implicitly
`as a function of time) as shown in Figure 2.19 [9]. The arrows in this figure indicate the
`direction of motion of the formants (in the (F1 - F2 ) plane) as time increases. The dashed
`circles in this figure indicate average positions of the vowels. Based on these data, and
`other measurements, the diphthongs can be characterized by a time-varying vocal tract area
`function that varies between two vowel configurations.
`
`2.4.3 Semivowels
`
`The group of sounds consisting of /w/, /1/, /r/, and /y/ is quite difficult to characterize.
`These sounds are called semivowels because of their vowel-like nature. They are generally
`characterized by a gliding transition in vocal tract area function between adjacent phonemes.
`Thus the acoustic characteristics of these sounds are strongly influenced by the context in
`which they occur. For our purposes, they are best described as transitional, vowel-like
`sounds, and hence are similar in nature to the vowels and diphthongs.
`
`IPR2023-00037
`Apple EX1013 Page 60
`
`

`

`30
`
`Chap. 2
`
`The Speech Signal
`
`aY
`.......... r-r--r-.
`
`+
`
`+
`
`5000
`
`,---~-.--
`
`i
`4000
`
`3000
`
`2000
`
`I , •~,
`
`+
`
`+
`
`+
`
`+
`
`N' ;;.
`0
`~ 0.0 0.1
`z
`W
`~ 5000~...,,_~.,._,_,..--,-..,-~
`0 w
`a:
`u. 40001
`
`1000
`
`I
`
`0.2 0.3
`
`().4 - 0.5
`
`0.0
`
`0.1
`
`0.2
`
`0.3 0.4
`
`0.5
`
`eY
`
`'Jy
`
`+
`
`+
`
`'I
`
`+
`
`+
`
`0.1
`
`0.2
`
`0.3
`
`0.4
`
`3000f -l
`·~l~
`
`0.0 0.1
`
`0.2
`
`0.3
`
`0.0
`0.4
`TIME (sec)
`
`Figure 2.18 Spectrogram plots of four diphthongs.
`
`2.4.4 Nasal Consonants
`
`The nasal consonants /m/, /n/, and /r,! are produced with glottal excitation and the vocal
`tract totally constricted at some point along the oral passageway. The velum is lowered so
`that air flows through the nasal tract, with sound being radiated at the nostrils. The oral
`cavity, although constricted toward the front, is still acoustically coupled to the pharynx.
`Thus, the mouth serves as a resonant cavity that traps acoustic energy at certain natural
`frequencies. As far as the radiated sound is concerned, these resonant frequencies of the
`oral cavity appear as antiresonances, or zeros of the transfer function of sound transmis(cid:173)
`sion. Furthermore, nasal consonants and nasalized vowels (i.e., some vowels preceding or
`following nasal consonants) are characterized by resonances that are spectrally broader, or
`more highly damped, than those for vowels.
`The three nasal consonants are distinguished by the place along the oral tract at
`is at the lips; for /n/ the
`which a total constriction is made. For /ml the constriction
`constriction is just behind the teeth; and for / r, / the constriction
`is just forward of the
`velum itself. Figure 2.20 shows typical speech waveforms and Figure 2.21 spectrograms
`for two nasal consonants in the context vowel-nasal-vowel. The waveforms of /m/ and /n/
`look very similar. The spectrograms show a concentration of low-frequency energy with a
`midrange of frequencies that contain no prominent peaks. This is because of the particular
`
`L
`
`IPR2023-00037
`Apple EX1013 Page 61
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`31
`
`eY
`
`l
`
`\
`
`.... ~
`\\
`I
`I U / I
`'~-✓ I
`,,_,,
`
`,
`
`3K
`
`2K
`
`/--,
`\
`l
`,_.,,,
`
`/
`
`iu.
`
`--I
`
`'
`
`\
`I
`
`I u
`\
`,_/
`
`N
`I&. 1 K
`
`500
`
`400.__ _ __._ _____
`200
`250
`
`....L,. __
`
`.,__
`
`_______
`
`500
`
`~
`1 K
`
`Figure 2.19 Time variation of the first two fonnants for the diphthongs (after
`Holbrook and Fairbanks (9)).
`
`combination of resonances and antiresonances that result from the coupling of the nasal
`and oral tracts.
`
`2.4.5 Unvoiced Fricatives
`
`The unvoiced fricatives /f/, / () /, /s/, and /sh/ are produced by exciting the vocal tract by a
`steady air flow, which becomes turbulent in the region of a constriction in the vocal tract.
`The location of the constriction serves to determine which fricative sound is produced.
`For the fricative /f/ the constriction is near the lips; for /0 / it is near the teeth; for /s/ it
`is near the middle of the oral tract; and for /sh/ it is near the back of the oral tract. Thus
`the system for producing unvoiced fricatives consists of a source of noise at a constriction,
`which separates the vocal tract into two cavities. Sound is radiated from the lips-that
`is,
`from the front cavity. The back cavity serves, as in the case of nasals, to trap energy and
`thereby introduce antiresonances into the vocal output. Figure 2.22 shows the waveforms
`and Figure 2.23 the spectrograms of the fricatives /f/, /s/ and /sh/. The nonperiodic nature
`
`IPR2023-00037
`Apple EX1013 Page 62
`
`

`

`32
`
`Chap. 2
`
`The Speech s· 191'lal
`
`ama
`
`(a)
`
`~~
`t~ HfthAAO"ll"ll"AAQAAAII
`
`r,/7/'rv4JQ vv4> V4i'J4'7JVV~
`\Tvv7'1
`
`I\ I\ (\ (\ "I\
`
`I\(\
`
`I\ 0 /\ fl. I\"
`
`II.~
`
`(b)
`
`ana
`L ~ .. ~ .J J J , " "n "~ ".
`
`-----100
`
`msec----(cid:173)
`
`Figure 2.20 Wavefonns for the sequences
`/a-m-a/ and /a-n-a/.
`
`of fricative excitation is obvious in the waveform plots. The spectral differences among
`the fricatives are readily seen by comparing the three spectrograms.
`
`2.4.6 Voiced Fricatives
`
`The voiced fricatives /v/, /th/, /z/ and /zh/ are the counterparts of the unvoiced fricatives /f/,
`/0/, /s/, and /sh/, respectively, in that the place of constriction for each of the corresponding
`phonemes is essentially identical. However, the voiced fricatives differ markedly from
`their unvoiced counterparts in that two excitation sources are involved in their production.
`For voiced fricatives the vocal cords are vibrating, and thus one excitation source is at
`the glottis. However, since the vocal tract is constricted at some point forward of the
`glottis, the air flow becomes turbulent in the neighborhood of the constriction. Thus the
`
`IPR2023-00037
`Apple EX1013 Page 63
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`33
`
`a m
`
`a
`
`11111llli~~1•1t, I
`
`+
`
`+
`
`1\ I
`
`+
`
`+
`
`+
`+
`, .,,~~,
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`11 •'
`
`+
`
`+
`
`+
`
`+
`
`+
`
`5000
`
`4000
`
`-N
`-
`
`~
`
`3000
`
`> (.)
`z
`w
`:::> 2000
`0 w
`a: 1000
`u.
`
`0
`19674
`
`w
`C
`:::>
`....
`:::;
`Cl.
`~ <
`
`9837
`
`0
`
`a
`
`n
`
`a
`
`+
`
`+
`
`+
`
`+
`
`t
`
`11 ....
`
`t
`
`+
`+
`+
`ll~b~~i 10 l~-1 •. '
`+
`+
`+
`
`+
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`17980
`
`8990,
`
`0
`
`-9837
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`-8990
`
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.21 Spectrograms of the sequences /a-m-a/ and a-n-a/.
`
`spectra of voiced fricatives can be expected to display two distinct components. These
`excitation features are readily observable in Figure 2.24, which shows typical waveforms,
`and in Figure 2.25, which shows spectra for two voiced fricatives. The similarity of the
`unvoiced fricative /fl to the voiced fricative /v/ is easily shown in a comparison between
`corresponding spectrograms in Figures 2.23 and 2.25. Likewise, it is instructive to compare
`the spectrograms of /sh/ and /zh/.
`
`2.4.7 Voiced and Unvoiced Stops
`
`The voiced stop consonants /b/, /d/, and /g/, are transient, noncontinuant sounds produced
`by building up pressure behind a total constriction somewhere in the oral tract and then
`suddenly releasing the pressure. For /b/ the constriction is at the lips; for /d/ the constriction
`is at the back of the teeth; and for /g/ it is near the velum. During the period when there is
`total constriction in the tract, no sound is radiated from the lips. However, there is often a
`small amount of low-frequency energy radiated through the walls of the throat (sometimes
`called a voice bar). This occurs when the vocal cords are able to vibrate even though the
`vocal tract is closed at some point.
`Since the stop sounds are dynamical in nature, their properties are highly influenced
`by the vowel that follows the stop consonant. As such, the waveforms for stop consonants
`give little information about the particular stop consonant. Figure 2.26 shows the waveform
`of the syllable /a-b-a/. The waveform of /b/ shows few distinguishing features except for
`the voiced excitation and lack of high-frequency energy.
`
`IPR2023-00037
`Apple EX1013 Page 64
`
`

`

`34
`
`Chap. 2
`
`The Speech s·,g
`na1
`
`r7' 1 'es
`
`'0
`
`• • ..
`
`t ,►
`
`t' ;
`
`'
`
`; ;J • I ...
`
`4t
`
`' +r4' t
`
`(c)
`
`a sa
`, 1 .. L, !11,lo~.!u.!uJ11~0~11,l",lv
`111vr11q1q~rnn,r1~1q1qv
`
`'~
`
`vlw~·Av••
`
`I
`
`•
`
`......__ __
`
`" t1Y rr ' r ~ 'l
`
`100 msec
`
`Figure 2.22 Waveforms for the sounds /f/, /s/ and /sh/ in the context /a-x-a/ where /x/ is
`the unvoiced fricative.
`
`a
`f
`'
`N' 5000 ,.........,.......,_-,..-.,.....,.._
`
`% -
`
`•
`...
`
`I
`
`4000
`3000
`
`2000
`
`11!1 ••
`
`.....
`♦♦ ++
`•
`
`,:
`
`a s
`
`a
`
`5000
`
`4000
`
`♦
`
`3000
`
`2000
`
`t
`
`1000
`
`0
`20974
`
`10487
`
`10831
`
`w
`._
`0
`:)
`=.i
`~
`~ -10831
`
`~~....C......J........J..._j
`0.0
`0.2
`0.4
`
`Figure 2.23
`
`- 10184: .
`---:..--'-L...,_L
`o.o
`0.2
`
`: -10487
`t........J._.___J_.....J....,...J-<-~~
`0.0
`0.2
`0.4
`
`0.8
`0.4
`TIME (aec)
`f
`Spectrogram comparis
`• ons O the sounds /a-f-a/, /a-s-a/ and /a-sh-a/.
`
`IPR2023-00037
`Apple EX1013 Page 65
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`35
`
`(a)
`
`ava
`
`(b)
`
`aia
`
`'yvtfV 1Vi
`
`~v ~ •
`
`•
`
`.,... ....
`
`y\/F
`
`-----100
`
`vt v, rr11 " "
`msec-----i
`
`Figure 2.24 Wavefonns for the sequences /a(cid:173)
`v-a/ and /a-zh-a/.
`
`The unvoiced stop consonants /p/, /t/, and /k/ are similar to their voiced counterparts
`/bl, /di, and /g/, with one major exception. During t}:le period of total closure of the tract,
`as the pressure builds up, the vocal cords do not vibrate. Then, following the period of
`closure, as the air pressure is released, there is a brief interval of friction (due to sudden
`turbulence of the escaping air) followed by a period of aspiration (steady air flow from the
`glottis exciting the resonances of the vocal tract) before voiced excitation begins.
`Figure 2.27 shows waveforms and Figure 2.28 shows spectrograms of the voiced stop
`/b/ and the voiceless stop consonants /p/ and /t/. The "stop gap," or time interval, during
`which the pressure is built up is clearly in evidence. Also, it can be readily seen that the
`duration and frequency content of the frication noise and aspiration vary greatly with the
`stop consonant.
`
`IPR2023-00037
`Apple EX1013 Page 66
`
`

`

`36
`
`5000
`
`4000
`
`-N
`:::c: -> CJ
`z w
`=> 2000
`0 w
`0:
`u.
`
`3000
`
`1000
`
`a
`
`V
`
`a
`
`+
`
`+
`
`I
`
`-
`
`+
`1,1111-
`+
`
`+
`
`+
`
`+
`
`5000
`
`40001
`
`3000
`
`Chap. 2
`z
`
`a
`
`The Speech s·
`19na1
`a
`
`+ ( 1r'i ~
`I ,,,hit.
`
`t..hl .! ~It
`
`•
`
`+
`•••
`
`+
`
`+
`
`+.
`
`+
`
`ti+,
`
`~ I
`+ 'i'
`•II 1'°
`ti +
`"''
`II·
`
`0
`23761
`
`w 11881
`0
`::>
`~
`::i
`~
`~ <
`
`0
`
`-11881
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`12060
`
`0
`
`-12060
`
`0.0
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.25 Spectrograms for the sequences /a-v-a/ and /a-zh-a/.
`
`a '
`
`100 ffllee ------------l
`
`Figure 2.26 Wavefonn t
`h
`or t e sequence /a-b-a/.
`
`' . ·, '·,
`
`..... :1·
`
`~-1,
`
`IPR2023-00037
`Apple EX1013 Page 67
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`37
`
`(a)
`
`apa
`.. )wA .. WJ. LH L, . 0
`w·· ,av v
`
`....
`
`,.,)
`
`t,-
`
`C
`
`tf
`
`(b)
`
`at a
`
`~~►-
`
`..,__-----100
`
`msec-------1
`
`Figure 2.27 Wavefonns for the sequences /a-p-a/
`and /a-t-a/.
`
`2.4.8 Review Exercises
`
`As a self-check on the reader's understanding of the material on speech sounds and their
`acoustic manifestations, we now digress and present some simple exercises along with the
`solutions. For maximum effectiveness, the reader is encouraged to think through each
`exercise before looking at the solution.
`Exercise 2.1
`1. Write out the phonetic transcription for the following words:
`he, eats, several, light, tacos
`
`2. What effect occurs when these five words are spoken in sequence as a sentence? What
`does this imply about automatic speech recognition?
`
`IPR2023-00037
`Apple EX1013 Page 68
`
`

`

`38
`
`a
`
`b
`
`a
`
`......
`. . . .
`
`..
`
`•
`
`3000
`2000 It •
`
`5000
`'N
`~ 4000
`>-
`(.)
`z
`w
`::,
`0 w
`a:
`LL
`
`1000
`
`0
`25679
`
`w
`0
`::,
`I-
`:::::i
`Q.
`~
`< -12840
`
`0.0
`
`0.2
`
`0.4
`
`0.6
`
`Chap. 2
`
`The Speech Signal
`
`p
`
`a
`
`5000,--,------
`
`a
`
`4000
`
`11111J • ~l,M!III".\
`
`I
`
`a
`
`,1
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`21735
`
`10868
`
`I
`
`O !-t, -----
`I
`-10868;
`:.,.__ __
`0.0
`
`__._......._..i...;._-"--'-~
`0.4
`0.6
`0.2
`TIME (sec)
`
`0.2
`
`0.4
`
`0.6
`
`Figure 2.28 Spectrogram comparisons of the sequences of voiced (/a-b-a/) and voiceless (/a-p-a/
`and /a-t-a/) stop consonants.
`
`Solution 2.1
`1. The phonetic transcriptions of the words are
`
`Word
`he
`eats
`several
`light
`tacos
`
`Phoneme Seguence
`
`/hi/
`/its/
`/s~v ul/
`/1 aYt/
`/takoz/
`
`ARPABET
`HH-IY
`IY-TS
`S-E H-V-R-AH-L
`L-AY-T
`T-AA-K-OW-Z
`
`2. When the words are spoken together, the last sound of each word merges with the
`first sound of the succeeding word (since they are the same sound), resulting in strong
`coarticulation of boundary sounds. The ARPABET transcription for the sentence is:
`HH-IY-T-S-EH-V-R-AH-L-AY-T-AA-K-OW-Z
`the durations of
`All information about word boundaries is totally lost; furthermore,
`the common sounds at the boundaries of words are much shorter than what would be
`predicted from the individual words.
`
`Exercise 2.2
`
`Some of the difficulties in large vocabulary speech recognition are related to the irregularities
`in the way basic speech sounds are combined to produce words. Exercise 2.2 highlights a
`couple of these difficulties.
`
`1. In word initial position of American English, which phoneme or phonemes can never
`occur? Which hardly ever occur?
`2. There are many word initial consonant clusters of length two, such as speak, drank,
`plead, and press. How many word initial consonant clusters of length three are there
`in American English? What general rule can you give about the sounds in each of the
`three positions?
`
`IPR2023-00037
`Apple EX1013 Page 69
`
`

`

`Sec. 2.4
`
`Speech Sounds and Features
`
`39
`
`3. A nasal consonant can be combined with a stop consonant (e.g., camp, tend) in a limited
`number of ways. What general rule do such combinations obey? There are several
`notable exceptions to this general rule. Can you give a couple of exceptions? What
`kind of speaking irregularity often results from these exceptions?
`
`Solution 2.2
`1. The only phoneme that never occurs in initial word position in English is the /ng/ sound
`(e.g., sing). The only other sound that almost never occurs naturally in English, in
`initial word position, is /zh/ except some foreign words imported into English, such as
`gendanne, which does have an initial /zh/.
`2. The word initial consonant clusters of length three in English include
`
`/spl/
`/spr/
`/skw/
`/skr/
`/str/
`
`split
`spring
`squirt
`script
`string
`
`The general rule for such clusters is
`
`/sound s/unvoiced stop/semivowel/
`
`3. The general rule for a nasal-stop combination is that the nasal and stop have the
`same place of articulation, e.g., front/lips (/mp/), mid/dental (Int/), back/velar (Ing k/).
`Exceptions occur in words like summed (/md/) or hanged (Ing di) or dreamt (/mt/).
`There is often a tendency to insert an extra stop in such situations (e.g., dreamt -+
`/drempt/).
`
`Exercise 2.3
`An important speech task is accurate digit recognition. This exercise seeks to exploit knowl(cid:173)
`edge of acoustic phonetics to recognize first isolated digits, and next some simple connected
`digit strings. We first need a sound lexicon (a dictionary) for the digits. The sound lexicon
`describes the pronunciations of digits in terms of the basic sounds of English. Such a sound
`lexicon is given in Table 2.3. A single male adult talker (LRR) spoke each of the 11 digits in
`random sequence and in isolation, and spectrograms of these spoken utterances are shown in
`Figure 2.29. Figure 2.30 shows spectrograms of two connected digit sequences spoken b

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket