`Band
`
`Aryn Alexandra Pyke
`
`-4 thesis submitted in conformity Rish the requiremenrs
`for the degree of Master of Applied Science
`Graduate depart ment of Electncal and Corn pu ter Engineering
`The University of Toronto
`
`@ Copyright Aryn -4ielcandra Pyke 1997
`
`Ex. 1043 / Page 1 of 101
`Apple v. Saint Lawrence
`
`
`
`WB
`
`National Library
`of,,,
`
`Bibliothèque nationale
`du Canada
`
`Acquisitions and
`Bibliographie Services
`395 Wellington Street
`Ottawa ON K1A ON4
`Canada
`
`Acquisitions et
`services bibliographiques
`395. nie Wellington
`OttawaON K1A O N 4
`Canada
`
`Vour Ma VUJe rskenca
`
`Our Ua Noire rehkBnCB
`
`The author has granted a non-
`exclusive licence allowing the
`National Library of Canada to
`reproduce, loan, distribute or seiI
`copies of this thesis in microform,
`paper or electronic formats.
`
`L'auteur a accordé une licence non
`exclusive permettant à la
`Bibliothèque nationale du Canada de
`reproduire, prêter, distribuer ou
`vendre des copies de cette thèse sous
`la fome de microfiche/film, de
`reproduction sur papier ou sur format
`électronique.
`
`The author retains ownership of the
`L'auteur conserve la propriété du
`copyright in this thesis. Neither the
`droit d'auteur qui protège cette thèse.
`thesis nor substantial extracts fkorn it Ni la thèse ni des extraits substantiels
`may be printed or otherwise
`de celle-ci ne doivent être imprimés
`reproduced without the author' s
`ou autrement reproduits sans son
`permission.
`autorisation.
`
`Ex. 1043 / Page 2 of 101
`
`
`
`Extrapolation of Wideband Speech From the Telephone
`
`Band
`
`Aryn AIexandra Pyke, -M.-LSc.
`Depart ment of Electricd and Corn pu ter Engineering
`The Cniversity of Toronto. 1997
`
`Telephone speech is bandlimited to the frequency range between 300 and 3300 Hz.
`which compromises its quality. Wideband speech. accommodating frequencies up to 7000
`Hz. provides higher quality but at a cost of increased transmission bandwidth. The pr*
`posed pseudewideband (PWB) speech algonthm regenerates approximations of the bands
`missing from telephone speech. This is possible because of the strong inter-band correla-
`tions which stem from the acoustics of the production apparatus.
`For t his receiver-based algorit hm. the improvemen t in effective bandwid t h requires
`no est ra transmission bandwidt h. and involves no codec standardization issues. The
`spectral envelope and spectral detail are deconvolved via Iinear predictive analysis. and
`each is mapped independently to its PWB counterpart. The aigorithm is based on para-
`metric analysis using a uniform tube tract model. and has good potential for speaker
`independence. Performance was encouraging for a preliminary investigation. but a more
`sophisticated acoustic mode1 is desirable for additional quality irn provernent .
`
`Ex. 1043 / Page 3 of 101
`
`
`
`Acknowledgment s
`
`I would like to thank my supervisor. Professor Frank Kschischang, for his invaluable advice
`and encouragement throughout the research and preparation of this thesis. 1 would also
`like to acknowledge my family for their unerring support. Finally. 1 would like to thank my
`friends. especially Joel -110 and Lucy Pegoraro. for t heir continual support and tolerance.
`
`Ex. 1043 / Page 4 of 101
`
`
`
`Contents
`
`Abstract
`
`Acknowledgmentç
`
`List of Figures
`
`..
`
`II
`
`vi
`
`Chapter 1 Background
`1
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.1 Introduction
`I
`-
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.3 Problem Definition
`3
`. . . . . . . . . . . . .
`3
`1.3 Speech Production and Speech Signal Characteristics
`. . . . . . . . . . . . . . . . . . . . . . . . . . . .
`8
`1.4 Linear Predictive Analysis
`. . . . . . . . . . . . . . . . . . . . .
`13
`1.5 Speech Quality: Factors and Measures
`. . . . . . . . . . . . . . . . . . . . . . . . . . . .
`13
`1.5.1 Speech Perception
`. . . . . . . . . . . . . . . . . . . . . . .
`1 . 5 2 Objective Quality Measures
`14
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`15
`1.6 Previous Work
`. . . . . . . . . . . . . . . . . . . . . . .
`16
`1.6.1 Spectral Envelope Mapping
`. . . . . . . . . . . . . . . . . . . . . . . .
`21
`1.6.2 Excitation Extrapolation
`. . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.6.3 System E d u a t i o n
`26
`
`Chapter 2 Speech Extrapolation Mode1
`27
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.1 The Excitation Source
`17
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`30
`2.2 The Tract Filter
`. . . . . . . . . . . . . . . . . . . .
`2.2.1 Transfer Function of a Resonance
`31
`. . . . . . . . . . . . . . . . . . . . . . . .
`2 - 2 2 The Uniform Tube Mode1
`33
`2.2.3 Estimation of Tract Length from TB speech . . . . . . . . . . . . . . 35
`2.2.4 Limitations of the Uniform Tube Mode1 . . . . . . . . . . . . . . . . 36
`. . . . . . . . . . . .
`37
`2.2.5 Perceptual Considerations for the Tract Model
`
`Ex. 1043 / Page 5 of 101
`
`
`
`2.3 The Entire Speech Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . 38
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`39
`2.4 Summary
`
`Chapter 3 Proposed PWB Speech Extrapolation Algorithm
`41
`3.1 Design h u m p t i o n s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
`3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`3.3 Framing for Block Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.4
`43
`-4nalysis
`3.4.1 Linear Predictive -Analysis (TB-LP-4) . . . . . . . . . . . . . . . . . 44
`3.4.2 Frame Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
`3.5 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
`. . . . . . . . . . . .
`3.5.1 Odd-Harmonic Tract Resonance Extrapolation
`48
`. . . . . . . . . . . . . . . . . . . . . . . . 49
`3.5.2 Excitation Extrapolation
`3.6 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
`3.6.1 Correction Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
`3.6.2 Splicing the TB into the WB Synthetic Signal . . . . . . . . . . . . . 52
`3.7 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.8 Summary
`54
`
`Chapter 4 Experimental Results
`55
`4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
`3.1.1 Equipment
`Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
`4.1.2
`. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
`4.1.3 Objective Measures
`-1.2 Telephone Band Speech Model . . . . . . . . . . . . . . . . . . . . . . . . . 39
`4.3 Performance Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
`4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
`4.4.1 Splicing the Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
`4.4.2 Excitation Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 63
`1.4.3 Envelope Extrapolation Simulations . . . . . . . . . . . . . . . . . . 66
`. . . . . . . . . . .
`-*
`4.5 Preliminary Investigations for Alternative Tract Models
`r ;,
`. . . . . . . . . . .
`4.5.1 Uniform Open Ended Tube for Unvoiced Frames
`I a
`4.5.2 -Multiple Independent Resonators . . . . . . . . . . . . . . . . . . . . I a
`-.-
`4.6 System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
`
`-.I
`
`Ex. 1043 / Page 6 of 101
`
`
`
`Chapter 5 Discussion
`79
`5.1 Strengths of the ljniform Tube. Fixed Bandwidth Mode1 . . . . . . . . . . . 79
`5.2 Tract Lengt h Parameterization Errors . . . . . . . . . . . . . . . . . . . . . 80
`5.3 Limitations of the Ctniforrn Tube . Fixed Bandwidth Mode1 . . . . . . . . . . 81
`5.4 Potential for Other Acoustic -4pproaches . . . . . . . . . . . . . . . . . . . . 81
`5.5 Speaker Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
`
`Chapter 6 Conclusions
`84
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`6.1 Contributions
`54
`6.1.1 Explicit .% coustic Approach to PWB Speech Generation . . . . . . . 8.1
`6.1.2 Voiced Excitation Estrapolation from TB to IVE3 . . . . . . . . . . . 85
`6.1.3 Speech Processing Toolbos . . . . . . . . . . . . . . . . . . . . . . . 86
`6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
`6.2.1 Extension of Acoustic Mode1 . . . . . . . . . . . . . . . . . . . . . . 86
`6 2 . 2 Acoustic-P honetic/-lrticulatory-P honetic Mode1 . . . . . . . . . . . 87
`6.2.3 Non..4 coustic -4pproaches to PWB Speech Generation . . . . . . . . 87
`References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
`
`Ex. 1043 / Page 7 of 101
`
`
`
`List of Figures
`
`. . . . . . . . . . . . . . . . . . . . . . .
`3
`1.1 Pseudo-Nideband speech generation
`. . . . . . . . . . . . . . . . . . . . .
`5
`1.2 Source-filter mode1 of speech production
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`6 -
`1.3 20 ms frame of voiced speech
`. . . . . . . . . . . . . . . . . . . . . . . . .
`1.4 20 ms frame of unvoiced speech
`r -
`. . . . . . . . . . . . . . .
`1.5 Estended source-filter mode1 of speech production
`r
`. . . . . . . . . . .
`order LP.4
`1.6 Speech spectrurn and formant-scape from 1 6 ~ ~
`11
`. . . . . . . . . . . . . . . . . . . . . . . . .
`1.7 Example of a residud spectrum
`11
`. . . . . . . . . . . . . . . . . . . . .
`1.8 Linear predictive analysis and synthesis
`12
`. . . . . . . . .
`1.9 High-level block diagram of a PUrB speech generation system
`16
`1.10 Spectral duplication: (a) NB spectral translation: (b) NB spectral folding:
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`24
`and ( c ) TB spectral folding
`
`. . . . . . . . . . . . . . . . . . . . . . . . .
`2.1
`2s
`Idealized voiced excitation signal
`. . . . . . . . .
`29
`2.2 Magnitude Spectrum of the idealized voiced excitation signal
`. . . . . . . . . . . . . . . . . . . . . . . . . .
`32
`2.3 Spectrum of a single resonance
`2.4 Odd-harrnonic resonances produced in a tube closcd at one end . . . . . . . . 34
`2.5 Tolerance guideiine of just noticeable differences in formant location and
`bandwidth as a function of frequency . . . . . . . . . . . . . . . . . . . . . . 37
`
`3.1 High-level block diagram of the proposed PWB speech extrapolation system . 43
`3.2 Relationships betnreen the analysis and synthesis frarnes used for LP-4 and
`LPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
`3. 3 Block Diagram of TB-LPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
`3.4 Block diagram of the excitation estrapolation technique . . . . . . . . . . . . 50
`3.5 Examples of actual and extrapolated wideband residuals for (a) an unvoiced
`frame; (b) a voiced frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
`3.6 Contour of the Voiced Spectral Shaping Filter, --Cr. (f )- . . . . . . . . . . 53
`
`Ex. 1043 / Page 8 of 101
`
`
`
`vii
`
`Experimental Set-Cip for Speech 1/0 and Processing . . . . . . . . . . . - . 56
`Generation of the TB corpus from the WB corpus. . . . . . . . . . . . . . . 60
`Yeasures of the band-limiting distortion for the TB. UB. sub-TB. TB and
`. . . . . . . . . . . . . . . . . . . . . . . - 61
`EB, and TB and s u b T B bands.
`Control performance measures for TB speech quality. . . . . . . . . . . . . . 62
`Simulation system for determining appropriate cutoff frequency for splicing
`the TB into PWB speech. . . . . . . . . . . . . . . . . . . . . . . . . . - . . 64
`Determination of appropriate UB cutoff frequency. Fr:sl for the highpass
`filter in the sub-band splicing phase. . . . . . . . . . . . . . . . . . . . . . . 61
`Determination of appropriate WB-TB cutoff frequency. FsrrB-rsz for the
`. . . . . . . . . . . . . . . . . 65
`lowpass filter in the sub-band splicing phase.
`Simulation system for evaluating excitation extrapolation techniques. . . . . 65
`Escitation extrapolation candidates. . . . . . . . . . . . . . . . . . . - . . . 67
`4.10 Simulation systern for evaluating envelope estrapolation techniques. . . . . 68
`4.11 Spectral distortion measures for envelope extrapolation of voiced speech. . . 70
`4.12 Esample of envelope extrapolation for a typical voiced frame. . . . . . . . . Cl
`4.13 Spectral distortion measures for envelope extrapolation of unvoiced speech.
`73
`4.14 Esample of envelope estrapolation for a typical unvoiced frame. . . . . . . . 73
`4.15 Objective results comparing PWB speech with the WB original in the es-
`--
`citation simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I r
`
`5.1 Effect of tract length o n the F-pattern observable in the TB spectral tvindow. 52
`
`Ex. 1043 / Page 9 of 101
`
`
`
`Chapter 1
`
`Background
`
`1.1 Introduction
`
`Perceptually. telephone speech is l e s nat ural and sometimes less intelligible t han face-
`teface speech. The qudity of telephone speech is prirnarily compromised by the band
`limiting done in the Public Switched Telephone Network (PSTN) to reduce the sarnpling
`rate and Save on transmission bandwidth. The PST3 uses a Pulse Code Modulation
`(PCBI) coding scheme. The speech signal is band lirnited to avoid aliasing and then
`sampled at 8000 samples per second. Eacli sample is then quantized according to the
`8-bit. non-linear p l a n quantizerl [16].
`.Uthough the frequenq content of speech can estend up to 20 kHz [?O].
`telephone
`speech is band limited to the range of approximately 300 to 3300 Hz. The range of
`perceptually significant frequencies for speech perception extends t o about 10 kHz. which
`is considerably beyond t hat of telephone-band (TB) speech. Specificdy. the range from
`50 to 200 Hz contributes to increased naturdness. presence and loudness. Its excIusion
`from the telephone band causes the speech to sound 'tinny'. but is presurnabiy justified
`by the fact that this range has little influence on intelligibility [20] [18]. The supra TB
`range from about 3400 Hz to 7000 Hz is thought to contribute to increased intelligibility,
`sound differentiation and crispness [ i l .
`In al1 fairness, TB speech is a well-justified tradeoff between speech quality and the
`cal1 capacity of the PSTN. However. advances in speech coding and speech processing can
`now enable the network to support so called wideband (WB) speech (with a bandwidth of
`approximately 8 kHz) over its voice grade channels. This contra-intuitive capability can
`
`'Outside of North Arnerica, an A-law quantizer is used.
`
`Ex. 1043 / Page 10 of 101
`
`
`
`CHAPTER 1. BACh'GROUXD
`
`actually be accomplished in two ways. Wideband speech can be sampled and efficiently
`digitally encoded in real time using less than 16 kbits/s [IO]. This can be easily accom-
`modated over voice grade channels since bit streams as fast as 33.6 kbits/s can be sent
`using modems complying with the V.34Q standard. X a t u r d y . to recover the speech a t
`the receiving end. a modem and corresponding wideband speech decoder are needed. This
`the-sis presents an alternative way to obtain nideband speech with no cost in transmission
`bandwidth. An algorithm is presented which regenerates pseudo-wideband speech a t the
`receiver using only the received TB speech.
`A more formai definition of the problem and a n outiine of objectives is presented
`in the following section. To provide the reader wit h the necessary speech background. an
`overview of speech signal characteristics and relevant speech production and perception
`issues is presented in Section 1.3. This is followed by a summary and discussion of relemn t
`previous mork in Section 1.6. The details of the proposed algorithm are provided in
`Chapter 3. Chapter -1 outlines the experimental met hodology. simulation results. and
`.A discussion of the results and their implications is presented
`performance evaluation.
`in Chapter 5 . Finally. conclusions and recornmendations for future work are presented in
`Chapter 6.
`
`1.2 Problem Definition
`
`to devebp a receiver-based digital speech pro-
`The primary objective of this work
`cessing algorithm t a produce a bandwidth and quality enhancement of TB speech. Essen-
`tially, the algorithm ni11 effect a mapping, T. from TB speech t o pseudo-wideband (PWB)
`speech. as shown in Figure 1.1.
`In order t hat the algorithm have the potentiai to function in conjunction with any
`e.xisting narrowband speech coders. in particular. the puIse code modulation of the PSTY.
`the only ailowed input t o the aigorithm is TB speech. For the purpose of this study.
`TB speech is assumed t o incorporate only frequencies in the range 300 Hz t o 3300 Hz.
`and wideband speech is defined as speech incorporating frequencies in the range O Hz to
`8000 Hz.
`The speech produced by the algonthm will be dubbed pseudwwideband (P%lrB)
`speech t o distinguish it from true wideband (WB) speech. Research in speech perception
`indicates a decrease in frequency resolution in the upper frequency band. which affords
`a certain leeway in generating the wideband speech. The objective is to generate a per-
`ceptually viable approximation t o the true WB speech. such that the subjective quality is
`
`Ex. 1043 / Page 11 of 101
`
`
`
`CHAPTER 1. BACKGROUND
`
`Figure 1.1: Pseudo- wideband speech generat ion.
`
`notably improved as compared with TB speech. Admittedly. subjective qualit? is difficult
`to quantify. therefore objective measures. such as segmental SSR and segmental spectral
`SXR were used in the design and evaluation of the algorithm. These objective measures
`are defined and described in Section 1-52. Final performance is also evaluated based on
`informal listening tests.
`Cnlike the previous attempts described in [1] [2] [6] [8]. it was desired that the
`algorithm be analytic and physically motivated in nature. rather than being based solely
`on signal statistics and pattern matching. It aras believed chat such an analytic nature
`would furnish it with a strong potential for speaker and language independence.
`Xmong the set of secondary goals was the desire to regulate the algorithrn's corn-
`plexity to make it amenable for red-tirne implementation on a low-cost DSP processor.
`Furthermore. it was desired to arrive a t a better general understanding of the nature and
`extent of the correlation between narrowbandlTB and WB speech.
`
`1.3 Speech Production and Speech Signal Characteristics
`
`Speech has a complex structure embodying a great deal of redundancy. In particular, if
`only the upper band is isolated (3300-8000 Hz). the speech is still fully intelligible. as it
`is when only the TB part is present. The fact that the semantic message content is fully
`
`Ex. 1043 / Page 12 of 101
`
`
`
`CHAPTER 1. BACKGRO L;:t'D
`
`4
`
`preserved in difTerent bands is a testament to the strength of the correlation between the
`different frequency bands of the speech signal. and the potential feasibility of the current
`endeavor .
`To understand the structure and characteristics of speech signals it is useful t o
`think in terms of a speech production model. The two main physical structures involved
`in speech production are the vocal chords and the vocal tract. .As air egresses frorn the
`lungs, the vocal chords can be relaxed or can vibrate a t various frequencies. usuaiiy in the
`range of 55 Hz t o 333 Hz! thus applying a periodic pressure signal to the tract [il [28] [li].
`Within the tract, differences in the cross-sectional area. influenced by the position of
`the tongue. lips, and jaw. cause sound n-ave reflections which give rise to resonances
`or fermants. which appear as peak in the speech spectrum. In the simplet and most
`common speech production model, the linear source-filter or terminal-analogue model, the
`contributions of the chords and the tract are partitioned [NI. -4s depicted in Figure 1.2.
`the whole system can be modeled by a source. ~ ( t ) ,
`isolated from. and leading into a linear
`filter. T ( z j. rnodeling the tract [19]. The ezcàtution signal. e ( t ) models the stimulating
`signal from the chords. and T ( z ) models the modulation of that signal by the tract.
`T h e dynarnics of the tract and chords produce speech signals which are non-
`stationary, that is. the frequency composition of speech varies with time. In particular.
`vocal tract structures. or articulators. rarely stay fised for more than 40 ms. so the re-
`q u i r d T ( 2 ) is actually a time varying filter [23. p. 2061. The rate of vibration of the vocal
`chords can change as quickly as one octave per 100 ms [23. p. 2331. It is such time varia-
`tion of vocal chord frequency that is the physicd basis of intonation. such as the raising
`of pitchl at the end of a question. These movements are sufficiently slow and snioot h t hot
`the speech signal is generally accepted to be effectively stationary xithin time segments
`on the order of 20 ms. Within such a segment. or fmme. dl the signals in the source-filter
`mode1 can be considered stationary. and T ( + ) represents a linear time-iniariant filter.
`In analyzing speech segmentally. speech can be partitioned into two main classes
`of sounds based on whether the vocal chords are relaxed or vibrating during that frame.
`In voiced speech, exemplified by vowel sounds, the vocal chords are vibrating. The rate
`at wliich they vibrate is c d e d the fundamental frequency. FO. or patch. which can be
`assumed to be approximately constant over the duration of the frame. -k depicted in
`the example in Figure 1.3. voiced speech is characterized by a quasi-periodic tirne-domain
`signal. s ( t ) . For unvoiced speech, the chords are relaxed. but a constriction is present
`
`'Strictly speaking, pitch is the perceived tone of the sound. while frequency of vibration is a propeng
`of the stimulus.
`
`Ex. 1043 / Page 13 of 101
`
`
`
`Figure 1.2: Source-Cilter mode1 of speech production.
`
`Ex. 1043 / Page 14 of 101
`
`
`
`somewhere in the tract which results in turbulence as air rushes through. This turbulence
`serves as an excitation for the rernainder of the tract. =Is shown in Figure 1.4. unvoiced
`speech is characterized by a lower amplitude. non-periodic. noise-like signal. Examples
`of unvoiced sounds are fieutives such as /s/ and /f/. Typically. unvoiced sounds have
`lower amplitudes and energies than voiced sounds. and have a greater proportion of t heir
`energ- concentrated in higher frequency bands. A few sounds. such as /z/ in 'zip'. have a
`mixture of unvoiced and voiced characteristics. and are referred to as miued-mode sounds.
`They occur if there is a constriction causing turbulence somewhere in the tract, but the
`vocal chords are &O vibrating.
`
`-2
`O
`
`0-
`
`O O W
`
`0006
`
`OODB
`
`O O l
`
`0012
`
`0014 O016
`
`0018
`
`1
`
`O R ?
`
`Figure 1.3: 20 rns frame of voiced speech.
`
`The voiced/unvoiced classification leads to the estended source-filter model of
`speech and speech production depicteci in Figure 1.5. Idealized time and frequency do-
`main representative signals are included. For voiced speech. the glot tai excitation signal
`is periodic. The spectrum obtained by taking the Short-Term Fourier Transform (STFT).
`exhibits the harmonic structure espected for a periodic signal. wit h the harrnonics sepa-
`rateci by FO, and the spectrum has a typicd roll-off of -12 dB/octave [24]. The noise-like
`excitation for unvoiced speech is spectraliy flat. For the model to handle mixed mode
`sounds, the switch cou1d be replaced with a summation bIock allowing both sources to be
`active simultaneously.
`Although most individuals possess the same basic apparatus and most languages
`
`Ex. 1043 / Page 15 of 101
`
`
`
`Figure 1.4: 20 rns frame of unvoiced speech.
`
`Figure 1.5: Extended source-filter mode1 of speech production.
`
`Ex. 1043 / Page 16 of 101
`
`
`
`em ploy the vocal tract in similar ways. speech signal characteristics are often highly
`speaker and context dependent [23]. .As shown in Figure 1.5, the main speech param-
`eters are: the voicing mode: the fundamental frequency, FO, for voiced speech; the gain.
`G ; the formant locations, or F-pattern {Fl.F2. F 3 . . . .); and the respective bandwidths
`of the forrnants. There is a one-temany mapping between a semantic unit of sound, or
`phoneme. and the corresponding acoustic signal as described by the parameters. The av-
`erage fundamental frequency. FO, for women is approximately 210 Hz, and that for m e n
`is approximately 125 Hz [27] [24] l . Within a given speaker the pitch usually ranges over
`an octave of values during speech (241. In terms of tract variation. the average lengtli
`for the pharengyl-oral tract is 17 cm for men and 13 cm for women. Tract dimensions
`affect the resonant frequencies and therefore the positions of the formants in the speech
`spectrum. The general principle can be understood in terms of a simple resonating tube
`of uniform cross-section and length, L, for which the resonances occur at odd multiples of
`FI = &. where c is the speed of sound (approximately 340 m/s for air at sea level) [24].
`For men. the formant frequency spacing is approximately 1000 Hz. while for women. it is
`about 1301 Hz. The telephone band typically contains about four formants worth of male
`speech. and about three formants worth of female speech. The fundamental frequency,
`FO, and up to three harrnonics often fall below the telephone band.
`
`1.4 Linear Predictive AnaIysis
`
`The deconvolution of excitation and envelope is often accomplished by Linear Predictive
`-4nalysis (LP-4). LP-4 identifies the spectral envelope by finding the best all-pole fit of a
`specified order for the spectrum of that frarne. The deviations from this all-pole spectral
`approximation constitute the excitation. or residzral as it is called in the LPA Iiterature.
`The tenet of linear predictive speech analysis is that a linear predictor can be used
`to estimate the value of the next sample of the digital speech signal, based upon a linear
`combination of a set of preceding speech samples. Mathematically, the estimate of the
`next sample. Sn, is expressed as
`
`'Men have more massive cords than women which is why they tend to vibrate at lower frequencies.
`Within a given speaker, the cricothyroid muscles can increase the tension on the cords, and thus raise their
`frequency of vibration (53.
`
`Ex. 1043 / Page 17 of 101
`
`
`
`CHAPTER 1- BACKGRO Cr,;VD
`
`9
`
`where s, is the speech sequence. p is the order of the prediction. and the akS are the
`prediction coefficients. The residual. r,, can therefore be espressed as
`
`LP-4 is the process of selecting the predictor coefficients. a k . to minimize the resid-
`uai in the mean squared sense. Since speech in non-stationary. the analysiç must be con-
`basis. Wit hin a frame of .V samples. the mean squared
`ducted on a segment---segment
`error, E is given by
`
`To minimize E. the partial derimtives of equation 1.3 are taken with respect to the
`prediction coefficients, ak, and set to zero
`
`d E
`- = O ,
`da;
`
`for 1' i s p .
`
`After some algebra. this yields p linear equations in the p unknomn predictor coefficients
`of the form
`
`with 1 5 i 5 p. The solution of th% systern can be computed with matrix inversion, how-
`ever in practice 8 5 p 5 26, and this method becomes computationâlly expensive [XI. Two
`alternative practicd techniques to approximate the solution are the covariance method and
`the auto-correlation method whose details are well presented in [21] and [XI. In the LP.4
`simulations conducted for this thesis. the auto-correlation met hod was employed, and the
`details of this method wiU be presented, dong with the other implementation details, in
`Chapter 3. Aside from finite precision errors, analysis with this method is guaranteed to
`produce a stable filter [23].
`
`Ex. 1043 / Page 18 of 101
`
`
`
`Although LPA is most often described in the time domain. the frequency domain
`order
`interpretation yields more insight for the current application. In particular.
`linear prediction in the tirne domain corresponds to modeling the spectrum of that frame
`order dl-pole model. To see this. note that the z-transform of equation 1.9
`with a
`reveais that
`
`order IIR filter. LP.4 is, in fact. equivalent to maximum entropy
`where H(z) is a
`spectral estimation [-II. For a n appropriate value of p. the macroscopic spectral shaping
`is fully captured in t.he envelope. H ( z ) , while the spectral detail is captured in the residual.
`r ( n ) , which has a generally flat spectral trend.
`The formulation in (1.6) reveals the consistency between LPA and the source-filter
`speech production model in Figure 1.2. Thus. H ( r ) is often referred t o as the tract or
`synthesis filter. The H ( 3 ) computed in LPA actually encompasses not only the tract
`effects, but also glottal flow and radiation effects which are distinct from the tract fiIter.
`T ( z ) , in the source-filter production model [32]. Thus. in LP-4. al1 major spectral shaping
`effects are encompassed in H ( z ) , and the residuals. bot h voiced and unvoiced. eshibit an
`overall flat spectral trend.
`Viewing the residual signal as an output of the LP-4 process. it can be seen that
`
`where A ( z ) is referred t o as the analysis or prediction filter.
`Figure 1.6 displays the speech spectrum, iS( f ) 1, and corresponding envelope. 1 H( f ) 1,
`order LP.4 on the 20 ms WB speech frame depicted in Figure 1.3.
`found by perforrning 1 6 ~ ~
`T h e peaks in [ H ( f)l can be interpreted as revealing the location of formants. Since speech
`order LP.4 spectral model effectively deduces the location of 2 positive
`is a real signal. a
`frequency formant locations1. Mat hernatically, the formant frequencies can be found by
`finding the poles, z,, from the denominator of H ( z ) . Then the positive frequency formant.
`Fi, associated with pole,
`
`' In the example shown. there are formants at O Hz and 8000 Hz which don't stand out on the graph.
`
`Ex. 1043 / Page 19 of 101
`
`
`
`CHAPTER 1. BACKGRO tr.'
`
`Figure 1.6: Speech spectnim and formant-scape from 16'~ order LPX.
`
`Figure 1.7: Example of a residual spectmm.
`
`Ex. 1043 / Page 20 of 101
`
`
`
`C M P T E R 1. BACKGROLTXD
`
`is given by
`
`where Fi is in Hertz, and T, is the speech sampling period.
`Figure 1.8 illustrates the decomposition and syuthesis method for a speech segment
`according to the sourcefilter mode1 via LP-4.
`
`Analysis Fiter
`
`Linear Predictive Speech Analysis
`- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
`Linear Predictive Speech Synthesis
`
`S ynthesis Filter
`
`r
`
`1
`
`Figure 1.8: Linear predictive analysis and synthesis.
`
`-\side from limitations in numeric precision. the deconvolution of a speech segment
`into spectral envelope and excitation via LP.4 is a lossless process. The original speech
`frame can be reconstructed by filtering the residual. r ( n ) . through the synthesis filter.
`H ( i ) . Speech synthesis by this technique is known as Residuai Excited Linear Prediction
`(RELP). In applying LP.4 t o speech compression, the prediction coefficients, a k , con be
`transformed and quantized by various means to obtain a very compact representation of
`the spectral envelope. Even more coding gain is achieved wlien sdvantage is taken of
`the fact that relatively less information is contained in the spectrally flat residual than
`
`Ex. 1043 / Page 21 of 101
`
`
`
`in the spectral envelope. Speech of acceptable quality can be synthesized according to
`Figure 1.8 even when only a fairly rough approximation to the residual is used. Such
`residual compression and approximation techniques are discussed in Section 1.6.2.
`
`1.5 Speech Quality: Factors and Measures
`
`Although the primary goal of t his pro ject was to produce a significant enhancement of
`speech quality, it is a very difficult property to q u a n t a . Speech quality is inherently
`subjective, and is dependent on incompletely understood aspects of speech perception.
`'ionetheless. some formai subjective and objective measures measures of speech quality
`were selected for the design and evaluation of the PWB speech algorithm. In this section.
`some basic speech quality perception factors are presented dong wit h the selected rneasures
`of speech quality.
`
`1.5.1 Speech Perception
`
`A distortion is only perceptually sign