throbber
Extrapolation of Wideband Speech From the Telephone
`Band
`
`Aryn Alexandra Pyke
`
`-4 thesis submitted in conformity Rish the requiremenrs
`for the degree of Master of Applied Science
`Graduate depart ment of Electncal and Corn pu ter Engineering
`The University of Toronto
`
`@ Copyright Aryn -4ielcandra Pyke 1997
`
`Ex. 1043 / Page 1 of 101
`Apple v. Saint Lawrence
`
`

`

`WB
`
`National Library
`of,,,
`
`Bibliothèque nationale
`du Canada
`
`Acquisitions and
`Bibliographie Services
`395 Wellington Street
`Ottawa ON K1A ON4
`Canada
`
`Acquisitions et
`services bibliographiques
`395. nie Wellington
`OttawaON K1A O N 4
`Canada
`
`Vour Ma VUJe rskenca
`
`Our Ua Noire rehkBnCB
`
`The author has granted a non-
`exclusive licence allowing the
`National Library of Canada to
`reproduce, loan, distribute or seiI
`copies of this thesis in microform,
`paper or electronic formats.
`
`L'auteur a accordé une licence non
`exclusive permettant à la
`Bibliothèque nationale du Canada de
`reproduire, prêter, distribuer ou
`vendre des copies de cette thèse sous
`la fome de microfiche/film, de
`reproduction sur papier ou sur format
`électronique.
`
`The author retains ownership of the
`L'auteur conserve la propriété du
`copyright in this thesis. Neither the
`droit d'auteur qui protège cette thèse.
`thesis nor substantial extracts fkorn it Ni la thèse ni des extraits substantiels
`may be printed or otherwise
`de celle-ci ne doivent être imprimés
`reproduced without the author' s
`ou autrement reproduits sans son
`permission.
`autorisation.
`
`Ex. 1043 / Page 2 of 101
`
`

`

`Extrapolation of Wideband Speech From the Telephone
`
`Band
`
`Aryn AIexandra Pyke, -M.-LSc.
`Depart ment of Electricd and Corn pu ter Engineering
`The Cniversity of Toronto. 1997
`
`Telephone speech is bandlimited to the frequency range between 300 and 3300 Hz.
`which compromises its quality. Wideband speech. accommodating frequencies up to 7000
`Hz. provides higher quality but at a cost of increased transmission bandwidth. The pr*
`posed pseudewideband (PWB) speech algonthm regenerates approximations of the bands
`missing from telephone speech. This is possible because of the strong inter-band correla-
`tions which stem from the acoustics of the production apparatus.
`For t his receiver-based algorit hm. the improvemen t in effective bandwid t h requires
`no est ra transmission bandwidt h. and involves no codec standardization issues. The
`spectral envelope and spectral detail are deconvolved via Iinear predictive analysis. and
`each is mapped independently to its PWB counterpart. The aigorithm is based on para-
`metric analysis using a uniform tube tract model. and has good potential for speaker
`independence. Performance was encouraging for a preliminary investigation. but a more
`sophisticated acoustic mode1 is desirable for additional quality irn provernent .
`
`Ex. 1043 / Page 3 of 101
`
`

`

`Acknowledgment s
`
`I would like to thank my supervisor. Professor Frank Kschischang, for his invaluable advice
`and encouragement throughout the research and preparation of this thesis. 1 would also
`like to acknowledge my family for their unerring support. Finally. 1 would like to thank my
`friends. especially Joel -110 and Lucy Pegoraro. for t heir continual support and tolerance.
`
`Ex. 1043 / Page 4 of 101
`
`

`

`Contents
`
`Abstract
`
`Acknowledgmentç
`
`List of Figures
`
`..
`
`II
`
`vi
`
`Chapter 1 Background
`1
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.1 Introduction
`I
`-
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.3 Problem Definition
`3
`. . . . . . . . . . . . .
`3
`1.3 Speech Production and Speech Signal Characteristics
`. . . . . . . . . . . . . . . . . . . . . . . . . . . .
`8
`1.4 Linear Predictive Analysis
`. . . . . . . . . . . . . . . . . . . . .
`13
`1.5 Speech Quality: Factors and Measures
`. . . . . . . . . . . . . . . . . . . . . . . . . . . .
`13
`1.5.1 Speech Perception
`. . . . . . . . . . . . . . . . . . . . . . .
`1 . 5 2 Objective Quality Measures
`14
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`15
`1.6 Previous Work
`. . . . . . . . . . . . . . . . . . . . . . .
`16
`1.6.1 Spectral Envelope Mapping
`. . . . . . . . . . . . . . . . . . . . . . . .
`21
`1.6.2 Excitation Extrapolation
`. . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.6.3 System E d u a t i o n
`26
`
`Chapter 2 Speech Extrapolation Mode1
`27
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.1 The Excitation Source
`17
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`30
`2.2 The Tract Filter
`. . . . . . . . . . . . . . . . . . . .
`2.2.1 Transfer Function of a Resonance
`31
`. . . . . . . . . . . . . . . . . . . . . . . .
`2 - 2 2 The Uniform Tube Mode1
`33
`2.2.3 Estimation of Tract Length from TB speech . . . . . . . . . . . . . . 35
`2.2.4 Limitations of the Uniform Tube Mode1 . . . . . . . . . . . . . . . . 36
`. . . . . . . . . . . .
`37
`2.2.5 Perceptual Considerations for the Tract Model
`
`Ex. 1043 / Page 5 of 101
`
`

`

`2.3 The Entire Speech Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . 38
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`39
`2.4 Summary
`
`Chapter 3 Proposed PWB Speech Extrapolation Algorithm
`41
`3.1 Design h u m p t i o n s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
`3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
`3.3 Framing for Block Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.4
`43
`-4nalysis
`3.4.1 Linear Predictive -Analysis (TB-LP-4) . . . . . . . . . . . . . . . . . 44
`3.4.2 Frame Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
`3.5 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
`. . . . . . . . . . . .
`3.5.1 Odd-Harmonic Tract Resonance Extrapolation
`48
`. . . . . . . . . . . . . . . . . . . . . . . . 49
`3.5.2 Excitation Extrapolation
`3.6 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
`3.6.1 Correction Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
`3.6.2 Splicing the TB into the WB Synthetic Signal . . . . . . . . . . . . . 52
`3.7 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.8 Summary
`54
`
`Chapter 4 Experimental Results
`55
`4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
`3.1.1 Equipment
`Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
`4.1.2
`. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
`4.1.3 Objective Measures
`-1.2 Telephone Band Speech Model . . . . . . . . . . . . . . . . . . . . . . . . . 39
`4.3 Performance Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
`4.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
`4.4.1 Splicing the Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
`4.4.2 Excitation Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 63
`1.4.3 Envelope Extrapolation Simulations . . . . . . . . . . . . . . . . . . 66
`. . . . . . . . . . .
`-*
`4.5 Preliminary Investigations for Alternative Tract Models
`r ;,
`. . . . . . . . . . .
`4.5.1 Uniform Open Ended Tube for Unvoiced Frames
`I a
`4.5.2 -Multiple Independent Resonators . . . . . . . . . . . . . . . . . . . . I a
`-.-
`4.6 System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
`
`-.I
`
`Ex. 1043 / Page 6 of 101
`
`

`

`Chapter 5 Discussion
`79
`5.1 Strengths of the ljniform Tube. Fixed Bandwidth Mode1 . . . . . . . . . . . 79
`5.2 Tract Lengt h Parameterization Errors . . . . . . . . . . . . . . . . . . . . . 80
`5.3 Limitations of the Ctniforrn Tube . Fixed Bandwidth Mode1 . . . . . . . . . . 81
`5.4 Potential for Other Acoustic -4pproaches . . . . . . . . . . . . . . . . . . . . 81
`5.5 Speaker Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
`
`Chapter 6 Conclusions
`84
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`6.1 Contributions
`54
`6.1.1 Explicit .% coustic Approach to PWB Speech Generation . . . . . . . 8.1
`6.1.2 Voiced Excitation Estrapolation from TB to IVE3 . . . . . . . . . . . 85
`6.1.3 Speech Processing Toolbos . . . . . . . . . . . . . . . . . . . . . . . 86
`6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
`6.2.1 Extension of Acoustic Mode1 . . . . . . . . . . . . . . . . . . . . . . 86
`6 2 . 2 Acoustic-P honetic/-lrticulatory-P honetic Mode1 . . . . . . . . . . . 87
`6.2.3 Non..4 coustic -4pproaches to PWB Speech Generation . . . . . . . . 87
`References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
`
`Ex. 1043 / Page 7 of 101
`
`

`

`List of Figures
`
`. . . . . . . . . . . . . . . . . . . . . . .
`3
`1.1 Pseudo-Nideband speech generation
`. . . . . . . . . . . . . . . . . . . . .
`5
`1.2 Source-filter mode1 of speech production
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`6 -
`1.3 20 ms frame of voiced speech
`. . . . . . . . . . . . . . . . . . . . . . . . .
`1.4 20 ms frame of unvoiced speech
`r -
`. . . . . . . . . . . . . . .
`1.5 Estended source-filter mode1 of speech production
`r
`. . . . . . . . . . .
`order LP.4
`1.6 Speech spectrurn and formant-scape from 1 6 ~ ~
`11
`. . . . . . . . . . . . . . . . . . . . . . . . .
`1.7 Example of a residud spectrum
`11
`. . . . . . . . . . . . . . . . . . . . .
`1.8 Linear predictive analysis and synthesis
`12
`. . . . . . . . .
`1.9 High-level block diagram of a PUrB speech generation system
`16
`1.10 Spectral duplication: (a) NB spectral translation: (b) NB spectral folding:
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`24
`and ( c ) TB spectral folding
`
`. . . . . . . . . . . . . . . . . . . . . . . . .
`2.1
`2s
`Idealized voiced excitation signal
`. . . . . . . . .
`29
`2.2 Magnitude Spectrum of the idealized voiced excitation signal
`. . . . . . . . . . . . . . . . . . . . . . . . . .
`32
`2.3 Spectrum of a single resonance
`2.4 Odd-harrnonic resonances produced in a tube closcd at one end . . . . . . . . 34
`2.5 Tolerance guideiine of just noticeable differences in formant location and
`bandwidth as a function of frequency . . . . . . . . . . . . . . . . . . . . . . 37
`
`3.1 High-level block diagram of the proposed PWB speech extrapolation system . 43
`3.2 Relationships betnreen the analysis and synthesis frarnes used for LP-4 and
`LPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
`3. 3 Block Diagram of TB-LPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
`3.4 Block diagram of the excitation estrapolation technique . . . . . . . . . . . . 50
`3.5 Examples of actual and extrapolated wideband residuals for (a) an unvoiced
`frame; (b) a voiced frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
`3.6 Contour of the Voiced Spectral Shaping Filter, --Cr. (f )- . . . . . . . . . . 53
`
`Ex. 1043 / Page 8 of 101
`
`

`

`vii
`
`Experimental Set-Cip for Speech 1/0 and Processing . . . . . . . . . . . - . 56
`Generation of the TB corpus from the WB corpus. . . . . . . . . . . . . . . 60
`Yeasures of the band-limiting distortion for the TB. UB. sub-TB. TB and
`. . . . . . . . . . . . . . . . . . . . . . . - 61
`EB, and TB and s u b T B bands.
`Control performance measures for TB speech quality. . . . . . . . . . . . . . 62
`Simulation system for determining appropriate cutoff frequency for splicing
`the TB into PWB speech. . . . . . . . . . . . . . . . . . . . . . . . . . - . . 64
`Determination of appropriate UB cutoff frequency. Fr:sl for the highpass
`filter in the sub-band splicing phase. . . . . . . . . . . . . . . . . . . . . . . 61
`Determination of appropriate WB-TB cutoff frequency. FsrrB-rsz for the
`. . . . . . . . . . . . . . . . . 65
`lowpass filter in the sub-band splicing phase.
`Simulation system for evaluating excitation extrapolation techniques. . . . . 65
`Escitation extrapolation candidates. . . . . . . . . . . . . . . . . . . - . . . 67
`4.10 Simulation systern for evaluating envelope estrapolation techniques. . . . . 68
`4.11 Spectral distortion measures for envelope extrapolation of voiced speech. . . 70
`4.12 Esample of envelope extrapolation for a typical voiced frame. . . . . . . . . Cl
`4.13 Spectral distortion measures for envelope extrapolation of unvoiced speech.
`73
`4.14 Esample of envelope estrapolation for a typical unvoiced frame. . . . . . . . 73
`4.15 Objective results comparing PWB speech with the WB original in the es-
`--
`citation simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I r
`
`5.1 Effect of tract length o n the F-pattern observable in the TB spectral tvindow. 52
`
`Ex. 1043 / Page 9 of 101
`
`

`

`Chapter 1
`
`Background
`
`1.1 Introduction
`
`Perceptually. telephone speech is l e s nat ural and sometimes less intelligible t han face-
`teface speech. The qudity of telephone speech is prirnarily compromised by the band
`limiting done in the Public Switched Telephone Network (PSTN) to reduce the sarnpling
`rate and Save on transmission bandwidth. The PST3 uses a Pulse Code Modulation
`(PCBI) coding scheme. The speech signal is band lirnited to avoid aliasing and then
`sampled at 8000 samples per second. Eacli sample is then quantized according to the
`8-bit. non-linear p l a n quantizerl [16].
`.Uthough the frequenq content of speech can estend up to 20 kHz [?O].
`telephone
`speech is band limited to the range of approximately 300 to 3300 Hz. The range of
`perceptually significant frequencies for speech perception extends t o about 10 kHz. which
`is considerably beyond t hat of telephone-band (TB) speech. Specificdy. the range from
`50 to 200 Hz contributes to increased naturdness. presence and loudness. Its excIusion
`from the telephone band causes the speech to sound 'tinny'. but is presurnabiy justified
`by the fact that this range has little influence on intelligibility [20] [18]. The supra TB
`range from about 3400 Hz to 7000 Hz is thought to contribute to increased intelligibility,
`sound differentiation and crispness [ i l .
`In al1 fairness, TB speech is a well-justified tradeoff between speech quality and the
`cal1 capacity of the PSTN. However. advances in speech coding and speech processing can
`now enable the network to support so called wideband (WB) speech (with a bandwidth of
`approximately 8 kHz) over its voice grade channels. This contra-intuitive capability can
`
`'Outside of North Arnerica, an A-law quantizer is used.
`
`Ex. 1043 / Page 10 of 101
`
`

`

`CHAPTER 1. BACh'GROUXD
`
`actually be accomplished in two ways. Wideband speech can be sampled and efficiently
`digitally encoded in real time using less than 16 kbits/s [IO]. This can be easily accom-
`modated over voice grade channels since bit streams as fast as 33.6 kbits/s can be sent
`using modems complying with the V.34Q standard. X a t u r d y . to recover the speech a t
`the receiving end. a modem and corresponding wideband speech decoder are needed. This
`the-sis presents an alternative way to obtain nideband speech with no cost in transmission
`bandwidth. An algorithm is presented which regenerates pseudo-wideband speech a t the
`receiver using only the received TB speech.
`A more formai definition of the problem and a n outiine of objectives is presented
`in the following section. To provide the reader wit h the necessary speech background. an
`overview of speech signal characteristics and relevant speech production and perception
`issues is presented in Section 1.3. This is followed by a summary and discussion of relemn t
`previous mork in Section 1.6. The details of the proposed algorithm are provided in
`Chapter 3. Chapter -1 outlines the experimental met hodology. simulation results. and
`.A discussion of the results and their implications is presented
`performance evaluation.
`in Chapter 5 . Finally. conclusions and recornmendations for future work are presented in
`Chapter 6.
`
`1.2 Problem Definition
`
`to devebp a receiver-based digital speech pro-
`The primary objective of this work
`cessing algorithm t a produce a bandwidth and quality enhancement of TB speech. Essen-
`tially, the algorithm ni11 effect a mapping, T. from TB speech t o pseudo-wideband (PWB)
`speech. as shown in Figure 1.1.
`In order t hat the algorithm have the potentiai to function in conjunction with any
`e.xisting narrowband speech coders. in particular. the puIse code modulation of the PSTY.
`the only ailowed input t o the aigorithm is TB speech. For the purpose of this study.
`TB speech is assumed t o incorporate only frequencies in the range 300 Hz t o 3300 Hz.
`and wideband speech is defined as speech incorporating frequencies in the range O Hz to
`8000 Hz.
`The speech produced by the algonthm will be dubbed pseudwwideband (P%lrB)
`speech t o distinguish it from true wideband (WB) speech. Research in speech perception
`indicates a decrease in frequency resolution in the upper frequency band. which affords
`a certain leeway in generating the wideband speech. The objective is to generate a per-
`ceptually viable approximation t o the true WB speech. such that the subjective quality is
`
`Ex. 1043 / Page 11 of 101
`
`

`

`CHAPTER 1. BACKGROUND
`
`Figure 1.1: Pseudo- wideband speech generat ion.
`
`notably improved as compared with TB speech. Admittedly. subjective qualit? is difficult
`to quantify. therefore objective measures. such as segmental SSR and segmental spectral
`SXR were used in the design and evaluation of the algorithm. These objective measures
`are defined and described in Section 1-52. Final performance is also evaluated based on
`informal listening tests.
`Cnlike the previous attempts described in [1] [2] [6] [8]. it was desired that the
`algorithm be analytic and physically motivated in nature. rather than being based solely
`on signal statistics and pattern matching. It aras believed chat such an analytic nature
`would furnish it with a strong potential for speaker and language independence.
`Xmong the set of secondary goals was the desire to regulate the algorithrn's corn-
`plexity to make it amenable for red-tirne implementation on a low-cost DSP processor.
`Furthermore. it was desired to arrive a t a better general understanding of the nature and
`extent of the correlation between narrowbandlTB and WB speech.
`
`1.3 Speech Production and Speech Signal Characteristics
`
`Speech has a complex structure embodying a great deal of redundancy. In particular, if
`only the upper band is isolated (3300-8000 Hz). the speech is still fully intelligible. as it
`is when only the TB part is present. The fact that the semantic message content is fully
`
`Ex. 1043 / Page 12 of 101
`
`

`

`CHAPTER 1. BACKGRO L;:t'D
`
`4
`
`preserved in difTerent bands is a testament to the strength of the correlation between the
`different frequency bands of the speech signal. and the potential feasibility of the current
`endeavor .
`To understand the structure and characteristics of speech signals it is useful t o
`think in terms of a speech production model. The two main physical structures involved
`in speech production are the vocal chords and the vocal tract. .As air egresses frorn the
`lungs, the vocal chords can be relaxed or can vibrate a t various frequencies. usuaiiy in the
`range of 55 Hz t o 333 Hz! thus applying a periodic pressure signal to the tract [il [28] [li].
`Within the tract, differences in the cross-sectional area. influenced by the position of
`the tongue. lips, and jaw. cause sound n-ave reflections which give rise to resonances
`or fermants. which appear as peak in the speech spectrum. In the simplet and most
`common speech production model, the linear source-filter or terminal-analogue model, the
`contributions of the chords and the tract are partitioned [NI. -4s depicted in Figure 1.2.
`the whole system can be modeled by a source. ~ ( t ) ,
`isolated from. and leading into a linear
`filter. T ( z j. rnodeling the tract [19]. The ezcàtution signal. e ( t ) models the stimulating
`signal from the chords. and T ( z ) models the modulation of that signal by the tract.
`T h e dynarnics of the tract and chords produce speech signals which are non-
`stationary, that is. the frequency composition of speech varies with time. In particular.
`vocal tract structures. or articulators. rarely stay fised for more than 40 ms. so the re-
`q u i r d T ( 2 ) is actually a time varying filter [23. p. 2061. The rate of vibration of the vocal
`chords can change as quickly as one octave per 100 ms [23. p. 2331. It is such time varia-
`tion of vocal chord frequency that is the physicd basis of intonation. such as the raising
`of pitchl at the end of a question. These movements are sufficiently slow and snioot h t hot
`the speech signal is generally accepted to be effectively stationary xithin time segments
`on the order of 20 ms. Within such a segment. or fmme. dl the signals in the source-filter
`mode1 can be considered stationary. and T ( + ) represents a linear time-iniariant filter.
`In analyzing speech segmentally. speech can be partitioned into two main classes
`of sounds based on whether the vocal chords are relaxed or vibrating during that frame.
`In voiced speech, exemplified by vowel sounds, the vocal chords are vibrating. The rate
`at wliich they vibrate is c d e d the fundamental frequency. FO. or patch. which can be
`assumed to be approximately constant over the duration of the frame. -k depicted in
`the example in Figure 1.3. voiced speech is characterized by a quasi-periodic tirne-domain
`signal. s ( t ) . For unvoiced speech, the chords are relaxed. but a constriction is present
`
`'Strictly speaking, pitch is the perceived tone of the sound. while frequency of vibration is a propeng
`of the stimulus.
`
`Ex. 1043 / Page 13 of 101
`
`

`

`Figure 1.2: Source-Cilter mode1 of speech production.
`
`Ex. 1043 / Page 14 of 101
`
`

`

`somewhere in the tract which results in turbulence as air rushes through. This turbulence
`serves as an excitation for the rernainder of the tract. =Is shown in Figure 1.4. unvoiced
`speech is characterized by a lower amplitude. non-periodic. noise-like signal. Examples
`of unvoiced sounds are fieutives such as /s/ and /f/. Typically. unvoiced sounds have
`lower amplitudes and energies than voiced sounds. and have a greater proportion of t heir
`energ- concentrated in higher frequency bands. A few sounds. such as /z/ in 'zip'. have a
`mixture of unvoiced and voiced characteristics. and are referred to as miued-mode sounds.
`They occur if there is a constriction causing turbulence somewhere in the tract, but the
`vocal chords are &O vibrating.
`
`-2
`O
`
`0-
`
`O O W
`
`0006
`
`OODB
`
`O O l
`
`0012
`
`0014 O016
`
`0018
`
`1
`
`O R ?
`
`Figure 1.3: 20 rns frame of voiced speech.
`
`The voiced/unvoiced classification leads to the estended source-filter model of
`speech and speech production depicteci in Figure 1.5. Idealized time and frequency do-
`main representative signals are included. For voiced speech. the glot tai excitation signal
`is periodic. The spectrum obtained by taking the Short-Term Fourier Transform (STFT).
`exhibits the harmonic structure espected for a periodic signal. wit h the harrnonics sepa-
`rateci by FO, and the spectrum has a typicd roll-off of -12 dB/octave [24]. The noise-like
`excitation for unvoiced speech is spectraliy flat. For the model to handle mixed mode
`sounds, the switch cou1d be replaced with a summation bIock allowing both sources to be
`active simultaneously.
`Although most individuals possess the same basic apparatus and most languages
`
`Ex. 1043 / Page 15 of 101
`
`

`

`Figure 1.4: 20 rns frame of unvoiced speech.
`
`Figure 1.5: Extended source-filter mode1 of speech production.
`
`Ex. 1043 / Page 16 of 101
`
`

`

`em ploy the vocal tract in similar ways. speech signal characteristics are often highly
`speaker and context dependent [23]. .As shown in Figure 1.5, the main speech param-
`eters are: the voicing mode: the fundamental frequency, FO, for voiced speech; the gain.
`G ; the formant locations, or F-pattern {Fl.F2. F 3 . . . .); and the respective bandwidths
`of the forrnants. There is a one-temany mapping between a semantic unit of sound, or
`phoneme. and the corresponding acoustic signal as described by the parameters. The av-
`erage fundamental frequency. FO, for women is approximately 210 Hz, and that for m e n
`is approximately 125 Hz [27] [24] l . Within a given speaker the pitch usually ranges over
`an octave of values during speech (241. In terms of tract variation. the average lengtli
`for the pharengyl-oral tract is 17 cm for men and 13 cm for women. Tract dimensions
`affect the resonant frequencies and therefore the positions of the formants in the speech
`spectrum. The general principle can be understood in terms of a simple resonating tube
`of uniform cross-section and length, L, for which the resonances occur at odd multiples of
`FI = &. where c is the speed of sound (approximately 340 m/s for air at sea level) [24].
`For men. the formant frequency spacing is approximately 1000 Hz. while for women. it is
`about 1301 Hz. The telephone band typically contains about four formants worth of male
`speech. and about three formants worth of female speech. The fundamental frequency,
`FO, and up to three harrnonics often fall below the telephone band.
`
`1.4 Linear Predictive AnaIysis
`
`The deconvolution of excitation and envelope is often accomplished by Linear Predictive
`-4nalysis (LP-4). LP-4 identifies the spectral envelope by finding the best all-pole fit of a
`specified order for the spectrum of that frarne. The deviations from this all-pole spectral
`approximation constitute the excitation. or residzral as it is called in the LPA Iiterature.
`The tenet of linear predictive speech analysis is that a linear predictor can be used
`to estimate the value of the next sample of the digital speech signal, based upon a linear
`combination of a set of preceding speech samples. Mathematically, the estimate of the
`next sample. Sn, is expressed as
`
`'Men have more massive cords than women which is why they tend to vibrate at lower frequencies.
`Within a given speaker, the cricothyroid muscles can increase the tension on the cords, and thus raise their
`frequency of vibration (53.
`
`Ex. 1043 / Page 17 of 101
`
`

`

`CHAPTER 1- BACKGRO Cr,;VD
`
`9
`
`where s, is the speech sequence. p is the order of the prediction. and the akS are the
`prediction coefficients. The residual. r,, can therefore be espressed as
`
`LP-4 is the process of selecting the predictor coefficients. a k . to minimize the resid-
`uai in the mean squared sense. Since speech in non-stationary. the analysiç must be con-
`basis. Wit hin a frame of .V samples. the mean squared
`ducted on a segment---segment
`error, E is given by
`
`To minimize E. the partial derimtives of equation 1.3 are taken with respect to the
`prediction coefficients, ak, and set to zero
`
`d E
`- = O ,
`da;
`
`for 1' i s p .
`
`After some algebra. this yields p linear equations in the p unknomn predictor coefficients
`of the form
`
`with 1 5 i 5 p. The solution of th% systern can be computed with matrix inversion, how-
`ever in practice 8 5 p 5 26, and this method becomes computationâlly expensive [XI. Two
`alternative practicd techniques to approximate the solution are the covariance method and
`the auto-correlation method whose details are well presented in [21] and [XI. In the LP.4
`simulations conducted for this thesis. the auto-correlation met hod was employed, and the
`details of this method wiU be presented, dong with the other implementation details, in
`Chapter 3. Aside from finite precision errors, analysis with this method is guaranteed to
`produce a stable filter [23].
`
`Ex. 1043 / Page 18 of 101
`
`

`

`Although LPA is most often described in the time domain. the frequency domain
`order
`interpretation yields more insight for the current application. In particular.
`linear prediction in the tirne domain corresponds to modeling the spectrum of that frame
`order dl-pole model. To see this. note that the z-transform of equation 1.9
`with a
`reveais that
`
`order IIR filter. LP.4 is, in fact. equivalent to maximum entropy
`where H(z) is a
`spectral estimation [-II. For a n appropriate value of p. the macroscopic spectral shaping
`is fully captured in t.he envelope. H ( z ) , while the spectral detail is captured in the residual.
`r ( n ) , which has a generally flat spectral trend.
`The formulation in (1.6) reveals the consistency between LPA and the source-filter
`speech production model in Figure 1.2. Thus. H ( r ) is often referred t o as the tract or
`synthesis filter. The H ( 3 ) computed in LPA actually encompasses not only the tract
`effects, but also glottal flow and radiation effects which are distinct from the tract fiIter.
`T ( z ) , in the source-filter production model [32]. Thus. in LP-4. al1 major spectral shaping
`effects are encompassed in H ( z ) , and the residuals. bot h voiced and unvoiced. eshibit an
`overall flat spectral trend.
`Viewing the residual signal as an output of the LP-4 process. it can be seen that
`
`where A ( z ) is referred t o as the analysis or prediction filter.
`Figure 1.6 displays the speech spectrum, iS( f ) 1, and corresponding envelope. 1 H( f ) 1,
`order LP.4 on the 20 ms WB speech frame depicted in Figure 1.3.
`found by perforrning 1 6 ~ ~
`T h e peaks in [ H ( f)l can be interpreted as revealing the location of formants. Since speech
`order LP.4 spectral model effectively deduces the location of 2 positive
`is a real signal. a
`frequency formant locations1. Mat hernatically, the formant frequencies can be found by
`finding the poles, z,, from the denominator of H ( z ) . Then the positive frequency formant.
`Fi, associated with pole,
`
`' In the example shown. there are formants at O Hz and 8000 Hz which don't stand out on the graph.
`
`Ex. 1043 / Page 19 of 101
`
`

`

`CHAPTER 1. BACKGRO tr.'
`
`Figure 1.6: Speech spectnim and formant-scape from 16'~ order LPX.
`
`Figure 1.7: Example of a residual spectmm.
`
`Ex. 1043 / Page 20 of 101
`
`

`

`C M P T E R 1. BACKGROLTXD
`
`is given by
`
`where Fi is in Hertz, and T, is the speech sampling period.
`Figure 1.8 illustrates the decomposition and syuthesis method for a speech segment
`according to the sourcefilter mode1 via LP-4.
`
`Analysis Fiter
`
`Linear Predictive Speech Analysis
`- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
`Linear Predictive Speech Synthesis
`
`S ynthesis Filter
`
`r
`
`1
`
`Figure 1.8: Linear predictive analysis and synthesis.
`
`-\side from limitations in numeric precision. the deconvolution of a speech segment
`into spectral envelope and excitation via LP.4 is a lossless process. The original speech
`frame can be reconstructed by filtering the residual. r ( n ) . through the synthesis filter.
`H ( i ) . Speech synthesis by this technique is known as Residuai Excited Linear Prediction
`(RELP). In applying LP.4 t o speech compression, the prediction coefficients, a k , con be
`transformed and quantized by various means to obtain a very compact representation of
`the spectral envelope. Even more coding gain is achieved wlien sdvantage is taken of
`the fact that relatively less information is contained in the spectrally flat residual than
`
`Ex. 1043 / Page 21 of 101
`
`

`

`in the spectral envelope. Speech of acceptable quality can be synthesized according to
`Figure 1.8 even when only a fairly rough approximation to the residual is used. Such
`residual compression and approximation techniques are discussed in Section 1.6.2.
`
`1.5 Speech Quality: Factors and Measures
`
`Although the primary goal of t his pro ject was to produce a significant enhancement of
`speech quality, it is a very difficult property to q u a n t a . Speech quality is inherently
`subjective, and is dependent on incompletely understood aspects of speech perception.
`'ionetheless. some formai subjective and objective measures measures of speech quality
`were selected for the design and evaluation of the PWB speech algorithm. In this section.
`some basic speech quality perception factors are presented dong wit h the selected rneasures
`of speech quality.
`
`1.5.1 Speech Perception
`
`A distortion is only perceptually sign

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket