`at 4 kb/s
`
`Eddie L. T. Choy
`
`Department of Electrical and Computer Engineering
`
`McGill University
`
`Montr(cid:19)eal, Canada
`
`August 1998
`
`A thesis submitted to the Faculty of Graduate Studies and Research in partial
`
`ful(cid:12)llment of the requirements for the degree of Master of Engineering.
`c(cid:13) 1998 Eddie L. T. Choy
`
`Ex. 1031 / Page 1 of 118
`Apple v. Saint Lawrence
`
`
`
`Abstract
`
`Speech coding at bit rates near 4 kbps is expected to be widely deployed in applica-
`
`tions such as visual telephony, mobile and personal communications. This research fo-
`cuses on developing a speech coder based on the waveform interpolation (WI) scheme,
`
`with an attempt to deliver near toll-quality speech at rates around 4 kbps. A WI
`
`coder has been simulated in floating-point using the C programming language. The
`high performance of the WI model has been con(cid:12)rmed by subjective listening tests
`
`in which the unquantized coder outperforms the 32 kbps G.726 standard (ADPCM)
`
`98% of the time under clean input speech conditions; the reconstructed speech is
`perceived to be essentially indistinguishable from the original. When fully quantized,
`
`the speech quality of the WI coder at 4.25 kbps has been judged to be equivalent
`
`to or better than that of G.729 (the ITU-T toll-quality 8 kbps standard) for 45% of
`
`the test sentences. Further re(cid:12)nements of the quantization techniques are warranted
`to bring the coder closer to the toll-quality benchmark. Yet, the existing implemen-
`
`tation has produced good quality coded speech with a high degree of intelligibility
`
`and naturalness when compared to the conventional coding schemes operating in the
`neighbourhood of 4 kbps.
`
`Ex. 1031 / Page 2 of 118
`
`
`
`Sommaire
`
`ii
`
`Dans un futur proche, le codage de la parole (cid:18)a des taux autour de 4 kbps devrait ^etre
`
`largement utilis(cid:19)e dans des applications comme, la tel(cid:19)ephonie visuelle, et les communi-
`cations personnelles et mobiles. Cette recherche a pour but de d(cid:19)evelopper un codeur
`
`de parole bas(cid:19)e sur l’interpolation d’un signal (abr(cid:19)eg(cid:19)e WI pour waveform interpola-
`
`tion), avec comme objectif une reconstruction (cid:12)d(cid:18)ele de la parole (cid:18)a des d(cid:19)ebits aussi
`faibles que 4 kbps. Un codeur bas(cid:19)e sur le mod(cid:18)ele WI a (cid:19)et(cid:19)e simul(cid:19)e en arithm(cid:19)etique flot-
`
`tante en utilisant le language C. Les hautes performances du mod(cid:18)ele ont (cid:19)et(cid:19)e con(cid:12)rm(cid:19)ees
`
`par des tests d’(cid:19)ecoute dans lesquels la qualit(cid:19)e de parole du codeur sans quanti(cid:12)cation
`est meilleure que le standard 32 kbps G.726 (ADPCM) dans 98% des cas lorsque
`
`la parole utilis(cid:19)ee au d(cid:19)epart (cid:19)etait sans bruit. On peut conclure que la synth(cid:18)ese est
`
`per(cid:24)cue comme (cid:19)etant essentiellement indi(cid:11)(cid:19)erentialle de la parole originale. Quand les
`
`param(cid:18)etres du codeur sont compl(cid:18)etement quanti(cid:12)(cid:19)es, la qualit(cid:19)e de parole du codeur WI
`(cid:18)a 4.25 kbps a (cid:19)et(cid:19)e jug(cid:19)ee comme (cid:19)etant (cid:19)equivalente ou meilleure que le G.729 (le standard
`
`ITU-T toll-quality 8 kbps) pour 45% des sequences de test. Des am(cid:19)eliorations plus
`
`pouss(cid:19)ees des techniques de quanti(cid:12)cation sont n(cid:19)ecessaires pour que le codeur perme-
`tte une reconstruction encore plus proche de la reconstruction (cid:12)d(cid:18)ele. N(cid:19)eanmoins, le
`
`programme existant a donn(cid:19)e de la parole cod(cid:19)ee de bonne qualit(cid:19)e avec un haut degr(cid:19)e
`
`d’intelligibilit(cid:19)e et de naturel compar(cid:19)e aux autres codeurs conventionnels fonctionnant
`autour de 4 kbps.
`
`Ex. 1031 / Page 3 of 118
`
`
`
`iii
`
`Acknowledgments
`
`I would like to express my sincere thanks to my supervisor, Professor Peter Kabal, for his
`guidance and support throughout my graduate studies at McGill University. Also, I am
`thankful to Dr. Jacek Stachurski for co-implementing the waveform interpolation speech
`coder. This research could not have been possible without their technical expertise, critical
`insight and enlightening suggestions.
`Moreover, I acknowledged all my fellow graduate students in the Telecommunications
`and Signal Processing Laboratory for their encouragement and companionship. Special
`thanks go to Hossein, Nadim and Khaled who constantly gave me both technical and non-
`technical advice. I am also obliged to Florence who helped me with the French abstract.
`I am thankful to Jianming, Michael, Johnny and Mohammad who participated in the
`listening tests for this research. The postgraduate scholarship awarded by the Natural
`Sciences and Engineering Research Council of Canada is appreciated.
`My deepest gratitude goes to my (cid:12)anc(cid:19)ee Jane for her love and understanding, and also
`to our respective families for their continuous support and encouragement in the past two
`years.
`
`Ex. 1031 / Page 4 of 118
`
`
`
`Contents
`
`1 Introduction
`
`1.1 Motivation for Speech Coding . . . . . . . . . . . . . . . . . . . . . .
`
`1.2 Propaedeutic of Speech Coding . . . . . . . . . . . . . . . . . . . . .
`
`. . . . . . . . . . . . . . . . .
`1.2.1 Components in a Speech Coder
`1.2.2 Concept of a Frame and a Subframe . . . . . . . . . . . . . . .
`
`1.2.3 Performance Dimensions . . . . . . . . . . . . . . . . . . . . .
`
`1.2.4 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.3 Speech Production and Properties . . . . . . . . . . . . . . . . . . . .
`
`1.4 Human Auditory Perception . . . . . . . . . . . . . . . . . . . . . . .
`
`1.5 Speech Coding Standardizations . . . . . . . . . . . . . . . . . . . . .
`1.6 Objectives and Scope of Our Research . . . . . . . . . . . . . . . . .
`
`iv
`
`1
`
`1
`
`2
`
`2
`2
`
`3
`
`5
`6
`
`8
`
`9
`9
`
`1.7 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . .
`
`11
`
`2 Linear Predictive Speech Coding
`2.1 Linear Prediction in Speech Coding . . . . . . . . . . . . . . . . . . .
`
`2.2 Estimation of LP coe(cid:14)cients . . . . . . . . . . . . . . . . . . . . . . .
`
`2.2.1 Autocorrelation Method . . . . . . . . . . . . . . . . . . . . .
`2.2.2 Covariance Method . . . . . . . . . . . . . . . . . . . . . . . .
`
`2.3
`
`Interpolation of LP coe(cid:14)cients . . . . . . . . . . . . . . . . . . . . . .
`
`2.4 Bandwidth Expansion . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.5 Pre-Emphasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3 Waveform Interpolation
`
`3.1 Background and Principles of WI Coding . . . . . . . . . . . . . . . .
`3.2 Overview of the WI Coder . . . . . . . . . . . . . . . . . . . . . . . .
`
`12
`13
`
`14
`
`14
`15
`
`16
`
`17
`18
`
`19
`
`19
`20
`
`Ex. 1031 / Page 5 of 118
`
`
`
`Contents
`
`3.3 Representation of Characteristic Waveform . . . . . . . . . . . . . . .
`
`3.4 The Analysis Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.4.1 LP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.4.2 Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.4.3 Pitch Interpolation . . . . . . . . . . . . . . . . . . . . . . . .
`3.4.4 CW Extraction . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.4.5 CW Alignment
`
`. . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.4.6 CW Power Computation and Normalization . . . . . . . . . .
`3.4.7 Output of the Analysis Layer
`. . . . . . . . . . . . . . . . . .
`
`3.5 The Synthesis Stage
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.5.1 CW Power Denormalization and Realignment
`
`. . . . . . . . .
`
`Instantaneous Pitch and CW Generation . . . . . . . . . . . .
`3.5.2
`3.5.3 Phase Track Estimation . . . . . . . . . . . . . . . . . . . . .
`
`3.5.4
`
`2D-to-1D Transformation . . . . . . . . . . . . . . . . . . . .
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5.5 LP Synthesis
`3.6 Performance of the Analysis-Synthesis Layer . . . . . . . . . . . . . .
`
`3.6.1 Time Asynchrony . . . . . . . . . . . . . . . . . . . . . . . . .
`
`Subjective Quality Evaluation . . . . . . . . . . . . . . . . . .
`3.6.2
`3.6.3 Temporal Envelope Variations . . . . . . . . . . . . . . . . . .
`
`3.7 Variants of the WI Scheme . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.7.1 Analysis in Speech + Synthesis in Speech . . . . . . . . . . . .
`3.7.2 Analysis in Residual + Synthesis in Speech . . . . . . . . . . .
`
`3.7.3 Other WI Derivatives . . . . . . . . . . . . . . . . . . . . . . .
`
`3.8
`
`Importance of Bandwidth Expansion in WI . . . . . . . . . . . . . . .
`
`3.9 Time-Scale Modi(cid:12)cation Using WI
`
`. . . . . . . . . . . . . . . . . . .
`
`4 Quantization of the Coder Parameters
`
`4.1 LSF Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.2 Pitch Quantization (Coding) . . . . . . . . . . . . . . . . . . . . . . .
`
`4.3 Power Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.3.1 Design of the Lowpass Filter . . . . . . . . . . . . . . . . . . .
`
`4.4 CW Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.4.1
`SEW-REW Decomposition . . . . . . . . . . . . . . . . . . . .
`
`4.4.2 REW Quantization . . . . . . . . . . . . . . . . . . . . . . . .
`
`v
`
`22
`
`24
`27
`
`27
`
`30
`33
`
`37
`
`46
`48
`
`49
`
`50
`
`51
`55
`
`56
`
`59
`59
`
`59
`
`60
`60
`
`62
`
`62
`63
`
`68
`
`68
`
`70
`
`72
`
`72
`74
`
`74
`
`74
`
`78
`78
`
`83
`
`Ex. 1031 / Page 6 of 118
`
`
`
`Contents
`
`4.4.3
`
`SEW Quantization . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.4.4 CW Reconstruction and Coding Noise Suppression . . . . . .
`4.5 Performance Evaluations . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.5.1
`
`Subjective Speech Quality . . . . . . . . . . . . . . . . . . . .
`
`4.5.2 Algorithmic Delay . . . . . . . . . . . . . . . . . . . . . . . .
`
`5 Concluding Remarks
`
`5.1 Summary of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`5.2 Strength of the WI Scheme . . . . . . . . . . . . . . . . . . . . . . . .
`5.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . .
`
`A The Constants in the WI Coder
`
`Bibliography
`
`vi
`
`86
`
`89
`91
`
`91
`
`93
`
`94
`
`94
`
`96
`97
`
`102
`
`103
`
`Ex. 1031 / Page 7 of 118
`
`
`
`List of Figures
`
`1.1 A block diagram of a speech transmission/storage system . . . . . . .
`1.2 Time and frequency representations of a voiced and unvoiced speech
`
`segment
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`2.1 The LP synthesis (cid:12)lter . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`2.2 The LP analysis (cid:12)lter . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.1 A block diagram of the WI speech coding system . . . . . . . . . . .
`
`3.2 An example of a characteristic waveform surface . . . . . . . . . . . .
`3.3 A block diagram of the WI analysis block (processor 100)
`. . . . . .
`3.4
`Interpolation of pitch in the case of pitch doubling . . . . . . . . . . .
`
`3.5 A pitch-doubling speech segment
`
`. . . . . . . . . . . . . . . . . . . .
`
`. . . . . . . . . . .
`3.6 An example of an unconstrained extraction point
`3.7
`Illustration of an extraction window and its boundary energy windows
`
`3.8 An example of the CWs extracted from a frame of residual signal
`. .
`3.9 A block diagram of the alignment processor 170 . . . . . . . . . . . .
`3.10 Aligned CWs for a frame of residual signal
`. . . . . . . . . . . . . . .
`
`3.11 Time-scaling of a CW . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.12 Illustration of the zero-insertion between spectral samples . . . . . . .
`
`3.13 Decomposition of a residual signal into a CW evolving surface . . . .
`3.14 A block diagram of the WI decoder in the analysis-synthesis layer . .
`
`3.15 A block diagram of the interpolator processor
`
`. . . . . . . . . . . . .
`
`. . . .
`3.16 An example of the CW interpolation over a subframe interval
`3.17 Comparisons between the two phase track computation methods . . .
`
`3.18 Transformation from a CW surface to a residual signal
`
`. . . . . . . .
`
`3.19 An example of the time envelope variation caused by the WI method
`
`vii
`
`2
`
`7
`
`13
`
`13
`
`21
`
`25
`
`26
`31
`
`32
`
`34
`35
`
`36
`
`38
`41
`
`42
`
`44
`
`49
`50
`
`52
`
`54
`57
`
`58
`
`61
`
`Ex. 1031 / Page 8 of 118
`
`
`
`List of Figures
`
`3.20 An alternate WI decoder (synthesis on speech-domain CWs) . . . . .
`
`. .
`3.21 The discrepancy between the linear and the circular convolutions
`3.22 Illustration of the pitch pulse disappearance . . . . . . . . . . . . . .
`
`3.23 Time scale modi(cid:12)cation of a speech segment using WI analysis-synthesis
`
`layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.1 A block diagram of the WI quantizer . . . . . . . . . . . . . . . . . .
`
`4.2 The schematic diagrams for the power’s and the CW’s quantizers and
`dequantizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.3 The characteristics of the anti-aliasing (cid:12)lter used before the power
`
`viii
`
`65
`
`67
`69
`
`71
`
`73
`
`75
`
`76
`. . . . . . . . . . . . . . . . . . . . . . . . . .
`downsampling process
`4.4 The convolution procedure for the lowpass (cid:12)ltering of the power contour 77
`
`4.5 A SEW and a REW surfaces . . . . . . . . . . . . . . . . . . . . . . .
`
`80
`
`4.6 The characteristics of the lowpass (cid:12)lter used in the SEW-REW decom-
`position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.7 The lowpass (cid:12)ltering operation for the SEW-REW decomposition . .
`
`4.8 Quantization of the SEWs . . . . . . . . . . . . . . . . . . . . . . . .
`
`81
`
`82
`
`88
`
`Ex. 1031 / Page 9 of 118
`
`
`
`List of Tables
`
`3.1 Paired comparison test results between the WI analysis-synthesis layer
`and the 32 kbps ADPCM . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.2 The SNR measures between the linear and circular convolution for a
`
`25-second speech segment
`
`. . . . . . . . . . . . . . . . . . . . . . . .
`
`4.1 Bit allocation for the 4.25 kbps WI coder . . . . . . . . . . . . . . . .
`
`4.2 Paired comparison test results between the 4.25 kbps WI and the 8
`kbps G.729 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`ix
`
`60
`
`66
`
`91
`
`92
`
`A.1 The constants used in the WI simulation . . . . . . . . . . . . . . . . 102
`
`Ex. 1031 / Page 10 of 118
`
`
`
`x
`
`List of Acronyms
`
`ADPCM Adaptive Di(cid:11)erential Pulse-Code Modulation
`Code Division Multiple Access
`CDMA
`Code-Excited Linear Prediction
`CELP
`Encoder and Decoder
`CODEC
`Characteristic Waveform
`CW
`Dimension Conversion Vector Quantization
`DCVQ
`Department of Defense (U.S.)
`DoD
`Digital Signal Processing
`DSP
`Discrete-Time Fourier Series
`DTFS
`Enhanced Variable Rate Codec
`EVRC
`Fixed Bit-Rate
`FBR
`Federal Standard (U.S.)
`FS
`Generalized Lloyd Algorithm
`GLA
`Improved Multi-Band Excitation
`IMBE
`International Telecommunication Union
`ITU
`ITU Telecommunication Standardization Sector
`ITU-T
`LD-CELP Low-Delay Code Excited Linear Prediction
`Linear Prediction
`LP
`Linear Predictive Coding
`LPC
`Line Spectral Frequency
`LSF
`Line Spectral Pair
`LSP
`Multi-Band Excitation
`MBE
`Mixed Excitation Linear Prediction
`MELP
`Million Instructions Per Second
`MIPS
`Mean Opinion Score
`MOS
`Mean Square Error
`MSE
`Pulse Code Modulation
`PCM
`Prototype Waveform Interpolation
`PWI
`Rapidly Evolving Waveform
`REW
`Slowly Evolving Waveform
`SEW
`Signal-to-Noise Ratio
`SNR
`Voiced/Unvoiced
`V/UV
`Variable Bit-Rate
`VBR
`Variable Dimension Vector Quantization
`VDVQ
`Vector Quantization
`VQ
`Waveform Interpolation
`WI
`
`Ex. 1031 / Page 11 of 118
`
`
`
`1
`
`Chapter 1
`
`Introduction
`
`1.1 Motivation for Speech Coding
`
`In modern digital systems, a speech signal is represented in a digital format | a
`
`sequence of binary bits. It is often desirable for the signal to be represented by as
`
`few bits as possible. For storage applications, lower bit usage means less memory is
`required. For transmission applications, lower bit rate means less bandwidth, power
`
`and/or memory. It is therefore cost-e(cid:11)ective to use an e(cid:14)cient speech compression
`
`algorithm in a digital speech storage or transmission system. Speech coding is the
`
`technology to o(cid:11)er such compression algorithms.
`Although larger bandwidth has become available in wired communications as a
`
`result of the rapid development in optical transmission media, there is still a growing
`
`need for bandwidth conservation, particularly in the wireless and satellite communi-
`cations. At the same time, with the growing trend of multimedia communications
`
`and other speech-related applications such as digital answering machine, the demand
`
`on memory conservation in voice storage system is increasing. These dual require-
`ments will de(cid:12)nitely keep speech coding a lively research and development area for
`
`the future.
`
`In addition, the emergence of much faster DSP microprocessors provides speech
`coding researchers even more incentives for getting new and improved speech coding
`
`algorithms, algorithms which are allowed to have more computational e(cid:11)ort than ever
`
`before. An explosion of research work on speech coding is expected to be seen in the
`
`coming millennium.
`
`Ex. 1031 / Page 12 of 118
`
`
`
`1 Introduction
`
`2
`
`1.2 Propaedeutic of Speech Coding
`
`1.2.1 Components in a Speech Coder
`
`A speech coder (also known as a speech codec) always consists of an encod er and a
`decoder. The encoder is the compression function while the decoder is the decompres-
`
`sion function. They usually coexist in typical speech transmission/storage systems.
`
`Figure 1.1 illustrates an example of such a system. At the compression stage, the
`
`speech encoder takes the original digital speech signal and produces a low-rate bit-
`stream. This bitstream is then transmitted to a receiver or to a storage device. At
`
`the decompression stage, the speech decoder tries to undo what the encoder has done
`
`and constructs an approximation of the original signal from the compressed bitstream.
`Thus, the decoder should be structurally an approximate inverse of the encoder.
`
`original
`speech
`
`A/D
`
`Speech
`Encoder
`
`Speech
`Decoder
`
`D/A
`
`reconstructed
`speech
`
`Transmission
`Channels
`
`..11001110...
`
`record
`or
`store
`
`Disk
`
`playback
`or
`retrieve
`
`..11001010...
`
`Fig. 1.1 A block diagram of a speech transmission/storage system
`
`1.2.2 Concept of a Frame and a Subframe
`
`Speech is a time-varying signal [1].
`
`In order to analyze a speech signal e(cid:14)ciently,
`
`a speech coder generally partitions the signal into successive blocks such that the
`
`samples within each block can be considered to be reasonably stationary. These
`blocks are referred to as frames. Furthermore, some processing steps may require a
`
`higher time-resolution and needs to be performed over smaller blocks. These smaller
`
`blocks are often called subframes.
`
`Ex. 1031 / Page 13 of 118
`
`
`
`1 Introduction
`
`3
`
`1.2.3 Performance Dimensions
`
`In selecting a speech coder, certain performance aspects must be considered and
`trade-o(cid:11)s need to be made. Di(cid:11)erent applications require the coder to be optimized
`
`for di(cid:11)erent dimensions or some balance between the dimensions. We have chosen
`
`eight important dimensions and each of these will be briefly described as follows:
`
`(i) Average bit-rate: This parameter is usually measured in bits per second (bps).
`
`The word ’average’ is used here because some coders operate at variable-rate,
`
`as opposed to (cid:12)xed-rate. Note that all the bit-rates mentioned in this thesis do
`
`not include any additional bit-rates used for error corrections.
`
`(ii) Speech quality: A popular method to evaluate speech quality is the MOS scale
`(Mean Opinion Score) which is a subjective measurement. Listeners are asked
`
`to give evaluations on speech quality based on a (cid:12)ve-point scale | bad, poor,
`
`fair, good and excellent. Because of a wide variation among listeners, the MOS
`
`test requires a large number of speech data, speakers, and listeners to get an
`accurate rating of a speech coder. In North America, a MOS scale of between
`
`4 and 4.5 generally means toll-quality while synthetic quality falls below 3.5.
`
`There are also objective measurements available such as SNR, known as signal-
`to-noise ratio. Generally, the objective measurements are not as lengthy and
`
`costly as the subjective ones, but the former do not fully account for perceptual
`
`properties of the human hearing system.
`
`(iii) Algorithmic delay: As mentioned earlier, most speech coders tend to process
`
`samples in blocks, so a time delay often exists between the original and the
`coded speech. In the speech coding context, this time delay is referred to as
`
`the algorithmic delay which is generally de(cid:12)ned as the sum of (i) the length of
`
`currently processed block of speech and (ii) the length of the look-ahead which
`is needed to process the samples of the current block. In some applications like
`
`telephony, there is often a strict limitation on the time delay. In others like
`
`voice storage systems, more delay can be tolerated.
`
`(iv) Computational complexity: Speech coding algorithms are usually required to
`
`run on a single DSP chip. Memory usage and speed are therefore the two
`most important contributors to complexity. The former is speci(cid:12)ed by the size
`
`Ex. 1031 / Page 14 of 118
`
`
`
`1 Introduction
`
`4
`
`of RAM used in executing an algorithm. The latter is measured in million
`
`instructions per second which is commonly known as MIPS. This MIPS can be
`measured in either a (cid:12)xed-point or a floating-point processor. An algorithm
`
`of large complexity not only requires a faster chip to implement in real-time,
`
`it also results in a high power consumption in hardware which is extremely
`disadvantageous for portable systems.
`
`(v) Channel-error sensitivity: This parameter is to measure the speech coder’s ro-
`bustness against channel errors, errors which are often caused by the presence
`
`of channel noise, signal fading and intersymbol interference. The channel-error
`
`issue has become increasingly important in speech coding as many newly devel-
`oped speech coders are used in wireless communications. In such systems, the
`
`speech coder must be able to give reasonable speech quality with error rates as
`
`high as 10%.
`
`(vi) Robustness against acoustic background noise: In real-word applications, we
`
`are faced with various types of background acoustic noise such as car-, babble-,
`street- and o(cid:14)ce-noise. Thus, it is essential that the performance of the speech
`
`coding algorithm does not su(cid:11)er unduly from such adverse environments. The
`
`issue of background noise becomes particularly crucial when it comes to appli-
`cations like military and mobile communications. In fact, the 1996 US D.o.D
`
`(Department of Defense) 2.4 kbps vocoder competition required all speech coder
`
`algorithms to have good performance in both quiet and noisy environments [2].
`
`(vii) Encoded speech bandwidth: This means the bandwidth of a speech signal for
`
`which a coder is intended to encode. Narrowband speech coders are found in
`typical telephone transmission which requires a bandwidth from 200 to 3400
`
`Hz. On the other hand, applications of wideband speech coding with band-
`
`width ranging from 7 to 20 kHz include audio transmission, teleconferencing
`
`and teleteaching.
`
`(viii) Additional acoustic features: Some speech coders have the abilities to provide
`speech compression as well as other speech processing features. Examples of
`
`such features are pitch and formants modi(cid:12)cations, fast/slow voice playback
`
`speech control without a(cid:11)ecting pitch track, etc.
`
`Ex. 1031 / Page 15 of 118
`
`
`
`1 Introduction
`
`1.2.4 Quantization
`
`5
`
`In theory, a precise digital representation of a single or a set of numerical values
`requires an in(cid:12)nite number of bits, which is not an achievable goal. Therefore, the
`
`di(cid:11)erence between the original value and its digitized version is always present when
`
`a signal is digitally transmitted or stored. The goal of quantization is to minimize
`
`this di(cid:11)erence, which is also known as the quantization noise or quantization error.
`There are two basic types of quantization: scalar quantization and vector quanti-
`
`zation (VQ). A scalar quantizer maps a single numerical value to the nearest approx-
`
`imating value from a predetermined (cid:12)nite set of allowed values [3]. Vector quantiza-
`tion, on the other hand, operates on a block of values. Rather than quantizing each
`
`of the values in the block independently, VQ treats the whole block as a single entity
`
`or vector and represents it as a single vector index, and at the same time, minimizes
`the distortion introduced. In this way, coding e(cid:14)ciency can be greatly enhanced if
`
`there is redundant information within the block of values (the values within the block
`are correlated) 1.
`In the context of VQ, a collection of the possible vector representations is referred
`
`to as a codebook. Each of these vector representations in a codebook de(cid:12)nes a code-
`
`word. Further, the number of codewords in a codebook is referred to as the size of
`
`the codebook and the number of elements in each codeword is called the dimension
`of a codebook.
`
`Depending on the speci(cid:12)c applications, there are many distortion measures that
`
`can be adopted to evaluate and/or design a quantizer. The most ubiquitous one is the
`Euclidean distance measure. Distance measures which take perceptual relevance into
`
`account are also available. They are advantageous to speech coders, particularly when
`
`coding vectors of spectral parameters since human ear has a variable sensitivity to
`di(cid:11)erent frequencies and intensities. The details about human perceptual sensitivity
`
`will be further described in Section 1.4.
`
`Due to its high coding e(cid:14)ciency, VQ has spurred tremendous research interest.
`
`Many di(cid:11)erent VQ-related algorithms have been developed to create and search code-
`books e(cid:14)ciently, algorithms such as gain-shape VQ, split VQ and multistage VQ [4].
`
`Recently, variable-dimension vector quantization (VDVQ) has drawn attention as
`
`well. Unlike conventional VQ, VDVQ is capable to handle variable-dimension input
`
`1Even for uncorrelated samples, VQ may o(cid:11)er some advantages over scalar quantization [3, p.347].
`
`Ex. 1031 / Page 16 of 118
`
`
`
`1 Introduction
`
`6
`
`vectors and each input vector can be quantized with a single universal codebook [5].
`
`1.3 Speech Production and Properties
`
`Many contemporary speech coders lower their bit rate consumptions by removing pre-
`dictable, redundant or pre-determined information in human speech. In the search
`
`for better speech coding algorithms, it is therefore important to have a good under-
`
`standing of the production of human speech and the properties of speech signals.
`
`Physiologically, human speech is produced when air is exhaled from the lungs,
`through the vocal folds and the vocal tract to the mouth opening. From the signal
`
`processing point of view, this speech production mechanism can be modeled as an
`
`excitation signal exciting a time-varying (cid:12)lter (the vocal tract), which ampli(cid:12)es or
`attenuates certain sound frequencies in the excitation. The vocal tract is modeled as
`
`a time-varying system because it consists of a combination of the throat, mouth, the
`
`tongue, the lip, and the nose, that change shape during generation of speech. The
`properties of the excitation signal highly depends on the type of speech sounds, either
`
`voiced or unvoiced. Examples of voiced speech are vowels (/a/, /i/, /o/, /u/) while
`
`fricatives such as /p/ and /k/ are examples of unvoiced sounds.
`
`The excitation for voiced speech is a quasi-periodic signal generated by the peri-
`odical abduction and adduction of the vocal folds where the airflow from the lungs
`
`is intercepted. Since the opening between the vocal folds is called the glottis, this
`
`excitation is sometimes referred as a glottal excitation. Generally, the vocal tract
`(cid:12)lter is considered linear in nature and therefore, not able to alter the periodicity of
`
`the glottal excitation. Hence, voiced sounds are quasi-periodic in nature as well.
`
`For unvoiced speech, the vocal folds are widely open. The excitation is formed as
`the air is forced through a narrow constriction at some point in the vocal tract and
`
`creates a turbulence. The unvoiced speech and its excitation signal both tend to be
`
`noise-like and lower in energy as compared to the voiced case. Figure 1.2a illustrates
`
`an example of both unvoiced and voiced speech segment in time domain.
`In spectral domain, due to the quasi-periodicity, voiced speech possesses a promi-
`
`nent harmonic line structure as depicted in (cid:12)gure 1.2c. The spacing between the
`
`harmonics is called the fundamental frequency. The envelope of the spectrum, also
`known as the formant structure, is characterized by a set of peaks, each of which is
`
`called a formant. The formant structure (poles and zeros of the envelope) is primar-
`
`Ex. 1031 / Page 17 of 118
`
`
`
`1 Introduction
`
`7
`
`ily attributed to the shape of the vocal tract. Thus, by moving the tongue, jaw or
`
`lips, the structure would be changed correspondingly. Also, the envelope falls o(cid:11) at
`about -6 dB/octave due to the radiation from the lips and the nature of the glottal
`
`excitation [6].
`
`Figure 1.2b shows the power spectrum of the unvoiced segment. As opposed to
`
`unvoiced segment
`
`voiced segment
`
`Signal Amplitude
`
`(a)
`
`0.1s
`
`Time
`
`0.2s
`
`Formant
`Structure
`
`2000 Hz
`Frequency
`
`4000 Hz
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
` 0 Hz
`
`Power Spectrum Magnitude (dB)
`
`(c)
`
`4000 Hz
`
`Frequency
`
`0.0s
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
` 0 Hz
`
`2000 Hz
`Frequency
`
`Power Spectrum Magnitude (dB)
`
`Power Spectrum in dB
`
`(b)
`
`Fig. 1.2 Time and frequency representations of a voiced and unvoiced
`speech segment. (a) A speech segment consists of an unvoiced and voiced
`segment in time domain. (b) The power spectrum for a 32 ms unvoiced
`segment starting at 50 ms. (c) The power spectrum and the corresponding
`formant structure for a 32 ms voiced segment starting at 150 ms. Both
`(b) and (c) are calculated based on a 32 ms Hanning window.
`
`Ex. 1031 / Page 18 of 118
`
`
`
`1 Introduction
`
`8
`
`the voiced spectrum, there is relatively less useful spectral information embedded in
`
`an unvoiced segment. It does not have any distinctive harmonics and it is rather flat,
`broadband and noise-like.
`
`1.4 Human Auditory Perception
`
`In order to reach maximal performance in a speech coder, it is also essential to take
`
`advantage of human auditory system, even though it is not fully understood yet.
`Generally, exploiting the perceptual properties of the ear could lead to signi(cid:12)cant
`
`improvement in performance of a speech coder. This is particularly true as we pursue
`
`lower and lower bit-rate speech coders while avoiding major audible degradation.
`
`One of the well-known properties of the auditory system is the auditory masking
`which has a strong e(cid:11)ect on the perceptibility of one signal in the presence of another
`
`[6]. Noise is less likely to be heard at frequencies of strong speech energy (e.g., for-
`
`mants) and more likely to be heard at frequencies of low speech energy (e.g., valleys).
`Spectral masking is a popular technique that takes advantage of this perceptual limi-
`
`tation by concentrating most of the noise (resulting from compression) in high-energy
`
`spectral regions where it is least audible.
`It is reported that humans perceive voiced and unvoiced sounds di(cid:11)erently. For
`
`voiced signals, the correct degree of periodicity and the temporal continuity in voiced
`
`segments [7, 8, 9] are of great importance to human perception (although excessive
`
`periodicity would lead to reverberation and buzziness). In spectral domain, the am-
`plitudes and the locations of the (cid:12)rst three formants (usually below 3 kHz) and the
`
`spacing between the harmonics are important [10].
`
`For unvoiced signals, it has been shown in [11] that the unvoiced speech segments
`can be replaced by a noise-like signal with a similar spectral envelope without a drop
`
`in the perceived quality of the speech signal.
`
`In both voiced and unvoiced cases, the time envelope of the speech signal con-
`tributes to intelligibility and naturalness [12, 13].
`
`Ex. 1031 / Page 19 of 118
`
`
`
`1 Introduction
`
`9
`
`1.5 Speech Coding Standardizations
`
`The standardization of high quality low-bit-rate narrowband2 speech coding has been
`intensifying since the beginning of this decade. In 1994, the International Telecom-
`munication Union (ITU) adopted the LD-CELP (Low-Delay Code-Excited Linear
`
`Predictive) algorithm [14] for the toll-quality coding of speech at 16 kbps known
`
`as the ITU G.728. Shortly after this standard was adopted, another CELP based
`speech coding running at 8 kbps was developed by the University of Sherbrooke [15].
`
`It was toll-quality as well and had a comparable performance to that of 16 kbps
`
`LD-CELP. In 1996, it (cid:12)nally became part of the ITU standards and was known as
`G.729. In the same year, U.S. Department of Defense (DoD) was standardizing a new
`
`2.4 kbps vocoder