throbber
Waveform Interpolation Speech Coder
`at 4 kb/s
`
`Eddie L. T. Choy
`
`Department of Electrical and Computer Engineering
`
`McGill University
`
`Montr(cid:19)eal, Canada
`
`August 1998
`
`A thesis submitted to the Faculty of Graduate Studies and Research in partial
`
`ful(cid:12)llment of the requirements for the degree of Master of Engineering.
`c(cid:13) 1998 Eddie L. T. Choy
`
`Ex. 1031 / Page 1 of 118
`Apple v. Saint Lawrence
`
`

`

`Abstract
`
`Speech coding at bit rates near 4 kbps is expected to be widely deployed in applica-
`
`tions such as visual telephony, mobile and personal communications. This research fo-
`cuses on developing a speech coder based on the waveform interpolation (WI) scheme,
`
`with an attempt to deliver near toll-quality speech at rates around 4 kbps. A WI
`
`coder has been simulated in floating-point using the C programming language. The
`high performance of the WI model has been con(cid:12)rmed by subjective listening tests
`
`in which the unquantized coder outperforms the 32 kbps G.726 standard (ADPCM)
`
`98% of the time under clean input speech conditions; the reconstructed speech is
`perceived to be essentially indistinguishable from the original. When fully quantized,
`
`the speech quality of the WI coder at 4.25 kbps has been judged to be equivalent
`
`to or better than that of G.729 (the ITU-T toll-quality 8 kbps standard) for 45% of
`
`the test sentences. Further re(cid:12)nements of the quantization techniques are warranted
`to bring the coder closer to the toll-quality benchmark. Yet, the existing implemen-
`
`tation has produced good quality coded speech with a high degree of intelligibility
`
`and naturalness when compared to the conventional coding schemes operating in the
`neighbourhood of 4 kbps.
`
`Ex. 1031 / Page 2 of 118
`
`

`

`Sommaire
`
`ii
`
`Dans un futur proche, le codage de la parole (cid:18)a des taux autour de 4 kbps devrait ^etre
`
`largement utilis(cid:19)e dans des applications comme, la tel(cid:19)ephonie visuelle, et les communi-
`cations personnelles et mobiles. Cette recherche a pour but de d(cid:19)evelopper un codeur
`
`de parole bas(cid:19)e sur l’interpolation d’un signal (abr(cid:19)eg(cid:19)e WI pour waveform interpola-
`
`tion), avec comme objectif une reconstruction (cid:12)d(cid:18)ele de la parole (cid:18)a des d(cid:19)ebits aussi
`faibles que 4 kbps. Un codeur bas(cid:19)e sur le mod(cid:18)ele WI a (cid:19)et(cid:19)e simul(cid:19)e en arithm(cid:19)etique flot-
`
`tante en utilisant le language C. Les hautes performances du mod(cid:18)ele ont (cid:19)et(cid:19)e con(cid:12)rm(cid:19)ees
`
`par des tests d’(cid:19)ecoute dans lesquels la qualit(cid:19)e de parole du codeur sans quanti(cid:12)cation
`est meilleure que le standard 32 kbps G.726 (ADPCM) dans 98% des cas lorsque
`
`la parole utilis(cid:19)ee au d(cid:19)epart (cid:19)etait sans bruit. On peut conclure que la synth(cid:18)ese est
`
`per(cid:24)cue comme (cid:19)etant essentiellement indi(cid:11)(cid:19)erentialle de la parole originale. Quand les
`
`param(cid:18)etres du codeur sont compl(cid:18)etement quanti(cid:12)(cid:19)es, la qualit(cid:19)e de parole du codeur WI
`(cid:18)a 4.25 kbps a (cid:19)et(cid:19)e jug(cid:19)ee comme (cid:19)etant (cid:19)equivalente ou meilleure que le G.729 (le standard
`
`ITU-T toll-quality 8 kbps) pour 45% des sequences de test. Des am(cid:19)eliorations plus
`
`pouss(cid:19)ees des techniques de quanti(cid:12)cation sont n(cid:19)ecessaires pour que le codeur perme-
`tte une reconstruction encore plus proche de la reconstruction (cid:12)d(cid:18)ele. N(cid:19)eanmoins, le
`
`programme existant a donn(cid:19)e de la parole cod(cid:19)ee de bonne qualit(cid:19)e avec un haut degr(cid:19)e
`
`d’intelligibilit(cid:19)e et de naturel compar(cid:19)e aux autres codeurs conventionnels fonctionnant
`autour de 4 kbps.
`
`Ex. 1031 / Page 3 of 118
`
`

`

`iii
`
`Acknowledgments
`
`I would like to express my sincere thanks to my supervisor, Professor Peter Kabal, for his
`guidance and support throughout my graduate studies at McGill University. Also, I am
`thankful to Dr. Jacek Stachurski for co-implementing the waveform interpolation speech
`coder. This research could not have been possible without their technical expertise, critical
`insight and enlightening suggestions.
`Moreover, I acknowledged all my fellow graduate students in the Telecommunications
`and Signal Processing Laboratory for their encouragement and companionship. Special
`thanks go to Hossein, Nadim and Khaled who constantly gave me both technical and non-
`technical advice. I am also obliged to Florence who helped me with the French abstract.
`I am thankful to Jianming, Michael, Johnny and Mohammad who participated in the
`listening tests for this research. The postgraduate scholarship awarded by the Natural
`Sciences and Engineering Research Council of Canada is appreciated.
`My deepest gratitude goes to my (cid:12)anc(cid:19)ee Jane for her love and understanding, and also
`to our respective families for their continuous support and encouragement in the past two
`years.
`
`Ex. 1031 / Page 4 of 118
`
`

`

`Contents
`
`1 Introduction
`
`1.1 Motivation for Speech Coding . . . . . . . . . . . . . . . . . . . . . .
`
`1.2 Propaedeutic of Speech Coding . . . . . . . . . . . . . . . . . . . . .
`
`. . . . . . . . . . . . . . . . .
`1.2.1 Components in a Speech Coder
`1.2.2 Concept of a Frame and a Subframe . . . . . . . . . . . . . . .
`
`1.2.3 Performance Dimensions . . . . . . . . . . . . . . . . . . . . .
`
`1.2.4 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.3 Speech Production and Properties . . . . . . . . . . . . . . . . . . . .
`
`1.4 Human Auditory Perception . . . . . . . . . . . . . . . . . . . . . . .
`
`1.5 Speech Coding Standardizations . . . . . . . . . . . . . . . . . . . . .
`1.6 Objectives and Scope of Our Research . . . . . . . . . . . . . . . . .
`
`iv
`
`1
`
`1
`
`2
`
`2
`2
`
`3
`
`5
`6
`
`8
`
`9
`9
`
`1.7 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . .
`
`11
`
`2 Linear Predictive Speech Coding
`2.1 Linear Prediction in Speech Coding . . . . . . . . . . . . . . . . . . .
`
`2.2 Estimation of LP coe(cid:14)cients . . . . . . . . . . . . . . . . . . . . . . .
`
`2.2.1 Autocorrelation Method . . . . . . . . . . . . . . . . . . . . .
`2.2.2 Covariance Method . . . . . . . . . . . . . . . . . . . . . . . .
`
`2.3
`
`Interpolation of LP coe(cid:14)cients . . . . . . . . . . . . . . . . . . . . . .
`
`2.4 Bandwidth Expansion . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.5 Pre-Emphasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3 Waveform Interpolation
`
`3.1 Background and Principles of WI Coding . . . . . . . . . . . . . . . .
`3.2 Overview of the WI Coder . . . . . . . . . . . . . . . . . . . . . . . .
`
`12
`13
`
`14
`
`14
`15
`
`16
`
`17
`18
`
`19
`
`19
`20
`
`Ex. 1031 / Page 5 of 118
`
`

`

`Contents
`
`3.3 Representation of Characteristic Waveform . . . . . . . . . . . . . . .
`
`3.4 The Analysis Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.4.1 LP Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.4.2 Pitch Estimation . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.4.3 Pitch Interpolation . . . . . . . . . . . . . . . . . . . . . . . .
`3.4.4 CW Extraction . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.4.5 CW Alignment
`
`. . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.4.6 CW Power Computation and Normalization . . . . . . . . . .
`3.4.7 Output of the Analysis Layer
`. . . . . . . . . . . . . . . . . .
`
`3.5 The Synthesis Stage
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.5.1 CW Power Denormalization and Realignment
`
`. . . . . . . . .
`
`Instantaneous Pitch and CW Generation . . . . . . . . . . . .
`3.5.2
`3.5.3 Phase Track Estimation . . . . . . . . . . . . . . . . . . . . .
`
`3.5.4
`
`2D-to-1D Transformation . . . . . . . . . . . . . . . . . . . .
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5.5 LP Synthesis
`3.6 Performance of the Analysis-Synthesis Layer . . . . . . . . . . . . . .
`
`3.6.1 Time Asynchrony . . . . . . . . . . . . . . . . . . . . . . . . .
`
`Subjective Quality Evaluation . . . . . . . . . . . . . . . . . .
`3.6.2
`3.6.3 Temporal Envelope Variations . . . . . . . . . . . . . . . . . .
`
`3.7 Variants of the WI Scheme . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.7.1 Analysis in Speech + Synthesis in Speech . . . . . . . . . . . .
`3.7.2 Analysis in Residual + Synthesis in Speech . . . . . . . . . . .
`
`3.7.3 Other WI Derivatives . . . . . . . . . . . . . . . . . . . . . . .
`
`3.8
`
`Importance of Bandwidth Expansion in WI . . . . . . . . . . . . . . .
`
`3.9 Time-Scale Modi(cid:12)cation Using WI
`
`. . . . . . . . . . . . . . . . . . .
`
`4 Quantization of the Coder Parameters
`
`4.1 LSF Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.2 Pitch Quantization (Coding) . . . . . . . . . . . . . . . . . . . . . . .
`
`4.3 Power Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.3.1 Design of the Lowpass Filter . . . . . . . . . . . . . . . . . . .
`
`4.4 CW Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.4.1
`SEW-REW Decomposition . . . . . . . . . . . . . . . . . . . .
`
`4.4.2 REW Quantization . . . . . . . . . . . . . . . . . . . . . . . .
`
`v
`
`22
`
`24
`27
`
`27
`
`30
`33
`
`37
`
`46
`48
`
`49
`
`50
`
`51
`55
`
`56
`
`59
`59
`
`59
`
`60
`60
`
`62
`
`62
`63
`
`68
`
`68
`
`70
`
`72
`
`72
`74
`
`74
`
`74
`
`78
`78
`
`83
`
`Ex. 1031 / Page 6 of 118
`
`

`

`Contents
`
`4.4.3
`
`SEW Quantization . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.4.4 CW Reconstruction and Coding Noise Suppression . . . . . .
`4.5 Performance Evaluations . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.5.1
`
`Subjective Speech Quality . . . . . . . . . . . . . . . . . . . .
`
`4.5.2 Algorithmic Delay . . . . . . . . . . . . . . . . . . . . . . . .
`
`5 Concluding Remarks
`
`5.1 Summary of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`5.2 Strength of the WI Scheme . . . . . . . . . . . . . . . . . . . . . . . .
`5.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . .
`
`A The Constants in the WI Coder
`
`Bibliography
`
`vi
`
`86
`
`89
`91
`
`91
`
`93
`
`94
`
`94
`
`96
`97
`
`102
`
`103
`
`Ex. 1031 / Page 7 of 118
`
`

`

`List of Figures
`
`1.1 A block diagram of a speech transmission/storage system . . . . . . .
`1.2 Time and frequency representations of a voiced and unvoiced speech
`
`segment
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`2.1 The LP synthesis (cid:12)lter . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`2.2 The LP analysis (cid:12)lter . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.1 A block diagram of the WI speech coding system . . . . . . . . . . .
`
`3.2 An example of a characteristic waveform surface . . . . . . . . . . . .
`3.3 A block diagram of the WI analysis block (processor 100)
`. . . . . .
`3.4
`Interpolation of pitch in the case of pitch doubling . . . . . . . . . . .
`
`3.5 A pitch-doubling speech segment
`
`. . . . . . . . . . . . . . . . . . . .
`
`. . . . . . . . . . .
`3.6 An example of an unconstrained extraction point
`3.7
`Illustration of an extraction window and its boundary energy windows
`
`3.8 An example of the CWs extracted from a frame of residual signal
`. .
`3.9 A block diagram of the alignment processor 170 . . . . . . . . . . . .
`3.10 Aligned CWs for a frame of residual signal
`. . . . . . . . . . . . . . .
`
`3.11 Time-scaling of a CW . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.12 Illustration of the zero-insertion between spectral samples . . . . . . .
`
`3.13 Decomposition of a residual signal into a CW evolving surface . . . .
`3.14 A block diagram of the WI decoder in the analysis-synthesis layer . .
`
`3.15 A block diagram of the interpolator processor
`
`. . . . . . . . . . . . .
`
`. . . .
`3.16 An example of the CW interpolation over a subframe interval
`3.17 Comparisons between the two phase track computation methods . . .
`
`3.18 Transformation from a CW surface to a residual signal
`
`. . . . . . . .
`
`3.19 An example of the time envelope variation caused by the WI method
`
`vii
`
`2
`
`7
`
`13
`
`13
`
`21
`
`25
`
`26
`31
`
`32
`
`34
`35
`
`36
`
`38
`41
`
`42
`
`44
`
`49
`50
`
`52
`
`54
`57
`
`58
`
`61
`
`Ex. 1031 / Page 8 of 118
`
`

`

`List of Figures
`
`3.20 An alternate WI decoder (synthesis on speech-domain CWs) . . . . .
`
`. .
`3.21 The discrepancy between the linear and the circular convolutions
`3.22 Illustration of the pitch pulse disappearance . . . . . . . . . . . . . .
`
`3.23 Time scale modi(cid:12)cation of a speech segment using WI analysis-synthesis
`
`layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.1 A block diagram of the WI quantizer . . . . . . . . . . . . . . . . . .
`
`4.2 The schematic diagrams for the power’s and the CW’s quantizers and
`dequantizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.3 The characteristics of the anti-aliasing (cid:12)lter used before the power
`
`viii
`
`65
`
`67
`69
`
`71
`
`73
`
`75
`
`76
`. . . . . . . . . . . . . . . . . . . . . . . . . .
`downsampling process
`4.4 The convolution procedure for the lowpass (cid:12)ltering of the power contour 77
`
`4.5 A SEW and a REW surfaces . . . . . . . . . . . . . . . . . . . . . . .
`
`80
`
`4.6 The characteristics of the lowpass (cid:12)lter used in the SEW-REW decom-
`position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.7 The lowpass (cid:12)ltering operation for the SEW-REW decomposition . .
`
`4.8 Quantization of the SEWs . . . . . . . . . . . . . . . . . . . . . . . .
`
`81
`
`82
`
`88
`
`Ex. 1031 / Page 9 of 118
`
`

`

`List of Tables
`
`3.1 Paired comparison test results between the WI analysis-synthesis layer
`and the 32 kbps ADPCM . . . . . . . . . . . . . . . . . . . . . . . .
`
`3.2 The SNR measures between the linear and circular convolution for a
`
`25-second speech segment
`
`. . . . . . . . . . . . . . . . . . . . . . . .
`
`4.1 Bit allocation for the 4.25 kbps WI coder . . . . . . . . . . . . . . . .
`
`4.2 Paired comparison test results between the 4.25 kbps WI and the 8
`kbps G.729 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`ix
`
`60
`
`66
`
`91
`
`92
`
`A.1 The constants used in the WI simulation . . . . . . . . . . . . . . . . 102
`
`Ex. 1031 / Page 10 of 118
`
`

`

`x
`
`List of Acronyms
`
`ADPCM Adaptive Di(cid:11)erential Pulse-Code Modulation
`Code Division Multiple Access
`CDMA
`Code-Excited Linear Prediction
`CELP
`Encoder and Decoder
`CODEC
`Characteristic Waveform
`CW
`Dimension Conversion Vector Quantization
`DCVQ
`Department of Defense (U.S.)
`DoD
`Digital Signal Processing
`DSP
`Discrete-Time Fourier Series
`DTFS
`Enhanced Variable Rate Codec
`EVRC
`Fixed Bit-Rate
`FBR
`Federal Standard (U.S.)
`FS
`Generalized Lloyd Algorithm
`GLA
`Improved Multi-Band Excitation
`IMBE
`International Telecommunication Union
`ITU
`ITU Telecommunication Standardization Sector
`ITU-T
`LD-CELP Low-Delay Code Excited Linear Prediction
`Linear Prediction
`LP
`Linear Predictive Coding
`LPC
`Line Spectral Frequency
`LSF
`Line Spectral Pair
`LSP
`Multi-Band Excitation
`MBE
`Mixed Excitation Linear Prediction
`MELP
`Million Instructions Per Second
`MIPS
`Mean Opinion Score
`MOS
`Mean Square Error
`MSE
`Pulse Code Modulation
`PCM
`Prototype Waveform Interpolation
`PWI
`Rapidly Evolving Waveform
`REW
`Slowly Evolving Waveform
`SEW
`Signal-to-Noise Ratio
`SNR
`Voiced/Unvoiced
`V/UV
`Variable Bit-Rate
`VBR
`Variable Dimension Vector Quantization
`VDVQ
`Vector Quantization
`VQ
`Waveform Interpolation
`WI
`
`Ex. 1031 / Page 11 of 118
`
`

`

`1
`
`Chapter 1
`
`Introduction
`
`1.1 Motivation for Speech Coding
`
`In modern digital systems, a speech signal is represented in a digital format | a
`
`sequence of binary bits. It is often desirable for the signal to be represented by as
`
`few bits as possible. For storage applications, lower bit usage means less memory is
`required. For transmission applications, lower bit rate means less bandwidth, power
`
`and/or memory. It is therefore cost-e(cid:11)ective to use an e(cid:14)cient speech compression
`
`algorithm in a digital speech storage or transmission system. Speech coding is the
`
`technology to o(cid:11)er such compression algorithms.
`Although larger bandwidth has become available in wired communications as a
`
`result of the rapid development in optical transmission media, there is still a growing
`
`need for bandwidth conservation, particularly in the wireless and satellite communi-
`cations. At the same time, with the growing trend of multimedia communications
`
`and other speech-related applications such as digital answering machine, the demand
`
`on memory conservation in voice storage system is increasing. These dual require-
`ments will de(cid:12)nitely keep speech coding a lively research and development area for
`
`the future.
`
`In addition, the emergence of much faster DSP microprocessors provides speech
`coding researchers even more incentives for getting new and improved speech coding
`
`algorithms, algorithms which are allowed to have more computational e(cid:11)ort than ever
`
`before. An explosion of research work on speech coding is expected to be seen in the
`
`coming millennium.
`
`Ex. 1031 / Page 12 of 118
`
`

`

`1 Introduction
`
`2
`
`1.2 Propaedeutic of Speech Coding
`
`1.2.1 Components in a Speech Coder
`
`A speech coder (also known as a speech codec) always consists of an encod er and a
`decoder. The encoder is the compression function while the decoder is the decompres-
`
`sion function. They usually coexist in typical speech transmission/storage systems.
`
`Figure 1.1 illustrates an example of such a system. At the compression stage, the
`
`speech encoder takes the original digital speech signal and produces a low-rate bit-
`stream. This bitstream is then transmitted to a receiver or to a storage device. At
`
`the decompression stage, the speech decoder tries to undo what the encoder has done
`
`and constructs an approximation of the original signal from the compressed bitstream.
`Thus, the decoder should be structurally an approximate inverse of the encoder.
`
`original
`speech
`
`A/D
`
`Speech
`Encoder
`
`Speech
`Decoder
`
`D/A
`
`reconstructed
`speech
`
`Transmission
`Channels
`
`..11001110...
`
`record
`or
`store
`
`Disk
`
`playback
`or
`retrieve
`
`..11001010...
`
`Fig. 1.1 A block diagram of a speech transmission/storage system
`
`1.2.2 Concept of a Frame and a Subframe
`
`Speech is a time-varying signal [1].
`
`In order to analyze a speech signal e(cid:14)ciently,
`
`a speech coder generally partitions the signal into successive blocks such that the
`
`samples within each block can be considered to be reasonably stationary. These
`blocks are referred to as frames. Furthermore, some processing steps may require a
`
`higher time-resolution and needs to be performed over smaller blocks. These smaller
`
`blocks are often called subframes.
`
`Ex. 1031 / Page 13 of 118
`
`

`

`1 Introduction
`
`3
`
`1.2.3 Performance Dimensions
`
`In selecting a speech coder, certain performance aspects must be considered and
`trade-o(cid:11)s need to be made. Di(cid:11)erent applications require the coder to be optimized
`
`for di(cid:11)erent dimensions or some balance between the dimensions. We have chosen
`
`eight important dimensions and each of these will be briefly described as follows:
`
`(i) Average bit-rate: This parameter is usually measured in bits per second (bps).
`
`The word ’average’ is used here because some coders operate at variable-rate,
`
`as opposed to (cid:12)xed-rate. Note that all the bit-rates mentioned in this thesis do
`
`not include any additional bit-rates used for error corrections.
`
`(ii) Speech quality: A popular method to evaluate speech quality is the MOS scale
`(Mean Opinion Score) which is a subjective measurement. Listeners are asked
`
`to give evaluations on speech quality based on a (cid:12)ve-point scale | bad, poor,
`
`fair, good and excellent. Because of a wide variation among listeners, the MOS
`
`test requires a large number of speech data, speakers, and listeners to get an
`accurate rating of a speech coder. In North America, a MOS scale of between
`
`4 and 4.5 generally means toll-quality while synthetic quality falls below 3.5.
`
`There are also objective measurements available such as SNR, known as signal-
`to-noise ratio. Generally, the objective measurements are not as lengthy and
`
`costly as the subjective ones, but the former do not fully account for perceptual
`
`properties of the human hearing system.
`
`(iii) Algorithmic delay: As mentioned earlier, most speech coders tend to process
`
`samples in blocks, so a time delay often exists between the original and the
`coded speech. In the speech coding context, this time delay is referred to as
`
`the algorithmic delay which is generally de(cid:12)ned as the sum of (i) the length of
`
`currently processed block of speech and (ii) the length of the look-ahead which
`is needed to process the samples of the current block. In some applications like
`
`telephony, there is often a strict limitation on the time delay. In others like
`
`voice storage systems, more delay can be tolerated.
`
`(iv) Computational complexity: Speech coding algorithms are usually required to
`
`run on a single DSP chip. Memory usage and speed are therefore the two
`most important contributors to complexity. The former is speci(cid:12)ed by the size
`
`Ex. 1031 / Page 14 of 118
`
`

`

`1 Introduction
`
`4
`
`of RAM used in executing an algorithm. The latter is measured in million
`
`instructions per second which is commonly known as MIPS. This MIPS can be
`measured in either a (cid:12)xed-point or a floating-point processor. An algorithm
`
`of large complexity not only requires a faster chip to implement in real-time,
`
`it also results in a high power consumption in hardware which is extremely
`disadvantageous for portable systems.
`
`(v) Channel-error sensitivity: This parameter is to measure the speech coder’s ro-
`bustness against channel errors, errors which are often caused by the presence
`
`of channel noise, signal fading and intersymbol interference. The channel-error
`
`issue has become increasingly important in speech coding as many newly devel-
`oped speech coders are used in wireless communications. In such systems, the
`
`speech coder must be able to give reasonable speech quality with error rates as
`
`high as 10%.
`
`(vi) Robustness against acoustic background noise: In real-word applications, we
`
`are faced with various types of background acoustic noise such as car-, babble-,
`street- and o(cid:14)ce-noise. Thus, it is essential that the performance of the speech
`
`coding algorithm does not su(cid:11)er unduly from such adverse environments. The
`
`issue of background noise becomes particularly crucial when it comes to appli-
`cations like military and mobile communications. In fact, the 1996 US D.o.D
`
`(Department of Defense) 2.4 kbps vocoder competition required all speech coder
`
`algorithms to have good performance in both quiet and noisy environments [2].
`
`(vii) Encoded speech bandwidth: This means the bandwidth of a speech signal for
`
`which a coder is intended to encode. Narrowband speech coders are found in
`typical telephone transmission which requires a bandwidth from 200 to 3400
`
`Hz. On the other hand, applications of wideband speech coding with band-
`
`width ranging from 7 to 20 kHz include audio transmission, teleconferencing
`
`and teleteaching.
`
`(viii) Additional acoustic features: Some speech coders have the abilities to provide
`speech compression as well as other speech processing features. Examples of
`
`such features are pitch and formants modi(cid:12)cations, fast/slow voice playback
`
`speech control without a(cid:11)ecting pitch track, etc.
`
`Ex. 1031 / Page 15 of 118
`
`

`

`1 Introduction
`
`1.2.4 Quantization
`
`5
`
`In theory, a precise digital representation of a single or a set of numerical values
`requires an in(cid:12)nite number of bits, which is not an achievable goal. Therefore, the
`
`di(cid:11)erence between the original value and its digitized version is always present when
`
`a signal is digitally transmitted or stored. The goal of quantization is to minimize
`
`this di(cid:11)erence, which is also known as the quantization noise or quantization error.
`There are two basic types of quantization: scalar quantization and vector quanti-
`
`zation (VQ). A scalar quantizer maps a single numerical value to the nearest approx-
`
`imating value from a predetermined (cid:12)nite set of allowed values [3]. Vector quantiza-
`tion, on the other hand, operates on a block of values. Rather than quantizing each
`
`of the values in the block independently, VQ treats the whole block as a single entity
`
`or vector and represents it as a single vector index, and at the same time, minimizes
`the distortion introduced. In this way, coding e(cid:14)ciency can be greatly enhanced if
`
`there is redundant information within the block of values (the values within the block
`are correlated) 1.
`In the context of VQ, a collection of the possible vector representations is referred
`
`to as a codebook. Each of these vector representations in a codebook de(cid:12)nes a code-
`
`word. Further, the number of codewords in a codebook is referred to as the size of
`
`the codebook and the number of elements in each codeword is called the dimension
`of a codebook.
`
`Depending on the speci(cid:12)c applications, there are many distortion measures that
`
`can be adopted to evaluate and/or design a quantizer. The most ubiquitous one is the
`Euclidean distance measure. Distance measures which take perceptual relevance into
`
`account are also available. They are advantageous to speech coders, particularly when
`
`coding vectors of spectral parameters since human ear has a variable sensitivity to
`di(cid:11)erent frequencies and intensities. The details about human perceptual sensitivity
`
`will be further described in Section 1.4.
`
`Due to its high coding e(cid:14)ciency, VQ has spurred tremendous research interest.
`
`Many di(cid:11)erent VQ-related algorithms have been developed to create and search code-
`books e(cid:14)ciently, algorithms such as gain-shape VQ, split VQ and multistage VQ [4].
`
`Recently, variable-dimension vector quantization (VDVQ) has drawn attention as
`
`well. Unlike conventional VQ, VDVQ is capable to handle variable-dimension input
`
`1Even for uncorrelated samples, VQ may o(cid:11)er some advantages over scalar quantization [3, p.347].
`
`Ex. 1031 / Page 16 of 118
`
`

`

`1 Introduction
`
`6
`
`vectors and each input vector can be quantized with a single universal codebook [5].
`
`1.3 Speech Production and Properties
`
`Many contemporary speech coders lower their bit rate consumptions by removing pre-
`dictable, redundant or pre-determined information in human speech. In the search
`
`for better speech coding algorithms, it is therefore important to have a good under-
`
`standing of the production of human speech and the properties of speech signals.
`
`Physiologically, human speech is produced when air is exhaled from the lungs,
`through the vocal folds and the vocal tract to the mouth opening. From the signal
`
`processing point of view, this speech production mechanism can be modeled as an
`
`excitation signal exciting a time-varying (cid:12)lter (the vocal tract), which ampli(cid:12)es or
`attenuates certain sound frequencies in the excitation. The vocal tract is modeled as
`
`a time-varying system because it consists of a combination of the throat, mouth, the
`
`tongue, the lip, and the nose, that change shape during generation of speech. The
`properties of the excitation signal highly depends on the type of speech sounds, either
`
`voiced or unvoiced. Examples of voiced speech are vowels (/a/, /i/, /o/, /u/) while
`
`fricatives such as /p/ and /k/ are examples of unvoiced sounds.
`
`The excitation for voiced speech is a quasi-periodic signal generated by the peri-
`odical abduction and adduction of the vocal folds where the airflow from the lungs
`
`is intercepted. Since the opening between the vocal folds is called the glottis, this
`
`excitation is sometimes referred as a glottal excitation. Generally, the vocal tract
`(cid:12)lter is considered linear in nature and therefore, not able to alter the periodicity of
`
`the glottal excitation. Hence, voiced sounds are quasi-periodic in nature as well.
`
`For unvoiced speech, the vocal folds are widely open. The excitation is formed as
`the air is forced through a narrow constriction at some point in the vocal tract and
`
`creates a turbulence. The unvoiced speech and its excitation signal both tend to be
`
`noise-like and lower in energy as compared to the voiced case. Figure 1.2a illustrates
`
`an example of both unvoiced and voiced speech segment in time domain.
`In spectral domain, due to the quasi-periodicity, voiced speech possesses a promi-
`
`nent harmonic line structure as depicted in (cid:12)gure 1.2c. The spacing between the
`
`harmonics is called the fundamental frequency. The envelope of the spectrum, also
`known as the formant structure, is characterized by a set of peaks, each of which is
`
`called a formant. The formant structure (poles and zeros of the envelope) is primar-
`
`Ex. 1031 / Page 17 of 118
`
`

`

`1 Introduction
`
`7
`
`ily attributed to the shape of the vocal tract. Thus, by moving the tongue, jaw or
`
`lips, the structure would be changed correspondingly. Also, the envelope falls o(cid:11) at
`about -6 dB/octave due to the radiation from the lips and the nature of the glottal
`
`excitation [6].
`
`Figure 1.2b shows the power spectrum of the unvoiced segment. As opposed to
`
`unvoiced segment
`
`voiced segment
`
`Signal Amplitude
`
`(a)
`
`0.1s
`
`Time
`
`0.2s
`
`Formant
`Structure
`
`2000 Hz
`Frequency
`
`4000 Hz
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
` 0 Hz
`
`Power Spectrum Magnitude (dB)
`
`(c)
`
`4000 Hz
`
`Frequency
`
`0.0s
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
` 0 Hz
`
`2000 Hz
`Frequency
`
`Power Spectrum Magnitude (dB)
`
`Power Spectrum in dB
`
`(b)
`
`Fig. 1.2 Time and frequency representations of a voiced and unvoiced
`speech segment. (a) A speech segment consists of an unvoiced and voiced
`segment in time domain. (b) The power spectrum for a 32 ms unvoiced
`segment starting at 50 ms. (c) The power spectrum and the corresponding
`formant structure for a 32 ms voiced segment starting at 150 ms. Both
`(b) and (c) are calculated based on a 32 ms Hanning window.
`
`Ex. 1031 / Page 18 of 118
`
`

`

`1 Introduction
`
`8
`
`the voiced spectrum, there is relatively less useful spectral information embedded in
`
`an unvoiced segment. It does not have any distinctive harmonics and it is rather flat,
`broadband and noise-like.
`
`1.4 Human Auditory Perception
`
`In order to reach maximal performance in a speech coder, it is also essential to take
`
`advantage of human auditory system, even though it is not fully understood yet.
`Generally, exploiting the perceptual properties of the ear could lead to signi(cid:12)cant
`
`improvement in performance of a speech coder. This is particularly true as we pursue
`
`lower and lower bit-rate speech coders while avoiding major audible degradation.
`
`One of the well-known properties of the auditory system is the auditory masking
`which has a strong e(cid:11)ect on the perceptibility of one signal in the presence of another
`
`[6]. Noise is less likely to be heard at frequencies of strong speech energy (e.g., for-
`
`mants) and more likely to be heard at frequencies of low speech energy (e.g., valleys).
`Spectral masking is a popular technique that takes advantage of this perceptual limi-
`
`tation by concentrating most of the noise (resulting from compression) in high-energy
`
`spectral regions where it is least audible.
`It is reported that humans perceive voiced and unvoiced sounds di(cid:11)erently. For
`
`voiced signals, the correct degree of periodicity and the temporal continuity in voiced
`
`segments [7, 8, 9] are of great importance to human perception (although excessive
`
`periodicity would lead to reverberation and buzziness). In spectral domain, the am-
`plitudes and the locations of the (cid:12)rst three formants (usually below 3 kHz) and the
`
`spacing between the harmonics are important [10].
`
`For unvoiced signals, it has been shown in [11] that the unvoiced speech segments
`can be replaced by a noise-like signal with a similar spectral envelope without a drop
`
`in the perceived quality of the speech signal.
`
`In both voiced and unvoiced cases, the time envelope of the speech signal con-
`tributes to intelligibility and naturalness [12, 13].
`
`Ex. 1031 / Page 19 of 118
`
`

`

`1 Introduction
`
`9
`
`1.5 Speech Coding Standardizations
`
`The standardization of high quality low-bit-rate narrowband2 speech coding has been
`intensifying since the beginning of this decade. In 1994, the International Telecom-
`munication Union (ITU) adopted the LD-CELP (Low-Delay Code-Excited Linear
`
`Predictive) algorithm [14] for the toll-quality coding of speech at 16 kbps known
`
`as the ITU G.728. Shortly after this standard was adopted, another CELP based
`speech coding running at 8 kbps was developed by the University of Sherbrooke [15].
`
`It was toll-quality as well and had a comparable performance to that of 16 kbps
`
`LD-CELP. In 1996, it (cid:12)nally became part of the ITU standards and was known as
`G.729. In the same year, U.S. Department of Defense (DoD) was standardizing a new
`
`2.4 kbps vocoder

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket