throbber
Prolog to
`Speech Coding: A Tutorial Review
`
`A tutorial introduction to the paper by Spanias
`
`If the subtle sounds of human speech are to travel
`the information highways of the future, digitized speech
`will have to be more efficiently transmitted and stored.
`Designers of cellular communications systems, wireless
`personal computer networks, and multimedia systems are
`all searching for improved techniques for handling speech.
`Since its awkward beginnings in the 1930's, speech
`coding has developed to become an essential feature of ev(cid:173)
`eryday telephone system operations. Speech coding is now
`finding applications in cellular communications, computer
`systems, automation, military communications, biomedical
`systems, and almost everywhere that digital communication
`takes hold.
`Speech coding involves sampling and amplitude quanti(cid:173)
`zation of the speech signal. The aim is to use a minimum
`number of bits, while preserving the quality of the recon(cid:173)
`structed speech at the receiving end. Coding research is now
`taking aim at low-rate (8 to 2.4 kbits/s) and very-low-rate
`(below 2.4 kbits/s) techniques.
`The entire gamut of speech coding research is covered in
`this paper. An extensive list of references gives the reader
`access to the speech coding literature. The paper has tutorial
`information to orient applications engineers, and it nicely
`summarizes coder developments for research experts.
`The meaning of the words we speak often changes with
`the smallest inflection of our voices, so better speech quality
`is an essential goal for coding research. The paper lays out
`the quality levels of reconstructed speech, ranging from
`the highest quality broadcast, wide-band speech produced
`by coders at 64 kbits/s, to the lowest quality, synthetic
`speech, currently produced by coders that operate well
`below 4 kbits/s. A section on speech quality points out
`that subjective testing can be lengthy and costly. Speech
`quality has been gauged by objective measures, beginning
`with the signal-to-noise ratio, but these measures do not
`account for human perception.
`The bulk of this paper is devoted to explaining and
`reviewing a wide variety of speech coders. First of all,
`the paper discusses waveform coders. Waveform coders, as
`opposed to vocoders, compress speech waveforms without
`making use of the underlying speech models.
`
`Scalar quantization techniques include familiar classical
`methods such as pulse-code modulation (PCM), differential
`PCM, and delta modulation.
`Vector quantization techniques make use of codebooks
`that reside in both the transmitter and receiver. The paper
`attributes much of the progress recently achieved in low(cid:173)
`rate speech coding to the introduction of vector quantization
`techniques in linear predictive coding. Highly structured
`codebooks allow significant reduction in the complexity of
`high-dimensional vector quantization.
`Sub-band and transform coders rely on transform-domain
`representations of the voice signal. In sub-band coders,
`these representations are obtained through filter banks. Sub(cid:173)
`band encoding is used in medium-rate coding. Fourier
`transform coders obtain frequency-domain representations
`by using unitary transforms. Perhaps the most successful of
`the early transform coders is the adaptive transform coder
`was developed at Bell Laboratories.
`The paper describes analysis-synthes~ methods that use
`the short-time Fourier transform, and also various methods
`that use sinusoidal representations of speech. Multiple
`sine waves have been successfully used in many different
`speech coding systems. For example, one sinusoidal analy(cid:173)
`sis-synthesis system performed very well with a variety of
`signals, including those from multiple speakers, music, and
`biological sounds, and this system also performed well in
`the presence of background noise. Sinusoidal coders have
`been used for low-rate speech coding, and have produced
`high-quality speech in the presence of background noise.
`Another coder that belongs to this class is the multiband ex(cid:173)
`citation coder which recently became part of the Australian
`mobile satellite and International Mobile Satellite standards.
`Since 1939, vocoder systems have tried to produce in(cid:173)
`telligible human speech without necessarily matching the
`speech waveform. Initially, simple models were used to
`produce low-rate coding. The result was synthetic, buzzy(cid:173)
`sounding reconstructed speech. More recently, sophisticated
`vocoders have provided improved quality at the cost of
`increased complexity. The paper briefly describes channel
`and formant vocoders, and the homomorphic vocoder, but
`focuses mostly on linear predictive vocoders.
`
`PROCEEDINGS OF THE IEEE. VOL. X2. NO. 10. OCTOBER 1994
`
`1539
`
`0018-9219/94$04.00 © 1994 IEEE
`
`Ex. 1047 / Page 1 of 44
`Apple v. Saint Lawrence
`
`

`

`Linear predictive coders use algorithms to predict the
`present speech sample from past samples of speech. Usually
`8 to 14 linear predictive parameters are required to model
`the human vocal tract. The analysis window is typically
`20-30 ms long and parameters are generally updated every
`10-30 ms. Real-time predictive coders were first demon(cid:173)
`strated in the early 1970's. The paper describes a linear
`predictive coding algorithm that has become a U.S. federal
`standard for secure communications at 2.4 kbits/s. The U.S.
`Government is currently seeking an improved algorithm to
`replace that standard.
`In analysis-by-synthesis methods, the reconstructed and
`original speech are compared, and the excitation parameters
`are adjusted to minimize the difference before the code is
`transmitted.
`Hybrid coders determine speech spectral parameters by
`linear prediction and optimize excitation using analysis(cid:173)
`by-synthesis techniques. These hybrid coders combine the
`
`features of modem vococ.lers with an ability to exploit
`the properties of the human auditory system. The pa(cid:173)
`per describes several analysis-by-synthesis linear predic(cid:173)
`tive coding algorithms. The coder used in the British
`Telecom International skyphone satellite-based system is
`based on one of these algorithms (MPLP). Another of
`these algorithms (RPE-LTP) has been adopted for the
`full-rate GSM Pan-European digital mobile standard. The
`U.S. Department of Defense has adopted another algorithm
`(CELP) described in the paper, for possible use in a new
`secure telephone unit. The 8-kbits/s algorithm (VSELP)
`adopted for the North American Cellular Digital System
`is also described, as is the LD-CELP coder selected by the
`CCITT as its recommendation for low-delay speech coding.
`
`-Howard Falk
`
`1540
`
`PROCEEDINGS OF THE IEEE. VOL. 82, NO. 10. OCTOBER 199-1
`
`Ex. 1047 / Page 2 of 44
`
`

`

`Speech Coding: A Thtorial Review
`
`ANDREAS S. SPANIAS, MEMBER, IEEE
`
`The past decade has witnessed substantial progress towards
`the application of low-rate speech coders to civilian and military
`communications as well as computer-related voice applications.
`Central to this progress has been the development of new speech
`coders capable of producing high-quality speech at low data
`rates. Most of these coders incorporate mechanisms to: represent
`the spectral properties of speech. provide for speech waveform
`matching, and "optimize" the coder's performance for the human
`ear. A number of these coders have already been adopted in
`national and international cellular telephony standards.
`The objective of this paper is to provide a tutorial overview of
`speech coding methodologies with emphasis on those algorithms
`that are part of the recent low-rate standards for cellular commu(cid:173)
`nications. Although the emphasis is on the new low-rate coders, we
`attempt to provide a comprehensive survey by covering some of the
`traditional methodologies as well. We feel that this approach will
`not only point out key references but will also provide valuable
`background to the beginner. The paper starts with a historical
`perspective and continues with a brief discussion on the speech
`properties and performance measures. We then proceed with de(cid:173)
`scriptions of waveform coders, sinusoidal transform coders, linear
`predictive vocoders, and analysis-by-synthesis linear predictive
`coders. Finally, we present concluding remarks followed by a
`discussion of opportunities for future research.
`
`I.
`
`INTRODUCTION
`Although with the emergence of optical fibers bandwidth
`in wired communications has become inexpensive, there is
`a growing need for bandwidth conservation and enhanced
`privacy in wireless cellular and satellite communications.
`In particular, cellular communications have been enjoying
`a tremendous worldwide growth and there is a great deal of
`R&D activity geared towards establishing global portable
`communications through wireless personal communication
`networks (PCN's). On the other hand, there is a trend
`toward integrating voice-related applications (e.g., voice(cid:173)
`mail) on desktop and portable personal computers-often
`in the context of multimedia communications. Most of these
`applications require that the speech signal is in digital
`format so that it can be processed, stored, or transmitted
`under software control. Although digital speech brings
`flexibility and opportunities for encryption, it is also as(cid:173)
`sociated (when uncompressed) with a high data rate and
`
`Manuscript received July 6, 1993; revised March 4, 1994. Portions of
`this work have been supported by Intel Corporation.
`The author
`is with
`the Department of Electrical Engineering,
`Telecommunications Research Center, Arizona State University, Tempe,
`AZ 85287-5706 USA.
`IEEE Log Number 9401511.
`
`hence high requirements of transmission bandwidth and
`storage. Speech Coding or Speech Compression is the field
`concerned with obtaining compact digital representations
`of voice signals for the purpose of efficient transmission or
`storage. Speech coding involves sampling and amplitude
`quantization. While the sampling is almost invariably done
`at a rate equal to or greater than twice the bandwidth of
`analog speech, there has been a great deal of variability
`among the proposed methods in the representation of the
`sampled waveform. The objective in speech coding is to
`represent speech with a minimum number of bits while
`maintaining its perceptual quality. The quantization or
`binary representation can be direct or parametric. Direct
`quantization implies binary representation of the speech
`samples themselves while parametric quantization involves
`binary representation of speech model and/or spectral pa(cid:173)
`rameters.
`With very few exceptions, the coding methods discussed
`in this paper are those intended for digital speech communi(cid:173)
`cations. In this application, speech is generally bandlimited
`to 4 kHz (or 3.2 kHz) and sampled at 8 kHz. The simplest
`nonparametric coding technique is Pulse-Code Modulation
`(PCM) which is simply a quantizer of sampled amplitudes.
`Speech coded at 64 kbits/s using logarithmic PCM is
`considered as "noncompressed" and is often used as a
`reference for comparisons. In this paper, we shall use the
`term medium rate for coding in the range of 8-16 kbits/s,
`low rate for systems working below 8 kbits/s and down to
`2.4 kbits/s, and very luw rate for coders operating below
`2.4 kbits/s.
`Speech coding at medium-rates and below is achieved
`using an analysis-synthesis process. In the analysis stage,
`speech is represented by a compact set of parameters which
`are encoded efficiently. In the synthesis stage, these param(cid:173)
`eters are decoded and used in conjunction with a reconstruc(cid:173)
`tion mechanism to form speech. Analysis can be open-loop
`or closed-loop. In closed-loop analysis, the parameters are
`extracted and encoded by minimizing explicitly a measure
`(usually the mean square) of the difference between the
`original and the reconstructed speech. Therefore, closed(cid:173)
`loop analysis incorporates synthesis and hence this process
`is also called analysis by synthesis. Parametric representa(cid:173)
`tions can be speech- or non-speech-specific. Non-speech(cid:173)
`specific coders or waveform coders are concerned with the
`
`PROCEEDINGS OF THE IEEE. VOL. 82. NO. 10. OCTOBER 1994
`
`15.11
`
`0018-9219/94$04.00 © 1994 IEEE
`
`Ex. 1047 / Page 3 of 44
`
`

`

`faithful reconstruction of the time-domain waveform and
`generally operate at medium rates. Speech-specific coders
`or voice coders (vocoders) rely on speech models and are
`focussed upon producing perceptually intelligible speech
`without necessarily matching the waveform. Vocoders are
`capable of operating at very-low rates but also tend to
`produce speech of synthetic quality. Although this is the
`generally accepted classification in speech coding, there
`are coders that combine features from both categories.
`For example, there are speech-specific waveform coders
`such as the Adaptive Transform Coder [303] and also
`hybrid coders which rely on analysis-by-synthesis linear
`prediction. Hybrid coders combine the coding efficiency
`of vocoders with the high-quality potential of waveform
`coders by modeling the spectral properties of speech (much
`like vocoders) and exploiting the perceptual properties of
`the ear, while at the same time providing for waveform
`matching (much like waveform coders). Modem hybrid
`coders can achieve communications quality speech at 8
`kbitsls and below at the expense of increased complexity.
`At this time there are at least four such coders that have
`been adopted in telephony standards.
`
`A. Scope and Organization
`In this paper, we provide a survey of the different
`methodologies for speech coding with emphasis on those
`methods and algorithms that are part of recent communica(cid:173)
`tions standards. The paper is intended both as a survey and
`a tutorial and has been motivated by advances in speech
`coding which have enabled the standardization of low(cid:173)
`rate coding algorithms for civilian cellular communications.
`The standardizations are results of more than fifty years of
`speech coding research. Until recently, low-rate algorithms
`were of interest only to researchers in the field. Speech
`coding is now of interest to many engineers who are
`confronted with the difficult task of learning the essentials
`of voice compression in order to solve implementation
`problems, such as fitting an algorithm to an existing fixed(cid:173)
`point signal processor or developing low-power single-chip
`solutions for portable cellular telephones. Modem speech(cid:173)
`coding algorithms are associated with numerical methods
`that are computationally intensive and often sensitive to ma(cid:173)
`chine precision. In addition, these algorithms employ math(cid:173)
`ematical, statistical, and heuristic methodologies. While
`the mathematical and statistical techniques are associated
`with the theory of signal processing, communications, and
`information theory, many of the heuristic methods were
`established through years of experimental work. Therefore,
`the beginner not only has to get a grasp of the theory
`but also needs to review the algorithms that preceded the
`standards. In this paper we attempt to sort through the
`literature and highlight the key theoretical and heuristic
`techniques employed in classical and modem speech-coding
`algorithms. For each method we give the key references
`and, when possible, we refer first to the article that the
`novice will find more accessible.
`The general notation adopted in this paper is as follows.
`The discrete-time speech signal is denoted as s( n ), where
`
`n is an integer indexing the sample number. Discrete(cid:173)
`time speech is related to analog speech, sa(t), by s(n) =
`Sa ( nT) = sa ( t) lt=nT, where T is the sampling period.
`Unless otherwise stated, lower case symbols denote time(cid:173)
`domain signals and upper case symbols denote transform(cid:173)
`domain signals. Bold characters are used for matrices and
`vectors. The rest of the notation is introduced in subsequent
`sections as necessary.
`The organization of the paper is as follows. The first
`section gives a brief description of the properties of speech
`signals and continues with a historical perspective and
`a review of performance measures. In Section II, we
`discuss waveform coding methods. In particular, we start
`with a general description of scalar [55], [82], [152] and
`vector quantization [81], [98], [115], [192] methods and we
`continue with a discussion of waveform coders [48], [52].
`Section III presents sinusoidal analysis-synthesis meth(cid:173)
`ods [205] for voice compression and Section IV presents
`vocoder methods [11], [162], [163]. Finally, in Section V
`we discuss analysis-by-synthesis linear predictive coders
`[96], [100], [123], [272] and in Section VI we present
`concluding remarks. Low-rate coders, and particularly those
`adopted in the recent standards, are discussed in more detail.
`The scope of the paper is wide and although our literature
`review is thorough is by no means exhaustive. Papers with
`similar scope [12], [23], [82], [83], [96], [104], [109], [150],
`[154], [155], [157], [191], [270], [279]; special journal and
`magazine editions on voice coding [18], [19], [131], [132],
`[134]-[136], [138], [139]; and books on speech processing
`[62], [86], [90], [91], [99], [113], [152], [199], [232], [234],
`[236], [251], [275] can provide additional information.
`There are also six excellent collections of papers edited
`by Jayant [156], Davidson and Gray [61], Schafer and
`Markel [269], Abut [1], and Atal, Cuperman, and Gersho
`[9], [10]. For the reader, who wants to keep up with the
`developments in this field, articles appear frequently in
`IEEE TRANSACTIONS and symposia associated with the areas
`of signal processing and communications (see references
`section) and also in specialized conferences, workshops,
`and journals, e.g., [133] [137], [140], [291].
`
`B. Speech Properties
`Before we begin our presentation of the speech coding
`methods, it would be useful if we briefly discussed some
`of the important speech properties. First, speech signals are
`nonstationary and at best they can be considered as quasi(cid:173)
`stationary over short segments, typically 5-20 ms. The
`statistical and spectral properties of speech are thus defined
`over short segments. Speech can generally be classified as
`voiced (e.g., Ia!, Iii, etc), unvoiced (e.g., Ish!), or mixed.
`Time- and frequency-domain plots for sample voiced and
`unvoiced segments are shown in Fig. l. Voiced speech
`is quasi-periodic in the time domain and harmonically
`structured in the frequency domain while unvoiced speech
`is random-like and broadband. In addition, the energy of
`voiced segments is generally higher than the energy of
`unvoiced segments.
`
`1542
`
`PROCEEDJ!'JGS OF THE IEEE. VOL. 82, NO. 10, OCTOBER 1994
`
`Ex. 1047 / Page 4 of 44
`
`

`

`~---­
`
`-
`
`nme domain speech segment
`TAPIT11d:I01C
`
`M
`
`~
`
`~ ~
`
`~
`
`1\
`
`~
`
`~I
`
`1.0
`
`! 'a 0.0
`
`~
`
`-1.0
`
`!50
`
`iii"
`
`.......... --1 ..... i 20
`
`il'
`::;;
`
`16
`Tlma(mS)
`
`24
`
`32
`
`2
`
`Frequency (KHz)
`
`1.0
`
`nme domain speech segment
`
`-1.0
`
`16
`Time (mS)
`
`24
`
`32
`
`Frequency (KHz)
`
`Fig. 1. Voiced and unvoiced segments and their short-time spectra.
`
`The short-time spectrum1 of voiced speech is character(cid:173)
`ized by its fine and formant structure. The fine harmonic
`structure is a consequence of the quasi-periodicity of speech
`and may be attributed to the vibrating vocal chords. The
`formant structure (spectral envelope) is due to the inter(cid:173)
`action of the source and the vocal tract. The vocal tract
`consists of the pharynx and the mouth cavity. The shape
`of the spectral envelope that "fits" the short-time spectrum
`of voiced speech, Fig. 1, is associated with the transfer
`characteristics of the vocal tract and the spectral tilt (6
`dB/octave) due to the glottal pulse [261]. The spectral
`envelope is characterized by a set of peaks which are called
`formants. The formants are the resonant modes of the vocal
`tract. For the average vocal tract there are three to five
`formants below 5 kHz. The amplitudes and locations of
`the first three formants, usually occurring below 3 kHz, are
`quite important both in speech synthesis and perception.
`Higher formants are also important for wideband and
`unvoiced speech representations. The properties of speech
`are related to the physical speech production system as
`follows. Voiced speech is produced by exciting the vocal
`tract with quasi-periodic glottal air pulses generated by
`the vibrating vocal chords. The frequency of the periodic
`pulses is referred to as the fundamental frequency or pitch.
`Unvoiced speech is produced by forcing air through a
`constriction in the vocal tract. Nasal sounds (e.g., In!) are
`due to the acoustical coupling of the nasal tract to the vocal
`tract, and plosive sounds (e.g., /p/) are produced by abruptly
`releasing air pressure which was built up behind a closure
`in the tract.
`
`1 Unless otherwise stated the term spectrum implies power spectrum
`
`More information on the acoustic theory of speech pro(cid:173)
`duction is given by Pant [75] while information on the
`physical modeling of the speech production process is given
`in the classic book by Flanagan [86].
`
`C. Historical Perspective
`Speech coding research started over fifty years ago with
`the pioneering work of Homer Dudley [66], [67] of the
`Bell Telephone Laboratories. The motivation for speech
`coding research at that time was to develop systems for
`transmission of speech over low-bandwidth telegraph ca(cid:173)
`bles. Dudley practically demonstrated the redundancy in
`the speech signal and provided the first analysis-synthesis
`method for speech coding. The basic idea behind Dudley's
`voice coder or vocoder (Fig. 2) was to analyze speech in
`terms of its pitch and spectrum and synthesize it by exciting
`a bank of ten analog band-pass filters (representing the
`vocal tract) with periodic (buzz) or random (hiss) excitation
`(for voiced and unvoiced sounds, respectively). The channel
`vocoder received a great deal of attention during World
`War II because of its potential for efficient transmission of
`encrypted speech. Formant [223] and pattern matching [68]
`vocoders along with improved analog implementations of
`channel vocoders [221], [292] were reported through the
`1950's and 1960's. In the formant vocoder, the resonant
`characteristics of the filter bank track the movements of the
`formants. In the pattern-matching vocoder the best match
`between the short-time spectrum of speech and a set of
`stored frequency response patterns is determined and speech
`is produced by exciting the channel filter associated with
`the selected pattern. The pattern-matching vocoder was
`
`SPANIAS: SPEECH CODING
`
`1543
`
`Ex. 1047 / Page 5 of 44
`
`

`

`Analysis
`
`Pnch Channel
`
`Synthesis
`
`A total of ten channels
`
`Fig. 2. Dudley's channel vocoder [67].
`
`essentially the first analysis-synthesis system to implicitly
`employ vector quantization.
`Although early vocoder implementations were based on
`analog speech representations, digital representations were
`rapidly gaining interest due to their promise for encryption
`and high-fidelity transmission and storage. In particular,
`there had been a great deal of activity in Pulse-Code
`Modulation (PCM) in the 1940's (see [156] and the ref(cid:173)
`erences therein). PCM [228] is a straightforward method
`for discrete-time, discrete-amplitude approximation of ana(cid:173)
`log waveforms and does not have any mechanism for
`redundancy removal. Quantization methods that exploit
`the signal correlation, such as Differential PCM (DPCM),
`Delta Modulation (DM) [153], and Adaptive DPCM were
`proposed later and speech coding with PCM at 64 kbits/s
`and with ADPCM at 32 kbits/s eventually became CCITT2
`standards [32].
`With the flexibility offered by digital computers, there
`was a natural tendency to experiment with more sophisti(cid:173)
`cated digital representations of speech [266]. Initial efforts
`concentrated on the digital implementation of the vocoder
`[112]. A great deal of activity, however, concentrated on the
`linear speech source-system production model developed by
`Fant [75] in the late 1950's. This model later evolved into
`the familiar speech production system shown in Fig. 3. This
`model consists of a linear slowly time-varying system (for
`the vocal tract and the glottal model) excited by periodic
`impulse train excitation (for voiced speech) and random
`excitation (for unvoiced speech).
`The source-system model became associated with Au(cid:173)
`toregressive (AR) time-series methods where the vocal tract
`filter is all-pole and its parameters are obtained by Linear
`Prediction analysis [189]; a process where the present
`speech sample is predicted by the linear combination of pre(cid:173)
`vious samples. Itakura and Saito [143], [264] and Atal and
`Schroeder [ 14] were the first to apply Linear Prediction (LP)
`techniques to speech. A tal and Hanauer [II] later reported
`
`2 International Consultative Committee for Telephone and Tele(cid:173)
`graph currently called International Telecommunications Union-Tele(cid:173)
`communication Standardization Sector (ITU-TSS)
`
`illil ___.
`
`• ___.
`
`VOCAL
`TRACT
`FILTER
`
`SYNTHETIC
`
`SPEECH
`
`Fig. 3. The engineering model for speech synthesis.
`
`an analysis-synthesis system based on LP. Theoretical and
`practical aspects of linear predictive coding (LPC) were
`examined by Markel and Gray [199] and the problem
`of spectral analysis of speech using linear prediction was
`addressed by Makhoul and Wolf [190].
`LP is not the only method for source-system analysis.
`Homomorphic analysis, a method that can be used. for
`separating signals that have been combined by convolution,
`has also been used for speech analysis. Oppenheim and
`Schafer were strong proponents of this method [229], [230].
`One of the inherent advantages of homomorphic speech
`analysis is the availability of pitch information from the
`cepstrum [41], [227].
`The emergence of VLSI technologies along with ad(cid:173)
`vances in the theory of digital signal processing during
`the 1960's and 1970's provided even more incentives for
`getting new and improved solutions to the . speech coding
`problem. Analysis-synthesis of speech usmg the Short(cid:173)
`Time Fourier Transform (STFT) was proposed by Flanagan
`and Golden in a paper entitled "Phase Vocoder" [87]. In
`addition, Schafer and Rabiner designed and simulated an
`analysis-synthesis system based on the STFT [26?], [268),
`and Portnoff [240], [242], [243] provided a theoretical basis
`for the time-frequency analysis of speech using the STFT.
`In the mid- to late 1970's there was also a continued activity
`in linear prediction [304], [310], transform coding [303],
`and sub-band coding [52]. An excellent review of this work
`is given by Flanagan et al. [82], and a unified analysis of
`transform and sub-band coders is given by Tribolet and
`Crochiere [303]. During the 1970's, there were also parallel
`efforts for the application of linear prediction in military
`secure communications (see the NRL reports by Kang et al.
`
`1544
`
`PROCEEDINGS OF THE IEEE, VOL. 82, NO. 10, OCTOBER 1994
`
`Ex. 1047 / Page 6 of 44
`
`

`

`[162]-[166]. A federal standard (FS-1015) which is based
`on the LPC-10 algorithm, was developed in the early 1980's
`(see the paper by Tremain [301]).
`Research efforts in the 1980's and 1990's have been
`focused upon developing robust low-rate speech coders ca(cid:173)
`pable of producing high-quality speech for communications
`applications. Much of this work was driven by the need
`for narrow-band and secure transmission in cellular and
`military communications. Competing methodologies pro(cid:173)
`moted in the 1980's included: sinusoidal analysis-synthesis
`of speech proposed by McAulay and Quatieri [205], [206],
`multiband excitation vocoders proposed by Griffin and Lim
`[ 117], multipulse and vector excitation schemes for LPC
`proposed by Atal et al. [13], [272], and vector quantization
`(VQ) promoted by Gersho and Gray [98], [99], [115],
`and others [47] [192]. Vector quantization [1] proved to
`be very useful in encoding LPC parameters. In partic(cid:173)
`ular, Atal and Schroeder [17], [272] proposed a linear(cid:173)
`prediction algorithm with stochastic vector excitation which
`they called "Code Excited Linear Prediction" (CELP).
`The stochastic excitation in CELP is determined using a
`perceptually weighted closed-loop (analysis-by -synthesis)
`optimization. CELP coders are also called hybrid coders
`because they combine the features of traditional vocoders
`with the waveform-matching features of waveform coders.
`Although the first paper [ 17] on CELP addressed the
`feasibility of vector excitation coding, follow-up work [37],
`[100], [170], [171], [176], [177], [276], [315] essentially
`demonstrated that CELP coders were capable of produc(cid:173)
`ing medium-rate and even low-rate speech adequate for
`communications applications. Real-time implementation of
`hybrid coders became feasible with the development of
`highly structured codebooks.
`Progress in speech coding, particularly in the late 1980's,
`enabled recent adoptions of low-rate algorithms for mo(cid:173)
`bile telephony. An 8-kbit/s hybrid coder has already been
`selected for the North American digital cellular standard
`[ 1 00], and a similar algorithm has been selected for the
`6.7-kbit/s Japanese digital cellular standard [102], [103],
`[217], [314]. In Europe, a standard that uses a 13-kbit/s
`regular pulse excitation algorithm [307] has been completed
`and partially deployed by the "Group Speciale Mobile"
`(GSM). Parallel standardization efforts for secure military
`communications [ 169] have resulted in the adoption of a
`4.8-kbit/s hybrid algorithm for the Federal Standard 1016
`[9]. In addition, a 6.4-kbit/s improved multiband excitation
`coder [121] has been adopted for the International Maritime
`Satellite (INMARSAT-M) system [322] and the Australian
`Satellite (AUSSAT) system. Finally, we note that there
`are plans to increase the capacity of cellular networks by
`introducing half-rate algorithms in the GSM, the Japanese,
`and the North American standards.
`
`and acoustic interference. In general, high-quality speech
`coding at low rates is achieved using high-complexity
`algorithms. For example, real-time implementation of a
`low-rate hybrid algorithm must be typically done on a
`digital signal processor capable of executing 12 or more
`million instructions per second (MIPS). The one-way delay
`(coding plus decoding delay only) introduced by such
`algorithms is usually between 50 to 60 ms. Robust speech
`coding systems incorporate error correction algorithms to
`protect the perceptually important information against chan(cid:173)
`nel errors. Moreover, in some applications coders must
`perform reasonably well with speech corrupted by back(cid:173)
`ground noise, nonspeech signals (such as DTMF tones,
`voiceband data, modem signals, etc), and a variety of
`languages and accents.
`In digital communications, speech quality is classified
`into four general categories, namely: broadcast, network or
`toll, communications, and synthetic. Broadcast wideband
`speech refers to high-quality "commentary" speech that can
`generally be achieved at rates above 64 kbits/s. Toll or
`network quality refers to quality comparable to the classical
`analog speech (200-3200 Hz) and can be achieved at rates
`above 16 kbits/s. Communications quality implies some(cid:173)
`what degraded speech quality which is nevertheless natural,
`highly intelligible, and adequate for telecommunications.
`Synthetic speech is usually intelligible but can be unnat(cid:173)
`ural and associated with a loss of speaker recognizability.
`Communications speech can be achieved at rates above 4.8
`kbits/s and the current goal in speech coding is to achieve
`communications quality at 4.0 kbits/s. Currently, speech
`coders operating well below 4.0 kbits/s tend to produce
`speech of synthetic quality.
`Gauging the speech quality is an important but also very
`difficult task. The signal-to-noise ratio (SNR) is one of
`the most common objective measures for evaluating the
`performance of a compression algorithm. This is given by
`
`SNR = 10log10
`
`"'I: 1
`82(n)
`{
`-::-:M~--1 n_==_O __ _
`n~O (s(n) - s(n) )2
`
`}
`
`(1)
`
`where s( n) is the original speech data while s( n) is the
`coded speech data. The SNR is a long-term measure for
`the accuracy of speech reconstruction and as such it tends
`to "hide" temporal reconstructi

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket