throbber
1586
`
`PROCEEDINGS OF THE IEEE, VOL. 67, NO. 12, DECEMBER 1979
`
`Enhancement and Bandwidth Compression
`of Noisy Speech
`JAE s. LIM, MEMBER, IEEE, AND ALAN v. OPPENHEIM, FELLOW, IEEE
`
`Invited Paper
`
`Abstract-Over file past several years fllere has been considerable
`attention focused on file problem of enhancement and bandwidth
`compression of speech degraded by additive background noise. This
`interest is motivated by several factors including a broad set of impor(cid:173)
`tant applications, file apparent lack of robustness in current speech(cid:173)
`compression systems and the development of several potentially
`promising and practical solutions. One objective of this paper is to
`provide an overview of the variety of techniques that have been prcr
`posed for enhancement and bandwidth compression of speech degraded
`by additive background noise. A second objective is to suggest a uni(cid:173)
`fying framework in tenns of which the relationships between these
`systems is more visible and which hopefully provides a structure which
`wiD suggest fruitful directions for further research.
`
`I. INTRODUCTION
`
`T HERE ARE a wide variety of contexts in which it is
`
`desired to enhance speech. The objective of enhance(cid:173)
`ment may perhaps be to improve the overall quality, to
`increase intelligibility, to reduce listener fatigue, etc. Depend(cid:173)
`ing on the specific application, the enhancement system may
`be directed at only one of these objectives or several. For
`example, a speech communication system may introduce a
`low-amplitude long-time delay echo or a narrow-band additive
`disturbance. While these degradations may not by themselves
`reduce intelligibility for the purposes for which the channel
`is used, they are generally objectionable and an improvement
`in quality perhaps even at the expense of some intelligibility
`may be desirable. Another example is the communication
`between a pilot and an air traffic control tower.
`In this
`environment, the speech is typically degraded by background
`noise. Of central importance is the intelligibility of the speech
`and it would generally be acceptable to sacrifice quality if the
`intelligibility could be improved. Even with normal unde(cid:173)
`graded speech, it is sometimes useful or desirable to provide
`enhancement. As a simple example high-pass filtering of nor(cid:173)
`mal speech is often used to introduce a "crispness" which is
`generally perceived as an improvement in quality.
`The speech-enhancement problem covers a broad spectrum
`of constraints, applications and issues. Environments in which
`. an additive background signal has been introduced are com(cid:173)
`mon. The background may be noise-like such as in aircraft,
`street noise, etc. or may be speech-like such as an environment
`with competing speakers. Other examples in which the need
`
`Manuscript received June 22, 1979; revised August 28, 1979. !his
`work was supported in part by the Defense Advance Research ProJects
`Agency monitored by the Office of Naval Research under Contract
`N00014-75-C-0951-NR049-328 at M.I.T. Research Laboratory of Elec(cid:173)
`tronics and in part by the Department of the Air Force under Contract
`F19628-78-C~002 at M.I.T. Lincoln Laboratory.
`The authors are with M.I.T. Research Laboratory of Electronics and
`M.I.T. Lincoln Laboratory, Cambridge, MA 02139.
`
`for speech enhancement arises include correcting for reverber(cid:173)
`ation, correcting for the distortion of the speech of underwater
`divers breathing a helium-oxygen mixture, and correcting
`the distortion of speech due to pathological difficulties of the
`speaker or introduced due to an attempt to speak too rapidly.
`Even for these examples, the problem and techniques vary,
`depending on the availability of other signals or information.
`For example, for enhancement of speech in an aircraft a
`separate microphone can be used to monitor the background
`noise so that the characteristics of the noise can be used to
`adjust or adapt the enhancement system. At the air-traffic
`control tower, however, the only signal available for enhance(cid:173)
`ment is the degraded speech.
`Another very important application for speech enhancement
`is in conjunction with speech bandwidth compression sys(cid:173)
`tems. Because of the increasing role of digital communication
`channels coupled with the need for encrypting of speech and
`increased emphasis on integrated voice-data networks, speech(cid:173)
`bandwidth-compression systems are destined to play an in·
`creasingly important role in speech-communication systems.
`The conceptual basis for narrow-band speech-compression
`systems stems from a model for the speech signal based on
`what is known about the physics and physiology of speech
`production. Because of this reliance on a model for the signal
`it is not unreasonable to expect that as the signal deviates from
`the model due to distortion such as additive noise, the per(cid:173)
`formance of the speech compression system with regard to
`factors such as quality, intelligibility, etc., will degrade.
`It
`is generally agreed that the performance of current speech(cid:173)
`compression systems degrades rapidly in the presence of
`additive noise and other distortions and there is currently
`considerable interest and attention being directed at the
`development of more robust speech compression systems.
`There are two basic approaches which are typically considered
`either of which may be preferable in a given situation. One
`approach is to base the bandwidth compression on the as(cid:173)
`sumption of undistorted speech and develop a preprocessor
`to enhance the degraded speech in preparation for further
`processing by the bandwidth compression system. It is impor(cid:173)
`tant to recognize that in enhancing speech in preparation
`for bandwidth compression the effectiveness of the prepro(cid:173)
`cessor is judged on the basis of the output of the bandwidth(cid:173)
`compression system in comparison with the output if no
`preprocessor is used. Thus, for example, it is possible that
`the output of the preprocessor would be judged by a listener
`to be inferior (by some measure) to the input but that the
`output of the bandwidth-compression system with the pre(cid:173)
`processor is preferred to the output without it. In this case,
`the preprocessor would clearly be considered to be effective
`
`0018-9219/79/1200-1586$00.75 © 1979 IEEE
`
`Petitioner Apple Inc.
`Ex. 1016, p. 1586
`
`

`
`LIM AND OPPENHEIM: ENHANCEMENT AND BANDWID'IH COMPRESSION
`
`1587
`
`in enhancing the speech in preparation for bandwidth com(cid:173)
`pression. Another approach to bandwidth compression of
`degraded speech is to incorporate into the model for the signal
`information about the degradation. A number of systems
`based on such an approach have recently been proposed and
`will be discussed in detail in this paper.
`As is evident from the above discussion, the general problem
`of enhancing speech is broad and the constraints, information,
`and objectives are heavily dependent on the specific context
`In this paper, we consider only a small
`and applications.
`subset of possible topics, specifically the enhancement and
`bandwidth compression of speech degraded by additive noise.
`Furthermore, we assume that the only signal available is the
`degraded speech and that the noise does not depend on the
`original speech. Many practical problems, some of which have
`already been discussed, fall into this framework and some
`problems that do not can be transformed so that they do.
`For example, multiplicative noise or convolutional noise
`degradation can be converted to an additive noise degradation
`by a homomorphic transformation [I], [2]. As another
`example, signal-dependent quantization noise in pulse-code
`modulation (PCM) signal coding can be converted to a signal
`independent additive noise by a pseudo-noise technique
`[3]-[5].
`Even within the limited framework outlined above, there is a
`diversity of approaches and systems. One objectiVe of this
`paper is to provide an overview of the variety of techniques
`that have been proposed for enhancement of speech degraded
`by additive background noise both for direct listening and as
`a preprocessor for subsequent bandwidth compression. Many
`of these systems were developed independently of each other
`and on the surface often appear to be unrelated. Thus another
`objective of the paper is to provide a unifying framework in
`terms of which the relationship between these systems is more
`visible, and which hopefully will provide a structure which
`will suggest further fruitful directions for research.
`In Section II, we present an overview of the _general topic.
`In this overview we classify the various enhancement systems
`based on the information assumed about the speech and the
`noise. Some systems based on time-invariant Wiener filtering,
`for example, rely only on an assumed noise power spectrum
`and on long-time average characteristics of speech, such as the
`fact that the average speech spectrum decays with frequency
`at approximately 6 dB/octave. Other systems rely on aspects
`of speech perception or speech production in general or on a
`detailed model of speech.
`Sections III-V present a more detailed discussion of several
`of these categories of speech-enhancement systems. In partic(cid:173)
`ular, Section III is concerned with the general principle of
`speech enhancement based on estimation of the short-time
`spectral amplitude of the speech. This basic principle encom(cid:173)
`passes a variety of techniques and systems including the
`specific methods of spectral subtraction, parametric Wiener
`ftltering, etc. In Section IV, speech enhancement techniques
`which rely principally on the concept of the short-time period(cid:173)
`icity of voiced speech are reviewed, including comb-filtering
`and related systems. Section V discusses a variety of systems
`that rely on more specific modeling of the speech waveform.
`As we will discuss in detail, in some cases, parameters of the
`model are obtained from an analysis of the degraded speech and
`used to synthesize the enhanced speech. In other cases, the
`results of an analysis based on a model for speech are used
`to control an enhancement filter, perhaps with the procedure
`
`being iterative so that the output of an enhancement ftlter is
`then subjected to further analysis, etc. Many of these systems
`also incorporate a number of the techniques introduced in
`Section III, including Wiener ftltering and spectral subtraction.
`In Sections III-V, the focus is entirely on systems for en(cid:173)
`hancement with the evaluation of the systems being based
`on listening without further processing.
`In Section VI, we
`consider the related but separate problem of bandwidth
`compression of speech degraded by additive noise.
`In Section VII, we discuss in some detail the evaluation of
`the performance of the various systems presented in the earlier
`sections. In general, the performance evaluation of a speech(cid:173)
`enhancement system is extremely difficult, in large measure
`because the appropriate criteria for evaluation are heavily
`dependent on the specific application of the system. Relative
`importance of such factors as quality, intelligibility, listener
`fatigue, etc., may vary considerably with the application. In
`Section VII, we summarize the performance evaluations that
`have been reported for the various systems presented in this
`paper. Since the evaluation of different systems has generally
`been based on different procedures, environments, etc., no
`attempt is made in the section to compare individual systems.
`In general, however, we will see that while many of the en(cid:173)
`hancement systems reduce the apparent background noise
`and thus perhaps increase quality, many of them to varying
`In the context of bandwidth
`degrees, reduce intelligibility.
`compression, however, various systems provide an increase
`in intelligibility over that obtained without the incorporation
`of speech enhancement.
`
`ll. OVERVIEW OF SYSTEMS FOR ENHANCEMENT AND
`BANDWIDTH COMPRESSION OF NOISY SPEECH
`As indicated in the previous section, our focus in this paper
`is on degradation due to the presence of additive noise. Even
`within this limited context there are a wide variety of ap(cid:173)
`proaches which have been proposed and explored. Conceptu(cid:173)
`ally any approach should attempt to capitalize on available
`information about the signal, i.e., the speech, and the back(cid:173)
`ground noise. Speech is a special subclass of audio signals
`and there are reasonable models in terms of which the speech
`waveform can be described and categorized. The more speci(cid:173)
`fically we attempt to model the speech signal, the more poten(cid:173)
`tial'for separating it from the background noise. On the other
`hand, the more we assume about the speech the more sensitive
`the enhancement system will be to inaccuracies or deviations
`from these assumptions. Thus incorporating assumptions and
`information about the speech signal represents tradeoffs which
`are reflected in the various systems. In a similar manner sys(cid:173)
`tems can attempt to incorporate detailed information about
`the background noise. For example, the type of processing
`suggested if the background noise is a competing speaker is
`different than if it is wide-band random noise. Thus enhance(cid:173)
`ment systems also tend to differ in terms of the assumptions
`made regarding the background noise. As with assumptions
`related to the signal, the more an enhancement system at(cid:173)
`tempts to capitalize on assumed characteristics of the noise
`the more susceptible it is likely to be to deviations from these
`assumptions.
`Another important consideration in speech enhancement
`stems from the fact that the criteria for enhancement ulti(cid:173)
`mately relate to an evaluation by a human listener. In different
`contexts the criteria for evaluation may differ depending on
`whether quality, intelligibility, or some other attribute is the
`
`Petitioner Apple Inc.
`Ex. 1016, p. 1587
`
`

`
`1588
`
`PITCH PERIOD
`
`.J ... J....J .. {p(nl} JL /\.,.A. A.
`
`DIGITAL FILTER COEFFICIENTS
`
`DIGITAL FILTER r-- SAMPLES
`
`Tl ME VARYING
`
`V(z)
`
`!
`
`1
`
`SPEECH
`
`{s(nl}
`
`AMPLITUDE
`
`"' '·· ,, .
`
`p, • ,.
`Fig. 1. A speech production model.
`
`3 >
`"' 0
`
`..J
`
`(b)
`(a)
`Fig. 2. An example of resonant frequencies of an acoustic cavity.
`(a) Vocal-tract transfer function. (b) Magnitude spectrum of a speech
`sound with the resonant frequencies shown in (a).
`
`most important. Thus speech enhancement must inevitably
`take into account aspects of human perception. As we will
`indicate shortly, some systems are heavily motivated by per(cid:173)
`ceptual considerations, others rely more on mathematical
`criteria.
`In such cases, of course, the mathematical criteria
`must in some way be consistent with human perception, and,
`while an optimum mathematical criterion is not known, some
`mathematical error criteria are understood to be a better
`match than others to aspects of human perception.
`In the following discussion we briefly describe some aspects
`of speech production and speech perception that in varying
`degrees play a role in speech-enhancement systems. Following
`that we present a brief overview of a representative collection
`of speech-enchancement systems, with the intent of cate(cid:173)
`gorizing these systems in terms of the various aspects of
`speech production and perception on which they attempt to
`capitalize.
`Speech is generated by exciting an acoustic cavity, the vocal
`tract, by pulses of air released through the vocal cords for
`voiced sounds, or by turbulence for unvoiced sounds. Thus
`a simple but useful model for speech production consists of
`a linear system, representing the vocal tract, driven by an
`excitation function which is a periodic pulse train for voiced
`sounds and wide-band noise for unvoiced sounds, as illustrated
`in Fig. 1. Furthermore, since the linear system represents an
`acoustic cavity, its response is of a resonant nature, so that
`its transfer function is characterized by a set of resonant
`frequencies, referred to as formants, as illustrated in Fig. 2(a).
`Thus, if the excitation and vocal-tract parameters are fixed,
`then as indicated in Fig. 2(b), the speech spectrum has an
`envelope representing the vocal-tract transfer function of
`Fig. 2(a) and a fme structure representing the excitation.
`Many of the techniques for speech enhancement, particu(cid:173)
`larly those in Sections III and V are conceptually based on
`the representation of the speech signal as a stochastic process.
`This characterization of speech is clearly more appropriate in
`the case of unvoiced sounds for which the vocal tract is driven
`by wide-band noise. The vocal tract of course changes shape
`as different sounds are generated and this is reflected in a
`
`PROCEEDINGS OF THE IEEE, VOL. 67, NO. 12, DECEMBER 1979
`
`time varying transfer function for the linear system in Fig. 1 .
`However, because of the mechanical and physiological con(cid:173)
`straints on the motion of the vocal tract and articulators
`such as the tongue and lips, it is reasonable to represent the
`linear system in Fig. 1 as a slowly varying linear system so that
`on a short-time basis it is approximated as stationary. Thus
`some specific attributes of the speech signal, which can be
`capitalized on in an enhancement system are that it is the
`response of a slowly varying linear system, that on a short(cid:173)
`time basis its spectral envelope is characterized by a set of
`resonances, and that for voiced sounds, on a short-time basis
`it has a harmonic structure. This simplified model for speech
`production has generally been very successful in a variety of
`engineering contexts including speech enchancement, synthe(cid:173)
`sis, and bandwidth compression. A more detailed discussion
`of models for speech production can be found in [ 6] -[ 8] .
`The perceptual aspects of speech are considerably more
`complicated and less well understood. However, there are a
`number of commonly accepted aspects of speech perception
`which play an important role in speech-enchancement systems.
`For example, consonants are known to be important in the
`intelligibility of speech even though they represent a relatively
`small fraction of the signal energy. Furthermore, it is generally
`understood that the short-time spectrum is of central impor(cid:173)
`tance in the perception of speech and that, specifically, the
`formants in the short-time spectrum are more important than
`other details of the spectral envelope. It appears also, that the
`first formant, typically in the range of 250 to 800 Hz, is less
`important perceptually, than the second formant [ 9] , [ 10] .
`Thus it is possible to apply a certain degree of high pass filter(cid:173)
`ing [ 11], [ 12] to speech which may perhaps affect the first
`formant without introducing serious degradation in intelligi(cid:173)
`bility. Similarly low-pass filtering with a cutoff frequency
`above 4 kHz, while perhaps affecting crispness and quality
`will in general not seriously affect intelligibility. A good repre(cid:173)
`sentation of the magnitude of the short-time spectrum is also
`generally considered to be important whereas the phase is
`relatively unimportant. Another perceptual aspect of the
`auditory system that plays a role in speech enhancement is the
`ability to mask one signal with another. Thus, for example,
`narrow-band noise and many forms of artificial noise or deg(cid:173)
`radation such as might be produced by a vocoder are more
`unpleasant to listen to than broad-band noise and a speech(cid:173)
`enhancement system might include the introduction of broad(cid:173)
`band noise to mask the narrow-band or artificial noise.
`All speech-enhancement systems rely to varying degrees on
`the aspects of speech production and perception outlined
`above. One of the simplest approaches to enhancement is the
`use of low-pass or bandpass filtering to attenuate the noise
`outside the band of perceptual importance for speech. More
`generally, when the power spectrum of the noise is known,
`one can consider the use of Wiener filtering, based on the long(cid:173)
`time power spectrum of speech. While in some cases such as
`the presence of narrow-band background noise, this is reason(cid:173)
`ably successful, Wiener filtering based on the long-time power
`spectrum of the speech and noise is limited because speech is
`not stationary. Even if speech were truly stationary, mean(cid:173)
`square error which is the error criterion on which Wiener
`filtering is based is not strongly correlated with perception and
`thus is not a particularly effective error criterion to apply to
`speech processing systems. This is evidenced, for example, in
`the use of masking for enhancement. By adding broad-band
`
`Petitioner Apple Inc.
`Ex. 1016, p. 1588
`
`

`
`LIM AND OPPENHEIM: ENHANCEMENT AND BANDWIDTII OOMPRESSION
`
`1589
`
`noise to mask other. degradation, we are, in effect, increasing
`the mean-square error. Another example that suggests that
`mean-square error is not well matched to the perceptually
`important attributes in speech is the fact that distortion of the
`speech waveform by processing with an all-pass ffiter results
`in essentially no audible difference if the impulse response of
`the all-pass fllter is reasonably short but can result in a sub(cid:173)
`stantial mean-square error between the original and ffitered
`speech. In other words, mean-square error is sensitive to phase
`of the spectrum whereas perception tends not to be.
`Masking and bandpass ftltering represent two simple ways
`in which perceptual aspects of the auditory system can be
`exploited in speech enhancement. Another system whose
`motivation depends heavily on aspects of speech perception
`was proposed by Thomas and Niederjohn [ 12] as a preproces(cid:173)
`sor prior to the introduction of noise in those applications
`where noise-free speech is available for processing. In essence,
`their system applies high-pass f:tltering to reduce or remove the
`first formant followed by inimite clipping. The motivation
`for the system lies in the observation that at a given signal(cid:173)
`to-noise ratio infinite clipping will increase, relative to the
`vowels, the amplitude of the perceptually important low(cid:173)
`amplitude events such as consonants thus making them less
`susceptible to masking by noise.
`In addition, for vowels
`the flltering will increase the amplitude of higher formants
`relative to . the first formant, thus making the perceptually
`more important higher formants less susceptible to degrada(cid:173)
`tion.
`In the speech enhancement problem considered in this
`paper, noise-free speech is not available for processing as re(cid:173)
`quired in the above system. Thomas and Ravindran ( 13],
`however, applied high-pass filtering followed by infmite
`clipping to noisy speech as an experiment. While quality may
`be degraded by the process of filtering and clipping, they claim
`a noticeable improvement in intelligibility when applied to
`enhance speech degraded by wide-band random noise. One
`possible explanation may be that the high-pass ftltering opera(cid:173)
`tion reduces the masking of perceptually important higher
`formants by
`the
`relatively unimportant
`low-frequency
`components.
`Another system which relies heavily on human perception of
`speech was proposed by Drucker [ 14] . Based on some per(cid:173)
`ceptual tests, Drucker concluded that one primary cause for
`the intelligibility loss in speech degraded by wide-band random
`noise is the confusion among the fricative and plosive sounds
`which is partly due to the loss of short pauses immediately
`before the plosive sounds. By high-pass ffitering one of the
`fricative sounds, the fsf sound, and inserting short pauses
`before the plosive sounds (assuming that their locations can
`be accurately determined), Drucker claims a significant im(cid:173)
`provement in intelligibility.
`In discussing perceptual attributes we indicated that the
`short-time spectral magnitude is generally considered to be
`important whereas the phase is relatively unimportant. This
`forms the basis for a class of speech enhancement systems
`which attempt in various ways to estimate the short-time
`spectral magnitude of the speech without particular regard to
`the phase and to use this to recover or reconstruct the speech.
`This class of systems includes spectral subtraction techniques
`originally due to Weiss et al. (15], [16], and which have
`recently received a great deal of attention [17]-[22] and
`optimum ftltering techniques such as Wiener ftltering and
`power spectrum ftltering. These systems will be discussed in
`
`considerable detail in Section III. As we will see, many of
`these systems which appear on the surface to be different
`are in fact identical or very closely related.
`In addition to directly or indirectly utilizing perceptual
`attributes most enhancement systems rely to varying degrees
`on aspects of speech production. For example, in Section IV,
`we describe in detail a variety of systems that attempt, in
`some way, to capitalize on short-time periodicity of speech
`during voiced sounds. As a consequence of this periodicity,
`during voiced intervals the speech spectrum has a harmonic
`structure which suggests the possibility of applying comb
`ftltering or as proposed by Parsons [23] attempting to extract
`in other ways, the components of the speech spectrum only
`In essence, knowledge of the
`at the harmonic frequencies.
`harmonic structure of voiced sounds allows us in principle to
`remove the noise in the spectral bands between the harmonics.
`As discussed in Section IV, speech enhancement by comb
`ftltering can also be viewed in terms of averaging successive
`periods of the noisy speech to partially cancel the noise.
`Another system, which attempts to take advantage of the
`quasi-periodic nature of the speech was proposed by Sambur
`[24]. As developed in more detail in Section IV, his system
`is based on the principles of adaptive noise cancelling. Unlike
`the classical procedure Sambur's method is designed to cancel
`out the clean speech signal, taking advantage of the quasi(cid:173)
`periodic nature of the speech to form an estimate of the
`speech at each time instant from the value of the signal one
`period earlier.
`In the model of speech production, we represented the
`speech signal as generated by exciting a quasi-stationary linear
`system with a pulse train for voiced speech and noise for
`unvoiced speech. Based on this model, an approach to speech
`enhancement is to attempt to estimate parameters of the
`model rather than the speech itself and to then use this to
`synthesize the speech, i.e., to enhance speech through the
`use of an analysis-synthesis system. A particularly novel
`application of this concept was used by Miller (25] to remove
`the orchestral accompaniment from early recordings of Enrico
`Caruso. In this system homomorphic deconvolution was used
`to estimate the impulse response of the model in Fig. 1. A
`similar approach to noise reduction was proposed by Suzuki
`[26], [27] whereby the short-time correlation function of
`the degraded speech is used as an estimate of the impulse
`response of the linear system. This system is referred to as
`splicing of auto correlation function (SP AC). A modification
`of SPAC is referred to as splicing of cross-correlation func(cid:173)
`tion (SPOC). A number of systems also attempt to model
`the vocal-tract impulse response in more detail. As we dis(cid:173)
`cussed previously the vocal-tract transfer function is charac·
`terized by a set of resonances or formants that are perceptually
`important. This suggests the possibility of representing the
`vocal-tract impulse response in terms of a pole-zero model
`with the analysis procedure directed at estimating the associ·
`ated parameters. The poles in particular would provide a
`reasonable representation of the formants.
`AU-pole modeling of speech has had notable success in
`analysis-synthesis systems for clean speech. A number of
`recent efforts have been directed toward estimating the param(cid:173)
`eters in an all-pole model from noisy observations of the
`speech such as the systems by MagiU and Un [ 28] , Lim and
`Oppenheim [29], Lim [18], and Done and Rushforth [30].
`Extensions to pole-zero modeling have also been proposed
`
`Petitioner Apple Inc.
`Ex. 1016, p. 1589
`
`

`
`1590
`
`PROCEEDINGS OF THE IEEE, VOL. 67, NO. 12, DECEMBER 1979
`
`by Musicus and Lim [31 1 and Musicus [32]. These various
`approaches are described and compared in detail in Section V.
`The above discussion was intended as a brief overview of
`the general approaches to speech enhancement. In the next
`three sections we explore in more detail many of the systems
`mentioned above.
`In particular, in Section III, we focus on
`speech-enhancement techniques based on short-time spectral
`In Section IV our focus is on speech
`amplitude estimation.
`enhancement based on periodicity of voiced speech and in
`Section V on speech-enhancement techniques using an analysis(cid:173)
`synthesis procedure.
`
`III. SPEECH ENHANCEMENT TECHNIQUES BASED ON
`SHORT-TIME SPECTRAL AMPLITUDE ESTIMATION
`In general, in enhancement of a signal degraded by additive
`noise, it is significantly easier to estimate the spectral ampli(cid:173)
`tude associated with the original signal than it is to estimate
`both amplitude and phase. As we discussed in Section II,
`it is principally the short-time spectral amplitude rather than
`phase that is important for speech intelligibility and quality.
`As we discuss in this section, there are a variety of speech(cid:173)
`enhancement techniques that capitalize on this aspect of
`speech perception by focusing on enhancing only the short(cid:173)
`time spectral amplitude. The techniques to be discussed can
`be broadly classified into two groups. In the first, presented
`in Section Ill-A, the short-time spectral amplitude is estimated
`in the frequency domain, using the spectrum of the degraded
`speech. Each short-time segment of the enhanced speech
`waveform in the time domain is then obtained by inverse
`transforming this spectral amplitude estimate combined with
`the phase of the degraded speech.
`In the second class, dis·
`cussed in Section III·B the degraded speech is frrst used to
`obtain a filter which is then applied to the degraded speech.
`Since these procedures lead to zero-phase ftlters, it is again
`only the spectral amplitude that is enhanced, with the phase
`of the ftltered speech being identical to that of the degraded
`speech.
`In both classes of systems discussed below no conceptual
`distinction is made between voiced and unvoiced speech and in
`particular in contrast to the techniques to be discussed in
`Section IV the periodicity of voiced speech is not exploited.
`Both classes of systems in this section are most easily inter(cid:173)
`preted in terms of a stochastic characterization of the speech
`signal. - While this characterization is more justifiable for
`unvoiced speech it has been shown empirically to also lead
`to successful procedures for voiced speech.
`
`A. Speech Enhancement Based on Direct Estimation
`of Short-Time Spectral Amplitude
`When a stationary random signal s(n) has been degraded by
`uncorrelated additive noise d(n) with a known power density
`spectrum, the power density spectrum or spectral amplitude
`of the signal is easily estimated through a process of spectral
`subtraction. Specifically, if
`
`y(n) = s(n) + d(n)
`
`(1)
`
`and Py(w), P,(w), and Pd(w) represent the power density
`spectra of y(n), s(n), and d(n), respectively, then
`
`Consequently, a reasonable estimate for P:(W) is obtained by
`
`(2)
`
`subtracting the known spectrum Pd(w) from an estimate of
`P y( w) developed from the observations of y(n ).
`Speech, of course, is not a stationary signal. However, with
`s(n) in (I) now representing a speech signal and with the pro(cid:173)
`cessing to be carried out on a short-time basis we consider s(n ),
`d(n), and y(n) multiplied by a time-limited window w(n).
`With Yw(n), dw(n), and Sw(n) denoting the windowed signals
`y(n), d(n), and s(n) and Yw(w), Dw(W), and Sw(W) as their
`respective Fourier transforms we have
`
`(3)
`
`and
`
`(4)
`
`1Yw(W)I2 = !Sw(w)l2 + !Dw{w)l2 + Sw(W) · D!,(w)
`+ S!(w) · Dw(W)
`where D!,(w) and S!(w) represent complex conjugates of
`Dw(w) and Sw(w). The function 1Sw(w)l2 will be referred
`to as the short-time energy spectrum of speech. For speech
`enhancement based on the short-time spectral amplitude, the
`objective is to obtain an estimat

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket