`
`137
`
`Speech Enhancement Using a Soft-Decision Noise
`Suppression Filter
`
`Abstract-One way of enhancing speech in an additive acoustic noise
`the background noise. Application of this
`measurement of
`environment is to perform a spectral decomposition of a frame of noisy
`technique to the cancellation of E4A advanced airborne com-
`depending on how
`speech and to attenuate a particular spectral tine
`mand post noise has shown that although significant improve-
`much the measured speech plus noise power exceeds an estimate of the
`ment in signal-to-noise ratio (SNR) can be obtained, the im-
`background noise. Using a two-state model for the speech event (speech
`provement in intelligibility, as measured by the diagnostic
`absent or speech present) and using the maximum likelihood estimator
`rhyme test (DRT), is marginal [7]. Recent work by Sambur
`of the magnitude of the speech spectrum results in a new class of sup-
`pression curves which permits a tradeoff of noise suppression against
`[8] has attempted to exploit the periodicity of voiced speech
`speech distortion. The algorithm has been implemented in real time in
`to eliminate the requirement for a second microphone. Thor-
`the time domain, exploiting the structure of the channel vocoder. Exten-
`ough evaluation of this algorithm has not yet been published.
`sive testing has shown that the noise
`can be made imperceptible by
`Considerably more work has been expended on the develop-
`proper choice of the suppression factor.
`ment of noise suppression prefilters. In this approach, a spec-
`tral decomposition of a frame of noisy speech
`is performed,
`and a particular spectral line is attenuated depending on how
`much the measured speech plus noise power exceeds an esti-
`mate of the background noise power
`[9] -[13]. Algorithms
`using the FFT have been tested against wide-band noise and
`improvements in intelligibility have been indicated, although
`no quantitative results have been given [ l l ] . To date, the
`attenuation curves have been proposed on more or less an ad
`hoc basis; hence, it is of interest to determine whether or not a
`more fundamental
`theoretical analysis could lead
`to a new
`suppression curve with substantially different properties.
`In
`the next section, an analytical model is proposed and used to
`determine the conditions under which the existing suppression
`curves can be justified. Having established a common basis, a
`new suppression curve is derived, recognizing the fact that the
`degree of suppression should be weighted by the probability
`that a given measurement corresponds to speech plus noise or
`to noise alone. It is shown that a class of curves is obtained by
`varying the value of a suppression factor. This is a parameter
`that can be chosen to trade off noise suppression against speech
`distortion. The algorithm has been implemented in
`real time
`of the channel
`in the time domain, exploiting the structure
`vocoder to perform the spectral decomposition. Extensive
`testing has shown that the noise can be made imperceptible by
`proper choice of the suppression factor.
`
`I. INTRODUCTION
`HE need for secure military voice communication has led
`to the consideration of narrow-band digital voice termi-
`nals. A preferred algorithm for this
`task is
`linear-predictive
`coding (LPC) which has demonstrated the ability to produce
`
`very intelligible speech with diagnostic rhyme test (DRT) scores
`in excess of 90 percent at data rates as low as 2400 bits/s [ l ] .
`Unfortunately, these results have been achieved only for clean
`speech, whereas many of the practical environments in which
`these terminals would be deployed, such as the airborne com-
`
`mand post or the cockpits of jet fighter aircraft and helicopters,
`are characterized by a high ambient noise level, which in many
`cases causes the vocoded speech to suffer a significant degrada-
`tion in intelligibility [2] . This has stimulated research into the
`problem of extracting the speech parameters (pitch, buzz-hiss,
`and spectrum) from noisy speech in the hope that more robust
`algorithms could be found [3] -[5].
`Another approach to the noisy speech problem is to develop
`a prefilter that would enhance the speech prior to encoding so
`that the existing LPC vocoder could be applied in tandem
`without modification. Two general classes of algorithms have
`emerged: noise canceling and noise suppression prefilters. In
`the first case, the coefficients of a tapped delay line are adapted
`to produce a minimum mean-squared
`error estimate of
`the
`noise signal which is then subtracted from the noisy speech
`waveform to effect the noise cancellation [6] . In order
`to
`train the coefficients of the noise-canceling filter, it is usually
`necessary to use a second microphone to provide a speech-free
`
`11. ANALYSIS
`The prefilter design problem arises because a speech signal
`acoustically coupled background
`s ( t ) has been corrupted by
`noise w ( t ) to form the measurement y (t) = s ( t ) + w(t). In
`speech, it is not easy to specify a criterion which would lead
`to a “best” estimate of s(t); hence, a variety of algorithms are
`often proposed and evaluated by
`listening to the processed
`results. In order to provide a common
`theoretical basis for
`relating some of these algorithms, it has been found useful to
`
`Manuscript received July 13, 1979;revised November 26, 1979. This
`work was supported by the Department of the Air Force. The views
`and conclusions contained in this document are those of the contractor
`and should not be interpreted as necessarily representing
`the official
`policies, either expressed or implied, of the United States Government.
`The authors are with the M.I.T. Lincoln Laboratory, Lexington, MA
`02173.
`
`0096-35 18/80/0400-0137$00.75 0 1980 IEEE
`
`WAVES345_1008-0001
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`
`
`138
`
`IEEE TKANSACTIONS ON ACOUSTICS, SPEECH,
`
`analyze the prefilter for a frame of data of length T(T - 20
`
`ms). A further simplification occurs by expanding y (t) in
`terms of a set of basis functions {& (t)} in such a way that the
`expansion coefficients are uncorrelated random variables. If
`the covariance function of y ( t ) is Ry(t, u), then a suitable set
`of basis functions are obtained from the
`Karhunen-Lo6ve
`expansion
`
`h(n) #n (t) = J R,(t, u ) #n (u> du
`
`T
`
`0 < t < T.
`
`(1)
`
`0
`
`Then on (0, T )
`
`Van Trees [14] shows that if the correlation time of y(t> is
`less than the frame interval T , then an appropriate set of eigen-
`functions and eigenvalues are
`
`AND SIGNAL PROCESSING,
`
`VOL. ASSY-28, NO. 2, APRIL 1980
`
`N
`
`A
`
`n =I
`where ? =
`since if h,(n) were known, the spectrum of
`s (t) would be identical to the spectrum of s(t). Of course, it is
`not known and provision must be made for estimating its value
`from an observation of y , and knowledge of A, (n). Since y ,
`is a complex Gaussian variate with variance u$ (n), its real and
`imaginary parts are Gaussian with variance u:+(n)/2. Hence,the
`probability density function for y , is
`
`(8)
`then by maximizing p ( y,) with respect to As(n), the maxi-
`mum likelihood estimate of A,(n) can be found to be
`A A.y(n) = IYn I' - h,(n).
`(9)
`In order to maintain an identity system in the absence of noise,
`the input phase can be appended to the prefilter output by
`taking
`
`where
`
`(4)
`
`is the power spectrum of the observed process. Since a narrow-
`band vocoder usually operates over a bandwidth less than 4
`kHz, only a finite number of expansion coefficients are needed
`to characterize y (t). The prefilter design problem then reduces
`to the problem of optimally extracting the
`speech random
`variable s, from the noisy observation y n s, + w,. If the
`speech and the noise are modeled as independent Gaussian ran-
`dom processes, then the expansion coefficients are indepen-
`dent Gaussian random variables with variances
`
`where
`
`which is known as the method of power subtraction. Modifi-
`cations of this algorithm have been studied extensively by Boll
`[IO] , Preuss [ 121 , and Berouti et al. [ 131 .
`
`B. Wiener Filtering
`Whereas the power subtraction algorithm arises from an
`attempt to obtain the best estimate of the speech spectrum,
`the Wiener filter corresponds to the criterion of minimizing
`the mean-squared error of best time domain fit to the speech
`waveform. Van Trees [14, pp. 198-2061 has shown that this
`can be done by choosing the channel coefficients to be
`
`Since the speech eigenvalues are unknown a.priori, the maxi-
`mum likelihood estimate developed in (8) can be used in (1 1)
`to result in the suppression rule
`
`represent the power in the nth harmonic line of the speech and
`noise spectra.
`
`A. Power Subtraction
`Since it is well known that the perception of speech is phase
`insensitive, a reasonable criterion for a prefilter design is
`to
`
`
`produce the
`
`speech estimate deterministic
`
`which is simply the square of the suppression rule for the
`method of power subtraction.
`C. Maximum Likelihood Envelope Estimation
`The previous results were obtained assuming that the speech
`and the noise were
`independent Gaussian random processes.
`In the interest of exploring the importance of this assumption,
`an alternative model is proposed in which the noise is a Gaus-
`sian random process, while the speech is characterized by a
`
`
`waveform of unknown amplitude and
`phase. In
`
`WAVES345_1008-0002
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`
`
`
`
`
`
`
`
`MC AULAY AND MALPASS: ENHANCEMENT SPEECH
`
`
`
`is y, = s, t w, where now
`this case, the channel measurement
`s, = A exp ( j 0 ) where A determines the speech envelope and
`of speech, an optimum
`0 its phase. For the perception
`esti-
`mate of its envelope
`is desired since
`this would represent an
`estimate of
`the speech spectrum
`in the nth channel. For
`Gaussian noise, the probability density function of the channel
`measurement y, is
`1
`P(YnIAY0)- -
`T A W (n)
`Iy, l2 - 2A Re (e-j'y,) t A2
`X, (n)
`ofA , a maximum
`To obtain the maximum likelihood estimate
`of p ( y , ] A , 0) is sought. However, the speech phase 0 shows
`Its effect can be eliminated by
`up as a nuisance parameter.
`maximizing the average likelihood function
`
`139
`
`-6
`
`-4
`
`SPEECH-TO-NOISE RATIO
`
`0
`-2
`2
`4
`
`6
`
`(dB)
`8
`10
`
`12
`
`14
`
`16
`
`18
`
`l
`
`l
`
`1
`c -
`
`.a
`
`.Ha
`
`-10
`0
`-
`
`- 2
`
`- 8
`
`I
`
`- 4 c -
`
`4 -
`/ -
`
`_---
` 4 - -
`.'
`..** /
`.*
`-
`-
`/*
`/-
`-
`/'
`.-.
`i
`/.
`z -8
`- :
`m
`5
`2 -10 * *
`*i
`w 2 1 2 -
`
`-6
`
`_I
`
`/*
`
`/
`
`-
`-
`-
`-
`
`-14
`
`-18
`-zo
`
`/
`/
`
`- - - MAXIMUM LIKELIHOOD
`- - WIENER FILTER
`
`POWER SUBTRACTION
`
`-22
`Fig. 1. Power subtraction, Wiener filter, and maximum likelihood sup-
`
`( 0 , 2 ~ ) , then the likelihood function for the spectral envelope
`becomes
`
`?, = A - A Y n
`IYn I
`
`.-
`2n
`
`de*
`
`The integral appearing in (I5) is known as the modified Bessel
`function of the first kind and is labeled
`
`2A Re (e-jey,) 1
`Io(lxl) = - 1 exp [Re (e-jex)] dB
`
`2n
`
`2R
`
`5, D. Two-State Soft Decision Maximum Likelihood Envelope
`Estimation
`for the power subtraction, Wiener fil-
`f i e suppression
`are illustrated in
`tering, and maximum likelihood algorithms
`
`Fig. 1. Their suppression
`
`capabilities were evaluated for speech
`(1 6 ) in airborne command post noise using a real-time implementa-
`tion of the prefilter (to be described in detail in Section 111).
`While it was difficult to determine which algorithm did
`the
`best job of extracting the speech when speech was present, it
`
`was apparent that none of the algorithms adequately suppressed
`the background noise when speech was absent. This is hardly
`surprising in view of the fact that the suppression rules were
`
`derived on the assumption that speech was always present
`
`in
`
`the measured data.
`
`
`Had a detector been
`used to determine
`that a given frame of data consisted of noise alone,
`then ob-
`spectral viously a better suppression rule would have been to apply
`For this condition, the likelihood function for the
`envelope becomes
`the curves in Fig. 1.
`greater attenuation than
`indicated by
`From this point of view, it follows that a better suppression
`1
`1
`P(Ynl-4) = - *
`curve might evolve if a two-state model for the speech event is
`2n - considered at the outset, that is, either speech is present or it is
`not. Mathematically, this leads to the binary hypothesis
`model
`Ho : speech absent: I y, I = I w, I
`lYnl = lAeie + w , ~ .
`H~ : speech present:
`
`where x = 2Ay,/X,
`(n) depends on the a priori signal-to-noise
`ratioA2/Xw(n) and the a posteriori signal-to-noise ratio Iy,I2/
`X, (n). For large values of Ix I (>3), which represents a con-
`straint on the signal-to-noise ratios,
`1
`
`exp (1x0.
`
`(17)
`
`lo(lxl) -
`
`IYnI2 - 2AlY,l + A 2 1-
`
`(1 8 )
`to A leads to the esti-
`
`Maximizing this function with respect
`mator
`A = - [Iv,l+ 4Y,12
`A
`1
`
`2
`
`- Xw(n)l.
`
`(21)
`Only the measured envelope is used in this measurement model
`since it has already been shown
`that the
`measured phase pro-
`
`
`
`vides no useful information
`
`the noise.
`
`in the suppression of
`
`WAVES345_1008-0003
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`
`
`140
`
`IEEE TRANSACTIONS
`
`O N ACOUSTICS. SPEECH,
`
`AND SIGNAL PROCESSING,
`
`VOL. ASSP-28, NO.
`
`2 , APRIL 1980
`
`A usefukcriterion for estimating the spectral envelope A ikto
`choose A to minimize the mean-squared spectral error E(A -
`A ) 2 . It is well knoyn [ 141 that the resulting estimator is the
`conditional mean A =E(AIV) where V = I y n I is used
`for
`notational convenience to represent the measured envelope.
`Reference to the nth channel will be implied. In
`this f o h u -
`lation, the expectation operator
`is used to indicate averaging
`over the ensemble of noise sample functions, speech enveiopes
`and phases, and the ensemble of speech events. The averaging
`for the latter case is carried out explicitly and results in the
`estimator
`2 = E(A I V , HI )P(H, I V ) E(A I v, H,)P(H, I V )
`(22)
`where P(Hk I V ) is the probability that the speech is in state H k
`given that the measured envelope has the value V. Since
`E(A I V , H , ) represents the average value of A given an observa-
`tion Vand the fact that speech is absent, then obviously this
`value must be zero; hence, (22) reduces to
`2 =E(AJV,H,)P(H, IV).
`(23)
`Since E(A I V , H1 ) represents the minimum variance estimate
`of A when speech is present, and since the maximum likeli-
`hood estimator is asymptotically efficient for
`large SNR, it
`suffices to replace E(A IV, H , ) by the estimator
`derived in
`(1 9); hence,
`
`Application of Bayes rule gives
`
`where p(VIHk) is the a priori probability density function for
`the measured envelope given the speech state H k . Assuming
`that the speech and noise states are equally likely (a worst case
`assumption),
`P ( H , ) = P(H,) = 5.
`1
`Under hypothesis Ho , V = I w 1, and since the noise is complex
`Gaussian with mean zero and variance A,,
`it follows that the
`envelope has the Rayleigh pdf
`
`(2 6 )
`
`Under hypothesis H I , V = Meio t w I and the envelope has
`the Rician pdf
`
`Defining the Q priori signal-to-noise ratio
`
`to be
`
`and substituting (26), (27), and (28) into ( 2 5 ) results in the
`following expression for the a posteriori probability for
`the
`
`-10
`
`-E
`
`-6
`
`-4
`
`SPEECH-TO-NOISE RATIO ( d B )
`0
`- 2
`2
`4
`6
`8
`
`10
`
`12
`
`14
`
`16
`
`-24
`
`I
`l
`l
`l
`l
`l
`
`I
`I
`I
`I
`Fig. 2. A posterion‘ probability for the speech state.
`
`I
`
`
`
`presence of speech:
`
`It is this term which contributes the “soft-decision’’ aspect to
`the maximum likelihood envelope estimator in contradistinc-
`tion with “hard decision” for which the speech plus noise is
`either passed as is or is suppressed completely. Appending the
`measured phase to the estimated envelope in order to preserve
`the identity system in the absence of noise, the final suppres-
`sion rule is then
`
`In Fig. 2 several curves for the a posteriori probability for the
`speech stateP(H, 1V)are plotted as a function of the aposterion
`speech-to-noise ratio V 2 / h , (i.e., the measured SNR) for
`various values of the a priori signal-to-noise ratio E. The chan-
`nel gains obtained when these a posteriori probabilities are ap-
`pended to the maximum likelihood suppression rule are shown
`in Fig. 3. The two-state soft-decision maximum likelihood
`algorithm applies considerably more suppression when the
`measurement corresponds to low speech SNR. Since this case
`“most likely” corresponds to noise alone, it is seen that the
`effect of the residual noise (false alarms) should be considerably
`reduced. When the speech SNR is large,
`the measured SNR
`(i.e., the a posteriori SNR V2/h,) will be large and it “most
`likely” means that speech is present, in which case the original
`maximum likelihood algorithm is the correct rule for extracting
`the speech envelope.
`In order to interpret the role of the parameter g, it is noted
`that in a radar or communications context (from which the
`
`WAVES345_1008-0004
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`
`
`MC AULAY AND MALPASS: SPEECH ENHANCEMENT
`
`141
`
`SPEECH-TO-NOISE RATIO ( d B )
`
`-8
`
`-4
`
`0
`
`4
`
` 8
`
`12
`1
`
`1
`
`16
`-
`
`
`
`TABLE I
`CHANNEL FILTER SPECIFICATIONS
`
`Channel
`Number
`
`Center
`Frequency
`240
`360
`
`0
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`3200
`17
`18
`3535
`Sampling Rate = 132 ps
`
`
`
`0
`
`1275
`
`I750
`
`2150
`
`2600
`
`84
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`3 dB
`Bandwidth
`
`120
`120
`120
`120
`120
`120
`150
`150
`150
`150
`150
`200
`200
`200
`300
`3 00
`300
`300
`3 70
`
`- 4 -
`
`-24 -
`
`-
`
`-
`
`-
`
`-
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`
`
`Fig. 4. The channel vocoder fiiter bank.
`
`480
`600
`720
`
`975
`1125
`
`1425
`1575
`1950
`
`2350
`
`.$ (the
`preceding theory was extracted), one would choose
`a priori SNRA2/h,) in order to guarantee a specified perfor-
`mance in terms of
`false alarms and missed detections. In
`speech, however, one must deal with whatever SNR exists as a
`consequence of the particular acoustic environment in which
`one is forced to operate; hence, the concept of an a priori SNR
`which can be controlled by the system designer is inappropri-
`ate. In terms of a noise suppression prefilter, however, Fig. 3
`parameter ( = A 2 / h , simply controls the
`shows that the
`amount of suppression applied to a particular frequency chan-
`nel; hence, it is convenient to refer to it as the “suppression
`factor.” From this point of view, the theory has simply pro-
`vided the catalyst for generating a new class of suppression
`curves.
`
`2900
`
`t
`
`W
`n
`t
`3
`z
`3
`P
`K a
`z w
`
`-I
`
`0.5
`
`1.0
`
`for smoothing the envelope of the speech spectrum; hence,
`their lack of orthogonality turns out to be an asset in this par-
`ticular case. Since the 19 filters span the frequency
`range of
`the speech signal, the front end of the channel vocoder, in the
`absence of noise, represents an identity system provided the
`outputs of each of the channels
`are added alternately out of
`phase, as shown in the block diagram in Fig. 5.
`In order to compute the channel gains, measurements must
`be made to determine the instantaneous signal power and the
`average noise power at the output of each of the channel fil-
`ters. Since the speech parameters change very little in 20 ms,
`some temporal smoothing can be exploited by computing the
`signal power in the nth channel from
`1 N
`= - Y 3 k )
`
`(32)
`
`2.0
`2.5
`1.5
`FREQUENCY (kHz1
`Fig. 3. Suppression rules for maximum likelihood with soft suppression.
`
`3.0
`
`3.5
`
`111. IMPLEMENTATION
`All of the noise suppression prefiters that have been reported
`on to date have been implemented in the frequency domain.
`This corresponds nicely to the theoretical orthogonal channel
`decomposition used in Section I1 and exploits the properties
`of the FFT for filtering by circular convolution. Since the
`present work evolved from an attempt to implement a time-
`domain Kalman fdter based on a parallel formant model for
`speech 11.51 , and since a contemporary implementation of a
`channel vocoder is being developed using CCD technology to
`produce a package which operates at
`rates from 1.2 to 4.8
`kbits/s, requires about 50 integrated circuits, occupies 0.22
`ft3, requires 5 W, and weighs 5 lb [16] , it seemed appropriate
`to attempt a time-domain implementation of the prefilter that
`could exploit this emerging technology. As in the channel
`vocoder, 19 filters are used to span the frequency range 180-
`3720 Hz (the sampling rate was 7575 Hz). Each filter in the
`bank is a result of a bandpass transformation of a second-order
`Butterworth filter. The center frequencies and the bandwidths
`k = l
`for each of the
`filters in the bank are listed in Table I and a
`where yn (k) represents the signal sample out of the nth
`plot of their linear magnitudes is shown in Fig. 4.
`chan-
`Although theory requires that the channels be orthogonal, in nel at time k where there are N such samples in the 20 ms
`by N will be unnecessary).
`practice, overlapping filters provide for spectral smoothing frame (the normalization
`requires
`which is known to be an important factor
`in the design of Determination of the background
`noise power
`[ 1 I] . The filters in the channel knowledge of whether or
`noise suppression systems
`not a particular frame contains speech.
`vocoder were originally chosen to provide a good compromise One approach to making this determination has been devel-
`
`WAVES345_1008-0005
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`
`
`142
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-28, NO. 2, APRIL 1980
`
`NOISE
`DETECTOR
`
`t SMOOTH
`THE ENERGY
`
`S ( t ) + wltl
`
`BANDPASS
`FILTER
`NO. 1 OF N
`
`Y i
`
`~
`
`>
`-THE
`
`-
`IF NOISE
`EN
`SMOOTH -.c
`ENERGY
`
`-
`
`COMPUTE GN
`GAIN
`
`c
`
`P
`
`COMPUTE
`
`20 msec
`EVERY
`10 rnsec
`
`ENERGY IN -
`-
`
`1
`
`,BANDPASS
`FILTER
`NO. N OF N
`
`that a 4 s histogram of the
`oped by Roberts [17] who noted
`frame energies of the input signal was bimodal. He found that
`by setting a detection threshold between the modes, correct
`speech and noise classification could be made most of the
`time. A modification of
`this algorithm, which is described in
`detail in the Appendix, was used to determine with high con-
`fidence the frames
`that were absent of speech. For those
`frames, the average noise power in each channel was estimated
`by smoothing the measurements in (32) using a 1 s time con-
`stant according to the recursion
`l)tCY[V;(m)- h,(m- l)]
`h,(m)=h,(m-
`(33)
`where V ; (m), h, (m) represents the measured power and the
`average noise power computed for the mth frame. The major
`drawback of
`this algorithm is the relatively long adaptation
`time needed to determine the detection threshold and then the
`additional training period required to learn the channel noise
`powers.
`Using the measurement of V i (m) and the estimated average
`value of the noise energy X, (m - l), the gain factor
`
`is computed for each channel.' Since V i (m)/h, (m - 1) can
`be expressed in terms of g(m), then the noise suppression rule,
`(30) and (31), can be written as
`
`(34)
`
`Fig. 5. Block diagram of the noise suppression prefiiter.
`fact that 0 <g(m) < 1 which permits the use of a simple soft-
`ware divide routine in forming the normalization. For a given
`value of the suppression factor E, the measured gain g(m) is
`used as a pointer for a table lookup to determine the attenua-
`tion prescribed by (35). Fifteen tables corresponding to values
`* , 15 have been included in the prefilter, with
`of .$ = 1 , 2 , 3 , *
`each table consisting of 50 values of the suppression rule com-
`puted for equal increments of g(m) from 0 to 1. No attempt
`was made to optimize the design of these tables. All of the
`coding was done in machine language on the LDVT [ 191,
`which has the ability to key in a new value of the suppression
`factor in real time. This meant that the prefilter could easily
`be adjusted to accomodate a wide class of operational environ-
`ments. This turned out to be a significant capability for effec-
`tive noise suppression. Since the algorithm was designed
`to
`operate in real time, a 10 ms delay had to be incurred between
`the time the energies were measured and the time the corres-
`ponding gains could be computed and applied to the channel
`waveforms. This was
`done by computing the energies (block
`floating point) in 10 ms segments and adding consecutive seg-
`ments together to produce the desired 20 ms energy measure-
`raw gains G(m)
`ment. This permitted computation of the
`every 10 ms. In order to avoid the introduction of discontinu-
`gain c(m) obtained according to
`ities in the output waveform, the final output is a smoothed
`G(m) = G(m - 1) t P(m) [G(m) - G(m - l)] .
`(3 6 )
`Since the introduction of smoothing can cause the prefilter to
`be slow to respond to a leading edge transition, which could
`result in speech distortion, the weighting factor in (36) is
`chosen adaptively according to the rule
`
`The advantage in using g(m) as the independent variable is the
`
`'This is where the normalization by N in (32) disappears.
`
`In this way, the prefilter responds immediately to an increase
`in the SNR which should minimize
`the potential for leading
`edge distortion. During a trailing edge, in which the gain will
`be decreasing, the smoothed gain will be used which will tend
`
`WAVES345_1008-0006
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`
`
`MC AULAY AND MALPASS: SPEECH ENHANCEMENT
`
`143
`
`to maintain the speech signal even though the noise becomes
`is the gain G(m) in (37) that is applied to the
`dominant. It
`waveform at the output of each of the channel
`filters. These
`waveforms were then added together alternately 180" out of
`phase to produce the prefilter output waveform $(t).
`
`IV. EXPERIMENTAL RESULTS AND CONCLUSIONS
`Since the prefiltering algorithm operated in real time, it was
`possible to perform extensive listening tests on a large speech
`and noise database. It was of particular interest to determine
`the operational performance of the
`prefilter in conjunction
`with a 2400 bits/s vocoder operating in a background of E4A
`advanced airborne command post noise (ACPN). Source tapes
`were available for this environment, consisting of lists spoken
`by six male speakers for which a DRT score and a diagnostic
`acceptability measure (DAM) could be computed. The record-
`ings were made using both a high-quality Altec microphone
`and a noise-canceling microphone.
`The first experiment consisted of listening to the output of
`the prefdter for various values of the suppression factor.
`It
`was always possible to select a suppression factor which would
`render the background noise imperceptible, although, for cases
`in which the SNR was low enough, the cost in doing this was
`the introduction of various
`degrees of speech distortion. In
`these cases, if the suppression factor was subsequently reduced,
`the speech distortion was reduced at the expense of introducing
`a perceptible level of background noise.
`In the next experiment, the prefilter was connected in tandem
`with the 2400 bit/s LPC vocoder which used the Gold-Rabiner
`pitch estimator [ 191 , [20] . An unexpected result was obtained.
`If the suppression factor was set to remove the residual noise
`at the output of the prefilter, then the speech quality at the
`vocoder was poor due
`to both buzz-hiss errors and spectral
`distortion. If, however, the suppression factor was chosen so
`that the noise at the vocoder
`output was negligible,
`then a
`significantly lower value of the suppression factor was needed
`and the speech quality was quite good, although the Gold-
`Rabiner algorithm continued to make buzz-hiss errors, but at
`a lower rate. In other words, LPC itself has some suppression
`capabilities against weak noise which can usefully be exploited
`in the tandem connection.
`It was the flexibility in selecting
`the prefilter suppression factor which made this result possible.
`LPC vocoder does allow for
`Since the deployment of the
`flexibility in the specification of the pitch extractor, it was of
`interest to determine whether or not algorithms that were
`specially designed to operate in noise would operate more ef-
`fectively in the tandem connection. Such an algorithm, based
`on maximum likelihood estimation techniques, has been under
`development for some time [21] and was chosen to be tested
`against the Gold-Rabiner algorithm. In the subjective listening
`tests it was found that, indeed, smoother pitch
`tracks could
`be obtained with a lower rate of buzz-hiss errors.
`Although the results of using the prefilter always produced
`subjectively more pleasant sounding speech to the ear since the
`annoying and tiresome background noise was removed, it was
`important to determine whether or not there
`was a corres-
`ponding quantitative improvement in intelligibility. To do
`this, DRT scores are being obtained for the prefiltered speech
`and the speech out of the LPC tandem for both the Gold-
`
`algo-
`Rabiner and the maximum likelihood pitch extraction
`both the
`rithms. Results
`are currently being obtained for
`Altec dynamic microphone and
`the confidencer noise-cancel-
`ing microphone and will be reported once all of the data have
`been collected and analyzed.
`So far the focus has been on the 19-channel prefilter based
`on the principles of channel vocoder design. This was strictly
`a pragmatic choice which was made to facilitate the develop-
`ment of a real-time testbed. Questions relating to the number
`of filters, the bandwidths, and the choice of center frequen-
`cies remain to be addressed. Although the time-domain struc-
`ture of the channel prefilter is well suited to an analog imple-
`mentation using CCD technology, it is of interest to determine
`the tradeoffs with respect
`to a frequency-domain approach
`using the FFT. Whatever candidate system is chosen for evalu-
`ation, using the class of suppression rules developed in this
`study allows the overall design to be optimized with respect to
`the noise suppression/speech distortion tradeoff by choosing
`an appropriate suppression factor. In this way, performance
`differences can be attributed to the system design parameters
`independent of a particular suppression rule which may have
`represented a poor choice
`for the particular signal and noise
`conditions used in the evaluation.
`
`APPENDIX
`MODIFIED ROBERTS NOISE DETECTION ALGORITHM
`In order to estimate the statistics of the background noise, it
`is desirable to inspect only those frames of data which have a
`high probability of containing no speech. To accomplish this,
`an adaptive energy threshold marking the probable boundary
`between noise and noise plus speech is established by monitor-
`ing the energy on a frame-by-frame
`basis and maintaining
`energy histograms which
`reflect the bimodal distribution of
`the energy. The flowchart for the algorithm, shown in Fig. 6,
`is described in the following paragraphs.
`For each frame, the sum of the squares of the input samples
`is computed. If this energy does not exceed 16 bits (i.e., does
`
`not strongly imply the presence of speech), the adaptive thresh-
`0.995 is
`old algorithm is exercised. First, a decay factor of
`applied to a 128-bin histogram of uniform
`ran