throbber
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-28, NO. 2, APRIL 1980
`
`137
`
`Speech Enhancement Using a Soft-Decision Noise
`Suppression Filter
`
`Abstract-One way of enhancing speech in an additive acoustic noise
`the background noise. Application of this
`measurement of
`environment is to perform a spectral decomposition of a frame of noisy
`technique to the cancellation of E4A advanced airborne com-
`depending on how
`speech and to attenuate a particular spectral tine
`mand post noise has shown that although significant improve-
`much the measured speech plus noise power exceeds an estimate of the
`ment in signal-to-noise ratio (SNR) can be obtained, the im-
`background noise. Using a two-state model for the speech event (speech
`provement in intelligibility, as measured by the diagnostic
`absent or speech present) and using the maximum likelihood estimator
`rhyme test (DRT), is marginal [7]. Recent work by Sambur
`of the magnitude of the speech spectrum results in a new class of sup-
`pression curves which permits a tradeoff of noise suppression against
`[8] has attempted to exploit the periodicity of voiced speech
`speech distortion. The algorithm has been implemented in real time in
`to eliminate the requirement for a second microphone. Thor-
`the time domain, exploiting the structure of the channel vocoder. Exten-
`ough evaluation of this algorithm has not yet been published.
`sive testing has shown that the noise
`can be made imperceptible by
`Considerably more work has been expended on the develop-
`proper choice of the suppression factor.
`ment of noise suppression prefilters. In this approach, a spec-
`tral decomposition of a frame of noisy speech
`is performed,
`and a particular spectral line is attenuated depending on how
`much the measured speech plus noise power exceeds an esti-
`mate of the background noise power
`[9] -[13]. Algorithms
`using the FFT have been tested against wide-band noise and
`improvements in intelligibility have been indicated, although
`no quantitative results have been given [ l l ] . To date, the
`attenuation curves have been proposed on more or less an ad
`hoc basis; hence, it is of interest to determine whether or not a
`more fundamental
`theoretical analysis could lead
`to a new
`suppression curve with substantially different properties.
`In
`the next section, an analytical model is proposed and used to
`determine the conditions under which the existing suppression
`curves can be justified. Having established a common basis, a
`new suppression curve is derived, recognizing the fact that the
`degree of suppression should be weighted by the probability
`that a given measurement corresponds to speech plus noise or
`to noise alone. It is shown that a class of curves is obtained by
`varying the value of a suppression factor. This is a parameter
`that can be chosen to trade off noise suppression against speech
`distortion. The algorithm has been implemented in
`real time
`of the channel
`in the time domain, exploiting the structure
`vocoder to perform the spectral decomposition. Extensive
`testing has shown that the noise can be made imperceptible by
`proper choice of the suppression factor.
`
`I. INTRODUCTION
`HE need for secure military voice communication has led
`to the consideration of narrow-band digital voice termi-
`nals. A preferred algorithm for this
`task is
`linear-predictive
`coding (LPC) which has demonstrated the ability to produce
`
`very intelligible speech with diagnostic rhyme test (DRT) scores
`in excess of 90 percent at data rates as low as 2400 bits/s [ l ] .
`Unfortunately, these results have been achieved only for clean
`speech, whereas many of the practical environments in which
`these terminals would be deployed, such as the airborne com-
`
`mand post or the cockpits of jet fighter aircraft and helicopters,
`are characterized by a high ambient noise level, which in many
`cases causes the vocoded speech to suffer a significant degrada-
`tion in intelligibility [2] . This has stimulated research into the
`problem of extracting the speech parameters (pitch, buzz-hiss,
`and spectrum) from noisy speech in the hope that more robust
`algorithms could be found [3] -[5].
`Another approach to the noisy speech problem is to develop
`a prefilter that would enhance the speech prior to encoding so
`that the existing LPC vocoder could be applied in tandem
`without modification. Two general classes of algorithms have
`emerged: noise canceling and noise suppression prefilters. In
`the first case, the coefficients of a tapped delay line are adapted
`to produce a minimum mean-squared
`error estimate of
`the
`noise signal which is then subtracted from the noisy speech
`waveform to effect the noise cancellation [6] . In order
`to
`train the coefficients of the noise-canceling filter, it is usually
`necessary to use a second microphone to provide a speech-free
`
`11. ANALYSIS
`The prefilter design problem arises because a speech signal
`acoustically coupled background
`s ( t ) has been corrupted by
`noise w ( t ) to form the measurement y (t) = s ( t ) + w(t). In
`speech, it is not easy to specify a criterion which would lead
`to a “best” estimate of s(t); hence, a variety of algorithms are
`often proposed and evaluated by
`listening to the processed
`results. In order to provide a common
`theoretical basis for
`relating some of these algorithms, it has been found useful to
`
`Manuscript received July 13, 1979;revised November 26, 1979. This
`work was supported by the Department of the Air Force. The views
`and conclusions contained in this document are those of the contractor
`and should not be interpreted as necessarily representing
`the official
`policies, either expressed or implied, of the United States Government.
`The authors are with the M.I.T. Lincoln Laboratory, Lexington, MA
`02173.
`
`0096-35 18/80/0400-0137$00.75 0 1980 IEEE
`
`WAVES345_1008-0001
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`

`
`138
`
`IEEE TKANSACTIONS ON ACOUSTICS, SPEECH,
`
`analyze the prefilter for a frame of data of length T(T - 20
`
`ms). A further simplification occurs by expanding y (t) in
`terms of a set of basis functions {& (t)} in such a way that the
`expansion coefficients are uncorrelated random variables. If
`the covariance function of y ( t ) is Ry(t, u), then a suitable set
`of basis functions are obtained from the
`Karhunen-Lo6ve
`expansion
`
`h(n) #n (t) = J R,(t, u ) #n (u> du
`
`T
`
`0 < t < T.
`
`(1)
`
`0
`
`Then on (0, T )
`
`Van Trees [14] shows that if the correlation time of y(t> is
`less than the frame interval T , then an appropriate set of eigen-
`functions and eigenvalues are
`
`AND SIGNAL PROCESSING,
`
`VOL. ASSY-28, NO. 2, APRIL 1980
`
`N
`
`A
`
`n =I
`where ? =
`since if h,(n) were known, the spectrum of
`s (t) would be identical to the spectrum of s(t). Of course, it is
`not known and provision must be made for estimating its value
`from an observation of y , and knowledge of A, (n). Since y ,
`is a complex Gaussian variate with variance u$ (n), its real and
`imaginary parts are Gaussian with variance u:+(n)/2. Hence,the
`probability density function for y , is
`
`(8)
`then by maximizing p ( y,) with respect to As(n), the maxi-
`mum likelihood estimate of A,(n) can be found to be
`A A.y(n) = IYn I' - h,(n).
`(9)
`In order to maintain an identity system in the absence of noise,
`the input phase can be appended to the prefilter output by
`taking
`
`where
`
`(4)
`
`is the power spectrum of the observed process. Since a narrow-
`band vocoder usually operates over a bandwidth less than 4
`kHz, only a finite number of expansion coefficients are needed
`to characterize y (t). The prefilter design problem then reduces
`to the problem of optimally extracting the
`speech random
`variable s, from the noisy observation y n s, + w,. If the
`speech and the noise are modeled as independent Gaussian ran-
`dom processes, then the expansion coefficients are indepen-
`dent Gaussian random variables with variances
`
`where
`
`which is known as the method of power subtraction. Modifi-
`cations of this algorithm have been studied extensively by Boll
`[IO] , Preuss [ 121 , and Berouti et al. [ 131 .
`
`B. Wiener Filtering
`Whereas the power subtraction algorithm arises from an
`attempt to obtain the best estimate of the speech spectrum,
`the Wiener filter corresponds to the criterion of minimizing
`the mean-squared error of best time domain fit to the speech
`waveform. Van Trees [14, pp. 198-2061 has shown that this
`can be done by choosing the channel coefficients to be
`
`Since the speech eigenvalues are unknown a.priori, the maxi-
`mum likelihood estimate developed in (8) can be used in (1 1)
`to result in the suppression rule
`
`represent the power in the nth harmonic line of the speech and
`noise spectra.
`
`A. Power Subtraction
`Since it is well known that the perception of speech is phase
`insensitive, a reasonable criterion for a prefilter design is
`to
`
`
`produce the
`
`speech estimate deterministic
`
`which is simply the square of the suppression rule for the
`method of power subtraction.
`C. Maximum Likelihood Envelope Estimation
`The previous results were obtained assuming that the speech
`and the noise were
`independent Gaussian random processes.
`In the interest of exploring the importance of this assumption,
`an alternative model is proposed in which the noise is a Gaus-
`sian random process, while the speech is characterized by a
`
`
`waveform of unknown amplitude and
`phase. In
`
`WAVES345_1008-0002
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`

`
`
`
`
`
`
`
`MC AULAY AND MALPASS: ENHANCEMENT SPEECH
`
`
`
`is y, = s, t w, where now
`this case, the channel measurement
`s, = A exp ( j 0 ) where A determines the speech envelope and
`of speech, an optimum
`0 its phase. For the perception
`esti-
`mate of its envelope
`is desired since
`this would represent an
`estimate of
`the speech spectrum
`in the nth channel. For
`Gaussian noise, the probability density function of the channel
`measurement y, is
`1
`P(YnIAY0)- -
`T A W (n)
`Iy, l2 - 2A Re (e-j'y,) t A2
`X, (n)
`ofA , a maximum
`To obtain the maximum likelihood estimate
`of p ( y , ] A , 0) is sought. However, the speech phase 0 shows
`Its effect can be eliminated by
`up as a nuisance parameter.
`maximizing the average likelihood function
`
`139
`
`-6
`
`-4
`
`SPEECH-TO-NOISE RATIO
`
`0
`-2
`2
`4
`
`6
`
`(dB)
`8
`10
`
`12
`
`14
`
`16
`
`18
`
`l
`
`l
`
`1
`c -
`
`.a
`
`.Ha
`
`-10
`0
`-
`
`- 2
`
`- 8
`
`I
`
`- 4 c -
`
`4 -
`/ -
`
`_---
` 4 - -
`.'
`..** /
`.*
`-
`-
`/*
`/-
`-
`/'
`.-.
`i
`/.
`z -8
`- :
`m
`5
`2 -10 * *
`*i
`w 2 1 2 -
`
`-6
`
`_I
`
`/*
`
`/
`
`-
`-
`-
`-
`
`-14
`
`-18
`-zo
`
`/
`/
`
`- - - MAXIMUM LIKELIHOOD
`- - WIENER FILTER
`
`POWER SUBTRACTION
`
`-22
`Fig. 1. Power subtraction, Wiener filter, and maximum likelihood sup-
`
`( 0 , 2 ~ ) , then the likelihood function for the spectral envelope
`becomes
`
`?, = A - A Y n
`IYn I
`
`.-
`2n
`
`de*
`
`The integral appearing in (I5) is known as the modified Bessel
`function of the first kind and is labeled
`
`2A Re (e-jey,) 1
`Io(lxl) = - 1 exp [Re (e-jex)] dB
`
`2n
`
`2R
`
`5, D. Two-State Soft Decision Maximum Likelihood Envelope
`Estimation
`for the power subtraction, Wiener fil-
`f i e suppression
`are illustrated in
`tering, and maximum likelihood algorithms
`
`Fig. 1. Their suppression
`
`capabilities were evaluated for speech
`(1 6 ) in airborne command post noise using a real-time implementa-
`tion of the prefilter (to be described in detail in Section 111).
`While it was difficult to determine which algorithm did
`the
`best job of extracting the speech when speech was present, it
`
`was apparent that none of the algorithms adequately suppressed
`the background noise when speech was absent. This is hardly
`surprising in view of the fact that the suppression rules were
`
`derived on the assumption that speech was always present
`
`in
`
`the measured data.
`
`
`Had a detector been
`used to determine
`that a given frame of data consisted of noise alone,
`then ob-
`spectral viously a better suppression rule would have been to apply
`For this condition, the likelihood function for the
`envelope becomes
`the curves in Fig. 1.
`greater attenuation than
`indicated by
`From this point of view, it follows that a better suppression
`1
`1
`P(Ynl-4) = - *
`curve might evolve if a two-state model for the speech event is
`2n - considered at the outset, that is, either speech is present or it is
`not. Mathematically, this leads to the binary hypothesis
`model
`Ho : speech absent: I y, I = I w, I
`lYnl = lAeie + w , ~ .
`H~ : speech present:
`
`where x = 2Ay,/X,
`(n) depends on the a priori signal-to-noise
`ratioA2/Xw(n) and the a posteriori signal-to-noise ratio Iy,I2/
`X, (n). For large values of Ix I (>3), which represents a con-
`straint on the signal-to-noise ratios,
`1
`
`exp (1x0.
`
`(17)
`
`lo(lxl) -
`
`IYnI2 - 2AlY,l + A 2 1-
`
`(1 8 )
`to A leads to the esti-
`
`Maximizing this function with respect
`mator
`A = - [Iv,l+ 4Y,12
`A
`1
`
`2
`
`- Xw(n)l.
`
`(21)
`Only the measured envelope is used in this measurement model
`since it has already been shown
`that the
`measured phase pro-
`
`
`
`vides no useful information
`
`the noise.
`
`in the suppression of
`
`WAVES345_1008-0003
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`

`
`140
`
`IEEE TRANSACTIONS
`
`O N ACOUSTICS. SPEECH,
`
`AND SIGNAL PROCESSING,
`
`VOL. ASSP-28, NO.
`
`2 , APRIL 1980
`
`A usefukcriterion for estimating the spectral envelope A ikto
`choose A to minimize the mean-squared spectral error E(A -
`A ) 2 . It is well knoyn [ 141 that the resulting estimator is the
`conditional mean A =E(AIV) where V = I y n I is used
`for
`notational convenience to represent the measured envelope.
`Reference to the nth channel will be implied. In
`this f o h u -
`lation, the expectation operator
`is used to indicate averaging
`over the ensemble of noise sample functions, speech enveiopes
`and phases, and the ensemble of speech events. The averaging
`for the latter case is carried out explicitly and results in the
`estimator
`2 = E(A I V , HI )P(H, I V ) E(A I v, H,)P(H, I V )
`(22)
`where P(Hk I V ) is the probability that the speech is in state H k
`given that the measured envelope has the value V. Since
`E(A I V , H , ) represents the average value of A given an observa-
`tion Vand the fact that speech is absent, then obviously this
`value must be zero; hence, (22) reduces to
`2 =E(AJV,H,)P(H, IV).
`(23)
`Since E(A I V , H1 ) represents the minimum variance estimate
`of A when speech is present, and since the maximum likeli-
`hood estimator is asymptotically efficient for
`large SNR, it
`suffices to replace E(A IV, H , ) by the estimator
`derived in
`(1 9); hence,
`
`Application of Bayes rule gives
`
`where p(VIHk) is the a priori probability density function for
`the measured envelope given the speech state H k . Assuming
`that the speech and noise states are equally likely (a worst case
`assumption),
`P ( H , ) = P(H,) = 5.
`1
`Under hypothesis Ho , V = I w 1, and since the noise is complex
`Gaussian with mean zero and variance A,,
`it follows that the
`envelope has the Rayleigh pdf
`
`(2 6 )
`
`Under hypothesis H I , V = Meio t w I and the envelope has
`the Rician pdf
`
`Defining the Q priori signal-to-noise ratio
`
`to be
`
`and substituting (26), (27), and (28) into ( 2 5 ) results in the
`following expression for the a posteriori probability for
`the
`
`-10
`
`-E
`
`-6
`
`-4
`
`SPEECH-TO-NOISE RATIO ( d B )
`0
`- 2
`2
`4
`6
`8
`
`10
`
`12
`
`14
`
`16
`
`-24
`
`I
`l
`l
`l
`l
`l
`
`I
`I
`I
`I
`Fig. 2. A posterion‘ probability for the speech state.
`
`I
`
`
`
`presence of speech:
`
`It is this term which contributes the “soft-decision’’ aspect to
`the maximum likelihood envelope estimator in contradistinc-
`tion with “hard decision” for which the speech plus noise is
`either passed as is or is suppressed completely. Appending the
`measured phase to the estimated envelope in order to preserve
`the identity system in the absence of noise, the final suppres-
`sion rule is then
`
`In Fig. 2 several curves for the a posteriori probability for the
`speech stateP(H, 1V)are plotted as a function of the aposterion
`speech-to-noise ratio V 2 / h , (i.e., the measured SNR) for
`various values of the a priori signal-to-noise ratio E. The chan-
`nel gains obtained when these a posteriori probabilities are ap-
`pended to the maximum likelihood suppression rule are shown
`in Fig. 3. The two-state soft-decision maximum likelihood
`algorithm applies considerably more suppression when the
`measurement corresponds to low speech SNR. Since this case
`“most likely” corresponds to noise alone, it is seen that the
`effect of the residual noise (false alarms) should be considerably
`reduced. When the speech SNR is large,
`the measured SNR
`(i.e., the a posteriori SNR V2/h,) will be large and it “most
`likely” means that speech is present, in which case the original
`maximum likelihood algorithm is the correct rule for extracting
`the speech envelope.
`In order to interpret the role of the parameter g, it is noted
`that in a radar or communications context (from which the
`
`WAVES345_1008-0004
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`

`
`MC AULAY AND MALPASS: SPEECH ENHANCEMENT
`
`141
`
`SPEECH-TO-NOISE RATIO ( d B )
`
`-8
`
`-4
`
`0
`
`4
`
` 8
`
`12
`1
`
`1
`
`16
`-
`
`
`
`TABLE I
`CHANNEL FILTER SPECIFICATIONS
`
`Channel
`Number
`
`Center
`Frequency
`240
`360
`
`0
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`3200
`17
`18
`3535
`Sampling Rate = 132 ps
`
`
`
`0
`
`1275
`
`I750
`
`2150
`
`2600
`
`84
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`3 dB
`Bandwidth
`
`120
`120
`120
`120
`120
`120
`150
`150
`150
`150
`150
`200
`200
`200
`300
`3 00
`300
`300
`3 70
`
`- 4 -
`
`-24 -
`
`-
`
`-
`
`-
`
`-
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I
`
`
`
`Fig. 4. The channel vocoder fiiter bank.
`
`480
`600
`720
`
`975
`1125
`
`1425
`1575
`1950
`
`2350
`
`.$ (the
`preceding theory was extracted), one would choose
`a priori SNRA2/h,) in order to guarantee a specified perfor-
`mance in terms of
`false alarms and missed detections. In
`speech, however, one must deal with whatever SNR exists as a
`consequence of the particular acoustic environment in which
`one is forced to operate; hence, the concept of an a priori SNR
`which can be controlled by the system designer is inappropri-
`ate. In terms of a noise suppression prefilter, however, Fig. 3
`parameter ( = A 2 / h , simply controls the
`shows that the
`amount of suppression applied to a particular frequency chan-
`nel; hence, it is convenient to refer to it as the “suppression
`factor.” From this point of view, the theory has simply pro-
`vided the catalyst for generating a new class of suppression
`curves.
`
`2900
`
`t
`
`W
`n
`t
`3
`z
`3
`P
`K a
`z w
`
`-I
`
`0.5
`
`1.0
`
`for smoothing the envelope of the speech spectrum; hence,
`their lack of orthogonality turns out to be an asset in this par-
`ticular case. Since the 19 filters span the frequency
`range of
`the speech signal, the front end of the channel vocoder, in the
`absence of noise, represents an identity system provided the
`outputs of each of the channels
`are added alternately out of
`phase, as shown in the block diagram in Fig. 5.
`In order to compute the channel gains, measurements must
`be made to determine the instantaneous signal power and the
`average noise power at the output of each of the channel fil-
`ters. Since the speech parameters change very little in 20 ms,
`some temporal smoothing can be exploited by computing the
`signal power in the nth channel from
`1 N
`= - Y 3 k )
`
`(32)
`
`2.0
`2.5
`1.5
`FREQUENCY (kHz1
`Fig. 3. Suppression rules for maximum likelihood with soft suppression.
`
`3.0
`
`3.5
`
`111. IMPLEMENTATION
`All of the noise suppression prefiters that have been reported
`on to date have been implemented in the frequency domain.
`This corresponds nicely to the theoretical orthogonal channel
`decomposition used in Section I1 and exploits the properties
`of the FFT for filtering by circular convolution. Since the
`present work evolved from an attempt to implement a time-
`domain Kalman fdter based on a parallel formant model for
`speech 11.51 , and since a contemporary implementation of a
`channel vocoder is being developed using CCD technology to
`produce a package which operates at
`rates from 1.2 to 4.8
`kbits/s, requires about 50 integrated circuits, occupies 0.22
`ft3, requires 5 W, and weighs 5 lb [16] , it seemed appropriate
`to attempt a time-domain implementation of the prefilter that
`could exploit this emerging technology. As in the channel
`vocoder, 19 filters are used to span the frequency range 180-
`3720 Hz (the sampling rate was 7575 Hz). Each filter in the
`bank is a result of a bandpass transformation of a second-order
`Butterworth filter. The center frequencies and the bandwidths
`k = l
`for each of the
`filters in the bank are listed in Table I and a
`where yn (k) represents the signal sample out of the nth
`plot of their linear magnitudes is shown in Fig. 4.
`chan-
`Although theory requires that the channels be orthogonal, in nel at time k where there are N such samples in the 20 ms
`by N will be unnecessary).
`practice, overlapping filters provide for spectral smoothing frame (the normalization
`requires
`which is known to be an important factor
`in the design of Determination of the background
`noise power
`[ 1 I] . The filters in the channel knowledge of whether or
`noise suppression systems
`not a particular frame contains speech.
`vocoder were originally chosen to provide a good compromise One approach to making this determination has been devel-
`
`WAVES345_1008-0005
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`

`
`142
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-28, NO. 2, APRIL 1980
`
`NOISE
`DETECTOR
`
`t SMOOTH
`THE ENERGY
`
`S ( t ) + wltl
`
`BANDPASS
`FILTER
`NO. 1 OF N
`
`Y i
`
`~
`
`>
`-THE
`
`-
`IF NOISE
`EN
`SMOOTH -.c
`ENERGY
`
`-
`
`COMPUTE GN
`GAIN
`
`c
`
`P
`
`COMPUTE
`
`20 msec
`EVERY
`10 rnsec
`
`ENERGY IN -
`-
`
`1
`
`,BANDPASS
`FILTER
`NO. N OF N
`
`that a 4 s histogram of the
`oped by Roberts [17] who noted
`frame energies of the input signal was bimodal. He found that
`by setting a detection threshold between the modes, correct
`speech and noise classification could be made most of the
`time. A modification of
`this algorithm, which is described in
`detail in the Appendix, was used to determine with high con-
`fidence the frames
`that were absent of speech. For those
`frames, the average noise power in each channel was estimated
`by smoothing the measurements in (32) using a 1 s time con-
`stant according to the recursion
`l)tCY[V;(m)- h,(m- l)]
`h,(m)=h,(m-
`(33)
`where V ; (m), h, (m) represents the measured power and the
`average noise power computed for the mth frame. The major
`drawback of
`this algorithm is the relatively long adaptation
`time needed to determine the detection threshold and then the
`additional training period required to learn the channel noise
`powers.
`Using the measurement of V i (m) and the estimated average
`value of the noise energy X, (m - l), the gain factor
`
`is computed for each channel.' Since V i (m)/h, (m - 1) can
`be expressed in terms of g(m), then the noise suppression rule,
`(30) and (31), can be written as
`
`(34)
`
`Fig. 5. Block diagram of the noise suppression prefiiter.
`fact that 0 <g(m) < 1 which permits the use of a simple soft-
`ware divide routine in forming the normalization. For a given
`value of the suppression factor E, the measured gain g(m) is
`used as a pointer for a table lookup to determine the attenua-
`tion prescribed by (35). Fifteen tables corresponding to values
`* , 15 have been included in the prefilter, with
`of .$ = 1 , 2 , 3 , *
`each table consisting of 50 values of the suppression rule com-
`puted for equal increments of g(m) from 0 to 1. No attempt
`was made to optimize the design of these tables. All of the
`coding was done in machine language on the LDVT [ 191,
`which has the ability to key in a new value of the suppression
`factor in real time. This meant that the prefilter could easily
`be adjusted to accomodate a wide class of operational environ-
`ments. This turned out to be a significant capability for effec-
`tive noise suppression. Since the algorithm was designed
`to
`operate in real time, a 10 ms delay had to be incurred between
`the time the energies were measured and the time the corres-
`ponding gains could be computed and applied to the channel
`waveforms. This was
`done by computing the energies (block
`floating point) in 10 ms segments and adding consecutive seg-
`ments together to produce the desired 20 ms energy measure-
`raw gains G(m)
`ment. This permitted computation of the
`every 10 ms. In order to avoid the introduction of discontinu-
`gain c(m) obtained according to
`ities in the output waveform, the final output is a smoothed
`G(m) = G(m - 1) t P(m) [G(m) - G(m - l)] .
`(3 6 )
`Since the introduction of smoothing can cause the prefilter to
`be slow to respond to a leading edge transition, which could
`result in speech distortion, the weighting factor in (36) is
`chosen adaptively according to the rule
`
`The advantage in using g(m) as the independent variable is the
`
`'This is where the normalization by N in (32) disappears.
`
`In this way, the prefilter responds immediately to an increase
`in the SNR which should minimize
`the potential for leading
`edge distortion. During a trailing edge, in which the gain will
`be decreasing, the smoothed gain will be used which will tend
`
`WAVES345_1008-0006
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1008
`
`

`
`MC AULAY AND MALPASS: SPEECH ENHANCEMENT
`
`143
`
`to maintain the speech signal even though the noise becomes
`is the gain G(m) in (37) that is applied to the
`dominant. It
`waveform at the output of each of the channel
`filters. These
`waveforms were then added together alternately 180" out of
`phase to produce the prefilter output waveform $(t).
`
`IV. EXPERIMENTAL RESULTS AND CONCLUSIONS
`Since the prefiltering algorithm operated in real time, it was
`possible to perform extensive listening tests on a large speech
`and noise database. It was of particular interest to determine
`the operational performance of the
`prefilter in conjunction
`with a 2400 bits/s vocoder operating in a background of E4A
`advanced airborne command post noise (ACPN). Source tapes
`were available for this environment, consisting of lists spoken
`by six male speakers for which a DRT score and a diagnostic
`acceptability measure (DAM) could be computed. The record-
`ings were made using both a high-quality Altec microphone
`and a noise-canceling microphone.
`The first experiment consisted of listening to the output of
`the prefdter for various values of the suppression factor.
`It
`was always possible to select a suppression factor which would
`render the background noise imperceptible, although, for cases
`in which the SNR was low enough, the cost in doing this was
`the introduction of various
`degrees of speech distortion. In
`these cases, if the suppression factor was subsequently reduced,
`the speech distortion was reduced at the expense of introducing
`a perceptible level of background noise.
`In the next experiment, the prefilter was connected in tandem
`with the 2400 bit/s LPC vocoder which used the Gold-Rabiner
`pitch estimator [ 191 , [20] . An unexpected result was obtained.
`If the suppression factor was set to remove the residual noise
`at the output of the prefilter, then the speech quality at the
`vocoder was poor due
`to both buzz-hiss errors and spectral
`distortion. If, however, the suppression factor was chosen so
`that the noise at the vocoder
`output was negligible,
`then a
`significantly lower value of the suppression factor was needed
`and the speech quality was quite good, although the Gold-
`Rabiner algorithm continued to make buzz-hiss errors, but at
`a lower rate. In other words, LPC itself has some suppression
`capabilities against weak noise which can usefully be exploited
`in the tandem connection.
`It was the flexibility in selecting
`the prefilter suppression factor which made this result possible.
`LPC vocoder does allow for
`Since the deployment of the
`flexibility in the specification of the pitch extractor, it was of
`interest to determine whether or not algorithms that were
`specially designed to operate in noise would operate more ef-
`fectively in the tandem connection. Such an algorithm, based
`on maximum likelihood estimation techniques, has been under
`development for some time [21] and was chosen to be tested
`against the Gold-Rabiner algorithm. In the subjective listening
`tests it was found that, indeed, smoother pitch
`tracks could
`be obtained with a lower rate of buzz-hiss errors.
`Although the results of using the prefilter always produced
`subjectively more pleasant sounding speech to the ear since the
`annoying and tiresome background noise was removed, it was
`important to determine whether or not there
`was a corres-
`ponding quantitative improvement in intelligibility. To do
`this, DRT scores are being obtained for the prefiltered speech
`and the speech out of the LPC tandem for both the Gold-
`
`algo-
`Rabiner and the maximum likelihood pitch extraction
`both the
`rithms. Results
`are currently being obtained for
`Altec dynamic microphone and
`the confidencer noise-cancel-
`ing microphone and will be reported once all of the data have
`been collected and analyzed.
`So far the focus has been on the 19-channel prefilter based
`on the principles of channel vocoder design. This was strictly
`a pragmatic choice which was made to facilitate the develop-
`ment of a real-time testbed. Questions relating to the number
`of filters, the bandwidths, and the choice of center frequen-
`cies remain to be addressed. Although the time-domain struc-
`ture of the channel prefilter is well suited to an analog imple-
`mentation using CCD technology, it is of interest to determine
`the tradeoffs with respect
`to a frequency-domain approach
`using the FFT. Whatever candidate system is chosen for evalu-
`ation, using the class of suppression rules developed in this
`study allows the overall design to be optimized with respect to
`the noise suppression/speech distortion tradeoff by choosing
`an appropriate suppression factor. In this way, performance
`differences can be attributed to the system design parameters
`independent of a particular suppression rule which may have
`represented a poor choice
`for the particular signal and noise
`conditions used in the evaluation.
`
`APPENDIX
`MODIFIED ROBERTS NOISE DETECTION ALGORITHM
`In order to estimate the statistics of the background noise, it
`is desirable to inspect only those frames of data which have a
`high probability of containing no speech. To accomplish this,
`an adaptive energy threshold marking the probable boundary
`between noise and noise plus speech is established by monitor-
`ing the energy on a frame-by-frame
`basis and maintaining
`energy histograms which
`reflect the bimodal distribution of
`the energy. The flowchart for the algorithm, shown in Fig. 6,
`is described in the following paragraphs.
`For each frame, the sum of the squares of the input samples
`is computed. If this energy does not exceed 16 bits (i.e., does
`
`not strongly imply the presence of speech), the adaptive thresh-
`0.995 is
`old algorithm is exercised. First, a decay factor of
`applied to a 128-bin histogram of uniform
`ran

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket