`
`113
`
`Suppression of Acoustic Noise in Speech Using
`Spectral Subtraction
`
`STEVEN F. BOLL, MEMBER, IEEE
`
`Abstract-A stand-alone noise suppression algorithm is presented for
`reducing the spectral effects of acoustically added noise in speech. Ef(cid:173)
`fective performance of digital speech processors operating in practical
`environments may require suppression of noise from the digital wave(cid:173)
`form. Spectral subtraction offers a computationally efficient, processor(cid:173)
`independent approach to effective digital speech analysis. The method,
`requiring about the same computation as high-speed convolution, sup(cid:173)
`presses stationary noise from speech by subtracting the spectral noise
`bias calculated during nonspeech activity. Secondary procedures are
`then applied to attenuate the residual noise left after subtraction. Since
`the algorithm resynthesizes a speech waveform, it can be used as a pre(cid:173)
`processor to narrow-band voice communications systems, speech recog(cid:173)
`nition systems, or speaker authentication systems.
`
`l. INTRODUCTION
`
`BACKGROUND noise acoustically added to speech can
`
`degrade the performance of digital voice processors used
`for applications such as speech compression, recognition, and
`authentication [I}, [2]. Digital voice systems will be used in
`a variety of environments, and their performance must be
`maintained at a level near that measured using noise-free input
`speech. To ensure continued reliability, the effects of back(cid:173)
`ground noise can be reduced by using noise-cancelling micro(cid:173)
`phones, internal modification of the voice processor algorithms
`to explicitly compensate for signal contamination, or pre(cid:173)
`processor noise reduction.
`Noise-cancelling microphones, although essential for ex(cid:173)
`tremely high noise environments such as the helicopter cockpit,
`offer little or no noise reduction above 1 kHz [3] (see Fig. 5).
`Techniques available for voice processor modification to ac(cid:173)
`count for noise contamination are being developed [ 4] , [ 5] .
`But due to the time, effort, and money spent on the design
`and implementation of these voice processors [6] -[8], there
`is a reluctance to internally modify these systems.
`Preprocessor noise reduction [12], [21] offers the advantage
`that noise stripping is done on the waveform itself with the
`output being either digital or analog speech. Thus, existing
`voice processors tuned to clean speech can continue to be
`used unmodified. Also, since the output is speech, the noise
`stripping becomes independent of any specific subsequent
`
`Manuscript received June 1, 1978; revised September 12,1978. This
`research was supported by the Information Processing Branch of the
`Defense Advanced Research Projects Agency, monitored by the Naval
`Research Laboratory under Contract N00173-77-C-0041.
`The author is with the Department of Computer Science, University
`of Utah, Salt Lake City, UT 84112.
`
`speech processor implementation (it could be connected to a
`CCD channel vocoder or a digital LPC vocoder).
`The objectives of this effort were to develop a noise sup(cid:173)
`pression technique, implement a computationally efficient
`algorithm, and test its performance in actual noise environ(cid:173)
`ments. The approach used was to estimate the magnitude
`frequency spectrum of the underlying clean speech by sub(cid:173)
`tracting the noise magnitude spectrum from the noisy speech
`spectrum. This estimator requires an estimate of the current
`noise spectrum. Rather than obtain this noise estimate from
`a second microphone source [9], [10], it is approximated
`using the average noise magnitude measured during nonspeech
`activity. Using this approach, the spectral approximation error
`is then defined, and secondary methods for reducing it are
`described.
`The noise suppressor is implemented using about the same
`amount of computation as required in a high-speech convolu(cid:173)
`tion. It is tested on speech recorded in a helicopter environ(cid:173)
`ment. Its performance is measured using the Diagnostic Rhyme
`Test (DRT) [11] and is demonstrated using isometric plots of
`short-time spectra.
`The paper is divided into sections which develop the spectral
`estimator, describe the algorithm implementation, and demon(cid:173)
`strate the algorithm performance.
`
`II. SUBTRACTIVE NOISE SUPPRESSION ANALYSIS
`A. Introduction
`This section describes the noise-suppressed spectral estimator.
`The estimator is obtained by subtracting an estimate of the
`noise spectrum from the noisy speech spectrum. Spectral in(cid:173)
`formation required to describe the noise spectrum is obtained
`from the signal measured during nonspeech activity. After
`developing the spectral estimator, the spectral error is com(cid:173)
`puted and four methods for reducing it are presented.
`The following assumptions were used in developing the
`analysis. The background noise is acoustically or digitally
`added to the speech. The background noise environment
`remains locally stationary to the degree that its spectral mag(cid:173)
`nitude expected value just prior to speech activity equals its
`If the environment
`expected value during speech activity.
`changes to a new stationary state, there exists enough time
`(about 300 ms) to estimate a new background noise spectral
`magnitude expected value before speech activity commences.
`For the slowly varying nonstationary noise environment, the
`algorithm requires a speech activity detector to signal the
`
`0096-3518/79/0400-0113$00.75 © 1979 IEEE
`
`Petitioner Apple Inc.
`Ex. 1009, p. 113
`
`
`
`114
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO. 2, APRIL 1979
`
`program that speech has ceased and a new noise bias can be
`estimated. Finally, it is assumed that significant noise reduc(cid:173)
`tion is possible by removing the effect of noise from the mag(cid:173)
`nitude spectrum only.
`Speech, suitably low-pass filtered and digitized, is analyzed
`by windowing data from half-overlapped input data buffers.
`The magnitude spectra of the windowed data are calculated
`and the spectral noise bias calculated during nonspeech activity
`is subtracted off. Resulting negative amplitudes are then
`zeroed out. Secondary residual noise suppression is then
`applied. A time waveform is recalculated from the modified
`magnitude. This waveform is then overlap added to the previ(cid:173)
`ous data to generate the output speech.
`
`B. Additive Noise Model
`. Assume that a windowed noise signal n(k) has been added to
`a windowed speech signal s(k), with their sum denoted by x(k).
`Then
`x(k) = s(k) + n(k).
`
`Taking the Fourier transform gives
`X(eiw) = S(eiw) + N(eiw)
`
`where
`
`x(k) ~ X(eiw)
`
`X(eiw) = ~1
`k=O
`
`x(k)eiwk
`
`1
`x(k) = -
`27T
`
`. k
`.
`X(e'w)e'w dw.
`
`l1T
`-1T
`C. Spectral Subtraction Estimator
`The spectral subtraction filter H(eiw) is calculated by re(cid:173)
`placing the noise spectrum N(eiw) with spectra which can be
`readily measured. The magnitude IN(eiw)l of N(eiw) is re(cid:173)
`placed by its average value J.J.(eiw) taken during nonspeech
`activity, and the phase eN(eiw) of N(eiw) is replaced by the
`phase Ox(eiw) of X(eiw). These substitutions result in the
`spectral subtraction estimator S ( e'w ):
`S(ejw) = (IX(ejw)l- J.J.(ejw)]ei8x(ejw)
`
`A
`
`•
`
`or
`
`with
`
`H(eiw) = 1 -
`
`J.J.(eiw)
`~::........,.~
`IX(eiw)l
`
`J.J.(eiw) = E {lN(eiw)l}.
`
`A number of simple modifications are available to reduce
`the auditory effects of this spectral error. These include:
`1) magnitude averaging; 2) half-wave rectification; 3) residual
`noise reduction; and 4) additional signal attenuation during
`nonspeech activity.
`
`E. Magnitude Averaging
`
`Since the spectral error equals the difference between the
`noise spectrum N and its mean J.J., local averaging of spectral
`magnitudes can be used to reduce the error. Replacing
`IX(eiw)l with iX(eiw)l where
`
`IX(eiw)l = _!._ ~1
`M
`
`i=O
`
`!Xi(eiw)l
`
`Xi(eiw) = ith time-windowed transform ofx(k)
`
`gives
`SA (eiw) == [iX(eiw)l- ,u(eiw)] ei8x(eiw).
`
`The rationale behind averaging is that the spectral error be(cid:173)
`comes approximately
`e(eiw) =SA (eiw)- S(eiw):::: INI - J.J.
`
`where
`
`--,.-.-
`1 M-1
`IN(e 1w)l = - L !Ni(e 1w)l.
`M
`
`i=O
`
`.
`
`Thus, the sample mean of iN(eiw)l will converge to J.J.(eiw) as
`a longer average is taken.
`The obvious problem with this modification is that the speech
`is nonstationary, and therefore only limited time averaging is
`allowed. DRT results show that averaging over more than
`three half-overlapped windows with a total time duration of
`38.4 ms will decrease intelligibility. Spectral examples and
`DRT scores with and without averaging are given in the
`"Results" section. Based upon these results, it appears that
`averaging coupled with half rectification offers some improve(cid:173)
`ment. The major disadvantage of averaging is the risk of some
`temporal smearing of short transitory sounds.
`
`F. Half- Wave Rectification
`For each frequency w where the noisy signal spectrum mag(cid:173)
`nitude lX(eiw)l is less than the average noise spectrum mag(cid:173)
`nitude J.J.(eiw), the output is set to zero. This modification
`can be simply implemented by half-wave rectifying H(eiw).
`The estimator then becomes
`
`where
`
`The input-output relationship between X(e 1w) and S(e'w) at
`each frequency w is shown in Fig. 1.
`Thus, the effect of half-wave rectification is to bias down the
`magnitude spectrum at each frequency w by the noise bias
`determined at that frequency. The bias value can, of course,
`
`•
`
`A
`
`•
`
`Petitioner Apple Inc.
`Ex. 1009, p. 114
`
`D. Spectral Error
`The spectral error e(eiw) resulting from this estimator is
`given by
`
`=
`==
`
`
`
`
`
`
`
`( jw) s"( iw) S( jw) N( iw) ee e - e e -J.J.e
`
`
`
`( jw)
`iOx
`.
`
`e
`
`
`
`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`115
`
`probability that the spectrum at that frequency is due to low
`energy speech; therefore, taking the minimum will retain the
`"
`.
`information; and third, if S(e!W) is greater than the maximum,
`there is speech present at that frequency; therefore, removing
`the bias is sufficient. The amount of noise reduction using this
`replacement scheme was judged equivalent to that obtained by
`averaging over three frames. However, with this approach high
`energy frequency bins are not averaged together. The disad(cid:173)
`vantage to the scheme is that more storage is required to save
`the maximum noise residuals and the magnitude values for
`three adjacent frames.
`The residual noise reduction scheme is implemented as
`IS;(e 1w)l = IS;(e 1w)l,
`IS;(e 1w)l;;;. max INR (e 1w)l
`for
`IS;(e 1w)l =min { ISj(e 1w)l j = i- 1, i, i + 1},
`ISt(efw)l <max iNR(eiw)l
`for
`
`A
`
`•
`
`A
`
`•
`
`.
`
`•
`
`A
`
`.
`
`A
`
`A
`
`•
`
`•
`
`where
`
`A
`
`•
`
`S;(e 1w) =HR(e 1w)X;(e 1w)
`
`•
`
`•
`
`and
`
`max INR(eiw)l =maximum value of noise residual
`measured during nonspeech activity.
`
`•
`
`A
`
`.
`
`H Additional Signal Attenuation During Nonspeech Activity
`The energy content of S(e 1w) relative to p.(e'w) provides an
`accurate indicator of the presence of speech activity within a
`given analysis frame. If speech activity is absent, then S (eiw)
`will consist of the noise residual which remains after half-wave
`rectification and minimum value selection. Empirically, it was
`determined that the average (before versus after) power ratio
`was down at least 12 dB. This implied a measure for detecting
`the absence of speech given by
`
`277
`
`JJ.(e 1
`
`)
`
`1 Jrr I S(e~:) I dw].
`T= 20 log10 [ -
`_11
`If Twas less than -12 dB, the frame was classified as having
`no speech activity. During the absence of speech activity there
`are at least three options prior to resynthesis: do nothing, at(cid:173)
`tenuate the output by a fixed factor, or set the output to zero.
`Having some signal present during nonspeech activity was
`judged to give the higher quality result. A possible reason for
`this is that noise present during speech activity is partially
`masked by the speech.
`Its perceived magnitude should be
`balanced by the presence of the same amount of noise during
`nonspeech activity. Setting the buffer to zero had the effect
`of amplifying the noise during speech activity. Likewise, doing
`nothing had the effect of amplifying the noise during nonspeech
`activity. A reasonable, though by no means optimum amount
`of attenuation was found to be -30 dB. Thus, the output
`spectral estimate including output attenuation during non(cid:173)
`speech activity is given by
`S(eiw)= { S(e~w)
`cX(e 1w)
`
`r;;;.-12dB
`T.;;;;-12dB
`
`where 20 log 10 c = -30 dB.
`
`Petitioner Apple Inc.
`Ex. 1009, p. 115
`
`•
`A
`•
`Fig. 1. Input-output relation between X(e'w) and S (elw).
`
`change from frequency to frequency as well as from analysis
`time window to time window. The advantage of half rectifica(cid:173)
`tion is that the noise floor is reduced by JJ.(eiw). Also, any
`low variance coherent noise tones are essentially eliminated.
`The disadvantage of half rectification can exhibit itself in the
`situation where the sum of the noise plus speech at a frequency
`w is less than JJ.(eiw). Then the speech information at that
`frequency is incorrectly removed, implying a possible decrease
`in intelligibility. As discussed in the section on "Results," for
`the helicopter speech data base this processing did not reduce
`intelligibility as measured using the DRT.
`
`G. Residual Noise Reduction
`
`After half-wave rectification, speech plus noise lying above
`J.1. remain.
`In the absence of speech activity the difference
`NR =N- J.1.ei0n, which shall be called the noise residual, will
`for uncorrelated noise exhibit itself in the spectrum as ran(cid:173)
`domly spaced narrow bands of magnitude spikes (see Fig. 7).
`This noise residual will have a magnitude between zero and a
`maximum value measured during nonspeech activity. Trans(cid:173)
`formed back to the time domain, the noise residual will sound
`like the sum of tone generators with random fundamental
`frequencies which are turned on and off at a rate of about 20
`ms. During speech activity the noise residual will also be per(cid:173)
`ceived at those frequencies which are not masked by the
`speech.
`The audible effects of the noise residual can be reduced by
`taking advantage of its frame-to-frame randomness. Specifi(cid:173)
`cally, at a given frequency bin, since the noise residual will
`randomly fluctuate in amplitude at each analysis frame, it
`can be suppressed by replacing its current value with its
`minimum value chosen from the adjacent analysis frames.
`Takjng .the minimum value is used only when the magnitude
`of S (e 1w) is less than the maximum noise residual calculated
`during nonspeech activity. The motivation behind this replace-
`ment scheme is threefold: first, if the amplitude of S (e'w) lies
`below the maximum noise residual, and it varies radically from
`analysis frame to frame, then there is a high probability that
`the spectrum at that frequency is due to noise; therefore, sup-
`"'
`.
`press it by taking the minimum; second, if S(e'w) lies below
`the maximum but has a nearly constant value, there is a high
`
`A
`
`.
`
`
`
`116
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO.2, APRIL 1979
`
`D. Magnitude Averaging
`As was described in the previous section, the variance of the
`noise spectral estimate is reduced by averaging over as many
`spectral magnitude sets as possible. However, the nonstation(cid:173)
`arity of the speech limits the total time interval available for
`local averaging. The number of averages is limited by the
`number of analysis windows which can be fit into the stationary
`speech time interval. The choice of window length and averag(cid:173)
`ing interval must compromise between conflicting require(cid:173)
`ments. For acceptable spectral resolution a window length
`greater than twice the expected largest pitch period is required
`with a 256-point window being used. For minimum noise
`variance a large number of windows are required for averaging.
`Finally, for acceptable time resolution a narrow analysis inter(cid:173)
`val is required. A reasonable compromise between variance
`reduction and time resolution appears to be three averages.
`This results in an effective analysis time interval of 38 ms.
`
`E. Bias Estimation
`The spectral subtraction method requires an estimate at
`each frequency bin of the expected value of noise magnitude
`spectrum f.J.N:
`
`f.J.N E{!NI}.
`
`This estimate is obtained by averaging the signal magnitude
`spectrum lXI during nonspeech activity. Estimating f.J.N in
`this manner places certain constraints when implementing the
`method. If the noise remains stationary during the subsequent
`speech activity, then an initial startup or calibration period of
`noise-only signal is required. During this period (on the order
`of a third of a second) an estimate of f.J.N can be computed. If
`the noise environment is nonstationary, then a new estimate
`of f.J.N must be calculated prior to bias removal each time the
`noise spectrum changes. Since the estimate is computed using
`the noise-only signal during non speech activity, a voice switch
`is required. When the voice switch is off, an average noise
`spectrum can be recomputed. If the noise magnitude spec(cid:173)
`trum is changing faster than an estimate of it can be com(cid:173)
`puted, then time averaging to estimate f.J.N cannot be used.
`likewise, if the expected value of the noise spectrum changes
`after an estimate of it has been computed, then noise reduc(cid:173)
`tion through bias removal will be less effective or even harm(cid:173)
`ful, i.e., removing speech where little noise is present.
`
`F. Bias Removal and Half Wave Rectification
`The spectral subtraction spectral estimate S is obtained by
`subtracting the expected noise magnitude spectrum fJ. from the
`magnitude signal spectrum lXI. Thus
`
`A
`
`IS(k)i=IX(k)l-f.l.(k) k=O,l,···,L-1
`
`or
`
`A
`S(k) = H(k) · X(k),H(k)
`
`f.J.(k)
`IX(k)l
`
`k=O,l,· .. ,L 1
`
`where L = DFT buffer length.
`After subtracting, the differenced values having negative
`magnitudes are set to zero (half-wave rectification). These
`
`Petitioner Apple Inc.
`Ex. 1009, p. 116
`
`Fig. 2. Data segmentation and advance.
`
`III. ALGORITHM IMPLEMENTATION
`A. Introduction
`Based on the development of the last section, a complete
`analysis-synthesis algorithm can be constructed. This section
`presents the specifications required to implement a spectral
`subtraction noise suppression system.
`
`B. Input-Output Data Buffering and Windowing
`
`Speech from the A-D converter is segmented and windowed
`such that in the absence of spectral modifications, if the syn(cid:173)
`thesis speech segments are added together, the resulting overall
`system reduces to an identity. The data are segmented and
`windowed using the result [ 12] that if a sequence is separated
`into half-overlapped data buffers, and each buffer is multiplied
`by a Hanning window, then the sum of these windowed se(cid:173)
`quences adds back up to the original sequences. The window
`length is chosen to be approximately twice as large as the
`maximum expected pitch period for adequate frequency reso(cid:173)
`lution [13]. For the sampling rate of 8.00 kHz a window
`length of 256 points shifted in steps of 128 points was used.
`Fig. 2 shows the data segmentation and advance.
`
`C .Frequency ~nalysis
`
`The DFT of each data window is taken and the magnitude
`is computed.
`Since real data are being transformed, two data windows can
`be transformed using one FFT [14]. The FFT size is set equal
`to the window size of 256. Augmentation with zeros was not
`incorporated. As correctly noted by Allen [15], spectral
`modification followed by inverse transforming can distort the
`time waveform due to temporal aliasing caused by circular
`convolution with the time response of the modification.
`Augmenting the input time waveform with zeros before spec(cid:173)
`tral modification will minimize this aliasing. Experiments
`with and without augmentation using the helicopter speech
`resulted in negligible differences, and therefore augmentation
`was not incorporated. Finally, since real data are analyzed,
`transform symmetries were taken advantage of to reduce
`storage requirements essentially in half [ 14j .
`
`
`
`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`117
`
`negative differences represent frequencies where the sum of
`speech plus local noise is less than the expected noise.
`
`G. Residual Noise Reduction
`As discussed in the previous section, the noise that remains
`after the mean is removed can be suppressed or even removed
`by selecting the minimum magnitude value from the three
`adjacent analysis frames in each frequency bin where the
`current amplitude is less than the maximum noise residual
`measured during nonspeech activity. This replacement pro(cid:173)
`cedure follows bias removal and half-wave rectification. Since
`the minimum is chosen from values on each side of the current
`time frame, the modification induces a one frame delay. The
`improvement in performance was judged superior to three
`frame averaging in that an equivalent amount of noise sup(cid:173)
`pression resulted without the adverse effect of high-energy
`spectral smoothing. The following section presents examples
`of spectra with and without residual noise reduction.
`
`H. Additional Noise Suppression During Nonspeech Activity
`The final improvement in noise reduction is signal suppres(cid:173)
`sion during nonspeech activity. As was discussed, a balance
`must be maintained between the magnitude and characteristics
`of the noise that is perceived during speech activity and the
`noise that is perceived during speech absence.
`An effective speech activity detector was defined using spec(cid:173)
`tra generated by the spectral subtraction algorithm. This
`detector required the determination of a threshold signaling
`absence of speech activity. This threshold (T == -12 dB) was
`empirically determined to ensure that only signals definitely
`consisting of background noise would be attenuated.
`
`L Synthesis
`After bias removal, rectification, residual noise removal, and
`nonspeech signal suppression a time waveform is reconstructed
`from the modified magnitude corresponding to the center win(cid:173)
`dow. Again, since only real data are generated, two time win(cid:173)
`dows are computed simultaneously using one inverse FFT.
`The data windows are then overlap added to form the output
`speech sequence. The overall system block diagram is given in
`Fig. 3.
`
`VI. RESULTS
`
`A. Introduction
`Examples of the performance of spectral subtraction will be
`presented in two forms: isometric plots of time versus fre(cid:173)
`quency magnitude spectra, with and without noise cancella(cid:173)
`tion; and intelligibility and quality measurement obtained
`from the Diagnostic Rhyme Test (DRT) [11]. The DRT is a
`well-established method for evaluating speech processing
`devices. Testing and scoring of the DRT data base was pro(cid:173)
`vided by Dynastat Inc. [12]. A limited single speaker DRT
`test was used. The DRT data base consisted of 192 words
`using speaker RH recorded in a helicopter environment. A
`crew of 8 listeners was used.
`The results are presented as follows: 1) short-time ampli(cid:173)
`tude spectra of helicopter speech; 2) DRT intelligibility and
`quality scores on LPC vocoded speech using as input the data
`
`x(n)
`
`~
`
`[iilf-Wave Rectify /
`
`,-------L
`
`Reduce Noise R~
`
`Compute Speech Activity Detector
`
`Attenuate Signal During
`Non-Speech Activity
`
`s{n)
`
`Fig. 3. System block diagram.
`
`given in 2); and 3) short-time spectra showing additional im(cid:173)
`provements in noise rejection through residual noise suppres(cid:173)
`sion and nonspeech signal attenuation.
`
`B. Short- Time Spectra of Helicopter Speech
`
`Isometric plots of time versus frequency magnitude spectra
`were constructed from the data by computing and displaying
`magnitude spectra from 64 overlapped Hanning windows.
`Each line represents a 128-point frequency analysis. Time
`increases from bottom to top and frequency from left to right.
`A 920 ms section of speech recorded with a noise-cancelling
`microphone in a helicopter environment is presented. The
`phrase "Save your" was filtered at 3.2 kHz and sampled at
`6.67 kHz. Since the noise was acoustically added, no under(cid:173)
`lying clean speech signal is available. Fig. 4 shows the digitized
`time signal. Fig. 5 shows the average noise magnitude spec(cid:173)
`trum computed by averaging over the first 300 ms of non(cid:173)
`speech activity. The short-time spectrum of the noisy signal
`x is shown in Fig. 6. Note the high amplitude, narrow-band
`ridges corresponding to the fundamental (1550 Hz) and first
`harmonic (3100 Hz) of the helicopter engine, as well as the
`ramped noise floor above 1800 Hz. Fig. 7 shows the result
`from bias removal and rectification. Figs. 8 and 9 show the
`noisy spectrum and the spectral subtraction estimate using
`three frame averaging.
`These figures indicate that considerable noise rejection has
`
`Petitioner Apple Inc.
`Ex. 1009, p. 117
`
`
`
`118
`
`2
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO.2, APRIL 1979
`
`2.0 r------t------+------4------1-----~r---t-11
`
`5 211E+4
`5 0
`
`1 e r-----,_----~-----
`
`•••
`
`-1 . e 1----+----~---
`
`-1 . 953E+-4
`
`2.0
`
`1.0
`1 . eeeE+e
`RECORD 1 - 6144 SAMPLES
`
`30
`
`40
`
`sa
`
`4 e
`
`se
`
`3 e
`
`58
`
`a.e
`
`se
`
`1.0
`
`se
`
`6.0
`5 0
`6 144E+3
`
`0 000E+0
`a0 48 aa sa
`
`0 000E+0
`
`2a 40 8a sa
`
`I 0
`
`2 0
`
`20 <te 80 sa
`
`28
`
`3 0
`3 334E+3
`
`Fig. 4. Time waveform of helicopter speech. "Save your".
`
`Fig. 5. Average noise magnitude of helicopter noise.
`
`Fig. 6. Short-time spectrum of helicopter speech.
`
`Fig. 7. Short-time
`
`spectrum using bias
`rectification.
`
`removal and half-wave
`
`Fig. 8. Short-time spectrum of helicopter speech using three frame
`averaging.
`
`Fig. 9. Short-time spectrum using bias removal and half-wave rcctifi-
`cation after three frame averaging.
`
`Petitioner Apple Inc.
`Ex. 1009, p. 118
`
`
`
`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`119
`
`been achieved, although some noise residual remains. The
`next step was to quantitatively measure the effect of spectral
`subtraction on intelligibility and quality. For this task a
`limited single speaker DRT was invoked to establish an anchor
`point for credibility.
`
`TABLE I
`DIAGNOSTIC RHYME TEST SCORES
`
`Original
`
`s (Ho Average)
`
`Voicing
`
`Nasality
`
`Sustention
`
`Sibilation
`
`Graveness
`
`Compactness
`
`Total
`
`95
`
`82
`
`92
`
`75
`
`68
`
`88
`
`84
`
`92
`
`78
`
`87
`
`83
`
`70
`
`87
`
`83
`
`S (Three Average)
`91
`
`77
`
`86
`
`84
`
`66
`
`88
`
`82
`
`TABLE II
`QUALITY RATINGS
`
`Original
`
`S (No Average)
`
`S (Three Averages)
`
`Natural ness of
`Signal
`
`Inconspicuousness
`of Background
`
`lntelligibil ity
`
`Pleasantness
`
`Overall
`Acceptabi 1 ity
`
`Composite
`Acceptabi 1 ity
`
`63
`
`36
`
`30
`
`20
`
`27
`
`26
`
`60
`
`38
`
`32
`
`31
`
`33
`
`32
`
`61
`
`42
`
`33
`
`25
`
`29
`
`29
`
`TABLE III
`DIAGNOSTIC RHYME TEST SCORES
`
`C. Intelligibility and Quality Results using the DRT
`The DRT data base consisted of 192 words recorded in a
`helicopter environment. The data base was f:tltered at 4 kHz
`and sampled at 8kHz. During the pause between each word,
`the noise bias was updated. Six output speech files were
`generated: 1) digitized original; 2) speech resulting from bias
`removal and rectification without averaging; 3) speech result(cid:173)
`ing from bias removal and rectification using three averages;
`4) an LPC vocoded version of original speech; 5) an LPC
`vocoded version of 2); and 6) an LPC vocoded version of 3).
`The last three experiments were conducted to measure intelli(cid:173)
`gibility and quality improvements resulting from the use of
`spectral subtraction as a preprocessor to an LPC analysis(cid:173)
`synthesis device. The LPC vocoder used was a nonreal-time
`floating-point implementation [17]. A ten-pole autocorrela(cid:173)
`tion implementation was used with a SIFT pitch tracker [18].
`The channel parameters used for synthesis were not quantized.
`Thus, any degradation would not be attributed to parameter
`quantization, but rather to the all-pole approximation to the
`spectrum and to the buzz-hiss approximation to the error
`signal. In addition, a frame rate of 40 frames/s was used which
`is typical of 2400 bit/s implementations. The vocoder on 3.2
`kHz filtered clean speech achieved a DRT score of 88.
`In addition to intelligibility, a coarse measure of quality [ 19]
`was conducted using the same DRT data base. These quality
`scores are neither quantitatively nor qualitatively equivalent
`to the more rigorous quality tests such as PARM or DAM [20].
`However, they do indicate on a relative scale improvements
`between data sets. Modem 2.4 kbit/s systems are expected to
`range from 45 to 50 on composite acceptability; unprocessed
`speech, 88-92.
`The results of the tests are summarized in Tables I-IV.
`Tables I and II indicate that spectral subtraction alone does
`not decrease intelligibility, but does increase quality, especially
`in the areas of increased pleasantness and inconspicuousness of
`noise background. Tables III and IV clearly indicate that spec(cid:173)
`tral subtraction can be used to improve the intelligibility and
`quality of speech processed through an LPC bandwidth com(cid:173)
`pression device.
`
`D. Short-Time Spectra Using Residual Noise Reduction and
`Nonspeech Signal Attenuation
`Based on the promising results of these preliminary DRT
`experiments, the algorithm was modified to incorporate resid(cid:173)
`ual noise reduction and nonspeech signal attenuation. Fig. 10
`shows the short-time spectra using the helicopter speech data
`with both modifications added. Note that now noise between
`words has been reduced below the resolution of the graph, and
`noise within the words has been significantly attenuated (com(cid:173)
`pare with Fig. 7).
`
`LPC on
`Original
`
`LPC on
`S without averaging
`
`LPC on
`S with averaging
`
`Voicing
`
`Nasality
`
`Sus tent ion
`
`Sibilation
`
`Graveness
`
`Compactness
`
`Total
`
`84
`
`56
`
`49
`
`61
`
`61
`
`83
`
`66
`
`90
`
`63
`
`52
`
`70
`
`62
`
`83
`
`70
`
`86
`
`52
`
`56
`
`88
`
`59
`
`93
`
`72
`
`TABLE IV
`QUALITY RATINGS
`
`LPC on
`Original
`
`LPC on
`S without averaging
`
`LPC on
`,
`S with averaging
`
`Naturalness
`of Signal
`
`Inconspicuousness
`of Background
`
`Intelligibility
`
`Pleasantness
`
`Over a 11
`Acceptability
`
`Composite
`Acceptabi 1 i ty
`
`53
`
`34
`
`28
`
`15
`
`24
`
`23
`
`49
`
`36
`
`30
`
`28
`
`28
`
`29
`
`58
`
`39
`
`28
`
`20
`
`26
`
`25
`
`Petitioner Apple Inc.
`Ex. 1009, p. 119
`
`
`
`120
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO.2, APRIL 1979
`
`noise,:' IEEE Trans. Acoust., Speech, Signal Processing, vol.
`ASSP-24, pp. 488-494, Dec. 1976.
`[3 I D. Coulter, private communication.
`[ 4 I S. F. Boll, "Improving linear prediction analysis of noisy speech
`by predictive noise cancellation," in Proc. IEEE Int. Con/. on
`Acoust., Speech, Signal Processing, Philadelphia, PA, Apr. 12-14,
`1976, pp. 10-13.
`[ 5 I J. S. Lim and A. V. Oppenheim, "All pole modeling of degraded
`speech," IEEE Trans. Acoust., Speech, Signal Processing, vol.
`ASSP-26, pp. 197-210, June 1978.
`[61 B. Gold, "Digital speech networks," Proc. IEEE, val. 65, pp.
`1636-1658,De~ 197~
`[71 B. Beek, E. P. Neuberg, and D. C. Hodge, "An assessment of the
`technology of automatic speech recognition for military appli(cid:173)
`cations," IEEE Trans. Acoust., Speech, Signal Processing, vol.
`ASSP-25, pp. 310-322, Aug. 1977.
`[81 J. D. Markel, "Text independent speaker identification from a
`large linguistically unconstrained time-spaced data base," in
`Proc. IEEE Int. Con/. on Acoust., Speech, Signal Processing,
`Tulsa, OK, Apr. 1978, pp. 287-291.
`[91 B. Widrow et al., "Adaptive noise cancelling: Principles and
`applications," Proc. IEEE, vol. 63, pp.1692-1716, Dec. 1975.
`[101 S. F. Boll and D. Pulsipher, "Noise suppression methods for
`robust speech processing," Dep. Comput. Sci., Univ. Utah, Salt
`Lake City, Semi-Annu. Tech. Rep., Utec-CSc-77-202, pp. 50-54,
`Oct. 1977.
`[111 W. D. Voiers, A. D. Sharpley, and C. H. Helmsath, "Research on
`diagnostic evaluation of speech intelligibility," AFSC, Final Rep.,
`Contract AF19628-70-C-0182, 1973.
`[121 M. R. Weiss, E. Aschkenasy, and T. W. Parsons, "Study and
`development of the INTEL technique for improving speech
`intelligibility," Nicolet Scientific Corp., Final Rep. NSC-FR/4023,
`Dec. 1974.
`[131 J. Makhoul and J. Wolf, "Linear prediction and the spectral
`analysis of speech," Bolt, Beranek, and Newman Inc., BBN
`Rep. 2304, NTIS No. AD-749066, pp. 172-185, 1972.
`[14 I 0. Br