`
`ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO. 2, APRIL 1979
`
`113
`
`Suppression of Acoustic Noise in Speech Using
`Spectral Subtraction
`
`Abstract-A stand-alone noise suppression algorithm is presented for
`reducing the spectral effects of acoustically added noise in speech. Ef-
`fective performance of digital speech processors operating in practical
`environments may require suppression of noise from the digital wave-
`form. Spectral subtraction offers a computationally efficient, processor-
`independent approach to effective digital speech analysis. The method,
`requiring about the same computation as high-speed convolution, sup-
`presses stationary noise from speech by subtracting the spectral noise
`bias calculated during nonspeech activity. Secondary procedures are
`then applied to attenuate the residual noise left after subtraction. Since
`the algorithm resynthesizes a speech waveform, it can be used as a pre-
`processor to narrow-band voice communications systems, speech recog-
`nition systems, or speaker authentication systems.
`
`B ACKGROUND noise
`
`I. INTRODUCTION
`acoustically added to speech can
`degrade the performance of digital voice processors used
`for applications such as speech compression, recognition, and
`authentication [ 11 , [2] . Digital voice systems will be used in
`a variety of environments, and their performance must
`be
`maintained at a level near that measured using noise-free input
`speech. To ensure continued
`reliability, the effects of back-
`ground noise can be reduced by using noise-cancelling micro-
`phones, internal modification of the voice processor algorithms
`to explicitly compensate for
`signal contamination, or pre-
`processor noise reduction.
`essential for ex-
`Noise-cancelling microphones, although
`tremely high noise environments such as the helicopter cockpit,
`offer little or no noise reduction above 1 kHz [3] (see Fig. 5).
`Techniques available for voice processor modification to ac-
`count for noise contamination are being developed [4] , [ 5 ] .
`But due to the time, effort, and money spent on the
`design
`and implementation of these voice processors [6] -[8] , there
`is a reluctance to internally modify these systems.
`Preprocessor noise reduction E121 , [21] offers the advantage
`that noise stripping is done on the waveform itself with the
`output being either digital or analog speech. Thus,
`existing
`voice processors tuned
`to clean speech can
`continue to be
`used unmodified. Also, since the output is speech, the noise
`stripping becomes independent of any
`specific subsequent
`
`Manuscript received June 1, 1978; revised September 12, 1978. This
`research was supported by the Information Processing Branch
`of the
`Agency, monitored by the Naval
`Defense Advanced Research Projects
`Research Laboratory under Contract N00173-77-C-0041,
`The author is with the Department of Computer Science, University
`of Utah, Salt Lake City, UT 84112.
`
`speech processor implementation (it could be connected to a
`CCD channel vocoder or a digital LPC vocoder).
`The objectives of this effort were to develop a noise sup-
`pression technique, implement a computationally
`efficient
`algorithm, and test its performance
`in actual noise environ-
`ments. The approach used was
`to estimate the magnitude
`frequency spectrum of the underlying clean speech by sub-
`tracting the noise magnitude spectrum from the noisy speech
`spectrum. This estimator requires an estimate
`of the current
`noise spectrum. Rather than obtain
`this noise estimate from
`[9] , [lo] , it is approximated
`a second microphone source
`using the average noise magnitude measured during nonspeech
`activity. Using this approach, the spectral approximation error
`is then defined, and secondary methods for reducing
`it are
`described.
`is implemented using about the same
`The noise suppressor
`amount of computation as required in a high-speech convolu-
`tion.
`It is tested on speech recorded in a helicopter environ-
`ment. Its performance is measured using the Diagnostic Rhyme
`Test (DRT) [ 111 and is demonstrated using isometric plots of
`short-time spectra.
`The paper is divided into sections which develop the spectral
`estimator, describe the algorithm implementation, and demon-
`strate the algorithm performance.
`
`11. SUBTRACTIVE NOISE SUPPRESSION ANALYSIS
`A. Introduction
`This section describes the noise-suppressed spectral estimator.
`The estimator is obtained by subtracting an estimate of the
`noise spectrum from the noisy speech spectrum. Spectral in-
`formation required to describe the noise spectrum is obtained
`from the
`signal measured during nonspeech
`activity. After
`developing the spectral estimator, the spectral error is com-
`puted and four methods for reducing it are presented.
`The following assumptions were used
`in developing the
`analysis. The background noise
`is acoustically or digitally
`added to the
`speech. The background noise environment
`remains locally stationary to the degree that its spectral mag-
`nitude expected value just prior to speech activity equals its
`expected value during speech
`activity.
`If the environment
`changes to a new stationary state, there
`exists enough time
`(about 300 ms) to estimate a new background noise spectral
`magnitude expected value before speech activity commences.
`For the slowly varying nonstationary noise environment, the
`algorithm requires a speech
`activity detector to signal the
`
`0096-3518/79/0400-0113$00.75 0 1979 IEEE
`
`WAVES345_1012-0001
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1012
`
`
`
`114
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH,
`
`AND SIGNAL PROCESSING, VOL. ASSP-27, NO. 2, APRIL 1979
`
`program that speech has ceased and a new noise bias can be
`estimated. Finally, it is assumed that significant noise reduc-
`tion is possible by removing the effect of noise from the mag-
`nitude spectrum only.
`Speech, suitably low-pass filtered and digitized, is analyzed
`by windowing data from half-overlapped input data buffers.
`The magnitude spectra of the windowed
`data are calculated
`and the spectral noise bias calculated during nonspeech activity
`is subtracted off. Resulting
`negative amplitudes are then
`zeroed out. Secondary residual noise suppression is then
`applied. A time waveform is recalculated from the modified
`magnitude. This waveform is then overlap added to the previ-
`ous data to generate the output speech.
`
`B. Additive Noise Model
`Assume that a windowed noise signal n(k) has been added to
`a windowed speech signal s(k), with their sum denoted by X@).
`Then
`x(k) = s(k) + n(k).
`Taking the Fourier transform gives
`X(e'") = S(ei") + N(eiw)
`where
`x(k) ++ X(ei")
`
`A number of simple modifications are available
`to reduce
`the auditory effects of this
`spectral error. These include:
`1) magnitude averaging; 2) half-wave rectification; 3) residual
`noise reduction; and 4) additional signal attenuation during
`nonspeech activity.
`
`E. Magnitude Averaging
`Since the spectral error equals the difference between the
`noise spectrum N and its mean p, local averaging of spectral
`magnitudes can be
`used to reduce the error. Replacing
`IX(eiw)I with IX(ejW)I where
`1 M-1
`Ix(ej")l E
`
`IXi(e'")I
`
`i=O
`Xi(&" = ith time-windowed transform of x(k)
`gives
`
`The rationale behind averaging is that the spectral error be-
`comes approximately
`- s(eiw> zs
`e(e'"> =
`where
`
`- p
`
`1,
`
`k=O
`
`1
`2n
`x(#%)= -
`
`=
`
`
`X(eiw)ejwk dw.
`
`Thus, the sample mean of IN(eiw)l will converge to p(e'") as
`a longer average is taken.
`The obvious problem with this modification is that the speech
`
`is nonstationary, and therefore only limited time averaging is
`allowed. DRT
`results show that averaging over more
`than
`three half-overlapped windows with a total time duration
`of
`38.4 ms will decrease intelligibility. Spectral examples and
`DRT scores with
`and without averaging are given
`in the
`"Results" section. Based upon these results, it appears that
`averaging coupled with half rectification offers some improve-
`ment. The major disadvantage of averaging is the risk of some
`temporal smearing of short transitory sounds.
`
`F, Half- Wave Rectification
`For each frequency w where the noisy signal spectrum mag-
`nitude IX(eIW)I is less than the average noise spectrum mag-
`nitude p(ei"),
`the output
`is set
`to zero. This modification
`implemented by half-wave rectifying H(eiw).
`can be simply
`The estimator then becomes
`$ ( e j w > = HR(ejW)X(ejW)
`where
`
`C. Spectral Subtraction Estimator
`The spectral subtraction filter H(eiw) is Calculated by re-
`placing the noise spectrum N(eiw) with spectra which can be
`readily measured. The magnitude (N(eiw)( of N(eiw) is re-
`placed by its average value p ( e J w ) taken during nonspeech
`activity, and the phase e,(ei")
`of N(eiw) is replaced by the
`phase ex(eiw) of X(eiw). T2ese substitutions result in the
`spectral subtraction estimator S(eiw):
`
`D. Spectral Error
`The spectral error e(e'")
`given by
`
`resulting from this estimator is
`
`=$(e'W) - s(eiW> = N ( e i w ) - p(e'"> ejex.
`
`The input-output relationship between X(eiW) and $(eiw) at
`each frequency c3 is shown in Fig. 1 .
`Thus, the effect of half-wave rectification is to bias down the
`w by the noise bias
`magnitude spectrum at each frequency
`determined at that frequency. The bias value can, of course,
`
`WAVES345_1012-0002
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1012
`
`
`
`
`
`
`
`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`Fig. 1. Input-output relation between X(@) and $(eiw).
`
`change from frequency to frequency as well as from analysis
`time window to time window. The advantage of half rectifica-
`tion is that the noise floor is reduced by p(eiw). Also, any
`low variance coherent noise tones are essentially eliminated.
`The disadvantage of half rectification can exhibit itself in the
`situation where the sum of the noise plus speech at a frequency
`than p(e'"). Then the speech information at that
`w is less
`frequency is incorrectly removed, implying a possible decrease
`in intelligibility. As discussed in the section on "Results," for
`the helicopter speech data base this processing did not reduce
`intelligibility as measured using the DRT.
`
`C. Residual Noise Reduction
`After half-wave rectification, speech plus noise lying above
`activity the difference
`p remain. In the absence of speech
`N R - - N - p i e n , which shall be called the noise residual, will
`for uncorrelated noise exhibit itself in the spectrum as ran-
`domly spaced narrow bands of magnitude
`spikes (see Fig. 7).
`This noise residual will have a magnitude between zero and a
`maximum value measured during nonspeech
`activity. Trans-
`formed back to the time domain, the noise residual will sound
`like the sum of
`tone generators with random fundamental
`frequencies which are turned on and off at a rate of about 20
`ms. During speech activity the noise residual will also be per-
`ceived at those frequencies which
`are not masked by the
`speech.
`residual can be reduced by
`The audible effects of the noise
`taking advantage of
`its frame-to-frame randomness. Specifi-
`cally, at a given frequency bin, since the noise residual will
`randomly fluctuate in amplitude at each
`analysis frame, it
`can be
`suppressed by replacing its current
`value with its
`minimum value chosen from the adjacent
`analysis frames.
`T&$g
`the minimum value is used only when the magnitude
`of S ( e i w ) is less than the maximum noise residual calculated
`during nonspeech activity. The motivation behind tEs replace-
`ment scheme is threefold: first, if the amplitude of &'(eiW) lies
`below the maximum noise residual, and it varies radically from
`analysis frame to frame, then there is a high probability that
`the spectrum at that frequency is due to noee; therefore, sup-
`press it by taking the minimum; second, if S(eiw) lies below
`the maximum but has a nearly constant value, there is a high
`
`115
`
`probability that the spectrum at that frequency is due tolow
`energy speech; therefore,*taking the minimum will retain the
`information; and third, if S(eiw) is greater than the maximum,
`there is speech present at that frequency; therefore, removing
`the bias is sufficient. The amount of noise reduction using this
`replacement scheme was judged equivalent to that obtained by
`averaging over three frames. However, with this approach high
`energy frequency bins are not averaged together. The disad-
`vantage to the scheme is that more storage is required to save
`the maximum noise residuals and the magnitude
`values for
`three adjacent frames.
`The residual noise reduction scheme is implemented as
`
`where
`
`and
`max INR(ejw)I = maximum value of noise residual
`measured during nonspeech activity.
`
`H. Additional Signal Attenuation During Nonspeech Activity
`The energy content of $(ei") relative to p(eiW) provides an
`accurate indicator of the presence of speech activity within a
`given analysis frame. If speech activity is absent, then S(eiw)
`will consist of the noise residual which remains after half-wave
`rectification and minimum value selection. Empirically,it was
`determined that the average (before versus after) power ratio
`was down at least 12 dB. This implied a measure for detecting
`the absence of speech given by
`
`If T was less than - 12 dB, the frame was classified as having
`no speech activity. During the absence of speech activity there
`are at least three options prior to resynthesis: do nothing, at-
`tenuate the output by a fixed factor, or set the output to zero.
`Having some signal present during nonspeech
`activity was
`judged to give the higher quality result. A possible reason for
`activity is partially
`this is that noise present during speech
`masked by the
`speech.
`Its perceived magnitude should be
`balanced by the presence of the same amount of noise during
`nonspeech activity. Setting the buffer to zero had the effect
`of amplifying the noise during speech activity. Likewise, doing
`nothing had the effect of amplifying the noise during nonspeech
`activity. A reasonable, though by no means optimum amount
`of attenuation was found to be -30 dB. Thus, the
`output
`spectral estimate including
`output attenuation during non-
`speech activity is given by
`
`$(ejw)= { $(eiw)
`
`cX(eiw)
`where 20 log,, c = -30 dB.
`
`T>-12dB
`T Q - 12 dB
`
`WAVES345_1012-0003
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1012
`
`
`
`116
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL.
`
`ASP-27, NO. 2, APRIL 1979
`
`1
`
`PROCESS
`
`Fig. 2. Data segmentation and advance.
`
`111. ALGORITHM IMPLEMENTATION
`A. Introduction
`last section, a complete
`Based on the development of the
`analysis-synthesis algorithm can be constructed. This section
`presents the
`specifications required to implement a spectral
`subtraction noise suppression system.
`
`B. Input-Output Data Buffering and Windowing
`Speech from the A-D converter is segmented and windowed
`such that in the absence of spectral modifications, if the syn-
`thesis speech segments are added together, the resulting overall
`are segmented and
`system reduces to an identity. The data
`windowed using the result [ 121 that if a sequence is separated
`into half-overlapped data buffers, and each buffer is multiplied
`by a Hanning window, then the sum of
`these windowed se-
`quences adds back up to the original sequences. The window
`length is chosen to be approximately
`twice as large as
`the
`maximum expected pitch period for adequate frequency reso-
`. For the sampling rate of 8.00 kHz a window
`lution [13]
`length of 256 points shifted in steps of 128 points was used.
`Fig. 2 shows the data segmentation and advance.
`
`is taken and the magnitude
`
`C Frequency Analysis
`The DFT of each data window
`is computed.
`Since real data are being transformed, two data windows can
`be transformed using one FFT [14] . The FFT size is set equal
`to the window size of 256. Augmentation with zeros was not
`by Allen [15] , spectral
`incorporated. As correctly noted
`modification followed by inverse transforming can distort the
`time waveform due to
`temporal aliasing caused by circular
`convolution with the
`time response of
`the modification.
`Augmenting the input time waveform with zeros before spec-
`tral modification will minimize
`this aliasing. Experiments
`with and without augmentation
`using the helicopter speech
`resulted in negligible differences, and therefore augmentation
`was not incorporated. Finally, since real data are analyzed,
`transform symmetries were taken advantage of to reduce
`storage requirements essentially in half [I41 .
`
`D. Magnitude Averaging
`As was described in the previous section, the variance of the
`noise spectral estimate is reduced by averaging over as many
`spectral magnitude sets as possible. However, the nonstation-
`arity of the speech limits the total time interval available for
`local averaging. The number of averages is
`limited by the
`number of analysis windows which can be fit into the stationary
`speech time interval. The choice of window length and averag-
`ing interval must compromise between
`conflicting require-
`ments. For acceptable spectral resolution
`a window length
`greater than twice the expected largest pitch period is required
`with a 256-point window being used. For minimum noise
`variance a large number of windows are required for averaging.
`Finally, for acceptable time resolution a narrow analysis inter-
`val is required. A reasonable compromise between variance
`reduction and
`time resolution appears to be three averages.
`This results in an effective analysis time interval of 38 ms.
`
`E. Bias Estimation
`at
`requires an estimate
`The spectral subtraction method
`each frequency bin of the expected value of noise magnitude
`spectrum p~ :
`PN =E{INI).
`This estimate is obtained
`by averaging the signal magnitude
`spectrum 1x1 during nonspeech activity.
`Estimating pN in
`this manner places certain constraints when implementing the
`method. If the noise remains stationary during the subsequent
`speech activity, then an initial startup or calibration period of
`noise-only signal is required. During this period (on the order
`of a third of a second) an estimate of pN can be computed. If
`the noise environment is nonstationary, then a new estimate
`of p N must be calculated prior to bias removal each time the
`noise spectrum changes. Since the estimate is computed using
`the noise-only signal during nonspeech activity, a voice switch
`is required. When the voice switch is off, an average noise
`spectrum can be recomputed. If the noise magnitude spec-
`trum is changing
`faster than an estimate of it can be com-
`time averaging to estimate pN cannot be used.
`puted, then
`Likewise, if the expected value of the noise spectrum changes
`after an estimate of it has been computed, then noise reduc-
`tion through bias removal will be less effective or even harm-
`ful, i.e., removing speech where little noise is present.
`
`F. Bias Removal and Half- Wave Rectification
`The spectral subtraction spectral estimate S is obtained by
`subtracting the expected noise magnitude spectrum p from the
`magnitude signal spectrum 1x1. Thus
`k = 0 , 1 ; . . , L - I
`IS^(k)I=IX(k)I-p(k)
`
`A
`
`OK
`
`where L = DFT buffer length.
`After subtracting, the
`differenced values having negative
`magnitudes are set to zero (half-wave rectification). These
`
`WAVES345_1012-0004
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1012
`
`
`
`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`117
`
`negative differences represent frequencies where the sum of
`speech plus local noise is less than the expected noise.
`
`G. Residual Noise Reduction
`As discussed in the previous section, the noise that remains
`after the mean is removed can be suppressed or even removed
`by selecting the minimum magnitude value from the three
`adjacent analysis frames in each frequency
`bin where the
`current amplitude
`is less
`than the maximum noise residual
`measured during nonspeech activity. This replacement pro-
`cedure follows bias removal and half-wave rectification. Since
`the minimum is chosen from values on each side of the current
`time frame, the modification induces a one frame delay. The
`improvement in performance was judged superior to three
`frame averaging in that an equivalent amount of noise sup-
`pression resulted without the adverse effect of high-energy
`spectral smoothing. The following section presents examples
`of spectra with and without residual noise reduction.
`H. Additional Noise Suppression During Nonspeech Activity
`The final improvement in noise reduction is signal suppres-
`sion during nonspeech activity. As was discussed, a balance
`must be maintained between the magnitude and characteristics
`of the noise that is perceived during speech activity and the
`noise that is perceived during speech absence.
`An effective speech activity detector was defined using spec-
`tra generated by the spectral subtraction
`algorithm. This
`detector required the determination of a threshold signaling
`absence of speech activity. This threshold (T = - 12 dB) was
`empirically determined to ensure that only signals definitely
`consisting of background noise would be attenuated.
`I. Synthesis
`After bias removal, rectification, residual noise removal, and
`nonspeech signal suppression a time waveform is reconstructed
`from the modified magnitude corresponding to the center win-
`dow. Again, since only real data are generated, two time win-
`dows are computed simultaneously using one inverse FFT.
`The data windows are then overlap added to form the output
`speech sequence. The overall system block diagram is given in
`Fig. 3.
`
`VI. RESULTS
`
`A. Introduction
`Examples of the performance of spectral subtraction will be
`presented in two forms: isometric plots of time versus fre-
`quency magnitude spectra, with and without noise cancella-
`tion; and intelligibility and quality measurement obtained
`from the Diagnostic Rhyme Test (DRT) [ 11 J . The DRT is a
`well-established method for
`evaluating speech processing
`devices. Testing and scoring of the DRT data base was pro-
`vided by Dynastat Inc.
`[12]. A limited single speaker DRT
`test was used. The DRT data base consisted of 192 words
`using speaker RH recorded in a helicopter environment. A
`crew of 8 listeners was used.
`The results are presented as follows: 1) short-time ampli-
`tude spectra of helicopter speech; 2) DRT intelligibility and
`quality scores on LPC vocoded speech using as input the data
`
`1
`I Hanning,Window
`,&,
`
`I
`
`H a l f - W a v e R e c t i f y
`
`R e d u c e N o i s e R e s i d u a l
`
`Fig. 3. System block diagram.
`
`given in 2); and 3) short-time spectra showing additional im-
`provements in noise rejection through residual noise suppres-
`sion and nonspeech signal attenuation.
`
`B. Short-Time Spectra of Helicopter Speech
`Isometric plots of time versus frequency magnitude spectra
`were constructed from the data by computing and displaying
`magnitude spectra from 64 overlapped Hanning windows.
`Each line represents a 128-point frequency analysis. Time
`increases from bottom to top and frequency from left to right.
`A 920 ms section of speech recorded with a noise-cancelling
`microphone in a helicopter environment
`is presented. The
`phrase “Save your” was filtered at 3.2 kHz and sampled at
`6.67 kHz. Since the noise was acoustically added, no under-
`lying clean speech signal is available. Fig. 4 shows the digitized
`time signal. Fig. 5 shows the average noise magnitude spec-
`trum computed by averaging over the first 300 ms of non-
`speech activity. The short-time spectrum of the noisy signal
`x is shown in Fig. 6 . Note the high amplitude, narrow-band
`ridges corresponding to the fundamental (1550 Hz) and first
`harmonic (3100 Hz) of the helicopter engine, as well as
`the
`ramped noise floor above 1800 Hz. Fig. 7 shows the result
`from bias removal and rectification. Figs. 8 and 9 show the
`noisy spectrum and the spectral subtraction estimate
`using
`three frame averaging.
`These figures indicate that considerable noise rejection has
`
`WAVES345_1012-0005
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1012
`
`
`
`118
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-21, NO.
`
`2, APRIL 1919
`
`2 145Et4
`2 . 0
`
`I
`
`I
`
`I
`
`1
`II
`
`I
`
`I
`
`rI I.
`
`1
`
`
`
`I
`
`1
`
`8 . 8
`
`-1 . e
`
`I
`-1 .953E+4
`
`I
`
`I
`
`I
`
`I
`
`I
`
`I I ]
`
`3 0
`1 0
`2 . 0
`4 0
`1 . 0 0 0 E + 0
`RECORD 1 - 6144 SAMPLES
`Fig. 4. Time waveform of helicopter speech. “Save your”.
`
`5 B
`6 . 0
`6 144E+3
`
`3 0
`3 334E+3
`0 0 0 0 E + 0
`Fig. 5. Average noise magnitude of helicopter noise.
`
`1 0
`
`2 0
`
`Fig. 6 . Short-time spectrum of helicopter speech.
`
`Fig. 7. Short-time spectrum
`
`using bias
`rectification.
`
`removal and half-wave
`
`Fig. 8. Short-time spectrum
`
`of helicopter speech
`averaging.
`
`using three frame
`
`Fig. 9. Short-time spectrum
`after
`
`
`cation frame
`
`
`using bias removal and half-wave rectifi-
`three
`averaging.
`
`WAVES345_1012-0006
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1012
`
`
`
`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`119
`
`TABLE I
`DIAGNOSTIC RHYME TEST SCORES
`
`Voicing
`
`Nasality
`Sustention
`
`S i b i l a t i o n
`
`Graveness
`
`Compactness
`
`Total
`
`Original
`
`95
`
`a2
`
`92
`
`75
`
`6a
`
`88
`
`84
`
`i (No Average)
`92
`
`
`
`s^ (Three Average)
`91
`
`78
`
`87
`
`a3
`
`70
`
`87
`
`a3
`
`77
`
`a6
`
`a4
`
`66
`
`a8
`
`a2
`
`TABLE I1
`QUALITY RATINGS
`
`O r i g i n a l
`63
`
`S (No Average)
`60
`
`36
`
`30
`
`20
`
`27
`
`26
`
`38
`
`32
`
`31
`
`33
`
`32
`
`
`
`S (Three Averages)
`
`61
`
`42
`
`33
`
`25
`
`29
`
`29
`
`Naturalness o f
`Signal
`
`Inconspicuousness
`o f Background
`
`I n t e l l i g i b i l i t y
`
`Pleasantness
`
`Overall
`A c c e p t a b i l i t y
`
`Composite
`A c c e p t a b i l i t y
`
`noise residual remains. The
`been achieved, although some
`next step was to quantitatively measure the effect of spectral
`this task a
`subtraction on intelligibility and quality. For
`limited single speaker DRT was invoked to establish an anchor
`point for credibility.
`C Intelligibility and Quality Results using the DRT
`The DRT data base consisted of 192 words recorded in a
`data base was filtered at 4 kHz
`helicopter environment. The
`and sampled at 8 kHz. During the pause between each word,
`the noise bias was updated. Six output speech files were
`generated: 1) digitized original; 2) speech resulting from bias
`removal and rectification without averaging; 3) speech result-
`ing from bias removal and rectification using three averages;
`4) an LPC vocoded version of original speech; 5) an LPC
`vocoded version of 2); and 6) an LPC vocoded version of 3).
`The last three experiments were conducted to measure intelli-
`gibility and quality improvements
`resulting from the use of
`to an LPC analysis-
`spectral subtraction as a preprocessor
`synthesis device. The LPC vocoder used was a nonreal-time
`floating-point implementation [ 171 . A ten-pole autocorrela-
`tion implementation was used with a SIFT pitch tracker [ 181 .
`The channel parameters used for synthesis were not quantized.
`Thus, any degradation would not be attributed to parameter
`quantization, but rather to the all-pole approximation to the
`spectrum and
`to the buzz-hiss approximation
`to the error
`signal. In addition, a frame rate of 40 frames/s was used which
`is typical of 2400 bit/s implementations. The vocoder on 3.2
`kHz filtered clean speech achieved a DRT score of 88.
`In addition to intelligibility, a coarse measure of quality [ 191
`was conducted using the same DRT data base. These quality
`scores are neither quantitatively nor
`qualitatively equivalent
`to the more rigorous quality tests such as PARM or DAM [20] .
`However, they do indicate on a relative scale
`improvements
`86
`
`90
`between data sets. Modern 2.4 kbit/s systems are expected to
`range from 45 to 50 on composite acceptability; unprocessed
`speech, 88-92.
`Tables I-IV.
`The results of the tests are summarized in
`Tables I and I1 indicate that spectral subtraction alone does
`not decrease intelligibility, but does increase quality, especially
`in the areas of increased pleasantness and inconspicuousness of
`noise background. Tables I11 and IV clearly indicate that spec-
`tral subtraction can be used
`to improve the intelligibility and
`quality of speech processed through an LPC bandwidth com-
`pression device.
`
`TABLE I11
`DIAGNOSTIC RHYME TEST SCORES
`
`LPC on
`Original
`
`LPC on
`S without averaging
`
`
`A
`
`LPC on
`S with averaging
`
`
`A
`
`Voicing
`
`
`Nasality
`52
`Sustention
`
`S i b i l a t i o n
`
`Graveness
`
`Compactness
`
`Total
`
`84
`
`
`56
`
`49
`61
`
`61
`
`83
`66
`
`63
`
`52
`
`70
`
`62
`
`83
`
`70
`
`56
`
`88
`
`59
`
`93
`
`72
`
`D. Short-Time Spectra Using Residual Noise Reduction and
`Nonspeech Signal Attenuation
`Based on the promising
`results of these preliminary DRT
`experiments, the algorithm was modified to incorporate resid-
`ual noise reduction and nonspeech signal attenuation. Fig. 10
`shows the short-time spectra using the helicopter speech data
`with both modifications added. Note that now noise between
`words has been reduced below the resolution of the graph, and
`noise within the words has been significantly attenuated (com-
`pare with Fig. 7).
`
`Naturalness
`o f Signal
`
`Inconspicuousness
`of Background
`
`I n t e l l i g i b i l i t y
`
`Pleasantness
`
`Overall
`A c c e p t a b i l i t y
`
`Composite
`A c c e p t a b i l i t y
`-
`
`TABLE IV
`QUALITY RATINGS
`
`LPC on
`O r i g i n a l
`
`LPC on
`
`S without averaging
`
`A
`
`53
`
`34
`
`za
`
`15
`
`24
`
`23
`
`49
`
`36
`
`30
`
`28
`
`26
`
`29
`
`~
`
`LPC on
`s w i t h averaging
`5a
`
`39
`
`26
`
`20
`
`26
`
`25
`
`WAVES345_1012-0007
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1012
`
`
`
`120
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPE CECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO. 2, APRIL 1979
`
`Processing, vol.
`
`noise,.” ZEEE Trans. Acoust., Speech, Signal
`ASSP-24, pp. 488-494, Dec. 1976.
`[3] D. Coulter, private communication.
`[4] S. F. Boll, “Improving linear prediction analysis of noisy speech
`in hoc. IEEE Znt. Coni on
`by predictive noise cancellation,”
`Acoust., Speech, Signal Processing, Philadelphia, PA, Apr. 12-14,
`1976, pp. 10-13.
`[5] J. S. Lim and A. V. Oppenheim, “All pole modeling of degraded
`speech,” ZEEE Trans. Acoust., Speech, Signal Processing, vol.
`ASSP-26, pp. 197-210, June 1978.
`[6] B. Gold, “Digital speech networks,” Proc. ZEEE, vol. 65, pp.
`1636-1658, Dec. 1977.
`[7] B. Beek, E. P. Neuberg, and D. C. Hodge, “An assessment of the
`technology of
`automatic speech recognition
`for military appli-
`cations,” IEEE Trans. Acoust., Speech, Signal
`Processing, vol.
`ASSP-25, pp. 310-322, Aug. 1977.
`[8] J. D. Markel, “Text independent speaker identification from a
`large linguistically unconstrained time-spaced
`data base,” in
`Proc. IEEE Znt. Coni on Acoust., Speech, Signal
`Processing,
`Tulsa, OK, Apr. 1978, pp. 287-291.
`[9] B. Widrow et al., “Adaptive noise cancelling: Principles and
`applications,”Proc. ZEEE, vol. 63, pp. 1692-1716, Dec. 1975.
`[ l o ] S . F. Boll and D. Pulsipher, “Noise suppression methods for
`robust speech processing,” Dep. Comput. Sci., Univ. Utah, Salt
`Lake City, Semi-Annu. Tech. Rep., Utec-CSc-77-202, pp. 50-54,
`Oct. 1977.
`[ 11 ] W. D. Voiers, A. D. Sharpley, and C. H. Helmsath, “Research on
`diagnostic evaluation of speech intelligibility,” AFSC, Final Rep.,
`Contract AF19628-70-C-O182,1973.
`[ 121 M. R. Weiss, E. Aschkenasy, and T. W. Parsons, “Study and
`development of
`the INTEL technique for
`improving speech
`intelligibility,” Nicolet Scientific Corp., Final Rep. NSC-FR/4023,
`Dec. 1974.
`[13] J. Makhoul and J. Wolf, “Linear prediction and the
`spectral
`analysis of speech,” Bolt,
`Beranek, and