throbber
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP·27, NO.2, APRIL 1979
`
`113
`
`Suppression of Acoustic Noise in Speech Using
`Spectral Subtraction
`
`STEVEN F. BOLL, MEMBER, IEEE
`
`Abstract-A stand-alone noise suppression algorithm is presented for
`reducing the spectral effects of acoustically added noise in speech. Ef(cid:173)
`fective performance of digital speech processors operating in practical
`environments may require suppression of noise from the digital wave(cid:173)
`form. Spectral subtraction offers a computationally efficient, processor(cid:173)
`independent approach to effective digital speech analysis. The method,
`requiring about the same computation as high-speed convolution, sup(cid:173)
`presses stationary noise from speech by subtracting the spectral noise
`bias calculated during nonspeech activity. Secondary procedures are
`then applied to attenuate the residual noise left after subtraction. Since
`the algorithm resynthesizes a speech waveform, it can be used as a pre(cid:173)
`processor to narrow-band voice communications systems, speech recog(cid:173)
`nition systems, or speaker authentication systems.
`
`l. INTRODUCTION
`
`BACKGROUND noise acoustically added to speech can
`
`degrade the performance of digital voice processors used
`for applications such as speech compression, recognition, and
`authentication [I}, [2]. Digital voice systems will be used in
`a variety of environments, and their performance must be
`maintained at a level near that measured using noise-free input
`speech. To ensure continued reliability, the effects of back(cid:173)
`ground noise can be reduced by using noise-cancelling micro(cid:173)
`phones, internal modification of the voice processor algorithms
`to explicitly compensate for signal contamination, or pre(cid:173)
`processor noise reduction.
`Noise-cancelling microphones, although essential for ex(cid:173)
`tremely high noise environments such as the helicopter cockpit,
`offer little or no noise reduction above 1 kHz [3] (see Fig. 5).
`Techniques available for voice processor modification to ac(cid:173)
`count for noise contamination are being developed [ 4] , [ 5] .
`But due to the time, effort, and money spent on the design
`and implementation of these voice processors [6] -[8], there
`is a reluctance to internally modify these systems.
`Preprocessor noise reduction [12], [21] offers the advantage
`that noise stripping is done on the waveform itself with the
`output being either digital or analog speech. Thus, existing
`voice processors tuned to clean speech can continue to be
`used unmodified. Also, since the output is speech, the noise
`stripping becomes independent of any specific subsequent
`
`Manuscript received June 1, 1978; revised September 12,1978. This
`research was supported by the Information Processing Branch of the
`Defense Advanced Research Projects Agency, monitored by the Naval
`Research Laboratory under Contract N00173-77-C-0041.
`The author is with the Department of Computer Science, University
`of Utah, Salt Lake City, UT 84112.
`
`speech processor implementation (it could be connected to a
`CCD channel vocoder or a digital LPC vocoder).
`The objectives of this effort were to develop a noise sup(cid:173)
`pression technique, implement a computationally efficient
`algorithm, and test its performance in actual noise environ(cid:173)
`ments. The approach used was to estimate the magnitude
`frequency spectrum of the underlying clean speech by sub(cid:173)
`tracting the noise magnitude spectrum from the noisy speech
`spectrum. This estimator requires an estimate of the current
`noise spectrum. Rather than obtain this noise estimate from
`a second microphone source [9], [10], it is approximated
`using the average noise magnitude measured during nonspeech
`activity. Using this approach, the spectral approximation error
`is then defined, and secondary methods for reducing it are
`described.
`The noise suppressor is implemented using about the same
`amount of computation as required in a high-speech convolu(cid:173)
`tion. It is tested on speech recorded in a helicopter environ(cid:173)
`ment. Its performance is measured using the Diagnostic Rhyme
`Test (DRT) [11] and is demonstrated using isometric plots of
`short-time spectra.
`The paper is divided into sections which develop the spectral
`estimator, describe the algorithm implementation, and demon(cid:173)
`strate the algorithm performance.
`
`II. SUBTRACTIVE NOISE SUPPRESSION ANALYSIS
`A. Introduction
`This section describes the noise-suppressed spectral estimator.
`The estimator is obtained by subtracting an estimate of the
`noise spectrum from the noisy speech spectrum. Spectral in(cid:173)
`formation required to describe the noise spectrum is obtained
`from the signal measured during nonspeech activity. After
`developing the spectral estimator, the spectral error is com(cid:173)
`puted and four methods for reducing it are presented.
`The following assumptions were used in developing the
`analysis. The background noise is acoustically or digitally
`added to the speech. The background noise environment
`remains locally stationary to the degree that its spectral mag(cid:173)
`nitude expected value just prior to speech activity equals its
`If the environment
`expected value during speech activity.
`changes to a new stationary state, there exists enough time
`(about 300 ms) to estimate a new background noise spectral
`magnitude expected value before speech activity commences.
`For the slowly varying nonstationary noise environment, the
`algorithm requires a speech activity detector to signal the
`
`0096-3518/79/0400-0113$00.75 © 1979 IEEE
`
`Petitioner Apple Inc.
`Ex. 1009, p. 113
`
`

`

`114
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO. 2, APRIL 1979
`
`program that speech has ceased and a new noise bias can be
`estimated. Finally, it is assumed that significant noise reduc(cid:173)
`tion is possible by removing the effect of noise from the mag(cid:173)
`nitude spectrum only.
`Speech, suitably low-pass filtered and digitized, is analyzed
`by windowing data from half-overlapped input data buffers.
`The magnitude spectra of the windowed data are calculated
`and the spectral noise bias calculated during nonspeech activity
`is subtracted off. Resulting negative amplitudes are then
`zeroed out. Secondary residual noise suppression is then
`applied. A time waveform is recalculated from the modified
`magnitude. This waveform is then overlap added to the previ(cid:173)
`ous data to generate the output speech.
`
`B. Additive Noise Model
`. Assume that a windowed noise signal n(k) has been added to
`a windowed speech signal s(k), with their sum denoted by x(k).
`Then
`x(k) = s(k) + n(k).
`
`Taking the Fourier transform gives
`X(eiw) = S(eiw) + N(eiw)
`
`where
`
`x(k) ~ X(eiw)
`
`X(eiw) = ~1
`k=O
`
`x(k)eiwk
`
`1
`x(k) = -
`27T
`
`. k
`.
`X(e'w)e'w dw.
`
`l1T
`-1T
`C. Spectral Subtraction Estimator
`The spectral subtraction filter H(eiw) is calculated by re(cid:173)
`placing the noise spectrum N(eiw) with spectra which can be
`readily measured. The magnitude IN(eiw)l of N(eiw) is re(cid:173)
`placed by its average value J.J.(eiw) taken during nonspeech
`activity, and the phase eN(eiw) of N(eiw) is replaced by the
`phase Ox(eiw) of X(eiw). These substitutions result in the
`spectral subtraction estimator S ( e'w ):
`S(ejw) = (IX(ejw)l- J.J.(ejw)]ei8x(ejw)
`
`A
`
`•
`
`or
`
`with
`
`H(eiw) = 1 -
`
`J.J.(eiw)
`~::........,.~
`IX(eiw)l
`
`J.J.(eiw) = E {lN(eiw)l}.
`
`A number of simple modifications are available to reduce
`the auditory effects of this spectral error. These include:
`1) magnitude averaging; 2) half-wave rectification; 3) residual
`noise reduction; and 4) additional signal attenuation during
`nonspeech activity.
`
`E. Magnitude Averaging
`
`Since the spectral error equals the difference between the
`noise spectrum N and its mean J.J., local averaging of spectral
`magnitudes can be used to reduce the error. Replacing
`IX(eiw)l with iX(eiw)l where
`
`IX(eiw)l = _!._ ~1
`M
`
`i=O
`
`!Xi(eiw)l
`
`Xi(eiw) = ith time-windowed transform ofx(k)
`
`gives
`SA (eiw) == [iX(eiw)l- ,u(eiw)] ei8x(eiw).
`
`The rationale behind averaging is that the spectral error be(cid:173)
`comes approximately
`e(eiw) =SA (eiw)- S(eiw):::: INI - J.J.
`
`where
`
`--,.-.-
`1 M-1
`IN(e 1w)l = - L !Ni(e 1w)l.
`M
`
`i=O
`
`.
`
`Thus, the sample mean of iN(eiw)l will converge to J.J.(eiw) as
`a longer average is taken.
`The obvious problem with this modification is that the speech
`is nonstationary, and therefore only limited time averaging is
`allowed. DRT results show that averaging over more than
`three half-overlapped windows with a total time duration of
`38.4 ms will decrease intelligibility. Spectral examples and
`DRT scores with and without averaging are given in the
`"Results" section. Based upon these results, it appears that
`averaging coupled with half rectification offers some improve(cid:173)
`ment. The major disadvantage of averaging is the risk of some
`temporal smearing of short transitory sounds.
`
`F. Half- Wave Rectification
`For each frequency w where the noisy signal spectrum mag(cid:173)
`nitude lX(eiw)l is less than the average noise spectrum mag(cid:173)
`nitude J.J.(eiw), the output is set to zero. This modification
`can be simply implemented by half-wave rectifying H(eiw).
`The estimator then becomes
`
`where
`
`The input-output relationship between X(e 1w) and S(e'w) at
`each frequency w is shown in Fig. 1.
`Thus, the effect of half-wave rectification is to bias down the
`magnitude spectrum at each frequency w by the noise bias
`determined at that frequency. The bias value can, of course,
`
`•
`
`A
`
`•
`
`Petitioner Apple Inc.
`Ex. 1009, p. 114
`
`D. Spectral Error
`The spectral error e(eiw) resulting from this estimator is
`given by
`
`=
`==
`
`
`
`
`
`
`
`( jw) s"( iw) S( jw) N( iw) ee e - e e -J.J.e
`
`
`
`( jw)
`iOx
`.
`
`e
`
`

`

`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`115
`
`probability that the spectrum at that frequency is due to low
`energy speech; therefore, taking the minimum will retain the
`"
`.
`information; and third, if S(e!W) is greater than the maximum,
`there is speech present at that frequency; therefore, removing
`the bias is sufficient. The amount of noise reduction using this
`replacement scheme was judged equivalent to that obtained by
`averaging over three frames. However, with this approach high
`energy frequency bins are not averaged together. The disad(cid:173)
`vantage to the scheme is that more storage is required to save
`the maximum noise residuals and the magnitude values for
`three adjacent frames.
`The residual noise reduction scheme is implemented as
`IS;(e 1w)l = IS;(e 1w)l,
`IS;(e 1w)l;;;. max INR (e 1w)l
`for
`IS;(e 1w)l =min { ISj(e 1w)l j = i- 1, i, i + 1},
`ISt(efw)l <max iNR(eiw)l
`for
`
`A
`
`•
`
`A
`
`•
`
`.
`
`•
`
`A
`
`.
`
`A
`
`A
`
`•
`
`•
`
`where
`
`A
`
`•
`
`S;(e 1w) =HR(e 1w)X;(e 1w)
`
`•
`
`•
`
`and
`
`max INR(eiw)l =maximum value of noise residual
`measured during nonspeech activity.
`
`•
`
`A
`
`.
`
`H Additional Signal Attenuation During Nonspeech Activity
`The energy content of S(e 1w) relative to p.(e'w) provides an
`accurate indicator of the presence of speech activity within a
`given analysis frame. If speech activity is absent, then S (eiw)
`will consist of the noise residual which remains after half-wave
`rectification and minimum value selection. Empirically, it was
`determined that the average (before versus after) power ratio
`was down at least 12 dB. This implied a measure for detecting
`the absence of speech given by
`
`277
`
`JJ.(e 1
`
`)
`
`1 Jrr I S(e~:) I dw].
`T= 20 log10 [ -
`_11
`If Twas less than -12 dB, the frame was classified as having
`no speech activity. During the absence of speech activity there
`are at least three options prior to resynthesis: do nothing, at(cid:173)
`tenuate the output by a fixed factor, or set the output to zero.
`Having some signal present during nonspeech activity was
`judged to give the higher quality result. A possible reason for
`this is that noise present during speech activity is partially
`masked by the speech.
`Its perceived magnitude should be
`balanced by the presence of the same amount of noise during
`nonspeech activity. Setting the buffer to zero had the effect
`of amplifying the noise during speech activity. Likewise, doing
`nothing had the effect of amplifying the noise during nonspeech
`activity. A reasonable, though by no means optimum amount
`of attenuation was found to be -30 dB. Thus, the output
`spectral estimate including output attenuation during non(cid:173)
`speech activity is given by
`S(eiw)= { S(e~w)
`cX(e 1w)
`
`r;;;.-12dB
`T.;;;;-12dB
`
`where 20 log 10 c = -30 dB.
`
`Petitioner Apple Inc.
`Ex. 1009, p. 115
`
`•
`A
`•
`Fig. 1. Input-output relation between X(e'w) and S (elw).
`
`change from frequency to frequency as well as from analysis
`time window to time window. The advantage of half rectifica(cid:173)
`tion is that the noise floor is reduced by JJ.(eiw). Also, any
`low variance coherent noise tones are essentially eliminated.
`The disadvantage of half rectification can exhibit itself in the
`situation where the sum of the noise plus speech at a frequency
`w is less than JJ.(eiw). Then the speech information at that
`frequency is incorrectly removed, implying a possible decrease
`in intelligibility. As discussed in the section on "Results," for
`the helicopter speech data base this processing did not reduce
`intelligibility as measured using the DRT.
`
`G. Residual Noise Reduction
`
`After half-wave rectification, speech plus noise lying above
`J.1. remain.
`In the absence of speech activity the difference
`NR =N- J.1.ei0n, which shall be called the noise residual, will
`for uncorrelated noise exhibit itself in the spectrum as ran(cid:173)
`domly spaced narrow bands of magnitude spikes (see Fig. 7).
`This noise residual will have a magnitude between zero and a
`maximum value measured during nonspeech activity. Trans(cid:173)
`formed back to the time domain, the noise residual will sound
`like the sum of tone generators with random fundamental
`frequencies which are turned on and off at a rate of about 20
`ms. During speech activity the noise residual will also be per(cid:173)
`ceived at those frequencies which are not masked by the
`speech.
`The audible effects of the noise residual can be reduced by
`taking advantage of its frame-to-frame randomness. Specifi(cid:173)
`cally, at a given frequency bin, since the noise residual will
`randomly fluctuate in amplitude at each analysis frame, it
`can be suppressed by replacing its current value with its
`minimum value chosen from the adjacent analysis frames.
`Takjng .the minimum value is used only when the magnitude
`of S (e 1w) is less than the maximum noise residual calculated
`during nonspeech activity. The motivation behind this replace-
`ment scheme is threefold: first, if the amplitude of S (e'w) lies
`below the maximum noise residual, and it varies radically from
`analysis frame to frame, then there is a high probability that
`the spectrum at that frequency is due to noise; therefore, sup-
`"'
`.
`press it by taking the minimum; second, if S(e'w) lies below
`the maximum but has a nearly constant value, there is a high
`
`A
`
`.
`
`

`

`116
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO.2, APRIL 1979
`
`D. Magnitude Averaging
`As was described in the previous section, the variance of the
`noise spectral estimate is reduced by averaging over as many
`spectral magnitude sets as possible. However, the nonstation(cid:173)
`arity of the speech limits the total time interval available for
`local averaging. The number of averages is limited by the
`number of analysis windows which can be fit into the stationary
`speech time interval. The choice of window length and averag(cid:173)
`ing interval must compromise between conflicting require(cid:173)
`ments. For acceptable spectral resolution a window length
`greater than twice the expected largest pitch period is required
`with a 256-point window being used. For minimum noise
`variance a large number of windows are required for averaging.
`Finally, for acceptable time resolution a narrow analysis inter(cid:173)
`val is required. A reasonable compromise between variance
`reduction and time resolution appears to be three averages.
`This results in an effective analysis time interval of 38 ms.
`
`E. Bias Estimation
`The spectral subtraction method requires an estimate at
`each frequency bin of the expected value of noise magnitude
`spectrum f.J.N:
`
`f.J.N E{!NI}.
`
`This estimate is obtained by averaging the signal magnitude
`spectrum lXI during nonspeech activity. Estimating f.J.N in
`this manner places certain constraints when implementing the
`method. If the noise remains stationary during the subsequent
`speech activity, then an initial startup or calibration period of
`noise-only signal is required. During this period (on the order
`of a third of a second) an estimate of f.J.N can be computed. If
`the noise environment is nonstationary, then a new estimate
`of f.J.N must be calculated prior to bias removal each time the
`noise spectrum changes. Since the estimate is computed using
`the noise-only signal during non speech activity, a voice switch
`is required. When the voice switch is off, an average noise
`spectrum can be recomputed. If the noise magnitude spec(cid:173)
`trum is changing faster than an estimate of it can be com(cid:173)
`puted, then time averaging to estimate f.J.N cannot be used.
`likewise, if the expected value of the noise spectrum changes
`after an estimate of it has been computed, then noise reduc(cid:173)
`tion through bias removal will be less effective or even harm(cid:173)
`ful, i.e., removing speech where little noise is present.
`
`F. Bias Removal and Half Wave Rectification
`The spectral subtraction spectral estimate S is obtained by
`subtracting the expected noise magnitude spectrum fJ. from the
`magnitude signal spectrum lXI. Thus
`
`A
`
`IS(k)i=IX(k)l-f.l.(k) k=O,l,···,L-1
`
`or
`
`A
`S(k) = H(k) · X(k),H(k)
`
`f.J.(k)
`IX(k)l
`
`k=O,l,· .. ,L 1
`
`where L = DFT buffer length.
`After subtracting, the differenced values having negative
`magnitudes are set to zero (half-wave rectification). These
`
`Petitioner Apple Inc.
`Ex. 1009, p. 116
`
`Fig. 2. Data segmentation and advance.
`
`III. ALGORITHM IMPLEMENTATION
`A. Introduction
`Based on the development of the last section, a complete
`analysis-synthesis algorithm can be constructed. This section
`presents the specifications required to implement a spectral
`subtraction noise suppression system.
`
`B. Input-Output Data Buffering and Windowing
`
`Speech from the A-D converter is segmented and windowed
`such that in the absence of spectral modifications, if the syn(cid:173)
`thesis speech segments are added together, the resulting overall
`system reduces to an identity. The data are segmented and
`windowed using the result [ 12] that if a sequence is separated
`into half-overlapped data buffers, and each buffer is multiplied
`by a Hanning window, then the sum of these windowed se(cid:173)
`quences adds back up to the original sequences. The window
`length is chosen to be approximately twice as large as the
`maximum expected pitch period for adequate frequency reso(cid:173)
`lution [13]. For the sampling rate of 8.00 kHz a window
`length of 256 points shifted in steps of 128 points was used.
`Fig. 2 shows the data segmentation and advance.
`
`C .Frequency ~nalysis
`
`The DFT of each data window is taken and the magnitude
`is computed.
`Since real data are being transformed, two data windows can
`be transformed using one FFT [14]. The FFT size is set equal
`to the window size of 256. Augmentation with zeros was not
`incorporated. As correctly noted by Allen [15], spectral
`modification followed by inverse transforming can distort the
`time waveform due to temporal aliasing caused by circular
`convolution with the time response of the modification.
`Augmenting the input time waveform with zeros before spec(cid:173)
`tral modification will minimize this aliasing. Experiments
`with and without augmentation using the helicopter speech
`resulted in negligible differences, and therefore augmentation
`was not incorporated. Finally, since real data are analyzed,
`transform symmetries were taken advantage of to reduce
`storage requirements essentially in half [ 14j .
`
`

`

`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`117
`
`negative differences represent frequencies where the sum of
`speech plus local noise is less than the expected noise.
`
`G. Residual Noise Reduction
`As discussed in the previous section, the noise that remains
`after the mean is removed can be suppressed or even removed
`by selecting the minimum magnitude value from the three
`adjacent analysis frames in each frequency bin where the
`current amplitude is less than the maximum noise residual
`measured during nonspeech activity. This replacement pro(cid:173)
`cedure follows bias removal and half-wave rectification. Since
`the minimum is chosen from values on each side of the current
`time frame, the modification induces a one frame delay. The
`improvement in performance was judged superior to three
`frame averaging in that an equivalent amount of noise sup(cid:173)
`pression resulted without the adverse effect of high-energy
`spectral smoothing. The following section presents examples
`of spectra with and without residual noise reduction.
`
`H. Additional Noise Suppression During Nonspeech Activity
`The final improvement in noise reduction is signal suppres(cid:173)
`sion during nonspeech activity. As was discussed, a balance
`must be maintained between the magnitude and characteristics
`of the noise that is perceived during speech activity and the
`noise that is perceived during speech absence.
`An effective speech activity detector was defined using spec(cid:173)
`tra generated by the spectral subtraction algorithm. This
`detector required the determination of a threshold signaling
`absence of speech activity. This threshold (T == -12 dB) was
`empirically determined to ensure that only signals definitely
`consisting of background noise would be attenuated.
`
`L Synthesis
`After bias removal, rectification, residual noise removal, and
`nonspeech signal suppression a time waveform is reconstructed
`from the modified magnitude corresponding to the center win(cid:173)
`dow. Again, since only real data are generated, two time win(cid:173)
`dows are computed simultaneously using one inverse FFT.
`The data windows are then overlap added to form the output
`speech sequence. The overall system block diagram is given in
`Fig. 3.
`
`VI. RESULTS
`
`A. Introduction
`Examples of the performance of spectral subtraction will be
`presented in two forms: isometric plots of time versus fre(cid:173)
`quency magnitude spectra, with and without noise cancella(cid:173)
`tion; and intelligibility and quality measurement obtained
`from the Diagnostic Rhyme Test (DRT) [11]. The DRT is a
`well-established method for evaluating speech processing
`devices. Testing and scoring of the DRT data base was pro(cid:173)
`vided by Dynastat Inc. [12]. A limited single speaker DRT
`test was used. The DRT data base consisted of 192 words
`using speaker RH recorded in a helicopter environment. A
`crew of 8 listeners was used.
`The results are presented as follows: 1) short-time ampli(cid:173)
`tude spectra of helicopter speech; 2) DRT intelligibility and
`quality scores on LPC vocoded speech using as input the data
`
`x(n)
`
`~
`
`[iilf-Wave Rectify /
`
`,-------L
`
`Reduce Noise R~
`
`Compute Speech Activity Detector
`
`Attenuate Signal During
`Non-Speech Activity
`
`s{n)
`
`Fig. 3. System block diagram.
`
`given in 2); and 3) short-time spectra showing additional im(cid:173)
`provements in noise rejection through residual noise suppres(cid:173)
`sion and nonspeech signal attenuation.
`
`B. Short- Time Spectra of Helicopter Speech
`
`Isometric plots of time versus frequency magnitude spectra
`were constructed from the data by computing and displaying
`magnitude spectra from 64 overlapped Hanning windows.
`Each line represents a 128-point frequency analysis. Time
`increases from bottom to top and frequency from left to right.
`A 920 ms section of speech recorded with a noise-cancelling
`microphone in a helicopter environment is presented. The
`phrase "Save your" was filtered at 3.2 kHz and sampled at
`6.67 kHz. Since the noise was acoustically added, no under(cid:173)
`lying clean speech signal is available. Fig. 4 shows the digitized
`time signal. Fig. 5 shows the average noise magnitude spec(cid:173)
`trum computed by averaging over the first 300 ms of non(cid:173)
`speech activity. The short-time spectrum of the noisy signal
`x is shown in Fig. 6. Note the high amplitude, narrow-band
`ridges corresponding to the fundamental (1550 Hz) and first
`harmonic (3100 Hz) of the helicopter engine, as well as the
`ramped noise floor above 1800 Hz. Fig. 7 shows the result
`from bias removal and rectification. Figs. 8 and 9 show the
`noisy spectrum and the spectral subtraction estimate using
`three frame averaging.
`These figures indicate that considerable noise rejection has
`
`Petitioner Apple Inc.
`Ex. 1009, p. 117
`
`

`

`118
`
`2
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO.2, APRIL 1979
`
`2.0 r------t------+------4------1-----~r---t-11
`
`5 211E+4
`5 0
`
`1 e r-----,_----~-----
`
`•••
`
`-1 . e 1----+----~---
`
`-1 . 953E+-4
`
`2.0
`
`1.0
`1 . eeeE+e
`RECORD 1 - 6144 SAMPLES
`
`30
`
`40
`
`sa
`
`4 e
`
`se
`
`3 e
`
`58
`
`a.e
`
`se
`
`1.0
`
`se
`
`6.0
`5 0
`6 144E+3
`
`0 000E+0
`a0 48 aa sa
`
`0 000E+0
`
`2a 40 8a sa
`
`I 0
`
`2 0
`
`20 <te 80 sa
`
`28
`
`3 0
`3 334E+3
`
`Fig. 4. Time waveform of helicopter speech. "Save your".
`
`Fig. 5. Average noise magnitude of helicopter noise.
`
`Fig. 6. Short-time spectrum of helicopter speech.
`
`Fig. 7. Short-time
`
`spectrum using bias
`rectification.
`
`removal and half-wave
`
`Fig. 8. Short-time spectrum of helicopter speech using three frame
`averaging.
`
`Fig. 9. Short-time spectrum using bias removal and half-wave rcctifi-
`cation after three frame averaging.
`
`Petitioner Apple Inc.
`Ex. 1009, p. 118
`
`

`

`BOLL: SUPPRESSION OF ACOUSTIC NOISE IN SPEECH
`
`119
`
`been achieved, although some noise residual remains. The
`next step was to quantitatively measure the effect of spectral
`subtraction on intelligibility and quality. For this task a
`limited single speaker DRT was invoked to establish an anchor
`point for credibility.
`
`TABLE I
`DIAGNOSTIC RHYME TEST SCORES
`
`Original
`
`s (Ho Average)
`
`Voicing
`
`Nasality
`
`Sustention
`
`Sibilation
`
`Graveness
`
`Compactness
`
`Total
`
`95
`
`82
`
`92
`
`75
`
`68
`
`88
`
`84
`
`92
`
`78
`
`87
`
`83
`
`70
`
`87
`
`83
`
`S (Three Average)
`91
`
`77
`
`86
`
`84
`
`66
`
`88
`
`82
`
`TABLE II
`QUALITY RATINGS
`
`Original
`
`S (No Average)
`
`S (Three Averages)
`
`Natural ness of
`Signal
`
`Inconspicuousness
`of Background
`
`lntelligibil ity
`
`Pleasantness
`
`Overall
`Acceptabi 1 ity
`
`Composite
`Acceptabi 1 ity
`
`63
`
`36
`
`30
`
`20
`
`27
`
`26
`
`60
`
`38
`
`32
`
`31
`
`33
`
`32
`
`61
`
`42
`
`33
`
`25
`
`29
`
`29
`
`TABLE III
`DIAGNOSTIC RHYME TEST SCORES
`
`C. Intelligibility and Quality Results using the DRT
`The DRT data base consisted of 192 words recorded in a
`helicopter environment. The data base was f:tltered at 4 kHz
`and sampled at 8kHz. During the pause between each word,
`the noise bias was updated. Six output speech files were
`generated: 1) digitized original; 2) speech resulting from bias
`removal and rectification without averaging; 3) speech result(cid:173)
`ing from bias removal and rectification using three averages;
`4) an LPC vocoded version of original speech; 5) an LPC
`vocoded version of 2); and 6) an LPC vocoded version of 3).
`The last three experiments were conducted to measure intelli(cid:173)
`gibility and quality improvements resulting from the use of
`spectral subtraction as a preprocessor to an LPC analysis(cid:173)
`synthesis device. The LPC vocoder used was a nonreal-time
`floating-point implementation [17]. A ten-pole autocorrela(cid:173)
`tion implementation was used with a SIFT pitch tracker [18].
`The channel parameters used for synthesis were not quantized.
`Thus, any degradation would not be attributed to parameter
`quantization, but rather to the all-pole approximation to the
`spectrum and to the buzz-hiss approximation to the error
`signal. In addition, a frame rate of 40 frames/s was used which
`is typical of 2400 bit/s implementations. The vocoder on 3.2
`kHz filtered clean speech achieved a DRT score of 88.
`In addition to intelligibility, a coarse measure of quality [ 19]
`was conducted using the same DRT data base. These quality
`scores are neither quantitatively nor qualitatively equivalent
`to the more rigorous quality tests such as PARM or DAM [20].
`However, they do indicate on a relative scale improvements
`between data sets. Modem 2.4 kbit/s systems are expected to
`range from 45 to 50 on composite acceptability; unprocessed
`speech, 88-92.
`The results of the tests are summarized in Tables I-IV.
`Tables I and II indicate that spectral subtraction alone does
`not decrease intelligibility, but does increase quality, especially
`in the areas of increased pleasantness and inconspicuousness of
`noise background. Tables III and IV clearly indicate that spec(cid:173)
`tral subtraction can be used to improve the intelligibility and
`quality of speech processed through an LPC bandwidth com(cid:173)
`pression device.
`
`D. Short-Time Spectra Using Residual Noise Reduction and
`Nonspeech Signal Attenuation
`Based on the promising results of these preliminary DRT
`experiments, the algorithm was modified to incorporate resid(cid:173)
`ual noise reduction and nonspeech signal attenuation. Fig. 10
`shows the short-time spectra using the helicopter speech data
`with both modifications added. Note that now noise between
`words has been reduced below the resolution of the graph, and
`noise within the words has been significantly attenuated (com(cid:173)
`pare with Fig. 7).
`
`LPC on
`Original
`
`LPC on
`S without averaging
`
`LPC on
`S with averaging
`
`Voicing
`
`Nasality
`
`Sus tent ion
`
`Sibilation
`
`Graveness
`
`Compactness
`
`Total
`
`84
`
`56
`
`49
`
`61
`
`61
`
`83
`
`66
`
`90
`
`63
`
`52
`
`70
`
`62
`
`83
`
`70
`
`86
`
`52
`
`56
`
`88
`
`59
`
`93
`
`72
`
`TABLE IV
`QUALITY RATINGS
`
`LPC on
`Original
`
`LPC on
`S without averaging
`
`LPC on
`,
`S with averaging
`
`Naturalness
`of Signal
`
`Inconspicuousness
`of Background
`
`Intelligibility
`
`Pleasantness
`
`Over a 11
`Acceptability
`
`Composite
`Acceptabi 1 i ty
`
`53
`
`34
`
`28
`
`15
`
`24
`
`23
`
`49
`
`36
`
`30
`
`28
`
`28
`
`29
`
`58
`
`39
`
`28
`
`20
`
`26
`
`25
`
`Petitioner Apple Inc.
`Ex. 1009, p. 119
`
`

`

`120
`
`IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-27, NO.2, APRIL 1979
`
`noise,:' IEEE Trans. Acoust., Speech, Signal Processing, vol.
`ASSP-24, pp. 488-494, Dec. 1976.
`[3 I D. Coulter, private communication.
`[ 4 I S. F. Boll, "Improving linear prediction analysis of noisy speech
`by predictive noise cancellation," in Proc. IEEE Int. Con/. on
`Acoust., Speech, Signal Processing, Philadelphia, PA, Apr. 12-14,
`1976, pp. 10-13.
`[ 5 I J. S. Lim and A. V. Oppenheim, "All pole modeling of degraded
`speech," IEEE Trans. Acoust., Speech, Signal Processing, vol.
`ASSP-26, pp. 197-210, June 1978.
`[61 B. Gold, "Digital speech networks," Proc. IEEE, val. 65, pp.
`1636-1658,De~ 197~
`[71 B. Beek, E. P. Neuberg, and D. C. Hodge, "An assessment of the
`technology of automatic speech recognition for military appli(cid:173)
`cations," IEEE Trans. Acoust., Speech, Signal Processing, vol.
`ASSP-25, pp. 310-322, Aug. 1977.
`[81 J. D. Markel, "Text independent speaker identification from a
`large linguistically unconstrained time-spaced data base," in
`Proc. IEEE Int. Con/. on Acoust., Speech, Signal Processing,
`Tulsa, OK, Apr. 1978, pp. 287-291.
`[91 B. Widrow et al., "Adaptive noise cancelling: Principles and
`applications," Proc. IEEE, vol. 63, pp.1692-1716, Dec. 1975.
`[101 S. F. Boll and D. Pulsipher, "Noise suppression methods for
`robust speech processing," Dep. Comput. Sci., Univ. Utah, Salt
`Lake City, Semi-Annu. Tech. Rep., Utec-CSc-77-202, pp. 50-54,
`Oct. 1977.
`[111 W. D. Voiers, A. D. Sharpley, and C. H. Helmsath, "Research on
`diagnostic evaluation of speech intelligibility," AFSC, Final Rep.,
`Contract AF19628-70-C-0182, 1973.
`[121 M. R. Weiss, E. Aschkenasy, and T. W. Parsons, "Study and
`development of the INTEL technique for improving speech
`intelligibility," Nicolet Scientific Corp., Final Rep. NSC-FR/4023,
`Dec. 1974.
`[131 J. Makhoul and J. Wolf, "Linear prediction and the spectral
`analysis of speech," Bolt, Beranek, and Newman Inc., BBN
`Rep. 2304, NTIS No. AD-749066, pp. 172-185, 1972.
`[14 I 0. Br

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket