`Arslan et al.
`
`I IIIII
`
`11111111~1111111011111111111111111111~11~ 11111111
`US005706395A
`5,706,395
`[11] Patent Number:
`[45] Date of Patent:
`Jan.6, 1998
`
`[54] ADAPTIVE WEINER FILTERING USING A
`DYNAMIC SUPPRESSION FACTOR
`
`[56]
`
`References Cited
`
`PUBUCATIONS
`
`[75]
`
`Inventors: Levent M. Arslan, Durham. N.C.; Alan
`V. McCree. Dallas; VIShu R.
`Viswanathan, Plano, both of Tex.
`
`[73] Assignee: Texas Instruments Incorporated,
`Dallas, Tex.
`
`[21] Appl. No.: 425,125
`
`[22] Filed:
`
`Apr. 19, 1995
`
`Int. CL6 ........................................................ G10L3/02
`[51]
`[52] U.S. CI •........................ 395/2.35; 395/2.36; 395/2.37
`[58] Field of Search .................................. 395/2.35, 2.36.
`395/2.37' 2.38, 2.42, 2.28; 381/34, 36
`
`Deller et al .. "Discrete-Time Processing of Speech Signals,"
`Prentice-Hall, Inc., pp. 506-528. 1987.
`Arslan et al., "New Methods for Adaptive Noise Suppres(cid:173)
`sion," ICASSP '95: Acoustics, Speech & Signal Processing
`Conference, pp. 812-815. May 1995.
`
`Primary Examiner-Allen R. MacDonald
`Assistant Examiner-Vijay Chawan
`Attome)\ Agent, or Firm-Carlton H. Hoel; W. James
`Brady; Richard L. Donaldson
`
`[57]
`
`ABSTRACT
`
`An acoustic noise suppression filter including attenuation
`filtering with a noise suppression factor depending upon the
`ratio of estimated noise energy of a frame divided by
`estimated signal energy.
`
`12 Claims, 7 Drawing Sheets
`
`WINDOWED y(j)
`INPUT
`FRAME
`
`y(j)
`,
`
`FFT
`
`Y(w)
`
`r
`
`A S (w)
`
`Py(w)
`
`H(w)
`
`FILTER
`PARAMETERS;
`CLAMP AND
`[SMOOTH]
`
`Py(w)
`
`PN(w)
`
`a
`
`_i I
`
`ENERGY
`
`Ey
`
`PRECEDING
`FRAME
`NOISE
`
`Ey
`
`PN'(w}
`
`y(j)
`
`LPC
`
`a· I
`
`OFT
`
`PRECEDING
`FRAME
`
`I
`
`:ex'
`
`I
`
`'
`
`UPDATE
`NOISE
`
`__., UPDATE
`ex
`EN
`
`Petitioner Apple Inc.
`Ex. 1011, p. 1
`
`
`
`U.S. Patent
`
`Jan. 6, 1998
`
`Sheet 1 of 7
`
`5,706,395
`
`~~
`
`102
`
`104
`
`106
`
`SAMPLING
`A/D
`CONVERTOR
`
`TRANSMIT/
`STORAGE
`
`NOISE
`SUPPRESSION
`
`ANALYSIS
`
`SYNTHESIS
`
`DAC
`
`108
`
`110
`FIG. 1a
`
`112
`
`100 "
`~
`
`150
`(
`
`RECOGNITION
`ANALYSIS
`
`OUTPUT
`
`NOISE
`SUPPRESSION
`
`SAMPLING
`A/0
`CONVERTOR
`FIG. 1b
`
`208
`
`NOISE
`FILTER
`
`206
`
`200
`(
`
`IFFT
`
`210
`
`NOISE
`SUPPRESSED
`FRAME BUFFER
`
`212
`
`SPEECH
`SAMPLE
`STREAM
`
`FRAME
`BUFFER
`
`FFT
`
`202
`
`204
`
`FIG. 2
`
`Petitioner Apple Inc.
`Ex. 1011, p. 2
`
`
`
`U.S. Patent
`
`Jan. 6, 1998
`
`Sheet 2 of 7
`
`5,706,395
`
`WINDOWED
`INPUT
`FRAME
`
`y(j)
`
`FFT
`
`Y(w)
`
`PRECEDING
`FRAME
`NOISE
`
`No{w)
`
`UPDATE
`NOISE
`
`N(w)
`
`INCREASE
`NOISE
`
`Y(w)
`
`W•IYI2(w)
`
`SMOOTH
`
`W•IY12(w)
`
`~
`
`CLAMP
`
`2N(w)
`
`H{w)
`
`S"(w)
`
`FIG. 3
`
`y(j)
`
`LPC
`
`a· I
`
`OFT
`
`Py(w)
`
`t I
`ENERGY
`
`Ey
`
`PRECEDING
`FRAME
`NOISE
`
`Ey
`
`PNI(w)
`
`UPDATE
`NOISE
`
`Py(w)
`
`PN(w)
`
`a
`
`PRECEDING
`FRAME
`I
`la'
`I
`i
`___.. UPDATE
`a
`EN
`
`WINDOWED yU)
`INPUT
`FRAME
`
`y(j)
`
`FFT
`
`Y(w)
`
`,.
`S (w)
`
`FILTER
`H(w) PARAMETERS; ~
`CLAMP AND
`[SMOOTH]
`
`FIG. 4
`
`Petitioner Apple Inc.
`Ex. 1011, p. 3
`
`
`
`U.S. Patent
`
`Jan. 6, 1998
`
`Sheet 3 of 7
`
`5,706,395
`
`•
`
`LPC
`
`a·
`.....!.... LSFn
`
`...!._. DISTANCE
`
`LSF·
`I CODEBOOK
`
`LSF·
`1
`
`di
`
`NOISE FREE
`LSF5
`
`LSF5
`
`LPC
`
`,.,
`a·
`I
`
`OFT
`
`P§(w)
`
`ITERATION
`
`WINDOWED y(j)
`INPUT
`FRAME
`y(j)
`
`FFT
`
`Y(w)
`
`q>
`
`IFFT
`
`1 • OUT
`
`H(w)
`
`CLAMP
`
`PN(w)
`
`FIG. 5
`
`y(j)
`
`ENERGY
`MEASURE
`
`WINDOWED
`INPUT
`FRAME
`y(j)
`,
`SCALE
`
`cy(j)
`
`FIG. 6
`
`FFT, H(w),
`
`INVERSE
`
`IFFT •
`SCALE • OUT
`
`PRECEDING
`FRAME
`NOISE
`
`NOISE
`UPDATE
`f
`Y(w)
`
`Petitioner Apple Inc.
`Ex. 1011, p. 4
`
`
`
`U.S. Patent
`
`Jan.6, 1998
`
`Sheet 4 of 7
`
`5,706,395
`
`SUPPRESSION
`FACTOR
`H(w) IN dB
`
`or---~~~===F:==~~~
`-5
`-r------------r------------T---
`-------T------------
`:
`:
`: NOISE
`:
`--{INCREASED--+------------
`i
`SPECTRAL
`I CLAMPED
`- SUBTRACTION ----------:------------1------------:------------
`
`1
`I
`
`I
`
`I
`
`I
`I
`
`:
`
`-10
`
`-15
`
`I
`I
`I
`
`I
`
`I
`
`I
`I
`
`I
`I
`
`1
`I
`
`-25
`
`-20
`
`1
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`------------L------------L------------~------------~------------
`1
`I
`I
`I
`I
`I
`I
`I
`t
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`------------L------------L------------L------------~------------
`1
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`-30+-------+-------~------~-------r------~
`2
`4
`6
`8
`10
`0
`UNPROCESSED INPUT SIGNAL TO NOISE RATIO lN dB
`FIG. 7
`
`DISTRIBUTION OF SPECTRAL ESTIMATES FOR WGN, SMOOTHING= 1,5,33, 128
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`-----------,----------- -r ---------- --•-- --------- -T
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`------------1------------!-------------:------------i
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`------------1------------r-----------~------------i
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`1
`I
`
`-----------4------------r-----------~-----------
`
`0.9
`
`0.8
`
`0.7
`
`0.6
`
`0.5
`
`0.4
`
`0.3
`
`0.2
`
`Q_l
`
`- L-----------~------------
`: 128 ELEMENT
`o~==~~------~~~~--~~~SM~O~OT~H~IN~G----~
`-20
`-15
`-10
`-5
`0
`5
`10
`POWER SPECTRUM IN dB
`FIG. 8
`
`l
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`-----------J------------L-----------~-------
`i
`5 ELEMENT
`1
`:
`SMOOTHING
`:
`-----------1------------t-----------~----
`1
`I
`I
`I
`I
`I
`I
`I
`I
`-----------~------------L-----------~-
`1
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`1
`
`I
`
`-----------J------------L--------
`:
`NO
`:
`: SMOOTHING :
`------------'-
`-
`
`I
`I
`I
`
`I
`I
`--~-----------
`1
`I
`I
`I
`I
`
`------r-----------
`
`1
`I
`I
`I
`I
`
`---------r-----------
`
`1
`I
`I
`I
`I
`
`-----------.-----------
`
`1
`I
`I
`I
`I
`
`------------r-----------
`
`1
`I
`I
`I
`I
`
`------------r-----------
`
`1
`I
`I
`I
`I
`
`:--33-ELEM-ENf __________ _
`: SMOOTHING
`-,------------i-----------
`
`1
`I
`I
`I
`
`I
`I
`I
`I
`
`Petitioner Apple Inc.
`Ex. 1011, p. 5
`
`
`
`U.S. Patent
`
`Jan. 6, 1998
`
`Sheet 5 of 7
`
`5,706,395
`
`910
`
`pN'
`
`PN'
`
`NOISE
`BUFFER
`
`908
`
`Py
`
`Py
`
`900
`~
`
`906
`
`IN
`
`902
`
`OUT
`
`920
`
`922
`
`930
`
`FIG. 9a
`
`950
`~
`
`906
`
`904
`
`Py
`
`910
`
`908
`
`NOISE
`BUFFER
`
`NP
`
`954
`
`IN
`
`902
`
`OUT
`
`920
`
`922
`
`930
`
`FIG. 9b
`
`Petitioner Apple Inc.
`Ex. 1011, p. 6
`
`
`
`U.S. Patent
`
`Jan. 6, 1998
`
`Sheet 6 of 7
`
`5,706,395
`
`LOG Py(w)/PN(w)
`10d8
`
`5d8
`
`,/ADAPTIVE
`'
`CLAMP
`
`FIG.
`
`10a
`
`1/
`
`____ ________
`________
`7
`CONSTANT
`CLAMP
`
`/
`/
`
`I
`I
`I
`I
`I
`I
`
`I
`I
`
`' I
`' I
`I
`I
`I
`
`-5d8
`
`LOG H(w) -lOdB
`
`-15d8
`
`LOG Py(w)/PN(w)
`
`0
`1~8
`~8
`Or---------~-----------+------------------
`STANDARD SPECTRAL
`SUBTRACTION
`
`LOG H(w)
`
`-5d8
`
`-15d8
`
`FIG. 10b
`
`1302
`
`1300
`i
`
`OUT
`
`1304
`
`1306
`
`1308
`
`IN
`
`FIG. 13
`
`Petitioner Apple Inc.
`Ex. 1011, p. 7
`
`
`
`U.S. Patent
`
`Jan. 6, 1998
`
`Sheet 7 of 7
`
`5,706,395
`
`1100
`~
`
`Py(w)
`
`1112
`
`1114
`
`1106
`
`LPC
`COEFFICIENT
`ANALYZER
`
`1104
`
`IN
`
`r(j)
`
`AUTO
`CORRELATOR
`
`1102
`
`1130
`
`1120
`FIG. 1 1
`
`OUT
`
`1122
`
`LPC-
`TO-
`LSF
`
`o· I
`
`1208
`
`1206
`
`1210
`
`LSF·
`J
`
`CODEBOOK
`OF LSF5
`
`1212
`
`r( )
`
`1224
`
`1204
`
`1220
`
`pN'
`
`Ps
`
`IN
`
`1202
`
`~
`1200
`
`OUT
`
`1230
`
`1232
`
`1234
`
`1240
`
`FIG. 12
`
`Petitioner Apple Inc.
`Ex. 1011, p. 8
`
`
`
`5~706,395
`
`1
`ADAPTIVE WEINER Fll.TERING USING A
`DYNAMIC SUPPRESSION FACTOR
`
`CROSS-REFERENCE TO RELATED
`APPLICATIONS
`
`Cofiled patent applications with Ser. Nos. 081424.928.
`08/426.426. 08/426.746 and 08/426,427 discloses related
`subject matter. These applications all have a common
`assignee.
`
`2
`transmission rate may be only 2.4 Kbps rather than the 64
`Kbps of PCM. In practice, the LPC coefficients must be
`quantized for transmission, and the sensitivity of the filter
`behavior to the quantization error has led to quantization
`5 based on the Line Spectral Frequencies (LSF) representa(cid:173)
`tion.
`To improve the sound quality, further information may be
`extracted from the speech. compressed and transmitted or
`stored along with the LPC coefficients, pitch. voicing, and
`10 gain. For example, the codebook excitation linear prediction
`( CELP) method first analyzes a speech frame to find the LPC
`filter coefficients, and then filters the frame with the LPC
`filter. Next. CELP determines a pitch period from the filtered
`frame and removes this periodicity with a comb filter to
`15 yield a noise-looking excitation signal. Lastly. CELP
`encodes the excitation signals using a codebook. Thus CELP
`transmits the LPC filter coefficients. pitch, gain, and the
`codebook index of the excitation signal.
`The advent of digital cellular telephones has emphasized
`20 the role of noise suppression in speech processing, both
`coding and recognition. Customer expectation of high per(cid:173)
`formance even in extreme car noise situations plus the
`demand to move to progressively lower data rate speech
`coding in order to accommodate the ever-increasing number
`25 of cellular telephone customers have contributed to the
`importance of noise suppression. While higher data rate
`speech coding methods tend to maintain robust performance
`even in high noise environments. that typically is not the
`case with lower data rate speech coding methods. The
`30 speech quality of low data rate methods tends to degrade
`drastically with high additive noise. Noise supression to
`prevent such speech quality losses is important. but it must
`be achieved without introducing any undesirable artifacts or
`speech distortions or any significant loss of speech intelli-
`35 gibility. These performance goals for noise suppression have
`existed for many years, and they have recently come to the
`forefront due to digital cellular telephone application.
`FlG. 1a schematically illustrates an overall system 100 of
`40 modules for speech acquisition. noise suppression, analysis,
`transmission/storage, synthesis, and playback. A micro(cid:173)
`phone converts sound waves into electrical signals. and
`sampling analog-to-digital converter 102 typically samples
`at 8 KHz to cover the speech spectrum up to 4 KHz. System
`45 100 may partition the stream of samples into frames with
`smooth windowing to avoid discontinuities. Noise suppres(cid:173)
`sion 104 filters a frame to suppress noise, and analyzer 106
`extracts LPC coefficients, pitch. voicing. and gain from the
`noise-suppressed frame for transmission and/or storage 108.
`50 The transmission may be any type used for digital informa(cid:173)
`tion transmission, and the storage may likewise be any type
`used to store digital information. Of course, types of encod(cid:173)
`ing analysis other than LPC could be used. Synthesizer 110
`combines the LPC coefficients, pitch, voicing, and gain
`55 information to synthesize frames of sampled speech which
`digital-to-analog convertor (DAC) 112 converts to analog
`signals to drive a loudspeaker or other playback device to
`regenerate sound waves.
`FIG. 1b shows an analogous system 150 for voice rec-
`60 ognition with noise suppression. The recognition analyzer
`may simply compare input frames with frames from a
`database or may analyze the input frames and compare
`parameters with known sets of parameters. Matches found
`between input frames and stored information provides rec-
`65 ognition output
`One approach to noise suppression in speech employs
`spectral subtraction and appears in Boll. Suppression of
`
`BACKGROUND OF THE INVENTION
`
`The invention relates to electronic devices, and, more
`particularly, to speech analysis and synthesis devices and
`systems.
`Human speech consists of a stream of acoustic signals
`with frequencies ranging up to roughly 20 KHz; but the band
`of 100 Hz to 5 KHz contains the bulk of the acoustic energy.
`Telephone transmission of human speech originally con(cid:173)
`sisted of conversion of the analog acoustic signal stream into
`an analog electrical voltage signal stream (e. g .. microphone)
`for transmission and reconversion to an acoustic signal
`stream (e.g., loudspeaker) for reception.
`The advantages of digital electrical signal transmission
`led to a conversion from analog to digital telephone trans(cid:173)
`mission beginning in the 1960s. Typically, digital telephone
`signals arise from sampling analog signals at 8 KHz and
`nonlinearly quantizing the samples with 8-bit codes accord(cid:173)
`ing to the j.l-law (pulse code modulation, or PCM). A clocked
`digital-to-analog converter and companding amplifier recon(cid:173)
`struct an analog electrical signal stream from the stream of
`8-bit samples. Such signals require transmission rates of 64
`Kbps (kilobits per second). Many communications
`applications, such as digital cellular telehone. cannot handle
`such a high transmission rate, and this has inspired various
`speech compression methods.
`The storage of speech information in analog format (e.g.,
`on magnetic tape in a telephone answering machine) can
`likewise be replaced with digital storage. However, the
`memory demands can become overwhelming: 10 minutes of
`8-bit PCM sampled at 8 KHz would require about 5 MB
`(megabytes) of storage. This demands speech compression
`analogous to digital transmission compression.
`One approach to speech compression models the physi(cid:173)
`ological generation of speech and thereby reduces the nec(cid:173)
`essary information transmitted or stored. In particular, the
`linear speech production model presumes excitation of a
`variable filter (which roughly represents the vocal tract) by
`either a pulse train for voiced sounds or white noise for
`unvoiced sounds followed by amplification or gain to adjust
`the loudness. The model produces a stream of sounds simply
`by periodically making a voiced/unvoiced decision plus
`adjusting the filter coefficients and the gain. Generally, see
`Markel and Gray, Linear Prediction of Speech (Springer(cid:173)
`Verlag 1976).
`More particularly, the linear prediction method partitions
`a stream of speech samples s(n) into "frames" of. for
`example, 180 successive samples (22.5 msec intervals for a
`8 KHz sampling rate); and the samples in a frame then
`provide the data for computing the filter coefficients for use
`in coding and synthesis of the sound associated with the
`frame. Each frame generates coded bits for the linear pre(cid:173)
`diction filter coefficients (LPC). the pitch, the voiced/
`unvoiced decision, and the gain. This approach . of encoding
`only the model parameters represents far fewer bits than
`encoding the entire frame of speech samples directly, so the
`
`Petitioner Apple Inc.
`Ex. 1011, p. 9
`
`
`
`5,706,395
`
`3
`Acoustic Noise in Speech Using Spectral Subtraction, 27
`IEEE Tr. ASSP 113 (1979), and Lim and Oppenheim.
`Enhancement and Bandwidth Compression of Noisy
`Speech, 67 Proc. IEEE 1586 (1979). Spectral subtraction
`proceeds roughly as follows. Presume a sampled speech 5
`signal sO) with uncorrelated additive noise n(j) to yield an
`observed windowed noisy speech y(j)=s(j}+n(j). These are
`random processes over time. Noise is assumed to be a
`stationary process in that the process's autocorrelation
`depends only on the difference of the variables; that is, there 10
`is a function r "(.) such that:
`
`E{ n(j)n(i) }=r~i-J)
`
`where E is the expectation. The Fourier transform of the
`autocorrelation is called the power spectral density, P~co). 15
`If speech were also a stationary process with autocorrelation
`rij) and power spectral density Piro), then the power
`spectral densities would add due to the lack of correlation:
`
`4
`The power spectral density P~co) of the noise signal can
`be estimated by detection during noise-only periods, so the
`speech power spectral estimate becomes
`
`IS (ro)f!
`
`IY(ro)f!- IN(ro)i2
`
`Pr(ro)- PN(ro)
`
`which is the spectral subtraction.
`The spectral subtraction method can be interpreted as a
`time-varying linear filter H(co) so that S'(co)=H(ro)Y(co)
`which the foregoing estimate then defines as:
`
`H(ro)'-[Pr(ro)-PM:ro)]/Pr(ro)
`
`The ultimate estimate for the frame of windowed speech,
`s'(j), then equals the inverse Fourier transform of S'(co), and
`then combining the estimates from successive frames
`("overlap add") yields the estimated speech stream.
`This spectral subtraction can attenuate noise substantially,
`20 but it has problems including the introduction of fluctuating
`tonal noises commonly referred to as musical noises.
`The Lim and Oppenheim article also describes an alter(cid:173)
`native noise suppression approach using noncausal Wiener
`filtering which minimizes the mean-square error. That is,
`25 again s·(ro)=H(ro)Y(ro) but with H(ro) now given by:
`
`Hence, an estimate for Ps(co), and thus s(j), could be
`obtained from the observed noisy speech yO) and the noise
`observed during intervals of (presumed) silence in the
`observed noisy speech. In particular, take Py(co) as the
`squared magnitude of the Fourier transform of y(j) and
`P ~co) as the squared magnitude of the Fourier transform of
`the observed noise.
`Of course, speech is not a stationary process, so Lim and
`Oppenheim modified the approach as follows. Take s(j) not 30
`to represent a random process but rather to represent a
`windowed speech signal (that is, a speech signal which has
`been multiplied by a window function), n(j) a windowed
`noise signal, and y(j) the resultant windowed observed noisy
`speech signal. Then Fourier transforming and multiplying 35
`by complex conjugates yields:
`
`H(ro)=P J_ro)I[P .{ro)+P _,{ro)]
`
`This Wiener filter generalizes to:
`
`H(ro)=[P.(ro)/[P J_ro)+a.P _,{ro)]JP
`
`where constants a and [3 are called the noise suppression
`factor and the filter power. respectively. Indeed, a=1 and
`[3='12 leads to the spectral subtraction method in the follow(cid:173)
`ing.
`A noncausal Wiener filter cannot be directly applied to
`provide an estimate for s(j) because speech is not stationary
`and the power spectral density Ps(ro) is not known. Thus
`approximate the noncausal Wiener filter by an adaptive
`40 generalized Wiener filter which uses the squared magnitude
`of the estimate SA(ro) in place of P s(co):
`
`H(ro)=(IS"(ro)f'/IIS"(rof+ilE{ INroi'} ])P
`
`45 Recalling S"(ro)=H(co)Y(ro) and then solving for IS'(ro)l in
`the [3='12 case yields:
`IS'( ro )~[IY( ro i -«£{ IN( ro )P} ]112
`
`IY(ro)f=IS(ro)I2+1N(ro)2+2Re{S(ro)N(ro)*}
`
`For ensemble averages the last term on the righthand side of
`the equation equals zero due to the lack of correlation of
`noise with the speech signal. This equation thus yields an
`estimate, S'(co), for the speech signal Fourier transform as:
`
`IS'(ro)f=IY(ro)i'-E{IN(rof}
`
`This resembles the preceding equation for the addition of
`power spectral densities.
`An autocorrelation approach for the windowed speech
`and noise signals simplifies the mathematics. In particular,
`the autocorrelation for the speech signal is given by
`
`r,(j)=!:,s( <)S( i+j),
`
`with similar expressions for the autocorrelation for the noisy
`speech and the noise. Thus the noisy speech autocorrelation
`is:
`
`55
`
`which just replicates the spectral subtraction method when
`50 0.=1.
`However, this generalized Wiener filtering has problems
`including how to estimate S', and estimators usually apply
`an iterative approach with perhaps a half dozen iterations
`which increases computational complexity.
`Ephraim. A Minimum Mean Square Error Approach for
`Speech Enhancement, Conf. Proc. ICASSP 829 (1990),
`derived a Wiener filter by first analyzing noisy speech to find
`linear prediction coefficients (LPC) and then resynthesizing
`an estimate of the speech to use in the Wiener filter.
`In contrast, O'Shaughnessy, Speech Enhancement Using
`Vector Quantization and a Formant Distance Measure, Conf.
`Proc. ICASSP 549 (1988), computed noisy speech formants
`and selected quantized speech codewords to represent the
`speech based on formant distance; the speech was resynthe(cid:173)
`sized from the codewords. This has problems including
`degradation for high signal-to-noise signals because of the
`speech quality limitations of the LPC synthesis.
`
`where c5~.) is the cross correlation of s(j) and n(j). But the
`speech and noise signals should be uncorrelated, so the cross
`correlations can be approximated as 0. Hence, ry(j)=rs0)+ 60
`r~). And the Fourier transforms of the autocorrelations are
`just the power spectral densities, so
`
`Pr(ro)=P .{ro)+P_,(ro)
`
`Of course, Py(co) equals IY(co)l2 with Y(co) the Fourier 65
`transform of y(j) due to the autocorrelation being just a
`convolution with a time-reversed variable.
`
`Petitioner Apple Inc.
`Ex. 1011, p. 10
`
`
`
`5,706,395
`
`5
`The Fourier transforms of the windowed sampled speech
`signals in systems 100 and 150 can be computed in either
`fixed point or floating point format. Fixed point is cheaper
`to implement in hardware but has less dynamic range for a
`comparable number of bits. Automatic gain control limits
`the dynamic range of the speech samples by adjusting
`magnitudes according to a moving average of the preceding
`sample magnitudes, but this also destroys the distinction
`between loud and quiet speech. Further, the acoustic energy
`may be concentrated in a narrow frequency band and the
`Fourier transform will have large dynamic range even for
`speech samples with relatively constant magnitude. To com(cid:173)
`pensate for such overflow potential in fixed point format, a
`few bits may be reserved for large Fourier transform
`dynamic range; but this implies a loss of resolution for small
`magnitude samples and consequent degradation of quiet
`speech. This is especially true for systems which follow a
`Fourier transform with an inverse Fourier transform.
`
`SUMMARY OF THE INVENTION
`
`The present invention provides speech noise suppression
`by spectral subtraction filtering improved with filter
`clamping, limiting, and/or smoothing, plus generalized
`Wiener filtering with a signal-to-noise ratio dependent noise
`suppression factor, and plus a generalized Wiener filter
`based on a speech estimate derived from codebook noisy
`speech analysis and resynthesis. And each frame of samples
`has a frame-energy-based scaling applied prior to and after
`Fourier analysis to preserve quiet speech resolution.
`The invention has advantages including simple speech
`noise suppression.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The drawings are schematic for clarity.
`FIGS. la-b show speech systems with noise suppression.
`FIG. 2 illustrates a preferred embodiment noise suppres-
`sion subsystem.
`FIGS. 3-S are flow diagrams for preferred embodiment
`noise suppression.
`FIG. 6 is a flow diagram for a framewise scaling preferred
`embodiment.
`FIGS. 7-8 illustrate spectral subtraction preferred
`embodiment aspects.
`FIGS. 9a-b shows spectral subtraction preferred embodi(cid:173)
`ment systems.
`FIGS. lOa-b illustrates spectral subtraction preferred
`embodiments with adaptive minimum gain clamping.
`FIG. 11 is a block diagram of a modified Wiener filter
`preferred embodiment system.
`FIG. 12 shows a codebook based generalized Wiener filter
`preferred embodiment system.
`FIG. 13 illustrates a preferred embodiment internal pre(cid:173)
`cision control system.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENfS
`
`10
`
`6
`the frame by the filter coefficients generated in noise filter
`block 208; and IFFf module 210 converts back to the time
`domain by inverse fast Fourier transform. Noise suppressed
`frame buffer 212 holds the filtered output for speech
`5 analysis, such as LPC coding. recognition. or direct trans(cid:173)
`mission. The filter coefficients in block 208 derive from
`estimates for the noise spectrum and the noisy speech
`spectrum of the frame, and thus adapt to the changing input.
`All of the noise suppression computations may be performed
`with a standard digital signal processor such as a
`TMS320C25, which can also perform the subsequent speech
`analysis, if any. Also, general purpose microprocessors or
`specialized hardware could be used.
`The preferred embodiment noise suppression filters may
`also be realized without Fourier transforms; however, the
`15 multiplication of Fourier transforms then corresponds to
`convolution of functions.
`The preferred embodiment noise suppression filters may
`each be used as the noise suppression blocks in the generic
`systems of FIGS. la-b to yield preferred embodiment
`20 systems.
`The smoothed spectral subtraction preferred embodi(cid:173)
`ments have a spectral subtraction filter which (1) clamps
`attenuation to limit suppression for inputs with small signal(cid:173)
`to-noise ratios, (2) increases noise estimate to avoid filter
`25 fluctuations, (3) smoothes noisy speech and noise spectra
`used for filter definition, and (4) updates a noise spectrum
`estimate from the preceding frame using the noisy speech
`spectrum. The attenuation clamp may depend upon speech
`and noise estimates in order to lessen the attenuation (and
`30 distortion) for speech; this strategy may depend upon esti(cid:173)
`mates only in a relatively noise-free frequency band. FIG. 3
`is a flow diagram showing all four aspects for the generation
`of the noise suppression filter of block 208.
`The signal-to-noise ratio adaptive generalized Wiener
`35 filter preferred embodiments use H(ro)=[P5~(ro)/[P.i(ro)+
`a.P,Jro)]]ll where the noise suppression factor a depends on
`Btffl.N with EN the noise energy and By the noisy speech
`energy for the frame. These preferred embodiments also use
`a scaled LPC spectral approximation of the noisy speech for
`40 a smoothed speech power spectrum estimate as illustrated in
`the flow diagram Figure 4. FIG. 4 also illustrates an optional
`filtered a..
`The codebook-based generalized Wiener filter noise sup(cid:173)
`pression preferred embodiments use H(ro)=[P5~(m)I[P5A(ro)
`45 +a.P,Jm)]]ll with P5 '(ro) estimated from LSFs as weighted
`sums of LSFs in a codebook of LSFs with the weights
`determined by the LSFs of the input noisy speech. Then
`iterate: use this H(m) to form H(ro)Y(ro). next redetermine
`the input LSFs from H(ro)Y(ro). and then redetermine H(m)
`50 with these LSFs as weights for the codebook LSFs. A half
`dozen iterations may be used. FIG. S illustrates the flow.
`The power estimates used in the preferred embodiment
`filter definitions may also be used for adaptive scaling of low
`power signals to avoid loss of precision during FFT or other
`55 operations. The scaling factor adapts to each frame so that
`with fixed-point digital computations the scale expands or
`contracts the samples to provide a constant overflow
`headroom. and after the computations the inverse scale
`restores the frame power level. FIG. 6 illustrates the flow.
`60 This scaling applies without regard to automatic gain control
`and could even be used in conjunction with an automatic
`gain controlled input.
`
`Overview
`FIG. 2 shows a preferred embodiment noise suppression
`filter system 200. In particular. frame buffer 202 partitions
`an incoming stream of speech samples into overlapping
`frames of 256-sample size and windows the frames; FFT 65
`module 204 converts the frames to the frequency domain by
`fast Fourier transform; multiplier 206 pointwi~e multiplies
`
`Smoothed Spectral Subtraction Preferred
`Embodiments
`FIG. 3 illustrates as a flow diagram the various aspects of
`the spectral subtraction preferred embodiments as used to
`
`Petitioner Apple Inc.
`Ex. 1011, p. 11
`
`
`
`5,706,395
`
`H(ro)'=1-41N(ro)I2/IY(ro)l'
`
`8
`
`7
`generate the filter. A preliminary consideration of the stan(cid:173)
`dard spectral subtraction noise suppression simplifies expla(cid:173)
`nation of the preferred embodiments. Thus first consider the
`standard spectral subtraction filter:
`
`H(ro'f
`
`[IY(ro)IZ- IN(ro)fi]/IY(ro)fl
`
`1 - IN(ro)i211Y(ro)i2
`
`For small input signal-to-noise power ratios this becomes
`negative, but a damp as in (1) eliminates the problem. This
`5 noise increase factor appears as a shift in the logarithmic
`input signal-to-noise power ratio independent variable of
`FIG. 7. Of course, the 2 factor could be replaced by other
`factors such as 1.5 or 3; indeed, FIG. 7 shows a 5 dB noise
`A graph of this function with logarithmic scales appears in 10 increase factor with the resulting attenuation curve labelled
`FIG. 7 labelled "standard spectral subtraction". Indeed.
`"noise increased". Further, the factor could vary with fie-
`spectral subtraction consists of applying a frequency-
`quency such as more noise increase (i.e., more attenuation)
`at low frequencies.
`dependent attenuation to each frequency in the noisy speech
`power spectrum with the attenuation tracking the input
`(3) Reduce the variance of spectral estimates used in the
`signal-to-noise power ratio at each frequency. That is, H(ro) 15 noise suppression filter H(ro) by smoothing over neighbor-
`represents a linear time-varying filter. Consequently, as
`ing frequencies. That is, for an input windowed noisy speech
`shown in FIG. 7, the amount of attenuation varies rapidly
`signal y(j) with Fourier transform Y(ro). apply a running
`average over frequency so that IY(ro)l2 is replaced by
`with input signal-to-noise power ratio, especially when the
`(W*IYI 2)(ro) in H(ro) where W(ro) is a window about 0 and
`input signal and noise are nearly equal in power. When the
`input signal contains only noise, the filtering produces 20 * is the convolution operator. FIG. 8 shows that the spectral
`musical noise because the estimated input signal-to-noise
`estimates for white noise converge more closely to the
`power ratio at each frequency fluctuates due to measurement
`correct answer with increasing smoothing window size. That
`error, producing attenuation with random variation across
`is, the curves labelled "5 element smoothing", "33 element
`frequencies and over time. FIG. 8 shows the probability
`smoothing", and "128 element smoothing" show the
`distribution of the FFf power spectral estimate at a given 25 decreasing probabilities for large variations with increasing
`smoothing window sizes. More spectral smoothing reduces
`frequency of white noise with unity power (labelled "no
`smoothing"), and illustrates the amount of variation which
`noise fluctuations in the filtered speech signal because it
`can be expected.
`reduces the variance of spectral estimation for noisy frames;
`The preferred embodiments modify this standard spectral
`however, spectral smoothing decreases the spectral resolu-
`subtraction in four independent but synergistic approaches 30 tion so that the noise suppression attenuation filter cannot
`as detailed in the following.
`track sharp spectral characteristics. The preferred embodi-
`Preliminarily, partition an input stream of noisy speech
`ment operates with sampling at 8 KHz and windows the
`sampled at 8 KHz into 256-sample frames with a 50%
`input into frames of size 56 samples (32 milliseconds); thus
`an FFI' on the frame generates the Fourier transform as a
`overlap between successive frames; that is, each frame
`shares its first 128 samples with the preceding frame and 35 function on a domain of 256 frequency values. Take the
`shares its last 128 samples with the succeeding frame. This
`smoothing window W(ro) to have a width of 32 frequencies,
`yields an input stream of frames with each frame having 32
`so convolution with W(ro) averages over 32 adjacent fre-
`msec of samples and a new frame beginning every 16 msec.
`quencies. W(ro) may be a simple rectangular window or any
`Next, multiply each frame with a Hann window of width
`other window. The filter transfer function with such smooth-
`256. (A Hann window has the form w(k)=(1+cos(21tk/K))/2 40 ing is:
`with K+1 the window width.) Thus each frame has 256
`samples y(j), and the frames add to reconstruct the input
`speech stream.
`Fourier transform the windowed speech to find Y(m) for
`the frame; the noise spectrum estimation di11ers from the
`traditional methods and appears in modification ( 4 ).
`(1) Clamp the H(ro) attenuation curve so that the attenu(cid:173)
`ation cannot go below a minimum value; FIG. 7 has this
`labelled as "damped" and illustrates a 10 dB clamp. The
`clamping prevents the noise suppression filter H(ro) from
`fluctuating around very small gain values, and also reduces
`potential speech signal distortion. The corresponding filter
`would be:
`
`Thus a filter with all three of the foregoing features has
`transfer function:
`
`45
`
`H(ro)"=max[lO""', 1-IN(ro)l'tlY(ro)f']
`
`Of course, the 10 dB damp could be replaced with any other
`desirable damp level. such as 5 dB or 20 dB. Also, the
`damping could include a sloped damp or stepped clamping
`or other more general clamping curves, but a simple damp
`lessens computational complexity. The following "Adaptive 60
`filter damp" section describes a damp which adapts to the
`input signal energy level.
`(2) Increase the noise power spectrum estimate by a factor
`such as 2 so that small errors in the spectral estimates for
`input (noisy) signals do not result in fluctuating attenuation 65
`filters. The corresponding filter for this factor alone would
`be:
`
`Extend the definition of H( ro) by symmetry to 7t<eo<27t or
`50 -7t<C0<0
`( 4) Any noise suppression by sp