`
`H. G. Hirsc h-, C. Ehrlicher
`of Technology, 52056 Aachen,
`Institute of Communication Systems and Data Processing, Aachen University
`Germany
`
`
`
`
`
`ABSTRACT
`
`1. INTRODUCTION
`
`background noise is not stationary or the signal-to
`
`noise ratio (SNR) is low. Some approaches are known
`Two new techniques are presented to estimate the
`to avoid the problem of speech pause detection and to
`noise spectra or the noise characteristics for noisy
`
`
`estimate the noise characteristics just from a past
`No explicit speech pause detection is
`
`segment of noisy speech [3]. [6), [7]. Tho
`speech signals.
`
`of just about 400 ms
`
`disadvantage of most approaches is the need Oof
`
`required. Past noisy segments
`
`relatively long past segments of noisy speech.
`
`duration are needed for the estimation. Thus the
`is able to quickly adapt to slowly varying
`
`The first method presented here calculates the
`algorithm
`weighted sum of past spectral magnitude values Xi in
`
`noise levels or slowly changing noise spectra. This
`
`techniques can be combined with a nonlinear spectral
`each subband i. The weighting is done by a simple first
`subtraction scheme. The ability can be shown to
`
`order recursive system
`enhance noisy speech a nd to improve the
`= (1-0.)· Xj(k) + IX· Ni(k-1) (1),
`Nj(k)
`performance of speech recognition systems. Another
`
`of a robust voice activity
`
`application is the realization
`1\
`where Xi(k) denotes the spectral magnitude at time k in
`detection.
`subband i and Ni(k) is an estimation
`of the noise
`magnitude.
`Some algorithms immediately
`use an average of past
`for the noisle
`spectral power values as an estimation
`
`power in the individual subband to realize a so called
`
`continous spectral subtraction (CSS) [1]. In contrast to
`
`Many proposals are known to improve speech
`recognition in situations with a noisy background,
`
`
`these approaches an adaptive threshold is introduced
`e.g.
`here. The magnitude values Xi are distributed
`[1], [2]. [3]. [4]. Especially
`
`the modified statistics of
`according to a Rayleigh d istribution in segments of
`spectral parameters should be considered
`in case of
`....
`using HMMs [5]. A well working algorithm to detect
`
`
`pure noise. Considerably higher values occur at the
`.
`modified statistics.
`
`speech pauses is presumed to determine these
`a threshold �. Ni(k-l) IS
`onset of speech. Thus
`where � takes a value in the rangE� of abolJ�
`introduced
`presents two methods to estimate
`This contribution
`1,5 to 2,5. When the actual spectral
`component Xi(k)
`the spectral parameters of noise without an explicit
`
`
`exceeds this threshold this is considered as a rough
`The first algorithm calculates
`speech pause detection.
`
`
`
`is accumulation detection of speech and the recursive
`stopped. The accumulated value is taken as an
`the noise level in each subband as a weighted average
`
`of past spectral magnitude values which are below an
`
`estimation for the noise level at this time. This simplE�
`
`adaptive threshold. The second approach evaluates
`in figure 1 as part of a completl�
`
`processing is illustrated
`the histograms of past spectral magnitude values in
`noise reduction scheme.
`each subband. The maximum is taken as an estimate
`A
`for the noise level.
`"
`The noise estimate N i is calculated with a first order
`
`recursive system. Ni is m ultiplied with an over-
`
`
`estimation factor � in the usual range of about 1,5 to
`values of (Xi - � Ni) the data input as
`Most of the noise reduction techniques based on
`
`2,5. For positive
`are stopped. This
`single channel recordings need an estimation
`of the
`well as the recursive accumulation
`A
`noise spectrum. This is usually done by detection of
`
`indicates an onset of speech. Negative values of
`•
`
`
`speech pauses to evaluate segments of pure noise. In
`(Xi - J3 Nj) are set to zero to get an estImate Sj of the
`this is a difficult
`if the
`practical
`task especially
`situations
`*) This author is with Ascom Business Systems (Solothurn,
`
`Switzerland) now. Email: hirsch@ens.ascom.ch
`
`2. E STIMATION OF NOISE SPECTRUM
`
`153
`
`$4.00 © 1995 IEEE
`0-7803-2431-5/95
`
`WAVES345_1007-0001
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1007
`
`
`
`A
`yes
`N· I
`control
`+--.....;;;St�DP�....j output
`coc
`
`L-. __
`
`ex
`stop data input
`
`as sum over all frames of a FFT based
`are calculated
`spectral analysis.
`
`relative error I %
`4
`3.5
`3 .....
`
`2.5
`2 ....
`1.5
`......... , .................................. .
`
`histogram tec hnique
`
`0.5
`o
`
`SNRldB
`-5 o 5 1 0 1 5
`20
`Fig. 1. Simple noise reduction scheme in one
`subband
`Fig. 2. Relative error for
`
`the estimation of the noise
`power spectrum with both techniques at different
`SNRs
`The second approach is based on histograms of past
`A good estimation can be achieved by both
`spectral values in each subband. The above
`
`techniques. As expected the evaluation of histograms
`mentioned threshold is used to evaluate histograms of
`
`
`at higher SN R is leads to better results. The increase
`
`
`past values which are below this threshold. This
`caused by an inaccurate
`noise estimation during
`can be interpreted as a rough separation
`of
`processing
`segments of speech. Even at a high SNR this small
`the distributions of noise (Rayleigh distribution) and
`
`
`errors effect the calculation of a relative error much
`
`speech. Speech takes much higher values. Past
`more.
`to noise segments of about 400
`values corresponding
`ms duration are evaluated to determine the distribution
`Both techniques are applied to a noise reduction
`in about 40 bins. The noise level is estimated as
`[2]. A well
`
`scheme using nonlinear spectral subtraction
`
`maximum of the distribution in each subband. The
`
`
`by informal working suppression was confirmed
`estimated values for the noise magnitude are
`
`
`
`effects as e.g. musical listening tests. Also negative
`
`smoothed versus time to eliminate rarely occuring
`
`
`e.g. tones can be reduced by optimizing parameters,
`
`spikes. This leads to a very accurate estimation of the
`the overestimation factor.
`noise spectrum.
`
`
`
`clean speech. The computational complexity is low.
`
`
`
`3. RECOGNITION OF NOISEX DATA
`
`An objective evaluation of the accuracy is illustrated
`in
`stationary noise signals [8] were
`figure 2. Different
`A first series of recognition experiments was carried out
`
`
`
`artificially added to clean speech at different SNRs.
`[8].
`using the isolated words
`of the Noisex92 study
`
`The average noise spectrum!':! is calculated from the
`This is a first attempt for a common data base to get
`noise itself as well as an average estimated noise
`spectrum N obtained with the two mentioned
`
`comparative results on the recognition of noisy
`The relative error
`uttcranccs of the ten digits at different SNRs. The
`
`
`added to speech. Different noises are artificially
`techniques.
`-I _I
`L (N' - N·)2
`digits were spoken 100 tim es separately for training
`
`i
`Recordings exist for a male and a female
`and testing.
`L!':!i2
`speaker at a sampling rate of 16 kHz. Both above
`i
`mentioned estimation techniques are applied to the
`
`as a preprocessing step
`is calculated as an objective measurement for the
`
`
`
`nonlinear spectral subtraction
`
`to recognition. A HMM recognizer [9] is used for the
`accuracy. In figure 2 the average relative error
`is shown
`adding a car noise [8] to different
`of 3 male
`experiments configured as a connected word
`utterances
`and 3 female speakers. Average spectral components
`A single mixture continous HMM is trained
`recognizer.
`
`(2)
`
`154
`
`WAVES345_1007-0002
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1007
`
`
`
`100
`- -------
`------ - -
`90 - -----------------
`------- -- --
`
`...... teChn;:;?:-::
`
`with 8 emitting states for
`each word. Pauses
`the number of "active"
`are
`bands is less than 4. A robust
`voice activity
`detection
`can be achieved by this
`represented
`by a single state HMM. All training
`is done
`with the clean data only. A set of 15 MFCC (Mel
`technique.
`frequency cepstral coefficients)
`are calculated as
`acoustic
`parameters
`for the recognition.
`Some results
`During segments of pauses the
`spectral subtraction
`jls
`are shown in figure 3 as average of 5 different
`noises
`applied with an overestimation
`factor of 3. An
`and as average of the two speakers.
`The male and the
`interesting
`result is observed decreasing
`thle
`female utterances
`were separately
`recognized.
`overestimation
`factor from usual values in the range of
`2,5 to a value of just 1 for segments of speech. Best
`recognition
`rate I %
`results
`are obtained
`for an overestimation
`factor in the
`range of 1. The use of
`a factor of 1 for the
`overestimation
`degrades the
`noise reduction
`scheme
`to a simple subtraction
`in subbands.
`This effect can ble
`----------------------------x-------
`explained by the training
`of the HMMs. The average
`::----------
`80
`and the variance of the acoustic parameters
`ane
`hi�t�;;�-�------:�:
`/::
`estimated
`for each state from the clean data. The
`70
`modified increased
`variance of spectral
`parameters
`is
`60
`not considered
`in this contribution.
`Thus tho average
`values are mainly evaluated for the recognition.
`50
`Subtracting
`more than the noise level will lower the
`.:{\�it�()_�t
`40
`spectral
`parameters
`in the individual
`subband on the
`__ ,-.• x"/' preprocessing
`30' -
`average. This
`will decrease the
`estimated
`averages in
`20 ------
`--------,-,-,.--------------
`----------------
`-------------------.--
`the corresponding
`states of the HMM.
`-:·-··-________________
`·:::�:·
`__ x::·:·:·
`_ ____ . __ _
`__ :·:::-:·�
`SNR/dB
`10
`O��-----r----�----'-----r--
`A second data base is considered
`for another series of
`6 12 18
`o
`-6
`experiments.
`13 words (digits
`inluding
`"zero",
`"oh" and
`Fig. 3. Average recognition
`results for a speaker
`''yes"
`, "no") were recorded from 200 speakers via
`dependent recognition
`of the Noisex data
`telephone lines. This time
`a HMM recognizer
`is
`configured
`as an isolated word recognizer
`but
`Considerable
`improvements
`can be achieved by
`including
`at model for the pauses. A continous
`HMM i:�
`applying the noise estimation
`techniques.
`In addition
`trained with 8 emitting
`states and 4 mixtures
`per state.
`the detection
`of speech pauses is implemented
`to
`Pauses are represented
`by a single state
`with 4
`obtain these results.
`This is necessary because
`no
`mixtures.
`5 PLP cepstral
`coefficients
`[10J are used as
`individual
`HMM model is calculated
`for the pauses at
`acoustic parameters.
`For each condition
`the
`each noise condition.
`The detection
`is based on the
`recognition
`rate is calculated
`as an average of 4
`evaluation
`of the SNRs in all subbands. A relative
`recognition
`experiments
`using 50 different
`speakers
`measure NX rei of the ratio NIX (noise to noise&signal
`)
`out of the 200 for training
`and the remaining 150 for
`is calculated
`for each subband:
`testing in each individual
`experiment.
`Car noise was
`NXi(k) -NXimin(k)
`artificially
`added at SNRs in the range from 5 t(> 20 dB.
`NXirel(k) = ---
`..:.:.....:- .....:.:.:..:::..:.:....:....
`NXima x(k) -NXimin(k) (3)
`Some recognition
`results are illustrated
`in figure
`4. The
`where smoothed versions are used for N and X.
`experiments
`applying the simple noise reduction
`N Xmin and NXmax are determined
`from past segments
`scheme shown in figure 1 were done in comparison
`to
`using PLP [10J or Rasta-PLP
`of about 600 ms. The value NXrel is already calculated
`analysis [11]. Rasta-PLP
`is a well working technique
`to reduce the influence
`of
`to realize the nonlinear
`spectral
`subtraction.
`A low
`different
`frequency responses during recording
`or
`value of NXrel indicates speech.
`Speech pauses are
`transmission.
`It introduces
`a high-pass
`filtering
`of th!�
`detected by counting the number of subbands where
`logarithmic
`spectral
`envelopes
`in each subband. Thus
`the ratio NXrel is less than a certain threshold
`e.g. in
`our realization
`a value of 0,4. Using a FFT filter
`bank
`logarithmic
`domaine. The impulse response of thle
`with 128 subbands frames are classified
`as pauses if
`high-pass filter is similiar
`to the response of th,e
`
`4. SPEAKER INDEPENDENT RECOGNITIONI
`
`it can be interpreted
`as a spectral subtraction
`in thl6
`
`155
`
`WAVES345_1007-0003
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1007
`
`
`
`rate I %
`
`---------------------------------..... -.--.. -.............. -.-......... ----.--....... ------.. .
`recognition
`
`1 00
`
`:���
`
`60
`50
`40
`30
`20
`•• =.---=.---
`... ------
`.... --.-..
`
`
`-IJ :-"""--.-� ... -�.--:::::----"::"----.-.--
`1 0 -------.-.11;,;
`SNR/dB
`Q-L--,------,-----r---,.--
`5
`
`10
`
`15
`
`20
`
`combined with well known spectral subtraction
`
`
`
`techniques. Reducing the overestimation factor to a
`value in the range of 1 leads to simple reduction
`
`schemes with low computational complexity.
`These approaches are a good supplement to HMM
`
`recognition schemes which consider the modified
`
`statistics of spectral parameters caused by additive
`noise [5].
`
`6. ACKNOWLEDGEMENT-
`
`This work was partly carried out at the International
`
`
`Computer Science Institute in Berkeley, USA. The
`authors would like to thank the whole speech group
`
`
`
`and especially Dr. Morgan for fruitful discussions and a
`stimulating athmosphere.
`
`7. REFERENCES
`
`[1] J.A. Nolazco Flores, S. J. Young, "Continous
`Speech
`Fig. 4. Recognition results for a speaker independent
`in Noise Using Spectral Subtraction
`and HMM
`Recognition
`Adaptation·, ICASSP-94, Vol. 1 , pp. 409-412, 1994
`recognition (car noise)
`with a Nonlinear
`[2] P. Lockwood, J. Boudy, "Experiments
`filter scheme presented in figure 1.
`Spectral Subtractor,
`Hidden Markov Models and the
`The simple noise reduction is integrated into PLP
`
`
`Robust Speech Recognition in Cars·, Speech
`Projection for
`
`PLP includes a spectral analysis with a FFT
`Communication, Vol. 11, No. 2-3, pp. 215-228, 1992
`analysis.
`of a number of subband energies
`and the calculation
`[3] D. Van Campernolle,
`"Noise Adaptation in a Hidden
`of the subbands is derived from
`where the definition
`
`Markov Model Speech Recognition System·, Computer
`
`Speech and Language, Vol. 3, pp. 151-167, 1989
`
`groups of the human auditive system.
`the frequency
`[4] H.G. Hirsch,
`P. Meyer, H.W. Ruhl, "Improved Speech
`Better results are obtained applying the noise
`Recognition Using High-Pass Filtering
`to the subband energies of the 15
`of Subband
`reduction
`Envelopes", Eurospeech-91,
`pp. 413-416,
`1991
`
`than to all output values of
`
`nonlinearly spaced filters
`[5) M.J.F. Gales, S .J. Young, "Cepstral
`Parameter
`
`the FFT. The variance of the subband energy seems
`
`Compensation for HMM Recognition in Noise", Speech
`to decrease by summing up the FFT energies in 15
`Vo1.12, No.3, pp. 231-240, 1993
`Communication,
`subbands during segments of noise.
`[61 R. Martin, "An Efficient
`Algorithm to Estimate the
`Also this time a speech pause detection
`is added in
`
`Instantaneous SNR of Speech Signals", Eurospeech -93,
`case of applying the processing scheme to speech
`
`pp.1 093-1 096, 1993
`Thus the 15 subband energies are
`recognition.
`[7] H.G. Hirsch, "Estimation of Noise Spectrum and its
`
`filtered with the mentioned filter
`scheme using an
`
`to SNR Estimation and Speech Enhancement",
`Application
`
`
`overestimation factor of 2. A robust speech detection
`
`
`Technical Report TR-93-012, International Computer Science
`
`
`Institute, Berkeley, USA, 1993
`summing up the output values of all
`can be realized
`[8] A. Varga, H.J.M. Steeneken, " Assessment for Automatic
`
`for a positive value of the sum. Again
`
`filters and looking
`
`
`Speech Recognition: II. Noisex92: A Database and an
`factor of 3
`
`a filtering is applied with an overestimation
`Experiment to Study the Effect of Additive Noise on Speech
`during segments of noise. Speech segments are
`Vo1.12, No.
`
`Recognition Systems", Speech Communication,
`
`filtered with a factor of 1.
`3,pp.247-252,1993
`[9] S.J. Young, "HTK Version 1.4: Reference Manual and
`User Manual", Cambridge University Engineering
`
`Department, Speech Group, 1992
`[10J H. Hermansky, "Perceptual Linear Predictive
`Two methods are presented to estimate the noise
`
`
`(PLP)
`spectra and more general the noise characteristics
`of
`Analysis of Speech", JASA, pp. 1738-1752, 1990
`[11] N. Morgan, H. Hermansky et aI., "Compensation
`noisy speech without an explicit speech pause
`for the
`Effect of the Communication Channel in Auditory-Like
`
`detection. These are able to adapt to varying noise
`Analysis of Speech (Rasta-PLP)",
`
`Eurospeech-91, pp. 1367-
`levels. Also one of the algorithms has a low
`1370,1991
`can be
`
`
`computational complexity. The approaches
`
`5. CONCLUSION
`
`156
`
`WAVES345_1007-0004
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1007