`Higgins et al.
`
`US006266633B1
`US 6,266,633 B1
`Jul. 24, 2001
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`(54) NOISE SUPPRESSION AND CHANNEL
`EQUALIZATION PREPROCESSOR FOR
`SPEECH AND SPEAKER RECOGNIZERS:
`METHOD AND APPARATUS
`
`(75) Inventors: Alan Lawrence Higgins; Steven F.
`Boll; Jack E. Porter, all of San Diego,
`CA (US)
`
`(73) Assignee: ITT Manufacturing Enterprises,
`Wilmington, DE (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/218,565
`(22) Filed:
`Dec. 22, 1998
`
`(51) Int. Cl.7 ................................................... .. G10L 21/02
`(52) US. Cl. ................ ..
`704/224; 704/228
`(58) Field of Search .................................... .. 704/224, 228
`
`(56)
`
`References Cited
`
`PUBLICATIONS
`
`Stockham, Jr., Thomas G., Cannon Thomas M., and Inge
`bretsen, Robert B., “Blind Deconvolution through Digital
`Signal Processing”, Proceedings of the IEEE, vol. 63, No. 4,
`Apr. 1975, pp. 678—692.
`Boll, Steven F., “Suppression of Acoustic Noise in Speech
`Using Spectral Subtraction”, IEEE Transactions on Acous
`tics, Speech, and Signal Processing, vol. ASSP—27, No. 2,
`Apr. 1979, pp. 113—120.
`Avendano, Carlos and Hermansky, Hynek, “On the Effects
`of Short—Term Spectrum Smoothing in Channel Normaliza
`tion”, IEEE Transactions on Speech and Audio Processing,
`vol. 5, No. 4, Jul. 1997, pp. 372—374.
`Hynek Hermansky, et al. “RASTA Processing of Speech”,
`IEEE Trans. Speech and Audio Processing, vol. 2, No. 4, pp.
`578—589, Oct. 1994.*
`Johan de Veth, et al. “Comparison of Channel Normalisation
`Techniques for Automatic Speech Recognition over the
`Phone,” Proc. Intl. Conf. on Spoken Language, ICSLP 96,
`vol. 4, pp. 2332—2335, Oct. 1996.*
`Detlef Hardt, et al. “Spectral Subtraction and RASTA—Fil
`tering in Text—Dependent HMM—Based Speaker Veri?ca
`tion,” Proc. IEEE ICASSP 97, vol. 2, pp. 867—870, Apr.
`1997.*
`
`Carlos Avendano, et al. “On the Effedts of Short—Term
`Spectrum Smoothing in Channel Normalization,” IEEE
`Trans. Speech and Audio Processing, vol. 5, No. 4, pp.
`372—374, Jul. 1997.*
`
`Zhang Zhijie, et al. “Stabilized Solutions and Multiparam
`eter Optimization Technique of Deconvolution,” Proc. Intl.
`Conf. Signal Processing, ICSP 98, vol. 1, pp. 168—171, Oct.
`1998*
`
`* cited by examiner
`
`Primary Examiner—Talivaldis I. Smits
`(74) Attorney, Agent, or Firm—Arthur L. Plevy; Duane,
`Morris & Hecksher
`
`(57)
`
`ABSTRACT
`
`A method for performing noise suppression and channel
`equalization of a noisy voice signal comprising the steps of
`sampling the noisy voice signal at a predetermined sampling
`rate f5; segmenting the sampled voice signal into a plurality
`of frames having a predetermined number of samples per
`frame, over a predetermined temporal Window; generating
`an N-point spectral sample representation of each of the
`sample signal frames; determining the magnitude of each of
`the N-point spectral samples and generating a histogram of
`the energy associated With each of the N-point spectral
`samples at a particular frequency; detecting a peak ampli
`tude of the histogram Which corresponds to a noise threshold
`Nf associated With the particular frequency; determining a
`channel frequency response Cf associated With the particular
`frequency by determining a geometric mean over all the
`spectral samples having magnitude exceeding the noise
`threshold Nf; subtracting from each of the magnitudes of the
`N point spectral samples the noise threshold Nf to provide a
`noise suppressed sample sequence; applying blind decon
`volution to the noise suppressed samples; transforming the
`deconvolved noise suppressed sampled sequence to a tem
`poral representation; shifting the temporal sample sequence
`in time by a predetermined amount; and adding the time
`shifted temporal samples over a period corresponding to the
`predetermined temporal Window to provide a suppressed
`noise voice signal.
`
`26 Claims, 5 Drawing Sheets
`
`RECT.
`TO
`POLAR
`CONVERTER
`
`Ma nitude
`
`70
`
`75
`
`ARITH.
`CIRCUIT
`HISTOGRAM
`GENERATOR
`
`I00
`H
`
`104
`
`110
`m “2
`
`SPECTRAL
`SUETRACTOR
`
`ED
`FILTER
`
`so?
`ESTIMATE
`N f
`MODULE
`
`of
`ESTIMATOR
`
`MEMORY
`
`l
`|
`|
`‘
`PQI-IBAR
`\
`RECT.
`,CONVERTER \
`‘
`a
`I
`130
`l
`l
`l
`I
`
`i l
`l
`I‘
`‘I
`l
`1
`|
`l
`
`WAVES345_1006-0001
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`U.S. Patent
`
`Jul. 24, 2001
`
`Sheet 1 of5
`
`US 6,266,633 B1
`
`\IIIIIIIl-lllllllllllll/
`
`6;
`
`M .wE
`
`_ _ _ _ _ _ _ m _ _ _ _ _ _
`a
`
`2m 5 c2280
`
`G;
`
`
`
`mO_O> mum:
`
`an
`
`.5002
`llL
`AN ; mowwmoomm
`
`¥_____ _____________/
`
`0252016 V
`moSmQ
`
`ARV
`
`
`
`-mmm E4625
`
`mowwwoomm Q2
`8S 3%
`
`<3 <3
`
`cowmqw
`QNNV
`
`EoE=oEm
`
`
`
`a ; mzOImOmQE mwwD
`
`av
`
`WAVES345_1006-0002
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`WAVES345_1006-0003
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`WAVES345_1006-0004
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`U.S. Patent
`
`Jul. 24, 2001
`
`Sheet 4 0f 5
`
`US 6,266,633 B1
`
`= :
`zfrgge t
`1
`v sampled data
`+
`_
`
`1024 pt.window
`50
`l processing
`
`601 1024 pt. FFT
`processrng
`
`7 O
`1 rect. to polar
`conversion
`l
`store ml
`in memory
`
`generate!
`date
`higgogram
`F1
`75
`
`f= f + 1
`
`set
`t=t+1
`
`frame = t = 1
`freq. = f = 0
`r
`l
`_
`
`retrieve
`981 m?, Nf, Hf,
`from memory
`
`100
`‘L perform spectral
`
`‘
`
`subtraction
`
`1 1O 1 perform blind
`deconvolution
`‘
`1 30
`\a polar to rect.
`conversion
`
`IS
`
`r=511
`,2
`
`yes
`
`"° r= r + 1
`
`.
`
`1 40
`
`1024
`pt. l_FFT
`processmg
`
`80x compute and
`store N f
`
`compute and
`store Hf
`
`end second
`pass
`
`end ?rst
`pass
`
`Fig. 3
`
`WAVES345_1006-0005
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`U.S. Patent
`
`Jul. 24, 2001
`
`Sheet 5 of5
`
`US 6,266,633 B1
`
`Prob.
`
`Density
`
`59 2
`
`/N f
`
`}
`Cf
`
`Spectral
`Magnitude at
`frequency f
`
`mode of
`distribution
`
`geometric mean
`of area 92 outlined
`
`Fig. 4
`
`"46-79" "64-79" "74-69" "94-67"
`"46-97" "64-97" "74-96" "94-76"
`"47-69" "67-49" "76-49" "96-47"
`"47-96" " 67-94 "
`" 76-94 "
`" 96-74 "
`"49-67" "69-47" "79-46" "97-46"
`"49-76" " 69-74 "
`" 79-64 "
`" 97-64 "
`
`Fig. 5
`
`WAVES345_1006-0006
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`US 6,266,633 B1
`
`1
`NOISE SUPPRESSION AND CHANNEL
`EQUALIZATION PREPROCESSOR FOR
`SPEECH AND SPEAKER RECOGNIZERS:
`METHOD AND APPARATUS
`
`FIELD OF THE INVENTION
`
`This invention relates to speech recognition generally, and
`more particularly to a signal pre-processor for enhancing the
`quality of a speech signal before further processing by a
`speech or speaker recognition device.
`
`10
`
`BACKGROUND OF THE INVENTION
`Speech and speaker recognition devices must often oper
`ate on speech signals corrupted by noise and channel dis
`tortions. This is the case, for example, When using “far-?eld”
`microphones placed on a desktop near computers or other
`of?ce equipment. Noise, such as noise originating from disk
`drives or cooling fans can be transmitted both mechanically,
`by direct contact of the microphone to the computer equip
`ment or through the furniture it rests on, and by acoustic
`transmission through the air. Noise can also be picked up
`through electrical or magnetic coupling as in the case of
`poWer line “hum”.
`The “channel” through Which speech is measured
`includes the processes of acoustic propagation from the
`speaker’s mouth, transduction by the microphone, analog
`signal processing, and analog-to-digital conversion. The
`distortion introduced by this composite channel may be
`modeled as a linear process and characteriZed by its fre
`quency response. Factors affecting the channel frequency
`response include microphone type, distance and off-axis
`angle of the speaker relative to the microphone, room
`acoustics, and the characteristics of the analog electronic
`circuits and anti-aliasing ?lter.
`Speech and speaker recognition systems operate by com
`paring the input speech With acoustic models derived from
`prior “training” speech material. Loss of accuracy occurs
`When the input speech is corrupted by noise or channel
`frequency response that differ signi?cantly from those
`affecting the training speech. The present invention
`addresses this problem by suppressing noise and equaliZing
`channel distortions in an input speech signal.
`Certain methods for noise suppression are Well knoWn.
`One method used for noise suppression is knoWn as spectral
`subtraction (SS). SS requires an estimate of the noise
`magnitude spectrum, Which is assumed to be stationary over
`time. This estimate is subtracted from the measured mag
`nitude spectrum of a noisy speech input at each time interval
`or “frame” to obtain an estimate of the magnitude spectrum
`of the speech in the absence of noise. Further details
`regarding noise suppression may be obtained from the
`publication entitled “Suppression of acoustic noise in speech
`using spectral subtraction,” IEEE Transactions on
`Acoustics, Speech, and Signal Processing, vol. ASSP-27, no.
`2, pp. 113—120, IEEE, NeW York, NY, 1979, and incorpo
`rated herein by reference.
`Certain methods Which operate to perform channel equal
`iZation are also knoWn. One method used for channel
`equaliZation, knoWn as blind deconvolution (BD), estimates
`the spectrum of the input signal over its Whole duration and
`applies a linear ?lter designed to make the spectrum of the
`signal equal to the long term spectrum of speech. This
`method effectively compensates for the channel When the
`input speech material is of sufficient length that its spectrum
`approximates the long-term spectrum of speech. Further
`details regarding Blind Deconvolution Will be obtained from
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`the publication by T. G. Stockham, T. M. Cannon, and R. B.
`Ingebretsen, entitled “Blind deconvolution through digital
`signal processing,” Proceedings of the IEEE, vol. 63, No. 4
`pp. 678—692, 1975, incorporated herein by reference.
`In addition, a publication by D. Hardt and K. Fellbaum,
`entitled “Spectral Subtraction and RASTA Filtering in Text
`Dependent HMM-Based Speaker Veri?cation”, IEEE Doc.
`No. 0-8186-7919-0/97, p ICASSP 97, Munich, Germany,
`April, 1997 and incorporated by reference herein describes
`a comparison of speaker veri?cation performance using
`“internal” versus “external” spectral subtraction. Internal
`SS, integrated With an existing veri?er front end system, Was
`found to be inferior to external SS, Which Was implemented
`as an independent processing step, prior to input to the
`veri?er. Using external SS, veri?cation accuracy Was found
`to improve With increasing spectral analysis WindoW siZe up
`to 128 milliseconds. Such ?ndings Were con?rmed in a set
`of experiments involving the SpeakerKey voice veri?er
`system described in commonly assigned copending patent
`application Ser. No. 08/960,509 entitled “VOICE
`AUTHENTICATION SYSTEM” ?led on Oct. 29, 1997 to
`Blais et al, and incorporated herein by reference, and a
`specially-collected database using far-?eld microphones. In
`our experiments, the improvement With increasing WindoW
`siZe Was found to be related to the nature of the noise. The
`loudest noise components in the data are stationary, narroW
`bandWidth spectral lines, for Which estimation accuracy
`increases With WindoW length. High spectral resolution is
`therefore needed to reject this type of noise. Analysis
`WindoWs of 128 ms length are suf?cient to provide the
`needed resolution.
`In another publication by C. Avendano and H. Hermansky
`entitled “On the Effects of Short-Term Spectrum Smoothing
`in Channel Normalization”, 5, p. 372, IEEE Transactions on
`Speech and Audio Processing, vol. 5, No. 4, July, 1997, an
`improvement to the performance of blind deconvolution Was
`reported in the context of a speech recognition system. The
`system used measurements of the poWer spectrum in critical
`bands, Where each such measurement Was derived by inte
`grating the fast Fourier transform (FFT) poWer spectrum
`over frequencies Within the critical band. BD Was reported
`to perform better When applied prior to critical-band inte
`gration (i.e., to the FFT poWer spectrum) than after (to the
`critical band measurements). The disparity of performance
`Was greatest for channels Whose magnitude response varies
`for channels Whose magnitude response varies Within the
`frequency limits of the individual critical band ?lters. In the
`present invention, it Was found that increasing the WindoW
`siZe from 20 ms (typically used in speech and speaker
`recognition systems) to 128 ms led to additional perfor
`mance improvements. The reason for this improvement is
`similar to that offered above in connection With narroW
`bandWidth noise. It is knoWn that reverberant environments
`can introduce sharp spectral nulls (as narroW as 10 HZ in
`Width) in the frequency response of acoustic transmission
`from the talker to the microphone caused by interference
`betWeen direct and re?ected signal paths. These effects
`cannot be adequately compensated if BD is applied to
`critical bands, Whose bandWidths greatly exceed 10 HZ.
`When applied before critical band integration, spectral nulls
`present in the channel can be resolved if suf?ciently long
`analysis WindoWs are used. WindoWs of at least 100 ms
`length are required to provide the needed 10 HZ frequency
`resolution.
`HoWever, none of the prior art applications combines
`noise suppression With channel equaliZation, including
`channel frequency response normaliZation and signal level
`
`WAVES345_1006-0007
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`US 6,266,633 B1
`
`3
`normalization to a signal preprocessor apparatus Which
`accepts as input a noisy speech signal such as that introduced
`from a microphone and Which produces an enhanced output
`speech signal for subsequent processing.
`
`4
`in time by a predetermined amount; and adding the time
`shifted temporal samples over a period corresponding to the
`predetermined temporal WindoW to provide a suppressed
`noise voice signal.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is an exemplary illustration of a voice veri?cation
`system employing the preprocessor according to the present
`invention.
`FIG. 2A is a block diagram depicting the major functional
`components of the preprocessor according to the present
`invention.
`FIG. 2B is a detailed block diagram depicting in greater
`detail the noise suppression and channel equalization fre
`quency processing module illustrated in FIG. 2A according
`to the present invention.
`FIG. 3 is a How diagram depicting the processing steps
`associated With noise suppression and channel equalization
`of a noisy input voice signal according to the present
`invention.
`FIG. 4 is an exemplary illustration of a histogram gener
`ated for determining the noise ?oor and channel response in
`order to perform noise suppression and channel equalization
`according to the present invention.
`FIG. 5 is a chart of speech utterances or phrases processed
`by the preprocessor according to the present invention.
`
`SUMMARY OF THE INVENTION
`
`It is an object of the present invention to provide a signal
`pre-processor Which accepts as input a speech signal from a
`microphone or other source and produces as output an
`enhanced speech signal for subsequent processing by a
`speech or speaker recognition device. It is intended to be
`used both in processing training material and at recognition
`time by attenuating stationary noise that may be present in
`the input signal and applying linear ?ltering to make the
`long-term spectrum associated With the output signal equal
`to a pre-speci?ed “target” spectrum. Through these
`operations, differences in noise and frequency response
`betWeen training and test channels are effectively
`suppressed, minimizing the loss of recognition or veri?ca
`tion accuracy.
`It is a further object of the invention to provide a method
`for performing noise suppression and channel equalization
`of a noisy voice signal comprising the steps of sampling the
`noisy voice signal at a predetermined sampling rate f5;
`segmenting the sampled voice signal into a plurality of
`frames having a predetermined number of samples per
`frame, over a predetermined temporal WindoW; generating
`an N-point spectral sample representation of each of the
`sample signal frames; determining the magnitude of each of
`the N-point spectral samples and generating a histogram of
`the energy associated With each of the N-point spectral
`samples at a particular frequency; detecting a peak ampli
`tude of the histogram Which corresponds to a noise threshold
`Nf associated With the particular frequency; determining a
`channel frequency response Cf associated With the particular
`frequency by determining a geometric mean over all the
`spectral samples having magnitude exceeding the noise
`threshold Nf; subtracting from each of the magnitudes of the
`N point spectral samples the noise threshold Nf to provide a
`noise suppressed sample sequence; applying blind decon
`volution to the noise suppressed samples; transforming the
`deconvolved noise suppressed sampled sequence to a tem
`poral representation; shifting the temporal sample sequence
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Before embarking on a detailed discussion, the folloWing
`should be understood. The pre-processor according to the
`present invention combines spectral subtraction and blind
`deconvolution Within a common algorithmic frameWork. It
`also normalizes the peak energy of the output speech signal
`to a ?xed value prior to veri?cation. The latter operation
`reduces saturation and quantization effects induced by input
`signals With large dynamic range.
`The preprocessor according to the present invention is
`especially useful since a combination of noise and channel
`variability is frequently encountered When using far-?eld
`microphones. In many applications of practical interest, both
`the noise spectrum and the channel frequency response
`exhibit sharp peaks and nulls as a function of frequency.
`These problems are not effectively treated in conventional
`speech and speaker recognition systems, Where the tradeoff
`betWeen time and frequency resolution is heavily in?uenced
`by the need to measure speech events of short duration.
`From the description that folloWs, one can see that the
`preprocessor of the present invention addresses noise and
`channel variability problems simultaneously, using an ef?
`cient frequency-domain approach that provides suf?cient
`frequency resolution of spectral peaks and nulls.
`The invention has been found to be particularly effective
`When used in conjunction With the SpeakerKey voice veri
`?cation system as disclosed in US. Pat. No. 5,339,385 by A.
`L. Higgins, entitled SPEAKER VERIFIER USING
`NEAREST-NEIGHBOR DISTANCE MEASURE, issued
`on Aug. 16, 1994, and commonly assigned copending appli
`cations Ser. Nos. 08/960,509 and 08/632,723, now US. Pat.
`No. 5,937,381. SpeakerKey uses prompted phrases that are
`constructed in a manner that enables blind deconvolution to
`provide accurate channel estimates, even for short phrases.
`In experiments involving the SpeakerKey system With far
`?eld microphones, error rates Were reduced by at least half
`under a variety of conditions by using the novel pre
`processor apparatus.
`Referring noW to FIG. 1, there is shoWn a voice veri?
`cation system 10in Which the output of the preprocessor 26,
`according to the present invention, is utilized. Note that
`When referring to the draWings, like reference numerals are
`used to indicate like parts. A voice veri?cation system such
`as that disclosed in copending, commonly assigned patent
`application Ser. Nos. 08/960,509, 08/632,723, or issued US.
`Pat. No. 5,271,088, and incorporated herein by reference,
`may use and/or implement the preprocessor according to the
`present invention, in order to provide noise suppression,
`channel equalization, and normalization of an noisy voice
`signal prior to the step of verifying the voice signal. As
`shoWn in FIG. 1 , the voice veri?cation system 10 includes
`a prompt generator 22, Which produces a prompting mes
`sage and communicates it to the user 9 via prompting device
`27. The prompting message may be communicated aurally
`by means of a computer monitor. In response to the prompt,
`a user 9 speaks into a microphone 18, thereby producing
`enrollnent speech utterances 22A. Speech utterances 22A
`are input to analog to digital converter circuit 23 Which
`performs sampling at a rate of preferably fs=8000 Hz (i.e. 8
`KHz) to provide a digitized voice signal 23A for input to
`
`WAVES345_1006-0008
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`US 6,266,633 B1
`
`5
`preprocessor 26, Which Will be described in detail below.
`The output of preprocessor 26 is applied as input to either
`enrollment processor 12 or veri?cation processor 16 of voice
`veri?cation system 10. The enrollment processor 12 per
`forms an enrollment function by generating a voice model
`30 of an authorized user’s speech. The voice model 30 is
`then stored in the computer’s memory so that it can be
`doWnloaded at a later time by the veri?cation function. The
`veri?cation processor 16 performs the veri?cation function
`by ?rst processing the speech of the user, and then compar
`ing the processed speech tot he voice model 30. Based on
`this comparison, the veri?cation processor produces a deci
`sion 16A to either grant or deny the user 9 access to system
`application 20.
`The speech utterances 22A comprise one or more phrases
`Which consist of the same Word in different Word orders.
`Such phrases may be selected from the group of enrollment
`phrases shoWn in FIG. 5. As one can ascertain, each of the
`phrases consist of four digits “four”, “six”, “seven”, “nine”,
`connected by “t’s” such that a single phrase or speech
`utterance may be “forty siX - seventy nine”, or “forty siX -
`ninety seven”, and so on. These selectable enrollment
`phrases or speech utterances are thus limited to the tWenty
`four combinations of Words “four”, “six”, “seven” and
`“nine” arranged in double tWo-digit number combination.
`The selection of these enrollment speech utterances alloWs
`easy and consistent repetition and minimiZes the number of
`phrases required for enrollment and/or veri?cation. In
`addition, these phrases represent a small number of Words,
`While enabling accurate Word recognition accuracy, and
`phonetic composition structure to alloW channel equaliZa
`tion using blind deconvolution. Note that phrases containing
`the Words “Zero”, “one”, “tWo”, “three”, “?ve” and “eight”
`are eXcluded because such numbers introduce pronuncia
`tions that depend on the position on the Word Within the
`phrase, for eXample, “20” vs. “2”. Note further that While the
`preferred embodiment uses prompted speech utterances,
`computeriZed prompting is not necessary to carry out the
`present invention.
`The preprocessor 26 operates to convert speech utterances
`into a plurality of speech frames and to eXtract the spectral
`characteristics and features of each of the speech frames.
`The preprocessor 26 utiliZes the spectral magnitudes of each
`of the WindoWed speech samples 24A (FIGS. 2A, 2B) to
`perform noise suppression and channel equalization of the
`magnitude spectra. In general, processing is performed in
`tWo passes over the speech data. In the ?rst pass, magnitude
`spectra are computed and saved for the entire utterance.
`These magnitude spectra are used to estimate the noise ?oor
`for spectral subtraction and the channel frequency response.
`Once the noise ?oor, Nf, and channel frequency response are
`obtained, the preprocessor 26 in a second pass, subtracts
`from each of the magnitude spectra the noise ?oor and sets
`any negative results to Zero. Blind deconvolution is than
`applied by multiplying the SS-processed magnitude by the
`blind deconvolution ?lter having a frequency response of
`GB/Cf, Where Bf represents a trapeZoidal WindoW applied to
`the blind deconvolution ?lter to reject frequencies outside a
`bandpass range and Where G represents a gain constant
`applied for the purpose of output level normaliZation. The
`preprocessor then operates to convert the spectral data back
`into a temporal representation via an inverse discrete Fourier
`transform such as an IFFT While maintaining the phase and
`provides a preprocessed output signal 26A for further pro
`cessing by a verifying system or construction of a user voice
`model 30. Note that While in the preferred embodiment,
`processing is performed over tWo passes of the data, the
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`6
`present contemplates the use of one pass of speech data in
`Which to perform the preprocessing functions described
`herein.
`Referring noW to FIG. 2A, there is shoWn a block diagram
`of the preprocessor 26. Each incoming frame of sampled
`data 23A indicative of a speech utterance received over an
`input channel is multiplied by a Hanning WindoW 50 and
`processed using an FFT 60. The sampled data 23A is
`indicative of a noisy voice input signal and comprising the
`speech utterance Which has been sampled and digitiZed at a
`predetermined sample rate (preferably 8 KHZ) via an
`analog-to-digital (A/D) converter for input to the preproces
`sor. Preferably, the noisy input voice signal comprises
`pulse-code modulator (PCM) sampled signal, but may be
`any of a number of different types of digital signals. The FFT
`transforms the WindoWed frame data into a “frequency
`domain” representation, Where further processing repre
`sented by module 63 occurs (shoWn in greater detail in FIG.
`2B). In the preferred embodiment, a 1024-point Hanning
`WindoW 50 and a 1024-point FFT 60 are used. The 1024
`point Hanning WindoW processes each speech utterance into
`a plurality of time WindoWs or speech frames of 1024-point
`samples, With consecutive frames overlapping by one-half
`(V2) WindoW (i.e. 512 samples). Each WindoWed frame of
`data samples 52 is then input into the 1024-point FFT
`processor 60 for converting the sampled speech signal into
`a spectral representation sequence having both real and
`imaginary portions. That is, operation of the FFT 60
`produces, for each frame of data, 512 real/imaginary number
`pairs representing the compleX spectrum at the 512 FFT
`sampling frequencies indicated fO,fl-, .
`.
`. fsll. The frequency
`domain processing of module 63 is therefore duplicated 512
`times, once for each sampling frequency. After frequency
`domain processing 63, an IFFT 140 transforms the data back
`to the time domain, Where it is overlapped by one-half frame
`With the previous output data and added to it. Note that if the
`frequency-domain processing of module 63 did nothing (i.e.,
`simply passed the signal through unaltered), the output
`signal 152 of the preprocessor Would be identical to the input
`23A because of the IFFT 140 and overlap and add synthe
`siZer (OLA) module 150 simply invert the processing per
`formed by the Hanning WindoW 50 and FFT 60.
`Referring noW to FIG. 2B, there is shoWn a block diagram
`of the frequency-domain processing associated With module
`63. Each real/imaginary number pair input 61 from FFT 60
`is ?rst converted to a magnitude and phase via polar con
`verter module 70 Which operates to convert the Fourier
`transform spectral sequence from rectangular to polar coor
`dinates using Well-known formulas. Such means for con
`verting rectangular to polar coordinates is Well knoWn in the
`art and Will therefore not be described in detail. HoWever,
`softWare programs may easily implement such conversion
`by taking square root of the sum of the squares of the real
`and imaginary portions of the spectral sequence 61 to obtain
`the magnitude spectra, and Where the phase associated With
`each spectral sample is obtained by taking the arc tangent of
`the imaginary part over the real part. Processing, to be
`elaborated on beloW, is performed on the magnitude portion,
`leaving the phase portion unaltered. Each magnitude/phase
`number pair is then converted to a real/imaginary number
`pair using Well-known formulas. These numbers comprise
`the output of module 63. One can ascertain that if no
`processing Were applied to the magnitude (so that both the
`magnitude and phase Were unaltered) then the output of
`module 63 Would be identical to the input of module 63. In
`this case, as stated above, the output signal 65 of prepro
`cessor 26 Would be identical to its input 61.
`
`WAVES345_1006-0009
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`
`
`US 6,266,633 B1
`
`7
`Still referring to FIG. 2B, the operations performed on the
`magnitude spectra can be divided into tWo estimation steps
`represented by modules 80 and 90, and tWo processing steps
`represented by modules 100 and 110. In the preferred
`embodiment, the estimation steps are carried out using data
`from the Whole utterance. To accomplish this, the data is
`processed in tWo passes over the sampled utterance data. In
`the ?rst pass, magnitude spectra m? output are computed and
`saved in memory 14 for the Whole utterance. That is, the data
`m? output from rectangular to polar converter 50 represents
`the magnitude at a Fourier frequency f and time WindoW (i.e.
`frame) t is stored in memory 14 such as a database. Note that
`in the processing that folloWs, the phase associated With the
`spectral samples is unmodi?ed, so that the processing is
`associated With the FFT magnitude rather than the associ
`ated phase. Accordingly, the subsequent processing by polar
`to rectangular converter 130 and IFFT processor algorithm
`140 operates to maintain the original phase of each input
`sampled speech utterance. Conventional arithmetic circuit
`75 operates to construct histograms of the magnitude spectra
`m? Which are generated for each frequency using each of the
`frames Which comprise a particular utterance and are stored
`in memory 14. The concept is to determine from the
`histogram for each frequency bin, What is the noise ampli
`tude over the Whole utterance. In each histogram, the
`background noise becomes evident as a peak or mode Within
`the histogram corresponding to the amplitude of the noise
`?oor at that particular frequency. FIG. 4 provides an
`example of this. The histogram shoWn in FIG. 4 represents
`the probability density as a function of the spectral magni
`tude at a particular frequency f. The mode of distribution, at
`Nf, is used to estimate the magnitude of the noise ?oor at
`frequency f. Conventional detector 80 then operates to
`examine each of the bins comprising the histogram at
`frequency f to determine Which magnitude bin has the
`highest probability. Noise ?oor Nf is then set equal to this
`magnitude. Once the noise ?oor, Nf, has been determined,
`channel estimator 90 then operates in response to the detec
`tion of the noise ?oor Nfby averaging the log magnitudes of
`those frequencies Which exceed the noise ?oor to obtain the
`channel frequency response Cf at frequency f. In the pre
`ferred embodiment, the estimator 90 operates to determine
`the channel frequency according to the equation
`
`Thus, the channel frequency response Cf at frequency f is
`set equal to the geometric mean over the utterance of those
`magnitudes at frequency f that exceed the noise ?oor. Note
`further that |m?>Nf| equals the number of time WindoWs for
`Which the magnitude at frequency f exceeds the noise ?oor
`at frequency f. Each of the noise ?oor and channel frequency
`response estimates are stored in memory 14. Spectral sub
`traction (SS) module 100 then operates on the saved mag
`nitude spectra data and noise estimate by subtracting from
`each m? the noise ?oor Nf determined in module 80 and
`setting any negative results to Zero to provide a noise
`suppressed signal sequence 104. Blind deconvolution ?lter
`110 is coupled to the output of SS module 100 and operates
`by multiplying the SS processed magnitude sequence 104 by
`the BD ?lter frequency response. As shoWn in FIG. 2B,
`blind deconvolution ?lter 110 is coupled to the spectral
`subtractor 100 and has a BD ?lter frequency response
`Hf=GBf/Cf Which is inversely proportional to the channel
`frequency response. Preferably, the BD ?lter comprises a
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`8
`trapeZoidal WindoW With height, Bf, applied to the ?lter to
`reject frequencies outside a band pass range Where
`
`In the preferred embodiment, the parameters are LO=200
`HZ, L1=300 HZ, H0=3200 HZ, and H1=3450 HZ. The gain
`constant, G, is applied for the purpose of output level
`normaliZation
`
`Where P is the desired peak RMS value of the output signal.
`Note that operations 75, 80, 90, 100, and 110 are repeated for
`each of the 512 values of f correspon