throbber
(12) United States Patent
`Higgins et al.
`
`US006266633B1
`US 6,266,633 B1
`Jul. 24, 2001
`
`(10) Patent N0.:
`(45) Date of Patent:
`
`(54) NOISE SUPPRESSION AND CHANNEL
`EQUALIZATION PREPROCESSOR FOR
`SPEECH AND SPEAKER RECOGNIZERS:
`METHOD AND APPARATUS
`
`(75) Inventors: Alan Lawrence Higgins; Steven F.
`Boll; Jack E. Porter, all of San Diego,
`CA (US)
`
`(73) Assignee: ITT Manufacturing Enterprises,
`Wilmington, DE (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/218,565
`(22) Filed:
`Dec. 22, 1998
`
`(51) Int. Cl.7 ................................................... .. G10L 21/02
`(52) US. Cl. ................ ..
`704/224; 704/228
`(58) Field of Search .................................... .. 704/224, 228
`
`(56)
`
`References Cited
`
`PUBLICATIONS
`
`Stockham, Jr., Thomas G., Cannon Thomas M., and Inge
`bretsen, Robert B., “Blind Deconvolution through Digital
`Signal Processing”, Proceedings of the IEEE, vol. 63, No. 4,
`Apr. 1975, pp. 678—692.
`Boll, Steven F., “Suppression of Acoustic Noise in Speech
`Using Spectral Subtraction”, IEEE Transactions on Acous
`tics, Speech, and Signal Processing, vol. ASSP—27, No. 2,
`Apr. 1979, pp. 113—120.
`Avendano, Carlos and Hermansky, Hynek, “On the Effects
`of Short—Term Spectrum Smoothing in Channel Normaliza
`tion”, IEEE Transactions on Speech and Audio Processing,
`vol. 5, No. 4, Jul. 1997, pp. 372—374.
`Hynek Hermansky, et al. “RASTA Processing of Speech”,
`IEEE Trans. Speech and Audio Processing, vol. 2, No. 4, pp.
`578—589, Oct. 1994.*
`Johan de Veth, et al. “Comparison of Channel Normalisation
`Techniques for Automatic Speech Recognition over the
`Phone,” Proc. Intl. Conf. on Spoken Language, ICSLP 96,
`vol. 4, pp. 2332—2335, Oct. 1996.*
`Detlef Hardt, et al. “Spectral Subtraction and RASTA—Fil
`tering in Text—Dependent HMM—Based Speaker Veri?ca
`tion,” Proc. IEEE ICASSP 97, vol. 2, pp. 867—870, Apr.
`1997.*
`
`Carlos Avendano, et al. “On the Effedts of Short—Term
`Spectrum Smoothing in Channel Normalization,” IEEE
`Trans. Speech and Audio Processing, vol. 5, No. 4, pp.
`372—374, Jul. 1997.*
`
`Zhang Zhijie, et al. “Stabilized Solutions and Multiparam
`eter Optimization Technique of Deconvolution,” Proc. Intl.
`Conf. Signal Processing, ICSP 98, vol. 1, pp. 168—171, Oct.
`1998*
`
`* cited by examiner
`
`Primary Examiner—Talivaldis I. Smits
`(74) Attorney, Agent, or Firm—Arthur L. Plevy; Duane,
`Morris & Hecksher
`
`(57)
`
`ABSTRACT
`
`A method for performing noise suppression and channel
`equalization of a noisy voice signal comprising the steps of
`sampling the noisy voice signal at a predetermined sampling
`rate f5; segmenting the sampled voice signal into a plurality
`of frames having a predetermined number of samples per
`frame, over a predetermined temporal Window; generating
`an N-point spectral sample representation of each of the
`sample signal frames; determining the magnitude of each of
`the N-point spectral samples and generating a histogram of
`the energy associated With each of the N-point spectral
`samples at a particular frequency; detecting a peak ampli
`tude of the histogram Which corresponds to a noise threshold
`Nf associated With the particular frequency; determining a
`channel frequency response Cf associated With the particular
`frequency by determining a geometric mean over all the
`spectral samples having magnitude exceeding the noise
`threshold Nf; subtracting from each of the magnitudes of the
`N point spectral samples the noise threshold Nf to provide a
`noise suppressed sample sequence; applying blind decon
`volution to the noise suppressed samples; transforming the
`deconvolved noise suppressed sampled sequence to a tem
`poral representation; shifting the temporal sample sequence
`in time by a predetermined amount; and adding the time
`shifted temporal samples over a period corresponding to the
`predetermined temporal Window to provide a suppressed
`noise voice signal.
`
`26 Claims, 5 Drawing Sheets
`
`RECT.
`TO
`POLAR
`CONVERTER
`
`Ma nitude
`
`70
`
`75
`
`ARITH.
`CIRCUIT
`HISTOGRAM
`GENERATOR
`
`I00
`H
`
`104
`
`110
`m “2
`
`SPECTRAL
`SUETRACTOR
`
`ED
`FILTER
`
`so?
`ESTIMATE
`N f
`MODULE
`
`of
`ESTIMATOR
`
`MEMORY
`
`l
`|
`|
`‘
`PQI-IBAR
`\
`RECT.
`,CONVERTER \
`‘
`a
`I
`130
`l
`l
`l
`I
`
`i l
`l
`I‘
`‘I
`l
`1
`|
`l
`
`WAVES345_1006-0001
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`U.S. Patent
`
`Jul. 24, 2001
`
`Sheet 1 of5
`
`US 6,266,633 B1
`
`\IIIIIIIl-lllllllllllll/
`
`6;
`
`M .wE
`
`_ _ _ _ _ _ _ m _ _ _ _ _ _
`a
`
`2m 5 c2280
`
`G;
`
`
`
`mO_O> mum:
`
`an
`
`.5002
`llL
`AN ; mowwmoomm
`
`¥_____ _____________/
`
`0252016 V
`moSmQ
`
`ARV
`
`
`
`-mmm E4625
`
`mowwwoomm Q2
`8S 3%
`
`<3 <3
`
`cowmqw
`QNNV
`
`EoE=oEm
`
`
`
`a ; mzOImOmQE mwwD
`
`av
`
`WAVES345_1006-0002
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`WAVES345_1006-0003
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`WAVES345_1006-0004
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`U.S. Patent
`
`Jul. 24, 2001
`
`Sheet 4 0f 5
`
`US 6,266,633 B1
`
`= :
`zfrgge t
`1
`v sampled data
`+
`_
`
`1024 pt.window
`50
`l processing
`
`601 1024 pt. FFT
`processrng
`
`7 O
`1 rect. to polar
`conversion
`l
`store ml
`in memory
`
`generate!
`date
`higgogram
`F1
`75
`
`f= f + 1
`
`set
`t=t+1
`
`frame = t = 1
`freq. = f = 0
`r
`l
`_
`
`retrieve
`981 m?, Nf, Hf,
`from memory
`
`100
`‘L perform spectral
`
`‘
`
`subtraction
`
`1 1O 1 perform blind
`deconvolution
`‘
`1 30
`\a polar to rect.
`conversion
`
`IS
`
`r=511
`,2
`
`yes
`
`"° r= r + 1
`
`.
`
`1 40
`
`1024
`pt. l_FFT
`processmg
`
`80x compute and
`store N f
`
`compute and
`store Hf
`
`end second
`pass
`
`end ?rst
`pass
`
`Fig. 3
`
`WAVES345_1006-0005
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`U.S. Patent
`
`Jul. 24, 2001
`
`Sheet 5 of5
`
`US 6,266,633 B1
`
`Prob.
`
`Density
`
`59 2
`
`/N f
`
`}
`Cf
`
`Spectral
`Magnitude at
`frequency f
`
`mode of
`distribution
`
`geometric mean
`of area 92 outlined
`
`Fig. 4
`
`"46-79" "64-79" "74-69" "94-67"
`"46-97" "64-97" "74-96" "94-76"
`"47-69" "67-49" "76-49" "96-47"
`"47-96" " 67-94 "
`" 76-94 "
`" 96-74 "
`"49-67" "69-47" "79-46" "97-46"
`"49-76" " 69-74 "
`" 79-64 "
`" 97-64 "
`
`Fig. 5
`
`WAVES345_1006-0006
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`US 6,266,633 B1
`
`1
`NOISE SUPPRESSION AND CHANNEL
`EQUALIZATION PREPROCESSOR FOR
`SPEECH AND SPEAKER RECOGNIZERS:
`METHOD AND APPARATUS
`
`FIELD OF THE INVENTION
`
`This invention relates to speech recognition generally, and
`more particularly to a signal pre-processor for enhancing the
`quality of a speech signal before further processing by a
`speech or speaker recognition device.
`
`10
`
`BACKGROUND OF THE INVENTION
`Speech and speaker recognition devices must often oper
`ate on speech signals corrupted by noise and channel dis
`tortions. This is the case, for example, When using “far-?eld”
`microphones placed on a desktop near computers or other
`of?ce equipment. Noise, such as noise originating from disk
`drives or cooling fans can be transmitted both mechanically,
`by direct contact of the microphone to the computer equip
`ment or through the furniture it rests on, and by acoustic
`transmission through the air. Noise can also be picked up
`through electrical or magnetic coupling as in the case of
`poWer line “hum”.
`The “channel” through Which speech is measured
`includes the processes of acoustic propagation from the
`speaker’s mouth, transduction by the microphone, analog
`signal processing, and analog-to-digital conversion. The
`distortion introduced by this composite channel may be
`modeled as a linear process and characteriZed by its fre
`quency response. Factors affecting the channel frequency
`response include microphone type, distance and off-axis
`angle of the speaker relative to the microphone, room
`acoustics, and the characteristics of the analog electronic
`circuits and anti-aliasing ?lter.
`Speech and speaker recognition systems operate by com
`paring the input speech With acoustic models derived from
`prior “training” speech material. Loss of accuracy occurs
`When the input speech is corrupted by noise or channel
`frequency response that differ signi?cantly from those
`affecting the training speech. The present invention
`addresses this problem by suppressing noise and equaliZing
`channel distortions in an input speech signal.
`Certain methods for noise suppression are Well knoWn.
`One method used for noise suppression is knoWn as spectral
`subtraction (SS). SS requires an estimate of the noise
`magnitude spectrum, Which is assumed to be stationary over
`time. This estimate is subtracted from the measured mag
`nitude spectrum of a noisy speech input at each time interval
`or “frame” to obtain an estimate of the magnitude spectrum
`of the speech in the absence of noise. Further details
`regarding noise suppression may be obtained from the
`publication entitled “Suppression of acoustic noise in speech
`using spectral subtraction,” IEEE Transactions on
`Acoustics, Speech, and Signal Processing, vol. ASSP-27, no.
`2, pp. 113—120, IEEE, NeW York, NY, 1979, and incorpo
`rated herein by reference.
`Certain methods Which operate to perform channel equal
`iZation are also knoWn. One method used for channel
`equaliZation, knoWn as blind deconvolution (BD), estimates
`the spectrum of the input signal over its Whole duration and
`applies a linear ?lter designed to make the spectrum of the
`signal equal to the long term spectrum of speech. This
`method effectively compensates for the channel When the
`input speech material is of sufficient length that its spectrum
`approximates the long-term spectrum of speech. Further
`details regarding Blind Deconvolution Will be obtained from
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`the publication by T. G. Stockham, T. M. Cannon, and R. B.
`Ingebretsen, entitled “Blind deconvolution through digital
`signal processing,” Proceedings of the IEEE, vol. 63, No. 4
`pp. 678—692, 1975, incorporated herein by reference.
`In addition, a publication by D. Hardt and K. Fellbaum,
`entitled “Spectral Subtraction and RASTA Filtering in Text
`Dependent HMM-Based Speaker Veri?cation”, IEEE Doc.
`No. 0-8186-7919-0/97, p ICASSP 97, Munich, Germany,
`April, 1997 and incorporated by reference herein describes
`a comparison of speaker veri?cation performance using
`“internal” versus “external” spectral subtraction. Internal
`SS, integrated With an existing veri?er front end system, Was
`found to be inferior to external SS, Which Was implemented
`as an independent processing step, prior to input to the
`veri?er. Using external SS, veri?cation accuracy Was found
`to improve With increasing spectral analysis WindoW siZe up
`to 128 milliseconds. Such ?ndings Were con?rmed in a set
`of experiments involving the SpeakerKey voice veri?er
`system described in commonly assigned copending patent
`application Ser. No. 08/960,509 entitled “VOICE
`AUTHENTICATION SYSTEM” ?led on Oct. 29, 1997 to
`Blais et al, and incorporated herein by reference, and a
`specially-collected database using far-?eld microphones. In
`our experiments, the improvement With increasing WindoW
`siZe Was found to be related to the nature of the noise. The
`loudest noise components in the data are stationary, narroW
`bandWidth spectral lines, for Which estimation accuracy
`increases With WindoW length. High spectral resolution is
`therefore needed to reject this type of noise. Analysis
`WindoWs of 128 ms length are suf?cient to provide the
`needed resolution.
`In another publication by C. Avendano and H. Hermansky
`entitled “On the Effects of Short-Term Spectrum Smoothing
`in Channel Normalization”, 5, p. 372, IEEE Transactions on
`Speech and Audio Processing, vol. 5, No. 4, July, 1997, an
`improvement to the performance of blind deconvolution Was
`reported in the context of a speech recognition system. The
`system used measurements of the poWer spectrum in critical
`bands, Where each such measurement Was derived by inte
`grating the fast Fourier transform (FFT) poWer spectrum
`over frequencies Within the critical band. BD Was reported
`to perform better When applied prior to critical-band inte
`gration (i.e., to the FFT poWer spectrum) than after (to the
`critical band measurements). The disparity of performance
`Was greatest for channels Whose magnitude response varies
`for channels Whose magnitude response varies Within the
`frequency limits of the individual critical band ?lters. In the
`present invention, it Was found that increasing the WindoW
`siZe from 20 ms (typically used in speech and speaker
`recognition systems) to 128 ms led to additional perfor
`mance improvements. The reason for this improvement is
`similar to that offered above in connection With narroW
`bandWidth noise. It is knoWn that reverberant environments
`can introduce sharp spectral nulls (as narroW as 10 HZ in
`Width) in the frequency response of acoustic transmission
`from the talker to the microphone caused by interference
`betWeen direct and re?ected signal paths. These effects
`cannot be adequately compensated if BD is applied to
`critical bands, Whose bandWidths greatly exceed 10 HZ.
`When applied before critical band integration, spectral nulls
`present in the channel can be resolved if suf?ciently long
`analysis WindoWs are used. WindoWs of at least 100 ms
`length are required to provide the needed 10 HZ frequency
`resolution.
`HoWever, none of the prior art applications combines
`noise suppression With channel equaliZation, including
`channel frequency response normaliZation and signal level
`
`WAVES345_1006-0007
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`US 6,266,633 B1
`
`3
`normalization to a signal preprocessor apparatus Which
`accepts as input a noisy speech signal such as that introduced
`from a microphone and Which produces an enhanced output
`speech signal for subsequent processing.
`
`4
`in time by a predetermined amount; and adding the time
`shifted temporal samples over a period corresponding to the
`predetermined temporal WindoW to provide a suppressed
`noise voice signal.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is an exemplary illustration of a voice veri?cation
`system employing the preprocessor according to the present
`invention.
`FIG. 2A is a block diagram depicting the major functional
`components of the preprocessor according to the present
`invention.
`FIG. 2B is a detailed block diagram depicting in greater
`detail the noise suppression and channel equalization fre
`quency processing module illustrated in FIG. 2A according
`to the present invention.
`FIG. 3 is a How diagram depicting the processing steps
`associated With noise suppression and channel equalization
`of a noisy input voice signal according to the present
`invention.
`FIG. 4 is an exemplary illustration of a histogram gener
`ated for determining the noise ?oor and channel response in
`order to perform noise suppression and channel equalization
`according to the present invention.
`FIG. 5 is a chart of speech utterances or phrases processed
`by the preprocessor according to the present invention.
`
`SUMMARY OF THE INVENTION
`
`It is an object of the present invention to provide a signal
`pre-processor Which accepts as input a speech signal from a
`microphone or other source and produces as output an
`enhanced speech signal for subsequent processing by a
`speech or speaker recognition device. It is intended to be
`used both in processing training material and at recognition
`time by attenuating stationary noise that may be present in
`the input signal and applying linear ?ltering to make the
`long-term spectrum associated With the output signal equal
`to a pre-speci?ed “target” spectrum. Through these
`operations, differences in noise and frequency response
`betWeen training and test channels are effectively
`suppressed, minimizing the loss of recognition or veri?ca
`tion accuracy.
`It is a further object of the invention to provide a method
`for performing noise suppression and channel equalization
`of a noisy voice signal comprising the steps of sampling the
`noisy voice signal at a predetermined sampling rate f5;
`segmenting the sampled voice signal into a plurality of
`frames having a predetermined number of samples per
`frame, over a predetermined temporal WindoW; generating
`an N-point spectral sample representation of each of the
`sample signal frames; determining the magnitude of each of
`the N-point spectral samples and generating a histogram of
`the energy associated With each of the N-point spectral
`samples at a particular frequency; detecting a peak ampli
`tude of the histogram Which corresponds to a noise threshold
`Nf associated With the particular frequency; determining a
`channel frequency response Cf associated With the particular
`frequency by determining a geometric mean over all the
`spectral samples having magnitude exceeding the noise
`threshold Nf; subtracting from each of the magnitudes of the
`N point spectral samples the noise threshold Nf to provide a
`noise suppressed sample sequence; applying blind decon
`volution to the noise suppressed samples; transforming the
`deconvolved noise suppressed sampled sequence to a tem
`poral representation; shifting the temporal sample sequence
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Before embarking on a detailed discussion, the folloWing
`should be understood. The pre-processor according to the
`present invention combines spectral subtraction and blind
`deconvolution Within a common algorithmic frameWork. It
`also normalizes the peak energy of the output speech signal
`to a ?xed value prior to veri?cation. The latter operation
`reduces saturation and quantization effects induced by input
`signals With large dynamic range.
`The preprocessor according to the present invention is
`especially useful since a combination of noise and channel
`variability is frequently encountered When using far-?eld
`microphones. In many applications of practical interest, both
`the noise spectrum and the channel frequency response
`exhibit sharp peaks and nulls as a function of frequency.
`These problems are not effectively treated in conventional
`speech and speaker recognition systems, Where the tradeoff
`betWeen time and frequency resolution is heavily in?uenced
`by the need to measure speech events of short duration.
`From the description that folloWs, one can see that the
`preprocessor of the present invention addresses noise and
`channel variability problems simultaneously, using an ef?
`cient frequency-domain approach that provides suf?cient
`frequency resolution of spectral peaks and nulls.
`The invention has been found to be particularly effective
`When used in conjunction With the SpeakerKey voice veri
`?cation system as disclosed in US. Pat. No. 5,339,385 by A.
`L. Higgins, entitled SPEAKER VERIFIER USING
`NEAREST-NEIGHBOR DISTANCE MEASURE, issued
`on Aug. 16, 1994, and commonly assigned copending appli
`cations Ser. Nos. 08/960,509 and 08/632,723, now US. Pat.
`No. 5,937,381. SpeakerKey uses prompted phrases that are
`constructed in a manner that enables blind deconvolution to
`provide accurate channel estimates, even for short phrases.
`In experiments involving the SpeakerKey system With far
`?eld microphones, error rates Were reduced by at least half
`under a variety of conditions by using the novel pre
`processor apparatus.
`Referring noW to FIG. 1, there is shoWn a voice veri?
`cation system 10in Which the output of the preprocessor 26,
`according to the present invention, is utilized. Note that
`When referring to the draWings, like reference numerals are
`used to indicate like parts. A voice veri?cation system such
`as that disclosed in copending, commonly assigned patent
`application Ser. Nos. 08/960,509, 08/632,723, or issued US.
`Pat. No. 5,271,088, and incorporated herein by reference,
`may use and/or implement the preprocessor according to the
`present invention, in order to provide noise suppression,
`channel equalization, and normalization of an noisy voice
`signal prior to the step of verifying the voice signal. As
`shoWn in FIG. 1 , the voice veri?cation system 10 includes
`a prompt generator 22, Which produces a prompting mes
`sage and communicates it to the user 9 via prompting device
`27. The prompting message may be communicated aurally
`by means of a computer monitor. In response to the prompt,
`a user 9 speaks into a microphone 18, thereby producing
`enrollnent speech utterances 22A. Speech utterances 22A
`are input to analog to digital converter circuit 23 Which
`performs sampling at a rate of preferably fs=8000 Hz (i.e. 8
`KHz) to provide a digitized voice signal 23A for input to
`
`WAVES345_1006-0008
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`US 6,266,633 B1
`
`5
`preprocessor 26, Which Will be described in detail below.
`The output of preprocessor 26 is applied as input to either
`enrollment processor 12 or veri?cation processor 16 of voice
`veri?cation system 10. The enrollment processor 12 per
`forms an enrollment function by generating a voice model
`30 of an authorized user’s speech. The voice model 30 is
`then stored in the computer’s memory so that it can be
`doWnloaded at a later time by the veri?cation function. The
`veri?cation processor 16 performs the veri?cation function
`by ?rst processing the speech of the user, and then compar
`ing the processed speech tot he voice model 30. Based on
`this comparison, the veri?cation processor produces a deci
`sion 16A to either grant or deny the user 9 access to system
`application 20.
`The speech utterances 22A comprise one or more phrases
`Which consist of the same Word in different Word orders.
`Such phrases may be selected from the group of enrollment
`phrases shoWn in FIG. 5. As one can ascertain, each of the
`phrases consist of four digits “four”, “six”, “seven”, “nine”,
`connected by “t’s” such that a single phrase or speech
`utterance may be “forty siX - seventy nine”, or “forty siX -
`ninety seven”, and so on. These selectable enrollment
`phrases or speech utterances are thus limited to the tWenty
`four combinations of Words “four”, “six”, “seven” and
`“nine” arranged in double tWo-digit number combination.
`The selection of these enrollment speech utterances alloWs
`easy and consistent repetition and minimiZes the number of
`phrases required for enrollment and/or veri?cation. In
`addition, these phrases represent a small number of Words,
`While enabling accurate Word recognition accuracy, and
`phonetic composition structure to alloW channel equaliZa
`tion using blind deconvolution. Note that phrases containing
`the Words “Zero”, “one”, “tWo”, “three”, “?ve” and “eight”
`are eXcluded because such numbers introduce pronuncia
`tions that depend on the position on the Word Within the
`phrase, for eXample, “20” vs. “2”. Note further that While the
`preferred embodiment uses prompted speech utterances,
`computeriZed prompting is not necessary to carry out the
`present invention.
`The preprocessor 26 operates to convert speech utterances
`into a plurality of speech frames and to eXtract the spectral
`characteristics and features of each of the speech frames.
`The preprocessor 26 utiliZes the spectral magnitudes of each
`of the WindoWed speech samples 24A (FIGS. 2A, 2B) to
`perform noise suppression and channel equalization of the
`magnitude spectra. In general, processing is performed in
`tWo passes over the speech data. In the ?rst pass, magnitude
`spectra are computed and saved for the entire utterance.
`These magnitude spectra are used to estimate the noise ?oor
`for spectral subtraction and the channel frequency response.
`Once the noise ?oor, Nf, and channel frequency response are
`obtained, the preprocessor 26 in a second pass, subtracts
`from each of the magnitude spectra the noise ?oor and sets
`any negative results to Zero. Blind deconvolution is than
`applied by multiplying the SS-processed magnitude by the
`blind deconvolution ?lter having a frequency response of
`GB/Cf, Where Bf represents a trapeZoidal WindoW applied to
`the blind deconvolution ?lter to reject frequencies outside a
`bandpass range and Where G represents a gain constant
`applied for the purpose of output level normaliZation. The
`preprocessor then operates to convert the spectral data back
`into a temporal representation via an inverse discrete Fourier
`transform such as an IFFT While maintaining the phase and
`provides a preprocessed output signal 26A for further pro
`cessing by a verifying system or construction of a user voice
`model 30. Note that While in the preferred embodiment,
`processing is performed over tWo passes of the data, the
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`6
`present contemplates the use of one pass of speech data in
`Which to perform the preprocessing functions described
`herein.
`Referring noW to FIG. 2A, there is shoWn a block diagram
`of the preprocessor 26. Each incoming frame of sampled
`data 23A indicative of a speech utterance received over an
`input channel is multiplied by a Hanning WindoW 50 and
`processed using an FFT 60. The sampled data 23A is
`indicative of a noisy voice input signal and comprising the
`speech utterance Which has been sampled and digitiZed at a
`predetermined sample rate (preferably 8 KHZ) via an
`analog-to-digital (A/D) converter for input to the preproces
`sor. Preferably, the noisy input voice signal comprises
`pulse-code modulator (PCM) sampled signal, but may be
`any of a number of different types of digital signals. The FFT
`transforms the WindoWed frame data into a “frequency
`domain” representation, Where further processing repre
`sented by module 63 occurs (shoWn in greater detail in FIG.
`2B). In the preferred embodiment, a 1024-point Hanning
`WindoW 50 and a 1024-point FFT 60 are used. The 1024
`point Hanning WindoW processes each speech utterance into
`a plurality of time WindoWs or speech frames of 1024-point
`samples, With consecutive frames overlapping by one-half
`(V2) WindoW (i.e. 512 samples). Each WindoWed frame of
`data samples 52 is then input into the 1024-point FFT
`processor 60 for converting the sampled speech signal into
`a spectral representation sequence having both real and
`imaginary portions. That is, operation of the FFT 60
`produces, for each frame of data, 512 real/imaginary number
`pairs representing the compleX spectrum at the 512 FFT
`sampling frequencies indicated fO,fl-, .
`.
`. fsll. The frequency
`domain processing of module 63 is therefore duplicated 512
`times, once for each sampling frequency. After frequency
`domain processing 63, an IFFT 140 transforms the data back
`to the time domain, Where it is overlapped by one-half frame
`With the previous output data and added to it. Note that if the
`frequency-domain processing of module 63 did nothing (i.e.,
`simply passed the signal through unaltered), the output
`signal 152 of the preprocessor Would be identical to the input
`23A because of the IFFT 140 and overlap and add synthe
`siZer (OLA) module 150 simply invert the processing per
`formed by the Hanning WindoW 50 and FFT 60.
`Referring noW to FIG. 2B, there is shoWn a block diagram
`of the frequency-domain processing associated With module
`63. Each real/imaginary number pair input 61 from FFT 60
`is ?rst converted to a magnitude and phase via polar con
`verter module 70 Which operates to convert the Fourier
`transform spectral sequence from rectangular to polar coor
`dinates using Well-known formulas. Such means for con
`verting rectangular to polar coordinates is Well knoWn in the
`art and Will therefore not be described in detail. HoWever,
`softWare programs may easily implement such conversion
`by taking square root of the sum of the squares of the real
`and imaginary portions of the spectral sequence 61 to obtain
`the magnitude spectra, and Where the phase associated With
`each spectral sample is obtained by taking the arc tangent of
`the imaginary part over the real part. Processing, to be
`elaborated on beloW, is performed on the magnitude portion,
`leaving the phase portion unaltered. Each magnitude/phase
`number pair is then converted to a real/imaginary number
`pair using Well-known formulas. These numbers comprise
`the output of module 63. One can ascertain that if no
`processing Were applied to the magnitude (so that both the
`magnitude and phase Were unaltered) then the output of
`module 63 Would be identical to the input of module 63. In
`this case, as stated above, the output signal 65 of prepro
`cessor 26 Would be identical to its input 61.
`
`WAVES345_1006-0009
`
`Petitioner Waves Audio Ltd. 345 - Ex. 1006
`
`

`
`US 6,266,633 B1
`
`7
`Still referring to FIG. 2B, the operations performed on the
`magnitude spectra can be divided into tWo estimation steps
`represented by modules 80 and 90, and tWo processing steps
`represented by modules 100 and 110. In the preferred
`embodiment, the estimation steps are carried out using data
`from the Whole utterance. To accomplish this, the data is
`processed in tWo passes over the sampled utterance data. In
`the ?rst pass, magnitude spectra m? output are computed and
`saved in memory 14 for the Whole utterance. That is, the data
`m? output from rectangular to polar converter 50 represents
`the magnitude at a Fourier frequency f and time WindoW (i.e.
`frame) t is stored in memory 14 such as a database. Note that
`in the processing that folloWs, the phase associated With the
`spectral samples is unmodi?ed, so that the processing is
`associated With the FFT magnitude rather than the associ
`ated phase. Accordingly, the subsequent processing by polar
`to rectangular converter 130 and IFFT processor algorithm
`140 operates to maintain the original phase of each input
`sampled speech utterance. Conventional arithmetic circuit
`75 operates to construct histograms of the magnitude spectra
`m? Which are generated for each frequency using each of the
`frames Which comprise a particular utterance and are stored
`in memory 14. The concept is to determine from the
`histogram for each frequency bin, What is the noise ampli
`tude over the Whole utterance. In each histogram, the
`background noise becomes evident as a peak or mode Within
`the histogram corresponding to the amplitude of the noise
`?oor at that particular frequency. FIG. 4 provides an
`example of this. The histogram shoWn in FIG. 4 represents
`the probability density as a function of the spectral magni
`tude at a particular frequency f. The mode of distribution, at
`Nf, is used to estimate the magnitude of the noise ?oor at
`frequency f. Conventional detector 80 then operates to
`examine each of the bins comprising the histogram at
`frequency f to determine Which magnitude bin has the
`highest probability. Noise ?oor Nf is then set equal to this
`magnitude. Once the noise ?oor, Nf, has been determined,
`channel estimator 90 then operates in response to the detec
`tion of the noise ?oor Nfby averaging the log magnitudes of
`those frequencies Which exceed the noise ?oor to obtain the
`channel frequency response Cf at frequency f. In the pre
`ferred embodiment, the estimator 90 operates to determine
`the channel frequency according to the equation
`
`Thus, the channel frequency response Cf at frequency f is
`set equal to the geometric mean over the utterance of those
`magnitudes at frequency f that exceed the noise ?oor. Note
`further that |m?>Nf| equals the number of time WindoWs for
`Which the magnitude at frequency f exceeds the noise ?oor
`at frequency f. Each of the noise ?oor and channel frequency
`response estimates are stored in memory 14. Spectral sub
`traction (SS) module 100 then operates on the saved mag
`nitude spectra data and noise estimate by subtracting from
`each m? the noise ?oor Nf determined in module 80 and
`setting any negative results to Zero to provide a noise
`suppressed signal sequence 104. Blind deconvolution ?lter
`110 is coupled to the output of SS module 100 and operates
`by multiplying the SS processed magnitude sequence 104 by
`the BD ?lter frequency response. As shoWn in FIG. 2B,
`blind deconvolution ?lter 110 is coupled to the spectral
`subtractor 100 and has a BD ?lter frequency response
`Hf=GBf/Cf Which is inversely proportional to the channel
`frequency response. Preferably, the BD ?lter comprises a
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`8
`trapeZoidal WindoW With height, Bf, applied to the ?lter to
`reject frequencies outside a band pass range Where
`
`In the preferred embodiment, the parameters are LO=200
`HZ, L1=300 HZ, H0=3200 HZ, and H1=3450 HZ. The gain
`constant, G, is applied for the purpose of output level
`normaliZation
`
`Where P is the desired peak RMS value of the output signal.
`Note that operations 75, 80, 90, 100, and 110 are repeated for
`each of the 512 values of f correspon

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket