`Diethorn
`
`US006035048A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,035,048
`Mar. 7, 2000
`
`[54] METHOD AND APPARATUS FOR
`REDUCING NOISE IN SPEECH AND AUDIO
`SIGNALS
`
`Primary Examiner—Vivian Chang
`Attorney, Agent, or Firm—Martin I. Finston; Ozer M.N.
`Teitelbaum
`
`[75] Inventor: Eric John Diethorn, Morristown, NJ.
`
`[57]
`
`ABSTRACT
`
`[73] Assignee: Lucent Technologies Inc., Murray Hill,
`NJ.
`
`[21] Appl. No.: 08/877,909
`[22]
`Filed:
`Jun. 18, 1997
`
`[51] Int. Cl.7 ................................................... .. H04B 15/00
`[52] US. Cl. ................ ..
`381/94.3; 704/226
`[58] Field of Search ................................ .. 381/941, 94.2,
`381/943, 72, 94.5, 94.7, 98, 73.1, 71.1;
`704/225, 226
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,251,263 10/1993 Andrea et a1. ....................... .. 381/716
`5,550,924
`8/1996 Helf et a1. .
`
`OTHER PUBLICATIONS
`
`R. E. Crochiere and L. R. Rabiner, Multirate Digital Signal
`Processing, Prentice—Hall, Englewood Cliffs, New Jersey,
`Jan. 1983, Chapter 7, “Multirate Techniques in Filter Banks
`and Spectrum Analyzers and Synthesizers,” pp. 289—400.
`W. Etter and G. S. Moschytz, “Noise Reduction by
`Noise—Adaptive Spectral Magnitude Expansion,” J. Audio
`Eng. Soc. 42 (May 1994) 341—349.
`J. B. Allen, “Short Term Spectral Analysis, Synthesis, and
`Modi?cation by Discrete Fourier Transform,” IEEE Trans
`actions on Acoustics, Speech, and Signal Processing, vol.
`ASSP—25, No. 3, Jun. 1977.
`
`Amethod and apparatus are disclosed for enhancing, within
`a signal bandwidth, a corrupted audio-frequency signal. The
`signal which is to be enhanced is analyzed into plural
`sub-band signals, each occupying a frequency sub-band
`smaller than the signal bandwidth. A respective signal gain
`function is applied to each sub-band signal, and the respec
`tive sub-band signals are then synthesized into an enhanced
`signal of the signal bandwidth. The signal gain function is
`derived, in part, by measuring speech energy and noise
`energy, and from these determining a relative amount of
`speech energy, within the corresponding sub-band. In certain
`embodiments of the invention, the signal gain function is
`also derived, in part, by determining a relative amount of
`speech energy within a frequency range greater than, but
`centered on, the corresponding sub-band. In other embodi
`ments of the invention, the sub-band noise energy is deter
`mined from a noise estimate that is updated at periodic
`intervals, but is not updated if the newest sample of the
`signal to be enhanced exceeds the current noise estimate by
`a multiplicative threshold (i.e., a threshold expressible in
`decibels). In still other embodiments of the invention, the
`value of the noise estimate is limited by an upper bound that
`is matched to the dynamic range of the signal to be
`enhanced.
`
`12 Claims, 4 Drawing Sheets
`
`501
`SIGNAL
`ESTIMATION
`(2)
`
`501
`NOISE
`ESTIMATION
`(3)
`
`801
`701
`NARROW-BAND BROAD-BAND
`DEFLECTION
`DEFLECTION
`[4]
`(5)
`
`901
`LUMPED
`DEFLECTIUN
`(6]
`
`SUBBAND
`
`INDEX k=0 c(k,ml
`401
`k=1
`SUBBAND
`ANALYSIS _-’
`(1)
`:
`k=M—1
`__.
`
`ALL M SUBBANDS INDEPENDENTLY PROCESSED
`BY BLOCKS (2) THROUGH (71
`
`-
`
`00th
`
`SPEECH)
`
`1001
`Z110
`GAIN
`COMPUTATION r" ~1
`[7]
`1
`i
`
`‘9 '"H
`
`I010
`k=O
`
`l |% |
`‘
`E H
`!
`I
`-
`5YNT(§)E5I5
`-
`I
`:
`l
`|k=M—1
`:
`_|___,
`
`120
`I
`SUBBAND
`
`'
`
`1
`
`'
`
`1
`
`L"_"J
`
`y(n1
`
`$5535
`
`RTL345-2_1025-0001
`
`
`
`U.S. Patent
`
`Mar. 7,2000
`
`Sheet 1 of4
`
`6,035,048
`
`FIG. 1
`(PRIOR ART)
`xh)
`
`1
`
`ANALYZER
`
`SPECTRAL
`MUDIFIEH
`
`10
`
`J——g(0.m)
`-——g (1, m)
`;
`
`,
`
`l
`
`V
`
`l
`
`_
`
`,
`
`--—g(M-1.m]
`
`[I30
`.
`.
`0
`l
`SYNTHESIZEH /
`
`Hi)
`
`FIG. 3
`EACH PROCESSING EPOCH, m 5130
`SHLIFNTEWIN
`SAMPLES --—-- SHIFT REGISTER (N)
`0F x0)
`
`LENGTH N VECTOR
`140
`ANALYSIS wmnow [N] J
`l LENGTH N VECTOR
`150
`m (N)
`5
`
`DIUSLCDAERSDTL
`SAMPLES
`0F xh)
`
`HM) c(1,m)
`
`c(M-1,m)
`
`1 COMPLEX TIME SERIES SAMPLE, C (k, m)
`FOR EACH OF M = N/E + 1 SUBBANDS
`
`RTL345-2_1025-0002
`
`
`
`U.S. Patent
`
`Mar. 7, 2000
`
`Sheet 2 of4
`
`6,035,048
`
`o:
`
`
`
` _9_;___.V_E_as
`
`_!-nl--|_
`
`
`
`
`
`_!-LzozsazsZSEEQzSBm_._u_mm_s:aEgz.E§:m_282228
`
`
`
`
`
`
`
`
`
`
`
`E5235QvasmEammueaEEEEQ:32:332._._<
`
`RTL345-2 1025-0003
`
`RTL345-2_1025-0003
`
`
`
`
`
`U.S. Patent
`
`Mar. 7, 2000
`
`Sheet 3 of4
`
`6,035,048
`
`FIG‘. 4
`
`4 3
`- A=ALPHA ATTACK /
`
`- A=ALPHA_DECAY
`
`FIG. 5
`
`B=BETA ATTACK
`
`‘$5.3
`
`I4. 1
`
`s (k, m)
`
`C
`
`U.
`m
`
`INHIBIT UPDATE ON
`PROBABLE SPEECH SAMPLES
`
`.2
`
`i
`
`LIMIT MAXIMUM
`ATTAINABLE NOISE LEVEL
`
`n(k,m) = min[n[k,ml.
`NUISE_PHOFILE (k) I
`t “5.6 I
`= B DUMB-1]
`"(k-m)
`n (k. m)
`)lcTkmH
`+ (1-8
`
`DDN'T UPDATE n (k, 0)
`
`B=BETA_UECAY '—
`
`FIG‘. 6
`(Tm
`k)
`mn
`1/
`AU | Wm
`
`=.K.
`
`m
`
`S
`
`'—-d (k, m)
`
`RTL345-2_1025-0004
`
`
`
`U.S. Patent
`
`Mar. 7, 2000
`
`Sheet 4 of4
`
`6,035,048
`
`FIG. 7
`
`s“SW-"1111.111 = 1s (k-K 11) + 511-1010 + .
`FOR SUBS/1ND INDISES
`1<= K, |<+1,
`11-1-1 n[k|m)__
`S(k+K,m)]/[(2K+1)*R(k,m)]
`
`. +._.D(k m)
`'
`
`1<-1) AND k IN (M-K.
`1
`E0R1< IN (0.
`M—K+1, ...M-1). 001,11) IS NOT COMPUTED
`
`FIG. 8A
`
`“(k-"0*" PHI (k m) = {max [11 [K m] /GAMMA NB
`{=02 SKU+R1SANIR IIINDMI_CKE_S1 Wm)‘
`Mum) NAMMLBBHW -
`
`,__
`PHI (k, m)
`
`.
`1,
`E0R1< IN [0,
`kIN (M-K, 11-1<+1
`
`FIG. 8B
`
`1<-1] AND
`11-11
`110,111- PHI (1,11) = [d(k,m]/GAMMA_NB] 101p
`
`--RR1(|<.11)
`
`FIG. 9
`
`PHI (k, m) - 90.11) = 1111100, PHI 0.11)]
`
`-—g(k.m)
`
`FIG. 10
`
`1 COMPLEX TIME SERIES SAMPLE, g(k, 0) 11111.11).
`FOR EACH OF M = N/2+1 SUBBANDS
`g(0,m) 11 c(0,m)
`g(1,m) * clLml
`1
`1101-111] 1 101-111)
`
`@160
`IFFT (N)
`LENGTH N vEcmR
`{
`SYNTHESIS 111110011 (N)
`»170
`LENGTH N VECTU
`ACCUMULATE S SHgFT REGISTER (N)
`R
`
`SHIFT 1“
`
`LNEWEST
`P1211125“
`
`RTL345-2_1025-0005
`
`
`
`1
`METHOD AND APPARATUS FOR
`REDUCING NOISE IN SPEECH AND AUDIO
`SIGNALS
`
`FIELD OF THE INVENTION
`This invention relates to the use of digital ?ltering tech
`niques to improve the audibility or intelligibility of speech
`or other audio-frequency signals that are corrupted With
`noise. More particularly, the invention relates to those
`techniques that seek to reduce stationary, or sloWly varying,
`background noise.
`ART BACKGROUND
`It is a matter of daily experience for speech (or other
`audible information) received over a communication chan
`nel to be corrupted With background noise. Such noise may
`arise, e.g., from circuitry Within the communication system,
`or from environmental conditions at the source of the
`audible signal. Environmental noise may come, for example,
`from fans, automobile engines, other vibrating machines, or
`nearby vehicular traf?c. Although noise components that
`occupy narroW, discrete frequency bands are often advan
`tageously removed by ?ltering, there are many cases in
`Which this does not provide an adequate solution. Instead,
`the background noise often exhibits a frequency spectrum
`that overlaps substantially With the spectrum of the desired
`signal. In such a case, a narroW frequency-rejection ?lter
`may not reject enough of the noise, Whereas a broad such
`?lter may unacceptably distort the desired signal.
`What is needed in such a case is a ?lter Whose frequency
`characteristics strike an appropriate balance betWeen reject
`ing frequency components characteristic of unWanted noise,
`and preserving the esthetic quality or intelligibility of the
`desired signal. Among the various audible signals of interest,
`it is fortuitous that speech, at least, is marked by frequent
`pauses of suf?cient length to be captured and analyZed using
`digital sampling techniques. Consequently, it is possible to
`apply different ?lter characteristics depending Whether,
`according to some criterion, the current signal is more
`probably speech or more probably noise. (Although the
`desired signal Will often be referred to beloW as speech, it
`should be noted that this usage is purely for convenience.
`Those skilled in the art Will readily appreciate that the
`techniques to be described here apply more generally to
`audible signals of various kinds.)
`Recently, a number of investigators have described
`approaches to this problem using digital ?lter banks for
`sub-band ?ltering. The ?lter-bank methods used include,
`e.g., the DFT (Discrete Fourier Transform) ?lter-bank
`method and the polyphase ?lter-bank method. (As is Well
`knoWn in the art, these tWo methods are essentially the same,
`but differ in certain details of the computational
`implementation.) Sub-band ?ltering in general, and in par
`ticular the DFT and polyphase ?lter-bank methods, are
`described in detail in R. E. Crochiere and L. R. Rabiner,
`Multirate Digital Signal Processing, Prentice-Hall, Engle
`Wood Cliffs, N.J., 1983, hereinafter referred to as
`CROCHIERE, particularly at Chapter 7, “Multirate Tech
`niques in Filter Banks and Spectrum AnalyZers and
`Synthesizers,” pages 289—400. I hereby incorporate CRO
`CHIERE by reference.
`In a broad sense, these and similar approaches can be
`described in terms of the processing stages depicted in FIG.
`1. Adigitally sampled input signal is denoted in the ?gure by
`Here, x typically represents the amplitude of an audio
`frequency signal, and i is the time variable, referred to in this
`digitiZed form as a time index.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`6,035,048
`
`2
`The input data are fed into ?lter-bank analyZer 10. The
`output of this analyZer consists of a respective sub-band
`signal c(0,m), c(1,m), c(2,m), .
`.
`. , c(M—1,m) at each of M
`respective output ports of the analyZer, M a positive integer.
`(The time index is shoWn as changed from i to m because the
`effective sampling rate may differ betWeen the respective
`processing stages.)
`At short-time spectral modi?er 20, each of the sub-band
`signals is subjected to gain modi?cation according to a
`respective signal gain function g(k,m), k=0,1,2, .
`.
`. , M-1,
`Which may differ betWeen respective sub-bands. (In this
`context, “short-time” refers to a time scale typical of that
`over Which speech utterances evolve. Such a time scale is
`generally on the order of 20 ms in applications for process
`ing human speech.)
`The sub-band signals are recombined at ?lter-bank syn
`thesiZer 30 into modi?ed full-band signal y(i).
`One application of methods of this kind to the problem of
`noise reduction is described in W. Etter and G. S. MoschytZ,
`“Noise Reduction by Noise-Adaptive Spectral Magnitude
`Expansion,” J. Audio Eng. Soc. 42 (May 1994) 341—349.
`This article discusses a signal gain function (for each
`respective sub-band) that varies inversely according to a
`poWer of the fractional contribution made by an estimated
`noise level to the total signal (i.e., speech plus noise). At
`relatively high signal-to-noise ratios, this signal gain func
`tion assumes a maximum value of unity. The exponent in the
`poWer-function relationship is referred to as an expansion
`factor. An expansion factor controls the rate at Which the
`gain decays as the signal-to-noise ratio decreases.
`Although the article by Etter et al. provides useful insights
`of a general nature, it does not teach hoW to estimate the
`noise level or hoW to discriminate betWeen incidents of
`speech and background noise that is free of speech. Thus it
`does not suggest any practical implementation of the ideas
`discussed there.
`Another application of methods of this kind is described
`in US. Pat. No. 5,550,924, “Reduction of Background
`Noise for Speech Enhancement,” issued Aug. 27, 1996 to B.
`M. Helf and P. L. Chu. This patent describes tWo methods
`for estimating the noise level. Both methods involve detect
`ing sequences of input data that satisfy some criterion that
`signi?es the likely presence of background noise Without
`speech. In one method, the processor observes the frequency
`spectrum of the input data and detects data sequences for
`Which this spectrum is stationary for a relatively long time
`interval. In the other method, the input stream is divided into
`ten-second intervals, and Within these intervals, the proces
`sor observes the energy content of multiple sub-intervals.
`Within each interval, the processor takes as representative of
`speech-free background noise that sub-interval having the
`least energy.
`The method of Helf et al. further involves making a binary
`decision Whether speech is present, based on the ratio of
`input signal to noise estimate. Acon?dence level is assigned
`to each of these decisions. These con?dence levels
`determine, in part, the corresponding values of the signal
`gain function.
`Although useful, the method of Helf et al. involves
`relatively complex procedures for estimating the noise level,
`establishing the presence of speech, and establishing values
`for the signal gain function. Complexity is disadvantageous
`because it increases demands on computational resources,
`and often leads to greater product costs.
`Moreover, it is signi?cant that human speech includes
`intervals of narroWband, multicomponent energy, referred to
`
`RTL345-2_1025-0006
`
`
`
`3
`as “voiced speech,” and intervals of broadband energy,
`referred to as “unvoiced speech.” Methods of sub-band
`processing, such as those described here, tend to be most
`effective in detecting voiced speech, because speech detec
`tion can take place Within the speci?c frequency sub-bands
`Where speech energy is concentrated. HoWever, such meth
`ods are generally less sensitive to incidents of unvoiced
`speech, because the speech energy is distributed over rela
`tively many frequency bands.
`Thus, What has been lacking until noW is a sub-band
`method for enhancing speech (or other audible signals) that
`is computationally relatively simple, and is at least as
`effective for detecting unvoiced speech (or other incidents of
`broadband energy) as it is for detecting voice speech (or
`other incidents of narroWband, multicomponent energy).
`SUMMARY OF THE INVENTION
`I have invented an improved sub-band method for
`enhancing speech or other audible signals in the presence of
`background noise. My method is computationally relatively
`simple, and thus can achieve economy in the use of, and
`demand for, computational resources. In contrast to methods
`of the prior art, my method includes separate speech
`detection stages, one directed primarily to voiced speech or
`the like, and the other directed primarily to unvoiced speech
`or the like.
`In a broad aspect, my invention involves a method for
`enhancing, Within a signal bandWidth, a corrupted audio
`frequency signal having a signal component and a noise
`component. In accordance With this method, the corrupted
`signal is analyZed into plural sub-band signals, each occu
`pying a frequency sub-band smaller than the signal band
`Width. A respective signal gain function is applied to the
`sub-band signal corresponding to each sub-band, thereby to
`yield respective gain-modi?ed signals. The gain-modi?ed
`signals are synthesiZed into an enhanced signal of the signal
`bandWidth.
`Within each frequency sub-band, the step of applying the
`signal gain function to the sub-band signal includes: evalu
`ating a function that is preferentially sensitive to energy in
`the signal component; and applying, to the sub-band signal,
`gain values that are related to the preferentially sensitive
`function.
`In contrast to methods of the prior art, the preferentially
`sensitive function is evaluated by, inter alia, measuring a
`relative amount of speech energy Within the corresponding
`sub-band, and also measuring a relative amount of speech
`energy Within a frequency range greater than, but centered
`on, the corresponding sub-band.
`I believe that through the use of my invention, noise in the
`speech channels of various kinds of telecommunication
`equipment can be e?iciently reduced, and improved subjec
`tive audio quality can thereby be efficiently achieved. Such
`equipment includes telephones such as cellular and cordless
`telephones, and audio and video teleconferencing systems.
`Further, my invention can be used to improve the quality of
`digitally encoded speech by reducing background noise that
`Would otherWise perturb the speech coder. Still further, I
`believe that my invention can be usefully employed Within
`the sWitching system of a telephone netWork to condition
`speech signals that have been degraded by noisy line
`conditions, or by background noise that is input at the
`location of one or more of the parties to a telephone call.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`BRIEF DESCRIPTION OF THE DRAWING
`FIG. 1 is a schematic draWing that represents, in generic
`fashion, sub-band methods of speech enhancement, includ
`ing those of the prior art.
`
`65
`
`6,035,048
`
`4
`FIG. 2 is a high-level, schematic diagram shoWing signal
`?oW through various processing stages of the invention in an
`exemplary embodiment.
`FIG. 3 is a more detailed, schematic representation of the
`sub-band analysis stage of FIG. 2.
`FIG. 4 is a more detailed, schematic representation of the
`signal-estimation stage of FIG. 2.
`FIG. 5 is a more detailed, schematic representation of the
`noise-estimation stage of FIG. 2.
`FIG. 6 is a more detailed, schematic representation of the
`narroWband de?ection stage of FIG. 2.
`FIG. 7 is a more detailed, schematic representation of the
`broadband de?ection stage of FIG. 2.
`FIGS. 8A and 8B provide a more detailed, schematic
`representation of the lumped de?ection stage of FIG. 2.
`FIG. 9 is a more detailed, schematic representation of the
`gain computation stage of FIG. 2.
`FIG. 10 is a more detailed, schematic representation of the
`sub-band synthesis stage of FIG. 2.
`
`DETAILED DESCRIPTION OF A PREFERRED
`EMBODIMENT
`
`that is to be
`In the folloWing discussion, the signal
`enhanced is referred to for convenience as “noisy speech,”
`although not only speech, but also other audible signals are
`advantageously enhanced according to the present inven
`tion.
`is analyZed at
`As shoWn in FIG. 2, the noisy speech
`block 40 into M sub-band time series c(k,m), k=0,1, .
`.
`.
`,
`M-1. At block 50, a signal estimate s(k,m) is calculated for
`each sub-band. As Will be seen, this signal estimate is a
`short-term average of the sub-band time series. When speech
`is present, s(k,m) estimates the signal level corresponding to
`the speech.
`At block 60, a noise estimate n(k,m) is calculated for each
`sub-band. As Will be seen, this noise estimate is a long-term
`average of the sub-band time series. It estimates the station
`ary component of the corrupted input signal, Which is
`assumed to correspond to background noise.
`At block 70, a narroWband de?ection d(k,m) is calculated
`for each sub-band. This is one of tWo de?ections to be
`calculated. Each of these de?ections is a time series derived
`from the signal and noise estimates. The narroWband de?ec
`tion is derived from the sub-band signal and noise estimates,
`so as to be particularly sensitive to, e.g., the energy in voiced
`speech.
`At block 80, a broadband de?ection D(k,m) is calculated
`for each sub-band. This second de?ection is derived from
`the sub-band noise estimate and from an average over plural
`sub-bands of the respective sub-band signal estimates, so as
`to be particularly sensitive to, e.g., the energy in unvoiced
`speech.
`At block 90, a lumped de?ection PHI(k,m) is calculated
`from the narroWband and broadband de?ections. Roughly
`speaking, the lumped de?ection indicates the presence of
`speech When speech is indicated by either the narroWband or
`broadband de?ection. In addition, an expansion factor p is
`used to tailor the sensitivity of PHI to the respective de?ec
`tions.
`At block 100, a respective sub-band gain g(k,m) is applied
`to each of the sub-band time series c(k,m). Typically, this
`sub-band gain has an upper bound of unity. This upper
`bound is attained When speech is likely to be present. At
`other times, the gain assumes values less than one. The
`
`RTL345-2_1025-0007
`
`
`
`6,035,048
`
`5
`expansion factor p affects the rate at Which this gain decays
`as the incidence of speech becomes less likely. Signi?cantly,
`this gain is calculated as a time series, as shoWn in the
`notation used herein by the functional dependence on the
`time index m.
`At block 110, each sub-band time series c(k,m) is modi
`?ed by its corresponding sub-band gain g(k,m).
`At block 120, the modi?ed sub-band time series are
`synthesiZed to form modi?ed, full-band output signal y(n),
`also referred to herein as “noise-reduced speech.”
`Each of the processing stages discussed above is
`described in greater detail beloW, With reference to the
`pertinent ?gure. Each of these processing stages is conve
`niently carried out by a general-purpose digital computer,
`such as a desktop personal computer, under the control of an
`appropriate stored program or programs. Equivalently, some
`or all of these stages can be carried out using special-purpose
`electronic signal-processing circuits.
`Our currently preferred sub-band analysis technique is
`based on a perfect reconstruction ?lter bank using the
`discrete Fourier transform (DFT) ?lter bank method. This
`method is Well-knoWn in the art, and described in detail in,
`e.g., CROCHIERE. Accordingly, this method need not be
`described in detail here. HoWever, referring back to FIG. 1,
`it should be noted that perfect reconstruction ?lter banks
`have the property that When spectral modi?er 20 applies the
`identity function (i.e., unity gain across all sub-bands), the
`output of synthesiZer 30 is identical to the input to analyZer
`10 (Within the accuracy of the digital computation).
`As shoWn in FIG. 3, the operations of the sub-band
`analysis stage can be described in terms of accumulator 130,
`analysis WindoW 140, and Fast Fourier Transform (FFT)
`150. Time-series samples are processed in blocks of L
`samples, Where L is an integer. The term “epoch” is used to
`denote the action of processing one such block. Thus, at the
`beginning of each processing epoch, a data block consisting
`of L neW time-series samples
`is shifted into accumulator
`130, Which is exemplarily a shift register. The total length of
`this accumulator is N samples, Wherein N is the siZe of the
`Fourier transform, and N>L. Those skilled in the art of
`digital ?ltering Will appreciate that the number M of unique
`complex sub-bands is related to the siZe of the Fourier
`transform according to the formula:
`
`By Way of illustration, our current implementation, sam
`pling at a rate of 8 kHZ, has 33 unique sub-bands spanning
`the frequency range 0—4000 HZ.
`When L neW samples are shifted into the accumulator, the
`L oldest samples are shifted out. In our current
`implementation, the value of L is 16 and the value of N is
`64. These values are illustrative, and not essential to the
`practice of the invention.
`The N-vector of accumulated samples is multiplied by
`analysis WindoW 140, Which is a WindoW of length N.
`Analysis WindoWs are Well-knoWn in the digital ?ltering
`arts, and discussed at length in, e.g., CROCHIERE. Thus,
`they need not be described here in detail. Brie?y, an analysis
`WindoW is a function that embodies the frequency-selective
`properties of a digital ?lter, and conditions the sampled data
`to avoid a by-product of digital processing knoWn as fre
`quency aliasing. Frequency aliasing is undesirable because
`it can lead to distracting audible artifacts in the
`reconstructed, processed signal.
`The N-vector of WindoWed data is then subjected to
`N-point FFT 150. As noted, this transform is effectuated, in
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`6
`our current implementation, using the DFT algorithm. Each
`frequency bin output from the DFT represents one neW
`complex time-series sample for the sub-band frequency
`range corresponding to that bin. The bandWidth of each bin,
`or sub-band time series, is given by the ratio of sampling
`frequency to transform length.
`As shoWn graphically in FIG. 4, the signal estimate s(k,m)
`in each sub-band is computed (block 4.1) using the folloW
`ing non-linear single-pole recursion:
`
`The value of the coef?cient A is determined by a test (block
`4.2) of Whether the magnitude of the neW data sample c(k,m)
`is greater, or not greater, than the current value of the signal
`estimate. Depending on the outcome of this test, A assumes
`(blocks 4.3, 4.4) one of tWo alternative values, namely an
`“attack” value AiATTACK and a “decay” value
`AiDECAY, respectively. In our current implementation, a
`useful range for AiATTACK is 1—10 ms, and a useful range
`for AiDECAY is 20—50 ms. These speci?c values are
`illustrative and not essential to the practice of the invention.
`As shoWn graphically in FIG. 5, the noise estimate n(k,m)
`in each sub-band is computed (block 5.1) using the folloW
`ing non-linear single-pole recursion:
`
`The value of the coef?cient B is determined by a test (block
`5.2) of Whether the magnitude of the neW data sample c(k,m)
`is greater, or not greater, than the current value of the noise
`estimate. Depending on the outcome of this test, B assumes
`(blocks 5.3, 5.4) one of tWo alternative values, namely an
`“attack” value BiATTACK and a “decay” value
`BiDECAY, respectively. In our current implementation, a
`useful range for BiATTACK is 1—10 seconds, and a useful
`range for BiDECAY is 1—50 ms. These values are illustra
`tive and not essential to the practice of the invention.
`As also shoWn in FIG. 5, the updating of the noise
`estimate is advantageously conditioned on a test (block 5.5)
`of Whether the magnitude of the neW data sample c(k,m) is
`less than the current value of the noise estimate, times a
`multiplier T. By Way of illustration, our current implemen
`tation has T=20. This prevents an update of the noise
`estimate if the neW data sample exceeds the current value of
`the noise estimate by 26 dB. This condition prevents the
`noise estimate from being unduly biased (upWard) by
`samples Whose magnitudes are high enough that they assur
`edly represent speech or other non-stationary signal energy.
`I have found that this condition signi?cantly improves the
`stability of the noise estimate for extended speech utter
`ances.
`As also shoWn in FIG. 5, it is advantageous, in at least
`some cases, to impose (block 5.6) an upper bound, denoted
`NOISEiPROFILE(k), on the noise estimate in each sub
`band. NOISEiPROFILE(k) is advantageously matched to
`the dynamic range of the corrupted signal to be enhanced.
`The practical effect of this upper bound is to automatically
`inhibit the enhancement process in abnormally noisy envi
`ronments. Such inhibition is useful for preventing speech
`processing artifacts that often arise in such environments
`and that are perceived as unacceptable distortion.
`It should be noted that Whereas other forms can be used
`for the signal and noise estimates, the non-linear single-pole
`recursion relations discussed above for the signal and noise
`estimates are advantageous because they are computation
`ally simple. Moreover, they have the desirable property of
`adapting to changes in the character and absolute level of the
`
`RTL345-2_1025-0008
`
`
`
`6,035,048
`
`7
`noise and signal processes. Indeed, practitioners have rec
`ogniZed this and have Widely used these relations in various
`voice-processing applications.
`As shoWn in FIG. 6, the narroWband de?ection is obtained
`as the ratio of the sub-band signal estimate to the sub-band
`noise estimate. That is,
`
`I have found that for detection of broadband energy, it is
`advantageous to combine, in a certain sense, the results of
`tWo or more narroWband de?ection ratios. That is, a lumped
`broadband de?ection coef?cient is advantageously com
`puted by taking an arithmetic average of 2K+1 narroWband
`de?ection coef?cients (K a positive integer) in a range of
`sub-bands centered about a given sub-band, each of these
`coef?cients taken relative to the noise estimate in the given
`sub-band. Thus, as shoWn in FIG. 7, the broadband de?ec
`tion coef?cient D(k,m) is given by:
`
`10
`
`15
`
`20
`
`8
`According to the ?rst of these formulas, the narroWband
`and broadband de?ection coef?cients are each normaliZed to
`a respective threshold GAMMAiNB or GAMMAiBB.
`These thresholds represent the respective levels at Which the
`de?ection ratios are declared to indicate a certainty of
`speech energy. In a current implementation, both of these
`thresholds are set to 30.0.
`The greater of the tWo normaliZed de?ection coef?cients
`determines the value of PHI(k,m). An expansion factor p
`controls the rate at Which the lumped de?ection ratio decays
`for de?ection ratios less than unity. According to a current
`implementation, p is equal to unity, providing linear decay
`With the envelope of the sub-band signal energy. The ?rst
`formula is expressed by:
`
`According to the second formula, the lumped de?ection
`coef?cient is determined by the narroWband de?ection coef
`?cient and the expansion factor. The second formula is
`expressed by:
`
`As shoWn in FIG. 9, the signal gain function g(k,m) is
`determined by PHI(k,m), but has an upper bound of unity.
`That is,
`
`Thus, each sub-band time series having a de?ection of unity
`or less is passed to the synthesis ?lter bank With gain given
`by PHI(k,m), but each such series having a greater de?ection
`is passed to the synthesis bank With unity gain.
`As shoWn in FIG. 10, the input to the sub-band synthesis
`stage (in each processing epoch of index m) includes one
`complex time-series sample g(k,m)°c(k,m) for each of the M
`sub-bands. These M samples are processed by inverse FFT
`160 to produce an output vector of length N, as is Well
`knoWn in the art. This output vector is processed by syn
`thesis WindoW (of length N) 170, Which is the counterpart,
`on the synthesis side, of analysis WindoW 140. The output of
`synthesis WindoW 170 is a further vector of length N. This
`vector is input to accumulator 180, Which is the counterpart
`on the synthesis end of accumulator 130.
`Input to accumulator 180 takes place in frames of length
`N. Output from accumulator 180 takes place in blocks of
`length L. Data are transferred to the accumulator in an
`overlap-and-add operation. In such an operation, the neW
`(processed) samples are added to the previous values stored
`in corresponding cells of the accumulator. When L samples
`are shifted out of the output end of the accumulator, a
`sequence of L Zeroes is inserted at the input end. The output
`of accumulator 180 corresponds to the noise-reduced
`speech, y(n).
`It Will be appreciated that the inventive method involves
`a modest number of adjustable parameters. Although at least
`some of these Will typically be set in the factory, others can
`optionally be set in the ?eld, either manually by the user or
`automatically. Exemplary ?eld-settable parameters may
`include, among others, the bandWidth 2K+1 for broadband
`speech detection, the expansion coefficient p, and the respec
`tive speech thresholds GAMMAiNB and GAMMAiBB.
`In one illustrative scenario, a user of a telephone desires
`to improve the intelligibility of far-in speech; that is, of
`speech that is received from a remote location. Manual
`controls are readily provided so that such a user can select
`those values of the ?eld-settable parameters that afford the
`greatest speech intelligibility as perceived by that user.
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`It should be noted in this regard that D(k,m) cannot be
`evaluated for values of k less than K It should further be
`noted that M-l is the maximum sub-band index. Thus,
`D(k,m) cannot be evaluated for values of k greater than
`M-K-l.
`In a current implementation, the value of K is 2. Other
`values of K (including the unity value as Well as values
`greater than 2) are readily chosen to provide optimal per
`formance in speci?c applications.
`I have found that the expression given above for D(k,m),
`in Which the central sub-band noise estimate appears directly
`in the denominator, is generally preferable to an arithmetic
`average of 2K+1 distinct narroWband de?ection coef?cients.
`This is because, for some classes of broadband voice
`utterances, the frequency band edges of the utterance that are
`poorly represented by the narroWband de?ection coef?cient
`are better represented by a broadband de?ection coef?cient
`that incorporates only the signal estimate from bands neigh
`boring those edges.
`Other techniques can also be used to obtain a broadband
`de?ection coefficient. For example, an alternate embodiment
`is readily implemented that includes a second sub-band ?lter
`architecture having broader sub-bands than that described
`above. (Such sub-bands may be referred to, e.g., as “auxil
`iary” sub-bands.) Broadband de?ection coef?cients are
`obtained by, e.g., a procedure analogous to the computation
`of d(k,m), but using this second ?lter architecture. Th