`Helf et al.
`
`111111111111111111111111111111111111111111111111111111111111111111111111111
`US005550924A
`[11] Patent Number:
`[45] Date of Patent:
`
`5,550,924
`Aug. 27, 1996
`
`[54] REDUCTION OF BACKGROUND NOISE
`FOR SPEECH ENHANCEMENT
`
`5,012,519
`5,133,013
`
`411991 Adlersberg et al ..
`7/1992 Munday.
`
`[75]
`
`Inventors: Brant M. Helf, Melrose; Peter L. Chu,
`Lexington, both of Mass.
`
`FOREIGN PATENT DOCUMENTS
`
`3132221
`
`511991
`
`Japan ....................................... 381/94
`
`[73] Assignee: PictureTel Corporation, Danvers,
`Mass.
`
`[21] Appl. No.: 402,550
`Mar. 13, 1995
`
`[22] Filed:
`
`Related U.S. Application Data
`
`[63] Continuation of Ser. No. 86,707, Jul. 7, 1993, abandoned.
`Int. Cl.6
`..................................................... H04B 15/00
`[51]
`[52] U.S. Cl . ................................. 381194; 381/46; 381/47;
`395/2.34; 395/2.35
`[58] Field of Search .................................. 381/94, 46, 47;
`395/2.34, 2.35
`
`[56]
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`4,185,168
`4,628,529
`4,630,304
`4,630,305
`4,653,102
`4,658,426
`4,696,039
`4,852,175
`4,868,880
`4,912,767
`
`111980 Graupe et al ..
`12/1986 Borth et al ..
`12/1986 Borth et al ..
`12/1986 Borth et al ..
`3/1987 Hansen .
`411987 Chabries et al ..
`9/1987 Doddington .............................. 381/46
`711989 Borth et al ..
`9/1989 Bennett, Jr ..
`3/1990 Chang .
`
`Primary Examiner-Forester W. Isen
`Attorney, Agent, or Firm-Fish & Richardson P.C.
`
`[57]
`
`ABSTRACT
`
`Properties of human audio perception are used to perform
`spectral and time masking to reduce perceived loudness of
`noise added to the speech signal. A signal is divided tem(cid:173)
`porally into blocks which are then passed through notch
`filters to remove narrow frequency band components of the
`noise. Each block is then appended to part of the previous
`block in a manner which avoids block boundary disconti(cid:173)
`nuities. An FFf is then performed on the resulting larger
`block, after which the spectral components of the signal are
`fed to a background noise estimator. Each frequency com(cid:173)
`ponent of the signal is analyzed with respect to the back(cid:173)
`ground noise to determine, within various confidence levels,
`whether it is pure noise or a noise-and-signal combination.
`The frequency band's gain function is determined, based on
`the confidence levels. A spectral valley finder detects and
`fills in spectral valleys in the frequency component gain
`function, after which the function is used to modify the
`magnitude components of the FFI'. An inverse FFf then
`maps the signal back from the frequency domain to the time
`domain to give a frame of noise-reduced signal. This signal
`is then multiplied by a temporal window and joined to the
`previous frame's signal to derive the output.
`
`21 Claims, 3 Drawing Sheets
`
`28
`
`BACKGROUND
`NOISE
`ESTIMATOR
`
`20
`
`36
`
`Petitioner Apple Inc.
`Ex. 1010, p. 1
`
`
`
`til =
`
`~
`N
`\C
`...
`til
`...
`til
`
`~
`
`"""" s,
`C7.l [
`
`"""" -= -= ="
`--~
`~
`>
`
`~ a
`~ • rJJ.
`
`~
`•
`
`I
`
`I WINDOW I _I OVERLAP I !0
`
`c 18
`
`n& ADD
`
`30
`
`FIG. 1
`
`Petitioner Apple Inc.
`Ex. 1010, p. 2
`
`FILLER
`SPECTRAL VAllEY (;38
`
`1
`
`BY CRITICAL BANDS
`SPECTRAL SPREADING
`TEMPORAL AND
`
`I
`36
`
`1-}4
`
`NOISE DETECTOR
`
`LOCAL SPEECH
`
`l
`vs
`
`NOISE DETECTOR
`
`rJ:jGLOBAl v~PEECH LP
`
`SPECTRAL MODIFIER
`NOISE SUPPRESSION
`
`20
`
`ESTIMATOR
`NOISE
`BACKGROUND
`
`I•'
`
`26
`
`~4
`
`--,
`
`I
`
`ESTIMATOR
`STATIONARITY
`
`L--
`
`~
`ESTIMATES
`COMPARE
`
`"'
`
`28
`
`MIN. ESTIMATOR
`RUNNING
`1
`r
`
`r--...I.C:....:.1=-6
`
`IFFT -
`2:
`
`I
`
`ATTENUATE
`(.12
`
`ONE H
`
`DELAY
`,-.1 FRAME
`
`/'10
`
`FFT
`'-,
`8
`
`WINDOW
`1"6
`
`1
`
`4
`
`.l.
`
`
`
`U.S. Patent
`
`Aug. 27, 1996
`
`Sheet 2 of 3
`
`5,550,924
`
`c ompute Spectral Shape
`f • fc- 3 (k. k, . 31
`L
`Ni (fc) = 0. 25 I:
`f = fc
`
`k .. k1
`
`2,0
`
`+ I 2 (k, f)))
`
`(R 2 (kl f)
`
`Find Sdfc)
`
`242
`
`_f_
`
`k. kl + 31
`
`si (fc> = L
`
`k .. k1
`
`(R 2 (k, fc) + I 2 (k 1 fc})
`
`IF
`
`?
`Ni (fc) > tlsi (fc>
`
`~244
`
`OR
`
`si (fc>
`
`?
`>
`
`tlNi (fc), for i = 0, 1, ... , 7 v
`
`246
`
`IF
`Ni (fc) ?
`>
`
`thsi { fc)
`
`v24s
`
`OR
`si (fc} ?
`>
`
`t~i (fc) 1 for i = 0, 1, • • I
`
`I 7
`
`v250
`
`(50 Consecutive Noise Frames?
`
`252
`
`f .. f~ + 31
`
`. 32
`
`f = f
`
`FIG. 3
`
`Develop Background Noise Estimate
`Bk = .J:._ L
`~ •
`
`(R 2 (k, f) + I 2 (k, f))
`
`254
`v
`
`Petitioner Apple Inc.
`Ex. 1010, p. 3
`
`
`
`U.S. Patent
`
`Aug. 27, 1996
`
`Sheet 3 of 3
`
`5,550,924
`
`NOTCH FILTER j
`1-2cosez-1 +z-2
`1-2rcos6z -l + r 2 z -2
`~
`w(i) = f (i) ~ o. 5-o . 5 cos(n 1~ 1 )
`
`H(z) =
`
`WINDOW
`
`w(i) = f(i)
`w(i) =f(i)~0.5
`- 0. 5 cos(n 511 - i)
`191
`
`14
`
`(6
`
`for i: 01 11 • • • 1 1911
`
`for i=1921 193 1 ... I 319,
`
`for i=319, 320, ... , 511
`
`~
`
`FIG. 2
`
`FIND FREQUENCY OF
`lARGEST MAGNITUDE
`EVERY 100 Hz
`
`264
`
`FIG. 4
`
`Petitioner Apple Inc.
`Ex. 1010, p. 4
`
`
`
`5,550,924
`
`1
`REDUCTION OF BACKGROUND NOISE
`FOR SPEECH ENHANCEMENT
`
`This is a continuation of application Ser. No. 08/086,707,
`filed Jul. 7, 1993, now abandoned.
`
`BACKGROUND OF THE INVENTION
`This invention relates to communicating voice informa- 10
`tion over a channel such as a telephone communication
`channel.
`Microphones used in voice transmission systems typically
`pick up ambient or background sounds, called noise, along
`with the voices they are intended to pick up. In voice
`transmission systems in which the microphone is at some
`distance from the speaker(s), for example, systems used in
`video and audio telephone conference environments, back(cid:173)
`ground noises are a cause of poor audio quality since the
`noise is added onto the speech picked up by a microphone.
`By their nature and intended use, these systems must pick up
`sounds from all locations surrounding their microphones,
`and these sounds will include background noise.
`Fan noise, originating from HVAC systems, computers,
`and other electronic equipment, is the predominant source of
`noise in most teleconferencing environments.
`A good noise suppression technique will reduce the
`perception of the background noise while simultaneously
`not affecting the quality or intelligibility of the speech. In
`general it is an object of this invention to suppress any
`constant noise, narrowband or wideband, that is added onto
`the speech picked up by a single microphone. It is a further
`object of this invention to reduce fan noise that is added onto
`the speech picked up by a single microphone.
`
`2
`The global decision mechanism makes, for each fre(cid:173)
`quency component of the frequency spectrum components,
`a determination as to whether that frequency component is
`primarily noise. The local noise decision mechanism
`5 derives, for each frequency component of the frequency
`spectrum components, a confidence level that the frequency
`component is primarily a noise component. The detector
`determines, based on the confidence levels, a gain multipli-
`cative factor for each frequency component. The spreading
`mechanism spectrally and temporally spreads the effect of
`the determined gain multiplicative factors, and the spectral
`valley filler detects and fills in spectral valleys in the
`resulting frequency components.
`In other aspects of the preferred embodiment, the back-
`IS ground noise estimator also produces a noise estimate for
`each frequency spectrum component, and the local noise
`decision mechanism derives confidence levels based on:
`ratios between each frequency component and its corre(cid:173)
`sponding noise estimate, and the determinations made by the
`20 global decision mechanism.
`In another aspect, the invention further features a post(cid:173)
`window and an overlap-and-adder mechanism. The post(cid:173)
`window produces smoothed time-domain components for
`minimizing discontinuities in the noise-reduced time-do-
`25 main components; and the overlap-and-adder outputs a first
`portion of the smoothed time-domain components in com(cid:173)
`bination with a previously stored portion of smoothed time(cid:173)
`domain components, and stores a remaining portion of the
`smoothed frequency components, where the remaining por-
`tion comprises the smoothed frequency components not
`included in the first portion.
`In preferred embodiments of the device, the background
`noise estimator includes at least two estimators, each pro(cid:173)
`ducing a background noise estimate, and a comparator for
`comparing and selecting one of the background noise esti-
`mates. One of the estimators is a running minimum estima(cid:173)
`tor, .and the other estimator is a stationary estimator.
`In preferred embodiments, the device also includes a
`40 notch filter mechanism for determining the locations of the
`notches for the notch filter bank.
`
`30
`
`35
`
`SUMMARY OF THE INVENTION
`
`In one aspect, generally, the invention relates to a device
`for reducing the background noise of an input audio signal.
`The device features a framer for dividing the input audio
`signal into a plurality of frames of signals, and a notch filter
`bank for removing components of noise from each of the
`frames of signals to produce filtered frames of signals. A
`multiplier multiplies a combined frame of signals to produce 45
`a windowed frame of signals, wherein the combined frame
`of signals includes all signals in one filtered frame of signals
`combined with some signals in the filtered frame of signals
`immediately preceding in time the one filtered frame of
`signals. A transformer obtains frequency spectrum compo- so
`nents from the windowed frame of signals, and a back(cid:173)
`ground noise estimator uses the frequency spectrum com(cid:173)
`ponents to produce a noise estimate of an amount of noise
`in the frequency spectrum components. A noise suppression
`spectral modifier produces gain multiplicative factors based 55
`on the noise estimate and the frequency spectrum compo(cid:173)
`nents. A delayer delays the frequency spectrum components
`to produce delayed frequency spectrum components. A
`controlled attenuator attenuates the frequency spectrum
`components based on the gain multiplicative factors to 60
`produce noise-reduced frequency components, and an
`inverse transformer converts the noise-reduced frequency
`components to the time domain.
`In preferred embodiments, the noise suppression spectral
`modifier includes a global decision mechanism, a local 65
`decision mechanism, a detector, a spreading mechanism, and
`a spectral valley filler.
`
`BRIEF DESCRIPTION OF THE DRAWING
`
`FIG. 1 is a block diagram of a noise suppression system
`according to the invention; and
`FIGS. 2-4 are detailed block diagrams implementing
`parts of the block diagram of FIG. 1.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`
`The simplest noise suppression apparatus, in daily use by
`millions of people around the world, is the so-called
`"squelch" circuit. A squelch circuit is standard on most
`Citizen Band two-way radios. It operates by simply discon(cid:173)
`necting the system's loudspeaker when the energy of the
`received signal falls below a certain threshold. The value of
`this threshold is usually fixed using a manual control knob
`to a level such that the background noise never passes to the
`speaker when the far end is silent. The problem with this
`kind of circuit is that when the circuit turns on and off as the
`far end speaker starts and then stops, the presence and then
`absence of noise can be clearly heard. The noise is wideband
`and covers frequencies in which there is little speech energy,
`and thus the noise can be heard simultaneously as the person
`is talking. The operation of the squelch unit produces a very
`
`Petitioner Apple Inc.
`Ex. 1010, p. 5
`
`
`
`5,550,924
`
`15
`
`3
`disconcerting effect, although it is preferable to having no
`noise suppression whatsoever.
`The noise suppression method of this invention improves
`on the "squelch" concept considerably by reducing the
`background noise in both speech and non-speech sections of 5
`the audio.
`The approach, according to the invention, is based on
`human perception. Using principles of spectral and time
`masking (both defined below), this invention reduces the
`perceived loudness of noise that is added onto or mixed with
`the speech signal.
`This approach differs from other approaches, for example,
`those in which the goal is to minimize the mean-squared(cid:173)
`error between the speech component by itself (speech(cid:173)
`without-noise) and the processed speech output of the sup(cid:173)
`pression system.
`The method used in this invention exploits the "squelch"
`notion of turning up the gain on a channel when the energy
`of that channel exceeds a threshold and turning down the
`gain when the channel energy falls below the threshold,
`however the method performs the operation separately on
`different frequency regions. The gain on a channel can be
`considered to be the ratio between the volume of the input
`signal and the volume of the corresponding output signal.
`The method further exploits various psychoacoustic prin(cid:173)
`ciples of spectral masking, in particular the principles which
`basically state that if there is a loud tone at some frequency,
`then there exists a given frequency band around that fre(cid:173)
`quency, called the critical band, within which other signals
`cannot be heard. In other words, other signals in the critical
`band cannot be heard. The method of the invention is far
`more effective than a simple "squelch" circuit in reducing
`the perception of noise while speech is being received from
`the far end.
`The method of the invention also exploits a temporal
`masking property. If a loud tone burst occurs, then for a
`period of time up to 200 milliseconds after that burst the
`sensitivity of the ear in the spectral region of the burst is
`decreased. Another acoustic effect is that for a time of up to
`20 milliseconds before the burst, the sensitivity of the ear is
`decreased (thus, human hearing has a pipeline delay of about
`20 milliseconds). One key element of this invention is thus
`that the signal threshold below which the gain for a given
`band is decreased can be lowered for a period of time both
`before and after the occurrence of a sufficiently strong signal
`in that band since the ear's sensitivity to noise is decreased
`in that period of time.
`
`4
`ately preceding frame of digital signals to produce a win(cid:173)
`dowed frame.
`In preferred embodiments, each frame of digital signals
`(20 ms) is combined with the last 12 ms of the preceding
`frame to produce windowed frames having durations of 32
`ms. In other words, each windowed frame includes three
`hundred and twenty samples from a frame of digital signals
`in combination with the last one hundred and ninety-two
`filtered samples of the immediately preceding frame. The
`10 512-sample segment of speech is then multiplied by a
`window, at a multiplier 6, to alleviate problems arising from
`discontinuities of the signal at the beginning and end of the
`512 sample frame. A fast Fourier Transform (FFT) 8 is then
`taken of the 512 sample windowed frame, producing a 257
`component frequency spectrum.
`The lowest (D.C.) and highest (sampling frequency
`divided by two, or 8 kHz) frequency components of the
`transformed signal have real parts only, while the other 255
`components have both real and imaginary parts. The spectral
`components are fed to a background noise estimator 20
`20 whose purpose is to estimate the background noise spectral
`energies and to find background noise spectral peaks at
`which to place the notches of notch filter 4. A signal
`magnitude spectrum estimator, a stationary estimator 24,
`and background noise spectrum estimator, a running mini-
`25 mum estimator 22, for each frequency component are com(cid:173)
`pared by a comparator 28 and various confidence levels are
`derived by a decision mechanism 32 for each frequency
`component as to whether or not the particular frequency
`component is primarily from noise or from signal-plus-
`30 noise. Based on these confidence levels, the gain for a
`frequency band is determined by a gain setter 34. The gains
`are then spread, by a spreading mechanism 36, in the
`frequency domain in critical bands, spectrally and tempo(cid:173)
`rally, exploiting psychoacoustic masking effects. A spectral
`35 valley filler 38 is used to detect spectral valleys in the
`frequency component gain function and fill in the valleys.
`The final frequency component gain function from noise
`compression spectral modifier 30 is used to modify the
`magnitude of the spectral components of the 512-point FFT
`40 at an attenuator 12. Note that the frame at attenuator 12 is
`one time unit behind the signals which are primarily used to
`generate the gains. An inverse FFT (IFFT) 14 then maps the
`signal back into the time domain from the frequency
`domain. The resulting 512 point frame of noise-reduced
`45 signal is multiplied by a window at a multiplier 16. The
`result is then overlapped and added, at adder 18, to the
`previous frame's signal to derive 20 milliseconds or 320
`samples of output signal on line 40.
`A more detailed description of each block in the signal
`processing chrun is now provided, from input to output in the
`order of their occurrence.
`As described above, the framed input signal is fed through
`a bank of notch filters 4.
`With reference to FIGS. 1 and 2, the notch filter bank 4
`consists of a cascade of Infinite Impulse Response (IIR)
`digital filters, where each filter has a response of the form:
`
`System Overview
`
`50
`
`With reference now to the block diagram of FIG. 1, the
`input signal 1 is first apportioned by a framer 2 into 20
`millisecond frames of samples. (Because the input signal is 55
`sampled at a rate of 16kHz in the illustrated embodiment,
`each 20 ms frame includes 320 samples.) The computational
`complexity of the method is significantly reduced by oper(cid:173)
`ating on groups of frames of samples at a time, rather than
`on individual samples, one at a time. The framed signal is 60
`then fed through a bank of notch filters 4, the purpose of
`which is to remove narrow band components of the noise,
`typically motor noise occurring at the rotational frequencies
`of the motors. If the notches are narrow enough with a sparse
`enough spectral density, the tonal quality of the speech will 65
`not be adversely affected. Each frame of digital signals is
`then combined with a portion from the end of the immedi-
`
`(1)
`
`H(z)
`
`1 - 2cosez-1 + z-2
`1 - 2rcosez-I + Tlz-2
`where 8=n/8000x(frequency of notch), and r is a value less
`than one which reflects the width of the notch. If the -3 dB
`width of the notch is roHz, then r-=1-(ro/2) (n/8000). The
`bandwidth, ro, used in the illustrated and preferred embodi(cid:173)
`ment is 20Hz. A notch is placed approximately every 100
`Hz, at the largest peak of the background noise energy near
`the nominal frequency.
`
`Petitioner Apple Inc.
`Ex. 1010, p. 6
`
`
`
`5,550,924
`
`6
`decreasing at the start and end of the frame. Thus, the first
`192 samples of the present 512 sample extended and win(cid:173)
`dowed frame are added to the last 192 samples of the
`previous extended and windowed frame. Then the next 128
`samples (8 milliseconds) of the current extended frame is
`output. The last 192 samples of the present extended and
`windowed frame are then stored for use by the next frame's
`overlap-add operation, and so on.
`In a preferred embodiment, the window function, W, used
`10 will have the property that:
`
`5
`
`wo-+CW2 shifted by amount of overlap)=!
`
`to avoid producing modulation over time. For example, if
`the amount of overlap is one half a frame, then the win(cid:173)
`dowing function, W, has the property that:
`
`15
`
`5
`The notch filtering is applied to the 320 samples of the
`new signal frame. The resulting 320 samples of notch
`filtered output are appended to the last 192 samples of
`notch-filtered output from the previous frame to produce a
`total extended frame of 512 samples.
`Referring to FIGS. 1 and 2, the notch-filtered 512 sample
`frame derived from filter bank 4 is multiplied by a window
`using the following formula:
`
`(2)
`
`w(!)=.f{i)~ 0.5-0.5cos ( 1t 1~1 )
`
`for i = 0, 1, ... , 191,
`
`w(l) = j{l)
`
`fori= 192, 193, ... ,319,
`
`w(i) = j{i) ~ 0.5- 0.5cos ( 1t 51~~ i
`
`)
`
`25
`
`for i = 320, 321, ... , 511
`where f(i) is the value ofthe ith notch-filtered sample of 512
`sample frame from filter bank 4 and w(i) is the resultant 20
`value of the ith sample of the resultant 512 sample win(cid:173)
`dowed output which is next fed to the FFT 8. The purpose
`of the window, effected by multiplier 6, is to minimize edge
`effects and discontinuities at the beginning and end of the
`extended frame.
`The time-windowed 512 sample points are now fed to the
`FFT 8. Because of the ubiquity of FFT' s, many Digital
`Signal Processing (DSP) chip manufacturer's supply highly
`optimized assembly language code to implement the FFT.
`A one frame delay 10 is introduced so that signal fre- 30
`quency components of the FFT can be amplified and pro(cid:173)
`cessed in attenuator 12 based upon later occurring signal
`values. This does not introduce any perceptual noise
`because, as noted above, a signal component will mask
`frequencies in its spectral neighborhood 20 milliseconds 35
`before it actually occurs. Also, since speech sounds gradu(cid:173)
`ally increase in volume starting from zero amplitude, the one
`frame delay prevents clipping the start of speech utterances.
`Those components of the FFT due to noise are attenuated
`by attenuator 12, while those components due to signal are 40
`less attenuated or unattenuated or may be amplified. As
`noted above, for each frequency, there is a real and an
`imaginary component. Both components are multiplied by a
`single factor found from the Noise Suppression Spectral
`Modifier module 30, so that the phase is preserved for the 45
`frequency component while the magnitude is altered.
`The inverse FFT 14 (IFFT) is taken of the magnitude
`modified FFT, producing a frequency processed extended
`frame, 512 samples in length.
`The windowing operation used in multiplier 16 is exactly 50
`the same as the windowing operation defined above for
`multiplier 6. Its purpose is to minimize discontinuities
`introduced by the attenuation of frequency components. For
`example, suppose that all frequency components have been
`set to zero except for one. The result will be a sine wave 55
`when the IFFT is taken. This sine wave may start at a large
`value and end at a large value. Neighboring frames may not
`have this sine wave component present. Thus, without
`proper windowing, when this signal is overlap-added in the
`output adder 18, a click may be heard at the start and end of 60
`the frame. However, by properly windowing the sine wave,
`using, for example, the parameters defined in Equation 2,
`what will be heard is a sine wave smoothly increasing in
`magnitude and then smoothing decreasing in magnitude.
`Because of the pre- and post-windowing of the frame by 65
`multipliers 6 and 16, overlap and addition of the frames is
`necessary to prevent the magnitude of the output from
`
`wo-+(wo- shifted by ll:z)=l
`
`Background Noise Estimator 20
`
`Referring to FIGS. 1 and 3, the background noise esti(cid:173)
`mator 20 and the noise suppression spectral modifier module
`30 operate as follows.
`The purpose of the background noise estimator 20 is to
`develop an estimate for each frequency component of the
`FFT, the average energy magnitude due to the background
`noise. The background noise estimator removes the need for
`the user to manually adjust or train the system for each new
`environment. The background noise estimator continually
`monitors the signal/noise environment, updating estimates
`of the background noise automatically in response to, for
`example, air conditioning fans turning off and on, etc. Two
`approaches are used, with the results of one or the other
`approach used in a particular situation. The first approach is
`more accurate, but requires one second intervals of solely
`background noise. The second approach is less accurate, but
`develops background noise estimates in 1 0 seconds under
`any conditions.
`
`Stationary Estimator 24
`
`With reference to FIGS. 1 and 3, the first approach uses
`a stationary estimator 24 to look for long sequences of
`frames where the spectral shape in each frame is very similar
`to that of the other frames. Presumably, this condition can
`only arise if the human in the room is silent and the constant
`background noise due to fans and/or circuit noise is the
`primary source of the signal. When such a sequence is
`detected, the average magnitude of each frequency is taken
`from those frames in the central part of the FFT sequence
`(frames at the beginning and end of the sequence may
`contain low level speech components). This method yields a
`much more accurate measurement of the background noise
`spectrum as compared to the second approach (described
`below), but requires that the background noise is relatively
`constant and that the humans in the room are not talking for
`a certain period of time, conditions sometimes not found in
`practice.
`The operation of this estimator, in more detail, is as
`follows:
`1. Referring to FIG. 3, the method in the first approach
`determines if the current 20 ms frame is similar in
`spectral shape to the previous frames. First, the method
`computes, at 240, the spectral shape of the previous
`frames:
`
`Petitioner Apple Inc.
`Ex. 1010, p. 7
`
`
`
`5,550,924
`
`7
`
`(3)
`
`(R2(k,J) + fl(k,J))
`
`)
`
`f=Jc-3 ( k=k,+31
`N;ifc) = 0.25
`1:
`1:
`f=fc
`k=ki
`where fc is the frame number for the current 20 ms frame (it
`advances by one for consecutive frames), i denotes a 1000
`Hz frequency band, k,=i * 32, k indexes the 256 frequency
`components of the 512 point FFT, and R(k, f) and l(k, f) are
`the real and imaginary components of the kth frequency
`component of the frame f.
`2. Next, Si(fc), the spectral shape of the current frame, is
`determined at 242:
`
`(4)
`
`(R2(k,fC) + fl(k,fc))
`
`k=ki+31
`S;(fc) =
`1:
`k=ki
`where the notation has the same meaning as in equation (3)
`above; and S; is the magnitude of the ith frequency compo(cid:173)
`nent of the current frame, fc.
`3. The estimator 24 then checks, at 244 and 246, to
`determine whether
`
`or
`
`(5)
`
`8
`mately every 100Hz, is found (at 264), and the frequency at
`which this locally maximum magnitude occurs corresponds
`to the location at which a notch center frequency will be
`placed (at 266). Notches are useful in reducing fan noise
`5 only up to 1500Hz or so, because for higher frequencies, the
`fan noise spectrum tends to be fairly even with the absence
`of strong peaks.
`
`Running Minimum Estimator 22
`
`10
`
`20
`
`There will be some instances when either the speech
`signal is never absent for more than a second or the
`background noise itself is never constant in spectral shape,
`so that the stationary estimator 24 (described above) will
`15 never produce noise background estimates. For these cases,
`the running minimum estimator 22 will produce noise
`background estimates, albeit with much less accuracy.
`The steps used by the running minimum estimator are:
`1. Over a 10 second interval, and for each frequency
`component k, find the eight consecutive frames which
`minimize the energy of the eight consecutive frames for
`that frequency component; that is, for every frequency
`component k find the frame fk that minimizes Mk(fk)
`where
`
`(6) 25
`
`where t1 is a lower threshold. In a preferred embodiment,
`t 1=3. If the inequality in (5) or (6) is satisfied for more than
`four values of i, then the current frame fc is classified as
`signal; otherwise, the estimator checks (at 248 and 250) to 30
`determine whether
`
`or
`
`(7)
`
`35
`
`(8)
`
`40
`
`where th is a higher threshold, and N; designates the mag-
`nitude of the ith frequency component of the background
`noise estimate. In a preferred embodiment, th=4.5. If either
`inequality is satisfied for one or more values of i, then the
`current frame fc is also classified as a signal frame. Other(cid:173)
`wise the current frame is classified as noise.
`4. If fifty consecutive noise-classified frames occur in a
`row, at 252 (corresponding to one second of noise),
`then estimator 24 develops noise background estimates
`by summing frequency energies from the 1Oth to the
`41st frame. By ignoring the beginning and ending 50
`frames of the sequence, confidence that the signal is
`absent in the remaining frames is increased. The esti(cid:173)
`mator finds, at 254,
`
`45
`
`(9)
`
`55
`
`1
`Bk = 32"""
`
`(R2(k,J) + fl(k,J))
`
`J=J,+31
`1:
`f=J,
`where k=O, 1, 2, ... , 255, fs is the starting index of the lOth
`noise-classified frame, and the other terms have the same
`notation as in equation (3). The values, Bk, now represent the
`average spectral magnitude of the noise component of the 60
`signal for the kth frequency.
`To determine where to place the notches of the notch filter
`bank, with reference to FIGS. 1 and 4, the unwindowed 20
`ms time-domain samples corresponding to the 32 noise-only
`classified frames are appended together (at 260) to form a 65
`contiguous sequence. A long FFf is taken ofthe sequence (at
`262). The component having the largest magnitude, approxi-
`
`Mk([k) = + 1: (R2(k,j) + P.(k,j))
`
`f=fk+1
`
`(10)
`
`t=fk
`where fk is any frame number occurring within the 10 second
`interval. Note that, in general, the fk that minimizes equation
`(10) will take on different values for different frequency
`components, k.
`2. Use the minimum values Mk derived in the previous
`step as the background noise spectral estimate if the
`following two conditions are both met:
`(a) It has been more than 10 seconds since the last
`update of the background noise spectral estimate due
`to the Stationary Estimator.
`(b) The difference, D, between the past background
`noise estimate, which may have resulted from the
`Stationary Estimator or the Running Minimum Esti(cid:173)
`mator, and the current Running Minimum Estimator
`is great. The metric used to define the difference D is
`given in Equation 11:
`
`2
`
`(11)
`
`( Mk
`NK
`
`NK
`' Mk
`
`)
`
`)
`
`- 1
`
`k=255 (
`D= ,db max
`where the max function returns the maximum of its
`two arguments, and Nk are the previous background
`noise estimates (from either Running Minimum or
`Stationary Estimators), and Mk are the current back(cid:173)
`ground noise estimates from the Runriing Minimum
`Estimator.
`If Dis greater than some threshold, for example, 3,000 in
`a preferred embodiment, and the preceding condition (a) is
`satisfied, then Mk is used as the new background spectral
`estimate. The use of Mk as the noise estimate indicates that
`the notch filters should be disabled, since a good estimate of
`the notch center frequencies is not possible.
`
`Noise Suppression Spectral Modifier 30
`
`Referring to FIG. 1, once the background noise estimate
`has been found, the current frame's spectra must be com(cid:173)
`pared to the background noise estimate's spectra, and on the
`basis of this comparison, attenuation must be derived for
`each frequency component of the current frame's FFT in an
`
`Petitioner Apple Inc.
`Ex. 1010, p. 8
`
`
`
`9
`attempt to reduce the perception of noise in the output
`signal.
`
`10
`Temporal and Spectral Spreading of Frequency Bin
`Gains by Critical Bands 36
`
`5,550,924
`
`Global Speech versus Noise Detector 32
`
`5
`
`Any given frame will either contain speech or not. Global
`Speech versus Noise Detector 32 makes a binary decision as
`to whether or not the frame is noise.
`In the presence of speech, thresholds, can be lowered 10
`because masking effects will tend to make incorrect signal
`versus noise declarations less noticeable. However, if the
`frame truly is noise only, slight errors in deciding whether or
`not frequency components are due to noise or signal will
`give rise to so-called "twinkling" sounds.
`In accordance with the illustrated embodiment, to deter(cid:173)
`mine whether speech is present in a frame, the system
`compares the magnitude of the kth frequency component of
`the current frame, designated Sk, and the magnitude of the
`kth frequency component of the background noise estimate, 20
`designated Ck. Then if SyTxC, for more than 7 values ofk
`(for one frame), where Tis a threshold constant (T=3, in a
`preferred embodiment), the frame is declared a speech
`frame. Otherwise, it is declared a noise frame.
`
`15
`
`25
`
`s.
`iffik >14, Dk=4,
`s.
`iffik >13, D•=3,
`s.
`if NK >12, D•=2,
`s.
`iffik >I), D•=l,
`
`else
`
`else
`
`else
`
`(12)
`
`50
`
`55
`
`Dk=O
`else
`where Sk=R2(k)+I2(k) for the current frame and Nk is the
`background noise estimate for component k. The values
`used for t1, ~, t3, t4 vary depending on whether the global
`speech detector