throbber
NOISE ADAPTATION IN A HIDDEN MARKOV MODEL
`
`SPEECH RECOGNITION SYSTEM
`
`Dirk Van Compernolle(cid:0)
`
`Department of Electrical Engineering (cid:2) ESAT y
`
`Katholieke Universiteit Leuven
`
`Kardinaal Mercierlaan 
`
`B(cid:2) Heverlee
`
`Belgium
`
`Tel (cid:7) (cid:8)(cid:9) (cid:11) (cid:14)(cid:15) (cid:15)
`
`Abstract
`
`Several ways for making the signal processing in an isolated word speech recognition
`
`system more robust against large variations in the background noise level are presented(cid:2)
`
`Isolated word recognition systems are sensitive to accurate silence detection(cid:3) and are eas(cid:4)
`
`ily overtrained on the speci(cid:5)c noise circumstances of the training environment(cid:2) Spectral
`
`subtraction provides good noise immunity in the cases where the noise level is lower or
`
`slightly higher in the testing environment than during training(cid:2) Di(cid:6)erences in residual
`
`noise energy after spectral subtraction between a clean training and noisy testing envi(cid:4)
`
`ronment can still cause severe problems(cid:2) The usability of spectral subtraction is largely
`
`increased if complemented with some extra noise immunity processing(cid:2) This is achieved by
`
`the addition of arti(cid:5)cial noise after spectral subtraction or by adaptively re(cid:4)estimating the
`
`(cid:0)Research Associate of the National Fund for Scienti(cid:2)c Research of Belgium (cid:3)N(cid:4)F(cid:4)W(cid:4)O(cid:4)(cid:5)
`yThe work was performed while the author was at IBM T(cid:4)J(cid:4) Watson Research Center(cid:6) Yorktown Heights(cid:6)
`
`NY  (cid:4)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 1
`
`

`

`noise statistics during a training session(cid:2) Both techniques are almost equally successful in
`
`dealing with the noise(cid:2) Noise addition achieves the additional robustness that the system
`
`will never be allowed to learn about low amplitude events(cid:3) that might not be observable
`
`in all environments(cid:7) this(cid:3) however(cid:3) at a cost that some information is consistently thrown
`
`away in the most favorable noise situations(cid:2)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 2
`
`

`

`I(cid:2) Introduction
`
`The signal processing (cid:3) (cid:4) used in IBM(cid:5)s real(cid:6)time speech recognizer (cid:3)(cid:4)(cid:7) has performed well
`
`in controlled environments(cid:8) At the time of development the system was exclusively used in
`
`a quiet o(cid:9)ce with a typical signal to noise ratio of about dB(cid:8) Day to day variations in a
`
`speakers voice level had to be accounted for(cid:7) but the acoustic environment changed little(cid:8)
`
`The development of the portable Tangora system (cid:3) (cid:4) allowed for use of the recognition
`
`system in a place at the convenience of the user(cid:7) but simultaneously created a new set of
`
`problems(cid:8) It became soon clear that the existing signal processing was not able to deal with
`
`the large variations in acoustic backgrounds that the system was now exposed to(cid:8) Sensitivity
`
`to the absolute noise level was moderate(cid:7) i(cid:8)e(cid:8) when the system was tested in the same acoustic
`
`environment as the one in which it was trained(cid:8) Under those conditions a drop in signal to
`
`noise ratio from  to dB corresponded to a doubling in error rate(cid:8) But a di(cid:13)erence in
`
`the signal to noise ratio larger than dB(cid:7) between the training and testing environment lead
`
`to severe deterioration in performance(cid:8) Using a lip(cid:6)mike can in principle solve most noise
`
`problems in o(cid:9)ce applications(cid:8) The goal at IBM has been(cid:7) however(cid:7) to use the recognizer
`
`with a less constraining table mounted microphone(cid:8)
`
`A (cid:15)silence model(cid:15)(cid:7) representing the short pauses between words(cid:7) is one of the hidden
`
`Markov models used in the isolated word recognition system(cid:8) The inclusion of the silence
`
`model as a regular Hidden Markov Model (cid:16)HMM(cid:17) is necessary to avoid the otherwise di(cid:9)cult
`
`task of endpoint detection(cid:8) It is trained in conjunction with the speech models and hence
`
`learns about the noise characteristics of the training session(cid:8) Large di(cid:13)erences in the noise
`
`level of training and testing environments will cause the recognizer to mistake speech for noise
`
`or noise for speech(cid:8) Two rules for robust signal processing for speech recognition emerge (cid:18) the
`
`processing of speech events must be largely invariant to the noise level and noise(cid:7) independent
`
`of its actual value(cid:7) should be mapped into a typical noise image(cid:8) This is a normalization and
`
`adaptation task and therefore quite distinct from actual noise removal(cid:7) which is the goal in
`
`speech enhancement (cid:3)(cid:7)(cid:4)(cid:8)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 3
`
`

`

`The approach taken in this work was to maintain the general signal processing structure of
`
`the existing system(cid:7) which has proven successful across many speakers in controlled constant
`
`acoustic environments(cid:7) and to complement it with noise immunity processing that is largely
`
`transparent for the low noise environments but e(cid:13)ective in dealing with changes in the back(cid:6)
`
`ground noise level(cid:8) In section II the signal processing of the IBM speech recognition system
`
`is reviewed(cid:8) In section III we introduce the basic noise immunity components of the signal
`
`processing which are spectral subtraction and a frequency dependent channel equalizer(cid:8) Both
`
`operations rely heavily on a histogram based speech(cid:20)noise discrimination algorithm(cid:8) Re(cid:21)ne(cid:6)
`
`ments to the noise immunity signal processing and the statistical silence model are explored
`
`in sections IV and V(cid:8) Ultimately in section VI results obtained on the IBM speech recognition
`
`system for all of the presented schemes are given and analyzed(cid:8)
`
`II(cid:2) Signal Processing Overview
`
`The signal processing (cid:16)Fig(cid:8) (cid:17) converts an input signal(cid:7) sampled at kHz(cid:7) into a  di(cid:6)
`
`mensional output vector at a frame rate of frames(cid:20)sec(cid:8) It can be divided into  parts (cid:18)
`
`Fourier transform(cid:7) simulated (cid:21)lterbank(cid:7) log conversion(cid:7) long term adaptation and short term
`
`adaptation(cid:8)
`
`Figure (cid:18) Signal processing block diagram (cid:16)SIGPSTD(cid:17)
`
`All blocks(cid:7) except for the long term adaptation(cid:7) are identical to the design described in
`
`(cid:3) (cid:4)(cid:8) The FFT uses a   point Hanning window and creates a new spectral vector every
`
` msec(cid:8) A mel(cid:6)scaled (cid:21)lterbank is simulated by adding up FFT power spectrum coe(cid:9)cients(cid:8)
`
` channels spanning the frequency range from Hz to Hz are created(cid:8) The long term
`
`adaptation(cid:7) which is described in detail in the next section plays a normalization role(cid:8) Its
`
`output is ideally independent of the current acoustic environment(cid:8) The short term adaptation
`
`is based on a Schroeder(cid:6)Hall haircell model (cid:3)(cid:4)(cid:8) It is modeled after the rapid adaptation seen
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 4
`
`

`

`in neural (cid:21)ring rate according to changes in input level(cid:8) The time constants(cid:7) dependent on
`
`the actual input(cid:7) are between and  msec(cid:8) The importance of this block for the stationary
`
`parts of speech is minimal(cid:7) but quite signi(cid:21)cant for the transient parts(cid:8)
`
`The  dimensional output vector is labeled by a vector quantizer with a codebook of
`
`size (cid:8) These labels are the inputs to the HMM based recognition system (cid:3)(cid:4)(cid:8) The VQ
`
`codebook is designed by K(cid:6)means clustering of training data (cid:3)(cid:4)(cid:8) This particular way of
`
`labeling(cid:7) especially the use of a Euclidean distance metric in (cid:21)nding a closest prototype has
`
`a direct impact on the signal processing(cid:8)
`
`III(cid:2) Noise Correction and Acoustic Adaptation(cid:2)
`
`The function of the signal processing block (cid:15)long term adaptation(cid:15) is to compensate for the
`
`large day to day variations in the environment(cid:7) and map the observed variable dynamic ranges
`
`into a (cid:21)xed dynamic range about which the system is allowed to learn(cid:8) The variations include
`
`changes in the background noise(cid:7) room acoustics and recording hardware(cid:8)
`
`X i
`
`(cid:2)
`
`(cid:3)i
`
`raw power spectral estimate
`
`rms noise threshold
`
`noise estimate
`
`SST HRi
`
`spectral subtraction threshold
`
`Gi
`
`yi
`
`channel gain
`
`adapted spectral estimate
`
`Figure (cid:18) Basic long term adaptation for a single channel (cid:16)NOISIM (cid:17)
`
`The system proposed here (cid:16)Fig(cid:8) (cid:17) splits the correction for the noise and the adaptation
`
`to the room acoustics into two largely independent parts(cid:7) but some linkage between both
`
` Notation (cid:12) Capital letters are used for power spectrum variables(cid:6) small letters for log spectrum variables(cid:4)
`
`Channel numbers are given as superscripts(cid:4)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 5
`
`

`

`parts is crucial to the overall success(cid:8) Spectral subtraction (cid:3)(cid:4) is used for the noise correc(cid:6)
`
`tion(cid:8) A channel equalization(cid:7) which is somewhat like a frequency dependent automatic gain
`
`control(cid:7) compensates for the variable microphone and room acoustic transferfunctions(cid:8) Both
`
`operations need a reasonably good noise(cid:20)speech discriminator(cid:7) which in this work is based on
`
`signal rms energy only(cid:8)
`
`No e(cid:13)ort is made within the signal processing to remove impulsive noises(cid:7) such as slamming
`
`doors(cid:7) coughing etc(cid:8) from the signal(cid:8) Hence all underlying parameters can truly be modeled
`
`as slowly varying(cid:8) Before any long term adaptation processing(cid:7) histograms are collected of
`
`the total and single channel log(cid:6)rms energies as shown in Fig(cid:8) (cid:8) Long term adaptation
`
`parameters are derived from these histograms or from running averages(cid:8) Parameter estimates
`
`are updated in block mode at regular intervals(cid:7) typically secs(cid:7) at which time the histograms
`
`are also restarted(cid:8)
`
`Use of Log(cid:2)rms Histograms(cid:3) Rms signal energy is computed over the frequency range
`
`used by the recognizer (cid:16)Hz(cid:6)Hz(cid:17)(cid:8) Histograms of the log of signal energy typically
`
`have a clean bimodal distribution (cid:3)(cid:4)(cid:7) corresponding to (cid:15)noise only(cid:15) or (cid:15)noise (cid:25) speech(cid:15)(cid:26)
`
`these can be considered as the two basic states of the system(cid:8) The noise contribution to
`
`the histograms and to a lesser extend the speech contribution tends to be modeled well by
`
`a normal distribution (cid:16)Fig(cid:8) (cid:17)(cid:8) This modeling has as basic assumption that speech(cid:7) when
`
`present(cid:7) largely dominates the noise(cid:8) Hence it is only suitable for applications with signi(cid:21)cant
`
`positive signal to noise ratios(cid:8)
`
`Using index (cid:15)no(cid:15) for the noise distribution and (cid:15)sp(cid:15) for the speech distribution(cid:7) the joint
`
`probabilities of state and current log(cid:6)rms observation et(cid:7) can be written as (cid:18)
`
`pk(cid:16)et(cid:17) (cid:27)
`
`(cid:4)kp(cid:5)(cid:6)k
`
`e
`
`
`(cid:2)
`
`(cid:0) et(cid:2)(cid:2)k(cid:3)k (cid:2)
`
`k (cid:27) no(cid:7) sp
`
`(cid:16) (cid:17)
`
`which now allows for writing the conditional probability of being in silence(cid:7) given an rms
`
`observation as (cid:18)
`
`p(cid:16)nojet(cid:17) (cid:27)
`
`pno(cid:16)et(cid:17)
`pno(cid:16)et(cid:17) (cid:25)p sp(cid:16)et(cid:17)
`
`(cid:16)(cid:17)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 6
`
`

`

`and also (cid:18)
`
`p(cid:16)spjet(cid:17) (cid:27) (cid:2) p(cid:16)nojet(cid:17)
`
`Figure (cid:18) Log(cid:6)rms histogram (cid:21)tted with  gaussians
`
`The generating mixing ratio(cid:7) means and variances (cid:16)(cid:4)k(cid:7) (cid:3)k and (cid:6)k(cid:17) of the best (cid:21)t to the
`
`histogram with a mixture of  gaussians can be found using a standard EM procedure (cid:3)(cid:4)(cid:8) A
`
`speech(cid:20)noise decision threshold (cid:2) is derived from this (cid:21)t as the point of equal probability on
`
`the two gaussian distributions (cid:18)
`
`p(cid:16)noj(cid:2)(cid:17) (cid:27) p(cid:16)spj(cid:2)(cid:17) (cid:27) (cid:8)
`pno(cid:16)(cid:2)(cid:17) (cid:27) psp(cid:16)(cid:2)(cid:17)
`
`The detection of a noise frame can conveniently be written as (cid:18)
`
`(cid:9)no
`t
`
`(cid:27)
`
`if et (cid:10) (cid:2)
`
`(cid:27)
`
`if et (cid:3) (cid:2)
`
`(cid:16) (cid:8)a(cid:17)
`
`(cid:16) (cid:8)b(cid:17)
`
`(cid:16)(cid:8)a(cid:17)
`
`(cid:16)(cid:8)b(cid:17)
`
`This decision criterion is de(cid:21)nitely not perfect on a frame by frame basis and would not be
`
`suitable for the purpose of word boundary detection(cid:8) But su(cid:9)ciently few errors are made(cid:7) such
`
`that on the basis of it excellent statistics can be gathered about the noise environment(cid:8) An
`
`additional advantage is that the described derivation of (cid:2) is suited for a mixed input consisting
`
`of noise and speech(cid:7) and that it does not require occasional long pauses for measuring noise
`
`statistics(cid:7) nor is any previous knowledge about the noise level required(cid:8) In an isolated word
`
`speech recognition system(cid:7) pauses in between words and sentences typically take up to 
`
`(cid:28) of all input(cid:7) hence providing ample information about the environment(cid:8)
`
`Spectral Subtraction(cid:3)
`
`The noise estimate for spectral subtraction is computed from
`
`the running average of power spectra of frames that are classi(cid:21)ed as noise by the rms decision
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 7
`
`

`

`threshold(cid:8) The updates are done in block mode for synchrony with the other parameter
`
`updates derived in the next section (cid:18)
`
`t
`(cid:3)i (cid:27) (cid:11) (cid:4) Pt X it (cid:9)no
`Pt (cid:9)no
`in which the choice of (cid:11) and the block width determine the e(cid:13)ective time constant(cid:8) As spectral
`
`
`
`t
`
`(cid:25) (cid:16) (cid:2) (cid:11)(cid:17) (cid:4) (cid:3)i
`
`f or i (cid:27) (cid:7) 
`
`(cid:16)(cid:17)
`
`subtraction is a power domain operation(cid:7) power spectrum averages are used in computing the
`
`bias(cid:8) The signal spectral estimate(cid:7) with the noise removed(cid:7) is obtained as (cid:18)
`
`(cid:29)X i (cid:27) max(cid:16)X i (cid:2) (cid:3)i(cid:7) SST HRi(cid:17)
`The spectral subtraction threshold SST HRi must be chosen larger than or equal to (cid:7)
`
`f or i (cid:27) (cid:7) 
`
`(cid:16)(cid:17)
`
`because power spectra can never be negative(cid:8) The choice of its actual value is very important
`
`and how it is derived on a channel dependent basis is discussed in a following section(cid:8)
`
`Figure (cid:18) Spectral subtraction (cid:18) input(cid:20)output relationship in the log domain
`
`Estimating the Frequency Dependent Gains(cid:3)
`
`Simultaneously with the log(cid:6)rms his(cid:6)
`
`togram(cid:7) histograms are gathered of log(cid:6)energy in each channel(cid:8) The (cid:15)Channel Peak Energy(cid:15)
`
`(cid:16)CP Ei(cid:17)(cid:7) a measurement for the highest energy that can be expected in a given frequency
`
`channel(cid:7) is de(cid:21)ned as the (cid:8) (cid:28) upper percentile on the speech distribution of a channel his(cid:6)
`
`togram(cid:8) The contribution of speech and noise to a histogram is derived on basis of the rms
`
`decision criterion(cid:8) Individual channel gains are computed as (cid:18)
`
`Gi (cid:27) T ARGET (cid:2) CP Ei
`
`and implemented as simple additions to the log spectra (cid:18)
`
`yi (cid:27) log
`
`(cid:29)X i (cid:25) Gi
`
`(cid:16)(cid:17)
`
`(cid:16)(cid:17)
`
`If histograms and their peaks were to be recomputed after long term adaption(cid:7) then all
`
`channel peak energies would be identical to the selected T ARGET value(cid:7) from there the name
`
`(cid:15)equalization(cid:15) in Fig(cid:8) (cid:8)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 8
`
`

`

`The idea behind this derivation of channel gains is that long observations of speech will
`
`roughly have equal frequency peaks across the whole speech frequency range(cid:8) The peaks of the
`
`histograms can therefore be considered as a measurement of the combined transferfunction of
`
`the environment and the recording equipment rather than as a measurement of the utterance(cid:8)
`
`Figure (cid:18) Spectral subtraction (cid:18) impact on channel histograms
`
`Choosing the Spectral Subtraction Threshold(cid:3) The importance of the spectral sub(cid:6)
`
`traction threshold SST HRi (cid:16)(cid:17) has been discussed at some length in (cid:3) (cid:4)(cid:8) Most noise values
`
`are mapped into this value because of the inherent clipping in spectral subtraction(cid:8) The
`
`chosen threshold value(cid:7) modi(cid:21)ed by the channel gain of the equalization(cid:7) is the (cid:15)typical noise
`
`image(cid:15) of the channel(cid:7) after long term adaptation(cid:8) As the peak energy at the output of long
`
`term adaptation is already (cid:21)xed at the T ARGET value it is su(cid:9)cient to a (cid:21)xed output dy(cid:6)
`
`namic range in order to create the desired environment independent noise image(cid:8) Hence the
`
`channel equalization (cid:16)(cid:17) must be anticipated by setting the spectral subtraction threshold at
`
`a (cid:21)xed level (cid:16)DY N (cid:17) below the channel peak energy (cid:18)
`
` log (cid:16)SST HRi(cid:17) (cid:27) CP Ei (cid:2) DY N
`
`(cid:16) (cid:17)
`
`It remains to choose an appropriate value for DY N in (cid:16) (cid:17)(cid:8) The distortion(cid:7) introduced by
`
`spectral subtraction (cid:16)(cid:17)(cid:7) is small if the threshold is chosen close to the original noise value(cid:8)
`
`But because DY N must be (cid:21)xed in all circumstances(cid:7) this can only be done for one envi(cid:6)
`
`ronment(cid:8) The quiet environment with a typical dynamic range in our applications of roughly
`
`dB is the most obvious candidate(cid:8) The I(cid:20)O relationship for spectral subtraction(cid:7) including
`
`this particular choice of the spectral subtraction threshold(cid:7) is illustrated in Fig(cid:8)  on a log(cid:6)log
`
`scale(cid:8) In the example the channel peak energy is dB and the noise estimate dB(cid:7) which cor(cid:6)
`
`responds to a dB dynamic range in this channel(cid:8) This is typical for a noisy(cid:7) though not bad(cid:7)
`
`environment(cid:8) Fig(cid:8)  shows the e(cid:13)ect of spectral subtraction on a single channel histogram(cid:7)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 9
`
`

`

`assuming histograms are computed immediately before and after spectral subtraction(cid:8) The
`
`thresholding produces a very large peak at the spectral subtraction threshold(cid:8) Immediately
`
`above it there is a very sparse region(cid:7) due to the dynamic range stretching(cid:8) Inputs more than
`
`dB above the noise estimate are hardly a(cid:13)ected(cid:8) If the original dynamic range is dB or
`
`more(cid:7) then the noise region will be compressed instead of expanded(cid:8)
`
`IV(cid:2) Additional Noise Immunity Processing
`
`Long term adaptation using only spectral subtraction and equalization proved to be a great
`
`step forward in terms of noise immunity(cid:7) and satis(cid:21)ed the goal of introducing no degradation
`
`for the noise free environments(cid:7) as results will show(cid:8) However the impact on the recognition
`
`system is insu(cid:9)cient for the most demanding situations(cid:8) Fig(cid:8)  shows signi(cid:21)cant stretching
`
`of the dynamic range for noisy situations(cid:8) Residual noise (cid:16)output for non(cid:6)thresholded noise
`
`input(cid:17) can hence be quite di(cid:13)erent form the typical noise image(cid:8) Roughly out of  channels
`
`contain residual noise(cid:8) The Euclidean distance between residual noise of a clean and noisy
`
`environment can be large(cid:7) leading to frequent labeling errors(cid:8) This problem is acute for a
`
`quiet training environment and a very noisy testing environment(cid:8)
`
`Two additional components are introduced in the long term adaptation to deal with these
`
`extreme environments (cid:16)Fig(cid:8) (cid:17)(cid:8) They will have a slight negative in(cid:30)uence on the optimal
`
`performance of the system(cid:7) but will further enhance the usability over di(cid:13)erent environments(cid:8)
`
`Noise estimate and channel peak energy computations are not shown in Fig(cid:8) (cid:7) but are
`
`identical to the ones in Fig(cid:8) (cid:8)
`
`Figure (cid:18) Long term adaptation with additional noise immunity processing (cid:16)NOISIM(cid:17)
`
`Noise Addition The (cid:21)rst addition to long term adaptation is noise masking(cid:8) From previous
`
`experiments we know that the IBM speech recognition system performs well in situations
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 10
`
`

`

`with a less than optimal dynamic range (cid:3) (cid:4)(cid:7) given that a better than dB SNR is retained(cid:8)
`
`Noise immunity could conceivably be improved by adding arti(cid:21)cial noise into the system in a
`
`controlled manner(cid:7) which can easily be done after the channel equalization (cid:16)Fig(cid:8) (cid:17) (cid:18)
`
`and (cid:18)
`
`zi (cid:27) log (cid:16)
`
`yi
` (cid:25) n(cid:17)
`
` log n (cid:5) N (cid:16)(cid:3)A(cid:7) (cid:6)
`A(cid:17)
`
`(cid:16) (cid:17)
`
`(cid:16) (cid:17)
`
`Hence noise is added directly in the frequency domain(cid:8) The distribution chosen for the
`
`additive noise is log(cid:6)normal(cid:7) which closely resembles natural background noise(cid:8) This noise
`
`masking has a dual e(cid:13)ect(cid:8) First(cid:7) it masks out most of the noise residuals(cid:7) which are still
`
`environment dependent(cid:8) Second(cid:7) it masks out low level events about which the system might
`
`learn when trained in a clean environment but is unable to observe and use in a noisier one(cid:8)
`
`One could consider a system based on noise masking and channel equalization solely(cid:7)
`
`without spectral subtraction(cid:8) For e(cid:9)cient masking however(cid:7) the selected mean noise level (cid:3)A
`
`must be several dB above the natural noise (cid:30)oor of all anticipated environments(cid:8) The level
`
`required for e(cid:9)ciently masking the noise residuals in spectral subtraction is much lower (cid:3) (cid:4)(cid:8)
`
`Noise Sharpening(cid:3)
`
`A second improvement comes from using cross channel correlation
`
`information(cid:8) Spectral subtraction(cid:7) as described before(cid:7) is a single channel operation in nature(cid:8)
`
`But the log(cid:6)rms histograms have a much cleaner bimodal distribution than single channel
`
`histograms(cid:7) and signi(cid:21)cantly smaller noise variances(cid:7) indicating high correlation between noisy
`
`observations(cid:8) Noise sharpening exploits this correlation by modifying the power spectra based
`
`on the global probability of being in a noise or in a speech region (cid:18)
`
`i
`
`X
`
`(cid:27) p(cid:16)nojet(cid:17) (cid:4) (cid:3)i (cid:25) p(cid:16)spjet(cid:17) (cid:4) X i
`
`(cid:16) (cid:17)
`
`The probability that a frame at time t with log(cid:6)rms value et is noise is given by (cid:16) (cid:7)(cid:17)(cid:8) The
`
`variance of the single channel noise distributions(cid:7) before spectral subtraction(cid:7) is reduced by
`
`this noise sharpening(cid:8) This then results in a reduced variance of the noise residual distribution(cid:8)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 11
`
`

`

`Most of the speech frames are barely a(cid:13)ected by (cid:16) (cid:17)(cid:7) but a few low amplitude speech events
`
`can be in(cid:30)uenced in an unfavorable way(cid:8)
`
`V(cid:2) Re(cid:3)estimation of Noise Statistics
`
`The fact that in some cases noise labels produced during testing are a poor match to the
`
`ones on which the hidden Markov models were trained(cid:7) is not necessarily a problem(cid:8) The
`
`only basic requirement for a robust system is that a string of noise labels can be distinguished
`
`from any speech sound(cid:8) The fact that current hidden Markov model systems(cid:7) as the IBM one(cid:7)
`
`use a single training session is not an optimal but pragmatic design(cid:8) Fully adaptive systems(cid:7)
`
`that continuously update their statistics when in use(cid:7) are the obvious goal(cid:8) Several problems(cid:7)
`
`both computational and conceptual ones(cid:7) remain to be solved before such a system will be
`
`available(cid:8) Adaptively estimating speech statistics is di(cid:9)cult because an alignment has to be
`
`made against utterances that are only known with a limited degree of accuracy(cid:8) However the
`
`problem of adapting noise statistics only is much simpler(cid:7) because rather robust information
`
`for noise vs(cid:8) speech discrimination is readily available from the rms and the rms decision
`
`threshold(cid:8)
`
`Anticipation of Unseen Events(cid:3)
`
`The realization that the signal processing is not
`
`perfect in creating its single noise image leads to a (cid:21)rst enhancement(cid:8) Noise residuals exist
`
`and do depend on the actual noise level(cid:8) This imperfection can be compensated for by
`
`assigning signi(cid:21)cant probabilities to all labels for the HMM silence model(cid:7) including the ones
`
`that were not observed during training(cid:8) Adding uncertainty into a system in order to enhance
`
`its robustness has to be used with care(cid:7) but can be very helpful(cid:8)
`
`Adaptively Re(cid:2)estimating the Noise Probabilities(cid:3)
`
`For the purpose of re(cid:6)estimating
`
`the noise statistics the probability that a given frame is noise is again derived from the
`
`binormal (cid:21)t to the rms histogram(cid:7) on basis of Eqs(cid:8) (cid:16) (cid:7)(cid:17)(cid:8) If we further use (cid:18)
`
`(cid:9)j
`t (cid:27)
`
`if label Lj was observed at time t
`
` 
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 12
`
`

`

`(cid:27)
`
`otherwise
`
`then a good estimate for the output probability of label Lj for the silence model is (cid:18)
`
`P (cid:16)Ljjno(cid:17) (cid:27) Pt (cid:9)j
`t (cid:4) p(cid:16)nojet(cid:17)
`Pt p(cid:16)nojet(cid:17)
`The summation in (cid:16) (cid:17) typically extends over to a few sentences(cid:8) These new statistics
`
`(cid:16) (cid:17)
`
`are smoothed with the original ones derived from full HMM training(cid:8)
`
`VI(cid:2) Experiments
`
`Test data(cid:3)
`
`The described algorithms were tested on the IBM isolated word speech
`
`recognition system using a  word o(cid:9)ce correspondence vocabulary(cid:8) A typical experiment
`
`has a training script of sentences containing words and a test script of  sentences(cid:7)
`
`containing  words(cid:8) Two sets of recordings(cid:7) each set containing both training and testing(cid:7)
`
`were used from (cid:21)ve speakers(cid:7) three male (cid:16)MAP(cid:7) PDM and ABK(cid:17) and two female (cid:16)NBD and
`
`MAM(cid:17)(cid:8)
`
`One set was made in a rather noise free environment and will be referred to as (cid:15)CLEAN(cid:15)(cid:7)
`
`the other was made in a noisy laboratory and will be referred to as (cid:15)NOISY(cid:15)(cid:8) Di(cid:13)erences
`
`between training and testing environment for the same condition and speaker are small(cid:8) The
`
`clean environment has a signal to noise ratio varying from  to dB(cid:7) the noisy one typically
`
`has a  to dB signal to noise ratio (cid:16)Table (cid:17)(cid:8) All recordings were made using a Crown
`
`PZM S table mounted microphone(cid:26) no special e(cid:13)ort was made to control the acoustics(cid:8)
`
`Signal to noise ratio (cid:16)SNR(cid:17)(cid:7) as always(cid:7) is a subjective measure(cid:8) For a single channel it
`
`is de(cid:21)ned as the di(cid:13)erence between the channel peak energy and the noise estimate(cid:8) For
`
`the full signal (cid:16)Table (cid:17) it is de(cid:21)ned as the average over the  channels(cid:8) Di(cid:13)erences of
`
` to dB in SNR between channels are common(cid:8) The observed variations during a single
`
`recording session tend to be small compared to the gross changes between sessions(cid:8) Hence
`
`The noisy background for MAP was arti(cid:2)cially achieved by adding (cid:13)f noise to the clean speech about
`
` dB above the natural noise level(cid:4)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 13
`
`

`

`ENVIR MAP PDM NBD MAM ABK
`
`CLEAN
`
`NOISY
`
`
`
`
`
`
`
`
`
` 
`
`
`
`
`
`
`
`
`
` 
`
`Table (cid:18) Typical signal to noise ratios for di(cid:13)erent environments for the  speakers(cid:8)
`
`the experimental results presented are indicative of how well an algorithm can deal with gross
`
`day to day changes(cid:7) but not of how quickly it can adapt(cid:8) Fig(cid:8)  shows the range of values for
`
`channel peak energies and noise estimates for the noisy test corpus for PDM(cid:8)
`
`Figure (cid:18) Range of channel peak energies and noise estimates for the NOISY testing corpus
`
`for speaker PDM(cid:8)
`
`Results obtained with the older system without noise immunity processing (cid:16)SIGPSTD(cid:17)
`
`in a constant environment are used as reference(cid:8) NOISIM and NOISIM are used to refer
`
`to the signal processing schemes incorporating the long term adaptation of Figs(cid:8)  and 
`
`respectively(cid:8) In all experiments the histograms are restarted after every secs of (cid:16)presumed(cid:17)
`
` 
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 14
`
`

`

`speech(cid:8) The parameters derived from them are smoothed on an equal weight basis with the
`
`previous estimates(cid:7) yielding e(cid:13)ective time constants of  to  secs depending on the mixing
`
`ratio of speech and silence(cid:8)
`
`Representation of Results(cid:3)
`
`For each di(cid:13)erent algorithm the number of errors for each
`
`speaker is reported in the  word task(cid:7) plus the average percentage error rate for  speakers(cid:8)
`
`The error rates given are for substitution and deletion errors combined(cid:26) substitution errors
`
`are generally the dominating source of errors(cid:8) Insertion errors also do occur(cid:7) but typically
`
`at a rate of (cid:20) or less of the substitution error rate(cid:8) In some of the more demanding noise
`
`environments noise gets su(cid:9)ciently mistaken for speech and higher insertions rates do occur(cid:26)
`
`this is indicated in the results by (cid:16)(cid:25)(cid:17)(cid:8) Also the average decoding time for an experiment for
`
`the (cid:21)ve speakers is given in CPU minutes on an IBM mainframe (cid:8) The error rate is the
`
`more important criterion on which an algorithm will be judged(cid:7) but the decoding time also
`
`gives an idea of its e(cid:9)ciency(cid:8) Based on experimental evidence it is estimated that relative
`
`changes in error rate of less than  (cid:28) for a single speaker and less than (cid:28) for the  speaker
`
`average are statistically insigni(cid:21)cant(cid:8)
`
`Baseline Experiments(cid:3)
`
`Baseline results were obtained with SIGPSTD(cid:7) doing training
`
`and testing in the same environment(cid:16)Table (cid:17)(cid:8) Training in the clean environment and de(cid:6)
`
`coding in the noisy one or vice(cid:6)versa did produce average error rates of more than (cid:26) no
`
`precise error counts are given because the many insertion errors made counting by the above
`
`principles barely possible(cid:8)
`
`Results with Noise Immunity derived from Signal Processing only(cid:3) The spectral
`
`subtraction threshold in NOISIM is chosen such that a dB dynamic range (cid:16)DYN in (cid:16) (cid:17)(cid:17)
`
`is available after spectral subtraction in all environments(cid:8) The results obtained with this
`
`implementation are given in Table (cid:8)
`
` Mainframe CPU minutes can not readily be interpreted as percentage of real(cid:15)time performance on the
`
`portable Tangora system(cid:6) because of di(cid:16)erences in implementation(cid:4)
`
` 
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 15
`
`

`

`ENVIRONMENT
`
`CLEAN NOISY
`
`MAP
`
`PDM
`
`NBD
`
`MAM
`
`ABK
`
`
`
`
`
`
`
` 
`
`
`
`
`
`
`
`
`
`
`
`
`
`AVG (cid:16) (cid:31)errors (cid:17)
`
` (cid:8)
`
` (cid:8)
`
`AVG (cid:16) (cid:28)errors (cid:17)
`
` (cid:8) (cid:28)
`
`(cid:8) (cid:28)
`
`AVG (cid:16) decoding time (cid:17)
`
`  min(cid:8)
`
`  min(cid:8)
`
`Table (cid:18) Performance of SIGPSTD in a constant environment(cid:8)
`
`TRAIN(cid:27)CLEAN
`
`TRAIN(cid:27)NOISY
`
`SPK
`
`TEST(cid:27)CL TEST(cid:27)NO TEST

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket