`
`SPEECH RECOGNITION SYSTEM
`
`Dirk Van Compernolle(cid:0)
`
`Department of Electrical Engineering (cid:2) ESAT y
`
`Katholieke Universiteit Leuven
`
`Kardinaal Mercierlaan
`
`B(cid:2) Heverlee
`
`Belgium
`
`Tel (cid:7) (cid:8)(cid:9) (cid:11) (cid:14)(cid:15) (cid:15)
`
`Abstract
`
`Several ways for making the signal processing in an isolated word speech recognition
`
`system more robust against large variations in the background noise level are presented(cid:2)
`
`Isolated word recognition systems are sensitive to accurate silence detection(cid:3) and are eas(cid:4)
`
`ily overtrained on the speci(cid:5)c noise circumstances of the training environment(cid:2) Spectral
`
`subtraction provides good noise immunity in the cases where the noise level is lower or
`
`slightly higher in the testing environment than during training(cid:2) Di(cid:6)erences in residual
`
`noise energy after spectral subtraction between a clean training and noisy testing envi(cid:4)
`
`ronment can still cause severe problems(cid:2) The usability of spectral subtraction is largely
`
`increased if complemented with some extra noise immunity processing(cid:2) This is achieved by
`
`the addition of arti(cid:5)cial noise after spectral subtraction or by adaptively re(cid:4)estimating the
`
`(cid:0)Research Associate of the National Fund for Scienti(cid:2)c Research of Belgium (cid:3)N(cid:4)F(cid:4)W(cid:4)O(cid:4)(cid:5)
`yThe work was performed while the author was at IBM T(cid:4)J(cid:4) Watson Research Center(cid:6) Yorktown Heights(cid:6)
`
`NY (cid:4)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 1
`
`
`
`noise statistics during a training session(cid:2) Both techniques are almost equally successful in
`
`dealing with the noise(cid:2) Noise addition achieves the additional robustness that the system
`
`will never be allowed to learn about low amplitude events(cid:3) that might not be observable
`
`in all environments(cid:7) this(cid:3) however(cid:3) at a cost that some information is consistently thrown
`
`away in the most favorable noise situations(cid:2)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 2
`
`
`
`I(cid:2) Introduction
`
`The signal processing (cid:3) (cid:4) used in IBM(cid:5)s real(cid:6)time speech recognizer (cid:3)(cid:4)(cid:7) has performed well
`
`in controlled environments(cid:8) At the time of development the system was exclusively used in
`
`a quiet o(cid:9)ce with a typical signal to noise ratio of about dB(cid:8) Day to day variations in a
`
`speakers voice level had to be accounted for(cid:7) but the acoustic environment changed little(cid:8)
`
`The development of the portable Tangora system (cid:3) (cid:4) allowed for use of the recognition
`
`system in a place at the convenience of the user(cid:7) but simultaneously created a new set of
`
`problems(cid:8) It became soon clear that the existing signal processing was not able to deal with
`
`the large variations in acoustic backgrounds that the system was now exposed to(cid:8) Sensitivity
`
`to the absolute noise level was moderate(cid:7) i(cid:8)e(cid:8) when the system was tested in the same acoustic
`
`environment as the one in which it was trained(cid:8) Under those conditions a drop in signal to
`
`noise ratio from to dB corresponded to a doubling in error rate(cid:8) But a di(cid:13)erence in
`
`the signal to noise ratio larger than dB(cid:7) between the training and testing environment lead
`
`to severe deterioration in performance(cid:8) Using a lip(cid:6)mike can in principle solve most noise
`
`problems in o(cid:9)ce applications(cid:8) The goal at IBM has been(cid:7) however(cid:7) to use the recognizer
`
`with a less constraining table mounted microphone(cid:8)
`
`A (cid:15)silence model(cid:15)(cid:7) representing the short pauses between words(cid:7) is one of the hidden
`
`Markov models used in the isolated word recognition system(cid:8) The inclusion of the silence
`
`model as a regular Hidden Markov Model (cid:16)HMM(cid:17) is necessary to avoid the otherwise di(cid:9)cult
`
`task of endpoint detection(cid:8) It is trained in conjunction with the speech models and hence
`
`learns about the noise characteristics of the training session(cid:8) Large di(cid:13)erences in the noise
`
`level of training and testing environments will cause the recognizer to mistake speech for noise
`
`or noise for speech(cid:8) Two rules for robust signal processing for speech recognition emerge (cid:18) the
`
`processing of speech events must be largely invariant to the noise level and noise(cid:7) independent
`
`of its actual value(cid:7) should be mapped into a typical noise image(cid:8) This is a normalization and
`
`adaptation task and therefore quite distinct from actual noise removal(cid:7) which is the goal in
`
`speech enhancement (cid:3)(cid:7)(cid:4)(cid:8)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 3
`
`
`
`The approach taken in this work was to maintain the general signal processing structure of
`
`the existing system(cid:7) which has proven successful across many speakers in controlled constant
`
`acoustic environments(cid:7) and to complement it with noise immunity processing that is largely
`
`transparent for the low noise environments but e(cid:13)ective in dealing with changes in the back(cid:6)
`
`ground noise level(cid:8) In section II the signal processing of the IBM speech recognition system
`
`is reviewed(cid:8) In section III we introduce the basic noise immunity components of the signal
`
`processing which are spectral subtraction and a frequency dependent channel equalizer(cid:8) Both
`
`operations rely heavily on a histogram based speech(cid:20)noise discrimination algorithm(cid:8) Re(cid:21)ne(cid:6)
`
`ments to the noise immunity signal processing and the statistical silence model are explored
`
`in sections IV and V(cid:8) Ultimately in section VI results obtained on the IBM speech recognition
`
`system for all of the presented schemes are given and analyzed(cid:8)
`
`II(cid:2) Signal Processing Overview
`
`The signal processing (cid:16)Fig(cid:8) (cid:17) converts an input signal(cid:7) sampled at kHz(cid:7) into a di(cid:6)
`
`mensional output vector at a frame rate of frames(cid:20)sec(cid:8) It can be divided into parts (cid:18)
`
`Fourier transform(cid:7) simulated (cid:21)lterbank(cid:7) log conversion(cid:7) long term adaptation and short term
`
`adaptation(cid:8)
`
`Figure (cid:18) Signal processing block diagram (cid:16)SIGPSTD(cid:17)
`
`All blocks(cid:7) except for the long term adaptation(cid:7) are identical to the design described in
`
`(cid:3) (cid:4)(cid:8) The FFT uses a point Hanning window and creates a new spectral vector every
`
` msec(cid:8) A mel(cid:6)scaled (cid:21)lterbank is simulated by adding up FFT power spectrum coe(cid:9)cients(cid:8)
`
` channels spanning the frequency range from Hz to Hz are created(cid:8) The long term
`
`adaptation(cid:7) which is described in detail in the next section plays a normalization role(cid:8) Its
`
`output is ideally independent of the current acoustic environment(cid:8) The short term adaptation
`
`is based on a Schroeder(cid:6)Hall haircell model (cid:3)(cid:4)(cid:8) It is modeled after the rapid adaptation seen
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 4
`
`
`
`in neural (cid:21)ring rate according to changes in input level(cid:8) The time constants(cid:7) dependent on
`
`the actual input(cid:7) are between and msec(cid:8) The importance of this block for the stationary
`
`parts of speech is minimal(cid:7) but quite signi(cid:21)cant for the transient parts(cid:8)
`
`The dimensional output vector is labeled by a vector quantizer with a codebook of
`
`size (cid:8) These labels are the inputs to the HMM based recognition system (cid:3)(cid:4)(cid:8) The VQ
`
`codebook is designed by K(cid:6)means clustering of training data (cid:3)(cid:4)(cid:8) This particular way of
`
`labeling(cid:7) especially the use of a Euclidean distance metric in (cid:21)nding a closest prototype has
`
`a direct impact on the signal processing(cid:8)
`
`III(cid:2) Noise Correction and Acoustic Adaptation(cid:2)
`
`The function of the signal processing block (cid:15)long term adaptation(cid:15) is to compensate for the
`
`large day to day variations in the environment(cid:7) and map the observed variable dynamic ranges
`
`into a (cid:21)xed dynamic range about which the system is allowed to learn(cid:8) The variations include
`
`changes in the background noise(cid:7) room acoustics and recording hardware(cid:8)
`
`X i
`
`(cid:2)
`
`(cid:3)i
`
`raw power spectral estimate
`
`rms noise threshold
`
`noise estimate
`
`SST HRi
`
`spectral subtraction threshold
`
`Gi
`
`yi
`
`channel gain
`
`adapted spectral estimate
`
`Figure (cid:18) Basic long term adaptation for a single channel (cid:16)NOISIM (cid:17)
`
`The system proposed here (cid:16)Fig(cid:8) (cid:17) splits the correction for the noise and the adaptation
`
`to the room acoustics into two largely independent parts(cid:7) but some linkage between both
`
` Notation (cid:12) Capital letters are used for power spectrum variables(cid:6) small letters for log spectrum variables(cid:4)
`
`Channel numbers are given as superscripts(cid:4)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 5
`
`
`
`parts is crucial to the overall success(cid:8) Spectral subtraction (cid:3)(cid:4) is used for the noise correc(cid:6)
`
`tion(cid:8) A channel equalization(cid:7) which is somewhat like a frequency dependent automatic gain
`
`control(cid:7) compensates for the variable microphone and room acoustic transferfunctions(cid:8) Both
`
`operations need a reasonably good noise(cid:20)speech discriminator(cid:7) which in this work is based on
`
`signal rms energy only(cid:8)
`
`No e(cid:13)ort is made within the signal processing to remove impulsive noises(cid:7) such as slamming
`
`doors(cid:7) coughing etc(cid:8) from the signal(cid:8) Hence all underlying parameters can truly be modeled
`
`as slowly varying(cid:8) Before any long term adaptation processing(cid:7) histograms are collected of
`
`the total and single channel log(cid:6)rms energies as shown in Fig(cid:8) (cid:8) Long term adaptation
`
`parameters are derived from these histograms or from running averages(cid:8) Parameter estimates
`
`are updated in block mode at regular intervals(cid:7) typically secs(cid:7) at which time the histograms
`
`are also restarted(cid:8)
`
`Use of Log(cid:2)rms Histograms(cid:3) Rms signal energy is computed over the frequency range
`
`used by the recognizer (cid:16) Hz(cid:6) Hz(cid:17)(cid:8) Histograms of the log of signal energy typically
`
`have a clean bimodal distribution (cid:3)(cid:4)(cid:7) corresponding to (cid:15)noise only(cid:15) or (cid:15)noise (cid:25) speech(cid:15)(cid:26)
`
`these can be considered as the two basic states of the system(cid:8) The noise contribution to
`
`the histograms and to a lesser extend the speech contribution tends to be modeled well by
`
`a normal distribution (cid:16)Fig(cid:8) (cid:17)(cid:8) This modeling has as basic assumption that speech(cid:7) when
`
`present(cid:7) largely dominates the noise(cid:8) Hence it is only suitable for applications with signi(cid:21)cant
`
`positive signal to noise ratios(cid:8)
`
`Using index (cid:15)no(cid:15) for the noise distribution and (cid:15)sp(cid:15) for the speech distribution(cid:7) the joint
`
`probabilities of state and current log(cid:6)rms observation et(cid:7) can be written as (cid:18)
`
`pk(cid:16)et(cid:17) (cid:27)
`
`(cid:4)kp(cid:5)(cid:6)k
`
`e
`
`
`(cid:2)
`
`(cid:0) et(cid:2)(cid:2)k(cid:3)k (cid:2)
`
`k (cid:27) no(cid:7) sp
`
`(cid:16) (cid:17)
`
`which now allows for writing the conditional probability of being in silence(cid:7) given an rms
`
`observation as (cid:18)
`
`p(cid:16)nojet(cid:17) (cid:27)
`
`pno(cid:16)et(cid:17)
`pno(cid:16)et(cid:17) (cid:25)p sp(cid:16)et(cid:17)
`
`(cid:16)(cid:17)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 6
`
`
`
`and also (cid:18)
`
`p(cid:16)spjet(cid:17) (cid:27) (cid:2) p(cid:16)nojet(cid:17)
`
`Figure (cid:18) Log(cid:6)rms histogram (cid:21)tted with gaussians
`
`The generating mixing ratio(cid:7) means and variances (cid:16)(cid:4)k(cid:7) (cid:3)k and (cid:6)k(cid:17) of the best (cid:21)t to the
`
`histogram with a mixture of gaussians can be found using a standard EM procedure (cid:3)(cid:4)(cid:8) A
`
`speech(cid:20)noise decision threshold (cid:2) is derived from this (cid:21)t as the point of equal probability on
`
`the two gaussian distributions (cid:18)
`
`p(cid:16)noj(cid:2)(cid:17) (cid:27) p(cid:16)spj(cid:2)(cid:17) (cid:27) (cid:8)
`pno(cid:16)(cid:2)(cid:17) (cid:27) psp(cid:16)(cid:2)(cid:17)
`
`The detection of a noise frame can conveniently be written as (cid:18)
`
`(cid:9)no
`t
`
`(cid:27)
`
`if et (cid:10) (cid:2)
`
`(cid:27)
`
`if et (cid:3) (cid:2)
`
`(cid:16) (cid:8)a(cid:17)
`
`(cid:16) (cid:8)b(cid:17)
`
`(cid:16)(cid:8)a(cid:17)
`
`(cid:16)(cid:8)b(cid:17)
`
`This decision criterion is de(cid:21)nitely not perfect on a frame by frame basis and would not be
`
`suitable for the purpose of word boundary detection(cid:8) But su(cid:9)ciently few errors are made(cid:7) such
`
`that on the basis of it excellent statistics can be gathered about the noise environment(cid:8) An
`
`additional advantage is that the described derivation of (cid:2) is suited for a mixed input consisting
`
`of noise and speech(cid:7) and that it does not require occasional long pauses for measuring noise
`
`statistics(cid:7) nor is any previous knowledge about the noise level required(cid:8) In an isolated word
`
`speech recognition system(cid:7) pauses in between words and sentences typically take up to
`
`(cid:28) of all input(cid:7) hence providing ample information about the environment(cid:8)
`
`Spectral Subtraction(cid:3)
`
`The noise estimate for spectral subtraction is computed from
`
`the running average of power spectra of frames that are classi(cid:21)ed as noise by the rms decision
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 7
`
`
`
`threshold(cid:8) The updates are done in block mode for synchrony with the other parameter
`
`updates derived in the next section (cid:18)
`
`t
`(cid:3)i (cid:27) (cid:11) (cid:4) Pt X it (cid:9)no
`Pt (cid:9)no
`in which the choice of (cid:11) and the block width determine the e(cid:13)ective time constant(cid:8) As spectral
`
`
`
`t
`
`(cid:25) (cid:16) (cid:2) (cid:11)(cid:17) (cid:4) (cid:3)i
`
`f or i (cid:27) (cid:7)
`
`(cid:16)(cid:17)
`
`subtraction is a power domain operation(cid:7) power spectrum averages are used in computing the
`
`bias(cid:8) The signal spectral estimate(cid:7) with the noise removed(cid:7) is obtained as (cid:18)
`
`(cid:29)X i (cid:27) max(cid:16)X i (cid:2) (cid:3)i(cid:7) SST HRi(cid:17)
`The spectral subtraction threshold SST HRi must be chosen larger than or equal to (cid:7)
`
`f or i (cid:27) (cid:7)
`
`(cid:16)(cid:17)
`
`because power spectra can never be negative(cid:8) The choice of its actual value is very important
`
`and how it is derived on a channel dependent basis is discussed in a following section(cid:8)
`
`Figure (cid:18) Spectral subtraction (cid:18) input(cid:20)output relationship in the log domain
`
`Estimating the Frequency Dependent Gains(cid:3)
`
`Simultaneously with the log(cid:6)rms his(cid:6)
`
`togram(cid:7) histograms are gathered of log(cid:6)energy in each channel(cid:8) The (cid:15)Channel Peak Energy(cid:15)
`
`(cid:16)CP Ei(cid:17)(cid:7) a measurement for the highest energy that can be expected in a given frequency
`
`channel(cid:7) is de(cid:21)ned as the (cid:8) (cid:28) upper percentile on the speech distribution of a channel his(cid:6)
`
`togram(cid:8) The contribution of speech and noise to a histogram is derived on basis of the rms
`
`decision criterion(cid:8) Individual channel gains are computed as (cid:18)
`
`Gi (cid:27) T ARGET (cid:2) CP Ei
`
`and implemented as simple additions to the log spectra (cid:18)
`
`yi (cid:27) log
`
`(cid:29)X i (cid:25) Gi
`
`(cid:16)(cid:17)
`
`(cid:16)(cid:17)
`
`If histograms and their peaks were to be recomputed after long term adaption(cid:7) then all
`
`channel peak energies would be identical to the selected T ARGET value(cid:7) from there the name
`
`(cid:15)equalization(cid:15) in Fig(cid:8) (cid:8)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 8
`
`
`
`The idea behind this derivation of channel gains is that long observations of speech will
`
`roughly have equal frequency peaks across the whole speech frequency range(cid:8) The peaks of the
`
`histograms can therefore be considered as a measurement of the combined transferfunction of
`
`the environment and the recording equipment rather than as a measurement of the utterance(cid:8)
`
`Figure (cid:18) Spectral subtraction (cid:18) impact on channel histograms
`
`Choosing the Spectral Subtraction Threshold(cid:3) The importance of the spectral sub(cid:6)
`
`traction threshold SST HRi (cid:16)(cid:17) has been discussed at some length in (cid:3) (cid:4)(cid:8) Most noise values
`
`are mapped into this value because of the inherent clipping in spectral subtraction(cid:8) The
`
`chosen threshold value(cid:7) modi(cid:21)ed by the channel gain of the equalization(cid:7) is the (cid:15)typical noise
`
`image(cid:15) of the channel(cid:7) after long term adaptation(cid:8) As the peak energy at the output of long
`
`term adaptation is already (cid:21)xed at the T ARGET value it is su(cid:9)cient to a (cid:21)xed output dy(cid:6)
`
`namic range in order to create the desired environment independent noise image(cid:8) Hence the
`
`channel equalization (cid:16)(cid:17) must be anticipated by setting the spectral subtraction threshold at
`
`a (cid:21)xed level (cid:16)DY N (cid:17) below the channel peak energy (cid:18)
`
` log (cid:16)SST HRi(cid:17) (cid:27) CP Ei (cid:2) DY N
`
`(cid:16) (cid:17)
`
`It remains to choose an appropriate value for DY N in (cid:16) (cid:17)(cid:8) The distortion(cid:7) introduced by
`
`spectral subtraction (cid:16)(cid:17)(cid:7) is small if the threshold is chosen close to the original noise value(cid:8)
`
`But because DY N must be (cid:21)xed in all circumstances(cid:7) this can only be done for one envi(cid:6)
`
`ronment(cid:8) The quiet environment with a typical dynamic range in our applications of roughly
`
` dB is the most obvious candidate(cid:8) The I(cid:20)O relationship for spectral subtraction(cid:7) including
`
`this particular choice of the spectral subtraction threshold(cid:7) is illustrated in Fig(cid:8) on a log(cid:6)log
`
`scale(cid:8) In the example the channel peak energy is dB and the noise estimate dB(cid:7) which cor(cid:6)
`
`responds to a dB dynamic range in this channel(cid:8) This is typical for a noisy(cid:7) though not bad(cid:7)
`
`environment(cid:8) Fig(cid:8) shows the e(cid:13)ect of spectral subtraction on a single channel histogram(cid:7)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 9
`
`
`
`assuming histograms are computed immediately before and after spectral subtraction(cid:8) The
`
`thresholding produces a very large peak at the spectral subtraction threshold(cid:8) Immediately
`
`above it there is a very sparse region(cid:7) due to the dynamic range stretching(cid:8) Inputs more than
`
`dB above the noise estimate are hardly a(cid:13)ected(cid:8) If the original dynamic range is dB or
`
`more(cid:7) then the noise region will be compressed instead of expanded(cid:8)
`
`IV(cid:2) Additional Noise Immunity Processing
`
`Long term adaptation using only spectral subtraction and equalization proved to be a great
`
`step forward in terms of noise immunity(cid:7) and satis(cid:21)ed the goal of introducing no degradation
`
`for the noise free environments(cid:7) as results will show(cid:8) However the impact on the recognition
`
`system is insu(cid:9)cient for the most demanding situations(cid:8) Fig(cid:8) shows signi(cid:21)cant stretching
`
`of the dynamic range for noisy situations(cid:8) Residual noise (cid:16)output for non(cid:6)thresholded noise
`
`input(cid:17) can hence be quite di(cid:13)erent form the typical noise image(cid:8) Roughly out of channels
`
`contain residual noise(cid:8) The Euclidean distance between residual noise of a clean and noisy
`
`environment can be large(cid:7) leading to frequent labeling errors(cid:8) This problem is acute for a
`
`quiet training environment and a very noisy testing environment(cid:8)
`
`Two additional components are introduced in the long term adaptation to deal with these
`
`extreme environments (cid:16)Fig(cid:8) (cid:17)(cid:8) They will have a slight negative in(cid:30)uence on the optimal
`
`performance of the system(cid:7) but will further enhance the usability over di(cid:13)erent environments(cid:8)
`
`Noise estimate and channel peak energy computations are not shown in Fig(cid:8) (cid:7) but are
`
`identical to the ones in Fig(cid:8) (cid:8)
`
`Figure (cid:18) Long term adaptation with additional noise immunity processing (cid:16)NOISIM(cid:17)
`
`Noise Addition The (cid:21)rst addition to long term adaptation is noise masking(cid:8) From previous
`
`experiments we know that the IBM speech recognition system performs well in situations
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 10
`
`
`
`with a less than optimal dynamic range (cid:3) (cid:4)(cid:7) given that a better than dB SNR is retained(cid:8)
`
`Noise immunity could conceivably be improved by adding arti(cid:21)cial noise into the system in a
`
`controlled manner(cid:7) which can easily be done after the channel equalization (cid:16)Fig(cid:8) (cid:17) (cid:18)
`
`and (cid:18)
`
`zi (cid:27) log (cid:16)
`
`yi
` (cid:25) n(cid:17)
`
` log n (cid:5) N (cid:16)(cid:3)A(cid:7) (cid:6)
`A(cid:17)
`
`(cid:16) (cid:17)
`
`(cid:16) (cid:17)
`
`Hence noise is added directly in the frequency domain(cid:8) The distribution chosen for the
`
`additive noise is log(cid:6)normal(cid:7) which closely resembles natural background noise(cid:8) This noise
`
`masking has a dual e(cid:13)ect(cid:8) First(cid:7) it masks out most of the noise residuals(cid:7) which are still
`
`environment dependent(cid:8) Second(cid:7) it masks out low level events about which the system might
`
`learn when trained in a clean environment but is unable to observe and use in a noisier one(cid:8)
`
`One could consider a system based on noise masking and channel equalization solely(cid:7)
`
`without spectral subtraction(cid:8) For e(cid:9)cient masking however(cid:7) the selected mean noise level (cid:3)A
`
`must be several dB above the natural noise (cid:30)oor of all anticipated environments(cid:8) The level
`
`required for e(cid:9)ciently masking the noise residuals in spectral subtraction is much lower (cid:3) (cid:4)(cid:8)
`
`Noise Sharpening(cid:3)
`
`A second improvement comes from using cross channel correlation
`
`information(cid:8) Spectral subtraction(cid:7) as described before(cid:7) is a single channel operation in nature(cid:8)
`
`But the log(cid:6)rms histograms have a much cleaner bimodal distribution than single channel
`
`histograms(cid:7) and signi(cid:21)cantly smaller noise variances(cid:7) indicating high correlation between noisy
`
`observations(cid:8) Noise sharpening exploits this correlation by modifying the power spectra based
`
`on the global probability of being in a noise or in a speech region (cid:18)
`
`i
`
`X
`
`(cid:27) p(cid:16)nojet(cid:17) (cid:4) (cid:3)i (cid:25) p(cid:16)spjet(cid:17) (cid:4) X i
`
`(cid:16) (cid:17)
`
`The probability that a frame at time t with log(cid:6)rms value et is noise is given by (cid:16) (cid:7)(cid:17)(cid:8) The
`
`variance of the single channel noise distributions(cid:7) before spectral subtraction(cid:7) is reduced by
`
`this noise sharpening(cid:8) This then results in a reduced variance of the noise residual distribution(cid:8)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 11
`
`
`
`Most of the speech frames are barely a(cid:13)ected by (cid:16) (cid:17)(cid:7) but a few low amplitude speech events
`
`can be in(cid:30)uenced in an unfavorable way(cid:8)
`
`V(cid:2) Re(cid:3)estimation of Noise Statistics
`
`The fact that in some cases noise labels produced during testing are a poor match to the
`
`ones on which the hidden Markov models were trained(cid:7) is not necessarily a problem(cid:8) The
`
`only basic requirement for a robust system is that a string of noise labels can be distinguished
`
`from any speech sound(cid:8) The fact that current hidden Markov model systems(cid:7) as the IBM one(cid:7)
`
`use a single training session is not an optimal but pragmatic design(cid:8) Fully adaptive systems(cid:7)
`
`that continuously update their statistics when in use(cid:7) are the obvious goal(cid:8) Several problems(cid:7)
`
`both computational and conceptual ones(cid:7) remain to be solved before such a system will be
`
`available(cid:8) Adaptively estimating speech statistics is di(cid:9)cult because an alignment has to be
`
`made against utterances that are only known with a limited degree of accuracy(cid:8) However the
`
`problem of adapting noise statistics only is much simpler(cid:7) because rather robust information
`
`for noise vs(cid:8) speech discrimination is readily available from the rms and the rms decision
`
`threshold(cid:8)
`
`Anticipation of Unseen Events(cid:3)
`
`The realization that the signal processing is not
`
`perfect in creating its single noise image leads to a (cid:21)rst enhancement(cid:8) Noise residuals exist
`
`and do depend on the actual noise level(cid:8) This imperfection can be compensated for by
`
`assigning signi(cid:21)cant probabilities to all labels for the HMM silence model(cid:7) including the ones
`
`that were not observed during training(cid:8) Adding uncertainty into a system in order to enhance
`
`its robustness has to be used with care(cid:7) but can be very helpful(cid:8)
`
`Adaptively Re(cid:2)estimating the Noise Probabilities(cid:3)
`
`For the purpose of re(cid:6)estimating
`
`the noise statistics the probability that a given frame is noise is again derived from the
`
`binormal (cid:21)t to the rms histogram(cid:7) on basis of Eqs(cid:8) (cid:16) (cid:7)(cid:17)(cid:8) If we further use (cid:18)
`
`(cid:9)j
`t (cid:27)
`
`if label Lj was observed at time t
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 12
`
`
`
`(cid:27)
`
`otherwise
`
`then a good estimate for the output probability of label Lj for the silence model is (cid:18)
`
`P (cid:16)Ljjno(cid:17) (cid:27) Pt (cid:9)j
`t (cid:4) p(cid:16)nojet(cid:17)
`Pt p(cid:16)nojet(cid:17)
`The summation in (cid:16) (cid:17) typically extends over to a few sentences(cid:8) These new statistics
`
`(cid:16) (cid:17)
`
`are smoothed with the original ones derived from full HMM training(cid:8)
`
`VI(cid:2) Experiments
`
`Test data(cid:3)
`
`The described algorithms were tested on the IBM isolated word speech
`
`recognition system using a word o(cid:9)ce correspondence vocabulary(cid:8) A typical experiment
`
`has a training script of sentences containing words and a test script of sentences(cid:7)
`
`containing words(cid:8) Two sets of recordings(cid:7) each set containing both training and testing(cid:7)
`
`were used from (cid:21)ve speakers(cid:7) three male (cid:16)MAP(cid:7) PDM and ABK(cid:17) and two female (cid:16)NBD and
`
`MAM(cid:17)(cid:8)
`
`One set was made in a rather noise free environment and will be referred to as (cid:15)CLEAN(cid:15)(cid:7)
`
`the other was made in a noisy laboratory and will be referred to as (cid:15)NOISY(cid:15)(cid:8) Di(cid:13)erences
`
`between training and testing environment for the same condition and speaker are small(cid:8) The
`
`clean environment has a signal to noise ratio varying from to dB(cid:7) the noisy one typically
`
`has a to dB signal to noise ratio (cid:16)Table (cid:17)(cid:8) All recordings were made using a Crown
`
`PZM S table mounted microphone(cid:26) no special e(cid:13)ort was made to control the acoustics(cid:8)
`
`Signal to noise ratio (cid:16)SNR(cid:17)(cid:7) as always(cid:7) is a subjective measure(cid:8) For a single channel it
`
`is de(cid:21)ned as the di(cid:13)erence between the channel peak energy and the noise estimate(cid:8) For
`
`the full signal (cid:16)Table (cid:17) it is de(cid:21)ned as the average over the channels(cid:8) Di(cid:13)erences of
`
` to dB in SNR between channels are common(cid:8) The observed variations during a single
`
`recording session tend to be small compared to the gross changes between sessions(cid:8) Hence
`
`The noisy background for MAP was arti(cid:2)cially achieved by adding (cid:13)f noise to the clean speech about
`
` dB above the natural noise level(cid:4)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 13
`
`
`
`ENVIR MAP PDM NBD MAM ABK
`
`CLEAN
`
`NOISY
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Table (cid:18) Typical signal to noise ratios for di(cid:13)erent environments for the speakers(cid:8)
`
`the experimental results presented are indicative of how well an algorithm can deal with gross
`
`day to day changes(cid:7) but not of how quickly it can adapt(cid:8) Fig(cid:8) shows the range of values for
`
`channel peak energies and noise estimates for the noisy test corpus for PDM(cid:8)
`
`Figure (cid:18) Range of channel peak energies and noise estimates for the NOISY testing corpus
`
`for speaker PDM(cid:8)
`
`Results obtained with the older system without noise immunity processing (cid:16)SIGPSTD(cid:17)
`
`in a constant environment are used as reference(cid:8) NOISIM and NOISIM are used to refer
`
`to the signal processing schemes incorporating the long term adaptation of Figs(cid:8) and
`
`respectively(cid:8) In all experiments the histograms are restarted after every secs of (cid:16)presumed(cid:17)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 14
`
`
`
`speech(cid:8) The parameters derived from them are smoothed on an equal weight basis with the
`
`previous estimates(cid:7) yielding e(cid:13)ective time constants of to secs depending on the mixing
`
`ratio of speech and silence(cid:8)
`
`Representation of Results(cid:3)
`
`For each di(cid:13)erent algorithm the number of errors for each
`
`speaker is reported in the word task(cid:7) plus the average percentage error rate for speakers(cid:8)
`
`The error rates given are for substitution and deletion errors combined(cid:26) substitution errors
`
`are generally the dominating source of errors(cid:8) Insertion errors also do occur(cid:7) but typically
`
`at a rate of (cid:20) or less of the substitution error rate(cid:8) In some of the more demanding noise
`
`environments noise gets su(cid:9)ciently mistaken for speech and higher insertions rates do occur(cid:26)
`
`this is indicated in the results by (cid:16)(cid:25)(cid:17)(cid:8) Also the average decoding time for an experiment for
`
`the (cid:21)ve speakers is given in CPU minutes on an IBM mainframe (cid:8) The error rate is the
`
`more important criterion on which an algorithm will be judged(cid:7) but the decoding time also
`
`gives an idea of its e(cid:9)ciency(cid:8) Based on experimental evidence it is estimated that relative
`
`changes in error rate of less than (cid:28) for a single speaker and less than (cid:28) for the speaker
`
`average are statistically insigni(cid:21)cant(cid:8)
`
`Baseline Experiments(cid:3)
`
`Baseline results were obtained with SIGPSTD(cid:7) doing training
`
`and testing in the same environment(cid:16)Table (cid:17)(cid:8) Training in the clean environment and de(cid:6)
`
`coding in the noisy one or vice(cid:6)versa did produce average error rates of more than (cid:26) no
`
`precise error counts are given because the many insertion errors made counting by the above
`
`principles barely possible(cid:8)
`
`Results with Noise Immunity derived from Signal Processing only(cid:3) The spectral
`
`subtraction threshold in NOISIM is chosen such that a dB dynamic range (cid:16)DYN in (cid:16) (cid:17)(cid:17)
`
`is available after spectral subtraction in all environments(cid:8) The results obtained with this
`
`implementation are given in Table (cid:8)
`
` Mainframe CPU minutes can not readily be interpreted as percentage of real(cid:15)time performance on the
`
`portable Tangora system(cid:6) because of di(cid:16)erences in implementation(cid:4)
`
`
`
`IPR No. 2017-00627
`Apple Inc. v. Andrea Electronics Inc. - Ex. 1033, p. 15
`
`
`
`ENVIRONMENT
`
`CLEAN NOISY
`
`MAP
`
`PDM
`
`NBD
`
`MAM
`
`ABK
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`AVG (cid:16) (cid:31)errors (cid:17)
`
` (cid:8)
`
` (cid:8)
`
`AVG (cid:16) (cid:28)errors (cid:17)
`
` (cid:8) (cid:28)
`
`(cid:8) (cid:28)
`
`AVG (cid:16) decoding time (cid:17)
`
` min(cid:8)
`
` min(cid:8)
`
`Table (cid:18) Performance of SIGPSTD in a constant environment(cid:8)
`
`TRAIN(cid:27)CLEAN
`
`TRAIN(cid:27)NOISY
`
`SPK
`
`TEST(cid:27)CL TEST(cid:27)NO TEST