`
`UNITED STATED PATENT APPLICATION
`
`for
`
`Voice Activity Detector (V AD)-Based Multiple-Microphone Acoustic Noise Suppression
`
`Inventors:
`
`Gregory C. Burnett
`
`Eric F. Breitf eller
`
`Prepared by
`
`Shemwell Gregory & Courtney LLP
`4880 Stevens Creek Blvd., Suite 201
`San Jose, CA 95129
`408-236-6647
`
`Attorney Docket No. ALPH.OIOX
`
`EXPRESS MAIL CERTIFICATE OF MAILING
`
`"Express Mail" mailing label number: EV 326 938 875 US
`Date of Deposit:
`September 18, 2003
`I hereby certify that this paper is being deposited with the United States Postal
`Service "Express Mail Post Office to Addressee" service under 37 CFR § 1.10 on the date
`indicated above and is addressed to Mail Stop Patent Application, Commissioner for
`Patents, PO Box 1450, Alexandria, VA 22313-1450.
`
`1
`
`Amazon v. Jawbone
`U.S. Patent 8,280,072
`Amazon Ex. 1011
`
`
`
`Attorney Docket No. ALPH.P0l0X
`
`Voice Activity Detector {YAO) -Based Multiple-Microphone Acoustic Noise
`
`Suppression
`
`RELATED APPLICATIONS
`
`5
`
`This patent application is a continuation-in-part of United States Patent
`
`Appli~ation Number 09/905,361, filed July 12, 2001, which claims priority from United
`
`States Patent Application Number 60/219,297, filed July 19, 2000. This patent
`
`application also claims priority from United States Patent Application Number
`
`10/383,162, filed March 5, 2003.
`
`10
`
`FIELD OF THE INVENTION
`
`The disclosed embodiments relate to systems and methods for detecting and
`
`processing a desired signal in the presence of acoustic noise.
`
`15
`
`BACKGROUND
`
`Many noise suppression algorithms and techniques have been developed over the
`
`years. Most of the noise suppression systems in use today for speech communication
`
`systems are based on a single-microphone spectral subtraction technique first develop in
`
`the 1970's and described, for example, by S. F. Boll in "Suppression of Acoustic Noise in
`
`20
`
`Speech using Spectral Subtraction," IEEE Trans. on ASSP, pp. 113-120, 1979. These
`
`techniques have been refined over the years, but the basic principles of operation have
`
`remained the same. See, for example, United States Patent Number 5,687,243 of
`
`McLaughlin, et al., and United States Patent Number 4,811,404 ofVilmur, et al.
`
`Generally, these techniques make use of a microphone-based Voice Activity Detector
`
`25
`
`(V AD) to determine the background noise characteristics, where "voice" is generally
`
`understood to include human voiced speech, unvoiced speech, or a combination of voiced
`
`and unvoiced speech.
`
`The V AD has also been used in digital cellular systems. As an example of such a
`
`use, see Llnited States Patent Number 6,453,291 of Ashley, where a V AD configuration
`
`30
`
`appropriate to the front-end of a digital cellular system is described. Further, some Code
`
`Division Multiple Access (CDMA) systems utilize a V AD to minimize the effective radio
`
`spectrum used, thereby allowing for more system capacity. Also, Global System for
`
`2
`
`
`
`Attorney Docket No. ALPH.P0l0X
`
`Mobile Communication (GSM) systems can include a V AD to reduce co-channel
`
`interference and to reduce battery consumption on the client or subscriber device.
`
`These typical microphone-based V AD systems are significantly limited in
`
`capability as a result of the addition of environmental acoustic noise to the desired speech
`
`5
`
`signal received by the single microphone, wherein the analysis is performed using typical
`
`signal processing techniques. In particular, limitations in performance of these
`
`microphone-based V AD systems are noted when processing signals having a low signal(cid:173)
`
`to-noise ratjo (SNR), and in settings where the background noise varies quickly. Thus,
`
`similar limitations are found in noise suppression systems using these microphone-based
`
`10 VADs.
`
`3
`
`
`
`Attorney Docket No. ALPH.POIOX
`
`BRIEF DESCRIPTION OF THE FIGURES
`
`Figure 1 is a block diagram of a denoising system, under an embodiment.
`Figure 2 is a block diagram including components of a noise removal algorithm,
`under the denoising system of an embodiment assuming a single noise source and direct
`
`5
`
`paths to the microphones.
`
`Figure 3 is a block diagram including front-end.components of a noise removal
`
`algorithm of an embodiment generalized ton distinct noise sources (these noise sources
`
`may be reflections or echoes of one another).
`
`Figure 4 is a block diagram including front-end components of a noise removal
`
`10
`
`algorithm of an embodiment in a general case where there· are n distinct noise sources and
`
`signal reflections.
`
`Figure 5 is a flow diagram of a denoising method, under an embodiment.
`
`Figure 6 shows results of a noise suppression algorithm of an embodiment for an
`
`American English female speaker in the presence of airport terminal noise that includes
`
`15 many other human speakers and public announcements.
`
`Figure 7A is a block diagram of a Voice Activity Detector (VAD) system
`
`including hardware for use in receiving and processing signals relating to V AD, under an
`
`embodiment.
`
`Figure 7B is a block diagram of a V AD system using hardware of a coupled noise
`
`20
`
`suppression system for use in receiving V AD information, under an alternative
`
`embodiment.
`
`Figure 8 is a flow diagram of a method for determining voiced and unvoiced
`
`speech using an accelerometer-based V AD, under an embodiment.
`
`Figure 9 shows plots including a noisy audio signal (live recording) along with a
`
`25
`
`corresponding accelerometer-based V AD signal, the corresponding accelerometer output
`
`signal, and the denoised audio signal following processing by the noise suppression
`
`system using the V AD signal, under an embodiment.
`
`Figure 10 shows plots including a noisy audio signal (live recording) along with a
`
`corresponding SSM-based V AD signal, the corresponding SSM output signal, and the
`
`30
`
`denoised audio signal following processing by the noise suppression system using the
`
`V AD signal, under an embodiment.
`
`4
`
`
`
`Attorney Docket No. ALPH.PO I OX
`
`Figure 11 shows plots including a noisy audio signal (live recording) along with a
`
`corresponding GEMS-based V AD signal, the corresponding GEMS output signal, and the
`
`denoised audio signal following processing by the noise suppression system using the
`
`V AD signal, under an embodiment.
`
`5
`
`
`
`Attorney Docket No. ALPH.POlOX
`
`DETAILED DESCRIPTION
`
`The following description provides specific details for a thorough understanding
`
`of, and enabling description for, embodiments of the noise suppression system.
`
`However, one skilled in the art will understand that the invention may be practiced
`
`5 without these details. In other instances, well-known structures and functions have not
`
`been shown or described in detail to avoid unnecessarily obscuring the description of the
`
`. embodiments of the noise suppression system. In the following description, "signal"
`
`represents any acoustic signal (such as human speech) that is desired, and "noise" is any
`
`acoustic signal (which may include human speech) that is not desired. An example
`
`10 would be a person talking on a cellular telephone with a radio in the background. The
`
`person's speech is desired and the acoustic energy from the radio is not desired. In
`
`addition, "user" describes a person who is using the device.and whose speech is desired
`
`to be captured by the system.
`
`Also, "acoustic" is generally defined as acoustic waves propagating in air.
`
`15
`
`Propagation of acoustic waves in media other than air will be noted as such. References
`
`to "speech" or "voice" generally refer to human speech including voiced speech,
`
`unvoiced speech, and/or a combination of voiced and unvoiced speech. Unvoiced speech
`
`or voiced speech is distinguished where necessary. The term "noise suppression"
`
`generally describes any method by which noise is reduced or eliminated in an electronic
`
`20
`
`signal.
`
`Moreover, the term "V AD" is generally defined as a vector or array signal, data,
`
`or information that in some manner represents the occurrence of speech in the digital or
`
`analog domain. A common representation ofV AD information is a one-bit digital signal
`
`sampled at the same rate as the corresponding acoustic signals, with a zero value
`
`25
`
`representing that no speech has occurred during the corresponding time sample, and a
`
`unity value indicating that speech has occurred during the corresponding time sample.
`
`While the embodiments described herein are generally described in the digital domain,
`
`the descriptions are also valid for the analog domain.
`
`Figure 1 is a block diagram of a denoising system 1000 of an embodiment that
`
`30
`
`uses knowledge of when speech is occurring derived from physiological information on
`
`voicing activity. The system 1000 includes microphones 10 and sensors 20 that provide
`
`6
`
`
`
`Attorney Docket No. ALPH.POlOX
`
`signals to at least one processor 30. The processor includes a denoising subsystem or
`
`algorithm 40.
`
`Figure 2 is a block diagram including components of a noise removal algorithm
`
`200 of an embodiment. A single noise source and a direct path to the microphones are
`
`5
`
`assumed. An operational description of the noise removal algorithm 200 of an
`
`embodiment is provided using a single signal source 100 and a single noise source 101,
`
`but is not so limited. This algorithm 200 uses two microphones: a "signal" microphone 1
`
`("MICl") and a "noise" microphone 2 ("MIC 2"), but is not so limited. The signal
`
`microphone MIC 1 is assumed to capture mostly signal with some noise, while MIC 2
`
`10
`
`captures mostly noise with some signal. The data from the signal source I 00 to MIC 1 is
`
`denoted by s(n), where s(n) is a discrete sample of the analog signal from the source 100.
`
`The data from the signal source 100 to MIC 2 is denoted by si(n). The data from the
`
`noise source 101 to MIC 2 is denoted by n(n). The data from the noise source 101 to
`
`MIC 1 is denoted by ni(n). Similarly, the data from MIC 1 to noise removal element 205
`
`15
`
`is denoted by mi(n), and the data from MIC 2 to noise removal element 205 is denoted by
`
`mi(n).
`
`The noise removal element 205 also receives a signal from a voice activity
`
`detection (V AD) element 204. The V AD 204 uses physiological information to
`
`determine when a speaker is speaking. In various embodiments, the V AD can include at
`
`20
`
`least one of an accelerometer, a skin surface microphone in physical contact with skin of
`
`a user, a human tissue vibration detector, a radio frequency (RF) vibration and/or motion
`
`detector/device, an electroglottograph, an ultrasound device, an acoustic microphone that
`
`is being used to detect acoustic frequency signals that correspond to the user's speech
`
`directly from the skin of the user ( anywhere on the body), an airflow detector, and a laser
`
`25
`
`vibration detector.
`
`The transfer functions from the signal source 100 to MIC 1 and from the noise
`
`source 101 to MIC 2 are assumed to be unity. The transfer function from the signal
`
`source 100 to MIC 2 is denoted by Hi(z), and the transfer function from the noise source
`
`101 to MIC 1 is denoted .by Hi(z). The assumption of unity transfer functions does not
`
`30
`
`inhibit the generality of this algorithm, as the actual relations between the signal, noise,
`
`and microphones are simply ratios and the ratios are redefined in this manner for
`
`simplicity.
`
`7
`
`
`
`Attorney Docket No. ALPH.PO I OX
`
`In conventional two-microphone noise removal systems, the information from
`
`MIC 2 is used to attempt to remove noise from MIC 1. However, an (generally
`
`unspoken) assumption is that the V AD element 204 is never perfect, and thus the
`
`denoising must be performed cautiously, so as not to remove too much of the signal along
`
`5 with the noise. However, if the V AD 204 is assumed to be perfect such that it is equal to
`
`zero when there is no speech being produced by the user, and equal to one when speech is
`
`produced, a substantial improvement in the noise removal can be made.
`
`In analyzing the single noise source 101 and the direct path to the microphones,
`with reference to Figure 2, the total acoustic information coming into MIC 1 is denoted
`
`10
`
`by mi(n). The total acoustic information coming into MIC 2 is similarly labeled mi(n).
`
`In the z (digital frequency) domain, these are represented as M 1(z) and Mi{z). Then,
`
`M 1(z) =S(z)+ N 2 (z)
`M 2 (z)=N(z)+S 2 (z)
`
`Ni(z)=N(z)Hi(z)
`S 2 (z) = S(z)H 2 (z),
`
`Mi(z)=S(z)+ N(z)Hi(z)
`M 2 (z) = N(z) + S(z)H 2 (z) .
`This is the general case for all two microphone systems. In a practical system
`
`Eq.
`
`1
`
`with
`
`so that
`
`15
`
`20
`
`there is always going to be some leakage of noise into MIC 1, and some leakage of signal
`
`into MIC 2. Equation 1 has four unknowns and only two known relationships and
`
`therefore cannot be solved explicitly.
`
`However, there is another way to solve for some of the unknowns in Equation 1.
`
`25
`
`The analysis starts with an examination of the case where the signal is not being
`
`generated, that is, where a signal from the V AD element 204 equals zero and speech is
`not being produced. In this case, s(n) = S(z) = 0, and Equation 1 reduces to
`
`M 1n(z)=N(z)H1(z)
`M 2n(z)=N(z),
`
`30
`
`where then subscript on the M variables indicate that only noise is being received. This
`
`leads to
`
`8
`
`
`
`Attorney Docket No. ALPH.PO I OX
`
`M 1Jz)=M 2Jz)HJz)
`HJz) M1n(z)
`M2n(z)
`
`Eq. 2
`
`The function H1(z) can be calculated using any of the available system
`identification algorithms and the microphone outputs when the system is certain that only
`
`5
`
`noise is being received. The calculation can be done adaptively, so that the system can
`
`react to changes in the noise.
`
`A solution is now available for one of the unknowns in Equation 1. Another
`
`unknown, Hi(z), can be determined by using the instances where the V AD equals one and
`
`10
`
`speech is being produced. When this is occurring, but the recent (perhaps less than 1
`
`second) history of the microphones indicate low levels of noise, it can be assumed that
`n(s) = N(z) ~ 0. Then Equation 1 reduces to
`
`15 which in tum leads to
`
`M 1s(z)=S(z)·
`M ls (z)=S(z)H 2 (z),
`
`M 2s(z)=M 1s(z)H2(z)
`
`H2(z) M2Jz)'
`M 1s(z}
`which is the inverse of the H1(z) calculation. However, it is noted that different inputs are
`being used (now only the signal is occurring whereas before only the noise was
`
`20
`
`occurring). While calculating Hi(z), the values calculated for H 1(z) are held constant and
`vice versa. Thus, it is assumed that while one ofH 1(z) and Hi(z) are being calculated, the
`one not being calculated does not change substantially.
`
`After calculating Hi(z) and Hi(z}, they are used to remove the noise from the
`
`signal. If Equation 1 is rewritten as
`
`25
`
`. S(z)=MJz)-N(z)HJz)
`N(z)=M 2(z)-S(z)H2(z)
`S(z)=M 1(z)-[Mifz)-S(z)H2(z)]HJz)'
`S(z)[l-H2(z)HJz)] =M1(z)-Mi(z)H1 (z),
`
`30
`
`then N(z) may be substituted as shown to solve for S(z) as
`
`9
`
`
`
`Attorney Docket No. ALPH.PO 1 OX
`
`S(z) M i(z)- M 2 (z)H i(z) .
`J-H2(z)H 1(z)
`· If the 1:!ansfer functions H1(z) and Hi{z) can be described with sufficient accuracy,
`then the noise can be completely removed and the original signal recovered. This
`
`Eq. 3
`
`remains true without respect to the amplitude or spectral characteristics of the noise. The
`
`5
`
`only assumptions made include use of a perfect V AD, sufficiently accurate Hi(z) and
`
`Hi{z), and that when one ofH 1(z) and Hi(z) are being calculated the other does not
`change substantially. In practice these assumptions have proven reasonable.
`
`The noise removal algorithm described herein is easily generalized to include any
`
`number of noise sources. Figure 3 is a block diagram including front-end components
`
`10
`
`300 of a noise removal algorithm of an embodiment, generalized to n distinct noise
`
`sources. These distinct noise sources may be reflections or echoes of one another, but are
`
`not so limited. There are several noise sources shown, each with a transfer function, or
`path, to each microphone. The previously named path H2 has been relabeled as H0, so
`that labeling noise source 2's path to MIC 1 is more convenient. The outputs of each
`
`15 microphone, when transformed to the z domain, are:
`
`M 1 (z) = S(z) + N 1 (z)H1 (z) + N 2 (z)H 2 (z) + ... Nn (z)H n (z)
`M 2(z)=S(z)H0 (z)+NJz)Gi(z)+N2(z)G2(z)+ ... Nn(z)Gn(z).
`
`Eq. 4
`
`When there is no signal (V AD= 0), then (suppressing z for clarity) .
`
`20
`
`Min =N1H1 +N2H2 + ... NnHn
`M2n =N1G1 +N2G2 + ... NnGn •
`
`Eq. 5
`
`A new transfer function can now be defined as
`
`fj == Min NIH! +N2H 2 + ... N,,Hn
`M2n NIGi +N2G2 + ... NnGn
`25 where ii I is analogous to H1(z) above. Thus ii I depends only on the noise sources and
`their respective transfer functions and can be calculated any time there is no signal being
`
`Eq. 6
`
`I
`
`'
`
`transmitted. Once again, the "n" subscripts on the microphone inputs denote only that
`
`10
`
`
`
`Attorney Docket No. ALPH.PO 1 OX
`
`noise is being detected, while an "s" subscript denotes that only signal is being received
`
`by the microphones.
`
`Examining Equation 4 while assuming an absence of noise produces
`
`5
`
`Mis ==S
`M2s =SHo.
`
`Thus, H0 can be solved for as before, using any available transfer function calculating
`algorithm. Mathematically, then,
`
`10
`
`Rewriting Equation 4, using ii I defined in Equation 6, provides,
`
`Eq. 7
`
`Solving for S yields,
`
`s M1-M2_H1'
`I-HOHi
`15 which is the same as Equation 3, with H0 taking the place ofH2, and ii. 1 taking the place
`ofH1• Thus the noise removal algorithm still is mathematically valid for any number of
`noise sources, including multiple echoes of noise sources. Again, ifH0 and ii. 1 can be
`estimated to a high enough accuracy, and the above assumption of only one path from the
`
`Eq. 8
`
`signal to the microphones holds, the noise may be removed completely.
`
`20
`
`The most general case involves multiple noise sources and multiple signal
`
`sources. Figure 4 is a block diagram including front-end components 400 of a noise
`
`removal algorithm of an embodiment in the most general case where there are n distinct
`
`noise sources and signal reflections. Here, signal reflections enter both microphones MIC
`
`1 and MIC 2. This is the most general case, as reflections of the noise source into the
`
`25 microphones MIC 1 and MIC 2 can be modeled accurately as simple additional noise
`
`sources. For clarity, the direct path from the signal to MIC 2 is changed from Ho(z) to
`
`11
`
`
`
`Attorney Docket No. ALPH.PO 1 OX
`
`H00(z), and the reflected paths to MIC 1 and MIC 2 are denoted by H01(z) and H02(z),
`respectively.
`The input into the microphones now becomes
`
`M 1 (z) = S(z) + S(z)H 01 (z) + N 1 (z)H 1 (z) + N 2 (z)H 2 (z) + ... Nn (z)H n (z)
`M 2(z)==S(z)H00 (z)+S(z)H0Jz)+Ni(z)Gi(z)+N2(z)G2(z)+ ... Nn(z)G,Jz). Eq. 9
`
`When the V AD= 0, the inputs become (suppressing z again)
`
`Min =N1H1 +N2H2 + ... NnHn
`M2n =N1G1 +N2G2 + ... NnGn •
`
`which is the same as Equation 5. Thus, the calculation of H1 in Equation 6 is unchanged,
`as expected. In examining the situation where there is no noise, Equation 9 reduces to
`
`5
`
`10
`
`Mls S+SHOJ
`M2s =SHoo +SH02 ·
`This leads to the definition of ii 2 as
`
`15
`
`H~ _ M2s Hoo +Ho2
`Mls
`l+Ho1
`
`2-
`
`Rewriting Equation 9 again using the definition for ii1 (as in Equation 7)
`provides
`
`Eq. 10
`
`Eq. 11
`
`20
`
`Some algebraic manipulation yields
`
`S(l+Ho1-iilHoo +Ho2J)=M1 -M2ii1
`
`I. (l+H01)
`
`I
`
`2
`
`I
`
`S(l+H )[1-fi (Hoo +Ho2 )]=M -M ii
`S(l+H01 J[1-if1fi 2 ]=M 1 -M2fi1'
`
`OJ
`
`and finally
`
`12
`
`
`
`Attorney Docket No. ALPH.PO 1 OX
`
`S(l+Ho1) M1-::12,.__,H1
`J-H1H2
`Equation 12 is the same as equation 8, with the replacement ofH0 by H2 , and the
`addition of the (1 + H01 ) factor on the left side. This extra factor (1 + H01 ) means that S
`cannot be solved for directly in this situation, but a solution can be generated for the
`
`Eq. 12
`
`5
`
`signal plus the addition of all of its echoes. This is not such a bad situation, as there are
`
`many conventional methods for dealing with echo suppression, and even if the echoes are
`
`not suppressed, it is unlikely that they will affect the comprehensibility of the speech to
`any meaningful extent. The more complex calculation of H2 is needed to account for the
`signal echoes in MIC 2, which act as noise sources.
`
`10
`
`Figure 5 is a flow diagram 500 of a denoising algorithm, under an embodiment.
`
`In operation, the acoustic signals are received, at block 502. Further, physiological
`
`information associated with human voicing activity is received, at block 504. A first
`
`transfer function representative of the acoustic signal is calculated upon determining that
`
`voicing information is absent from the acoustic signal for at least one specified period of
`
`15
`
`time, at block 506. A second transfer function representative of the acoustic signal is
`
`calculated upon determining that voicing information is present in the acoustic signal for
`
`at least one specified period of time, at block 508. Noise is removed from the acoustic
`
`signal using at least one combination of the first transfer function and the second transfer
`
`function, producing denoised acoustic data streams, at block 510.
`
`20
`
`An algorithm for noise removal, or denoising algorithm, is described herein, from
`
`the simplest case of a single noise source with a direct path to multiple noise sources with
`
`reflections and echoes. The algorithm has been shown herein to be viable under any
`
`25
`
`environmental conditions. The type and amount of noise are inconsequential if a good
`estimate has been made of H1 and H2 , and if one does not change substantially while the
`other is calculated. If the user environment is such that echoes are present, they can be
`compensated for if coming from a noise source. If signal echoes are also present, they
`will affect the cleaned signal, but the effect should be negligible in most environments.
`
`In operation, the algorithm of an embodiment has shown excellent results in
`
`dealing with a variety of noise types, amplitudes, and orientations. However, there are
`
`30
`
`always approximations and adjustments that have to be made when moving from
`
`13
`
`
`
`Attorney Docket No. ALPH.P0l0X
`
`mathematical concepts to engineering applications. One asswnption is made in Equation
`
`3, where Hi(z) is assumed small and therefore Hlz)Hlz) ~ 0, so that Equation 3 reduces
`
`to
`
`5
`
`S(z) ~ M 1 (z)- M 2 (z)H 1 (z).
`This means that only H1(z) has to be calculated, speeding up the process and reducing the
`number of computations required considerably. With the proper selection of
`
`microphones, this approximation is easily realized.
`
`Another approximation involves the filter used in an embodiment. The actual
`
`H1(z) will undoubtedly have both poles and zeros, but for stability and simplicity an all-
`zero Finite Impulse Response (FIR) filter is used. With enough taps the approximation to
`
`10
`
`the actual H 1(z) can be very good.
`To further increase the performance of the noise suppression system, the spectrum
`
`of interest (generally about 125 to 3700 Hz) is divided into subbands. The wider the
`
`range of frequencies over which a transfer function must be calculated, the more difficult
`
`15
`
`it is to calculate it accurately. Therefore the acoustic data was divided into 16 subbands,
`
`and the denoising algorithm was then applied to each subband in turn. Finally, the 16
`
`denoised data streams were recombined to yield the denoised acoustic data. This works
`
`very well, but any combinations of subbands (i.e., 4, 6, 8, 32, equally spaced,
`
`perceptually spaced, etc.) can be used and all have been found to work better than a single
`
`20
`
`subband.
`
`The amplitude of the noise was constrained in an embodiment so that the
`
`microphones used did not saturate (that is, operate outside a linear response region). It is
`
`important that the microphones operate linearly to ensure the best performance. Even
`
`with this restriction, very low signal-to-noise ratio (SNR) signals can be denoised (down
`
`25
`
`to -10 dB or less).
`
`The calculation ofH 1(z) is accomplished every 10 milliseconds using the Least(cid:173)
`Mean Squares (LMS) method, a common adaptive transfer function. An explanation may
`
`be found in ''Adaptive Signal Processing" (1985), by Widrow and Steams, published by
`
`Prentice-Hall, ISBN 0-13-004029-0. The LMS was used for demonstration purposes, but
`
`30 many other system idenfication techniques can be used to identify H 1(z) and Hi(z) in
`Figure 2.
`
`14
`
`
`
`Attorney Docket No. ALPH.PO I OX
`
`The V AD for an embodiment is derived from a radio frequency sensor and the
`
`two microphones, yielding very high accuracy (>99%) for both voiced and unvoiced
`
`speech. The V AD of an embodiment uses a radio frequency (RF) vibration detector
`
`interferometer to detect tissue motion associated with human speech production, but is
`
`5
`
`not so limited. The signal from the RF device is completely acoustic-noise free, and is
`
`able to function in any acoustic noise environment. A simple energy measurement of the
`
`RF signal can be used to determine if voiced speech is occurring. Unvoiced speech can
`
`be determined using conventional acoustic-based methods, by proximity to voiced
`
`sections determined using the RF sensor or similar voicing sensors, or through a
`
`10
`
`combination of the above. Since there is much less energy in unvoiced speech, its
`
`detection accuracy is not as critical to good noise suppression performance as is voiced
`
`speech.
`
`With voiced and unvoiced speech detected reliably, the algorithm of an
`
`embodiment can be implemented. Once again, it is useful to repeat that the noise
`
`15
`
`removal algorithm does not depend on how the V AD is obtained, only that it is accurate,
`
`especially for voiced speech. If speech is not detected and training occurs on the speech,
`
`the subsequent denoised acoustic data can be distorted.
`
`Data was collected in four channels, one for MIC 1, one for MIC 2, and two for
`
`the radio frequency sensor that detected the tissue motions associated with voiced speech.
`
`20
`
`The data were sampled simultaneously at 40 kHz, then digitally filtered and decimated
`
`down to 8 kHz. The high sampling rate was used to reduce any aliasing that might result
`
`from the analog to digital process. A four-channel National Instruments AID board was
`
`used along with Labview to capture and store the data. The data was then read into a C
`
`program and denoised 10 milliseconds at a time.
`
`25
`
`Figure 6 shows a denoised audio 602 signal output upon application of the noise
`
`suppression algorithm of an embodiment to a dirty acoustic signal 604, under an
`
`embodiment. The dirty acoustic signal 604 includes speech of an American English(cid:173)
`
`speaking female in the presence of airport terminal noise where the noise includes many
`
`other human speakers and public announcements. The speaker is uttering the numbers
`
`30
`
`"406 5562" in the midst of moderate airport terminal noise. The dirty acoustic signal 604
`
`was denoised 10 milliseconds at a time, and before denoising the 10 milliseconds of data
`
`were prefiltered from 50 to 3 700 Hz. A reduction in the noise of approximately 1 7 dB is
`
`15
`
`
`
`Attorney Docket No. ALPH.PO I OX
`
`evident. No post filtering was done on this sample; thus, all of the noise reduction
`realized is due to the algorithm of an embodiment. It is clear that the algorithm adjusts to
`
`the noise instantly, and is capable of removing the very difficult noise of other human
`
`speakers. Many different types of noise have all been tested with similar results,
`
`5
`
`including street noise, helicopters, music, and sine waves. Also, the orientation of the
`
`noise can be varied substantially without significantly changing the noise suppression
`
`performance. Finally, the distortion of the cleaned speech is very low, ensuring good
`
`performance for speech recognition engines and human receivers alike.
`
`The noise removal algorithm of an embodiment has been shown to be viable
`
`10
`
`under any environmental conditions. The type and amount of noise are inconsequential if
`a good estimate has been made of ii 1 and ii 2 • If the user environment is such that
`echoes are present, they can be compensated for if corning from a noise source. If signal
`
`echoes are also present, they will affect the cleaned signal, but the effect should be
`
`negligible in most environments.
`
`15
`
`When using the V AD devices and methods described herein with a noise
`
`suppression system, the V AD signal is processed independently of the noise suppression
`
`system, so that the receipt and processing ofV AD information is independent from the
`
`processing associated with the noise suppression, but the embodiments are not so limited.
`
`This independence is attained physically (i.e., different hardware for use in receiving and
`
`20
`
`processing signals relating to the V AD and the noise suppression), but is not so limited.
`
`The V AD devices/methods described herein generally include vibration and
`
`movement sensors, but are not so limited. In one emb?diment, an accelerometer is placed
`
`on the skin for use in detecting skin surface vibrations that correlate with hwnan speech.
`
`These recorded vibrations are then used to calculate a V AD signal for use with or by an
`
`25
`
`adaptive noise suppression algorithm in suppressing environmental acoustic noise from a
`
`simultaneously (within a few milliseconds) recorded acoustic signal that includes both
`
`speech and noise.
`
`Another embodiment of the V AD devices/methods described herein includes an
`
`. acoustic microphone modified with a membrane so that the microphone no longer
`
`30
`
`efficiently detects acoustic vibrations in air. The membrane, though, allows the
`
`microphone to detect acoustic vibrations in objects with which it is in physical contact
`
`(allowing a good mechanical impedance match), such as human skin. That is, the
`
`16
`
`
`
`Attorney Docket No. ALPH.POIOX
`
`acoustic microphone is modified in some way such that it no longer detects acoustic
`
`vibrations in air (where it no longer has a good physica\ impedance match), but only in
`
`objects with which the microphone is in contact. This configures the microphone, like
`
`the accelerometer, to detect vibrations of human skin associated with the speech
`
`5
`
`production of that human while not efficiently detecting acoustic environmental noise in
`
`the air. The detected vibrations are processed to form a V AD signal for use in a noise
`
`suppression system, as detailed below.
`
`Yet another embodiment of the V AD described herein uses an electromagnetic
`
`vibration sensor, such as a radiofrequency vibrometer (RF) or laser vibrometer, which
`
`10
`
`detect skin vibrations. Further, the RF vibrometer detects the movement of tissue within
`
`the body, such as the inner surface of the cheek or the tracheal wall. Both the exterior
`
`skin and internal tissue vibrations associated with speech production can be used to form
`
`a VAD signal for use in a noise suppression system as detailed below.
`Figure 7 A is a block diagram of a V AD system 702A including hardware for use
`in receiving and processing signals relating to V AD, under an embodiment. The V AD
`
`15
`
`system 702A includes a V AD device 730 coupled to provide data to a corresponding
`
`VAD algorithm 740. Note that noise suppression systems of alternative embodiments
`
`can integrate some or all functions of the V AD algorithm with the noise suppression
`
`processing in any manner obvious to those skilled in the art. Referring to Figure 1, the
`
`20
`
`voicing sensors 20 include the V AD system 702A, for example, but are not so limited.
`
`Referring to Figure 2, the V AD includes the V AD system 702A, for example, but is not
`
`so limited.
`
`Figure 7B is a block diagram of a V AD system 702B using hardware of the
`
`associated noise suppression system 701 for use in receiving V AD information 764,
`
`25
`
`under an embodiment. The V AD system 702B includes a V AD algorithm 750 that
`
`receives data 764 from MIC 1 and MIC 2, or other components, of the corresponding
`
`signal processing system 700. Alternative embodiments of the noise suppression system
`
`can integrate some or all functions of the V AD algorithm with the noise suppression
`
`processing in any manner obvious to those skilled in the art.
`
`30
`
`The vibration/movement-based V AD devices described herein include the
`
`physical hardware devices for use in receiving and processing signals relating to the V AD
`
`and the noise suppression. As a speaker or user produces speech, the resulting vibrations
`
`17
`
`
`
`Attorney Docket No. ALPH.PO I OX
`
`propagate through the tissue of the speak