`
`Transmittal of Utility Patent Application for Filing
`
`Certification Under 37 C.F.R. §1. 10 (if applicable)
`
`EV 708 402 865 US
`"Express Mail" Label Number
`
`May 25, 2007
`Date of Deposit
`
`I hereby certify that this application, and any other documents referred to as enclosed herein are being deposited in an
`envelope with the United States Postal Service "Express Mail Post Office to Addressee" service under 37 CFR § 1.1 O on
`the date indicated above and addressed to the Assistant Commissioner for Patents, Washington, D.C. 20231
`
`Barbara B. Courtney
`
`{Print Name of Person Mailing Application)
`
`DETECTING VOICED AND UNVOICED SPEECH USING ·BOTH ACOUSTIC
`AND NONACOUSTIC SENSORS
`
`RELATED APPLICATIONS
`
`This application claims the benefit of U.S. Application Numbers 60/294,383
`
`filed May 30, 2002; 09/905,361 filed July 12, 2001; 60/335,100 filed October 30,
`
`2001; 60/332,202 and 09/990,847, both filed November 21, 2001; 60/362,103,
`
`60/362,161, 60/362,162, 60/362,170, and 60/361,981, all filed March 5, 2002;
`
`60/368,208, 60/368,209, and 60/368,343, all filed March 27, 2002; all of which are
`
`incorporated herein by reference in their entirety.
`
`TECHNICAL FIELD
`
`The disclosed embodiments relate to the processing of speech signals.
`
`BACKGROUND
`The ability to correctly identify voiced and unvoiced speech is critical to
`
`many speech applications including speech recognition, speaker verification, noise
`
`suppression, and many others. In a typical acoustic application, speech from a
`
`human speaker is captured and transmitted to a receiver in a different location. In
`
`the speaker's environment there may exist one or more noise sources that pollute
`
`the speech signal, or the signal of interest, with unwanted acoustic noise. This
`
`makes it difficult or impossible for the receiver, whether human or machine, to
`
`understand the user's speech.
`
`l
`
`Amazon v. Jawbone
`U.S. Patent 8,280,072
`Amazon Ex. 1010
`
`
`
`(
`
`Typical methods for classifying voiced and unvoiced speech have relied
`
`mainly on the acoustic content of microphone data, which is plagued by problems
`
`with noise and the corresponding uncertainties in signal content. This is especially
`
`5
`
`problematic now with the proliferation of portable communication devices like
`cellular telephones and personal digital assistants because, in many cases, the
`quality of service provided by the device depends on the quality of the voice
`services offered by the devicE3. There are methods known in the art for
`suppressing the noise present in the speech signals, but these methods
`demonstrate performance shortcomings that include unusually long computing
`
`10
`
`time, requirements for cumbersome hardware to perform the signal processing,
`and distorting the signals of interest.
`
`15
`
`20
`
`25
`
`30
`
`BRIEF DESCRIPTION OF THE FIGURES
`Figure 1 is a block diagram of a NAVSAD system, under an embodiment.
`
`Figure 2 is a block diagram of a PSAD system, under an embodiment.
`Figure 3 is a block diagram of a denoising system, referred to herein as the
`Pathfinder system, under an embodiment.
`Figure 4 is a flow diagram of a detection algorithm for use in detecting
`voiced and unvoiced speech, under an embodiment.
`Figure 5A plots the received GEMS signal for an utterance along with the
`mean correlation between the GEMS signal and the Mic 1 signal and the threshold
`for voiced speech detection.
`Figure 5B plots the received GEMS signal for an utterance along with the
`standard deviation of the GEMS signal and the threshold for voiced sp~ech
`detection.
`Figure 6 plots voiced speech detected from an utterance along with the
`GEMS signal and the acoustic noise.
`Figure 7 is a microphone array for use under an embodiment of the PSAD
`system.
`Figure 8 is a plot of AM versus d1 for several Ad values, under an
`embodiment.
`Figure 9 shows a plot of the gain parameter as the sum of the absolute
`values of H 1(z) and the acoustic data or audio from microphone 1.
`Figure 10 is an alternative plot of acoustic data presented in Figure 9.
`
`2
`
`
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`!
`\
`
`-r
`(
`
`In the figures, the same reference numbers identify identical or substantially
`
`similar elements or acts.
`
`Any headings provided herein are for convenience only and do not
`
`necessarily affect the scope or meaning of the claimed invention.
`
`DETAILED DESCRIPTION
`Systems and methods for discriminating voiced and unvoiced speech from
`
`background noise are provided below including a Non-Acoustic Sensor Voiced
`Speech Activity Detection (NAVSAD) system and a Pathfinder Speech Activity
`
`Detection (PSAD) system. The noise removal and reduction methods provided
`herein, while allowing for the separation and classification of unvoiced and voiced
`human speech from background noise, address the shortcomings of typical
`systems known in the art by cleaning acoustic signals of interest without distortion.
`Figure 1 is a block diagram of a NAVSAD system 100, under an
`embodiment. The NAVSAD system couples microphones 10 and sensors 20 to at
`least one processor 30. The sensors 20 of an embodiment include voicing activity
`detectors or non-acoustic sensors. The processor 30 controls subsystems
`including a detection subsystem 50, referred to herein as a detection algorithm,
`and a denoising subsystem 40. Operation of the denoising subsystem 40 fa
`described in detail in the Related Applications. The NA VSAD system works
`extremely well in any background _acoustic noise environment.
`figure 2 is a block diagram of a PSAD system 200, under an embodiment.
`The PSAD system couples microphones 10 to at least one processor 30. The
`proce~sor 30 includes a detection subsystem 50, referred to herein as a detection
`algorithm, and a denoising subsystem 40. The PSAD system is highly sensitive in
`low acoustic noise environments and relatively insensitive in hfgh acoustic noise
`environments. The PSAD can operate independently or as a backup to the
`NAV$AD, detecting voiced speech if the NAVSAD fails.
`Note that the detection subsystems 50 and denoising subsystems 40 of
`both the NAVSAD and PSAO systems of an embodiment are algorithms controlled
`by the processor 30, but are not so limited. Alternative embodiments of the
`NAVSAD and PSAD systems can include detection subsystems 50 and/or
`denoising subsystems 40 that compris~ additional hardware, firmware, software,
`and/or combinations of hardware, firmware, and software. Furthermore, functions
`
`3
`
`
`
`I
`
`(
`
`of the detection subsystems 50 and denoising subsystems 40 may be distributed
`across numerous components of the NAVSAD and PSAD syste~s.
`Figure 3 is a block diagram of a denoising subsystem 300, referred to
`herein as the Pathfinder system, under an embodiment. The Pathfinder system is
`briefly described below, and is described in detail in the Related Applications. Two
`microphones Mic 1 and Mic 2 are used in the Pathfinder system, and Mic 1 is
`considered the "signal" microphone. With reference to Figure 1, the Pathfinder
`system 300 is equivalent to the NAVSAD system 100 when the voicing activity
`detector (VAD) 320 is a non-acoustic voicing sensor 20 and the noise removal
`subsystem 340 includes the detection subsystem 50 and the denoising subsystem
`40. With reference to Figure 2, the Pathfinder system 300 is equivalent to the
`PSAD system 200 in the absence of the VAD 320, and when the noise removal
`subsystem 340 includes the detection subsystem 50 and the denoising subsystem
`40.
`
`The NAVSAD and PSAD systems support a two-level commercial approach
`·;n which (i) a relatively fess expensive PSAD system supports an acoustic
`approach that functions in most low- to medium-noise environments, and (ii) a
`NAVSAD system adds a non-acoustic sensor to enable detection of voiced speech
`in any environment. Unvoiced speech is normally not detected using the sensor,
`as it normally does not sufficiently vibrate human tissue. However, in high noise
`situations detecting the unvoiced speech is not as important, as it is normally very
`low in energy and easily washed out by the noise. Therefore in high noise
`environments the unvoiced speech is unlikely to affect the voiced speech
`denoising. Unvoiced speech information is most important in the presence of little
`to no noise and, therefore, the unvoiced detection should be highly sensitive in low
`noise situations, and insensitive in high noise situations. This is not easily
`accomplished, and comparable acoustic unvoiced detectors known in the art are
`incapable of operating under these environmental constraints.
`The NAVSAD and PSAD systems include an array algorithm for speech
`detection that uses the diff~rence in frequency content between two microphones
`to calculate a relationship between the signals of the two microphones. This is in
`contrast to conventional arrays that attempt to use the time/phase difference of
`each microphone to remove the noise outside of an "area of sensitivity". The
`
`5
`
`1 o
`
`15
`
`20
`
`25
`
`30
`
`4
`
`
`
`(
`
`methods described herein provide a signi{icant advantage, as they do not require a
`
`specific orientation of the array with respect to the signal.
`Further, the systems described herein are sensitive to noise of every type
`
`and every orientation, unlike conventional arrays that depend on specific noise
`orient,;:1tions. Consequently, the frequency-based arrays presented herein are
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`unique as they depend only on the relative orientation of the two microphones
`themselves with no dependence on the orientation of the noise and signal with
`respect to the microphones. This results in a robust signal processing system with
`respect to the type of noise, microphones. and orientation between the
`noise/signal source and the microphones.
`
`The systems described herein use the information derived from the
`Pathfinder noise suppression system and/or a non-acoustic sensor described in
`the Related Applications to determine the voicing state of an input signal, as
`described in detail below. The voicing state includes silent. voiced, and unvoiced
`states. The NAVSAD system, for example, includes a non-acoustic sensor to
`detect the vibration of human tissue associated with speech. The non-acoustic
`sensor of an embodiment is a General Electromagnetic Movement Sensor
`(GEMS) as described briefly below and in detail in the Related Applications, but is
`not so limited. Alternative embodiments, however, may use any sensor that is
`
`abl~ to detect human tissue motion associated with speech and is unaffected by
`environmental acoustic noise.
`The GEMS is a radio frequency device (2.4 GHz) that allows the detection
`of moving human tissue dielectric interfaces. The GEMS includes an RF
`interferometer that uses homodyne mixing to detect small phase shifts associated
`with target motion. In essence, the sensor sends out weak electromagnetic waves
`(less than 1 milliwatt) that reflect off of whatever is around the sens<;>r. The
`reflected waves are mixed with the original transmitted waves and the results
`analyzed for any change in position of the targets. Anything that moves near the
`sensor will cause a change in phase of the reflected wave that will be amplified
`
`and displayed as a change in voltage output from the sensor. A similar sensor is
`described by Gregory C. Burnett (1999) in "The physiological basis of glottal
`electromagnetic micropower sensors (GEMS) and their use in defining an
`excitation function for the human vocal tract"; Ph.D. Thesis, University of California
`at Davis.
`
`5
`
`
`
`(
`
`{
`
`Figure 4 is a flow diagram of a detection algorithm 50 for use in detecting
`voiced and unvoiced speech, under an embodiment. With reference to Figures 1
`and 2, both the NAVSAD and PSAQ systems of an embodiment include the
`detection algorithm 50 as the detection subsystem 50. This detection algorithm 50
`operates in real-time and, in an embodiment, operates on 20 millisecond windows
`
`and steps 10 milliseconds at a time, but is not so limited. The voice activity
`determination is recorded for the first 10 milliseconds, and the second 10
`milliseconds functions as a "look-ahead" buffer. While an embodiment uses the
`
`20/10 windows, alternative embodiments may use numerous other combinations
`
`of window values.
`Consideration was given to a number of multi-dimensional factors in
`developing the detection algorithm 50. The biggest consideration was to
`maintaining the effectiveness of the Pathfinder denoising technique, described in
`detail in the Related Applications and reviewed herein. Pathfinder performance
`
`can be compromised if the adaptive filter training is conducted on speech rather
`than on noise. It is therefore important not to exclude any significant amount of
`speech from the VAD to keep such disturbances to a minimum.
`Consideration was al so given to the accuracy of the characterization
`between voiced and unvoiced speech signals, and distinguishing each of these
`speech signals from noise signals. This type of characterization can be useful in
`such applications as speech recognition and speaker verification.
`Furthermore, the systems using the detection algorithm of an embodiment
`
`function in environments containing varying amounts of background acoustic
`noise. If the non-acoustic sensor is available, this external noise is not a problem
`for voiced speech. Howe~er, for unvoiced speech (and voiced if the non-acoustic
`. sensor is not available or has malfunctioned) reliance is placed on acoustic data
`alone to separate noise from unvoiced speech. An advantage inheres in the use
`of two microphones in an embodiment of the Pathfinder noise suppression system,
`
`and the spatial relationship between the· microphones is exploited to assist in_ the
`detection of unvoiced speech. However, there may occasionally be noise levels
`high enough that the speech will be nearly undetectable and the acoustic-only
`method will fail. In these situations, the non-acoustic sensor (or hereafter just the
`sensor) will be required to ensure good performance ..
`
`5
`
`IO
`
`15
`
`20
`
`25
`
`30
`
`6
`
`
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`I
`
`\
`
`/
`!
`
`In the two-microphone system, the speech source should be relatively
`
`louder in one designated microphone when compared to the other microphone.
`Tests have shown that this requirement is easily met with conventional
`
`microphones when the microphones are placed on the head, as any noise should
`result in an H1 with a gain near unity.
`Regarding the NAVSAD system, and with reference to Figure 1 and Figure
`3, the NAVSAD relies on two parameters to detect voiced speech. These two
`
`parameters include the energy of the sensor in the window of interest, determined
`in an embodiment by the standard deviation (SD), and optionally the cross-
`correlation (XCORR) between the acoustic signal from microphone 1 and the
`sensor data. The energy of the sensor can be determined in any one of a number
`of ways, and the SD is just one convenient way to determine the energy.
`For the sensor, t~e SD is akin to the energy of the signal, which normally
`corresponds quite accurately to the voicing state, but may be susceptible to
`movement noise (relative motion of the sensor with respect to the human user}
`and/or electromagnetic noise. To further differentiate sensor noise from tissue
`motion, the XCORR can be used. The XCORR is only calculated to 15 delays,
`
`which corresponds to just under 2 milliseconds at 8000 Hz.
`The XCORR can also be useful when the sensor signal is distorted or
`modulated in some fashion. For example, there are sensor locations (such as the
`jaw or back of the neck) where speech production can be detected but where the
`
`signal may have incorrect or distorted time-based information. That is, they may
`not have well defined features in time that will match with the acoustic waveform.
`However, XCORR is more susceptible to errors from acoustic noise, and in high
`(<O dB SNR) environments is almost useless. Therefore it should not be the sole
`source of voicing information.
`
`The sensor detects human tissue motion associated with the closure of the
`
`vocal folds, so the acoustic signal produced by the closure of the folds is highly
`correlated with the closures. Therefore, sensor data that correlates highly with the
`acoustic signal is declared as speech, and sensor data that does not correlate well
`is termed noise. The acoustic data is expected to lag behind the sensor data by
`about 0.1 to 0.8 milliseconds (or about 1-7 samples) as a result of the delay time
`due to the relatively slower speed of sound {around 330 m/s). However, an
`embodiment uses a 15-sample correlation, as the acoustic wave shape varies
`
`7
`
`
`
`(
`
`significantly depending on the sound produced, and a larger correlation width is
`
`needed to ensure detection.
`The SD and XCORR signals are related, but are sufficiently different so that
`the voiced speech detection is more reliable. For simplicity, though, either
`parameter may be used. The values for the SD and XCORR are compared to
`empirical thresholds, and if both are above their threshold, voiced speech is
`declared. Example data is presented and described below.
`Figures SA, SB, and 6 show data plots for an example in which a subject
`twice speaks the phrase "pop pan", under an embodiment. Figure SA plots the
`received GEMS signal 502 for this utterance along with the mean correlation 504
`between the GEMS signal and the Mic 1 signal and the threshold T1 used for
`voiced speech detection. Figure SB plots the received GEMS signal 502 for this
`utterance along with the standard deviation 506 of the GEMS signal and the
`threshold T2 used for voiced speech detection. Figure 6 plots voiced speech 602
`detected from the acoustic or audio signal 608, along with the GEMS signal 604
`and the acoustic noise 606; no unvoiced speech is detected in this example
`because of the heavy background babble noise 606. The thresholds have been
`set so that there are virtually no false negatives, and only occasional false
`positives. A voiced speech activity detection accuracy of greater than 99% has
`
`been attained under any acoustic background noise conditions.
`The NAVSAD can determine when voiced speech is occurring with high
`degrees of accuracy due to the non-acoustic sensor data. However, the sensor
`offers little assistance in separating unvoiced speech from noise, as unvoiced
`speech normally causes no detectable signal in most n·on-acoustic sensors. If
`there is a detectable signal, the NAVSAD can be used, although use of the SD
`method is dictated as unvoiced speech is normally poorly corre·lated. In the
`absence of a detectable signal use is made of the system and methods of the
`Pathfinder noise removal algorithm in determining when unvoiced speech is
`occurring. A brief review of the Pathfinder algorithm is described below, while a
`detailed description is provided in the Related Applications.
`With reference to Figure 3, the acoustic information coming into
`Microphone 1 is denoted by m 1(n), the information coming into Microphone 2 is
`siniifariy labeled m2(ri}, and the GEMS sensor is assumed available to determine
`
`5
`
`1 0
`
`15
`
`20
`
`25.
`
`30
`
`8
`
`
`
`(
`
`voiced speech areas. In the z (digital frequency) domain, these signals are
`
`represented as M1(z) and M2(z). Then
`Ml (z)= s(z)+ N2(z)
`M2 (z)= N(z)+S2 (z)
`
`with
`
`5
`
`so that
`
`N2 (z)= N(z)H,(z)
`S2 (z)= S(z)Hi{z)
`
`(1)
`
`M 1 (z) = S(z) + N(z )H1 (z)
`Mi(z)= N(z)+S(z)H2 (z)
`This is the general case for all two microphone systems. There is always going to
`be some leakage of noise into Mic 1, and some leakage of signal into Mic 2.
`Equation 1 has four unknowns and only two relationships and cannot be solved
`explicitly.
`However, there is another way to solve for some of the unknowns in
`that is,
`Equation 1. Examine the case where-the signal is not being generated -
`where the GEMS signal indicates voicing is not occurring. In this case, s(n) = S(z)
`= 0, and Equation 1 reduces to
`
`10
`
`15
`
`M1n(z)= N(z)H1(z)
`M2n(z)= N(z)
`
`where then subscript on the M variables indicate that only noise is being received.
`This leads to
`
`(2)
`
`. 20
`
`25
`
`H 1(z) can be calculated using any of the available system identification algorithms
`and the microphone outputs when only noise is being received. The calculation
`can be done adaptively, so that if the noise changes significantly H1(z) ·can be
`recalculated quickly.
`With a solution for one of the unknowns in Equation 1, solutions can be
`found for another, Hi(z), by using the amplitude of the GEMS or similar device
`along with the amplitude of the two microphones. When the GEMS indicates
`
`9
`
`
`
`(
`
`{
`
`voicing, but the recent (less than 1 second) history of the microphones indicate low
`levels of noise, assume that n(s) = N(z) ~ 0. Then Equation 1 reduces to
`M1s(z)= S(z)
`M2s(z)= S(z)H2 (z)
`
`which in turn leads to
`
`5
`
`10
`
`which is the inverse of the H1(z) calculation, but note that different inputs are being
`used.
`
`After calculating H 1(z) and H2(z) above, they are used to remove the noise
`from the signal. Rewrite Equation 1 as
`S(z)= M1 (z)-N(z)H1 (z)
`N(z) = M 2 (z )-s(;)H2 (z)
`S(z)= M1(z)~[M2 (z)-S(z)H 2 (z)]H1(z) '
`S(z )[l - H 2 (z )H1 (z )] = M1 (z }- M 2 (z )H1 (z)
`
`and solve for S(z) as:
`
`(3)
`
`In practice H2(z) is usually quite small, so that Hi(z )H1(z) << 1, and
`S(z)~ M1(z)-Mi(z)H1(z),
`
`15
`
`obviating the need for the H2(z) calculation.
`With reference to Figure 2 and Figure 3, the PSAD system is described.
`As sound waves propagate, ·they normally lose energy as they travel due to
`diffraction and dispersion. Assuming the sound waves originate from a point
`source and radiate isotropically, their amplitude will decrease as a fu11ction of 1/r,
`
`· 20
`
`where r is the distance from the originati_ng point. This function of 1/r proportional
`
`to amplitude is the worst case, if confined to a smaller area the reduction will be
`
`less. However it is an adequate model for the configurations of interest,
`spe~ifically the propagation of noise and speech to microphones located
`somewhere on the user's head.
`
`10
`
`
`
`(
`
`/ ,(
`
`· Figure 7 is a microphone array for use under an embodiment of the PSAD
`system. Placing the microphones Mic 1 and Mic 2 in a linear array with the mouth
`on the array midline, the difference in signal strength in Mic 1 and Mic 2 (assuming
`
`the microphones have identical frequency responses) will be proportfonal to both
`
`5
`
`d1 and Lid. Assuming a 1/r (or in this case 1/d) relationship, it is seen that
`jMicll
`( ) di + Lid
`LiM ~~=Lill z oc ~ - -
`!Mic21
`1
`d 1
`
`'
`
`where LiM is the difference in gain between Mic 1 and Mic 2 and therefore H1{z),
`as· above in Equation 2. The variable d 1 is the distance from Mic 1 to the speech
`or noise source.
`Figure 8 is a plot 800 of LiM versus d 1 for several Lid values, under an
`embodiment. It is clear that as Lid becomes larger and the noise source is closer,
`
`LiM becomes larger. The variable Lid will change depending on the orientation to
`
`the speech/noise source, from the maximum value on the array midline to zero
`
`perpendicular to the array midline. From the plot 800 it is clear that for small Lid
`
`and for distances over approximately 30 centimeters (cm), .6M is close to unity.
`Since most noise sources are farther away than 30 cm and are unlikely to be on
`the midline on the array, it is probable that when calculating H1(z) as above in
`Equation 2, AM (or equivalently the gain of H1(z)) will be close to unity.
`Conversely, for noise sources that are close (within a few centimeters), there could
`be a substantial difference in gain depending on which microphone is closer to the
`
`noise.
`
`If the "noise" is the user speaking, and Mic 1 is closer to the mouth than Mic
`2, the gain increases. Since environmental noise normally originates much farther
`away from the user's head than speech, noise will be found during the time when
`the gain of H1(z) is near unity or some fixed value, and speech can be found after .
`a sharp rise in gain. The speech can be unvoiced or voiced, as long as it is of
`sufficient volume compared to the surrounding noise. The gain will stay somewhat
`high during the speech portions, then descend quickly after speech ceases. The
`rapid increase and decrease in the gain of H1(z) should be sufficient to allow the
`detection of speech under almost any circumstances. The gain in this example is
`calculated by the sum of the absolute. value of the filter coefficients. This sum is
`
`10
`
`15
`
`20
`
`25
`
`30
`
`11
`
`
`
`{
`
`5
`
`not equivalent to the gain, but the two are related in that a rise in the sum of the
`absolute value reflects a rise in the gain.
`As an example of this behavior, Figure 9 shows a plot 900 of the gain
`parameter 902 as the sum of the absolute values of H1(z) and the acoustic data
`904 or audio from microphone 1. The speech signal was an utterance of the
`phrase "pop pan", repeated twice. The evaluated bandwidth included the
`frequency range from 2500 Hz to 3500 Hz, although 1500Hz to 2500 Hz was
`additionally used in practice. Note the rapid increase in the gain when the
`unvoiced speech is first encountered, then the rapid return to normal when the
`speech ends. The large changes in gain that result from transitions between noise
`and speech can be detected by any standard signal processing techniques. The
`standard deviation of the last few gain calculations is used, with thresholds being
`defined by a running average of the standard deviations and the standard
`- - - - deviatien-noise·floor: · The-later changes·in--garn-for-the-voiced-speecrrare- -- ------· -- - -- ---- ·
`
`IO
`
`20
`
`25
`
`30
`
`suppressed in this plot 900 for clarity.
`Figure 10 is an alternative plot 1000 of acoustic data presented in Figure 9.
`The data used to form plot 900 is presented ·again in this plot 1000, along with
`audio data 1004 and GEMS data 1006 without noise to make the unvoiced speech
`apparent. The voiced signal 1002 has three possible values: 0 for noise, 1 for
`unvoiced, and 2 for voiced. Denoising is only accomplished when V = 0. It is clear
`that the unvoiced speech is captured very well, aside from two single dropouts in
`the unvoiced detection near the end of each "pop". However, these single-window
`dropouts are not common and do not significantly affect the denoising algorithm.
`They can easily be removed using standard smoothing techniques.
`What is not clear from this plot 1000 is that the PSAD system functions as
`an automatic backup to the NAVSAD. This is because the voiced speech (since it
`has the same. spatial relationship to the mies as the unvoiced) will be detected as
`unvoiced if the sensor or NAVSAD system fail for any reason. The voiced speech
`will be misclassified as unvoiced, but the denoising will still not take place,.
`preserving the quality of the speech signal.
`However, this automatic backup of the NAVSAD system functions best in
`an environment with low noise (approximately 10+ dB SNR), as high amounts (10
`dB of SNR or less) of acoustic noise can quickly overwhelm any acoustic-only
`unvoiced detector, including the PSAD. This is evident in the difference in the
`
`12
`
`
`
`f
`
`5
`
`10
`
`15
`
`20
`
`25 ·.
`
`30
`
`voiced signal data 602 and 1002 shown in plots 600 and 100 of Figures 6 and 10,
`respectively, where the same utterance is spoken, but the data of plot 600 shows
`no unvoiced speech because the unvoiced speech is undetectable. This is the
`desired behavior when performing denoising, since if the unvoiced speech is not
`detectable then it will not significantly affect the denoising process. Using the
`Pathfinder system to detect unvoiced speech ensures detection of any unvoiced
`
`speech loud enough to distort the denoising.
`Regarding hardware considerations, and with reference to Figure 7, the
`configuration of the microphones can have an effect on the change in gain
`associated with speech and the thresholds needed to detect speech. In general,
`each configuration will require testing to determine the proper thresholds, but tests
`with two very different microphone configurations showed the same thresholds and
`other parameters to work well. The first microphone set had the signal
`microphone near the mouth and the noise microphone several centimeters away
`at the ear, while the second configuration placed the noise and signal microphones
`back-to-back within a few centimeters of the mouth. The results presented herein
`were derived using the first microphone configuration, but the results using the
`other set are virtually identical, so the detection algorithm is relatively robust with
`respect to microphone placement.
`A number of configurations are possible using the NAVSAD and PSAD
`
`systems to detect voiced and unvoiced speech. One configuration uses the
`NAVSAD system (non-acoustic only) to detect voiced speech along with the PSAD ·
`system to detect unvoiced speech; the PSAD also functions as a backup to the
`NAVSAD system for detecting voiced speech. An alternative configuration uses
`the,NAVSAD system (non-acoustic correlated with acoustic) to detect voiced
`
`speech along with the PSAD system to detect unvoiced speech; the PSAD also
`functions as a backup to the NAVSAD system for detecting voiced speech.
`Another alternative configuration uses the PSAD system to detect both voiced and
`unvoiced speech.·
`While the systems described above have been described with reference to
`separating voiced and unvoiced speech from background acoustic noise, there are
`no reasons more complex classifications can not be made. For more in-depth
`characterization of speech, the system can bandpass the information from Mic 1
`and Mic 2 so that it is possible to see which bands in the Mic 1 data are more
`
`13
`
`
`
`5
`
`IO
`
`15
`
`20
`
`25
`
`30
`
`(
`
`(
`
`heavily composed of noise and which are more weighted with speech. Using this
`
`knowledge, it is possible to group the utterances by their spectral characteristics
`similar to conventional acoustic methods; this method would work better in noisy
`environments.
`As an example, the "k" in "kick" has significant frequency content form 500
`
`Hz to 4000 Hz, but a "sh" in "she" only contains significant energy from 1700-4000
`Hz. Voiced speech could be classified in a similar manner. For instance, an Iii
`
`("ee") has significant energy around 300 Hz and 2500 Hz, and an /a/ ("ah") has
`energy at around 900 Hz and 1200 Hz. This ability to discriminate unvoiced and
`voiced speech in the presence of noise is, thus, very useful.
`Each of the steps depicted in the flow diagrams presented herein can itself
`include a sequence of operations that need not be described herein. Those skilled
`in the relevant art can create routines, algorithms, source code, microcode,
`program logic arrays or otherwise implement the invention based on the flow
`diagrams and the detailed description provided herein. The routines described
`herein can be provided with one or more of the following, or one or more
`combinations of the following: stored in non-volatile memory (not shown) that
`forms part of an associated processor or processors, or implemented using
`conventional programmed logic arrays or circuit elements, or stored in removable
`media such as disks, or down(oaded from a .server and stored locally at a client, or
`hardwired or preprogrammed in chips such as EEPROM semiconductor chips,
`application specific integrated circuits (AS I Cs), or by digital signal processing
`(DSP) integrated circuits.
`Unless described otherwise herein, the information described herein is Well
`
`known or described in detail in the Related Applications. Indeed, much of the
`detailed description provided herein is explicitly disclosed in the Related
`Applications; most or all of the additional material _of c;1spects of the inventjon will
`be recognized by those skilled in the relevant art as being inherent in the detailed
`description provided in such Related Applications, or well known to those skilled in
`
`the relevant art. Those skilled in the relevant art can implement aspects of the
`invention based on the material presented herein and the detailed description
`provided in the Related Applications.
`Unless the context clearly requires otherwise, throughout the description
`and the claims, the words "comprise," "comprising," and the like are to be
`
`14
`
`
`
`i
`{
`
`construed in an inclusive sense as opposed to an exclusive or exhaustive sense;
`that is to say, in a sense of "including, but not limited to." Words using the singular
`or plural number also include the plural or singular number respectively.
`
`Additionally, the