throbber
10.21437/Interspeech.2011-418
`
`INTERSPEECH 2011
`
`Low-Frequency Bandwidth Extension of Telephone Speech Using
`Sinusoidal Synthesis and Gaussian Mixture Model
`
`Hannu Pulakka1, Ulpu Remes2, Santeri Yrttiaho1,3, Kalle Palom¨aki2, Mikko Kurimo2, Paavo Alku1
`
`1Department of Signal Processing and Acoustics, Aalto University, Finland
`2Adaptive Informatics Research Centre, Aalto University, Finland
`3Department of Biomedical Engineering and Computational Science, Aalto University, Finland
`hannu.pulakka@aalto.fi
`
`Abstract
`
`The limited audio bandwidth of narrowband telephone speech
`degrades the speech quality. This paper proposes a method that
`extends the bandwidth of telephone speech to the frequency
`range 0–300 Hz. The lowest harmonics of voiced speech are
`generated using sinusoidal synthesis. The energy in the exten-
`sion band is estimated from spectral features using a Gaussian
`mixture model. The amplitudes and phases of the synthesized
`signal are adjusted based on the amplitudes and phases of the
`narrowband input speech. The proposed method was evaluated
`with listening tests together with a bandwidth extension method
`for the range 4–8 kHz. The low-frequency bandwidth extension
`was found to reduce dissimilarity with wideband speech but no
`perceived quality improvement was achieved.
`Index Terms: bandwidth extension, Gaussian mixture model,
`listening test, sinusoidal synthesis, speech enhancement
`
`1. Introduction
`
`The audio bandwidth of most telephone systems today is lim-
`ited to approximately the traditional telephone band of 300–
`3400 Hz. The bandwidth limitation degrades the speech quality
`and has an adverse effect on intelligibility. The GSM cellular
`telephone system using the adaptive multi-rate (AMR) codec
`[1] is an example of a present-day narrowband speech transmis-
`sion system. Wideband speech transmission, which typically
`covers 50–7000 Hz, offers improved quality and intelligibility.
`Wideband speech services using the AMR-WB codec are be-
`coming available in mobile telephone networks in an increas-
`ing number of countries, but the transition is expected to take
`a long time. Artificial bandwidth extension (ABE) techniques
`have been developed to enhance narrowband speech by generat-
`ing spectral content in the missing frequency regions. The spec-
`trum can be reconstructed below 300 Hz (denoted as lowband
`in this paper) or in the range 3.4–8 kHz (highband) or both.
`Most studies on ABE have focused on the highband, whereas
`this paper concentrates on bandwidth extension in the lowband.
`Although the audio bandwidth in digital mobile telephone
`systems is not strictly limited to the traditional telephone band,
`low frequencies are attenuated in order to reduce background
`noise [2]. Handheld and headset terminals for narrowband tele-
`phony in 3G are required to have attenuation of at least 12 dB
`at the frequency of 100 Hz [3]. No lower limit is defined on
`the sensitivity at 100 Hz or 200 Hz, so the low-frequency band
`may be completely missing [3, 4]. Mobile terminals typically
`have a very limited capability to reproduce low frequencies, but
`telephone conversations occur increasingly in conditions where
`low-frequency reproduction is possible, such as using hands-
`
`free or video conferencing equipment. Lowband extension can
`improve especially the naturalness of low-pitched male speech.
`Various low-frequency bandwidth extension methods have
`been proposed. Techniques used for generating spectral con-
`tent in the lowband include nonlinear processing of the narrow-
`band signal [5] and sinusoidal synthesis possibly combined with
`noise addition [6, 4, 7, 8]. Methods based on sinusoidal syn-
`thesis require accurate estimation of the fundamental frequency
`(f0). In [7], the f0 estimate is obtained from a CELP coder. The
`absolute phases of the lowband partials are usually considered
`perceptually unimportant [6, 4] and a continuous waveform is
`generated from frame to frame [6, 4, 7, 8]. The amplitudes of
`lowband harmonics or the lowband spectral envelope can be es-
`timated using, e.g., linear mapping [4], codebooks [4, 7], neural
`networks [6], or Gaussian mixture models (GMM) [9].
`This paper
`introduces a speech bandwidth extension
`method for the lowband. The proposed method uses a GMM
`to estimate the lowband energy from spectral features and si-
`nusoidal synthesis to generate the lowest harmonics of voiced
`speech. The amplitudes and phases of the synthesized harmon-
`ics are adjusted based on the characteristics of the input sig-
`nal, which is a major addition to previously presented methods
`(e.g., [6, 4, 7, 8]). The proposed method is referred to as low-
`band ABE (LB-ABE). The method is evaluated together with a
`previously presented highband extension method [10].
`
`2. Method
`
`This section describes the LB-ABE method, whose block dia-
`gram is shown in Figure 1. The input to the method is a nar-
`rowband speech signal with unknown low-frequency character-
`istics. The method attempts to reconstruct the lowband by syn-
`thesizing the lowest harmonics of f0 in voiced speech segments.
`The f0 estimates are obtained from the AMR decoder [1]. The
`amplitudes of the lowband harmonics are derived from an esti-
`mate of lowband energy. This estimate is calculated from nar-
`rowband spectral features using a GMM predictor. The phases
`of the harmonics are determined from the input signal if the ob-
`served phases of the harmonics are found to be consistent from
`frame to frame. Otherwise, continuous phase contours in suc-
`cessive frames are generated. Finally, sinusoidal synthesis is
`used to generate the harmonics from the estimated parameters.
`The lowband primarily contains the lowest harmonics of
`voiced speech [2] and is not important for unvoiced sounds [4,
`8]. In the proposed method, additional features are calculated
`for each frame to assess the degree of voicing. The lowband
`synthesis is muted in frames that are classified as unvoiced as
`well as in frames where the f0 estimate is considered unreliable.
`
`Copyright © 2011 ISCA
`
`1181
`
`28-31 August 2011, Florence, Italy
`
`Jawbone's Exhibit No. 2013, IPR2022-01124
`Page 001
`
`

`

`expectation-maximization (EM) algorithm implemented in the
`GMMBAYES Matlab toolbox1.
`Given the observed input x(m) and the joint distribution
`model, the minimum mean square error (MMSE) estimate for
`the missing frequencies y in the mth frame is calculated as
`
`E{y|x(m), Λ} =Xν
`
`P (ν|x(m), Λ)E{y|x(m), Λ, ν}, (1)
`
`(2)
`
`where x(m) denotes x = x(m) and Λ the model parameters.
`The posterior probabilities for clusters ν, P (ν|x(m), Λ), are
`calculated from the prior probabilities P (ν) estimated from the
`training data and the likelihoods p(x(m)|Λ, ν) which are cal-
`culated assuming diagonal covariances. The expected values
`E{y|x(m), Λ, ν} are calculated as
`E{y|x(m), Λ, ν} = µy + ΣyxΣ−1xx (x(m) − µx),
`
`where µx and µy are the input and output means, Σxy the
`cross-covariance, and Σxx the input covariance as specified
`in the νth Gaussian component of the model. The cluster-
`dependent linear transformations R(ν) = ΣyxΣ−1
`xx can be pre-
`computed to a look-up-table to save computation during use.
`The 10-component GMM used in this work was trained
`on 52 minutes of clean speech from the Finnish SPEECON
`database [11]. Log-compressed spectral features of the cur-
`rent and two preceding frames were used as the input x(m).
`The training data was formed as follows. The input features
`x(m) were computed from narrowband signals filtered with the
`MSIN filter, which simulates the input response of a mobile sta-
`tion [12], scaled to −26 dBov, and coded with the AMR codec
`(12.2 kbps). The output y(m) represents the log-compressed
`lowband energy in wideband speech. The target outputs were
`calculated from the same samples as input features using iden-
`tical scaling but no filtering or coding. The lowband energy of
`each frame was extracted using 128-point FFT and one trape-
`zoidal filter window with −3-dB points at 40 Hz and 330 Hz.
`To reduce occasional peaks in lowband energy, the energy
`estimates are processed with an adaptive compression scheme.
`A smoothed energy estimate is computed from frames of active
`speech, and energy estimates exceeding 1.5 times the smoothed
`value are limited using a logarithmic compression curve.
`
`2.3. Amplitude calculation for lowband harmonics
`
`Harmonic sinusoidal components of equal amplitude are gener-
`ated in the lowband so that the energy estimate provided by the
`GMM is approximately realized (G). The amplitude Ae(m) of
`the harmonic components in frame m is computed as
`
`Ae(m) =rk · ˆElb(m)(cid:16) ˆf0(m)/330 Hz − 0.5(cid:17)
`
`(3)
`
`where ˆElb(m) is the lowband energy estimate based on the
`GMM output, ˆf0(m) is a smoothed fundamental frequency es-
`timate, and the constant k compensates for the effects of win-
`dowing etc.
`In this work, the synthesis amplitudes are modified depend-
`ing on the observed amplitudes at the frequencies of the low-
`band harmonics in the input signal. The input is analyzed (F) in
`Hann-windowed segments of 20 ms using the formula
`
`S(m, l) =
`
`N−1Xk=0
`
`sm,w(k)e−2πik
`
`lf0 (m)
`fs
`
`(4)
`
`1Available in www.it.lut.fi/project/gmmbayes/
`
`1182
`
`Figure 1: Block diagram of the LB-ABE method.
`
`2.1. Input signal framing and feature calculation
`
`The narrowband input snb from the AMR decoder (A in Fig-
`ure 1) has a 8-kHz sampling rate. The signal is divided in
`frames (B) with a 5-ms frame shift, which is equal to the sub-
`frame length of the AMR codec [1]. Lookahead of 5 ms is used.
`Time-domain and frequency-domain features used in the system
`are computed from the frames (C).
`The estimation of the lowband energy is based on spectral
`features calculated as follows. The power spectrum computed
`using 128-point FFT and a 16-ms Hamming window is divided
`into seven sub-bands by trapezoidal filter windows and the sub-
`band energies are log-compressed. The sub-bands are located
`linearly on the mel scale.
`The detection of voiced speech segments utilizes two time-
`domain features calculated in 20-ms windows: the gradient in-
`dex [2, 10], which gives low values for voiced speech and high
`values for unvoiced speech [2], and the frame energy, which
`represents the input signal energy in a frame.
`Estimates of f0 are based on the pitch period estimates ob-
`tained from the AMR decoder (A) for every 5-ms subframe. The
`f0 estimates are further processed (D) to reduce octave errors
`and other discontinuities.
`
`2.2. Lowband energy estimation
`
`A GMM-based predictor is used for the estimation of the low-
`band energy (E). The goal is to estimate the missing frequencies
`based on input information derived from the narrowband signal.
`The joint distribution of the input and output variables x and y,
`respectively, is modeled as a mixture of Gaussians where each
`mixture component ν is associated with a mean vector µ(ν)
`and a full covariance matrix Σ(ν). The component or cluster
`index ν is assumed a hidden variable, and in this work, the clus-
`ters and distribution parameters are jointly estimated using the
`
`Jawbone's Exhibit No. 2013, IPR2022-01124
`Page 002
`
`

`

`where N is the window length, sm,w is the windowed sig-
`nal of frame m, and fs is the sampling frequency. S(m, l)
`corresponds to discrete-time Fourier transform computed from
`the input signal at the frequencies of harmonics up to 400 Hz.
`The amplitude of the lth harmonic in the input is A(m, l) =
`cA |S(m, l)| where cA is a constant that compensates for the
`effects of windowing.
`The synthesis amplitudes are then determined by subtract-
`ing the observed harmonic amplitudes A(m, l) from the esti-
`mated values Ae(m) so that the sum of the input signal and the
`synthesized signal has approximately the estimated amplitudes.
`To provide a smooth transition from the lowband to the tele-
`phone band, the amplification of the observed harmonics is lim-
`ited at the upper end of the extension band. Finally, a smoothing
`filter is applied to the synthesis amplitudes.
`
`2.4. Phase estimation for lowband harmonics
`
`As a novel technique, the phases of the synthetic sinusoidal
`components are set such that they are coherent with the phases
`of the possibly existing but attenuated harmonics in the input
`signal. If the observed phases are found to be unreliable, they
`are ignored and a continuous phase contour from frame to frame
`is generated (G).
`The observed phase φ(m, l) of the lth harmonic in frame
`m is calculated as the phase angle of S(m, l) defined in (4). A
`reference phase for each frame and harmonic is computed from
`the previous synthesis phase and the fundamental frequency es-
`timates of the current and the previous frame assuming phase
`continuity at the frame boundary. The phase ˜φ(m, l) for the
`synthesis of the lth harmonic depends on the current and past
`observed phases of the lth harmonic and is determined by the
`first matching condition of the following list:
`
`1. If the observed phase does not behave in a consistent way
`on average, a continuous phase is generated.
`
`3. Evaluation
`
`3.1. Test samples
`
`Listening tests were arranged to evaluate the proposed LB-ABE
`method in combination with the previously presented highband
`extension method FB-ABE [10]. Evaluation of lowband ex-
`tension together with highband extension is necessary because
`lowband extension alone can be perceived as spectrally unbal-
`anced [4]. The methods were evaluated using twenty sentences
`selected from the SPEECON database [11]. All sentences were
`spoken in Finnish by different talkers (10 females and 10 males)
`and they were all recorded in an office environment.
`To simulate narrowband telephone speech, the test samples
`were highpass filtered (−3 dB at 293 Hz) and the spectrum be-
`low 181 Hz was cleared with FFT-based processing. The band-
`limited signals were scaled to −26 dBov and processed with the
`AMR codec (12.2 kbps).
`Four versions of each test sentence were generated:
`• NB: AMR-coded narrowband speech.
`• NB+HB: NB with highband extension using FB-ABE.
`• LB+NB+HB: NB+HB with lowband extension using the
`proposed LB-ABE method.
`• WB: Wideband reference generated from the wideband
`signal by filtering with the P.341 filter [12] and process-
`ing through the AMR-WB codec (12.65 kbps).
`
`For each test sample, the adaptation of the methods was ar-
`ranged by first processing another sentence spoken by the same
`talker in the same environment.
`The loudness of the processing types was normalized. The
`amplitude of each processed test sentence was adjusted manu-
`ally by two listeners to match the loudness of the WB version
`of the sentence. An average loudness normalization factor was
`determined for each processing type from this data, and the test
`samples were scaled accordingly.
`
`2. At signal onset, the observed phase is used.
`
`3.2. Test methods
`
`3. If the mismatch between the observed phase and the ref-
`erence phase is small, the observed phase is used.
`
`4. If the observed phase is nearly continuous in a few past
`frames, the observed phase is approached gradually.
`
`5. Otherwise, a continuous phase contour is generated.
`
`2.5. Lowband signal synthesis
`
`Each frame m of the lowband signal is synthesized as a sum of
`sinusoidal components up to a frequency limit of 400 Hz (H).
`The lth harmonic is generated as a sine wave with amplitude
`˜A(m, l), phase ˜φ(m, l), and frequency lf0(m).
`To reduce artifacts, the output signal is muted (I) if (1) the
`gradient index exceeds a fixed threshold indicating unvoiced
`speech, (2) the frame energy does not exceed an adaptive noise
`floor estimate substantially, (3) the f0 estimate varies greatly
`from frame to frame, or (4) the f0 estimate is substantially lower
`than a smoothed long-term estimate. The last condition reduces
`artifacts caused by downward octave errors observed especially
`for creaky female voice. Transition regions from complete mut-
`ing to no attenuation are defined around the thresholds.
`Synthesized frames are windowed with a 10-ms Hann win-
`dow and overlap-added to get a continuous lowband signal slb
`with smooth transitions between adjacent frames (J). Finally,
`the synthesized lowband signal slb is summed to the narrow-
`band signal snb to obtain the output soutput of the method (K).
`
`The quality of processed speech samples was evaluated using a
`listening test procedure similar to the comparison category rat-
`ing (CCR) test also used in [10]. The test comprised pairwise
`comparisons between the processing types. In each case, the lat-
`ter sample was compared to the first one on a scale from much
`worse (-3) to much better (3). Each listener compared all pro-
`cessing pairs in combination with ten different sentences. Null
`pairs with identical samples were also included.
`Another test was arranged to evaluate the perceived dis-
`similarity between AMR-WB-coded wideband speech and the
`processed narrowband signals.
`In each test case, the listener
`first heard a WB version of a sentence and then the same sen-
`tence processed with another processing type. The task of the
`listener was to answer the question How much does the latter
`sample differ from the first sample using a scale from not at
`all (0) to very much (4) with half-unit steps. All twenty sen-
`tences were included in the test. For each sentence, all three
`processing types were compared with the wideband reference.
`Additionally, six comparisons with identical wideband samples
`were presented to each listener.
`In both tests, the samples were played to both ears using
`Sennheiser HD 580 headphones in a silent room. The listen-
`ers were instructed to adjust the volume to a suitable listening
`level during practice sessions preceding both tests. Seventeen
`Finnish listeners (5 females and 12 males) between 20 and 31
`years of age participated in the tests.
`
`1183
`
`Jawbone's Exhibit No. 2013, IPR2022-01124
`Page 003
`
`

`

`
`
`(a)
`
`(b)
`
`Female
`
`Male
`
`N B
`
`N B + H B
`
`W B
`
`L B + N B + H B
`
`2
`
`1
`
`0
`
`Dissimilarity score
`
`Female
`
`Male
`
`N B
`
`N B + H B
`
`W B
`
`L B + N B + H B
`
`1
`
`0
`
`CCR mean score
`
`−1
`
`Figure 3: Mean CCR scores (a) and dissimilarity scores for
`comparisons against WB samples (b). Error bars indicate the
`standard error of mean.
`
`those reported for lowband and highband extension in [7].
`
`5. Acknowledgements
`
`This work is funded by the graduate schools GETA and Hecse,
`the Academy of Finland (LASTU research programme 135003,
`project 136209, and AIRC), Aalto University (Mide/UI-ART),
`and Nokia.
`
`6. References
`
`[1] Adaptive Multi-Rate (AMR) Speech Codec, Transcoding Func-
`tions, 3GPP TS 26.090, 3rd Generation Partnership Project
`(3GPP), 2004, version 6.0.0.
`
`[2] P. Jax, “Enhancement of bandlimited speech signals: Algo-
`rithms and theoretical bounds,” Ph.D. dissertation, Rheinisch-
`Westf¨alische Technische Hochschule Aachen, Germany, 2002.
`
`[3] Terminal acoustic characteristics for telephony, Requirements,
`3GPP TS 26.131, 3rd Generation Partnership Project (3GPP),
`2010, version 10.0.0.
`
`[4] G. Miet, “Towards wideband speech by narrowband speech band-
`width extension: magic effect or wideband recovery?” Ph.D. dis-
`sertation, Universit´e du Maine, France, 2001.
`
`[5] U. Kornagel, “Techniques for artificial bandwidth extension of
`telephone speech,” Signal Process., vol. 86, no. 6, pp. 1296–1306,
`2006.
`
`[6] J.-M. Valin and R. Lefebvre, “Bandwidth extension of narrow-
`band speech for low bit-rate wideband coding,” in Proc. IEEE
`Speech Coding Workshop (SCW), 2000, pp. 130–132.
`
`[7] J. S. Park, M. Y. Choi, and H. S. Kim, “Low-band extension of
`CELP speech coder by harmonics recovery,” in Proc. ISPACS,
`2004, pp. 147–150.
`
`[8] H. Gustafsson, U. A. Lindgren, and I. Claesson, “Low-
`complexity feature-mapped speech bandwidth extension,” IEEE
`Trans. Speech Audio Process., vol. 14, no. 2, pp. 577–588, 2006.
`
`[9] I. Uysal, H. Sathyendra, and J. G. Harris, “Bandwidth extension
`of telephone speech using frame-based excitation and robust fea-
`tures,” in Proc. EUSIPCO, 1995.
`
`[10] H. Pulakka and P. Alku, “Bandwidth extension of telephone
`speech using a neural network and a filter bank implementation
`for highband mel spectrum,” IEEE Trans. Audio, Speech, Lan-
`guage Process., 2011, accepted for publication.
`
`[11] D. Iskra, B. Grosskopf, K. Marasek, H. van den Heuvel, F. Diehl,
`and A. Kiessling, “SPEECON – speech databases for consumer
`devices: Database specification and validation,” in Proc. LREC,
`2002, pp. 329–333.
`
`[12] ITU-T Recommendation G.191, Software tools for speech and au-
`dio coding standardization, Int. Telecommun. Union, 2005.
`
`1184
`
`WB
`NB
`NB+HB
`LB+NB+HB
`
`−20
`
`−40
`
`−60
`
`
`
`Magnitude (dB)
`
`100
`
`300
`
`1000
`Frequency (Hz)
`
`4000
`
`8000
`
`Figure 2: Average spectra of the listening test samples.
`
`CCR data was analyzed in terms of a summary variable that
`was calculated as the mean of the CCR scores for each process-
`ing type separately. In these calculations, every comparison in-
`volving a specific processing type was included. The dissim-
`ilarity test data was analyzed in terms of the original compar-
`isons. The means of the scores were compared with Analyses
`of Variance (ANOVA). For post-hoc comparisons between pair-
`wise data means, Tukeys HSD statistic was used.
`
`3.3. Results
`
`The long-term average spectra of the listening test samples are
`shown in Figure 2. In the lowband, the average LB+NB+HB
`spectrum is close to the average WB spectrum. The effect of
`loudness normalization (Section 3.1) can be seen in the figure.
`The
`differences
`between
`the mean CCR values
`[F (3, 48)=35.703, p<0.001] are shown in Figure 3.(a).
`WB had higher mean CCR values than all the other pro-
`cessing types (p-values<0.01). The NB processing had, in
`contrast, smaller CCR values than the rest of the samples
`(p-values<0.01). There was no significant difference between
`LB+NB+HB and NB+HB in terms of the mean CCR score.
`It is noteworthy that listening test show significant quality
`improvement for FB-ABE compared with NB speech. This
`indicates that the quality improvement reported in [10] was not
`caused by increased loudness only.
`The dissimilarity score differed between different process-
`ing types [F (3, 1114)=208.23, p<0.001] as shown in Figure
`3.(b). The processing types could be ranked in terms of in-
`creasing dissimilarity from WB in the following order: WB,
`LB+NB+HB, NB+HB, NB. The ratings, however, interacted
`with the speaker gender [F (3, 1114)=7.6220, p<0.001].
`In
`particular, the difference between LB+NB+HB and NB+HB
`was observer only for the male speakers (p<0.001).
`
`4. Conclusions
`
`A method for the low-frequency bandwidth extension of tele-
`phone speech was described. The method was evaluated to-
`gether with the previously proposed highband extension method
`[10]. The results indicate that the lowband extension did not
`improve the perceived speech quality. This may be due to oc-
`casional artificats in the lowband, although it is also known that
`the listener preference varies on spectral balance in general [4].
`However, lowband extension was found to reduce dissimilar-
`ity in comparison with wideband speech, which is an important
`goal for bandwidth extension during the transition from narrow-
`band to wideband telephone systems. The reduced dissimilarity
`was not observed for female voices, which typically have less
`energy in the lowband due to higher f0. The results resemble
`
`Jawbone's Exhibit No. 2013, IPR2022-01124
`Page 004
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket