`
`INTERSPEECH 2011
`
`Low-Frequency Bandwidth Extension of Telephone Speech Using
`Sinusoidal Synthesis and Gaussian Mixture Model
`
`Hannu Pulakka1, Ulpu Remes2, Santeri Yrttiaho1,3, Kalle Palom¨aki2, Mikko Kurimo2, Paavo Alku1
`
`1Department of Signal Processing and Acoustics, Aalto University, Finland
`2Adaptive Informatics Research Centre, Aalto University, Finland
`3Department of Biomedical Engineering and Computational Science, Aalto University, Finland
`hannu.pulakka@aalto.fi
`
`Abstract
`
`The limited audio bandwidth of narrowband telephone speech
`degrades the speech quality. This paper proposes a method that
`extends the bandwidth of telephone speech to the frequency
`range 0–300 Hz. The lowest harmonics of voiced speech are
`generated using sinusoidal synthesis. The energy in the exten-
`sion band is estimated from spectral features using a Gaussian
`mixture model. The amplitudes and phases of the synthesized
`signal are adjusted based on the amplitudes and phases of the
`narrowband input speech. The proposed method was evaluated
`with listening tests together with a bandwidth extension method
`for the range 4–8 kHz. The low-frequency bandwidth extension
`was found to reduce dissimilarity with wideband speech but no
`perceived quality improvement was achieved.
`Index Terms: bandwidth extension, Gaussian mixture model,
`listening test, sinusoidal synthesis, speech enhancement
`
`1. Introduction
`
`The audio bandwidth of most telephone systems today is lim-
`ited to approximately the traditional telephone band of 300–
`3400 Hz. The bandwidth limitation degrades the speech quality
`and has an adverse effect on intelligibility. The GSM cellular
`telephone system using the adaptive multi-rate (AMR) codec
`[1] is an example of a present-day narrowband speech transmis-
`sion system. Wideband speech transmission, which typically
`covers 50–7000 Hz, offers improved quality and intelligibility.
`Wideband speech services using the AMR-WB codec are be-
`coming available in mobile telephone networks in an increas-
`ing number of countries, but the transition is expected to take
`a long time. Artificial bandwidth extension (ABE) techniques
`have been developed to enhance narrowband speech by generat-
`ing spectral content in the missing frequency regions. The spec-
`trum can be reconstructed below 300 Hz (denoted as lowband
`in this paper) or in the range 3.4–8 kHz (highband) or both.
`Most studies on ABE have focused on the highband, whereas
`this paper concentrates on bandwidth extension in the lowband.
`Although the audio bandwidth in digital mobile telephone
`systems is not strictly limited to the traditional telephone band,
`low frequencies are attenuated in order to reduce background
`noise [2]. Handheld and headset terminals for narrowband tele-
`phony in 3G are required to have attenuation of at least 12 dB
`at the frequency of 100 Hz [3]. No lower limit is defined on
`the sensitivity at 100 Hz or 200 Hz, so the low-frequency band
`may be completely missing [3, 4]. Mobile terminals typically
`have a very limited capability to reproduce low frequencies, but
`telephone conversations occur increasingly in conditions where
`low-frequency reproduction is possible, such as using hands-
`
`free or video conferencing equipment. Lowband extension can
`improve especially the naturalness of low-pitched male speech.
`Various low-frequency bandwidth extension methods have
`been proposed. Techniques used for generating spectral con-
`tent in the lowband include nonlinear processing of the narrow-
`band signal [5] and sinusoidal synthesis possibly combined with
`noise addition [6, 4, 7, 8]. Methods based on sinusoidal syn-
`thesis require accurate estimation of the fundamental frequency
`(f0). In [7], the f0 estimate is obtained from a CELP coder. The
`absolute phases of the lowband partials are usually considered
`perceptually unimportant [6, 4] and a continuous waveform is
`generated from frame to frame [6, 4, 7, 8]. The amplitudes of
`lowband harmonics or the lowband spectral envelope can be es-
`timated using, e.g., linear mapping [4], codebooks [4, 7], neural
`networks [6], or Gaussian mixture models (GMM) [9].
`This paper
`introduces a speech bandwidth extension
`method for the lowband. The proposed method uses a GMM
`to estimate the lowband energy from spectral features and si-
`nusoidal synthesis to generate the lowest harmonics of voiced
`speech. The amplitudes and phases of the synthesized harmon-
`ics are adjusted based on the characteristics of the input sig-
`nal, which is a major addition to previously presented methods
`(e.g., [6, 4, 7, 8]). The proposed method is referred to as low-
`band ABE (LB-ABE). The method is evaluated together with a
`previously presented highband extension method [10].
`
`2. Method
`
`This section describes the LB-ABE method, whose block dia-
`gram is shown in Figure 1. The input to the method is a nar-
`rowband speech signal with unknown low-frequency character-
`istics. The method attempts to reconstruct the lowband by syn-
`thesizing the lowest harmonics of f0 in voiced speech segments.
`The f0 estimates are obtained from the AMR decoder [1]. The
`amplitudes of the lowband harmonics are derived from an esti-
`mate of lowband energy. This estimate is calculated from nar-
`rowband spectral features using a GMM predictor. The phases
`of the harmonics are determined from the input signal if the ob-
`served phases of the harmonics are found to be consistent from
`frame to frame. Otherwise, continuous phase contours in suc-
`cessive frames are generated. Finally, sinusoidal synthesis is
`used to generate the harmonics from the estimated parameters.
`The lowband primarily contains the lowest harmonics of
`voiced speech [2] and is not important for unvoiced sounds [4,
`8]. In the proposed method, additional features are calculated
`for each frame to assess the degree of voicing. The lowband
`synthesis is muted in frames that are classified as unvoiced as
`well as in frames where the f0 estimate is considered unreliable.
`
`Copyright © 2011 ISCA
`
`1181
`
`28-31 August 2011, Florence, Italy
`
`Jawbone's Exhibit No. 2013, IPR2022-01124
`Page 001
`
`
`
`expectation-maximization (EM) algorithm implemented in the
`GMMBAYES Matlab toolbox1.
`Given the observed input x(m) and the joint distribution
`model, the minimum mean square error (MMSE) estimate for
`the missing frequencies y in the mth frame is calculated as
`
`E{y|x(m), Λ} =Xν
`
`P (ν|x(m), Λ)E{y|x(m), Λ, ν}, (1)
`
`(2)
`
`where x(m) denotes x = x(m) and Λ the model parameters.
`The posterior probabilities for clusters ν, P (ν|x(m), Λ), are
`calculated from the prior probabilities P (ν) estimated from the
`training data and the likelihoods p(x(m)|Λ, ν) which are cal-
`culated assuming diagonal covariances. The expected values
`E{y|x(m), Λ, ν} are calculated as
`E{y|x(m), Λ, ν} = µy + ΣyxΣ−1xx (x(m) − µx),
`
`where µx and µy are the input and output means, Σxy the
`cross-covariance, and Σxx the input covariance as specified
`in the νth Gaussian component of the model. The cluster-
`dependent linear transformations R(ν) = ΣyxΣ−1
`xx can be pre-
`computed to a look-up-table to save computation during use.
`The 10-component GMM used in this work was trained
`on 52 minutes of clean speech from the Finnish SPEECON
`database [11]. Log-compressed spectral features of the cur-
`rent and two preceding frames were used as the input x(m).
`The training data was formed as follows. The input features
`x(m) were computed from narrowband signals filtered with the
`MSIN filter, which simulates the input response of a mobile sta-
`tion [12], scaled to −26 dBov, and coded with the AMR codec
`(12.2 kbps). The output y(m) represents the log-compressed
`lowband energy in wideband speech. The target outputs were
`calculated from the same samples as input features using iden-
`tical scaling but no filtering or coding. The lowband energy of
`each frame was extracted using 128-point FFT and one trape-
`zoidal filter window with −3-dB points at 40 Hz and 330 Hz.
`To reduce occasional peaks in lowband energy, the energy
`estimates are processed with an adaptive compression scheme.
`A smoothed energy estimate is computed from frames of active
`speech, and energy estimates exceeding 1.5 times the smoothed
`value are limited using a logarithmic compression curve.
`
`2.3. Amplitude calculation for lowband harmonics
`
`Harmonic sinusoidal components of equal amplitude are gener-
`ated in the lowband so that the energy estimate provided by the
`GMM is approximately realized (G). The amplitude Ae(m) of
`the harmonic components in frame m is computed as
`
`Ae(m) =rk · ˆElb(m)(cid:16) ˆf0(m)/330 Hz − 0.5(cid:17)
`
`(3)
`
`where ˆElb(m) is the lowband energy estimate based on the
`GMM output, ˆf0(m) is a smoothed fundamental frequency es-
`timate, and the constant k compensates for the effects of win-
`dowing etc.
`In this work, the synthesis amplitudes are modified depend-
`ing on the observed amplitudes at the frequencies of the low-
`band harmonics in the input signal. The input is analyzed (F) in
`Hann-windowed segments of 20 ms using the formula
`
`S(m, l) =
`
`N−1Xk=0
`
`sm,w(k)e−2πik
`
`lf0 (m)
`fs
`
`(4)
`
`1Available in www.it.lut.fi/project/gmmbayes/
`
`1182
`
`Figure 1: Block diagram of the LB-ABE method.
`
`2.1. Input signal framing and feature calculation
`
`The narrowband input snb from the AMR decoder (A in Fig-
`ure 1) has a 8-kHz sampling rate. The signal is divided in
`frames (B) with a 5-ms frame shift, which is equal to the sub-
`frame length of the AMR codec [1]. Lookahead of 5 ms is used.
`Time-domain and frequency-domain features used in the system
`are computed from the frames (C).
`The estimation of the lowband energy is based on spectral
`features calculated as follows. The power spectrum computed
`using 128-point FFT and a 16-ms Hamming window is divided
`into seven sub-bands by trapezoidal filter windows and the sub-
`band energies are log-compressed. The sub-bands are located
`linearly on the mel scale.
`The detection of voiced speech segments utilizes two time-
`domain features calculated in 20-ms windows: the gradient in-
`dex [2, 10], which gives low values for voiced speech and high
`values for unvoiced speech [2], and the frame energy, which
`represents the input signal energy in a frame.
`Estimates of f0 are based on the pitch period estimates ob-
`tained from the AMR decoder (A) for every 5-ms subframe. The
`f0 estimates are further processed (D) to reduce octave errors
`and other discontinuities.
`
`2.2. Lowband energy estimation
`
`A GMM-based predictor is used for the estimation of the low-
`band energy (E). The goal is to estimate the missing frequencies
`based on input information derived from the narrowband signal.
`The joint distribution of the input and output variables x and y,
`respectively, is modeled as a mixture of Gaussians where each
`mixture component ν is associated with a mean vector µ(ν)
`and a full covariance matrix Σ(ν). The component or cluster
`index ν is assumed a hidden variable, and in this work, the clus-
`ters and distribution parameters are jointly estimated using the
`
`Jawbone's Exhibit No. 2013, IPR2022-01124
`Page 002
`
`
`
`where N is the window length, sm,w is the windowed sig-
`nal of frame m, and fs is the sampling frequency. S(m, l)
`corresponds to discrete-time Fourier transform computed from
`the input signal at the frequencies of harmonics up to 400 Hz.
`The amplitude of the lth harmonic in the input is A(m, l) =
`cA |S(m, l)| where cA is a constant that compensates for the
`effects of windowing.
`The synthesis amplitudes are then determined by subtract-
`ing the observed harmonic amplitudes A(m, l) from the esti-
`mated values Ae(m) so that the sum of the input signal and the
`synthesized signal has approximately the estimated amplitudes.
`To provide a smooth transition from the lowband to the tele-
`phone band, the amplification of the observed harmonics is lim-
`ited at the upper end of the extension band. Finally, a smoothing
`filter is applied to the synthesis amplitudes.
`
`2.4. Phase estimation for lowband harmonics
`
`As a novel technique, the phases of the synthetic sinusoidal
`components are set such that they are coherent with the phases
`of the possibly existing but attenuated harmonics in the input
`signal. If the observed phases are found to be unreliable, they
`are ignored and a continuous phase contour from frame to frame
`is generated (G).
`The observed phase φ(m, l) of the lth harmonic in frame
`m is calculated as the phase angle of S(m, l) defined in (4). A
`reference phase for each frame and harmonic is computed from
`the previous synthesis phase and the fundamental frequency es-
`timates of the current and the previous frame assuming phase
`continuity at the frame boundary. The phase ˜φ(m, l) for the
`synthesis of the lth harmonic depends on the current and past
`observed phases of the lth harmonic and is determined by the
`first matching condition of the following list:
`
`1. If the observed phase does not behave in a consistent way
`on average, a continuous phase is generated.
`
`3. Evaluation
`
`3.1. Test samples
`
`Listening tests were arranged to evaluate the proposed LB-ABE
`method in combination with the previously presented highband
`extension method FB-ABE [10]. Evaluation of lowband ex-
`tension together with highband extension is necessary because
`lowband extension alone can be perceived as spectrally unbal-
`anced [4]. The methods were evaluated using twenty sentences
`selected from the SPEECON database [11]. All sentences were
`spoken in Finnish by different talkers (10 females and 10 males)
`and they were all recorded in an office environment.
`To simulate narrowband telephone speech, the test samples
`were highpass filtered (−3 dB at 293 Hz) and the spectrum be-
`low 181 Hz was cleared with FFT-based processing. The band-
`limited signals were scaled to −26 dBov and processed with the
`AMR codec (12.2 kbps).
`Four versions of each test sentence were generated:
`• NB: AMR-coded narrowband speech.
`• NB+HB: NB with highband extension using FB-ABE.
`• LB+NB+HB: NB+HB with lowband extension using the
`proposed LB-ABE method.
`• WB: Wideband reference generated from the wideband
`signal by filtering with the P.341 filter [12] and process-
`ing through the AMR-WB codec (12.65 kbps).
`
`For each test sample, the adaptation of the methods was ar-
`ranged by first processing another sentence spoken by the same
`talker in the same environment.
`The loudness of the processing types was normalized. The
`amplitude of each processed test sentence was adjusted manu-
`ally by two listeners to match the loudness of the WB version
`of the sentence. An average loudness normalization factor was
`determined for each processing type from this data, and the test
`samples were scaled accordingly.
`
`2. At signal onset, the observed phase is used.
`
`3.2. Test methods
`
`3. If the mismatch between the observed phase and the ref-
`erence phase is small, the observed phase is used.
`
`4. If the observed phase is nearly continuous in a few past
`frames, the observed phase is approached gradually.
`
`5. Otherwise, a continuous phase contour is generated.
`
`2.5. Lowband signal synthesis
`
`Each frame m of the lowband signal is synthesized as a sum of
`sinusoidal components up to a frequency limit of 400 Hz (H).
`The lth harmonic is generated as a sine wave with amplitude
`˜A(m, l), phase ˜φ(m, l), and frequency lf0(m).
`To reduce artifacts, the output signal is muted (I) if (1) the
`gradient index exceeds a fixed threshold indicating unvoiced
`speech, (2) the frame energy does not exceed an adaptive noise
`floor estimate substantially, (3) the f0 estimate varies greatly
`from frame to frame, or (4) the f0 estimate is substantially lower
`than a smoothed long-term estimate. The last condition reduces
`artifacts caused by downward octave errors observed especially
`for creaky female voice. Transition regions from complete mut-
`ing to no attenuation are defined around the thresholds.
`Synthesized frames are windowed with a 10-ms Hann win-
`dow and overlap-added to get a continuous lowband signal slb
`with smooth transitions between adjacent frames (J). Finally,
`the synthesized lowband signal slb is summed to the narrow-
`band signal snb to obtain the output soutput of the method (K).
`
`The quality of processed speech samples was evaluated using a
`listening test procedure similar to the comparison category rat-
`ing (CCR) test also used in [10]. The test comprised pairwise
`comparisons between the processing types. In each case, the lat-
`ter sample was compared to the first one on a scale from much
`worse (-3) to much better (3). Each listener compared all pro-
`cessing pairs in combination with ten different sentences. Null
`pairs with identical samples were also included.
`Another test was arranged to evaluate the perceived dis-
`similarity between AMR-WB-coded wideband speech and the
`processed narrowband signals.
`In each test case, the listener
`first heard a WB version of a sentence and then the same sen-
`tence processed with another processing type. The task of the
`listener was to answer the question How much does the latter
`sample differ from the first sample using a scale from not at
`all (0) to very much (4) with half-unit steps. All twenty sen-
`tences were included in the test. For each sentence, all three
`processing types were compared with the wideband reference.
`Additionally, six comparisons with identical wideband samples
`were presented to each listener.
`In both tests, the samples were played to both ears using
`Sennheiser HD 580 headphones in a silent room. The listen-
`ers were instructed to adjust the volume to a suitable listening
`level during practice sessions preceding both tests. Seventeen
`Finnish listeners (5 females and 12 males) between 20 and 31
`years of age participated in the tests.
`
`1183
`
`Jawbone's Exhibit No. 2013, IPR2022-01124
`Page 003
`
`
`
`
`
`(a)
`
`(b)
`
`Female
`
`Male
`
`N B
`
`N B + H B
`
`W B
`
`L B + N B + H B
`
`2
`
`1
`
`0
`
`Dissimilarity score
`
`Female
`
`Male
`
`N B
`
`N B + H B
`
`W B
`
`L B + N B + H B
`
`1
`
`0
`
`CCR mean score
`
`−1
`
`Figure 3: Mean CCR scores (a) and dissimilarity scores for
`comparisons against WB samples (b). Error bars indicate the
`standard error of mean.
`
`those reported for lowband and highband extension in [7].
`
`5. Acknowledgements
`
`This work is funded by the graduate schools GETA and Hecse,
`the Academy of Finland (LASTU research programme 135003,
`project 136209, and AIRC), Aalto University (Mide/UI-ART),
`and Nokia.
`
`6. References
`
`[1] Adaptive Multi-Rate (AMR) Speech Codec, Transcoding Func-
`tions, 3GPP TS 26.090, 3rd Generation Partnership Project
`(3GPP), 2004, version 6.0.0.
`
`[2] P. Jax, “Enhancement of bandlimited speech signals: Algo-
`rithms and theoretical bounds,” Ph.D. dissertation, Rheinisch-
`Westf¨alische Technische Hochschule Aachen, Germany, 2002.
`
`[3] Terminal acoustic characteristics for telephony, Requirements,
`3GPP TS 26.131, 3rd Generation Partnership Project (3GPP),
`2010, version 10.0.0.
`
`[4] G. Miet, “Towards wideband speech by narrowband speech band-
`width extension: magic effect or wideband recovery?” Ph.D. dis-
`sertation, Universit´e du Maine, France, 2001.
`
`[5] U. Kornagel, “Techniques for artificial bandwidth extension of
`telephone speech,” Signal Process., vol. 86, no. 6, pp. 1296–1306,
`2006.
`
`[6] J.-M. Valin and R. Lefebvre, “Bandwidth extension of narrow-
`band speech for low bit-rate wideband coding,” in Proc. IEEE
`Speech Coding Workshop (SCW), 2000, pp. 130–132.
`
`[7] J. S. Park, M. Y. Choi, and H. S. Kim, “Low-band extension of
`CELP speech coder by harmonics recovery,” in Proc. ISPACS,
`2004, pp. 147–150.
`
`[8] H. Gustafsson, U. A. Lindgren, and I. Claesson, “Low-
`complexity feature-mapped speech bandwidth extension,” IEEE
`Trans. Speech Audio Process., vol. 14, no. 2, pp. 577–588, 2006.
`
`[9] I. Uysal, H. Sathyendra, and J. G. Harris, “Bandwidth extension
`of telephone speech using frame-based excitation and robust fea-
`tures,” in Proc. EUSIPCO, 1995.
`
`[10] H. Pulakka and P. Alku, “Bandwidth extension of telephone
`speech using a neural network and a filter bank implementation
`for highband mel spectrum,” IEEE Trans. Audio, Speech, Lan-
`guage Process., 2011, accepted for publication.
`
`[11] D. Iskra, B. Grosskopf, K. Marasek, H. van den Heuvel, F. Diehl,
`and A. Kiessling, “SPEECON – speech databases for consumer
`devices: Database specification and validation,” in Proc. LREC,
`2002, pp. 329–333.
`
`[12] ITU-T Recommendation G.191, Software tools for speech and au-
`dio coding standardization, Int. Telecommun. Union, 2005.
`
`1184
`
`WB
`NB
`NB+HB
`LB+NB+HB
`
`−20
`
`−40
`
`−60
`
`
`
`Magnitude (dB)
`
`100
`
`300
`
`1000
`Frequency (Hz)
`
`4000
`
`8000
`
`Figure 2: Average spectra of the listening test samples.
`
`CCR data was analyzed in terms of a summary variable that
`was calculated as the mean of the CCR scores for each process-
`ing type separately. In these calculations, every comparison in-
`volving a specific processing type was included. The dissim-
`ilarity test data was analyzed in terms of the original compar-
`isons. The means of the scores were compared with Analyses
`of Variance (ANOVA). For post-hoc comparisons between pair-
`wise data means, Tukeys HSD statistic was used.
`
`3.3. Results
`
`The long-term average spectra of the listening test samples are
`shown in Figure 2. In the lowband, the average LB+NB+HB
`spectrum is close to the average WB spectrum. The effect of
`loudness normalization (Section 3.1) can be seen in the figure.
`The
`differences
`between
`the mean CCR values
`[F (3, 48)=35.703, p<0.001] are shown in Figure 3.(a).
`WB had higher mean CCR values than all the other pro-
`cessing types (p-values<0.01). The NB processing had, in
`contrast, smaller CCR values than the rest of the samples
`(p-values<0.01). There was no significant difference between
`LB+NB+HB and NB+HB in terms of the mean CCR score.
`It is noteworthy that listening test show significant quality
`improvement for FB-ABE compared with NB speech. This
`indicates that the quality improvement reported in [10] was not
`caused by increased loudness only.
`The dissimilarity score differed between different process-
`ing types [F (3, 1114)=208.23, p<0.001] as shown in Figure
`3.(b). The processing types could be ranked in terms of in-
`creasing dissimilarity from WB in the following order: WB,
`LB+NB+HB, NB+HB, NB. The ratings, however, interacted
`with the speaker gender [F (3, 1114)=7.6220, p<0.001].
`In
`particular, the difference between LB+NB+HB and NB+HB
`was observer only for the male speakers (p<0.001).
`
`4. Conclusions
`
`A method for the low-frequency bandwidth extension of tele-
`phone speech was described. The method was evaluated to-
`gether with the previously proposed highband extension method
`[10]. The results indicate that the lowband extension did not
`improve the perceived speech quality. This may be due to oc-
`casional artificats in the lowband, although it is also known that
`the listener preference varies on spectral balance in general [4].
`However, lowband extension was found to reduce dissimilar-
`ity in comparison with wideband speech, which is an important
`goal for bandwidth extension during the transition from narrow-
`band to wideband telephone systems. The reduced dissimilarity
`was not observed for female voices, which typically have less
`energy in the lowband due to higher f0. The results resemble
`
`Jawbone's Exhibit No. 2013, IPR2022-01124
`Page 004
`
`