`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCTOBER I994
`
`Statistical Recovery of Wideband
`Speech from Narrowband Speech
`
`Yan Ming Cheng, Douglas O'Shaughnessy, and Paul Mermelstein
`
`Abstract-We present an algorithm to generate wideband speech from
`a narrowband version of the same. The main body of the algorithm
`is a statistical recovery function (SRF)/-I&Iiltb predicts the bigbband
`spectrum based solely on the narrowband spectrum. The performance
`of the algorithm bas been measured both in terms of spectral distortion
`and spectral signal-to-noise ratio (SNR). We Obtained a 3 dB gain in SNR
`for the reconstructed wideband speech as compared to the narrowband
`
`......
`
`I. INTRODUCTION
`Wideband speech (in our experiments, covering the range 0.3-8
`kHz) has generally a more pleasant quality compared with narrow(cid:173)
`band (0.3-3.75 kHz) speech. Most transmission lines cany only
`narrowband speech for economic reasons, and some existing commu(cid:173)
`nication networks do so for historical reasons. Because of the human
`preference for wideband speech, a solution to generate wideband
`speech from a narrowband transmission appears attractive. We de(cid:173)
`velop here a tool to recover the spectral highband difference between
`wideband and narrowband speech, without the use of any additional
`transmitted information. The feasibility of such a tool depends on
`the validity of the assumption that the difference signal is closely
`correlated with, and it" a· nonlinear functiOII'bf, the narrowband speech.
`Our experiments support the validity of this assumption.
`In this paper, we present a preliminary study toward the real(cid:173)
`ization of such a speech-recovety tool. The approach we adopt
`is to implement a recovery function at the receiver of a coded
`speech transmission. The function maps narrowband speech to a
`spectral difference signal, which is considered here only as highband
`(3.75-8 kHz) speech. To reconstruct the wideband speech, we add
`the highband component to the received narrowband speech. The
`recovery function is based on a statistical dependence between the
`narrowband and highband speech spectra and applies in a speaker(cid:173)
`independent fashion.
`
`II. THE STATISTICAL RECOVERY FUNCTION (SRF) AND ITS USE
`
`A. Background
`A speech signal can be segmented and assigned into classes, such
`as phonemes or broad phonetic classes. We interpret each class to
`have a distinct pattern in both low and high frequencies. If a signal
`in a narrowband spectrum is recognized as belonging to a certain
`class, the highband signal can be approximately determined by the
`corresponding pattern of the class. This idea can be generalized as
`that of a narrowband class being mapped to any highband class
`with a certain probability. In order to avoid hard decisions in the
`classification, we introduce the notion of a random source instead of
`class. Each random source has a probability density function (pdf),
`characterized by a mean and a covariance matrix. A signal of a class
`can be considered as, for instance, a random signal emitted from a
`
`Manuscript received August 17, 1992; revised December 31, 1993. This
`work was supported by grants from the Natural Sciences and Engineering
`Research Council (NSERC) of Canada.
`Y. M. Cheng and P. Mermelstein are with Bell-Northern Research, Nuns'
`Island, Quebec, Canada H3E IH6.
`D. O'Shaughnessy is \1/ith INRS-Telecommunications, Nuns' Island, Que(cid:173)
`bec, Canada H3E IH6.
`·'
`IEEE Log Number 9403984.
`
`Fig. 1. Graphical illustration of a statistical recovery function. The dots
`in the space X represent random sources ,\i and those in the space Y
`represent () j. The lines connecting the sources in the two spaces represent
`the cross-correlation probabilities.
`
`random source with the highest probability; a transitional signal is
`emitted jointly by several random sources.
`
`B. An Iterative Training Algorithm via the EM Algorithm
`Consider a sample vector or frame of K narrowband speech
`samples, x = [x 0 ,x 1 ,· .. ,xi<f, in a multidimensional space .1:',
`and a sample vector of highband speech, y = (yo, Y1, · .. , YI< JT, in
`a space Y. We assume that the ensemble of x is generated by a
`combination of N random sources,.\;, 1 $ i $ N, and the ensemble
`of y by M random sources, B i, 1 $ 1 $ M. The probability of source
`B 1 contributing to the high band speech, while source .\; contributes
`to the narrowband speech, is defined by a; 1 = p( B i ].\;), a cross(cid:173)
`correlation probability. A graphical illustration is given in Fig. I.
`Given a set of parameters, A = { a;1 }, A = {.\;}, and 8 = { B i},
`and a vector of narrowband speech, x, a recovery function f yields
`highband speech y = j(x, A, A, 8). In the remainder of this section,
`we derive a training algorithm and a procedure for highband speech
`estimation.
`Let us consider a joint pdf for the speech and individual sources
`at time t in the speech signal
`p(y, x, .\;, 81 ) = p(y1 JB1 )p(Bj].\;)p(xt].\; )p(.\;)
`= p(y, jlij )O!;jp(Xt j.\; )p(.\;)
`
`(I)
`
`since, by definition, y depends only on 81 and B 1 only on .\;.
`Following the frequent assumption of an all-pole (autoregressive)
`model for speech signals in linear predictive analysis (1], we will,
`for computational convenience, use pth- and qth-order autoregressive
`Gaussian sources to describe the random sources, both .\; and B io
`respectively. In cases where assuming an autoregressive model for
`the signal is not suitable, a standard Gaussian pdf can be used without
`any loss. The conditional autoregressive Gaussian pdf's of observing
`x, and yt, given their underlying sources, p(xt].\;) and p(y1JBi ),
`can be found in [1], (2]. We use an energy-normalized version of the
`above pdf's, since the absolute energy is independent of the sources.
`In order to reconstruct energy information for a given signal, we will
`explicitly build up the recovery function for the energy ratio.
`Given a pair of training sequences, (X, Y) = { (x, yt}} with
`1 $ t $ T, and a set of parameters 2 = {A, A, 8}, the joint
`probability
`
`T N M
`p(X,Y) = p(S,X, Y) = ITLLP(X,yt,.\;,8})
`t=l i=l j=l
`
`(NM)T T L ITp(x~,y,sk(t))
`
`k=l
`
`t=1
`
`(NM)T L p(S,X,Y,sk)
`
`k=l
`
`(2)
`
`1063-6676/94$04.00 © 1994 IEEE
`
`ZTE EXHIBIT 1030
`
`Page 1 of 5
`
`
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCfOBER 1994
`
`545
`
`where sk(t) = [.\;,6,] is a state-vector a1 timet and on the kth state(cid:173)
`path of a treillis (the number of distinct state-vectors are N Af and
`that of a distinct stale-paths are ( N M)T ). Given the same conditions,
`the conditional joint pdf of both .\; and (J 1 at time t is
`
`(t: .\ lijX Y) = p(t: .\;,II), X, Y)
`p(X,Y)
`P.
`"
`
`1
`
`'
`
`L ljcx,. y, -'k· 01 l
`
`k=ll=l
`
`(3)
`
`and the conditional pdf of .\, contributing to the speech at time t is
`
`p(t: .\;IX.Y)
`
`p(t: .\;,X, Y)
`p(X.Y)
`M
`lj(x~.y,,.\,J/1 )
`J=l
`N M
`LV(x,y,,.\.,f!,)
`k=li=l
`
`(4)
`
`In order to estimate the parameters, we use the EM algorithm [3]
`to maximize the likelihood, p(E, X, Y). In the E-step we compute
`the expectation of the log likelihood (or an auxiliary function in [4]),
`Q(S!Eo)
`E(logp(S, X, Y, •• JISo ), over state-paths (see (2)),
`based on an initial guess of parameters, So= {Ao.Ao.eo}. In the
`M-step, we maximize the expectalion function, Q( ) by adjusting S.
`Using Lagrange optimization and the constraint E'J~ 1 p(111l.\d
`1,
`it is not hard to derive (see [4])
`
`o:;i = p(ll1 jA,) =
`
`I:cj
`
`j:::::l
`
`where C;1 = z;f;,~)T p(Eo,X,Y,sk)c;1(sk) is the expectation of
`c;1 (s.), which is the count of a state-vector, [.\;,81], on the stale(cid:173)
`path, ••· Since the expected count can he also efficiently computed as
`C;j ET=l p(t: .\;, B;. X, Y) = EL p(t: .\;, ll;jX, Y)p(X, Y),
`and since p(X, Y) is independent of i and j, thus (5), at the bottom
`of this page. Similarly, an updating formula of the a priori pdf of the
`source, p(.\i), can be derived as the expected count of A; on a state(cid:173)
`path divided by the total COlin! on a stale path ( 6), at the bottom of
`
`this page. We can also update the autocorrelation sequences of sources
`
`T
`L'',.,,(k)p(.\;jx, y,)
`r >; ( k) = :c'-=-1'--;,Tc-------
`L>(.\ijx,.y,)
`
`T
`2:>"·'(1.-)p(OAx,_y,)
`rei(~·)= t=t T
`ljU11 !x,,yt)
`t=l
`
`(7a)
`
`(7b)
`
`and a ratio of highband signal energy versus narrowband energy
`
`T ("( ))1/2
`~ ~ p(B;jx, yt)
`
`T
`l:)<ll;jx,,y,)
`t:::;l
`
`(7c)
`
`where p(A,Ix,, y,) and p(81jx, yt) can he easily derived from
`p(.\,,(l,_x,,yt) and its marginal pdfs; f(xtl and f(y,) are the
`energies of Xt and of y,, respectively. The reason for using the
`energy ralio here in steady absolute energy value is that the latler
`value varies from signal to signal (e.g., on different telephone lines)
`and has little direct utility; the ratio, however, contains sufficient
`energy informalion for reconstruction of the highband signal. A set
`of updated autoregressive coefficients of sources can he obtained
`through the usual Levinson-Durbin recursive algorithm and r,, (k)
`and ra;(k).
`The following list summarizes the training algorithm:
`{ Ao, Ao, 8o}.
`I) Initialize parameters So
`2) Iteration loop
`a) Time loop: t from 1 to T
`i) Compute p(y,,x, .\,,IJ1 ) according to (1).
`ii) Compute recursively the necessary cumulatives in (5),
`(6), and (7),
`b) End the time loop and update S 0
`ing to (5), (6), and (7).
`c) Test if the stop criterion is satisfied. If yes, stop the iteration.
`The EM algorithm guarantees that the developed updating formulas
`converge to a critical point.
`
`{ Ao, Ao, eo} accord-
`
`L N ~~~x,,y,_.\,,11;)
`
`t=l 'L:l:)(x,, y, .\k- 81)
`k=ll:::::l
`
`(5)
`
`(6)
`
`Page 2 of 5
`
`
`
`546
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCTOBER 1994
`
`C. Minimum Mean Square Estimation (MMSE) of Highband Speech
`An estimation of energy-nonnalized highband speech through min(cid:173)
`imum mean square estimation (MMSE) is a conditional expectation
`
`Y =E(YIX) = { Yp(YIX}dY
`}yT'
`
`(8)
`
`where, using (I)
`( I ) _ ;..~p(y,,x,,>..,,Uj)
`p(xt)
`P Yt Xt
`L...,L...,
`z=l ;=1
`
`-
`
`= :tt p(y,j~)<>;jp(x,j>..;)p(>..;)
`•=• i=•
`
`ij(x,j>..;)p(>..;)
`i=l
`
`and T' is the length of the highband speech to estimate. Since we
`assume that observations of both the highband and narrowband speech
`are independent in different time frames t, the above equation can be
`written in a fonn of vector concatenation, presented by the sign x .
`Then, we calculate the highband speech estimate as
`
`M
`
`.(j)
`
`M
`
`.(j)
`
`M
`
`•(i)
`
`Y = ~ ;~.) X • • • ~ ;~t) .. · X ~ pr:.;,)
`where :Yiil = :E~, Yli)<>;ip(x,j>..;)p(>..;),yl') is the mean-vector
`of the random source Ui at time t. Since Yl'lis also a qth-order
`autoregressive process, then we have
`
`(9)
`
`[y(n)lr = [~ a~)y(n- k)+ Glil•,(n)r
`
`(10)
`
`where '' ( n) is white Gaussian noise excitation at time t with zero
`mean and unity variance, and a~) are the autoregressive coefficients.
`The same excitation sequence has to be applied to all sources 0 to
`guarantee an identical initial phase and a smooth phase evolution in
`time. ali) is a gain factor applied to th~ 9 i sources and estimated
`as (II), at the bottom of this page. Thus the estimation of G~il is
`independent of p(x,). We assumed implicitly in the above highband
`speech generation that the highband speech exhibits no periodic
`behavior (i.e., pitch-periodicity is absent).
`In Fig. 2, we show a diagram of the current system to recover the
`wideband speech. Each filter represents a random source's autoregres(cid:173)
`sive spectrum in the hi~hband. The input of each filter is weighted by
`a factor, j{t,j) = Gl' (1 , , where ( 1 ,, = E~1 <>;ip(x,j>..,)p(>..,).
`
`III. EXPERIMENTAL RESULTS
`
`.
`
`·---------------------'
`
`.
`
`Fig. 2. Diagram of the wideband speech recovery system.
`
`ll
`
`. . . . --~--~--~...,
`
`23 ..
`
`21
`
`••
`
`..,!----~--~--~-'
`
`Fig. 3. Performance aS a function of the number, M, of sources 8j. The left
`panel shows the rms of log spectra; the right panel shows segmental SNR. The
`vertical axes are in dB; the horizontal axis shows log2 M (i.e., M in bits).
`
`..
`
`11,---~---~-~---, r
`
`23 ..
`
`21
`
`Fig. 4. Performance as a function of number, N, of sources ,\i. The left
`panel shows the rms of log spectra; the right panel shows the segmental SNR.
`The vertical axes are in dB; the horirontal axis shows log2 N (i.e., bits).
`
`to train the statistical recovery function, consisted of speech from
`four male and four female speakers. The data to test our algorithm
`consisted of speech from four separate speakers (two male and two
`female). Thus, the algorithm can be viewed as operating speaker(cid:173)
`independently. The narrowband speech was generated by passing the
`wideband speech through a 0.3-3.75 kHz Chebychev bandpass filter.
`The frame length was 20 ms and the frame advance was 10 ms.
`The orders of linear prediction (autoregressive) analysis were sixteen
`(i.e., p = 16 and q = 16).
`
`A. Speech Material
`The speech database used contained phonetically-balanced wide(cid:173)
`band speech sampled at 16 kHz with an antialiasing filter cutting off
`at 7.8 kHz. The database was split into two parts. Part one, used
`
`B. Experiments for the Training Procedure
`For the training procedure, there are two factors that attracted
`most of our concern: Iteration convergence and initialization. For the
`
`(11)
`
`Page 3 of 5
`
`
`
`IEEE TRANSACDONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCfOBER I994
`
`547
`
`Fig. 5. Spectrograms of the original wideband speech (top), reconstructed wideband speech (middle), and narrowband speech (bottom) for a sentence,
`"Lift the square stone over the fence."
`
`initialization, we had two options in these experiments: (1) vector
`quantization (VQ) [5] initialization and (2) bootstrap initialization.
`In the above two initializations, the cross-correlations are always
`initialized as C>i; = 1/M. We have observed that for both VQ
`and bootstrap initialization the log likelihood increased with each
`iteration, thus the training convergence is practically demonstrated.
`Both VQ and bootstrap initializations, however, have log likelihood
`values very close to each other after about ten iterations. We may
`say that the initialization has little influence on the resulting mapping
`function, at least in terms of likelihood.
`A more analytical way to study the training procedure is to use
`fully controlled data. For this purpose, we simulated the assumed
`data-generation process described in Section IT. The parameters were
`K = p = q = 2 and four random sources at .1:', the originating space,
`and three random sources at Y, the destination space. The statistical
`coefficients of the data generation and their estimation after fifteen
`iterations were:
`
`I ) The mean vectors of the four sources in .1:' and their estimations
`
`-1n] [-2n] [-2n] [-1n]
`-1.0 ' -1.0 ' -2.0 ' -2.0 '
`
`[
`
`-1.00] [-2.00] [-2.01] [-1.00]
`-0.99 ' -0.99 ' -2.01 ' -2.01 .
`
`[
`
`2) The mean vectors of the three sources in Y and their estimations
`
`[1.0] [0.0] [2.0]
`0.0 ' 1.0 ' 1.0 '
`
`1.00] [0.01] [2.00]
`0.00 ' 1.00 ' 1.00 .
`
`[
`
`3) The active probabilities of the four sources in .1:' and their
`estimations
`
`[0.3, 0.2, 0.2, 0.3],
`
`[0.29, 0.19, 0.21, 0.31].
`
`4) The correlation matrix and their estimations
`
`0.80 0.10 0.10]
`0.30 0.50 0.20
`0.20 0.50 0.30
`0.10 0.10 0.80
`
`[
`
`'
`
`0.79 0.11 0.10]
`0.32 0.50 0.18
`0.20 0.52 0.28
`0.10 0.10 0.81
`
`[
`
`.
`
`From this controlled experiment, we have shown that the proposed
`algorithm estimates a correct statistical structure.
`
`C. Experiment for the Recovery ofWideband Speech
`For an assessment of the recovery algorithm we used, as criteria,
`spectral log rms, Dnn~, and segmental spectral signal-to-noise ratio
`(SNR), LsNR·
`In the first experiment, we were very interested in the performance
`as a function of the number of random sources for the highband
`speech. The number of sources for the narrowband speech was preset
`to a large number ( N = 128 in practice), which may not be efficient
`by was certainly sufficient. We see from Fig. 3 that the spectral log
`rms decreases and segmental spectral SNR increases as M increases.
`Above M = 16 (i.e., 4 bits), further changes were not significant.
`Secondly, fixing !If at 16 we increased gradually the number of
`sources for narrowband speech. As N increased, a decrease in log
`rms and an increase in segmental spectral SNR were also observed
`(see Fig. 4). N = 64 (i.e., 6 bits) was reasonable. Compared with
`
`Page 4 of 5
`
`
`
`548
`
`IEEE TRANSACfiONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCI'OBER I994
`
`narrowband speech ( M = 0), the reconstructed wide band speech
`with N = 64 and M = 16 showed a gain of about 3 dB in
`segmental spectral SNR. We note, however, that SNR is a flawed
`measure to evaluate performance in this context. We, thus, examined
`spectrograms.
`In Fig. 5, we show spectrograms for an example of original
`wideband speech, reconstructed wideband speech, and narrowband
`speech. Most of the highband speech was successfully reconstructed;
`however, the reconstruction is not fully accurate, especially for the
`fricatives, /f/ and /s/. This weakness is mainly due to such fricatives'
`concentrating their information at highband; the narrowband versions
`of such fricatives do not allow easy discrimination.
`
`IV. CONCLUSION
`We developed a statistical recovery function (SRF) to recover
`wideband speech from the narrowband speech available at receivers
`in most communications networlcs. We obtained encouraging results
`
`inour preliminary study. Reconstructed wideband speech showed a
`gain of 3 dB in segmental SNR compared with narrowband speech,
`with no more than narrowband speech as input.
`
`REFERENCES
`
`[I] F. ltakura, "Minimum prediction residual principle applied to speech
`recognition," IEEE Trans. Acoust., Speech, Signal Processing, vol. 23,
`no. I, pp. 67-72, Feb. 1975.
`[2] L. R. Rabiner, "A tutorial on hidden Markov models and selected
`applications in speech recognition," Proc. IEEE. vol. 77, no. 22, pp.
`257-289, 1989.
`[3] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood
`from incomplete data via the EM algorithm," Ann. Royal Stat. Soc., pp.
`1-38, 1977.
`[4] L. E. Baum, "An inequality and associated maximization technique in
`satistical estimation for probabilistic functions of Markov processes."
`Inequalities, vol. 3, pp. 1-8, 1972.
`[5] A. Buzo, A. H. Gray, Jr., R. M. Gray, and J.D. Markel, "Speech coding
`based upon vector quantization," IEEE Trans. Acoust., Speech, Signal
`Processing, vol. 28, no. 5, pp. 562-574, 1980.
`
`Page 5 of 5