throbber
544
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCTOBER I994
`
`Statistical Recovery of Wideband
`Speech from Narrowband Speech
`
`Yan Ming Cheng, Douglas O'Shaughnessy, and Paul Mermelstein
`
`Abstract-We present an algorithm to generate wideband speech from
`a narrowband version of the same. The main body of the algorithm
`is a statistical recovery function (SRF)/-I&Iiltb predicts the bigbband
`spectrum based solely on the narrowband spectrum. The performance
`of the algorithm bas been measured both in terms of spectral distortion
`and spectral signal-to-noise ratio (SNR). We Obtained a 3 dB gain in SNR
`for the reconstructed wideband speech as compared to the narrowband
`
`......
`
`I. INTRODUCTION
`Wideband speech (in our experiments, covering the range 0.3-8
`kHz) has generally a more pleasant quality compared with narrow(cid:173)
`band (0.3-3.75 kHz) speech. Most transmission lines cany only
`narrowband speech for economic reasons, and some existing commu(cid:173)
`nication networks do so for historical reasons. Because of the human
`preference for wideband speech, a solution to generate wideband
`speech from a narrowband transmission appears attractive. We de(cid:173)
`velop here a tool to recover the spectral highband difference between
`wideband and narrowband speech, without the use of any additional
`transmitted information. The feasibility of such a tool depends on
`the validity of the assumption that the difference signal is closely
`correlated with, and it" a· nonlinear functiOII'bf, the narrowband speech.
`Our experiments support the validity of this assumption.
`In this paper, we present a preliminary study toward the real(cid:173)
`ization of such a speech-recovety tool. The approach we adopt
`is to implement a recovery function at the receiver of a coded
`speech transmission. The function maps narrowband speech to a
`spectral difference signal, which is considered here only as highband
`(3.75-8 kHz) speech. To reconstruct the wideband speech, we add
`the highband component to the received narrowband speech. The
`recovery function is based on a statistical dependence between the
`narrowband and highband speech spectra and applies in a speaker(cid:173)
`independent fashion.
`
`II. THE STATISTICAL RECOVERY FUNCTION (SRF) AND ITS USE
`
`A. Background
`A speech signal can be segmented and assigned into classes, such
`as phonemes or broad phonetic classes. We interpret each class to
`have a distinct pattern in both low and high frequencies. If a signal
`in a narrowband spectrum is recognized as belonging to a certain
`class, the highband signal can be approximately determined by the
`corresponding pattern of the class. This idea can be generalized as
`that of a narrowband class being mapped to any highband class
`with a certain probability. In order to avoid hard decisions in the
`classification, we introduce the notion of a random source instead of
`class. Each random source has a probability density function (pdf),
`characterized by a mean and a covariance matrix. A signal of a class
`can be considered as, for instance, a random signal emitted from a
`
`Manuscript received August 17, 1992; revised December 31, 1993. This
`work was supported by grants from the Natural Sciences and Engineering
`Research Council (NSERC) of Canada.
`Y. M. Cheng and P. Mermelstein are with Bell-Northern Research, Nuns'
`Island, Quebec, Canada H3E IH6.
`D. O'Shaughnessy is \1/ith INRS-Telecommunications, Nuns' Island, Que(cid:173)
`bec, Canada H3E IH6.
`·'
`IEEE Log Number 9403984.
`
`Fig. 1. Graphical illustration of a statistical recovery function. The dots
`in the space X represent random sources ,\i and those in the space Y
`represent () j. The lines connecting the sources in the two spaces represent
`the cross-correlation probabilities.
`
`random source with the highest probability; a transitional signal is
`emitted jointly by several random sources.
`
`B. An Iterative Training Algorithm via the EM Algorithm
`Consider a sample vector or frame of K narrowband speech
`samples, x = [x 0 ,x 1 ,· .. ,xi<f, in a multidimensional space .1:',
`and a sample vector of highband speech, y = (yo, Y1, · .. , YI< JT, in
`a space Y. We assume that the ensemble of x is generated by a
`combination of N random sources,.\;, 1 $ i $ N, and the ensemble
`of y by M random sources, B i, 1 $ 1 $ M. The probability of source
`B 1 contributing to the high band speech, while source .\; contributes
`to the narrowband speech, is defined by a; 1 = p( B i ].\;), a cross(cid:173)
`correlation probability. A graphical illustration is given in Fig. I.
`Given a set of parameters, A = { a;1 }, A = {.\;}, and 8 = { B i},
`and a vector of narrowband speech, x, a recovery function f yields
`highband speech y = j(x, A, A, 8). In the remainder of this section,
`we derive a training algorithm and a procedure for highband speech
`estimation.
`Let us consider a joint pdf for the speech and individual sources
`at time t in the speech signal
`p(y, x, .\;, 81 ) = p(y1 JB1 )p(Bj].\;)p(xt].\; )p(.\;)
`= p(y, jlij )O!;jp(Xt j.\; )p(.\;)
`
`(I)
`
`since, by definition, y depends only on 81 and B 1 only on .\;.
`Following the frequent assumption of an all-pole (autoregressive)
`model for speech signals in linear predictive analysis (1], we will,
`for computational convenience, use pth- and qth-order autoregressive
`Gaussian sources to describe the random sources, both .\; and B io
`respectively. In cases where assuming an autoregressive model for
`the signal is not suitable, a standard Gaussian pdf can be used without
`any loss. The conditional autoregressive Gaussian pdf's of observing
`x, and yt, given their underlying sources, p(xt].\;) and p(y1JBi ),
`can be found in [1], (2]. We use an energy-normalized version of the
`above pdf's, since the absolute energy is independent of the sources.
`In order to reconstruct energy information for a given signal, we will
`explicitly build up the recovery function for the energy ratio.
`Given a pair of training sequences, (X, Y) = { (x, yt}} with
`1 $ t $ T, and a set of parameters 2 = {A, A, 8}, the joint
`probability
`
`T N M
`p(X,Y) = p(S,X, Y) = ITLLP(X,yt,.\;,8})
`t=l i=l j=l
`
`(NM)T T L ITp(x~,y,sk(t))
`
`k=l
`
`t=1
`
`(NM)T L p(S,X,Y,sk)
`
`k=l
`
`(2)
`
`1063-6676/94$04.00 © 1994 IEEE
`
`ZTE EXHIBIT 1030
`
`Page 1 of 5
`
`

`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCfOBER 1994
`
`545
`
`where sk(t) = [.\;,6,] is a state-vector a1 timet and on the kth state(cid:173)
`path of a treillis (the number of distinct state-vectors are N Af and
`that of a distinct stale-paths are ( N M)T ). Given the same conditions,
`the conditional joint pdf of both .\; and (J 1 at time t is
`
`(t: .\ lijX Y) = p(t: .\;,II), X, Y)
`p(X,Y)
`P.
`"
`
`1
`
`'
`
`L ljcx,. y, -'k· 01 l
`
`k=ll=l
`
`(3)
`
`and the conditional pdf of .\, contributing to the speech at time t is
`
`p(t: .\;IX.Y)
`
`p(t: .\;,X, Y)
`p(X.Y)
`M
`lj(x~.y,,.\,J/1 )
`J=l
`N M
`LV(x,y,,.\.,f!,)
`k=li=l
`
`(4)
`
`In order to estimate the parameters, we use the EM algorithm [3]
`to maximize the likelihood, p(E, X, Y). In the E-step we compute
`the expectation of the log likelihood (or an auxiliary function in [4]),
`Q(S!Eo)
`E(logp(S, X, Y, •• JISo ), over state-paths (see (2)),
`based on an initial guess of parameters, So= {Ao.Ao.eo}. In the
`M-step, we maximize the expectalion function, Q( ) by adjusting S.
`Using Lagrange optimization and the constraint E'J~ 1 p(111l.\d
`1,
`it is not hard to derive (see [4])
`
`o:;i = p(ll1 jA,) =
`
`I:cj
`
`j:::::l
`
`where C;1 = z;f;,~)T p(Eo,X,Y,sk)c;1(sk) is the expectation of
`c;1 (s.), which is the count of a state-vector, [.\;,81], on the stale(cid:173)
`path, ••· Since the expected count can he also efficiently computed as
`C;j ET=l p(t: .\;, B;. X, Y) = EL p(t: .\;, ll;jX, Y)p(X, Y),
`and since p(X, Y) is independent of i and j, thus (5), at the bottom
`of this page. Similarly, an updating formula of the a priori pdf of the
`source, p(.\i), can be derived as the expected count of A; on a state(cid:173)
`path divided by the total COlin! on a stale path ( 6), at the bottom of
`
`this page. We can also update the autocorrelation sequences of sources
`
`T
`L'',.,,(k)p(.\;jx, y,)
`r >; ( k) = :c'-=-1'--;,Tc-------
`L>(.\ijx,.y,)
`
`T
`2:>"·'(1.-)p(OAx,_y,)
`rei(~·)= t=t T
`ljU11 !x,,yt)
`t=l
`
`(7a)
`
`(7b)
`
`and a ratio of highband signal energy versus narrowband energy
`
`T ("( ))1/2
`~ ~ p(B;jx, yt)
`
`T
`l:)<ll;jx,,y,)
`t:::;l
`
`(7c)
`
`where p(A,Ix,, y,) and p(81jx, yt) can he easily derived from
`p(.\,,(l,_x,,yt) and its marginal pdfs; f(xtl and f(y,) are the
`energies of Xt and of y,, respectively. The reason for using the
`energy ralio here in steady absolute energy value is that the latler
`value varies from signal to signal (e.g., on different telephone lines)
`and has little direct utility; the ratio, however, contains sufficient
`energy informalion for reconstruction of the highband signal. A set
`of updated autoregressive coefficients of sources can he obtained
`through the usual Levinson-Durbin recursive algorithm and r,, (k)
`and ra;(k).
`The following list summarizes the training algorithm:
`{ Ao, Ao, 8o}.
`I) Initialize parameters So
`2) Iteration loop
`a) Time loop: t from 1 to T
`i) Compute p(y,,x, .\,,IJ1 ) according to (1).
`ii) Compute recursively the necessary cumulatives in (5),
`(6), and (7),
`b) End the time loop and update S 0
`ing to (5), (6), and (7).
`c) Test if the stop criterion is satisfied. If yes, stop the iteration.
`The EM algorithm guarantees that the developed updating formulas
`converge to a critical point.
`
`{ Ao, Ao, eo} accord-
`
`L N ~~~x,,y,_.\,,11;)
`
`t=l 'L:l:)(x,, y, .\k- 81)
`k=ll:::::l
`
`(5)
`
`(6)
`
`Page 2 of 5
`
`

`
`546
`
`IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCTOBER 1994
`
`C. Minimum Mean Square Estimation (MMSE) of Highband Speech
`An estimation of energy-nonnalized highband speech through min(cid:173)
`imum mean square estimation (MMSE) is a conditional expectation
`
`Y =E(YIX) = { Yp(YIX}dY
`}yT'
`
`(8)
`
`where, using (I)
`( I ) _ ;..~p(y,,x,,>..,,Uj)
`p(xt)
`P Yt Xt
`L...,L...,
`z=l ;=1
`
`-
`
`= :tt p(y,j~)<>;jp(x,j>..;)p(>..;)
`•=• i=•
`
`ij(x,j>..;)p(>..;)
`i=l
`
`and T' is the length of the highband speech to estimate. Since we
`assume that observations of both the highband and narrowband speech
`are independent in different time frames t, the above equation can be
`written in a fonn of vector concatenation, presented by the sign x .
`Then, we calculate the highband speech estimate as
`
`M
`
`.(j)
`
`M
`
`.(j)
`
`M
`
`•(i)
`
`Y = ~ ;~.) X • • • ~ ;~t) .. · X ~ pr:.;,)
`where :Yiil = :E~, Yli)<>;ip(x,j>..;)p(>..;),yl') is the mean-vector
`of the random source Ui at time t. Since Yl'lis also a qth-order
`autoregressive process, then we have
`
`(9)
`
`[y(n)lr = [~ a~)y(n- k)+ Glil•,(n)r
`
`(10)
`
`where '' ( n) is white Gaussian noise excitation at time t with zero
`mean and unity variance, and a~) are the autoregressive coefficients.
`The same excitation sequence has to be applied to all sources 0 to
`guarantee an identical initial phase and a smooth phase evolution in
`time. ali) is a gain factor applied to th~ 9 i sources and estimated
`as (II), at the bottom of this page. Thus the estimation of G~il is
`independent of p(x,). We assumed implicitly in the above highband
`speech generation that the highband speech exhibits no periodic
`behavior (i.e., pitch-periodicity is absent).
`In Fig. 2, we show a diagram of the current system to recover the
`wideband speech. Each filter represents a random source's autoregres(cid:173)
`sive spectrum in the hi~hband. The input of each filter is weighted by
`a factor, j{t,j) = Gl' (1 , , where ( 1 ,, = E~1 <>;ip(x,j>..,)p(>..,).
`
`III. EXPERIMENTAL RESULTS
`
`.
`
`·---------------------'
`
`.
`
`Fig. 2. Diagram of the wideband speech recovery system.
`
`ll
`
`. . . . --~--~--~...,
`
`23 ..
`
`21
`
`••
`
`..,!----~--~--~-'
`
`Fig. 3. Performance aS a function of the number, M, of sources 8j. The left
`panel shows the rms of log spectra; the right panel shows segmental SNR. The
`vertical axes are in dB; the horizontal axis shows log2 M (i.e., M in bits).
`
`..
`
`11,---~---~-~---, r
`
`23 ..
`
`21
`
`Fig. 4. Performance as a function of number, N, of sources ,\i. The left
`panel shows the rms of log spectra; the right panel shows the segmental SNR.
`The vertical axes are in dB; the horirontal axis shows log2 N (i.e., bits).
`
`to train the statistical recovery function, consisted of speech from
`four male and four female speakers. The data to test our algorithm
`consisted of speech from four separate speakers (two male and two
`female). Thus, the algorithm can be viewed as operating speaker(cid:173)
`independently. The narrowband speech was generated by passing the
`wideband speech through a 0.3-3.75 kHz Chebychev bandpass filter.
`The frame length was 20 ms and the frame advance was 10 ms.
`The orders of linear prediction (autoregressive) analysis were sixteen
`(i.e., p = 16 and q = 16).
`
`A. Speech Material
`The speech database used contained phonetically-balanced wide(cid:173)
`band speech sampled at 16 kHz with an antialiasing filter cutting off
`at 7.8 kHz. The database was split into two parts. Part one, used
`
`B. Experiments for the Training Procedure
`For the training procedure, there are two factors that attracted
`most of our concern: Iteration convergence and initialization. For the
`
`(11)
`
`Page 3 of 5
`
`

`
`IEEE TRANSACDONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCfOBER I994
`
`547
`
`Fig. 5. Spectrograms of the original wideband speech (top), reconstructed wideband speech (middle), and narrowband speech (bottom) for a sentence,
`"Lift the square stone over the fence."
`
`initialization, we had two options in these experiments: (1) vector
`quantization (VQ) [5] initialization and (2) bootstrap initialization.
`In the above two initializations, the cross-correlations are always
`initialized as C>i; = 1/M. We have observed that for both VQ
`and bootstrap initialization the log likelihood increased with each
`iteration, thus the training convergence is practically demonstrated.
`Both VQ and bootstrap initializations, however, have log likelihood
`values very close to each other after about ten iterations. We may
`say that the initialization has little influence on the resulting mapping
`function, at least in terms of likelihood.
`A more analytical way to study the training procedure is to use
`fully controlled data. For this purpose, we simulated the assumed
`data-generation process described in Section IT. The parameters were
`K = p = q = 2 and four random sources at .1:', the originating space,
`and three random sources at Y, the destination space. The statistical
`coefficients of the data generation and their estimation after fifteen
`iterations were:
`
`I ) The mean vectors of the four sources in .1:' and their estimations
`
`-1n] [-2n] [-2n] [-1n]
`-1.0 ' -1.0 ' -2.0 ' -2.0 '
`
`[
`
`-1.00] [-2.00] [-2.01] [-1.00]
`-0.99 ' -0.99 ' -2.01 ' -2.01 .
`
`[
`
`2) The mean vectors of the three sources in Y and their estimations
`
`[1.0] [0.0] [2.0]
`0.0 ' 1.0 ' 1.0 '
`
`1.00] [0.01] [2.00]
`0.00 ' 1.00 ' 1.00 .
`
`[
`
`3) The active probabilities of the four sources in .1:' and their
`estimations
`
`[0.3, 0.2, 0.2, 0.3],
`
`[0.29, 0.19, 0.21, 0.31].
`
`4) The correlation matrix and their estimations
`
`0.80 0.10 0.10]
`0.30 0.50 0.20
`0.20 0.50 0.30
`0.10 0.10 0.80
`
`[
`
`'
`
`0.79 0.11 0.10]
`0.32 0.50 0.18
`0.20 0.52 0.28
`0.10 0.10 0.81
`
`[
`
`.
`
`From this controlled experiment, we have shown that the proposed
`algorithm estimates a correct statistical structure.
`
`C. Experiment for the Recovery ofWideband Speech
`For an assessment of the recovery algorithm we used, as criteria,
`spectral log rms, Dnn~, and segmental spectral signal-to-noise ratio
`(SNR), LsNR·
`In the first experiment, we were very interested in the performance
`as a function of the number of random sources for the highband
`speech. The number of sources for the narrowband speech was preset
`to a large number ( N = 128 in practice), which may not be efficient
`by was certainly sufficient. We see from Fig. 3 that the spectral log
`rms decreases and segmental spectral SNR increases as M increases.
`Above M = 16 (i.e., 4 bits), further changes were not significant.
`Secondly, fixing !If at 16 we increased gradually the number of
`sources for narrowband speech. As N increased, a decrease in log
`rms and an increase in segmental spectral SNR were also observed
`(see Fig. 4). N = 64 (i.e., 6 bits) was reasonable. Compared with
`
`Page 4 of 5
`
`

`
`548
`
`IEEE TRANSACfiONS ON SPEECH AND AUDIO PROCESSING, VOL. 2, NO. 4, OCI'OBER I994
`
`narrowband speech ( M = 0), the reconstructed wide band speech
`with N = 64 and M = 16 showed a gain of about 3 dB in
`segmental spectral SNR. We note, however, that SNR is a flawed
`measure to evaluate performance in this context. We, thus, examined
`spectrograms.
`In Fig. 5, we show spectrograms for an example of original
`wideband speech, reconstructed wideband speech, and narrowband
`speech. Most of the highband speech was successfully reconstructed;
`however, the reconstruction is not fully accurate, especially for the
`fricatives, /f/ and /s/. This weakness is mainly due to such fricatives'
`concentrating their information at highband; the narrowband versions
`of such fricatives do not allow easy discrimination.
`
`IV. CONCLUSION
`We developed a statistical recovery function (SRF) to recover
`wideband speech from the narrowband speech available at receivers
`in most communications networlcs. We obtained encouraging results
`
`inour preliminary study. Reconstructed wideband speech showed a
`gain of 3 dB in segmental SNR compared with narrowband speech,
`with no more than narrowband speech as input.
`
`REFERENCES
`
`[I] F. ltakura, "Minimum prediction residual principle applied to speech
`recognition," IEEE Trans. Acoust., Speech, Signal Processing, vol. 23,
`no. I, pp. 67-72, Feb. 1975.
`[2] L. R. Rabiner, "A tutorial on hidden Markov models and selected
`applications in speech recognition," Proc. IEEE. vol. 77, no. 22, pp.
`257-289, 1989.
`[3] A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood
`from incomplete data via the EM algorithm," Ann. Royal Stat. Soc., pp.
`1-38, 1977.
`[4] L. E. Baum, "An inequality and associated maximization technique in
`satistical estimation for probabilistic functions of Markov processes."
`Inequalities, vol. 3, pp. 1-8, 1972.
`[5] A. Buzo, A. H. Gray, Jr., R. M. Gray, and J.D. Markel, "Speech coding
`based upon vector quantization," IEEE Trans. Acoust., Speech, Signal
`Processing, vol. 28, no. 5, pp. 562-574, 1980.
`
`Page 5 of 5

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket