`Saint Lawrence Communications
`Exhibit 2002
`
`EURASIP Journal on Applied Signal Processing 2001:0, 1–9
`© 2001 Hindawi Publishing Corporation
`
`Techniques for the Regeneration of Wideband Speech
`from Narrowband Speech
`
`Jason A. Fuemmeler
`
`University of Dayton, Department of Electrical and Computer Engineering, 300 College Park, Dayton, OH 45469-0226, USA
`Email: fuemmeja@hotmail.com
`
`Russell C. Hardie
`
`University of Dayton, Department of Electrical and Computer Engineering, 300 College Park, Dayton, OH 45469-0226, USA
`Email: rhardie@udayton.edu
`
`William R. Gardner
`
`Qualcomm, Inc., 5775 Morehouse Drive, San Diego, CA 92121, USA
`Email: wgardner@qualcomm.com
`
`Received 31 July 2001 and in revised form 30 September 2001
`
`This paper addresses the problem of reconstructing wideband speech signals from observed narrowband speech signals. The goal
`of this work is to improve the perceived quality of speech signals which have been transmitted through narrowband channels
`or degraded during acquisition. We describe a system, based on linear predictive coding, for estimating wideband speech from
`narrowband. This system employs both previously identified and novel techniques. Experimental results are provided in order to
`illustrate the system’s ability to improve speech quality. Both objective and subjective criteria are used to evaluate the quality of the
`processed speech signals.
`
`Keywords and phrases: wideband speech regeneration, narrowband speech, linear predictive coding, speech processing, speech
`coding.
`
`1.
`
`INTRODUCTION
`
`In voice communications, the quality of received speech sig-
`nals is highly dependent on the received signal bandwidth. If
`the communications channel restricts the bandwidth of the
`received signal, the perceived quality of the output speech
`is degraded. In many voice transmission or storage applica-
`tions the high-frequency portions of the input speech signal
`are eliminated due to the physical properties of the channel
`or to reduce the bandwidth required. The resulting lowpass
`speech often sounds muffled or “far away” compared to the
`original, due to the lack of high frequency content.
`One way to compensate for these effects is to efficiently
`encode the speech at the transmitter so that less channel band-
`width is required to transmit the same amount of informa-
`tion. Of course, this requires that the receiver have an appro-
`priate decoder to recover the original signal. Because of this
`burden on both the receiver and transmitter, the use of wide-
`band vocoding techniques is difficult to apply to systems that
`have already been standardized (e.g., analog telephone com-
`munications). It would be more convenient to devise a system
`
`at the receiver that could regenerate the lost high-frequency
`content.
`Some work has already been done in the area of wideband
`speech regeneration [1, 2, 3, 4, 5, 6, 7, 8]. These works have
`primarily used linear predictive (LP) techniques. By using
`these techniques, the reconstruction problem is divided into
`two separate tasks. The first task is forming a wideband resid-
`ual error signal, while the second is recreating a set of wide-
`band linear predictive coefficients (LPCs). Once these two
`components have been generated, the wideband residual is
`fed into the wideband LP synthesis filter resulting in a regen-
`erated wideband speech signal.
`This paper provides a brief review of LP techniques used
`in wideband speech regeneration and proposes a new LP tech-
`nique with several novel aspects. In particular, a new method
`for the creation of a wideband residual is proposed which
`overcomes difficulties encountered using many methods pre-
`viously employed. Furthermore, a relatively new distortion
`measure is investigated for use in codebook generation and
`in mapping narrowband LPCs to wideband LPCs. To the best
`IPR2016-00704
`SAINT LAWRENCE COMMUNICATIONS LLC
`Exhibit 2002
`
`
`
`2
`
`EURASIP Journal on Applied Signal Processing
`
`xW (t)
`
`Channel
`Hc (f )
`
`xN (t)
`
`xN (n)
`
`A/D
`
`Wideband
`speech
`regeneration
`
`ˆxW (n)
`
`D/A
`
`ˆxW (t)
`
`xN (n)
`
`K
`
`LP
`analysis
`
`Wideband
`residual
`regeneration
`
`LP
`synthesis
`
`HPF
`
`eN (n)
`
`ˆeW (n)
`
`Figure 1: Context of the wideband speech regeneration system.
`
`aN
`
`VQ codebook
`mapping
`
`ˆaW
`
`(cid:2)
`
`ˆxW (n)
`
`Figure 2: The wideband speech regeneration system.
`
`signal. For each speech frame, this produces a narrowband
`residual or error signal, eN (n), and a set of narrowband LPCs,
`denoted aN = [a1, a2, . . . , ap]T . To regenerate the wideband
`speech, both of these components must be converted into
`their wideband counterparts.
`The process for creating a wideband residual, ˆeW (n), from
`a narrowband residual is referred to herein as high frequency
`regeneration (HFR). The process for generating wideband
`LPCs, ˆaW , from the narrowband LPCs is referred to as code-
`book mapping. Having regenerated these two critical speech
`components, it is then possible to construct wideband speech
`through LP synthesis (the inverse process of LP analysis).
`It is important to realize that, although no upsampling is
`explicitly shown in Figure 2, the HFR block outputs a signal
`at K times the sampling rate of its input. Finally, the regener-
`ated wideband speech is passed through a highpass filter and
`added to the input narrowband speech signal to create the
`final speech waveform. This is done in order to preserve the
`accurate low frequency information contained in the origi-
`nal input speech signal. Thus, the proposed processing will
`not alter the low frequency content of the input signal but
`will simply add high frequency components. The following
`three sections describe in detail the LP, HFR, and codebook
`mapping blocks, respectively.
`
`3. LINEAR PREDICTIVE TECHNIQUES
`
`LP analysis determines the coefficients of a linear prediction
`filter designed to predict each speech sample as a weighted
`sum of previous samples. The prediction filter output can be
`written as
`
`akx(n − k),
`
`(2)
`
`P(cid:3)k
`
`=1
`
`ˆx(n) =
`
`where a1, a2, . . . , aP are the LPCs. The prediction error signal
`(also known as a residual or residual error signal) is defined
`as the difference between the actual and predicted signals as
`follows:
`
`akx(n − k).
`
`(3)
`
`P(cid:3)k
`
`=1
`
`e(n) = x(n) −
`
`This expression simply defines a finite impulse response
`
`of the authors’ knowledge, this new metric has not previously
`been applied to the speech regeneration problem. Finally,
`a new method for calculating optimal gain coefficients for
`application to the wideband LPCs is described.
`The organization of the rest of this paper is as follows. In
`Section 2, an overview of the wideband speech regeneration
`system is provided. Since the proposed technique is based
`on LP coding, Section 3 provides a brief review of LP analysis
`and synthesis. Section 4 describes the wideband residual error
`regeneration and Section 5 describes the codebook mapping
`of the LPCs. Experimental results are presented in Section 6,
`and conclusions and areas for future work are presented in
`Section 7.
`
`2. OVERVIEW OF THE WIDEBAND SPEECH
`REGENERATION SYSTEM
`
`The context in which the wideband speech regeneration sys-
`tem may be employed is shown in Figure 1. Here, a nar-
`rowband continuous-time speech signal, xN (t), is formed by
`passing its wideband counterpart, xW (t), through a bandlim-
`ited channel. The channel, with frequency response Hc(f ), is
`assumed to be a continuous-time lowpass filter with cutoff
`frequency fc. This lowpass filter need not be ideal. However,
`if the sampling frequency of the analog-to-digital (A/D) and
`digital-to-analog (D/A) converters is denoted as fs, the fol-
`lowing restriction is assumed:
`
`(cid:1)(cid:1)Hc(f )(cid:1)(cid:1) ≈ 0 for |f | >
`
`fs
`2K
`
`,
`
`(1)
`
`where K is a positive integer greater than one. In other words,
`the channel transfer function is such that the digital signal
`entering the wideband speech regeneration system is over-
`sampled by a factor of at least K. Furthermore, this oversam-
`pling will in turn allow the wideband speech regeneration
`system to increase the bandwidth of the received signal by a
`factor of at least K. Examples of system parameters for ap-
`plication to analog telephone speech might be fc = 3.3 kHz,
`fs = 16 kHz, and K = 2.
`In many speech applications, the input speech waveform
`is already filtered and sampled at a lower frequency (e.g.,
`8 kHz typically in many wireline and wireless communication
`systems). In this case, the speech signal can be upsampled
`and filtered to provide a higher frequency sampling rate (e.g.,
`16 kHz as in this work) with no high frequency content above
`fc.
`
`The structure of the wideband speech recovery system is
`shown in Figure 2. The system begins by performing a stan-
`dard LP analysis of the downsampled, narrowband speech
`
`
`
`3
`
`ˆeW (n)
`
`G
`
`Techniques for the regeneration of wideband speech from narrowband speech
`
`filter with impulse response
`
`eN (n)
`
`2
`
`LPF
`
`abs(·)
`
`Spectral
`flattening
`filter
`
`akδ(n − k),
`
`(4)
`
`P(cid:3)k
`
`=1
`
`a(n) = δ(n) −
`
`and discrete-time frequency response
`
`(a)
`
`(5)
`
`eN (n)
`
`ˆeW (n)
`
`akejkω.
`
`P(cid:3)k
`
`=1
`
`A(ω) = 1 −
`
`2
`
`(b)
`
`Figure 3: (a) HFR using rectification and filtering, (b) HFR using
`spectral folding.
`
`rectification of the upsampled narrowband residual to gener-
`ate high-frequency spectral content. The signal is then filtered
`with an LP analysis filter to generate a spectrally flat residual.
`An appropriate variable gain factor must also be applied to
`this new wideband residual so that its signal power will not be
`too large or small when compared with the original narrow-
`band version. This approach is illustrated in Figure 3a. The
`main drawback to this method is that the spectral compo-
`nents generated by the rectification (a nonlinear operation)
`are largely unpredictable. As a result, it often generates noisy
`or rough high frequency components, especially when the
`speech is voiced.
`The second class of HFR techniques, shown in Figure 3b,
`is termed spectral folding and involves expansion of the nar-
`rowband residual through insertion of zeros between adjacent
`samples. Although simple, this method has several potential
`problems when applied to voiced speech.
`(1) First, it is unlikely that the new high frequency har-
`monics will reside at integer multiples of the voiced speech’s
`fundamental frequency. Often, this does not result in a large
`perceptual effect as long as the low frequency content has the
`harmonics spaced correctly and as long as the energy in the
`low frequency components is significantly greater than that
`in the higher frequencies.
`(2) Second, as the pitch of the narrowband residual moves
`higher or lower in frequency, the high-frequency portions of
`the new wideband residual move in the opposite direction.
`This will be seen later in Section 6. Because of this effect, the
`resultant speech can sound somewhat garbled—especially if
`there are wide variations in fundamental frequency.
`(3) Finally, a greater problem occurs when the cutoff fre-
`quency of the bandlimiting process is lower than half the nar-
`rowband sampling frequency. Although LP analysis tends to
`produce spectrally flat residuals, this is generally not possible
`when a portion of the input spectrum has been eliminated. In
`these regions, the narrowband residual therefore exhibits lit-
`tle spectral energy. When spectral folding is applied to such a
`residual, the resultant wideband speech exhibits a band gap in
`the middle of the spectrum. This will also be seen in Section 6.
`This partial lack of spectral content can degrade perceptual
`speech quality.
`
`This filter is known as the LP analysis filter and is used to
`generate the residual from the original discrete-time signal.
`To perform optimal linear prediction, the LPCs are chosen
`such that the power in the residual is minimized. The LPCs are
`typically also chosen to be gain normalized and such that the
`LP analysis filter is minimum phase. Such LPCs can be found
`efficiently through the application of the Levinson-Durbin
`algorithm. Because the LP analysis filter is minimum phase,
`it has a stable inverse known as the LP synthesis filter. This
`synthesis filter is an all-pole, infinite impulse response filter.
`While the LP analysis filter is used to generate the residual, the
`LP synthesis filter creates the original signal from the residual.
`The system difference equation for the LP synthesis filter can
`be found by rewriting (3) as
`
`akx(n − k).
`
`(6)
`
`P(cid:3)k
`
`=1
`
`x(n) = e(n) +
`
`LP techniques are especially useful in speech processing.
`Because of the error minimization used to find the LPCs, the
`residual error signal tends to be spectrally flat. This means
`that the shape of the speech signal’s spectral envelope is repre-
`sented in the LPCs. The residual then contains the amplitude,
`voicing, and pitch information. The speech signal’s spectral
`envelope can be approximately written in the frequency do-
`main as
`
`S(ω) =
`
`σ
`|A(ω)|
`
`,
`
`(7)
`
`where σ is the square root of the residual power. In speech
`processing systems, LP analysis is typically performed on
`frames of about 10–20 ms, since speech characteristics are
`relatively constant over this time interval.
`
`4. HIGH FREQUENCY REGENERATION
`
`In this section, methods for regenerating wideband resid-
`ual errors from narrow band errors are described. Previously
`defined techniques are described, motivation for exploring
`alternative methods is presented, and a novel method, re-
`ferred to here as “spectral shifting,” is developed.
`
`4.1. Previous techniques
`
`The HFR methods employed in previous systems can ba-
`sically be divided into two classes. These two approaches
`are illustrated in Figure 3. For simplicity, the discussion is
`restricted to the case where K = 2. The first class of HFR uses
`
`
`
`4
`
`eN (n)
`
`EURASIP Journal on Applied Signal Processing
`
`Pitch
`detector
`
`Cosine
`generator
`
`2 cos(π n)
`
`2, −2, 2, −2, . . .
`
`2
`
`LPF
`
`ˆeW (n)
`
`X
`
`eN (n)
`
`X
`
`2
`
`ˆeW (n)
`
`Figure 4: Proposed spectral shifting high frequency regeneration
`algorithm.
`
`4.2. The spectral shifting method
`
`A new method of HFR is proposed, which is illustrated in
`Figure 4. This method relies on spectral shifting rather than
`on spectral folding. If used to its full extent, it is capable of
`overcoming the three problems associated with the spectral
`folding method listed above.
`The first step in the spectral shifting method involves up-
`sampling the input narrowband residual. The lowpass filter
`used has cutoff frequency of ωc = 2π fc/fs radians/sample
`and a gain of K = 2. A pitch detector then assumes that the
`speech is voiced and finds the fundamental pitch period, Tf ,
`of the resultant signal. If the speech is in fact unvoiced, the
`pitch period computed is of little importance and the output
`of the pitch detector can still be used. The upsampled residual
`is then multiplied (mixed) with a cosine of amplitude 2 at a
`radian frequency ωg, where ωg is a multiple of the funda-
`mental frequency that is close to (but does not exceed) the
`cutoff frequency ωc. One expression that calculates such a
`radian frequency is
`
`ωg =
`
`2π
`Tf
`
`floor(cid:4) Tf ωc
`2π (cid:5),
`
`(8)
`
`where floor(·) computes the maximum integer less than or
`equal to its argument.
`The multiplication by a cosine results in a shift of the
`original spectrum. If the discrete-time Fourier transform of
`the upsampled and filtered narrowband residual is denoted
`as EN (ω), it is easily shown that the resultant wideband spec-
`trum is given by
`
`EW (ω) = EN(cid:6)ω − ωg(cid:7) + EN(cid:6)ω + ωg(cid:7)
`≈ EN(cid:6)ω − ωc(cid:7) + EN(cid:6)ω + ωc(cid:7).
`
`(9)
`
`This expression clearly reveals the reason this method is
`referred to as a “spectral shifting” method. Unlike the spec-
`tral folding method, this method does not preserve the origi-
`nal narrowband residual information. However, this is not
`a problem because a highpass filter will subsequently be
`applied to the output of the LP synthesis filter, eliminating
`the narrowband portion of the regenerated signal anyway.
`Since ωg ≈ ωc, the bandwidth of the wideband residual is
`approximately twice that of the narrowband residual. How-
`ever, if more bandwidth is desired, the narrowband resid-
`ual can be multiplied with cosines at higher multiples of the
`
`Figure 5: Simplest form of the spectral shifting method.
`
`fundamental frequency and the results added with one an-
`other (after appropriate filtering to prevent overlap). This
`will further increase the signal bandwidth.
`In practice, the pitch detection algorithm is applied to
`each frame (10–20 ms) of speech. However, instantaneously
`updating the frequency of the cosine at each frame boundary
`results in undesired frequency components and noisy speech.
`Applying linear interpolation to the values of ωg between
`adjacent frames has been found to sufficiently eliminate this
`effect.
`If reduction in computational complexity is desired, cer-
`tain modifications can be made to the spectral shifting
`method. One such modification is to eliminate the pitch
`detector and always use a cosine of frequency ωc. If this is
`done, the spectral shifting method will eliminate only prob-
`lems (2) and (3) described in Section 4.1. However, as noted
`in Section 4.1, problem (1) is generally not significant, and
`thus, this simplification may have little impact on speech
`quality.
`An additional simplification can be made by using a
`cosine at frequency π /2 rather than at ωc (if ωc = π /2,
`system quality is not additionally compromised). In this case,
`no lowpass filter is necessary for interpolation because every
`other cosine value will be zero—eliminating all interpolated
`values. In this case, the spectral shifting method reduces to
`the system shown in Figure 5. Note that this system will solve
`only problem (2) (assuming ωc ≠ π /2). Problem (3) is only
`somewhat alleviated since the size of the band gap in the spec-
`trum will be cut in half. However, the performance is still an
`improvement over the spectral folding method with the only
`additional computational complexity being a sign change of
`every other sample.
`
`5. CODEBOOK MAPPING
`
`In this section, the mapping of the narrowband LPCs to wide-
`band LPCs is addressed. The use of a dual codebook in a
`vector quantization scheme is discussed, and solutions to the
`problem of applying an appropriate gain to the wideband
`LPCs are presented.
`
`5.1. Narrowband to wideband LPC conversion
`
`In a vector quantization system, codebook generation is
`most commonly performed through training using the Linde,
`
`
`
`Techniques for the regeneration of wideband speech from narrowband speech
`
`5
`
`Buzo, Gray (LBG) algorithm [9]. In the case of a narrowband
`to wideband mapping, it is necessary to generate a dual code-
`book. Part of the codebook contains narrowband codewords
`and the other part contains the corresponding wideband
`codewords. These codewords contain representative LPCs in
`some form.
`Generation of the dual codebook requires training data
`sampled at the desired higher rate of fs with the full low
`and high frequency content intact. These data are artificially
`degraded and downsampled to match the bandwidth of the
`actual signals to be processed. The LBG algorithm is then
`applied to the speech frames in the narrowband version of the
`training data. While the LBG algorithm operates on the nar-
`rowband data, each operation is mimicked on the wideband
`version of the training data to form the wideband portion
`of the dual codebook. In this way, the dual codebook will
`contain a set of representative narrowband codewords and
`the corresponding codewords based on the wideband data.
`This dual codebook now contains the a priori information
`needed to allow the wideband speech regeneration algorithm
`to extend the bandwidth of a speech signal.
`During wideband speech regeneration, narrowband
`codewords are computed from the input speech frames and
`the best match in the narrowband portion of the codebook
`is found. The corresponding wideband codeword from the
`wideband portion of the codebook is then used to generate
`the output speech. The assumption underlying the use of the
`dual codebook is that there is correlation between the low
`frequency spectral envelope of a speech frame and its high
`frequency envelope for a given speaker or class of speakers.
`That is, when the algorithm recognizes the narrowband spec-
`tral envelope of a speech frame (provided by the narrowband
`LPCs), the training data will allow us to predict what the
`spectral envelope should be for the full broadband version
`(contained in the wideband LPCs). Performance will clearly
`depend on the level of correlation between the low and high
`frequencies, and on how representative the training data is of
`the actual data. It is worth mentioning that improved perfor-
`mance was reported by Epps and Holmes [8] when separate
`codebooks were used for voiced and unvoiced speech frames.
`However, this method was not employed here.
`The operation of the LBG algorithm and the codebook
`mapping requires some quantitative measure of closeness for
`sets of LPCs. The only requirement is a distance (or distor-
`tion) measure for which a centroid calculation exists. The
`centroid of a bin or group is defined as the codeword that
`minimizes the total distortion over that bin. The quality of
`the resultant codebook is greatly affected by the correlation
`between the quantitative distance metric and human percep-
`tion of difference in the reconstructed speech frames.
`One commonly used distortion metric is the Itakura-
`Saito measure defined as
`
`ˆσ
`
`,
`
`A(ω)
`
`dIS(cid:4) σ
`ˆA(ω)(cid:5)
`= (cid:8) π
`ˆA(ω)(cid:1)(cid:1)
`−π (cid:4) σ 2(cid:1)(cid:1)
`ˆA(ω)(cid:1)(cid:1)
`ˆσ 2(cid:1)(cid:1)
`
`2
`
`2
`
`− ln
`
`2
`
`2
`
`ˆA(ω)(cid:1)(cid:1)
`σ 2(cid:1)(cid:1)
`ˆA(ω)(cid:1)(cid:1)
`ˆσ 2(cid:1)(cid:1)
`
`− 1(cid:5)dω.
`
`(10)
`
`An efficient method for calculating this exists which uses the
`estimated autocorrelation for each speech frame. The use of
`this distortion measure for codebook generation was first
`described in [10]. Another common metric is log spectral
`distortion (LSD), given by
`
`ˆσ
`
`,
`
`A(ω)
`
`σ 2
`
`dLSD(cid:4) σ
`ˆA(ω)(cid:5)
`= (cid:8) π
`−π (cid:9) ln
`
`(cid:1)(cid:1)A(ω)(cid:1)(cid:1)
`
`(11)
`
`2(cid:10)2
`
`dω.
`
`− ln
`
`2
`
`ˆσ 2
`
`(cid:1)(cid:1)
`
`ˆA(ω)(cid:1)(cid:1)
`
`With respect to wideband speech regeneration, this distance
`measure has been previously applied by sampling the loga-
`rithm of the Fourier transform of the LPCs [1, 8].
`It has been shown by Gardner and Rao [11] that a dis-
`tortion measure, using a weighted mean squared error of the
`line spectral pair frequencies, is equivalent to LSD and to the
`Itakura-Saito measure for high rate vector quantizers. It is
`thought that this measure may offer the performance of the
`LSD metric in the current application, yet be more computa-
`tionally efficient. The computational savings comes primarily
`from the fact that Fourier transforms and logarithms need not
`be computed.
`
`5.2. Optimal gain constant calculation
`
`In addition to the generation of the wideband LPCs discussed
`above, the gain applied to each wideband LP synthesis filter
`must also be determined such that the new wideband infor-
`mation has the appropriate energy. The optimal gain constant
`is defined here as that which minimizes the distance between
`the reconstructed and original wideband spectral envelopes
`in the narrowband region.
`To derive the optimum gain, we first trace a wideband
`spectral envelope through the system. Represent the spec-
`tral envelope of the original wideband speech signal as
`a real, positive, symmetric function in the frequency do-
`main. After the bandlimiting filter and subsequent downsam-
`pling, the narrowband spectral envelope can be represented
`as
`
`H(cid:4) ω
`SN (ω) = SW(cid:4) ω
`K (cid:5)(cid:1)(cid:1)(cid:1)(cid:1)
`K (cid:5)(cid:1)(cid:1)(cid:1)(cid:1)
`
`,
`
`(12)
`
`where H(ω) is the impulse-invariant discrete-time system
`frequency response used to model the continuous channel,
`Hc(f ). After LP analysis, the spectral envelope of the nar-
`rowband residual is
`
`SE(ω) = SN (ω)(cid:1)(cid:1)AN (ω)(cid:1)(cid:1)
`H(cid:4) ω
`= SW(cid:4) ω
`K (cid:5)(cid:1)(cid:1)(cid:1)(cid:1)
`K (cid:5)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)AN (ω)(cid:1)(cid:1).
`
`(13)
`
`The high-frequency regeneration technique does not alter the
`basic shape of the residual spectral envelope. However, it does
`create a signal sampled at a higher rate. Thus, the wideband
`residual spectral envelope can be approximated as SE(Kω).
`The wideband LP synthesis filter transforms the wideband
`residual spectral envelope into the reconstructed wideband
`
`
`
`6
`
`EURASIP Journal on Applied Signal Processing
`
`speech signal spectral envelope as
`
`Table 1: Parameters used in system testing.
`
`σ
`
`Bandlimiting filter type
`
`High-order Butterworth
`
`σ
`
`(cid:1)(cid:1)
`
`ˆAW (ω)(cid:1)(cid:1)
`
`(14)
`
`.
`
`Bandlimiting filter cutoff
`
`Narrowband sampling rate
`
`Wideband sampling rate
`
`Frame size
`
`Frame rate
`
`Codebook size
`
`Narrowband LPCs (Itakura-Saito)
`
`Wideband LPCs (Itakura-Saito)
`
`Narrowband LPCs (Gardner-Rao)
`
`Wideband LPCs (Gardner-Rao)
`
`3.3 kHz
`
`8 kHz
`
`16 kHz
`
`30 ms
`
`100 Hz
`
`512
`
`7
`
`15
`
`8
`
`14
`
`ˆSW (ω) = SE(Kω)
`
`ˆAW (ω)(cid:1)(cid:1)
`(cid:1)(cid:1)
`= SW (ω)(cid:1)(cid:1)H(ω)(cid:1)(cid:1)(cid:1)(cid:1)AN (Kω)(cid:1)(cid:1)
`
`Note the presence of the gain constant σ . Using (14), we
`are able to relate the original and reconstructed spectral
`envelopes in the narrowband region.
`One approach to finding the optimal gain constant is to
`store a calculated gain constant for each training frame in
`a given bin [1]. This constant can be computed using the
`relative powers of the narrowband and wideband residuals
`for each frame. At the end of training, a centroid of these
`gains is found for each bin using the gain values for each
`frame, the narrowband training codewords for each frame,
`and the newly computed representative wideband codeword
`for the bin.
`However, it is assumed that the representative narrow-
`band codeword is indeed representative of all the narrowband
`training codewords in that bin. Thus, an alternative approach
`is to use only the representative narrowband codeword, the
`representative wideband codeword, and an estimate of the
`bandlimiting transfer function to compute the optimal gain
`for each bin. This eliminates the need to store the gain con-
`stants during training and also the need to use multiple nar-
`rowband codewords in the optimal gain calculation. With this
`approach, the gain constant can be computed by minimizing
`
`This calculation does not slow system performance because it
`is performed only once—after training and before the system
`actually operates.
`
`6. EXPERIMENTAL RESULTS
`
`Testing of the wideband speech recovery system has been
`performed in MATLAB using speech samples obtained from
`the TIMIT speech corpus.1 For simplicity, codebook training
`was performed using 30 randomly selected female utterances
`from dialect region 3 (North Midland region). These utter-
`ances comprised approximately 90 seconds of speech data.
`Testing of the resultant system was performed using 2 utter-
`ances not in the training data, but from the same gender and
`dialect region.
`The parameters used in both the training and testing
`phases are shown in Table 1. As stated earlier, speech wave-
`form characteristics are relatively stationary over a 10 ms
`interval. Thus, computations are performed for frames taken
`at a 100 Hz rate. The frames, however, are 30 ms soft-
`windowed overlapping frames. This allows for smooth tran-
`sitions between adjacent frames. The numbers of LPCs used
`for the Itakura-Saito distortion measure have been selected
`to make FFT computations more convenient. However, the
`Gardner-Rao method requires that the numbers of LPCs be
`even, explaining why slightly different numbers were used for
`this distortion measure.
`Additionally, it should be noted that frames with energies
`below an empirically determined threshold were considered
`silence, and were thus excluded from training and testing.
`The spectral shifting method was implemented by multiply-
`ing the narrowband residual by two cosine functions: one
`at 3.3 kHz and one at 4.7 kHz. These results were added to-
`gether after appropriate filtering to prevent overlap. Note that
`pitch-detection was not employed in these tests.
`Sample results from these tests are shown in Figure 6.
`Figure 6a shows an original wideband speech signal spectro-
`
`1Texas Instruments/Massachusetts Institute of Technology Acoustic-
`Phonetic Countinuous Speech Corpus October 1990 (www.ntis.gov/fcpc/
`cpn4129.htm).
`
`d(cid:6)SW (ω), ˆSW (ω)(cid:7)
`= (cid:9)SW (ω), SW (ω)(cid:1)(cid:1)H(ω)(cid:1)(cid:1)(cid:1)(cid:1)
`ˆAN (Kω)(cid:1)(cid:1)
`
`(cid:1)(cid:1)
`
`σ
`
`ˆAW (ω)(cid:1)(cid:1)
`
`(cid:10),
`
`(15)
`
`with respect to σ over a selected narrowband frequency range,
`where d(·, ·) is the relevant distance metric. If the LSD dis-
`tance measure is being used, the optimal gain constant is given
`by
`
`4ωn (cid:8) ωn
`σ = exp(cid:4) 1
`2(cid:12)
`ln(cid:11)(cid:1)(cid:1)
`ˆAW (ω)(cid:1)(cid:1)
`2(cid:12) − ln(cid:11)(cid:1)(cid:1)
`− ln(cid:11)(cid:1)(cid:1)H(ω)(cid:1)(cid:1)
`ˆAN (Kω)(cid:1)(cid:1)
`
`−ωn
`
`(16)
`
`2(cid:12)dω(cid:5).
`
`If the Itakura-Saito distance measure is being used, the opti-
`mal gain constant is given by
`
`σ = (cid:13)(cid:14)(cid:14)(cid:15)
`
`1
`
`2ωn (cid:8) ωn
`
`−ωn
`
`2
`
`ˆAW (ω)(cid:1)(cid:1)
`(cid:1)(cid:1)
`2(cid:1)(cid:1)
`ˆAN (Kω)(cid:1)(cid:1)
`(cid:1)(cid:1)H(ω)(cid:1)(cid:1)
`
`2 dω.
`
`(17)
`
`In both expressions, ωn, is a radian frequency in the range
`(0, π /K). This frequency should be selected to include only
`those portions of the narrowband spectrum in which H(ω) is
`invertible. In the simplest case, where H(ω) is assumed to be
`an ideal lowpass filter, it is most appropriate to use ωn = ωc.
`In practice, the various spectra are generated and numerically
`integrated by summing fast Fourier transform (FFT) results.
`
`
`
`Techniques for the regeneration of wideband speech from narrowband speech
`
`7
`
`8000
`
`7000
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`Frequency (Hz)
`
`0
`
`0.2 0.4 0.6 0.8
`
`1
`
`1.2 1.4 1.6 1.8
`
`0
`
`0.2 0.4 0.6 0.8
`
`1
`
`1.2
`
`1.4
`
`1.6 1.8
`
`Time (s)
`
`(a)
`
`Time (s)
`
`(b)
`
`8000
`
`7000
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`Frequency (Hz)
`
`8000
`
`7000
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`Frequency (Hz)
`
`8000
`
`7000
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`Frequency (Hz)
`
`0
`
`0.2 0.4 0.6
`
`0.8
`
`1
`
`1.2 1.4 1.6 1.8
`
`0
`
`0.2 0.4 0.6 0.8
`
`1
`
`1.2
`
`1.4 1.6 1.8
`
`Time (s)
`
`(c)
`
`Time (s)
`
`(d)
`
`Figure 6: Spectrograms for the (a) original signal, (b) bandlimited/narrowband signal, (c) reconstructed signal using spectral folding, and
`(d) reconstructed signal using spectral shifting.
`
`gram, while Figure 6b shows this same signal bandlimited
`to approximately 3.3 kHz. The spectrogram of the recon-
`structed signal using the Itakura-Saito distortion measure and
`spectral folding is shown in Figure 6c. The spectrogram of the
`reconstructed signal using the Itakura-Saito distortion mea-
`sure and spectral shifting is shown in Figure 6d. The result
`using the spectral folding method contains a gap in the mid-
`dle of the frequency spectrum. Also, the pitch contours in the
`high-frequency portion of the spectrum run counter to those
`in the true wideband signal and in the low-frequency por-
`tion. This is especially evident in the 0.5–0.7 sec time range.
`In contrast, the spectral shifting method eliminates both of
`these problems. A comparison of Figures 6a and 6d reveals
`that the high-frequency portion of the original spectrum is
`approximated reasonably well in the high-frequency portion
`of the reconstructed spectrum.
`
`Subjective evaluations have also been obtained using a
`survey, to which 18 participants responded. These partici-
`pants were asked to compare 5 pairs of speech files and select
`one from each pair that they thought had the best overall
`sound quality. The choices and the results are summarized in
`Table 2. A p-value is also given to indicate a level of confi-
`dence in the results. For example, a p-value of 0.05 indicates
`that there is a 95% chance that the choice preferred in the sur-
`vey would still be preferred if an infinite number of responses
`had been received.
`It can be concluded from the survey results that it is
`indeed possible to enhance the perceived quality of speech
`signals. This is seen most clearly in Choice 3. It is believed that
`artifacts created by the reconstruction process—especially
`around unvoiced consonants—adversely affected the results
`for Choice 2. Techniques for eliminating these artifacts need
`
`
`
`8
`
`EURASIP Journal on Applied Signal Processing
`
`Table 2: Survey results for the A/B testing.
`
`Choice 1
`
`(A) Bandlimited signal vs.
`
`(B) Original wideband signal
`
`Preferred: B
`
`Proportion: 100%
`
`p-value: ∼ 0
`
`(A) Bandlimited signal vs.
`
`Choice 2
`
`(B) Rebuilt signal (Itakura-Saito and spectral shifting)
`
`Preferred: B
`
`Proportion: 61%
`
`p-value: 0.173
`
`Choice 3
`
`(A) Rebuilt signal (Gardner-Rao and spectral shifting) vs.
`
`(B) Bandlimited signal
`
`Preferred: A
`
`Proportion: 78%
`
`p-value: 0.009
`
`Choice 4
`
`(A) Rebuilt signal (Itakura-Saito and spectral folding) vs.
`
`(B) Rebuilt signal (It