throbber
IPR2017-01075
`Saint Lawrence Communications
`Exhibit 2002
`
`EURASIP Journal on Applied Signal Processing 2001:0, 1–9
`© 2001 Hindawi Publishing Corporation
`
`Techniques for the Regeneration of Wideband Speech
`from Narrowband Speech
`
`Jason A. Fuemmeler
`
`University of Dayton, Department of Electrical and Computer Engineering, 300 College Park, Dayton, OH 45469-0226, USA
`Email: fuemmeja@hotmail.com
`
`Russell C. Hardie
`
`University of Dayton, Department of Electrical and Computer Engineering, 300 College Park, Dayton, OH 45469-0226, USA
`Email: rhardie@udayton.edu
`
`William R. Gardner
`
`Qualcomm, Inc., 5775 Morehouse Drive, San Diego, CA 92121, USA
`Email: wgardner@qualcomm.com
`
`Received 31 July 2001 and in revised form 30 September 2001
`
`This paper addresses the problem of reconstructing wideband speech signals from observed narrowband speech signals. The goal
`of this work is to improve the perceived quality of speech signals which have been transmitted through narrowband channels
`or degraded during acquisition. We describe a system, based on linear predictive coding, for estimating wideband speech from
`narrowband. This system employs both previously identified and novel techniques. Experimental results are provided in order to
`illustrate the system’s ability to improve speech quality. Both objective and subjective criteria are used to evaluate the quality of the
`processed speech signals.
`
`Keywords and phrases: wideband speech regeneration, narrowband speech, linear predictive coding, speech processing, speech
`coding.
`
`1.
`
`INTRODUCTION
`
`In voice communications, the quality of received speech sig-
`nals is highly dependent on the received signal bandwidth. If
`the communications channel restricts the bandwidth of the
`received signal, the perceived quality of the output speech
`is degraded. In many voice transmission or storage applica-
`tions the high-frequency portions of the input speech signal
`are eliminated due to the physical properties of the channel
`or to reduce the bandwidth required. The resulting lowpass
`speech often sounds muffled or “far away” compared to the
`original, due to the lack of high frequency content.
`One way to compensate for these effects is to efficiently
`encode the speech at the transmitter so that less channel band-
`width is required to transmit the same amount of informa-
`tion. Of course, this requires that the receiver have an appro-
`priate decoder to recover the original signal. Because of this
`burden on both the receiver and transmitter, the use of wide-
`band vocoding techniques is difficult to apply to systems that
`have already been standardized (e.g., analog telephone com-
`munications). It would be more convenient to devise a system
`
`at the receiver that could regenerate the lost high-frequency
`content.
`Some work has already been done in the area of wideband
`speech regeneration [1, 2, 3, 4, 5, 6, 7, 8]. These works have
`primarily used linear predictive (LP) techniques. By using
`these techniques, the reconstruction problem is divided into
`two separate tasks. The first task is forming a wideband resid-
`ual error signal, while the second is recreating a set of wide-
`band linear predictive coefficients (LPCs). Once these two
`components have been generated, the wideband residual is
`fed into the wideband LP synthesis filter resulting in a regen-
`erated wideband speech signal.
`This paper provides a brief review of LP techniques used
`in wideband speech regeneration and proposes a new LP tech-
`nique with several novel aspects. In particular, a new method
`for the creation of a wideband residual is proposed which
`overcomes difficulties encountered using many methods pre-
`viously employed. Furthermore, a relatively new distortion
`measure is investigated for use in codebook generation and
`in mapping narrowband LPCs to wideband LPCs. To the best
`IPR2016-00704
`SAINT LAWRENCE COMMUNICATIONS LLC
`Exhibit 2002
`
`

`

`2
`
`EURASIP Journal on Applied Signal Processing
`
`xW (t)
`
`Channel
`Hc (f )
`
`xN (t)
`
`xN (n)
`
`A/D
`
`Wideband
`speech
`regeneration
`
`ˆxW (n)
`
`D/A
`
`ˆxW (t)
`
`xN (n)
`
`K
`
`LP
`analysis
`
`Wideband
`residual
`regeneration
`
`LP
`synthesis
`
`HPF
`
`eN (n)
`
`ˆeW (n)
`
`Figure 1: Context of the wideband speech regeneration system.
`
`aN
`
`VQ codebook
`mapping
`
`ˆaW
`
`(cid:2)
`
`ˆxW (n)
`
`Figure 2: The wideband speech regeneration system.
`
`signal. For each speech frame, this produces a narrowband
`residual or error signal, eN (n), and a set of narrowband LPCs,
`denoted aN = [a1, a2, . . . , ap]T . To regenerate the wideband
`speech, both of these components must be converted into
`their wideband counterparts.
`The process for creating a wideband residual, ˆeW (n), from
`a narrowband residual is referred to herein as high frequency
`regeneration (HFR). The process for generating wideband
`LPCs, ˆaW , from the narrowband LPCs is referred to as code-
`book mapping. Having regenerated these two critical speech
`components, it is then possible to construct wideband speech
`through LP synthesis (the inverse process of LP analysis).
`It is important to realize that, although no upsampling is
`explicitly shown in Figure 2, the HFR block outputs a signal
`at K times the sampling rate of its input. Finally, the regener-
`ated wideband speech is passed through a highpass filter and
`added to the input narrowband speech signal to create the
`final speech waveform. This is done in order to preserve the
`accurate low frequency information contained in the origi-
`nal input speech signal. Thus, the proposed processing will
`not alter the low frequency content of the input signal but
`will simply add high frequency components. The following
`three sections describe in detail the LP, HFR, and codebook
`mapping blocks, respectively.
`
`3. LINEAR PREDICTIVE TECHNIQUES
`
`LP analysis determines the coefficients of a linear prediction
`filter designed to predict each speech sample as a weighted
`sum of previous samples. The prediction filter output can be
`written as
`
`akx(n − k),
`
`(2)
`
`P(cid:3)k
`
`=1
`
`ˆx(n) =
`
`where a1, a2, . . . , aP are the LPCs. The prediction error signal
`(also known as a residual or residual error signal) is defined
`as the difference between the actual and predicted signals as
`follows:
`
`akx(n − k).
`
`(3)
`
`P(cid:3)k
`
`=1
`
`e(n) = x(n) −
`
`This expression simply defines a finite impulse response
`
`of the authors’ knowledge, this new metric has not previously
`been applied to the speech regeneration problem. Finally,
`a new method for calculating optimal gain coefficients for
`application to the wideband LPCs is described.
`The organization of the rest of this paper is as follows. In
`Section 2, an overview of the wideband speech regeneration
`system is provided. Since the proposed technique is based
`on LP coding, Section 3 provides a brief review of LP analysis
`and synthesis. Section 4 describes the wideband residual error
`regeneration and Section 5 describes the codebook mapping
`of the LPCs. Experimental results are presented in Section 6,
`and conclusions and areas for future work are presented in
`Section 7.
`
`2. OVERVIEW OF THE WIDEBAND SPEECH
`REGENERATION SYSTEM
`
`The context in which the wideband speech regeneration sys-
`tem may be employed is shown in Figure 1. Here, a nar-
`rowband continuous-time speech signal, xN (t), is formed by
`passing its wideband counterpart, xW (t), through a bandlim-
`ited channel. The channel, with frequency response Hc(f ), is
`assumed to be a continuous-time lowpass filter with cutoff
`frequency fc. This lowpass filter need not be ideal. However,
`if the sampling frequency of the analog-to-digital (A/D) and
`digital-to-analog (D/A) converters is denoted as fs, the fol-
`lowing restriction is assumed:
`
`(cid:1)(cid:1)Hc(f )(cid:1)(cid:1) ≈ 0 for |f | >
`
`fs
`2K
`
`,
`
`(1)
`
`where K is a positive integer greater than one. In other words,
`the channel transfer function is such that the digital signal
`entering the wideband speech regeneration system is over-
`sampled by a factor of at least K. Furthermore, this oversam-
`pling will in turn allow the wideband speech regeneration
`system to increase the bandwidth of the received signal by a
`factor of at least K. Examples of system parameters for ap-
`plication to analog telephone speech might be fc = 3.3 kHz,
`fs = 16 kHz, and K = 2.
`In many speech applications, the input speech waveform
`is already filtered and sampled at a lower frequency (e.g.,
`8 kHz typically in many wireline and wireless communication
`systems). In this case, the speech signal can be upsampled
`and filtered to provide a higher frequency sampling rate (e.g.,
`16 kHz as in this work) with no high frequency content above
`fc.
`
`The structure of the wideband speech recovery system is
`shown in Figure 2. The system begins by performing a stan-
`dard LP analysis of the downsampled, narrowband speech
`
`

`

`3
`
`ˆeW (n)
`
`G
`
`Techniques for the regeneration of wideband speech from narrowband speech
`
`filter with impulse response
`
`eN (n)
`
`2
`
`LPF
`
`abs(·)
`
`Spectral
`flattening
`filter
`
`akδ(n − k),
`
`(4)
`
`P(cid:3)k
`
`=1
`
`a(n) = δ(n) −
`
`and discrete-time frequency response
`
`(a)
`
`(5)
`
`eN (n)
`
`ˆeW (n)
`
`akejkω.
`
`P(cid:3)k
`
`=1
`
`A(ω) = 1 −
`
`2
`
`(b)
`
`Figure 3: (a) HFR using rectification and filtering, (b) HFR using
`spectral folding.
`
`rectification of the upsampled narrowband residual to gener-
`ate high-frequency spectral content. The signal is then filtered
`with an LP analysis filter to generate a spectrally flat residual.
`An appropriate variable gain factor must also be applied to
`this new wideband residual so that its signal power will not be
`too large or small when compared with the original narrow-
`band version. This approach is illustrated in Figure 3a. The
`main drawback to this method is that the spectral compo-
`nents generated by the rectification (a nonlinear operation)
`are largely unpredictable. As a result, it often generates noisy
`or rough high frequency components, especially when the
`speech is voiced.
`The second class of HFR techniques, shown in Figure 3b,
`is termed spectral folding and involves expansion of the nar-
`rowband residual through insertion of zeros between adjacent
`samples. Although simple, this method has several potential
`problems when applied to voiced speech.
`(1) First, it is unlikely that the new high frequency har-
`monics will reside at integer multiples of the voiced speech’s
`fundamental frequency. Often, this does not result in a large
`perceptual effect as long as the low frequency content has the
`harmonics spaced correctly and as long as the energy in the
`low frequency components is significantly greater than that
`in the higher frequencies.
`(2) Second, as the pitch of the narrowband residual moves
`higher or lower in frequency, the high-frequency portions of
`the new wideband residual move in the opposite direction.
`This will be seen later in Section 6. Because of this effect, the
`resultant speech can sound somewhat garbled—especially if
`there are wide variations in fundamental frequency.
`(3) Finally, a greater problem occurs when the cutoff fre-
`quency of the bandlimiting process is lower than half the nar-
`rowband sampling frequency. Although LP analysis tends to
`produce spectrally flat residuals, this is generally not possible
`when a portion of the input spectrum has been eliminated. In
`these regions, the narrowband residual therefore exhibits lit-
`tle spectral energy. When spectral folding is applied to such a
`residual, the resultant wideband speech exhibits a band gap in
`the middle of the spectrum. This will also be seen in Section 6.
`This partial lack of spectral content can degrade perceptual
`speech quality.
`
`This filter is known as the LP analysis filter and is used to
`generate the residual from the original discrete-time signal.
`To perform optimal linear prediction, the LPCs are chosen
`such that the power in the residual is minimized. The LPCs are
`typically also chosen to be gain normalized and such that the
`LP analysis filter is minimum phase. Such LPCs can be found
`efficiently through the application of the Levinson-Durbin
`algorithm. Because the LP analysis filter is minimum phase,
`it has a stable inverse known as the LP synthesis filter. This
`synthesis filter is an all-pole, infinite impulse response filter.
`While the LP analysis filter is used to generate the residual, the
`LP synthesis filter creates the original signal from the residual.
`The system difference equation for the LP synthesis filter can
`be found by rewriting (3) as
`
`akx(n − k).
`
`(6)
`
`P(cid:3)k
`
`=1
`
`x(n) = e(n) +
`
`LP techniques are especially useful in speech processing.
`Because of the error minimization used to find the LPCs, the
`residual error signal tends to be spectrally flat. This means
`that the shape of the speech signal’s spectral envelope is repre-
`sented in the LPCs. The residual then contains the amplitude,
`voicing, and pitch information. The speech signal’s spectral
`envelope can be approximately written in the frequency do-
`main as
`
`S(ω) =
`

`|A(ω)|
`
`,
`
`(7)
`
`where σ is the square root of the residual power. In speech
`processing systems, LP analysis is typically performed on
`frames of about 10–20 ms, since speech characteristics are
`relatively constant over this time interval.
`
`4. HIGH FREQUENCY REGENERATION
`
`In this section, methods for regenerating wideband resid-
`ual errors from narrow band errors are described. Previously
`defined techniques are described, motivation for exploring
`alternative methods is presented, and a novel method, re-
`ferred to here as “spectral shifting,” is developed.
`
`4.1. Previous techniques
`
`The HFR methods employed in previous systems can ba-
`sically be divided into two classes. These two approaches
`are illustrated in Figure 3. For simplicity, the discussion is
`restricted to the case where K = 2. The first class of HFR uses
`
`

`

`4
`
`eN (n)
`
`EURASIP Journal on Applied Signal Processing
`
`Pitch
`detector
`
`Cosine
`generator
`
`2 cos(π n)
`
`2, −2, 2, −2, . . .
`
`2
`
`LPF
`
`ˆeW (n)
`
`X
`
`eN (n)
`
`X
`
`2
`
`ˆeW (n)
`
`Figure 4: Proposed spectral shifting high frequency regeneration
`algorithm.
`
`4.2. The spectral shifting method
`
`A new method of HFR is proposed, which is illustrated in
`Figure 4. This method relies on spectral shifting rather than
`on spectral folding. If used to its full extent, it is capable of
`overcoming the three problems associated with the spectral
`folding method listed above.
`The first step in the spectral shifting method involves up-
`sampling the input narrowband residual. The lowpass filter
`used has cutoff frequency of ωc = 2π fc/fs radians/sample
`and a gain of K = 2. A pitch detector then assumes that the
`speech is voiced and finds the fundamental pitch period, Tf ,
`of the resultant signal. If the speech is in fact unvoiced, the
`pitch period computed is of little importance and the output
`of the pitch detector can still be used. The upsampled residual
`is then multiplied (mixed) with a cosine of amplitude 2 at a
`radian frequency ωg, where ωg is a multiple of the funda-
`mental frequency that is close to (but does not exceed) the
`cutoff frequency ωc. One expression that calculates such a
`radian frequency is
`
`ωg =
`
`2π
`Tf
`
`floor(cid:4) Tf ωc
`2π (cid:5),
`
`(8)
`
`where floor(·) computes the maximum integer less than or
`equal to its argument.
`The multiplication by a cosine results in a shift of the
`original spectrum. If the discrete-time Fourier transform of
`the upsampled and filtered narrowband residual is denoted
`as EN (ω), it is easily shown that the resultant wideband spec-
`trum is given by
`
`EW (ω) = EN(cid:6)ω − ωg(cid:7) + EN(cid:6)ω + ωg(cid:7)
`≈ EN(cid:6)ω − ωc(cid:7) + EN(cid:6)ω + ωc(cid:7).
`
`(9)
`
`This expression clearly reveals the reason this method is
`referred to as a “spectral shifting” method. Unlike the spec-
`tral folding method, this method does not preserve the origi-
`nal narrowband residual information. However, this is not
`a problem because a highpass filter will subsequently be
`applied to the output of the LP synthesis filter, eliminating
`the narrowband portion of the regenerated signal anyway.
`Since ωg ≈ ωc, the bandwidth of the wideband residual is
`approximately twice that of the narrowband residual. How-
`ever, if more bandwidth is desired, the narrowband resid-
`ual can be multiplied with cosines at higher multiples of the
`
`Figure 5: Simplest form of the spectral shifting method.
`
`fundamental frequency and the results added with one an-
`other (after appropriate filtering to prevent overlap). This
`will further increase the signal bandwidth.
`In practice, the pitch detection algorithm is applied to
`each frame (10–20 ms) of speech. However, instantaneously
`updating the frequency of the cosine at each frame boundary
`results in undesired frequency components and noisy speech.
`Applying linear interpolation to the values of ωg between
`adjacent frames has been found to sufficiently eliminate this
`effect.
`If reduction in computational complexity is desired, cer-
`tain modifications can be made to the spectral shifting
`method. One such modification is to eliminate the pitch
`detector and always use a cosine of frequency ωc. If this is
`done, the spectral shifting method will eliminate only prob-
`lems (2) and (3) described in Section 4.1. However, as noted
`in Section 4.1, problem (1) is generally not significant, and
`thus, this simplification may have little impact on speech
`quality.
`An additional simplification can be made by using a
`cosine at frequency π /2 rather than at ωc (if ωc = π /2,
`system quality is not additionally compromised). In this case,
`no lowpass filter is necessary for interpolation because every
`other cosine value will be zero—eliminating all interpolated
`values. In this case, the spectral shifting method reduces to
`the system shown in Figure 5. Note that this system will solve
`only problem (2) (assuming ωc ≠ π /2). Problem (3) is only
`somewhat alleviated since the size of the band gap in the spec-
`trum will be cut in half. However, the performance is still an
`improvement over the spectral folding method with the only
`additional computational complexity being a sign change of
`every other sample.
`
`5. CODEBOOK MAPPING
`
`In this section, the mapping of the narrowband LPCs to wide-
`band LPCs is addressed. The use of a dual codebook in a
`vector quantization scheme is discussed, and solutions to the
`problem of applying an appropriate gain to the wideband
`LPCs are presented.
`
`5.1. Narrowband to wideband LPC conversion
`
`In a vector quantization system, codebook generation is
`most commonly performed through training using the Linde,
`
`

`

`Techniques for the regeneration of wideband speech from narrowband speech
`
`5
`
`Buzo, Gray (LBG) algorithm [9]. In the case of a narrowband
`to wideband mapping, it is necessary to generate a dual code-
`book. Part of the codebook contains narrowband codewords
`and the other part contains the corresponding wideband
`codewords. These codewords contain representative LPCs in
`some form.
`Generation of the dual codebook requires training data
`sampled at the desired higher rate of fs with the full low
`and high frequency content intact. These data are artificially
`degraded and downsampled to match the bandwidth of the
`actual signals to be processed. The LBG algorithm is then
`applied to the speech frames in the narrowband version of the
`training data. While the LBG algorithm operates on the nar-
`rowband data, each operation is mimicked on the wideband
`version of the training data to form the wideband portion
`of the dual codebook. In this way, the dual codebook will
`contain a set of representative narrowband codewords and
`the corresponding codewords based on the wideband data.
`This dual codebook now contains the a priori information
`needed to allow the wideband speech regeneration algorithm
`to extend the bandwidth of a speech signal.
`During wideband speech regeneration, narrowband
`codewords are computed from the input speech frames and
`the best match in the narrowband portion of the codebook
`is found. The corresponding wideband codeword from the
`wideband portion of the codebook is then used to generate
`the output speech. The assumption underlying the use of the
`dual codebook is that there is correlation between the low
`frequency spectral envelope of a speech frame and its high
`frequency envelope for a given speaker or class of speakers.
`That is, when the algorithm recognizes the narrowband spec-
`tral envelope of a speech frame (provided by the narrowband
`LPCs), the training data will allow us to predict what the
`spectral envelope should be for the full broadband version
`(contained in the wideband LPCs). Performance will clearly
`depend on the level of correlation between the low and high
`frequencies, and on how representative the training data is of
`the actual data. It is worth mentioning that improved perfor-
`mance was reported by Epps and Holmes [8] when separate
`codebooks were used for voiced and unvoiced speech frames.
`However, this method was not employed here.
`The operation of the LBG algorithm and the codebook
`mapping requires some quantitative measure of closeness for
`sets of LPCs. The only requirement is a distance (or distor-
`tion) measure for which a centroid calculation exists. The
`centroid of a bin or group is defined as the codeword that
`minimizes the total distortion over that bin. The quality of
`the resultant codebook is greatly affected by the correlation
`between the quantitative distance metric and human percep-
`tion of difference in the reconstructed speech frames.
`One commonly used distortion metric is the Itakura-
`Saito measure defined as
`
`ˆσ
`
`,
`
`A(ω)
`
`dIS(cid:4) σ
`ˆA(ω)(cid:5)
`= (cid:8) π
`ˆA(ω)(cid:1)(cid:1)
`−π (cid:4) σ 2(cid:1)(cid:1)
`ˆA(ω)(cid:1)(cid:1)
`ˆσ 2(cid:1)(cid:1)
`
`2
`
`2
`
`− ln
`
`2
`
`2
`
`ˆA(ω)(cid:1)(cid:1)
`σ 2(cid:1)(cid:1)
`ˆA(ω)(cid:1)(cid:1)
`ˆσ 2(cid:1)(cid:1)
`
`− 1(cid:5)dω.
`
`(10)
`
`An efficient method for calculating this exists which uses the
`estimated autocorrelation for each speech frame. The use of
`this distortion measure for codebook generation was first
`described in [10]. Another common metric is log spectral
`distortion (LSD), given by
`
`ˆσ
`
`,
`
`A(ω)
`
`σ 2
`
`dLSD(cid:4) σ
`ˆA(ω)(cid:5)
`= (cid:8) π
`−π (cid:9) ln
`
`(cid:1)(cid:1)A(ω)(cid:1)(cid:1)
`
`(11)
`
`2(cid:10)2
`
`dω.
`
`− ln
`
`2
`
`ˆσ 2
`
`(cid:1)(cid:1)
`
`ˆA(ω)(cid:1)(cid:1)
`
`With respect to wideband speech regeneration, this distance
`measure has been previously applied by sampling the loga-
`rithm of the Fourier transform of the LPCs [1, 8].
`It has been shown by Gardner and Rao [11] that a dis-
`tortion measure, using a weighted mean squared error of the
`line spectral pair frequencies, is equivalent to LSD and to the
`Itakura-Saito measure for high rate vector quantizers. It is
`thought that this measure may offer the performance of the
`LSD metric in the current application, yet be more computa-
`tionally efficient. The computational savings comes primarily
`from the fact that Fourier transforms and logarithms need not
`be computed.
`
`5.2. Optimal gain constant calculation
`
`In addition to the generation of the wideband LPCs discussed
`above, the gain applied to each wideband LP synthesis filter
`must also be determined such that the new wideband infor-
`mation has the appropriate energy. The optimal gain constant
`is defined here as that which minimizes the distance between
`the reconstructed and original wideband spectral envelopes
`in the narrowband region.
`To derive the optimum gain, we first trace a wideband
`spectral envelope through the system. Represent the spec-
`tral envelope of the original wideband speech signal as
`a real, positive, symmetric function in the frequency do-
`main. After the bandlimiting filter and subsequent downsam-
`pling, the narrowband spectral envelope can be represented
`as
`
`H(cid:4) ω
`SN (ω) = SW(cid:4) ω
`K (cid:5)(cid:1)(cid:1)(cid:1)(cid:1)
`K (cid:5)(cid:1)(cid:1)(cid:1)(cid:1)
`
`,
`
`(12)
`
`where H(ω) is the impulse-invariant discrete-time system
`frequency response used to model the continuous channel,
`Hc(f ). After LP analysis, the spectral envelope of the nar-
`rowband residual is
`
`SE(ω) = SN (ω)(cid:1)(cid:1)AN (ω)(cid:1)(cid:1)
`H(cid:4) ω
`= SW(cid:4) ω
`K (cid:5)(cid:1)(cid:1)(cid:1)(cid:1)
`K (cid:5)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)(cid:1)AN (ω)(cid:1)(cid:1).
`
`(13)
`
`The high-frequency regeneration technique does not alter the
`basic shape of the residual spectral envelope. However, it does
`create a signal sampled at a higher rate. Thus, the wideband
`residual spectral envelope can be approximated as SE(Kω).
`The wideband LP synthesis filter transforms the wideband
`residual spectral envelope into the reconstructed wideband
`
`

`

`6
`
`EURASIP Journal on Applied Signal Processing
`
`speech signal spectral envelope as
`
`Table 1: Parameters used in system testing.
`

`
`Bandlimiting filter type
`
`High-order Butterworth
`

`
`(cid:1)(cid:1)
`
`ˆAW (ω)(cid:1)(cid:1)
`
`(14)
`
`.
`
`Bandlimiting filter cutoff
`
`Narrowband sampling rate
`
`Wideband sampling rate
`
`Frame size
`
`Frame rate
`
`Codebook size
`
`Narrowband LPCs (Itakura-Saito)
`
`Wideband LPCs (Itakura-Saito)
`
`Narrowband LPCs (Gardner-Rao)
`
`Wideband LPCs (Gardner-Rao)
`
`3.3 kHz
`
`8 kHz
`
`16 kHz
`
`30 ms
`
`100 Hz
`
`512
`
`7
`
`15
`
`8
`
`14
`
`ˆSW (ω) = SE(Kω)
`
`ˆAW (ω)(cid:1)(cid:1)
`(cid:1)(cid:1)
`= SW (ω)(cid:1)(cid:1)H(ω)(cid:1)(cid:1)(cid:1)(cid:1)AN (Kω)(cid:1)(cid:1)
`
`Note the presence of the gain constant σ . Using (14), we
`are able to relate the original and reconstructed spectral
`envelopes in the narrowband region.
`One approach to finding the optimal gain constant is to
`store a calculated gain constant for each training frame in
`a given bin [1]. This constant can be computed using the
`relative powers of the narrowband and wideband residuals
`for each frame. At the end of training, a centroid of these
`gains is found for each bin using the gain values for each
`frame, the narrowband training codewords for each frame,
`and the newly computed representative wideband codeword
`for the bin.
`However, it is assumed that the representative narrow-
`band codeword is indeed representative of all the narrowband
`training codewords in that bin. Thus, an alternative approach
`is to use only the representative narrowband codeword, the
`representative wideband codeword, and an estimate of the
`bandlimiting transfer function to compute the optimal gain
`for each bin. This eliminates the need to store the gain con-
`stants during training and also the need to use multiple nar-
`rowband codewords in the optimal gain calculation. With this
`approach, the gain constant can be computed by minimizing
`
`This calculation does not slow system performance because it
`is performed only once—after training and before the system
`actually operates.
`
`6. EXPERIMENTAL RESULTS
`
`Testing of the wideband speech recovery system has been
`performed in MATLAB using speech samples obtained from
`the TIMIT speech corpus.1 For simplicity, codebook training
`was performed using 30 randomly selected female utterances
`from dialect region 3 (North Midland region). These utter-
`ances comprised approximately 90 seconds of speech data.
`Testing of the resultant system was performed using 2 utter-
`ances not in the training data, but from the same gender and
`dialect region.
`The parameters used in both the training and testing
`phases are shown in Table 1. As stated earlier, speech wave-
`form characteristics are relatively stationary over a 10 ms
`interval. Thus, computations are performed for frames taken
`at a 100 Hz rate. The frames, however, are 30 ms soft-
`windowed overlapping frames. This allows for smooth tran-
`sitions between adjacent frames. The numbers of LPCs used
`for the Itakura-Saito distortion measure have been selected
`to make FFT computations more convenient. However, the
`Gardner-Rao method requires that the numbers of LPCs be
`even, explaining why slightly different numbers were used for
`this distortion measure.
`Additionally, it should be noted that frames with energies
`below an empirically determined threshold were considered
`silence, and were thus excluded from training and testing.
`The spectral shifting method was implemented by multiply-
`ing the narrowband residual by two cosine functions: one
`at 3.3 kHz and one at 4.7 kHz. These results were added to-
`gether after appropriate filtering to prevent overlap. Note that
`pitch-detection was not employed in these tests.
`Sample results from these tests are shown in Figure 6.
`Figure 6a shows an original wideband speech signal spectro-
`
`1Texas Instruments/Massachusetts Institute of Technology Acoustic-
`Phonetic Countinuous Speech Corpus October 1990 (www.ntis.gov/fcpc/
`cpn4129.htm).
`
`d(cid:6)SW (ω), ˆSW (ω)(cid:7)
`= (cid:9)SW (ω), SW (ω)(cid:1)(cid:1)H(ω)(cid:1)(cid:1)(cid:1)(cid:1)
`ˆAN (Kω)(cid:1)(cid:1)
`
`(cid:1)(cid:1)
`

`
`ˆAW (ω)(cid:1)(cid:1)
`
`(cid:10),
`
`(15)
`
`with respect to σ over a selected narrowband frequency range,
`where d(·, ·) is the relevant distance metric. If the LSD dis-
`tance measure is being used, the optimal gain constant is given
`by
`
`4ωn (cid:8) ωn
`σ = exp(cid:4) 1
`2(cid:12)
`ln(cid:11)(cid:1)(cid:1)
`ˆAW (ω)(cid:1)(cid:1)
`2(cid:12) − ln(cid:11)(cid:1)(cid:1)
`− ln(cid:11)(cid:1)(cid:1)H(ω)(cid:1)(cid:1)
`ˆAN (Kω)(cid:1)(cid:1)
`
`−ωn
`
`(16)
`
`2(cid:12)dω(cid:5).
`
`If the Itakura-Saito distance measure is being used, the opti-
`mal gain constant is given by
`
`σ = (cid:13)(cid:14)(cid:14)(cid:15)
`
`1
`
`2ωn (cid:8) ωn
`
`−ωn
`
`2
`
`ˆAW (ω)(cid:1)(cid:1)
`(cid:1)(cid:1)
`2(cid:1)(cid:1)
`ˆAN (Kω)(cid:1)(cid:1)
`(cid:1)(cid:1)H(ω)(cid:1)(cid:1)
`
`2 dω.
`
`(17)
`
`In both expressions, ωn, is a radian frequency in the range
`(0, π /K). This frequency should be selected to include only
`those portions of the narrowband spectrum in which H(ω) is
`invertible. In the simplest case, where H(ω) is assumed to be
`an ideal lowpass filter, it is most appropriate to use ωn = ωc.
`In practice, the various spectra are generated and numerically
`integrated by summing fast Fourier transform (FFT) results.
`
`

`

`Techniques for the regeneration of wideband speech from narrowband speech
`
`7
`
`8000
`
`7000
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`Frequency (Hz)
`
`0
`
`0.2 0.4 0.6 0.8
`
`1
`
`1.2 1.4 1.6 1.8
`
`0
`
`0.2 0.4 0.6 0.8
`
`1
`
`1.2
`
`1.4
`
`1.6 1.8
`
`Time (s)
`
`(a)
`
`Time (s)
`
`(b)
`
`8000
`
`7000
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`Frequency (Hz)
`
`8000
`
`7000
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`Frequency (Hz)
`
`8000
`
`7000
`
`6000
`
`5000
`
`4000
`
`3000
`
`2000
`
`1000
`
`0
`
`Frequency (Hz)
`
`0
`
`0.2 0.4 0.6
`
`0.8
`
`1
`
`1.2 1.4 1.6 1.8
`
`0
`
`0.2 0.4 0.6 0.8
`
`1
`
`1.2
`
`1.4 1.6 1.8
`
`Time (s)
`
`(c)
`
`Time (s)
`
`(d)
`
`Figure 6: Spectrograms for the (a) original signal, (b) bandlimited/narrowband signal, (c) reconstructed signal using spectral folding, and
`(d) reconstructed signal using spectral shifting.
`
`gram, while Figure 6b shows this same signal bandlimited
`to approximately 3.3 kHz. The spectrogram of the recon-
`structed signal using the Itakura-Saito distortion measure and
`spectral folding is shown in Figure 6c. The spectrogram of the
`reconstructed signal using the Itakura-Saito distortion mea-
`sure and spectral shifting is shown in Figure 6d. The result
`using the spectral folding method contains a gap in the mid-
`dle of the frequency spectrum. Also, the pitch contours in the
`high-frequency portion of the spectrum run counter to those
`in the true wideband signal and in the low-frequency por-
`tion. This is especially evident in the 0.5–0.7 sec time range.
`In contrast, the spectral shifting method eliminates both of
`these problems. A comparison of Figures 6a and 6d reveals
`that the high-frequency portion of the original spectrum is
`approximated reasonably well in the high-frequency portion
`of the reconstructed spectrum.
`
`Subjective evaluations have also been obtained using a
`survey, to which 18 participants responded. These partici-
`pants were asked to compare 5 pairs of speech files and select
`one from each pair that they thought had the best overall
`sound quality. The choices and the results are summarized in
`Table 2. A p-value is also given to indicate a level of confi-
`dence in the results. For example, a p-value of 0.05 indicates
`that there is a 95% chance that the choice preferred in the sur-
`vey would still be preferred if an infinite number of responses
`had been received.
`It can be concluded from the survey results that it is
`indeed possible to enhance the perceived quality of speech
`signals. This is seen most clearly in Choice 3. It is believed that
`artifacts created by the reconstruction process—especially
`around unvoiced consonants—adversely affected the results
`for Choice 2. Techniques for eliminating these artifacts need
`
`

`

`8
`
`EURASIP Journal on Applied Signal Processing
`
`Table 2: Survey results for the A/B testing.
`
`Choice 1
`
`(A) Bandlimited signal vs.
`
`(B) Original wideband signal
`
`Preferred: B
`
`Proportion: 100%
`
`p-value: ∼ 0
`
`(A) Bandlimited signal vs.
`
`Choice 2
`
`(B) Rebuilt signal (Itakura-Saito and spectral shifting)
`
`Preferred: B
`
`Proportion: 61%
`
`p-value: 0.173
`
`Choice 3
`
`(A) Rebuilt signal (Gardner-Rao and spectral shifting) vs.
`
`(B) Bandlimited signal
`
`Preferred: A
`
`Proportion: 78%
`
`p-value: 0.009
`
`Choice 4
`
`(A) Rebuilt signal (Itakura-Saito and spectral folding) vs.
`
`(B) Rebuilt signal (It

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket