`(0) Patent No.
`a2) United States Patent
`US 6,298,322 B1
`
` Lindemann (45) Date of Patent: Oct. 2, 2001
`
`
`(54) ENCODING AND SYNTHESIS OF TONAL
`AUDIO SIGNALS USING DOMINANT
`SINUSOIDS AND A VECTOR-QUANTIZED
`RESIDUAL TONAL SIGNAL
`Inventor: Eric Lindemann, 2975 18th St.,
`Boulder, CG (US) 80304
`(73) Assignee: Eric Lindemann, Boulder, CO (US)
`
`(75)
`
`Jean LaRoche, HNS: Speech Modification Based on a
`Harmonic + Noise Model Proceedings of IEEE ICASSP,
`Apr. 1993, Minneapolis, Minnesota, vol. I, p. 550-553
`Section 2—Description of the Model.
`
`Primary Examiner—Tilivaldis Ivars Smits
`
`(57)
`
`ABSTRACT
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/306,256
`(22)
`Filed:
`May6, 1999
`
`(SL)
`
`Tint, C0ee cccceesecssseesseessseesseeeseeenees GI10L 19/02
`
`(52) U.S. Cheess 704/222; 704/200.1; 704/209;
`704/220
`(58) Field of Search cscs 704/200.1, 206,
`704/207, 209, 220, 222
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`6/1974 Koch.
`3,816,664
`9/1982 Gallitzendorfer.
`4,348,929
`7/1984 Hiyoshi .
`4.461.199
`9/1986 Hideo .
`4,611,522
`8/1989 Quatieri, Jy.
`4,856,068
`12/1989 McAulay.
`4,885,790
`6/1990 McAulay.
`4,937,873
`7/1991 Serra .
`5,029,509
`FOREIGN PATENT DOCUMENTS
`
`.
`
`Tonal audio signals can be modeled as a sum of sinusoids
`with time-varying frequencies, amplitudes, and phases. An
`efficient encoder and synthesizer of tonal audio signals is
`disclosed. The encoder determines time-varying
`frequencies, amplitudes, and, optionally, phases for a
`restricted number of dominant sinusoid components of the
`tonal audio signal to form a dominant sinusoid parameter
`sequence. These components are removed from the tonal
`audio signal to form a residual tonal signal. The residual
`tonal signal is encoded using a residual tonal signal encoder
` (RTSE). In one embodiment, the RTSE gencrates a vector
`quantization codebook (VQC) and residual codebook
`sequence (RCS). The VQC may contain time-domain
`residual waveforms selected from the residual tonal signal,
`synthetic time-domain residual waveforms with magnitude
`spectra related to the residual tonal signal, magnitude spec-
`trum encoding vectors, or a combination of time-domain
`waveforms and magoitude spectrum encoding vectors. The
`tonal audio signal synthesizer uses a sinusoidal oscillator
`bank to synthesize a set of dominant sinusoid components
`from the dominant sinusoid parameter sequence generated
`during encoding. In one embodiment, a residual tonal signal
`is synthesized using a VOC and RCS generated by the RTSE
`during encoding.
`If the VOC includes time-domain
`waveforms, an interpolating residual waveform oscillator
`4/1990 (EP).
`0363233 Al
`
`0363233 B1=11/1994 (EP). may be used to synthesize the residual tonal signal. The
`0813184 Al
`12/1997 (EP) .
`synthesized dominant sinusoids and synthesized residual
`OTHER PUBLICATIONS
`tonal signal are summed to form the synthesized tonal audio
`Scott Levine et al., A Switched Parametric & Transform
`signal.
`Audio Coder, Proceedings of the IEEE ICASSP, May 15-19,
`1999, Phoenix Arizona, Section 2—System Overview.
`
`42 Claims, 26 Drawing Sheets
`
`
`
`
`
`
`2 [ee —
`residual I signal
`
`
`
`150 yt
`+
`mass storage or
`communications channel
`a
`
`
`
`
`
`
`
`
`ee . ane Teaver onasgh
`i
`sinusoidal oscillator bank
`
`synthesizer
`
`
`
`
`
`108
`
`Sony Exhibit 1043
`Sony Exhibit 1043
`Sony v. MZ Audio
`Sony v. MZ Audio
`
`
`
`US 6,298,322 B1
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`5,195,166
`5,226,108
`5,327,518
`5,369,730
`5,401,897
`5,479,564
`5,581,656
`5,686,683
`
`3/1993
`7/1993
`7/1994
`11/1994
`3/1995
`12/1995
`12/1996
`11/1997
`
`Hardwick .
`Hardwick .
`George .
`Yajima.
`Depalle .
`Vogten .
`Hardwick .
`Freed .
`
`2/1998
`5,717,821 *
`4/1998
`5,744,742
`6/1998
`5,765,126 *
`6/1998
`5,774,837
`7/1998
`5,787,387
`9/1998
`5,806,024
`5,848,387 * 12/1998
`
`Tsutsui et al. oo. eeee 704/200.1
`Lindemann.
`Tsutsui et al. oc. 704/200.1
`Yeldener.
`
`Aguilar .
`Ozawa.
`
`Nishiguchi et al. ww... 704/214
`
`* cited by examiner
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 1 of 26
`
`US 6,298,322 B1
`
`tonal audio signal
`
`101
`
`dominant sinusoid encoder
`
`residual
`
`.
`pitch
`sequence
`
`dominant
`sinusoid
`parameter
`sequence
`
`
`
`
`residual tonal signal
`encoder
`
`
`
`residual vector
`residual
`
`103
`codebook|quantization
`sequence|codebook
`
`
`communications channel
`
` mass storage or
`
`104
`
`.
`.
`.
`sinusoidal oscillator bank
`
`residual tonal signal
`‘
`g
`synthesizer
`
`resynthesized
`residual tonal
`signal
`
`resynthesized tonal audio signal
`
`Figure 1
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 2 of 26
`
`US 6,298,322 B1
`
`201
`
`203
`
`204
`
`205
`
`206
`
`207
`
`208
`
`209
`
`210
`
`211
`
`tonal audio signal
`
`n=0; offset = 0
`
`202
`
`offset + frame_length < signal_length
`
`NO
`
`frame=window * low_frequency_signal(offset to (offset+frame_length-1))
`
`YES
`
`frame = zeropad(frame, frame_length * 50)
`
`frame_fft = real_fft( frame)
`
`pitch_sequence[n] = find_pitch(frame_fft .* conj(frame_fft))
`
`indices = maxima(abs(frame_fft},number_of_sinusoids)
`
`frequencies = indices*(fs/2)/length(frame_fft);
`dominant_sinusoi_parameter_sequence[n].frequencies = frequencies;
`dominant_sinusoid_parameter_sequence[n].amplitudes =abs(frame_fft[indices]);
`dominant_sinusoid_parameter_sequence(n].phases =angle(frame_fft[indices]);
`
`
`
`frame_fft = frame_fft .*
`real_fft(zero_pad(zeros_filter(frequencies,frame_length), length(frame_fft))})
`
`residual_tonal_signal = overlap_add(residual_tonal_signal, real_ifft(frame_fft})
`
`
`
`offset += frame_length/2;
`
`n += 1;
`
`
`
`212
`
`return dominant_sinusoid_parameter_sequence, residual_tonal_signal, pitch_sequence
`
`Figure 2
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 3 of 26
`
`US 6,298,322 B1
`
`301
`
`303
`
`304
`
`305
`
`306
`
`307
`
`308
`
`309
`
`310
`
`311
`
`tonal audio signal
`
`n=0; offset = 0
`
`302
`
`offset + frame_length < signal_length
`
`NO
`
`frame=window * low_frequency_signal(offset to (offset+frame_length-1))
`
`YES
`
`frame = zeropad(frame, frame_length * 50)
`
`frame_fft = real_fft( frame)
`
`pitch_sequencef[n] = find_pitch(frame_fft .* conj(frame_Tfft))
`
`low_frequency_fft = frame_fft .* low_pass_fft( pitch_sequence[n]);
`high_frequency_fft = frame_fft .* high_pass_fft( pitch_sequence[n]);
`
`
`
`indices = find_maxima( abs(low_frequency_fft ), number_of_sinusoids)
`
`
`
`frequencies = indices*(fs/2)/length(frame_fft);
`
`dominant_sinusoid_parameter_sequence([n].frequencies = frequencies;
`
`
`dominant_sinusoid_parameter_sequence([n].amplitudes =abs(frame_fft[indices]};
`dominant_sinusoid_parameter_sequencef{n].phases =angle(frame_fft[indices]};
`
`
`
`
`
`residual_tonal_signal = overlap_add(residual_tonal_signal,
`real_ifft(high_frequency_fft))
`
`offset += frame_length/2; n+= 1;
`
`
`312
`
`return dominant_sinusoid_parameter_sequence, residual_tonal_signal, pitch_sequence
`
`Figure 3
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 4 of 26
`
`US 6,298,322 B1
`
`tonal audio signal
`
`n=0; offset = 0
`
`402
`
`.
`offset + frame_length < signal_length
`
`NO
`
`frame=window * low_frequency_signal(offset to (offset+frame_length-1))
`
`YES
`
`frame = zeropad(frame, frame_length * 40)
`
`frame_fft = real_fft( frame)
`
`
`
`pitch_sequence[n] = find_pitch(frame_fft.*conj(frame_fft));
`f0 = pitch_to_frequency(pitch_sequence[n]);
`
`harmonic_bins = round((f0 to fs/2 by f0) / (fs/2) * length(frame_fft))
`
`
`
`
`indices = find_largest(abs(frame_fft[harmonic_bins]},number_of_sinusoids)
`
`a,
`;
`
`
`frequencies = indices*(fs/2)/length(trame_fft);
`dominant_sinusoid_parameter_sequence[n].frequencies = frequencies;
`
`
`dominant_sinusoid_parameter_sequence[n].amplitudes -abs(frame_fft[indices]);
`
`
`dominant_sinusoid_parameter_sequence[n].phases =angle(frame_fft[indices]);
`
`frame_fft = frame_fft .*
`real_fft(zero_pad(zeros_filter( frequencies, frame_length), length(frame_fft)))
`
`
`
`residual_tonal_signal = overlap_add(residual_signal, real_ifft(frame_fft))
`
`
`
`401
`
`403
`
`404
`
`405
`
`406
`
`407
`
`406
`
`409
`
`410
`
`411
`
`412
`
`offset += frame_length/2;
`
`
`n += 1;
`
`413
`
`return dominant_sinusoid_parameter_sequence, residual_tonal_signal, pitch_sequence
`
`Figure 4
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 5 of 26
`
`US 6,298,322 B1
`
`tonal audio signal
`
`n=0; offset = 0
`
`502
`
`.
`offset + frame_length < signal_length
`
`NO
`
`frame=window * low_frequency_signal(offset to (offset+trame_iength-1))
`
`YES
`
`frame = zeropad(frame, frame_length * 50)
`
`frame_fft = real_fft( frame)
`
`
`
`pitch_sequence[n] = find_pitch(frame_tft.*conj(frame_fft));
`{0 = pitch_to_frequency(pitch_sequence[n]);
`
`
`
`indices = round((fO to fO*number_of_sinusoids by f0) / (fs/2) * length(frame_fft))
`
`501
`
`503
`
`504
`
`505
`
`506
`
`507
`
`508
`
`509
`
`510
`
`511
`
`low_frequency_fft = frame_fft .* low_pass_fft( pitch_sequence[n)]);
`high_frequency_fft = frame_fft .* high_pass_fft(pitch_sequence[n]);
`
`frequencies = indices*(fs/2)/length(frame_fft);
`dominant_sinusoid_parameter_sequence[n].frequencies = frequencies;
`dominant_sinusoid_parameter_sequence[n].amplitudes =
`abs(low_frequency_fft[indices]};
`dominant_sinusoid_parameter_sequence[n].phases =
`
`angle(low_frequency_fft[indices});
`
`residual_tonat_signal = overlap_add(residual_tonal_signal,
`real_ifft(high_frequency_fft))
`
`offset += frame_length/2;
`
`
`n += 1;
`
`512
`
`return dominant_sinusoid_parameter_sequence, residual_tonal_signal, pitch_sequence
`Figure 5
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 6 of 26
`
`US 6,298,322 Bl
`
`magnitude
`spectrum
`sequence
`
`magnitude
`spectrum
`codebook
`
`
`
`residual
`codebook
`
`sequence
`
`residual
`amplitude
`sequence
`
`.
`residual
`codebook pitch
`
`codebook
`
`residual
`waveform
`
`Figure 6
`
`amplitude
`
`residual
`codebook
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 7 of 26
`
`US 6,298,322 B1
`
`residual_tonal_signal
`
`n=0
`
`702
`
`,
`offset + frame_length < signal_length
`
`NO
`
`YES
`
`frame = window * residual_tonal_signal(
`offset to (offset+frame_length-1))
`
`frame_fft = real_fft(frame)
`
`magnitude_spectrum = abs(frame_fft)
`
`residual_amplitude_sequence[n] = sqrt(sum(magnitude_spectrum.‘’2))
`
`magnitude_spectrum = smooth_spectrum( magnitude_spectrum)
`
`,
`magnitude_spectrum _sequence[n] =
`magnitude_spectrum / residual_amplitude_sequence{n]
`
`700
`
`701
`
`703
`
`704
`
`705
`
`706
`
`707
`
`708
`
`709
`
`offset += frame_length/2;
`n += 1;
`
`
`710
`
`return
`
`Figure 7
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 8 of 26
`
`US 6,298,322 B1
`
`800
`
`pitch_sequence
`
` last_total_distance = LARGE_NUMBER;
`total_distance = LARGE_NUMBER/2;
`B01
`
`
`
`indices =ceil(rand(number_of_codebook_vectors)*length(magnitude_spectrum_sequence}});
`magnitude_spectrum_codebook = magnitude_spectrum_sequencelindices];
`
`
`residual_codebook_pitch = pitch_sequence[indices];
`
`
`residual_codebook_amplitude = residual_amplitude_sequence|[indices];
`
`ast_total_distance - total_distance > PROGRESS_THRESHOLD
`
`g
`
`$02
`
`YES
`
`form =1 to number_of_codebook_vectors
`
`817
`
`NO
`returnh
`
`Q
`
`forn = 1
`
`to length(magnitude_spectrum_sequence)
`
`distance[m][n] = (residual_amplitude_sequence[n].42
`
`803
`
`804
`
`805
`
`808
`
`+ residual_codebook_amplituder[m].‘2)
`- 2*(magnitude_spectrum_sequence[n]'*magnitude_spectrum_codebook[m])
`806
`+ pitch_weight*round(abs(pitch_sequence[n] - codebook_pitch[m])/pitch_sz)
`807 b_edsOO™~=“—sSOSOSOTC(CSSCSY
`
`
`
`809
`810
`
`811
`
`812
`
`813
`
`814
`
`815
`
`816
`
`to length(magnitude_spectrum_sequence)
`forn = 1
`residual_codebook_sequence[n] = closest_vector( distance[all][n]}
`
`form = 1
`
`to number_of_codebook_vectors
`
`
`
`
`{min_distances, indexes] = find( residual_codebook_sequence ==
`m)
`new_magnitude_spectrum = sum( magnitude_spectrum_sequence[indexes])/length(indexes);
`new_pitch = sum(pitch_sequence[indexes])/length({indexes);
`new_amplitude = sqrt(sum(new_magnitude_spectrum.’2));
`magnitude_spectrum_codebook[m] = new_magnitude_spectrum/new_amplitude;
`residual_codebook_pitch[m] = new_pitch;
`residual_codebook_amplitude[m]=new_amplitude;
`
` last_total_distance = total_distance;
`total_distance = sum(min_distances);
`
`Figure 8
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 9 of 26
`
`US 6,298,322 B1
`
`residual_tonal_signal and
`pitch_sequence
`
`form = 1
`
`to number_of_codebook_vectors
`
`900
`
`901
`
`902
`
`903
`
`for n = 1
`
`to length(magnitude_spectrum_sequence)
`
`+ pitch_weight*round(abs(pitch_sequence[n] - codebook_pitch[m])/pitch_sz)
`
`distance[m][n] = (residual_amplitude_sequence[n].42 +
`residual_codebook_amplitude[m]).42 -
`2*(magnitude_spectrum_sequence[n]"*magnitude_spectrum_codebook[m])
`
`905
`906
`907
`
`to number_of_code_book_vectors
`for m= 1
`closest_frame_index = closest_vector( distance{m][all])
`wave_start = (closest_frame_index-1)*frame_length/2
`908
`residual_waveform_codebook([m] =
`909
`residual_tonal_signal[wave_start to wave_start+ frame_length]
`910
`residual_codebook_pitch[m] = pitch_sequencefclosest_frame_index]
`residual_codebook_amplitude[m] = 911
`
`residual_amplitude_sequence[closest_frame_index]
`
`912
`
`return
`
`Figure 9
`
`
`
`U.S. Patent
`
`Sheet 10 of 26
`
`US 6,298,322 Bl
`
`harmonic
`spectrum
`sequence
`
`residual
`amplitude
`sequence
`
`amplitude
`
`harmonic
`spectrum
`codebook
`
`residual
`codebook
`
`residual
`codebook
`
`sequence
`
`residual
`waveform
`codebook
`
`Figure 10
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 11 of 26
`
`US 6,298,322 B1
`
`residual_tonal_signal
`and pitch sequence
`
`
`offset =0; n= 0; harmonic_spectrum_sequence[all]fall] =0;
`
`
`
`41102
`
`.
`offset + frame_length < signal_length
`
`NO
`
`frame = window .* residual_tonal_signal(offset to (offset+frame_length-1))
`
`YES
`
`frame_fft = real_fft( frame)
`
`fO = pitch_to_frequency(pitch_sequence[n])
`
`highest_harmonic = floor((fs/2) / 0);
`
`for k = t to highest_harmonic
`
`harmonic_freq = (k*f0)
`
`harmonic_bin = round(harmonic_freq/(fs/2)* length(frame_fft)
`harmonic_spectrum_sequence[n][k] = abs(frame_fft{harmonic_bin})
`
`residual_amplitude_sequence[n] =
`sqrt(sum(harmonic_spectrum_sequence{n].‘2))
`
`harmonic_spectrum_sequence[n] /= residual_amplitude_sequence[n])
`
`1101
`
`1103
`
`1105
`
`1106
`
`1107
`
`1108
`
`1109
`
`1110
`
`1111
`1112
`
`1113
`
`1114
`
`1115
`
`offset += frame_length; n += 1;
`
`
`1116
`
`Figure 11
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 12 of 26
`
`US 6,298,322 B1
`
`pitch_sequence
`
`1200
`
`last_total_distance = LARGE_NUMBER;
`
`total_distance = LARGE_NUMBER / 2;
`
`1201
`
`indices = ceil(rand(number_of_codebook_vectors)*length(harmonic_spectrum_sequence));
`harmonic_spectrum_codebook = harmonic_spectrum_sequence[indices];
`
`residual_codebook_amplitude = sqrt(sum(harmonic_spectrum_codebook.42));
`
`1202
`
`
`
`last_total_distance - total_distance > PROGRESS_THRESHOLD
`
`1217
`
`NO
`
`for m = 1
`
`to number_of_codebook_vectors
`
`1204
`
`1205
`
`1206
`1207
`
`forn = 1
`
`to length(harmonic_spectrum_sequence)
`
`residual_codebook_amplitude[m]).*2 -2 *harmonic_spectrum_sequence[n]" *harmonic_spectrum_codebook[m];
`
`distance[m][n] = (residual_amplitude_sequence[n].42 +
`
`1208
`1209
`1210
`
`1211
`
`1212
`
`1213
`
`:
`forn = 1
`to length(harmonic_spectrum_sequence)
`residual_codebook_sequence[n] = closest_vector( distancefall][n])
`
`form = 1
`
`to number_of_codebook_vectors
`
`[min_distances, indexes] = find( residual_codebook_sequence =
`new_harmonic_spectrum = sum( harmonic_spectrum_sequence[indexes]}/
`length(indexes);
`residual_codebook_amplitude[m] =
`sqrt(sum(new_harmonic_spectrum).42);
`harmonic_spectrum_codebook[m] = new_codebook_vector /
`residual_codebook_amplitude[m];
`
` 1214
`
`1215
`
`1216
`
`_total_distance = sum{min_distances);
`last_total_distance = total_distance;
`
`Figure 12
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 13 of 26
`
`US 6,298,322 B1
`
`
`
`phases = rand( frame_length/2) .* 2*Pl
`
`
`
`form = 1
`to number_of_codebook_vectors
`
`frame_fft = harmonic_spectrum_codebook[m].*exp(j.“phases)
`
`
`
`
`
`residual_waveform_codebook[m] = real_ifft( frame_fft)
`
`
`
`
`
`Figure 13
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 14 of 26
`
`US 6,298,322 B1
`
`1400
`
`1403
`
`LPC sequence
`
`LPC
`codebook
`
`1405
`
`LPC variance
`
`excitation
`amplitude
`sequence
`
`excitation
`signal
`
`sequence
`variance
`
`1404
`
`codebook
`
`Figure 14
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 15 of 26
`
`US 6,298,322 B1
`
`residual_tonal_signal
`
`1500
`
`1501
`
`1503
`
`1504
`
`1505
`
`1506
`
`1507
`
`offset = 0
`
`1502
`
`;
`offset + frame_length < signal_length
`
`NO
`
`YES
`
`frame = window *
`residual_tonal_signal(offset to (offset+frame_length-1))
`
`
`
`(LPC_sequence[n], residual_amplitude_sequence[n]) =
`generate_LPC_coefficients_and_amplitude(frame);
`
`
`LPC_variance_sequence[n] = sum(LPC_sequence[n].*2)
`
`excitation_segment =inverse_filter( frame, LPC_sequence[n)
`
`
`
`excitation_signal =overlap_add( excitation_signal, excitation_segment)
`
`offset += frame_length;
`n+= 1;
`
`
`
`1508
`
`Figure 15
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 16 of 26
`
`US 6,298,322 B1
`
`pitch_sequence
`
`last_total_distance = LARGE_NUMBER;
`
`total_distance = LARGE_NUMBER / 2:
`
` 1600
`
`1601
`
`indices = ceil(rand( number_of_codebook_vectors)*length(LPC_sequence));
`LPC_codebook = LPC_sequence[indices];
`LPC_codebook_variance = sum(LPC_codebook.’2);
`
`1602
`
`
`
`
`last_total_distance - total_distance > PROGRESS_THRESHOLD
`
`1617
`
`
`NO
`
`1603
`
`YES
`
`1604
`1605
`
`form = 1
`to number_of_codebook_vectors
`for n= 1
`to length(LPC_sequence)
`distance[m][n] = (LPC_variance_sequence[n] +
`LPCG_codebook_variance[m])
`- 2*(LPC_sequence[n]" *LPC_codebook[m]);
`1606
`vor,
`
`
`
`1608
`
`1609
`1610
`
`1611
`
`1612
`1613
`
`1614
`
`1615
`
`1616
`
`forn = 1
`to length(LPC_sequence)
`LPC_codebook_sequence[n] = closest_vector( distance[all][n])
`
`for m = 1
`to number_of_codebook_vectors
`{min_distances, indexes] = find( LPC_codebook_sequence == m)
`
`.
`.
`new_LPC_vector = sum( LPC_sequence[indexes])/ length(indexes);
`
`
`
`sum(new_LPC_vector.42);
`LPC_codebook_variance[m] =
`LPC_codebook[m] = new_LPC_vector
`
`total_distance = sum(min_distances);
`last_total_distance = total_distance;
`
`Figure 16
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 17 of 26
`
`US 6,298,322 B1
`
`magnitude_spectrum
`
`log_spectrum = log( magnitude_spectrum.42)
`
`cepstrum = ifft( log_spectrum)
`
`windowed_cepstrum = ceptstrum .* smoothing_window
`
`smoothed_log_spectrum = fft(windowed_cepstrum)
`
`1700
`
`1701
`
`1702
`
`1703
`
`1704
`
`smoothed_magnitude_spectrum = sqrt(exp(smoothed_log_spectrum))
`
`1705
`
`return smoothed_magnitude_spectrum
`
`Figure 17
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 18 of 26
`
`US 6,298,322 B1
`
`magnitude_squared_spectrum
`
`
`max_pitch_power = 0;
`best_pitch = 0;
`
`
`
`
`for pitch = pitch_min to pitch_max by 1/20
`
`
`{0 = pitch_to_frequency(pitch);
`harmonic_grid = round(((1 to floor( fs/2*fO)}*f0)/(fs/2)
`ectrum));
`
`1802
`
`pitch_power = sum(magnitude_squared_spectrum(harmonic_grid))
`
`if( pitch_power > max_pitch_power){
`max_pitch_power = pitch_power;
`best_pitch = pitch;
`
`
`
`
`
`return best_pitch
`
`}
`
`Figure 18
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 19 of 26
`
`US 6,298,322 B1
`
`frequency[n][1]
`phase[n][1]
`amp[n][1]
`
`frequency[n][2]
`phase[n][2]
`
`frequency[n][N]
`phase[n][N]
`amp([n][N}
`
` osccillator N
`
`
`
`
`osccillator 1
`
`osccillator 2
`
`output{1]
`
`output[2]
`
`output[N]
`
`Figure 19
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 20 of 26
`
`US 6,298,322 B1
`
`2014
`
`amp
`
`amplitude register
`
`2000
`
`hase
`
`P
`
`2009
`
`frequenc
`
`y
`
`phase to table offset
`conversion
`
`; frequency to phase
`increment conversion
`
`2001
`
`
`table
`
`
`
`
`
`initial offset
`register
`
`2010
`
`phase increment
`register
`
`2005
`
`sine wave
`
`2006
`
`2011
`
`Sx)
`a>
`
`2012
`
`frame
`sample counter
`
`window
`table
`
` 2008
`
`x)
`XS
`
`ob
`
`Figure 20
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 21 of 26
`
`US 6,298,322 B1
`
`
`
`
`2114
`
`amp
`
`phase
`
`frequency
`
`2108
`
`Previous amp
`
`2100
`
`2102
`
`phaseto table offset
`conversion
`
`frequency to phase
`increment conversion
`
`2109
`
`a
`
`2110
`
`2111
`
`divide by
`
`2103
`
`,
`phase increment
`register
`
`register
`
`2112 phase accumulator
`table
`
`2106
`
` sine wave
`
`amp accumulator
`register
`
`2107
`
`SO
`
`sinusoid output
`
`Figure 21
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 22 of 26
`
`US 6,298,322 B1
`
`amp
`
`2200
`
`phase
`
`2202
`
`2216
`
`previous amp
`
`phase to table offset
`conversion
`
`2203
`
`frequency
`
`
`
`pitch to phase
`increment conversion
`
`register
`
`
`phase increment
`
`2213
`
`(FIR jength-1) register
`
`FIR index counter
`
`onemory.
`
`2214
` sinewave
`
`LL
`
`2206
`
`amp
`accumulator
`
`register
`
`
`
`ed
`r|
`
`register
`
`2212
`
`OY)
`
`sinusoid output
`
`Figure 22
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 23 of 26
`
`US 6,298,322 B1
`
`residual
`i
`amplitude
`sequence
`
`residual codebook
`sequence
`
`waveform
`select
`
`2300
`
`initi
`initial_phase
`
`2307
`
`i
`pitch_sequence
`
`phase to table offset conversion
`
`2324
`
`3325
`
`2301
`
`2308
`
`amplitude
`
`op
`
`mila
`
`register 2310
`
`2321
`
`2326
`
`2304
`
`phase accumulator register
`
`pitch to phase
`increment
`conversion
`
`phase increment
`
`(FIR length-1) register
`
`FIR index counter
`
`S<) {|\_integer part|fractional part{fi {fin
`
`
`
`LY
`a
`LL
`VL
`N
`
`residual
`S2)
`waveform
`cS
`codeboo
`
`FIR coefficient memory
`
`2311
`
`waveform
`
`length
`
`2322
`
`2305
`
`4
`
`—
`
`2314
`
`2313
`
`accumulator register
`
`2315
`
`2318
`
`
`
`
`
`2)[namesamplecounter
`
`2
`
`52)]windowaio
`ae WSyr
`
`Jast half frame table
`
`synthesized residual tonal
`signal
`Figure 23
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 24 of 26
`
`US 6,298,322 Bl
`
`LPC codebook sequence
`
`excitation amplitude
`sequence
`
`pitch_sequence
`
`2400
`
`2406
`
`coefficients
`select register
`
`2404
`
`gain
`register
`
`excitation
`synthesizer
`
`2407
`
`x)<>
`
`2401
`
`2405
`
`2408
`
`xX
`length
`
`coofficient
`vector
`
`LPC codebook
`
`2402
`
`all-pole filter
`
`2411
`
`frame sample
`counter
`
`2412
`
`[earn
`
`
`
`
`last half frame
`table
`
`2409
`
`2410
`
`synthesized residual tonal
`signal
`
`Figure 24
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 25 of 26
`
`US 6,298,322 B1
`
`residual tonal signal
`
`vector
`quantizer
`
`
`
`residual vector
`residual
`
`quantization
`codebook
`
`codebook
`sequence
`
`
`Figure 25
`
`
`
`U.S. Patent
`
`Oct. 2, 2001
`
`Sheet 26 of 26
`
`US 6,298,322 B1
`
`residual tonal signal
`
`
`
`quantizer
`
`
`
`
`
`
`residual
`waveform
`
`
`codebook
`sequence
`
`
`residual
`codebook
`
`
`
`Figure 26
`
`
`
`US 6,298,322 Bl
`
`1
`ENCODING AND SYNTHESIS OF TONAL
`AUDIO SIGNALS USING DOMINANT
`SINUSOIDS AND A VECTOR-QUANTIZED
`RESIDUAL TONAL SIGNAL
`
`FIELD OF THE INVENTION
`
`This invention relates to encoding and synthesizing tonal
`audio signals, especially voiced speech and music signals.
`BACKGROUND OF THE INVENTION
`
`Tonal sounds can be effectively modeled as a sum of
`sinusoids with time-varying parameters consisting of
`frequency, amplitude, and phase. The key word here is
`“effectively” because, in fact, all sounds can be modeled as
`sums of sinusoids, but the number of sinusoids may be
`extremely large, and the time-varying sinusoidal parameters
`may not have intuitive significance. Colored noise signals
`like breath noise, ocean waves, and snare drums are
`examples of soundsthatare not effectively modeled by sums
`of sinusoids. Pitched musical instruments such as clarinet,
`trumpet, gongs, and certain cymbals, as well as ensembles of
`these instruments are examples of tonal sounds that are
`effectively modeled as sums of sinusoids.
`Many sounds are modeled as a combination of tonal and
`non-tonal, or colored noise, sounds. Flute and violin both
`have tonal and colored noise components. Human speech is
`often modeled as a mixture of tonal or “voiced” speech, and
`colored noise or “unvoiced” speech. The present invention is
`concerned with encoding and synthesizing tonal audio sig-
`nals. This invention can be used in conjunction with systems
`for encoding and synthesizing non-tonal or colored noise
`signals.
`Pitched signals are a special class of tonal audio signals in
`which the sinusoidal frequencies are harmonically related.
`The present invention can be used for encoding and synthe-
`sizing both pitched and unpitched tonal audio signals. Spe-
`cifically optimized embodiments are proposed for encoding
`and synthesizing pitched tonal audio signals.
`In this specification we use the term “tonal audio signal”
`to refer to all audio signals that can be effectively modeled
`as a sum of sinusoids with time-varying parameters consist-
`ing of frequency, amplitude, and phase. Theseare all signals
`that are not noise-like in character. We use the term “pitched
`tonal audio signal” or simply “pitched signal” to refer to
`tonal audio signals whose sinusoidal frequencies are har-
`monically related. The term “voiced signal” is a common
`term of art
`that refers to the pitched tonal audio signal
`component of a speech signal. The term “unvoiced signal”
`is a term ofart that refers to the noise-like componentof a
`speech signal. This is the non-tonal part of the signal that
`cannot be effectively modeled as a sum of sinusoids with
`time-varying parameters consisting of frequency, amplitude,
`and phase.
`One method of encoding and synthesizing tonal audio
`signals is additive sinusoidal encoding and synthesis. This
`method provides excellent results since the encoding and
`synthesis model is the same model as the signal: a sum of
`sinusoids with time-varying parameters. U.S. Pat. Nos.
`4,885,790 and 4,937,873, both to McCauley et. al, and US.
`Pat. No. 4,856,068, to Quatieri, J R. et al., teach systems for
`encoding and synthesizing sound waveforms as a sums of
`sinusoids with time-varying amplitude,
`frequency, and
`phase. While sinusoidal encoding and synthesis provides
`excellent results for tonal audio signals,
`the synthesis
`requires large computational resources because many tonal
`audio signals may involve one hundred or more individual
`sinusoids.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`To reduce the computational requirement of sinusoidal
`synthesis U.S. Pat. Nos. 5,401,897 to Depalle et al., 5,686,
`683, to Freed, and 5,327,518 teach systems for sinusoidal
`synthesis using Inverse Fast Fourier Transform (IFFT)tech-
`niques. While this approach reduces somewhat the compu-
`tation requirements for synthesis of a large number of
`parameters,
`the computation is still expensive and new
`problemsare introduced. Many synthesis environments, for
`example musical synthesizers, require multi-channel output.
`Using IFFT approaches, a separate IFFT system must be
`used for every channel. In addition, IFFT systems limit
`sinusoidal parameter update to once per frame, where a
`frame_length must be at least as long as the lowest fre-
`quency period. This parameter update rate may be insuffi-
`cient at higher frequencies.
`USS. Pat. Nos. 5,581,656, 5,195,166, and 5,226,108, all to
`Hardwicket al., teach a system where a certain number of
`sinusoids,
`the dominant or low-frequency sinusoids, are
`synthesized using traditional time-domain sinusoidal addi-
`tive synthesis, while the remaining sinusoids are synthesized
`using an IFFT approach. This permits higher update rate for
`the dominant sinusoid components while taking advantage
`of the lower IFFT computation rate for the bulk of the
`sinusoids. This approach has the disadvantages of IFFT
`computation cost especially with multi-channel synthesis.In
`addition, the dominant sinusoid components are usually at
`lower frequencies andit is the higher that often require an
`increased parameter update rate.
`A number of less compute-intensive systems have been
`proposed for encoding and synthesizing tonal audio signals.
`Linear Predictive Coding (LPC) is well knownin the art of
`speech coding and synthesis. Methods for using LPC for
`synthesizing tonal or voiced speech concentrate on methods
`for generating the tonal excitation signal. The numerous
`approachesinclude, generating a pulse-train at the desired
`pitch, generating a multi-pulse excitation signal at
`the
`desired pitch, vector quantizing (VQ) the excitation signal,
`and simply transmitting the excitation signal with fewerbits.
`U.S. Pat. No. 5,744,742,
`to Lindemann et al.,
`teaches a
`system for encoding excitation signals as single pitch period
`loops. To synthesize excitation signals at different pitches or
`amplitudes, weighted sums of pitch period excitation signal
`loops are created. The excitation signal pitch periods are
`stored in single pitch period waveform memory tables. The
`phase responseofall excitation signal waveformsis forced
`to be the same so that weighted sums of the waveforms do
`not cause phase cancellation. All of these techniques with
`the exception of simply transmitting the excitation signal
`give poorer results than full additive sinusoidal encoding
`and synthesis. The pulse based techniques in particular
`sound “buzzy” and unnatural.
`USS. Pat. Nos. 5,369,730 to Yajima, 5,479,564 to Vogten
`et al., European Patent 813,184 AJ to Dutoit et al., European
`Patents 0,363,233A1 and 0,363,233B1, both to Hamon,
`teach methodsof pitch synchronous concatenated waveform
`encoding and synthesis. With this method a numberofsingle
`pitch period waveformsare stored in memory. To synthesize
`a time-varying signal, a sequence of single pitch period
`waveformsis selected from waveform memory and concat-
`enated over time. The waveform are usually overlap-added
`for continuity. To shift the pitch of the synthesized signal the
`overlap rate is modulated. While relatively inexpensive in
`terms of compute resources,
`this approach suffers from
`distortions especially associated with the pitch shifting
`mechanism. Is audibly inferior to full additive synthesis for
`most tonal audio signals.
`In the music synthesizer field, an approach similar con-
`catenated waveform synthesis is referred to as waveform
`
`
`
`US 6,298,322 Bl
`
`3
`sequencing. With waveform sequencing each single pitch
`period waveform is pitch shifted using sample rate conver-
`sion techniques and looped for a specified time to generate
`a stable magnitude spectrum. To generate time-varying
`magnitude spectra the waveforms are generally cross-faded
`over time. U.S. Pat. Nos. 3,816,664, to Koch, 4,348,929, to
`Gallitzendorfer, 4,461,199 and Reissue 34,913, to Hiyoshi et
`al., and U.S. Pat. No. 4,611,522 to Hideo teach systems of
`waveform sequencing relative to music synthesis. Wave-
`form sequencing can be economical
`in computation
`resources but much of the complex time-varying character
`of the magnitude spectra is lost due to reduction to a limited
`number of waveforms.
`
`Anumberof hybrid systems have been proposedthat use
`additive sinusoidal encoding and synthesis for one part of a
`signal—usually the tonal part—and some other technique
`for the another part of the signal—usually the colored noise
`part. U.S. Pat. No. 5,029,509 to Serra et al. teaches a system
`for full sinusoidal encoding and synthesis of the tonal part of
`a signal and LPC coding of the non-tonal part of the signal.
`This approach has the computational expense of full sinu-
`soidal additive encoding and synthesis plus the expense of
`LPC coding and synthesis. A similar approach is applied to
`speechsignals in U'S. Pat. Nos. 5,774,837, to Yeldeneretal.,
`and USS. Pat. No. 5,787,387 to Aquilar.
`In “A Switched Parametric & Transform Audio Coder”,
`Scott Levine et al., Proceedings of the IEEE ICASSP, May
`15-19, 1999 Phoenix, Ariz., a system is taught wherein low
`frequencies are encoded and synthesized using full sinusoi-
`dal additive synthesis, and high frequencies are encoded
`using LPC with a white noise excitation signal. This is
`economical in terms of computation, but the high-frequency
`synthesized signal sounds excessively noise-like for tonal
`audio signals. A similar approachis applied to voiced speech
`signals in “HNS: Speech Modification Based on a
`Harmonic+Noise Model,” J. Laroche et al., Proceedings of
`IEEE ICASSP, April 1993, Minneapolis, Minn. The use of
`colored noise to model the high frequencies of tonal audio
`signals is less objectionable when applied to speech signals,
`but still results in some “buzzyness” at high frequencies.
`US. Pat. No. 5,806,024,
`to Ozawa, teaches a system
`wherein the short time magnitude spectrum of the tonal
`audio signal is determined in frames. The tonal audio signal
`is assumed to have a harmonic component with time-varying
`pitch. The pitch varies slowly enoughthat it can be consid-
`ered constant over each frame. For each frame, a pitch is
`determined. A harmonic spectrum is determined for each
`frame as the values of the magnitude spectrum at multiples
`of the pitch frequency. A residual spectrum is determined for
`each frame as the magnitude spectrum minus the harmonic
`spectrum. The harmonic spectrum framesand residual spec-
`trum frames are vector quantized (VQ) to form a harmonic
`spectrum codebook, residual spectrum codebook, and a gain
`codebook. The signal is encoded as sequence of unique
`coding vector numbers identifying coding vectors in these
`codebooks. Thus the harmonic spectrum codebook sequence
`codes the pitched part of the signal, and the residual code-
`book sequence codes the non-tonal and non-pitched-but-
`tonalpart of the signal. This approach can be economical but
`with VQ, muchof the richness in time-varying behavior is
`lost. This is especially true for complex tonal audio signals
`such as high-fidelity music signals.
`
`BRIEF SUMMARY OF THE INVENTION
`
`Accordingly, one object of the present invention is to
`synthesize tonal sounds, especially voiced speech or musical
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`sound, of high quality equivalent to full sinusoidal additive
`synthesis or IFFT sinusoidal synthesis, but with fewer
`encoding parameters and greatly reduced computational
`requirements.
`Another object of the present invention is to synthesize
`tonal sounds without the artificial “buzzyness” associated
`with pulse-based LPC techniques.
`Another object of the present invention is to synthesize
`high quality tonal sounds without audible loss of complex
`time-varying behavior associated with harmonic VQ or
`waveform sequencing tec