`
`D(cid:0)W(cid:0)E(cid:0) Schobben and P(cid:0)C(cid:0)W(cid:0) Sommen
`
`Electrical Engineering Department
`Eindhoven University of Technology
`P(cid:0)O(cid:0) Box (cid:4) MB Eindhoven(cid:4) the Netherlands
`
`tel(cid:0)
`
`(cid:1) (cid:4)
`
`fax(cid:0)
`
`(cid:1) (cid:4)
`
`email (cid:11)
`
`D(cid:0)W(cid:0)E(cid:0)Schobben(cid:12)ele(cid:0)tue(cid:0)nl and P(cid:0)C(cid:0)W(cid:0)Sommen(cid:12)ele(cid:0)tue(cid:0)nl
`
`speakers(cid:2) cancel the acoustical echos(cid:2) eliminate the
`reverberation and suppress surrounding noise(cid:1) Fig(cid:0)
`ure depicts a teleconferencing setup(cid:2) in which L
`loudspeakers reproduce the far end speech and L
`microphones pickup degraded near end speech(cid:1) An
`
`(cid:0)(cid:0)(cid:0)
`
`(cid:0)(cid:0)(cid:0)
`
`(cid:0)(cid:0)(cid:0)
`
`Far end speech
`
`L
`
`Near end
`signals
`
`L
`
`Adaptive
`Processor
`
`To far end
`
`Figure (cid:6) Teleconferencing setup
`
`approach is presented in this paper to combine the
`BSS and MC(cid:0)AEC that are needed to achieve trans(cid:0)
`parent communication(cid:1)
`In this way(cid:2) both perfor(cid:0)
`mance and computational cost can be improved(cid:1) The
`problem of dereverberation is not addressed in this
`paper(cid:1)
`
` BLIND SIGNAL SEPARATION
`
`The goal of blind signal separation is to recover esti(cid:0)
`mates of the sources signal from an observed mixture
`of them(cid:1) Figure depicts the mixing and unmixing
`system in this context(cid:1) The mixing system H can
`be modeled by FIR (cid:10)lters that are present between
`every input and every output of this multichannel
`system(cid:1) For acoustical applications these (cid:10)lters can
`have a length of several thousands of tabs(cid:2) depend(cid:0)
`ing on the sample rate and properties of the room in
`which the microphones are placed(cid:1) The goal of the
`unmixing system is to produce outputs that are lin(cid:0)
`ear functions of the sources(cid:2) yi (cid:11) fi(cid:3)si(cid:4)(cid:0) (cid:1) i (cid:1) J(cid:2)
`with J the number of inputs and outputs of the mix(cid:0)
`
`Abstract
`
`Transparent communication refers to the audio sig(cid:0)
`nal processing which is applied in communication ap(cid:0)
`plications(cid:1) The goal is to make the audio as transpar(cid:0)
`ent as possible in the sense that the reproduced au(cid:0)
`dio should ideally be free from reverberation(cid:2) noise(cid:2)
`acoustical echos and mixed speakers(cid:1) Application ar(cid:0)
`eas are for example teleconferencing and hands(cid:0)free
`telephony(cid:1) This paper presents new ideas for the im(cid:0)
`plementation of such a system(cid:1)
`In particular(cid:2) the
`use of blind signal separation is examined and new
`ideas are presented for the joint implementation of
`the Multi(cid:0)Channel Acoustical Echo Canceler (cid:3)MC(cid:0)
`AEC(cid:4) and the Blind Signal Separation (cid:3)BSS(cid:4)(cid:1) In this
`way(cid:2) acoustical quality can be improved at a reduced
`computational cost(cid:1)
`
` INTRODUCTION
`
`In a teleconferencing setup(cid:2) plain recording of near
`end speech can result in ineligible reproduced speech
`at the far end(cid:1) This reproduced speech is observed
`as a noisy unnatural sounding mixture of multiple
`speech signals which also contains acoustical echos(cid:1)
`Besides this(cid:2) data compression is far less e(cid:5)cient for
`such a signal than for clean speech(cid:1) The degradation
`of the near end speech recordings is caused by the
`following(cid:6)
`
`(cid:0) Reproduced far end speech propagates towards
`the microphones and generates acoustical echos(cid:1)
`
`(cid:0) Microphones pickup an acoustical mixture of
`several speech signals
`
`(cid:0) Microphone signals have reduced signal to noise
`ratio due to the pickup of unwanted surrounding
`noise(cid:1)
`
`(cid:0) Speech signals are a(cid:7)ected by the acoustical re(cid:0)
`verberation(cid:1)
`
`Quality can be improved if multiple microphones are
`used(cid:1) Digital signal processing is applied to these
`microphone signals in order to ideally separate the
`
`Petitioner Apple Inc.
`Ex. 1017, p. 171
`
`
`
`ing and unmixing system(cid:1) Note that the permuta(cid:0)
`tion of the recovered signals and the linear functions
`fi are ambiguous when no properties of the sources
`themselves or their locations are used(cid:1) In practical
`situations these permutations are often not impor(cid:0)
`tant(cid:1) Also(cid:2) the unmixing system can be restricted
`so that it has amplitude responses that are relatively
`(cid:12)at(cid:1) In this way(cid:2) its outputs sound just as natural as
`its inputs(cid:1)
`
`s
`
`sJ
`
`(cid:0)(cid:0)(cid:0)
`
`x
`
`xJ
`
`(cid:0)(cid:0)(cid:0)
`
`w
`
`y
`
`yJ
`
`(cid:0)(cid:0)(cid:0)
`
`H
`
`Figure (cid:6) Blind signal separation
`
`From the separated speech signals(cid:2) the strongest one
`may be sent to the far end(cid:1) Audio can be made even
`more transparent by sending more than one speech
`signal and play them via several distinct loudspeak(cid:0)
`ers(cid:1) If only one speech signal is required(cid:2) it is also
`possible to track only the strongest one(cid:1) This can
`be done using array signal processing (cid:13) (cid:2) (cid:2) (cid:15)(cid:1) Ar(cid:0)
`ray signal processing typically assumes knowledge of
`the geometry of the array and tracks the strongest
`source(cid:1) Tracking only the strongest sources has the
`disadvantage that it gives poor performance at the
`time that it switches from one source to the other(cid:1)
`Blind signal separation is usually based on the fact
`that the speech signals are independent of each other
`and typically tries to recover all source signals(cid:1) Even
`if only one speech signal is sent to the far end(cid:2) recov(cid:0)
`ering all sources has the advantage that the strongest
`one can be picked from the outputs at all times(cid:2) with(cid:0)
`out the system having to reconverge(cid:1) The unmixing
`system w consists of a set of J FIR (cid:10)lters similar
`to the mixing system(cid:1) For blind signal separation(cid:2)
`these (cid:10)lters are controlled to minimize a cost func(cid:0)
`tion (cid:13)(cid:2) (cid:2) (cid:15)(cid:1) This cost function can be based(cid:2) for ex(cid:0)
`ample(cid:2) on mutual information(cid:2) maximum likelihood(cid:2)
`second or higher order statistics(cid:1) A priory knowl(cid:0)
`edge of the probability density functions (cid:3)pdf(cid:19)s(cid:4) of
`the speech signals can be used as a tool to adap(cid:0)
`tively maximize the mutual information among the
`outputs of the BSS scheme (cid:13)(cid:2) (cid:2) (cid:15)(cid:1)
`Another interesting approach which will be used in
`this paper is to minimize cross(cid:0)correlations among
`the outputs of the BSS scheme (cid:13) (cid:15)(cid:1) This approach
`does not require any a priory knowledge other than
`the statistical independence of the speech signals(cid:1)
`The objective of this approach can be given by
`
`min
`
`w Xl Xi Xj(cid:2)i
`
`jryiyj (cid:13)l(cid:15)j
`
`(cid:3) (cid:4)
`
`This cost function will be small when the (cid:10)ltercoe(cid:5)(cid:0)
`cients w are chosen such that the outputs of the BSS
`become independent of each other in terms of their
`second order statistics(cid:1) The correlation lags l which
`are used in this cost function must be from (cid:2)N (cid:24) to
`N (cid:2) (cid:2) with N the length of the FIR (cid:10)lters in the un(cid:0)
`mixing system(cid:1) This ensures that the problem is not
`ambiguous(cid:1) Using more lags is also allowed and can
`further improve performance (cid:13) (cid:15)(cid:1) The crosscorrela(cid:0)
`tion ryiyj (cid:13)l(cid:15) can be expressed in the crosscorrelation
`of the input of the BSS rxixj (cid:13)l(cid:15)
`
`ryiyj (cid:13)l(cid:15) (cid:11) Efyi(cid:13)n(cid:15)yj(cid:13)n (cid:24) l(cid:15)g
`L
`L
`
`N(cid:1) N(cid:1)
`wia(cid:13)b(cid:15)wjc(cid:13)d(cid:15)rxaxc(cid:13)l(cid:24)b(cid:2)d(cid:15) (cid:3)(cid:4)
`
`Xb(cid:2)
`
`Xd(cid:2)
`
`(cid:11)
`
`
`
`Xa(cid:2) Xc(cid:2)
`
`In this notation(cid:2) wia(cid:13)b(cid:15) is the bth tab of the FIR(cid:0)
`(cid:10)lter which is present between the ath input and the
`ith output of the BSS(cid:1) The advantage of (cid:3)(cid:4) over (cid:3) (cid:4)
`is that it no longer explicitly contains ryiyj (cid:13)l(cid:15) which
`changes when the (cid:10)lter coe(cid:5)cients change(cid:1) Instead(cid:2)
`rxixj (cid:13)l(cid:15) can be estimated once from a data set(cid:2) and
`the (cid:10)lter coe(cid:5)cients can be found from minimizing
`the cost function which is expressed in rxixj (cid:13)l(cid:15) and
`in the (cid:10)lter coe(cid:5)cients only(cid:1)
`
` COMBINING MC(cid:3)AEC (cid:4) BSS
`
`First(cid:2) two traditional approaches will be presented in
`this section(cid:1) It will be argued that they have impor(cid:0)
`tant drawbacks which cannot be solved by applying
`AEC and BSS independently(cid:1)
`
` (cid:5) Separate BSS (cid:4) MC(cid:3)AEC
`
`The acoustical echos (cid:3)i(cid:1)e(cid:1) the far end speech signals(cid:4)
`that are picked up by the microphone array can be
`cancelled using a MC(cid:0)AEC (cid:13) (cid:15)(cid:1) Next(cid:2) BSS is ap(cid:0)
`plied to separate the near end speech signals(cid:1) This
`is depicted in Figure (cid:1) This approach has the fol(cid:0)
`lowing drawbacks
`
`(cid:0) The MC(cid:0)AEC is not able to work well with dou(cid:0)
`ble talk(cid:2) i(cid:1)e(cid:1) when there is both near end speech
`and far end speech at the same time(cid:1) This is a
`problem when tracking time varying acoustical
`transfer functions(cid:1) The overall performance of
`the system will degrade since the performance
`of the BSS depends on the performance of the
`MC(cid:0)AEC(cid:1)
`
`(cid:0) The separate implementation of MC(cid:0)AEC and
`BSS can result in a considerable computational
`workload(cid:1) This is especially true for cases with
`several loudspeakers and many microphones but
`where only a few outputs need to be retrieved(cid:1)
`
`Petitioner Apple Inc.
`Ex. 1017, p. 172
`
`
`
`Far end
`Speech
`L
`
`Near end
`signals
`L
`
`BSS
`
`Far end
`Speech
`
`Noise
`
`Near end
`Speech
`
`Figure (cid:6) Combined adaptive echo canceling and
`blind signal separation
`
`and outputs are (cid:10)xed to the unit impulse response(cid:1)
`Furthermore(cid:2) the (cid:10)lters from the microphone inputs
`to the far end speech outputs are kept identically
`equal to zero(cid:1)
`Besides the trivial far end speech outputs(cid:2) the BSS
`produces separated near end speech which are inde(cid:0)
`pendent of the far end speech(cid:1) In this way acoustical
`echos are suppressed(cid:1) The number of microphones
`used in this approach must be greater than or equal
`to the number of local speakers(cid:1) If the number of mi(cid:0)
`crophones is larger than the number of local speakers(cid:2)
`the BSS will also generate noise outputs which cor(cid:0)
`respond to noisy observations of the local speakers(cid:2)
`or strong physical noise sources such as the fan of an
`overhead projector(cid:1)
`
` EXPERIMENTS
`
`Experiments were carried out with audio signals
`recorded in a real acoustical environment(cid:1) The room
`which is used for the recordings was (cid:1) x (cid:1) x (cid:1)
`m (cid:3)height x width x depth(cid:4)(cid:1) Two live speakers read
` sentences aloud(cid:1) Also(cid:2) far end speech was intro(cid:0)
`duced by playing prerecorded French news over a
`small loudspeaker(cid:1) The resulting sound was recorded
`by two microphones(cid:1) The setup is depicted in Fig(cid:0)
`ure (cid:1) The microphone signals and the far end speech
`were used to minimize the extended objective func(cid:0)
`tion (cid:3) (cid:4) o(cid:7)(cid:0)line(cid:1) The FIR (cid:10)lters that were used in
`the BSS all have tabs(cid:1) All signals were sam(cid:0)
`pled at KHz with a bit accuracy(cid:1) The output
`of the BSS shows clear separation of the speech and
`good suppression of the acoustical echos(cid:1) An AEC
`could however be used to further remove the residual
`echos(cid:1) For this application area(cid:2) quality can not be
`expressed in terms of SNR in a straightforward way
`because the separated speech signals don(cid:19)t resemble
`
`Petitioner Apple Inc.
`Ex. 1017, p. 173
`
`Far end
`Speech
`L
`
`Near end
`signals
`L
`
`AEC
`
`(cid:1)
`
`(cid:0)
`
`BSS
`
`To far end
`
`Figure (cid:6) Adaptive echo canceling followed by blind
`signal separation
`
` (cid:5) BSS without MC(cid:3)AEC
`Theoretically(cid:2) the BSS can classify acoustical echos
`as sources that are independent of the near end
`speech signals(cid:1) Therefore(cid:2) a possible solution could
`be to use L (cid:24) L microphones and let the BSS re(cid:0)
`trieve the far end speech from this as outputs which
`are independent of the retrieved near end speech(cid:1)
`Simulations showed however that the BSS is not able
`to do this with an accuracy comparable to that of
`the MC(cid:0)AEC(cid:1) Furthermore(cid:2) the separation becomes
`more di(cid:5)cult when more sources are involved(cid:1)
`In
`the following section(cid:2) an approach is presented which
`also makes use of the far end speech itself(cid:1) The per(cid:0)
`formance of the system is greatly improved by this(cid:1)
`
` JOINT BSS (cid:4) MC(cid:3)AEC
`
`In order to obtain a succesfull combination of BSS
`and MC(cid:0)AEC(cid:2) the objective function (cid:3) (cid:4) is extended
`to
`
`jryiyj (cid:13)l(cid:15)j
`
`jrzmyi(cid:13)l(cid:15)j (cid:24) jryizm(cid:13)l(cid:15)j(cid:2) (cid:0)
`
`(cid:3) (cid:4)
`
` (cid:1)
`
`Xj(cid:2)i
`w Xl Xi
`(cid:24) Xm
`
`min
`
`with zm the mth far end speech signal(cid:1) In this way(cid:2)
`the BSS will produce outputs which are not only in(cid:0)
`dependent of each other(cid:2) but they are also indepen(cid:0)
`dent of the far end speech signals(cid:1) In fact(cid:2) the BSS
`is extended by adding the far end speech as input
`signals(cid:1) The BSS with its inputs and outputs is de(cid:0)
`picted in Fig(cid:1) (cid:1) So(cid:2) when the system is controlled
`by optimizing (cid:3) (cid:4)(cid:2) it can be considered as a special
`case of optimizing (cid:3)(cid:4) with the restriction that far
`end speech must also appear as outputs of the BSS(cid:1)
`Therefore the (cid:10)lters between the far end speech input
`
`
`
`the original speech signals(cid:2) but are linear functions
`of them instead(cid:1) Both the original speech signals and
`the linear functions are unknown(cid:1) In order to give an
`impression of the improvements that can be achieved
`using this approach(cid:2) the microphone signals(cid:2) far end
`speech(cid:2) and the output of the BSS are available for
`listening(cid:1) The tracks can be found in WAV(cid:0)format
`at
`
`http(cid:0)(cid:1)(cid:1)www(cid:2)ses(cid:2)ele(cid:2)tue(cid:2)nl(cid:1)persons(cid:1)daniels(cid:1)
`
`by choosing (cid:25)Transparent Communication(cid:25) from the
`publication list(cid:1) There is also BSS page on which the
`latest research results will be presented(cid:1)
`
` (cid:0)m
`
` cm
`
`(cid:0)m
`
` (cid:0)m
`
` m
`
`cm
`
`Figure (cid:6) Recording setup
`
` Conclusions and future work
`
`in
`Blind signal separation is an important tool
`applications like teleconferencing(cid:1) In this paper(cid:2) the
`concept of blind signal separation is extended to in(cid:0)
`corporate acoustical echo cancellation(cid:1) Experiments
`with real acoustical measurements show that this
`extended approach exhibits a good performance(cid:1)
`Subject to further study is the online (cid:3)adaptive(cid:4)
`implementation of the extended algorithm(cid:1) Impor(cid:0)
`tant issues to be considered are the computational
`workload and the convergence speed(cid:1)
`
`References
`
`(cid:13) (cid:15) K(cid:1)M(cid:1) Buckley B(cid:1)D(cid:1) van Veen(cid:1) Beamforming(cid:6)
`A versatile approach to digital (cid:10)ltering(cid:1) AASP
`Mag(cid:0)(cid:2) (cid:3)(cid:4)(cid:6)(cid:26)(cid:2) (cid:1)
`
`(cid:13)(cid:15) Y(cid:1) Grenier S(cid:1) A(cid:7)es(cid:1) A speaker tracking array
`IEEE Trans(cid:0) on Speech and
`of microphones(cid:1)
`Audio Proc(cid:0)(cid:2) (cid:6)(cid:26) (cid:2) Sept(cid:1) (cid:1)
`
`(cid:13) (cid:15) W(cid:1) Kellerman(cid:1) Strategies for combining acous(cid:0)
`tic echo cancellation and adaptive beamforming
`microphone arrays(cid:1)
`In Int(cid:0) Conf(cid:0) on Acous(cid:1)
`tics(cid:2) Speech and Signal Proc(cid:0) (cid:3)ICASSP(cid:4)(cid:2) pages
` (cid:26)(cid:2) Apr(cid:1) (cid:1)
`
`(cid:13)(cid:15) P(cid:1) Smaragdis(cid:1) Blind separation of convolved
`mixtures in the frequency domain(cid:1)
`In Int(cid:0)
`Workshop on Independence (cid:5) Arti(cid:6)cial Neural
`Networks(cid:2) Febr(cid:1) (cid:1)
`
`(cid:13)(cid:15) R(cid:1) Lambert T(cid:0)W(cid:1) Lee(cid:2) A(cid:1)J(cid:1) Bell(cid:1) Blind separa(cid:0)
`tion of delayed and convolved sources(cid:1) Advances
`in Neural Inf(cid:0) Proc(cid:0) Systems (cid:2) pages (cid:26)(cid:2)
` (cid:1)
`
`(cid:13)(cid:15) A(cid:1)J(cid:1) Bell R(cid:1)H(cid:1) Lambert(cid:1) Blind separation of
`multiple speakers in a multipath environment(cid:1)
`In Int(cid:0) Conf(cid:0) on Acoustics(cid:2) Speech and Signal
`Proc(cid:0) (cid:3)ICASSP(cid:4)(cid:2) Apr(cid:1) (cid:1)
`
`An information
`(cid:13)(cid:15) T(cid:1)J(cid:1) Sejnowski A(cid:1)J(cid:1) Bell(cid:1)
`maximisation approach to blind separation and
`blind deconvolution(cid:1) Neural Computation (cid:2)
`pages (cid:26) (cid:2) (cid:1) MIT Press(cid:2) Cambridge
`MA(cid:1)
`
`(cid:13)(cid:15) R(cid:1)H(cid:1) Lambert(cid:1) Multichannel Blind Deconvo(cid:1)
`lution(cid:9) FIR Matrix Algebra and Separation of
`Multipath Mixtures(cid:1) University of Southern Cal(cid:0)
`ifornia(cid:2) May (cid:1) Ph(cid:1)D(cid:1) Thesis(cid:1)
`
`Blind separation of delayed
`(cid:13) (cid:15) K(cid:1) Torkkola(cid:1)
`sources based on information maximization(cid:1) In
`IEEE Workshop on Neural Networks for Signal
`Proc(cid:0)(cid:2) Sept(cid:1) (cid:1)
`
`(cid:13) (cid:15) D(cid:1)C(cid:1)B(cid:1) Chan(cid:1) Blind Signal Separation(cid:1) Cam(cid:0)
`bridge(cid:6) Thesis University of Cambridge(cid:2) (cid:1)
`Ph(cid:1) D(cid:1) Thesis(cid:1)
`
`(cid:13) (cid:15) Y(cid:1) Grenier F(cid:1) Alberge(cid:2) P(cid:1) Duhamel(cid:1) A com(cid:0)
`binded fdaf(cid:27)wsaf algorithm for stereophonic
`acoustic echo cancellation(cid:1)
`In Int(cid:0) Conf(cid:0) on
`Acoustics(cid:2) Speech and Signal Proc(cid:0) (cid:3)ICASSP(cid:4)(cid:2)
`May (cid:1)
`
`Petitioner Apple Inc.
`Ex. 1017, p. 174