Speech Communication 17 (1995) 91-108
`Speaker identification and verification using Gaussian mixture
`speaker models “’
`Douglas A. Reynolds *
`MIT Lincoln Laboratory, 244 Wood St, Lexington, MA 02173, USA
`Received 27 September 1994; revised 9 March 1995
`This paper presents high performance speaker identification and verification systems based on Gaussian mixture
`speaker models:
`robust, statistically based representations of speaker identity. The identification system is a
`maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background
`speaker normalization. The systems are evaluated on four publically available speech databases: TIMIT, NTIMIT,
`Switchboard and YOHO. The different levels of degradations and variabilities found in these databases allow the
`examination of system performance for different task domains. Constraints on the speech range from vocabulary-de-
`pendent to extemporaneous and speech quality varies from near—ideal, clean speech to noisy, telephone speech.
`Closed set identification accuracies on the 630 speaker TIMIT and NTIMIT databases were 99.5% and 60.7%,
`respectively. On a 113 speaker population from the Switchboard database the identification accuracy was 82.8%.
`Global threshold equal error rates of 0.24%, 7.19%, 5.15% and 0.51% were obtained in verification experiments on
`the TIMIT, NTIMIT, Switchboard and YOHO databases, respectively.
`Dieses Referat befaBt sich mit Hochleistungssystemen zur Sprechererkennung und Sprecherverifizierung auf der
`Basis von normalverteilten Sprechermodellen, d.h.
`robusten, statistisch ausgewogenen Reprasentationen der
`Sprecheridentitat. Bei dem Erkennungssystem handelt es sich um einen Klassierer nach dem Maximum-Likelihood
`Prinzip; das Verifikationssystem ist ein Likelihoodverhéiltnis—Hypothesentester mit Hintergrund—Norrnalisierung fiir
`die Sprechmuster. Die Bewertung der Systeme erfolgt anhand von vier offentlich zuganglichen Sprachdatenbanken
`(TIMIT, NTIMIT, Switchboard und YOHO). Die bei den Sprachmustern in diesen Datenbanken bestehenden
`unterschiedlichen Qualitatsverluste und Schwankungen lassen die Untersuchung der Systernleistung in unter-
`schiedlichen Aufgabenbereichen zu. Die Sprachmuster sind verschiedenartigen Einschrankungen unterworfen, die
`Von Sprachschatz bis hin zu situativ bedingten Ausfallen reichen, und die Qualitéit der Sprachmuster reicht von
`‘’ This paper is based on a communication presented at the ESCA Workshop on Automatic Speaker Recognition, Identification
`and Verification, Martigny, 5—7 April 1994, and has been recommended by the Scientific Committee of this workshop and the
`Editorial Board of the journal. This work was sponsored by the Department of the Air Force. Opinions,
`conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Air Force.
* Corresponding author. Tel.: (617) 981-4494. Fax: (617) 981-0186. E-mail:
`Elsevier Science B.V.
`SSDI 0167—6393(95)()0OO9—7
`MS 1130 - Page 6
`MS 1130 - Page 6

`DA. Reynolds / Speech Communication 17 (I 995) 91 -108
`nahezu ideal und klar bis hin zu verrauschten Telefoniibertragungen. Die Erkennungsgenauigkeit
`abgeschlossener Menge der Sprachmuster in den Datenbanken TIMIT und NTIMIT, die 630 Sprecher umfaliten,
`betrug 99,5% bzw. 60,7%; bei der Switchboard—Datenbank mit 113 Sprechern betrug sic 82,8%. Bei Verifikationsex—
`perimcnten mit der TIMIT-, NTIMIT-, Switchboard- und YOHO—Datenbank ergaben sich globale Schwellenwerte
`der Fehlerraten von 0,24%, 7,19%, 5,15% bzw. 0,51%.
`Ce texte présente deux systémes performants d’identificati0n et de verification du locuteur fondés sur la
`modélisation par mélange de gaussiennes, une caractérisation statistique robuste de l’identité d’un locuteur. La
`méthode d’identification est un classificateur 21 maximum de vraisemblance; celle de vérification est un test de
`rapport de vraisemblance appuyé sur une normalisation des locuteurs. Les systemes ont été évalués sur quatre bases
`de données publiques de parole: TIMIT, NTIMIT, Switchboard et YOHO. On y trouve une variabilité et des
`différences de qualité permettant d’évaluer les systémes selon différents points dc vue. Les contraintes y varient de
`l’élocution par mots isolés a la parole spontanée; la qualité sonore va de la quasi-perfection 51 Fenregistrement
`téléphonique. L’identification dans un ensemble fenné de 630 locuteurs, pour TIMIT et NTIMIT, a atteint les taux
`i‘I’I" ;‘i’ ""iiii’ 3: ii‘'';
`tion a été de 82.8%. Les taux globaux d’erreurs (.21 seuil égal) de 0.24%, 7.19%, 5.15% et 0.51% ont été obtenus dan
`les experiences de vérification sur les bases de données TIMIT, NTIMIT, Switchboard et YOHO.
`Keywords: Automatic speaker identification and verification; Text—independent; Vocabulary-dependent; Gaussian
`mixture speaker models; TIMIT; NTIMJT; ; ¥OHO
`1. Introduction
`n‘ ‘;I; ‘-on (I. Iu_‘
`the growing use of speech as a modality in man-
`machine communications, and the need to man-
`age speech as a new data-type in multimedia
`applications, the utility of recognizing a person
`I"‘I "'I ' "’3“:
`of speech recognition is concerned with extract-
`ing the linguistic message underlying a spoken
`utterance, speaker recognition is concerned with
`extracting the identity of the person speaking the
`utterance. Applications of speaker recognition are
`wide ranging, including facility or computer ac-
`cess control (Naik and Doddington, 1987; Higgins
`et al., 1991), telephone voice authentication for
`long-distance calling or banking access (Naik et
`al., 1989),
`intelligent answering machines with
`personalized caller greetings
`Arons, 1984), and automatic speaker labeling of
`recorded meetings for speaker-dependent audio
`indexing (speech-skimming) (Wilcox et al., 1994;
`Arons, 1994).
`Depending upon the application, the general
`area of speaker recognition is divided into two
`specific tasks: identification and verification. In
`speaker identification, the goal
`is to determine
`which one of a group of known voices best
`matches the input voice sample. This is also re-
`rre 0. 1'4’ I‘...‘1,’|-,!|.‘.I
`identification are
`plications of pure closed-set
`limited to cases where only enrolled speakers will
`be encountered, but it is a useful means of exam-
`ining the separability of speakers’ voices or find-
`‘iI‘i‘j ‘i’i” ,
`’:‘ iii ‘-
`tions in speaker-adaptive speech recognition. In
`verification, the goal is to determine from a voice
`sample if a person is who he or she claims to be.
`This is sometimes referred to as the open-set
`problem, because this task requires distinguishing
`a claimed speaker’s voice known to the system
`from a potentially large group of voices unknown
`to the system (i.e., imposter speakers). Verifica-
`tion is the basis for most speaker recognition
`applications and the most commercially viable
`task. The merger of the closed-set identification
`and open-set verification tasks, called open—set
`identification, performs like closed-set identifica-
`tion for known speakers but must also be able to
`classify speakers unknown to the system into a
`“none of the above” category.
`These tasks are further distinguished by the
`constraints placed on the speech used to train
`and test the system and the environment in which
`D.A. Reynolds /’ Speech Communication 1 7 (I 995) 91-108
`the speech is collected (Doddington, 1985). In a
`text-dependent system, the speech used to train
`and test the system is constrained to be the same
`word or phrase. In a text-independent system, the
`training and testing speech are completely uncon-
`strained. Between text-dependence and text—inde-
`pendence, a vocabulary-dependent system con-
`strains the speech to come from a limited vocabu-
`lary, such as the digits, from which test words or
`phrases (e.g. digit strings) are selected. Further-
`more, depending upon the amount of control
`allowed by the application, the speech may be
`collected from a noise-free environment using a
`wideband microphone or from a noisy, narrow-
`band telephone channel.
`In this paper a simple but effective statistical
`speaker representation is presented which attains
`identification and verification perfor-
`for both text-independent
`vocabulary—dependent tasks with clean, wideband
`and telephone speech. The Gaussian mixture
`speaker model was introduced in (Rose and
`Reynolds, 1990; Reynolds, 1992) and has demon-
`strated high text-independent identification accu-
`racy for short test utterances from unconstrained,
`telephone quality speech. This paper extends the
`application to speaker verification using back-
`ground speaker normalization (Higgins et al.,
`1991) and a likelihood ratio test. A novel tech-
`nique for selecting background speakers is also
`Gaussian mixture model (GMM) based identi-
`fication and Verification systems are evaluated on
`four publicly available speech databases: TIMIT
`(Fisher et al., 1986), NTIMIT (Janlowski et al.,
`1990), Switchboard (Godfrey et al., 1992) and
`YOHO (Higgins et al., 1991; Campbell, 1992) '.
`Each database possesses different characteristics
`(both in task domain (e.g., text-dependency, num-
`ber of speakers) and speech quality (e.g., clean
`1 Results on the King database with the “great-divide” can
`be found in (Reynolds, 1994b). TIMIT and NTIMIT are
`available through the U.S. National Institute of Standards and
`Technology. Switchboard, YOHO and King are available
`through the Linguistic Data Consortium.
`wideband, noisy telephone) allowing for experi-
`mentation over a wide variety of tasks and condi-
`tions. The TIMIT database is used to examine
`how well text-independent speaker identification
`can perform under near—ideal conditions with
`large populations, thus providing an indication of
`the inherent “crowding” of the feature space.
`The NTIMIT database is then used to gauge the
`identification performance loss incurred by trans-
`mitting speech over the telephone network for
`the same large population experiment. The more
`realistic, unconstrained Switchboard database is
`used to determine a better measure of large
`population performance using telephone speech.
`For speaker verification,
`and Switchboard databases are again used to
`gauge verification performance over the range of
`near-ideal speech to more realistic, extemporane-
`speech. Finally,
`the YOHO
`database is used to determine performance on a
`vocabulary-dependent, office-environment verifi-
`cation task. The effect of different background
`speaker selections is also examined for all of
`these databases.
`Besides using the databases to address specific
`research questions, it is hoped that presentation
`of results on these publicly available databases
`will encourage competitive evaluations and com-
`parisons by other researchers in the speaker
`recognition area. There are many competing
`speaker recognition techniques found throughout
`the literature, but without evaluation on common
`databases with defined train/test paradigms it is
`extremely difficult
`to assess the merits of an
`approach. Moreover, few people have the time or
`resources to implement
`faithfully a competing
`scheme to see how it performs on a calibrated
`database. While not everyone is interested in the
`same task, the available databases allow evalua-
`tion over a wide range of identification and verifi-
`cation scenarios.
`The rest of the paper is organized as follows.
`The next section gives a brief description of the
`Gaussian mixture speaker model. This is followed
`in Section 3 by a description of the identification
`and verification systems. Section 4 then presents
`and comparisons of
`databases used in this paper. The identification
`D.A. Reynolds / Speech Communication 17 (1995) 91—108
`experimental paradigms and results on these
`databases are given in Section 5 followed by
`verification experiments in Section 6. A summary
`and conclusions are given in Section 7.
`2. Gaussian mixture speaker model
`The basis for both the identification and verifi-
`cation systems is the GMM used to represent
`speakers. More specifically,
`the distribution of
`feature vectors extracted from a person’s speech
`is modeled by a Gaussian mixture density. For a
`D-dimensional feature vector denoted as x, the
`mixture density for speaker s is defined as
`uni-modal Gaussian classifier and a vector quan-
`tizer codebook. The GMM combines the robust-
`ness and smoothness of the parametric Gaussian
`model with the arbitrary density modeling of the
`non-parametric VQ model. It can also be viewed
`as a single-state HMM with a Gaussian mixture
`observation density or an ergodic Gaussian obser-
`vation HMM with fixed, equal transition proba-
`bilities. Here, the Gaussian components can be
`considered to be modeling the underlying broad
`phonetic sounds which characterize a person’s
`voice. A more detailed discussion of how GMMs
`apply to speaker modeling can be found in (Re-
`ynolds, 1992; Reynolds and Rose, 1995).
`P(x|/\s) = Z P.-‘b.~‘(x)-
`3. System descriptions
`The density is a weighted linear combination of
`M component uni-modal Gaussian densities,
`b,.‘(x), each parameterized by a mean vector, pi,
`and covariance matrix, 2,‘;
`b.-‘(x) =
`The mixture weights, pf, furthermore satisfy the
`constraint Ef"=1p,»‘ = 1. Collectively, the parame-
`ters of speaker s’s density model are denoted as
`A, = {p,~’,,u§,Z,-5},
`i = 1,... ,M.
`While the general model form supports full
`covariance matrices,
`in this paper diagonal co-
`variance matrices are used. This choice is based
`on empirical evidence that diagonal matrices out-
`perform full matrices and the fact that the den-
`sity modeling of an Mth order full covariance
`mixture can equally well be achieved using a
`larger order, diagonal covariance mixture.
`Maximum likelihood speaker model parame-
`estimated using the
`Expectation~Maximization (EM)
`(Dempster et al., 1977). Generally 10 iterations
`are sufficient for parameter convergence.
`The GMM can be viewed as a hybrid between
`two effective models for speaker recognition: a
`3.]. Speech analysis
`Several processing steps occur in the front-end
`analysis (see Fig. 1). First,
`the speech is seg-
`mented into frames by a 20 ms window progress-
`ing at a 10 ms frame rate. A speech activity
`then used to discard
`silence/noise frames. The SAD is a self-normal-
`izing, energy based detector which tracks the
`noise floor of the signal and can adapt to chang-
`ing noise conditions (Reynolds, 1992; Reynolds et
`al., 1992). For text-independent speaker recogni-
`to remove silence /noise
`frames from both the training and testing signal
`to avoid modeling and detecting the environment
`rather than the speaker.
`Next, mel-scale cepstral feature vectors are
`extracted from the speech frames (a detailed de-
`scription of the feature extraction steps can be
`found in (Reynolds, 1992; Reynolds and Rose,
`1995)). For bandlimited telephone speech, cep-
`stral analysis is performed only over the mel-filters
`in the telephone passband (300—3400 Hz). All
`XXX12 3'"
`Fig. 1. Front-end speech processing.
`D.A. Reynolds / Speech Communication 17 (I995) 91 -108
`cepstral coefficients except c[0] are retained in
`the processing. This choice of features is based
`on previous good performance and a recent study
`(Reynolds, 1994b) comparing several standard
`speech features for speaker identification.
`Last, the feature vectors are channel equalized
`via blind deconvolution. The deconvolution is im-
`plemented by subtracting the average cepstral
`Vector from each input utterance. If training and
`testing speech are collected from different micro-
`phones or channels
`(e.g., different handsets
`and /or lines in telephone applications), this is a
`crucial step for achieving good recognition accu-
`racy (as with the “great-divide” of the King
`database (Reynolds, 1994b)). However, when
`there is not much variability between recording
`microphones or
`as with the
`TIMIT/NTIMIT databases, blind channel equal-
`ization can reduce accuracy. The channel equal-
`ization is used for all databases except the TIMIT
`and NTIMIT databases.
`3.2. Identification system
`The identification system is a straight-forward
`maximum-likelihood classifier. For a reference
`group of S speakers .5” = {1,2, . .
`. ,S} represented
`by models A1, )t2,...,)tS, the objective is to find
`the speaker model which has the maximum poste-
`rior probability for the input feature vector se-
`quence, X = {x1, . . . ,xT}. The minimum error
`Bayes’ decision rule for this problem is
`§= arg max Pr(/\_,lX)
`= arg max ——
`1<s<S p(X) Pr(“).
`Assuming equal prior probabilities of speakers,
`the terms Pr()¢s) and p(X) are constant for all
`speakers and can be ignored in the maximum.
`Using logarithms and the assumed independence
`between observations, the decision rule becomes
`s‘= arg max 2 logp(x,I)t,),
`1-<.s<S ,=1
`3. 3. Verifica tion system
`Although requiring only a binary decision, the
`verification task is more difficult than the identi-
`fication task in that the alternatives are less de-
`fined. The system must decide if the input voice
`came from the claimed speaker, with a well-de-
`fined model, or not the claimed speaker, which is
`ill-defined. Cast
`in a hypothesis testing frame-
`work, for a given input utterance X and a claimed
`identity the choice is between H0 and H1:
`H0 2 X is from the claimed speaker.
`H 1: X is not from the claimed speaker.
`To perform the optimum likelihood ratio test to
`decide between H0 and H1 then requires some
`model of the universe of possible non-claimant
`speakers. The application of this hypothesis test-
`ing approach is first described, followed by a
`discussion of a techniques for selecting speakers
`for modeling the non-claimant alternative hy-
`3.3.1. General approach
`The general approach used in the speaker
`verification system is to apply a likelihood ratio
`to an input utterance to determine if the
`claimed speaker is accepted or rejected. For an
`utterance X = {x1, .
`. ,xT} and a claimed speaker
`identity with corresponding model AC, the likeli-
`hood ratio is
`Pr( X is from the claimed speaker)
`Pr( X is not from the claimed speaker)
`Pr A IX
`= __(.L_Z _
`Pr(}t5lX )
`Applying Bayes’ rule and discarding the constant
`prior probabilities for claimant and imposter
`speakers (they are accounted for in the decision
`threshold), the likelihood ratio in the log domain
`A(X) =lOgp(XI/\C)—l0gp(XI}t§).
`(1). A block
`in which p(x,l/\5) is given in Eq.
`diagram of the speaker identification system is
`shown in Fig. 2(a).
`The term p(XI}tC) is the likelihood of the utter-
`ance given it
`is from the claimed speaker and
`p(X i/\z=) is the likelihood of the utterance given it
`D.A. Reynolds / Speech Communication 17 (1995) 91-108
`I I
`A(x)> 9
`Aon< 9
`Fig. 2. Speaker recognition systems. (11) Identification system. (b) Verification system.
`is not from the claimed speaker. The likelihood
`ratio is compared to a threshold 6 and the claimed
`speaker is accepted if A(X) > 0 and rejected if
`A(X) < 6. The likelihood ratio essentially mea-
`sures how much better
`the claimant’s model
`scores for the test utterance compared to some
`non-claimant model. The decision threshold is
`then set to adjust the trade off between rejecting
`true claimant utterances (false rejection errors)
`and accepting .non—claimant utterances (false ac-
`ceptance errors).
`The terms of the likelihood ratio are computed
`as follows. The likelihood of the utterance given
`the claimed speaker’s model is directly computed
`l0gp(XI}IC) = ? Z logp(x,l/\C).
`D.A. Reynolds / Speech Communication I 7 (1995) 91-108
`The % scale is used to normalize the likelihood
`for utterance duration.
`The likelihood of the utterance given it is not
`from the claimed speaker is formed using a col-
`lection of background speaker models. With a set
`of B background speaker models, {A1, ..,)iB},
`the background speakers’ log-likelihood is com-
`puted as
`represent the population of expected imposters,
`which is in general application specific. In some
`scenarios, it may be assumed that imposters will
`attempt to gain access only from similar sounding
`or at
`least same-sex speakers (dedicated im-
`posters). In a telephone based application acces-
`sible by a larger cross-section of potential
`posters, on the other hand,
`the imposters may
`sound very dissimilar to the users they attack
`(casual imposters); for example a male imposter
`claiming to be a female user. Previous systems
`have relied on selecting background speakers

