`MS 1130
`
`
`
`MS 1130 - Page
`
`MS 1130 - Page 1
`
`
`
`
`
`MS 1130 - Page 2
`
`
`
`
`
`SPEECH
`
`COMMUNICATION
`
`
`
`
`
`EISEVIER
`
`Speech Communication 17 (1995) 345-346
`
`Cumulative contents of Volume 17
`
`Volume 17, Nos. 1-2, August 1995
`
`Regular papers
`
`J. Glass, G. Flammla, D. Goodine, M. Phillips, J. Polifroni, S. Sakai, S. Seneff and V. Zue
`Multilingual spoken-language understanding in the MIT Voyager system
`I/. Steinbiss, H. Ney, U. Essen, B.-H. Tran, X. Aubert, C. Dugast, R. Kneser, H.-G. Meier, M. Oerder, R. Haeb-Umbach, D.
`Geller, W. Héllerbauer and H. Bartosik
`Continuous speech dictation — From theory to practice
`A. De and P. Kabul
`
`Auditory distortion measure for speech coder evaluation — Hidden Markovian approach
`HS. Lee and A. C. Tsar"
`
`Application of multi-layer perceptron in estimating speech/noise characteristics for speech recognition in noisy
`environment
`
`Special Section on Automatic Speaker Recognition, Identification and Verification
`Guest Editors: F. Bimbot, G. Chollet and A. Paoloni
`
`Editorial
`J. de Veth and H. Bourlard
`
`Comparison of Hidden Markov Model techniques for automatic speaker verification in real-world conditions
`DA. Reynolds
`Speaker identification and verification using Gaussian mixture speaker models
`T. Matsul and S. Furui
`
`Likelihood normalization for speaker verification using a phoneme- and speaker-independent model
`M. Forsyth
`Discriminating observation probability (DOP) HMM for speaker verification
`H.C. Choi and RW. King
`On the use of spectral transformation for speaker adaptation in HMM based isolated-word speech recognition
`P. Thévenaz and H. I-Iiigli
`Usefulness of the LPC-residue in text—independent speaker verification
`Y. Bennani and P. Gallinari
`Neural networks for discrimination and modelization of speakers
`
`F. Bimbot, I. Magrin-Chagnolleau and L. Mathan
`Second-order statistical measures for text—independent speaker identification
`
`J. Oglesby
`What’s in a number? Moving beyond the equal error rate
`Advance table of contents
`News
`
`Elsevier Science B.V.
`
`1
`
`19
`
`39
`
`59
`
`77
`
`81
`
`91
`
`109
`
`117
`
`131
`
`145
`
`159
`
`177
`
`193
`209
`211
`
`MS 1130 - Page 3
`
`MS 1130 - Page 3
`
`
`
`346
`
`Cumulative contents of Volume 17
`
`Volume 17, No. 3-4, November 1995
`
`Special Issue on Interactive Voice Technology for Telecommunication Applications
`Guest Editors: D.B. Roe and S. Furui
`
`Editorial
`
`L.R. Rabiner
`
`The impact of voice processing on modern telecommunications
`M. Lennig, G. Bielby and J. Mussicotte
`Directory assistance automation in Bell Canada: Trial results
`GJ. Vysotsky
`VoiceDialingsM - The first speech recognition based service delivered to custo1'ner’s home from the telephone network
`H. Aust, M. Oerder, F. Seide and I/. Steinbiss
`The Philips automatic train timetable information system
`R. Billi, F. Canavesio, A. Ciaramella and L. Nebbia
`Interactive voice technology at work: The CSELT experience
`C. Sorin, D. Jouuet, C. Gagnoulet, D. Dubois, D. Sadek and M. Toularhoat
`Operational and experimental French telecommunication services using CNET speech recognition and text-to-speech
`synthesis
`
`1. Takahashi, N. Sugamura, T. Hirokawa, S. Sagayama and S. Furui
`Interactive voice technology development for telecommunications applications
`C.A. Kamm, C.R. Shamieh and S. Singhal
`Speech recognition issues for directory assistance applications
`B. Mazor and B.L. Zeigler
`The design of speech-interactive dialogs for transactiomautomation systems
`T. Matsumura and S. Matsunaga
`Non—uniform unit based HMMS for continuous speech recognition
`
`Erratum
`
`Acknowledgements
`Advance table of contents
`News
`
`Author index of Volume 17
`
`Cumulative contents of Volume 17
`
`215
`
`217
`
`227
`
`235
`
`249
`
`263
`
`273
`
`287
`
`303
`
`313
`
`321
`
`331
`
`335
`337
`
`339
`
`343
`345
`
`MS 1130 -Page 4
`
`MS 1130 - Page 4
`
`
`
`SPEECH COMMUNICATION
`
`Founding Edltor
`(1981-1993)
`Max Wajskop 1-
`
`Edltor in Chief
`Christel Sorin
`France Telecom CNET
`LAA/TSS/RCP Bat D
`Technopole Anticipa — 2 Avenue Pierre Marzin
`22307 Lannion Cedex
`
`F'a"°e
`Tel: (33) 96 05 33 06
`Fax: (33) 96 05 35 30
`E-mail: sorin@|annion.cnet.fr
`
`Assistnt Editor
`Laurent Miclet
`(same address)
`E-mail: miclet@|annion.cnet.fr
`
`M. Kunt (Lausanne, Switzerland; EURASIP)
`M. Liberman (Philadelphia, PA, USA)
`J. Ohala (Berkeley, CA, USA)
`L. Pols (Amsterdam, The Netherlands; ESCA)
`
`Editorial Board
`R. Billi (Torino, Italy)
`L.J. Boé (St Martin d’Héres, France)
`L. Boves (Nijmegen, The Netherlands)
`Ft. Carlson (Stockholm, Sweden)
`C.J. Darwin (Brighton, UK)
`B. De Boysson-Bardies (Paris, France)
`Fl. De Mori (Montréal, Canada)
`8. Delgutte (Boston, MA, USA)
`G. Fant (Stockholm, Sweden)
`J. Flanagan (Piscataway, NJ, USA)
`J.E. Flege (Birmingham, AL, USA)
`J.P. Haton (Vandoeuvre, France)
`J. Hirschberg (Murray Hill, NJ, USA)
`M. Hunt (Cheltenham, UK)
`F. Jelinek (Baltimore, MD, USA)
`
`J.C. Junqua (Santa Barbara, CA, USA)
`J. Laver (Edinburgh, UK)
`M. Lennig (Montréal, Canada)
`B. Lindblom (Austin, TX, USA)
`J. Makhoul (Cambridge, MA, USA)
`J. Mariani (Orsay, France)
`J.B. Millar (Canberra, Australia)
`E. Moulines (Paris. France)
`H. Ney (Aachen, Germany)
`L.A. Petitto (Montreal, Canada)
`P.J. Price (Menlo Park, CA, USA)
`Y. Sagisaka (Kyoto, Japan)
`J. Schoentgen (Brussels, Belgium)
`S. Seneff (Cambridge, MA, USA)
`J. Spitz (White Plains, NY, USA)
`Y. Tohkura (Kyoto, Japan)
`I. Trancoso (Lisbon, Portugal)
`J.P. Tubach (Paris, France)
`D. Van Compernolle (Heverlee. Belgium)
`H. Wakita (Santa Barbara, CA, USA)
`C. Wellekens (Sophia Antipolis, France)
`A.C. Woodbury (Austin, TX, USA)
`
`Advisory Committee
`S. Furui (Tokyo, Japan)
`B. Juang (Murray Hill, NJ, USA)
`
`
`all aspects of speech
`Scope. Speech Communication covers
`communication processes between humans as well as between
`human and machines. Speech Communication features original re-
`search work, tutorial and review articles dealing with the theoretical,
`empirical and practical aspects of this scientific field. Special empha-
`sis is given to material containing an interdisciplinary point of view.
`
`Editorial policy. Speech Communication is an interdisciplinary
`Journal whose primary objective is to fill the need for the rapid dis-
`semination and thorough discussion of basic and applied research
`results. In order to establish frameworks within which results from our
`different fields may be interrelated. emphasis will be placed on
`viewpoints and topics able to induce transdisciplinary approaches.
`The editorial policy and the technical content of the Journal are the
`responsibility of the Editors and the Advisory Committee. The Journal
`is self-supporting from subscription income. Advertisements are
`subject to the prior approval of the Editors.
`
`Subject coverage. Subject areas covered in this journal include:
`(1) Basics of oral communication and dialogue: modelling of produc-
`tion and perception processes, phonetics and phonology, syntax, se-
`mantics and pragmatics of speech communication, cognitive aspects.
`(2) Models and tools for language learning: functional organisation
`and developmental models of human language capabilities, acqui-
`sition and rehabilitation of spoken language, speech and hearing
`defects and aids.
`(3) Speech signal processing: analysis, coding, transmission, enhan-
`cement, robustness to noise.
`(4) Models for automatic speech communication: speech recognition,
`language identification, speaker recognition, speech synthesis, oral
`dialogue.
`(5) Development and evaluation tools: monolingual and multilingual
`data bases, assessment methodologies, specialised hardware and
`software packages, field experiments, market development.
`
`© 1995 Elsevier Science B.V. All rights reserved
`
`Subscription information. Speech Communication (ISSN 0167-
`6393) is published in two volumes (eight issues) a year. For 1995
`Volumes 16-17 are scheduled for publication. Subscription prices are
`available upon request
`from the publishers. Subscriptions are
`accepted on a prepaid basis only and are entered on a calender year
`basis. Issues are sent by surface mail except to the following coun-
`tries where air delivery (S.A.L. — Surface Air Lifted)
`is ensured:
`Argentina, Australia, Brazil, Canada, China, Hong Kong, India, Israel,
`Japan, Malaysia, Mexico, New Zealand, Pakistan, Singapore, South
`Africa, South Korea, Taiwan, Thailand, USA. For the rest of the
`world, airmail and S.A.L. charges are available upon request. Claims
`for missing issues will be honoured free of charge within six months
`after the publication data of the issues. Mail orders and inquiries to:
`Elsevier Science B.V., Journals Department, P.O. Box 211, 1000 AE
`Amsterdam, The Netherlands. For full membership information of the
`Associations, possibly combined with a subscription at a reduced
`rate, please contact: EUFIASIP, P.O. Box 134, CH-1000 Lausanne
`13, Switzerland; ESCA, BP 7, B-1040, Brussels 40, Belgium.
`
`US mailing notice — Speech Communication (ISSN 0167-6393) is
`published six times a year in e.g. January, February, April, June,
`August and November by Elsevier Science B.V., Molenwerf 1,
`Postbus 211, 1000 AE Amsterdam, The Netherlands. Annual
`subscription price in USA US $432 (subject to change), including air
`speed delivery. Application to mail at second class postage rate is
`pending at Jamaica, NY 11431.
`to Speech
`changes
`address
`USA POSTMASTEFIS: Send
`Inc.,
`200 Meacham
`Communication,
`Publications Expediting,
`Avenue, Elmont, NY 11003. Airfreight and mailing in the USA by
`Publication Expediting.
`
`No part of this publication may be reproduced, stored in a retrieval system or transmitted in any fonn or by any means, electronic, mechanical, photocopying, recording or
`otherwise, without the prior pennission of the publisher, Elsevier Science B.V., Copyright and Permissions Department, PO. Box 521, 1000 AM Amsterdam. The
`Nethertands.
`Special regulations for authors: Upon acceptance of an article by the journal, the author(s) will be asked to transfer copyright of the article to the publisher. This transfer will
`ensure the widest possible dissemination of infonnation.
`Special regulations for readers in the USA: This journal has been registered with the Copyright Clearance Center, Inc. Consent is given for copying of articles for personal
`or internal use, or for the personal use of specific clients. This consent is given on the condition that the copier pays through the Center the per-copy lee stated in the code
`on the first page of each article for copying beyond that permitted by Sections 107 or 108 of the US. Copyright Law. The appropriate fee should be forwarded with a copy
`of the first page of the article to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. If no code appears in an article, the author has not
`given broad
`copy and permission to copy must be obtained directly from the author. The fee indicated on the first page of an article in this issue will apply
`‘
`ublished in the journal, regardless of the year of publication. This consent does not extend to other kinds of copying, such as for general
`and promotion purposes, or for creating new collective works. Special written permission must be obtained from the publisher for such
`sibility is assumfiby’ the Publisher for any injury andlor damage to persons or property as a matter of products liability, negligence or othenuise, or irom any use
`ation of any methods,
`oducts, instructions or ideas contained in the material herein. Although all advertising material is expected to conform to ethical standards,
`in this publication d
`s not constitute a guarantee or endorsement of the quality of such product or of the claims made of it by its manufacturer.
`The pa % b|ication meets the requirements of ANSI/NISO239.48-1992 (Pennanence of Paper).
`01 R7-fi.’1Q.’-'tlQEl$f]9 50
`
`MS 1130 - Page 5
`
`Printed in The Netherlands
`
`
`
`MS 1130 - Page 5
`
`
`
`ELSEVIER
`
`Speech Communication 17 (1995) 91-108
`
`SPEECH
`
`CONINIUNTCATION
`
`Speaker identification and verification using Gaussian mixture
`speaker models “’
`
`Douglas A. Reynolds *
`
`MIT Lincoln Laboratory, 244 Wood St, Lexington, MA 02173, USA
`
`Received 27 September 1994; revised 9 March 1995
`
`Abstract
`
`This paper presents high performance speaker identification and verification systems based on Gaussian mixture
`speaker models:
`robust, statistically based representations of speaker identity. The identification system is a
`maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background
`speaker normalization. The systems are evaluated on four publically available speech databases: TIMIT, NTIMIT,
`Switchboard and YOHO. The different levels of degradations and variabilities found in these databases allow the
`examination of system performance for different task domains. Constraints on the speech range from vocabulary-de-
`pendent to extemporaneous and speech quality varies from near—ideal, clean speech to noisy, telephone speech.
`Closed set identification accuracies on the 630 speaker TIMIT and NTIMIT databases were 99.5% and 60.7%,
`respectively. On a 113 speaker population from the Switchboard database the identification accuracy was 82.8%.
`Global threshold equal error rates of 0.24%, 7.19%, 5.15% and 0.51% were obtained in verification experiments on
`the TIMIT, NTIMIT, Switchboard and YOHO databases, respectively.
`
`Zusammenfassung
`
`Dieses Referat befaBt sich mit Hochleistungssystemen zur Sprechererkennung und Sprecherverifizierung auf der
`Basis von normalverteilten Sprechermodellen, d.h.
`robusten, statistisch ausgewogenen Reprasentationen der
`Sprecheridentitat. Bei dem Erkennungssystem handelt es sich um einen Klassierer nach dem Maximum-Likelihood
`Prinzip; das Verifikationssystem ist ein Likelihoodverhéiltnis—Hypothesentester mit Hintergrund—Norrnalisierung fiir
`die Sprechmuster. Die Bewertung der Systeme erfolgt anhand von vier offentlich zuganglichen Sprachdatenbanken
`(TIMIT, NTIMIT, Switchboard und YOHO). Die bei den Sprachmustern in diesen Datenbanken bestehenden
`unterschiedlichen Qualitatsverluste und Schwankungen lassen die Untersuchung der Systernleistung in unter-
`schiedlichen Aufgabenbereichen zu. Die Sprachmuster sind verschiedenartigen Einschrankungen unterworfen, die
`Von Sprachschatz bis hin zu situativ bedingten Ausfallen reichen, und die Qualitéit der Sprachmuster reicht von
`
`‘’ This paper is based on a communication presented at the ESCA Workshop on Automatic Speaker Recognition, Identification
`and Verification, Martigny, 5—7 April 1994, and has been recommended by the Scientific Committee of this workshop and the
`Editorial Board of the journal. This work was sponsored by the Department of the Air Force. Opinions,
`interpretations,
`conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Air Force.
`* Corresponding author. Tel.: (617) 981-4494. Fax: (617) 981-0186. E-mail: dar@sst.ll.mit.edu.
`
`Elsevier Science B.V.
`SSDI 0167—6393(95)()0OO9—7
`
`MS 1130 - Page 6
`
`MS 1130 - Page 6
`
`
`
`92
`
`DA. Reynolds / Speech Communication 17 (I 995) 91 -108
`
`innerhalb
`nahezu ideal und klar bis hin zu verrauschten Telefoniibertragungen. Die Erkennungsgenauigkeit
`abgeschlossener Menge der Sprachmuster in den Datenbanken TIMIT und NTIMIT, die 630 Sprecher umfaliten,
`betrug 99,5% bzw. 60,7%; bei der Switchboard—Datenbank mit 113 Sprechern betrug sic 82,8%. Bei Verifikationsex—
`perimcnten mit der TIMIT-, NTIMIT-, Switchboard- und YOHO—Datenbank ergaben sich globale Schwellenwerte
`der Fehlerraten von 0,24%, 7,19%, 5,15% bzw. 0,51%.
`
`Résumé
`
`Ce texte présente deux systémes performants d’identificati0n et de verification du locuteur fondés sur la
`modélisation par mélange de gaussiennes, une caractérisation statistique robuste de l’identité d’un locuteur. La
`méthode d’identification est un classificateur 21 maximum de vraisemblance; celle de vérification est un test de
`rapport de vraisemblance appuyé sur une normalisation des locuteurs. Les systemes ont été évalués sur quatre bases
`de données publiques de parole: TIMIT, NTIMIT, Switchboard et YOHO. On y trouve une variabilité et des
`différences de qualité permettant d’évaluer les systémes selon différents points dc vue. Les contraintes y varient de
`l’élocution par mots isolés a la parole spontanée; la qualité sonore va de la quasi-perfection 51 Fenregistrement
`téléphonique. L’identification dans un ensemble fenné de 630 locuteurs, pour TIMIT et NTIMIT, a atteint les taux
`i‘I’I" ;‘i’ ""iiii’ 3: ii‘'';
`tion a été de 82.8%. Les taux globaux d’erreurs (.21 seuil égal) de 0.24%, 7.19%, 5.15% et 0.51% ont été obtenus dan
`les experiences de vérification sur les bases de données TIMIT, NTIMIT, Switchboard et YOHO.
`
`Keywords: Automatic speaker identification and verification; Text—independent; Vocabulary-dependent; Gaussian
`mixture speaker models; TIMIT; NTIMJT; ; ¥OHO
`
`1. Introduction
`
`n‘ ‘;I; ‘-on (I. Iu_‘
`
`the growing use of speech as a modality in man-
`machine communications, and the need to man-
`
`age speech as a new data-type in multimedia
`applications, the utility of recognizing a person
`I"‘I "'I ' "’3“:
`
`of speech recognition is concerned with extract-
`ing the linguistic message underlying a spoken
`utterance, speaker recognition is concerned with
`extracting the identity of the person speaking the
`utterance. Applications of speaker recognition are
`wide ranging, including facility or computer ac-
`cess control (Naik and Doddington, 1987; Higgins
`et al., 1991), telephone voice authentication for
`long-distance calling or banking access (Naik et
`al., 1989),
`intelligent answering machines with
`personalized caller greetings
`(Schmandt
`and
`Arons, 1984), and automatic speaker labeling of
`recorded meetings for speaker-dependent audio
`indexing (speech-skimming) (Wilcox et al., 1994;
`Arons, 1994).
`Depending upon the application, the general
`area of speaker recognition is divided into two
`specific tasks: identification and verification. In
`speaker identification, the goal
`is to determine
`
`which one of a group of known voices best
`
`matches the input voice sample. This is also re-
`rre 0. 1'4’ I‘...‘1,’|-,!|.‘.I
`
`identification are
`plications of pure closed-set
`limited to cases where only enrolled speakers will
`be encountered, but it is a useful means of exam-
`ining the separability of speakers’ voices or find-
`‘iI‘i‘j ‘i’i” ,
`’:‘ iii ‘-
`tions in speaker-adaptive speech recognition. In
`verification, the goal is to determine from a voice
`sample if a person is who he or she claims to be.
`This is sometimes referred to as the open-set
`problem, because this task requires distinguishing
`a claimed speaker’s voice known to the system
`from a potentially large group of voices unknown
`to the system (i.e., imposter speakers). Verifica-
`tion is the basis for most speaker recognition
`applications and the most commercially viable
`task. The merger of the closed-set identification
`and open-set verification tasks, called open—set
`identification, performs like closed-set identifica-
`tion for known speakers but must also be able to
`classify speakers unknown to the system into a
`“none of the above” category.
`These tasks are further distinguished by the
`constraints placed on the speech used to train
`and test the system and the environment in which
`
`MS 1130 - Page 7
`
`MS 1130 - Page 7
`
`
`
`D.A. Reynolds /’ Speech Communication 1 7 (I 995) 91-108
`
`93
`
`the speech is collected (Doddington, 1985). In a
`text-dependent system, the speech used to train
`and test the system is constrained to be the same
`
`word or phrase. In a text-independent system, the
`training and testing speech are completely uncon-
`strained. Between text-dependence and text—inde-
`
`pendence, a vocabulary-dependent system con-
`strains the speech to come from a limited vocabu-
`lary, such as the digits, from which test words or
`phrases (e.g. digit strings) are selected. Further-
`more, depending upon the amount of control
`allowed by the application, the speech may be
`collected from a noise-free environment using a
`wideband microphone or from a noisy, narrow-
`band telephone channel.
`In this paper a simple but effective statistical
`speaker representation is presented which attains
`excellent
`identification and verification perfor-
`mance
`for both text-independent
`and
`vocabulary—dependent tasks with clean, wideband
`and telephone speech. The Gaussian mixture
`speaker model was introduced in (Rose and
`Reynolds, 1990; Reynolds, 1992) and has demon-
`strated high text-independent identification accu-
`racy for short test utterances from unconstrained,
`telephone quality speech. This paper extends the
`application to speaker verification using back-
`ground speaker normalization (Higgins et al.,
`1991) and a likelihood ratio test. A novel tech-
`
`nique for selecting background speakers is also
`presented.
`Gaussian mixture model (GMM) based identi-
`
`fication and Verification systems are evaluated on
`four publicly available speech databases: TIMIT
`(Fisher et al., 1986), NTIMIT (Janlowski et al.,
`1990), Switchboard (Godfrey et al., 1992) and
`YOHO (Higgins et al., 1991; Campbell, 1992) '.
`Each database possesses different characteristics
`(both in task domain (e.g., text-dependency, num-
`ber of speakers) and speech quality (e.g., clean
`
`1 Results on the King database with the “great-divide” can
`be found in (Reynolds, 1994b). TIMIT and NTIMIT are
`available through the U.S. National Institute of Standards and
`Technology. Switchboard, YOHO and King are available
`through the Linguistic Data Consortium.
`
`wideband, noisy telephone) allowing for experi-
`mentation over a wide variety of tasks and condi-
`tions. The TIMIT database is used to examine
`
`how well text-independent speaker identification
`can perform under near—ideal conditions with
`
`large populations, thus providing an indication of
`the inherent “crowding” of the feature space.
`The NTIMIT database is then used to gauge the
`identification performance loss incurred by trans-
`mitting speech over the telephone network for
`the same large population experiment. The more
`realistic, unconstrained Switchboard database is
`
`used to determine a better measure of large
`population performance using telephone speech.
`For speaker verification,
`the TIMIT, NTIMIT
`and Switchboard databases are again used to
`gauge verification performance over the range of
`near-ideal speech to more realistic, extemporane-
`ous
`telephone
`speech. Finally,
`the YOHO
`database is used to determine performance on a
`vocabulary-dependent, office-environment verifi-
`cation task. The effect of different background
`speaker selections is also examined for all of
`these databases.
`
`Besides using the databases to address specific
`research questions, it is hoped that presentation
`of results on these publicly available databases
`will encourage competitive evaluations and com-
`
`parisons by other researchers in the speaker
`recognition area. There are many competing
`speaker recognition techniques found throughout
`the literature, but without evaluation on common
`databases with defined train/test paradigms it is
`extremely difficult
`to assess the merits of an
`
`approach. Moreover, few people have the time or
`resources to implement
`faithfully a competing
`scheme to see how it performs on a calibrated
`database. While not everyone is interested in the
`same task, the available databases allow evalua-
`
`tion over a wide range of identification and verifi-
`cation scenarios.
`
`The rest of the paper is organized as follows.
`The next section gives a brief description of the
`Gaussian mixture speaker model. This is followed
`in Section 3 by a description of the identification
`and verification systems. Section 4 then presents
`descriptions
`and comparisons of
`the
`four
`databases used in this paper. The identification
`
`MS 1130 - Page 8
`
`MS 1130 - Page 8
`
`
`
`94
`
`D.A. Reynolds / Speech Communication 17 (1995) 91—108
`
`experimental paradigms and results on these
`databases are given in Section 5 followed by
`verification experiments in Section 6. A summary
`and conclusions are given in Section 7.
`
`2. Gaussian mixture speaker model
`
`The basis for both the identification and verifi-
`
`cation systems is the GMM used to represent
`speakers. More specifically,
`the distribution of
`feature vectors extracted from a person’s speech
`is modeled by a Gaussian mixture density. For a
`D-dimensional feature vector denoted as x, the
`
`mixture density for speaker s is defined as
`M
`
`uni-modal Gaussian classifier and a vector quan-
`tizer codebook. The GMM combines the robust-
`
`ness and smoothness of the parametric Gaussian
`model with the arbitrary density modeling of the
`non-parametric VQ model. It can also be viewed
`as a single-state HMM with a Gaussian mixture
`observation density or an ergodic Gaussian obser-
`vation HMM with fixed, equal transition proba-
`bilities. Here, the Gaussian components can be
`considered to be modeling the underlying broad
`phonetic sounds which characterize a person’s
`voice. A more detailed discussion of how GMMs
`
`apply to speaker modeling can be found in (Re-
`ynolds, 1992; Reynolds and Rose, 1995).
`
`P(x|/\s) = Z P.-‘b.~‘(x)-
`[=1
`
`(1)
`
`3. System descriptions
`
`The density is a weighted linear combination of
`M component uni-modal Gaussian densities,
`b,.‘(x), each parameterized by a mean vector, pi,
`and covariance matrix, 2,‘;
`1
`
`b.-‘(x) =
`
`(2'n_)D/2'21,“/2
`
`><exp{—%<x-u:>'<2:>”‘(x—lu:>}.
`<2)
`
`The mixture weights, pf, furthermore satisfy the
`constraint Ef"=1p,»‘ = 1. Collectively, the parame-
`ters of speaker s’s density model are denoted as
`A, = {p,~’,,u§,Z,-5},
`i = 1,... ,M.
`While the general model form supports full
`covariance matrices,
`in this paper diagonal co-
`variance matrices are used. This choice is based
`
`on empirical evidence that diagonal matrices out-
`perform full matrices and the fact that the den-
`sity modeling of an Mth order full covariance
`mixture can equally well be achieved using a
`larger order, diagonal covariance mixture.
`Maximum likelihood speaker model parame-
`ters
`are
`estimated using the
`iterative
`Expectation~Maximization (EM)
`algorithm
`(Dempster et al., 1977). Generally 10 iterations
`are sufficient for parameter convergence.
`The GMM can be viewed as a hybrid between
`two effective models for speaker recognition: a
`
`3.]. Speech analysis
`
`Several processing steps occur in the front-end
`analysis (see Fig. 1). First,
`the speech is seg-
`mented into frames by a 20 ms window progress-
`ing at a 10 ms frame rate. A speech activity
`detector
`(SAD)
`is
`then used to discard
`
`silence/noise frames. The SAD is a self-normal-
`izing, energy based detector which tracks the
`noise floor of the signal and can adapt to chang-
`ing noise conditions (Reynolds, 1992; Reynolds et
`al., 1992). For text-independent speaker recogni-
`tion,
`it
`is
`important
`to remove silence /noise
`frames from both the training and testing signal
`to avoid modeling and detecting the environment
`rather than the speaker.
`Next, mel-scale cepstral feature vectors are
`extracted from the speech frames (a detailed de-
`scription of the feature extraction steps can be
`found in (Reynolds, 1992; Reynolds and Rose,
`1995)). For bandlimited telephone speech, cep-
`stral analysis is performed only over the mel-filters
`in the telephone passband (300—3400 Hz). All
`
`CEP ANALYSIS
`m MEL-—SCALE
`
`EQUALIZATION
`CHANNEL
`
`XXX12 3'"
`
`Fig. 1. Front-end speech processing.
`
`MS 1130 - Page 9
`
`MS 1130 - Page 9
`
`
`
`D.A. Reynolds / Speech Communication 17 (I995) 91 -108
`
`95
`
`cepstral coefficients except c[0] are retained in
`the processing. This choice of features is based
`on previous good performance and a recent study
`(Reynolds, 1994b) comparing several standard
`speech features for speaker identification.
`Last, the feature vectors are channel equalized
`via blind deconvolution. The deconvolution is im-
`
`plemented by subtracting the average cepstral
`Vector from each input utterance. If training and
`
`testing speech are collected from different micro-
`phones or channels
`(e.g., different handsets
`and /or lines in telephone applications), this is a
`crucial step for achieving good recognition accu-
`racy (as with the “great-divide” of the King
`database (Reynolds, 1994b)). However, when
`there is not much variability between recording
`microphones or
`channels,
`as with the
`TIMIT/NTIMIT databases, blind channel equal-
`ization can reduce accuracy. The channel equal-
`ization is used for all databases except the TIMIT
`and NTIMIT databases.
`
`3.2. Identification system
`
`The identification system is a straight-forward
`maximum-likelihood classifier. For a reference
`
`group of S speakers .5” = {1,2, . .
`. ,S} represented
`by models A1, )t2,...,)tS, the objective is to find
`the speaker model which has the maximum poste-
`rior probability for the input feature vector se-
`quence, X = {x1, . . . ,xT}. The minimum error
`Bayes’ decision rule for this problem is
`
`§= arg max Pr(/\_,lX)
`1<s<S
`
`P(XFA.)
`= arg max ——
`1<s<S p(X) Pr(“).
`
`(3)
`
`Assuming equal prior probabilities of speakers,
`the terms Pr()¢s) and p(X) are constant for all
`speakers and can be ignored in the maximum.
`Using logarithms and the assumed independence
`between observations, the decision rule becomes
`r
`
`s‘= arg max 2 logp(x,I)t,),
`1-<.s<S ,=1
`
`(4)
`
`3. 3. Verifica tion system
`
`Although requiring only a binary decision, the
`verification task is more difficult than the identi-
`fication task in that the alternatives are less de-
`
`fined. The system must decide if the input voice
`came from the claimed speaker, with a well-de-
`fined model, or not the claimed speaker, which is
`ill-defined. Cast
`in a hypothesis testing frame-
`work, for a given input utterance X and a claimed
`identity the choice is between H0 and H1:
`
`H0 2 X is from the claimed speaker.
`
`H 1: X is not from the claimed speaker.
`
`To perform the optimum likelihood ratio test to
`decide between H0 and H1 then requires some
`model of the universe of possible non-claimant
`speakers. The application of this hypothesis test-
`ing approach is first described, followed by a
`discussion of a techniques for selecting speakers
`for modeling the non-claimant alternative hy-
`pothesis.
`
`3.3.1. General approach
`The general approach used in the speaker
`verification system is to apply a likelihood ratio
`test
`to an input utterance to determine if the
`claimed speaker is accepted or rejected. For an
`utterance X = {x1, .
`.
`. ,xT} and a claimed speaker
`identity with corresponding model AC, the likeli-
`hood ratio is
`
`Pr( X is from the claimed speaker)
`
`Pr( X is not from the claimed speaker)
`
`Pr A IX
`
`= __(.L_Z _
`Pr(}t5lX )
`
`(5)
`
`Applying Bayes’ rule and discarding the constant
`prior probabilities for claimant and imposter
`speakers (they are accounted for in the decision
`threshold), the likelihood ratio in the log domain
`becomes
`
`A(X) =lOgp(XI/\C)—l0gp(XI}t§).
`
`(6)
`
`(1). A block
`in which p(x,l/\5) is given in Eq.
`diagram of the speaker identification system is
`shown in Fig. 2(a).
`
`The term p(XI}tC) is the likelihood of the utter-
`ance given it
`is from the claimed speaker and
`p(X i/\z=) is the likelihood of the utterance given it
`
`MS 1130 - Page 10
`
`MS 1130 - Page 10
`
`
`
`96
`
`D.A. Reynolds / Speech Communication 17 (1995) 91-108
`
`REFERENCE SPEAKERS
`
`9
`IDENTIFIED
`SPEAKER
`
`BACKGROUND
`
`SPEAKER 1
`
`I I
`
`A(x)> 9
`
`Aon< 9
`
`ACCEPT
`
`REJECT
`
`(b)
`Fig. 2. Speaker recognition systems. (11) Identification system. (b) Verification system.
`
`is not from the claimed speaker. The likelihood
`ratio is compared to a threshold 6 and the claimed
`speaker is accepted if A(X) > 0 and rejected if
`A(X) < 6. The likelihood ratio essentially mea-
`sures how much better
`the claimant’s model
`
`scores for the test utterance compared to some
`non-claimant model. The decision threshold is
`
`then set to adjust the trade off between rejecting
`true claimant utterances (false rejection errors)
`
`and accepting .non—claimant utterances (false ac-
`ceptance errors).
`The terms of the likelihood ratio are computed
`as follows. The likelihood of the utterance given
`
`the claimed speaker’s model is directly computed
`as
`
`1
`T
`l0gp(XI}IC) = ? Z logp(x,l/\C).
`t=1
`
`(7)
`
`MS 1130 -Page 11
`
`MS 1130 - Page 11
`
`
`
`D.A. Reynolds / Speech Communication I 7 (1995) 91-108
`
`97
`
`The % scale is used to normalize the likelihood
`for utterance duration.
`
`The likelihood of the utterance given it is not
`from the claimed speaker is formed using a col-
`lection of background speaker models. With a set
`of B background speaker models, {A1, ..,)iB},
`the background speakers’ log-likelihood is com-
`puted as
`
`represent the population of expected imposters,
`which is in general application specific. In some
`scenarios, it may be assumed that imposters will
`attempt to gain access only from similar sounding
`or at
`least same-sex speakers (dedicated im-
`posters). In a telephone based application acces-
`sible by a larger cross-section of potential
`im-
`posters, on the other hand,
`the imposters may
`sound very dissimilar to the users they attack
`(casual imposters); for example a male imposter
`claiming to be a female user. Previous systems
`have relied on selecting background speakers
`whose