throbber
MS 1130
`MS 1130
`
`

`
`MS 1130 - Page
`
`MS 1130 - Page 1
`
`

`
`
`
`MS 1130 - Page 2
`
`

`
`
`
`SPEECH
`
`COMMUNICATION
`
`
`
`
`
`EISEVIER
`
`Speech Communication 17 (1995) 345-346
`
`Cumulative contents of Volume 17
`
`Volume 17, Nos. 1-2, August 1995
`
`Regular papers
`
`J. Glass, G. Flammla, D. Goodine, M. Phillips, J. Polifroni, S. Sakai, S. Seneff and V. Zue
`Multilingual spoken-language understanding in the MIT Voyager system
`I/. Steinbiss, H. Ney, U. Essen, B.-H. Tran, X. Aubert, C. Dugast, R. Kneser, H.-G. Meier, M. Oerder, R. Haeb-Umbach, D.
`Geller, W. Héllerbauer and H. Bartosik
`Continuous speech dictation — From theory to practice
`A. De and P. Kabul
`
`Auditory distortion measure for speech coder evaluation — Hidden Markovian approach
`HS. Lee and A. C. Tsar"
`
`Application of multi-layer perceptron in estimating speech/noise characteristics for speech recognition in noisy
`environment
`
`Special Section on Automatic Speaker Recognition, Identification and Verification
`Guest Editors: F. Bimbot, G. Chollet and A. Paoloni
`
`Editorial
`J. de Veth and H. Bourlard
`
`Comparison of Hidden Markov Model techniques for automatic speaker verification in real-world conditions
`DA. Reynolds
`Speaker identification and verification using Gaussian mixture speaker models
`T. Matsul and S. Furui
`
`Likelihood normalization for speaker verification using a phoneme- and speaker-independent model
`M. Forsyth
`Discriminating observation probability (DOP) HMM for speaker verification
`H.C. Choi and RW. King
`On the use of spectral transformation for speaker adaptation in HMM based isolated-word speech recognition
`P. Thévenaz and H. I-Iiigli
`Usefulness of the LPC-residue in text—independent speaker verification
`Y. Bennani and P. Gallinari
`Neural networks for discrimination and modelization of speakers
`
`F. Bimbot, I. Magrin-Chagnolleau and L. Mathan
`Second-order statistical measures for text—independent speaker identification
`
`J. Oglesby
`What’s in a number? Moving beyond the equal error rate
`Advance table of contents
`News
`
`Elsevier Science B.V.
`
`1
`
`19
`
`39
`
`59
`
`77
`
`81
`
`91
`
`109
`
`117
`
`131
`
`145
`
`159
`
`177
`
`193
`209
`211
`
`MS 1130 - Page 3
`
`MS 1130 - Page 3
`
`

`
`346
`
`Cumulative contents of Volume 17
`
`Volume 17, No. 3-4, November 1995
`
`Special Issue on Interactive Voice Technology for Telecommunication Applications
`Guest Editors: D.B. Roe and S. Furui
`
`Editorial
`
`L.R. Rabiner
`
`The impact of voice processing on modern telecommunications
`M. Lennig, G. Bielby and J. Mussicotte
`Directory assistance automation in Bell Canada: Trial results
`GJ. Vysotsky
`VoiceDialingsM - The first speech recognition based service delivered to custo1'ner’s home from the telephone network
`H. Aust, M. Oerder, F. Seide and I/. Steinbiss
`The Philips automatic train timetable information system
`R. Billi, F. Canavesio, A. Ciaramella and L. Nebbia
`Interactive voice technology at work: The CSELT experience
`C. Sorin, D. Jouuet, C. Gagnoulet, D. Dubois, D. Sadek and M. Toularhoat
`Operational and experimental French telecommunication services using CNET speech recognition and text-to-speech
`synthesis
`
`1. Takahashi, N. Sugamura, T. Hirokawa, S. Sagayama and S. Furui
`Interactive voice technology development for telecommunications applications
`C.A. Kamm, C.R. Shamieh and S. Singhal
`Speech recognition issues for directory assistance applications
`B. Mazor and B.L. Zeigler
`The design of speech-interactive dialogs for transactiomautomation systems
`T. Matsumura and S. Matsunaga
`Non—uniform unit based HMMS for continuous speech recognition
`
`Erratum
`
`Acknowledgements
`Advance table of contents
`News
`
`Author index of Volume 17
`
`Cumulative contents of Volume 17
`
`215
`
`217
`
`227
`
`235
`
`249
`
`263
`
`273
`
`287
`
`303
`
`313
`
`321
`
`331
`
`335
`337
`
`339
`
`343
`345
`
`MS 1130 -Page 4
`
`MS 1130 - Page 4
`
`

`
`SPEECH COMMUNICATION
`
`Founding Edltor
`(1981-1993)
`Max Wajskop 1-
`
`Edltor in Chief
`Christel Sorin
`France Telecom CNET
`LAA/TSS/RCP Bat D
`Technopole Anticipa — 2 Avenue Pierre Marzin
`22307 Lannion Cedex
`
`F'a"°e
`Tel: (33) 96 05 33 06
`Fax: (33) 96 05 35 30
`E-mail: sorin@|annion.cnet.fr
`
`Assistnt Editor
`Laurent Miclet
`(same address)
`E-mail: miclet@|annion.cnet.fr
`
`M. Kunt (Lausanne, Switzerland; EURASIP)
`M. Liberman (Philadelphia, PA, USA)
`J. Ohala (Berkeley, CA, USA)
`L. Pols (Amsterdam, The Netherlands; ESCA)
`
`Editorial Board
`R. Billi (Torino, Italy)
`L.J. Boé (St Martin d’Héres, France)
`L. Boves (Nijmegen, The Netherlands)
`Ft. Carlson (Stockholm, Sweden)
`C.J. Darwin (Brighton, UK)
`B. De Boysson-Bardies (Paris, France)
`Fl. De Mori (Montréal, Canada)
`8. Delgutte (Boston, MA, USA)
`G. Fant (Stockholm, Sweden)
`J. Flanagan (Piscataway, NJ, USA)
`J.E. Flege (Birmingham, AL, USA)
`J.P. Haton (Vandoeuvre, France)
`J. Hirschberg (Murray Hill, NJ, USA)
`M. Hunt (Cheltenham, UK)
`F. Jelinek (Baltimore, MD, USA)
`
`J.C. Junqua (Santa Barbara, CA, USA)
`J. Laver (Edinburgh, UK)
`M. Lennig (Montréal, Canada)
`B. Lindblom (Austin, TX, USA)
`J. Makhoul (Cambridge, MA, USA)
`J. Mariani (Orsay, France)
`J.B. Millar (Canberra, Australia)
`E. Moulines (Paris. France)
`H. Ney (Aachen, Germany)
`L.A. Petitto (Montreal, Canada)
`P.J. Price (Menlo Park, CA, USA)
`Y. Sagisaka (Kyoto, Japan)
`J. Schoentgen (Brussels, Belgium)
`S. Seneff (Cambridge, MA, USA)
`J. Spitz (White Plains, NY, USA)
`Y. Tohkura (Kyoto, Japan)
`I. Trancoso (Lisbon, Portugal)
`J.P. Tubach (Paris, France)
`D. Van Compernolle (Heverlee. Belgium)
`H. Wakita (Santa Barbara, CA, USA)
`C. Wellekens (Sophia Antipolis, France)
`A.C. Woodbury (Austin, TX, USA)
`
`Advisory Committee
`S. Furui (Tokyo, Japan)
`B. Juang (Murray Hill, NJ, USA)
`
`
`all aspects of speech
`Scope. Speech Communication covers
`communication processes between humans as well as between
`human and machines. Speech Communication features original re-
`search work, tutorial and review articles dealing with the theoretical,
`empirical and practical aspects of this scientific field. Special empha-
`sis is given to material containing an interdisciplinary point of view.
`
`Editorial policy. Speech Communication is an interdisciplinary
`Journal whose primary objective is to fill the need for the rapid dis-
`semination and thorough discussion of basic and applied research
`results. In order to establish frameworks within which results from our
`different fields may be interrelated. emphasis will be placed on
`viewpoints and topics able to induce transdisciplinary approaches.
`The editorial policy and the technical content of the Journal are the
`responsibility of the Editors and the Advisory Committee. The Journal
`is self-supporting from subscription income. Advertisements are
`subject to the prior approval of the Editors.
`
`Subject coverage. Subject areas covered in this journal include:
`(1) Basics of oral communication and dialogue: modelling of produc-
`tion and perception processes, phonetics and phonology, syntax, se-
`mantics and pragmatics of speech communication, cognitive aspects.
`(2) Models and tools for language learning: functional organisation
`and developmental models of human language capabilities, acqui-
`sition and rehabilitation of spoken language, speech and hearing
`defects and aids.
`(3) Speech signal processing: analysis, coding, transmission, enhan-
`cement, robustness to noise.
`(4) Models for automatic speech communication: speech recognition,
`language identification, speaker recognition, speech synthesis, oral
`dialogue.
`(5) Development and evaluation tools: monolingual and multilingual
`data bases, assessment methodologies, specialised hardware and
`software packages, field experiments, market development.
`
`© 1995 Elsevier Science B.V. All rights reserved
`
`Subscription information. Speech Communication (ISSN 0167-
`6393) is published in two volumes (eight issues) a year. For 1995
`Volumes 16-17 are scheduled for publication. Subscription prices are
`available upon request
`from the publishers. Subscriptions are
`accepted on a prepaid basis only and are entered on a calender year
`basis. Issues are sent by surface mail except to the following coun-
`tries where air delivery (S.A.L. — Surface Air Lifted)
`is ensured:
`Argentina, Australia, Brazil, Canada, China, Hong Kong, India, Israel,
`Japan, Malaysia, Mexico, New Zealand, Pakistan, Singapore, South
`Africa, South Korea, Taiwan, Thailand, USA. For the rest of the
`world, airmail and S.A.L. charges are available upon request. Claims
`for missing issues will be honoured free of charge within six months
`after the publication data of the issues. Mail orders and inquiries to:
`Elsevier Science B.V., Journals Department, P.O. Box 211, 1000 AE
`Amsterdam, The Netherlands. For full membership information of the
`Associations, possibly combined with a subscription at a reduced
`rate, please contact: EUFIASIP, P.O. Box 134, CH-1000 Lausanne
`13, Switzerland; ESCA, BP 7, B-1040, Brussels 40, Belgium.
`
`US mailing notice — Speech Communication (ISSN 0167-6393) is
`published six times a year in e.g. January, February, April, June,
`August and November by Elsevier Science B.V., Molenwerf 1,
`Postbus 211, 1000 AE Amsterdam, The Netherlands. Annual
`subscription price in USA US $432 (subject to change), including air
`speed delivery. Application to mail at second class postage rate is
`pending at Jamaica, NY 11431.
`to Speech
`changes
`address
`USA POSTMASTEFIS: Send
`Inc.,
`200 Meacham
`Communication,
`Publications Expediting,
`Avenue, Elmont, NY 11003. Airfreight and mailing in the USA by
`Publication Expediting.
`
`No part of this publication may be reproduced, stored in a retrieval system or transmitted in any fonn or by any means, electronic, mechanical, photocopying, recording or
`otherwise, without the prior pennission of the publisher, Elsevier Science B.V., Copyright and Permissions Department, PO. Box 521, 1000 AM Amsterdam. The
`Nethertands.
`Special regulations for authors: Upon acceptance of an article by the journal, the author(s) will be asked to transfer copyright of the article to the publisher. This transfer will
`ensure the widest possible dissemination of infonnation.
`Special regulations for readers in the USA: This journal has been registered with the Copyright Clearance Center, Inc. Consent is given for copying of articles for personal
`or internal use, or for the personal use of specific clients. This consent is given on the condition that the copier pays through the Center the per-copy lee stated in the code
`on the first page of each article for copying beyond that permitted by Sections 107 or 108 of the US. Copyright Law. The appropriate fee should be forwarded with a copy
`of the first page of the article to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. If no code appears in an article, the author has not
`given broad
`copy and permission to copy must be obtained directly from the author. The fee indicated on the first page of an article in this issue will apply
`‘
`ublished in the journal, regardless of the year of publication. This consent does not extend to other kinds of copying, such as for general
`and promotion purposes, or for creating new collective works. Special written permission must be obtained from the publisher for such
`sibility is assumfiby’ the Publisher for any injury andlor damage to persons or property as a matter of products liability, negligence or othenuise, or irom any use
`ation of any methods,
`oducts, instructions or ideas contained in the material herein. Although all advertising material is expected to conform to ethical standards,
`in this publication d
`s not constitute a guarantee or endorsement of the quality of such product or of the claims made of it by its manufacturer.
`The pa % b|ication meets the requirements of ANSI/NISO239.48-1992 (Pennanence of Paper).
`01 R7-fi.’1Q.’-'tlQEl$f]9 50
`
`MS 1130 - Page 5
`
`Printed in The Netherlands
`
`
`
`MS 1130 - Page 5
`
`

`
`ELSEVIER
`
`Speech Communication 17 (1995) 91-108
`
`SPEECH
`
`CONINIUNTCATION
`
`Speaker identification and verification using Gaussian mixture
`speaker models “’
`
`Douglas A. Reynolds *
`
`MIT Lincoln Laboratory, 244 Wood St, Lexington, MA 02173, USA
`
`Received 27 September 1994; revised 9 March 1995
`
`Abstract
`
`This paper presents high performance speaker identification and verification systems based on Gaussian mixture
`speaker models:
`robust, statistically based representations of speaker identity. The identification system is a
`maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background
`speaker normalization. The systems are evaluated on four publically available speech databases: TIMIT, NTIMIT,
`Switchboard and YOHO. The different levels of degradations and variabilities found in these databases allow the
`examination of system performance for different task domains. Constraints on the speech range from vocabulary-de-
`pendent to extemporaneous and speech quality varies from near—ideal, clean speech to noisy, telephone speech.
`Closed set identification accuracies on the 630 speaker TIMIT and NTIMIT databases were 99.5% and 60.7%,
`respectively. On a 113 speaker population from the Switchboard database the identification accuracy was 82.8%.
`Global threshold equal error rates of 0.24%, 7.19%, 5.15% and 0.51% were obtained in verification experiments on
`the TIMIT, NTIMIT, Switchboard and YOHO databases, respectively.
`
`Zusammenfassung
`
`Dieses Referat befaBt sich mit Hochleistungssystemen zur Sprechererkennung und Sprecherverifizierung auf der
`Basis von normalverteilten Sprechermodellen, d.h.
`robusten, statistisch ausgewogenen Reprasentationen der
`Sprecheridentitat. Bei dem Erkennungssystem handelt es sich um einen Klassierer nach dem Maximum-Likelihood
`Prinzip; das Verifikationssystem ist ein Likelihoodverhéiltnis—Hypothesentester mit Hintergrund—Norrnalisierung fiir
`die Sprechmuster. Die Bewertung der Systeme erfolgt anhand von vier offentlich zuganglichen Sprachdatenbanken
`(TIMIT, NTIMIT, Switchboard und YOHO). Die bei den Sprachmustern in diesen Datenbanken bestehenden
`unterschiedlichen Qualitatsverluste und Schwankungen lassen die Untersuchung der Systernleistung in unter-
`schiedlichen Aufgabenbereichen zu. Die Sprachmuster sind verschiedenartigen Einschrankungen unterworfen, die
`Von Sprachschatz bis hin zu situativ bedingten Ausfallen reichen, und die Qualitéit der Sprachmuster reicht von
`
`‘’ This paper is based on a communication presented at the ESCA Workshop on Automatic Speaker Recognition, Identification
`and Verification, Martigny, 5—7 April 1994, and has been recommended by the Scientific Committee of this workshop and the
`Editorial Board of the journal. This work was sponsored by the Department of the Air Force. Opinions,
`interpretations,
`conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Air Force.
`* Corresponding author. Tel.: (617) 981-4494. Fax: (617) 981-0186. E-mail: dar@sst.ll.mit.edu.
`
`Elsevier Science B.V.
`SSDI 0167—6393(95)()0OO9—7
`
`MS 1130 - Page 6
`
`MS 1130 - Page 6
`
`

`
`92
`
`DA. Reynolds / Speech Communication 17 (I 995) 91 -108
`
`innerhalb
`nahezu ideal und klar bis hin zu verrauschten Telefoniibertragungen. Die Erkennungsgenauigkeit
`abgeschlossener Menge der Sprachmuster in den Datenbanken TIMIT und NTIMIT, die 630 Sprecher umfaliten,
`betrug 99,5% bzw. 60,7%; bei der Switchboard—Datenbank mit 113 Sprechern betrug sic 82,8%. Bei Verifikationsex—
`perimcnten mit der TIMIT-, NTIMIT-, Switchboard- und YOHO—Datenbank ergaben sich globale Schwellenwerte
`der Fehlerraten von 0,24%, 7,19%, 5,15% bzw. 0,51%.
`
`Résumé
`
`Ce texte présente deux systémes performants d’identificati0n et de verification du locuteur fondés sur la
`modélisation par mélange de gaussiennes, une caractérisation statistique robuste de l’identité d’un locuteur. La
`méthode d’identification est un classificateur 21 maximum de vraisemblance; celle de vérification est un test de
`rapport de vraisemblance appuyé sur une normalisation des locuteurs. Les systemes ont été évalués sur quatre bases
`de données publiques de parole: TIMIT, NTIMIT, Switchboard et YOHO. On y trouve une variabilité et des
`différences de qualité permettant d’évaluer les systémes selon différents points dc vue. Les contraintes y varient de
`l’élocution par mots isolés a la parole spontanée; la qualité sonore va de la quasi-perfection 51 Fenregistrement
`téléphonique. L’identification dans un ensemble fenné de 630 locuteurs, pour TIMIT et NTIMIT, a atteint les taux
`i‘I’I" ;‘i’ ""iiii’ 3: ii‘'';
`tion a été de 82.8%. Les taux globaux d’erreurs (.21 seuil égal) de 0.24%, 7.19%, 5.15% et 0.51% ont été obtenus dan
`les experiences de vérification sur les bases de données TIMIT, NTIMIT, Switchboard et YOHO.
`
`Keywords: Automatic speaker identification and verification; Text—independent; Vocabulary-dependent; Gaussian
`mixture speaker models; TIMIT; NTIMJT; ; ¥OHO
`
`1. Introduction
`
`n‘ ‘;I; ‘-on (I. Iu_‘
`
`the growing use of speech as a modality in man-
`machine communications, and the need to man-
`
`age speech as a new data-type in multimedia
`applications, the utility of recognizing a person
`I"‘I "'I ' "’3“:
`
`of speech recognition is concerned with extract-
`ing the linguistic message underlying a spoken
`utterance, speaker recognition is concerned with
`extracting the identity of the person speaking the
`utterance. Applications of speaker recognition are
`wide ranging, including facility or computer ac-
`cess control (Naik and Doddington, 1987; Higgins
`et al., 1991), telephone voice authentication for
`long-distance calling or banking access (Naik et
`al., 1989),
`intelligent answering machines with
`personalized caller greetings
`(Schmandt
`and
`Arons, 1984), and automatic speaker labeling of
`recorded meetings for speaker-dependent audio
`indexing (speech-skimming) (Wilcox et al., 1994;
`Arons, 1994).
`Depending upon the application, the general
`area of speaker recognition is divided into two
`specific tasks: identification and verification. In
`speaker identification, the goal
`is to determine
`
`which one of a group of known voices best
`
`matches the input voice sample. This is also re-
`rre 0. 1'4’ I‘...‘1,’|-,!|.‘.I
`
`identification are
`plications of pure closed-set
`limited to cases where only enrolled speakers will
`be encountered, but it is a useful means of exam-
`ining the separability of speakers’ voices or find-
`‘iI‘i‘j ‘i’i” ,
`’:‘ iii ‘-
`tions in speaker-adaptive speech recognition. In
`verification, the goal is to determine from a voice
`sample if a person is who he or she claims to be.
`This is sometimes referred to as the open-set
`problem, because this task requires distinguishing
`a claimed speaker’s voice known to the system
`from a potentially large group of voices unknown
`to the system (i.e., imposter speakers). Verifica-
`tion is the basis for most speaker recognition
`applications and the most commercially viable
`task. The merger of the closed-set identification
`and open-set verification tasks, called open—set
`identification, performs like closed-set identifica-
`tion for known speakers but must also be able to
`classify speakers unknown to the system into a
`“none of the above” category.
`These tasks are further distinguished by the
`constraints placed on the speech used to train
`and test the system and the environment in which
`
`MS 1130 - Page 7
`
`MS 1130 - Page 7
`
`

`
`D.A. Reynolds /’ Speech Communication 1 7 (I 995) 91-108
`
`93
`
`the speech is collected (Doddington, 1985). In a
`text-dependent system, the speech used to train
`and test the system is constrained to be the same
`
`word or phrase. In a text-independent system, the
`training and testing speech are completely uncon-
`strained. Between text-dependence and text—inde-
`
`pendence, a vocabulary-dependent system con-
`strains the speech to come from a limited vocabu-
`lary, such as the digits, from which test words or
`phrases (e.g. digit strings) are selected. Further-
`more, depending upon the amount of control
`allowed by the application, the speech may be
`collected from a noise-free environment using a
`wideband microphone or from a noisy, narrow-
`band telephone channel.
`In this paper a simple but effective statistical
`speaker representation is presented which attains
`excellent
`identification and verification perfor-
`mance
`for both text-independent
`and
`vocabulary—dependent tasks with clean, wideband
`and telephone speech. The Gaussian mixture
`speaker model was introduced in (Rose and
`Reynolds, 1990; Reynolds, 1992) and has demon-
`strated high text-independent identification accu-
`racy for short test utterances from unconstrained,
`telephone quality speech. This paper extends the
`application to speaker verification using back-
`ground speaker normalization (Higgins et al.,
`1991) and a likelihood ratio test. A novel tech-
`
`nique for selecting background speakers is also
`presented.
`Gaussian mixture model (GMM) based identi-
`
`fication and Verification systems are evaluated on
`four publicly available speech databases: TIMIT
`(Fisher et al., 1986), NTIMIT (Janlowski et al.,
`1990), Switchboard (Godfrey et al., 1992) and
`YOHO (Higgins et al., 1991; Campbell, 1992) '.
`Each database possesses different characteristics
`(both in task domain (e.g., text-dependency, num-
`ber of speakers) and speech quality (e.g., clean
`
`1 Results on the King database with the “great-divide” can
`be found in (Reynolds, 1994b). TIMIT and NTIMIT are
`available through the U.S. National Institute of Standards and
`Technology. Switchboard, YOHO and King are available
`through the Linguistic Data Consortium.
`
`wideband, noisy telephone) allowing for experi-
`mentation over a wide variety of tasks and condi-
`tions. The TIMIT database is used to examine
`
`how well text-independent speaker identification
`can perform under near—ideal conditions with
`
`large populations, thus providing an indication of
`the inherent “crowding” of the feature space.
`The NTIMIT database is then used to gauge the
`identification performance loss incurred by trans-
`mitting speech over the telephone network for
`the same large population experiment. The more
`realistic, unconstrained Switchboard database is
`
`used to determine a better measure of large
`population performance using telephone speech.
`For speaker verification,
`the TIMIT, NTIMIT
`and Switchboard databases are again used to
`gauge verification performance over the range of
`near-ideal speech to more realistic, extemporane-
`ous
`telephone
`speech. Finally,
`the YOHO
`database is used to determine performance on a
`vocabulary-dependent, office-environment verifi-
`cation task. The effect of different background
`speaker selections is also examined for all of
`these databases.
`
`Besides using the databases to address specific
`research questions, it is hoped that presentation
`of results on these publicly available databases
`will encourage competitive evaluations and com-
`
`parisons by other researchers in the speaker
`recognition area. There are many competing
`speaker recognition techniques found throughout
`the literature, but without evaluation on common
`databases with defined train/test paradigms it is
`extremely difficult
`to assess the merits of an
`
`approach. Moreover, few people have the time or
`resources to implement
`faithfully a competing
`scheme to see how it performs on a calibrated
`database. While not everyone is interested in the
`same task, the available databases allow evalua-
`
`tion over a wide range of identification and verifi-
`cation scenarios.
`
`The rest of the paper is organized as follows.
`The next section gives a brief description of the
`Gaussian mixture speaker model. This is followed
`in Section 3 by a description of the identification
`and verification systems. Section 4 then presents
`descriptions
`and comparisons of
`the
`four
`databases used in this paper. The identification
`
`MS 1130 - Page 8
`
`MS 1130 - Page 8
`
`

`
`94
`
`D.A. Reynolds / Speech Communication 17 (1995) 91—108
`
`experimental paradigms and results on these
`databases are given in Section 5 followed by
`verification experiments in Section 6. A summary
`and conclusions are given in Section 7.
`
`2. Gaussian mixture speaker model
`
`The basis for both the identification and verifi-
`
`cation systems is the GMM used to represent
`speakers. More specifically,
`the distribution of
`feature vectors extracted from a person’s speech
`is modeled by a Gaussian mixture density. For a
`D-dimensional feature vector denoted as x, the
`
`mixture density for speaker s is defined as
`M
`
`uni-modal Gaussian classifier and a vector quan-
`tizer codebook. The GMM combines the robust-
`
`ness and smoothness of the parametric Gaussian
`model with the arbitrary density modeling of the
`non-parametric VQ model. It can also be viewed
`as a single-state HMM with a Gaussian mixture
`observation density or an ergodic Gaussian obser-
`vation HMM with fixed, equal transition proba-
`bilities. Here, the Gaussian components can be
`considered to be modeling the underlying broad
`phonetic sounds which characterize a person’s
`voice. A more detailed discussion of how GMMs
`
`apply to speaker modeling can be found in (Re-
`ynolds, 1992; Reynolds and Rose, 1995).
`
`P(x|/\s) = Z P.-‘b.~‘(x)-
`[=1
`
`(1)
`
`3. System descriptions
`
`The density is a weighted linear combination of
`M component uni-modal Gaussian densities,
`b,.‘(x), each parameterized by a mean vector, pi,
`and covariance matrix, 2,‘;
`1
`
`b.-‘(x) =
`
`(2'n_)D/2'21,“/2
`
`><exp{—%<x-u:>'<2:>”‘(x—lu:>}.
`<2)
`
`The mixture weights, pf, furthermore satisfy the
`constraint Ef"=1p,»‘ = 1. Collectively, the parame-
`ters of speaker s’s density model are denoted as
`A, = {p,~’,,u§,Z,-5},
`i = 1,... ,M.
`While the general model form supports full
`covariance matrices,
`in this paper diagonal co-
`variance matrices are used. This choice is based
`
`on empirical evidence that diagonal matrices out-
`perform full matrices and the fact that the den-
`sity modeling of an Mth order full covariance
`mixture can equally well be achieved using a
`larger order, diagonal covariance mixture.
`Maximum likelihood speaker model parame-
`ters
`are
`estimated using the
`iterative
`Expectation~Maximization (EM)
`algorithm
`(Dempster et al., 1977). Generally 10 iterations
`are sufficient for parameter convergence.
`The GMM can be viewed as a hybrid between
`two effective models for speaker recognition: a
`
`3.]. Speech analysis
`
`Several processing steps occur in the front-end
`analysis (see Fig. 1). First,
`the speech is seg-
`mented into frames by a 20 ms window progress-
`ing at a 10 ms frame rate. A speech activity
`detector
`(SAD)
`is
`then used to discard
`
`silence/noise frames. The SAD is a self-normal-
`izing, energy based detector which tracks the
`noise floor of the signal and can adapt to chang-
`ing noise conditions (Reynolds, 1992; Reynolds et
`al., 1992). For text-independent speaker recogni-
`tion,
`it
`is
`important
`to remove silence /noise
`frames from both the training and testing signal
`to avoid modeling and detecting the environment
`rather than the speaker.
`Next, mel-scale cepstral feature vectors are
`extracted from the speech frames (a detailed de-
`scription of the feature extraction steps can be
`found in (Reynolds, 1992; Reynolds and Rose,
`1995)). For bandlimited telephone speech, cep-
`stral analysis is performed only over the mel-filters
`in the telephone passband (300—3400 Hz). All
`
`CEP ANALYSIS
`m MEL-—SCALE
`
`EQUALIZATION
`CHANNEL
`
`XXX12 3'"
`
`Fig. 1. Front-end speech processing.
`
`MS 1130 - Page 9
`
`MS 1130 - Page 9
`
`

`
`D.A. Reynolds / Speech Communication 17 (I995) 91 -108
`
`95
`
`cepstral coefficients except c[0] are retained in
`the processing. This choice of features is based
`on previous good performance and a recent study
`(Reynolds, 1994b) comparing several standard
`speech features for speaker identification.
`Last, the feature vectors are channel equalized
`via blind deconvolution. The deconvolution is im-
`
`plemented by subtracting the average cepstral
`Vector from each input utterance. If training and
`
`testing speech are collected from different micro-
`phones or channels
`(e.g., different handsets
`and /or lines in telephone applications), this is a
`crucial step for achieving good recognition accu-
`racy (as with the “great-divide” of the King
`database (Reynolds, 1994b)). However, when
`there is not much variability between recording
`microphones or
`channels,
`as with the
`TIMIT/NTIMIT databases, blind channel equal-
`ization can reduce accuracy. The channel equal-
`ization is used for all databases except the TIMIT
`and NTIMIT databases.
`
`3.2. Identification system
`
`The identification system is a straight-forward
`maximum-likelihood classifier. For a reference
`
`group of S speakers .5” = {1,2, . .
`. ,S} represented
`by models A1, )t2,...,)tS, the objective is to find
`the speaker model which has the maximum poste-
`rior probability for the input feature vector se-
`quence, X = {x1, . . . ,xT}. The minimum error
`Bayes’ decision rule for this problem is
`
`§= arg max Pr(/\_,lX)
`1<s<S
`
`P(XFA.)
`= arg max ——
`1<s<S p(X) Pr(“).
`
`(3)
`
`Assuming equal prior probabilities of speakers,
`the terms Pr()¢s) and p(X) are constant for all
`speakers and can be ignored in the maximum.
`Using logarithms and the assumed independence
`between observations, the decision rule becomes
`r
`
`s‘= arg max 2 logp(x,I)t,),
`1-<.s<S ,=1
`
`(4)
`
`3. 3. Verifica tion system
`
`Although requiring only a binary decision, the
`verification task is more difficult than the identi-
`fication task in that the alternatives are less de-
`
`fined. The system must decide if the input voice
`came from the claimed speaker, with a well-de-
`fined model, or not the claimed speaker, which is
`ill-defined. Cast
`in a hypothesis testing frame-
`work, for a given input utterance X and a claimed
`identity the choice is between H0 and H1:
`
`H0 2 X is from the claimed speaker.
`
`H 1: X is not from the claimed speaker.
`
`To perform the optimum likelihood ratio test to
`decide between H0 and H1 then requires some
`model of the universe of possible non-claimant
`speakers. The application of this hypothesis test-
`ing approach is first described, followed by a
`discussion of a techniques for selecting speakers
`for modeling the non-claimant alternative hy-
`pothesis.
`
`3.3.1. General approach
`The general approach used in the speaker
`verification system is to apply a likelihood ratio
`test
`to an input utterance to determine if the
`claimed speaker is accepted or rejected. For an
`utterance X = {x1, .
`.
`. ,xT} and a claimed speaker
`identity with corresponding model AC, the likeli-
`hood ratio is
`
`Pr( X is from the claimed speaker)
`
`Pr( X is not from the claimed speaker)
`
`Pr A IX
`
`= __(.L_Z _
`Pr(}t5lX )
`
`(5)
`
`Applying Bayes’ rule and discarding the constant
`prior probabilities for claimant and imposter
`speakers (they are accounted for in the decision
`threshold), the likelihood ratio in the log domain
`becomes
`
`A(X) =lOgp(XI/\C)—l0gp(XI}t§).
`
`(6)
`
`(1). A block
`in which p(x,l/\5) is given in Eq.
`diagram of the speaker identification system is
`shown in Fig. 2(a).
`
`The term p(XI}tC) is the likelihood of the utter-
`ance given it
`is from the claimed speaker and
`p(X i/\z=) is the likelihood of the utterance given it
`
`MS 1130 - Page 10
`
`MS 1130 - Page 10
`
`

`
`96
`
`D.A. Reynolds / Speech Communication 17 (1995) 91-108
`
`REFERENCE SPEAKERS
`
`9
`IDENTIFIED
`SPEAKER
`
`BACKGROUND
`
`SPEAKER 1
`
`I I
`
`A(x)> 9
`
`Aon< 9
`
`ACCEPT
`
`REJECT
`
`(b)
`Fig. 2. Speaker recognition systems. (11) Identification system. (b) Verification system.
`
`is not from the claimed speaker. The likelihood
`ratio is compared to a threshold 6 and the claimed
`speaker is accepted if A(X) > 0 and rejected if
`A(X) < 6. The likelihood ratio essentially mea-
`sures how much better
`the claimant’s model
`
`scores for the test utterance compared to some
`non-claimant model. The decision threshold is
`
`then set to adjust the trade off between rejecting
`true claimant utterances (false rejection errors)
`
`and accepting .non—claimant utterances (false ac-
`ceptance errors).
`The terms of the likelihood ratio are computed
`as follows. The likelihood of the utterance given
`
`the claimed speaker’s model is directly computed
`as
`
`1
`T
`l0gp(XI}IC) = ? Z logp(x,l/\C).
`t=1
`
`(7)
`
`MS 1130 -Page 11
`
`MS 1130 - Page 11
`
`

`
`D.A. Reynolds / Speech Communication I 7 (1995) 91-108
`
`97
`
`The % scale is used to normalize the likelihood
`for utterance duration.
`
`The likelihood of the utterance given it is not
`from the claimed speaker is formed using a col-
`lection of background speaker models. With a set
`of B background speaker models, {A1, ..,)iB},
`the background speakers’ log-likelihood is com-
`puted as
`
`represent the population of expected imposters,
`which is in general application specific. In some
`scenarios, it may be assumed that imposters will
`attempt to gain access only from similar sounding
`or at
`least same-sex speakers (dedicated im-
`posters). In a telephone based application acces-
`sible by a larger cross-section of potential
`im-
`posters, on the other hand,
`the imposters may
`sound very dissimilar to the users they attack
`(casual imposters); for example a male imposter
`claiming to be a female user. Previous systems
`have relied on selecting background speakers
`whose

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket