`Maes
`
`USOO6182037B1
`(10) Patent No.:
`US 6,182,037 B1
`(45) Date of Patent:
`*Jan. 30, 2001
`
`33.
`
`5.
`A15706
`
`(54) SPEAKER RECOGNITION OVER LARGE
`POPULATION WITH FAST AND DETALED
`MATCHES
`
`(5) near than Herman Me, Dunct
`
`(*) Notice:
`
`(73) Assignee: International Business Machines
`Corporation
`This patent issued on a continued pros-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`patent term provisions of 35 U.S.C.
`154(a)(2).
`Under 35 U.S.C. 154(b), the term of this
`patent shall be extended for 0 days.
`
`(21) Appl. No.: 08/851,982
`(22) Filed:
`May 6, 1997
`
`7
`(51) Int. Cl." ..................................................... G10L 17/00
`(52) U.S. Cl. ............................................. 704/247; 704/245
`(58) Field of Search ..................................... 704/246, 247,
`704/250, 249, 243, 245, 244
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,347,595
`5,384,833
`5,412,738
`5,414,755
`
`:
`
`9/1994 Bokser ................................. 382/225
`1/1995 Cameron.
`5/1995 Brunelli et al. .
`5/1995 Bahler et al. .
`
`:::::::::::::"
`
`(List continued on next page.)
`FOREIGN PATENT DOCUMENTS
`1984 (JP).
`1. SE
`OTHER PUBLICATIONS
`T. Matsui et al.; “A Study of Model and a Priori Threshold
`Updating in Speaker Verification'; Technical Report of the
`Institute of Electronics, Information & Communications
`Engineers; SP95-120(1996-01); pp. 21–26.
`(List continued on next page.)
`Primary Examiner David R. Hudspeth
`ASSistant Examiner Harold Zintel
`(74) Attorney, Agent, O Firm McGuireWoods, LLP; Paul
`J. Otterstedt
`ABSTRACT
`(57)
`Fast and detailed match techniques for Speaker recognition
`are combined into a hybrid System in which speakers are
`asSociated in groups when potential confusion is detected
`between a speaker being enrolled and a previously enrolled
`3,673,331 : 6/1972 Hair et al. ............................ 704/246
`Speaker. Thus the detailed match techniques C invoked
`4,363,102
`12/1982 Holmgren et al..
`4,716,593
`12/1987 Hirai et al. ........................... o, only at the potential onset of Saturation of the fast match
`4,720,863
`1/1988 Li et al. .
`technique while the detailed match is facilitated by limita
`4,827,518
`5/1989 Feustel et al..
`tion of comparisons to the group and the development of
`4,947,436
`8/1990 Greaves et al. ...................... 704/206
`Speaker-dependent models which principally function to
`5,073.939
`12/1991 Vensko et al..
`distinguish between members of a group rather than to more
`5,121,428
`6/1992 Uchiyama et al..
`fully characterize each speaker. Thus Storage and computa
`5,167,004
`11/1992 Netsch et al..
`tional requirements are limited and fast and accurate Speaker
`5,189,727
`2/1993 Guerreri.
`recognition can be extended over populations of Speakers
`5,216,720
`6/1993 Naik et al..
`which would degrade or Saturate fast match Systems and
`5,241,649
`8/1993 Niyada.
`degrade performance of detailed match SVStems
`5,271,088
`12/1993 Bahler.
`grade p
`y
`5,274,695
`12/1993 Green.
`5,339,385
`8/1994 Higgins.
`
`23 Claims, 2 Drawing Sheets
`
`ACOUSTIC
`FRONT-END
`(VO)
`
`110
`
`120
`
`specie/13
`--- DEPENDENT
`(CLUSTER)
`:
`S--- |
`contgos:
`T
`-140
`
`FS-
`DECODENG
`
`Asoo
`CLASS
`SPEAKER/CASS. - - - SELECTIO
`As
`COMPARATOR
`
`
`
`20
`2 --
`| SPEAKER-
`: DEFEDENT
`MODEL
`- - - - - -
`
`230
`
`COHORTAND
`
`SPEECH DECODENG
`SPEAKER RECOGNITION
`ENGINE
`
`150
`
`HISROGRAM
`COUNER
`
`- E.
`MATCH
`100
`
`TENTATIVE ID/
`p posion
`
`-1
`-
`
`220
`2.
`CLASS/
`COHOR
`MODE
`
`DEALED
`- MATCH
`200
`
`O
`DECSON
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 1
`
`
`
`US 6,182,037 B1
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`
`
`3/1997 Tsuboka ............................... 704/236
`5,608,840
`5,666,466 * 9/1997 Lin et al. .......
`... 704/246
`5,675,704 * 10/1997 Juang et al.
`... 704/246
`5,682,464 * 10/1997 Sejnoha ......
`... 704/238
`5,689,616
`11/1997 Li ......................
`... 704/232
`5,895,447 * 4/1999 Ittycheriah et al. ................. 704/231
`OTHER PUBLICATIONS
`“Voice
`and Speech Processing
`Parsons
`McGraw-Hill, pp. 332–336.*
`Bahl et al "A fast approximate acoustic match for large
`Vocabulary Speech recognition' IEEE Transactions, Jan.
`1993, pp. 59–67.*
`Rudasi, Text-independent talker identification using recur
`rent neural networks: J Acoust Soc Am Supp 1 v 87, pg. S104,
`1990.*
`“Merriam-Webster collegiate dictionary” pp. 211 and 550,
`1993.*
`
`1987,
`
`1987,
`
`Speech Processing”
`
`Rabiner “Digital processing of Speech Signals' p. 478,
`1978.*
`and
`“Voice
`Parsons
`McGraw-Hill, p. 175.*
`Yu et al "Speakerrecognition using hidden Markov models,
`dynamic time warping and vector quantisation', Oct. 1995,
`IEEE, 313–318.*
`Rosenberg, Lee, and Soone, Sub-Word Unit Talker Verifi
`cation Using Hidden Markov Models, 1990, AT&T Bell
`Laboratories, pp 269-272.
`Herbert Gish, Robust Discrimination in Automatic Speaker
`Identification, BBN Systems and Technologies Corporation,
`pp 289–292.
`Naik, Netsch and Doddington, Speaker Verification Over
`Long Distance Telephone Lines, Texas Instruments Inc., pp
`524-527.
`
`* cited by examiner
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 2
`
`
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 3
`
`
`
`U.S. Patent
`
`Jan. 30, 2001
`
`Sheet 2 of 2
`
`US 6,182,037 B1
`
`
`
`SPEAKER
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6 .
`
`.
`
`.
`
`h
`
`6
`
`7 .
`
`3
`
`4
`
`5
`
`SPEAKER
`
`1
`
`2
`
`FIG.2B
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 4
`
`
`
`1
`SPEAKER RECOGNITION OVER LARGE
`POPULATION WITH FAST AND DETALED
`MATCHES
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`The present invention generally relates to Speaker iden
`tification and Verification in Speech recognition Systems and,
`more particularly, to rapid and text-independent Speaker
`identification and Verification over a large population of
`enrolled Speakers.
`2. Description of the Prior Art
`Many electronic devices require input from a user in order
`to convey to the device particular information required to
`determine or perform a desired function or, in a trivially
`Simple case, when a desired function is to be performed as
`would be indicated by, for example, activation of an on/off
`Switch. When multiple different inputs are possible, a key
`board comprising an array of two or more Switches has been
`the input device of choice in recent years.
`However, keyboards of any type have inherent disadvan
`tages. Most evidently, keyboards include a plurality of
`distributed actuable areas, each generally including moving
`parts Subject to wear and damage and which must be sized
`to be actuated by a portion of the body unless a Stylus or
`other separate mechanical expedient is employed.
`Accordingly, in many types of devices, Such as input panels
`for Security Systems and electronic calculators, the size of
`the device is often determined by the dimensions of the
`keypad rather than the electronic contents of the housing.
`Additionally, numerous keystrokes may be required (e.g. to
`Specify an operation, enter a security code, personal identi
`fication number (PIN), etc.) which slows operation and
`increases the possibility that erroneous actuation may occur.
`Therefore, use of a keyboard or other manually manipulated
`inputStructure requires action which is not optimally natural
`or expeditious for the user.
`In an effort to provide a more naturally usable, convenient
`and rapid interface and to increase the capabilities thereof,
`numerous approaches to voice or Sound detection and rec
`ognition Systems have been proposed and implemented with
`Some degree of Success. Additionally, Such Systems could
`theoretically have the capability of matching utterances of a
`user against utterances of enrolled Speakers for granting or
`denying access to resources of the device or System, iden
`tifying enrolled Speakers or calling customized command
`libraries in accordance with Speaker identity in a manner
`which may be relatively transparent and convenient to the
`USC.
`However, large Systems including large resources are
`likely to have a large number of potential users and thus
`require massive amounts of Storage and processing overhead
`to recognize speakers when the population of enrolled
`Speakers becomes large. Saturation of the performance of
`Speaker recognition Systems will occur for Simple and fast
`Systems designed to quickly discriminate among different
`Speakers when the size of the Speaker population increases.
`Performance of most speaker-dependent (e.g. performing
`decoding of the utterance and aligning on the decoded Script
`models such as hidden Markov models (HMM) adapted to
`the different Speakers, the models presenting the highest
`likelihood of correct decoding identifying the Speaker, and
`which may be text-dependent or text-independent) Systems
`also degrades over large Speaker populations but the ten
`dency toward Saturation and performance degradation is
`encountered over Smaller populations with fast, Simple
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,182,037 B1
`
`2
`Systems which discriminate between Speakers based on
`Smaller amounts of information and thus tend to return
`ambiguous results when data for larger populations results in
`Smaller differences between instances of data.
`AS an illustration, text-independent Systems. Such as
`frame-by-frame feature clustering and classification may be
`considered as a fast match technique for Speaker or Speaker
`class identification. However, the numbers of Speaker
`classes and the number of Speakers in each class that can be
`handled with practical amounts of processing overhead in
`acceptable response times is limited. (In other words, while
`frame-by-frame classifiers require relatively Small amounts
`of data for each enrolled Speaker and leSS processing time for
`limited numbers of Speakers, their discrimination power is
`correspondingly limited and becomes Severely compro
`mised as the distinctiveness of the speaker models (each
`containing relatively less information than in Speaker
`dependent Systems) is reduced by increasing numbers of
`models. It can be readily understood that any approach
`which seeks to reduce information (Stored and/or processed)
`concerning Speaker utterances may compromise the ability
`of the System to discriminate individual enrolled users as the
`population of users becomes large. At Some size of the
`Speaker population, the Speaker recognition System or
`engine is no longer able to discriminate between Some
`Speakers. This condition is known as Saturation.
`On the other hand, more complex Systems which use
`Speaker dependent model-based decoders which are adapted
`to individual Speakers to provide Speaker recognition must
`run the models in parallel or Sequentially to accomplish
`Speaker recognition and therefore are extremely slow and
`require large amounts of memory and processor time.
`Additionally, such models are difficult to train and adapt
`Since they typically require a large amount of data to form
`the model.
`Some reduction in Storage requirements has been
`achieved in template matching Systems which are also
`text-dependent as well as Speaker-dependent by reliance on
`particular utterances of each enrolled Speaker which are
`Specific to the Speaker identification and/or verification
`function. However, Such arrangements, by their nature,
`cannot be made transparent to the user, requiring a relatively
`lengthy enrollment and initial recognition (e.g. logon) pro
`cedure and more or less periodic interruption of use of the
`System for Verification. Further and, perhaps, more
`importantly, Such Systems are more Sensitive to variations of
`the utterances of each speaker (“intra-speaker variations)
`Such as may occur through aging, fatigue, illness, StreSS,
`prosody, psychological State and other conditions of each
`Speaker.
`More specifically, Speaker-dependent Speech recognizers
`build a model for each Speaker during an enrollment phase
`of operation. Thereafter, a Speaker and the utterance is
`recognized by the model which produces the largest likeli
`hood or lowest error rate. Enough data is required to adapt
`each model to a unique Speaker for all utterances to be
`recognized. For this reason, most Speaker-dependent Sys
`tems are also text-dependent and template matching is often
`used to reduce the amount of data to be Stored in each model.
`Alternatively, Systems using, for example, hidden Markov
`models (HMM) or similar statistical models usually involve
`the introduction of cohort models based on a group of
`Speakers to be able to reject Speakers which are too improb
`able.
`Cohort models allow the introduction of confidence mea
`Sures based on competing likelihoods of Speaker identity and
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 5
`
`
`
`3
`are very difficult to build correctly, especially in increasing
`populations due to the number of Similarities which may
`exist between utterances of different speakers as the popu
`lation of enrolled Speakers increases. For that reason, cohort
`models can be significant Sources of potential error. Enroll
`ment of new speakers is also complicated Since it requires
`extraction of new cohorts and the development or modifi
`cation of corresponding cohort models.
`Template matching, in particular, does not allow the
`Straightforward introduction of cohorts. Templates are usu
`ally the original waveforms of user utterances used for
`enrollment and the number of templates for each utterance
`is limited, as a practical matter, by the time which can
`reasonably be made available for the matching process. On
`the other hand, coverage of intra-speaker variations is lim
`ited by the number of templates which may be acquired or
`used for each utterance to be recognized and acceptable
`levels of coverage of intra-speaker variations becomes pro
`hibitive as the user population becomes large. Development
`of cohorts, particularly to reduce data or simplify Search
`Strategies tends to mask intra-speaker variation while being
`complicated thereby.
`Further, template matching becomes leSS discriminating
`as the user population increases since the definition of
`distance measures between templates becomes more critical
`and complicates Search Strategies. Also, conceptually, tem
`plate matching emphasizes the evolution of a dynamic (e.g.
`change in waveform over time) in the utterance and repro
`duction of that dynamic while that dynamic is particularly
`variable with condition of the speaker.
`Accordingly, at the present State of the art, large Speaker
`populations render text-independent, fast Speaker recogni
`tion Systems less Suitable for use and, at Some size of speaker
`population, render them ineffective, requiring slower, Stor
`age and processor intensive Systems to be employed while
`degrading their performance as well. There has been no
`System available which allows maintaining of performance
`of Speaker recognition comparable to fast, simple Systems or
`increasing their discrimination power while limiting com
`40
`putational and memory requirements and avoiding Satura
`tion as the enrolled Speaker population becomes large.
`SUMMARY OF THE INVENTION
`It is therefore an object of the present invention to provide
`a System for rapidly discriminating individual enrolled users
`among a large population of enrolled users which is text
`independent and transparent to the user after enrollment.
`It is another object of the invention to provide a system for
`Speaker identification and verification among a large popu
`lation of enrolled users and having a simple, rapid, trans
`parent and text-independent enrollment procedure.
`It is a further object of the invention to improve the
`processing of Speaker and cohort models during speech
`decoding and Speaker recognition.
`It is yet another object of the invention to provide fast
`Speaker recognition over a large population of Speakers
`without reduction of accuracy.
`In order to accomplish these and other objects of the
`invention, a method for identification of Speakers is pro
`Vided including the Steps of forming groups of enrolled
`Speakers, identifying a Speaker or a group of Speakers among
`the groups of enrolled Speakers which is most likely to
`include the Speaker of a particular utterance, and matching
`the utterance against Speaker-dependent models within the
`group of Speakers to determine identity of a Speaker of the
`utterance.
`
`4
`In accordance with another aspect of the invention, a
`Speaker recognition apparatus is provided comprising a
`vector quantizer for Sampling frames of an utterance and
`determining a most likely speaker of an utterance, including
`an arrangement for detecting potential confusion between a
`Speaker of the utterance with one or more previously
`enrolled Speakers, and an arrangement for developing a
`Speaker-dependent model for distinguishing between the
`Speaker and the previously enrolled Speaker in response
`upon detection of potential confusion between them.
`The invention utilize a fast match proceSS and a detailed
`match, if needed, in Sequence So that the detailed match is
`implemented at or before the onset of Saturation of the fast
`match process by an increasing population of users. The
`detailed match is accelerated by grouping of users in
`response to detection of potential confusion and limits
`Storage by developing models directed to distinguishing
`between members of a group while facilitating and accel
`erating the detailed match proceSS by limiting the number of
`candidate Speakers or groups.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The foregoing and other objects, aspects and advantages
`will be better understood from the following detailed
`description of a preferred embodiment of the invention with
`reference to the drawings, in which:
`FIG. 1 is a block diagram/flow chart illustrating the
`architecture and operation of a preferred form of the
`invention, and
`FIGS. 2A and 2B are graphical representation of histo
`gram processing in accordance with the invention.
`
`DETAILED DESCRIPTION OF A PREFERRED
`EMBODIMENT OF THE INVENTION
`Referring now to the drawings, and more particularly to
`FIG. 1, there is shown a high level block diagram of a
`preferred form of the invention. FIG. 1 can also be under
`stood as a flow chart illustrating the operation of the inven
`tion as will be discussed below. It should also be understood
`that the architecture and operation of the System as illus
`trated in FIG. 1 may be implemented as a special purpose
`data processor or, preferably, by a Suitably programmed
`general purpose data processor, in which latter case, the
`illustrated functional elements will be configured therein
`during initialization or as needed during operation of the
`program as is well-understood in the art.
`Initially, it should be appreciated that the configuration of
`the preferred form of the invention is generally divided into
`two Sections and thus is well-described as a hybrid System.
`The upper portion 100, as illustrated, is a feature vector
`based fast match Speaker recognition/classification System
`which is text-independent. The lower portion 200, as
`illustrated, is a detailed match arrangement based on Speaker
`models 210 or cohort models 220 and may be text
`dependent or text-independent while the upper portion 100
`is inherently text-independent. It should be understood that
`the overall System in accordance with the invention may be
`text-dependent or text-independent in accordance with the
`implementation chosen for lower, detailed match portion
`200.
`These portions of the System architecture represent
`Sequential Stages of processing with the detailed match
`being conducted only when a decision cannot be made by
`the first, fast match Stage while, even if unsuccessful, the
`first Stage enhances performance of the Second Stage by
`
`US 6,182,037 B1
`
`15
`
`25
`
`35
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 6
`
`
`
`S
`automatic Selection of Speaker or cohort models for the
`detailed match as well as automatically Selecting them. The
`Selection of cohorts, while needed for the detailed match
`processing also accelerates the fast match processes in Some
`cases as will be discussed below.
`More specifically, an acoustic front-end 110 which is,
`itself, well-understood in the art, is used to Sample utter
`ances in an overlapping fashion and to extract feature
`vectors 120 therefrom, preferably as MEL cepstra, delta and
`delta-delta coefficients along with normalized log-energies.
`(Log-energies and cepstra Co should not be included.) In
`combination therewith, a vector quantizer clusters the fea
`ture vectors produced from the enrollment data as means and
`variances thereof for efficient Storage as well as to quantize
`the feature vectors derived from utterances (test data) to be
`recognized.
`Such feature vectors are preferably computed on Overlap
`ping 25-30 msec. frames with shifts of 10 msec. Physiologi
`cally related (e.g. characterizing vocal tract signatures Such
`as resonances) MEL cepstra, delta and delta-delta feature
`vectors are preferred as feature vectors for efficiency and
`effectiveness of Speaker identification or verification
`although other known types of feature vectors could be used.
`Such feature vectors and others, Such as LPC cepstra, are
`usually thirty-nine dimension vectors, as is well-understood
`in the art. The resulting feature vectors are clustered into
`about sixty-five codewords (the number is not critical to
`practice of the invention) in accordance with a Mahalanobis
`distance. In practice, the variances of each coordinate of the
`feature vectors can be empirically determined over a repre
`Sentative Set of Speakers and the measure of association of
`a vector relative to a codeword is a weighted Euclidean
`distance with the weight being the inverse of the associated
`variances. The Set of codewords thus derived constitute a
`codebook 130 for each enrolled speaker.
`It should be noted that only one codebook is required for
`each enrolled speaker. Therefore, Storage requirements (e.g.
`memory 130) for the fast match section of the invention are
`quite Small and no complex model of a complete utterance
`is required. Any new speaker enrollment requires only the
`addition of an additional codebook while leaving other
`codebooks of previously enrolled Speakers unaffected,
`reducing enrollment complexity. Also, Since the memory
`130 consists of similarly organized codebooks, efficient
`hierarchical (multi-resolution) approaches to Searching can
`be implemented as the number of enrolled users (and
`associated codebooks) becomes large.
`Thereafter, test information is decoded frame-by-frame
`against the codebooks by decoder 140. Each frame of test
`data which provides an arbitrarily close match to a codeword
`provides an identification of the codebook which contains it.
`The frames which thus identify a particular codebook are
`counted in accordance with the codebook identified by each
`frame by counter 150 and a histogram is developed, as
`illustrated for speakers 1-5 of FIG. 2A. Generally, one
`codebook will emerge as being identified by a Statistically
`Significant or dominant number of frames after a few Sec
`onds of arbitrary speech and the speaker (e.g. Speaker 1 of
`FIG. 2A) is thus identified, as preferably detected by a
`comparator arrangement 160. The divergence of histogram
`peak magnitudes also provides a direct measure of confi
`dence level of the speaker identification. If two or more
`peaks are of similar (not statistically significant) magnitude,
`further processing as will be described below can be per
`formed for a detailed match 200 for speaker identification.
`However, in accordance with the invention, the feature
`vectors are also decoded against existing codebooks during
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,182,037 B1
`
`6
`enrollment by developing a histogram as described above. If
`the Speaker being enrolled (e.g. Speaker 6 of FIG. 2A) is
`confused with an existing enrolled Speaker by developing a
`histogram peak magnitude which is Similar to that of a
`previously enrolled speaker (e.g. potentially identifying a
`new speaker as a previously enrolled speaker), a class is
`formed in database 300 (in this case containing speakers 1
`and 6) responsive to comparator 160 including the speakers
`whose utterances produce Similar feature vectors. Data from
`the different Speakers is then used to adapt speaker
`dependent models capable of distinguishing between them
`and the models stored in database 300.
`It should be appreciated that detecting potential confusion
`is tantamount to the onset of Saturation of the fast match
`System So that the fast match System can be utilized to the
`full extent of its discrimination powerS and detailed match
`ing is performed only when beyond the capability of the fast
`match portion of the invention (unless intentionally limited
`by conservative design Such as imposition of a low Statistical
`threshold for potential confusion). However, such onset of
`Saturation is detected during enrollment and, in this Sense,
`the configuration of the System in accordance with the
`invention is adaptive to Supplement the fast match portion,
`when necessary, by a detailed matching process. The
`detailed match is, itself, facilitated by limiting the Scope of
`comparisons to members of a group and the adaptation of
`Speaker-dependent models optimized or at least adapted to
`make the necessary distinctions between members of the
`group. The number of groupS and number of Speakers per
`group will always be minimized since groups are generated
`and members added to a group only when potential confu
`Sion is detected.
`Of course, development and adaptation of Such speaker
`dependent models requires Substantially more data to be
`collected for each Such Speaker. However, Such data can be
`collected during Somewhat extended enrollment for the
`Speaker being enrolled (speaker 6) and later for the speakers
`(e.g. Speaker 1) with which the newly enrolled speaker is
`confused during their next use of the System. It should also
`be noted that the development of classes automatically
`Selects or defines cohorts from which cohort models can be
`developed and provides for collection and Storage of addi
`tional data only when necessary as the enrolled user popu
`lation increases.
`It should also be noted that after at least one class is
`defined and created as described above, test data which
`results in confusion between Speakers, as a histogram is
`developed of counter 150 can be compared against the class
`or classes, if any, to which each candidate Speaker is
`assigned. This comparison can often provide uSeable results
`after only a few Seconds of Speech or even a few hundred
`frames. If, for example, during verification (e.g. the periodic
`testing of Speech to be that of a previously identified
`Speaker) a class is identified other than the class to which the
`previously identified Speaker belongs, the Verification can be
`considered to have failed. This possibility is particularly
`useful in denying access to a user of a Secure System when
`Verification fails after acceSS is granted upon initial identi
`fication. For identification, as Soon as two or a limited
`number of Speakers dominates, only the Speakers in the one
`or two classes corresponding to the dominating Speakers
`need be considered further, Both of these decisions, taken
`after only a relatively few seconds or small number of
`frames greatly accelerate the Speaker recognition process.
`Other decisions can also be made in a manner which
`facilitates fast and/or detailed match processing and may
`allow a possible or at least tentative identification of the
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 7
`
`
`
`7
`Speaker to be made by the fast match processing alone. For
`example, as illustrated in FIG. 2B where comparable counts
`are developed for SpeakerS1 and 3, if the candidate Speakers
`do not belong to the same class (e.g. speaker 3, when
`enrolled, did not cause creation of a class with speaker 1),
`the Speaker associated with the greater histogram peak can
`usually be correctly selected or tentatively identified (or
`Speakers not classed with other speakers between which
`there is confusion can be eliminated) by the fast match
`proceSS on the basis of a relatively few frames Since it can
`be assumed that a divergence of magnitude of the histogram
`peaks would later develop based on further speech. This
`feature of the invention provides acceleration of the Speaker
`recognition process by the fast match section 100 of the
`invention and allows a speaker dependent model 210 to be
`called for use by the cohort and speech decoder 230 from
`database 300 for speaker identity verification and speech
`recognition.
`If the Speakers are in the same class as detected by
`comparator 160 accessing database 300, the speaker depen
`dent models of all cohorts of the Single class can be called
`at an early point in time in order to distinguish between them
`which is also done by the Speaker recognition engine
`included with cohort and speech decoder 230. It should be
`noted that this Selection of a class limits the data processed
`to that which is necessary for discrimination between the
`Speakers which are, in fact, confused by fast match Section
`100 and results in reduction of processing time and overhead
`as well as Storage requirements for the Speaker-dependent
`models. Cohorts are needed only when confusion actually
`occurs, reducing the total Storage requirements. Further, the
`cohort model 220 can be used for Speech decoding at an
`earlier time since an ambiguous decoding of an utterance is
`unlikely within cohorts.
`The speech decoding engine 230 preferably utilizes hid
`den Markov models (HMMs) with continuous density Gaus
`sian mixtures to model the output distribution (i.e. the
`probability density function to observe a given acoustic
`vector at a given arc of the HMM model). A set of
`maximum-a-posteriori (MAP) estimated models, or adapted
`by other speaker-dependent algorithms like re-training,
`adaptation by correlation (ABC), maximum likelihood lin
`ear regression (MLLR) or clustered transformation (CT), are
`loaded for different pre-loaded Speakers. During enrollment,
`the utterances are decoded with a gender-independent Sys
`tem. Then each pre-loaded System is used to compute
`likelihoods for a Same alignment. The N-best Speakers are
`extracted and linear transforms are computed to map each of
`the Selected pre-loaded Speaker models closer to the enrolled
`Speaker. Using this data, new Gaussians are built for the new
`Speakers. Unobserved Gaussians are preferably adapted
`using the ABC algorithm. During Speaker recognition, the
`likelihoods produced by the Speaker and its cohorts are
`compared for a same alignment produced by a speaker
`independent model.
`While this proceSS may be computationally intensive, the
`enrollment data used to distinguish between cohorts may, in
`practice, be quite limited particularly if made text-dependent
`or text-prompted. In these latter cases, if the fast match
`identification or Verification is not Succesful, identification
`or verification may be carried out in a text-dependent
`fashion. However, the computation and comparisons of the
`alignments described above allow text-independence for
`identification or verification, if desired. Thus, text
`independence is achieved for the majority of identification
`and Verification operations by the fast match processing
`while minimizing Storage requirements and computational
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,182,037 B1
`
`8
`overhead to very low levels in the detailed match stage 200
`which is thus accelerated, as described above.
`In view of the foregoing, the hybrid system of the
`invention combining fast and detailed match Sections pro
`vides very rapid speaker identification with little, if any,
`increase in Storage requirements, Since the processing of the
`detailed match Stage generally allows reduction of Storage
`requirement by more than enough to compensate for Storage
`of codebooks, largely because the Speaker-dependent mod
`els may be built principally for distinguishing between
`Speakers of a group rather than more fully characterizing the
`Speech of each speaker. Enrollment, identification and Veri
`fication of Speaker identity are conducted in a manner
`transparent to the user except to the extent text-dependence
`may be used to limit Storage for a Small number of discrimi
`nations between Speakers. The fast match and detailed match
`Sections of the hybrid arrangement accelerate operation of
`each other while providing for auto