`US 6,182,037 B1
`(10) Patent No.:
`
` Maes (45) Date of Patent: *Jan. 30, 2001
`
`
`US006182037B1
`
`(54) SPEAKER RECOGNITION OVER LARGE
`POPULATION WITH FAST AND DETAILED
`MATCHES
`
`(15)
`
`Inventor: Stephane Herman Maes, Danbury, CT
`
`.
`(73) Assignee:
`
`(*) Notice:
`
`International Business Machines
`Corporation
`This patent issued on a continued pros-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`patent
`term provisions of 35 U.S.C.
`154(a)(2).
`
`Under 35 U.S.C. 154(b), the term of this
`patent shall be extended for 0 days.
`
`(21) Appl. No.: 08/851,982
`(22)
`Filed:
`May6, 1997
`7
`(51)
`Tint, C1 cccccceeececeseseeeseneeeseseseseee GI10L 17/00
`(52) U.S. C1. eee scceseseecseecssscneceneeseess 704/247; 704/245
`(58) Field of Search 0... 704/246, 247,
`704/250, 249, 243, 245,244
`
`(56)
`
`:
`References Cited
`U.S. PATENT DOCUMENTS
`.
`.
`ooo eos nercea aa. 704/246
`4,716,593 * 12/1987 Hirai St ale sccceunnece 704/247
`4,720,863
`1/1988 Lietal. .
`4,827,518
`5/1989 Feustel etal. .
`4,947,436 *
`8/1990 Greaveset al. ccc 704/206
`5,073,939
`12/1991 Vensko etal. .
`5,121,428
`6/1992 Uchiyamaetal. .
`5,167,004
`11/1992 Netsch etal. .
`5,189,727
`2/1993 Guerreri.
`ooreae
`e/1003 Nuka al.
`.
`5,271,088
`12/1993 Bevler ,
`5,274,695
`12/1993 Green .
`5,339,385
`8/1994 Higgins.
`
`5,347,595 *
`5,384,833
`5,412,738
`5,414,755
`
`Eo
`
`9/1994 Bokser ...ecseecsesssescseesneeeeees 382/225
`1/1995 Cameron .
`5/1995 Brunelli et al.
`5/1995 Bahleret al.
`.
`
`.
`
`557565 © 71936 Menonets wn 388/25
`
`(List continued on next page.)
`
`FOREIGN PATENT DOCUMENTS
`eee 01984 GIP).
`4-15700
`ti1909 OP),
`‘
`OTHER PUBLICATIONS
`
`_
`T. Matsui et al.; “A Study of Model and a Priori Threshold
`Updating in Speaker Verification”; Technical Report of the
`Institute of Electronics, Information & Communications
`Engineers; SP95—120(1996-01); pp 21-26.
`(List continued on next page.)
`Primary Examiner—David R. Hudspeth
`Assistant Examiner—Harold Zintel
`(74) Attorney, Agent, or Firm—McGuireWoods, LLP; Paul
`J. Otterstedt
`
`67)
`
`ABSTRACT
`
`Fast and detailed match techniques for speaker recognition
`are combined into a hybrid system in which speakers are
`associated in groups when potential confusion is detected
`between a speaker being enrolled and a previously enrolled
`speaker. Thus the detailed match techniques are invoked
`Olly at the potential onset of saturation of the fast match
`technique while the detailed match is facilitated by limita-
`tion of comparisons to the group and the development of
`speaker-dependent models which principally function to
`distinguish between members of a group rather than to more
`fully characterize each speaker. Thus storage and computa-
`tional requirementsare limited and fast and accurate speaker
`recognition can be extended over populations of speakers
`which would degrade or saturate fast match systems and
`degrade performance of detailed match systems.
`
`23 Claims, 2 Drawing Sheets
`
`140
`199po=-=-
`ACOUSTIC
`| SPEER j/
`130
`7
`FRONT-END
`COCCI pePeNDeNT
`(va)
`ay (CLUSTER)
`|
`Tv
`1 CODEBOOKS
`i140
`ao
`ee
`DECODING
`
`150
`
`\ Fst
`ay
`
`
`
`
`
`
`
`
`HISTROGRAM
`COUNTER
`
`
`f™
`a CLASS
`
`TENTATIVE 1D
`|speaxer/eLass of
`seuecTi
`J
`iD DECISION
`| _DATABESE |
`COMPARATOR
`5
`
`n0
`230
`0)
`COHORT Ino Le DETAILED
`
`
`MATCH200
`| SPEMER-
`|
`SPEECH DECODING
`| CASS/
`|
`
`y DEPENDENT |" cprever pecognmon{~| COHORT
`+
`ta———4J
`i
`MODEL
`|
`ENGINE
`| Mo
`iD
`DECISION
`
`160
`
`220
`
`.
`
`J
`
`IPR2023-00037
`Apple EX1018 Page1
`
`IPR2023-00037
`Apple EX1018 Page 1
`
`
`
`US 6,182,037 B1
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`3/1997 Tsuboka oe eeeeeeeee 704/236
`5,608,840 *
`.......
`«. 704/246
`9/1997 Lin etal.
`5,666,466 *
`
`«. 704/246
`5,675,704 * 10/1997 Juang etal.
`5,682,464 * 10/1997 Sejnoha......
`we 704/238
`5,689,616 * 11/1997 Li we we 704/232
`5,895,447 *
`4/1999 Ittycheriah et al. ou... 704/231
`OTHER PUBLICATIONS
`
`
`
`1987,
`
`Processing”
`
`Speech
`and
`“Voice
`Parsons
`McGraw-Hill, pp. 332-336.*
`Bahl et al “A fast approximate acoustic match for large
`vocabulary speech recognition” JEEE Transactions, Jan.
`1993, pp. 59-67.*
`Rudasi, Text—independent talker identification using recur-
`rent neural networks: J Acoust Soc Am Supp 1 v 87, pg 5104,
`1990.*
`
`1987,
`
`Speech
`
`Processing”
`
`Rabiner “Digital processing of speech signals” p. 478,
`1978.*
`and
`“Voice
`Parsons
`McGraw-Hill, p. 175.*
`Yu et al “Speakerrecognition using hidden Markov models,
`dynamic time warping and vector quantisation”, Oct. 1995,
`IEEE, 313-318.*
`Rosenberg, Lee, and Soone, Sub—Word Unit Talker Verifi-
`cation Using Hidden Markov Models, 1990, AT&T Bell
`Laboratories, pp 269-272.
`Herbert Gish, Robust Discrimination in Automatic Speaker
`Identification, BBN Systems and Technologies Corporation,
`pp 289-292.
`Naik, Netsch and Doddington, Speaker Verification Over
`Long Distance Telephone Lines, Texas Instruments Inc., pp
`524-527.
`
`“Merriam—Webster collegiate dictionary” pp. 211 and 550,
`1993.*
`
`* cited by examiner
`
`IPR2023-00037
`Apple EX1018 Page 2
`
`IPR2023-00037
`Apple EX1018 Page 2
`
`
`
`U.S. Patent
`
`Jan. 30, 2001
`
`Sheet 1 of 2
`
`US 6,182,037 B1
`
`{sv
`
`HOIYAI/ssvi|Nyong|NaIdS|ayiWwiedLo—Jb-—p40220¢z012
`002|utyL_-[UBDNACS|1snaaag|NOWINOOOZY
`
`
`
`HOIWN—\9-JHvus
`001WYYOOULSIHaNvud
`
`roterT]
`Coto4NOISIOI0aYOWa¥dNOD|4syaviva=|
` (I3ALLVINGLNOLSf*-——"scyig/uaywads|ssv10L4091oo/sINNOD9NIG00ad
`
`
`
`OSIOri|7Sy0083000:1(waisnta)AINGONSda0-_~{TTITn
`oclLeads0Z1OU
`
`INIONG
`
`al
`
`NoIsIOzd
`
`IPR2023-00037
`Apple EX1018 Page 3
`
`IPR2023-00037
`Apple EX1018 Page 3
`
`
`
`
`U.S. Patent
`
`Jan. 30, 2001
`
`Sheet 2 of 2
`
`US 6,182,037 B1
`
` oo———_—_—_—_—OOS
`
`3
`
`4
`
`#5
`
`6
`
`SPEAKER
`
`1
`
`2
`
`SPEAKER
`
`1
`
`2
`
`3
`
`4
`
`#5
`
`6
`
`7-
`
`HIG.2B
`
`IPR2023-00037
`Apple EX1018 Page 4
`
`IPR2023-00037
`Apple EX1018 Page 4
`
`
`
`US 6,182,037 B1
`
`1
`SPEAKER RECOGNITION OVER LARGE
`POPULATION WITH FAST AND DETAILED
`MATCHES
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`10
`
`15
`
`The present invention generally relates to speaker iden-
`tification and verification in speech recognition systems and,
`more particularly,
`to rapid and text-independent speaker
`identification and verification over a large population of
`enrolled speakers.
`2. Description of the Prior Art
`Manyelectronic devices require input from a user in order
`to convey to the device particular information required to
`determine or perform a desired function or, in a trivially
`simple case, when a desired function is to be performed as
`would be indicated by, for example, activation of an on/off
`switch. When multiple different inputs are possible, a key-
`board comprising an array of two or more switches has been
`the input device of choice in recent years.
`However, keyboards of any type have inherent disadvan-
`tages. Most evidently, keyboards include a plurality of
`distributed actuable areas, each generally including moving
`parts subject to wear and damage and which must be sized
`to be actuated by a portion of the body unless a stylus or
`other separate mechanical expedient
`is employed.
`Accordingly, in many types of devices, such as input panels
`for security systems and electronic calculators, the size of
`the device is often determined by the dimensions of the
`keypad rather than the electronic contents of the housing.
`Additionally, numerous keystrokes may be required (e.g. to
`specify an operation, enter a security code, personal identi-
`fication number (PIN), etc.) which slows operation and
`increases the possibility that erroneous actuation may occur.
`Therefore, use of a keyboard or other manually manipulated
`Some reduction in storage requirements has been
`input structure requires action which is not optimally natural
`achieved in template matching systems which are also
`or expeditious for the user.
`text-dependent as well as speaker-dependent by reliance on
`In aneffort to provide a morenaturally usable, convenient
`particular utterances of each enrolled speaker which are
`and rapid interface and to increase the capabilities thereof,
`specific to the speaker identification and/or verification
`numerous approaches to voice or sound detection and rec-
`function. However, such arrangements, by their nature,
`ognition systems have been proposed and implemented with
`cannot be madetransparentto the user; requiringarelatively
`some degree of success. Additionally, such systems could
`lengthy enrollment and initial recognition (e.g. logon) pro-
`theoretically have the capability of matching utterances of a
`cedure and moreorless periodic interruption of use of the
`user against utterances of enrolled speakers for granting or
`system for verification. Further and, perhaps, more
`denying access to resources of the device or system, iden-
`importantly, such systems are moresensitive to variations of
`tifying enrolled speakers or calling customized command
`the utterances of each speaker (“intra-speaker” variations)
`libraries in accordance with speaker identity in a manner
`such as may occur through aging, fatigue, illness, stress,
`which may berelatively transparent and convenient to the
`user.
`prosody, psychological state and other conditions of each
`speaker.
`Morespecifically, speaker-dependent speech recognizers
`build a model for each speaker during an enrollment phase
`of operation. Thereafter, a speaker and the utterance is
`recognized by the model which produces the largest likeli-
`hood or lowest error rate. Enough data is required to adapt
`each model to a unique speaker for all utterances to be
`recognized. For this reason, most speaker-dependent sys-
`temsare also text-dependent and template matching is often
`used to reduce the amountof data to be stored in each model.
`
`2
`systems which discriminate between speakers based on
`smaller amounts of information and thus tend to return
`
`ambiguousresults when data for larger populations results in
`smaller differences between instances of data.
`
`text-independent systems such as
`As an illustration,
`frame-by-frame feature clustering and classification may be
`considered as a fast match technique for speaker or speaker
`class identification. However,
`the numbers of speaker
`classes and the numberof speakers in each class that can be
`handled with practical amounts of processing overhead in
`acceptable response times is limited. (In other words, while
`frame-by-frame classifiers require relatively small amounts
`of data for each enrolled speaker andless processing time for
`limited numbers of speakers, their discrimination poweris
`correspondingly limited and becomes severely compro-
`mised as the distinctiveness of the speaker models (each
`containing relatively less information than in speaker-
`dependent systems) is reduced by increasing numbers of
`models.
`It can be readily understood that any approach
`which seeks to reduce information (stored and/or processed)
`concerning speaker utterances may compromise the ability
`of the system to discriminate individual enrolled users as the
`population of users becomes large. At some size of the
`speaker population,
`the speaker recognition system or
`engine is no longer able to discriminate between some
`speakers. This condition is known assaturation.
`On the other hand, more complex systems which use
`speaker dependent model-based decoders which are adapted
`to individual speakers to provide speaker recognition must
`run the models in parallel or sequentially to accomplish
`speaker recognition and therefore are extremely slow and
`require large amounts of memory and processor time.
`Additionally, such models are difficult to train and adapt
`since they typically require a large amount of data to form
`the model.
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`large systems including large resources are
`However,
`likely to have a large number of potential users and thus
`require massive amounts of storage and processing overhead
`to recognize speakers when the population of enrolled
`speakers becomes large. Saturation of the performance of
`speaker recognition systems will occur for simple and fast
`systems designed to quickly discriminate among different
`speakers when the size of the speaker population increases.
`Performance of most speaker-dependent (e.g. performing
`decoding of the utterance and aligning on the decoded script
`models such as hidden Markov models (HMM) adapted to
`the different speakers,
`the models presenting the highest
`likelihood of correct decoding identifying the speaker, and
`which may be text-dependent or text-independent) systems
`also degrades over large speaker populations but the ten-
`dency toward saturation and performance degradation is
`encountered over smaller populations with fast, simple
`
`Alternatively, systems using, for example, hidden Markov
`models (HMM)orsimilar statistical models usually involve
`the introduction of cohort models based on a group of
`speakers to be able to reject speakers which are too improb-
`able.
`Cohort models allow the introduction of confidence mea-
`
`65
`
`sures based on competing likelihoods of speakeridentity and
`
`IPR2023-00037
`Apple EX1018 Page 5
`
`IPR2023-00037
`Apple EX1018 Page 5
`
`
`
`US 6,182,037 B1
`
`3
`are very difficult to build correctly, especially in increasing
`populations due to the number of similarities which may
`exist between utterances of different speakers as the popu-
`lation of enrolled speakers increases. For that reason, cohort
`models can be significant sources of potential error. Enroll-
`ment of new speakers is also complicated since it requires
`extraction of new cohorts and the development or modifi-
`cation of corresponding cohort models.
`Template matching,
`in particular, does not allow the
`straightforward introduction of cohorts. Templates are usu-
`ally the original waveforms of user utterances used for
`enrollment and the number of templates for each utterance
`is limited, as a practical matter, by the time which can
`reasonably be madeavailable for the matching process. On
`the other hand, coverage of intra-speaker variations is lim-
`ited by the number of templates which may be acquired or
`used for each utterance to be recognized and acceptable
`levels of coverage of intra-speaker variations becomes pro-
`hibitive as the user population becomeslarge. Development
`of cohorts, particularly to reduce data or simplify search
`strategies tends to mask intra-speaker variation while being
`complicated thereby.
`Further, template matching becomesless discriminating
`as the user population increases since the definition of
`distance measures between templates becomes morecritical
`and complicates search strategies. Also, conceptually, tem-
`plate matching emphasizes the evolution of a dynamic(e.g.
`change in waveform overtime) in the utterance and repro-
`duction of that dynamic while that dynamic is particularly
`variable with condition of the speaker.
`Accordingly, at the present state of the art, large speaker
`populations render text-independent, fast speaker recogni-
`tion systemsless suitable for use and, at some size of speaker
`population, render them ineffective, requiring slower, stor-
`age and processor intensive systems to be employed while
`degrading their performance as well. There has been no
`system available which allows maintaining of performance
`of speaker recognition comparable to fast, simple systems or
`increasing their discrimination power while limiting com-
`putational and memory requirements and avoiding satura-
`tion as the enrolled speaker population becomeslarge.
`SUMMARYOF THE INVENTION
`
`4
`In accordance with another aspect of the invention, a
`speaker recognition apparatus is provided comprising a
`vector quantizer for sampling frames of an utterance and
`determining a most likely speaker of an utterance, including
`an arrangementfor detecting potential confusion between a
`speaker of the utterance with one or more previously
`enrolled speakers, and an arrangement for developing a
`speaker-dependent model for distinguishing between the
`speaker and the previously enrolled speaker in response
`upon detection of potential confusion between them.
`The invention utilize a fast match process and a detailed
`match, if needed, in sequence so that the detailed match is
`implemented at or before the onset of saturation of the fast
`match process by an increasing population of users. The
`detailed match is accelerated by grouping of users in
`response to detection of potential confusion and limits
`storage by developing models directed to distinguishing
`between members of a group while facilitating and accel-
`erating the detailed match process by limiting the numberof
`candidate speakers or groups.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The foregoing and other objects, aspects and advantages
`will be better understood from the following detailed
`description of a preferred embodimentof the invention with
`reference to the drawings, in which:
`illustrating the
`FIG. 1 is a block diagram/flow chart
`architecture and operation of a preferred form of the
`invention, and
`FIGS. 2A and 2B are graphical representation of histo-
`gram processing in accordance with the invention.
`
`DETAILED DESCRIPTION OF A PREFERRED
`EMBODIMENT OF THE INVENTION
`
`Referring now to the drawings, and more particularly to
`FIG. 1, there is shown a high level block diagram of a
`preferred form of the invention. FIG. 1 can also be under-
`stood as a flow chart illustrating the operation of the inven-
`tion as will be discussed below.It should also be understood
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`These portions of the system architecture represent
`sequential stages of processing with the detailed match
`being conducted only when a decision cannot be made by
`the first, fast match stage while, even if unsuccessful, the
`first stage enhances performance of the second stage by
`
`IPR2023-00037
`Apple EX1018 Page 6
`
`that the architecture and operation of the system as illus-
`trated in FIG. 1 may be implemented as a special purpose
`data processor or, preferably, by a suitably programmed
`It is therefore an object of the present invention to provide
`general purpose data processor,
`in which latter case, the
`a system for rapidly discriminating individual enrolled users
`illustrated functional elements will be configured therein
`amongalarge population of enrolled users which is text-
`during initialization or as needed during operation of the
`independent and transparent to the user after enrollment.
`program as is well-understood in the art.
`It is another object of the invention to provide a system for
`Initially, it should be appreciated that the configuration of
`speaker identification and verification among a large popu-
`the preferred form of the invention is generally divided into
`lation of enrolled users and having a simple, rapid, trans-
`two sections and thus is well-described as a hybrid system.
`parent and text-independentenrollment procedure.
`The upper portion 100, as illustrated,
`is a feature vector
`It
`is a further object of the invention to improve the
`based fast match speaker recogmition/classification system
`processing of speaker and cohort models during speech
`which is text-independent. The lower portion 200, as
`decoding and speaker recognition.
`illustrated, is a detailed match arrangementbased on speaker
`models 210 or cohort models 220 and may be text-
`It is yet another object of the invention to provide fast
`dependent or text-independent while the upper portion 100
`speaker recognition over a large population of speakers
`is inherently text-independent. It should be understood that
`without reduction of accuracy.
`the overall system in accordance with the invention may be
`In order to accomplish these and other objects of the
`text-dependent or text-independent in accordance with the
`invention, a method for identification of speakers is pro-
`implementation chosen for lower, detailed match portion
`vided including the steps of forming groups of enrolled
`200.
`speakers, identifying a speaker or a group of speakers among
`the groups of enrolled speakers which is most likely to
`include the speaker of a particular utterance, and matching
`the utterance against speaker-dependent models within the
`group of speakers to determine identity of a speaker of the
`utterance.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`IPR2023-00037
`Apple EX1018 Page 6
`
`
`
`US 6,182,037 B1
`
`5
`automatic selection of speaker or cohort models for the
`detailed match as well as automatically selecting them. The
`selection of cohorts, while needed for the detailed match
`processing also accelerates the fast match processes in some
`cases as will be discussed below.
`
`More specifically, an acoustic front-end 110 which is,
`itself, well-understood in the art, is used to sample utter-
`ances in an overlapping fashion and to extract feature
`vectors 120 therefrom, preferably as MELcepstra, delta and
`delta-delta coefficients along with normalized log-energies.
`(Log-energies and cepstra Cy should not be included.) In
`combination therewith, a vector quantizer clusters the fea-
`ture vectors produced from the enrollment data as means and
`variances thereof for efficient storage as well as to quantize
`the feature vectors derived from utterances (test data) to be
`recognized.
`Such feature vectors are preferably computed on overlap-
`ping 25-30 msec. frames with shifts of 10 msec. Physiologi-
`cally related (e.g. characterizing vocal tract signatures such
`as resonances) MEL cepstra, delta and delta-delta feature
`vectors are preferred as feature vectors for efficiency and
`effectiveness of speaker
`identification or verification
`although other knowntypesof feature vectors could be used.
`Such feature vectors and others, such as LPC cepstra, are
`usually thirty-nine dimension vectors, as is well-understood
`in the art. The resulting feature vectors are clustered into
`about sixty-five codewords (the number is notcritical to
`practice of the invention) in accordance with a Mahalanobis-
`distance. In practice, the variances of each coordinate of the
`feature vectors can be empirically determined over a repre-
`sentative set of speakers and the measure of association of
`a vector relative to a codeword is a weighted Euclidean
`distance with the weight being the inverse of the associated
`variances. The set of codewords thus derived constitute a
`
`6
`enrollment by developing a histogram as described above. If
`the speaker being enrolled (e.g. speaker 6 of FIG. 2A) is
`confused with an existing enrolled speaker by developing a
`histogram peak magnitude which is similar to that of a
`previously enrolled speaker (e.g. potentially identifying a
`new speaker as a previously enrolled speaker), a class is
`formed in database 300 (in this case containing speakers 1
`and 6) responsive to comparator 160 including the speakers
`whose utterances produce similar feature vectors. Data from
`the different speakers is then used to adapt speaker-
`dependent models capable of distinguishing between them
`and the models stored in database 300.
`
`It should be appreciated that detecting potential confusion
`is tantamount to the onset of saturation of the fast match
`system so that the fast match system can be utilized to the
`full extent of its discrimination powers and detailed match-
`ing is performed only when beyondthe capability of the fast
`match portion of the invention (unless intentionally limited
`by conservative design such as imposition of a low statistical
`threshold for potential confusion). However, such onset of
`saturation is detected during enrollment and, in this sense,
`the configuration of the system in accordance with the
`invention is adaptive to supplement the fast match portion,
`when necessary, by a detailed matching process. The
`detailed matchis,itself, facilitated by limiting the scope of
`comparisons to members of a group and the adaptation of
`speaker-dependent models optimized or at least adapted to
`make the necessary distinctions between members of the
`group. The number of groups and numberof speakers per
`group will always be minimized since groups are generated
`and members addedto a group only when potential confu-
`sion is detected.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`codebook 130 for each enrolled speaker.
`It should be noted that only one codebookis required for
`each enrolled speaker. Therefore, storage requirements(e.g.
`memory 130) for the fast match section of the invention are
`quite small and no complex model of a complete utterance
`is required. Any new speaker enrollment requires only the
`addition of an additional codebook while leaving other
`developed and provides for collection and storage of addi-
`codebooks of previously enrolled speakers unaffected,
`tional data only when necessary as the enrolled user popu-
`reducing enrollment complexity. Also, since the memory
`lation increases.
`130 consists of similarly organized codebooks, efficient
`It should also be noted that after at least one class is
`hierarchical (multi-resolution) approaches to searching can
`be implemented as the number of enrolled users (and
`defined and created as described above,
`test data which
`associated codebooks) becomeslarge.
`results in confusion between speakers, as a histogram is
`developed of counter 150 can be compared against the class
`Thereafter, test information is decoded frame-by-frame
`or classes,
`if any,
`to which each candidate speaker is
`against the codebooks by decoder 140. Each frame of test
`assigned. This comparison can often provide useable results
`data which providesan arbitrarily close match to a codeword
`after only a few seconds of speech or even a few hundred
`provides an identification of the codebook which containsit.
`frames. If, for example, during verification (e.g. the periodic
`The frames which thus identify a particular codebook are
`testing of speech to be that of a previously identified
`counted in accordance with the codebookidentified by each
`speaker) a classis identified other than the class to which the
`frame by counter 150 and a histogram is developed, as
`illustrated for speakers 1-5 of FIG. 2A. Generally, one
`previously identified speaker belongs,the verification can be
`
`codebook will emerge as being identified byastatistically considered to have failed. This possibility is particularly
`significant or dominant numberof frames after a few sec-
`useful in denying access to a user of a secure system when
`onds of arbitrary speech and the speaker(e.g. speaker 1 of
`verification fails after access is granted upon initial identi-
`fication. For identification, as soon as two or a limited
`FIG. 2A) is thus identified, as preferably detected by a
`comparator arrangement 160. The divergence of histogram
`number of speakers dominates, only the speakers in the one
`peak magnitudes also provides a direct measure of confi-
`or two classes corresponding to the dominating speakers
`need be considered further, Both of these decisions, taken
`dence level of the speaker identification. If two or more
`peaksare of similar (notstatistically significant) magnitude,
`after only a relatively few seconds or small number of
`further processing as will be described below can be per-
`frames greatly accelerate the speaker recognition process.
`Other decisions can also be made in a manner which
`formed for a detailed match 200 for speaker identification.
`However, in accordance with the invention, the feature
`facilitates fast and/or detailed match processing and may
`vectors are also decoded against existing codebooks during
`allow a possible or at least tentative identification of the
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Of course, development and adaptation of such speaker-
`dependent models requires substantially more data to be
`collected for each such speaker. However, such data can be
`collected during somewhat extended enrollment for the
`speaker being enrolled (speaker 6) and later for the speakers
`(e.g. speaker 1) with which the newly enrolled speaker is
`confused during their next use of the system. It should also
`be noted that
`the development of classes automatically
`selects or defines cohorts from which cohort models can be
`
`IPR2023-00037
`Apple EX1018 Page 7
`
`IPR2023-00037
`Apple EX1018 Page 7
`
`
`
`US 6,182,037 B1
`
`7
`speaker to be made by the fast match processing alone. For
`example, as illustrated in FIG. 2B where comparable counts
`are developed for speakers 1 and3,if the candidate speakers
`do not belong to the same class (e.g. speaker 3, when
`enrolled, did not cause creation of a class with speaker 1),
`the speaker associated with the greater histogram peak can
`usually be correctly selected or tentatively identified (or
`speakers not classed with other speakers between which
`there is confusion can be eliminated) by the fast match
`process on the basis of a relatively few frames since it can
`be assumedthat a divergence of magnitudeof the histogram
`peaks would later develop based on further speech. This
`feature of the invention provides acceleration of the speaker
`recognition process by the fast match section 100 of the
`invention and allows a speaker dependent model 210 to be
`called for use by the cohort and speech decoder 230 from
`database 300 for speaker identity verification and speech
`recognition.
`If the speakers are in the same class as detected by
`comparator 160 accessing database 300, the speaker depen-
`dent models of all cohorts of the single class can be called
`at an early point in timein orderto distinguish between them
`which is also done by the speaker recognition engine
`included with cohort and speech decoder 230. It should be
`noted that this selection of a class limits the data processed
`to that which is necessary for discrimination between the
`speakers which are, in fact, confused by fast match section
`100 and results in reduction of processing time and overhead
`as well as storage requirements for the speaker-dependent
`models. Cohorts are needed only when confusion actually
`occurs, reducing the total storage requirements. Further, the
`cohort model 220 can be used for speech decoding at an
`earlier time since an ambiguous decoding of an utterance is
`unlikely within cohorts.
`The speech decoding engine 230 preferably utilizes hid-
`den Markov models (HMMs) with continuous density Gaus-
`sian mixtures to model
`the output distribution (ie.
`the
`probability density function to observe a given acoustic
`vector at a given arc of the HMM model). A set of
`maximum-a-posteriori (MAP) estimated models,or adapted
`by other speaker-dependent algorithms like re-training,
`adaptation by correlation (ABC), maximum likelihood lin-
`ear regression (MLLR)or clustered transformation (CT),are
`loaded for different pre-loaded speakers. During enrollment,
`the utterances are decoded with a gender-independent sys-
`tem. Then each pre-loaded system is used to compute
`likelihoods for a same alignment. The N-best speakers are
`extracted and linear transforms are computed to map each of
`the selected pre-loaded speaker models closer to the enrolled
`speaker. Using this data, new Gaussiansare built for the new
`speakers. Unobserved Gaussians are preferably adapted
`using the ABC algorithm. During speaker recognition, the
`likelihoods produced by the speaker and its cohorts are
`compared for a same alignment produced by a speaker
`independent model.
`While this process may be computationally intensive, the
`enrollment data used to distinguish between cohorts may, in
`practice, be quite limited particularly if made text-dependent
`or text-prompted. In these latter cases, if the fast match
`identification or verification is not succesful, identification
`or verification may be carried out
`in a text-dependent
`fashion. However, the computation and comparisons of the
`alignments described above allow text-independence for
`identification or verification,
`if desired. Thus,
`text-
`independence is achieved for the majority of identification
`and verification operations by the fast match processing
`while minimizing storage requirements and computational
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`overhead to very low levels in the detailed match stage 200
`which is thus accelerated, as described above.
`In view of the foregoing,
`the hybrid system of the
`invention combining fast and detailed match sections pro-
`vides very rapid speaker identification with little, if any,
`increase in storage requirements, since the processing of the
`detailed match stage generally allows reduction of storage
`requirement by more than enough to compensate for storage
`of codebooks, largely because the speaker-dependent mod-
`els may be built principally for distinguishing between
`speakers of a group rather than more fully characterizing the
`speech of each speaker. Enrollment, identification and veri-
`fication of speaker identity are conducted in a manner
`transparent to the user except to the extent text-dependence
`may be usedto limit storage for a small numberof discrimi-
`nations between speakers. The fast match and detailed match
`sections of the hybrid arrangement accelerate operation of
`each other while providing for automatic handling of cohorts
`and supporting efficient search strategies while processing
`overhead is limited by reducing the amount of data pro-
`cessed in each section supplemented by efficient search
`strategies. These benefits are main