throbber
a2, United States Patent
`US 6,182,037 B1
`(10) Patent No.:
`
` Maes (45) Date of Patent: *Jan. 30, 2001
`
`
`US006182037B1
`
`(54) SPEAKER RECOGNITION OVER LARGE
`POPULATION WITH FAST AND DETAILED
`MATCHES
`
`(15)
`
`Inventor: Stephane Herman Maes, Danbury, CT
`
`.
`(73) Assignee:
`
`(*) Notice:
`
`International Business Machines
`Corporation
`This patent issued on a continued pros-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`patent
`term provisions of 35 U.S.C.
`154(a)(2).
`
`Under 35 U.S.C. 154(b), the term of this
`patent shall be extended for 0 days.
`
`(21) Appl. No.: 08/851,982
`(22)
`Filed:
`May6, 1997
`7
`(51)
`Tint, C1 cccccceeececeseseeeseneeeseseseseee GI10L 17/00
`(52) U.S. C1. eee scceseseecseecssscneceneeseess 704/247; 704/245
`(58) Field of Search 0... 704/246, 247,
`704/250, 249, 243, 245,244
`
`(56)
`
`:
`References Cited
`U.S. PATENT DOCUMENTS
`.
`.
`ooo eos nercea aa. 704/246
`4,716,593 * 12/1987 Hirai St ale sccceunnece 704/247
`4,720,863
`1/1988 Lietal. .
`4,827,518
`5/1989 Feustel etal. .
`4,947,436 *
`8/1990 Greaveset al. ccc 704/206
`5,073,939
`12/1991 Vensko etal. .
`5,121,428
`6/1992 Uchiyamaetal. .
`5,167,004
`11/1992 Netsch etal. .
`5,189,727
`2/1993 Guerreri.
`ooreae
`e/1003 Nuka al.
`.
`5,271,088
`12/1993 Bevler ,
`5,274,695
`12/1993 Green .
`5,339,385
`8/1994 Higgins.
`
`5,347,595 *
`5,384,833
`5,412,738
`5,414,755
`
`Eo
`
`9/1994 Bokser ...ecseecsesssescseesneeeeees 382/225
`1/1995 Cameron .
`5/1995 Brunelli et al.
`5/1995 Bahleret al.
`.
`
`.
`
`557565 © 71936 Menonets wn 388/25
`
`(List continued on next page.)
`
`FOREIGN PATENT DOCUMENTS
`eee 01984 GIP).
`4-15700
`ti1909 OP),
`‘
`OTHER PUBLICATIONS
`
`_
`T. Matsui et al.; “A Study of Model and a Priori Threshold
`Updating in Speaker Verification”; Technical Report of the
`Institute of Electronics, Information & Communications
`Engineers; SP95—120(1996-01); pp 21-26.
`(List continued on next page.)
`Primary Examiner—David R. Hudspeth
`Assistant Examiner—Harold Zintel
`(74) Attorney, Agent, or Firm—McGuireWoods, LLP; Paul
`J. Otterstedt
`
`67)
`
`ABSTRACT
`
`Fast and detailed match techniques for speaker recognition
`are combined into a hybrid system in which speakers are
`associated in groups when potential confusion is detected
`between a speaker being enrolled and a previously enrolled
`speaker. Thus the detailed match techniques are invoked
`Olly at the potential onset of saturation of the fast match
`technique while the detailed match is facilitated by limita-
`tion of comparisons to the group and the development of
`speaker-dependent models which principally function to
`distinguish between members of a group rather than to more
`fully characterize each speaker. Thus storage and computa-
`tional requirementsare limited and fast and accurate speaker
`recognition can be extended over populations of speakers
`which would degrade or saturate fast match systems and
`degrade performance of detailed match systems.
`
`23 Claims, 2 Drawing Sheets
`
`140
`199po=-=-
`ACOUSTIC
`| SPEER j/
`130
`7
`FRONT-END
`COCCI pePeNDeNT
`(va)
`ay (CLUSTER)
`|
`Tv
`1 CODEBOOKS
`i140
`ao
`ee
`DECODING
`
`150
`
`\ Fst
`ay
`
`
`
`
`
`
`
`
`HISTROGRAM
`COUNTER
`
`
`f™
`a CLASS
`
`TENTATIVE 1D
`|speaxer/eLass of
`seuecTi
`J
`iD DECISION
`| _DATABESE |
`COMPARATOR
`5
`
`n0
`230
`0)
`COHORT Ino Le DETAILED
`
`
`MATCH200
`| SPEMER-
`|
`SPEECH DECODING
`| CASS/
`|
`
`y DEPENDENT |" cprever pecognmon{~| COHORT
`+
`ta———4J
`i
`MODEL
`|
`ENGINE
`| Mo
`iD
`DECISION
`
`160
`
`220
`
`.
`
`J
`
`IPR2023-00037
`Apple EX1018 Page1
`
`IPR2023-00037
`Apple EX1018 Page 1
`
`

`

`US 6,182,037 B1
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`3/1997 Tsuboka oe eeeeeeeee 704/236
`5,608,840 *
`.......
`«. 704/246
`9/1997 Lin etal.
`5,666,466 *
`
`«. 704/246
`5,675,704 * 10/1997 Juang etal.
`5,682,464 * 10/1997 Sejnoha......
`we 704/238
`5,689,616 * 11/1997 Li we we 704/232
`5,895,447 *
`4/1999 Ittycheriah et al. ou... 704/231
`OTHER PUBLICATIONS
`
`
`
`1987,
`
`Processing”
`
`Speech
`and
`“Voice
`Parsons
`McGraw-Hill, pp. 332-336.*
`Bahl et al “A fast approximate acoustic match for large
`vocabulary speech recognition” JEEE Transactions, Jan.
`1993, pp. 59-67.*
`Rudasi, Text—independent talker identification using recur-
`rent neural networks: J Acoust Soc Am Supp 1 v 87, pg 5104,
`1990.*
`
`1987,
`
`Speech
`
`Processing”
`
`Rabiner “Digital processing of speech signals” p. 478,
`1978.*
`and
`“Voice
`Parsons
`McGraw-Hill, p. 175.*
`Yu et al “Speakerrecognition using hidden Markov models,
`dynamic time warping and vector quantisation”, Oct. 1995,
`IEEE, 313-318.*
`Rosenberg, Lee, and Soone, Sub—Word Unit Talker Verifi-
`cation Using Hidden Markov Models, 1990, AT&T Bell
`Laboratories, pp 269-272.
`Herbert Gish, Robust Discrimination in Automatic Speaker
`Identification, BBN Systems and Technologies Corporation,
`pp 289-292.
`Naik, Netsch and Doddington, Speaker Verification Over
`Long Distance Telephone Lines, Texas Instruments Inc., pp
`524-527.
`
`“Merriam—Webster collegiate dictionary” pp. 211 and 550,
`1993.*
`
`* cited by examiner
`
`IPR2023-00037
`Apple EX1018 Page 2
`
`IPR2023-00037
`Apple EX1018 Page 2
`
`

`

`U.S. Patent
`
`Jan. 30, 2001
`
`Sheet 1 of 2
`
`US 6,182,037 B1
`
`{sv
`
`HOIYAI/ssvi|Nyong|NaIdS|ayiWwiedLo—Jb-—p40220¢z012
`002|utyL_-[UBDNACS|1snaaag|NOWINOOOZY
`
`
`
`HOIWN—\9-JHvus
`001WYYOOULSIHaNvud
`
`roterT]
`Coto4NOISIOI0aYOWa¥dNOD|4syaviva=|
` (I3ALLVINGLNOLSf*-——"scyig/uaywads|ssv10L4091oo/sINNOD9NIG00ad
`
`
`
`OSIOri|7Sy0083000:1(waisnta)AINGONSda0-_~{TTITn
`oclLeads0Z1OU
`
`INIONG
`
`al
`
`NoIsIOzd
`
`IPR2023-00037
`Apple EX1018 Page 3
`
`IPR2023-00037
`Apple EX1018 Page 3
`
`
`

`

`U.S. Patent
`
`Jan. 30, 2001
`
`Sheet 2 of 2
`
`US 6,182,037 B1
`
` oo———_—_—_—_—OOS
`
`3
`
`4
`
`#5
`
`6
`
`SPEAKER
`
`1
`
`2
`
`SPEAKER
`
`1
`
`2
`
`3
`
`4
`
`#5
`
`6
`
`7-
`
`HIG.2B
`
`IPR2023-00037
`Apple EX1018 Page 4
`
`IPR2023-00037
`Apple EX1018 Page 4
`
`

`

`US 6,182,037 B1
`
`1
`SPEAKER RECOGNITION OVER LARGE
`POPULATION WITH FAST AND DETAILED
`MATCHES
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`10
`
`15
`
`The present invention generally relates to speaker iden-
`tification and verification in speech recognition systems and,
`more particularly,
`to rapid and text-independent speaker
`identification and verification over a large population of
`enrolled speakers.
`2. Description of the Prior Art
`Manyelectronic devices require input from a user in order
`to convey to the device particular information required to
`determine or perform a desired function or, in a trivially
`simple case, when a desired function is to be performed as
`would be indicated by, for example, activation of an on/off
`switch. When multiple different inputs are possible, a key-
`board comprising an array of two or more switches has been
`the input device of choice in recent years.
`However, keyboards of any type have inherent disadvan-
`tages. Most evidently, keyboards include a plurality of
`distributed actuable areas, each generally including moving
`parts subject to wear and damage and which must be sized
`to be actuated by a portion of the body unless a stylus or
`other separate mechanical expedient
`is employed.
`Accordingly, in many types of devices, such as input panels
`for security systems and electronic calculators, the size of
`the device is often determined by the dimensions of the
`keypad rather than the electronic contents of the housing.
`Additionally, numerous keystrokes may be required (e.g. to
`specify an operation, enter a security code, personal identi-
`fication number (PIN), etc.) which slows operation and
`increases the possibility that erroneous actuation may occur.
`Therefore, use of a keyboard or other manually manipulated
`Some reduction in storage requirements has been
`input structure requires action which is not optimally natural
`achieved in template matching systems which are also
`or expeditious for the user.
`text-dependent as well as speaker-dependent by reliance on
`In aneffort to provide a morenaturally usable, convenient
`particular utterances of each enrolled speaker which are
`and rapid interface and to increase the capabilities thereof,
`specific to the speaker identification and/or verification
`numerous approaches to voice or sound detection and rec-
`function. However, such arrangements, by their nature,
`ognition systems have been proposed and implemented with
`cannot be madetransparentto the user; requiringarelatively
`some degree of success. Additionally, such systems could
`lengthy enrollment and initial recognition (e.g. logon) pro-
`theoretically have the capability of matching utterances of a
`cedure and moreorless periodic interruption of use of the
`user against utterances of enrolled speakers for granting or
`system for verification. Further and, perhaps, more
`denying access to resources of the device or system, iden-
`importantly, such systems are moresensitive to variations of
`tifying enrolled speakers or calling customized command
`the utterances of each speaker (“intra-speaker” variations)
`libraries in accordance with speaker identity in a manner
`such as may occur through aging, fatigue, illness, stress,
`which may berelatively transparent and convenient to the
`user.
`prosody, psychological state and other conditions of each
`speaker.
`Morespecifically, speaker-dependent speech recognizers
`build a model for each speaker during an enrollment phase
`of operation. Thereafter, a speaker and the utterance is
`recognized by the model which produces the largest likeli-
`hood or lowest error rate. Enough data is required to adapt
`each model to a unique speaker for all utterances to be
`recognized. For this reason, most speaker-dependent sys-
`temsare also text-dependent and template matching is often
`used to reduce the amountof data to be stored in each model.
`
`2
`systems which discriminate between speakers based on
`smaller amounts of information and thus tend to return
`
`ambiguousresults when data for larger populations results in
`smaller differences between instances of data.
`
`text-independent systems such as
`As an illustration,
`frame-by-frame feature clustering and classification may be
`considered as a fast match technique for speaker or speaker
`class identification. However,
`the numbers of speaker
`classes and the numberof speakers in each class that can be
`handled with practical amounts of processing overhead in
`acceptable response times is limited. (In other words, while
`frame-by-frame classifiers require relatively small amounts
`of data for each enrolled speaker andless processing time for
`limited numbers of speakers, their discrimination poweris
`correspondingly limited and becomes severely compro-
`mised as the distinctiveness of the speaker models (each
`containing relatively less information than in speaker-
`dependent systems) is reduced by increasing numbers of
`models.
`It can be readily understood that any approach
`which seeks to reduce information (stored and/or processed)
`concerning speaker utterances may compromise the ability
`of the system to discriminate individual enrolled users as the
`population of users becomes large. At some size of the
`speaker population,
`the speaker recognition system or
`engine is no longer able to discriminate between some
`speakers. This condition is known assaturation.
`On the other hand, more complex systems which use
`speaker dependent model-based decoders which are adapted
`to individual speakers to provide speaker recognition must
`run the models in parallel or sequentially to accomplish
`speaker recognition and therefore are extremely slow and
`require large amounts of memory and processor time.
`Additionally, such models are difficult to train and adapt
`since they typically require a large amount of data to form
`the model.
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`large systems including large resources are
`However,
`likely to have a large number of potential users and thus
`require massive amounts of storage and processing overhead
`to recognize speakers when the population of enrolled
`speakers becomes large. Saturation of the performance of
`speaker recognition systems will occur for simple and fast
`systems designed to quickly discriminate among different
`speakers when the size of the speaker population increases.
`Performance of most speaker-dependent (e.g. performing
`decoding of the utterance and aligning on the decoded script
`models such as hidden Markov models (HMM) adapted to
`the different speakers,
`the models presenting the highest
`likelihood of correct decoding identifying the speaker, and
`which may be text-dependent or text-independent) systems
`also degrades over large speaker populations but the ten-
`dency toward saturation and performance degradation is
`encountered over smaller populations with fast, simple
`
`Alternatively, systems using, for example, hidden Markov
`models (HMM)orsimilar statistical models usually involve
`the introduction of cohort models based on a group of
`speakers to be able to reject speakers which are too improb-
`able.
`Cohort models allow the introduction of confidence mea-
`
`65
`
`sures based on competing likelihoods of speakeridentity and
`
`IPR2023-00037
`Apple EX1018 Page 5
`
`IPR2023-00037
`Apple EX1018 Page 5
`
`

`

`US 6,182,037 B1
`
`3
`are very difficult to build correctly, especially in increasing
`populations due to the number of similarities which may
`exist between utterances of different speakers as the popu-
`lation of enrolled speakers increases. For that reason, cohort
`models can be significant sources of potential error. Enroll-
`ment of new speakers is also complicated since it requires
`extraction of new cohorts and the development or modifi-
`cation of corresponding cohort models.
`Template matching,
`in particular, does not allow the
`straightforward introduction of cohorts. Templates are usu-
`ally the original waveforms of user utterances used for
`enrollment and the number of templates for each utterance
`is limited, as a practical matter, by the time which can
`reasonably be madeavailable for the matching process. On
`the other hand, coverage of intra-speaker variations is lim-
`ited by the number of templates which may be acquired or
`used for each utterance to be recognized and acceptable
`levels of coverage of intra-speaker variations becomes pro-
`hibitive as the user population becomeslarge. Development
`of cohorts, particularly to reduce data or simplify search
`strategies tends to mask intra-speaker variation while being
`complicated thereby.
`Further, template matching becomesless discriminating
`as the user population increases since the definition of
`distance measures between templates becomes morecritical
`and complicates search strategies. Also, conceptually, tem-
`plate matching emphasizes the evolution of a dynamic(e.g.
`change in waveform overtime) in the utterance and repro-
`duction of that dynamic while that dynamic is particularly
`variable with condition of the speaker.
`Accordingly, at the present state of the art, large speaker
`populations render text-independent, fast speaker recogni-
`tion systemsless suitable for use and, at some size of speaker
`population, render them ineffective, requiring slower, stor-
`age and processor intensive systems to be employed while
`degrading their performance as well. There has been no
`system available which allows maintaining of performance
`of speaker recognition comparable to fast, simple systems or
`increasing their discrimination power while limiting com-
`putational and memory requirements and avoiding satura-
`tion as the enrolled speaker population becomeslarge.
`SUMMARYOF THE INVENTION
`
`4
`In accordance with another aspect of the invention, a
`speaker recognition apparatus is provided comprising a
`vector quantizer for sampling frames of an utterance and
`determining a most likely speaker of an utterance, including
`an arrangementfor detecting potential confusion between a
`speaker of the utterance with one or more previously
`enrolled speakers, and an arrangement for developing a
`speaker-dependent model for distinguishing between the
`speaker and the previously enrolled speaker in response
`upon detection of potential confusion between them.
`The invention utilize a fast match process and a detailed
`match, if needed, in sequence so that the detailed match is
`implemented at or before the onset of saturation of the fast
`match process by an increasing population of users. The
`detailed match is accelerated by grouping of users in
`response to detection of potential confusion and limits
`storage by developing models directed to distinguishing
`between members of a group while facilitating and accel-
`erating the detailed match process by limiting the numberof
`candidate speakers or groups.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The foregoing and other objects, aspects and advantages
`will be better understood from the following detailed
`description of a preferred embodimentof the invention with
`reference to the drawings, in which:
`illustrating the
`FIG. 1 is a block diagram/flow chart
`architecture and operation of a preferred form of the
`invention, and
`FIGS. 2A and 2B are graphical representation of histo-
`gram processing in accordance with the invention.
`
`DETAILED DESCRIPTION OF A PREFERRED
`EMBODIMENT OF THE INVENTION
`
`Referring now to the drawings, and more particularly to
`FIG. 1, there is shown a high level block diagram of a
`preferred form of the invention. FIG. 1 can also be under-
`stood as a flow chart illustrating the operation of the inven-
`tion as will be discussed below.It should also be understood
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`These portions of the system architecture represent
`sequential stages of processing with the detailed match
`being conducted only when a decision cannot be made by
`the first, fast match stage while, even if unsuccessful, the
`first stage enhances performance of the second stage by
`
`IPR2023-00037
`Apple EX1018 Page 6
`
`that the architecture and operation of the system as illus-
`trated in FIG. 1 may be implemented as a special purpose
`data processor or, preferably, by a suitably programmed
`It is therefore an object of the present invention to provide
`general purpose data processor,
`in which latter case, the
`a system for rapidly discriminating individual enrolled users
`illustrated functional elements will be configured therein
`amongalarge population of enrolled users which is text-
`during initialization or as needed during operation of the
`independent and transparent to the user after enrollment.
`program as is well-understood in the art.
`It is another object of the invention to provide a system for
`Initially, it should be appreciated that the configuration of
`speaker identification and verification among a large popu-
`the preferred form of the invention is generally divided into
`lation of enrolled users and having a simple, rapid, trans-
`two sections and thus is well-described as a hybrid system.
`parent and text-independentenrollment procedure.
`The upper portion 100, as illustrated,
`is a feature vector
`It
`is a further object of the invention to improve the
`based fast match speaker recogmition/classification system
`processing of speaker and cohort models during speech
`which is text-independent. The lower portion 200, as
`decoding and speaker recognition.
`illustrated, is a detailed match arrangementbased on speaker
`models 210 or cohort models 220 and may be text-
`It is yet another object of the invention to provide fast
`dependent or text-independent while the upper portion 100
`speaker recognition over a large population of speakers
`is inherently text-independent. It should be understood that
`without reduction of accuracy.
`the overall system in accordance with the invention may be
`In order to accomplish these and other objects of the
`text-dependent or text-independent in accordance with the
`invention, a method for identification of speakers is pro-
`implementation chosen for lower, detailed match portion
`vided including the steps of forming groups of enrolled
`200.
`speakers, identifying a speaker or a group of speakers among
`the groups of enrolled speakers which is most likely to
`include the speaker of a particular utterance, and matching
`the utterance against speaker-dependent models within the
`group of speakers to determine identity of a speaker of the
`utterance.
`
`45
`
`50
`
`55
`
`60
`
`65
`
`IPR2023-00037
`Apple EX1018 Page 6
`
`

`

`US 6,182,037 B1
`
`5
`automatic selection of speaker or cohort models for the
`detailed match as well as automatically selecting them. The
`selection of cohorts, while needed for the detailed match
`processing also accelerates the fast match processes in some
`cases as will be discussed below.
`
`More specifically, an acoustic front-end 110 which is,
`itself, well-understood in the art, is used to sample utter-
`ances in an overlapping fashion and to extract feature
`vectors 120 therefrom, preferably as MELcepstra, delta and
`delta-delta coefficients along with normalized log-energies.
`(Log-energies and cepstra Cy should not be included.) In
`combination therewith, a vector quantizer clusters the fea-
`ture vectors produced from the enrollment data as means and
`variances thereof for efficient storage as well as to quantize
`the feature vectors derived from utterances (test data) to be
`recognized.
`Such feature vectors are preferably computed on overlap-
`ping 25-30 msec. frames with shifts of 10 msec. Physiologi-
`cally related (e.g. characterizing vocal tract signatures such
`as resonances) MEL cepstra, delta and delta-delta feature
`vectors are preferred as feature vectors for efficiency and
`effectiveness of speaker
`identification or verification
`although other knowntypesof feature vectors could be used.
`Such feature vectors and others, such as LPC cepstra, are
`usually thirty-nine dimension vectors, as is well-understood
`in the art. The resulting feature vectors are clustered into
`about sixty-five codewords (the number is notcritical to
`practice of the invention) in accordance with a Mahalanobis-
`distance. In practice, the variances of each coordinate of the
`feature vectors can be empirically determined over a repre-
`sentative set of speakers and the measure of association of
`a vector relative to a codeword is a weighted Euclidean
`distance with the weight being the inverse of the associated
`variances. The set of codewords thus derived constitute a
`
`6
`enrollment by developing a histogram as described above. If
`the speaker being enrolled (e.g. speaker 6 of FIG. 2A) is
`confused with an existing enrolled speaker by developing a
`histogram peak magnitude which is similar to that of a
`previously enrolled speaker (e.g. potentially identifying a
`new speaker as a previously enrolled speaker), a class is
`formed in database 300 (in this case containing speakers 1
`and 6) responsive to comparator 160 including the speakers
`whose utterances produce similar feature vectors. Data from
`the different speakers is then used to adapt speaker-
`dependent models capable of distinguishing between them
`and the models stored in database 300.
`
`It should be appreciated that detecting potential confusion
`is tantamount to the onset of saturation of the fast match
`system so that the fast match system can be utilized to the
`full extent of its discrimination powers and detailed match-
`ing is performed only when beyondthe capability of the fast
`match portion of the invention (unless intentionally limited
`by conservative design such as imposition of a low statistical
`threshold for potential confusion). However, such onset of
`saturation is detected during enrollment and, in this sense,
`the configuration of the system in accordance with the
`invention is adaptive to supplement the fast match portion,
`when necessary, by a detailed matching process. The
`detailed matchis,itself, facilitated by limiting the scope of
`comparisons to members of a group and the adaptation of
`speaker-dependent models optimized or at least adapted to
`make the necessary distinctions between members of the
`group. The number of groups and numberof speakers per
`group will always be minimized since groups are generated
`and members addedto a group only when potential confu-
`sion is detected.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`codebook 130 for each enrolled speaker.
`It should be noted that only one codebookis required for
`each enrolled speaker. Therefore, storage requirements(e.g.
`memory 130) for the fast match section of the invention are
`quite small and no complex model of a complete utterance
`is required. Any new speaker enrollment requires only the
`addition of an additional codebook while leaving other
`developed and provides for collection and storage of addi-
`codebooks of previously enrolled speakers unaffected,
`tional data only when necessary as the enrolled user popu-
`reducing enrollment complexity. Also, since the memory
`lation increases.
`130 consists of similarly organized codebooks, efficient
`It should also be noted that after at least one class is
`hierarchical (multi-resolution) approaches to searching can
`be implemented as the number of enrolled users (and
`defined and created as described above,
`test data which
`associated codebooks) becomeslarge.
`results in confusion between speakers, as a histogram is
`developed of counter 150 can be compared against the class
`Thereafter, test information is decoded frame-by-frame
`or classes,
`if any,
`to which each candidate speaker is
`against the codebooks by decoder 140. Each frame of test
`assigned. This comparison can often provide useable results
`data which providesan arbitrarily close match to a codeword
`after only a few seconds of speech or even a few hundred
`provides an identification of the codebook which containsit.
`frames. If, for example, during verification (e.g. the periodic
`The frames which thus identify a particular codebook are
`testing of speech to be that of a previously identified
`counted in accordance with the codebookidentified by each
`speaker) a classis identified other than the class to which the
`frame by counter 150 and a histogram is developed, as
`illustrated for speakers 1-5 of FIG. 2A. Generally, one
`previously identified speaker belongs,the verification can be
`
`codebook will emerge as being identified byastatistically considered to have failed. This possibility is particularly
`significant or dominant numberof frames after a few sec-
`useful in denying access to a user of a secure system when
`onds of arbitrary speech and the speaker(e.g. speaker 1 of
`verification fails after access is granted upon initial identi-
`fication. For identification, as soon as two or a limited
`FIG. 2A) is thus identified, as preferably detected by a
`comparator arrangement 160. The divergence of histogram
`number of speakers dominates, only the speakers in the one
`peak magnitudes also provides a direct measure of confi-
`or two classes corresponding to the dominating speakers
`need be considered further, Both of these decisions, taken
`dence level of the speaker identification. If two or more
`peaksare of similar (notstatistically significant) magnitude,
`after only a relatively few seconds or small number of
`further processing as will be described below can be per-
`frames greatly accelerate the speaker recognition process.
`Other decisions can also be made in a manner which
`formed for a detailed match 200 for speaker identification.
`However, in accordance with the invention, the feature
`facilitates fast and/or detailed match processing and may
`vectors are also decoded against existing codebooks during
`allow a possible or at least tentative identification of the
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Of course, development and adaptation of such speaker-
`dependent models requires substantially more data to be
`collected for each such speaker. However, such data can be
`collected during somewhat extended enrollment for the
`speaker being enrolled (speaker 6) and later for the speakers
`(e.g. speaker 1) with which the newly enrolled speaker is
`confused during their next use of the system. It should also
`be noted that
`the development of classes automatically
`selects or defines cohorts from which cohort models can be
`
`IPR2023-00037
`Apple EX1018 Page 7
`
`IPR2023-00037
`Apple EX1018 Page 7
`
`

`

`US 6,182,037 B1
`
`7
`speaker to be made by the fast match processing alone. For
`example, as illustrated in FIG. 2B where comparable counts
`are developed for speakers 1 and3,if the candidate speakers
`do not belong to the same class (e.g. speaker 3, when
`enrolled, did not cause creation of a class with speaker 1),
`the speaker associated with the greater histogram peak can
`usually be correctly selected or tentatively identified (or
`speakers not classed with other speakers between which
`there is confusion can be eliminated) by the fast match
`process on the basis of a relatively few frames since it can
`be assumedthat a divergence of magnitudeof the histogram
`peaks would later develop based on further speech. This
`feature of the invention provides acceleration of the speaker
`recognition process by the fast match section 100 of the
`invention and allows a speaker dependent model 210 to be
`called for use by the cohort and speech decoder 230 from
`database 300 for speaker identity verification and speech
`recognition.
`If the speakers are in the same class as detected by
`comparator 160 accessing database 300, the speaker depen-
`dent models of all cohorts of the single class can be called
`at an early point in timein orderto distinguish between them
`which is also done by the speaker recognition engine
`included with cohort and speech decoder 230. It should be
`noted that this selection of a class limits the data processed
`to that which is necessary for discrimination between the
`speakers which are, in fact, confused by fast match section
`100 and results in reduction of processing time and overhead
`as well as storage requirements for the speaker-dependent
`models. Cohorts are needed only when confusion actually
`occurs, reducing the total storage requirements. Further, the
`cohort model 220 can be used for speech decoding at an
`earlier time since an ambiguous decoding of an utterance is
`unlikely within cohorts.
`The speech decoding engine 230 preferably utilizes hid-
`den Markov models (HMMs) with continuous density Gaus-
`sian mixtures to model
`the output distribution (ie.
`the
`probability density function to observe a given acoustic
`vector at a given arc of the HMM model). A set of
`maximum-a-posteriori (MAP) estimated models,or adapted
`by other speaker-dependent algorithms like re-training,
`adaptation by correlation (ABC), maximum likelihood lin-
`ear regression (MLLR)or clustered transformation (CT),are
`loaded for different pre-loaded speakers. During enrollment,
`the utterances are decoded with a gender-independent sys-
`tem. Then each pre-loaded system is used to compute
`likelihoods for a same alignment. The N-best speakers are
`extracted and linear transforms are computed to map each of
`the selected pre-loaded speaker models closer to the enrolled
`speaker. Using this data, new Gaussiansare built for the new
`speakers. Unobserved Gaussians are preferably adapted
`using the ABC algorithm. During speaker recognition, the
`likelihoods produced by the speaker and its cohorts are
`compared for a same alignment produced by a speaker
`independent model.
`While this process may be computationally intensive, the
`enrollment data used to distinguish between cohorts may, in
`practice, be quite limited particularly if made text-dependent
`or text-prompted. In these latter cases, if the fast match
`identification or verification is not succesful, identification
`or verification may be carried out
`in a text-dependent
`fashion. However, the computation and comparisons of the
`alignments described above allow text-independence for
`identification or verification,
`if desired. Thus,
`text-
`independence is achieved for the majority of identification
`and verification operations by the fast match processing
`while minimizing storage requirements and computational
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`overhead to very low levels in the detailed match stage 200
`which is thus accelerated, as described above.
`In view of the foregoing,
`the hybrid system of the
`invention combining fast and detailed match sections pro-
`vides very rapid speaker identification with little, if any,
`increase in storage requirements, since the processing of the
`detailed match stage generally allows reduction of storage
`requirement by more than enough to compensate for storage
`of codebooks, largely because the speaker-dependent mod-
`els may be built principally for distinguishing between
`speakers of a group rather than more fully characterizing the
`speech of each speaker. Enrollment, identification and veri-
`fication of speaker identity are conducted in a manner
`transparent to the user except to the extent text-dependence
`may be usedto limit storage for a small numberof discrimi-
`nations between speakers. The fast match and detailed match
`sections of the hybrid arrangement accelerate operation of
`each other while providing for automatic handling of cohorts
`and supporting efficient search strategies while processing
`overhead is limited by reducing the amount of data pro-
`cessed in each section supplemented by efficient search
`strategies. These benefits are main

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket