throbber
(12) United States Patent
`Maes
`
`USOO6182037B1
`(10) Patent No.:
`US 6,182,037 B1
`(45) Date of Patent:
`*Jan. 30, 2001
`
`33.
`
`5.
`A15706
`
`(54) SPEAKER RECOGNITION OVER LARGE
`POPULATION WITH FAST AND DETALED
`MATCHES
`
`(5) near than Herman Me, Dunct
`
`(*) Notice:
`
`(73) Assignee: International Business Machines
`Corporation
`This patent issued on a continued pros-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`patent term provisions of 35 U.S.C.
`154(a)(2).
`Under 35 U.S.C. 154(b), the term of this
`patent shall be extended for 0 days.
`
`(21) Appl. No.: 08/851,982
`(22) Filed:
`May 6, 1997
`
`7
`(51) Int. Cl." ..................................................... G10L 17/00
`(52) U.S. Cl. ............................................. 704/247; 704/245
`(58) Field of Search ..................................... 704/246, 247,
`704/250, 249, 243, 245, 244
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`5,347,595
`5,384,833
`5,412,738
`5,414,755
`
`:
`
`9/1994 Bokser ................................. 382/225
`1/1995 Cameron.
`5/1995 Brunelli et al. .
`5/1995 Bahler et al. .
`
`:::::::::::::"
`
`(List continued on next page.)
`FOREIGN PATENT DOCUMENTS
`1984 (JP).
`1. SE
`OTHER PUBLICATIONS
`T. Matsui et al.; “A Study of Model and a Priori Threshold
`Updating in Speaker Verification'; Technical Report of the
`Institute of Electronics, Information & Communications
`Engineers; SP95-120(1996-01); pp. 21–26.
`(List continued on next page.)
`Primary Examiner David R. Hudspeth
`ASSistant Examiner Harold Zintel
`(74) Attorney, Agent, O Firm McGuireWoods, LLP; Paul
`J. Otterstedt
`ABSTRACT
`(57)
`Fast and detailed match techniques for Speaker recognition
`are combined into a hybrid System in which speakers are
`asSociated in groups when potential confusion is detected
`between a speaker being enrolled and a previously enrolled
`3,673,331 : 6/1972 Hair et al. ............................ 704/246
`Speaker. Thus the detailed match techniques C invoked
`4,363,102
`12/1982 Holmgren et al..
`4,716,593
`12/1987 Hirai et al. ........................... o, only at the potential onset of Saturation of the fast match
`4,720,863
`1/1988 Li et al. .
`technique while the detailed match is facilitated by limita
`4,827,518
`5/1989 Feustel et al..
`tion of comparisons to the group and the development of
`4,947,436
`8/1990 Greaves et al. ...................... 704/206
`Speaker-dependent models which principally function to
`5,073.939
`12/1991 Vensko et al..
`distinguish between members of a group rather than to more
`5,121,428
`6/1992 Uchiyama et al..
`fully characterize each speaker. Thus Storage and computa
`5,167,004
`11/1992 Netsch et al..
`tional requirements are limited and fast and accurate Speaker
`5,189,727
`2/1993 Guerreri.
`recognition can be extended over populations of Speakers
`5,216,720
`6/1993 Naik et al..
`which would degrade or Saturate fast match Systems and
`5,241,649
`8/1993 Niyada.
`degrade performance of detailed match SVStems
`5,271,088
`12/1993 Bahler.
`grade p
`y
`5,274,695
`12/1993 Green.
`5,339,385
`8/1994 Higgins.
`
`23 Claims, 2 Drawing Sheets
`
`ACOUSTIC
`FRONT-END
`(VO)
`
`110
`
`120
`
`specie/13
`--- DEPENDENT
`(CLUSTER)
`:
`S--- |
`contgos:
`T
`-140
`
`FS-
`DECODENG
`
`Asoo
`CLASS
`SPEAKER/CASS. - - - SELECTIO
`As
`COMPARATOR
`
`
`
`20
`2 --
`| SPEAKER-
`: DEFEDENT
`MODEL
`- - - - - -
`
`230
`
`COHORTAND
`
`SPEECH DECODENG
`SPEAKER RECOGNITION
`ENGINE
`
`150
`
`HISROGRAM
`COUNER
`
`- E.
`MATCH
`100
`
`TENTATIVE ID/
`p posion
`
`-1
`-
`
`220
`2.
`CLASS/
`COHOR
`MODE
`
`DEALED
`- MATCH
`200
`
`O
`DECSON
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 1
`
`

`

`US 6,182,037 B1
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`
`
`3/1997 Tsuboka ............................... 704/236
`5,608,840
`5,666,466 * 9/1997 Lin et al. .......
`... 704/246
`5,675,704 * 10/1997 Juang et al.
`... 704/246
`5,682,464 * 10/1997 Sejnoha ......
`... 704/238
`5,689,616
`11/1997 Li ......................
`... 704/232
`5,895,447 * 4/1999 Ittycheriah et al. ................. 704/231
`OTHER PUBLICATIONS
`“Voice
`and Speech Processing
`Parsons
`McGraw-Hill, pp. 332–336.*
`Bahl et al "A fast approximate acoustic match for large
`Vocabulary Speech recognition' IEEE Transactions, Jan.
`1993, pp. 59–67.*
`Rudasi, Text-independent talker identification using recur
`rent neural networks: J Acoust Soc Am Supp 1 v 87, pg. S104,
`1990.*
`“Merriam-Webster collegiate dictionary” pp. 211 and 550,
`1993.*
`
`1987,
`
`1987,
`
`Speech Processing”
`
`Rabiner “Digital processing of Speech Signals' p. 478,
`1978.*
`and
`“Voice
`Parsons
`McGraw-Hill, p. 175.*
`Yu et al "Speakerrecognition using hidden Markov models,
`dynamic time warping and vector quantisation', Oct. 1995,
`IEEE, 313–318.*
`Rosenberg, Lee, and Soone, Sub-Word Unit Talker Verifi
`cation Using Hidden Markov Models, 1990, AT&T Bell
`Laboratories, pp 269-272.
`Herbert Gish, Robust Discrimination in Automatic Speaker
`Identification, BBN Systems and Technologies Corporation,
`pp 289–292.
`Naik, Netsch and Doddington, Speaker Verification Over
`Long Distance Telephone Lines, Texas Instruments Inc., pp
`524-527.
`
`* cited by examiner
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 2
`
`

`

`Amazon / Zentian Limited
`Exhibit 1018
`Page 3
`
`

`

`U.S. Patent
`
`Jan. 30, 2001
`
`Sheet 2 of 2
`
`US 6,182,037 B1
`
`
`
`SPEAKER
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6 .
`
`.
`
`.
`
`h
`
`6
`
`7 .
`
`3
`
`4
`
`5
`
`SPEAKER
`
`1
`
`2
`
`FIG.2B
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 4
`
`

`

`1
`SPEAKER RECOGNITION OVER LARGE
`POPULATION WITH FAST AND DETALED
`MATCHES
`
`BACKGROUND OF THE INVENTION
`1. Field of the Invention
`The present invention generally relates to Speaker iden
`tification and Verification in Speech recognition Systems and,
`more particularly, to rapid and text-independent Speaker
`identification and Verification over a large population of
`enrolled Speakers.
`2. Description of the Prior Art
`Many electronic devices require input from a user in order
`to convey to the device particular information required to
`determine or perform a desired function or, in a trivially
`Simple case, when a desired function is to be performed as
`would be indicated by, for example, activation of an on/off
`Switch. When multiple different inputs are possible, a key
`board comprising an array of two or more Switches has been
`the input device of choice in recent years.
`However, keyboards of any type have inherent disadvan
`tages. Most evidently, keyboards include a plurality of
`distributed actuable areas, each generally including moving
`parts Subject to wear and damage and which must be sized
`to be actuated by a portion of the body unless a Stylus or
`other separate mechanical expedient is employed.
`Accordingly, in many types of devices, Such as input panels
`for Security Systems and electronic calculators, the size of
`the device is often determined by the dimensions of the
`keypad rather than the electronic contents of the housing.
`Additionally, numerous keystrokes may be required (e.g. to
`Specify an operation, enter a security code, personal identi
`fication number (PIN), etc.) which slows operation and
`increases the possibility that erroneous actuation may occur.
`Therefore, use of a keyboard or other manually manipulated
`inputStructure requires action which is not optimally natural
`or expeditious for the user.
`In an effort to provide a more naturally usable, convenient
`and rapid interface and to increase the capabilities thereof,
`numerous approaches to voice or Sound detection and rec
`ognition Systems have been proposed and implemented with
`Some degree of Success. Additionally, Such Systems could
`theoretically have the capability of matching utterances of a
`user against utterances of enrolled Speakers for granting or
`denying access to resources of the device or System, iden
`tifying enrolled Speakers or calling customized command
`libraries in accordance with Speaker identity in a manner
`which may be relatively transparent and convenient to the
`USC.
`However, large Systems including large resources are
`likely to have a large number of potential users and thus
`require massive amounts of Storage and processing overhead
`to recognize speakers when the population of enrolled
`Speakers becomes large. Saturation of the performance of
`Speaker recognition Systems will occur for Simple and fast
`Systems designed to quickly discriminate among different
`Speakers when the size of the Speaker population increases.
`Performance of most speaker-dependent (e.g. performing
`decoding of the utterance and aligning on the decoded Script
`models such as hidden Markov models (HMM) adapted to
`the different Speakers, the models presenting the highest
`likelihood of correct decoding identifying the Speaker, and
`which may be text-dependent or text-independent) Systems
`also degrades over large Speaker populations but the ten
`dency toward Saturation and performance degradation is
`encountered over Smaller populations with fast, Simple
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,182,037 B1
`
`2
`Systems which discriminate between Speakers based on
`Smaller amounts of information and thus tend to return
`ambiguous results when data for larger populations results in
`Smaller differences between instances of data.
`AS an illustration, text-independent Systems. Such as
`frame-by-frame feature clustering and classification may be
`considered as a fast match technique for Speaker or Speaker
`class identification. However, the numbers of Speaker
`classes and the number of Speakers in each class that can be
`handled with practical amounts of processing overhead in
`acceptable response times is limited. (In other words, while
`frame-by-frame classifiers require relatively Small amounts
`of data for each enrolled Speaker and leSS processing time for
`limited numbers of Speakers, their discrimination power is
`correspondingly limited and becomes Severely compro
`mised as the distinctiveness of the speaker models (each
`containing relatively less information than in Speaker
`dependent Systems) is reduced by increasing numbers of
`models. It can be readily understood that any approach
`which seeks to reduce information (Stored and/or processed)
`concerning Speaker utterances may compromise the ability
`of the System to discriminate individual enrolled users as the
`population of users becomes large. At Some size of the
`Speaker population, the Speaker recognition System or
`engine is no longer able to discriminate between Some
`Speakers. This condition is known as Saturation.
`On the other hand, more complex Systems which use
`Speaker dependent model-based decoders which are adapted
`to individual Speakers to provide Speaker recognition must
`run the models in parallel or Sequentially to accomplish
`Speaker recognition and therefore are extremely slow and
`require large amounts of memory and processor time.
`Additionally, such models are difficult to train and adapt
`Since they typically require a large amount of data to form
`the model.
`Some reduction in Storage requirements has been
`achieved in template matching Systems which are also
`text-dependent as well as Speaker-dependent by reliance on
`particular utterances of each enrolled Speaker which are
`Specific to the Speaker identification and/or verification
`function. However, Such arrangements, by their nature,
`cannot be made transparent to the user, requiring a relatively
`lengthy enrollment and initial recognition (e.g. logon) pro
`cedure and more or less periodic interruption of use of the
`System for Verification. Further and, perhaps, more
`importantly, Such Systems are more Sensitive to variations of
`the utterances of each speaker (“intra-speaker variations)
`Such as may occur through aging, fatigue, illness, StreSS,
`prosody, psychological State and other conditions of each
`Speaker.
`More specifically, Speaker-dependent Speech recognizers
`build a model for each Speaker during an enrollment phase
`of operation. Thereafter, a Speaker and the utterance is
`recognized by the model which produces the largest likeli
`hood or lowest error rate. Enough data is required to adapt
`each model to a unique Speaker for all utterances to be
`recognized. For this reason, most Speaker-dependent Sys
`tems are also text-dependent and template matching is often
`used to reduce the amount of data to be Stored in each model.
`Alternatively, Systems using, for example, hidden Markov
`models (HMM) or similar statistical models usually involve
`the introduction of cohort models based on a group of
`Speakers to be able to reject Speakers which are too improb
`able.
`Cohort models allow the introduction of confidence mea
`Sures based on competing likelihoods of Speaker identity and
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 5
`
`

`

`3
`are very difficult to build correctly, especially in increasing
`populations due to the number of Similarities which may
`exist between utterances of different speakers as the popu
`lation of enrolled Speakers increases. For that reason, cohort
`models can be significant Sources of potential error. Enroll
`ment of new speakers is also complicated Since it requires
`extraction of new cohorts and the development or modifi
`cation of corresponding cohort models.
`Template matching, in particular, does not allow the
`Straightforward introduction of cohorts. Templates are usu
`ally the original waveforms of user utterances used for
`enrollment and the number of templates for each utterance
`is limited, as a practical matter, by the time which can
`reasonably be made available for the matching process. On
`the other hand, coverage of intra-speaker variations is lim
`ited by the number of templates which may be acquired or
`used for each utterance to be recognized and acceptable
`levels of coverage of intra-speaker variations becomes pro
`hibitive as the user population becomes large. Development
`of cohorts, particularly to reduce data or simplify Search
`Strategies tends to mask intra-speaker variation while being
`complicated thereby.
`Further, template matching becomes leSS discriminating
`as the user population increases since the definition of
`distance measures between templates becomes more critical
`and complicates Search Strategies. Also, conceptually, tem
`plate matching emphasizes the evolution of a dynamic (e.g.
`change in waveform over time) in the utterance and repro
`duction of that dynamic while that dynamic is particularly
`variable with condition of the speaker.
`Accordingly, at the present State of the art, large Speaker
`populations render text-independent, fast Speaker recogni
`tion Systems less Suitable for use and, at Some size of speaker
`population, render them ineffective, requiring slower, Stor
`age and processor intensive Systems to be employed while
`degrading their performance as well. There has been no
`System available which allows maintaining of performance
`of Speaker recognition comparable to fast, simple Systems or
`increasing their discrimination power while limiting com
`40
`putational and memory requirements and avoiding Satura
`tion as the enrolled Speaker population becomes large.
`SUMMARY OF THE INVENTION
`It is therefore an object of the present invention to provide
`a System for rapidly discriminating individual enrolled users
`among a large population of enrolled users which is text
`independent and transparent to the user after enrollment.
`It is another object of the invention to provide a system for
`Speaker identification and verification among a large popu
`lation of enrolled users and having a simple, rapid, trans
`parent and text-independent enrollment procedure.
`It is a further object of the invention to improve the
`processing of Speaker and cohort models during speech
`decoding and Speaker recognition.
`It is yet another object of the invention to provide fast
`Speaker recognition over a large population of Speakers
`without reduction of accuracy.
`In order to accomplish these and other objects of the
`invention, a method for identification of Speakers is pro
`Vided including the Steps of forming groups of enrolled
`Speakers, identifying a Speaker or a group of Speakers among
`the groups of enrolled Speakers which is most likely to
`include the Speaker of a particular utterance, and matching
`the utterance against Speaker-dependent models within the
`group of Speakers to determine identity of a Speaker of the
`utterance.
`
`4
`In accordance with another aspect of the invention, a
`Speaker recognition apparatus is provided comprising a
`vector quantizer for Sampling frames of an utterance and
`determining a most likely speaker of an utterance, including
`an arrangement for detecting potential confusion between a
`Speaker of the utterance with one or more previously
`enrolled Speakers, and an arrangement for developing a
`Speaker-dependent model for distinguishing between the
`Speaker and the previously enrolled Speaker in response
`upon detection of potential confusion between them.
`The invention utilize a fast match proceSS and a detailed
`match, if needed, in Sequence So that the detailed match is
`implemented at or before the onset of Saturation of the fast
`match process by an increasing population of users. The
`detailed match is accelerated by grouping of users in
`response to detection of potential confusion and limits
`Storage by developing models directed to distinguishing
`between members of a group while facilitating and accel
`erating the detailed match proceSS by limiting the number of
`candidate Speakers or groups.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The foregoing and other objects, aspects and advantages
`will be better understood from the following detailed
`description of a preferred embodiment of the invention with
`reference to the drawings, in which:
`FIG. 1 is a block diagram/flow chart illustrating the
`architecture and operation of a preferred form of the
`invention, and
`FIGS. 2A and 2B are graphical representation of histo
`gram processing in accordance with the invention.
`
`DETAILED DESCRIPTION OF A PREFERRED
`EMBODIMENT OF THE INVENTION
`Referring now to the drawings, and more particularly to
`FIG. 1, there is shown a high level block diagram of a
`preferred form of the invention. FIG. 1 can also be under
`stood as a flow chart illustrating the operation of the inven
`tion as will be discussed below. It should also be understood
`that the architecture and operation of the System as illus
`trated in FIG. 1 may be implemented as a special purpose
`data processor or, preferably, by a Suitably programmed
`general purpose data processor, in which latter case, the
`illustrated functional elements will be configured therein
`during initialization or as needed during operation of the
`program as is well-understood in the art.
`Initially, it should be appreciated that the configuration of
`the preferred form of the invention is generally divided into
`two Sections and thus is well-described as a hybrid System.
`The upper portion 100, as illustrated, is a feature vector
`based fast match Speaker recognition/classification System
`which is text-independent. The lower portion 200, as
`illustrated, is a detailed match arrangement based on Speaker
`models 210 or cohort models 220 and may be text
`dependent or text-independent while the upper portion 100
`is inherently text-independent. It should be understood that
`the overall System in accordance with the invention may be
`text-dependent or text-independent in accordance with the
`implementation chosen for lower, detailed match portion
`200.
`These portions of the System architecture represent
`Sequential Stages of processing with the detailed match
`being conducted only when a decision cannot be made by
`the first, fast match Stage while, even if unsuccessful, the
`first Stage enhances performance of the Second Stage by
`
`US 6,182,037 B1
`
`15
`
`25
`
`35
`
`45
`
`50
`
`55
`
`60
`
`65
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 6
`
`

`

`S
`automatic Selection of Speaker or cohort models for the
`detailed match as well as automatically Selecting them. The
`Selection of cohorts, while needed for the detailed match
`processing also accelerates the fast match processes in Some
`cases as will be discussed below.
`More specifically, an acoustic front-end 110 which is,
`itself, well-understood in the art, is used to Sample utter
`ances in an overlapping fashion and to extract feature
`vectors 120 therefrom, preferably as MEL cepstra, delta and
`delta-delta coefficients along with normalized log-energies.
`(Log-energies and cepstra Co should not be included.) In
`combination therewith, a vector quantizer clusters the fea
`ture vectors produced from the enrollment data as means and
`variances thereof for efficient Storage as well as to quantize
`the feature vectors derived from utterances (test data) to be
`recognized.
`Such feature vectors are preferably computed on Overlap
`ping 25-30 msec. frames with shifts of 10 msec. Physiologi
`cally related (e.g. characterizing vocal tract signatures Such
`as resonances) MEL cepstra, delta and delta-delta feature
`vectors are preferred as feature vectors for efficiency and
`effectiveness of Speaker identification or verification
`although other known types of feature vectors could be used.
`Such feature vectors and others, Such as LPC cepstra, are
`usually thirty-nine dimension vectors, as is well-understood
`in the art. The resulting feature vectors are clustered into
`about sixty-five codewords (the number is not critical to
`practice of the invention) in accordance with a Mahalanobis
`distance. In practice, the variances of each coordinate of the
`feature vectors can be empirically determined over a repre
`Sentative Set of Speakers and the measure of association of
`a vector relative to a codeword is a weighted Euclidean
`distance with the weight being the inverse of the associated
`variances. The Set of codewords thus derived constitute a
`codebook 130 for each enrolled speaker.
`It should be noted that only one codebook is required for
`each enrolled speaker. Therefore, Storage requirements (e.g.
`memory 130) for the fast match section of the invention are
`quite Small and no complex model of a complete utterance
`is required. Any new speaker enrollment requires only the
`addition of an additional codebook while leaving other
`codebooks of previously enrolled Speakers unaffected,
`reducing enrollment complexity. Also, Since the memory
`130 consists of similarly organized codebooks, efficient
`hierarchical (multi-resolution) approaches to Searching can
`be implemented as the number of enrolled users (and
`associated codebooks) becomes large.
`Thereafter, test information is decoded frame-by-frame
`against the codebooks by decoder 140. Each frame of test
`data which provides an arbitrarily close match to a codeword
`provides an identification of the codebook which contains it.
`The frames which thus identify a particular codebook are
`counted in accordance with the codebook identified by each
`frame by counter 150 and a histogram is developed, as
`illustrated for speakers 1-5 of FIG. 2A. Generally, one
`codebook will emerge as being identified by a Statistically
`Significant or dominant number of frames after a few Sec
`onds of arbitrary speech and the speaker (e.g. Speaker 1 of
`FIG. 2A) is thus identified, as preferably detected by a
`comparator arrangement 160. The divergence of histogram
`peak magnitudes also provides a direct measure of confi
`dence level of the speaker identification. If two or more
`peaks are of similar (not statistically significant) magnitude,
`further processing as will be described below can be per
`formed for a detailed match 200 for speaker identification.
`However, in accordance with the invention, the feature
`vectors are also decoded against existing codebooks during
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,182,037 B1
`
`6
`enrollment by developing a histogram as described above. If
`the Speaker being enrolled (e.g. Speaker 6 of FIG. 2A) is
`confused with an existing enrolled Speaker by developing a
`histogram peak magnitude which is Similar to that of a
`previously enrolled speaker (e.g. potentially identifying a
`new speaker as a previously enrolled speaker), a class is
`formed in database 300 (in this case containing speakers 1
`and 6) responsive to comparator 160 including the speakers
`whose utterances produce Similar feature vectors. Data from
`the different Speakers is then used to adapt speaker
`dependent models capable of distinguishing between them
`and the models stored in database 300.
`It should be appreciated that detecting potential confusion
`is tantamount to the onset of Saturation of the fast match
`System So that the fast match System can be utilized to the
`full extent of its discrimination powerS and detailed match
`ing is performed only when beyond the capability of the fast
`match portion of the invention (unless intentionally limited
`by conservative design Such as imposition of a low Statistical
`threshold for potential confusion). However, such onset of
`Saturation is detected during enrollment and, in this Sense,
`the configuration of the System in accordance with the
`invention is adaptive to Supplement the fast match portion,
`when necessary, by a detailed matching process. The
`detailed match is, itself, facilitated by limiting the Scope of
`comparisons to members of a group and the adaptation of
`Speaker-dependent models optimized or at least adapted to
`make the necessary distinctions between members of the
`group. The number of groupS and number of Speakers per
`group will always be minimized since groups are generated
`and members added to a group only when potential confu
`Sion is detected.
`Of course, development and adaptation of Such speaker
`dependent models requires Substantially more data to be
`collected for each Such Speaker. However, Such data can be
`collected during Somewhat extended enrollment for the
`Speaker being enrolled (speaker 6) and later for the speakers
`(e.g. Speaker 1) with which the newly enrolled speaker is
`confused during their next use of the System. It should also
`be noted that the development of classes automatically
`Selects or defines cohorts from which cohort models can be
`developed and provides for collection and Storage of addi
`tional data only when necessary as the enrolled user popu
`lation increases.
`It should also be noted that after at least one class is
`defined and created as described above, test data which
`results in confusion between Speakers, as a histogram is
`developed of counter 150 can be compared against the class
`or classes, if any, to which each candidate Speaker is
`assigned. This comparison can often provide uSeable results
`after only a few Seconds of Speech or even a few hundred
`frames. If, for example, during verification (e.g. the periodic
`testing of Speech to be that of a previously identified
`Speaker) a class is identified other than the class to which the
`previously identified Speaker belongs, the Verification can be
`considered to have failed. This possibility is particularly
`useful in denying access to a user of a Secure System when
`Verification fails after acceSS is granted upon initial identi
`fication. For identification, as Soon as two or a limited
`number of Speakers dominates, only the Speakers in the one
`or two classes corresponding to the dominating Speakers
`need be considered further, Both of these decisions, taken
`after only a relatively few seconds or small number of
`frames greatly accelerate the Speaker recognition process.
`Other decisions can also be made in a manner which
`facilitates fast and/or detailed match processing and may
`allow a possible or at least tentative identification of the
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 7
`
`

`

`7
`Speaker to be made by the fast match processing alone. For
`example, as illustrated in FIG. 2B where comparable counts
`are developed for SpeakerS1 and 3, if the candidate Speakers
`do not belong to the same class (e.g. speaker 3, when
`enrolled, did not cause creation of a class with speaker 1),
`the Speaker associated with the greater histogram peak can
`usually be correctly selected or tentatively identified (or
`Speakers not classed with other speakers between which
`there is confusion can be eliminated) by the fast match
`proceSS on the basis of a relatively few frames Since it can
`be assumed that a divergence of magnitude of the histogram
`peaks would later develop based on further speech. This
`feature of the invention provides acceleration of the Speaker
`recognition process by the fast match section 100 of the
`invention and allows a speaker dependent model 210 to be
`called for use by the cohort and speech decoder 230 from
`database 300 for speaker identity verification and speech
`recognition.
`If the Speakers are in the same class as detected by
`comparator 160 accessing database 300, the speaker depen
`dent models of all cohorts of the Single class can be called
`at an early point in time in order to distinguish between them
`which is also done by the Speaker recognition engine
`included with cohort and speech decoder 230. It should be
`noted that this Selection of a class limits the data processed
`to that which is necessary for discrimination between the
`Speakers which are, in fact, confused by fast match Section
`100 and results in reduction of processing time and overhead
`as well as Storage requirements for the Speaker-dependent
`models. Cohorts are needed only when confusion actually
`occurs, reducing the total Storage requirements. Further, the
`cohort model 220 can be used for Speech decoding at an
`earlier time since an ambiguous decoding of an utterance is
`unlikely within cohorts.
`The speech decoding engine 230 preferably utilizes hid
`den Markov models (HMMs) with continuous density Gaus
`sian mixtures to model the output distribution (i.e. the
`probability density function to observe a given acoustic
`vector at a given arc of the HMM model). A set of
`maximum-a-posteriori (MAP) estimated models, or adapted
`by other speaker-dependent algorithms like re-training,
`adaptation by correlation (ABC), maximum likelihood lin
`ear regression (MLLR) or clustered transformation (CT), are
`loaded for different pre-loaded Speakers. During enrollment,
`the utterances are decoded with a gender-independent Sys
`tem. Then each pre-loaded System is used to compute
`likelihoods for a Same alignment. The N-best Speakers are
`extracted and linear transforms are computed to map each of
`the Selected pre-loaded Speaker models closer to the enrolled
`Speaker. Using this data, new Gaussians are built for the new
`Speakers. Unobserved Gaussians are preferably adapted
`using the ABC algorithm. During Speaker recognition, the
`likelihoods produced by the Speaker and its cohorts are
`compared for a same alignment produced by a speaker
`independent model.
`While this proceSS may be computationally intensive, the
`enrollment data used to distinguish between cohorts may, in
`practice, be quite limited particularly if made text-dependent
`or text-prompted. In these latter cases, if the fast match
`identification or Verification is not Succesful, identification
`or verification may be carried out in a text-dependent
`fashion. However, the computation and comparisons of the
`alignments described above allow text-independence for
`identification or verification, if desired. Thus, text
`independence is achieved for the majority of identification
`and Verification operations by the fast match processing
`while minimizing Storage requirements and computational
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,182,037 B1
`
`8
`overhead to very low levels in the detailed match stage 200
`which is thus accelerated, as described above.
`In view of the foregoing, the hybrid system of the
`invention combining fast and detailed match Sections pro
`vides very rapid speaker identification with little, if any,
`increase in Storage requirements, Since the processing of the
`detailed match Stage generally allows reduction of Storage
`requirement by more than enough to compensate for Storage
`of codebooks, largely because the Speaker-dependent mod
`els may be built principally for distinguishing between
`Speakers of a group rather than more fully characterizing the
`Speech of each speaker. Enrollment, identification and Veri
`fication of Speaker identity are conducted in a manner
`transparent to the user except to the extent text-dependence
`may be used to limit Storage for a Small number of discrimi
`nations between Speakers. The fast match and detailed match
`Sections of the hybrid arrangement accelerate operation of
`each other while providing for auto

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket