`IPR of U.S. Patent 9,007,420
`
`0001
`
`
`
`US 6,219,640 B1
`Page 2
`
`OTHER PUBLICATIONS
`
`N.R. Garner et al., “Robust Noise Detection for Speech
`Detection and Enhancement,” IEE, pp. 1-2, Nov. 5, 1996.
`H. Ney, “On the Probabilistic Interpretation of Neural Net-
`Work Classifiers and Discriminative Training Criteria,”
`IEEE Transactions on Pattern Analysis and Machine Intel-
`ligence, Vol. 17,No. 2,pp. 107-112, Feb. 1995.
`L. Wiskott et al., “Recognizing Faces by Dynamic Link
`Matching,” ICANN ’95, Paris, Francis, pp. 347-342, 1995.
`A.H. Gee et al., “Determining the Gaze of Faces in Images,”
`Univeristy of Cambridge, Cambridge, England, pp. 1-20,
`Mar. 1994.
`
`C. Bregler et al., “Eigenlips For Robust Speech Recogni-
`tion,” IEEE, pp. II-669-II-672.
`C. Benoil et al., “Which Components of the Face Do
`Humans and Machines Best Speechread?, ” Institut de la
`Communication Parlee, Grenoble, France, pp. 315-328.
`
`Q. Summerfield, “Use of Visual Information for Phonetic
`
`Perception,” Visual Information for Phonetic Perception,
`MRC Institute of Hearing Research, Univeristy Medical
`School, Nottingham, pp. 314-330.
`
`N. Kruger et al., “Determination of Face Position and Pose
`With a Learned Representation Based on Label Graphs,”
`Ruhr-Universitat Bochum, Bochum, Germany and UniVes-
`ity of Southern California, Los Angeles, CA, pp. 1-19.
`
`G. Potamianos et al., “DiscriminatiVe Training ofHMM
`Stream Exponents for Audio Visual Speech Recognition,”
`AT&T Labs Research, Florham and Red Bank, NJ, pp. 1-4.
`
`* cited by examiner
`
`0002
`
`0002
`
`
`
`U.S. Patent
`
`Apr. 17, 2001
`
`Sheet 1 of 6
`
`US 6,219,640 B1
`
`zo=5:E>2E2
`
`
`
`cs=<z§8:95
`
`82595><
`
`zo=5:E>
`
`Em;
`
`Iommmm
`
`2255.83$2.9
`
`§me_z8we2222
`
`EémO33222
`
`§mEz8Eém82>
`
`
`
`zo=§=mzo=5:E>5:zommméss
`
`2o=_z8§_séz2E§§m
`
`zomsg
`
`5:25v
`
`ex
`
`N65%
`
`0003
`
`2oEz8§_5:3
`
`zo=§=mzo=5:E>zommézoos
`
`
`
`
`
`0003
`
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Apr. 17, 2001
`
`Sheet 2 of 6
`
`US 6,219,640 B1
`
`FIG. 2
`
`PRODUCE DECODED
`TEXT AND "ME
`ALIGNMENTS FROM
`ACOUSTIC FEATURE
`
`
`
`
`
`
`
`
`
`
`DATA
`
`
`UNSUPERVISED
`MODE ONLY
`
`202A
`
`204
`
`
`
`SCRIPT
`
`
`
`COMPUTE LIKELIHOOD
`
`
`VIDEO FRAMES
`
`
` 206
`
`ALIGN SCRIPT
`
`WITH VISEMES
`
`INPUT EXPECTED
`
`202B
`
`208
`
`OF ALIGNMENT ON
`
` OUTPUT VERIFICATION
`
`
`
`
`
`RESULT TO DECISION
`
`210
`
`BLOCK
`
`0004
`
`0004
`
`
`
`U.S. Patent
`
`Apr. 17, 2001
`
`Sheet 3 of 6
`
`US 6,219,640 B1
`
`205%
`
`zo=5:E>9E55
`
`22E><zo
`
`82595><
`
`zoEa:E>
`
`m_>=o<
`
`Eém
`
`SE
`
`zocszmzom
`
`32>
`
`Iommmm
`
`555.5E32.9
`
`zommwazoaa
`
`.4.b~n~
`
`0005
`
`ea3.
`
`
`
`$55.5Em~=:<m_._zomm_m_m.<‘oomQ
`
`0005
`
`
`
`
`
`
`
`
`
`
`
`U.S.Patent
`
`V65%
`
`§:.Ez8
`
`zo=§=m
`
`zo_mm§o§
`
`E5:
`
`zo=5:E>
`
`20528315:
`
`m>=o<
`
`Ev_<Em
`
`as
`
`zozséém
`
`Apr.17,2001
`
`Sheet4of6
`
`E<~E$_<
`
`2:28
`
`82>
`
`§mm_E_2§
`
`0006
`
`QOEZBE5:3SEES
`éém22222:
`zo_W_,%_%$>35.2.9
`
`US6,219,640B1
`
`222
`
`zo_mm§o§
`
`0006
`
`
`
`
`
`
`
`
`
`U
`
`aP
`
`m
`
`6cl05
`
`912,
`
`1B
`
`0w_,4_6,A{II.....Lril-
`
`
`szsczoumee222053
`
`
`
`6205280:55mzo=§_Ez2E:E>zoaméoé
`
`EémO52
`
`
`
` M205298:sézo_Ea.§m2,20595:5,:zsmméoaoU§_§_m3%,m25:isA2
`
`wE25
`
`SmEm
`
`M7%m
`
`0
`
`0007
`
`
`
`
`
`U.S. Patent
`
`Apr. 17, 2001
`
`Sheet 6 0f 6
`
`US 6,219,640 B1
`
`FIG. 6
`
`USER INTERFACE 606
`
`
` 602 PROCESSOR
`
`MEMORY
`
`
`
`604
`
`FIG. 7
`
`ACOUSTIC CONDITION
`
`CLEAN
`
`NOISY
`
`TELEPHONE
`
`
`
`
`
`
`
`
`
`
`
`
`MATCHED FUSION — HARD OPT.
`MATCHEO FUSION - SOFT OPT.
`
`
`cum -011 = o. a2= «/4
`
`TABLE 1: AUDIO-VISUAL SPEAKER ID
`
`0008
`
`90.9%
`92.2%
`
`83.8%
`84.4%
`
`74.0%
`74.0%
`
`0008
`
`
`
`US 6,219,640 B1
`
`1
`METHODS AND APPARATUS FOR
`AUDIO-VISUAL SPEAKER RECOGNITION
`AND UTTERANCE VERIFICATION
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to the U.S. patent
`application entitled: “Methods And Apparatus for Audio-
`Visual Speech Detection and Recognition,” filed concur-
`rently herewith and incorporated by reference herein.
`
`FIELD OF THE INVENTION
`
`The present invention relates generally to speaker recog-
`nition and, more particularly, to methods and apparatus for
`using video and audio information to provide improved
`speaker recognition and utterance verification in connection
`with arbitrary content video.
`
`BACKGROUND OF THE INVENTION
`
`Humans identify speakers based on a variety of attributes
`of the person which include acoustic cues, visual appearance
`cues and behavioral characteristics (e.g., such as character-
`istic gestures, lip movements). In the past, machine imple-
`mentations of person identification have focused on single
`techniques relating to audio cues alone (e.g., audio- based
`speaker recognition), visual cues alone (e.g.,
`face
`identification, iris identification) or other biometrics. More
`recently, researchers are attempting to combine multiple
`modalities for person identification, see, e.g., J. Bigun, B.
`Duc, F. Smeraldi, S. Fischer and A. Makarov, “Multi-modal
`person authentication,” In H. Wechsler, J. Phillips, V. Bruce,
`F. Fogelman Soulie, T. Huang (eds.) Face Recognition:
`From theory to applications, Berlin Springer- Verlag, 1999.
`Speaker recognition is an important
`technology for a
`variety of applications including security and, more recently,
`as an index for search and retrieval of digitized multimedia
`content (for instance in the MPEG-7 standard). Audio-based
`speaker recognition accuracy under acoustically degraded
`conditions (e.g., such as background noise) and channel
`mismatch (e.g., telephone) still needs further improvements.
`To make improvements in such degraded conditions is a
`difficult problem. As a result, it would be highly advanta-
`geous to provide methods and apparatus for providing
`improved speaker recognition that successfully perform in
`the presence of acoustic degradation, channel mismatch, and
`other conditions which have hampered existing speaker
`recognition techniques.
`
`SUMMARY OF THE INVENTION
`
`invention provides various methods and
`The present
`apparatus for using visual information and audio informa-
`tion associated with arbitrary video content
`to provide
`improved speaker recognition accuracy. It is to be under-
`stood that speaker recognition may involve user enrollment,
`user identification (i.e., find who the person is among the
`enrolled users), and user verification (i.e., accept or reject an
`identity claim provided by the user). Further, the invention
`provides methods and apparatus for using such visual infor-
`mation and audio information to perform utterance verifi-
`cation.
`
`In a first aspect of the invention, a method of performing
`speaker recognition comprises processing a video signal
`associated with an arbitrary content video source and pro-
`cessing an audio signal associated with the video signal.
`Then, an identification and/or verification decision is made
`
`2
`based on the processed audio signal and the processed video
`signal. Various decision making embodiments may be
`employed including, but not limited to, a score combination
`approach, a feature combination approach, and a re-scoring
`approach.
`As will be explained in detail, the combination of audio-
`based processing with visual processing for speaker recog-
`nition significantly improves the accuracy in acoustically
`degraded conditions such as, for example only, the broadcast
`news domain. The use of two independent sources of
`information brings significantly increased robustness to
`speaker recognition since signal degradations in the two
`channels are uncorrelated. Furthermore, the use of visual
`information allows a much faster speaker identification than
`possible with acoustic information alone. In accordance with
`the invention, we present results of various methods to fuse
`person identification based on visual information with iden-
`tification based on audio information for TV broadcast news
`
`video data (e.g., CNN and CSPAN) provided by the Lin-
`guistic Data Consortium (LDC). That is, we provide various
`techniques to fuse video based speaker recognition with
`audio-based speaker recognition to improve the perfor-
`mance under mismatch conditions.
`In a preferred
`embodiment, we provide technique to optimally determine
`the relative weights of the independent decisions based on
`audio and video to achieve the best combination. Experi-
`ments on video broadcast news data suggest that significant
`improvements are achieved by such a combination in acous-
`tically degraded conditions.
`In a second aspect of the invention, a method of verifying
`a speech utterance comprises processing a video signal
`associated with a video source and processing an audio
`signal associated with the video signal. Then, the processed
`audio signal is compared with the processed video signal to
`determine a level of correlation between the signals. This is
`referred to as unsupervised utterance verification. In a super-
`vised utterance verification embodiment,
`the processed
`video signal is compared with a script representing an audio
`signal associated with the video signal to determine a level
`of correlation between the signals.
`Of course, it is to be appreciated that any one of the above
`embodiments or processes may be combined with one or
`more other embodiments or processes to provide even
`further speech recognition and speech detection improve-
`ments.
`
`it is to be appreciated that the video and audio
`Also,
`signals may be of a compressed format such as, for example,
`the MPEG-2 standard. The signals may also come from
`either a live camera/microphone feed or a stored (archival)
`feed. Further, the video signal may include images of visible
`and/or non-visible (e.g., infrared or radio frequency) wave-
`lengths. Accordingly,
`the methodologies of the invention
`may be performed with poor lighting, changing lighting, or
`no light conditions. Given the inventive teachings provided
`herein, one of ordinary skill in the art will contemplate
`various applications of the invention.
`These and other objects, features and advantages of the
`present invention will become apparent from the following
`detailed description of illustrative embodiments thereof,
`which is to be read in connection with the accompanying
`drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`FIG. 1 is a block diagram of art audio-visual speaker
`recognition and utterance verification system according to
`an illustrative score or decision fusion embodiment of the
`
`present invention;
`
`0009
`
`0009
`
`
`
`US 6,219,640 B1
`
`3
`FIG. 2 is a flow diagram of an utterance verification
`methodology according to an illustrative embodiment of the
`present invention;
`FIG. 3 is a block diagram of an audio-visual speaker
`recognition and utterance verification system according to
`an illustrative feature fusion embodiment of the present
`invention;
`FIG. 4 is a block diagram of an audio-visual speaker
`recognition and utterance verification system according to
`an illustrative re-scoring embodiment of the present inven-
`tion;
`FIG. 5 is a block diagram of an audio-visual audio-visual
`speaker recognition and utterance verification system
`according to another illustrative re-scoring embodiment of
`the present invention;
`FIG. 6 is a block diagram of an illustrative hardware
`implementation of an audio- visual audio-visual speaker
`recognition and utterance verification system according to
`the invention; and
`FIG. 7 is a tabular representation of some experimental
`results.
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`The present invention will be explained below in the
`context of an illustrative speaker recognition implementa-
`tion. The illustrative embodiments include both identifica-
`
`tion and/or verification methodologies. However, it is to be
`understood that the present invention is not limited to a
`particular application or structural embodiment. Rather, the
`invention is more generally applicable to any situation in
`which it is desirable to improve speaker recognition accu-
`racy and provide utterance verification by employing visual
`information in conjunction with corresponding audio infor-
`mation during the recognition process.
`Referring initially to FIG. 1, a block diagram of an
`audio-visual speaker recognition and utterance verification
`system according to an illustrative embodiment of the
`present
`invention is shown. This particular illustrative
`embodiment, as will be explained, depicts audio-visual
`speaker recognition using a decision fusion approach.
`It is to be appreciated that the system of the invention may
`receive input signals from a variety of sources. That is, the
`input signals for processing in accordance with the invention
`may be provided from a real-time (e.g., live feed) source or
`an archival (e.g., stored) source. Arbitrary content video 2 is
`an input signal that may be received from either a live source
`or archival source. Preferably, the system may accept, as
`arbitrary content video 2, video compressed in accordance
`with a video standard such as the Moving Picture Expert
`Group-2 (MPEG-2) standard. To accommodate such a case,
`the system includes a video demultiplexer 8 which separates
`the compressed audio signal from the compressed video
`signal. The video signal
`is then decompressed in video
`decompressor 10, while the audio signal is decompressed in
`audio decompressor 12. The decompression algorithms are
`standard MPEG-2 techniques and thus will not be further
`described. In any case, other forms of compressed video
`may be processed in accordance with the invention.
`It is to be further appreciated that one of the advantages
`that the invention provides is the ability to process arbitrary
`content video. That is, previous systems that have attempted
`to utilize visual cues from a video source in the context of
`
`speech recognition have utilized video with controlled
`conditions,
`i.e., non-arbitrary content video. That is,
`the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`video content included only faces from which the visual
`cues were taken in order to try to recognize short commands
`or single words in a predominantly noiseless environment.
`However, as will be explained in detail below, the system of
`the present invention is able to process arbitrary content
`video which may not only contain faces but may also contain
`arbitrary background objects in a noisy environment. One
`example of arbitrary content video is in the context of
`broadcast news. Such video can possibly contain a news-
`person speaking at a location where there is arbitrary activity
`and noise in the background. In such a case, as will be
`explained, the invention is able to locate and track a face
`and, more particularly, a mouth and/or other facial features,
`to determine what is relevant visual information to be used
`
`in more accurately identifying and/or verifying the speaker.
`Alternatively,
`the system of the present
`invention is
`capable of receiving real-time arbitrary content directly from
`a video camera 4 and microphone 6. While the video signals
`received from the camera 4 and the audio signals received
`from the microphone 6 are shown in FIG. 1 as not being
`compressed, they may be compressed and therefore need to
`be decompressed in accordance with the applied compres-
`sion scheme.
`
`It is to be understood that the video signal captured by the
`camera 4 does not necessarily have to be of any particular
`type. That is, the face detection and recognition techniques
`of the invention may process images of any wavelength such
`as, e.g., visible and/or non-visible electromagnetic spectrum
`images. By way of example only, this may include infrared
`(IR) images (e.g., near, mid and far field IR video) and radio
`frequency (RF) images. Accordingly, the system may per-
`form audio-visual speaker recognition and utterance verifi-
`cation techniques in poor lighting conditions, changing
`lighting conditions, or in environments without light. For
`example, the system may be installed in an automobile or
`some other form of vehicle and capable of capturing IR
`images so that improved speaker recognition may be per-
`formed. Because video information (i.e., including visible
`and/or non-visible electromagnetic spectrum images) is used
`in the speaker recognition process in accordance with the
`invention, the system is less susceptible to recognition errors
`due to noisy conditions, which significantly hamper con-
`ventional speaker recognition systems that use only audio
`information. Additionally, as disclosed in Francine J. Pro-
`koski and Robert R. Riedel, “Infrared Identification of Faces
`and Body Parts,” BIOMETRICS, Personal Identification in
`Networked Society, Kluwer Academic Publishers, 1999, IR
`cameras introduce additional very robust biometric features
`which uniquely characterize individuals very well.
`A phantom line denoted by Roman numeral I represents
`the processing path the audio information signal takes within
`the system, while a phantom line denoted by Roman
`numeral II represents the processing path the video infor-
`mation signal takes within the system. First, the audio signal
`path I will be discussed,
`then the video signal path II,
`followed by an explanation of how the two types of infor-
`mation are combined to provide improved speaker recogni-
`tion accuracy.
`The system includes an auditory feature extractor 14. The
`feature extractor 14 receives an audio or speech signal and,
`as is known in the art, extracts spectral features from the
`signal at regular intervals. The spectral features are in the
`form of acoustic feature vectors (signals) which are then
`passed on to an audio speaker recognition module 16. As
`mentioned, the audio signal may be received from the audio
`decompressor 12 or directly from the microphone 6, depend-
`ing on the source of the video. Before acoustic vectors are
`
`0010
`
`0010
`
`
`
`US 6,219,640 B1
`
`5
`extracted, the speech signal may be sampled at a rate of 16
`kilohertz (kHz). Aframe may consist of a segment of speech
`having a 25 millisecond (msec) duration.
`In such an
`arrangement, the extraction process preferably produces 24
`dimensional acoustic cepstral vectors via the process
`described below. Frames are advanced every 10 msec to
`obtain succeeding acoustic vectors.
`First,
`in accordance with a preferred acoustic feature
`extraction process, magnitudes of discrete Fourier trans-
`forms of samples of speech data in a frame are considered
`in a logarithmically warped frequency scale. Next, these
`amplitude values themselves are transformed to a logarith-
`mic scale. The latter two steps are motivated by a logarith-
`mic sensitivity of human hearing to frequency and ampli-
`tude. Subsequently, a rotation in the form of discrete cosine
`transform is applied. One way to capture the dynamics is to
`use the delta (first-difference) and the delta-delta (second-
`order differences) information. An alternative way to capture
`dynamic information is to append a set of (e.g., four)
`preceding and succeeding vectors to the vector under con-
`sideration and then project the vector to a lower dimensional
`space, which is chosen to have the most discrimination. The
`latter procedure is known as Linear Discriminant Analysis
`(LDA) and is well known in the art. It is to be understood
`that other variations on features may be used, e.g., LPC
`cepstra, PLP, etc., and that the invention is not limited to any
`particular type.
`After the acoustic feature vectors, denoted in FIG. 1. by
`the letter A, are extracted, they are provided to the audio
`speaker recognition module 16. It is to be understood that
`the module 16 may perform speaker identification and/or
`speaker verification using the extracted acoustic feature
`vectors. The processes of speaker identification and verifi-
`cation may be accomplished via any conventional acoustic
`information speaker recognition system. For example,
`speaker recognition module 16 may implement the recog-
`nition techniques described in the U.S. patent application
`identified by Ser. No. 08/788,471, filed on Jan. 28, 1997, and
`entitled: “Text Independent Speaker Recognition for Trans-
`parent Command Ambiguity Resolution and Continuous
`Access Control,” the disclosure of which is incorporated
`herein by reference.
`An illustrative speaker identification process for use in
`module 16 will now be described. The illustrative system is
`disclosed in H. Beigi, S. H. Maes, U. V. Chaudari and J. S.
`Sorenson, “IBM model-based and frame-by-frame speaker
`recognition,” Speaker Recognition and its Commercial and
`Forensic Applications, Avignon, France 1998. The illustra-
`tive speaker identification system may use two techniques:
`a model-based approach and a frame-based approach. In the
`experiments described herein, we use the frame-based
`approach for speaker identification based on audio. The
`frame-based approach can be described in the following
`manner.
`
`Let Mi be the model corresponding to the i”“ enrolled
`speaker. Mi is represented by a mixture Gaussian model
`defined by the parameter set {yiip Ziii, piii}i-=1) _
`_ _,i, consisting
`of the mean vector, covariance matrix and mixture weights
`for each of the n, components of speaker i’s model. These
`models are created using training data consisting of a
`sequence of K frames of speech with d-dimensional cepstral
`feature vectors, {f,ii},ii=1) _
`_
`_ K. The goal of speaker identi-
`fication is to find the model, Mi, that best explains the test
`data represented by a sequence of N frames, {f,i},i=1) _
`_
`_ N.
`We use the following frame-based weighted likelihood
`z,n>
`distance measure, d-
`in making the decision:
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`d,-in = —log
`
`i:i
`
`Pi,jP[fn IM.
`
`:4
`
`The total distance Di of model Mi from the test data is then
`taken to be the sum of the distances over all the test frames:
`
`N
`
`n:l
`D, = Z d,,,,.
`
`Thus, the above approach finds the closest matching model
`and the person whose model that represents is determined to
`be the person whose utterance is being processed.
`Speaker verification may be performed in a similar
`manner, however, the input acoustic data is compared to
`determine if the data matches closely enough with stored
`models. If the comparison yields a close enough match, the
`person uttering the speech is verified. The match is accepted
`or rejected by comparing the match with competing models.
`These models can be selected to be similar to the claimant
`
`speaker or be speaker independent (i.e., a single or a set of
`speaker independent models). If the claimant wins and wins
`with enough margin (computed at the level of the likelihood
`or the distance to the models), we accept the claimant.
`Otherwise, the claimant is rejected. It should be understood
`that, at enrollment, the input speech is collected for a speaker
`to build the mixture gaussian model Mi that characterize
`each speaker.
`Referring now to the video signal path II of FIG. 1, the
`methodologies of processing visual information according
`to the invention will now be explained. The audio-visual
`speaker recognition and utterance verification system of the
`invention includes an active speaker face segmentation
`module 20 and a face recognition module 24. The active
`speaker face segmentation module 20 can receive video
`input from one or more sources, e.g., video decompressor
`10, camera 4, as explained above. It is to be appreciated that
`speaker face detection can also be performed directly in the
`compressed data domain and/or from audio and video infor-
`mation rather than just from video information. In any case,
`segmentation module 20 generally locates and tracks the
`speaker’s face and facial features within the arbitrary video
`background. This will be explained in detail below. From
`data provided from the segmentation module 22, an identi-
`fication and/or verification operation may be performed by
`recognition module 24 to identify and/or verify the face of
`the person assumed to be the speaker in the video. Verifi-
`cation can also be performed by adding score thresholding
`or competing models. Thus,
`the visual mode of speaker
`identification is implemented as a face recognition system
`where faces are found and tracked in the video sequences,
`and recognized by comparison with a database of candidate
`face templates. As will be explained later, utterance verifi-
`cation provides a technique to verify that the person actually
`uttered the speech used to recognize him.
`Face detection and recognition may be performed in a
`variety of ways. For example, in an embodiment employing
`an infrared camera 4, face detection and identification may
`be performed as disclosed in Francine J. Prokoski and
`Robert R. Riedel, “Infrared Identification of Faces and Body
`Parts,” BIOMETRICS, Personal Identification in Networked
`Society, Kluwer Academic Publishers, 1999. In a preferred
`embodiment, techniques described in Andrew Senior, “Face
`and feature finding for face recognition system,” 2”“ Int.
`
`0011
`
`0011
`
`
`
`US 6,219,640 B1
`
`7
`Conf. On Audio-Video based Biometric Person
`
`Authentication, Washington D.C., March 1999 are
`employed. The following is an illustrative description of
`face detection and recognition as respectively performed by
`segmentation module 22 and recognition module 24.
`Face Detection
`
`locations and
`Faces can occur at a variety of scales,
`orientations in the video frames. In this system, we make the
`assumption that faces are close to the vertical, and that there
`is no face smaller than 66 pixels high. However, to test for
`a face at all the remaining locations and scales, the system
`searches for a fixed size template in an image pyramid. The
`image pyramid is constructed by repeatedly down-sampling
`the original image to give progressively lower resolution
`representations ofthe original frame. Within each of these
`sub-images, we consider all square regions of the same size
`as our face template (typically 11x11 pixels) as candidate
`face locations. A sequence of tests is used to test whether a
`region contains a face or not.
`First, the region must contain a high proportion of skin-
`tone pixels, and then the intensities of the candidate region
`are compared with a trained face model. Pixels falling into
`a pre-defined cuboid of hue-chromaticity-intensity space are
`deemed to be skin tone, and the proportion of skin tone
`pixels must exceed a threshold for the candidate region to be
`considered further.
`
`The face model is based on a training set of cropped,
`normalized, grey-scale face images. Statistics of these faces
`are gathered and a variety of classifiers are trained based on
`these statistics. A Fisher linear discriminant (FLD) trained
`with a linear program is found to distinguish between faces
`and background images, and “Distance from face space”
`(DFFS), as described in M. Turk and A. Pentland, “Eigen-
`faces for Recognition,” Journal of Cognitive Neuro Science,
`vol. 3, no. 1, pp. 71-86, 1991, is used to score the quality of
`faces given high scores by the first method. Ahigh combined
`score from both these face detectors indicates that
`the
`
`candidate region is indeed a face. Candidate face regions
`with small perturbations of scale,
`location and rotation
`relative to high-scoring face candidates are also tested and
`the maximum scoring candidate among the perturbations is
`chosen, giving refined estimates of these three parameters.
`In subsequent frames,
`the face is tracked by using a
`velocity estimate to predict
`the new face location, and
`models are used to search for the face in candidate regions
`near the predicted location with similar scales and rotations.
`A low score is interpreted as a failure of tracking, and the
`algorithm begins again with an exhaustive search.
`Face Recognition
`Having found the face, K facial features are located using
`the same techniques (FLD and DFFS) used for face detec-
`tion. Features are found using a hierarchical approach where
`large-scale features, such as eyes, nose and mouth are first
`found, then sub-features are found relative to these features.
`As many as 29 sub-features are used, including the hairline,
`chin, ears, and the comers of mouth, nose, eyes and eye-
`brows. Prior statistics are used to restrict the search area for
`each feature and sub-feature relative to the face and feature
`
`positions, respectively. At each of the estimated sub-feature
`locations, a Gabor Jet representation, as described in L.
`Wiskott and C. von der Malsburg, “Recognizing Faces by
`Dynamic Link Matching,” Proceedings of the International
`Conference on Artificial Neural Networks, pp. 347-352,
`1995, is generated. A Gabor jet is a set of two-dimensional
`Gabor filters—each a sine wave modulated by a Gaussian.
`Each filter has scale (the sine wavelength and Gaussian
`standard deviation with fixed ratio) and orientation (of the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`0012
`
`8
`
`sine wave). We use five scales and eight orientations, giving
`40 complex coefficients (a(j), j=1, .
`.
`.
`, 40) at each feature
`location.
`
`A simple distance metric is used to compute the distance
`between the feature vectors for trained faces and the test
`candidates. The distance between the im trained candidate
`and a test candidate for feature k is defined as:
`
`Z a</')a.-<1)
`S”. =.
`
`J2.”->2z,..,.(,-)2
`
`_
`J
`
`J
`
`A simple average of these similarities, S,-=1/KZ1KS,-k, gives
`an overall measure for the similarity of the test face to the
`face template in the database. Accordingly, based on the
`similarity measure, an identification and/or verification of
`the person in the video sequence under consideration is
`made.
`
`Next, the results of the face recognition module 24 and the
`audio speaker recognition module 16 are provided to respec-
`tive confidence estimation blocks 26 and 18 where confi-
`
`dence estimation is performed. Confidence estimation refers
`to a likelihood or other confidence measure being deter-
`mined with regard to the recognized input.
`In one
`embodiment,
`the confidence estimation procedure may
`include measurement of noise levels respectively associated
`with the audio signal and the video signal. These levels may
`be measured internally or externally with respect to the
`system. A higher level of noise associated with a signal
`generally means that the confidence attributed to the recog-
`nition results associated with that signal is lower. Therefore,
`these confidence measures are taken into consideration
`
`during the weighting of the visual and acoustic results
`discussed below.
`
`Given the audio-based speaker recognition and face rec-
`ognition scores provided by respective modules 16 and 24,
`audio-visual speaker identification/verification may be per-
`formed by a joint identification/verification module 30 as
`follows. The top N scores are generated-based on both audio
`and video-based identification techniques. The two lists are
`combined by a weighted sum and the best-scoring candidate
`is chosen. Since the weights need only to be defined up to
`a scaling factor, we can define the combined score S,“V as a
`function of the single parameter (X2
`
`S,-"V=cos o.D,-+sin (XS,-.
`
`The mixture angle (X has to be selected according to the
`relative reliability of audio identification and face identifi-
`cation. One way to achieve this is to optimize (X in order to
`maximize the audio-visual accuracy on some training data.
`Let us denote by D,-(n) and S,-(n) the audio ID (identification)
`and video ID score for the i”“ enrolled speaker (i=1 .
`.
`. P)
`computed on the n’h training clip. Let us define the variable
`T,-(n) as zero when the n’h clip belongs to the im speaker and
`one otherwise. The cost function to be minimized is the
`
`empirical error, as discussed in V. N. Vapnik, “The Nature of
`
`0012
`
`
`
`US 6,219,640 B1
`
`9
`Statistical Learning Theory, Springer, 1995,
`written as:
`
`that can be
`
`1 N
`n:l
`
`,
`T7(n) where i = a.rgm_ax Sf”(n),
`
`C(11) =
`
`and where:
`
`S,‘"(n)=cos o.D,-(n)+sin OLS,-(n).
`
`In order to prevent over-fitting, one can also resort to the
`smoothed error rate, as discussed in H. Ney, “On the
`Probabilistic Interpretation of Neural Network Classification
`and Discriminative Training Criteria,” IEEE Transactions on
`Pattern Analysis and Machine Intelligence,” vol. 17, no. 2,
`pp. 107-119, 1995, defined as:
`
`When 11 is large, all the terms of the inner sum approach
`zero, except for i=i, and C'(ot) approaches the raw error
`count C((X). Otherwise, all the incorrect hypotheses (those
`for which T,-(n)=1) have a contribution that is a decreasing
`function of the distance between their score and the maxi-
`
`mum score. If the best hypothesis is incorrect, it has the
`largest contribution. Hence, by minimizing the latter cost
`function, one tends to maximize not only the recognition
`accuracy on the training data, but also the margin by which
`the best score wins. This function also presents the advan-
`tage of being differentiable, which can facilitate the optimi-
`zation process when there is more than one parameter.
`The present invention provides another decision or score
`fusion technique derived by the previous technique, but
`which does not require any training. It consists in selecting
`at testing time, for each clip, the value of (X in a given range
`which maximizes the difference between the highest and the
`second highest scores. The corresponding best hypothesis
`I(n) is then chosen. We have