throbber
GTL 1015
`IPR of U.S. Patent 9,007,420
`
`0001
`
`

`
`US 6,219,640 B1
`Page 2
`
`OTHER PUBLICATIONS
`
`N.R. Garner et al., “Robust Noise Detection for Speech
`Detection and Enhancement,” IEE, pp. 1-2, Nov. 5, 1996.
`H. Ney, “On the Probabilistic Interpretation of Neural Net-
`Work Classifiers and Discriminative Training Criteria,”
`IEEE Transactions on Pattern Analysis and Machine Intel-
`ligence, Vol. 17,No. 2,pp. 107-112, Feb. 1995.
`L. Wiskott et al., “Recognizing Faces by Dynamic Link
`Matching,” ICANN ’95, Paris, Francis, pp. 347-342, 1995.
`A.H. Gee et al., “Determining the Gaze of Faces in Images,”
`Univeristy of Cambridge, Cambridge, England, pp. 1-20,
`Mar. 1994.
`
`C. Bregler et al., “Eigenlips For Robust Speech Recogni-
`tion,” IEEE, pp. II-669-II-672.
`C. Benoil et al., “Which Components of the Face Do
`Humans and Machines Best Speechread?, ” Institut de la
`Communication Parlee, Grenoble, France, pp. 315-328.
`
`Q. Summerfield, “Use of Visual Information for Phonetic
`
`Perception,” Visual Information for Phonetic Perception,
`MRC Institute of Hearing Research, Univeristy Medical
`School, Nottingham, pp. 314-330.
`
`N. Kruger et al., “Determination of Face Position and Pose
`With a Learned Representation Based on Label Graphs,”
`Ruhr-Universitat Bochum, Bochum, Germany and UniVes-
`ity of Southern California, Los Angeles, CA, pp. 1-19.
`
`G. Potamianos et al., “DiscriminatiVe Training ofHMM
`Stream Exponents for Audio Visual Speech Recognition,”
`AT&T Labs Research, Florham and Red Bank, NJ, pp. 1-4.
`
`* cited by examiner
`
`0002
`
`0002
`
`

`
`U.S. Patent
`
`Apr. 17, 2001
`
`Sheet 1 of 6
`
`US 6,219,640 B1
`
`zo=5:E>2E2
`
`
`
`cs=<z§8:95
`
`82595><
`
`zo=5:E>
`
`Em;
`
`Iommmm
`
`2255.83$2.9
`
`§me_z8we2222
`
`EémO33222
`
`§mEz8Eém82>
`
`
`
`zo=§=mzo=5:E>5:zommméss
`
`2o=_z8§_séz2E§§m
`
`zomsg
`
`5:25v
`
`ex
`
`N65%
`
`0003
`
`2oEz8§_5:3
`
`zo=§=mzo=5:E>zommézoos
`
`
`
`
`
`0003
`
`
`
`
`
`
`
`

`
`U.S. Patent
`
`Apr. 17, 2001
`
`Sheet 2 of 6
`
`US 6,219,640 B1
`
`FIG. 2
`
`PRODUCE DECODED
`TEXT AND "ME
`ALIGNMENTS FROM
`ACOUSTIC FEATURE
`
`
`
`
`
`
`
`
`
`
`DATA
`
`
`UNSUPERVISED
`MODE ONLY
`
`202A
`
`204
`
`
`
`SCRIPT
`
`
`
`COMPUTE LIKELIHOOD
`
`
`VIDEO FRAMES
`
`
` 206
`
`ALIGN SCRIPT
`
`WITH VISEMES
`
`INPUT EXPECTED
`
`202B
`
`208
`
`OF ALIGNMENT ON
`
` OUTPUT VERIFICATION
`
`
`
`
`
`RESULT TO DECISION
`
`210
`
`BLOCK
`
`0004
`
`0004
`
`

`
`U.S. Patent
`
`Apr. 17, 2001
`
`Sheet 3 of 6
`
`US 6,219,640 B1
`
`205%
`
`zo=5:E>9E55
`
`22E><zo
`
`82595><
`
`zoEa:E>
`
`m_>=o<
`
`Eém
`
`SE
`
`zocszmzom
`
`32>
`
`Iommmm
`
`555.5E32.9
`
`zommwazoaa
`
`.4.b~n~
`
`0005
`
`ea3.
`
`
`
`$55.5Em~=:<m_._zomm_m_m.<‘oomQ
`
`0005
`
`
`
`
`
`
`
`

`
`
`
`U.S.Patent
`
`V65%
`
`§:.Ez8
`
`zo=§=m
`
`zo_mm§o§
`
`E5:
`
`zo=5:E>
`
`20528315:
`
`m>=o<
`
`Ev_<Em
`
`as
`
`zozséém
`
`Apr.17,2001
`
`Sheet4of6
`
`E<~E$_<
`
`2:28
`
`82>
`
`§mm_E_2§
`
`0006
`
`QOEZBE5:3SEES
`éém22222:
`zo_W_,%_%$>35.2.9
`
`US6,219,640B1
`
`222
`
`zo_mm§o§
`
`0006
`
`
`
`
`
`
`
`

`
`U
`
`aP
`
`m
`
`6cl05
`
`912,
`
`1B
`
`0w_,4_6,A{II.....Lril-
`
`
`szsczoumee222053
`
`
`
`6205280:55mzo=§_Ez2E:E>zoaméoé
`
`EémO52
`
`
`
` M205298:sézo_Ea.§m2,20595:5,:zsmméoaoU§_§_m3%,m25:isA2
`
`wE25
`
`SmEm
`
`M7%m
`
`0
`
`0007
`
`
`
`

`
`U.S. Patent
`
`Apr. 17, 2001
`
`Sheet 6 0f 6
`
`US 6,219,640 B1
`
`FIG. 6
`
`USER INTERFACE 606
`
`
` 602 PROCESSOR
`
`MEMORY
`
`
`
`604
`
`FIG. 7
`
`ACOUSTIC CONDITION
`
`CLEAN
`
`NOISY
`
`TELEPHONE
`
`
`
`
`
`
`
`
`
`
`
`
`MATCHED FUSION — HARD OPT.
`MATCHEO FUSION - SOFT OPT.
`
`
`cum -011 = o. a2= «/4
`
`TABLE 1: AUDIO-VISUAL SPEAKER ID
`
`0008
`
`90.9%
`92.2%
`
`83.8%
`84.4%
`
`74.0%
`74.0%
`
`0008
`
`

`
`US 6,219,640 B1
`
`1
`METHODS AND APPARATUS FOR
`AUDIO-VISUAL SPEAKER RECOGNITION
`AND UTTERANCE VERIFICATION
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`
`The present application is related to the U.S. patent
`application entitled: “Methods And Apparatus for Audio-
`Visual Speech Detection and Recognition,” filed concur-
`rently herewith and incorporated by reference herein.
`
`FIELD OF THE INVENTION
`
`The present invention relates generally to speaker recog-
`nition and, more particularly, to methods and apparatus for
`using video and audio information to provide improved
`speaker recognition and utterance verification in connection
`with arbitrary content video.
`
`BACKGROUND OF THE INVENTION
`
`Humans identify speakers based on a variety of attributes
`of the person which include acoustic cues, visual appearance
`cues and behavioral characteristics (e.g., such as character-
`istic gestures, lip movements). In the past, machine imple-
`mentations of person identification have focused on single
`techniques relating to audio cues alone (e.g., audio- based
`speaker recognition), visual cues alone (e.g.,
`face
`identification, iris identification) or other biometrics. More
`recently, researchers are attempting to combine multiple
`modalities for person identification, see, e.g., J. Bigun, B.
`Duc, F. Smeraldi, S. Fischer and A. Makarov, “Multi-modal
`person authentication,” In H. Wechsler, J. Phillips, V. Bruce,
`F. Fogelman Soulie, T. Huang (eds.) Face Recognition:
`From theory to applications, Berlin Springer- Verlag, 1999.
`Speaker recognition is an important
`technology for a
`variety of applications including security and, more recently,
`as an index for search and retrieval of digitized multimedia
`content (for instance in the MPEG-7 standard). Audio-based
`speaker recognition accuracy under acoustically degraded
`conditions (e.g., such as background noise) and channel
`mismatch (e.g., telephone) still needs further improvements.
`To make improvements in such degraded conditions is a
`difficult problem. As a result, it would be highly advanta-
`geous to provide methods and apparatus for providing
`improved speaker recognition that successfully perform in
`the presence of acoustic degradation, channel mismatch, and
`other conditions which have hampered existing speaker
`recognition techniques.
`
`SUMMARY OF THE INVENTION
`
`invention provides various methods and
`The present
`apparatus for using visual information and audio informa-
`tion associated with arbitrary video content
`to provide
`improved speaker recognition accuracy. It is to be under-
`stood that speaker recognition may involve user enrollment,
`user identification (i.e., find who the person is among the
`enrolled users), and user verification (i.e., accept or reject an
`identity claim provided by the user). Further, the invention
`provides methods and apparatus for using such visual infor-
`mation and audio information to perform utterance verifi-
`cation.
`
`In a first aspect of the invention, a method of performing
`speaker recognition comprises processing a video signal
`associated with an arbitrary content video source and pro-
`cessing an audio signal associated with the video signal.
`Then, an identification and/or verification decision is made
`
`2
`based on the processed audio signal and the processed video
`signal. Various decision making embodiments may be
`employed including, but not limited to, a score combination
`approach, a feature combination approach, and a re-scoring
`approach.
`As will be explained in detail, the combination of audio-
`based processing with visual processing for speaker recog-
`nition significantly improves the accuracy in acoustically
`degraded conditions such as, for example only, the broadcast
`news domain. The use of two independent sources of
`information brings significantly increased robustness to
`speaker recognition since signal degradations in the two
`channels are uncorrelated. Furthermore, the use of visual
`information allows a much faster speaker identification than
`possible with acoustic information alone. In accordance with
`the invention, we present results of various methods to fuse
`person identification based on visual information with iden-
`tification based on audio information for TV broadcast news
`
`video data (e.g., CNN and CSPAN) provided by the Lin-
`guistic Data Consortium (LDC). That is, we provide various
`techniques to fuse video based speaker recognition with
`audio-based speaker recognition to improve the perfor-
`mance under mismatch conditions.
`In a preferred
`embodiment, we provide technique to optimally determine
`the relative weights of the independent decisions based on
`audio and video to achieve the best combination. Experi-
`ments on video broadcast news data suggest that significant
`improvements are achieved by such a combination in acous-
`tically degraded conditions.
`In a second aspect of the invention, a method of verifying
`a speech utterance comprises processing a video signal
`associated with a video source and processing an audio
`signal associated with the video signal. Then, the processed
`audio signal is compared with the processed video signal to
`determine a level of correlation between the signals. This is
`referred to as unsupervised utterance verification. In a super-
`vised utterance verification embodiment,
`the processed
`video signal is compared with a script representing an audio
`signal associated with the video signal to determine a level
`of correlation between the signals.
`Of course, it is to be appreciated that any one of the above
`embodiments or processes may be combined with one or
`more other embodiments or processes to provide even
`further speech recognition and speech detection improve-
`ments.
`
`it is to be appreciated that the video and audio
`Also,
`signals may be of a compressed format such as, for example,
`the MPEG-2 standard. The signals may also come from
`either a live camera/microphone feed or a stored (archival)
`feed. Further, the video signal may include images of visible
`and/or non-visible (e.g., infrared or radio frequency) wave-
`lengths. Accordingly,
`the methodologies of the invention
`may be performed with poor lighting, changing lighting, or
`no light conditions. Given the inventive teachings provided
`herein, one of ordinary skill in the art will contemplate
`various applications of the invention.
`These and other objects, features and advantages of the
`present invention will become apparent from the following
`detailed description of illustrative embodiments thereof,
`which is to be read in connection with the accompanying
`drawings.
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`FIG. 1 is a block diagram of art audio-visual speaker
`recognition and utterance verification system according to
`an illustrative score or decision fusion embodiment of the
`
`present invention;
`
`0009
`
`0009
`
`

`
`US 6,219,640 B1
`
`3
`FIG. 2 is a flow diagram of an utterance verification
`methodology according to an illustrative embodiment of the
`present invention;
`FIG. 3 is a block diagram of an audio-visual speaker
`recognition and utterance verification system according to
`an illustrative feature fusion embodiment of the present
`invention;
`FIG. 4 is a block diagram of an audio-visual speaker
`recognition and utterance verification system according to
`an illustrative re-scoring embodiment of the present inven-
`tion;
`FIG. 5 is a block diagram of an audio-visual audio-visual
`speaker recognition and utterance verification system
`according to another illustrative re-scoring embodiment of
`the present invention;
`FIG. 6 is a block diagram of an illustrative hardware
`implementation of an audio- visual audio-visual speaker
`recognition and utterance verification system according to
`the invention; and
`FIG. 7 is a tabular representation of some experimental
`results.
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`The present invention will be explained below in the
`context of an illustrative speaker recognition implementa-
`tion. The illustrative embodiments include both identifica-
`
`tion and/or verification methodologies. However, it is to be
`understood that the present invention is not limited to a
`particular application or structural embodiment. Rather, the
`invention is more generally applicable to any situation in
`which it is desirable to improve speaker recognition accu-
`racy and provide utterance verification by employing visual
`information in conjunction with corresponding audio infor-
`mation during the recognition process.
`Referring initially to FIG. 1, a block diagram of an
`audio-visual speaker recognition and utterance verification
`system according to an illustrative embodiment of the
`present
`invention is shown. This particular illustrative
`embodiment, as will be explained, depicts audio-visual
`speaker recognition using a decision fusion approach.
`It is to be appreciated that the system of the invention may
`receive input signals from a variety of sources. That is, the
`input signals for processing in accordance with the invention
`may be provided from a real-time (e.g., live feed) source or
`an archival (e.g., stored) source. Arbitrary content video 2 is
`an input signal that may be received from either a live source
`or archival source. Preferably, the system may accept, as
`arbitrary content video 2, video compressed in accordance
`with a video standard such as the Moving Picture Expert
`Group-2 (MPEG-2) standard. To accommodate such a case,
`the system includes a video demultiplexer 8 which separates
`the compressed audio signal from the compressed video
`signal. The video signal
`is then decompressed in video
`decompressor 10, while the audio signal is decompressed in
`audio decompressor 12. The decompression algorithms are
`standard MPEG-2 techniques and thus will not be further
`described. In any case, other forms of compressed video
`may be processed in accordance with the invention.
`It is to be further appreciated that one of the advantages
`that the invention provides is the ability to process arbitrary
`content video. That is, previous systems that have attempted
`to utilize visual cues from a video source in the context of
`
`speech recognition have utilized video with controlled
`conditions,
`i.e., non-arbitrary content video. That is,
`the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`video content included only faces from which the visual
`cues were taken in order to try to recognize short commands
`or single words in a predominantly noiseless environment.
`However, as will be explained in detail below, the system of
`the present invention is able to process arbitrary content
`video which may not only contain faces but may also contain
`arbitrary background objects in a noisy environment. One
`example of arbitrary content video is in the context of
`broadcast news. Such video can possibly contain a news-
`person speaking at a location where there is arbitrary activity
`and noise in the background. In such a case, as will be
`explained, the invention is able to locate and track a face
`and, more particularly, a mouth and/or other facial features,
`to determine what is relevant visual information to be used
`
`in more accurately identifying and/or verifying the speaker.
`Alternatively,
`the system of the present
`invention is
`capable of receiving real-time arbitrary content directly from
`a video camera 4 and microphone 6. While the video signals
`received from the camera 4 and the audio signals received
`from the microphone 6 are shown in FIG. 1 as not being
`compressed, they may be compressed and therefore need to
`be decompressed in accordance with the applied compres-
`sion scheme.
`
`It is to be understood that the video signal captured by the
`camera 4 does not necessarily have to be of any particular
`type. That is, the face detection and recognition techniques
`of the invention may process images of any wavelength such
`as, e.g., visible and/or non-visible electromagnetic spectrum
`images. By way of example only, this may include infrared
`(IR) images (e.g., near, mid and far field IR video) and radio
`frequency (RF) images. Accordingly, the system may per-
`form audio-visual speaker recognition and utterance verifi-
`cation techniques in poor lighting conditions, changing
`lighting conditions, or in environments without light. For
`example, the system may be installed in an automobile or
`some other form of vehicle and capable of capturing IR
`images so that improved speaker recognition may be per-
`formed. Because video information (i.e., including visible
`and/or non-visible electromagnetic spectrum images) is used
`in the speaker recognition process in accordance with the
`invention, the system is less susceptible to recognition errors
`due to noisy conditions, which significantly hamper con-
`ventional speaker recognition systems that use only audio
`information. Additionally, as disclosed in Francine J. Pro-
`koski and Robert R. Riedel, “Infrared Identification of Faces
`and Body Parts,” BIOMETRICS, Personal Identification in
`Networked Society, Kluwer Academic Publishers, 1999, IR
`cameras introduce additional very robust biometric features
`which uniquely characterize individuals very well.
`A phantom line denoted by Roman numeral I represents
`the processing path the audio information signal takes within
`the system, while a phantom line denoted by Roman
`numeral II represents the processing path the video infor-
`mation signal takes within the system. First, the audio signal
`path I will be discussed,
`then the video signal path II,
`followed by an explanation of how the two types of infor-
`mation are combined to provide improved speaker recogni-
`tion accuracy.
`The system includes an auditory feature extractor 14. The
`feature extractor 14 receives an audio or speech signal and,
`as is known in the art, extracts spectral features from the
`signal at regular intervals. The spectral features are in the
`form of acoustic feature vectors (signals) which are then
`passed on to an audio speaker recognition module 16. As
`mentioned, the audio signal may be received from the audio
`decompressor 12 or directly from the microphone 6, depend-
`ing on the source of the video. Before acoustic vectors are
`
`0010
`
`0010
`
`

`
`US 6,219,640 B1
`
`5
`extracted, the speech signal may be sampled at a rate of 16
`kilohertz (kHz). Aframe may consist of a segment of speech
`having a 25 millisecond (msec) duration.
`In such an
`arrangement, the extraction process preferably produces 24
`dimensional acoustic cepstral vectors via the process
`described below. Frames are advanced every 10 msec to
`obtain succeeding acoustic vectors.
`First,
`in accordance with a preferred acoustic feature
`extraction process, magnitudes of discrete Fourier trans-
`forms of samples of speech data in a frame are considered
`in a logarithmically warped frequency scale. Next, these
`amplitude values themselves are transformed to a logarith-
`mic scale. The latter two steps are motivated by a logarith-
`mic sensitivity of human hearing to frequency and ampli-
`tude. Subsequently, a rotation in the form of discrete cosine
`transform is applied. One way to capture the dynamics is to
`use the delta (first-difference) and the delta-delta (second-
`order differences) information. An alternative way to capture
`dynamic information is to append a set of (e.g., four)
`preceding and succeeding vectors to the vector under con-
`sideration and then project the vector to a lower dimensional
`space, which is chosen to have the most discrimination. The
`latter procedure is known as Linear Discriminant Analysis
`(LDA) and is well known in the art. It is to be understood
`that other variations on features may be used, e.g., LPC
`cepstra, PLP, etc., and that the invention is not limited to any
`particular type.
`After the acoustic feature vectors, denoted in FIG. 1. by
`the letter A, are extracted, they are provided to the audio
`speaker recognition module 16. It is to be understood that
`the module 16 may perform speaker identification and/or
`speaker verification using the extracted acoustic feature
`vectors. The processes of speaker identification and verifi-
`cation may be accomplished via any conventional acoustic
`information speaker recognition system. For example,
`speaker recognition module 16 may implement the recog-
`nition techniques described in the U.S. patent application
`identified by Ser. No. 08/788,471, filed on Jan. 28, 1997, and
`entitled: “Text Independent Speaker Recognition for Trans-
`parent Command Ambiguity Resolution and Continuous
`Access Control,” the disclosure of which is incorporated
`herein by reference.
`An illustrative speaker identification process for use in
`module 16 will now be described. The illustrative system is
`disclosed in H. Beigi, S. H. Maes, U. V. Chaudari and J. S.
`Sorenson, “IBM model-based and frame-by-frame speaker
`recognition,” Speaker Recognition and its Commercial and
`Forensic Applications, Avignon, France 1998. The illustra-
`tive speaker identification system may use two techniques:
`a model-based approach and a frame-based approach. In the
`experiments described herein, we use the frame-based
`approach for speaker identification based on audio. The
`frame-based approach can be described in the following
`manner.
`
`Let Mi be the model corresponding to the i”“ enrolled
`speaker. Mi is represented by a mixture Gaussian model
`defined by the parameter set {yiip Ziii, piii}i-=1) _
`_ _,i, consisting
`of the mean vector, covariance matrix and mixture weights
`for each of the n, components of speaker i’s model. These
`models are created using training data consisting of a
`sequence of K frames of speech with d-dimensional cepstral
`feature vectors, {f,ii},ii=1) _
`_
`_ K. The goal of speaker identi-
`fication is to find the model, Mi, that best explains the test
`data represented by a sequence of N frames, {f,i},i=1) _
`_
`_ N.
`We use the following frame-based weighted likelihood
`z,n>
`distance measure, d-
`in making the decision:
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`d,-in = —log
`
`i:i
`
`Pi,jP[fn IM.
`
`:4
`
`The total distance Di of model Mi from the test data is then
`taken to be the sum of the distances over all the test frames:
`
`N
`
`n:l
`D, = Z d,,,,.
`
`Thus, the above approach finds the closest matching model
`and the person whose model that represents is determined to
`be the person whose utterance is being processed.
`Speaker verification may be performed in a similar
`manner, however, the input acoustic data is compared to
`determine if the data matches closely enough with stored
`models. If the comparison yields a close enough match, the
`person uttering the speech is verified. The match is accepted
`or rejected by comparing the match with competing models.
`These models can be selected to be similar to the claimant
`
`speaker or be speaker independent (i.e., a single or a set of
`speaker independent models). If the claimant wins and wins
`with enough margin (computed at the level of the likelihood
`or the distance to the models), we accept the claimant.
`Otherwise, the claimant is rejected. It should be understood
`that, at enrollment, the input speech is collected for a speaker
`to build the mixture gaussian model Mi that characterize
`each speaker.
`Referring now to the video signal path II of FIG. 1, the
`methodologies of processing visual information according
`to the invention will now be explained. The audio-visual
`speaker recognition and utterance verification system of the
`invention includes an active speaker face segmentation
`module 20 and a face recognition module 24. The active
`speaker face segmentation module 20 can receive video
`input from one or more sources, e.g., video decompressor
`10, camera 4, as explained above. It is to be appreciated that
`speaker face detection can also be performed directly in the
`compressed data domain and/or from audio and video infor-
`mation rather than just from video information. In any case,
`segmentation module 20 generally locates and tracks the
`speaker’s face and facial features within the arbitrary video
`background. This will be explained in detail below. From
`data provided from the segmentation module 22, an identi-
`fication and/or verification operation may be performed by
`recognition module 24 to identify and/or verify the face of
`the person assumed to be the speaker in the video. Verifi-
`cation can also be performed by adding score thresholding
`or competing models. Thus,
`the visual mode of speaker
`identification is implemented as a face recognition system
`where faces are found and tracked in the video sequences,
`and recognized by comparison with a database of candidate
`face templates. As will be explained later, utterance verifi-
`cation provides a technique to verify that the person actually
`uttered the speech used to recognize him.
`Face detection and recognition may be performed in a
`variety of ways. For example, in an embodiment employing
`an infrared camera 4, face detection and identification may
`be performed as disclosed in Francine J. Prokoski and
`Robert R. Riedel, “Infrared Identification of Faces and Body
`Parts,” BIOMETRICS, Personal Identification in Networked
`Society, Kluwer Academic Publishers, 1999. In a preferred
`embodiment, techniques described in Andrew Senior, “Face
`and feature finding for face recognition system,” 2”“ Int.
`
`0011
`
`0011
`
`

`
`US 6,219,640 B1
`
`7
`Conf. On Audio-Video based Biometric Person
`
`Authentication, Washington D.C., March 1999 are
`employed. The following is an illustrative description of
`face detection and recognition as respectively performed by
`segmentation module 22 and recognition module 24.
`Face Detection
`
`locations and
`Faces can occur at a variety of scales,
`orientations in the video frames. In this system, we make the
`assumption that faces are close to the vertical, and that there
`is no face smaller than 66 pixels high. However, to test for
`a face at all the remaining locations and scales, the system
`searches for a fixed size template in an image pyramid. The
`image pyramid is constructed by repeatedly down-sampling
`the original image to give progressively lower resolution
`representations ofthe original frame. Within each of these
`sub-images, we consider all square regions of the same size
`as our face template (typically 11x11 pixels) as candidate
`face locations. A sequence of tests is used to test whether a
`region contains a face or not.
`First, the region must contain a high proportion of skin-
`tone pixels, and then the intensities of the candidate region
`are compared with a trained face model. Pixels falling into
`a pre-defined cuboid of hue-chromaticity-intensity space are
`deemed to be skin tone, and the proportion of skin tone
`pixels must exceed a threshold for the candidate region to be
`considered further.
`
`The face model is based on a training set of cropped,
`normalized, grey-scale face images. Statistics of these faces
`are gathered and a variety of classifiers are trained based on
`these statistics. A Fisher linear discriminant (FLD) trained
`with a linear program is found to distinguish between faces
`and background images, and “Distance from face space”
`(DFFS), as described in M. Turk and A. Pentland, “Eigen-
`faces for Recognition,” Journal of Cognitive Neuro Science,
`vol. 3, no. 1, pp. 71-86, 1991, is used to score the quality of
`faces given high scores by the first method. Ahigh combined
`score from both these face detectors indicates that
`the
`
`candidate region is indeed a face. Candidate face regions
`with small perturbations of scale,
`location and rotation
`relative to high-scoring face candidates are also tested and
`the maximum scoring candidate among the perturbations is
`chosen, giving refined estimates of these three parameters.
`In subsequent frames,
`the face is tracked by using a
`velocity estimate to predict
`the new face location, and
`models are used to search for the face in candidate regions
`near the predicted location with similar scales and rotations.
`A low score is interpreted as a failure of tracking, and the
`algorithm begins again with an exhaustive search.
`Face Recognition
`Having found the face, K facial features are located using
`the same techniques (FLD and DFFS) used for face detec-
`tion. Features are found using a hierarchical approach where
`large-scale features, such as eyes, nose and mouth are first
`found, then sub-features are found relative to these features.
`As many as 29 sub-features are used, including the hairline,
`chin, ears, and the comers of mouth, nose, eyes and eye-
`brows. Prior statistics are used to restrict the search area for
`each feature and sub-feature relative to the face and feature
`
`positions, respectively. At each of the estimated sub-feature
`locations, a Gabor Jet representation, as described in L.
`Wiskott and C. von der Malsburg, “Recognizing Faces by
`Dynamic Link Matching,” Proceedings of the International
`Conference on Artificial Neural Networks, pp. 347-352,
`1995, is generated. A Gabor jet is a set of two-dimensional
`Gabor filters—each a sine wave modulated by a Gaussian.
`Each filter has scale (the sine wavelength and Gaussian
`standard deviation with fixed ratio) and orientation (of the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`0012
`
`8
`
`sine wave). We use five scales and eight orientations, giving
`40 complex coefficients (a(j), j=1, .
`.
`.
`, 40) at each feature
`location.
`
`A simple distance metric is used to compute the distance
`between the feature vectors for trained faces and the test
`candidates. The distance between the im trained candidate
`and a test candidate for feature k is defined as:
`
`Z a</')a.-<1)
`S”. =.
`
`J2.”->2z,..,.(,-)2
`
`_
`J
`
`J
`
`A simple average of these similarities, S,-=1/KZ1KS,-k, gives
`an overall measure for the similarity of the test face to the
`face template in the database. Accordingly, based on the
`similarity measure, an identification and/or verification of
`the person in the video sequence under consideration is
`made.
`
`Next, the results of the face recognition module 24 and the
`audio speaker recognition module 16 are provided to respec-
`tive confidence estimation blocks 26 and 18 where confi-
`
`dence estimation is performed. Confidence estimation refers
`to a likelihood or other confidence measure being deter-
`mined with regard to the recognized input.
`In one
`embodiment,
`the confidence estimation procedure may
`include measurement of noise levels respectively associated
`with the audio signal and the video signal. These levels may
`be measured internally or externally with respect to the
`system. A higher level of noise associated with a signal
`generally means that the confidence attributed to the recog-
`nition results associated with that signal is lower. Therefore,
`these confidence measures are taken into consideration
`
`during the weighting of the visual and acoustic results
`discussed below.
`
`Given the audio-based speaker recognition and face rec-
`ognition scores provided by respective modules 16 and 24,
`audio-visual speaker identification/verification may be per-
`formed by a joint identification/verification module 30 as
`follows. The top N scores are generated-based on both audio
`and video-based identification techniques. The two lists are
`combined by a weighted sum and the best-scoring candidate
`is chosen. Since the weights need only to be defined up to
`a scaling factor, we can define the combined score S,“V as a
`function of the single parameter (X2
`
`S,-"V=cos o.D,-+sin (XS,-.
`
`The mixture angle (X has to be selected according to the
`relative reliability of audio identification and face identifi-
`cation. One way to achieve this is to optimize (X in order to
`maximize the audio-visual accuracy on some training data.
`Let us denote by D,-(n) and S,-(n) the audio ID (identification)
`and video ID score for the i”“ enrolled speaker (i=1 .
`.
`. P)
`computed on the n’h training clip. Let us define the variable
`T,-(n) as zero when the n’h clip belongs to the im speaker and
`one otherwise. The cost function to be minimized is the
`
`empirical error, as discussed in V. N. Vapnik, “The Nature of
`
`0012
`
`

`
`US 6,219,640 B1
`
`9
`Statistical Learning Theory, Springer, 1995,
`written as:
`
`that can be
`
`1 N
`n:l
`
`,
`T7(n) where i = a.rgm_ax Sf”(n),
`
`C(11) =
`
`and where:
`
`S,‘"(n)=cos o.D,-(n)+sin OLS,-(n).
`
`In order to prevent over-fitting, one can also resort to the
`smoothed error rate, as discussed in H. Ney, “On the
`Probabilistic Interpretation of Neural Network Classification
`and Discriminative Training Criteria,” IEEE Transactions on
`Pattern Analysis and Machine Intelligence,” vol. 17, no. 2,
`pp. 107-119, 1995, defined as:
`
`When 11 is large, all the terms of the inner sum approach
`zero, except for i=i, and C'(ot) approaches the raw error
`count C((X). Otherwise, all the incorrect hypotheses (those
`for which T,-(n)=1) have a contribution that is a decreasing
`function of the distance between their score and the maxi-
`
`mum score. If the best hypothesis is incorrect, it has the
`largest contribution. Hence, by minimizing the latter cost
`function, one tends to maximize not only the recognition
`accuracy on the training data, but also the margin by which
`the best score wins. This function also presents the advan-
`tage of being differentiable, which can facilitate the optimi-
`zation process when there is more than one parameter.
`The present invention provides another decision or score
`fusion technique derived by the previous technique, but
`which does not require any training. It consists in selecting
`at testing time, for each clip, the value of (X in a given range
`which maximizes the difference between the highest and the
`second highest scores. The corresponding best hypothesis
`I(n) is then chosen. We have

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket