throbber
UNSUPERVISED CLUSTERING OF AMBULATORY AUDIO AND VIDEO
`
`Brian Clarkson and Alex Pentland
`
`Perceptual Computing
`MIT Media Lab
`Cambridge, MA 
`fclarkson,sandyg@media.mit.edu
`
`ABSTRACT
`A truly personal and reactive computer system should
`have access to the same information as its user, in-
`cluding the ambient sights and sounds. To this end,
`we have developed a system for extracting events and
`scenes from natural audiovisual input. We nd our
`system can without any prior labeling of data clus-
`ter the audiovisual data into events, such as passing
`through doors and crossing the street. Also, we hierar-
`chically cluster these events into scenes and get clusters
`that correlate with visiting the supermarket, or walking
`down a busy street.
`
` . INTRODUCTION
`
`Computers have evolved into miniature and wearable
`systems. As a result there is a desire for these com-
`puters to be tightly coupled with their user’s day-to-
`day activities. A popular analogy for this integration
`equates the wearable computer to an intelligent ob-
`server and assistant for its user. To ll this role ef-
`fectively, the wearable computer needs to live in the
`same sensory world as its human user. 
`Thus, a system is required that can take this nat-
`ural and personal audiovideo and nd the coherent
`segments, the points of major activity, and recurring
`events. The eld of multimedia indexing has wrestled
`with many of the problems that such a system creates.
`However, audiovideo data that these researchers typi-
`cally tackle are very heterogeneous and thus have little
`structure and what structure they do have is usually
`articial, like scene cuts, and patterns in camera an-
`gle. The eyes and ears" audiovideo data that we are
`tackling is much more homogeneous and thus richer in
`structure, and lled with repeating elements and slowly
`varying trends.
`The use of our system’s resulting indexing diers
`greatly from the typical querying for key-frames". Sup-
`pose our system has clustered its audiovideo history
`into models. Upon further use, the system notices
`
`that whenever the user requests his grocery list, model
` is active. We would say that model  is the super-
`market. However, the system does not need to have
`such a human-readable for model .
`What would
`the system do with it? The user presumably knows
`already that he is in the supermarket. However, a
`software agent built on our system would know to au-
`tomatically display the user’s grocery list when model
` activates.
`
`. THE PERSONAL AUDIO-VISUAL TASK
`
`In contrast to subject-oriented video or audio , such
`as TV , movies, and video recordings of meetings
` , our goal is to use video to monitor an individual’s
`environment. Literally, the camera and microphone
`become an extra set of senses for the user.  , 
`
`. . Data Collection
`
`In order to adequately sample the visual and aural en-
`vironment of a mobile person, the sensors should be
`small and have a wide eld of reception. The environ-
`mental audio was collected with a lavalier microphone
`the size of a pencil eraser mounted on the shoulder
`and directed away from the user. The environmen-
`tal video was collected with a miniature CCD cam-
`era  " diameter, " long attached to a backpack
`pointing backwards. The camera was tted with a
`  wide-angle lens giving an excellent view of the
`sky, ground, and horizon at all times.
`The system was worn around the city for a few
`hours, while the wearer performed typical actions, such
`as shopping for groceries, renting a video, going home,
`and meeting and talking with acquaintances. The re-
`sulting recording covered early to late afternoon no
`night-time data. The camera’s automatic gain control
`was used to prevent saturation in daylight.
`
`IPR2020-00910
`Garmin, et al. EX1032 Page 1
`
`

`

`. Initialization: Select N segments of the time se-
`ries each of length T*S, spaced approximately
` =f apart. Initialize each of the N models with a
`segment, using linear state segmentation.
`
` . Segmentation: Compile the N current models into
`a fully-connected grammar. A nonzero transition
`connects the nal state of every model to the ini-
`tial state of every model. Using this network, re-
`segment the cluster membership for each model.
`
`. Training: Estimate the new model parameters
`using the Forward-Backward algorithm on the
`segments from step . Iterate on the current seg-
`mentation until the models converge and then go
`back to step to resegment. Repeat steps and
` until the segmentation converges.
`
`We constrained ourselves to left-right HMMs with
`no jumps and single Gaussian states.
`
` . . Time Hierarchy
`
`Varying the frame-state allocation number directs the
`clustering algorithm to model the time-series at varying
`time scales. In the Initialization step, this time scale is
`made explicit by T , the frame-state allocation number,
`so that each model begins by literally modeling S T
`samples. Of course, the reestimation steps adaptively
`change the window size of samples modeled by each
`HMM. However, since EM is a local optimization the
`time scale will typically not change drastically from
`the initialization. Hence, by increasing the frame-state
`allocation we can build a hierarchy of HMMs where
`each level of the hierarchy has a coarser time scale than
`the one below it.
`
` .. Representation Hierarchy
`
`There are still important structures that just clustering
`at dierent time scales will not capture. For example,
`suppose we wanted a model for a supermarket visit, or
`a walk down a busy street. As it stands, clustering will
`only separate specic events like supermarket music,
`cash register beeps, walking through aisles, for the su-
`permarket, and cars passing, crosswalks, and sidewalks
`for the busy street.
`It will not capture the fact that
`these events occur together to create scenes, such as the
`supermarket scene, or busy street scene. Notice that
`simply increasing the time scale and model complexity
`to cover the typical supermarket visit is not feasible
`for the same reasons that speech is recognized at the
`phoneme and word level instead of at the sentence and
`paragraph level.
`
`.. Feature Extraction
`
`Unlike the typical features used for face and speech
`recognition, we require features that are much less sen-
`sitive. We want our features to respond only to the
`most blindingly obvious events  walking into a build-
`ing, crossing the street, riding an elevator. Since, our
`system is restricted to unsupervised learning, it is nec-
`essary to use robust features that do not behave wildly
`or respond to every change in the environment  only
`enough to convey the ambiance.
`Video First the r; g; b pixel values were separated
`into pseudo luminance and chrominance channels:
`
`I = r + g + b
`
`Ir = r=I
`
`Ig = g=I
`
`The visual eld of the camera was divided into
`regions that correspond strongly to direction. The fol-
`lowing features were extracted from the I; Ir ; Ig val-
`ues of each region:
`
` 
`
` I
`I 
`Cov 
`
`
`Ir
`
`IIr
`I 
`r
`
`Ig
`IIg
`Ir Ig
`I 
`g
`
`Figure : The features on the left were extracted from
`each of the regions shown on the right.
`
`Hence, we are collapsing each region to a Gaussian
`in color space. This rough approximation lends robust-
`ness to small changes in the visual eld, such as distant
`moving objects and small amplitude camera movement
`the human body is not a stable camera platform.
`Audio Auditory features were extracted with  Mel-
`scaled lter banks. The triangle lters give the same
`robustness to small variations in frequency especially
`high frequencies, not to mention warping frequencies
`to a more perceptually meaningful scale.
`Both the video and the audio features were calcu-
`lated at a rate of Hz.
`
` . TIME SERIES CLUSTERING
`
`The algorithm we used to cluster time series data is
`a variation on the Segmental K-Means algorithm .
`The procedure is as follows:
`
` . Given: N , the number of models, T the number
`of samples allocated to a state, S, the number
`of states per model, f the expected rate of class
`changes.
`
`IPR2020-00910
`Garmin, et al. EX1032 Page 2
`
`

`

`Notice that you can even see the users steps in the
`audio spectrogram.
`
`Long Time-scale Object HMMs
`
`Here we increase the time-scale of the object HMMs
`to secs. The results are that HMMs model larger
`scale changes such as long walks down hallways and
`streets.
`
`We give some preliminary results for the perfor-
`mance of classication as compared to some hand-labeled
`ground truth. Since we did no training with labeled
`data, our models did not get the benet of embedded
`training or garbage-modeling. Hence frequently the
`models are overpowered by a few that are not model-
`ing anything useful. Typically this is where the system
`would make an application-driven decision to eliminate
`these models.
`
`As an alternative we present the correlation coef-
`cients between the the independently hand-labeled
`ground truth and the output likelihood of the highest
`correlating model. The table below shows the classes
`that the system was ably to reliably model from only
`hrs. of data:
`
`Label
`oce
`lobby
`bedroom
`cashier
`
`Correlation
`Coe.
`. 
`. 
`.
`. 
`
`Long Time-scale Scene HMMs
`
`We also constructed a layer of scene HMMs that
`are based on the outputs of the Short Time-scale Ob-
`ject HMMs from above. Where before we were unable
`to clean classes for more complex events, like the su-
`permarket visit and walk down a busy street, now this
`level HMMs is able to capture them. The following
`table gives the correlations for the best models:
`
`Label
`dorms
`charles river
`necco area
`sidewalk
`video store
`
`Correlation
`Coe.
`.
`. 
`. 
`.
`. 
`
`Figures and  show the model likelihoods for the
`models that correlated with walking down a sidewalk"
`and at the video store". While the video store scene
`has elements that overlap with other scenes, the video
`store model is able to cleanly select only the visit to
`the video store.
`
`We address this shortcoming by adapting a hierar-
`chy of HMMs much a like a grammar. So beginning
`with a set of low-level hmms, which we will call object
`HMMs like phonemes, we can encode their relation-
`ships into scene HMMs like words. The process is as
`follows:
`
` . Detect: By using the Forward algorithm with a
`sliding window of length t, obtain the likeli-
`hood,
`
`Lt =P Ot; ; Ot+ tj
`
`for each object HMM, , at time, t.
`
`. Abstract: Construct a new feature space from
`these likelihoods,
`
` 
`
`L t
`...
`LN t
`
`
`
`F t =
`
` . Cluster: Now cluster the new feature space into
`scene HMMs using the algorithm from Section .
`
`test  
`
`. RESULTS
`
`We evaluated our performance by noting the correla-
`tion between our emergent models and a human-generated
`transcription. Each cluster plays the role of a hypoth-
`esis. A hypothesis is veried when its indexing corre-
`lates highly with a ground truth labeling. Hypotheses
`that fail to correlate are ignored, but kept as garbage
`classes". Hence, it is necessary to have more clusters
`than classes" in order to prevent the useful models
`from having to model everything.
`In the following experiments we restricted the sys-
`tem to two levels of representation i.e. a single object
`HMM layer and a single scene HMM layer. The time
`scales were varied from secs to secs for the object
`HMMs, but kept at secs for the scene layer.
`Short Time Scale Object HMMs
`In this case, we used a sec time-scale for each
`object HMM and set the expected rate of class changes,
`f , to secs. As a result, the HMMs modeled events
`such as doors, stairs, crosswalks, and so on. To show
`exactly how this worked, we give the specic example
`of the user arriving at his apartment building. This
`example is representative of the performance during
`other sequences of events. Figure  shows the features,
`segmentation, and key frames for the sequence of events
`in question. The image in the midde represents the raw
`feature vectors top  are video, bottom are audio.
`
`IPR2020-00910
`Garmin, et al. EX1032 Page 3
`
`

`

`Figure : Coming Home: this example shows the user entering his apartment building, going up stair cases and
`arriving in his bedroom. The system’s segmentation is depicted by the vertical lines along with key frames.
`
`Ground Truth vs. Most Correllated Model
`
`. CONCLUSION
`
`It is pretty clear that the unsupervised clustering of au-
`diovideo data is feasible and useful. In addition, the
`clustering algorithm in this paper can be easily adapted
`to an incremental and pseudo-realtime framework. In-
`stead of iterating over all past data, the system can
`have a memory" by only training on a recent window
`of data. This implies that the system can then adapt
`as new memories habituate.
`Our immediate goal is to integrate our system with
`a software agent so that the performance of our models
`can be grounded in some meaningful context.
`
`. REFERENCES
`
`  Albert S. Bregman. Auditory Scene Analysis: The Percep-
`tual Organization of Sound. The MIT Press, .
`
` G. J. Brown. Computational Auditory Scene Analysis:
`A representational approach. PhD thesis, University of
`Sheeld, .
`
`  Bernhard Feiten and Stefan Gunzel. Automatic indexing of a
`sound database using self-organizing neural nets. Computer
`Music Journal, .
`
` Liu, Wang, , and Chen. Audio feature extraction and analysis
`for multimedia content classication. Journal of VLSI Signal
`Processing Systems, .
`
` Silvia Pfeier, Stephan Fischer, and Wolfgang Eelsberg.
`Automatic audio content analysis. Technical report, Uni-
`versity of Mannheim, .
`
` Lawrence R. Rabiner. A tutorial on hidden markov models
`and selected applications in speech recognition. Proceedings
`of the IEEE,  .
`
` Dan Siewiorek, editor. The First International Symposium
`on Wearable Computers, .
`
` T. Starner, B. Schiele, and A. Pentland. Visual contextual
`awareness in wearable computing. In Second International
`Symposium on Wearable Computers, Oct .
`
`true
`
`false
`
`Walking on a Sidewalk
`
`Likelihood of Model 11
`
`Time (full length: 2 hrs.)
`
`Figure : The Sidewalk Scene: above is the indepen-
`dently hand-labeled ground truth, below is the likeli-
`hood of the most correlated model.
`
`Ground Truth vs. Most Correllated Model
`
`true
`
`false
`
`Video Store
`
`Model 36 Likelihood
`
`Time (full length: 2 hrs.)
`
`Figure : The Video Store Scene: above is the indepen-
`dently hand-labeled ground truth, below is the likeli-
`hood of the most correlated model.
`
`IPR2020-00910
`Garmin, et al. EX1032 Page 4
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket