`
`TO REQUEST FOR EX PARTE REEXAMINATION OF
`U.S. PATENTNO. 7,868,912
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 1| of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 1 of 18
`
`
`
`Moving Object Detection and Event Recognition Algorithms for Smart Cameras
`
`Thomas J. Olson
`Frank 7. Brill
`
`Texas Instruments
`Research & Development
`P.O. Box 655303, MS 8374, Dallas, TX 75265
`E-mail: olson@esc.ti.com, brill @ti.com
`http.//www.ti.com/research/docsAuba/indexhtm!
`
`Abstract
`
`Smart video cameras analyze the video stream and
`translate if into a description of the scene in terms
`of objects, object motions, and events, This paper
`describes a set of algorithms for the core computa-
`tions needed to baild smart cameras. Together
`these algorithms make up the Autonomous Video
`Surveillance (AVS)
`system, 4 general-purpose
`framework for moving object detection and event
`recognition. Moving objects are detected using
`change detection, and are tracked using first-order
`prediction and nearest neighbor matching. Events
`are recognized by applying predicates ta the graph
`formed by linking corresponding objects in succes-
`sive frames.The AVS algorithms have been used to
`create several novel video surveillance applica-
`tions. These include a video surveillance shell that
`allows a hurnan to monitor the outputs of multiple
`cameras, a system that takes a single high-quality
`snapshot of every person who enters its field of
`view, and a system that learns the structure of the
`monitored environment by watching humans move
`around in the scene.
`
`1 Introduction
`
`ages and video clips, but these will be carefully
`selected to maximize their useful information con-
`ient. The symbolic information and images from
`smart cameras will be filtered by programs that ex-
`tract data relevant to particular tasks. This filtering
`process will enable a single human to monitor hun-
`dreds or thousands of video streams.
`
`in pursuit of ovr research objectives [Flinchbaugh,
`1997}, we are developing the technology needed to
`make smart cameras a reality. Two fundamental ca-
`pabilities are needed. The first is the ability to
`describe scenes in terms of object motions and in-
`teractions. The second is the ability ts recognize
`important events that occur in the scene, and to
`pickout those that are relevant to the current task.
`These capabilities make it possible to develop a va-
`riety of novel and useful video surveillance
`applications.
`
`1.1 Video Surveillance and Monitoring
`Scenarios
`
`Our work is motivated by a several types of video
`surveillance and monitoring scenarios.
`
`Video cameras today produce images, which must
`be examined by humans in order to be useful. Fu-
`ture ‘smart’ video cameras will praduce infor-
`mation, inchiding descriptions of the environment
`they are monitoring and the events taking place in
`it. The information they produce may inchide im~-
`
`The research described in this report was sponsored in part by
`the DARPA Image Understanding Program.
`
`Indoor Surveillance: Indoor surveillance provides
`information about areas such as building lobbies,
`hallways, and offices. Monitoring tasks in Iobbies
`and hallways include detection of peaple depasit-
`ing things (e.g., unattended luggage in an airport
`founge), removing things (e.z., theft}, or loitering.
`Office monitoring tasks typically require informa-
`tion about people's identities:
`in an office, for
`example, the office owner may do anything at any
`
`159
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 2 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 2 of 18
`
`
`
`time, but other people should not open desk draw-
`ers or operate the computer unless the owner is
`present, Cleaning staff may come in at might to vac-
`uum and empty trash cans, but should not handle
`objects on the desk.
`
`Outdoor Surveillance: Outdoor surveillance in-
`chides tasks such as monitoring a site perimeter for
`intrusion or threats from vehicles (e.g., car bombs).
`In military applications, video survedlance can
`function as a seniry or forward observer, e.g. by
`notifying commanders when
`enemy
`soldiers
`emerge from a wooded area or cross a road.
`
`In order for smart cameras to be practical for real-
`world tasks, the algorithms they use must be ro-
`bust. Current
`commercial
`video surveillance
`systems have a high false alarm rate [Ringler and
`Hoover, 1995], which renders them useless for
`most applications. For this reason, our research
`stresses robustness and quantification of detection
`and false alarm rates. Smart camera algorithms
`must also oun effectively on low-cost platforms, so
`that they can be implemented in small, low-power
`packages and can be used im large numbers. Study-
`ing algorithms that can run in near real time makes
`it practical
`to conduct extensive evaluation and
`testing of systems, and may enable worthwhile
`near-term applications as well as contributing to
`long-term research goals.
`
`1.2 Approach
`
`The first step in processing a video stream for sur-
`veillance purposes is to identify the important
`objects in the scene. In this paper it is assumed that
`the important objects are those that move indepen-
`dently, Camera parameters are assumed to be fixed.
`This allows the use of simple change detection to
`identify moving objects. Where use of moving
`cameras is necessary, stabilization hardware and
`stabilized moving object detection algorithms can
`be used (e.g. [Burt et al, 1989, Nelson, 1991], The
`use of criteria other than motion fe.g., salience
`based on shape or color, or more general object
`recognition) is compatible with our approach, but
`these criteria
`are not
`used
`in our
`current
`applications,
`
`Our event recognition algorithms are based on
`graph matching. Moving objects in the image are
`
`tracked over time. Observations of an object in suc-
`cessive video frames are linked to form a directed
`graph (the motion graph}. Events are defined in
`terms of predicates on the motion graph. For in-
`stance,
`the beginning of a chain of successive
`observations of an object is defined to be an EN-
`TER event. Event detection is described in more
`detail below.
`
`Our approach to video surveillance stresses 2D,
`image-based algorithms and simple, low-level ob-
`ject representations that can be extracted reliably
`from the video sequence. This emphasis yields a
`high level of robustness and low computational
`cost. Object recognition and other detailed analy-
`ses are used only after the systern has determined
`that the objects in question are interesting and mer-
`it further investigation.
`
`L3 Research Strategy
`
`The primary technical goal ofthis research is to de-
`velop general-purpose algonthms
`for moving
`object detection and event recognition. These algo-
`rims
`comprise
`the Axtenemous
`Video
`Surveillance (AVS) system, a modular framework
`for building video surveillance applications. AVS
`is designed to be updated to incorporate better core
`algorithms or to tune the processing to specific do-
`mains as our research progresses,
`
`In order to evaluate the AVS core algorithms and
`event recognition and tracking framework, we use
`them to develop applications motivated by the sur-
`veillance
`scenarios
`described
`above.
`The
`applications are smali-scale implementations of fu-
`ture smart camera systems. They are designed for
`long-term operation, and are evaluated by allowing
`them te mn for long periods (hours or days) and
`analyzing their output.
`
`The remainder of this paper is organized as fol-
`lows. The next section discusses related work.
`Section 3 presents the core moving object detection
`and event recognition algorithms, and the mecha-
`nism used te establish the 3D positions of objects.
`Section 4 presents applications that have been built
`using the AVS framework. The fina! section dis-
`cusses the current state of the system and our
`future plans.
`
`160
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 3 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 3 of 18
`
`
`
`2 Related Work
`
`Our overall approach to video surveillance has
`been influenced by interest in selective attention
`and task-oriented processing [Swain and Stricker,
`1991, Rimey and Brown, 1993, Camus et al,
`1993]. The fundamental problem with current vid-
`ex surveillance technology is
`that
`the useful
`information density of the images delivered to a
`hurnan is very law; the vast majority of surveil-
`lance video frames contain no useful information
`at al, The fundamental role of the smart camera
`deserthed above is ta reduce the volume of data
`produced by the camera, and increase the value of
`that data. It does this by discarding irrelevant
`frames, and by expressing the information in the
`relevant frames primarily in symbolic form.
`
`2.1 Moving Object Detection
`
`Most algorithms for moving object detection using
`fixed cameras work by comparing incoming video
`frames to a reference image, and attributing signifi-
`cant differences either to motion or to noise. The
`algorithms differ in the form of the comparison op-
`erator they ase, and in the way in which the
`reference image is maintained. Simple intensity
`differencing followed by thresholding is widely
`used [Jain et al., 1979, Yalamanchili et al., 1982,
`Kelly et al., 1995, Bobick and Davis, 1996, Court-
`ney,
`[997]
`because
`it
`i8
`computationally
`inexpensive and works guite well in many indoor
`environments. Same algorithms provide a means of
`adapting the reference image over time, in order fo
`track slow changes in lighting conditions and/or
`changes in the environment [Karmann and von
`Brandt, 1990, Makarov, 1996a]. Some also filter
`the image to reduce or remove low spatial frequen-
`cy content, which again makes the detector fess
`sensitive to lighting changes {Makarov et al.,
`1996b, Keller et al., 19941.
`
`Recent wark [Pentland, 1996, Kahn et al., 1996]
`has extended the basic change detection paradigm
`by replacing the reference image with a statistical
`model of the background. The comparison operator
`becomes a statistical test that estimates the proba-
`bility that the observed pixel value belangs to the
`background,
`
`Gur baseline change detection algorithm uses
`thresholded absolute differencing, since this works
`well for our indoor surveillance scenarios, For ap-
`plications where Hghting change is a problem, we
`use the adaptive reference frame algorithm of Kar-
`mann and von Brandt
`[1990]. We are also
`expenmenting with a probabilistic change detector
`stmilar to Pinder [Pentland, 1996.
`
`Our work assumes fixed cameras. When the cam-
`era is not fixed, simple change detection cannot be
`used because of background motion. One approach
`to this problem is to treat the scene as a collection
`of independently moving objects, and to detect and
`ignore the visual motion due te camera motion
`fe.g. Burt et al, 1989] Other researchers have pro-
`posed ways of detecting features of the optical flaw
`that are Inconsistent with a hypeathesis of self mo-
`tion [Nelson, 1991].
`
`In many of our applications moving object detwe-
`tion is a prelude to person detection. There has
`been significant recent progress in the development
`of algorithms to locate and track humans. Pfinder
`(cited above) uses a coarse statistical model of hn-
`man body geometry and motion to estimate the
`likelihood that a given pixel is part of a human.
`Several researchers have described methods of
`tracking human body and imb movements [Gavri-
`fa and Davis, 1996, Kakadiaris and Metaxas, 1996]
`and locating faces in images [Sung and Poggio,
`1994, Rewley et al.,
`[996]. Intille and Bobick
`{1995] describe methods of tracking humans
`through episodes of mutual occlusion in a highly
`structured environment. We do not currently make
`use of these techniques in live experiments because
`of their computational cost. However, we expect
`that this type of analysis will eventually be an im-
`portant part of smart camera processing.
`
`2.2 Event Recognition
`
`Mosi work on event recognition has focussed on
`events that consist of a well-defined sequence of
`primitive motions. This class of events can be con-
`verted into spatiotemporal patterns and recognized
`using statistical pattern matching techniques. A
`number of researchers have demonstrated algo-
`rithms for recagnizing gestures and sign language
`fe.g., Stamer and Pentland, 1995}. Robick and
`Davis [1996] describe a method ofrecognizing ste-
`
`161.
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 4 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 4 of 18
`
`
`
`
`
`
`
`Reference image
`
`Difference image
`Thresholded tmage
`Video Frame
`Figur: f: Image procesaine steps for moving object detection,
`
`reotypical motion pattem corresponding i
`aétions such as sittingdown, walking, or waving.
`
`Cer approach ia event recognition is based on the
`video database indexing work of Courtney [199774,
`which imrodaced the use af predicates on the mo-
`tat graph fo represent events. Motion graphs are
`welf sulted fo representing abstract, generic events
`such aa “depositing an obyeer’ or “coming Hi reat’,
`witch are difcult ts captons using the pattern:
`based anproaches ssferred tn above. On the other
`hand, pattern-hases! approaches can represent cam
`plex motions auch as “throwing an abject or
`waving’ which would be dificult to express ualng.
`motion graphs. fois Akely that both pattern-based
`anit abstract event recognitian techmqass etl be
`neededt3 handle the full range af events thal are of
`mterest in surveillance, applications.
`
`3 AYS Tracking and Event Recognition
`Algerithms
`
`This section deserihes the core icchnologias. that
`provide the wider surveillance and niinitoring ca-
`pabiitics af the AVS systers. There are threes key
`lachnologige: moving object detection,
`vidual
`tracking, and event recognition, The moving ohjset
`detection routines determine when one or mare ob
`jocks onier a monitored scene, decide which pixels
`in a given video frame correspand ky the moving
`objects versas which pixels correspond to the back~
`ground, and fom a simple representation afthe
`oblsct’s gnage in the video frame. This represente-
`tion is seferred to ae a motion region, and lt exists
`ia a aingle video frame, as distinguished fromthe
`warkd adjecis which exist in the world and give rise
`to ghe moths Melons.
`
`Visual tracking consists of determining carrespon-
`ences between the matin regians over 2
`sequence ofvuleo frames, and maintaining a single
`representation, or raed, for the world object which
`gave tise ta the sequienceaf motion regions in the
`sequencs of frames. Mnally, event recognition is &
`means af analyaimgthe collection of tracks m order
`to identify events af interest ineoiving the world
`abjects represented by the tracks.
`
`3.4 MovingObject Detection
`
`The muving object detection teckhnslazy we em-
`ploy isa 2D changedetection technique siralar to
`thal deseribed in Jain ef al. [1878 anc Yalamas-
`chit ct al
`[1982] Brier
`to activation ef the
`moniforing sysinm, an image of the background,
`Le. Sn imaie of the scene which canigis neo ies
`ing ar otherwise infersaling objects, ig captured to
`aurveaethe refererce image. When tis syste8 by
`iperation, the absolute diffeames of the cirent
`video frame from the reference page ja compated
`te prodare a diffivence image, The difference im
`age is Gren thresholded af an appropriate value to
`abtain a binary image in which the “offpixels rep-
`resent backersnind pixels, and the “on” pixels
`represent “moving abiet pixels. The foer-con-
`nected componente af moving object pixels in the
`thresholted imageare the niiinn reginma (See Pig-
`urei}.
`
`Simple appliosti¢n af rhe abject detection proce-
`dure ouilined above results In a numer of srrors,
`largely due io the houtations af threshelding. Hfibe
`threshold ased is toofew, camera aeins and shad-
`aws WHT producespurious objects: whereas of the
`thrashakd as ton high, sarne portions af the objects
`in the scene yell fail fe be separated from the back-
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 5 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 5 of 18
`
`
`
`in which a single
`ground, menuiting ty breakep,
`world oblect gives nse to several motion regions
`wittin @ single frame, Gur general approach is 16
`alley breakup. bet ase groaping heurisues to
`merge multiple connected components inte a sings
`mation region and maintain a oné-lo-tare corre-
`spondence between mathnr regions and world
`abjecta within each frame.
`
`Unegrouping techangae weomploy ws 20) marpho-
`logical dilatien of the motion regions. This enables
`ihe system ts merge comnecied components sepa
`rated by a few pixels, bt using this iechnikpue te
`span hege gaps reaults in a severe performance
`degradation. Moreover, diauon in the imagespace
`may result
`in incorrectly merging distant alyeets
`which are nearby in the image (a few pixels), bet
`are im fact asparated by a largs distance in the
`workd (a fewfeet}.
`
`IY SES information is avaliable, the connected cam-
`ponent grouping algorithm mikes use of an
`estimate of the sive Gn world coordinates} of the
`abyects in the a@nage. The hounding boxes of the
`comnected componente are expanded vertically and
`harizontally Oy a distance mieaaured in feet crather
`thant pinele), andconnected conrponants with over:
`lapping bounding boxes are merged ints a single
`motion region. The technique for estimating the
`arteof theahjects mm the image is desorbed im soc-
`Han 3.4 below.
`
`32 Tracking
`
`The finetion af the AVS Wacking routias i te es.
`tablish carrepondences
`between
`the motion
`resiogs in the Gurrent frameand dose in the previ
`wus
`frame. We ase the ischniqee af Courtney
`[2897], which proceeds as follows. First assume
`that we have cormmputed 209 welacity estimates for
`the mation regions inthe previous frame. These ve-
`focny estimates, together with the locations ofthe
`centroids in the previoas frame, are used to proheect
`thelocations of the centroids of the mation regions
`inte the current frame. Thon, a sete! searest-
`neighbor
`crferion
`i
`used
`to
`establish
`somespondences.
`
`Let P be the set of mothin region centred bica-
`tons in the provinus frame, with p, ons such
`location, Lat p' he the projected location of p. ie
`
`fhe the setaf all such
`shecurrent frame, and det
`projected locations in the certent franie. Lee €" he
`the set af motign region centred bucations iq the
`carent frame. Ho the distance heoscor Pe and
`£,.€ i is the smallest for all alomenn of C. aut
`his distance is also the smallest af the distances
`Retween c, and all elements of P Ge. BP: and ep
`are maine) nearest maishbors}, then establish # cor
`respandence between op. and ¢, by creating 3
`bidirectional strongHakbetween them, Usethedif-
`fersnes In fine and space between BR; and ¢; is
`deterrine 8 veksiity estinmeefor ¢.. expressed in
`pixels per second. H there is an existing track con-
`taining p.. add c, hit Otherwise, establish a new
`track, amd add bath p. and ¢, fo i.
`
`‘The strong Hnksformthe basisaf the tracks with a
`high-confideace oftheir correctness. Video objects
`which donot havemaiteal nearest neighbors in the
`adjacent frame may fad to form correspondendce
`Racause the underlying world abject is involved in
`an event (e.g. enter, exit, denasit, semove}, fn or-
`der fx assist in the ddentification af these events,
`abpiets withonl strong dinky are given anidireciian-
`al wedk finds ta the their Cnon-metya)) nearest
`avighbors. Phe weaklinks represent potential am-
`baruity i the trackingprocessThe motion regions
`in all of the frames, together with their strame and
`week links, forma mrenen erepk,
`
`Figure 2 depicts a sample motion graph. In the fig-
`ure, each frame
`is onedimensienal, and a
`represented by « vertical fine (AY ~ PIS Circles.
`represent objects in the acene, thedark arnwe rep-
`fesem atrong Hinks, and the gray arrows represent
`weaklinks. An ofjent enters the scene in frame FY,
`and then moves through thescene until faene Pa,
`wheres # degaartsa second ablect. The frat objeci
`conlinues fo move through thesoem, and exits at
`frame PS. The deposited abject remmny stationary.
`At frame FS another object enters the soene, inne
`porarily ocehades tle stationary object at frame
`PH} (or is orchided by i), and then proceeds to
`roves pastthe stationary object. This second mav-
`ing object reweraes directions around frames FITS
`and F 14, pontine to remove the Satienaryobject in
`frame FIG, and fimally exis in framePY. An adel
`fianal object snicrs in frame PS andestes inframe
`ES welthout interacting whh any other object,
`
`As mdinated by the strped RY pattems in Pagure 2.
`ihe correct cumeapondences for the tracks are arm-
`
`LAS
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 6 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 6 of 18
`
`
`
`ENTER
`
`ENTER
`
`EXIT
`
`
`
`| ogedsir rt
`
`
`
`
`
`3
`Ft
`
`FQ
`
`Fe
`
`Fa
`
`fa
`
`FS
`
`yi
`FB
`
`FT
`
`roe fo fig FM Fie Fig eS
`
`REMOVE
`
`FIS FIG Fiy Fd
`
`EXIT
`ENTER
`Figure 2: Byent detection in the motion graph.
`
`dhe
`such as
`interactions
`bigucas after object
`ocehusion in frarne FLO. Ths AVS systom resolves
`this ambiguity where possible by preferring ow
`match moving objects with moving objects, and
`Stationary objects with stationary objects. The dis-
`tinction between moving and shwianary iracks is
`computed using thresholds on the selecity asti-
`mates, and hysteresis for stidihaing transiticns
`between moving and aiationary,
`
`Following an occhaiomOvhich may hast fir several
`frames} the fearnes immediately befare and after
`the ooclasion are compared fe.g, frames PO and
`FLT in Figure 2). The AVS systent examines cach
`stationary object
`in the pre-unclusion frarne, and
`searches for its cormespondant inthe post-acclusion
`frame Ovhich shvadd be exactly where Wo was be-
`fore, sincethe object is stationary), This procedure
`resabves a large portion of the tracking ambiguities.
`General resohition of ambiguities resulting from
`mulipic moving objects in the acens is a tome for
`further neeearch. The AVA system may benefit
`frominchision of a “closedworld tracking” facllity
`such a8 that described by fnnile and Bobick
`{1995a, (90Sb).
`
`AS Event Recaption
`
`Certain features of racks and pairs of imacks corre
`sped to events, For example, the begimming of 4
`track corresponds tt an ENTER event, and the end
`corresponds to an EXIT event. fn an cuelmeevent
`datechon systematts preferable io detect the event
`
`aS near it Hime as possible to the acmal accarrence
`of the event. The previous eysiem whichused mo-
`von graphs for event detection [Courtney1907}
`opertiedin & batch mods, ami required multipis
`PASS Over the motion graph, prechuling on-fine:
`operation. The AVS ayatem dntects events ina gins
`gle pads over the amtion graph, as the graph i
`created. However, in order fo nadace errors due to
`noise, the AVS systent bitraduces a sHght delay of
`n frame fimes (ied in the current implementation}
`before reporting certain events. For example. in
`Finere 2, an enter event oceurs on frame FI. The
`AVS aystem requiresthe track ig be maintained for
`n frames before reporting the enter evant, N the
`track mot recuntained for the requiced number of
`frames, it is ignarsd, and the enter event a nat re~
`ported, eg, fa > 4, the ahjcet in Figure 2 which
`enters in Trams PS and exits in frome FR will not
`
`gusieraie any cvenis.
`
`Atrack that splits inte fwo tracks, one of which ix
`moving, and the other of which is Stauonary, corme~
`sponds tg a DEPOSIT event. If a moving track
`iiferskete a Siationary track, aud then continues i
`move, bul the stationary track ends at the intersec~-
`tion, thie correspands ty a REMOVE event. The
`yehiove event can be asneratad as soon as the re-
`mover discecludke the location of the sistionary
`oent which was removed, and fhe system can de-
`termine that the Malionary object ig no longerat
`that location.
`
`E84
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 7 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 7 of 18
`
`
`
`
`
`Figure 3: Establishing theimage fo map coordinate transformation
`
`in a manner similar ts the oeclasion situation de-
`acribed above in sectian 3.2, the deposht event alsa
`Rives tise to aatbiguity as to which abject is the de-
`poaitor, and which ia dus deposites. For example, it
`may Rave been Unt the object which entered at
`frame FT of Figure 2 stopped at frame P4 and de-
`posited A aroving shjcct, and it
`is the deposited
`abject which then priceeded to exit the scene at
`Fa. Again, the AVS aysiemrelies au a moving ¥s.
`stationary distinction fo resolve the amblieuny, and
`insiata that the depaaites rernain stativmary after a
`depesk event. The AVS system reqaires botthe
`depositar and the depastes tracks to extend for a
`frames past the point af which ihe tracks separate
`fe.g., past frame PS in Figure 23, amd that the de-
`posited object remain stationary; atheredas no
`deposit fvert ig ganeraiad,
`
`Also detected (hai met iiustrated in Figure 2), are
`RESTevents Geben a moving object comes to 8
`gop), and MOVE events (when a RESTing object
`bagins to mive again} Finally, onefarther event
`that is detected is the LIGHTSOUT event, which
`oceans whenever a large changeaccurs aver the on-
`Gre image. The motion graph reed rust be commuliad
`to detect this event.
`
`3.4 Image te World Mapping
`
`in onder to inoateobjects acen in the image with re-
`spect i G map, H is Becessary to establish 2
`mapping betwoen image and map ceardinates. THis
`mapping is cstablished in the AVS systemby hay
`Hye a user draw quadriaterals on the homzontal
`
`surfaces Vande in an image, and the corresponding
`quadrlaterals an a map, ag shown in Figure 3. A
`warp immeformation from image to migy coondl-
`nmaies
`rs
`cdmstndiad using the quadrilateral
`coordinates,
`
`Unce the transformations arecatablished, the sys-
`tem can estimate the focauion af an object fas in
`Rinckbaaghand Bannon [19S]) by assuring thai
`all abpects nest on a horizontal surface. Whenan
`Hbject is detected in the scene, dhe midpoint afthe
`lowest sideof the bounding baxis used as fheim
`age point to project inte the map window using the
`quadriatsral warp transformation [Wolherg, 1990),
`
`4 Applications
`
`The AVS core algorithms desenmbed in section 3
`have been ased ay the basis for sevens) vider sar
`veilanze applications. Section 4 describes thse
`applications that we have implemented: situations!
`awareness, best-view selection for activity logsing.
`aki envirgiment learning.
`
`4.) Situational Awareness
`
`‘Phe goal of the aituationsl awarsness applicaiion is
`to producea feal-tine map-hased display of the fo-
`cations of psople, objects
`and events
`in a
`montonsl region, and fo allow @ aser ta Speedy
`alarm. conditions interactively. Alarm conditicns
`may be based an the locations af people and ab-
`jeota in the seene, the gypes of ahyects in the seene,
`the sinis in which ihe people and objects are in-
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 8 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 8 of 18
`
`
`
`Thurs " Friday wd Saturday.adfunda
`
`valved, and ihe times at which the events occur.
`Purthermors, the user can specify the actioniotake
`when an alarmis tiggered, og. to generate an au-
`dio alarm or write a ig Als, Por example, the user
`should be alle to specify that an audio alarm
`should he triggered Ha person deposrs a brefosse
`on a given iible between S:00pm and 7:00 am on a
`weeknight,
`
`Figure 3: User interface for specifyimg a monitor in AVS
`in order to determine theientities af objects fe.2..
`briefcase, netebuok), the sihationsl awareness sys-
`tein cOmmnuncates with one aor more object
`analysis muduiss (QAMos). ‘The core engines cap-
`tere snapshots of imterasting objects in the sesnes,
`and forward the sagpshots fo the QAM,along with
`the EDs of the tracks containing the objects. The
`GAM thon processes the snapshot ie order te deter-
`mune the type of object. The OAM processing and
`the AVS core engine compuiations are asynchro-
`nous, So the care angine may have processed
`several mors frames by time the QAM completes
`HS analysis, Once the analysis is conplete,
`the
`QAMsenils the results Gin object type label} and
`the track HD back fo the core engine. ‘The cone en-
`gine uses ihe wack TDto associate the babel with
`the correct object in the current frame (assuming
`the object has remained in the scene and been eue-
`cesshifly tracked).
`
`The architecture of the AVS sitiational awareness
`aystern is depictes! in Pigare 4, The systconsists
`af one ar more smart cameras communicating with
`a Video Surveillance Shell (VSS). Each camera has
`associated with @ an independent AVS core engine
`that performs theprocessing described in section 3.
`That is, the engine finds and tacks moving objects
`imthe scems, mapa their image locations to world
`comdinates, and recngnizes sevens invalving dhe
`objects. Each core engine emits a stream of loca-
`tin and evant reparts to the VSS, which filtersthe
`incoming evant stresans for user-specified alarm
`
`The VSS provides a map displayof the monitored
`area, with the loeations af the objectsin the scene
`reported as iconson The map. The VSS aise allows
`the aser to speedy alarm regions and conchtions.
`Alanm regions ars specified by drawing them on
`the may using a mouse, and naming themas de-
`sited, Theuser can then specify the conditions and
`actione far alarms by creating one or mare meni
`tors. Figure 3 depicts the monitor creation dialog
`box. The user names the monitor and uses the
`mouse fselect cheek boxes assoniaied with the
`conditions that adil trigger the noniter, The user
`Aftalysis
`selects the type af event, the type of object in
`age SESS|Mogute =
`
`volved in the event, the day of wesk and fime of
`Smart Camera &
`{OSNY
`Sag Biss
`day of the event, where the event accars, and what
`BoEH RAAN
`te do when the alarm goendiion occurs. The piont-
`for apecified in Figure 5 specifics that a voice alarm
`
`conditions and takes the apprupriate actions, neh |
`
`Fimure 4: The situational awareneas oystem
`
`188
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 9 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 9 of 18
`
`
`
`
`
`Pigare & Tracking an object in thescene on the map
`
`will he sounded when a bricfease is deposited on
`TableA between 5:00mm and 7am on a week-
`night. The voice alarms are customuzed to the avant
`and abject type, se that when this alarms is ing
`gerd, the system will announces “deposit box" via
`His audio iayeput, Figure 6 shows @ person about to
`friggerthis alarm.
`
`§ Best-View Selection for Activity Logging
`
`ta many video surveillance applications the goal of
`surveillance ig nat ta detect events in real time and
`generate alarms, bot rather fo construct a iseor a
`dst trail of all of the activiaythat takes place inthe
`camert'’s feld of view, This log is examined by in-
`vestigaiors fier a security incident {e.¢., a thear
`terrorist aitack}, and is used to identify possible
`SMAPACTS OF WHESSOS,
`
`in order togain experience with this fype of appli
`oation, we have used the tracking and event
`detection capabilities described in section 3 to can-
`struct a program that monitors and records the
`movernenis af humans mAs fel of view. For ev-
`ery persan ist if sees, Tt creates a log fle that
`summarizes important pvformation about the per-
`aon, including & smapshot taken when the person
`wag close to the camera and Gf passibie) facing it.
`The log Glee are made available to authorized users
`via the World-Wide Web.
`
`8.1 Architecture
`
`The appheation makes use of the AVS core alge-
`fiihms to detect and track people. Upon detection
`af a track corresponding to a person in the iapur,
`the irackar associates & dita recon? with the wack,
`The data recordcontains a summery of information
`about the person, Including a snapshot satracted
`from the current video image. Ae the person is
`tacked through the scone, the iacker axarmizies
`each image of that persnn that & reerives. H the
`new image isa better view of the persenthan the
`previnuslysavedsnapshot, the snapshot is replaced
`with the new view. Whon the person saves the
`scene, the data record is saved to a file,
`
`Each log entry fils records the time when the per
`son entered the scene and a Het of coordinate pairs
`showing their position in cach video frame. Each
`lag entry fle alse contuns the snapshot thai was
`stared in the track record for the person when they
`exited the scene. Because of the way smapehots are
`maintained, the final snapshot js the best view of
`the person that the system had doring irecking. Pi-
`nally, the Ing amtry file contains a pointer to fhe
`reference image that was iy effect when the snap.
`shot was
`taken. Thie information fomns
`an
`extremely conejes description af the person’s
`movarnents and appearance while they were in the
`SkENE.
`
`Selecting the best view: The system uses simple
`heuristics to decide when the current view ata per
`
`Ley
`
`AVIGILONEX. 2005
`
`IPR2019-00235
`Page 10 of 18
`
`AVIGILON EX. 2005
`IPR2019-00235
`Page 10 of 18
`
`
`
`
`
`
`
`
`
`Lab
`Figure 7: Floor plan of area used for hallway aionitoring experintents.
`monitors the hallwayand printer aleave.
`
`Virtual
`Reality
`
`Vision Research Lab
`
`oe
`
`eee
`
`=
`
`
`
`ss
`Camera is locatedat right gad
`
`aan is berer than the previously saved view. First,
`the new view is considered better dP the subinet is
`moving Ewand the camer in the current. frame,
`ad was moving away in the previously saved
`vane TRIs cantaes the svetern bs favor views on
`which the subject's fare ig visible. Hf this nile dees
`n