`
`Robert T. Collins, Alan J. Lipton, Takeo Kanade,
`Hironobu Fujiyoshi, David Duggins, Yanghai Tsin,
`David Tolliver, Nobuyoshi Enomoto, Osamu Hasegawa,
`Peter Burt1 and Lambert Wixson1
`
`CMU-RI-TR-00-12
`
`The Robotics Institute, Carnegie Mellon University, Pittsburgh PA
`1 The Sarnoff Corporation, Princeton, NJ
`
`Abstract
`
`Under the three-year Video Surveillance and Monitoring (VSAM) project (1997–1999), the
`Robotics Institute at Carnegie Mellon University (CMU) and the Sarnoff Corporation devel-
`oped a system for autonomous Video Surveillance and Monitoring. The technical approach
`uses multiple, cooperative video sensors to provide continuous coverage of people and vehi-
`cles in a cluttered environment. This final report presents an overview of the system, and of
`the technical accomplishments that have been achieved.
`
`c 2000 Carnegie Mellon University
`
` This work was funded by the DARPA Image Understanding under contract DAAB07-97-C-J031, and by the
`Office of Naval Research under grant N00014-99-1-0646.
`
`1/69
`
`DOJ EX. 1025
`
`
`
`1 Introduction
`
`The thrust of CMU research under the DARPA Video Surveillance and Monitoring (VSAM)
`project is cooperative multi-sensor surveillance to support battlefield awareness [17]. Under our
`VSAM Integrated Feasibility Demonstration (IFD) contract, we have developed automated video
`understanding technology that enables a single human operator to monitor activities over a com-
`plex area using a distributed network of active video sensors. The goal is to automatically collect
`and disseminate real-time information from the battlefield to improve the situational awareness of
`commanders and staff. Other military and federal law enforcement applications include providing
`perimeter security for troops, monitoring peace treaties or refugee movements from unmanned air
`vehicles, providing security for embassies or airports, and staking out suspected drug or terrorist
`hide-outs by collecting time-stamped pictures of everyone entering and exiting the building.
`
`Automated video surveillance is an important research area in the commercial sector as well.
`Technology has reached a stage where mounting cameras to capture video imagery is cheap, but
`finding available human resources to sit and watch that imagery is expensive. Surveillance cameras
`are already prevalent in commercial establishments, with camera output being recorded to tapes
`that are either rewritten periodically or stored in video archives. After a crime occurs – a store
`is robbed or a car is stolen – investigators can go back after the fact to see what happened, but of
`course by then it is too late. What is needed is continuous 24-hour monitoring and analysis of video
`surveillance data to alert security officers to a burglary in progress, or to a suspicious individual
`loitering in the parking lot, while options are still open for avoiding the crime.
`
`Keeping track of people, vehicles, and their interactions in an urban or battlefield environment
`is a difficult task. The role of VSAM video understanding technology in achieving this goal is to
`automatically “parse” people and vehicles from raw video, determine their geolocations, and insert
`them into a dynamic scene visualization. We have developed robust routines for detecting and
`tracking moving objects. Detected objects are classified into semantic categories such as human,
`human group, car, and truck using shape and color analysis, and these labels are used to improve
`tracking using temporal consistency constraints. Further classification of human activity, such as
`walking and running, has also been achieved. Geolocations of labeled entities are determined from
`their image coordinates using either wide-baseline stereo from two or more overlapping camera
`views, or intersection of viewing rays with a terrain model from monocular views. These computed
`locations feed into a higher level tracking module that tasks multiple sensors with variable pan, tilt
`and zoom to cooperatively and continuously track an object through the scene. All resulting object
`hypotheses from all sensors are transmitted as symbolic data packets back to a central operator
`control unit, where they are displayed on a graphical user interface to give a broad overview of
`scene activities. These technologies have been demonstrated through a series of yearly demos,
`using a testbed system developed on the urban campus of CMU.
`
`This is the final report on the three-year VSAM IFD research program. The emphasis is on
`recent results that have not yet been published. Older work that has already appeared in print is
`briefly summarized, with references to the relevant technical papers. This report is organized as
`
`Robotics Institute, CMU
`
`– 1 –
`
`VSAM Final Report
`
`2/69
`
`DOJ EX. 1025
`
`
`
`follows. Section 2 contains a description of the VSAM IFD testbed system, developed as a testing
`ground for new video surveillance research. Section 3 describes the basic video understanding
`algorithms that have been demonstrated, including moving object detection, tracking, classifica-
`tion, and simple activity recognition. Section 4 discusses the use of geospatial site models to aid
`video surveillance processing, including calibrating a network of sensors with respect to the model
`coordinate system, computation of 3D geolocation estimates, and graphical display of object hy-
`potheses within a distributed simulation. Section 5 discusses coordination of multiple cameras to
`achieve cooperative object tracking. Section 6 briefly lists the milestones achieved through three
`VSAM demos that were performed in Pittsburgh, the first at the rural Bushy Run site, and the
`second and third held on the urban CMU campus, and concludes with plans for future research.
`The appendix contains published technical papers from the CMU VSAM research group.
`
`2 VSAM Testbed System
`
`We have built a VSAM testbed system to demonstrate how automated video understanding tech-
`nology described in the following sections can be combined into a coherent surveillance system
`that enables a single human operator to monitor a wide area. The testbed system consists of multi-
`ple sensors distributed across the campus of CMU, tied to a control room (Figure 1a) located in the
`Planetary Robotics Building (PRB). The testbed consists of a central operator control unit (OCU)
`
`(a)
`
`(b)
`
`Figure 1: a) Control room of the VSAM testbed system on the campus of Carnegie Mellon Uni-
`versity. b) Close-up of the main rack.
`
`which receives video and Ethernet data from multiple remote sensor processing units (SPUs) (see
`Figure 2). The OCU is responsible for integrating symbolic object trajectory information accu-
`mulated by each of the SPUs together with a 3D geometric site model, and presenting the results
`to the user on a map-based graphical user interface (GUI). Each logical component of the testbed
`system architecture is described briefly below.
`
`Robotics Institute, CMU
`
`– 2 –
`
`VSAM Final Report
`
`3/69
`
`DOJ EX. 1025
`
`
`
`CMUPA
`
`CMUPA
`
`SPUs
`
`OCU
`
`DIS
`
` Site
`Model
`
`Sensor
`Fusion
`
`GUI
`
`VIS
`
`Figure 2: Schematic overview of the VSAM testbed system.
`
`2.1 Sensor Processing Units (SPUs)
`
`The SPU acts as an intelligent filter between a camera and the VSAM network. Its function is to
`analyze video imagery for the presence of significant entities or events, and to transmit that infor-
`mation symbolically to the OCU. This arrangement allows for many different sensor modalities
`to be seamlessly integrated into the system. Furthermore, performing as much video processing
`as possible on the SPU reduces the bandwidth requirements of the VSAM network. Full video
`signals do not need to be transmitted; only symbolic data extracted from video signals.
`
`The VSAM testbed can handle a wide variety of sensor and SPU types (Figure 3). The list of
`IFD sensor types includes: color CCD cameras with active pan, tilt and zoom control; fixed field
`of view monochromatic low-light cameras; and thermal sensors. Logically, each SPU combines a
`camera with a local computer that processes the incoming video. However, for convenience, most
`video signals in the testbed system are sent via fiber optic cable to computers located in a rack
`in the control room (Figure 1b). The exceptions are SPU platforms that move: a van-mounted
`relocatable SPU; an SUO portable SPU; and an airborne SPU. Computing power for these SPUs is
`on-board, with results being sent to the OCU over relatively low-bandwidth wireless Ethernet links.
`In addition to the IFD in-house SPUs, two Focussed Research Effort (FRE) sensor packages have
`been integrated into the system: a Columbia-Lehigh CycloVision ParaCamera with a hemispher-
`ical field of view; and a Texas Instruments indoor surveillance system. By using a pre-specified
`communication protocol (see Section 2.4), these FRE systems were able to directly interface with
`the VSAM network. Indeed, within the logical system architecture, all SPUs are treated identi-
`cally. The only difference is at the hardware level where different physical connections (e.g. cable
`or wireless Ethernet) may be required to connect to the OCU.
`
`The relocatable van and airborne SPU warrant further discussion. The relocatable van SPU
`consists of a sensor and pan-tilt head mounted on a small tripod that can be placed on the vehicle
`roof when stationary. All video processing is performed on-board the vehicle, and results from
`object detection and tracking are assembled into symbolic data packets and transmitted back to
`the operator control workstation using a radio Ethernet connection. The major research issue
`involved in demonstrating the redeployable van unit involves how to rapidly calibrate sensor pose
`after redeployment, so that object detection and tracking results can be integrated into the VSAM
`network (via computation of geolocation) for display at the operator control console.
`
`Robotics Institute, CMU
`
`– 3 –
`
`VSAM Final Report
`
`4/69
`
`DOJ EX. 1025
`
`
`
`Figure 3: Many types of sensors and SPUs have been incorporated into the VSAM IFD testbed
`system: a) color PTZ; b) thermal; c) relocatable van; d) airborne. In addition, two FRE sensors
`have been successfully integrated: e) Columbia-Lehigh omnicamera; f) Texas Instruments indoor
`activity monitoring system.
`
`Robotics Institute, CMU
`
`– 4 –
`
`VSAM Final Report
`
`5/69
`
`DOJ EX. 1025
`
`
`
`The airborne sensor and computation packages are mounted on a Britten-Norman Islander
`twin-engine aircraft operated by the U.S. Army Night Vision and Electronic Sensors Directorate.
`The Islander is equipped with a FLIR Systems Ultra-3000 turret that has two degrees of freedom
`(pan/tilt), a Global Positioning System (GPS) for measuring position, and an Attitude Heading
`Reference System (AHRS) for measuring orientation. The continual self-motion of the aircraft
`introduces challenging video understanding issues. For this reason, video processing is performed
`using the Sarnoff PVT-200, a specially designed video processing engine.
`
`2.2 Operator Control Unit (OCU)
`
`Figure 4 shows the functional architecture of the VSAM OCU. It accepts video processing results
`from each of the SPUs and integrates the information with a site model and a database of known
`objects to infer activities that are of interest to the user. This data is sent to the GUI and other
`visualization tools as output from the system.
`
`site model
`DB
`
`site
`model
`
`sensor
`info
`
`trigger
`info
`
`target
`info
`
`SPU
`MTD
`tracking
`recognition
`classification
`triggers
`geolocation
`
`OCU FUNCTIONAL MODEL
`
`footprint
`DB
`
`footprint
`analysis
`
`sensor
`arbiter
`
`trajectory
`analysis
`
`target
`maintenance
`
`dynamic
`target DB
`
`trigger
`definition
`
`USER
`
`GUI
`
`activity
`modeling
`(HVI,
`riot monitoring
`car park monitoring
`loiterer detection
`tracking)
`
`SPU
`idle
`behaviour
`
`sensor
`control
`(handoff,
`multi-tasking)
`
`Figure 4: Functional architecture of the VSAM OCU.
`
`One key piece of system functionality provided by the OCU is sensor arbitration. Care must
`be taken to ensure that an outdoor surveillance system does not underutilize its limited sensor
`assets. Sensors must be allocated to surveillance tasks in such a way that all user-specified tasks
`get performed, and, if enough sensors are present, multiple sensors are assigned to track important
`objects. At any given time, the OCU maintains a list of known objects and sensor parameters, as
`well as a set of “tasks” that may need attention. These tasks are explicitly indicated by the user
`through the GUI, and may include specific objects to be tracked, specific regions to be watched,
`or specific events to be detected (such as a person loitering near a particular doorway). Sensor
`
`Robotics Institute, CMU
`
`– 5 –
`
`VSAM Final Report
`
`6/69
`
`DOJ EX. 1025
`
`
`
`arbitration is performed by an arbitration cost function. The arbitration function determines the
`cost of assigning each of the SPUs to each of the tasks. These costs are based on the priority of
`the tasks, the load on the SPU, and visibility of the objects from a particular sensor. The system
`performs a greedy optimization of the cost to determine the best combination of SPU tasking to
`maximize overall system performance requirements.
`
`The OCU also contains a site model representing VSAM-relevant information about the area
`being monitored. The site model representation is optimized to efficiently support the following
`VSAM capabilities:
`
` object geolocation via intersection of viewing rays with the terrain.
`
` visibility analysis (predicting what portions of the scene are visible from what sensors) so
`that sensors can be efficiently tasked.
`
` specification of the geometric location and extent of relevant scene features. For example,
`we might directly task a sensor to monitor the door of a building, or to look for vehicles
`passing through a particular intersection.
`
`2.3 Graphical User Interface (GUI)
`
`(a)
`
`(b)
`
`Figure 5: a) Operator console located in the control room. Also shown is a laptop-based portable
`operator console. b) Close-up view of the visualization node display screen.
`
`One of the technical goals of the VSAM project is to demonstrate that a single human operator
`can effectively monitor a significant area of interest. Keeping track of multiple people, vehicles,
`and their interactions, within a complex urban environment is a difficult task. The user obviously
`shouldn’t be looking at two dozen screens showing raw video output. That amount of sensory
`overload virtually guarantees that information will be ignored, and requires a prohibitive amount
`of transmission bandwidth. Our approach is to provide an interactive, graphical user interface
`(GUI) that uses VSAM technology to automatically place dynamic agents representing people and
`
`Robotics Institute, CMU
`
`– 6 –
`
`VSAM Final Report
`
`7/69
`
`DOJ EX. 1025
`
`
`
`vehicles into a synthetic view of the environment (Figure 5). This approach has the benefit that
`visualization of scene events is no longer tied to the original resolution and viewpoint of a single
`video sensor. The GUI currently consists of a map of the area, overlaid with all object locations,
`sensor platform locations, and sensor fields of view (Figure 5b). In addition, a low-bandwidth,
`compressed video stream from one of the sensors can be selected for real-time display.
`
`The GUI is also used for sensor suite tasking. Through this interface, the operator can task
`individual sensor units, as well as the entire testbed sensor suite, to perform surveillance operations
`such as generating a quick summary of all object activities in the area. The lower left corner of
`the control window contains a selection of controls organized as tabbed selections. This allows the
`user to move fluidly between different controls corresponding to the entity types Objects, Sensors,
`and Regions of Interest.
`
` Object Controls. Track directs the system to begin actively tracking the current object.
`Stop Tracking terminates all active tracking tasks in the system. Trajectory displays the
`trajectory of selected objects. Error displays geolocation error bounds on the locations and
`trajectories of selected objects.
`
` Sensor Controls. Show FOV displays sensor fields of view on the map, otherwise only a
`position marker is drawn. Move triggers an interaction allowing the user to control the pan
`and tilt angle of the sensor. Request Imagery requests either a continuous stream or single
`image from the currently selected sensor, and Stop Imagery terminates the current imagery
`stream.
`
` ROI controls This panel contains all the controls associated with Regions of Interest (ROIs)
`in the system. ROIs are tasks that focus sensor resources at specific areas in the session
`space. Create triggers the creation of a ROI, specified interactively by the user as a polygon
`of boundary points. The user also selects from a set of object types (e.g. human, vehicle)
`that will trigger events in this ROI, and from a set of event types (e.g. enter, pass through,
`stop in) that are considered to be trigger events in the ROI.
`
`2.4 Communication
`
`The nominal architecture for the VSAM network allows multiple OCUs to be linked together, each
`controlling multiple SPUs (Figure 6). Each OCU supports exactly one GUI through which all user
`related command and control information is passed. Data dissemination is not limited to a single
`user interface, however, but is also accessible through a series of visualization nodes (VIS).
`
`There are two independent communication protocols and packet structures supported in this
`architecture: the Carnegie Mellon University Packet Architecture (CMUPA) and the Distributed
`Interactive Simulation (DIS) protocols. The CMUPA is designed to be a low bandwidth, highly
`flexible architecture in which relevant VSAM information can be compactly packaged without
`
`Robotics Institute, CMU
`
`– 7 –
`
`VSAM Final Report
`
`8/69
`
`DOJ EX. 1025
`
`
`
`VIS
`
`SPU
`
`SPU
`
`VIS
`
`VIS
`
`VIS
`
`SPU
`
`OCU
`
`GUI
`
`SPU
`
`SPU
`
`OCU
`
`GUI
`
`VIS
`
`VIS
`
`OCU
`
`GUI
`
`SPU
`
`SPU
`
`SPU
`
`Figure 6: A nominal architecture for expandable VSAM networks.
`
`Header
`bitmask
`
`Comm.
`
`Sensor
`
`Imagery
`
`Target
`
`Event
`
`ROI
`
`Imagery
`block
`
`Sensor
`block
`
`Sensor
`block
`
`Comm.
`block
`
`Comm.
`block
`
`Comm.
`block
`
`Event
`block
`
`Event
`block
`
`ROI
`block
`
`block
`
`Target
`block
`
`bitmask
`
`position
`
`bounding
`box
`
`image
`template
`
`Target
`block
`
`Figure 7: CMUPA packet structure. A bitmask in the header describes which sections are present.
`Within each section, multiple data blocks can be present. Within each data block, bitmasks describe
`what information is present.
`
`redundant overhead. The concept of the CMUPA packet architecture is a hierarchical decompo-
`sition. There are six data sections that can be encoded into a packet: command; sensor; image;
`object; event; and region of interest. A short packet header section describes which of these six
`sections are present in the packet. Within each section it is possible to represent multiple instances
`of that type of data, with each instance potentially containing a different layout of information.
`At each level, short bitmasks are used to describe the contents of the various blocks within the
`packets, keeping wasted space to a minimum. All communication between SPUs, OCUs and
`GUIs is CMUPA compatible. The CMUPA protocol specification document is accessible from
`http://www.cs.cmu.edu/vsam.
`
`VIS nodes are designed to distribute the output of the VSAM network to where it is needed.
`They provide symbolic representations of detected activities overlaid on maps or imagery. Infor-
`mation flow to VIS nodes is unidirectional, originating from an OCU. All of this communication
`uses the DIS protocol, which is described in detail in [15]. An important benefit to keeping VIS
`nodes DIS compatible is that it allows us to easily interface with synthetic environment visualiza-
`tion tools such as ModSAF and ModStealth (Section 4.4).
`
`Robotics Institute, CMU
`
`– 8 –
`
`VSAM Final Report
`
`9/69
`
`DOJ EX. 1025
`
`
`
`2.5 Current Testbed Infrastructure
`
`This section describes the VSAM testbed on the campus of Carnegie Mellon University, as of Fall
`1999 (see Figure 8). The VSAM infrastructure consists of 14 cameras distributed throughout cam-
`pus. All cameras are connected to the VSAM Operator Control Room in the Planetary Robotics
`Building (PRB): ten are connected via fiber optic lines, three on PRB are wired directly to the
`SPU computers, and one is a portable Small Unit Operations (SUO) unit connected via wireless
`Ethernet to the VSAM OCU. The work done for VSAM 99 concentrated on increasing the density
`of sensors in the Wean/PRB area. The overlapping fields of view (FOVs) in this area of campus
`enable us to conduct experiments in wide baseline stereo, object fusion, sensor cuing and sensor
`handoff.
`
`Smith
`
`Wean
`
`PRB
`
`Color
`Monochrome
`
`Figure 8: Placement of color and monochrome cameras in current VSAM testbed system. Not
`shown are two additional cameras, a FLIR and the SUO portable system, which are moved to
`different places as needed.
`
`The backbone of the CMU campus VSAM system consists of six Sony EVI-370 color zoom
`cameras installed on PRB, Smith Hall, Newell-Simon Hall, Wean Hall, Roberts Hall, and Porter
`Hall. Five of these units are mounted on Directed Perception pan/tilt heads. The most recent
`camera, on Newell-Simon, is mounted on a Sagebrush Technologies pan/tilt head. This is a more
`rugged outdoor mount being evaluated for better performance specifications and longer term usage.
`Two stationary fixed-FOV color cameras are mounted on the peak of PRB, on either side of the
`
`Robotics Institute, CMU
`
`– 9 –
`
`VSAM Final Report
`
`10/69
`
`DOJ EX. 1025
`
`
`
`pan/tilt/zoom color camera located there. These PRB “left” and “right” sensors were added to
`facilitate work on activity analysis, classification, and sensor cuing. Three stationary fixed-FOV
`monochrome cameras are mounted on the roof of Wean Hall in close proximity to one of the
`pan/tilt/zoom color cameras. These are connected to the Operator Control Room over a single
`multimode fiber using a video multiplexor. The monochrome cameras have a vertical resolution
`of 570 TV lines and perform fairly well at night with the available street lighting. A mounting
`bracket has also been installed next to these cameras for the temporary installation of a Raytheon
`NightSight thermal (FLIR) sensor. A fourth stationary fixed FOV monochrome camera is mounted
`on PRB pointing at the back stairwell. A SUO portable unit was built to allow further software
`development and research at CMU in support of the SUO program. This unit consists of the same
`hardware as the SPUs that were delivered to Fort Benning, Georgia in November, 1999.
`
`The Operator Control Room in PRB houses the SPU, OCU, GUI and development work-
`stations – nineteen computers in total. The four most recent SPUs are Pentium III 550 MHz
`computers. Dagwood, a single “compound SPU”, is a quad Xeon 550 MHz processor computer,
`purchased to conduct research on classification, activity analysis, and digitization of three simulta-
`neous video streams. Also included in this list of machines is a Silicon Graphics Origin 200, used
`to develop video database storage and retrieval algorithms as well as designing user interfaces for
`handling VSAM video data.
`
`Two auto tracking Leica theodolites (TPS1100) are installed on the corner of PRB, and are
`hardwired to a data processing computer linked to the VSAM OCU. This system allows us to do
`real-time automatic tracking of objects to obtain ground truth for evaluating the VSAM geolocation
`and sensor fusion algorithms. This data can be displayed in real-time on the VSAM GUI.
`
`An Office of Naval Research DURIP grant provided funds for two Raytheon NightSight ther-
`mal sensors, the Quad Xeon processor computer, the Origin 200, an SGI Infinite Reality Engine
`and the Leica theodolite surveying systems.
`
`3 Video Understanding Technologies
`
`Keeping track of people, vehicles, and their interactions in a complex environment is a difficult
`task. The role of VSAM video understanding technology in achieving this goal is to automatically
`“parse” people and vehicles from raw video, determine their geolocations, and automatically insert
`them into a dynamic scene visualization. We have developed robust routines for detecting moving
`objects and tracking them through a video sequence using a combination of temporal differencing
`and template tracking. Detected objects are classified into semantic categories such as human,
`human group, car, and truck using shape and color analysis, and these labels are used to improve
`tracking using temporal consistency constraints. Further classification of human activity, such as
`walking and running, has also been achieved. Geolocations of labeled entities are determined from
`their image coordinates using either wide-baseline stereo from two or more overlapping camera
`views, or intersection of viewing rays with a terrain model from monocular views. The computed
`
`Robotics Institute, CMU
`
`– 10 –
`
`VSAM Final Report
`
`11/69
`
`DOJ EX. 1025
`
`
`
`geolocations are used to provide higher-level tracking capabilities, such as tasking multiple sensors
`with variable pan, tilt and zoom to cooperatively track an object through the scene. Results are
`displayed to the user in real-time on the GUI, and are also archived in web-based object/event
`database.
`
`3.1 Moving Object Detection
`
`Detection of moving objects in video streams is known to be a significant, and difficult, research
`problem [26]. Aside from the intrinsic usefulness of being able to segment video streams into
`moving and background components, detecting moving blobs provides a focus of attention for
`recognition, classification, and activity analysis, making these later processes more efficient since
`only “moving” pixels need be considered.
`
`There are three conventional approaches to moving object detection: temporal differencing
`[1]; background subtraction [13, 29]; and optical flow (see [3] for an excellent discussion). Tem-
`poral differencing is very adaptive to dynamic environments, but generally does a poor job of
`extracting all relevant feature pixels. Background subtraction provides the most complete feature
`data, but is extremely sensitive to dynamic scene changes due to lighting and extraneous events.
`Optical flow can be used to detect independently moving objects in the presence of camera mo-
`tion; however, most optical flow computation methods are computationally complex, and cannot
`be applied to full-frame video streams in real-time without specialized hardware.
`
`Under the VSAM program, CMU has developed and implemented three methods for mov-
`ing object detection on the VSAM testbed. The first is a combination of adaptive background
`subtraction and three-frame differencing (Section 3.1.1). This hybrid algorithm is very fast, and
`surprisingly effective – indeed, it is the primary algorithm used by the majority of the SPUs in
`the VSAM system. In addition, two new prototype algorithms have been developed to address
`shortcomings of this standard approach. First, a mechanism for maintaining temporal object layers
`is developed to allow greater disambiguation of moving objects that stop for a while, are occluded
`by other objects, and that then resume motion (Section 3.1.2). One limitation that affects both
`this method and the standard algorithm is that they only work for static cameras, or in a ”step-
`and-stare” mode for pan-tilt cameras. To overcome this limitation, a second extension has been
`developed to allow background subtraction from a continuously panning and tilting camera (Sec-
`tion 3.1.3). Through clever accumulation of image evidence, this algorithm can be implemented
`in real-time on a conventional PC platform. A fourth approach to moving object detection from
`a moving airborne platform has also been developed, under a subcontract to the Sarnoff Corpora-
`tion. This approach is based on image stabilization using special video processing hardware. It is
`described later, in Section 3.6.
`
`Robotics Institute, CMU
`
`– 11 –
`
`VSAM Final Report
`
`12/69
`
`DOJ EX. 1025
`
`
`
`Long-term
`parked car
`
`Car moves
`
`"Hole" left in
`background model
`
`Car moves
`
`(a)
`
`Detection
`
`(b)
`
`Figure 9: problems with standard MTD algorithms. (a) Background subtraction leaves “holes”
`when stationary objects move. (b) Frame differencing does not detect the entire object
`
`3.1.1 A Hybrid Algorithm for Moving Object Detection
`
`We have developed a hybrid algorithm for detecting moving objects, by combining an adaptive
`background subtraction technique[18] with a three-frame differencing algorithm. As discussed in
`[26], the major drawback of adaptive background subtraction is that it makes no allowances for
`stationary objects in the scene that start to move. Although these are usually detected, they leave
`behind “holes” where the newly exposed background imagery differs from the known background
`model (see Figure 9a). While the background model eventually adapts to these “holes”, they gen-
`erate false alarms for a short period of time. Frame differencing is not subject to this phenomenon,
`however, it is generally not an effective method for extracting the entire shape of a moving object
`(Figure 9b). To overcome these problems, we have combined the two methods. A three-frame dif-
`ferencing operation is performed to determine regions of legitimate motion, followed by adaptive
`background subtraction to extract the entire moving region.
`
`Consider a video stream from a stationary (or stabilized) camera. Let Inx represent the
`intensity value at pixel position x, at time t = n. The three-frame differencing rule suggests that
`a pixel is legitimately moving if its intensity has changed significantly between both the current
`image and the last frame, and the current image and the next-to-last frame. That is, a pixel x is
`moving if
`
` jInx (cid:0) In(cid:0)1xj Tnx and jInx (cid:0) In(cid:0)2xj Tnx
`where Tnx is a threshold describing a statistically significant intensity change at pixel position x
`(described below). The main problem with frame differencing is that pixels interior to an object
`with uniform intensity aren’t included in the set of “moving” pixels. However, after clustering
`moving pixels into a connected region, interior pixels can be filled in by applying adaptive back-
`ground subtraction to extract all of the “moving” pixels within the region’s bounding box R. Let
`Bnx represent the current background intensity value at pixel x, learned by observation over time.
`Then the blob bn can be filled out by taking all the pixels in R that are significantly different from
`the background model Bn. That is
`
`bn = fx : jInx (cid:0) Bnxj Tnx; x Rg
`
`Robotics Institute, CMU
`
`– 12 –
`
`VSAM Final Report
`
`13/69
`
`DOJ EX. 1025
`
`
`
`Both the background model Bnx and the difference threshold Tnx are statistical properties
`of the pixel intensities observed from the sequence of images fIkxg for k n. B0x is initially
`set to the first image, B0x = I0x, and T0x is initially set to some pre-determined, non-zero
`value. Bx and T x are then updated over time as:
`
`Bn+1x = a Bnx + 1 (cid:0) a Inx; x is non-moving
`Tn+1x = a Tnx + 1 (cid:0) a 5 jInx (cid:0) Bnxj; x is non-moving
`
`Bnx;
`
`x is moving
`
`x is moving
`Tnx;
`where a
`is a time constant that specifies how fast new information supplants old observations.
`Note that each value is only changed for pixels that are determined to be non-moving, i.e. part of
`the stationary background. If each non-moving pixel position is considered as a time series, Bnx
`is analogous to a local temporal average of intensity values, and Tnx is analogous to 5 times the
`local temporal standard deviation of intensity, both computed using an infinite impulse response
`(IIR) filter. Figure 10 shows a result of this detection algorithm for one frame.
`
`
`
`(a)
`
`(b)
`
`Figure 10: Result of the detection algorithm. (a) Original image. (b) Detected motion regions.
`
`3.1.2 Temporal Layers for Adaptive Background Subtraction
`
`A robust detection system should be able to recognize when objects have stopped and even dis-
`ambiguate overlapping objects — functions usually not possible with traditional motion detection
`algorithms. An important aspect of this work derives from the observation that legitimately moving
`objects in a scene tend to cause much faster transitions than changes due to lighting, meteorolog-
`ical, and diurnal effects. This section describes a novel approach to object detection based on
`layered adaptive background subtraction.
`
`Robotics Institute, CMU
`
`– 13 –
`
`VSAM Final Report
`
`14/69
`
`DOJ EX. 1025
`
`
`
`The Detection Algorithm
`
`Layered detection is based on two processes: pixel analysis and region analysis. The purpose of
`pixel analysis is to determine whether a pixel is stationary or transient by observing its intensity
`value over time. Region analysis deals with the agglomeration of gr