throbber
A System for Video Surveillance and Monitoring
`
`Robert T. Collins, Alan J. Lipton, Takeo Kanade,
`Hironobu Fujiyoshi, David Duggins, Yanghai Tsin,
`David Tolliver, Nobuyoshi Enomoto, Osamu Hasegawa,
`Peter Burt1 and Lambert Wixson1
`
`CMU-RI-TR-00-12
`
`The Robotics Institute, Carnegie Mellon University, Pittsburgh PA
`1 The Sarnoff Corporation, Princeton, NJ
`
`Abstract
`
`Under the three-year Video Surveillance and Monitoring (VSAM) project (1997–1999), the
`Robotics Institute at Carnegie Mellon University (CMU) and the Sarnoff Corporation devel-
`oped a system for autonomous Video Surveillance and Monitoring. The technical approach
`uses multiple, cooperative video sensors to provide continuous coverage of people and vehi-
`cles in a cluttered environment. This final report presents an overview of the system, and of
`the technical accomplishments that have been achieved.
`
`c2000 Carnegie Mellon University
`
` This work was funded by the DARPA Image Understanding under contract DAAB07-97-C-J031, and by the
`Office of Naval Research under grant N00014-99-1-0646.
`
`1/69
`
`DOJ EX. 1025
`
`

`
`1 Introduction
`
`The thrust of CMU research under the DARPA Video Surveillance and Monitoring (VSAM)
`project is cooperative multi-sensor surveillance to support battlefield awareness [17]. Under our
`VSAM Integrated Feasibility Demonstration (IFD) contract, we have developed automated video
`understanding technology that enables a single human operator to monitor activities over a com-
`plex area using a distributed network of active video sensors. The goal is to automatically collect
`and disseminate real-time information from the battlefield to improve the situational awareness of
`commanders and staff. Other military and federal law enforcement applications include providing
`perimeter security for troops, monitoring peace treaties or refugee movements from unmanned air
`vehicles, providing security for embassies or airports, and staking out suspected drug or terrorist
`hide-outs by collecting time-stamped pictures of everyone entering and exiting the building.
`
`Automated video surveillance is an important research area in the commercial sector as well.
`Technology has reached a stage where mounting cameras to capture video imagery is cheap, but
`finding available human resources to sit and watch that imagery is expensive. Surveillance cameras
`are already prevalent in commercial establishments, with camera output being recorded to tapes
`that are either rewritten periodically or stored in video archives. After a crime occurs – a store
`is robbed or a car is stolen – investigators can go back after the fact to see what happened, but of
`course by then it is too late. What is needed is continuous 24-hour monitoring and analysis of video
`surveillance data to alert security officers to a burglary in progress, or to a suspicious individual
`loitering in the parking lot, while options are still open for avoiding the crime.
`
`Keeping track of people, vehicles, and their interactions in an urban or battlefield environment
`is a difficult task. The role of VSAM video understanding technology in achieving this goal is to
`automatically “parse” people and vehicles from raw video, determine their geolocations, and insert
`them into a dynamic scene visualization. We have developed robust routines for detecting and
`tracking moving objects. Detected objects are classified into semantic categories such as human,
`human group, car, and truck using shape and color analysis, and these labels are used to improve
`tracking using temporal consistency constraints. Further classification of human activity, such as
`walking and running, has also been achieved. Geolocations of labeled entities are determined from
`their image coordinates using either wide-baseline stereo from two or more overlapping camera
`views, or intersection of viewing rays with a terrain model from monocular views. These computed
`locations feed into a higher level tracking module that tasks multiple sensors with variable pan, tilt
`and zoom to cooperatively and continuously track an object through the scene. All resulting object
`hypotheses from all sensors are transmitted as symbolic data packets back to a central operator
`control unit, where they are displayed on a graphical user interface to give a broad overview of
`scene activities. These technologies have been demonstrated through a series of yearly demos,
`using a testbed system developed on the urban campus of CMU.
`
`This is the final report on the three-year VSAM IFD research program. The emphasis is on
`recent results that have not yet been published. Older work that has already appeared in print is
`briefly summarized, with references to the relevant technical papers. This report is organized as
`
`Robotics Institute, CMU
`
`– 1 –
`
`VSAM Final Report
`
`2/69
`
`DOJ EX. 1025
`
`

`
`follows. Section 2 contains a description of the VSAM IFD testbed system, developed as a testing
`ground for new video surveillance research. Section 3 describes the basic video understanding
`algorithms that have been demonstrated, including moving object detection, tracking, classifica-
`tion, and simple activity recognition. Section 4 discusses the use of geospatial site models to aid
`video surveillance processing, including calibrating a network of sensors with respect to the model
`coordinate system, computation of 3D geolocation estimates, and graphical display of object hy-
`potheses within a distributed simulation. Section 5 discusses coordination of multiple cameras to
`achieve cooperative object tracking. Section 6 briefly lists the milestones achieved through three
`VSAM demos that were performed in Pittsburgh, the first at the rural Bushy Run site, and the
`second and third held on the urban CMU campus, and concludes with plans for future research.
`The appendix contains published technical papers from the CMU VSAM research group.
`
`2 VSAM Testbed System
`
`We have built a VSAM testbed system to demonstrate how automated video understanding tech-
`nology described in the following sections can be combined into a coherent surveillance system
`that enables a single human operator to monitor a wide area. The testbed system consists of multi-
`ple sensors distributed across the campus of CMU, tied to a control room (Figure 1a) located in the
`Planetary Robotics Building (PRB). The testbed consists of a central operator control unit (OCU)
`
`(a)
`
`(b)
`
`Figure 1: a) Control room of the VSAM testbed system on the campus of Carnegie Mellon Uni-
`versity. b) Close-up of the main rack.
`
`which receives video and Ethernet data from multiple remote sensor processing units (SPUs) (see
`Figure 2). The OCU is responsible for integrating symbolic object trajectory information accu-
`mulated by each of the SPUs together with a 3D geometric site model, and presenting the results
`to the user on a map-based graphical user interface (GUI). Each logical component of the testbed
`system architecture is described briefly below.
`
`Robotics Institute, CMU
`
`– 2 –
`
`VSAM Final Report
`
`3/69
`
`DOJ EX. 1025
`
`

`
`CMUPA
`
`CMUPA
`
`SPUs
`
`OCU
`
`DIS
`
` Site
`Model
`
`Sensor
`Fusion
`
`GUI
`
`VIS
`
`Figure 2: Schematic overview of the VSAM testbed system.
`
`2.1 Sensor Processing Units (SPUs)
`
`The SPU acts as an intelligent filter between a camera and the VSAM network. Its function is to
`analyze video imagery for the presence of significant entities or events, and to transmit that infor-
`mation symbolically to the OCU. This arrangement allows for many different sensor modalities
`to be seamlessly integrated into the system. Furthermore, performing as much video processing
`as possible on the SPU reduces the bandwidth requirements of the VSAM network. Full video
`signals do not need to be transmitted; only symbolic data extracted from video signals.
`
`The VSAM testbed can handle a wide variety of sensor and SPU types (Figure 3). The list of
`IFD sensor types includes: color CCD cameras with active pan, tilt and zoom control; fixed field
`of view monochromatic low-light cameras; and thermal sensors. Logically, each SPU combines a
`camera with a local computer that processes the incoming video. However, for convenience, most
`video signals in the testbed system are sent via fiber optic cable to computers located in a rack
`in the control room (Figure 1b). The exceptions are SPU platforms that move: a van-mounted
`relocatable SPU; an SUO portable SPU; and an airborne SPU. Computing power for these SPUs is
`on-board, with results being sent to the OCU over relatively low-bandwidth wireless Ethernet links.
`In addition to the IFD in-house SPUs, two Focussed Research Effort (FRE) sensor packages have
`been integrated into the system: a Columbia-Lehigh CycloVision ParaCamera with a hemispher-
`ical field of view; and a Texas Instruments indoor surveillance system. By using a pre-specified
`communication protocol (see Section 2.4), these FRE systems were able to directly interface with
`the VSAM network. Indeed, within the logical system architecture, all SPUs are treated identi-
`cally. The only difference is at the hardware level where different physical connections (e.g. cable
`or wireless Ethernet) may be required to connect to the OCU.
`
`The relocatable van and airborne SPU warrant further discussion. The relocatable van SPU
`consists of a sensor and pan-tilt head mounted on a small tripod that can be placed on the vehicle
`roof when stationary. All video processing is performed on-board the vehicle, and results from
`object detection and tracking are assembled into symbolic data packets and transmitted back to
`the operator control workstation using a radio Ethernet connection. The major research issue
`involved in demonstrating the redeployable van unit involves how to rapidly calibrate sensor pose
`after redeployment, so that object detection and tracking results can be integrated into the VSAM
`network (via computation of geolocation) for display at the operator control console.
`
`Robotics Institute, CMU
`
`– 3 –
`
`VSAM Final Report
`
`4/69
`
`DOJ EX. 1025
`
`

`
`Figure 3: Many types of sensors and SPUs have been incorporated into the VSAM IFD testbed
`system: a) color PTZ; b) thermal; c) relocatable van; d) airborne. In addition, two FRE sensors
`have been successfully integrated: e) Columbia-Lehigh omnicamera; f) Texas Instruments indoor
`activity monitoring system.
`
`Robotics Institute, CMU
`
`– 4 –
`
`VSAM Final Report
`
`5/69
`
`DOJ EX. 1025
`
`

`
`The airborne sensor and computation packages are mounted on a Britten-Norman Islander
`twin-engine aircraft operated by the U.S. Army Night Vision and Electronic Sensors Directorate.
`The Islander is equipped with a FLIR Systems Ultra-3000 turret that has two degrees of freedom
`(pan/tilt), a Global Positioning System (GPS) for measuring position, and an Attitude Heading
`Reference System (AHRS) for measuring orientation. The continual self-motion of the aircraft
`introduces challenging video understanding issues. For this reason, video processing is performed
`using the Sarnoff PVT-200, a specially designed video processing engine.
`
`2.2 Operator Control Unit (OCU)
`
`Figure 4 shows the functional architecture of the VSAM OCU. It accepts video processing results
`from each of the SPUs and integrates the information with a site model and a database of known
`objects to infer activities that are of interest to the user. This data is sent to the GUI and other
`visualization tools as output from the system.
`
`site model
`DB
`
`site
`model
`
`sensor
`info
`
`trigger
`info
`
`target
`info
`
`SPU
`MTD
`tracking
`recognition
`classification
`triggers
`geolocation
`
`OCU FUNCTIONAL MODEL
`
`footprint
`DB
`
`footprint
`analysis
`
`sensor
`arbiter
`
`trajectory
`analysis
`
`target
`maintenance
`
`dynamic
`target DB
`
`trigger
`definition
`
`USER
`
`GUI
`
`activity
`modeling
`(HVI,
`riot monitoring
`car park monitoring
`loiterer detection
`tracking)
`
`SPU
`idle
`behaviour
`
`sensor
`control
`(handoff,
`multi-tasking)
`
`Figure 4: Functional architecture of the VSAM OCU.
`
`One key piece of system functionality provided by the OCU is sensor arbitration. Care must
`be taken to ensure that an outdoor surveillance system does not underutilize its limited sensor
`assets. Sensors must be allocated to surveillance tasks in such a way that all user-specified tasks
`get performed, and, if enough sensors are present, multiple sensors are assigned to track important
`objects. At any given time, the OCU maintains a list of known objects and sensor parameters, as
`well as a set of “tasks” that may need attention. These tasks are explicitly indicated by the user
`through the GUI, and may include specific objects to be tracked, specific regions to be watched,
`or specific events to be detected (such as a person loitering near a particular doorway). Sensor
`
`Robotics Institute, CMU
`
`– 5 –
`
`VSAM Final Report
`
`6/69
`
`DOJ EX. 1025
`
`

`
`arbitration is performed by an arbitration cost function. The arbitration function determines the
`cost of assigning each of the SPUs to each of the tasks. These costs are based on the priority of
`the tasks, the load on the SPU, and visibility of the objects from a particular sensor. The system
`performs a greedy optimization of the cost to determine the best combination of SPU tasking to
`maximize overall system performance requirements.
`
`The OCU also contains a site model representing VSAM-relevant information about the area
`being monitored. The site model representation is optimized to efficiently support the following
`VSAM capabilities:
`
` object geolocation via intersection of viewing rays with the terrain.
`
` visibility analysis (predicting what portions of the scene are visible from what sensors) so
`that sensors can be efficiently tasked.
`
` specification of the geometric location and extent of relevant scene features. For example,
`we might directly task a sensor to monitor the door of a building, or to look for vehicles
`passing through a particular intersection.
`
`2.3 Graphical User Interface (GUI)
`
`(a)
`
`(b)
`
`Figure 5: a) Operator console located in the control room. Also shown is a laptop-based portable
`operator console. b) Close-up view of the visualization node display screen.
`
`One of the technical goals of the VSAM project is to demonstrate that a single human operator
`can effectively monitor a significant area of interest. Keeping track of multiple people, vehicles,
`and their interactions, within a complex urban environment is a difficult task. The user obviously
`shouldn’t be looking at two dozen screens showing raw video output. That amount of sensory
`overload virtually guarantees that information will be ignored, and requires a prohibitive amount
`of transmission bandwidth. Our approach is to provide an interactive, graphical user interface
`(GUI) that uses VSAM technology to automatically place dynamic agents representing people and
`
`Robotics Institute, CMU
`
`– 6 –
`
`VSAM Final Report
`
`7/69
`
`DOJ EX. 1025
`
`

`
`vehicles into a synthetic view of the environment (Figure 5). This approach has the benefit that
`visualization of scene events is no longer tied to the original resolution and viewpoint of a single
`video sensor. The GUI currently consists of a map of the area, overlaid with all object locations,
`sensor platform locations, and sensor fields of view (Figure 5b). In addition, a low-bandwidth,
`compressed video stream from one of the sensors can be selected for real-time display.
`
`The GUI is also used for sensor suite tasking. Through this interface, the operator can task
`individual sensor units, as well as the entire testbed sensor suite, to perform surveillance operations
`such as generating a quick summary of all object activities in the area. The lower left corner of
`the control window contains a selection of controls organized as tabbed selections. This allows the
`user to move fluidly between different controls corresponding to the entity types Objects, Sensors,
`and Regions of Interest.
`
` Object Controls. Track directs the system to begin actively tracking the current object.
`Stop Tracking terminates all active tracking tasks in the system. Trajectory displays the
`trajectory of selected objects. Error displays geolocation error bounds on the locations and
`trajectories of selected objects.
`
` Sensor Controls. Show FOV displays sensor fields of view on the map, otherwise only a
`position marker is drawn. Move triggers an interaction allowing the user to control the pan
`and tilt angle of the sensor. Request Imagery requests either a continuous stream or single
`image from the currently selected sensor, and Stop Imagery terminates the current imagery
`stream.
`
` ROI controls This panel contains all the controls associated with Regions of Interest (ROIs)
`in the system. ROIs are tasks that focus sensor resources at specific areas in the session
`space. Create triggers the creation of a ROI, specified interactively by the user as a polygon
`of boundary points. The user also selects from a set of object types (e.g. human, vehicle)
`that will trigger events in this ROI, and from a set of event types (e.g. enter, pass through,
`stop in) that are considered to be trigger events in the ROI.
`
`2.4 Communication
`
`The nominal architecture for the VSAM network allows multiple OCUs to be linked together, each
`controlling multiple SPUs (Figure 6). Each OCU supports exactly one GUI through which all user
`related command and control information is passed. Data dissemination is not limited to a single
`user interface, however, but is also accessible through a series of visualization nodes (VIS).
`
`There are two independent communication protocols and packet structures supported in this
`architecture: the Carnegie Mellon University Packet Architecture (CMUPA) and the Distributed
`Interactive Simulation (DIS) protocols. The CMUPA is designed to be a low bandwidth, highly
`flexible architecture in which relevant VSAM information can be compactly packaged without
`
`Robotics Institute, CMU
`
`– 7 –
`
`VSAM Final Report
`
`8/69
`
`DOJ EX. 1025
`
`

`
`VIS
`
`SPU
`
`SPU
`
`VIS
`
`VIS
`
`VIS
`
`SPU
`
`OCU
`
`GUI
`
`SPU
`
`SPU
`
`OCU
`
`GUI
`
`VIS
`
`VIS
`
`OCU
`
`GUI
`
`SPU
`
`SPU
`
`SPU
`
`Figure 6: A nominal architecture for expandable VSAM networks.
`
`Header
`bitmask
`
`Comm.
`
`Sensor
`
`Imagery
`
`Target
`
`Event
`
`ROI
`
`Imagery
`block
`
`Sensor
`block
`
`Sensor
`block
`
`Comm.
`block
`
`Comm.
`block
`
`Comm.
`block
`
`Event
`block
`
`Event
`block
`
`ROI
`block
`
`block
`
`Target
`block
`
`bitmask
`
`position
`
`bounding
`box
`
`image
`template
`
`Target
`block
`
`Figure 7: CMUPA packet structure. A bitmask in the header describes which sections are present.
`Within each section, multiple data blocks can be present. Within each data block, bitmasks describe
`what information is present.
`
`redundant overhead. The concept of the CMUPA packet architecture is a hierarchical decompo-
`sition. There are six data sections that can be encoded into a packet: command; sensor; image;
`object; event; and region of interest. A short packet header section describes which of these six
`sections are present in the packet. Within each section it is possible to represent multiple instances
`of that type of data, with each instance potentially containing a different layout of information.
`At each level, short bitmasks are used to describe the contents of the various blocks within the
`packets, keeping wasted space to a minimum. All communication between SPUs, OCUs and
`GUIs is CMUPA compatible. The CMUPA protocol specification document is accessible from
`http://www.cs.cmu.edu/vsam.
`
`VIS nodes are designed to distribute the output of the VSAM network to where it is needed.
`They provide symbolic representations of detected activities overlaid on maps or imagery. Infor-
`mation flow to VIS nodes is unidirectional, originating from an OCU. All of this communication
`uses the DIS protocol, which is described in detail in [15]. An important benefit to keeping VIS
`nodes DIS compatible is that it allows us to easily interface with synthetic environment visualiza-
`tion tools such as ModSAF and ModStealth (Section 4.4).
`
`Robotics Institute, CMU
`
`– 8 –
`
`VSAM Final Report
`
`9/69
`
`DOJ EX. 1025
`
`

`
`2.5 Current Testbed Infrastructure
`
`This section describes the VSAM testbed on the campus of Carnegie Mellon University, as of Fall
`1999 (see Figure 8). The VSAM infrastructure consists of 14 cameras distributed throughout cam-
`pus. All cameras are connected to the VSAM Operator Control Room in the Planetary Robotics
`Building (PRB): ten are connected via fiber optic lines, three on PRB are wired directly to the
`SPU computers, and one is a portable Small Unit Operations (SUO) unit connected via wireless
`Ethernet to the VSAM OCU. The work done for VSAM 99 concentrated on increasing the density
`of sensors in the Wean/PRB area. The overlapping fields of view (FOVs) in this area of campus
`enable us to conduct experiments in wide baseline stereo, object fusion, sensor cuing and sensor
`handoff.
`
`Smith
`
`Wean
`
`PRB
`
`Color
`Monochrome
`
`Figure 8: Placement of color and monochrome cameras in current VSAM testbed system. Not
`shown are two additional cameras, a FLIR and the SUO portable system, which are moved to
`different places as needed.
`
`The backbone of the CMU campus VSAM system consists of six Sony EVI-370 color zoom
`cameras installed on PRB, Smith Hall, Newell-Simon Hall, Wean Hall, Roberts Hall, and Porter
`Hall. Five of these units are mounted on Directed Perception pan/tilt heads. The most recent
`camera, on Newell-Simon, is mounted on a Sagebrush Technologies pan/tilt head. This is a more
`rugged outdoor mount being evaluated for better performance specifications and longer term usage.
`Two stationary fixed-FOV color cameras are mounted on the peak of PRB, on either side of the
`
`Robotics Institute, CMU
`
`– 9 –
`
`VSAM Final Report
`
`10/69
`
`DOJ EX. 1025
`
`

`
`pan/tilt/zoom color camera located there. These PRB “left” and “right” sensors were added to
`facilitate work on activity analysis, classification, and sensor cuing. Three stationary fixed-FOV
`monochrome cameras are mounted on the roof of Wean Hall in close proximity to one of the
`pan/tilt/zoom color cameras. These are connected to the Operator Control Room over a single
`multimode fiber using a video multiplexor. The monochrome cameras have a vertical resolution
`of 570 TV lines and perform fairly well at night with the available street lighting. A mounting
`bracket has also been installed next to these cameras for the temporary installation of a Raytheon
`NightSight thermal (FLIR) sensor. A fourth stationary fixed FOV monochrome camera is mounted
`on PRB pointing at the back stairwell. A SUO portable unit was built to allow further software
`development and research at CMU in support of the SUO program. This unit consists of the same
`hardware as the SPUs that were delivered to Fort Benning, Georgia in November, 1999.
`
`The Operator Control Room in PRB houses the SPU, OCU, GUI and development work-
`stations – nineteen computers in total. The four most recent SPUs are Pentium III 550 MHz
`computers. Dagwood, a single “compound SPU”, is a quad Xeon 550 MHz processor computer,
`purchased to conduct research on classification, activity analysis, and digitization of three simulta-
`neous video streams. Also included in this list of machines is a Silicon Graphics Origin 200, used
`to develop video database storage and retrieval algorithms as well as designing user interfaces for
`handling VSAM video data.
`
`Two auto tracking Leica theodolites (TPS1100) are installed on the corner of PRB, and are
`hardwired to a data processing computer linked to the VSAM OCU. This system allows us to do
`real-time automatic tracking of objects to obtain ground truth for evaluating the VSAM geolocation
`and sensor fusion algorithms. This data can be displayed in real-time on the VSAM GUI.
`
`An Office of Naval Research DURIP grant provided funds for two Raytheon NightSight ther-
`mal sensors, the Quad Xeon processor computer, the Origin 200, an SGI Infinite Reality Engine
`and the Leica theodolite surveying systems.
`
`3 Video Understanding Technologies
`
`Keeping track of people, vehicles, and their interactions in a complex environment is a difficult
`task. The role of VSAM video understanding technology in achieving this goal is to automatically
`“parse” people and vehicles from raw video, determine their geolocations, and automatically insert
`them into a dynamic scene visualization. We have developed robust routines for detecting moving
`objects and tracking them through a video sequence using a combination of temporal differencing
`and template tracking. Detected objects are classified into semantic categories such as human,
`human group, car, and truck using shape and color analysis, and these labels are used to improve
`tracking using temporal consistency constraints. Further classification of human activity, such as
`walking and running, has also been achieved. Geolocations of labeled entities are determined from
`their image coordinates using either wide-baseline stereo from two or more overlapping camera
`views, or intersection of viewing rays with a terrain model from monocular views. The computed
`
`Robotics Institute, CMU
`
`– 10 –
`
`VSAM Final Report
`
`11/69
`
`DOJ EX. 1025
`
`

`
`geolocations are used to provide higher-level tracking capabilities, such as tasking multiple sensors
`with variable pan, tilt and zoom to cooperatively track an object through the scene. Results are
`displayed to the user in real-time on the GUI, and are also archived in web-based object/event
`database.
`
`3.1 Moving Object Detection
`
`Detection of moving objects in video streams is known to be a significant, and difficult, research
`problem [26]. Aside from the intrinsic usefulness of being able to segment video streams into
`moving and background components, detecting moving blobs provides a focus of attention for
`recognition, classification, and activity analysis, making these later processes more efficient since
`only “moving” pixels need be considered.
`
`There are three conventional approaches to moving object detection: temporal differencing
`[1]; background subtraction [13, 29]; and optical flow (see [3] for an excellent discussion). Tem-
`poral differencing is very adaptive to dynamic environments, but generally does a poor job of
`extracting all relevant feature pixels. Background subtraction provides the most complete feature
`data, but is extremely sensitive to dynamic scene changes due to lighting and extraneous events.
`Optical flow can be used to detect independently moving objects in the presence of camera mo-
`tion; however, most optical flow computation methods are computationally complex, and cannot
`be applied to full-frame video streams in real-time without specialized hardware.
`
`Under the VSAM program, CMU has developed and implemented three methods for mov-
`ing object detection on the VSAM testbed. The first is a combination of adaptive background
`subtraction and three-frame differencing (Section 3.1.1). This hybrid algorithm is very fast, and
`surprisingly effective – indeed, it is the primary algorithm used by the majority of the SPUs in
`the VSAM system. In addition, two new prototype algorithms have been developed to address
`shortcomings of this standard approach. First, a mechanism for maintaining temporal object layers
`is developed to allow greater disambiguation of moving objects that stop for a while, are occluded
`by other objects, and that then resume motion (Section 3.1.2). One limitation that affects both
`this method and the standard algorithm is that they only work for static cameras, or in a ”step-
`and-stare” mode for pan-tilt cameras. To overcome this limitation, a second extension has been
`developed to allow background subtraction from a continuously panning and tilting camera (Sec-
`tion 3.1.3). Through clever accumulation of image evidence, this algorithm can be implemented
`in real-time on a conventional PC platform. A fourth approach to moving object detection from
`a moving airborne platform has also been developed, under a subcontract to the Sarnoff Corpora-
`tion. This approach is based on image stabilization using special video processing hardware. It is
`described later, in Section 3.6.
`
`Robotics Institute, CMU
`
`– 11 –
`
`VSAM Final Report
`
`12/69
`
`DOJ EX. 1025
`
`

`
`Long-term
`parked car
`
`Car moves
`
`"Hole" left in
`background model
`
`Car moves
`
`(a)
`
`Detection
`
`(b)
`
`Figure 9: problems with standard MTD algorithms. (a) Background subtraction leaves “holes”
`when stationary objects move. (b) Frame differencing does not detect the entire object
`
`3.1.1 A Hybrid Algorithm for Moving Object Detection
`
`We have developed a hybrid algorithm for detecting moving objects, by combining an adaptive
`background subtraction technique[18] with a three-frame differencing algorithm. As discussed in
`[26], the major drawback of adaptive background subtraction is that it makes no allowances for
`stationary objects in the scene that start to move. Although these are usually detected, they leave
`behind “holes” where the newly exposed background imagery differs from the known background
`model (see Figure 9a). While the background model eventually adapts to these “holes”, they gen-
`erate false alarms for a short period of time. Frame differencing is not subject to this phenomenon,
`however, it is generally not an effective method for extracting the entire shape of a moving object
`(Figure 9b). To overcome these problems, we have combined the two methods. A three-frame dif-
`ferencing operation is performed to determine regions of legitimate motion, followed by adaptive
`background subtraction to extract the entire moving region.
`
`Consider a video stream from a stationary (or stabilized) camera. Let Inx represent the
`intensity value at pixel position x, at time t = n. The three-frame differencing rule suggests that
`a pixel is legitimately moving if its intensity has changed significantly between both the current
`image and the last frame, and the current image and the next-to-last frame. That is, a pixel x is
`moving if
`
` jInx (cid:0) In(cid:0)1xj Tnx  and  jInx (cid:0) In(cid:0)2xj Tnx 
`where Tnx is a threshold describing a statistically significant intensity change at pixel position x
`(described below). The main problem with frame differencing is that pixels interior to an object
`with uniform intensity aren’t included in the set of “moving” pixels. However, after clustering
`moving pixels into a connected region, interior pixels can be filled in by applying adaptive back-
`ground subtraction to extract all of the “moving” pixels within the region’s bounding box R. Let
`Bnx represent the current background intensity value at pixel x, learned by observation over time.
`Then the blob bn can be filled out by taking all the pixels in R that are significantly different from
`the background model Bn. That is
`
`bn = fx : jInx (cid:0) Bnxj Tnx; x  Rg
`
`Robotics Institute, CMU
`
`– 12 –
`
`VSAM Final Report
`
`13/69
`
`DOJ EX. 1025
`
`

`
`Both the background model Bnx and the difference threshold Tnx are statistical properties
`of the pixel intensities observed from the sequence of images fIkxg for k n. B0x is initially
`set to the first image, B0x = I0x, and T0x is initially set to some pre-determined, non-zero
`value. Bx and T x are then updated over time as:
`
`Bn+1x = a Bnx + 1 (cid:0) a  Inx; x is non-moving
`Tn+1x = a Tnx + 1 (cid:0) a  5  jInx (cid:0) Bnxj; x is non-moving
`
`Bnx;
`
`x is moving
`
`x is moving
`Tnx;
`where a
`is a time constant that specifies how fast new information supplants old observations.
`Note that each value is only changed for pixels that are determined to be non-moving, i.e. part of
`the stationary background. If each non-moving pixel position is considered as a time series, Bnx
`is analogous to a local temporal average of intensity values, and Tnx is analogous to 5 times the
`local temporal standard deviation of intensity, both computed using an infinite impulse response
`(IIR) filter. Figure 10 shows a result of this detection algorithm for one frame.
`
`
`
`(a)
`
`(b)
`
`Figure 10: Result of the detection algorithm. (a) Original image. (b) Detected motion regions.
`
`3.1.2 Temporal Layers for Adaptive Background Subtraction
`
`A robust detection system should be able to recognize when objects have stopped and even dis-
`ambiguate overlapping objects — functions usually not possible with traditional motion detection
`algorithms. An important aspect of this work derives from the observation that legitimately moving
`objects in a scene tend to cause much faster transitions than changes due to lighting, meteorolog-
`ical, and diurnal effects. This section describes a novel approach to object detection based on
`layered adaptive background subtraction.
`
`Robotics Institute, CMU
`
`– 13 –
`
`VSAM Final Report
`
`14/69
`
`DOJ EX. 1025
`
`

`
`The Detection Algorithm
`
`Layered detection is based on two processes: pixel analysis and region analysis. The purpose of
`pixel analysis is to determine whether a pixel is stationary or transient by observing its intensity
`value over time. Region analysis deals with the agglomeration of gr

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket