throbber
Video Indexing Based on Mosaic Representations
`
`MICHAL IRANI, MEMBER, IEEE, AND P. ANANDAN, MEMBER, IEEE
`
`information. It provides visual
`Video is a rich source of
`information about scenes. This information is implicitly buried
`inside the raw video data, however, and is provided with the cost
`of very high temporal redundancy. While the standard sequential
`form of video storage is adequate for viewing in a “movie mode,”
`it fails to support rapid access to information of interest that is
`required in many of the emerging applications of video. This paper
`presents an approach for efficient access, use, and manipulation
`of video data. The video data are first transformed from their
`sequential and redundant frame-based representation, in which
`the information about the scene is distributed over many frames,
`to an explicit and compact scene-based representation, to which
`each frame can be directly related.
`the video data supports
`This compact reorganization of
`nonlinear browsing and efficient indexing to provide rapid access
`directly to information of interest. This paper describes a new
`set of methods for indexing into the video sequence based on the
`scene-based representation. These indexing methods are based
`on geometric and dynamic information contained in the video.
`These methods complement the more traditional “content-based
`indexing” methods, which utilize image-appearance information
`(namely, color and texture properties) but are considerably
`simpler to achieve and are highly computationally efficient.
`Keywords—Compact video representations, mosaics, video an-
`notation, video browsing, video compression, video data bases,
`video indexing, video manipulation.
`
`I.
`
`INTRODUCTION
`
`The emergence of video as data and a source of informa-
`tion on the computer opens the potential for new ways of
`accessing, viewing, and manipulating the contents of video.
`These include direct nonlinear access to video frames and
`sequences of interest, new modes of viewing that give the
`viewer control over how the video is viewed, annotation
`and manipulation of objects and scenes in the video, and
`merging of text and graphics with the video data.
`While the standard manner of representing video as a
`sequence of frames is adequate for viewing it in a movie
`mode,
`it does not support
`the type of interaction with
`video information described above. Currently, the only way
`
`Manuscript received July 20, 1997; revised November 30, 1997. The
`Guest Editor coordinating the review of this paper and approving it for
`publication was A. M. Tekalp.
`M. Irani was with the Sarnoff Corporation, Princeton, NJ 08540 USA.
`She is now with the Department of Applied Math and Computer Science,
`The Weizmann Institute of Science, Rehovot 76100 Israel.
`P. Anandan was with the Sarnoff Corporation, Princeton, NJ 08540
`USA. He is now with Microsoft Corporation, Redmond, WA 98052 USA.
`Publisher Item Identifier S 0018-9219(98)03562-2.
`
`to access the information of interest is by sequentially
`scanning the video. The only way to manipulate, annotate,
`or edit the video is by processing the video frame by frame.
`This process is both slow and tedious.
`This paper presents a new approach for efficient access,
`storage, and manipulation of video data. Our approach is
`based on the fact that a video sequence contains many views
`of the same scene taken over time, from either a moving or
`a stationary camera. Hence, the information that is common
`to all the frames is the scene itself. This information is dis-
`tributed over many frames, however, at the cost of very high
`temporal redundancy, and is found only implicitly in the
`video data. We transform the video data from a sequential
`frame-based representation, in which this common scene
`information is distributed over many frames, into a single
`common scene-based representation to which each frame
`can be directly related. This representation then allows
`direct and immediate access to the scene information,
`such as static locations and dynamically moving objects.
`It also eliminates the redundancy between the different
`views of the scene contained in the frames and results in
`a highly efficient and compact representation of the video
`information. Hence, the scene-based representation forms
`the basis for direct and efficient access to and manipulation
`of the video information and supports efficient storage and
`transmission of the video data.
`The scene representation is composed of three compo-
`nents.
`
`1) Extended spatial information: this captures the ap-
`pearance of the entire scene imaged in the video clip
`and is represented in the form of a few (often just one)
`panoramic mosaic images constructed by composing
`the information from the different views of the scene
`in the individual frames into a single image.
`
`2) Extended temporal information: this captures the mo-
`tion of independently moving objects in the scene
`(e.g., in the form of their trajectories).
`
`the three-
`this captures
`3) Geometric information:
`dimensional (3-D) scene structure, as well as the
`geometric transformations that are induced by the
`motion of the camera, and maps the frames to the
`common mosaic image.
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 5, MAY 1998
`
`905
`
`0018–9219/98$10.00 ª
`
`1998 IEEE
`
`Prime Focus Ex 1038-1
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`Taken together, these three components provide a compact
`description of the video data.
`We construct the common scene-based representation by
`measuring and interpreting the image motion within the
`video clip. Regions of the video frames corresponding to
`the static and dynamic portions of the scene are determined.
`The geometric transformations and the 3-D scene structure
`are recovered as a part of this process. This process is done
`automatically, without any information about the camera
`calibration or the scene.
`Once the common scene-based representation is con-
`structed, it forms the basis for direct and efficient brows-
`ing, indexing, and manipulation of the video data. Brows-
`ing is done by skimming a collection of images that
`“summarize” the video data. We refer to these images
`as visual summaries. These summaries visually describe
`the video information in a compact and succinct fash-
`ion and can serve as a visual table of contents for the
`video.
`Since the mosaics capture the information that is common
`to all
`the frames,
`they provide the means directly to
`index into and manipulate the individual frames. Both the
`static and dynamic portions of the video sequence can
`be accessed this way. These indexing methods are based
`on geometric and dynamic information contained in the
`video. These complement the more traditional approach to
`“content-based indexing,” which utilizes image appearance
`information (namely, color and texture properties) [7], [9],
`[10], [26], but are considerably simpler to achieve and are
`computationally highly efficient. The existing appearance-
`based methods themselves can also be used more efficiently
`within the scene-based representation when applied directly
`to the mosaic image (i.e., to the appearance component
`of our representation), rather than to the individual video
`frames one by one.
`The rest of this paper is organized as follows. Section II
`presents the common and compact scene-based representa-
`tion, to which each frame is directly related. Section III
`explains how to use the scene-based representation to
`browse, index, and manipulate video data efficiently and
`rapidly. Section IV reviews the techniques used for con-
`structing the scene-based representation from raw video
`sequences. Section V concludes this paper.
`
`II. FROM FRAMES TO SCENES
`
`Video is a rich data source. It provides information about
`scenes. This information is buried inside the raw video
`data, however, and is provided at the cost of very high
`temporal redundancy (e.g., every scene point is displayed
`repeatedly in numerous consecutive frames). In this section,
`we first review the fundamental components of information
`in a video stream (Section II-A). Then we make use of
`these information components to transform the video from
`an implicit and redundant frame-based representation to an
`explicit and nonredundant scene-based representation that
`is common to all frames (Section II-B).
`
`A. The Three Fundamental Information
`Components of Video
`Video extends the imaging capabilities of a still camera
`in three ways. First, although the field of view of each
`single image frame may be small, the camera can be panned
`or otherwise moved around in order to cover an extended
`spatial area. However,
`the extended spatial information
`acquired by the video is not available in a coherent form.
`It is distributed among a sequence of frames and is hard
`to use.
`The second, and perhaps the most common, use of
`video is to record the evolution of events over time.
`Again, however, this extended temporal information is not
`explicitly represented but distributed over a sequence of
`video frames. While it is natural for a human to view it as
`a movie, this representation is not particularly suitable for
`analytic purposes.
`Third, a video camera can be moved in order to acquire
`views from a continuously varying set of vantage points.
`This induces image motion, which depends on the 3-D
`geometric layout of the scene and the motion of the camera.
`However, this geometric information is also only implicitly
`present and is not directly accessible from the standard
`sequential video representation.
`Thus, the total information contained in the video data
`consists of the three scene components mentioned above.
`However, this information is distributed among the frames
`and is implicitly encoded in terms of image motion. There-
`fore, a natural way to reorganize the video data is in terms
`of these three scene components. Moreover, such a reorga-
`nization removes the tremendous redundancy that is present
`in the source video data. This scene-based organization is
`highly efficient since it directly and uniquely maps onto the
`information in the scene. Therefore, it facilitates efficient
`interaction and manipulation and supports very efficient
`storage and transmission.
`
`B. The Scene-Based Representation
`To bring out the common scene information contained in
`the video, and make it more directly accessible, we first
`transform the video from its implicit and redundant frame-
`based representation to an explicit and compact scene-based
`representation. In this section, we introduce the scene-based
`representation. In Section IV, we elaborate on the details of
`the representation and explain how it is constructed from
`the video data.
`temporally segmented into
`The video stream is first
`scene segments, which are subsequences of the input video
`sequence. A beginning or an end of a scene segment
`is automatically detected wherever a scene cut or scene
`change occurs in the video. The scene cuts are characterized
`typically by drastic changes in the frame content, which are
`directly reflected in the distribution of color and the gray
`levels in the image, or in the image motion (e.g., see [9]
`and [37]). These changes are relatively simple to detect.
`Each scene segment
`is subsequently parsed into the
`three fundamental components of video (see Section II-A),
`namely, the static background scene, the dynamic moving
`
`906
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 5, MAY 1998
`
`Prime Focus Ex 1038-2
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`objects, and the geometric information. These components
`are organized as described below.
`Corresponding to the three fundamental components, the
`scene-based representation is divided into three parts.
`1) Panoramic Mosaic Image: This captures an extended
`spatial view of the entire scene visible in the video clip in a
`single (or sometimes a few) “snapshot” image(s) (e.g., see
`Fig. 1). This image captures the appearance of the static
`portions of the scene.
`The mosaic image is constructed by first aligning all
`the frames with respect to the common coordinate system
`(which becomes also the mosaic coordinate system) and
`then integrating all these frames to form a single image.
`Different methods of integration can be employed (e.g.,
`temporal average, temporal median, superresolution, etc.).
`These are described in more detail in [12].
`The mosaic representation removes the redundancy con-
`tained in the overlap between successive frames and rep-
`resents each spatial point only once. Mosaics have been
`previously used as an effective way of creating panoramic
`views of a scene from video sequences [3], [16], [20],
`[23], [31], [32]. Until now, however, they have not been
`used as an information component within a scene-based
`representation, which provides direct and efficient access
`to video data.
`Section IV describes a hierarchy of mosaic representa-
`tions. The hierarchy corresponds to increasing complexity
`levels in the camera motion and in the 3-D scene structure.
`2) Geometric Transformations: These relate the different
`video frames to the mosaic coordinate system. The geo-
`metric transformations contain the information necessary
`to map the location of each scene point back and forth
`between the panoramic mosaic image(s) and the individual
`frames. Corresponding to the hierarchy of the panoramic
`mosaic representations, there exists a hierarchy of represen-
`tations of the geometric transformations. These range from
`global parametric two-dimensional (2-D) transformations to
`more complex 3-D transformations and are described in
`Section IV.
`3) Dynamic Information: This is the information about
`moving objects, which are not captured by the static
`panoramic mosaic image. Moving-object
`information is
`completely captured by representing the extended time
`trajectories of those objects as well as their appearance.
`Such a complete representation is needed, e.g.,
`for
`video compression (since the video frames need to be
`reconstructed from the scene representation). To access,
`browse, index, and annotate the video (as presented in
`Section III), however, the trajectory information alone is
`sufficient. The trajectory of the center of mass of each
`detected moving object (i.e., a single image point per
`moving object per frame) is maintained. These trajectories
`are represented in the coordinate system of the mosaic
`image, which is common to all
`the frames.
`In the
`common coordinate system, time continuity, continuous
`tracking, and the temporal behavior of
`the moving
`object can be analyzed more effectively (see Figs. 3 and
`5).
`
`the three components of our scene-based rep-
`Thus,
`resentation form a compact representation of the video
`clip. The compactness results from the fact
`that every
`scene point is presented only once in the mosaic image,
`while in the original video clip, it is observed in multiple
`frames. This compactness of the scene-based representation
`facilitates very high compression (and we have developed
`such algorithms for very-low-bit-rate compression [13]). In
`this paper, we focus on the power of this representation
`for video indexing and manipulation. Section III describes
`how this representation can be used for efficiently accessing
`and manipulating the video data. Section IV describes the
`methods for constructing the scene-based representation.
`
`III. FROM SCENES TO VISUAL SUMMARIES AND INDEXING
`
`Once a video sequence is transformed from the frame-
`based representation to the scene-based representation, it
`forms the basis for the user’s interaction with the video. The
`user can initially preview the video by browsing through
`visual summaries of the various video clips. These visual
`summaries can serve as a visual table of contents of the
`video data. When a scene of interest is detected by the
`user, he can either request to view only that portion of the
`video or further index into individual video frames. The
`detected frames of interest can then be either viewed or
`manipulated by the user.
`
`A. Visual Summaries—A Visual Table of Contents
`
`There are two types of visual summaries of video clips
`through which a user can browse. These are captured by
`two types of mosaic images, which are constructed from
`the video clip of a scene.
`1) The Static Background Mosaic: The video frames of a
`single video segment (clip) are aligned and integrated into
`a single mosaic image. This image provides an extended
`(panoramic) spatial view of the entire static background
`scene viewed in the clip in a single “snapshot” image
`and represents the scene better than any single frame.
`This image does not include any moving objects. The user
`can visually browse through the collection of such mosaic
`images to select a scene (clip) of interest.
`Figs. 1 and 2(b) display some examples of static back-
`ground mosaic images.
`2) The Synopsis Mosaic: While the static mosaic image
`effectively captures the background scene, it contains no
`representation of the dynamic events in the scene. To
`provide a summary of the events, we create a new type
`of mosaic called the synopsis mosaic. This is constructed
`by overlaying the trajectories of the moving objects on top
`of the background mosaic. This single “snapshot” image
`provides a visual summary of the entire dynamic foreground
`event that occurred in the video clip.
`Fig. 3 graphically illustrates the trajectory associated with
`a moving object in a synopsis mosaic.
`Fig. 2(c) provides a summary of the entire event in the
`baseball video clip.
`
`IRANI AND ANANDAN: VIDEO INDEXING BASED ON MOSAIC REPRESENTATIONS
`
`907
`
`Prime Focus Ex 1038-3
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`(a)
`
`(b)
`
`Fig. 1. Static background mosaic of an airport video clip. (a) A few representative frames from the
`minute-long video clip. The video shows an airport being imaged from the air with a moving camera.
`The scene itself is static (i.e., no moving objects). (b) The static background mosaic image, which
`provides an extended view of the entire scene imaged by the camera in the one-minute video clip.
`
`To allow for comprehensive display of multiple tra-
`jectories (corresponding to multiple moving objects), the
`trajectory of each moving object is uniquely color coded.
`Figs. 4 and 5 provide visual summaries of airborne video
`clips each with multiple moving objects. Fig. 4 shows a
`flying airplane and a moving car on the road. Fig. 5 shows
`a flying airplane, three parachuters that were dropped from
`the plane, and a moving car.
`The natural mode of operation for the user is first to
`browse through the visual summary mosaics to identify a
`
`few scenes of interest. Once the user has identified a scene
`(i.e., mosaic) of interest, he proceeds to directly access
`and/or manipulate individual video frames associated with
`only a portion of the scene that is of interest to him. The
`scene-based representation supports this type of indexing.
`Two new types of indexing methods are presented: 1)
`indexing based on location (geometric) information, and
`2) indexing based on dynamic information. These are made
`possible directly via the geometric coordinate transforma-
`tions that relate the different frames to the mosaic image and
`
`908
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 5, MAY 1998
`
`Prime Focus Ex 1038-4
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`(a)
`
`(b)
`
`(c)
`
`Fig. 2. Visual summaries of a baseball video clip. (a) A few representative frames from the video
`clip. The video shows two outfielders running, while the camera is panning to the left and zooming
`on the two baseball players. (b) The static background mosaic image, which provides an extended
`view of the entire scene captured by the camera in the video clip. The “missing” regions at the
`top left and bottom left were never imaged by the camera because at that point, it was zoomed on
`the two players (e.g., frame 80). (c) The synopsis mosaic, which provides a visual summary of the
`entire event. It shows the trajectories of the two outfielders in the context of the mosaic image.
`
`Fig. 3. Synopsis of a moving object. The trajectory of the moving object is depicted in the
`synopsis mosaic. This shows the motion of the moving object after cancellation of the background
`(camera-induced) motion. With each point on the trajectory is associated a frame number (i.e., the
`“time” when the moving object was at that location).
`
`through the moving objects information that was estimated
`in the formation of the mosaic-based scene representation
`(Section II-B). The access and manipulation of selected
`video frames is done directly from the mosaic-based visual
`summaries. These location and dynamic indexing methods
`complement
`the more traditional approach to “content-
`
`based indexing,” which utilizes image-appearance informa-
`tion (e.g., color and texture) [7], [9], [10], [26]. However,
`our methods are considerably simpler to achieve and are
`highly computationally efficient.
`The remainder of this section describes these modes of
`video indexing and manipulation.
`
`IRANI AND ANANDAN: VIDEO INDEXING BASED ON MOSAIC REPRESENTATIONS
`
`909
`
`Prime Focus Ex 1038-5
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`(a)
`
`(b)
`Fig. 4. The visual summary of a flying plane video clip. (a) A few representative frames from
`the minute-long video clip. The video shows an airplane flying from right to left (during takeoff).
`A car driving on a road is visible for a few frames. (b) The synopsis mosaic, which provides
`a visual summary of the entire video clip, showing the trajectories of all moving objects in the
`context of the mosaic image. Each detected and tracked moving object is color coded uniquely
`(plane: green; car: yellow).
`
`B. Location (Geometric)-Based Indexing
`Once a few scenes of interest (in the form of visual
`summaries) have been selected, the user proceeds to access
`the video frames themselves. The user selects a scene point
`
`(or several points) in the mosaic image. The geometric
`coordinate transformations map the selected scene point(s)
`from the mosaic image to its location in the coordinate
`system of each of the video frames. All frames containing
`
`910
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 5, MAY 1998
`
`Prime Focus Ex 1038-6
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`(a)
`
`(b)
`Fig. 5. The visual summary of a parachuters video clip. (a) A few representative frames from the
`30-second-long video clip. The video shows an airplane flying from left to right, dropping three
`parachuters. A car driving on a road is visible for a few frames. The parachuters are very small
`(tiny white dots) and difficult to see in a static image, but they are easily detectable in video, as they
`have different motion than the background. They are depicted in the synopsis mosaic by the green,
`red, and blue trajectories. In the video sequence, they become visible gradually, as their parachutes
`open—first the left parachuter, then the right one, and last the middle one. This becomes clearer
`in the annotated video displayed in Fig. 10.
`(b) The synopsis mosaic, which provides a visual
`summary of the entire video clip, showing the trajectories of all moving objects in the context of
`the mosaic image. Each detected and tracked moving object is color coded uniquely (parachuters:
`green, red, blue; car: purple; plane: yellow).
`
`inside their field of view are
`the selected scene point
`therefore instantaneously determined. The user can view
`the subsequence of the video that contains only the frames
`
`with the selected scene point (or points). When these frames
`are not consecutive in time (e.g., if the selected portion of
`the scene was revisited by the camera multiple times), then
`
`IRANI AND ANANDAN: VIDEO INDEXING BASED ON MOSAIC REPRESENTATIONS
`
`911
`
`Prime Focus Ex 1038-7
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`Fig. 6. Location-based indexing. Selection of a scene point in the mosaic image generates a display
`of all frames whose field of view contains the selected scene point. These are frames i; j; and k.
`In the figure, these frames are displayed as a collection of frames, but in reality, they are displayed
`as a video sequence.
`
`multiple subsequences (corresponding to consecutive frame
`groups) are displayed to the user.
`Fig. 6 demonstrates an indexing process. Selection of a
`scene point in the mosaic image generates a display of
`all frames whose field of view contains the selected scene
`and
`. In the figure, these
`point. These are frames
`frames are displayed as a collection of frames, but in reality,
`they are displayed as a video sequence.
`In addition to manual scene-point selection, this repre-
`sentation also provides a basis for efficiently indexing into
`the video using existing automatic detection methods. For
`example, if a region is searched using an appearance-based
`detection method (e.g., template correlation or search based
`on color or texture attributes [7], [9], [10], [26]), then
`instead of applying these search methods individually to
`each frame, they can be applied just once to the common
`mosaic image. Once it is detected in the mosaic image, the
`location-based indexing mechanism can be used to retrieve
`the corresponding frames.
`1) Editing and Annotation: The compact mosaic repre-
`sentation can be used not only to access video frames
`but also to edit, annotate, and manipulate these frames.
`For example, the same mechanism used for indexing is
`also used to efficiently inherit annotations from the mosaic
`image onto scene locations in the video frames.
`The annotation is specified by the user just once on the
`mosaic image rather than tediously specifying it for each
`
`and every frame. This can be further extended to efficiently
`edit video clips by inserting or deleting an object in the
`mosaic image, hence inserting or deleting that object in all
`corresponding video frames.
`Fig. 7 graphically illustrates a video annotation process.
`Fig. 8 shows an example of annotating airborne video of
`an airport scene.
`
`C. Dynamic (Moving-Objects)-Based Indexing
`Since the synopsis mosaic provides a snapshot view
`of an entire dynamic event, it can be used for indexing
`based on temporal events. In the synopsis mosaic,
`the
`motion of an object is represented as a trajectory in the
`common coordinate system; hence, the temporal event has
`been transformed into a spatial representation. Marking a
`segment on the trajectory is thus equivalent to marking a
`time interval, which enables access and display of all frames
`in this time interval.
`More specifically, all frames containing a selected mov-
`ing object can be immediately determined and accessed,
`as can the location of the moving object in each of these
`frames. The user can select an object of interest whose track
`is marked on the synopsis mosaic. Since the trajectories
`of the moving objects in the mosaic coordinate system
`are precomputed (as well as which point on the trajectory
`corresponds to which frame), all frames containing that
`object are immediately accessed and viewed. The location
`
`912
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 5, MAY 1998
`
`Prime Focus Ex 1038-8
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`Fig. 7. Location-based annotation. Annotation of a selected scene point in the mosaic image leads
`to automatic annotation of all relevant frames (i; j; and k) with the selected annotation and at the
`appropriate image coordinate, i.e., that which corresponds to the selected scene point in each of
`the frames.
`
`of that object in each frame is estimated through the basic
`geometric coordinate transformations (those that correspond
`to the camera-induced motion). In a similar manner, the
`moving objects in the video frames are efficiently annotated
`or manipulated by annotating the synopsis mosaic, without
`the need for the user to perform the operation repeatedly
`on a frame-by-frame basis.
`Fig. 9 shows an example of annotating moving objects
`using the plane video, whose synopsis mosaic was shown in
`Fig. 4. The figure displays the selected annotations on the
`synopsis mosaic. Representative output frames are shown,
`in which the annotations are automatically inherited from
`the mosaic. Note that the annotations “move” together with
`the moving objects.
`Fig. 10 shows an example of video annotation using the
`airborne parachuters video. The figure displays the selected
`annotations on top of the synopsis mosaic image. Both
`moving objects and stationary scene points are annotated.
`Representative frames from the automatically annotated
`video clip are also displayed. Note that annotations of
`moving objects “move” together with the moving objects,
`while annotations of static scene points (e.g., “building”)
`remain stationary with respect to the background scene
`(i.e., they preserve the background motion induced by the
`moving camera).
`Note also that estimating the trajectories of moving ob-
`jects in the common mosaic coordinate system allows more
`reliable detection and tracking of moving objects, even
`
`when they are very small (such as the three parachuters in
`Fig. 5). This is because a “temporal coherence” constraint
`can be used during moving-object detection and tracking
`after removal of the background motion. Assuming that
`object velocities do not change too rapidly, the detection
`of moving objects within each frame can be guided by the
`trajectory of the objects in a few previous frames. This
`leads to better separation between small moving objects
`and noise as well as enables recovery from losing an
`object for a few frames (e.g., due to occlusion or bad
`detection). The missing portion of the trajectory is smoothly
`interpolated/extrapolated from the neighboring frames.
`
`IV. BUILDING THE SCENE-BASED REPRESENTATION
`In Section II-B, we introduced the basic components of
`the scene-based representation. In this section, we provide
`the details of the scene-based representation (Section IV-
`A), followed by a review of the methods used for its
`construction (Sections IV-B and IV-C). This section serves
`mainly as a review of methods that have been previously
`published; these methods are briefly outlined here in order
`to make the paper self-contained.
`
`A. The Detailed Scene-Based Representation
`1) Panoramic View: The panoramic view of the scene is
`captured by one or several mosaic images. We present a
`hierarchy of such mosaic representations. The hierarchy
`
`IRANI AND ANANDAN: VIDEO INDEXING BASED ON MOSAIC REPRESENTATIONS
`
`913
`
`Prime Focus Ex 1038-9
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`(a)
`
`(b)
`
`Fig. 8. Annotation of the airport video clip. (a) A stationary car is annotated once on the mosaic
`image (“car”). (b) A few representative frames from the video clip with the annotations inherited
`from the mosaic image. The annotations are incorporated into the video frames automatically and
`instantly through the geometric coordinate transformations that map each frame onto the mosaic
`image. Some video frames from the raw video clip are displayed in Fig. 1.
`
`corresponds to increasing complexity levels in the camera
`motion and in the 3-D scene structure.
`a) 2-D parametric mosaic image: The simplest repre-
`sentation is a mosaic image constructed by aligning all the
`frames to a single coordinate system using 2-D parametric
`coordinate transformations. We refer to such a mosaic
`as a 2-D parametric mosaic image. The cases when the
`camera-induced motion can be modeled as a 2-D parametric
`transformation can be divided broadly into three categories
`(see Section IV-B1): 1) when the translational motion of
`the camera is negligible, i.e., camera motion can be ap-
`proximated by only 3-D rotations and zooms, 2) when the
`scene is planar, or 3) when the 3-D scene is sufficiently
`distant from the camera such that it can be approximated
`by a nearly flat 2-D surface. We refer to these scenarios
`as 2-D scenes.
`The examples given in Section III belong to this class of
`scenarios. For example, the baseball sequence (Fig. 2) was
`
`captured by a panning camera (i.e., pure rotation), while the
`other sequences in that section (Figs. 1, 4, and 5) were taken
`by an airborne camera. Hence, the scene was sufficiently
`distant from the camera and could be well approximated
`by a flat 2-D surface.
`parallax representation: The next level of
`b) Plane
`complexity arises when the 3-D deviations from the 2-D
`planar surface approximation (when combined with the
`camera translation) result
`in measurable parallax image
`motion relative to the surface. In this case,
`the visual
`appearance of the scene is still captured by a mosaic image,
`as in the previous case, while the geometric component of
`the representation also encodes the 3-D parallax relative
`to the planar surface (see Section IV-B2). The parallax
`information is captured in the geometric component of the
`representation and is taken into account while combining
`the different frames into a single mosaic [12]. We refer to
`parallax representation.
`this representation as the plane
`
`914
`
`PROCEEDINGS OF THE IEEE, VOL. 86, NO. 5, MAY 1998
`
`Prime Focus Ex 1038-10
`Prime Focus v Legend3D
`IPR2016-01243
`
`

`

`(a)
`
`(b)
`Fig. 9. Annotation of the flying plane video clip. (a) The annotations are defined once on the
`synopsis

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket