`
`Pattern Recognition, Vol. 30, No. 4, pp. 607--Q25, 1997
`© 1997 Pattern Recognition Society. Pnblished by Elsevier Science Ltd
`Ptinted in Great Britain. All rights reserved
`0031-3203/97 $17.00+.00
`
`PH: S0031-3203(96)00107-0
`
`AUTOMATIC VIDEO INDEXING VIA OBJECT
`MOTION ANALYSIS
`
`JONATHAN D. COURTNEY*
`Texas Instruments, Incorporated 8330 LBJ Freeway, MIS 8374 Dallas, Texas 75243, U.S.A
`
`(Received 12 June 1996; received for publication 30 July 1996)
`
`Abstract-To assist human analysis of video data, a technique has been developed to perform automatic,
`content-based video indexing from object motion. Moving objects are detected in tbe video sequence using
`motiOn se~mentation I?etbo?s. By trac~ng individual objects through tbe segmented data, a symbolic
`representation of tbe video IS generated m tbe form of a directed graph describing tbe objects and tbeir
`movement. This graph is then annotated using a rule-based classification scheme to identify events of interest
`~.g., a~pearance/di.sappearan~e, deposit/removal, entrance/exit, and motion/rest of objects. One may then use ~
`m?ex mto. tbe monon graph mstead of the raw data to analyse the semantic content of tbe video. Application of
`tb1s techmque to surveillance video analysis is discussed. © 1997 Pattern Recognition Society. Published by
`Elsevier Science Ltd.
`
`Video indexing
`
`Object tracking
`
`Motion analysis
`
`Content-based retrieval
`
`1. INTRODUCTION
`
`Advances in multimedia technology, including commer(cid:173)
`cial prospects for video-on-demand and digital library
`systems, have generated recent interest in content-based
`video analysis. Video data offers users of multimedia
`systems a wealth of information; however, it is not as
`readily manipulated as other data such as text. Raw video
`data has no immediate "handles" by which the multi(cid:173)
`media system user may analyse its contents. By annotat(cid:173)
`ing video data with symbolic information describing the
`semantic content, one may facilitate analysis beyond
`simple serial playback.
`To assist human analysis of video data, a technique has
`been developed to perform automatic, content-based
`video indexing from object motion. Moving objects
`are detected in the video sequence using motion seg(cid:173)
`mentation methods. By tracking individual objects
`through the segmented data, a symbolic representation
`of the video is generated in the form of a directed graph
`describing the objects and their movement. This graph is
`then annotated using a rule-based classification scheme
`to identify events of interest, e.g., appearance/disappear(cid:173)
`ance, deposit/removal, entrance/exit, and motion/rest of
`objects. One may then use an index into the motion graph
`instead of the raw data to analyse the semantic content of
`the video.
`We have developed a system that demonstrates this
`indexing technique in assisted analysis of surveillance
`video data. The Automatic Video Indexing (AVI) system
`allows the user to select a video sequence of interest, play
`it forward or backward and stop at individual frames.
`Furthermore, the user may specify queries on video
`sequences and "jump" to events of interest to avoid
`tedious serial playback. For example, the user may select
`
`* E-mail: courtney@csc.ti.com.
`
`a person in a video sequence and specify the query "show
`me all objects that this person removed from the scene".
`In response, the A VI system assembles a set of video
`"clips" highlighting the query results. The user may
`select a clip of interest and proceed with further video
`analysis using queries or playback as before.
`The remainder of this paper is organized as follows:
`Section 2 discusses content-based video analysis. Sec(cid:173)
`tion 3 presents a video indexing technique based on
`object motion analysis. Section 4 describes a system
`which implements this video indexing technique for
`scene monitoring applications. Section 5 presents experi(cid:173)
`mental results using the system. Section 6 concludes the
`paper.
`
`2. CONTENT-BASED VIDEO ANALYSIS
`
`Video data poses unique problems for multimedia
`information systems that text does not. Textual data is
`a symbolic abstraction of the spoken word that is usually
`generated and structured by humans. Video, on the other
`hand, is a direct recording of visual information. In its
`raw and most common form, video data is subject to little
`human-imposed structure, and thus has no immediate
`"handles" by which the multimedia system user may
`analyse its contents.
`For example, consider an online movie screenplay
`(textual data) and a digitized movie (video and audio
`data). If one were analysing the screenplay and interested
`in searching for instances of the word "horse" in the text,
`various text searching algorithms could be employed to
`locate every instance of this symbol as desired. Such
`analysis is common in online text databases. If, however,
`one were interested in searching for every scene in the
`digitized movie where a horse appeared, the task is much
`more difficult. Unless a human performs some sort of
`
`607
`
`AVIGILON EX. 2006
`IPR2019-00311
`Page 1 of 19
`
`
`
`608
`
`J. D. COURTNEY
`
`pre-processing of the video data, there are no symbolic
`keys on which to search. For a computer to assist in the
`search, it must analyse the semantic content of the video
`data itself. Without such capabilities, the information
`available to the multimedia system user is greatly re(cid:173)
`duced.
`Thus, much research in video analysis focuses on
`semantic content-based search and retrieval techniques.
`Video indexing refers to the process of identifying im(cid:173)
`portant frames or objects in the video data for efficient
`playback. An indexed video sequence allows a user not
`only to play the sequence in the usual serial fashion, but
`also to "jump" to points of interest while it plays. A
`common indexing scheme is to employ scene cut detec(cid:173)
`tion(!) to determine breakpoints in the video data. Index(cid:173)
`ing has also been performed based on camera (i.e.
`4l
`viewpoint) motion<2l and object motion.<3
`'
`Using breakpoints found via scene cut detection, other
`researchers have pursued hierarchical segmentation<S-?)
`to analyse the logical organization of video sequences. In
`the same way that text is organized into sentences,
`paragraphs, and chapters, the goal of these techniques
`is to determine a hierarchical grouping of video sub(cid:173)
`sequences. Combining this structural information with
`content abstractions of segmented sub-sequences<Sl pro(cid:173)
`vides multimedia system users a top-down view of video
`data.
`The indexing technique described in this paper (the
`"AVI technique") performs video indexing based on
`object motion analysis. Unlike previous work, it forms
`semantically high-level interpretations of object actions
`and interactions from the object motion information. This
`allows multimedia system users to search for object(cid:173)
`motion "events" in the video sequence (such as object
`entrance or exit) rather than features related to object
`velocity alone (such as "northeast movement").
`
`3. VIDEO INDEXING VIA OBJECT MOTION ANALYSIS
`
`Given a video sequence, the AVI technique analyses
`the motion of foreground objects in the data and indexes
`
`the objects to indicate the occurrence of several events of
`interest. It outputs a symbolic abstraction of the video
`content in the form of an annotated directed graph
`containing the indexed objects. This symbolic data
`may then be read by a user interface to perform con(cid:173)
`tent-based queries on the video data.
`The AVI technique processes the video data in three
`stages: motion segmentation, object tracking, and motion
`10
`analysis. First, motion segmentation methods<9
`) are
`•
`used to segment moving foreground objects from the
`scene background in each frame. Next, each object is
`tracked through successive video frames, resulting in a
`graph describing object motion and path intersections.
`Then the motion graph is scanned for the occurrence of
`several events of interest. This is performed using a rule(cid:173)
`based classifier which employs knowledge concerning
`object motion and the output of the previous stages to
`characterize the activity of the objects recorded in the
`graph. For example, a moving object that occludes an(cid:173)
`other object results in a "disappear" event; a moving
`object that intersects and then removes a stationary object
`results in a "removal" event. An index is then created
`which identifies the location of each event in the video
`sequence.
`Figure 1 depicts the relation between the video data,
`motion segmentation information, and the motion graph.
`Note that for each frame of the video, the AVI technique
`creates a corresponding symbolic "frame" to describe it.
`
`3.1. Terminology and notation
`
`The following is a description of some of the terms and
`notation used in the subsequent sections:
`
`• A sequence Y' is an ordered set of N frames, denoted
`Y' = {Fo,F,, . .. ,FN-J}, whereFnisframenumbern
`in the sequence.
`• A clip is a 4-tuple '?J=(Y',j,s,l), where Y' is a
`sequence with N frames, and f, s, and l are frame
`numbers such that 0 ~ f ~ s ~ l ~ N - 1. Here, F1
`and F1 are the first and last valid frames in the clip, and
`Fs is the "start" frame. Thus, a clip specifies a sub-
`
`Video Data
`
`Motion Segmentation
`
`Motion Graph
`
`:---- -~--: :--~------: :--~--: :----ci\-: .:~.~-~--~-.. -1?---:
`,
`..
`. ·········; .. ··:..,HJ ..... , .... , ....... . ioi :::~:··,
`.u_ ... , ... T.....
`, o
`1~.~-~·.~·.~~.t:.·~~·-·~.;.~.~-~.r:.~~-~.t ---~-~-~-~ . ...l~·--~.~ .. ~.~-).::~--~-~~--- j
`
`Removal
`
`Fig. 1. Relation between video data, motion segmentation information, and tbe symbolic motion graph.
`
`AVIGILON EX. 2006
`IPR2019-00311
`Page 2 of 19
`
`
`
`Automatic video indexing via object motion analysis
`
`609
`
`sequence and contains a state variable to indicate a
`"frame of interest".
`• A frame F is an image I annotated with a timestamp t.
`Thus, frame number n is denoted by the pair
`Fn = (In, tn)·
`• An image I is an rxc array of pixels. The notation /(i, j)
`indicates the pixel at coordinates (row i, columnj). For
`purposes of this discussion, a pixel is assumed to be an
`intensity value between 0 and 255.
`• A timestamp records the date and time that an image
`was digitized.
`
`3.2. Motion segmentation
`
`For each frame Fn in the sequence, the motion seg(cid:173)
`mentation stage computes segmented image Cn as
`
`Cn = ccomps(Th·k),
`
`where Th is the binary image resulting from thresholding
`the absolute difference of images In and / 0 at h, Th· k the
`morphological close operation° 2l on Th with structuring
`element k, and the function ccomps(·) performs con(cid:173)
`nected components analysis/Ill resulting in a unique
`label for each connected region in image Th· k. The
`image Th is defined as
`T (i ") = { 1
`if IIn(i,j)- Io(i,j)l 2' h,
`0 otherwise,
`,J
`h
`
`for all pixels (i,j) in h
`Figure 2 shows an example of this process. Absolute
`differencing and thresholding [Fig. 2(c) and (d)] detect
`motion regions in the image. The morphological close
`operation shown in Fig. 2(e) joins together small regions
`into smoothly-shaped objects. Connected components
`analysis assigns each detected object a unique label, as
`shown in Fig. 2(f). Components smaller than a given size
`
`threshold are discarded. The result is Cm the output of the
`motion segmentation stage.
`The motion segmentation technique described here is
`best suited for video sequences containing object motion
`within an otherwise static scene, such as in surveillance
`and scene monitoring applications. Note that the tech(cid:173)
`nique uses a "reference image" for processing. This is
`nominally the first image from the sequence, / 0 . For many
`applications, the assumption of an available reference
`image is not unreasonable; video capture is simply
`initiated from a fixed-viewpoint camera when there is
`limited motion in the scene. Following are some reasons
`why this assumption may fail in other applications:
`
`1. Sudden lighting changes may render the reference
`frame invalid. However, techniques such as scene
`to detect such
`cut detection(!) may be used
`occurrences and indicate when a new reference
`image must be acquired.
`2. Gradual lighting changes may cause the reference
`image to slowly grow "out of date" over long video
`sequences, particularly in outdoor scenes. Here, more
`sophisticated techniques involving cumulative differ(cid:173)
`ences of successive video frames< 13
`) must be employed.
`3. The viewpoint may change due to camera motion. In
`this case, camera motion compensation°4l must be used
`to offset the effect of an apparent moving background.
`4. An object may be present in the reference frame and
`move during the sequence. This causes the motion
`segmentation process to incorrectly detect the back(cid:173)
`ground region exposed by the object as if it were a
`newly-appearing stationary object in the scene.
`
`A straightforward solution to problem 4 is to apply a
`test to non-moving regions detected by the motion seg(cid:173)
`mentation process based on the following observation: if
`
`(a)
`
`•
`
`(d)
`
`(b)
`
`•
`
`(e)
`
`(c)
`
`•
`
`(f)
`
`Fig. 2. Motion segmentation example. {a) Reference image I0 . (b) Image In- (c) Absolute difference \In - I0 \.
`(d) Thresholded image Th. (e) Result of morphological close operation. (f) Result of connected components
`analysis.
`
`AVIGILON EX. 2006
`IPR2019-00311
`Page 3 of 19
`
`
`
`610
`
`J. D. COURTNEY
`
`the region detected by the segmentation of image In is due
`to the motion of an object present in the reference image
`(i.e. due to "exposed background"), a high probability
`exists that the boundary of the segmented region will
`coincide with intensity edges detected in I 0 . If the region
`is due to the presence of a foreground object in the
`current image, a high probability exists that the region
`boundary will coincide with intensity edges in In. The test
`is implemented by applying an edge detection operator to
`the current and reference images and checking for co(cid:173)
`incident boundary pixels in the segmented region of
`Cn.<9l Figure 3 shows this process. If the test supports
`the hypothesis that the region in question is due to
`exposed background, the reference image is modified
`by replacing the object with its exposed background
`region (see Fig. 4).
`No motion segmentation technique is perfect. The
`following are errors typical of many motion segmenta(cid:173)
`tion techniques:
`
`1. True objects will disappear temporarily from the
`motion segmentation record. This occurs when there
`is insufficient contrast between an object and an
`occluded background region, or if an object is
`partially occluded by a "background" structure (for
`instance, a tree or pillar present in the scene).
`2. False objects will appear temporarily in the motion
`segmentation record. This is caused by light fluctua(cid:173)
`tions or shadows cast by moving objects.
`3. Separate objects will temporarily join together. This
`typically occurs when two or more objects are in
`close proximity or when one object occludes another
`object.
`4. Single objects will split into multiple regions. This
`occurs when a portion of an object has insufficient
`contrast with the background it occludes.
`
`Instead of applying incremental improvements to re(cid:173)
`lieve the shortcomings of motion segmentation, the AVI
`technique addresses these problems at a higher level
`where information about the semantic content of the
`video data is more readily available. The object tracking
`and motion analysis stages described in Sections 3.3 and
`3.4 employ object trajectory estimates and knowledge
`concerning object motion and typical motion segmenta(cid:173)
`tion errors to construct a more accurate representation of
`the video content.
`
`3.3. Object tracking
`
`The motion segmentation output is processed by the
`object tracking stage. Given a segmented image Cn with
`P uniquely-labeled regions corresponding to foreground
`objects in the video, the system generates a set of features
`to represent each region. This set of features is named a
`"V-object" (video-object), denoted V~, p = 1, ... , P. A
`V-object contains the label, centroid, bounding box, and
`shape mask of its corresponding region, as well as object
`velocity and trajectory information generated by the
`tracking process.
`V-objects are then tracked through the segmented
`video sequence. Given segmented images Cn and Cn+t
`
`with V-objects Vn = {V~; p = 1, ... , P} and Vn+l =
`{V,:+1; q = 1, ... , Q}, respectively, the motion tracking
`process "links" V-objects V~ and V~+l if their position
`and estimated velocity indicate that they correspond to
`the same real-world object appearing in frames Fn and
`Fn+l· This is determined using linear prediction of V(cid:173)
`object positions and a "mutual nearest neighbor" criter(cid:173)
`ion via the following procedure:
`
`1. For each V-object V~ E Vn, predict its position in the
`next frame using
`if,. = J/, + v~ . (tn+l -
`tn),
`where if,. is the predicted centroid of V~ in Cn+t> J1:.
`the centroid of V~ measured in Cm v~ the estimated
`(forward) velocity of V~, and tn+l and tn are the
`timestamps of frames Fn+l and Fm respectively.
`Initially, the velocity estimate is set to v~ = (0, 0).
`2. For each V~ E Vn, determine the V-object in the next
`frame with centroid nearest if,.. This "nearest neigh(cid:173)
`bor" is denoted JV~. Thus,
`JV~ = V~+l 3 II.U,: - .u~+1ll S II ,if,; - .u~+1ll Vq # r.
`3. For every pair (V~, JV~ = V~+ 1) for which no other V(cid:173)
`objects in Vn have V~+ 1 as a nearest neighbor, estimate
`v~+1 , the (forward) velocity of V~+l' as
`.U~+1- ~
`;
`tn+1 -
`tn
`otherwise, set v~+l = (0, 0).
`
`r
`vn+l =
`
`(1)
`
`each Cm
`for
`performed
`are
`steps
`These
`n = 0, 1, ... , N- 2. Steps 1 and 2 find nearest neighbors
`in the subsequent frame for each V-object. Step 3 gen(cid:173)
`erates velocity estimates for V-objects that can be un(cid:173)
`ambiguously tracked; this information is used in step 1 to
`predict V-object positions for the next frame.
`Next, steps l-3 are repeated for the reverse sequence,
`i.e. Cm n = N- 1,N- 2, ... , 1. This results in anew set
`of predicted centroids, velocity estimates, and nearest
`neighbors for each V-object in the reverse direction.
`Thus, the V-objects are tracked both forward and back(cid:173)
`ward through the sequence. The remaining steps are then
`performed:
`
`4. V-objects V~ and V~+l are mutual nearest neighbors
`if Jll"~ = V~+ 1 and JV~+ 1 = V~. (Here, JV~ is the
`nearest neighbor of V~ in the forward direction, and
`JV~+ 1 is the nearest neighbor of V~+ 1 in the reverse
`direction.) For each pair of mutual nearest neighbors
`(V~, v~+1), create a primary link from v~ to v~+1"
`5. For each V~ E Vn without a mutual nearest neighbor,
`create a secondary link from V~ to JV~ if the predicted
`centroid if,. is within E of JV~, where E is some small
`distance.
`6. For each V~+1 in Vn+ 1 without a mutual nearest
`neighbor, create a secondary link from JV~+ 1 to
`V,:+ 1 if the predicted centroid p~+ 1 is within E of
`JV~+l"
`The object tracking procedure uses the mutual nearest
`neighbor criterion (step 4) to estimate frame-to-frame V-
`
`AVIGILON EX. 2006
`IPR2019-00311
`Page 4 of 19
`
`
`
`Automatic video indexing via object motion analysis
`
`611
`
`(a)
`
`(b)
`
`(c)
`
`\'(cid:173)r "o.
`I
`I
`..... ..__,
`
`(f)
`
`(g)
`
`(h)
`
`Fig. 3. Exposed background detection. (a) Reference image / 0. (b) Image In. (c) Region to be tested. (d)
`Edge image of (a), found using Sobel0 1) operator. (e) Edge image of (b). (t) Edge image of (c), showing
`boundary pixels. (g) Pixels coincident in (d) and (t). (h) Pixels coincident in (e) and (t). The greater number
`of coincident pixels in (g) versus (h) support the hypothesis that the region in question is due to exposed
`background.
`
`AVIGILON EX. 2006
`IPR2019-00311
`Page 5 of 19
`
`
`
`612
`
`J. D. COURTNEY
`
`Fig. 4. Reference image modified to account for the exposed background region detected in Fig. 3.
`
`object trajectories with a high degree of confidence. Pairs
`of mutual nearest neighbors are connected using a "pri(cid:173)
`mary" link to indicate that they are highly likely to
`represent the same real-world object in successive video
`frames.
`Steps 5-6 associate V-objects
`tracked
`that are
`with less confidence but display evidence that they might
`result from the same real-world object. Thus, these
`objects are joined by "secondary" links. These steps
`are necessary to account for the "split" and "join"
`type motion segmentation errors as described
`in
`Section 3.2.
`The object tracking process results in a list of V(cid:173)
`objects and connecting links that form a directed graph
`(digraph) representing the position and trajectory of
`foreground objects in the video sequence. Thus, the V(cid:173)
`objects are the nodes of the graph and the connecting
`links are the arcs. This motion graph is the output of the
`object tracking stage.
`
`Figure 5 shows a motion graph for a hypothetical
`sequence of one-dimensional frames. Here, the system
`detects the appearance of an object at A and tracks it to
`the V-object at B. Due to an error in motion segmenta(cid:173)
`tion, the object splits at D and E, and joins at F. At G, the
`object joins with the object tracked from C due to
`occlusion. These objects split at H and I. Note that
`primary links connect the V-objects that were most
`reliably tracked.
`
`3.4. Motion analysis
`
`The motion analysis stage analyses the results of
`object tracking and annotates the motion graph with tags
`describing several events of interest. This process pro(cid:173)
`ceeds in two parts: V-object grouping and V-object
`indexing. Figure 6 shows an example motion graph for
`a hypothetical sequence of 1-D frames discussed in the
`following sections.
`
`FO
`
`Fl
`
`F2
`
`F3
`
`F4
`
`F5
`
`F6
`
`F7
`
`F8
`
`Fig. 5. The output of the object tracking stage for a hypothetical sequence of 1-D frames. The vertical lines
`labeled "Fn" represent frame number n. Primary links are shown as solid arcs; secondary links are shown as
`dashed arcs.
`
`AVIGILON EX. 2006
`IPR2019-00311
`Page 6 of 19
`
`
`
`Automatic video indexing via object motion analysis
`
`613
`
`FO
`
`Fl
`
`F2
`
`F3
`
`F4
`
`F5
`
`F6
`
`F7
`
`F8
`
`F9
`
`FlO
`
`Fll
`
`Fl2
`
`Fl3
`
`Fl4
`
`Fig. 6. An example motion graph for a sequence of 1-D frames.
`
`3.4.1. V-object grouping. First, the motion analysis
`stage hierarchically groups V-objects into structures
`representing the paths of objects through the video data.
`Using graph theory terminology,0 5l five groupings are
`defined for this purpose:
`A stem M ={Vi: i = 1,2, ... ,NM} is a maximal(cid:173)
`size, directed path (dipath) of two or more V-objects
`containing no secondary links, meeting all of the follow(cid:173)
`ing conditions:
`• outdegree(Vi) = 1 for 1 :::; i < NM,
`• indegree(Vi) = 1 for 1 < i :::; NM, and
`• either
`
`or
`
`(2)
`
`(3)
`
`where Jli is the centroid of V-object Vi EM.
`Thus, a stem represents a simple trajectory of an object
`through two or more frames. Figure 7 labels V-objects
`from Fig. 6 belonging to separate stems with the letters
`"A" through "K".
`Stems are used to determine the motion "state" of
`real-world objects, i.e. whether they are moving or
`
`stationary. If equation (2) is true, then the stem is classi(cid:173)
`fied as stationary; if equation (3) is true, then the stem is
`classified as moving. Figure 7 highlights stationary stems
`B, C, F, and H; the remainder are moving.
`A branch B ={Vi: i = 1, 2, ... ,NB} is a maximal(cid:173)
`size dipath of two or more V-objects containing no
`secondary
`links,
`for which outdegree(Vi)=1
`for
`1 :::; i < NB and indegree(V;)=l for 1 < i:::; NB. Figure
`8 labels V-objects belonging to branches with the letters
`"L" through "T". A branch represents a highly reliable
`trajectory estimate of an object through a series of
`frames.
`If a branch consists entirely of a single stationary stem,
`then it is classified as stationary; otherwise, it is classi(cid:173)
`fied as moving. Branches "N" and "Q" in Fig. 8 (high(cid:173)
`lighted) are stationary; the remainder are moving.
`A trail Lis a maximal-size dipath of two or more V(cid:173)
`objects that contains no secondary links. This grouping
`represents the object tracking stage's best estimate of an
`object trajectory using the mutual nearest neighbor cri(cid:173)
`terion. Figure 9 labels V-objects belonging to trails with
`the letters "U" through "Z".
`A trail and the V-objects it contains are classified as
`stationary if all the branches it contains are stationary,
`
`FO
`
`Fl
`
`F2
`
`F3
`
`F4
`
`F5
`
`F6
`
`F7
`
`F8
`
`F9
`
`FlO
`
`Fll
`
`Fl2
`
`Fl3
`
`Fl4
`
`Fig. 7. Stems. Stationary stems are highlighted.
`
`K
`
`FO
`
`Fl
`
`F2
`
`F3
`
`F4
`
`F5
`
`F6
`
`F7
`
`FS
`
`F9
`
`FlO
`
`Fll
`
`Fl2
`
`Fl3
`
`Fl4
`
`Fig. 8. Branches. Stationary branches are highlighted.
`
`AVIGILON EX. 2006
`IPR2019-00311
`Page 7 of 19
`
`
`
`614
`
`J. D. COURTNEY
`
`FO
`
`Fl
`
`F2
`
`F3
`
`F4
`
`F5
`
`F6
`
`F7
`
`FS
`
`F9
`
`FlO
`
`Fll
`
`Fl2
`
`Fl3
`
`Fl4
`
`Fig. 9. Trails.
`
`and moving if all the branches it contains are moving.
`Otherwise, the trail is classified as unknown. Trail W in
`Fig. 9 is stationary; the remainder are moving.
`A track K={L,,G,, ... ,LNK_1 ,GNK_1 ,LNK}
`is a
`dipath of maximal size containing trails { L; : 1 ::;
`i::; NK], and connecting dipaths {G; : 1 ::; i < NK}.
`For each G; E K there must exist a dipath
`H = {Vf,G;, V;~d
`
`(where Vf is the last V-object in L;, and V;~ 1 is the first V(cid:173)
`object in L;+ 1), such that every \'} E H meets the require-
`ment
`
`(4)
`
`where p,~ is the centroid of Vf, v~ the forward velocity of
`vf, (tj- f;) the time difference between the frames con(cid:173)
`taining \'} and Vf, and P,j is the centroid of \'). Thus,
`equation (4) specifies that the object must maintain a
`constant velocity through path H.
`A track represents the trajectory estimate of an object
`that may cause or undergo occlusion one or more times in
`a sequence. The motion analysis stage uses equation (4)
`to attempt to follow an object through frames where an
`
`occlusion occurs. Figure 10 labels V-objects belonging to
`tracks with the letters "a", "(3", "x", "6" and "c".
`Note that track 6 joins trails X and Y.
`A track and the V-objects it contains are classified as
`stationary if all the trails it contains are stationary, and
`moving if all the trails it contains are moving. Otherwise,
`the track is classified as unknown. Track x in Fig. 10 is
`stationary; the remaining tracks are moving.
`A trace is a maximal-size, connected digraph of V(cid:173)
`objects. A trace represents the complete trajectory of an
`object and all the objects with which it intersects. Thus,
`the motion graph in Fig. 6 contains two traces: one trace
`extends from F2 to F7 ; the remaining V-objects form a
`second trace. Figure 11 labels V-objects on these traces
`with the numbers "1" and "2".
`Note that the preceding groupings are hierarchical, i.e.
`for every trace E, there exists at least one track K, trail L,
`branch B, and stem M such that E 2 K 2 L 2 B 2 M.
`Furthermore, every V-object is a member of exactly one
`trace.
`The motion analysis stage scans the motion graph
`generated by the object tracking stage and groups V(cid:173)
`objects into stems, branches, trails, tracks, and traces.
`
`FO
`
`Fl
`
`F2
`
`F3
`
`F4
`
`F5
`
`F6
`
`F7
`
`FS
`
`F9
`
`FlO
`
`Fll
`
`Fl2
`
`Fl3
`
`Fl4
`
`Fig. 10. Tracks. The dipath connecting trails X and Y from Fig. 9 is highlighted.
`
`FO
`
`Fl
`
`F2
`
`F3
`
`F4
`
`F5
`
`F6
`
`F7
`
`FS
`
`F9
`
`FlO
`
`Fll
`
`Fl2
`
`Fl3
`
`Fl4
`
`Fig. 11. Traces.
`
`AVIGILON EX. 2006
`IPR2019-00311
`Page 8 of 19
`
`
`
`Automatic video indexing via object motion analysis
`
`615
`
`Thus, these five definitions are used to characterize
`object trajectories in various portions of the motion
`graph. This information is then used to index the video
`according to its object motion content.
`
`3.4.2. V-object indexing. Eight events of interest are
`defined to designate various object-motion events in a
`video sequence:
`
`Appearance: An object emerges in the scene.
`Disappearance: An object disappears from the scene.
`Entrance: A moving object enters in the scene.
`Exit: A moving object exits from the scene.
`is added
`inanimate object
`Deposit: An
`scene.
`Removal: An inanimate object is removed from the
`scene.
`Motion: An object at rest begins to move.
`Rest: A moving object comes to a stop.
`
`the
`
`to
`
`These eight events are sufficiently broad for a video
`indexing system to assist the analysis of many sequences.
`For example, valuable objects such as inventory boxes,
`tools, computers, etc., can be monitored for theft (i.e.
`removal) in a security monitoring application. Likewise,
`the traffic patterns of automobiles can be analysed (e.g.,
`entrance/exit and motion/rest), or the shopping patterns
`of retail customers recorded (e.g., motion/rest and re(cid:173)
`moval).
`After the V-object grouping process is complete, the
`motion analysis stage has all the semantic information
`necessary to identify these eight events in a video se(cid:173)
`quence. For each V-object V in the graph, the following
`rules are applied to annotate the nodes of the motion
`graph with event tags:
`
`1. If Vis moving, the first V-object in a track (i.e. the
`"head"), and indegree(V) > 0, place a tag designat(cid:173)
`ing an appearance event at V.
`track, and
`the head of a
`2. If V is stationary,
`indegree(V) = 0, place a tag designating an appear(cid:173)
`ance event at V.
`3. If Vis moving, the last V-object in a track (i.e. the
`"tail"), and outdegree(V) > 0, place a disappearance
`event tag at V.
`4. If V is stationary, the tail of a track, and out(cid:173)
`degree(V) = 0, place a disappearance event tag at V.
`5. If Vis non-stationary (i.e. moving or unknown), the
`head of a track, and indegree(V) = 0, place an
`entrance event tag at V.
`6. If V is non-stationary, the tail of a track, and out(cid:173)
`degree(V) = 0, place an exit event tag at V.
`track, and
`the head of a
`7. If V is stationary,
`indegree(V) = 1, place a deposit event tag at V.
`8. If V is stationary, the tail of a track, and out(cid:173)
`degree(V) = 1, place a removal event tag at V.
`
`Rules 1-8 use track groupings to annotate the video at
`the beginning and end of individual object trajectories.
`Note, however, that rules 7 and 8 only account for the
`object deposited or removed from the scene; they do not
`tag the V-object that caused the deposit or remove event
`
`to occur. For this purpose, we define two additional
`events-
`
`Depositor: A moving object adds an inanimate object
`to the scene.
`Remover: A moving object removes an inanimate
`object from the scene.
`
`-and apply two more rules:
`
`9. If V is moving and adjacent to a V-object with a
`deposit event tag, place a depositor event tag at V.
`10. If Vis moving and adjacent from a V-object with a
`removal event tag, place a remover event tag at V.
`
`The additional events depositor and remover are used
`to provide a distinction between the subject and object of
`deposit/removal events. These events are only used when
`the actions of a specific moving object must be analysed.
`Otherwise, their deposit/removal counterparts are suffi(cid:173)
`cient indication of the occurrence of the event.
`Finally, two additional rules are applied to account for
`the motion and rest events:
`
`11. If V is the tail of a stationary stem M; and the head
`of a moving stem Mj for which IMd 2: hM and
`IMj I 2: hM, then place a motion event tag at V. Here,
`hM is a lower size limit of stems to consider.
`12. If Vis the tail of a moving stem M; and the head of a
`IMd 2: hM and
`for which
`stationary stem Mj
`IMjl 2: hM, then place a rest event tag at V.
`Table 1 summarizes the conditions under which rules
`1-12 apply event tags to V-objects with moving, sta(cid:173)
`tionary, and unknown motion states. Figure 12 shows all
`the event annotation rules applied to the example motion
`graph of Fig. 6.
`As the annotation rules are applied to the motion
`graph, each identified event is recorded in an index table
`for later lookup. This event index takes the form of an
`array of lists of V-objects (one list for each event type)
`and indexes V-objects in the motion graph according to
`their event tags.
`The output of the motion analysis stage is an annotated
`directed graph describing the motion of foreground
`objects and an event index indicating events of interest
`in the video stream. Thus, the motion analysis stage
`generates from the object tracking output a symbolic
`abstraction of the actions and interactions of foreground
`objects in the video. This approach enables content-based
`analysis of video sequences that wou