`and
`
`Communication
`
`Technology
`
`US Patent No. 9,124,950
`
`Interactive Video
`
`Algorithms and Technologies
`
`@ Springer
`
`AMAZON EX. 1026
`
`Amazon v. CustomPIay
`
`Page i
`
`AMAZON EX. 1026
`Amazon v. CustomPlay
`US Patent No. 9,124,950
`
`
`
`Riad I. Hammoud (Ed.)
`
`Interactive Video
`
`Algorithms and Technologies
`
`With 109 Figures and 8 Tables
`
`@ Springer
`
`Page ii
`
`Page ii
`
`
`
`Dr. Riad I. Hammoud
`
`Delphi Electronics and Safety
`World Headquarters
`MIC Euo, PO. Box 9005
`Kokomo, Indiana 46904-9005
`USA
`e-mail: riadhammaud®delphxlcom
`
`Library of Congress Control Number: 2006923234
`
`ISSN: 1860-4862
`ISBN-10
`3-54o~33214-6 Springer Berlin Heidelberg New York
`ISBN—13
`978—3— 540- 33214- 5 Springer Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned. specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
`reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or
`parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its
`current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable
`to prosecution under the German Copyright Law.
`
`Springer is a part of Springer Science+Business Media.
`springer.com
`
`0 Springer-Verlag Berlin Heidelberg 2006
`Printed in The Netherlands
`
`The use of general descriptive names, registered names, trademarks. etc. in this publication does not imply,
`even in the absence of a specific statement, that such names are exempt from the relevant protective laws and
`regulations and therefore free for general use.
`
`TYpesetting by SF! Publisher Services using a Springer [MEX macro package
`Cover design: design &production, Heidelberg
`Printed on acid-free paper
`SPIN: 11399551
`
`62/3100lSP1-
`
`5 4 3 z 1 o
`
`Page iii
`
`
`
`
`
`Introduction to Interactive Video
`
`Riad I. Hammond
`
`Delphi Electronics and Safety, IN, USA.
`rind . hammoudOde 1phi . com
`
`1 Introduction
`
`In recent years, digital video has been widely employed as an effective me-
`dia format not only for personal communications, but also for business-to-
`employee, business-to—business and business-to—consumers applications. It ap-
`pears more attractive than other static data types like text and graphics,
`as its underlying rich content conveys easily and effectively the goals of the
`provider in means of image, motion, sound and text, all together presented
`to the consumer in a timely synchronized manner. Multimedia documents are
`more accessible than ever, due to a rapid expansion of Internet connectivity
`and increasing interest in online multimedia rich applications.
`Video content has skyrocketed in light of fore mentioned advances, and
`as a result of increasing network bandwidth capacities, decreasing cost of
`video acquisition and storage devices, and improving compression techniques
`Today, many content providers,
`like Google Video [118] and Yahoo [353],
`have created “open video” marketplaces that enable consumers to buy and
`rent a wide range of video content ( video-on-demand) [292] from major tele-
`vision networks, professional sports leagues, cable programmers, independent
`producers and film makers. The evolving digital video archives include vari-
`ous types of video documents like featurelength movies, music videos from
`SONY BMG, news, “Charlie Rose” interviews, medical and e—learm’ng videos,
`as well as prime-time and classic hits from CBS, sports and NBA games, his-
`torical content from ITN, and new titles being added everyday. Such video
`archives represent a valuable business asset and an important on—line video
`source to providers and consumers, respectively. The market is taking an-
`other important boost as video distribution is becoming available for people
`on “the move”, where they can select, download and view various video types
`on their consumer electronics devices like mobileTV [227] and video-playing
`iPods [158]. In 2005, the number of hits for downloaded music data from
`i'flmes on Apple’s iPods reached 20.7 million [163]. This number is expected
`to increase with the appearance of video-playing iPods and mobileTVs.
`
`Page3
`
`I
`
`Page 3
`
`
`
`4
`
`Riad I. Hammoud
`
`Traditionally, video data is either annotated manually, or consumed by
`end-users in its original form. Manual processing is expensive, time consum-
`ing and often subjective. On the other hand, a video file provided in a standard
`format like MPEG-l or MPEG-2 is played back using media players with lim-
`ited conventional control interaction features like play, fast forward/backward,
`and pause. In order to remedy these challenges and address the issues of grow-
`ing number and size of video archives and detail-on—demand videos, innova-
`tive video processing and video content management solutions — ranging from
`decomposing, indexing, browsing and filtering to automatic searching tech-
`niques as well as new forms of interaction with video content — are needed
`more than ever [324, 276, 130, 143, 303, 26, 278, 152]. Automatic processing
`will substantially drop the cost and reduce the errors of human operators,
`and it will introduce a new set of interactivity options that gives users ad-
`vanced interaction and navigational possibilities with the video content. One
`can navigate through the interactive video content in a non-linear fashion, by
`downloading, browsing and viewing, for instance, only the “actions” of a film
`character that caught his/her attention in the beginning of a feature-length
`movie [194, 155, 130, 199, 299]. Retaining only the essential information of a
`video sequence, such as representative frames, events and highlights, improves
`the storage, bandwidth and viewing time.
`Within the framework of the emerging “interactive video” technology,
`MPEG—4 and MPEG—7 standards [121, 124, 130, 293, 233, 275], this book will
`address the following two major issues that concerns both content-providers
`and consumers: ( 1) automatic restructuring, indexing and cataloging video
`content, and (2) advanced interaction features for audio-video editing, play-
`ing, searching and navigation. In this chapter, we will briefly introduce the
`concept, rhetoric, algorithms and technologies of interactive videos, automatic
`video restructuring and content-based video retrieval systems.
`
`2 What is an Interactive Video?
`
`In order to simplify the understanding of interactive video environments, it is
`worth reviewing how people are used to viewing and interacting with video
`content. Current VCRs and video-players provide basic control options such
`as play/stop, fast forward/backward and slow motion picture streaming. The
`video is mostly viewed in a passive way as a non-stop medium where the
`user’s interaction with the content is somewhat limited. For example, users
`cannot stop the video playback to jump to another place inside or outside the
`video document that provides related information about a specific item in
`the video like a commercial product, a film character, or a concealed object.
`Hence, the viewing of the video is performed in a linear fashion where the
`only way to discover what is next is to follow the narration and move through
`the video guided by seconds and minutes.
`Such conventional techniques for video viewing and browsing seem to be
`inefficient for most users to get the crux of the video. Users ease and efficiency
`could be improved through:
`
`Page 4
`
`
`
`Introduction to Interactive Video
`
`5
`
`1. providing a representative visual summary of the video document prior to
`downloading, storing or watching it. Alternatively, users could select the
`video based on just a few snapshots;
`2. presenting a list of visual entries, like key-frames, hot spots, events, scenes
`and highlights, that serves as meaningful access points to desired video
`content as opposed to accessing the video from the beginning to the end;
`and,
`3. showing a list of navigational options that allows users to follow internal
`and external links between related items in the same video or in other
`media documents like web pages.
`
`Interactive video refers to nowadays uncommon forms of video documents
`that accept and respond to the input of a viewer beyond just conventional
`VCR interactive features like play and pause. For instance, in its basic form,
`interactive video allows users to pause the video, click on an object of interest
`in a‘ video frame, and choose to jump from one temporal arbitrary frame to
`another Where the selected object has appeared. Instead of being guided by
`seconds and minutes, the user of interactive video form navigates through
`the video in a very eflicient non-linear fashion with options such as “next
`appearance", “previous scene” and “last event”. Tentatively, the following
`definition of interactive video could be drawn:
`
`Definition 1. Interactive video is a digitally enriched form of the original raw
`video sequence, allowing viewers attractive and powerful interactivity forms
`and navigational possibilities.
`
`In order to ensure that humans do not perceive any discontinuity in the
`video stream, a frame rate of at least 25fps is required, that is, 90, 000 images
`for one hour of video content. This video content can be complex and rich
`in terms of objects, shots, scenes, events, key-frames, sounds, narration and
`motion. An original video, in MPEG—1 format, is transformed to interactive
`video form through a series of re-structuring phases of its content. Both the
`original video and the interactive video contain the same information with
`one major difference in the structure of the document. The original video has
`an implicit structure, while its interactive form has an explicit one. In an
`explicit structure, the hotspots and key-elements are emphasized and links
`between these elements are created. If such links are not established with
`items from outside the video content, then the produced document is called
`raw interactive video. For simplicity we Will refer to it just as interactive
`video. Here we introduce two other extentions of interactive video documents:
`interactive video presentation and interactive video database.
`
`Definition 2. Interactive video presentation is a form of interactive video
`document'that is centered on enriched video but is not exclusively video.
`
`In this context, the interactive video document contains not only the raw
`enriched video but also includes several kind of data in a time synchronized
`fashion. This additional data is added for two reasons:
`(1) enhancing the
`
`Page 5
`
`Page 5
`
`
`
`6
`
`Riad 1. Hammond
`
`video content, and (2) making the video presentation self-contained. The type
`of data is determined by the author of the interactive video. This author
`would intelligently integrate the technique of the video producer or filmmaker
`with the technique and wisdom of the skillful teacher. The objective is to
`maximize the chances of convening the purpose and key information in the
`video well. Additional data can be documents of all types i.e. html, tables,
`music and/or still frames and sequence of images, that are available locally or
`remotely through the Web. Such types of interactive videos are also known as
`hyperfilm, hypervideo or hypermedia [151, 47].
`
`Definition 3. Interactive video database is a collection of interactive video
`documents and interactive video presentations.
`
`The main interactive video document is seen here as a master document
`which contains in itself a large number of independent interactive videos.
`Users can access the video database in two forms: searching and browsing.
`For instance, a sketch may be used to define an image in a frame. The system
`can look for that image and retrieve the frames that contain it. The user
`can also move the sketch within a certain pattern indicating to the system to
`find in a large interactive video database all sets of frames that represent a
`similar moving pattern for that object of interest (hotspot). As technologies
`such as object detection, tracking and image recognition evolve, it will be
`possible to provide better ways to specify hotspots in interactive video, which
`will be covered in more detail in the following sections. When browsing is not
`convenient for a user to locate specific information, he or she would utilize
`searching routines instead. The searching process would result in a small set of
`interactive video sub-documents that could be browsed and easily navigated.
`
`- 3 Video Transformation and Re—structuring Phases
`
`The transformation of an original video format to its interactive form aims at
`extracting the key-elements or components of the video structure first, and
`then creating links between elements of various levels of this structure.
`
`3.1 Video Structure Components
`
`The presentation of a document often follows a domain-specific model. Read-
`ers of a textbook expect to see the book content organized into parts, chapters,
`sections, paragraphs and indexes. Such structure is presented to the reader up-
`front in a table-of—contents. Unfortunately the structure of video documents
`is not explicitly apparent to the viewer. The re-structuring process aims at
`automatically constructing such a table-of-contents [276].
`A video collection can be divided into multiple categories by grouping doc-
`uments with similar structures together. Thus, feature-length movies, news,
`sports, TV shows, and surveillance videos have different structures, but with
`
`Page 6
`
`Page 6
`
`
`
`Introduction to Interactive Video
`
`7
`
`Q
`
`5.3
`SE
`£4;
`
`EH7
`
`Fig. 1. Illustration of the structure of a feature-length movie
`
`many basic elements in common. Figures 1 and 2 illustrate the general struc-
`ture of movies and news videos.
`
`The list of following components covers video structures of most types of
`video documents:
`
`is an unbroken sequence of frames
`1. Camera shots: A camera shot
`recorded from a single camera, during a short period of time, and thus it
`contains little changes in background and scene content. A video sequence
`is therefore a concatenation of camera shots. A cut is where the last frame
`in one shot is followed by the first frame in the next shot.
`2. Gradual transitions: Three other types of shot boundaries may be
`found in video documents: fades, dissolves, and wipes. A fade is where
`the frames of the shot gradually change from or to black. A dissolve is
`where the frames of the first shot are gradually morphed into the frames
`
`Page 7
`
`Page 7
`
`
`
`8
`
`Riad I. Hammond
`
`Concepts
`
`Sequences
`meetings
`
`-onstructions
`
`Sho-
`
`unanclal
`
`.iolence
`
`
`
`This week
`
`
`Fig. 2. Illustration of the general structure of news videos
`
`of the second. And a wipe is where the frames of the first shot are moved
`gradually in a horizontal or vertical direction into the frames of the second.
`. Key-frames: A key frame is a still image of a video shot that best rep-
`resents the content of a shot.
`
`Visual objects, zones: Objects of interest in video documents are sim-
`ilar to key-words in text and html documents. They could be detected ei-
`ther manually or automatically in key-frames or within individual frames
`of a video shot.
`
`. Audio objects: An audio object could take several shapes ranging from
`a clip of music to a single word.
`. Text objects: A text object is defined as the joint text to the video
`sequence, such as, footnotes and superimposed text on images.
`Events: An event is the basic segment of time during which an important
`action occurs in the video.
`
`. Scenes: A scene is the minimum set of sequential shots that conveys
`certain meaning in terms of narration.
`. Cluster of objects: A cluster is a collection of objects with similar
`characteristics like appearance, shape, color, texture, sounds, etc.
`Narrative sequence: A narrative sequence projects a large concept by
`combining multiple scenes together.
`Summary: A video summary of an input video is seen as a new brief
`document which may consist of an arrangement of video shots and scenes,
`or an arrangement of still key-frames.
`
`10.
`
`11.
`
`Page 8
`
`Page 8
`
`
`
`Introduction to Interactive Video
`
`9
`
`
`
`Fig. 3. A out between two camera shots is observed at frames 90—91
`
`These above components are labeled either as low-level components or high-
`level components. The first category includes video shots, audio-visual objects,
`key—frames and words, while the second contains events, scenes, summaries,
`and narrative sequences.
`
`3.2 Toward Automatic Video lie-structuring
`
`Manual labeling and re—structuring of video content is an extremely time con-
`suming, cost intensive and error prone endeavor that often results in incom-
`plete and inconsistent annotations. Annotating one hour of video may require
`more than ten hours of human effort for basic decomposition of the video into
`shots, key-frames and zones.
`In order to overcome these challenges, intensive research work has been
`done in computer vision and video processing to automate the process of video
`re-structuring. Some interactive video authoring systems [127, 157] offer tools
`that automate shot detection, zone localization, object tracking and scene
`recognition.
`
`3.2.1 Shot Boundary Detection
`
`Several methods have been proposed for shot boundary detection in both com-
`pressed and non—compressed domain [38, 36, 13, 37, 134, 347, 356, 52]. Pair-
`wise comparison [368] checks each pixel in one frame with the corresponding
`pixel in the next frame. In this approach, the gray-scale values of the pixels
`at the corresponding locations in two successive frames-are subtracted and
`the absolute value is used as a measure of dissimilarity between the pixel
`values. If this value exceeds a certain threshold, then the pixel gray scale
`is said to have changed. The percentage of the pixels that have changed is
`the measure of dissimilarity between the frames. This approach is compu-
`tationally simple but sensitive to digitalization noise, illumination changes
`and object motion. As a means to compensate for this, the Likelihood ratio,
`histogram comparison, Model-based comparison, and edge-based approaches
`have been proposed. The Likelihood ratio approach [368] compares blocks
`of pixel regions. The color histogram method [37] compares the intensity or
`color histograms between adjacent frames. Model-based comparison [134] uses
`the video production system as a template. Edge detection segmentation [364]
`
`Page 9
`
`Page 9
`
`
`
`10
`
`Riad I. Hammoud
`
`looks for entering and exiting edge pixels. Color blocks in a. compressed MPEG
`stream [34] are processed to find shots. DOT-based shot boundary detection
`[16] uses differences in motion encoded in an MPEG stream to find shots.
`Smeaton et al. [297, 246, 38] combine color histogram-based technique with
`edge-based and MPEG macroblocks methods to improve the performance of
`each individual method for shot detection. Recently, Chen et al.
`[54] pro-
`posed a multi-filtering technique that combines histogram comparison, pixel
`comparison and object tracking techniques for shot detection. They perform
`object tracking to help to determine the actual shot boundaries when both
`pixel comparison and histogram comparison techniques failed. As reported
`in [54], experiments on a large amount of video sequences (over 1000 test-
`ing shots), show a very promising shot detection precision of greater than
`ninety-two percent and recall beyond ninety-eight percent. With such a solid
`performance, very little manual effort is needed to correct the false positives
`and to recover the missing positives during shot detection. For most applica-
`tions, color histogram appears to be the simplest and most computationally
`inexpensive technique to obtain satisfactory results.
`
`3.2.2 Key-frame Detection
`
`Automatic extraction of key-frames ought to be content based so that key-
`framm maintain the important content of the video while removing all re-
`dundancy. Ideally, high-level primitives of video, such as objects and events,
`should be used. However, because such components are not always easy to
`identify, current methods tend to rely on low—level image features and other
`readily available information instead. In that regard, the work done in this area
`[369, 377, 18, 341] could be grouped into three categories: static, dynamic and
`content-based approaches. The first category selects frames at a fixed time-
`code sampling rate such that each nth frame is retained as a key-frame [237].
`The second category relies on motion analysis to eliminate redundant frames
`The optical flow vector is first estimated on each frame, then analyzed using a
`metric as a function of time to select key-frames at the local minima of motion
`[341]. Besides the complexity of computation of a dense optical flow vector, the
`underlying assumption of local minima may not work if constant variations are
`observed. The third category analyzes the variation of the content in terms of
`color, texture and motion features [369]. Avrithis et al. [18] select key-frames
`at local minima and local maxima of the magnitude of the second derivative of
`the composed feature curve of all frames of a given shot. A composed feature
`is made up of a linear combination of both color and motion vectors. More
`sophisticated pattern classification techniques, like Gaussian Mixture Models
`(GMM), have been used to group similar frames [377], and close appearances
`of segmented objects [132], into clusters. Depending on the compactness of
`each obtained cluster, one or more images could be extracted as key-frames.
`Figure 5 shows extracted key-appearances from the video shot of Fig. 4 using
`
`Page 10
`
`Page 10
`
`
`
`Introduction to Interactive Video
`
`11
`
`
`
`
`
`[
`
`Number
`Thumbnail
`Name
`
`194
`‘_- 7-l Ford(key-1)
`"
`5—H
`
`ems
`
`-Ll
`
`Ford (key-3)
`
`194
`
`Fig. 5. Extracted three representative key-frames (top) and corresponding key-
`appearancm (bottom) of the presented video sequence in Fig.4, using the method
`of [132]
`
`the method of [132]. In this example, the vehicle is being tracked, and modeled
`with GMM in the RGB color histogram space.
`
`3.2.3 Object Detection
`
`Extracting the objects from the video is the first step in object-based analysis.
`Video objects or hot spots are classified into static zones and moving objects.
`The purpose of object detection is to initialize the tracking by identifying
`the object boundaries in the first frame of the shot. The accuracy of auto-
`matic object detection depends on the prior knowledge about the object to
`be detected as well as the complexity of the background and quality of video
`
`Page 11
`
`Page 11
`
`
`
`12
`
`Riad I. Hammond
`
` MEI-Ila:
`
`scene 1
`
`
`
`scene 2
`
`[It]
`NI
`
`scene 4
`
`Page 12
`
`
`
` .
`
`...sc.ene.5
`
`Fig. 6. Semantic scenes of a clip from the movie “Dances with Wolves": “first
`meeting" (scene 1); “horses" (scene 2); “an90 preparation" (scene 3); "working"
`(scene 4); and “dialog in the tent” (scene 5).
`
`signal. As an example. detecting the position of an eye in an infrared driver’s
`video [133. 136] is easier than locating pedestrians in a night-time sequence of
`images. The eye has a very distinct and unique shape and appearance among
`people while the appearance, shape. scale and size characteristics may vary
`widely among pedestrians. On a related topic, Prati et al. [256] recently pro-
`posed an object detection method that differentiates between a moving car
`and its moving shadows on a highway.
`Several efforts were undertaken to find objects belonging to specific classes
`like faces, figs and vehicles [202. 240, 254, 120. 333]. Recent systems like the
`Informedia Project [155] include in their generation of video summaries the
`detection of common objects like text and faces. Systems such as Netm-V
`[79] and Vide [50] used spatio-temporal segmentation to extract regions
`which are supposed to correspond to video objects. Others [184. 269. 191] have
`resorted to user-assisted segmentation. Since it is easier to define an object by
`its boundary box. accurately delineating an object is generally advantageous
`in reducing clutter for the subsequent tracking and matching phases. In [269],
`the color distributions of both the object and its background are modeled
`by a Gaussian mixture. Interestingly, this process is interactive. i.e. the user
`may iteratively assist the scheme in determining the boundary, starting from
`sub-optimal solutions. if needed. Recent work goes further in this direction
`
`Page 12
`
`
`
`Introduction to Interactive Video
`
`13
`
`[191], by learning the shape of an object category, in order to introduce it
`as an elaborate input into the energy function involved in interactive object
`detection.
`
`the tracking
`that
`is crucial
`it
`In the interactive video framework,
`algorithm can be initialized easily through a semi-automatic or automatic
`process. Automatic detection methods may be categorized into three broad
`classes: model-based, motion-based (optical flow, frame differencing), and
`background subtraction approaches. For static cameras, background subtrac-
`tion is probably the most popular. Model-based detection approaches usually
`employ a supervised learning process of the object model using representative
`training patches of the object in terms of appearances, scales, and illumi-
`nation conditions, as in a multi-view face detector [202] or a hand posture
`recognition algorithm [344]. In contrast, a motion-based detection consists of
`segmenting the scene into zones of independent motions using optical flow
`and frame differencing techniques [240]. In many applications, the knowledge-
`based, motion-based and model-based techniques are combined together to
`ensure robust object detection. Despite all of these efforts, accurate automatic
`object detection is successful only in some domain specific applications.
`
`3.2.4 Intra-Shot Object ’Ih'acking
`
`Tracking is easier and faster than object detection since the object state (posi-
`tion, scale, boundaries) is known in previous frame, and the search prooedure
`in target frame is local (search window) rather than global (entire frame). The
`tracking task is easier when tracked objects have low variability in the feature
`space, and thus exhibit smooth and rigid motion. In reality, video objects are
`often very difficult to track due to changes in pose, scale, appearance, shape
`and lighting conditions. Therefore, a robust and accurate tracker requires a
`self-adaptation stratey to these changes in order to maintain tracking in dif-
`ficult situations, an exist strategy that terminates the tracking process if a
`drift or miss-tracking occurs, and a recovery process that allows object reac-
`quisition when tracking is lost.
`In literature, three categories of methods could be identified: Contour-
`based, Color-based and Motion-based techniques. For tracking of non-rigid
`and deformable objects, geodesic active contours and active contour models
`have been proved to be powerful tools [247, 181, 320, 250]. These approaches
`use an energy minimization procedure to obtain the best contour, where the
`total energy consists of an internal energy term for contour smoothness and an
`external energy term for edge likelihood. Many color based tracking methods
`[66, 27, 166, 249, 378] require a typical object color histogram. Color histogram
`is a global feature used frequently in tracking and recognition due to its quasi-
`invariance to geometric and photometric variations (using the hue-saturation
`subspace), as well as its low computation requirements. In [27], a side View
`
`Page 13
`
`Page 13
`
`
`
`14
`
`Riad 1. Hammond
`
`of the human head is used to train a typical color model of both skin color
`and hair color to track the human head with out-of-plane rotation. The third
`category of tracking methods tends to formulate a tracking problem as the
`estimation of a 2D inter-frame motion field over the region of interest [110].
`Recently, the concern in tracking is shifting toward real-time performance.
`This is essential in an interactive video authoring system where the user could
`initialize the tracker and desire quick tracking results in the remaining frames
`of a shot. Real-time object tracking is addressed in more detail in Chapter 4.
`
`3.2.5 Recognition and Classification of Video Shots and Objects
`
`Once the shots, key-frames, and objects are identified in a video sequence, the
`next step is to identify spatial—temporal similarities between these entities in
`order to construct high-level components of the video structure like groups of
`objects [131] and clusters of shots [275]. In order to retrieve similar objects
`to an object-query, a recognition algorithm that entails feature extraction,
`distance computation and a decision rule, is required. Recognition and classi-
`fication techniques often employ low-level features like color, texture and edge
`histograms [24] as well as high-level features like object bobs [45].
`An effective method for recognizing similar objects in video shots is to
`specify generic object models and find objects that conform to these models
`[89]. Different models were used in methods for recognizing objects that are
`common in video. These include methods for finding text [222, 375], faces
`[271, 286] and vehicles [286]. Hammond et al. [131, 128] proposed a success-
`ful framework for supervised and non-supervised classification of inter-shot
`objects using Maximum A Posteriori (MAP) rule and Hierarchical classifica-
`tion techniques, respectively. Figure 8 illustrates some obtained results. Each
`tracked object is being modeled by a Gaussian Mixture Model (GMM) that
`captures the intra-shot changes in appearance in the RGB color feature space.
`In the supervised approach, the user chooses the object classes interactively,
`and the classification algorithm classifies remaining objects into the object
`class according to the MAP. In contrast, the unsupervised approach consists
`of computing the Kullback distance [190, 106] between Gaussian Mixture
`Models of all identified objects in the video, and then feeding the Hierar-
`chical clustering algorithm [173] the constructed proximity matrix [169, 264].
`These approaches seem to substantially improve the recognition rate of video
`objects between shots. Recently, Fan et al. [96] proposed a successful hierar-
`chical semantics-sensitive video classifier to shorten the semantic gap between
`the low-level visual features and the high—level semantic concepts. The hier-
`archical structure of the semantics-sensitive video classifier is derived from
`the domain dependent concept hierarchy of video contents in the database.
`Relevance analysis is used to shorten the semantic gap by selecting the dis-
`criminating visual features. Part two of this book will address in detail these
`issues of object detection, editing, tracking, recognition, and classification.
`
`Page 14
`
`Page 14
`
`
`
`Introduction to Interactive Video
`
`15
`
`3.2.6 Video Events, Highlights and Scenes Detection
`
`In recent years, research towards the automatic detection and recognition of
`highlights and events in specific applications such as sports video, Where events
`and highlights are relatively well-defined based on the domain knowledge of
`the underlying news and sport video data, has gained a lot of attention [182,
`54, 199, 201, 198, 272, 281, 361]. A method for detecting news reporting was
`presented in [213]. Seitz and Dyer [289] proposed an affine view-invariant
`trajectory matching method to analyze cyclic motion. In [310], Stauffer and
`Crimson classify activities based on the aspect ratio of the tracked objects.
`Haering et al. [122] detect hunting activities in wildlife video. In [54], a new
`multimedia data mining framework has been proposed for the detection of
`soccer goal shots by using combined multimodal (audio/visual) features and
`classification rules. The output results can be used for annotation and indexing
`of the high-level structures of soccer videos. This framework exploits the rich
`semantic information contained in visual and audio features for soccer video
`data, and incorporates the data mining process for effective detection of soccer
`goal events.
`Regarding automatic detection of semantic units like scenes in feature
`length movies, several graph-based clustering techniques have been pro-
`posed in the literature [357, 275, 125]. In [126, 127], a general framework
`of two phases, clustering of shots, and margin of overlapped clusters, has
`been proposed to extract the video scenes of a. feature-length movie. Two
`shots are matched on the basis of color-metric histograms, color-metric auto-
`correlograms [147] and the number of similar objects localized in their gener-
`ated key-frames Using the hierarchical classification technique, a, partition of
`shots is identified for each feature separately. From all these partitions a single
`partition is deduced based on a union distance between various clusters. The
`obtained clusters are then linked together using the temporal relations of Alle