`rs
`
`Communication
`Technology
`
`US Patent No. 9,124,950
`
`Interactive Video
`Algorithms and Technologies
`
`a Satie
`
`AMAZON EX.1026
`Amazon v. CustomPlay
`
`Page i
`
`AMAZON EX. 1026
`Amazon v. CustomPlay
`US Patent No. 9,124,950
`
`
`
`Riad I. Hammoud(Ed.)
`
`Interactive Video
`
`Algorithms and Technologies
`
`With 109 Figures and 8 Tables
`
`9) Springer
`
`Page ii
`
`Page ii
`
`
`
`Dr. Riad I. Hammoud
`
`Delphi Electronics and Safety
`World Headquarters
`MIC E110, P.O. Box 9005
`Kokomo,Indiana 46904-9005
`USA
`e-mail: riad.hammoud@delphi.com
`
`Library of Congress Control Number: 2006923234
`
`ISSN: 1860-4862
`ISBN-10
`3-540-33214-6 Springer Berlin Heidelberg New York
`ISBN-13.
`978-3-540-33214-5 Springer Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned,specifically the rights of translation, reprinting, reuse ofillustrations, recitation, broadcasting,
`reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or
`parts thereof is permitted only underthe provisions of the German Copyright Law of September9, 1965,in its
`current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable
`to prosecution under the German Copyright Law.
`
`Springeris a part of Springer Science+Business Media.
`springer.com
`
`© Springer-Verlag Berlin Heidelberg 2006
`Printed in The Netherlands
`
`The use of general descriptive names, registered names, trademarks,etc. in this publication does not imply,
`even in the absence ofa specific statement, that such names are exempt from the relevantprotective laws and
`regulations and therefore free for general use.
`
`Typesetting by SPI Publisher Services using a Springer ISTEX macro package
`Cover design: design & production, Heidelberg
`Printed onacid-free paper
`SPIN: 11399551
`
`62/3100/SPI-
`
`543210
`
`Page iii
`
`
`
`
`
`Introduction to Interactive Video
`
`Riad I. Hammoud
`
`Delphi Electronics and Safety, IN, USA.
`riad.hammoud@delphi.com
`
`1 Introduction
`
`In recent years, digital video has been widely employed as an effective me-
`dia format not only for personal communications, but also for business-to-
`employee, business-to-business and business-to-consumers applications.It ap-
`pears more attractive than other static data types like text and graphics,
`as its underlying rich content conveys easily and effectively the goals of the
`provider in means of image, motion, sound and text, all together presented
`to the consumerin a timely synchronized manner. Multimedia documents are
`more accessible than ever, due to a rapid expansion of Internet connectivity
`and increasing interest in online multimedia rich applications.
`Video content has skyrocketed in light of fore mentioned advances, and
`as a result of increasing network bandwidth capacities, decreasing cost of
`video acquisition and storage devices, and improving compression techniques.
`Today, many content providers,
`like Google Video [118] and Yahoo [353],
`have created “open video” marketplaces that enable consumers to buy and
`rent a wide range of video content ( video-on-demand)[292] from major tele-
`vision networks, professional sports leagues, cable programmers, independent
`producers and film makers. The evolving digital video archives include vari-
`ous types of video documents like feature-length movies, music videos from
`SONY BMG,news, “Charlie Rose” interviews, medical and e-learning videos,
`as well as prime-time and classic hits from CBS, sports and NBA games,his-
`torical content from ITN, and newtitles being added everyday. Such video
`archives represent a valuable business asset and an important on-line video
`source to providers and consumers, respectively. The market is taking an-
`other important boost as video distribution is becoming available for people
`on “the move”, where they can select, download and view various video types
`on their consumerelectronics devices like mobileT'V [227] and video-playing
`iPods [158]. In 2005, the number of hits for downloaded music data from
`iTunes on Apple’s iPods reached 20.7 million [163]. This numberis expected
`to increase with the appearance of video-playing iPods and mobileTVs.
`
`Page 3
`
`Page 3
`
`
`
`4
`
`Riad I. Hammoud
`
`Traditionally, video data is either annotated manually, or consumed by
`end-users in its original form. Manual processing is expensive, time consum-
`ing andoften subjective. On the other hand,a videofile provided in a standard
`format like MPEG-1 or MPEG-2is played back using media players with lim-
`ited conventional control interaction features like play, fast forward/backward,
`and pause. In order to remedy these challenges and address theissues of grow-
`ing number andsize of video archives and detail-on-demand videos, innova-
`tive video processing and video content management solutions — ranging from
`decomposing, indexing, browsing andfiltering to automatic searching tech-
`niques as well as new forms of interaction with video content — are needed
`more than ever (324, 276, 130, 143, 303, 26, 278, 152]. Automatic processing
`will substantially drop the cost and reduce the errors of human operators,
`and it will introduce a new set of interactivity options that gives users ad-
`vanced interaction and navigational possibilities with the video content. One
`can navigate throughthe interactive video content in a non-linear fashion, by
`downloading, browsing and viewing, for instance, only the “actions” of a film
`character that caught his/her attention in the beginning of a feature-length
`movie (194, 155, 130, 199, 299]. Retaining only the essential information of a
`video sequence, such as representative frames, events and highlights, improves
`the storage, bandwidth and viewing time.
`Within the framework of the emerging “interactive video” technology,
`MPEG-4 and MPEG-7 standards [121, 124, 130, 293, 233, 275], this book will
`address the following two major issues that concerns both content-providers
`and consumers: (1) automatic re-structuring, indexing and cataloging video
`content, and (2) advanced interaction features for audio-video editing, play-
`ing, searching and navigation. In this chapter, we will briefly introduce the
`concept, rhetoric, algorithms andtechnologiesof interactive videos, automatic
`video restructuring and content-based video retrieval systems.
`
`2 Whatis an Interactive Video?
`
`In orderto simplify the understanding of interactive video environments,it is
`worth reviewing how people are used to viewing and interacting with video
`content. Current VCRs and video-players provide basic control options such
`as play/stop, fast forward/backward and slow motion picture streaming. The
`video is mostly viewed in a passive way as a non-stop medium where the
`user’s interaction with the content is somewhat limited. For example, users
`cannot stop the video playback to jump to anotherplace inside or outside the
`video document that provides related information about a specific item in
`the video like a commercial product, a film character, or a concealed object.
`Hence, the viewing of the video is performed in a linear fashion where the
`only way to discover what is next is to follow the narration and move through
`the video guided by seconds and minutes.
`Such conventional techniques for video viewing and browsing seem to be
`inefficient for most users to get the crux of the video. Users ease and efficiency
`could be improved through:
`
`Page 4
`
`Page 4
`
`
`
`Introduction to Interactive Video
`
`5
`
`1. providing a representative visual summary of the video document prior to
`downloading,storing or watchingit. Alternatively, users could select the
`video based on just a few snapshots;
`2. presenting a list of visual entries, like key-frames, hot spots, events, scenes
`and highlights, that serves as meaningful access points to desired video
`content as opposed to accessing the video from the beginning to the end;
`and,
`3. showing a list of navigational options that allows users to follow internal
`and external links between related items in the same video or in other
`media documents like web pages.
`
`Interactive video refers to nowadays uncommon forms of video documents
`that accept and respond to the input of a viewer beyond just conventional
`VCRinteractive features like play and pause. For instance, in its basic form,
`interactive video allowsusers to pausethe video,click on an object of interest
`in a video frame, and choose to jump from one temporal arbitrary frame to
`another where the selected object has appeared. Instead of being guided by
`seconds and minutes, the user of interactive video form navigates through
`the video in a very efficient non-linear fashion with options such as “next
`appearance”, “previous scene” and “last event”. Tentatively, the following
`definition of interactive video could be drawn:
`
`Definition 1. Interactive video is a digitally enriched form of the original raw
`video sequence, allowing viewers attractive and powerful interactivity forms
`and navigational possibilities.
`
`In order to ensure that humans do not perceive any discontinuity in the
`video stream, a framerate of at least 25fps is required, that is, 90,000 images
`for one hour of video content. This video content can be complex and rich
`in terms of objects, shots, scenes, events, key-frames, sounds, narration and
`motion. An original video, in MPEG-1 format, is transformed to interactive
`video form throughaseries of re-structuring phases ofits content. Both the
`original video and the interactive video contain the same information with
`one major difference in the structure of the document. The original video has
`an implicit structure, while its interactive form has an explicit one. In an
`explicit structure, the hotspots and key-elements are emphasized and links
`between these elements are created. If such links are not established with
`items from outside the video content, then the produced document is called
`raw interactive video. For simplicity we will refer to it just as interactive
`video. Here we introduce two other extentions of interactive video documents:
`interactive video presentation and interactive video database.
`Definition 2. Interactive video presentation is a form of interactive video
`document that is centered on enriched video but is not exclusively video.
`
`In this context, the interactive video document contains not only the raw
`enriched video but also includes several kind of data in a time synchronized
`fashion. This additional data is added for two reasons:
`(1) enhancing the
`
`Page 5
`
`Page 5
`
`
`
`6
`
`Riad I. Hammoud
`
`video content, and (2) making the video presentation self-contained. The type
`of data is determined by the author of the interactive video. This author
`wouldintelligently integrate the technique of the video producer or filmmaker
`with the technique and wisdom oftheskillful teacher. The objective is to
`maximize the chances of convening the purpose and key information in the
`video well. Additional data can be documents ofall types i.e. html, tables,
`music and/orstill frames and sequenceof images, that are available locally or
`remotely through the Web. Such types ofinteractive videos are also known as
`hyperfilm, hypervideo or hypermedia[151, 47].
`
`Definition 3. Interactive video database is a collection of interactive video
`documents and interactive video presentations.
`
`The main interactive video documentis seen here as a master document
`which contains in itself a large number of independent interactive videos.
`Users can access the video database in two forms: searching and browsing.
`For instance, a sketch may be used to define an image in a frame. The system
`can look for that image and retrieve the frames that contain it. The user
`can also move the sketch within a certain pattern indicating to the system to
`find in a large interactive video database all sets of frames that represent a
`similar moving pattern for that object of interest (hotspot). As technologies
`such as object detection, tracking and image recognition evolve, it will be
`possible to provide better ways to specify hotspots in interactive video, which
`will be covered in more detail in the following sections. When browsing is not
`convenient for a user to locate specific information, he or she would utilize
`searching routines instead. The searching process would result in a small set of
`interactive video sub-documents that could be browsed and easily navigated.
`
`3 Video Transformation and Re-structuring Phases
`
`The transformation of an original video formatto its interactive form aims at
`extracting the key-elements or components of the video structure first, and
`then creating links between elements of various levels of this structure.
`
`3.1 Video Structure Components
`
`The presentation of a document often follows a domain-specific model. Read-
`ers of a textbook expect to see the book content organizedinto parts, chapters,
`sections, paragraphsand indexes. Such structure is presented to the reader up-
`front in a table-of-contents. Unfortunately the structure of video documents
`is not explicitly apparent to the viewer. The re-structuring process aims at
`automatically constructing such a table-of-contents [276].
`A video collection can be divided into multiple categories by grouping doc-
`uments with similar structures together. Thus, feature-length movies, news,
`sports, T'V shows, and surveillance videos have different structures, but with
`
`Page 6
`
`Page 6
`
`
`
`Introduction to Interactive Video
`
`7
`
`Lecwcmema memenent Lcmemonamemeqnmameneeonei
`Objects
`Shots
`
`Matching, Grouping
`
`features
`
`Descriptors
`
`low-level
`
`Fig. 1. Illustration of the structure of a feature-length movie
`
`many basic elements in common. Figures 1 and2illustrate the general struc-
`ture of movies and newsvideos.
`Thelist of following components covers video structures of most types of
`video documents:
`
`is an unbroken sequence of frames
`1. Camera shots: A camera shot
`recorded from a single camera, during a short periodof time, and thus it
`containslittle changes in background andscene content. A video sequence
`is therefore a concatenation of camera shots. A cut is where thelast frame
`in one shot is followed by thefirst frame in the next shot.
`2. Gradual transitions: Three other types of shot boundaries may be
`found in video documents: fades, dissolves, and wipes. A fade is where
`the frames of the shot gradually change from or to black. A dissolve is
`where the framesof the first shot are gradually morphed into the frames
`
`Page 7
`
`Page 7
`
`
`
`Riad I. Hammoud
`
`Concepts
`
`Sequences
`
`Scenes
`
`Soccer
`
`ran !
`
`
`
`constructions (Stet)
`(_srot)
`
`{meetings|
`
`
`(Financial) (violence|
`
`
`
`Fig. 2. Illustration of the general structure of news videos
`
`of the second. And a wipeis where the framesofthefirst shot are moved
`gradually in a horizontalor vertical direction into the frames of the second.
`. Key-frames: A key frameis a still image of a video shot that best rep-
`resents the content of a shot.
`. Visual objects, zones: Objects of interest in video documents are sim-
`ilar to key-words in text and html documents. They could be detected ei-
`ther manually or automatically in key-frames, or within individual frames
`of a video shot.
`. Audio objects: An audio object could take several shapes ranging from
`a clip of music to a single word.
`. Text objects: A text object is defined as the joint text to the video
`sequence, such as, footnotes and superimposed text on images.
`. Events: An event is the basic segment of time during which an important
`action occursin the video.
`. Scenes: A scene is the minimum set of sequential shots that conveys
`certain meaning in terms of narration.
`. Cluster of objects: A cluster is a collection of objects with similar
`characteristics like appearance, shape, color, texture, sounds, etc.
`Narrative sequence: A narrative sequence projects a large concept by
`combining multiple scenes together.
`Summary: A video summary of an input video is seen as a new brief
`document which may consist of an arrangement of video shots and scenes,
`or an arrangementofstill key-frames.
`
`10.
`
`11.
`
`Page 8
`
`Page 8
`
`
`
`Introduction to Interactive Video
`
`9
`
`
`
`Fig. 3. A cut between two camera shots is observed at frames 90-91
`
`These above componentsare labeled either as low-level componentsor high-
`level components. Thefirst category includes video shots, audio-visual objects,
`key-frames and words, while the second contains events, scenes, summaries,
`and narrative sequences.
`
`3.2 Toward Automatic Video Re-structuring
`
`Manual labeling andre-structuring of video content is an extremely time con-
`suming, cost intensive and error prone endeavor that often results in incom-
`plete and inconsistent annotations. Annotating one hour of video may require
`more than ten hours of humaneffort for basic decomposition of the video into
`shots, key-frames and zones.
`In order to overcome these challenges, intensive research work has been
`done in computer vision and video processing to automate the process of video
`re-structuring. Someinteractive video authoring systems[127, 157] offer tools
`that automate shot detection, zone localization, object tracking and scene
`recognition.
`
`3.2.1 Shot Boundary Detection
`
`Several methods have been proposedfor shot boundary detection in both com-
`pressed and non-compressed domain [38, 36, 13, 37, 134, 347, 356, 52]. Pair-
`wise comparison {368] checks each pixel in one frame with the corresponding
`pixel in the next frame. In this approach, the gray-scale values of the pixels
`at the corresponding locations in two successive frames-are subtracted and
`the absolute value is used as a measure of dissimilarity between the pixel
`values. If this value exceeds a certain threshold, then the pixel gray scale
`is said to have changed. The percentage of the pixels that have changed is
`the measure of dissimilarity between the frames. This approach is compu-
`tationally simple but sensitive to digitalization noise, illumination changes
`and object motion. As a means to compensate for this, the Likelihoodratio,
`histogram comparison, Model-based comparison, and edge-based approaches
`have been proposed. The Likelihood ratio approach [368] compares blocks
`of pixel regions. The color histogram method [37] compares the intensity or
`color histograms between adjacent frames. Model-based comparison [134] uses
`the video production system as a template. Edge detection segmentation [364]
`
`Page 9
`
`Page 9
`
`
`
`10
`
`Riad I. Hammoud
`
`looksfor entering and exiting edge pixels. Color blocks in a compressed MPEG
`stream [34] are processed to find shots. DCT-based shot boundary detection
`[16] uses differences in motion encoded in an MPEGstream tofind shots.
`Smeatonet al. [297, 246, 38] combine color histogram-based technique with
`edge-based and MPEG macroblocks methods to improve the performanceof
`each individual method for shot detection. Recently, Chen et al.
`[54] pro-
`posed a multi-filtering technique that combines histogram comparison, pixel
`comparison and object tracking techniques for shot detection. They perform
`object tracking to help to determine the actual shot boundaries when both
`pixel comparison and histogram comparison techniques failed. As reported
`in [54], experiments on a large amount of video sequences (over 1000 test-
`ing shots), show a very promising shot detection precision of greater than
`ninety-two percent and recall beyond ninety-eight percent. With suchasolid
`performance,very little manual effort is needed to correct the false positives
`and to recover the missing positives during shot detection. For most applica-
`tions, color histogram appears to be the simplest and most computationally
`inexpensive technique to obtain satisfactory results.
`
`3.2.2 Key-frame Detection
`
`Automatic extraction of key-frames ought to be content based so that key-
`frames maintain the important content of the video while removingall re-
`dundancy.Ideally, high-level primitives of video, such as objects and events,
`should be used. However, because such components are not always easy to
`identify, current methods tend to rely on low-level image features and other
`readily available information instead. In that regard, the work donein this area
`(369, 377, 18, 341] could be groupedinto three categories: static, dynamic and
`content-based approaches. Thefirst category selects frames at a fixed time-
`code sampling rate such that each n** frameis retained as a key-frame [237].
`The second category relies on motion analysis to eliminate redundant frames.
`The optical flow vectoris first estimated on each frame, then analyzed using a
`metric as a function of timeto select key-frames at the local minima of motion
`[341]. Besides the complexity of computation of a denseoptical flow vector, the
`underlying assumptionof local minima may not work if constant variations are
`observed. The third category analyzes the variation of the content in terms of
`color, texture and motion features [369]. Avrithis et al. [18] select key-frames
`at local minima and local maxima of the magnitude ofthe second derivative of
`the composed feature curveofall frames of a given shot. A composed feature
`is made up of a linear combination of both color and motion vectors. More
`sophisticated pattern classification techniques, like Gaussian Mixture Models
`(GMM), have been used to groupsimilar frames [377], and close appearances
`of segmented objects [132], into clusters. Depending on the compactness of
`each obtained cluster, one or more images could be extracted as key-frames.
`Figure 5 shows extracted key-appearances from the video shot of Fig. 4 using
`
`Page 10
`
`Page 10
`
`
`
`Introduction to Interactive Video
`
`ll
`
`oi co 1 UU ut af
`
`
`Fig. 4. Sample images of a tracked vehicle in a video shot of 66 frames
`
`
`
`
`
`
`
`Fig. 5. Extracted three representative key-frames (top) and corresponding key-
`appearances (bottom) of the presented video sequence in Fig. 4, using the method
`of [132]
`
`the method of[132]. In this example, the vehicle is being tracked, and modeled
`with GMM in the RGBcolor histogram space.
`
`3.2.3 Object Detection
`
`Extracting the objects from thevideois the first step in object-based analysis.
`Video objects or hot spots are classified into static zones and moving objects.
`The purpose of object detection is to initialize the tracking by identifying
`the object boundaries in the first frame of the shot. The accuracy of auto-
`matic object detection depends on the prior knowledge about the object to
`be detected as well as the complexity of the background and quality of video
`
`Page 11
`
`Page 11
`
`
`
`_ 2
`
`Riad I. Hammoud
`
`scene 2
`
` PIE!
`HBscene 4
` scene 3
`
`
`
`Fig. 6. Semantic scenes of a clip from the movie “Dances with Wolves”: “first
`meeting” (scene 1); “horses” (scene 2); “coffee preparation” (scene 3); “working”
`(scene 4); and “dialog in the tent” (scene 5).
`
`signal. As an example, detecting the position of an eye in an infrared driver’s
`video [133, 136] is easier than locating pedestrians in a night-time sequence of
`images. Theeye has a very distinct and unique shape and appearance among
`people while the appearance, shape, scale and size characteristics may vary
`widely among pedestrians. On a related topic, Prati et al. [256] recently pro-
`posed an object detection method that differentiates between a moving car
`and its moving shadows on a highway.
`Severalefforts were undertaken to find objects belonging to specific classes
`like faces, figs and vehicles [202, 240, 254, 120, 333]. Recent systems like the
`Informedia Project [155] include in their generation of video summaries the
`detection of common objects like text and faces. Systems such as Netra-V
`[79] and VideoQ [50] used spatio-temporal segmentation to extract regions
`which are supposed to correspond to video objects. Others [184, 269, 191] have
`resorted to user-assisted segmentation. Sinceit is easier to define an object by
`its boundarybox, accurately delineating an object is generally advantageous
`in reducing clutter for the subsequent tracking and matching phases. In [269],
`the color distributions of both the object and its background are modeled
`by a Gaussian mixture. Interestingly, this process is interactive, i.e. the user
`may iteratively assist the scheme in determining the boundary,starting from
`sub-optimal solutions, if needed. Recent work goes further in this direction
`
`Page 12
`
`Page 12
`
`
`
`Introduction to Interactive Video
`
`13
`
`[191], by learning the shape of an object category, in order to introduce it
`as an elaborate input into the energy function involved in interactive object
`detection.
`the tracking
`that
`is crucial
`it
`In the interactive video framework,
`algorithm can beinitialized easily through a semi-automatic or automatic
`process. Automatic detection methods may be categorized into three broad
`classes: model-based, motion-based (optical flow, frame differencing), and
`background subtraction approaches. For static cameras, background subtrac-
`tion is probably the most popular. Model-based detection approaches usually
`employ a supervised learning process of the object model using representative
`training patches of the object in terms of appearances, scales, and illumi-
`nation conditions, as in a multi-view face detector [202] or a hand posture
`recognition algorithm [344]. In contrast, a motion-based detection consists of
`segmenting the scene into zones of independent motions using optical flow
`and frame differencing techniques {240]. In many applications, the knowledge-
`based, motion-based and model-based techniques are combined together to
`ensure robust object detection. Despite all of these efforts, accurate automatic
`object detection is successful only in some domain specific applications.
`
`3.2.4 Intra-Shot Object Tracking
`
`Trackingis easier and faster than object. detection since the object state (posi-
`tion, scale, boundaries) is known in previous frame, and the search procedure
`in target frameis local (search window)rather than global (entire frame). The
`tracking task is easier when tracked objects have low variability in the feature
`space, and thus exhibit smooth and rigid motion. In reality, video objects are
`often very difficult to track due to changes in pose, scale, appearance, shape
`and lighting conditions. Therefore, a robust and accurate tracker requires a
`self-adaptation strategy to these changes in order to maintain tracking in dif-
`ficult situations, an exist strategy that terminates the tracking process if a
`drift or miss-tracking occurs, and a recovery process that allows object reac-
`quisition when tracking is lost.
`In literature, three categories of methods could be identified: Contour-
`based, Color-based and Motion-based techniques. For tracking of non-rigid
`and deformable objects, geodesic active contours and active contour models
`have been proved to be powerful tools [247, 181, 320, 250]. These approaches
`use an energy minimization procedure to obtain the best contour, where the
`total energy consists of an internal energy term for contour smoothness and an
`external energy term for edge likelihood. Many color based tracking methods
`(66, 27, 166, 249, 378] require a typical object color histogram. Color histogram
`is a global feature used frequently in tracking and recognition dueto its quasi-
`invariance to geometric and photometric variations (using the hue-saturation
`subspace), as well as its low computation requirements. In [27], a side view
`
`Page 13
`
`Page 13
`
`
`
`14
`
`Riad I. Hammoud
`
`of the human head is used to train a typical color model of both skin color
`and hair color to track the human head with out-of-plane rotation. The third
`category of tracking methods tends to formulate a tracking problem as the
`estimation of a 2D inter-frame motionfield over the region ofinterest [110].
`Recently, the concern in trackingis shifting toward real-time performance.
`Thisis essential in an interactive video authoring system where the user could
`initialize the tracker and desire quick tracking results in the remaining frames
`of a shot. Real-time object tracking is addressed in more detail in Chapter 4.
`
`3.2.5 Recognition and Classification of Video Shots and Objects
`
`Once the shots, key-frames, and objects are identified in a video sequence, the
`next step is to identify spatial-temporal similarities between these entities in
`order to construct high-level components of the video structurelike groups of
`objects [131] and clusters of shots [275]. In order to retrieve similar objects
`to an object-query, a recognition algorithm that entails feature extraction,
`distance computation and a decision rule, is required. Recognition and classi-
`fication techniques often employ low-level featureslike color, texture and edge
`histograms [24] as well as high-level features like object bobs [45].
`Aneffective method for recognizing similar objects in video shots is to
`specify generic object models and find objects that conform to these models
`[89]. Different models were used in methods for recognizing objects that are
`common in video. These include methodsfor finding text [222, 375], faces
`[271, 286] and vehicles [286]. Hammoud et al. [131, 128] proposed a success-
`ful framework for supervised and non-supervised classification of inter-shot
`objects using Maximum A Posteriori (MAP) rule and Hierarchicalclassifica-
`tion techniques, respectively. Figure 8 illustrates some obtained results. Each
`tracked object is being modeled by a Gaussian Mixture Model (GMM)that
`captures the intra-shot changes in appearancein the RGB colorfeature space.
`In the supervised approach, the user chooses the object classes interactively,
`and theclassification algorithm classifies remaining objects into the object
`class according to the MAP.In contrast, the unsupervised approach consists
`of computing the Kullback distance [190, 106] between Gaussian Mixture
`Models ofall identified objects in the video, and then feeding the Hierar-
`chical clustering algorithm[173] the constructed proximity matrix [169, 264].
`These approaches seem to substantially improve the recognition rate of video
`objects between shots. Recently, Fan et al. [96] proposed a successful hierar-
`chical semantics-sensitive video classifier to shorten the semantic gap between
`the low-level visual features and the high-level semantic concepts. The hier-
`archical structure of the semantics-sensitive video classifier is derived from
`the domain dependent concept hierarchy of video contents in the database.
`Relevance analysis is used to shorten the semantic gap byselecting the dis-
`criminating visual features. Part two of this book will address in detail these
`issues of object detection, editing, tracking, recognition, and classification.
`
`Page 14
`
`Page 14
`
`
`
`Introduction to Interactive Video
`
`15
`
`3.2.6 Video Events, Highlights and Scenes Detection
`
`In recent years, research towards the automatic detection and recognition of
`highlights and eventsin specific applications such as sports video, where events
`and highlights are relatively well-defined based on the domain knowledge of
`the underlying news and sport video data, has gained a lot of attention [182,
`54, 199, 201, 198, 272, 281, 361]. A method for detecting news reporting was
`presented in [213]. Seitz and Dyer [289] proposed an affine view-invariant
`trajectory matching method to analyze cyclic motion. In [310], Stauffer and
`Grimsonclassify activities based on the aspect ratio of the tracked objects.
`Haering etal. [122] detect hunting activities in wildlife video. In [54], a new
`multimedia data mining framework has been proposed for the detection of
`soccer goal shots by using combined multimodal (audio/visual) features and
`classification rules. The output results can be used for annotation and indexing
`of the high-level structures of soccer videos. This framework exploits the rich
`semantic information contained in visual and audio features for soccer video
`data, and incorporates the data mining process foreffective detection of soccer
`goal events.
`Regarding automatic detection of semantic units like scenes in feature-
`length movies, several graph-based clustering techniques have been pro-
`posed in theliterature (357, 275, 125]. In [126, 127], a general framework
`of two phases, clustering of shots, and margin of overlapped clusters, has
`been proposed to extract the video scenes of a feature-length movie. Two
`shots are matched on the basis of color-metric histograms, color-metric auto-
`correlograms [147] and the numberofsimilar objects localized in their gener-
`ated key-frames. Using the hierarchicalclassification technique, apartition of
`shotsis identified for each feature separately. From all these partitions a single
`partition is deduced based on a union distance between various clusters. The
`obtained clusters are then linked together using the temporal relations of Allen
`[7] in order to construct a temporal graph of cluste