throbber
Signals
`and
`
`Communication
`
`Technology
`
`US Patent No. 9,124,950
`
`Interactive Video
`
`Algorithms and Technologies
`
`@ Springer
`
`AMAZON EX. 1026
`
`Amazon v. CustomPIay
`
`Page i
`
`AMAZON EX. 1026
`Amazon v. CustomPlay
`US Patent No. 9,124,950
`
`

`

`Riad I. Hammoud (Ed.)
`
`Interactive Video
`
`Algorithms and Technologies
`
`With 109 Figures and 8 Tables
`
`@ Springer
`
`Page ii
`
`Page ii
`
`

`

`Dr. Riad I. Hammoud
`
`Delphi Electronics and Safety
`World Headquarters
`MIC Euo, PO. Box 9005
`Kokomo, Indiana 46904-9005
`USA
`e-mail: riadhammaud®delphxlcom
`
`Library of Congress Control Number: 2006923234
`
`ISSN: 1860-4862
`ISBN-10
`3-54o~33214-6 Springer Berlin Heidelberg New York
`ISBN—13
`978—3— 540- 33214- 5 Springer Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned. specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
`reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or
`parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its
`current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable
`to prosecution under the German Copyright Law.
`
`Springer is a part of Springer Science+Business Media.
`springer.com
`
`0 Springer-Verlag Berlin Heidelberg 2006
`Printed in The Netherlands
`
`The use of general descriptive names, registered names, trademarks. etc. in this publication does not imply,
`even in the absence of a specific statement, that such names are exempt from the relevant protective laws and
`regulations and therefore free for general use.
`
`TYpesetting by SF! Publisher Services using a Springer [MEX macro package
`Cover design: design &production, Heidelberg
`Printed on acid-free paper
`SPIN: 11399551
`
`62/3100lSP1-
`
`5 4 3 z 1 o
`
`Page iii
`
`

`

`
`
`Introduction to Interactive Video
`
`Riad I. Hammond
`
`Delphi Electronics and Safety, IN, USA.
`rind . hammoudOde 1phi . com
`
`1 Introduction
`
`In recent years, digital video has been widely employed as an effective me-
`dia format not only for personal communications, but also for business-to-
`employee, business-to—business and business-to—consumers applications. It ap-
`pears more attractive than other static data types like text and graphics,
`as its underlying rich content conveys easily and effectively the goals of the
`provider in means of image, motion, sound and text, all together presented
`to the consumer in a timely synchronized manner. Multimedia documents are
`more accessible than ever, due to a rapid expansion of Internet connectivity
`and increasing interest in online multimedia rich applications.
`Video content has skyrocketed in light of fore mentioned advances, and
`as a result of increasing network bandwidth capacities, decreasing cost of
`video acquisition and storage devices, and improving compression techniques
`Today, many content providers,
`like Google Video [118] and Yahoo [353],
`have created “open video” marketplaces that enable consumers to buy and
`rent a wide range of video content ( video-on-demand) [292] from major tele-
`vision networks, professional sports leagues, cable programmers, independent
`producers and film makers. The evolving digital video archives include vari-
`ous types of video documents like featurelength movies, music videos from
`SONY BMG, news, “Charlie Rose” interviews, medical and e—learm’ng videos,
`as well as prime-time and classic hits from CBS, sports and NBA games, his-
`torical content from ITN, and new titles being added everyday. Such video
`archives represent a valuable business asset and an important on—line video
`source to providers and consumers, respectively. The market is taking an-
`other important boost as video distribution is becoming available for people
`on “the move”, where they can select, download and view various video types
`on their consumer electronics devices like mobileTV [227] and video-playing
`iPods [158]. In 2005, the number of hits for downloaded music data from
`i'flmes on Apple’s iPods reached 20.7 million [163]. This number is expected
`to increase with the appearance of video-playing iPods and mobileTVs.
`
`Page3
`
`I
`
`Page 3
`
`

`

`4
`
`Riad I. Hammoud
`
`Traditionally, video data is either annotated manually, or consumed by
`end-users in its original form. Manual processing is expensive, time consum-
`ing and often subjective. On the other hand, a video file provided in a standard
`format like MPEG-l or MPEG-2 is played back using media players with lim-
`ited conventional control interaction features like play, fast forward/backward,
`and pause. In order to remedy these challenges and address the issues of grow-
`ing number and size of video archives and detail-on—demand videos, innova-
`tive video processing and video content management solutions — ranging from
`decomposing, indexing, browsing and filtering to automatic searching tech-
`niques as well as new forms of interaction with video content — are needed
`more than ever [324, 276, 130, 143, 303, 26, 278, 152]. Automatic processing
`will substantially drop the cost and reduce the errors of human operators,
`and it will introduce a new set of interactivity options that gives users ad-
`vanced interaction and navigational possibilities with the video content. One
`can navigate through the interactive video content in a non-linear fashion, by
`downloading, browsing and viewing, for instance, only the “actions” of a film
`character that caught his/her attention in the beginning of a feature-length
`movie [194, 155, 130, 199, 299]. Retaining only the essential information of a
`video sequence, such as representative frames, events and highlights, improves
`the storage, bandwidth and viewing time.
`Within the framework of the emerging “interactive video” technology,
`MPEG—4 and MPEG—7 standards [121, 124, 130, 293, 233, 275], this book will
`address the following two major issues that concerns both content-providers
`and consumers: ( 1) automatic restructuring, indexing and cataloging video
`content, and (2) advanced interaction features for audio-video editing, play-
`ing, searching and navigation. In this chapter, we will briefly introduce the
`concept, rhetoric, algorithms and technologies of interactive videos, automatic
`video restructuring and content-based video retrieval systems.
`
`2 What is an Interactive Video?
`
`In order to simplify the understanding of interactive video environments, it is
`worth reviewing how people are used to viewing and interacting with video
`content. Current VCRs and video-players provide basic control options such
`as play/stop, fast forward/backward and slow motion picture streaming. The
`video is mostly viewed in a passive way as a non-stop medium where the
`user’s interaction with the content is somewhat limited. For example, users
`cannot stop the video playback to jump to another place inside or outside the
`video document that provides related information about a specific item in
`the video like a commercial product, a film character, or a concealed object.
`Hence, the viewing of the video is performed in a linear fashion where the
`only way to discover what is next is to follow the narration and move through
`the video guided by seconds and minutes.
`Such conventional techniques for video viewing and browsing seem to be
`inefficient for most users to get the crux of the video. Users ease and efficiency
`could be improved through:
`
`Page 4
`
`

`

`Introduction to Interactive Video
`
`5
`
`1. providing a representative visual summary of the video document prior to
`downloading, storing or watching it. Alternatively, users could select the
`video based on just a few snapshots;
`2. presenting a list of visual entries, like key-frames, hot spots, events, scenes
`and highlights, that serves as meaningful access points to desired video
`content as opposed to accessing the video from the beginning to the end;
`and,
`3. showing a list of navigational options that allows users to follow internal
`and external links between related items in the same video or in other
`media documents like web pages.
`
`Interactive video refers to nowadays uncommon forms of video documents
`that accept and respond to the input of a viewer beyond just conventional
`VCR interactive features like play and pause. For instance, in its basic form,
`interactive video allows users to pause the video, click on an object of interest
`in a‘ video frame, and choose to jump from one temporal arbitrary frame to
`another Where the selected object has appeared. Instead of being guided by
`seconds and minutes, the user of interactive video form navigates through
`the video in a very eflicient non-linear fashion with options such as “next
`appearance", “previous scene” and “last event”. Tentatively, the following
`definition of interactive video could be drawn:
`
`Definition 1. Interactive video is a digitally enriched form of the original raw
`video sequence, allowing viewers attractive and powerful interactivity forms
`and navigational possibilities.
`
`In order to ensure that humans do not perceive any discontinuity in the
`video stream, a frame rate of at least 25fps is required, that is, 90, 000 images
`for one hour of video content. This video content can be complex and rich
`in terms of objects, shots, scenes, events, key-frames, sounds, narration and
`motion. An original video, in MPEG—1 format, is transformed to interactive
`video form through a series of re-structuring phases of its content. Both the
`original video and the interactive video contain the same information with
`one major difference in the structure of the document. The original video has
`an implicit structure, while its interactive form has an explicit one. In an
`explicit structure, the hotspots and key-elements are emphasized and links
`between these elements are created. If such links are not established with
`items from outside the video content, then the produced document is called
`raw interactive video. For simplicity we Will refer to it just as interactive
`video. Here we introduce two other extentions of interactive video documents:
`interactive video presentation and interactive video database.
`
`Definition 2. Interactive video presentation is a form of interactive video
`document'that is centered on enriched video but is not exclusively video.
`
`In this context, the interactive video document contains not only the raw
`enriched video but also includes several kind of data in a time synchronized
`fashion. This additional data is added for two reasons:
`(1) enhancing the
`
`Page 5
`
`Page 5
`
`

`

`6
`
`Riad 1. Hammond
`
`video content, and (2) making the video presentation self-contained. The type
`of data is determined by the author of the interactive video. This author
`would intelligently integrate the technique of the video producer or filmmaker
`with the technique and wisdom of the skillful teacher. The objective is to
`maximize the chances of convening the purpose and key information in the
`video well. Additional data can be documents of all types i.e. html, tables,
`music and/or still frames and sequence of images, that are available locally or
`remotely through the Web. Such types of interactive videos are also known as
`hyperfilm, hypervideo or hypermedia [151, 47].
`
`Definition 3. Interactive video database is a collection of interactive video
`documents and interactive video presentations.
`
`The main interactive video document is seen here as a master document
`which contains in itself a large number of independent interactive videos.
`Users can access the video database in two forms: searching and browsing.
`For instance, a sketch may be used to define an image in a frame. The system
`can look for that image and retrieve the frames that contain it. The user
`can also move the sketch within a certain pattern indicating to the system to
`find in a large interactive video database all sets of frames that represent a
`similar moving pattern for that object of interest (hotspot). As technologies
`such as object detection, tracking and image recognition evolve, it will be
`possible to provide better ways to specify hotspots in interactive video, which
`will be covered in more detail in the following sections. When browsing is not
`convenient for a user to locate specific information, he or she would utilize
`searching routines instead. The searching process would result in a small set of
`interactive video sub-documents that could be browsed and easily navigated.
`
`- 3 Video Transformation and Re—structuring Phases
`
`The transformation of an original video format to its interactive form aims at
`extracting the key-elements or components of the video structure first, and
`then creating links between elements of various levels of this structure.
`
`3.1 Video Structure Components
`
`The presentation of a document often follows a domain-specific model. Read-
`ers of a textbook expect to see the book content organized into parts, chapters,
`sections, paragraphs and indexes. Such structure is presented to the reader up-
`front in a table-of—contents. Unfortunately the structure of video documents
`is not explicitly apparent to the viewer. The re-structuring process aims at
`automatically constructing such a table-of-contents [276].
`A video collection can be divided into multiple categories by grouping doc-
`uments with similar structures together. Thus, feature-length movies, news,
`sports, TV shows, and surveillance videos have different structures, but with
`
`Page 6
`
`Page 6
`
`

`

`Introduction to Interactive Video
`
`7
`
`Q
`
`5.3
`SE
`£4;
`
`EH7
`
`Fig. 1. Illustration of the structure of a feature-length movie
`
`many basic elements in common. Figures 1 and 2 illustrate the general struc-
`ture of movies and news videos.
`
`The list of following components covers video structures of most types of
`video documents:
`
`is an unbroken sequence of frames
`1. Camera shots: A camera shot
`recorded from a single camera, during a short period of time, and thus it
`contains little changes in background and scene content. A video sequence
`is therefore a concatenation of camera shots. A cut is where the last frame
`in one shot is followed by the first frame in the next shot.
`2. Gradual transitions: Three other types of shot boundaries may be
`found in video documents: fades, dissolves, and wipes. A fade is where
`the frames of the shot gradually change from or to black. A dissolve is
`where the frames of the first shot are gradually morphed into the frames
`
`Page 7
`
`Page 7
`
`

`

`8
`
`Riad I. Hammond
`
`Concepts
`
`Sequences
`meetings
`
`-onstructions
`
`Sho-
`
`unanclal
`
`.iolence
`
`
`
`This week
`
`
`Fig. 2. Illustration of the general structure of news videos
`
`of the second. And a wipe is where the frames of the first shot are moved
`gradually in a horizontal or vertical direction into the frames of the second.
`. Key-frames: A key frame is a still image of a video shot that best rep-
`resents the content of a shot.
`
`Visual objects, zones: Objects of interest in video documents are sim-
`ilar to key-words in text and html documents. They could be detected ei-
`ther manually or automatically in key-frames or within individual frames
`of a video shot.
`
`. Audio objects: An audio object could take several shapes ranging from
`a clip of music to a single word.
`. Text objects: A text object is defined as the joint text to the video
`sequence, such as, footnotes and superimposed text on images.
`Events: An event is the basic segment of time during which an important
`action occurs in the video.
`
`. Scenes: A scene is the minimum set of sequential shots that conveys
`certain meaning in terms of narration.
`. Cluster of objects: A cluster is a collection of objects with similar
`characteristics like appearance, shape, color, texture, sounds, etc.
`Narrative sequence: A narrative sequence projects a large concept by
`combining multiple scenes together.
`Summary: A video summary of an input video is seen as a new brief
`document which may consist of an arrangement of video shots and scenes,
`or an arrangement of still key-frames.
`
`10.
`
`11.
`
`Page 8
`
`Page 8
`
`

`

`Introduction to Interactive Video
`
`9
`
`
`
`Fig. 3. A out between two camera shots is observed at frames 90—91
`
`These above components are labeled either as low-level components or high-
`level components. The first category includes video shots, audio-visual objects,
`key—frames and words, while the second contains events, scenes, summaries,
`and narrative sequences.
`
`3.2 Toward Automatic Video lie-structuring
`
`Manual labeling and re—structuring of video content is an extremely time con-
`suming, cost intensive and error prone endeavor that often results in incom-
`plete and inconsistent annotations. Annotating one hour of video may require
`more than ten hours of human effort for basic decomposition of the video into
`shots, key-frames and zones.
`In order to overcome these challenges, intensive research work has been
`done in computer vision and video processing to automate the process of video
`re-structuring. Some interactive video authoring systems [127, 157] offer tools
`that automate shot detection, zone localization, object tracking and scene
`recognition.
`
`3.2.1 Shot Boundary Detection
`
`Several methods have been proposed for shot boundary detection in both com-
`pressed and non—compressed domain [38, 36, 13, 37, 134, 347, 356, 52]. Pair-
`wise comparison [368] checks each pixel in one frame with the corresponding
`pixel in the next frame. In this approach, the gray-scale values of the pixels
`at the corresponding locations in two successive frames-are subtracted and
`the absolute value is used as a measure of dissimilarity between the pixel
`values. If this value exceeds a certain threshold, then the pixel gray scale
`is said to have changed. The percentage of the pixels that have changed is
`the measure of dissimilarity between the frames. This approach is compu-
`tationally simple but sensitive to digitalization noise, illumination changes
`and object motion. As a means to compensate for this, the Likelihood ratio,
`histogram comparison, Model-based comparison, and edge-based approaches
`have been proposed. The Likelihood ratio approach [368] compares blocks
`of pixel regions. The color histogram method [37] compares the intensity or
`color histograms between adjacent frames. Model-based comparison [134] uses
`the video production system as a template. Edge detection segmentation [364]
`
`Page 9
`
`Page 9
`
`

`

`10
`
`Riad I. Hammoud
`
`looks for entering and exiting edge pixels. Color blocks in a. compressed MPEG
`stream [34] are processed to find shots. DOT-based shot boundary detection
`[16] uses differences in motion encoded in an MPEG stream to find shots.
`Smeaton et al. [297, 246, 38] combine color histogram-based technique with
`edge-based and MPEG macroblocks methods to improve the performance of
`each individual method for shot detection. Recently, Chen et al.
`[54] pro-
`posed a multi-filtering technique that combines histogram comparison, pixel
`comparison and object tracking techniques for shot detection. They perform
`object tracking to help to determine the actual shot boundaries when both
`pixel comparison and histogram comparison techniques failed. As reported
`in [54], experiments on a large amount of video sequences (over 1000 test-
`ing shots), show a very promising shot detection precision of greater than
`ninety-two percent and recall beyond ninety-eight percent. With such a solid
`performance, very little manual effort is needed to correct the false positives
`and to recover the missing positives during shot detection. For most applica-
`tions, color histogram appears to be the simplest and most computationally
`inexpensive technique to obtain satisfactory results.
`
`3.2.2 Key-frame Detection
`
`Automatic extraction of key-frames ought to be content based so that key-
`framm maintain the important content of the video while removing all re-
`dundancy. Ideally, high-level primitives of video, such as objects and events,
`should be used. However, because such components are not always easy to
`identify, current methods tend to rely on low—level image features and other
`readily available information instead. In that regard, the work done in this area
`[369, 377, 18, 341] could be grouped into three categories: static, dynamic and
`content-based approaches. The first category selects frames at a fixed time-
`code sampling rate such that each nth frame is retained as a key-frame [237].
`The second category relies on motion analysis to eliminate redundant frames
`The optical flow vector is first estimated on each frame, then analyzed using a
`metric as a function of time to select key-frames at the local minima of motion
`[341]. Besides the complexity of computation of a dense optical flow vector, the
`underlying assumption of local minima may not work if constant variations are
`observed. The third category analyzes the variation of the content in terms of
`color, texture and motion features [369]. Avrithis et al. [18] select key-frames
`at local minima and local maxima of the magnitude of the second derivative of
`the composed feature curve of all frames of a given shot. A composed feature
`is made up of a linear combination of both color and motion vectors. More
`sophisticated pattern classification techniques, like Gaussian Mixture Models
`(GMM), have been used to group similar frames [377], and close appearances
`of segmented objects [132], into clusters. Depending on the compactness of
`each obtained cluster, one or more images could be extracted as key-frames.
`Figure 5 shows extracted key-appearances from the video shot of Fig. 4 using
`
`Page 10
`
`Page 10
`
`

`

`Introduction to Interactive Video
`
`11
`
`
`
`
`
`[
`
`Number
`Thumbnail
`Name
`
`194
`‘_- 7-l Ford(key-1)
`"
`5—H
`
`ems
`
`-Ll
`
`Ford (key-3)
`
`194
`
`Fig. 5. Extracted three representative key-frames (top) and corresponding key-
`appearancm (bottom) of the presented video sequence in Fig.4, using the method
`of [132]
`
`the method of [132]. In this example, the vehicle is being tracked, and modeled
`with GMM in the RGB color histogram space.
`
`3.2.3 Object Detection
`
`Extracting the objects from the video is the first step in object-based analysis.
`Video objects or hot spots are classified into static zones and moving objects.
`The purpose of object detection is to initialize the tracking by identifying
`the object boundaries in the first frame of the shot. The accuracy of auto-
`matic object detection depends on the prior knowledge about the object to
`be detected as well as the complexity of the background and quality of video
`
`Page 11
`
`Page 11
`
`

`

`12
`
`Riad I. Hammond
`
` MEI-Ila:
`
`scene 1
`
`
`
`scene 2
`
`[It]
`NI
`
`scene 4
`
`Page 12
`
`
`
` .
`
`...sc.ene.5
`
`Fig. 6. Semantic scenes of a clip from the movie “Dances with Wolves": “first
`meeting" (scene 1); “horses" (scene 2); “an90 preparation" (scene 3); "working"
`(scene 4); and “dialog in the tent” (scene 5).
`
`signal. As an example. detecting the position of an eye in an infrared driver’s
`video [133. 136] is easier than locating pedestrians in a night-time sequence of
`images. The eye has a very distinct and unique shape and appearance among
`people while the appearance, shape. scale and size characteristics may vary
`widely among pedestrians. On a related topic, Prati et al. [256] recently pro-
`posed an object detection method that differentiates between a moving car
`and its moving shadows on a highway.
`Several efforts were undertaken to find objects belonging to specific classes
`like faces, figs and vehicles [202. 240, 254, 120. 333]. Recent systems like the
`Informedia Project [155] include in their generation of video summaries the
`detection of common objects like text and faces. Systems such as Netm-V
`[79] and Vide [50] used spatio-temporal segmentation to extract regions
`which are supposed to correspond to video objects. Others [184. 269. 191] have
`resorted to user-assisted segmentation. Since it is easier to define an object by
`its boundary box. accurately delineating an object is generally advantageous
`in reducing clutter for the subsequent tracking and matching phases. In [269],
`the color distributions of both the object and its background are modeled
`by a Gaussian mixture. Interestingly, this process is interactive. i.e. the user
`may iteratively assist the scheme in determining the boundary, starting from
`sub-optimal solutions. if needed. Recent work goes further in this direction
`
`Page 12
`
`

`

`Introduction to Interactive Video
`
`13
`
`[191], by learning the shape of an object category, in order to introduce it
`as an elaborate input into the energy function involved in interactive object
`detection.
`
`the tracking
`that
`is crucial
`it
`In the interactive video framework,
`algorithm can be initialized easily through a semi-automatic or automatic
`process. Automatic detection methods may be categorized into three broad
`classes: model-based, motion-based (optical flow, frame differencing), and
`background subtraction approaches. For static cameras, background subtrac-
`tion is probably the most popular. Model-based detection approaches usually
`employ a supervised learning process of the object model using representative
`training patches of the object in terms of appearances, scales, and illumi-
`nation conditions, as in a multi-view face detector [202] or a hand posture
`recognition algorithm [344]. In contrast, a motion-based detection consists of
`segmenting the scene into zones of independent motions using optical flow
`and frame differencing techniques [240]. In many applications, the knowledge-
`based, motion-based and model-based techniques are combined together to
`ensure robust object detection. Despite all of these efforts, accurate automatic
`object detection is successful only in some domain specific applications.
`
`3.2.4 Intra-Shot Object ’Ih'acking
`
`Tracking is easier and faster than object detection since the object state (posi-
`tion, scale, boundaries) is known in previous frame, and the search prooedure
`in target frame is local (search window) rather than global (entire frame). The
`tracking task is easier when tracked objects have low variability in the feature
`space, and thus exhibit smooth and rigid motion. In reality, video objects are
`often very difficult to track due to changes in pose, scale, appearance, shape
`and lighting conditions. Therefore, a robust and accurate tracker requires a
`self-adaptation stratey to these changes in order to maintain tracking in dif-
`ficult situations, an exist strategy that terminates the tracking process if a
`drift or miss-tracking occurs, and a recovery process that allows object reac-
`quisition when tracking is lost.
`In literature, three categories of methods could be identified: Contour-
`based, Color-based and Motion-based techniques. For tracking of non-rigid
`and deformable objects, geodesic active contours and active contour models
`have been proved to be powerful tools [247, 181, 320, 250]. These approaches
`use an energy minimization procedure to obtain the best contour, where the
`total energy consists of an internal energy term for contour smoothness and an
`external energy term for edge likelihood. Many color based tracking methods
`[66, 27, 166, 249, 378] require a typical object color histogram. Color histogram
`is a global feature used frequently in tracking and recognition due to its quasi-
`invariance to geometric and photometric variations (using the hue-saturation
`subspace), as well as its low computation requirements. In [27], a side View
`
`Page 13
`
`Page 13
`
`

`

`14
`
`Riad 1. Hammond
`
`of the human head is used to train a typical color model of both skin color
`and hair color to track the human head with out-of-plane rotation. The third
`category of tracking methods tends to formulate a tracking problem as the
`estimation of a 2D inter-frame motion field over the region of interest [110].
`Recently, the concern in tracking is shifting toward real-time performance.
`This is essential in an interactive video authoring system where the user could
`initialize the tracker and desire quick tracking results in the remaining frames
`of a shot. Real-time object tracking is addressed in more detail in Chapter 4.
`
`3.2.5 Recognition and Classification of Video Shots and Objects
`
`Once the shots, key-frames, and objects are identified in a video sequence, the
`next step is to identify spatial—temporal similarities between these entities in
`order to construct high-level components of the video structure like groups of
`objects [131] and clusters of shots [275]. In order to retrieve similar objects
`to an object-query, a recognition algorithm that entails feature extraction,
`distance computation and a decision rule, is required. Recognition and classi-
`fication techniques often employ low-level features like color, texture and edge
`histograms [24] as well as high-level features like object bobs [45].
`An effective method for recognizing similar objects in video shots is to
`specify generic object models and find objects that conform to these models
`[89]. Different models were used in methods for recognizing objects that are
`common in video. These include methods for finding text [222, 375], faces
`[271, 286] and vehicles [286]. Hammond et al. [131, 128] proposed a success-
`ful framework for supervised and non-supervised classification of inter-shot
`objects using Maximum A Posteriori (MAP) rule and Hierarchical classifica-
`tion techniques, respectively. Figure 8 illustrates some obtained results. Each
`tracked object is being modeled by a Gaussian Mixture Model (GMM) that
`captures the intra-shot changes in appearance in the RGB color feature space.
`In the supervised approach, the user chooses the object classes interactively,
`and the classification algorithm classifies remaining objects into the object
`class according to the MAP. In contrast, the unsupervised approach consists
`of computing the Kullback distance [190, 106] between Gaussian Mixture
`Models of all identified objects in the video, and then feeding the Hierar-
`chical clustering algorithm [173] the constructed proximity matrix [169, 264].
`These approaches seem to substantially improve the recognition rate of video
`objects between shots. Recently, Fan et al. [96] proposed a successful hierar-
`chical semantics-sensitive video classifier to shorten the semantic gap between
`the low-level visual features and the high—level semantic concepts. The hier-
`archical structure of the semantics-sensitive video classifier is derived from
`the domain dependent concept hierarchy of video contents in the database.
`Relevance analysis is used to shorten the semantic gap by selecting the dis-
`criminating visual features. Part two of this book will address in detail these
`issues of object detection, editing, tracking, recognition, and classification.
`
`Page 14
`
`Page 14
`
`

`

`Introduction to Interactive Video
`
`15
`
`3.2.6 Video Events, Highlights and Scenes Detection
`
`In recent years, research towards the automatic detection and recognition of
`highlights and events in specific applications such as sports video, Where events
`and highlights are relatively well-defined based on the domain knowledge of
`the underlying news and sport video data, has gained a lot of attention [182,
`54, 199, 201, 198, 272, 281, 361]. A method for detecting news reporting was
`presented in [213]. Seitz and Dyer [289] proposed an affine view-invariant
`trajectory matching method to analyze cyclic motion. In [310], Stauffer and
`Crimson classify activities based on the aspect ratio of the tracked objects.
`Haering et al. [122] detect hunting activities in wildlife video. In [54], a new
`multimedia data mining framework has been proposed for the detection of
`soccer goal shots by using combined multimodal (audio/visual) features and
`classification rules. The output results can be used for annotation and indexing
`of the high-level structures of soccer videos. This framework exploits the rich
`semantic information contained in visual and audio features for soccer video
`data, and incorporates the data mining process for effective detection of soccer
`goal events.
`Regarding automatic detection of semantic units like scenes in feature
`length movies, several graph-based clustering techniques have been pro-
`posed in the literature [357, 275, 125]. In [126, 127], a general framework
`of two phases, clustering of shots, and margin of overlapped clusters, has
`been proposed to extract the video scenes of a. feature-length movie. Two
`shots are matched on the basis of color-metric histograms, color-metric auto-
`correlograms [147] and the number of similar objects localized in their gener-
`ated key-frames Using the hierarchical classification technique, a, partition of
`shots is identified for each feature separately. From all these partitions a single
`partition is deduced based on a union distance between various clusters. The
`obtained clusters are then linked together using the temporal relations of Alle

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket