throbber
Signals
`rs
`
`Communication
`Technology
`
`US Patent No. 9,124,950
`
`Interactive Video
`Algorithms and Technologies
`
`a Satie
`
`AMAZON EX.1026
`Amazon v. CustomPlay
`
`Page i
`
`AMAZON EX. 1026
`Amazon v. CustomPlay
`US Patent No. 9,124,950
`
`

`

`Riad I. Hammoud(Ed.)
`
`Interactive Video
`
`Algorithms and Technologies
`
`With 109 Figures and 8 Tables
`
`9) Springer
`
`Page ii
`
`Page ii
`
`

`

`Dr. Riad I. Hammoud
`
`Delphi Electronics and Safety
`World Headquarters
`MIC E110, P.O. Box 9005
`Kokomo,Indiana 46904-9005
`USA
`e-mail: riad.hammoud@delphi.com
`
`Library of Congress Control Number: 2006923234
`
`ISSN: 1860-4862
`ISBN-10
`3-540-33214-6 Springer Berlin Heidelberg New York
`ISBN-13.
`978-3-540-33214-5 Springer Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned,specifically the rights of translation, reprinting, reuse ofillustrations, recitation, broadcasting,
`reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or
`parts thereof is permitted only underthe provisions of the German Copyright Law of September9, 1965,in its
`current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable
`to prosecution under the German Copyright Law.
`
`Springeris a part of Springer Science+Business Media.
`springer.com
`
`© Springer-Verlag Berlin Heidelberg 2006
`Printed in The Netherlands
`
`The use of general descriptive names, registered names, trademarks,etc. in this publication does not imply,
`even in the absence ofa specific statement, that such names are exempt from the relevantprotective laws and
`regulations and therefore free for general use.
`
`Typesetting by SPI Publisher Services using a Springer ISTEX macro package
`Cover design: design & production, Heidelberg
`Printed onacid-free paper
`SPIN: 11399551
`
`62/3100/SPI-
`
`543210
`
`Page iii
`
`

`

`
`
`Introduction to Interactive Video
`
`Riad I. Hammoud
`
`Delphi Electronics and Safety, IN, USA.
`riad.hammoud@delphi.com
`
`1 Introduction
`
`In recent years, digital video has been widely employed as an effective me-
`dia format not only for personal communications, but also for business-to-
`employee, business-to-business and business-to-consumers applications.It ap-
`pears more attractive than other static data types like text and graphics,
`as its underlying rich content conveys easily and effectively the goals of the
`provider in means of image, motion, sound and text, all together presented
`to the consumerin a timely synchronized manner. Multimedia documents are
`more accessible than ever, due to a rapid expansion of Internet connectivity
`and increasing interest in online multimedia rich applications.
`Video content has skyrocketed in light of fore mentioned advances, and
`as a result of increasing network bandwidth capacities, decreasing cost of
`video acquisition and storage devices, and improving compression techniques.
`Today, many content providers,
`like Google Video [118] and Yahoo [353],
`have created “open video” marketplaces that enable consumers to buy and
`rent a wide range of video content ( video-on-demand)[292] from major tele-
`vision networks, professional sports leagues, cable programmers, independent
`producers and film makers. The evolving digital video archives include vari-
`ous types of video documents like feature-length movies, music videos from
`SONY BMG,news, “Charlie Rose” interviews, medical and e-learning videos,
`as well as prime-time and classic hits from CBS, sports and NBA games,his-
`torical content from ITN, and newtitles being added everyday. Such video
`archives represent a valuable business asset and an important on-line video
`source to providers and consumers, respectively. The market is taking an-
`other important boost as video distribution is becoming available for people
`on “the move”, where they can select, download and view various video types
`on their consumerelectronics devices like mobileT'V [227] and video-playing
`iPods [158]. In 2005, the number of hits for downloaded music data from
`iTunes on Apple’s iPods reached 20.7 million [163]. This numberis expected
`to increase with the appearance of video-playing iPods and mobileTVs.
`
`Page 3
`
`Page 3
`
`

`

`4
`
`Riad I. Hammoud
`
`Traditionally, video data is either annotated manually, or consumed by
`end-users in its original form. Manual processing is expensive, time consum-
`ing andoften subjective. On the other hand,a videofile provided in a standard
`format like MPEG-1 or MPEG-2is played back using media players with lim-
`ited conventional control interaction features like play, fast forward/backward,
`and pause. In order to remedy these challenges and address theissues of grow-
`ing number andsize of video archives and detail-on-demand videos, innova-
`tive video processing and video content management solutions — ranging from
`decomposing, indexing, browsing andfiltering to automatic searching tech-
`niques as well as new forms of interaction with video content — are needed
`more than ever (324, 276, 130, 143, 303, 26, 278, 152]. Automatic processing
`will substantially drop the cost and reduce the errors of human operators,
`and it will introduce a new set of interactivity options that gives users ad-
`vanced interaction and navigational possibilities with the video content. One
`can navigate throughthe interactive video content in a non-linear fashion, by
`downloading, browsing and viewing, for instance, only the “actions” of a film
`character that caught his/her attention in the beginning of a feature-length
`movie (194, 155, 130, 199, 299]. Retaining only the essential information of a
`video sequence, such as representative frames, events and highlights, improves
`the storage, bandwidth and viewing time.
`Within the framework of the emerging “interactive video” technology,
`MPEG-4 and MPEG-7 standards [121, 124, 130, 293, 233, 275], this book will
`address the following two major issues that concerns both content-providers
`and consumers: (1) automatic re-structuring, indexing and cataloging video
`content, and (2) advanced interaction features for audio-video editing, play-
`ing, searching and navigation. In this chapter, we will briefly introduce the
`concept, rhetoric, algorithms andtechnologiesof interactive videos, automatic
`video restructuring and content-based video retrieval systems.
`
`2 Whatis an Interactive Video?
`
`In orderto simplify the understanding of interactive video environments,it is
`worth reviewing how people are used to viewing and interacting with video
`content. Current VCRs and video-players provide basic control options such
`as play/stop, fast forward/backward and slow motion picture streaming. The
`video is mostly viewed in a passive way as a non-stop medium where the
`user’s interaction with the content is somewhat limited. For example, users
`cannot stop the video playback to jump to anotherplace inside or outside the
`video document that provides related information about a specific item in
`the video like a commercial product, a film character, or a concealed object.
`Hence, the viewing of the video is performed in a linear fashion where the
`only way to discover what is next is to follow the narration and move through
`the video guided by seconds and minutes.
`Such conventional techniques for video viewing and browsing seem to be
`inefficient for most users to get the crux of the video. Users ease and efficiency
`could be improved through:
`
`Page 4
`
`Page 4
`
`

`

`Introduction to Interactive Video
`
`5
`
`1. providing a representative visual summary of the video document prior to
`downloading,storing or watchingit. Alternatively, users could select the
`video based on just a few snapshots;
`2. presenting a list of visual entries, like key-frames, hot spots, events, scenes
`and highlights, that serves as meaningful access points to desired video
`content as opposed to accessing the video from the beginning to the end;
`and,
`3. showing a list of navigational options that allows users to follow internal
`and external links between related items in the same video or in other
`media documents like web pages.
`
`Interactive video refers to nowadays uncommon forms of video documents
`that accept and respond to the input of a viewer beyond just conventional
`VCRinteractive features like play and pause. For instance, in its basic form,
`interactive video allowsusers to pausethe video,click on an object of interest
`in a video frame, and choose to jump from one temporal arbitrary frame to
`another where the selected object has appeared. Instead of being guided by
`seconds and minutes, the user of interactive video form navigates through
`the video in a very efficient non-linear fashion with options such as “next
`appearance”, “previous scene” and “last event”. Tentatively, the following
`definition of interactive video could be drawn:
`
`Definition 1. Interactive video is a digitally enriched form of the original raw
`video sequence, allowing viewers attractive and powerful interactivity forms
`and navigational possibilities.
`
`In order to ensure that humans do not perceive any discontinuity in the
`video stream, a framerate of at least 25fps is required, that is, 90,000 images
`for one hour of video content. This video content can be complex and rich
`in terms of objects, shots, scenes, events, key-frames, sounds, narration and
`motion. An original video, in MPEG-1 format, is transformed to interactive
`video form throughaseries of re-structuring phases ofits content. Both the
`original video and the interactive video contain the same information with
`one major difference in the structure of the document. The original video has
`an implicit structure, while its interactive form has an explicit one. In an
`explicit structure, the hotspots and key-elements are emphasized and links
`between these elements are created. If such links are not established with
`items from outside the video content, then the produced document is called
`raw interactive video. For simplicity we will refer to it just as interactive
`video. Here we introduce two other extentions of interactive video documents:
`interactive video presentation and interactive video database.
`Definition 2. Interactive video presentation is a form of interactive video
`document that is centered on enriched video but is not exclusively video.
`
`In this context, the interactive video document contains not only the raw
`enriched video but also includes several kind of data in a time synchronized
`fashion. This additional data is added for two reasons:
`(1) enhancing the
`
`Page 5
`
`Page 5
`
`

`

`6
`
`Riad I. Hammoud
`
`video content, and (2) making the video presentation self-contained. The type
`of data is determined by the author of the interactive video. This author
`wouldintelligently integrate the technique of the video producer or filmmaker
`with the technique and wisdom oftheskillful teacher. The objective is to
`maximize the chances of convening the purpose and key information in the
`video well. Additional data can be documents ofall types i.e. html, tables,
`music and/orstill frames and sequenceof images, that are available locally or
`remotely through the Web. Such types ofinteractive videos are also known as
`hyperfilm, hypervideo or hypermedia[151, 47].
`
`Definition 3. Interactive video database is a collection of interactive video
`documents and interactive video presentations.
`
`The main interactive video documentis seen here as a master document
`which contains in itself a large number of independent interactive videos.
`Users can access the video database in two forms: searching and browsing.
`For instance, a sketch may be used to define an image in a frame. The system
`can look for that image and retrieve the frames that contain it. The user
`can also move the sketch within a certain pattern indicating to the system to
`find in a large interactive video database all sets of frames that represent a
`similar moving pattern for that object of interest (hotspot). As technologies
`such as object detection, tracking and image recognition evolve, it will be
`possible to provide better ways to specify hotspots in interactive video, which
`will be covered in more detail in the following sections. When browsing is not
`convenient for a user to locate specific information, he or she would utilize
`searching routines instead. The searching process would result in a small set of
`interactive video sub-documents that could be browsed and easily navigated.
`
`3 Video Transformation and Re-structuring Phases
`
`The transformation of an original video formatto its interactive form aims at
`extracting the key-elements or components of the video structure first, and
`then creating links between elements of various levels of this structure.
`
`3.1 Video Structure Components
`
`The presentation of a document often follows a domain-specific model. Read-
`ers of a textbook expect to see the book content organizedinto parts, chapters,
`sections, paragraphsand indexes. Such structure is presented to the reader up-
`front in a table-of-contents. Unfortunately the structure of video documents
`is not explicitly apparent to the viewer. The re-structuring process aims at
`automatically constructing such a table-of-contents [276].
`A video collection can be divided into multiple categories by grouping doc-
`uments with similar structures together. Thus, feature-length movies, news,
`sports, T'V shows, and surveillance videos have different structures, but with
`
`Page 6
`
`Page 6
`
`

`

`Introduction to Interactive Video
`
`7
`
`Lecwcmema memenent Lcmemonamemeqnmameneeonei
`Objects
`Shots
`
`Matching, Grouping
`
`features
`
`Descriptors
`
`low-level
`
`Fig. 1. Illustration of the structure of a feature-length movie
`
`many basic elements in common. Figures 1 and2illustrate the general struc-
`ture of movies and newsvideos.
`Thelist of following components covers video structures of most types of
`video documents:
`
`is an unbroken sequence of frames
`1. Camera shots: A camera shot
`recorded from a single camera, during a short periodof time, and thus it
`containslittle changes in background andscene content. A video sequence
`is therefore a concatenation of camera shots. A cut is where thelast frame
`in one shot is followed by thefirst frame in the next shot.
`2. Gradual transitions: Three other types of shot boundaries may be
`found in video documents: fades, dissolves, and wipes. A fade is where
`the frames of the shot gradually change from or to black. A dissolve is
`where the framesof the first shot are gradually morphed into the frames
`
`Page 7
`
`Page 7
`
`

`

`Riad I. Hammoud
`
`Concepts
`
`Sequences
`
`Scenes
`
`Soccer
`
`ran !
`
`
`
`constructions (Stet)
`(_srot)
`
`{meetings|
`
`
`(Financial) (violence|
`
`
`
`Fig. 2. Illustration of the general structure of news videos
`
`of the second. And a wipeis where the framesofthefirst shot are moved
`gradually in a horizontalor vertical direction into the frames of the second.
`. Key-frames: A key frameis a still image of a video shot that best rep-
`resents the content of a shot.
`. Visual objects, zones: Objects of interest in video documents are sim-
`ilar to key-words in text and html documents. They could be detected ei-
`ther manually or automatically in key-frames, or within individual frames
`of a video shot.
`. Audio objects: An audio object could take several shapes ranging from
`a clip of music to a single word.
`. Text objects: A text object is defined as the joint text to the video
`sequence, such as, footnotes and superimposed text on images.
`. Events: An event is the basic segment of time during which an important
`action occursin the video.
`. Scenes: A scene is the minimum set of sequential shots that conveys
`certain meaning in terms of narration.
`. Cluster of objects: A cluster is a collection of objects with similar
`characteristics like appearance, shape, color, texture, sounds, etc.
`Narrative sequence: A narrative sequence projects a large concept by
`combining multiple scenes together.
`Summary: A video summary of an input video is seen as a new brief
`document which may consist of an arrangement of video shots and scenes,
`or an arrangementofstill key-frames.
`
`10.
`
`11.
`
`Page 8
`
`Page 8
`
`

`

`Introduction to Interactive Video
`
`9
`
`
`
`Fig. 3. A cut between two camera shots is observed at frames 90-91
`
`These above componentsare labeled either as low-level componentsor high-
`level components. Thefirst category includes video shots, audio-visual objects,
`key-frames and words, while the second contains events, scenes, summaries,
`and narrative sequences.
`
`3.2 Toward Automatic Video Re-structuring
`
`Manual labeling andre-structuring of video content is an extremely time con-
`suming, cost intensive and error prone endeavor that often results in incom-
`plete and inconsistent annotations. Annotating one hour of video may require
`more than ten hours of humaneffort for basic decomposition of the video into
`shots, key-frames and zones.
`In order to overcome these challenges, intensive research work has been
`done in computer vision and video processing to automate the process of video
`re-structuring. Someinteractive video authoring systems[127, 157] offer tools
`that automate shot detection, zone localization, object tracking and scene
`recognition.
`
`3.2.1 Shot Boundary Detection
`
`Several methods have been proposedfor shot boundary detection in both com-
`pressed and non-compressed domain [38, 36, 13, 37, 134, 347, 356, 52]. Pair-
`wise comparison {368] checks each pixel in one frame with the corresponding
`pixel in the next frame. In this approach, the gray-scale values of the pixels
`at the corresponding locations in two successive frames-are subtracted and
`the absolute value is used as a measure of dissimilarity between the pixel
`values. If this value exceeds a certain threshold, then the pixel gray scale
`is said to have changed. The percentage of the pixels that have changed is
`the measure of dissimilarity between the frames. This approach is compu-
`tationally simple but sensitive to digitalization noise, illumination changes
`and object motion. As a means to compensate for this, the Likelihoodratio,
`histogram comparison, Model-based comparison, and edge-based approaches
`have been proposed. The Likelihood ratio approach [368] compares blocks
`of pixel regions. The color histogram method [37] compares the intensity or
`color histograms between adjacent frames. Model-based comparison [134] uses
`the video production system as a template. Edge detection segmentation [364]
`
`Page 9
`
`Page 9
`
`

`

`10
`
`Riad I. Hammoud
`
`looksfor entering and exiting edge pixels. Color blocks in a compressed MPEG
`stream [34] are processed to find shots. DCT-based shot boundary detection
`[16] uses differences in motion encoded in an MPEGstream tofind shots.
`Smeatonet al. [297, 246, 38] combine color histogram-based technique with
`edge-based and MPEG macroblocks methods to improve the performanceof
`each individual method for shot detection. Recently, Chen et al.
`[54] pro-
`posed a multi-filtering technique that combines histogram comparison, pixel
`comparison and object tracking techniques for shot detection. They perform
`object tracking to help to determine the actual shot boundaries when both
`pixel comparison and histogram comparison techniques failed. As reported
`in [54], experiments on a large amount of video sequences (over 1000 test-
`ing shots), show a very promising shot detection precision of greater than
`ninety-two percent and recall beyond ninety-eight percent. With suchasolid
`performance,very little manual effort is needed to correct the false positives
`and to recover the missing positives during shot detection. For most applica-
`tions, color histogram appears to be the simplest and most computationally
`inexpensive technique to obtain satisfactory results.
`
`3.2.2 Key-frame Detection
`
`Automatic extraction of key-frames ought to be content based so that key-
`frames maintain the important content of the video while removingall re-
`dundancy.Ideally, high-level primitives of video, such as objects and events,
`should be used. However, because such components are not always easy to
`identify, current methods tend to rely on low-level image features and other
`readily available information instead. In that regard, the work donein this area
`(369, 377, 18, 341] could be groupedinto three categories: static, dynamic and
`content-based approaches. Thefirst category selects frames at a fixed time-
`code sampling rate such that each n** frameis retained as a key-frame [237].
`The second category relies on motion analysis to eliminate redundant frames.
`The optical flow vectoris first estimated on each frame, then analyzed using a
`metric as a function of timeto select key-frames at the local minima of motion
`[341]. Besides the complexity of computation of a denseoptical flow vector, the
`underlying assumptionof local minima may not work if constant variations are
`observed. The third category analyzes the variation of the content in terms of
`color, texture and motion features [369]. Avrithis et al. [18] select key-frames
`at local minima and local maxima of the magnitude ofthe second derivative of
`the composed feature curveofall frames of a given shot. A composed feature
`is made up of a linear combination of both color and motion vectors. More
`sophisticated pattern classification techniques, like Gaussian Mixture Models
`(GMM), have been used to groupsimilar frames [377], and close appearances
`of segmented objects [132], into clusters. Depending on the compactness of
`each obtained cluster, one or more images could be extracted as key-frames.
`Figure 5 shows extracted key-appearances from the video shot of Fig. 4 using
`
`Page 10
`
`Page 10
`
`

`

`Introduction to Interactive Video
`
`ll
`
`oi co 1 UU ut af
`
`
`Fig. 4. Sample images of a tracked vehicle in a video shot of 66 frames
`
`
`
`
`
`
`
`Fig. 5. Extracted three representative key-frames (top) and corresponding key-
`appearances (bottom) of the presented video sequence in Fig. 4, using the method
`of [132]
`
`the method of[132]. In this example, the vehicle is being tracked, and modeled
`with GMM in the RGBcolor histogram space.
`
`3.2.3 Object Detection
`
`Extracting the objects from thevideois the first step in object-based analysis.
`Video objects or hot spots are classified into static zones and moving objects.
`The purpose of object detection is to initialize the tracking by identifying
`the object boundaries in the first frame of the shot. The accuracy of auto-
`matic object detection depends on the prior knowledge about the object to
`be detected as well as the complexity of the background and quality of video
`
`Page 11
`
`Page 11
`
`

`

`_ 2
`
`Riad I. Hammoud
`
`scene 2
`
` PIE!
`HBscene 4
` scene 3
`
`
`
`Fig. 6. Semantic scenes of a clip from the movie “Dances with Wolves”: “first
`meeting” (scene 1); “horses” (scene 2); “coffee preparation” (scene 3); “working”
`(scene 4); and “dialog in the tent” (scene 5).
`
`signal. As an example, detecting the position of an eye in an infrared driver’s
`video [133, 136] is easier than locating pedestrians in a night-time sequence of
`images. Theeye has a very distinct and unique shape and appearance among
`people while the appearance, shape, scale and size characteristics may vary
`widely among pedestrians. On a related topic, Prati et al. [256] recently pro-
`posed an object detection method that differentiates between a moving car
`and its moving shadows on a highway.
`Severalefforts were undertaken to find objects belonging to specific classes
`like faces, figs and vehicles [202, 240, 254, 120, 333]. Recent systems like the
`Informedia Project [155] include in their generation of video summaries the
`detection of common objects like text and faces. Systems such as Netra-V
`[79] and VideoQ [50] used spatio-temporal segmentation to extract regions
`which are supposed to correspond to video objects. Others [184, 269, 191] have
`resorted to user-assisted segmentation. Sinceit is easier to define an object by
`its boundarybox, accurately delineating an object is generally advantageous
`in reducing clutter for the subsequent tracking and matching phases. In [269],
`the color distributions of both the object and its background are modeled
`by a Gaussian mixture. Interestingly, this process is interactive, i.e. the user
`may iteratively assist the scheme in determining the boundary,starting from
`sub-optimal solutions, if needed. Recent work goes further in this direction
`
`Page 12
`
`Page 12
`
`

`

`Introduction to Interactive Video
`
`13
`
`[191], by learning the shape of an object category, in order to introduce it
`as an elaborate input into the energy function involved in interactive object
`detection.
`the tracking
`that
`is crucial
`it
`In the interactive video framework,
`algorithm can beinitialized easily through a semi-automatic or automatic
`process. Automatic detection methods may be categorized into three broad
`classes: model-based, motion-based (optical flow, frame differencing), and
`background subtraction approaches. For static cameras, background subtrac-
`tion is probably the most popular. Model-based detection approaches usually
`employ a supervised learning process of the object model using representative
`training patches of the object in terms of appearances, scales, and illumi-
`nation conditions, as in a multi-view face detector [202] or a hand posture
`recognition algorithm [344]. In contrast, a motion-based detection consists of
`segmenting the scene into zones of independent motions using optical flow
`and frame differencing techniques {240]. In many applications, the knowledge-
`based, motion-based and model-based techniques are combined together to
`ensure robust object detection. Despite all of these efforts, accurate automatic
`object detection is successful only in some domain specific applications.
`
`3.2.4 Intra-Shot Object Tracking
`
`Trackingis easier and faster than object. detection since the object state (posi-
`tion, scale, boundaries) is known in previous frame, and the search procedure
`in target frameis local (search window)rather than global (entire frame). The
`tracking task is easier when tracked objects have low variability in the feature
`space, and thus exhibit smooth and rigid motion. In reality, video objects are
`often very difficult to track due to changes in pose, scale, appearance, shape
`and lighting conditions. Therefore, a robust and accurate tracker requires a
`self-adaptation strategy to these changes in order to maintain tracking in dif-
`ficult situations, an exist strategy that terminates the tracking process if a
`drift or miss-tracking occurs, and a recovery process that allows object reac-
`quisition when tracking is lost.
`In literature, three categories of methods could be identified: Contour-
`based, Color-based and Motion-based techniques. For tracking of non-rigid
`and deformable objects, geodesic active contours and active contour models
`have been proved to be powerful tools [247, 181, 320, 250]. These approaches
`use an energy minimization procedure to obtain the best contour, where the
`total energy consists of an internal energy term for contour smoothness and an
`external energy term for edge likelihood. Many color based tracking methods
`(66, 27, 166, 249, 378] require a typical object color histogram. Color histogram
`is a global feature used frequently in tracking and recognition dueto its quasi-
`invariance to geometric and photometric variations (using the hue-saturation
`subspace), as well as its low computation requirements. In [27], a side view
`
`Page 13
`
`Page 13
`
`

`

`14
`
`Riad I. Hammoud
`
`of the human head is used to train a typical color model of both skin color
`and hair color to track the human head with out-of-plane rotation. The third
`category of tracking methods tends to formulate a tracking problem as the
`estimation of a 2D inter-frame motionfield over the region ofinterest [110].
`Recently, the concern in trackingis shifting toward real-time performance.
`Thisis essential in an interactive video authoring system where the user could
`initialize the tracker and desire quick tracking results in the remaining frames
`of a shot. Real-time object tracking is addressed in more detail in Chapter 4.
`
`3.2.5 Recognition and Classification of Video Shots and Objects
`
`Once the shots, key-frames, and objects are identified in a video sequence, the
`next step is to identify spatial-temporal similarities between these entities in
`order to construct high-level components of the video structurelike groups of
`objects [131] and clusters of shots [275]. In order to retrieve similar objects
`to an object-query, a recognition algorithm that entails feature extraction,
`distance computation and a decision rule, is required. Recognition and classi-
`fication techniques often employ low-level featureslike color, texture and edge
`histograms [24] as well as high-level features like object bobs [45].
`Aneffective method for recognizing similar objects in video shots is to
`specify generic object models and find objects that conform to these models
`[89]. Different models were used in methods for recognizing objects that are
`common in video. These include methodsfor finding text [222, 375], faces
`[271, 286] and vehicles [286]. Hammoud et al. [131, 128] proposed a success-
`ful framework for supervised and non-supervised classification of inter-shot
`objects using Maximum A Posteriori (MAP) rule and Hierarchicalclassifica-
`tion techniques, respectively. Figure 8 illustrates some obtained results. Each
`tracked object is being modeled by a Gaussian Mixture Model (GMM)that
`captures the intra-shot changes in appearancein the RGB colorfeature space.
`In the supervised approach, the user chooses the object classes interactively,
`and theclassification algorithm classifies remaining objects into the object
`class according to the MAP.In contrast, the unsupervised approach consists
`of computing the Kullback distance [190, 106] between Gaussian Mixture
`Models ofall identified objects in the video, and then feeding the Hierar-
`chical clustering algorithm[173] the constructed proximity matrix [169, 264].
`These approaches seem to substantially improve the recognition rate of video
`objects between shots. Recently, Fan et al. [96] proposed a successful hierar-
`chical semantics-sensitive video classifier to shorten the semantic gap between
`the low-level visual features and the high-level semantic concepts. The hier-
`archical structure of the semantics-sensitive video classifier is derived from
`the domain dependent concept hierarchy of video contents in the database.
`Relevance analysis is used to shorten the semantic gap byselecting the dis-
`criminating visual features. Part two of this book will address in detail these
`issues of object detection, editing, tracking, recognition, and classification.
`
`Page 14
`
`Page 14
`
`

`

`Introduction to Interactive Video
`
`15
`
`3.2.6 Video Events, Highlights and Scenes Detection
`
`In recent years, research towards the automatic detection and recognition of
`highlights and eventsin specific applications such as sports video, where events
`and highlights are relatively well-defined based on the domain knowledge of
`the underlying news and sport video data, has gained a lot of attention [182,
`54, 199, 201, 198, 272, 281, 361]. A method for detecting news reporting was
`presented in [213]. Seitz and Dyer [289] proposed an affine view-invariant
`trajectory matching method to analyze cyclic motion. In [310], Stauffer and
`Grimsonclassify activities based on the aspect ratio of the tracked objects.
`Haering etal. [122] detect hunting activities in wildlife video. In [54], a new
`multimedia data mining framework has been proposed for the detection of
`soccer goal shots by using combined multimodal (audio/visual) features and
`classification rules. The output results can be used for annotation and indexing
`of the high-level structures of soccer videos. This framework exploits the rich
`semantic information contained in visual and audio features for soccer video
`data, and incorporates the data mining process foreffective detection of soccer
`goal events.
`Regarding automatic detection of semantic units like scenes in feature-
`length movies, several graph-based clustering techniques have been pro-
`posed in theliterature (357, 275, 125]. In [126, 127], a general framework
`of two phases, clustering of shots, and margin of overlapped clusters, has
`been proposed to extract the video scenes of a feature-length movie. Two
`shots are matched on the basis of color-metric histograms, color-metric auto-
`correlograms [147] and the numberofsimilar objects localized in their gener-
`ated key-frames. Using the hierarchicalclassification technique, apartition of
`shotsis identified for each feature separately. From all these partitions a single
`partition is deduced based on a union distance between various clusters. The
`obtained clusters are then linked together using the temporal relations of Allen
`[7] in order to construct a temporal graph of cluste

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket