throbber
Content-Based
`Video Indexing
`and Retrieval
`-
`V
`
`-
`
`Stephen W. Smoliar and Hongliang Zhang
`National University of Singapore
`
`Current video
`management
`tools and
`techniques are
`based on pixels
`rather than
`perceived content.
`Thus, state-of-the-
`art video editing
`systems can easily
`manipulate such
`things as time
`codes and image
`frames, but they
`cannot “know,”
`for example, what
`a basketball is.
`Our research
`addresses four
`areas of content-
`based video
`management.
`
`ideo has become an important ele-
`ment of multimedia computing and
`communication environments, with
`applications as varied as broadcast-
`ing, education, publishing, and military intelli-
`gence. However, video will only become an
`effective part of everyday computing environ-
`ments when we can use it with the same facility
`that we currently use text. Computer literacy
`today entails the ability to set our ideas down
`spontaneously with a word processor, perhaps
`while examining other text documents to devel-
`op those ideas and even using editing operations
`to transfer some of that text into our own com-
`positions. Similar composition using video
`remains far in the future, even though worksta-
`tions now come equipped with built-in video
`cameras and microphones, not to mention ports
`for connecting our increasingly popular hand-
`held video cameras.
`Why is this move to communication incorpo-
`rating video still beyond our grasp? The problem is
`that video technology has developed thus far as a
`technology of images. Little has been done to help
`us use those images effectively. Thus, we can buy a
`camera that “knows” all about how to focus itself
`properly and even how to compensate for the fact
`that we can rarely hold it steady without a tripod.
`But no camera knows “where the action is” during
`a basketball game or a family reunion. A camera
`can give us a clear shot of the ball going through
`the basket, but only if we find the ball for it.
`The point is that we do not use images just
`because they are steady or clearly focused. We use
`them for their content. If we wish to compose
`with images in the same way that we compose
`
`1070-986X/94/$4.00 8 1 994 IEEE
`
`with words, we must focus our attention on con-
`tent. Video composition should not entail think-
`ing about image “bits” (pixels), any more than
`text composition requires thinking about ASCII
`character codes. Video content objects include
`basketballs, athletes, and hoops. Unfortunately,
`state-of-the-art software for manipulating video
`does not “know” about such objects. At best, it
`“knows“ about time codes, individual frames, and
`clips of video and sound. To compose a video doc-
`ument-or
`even just incorporate video as part of
`a text document-we find ourselves thinking one
`way (with ideas) when we are working with text
`and another (with pixels) when we are working
`with video. The pieces do not fit together effec-
`tively, and video suffers for it.
`Similarly, if we wish to incorporate other text
`material in a document, word processing offers a
`powerful repertoire of techniques for finding what
`we want. In video, about the only technique we
`have is our own memory coupled with some intu-
`ition about how to use fast forward and fast
`reverse buttons while viewing.
`The moral of all this is that the effective use of
`video is still beyond our grasp because the effec-
`tive use of its content is still beyond our grasp.
`How can we remedy this situation? At the
`Institute of Systems Science of the National
`University of Singapore, the Video Classification
`project addresses this question. We are currently
`tackling problems in four areas:
`
`I Defining an architecture that characterizes the
`tasks of managing video content.
`
`I Developing software tools and techniques that
`identify and represent video content.
`
`I Applying knowledge representation techniques
`to the development of index construction and
`retrieval tools.
`
`I Developing an environment for interacting
`with video objects.
`
`In this article, we discuss each of these problem
`areas in detail, then briefly review a recent case
`study concerned with content analysis of news
`videos. We conclude with a discussion of our
`plans to extend our work into the audio domain.
`
`Architecture for video management
`Our architecture is based on the assumption
`that video information will be maintained in a
`
`Page 1 of 11
`
`MINDGEEK EXHIBIT 1011
`
`

`

`database.’ This assumption requires us to define
`tools for the construction of such databases and
`the insertion of new material into existing data-
`bases. We can characterize these tools in terms of
`a sequence of specific task requirements:
`
`basic units for indexing. The second set identifies
`different manifestations of camera technique in
`these clips. The third set applies content models
`to the identification of context-dependent seman-
`tic primitives.
`
`I Pming, which segments the video stream into
`generic clips. These clips are the elemental index
`units in the database. Ideally, the system decom-
`poses individual images into semantic primitives.
`On the basis of these primitives, a video clip can
`be indexed with a semantic description using
`existing knowledge-representation techniques.
`
`I Indexing, which tags video clips when the sys-
`tem inserts them into the database. The tag
`includes information based on a knowledge
`model that guides the classification according to
`the semantic primitives of the images. Indexing is
`thus driven by the image itself and any semantic
`descriptors provided by the model.
`
`I RfJtriewl c 7 n d browsing, where users can access
`the database through queries based on text and/or
`visual examples or browse it through interaction
`with displays of meaningful icon?. Users can also
`browse the results of a retrieiral query. tt is impor-
`tant that both retrieval and browsing appeal to
`the user’$ visual intuition.
`
`Figure 1 summarizes this task analysis as an
`architectural diagram. The heart of the system is
`a database management system containing the
`video and audio data from video source material
`that has been compressed wherever possible. The
`DBMS detines attributes and relations among
`these entities in terms of a frame-based approach
`to knowledge representation (described further
`under the subhead “A frame-based knowledge
`base,” p. 65). This representation approach, in
`turn, drives the indexing of entities as they are
`added to the database. Those entities are initially
`extracted by the tools that support the parsing
`task. In the opposite direction, the database con-
`tents are made available by tools that support the
`processing of both specific queries and the more
`general needs of casual browsing.
`The next three sections discuss elements of this
`architecture in greater detail.
`
`Video content parsing
`‘l‘hree tool sets address the parsing task. The
`first set segments the video source material into
`individual camera shots, which then serve as the
`
`Knowledge
`
`Video/a ud io
`data
`
`Content attributes:
`frame based
`
`I
`
`L
`
`I
`
`J
`
`Raw
`video/audio
`data
`
`Appliiations
`
`reference
`-
`enqine
`
`I
`
`browsing
`tools
`
`Figure 1. Diagram of
`video management
`architecture.
`
`Locating camera shot boundaries
`We decided that the most viable segmentation
`criteria for motion video are those that detect
`boundaries between camera shots. Thus, the imi-
`em shot-consisting of one or more frames gener-
`ated and recorded contiguously and representing
`a continuous action in time and space-becomes
`the smallest unit for indexing video. ’The simplest
`shot transition is a camera cut, where the bound-
`ary lies between two successive frames. More
`sophisticated transition techniques include dis-
`solvc~s, wipes, and fade-outs-all of which take
`placr over a sequence of frames.
`In any case, camera shots can always be distin-
`guished by significant qualitative differences. If we
`can expresh those differences by a suitable quan-
`titative measure, then we can declare a segment
`boundary whenever that measure exceeds a given
`threshold. The key issues in locating shot bound-
`aries, therefore, are selecting suitable difference
`measures and thresholds, and applying them to
`the comparison of video frames. We now briefly
`review the segmentation techniques we currently
`employ. (For details, see Zhang et al.?)
`The most suitable measures rely on compar-
`isons between the pixel-intensity histograms of
`two frames. The principle behind this metric is
`that two frames with little change in the back-
`ground and object content will also differ little in
`their overall intensity distributions. Further
`strengthening this approach, it is easy to define a
`histogram that effectively accounts for color infor-
`mation. We also developed an automatic
`approach to detect the segmentation threshold on
`
`Page 2 of 11
`
`MINDGEEK EXHIBIT 1011
`
`

`

`camera), in which the camera position does
`~ h a n g e . ~ These operations may also occur in com-
`binations. They are most readily detected through
`motion field analysis, since each operation has its
`own characteristic pattern of motion vectors. For
`example, a zoom causes most of the motion vec-
`tors to point either toward or away from a focus
`center, while movement of the camera itself
`shows up as a modal value across the entire
`motion field.
`The motion vectors can be computed by the
`block-matching algorithms used in motion com-
`pensation for video compression. Thus, a system
`can often retrieve the vectors from files of video
`compressed according to standards such as MPEG
`and H.261. The system could also compute them
`in real time by using chips that perform such
`compression in hardware.
`
`Content models
`Content parsing is most effective with an a pri-
`ori model of a video’s structure.’ Such a model can
`represent a strong spatial order within the indi-
`vidual frames of shots and/or a strong temporal
`order across a sequence of shots. News broadcasts
`usually provide simple examples of such models.
`For example, all shots of the anchorperson
`conform to a common spatial layout, and the
`temporal structure simply alternates between the
`anchorperson and more detailed footage (possibly
`including breaks for commercials).
`Our approach to content parsing begins with
`identifying key features of the image data, which
`are then compared to domain models to identify
`objects inferred to be part of the domain. We then
`identify domain events as segments that include
`specific domain objects. Our initial experiments
`involve models for cut boundaries, typed shots,
`and episodes. The cut boundary model drives the
`segmentation process that locates camera shot
`boundaries. Once a shot has been isolated
`through segmentation, it can be compared against
`type models based both on features to be detect-
`ed and on measures that determine acceptable
`similarity. Sequences of typed shots can then be
`similarly compared against episode models. We
`discuss this in more detail later, under “Case study
`of video content analysis.”
`
`Index construction and retrieval tools
`The fundamental task of any database system
`is to support retrieval, so we must consider how to
`build indexes that facilitate such retrieval services
`for video. We want to base the index on semantic
`
`Figure 2. A sequence of
`frame-to-frame
`histogram diferences
`obtained from a
`documentary video,
`where direrences
`corresponding both to
`camera breaks and to
`transitions
`implemented by special
`effects can be observed.
`
`w
`w w
`
`the basis of statistics of frame difference values
`and a multipass technique that improves process-
`ing speed.2
`Figure 2 illustrates a typical sequence of differ-
`ence values. The graph exhibits two high pulses
`corresponding to two camera breaks. It also illus-
`trates a gradual transition occurring over a
`sequence of frames. In this case, the task is to
`identify the sequence start and end points. As the
`inset in Figure 2 shows, the difference values dur-
`ing such a transition are far less than across a cam-
`era break. Thus, a single threshold lacks the power
`to detect gradual transitions.
`A so-called twin-comparison approach solves
`this problem. The name refers to the use of two
`thresholds. First, a reduced threshold detects the
`potential starting frame of a transition sequence.
`Once that frame has been identified, it is com-
`pared against successive frames, thus measuring
`an accumulated difference instead of frame-to-
`frame differences. This accumulated difference
`must be monotonic. When it ceases to be monot-
`onic, it is compared against a second, higher
`threshold. If this threshold is exceeded, we con-
`clude that the monotonically increasing sequence
`of accumulated differences corresponds to a grad-
`ual transition. Experiments have shown this
`approach to be very effective.2
`
`Shot classification
`Before a system can parse content, it must first
`recognize and account for artifacts caused by cam-
`era movement. These movements include pan-
`ning and tilting (horizontal or vertical rotation of
`the camera) and zooming (focal length change),
`in which the camera position does not change,
`and tracking and booming (horizontal and verti-
`cal transverse movement of the camera) and dol-
`lying (horizontal lateral movement of
`the
`
`Page 3 of 11
`
`MINDGEEK EXHIBIT 1011
`
`

`

`properties, rather than lower level features. A
`knowledge model can support such semantic
`properties. The model for our system is a frame-
`based knowledge base. In the following discus-
`sion, the word “frame” refers to such a knowledge
`base object rather than a video image frame.
`
`A frame-based knowledge base
`An index based on semantic properties requires
`an organization that explicitly represents the var-
`ious subject matter categories of the material
`being indexed. Such a representation is often real-
`ized as a semantic network, but text indexes tend
`to be structured as trees (as revealed by the indent-
`ed representations of most book indexes). We
`decided that the more restricted tree form also
`suited our purposes.
`Figure 3 gives an example of such a tree. It rep-
`resents a selection of topical categories taken from
`a documentary video about the Faculty of Engi-
`neering at the National University of Singapore.
`The tree structure represents relations of special-
`ization and generalization among these cate-
`gories. Note, in particular, that categories
`correspond both to content material about stu-
`dent activities (Activity) and to classifications of
`different approaches to producing the video
`(Video-Types).
`Users tend to classify material on the basis of
`the information they hope to extract. This partic-
`ular set of categories reflects interest both in the
`faculty and in documentary production. Thus, the
`purpose of this topical organization is not to clas-
`sify every object in the video definitively. Rather,
`it helps users who approach this material with
`only a general set of questions, orienting them in
`how to formulate more specific questions and
`what sorts of answers to expect.
`The frame-based knowledge base is the most
`appropriate technology for building such a struc-
`The Fume is a data object that plays a role
`t ~ r e . ~
`similar to that of a record in a traditional database.
`However, frames are grouped into classes, each of
`which represents some topical category. As Figure
`3 illustrates, these classes tend to be organized in a
`specialization hierarchy. Such a hierarchy allows
`the representation of content in terms of one or
`more systems of categories that can then be used
`to focus attention for a variety of tasks.
`The simplest of these tasks is the casual brows-
`ing of collections of items. However, hierarchical
`organization also facilitates the retrieval of specif-
`ic items that satisfy the sorts of constraints nor-
`mally associated with a database query. Like the
`
`Scenery
`
`Convocation
`
`~
`
`Figure 3. A tree
`structure of topical
`categories for a
`documentary video
`about engineering at the
`National University of
`Singapore.
`
`Figure 4. Examples of
`class frame Laboratory
`(top) and subclass
`instance
`Wave-Simulator
`(bottom).
`
`records of a database, frames are structured as a
`collection of fields (usually called slots in frame-
`based systems). These slots provide different ele-
`ments of descriptive information, and the
`elements distinguish the topical characteristics for
`each object represented by a frame.
`It is important to recognize that we use frames
`to represent both classes (the categories) and
`instances (the elements categorized). As an exam-
`ple of a class frame, consider the Laboratory cate-
`gory in Figure 3. We might define the frame for it
`as shown in Figure 4a. Alternatively, we can define
`an instance of one of its subclasses in a slightly
`similar manner as shown in Figure 4b.
`Note that not all slots need to be filled in a class
`definition (“void” indicates an unfilled slot), while
`
`Name : Laboratory
`Superclass: Academic
`Suwlasses: #table[Computer-Lab
`Elect ronic-lab Mechan ical-Lab
`Civil-Lab Chemical-Ldbl
`Instances : void
`Description: vold
`Video: void
`Course: void
`Equipment : void
`
`Name: Wave-Simulator
`Class: Civil-Ldb
`Description: ”Monitoring plessure
`variation in breaklny waves. ”
`Video: WdveBreaker-CoverFldme
`C i v L I-Enq
`Course :
`Equipment : ittable [Corrputer
`v e-& ne L ci t o 1 I
`
`Page 4 of 11
`
`MINDGEEK EXHIBIT 1011
`
`

`

`they do all tend to be filled in instances. Also note
`that a slot can be filled by either a single value or
`a collection of values (indicated by the ”#table
`[...I” construct).
`For purposes of search, it is also important to
`note that some slots, such as Name, Superclass,
`Subclasses, Instances, and Class, exist strictly for
`purposes of maintaining a system of frames. The
`remaining slots, such as Description, Video,
`Course, and Equipment, are responsible for the
`actual representation of content. These latter slots
`are thus the objective of all search tasks.
`Most
`frame-based knowledge
`bases impose no restrictions on the
`contents of slots: Any slot can
`assume any value or set of values.
`However, the search objective can be
`facilitated by strongly typing all slots.
`The system could enforce such a con-
`straint through an “if-added” demon
`that does not allow a value to be
`added to a slot unless it satisfies some
`data typing requirement. For exam-
`ple, if Shot is a class whose instances
`represent individual camera shots
`from a video source, then only values
`that are instances of the Shot class
`can be added to the Video slot in
`frames such as those in Figure 4.
`Data
`typing can determine
`whether or not any potential slot
`value is a frame, and it might even be
`able to distinguish class frames from
`instance frames. However, we can
`make typing even more powerful if
`we extend it to deal with classes as if they were data
`types. In this case, type checking would verify not
`only that every potential Video slot value is an
`instance frame but, more specifically, that it is an
`instance of the Shot class. Furthermore, we could
`subject slot values for instances of more specific
`classes to even more restrictive constraints. Thus,
`we might constrain the Video slot of the Headings
`frame to check whether or not the content of a rep-
`resentative frame of the Shot instance being
`assigned consists only of characters. (We could fur-
`ther refine this test if we knew the fonts used to
`compose such headings.)
`What is important for retrieval purposes is that
`we can translate knowledge of a slot’s type into
`knowledge of how to search it. We can apply dif-
`ferent techniques to inspecting the contents of
`different slots, and we can combine those tech-
`niques by means far more sophisticated than the
`
`of how to
`search it.
`
`sorts of combinations normally associated with
`database query operations.
`
`Retrieval tools
`Let us now consider more specifically how we
`can search frames given a priori knowledge of the
`typing of their slots. Because a database is only as
`good as the retrieval facilities it supports, it must
`have a variety of tools, based on both text and
`visual interfaces. Our system’s current suite of
`tools includes a free-text query engine and inter-
`face, the tree display of the class hierarchy, image
`feature-based retrieval tools, and the Clipmap.
`Every frame in the knowledge base includes a
`Description slot with a text string as its contents.
`Thus, the user can provide text descriptions for all
`video shots in the database. The free-text retrieval
`tool retrieves video shots on the basis of the
`Description slot contents. A concept-based
`retrieval engine analyzes the user’s query.6 Given
`a free-text query specified by the user, the system
`first extracts the relevant terms by removing the
`nonfunctional words and converting those
`remaining into stemmed forms. The system then
`checks the query against a domain-specific the-
`saurus, after which it uses similarity measures to
`compare the text descriptions with the query
`terms. Frames whose similarity measure exceeds a
`given threshold are identified and retrieved, lin-
`early ordered by the strength of the similarity.
`In addition to using free text, we can formulate
`queries directly on the basis of the category tree
`itself. This tree is particularly useful in identifying
`all shots that are instances of a common category
`at any level of generalization. We can then use the
`tree to browse instances of related categories. The
`class hierarchy also allows for slot-based retrieval.
`Free-text retrieval provides access to Description
`slots, but we can search on the basis of other slots
`as well. For example, we can retrieve slots whose
`contents are other frames through queries based
`on their situation in the class hierarchy. We can
`compare slots having numeric values against
`numeric intervals. Furthermore, if we want to
`restrict a search to the instances of a particular
`class, then the class hierarchy can tell us which
`slots can be searched and what types of data they
`contain.
`Retrieval based on the contents of Video slots
`will require computation of characteristic visual
`features. As an example, a user examining a video
`of a dance performance should be able to retrieve
`all shots of a particular dancer on the basis of cos-
`tume color. Retrieval would then require con-
`
`Page 5 of 11
`
`MINDGEEK EXHIBIT 1011
`
`

`

`structing a model for matching regions of the
`color with suitable spatial properties. The primi-
`tives froin which such models are constructed
`then serve as the basis for the index structure. In
`such a database, each video clip would be repre-
`sented by one or more frames, and all indexing
`and retrieval would be based o n the image fea-
`tures of those frames.
`Some image database systems, such as thc
`Query By Image Content ( Q B I C ) Project’ have
`developed techniques that support this approach.
`‘These techniques include sclectioii and computa-
`tion of image features that provide useful query
`functionality, similarity-based retrieval methods,
`and interfaces that let users pose and refine
`queries visually and navigate their way through
`the database visually.
`We chose color, texture, and shape as basic
`image features and developed a prototype system
`with fast image-indexing abilities. This system
`automatically computes numerical index keys
`based on color distribution, prominent color
`region segmentation, and color histograms (as
`texture models) for each image. Each image is
`indexed by the size, color, Location, and shape of
`segmented regions and the color histograms of
`the entire image and nine subregions. ‘1.0 achieve
`fast retrieval, the system codes these image fea-
`tures into numerical index keys according to the
`significance of each feature in the query-match-
`ing proccss. This retrieval approach has proved
`fast and accurate.
`Indexing representative images essentially
`ignores the temporal nature of a video. Retrieval
`should be based on events as well as features of
`static images. This will require a better under-
`standing of which temporal visual features are
`both important for retrieval and feasible to com-
`pute. For instance, we can retrieve zooming
`sequences through a relatively straightforward
`examination of the motion vector fielcl.” However,
`because such vector fields are often difficult to
`compute (and because the “motion vectors” pro-
`vided by compressed video are not always a reli-
`able representation of optical flow), a more viable
`alternative might be to perform teature analysis
`on the spatio-temporal iniages. We discuss this
`alternative below under the subsection “Micons:
`Icons for video content”.
`A Clipmap is simply a window containing a
`collection of icons, each of which represents a
`camera shot. We can use Clipmaps to provide an
`unstructured index for a collection of shots.’ They
`can also be used to display the results of retrieval
`
`Figure 5. A typicuf
`Cliprimup.
`
`operations. For example, rather than simpl), list-
`ing the frames retrieved by a free-text query, the
`system can construct a Clipmap based on the con-
`tents of the Video slot of each frame. Such ;I dis-
`pia). is especially useful when the query results in
`a long list of frames. For example, Figure 5 is a
`Clipmap constructed for a query requesting all
`instances of the Activity class. Even if the system
`orders retrieval results by degree of similarity (as
`they are in free-text search), it can still be difficult
`to identify the desired shots from text representa-
`tions of those frames. The Clipmap provides visu-
`al recognition as an alternative to examining such
`text descriptions.
`
`Interactive video objects
`We turn now to the problem of interfaces.
`Vidco is “media rich,” providing moving pictures,
`text, music, and sound. l‘hus, interfaces based on
`keywords or other types of text representation
`cannot provide users a suitable “window” on
`vidco content. Only visual representation can
`provide an intuitive cue to such content.
`Furthermore, we should not regard such cues as
`passive objects. A user should be able to interact
`witli them, just as text indexes are more for inter-
`action than for examination. In this section, we
`discuss three approaches to interactivity.
`
`Page 6 of 11
`
`MINDGEEK EXHIBIT 1011
`
`

`

`hand window. It also brings up a display of the
`contents of the Description slot and “loads” the
`clip into the “soft video player” shown on the
`right. The “depth” of the micon itself corresponds
`to the duration of the represented video sequence,
`and we can use this depth dimension as a “scroll
`bar.” Thus, we can use any point along a “depth
`face” of the icon to cue the frame corresponding
`to that point in time for display on the front face.
`The top and side planes of the icon are the spatio-
`temporal pictures composed by the pixels along
`the horizontal and vertical edge of each frame in
`the video clip.4
`This presentation reveals that, at the bit level, a
`video clip is best thought of as a volume ofpixels,
`different views of which can provide valuable con-
`tent information. (For example, the ascent of
`Apollo 11 represented in Figure 6 is captured by the
`upper face of the icon.) We can also incorporate the
`camera operation information into the video icon
`construction and build a VideoSpaceI~on.~
`We can examine this volume further by taking
`horizontal and vertical slices, as indicated by the
`operator icons on the side of the display in Figure
`6. For example, Figure 7 illustrates a horizontal
`slice through a micon corresponding to an excerpt
`from “Changing Steps,” produced by Elliot Caplan
`and Merce Cunningham, a video interpretation of
`Cunningham’s dance of the same name. (This
`micon actually does not correspond to a single
`camera shot. The clip was obtained from the
`Hierarchical Video Magnifier, discussed in the next
`subsection.) Note that the slice was taken just
`above the ankles, so it is possible to trace the
`movements of the dancers in time through the
`colored lines created as traces of their leotards.
`Selecting a representative frame for each cam-
`era shot in the index strip is an important issue.
`Currently, we avoid tools that are too computa-
`tionally intensive. Two approaches involve sim-
`ple pixel-based properties. An “average frame” is
`defined in which each pixel has the average of val-
`ues at the same grid point in all frames of the shot.
`Then, the system will select a frame that is most
`similar to this average frame as the representative
`frame.
`Another approach involves averaging the his-
`tograms of all the frames in a clip and selecting
`the frame whose histogram is closest to the aver-
`age histogram as the representative frame.
`However, neither of these approaches involves
`semantic properties (although the user can always
`override decisions made by these methods). We
`also plan to incorporate camera and object
`
`Figure 6. An
`environment for
`examining and
`manipulating micons.
`
`Micons: Icons for video content
`A commonly used visual representation of
`video shots or sequences is the video or movie
`icon, sometimes called a micon.* Figure 6 illus-
`trates the environment we have designed for
`examining and manipulating micons. Every video
`clip has a representative frame that provides a
`visual cue to its content. This static image is dis-
`played in the upper “index strip” of Figure 6.
`Selecting an image from the index strip causes the
`system to display the entire micon in the left-
`
`Figure 7. A micon with
`a horizontal slice. Each
`colored line matches the
`color of a leotard on the
`lower leg. Horizontal
`movement of the line in
`the image exposed b y
`the slice corresponds to
`horizontal movement of
`the leg (and usually the
`dancer). Dancers’ move-
`ment’s can be traced
`b y using this exposed
`spatio-temporal picture
`as a source of cues.
`
`Page 7 of 11
`
`MINDGEEK EXHIBIT 1011
`
`

`

`6 Fllt Edlt Csnl Search Ulcw
`
`tp4’
`
`Figure 8. A hierarchical
`browser of a full-length
`video.
`
`motion information either for selecting a repre-
`sentative frame or for constructing a “salient
`still”’0 instead of a representative frame.
`
`Hierarchical video magnifier
`Sometimes the ability to browse a video in its
`entirety is more important than examining indi-
`vidual camera shots in detail. We base our
`approach on the Hierarchical Video Magnifier.”
`It is illustrated in Figure 8, which presents an
`overview of the entire “Changing Steps” video.
`The original tape of this composition was con-
`verted to a QuickTime movie 1,282,602 units
`long. (There are 600 QuickTime units per second,
`so this corresponds to a little under 36 minutes.)
`As the figure shows, dimensions allow for the dis-
`play of five frames side by side. Therefore, the
`whole movie is divided into five segments of equal
`length, each segment represented by the frame at
`its midpoint.
`As an example from this particular video, the
`first segment occupies the first 256,520 units of
`the movie, and its representative frame is at index
`128,260. Each segment can then be similarly
`expanded by dividing it into five portions of equal
`length, each represented by the midpoint frame.
`By the time we get to the third level, we are view-
`ing five equally spaced frames from a segment of
`
`size 51,304 (approximately 85.5 seconds). Users
`can continue browsing to greater depth, after
`which the screen scrolls accordingly.
`The user can also select any frame on the dis-
`play for storage. The system will store the entire
`segment represented by the frame as a separate file,
`which the user can then examine with the micon
`viewer. (This is how we created the image in Figure
`7.) This approach to browsing is particularly valu-
`able for a source like “Changing Steps,” which
`does not have a well-defined narrative structure. It
`can serve equally well for material where the nar-
`rative structure is not yet understood.
`The Hierarchical Video Magnifier is an excel-
`lent example of “content-free content analysis.’’
`The technique requires no information regarding
`the content of the video other than its duration.
`We developed it to exploit the results of automat-
`ic segmentation. The segment boundaries deter-
`mined by simple arithmetic division in the
`Hierarchical Video Magnifier are then “justified”
`by being shifted to the nearest camera shot
`boundary. Thus, at the top levels of the hierarchy,
`the segments actually correspond to sequences of
`camera shots, rather than an arbitrary interval of a
`fixed duration. These camera shot boundaries are
`honored in the subdivision of

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket