`Video Indexing
`and Retrieval
`-
`V
`
`-
`
`Stephen W. Smoliar and Hongliang Zhang
`National University of Singapore
`
`Current video
`management
`tools and
`techniques are
`based on pixels
`rather than
`perceived content.
`Thus, state-of-the-
`art video editing
`systems can easily
`manipulate such
`things as time
`codes and image
`frames, but they
`cannot “know,”
`for example, what
`a basketball is.
`Our research
`addresses four
`areas of content-
`based video
`management.
`
`ideo has become an important ele-
`ment of multimedia computing and
`communication environments, with
`applications as varied as broadcast-
`ing, education, publishing, and military intelli-
`gence. However, video will only become an
`effective part of everyday computing environ-
`ments when we can use it with the same facility
`that we currently use text. Computer literacy
`today entails the ability to set our ideas down
`spontaneously with a word processor, perhaps
`while examining other text documents to devel-
`op those ideas and even using editing operations
`to transfer some of that text into our own com-
`positions. Similar composition using video
`remains far in the future, even though worksta-
`tions now come equipped with built-in video
`cameras and microphones, not to mention ports
`for connecting our increasingly popular hand-
`held video cameras.
`Why is this move to communication incorpo-
`rating video still beyond our grasp? The problem is
`that video technology has developed thus far as a
`technology of images. Little has been done to help
`us use those images effectively. Thus, we can buy a
`camera that “knows” all about how to focus itself
`properly and even how to compensate for the fact
`that we can rarely hold it steady without a tripod.
`But no camera knows “where the action is” during
`a basketball game or a family reunion. A camera
`can give us a clear shot of the ball going through
`the basket, but only if we find the ball for it.
`The point is that we do not use images just
`because they are steady or clearly focused. We use
`them for their content. If we wish to compose
`with images in the same way that we compose
`
`1070-986X/94/$4.00 8 1 994 IEEE
`
`with words, we must focus our attention on con-
`tent. Video composition should not entail think-
`ing about image “bits” (pixels), any more than
`text composition requires thinking about ASCII
`character codes. Video content objects include
`basketballs, athletes, and hoops. Unfortunately,
`state-of-the-art software for manipulating video
`does not “know” about such objects. At best, it
`“knows“ about time codes, individual frames, and
`clips of video and sound. To compose a video doc-
`ument-or
`even just incorporate video as part of
`a text document-we find ourselves thinking one
`way (with ideas) when we are working with text
`and another (with pixels) when we are working
`with video. The pieces do not fit together effec-
`tively, and video suffers for it.
`Similarly, if we wish to incorporate other text
`material in a document, word processing offers a
`powerful repertoire of techniques for finding what
`we want. In video, about the only technique we
`have is our own memory coupled with some intu-
`ition about how to use fast forward and fast
`reverse buttons while viewing.
`The moral of all this is that the effective use of
`video is still beyond our grasp because the effec-
`tive use of its content is still beyond our grasp.
`How can we remedy this situation? At the
`Institute of Systems Science of the National
`University of Singapore, the Video Classification
`project addresses this question. We are currently
`tackling problems in four areas:
`
`I Defining an architecture that characterizes the
`tasks of managing video content.
`
`I Developing software tools and techniques that
`identify and represent video content.
`
`I Applying knowledge representation techniques
`to the development of index construction and
`retrieval tools.
`
`I Developing an environment for interacting
`with video objects.
`
`In this article, we discuss each of these problem
`areas in detail, then briefly review a recent case
`study concerned with content analysis of news
`videos. We conclude with a discussion of our
`plans to extend our work into the audio domain.
`
`Architecture for video management
`Our architecture is based on the assumption
`that video information will be maintained in a
`
`Page 1 of 11
`
`MINDGEEK EXHIBIT 1011
`
`
`
`database.’ This assumption requires us to define
`tools for the construction of such databases and
`the insertion of new material into existing data-
`bases. We can characterize these tools in terms of
`a sequence of specific task requirements:
`
`basic units for indexing. The second set identifies
`different manifestations of camera technique in
`these clips. The third set applies content models
`to the identification of context-dependent seman-
`tic primitives.
`
`I Pming, which segments the video stream into
`generic clips. These clips are the elemental index
`units in the database. Ideally, the system decom-
`poses individual images into semantic primitives.
`On the basis of these primitives, a video clip can
`be indexed with a semantic description using
`existing knowledge-representation techniques.
`
`I Indexing, which tags video clips when the sys-
`tem inserts them into the database. The tag
`includes information based on a knowledge
`model that guides the classification according to
`the semantic primitives of the images. Indexing is
`thus driven by the image itself and any semantic
`descriptors provided by the model.
`
`I RfJtriewl c 7 n d browsing, where users can access
`the database through queries based on text and/or
`visual examples or browse it through interaction
`with displays of meaningful icon?. Users can also
`browse the results of a retrieiral query. tt is impor-
`tant that both retrieval and browsing appeal to
`the user’$ visual intuition.
`
`Figure 1 summarizes this task analysis as an
`architectural diagram. The heart of the system is
`a database management system containing the
`video and audio data from video source material
`that has been compressed wherever possible. The
`DBMS detines attributes and relations among
`these entities in terms of a frame-based approach
`to knowledge representation (described further
`under the subhead “A frame-based knowledge
`base,” p. 65). This representation approach, in
`turn, drives the indexing of entities as they are
`added to the database. Those entities are initially
`extracted by the tools that support the parsing
`task. In the opposite direction, the database con-
`tents are made available by tools that support the
`processing of both specific queries and the more
`general needs of casual browsing.
`The next three sections discuss elements of this
`architecture in greater detail.
`
`Video content parsing
`‘l‘hree tool sets address the parsing task. The
`first set segments the video source material into
`individual camera shots, which then serve as the
`
`Knowledge
`
`Video/a ud io
`data
`
`Content attributes:
`frame based
`
`I
`
`L
`
`I
`
`J
`
`Raw
`video/audio
`data
`
`Appliiations
`
`reference
`-
`enqine
`
`I
`
`browsing
`tools
`
`Figure 1. Diagram of
`video management
`architecture.
`
`Locating camera shot boundaries
`We decided that the most viable segmentation
`criteria for motion video are those that detect
`boundaries between camera shots. Thus, the imi-
`em shot-consisting of one or more frames gener-
`ated and recorded contiguously and representing
`a continuous action in time and space-becomes
`the smallest unit for indexing video. ’The simplest
`shot transition is a camera cut, where the bound-
`ary lies between two successive frames. More
`sophisticated transition techniques include dis-
`solvc~s, wipes, and fade-outs-all of which take
`placr over a sequence of frames.
`In any case, camera shots can always be distin-
`guished by significant qualitative differences. If we
`can expresh those differences by a suitable quan-
`titative measure, then we can declare a segment
`boundary whenever that measure exceeds a given
`threshold. The key issues in locating shot bound-
`aries, therefore, are selecting suitable difference
`measures and thresholds, and applying them to
`the comparison of video frames. We now briefly
`review the segmentation techniques we currently
`employ. (For details, see Zhang et al.?)
`The most suitable measures rely on compar-
`isons between the pixel-intensity histograms of
`two frames. The principle behind this metric is
`that two frames with little change in the back-
`ground and object content will also differ little in
`their overall intensity distributions. Further
`strengthening this approach, it is easy to define a
`histogram that effectively accounts for color infor-
`mation. We also developed an automatic
`approach to detect the segmentation threshold on
`
`Page 2 of 11
`
`MINDGEEK EXHIBIT 1011
`
`
`
`camera), in which the camera position does
`~ h a n g e . ~ These operations may also occur in com-
`binations. They are most readily detected through
`motion field analysis, since each operation has its
`own characteristic pattern of motion vectors. For
`example, a zoom causes most of the motion vec-
`tors to point either toward or away from a focus
`center, while movement of the camera itself
`shows up as a modal value across the entire
`motion field.
`The motion vectors can be computed by the
`block-matching algorithms used in motion com-
`pensation for video compression. Thus, a system
`can often retrieve the vectors from files of video
`compressed according to standards such as MPEG
`and H.261. The system could also compute them
`in real time by using chips that perform such
`compression in hardware.
`
`Content models
`Content parsing is most effective with an a pri-
`ori model of a video’s structure.’ Such a model can
`represent a strong spatial order within the indi-
`vidual frames of shots and/or a strong temporal
`order across a sequence of shots. News broadcasts
`usually provide simple examples of such models.
`For example, all shots of the anchorperson
`conform to a common spatial layout, and the
`temporal structure simply alternates between the
`anchorperson and more detailed footage (possibly
`including breaks for commercials).
`Our approach to content parsing begins with
`identifying key features of the image data, which
`are then compared to domain models to identify
`objects inferred to be part of the domain. We then
`identify domain events as segments that include
`specific domain objects. Our initial experiments
`involve models for cut boundaries, typed shots,
`and episodes. The cut boundary model drives the
`segmentation process that locates camera shot
`boundaries. Once a shot has been isolated
`through segmentation, it can be compared against
`type models based both on features to be detect-
`ed and on measures that determine acceptable
`similarity. Sequences of typed shots can then be
`similarly compared against episode models. We
`discuss this in more detail later, under “Case study
`of video content analysis.”
`
`Index construction and retrieval tools
`The fundamental task of any database system
`is to support retrieval, so we must consider how to
`build indexes that facilitate such retrieval services
`for video. We want to base the index on semantic
`
`Figure 2. A sequence of
`frame-to-frame
`histogram diferences
`obtained from a
`documentary video,
`where direrences
`corresponding both to
`camera breaks and to
`transitions
`implemented by special
`effects can be observed.
`
`w
`w w
`
`the basis of statistics of frame difference values
`and a multipass technique that improves process-
`ing speed.2
`Figure 2 illustrates a typical sequence of differ-
`ence values. The graph exhibits two high pulses
`corresponding to two camera breaks. It also illus-
`trates a gradual transition occurring over a
`sequence of frames. In this case, the task is to
`identify the sequence start and end points. As the
`inset in Figure 2 shows, the difference values dur-
`ing such a transition are far less than across a cam-
`era break. Thus, a single threshold lacks the power
`to detect gradual transitions.
`A so-called twin-comparison approach solves
`this problem. The name refers to the use of two
`thresholds. First, a reduced threshold detects the
`potential starting frame of a transition sequence.
`Once that frame has been identified, it is com-
`pared against successive frames, thus measuring
`an accumulated difference instead of frame-to-
`frame differences. This accumulated difference
`must be monotonic. When it ceases to be monot-
`onic, it is compared against a second, higher
`threshold. If this threshold is exceeded, we con-
`clude that the monotonically increasing sequence
`of accumulated differences corresponds to a grad-
`ual transition. Experiments have shown this
`approach to be very effective.2
`
`Shot classification
`Before a system can parse content, it must first
`recognize and account for artifacts caused by cam-
`era movement. These movements include pan-
`ning and tilting (horizontal or vertical rotation of
`the camera) and zooming (focal length change),
`in which the camera position does not change,
`and tracking and booming (horizontal and verti-
`cal transverse movement of the camera) and dol-
`lying (horizontal lateral movement of
`the
`
`Page 3 of 11
`
`MINDGEEK EXHIBIT 1011
`
`
`
`properties, rather than lower level features. A
`knowledge model can support such semantic
`properties. The model for our system is a frame-
`based knowledge base. In the following discus-
`sion, the word “frame” refers to such a knowledge
`base object rather than a video image frame.
`
`A frame-based knowledge base
`An index based on semantic properties requires
`an organization that explicitly represents the var-
`ious subject matter categories of the material
`being indexed. Such a representation is often real-
`ized as a semantic network, but text indexes tend
`to be structured as trees (as revealed by the indent-
`ed representations of most book indexes). We
`decided that the more restricted tree form also
`suited our purposes.
`Figure 3 gives an example of such a tree. It rep-
`resents a selection of topical categories taken from
`a documentary video about the Faculty of Engi-
`neering at the National University of Singapore.
`The tree structure represents relations of special-
`ization and generalization among these cate-
`gories. Note, in particular, that categories
`correspond both to content material about stu-
`dent activities (Activity) and to classifications of
`different approaches to producing the video
`(Video-Types).
`Users tend to classify material on the basis of
`the information they hope to extract. This partic-
`ular set of categories reflects interest both in the
`faculty and in documentary production. Thus, the
`purpose of this topical organization is not to clas-
`sify every object in the video definitively. Rather,
`it helps users who approach this material with
`only a general set of questions, orienting them in
`how to formulate more specific questions and
`what sorts of answers to expect.
`The frame-based knowledge base is the most
`appropriate technology for building such a struc-
`The Fume is a data object that plays a role
`t ~ r e . ~
`similar to that of a record in a traditional database.
`However, frames are grouped into classes, each of
`which represents some topical category. As Figure
`3 illustrates, these classes tend to be organized in a
`specialization hierarchy. Such a hierarchy allows
`the representation of content in terms of one or
`more systems of categories that can then be used
`to focus attention for a variety of tasks.
`The simplest of these tasks is the casual brows-
`ing of collections of items. However, hierarchical
`organization also facilitates the retrieval of specif-
`ic items that satisfy the sorts of constraints nor-
`mally associated with a database query. Like the
`
`Scenery
`
`Convocation
`
`~
`
`Figure 3. A tree
`structure of topical
`categories for a
`documentary video
`about engineering at the
`National University of
`Singapore.
`
`Figure 4. Examples of
`class frame Laboratory
`(top) and subclass
`instance
`Wave-Simulator
`(bottom).
`
`records of a database, frames are structured as a
`collection of fields (usually called slots in frame-
`based systems). These slots provide different ele-
`ments of descriptive information, and the
`elements distinguish the topical characteristics for
`each object represented by a frame.
`It is important to recognize that we use frames
`to represent both classes (the categories) and
`instances (the elements categorized). As an exam-
`ple of a class frame, consider the Laboratory cate-
`gory in Figure 3. We might define the frame for it
`as shown in Figure 4a. Alternatively, we can define
`an instance of one of its subclasses in a slightly
`similar manner as shown in Figure 4b.
`Note that not all slots need to be filled in a class
`definition (“void” indicates an unfilled slot), while
`
`Name : Laboratory
`Superclass: Academic
`Suwlasses: #table[Computer-Lab
`Elect ronic-lab Mechan ical-Lab
`Civil-Lab Chemical-Ldbl
`Instances : void
`Description: vold
`Video: void
`Course: void
`Equipment : void
`
`Name: Wave-Simulator
`Class: Civil-Ldb
`Description: ”Monitoring plessure
`variation in breaklny waves. ”
`Video: WdveBreaker-CoverFldme
`C i v L I-Enq
`Course :
`Equipment : ittable [Corrputer
`v e-& ne L ci t o 1 I
`
`Page 4 of 11
`
`MINDGEEK EXHIBIT 1011
`
`
`
`they do all tend to be filled in instances. Also note
`that a slot can be filled by either a single value or
`a collection of values (indicated by the ”#table
`[...I” construct).
`For purposes of search, it is also important to
`note that some slots, such as Name, Superclass,
`Subclasses, Instances, and Class, exist strictly for
`purposes of maintaining a system of frames. The
`remaining slots, such as Description, Video,
`Course, and Equipment, are responsible for the
`actual representation of content. These latter slots
`are thus the objective of all search tasks.
`Most
`frame-based knowledge
`bases impose no restrictions on the
`contents of slots: Any slot can
`assume any value or set of values.
`However, the search objective can be
`facilitated by strongly typing all slots.
`The system could enforce such a con-
`straint through an “if-added” demon
`that does not allow a value to be
`added to a slot unless it satisfies some
`data typing requirement. For exam-
`ple, if Shot is a class whose instances
`represent individual camera shots
`from a video source, then only values
`that are instances of the Shot class
`can be added to the Video slot in
`frames such as those in Figure 4.
`Data
`typing can determine
`whether or not any potential slot
`value is a frame, and it might even be
`able to distinguish class frames from
`instance frames. However, we can
`make typing even more powerful if
`we extend it to deal with classes as if they were data
`types. In this case, type checking would verify not
`only that every potential Video slot value is an
`instance frame but, more specifically, that it is an
`instance of the Shot class. Furthermore, we could
`subject slot values for instances of more specific
`classes to even more restrictive constraints. Thus,
`we might constrain the Video slot of the Headings
`frame to check whether or not the content of a rep-
`resentative frame of the Shot instance being
`assigned consists only of characters. (We could fur-
`ther refine this test if we knew the fonts used to
`compose such headings.)
`What is important for retrieval purposes is that
`we can translate knowledge of a slot’s type into
`knowledge of how to search it. We can apply dif-
`ferent techniques to inspecting the contents of
`different slots, and we can combine those tech-
`niques by means far more sophisticated than the
`
`of how to
`search it.
`
`sorts of combinations normally associated with
`database query operations.
`
`Retrieval tools
`Let us now consider more specifically how we
`can search frames given a priori knowledge of the
`typing of their slots. Because a database is only as
`good as the retrieval facilities it supports, it must
`have a variety of tools, based on both text and
`visual interfaces. Our system’s current suite of
`tools includes a free-text query engine and inter-
`face, the tree display of the class hierarchy, image
`feature-based retrieval tools, and the Clipmap.
`Every frame in the knowledge base includes a
`Description slot with a text string as its contents.
`Thus, the user can provide text descriptions for all
`video shots in the database. The free-text retrieval
`tool retrieves video shots on the basis of the
`Description slot contents. A concept-based
`retrieval engine analyzes the user’s query.6 Given
`a free-text query specified by the user, the system
`first extracts the relevant terms by removing the
`nonfunctional words and converting those
`remaining into stemmed forms. The system then
`checks the query against a domain-specific the-
`saurus, after which it uses similarity measures to
`compare the text descriptions with the query
`terms. Frames whose similarity measure exceeds a
`given threshold are identified and retrieved, lin-
`early ordered by the strength of the similarity.
`In addition to using free text, we can formulate
`queries directly on the basis of the category tree
`itself. This tree is particularly useful in identifying
`all shots that are instances of a common category
`at any level of generalization. We can then use the
`tree to browse instances of related categories. The
`class hierarchy also allows for slot-based retrieval.
`Free-text retrieval provides access to Description
`slots, but we can search on the basis of other slots
`as well. For example, we can retrieve slots whose
`contents are other frames through queries based
`on their situation in the class hierarchy. We can
`compare slots having numeric values against
`numeric intervals. Furthermore, if we want to
`restrict a search to the instances of a particular
`class, then the class hierarchy can tell us which
`slots can be searched and what types of data they
`contain.
`Retrieval based on the contents of Video slots
`will require computation of characteristic visual
`features. As an example, a user examining a video
`of a dance performance should be able to retrieve
`all shots of a particular dancer on the basis of cos-
`tume color. Retrieval would then require con-
`
`Page 5 of 11
`
`MINDGEEK EXHIBIT 1011
`
`
`
`structing a model for matching regions of the
`color with suitable spatial properties. The primi-
`tives froin which such models are constructed
`then serve as the basis for the index structure. In
`such a database, each video clip would be repre-
`sented by one or more frames, and all indexing
`and retrieval would be based o n the image fea-
`tures of those frames.
`Some image database systems, such as thc
`Query By Image Content ( Q B I C ) Project’ have
`developed techniques that support this approach.
`‘These techniques include sclectioii and computa-
`tion of image features that provide useful query
`functionality, similarity-based retrieval methods,
`and interfaces that let users pose and refine
`queries visually and navigate their way through
`the database visually.
`We chose color, texture, and shape as basic
`image features and developed a prototype system
`with fast image-indexing abilities. This system
`automatically computes numerical index keys
`based on color distribution, prominent color
`region segmentation, and color histograms (as
`texture models) for each image. Each image is
`indexed by the size, color, Location, and shape of
`segmented regions and the color histograms of
`the entire image and nine subregions. ‘1.0 achieve
`fast retrieval, the system codes these image fea-
`tures into numerical index keys according to the
`significance of each feature in the query-match-
`ing proccss. This retrieval approach has proved
`fast and accurate.
`Indexing representative images essentially
`ignores the temporal nature of a video. Retrieval
`should be based on events as well as features of
`static images. This will require a better under-
`standing of which temporal visual features are
`both important for retrieval and feasible to com-
`pute. For instance, we can retrieve zooming
`sequences through a relatively straightforward
`examination of the motion vector fielcl.” However,
`because such vector fields are often difficult to
`compute (and because the “motion vectors” pro-
`vided by compressed video are not always a reli-
`able representation of optical flow), a more viable
`alternative might be to perform teature analysis
`on the spatio-temporal iniages. We discuss this
`alternative below under the subsection “Micons:
`Icons for video content”.
`A Clipmap is simply a window containing a
`collection of icons, each of which represents a
`camera shot. We can use Clipmaps to provide an
`unstructured index for a collection of shots.’ They
`can also be used to display the results of retrieval
`
`Figure 5. A typicuf
`Cliprimup.
`
`operations. For example, rather than simpl), list-
`ing the frames retrieved by a free-text query, the
`system can construct a Clipmap based on the con-
`tents of the Video slot of each frame. Such ;I dis-
`pia). is especially useful when the query results in
`a long list of frames. For example, Figure 5 is a
`Clipmap constructed for a query requesting all
`instances of the Activity class. Even if the system
`orders retrieval results by degree of similarity (as
`they are in free-text search), it can still be difficult
`to identify the desired shots from text representa-
`tions of those frames. The Clipmap provides visu-
`al recognition as an alternative to examining such
`text descriptions.
`
`Interactive video objects
`We turn now to the problem of interfaces.
`Vidco is “media rich,” providing moving pictures,
`text, music, and sound. l‘hus, interfaces based on
`keywords or other types of text representation
`cannot provide users a suitable “window” on
`vidco content. Only visual representation can
`provide an intuitive cue to such content.
`Furthermore, we should not regard such cues as
`passive objects. A user should be able to interact
`witli them, just as text indexes are more for inter-
`action than for examination. In this section, we
`discuss three approaches to interactivity.
`
`Page 6 of 11
`
`MINDGEEK EXHIBIT 1011
`
`
`
`hand window. It also brings up a display of the
`contents of the Description slot and “loads” the
`clip into the “soft video player” shown on the
`right. The “depth” of the micon itself corresponds
`to the duration of the represented video sequence,
`and we can use this depth dimension as a “scroll
`bar.” Thus, we can use any point along a “depth
`face” of the icon to cue the frame corresponding
`to that point in time for display on the front face.
`The top and side planes of the icon are the spatio-
`temporal pictures composed by the pixels along
`the horizontal and vertical edge of each frame in
`the video clip.4
`This presentation reveals that, at the bit level, a
`video clip is best thought of as a volume ofpixels,
`different views of which can provide valuable con-
`tent information. (For example, the ascent of
`Apollo 11 represented in Figure 6 is captured by the
`upper face of the icon.) We can also incorporate the
`camera operation information into the video icon
`construction and build a VideoSpaceI~on.~
`We can examine this volume further by taking
`horizontal and vertical slices, as indicated by the
`operator icons on the side of the display in Figure
`6. For example, Figure 7 illustrates a horizontal
`slice through a micon corresponding to an excerpt
`from “Changing Steps,” produced by Elliot Caplan
`and Merce Cunningham, a video interpretation of
`Cunningham’s dance of the same name. (This
`micon actually does not correspond to a single
`camera shot. The clip was obtained from the
`Hierarchical Video Magnifier, discussed in the next
`subsection.) Note that the slice was taken just
`above the ankles, so it is possible to trace the
`movements of the dancers in time through the
`colored lines created as traces of their leotards.
`Selecting a representative frame for each cam-
`era shot in the index strip is an important issue.
`Currently, we avoid tools that are too computa-
`tionally intensive. Two approaches involve sim-
`ple pixel-based properties. An “average frame” is
`defined in which each pixel has the average of val-
`ues at the same grid point in all frames of the shot.
`Then, the system will select a frame that is most
`similar to this average frame as the representative
`frame.
`Another approach involves averaging the his-
`tograms of all the frames in a clip and selecting
`the frame whose histogram is closest to the aver-
`age histogram as the representative frame.
`However, neither of these approaches involves
`semantic properties (although the user can always
`override decisions made by these methods). We
`also plan to incorporate camera and object
`
`Figure 6. An
`environment for
`examining and
`manipulating micons.
`
`Micons: Icons for video content
`A commonly used visual representation of
`video shots or sequences is the video or movie
`icon, sometimes called a micon.* Figure 6 illus-
`trates the environment we have designed for
`examining and manipulating micons. Every video
`clip has a representative frame that provides a
`visual cue to its content. This static image is dis-
`played in the upper “index strip” of Figure 6.
`Selecting an image from the index strip causes the
`system to display the entire micon in the left-
`
`Figure 7. A micon with
`a horizontal slice. Each
`colored line matches the
`color of a leotard on the
`lower leg. Horizontal
`movement of the line in
`the image exposed b y
`the slice corresponds to
`horizontal movement of
`the leg (and usually the
`dancer). Dancers’ move-
`ment’s can be traced
`b y using this exposed
`spatio-temporal picture
`as a source of cues.
`
`Page 7 of 11
`
`MINDGEEK EXHIBIT 1011
`
`
`
`6 Fllt Edlt Csnl Search Ulcw
`
`tp4’
`
`Figure 8. A hierarchical
`browser of a full-length
`video.
`
`motion information either for selecting a repre-
`sentative frame or for constructing a “salient
`still”’0 instead of a representative frame.
`
`Hierarchical video magnifier
`Sometimes the ability to browse a video in its
`entirety is more important than examining indi-
`vidual camera shots in detail. We base our
`approach on the Hierarchical Video Magnifier.”
`It is illustrated in Figure 8, which presents an
`overview of the entire “Changing Steps” video.
`The original tape of this composition was con-
`verted to a QuickTime movie 1,282,602 units
`long. (There are 600 QuickTime units per second,
`so this corresponds to a little under 36 minutes.)
`As the figure shows, dimensions allow for the dis-
`play of five frames side by side. Therefore, the
`whole movie is divided into five segments of equal
`length, each segment represented by the frame at
`its midpoint.
`As an example from this particular video, the
`first segment occupies the first 256,520 units of
`the movie, and its representative frame is at index
`128,260. Each segment can then be similarly
`expanded by dividing it into five portions of equal
`length, each represented by the midpoint frame.
`By the time we get to the third level, we are view-
`ing five equally spaced frames from a segment of
`
`size 51,304 (approximately 85.5 seconds). Users
`can continue browsing to greater depth, after
`which the screen scrolls accordingly.
`The user can also select any frame on the dis-
`play for storage. The system will store the entire
`segment represented by the frame as a separate file,
`which the user can then examine with the micon
`viewer. (This is how we created the image in Figure
`7.) This approach to browsing is particularly valu-
`able for a source like “Changing Steps,” which
`does not have a well-defined narrative structure. It
`can serve equally well for material where the nar-
`rative structure is not yet understood.
`The Hierarchical Video Magnifier is an excel-
`lent example of “content-free content analysis.’’
`The technique requires no information regarding
`the content of the video other than its duration.
`We developed it to exploit the results of automat-
`ic segmentation. The segment boundaries deter-
`mined by simple arithmetic division in the
`Hierarchical Video Magnifier are then “justified”
`by being shifted to the nearest camera shot
`boundary. Thus, at the top levels of the hierarchy,
`the segments actually correspond to sequences of
`camera shots, rather than an arbitrary interval of a
`fixed duration. These camera shot boundaries are
`honored in the subdivision of