`Multimedia Content Descriptions
`
`Seungyup Paek, Ana B. Benitez, and Shih-Fu Chang'
`
`Image & Advanced TV Lab, Department of Electrical Engineering
`Columbia University, 1312 S.W. Mudd, Mail code 4712 Box F-4
`New York, NY 10027, USA
`
`ABSTRACT
`In this paper, we present the self-describing schemes for interoperable image/video content descriptions, which are being
`developed as part of our proposal to the MPEG-7 standard. MPEG-7 aims to standardize content descriptions for multimedia
`data. The objective of this standard is to facilitate content-focused applications like multimedia searching, filtering, browsing,
`and summarization. To ensure maximum interoperability and flexibility, our descriptions are defined using the eXtensible
`Markup Language (XML), developed by the World Wide Web Consortium. We demonstrate the feasibility and efficiency of
`our self-describing schemes in our MPEG-7 testbed. First, we show how our scheme can accommodate image and video
`descriptions that are generated by a wide variety of systems. Then, we present two systems being developed that are enabled
`and enhanced by the proposed approach for multimedia content descriptions. The first system is an intelligent search engine
`with an associated expressive query interface. The second system is a new version of MetaSEEk, a metasearch system for
`mediation among multiple search engines for audio-visual information.
`Keywords: MPEG-7, self-describing scheme, interoperability, audio-visual content description, visual information system,
`metasearch, XML.
`
`1. INTRODUCTION
`It is increasingly easier to access digital multimedia information. Correspondingly, it has become increasingly important to
`develop systems that process, filter, search and organize this information, so that useful knowledge can be derived from the
`exploding mass of information that is becoming accessible. To enable exciting new systems for processing, searching, filtering
`and organizing multimedia information, it has become clear that an interoperable method of describing multimedia content is
`necessary. This is the objective of the emerging MPEG-7 standardization effort.
`In this paper, we first give a brief overview of the objectives of the MPEG-7 standard. MPEG-7 aims at the
`standardization of content descriptions of multimedia data. The objectives of this standard are to facilitate content-focused
`applications like multimedia searching, filtering, browsing, and summarization.
`Then, we present self-describing schemes for interoperable image/video content descriptions, which are being
`developed as part of our proposal to MPEG-7. To ensure maximum interoperability and flexibility, our descriptions use the
`eXtensible Markup Language (XML), developed by the World Wide Web Consortium. Under the proposed self-describing
`schemes, an image is represented as a set of relevant objects that are organized in one or more object hierarchies. Similarly, a
`video is viewed as a set of relevant events that can be combined hierarchically in one ore more event hierarchies. Both, objects
`and events, are described by some feature descriptors that can link to external extraction and similarity code.
`Finally, we demonstrate the feasibility and efficiency of our self-describing schemes in our MPEG-7 testbed. In our
`testbed, we will show how our scheme can accommodate image and video descriptions that are generated by a wide variety of
`systems. In addition, we introduce two systems being developed, which are enabled and enhanced by our approach for multi-
`media content descriptions. The first system is an intelligent search engine with an associated expressive query interface. The
`second system is a metasearch system for mediation among multiple search engines for audio-visual information.
`
`1. Email: {syp, ana, sfchang}@ee.columbia.edu ; WWW: http://www.ee.columbia.eduk- {syp, ana, sfchang}/
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2013-1
`
`
`
`2. MPEG-7 STANDARD AND SCENARIOS
`
`2.1. MPEG-7 standard
`The MPEG-7 standard [14] has the objective of specifying a standard set of descriptors to describe various types of multimedia
`information. MPEG-7 will also standardize ways to define other descriptors as well as Description Schemes (DS s) for the
`structure of descriptors and their relationships. This description (i.e. the combination of descriptors and description schemes)
`will be associated with the content itself to allow fast and efficient searching for material of a user's interest. MPEG-7 will also
`standardize a language to specify description schemes, i.e. a Description Definition Language (DDL), and the schemes for
`encoding the descriptions of multimedia content.
`2.2. MPEG -7 scenarios
`MPEG-7 will improve existing applications and enable completely new ones. We will review three of the most relevantly
`impacted application scenarios [3]: distributed processing, exchange, and personalized viewing of multimedia content.
`• Distributed processing
`MPEG-7 will provide the ability to interchange descriptions of audio-visual material independently of any platform, any
`vendor, and any application, which will enable the distributed processing of multimedia content. This standard for
`interoperable content descriptions will mean that data from a variety of sources can be plugged into a variety of distributed
`applications such as multimedia processors, editors, retrieval systems, filtering agents, etc. Some of these applications
`may be provided by third parties, generating a sub-industry of providers of multimedia tools that can work with the
`standard descriptions of the multimedia data.
`The vision of the near future is one in which a user can access various content providers' web sites to download
`content and associated indexing data, obtained by some low-level or high-level processing. The user can then proceed to
`access several tool providers' web sites to download tools (e.g. Java applets) to manipulate the heterogeneous data
`descriptions in particular ways, according to the user's personal interests. An example of such a multimedia tool will be a
`video editor. A MPEG-7 compliant video editor will be able to manipulate and process video content from a variety of
`sources if the description associated with each video is MPEG-7 compliant. Each video may come with varying degrees of
`description detail such as camera motion, scene cuts, annotations, and object segmentations.
`• Content exchange
`A second scenario that will greatly benefit from an interoperable content-description standard is the exchange of
`multimedia content among heterogeneous audio-visual databases. MPEG-7 will provide the means to express, exchange,
`translate, and reuse existing descriptions of audio-visual material.
`Currently, TV broadcasters, radio broadcasters, and other content providers manage and store an enormous amount of
`audio-visual material. This material is currently described manually using textual information and proprietary databases.
`Describing audio-visual material is an expensive and time-consuming task, so it is desirable to minimize the re-indexing
`of data that has been processed before.
`Consider a media company that purchases videos from a TV broadcaster. The TV broadcaster has already described
`and indexed the content in their proprietary description scheme. Without an interoperable content description, the
`purchasing company will have to invest manpower to translate manually the description of the broadcaster into their
`proprietary scheme. Interchange of multimedia content descriptions would be possible if all the content providers
`embraced the same scheme and system. As this is unlikely to happen, MPEG-7 proposes to adopt a single industry-wide
`interoperable interchange format that is system and vendor independent.
`• Customized views
`Finally, multimedia players and viewers compliant with the multimedia description standard will provide the users with
`irmovative capabilities such as multiple views of the data configured by the user. The user could change the display's
`configuration without requiring the data to be downloaded again in a different format from the content broadcaster.
`The ability to capture and transmit semantic and structural annotations of the audio-visual data, made possible by
`MPEG-7, greatly expands the range of possibilities for client-side manipulation of the data for displaying purposes. For
`example, a browsing system can allow users to quickly browse through videos if they receive information about their
`corresponding semantic structure. For example, when modeling a tennis match video, the viewer can choose to view only
`the third game of the second set, all the overhead smashes made by one player, etc.
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2013-2
`
`
`
`These examples only hint at the possible uses that creative multimedia-application designers will find for richly
`structured data delivered in a standardized way based on MPEG-7.
`
`3. SELF -DESCRIBING SCHEMES
`
`In this section, we present description schemes for interoperable image/video content descriptions. The proposed description
`schemes are self-describing in the sense that they combine the data and the structure of the data in the same format. The
`advantages of such a type of descriptions are flexibility, easy validation, and efficient exchange.
`
`3.1. eXtensible Markup Language (XML)
`
`SGML (Standard Generalized Markup Language, ISO 8879) is a standard language for defining and using document formats.
`SGML allows documents to be self-describing, i.e. they describe their own grammar by specifying the tag set used in the
`document and the structural relationships that those tags represent. SGML makes it possible to define your own formats for
`your own documents, to handle large and complex documents, and to manage large information repositories. However, full
`SGML contains many optional features that are not needed for Web applications and has proven to be too complex to current
`vendors of Web browsers.
`
`The World Wide Web Consortium (W3C) has created an SGML Working Group to build a set of specifications to
`make it easy and straightforward to use the beneficial features of SGML on the Web [21]. The goal of the W3C SGML activity
`is to enable the delivery of self-describing data structures of arbitrary depth and complexity to applications that require such
`structures. The first phase of this effort is the specification of a simplified subset of SGML specially designed for Web
`applications. This subset, called XML (Extensible Markup Language), retains the key SGML advantages in a language that is
`designed to be vastly easier to learn, use, and implement than full SGML.
`
`Before describing the image and video DSs, we present some of the core features of XML. Let's start with a simple
`XML element:
`
`<im a g e > Hello M PEG -7 world ! </ima g e>
`
`<image> is the start tag and </image> is the end tag; Hello MPEG-7 world! is the content of the element. What does the image
`tag mean? In short, it means anything you want it to mean. XML predefines no tags at all. Rather than relying on a few hun-
`dred predefined tags, XML lets you create the tags you need to describe your data. Users define what is allowed in each docu-
`ment by providing rules, collectively known as the Document Type Definition (DTD). The DTD states the element types with
`their characteristics, the notations, and the entities allowed in the document. Apart from the DTD, XML documents must fol-
`low some basic well-form rules. This is the minimum criterion for XML parsers and processors.
`
`Text in XML documents consists of characters. A document's text is divided into character data and markup. In a first
`approximation, markup describes a document's logical structure while character data is the basic content of the document.
`Generally, anything inside a pair of <> angle brackets is markup and anything that is not inside these brackets is character data.
`Start tags and empty tags may optionally contain attributes. An attribute is a name-value pair separated by an equal sign. Work
`is in progress to include binary data in XML tags. Currently, XML allows defining binary entities pointing to binary data (e.g.
`images). They require an associated notation describing the type of resource (e.g. GIF and JPG).
`
`3.2. Image description scheme
`
`In this section, we present the proposed description scheme for images. To clarify the explanation, we will use the example
`shown in Figure 1. Using this example, we will walk through the image DS expressed in XML. Along the way, we will explain
`the use of various XML elements that are defined for the proposed image DS. The complete set of rules of the tags in the image
`and video description schemes is defined in our document type definitions [15]. Another advantage of using XML as the DLL
`is that it provides the capability to import external description schemes' DTDs to incorporate them in one description by using
`namespaces. We will see an example later in this section.
`
`The basic description element of our image description scheme is the object element (<object>). An object element
`represents a region of the image for which some features are available. There are two different types of objects: physical and
`logical objects. Physical objects usually correspond to continuous regions of the image with some descriptors in common
`(semantics, features, etc.) - in other words, real objects in the image. Logical objects are groupings of objects based on some
`high-level semantic relationships (e.g. faces). The object element comprises the concepts of group of objects, objects, and
`regions in the visual literature. The set of all objects identified in an image is included within the object set element
`(<object_set>).
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2013-3
`
`
`
`For the image example of Figure 1.a, we have chosen to describe the objects listed below. Each object element has a
`unique identifier within an image description. The identifier is expressed as an attribute of the object element (id). Another
`attribute of the object element (type) distinguishes between physical and logical objects. We have left the content of each
`object element empty to show clearly the overall structure of the image description. Later in the section, we will describe the
`features that can be included within the object element.
`
`<object_set>
`<object id =IP type="PHYSICAL" > </object> <!— Fa mily portra it —>
`<object id ="1" typ e ="PHYSICAL" > </object> <!— Father —>
`<object id ="2" typ e ="PHYSICAL" > </object> <!— M other —>
`<object id ="3" typ e ="LOG ICAL" > </object> <!— Fa ces —>
`<object id ="4" typ e ="PHYSICAL" > </object> <!— Father's fa ce —>
`<object id ="5" typ e ="PHYSICAL" > </object> <!— M other's fa ce —>
`</o bjec t_set>
`
`Set of Objects
`
`0 1 2 3 4 5 ...
`
`Physical Object
`Hierarchy
`
`Logical Object
`Hierarchy
`
`3
`
`4 5
`
`0
`
`1 (cid:9)
`
`2
`
`4 5
`
`b)
`
`a)
`
`Figure 1: a) Image example. b) High-level description of the image by proposed image description scheme.
`
`The image description scheme is comprised of object elements that are combined hierarchically in one or more object
`hierarchy elements (<object_hierarchy>). The hierarchy is a way to organize the object elements in the object set element.
`Each object hierarchy consists of a tree of object node elements (<object_node>). Each object node points to an object. The
`objects in an image can be organized by their location in the image or by their semantic relationships. These two ways to group
`objects generate two types of hierarchies: physical and logical hierarchies. A physical hierarchy describes the physical location
`of the objects in the image. On the other hand, a logical hierarchy organizes the objects based on a higher level understanding
`of their semantics, similar to semantic clustering.
`
`Continuing with the image example in Figure 1.a, two possible hierarchies are shown in Figure 1.b. These hierarchies
`are expressed in XML below. The type of hierarchy is included in the object hierarchy element as an attribute (type). The
`object node element has associated a unique identifier in the form of an attribute (id). The object node element references an
`object element by using the latter's unique identifier. The reference to the object element is included as an attribute
`(object_ref). An object element can include links back to nodes in the object hierarchy as an attribute too (object_node_ref).
`
`<0 bje c t_hie ra rthy typ e ="PHYSIC A L"> <!— Physic a I hie ra rc hy —>
`<0 b je c t_no d e id =1110" o bjec t_ref=13"> <!— Portrait —>
`<object_node id ="11"object_ref="1"> <!— Fa the r —>
`<object_node id ="12"object_ref="47> <!— Father's fa ce —>
`</object_node>
`<object_node id ="13"object_ref="2"> <!— Mother-->
`<object_node id ="14"object_ref="57> <!— Mother's fa ce —>
`</object_node>
`</object_node>
`qobject_hiera it hy>
`<object_hiera hy typ e =10 G ICAL"> <!— Log ic a I hie ra rc hy: faces in the image —>
`<0 b je c t_no d e id ="15" o bje c t_ref="3"> <!— Fa c es —>
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2013-4
`
`(cid:9)
`
`
`<object_node id ="16" object_ref="4"/> <!— Fa the r's fa c e —>
`<object_node id ="17" o bje c t_ref="5"/> <!— Mother's fa c e —>
`</object_node>
`</objec thie ra It hy>
`
`An object set element and one or more object hierarchy elements form the image element (<image>). The image ele-
`ment symbolizes the image or picture being described.
`
`In our image description scheme, the object element contains the feature elements; they include location, color, tex-
`ture, shape, size, motion, time, and annotation elements, among others. Time and motion descriptors with have sense when the
`object belongs to a video sequence. The location element contains pointers to the locations of the image. Note that annotations
`can be textual, visual or audio. These features can be extracted or assigned automatically or manually. For those features
`extracted automatically, the feature descriptors can include links to external extraction and similarity matching code. An exam-
`ple is included below. This example also shows how external DSs can be imported and combined with ours.
`<object id ="4" typ e ="PHYSIC A L" object_node_ref="12 16 1 > <!— Fa the r's face —>
`<c olor> </c olor>
`<texture>
`<ta mu ra >
`<ta mura_va lue c oa rseness=13.01" contra st="0.39" o de nta tio n=13.77>
`<c ode type ="IEX1RACTIO N" la ng ua ge="JAVA" version="1.2"> <!— Link to extra ction code —>
`<loc a tion> <loc a tion_site href="ftp ://extra c tion.ta mura .ja va "/> </loc a tion>
`</code>
`</ta mura >
`</texture>
`<sha pe> </sha pc >
`<position> </position>
`<!— import and use of e xte ma I a nnota tion DS's DID —>
`<text_a nnota tio n xmlns:extAnDS="http ://www.other.ds/a nnotations.dtd">
`<extAnDS:C la ss>Fa c e </extAnDS:C la ss>
`</te xt_a nnotation>
`<fp bjec t>
`
`In summary, both, the object hierarchy and object set elements, are part of the image element (<image>). The objects
`in the object set are combined hierarchically in one or more object hierarchy elements. For efficient transversal of the image
`description, links are provided to traverse from objects in the object set to corresponding object nodes in the object hierarchy
`and viceversa. The objects include various feature descriptors that can link to external extraction and similarity matching code.
`
`3.3. Video description scheme
`
`In this section, we present the proposed description scheme (DS) for videos. To clarify the explanation, we will use the exam-
`ple shown in Figure 2. Using this example, we will walk through the video DS expressed in XML. Along the way, we will
`explain the use of various XML elements that are defined for the proposed MPEG-7 video DS. The structure of the image
`description scheme and the video description scheme are very similar.
`
`The basic description element of our video description scheme is the event element (<event>). An event represents
`one or more shots of the video for which some features are available. We distinguish three different types of events: a shot, a
`continuous group of shots, and a discontinuous group of shots. Discontinuous group of shots will usually be associated
`together based on common features (e.g. background color) or high-level semantic relationships (e.g. actor on screen). The
`event element comprises the concepts of story, scene, and shot in the visual literature. The set of all events identified in a video
`is included within the event set element (<event_set>).
`
`For the video example of Figure 2.b, we have chosen to describe the events listed below. Each event element has a
`unique identifier within a video description. The identifier is expressed as an attribute of the event element (id). Another
`attribute of the event element (type) distinguishes between the three different types of events. We have left each event element
`empty to show clearly the overall structure of the video description. Later in the section, we will describe the features that can
`be included within the event element.
`
`<eve nt_set>
`<event id ="0" type ='SHOT' > </event> <!— The tig er —>
`<event id ="1" type ="SHOT' > </event> <!— Sta lking the prey-->
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2013-5
`
`
`
`<event id =12" type ='SHOT' > </event> <!-- C ha se —>
`<event id ="3" type ='SHOT' > </event> <!-- Ca pture —>
`<event id ="4" type ="C 0 WON UO US G ROUP_SHOIS" > </event> <!— Feed ing —>
`<event id ="5" type ="SHOT' > </event> <!-- Hiding the food -->
`<event id =16" type ="SHOT' > </event> <!-- Feeding the young —>
`</eve nt_set>
`
`The lig er [event 0]
`
`_
`
`Feed ing [event 4]
`
`Stalking the
`prey
`
`Chase
`
`Capture Hiding the
`Food
`
`[event 1] (cid:9)
`
`[eve nt2] (cid:9)
`
`[event 3]
`
`[event 5] (cid:9)
`
`Feeding
`the young
`
`[event 6]
`
`0:00 (cid:9)
`
`0:03 (cid:9)
`
`0:09 (cid:9)
`
`0:12 (cid:9)
`
`0:17 (cid:9)
`
`Time
`
`Set of Events
`
`0 1 2 3 4 5 6 ...
`
`a)
`
`b)
`
`Physical Event
`Hierarchy
`
`0
`
`1 2 3 4
`/\
`5 6
`
`Figure 2: a) Video example. b) High-level description of the video by proposed video description scheme.
`The video description scheme is comprised of event elements that are combined hierarchically in one or more event
`hierarchy elements (<event_hierarchy>). The hierarchy is a way to organize the event elements in the event set element. Each
`event hierarchy consists of a tree of event node elements (<event_node>). Each event node points to an event. The event in a
`video can be organized by their location in the video or by their semantic relationships. These two ways to group events
`generate two types of hierarchies: physical and logical hierarchies. A physical hierarchy describes the time composition of the
`events in the video. On the other hand, a logical hierarchy organizes the events based on a higher level understanding of their
`semantics, similar to semantic clustering.
`Continuing with the video example in Figure 2.a, one possible hierarchy is shown in Figure 2.b. The corresponding
`XML is below. The type of hierarchy is included in the event hierarchy element as an attribute (type). The event element has
`associated a unique identifier as an attribute (id). The event node element references an event element by using the latter's
`unique identifier. The reference to the event element is included as an attribute (event_ref). An event element can include links
`back to nodes in the event hierarchy to jump between events and event nodes in both directions (event_node_ref).
`<eve nt_h ie ra it hy type = 1PHYSE A L">
`<eve nt_no d e id ="10" eve nt_ref= 10"> <!-- lhe Tiger-->
`<eve nt_no d e id ="11" eve nt_ref="17 > <!— Stalking the prey —>
`<eve nt_no d e id ="12" eve nt_ref="27 > <!— C ha se -->
`<eve nt_no d e id ="13" eve nt_ref="37 > <!— C a ptu re —>
`<eve nt_no d e id ="14" eve nt_ref="4"> <!-- Feed ing —>
`<eve nt_nod e id ="15" eve nt_ref="57> <!— Hid ing the food —>
`<eve nt_nod e id ="16" eve nt_ref= 167> <!— Feed ing the young -->
`</eve nt_nod e >
`</eve nt_nod e >
`</eve nt_hie ra it hy>
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2013-6
`
`
`
`An event set element and one or more even hierarchy elements form the video element (<video>). The video element
`symbolizes the video sequence being described.
`In our video description scheme, the event element contains the feature elements; they include location, shot
`transition (i.e. various within shot or across shot special effects), camera motion, time, key frame, annotation and object set
`elements, among others. The object element is defined in the image description scheme; it represents the relevant objects in the
`event. As in the image DS, these features can be extracted or assigned automatically or manually. For those features extracted
`automatically, the feature descriptors can include links to extraction and similarity matching code. For example,
`<event id ="3" type d'PHYSE A L" event_node_ref="10"> <!— Ca p tu re —>
`<object_set> </object_set>
`<c a me ra _m otio n>
`<ba c kg ID un_a ffine_mod e I>
`<ba c kg rbund_affine_motion va lue>
`<pa nning d ire c tio n="NE7>
`<zoom direction="117>
`</bac kg round_a ffine_motion_va lue>
`<code type ="DIS1A NC E" la ngua ge="JAVA"version="1.0"> <!— Link to similarity ma tc hing c ode —>
`<location> <loc a tio n_site href="ftp://dist.ba c g round .a ffine '7> </location>
`</code>
`</bac kg round_a ffine_mod e I>
`</c a me ra _m otio n>
`<time> </time>
`</event>
`In summary, both, the event hierarchy and event set elements, are part of the video element (<video>). The event
`elements in the event set element are combined hierarchically in one or more event hierarchy elements. For efficient
`transversal of the video description, links are provided to traverse from events in the event set to corresponding event nodes in
`the event hierarchy and viceversa. The events include various feature descriptors that can link to extraction and similarity
`matching code.
`
`4. MPEG -7 TESTBED
`The proposed self-describing schemes are intuitive, flexible, and efficient. We demonstrate the feasibility of our self-describing
`schemes in our MPEG-7 testbed. In our test bed, we are using the self-describing schemes for descriptions of images and
`videos that are generated by a wide variety of systems we have developed. We are developing two systems that are enabled and
`enhanced by our approach for multimedia content descriptions. The first system is an intelligent search engine with an
`associated expressive query interface. The second system is a new version of MetaSEEk, a metasearch system for mediation
`among multiple search engines for audio-visual information.
`4.1. Description generator
`In our MPEG-7 testbed, we are using various image/video processing, analysis, and annotation systems to generate a rich vari-
`ety of descriptions for a collection of image/video items, as shown in Figure 3. The descriptions that we generate for visual
`content include low-level visual features of automatically segmented regions, user defined semantic objects, high-level scene
`properties, classifications, and associated textual information. We are also including descriptions that are generated by our col-
`laborators. As described in section 3.1., we are using XML as the DDL for the descriptions. The descriptions have the structure
`of the image/video description scheme (DS) described in section 3.2. and 3.3. The DS and DDL are designed to accommodate
`descriptions generated by a wide variety of heterogeneous systems.
`Once all the descriptions for an image/video item are generated, the descriptions are inputted into a database, which
`the search engine accesses. We shall describe now the systems used to generate the descriptions.
`• VideoQ: Region-based indexing and searching system.
`This system extracts visual features such as color, texture, motion, shape, and size for automatically segmented regions of
`a video sequence [5]. The system first decomposes a video into separate shots. This is performed by scene change detec-
`tion [13]. Scene changes may be either abrupt or transitional (e.g. dissolve, fade in/out, and wipe). For each shot, the sys-
`tem estimates the global (i.e. the motion of dominant background) and the camera motion. Then, it segments, detects, and
`tracks regions across the frames in the shot computing different visual features for each region. For each shot, the descrip-
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2013-7
`
`
`
`tion generated by this system is a set of regions with visual and motion features, and the camera motion. Some keywords,
`assigned manually, are also available for each shot.
`• AMOS: Video object segmentation system.
`
`Currently, fully automatic segmentation of semantic objects is only successful in constrained visual domains. The AMOS
`system [23] takes on a powerful approach in which automatic segmentation is integrated with user input to track semantic
`objects in video sequences. For general video sources, the system allows users to define an approximate object boundary
`by using a tracing interface. Given the approximate object boundary, the system automatically refines the boundary and
`tracks the movement of the object in subsequent frames of the video. The system is robust enough to handle many real-
`world situations that are hard to model in existing approaches, including complex objects, fast and intermittent motion,
`complicated backgrounds, multiple moving objects, and partial occlusion. The description generated by this system is a
`set of semantic objects with the associated regions and features that can be manually annotated with text.
`• MPEG domain face detection system.
`This system efficiently and automatically detects faces directly in the MPEG compressed domain [20]. The human face is
`an important subject in video. It is ubiquitous in news, documentaries, movies, etc., providing key information to the
`viewer for the understanding of the video content. This system provides a set of regions with face labels.
`• WebClip: Hierarchical video browsing system.
`
`This system parsers compressed MPEG video streams to extract shot boundaries, moving objects, object features, and
`camera motion [12]. It also generates a hierarchical shot-based browsing interface for intuitive visualization and editing of
`videos.
`
`Image/video content
`
`VideoQ video object feature
`extraction
`
`_o MPEG-7 description
`
`MPEG-domain face detection
`
`_o MPEG-7 description h
`
`AMOS object segmentation
`
`WebClip hierarchical video
`browsing
`
`_o MPEG-7 description k
`
`MPEG-7 description h
`
`Manual text annotations
`
`MPEG-7 description h
`
`In Lumine scene classification
`
`_o MPEG-7 description k
`
`Visual apprentice model based
`classification system
`
`_o MPEG-7 description h
`
`Descriptions generated col-
`laborators
`
`MPEG-7 description
`
`(cid:9)/Integration of MPEG-7
`descriptions
`
`w
`Image/video database
`
`:
`
`Image/video search engine
`
`:
`
`Query interface and
`image/video browser
`
`Figure 3: Architecture for combining descriptions generated by heterogeneous systems.
`• Visual Apprentice: Model based image classification system.
`
`Many automatic image classification systems are based on a pre-defined set of classes in which class-specific algorithms
`are used to perform classification. The Visual Apprentice [10] allows users to define their own classes and provide exam-
`
`AMERICAN EXPRESS v. METASEARCH
`CBM2014-00001 EXHIBIT 2013-8
`
`
`
`pies that are used to automatically learn visual models. The visual models are based on automatically segmented regions,
`their associated visual features, and their spatial relationships. For example, the user may build a visual model of a portrait
`in which one person wearing a blue suit is seated on a brown sofa, and a second person is standing to the right of the
`seated person. The system uses a combination of lazy-learning, decision trees, and evolution programs during classifica-
`tion. The description generated by this system is a set of text annotations, i.e. the user defined classes, for each image.
`*In Lumine: Scene classification system.
`The In Lumine system [16] is a method for high-level semantic classification of images and video shots based on low-level
`visual features. The core of the system consists of various machine learning techniques such as rule induction, clustering,
`and nearest neighbor classification. The system is being used to classify images and video scenes into high-level semantic
`scene classes such as {nature landscape}, {city/suburb}, {indoor}, and {outdoor} . The