`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`Multimedia Tools and Applications 6, 289–312 (1998)
`c(cid:176) 1998 Kluwer Academic Publishers. Manufactured in The Netherlands.
`
`A Language for Content-Based Video Retrieval
`
`FOROUZAN GOLSHANI
`Department of Computer Science, Arizona State University, Tempe, Arizona 85287-5406
`
`golshani@asu.edu
`
`NEVENKA DIMITROVA
`Philips Research, 345 Scarborough Rd., Briarcliff Manor, NY 10510
`
`nvd@philabs.research.philips.com
`
`Abstract. We present an effective technique for automatic extraction, representation, and classification of digital
`video, and a visual language for formulation of queries to access the semantic information contained in digital
`video. We have devised an algorithm that extracts motion information from a video sequence. This algorithm
`provides a low-cost extension to the motion compensation component of the MPEG compression algorithm. In
`this paper, we present a visual language called VEVA for querying multimedia information in general, and video
`semantic information in particular. Unlike many other proposals that concentrate on browsing the data, VEVA
`offers a complete set of capabilities for specifying relationships between the image components and formulating
`queries that search for objects, their motions and their other associated characteristics. VEVA has been shown to
`be very expressive in this context mainly due to the fact that many types of multimedia information are inherently
`visual in nature.
`
`Keywords: visual languages, content-based retrieval, digital video
`
`1.
`
`Introduction
`
`It is generally believed that the human mind is visually oriented. For those concepts that can
`be presented visually, people acquire information at a higher rate by discovering graphical
`relationships in complex pictures rather than plain text. Analogous to this human char-
`acteristic, graphical/visual interfaces attempt to augment the traditional human-computer
`interactive mechanisms with visual aides [25].
`Human and computer vision refer to the ability of seeing an image and understanding
`its details—generally known as image contents. This ability is often summarized as: to
`observe and to evaluate. Visual information is what humans can extract and understand
`from images and video. Making inferences is much easier in visual systems. For example,
`given two objects—one large and one small—a visual system can immediately recognize
`the disparity of size. Such an inference would be much more difficult to make using a text
`annotations.
`Spatial and motion characteristics of objects derived from images and video sequences
`are inherently visual. This visual aspect behooves us to find best suited visual paradigms
`for content retrieval of images and video sequences.
`In a previous paper, we have presented a method for extracting object motion during
`video encoding by the MPEG method. Since a large part of the low level motion analysis
`is already performed by the video encoder for the encoding of the blocks in the predicted
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 1 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`290
`
`GOLSHANI AND DIMITROVA
`
`and bidirectional frames, our algorithms perform very well in creating object trajectories
`based on these coarse-grained representations of the optical flow. Simply put, we extract
`macroblock trajectories which are spatiotemporal representations of macroblock motion
`and use them for object motion recovery. By describing the movements that we derive
`from the process of motion analysis, we introduce a dual hierarchy consisting of spatial and
`temporal parts for video sequence representation. This gives us the flexibility to examine
`arbitrary frames at various levels of abstraction, and to retrieve the associated temporal
`information (say, object trajectories) in addition to the spatial representation. See [12] for
`details.
`Motion in a video segment may be a result of a number of different phenomena. Our work
`does not cover all types. For example, camera motion cannot be separated from object mo-
`tion. Basic camera movements include: vertical rotation (called tilting), vertical transverse
`movement (or booming), horizontal rotation (known as panning), horizontal transverse
`movement (called tracking), variations in focus distance (or zooming), and horizontal lat-
`eral movement (commonly known as dollying). These six along with no movement, i.e.,
`fixed, constitute the seven basic camera operations. Obviously, camera operations are the
`cause of significant characteristics of the video data and should be modeled properly. Optical
`flow techniques for motion extraction work much better for detecting this kind of change.
`See Akutsu et al. [1] for more detail.
`There are certain other types of motion that present a problem for our schemes. Our
`algorithms are primarily geared to recognizing motion in 2D, i.e., the (x, y) direction.
`When an object is moving directly toward the camera, similar to the zooming operation of
`the camera, we cannot distinguish and accurately formulate the motion. Again, to a certain
`degree, schemes based on optical flow analysis perform better in this regard.
`In this paper, we outline the design of a multimedia database language which has well
`defined semantics in both the icon-based and character-based paradigms. The reason for
`supporting both paradigms is to have the best of both worlds: intuitive visual description
`of visual languages and the scalability of the character based languages. This paper first
`surveys the visual metaphors for content based retrieval for digital video in Section 2. The
`video information model are then presented in Section 3.
`In Section 4, we present the
`formal foundation of the language Visual Extension to VArqa (VEVA) which embodies
`the representation of the visual model into a common iconic and character-based language.
`The grammar of a language based on VEVA is given in Section 4.1. The execution model
`is described in Section 4.2. Section 4.3 illustrates the usage of the VEVA language through
`example queries and results. Some final observations are presented in the conclusions
`section.
`
`2. Visual metaphors for content retrieval
`
`The basic concept of visual interaction is the visual metaphor. The quality, in terms of
`understandability, clarity and effectiveness (i.e., having intuitive connotations with the con-
`cept they represent), is very important for raising the level of communication and increasing
`the potential number of users. A visual metaphor may be characterized as some symbol that
`would remind the user of an object or concept. Such a symbol may be a part of the object
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 2 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`A LANGUAGE FOR CONTENT-BASED VIDEO RETRIEVAL
`
`291
`
`or an exaggeration of one of its significant attributes. In the case of an iconic language, the
`visual metaphor is the icon. An icon is usually a simplified image of the function of the
`system or an entity, and is used for manipulation of the system. The user recognizes the ob-
`jectives from the icon images and their spatial arrangements. In a flexible setting, the shape
`and other characteristics of the icons are tailored by the user to suit her/his own mental
`representation of the tasks to be performed. The specification of icons is a whole new field
`of research.
`Visual languages are based on the direct manipulation of graphical objects. Several
`languages cater to graphical description of a wide variety of applications, and allow the
`specification of complex computing environments. We will discuss a few examples. Media
`Streams is a visual language that enables users to create multilayered, iconic annotations of
`video content [10]. The objects denoted by icons are organized into hierarchies. The icons
`are used to annotate video streams in what the author calls a “Media Time Line”. While
`appealing with the simplicity and the abundance of iconic expressions, Media Streams uses
`a fixed vocabulary for the video annotation process.
`A system proposed by Little et al. supports content based retrieval and playback [20].
`A specific schema is composed of movie, scene and actor relations with a fixed set of
`attributes. The system requires manual feature extraction. The features are then inserted
`into the schema. Querying may involve reference to the attributes of movie, scene or actor.
`Once a movie is selected, the user can browse from scene to scene beginning with the initial
`selection. Querying in this system is achieved by browsing the predetermined attributes
`of the movies. Browsing as a visual interaction metaphor in digital video is useful for
`applications which have a predefined set of attributes and where users are not expected to
`learn a new interaction language.
`An algebraic approach to content-based access to video is presented in [30]. Video
`presentations are composed of video segments using a Video Algebraic Model. The algebra
`contains methods for combining video segments both temporally and spatially, as well as
`methods for navigation and querying. This model leads to a framework for efficient access
`and management of video collections. However, the search process is based on the attribute
`information of the algebraic video nodes which are textually represented by human readable,
`semi-structured, algebraic video files. This approach ties together the attribute-based access
`and content browsing of the video nodes.
`One premise of our work is that visual languages and browsing are extremely impor-
`tant for interactive capabilities in digital video applications. The language presented here,
`VEVA, has much in common with the above visual languages. It is an iconic language
`which has a formal mathematical model. Video sequences can be queried based on the
`motion as well as spatial aspects of the objects. The iconic representations of all the con-
`cepts can be queried based on their temporal appearance in a way similar to that in Media
`Streams. The language closely follows a model for motion recovery and representation
`of objects in digital video [12]. The advantage of VEVA is that it is based on the same
`model which is used for extraction of semantics of video sequences. As part of the lan-
`guage, a suite of operators can be used for automatic feature extraction and matching of
`objects.
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 3 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`292
`
`GOLSHANI AND DIMITROVA
`
`3. Video information model and language
`
`3.1. The model
`
`In this section we present a formal foundation for our video information model. It is based
`on an algebraic framework [13, 31] which has been used in many other areas including
`programming languages, software and hardware specifications, and object oriented mod-
`eling. Algebraic specifications are developed in the following manner. Given an alphabet
`consisting of several classes of symbols for types and their associated operators (functions),
`a schema is specified. (Formally, the schema corresponds closely to the notion of signature
`in the algebraic framework.) The schema has all the necessary syntactic information along
`with typing rules, i.e., the rules that determine what type of object(s) can be given to each
`operator and what type of object(s) it will return. In our system, the domain-dependent infor-
`mation, provided by the system developer, is combined with the application-independent
`constructs, provided by the system itself, in order to create the schema. In essence, the
`schema is a formal specification of all objects of interest, real or conceptual, and the re-
`lationships between them. Given a schema, the set of well-formed expressions is defined.
`These expressions are constructed by using the numerous powerful operators that are pro-
`vided for each type. We will see that, despite a formalized underpinning, the language
`presented here is extremely simple. Strongly resembling conventional set theory, our lan-
`guage has a functional flavor, similar to Lucid [29] and others. In Section 3.2 we will
`expand the treatment of video data type. This is based on our work on specification of
`object recognition in images and motion recovery in video [11].
`The basic constructs of the model are data types and functions that operate on the data
`types. Two main kinds of data types that are used in this discussion are:
`† System-defined, fixed data types, called “deliverable types”, are: string, integer, boolean,
`text, image, audio and video. These are the application-independent constituents of the
`system and are present in any specification. The traditional data types integer, string and
`boolean are known as printable objects. Analogously, we call audio and video deliverable
`(presentable) types, since they inherently contain a time component.
`† User-defined data types, called “entity types”, such as “PERSON”, “STUDENT”, are
`those that represent objects or concepts in the real world. These, generally, have properties
`that are embodied in deliverable types.
`
`Each data type has a number of operators associated with it. Operators that are associated
`with user-defined types are called user-defined functions. User-defined functions describe
`the domain related relationships, cross references between entity types, and the attributes
`of objects. Cross references typically represent multivalued relationships. Function types
`have the general form of:
`` : fi0 £ ¢¢¢ £ fin¡1 ! fin
`
`where every fii , called a type expression, is inductively defined as:
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 4 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`A LANGUAGE FOR CONTENT-BASED VIDEO RETRIEVAL
`
`293
`
`— a data type
`— fi1 [ fi2, fi1 £ fi2 or P.fi1/ where fi1 and fi2 are type expressions.
`
`Note that the above definition will allow the functions to take as arguments: object types,
`their unions, cartesian products and powersets.
`In addition to the user-defined functions, we need a collection of operators that are in-
`dependent of the application domain, and as such operate on system defined data types.
`Each data type, such as text, graphics, scanned images, audio, and video, has a rich se-
`lection of operators associated with it. For example, the operator appendPar performs
`concatenation operation for text paragaraphs for the type Text. All set theoretical, boolean,
`and arithmetic operators are included. In addition, there are a number of variable binding
`operators. The main characteristic of these operators is that they cause a variable to range
`over the elements of its domain. An example is the set construction operator that has the
`form f f .x/ j P.x/g, where f .x/ denotes the desired output objects, and P.x/ denotes
`the retrieval predicate that must hold for those objects. While x ranges over its domain,
`whenever P.x/ is satisfied, f .x/ is added to the set. There are a number of other operators
`like this, including the logical quantifiers. We will see some examples when we introduce
`video operators.
`A list of basic multimedia operators and their semantics are presented in [15]. These
`operators are categorized into: set operators, logical operators, operators for temporal syn-
`chronization and spatial composition, arithmetic operations, and media specific operators
`for text, graphics, audio, image, and video manipulation. The basic data types and their as-
`sociated operators are part of the schema (signature) of the multimedia information system.
`The syntax of the language is developed on the basis of this schema.
`In the algebraic framework, the algebra associates with each type a set of objects that
`behave as mandated by the specification of that type. Thus, the set associated with the
`type Integer is, obviously, the set of integer numbers. Other data types denote objects that
`have the predefined properties. Objects of type Text are paragraphs. Data type Audio
`denotes signals of one dimension, and can be thought of as a function of time. The type
`Image contains signals of two dimensions. Finally, Video is a signal of three dimensions
`F .x; y; i / represented as Fi .x; y/ where i represents the frame counter, and x and y are
`pixel coordinates. When no confusion is expected, we may omit the references to the
`coordinates x and y or to the frame counter i. When needed we will use superscripts to
`.x; y/ and F 2
`distinguish between different video streams, e.g., F 1
`.x; y/. The following
`i
`i
`notation will be used:
`
`— Fb.x; y/ for the first frame of the video signal
`— Fe.x; y/ for the last frame of the video signal
`— Fc.x; y/ for the current frame of the video signal
`
`The algebra allows partial functions. Whenever the function is not defined over a certain
`part of the domain we use the symbol (cid:181) to denote the undefined values.
`In this paper we focus specifically on the Video data type. In the next sections we describe
`operators for the video data type in detail.
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 5 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`294
`
`GOLSHANI AND DIMITROVA
`
`3.2. Physical video data type
`
`A video data stream carries much more complex information than any other type of media.
`Operators on the video stream range from operators for delivery of video streams, editing
`operators, as well as operators for extracting motion for video classification. In this section,
`we will present video operators for editing and delivery. We will present the operators for
`motion based content classification in Section 3.3.
`
`3.2.1. Editing operators. The primitive operators for video editing are analogous to the
`list processing operators. For list processing, first head, tail and append (car, cdr, : : : in
`Lisp) are defined, and then, based on these, more elaborate operators are introduced. A
`similar definition is followed for type video. The primitive video operators that are defined
`first are:
`— # for obtaining the first frame of a video sequence. Thus F .x; y/#D Fb.x; y/
`— " for obtaining the video sequence without its first frame.
`
`Based on the above, we can define such operators as:
`— #a returns the first portion of the video sequence up to frame number a. We write F#a
`to indicate the first a frames of F.
`— "a returns the last portion of the video sequence starting with frame number a. Thus
`F"a denotes the last portion of F starting with frame number a.
`— – appends one video sequence to the end of another (for concatenation of two video
`sequences).
`
`Many editing tasks can now be introduced by means of more elaborate operators, includ-
`ing the following:
`† Inserting a video stream into another video stream:
`v insert : Video £ Video £ Integer ! Video
`(
`
`F 1#a – F 2 – F 1"a
`
`(cid:181)
`
`if a ‚ 0 and F 1 has ‚a frames
`otherwise
`
`where:
`v insert.F 1; F 2; a/ D
`
`- Extract a video clip from a video stream:
`v clip : Video £ Integer £ Integer ! Video
`(
`
`where
`v clip.F; a1; a2/ D
`
`F ¡ F#a 1 ¡ F"a 2
`
`(cid:181)
`
`if a1 • a2 and F has ‚a2 frames
`otherwise
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 6 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`A LANGUAGE FOR CONTENT-BASED VIDEO RETRIEVAL
`
`295
`
`† Video cut extraction:
`cuts : Video ! P.int/
`
`cuts.F / D fi jdifference .Fi¡1; Fi / >threshold g
`
`The operator cuts takes an input video stream and returns a set of frame numbers which
`correspond to a drastic scene change, i.e., video cut. Determining the difference between
`the frames for cut detection Fi¡1 and Fi is not a straightforward task because of camera
`operation or sophisticated video effects. Cut detection is important in the initial stages
`of video processing, since shot boundaries have to be identified to extract meaningful
`information within a shot [16, 22, 24].
`† Extract a set of motion icons from a video stream:
`micons : Video ! P.Image/
`
`where
`micons.F / D fFi j i 2cuts. F /g
`
`This operator extracts frames in a video sequence that correspond to video cut changes.
`A micon is a visual representation of the most representative scenes of a video stream.
`† Extract a still image from a video sequence at a given frame number:
`still : Video £ integer ! Image
`
`3.2.2. Video delivery operators. Delivery operators take a video stream and present it at
`user’s request during retrieval time. These operators do not change the contents of the input
`video stream.
`Consider the video stream Fi .x; y/. As before, we use Fb.x; y/ , Fe.x; y/ and Fc.x; y/
`to denote the beginning frame, the ending frame, and the current frame of the video stream,
`respectively.
`The two primary delivery (playback) operators, play and reverse, are defined as follows:
`play D Fe
`iDc Fi .x; y/
`reverse D Gb
`iDc Fi .x; y/
`R
`R
`The video operators G and F are similar in nature to operators that force variables to range
`b
`a f .x/ dx which causes x to range
`over their domain. Analogous to the integral sign
`in
`over the interval [a; b]; F and G force the frame counter to range from c to the end or to the
`beginning of the frame counter, respectively.
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 7 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`296
`
`GOLSHANI AND DIMITROVA
`
`3.2.3. Video attribute operators. Finally, a host of operators for querying the physical
`attributes of video are included. For example, given a video clip, v length returns a number
`representing the length of the sequence.
`v length : Video ! Integer
`
`Similar operators are introduced for querying other attributes of the physical video. The
`expected playback frame rate is given by the operator:
`frame rate : Video ! Integer
`
`The following operator gives the physical storage size of the video:
`size : Video ! Integer
`
`Another format related video information is the resolution of the individual frames:
`resolution : Video ! String
`
`The following operator gives back the compression scheme used for storing the video
`data:
`
`compression : Video ! String
`
`Figure 1 contains a complete description of the operators that apply to the physical video
`type. Double lines indicate set-valued operators.
`
`Figure 1. Physical video data type.
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 8 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`A LANGUAGE FOR CONTENT-BASED VIDEO RETRIEVAL
`
`297
`
`Figure 2. Conceptual video data type.
`
`3.3. Spatiotemporal operators of the video data type
`
`The purpose of object analysis and motion analysis is to extract relevant properties of objects
`and their movements in order to represent the concepts emerging from the video sequences.
`In this section we bring together both object and motion aspects of video content analysis
`into a unique representation of the conceptual video data type.
`Figure 2 illustrates the operators that apply to the conceptual video type.
`Motion analysis starts with the motion vector recovery followed by tracing of individual
`macroblock trajectories. Conceptually, this process is captured by the operator:
`extractedTrajectories : Video ! Trajectory
`
`The data type Trajectory consists of the representation of the motion path of objects
`in a video sequence. Each trajectory can be thought of as an n-tuple of motion vectors.
`This trajectory representation is a basis for various other trajectory representations, such
`as curve representation and point representation. The diversity of the trajectory represen-
`tations makes the querying process more flexible. Matching operators used for motion
`retrieval depend on the method employed for trajectory representation. Examples include:
`exact matching function that uses absolute frame coordinates; exact matching function that
`uses relative coordinates; curve comparison based on the curve fitting approach used for
`interpolated trajectory representation; approximate matching that uses chain code represen-
`tation; and qualitative matching that uses differential chain code. The result in each case
`is a similarity factor between the input trajectory and a target trajectory in the set of object
`trajectories.
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 9 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`298
`
`GOLSHANI AND DIMITROVA
`
`The operators still and micons were described in Section 3.2. The operator features has
`the following signature:
`features : Image ! featureMatrix
`
`This operator takes an image and derives a set of features indexed into a feature matrix.
`The operator leadsTo, which has the signature:
`leadsTo : featureMatrix ! Object
`This operator takes the input feature matrix and performs a classification of the input
`features into a predefined set of object categories.
`The process of feature selection and feature mapping is discussed extensively in the
`computer vision literature [2, 27]. The actual process of feature selection depends on the
`specific domain implementation.
`The operator identifiedObjects is abstracted as:
`identifiedObjects : Video ! Object
`where the Object is a union of all user defined data types in a particular schema. In ad-
`dition, Object may contain other objects of interest. The operator identifiedObjects is a
`representation of a composition of a series of other operators such as: still, features, and
`leadsTo.
`Using object descriptions and their traversed trajectories we infer activities:
`activeObjects : Object £ Trajectory ! Activity
`Description of activities is derived from previously computed motion features. For ex-
`ample, if the object has been recognized as a car, then, by associating with it a straight line
`as its trajectory, it will have an activity: driveStraight.
`We should note that derivation of activities is a process in which the object representation
`might be an empty set. The process of derivation of activities directly from motion repre-
`sentation is an idea that started more than 20 years ago [18]. However, implementationally,
`there are still unanswered questions.
`Once the activities in a video sequence are described, we can pose the question: what
`are the video sequences in which a particular activity occurs, using the operator:
`occurs : Activity ! Video
`Occurs is an operator that delivers a video sequence which contains a certain activity.
`Figure 2 represents the conceptual video operators described above. The figure is an at-
`tempt to synthesize all of the information derivable from the video sequences. The operators
`for identification of objects are simplified in order to give the complete picture.
`In order to establish a correspondence between the paths traversed by objects and their
`spatial descriptions, we use a family of functions which take as input any object description
`and trajectory description. There are a variety of activities that can be inferred solely from
`trajectory description, but those activities would be the generic ones. For example, we
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 10 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`A LANGUAGE FOR CONTENT-BASED VIDEO RETRIEVAL
`
`299
`
`Figure 3. Car racing schema.
`
`can infer that whatever object has a trajectory congruent to a straight line is performing a
`“move straight” activity. Given a certain context, say where the only moving objects in the
`application domain are cars and we do not expect any other moving objects, then we can
`infer a more specific activity such as “drive straight”.
`Consider a system for storing information on car racing in which information about race
`cars and drivers are stored. The user defined entity types and the associated operators for this
`application form a schema that is represented graphically in figure 3. The type expressions
`of some sample functions are:
`
`! String
`: Driver
`nameOf
`! Image
`: Driver
`pictureOf
`! Car
`: Driver
`drives
`: Driver £ Car ! Race
`racesIn
`! Video
`: Race
`coverage
`! Integer
`: Race
`yearOf
`! String
`: Race
`winnerOf
`! String
`: Race
`rnameOf
`! Audio
`announcement : Race
`! String
`: Car
`modelOf
`! String
`: Car
`makeOf
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 11 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`300
`
`GOLSHANI AND DIMITROVA
`
`Given the car race schema (see figure 3) the following activities can be inferred from the
`set of video sequences:
`† turnLeft is true if trajectory orientation changes to the left with respect to the current
`direction.
`† turnRight is true if trajectory orientation changes to the right with respect to the current
`direction.
`† driveStraight is true if trajectory orientation stays the same.
`† speedUp is true if velocity increases.
`† slowDown is true if velocity decreases.
`† collision is true if trajectory t1 coincides with trajectory t2 at a particular time instance.
`collision : Trajectory £ Trajectory ! Activity
`
`An example query on the above schema could be: show all the video sequences in which
`John’s Ferrari is speeding up.
`fcoverage.Race/ j exists Driver exists Car:
`speedUp.O/ occursIn coverage.Race/
`and racesIn.Driver; Car/ is Race
`and nameOf .Driver/ is “John”
`and makeOf .Car/ is “Ferrari”g
`Using derived descriptions, many new types of queries that refer to the contents of video
`sequences can be specified. Specifically, we can express queries that refer to the contents
`of video sequences. Examples include the following:
`
`— Retrieve all the video sequences in which a red car is turning left.
`— Show all the video sequences in which John’s car is slowing down.
`
`The VEVA query language is a visual one. It allows for specification of spatial properties,
`as well as exact and inexact specification of motion properties. In the next section, we will
`outline the language and discuss its implementation.
`
`4.
`
`Implementation of VEVA
`
`A query language for multimedia information systems must provide a number of features
`beyond those of ordinary textual query languages. One such feature is the ability to deal with
`spatial and temporal dimensions of multimedia objects. This capability ensures that, when
`presenting the retrieved results, temporal precedence and spatial composition of objects are
`exactly those requested by the user.
`Defined within the algebraic framework described in Section 3, VEVA is a query lan-
`guage that provides all the necessary constructs for retrieval and management of multimedia
`information [15]. The basis for the language is a schema (algebraic signature) which con-
`tains entity types (both user-defined and application-independent types) and the associated
`
`AVIGILON EX. 2009
`IPR2019-00314
`Page 12 of 24
`
`
`
`P1: JSN
`Multimedia Tools and Applications
`
`KL557-03-Golshani
`
`February 18, 1998
`
`11:1
`
`A LANGUAGE FOR CONTENT-BASED VIDEO RETRIEVAL
`
`301
`
`operators. By using these operators, the user can visually specify a query for the desired
`objects in a simple way. VEVA has a formal grammar with which the set of acceptable
`expressions can be generated. In fact, VEVA’s parent language, Varqa, yields a family of
`attractive graphical languages [14]. One example is VEENA [21], which provides graphical
`primitives for operators, functions, and sorts (sets) in the style of data-flow programming.
`The main construct is the set construction operator which has the form:
`f f .x/ j P.x/g
`
`As in most database languages, there is a specification part f .x/, which stands for the
`objects of interest, and a filtering part P.x/, which represents the conditions that must be
`met for the specified objects. The analogy with SQL is that the f .x/ part corresponds to
`“SELECT ... FROM”, and the P.x/ corresponds to the “WHERE” clause.
`The visual language VEVA is implemented in Tcl [23] with extensions to handle image
`and video manipulation. The visual front end has the following functionalities: select a
`database, draw visual queries by connecting “sorts” and “functions” and applying operators
`to symbols.
`
`4.1. Visual grammar
`
`Sentences in visual languages are assemblies of pictorial objects such as “ovals”, “arrows”,
`or “icons” with spatial relations such as “next to” or “contains” between them. Their un-
`derlying structure is a variant of directed graphs. Therefore, graph grammars are a natural
`means for defining the syntax of visual languages.
`The grammar for the visual language VEVA is given using visual rules in the style of a
`picture description language which was developed within the syntactic approach to pattern
`recognition [27] The grammar rules contain nonterminal and terminal icons. The rules are
`given as graph rewriting rules where the left-hand side is a nonterminal icon, and the right-
`hand side is a graph containing nonterminal and terminal icons connected with links of
`certain kinds. Nonterminal icons are denoted by shaded iconic symbols, whereas terminal
`icons have a transparent background. The