`
`TO REQUEST FOR EX PARTE REEXAMINATION OF
`
`U.S. PATENT NO. 7,932,923
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 1 of 9
`
`
`
`Spatio-Temporal Modeling of Video Data for On-Line
`Object-Oriented Query Processing
`
`Young H·ancis Day, SerhaJt Dagt,a.§, Mitsutoshi Iino, Ashfo,q Khokhar, and Arif Gha.foor
`Dist,ributed Multimedia Systems Laboratory
`School of Electrical Engineering
`Purdue University, West Lafayette) IN 47907
`
`Abstract
`
`This paper pr'esents a framework for data modeling
`and ,qemantic abstraction of image/video data. The
`fra.mrnwrk is b,1-~td on iipatio-tempoml informatior. as(cid:173)
`"sociated wit.h salient objects in an image or in a se(cid:173)
`quence of video frames and on a 8et of generalised
`n-ary operators defined to specify spatial awl tempo.
`ml relationships of objects present in the data. The
`met!wdolo.qy presented in th.is paper can manifest ii(cid:173)
`self tffectivdy in conceptualizing evenfa and l1eler·oge(cid:173)
`nwus 11iews in m1tltimedia data as perceii•ed by indi(cid:173)
`vidual 1tsers. The prnposed parndigm induces a multi(cid:173)
`level indexing and searching meclumism that models
`information at various levds of granularity and hence
`allows processin,q of content-ba.sed que.ries in real time.
`We also devise a unified object-oriented inlerface j'or·
`·users with heterogeneous views fo specify q11erie.s on
`the unbiased encoded data. C11rrentiy th.is framework
`is being de1.idoped to ·reali.tt'. a highly int.egrnted nnilti(cid:173)
`media database a.rchitecture. 1
`
`1
`
`Introduction
`
`Recent advanct~ in broadband networking, high
`performance computing, and storage systems hiwe
`result.ed in a tremendous inter0st in digitizing large
`ard1ive1; of multimedia data. aml providing interactive
`access to usem. M,,ny future multimedia applica.tions
`will require retrievai of video data including search(cid:173)
`ing, browsing, selective replays, editing, etc. Due to
`t.he shear volume of the data, all these capabilities
`require efficient cornputer vision/image processing al(cid:173)
`gorithms for automatic indexing a.nd abstra.ction of
`video data. Subsequently, powerful indexing and data
`
`1 T}Ur:s rest~arcb WM 8upported in part by the N!!!.ti:onal Sdeno::e
`Foundation under gnmt. number 9418 7$7-EEC a.nd in part by
`AR.PA under contra.ct DABT63-9Z-C-00220NR.
`
`retrieval tedmiques need to be employed to support
`content-based query processing.
`
`The key cJrn.racteristic of video data is the spa-·
`tial/tempornl semantics associated with iL, making
`video data quite different from other types of data such
`as text., voice a.nd images. A user of video dat,ahase can
`generate que:ries containing both temporal and spatial
`concepts. However, considerable semantic heterogene(cid:173)
`ity may exist among users of such data due to differ(cid:173)
`ence in their pre-conceived interpretation or intend~d
`use of the information given in a video dip. Sem~n(cid:173)
`tic heterogeneity has been a. difficult problem for con(cid:173)
`ventional database [6], and even today this problem
`is not clearly understood. Consequently, providing a
`comprehensive interpretal;ion of video data is a much
`more complex problem.
`
`Most of the existing video <lat.abase systems either
`employ imagt~ processing tedmiquei; for indexing of
`video data [5, 3, 10, 2] or nse traditional database
`approaches based on keywords or annotated textu.al
`descriptors [11, 15]. However, most of these systems
`lack the ability to provide a general-purpose, aut,o(cid:173)
`matic indexing mechanism which renders an unbfa.sed
`description of video data. Also they do not handle
`i.he semantic heterogeneity eificiently. In order to ad(cid:173)
`dress the iBSues related to user .. independent view and
`semantic heterogeneity, we propose a framework for
`semantically unbiased abstrnction of video data. The
`framework is ba.sed on spatio .. t.emporal information as(cid:173)
`sociated with salient objects in an image or in a se(cid:173)
`quence of video frames and on a set of ge1rnrnJi:;ied
`n-ary operators defined to specify spai.ial and tempo·
`ral relationships of objects -present, in the da.ta, The
`methodology presented in t,his pa.per can manifest it(cid:173)
`self effectively in conceptualizing events and heteroge(cid:173)
`neous views in multimedia data as perceived by indi(cid:173)
`vidual users. The propo!led paradigm induces a. multi~
`level indexing and searching mechanism that. models
`information at various levels of granuta.rity and hence
`
`0-8186-7105-X/95 $4.00 © 1995 lREt~
`
`98
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 2 of 9
`
`
`
`put in one sequence. The term clip is a gene:ric object
`without any structural meaning, which is a portion
`of a video sequence with a starting and. ending frame
`numbers. In order to put things in perspective, we
`first suggest the following definitions.
`Generic indexing : It is the process of identifying a
`dip from a video sequence and using image processing
`algorithms (histogrn.mB or equivalents) to partition the
`dip into ordered shots.
`Structural indexing : It is the process of group-(cid:173)
`ing continuom; shots t.o form an episode and grouping
`continuous episode-S to form a program.
`In t.his paper we address issues related io struc(cid:173)
`tural indexing only. Generally, most of the episode,;
`and programs can be expressed in the form of worldly
`knowledge by describing the interplay among physi(cid:173)
`cal objects in t.he course of time and their relation(cid:173)
`ship in space. Physical objeds may indude persons,
`buildings, vehicles, etc. Video is a typical replica of
`this worldly environment. In conceptual modeling of
`video data for the purpose of stnictural indexing, it
`is therefore important that we identify physical ob(cid:173)
`jects and their relationship in t.ime a.nd space. Subse(cid:173)
`quently, we can represent these relations in a suitable
`data structure that is useful for users to manipulate.
`Temporal relations a.mong objects have been previ(cid:173)
`ously modeled by using methods like temporal-interval
`[9). For spatial relations, most of the techniques are
`based on projecting objects on a two or three dimen(cid:173)
`sional coordinate system. Very little attempt has been
`made to formally express spatio-temporal. interactions
`of objects in a single framework. Though in [8], spa(cid:173)
`tial/ternporn.l metadata for video database is defined,
`yet no detailed modeling is provided. In the follow(cid:173)
`ing sections, we describe a generalized framework de(cid:173)
`scribing spatio-temporal relat.ionships of objects in an
`image or video.
`
`2.1 Generalized Spatial and Temporal
`Operations
`
`Generalized spatial and temporal operations pre(cid:173)
`sented in this section are an extension i.o our earlier
`work [9]. The reason of introducing the generalization
`in both spatial and temporal domains is to simplify
`describing compl.:x spatial or temporal events, which
`otherwise are rather cumbersome to expresa [4]. vVe
`first give a definition for the generalized n-ary relat.ion.
`
`Definition 1 : Generalized n-a:ry relation A
`generalized n-ary relation R( r 1 , ... , r,,) is a relation
`among n objects, r;s, that satisfies one of the condi(cid:173)
`tions in Table 1 according lo iheir positions in space
`
`I VldroD~a I
`
`Figure 1: System abstract.ion
`
`allows processing of content-based queries in real t.ime.
`However, a unified framework is needed for the users
`to express and for the system to process semantically
`het,erogeneous queries on the encoded data. For this
`purpose, we propose an object-oriented interface that
`p.rovides an elegant paradigm for representing hetero(cid:173)
`geneous views of the users. The architecture of the
`proposed system is shown in Figure 1.
`The organization of this paper is as follows. Section
`2 presents the framework for characterizing various
`events in video <lat.a. A video database architecture
`based on that framework is proposed in Section 3. In
`Section 4, an object-oriented approach le; present.eel for
`users to specify the perceived view of video data. The
`paper is concluded in Section 5.
`
`Characterizing
`for
`2 Framework
`Events in Video Data
`
`Generally, a video sequence CO!lsists of ordered
`frames that can be partitioned int.o a collect.ion of
`shots using various image processing techniques like
`histogram comparisons, motion- based indexing, and
`optical flow determination. Each shot contains no
`scene changes and is the basic element for characteriz(cid:173)
`ing the video data [14]- Several shots can be grouped
`logically into episodes or scenes., i.e., an episode is a
`specific sequence of shots [15]. Several episodes can be
`
`99
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 3 of 9
`
`
`
`barors
`
`I
`
`~ril&Tfil
`tf@'''%hf:'~',WI1'~)lmfil'tf:W:J
`eqUi!iO f;ef~~~@§°$;,;:@§i,fiil
`mr:1£@,&'.,,4- ,~ii
`tM@M1\W:M1'1tillfilil
`'
`
`1
`
`R.elati~_r_1 nan~~T-~.i~j_p~l' -~~~9~!_raints, Vi, 1 <i<rl!
`'f
`,;• < -r,+1 •
`before
`I
`fl
`,,• = r,+1'
`meets
`1vf
`·r;' < r,+1" < -r;" < r,+1 • f
`overlaps
`O
`rr' < r,+1' < r,+t • < -r,• I
`C
`contains
`I
`~ I\ .. e
`_ 1 _
`c
`e
`,
`r, < r,.f-1
`,; -- r,.f-1
`starts
`1
`D
`r,' < -r;+1' I\ r;' = -r,.J-1 •
`CO
`complete..9
`jl
`.. ~f. .. _ ... ... .'.::<.~ ... :: .. :.~±~'.J:: .. :.i• ":: .:.t±~."-
`equals
`r,' :::: starting coordinate of object r;
`r; • = ending coordinate of object -r;
`
`Table l; n-ary relations
`
`called bounding vciiume, V, for each salient physical
`object present in t.he frame has been extrad.ed and
`stored in VSDG (Video Semantic Directed Graph) [7]
`or equivalent data structure. The volume describes
`the spatial projetti.on of an object in :r., y, and z axes
`and is defined in the following way:
`
`Figure 2: 1lr!l,ry relat,fona
`
`Bounding Volume (V) :::-
`
`or ti·me domain with respect to each other.
`
`The :reiation is represented by the corresponding
`name and symbol. The operands oft.he relatiom, i.e.
`r;, (i
`:::: 1, ... , n) are either the projections of the
`positions of the objects (spatial domain) or time span
`of a certain object/event (temporal domain):
`The generalized n-ary relations and the correspond(cid:173)
`ing interval constraints are shown in Figure 2 and_Ta(cid:173)
`ble l, respectively. The same fundamental reiat10ns
`can be used either in space or time domains. The dif(cid:173)
`ference is in the meaning of the operands rather than
`the open.lion.
`fo'or the spatial domain the operands
`represent the physical location of the objects while in
`the tempornl case they represent the duration of acer(cid:173)
`tain temporal event (such as p1·esence). The number of
`operands, :n, in the relations is assumed variable. This
`generality enables any sp<',tia.l or temporal situation to
`be represented in terms of the seven fundamental n-(cid:173)
`ary relations in Figure ;2,
`
`2,2 Modeling of Spatial Events in a Single
`Frame
`
`Assume l.bat computer vision/image processing al(cid:173)
`gorithms fer object id@tifi.ca.t:on and recognition have
`been applied to video frnmeB r,ml a. spatfal attribute,
`
`(x1, x:;i, Yl, Y:i, z1, z2)
`A 2-D Bounding Box is used in those ca;,es where
`only 2-D information is available. For all three coor(cid:173)
`dinates, the points with subscripts 1 and 2 specify the
`beginning a.nd end points of the projections respec(cid:173)
`tively.
`The information provided by the bounding volumes
`is not imffit:ient to describe meaningful semantic infor(cid:173)
`mation present in a frame. Although it provid<"..s the
`most fundamental information about a frame, e.g., the
`locations of individual objects, it needs to he expa,nded
`to construct higher level contents in the frame. Such
`detailed information contents in a single frame can be
`termed as spatial events,
`For example, presiding a meeting attaches a mean(cid:173)
`ing to some spatial area. For this event, a person in
`a frame may be ident.ified such that he/she is either
`standing or sitting on a chair in the center c.r front of a
`meeting room. Another example of a spai.ial concept
`is three point position in a basket.ball ileld, Similarly,
`a person may be siUing on a ch,i.ir or some phy,,ical
`object,. in this caae, we have a conceptua{ spatial ob(cid:173)
`ject 'sitting' with attributes 'a physical object whit.h
`sits' and 'a physical object. being sit on', and they are
`relat,ed by the 'sitting' relationship.
`In order to express events in an unambiguous way,
`we present a formal definition of a spatial event ba.;;ed
`on the spatial operations discussed in the previous sec(cid:173)
`tion.
`
`100
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 4 of 9
`
`
`
`A spatial event describes the relative positiona of
`objects in a frame.
`
`Definition 2 : Spatial Event A spatial event B, is
`a logical expression consisting of various generalized n(cid:173)
`ary spatial operations on projections and is described
`11.11 follows
`E, = R1(r1 1, ... ,r.,,1) 01 R2(r/, ... ,r ... ,2) <>2 ...
`<>m-1 Rm( r1 "', ... , r,.,,. m),
`whe1·e Rj, j ::: 1, ... , m is a generalized n-ary relation,
`01,;, k :::: 1, ... , m-1 is one of the logical operators (/\
`or V) a111J rj is the projection of object j in rel~tion i
`on a:, y, or z axis.
`
`Note that more complex spatial events can be con(cid:173)
`structed by relating several spatial events using logical
`opera.tors.
`As an example of spatial event, consider a player
`holding the ball in a bMketball game. To simplify
`the characterization of this situation, we ruisume when
`the bounding boxes of the objects player and ball are
`in coni,acl. with each other, it is considered that the
`frame contains event "player holding the ball". This
`is characterized by six of the n-ary relations in both x
`and y coordinates and can be formally expressed as
`E. ::: (M(r,,1,r,,b) V O(r.,1,-r,.b) V C(,.,1,-r./') V
`S(r.,,1,r,,~) V CO(r,,1,-r,,b) V E(-r,, 1 , ,,.i)) A
`6
`1 , ry°) V C(r/, r11
`) V S(r/, ,::,•) V
`(M(r/, r1/) V 0(,11
`CO(r/, r,b) V E(r,/, 1/)),
`where r,,1 is the projection of the bounding box asso(cid:173)
`ciated with objed. player 1 on the x-axis and r.,~ is the
`projection of the bounding box associated with the ob(cid:173)
`ject ball on the x-axis, etc. If the specified condition
`ii!! satisfied for a given frame, the event E. exists.
`As a side note, we need to mention that one can
`maintain the spatial events information for each frame.
`However, the overhead associated with such detailed
`specification may be formidable. Also, tracking such
`detailed information may not even be needed for many
`applications. We, therefore, can maintain temporal in(cid:173)
`formation by only identifying spatial eventa in frames
`at. 5 distance (in frames) apart.
`Spa.tiat events can 1ierve as t.lte low level (fine·-grain)
`indexing mechanisms for video data where information
`contents at the frame-level are generated. Modeling
`more complex information contents, such as gloomy
`weather is a more challenging problem.
`
`2.3 Temporal Events
`
`The next level of video information modeling in··
`volve;; ternporai dimension. Temporal modeling of a
`video clip is crucial for users to llltimately construct
`
`complex views or i.o describe episodes/events in the
`clip. Episodes ca.n be expressed by collectively inter(cid:173)
`preting the behavior of physical objects. The behavior
`can be described by observing the total duration an
`object appears in a given video clip and its relative
`movement over the sequence frames in which it ap(cid:173)
`pears. For example, occurrence of a slam-dunk in a
`sport video clip can be an episode in a user's specified
`query.
`Modeling of this episode requires tracking motion
`of the player for whom slam-dunk is being queried
`and tracking motion of the bail in a careful manner
`especially when it approaches the hoop. Tracking the
`motion of the player and the motion of the ball are
`two simple temporal events. These temporal events
`need to be expressed prior to composing the complex
`episodes of slam-dunk. It is obvious that t.hese sim(cid:173)
`ple events can be exprnssed formally a.B a temporal se(cid:173)
`quence of various spatial events, spanning over a mnn(cid:173)
`ber of frames. Composite temporal events are defined
`in t.erms of at.her simple or complex temporal events
`relating them by the n,-ary relations. We formally de(cid:173)
`fine a temporal event a.s follows.
`
`Definition 3 : Temporal Events A simple tempo(cid:173)
`ml e~ent (E,t) is defined as a logical operation on a
`set of spatial events the durations of which are related
`by one of the n-ary temporal relations. Formally,
`E,i =
`R1 (d(E, 1 ), ••• , d(E,,.)) 01 R2(d(E,,), ... , d(E.,.)) 02
`... <>m-1 Rm(d(E.J, ... , d(E,J),
`where Rj is a genemlized n-ary r-elation and d(E,.) is
`the duration of the spatial event E,,. A compoaite tem(cid:173)
`poral event (E~t) is formed by f'uriker relating the ex(cid:173)
`i11ting temporal events. using the same spatio-temporal
`generalized operators. Formally,
`Eet =
`R1(d(Et,), ... , d(EcJ) <>1 R2(d(E1,), ... , d(Ei.)) 02
`... <>m-1 Rm(d(Ei.), ... , d(Et,J),
`
`where d(E1,) 'sin this case are durations of temporal
`e~ienis which could be either simple or composite.
`
`In video data, associated with each spatial event
`is its duration d(E,) during which the spatial event
`persists. If the event starts at frame # °' and ends at
`frame# {j then d(E,) = /3- u + 1. The duration of the
`result of an n-ary operation is the aggregate duration,
`i.e. the time ini,erval between i;he earliest starting time
`and the latest ending time of the involved objects.
`A set of spatial/temporal events can be a.rrnnged in
`a nondecreasing order in terms of the (approximate)
`start time. However, we may not know the exact inter(cid:173)
`interval delays in many cases dnring the definition of
`
`101
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 5 of 9
`
`
`
`events. Therefore, instead of giving an exact value,
`we can specify ranges of inter-interval delays. In the
`extreme cases, every duration or delay is a va.riable.
`1n summary, in defining a spatio-temporal event, we
`only have t.o specify the componeots in a. certa.in order,
`the durations (r;s) and inter-interval delays (,L>-'s) a.re
`optional
`An example for a. temporal event can be an exten(cid:173)
`sion of the previous example "holding a ball" to "pass-(cid:173)
`ing of a ball between two players". This is character(cid:173)
`ized by two events of the same type, namely: "holding
`of the ball by a player." The pass event is then corn(cid:173)
`pooed of these events with two conditions. First, these
`events should follow each other, and second is that t,he
`delay between these two events should exceed some
`specified value. Accordingly, we can express the pa$S
`event as
`E.,i=B(d(E,, ), d(E,,)),
`where B is the before n-ary operation, and d(E,;)'s
`are the durations of the spa.tiaI events a..~ defined in
`the previous section for players 1 and 2. A compos(cid:173)
`ite temporal event "3 passes that follow each other"
`can be sirnlla::rly expressed as a sequence of temporal
`events as
`Eot~B(d(E,t, ), d(E,t,), d(B,, 0 )),
`where d(E, 1,)'s are dumtious of the corresponding
`simple temporal events.
`
`3 Video Database Architecture
`
`Summarizing the discussion of the previous sec(cid:173)
`tion, we ha.ve suggested three levels of semantic in(cid:173)
`dexing of video databases. The first level identifies
`spatial events from the physical position of salient,
`objects. The second level maintains indexes for sim(cid:173)
`ple temporal events using the information n1aintained
`on the first level. Subsequently, composite temporal
`events are identified from simple events and composit.e
`events.
`The first step is to process the ra.w video data and
`extract the relevant information such as the identities
`of the objects of interest and their bounding volnmes,
`This constitutes a challenging problem even for to(cid:173)
`day's advanced computer vision technology. Discus(cid:173)
`sion of this problem is not, the main theme of thi!l pa(cid:173)
`pero However, it's worthwhile t.o mention some issues
`related t.o the initia.l processing.
`For an .easier ;~nd more efficient recognit.ion, ob(cid:173)
`jects should be grouped into d1!Sses. This enables
`pre-defined object models to be used and simplifies
`recognition t.hrough appropriate matching techniques.
`
`Since the identities of human objects, which are obvi(cid:173)
`ously of special interest in video databases, are deter(cid:173)
`mined by their faces, their recognition should be given
`special treatment. Among many different schemes
`that incorporate a wide range of methods, EIGEN(cid:173)
`f'.ACE technique appears to be t,he most reliable and
`robust [16]. This method recognizes faces according
`to their representatioo in a feature spa.ce formed by
`"eigeufaces" of the sample faces, The bounding vol~
`ume information for any recogni;sed object can easily
`obtained through well-established edge detection algo(cid:173)
`rithms to be used in the later stepB to construct the
`entire database.
`A possible architect,ure of the spatfo .. t,emporal pa.rt
`pf the database is shown in Figure 3. At each level
`of indexing ( event database), the event definitions rui
`weli as the ad,ual event ocrnrrences and Uieir corre(cid:173)
`sponding information is stored" The event databases
`can be implemented either in an object-oriented envi(cid:173)
`romnent o:r as a relational database. We will illustrate
`using the object.,oriented paradigm next.
`At the first level of the archii.ecture lies tru, spatial
`event database. This database is built by direct use of
`the information about physical objects and their po(cid:173)
`sitions. This information, for example, can be stored
`in VSDG (Video Semantic Directed Graph) [7]. Each
`event. is represented as a class. The event definition
`,,nd recognition procedures are stored as methods in
`the dass definition.. We also record the following infor(cid:173)
`mation (stored as instance variables): the object IDs
`of the participating objects, clip number of the· spatial
`event. detected, the st.arting and ending frame num(cid:173)
`bers of t.he event. The actual events identlfied are the
`instances of the corresponding event class. All the in(cid:173)
`stances of a dass constitute the collection of the class.
`The spa.tial event database is updated with new in(cid:173)
`stances and event types a!l they iJ.re encountered; dur(cid:173)
`ing the archiving/retrieval proc<"..ss.
`The spatial events (Es) information is used to con(cid:173)
`struct the second level indexing scheme, simple tempo(cid:173)
`rn.l event dat1>.base. The classes in this level :represent.
`the simple temporal events, formally defined in the
`previous section. The methods of a class a.re used· to
`represent and recognize the event as in the first level.
`The instance variables of each of thtt cla.~s are follow(cid:173)
`ing: durations and IDs of component spatial events
`specified in the first level indexing; clip nmnber of the
`video dip cont,aining the simple temporal event, and
`the starting and ending fn,me numbers of the event.
`At the highest, level is the composite temporal event
`data.base. The classes at this level represent the tem(cid:173)
`poral events among the simple/composite temporal
`
`102
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 6 of 9
`
`
`
`~ f
`....... ,,..,
`
`A
`
`01 ¢ 3 0 11
`
`-~
`
`s
`
`M
`
`Composite T,:;r~~l
`Event Database
`
`tl
`
`It\
`
`¢ 1 C2 0~
`~«aB!Sl&ri
`IIS-<\l
`
`l
`
`0 1 Ca i'J!!,
`aw,.son
`{~ )
`(IS-PilRT.01')
`
`~s-wO!. vro-;N)
`
`{IS-MEMOCA-01')
`
`Figure 4: Object--oriented abstractions.
`
`an object-oriented approach which provides an elegant
`paradigm for user's view representation. It is a hierar(cid:173)
`chical schema that allows users to compose their views
`about the video data. The purpose is to offer the max(cid:173)
`imum fl.e.'lf.ibility io a user to represent their semantics
`and at the same time allow processing of het,mlge(cid:173)
`neous query by interacting with t,he proposed architec(cid:173)
`ture. The objective for a:r1 object-oriented paradigm is
`two folds. First. the users can define their conceptual
`queries involvi~g motion in a systematic way. Sec(cid:173)
`ond, it allows processing of user's conceptual queries
`by using various event databMes. 'l'herefore, the sys(cid:173)
`tem processes users' queries with the assistance of the
`proposed object-oriented views. In other words, an
`object-oriented view serves as a "knowledge base" for
`the system.
`Corresponding to the three entities (physical ob(cid:173)
`jects, spatial events and temporal events) used in
`the modeling of video data, t.hree objects are defined
`from the user point of view. These are physical ob(cid:173)
`jects (PO), spatial objects (SO), and temporal objects
`(TO). For video data, a user can use combinations of
`various object-oriented abstractions (such as shown in
`Figure 4) on these objects to specify queries. Theim(cid:173)
`porta.nt. feature of this hierarchy, and in general for any
`object-oriented abstraction, is that terminal nodes are
`either POs, SOs, or TOs. Any complex video query is
`expressed as a function of these nodes and processing
`of such queries requires searching the occurrence of
`SOs and TOs over the specified PO's. As an example,
`consider a sports video database which can be used
`by multiple users with different interests. Figure 5 de(cid:173)
`scribes an object hierarchy of view /knowledge which
`a user may would like i.o construct.
`A sports-fan may view video data as a collec(cid:173)
`tion of players, event, and teams. Furthermore, in
`his/her view, there a:re three types of players, for(cid:173)
`ward, guard, and center. There are two types of event,
`individuaLevent and team_event. Tea.ms consist of
`those from NBA and NCAA. Also, the composition of
`the field is described in detail. A sports fan can gener(cid:173)
`ate a, query such a.a 'Give the video clips where Michael
`Jordan (i.e., a PO) hM a slam-dunk (TO)'. The sys-
`
`103
`
`Figure 3: An architecture for spatio-temporal event
`identification
`
`events. The strncture of the dasses is the same M
`in the second level, except the durations and IDs of
`temporal events are recorded. Composite temporal
`events can be recursively formed from simple and/or
`composite events.
`Although we have not presented a formal gram(cid:173)
`mar for expressing queries, we have proposed a gen(cid:173)
`eral framework for characterizing events u:,ing general(cid:173)
`ized n-ary operators. A user can specify more events
`as needed and store them. New classes can also be
`formed based on t,he existing classes at. lower/same
`level through n-ary operations and using class inher(cid:173)
`itance. OccasionaJly, the system may resort back to
`processing of raw video data to identify objects that
`were not previously identified. We expect the pro(cid:173)
`posed methodology to be helpful in providing on-line
`capabilities for query processing.
`
`4 An Object-Oriented Model of Video
`Data
`
`As mentioned earlier, video data is represeni.ed by
`three entities, physical objecta, 6pa,tial events il,lld t,ein(cid:173)
`poral events. For user to query video dil,ta, W€ propos€
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 7 of 9
`
`
`
`and express complex querie,; in more abstract man(cid:173)
`ners. The spatial objects (SOs) and temporal objects
`defined earlier are used as operands for spatial and
`temporal predic.ates, For additional detail on the use
`of predicate logic, we refer to [7].
`
`5 Conclusion
`
`We have presented a framework for sema.nt.ic index(cid:173)
`ing of video data. using the generalized n-ary opern.to:rs
`proposed in this pa.per. The same set of operators
`is used for modeling both spatial and temporal con(cid:173)
`tents of an image or a. sequence of video frames. This
`enables a unified methodology for handling content-(cid:173)
`based spatial and spatio-temporal queries. The sys(cid:173)
`tem is hie:rarchica.l in nat!lre and allows multi-level
`indexing and searching mechanism by modeling infor-·
`ma.tion at various levels of semantic granularity and
`hence allows processing of content-based queries with(cid:173)
`out processing raw image or video data. Currently
`t.hii; fra.:mework ia being developed at Purdue lJniver(cid:173)
`sity to realir.e a .highly integr~!.ed multimedia database
`architecture.
`
`References
`
`[1} A. Abellll>, :md J. R. Kend.-,f, "Qu"1ite.fr,e Describing Ql:,..
`jects Using Spatl"1 l?repositions," in AAA.l-9S, Proc. of
`11th. N .. tfoni,/ Gonf•_rence on Artifi.ti .. l lnfo/li;re"oe, pp.
`536-MO.
`
`[2) F. Arman, R Depommier, A, Hsu, and M-Y. Ch.in,
`"Content-ba,sed Brnwsing of Video Sequences," Proo. of
`Second ACM Internaiia1u,/ Conj. on M1,li.imed.ia, San
`Firamcisco, CA, October l9lH, pp. 97-102.
`
`[3] T. A:n,dt, and S.-K. Chang, "Image Sequence Gompre;i.(cid:173)
`'89 Work•l>op an Vi(cid:173)
`ek,n by Icc,nic Inde:ring," IEEE VL-
`a,;,,./ L,mfJ1'&!Je5, Rome, Italy, 1989,
`
`[4] A. Del Bimbo, E. Vicarlo, a.nd D. Zlngim.i, "A Spatio(cid:173)
`Temporal Logic for Image Sequence Coding and R~-
`1.l'ieva.l.," Proceeding of !F:E.8 VD' 92 Work•hop on ·vi&u.a.l
`L,mg&:tgeJ, Seattle WA, Septemb~r 1992, pp. 228-230.
`
`[5] S ... K. Ch"--«g, q .. Y. Sh;, ""d 0.-\,V. Y&n, "Iconk l,o.d«,ci:..i;
`by 2-D String-a/' IEEE Tr!1:nJt1-~tfon.:3 :,n. PaUe:rn Analysi$
`and lvfathine lntoiligot>oe, Vol. PAMI-9, No. 3, May 1987,
`pp. 413-427.
`
`[6) C, J, Date, ..4,. lnfrod,;dfon fo D,,-t~bau Syskff», Vol. 1,
`5th edit.ion, Addfoion We•ley, 1990.
`
`[7} Y. P. DBy, S. Dag,,.,,, M, Iino, A. Kh.okha.r, &Ud A.
`Gh.afoor, "Object-Oriented Conceptual Modeling of Vldeo
`Data," To ~ppt,w in P~oo. IEEE JO.DE '9/i,
`
`104
`
`Figure 5; Fan's view
`
`tem first searches the slam-dunk relation or collection
`of composite temporaJ. event database to find if Jor(cid:173)
`dan appears one or more titnes. If it is true, the clip
`numbers containing Michael Jordan slam-dunks are
`returned. Otherwise, the system goes to lower !evd
`event databases and event definition databases until
`it finds enough informa.tion to evaluate the query. A
`fan may wa.nt; to identify those video segments where
`stea.i (TO) occurs a.round the right sideline of the front
`court. In this case "right sideline" and "front court"
`are also objects of interest in addition to the player
`and the ball.
`The definitions of some of the classes used in these
`examples are given in Table 2. The methods are coded
`based on generalized n-ary operator@ and describe the
`spatio-temporal processing related to the event of the
`corresponding object. In class NBA, "SetOf' is used
`to specify t.he association abstraction.
`
`Table 2: Clas.s definitions.
`The predicate logic described in our earlier work t7J
`and also used in [1, 12, 13] can be used to ,:onst.ruct
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 8 of 9
`
`
`
`ln. Vidoo
`"MetM1>ta
`[8] R. Jihin and A. Harnp1>pur,
`Dat&baseo," AOM S1GMOD RECORD, VoL 23, No. 4,
`December 19:M, pp. 27•33.
`
`[9] T, D. C. Little and A. Ghafoor, "lnterval-Ba.oetl Concep(cid:173)
`tual. Models for Time-dependent Multimedia Data," IEEE
`Tr,rnsaclfona on Knowledge and.1)11,fo. Engineering, Vol. 5,
`No. ,J, August 1993, pp. 551-663:
`
`[10)
`
`A. Nagw11,Jm, and Y. 'l'=a.ka., "A utom ... eic Video Inde,dng
`and F\ill Video Se.arclt for Object Apperu:&nces," in 2nd
`Working Conference on ViRual Datda$o System$, Bu(cid:173)
`dapei;t, Hungary, October 1mn, IFIP WG 2.6., pp.119-
`133.
`
`!11]
`
`E. Comoto, and K. Tanaka, "OVID: Deoign and Imple(cid:173)
`mentation of ,. Video-Object De.tabru;e System," · iEEE
`Tranu.ctivns on Knowledge ,md Data Engineering, Vc,i.
`5, No. 4, August 1993, pp. 629--643,
`·
`
`[U] D. A. Randell and A. G. Cohn, "Elqlloitlng L1>ttlce ln,.
`th.eol!'y of Spa.c., ,md Time," Computer Malkematfo• A;,(cid:173)
`plfoatforu, Vol. 23 1 No. 6-9, 1992, pp.459-416.
`
`!13] D, A. Ran.deli, Z. Cul, llilld A.G. Cohn, "A SpaUal Logfo
`Ba.oed on Regiorui a.nd Connection," in Proc. of 9rd Intl.
`Conj. o" Principle~ of J(nowledge Rept>e,enfatfon i>nd
`Re1,aeni'/l.g, Cambridge, MA, 1992, pp. 165-176.
`
`[14j S. W. Smoliar and H. Zh=g, "Conte,,t-Booed. Vidro fo.
`deicing ,,,,,.d Retrl.eval,» IEEE J.,fo/timedia, Vol. 1, Ne.. 2,
`Summer 19!M, pp. 62-72.
`
`[l5j D. Swanberg, C.-F. Shu, R. J,.in, "Knowledge Gwded
`Paniing ln Video Data.bru;es," Proc. SPIE 9!1, San Jo,e,
`Jan.mary 1993, pp. 3-11 - 3-22.
`
`[16] M. Turk and A. Penthmd, "Eig.,rifaces for Recognition,"
`Jo11,r1,11/ of Cegniiive N .... roseiet1oe, Vol. 3, No. 1, Ja.,..,_,
`1989, pp. 71-86.
`
`105
`
`AVIGILON EX. 2008
`IPR2019-00314
`Page 9 of 9
`
`