`
`IBG 1044
`CBM of U.S. Patent No. 7,693,768
`
`
`
`U.S. Patent
`
`.mmhSm2|)w
`
`US 6,205,260 B1
`
`
`
`
`
`M3O._.m_EH._.50¢“:Mmm.._o
`
`0—..U_..._
`
`
`
`oH._.<.w3<oz<moo
`
`20:.Emzmuzo
`
`0002
`0002
`
`
`
`
`U.S. Patent
`
`M
`
`0.,
`
`m
`
`US 6,205,260 B1
`
`©
`
`:+fim_o_umoEN
`
`N.U_u_
`
`oH._.<§O._.3<<OZHQOQ
`
`
`
`zofiozmpmzmzfiuzo
`
`M:immo»Hw.\mm_%H,_J_mo..o~u_.._.m:+...m.u9=:_omM
`m:..m:wo_.3TT.......,n_:i.m:
`
`eV&m9\1m@mA
`
`M:i.m.m.,mH:+...m_o€m
`
`V
`
`0003
`0003
`
`
`
`
`
`U.S. Patent
`
`masaMcom2|)m
`
`US 6,205,260 B1
`
`
`
`mato»Eus?someMmoH<mos_.._o
`
`
`
`oH._.<_2o._.D<az<OZHQOU
`
`
`
`zofionfimzoon._zH._nzo
`
`m.o_"_
`
`©
`
`.m+...m_u_omoEH
`
`
`
` a+...m.EmM.~i.m.9.nw0To”/V
`
`©
`
`0004
`0004
`
`
`
`
`
`U.S. Patent
`
`Mar. 20, 2001
`
`Sheet 4 of 11
`
`US 6,205,260 B1
`
`INITIALIZE MOSAIC FIELDS
`AND BUFFERS
`gmoscnc
`- (§,tJ=0, M(§,t)=0
`INITIALIZE WARPING PARAMETERS
`
`wtoett-1}( )
`
`19
`
`POSITION AT BEGINNING
`OF IMAGE I(§_,t)
`
`20
`
`22
`
`0
`
`24
`
`26
`9 YES N0
`
`
`COMPARE I(§,t) WITH
`PREVIOUS ENCODED/DECODED
`FRAME c"c§1t§.t—1)}
`
`
`
`GET NEXT PIXEL
`
`28
`
`N0
`NEW IMAGE REGION
`3ChOnge(§.t}=1b
`
`
`
`
`52
`
`YES
`
`STHRES
`CHA9NGE
`
`N0
`
`3
`
`YES
`
`:54
`
`S
`
`change
`
`(§,t}=O
`
`(
`
`°“°“9° g
`
`,t)=1
`
`a
`
`58
`
`
`IDENTIFY
`
`NEE! BACKGROUND
`AREA (Jchange(§,t)==0 8: 8:
`‘.3
`Wtvto 3mosaict§.t—1))=o)
`
`F|G.4A
`
` 3nbg(§,t)=1
`
`3,,bg(§.t1=o
`
`40
`
`0005
`0005
`
`
`
`U.S. Patent
`
`Mar. 20, 2001
`
`Sheet 5 of 11
`
`US 6,205,260 B1
`
`:
`
`FROM FIG.-1-A
`
`
`
`42
`
`IS
`ASSOCIATED
`
`
`BACKGROUND REGION
`FOR MOSAIC KNOW?
`(wtvr to §'$moSaic(§.t—1 1 )=1)
`?
`
`
`
`NO
`
`
`
`
`COMPARE I(§,t) WITH
`CURRENT WARPED MOSAIC
`Wt¢-t°(M(§_,t—1
`
`
`
`
`
`ENCODE FOREGROUND
`REGIONS IN IMAGE
`I(§,tJ WHERE
`3fgt§,t)=1a on 1b
`
`
`
`
`
`
`
`
`TO FIG.4-C
`
`0006
`0006
`
`
`
`U.S. Patent
`
`Mar. 20, 2001
`
`Sheet 6 of 11
`
`US 6,205,260 B1
`
`FROM FIG.4-B
`
`POSITION AT BEGINNING
`OF CODED/DECODED
`FOREGROUND SHAPE
`
`55
`
`68
`
`0
`
`70
`
`0 YES NO
`
`GET NEXT PIXEL
`
`72
`
`YES
`
`
`
`80
`
`
`
`78
`
`
`
`
`
`
`MOSAIC
`PREGION KNOW?
`-.5
`- (s.t-1)=1
`l‘T10SCllC —
`
`82
`
`
`BACKGROUND
`REGION
`
`twtm_t(Zs‘bg(§_tn=1
`?
`
`M'(§,t—1)=M(§,t—1)
`
`
`YES
`
`84
`
`M'(§,t—1}=Wt01_(t_1)(C_1C{I(§,t—1}§)
`
`TO FIG.4D
`
`0007
`0007
`
`
`
`U.S. Patent
`
`Mar. 20, 2001
`
`Sheet 7 of 11
`
`US 6,205,260 B1
`
`FROM FIG.4C
`
`as
`
`G)
`
`
`
`
`ENCODE BACKGROUND REGIONS
`IN IMAGE I{§.t) WHERE
`
`
`
`
`
`
`POSITION OF BEGINNING
`OF MOSAIC AT TIME t-1
`
`92
`
`94
`
`° YES 6
`
`N
`
`0
`
`96
`
`GET NEXT PIXEL
`
`90
`
`104
`
`FmosaIc(§'U=1
`J
`.
`
`TO FIG.4-E
`
`0008
`0008
`
`98
`
`NO
`
`100
`
`YES
`
`
`WtoFt(Ibg(§,t))=1
`
`?
`
`F|G.4D
`
`NO
`
`102
`
`3mosc|ic{§'t)=O
`
`
`
`U.S. Patent
`
`Mar. 20, 21101
`
`Sheet 8 of 11
`
`US 6,205,260 B1
`
`
`
`M (§.t)=[1——O{WtO,,t(Fbg(§,t))] M'(§.t—1)
`+ ocwto,ttFbg<§,t1 1 [M'(§.t—-1 1 +
`
`
`
`Wto(_t(C_1C§AI(§,t)} 1]
`
`
`
`
`F|G.4E
`
`0009
`0009
`
`
`
`U
`
`8
`
`0
`
`S
`
`mnu
`
`UmiansEaz<
`
`
`
`NhwmWmN_n..:2OOmn_<Iwm.n_._.mO
`
`6.,:3mmmmazoouo
`
`%amwamfimnmmmmmazoomo0.W_n=2OOmmmw_n:2OO
`
`zoC<hzu2on._m
`
`mn_\3mmm:..uv93am.._oz:omov_o<m
`
`zoEoEmEn_
`
`
`
`
`
`
`
`om.m:zoE<_2Emm_e.I-Imo<_2HNm_o<§tmU_u_v_oo._m_mmxom._._<o»mm..1.N:o:...m_mEaoozm
`
`mozEa<;Jmm_EH.__2§<aETm:oH>m~_n_aozEm<2P«Emmoazoo4/SEzxm:2:
`
`um_a»<mnomozfimfsoz2zoam%m_xu<m
`u,:u:uX3zo:§zn.:,.ommTM.3oMmoz<:onoz\moz<:o91,.
`
`193:u:uv93on.mafixntm..,m_..._ozHmoo:3memo...mi2.H.:<o%
`
`
`
`s_<m_o<Hn_ozCzm_2n._._n_s:I.
`
`
`ME.mmroz3omoxo<mofm
`
`
`«Ummmaazoomo8$>ooz:fi._,_w_.._¢m.%m_m.,Hm_zn._NH..H__%Hn.»__W_zH
`
`
`
`0010
`0010
`
`
`
`
`U.S. Patent
`
`Mar. 20, 2001
`
`Sheet 10 of 11
`
`US 6,205,260 B1
`
`179
`
`CAMERA.CONTROL
`
`PRIOR TO
`XMISSION
`
`190
`
`186
`
`
`
`'
`
`188
`
`DURING
`XMISSION
`
`180
`
`C)
`
`C)(D
`DD
`
`F|G.6
`
`192
`
`
`
`TIME TO \”’'E To“
`
`v
`
`TIME T0+2
`
`F|G.8
`
`210
`
`0011
`0011
`
`
`
`U.S. Patent
`
`M
`
`...mT\.
`
`mom
`
`mm:
`
`mwowM£3
`mz5zH._mHm>._<z<ozEoozn._@25391Eo.S<Exn._m_~_3fi._..__Mmomo_.s.fi._v_°mu.EE<E
`
`
`
`
`
`
`
`
`2.,,m_>EEzn._mmmn_n._~_R._:&_n._zn._oodqmoz
`
`$.48
`
`US 6,205,260 B1
`
`uz8zH._
`
`
`
`_._0w_<mmn_H:_O
`
`.mmm<m_<.2n_Eooozuu82>z.fi._Em:m82>Bm<m_uEHEm
`
`5mm.
`
`0012
`0012
`
`
`
`
`
`US 6,205,260 B1
`
`1
`SPRITE-BASED VIDEO CODING SYSTEM
`WITH AUTOMATIC SEGMENTATION
`INTEGRATED INTO CODING AND SPRITE
`BUILDING PROCESSES
`
`CROSS RL*ll7I£RI£N(_‘l.'l TO REl.Al'l3|)
`APPLICATIONS
`
`This application claims priority from U.S.A. Provisional
`Application No. 60f034,558, filed on Dec. 30, 1996, which
`is hereby incorporated by reference.
`
`I3/\(IKGR()UNI) ()1-‘ 'l‘I|lE INVLlN'I'I()N
`
`The invention relates to a sprite-based encoder {encoder
`and decoder that automatically builds a sprite (also called a
`mosaic) while operating in a separate shapeftextu re coding
`environment such as MPEG-4 and novel applications that
`utilize the mosaic.
`
`A mosaic image (the terrn’s mosaic and sprite are used
`interchangeably) is built from images of a certain scene
`object over several video frames. For instance, a mosaic of
`a background scene in the case of a panning camera will
`result in a panoramic image of the background. Two major
`types of sprites and sprite-based coding are defined in
`MPEG-4. The first type of mosaic is called an olI-line static
`sprite. An off—line static sprite is a panoramic image which
`is used to produce a sequence of snapshots of the same video
`object (such as background). Each individual snapshot is
`generated by simply warping portions of the mosaic content
`and copying it to the video bu l]'er where the current video
`frame is being reconstructed. Static sprites are built off—line
`and are transmitted as side information.
`
`The second type of mosaic is called an on-line dynamic
`sprite. On-line dynamic sprites are used in predictive coding
`of a video object. A prediction of each snapshot of the video
`object in a sequence is obtained by warping a section of the
`dynamic sprite. The residual signal is coded and used to
`update the mosaic in the encoder and the decoder concur-
`rently. The content ol‘ a dynamic mosaic may be constantly
`updated to include the latest video object infonnation. As
`opposed to static sprites, dynamic sprites are built on line
`simultaneously in the encoder and decoder. Consequently,
`no additional information needs to be transmitted besides
`
`mosaic dimensions, a blending factor and the global motion
`parameters.
`Both olT-line static and on-line dynamic sprite-based
`coding require constructing a sprite. In the former case, the
`sprite is built prior to transmission. In the later case,
`the
`sprite is built on-line during the transmission. So far,
`MPEG-4 has assumed that the outline (segmentation) of the
`object for which the sprite is going to be built is known
`a—priori at every time instant. Although this is true in certain
`applications, especially in post—production or content gen-
`eration using blue screen techniques, automatic segmenta-
`tion does not currently exist for sprites built dynamically in
`encoding environments.
`Therefore a need remains for sprite-based coding systems
`where sprite building does not require a—priori knowledge of
`soene segmentation.
`
`SUMMARY OF THE INVENTION
`
`A sprite-based coding system includes an encoder and
`decoder where sprite-building is automatic and segmenta-
`tion of the object used to reconstruct the mosaic is automatic
`and integrated into the ooding process. The sprite object is
`distinguished from the rest ol‘ the video objects on basis of
`
`‘Ill
`
`15
`
`20
`
`30
`
`40
`
`E50
`
`2
`
`its motion. The sprite object moves according to the domi-
`nant component of the scene motion, which is usu ally due to
`camera motion or zoom. Hence, the sprite-based coding
`system utilizes dominant motion to distinguish the back-
`ground object in the image from foreground objects in the
`image. The sprite-based coding system is easily integrated
`into a video object—based coding framework such as MPEG-
`4, where shape and texture of individual video objects are
`coded separately. The automatic segmentation integrated in
`the sprite-based coding system identifies the shape and
`texture of the sprite object.
`In one application, the sprite-based encodingfdecoding
`system builds a background mosaic olli-line prior to the
`bi-directional
`transmission of conversational and visual
`data. The initial mosaic reconstruction phase provides both
`the videophone encoder and decoder the opportunity to build
`either a partial or a complete representation of the back-
`ground behind the videophone user. In a second phase the
`videophone supports a conversational service between two
`videophone users and the mosaic is used by the videophone
`encoder and decoder to implement mosaic—based predictive
`coding of the visual data. In very low bit rate applications,
`coding of video frames in ten11s of video objects within may
`require too much bandwidth, because the shape of such
`objects may consume a significant portion of the limited bit
`budget. In such cases, the sprite-based coding system con-
`ducts Irame-based coding where automatic segmentation is
`only used to obtain better dominant motion estimation for
`sprite building and dominant motion—compensated predic-
`tion. The sprite—based encoding system can also generate a
`sprite that has a higher spatial resolution than the original
`images.
`The sprite-based coding system is suitable for applica-
`tions where camera view may change frequently, such as
`video conferencing with multiple cameras, or a talk show
`captured with more than one camera. In these applications,
`multiple sprites are built and used as needed. For instance,
`if a camera goes back and forth between two participants in
`front of two diiferent backgrounds, two background sprites
`are built and used as appropriate. More specifically, when
`background A is visible, building of the sprite for back-
`ground I} and its use in coding is suspended until Back-
`ground B appears again. The use of multiple sprites in this
`fashion is possible within the MPEG-4 framework, as
`described below.
`
`A sprite is built during the encoding process. However,
`the resulting sprite may be subsequently used, after coding,
`as a representative image of the compressed video clip. Its
`features can be used to identify the features of a video clip,
`which is then used in feature-based (or c0ntent—based)
`storage and retrieval of video clips. Ilence sprite-based
`coding, according to another embodiment of the invention,
`is used to populate a video library of bitstreams or decoded
`images where sprite images generated during the encoding
`process act as representative images of the video clips. The
`mosaics can also be coded using a still
`image coding
`method.
`
`In a similar fashion, one or several event lists may be
`associated with a background sprite. Apossible choice for an
`event list is the set of consecutive positions of one or several
`vertices belonging to each foreground objects. Such a list is
`then used to generate token representative images of the
`foreground object position in the sprite. Consecutive posi-
`tions of each vertex are then either linked by a straight line
`or share a distinct color. The consecutive positions of the
`vertex are then shown statically (all successive positions in
`the same sprite) or dynamically {vertex positions shown in
`
`0013
`0013
`
`
`
`US 6,205,260 B1
`
`3
`the mosaic successively in time). A vertex is chosen to
`correspond to any distinctive feature of the foreground
`object, such as the center of gravity or a salient point in the
`shape of the object. In the latter case, and if several vertices
`are used simultaneously, the vertices are arranged according
`to a hierarchical description of the object shape. With this
`technique, a user or a presentation interface has the freedom
`to choose between coarse to finer shapes to show successive
`foreground object positions in the background sprite. This
`concept may be used in a video library system to retrieve
`content based on motion characteristics of the foreground
`object(s).
`The mosaic can also be warped at a fixed zooming factor
`to provide more resolution than the images used to generate
`the mosaic. An optimal view point for the mosaic is provided
`by locating an optimiried center reference frame instant and
`then building the mosaic around the reference frame instant.
`The foregoing and other objects, features and advantages
`of the invention will become more readily apparent from the
`following detailed description of a preferred embodiment of
`the invention, which proceeds with reference to the accom-
`panying drawings.
`
`BRIEF DESCRIPTIONS OF THE DRAWINGS
`
`FIG. 1 illustrates the bulfers used in the sprite-based
`encoder according to the invention at time t—l.
`FIG. 2 illustrates the buffers used in the sprite-based
`encoder at time t to 1+1.
`
`FIG. 3 illustrates the buffers used in the sprite-based
`encoder at time t+l to t+2.
`
`FIGS. 4A—-4-E show a flow diagram describing operation
`of the sprite-based encoder illustrated in FIGS. 1-3.
`FIG. 5 is a functional block diagram of the sprite-based
`encoden
`
`FIG. 6 is a diagram of a sprite-based video conferencing
`system according to another embodiment of the invention.
`l"‘I('i. 7 is a block diagram of a sprite-based video search
`system according to another embodiment of the invention.
`FIG. 8 shows how consecutive portions of a foreground
`object may be represented in a mosaic according to yet
`another aspect of the invention.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`The sprite-encoder system progressively learns to disso-
`ciate foreground from background while building a back-
`ground mosaic at the same time. Once an initialization step
`(step 1)
`is complete, steps 2 to 11 described below arc
`repeated until the construction of the backgnlund is com-
`plete or until it
`is aborted. The meaning of a region as
`referred to below is either a single set of contiguous pixels
`or alternatively several disconnected sets of contiguous
`pixels in the image. The single region or the union of these
`disconnected regions are referred to interchangeably as a
`region.
`The notations are as follows: l(s,t) denotes the content of
`video frame at spatial position s and at time t. W,,_,-,_1,(I(
`§,t—1)) denotes a warping operator which maps the image at
`time (1-1) to time t. For a given pixel location 5,, in a video
`buifer at
`time I, this warping operation is performed by
`copying the pixel value at corresponding location s in frame
`(t—1). The correspondence between location so and location
`s is established by a particular and well defined transfon11a-
`tion such as an affine or a perspective transformation, which
`
`10
`
`‘I5
`
`20
`
`30
`
`40
`
`45
`
`60
`
`4
`
`are well known to those skilled in the art. 3, s,t) is an
`indicator buffer, say for quantity J, denotes a particular state
`related to I at any pixel location s within the image at time
`t. Thresh is a threshold value. The operations §Thresh and
`éfhresh are symbolic and can represent complex thresh-
`olding operations.
`image
`The size (per color component) of the current
`frame I(s,t)
`is M,><N, and the size of the previous
`compressedfdeconipressed frame after warping, W,__[.,_])
`((T_1C{([(s,t—l)}),
`is such that
`it can be inscribed in a
`rectangular array of M,_,><N_._, pixels. A sprite M(s,t) is an
`image intensity (texture) butfer of size M,,_><N,,, per color
`component. The field -'i'§m,_m_(s_,t) is a single component field
`of the same size.
`Referring now to FIGS. 1-4, a mosaic of the background
`is built as follows. The construction of the sprite is started at
`time t. The image I(s,t—1) has already been compressed and
`decorrtpressed and is available in both the encoder and the
`decoder.
`In the following steps,
`the image content
`is
`assumed to have a background and a foreground part which
`are transmitted as separate video objects (or VOs).
`FIG. '1 illustrates steps 1 through 11 from time t—l, the
`instant when mosaic building is initiated, to time lwhen the
`new, video frame or field has been acquired. FIGS. 2 and 3
`illustrate steps 2 through 11 from time t to t+1 and from time
`t+l
`to t+2, respectively. FIG. 4 shows the operations per-
`formed by the sprite-based encoder during steps l—ll.
`At the top left corner in each of these FIGS. 1-3 is shown
`a newly acquired video frame (A) which is compared to the
`previous video frame (B) (next image field to the right), once
`it has been cocledfdecoded and warped (step 2). Step 3 is
`illustrated by the right most image field (C) in the first row
`of FIGS. 1-3. This field shows the area where content
`
`change has been detected. The status of the mosaic buffer at
`time 1 (before updating occurs) is shown in the leftmost
`image field in the second row (D). The buffer
`is used to
`identify the new background areas as described in step 4
`below. These areas correspond to regions where background
`was not known until now.
`
`Foreground identification is illustrated by the rightmost
`image in the second row ([5). The operations associated with
`this image (I?) that use the change map, the mosaic indicator
`buffer and the new background areas to define the fore-
`ground are described in step 5. The two leftmost image fields
`in the third row (G, H) illustrate steps 6 and 7 of the
`invention, respectively. In II, background shape information
`comes from the codedldecoded foreground shape infonT1a-
`tion obtained in G. Finally, the mosaic updating process is
`illustrated by the image field (I) in the bottom right corner.
`This process.
`takes place in steps 8,9,1U' and 11 of the
`sprite-based encoding process.
`Step 1: Initialization.
`In operation 19 in FIG. 4, the binary field -‘timm-,,(s,t) is
`initialized to 0 for every position s in the bulfer, meaning that
`the content of the mosaic is unknown at these locations. The
`
`content of the mosaic butler M(§,t) is initialized to 0. The
`warping parameters from the current video frame I(s_,t—1) to
`the mosaic are initialized to be W,0_,_,,_,)(any),
`t0 here
`representing an arbitrary fictive time. This initial warping is
`important as it provides a way to specify the “resolution” or
`the “time reference” used to build the mosaic. Potential
`applications of this initial mapping are making a mosaic
`with super spatial resolution or selection of an optimal time
`t0 minimizing the distortion introduced by the method as
`described in further detail below. A super spatial resolution
`mosaic is a mosaic with increased resolution compared to
`the original individual images in the sequence.
`
`0014
`0014
`
`
`
`US 6,205,260 B1
`
`5
`
`Step 2: Acquisition.
`The image I('s,t) is acquired and the forward warping
`parameters for mapping the image I('s,t—l)
`to I(s,t) are
`computed. The number of warping parameters as well as the
`method for estimating these parameters are known to those
`skilled in the art and are therefore not specified here. For
`example, a dominant motion estimation algorithm such as
`the one given in .I. Bergen et al., “A three-frame algorithrrl
`for estimating two-component image motion," IEEE Trans.
`Pattern Anal., Mach. Intel., Vol. 14, pp. 886-896, September
`1992. may be used. The warping parameters are composed
`with the current warping parameters, resulting in the map-
`ping W“. ,0(any). These parameters are transmitted to a
`decoder either directly or indirectly in the form Of a series of
`coded vertex coordinates. The sprite—based encoder is posi-
`tioned at the beginning of the image frame I(s,t) in operation
`20 in FIG. 4 and the first pixel in the image frame is obtained
`in operation 24.
`Step 3: Detect Change in Content between Previously
`Codedt'Decoded Frame and Current Frame.
`
`i) Initialization of a large buffer of size M,,><N,, greater
`than the image {Mb>M,, N,_>N_,) and possibly as large
`as the mosaic. The buffer is initialized to 3 to indicate
`
`unknown status. 3C,,,,,,g,_,{s,t)=3
`ii) In operations 26, 28,32, 34 and 36 in FIG. 4, scene
`changes (motion compensated) are computed over
`common image support. Give label 0 to all locations
`where change in content is deemed small (34). Give
`label 1a to locations where change is detected to be
`large (36). To make regions more homogeneous, imple-
`ment additional operations (e.g. morphological
`operations) which either reset label from la to O or set
`label from 0 to la. Regions labeled 0 will typically be
`considered and coded as part of the background Video
`Object while regions labeled la will
`typically be
`encoded as part of the foreground Video Object.
`
`.'f.-:....g..t.s. 1) = {
`
`it mt-.:t—w,H-.:c"'Ctrts.:—1m|sn:n~.s.-:...,,.
`D
`in otherwise
`
`where Thres _
`L,m,_,_,,. denotes a pre—defined threshold
`value.
`
`iii) In operations 28 and 30 in FIG. 4, new image regions
`are tagged, where support of image at time I does not
`overlap with support of image at
`time (1-1),
`
`as 3L.,m,,,_,,(s,t)=1b
`Step 4: Identify New Background Areas.
`In operations 38, 40 and 41 in FIG. 4, a new background
`area is detected if there has not been any change in image
`content in the last two video frames. The corresponding area
`in the mosaic must also indicate that the background at this
`location is unknown. The resulting new background area is
`then pasted to any neighboring regions where background is
`known. As will be seen in later steps, incorporation of new
`background data into the mosaic must be done according to
`compressed;‘de—compressed background shape information
`to avoid any drift between encoder and decoder.
`
`.‘I;.i.,,ts. n =
`
`if tt.7.-rmxpts. 1'] == E1l5’-:&tl‘-’n—n?l.ffamuaIct5. 1’ ** 111 == mi
`oth crwise
`
`D{ 1
`
`Here, the indicator value 0 means that the background is
`unknown.
`
`‘IE!
`
`‘IS
`
`20
`
`30
`
`40
`
`45
`
`E50
`
`6
`Step 5: Perform Foreground} Background Segmentation.
`Operations 44, 46, 48 and 50 in FIG. 4, first look at
`
`regions where the background is known (-‘.'§,,,,,_,,—,,c(s,t—l)=l).
`Thresholding is performed (46) to distinguish the ‘fore-
`ground from the background (case {i)) For regions where
`background is not known, operations 56-62 tag as fore-
`ground any regions where changes have occurred {label la
`and lb defined in step 3) (cases (iii) and (iv)). Operations 52
`and 54 in case (ii) represents new background areas that are
`excluded from being part of the foreground.
`wfi- - r(si'r1o5.rzr‘¢:(§ol_1 })==1
`
`=1-'1 if “(-5. l'J— Wu pol-Ml-5. 1'— ll,“ > Tftrestij.-3
`.
`.1:
`Jygtsr {:0 otherwise
`
`where Thres}-g is a pre-defined threshold value, which is
`used here to segment foreground from background.
`
`ii) else if .‘3,,,,g(s,t)==1 $§,g(s,t)=O
`clsc
`(wr1—(a(3mus'nfc(§3[_1-)==0) & & (%c:.l'rr.rrIgg’(
`s,t)==1a)) 3,g(s,t)=1a
`iv)
`0156
`(VVII-—{o(grrIatsar'c'(§![_1)==0) & & (3:.'J'rrrng;'(
`s,1)==1b}) -‘«'i;,,J(§,1)=1b
`In cases (iii) and (iv), a sub-classification of the regions
`tagged as 1 into either regions la and lb is used for the sole
`purpose of providing the encoder with the flexibility to
`follow diflerent macroblock selection rules. For example,
`regions tagged as Ia might be preferably coded with inter-
`frame macroblocks since these regions occur over common
`image support. ()n the other hand, regions tagged as lb
`might preferably be coded with intra—frame macroblocks
`since these regions do not share a common support with the
`previoLLs frame. These regions can be tagged as regions “ I ”
`if such distinction does not need to be stored or is not used
`by the encoder.
`The sprite—based encoder then jumps back to operations
`22 and 24 in FIG. 4. The next pixel is then obtained for the
`image frame and steps 3, case ii) -5 are repeated for the next
`pixel.
`Step 6: Compress,r‘Decompress Foreground Shape and Tex-
`ture.
`
`After all pixels in the image frame have been analyzed,
`and applicable pixels tagged as foreground or background
`regions, the sprite-based encoder jumps to operation 64 in
`FIG. 4. A conventional (I, P or B—\/OP) prediction mode is
`used to encode foreground regions labeled as la and lb. In
`the case of P or B-V()Ps, individual macroblocks can either
`use inter-frame prediction or intra—frame coding. The pixels
`corresponding to regions labeled 1b {which can include
`revealed background not represented in mosaic) are prefer-
`ably coded as intra macroblocks.
`The shape of the foreground is compressed and transmit-
`ted as well. Once de-com pressed, this shape is used by the
`encoder and decoder to update the content of the mosaic.
`This process can be performed using a standard MPEG-4
`VM version 9.0 encoding technique such as described in
`“MPEG-4 Video Verification Model Version 9.0”, Docu-
`ment IS()fII_".(.‘ .|T(_‘t.t'SC29;"W'Gl l,=’Nl869. The sprite—based
`encoder is then positioned at the beginning of the coded!
`decoded foreground shape in operation 66 and the first pixel
`of the shape bitmap is obtained in operation 70 of FIG. 4.
`Step 7: Get Background Shape.
`In operations 72-76 in FIG. 4, the background shape is
`then obtained from the compressedfde-compressed fore-
`ground shape. Compressionx’Dc—compression is necessary
`
`0015
`0015
`
`
`
`US 6,205,260 B1
`
`7
`here to ensure that encoder and decoder share the same
`shape information.
`
`.7b_gL‘i. I] ={
`
`it‘ r‘ ct_’I;,t.s. rt == 0t
`1
`0 otherwise
`
`where C‘JC{ } denotes shape codingfdecoding which is
`performed using standard Ml-"I.iG-4 coding techniques as
`described above.
`
`‘ll!
`
`Step 8: Initialize New Background Texture in Mosaic.
`Operations 78-84 in FIG. 4 identify where new back-
`ground has occurred and initialize mosaic with content
`found in previous video frame (time (1-1)). Note that the
`
`15
`
`field -‘3‘"£w(:_;,t) cannot be used here since this information is
`unknown to the decoder.
`
`8
`background regions have been updated, the segmentation
`process described above in steps 2—11 is repeated for the
`next image frame acquired in operation 108.
`The method described above builds the mosaic with
`reference to time 10, which can be a time instant in the past
`or can be the current time or a future time instant. It is
`straightforward to rewrite the above equations for the case
`where the mosaic is continuously warped to the current time
`instant t.
`Referring to FIG. 5, a block diagram of the sprite—based
`encoder 109 is depicted. The purpose of this drawing is to
`highlight the dependencies in the various components and
`quantities used by the method 01‘ the invention. It also
`emphasizes the various warping and un—warping stages that
`are necessary to align consecutive video fields.
`The current image to be encoded "114 is used along with
`the previous original
`image 110 to estimate the global
`
`M't-5. r— 11:
`
`Mtg, r- I)
`
`it‘ L‘f,m,,(,,-,t.§. r — I] —_-= 1]
`
`Wro---tr
`
`tiff-' lei-f1.§.!-— 1t}J
`
`if
`
`(Wino r13}-gts. I1} == 1t&&t.I-umar.-ts.
`
`I-—1ll==Ull
`
`Step 9: Calculate Background Texture Residuals from
`Mosaic Prediction.
`
`is
`in operation 86, a dillerence signal
`If $§L_g(s,t)==l
`calculated by using mosaic content as a predictor. The
`resulting Al(s,t) calculated in operation 88, is used to com-
`pute the diflerence signal. In operations 63 and 70 in FIG. 4,
`the sprite~based encoder repeats steps 7-9 above for each
`identified foreground image pixel. The dillerence signal over
`each macroblock is Compared to conventional difference
`signals produced by using prediction from the previous and
`the next video frame (P or B prediction mode). The mac-
`roblock type is selected according to the best prediction
`mode. The residual signal is transmitted to the decoder along
`with the compressed background shape as described in
`"Result of N3 Experiment Using Unified S2i'N3 Syntax”,
`Regis J. Crinon and Ibrahim Sezan, Document ISO/IEC—
`JTCt;'SC29;’WG11x’M1404.
`
`*7\‘(§-l)=l(-.81}-‘-\"r.—.oU\"l'(-E.’—1J)
`
`Operations 90 and 92 in FIG. 4 encodes background regions
`
`in the image I(s_,t) where 3,,g(s_,l)==l and then start subse-
`quent processing at the beginning pixel of the mosaic.
`Step 10: Update Background Shape in Mosaic.
`Operations 98-404 in FIG. 4 update the mosaic indicator
`map to include the shape of the new background.
`
`:r.,....,.-,t.s. n =
`
`if .7"-to.sa;cl.5. I-v lt= I
`1
`it tt Wu. .tJi,,t.~;. n) = Ila’-8‘-[..‘jittL15at'(l.5. 1 — 1:: OJ:
`1
`0 nlhenvisc
`
`30
`
`40
`
`45
`
`Step 11: Update Mosaic.
`Operations 104 and 106 in FIG. 4 update content of the
`mosaic in regions corresponding to new or non-covered
`background in frame t.
`
`E50
`
`Mt-srt=I ‘I -aW.nHt3i,(sIJJtM'ts.L—'I t+aWmH(-‘3‘».,.(s ml “(-2-
`t—1t+W.o. ,(C 'C{M(§.-')}J]
`
`The selection of the value of the blending parameter 0.
`(Deer.-:1) in the above equation is application dependent. The
`sprite-based encoder then jumps back to operations 94 and
`96 and processes the next pixel ol‘ the mosaic. After all
`
`motion (warping) parameters in 116. The output of estimator
`116 is transmitted over a communication channel 122 (the
`encoding of these parameters can take various forms includ-
`ing a series of coded vertex coordinates). The output of
`estimator 116 is also passed to a warping block 120 which
`compose (computes accumulated) warping transformations
`in time. A map indicating regions to be updated 128 and
`initialized new background 130 are used for generating a
`background mosaic 134. The image 144 which represents
`image 114 once it has been segmented into foreground and
`background regions, is generated from the warped back-
`ground mosaic buffer 136, the warped mosaic indicator map
`133 and identified new background regions. 150.
`A background shape signal and background update pixel
`values 164 are generated from the input image 114, the
`warped mosaic 136, uncovered background in the previous
`encodedfdecoded image 158 and the encodedfdecoded fore-
`ground shape of the new image 114 being encoded. The
`coded background shape and texture residuals 166,
`the
`coded foreground shape and texture 170 are transmitted in
`block 172. The contents from block 172 are decoded in
`
`block 156 and warped in block 152 and 154. The decoded,
`warped, previous image from block 152 and the foreground}
`background—segmented picture 144 are used to calculate the
`foreground residual signal in block 160. Uncovered back-
`ground regions are identilied in block 158 from the warped
`decoded previous image generated in block 154 as well as
`from the current mosaic indicator map 124. The uncovered
`background regions 153 are used to initialize new mosaic
`regions in block 130 (this corresponds to the bottom equality
`in step 8 above).
`A Mosaic—Based Video Conferencing and Vidcophone Sys-
`tern.
`
`FIG. 6 shows a block diagram of a video conferencing
`system 179 that uses an ell‘-line built background sprite as a
`dynamic sprite during transmission. Prior to transmission of
`encoded images, the encoding system 179 conducts a time
`adjustable configuration phase. During the configuration
`phase, a switch 190 disables sprite—based encoding of
`images 180 until an on-line background mosaic is built
`off-line by an oil‘-line mosaic builder 186. By lirst building
`a background mosaic ofi—lir1e, coding etficiency is increased
`because the mosaic can incorporate all
`the background
`
`0016
`0016
`
`
`
`US 6,205,260 B1
`
`9
`regions that the videophone system need to represent in the
`course of the conversational transmission between the two
`end users.
`Prior to transmission, a switch 184 connects a video
`camera 182 to the oll‘-line mosaic builder 186. Small dis-
`placements of a foreground object, such as the head and
`shoulders of image 180 are used to build a background
`mosaic. The head displacements allow the off—line mosaic
`builder 186 to identify and build the background image
`behind the head and shoulders. The displacements of the
`foreground are voluntary where the system 179 directs the
`foreground in image 180 to move in different directions until
`a substantially complete background mosaic is derived in
`mosaic builder 186. Alternatively, the system 179 does not
`direct the user to move in different directions redu.cing gain
`in coding efliciency. The sprite-base encoding technique
`described above in steps 1-11 is used to build the back-
`ground mosaic.
`The sprite—based codec 188 then uses the mosaic during
`video transmission as a dynamic sprite with the blending
`factor set to U to prevent any updating of the mosaic. In this
`case, macroblock types may be dynamic or static. In one
`extreme case, all macroblocks are static-type macroblocks
`meaning that
`the background mosaic is used as a static
`sprite. In another extreme case, all macroblocks are dynamic
`and the mosaic is being used as a dynamic (predictive)
`sprite. This later case requires a higher data transmission
`bandwidth.
`A Mosaic-Based Video Database
`Referring to FIG. 7, a sprite—based video search system
`194 uses the sprite—based encoder 109 to populate and search
`a database 198 of video biLstreams 202, i.e., a database of
`compressed bitstreams. The sprite-based encoder 109 com-
`press video clips 193 using the sprite-based encoding tech-
`nique described above in steps l—l1. Features of the video
`clips 193 can no longer be visually identified in the com-
`pressed bitstream 202. Thus, a user would have to decom-
`press the video bitstream 202 for all video clips 193 until the
`correct video clip is located.
`However, the sprite-based encoder 109 generates both the
`compressed bitstream 202 and a mosaic 195 generated
`during the encoding process. The mosaic image 195 is used
`as a representative image of the video clip bitstream 202.
`The features in mosaic 195 are then analyzed and represen-
`tative features 208 extracted by an analysis engine 196. The
`representative features 208 are inserted along with the video
`bitstream into a video database 198. A search