`Transform for Image Coding applications on the
`DSP96002 Processor
`
`Srinath V. Ramaswamy and Gerald D. Miller
`Department of Electrical Engineering
`Northern Illinois University, DeKalb, Illinois 60115
`
`Abstract-thisp paper describes the @cient implementation of
`the 2-0 DCT for image coding on the DSPY6002 processor. The
`DSP96002 is a general purpose, dual-bus IEEE flouting point
`digital signal processor. Utilizing the DSPY6(W2 :s inharent pural-
`lelprocessing capubilities, the execution of U 8 n 8 fust 2 - 0 DCT
`takes 133 microseconds. The recently proposed 2-0 und 1 -D fast
`DCTalgorithms are employed in this implementation. Transform
`coencient zigzag ordering, used in the image coding process,
`executes in less than 28 microseconds. The fust DSPY6002 rou-
`tines, incorporated within this implementution, cun be upplied in
`a number of image coding applications such us video und still
`i m g e coding as well us the newer JPEG still inzage and MPEG
`video standards.
`
`INTRODUCTION
`The Discrete Cosine Transform@CT) is widely used in digi-
`tal signal processing applications such as speech and image com-
`pression. The DCT’s performance approaches that of the, difficult
`to implement, statistically optimal Karhunen-Loeve Transform
`[l]. The DCT application to the compression of a highly corre-
`lated image results in the image’s energy being packed into fewer
`coefficients than other transform approaches. The DCT’s adapta-
`tion, in the Joint Photographer Expert Group’s(JPEG) and Motion
`Picture Experts Group’s(MPEG) image compression standards,
`as a primary compression tool has added to its importance [2].
`The DCT’s popularity has stimulated the development of sev-
`eral fast 2-D DCT algorithms. Notable among these are the algo-
`rithms developed by Vetrelli, Duh‘amel, Hou [ 13 and recently by
`Cho and Lee[3]. These fast 2-D DCT algorithm are themselves
`generated by 1-D DCTs. The 2 4 DCT’s execution time is espe-
`cially important in several real time applications such as video
`teleconferencing and video compression. The 2-D DCT’s exhibit
`slow execution times on general purpose processors when com-
`pared to their execution on DSPs or application specific chips.
`However because the 2-D DCT is often used in conjunction with
`numerous other DSP algorithms, the usefulness of specialized
`DCT chips is greatly diminished[4].
`The fast 2-D DCT algorithm proposed by Cho and Lee is one
`of the fastest algorithm proposed to date. This algorithm’s highly
`regular and systematic computational structure lends itself to
`DSP implementation. This fast DCT algorithm is implemented on
`Motorala’s DSP96002 processor. The DSPj6002 processor is a
`general purpose, IEEE floating point DSP that provides several
`
`hardware features and software instructions which aid in fast
`algorithm implementations.
`In image coding applications, the digital image is usually par-
`titioned into 8 x 8 pixel blocks. These 8 x 8 pixel blocks are then
`sequentially compresseddecompressed. The coefficients gener-
`ated by Cho and Lee’s fast 2-D DCT algorithm for an 8 x 8 input
`pixel block are entropy coded. To facilitate entropy coding, the
`coefficients are rearranged into a zigzag order. This procedure
`arranges the coefficients in the order of increasing frequency. The
`2-D DCT’s memory requirements and execution time as well as
`zigzag ordering are addressed in this paper.
`
`DSP96002 CHARACTERISTICS
`The DSP96002 processor possesses several architectural fea-
`tures that allow efficient fast transform implementations[5,6].
`The DSW6002 has a dual Harvard architecture and conforms to
`the IEEE 754 floating point standard. The X and Y data memories
`can be simultaneously accessed via separate address and data
`buses. By efficiently utilizing the two data memories, processor
`throughput is significantly increased. The three (3) DSP96002
`CPU execution units, namely the Data Arithmetic Logic, Address
`Generation and Program Control Units, operate in parallel. This
`organization allows parallel data moves together with concurrent
`ALU operations. For example, results can be stored from and
`loaded in the data registers simultaneously with ALU operations.
`In addition, the on-chip dual channel DMA controller also per-
`mits block oriented I/O operations in parallel with CPU execu-
`tions.
`
`FAST 2-D DCT ALGORITHMAND and
`its IMPLEMENTATION
`Cho and Lee’s fast 2-D DCT algorithm was selected for
`implementation[3]. Because of this algorithm’s highly regular
`and systematic structure, it is more suitable for VLSI implemen-
`tation than many other algorithms. This fast algorithm generates
`the 2-D DCT of an 8 x 8 input sequence by means of eight 1-D
`DCTs, thereby reducing the number of multiplications by 50%.
`However, from a DSP perspective, the multiplication operation
`takes the same amount of time as the addition operation. There-
`fore, the reduction in the number of multiplications is not the pri-
`mary criteria for this algorithm’s selection. Its attractiveness is its
`
`CH 3381-1/93/$01.00 01993 IEEE
`
`96
`
`Petitioners HTC and LG - Exhibit 1039, p. 96
`HTC and LG v. PUMA, IPR2015-01502
`
`
`
`highly regular and systematic structure which, computationally,
`involves numerous butterfly computations.
`The butterfly computation is illustrated in Figure 1. The x and
`y input data, and the z and w results are stored respectively in the
`X and Y memory spectrums. By using the FADDSUB instruction,
`the s u m and difference of x and y can be simultaneously com-
`puted. By means of parallel data move operations, the results can
`be stored, and the next set of input data can be concurrently
`retrieved as well. since the butterfly’s computation complexity is
`equivalent to only three instructions, it forms the basis for fast
`implementations.
`
`A
`
`Figure 1. Buttefly Structure
`
`The Signal Flow Graphs6FGs) utilized by Cho and Lee’s fast
`2-D DCT algorithm are illustrated in Figures 2 , 3 and 4. The fast
`2-D DCT algorithm’s execution time is further reduced by utiliz-
`ing a fast 1-D DCr algorithm to implement the DCT modules as
`illustrated in Figure 2. B. G. Lee’s proposed fast 1-D DCT algo-
`rithm [7] is utilized in this paper. The DCT module’s input data
`for tbefpl computations is stored in the X memory space, while
`the input data for the gpl computations is stored in the Y memory
`space. These two memory spaces are simultaneously accessed by
`the DSP96002. The DSP96002 can also perform a multiplication,
`addition and subtraction in a single instruction. This capability is
`used to concurrently execute two DCT modules: one computesfpl
`while the other computes gpl .
`The SFGs illustrate that a complicated data access scheme is
`required. Some of these complications are simplified by utilizing
`the addressing modes and modifiers available on the DSPO6002
`processor. For exzunple, the input SFG values, in Figure 1 have
`indices consistent with the following set of equations:
`
`or
`
`i= 0, 1, 2, -., 7
`for p = 1, 3, 5 , .-, 7
`The DSP96002’s modulo arithmetic addressing facility is
`used to efficiently implement equation 1. The large number of
`internal registers malres it feasible to in-place reorder some of the
`input data. B. G. Lee’s fast 1-D DCT outputs bit-reversed ordered
`data, which is efficiently unscrambled using bit-reversed adclress
`modifiers.
`The DCT module’s reordered output coefficients are distrib-
`uted between the two separate X and Y data memories. This sce-
`nario efficiently implements
`the computational structure
`
`illustrated in Figures 3 and 4. Figures 3 and 4 SFG’s output coef-
`ficients are rearranged utilizing DSP96002’s addressing capabili-
`ties. If the scale factors are ignored, the inverse 2-D DCT’s signal
`flow graph is the same as the forward transform. Consequently.
`the inverse transform is similarly implemented except for some
`necessary scaling factor modilications. Nested hardware DO
`loops are used instead of straight line coding since they involve
`little overhead.
`
`xm.
`
`-i
`
`Figure 2. Signal Flow Graph from xij to& and gpl
`
`Figure 3. Signal Flow Graph fromfpl to Ym,
`Where n is even
`
`91
`
`Petitioners HTC and LG - Exhibit 1039, p. 97
`HTC and LG v. PUMA, IPR2015-01502
`
`
`
`left while those containing an odd number of points are directed
`upward to the right. Consequently, a single loop is constructed for
`reordering elements in a line with an even number of elements,
`and a second loop in constructed of lines with an odd number of
`elements. By using the DSP96002’s register indirect addressing
`capabilities, where the offset register is loaded with constant dis-
`tance dderived above, the zigzag ordering of the two-dimensional
`array of elements was implemented efficiently and quickly. A line
`with an even number of elements is referenced using a post incre-
`ment with an offset. A line with an odd number of elements if ref-
`erenced using a post decrement with an offset. The zigzag
`ordering facilitates the entropy coding process by rearranging the
`coefficients in the order of increasing frequency.
`
`EXTERNAL DATA TRANSFERS
`The externally stored 8 x 8 pixel blocks are sequentially trans-
`ferred to the DSW6002’s internal memory utilizing the on-chip
`dual channel DMA controller. Similarly, the coefficients gener-
`ated by the 2-D DCT routine are transferred to the external mem-
`ory via the DMA. Since the DMA controller and CPU operate
`concurrently, these two transfers can take place in parallel. There-
`fore, the routines which execute the image sample coefficient
`transfers between the internal and external memories involve
`very low overhead.
`
`RESULTS
`The forward and reverse 8 x 8 DCT execution times andmem-
`ory requirements are summarized in Table 1. In contrast to the
`forward transform, some of the inverse transform instructions do
`not utilize the DSP96002’s inherent parallelism. These imple-
`mentation considerations account for the difference between the
`program word requirements of the forward and inverse transform
`computations.
`
`Routine
`
`Memory Requirements
`x
`Program
`Y
`Space Space
`Words
`
`I
`
`Execution
`Time
`(micro-
`seconds)
`
`i
`
`Forward
`g x 8 K T
`
`Inverse
`g x 8 D C ~
`
`Forward
`zigzag
`
`Inverse
`zigazg
`
`629
`
`192
`
`72
`
`133.3
`
`682
`
`200
`
`256
`
`131.05
`
`58
`
`60
`
`64
`
`I 64
`
`64
`
`64
`
`27.85
`
`27.85
`
`Table 1. Memory Requirements and
`Execution Times of the DCT and Zigazg Routines
`
`Figure 4. Signal Flow Graph from gpi to y,,,,,,
`where n is odd
`
`EFFICIENT ZIGZAG CODING
`IMPLEMENTATION
`The coefficients generated by the 2-D DCT are rearranged
`into a zigzag order as shown in Figure 5. Prior to zigzag reorder-
`ing, the coefficients are quantized using a quantization table.
`The two-dimensional array elements are stored horizontally in
`rows. The elements are reordered in a zigzag fashion as dictated
`by the lines connecting the elements. Each 8 x 8 array element is
`denoted by x i j , where i indicates the row and j indicates the col-
`umn. The adjacent line elements are denoted by x;,~ and xi+l,,.l.
`These adjacent line elements are separated by a constant distance:
`d = [ ( i - 1) - ( i
`which reduces to:
`
`I
`A cm
`Figure 5. Zig Zag Sequence
`
`I
`
`An inspection of altemate diagonal lines in Figure 5, reveals
`that they contain an even/odd number of elements. Lines which
`contain an even number of points iue directed downward to the
`
`98
`
`Petitioners HTC and LG - Exhibit 1039, p. 98
`HTC and LG v. PUMA, IPR2015-01502
`
`
`
`The zigzag routine’s execution time and memory require-
`ments, associated with the 2-D DCT routine, are also provided in
`Table 1. The forward and inverse DCT transform with the zigzag
`feature results in total computation times of 161 and 159 micro-
`seconds respectively (with a 40 MHz clock). Since the DSP96002
`contains 1024 words of full speed on-chip program memory, it
`can accommodate and quickly execute both the forward and
`inverse DCT routines which incorporate the zigzag feature.
`
`SUMMARY
`The DSF96002 is a general purpose, IEEE Hoating point dig-
`ital signal processor containing hardware and software feaqtures
`which lend themselves to the efficient implementation of fast
`DCTs. An efficient implementation of the 8 x 8 pixel 2-D DCT
`and zigzag ordering have been described. These implementitions
`use the most recently proposed fast DCT algorithms. The fast
`DCT and zigzag routines execute in 133 and 28 microseconds
`respectively. These fast routines can be utilized in image coding
`applications; these routines are especially useful in those applica-
`tions which adhere to the JPEG N P E G image coding stiindards.
`
`REFERENCES
`[I] Ran, K. R. and Yap, P. “Discrete Cosine Tr:msform. Algo-
`rithm, Advaikges, Applications.” Acodeniic Press, S:IU Diego.
`1990
`[2] Watchcases G. K. “The JPEG still Picture Compression Stan-
`dard.” IEEE Trunsuctions on Consumrr ElPctronics. Vo1.38 No.1,
`I:ebruary 1992, pp.xvii-sixties.
`[3] Cho, N. I. and Lee. S. LJ. “Fast Algorithm and IMplementTtion
`of 2-D discrete Cosine Transform.” IEEE Trunsuctions on Cir-
`cuits and Systenis. Vo1.38. No.3, March 1991, pp207-305.
`[4] David Shear, “EDN’s DSP Benchmarks” EDN. September
`29,1988.
`151 DSP96002, IEEE Flouting-Point Dud-Porr Pmcessor User’s
`Munual. MOtorala, inc., 1989.
`[6] Guy R. I,. solei and Kevin I,. Kookier, “A Digital Signal Pro-
`cessor with IEEE Floating-point Arithmetic”. IEEE Micro. W.8,
`No.6, December 1988, pop. 49-67,
`171 Lee. B. G. “A New Algorithm to Compute the Discrete Cosine
`Transform.” IEEE Truns. Acoustics, Speech, and Signirl Process-
`ing. Val. ASSP-32. No.6, December 1984, pop 1243-124s.
`
`99
`
`Petitioners HTC and LG - Exhibit 1039, p. 99
`HTC and LG v. PUMA, IPR2015-01502