throbber
Efficient Implementation of the Two Dimensional Discrete Cosine
`Transform for Image Coding applications on the
`DSP96002 Processor
`
`Srinath V. Ramaswamy and Gerald D. Miller
`Department of Electrical Engineering
`Northern Illinois University, DeKalb, Illinois 60115
`
`Abstract-thisp paper describes the @cient implementation of
`the 2-0 DCT for image coding on the DSPY6002 processor. The
`DSP96002 is a general purpose, dual-bus IEEE flouting point
`digital signal processor. Utilizing the DSPY6(W2 :s inharent pural-
`lelprocessing capubilities, the execution of U 8 n 8 fust 2 - 0 DCT
`takes 133 microseconds. The recently proposed 2-0 und 1 -D fast
`DCTalgorithms are employed in this implementation. Transform
`coencient zigzag ordering, used in the image coding process,
`executes in less than 28 microseconds. The fust DSPY6002 rou-
`tines, incorporated within this implementution, cun be upplied in
`a number of image coding applications such us video und still
`i m g e coding as well us the newer JPEG still inzage and MPEG
`video standards.
`
`INTRODUCTION
`The Discrete Cosine Transform@CT) is widely used in digi-
`tal signal processing applications such as speech and image com-
`pression. The DCT’s performance approaches that of the, difficult
`to implement, statistically optimal Karhunen-Loeve Transform
`[l]. The DCT application to the compression of a highly corre-
`lated image results in the image’s energy being packed into fewer
`coefficients than other transform approaches. The DCT’s adapta-
`tion, in the Joint Photographer Expert Group’s(JPEG) and Motion
`Picture Experts Group’s(MPEG) image compression standards,
`as a primary compression tool has added to its importance [2].
`The DCT’s popularity has stimulated the development of sev-
`eral fast 2-D DCT algorithms. Notable among these are the algo-
`rithms developed by Vetrelli, Duh‘amel, Hou [ 13 and recently by
`Cho and Lee[3]. These fast 2-D DCT algorithm are themselves
`generated by 1-D DCTs. The 2 4 DCT’s execution time is espe-
`cially important in several real time applications such as video
`teleconferencing and video compression. The 2-D DCT’s exhibit
`slow execution times on general purpose processors when com-
`pared to their execution on DSPs or application specific chips.
`However because the 2-D DCT is often used in conjunction with
`numerous other DSP algorithms, the usefulness of specialized
`DCT chips is greatly diminished[4].
`The fast 2-D DCT algorithm proposed by Cho and Lee is one
`of the fastest algorithm proposed to date. This algorithm’s highly
`regular and systematic computational structure lends itself to
`DSP implementation. This fast DCT algorithm is implemented on
`Motorala’s DSP96002 processor. The DSPj6002 processor is a
`general purpose, IEEE floating point DSP that provides several
`
`hardware features and software instructions which aid in fast
`algorithm implementations.
`In image coding applications, the digital image is usually par-
`titioned into 8 x 8 pixel blocks. These 8 x 8 pixel blocks are then
`sequentially compresseddecompressed. The coefficients gener-
`ated by Cho and Lee’s fast 2-D DCT algorithm for an 8 x 8 input
`pixel block are entropy coded. To facilitate entropy coding, the
`coefficients are rearranged into a zigzag order. This procedure
`arranges the coefficients in the order of increasing frequency. The
`2-D DCT’s memory requirements and execution time as well as
`zigzag ordering are addressed in this paper.
`
`DSP96002 CHARACTERISTICS
`The DSP96002 processor possesses several architectural fea-
`tures that allow efficient fast transform implementations[5,6].
`The DSW6002 has a dual Harvard architecture and conforms to
`the IEEE 754 floating point standard. The X and Y data memories
`can be simultaneously accessed via separate address and data
`buses. By efficiently utilizing the two data memories, processor
`throughput is significantly increased. The three (3) DSP96002
`CPU execution units, namely the Data Arithmetic Logic, Address
`Generation and Program Control Units, operate in parallel. This
`organization allows parallel data moves together with concurrent
`ALU operations. For example, results can be stored from and
`loaded in the data registers simultaneously with ALU operations.
`In addition, the on-chip dual channel DMA controller also per-
`mits block oriented I/O operations in parallel with CPU execu-
`tions.
`
`FAST 2-D DCT ALGORITHMAND and
`its IMPLEMENTATION
`Cho and Lee’s fast 2-D DCT algorithm was selected for
`implementation[3]. Because of this algorithm’s highly regular
`and systematic structure, it is more suitable for VLSI implemen-
`tation than many other algorithms. This fast algorithm generates
`the 2-D DCT of an 8 x 8 input sequence by means of eight 1-D
`DCTs, thereby reducing the number of multiplications by 50%.
`However, from a DSP perspective, the multiplication operation
`takes the same amount of time as the addition operation. There-
`fore, the reduction in the number of multiplications is not the pri-
`mary criteria for this algorithm’s selection. Its attractiveness is its
`
`CH 3381-1/93/$01.00 01993 IEEE
`
`96
`
`Petitioners HTC and LG - Exhibit 1039, p. 96
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`highly regular and systematic structure which, computationally,
`involves numerous butterfly computations.
`The butterfly computation is illustrated in Figure 1. The x and
`y input data, and the z and w results are stored respectively in the
`X and Y memory spectrums. By using the FADDSUB instruction,
`the s u m and difference of x and y can be simultaneously com-
`puted. By means of parallel data move operations, the results can
`be stored, and the next set of input data can be concurrently
`retrieved as well. since the butterfly’s computation complexity is
`equivalent to only three instructions, it forms the basis for fast
`implementations.
`
`A
`
`Figure 1. Buttefly Structure
`
`The Signal Flow Graphs6FGs) utilized by Cho and Lee’s fast
`2-D DCT algorithm are illustrated in Figures 2 , 3 and 4. The fast
`2-D DCT algorithm’s execution time is further reduced by utiliz-
`ing a fast 1-D DCr algorithm to implement the DCT modules as
`illustrated in Figure 2. B. G. Lee’s proposed fast 1-D DCT algo-
`rithm [7] is utilized in this paper. The DCT module’s input data
`for tbefpl computations is stored in the X memory space, while
`the input data for the gpl computations is stored in the Y memory
`space. These two memory spaces are simultaneously accessed by
`the DSP96002. The DSP96002 can also perform a multiplication,
`addition and subtraction in a single instruction. This capability is
`used to concurrently execute two DCT modules: one computesfpl
`while the other computes gpl .
`The SFGs illustrate that a complicated data access scheme is
`required. Some of these complications are simplified by utilizing
`the addressing modes and modifiers available on the DSPO6002
`processor. For exzunple, the input SFG values, in Figure 1 have
`indices consistent with the following set of equations:
`
`or
`
`i= 0, 1, 2, -., 7
`for p = 1, 3, 5 , .-, 7
`The DSP96002’s modulo arithmetic addressing facility is
`used to efficiently implement equation 1. The large number of
`internal registers malres it feasible to in-place reorder some of the
`input data. B. G. Lee’s fast 1-D DCT outputs bit-reversed ordered
`data, which is efficiently unscrambled using bit-reversed adclress
`modifiers.
`The DCT module’s reordered output coefficients are distrib-
`uted between the two separate X and Y data memories. This sce-
`nario efficiently implements
`the computational structure
`
`illustrated in Figures 3 and 4. Figures 3 and 4 SFG’s output coef-
`ficients are rearranged utilizing DSP96002’s addressing capabili-
`ties. If the scale factors are ignored, the inverse 2-D DCT’s signal
`flow graph is the same as the forward transform. Consequently.
`the inverse transform is similarly implemented except for some
`necessary scaling factor modilications. Nested hardware DO
`loops are used instead of straight line coding since they involve
`little overhead.
`
`xm.
`
`-i
`
`Figure 2. Signal Flow Graph from xij to& and gpl
`
`Figure 3. Signal Flow Graph fromfpl to Ym,
`Where n is even
`
`91
`
`Petitioners HTC and LG - Exhibit 1039, p. 97
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`left while those containing an odd number of points are directed
`upward to the right. Consequently, a single loop is constructed for
`reordering elements in a line with an even number of elements,
`and a second loop in constructed of lines with an odd number of
`elements. By using the DSP96002’s register indirect addressing
`capabilities, where the offset register is loaded with constant dis-
`tance dderived above, the zigzag ordering of the two-dimensional
`array of elements was implemented efficiently and quickly. A line
`with an even number of elements is referenced using a post incre-
`ment with an offset. A line with an odd number of elements if ref-
`erenced using a post decrement with an offset. The zigzag
`ordering facilitates the entropy coding process by rearranging the
`coefficients in the order of increasing frequency.
`
`EXTERNAL DATA TRANSFERS
`The externally stored 8 x 8 pixel blocks are sequentially trans-
`ferred to the DSW6002’s internal memory utilizing the on-chip
`dual channel DMA controller. Similarly, the coefficients gener-
`ated by the 2-D DCT routine are transferred to the external mem-
`ory via the DMA. Since the DMA controller and CPU operate
`concurrently, these two transfers can take place in parallel. There-
`fore, the routines which execute the image sample coefficient
`transfers between the internal and external memories involve
`very low overhead.
`
`RESULTS
`The forward and reverse 8 x 8 DCT execution times andmem-
`ory requirements are summarized in Table 1. In contrast to the
`forward transform, some of the inverse transform instructions do
`not utilize the DSP96002’s inherent parallelism. These imple-
`mentation considerations account for the difference between the
`program word requirements of the forward and inverse transform
`computations.
`
`Routine
`
`Memory Requirements
`x
`Program
`Y
`Space Space
`Words
`
`I
`
`Execution
`Time
`(micro-
`seconds)
`
`i
`
`Forward
`g x 8 K T
`
`Inverse
`g x 8 D C ~
`
`Forward
`zigzag
`
`Inverse
`zigazg
`
`629
`
`192
`
`72
`
`133.3
`
`682
`
`200
`
`256
`
`131.05
`
`58
`
`60
`
`64
`
`I 64
`
`64
`
`64
`
`27.85
`
`27.85
`
`Table 1. Memory Requirements and
`Execution Times of the DCT and Zigazg Routines
`
`Figure 4. Signal Flow Graph from gpi to y,,,,,,
`where n is odd
`
`EFFICIENT ZIGZAG CODING
`IMPLEMENTATION
`The coefficients generated by the 2-D DCT are rearranged
`into a zigzag order as shown in Figure 5. Prior to zigzag reorder-
`ing, the coefficients are quantized using a quantization table.
`The two-dimensional array elements are stored horizontally in
`rows. The elements are reordered in a zigzag fashion as dictated
`by the lines connecting the elements. Each 8 x 8 array element is
`denoted by x i j , where i indicates the row and j indicates the col-
`umn. The adjacent line elements are denoted by x;,~ and xi+l,,.l.
`These adjacent line elements are separated by a constant distance:
`d = [ ( i - 1) - ( i
`which reduces to:
`
`I
`A cm
`Figure 5. Zig Zag Sequence
`
`I
`
`An inspection of altemate diagonal lines in Figure 5, reveals
`that they contain an even/odd number of elements. Lines which
`contain an even number of points iue directed downward to the
`
`98
`
`Petitioners HTC and LG - Exhibit 1039, p. 98
`HTC and LG v. PUMA, IPR2015-01501
`
`

`
`The zigzag routine’s execution time and memory require-
`ments, associated with the 2-D DCT routine, are also provided in
`Table 1. The forward and inverse DCT transform with the zigzag
`feature results in total computation times of 161 and 159 micro-
`seconds respectively (with a 40 MHz clock). Since the DSP96002
`contains 1024 words of full speed on-chip program memory, it
`can accommodate and quickly execute both the forward and
`inverse DCT routines which incorporate the zigzag feature.
`
`SUMMARY
`The DSF96002 is a general purpose, IEEE Hoating point dig-
`ital signal processor containing hardware and software feaqtures
`which lend themselves to the efficient implementation of fast
`DCTs. An efficient implementation of the 8 x 8 pixel 2-D DCT
`and zigzag ordering have been described. These implementitions
`use the most recently proposed fast DCT algorithms. The fast
`DCT and zigzag routines execute in 133 and 28 microseconds
`respectively. These fast routines can be utilized in image coding
`applications; these routines are especially useful in those applica-
`tions which adhere to the JPEG N P E G image coding stiindards.
`
`REFERENCES
`[I] Ran, K. R. and Yap, P. “Discrete Cosine Tr:msform. Algo-
`rithm, Advaikges, Applications.” Acodeniic Press, S:IU Diego.
`1990
`[2] Watchcases G. K. “The JPEG still Picture Compression Stan-
`dard.” IEEE Trunsuctions on Consumrr ElPctronics. Vo1.38 No.1,
`I:ebruary 1992, pp.xvii-sixties.
`[3] Cho, N. I. and Lee. S. LJ. “Fast Algorithm and IMplementTtion
`of 2-D discrete Cosine Transform.” IEEE Trunsuctions on Cir-
`cuits and Systenis. Vo1.38. No.3, March 1991, pp207-305.
`[4] David Shear, “EDN’s DSP Benchmarks” EDN. September
`29,1988.
`151 DSP96002, IEEE Flouting-Point Dud-Porr Pmcessor User’s
`Munual. MOtorala, inc., 1989.
`[6] Guy R. I,. solei and Kevin I,. Kookier, “A Digital Signal Pro-
`cessor with IEEE Floating-point Arithmetic”. IEEE Micro. W.8,
`No.6, December 1988, pop. 49-67,
`171 Lee. B. G. “A New Algorithm to Compute the Discrete Cosine
`Transform.” IEEE Truns. Acoustics, Speech, and Signirl Process-
`ing. Val. ASSP-32. No.6, December 1984, pop 1243-124s.
`
`99
`
`Petitioners HTC and LG - Exhibit 1039, p. 99
`HTC and LG v. PUMA, IPR2015-01501

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket