`for Multimedia Applications
`
`Takashi Miyamori
`System ULSI Engineering Laboratory
`TOSHIBA Corporation, JAPAN
`miyamori@sdel.toshiba.co.jp
`
`Kunle Olukotun
`Computer Systems Laboratory
`Stanford University
`kunle@ogun.stanford.edu
`
`Abstract
`
`Recently, computer architectures that combine a recon g-
`urable or retargetable coprocessor with a general-purpose
`microprocessor have been proposed. These architectures
`are designed to exploit large amounts of ne grain par-
`allelism in applications. In this paper, we study the per-
`formance of the recon gurable coprocessors on multimedia
`applications. We compare a Field Programmable Gate Ar-
`ray FPGA based recon gurable coprocessor with the array
`processor called REMARC Recon gurable Multimedia Ar-
`ray Coprocessor. REMARC uses a -bit simple processor
`that is much larger than a Con gurable Logic Block CLB
`of an FPGA. We have developed a simulator, a program-
`ming environment, and multimedia application programs to
`evaluate the performance of the two coprocessor architec-
`tures. The simulation results show that REMARC achieves
`speedups ranging from a factor of . to . on these ap-
`plications. The FPGA coprocessor achieves similar per-
`formance improvements. However, the FPGA coprocessor
`needs more hardware area to achieve the same performance
`improvement as REMARC.
`
`
`
`Introduction
`
`it be-
`As the use of multimedia applications increases,
`comes important to achieve high performance on algo-
`rithms such as video compression, decompression, and im-
`age processing with general-purpose microprocessors. This
`has motivated the recent addition of multimedia instruc-
`tions to most general-purpose microprocessor ISAs .
`These ISA extensions work by segmenting a conventional
`-bit datapath into four -bit or eight -bit datapaths.
`The multimedia instructions exploit ne grain SIMD par-
`allelism by operating on four -bit or eight -bit data
`values. However, a -bit datapath limits the speedups
`to a factor of four or eight even though many multimedia
`applications have much more inherent parallelism.
`Computer architectures that connect a recon gurable
`coprocessor to a general-purpose microprocessor have been
`proposed . The advantage of this approach is that
`the coprocessor can be recon gured to improve the perfor-
`mance of a particular application. All of these proposed
`architectures use eld programmable gate arrays FPGAs
`for the recon gurable hardware. We refer to this coproces-
`
`sor as an FPGA coprocessor" in this paper. The FPGA
`architecture, which has narrow programmable logic blocks
`and programmable interconnection network, provides great
` exibility for implementing application speci c hardware.
`However, the rich programmable interconnection comes at
`the price of reduced operating frequency and logic density.
`Array processors, such as general-purpose systolic ar-
`ray processors, wavefront array processors , PADDI-
` , and MATRIX , are other recon gurable archi-
`tectures. These processors have -bit or -bit datapaths
`and each programmable logic block has an to -entry
`instruction RAM that makes it easy to support multiple
`functions. Because multimedia or DSP applications pre-
`dominantly manipulate -bit or -bit data values, these
`architectures work very well on these applications. Re-
`cently, we proposed a new array processor architecture
`called REMARC Recon gurable Multimedia Array Co-
`processor . REMARC is a recon gurable coprocessor
`that is tightly coupled to a main RISC processor and con-
`sists of a global control unit and -bit simple processors
`called nano processors.
`Both the FPGA coprocessor and REMARC are not
`limited to SIMD parallelism that can be exploited by mul-
`timedia extensions such as the Intel MMX. They can
`exploit various kinds of ne grain parallelism in multime-
`dia applications. Using more processing resources, they
`can achieve higher performance than the multimedia ex-
`tensions. To understand how these two coprocessor archi-
`tecture compare, in this paper we evaluate the cost and
`performance of these architectures. The architecture of
`FPGA coprocessors are still in ux, so we evaluate the per-
`formance of the FPGA coprocessor with a varying number
`of CLBs and vary the cycle time of the FPGA coprocessor
`from x to x that of the main processor. For the perfor-
`mance evaluation, we use detailed simulators and two real-
`istic application programs, DES encryption and MPEG-
`decoding. We also estimate the chip sizes of processors
`with REMARC and the FPGA coprocessor and compare
`their performance when the same die size is used for both
`architectures.
`The rest of this paper is organized as follows. In Sec-
`tion , we describe the recon gurable coprocessor archi-
`tectures, both REMARC array based and FPGA based.
`In Section , we show the results of our performance eval-
`uation. In Section , we estimate chip sizes of processors
`
`INTEL - 1009
`
`
`
`with REi\.1ARC and the FPGA coprocessor. Finally, we
`conclude in Section 5.
`
`2 Reconfigurable Coprocessor Ar(cid:173)
`chit ect ure
`
`2 .1 Architect ure Overv iew
`
`rcon
`rex
`lduc2
`sduc2
`mtc2
`mfc2
`ctc2
`cfc2
`
`src (rncon or rgcon)
`cov~reg, offset{base)
`cov~reg, offset{base)
`cov~reg, offset{base)
`cov~reg, src
`cov~reg, dst
`cov~reg, src
`cov~reg, dst
`
`Main Processor
`
`Instruction
`Cache
`
`Data
`Cache
`
`Figure 1: Block Diagram of a Microprocessor w;th Reconfig(cid:173)
`urable Coprocessor
`
`Figure 1 shows a block diagram of a microprocessor
`which includes a reconfigurable coprocessor . The recon(cid:173)
`figurable coprocessor consists of a global control unit, co(cid:173)
`processor data registers, and a reconfigurable logic array.
`Recently, we proposed the REMARC architectur e which
`includes an 8x8 16-bit processor (nano processor) array as
`its reconfigurable logic array(13]. The other reconfigurable
`coprocessor that we consider in this paper, the FPGA co(cid:173)
`processor, uses FPGAs for the reconfigurable logic array.
`The global control unit controls the execution of the recon(cid:173)
`figurable logic array and the transfer of data between the
`main processor and the reconfigurable logic array through
`the coprocessor data registers.
`We use the i\.flPS-II ISA (14] as the base architecture
`of the main processor. The MIPS ISA is extended for the
`RE MARC and the FPG A coprocessor using the instruc(cid:173)
`tions listed in Table 1. The main processor issues these
`instructions to the reconfigurable coprocessor which exe(cid:173)
`cutes them in a manner similar to a floating point copro(cid:173)
`cessor. Unlike a floating point coprocessor, the functions
`of reconfigurable coprocessor instructions ar e configur able
`( or programmable) so that they can be specialized for spe(cid:173)
`cific applications.
`The configuration instructions, rcon, rgcon, or rncon,
`download the configuration data from memory and store
`them in the reconfigur able coprocessor. The start address
`of the configuration data is specified by the value of the
`sour ce register (src). The rex instruction starts execution
`of a reconfigurable coprocessor instruction. The sum of
`
`Table 1: New Instructions Used in Reconfigurable Coproces(cid:173)
`sors
`the offset field and the base register specifies one of the op(cid:173)
`erations to execute. The lduc2 and sduc2 instructions are
`load and store coprocessor instructions which t ransfer dou(cid:173)
`ble word (64-bit) data between memory and the reconfig(cid:173)
`urable coprocessor data registers. The mfc2 and mtc2 in(cid:173)
`structions transfer word (32-bit) data between the general(cid:173)
`purpose registers (integer registers) in the main processor
`and the reconfigurable coprocessor data registers. The cfc2
`and ctc2 instructions transfer data between the integer reg(cid:173)
`isters and the reconfigurable coprocessor control registers.
`The reconfigurable coprocessors do not have a direct
`interface to the data cache or memory. The main proces(cid:173)
`sor has to set the input data to the coprocessor data reg(cid:173)
`isters using lduc2 and mtc2 instructions before execution
`of rex instructions. Then, the reconfigurable coprocessor
`reads the input data, executes the operations, and stores
`the results into the coprocessor data registers. Finally, the
`main processor reads the results using sduc2 and mfc2 in(cid:173)
`structions.
`
`2 .2 Pipeline Organization
`
`REMARC Pipeline
`
`RF RR RL ! RI ! RE !RW!
`
`Main Processor Pipeline ! F ! D ! E I M I W I
`Main Processor Pipeline ! F ! D ! E I M I W I
`
`FPGA Coprocessor Pipeline
`
`RF RR RE !RW!
`
`Figure 2: P ipeline Organization of Reconfigurable Coproces(cid:173)
`sor
`
`The pipeline for REMARC, the FPGA coprocessor,
`and the main processor are shown in Figure 2. The main
`processor pipeline is similar to the MIPS R3000 and the
`MIPS R5000 and consists of five stages: Instruction Fetch
`(F), Instruction Decode (D), Execution (E), Memory Ac(cid:173)
`cess (M), Register Write-back (W) . The reconfigurable co(cid:173)
`processor pipelines ar e independent of the main processor
`pipeline; therefore, the main processor can execute concur (cid:173)
`rently with the reconfigurable coprocessors .
`The REMARC pipeline starts from the M stage of the
`main processor and has the following six stages:
`RF : An instruction of the global control unit is fetched.
`RR : The REMARC data registers are read.
`RL : The data are aligned or "unpacked" .
`RI : The instructions of the nano processors are fetched.
`
`INTEL - 1009
`
`
`
`HBUS
`
`32
`
`VBUS
`32
`
`nano PC
`
`5
`
`Nano Instruction
` RAM
`(32 x 32 bits)
`
`32
`
`Imm
`
`D R
`
`02
`
`701
`
`DR
`
`16
`
`16
`
`IR
`
`16-bit ALU & Data RAM
`
`16
`
`16
`
`DOR
`
`16
`
`DOUT
`
`16
`
`16
`
`16
`
`DINU
`DIND
`DINL
`DINR
`DINU
`DIND
`DINL
`DINR
`DINU
`DIND
`DINL
`DINR
`
`Figure : Nano Processor Architecture
`
`register IR, eight -bit data registers DR, four -
`bit data input registers DIR, and a -bit data output
`register DOR.
`Each nano processor can use the DR registers, the DIR
`registers, and immediate data as the source data of ALU
`operations. Moreover, it can directly use the DOR regis-
`ters of the four adjacent nano processors DINU, DIND,
`DINL, and DINR as the source.
`The nano processors are also connected by the -bit
`Horizontal Buses HBUSs and the -bit Vertical Buses
`VBUSs. Each bus operates as two -bit data buses. The
` -bit data in the DOR register can be sent to the upper
`or lower bits of the VBUS or the HBUS. The HBUSs
`and the VBUSs allow data to be broadcast to the other
`nano processors in the same row or column. These buses
`can reduce the communication overhead between proces-
`sors separated by long distances.
`The DIR registers accept inputs from the HBUS, the
`VBUS, the DOR, or the four adjacent nano processors.
`Because the width of the HBUS and the VBUS is bits,
`data on the HBUS or the VBUS are stored into a DIR
`register pair, DIR and DIR , or DIR and DIR . Using
`the DIR registers, data can be transfered between nano
`processors during ALU operations.
`It takes a half cycle to transfer data using the VBUSs
`or HBUSs.
`It should not be a critical path of the de-
`sign. Other operations, except for data inputs from near-
`est neighbors, are done within the nano processor. Because
`the width of a nano processor’s datapath is only bits,
`which is a quarter of those of the general purpose micro-
`processors, this careful design does not make REMARC a
`critical path of the chip.
`
`. FPGA Coprocessor Architecture
`
`The recon gurable logic array of the FPGA coprocessor is
`composed of con gurable logic blocks CLBs. Each CLB
`
`RE : The nano processors execute the instructions.
`RW : The executed results are packed and stored into
`the REMARC data registers.
`
`The FPGA coprocessor pipeline has four stages as fol-
`lows:
`
`RF : The sequencer of the global control unit starts its
`execution.
`RR : The coprocessor data registers are read.
`RE : The recon gurable logic array starts execution.
`RW : The execution results are stored into the coproces-
`sor data registers.
`
`The RL and RI stages are unnecessary in the FPGA
`coprocessor because load alignment or unpack operations
`are realized directly in the FPGA array and FPGAs do
`not have instructions to fetch and execute.
`
`. REMARC Architecture
`
`to/from Main Processor
`
`32 bits
`
`64 bits
`
`64 bits
`
`REMARC
`
`Global Control Unit
`
`Coprocessor
`Data Registers
`
`32
`
`32
`
`NANO
`00
`
`NANO
`10
`
`NANO
`20
`
`HBUS0
`NANO
`NANO
`30
`40
`
`NANO
`50
`
`NANO
`60
`
`NANO
`70
`
`ROW 0
`
`ROW 1
`
`ROW 2
`
`ROW 3
`
`ROW 4
`
`ROW 8
`
`HBUS1
`
`HBUS2
`
`HBUS3
`
`HBUS4
`
`VBUS7
`
`VBUS6
`
`VBUS5
`
`VBUS4
`
`HBUS7
`
`VBUS3
`
`VBUS2
`
`VBUS1
`
`VBUS0
`
`COL.0
`
`COL.1
`
`COL. 2
`
`COL. 3
`
`COL. 4
`
`COL.5
`
`COL. 6
`
`COL. 7
`
`nano PC/Row Num./Column Num.
`
`Figure : Block Diagram of REMARC
`
`In this section, we brie y describe the REMARC archi-
`tecture; more information can be found in . Figure
`shows a block diagram of REMARC. REMARC’s recon-
` gurable logic is composed of an x array of the -bit
`processors, called nano processors. The execution of each
`nano processor is controlled by the instructions stored in
`the local instruction RAM nano instruction RAM. How-
`ever, each nano processor does not directly control the
`instructions it executes. Every cycle the nano processor
`receives a PC value, nano PC", from the global control
`unit. All nano processors use the same nano PC and exe-
`cute the instructions indexed by the nano PC in their nano
`instruction RAM.
`Figure shows the architecture of the nano processor.
`The nano processor contains a -entry nano instruction
`RAM, a -bit ALU, a -entry data RAM, an instruction
`
`INTEL - 1009
`
`
`
`Processor Size
`Num. of Procs.
`Execution control
`Communication
`Interconnection
`Cycle Time
`
`REMARC
`Large -bit datapath
`Small processors
`Instruction
`Controlled by instruction
` neighbors, VBUS, and HBUS
`Tcpu
`
`FPGA Coprocessor
`Small two - LUTs
`Large CLBs
`Hardwired
`Con gured by switch matrix
`Short wires and long wires
` x, x, x, xTcpu
`
`Table : Coprocessor Architecture Comparison
`
`is equal to a CLB of the Xilinx series. The CLB
`includes two - lookup tables LUTs and two ip- ops.
`We do not x the number of the CLBs in the FPGA
`coprocessor. Instead, we evaluate the performance of the
`FPGA coprocessor using a varying number of CLBs. The
`cycle time of the FPGA coprocessor is also parameter-
`ized. Cycle times of current FPGA systems are longer
`than those of the microprocessors by factors of ve to
`ten. For instance, FPGA systems usually operate at
`to MHz while state-of-the-art microprocessors operate
`at more than MHz. However, the recently proposed
`recon gurable coprocessor Garp aims to operate at the
`same operating frequency as its main processor. Therefore,
`we assumed the cycle time of the FPGA coprocessor could
`be x or x that of the main processor for current FPGA
`architectures and x or x for future FPGA architectures.
`
`. Coprocessor Architecture Compar-
`ison
`
`Table summarizes the comparison of the two recon g-
`urable coprocessor architectures. REMARC has larger
`processing elements, the nano processors, than the FPGA
`coprocessor. However, the number of nano processors is
`less than that of CLBs in the FPGA coprocessor.
`In
`REMARC, both the execution and data transfer are con-
`trolled by instructions, while these are controlled by hard-
`wired logic in the FPGA coprocessor. REMARC has lim-
`ited hardware interconnections. Each nano processor has
`direct inputs from the four nearest neighbors and it is con-
`nected by two -bit data buses. The FPGA coprocessor
`has more exible hardware interconnections which are con-
` gured in a bit-wise fashion.
`We assume that REMARC will operate at the same
`frequency as the main processor. The cycle times of the
`FPGA coprocessor Tfpga are varied. We evaluate the
`FPGA coprocessor performance, assuming Tfpga values
`that are x, x, x, and x the CPU cycle time Tcpu.
`
` Performance Evaluation
`
` . Simulation Methodology
`
`We developed the recon gurable coprocessor simulator us-
`ing the SimOS simulation environment . SimOS mod-
`els the CPUs, memory systems, and IO devices in su -
`cient detail to boot and run a commercial operating sys-
`tem. As a base CPU simulation model, we used the
`
`MIPSY " which models a simple single issue RISC pro-
`cessor similar to the MIPS R .
`REMARC functions are added to the MIPSY model.
`The latency of a recon gurable execution rex instruction
`is the sum of the number of the executed global instruc-
`tions and the pipeline latency cycles. If a following in-
`struction attempts to read the result of a rex instruction,
`the pipeline will stall for the cycles of the rex instruction’s
`latency Data Dependency. Furthermore, a following rex
`instruction will stall if a previous rex instruction is still
`executing Resource Con ict.
`We evaluate the execution time of the FPGA coproces-
`sor by changing the latencies of the rex instructions. First,
`we estimate the number of execution cycles of the rex in-
`structions based on the FPGA delay model. In this model,
`two sequences can be executed within one FPGA cycle:
`
` One stage of interconnection by long or short wire
`and one stage of function not using the carry chain
` One stage of interconnection by short wire and any
`one stage of function
`
`This model is almost the same as the Garp’s delay
`model and the XC ’s medium frequency design which
`will operate at about MHz . Then, we normalize the
`number of execution cycles based on the CPU cycle time,
`assuming the FPGA cycle time is x, x, x, or x the
`CPU cycle time. Finally, we use this estimated cycle count
`of the rex instruction in the simulator.
`We also developed simulators which can execute mul-
`timedia instructions similar to the Intel MMX instruction
`set extensions .
`To make the comparison fair, the same application
`source codes are used for the evaluation except for the
`use of the multimedia instructions or the augmented co-
`processor instructions. The memory system parameters
`used commonly through the performance evaluations are
`found in Table . We used the gcc" compiler with the -
`O" optimization option. This option executes most global
`compiler optimizations except for loop unrolling and func-
`tion inlining.
`
`I-cache
`D-cache
`L cache
`L miss penalty
`L miss penalty
`
` K bytes, -way set.
` K bytes, -way set.
` K bytes, -way set.
` cycles
` cycles
`
`Table : Memory System Parameters
`
`INTEL - 1009
`
`
`
`We used the DES encryption program based on
`the Electronic Codebook ECB mode. Although the ECB
`mode is less secure than the Cipher Block Chaining CBC
`mode, it is more commonly used and its operation can be
`pipelined.
`
` .. DES Implementation on REMARC
`
`We decided to divide the algorithm between the main pro-
`cessor and REMARC. The initial permutation and the -
`nal permutation are executed by the main processor and
`the rounds of f-box operations are executed by RE-
`MARC. Each row of nano processors executes two f-box
`operations. For instance, eight nano processors in the row
` execute the rst and second rounds, the row execute
`the third and fourth rounds, and so on.
`
`Operation
`Data Load
`Exp. Permutation
`Key XOR
`S-box
`P-box Permutation
`Left XOR
`Data Transfer
`Total
`
`CPU Cycles
`
` x iter.
` x iter.
` x iter.
` x iter.
` x iter.
`
`
`
`Table : Execution Cycle Breakdown of Two f-box Operations
`on REMARC
`
`Table is the execution cycle breakdown of the two f-
`box operations implemented by the eight nano processors
`in each row. The f-box operations are pipelined into
` stages by the eight rows of the nano processor array.
`REMARC can generate a result of the f-box operations
`every cycles. As Table shows, because of the limited
`interconnection of REMARC, more than half cycles of
`the execution time are used for the expansion permutation
`and the P-box permutation.
`
` .. DES Implementation on the FPGA Co-
`processor
`
`First, we estimate the latency and the throughput of one
`f-box operation. The expansion permutation and the XOR
`operation with the keys can be executed in one cycle be-
`cause it can be implemented by long wire and simple logic.
`The S-box table lookup requires two cycles to execute be-
`cause it consists of LUTs and MUXs. The P-box permu-
`tation and the XOR operation can be executed in one cy-
`cle. The total latency of the f-box operation is four cycles,
`and it can be fully pipelined. Therefore, the maximum
`throughput of the f-box operation is one FPGA cycle.
`We assume three distinct cases, implementing one f-
`box, f-boxes, and all of the DES encryption algorithm
`including f-boxes, the initial permutation, and the nal
`permutation.
`In the case of one f-box implementation, because each
`f-box operation takes as input the result of the previous
`f-box operation, the f-box operations cannot be pipelined.
`
` . DES Encryption
`
`The Data Encryption Standard DES is one of the most
`important encryption algorithms and has been a worldwide
`standard for over years. It is widely used to provide se-
`cure communication over the Internet. DES is also a good
`application for recon gurable processors because it has a
`lot of ne-grained parallelism in the form of bit-level irreg-
`ular data movements which make software implementation
`on conventional microprocessors di cult and ine cient.
`
`Plaintext (64 bits)
`
`Initial Permutation
`
`Round 1
`
`f-box
`
`Round 2
`
`f-box
`
`Round 16
`
`f-box
`
`L0
`
`+
`
`L1
`
`+
`
`L2
`
`L15
`
`+
`
`L16
`
`f
`
`f
`
`f
`
`R0
`
`R1
`
`R2
`
`R15
`
`R16
`
`Final Permutation
`
`Ciphertext (64 bits)
`
`K1
`
`K2
`
`K16
`
`Figure : DES Encryption Algorithm
`
`48-bit
`Key
`
`Li-1
`
`f-box
`
`Ri-1
`32
`
`Expansion Permutation
`6
`6
`6
`6
`6
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`32
`
`P-box Permutation
`32
`
`32
`
`Ri
`
`Figure : DES f-box Algorithm
`
`Figure shows an outline of the DES algorithm. DES
`takes as input -bit plaintext. After the initial permuta-
`tion, there are rounds of f-box" operations, as shown
`in Figure , including expansion permutation, XOR with
`the key, S-box table lookup, P-box permutation, and XOR
`with the result of the previous round. The nal permu-
`tation is performed on the -bit result of the sixteenth
`round.
`
`INTEL - 1009
`
`
`
`Tfpga
`
`Tcpu
`Tcpu x 2
`Tcpu x 5
`Tcpu x 10
`
`FPGA Coprocessor
`(1 f-box)
`64
`128
`320
`640
`
`FPGA Coprocessor
`(16 f-boxesJ
`1
`2
`5
`10
`
`Table 5: Execution Throughput of 16 f-box Operations on
`Different FPGA Coprocessors (CPU Cycles)
`
`D Base (Single-Issue) D +FPGA (16 f0 boxs)
`(cid:143) +REMARC
`(cid:143) +FPGA (All DES)
`(cid:143) +FPGA (1 f-box)
`
`140
`
`120
`-;;;-
`Q) o 100
`~
`6 80
`"' C: 5 60
`
`(.)
`Q) 40
`
`~ (.)
`20
`
`0
`
`""
`
`llb
`
`Tcpu
`
`...
`
`~
`
`,__
`
`,__
`
`n=;
`
`r-
`
`h
`
`-
`I lh
`
`Tcpu x 5
`Tcpu x 2
`Tfpga
`
`Tcpu x 10
`
`Figure 7: DES Encryption {lM Bytes) Results - Single Issue
`Archltecture
`
`It takes 64 FPGA cycles (4 cycles x 16 rounds) to exe(cid:173)
`cute the 16 rounds for the f-box operations. When the 16
`f-boxes are implemented in the FPGA array, the opera(cid:173)
`tions can be pipelined. It takes only one FPGA cycle to
`generate a new result for the 16 f-box operations after the
`pipeline is fill ed up. T.i.ble 5 normalizes the throughput
`of the 16 f-box operations in CPU cycles for both the one
`f-box implementation and the 16 f-box implementation.
`When the complete DES encryption algorithm is im(cid:173)
`plemented in the FPGA coprocessor, the init ial and final
`permutations can be fully pipelined and a new 64-bit ci(cid:173)
`phertext character is generated every FPG A cycle.
`To estimate the number of CLBs without ·wiring over(cid:173)
`head, we use the following method. As s hown in Figure 6,
`the f-box operation requires a 48-bit XOR, eight 6-bit in(cid:173)
`put 4-bit output LUTs, a 32-bit 4-1 Multiplexer, and a
`32-bit XOR. A two-bit XOR or a 4-1 Multiplexer fits to
`one CLB. A 6-4 LUT needs 16 4-1 LUTs or 8 CLBs. In
`total, 136 CLBs are required for one f-box operation. The
`16 f-box implementation needs 2176 (136 x 16) CLBs. We
`expect that in an actual implementation more CLBs would
`be dedicated to the eJ...'tra wiring which realizes the initial
`permutation and the final permutation.
`
`D Base (2-lssue)
`(cid:143) +REMARC
`(cid:143) +FPGA (1 f-box)
`
`(cid:143) +FPGA (16 f-boxs)
`(cid:143) +FPGA (All DES)
`
`Q)
`
`100
`90
`en so
`,-
`-g. 70 ,-
`,-
`~ 60
`,-
`.!1 50
`C
`:::, 8 40 ,-
`,-
`a> 30
`0
`,-
`(S 20
`10
`
`0
`
`ITTli
`
`Tcpu
`
`----, -
`
`, -
`
`, -
`
`-
`n,
`
`,....
`
`n
`n
`
`n
`n
`
`Tcpu x 5
`Tcpu x 2
`Tfpga
`
`Tcpu x 10
`
`Figure 8: DES Encryption {lM Bytes) Results - 2-way Su(cid:173)
`perscalar Archltecture
`
`3.2.3 Performance Evaluation Results
`
`Figure 7 shows the result of the DES encryption of l M
`bytes data. The base architecture of the main processor is
`a single issue processor. To optimize performance we im(cid:173)
`plemented a software pipelining technique which can over(cid:173)
`lap operations in the main processor and the reconfigurable
`coprocessors. REMARC achieves a 7.3 times performance
`improvement. The FPGA coprocessor achieves the same
`performance improvement if its execution cycles for the 16
`f-box operations ar e less than 64 CPU cycles. T his indi(cid:173)
`cates that the main processor which executes the initial
`permutation and the final permutation limits the perfor(cid:173)
`mance improvement in these cases.
`If the FPGA coprocessor implements all of the DES op(cid:173)
`erations, the execution cycles are reduced to 9.0 M CPU
`cycles. Reducing the load on the main processor, this im(cid:173)
`plementation delivers a 14.5 times performance improve(cid:173)
`ment over the base processor. Even though the cycle time
`of the FPGA coprocessor is l 0x Tcpu, it takes only 10
`CPU cycles to generate one 64-bit encrypted data. This
`execution time on the FPGA coprocessor is short enough
`that the main processor still limits the performance.
`Using a higher performance main processor is another
`way to reduce the total execution time. As s hown in Fig(cid:173)
`ure 8, if a 2-way superscalar processor is used as the main
`processor, the execution times are reduced from 17.3 M
`CPU cycles to 14.8 M CPU cycles for the + REMARC,
`+ FPGA(l f-box) at Tcpu cycle time, and + FPGA(16 f(cid:173)
`box) implementations.
`The other point that is demonstrated in Figure 7 and
`Figure 8 is that the one f-box implementation at slow cy(cid:173)
`cle times, especially Tcpu x 5 and Tcpu x 10, delivers only
`limited performance improvements. Even though the exe(cid:173)
`cution time of one f-box operation is only four FPGA cy(cid:173)
`cles, the slow cycle time of the FPGA coprocessor results in
`
`INTEL - 1009
`
`
`
`small performance gains. Furthermore, the performance is
`limited by the reconfigurable coprocessor; there is no per(cid:173)
`formance improvement even if the 2-wa.y supersca.lar ma.in
`processor is used as shown in Figure 8.
`
`3.3 MPEG-2 Decoding
`i\.fPEG-2 is one of the most popular video compression
`standards. The a.lgorithm for i\.fPEG-2 decoding consists
`of five stages, Variable Length Decoding (VLD), Inverse
`Quantization (IQ), Inverse DCT (IDCT), Motion Com(cid:173)
`pensation (MC), and the addition of the IDCT and the
`MC results (Add.Block). We used the mpeg2decode{19]
`program distri buted by MPEG software simulation group
`as a base MPEG-2 decoding program. The mpeg2decode
`program was modified using the reconfigurable coproces(cid:173)
`sor instructions. The bitstream "mobile" was chosen as a
`benchmark.
`
`I D MC D Add_Block D IDCT D VLD&IQ I
`I .. )~.: ... !:::~:! ... :.~ ..:.i. : ,,~: .... I
`
`70
`
`80
`
`90 100
`
`0
`
`10
`
`20
`
`30
`
`40
`
`50
`
`60
`
`Figure 9: Execution T ime Breakdown of MPEG-2 Decoding
`
`To determine the execution time breakdown we pro(cid:173)
`filed the MPEG-2 decoding program using the profiling
`tool pixie (18]. Figure 9 shows the percentage of execution
`time spent in ea.ch stage while decoding 30 frames. The fig(cid:173)
`ure indicates that the execution time is distri buted among
`MC, Add.Block, IDCT, and VLD/IQ. This suggests that
`reconfigur able coprocessors need to support multiple func(cid:173)
`tions to accelerate the entire i\.fPEG-2 decoding. As a first
`step, however, we investigate the acceleration of IDCT,
`which represents 31 % of the execution time for i\.fPEG-2
`decoding.
`
`3.3. 1 Acceleration of IDCT
`
`We use the fast IDCT a.lgorithm which is the reverse of the
`DCT a.lgorithm presented in (20] . The lD-IDCT a.lgorithm
`consists of 12 multiplications and 32 a.ddi tions/subtra.ctions.
`The 8x8 2-D IDCT is composed using a row IDCT followed
`by a column IDCT. Each of the row and column IDCTs
`consists of eight 1-D IDCTs.
`For REMARC, the lD-IDCT a.lgorithm is changed
`slightly. The a.lgorithm consists of 16 multiplications and
`30 additions/subtractions to achieve more regular data
`movements. Each of the eight nano processors in the same
`row or column ca.lcula.tes a 1-D IDCT. Therefore, two mul(cid:173)
`tiplications a.re mapped into each nano processor to main(cid:173)
`tain the load ba.lance even among the nano processors.
`First, each of the eight nano processors in the same row
`ca.lcula.te the 1-D row IDCT. Then, each of the eight nano
`processors in the same column ca.lculate the 1-D column
`IDCT. Because the 2-D IDCT is done in place, there is
`no need to execute a ma.tri..'< transposit ion. In this fashion
`
`an 8x8 nano processor array na.tura.lly implements the 2-
`D IDCT. The 2-D IDCT can be executed in 54 cycles by
`REMARC . REMARC can rea.lize a fast 2-D IDCT imple(cid:173)
`mentation which exploits the large amount of parallelism
`from the algorithm.
`We estimate the latency and the throughput of the 1-
`D IDCT for the FPGA coprocessor. The IDCT a.lgorithm
`has five stages of additions/subtractions and one stage of
`multiplication. The bit width of these operations is 16
`bits. We assumed each stage of the additions/subtractions
`requires two FPGA cycles. The multiplication requires
`four FPGA cycles to execute using a constant (k) coeffi(cid:173)
`cient multiplier (KCM). Therefore, the latency of the 1-D
`IDCT is 14 FPGA cycles ( 2 FPGA cycles x 5 + 4 FPGA
`cycles). The throughput is restricted to four FPGA cycles
`by the KCM.
`We implement the IDCT in the FPGA array using two
`models. One model only implements one 1D-IDCT and
`the matrix transposition circuits. This 1D-IDCT circuit
`is used 16 times to realize a 2-D IDCT. The other model
`implements eight 1D-IDCTs and the matrix transposition
`circuits.
`In the first IDCT implementation, eight 1D-IDCTs can
`be pipelined. It takes 42 FPGA cycles ( 14 FPGA cycles
`+ 4 FPGA cycles x 7 ) to execute eight 1D-IDCTs, or a
`row/column IDCT.
`The ma.tri..'< transposition is required between the row
`IDCT and column IDCT, and it breaks the IDCT exe(cid:173)
`cution pipeline, The tota.l execution cycle counts for one
`2D-IDCT are 84 FPGA cycles for the one 1D-IDCT im(cid:173)
`plementation and 28 FPGA cycles for the eight 1D-IDCT
`implementation. Table 6 norma.lizes the execution cycles
`in CPU cycles.
`We estimate the number of CLBs to implement 1D(cid:173)
`IDCT and the matrix transposit ion. At least 808 CLBs
`a.re necessary for 1D-IDCT and 512 CLBs for the matrix
`transposit ion. The implementation of one 1D-IDCT and
`the matrix transposition requires 1320 CLBs, and the im(cid:173)
`plementation of eight 1D-IDCTs and the matrix transpo(cid:173)
`sition requires 6976 CLBs. Again these estimates include
`no wiring overhead.
`
`Tfpga
`
`Tcpu
`Tcpu x 2
`Tcpu x 5
`Tcpu x 10
`
`FPGA Coprocessor
`(lx lD-IDCT HW.)
`84
`168
`420
`840
`
`FPGA Coprocessor
`(8x lD-IDCT HW.)
`28
`56
`140
`280
`
`Table 6: 2D-IDCT Execution Cycles (CPU Cycles)
`
`Figure 10 shows the execution time of the i\.fPEG-2 de(cid:173)
`coding for 15 frames of the "mobile" bitstream. Execution
`times of the processors ·with the reconfigurable coprocessor
`are compared with the single issue base processor and the
`processor with the multimedia. extensions (+ MMX).
`Although the multimedia. extensions can reduce the cy(cid:173)
`cle counts of the IDCT pa.rt by a factor of 2. 7 from the base
`processor, RE MARC and the FPG A coprocessor which op(cid:173)
`erate at the same frequency as the ma.in processor achieve
`
`INTEL - 1009
`
`
`
`:(cid:143)
`
`IDCT D Others I
`
`Tcpu x2
`
`Tcpu x 5
`
`Tcpux10
`
`-
`
`Tfpga =
`Tcpu
`
`~
`
`350
`
`.-..300
`"'
`<I> u 250
`-
`~
`-
`e-200
`!?
`-
`C 6 150
`
`(.)
`~ 100
`g_
`.
`O 50
`
`0
`
`I
`
`I
`
`ai'jUi='i='
`:, ~ (.) (.)
`2 + QQ
`Q) w oo
`c,
`a'.
`+ ..........
`C
`X X
`~ ..... 00
`~~
`<( <(
`(!) (!)
`Q. Q.
`u. u.
`+ +
`
`Q)
`(/)
`(II
`co
`
`I
`
`I
`
`i=' i='
`(.) (.)
`00
`..... .....
`X X
`..... 00
`~~
`<( <(
`(!) (!)
`Q. Q.
`u. u.
`+ +
`
`I
`
`I
`
`~i='
`(.) (.)
`00
`..... .....
`X X
`:s.e
`<3(3
`Q. Q.
`u. u.
`+ +
`
`i=' i='
`(.) (.)
`00
`..........