throbber
A Quantitative Analysis of Recongurable Coprocessors
`for Multimedia Applications
`
`Takashi Miyamori
`System ULSI Engineering Laboratory
`TOSHIBA Corporation, JAPAN
`miyamori@sdel.toshiba.co.jp
`
`Kunle Olukotun
`Computer Systems Laboratory
`Stanford University
`kunle@ogun.stanford.edu
`
`Abstract
`
`Recently, computer architectures that combine a recong-
`urable or retargetable coprocessor with a general-purpose
`microprocessor have been proposed. These architectures
`are designed to exploit large amounts of ne grain par-
`allelism in applications. In this paper, we study the per-
`formance of the recongurable coprocessors on multimedia
`applications. We compare a Field Programmable Gate Ar-
`ray FPGA based recongurable coprocessor with the array
`processor called REMARC Recongurable Multimedia Ar-
`ray Coprocessor. REMARC uses a -bit simple processor
`that is much larger than a Congurable Logic Block CLB
`of an FPGA. We have developed a simulator, a program-
`ming environment, and multimedia application programs to
`evaluate the performance of the two coprocessor architec-
`tures. The simulation results show that REMARC achieves
`speedups ranging from a factor of . to . on these ap-
`plications. The FPGA coprocessor achieves similar per-
`formance improvements. However, the FPGA coprocessor
`needs more hardware area to achieve the same performance
`improvement as REMARC.
`
`
`
`Introduction
`
`it be-
`As the use of multimedia applications increases,
`comes important to achieve high performance on algo-
`rithms such as video compression, decompression, and im-
`age processing with general-purpose microprocessors. This
`has motivated the recent addition of multimedia instruc-
`tions to most general-purpose microprocessor ISAs   .
`These ISA extensions work by segmenting a conventional
`-bit datapath into four -bit or eight -bit datapaths.
`The multimedia instructions exploit ne grain SIMD par-
`allelism by operating on four -bit or eight -bit data
`values. However, a -bit datapath limits the speedups
`to a factor of four or eight even though many multimedia
`applications have much more inherent parallelism.
`Computer architectures that connect a recongurable
`coprocessor to a general-purpose microprocessor have been
`proposed  . The advantage of this approach is that
`the coprocessor can be recongured to improve the perfor-
`mance of a particular application. All of these proposed
`architectures use eld programmable gate arrays FPGAs
`for the recongurable hardware. We refer to this coproces-
`
`sor as an FPGA coprocessor" in this paper. The FPGA
`architecture, which has narrow programmable logic blocks
`and programmable interconnection network, provides great
`exibility for implementing application specic hardware.
`However, the rich programmable interconnection comes at
`the price of reduced operating frequency and logic density.
`Array processors, such as general-purpose systolic ar-
`ray processors, wavefront array processors  , PADDI-
`  , and MATRIX  , are other recongurable archi-
`tectures. These processors have -bit or -bit datapaths
`and each programmable logic block has an  to -entry
`instruction RAM that makes it easy to support multiple
`functions. Because multimedia or DSP applications pre-
`dominantly manipulate -bit or -bit data values, these
`architectures work very well on these applications. Re-
`cently, we proposed a new array processor architecture
`called REMARC Recongurable Multimedia Array Co-
`processor . REMARC is a recongurable coprocessor
`that is tightly coupled to a main RISC processor and con-
`sists of a global control unit and  -bit simple processors
`called nano processors.
`Both the FPGA coprocessor and REMARC are not
`limited to SIMD parallelism that can be exploited by mul-
`timedia extensions such as the Intel MMX. They can
`exploit various kinds of ne grain parallelism in multime-
`dia applications. Using more processing resources, they
`can achieve higher performance than the multimedia ex-
`tensions. To understand how these two coprocessor archi-
`tecture compare, in this paper we evaluate the cost and
`performance of these architectures. The architecture of
`FPGA coprocessors are still in ux, so we evaluate the per-
`formance of the FPGA coprocessor with a varying number
`of CLBs and vary the cycle time of the FPGA coprocessor
`from x to x that of the main processor. For the perfor-
`mance evaluation, we use detailed simulators and two real-
`istic application programs, DES encryption and MPEG-
`decoding. We also estimate the chip sizes of processors
`with REMARC and the FPGA coprocessor and compare
`their performance when the same die size is used for both
`architectures.
`The rest of this paper is organized as follows. In Sec-
`tion , we describe the recongurable coprocessor archi-
`tectures, both REMARC array based and FPGA based.
`In Section , we show the results of our performance eval-
`uation. In Section , we estimate chip sizes of processors
`
`INTEL - 1009
`
`

`

`with REi\.1ARC and the FPGA coprocessor. Finally, we
`conclude in Section 5.
`
`2 Reconfigurable Coprocessor Ar(cid:173)
`chit ect ure
`
`2 .1 Architect ure Overv iew
`
`rcon
`rex
`lduc2
`sduc2
`mtc2
`mfc2
`ctc2
`cfc2
`
`src (rncon or rgcon)
`cov~reg, offset{base)
`cov~reg, offset{base)
`cov~reg, offset{base)
`cov~reg, src
`cov~reg, dst
`cov~reg, src
`cov~reg, dst
`
`Main Processor
`
`Instruction
`Cache
`
`Data
`Cache
`
`Figure 1: Block Diagram of a Microprocessor w;th Reconfig(cid:173)
`urable Coprocessor
`
`Figure 1 shows a block diagram of a microprocessor
`which includes a reconfigurable coprocessor . The recon(cid:173)
`figurable coprocessor consists of a global control unit, co(cid:173)
`processor data registers, and a reconfigurable logic array.
`Recently, we proposed the REMARC architectur e which
`includes an 8x8 16-bit processor (nano processor) array as
`its reconfigurable logic array(13]. The other reconfigurable
`coprocessor that we consider in this paper, the FPGA co(cid:173)
`processor, uses FPGAs for the reconfigurable logic array.
`The global control unit controls the execution of the recon(cid:173)
`figurable logic array and the transfer of data between the
`main processor and the reconfigurable logic array through
`the coprocessor data registers.
`We use the i\.flPS-II ISA (14] as the base architecture
`of the main processor. The MIPS ISA is extended for the
`RE MARC and the FPG A coprocessor using the instruc(cid:173)
`tions listed in Table 1. The main processor issues these
`instructions to the reconfigurable coprocessor which exe(cid:173)
`cutes them in a manner similar to a floating point copro(cid:173)
`cessor. Unlike a floating point coprocessor, the functions
`of reconfigurable coprocessor instructions ar e configur able
`( or programmable) so that they can be specialized for spe(cid:173)
`cific applications.
`The configuration instructions, rcon, rgcon, or rncon,
`download the configuration data from memory and store
`them in the reconfigur able coprocessor. The start address
`of the configuration data is specified by the value of the
`sour ce register (src). The rex instruction starts execution
`of a reconfigurable coprocessor instruction. The sum of
`
`Table 1: New Instructions Used in Reconfigurable Coproces(cid:173)
`sors
`the offset field and the base register specifies one of the op(cid:173)
`erations to execute. The lduc2 and sduc2 instructions are
`load and store coprocessor instructions which t ransfer dou(cid:173)
`ble word (64-bit) data between memory and the reconfig(cid:173)
`urable coprocessor data registers. The mfc2 and mtc2 in(cid:173)
`structions transfer word (32-bit) data between the general(cid:173)
`purpose registers (integer registers) in the main processor
`and the reconfigurable coprocessor data registers. The cfc2
`and ctc2 instructions transfer data between the integer reg(cid:173)
`isters and the reconfigurable coprocessor control registers.
`The reconfigurable coprocessors do not have a direct
`interface to the data cache or memory. The main proces(cid:173)
`sor has to set the input data to the coprocessor data reg(cid:173)
`isters using lduc2 and mtc2 instructions before execution
`of rex instructions. Then, the reconfigurable coprocessor
`reads the input data, executes the operations, and stores
`the results into the coprocessor data registers. Finally, the
`main processor reads the results using sduc2 and mfc2 in(cid:173)
`structions.
`
`2 .2 Pipeline Organization
`
`REMARC Pipeline
`
`RF RR RL ! RI ! RE !RW!
`
`Main Processor Pipeline ! F ! D ! E I M I W I
`Main Processor Pipeline ! F ! D ! E I M I W I
`
`FPGA Coprocessor Pipeline
`
`RF RR RE !RW!
`
`Figure 2: P ipeline Organization of Reconfigurable Coproces(cid:173)
`sor
`
`The pipeline for REMARC, the FPGA coprocessor,
`and the main processor are shown in Figure 2. The main
`processor pipeline is similar to the MIPS R3000 and the
`MIPS R5000 and consists of five stages: Instruction Fetch
`(F), Instruction Decode (D), Execution (E), Memory Ac(cid:173)
`cess (M), Register Write-back (W) . The reconfigurable co(cid:173)
`processor pipelines ar e independent of the main processor
`pipeline; therefore, the main processor can execute concur (cid:173)
`rently with the reconfigurable coprocessors .
`The REMARC pipeline starts from the M stage of the
`main processor and has the following six stages:
`RF : An instruction of the global control unit is fetched.
`RR : The REMARC data registers are read.
`RL : The data are aligned or "unpacked" .
`RI : The instructions of the nano processors are fetched.
`
`INTEL - 1009
`
`

`

`HBUS
`
`32
`
`VBUS
`32
`
`nano PC
`
`5
`
`Nano Instruction
` RAM
`(32 x 32 bits)
`
`32
`
`Imm
`
`D R
`
`02
`
`701
`
`DR
`
`16
`
`16
`
`IR
`
`16-bit ALU & Data RAM
`
`16
`
`16
`
`DOR
`
`16
`
`DOUT
`
`16
`
`16
`
`16
`
`DINU
`DIND
`DINL
`DINR
`DINU
`DIND
`DINL
`DINR
`DINU
`DIND
`DINL
`DINR
`
`Figure : Nano Processor Architecture
`
`register IR, eight -bit data registers DR, four -
`bit data input registers DIR, and a -bit data output
`register DOR.
`Each nano processor can use the DR registers, the DIR
`registers, and immediate data as the source data of ALU
`operations. Moreover, it can directly use the DOR regis-
`ters of the four adjacent nano processors DINU, DIND,
`DINL, and DINR as the source.
`The nano processors are also connected by the -bit
`Horizontal Buses HBUSs and the -bit Vertical Buses
`VBUSs. Each bus operates as two -bit data buses. The
` -bit data in the DOR register can be sent to the upper
`or lower  bits of the VBUS or the HBUS. The HBUSs
`and the VBUSs allow data to be broadcast to the other
`nano processors in the same row or column. These buses
`can reduce the communication overhead between proces-
`sors separated by long distances.
`The DIR registers accept inputs from the HBUS, the
`VBUS, the DOR, or the four adjacent nano processors.
`Because the width of the HBUS and the VBUS is  bits,
`data on the HBUS or the VBUS are stored into a DIR
`register pair, DIR and DIR , or DIR and DIR . Using
`the DIR registers, data can be transfered between nano
`processors during ALU operations.
`It takes a half cycle to transfer data using the VBUSs
`or HBUSs.
`It should not be a critical path of the de-
`sign. Other operations, except for data inputs from near-
`est neighbors, are done within the nano processor. Because
`the width of a nano processor’s datapath is only  bits,
`which is a quarter of those of the general purpose micro-
`processors, this careful design does not make REMARC a
`critical path of the chip.
`
`. FPGA Coprocessor Architecture
`
`The recongurable logic array of the FPGA coprocessor is
`composed of congurable logic blocks CLBs. Each CLB
`
`RE : The nano processors execute the instructions.
`RW : The executed results are packed and stored into
`the REMARC data registers.
`
`The FPGA coprocessor pipeline has four stages as fol-
`lows:
`
`RF : The sequencer of the global control unit starts its
`execution.
`RR : The coprocessor data registers are read.
`RE : The recongurable logic array starts execution.
`RW : The execution results are stored into the coproces-
`sor data registers.
`
`The RL and RI stages are unnecessary in the FPGA
`coprocessor because load alignment or unpack operations
`are realized directly in the FPGA array and FPGAs do
`not have instructions to fetch and execute.
`
`. REMARC Architecture
`
`to/from Main Processor
`
`32 bits
`
`64 bits
`
`64 bits
`
`REMARC
`
`Global Control Unit
`
`Coprocessor
`Data Registers
`
`32
`
`32
`
`NANO
`00
`
`NANO
`10
`
`NANO
`20
`
`HBUS0
`NANO
`NANO
`30
`40
`
`NANO
`50
`
`NANO
`60
`
`NANO
`70
`
`ROW 0
`
`ROW 1
`
`ROW 2
`
`ROW 3
`
`ROW 4
`
`ROW 8
`
`HBUS1
`
`HBUS2
`
`HBUS3
`
`HBUS4
`
`VBUS7
`
`VBUS6
`
`VBUS5
`
`VBUS4
`
`HBUS7
`
`VBUS3
`
`VBUS2
`
`VBUS1
`
`VBUS0
`
`COL.0
`
`COL.1
`
`COL. 2
`
`COL. 3
`
`COL. 4
`
`COL.5
`
`COL. 6
`
`COL. 7
`
`nano PC/Row Num./Column Num.
`
`Figure : Block Diagram of REMARC
`
`In this section, we briey describe the REMARC archi-
`tecture; more information can be found in  . Figure
`shows a block diagram of REMARC. REMARC’s recon-
`gurable logic is composed of an x array of the -bit
`processors, called nano processors. The execution of each
`nano processor is controlled by the instructions stored in
`the local instruction RAM nano instruction RAM. How-
`ever, each nano processor does not directly control the
`instructions it executes. Every cycle the nano processor
`receives a PC value, nano PC", from the global control
`unit. All nano processors use the same nano PC and exe-
`cute the instructions indexed by the nano PC in their nano
`instruction RAM.
`Figure  shows the architecture of the nano processor.
`The nano processor contains a -entry nano instruction
`RAM, a -bit ALU, a -entry data RAM, an instruction
`
`INTEL - 1009
`
`

`

`Processor Size
`Num. of Procs.
`Execution control
`Communication
`Interconnection
`Cycle Time
`
`REMARC
`Large  -bit datapath
`Small  processors
`Instruction
`Controlled by instruction
` neighbors, VBUS, and HBUS
`Tcpu
`
`FPGA Coprocessor
`Small two - LUTs
`Large    CLBs
`Hardwired
`Congured by switch matrix
`Short wires and long wires
` x, x, x, xTcpu
`
`Table : Coprocessor Architecture Comparison
`
`is equal to a CLB of the Xilinx  series. The CLB
`includes two - lookup tables LUTs and two ip-ops.
`We do not x the number of the CLBs in the FPGA
`coprocessor. Instead, we evaluate the performance of the
`FPGA coprocessor using a varying number of CLBs. The
`cycle time of the FPGA coprocessor is also parameter-
`ized. Cycle times of current FPGA systems are longer
`than those of the microprocessors by factors of ve to
`ten. For instance, FPGA systems usually operate at
`to MHz while state-of-the-art microprocessors operate
`at more than  MHz. However, the recently proposed
`recongurable coprocessor Garp  aims to operate at the
`same operating frequency as its main processor. Therefore,
`we assumed the cycle time of the FPGA coprocessor could
`be x or x that of the main processor for current FPGA
`architectures and x or x for future FPGA architectures.
`
`. Coprocessor Architecture Compar-
`ison
`
`Table  summarizes the comparison of the two recong-
`urable coprocessor architectures. REMARC has larger
`processing elements, the nano processors, than the FPGA
`coprocessor. However, the number of nano processors is
`less than that of CLBs in the FPGA coprocessor.
`In
`REMARC, both the execution and data transfer are con-
`trolled by instructions, while these are controlled by hard-
`wired logic in the FPGA coprocessor. REMARC has lim-
`ited hardware interconnections. Each nano processor has
`direct inputs from the four nearest neighbors and it is con-
`nected by two -bit data buses. The FPGA coprocessor
`has more exible hardware interconnections which are con-
`gured in a bit-wise fashion.
`We assume that REMARC will operate at the same
`frequency as the main processor. The cycle times of the
`FPGA coprocessor Tfpga are varied. We evaluate the
`FPGA coprocessor performance, assuming Tfpga values
`that are x, x, x, and x the CPU cycle time Tcpu.
`
` Performance Evaluation
`
` . Simulation Methodology
`
`We developed the recongurable coprocessor simulator us-
`ing the SimOS simulation environment   . SimOS mod-
`els the CPUs, memory systems, and IO devices in su-
`cient detail to boot and run a commercial operating sys-
`tem. As a base CPU simulation model, we used the
`
`MIPSY " which models a simple single issue RISC pro-
`cessor similar to the MIPS R .
`REMARC functions are added to the MIPSY model.
`The latency of a recongurable execution rex instruction
`is the sum of the number of the executed global instruc-
`tions and the pipeline latency  cycles. If a following in-
`struction attempts to read the result of a rex instruction,
`the pipeline will stall for the cycles of the rex instruction’s
`latency Data Dependency. Furthermore, a following rex
`instruction will stall if a previous rex instruction is still
`executing Resource Conict.
`We evaluate the execution time of the FPGA coproces-
`sor by changing the latencies of the rex instructions. First,
`we estimate the number of execution cycles of the rex in-
`structions based on the FPGA delay model. In this model,
`two sequences can be executed within one FPGA cycle:
`
` One stage of interconnection by long or short wire
`and one stage of function not using the carry chain
` One stage of interconnection by short wire and any
`one stage of function
`
`This model is almost the same as the Garp’s delay
`model  and the XC’s medium frequency design which
`will operate at about  MHz  . Then, we normalize the
`number of execution cycles based on the CPU cycle time,
`assuming the FPGA cycle time is x, x, x, or x the
`CPU cycle time. Finally, we use this estimated cycle count
`of the rex instruction in the simulator.
`We also developed simulators which can execute mul-
`timedia instructions similar to the Intel MMX instruction
`set extensions .
`To make the comparison fair, the same application
`source codes are used for the evaluation except for the
`use of the multimedia instructions or the augmented co-
`processor instructions. The memory system parameters
`used commonly through the performance evaluations are
`found in Table . We used the gcc" compiler with the -
`O" optimization option. This option executes most global
`compiler optimizations except for loop unrolling and func-
`tion inlining.
`
`I-cache
`D-cache
`L cache
`L miss penalty
`L miss penalty
`
`  K bytes, -way set.
`  K bytes, -way set.
` K bytes, -way set.
` cycles
` cycles
`
`Table : Memory System Parameters
`
`INTEL - 1009
`
`

`

`We used the DES encryption program   based on
`the Electronic Codebook ECB mode. Although the ECB
`mode is less secure than the Cipher Block Chaining CBC
`mode, it is more commonly used and its operation can be
`pipelined.
`
` .. DES Implementation on REMARC
`
`We decided to divide the algorithm between the main pro-
`cessor and REMARC. The initial permutation and the -
`nal permutation are executed by the main processor and
`the  rounds of f-box operations are executed by RE-
`MARC. Each row of nano processors executes two f-box
`operations. For instance, eight nano processors in the row
` execute the rst and second rounds, the row execute
`the third and fourth rounds, and so on.
`
`Operation
`Data Load
`Exp. Permutation
`Key XOR
`S-box
`P-box Permutation
`Left XOR
`Data Transfer
`Total
`
`CPU Cycles
`
`  x iter.
`  x iter.
`    x  iter.
`  x  iter.
`  x  iter.
`
`
`
`Table : Execution Cycle Breakdown of Two f-box Operations
`on REMARC
`
`Table  is the execution cycle breakdown of the two f-
`box operations implemented by the eight nano processors
`in each row. The  f-box operations are pipelined into
` stages by the eight rows of the nano processor array.
`REMARC can generate a result of the  f-box operations
`every  cycles. As Table  shows, because of the limited
`interconnection of REMARC, more than half  cycles of
`the execution time are used for the expansion permutation
`and the P-box permutation.
`
` .. DES Implementation on the FPGA Co-
`processor
`
`First, we estimate the latency and the throughput of one
`f-box operation. The expansion permutation and the XOR
`operation with the keys can be executed in one cycle be-
`cause it can be implemented by long wire and simple logic.
`The S-box table lookup requires two cycles to execute be-
`cause it consists of LUTs and MUXs. The P-box permu-
`tation and the XOR operation can be executed in one cy-
`cle. The total latency of the f-box operation is four cycles,
`and it can be fully pipelined. Therefore, the maximum
`throughput of the f-box operation is one FPGA cycle.
`We assume three distinct cases, implementing one f-
`box,  f-boxes, and all of the DES encryption algorithm
`including  f-boxes, the initial permutation, and the nal
`permutation.
`In the case of one f-box implementation, because each
`f-box operation takes as input the result of the previous
`f-box operation, the f-box operations cannot be pipelined.
`
` . DES Encryption
`
`The Data Encryption Standard DES is one of the most
`important encryption algorithms and has been a worldwide
`standard for over  years. It is widely used to provide se-
`cure communication over the Internet. DES is also a good
`application for recongurable processors because it has a
`lot of ne-grained parallelism in the form of bit-level irreg-
`ular data movements which make software implementation
`on conventional microprocessors dicult and inecient.
`
`Plaintext (64 bits)
`
`Initial Permutation
`
`Round 1
`
`f-box
`
`Round 2
`
`f-box
`
`Round 16
`
`f-box
`
`L0
`
`+
`
`L1
`
`+
`
`L2
`
`L15
`
`+
`
`L16
`
`f
`
`f
`
`f
`
`R0
`
`R1
`
`R2
`
`R15
`
`R16
`
`Final Permutation
`
`Ciphertext (64 bits)
`
`K1
`
`K2
`
`K16
`
`Figure : DES Encryption Algorithm
`
`48-bit
`Key
`
`Li-1
`
`f-box
`
`Ri-1
`32
`
`Expansion Permutation
`6
`6
`6
`6
`6
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`32
`
`P-box Permutation
`32
`
`32
`
`Ri
`
`Figure : DES f-box Algorithm
`
`Figure  shows an outline of the DES algorithm. DES
`takes as input -bit plaintext. After the initial permuta-
`tion, there are  rounds of f-box" operations, as shown
`in Figure , including expansion permutation, XOR with
`the key, S-box table lookup, P-box permutation, and XOR
`with the result of the previous round. The nal permu-
`tation is performed on the -bit result of the sixteenth
`round.
`
`INTEL - 1009
`
`

`

`Tfpga
`
`Tcpu
`Tcpu x 2
`Tcpu x 5
`Tcpu x 10
`
`FPGA Coprocessor
`(1 f-box)
`64
`128
`320
`640
`
`FPGA Coprocessor
`(16 f-boxesJ
`1
`2
`5
`10
`
`Table 5: Execution Throughput of 16 f-box Operations on
`Different FPGA Coprocessors (CPU Cycles)
`
`D Base (Single-Issue) D +FPGA (16 f0 boxs)
`(cid:143) +REMARC
`(cid:143) +FPGA (All DES)
`(cid:143) +FPGA (1 f-box)
`
`140
`
`120
`-;;;-
`Q) o 100
`~
`6 80
`"' C: 5 60
`
`(.)
`Q) 40
`
`~ (.)
`20
`
`0
`
`""
`
`llb
`
`Tcpu
`
`...
`
`~
`
`,__
`
`,__
`
`n=;
`
`r-
`
`h
`
`-
`I lh
`
`Tcpu x 5
`Tcpu x 2
`Tfpga
`
`Tcpu x 10
`
`Figure 7: DES Encryption {lM Bytes) Results - Single Issue
`Archltecture
`
`It takes 64 FPGA cycles (4 cycles x 16 rounds) to exe(cid:173)
`cute the 16 rounds for the f-box operations. When the 16
`f-boxes are implemented in the FPGA array, the opera(cid:173)
`tions can be pipelined. It takes only one FPGA cycle to
`generate a new result for the 16 f-box operations after the
`pipeline is fill ed up. T.i.ble 5 normalizes the throughput
`of the 16 f-box operations in CPU cycles for both the one
`f-box implementation and the 16 f-box implementation.
`When the complete DES encryption algorithm is im(cid:173)
`plemented in the FPGA coprocessor, the init ial and final
`permutations can be fully pipelined and a new 64-bit ci(cid:173)
`phertext character is generated every FPG A cycle.
`To estimate the number of CLBs without ·wiring over(cid:173)
`head, we use the following method. As s hown in Figure 6,
`the f-box operation requires a 48-bit XOR, eight 6-bit in(cid:173)
`put 4-bit output LUTs, a 32-bit 4-1 Multiplexer, and a
`32-bit XOR. A two-bit XOR or a 4-1 Multiplexer fits to
`one CLB. A 6-4 LUT needs 16 4-1 LUTs or 8 CLBs. In
`total, 136 CLBs are required for one f-box operation. The
`16 f-box implementation needs 2176 (136 x 16) CLBs. We
`expect that in an actual implementation more CLBs would
`be dedicated to the eJ...'tra wiring which realizes the initial
`permutation and the final permutation.
`
`D Base (2-lssue)
`(cid:143) +REMARC
`(cid:143) +FPGA (1 f-box)
`
`(cid:143) +FPGA (16 f-boxs)
`(cid:143) +FPGA (All DES)
`
`Q)
`
`100
`90
`en so
`,-
`-g. 70 ,-
`,-
`~ 60
`,-
`.!1 50
`C
`:::, 8 40 ,-
`,-
`a> 30
`0
`,-
`(S 20
`10
`
`0
`
`ITTli
`
`Tcpu
`
`----, -
`
`, -
`
`, -
`
`-
`n,
`
`,....
`
`n
`n
`
`n
`n
`
`Tcpu x 5
`Tcpu x 2
`Tfpga
`
`Tcpu x 10
`
`Figure 8: DES Encryption {lM Bytes) Results - 2-way Su(cid:173)
`perscalar Archltecture
`
`3.2.3 Performance Evaluation Results
`
`Figure 7 shows the result of the DES encryption of l M
`bytes data. The base architecture of the main processor is
`a single issue processor. To optimize performance we im(cid:173)
`plemented a software pipelining technique which can over(cid:173)
`lap operations in the main processor and the reconfigurable
`coprocessors. REMARC achieves a 7.3 times performance
`improvement. The FPGA coprocessor achieves the same
`performance improvement if its execution cycles for the 16
`f-box operations ar e less than 64 CPU cycles. T his indi(cid:173)
`cates that the main processor which executes the initial
`permutation and the final permutation limits the perfor(cid:173)
`mance improvement in these cases.
`If the FPGA coprocessor implements all of the DES op(cid:173)
`erations, the execution cycles are reduced to 9.0 M CPU
`cycles. Reducing the load on the main processor, this im(cid:173)
`plementation delivers a 14.5 times performance improve(cid:173)
`ment over the base processor. Even though the cycle time
`of the FPGA coprocessor is l 0x Tcpu, it takes only 10
`CPU cycles to generate one 64-bit encrypted data. This
`execution time on the FPGA coprocessor is short enough
`that the main processor still limits the performance.
`Using a higher performance main processor is another
`way to reduce the total execution time. As s hown in Fig(cid:173)
`ure 8, if a 2-way superscalar processor is used as the main
`processor, the execution times are reduced from 17.3 M
`CPU cycles to 14.8 M CPU cycles for the + REMARC,
`+ FPGA(l f-box) at Tcpu cycle time, and + FPGA(16 f(cid:173)
`box) implementations.
`The other point that is demonstrated in Figure 7 and
`Figure 8 is that the one f-box implementation at slow cy(cid:173)
`cle times, especially Tcpu x 5 and Tcpu x 10, delivers only
`limited performance improvements. Even though the exe(cid:173)
`cution time of one f-box operation is only four FPGA cy(cid:173)
`cles, the slow cycle time of the FPGA coprocessor results in
`
`INTEL - 1009
`
`

`

`small performance gains. Furthermore, the performance is
`limited by the reconfigurable coprocessor; there is no per(cid:173)
`formance improvement even if the 2-wa.y supersca.lar ma.in
`processor is used as shown in Figure 8.
`
`3.3 MPEG-2 Decoding
`i\.fPEG-2 is one of the most popular video compression
`standards. The a.lgorithm for i\.fPEG-2 decoding consists
`of five stages, Variable Length Decoding (VLD), Inverse
`Quantization (IQ), Inverse DCT (IDCT), Motion Com(cid:173)
`pensation (MC), and the addition of the IDCT and the
`MC results (Add.Block). We used the mpeg2decode{19]
`program distri buted by MPEG software simulation group
`as a base MPEG-2 decoding program. The mpeg2decode
`program was modified using the reconfigurable coproces(cid:173)
`sor instructions. The bitstream "mobile" was chosen as a
`benchmark.
`
`I D MC D Add_Block D IDCT D VLD&IQ I
`I .. )~.: ... !:::~:! ... :.~ ..:.i. : ,,~: .... I
`
`70
`
`80
`
`90 100
`
`0
`
`10
`
`20
`
`30
`
`40
`
`50
`
`60
`
`Figure 9: Execution T ime Breakdown of MPEG-2 Decoding
`
`To determine the execution time breakdown we pro(cid:173)
`filed the MPEG-2 decoding program using the profiling
`tool pixie (18]. Figure 9 shows the percentage of execution
`time spent in ea.ch stage while decoding 30 frames. The fig(cid:173)
`ure indicates that the execution time is distri buted among
`MC, Add.Block, IDCT, and VLD/IQ. This suggests that
`reconfigur able coprocessors need to support multiple func(cid:173)
`tions to accelerate the entire i\.fPEG-2 decoding. As a first
`step, however, we investigate the acceleration of IDCT,
`which represents 31 % of the execution time for i\.fPEG-2
`decoding.
`
`3.3. 1 Acceleration of IDCT
`
`We use the fast IDCT a.lgorithm which is the reverse of the
`DCT a.lgorithm presented in (20] . The lD-IDCT a.lgorithm
`consists of 12 multiplications and 32 a.ddi tions/subtra.ctions.
`The 8x8 2-D IDCT is composed using a row IDCT followed
`by a column IDCT. Each of the row and column IDCTs
`consists of eight 1-D IDCTs.
`For REMARC, the lD-IDCT a.lgorithm is changed
`slightly. The a.lgorithm consists of 16 multiplications and
`30 additions/subtractions to achieve more regular data
`movements. Each of the eight nano processors in the same
`row or column ca.lcula.tes a 1-D IDCT. Therefore, two mul(cid:173)
`tiplications a.re mapped into each nano processor to main(cid:173)
`tain the load ba.lance even among the nano processors.
`First, each of the eight nano processors in the same row
`ca.lcula.te the 1-D row IDCT. Then, each of the eight nano
`processors in the same column ca.lculate the 1-D column
`IDCT. Because the 2-D IDCT is done in place, there is
`no need to execute a ma.tri..'< transposit ion. In this fashion
`
`an 8x8 nano processor array na.tura.lly implements the 2-
`D IDCT. The 2-D IDCT can be executed in 54 cycles by
`REMARC . REMARC can rea.lize a fast 2-D IDCT imple(cid:173)
`mentation which exploits the large amount of parallelism
`from the algorithm.
`We estimate the latency and the throughput of the 1-
`D IDCT for the FPGA coprocessor. The IDCT a.lgorithm
`has five stages of additions/subtractions and one stage of
`multiplication. The bit width of these operations is 16
`bits. We assumed each stage of the additions/subtractions
`requires two FPGA cycles. The multiplication requires
`four FPGA cycles to execute using a constant (k) coeffi(cid:173)
`cient multiplier (KCM). Therefore, the latency of the 1-D
`IDCT is 14 FPGA cycles ( 2 FPGA cycles x 5 + 4 FPGA
`cycles). The throughput is restricted to four FPGA cycles
`by the KCM.
`We implement the IDCT in the FPGA array using two
`models. One model only implements one 1D-IDCT and
`the matrix transposition circuits. This 1D-IDCT circuit
`is used 16 times to realize a 2-D IDCT. The other model
`implements eight 1D-IDCTs and the matrix transposition
`circuits.
`In the first IDCT implementation, eight 1D-IDCTs can
`be pipelined. It takes 42 FPGA cycles ( 14 FPGA cycles
`+ 4 FPGA cycles x 7 ) to execute eight 1D-IDCTs, or a
`row/column IDCT.
`The ma.tri..'< transposition is required between the row
`IDCT and column IDCT, and it breaks the IDCT exe(cid:173)
`cution pipeline, The tota.l execution cycle counts for one
`2D-IDCT are 84 FPGA cycles for the one 1D-IDCT im(cid:173)
`plementation and 28 FPGA cycles for the eight 1D-IDCT
`implementation. Table 6 norma.lizes the execution cycles
`in CPU cycles.
`We estimate the number of CLBs to implement 1D(cid:173)
`IDCT and the matrix transposit ion. At least 808 CLBs
`a.re necessary for 1D-IDCT and 512 CLBs for the matrix
`transposit ion. The implementation of one 1D-IDCT and
`the matrix transposition requires 1320 CLBs, and the im(cid:173)
`plementation of eight 1D-IDCTs and the matrix transpo(cid:173)
`sition requires 6976 CLBs. Again these estimates include
`no wiring overhead.
`
`Tfpga
`
`Tcpu
`Tcpu x 2
`Tcpu x 5
`Tcpu x 10
`
`FPGA Coprocessor
`(lx lD-IDCT HW.)
`84
`168
`420
`840
`
`FPGA Coprocessor
`(8x lD-IDCT HW.)
`28
`56
`140
`280
`
`Table 6: 2D-IDCT Execution Cycles (CPU Cycles)
`
`Figure 10 shows the execution time of the i\.fPEG-2 de(cid:173)
`coding for 15 frames of the "mobile" bitstream. Execution
`times of the processors ·with the reconfigurable coprocessor
`are compared with the single issue base processor and the
`processor with the multimedia. extensions (+ MMX).
`Although the multimedia. extensions can reduce the cy(cid:173)
`cle counts of the IDCT pa.rt by a factor of 2. 7 from the base
`processor, RE MARC and the FPG A coprocessor which op(cid:173)
`erate at the same frequency as the ma.in processor achieve
`
`INTEL - 1009
`
`

`

`:(cid:143)
`
`IDCT D Others I
`
`Tcpu x2
`
`Tcpu x 5
`
`Tcpux10
`
`-
`
`Tfpga =
`Tcpu
`
`~
`
`350
`
`.-..300
`"'
`<I> u 250
`-
`~
`-
`e-200
`!?
`-
`C 6 150
`
`(.)
`~ 100
`g_
`.
`O 50
`
`0
`
`I
`
`I
`
`ai'jUi='i='
`:, ~ (.) (.)
`2 + QQ
`Q) w oo
`c,
`a'.
`+ ..........
`C
`X X
`~ ..... 00
`~~
`<( <(
`(!) (!)
`Q. Q.
`u. u.
`+ +
`
`Q)
`(/)
`(II
`co
`
`I
`
`I
`
`i=' i='
`(.) (.)
`QQ
`00
`..... .....
`X X
`..... 00
`~~
`<( <(
`(!) (!)
`Q. Q.
`u. u.
`+ +
`
`I
`
`I
`
`~i='
`(.) (.)
`QQ
`00
`..... .....
`X X
`:s.e
`<3(3
`Q. Q.
`u. u.
`+ +
`
`i=' i='
`(.) (.)
`QQ
`00
`..........

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket