`
`A Quantitative Analysis of Reconfigurable Coprocessors
`for Multimedia Applications
`
`Takashi Miyamori
`System ULSI Engineering Laboratory
`TOSHIBA Corporation. JAPAN
`miyamori©sdei.toshiba.co.jp
`
`Kunle Olukotun
`
`Computer Systems Laboratory
`Stanford University
`kunle@ogun.stanford.edu
`
`Abstract
`
`Recently, computer architectures that carnhinc n reconfig-
`urnhie (or relorgetohte) coprocessor with n gamut-purpose
`microprocessor have been proposed.
`These architectures
`are designed to exploit
`large amounu of fine grain pur-
`olleiisrn in applications.
`In this paper. we study the per-
`formance of the reconfigurable copmcessors an multimedia
`applications. We compare o Field Programmable Gate Ar-
`ray {FFGAJ based reconfigurable mpromsor with the army
`processor salted REMARC {Reconfigurable Multimedia Ar-
`roy- G‘opmccuor}. REMARC use: a 1616:"! simple processor
`that is much larger than n Configurable Logic Black (91.3)
`of cm FPGA. We have developed a simulator. a program-
`ming environment. and multimedia application programs to
`evaluate the performance of the two coprocessor architec-
`tures. The simulation results show that REMARG achieves
`specdups ranging from a factor of 2.3 to 713 on there cp-
`piicotions. The FFGA coprocessor achieves sin-lilo!- per-
`formance improvements. However, the FPGA coprocessor
`needs more hardware area to nchieve the some performance
`improvement as REMARC.
`
`Introduction
`1
`it he-
`As the use of multimedia applications increases.
`comes important to achieve high performance on algo-
`rithms such as video compression. decompression. and im-
`age processing with generalopurpose microprocessors. This
`has motivated the recent addition of multimedia instruc-
`tions to most general-purpose microprocessor ISAs [1—33.
`These ISA extensions work by segmenting a conventional
`64-bit datapath into four 15-bit or Eight 8-bit datnpaths.
`The multimedia instructions exploit fine grain SIMD par-
`allelism by operating on four 16-bit or eight 3-bit data
`values. However, a 54-bit datapath limits the speedups
`to a factor of four or eight even though many multimedia
`applications have much more inherent parallelism.
`Computer architectures that. connect a reconfigurable
`coprocessor to a general-purpose microprocessor have been
`proposed [4—9]. The advantage of this approach is that
`the coprocessor can be reconfigured to improve the parlor,
`mance of a particular application. All of these proposed
`architectures use field. programmable gate arrays {FPGAs)
`for the recufigurahie hardware. We refer to this coproce-
`
`nor as an “FPGA coprocessor" in this paper. The FPGA
`architecture. which has narrow programmable logic blocks
`nnd programmable interconnection network. provides great
`flexibility for implementing application specific hardware.
`However. the rich programmable interconnection comes at
`the price of reduced operating frequency and logic density.
`Array processors. such as general-purpose systolic er-
`rny processors. wavefront array processors [10]. PADDI-
`2 [ll]I end MATRIX [12], are other reconfigurable archi-
`tectures. These processors have 3-bit or 16-bit datapaths
`and each programmable logic block has an S to fit—entry
`instruction RAM that makes it easy to support multiple
`functions. Because multimedia (or DSP) applications pre-
`dominantly manipulate 8-bit or 1643i: data values. these
`architectures work very well on than applications.
`Re—
`cently. we proposed a new array processor architecture
`called REMARC (Reconfigurable Multimedia Array Co-
`prucwsorfllii]. REMARC is a reconfigurable coprocessor
`that is tightly coupled to a. main RJSC proceanor and con-
`sists otsglobal control unit and 64 [obit simple procmors
`called mum processors.
`Both the FPGA coprocessor and REMARC are not
`limited to SiMD parallelism that can be exploited by mul-
`timedia extensions such as the intel MMX[2]. They can
`exploit various kinds of fine grain parallelism in multime-
`din applications. Using more processing resources. they
`can achieve higher performance than the multimedia axe
`tensions. To understand how them two coprocessor archi-
`tecture compare.
`in this paper we evaluate the oust and
`performance of these architectures. The architecture of
`FPGA caprooessors are still in flux. so we evaluate the per-
`formance of the FPGA coprocessm with a varying number
`of CLBs and my the cycle time of the FPGA coprocessor
`from in to 10:: that of the main processor. For the perfor-
`mance evaluation. we use detailed simulators and two real~
`istic application programs, DES encryption and MPEG-2
`decoding. We also stimate the chip sizes of processors
`with REMARC and the FPGA coprocasor and compare
`their performance when the same die size is used for both
`architectures.
`In 580
`The rest of this paper is organized as follows.
`tion 2. we dmribe the reconfigurable ooprocasor archie
`tenures. both REMARC (array based) and FPGA based.
`In Section 3, we Show the mults of our performance eval-
`uation. In Section 4. we estimate chip sizes of processors
`
`|PFl2018-01594
`
`EXHIBIT
`
`2056
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 1
`
`IPR2018-01594
`
`EXHIBIT
`2056
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 1
`
`
`
`with REMARC and the FPGA coprocessor. Finally. we
`conclude in Section 5.
`
`2 Reconfigurable Coprocessor Ar—
`chit ect ure
`
`2.1 Architecture Overview
`
`roan
`It:
`”no.2
`Idea!
`mlcz
`we}!
`old
`1:ch
`
`Ire frown or mean)
`caring. afilzll‘baul
`0192.73, nflsctl’om'c)
`eupflJlgt oflsetfbnsej
`coping, an:
`coping. it:
`naparey, Ire
`coping, list
`
`’
`
`Inslrucliorl
`Cache
`
`'
`
`MainProcessor
`
`Bus Interface Unll
`
`Figure 1: Block Diagram ora Microprocessor with Reconfig-
`urable Coproressor
`Figure 1 shows a block diagram of a microprocessor
`which includes a reconfigurable coprocessor. The recon-
`figurable coprocessor consists ofa global control unit, co-
`procmsor data registersI and a reconfigurable logic array.
`Recently, we proposed the REMARC architecture which
`includ$ an 8x8 16-bit processor (nano processor) array as
`its reconfigurable logic array[13]. The other reconfigurable
`coprocemr that we consider in this paper, the FPGA co—
`processor. uses FPGAs for the reconfigurable logic array.
`The global control unit controls the execution of the reconv
`figurable logic array and the transfer of data between the
`main processor and the reconfigurable logic array through
`the coprocessor data registers
`We use the MIPS-H ISA [14] as the base architecture
`of the main processor. The MIPS ISA is extended for the
`REMARC and the FPGA coprocessor using the instruc‘
`tions listed in Table 1. The main processor issuer these
`instructions to the reconfigurable coprocessor which exe-
`cutes them in a manner similar to a floating point copro-
`cream. Unlike a floating point coprocessor. the functions
`of reconfigurable coprocfisor instructions are configurable
`(or programmable) so that they can be Specialized for spe-
`cific applications.
`The configuration instructions, roan. rgcon. or mourn
`download the configuration data from memory and store
`them in the reconfigurable coprocessor. The start address
`of the configuration data is specified by the value of the
`source register {are}. The for instruction starts execution
`of a reconfigurable coprocessor instruction The sum of
`
`Table 1: New instructions Used in Reconfigurable Coproces—
`son
`
`the ofisct field and the base register specifies one of the 0p-
`erations to execute. The lducfl and grind instructions are
`load and store coprocessor instructions which transfer dou-
`ble word (54-bit) data between memoryr and the reconfig-
`urable coprocessor data registers. The mid. and. mic-2 in—
`structions transfer word (32-bit) data between the general-
`purpose registers (integer registers) in the main processor
`and the reconfigurable coprocessor data registers. The cfcfl
`and 51:2 instructions transfer data. between the integer reg-
`isters and the reconfigurable coprocessor control registers.
`The reconfigurable Dopmceflors do not have I direct
`interiace to the data cache or memory. The main proces
`sot has to set the input data to the coprocessor data reg-
`isters using lduc2 and rates instructions before execution
`of re: instructions. Then, the reconfigurable coprocessor
`reads the input data, executa the operations, and stores
`the raults into the coprocessor data registers. Finally, the
`main processor reads the results using sduc? and MR2 in-
`struetions,
`
`2.2 Pipeline Organization
`
`Main 9mm- 9mm- BEEN“
`nemnc Fianna
`"-EEIIEIIEE
`
`Mah- Pmrt‘o—Hm flElEltlm
`FPGA copmwssor Pip-an:
`HEREB-
`
`Figure 2: Pipeline Organization of Reconfigurable Coproeee—
`our
`
`The pipeline for REMARC, the FPGA coprocessor.
`and the main processor are shown in Figure 2. The main
`processor pipeline is similar to the MIPS 3.3000 and the
`MIPS M000 and consists of five stages: Instruction Fetch
`(F), Instruction Decode (D), Execution (E), Memory Ac-
`cess (M). Register Write-back (W). The reconfigurable co—
`processor pipelinu are independent of the main processor
`pipeline; therefore. the main processor can execute concur-
`rently with the reconfigurable coprocessors.
`The REMARC pipeline starts from the M stage of the
`main processor and has the following six stages:
`RF : An instruction of the globe] control unit is fetched.
`RR : The REMARC data registers are read.
`KL : The data are aligned or “unpacked".
`RI : The instructions of the nano processors are fetched.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 2
`
`
`
`HBUS
`
`
`
`
`BOUT
`
`q
`
`IE'i'i'
`
` !
`
`7
`
`mPc
`
`ll— '
`
`I.
`
`figure 4: Nana Processor Architecture
`register (IR), eight 16-bit data registers (DR). four ‘16-
`hit data input registers (DIR), and a 16—bit data output
`register (DOE).
`Each nano processor can use the DH. registers. the DIR.
`registers, and immediate data as the source data of ALU
`operations. Moreover. it can directly use the DOR regis-
`ters of the four adjacent nsno processors (DINU, DIND.
`DINL, and DINR) as the source.
`The nano processors are also connected by the 32-bit
`Horizontal Buses {HBUSsl and the 32-bit Vertical Buses
`(VBUSa). Each bus operates as two 16-bit data buses. The
`16-bit data in the DOR register can he sent to the upper
`or lower 16 hits of the VBUS or the HBUS. The HBUSs
`and the VBUSs allow data to be broadcast
`to the other
`nano processors in the same row or column. These buses
`can reduce the communication overhead between proces-
`sora separated by long distances.
`The DIR registers accept inputs from the HBUS, the
`VBUS, the DOR. or the four adjacent nano processors.
`Because the width of the HBUS and. the VBUS is 32 hits,
`data on the HBUS or the VBUS are Stored into a DIR
`register pair. DIRB and DIRl, or D192 and DIR3. Using
`the DIR registers, data can be transfered between sane
`processors during ALU operations.
`It takes a half cycle to transfer data using the VBUSS
`or HBUSS.
`It should not he a critical path of the de-
`sign. Other operations, except for data inputs from near-
`est neighbors. are done within the nano processor. Because
`the width of a norm processor's detapath is only 16 bits‘
`which is a quarter of those of the general purpose micro-
`processors. this careful design does not make REMARC a
`critical path of the drip.
`
`2.4 FPGA Caprocessor Architecture
`The reconfigurable logic array of the FPGA coprocessor is
`composed of configurable logic blocks (CLBs). Each CLB
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 3
`
`RE : The new prose-hrs execute the instructions.
`RW : The executed results are padced and stored into
`the REMARC data registers.
`The FPGA coprocessor pipeline has four stages as fol-
`lows:
`
`RF : The sequencer of the global control unit starts its
`execution.
`RR : The coprocessor data registers are read.
`111?.
`r The reconfigurable logic may starts execution.
`RZW .' The execution results are stored into the coproces—
`sor data registers.
`The RL and R1 stages are unnecmary in the FPGA
`coprocessor because load alignment or unpack operations
`are realized directly in the FPGA array and FPGM do
`not have instructions to fetch and execute.
`
`2.3 REMARC Architecture
`whom Main Processor
`
`
`
`
`
`
`
`
`.1mfimIS‘fl
`r
`I!
`
`
`
`Eli=ii= —iI
`
`
`
`ill. AIME
`
`
`oats
`Ml
`Oil—2
`con: mi"!
`00L! ml CELT
`
`Figure 3: Block Diagram of REMARC
`In this section. we briefly describe the REMARK} archiv
`oecture‘, more information can he found in [13]. Figure 3
`shows a block diagram of REMARC. REMARC‘S reoon-
`figurahle logic is composed of an 8x8 arrayr of the 16-bit
`processors, called auto processors. The execution of each
`nano pmcetsnr is controlled by the instructions stored in
`the local instruction 11AM (nu-Lo instruction RAM). How-
`ever. each name processor does not directly control the
`instructions it executes. Every cycle the nano promor
`receives a PC value. "mo PC", from the global control
`unit. All nano processors use the same norm PC and axe
`cuts the instructions indexed by the nano PC in their name
`instruction RAM.
`Figure 4 shows the architecture of the nano processor.
`The one processor contains a 32-entry nano instruction
`RAM. a 16-bit ALU. a lG—entry data RAM, on instruCtion
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 3
`
`
`
`I
`
`
`I @MARC
`EPG'I't—Onproceesor
`Ina]
`64 - rooeasors
`mam , 50m LB:
`natructlon
`untmlled by instruction
`on _u -- 3" switch matrix
`4 nei hora. VBUS. and HBIS
`
`Execution CDIIHO
`interconnection
`
`1X. 2x. 5x, lofl‘cpu
`
`Table 2: CDPIDEfiSDr Architecture Comparison
`
`is equal to a. CLB of the Xillnx 4000 series. The CLB
`includes two 4'1 lookup tables (LUTs) and two flip-flops.
`We do not fix the number of the CLBa in the FPGA
`CDDTUWDL Instead. we evaluate the performance of the
`FPGA coprocessor using a varying number of CLBS. The
`cycle time of the FPGA coproceseor is also parameter-
`ized. Cycle times of current FPGA systems are longer
`than those of the microprocessors by factors of five to
`ten. For instance. FPGA systems usually operate at 30
`to 100 MHz while state-ofithe-art microprocessors Operate
`at more than 400 MHZ. However. the recently proposed
`reconfigurable coprocessor Sup [8] aims to operate at the
`same operating frequency as its main processor. Therefore.
`we assumed the cycle time of the FPGA coprocessor could
`he 5:: or 10): that of the main processor for current FFGA
`architectures and 1x or 2:: {or future FPGA architectures.
`
`2.5 Coprocessor Architecture Compar-
`ison
`Table 2 summarizes the comparison of the two reconfig-
`urable coprocunr architectures. REMARC has larger
`processing elements. the nano processors, than the FPGA
`coprocessor. However. the number of mono processors is
`less than that or CLBs in the FPGA coprocessor.
`1n
`REMARK}, both the acecution and data transfer are con-
`trolled by instructions. while these are controlled by hard-
`wired logic in the FPGA. coprocessor. REMARC has lim-
`ited hardware interconnections. Each nanu processor has
`direct inputs from the four nearest neighbors and it is con-
`nected by two 32-bit data buses The FPGA coprocessor
`has more flexible hardware interconnections which are con-
`figured in a bit-wise fashion,
`We assume that REMARC will operate at the same
`frequency as the main processor. The cycle times of the
`FPO-A copra-user (Tfpgo) are varied. We evaluate the
`FPGA coprocessor performance. assuming Tl'pga values
`that are ix, 2:. 5x‘ and. 10x the CPU cycle time (Tcpu).
`
`3 Performance Evaluation
`
`3.1 Simulation Methodology
`We developed the reconfigurable coprocessor simulator us—
`ing the SimOS simulation environment [15] . SimOS mod-
`es the CPUS, memory systems. and NO devicm in suffi-
`cient detail to boot and run a commercial operating syn-
`tern. As a has: CPU simulation model. we used the
`
`“MIPSY " which models a simple single issue RISC pro-
`cesaor similar to the MIPS RBUIJG.
`REMARC functions are added. to the MIPSY model.
`The. latency of a reconfigurable execution (Her) instruction
`is the sum of the number of the executed global instruc—
`tions and the pipeline latency (5 cycles). If a following in-
`struction attempts to read the result of a re: instruction.
`the pipeline will stall for the cycles of the m: lnstruction’s
`latency (Data Dependency) Furthermore. a following re:
`instruction will stall if a. previous re: instruction is still
`executing (Resource Conflict).
`We evaluate the execution time of the FPGA coproces-
`sor by changing the latencies of the rec instructions. First,
`we estimate the number of execution cycles of the re: in—
`structions based on the FPGA delay model. In this model.
`two sequences can he executed within one FFGA cycle:
`- One stage of interconnection by long or short wire
`and. one stage of function not using the carry chain
`- One stage of interconnection by short wire and any
`one stage of function
`the same as the Garp's delay
`This model
`is almost
`model [3] and the X04000? medium frequency design which
`will operate at about 70 MHz [16]. Then, we normalise the
`number of execution cycla based on the CPU cycle time,
`assuming the F13GA cycle time is 13:. 2x. 5x. or 10:: the
`CPU cycle time. Finally. we use this estimated cycle count
`of the re: instruction in the simulator.
`We also developed simulators which can execute mul-
`timedia instructions similar to the Intel MMX instruction
`set extensions [2].
`the same application
`To make the comparison fair,
`source codes are used for the evaluation except for the
`use of the multimedia instructions or the augmented cow
`procwsor instructions. The memory system parameters
`used commonly through the performance evaluations are
`found in Table 3‘ We used the “goo" compiler with the I'-
`02" optimization option. This option executes most global
`compiler optimizations except for loop unrolling and func-
`tion inlioing.
`
`3? K bytes. 2-way an.
`IMH- 32 K him. 2-war set-
`256 K bytes. 27w” set.
`n he mm
`
`M mi” will?
`
`Table 3: Memory System Parameters
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 4
`
`
`
`3.2 DES Encryption
`The Data Encryption Standard (DES) is one 01' the most
`important encryption algorithms and has been a worldwide
`standard for over 20 years. It is widely used to provide sew
`cure communication over the Internet DES is also a. good
`application {or reconfigurable processors because it has a
`lot of fine—grained parallelism in the 5mm of hit-level irreg-
`ular data movements which make software implementation
`on conventional microprocessors difficult and inefficient.
`Pnirum 1:6»Q oils:
`
`
`
`
`
`
`
`
`
`
`Figure 6: DES f—hox Algorithm
`Figure 5 shows an outline ol the DES aigorithm. DES
`tel-roe as input 64-bit plaintext. After the initial permuta-
`tion. there are 16 rounds of “Maori" operations. as shown
`in Figure 6. ineiuding expansion permutation. XOR with
`the key, S-box table lookup. P-hox permutation. and XOR
`with the mull. of the previous round. The final permu-
`tation is performed on the 64-bit rault of the sixteenth
`round.
`
`We used the DES encryption program [17] based on
`the Electronic Codebook (ECB) mode. Although the ECB
`mode is In- secure thau the Cipher Block Chaining (CBC)
`mode. if. is more commonly used and its operation can be
`pipelined.
`
`3.2.1 DES Implementation on REMARC
`We decided to divide the algorithm between the main pro—
`cmsnr and REMARC. The initial permutation and the fi-
`nal permutation are executed by the main processor and
`the 16 rounds of (-120): operations are executed by RE—
`MARC. Each row of name processors execut- two f—hox
`operations. For instance. eight nun precursors in the row
`i] execute the first and second rounds. the row I execute
`the third and fourth rounds, and so on.
`
`CPU Cycle
`
`—umnal
`“new.
`—sm_ 12 sum
`P-Im err-mun:- Eamon]
`
`mam-Irma:-
`
`Table 4: Execulion Cycle Breakdown of Two f—bou Operations
`on REMARC
`
`Table 4 is the execution cycle breakdown of the two i-
`bnx operations implemented by the eight nano processors
`in each row. The 16 f-box operations are pipelioed into
`8 stages by the eight rows of the nano processor array.
`REMARC can generate a. result of the 16 (their operations
`every 51 cycles. As Table 4 show, because of the limited
`interconnection of REMARC. more than half (28 eyelet) of
`the execution time are used for the expansion permutation
`and the P-box permutation.
`
`3.2.2 DES Implementation on the FPGA Co-
`processor
`First. we estimate the latency and the throughput of one
`f—hox operation. The expansion permutation and the XOR
`operation with the keys can be executed in one cycle be—
`cause it can be implemented by long wire and simple logic.
`The S—hnx table lookup requires two cycles to execute be-
`cause it consists of LUTs and MUXS. The P-hoor permu-
`tation and the XOR operation can be executed in one cy-
`cle. The total latency of the f—box operation is four cycles.
`and it can be fully pipelinedt Therefore.
`the maximum
`throughput of the f-hox operation is one FPGA cycle.
`We assume three distinct cases, implementing one f-
`hox, 16 f-boxea. and. all of the DES encryption algorithm
`including 16 f-boxes. the initial permutation. and the final
`permutation.
`in the case of one f-box implementation. because each
`f-box operation takes as input the result of the previous
`f-box operation. the I—box operations cannot be pipelined.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 5
`
`
`
`
`
`Tfpga
`
`FPGA Coprncusor
`[1 f-box)
`
`FPGA opmcessor
`(l6 f—boxee]
`
` Tcpu it ill
`
`Table 5: Execution Throughput of [6 l—box Operations on
`Difierent FPGA Coprocnesors [CPU Cycles)
`
`D BilatSinglo-Issue) El +FPeM1sr-bus)
`E memo
`eFF‘GAlAHDESJ
`9FFGA (1 tom}
`
`
`Ellio- Ell.
`
`
`
`CycleCmnls(Mcycles] -_I II I
`
`Two
`
`TOOL! It 5
`Two x 2
`Tfpga
`
`Toou x 10
`
`Figure 7: DES Encryption (1M Bylfi} Results - Single Issue
`Architecture
`
`It takes 64 FPGA cycles (4 cycles x 16 rounds) to exe-
`cute the 16 rounds [or the [-‘oox operations. When the 16
`f-boxes are implemented in the FPGA array.
`the opera.-
`tions can be pipelined.
`It takes only one FPGA cycle to
`generate a new result. for the 16 {-1101 operations after the
`pipeline is filled up. Table 5 normalizes the throughput.
`of the 16 l-‘oox operations in CPU cycles for both the one
`febox implementation and the 16 film): implementation.
`When the complete DES encryption algorithm is Im-
`plemented in the PPGA coprocessor, the initial and final
`permutations can be Fully pipelined and a new 64-bit ci-
`phertcxt character is generated every FPGA cycle.
`To estimate the number of CLBs without wiring over-
`head. we use the following method. As shown in Figure 6,
`the f-box operation requires a 48-bit XOR‘ eight 6-bit in-
`put 4-bit output LUTS. a 32-bit 4-1 Multiplexer, and a.
`32-bit XOR. A two—bit XOR or a 4-1 Multiplexer fits to
`one CLB. A 6-4 LUT needs 16 4-1 LUTs or S CLBs.
`In
`total, 136 CLBs are required for one f-box operation. The
`16 “not: implementation needs 2176 (136 x 16) (31.35. We
`expect that in an actual implementation more CLBs would
`he dedicated to the extra wiring which realizes the initial
`permutation and the final permutation.
`
`D Base (Nestle)
`E] +215ch
`
`[:1 +FPGA(1f-boxi we
`
`E +FPGA (IE I—haxs:
`+FPGA (Ni DES)
`
`so
`so
`TD
`so
`so
`40
`30
`20
`10
`
` Cycle
`
`
`
`Counts[Mcydes)
`
`
`
`
`Ililmlilm
`TCDu
`'lcpu x 2
`Tcpu x 5
`Tfpga
`
`Topu x 10
`
`Figure 8: DES Encryption (1M Bytes] Results - 2~way Su-
`perscalar Architecture
`3.2.3 Performance Evaluation Results
`Figure 1' shows the result of the DES encryption of 1M
`byte; data. The base architecture of the main Processor is
`a single issue processm.
`'I‘a Optimize performance we im-
`plemented a software pipeliuing technique whirJi can over-
`lap operations in the main processor and the reconfigurable
`coprocessor; REMARC achieves a 7.3 times performance
`improvement. The FPGA coprocessor achieves the same
`performance improvement if its execution cycles for the 16
`f-box operations are 155 than 64 CPU Cycles. This indi-
`cates that the main processor which executes the initial
`permutation and the final permutation limits the perfor‘
`mance improvement in that! cases.
`if the FPGA coprocessor implement; all of the DES op-
`erations, the execution cycles are reduced to 9.0 M CPU
`cycles. Reducing the load on the main processor, this im-
`plementation delivers a 1.4.5 times performance irriprova.L
`ment over the base processor. Even though the cycle time
`of the FPGA coprocessor is Iii): Tcpu,
`it takes only 10
`CPU cycles to generate one 64-bit encrypted data. This
`execution time on the FPGA coprocessor is short enough
`that the main processor still limits the performance.
`Using a higher performance main processor is another
`way to reduce the total execution time. As shown in Fig-
`ure 8. if a flurry superscalar processor is used as the main
`proce‘nor, the execution times are reduced from 17.3 M
`CPU cycles to 14.3 M CPU cycles for the +REMARC,
`+FPGAU. I-box] at Two cycie time, and +FPGAU$ 1'-
`box) implementations.
`The other point that is demonstrated in Figure T and
`Figure B is that the one f—box implementation at slow cy-
`cle times, especially Tcpu 'x 5 and Tcpu at 1i]. delivers only
`limited performance improvements Even though the exe-
`cution time of one Lbox operation is onlyr four FFGA cy-
`cles, the slow cycle time of the PPGA coprocessor results in
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 6
`
`
`
`small performance gains. Furthermore. the perl'ormance is
`limited by the reconfigurable coprocessor: there is no per-
`formance improvement even il‘ the 2-way superscslar main
`processor is used as shown in Figure 8.
`
`3.3 MPEG-2 Decoding
`MPEG-2 is one of the most popular video compression
`standards. The algorithm for MPEG-2 decoding consists
`of five stages. Variable Length Decoding {VLD}, Inverse
`Quantization (IQ), Inverse DCT (IDCT). Motion Com-
`pensation (MC). and the addition of the DCT and the
`MC retults [AddBlock). We used the mpegfidecadeflgl
`program distributed by MPEG software simulation group
`as a base MPEG-2 decoding program. The mpegsdeoode
`program was modified using the reconfigurable coproces-
`sor instructions. The hitstrenm “mobile“ was chosen as a
`benchmark,
`
`El MC D Adojiock El Iocr
`
`VLosIo
`
` D
`
`2!! ”£05080 TDBDW‘IW
`
`in
`
`Figure 9: Execution Time Breakdown of MPEG-2 Decoding
`To determine the execution time breakde we pro-
`filed the MPEG-2 decoding program using the profiling
`tool pm‘e [18]. Figure 9 shows the percentage of execution
`time spent in each stage while decoding 30 frames. The fig-
`ure indicates that the execution time is distributed among
`MC, AddBlock. IDCT. and VLD/lQi This suggests that
`reconfigurable coprocessors need to support multiple func-
`tions to accelerate the entire M?EG-2 decoding. As a first
`step, however. we investigate the acceleration of IDCT.
`which represents 31% of the execution time for MPEG-2
`decoding.
`
`3.3.] Acceleration of IDCT
`
`We use the fast IDCT algorithm which is the reverse of the
`DCT algorithm presean in [20]. The lD-EDCT algorithm
`coneisteof 12 multiplications and 32 additions/subtraction.
`The 518 2-D IDCT is composed using a row IDCT followed
`by a column IDCT. Each of the row and column lDCTE
`consists of eight 1-D IDCTs.
`For REMARC.
`the lD-IDCT algorithm is changed
`slightly. The algorithm consists of 16 multiplications and
`30 additions/subtractions to achieve more regular data
`movements. Each of the eight nauo processors in the same
`row or column calculates a 1-D IDCT. Therefore, two mule
`tiplications are mapped into each uanc processor to main-
`tain the load balance even among the nano processors.
`First, each of the eight nano processors in the same row
`calculate the 1-D row IDCT. Then, each of the eight nano
`processors in the same column calculate the 1-D column
`iDCT. Because the 2-D lDCT is done in place, there is
`no need to execute a matrix transposition.
`in this fashion
`
`an 5x8 nano processm may naturally implements the 2.
`D IDCT. The 2-D IDCT can be executed in .54 cycles by
`REMARC. REMARC can realize a. fast 2-D IDCT imple-
`mentation which exploits the large amount of parallelism
`from the algorithm.
`We estimate the latency and the throughput of the 1-
`D IDCT for the FPGA coprocessor. The IDCT algorithm
`has five stages of additions/subtraction and one stage of
`multiplication. The hit width of these operations is 16
`hits. We assumed each stage of the additionsfsuhtractions
`requires two FPGA cycles. The multiplication requires
`tour FPGA cycles to atecute using a constant (k) coefi-l.
`cient multiplier (KCM). Therefore. the latency of the 1-D
`IDCT is 14 FPGA cycles ( 2 FPGA cycles it 5 + 4 FPGA
`cycles). The throughput is restricted to four FPGA cycles
`by the ROM.
`We implement the IDCT in the FPGA array using two
`models. One model only implements one 1DIDCT and
`the matrix transposition circuits. This lD-IDCT circuit
`is used 16 times to realize a 2-D IDCT. The other model
`implements eight lD-IDCTS and the matrix transposition
`circuits.
`In the first IDCT implementation, eight lD-lDCTs can
`he pipelined‘ It takes 42 FPGA cycles; ( 14 FPGA cycles
`+ 4 FPGA cycles it 7 J to execute eight lD-lDCTs. or a
`row/column iDCT.
`The matrix transposition is required between the row
`IDCT and column IDCT, and it breaks the IDCT exe-
`cution pipeline, The total execution cycle counts {or one
`2D-IDCT are 84 FPGA cycles for the one leIDCT im-
`plementation and 28 FPGA cycles for the eight lD-IDCT
`implementation. Table 6 normalizes the execution cycles
`in CPU cycles.
`We estimate the number of CLBs to implement 1D-
`IDC‘T and the matrix transposition. At least 808 CLBS
`are necasary for lD-IDCT and 512 CLBS for the matrix
`transposition. The implementation of one lD-IDCT and
`the matrix transposition requires 1320 CLBs‘ and the im-
`plementation of eight iD-‘lDC‘I‘s and the matrix transpo-
`sition requires 6976 (31.35. Again these estimates include
`no wiring overhead.
`
`F‘ A uroceesor
`ix lD-I‘DCT HW.)
`
`'GA aprocesoor
`Ex Ill-IDCT HW.)
`
`
`
`Table 6: 2D-IDC‘T Execution Cycla (CPU Cycles)
`Figure 10 shows the execution time of the MPEG-2 de
`coding for 15 frames of the “mobile" bitstream. Execution
`times of the processors with the reconfigurable coprocessor
`are compared with the single issue base procmsor and the
`procmr with the multimedia extensions (+MMX).
`Although the multimedia extensions can reduce the cyv
`tie counts of the IDCT part by a factor of 2.7 from the base
`processor. REMARC and the FPGA coprocolsor which cp-
`erate at the same frequency as the main processor achieve
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 7
`
`
`
`B
`
`CycleCounts(Mcycles)_n.-NuaeeE8
`
`5
`Do
`E'EOFF‘
`géoo
`w U-IDI:
`+
`.-
`a: E;‘
`5
`In
`
`3
`n
`e
`
`5e
`gt
`he
`vli‘
`
`Dr.)
`F‘F‘
`on
`qua
`.-
`.—‘
`In
`
`:3
`as
`at
`1+
`
`00
`FF‘
`on
`an
`.-
`.'—'
`:1):
`
`in
`g
`E
`+9:
`
`Du
`l3?
`on
`on
`.1;
`xx
`
`:e
`ag
`e
`+i-i'
`
`Figure ill: MPEGZ Decoding (15 Frames} Results with Ac-
`celerating lDC'T‘
`from 14.4): to 20.43: performance improvements. They are
`from 5.3:: to Mix faster than the multimedia extensions.
`Another observation from Figure 10 is that the small
`and slow FPGA coprocessor does not achieve good perfor-
`mance improvement, which is similar to the results we saw
`with DES encryption. The FPGA copra-tester implement-
`ing a single 1-D IDCT with a cycle time of Tcpu 1 ID is
`no longer faster than the multimedia extensions.
`
`3.3.2 Acceleration of MC and AddBlock
`
`The execution time of [DCT can be dramatically reduced
`by the reconfigurable coprocessor. However, the execution
`time of IDCT is only 31% of the entire MPEG-2 decoding.
`The performance improvement of the entire program is
`only 22.8% for the multimedia instructions and 38.0% for
`REMARC. This suggests that a reconfigurable coprocessor
`needs to support multiple i'unctions to accelerate the entire
`MPEG-2 decoding.
`In addition to IDCT, we can accelerate MC and
`Addflloclt
`using
`the multimedia
`instructions
`and
`REMARC. This cavers more than 70% of the original ex-
`ecution time. The REMARC instructions for MC and
`AddBl-ck are SIMD type instructions and very similar
`to multimedia instructions. They can Operate on B or 16
`data elements at the same time using 8 or 16 nano proces-
`ears.
`The performance evaluation [melt is shown in Fig-
`ure 11. Using REMARC as a coprocessor the performance
`improves hy afactor oflfl from the base procasor and 1.33
`from the architecture with the multimedia extensions.
`We were not able to evaluate the FPGA coprocessor
`implementation of MC and Addfilock in detail. However.
`the FPGA coprocessor can also implement the some SIMD
`
`
`
`CycleCounls(Mcycles}
`
` a
`“fiefiifi
`see
`
`+MMX
`
`‘REMARC
`
`Figure 11: MPEG-2 Decoding (15 Frames} Results with A:-
`celerating IDC'I‘. MC. and Addjlock
`type instructions as those used in REMARC for these al-
`gorithm. The performance of the PPGA coprocasor will
`he almost the same at REMARC, if its cycle time is Tcpu
`or Tcpu x 2.
`If the cycle time of the FPGA coprocese
`nor is mud: slower than that of the main processor.
`its
`performance will decrease greatly because the END type
`instructions only exploit limited parallelism.