throbber
[MiOSlS]l
`
`A Quantitative Analysis of Reconfigurable Coprocessors
`for Multimedia Applications
`
`Takashi Miyamori
`System ULSI Engineering Laboratory
`TOSHIBA Corporation. JAPAN
`miyamori©sdei.toshiba.co.jp
`
`Kunle Olukotun
`
`Computer Systems Laboratory
`Stanford University
`kunle@ogun.stanford.edu
`
`Abstract
`
`Recently, computer architectures that carnhinc n reconfig-
`urnhie (or relorgetohte) coprocessor with n gamut-purpose
`microprocessor have been proposed.
`These architectures
`are designed to exploit
`large amounu of fine grain pur-
`olleiisrn in applications.
`In this paper. we study the per-
`formance of the reconfigurable copmcessors an multimedia
`applications. We compare o Field Programmable Gate Ar-
`ray {FFGAJ based reconfigurable mpromsor with the army
`processor salted REMARC {Reconfigurable Multimedia Ar-
`roy- G‘opmccuor}. REMARC use: a 1616:"! simple processor
`that is much larger than n Configurable Logic Black (91.3)
`of cm FPGA. We have developed a simulator. a program-
`ming environment. and multimedia application programs to
`evaluate the performance of the two coprocessor architec-
`tures. The simulation results show that REMARG achieves
`specdups ranging from a factor of 2.3 to 713 on there cp-
`piicotions. The FFGA coprocessor achieves sin-lilo!- per-
`formance improvements. However, the FPGA coprocessor
`needs more hardware area to nchieve the some performance
`improvement as REMARC.
`
`Introduction
`1
`it he-
`As the use of multimedia applications increases.
`comes important to achieve high performance on algo-
`rithms such as video compression. decompression. and im-
`age processing with generalopurpose microprocessors. This
`has motivated the recent addition of multimedia instruc-
`tions to most general-purpose microprocessor ISAs [1—33.
`These ISA extensions work by segmenting a conventional
`64-bit datapath into four 15-bit or Eight 8-bit datnpaths.
`The multimedia instructions exploit fine grain SIMD par-
`allelism by operating on four 16-bit or eight 3-bit data
`values. However, a 54-bit datapath limits the speedups
`to a factor of four or eight even though many multimedia
`applications have much more inherent parallelism.
`Computer architectures that. connect a reconfigurable
`coprocessor to a general-purpose microprocessor have been
`proposed [4—9]. The advantage of this approach is that
`the coprocessor can be reconfigured to improve the parlor,
`mance of a particular application. All of these proposed
`architectures use field. programmable gate arrays {FPGAs)
`for the recufigurahie hardware. We refer to this coproce-
`
`nor as an “FPGA coprocessor" in this paper. The FPGA
`architecture. which has narrow programmable logic blocks
`nnd programmable interconnection network. provides great
`flexibility for implementing application specific hardware.
`However. the rich programmable interconnection comes at
`the price of reduced operating frequency and logic density.
`Array processors. such as general-purpose systolic er-
`rny processors. wavefront array processors [10]. PADDI-
`2 [ll]I end MATRIX [12], are other reconfigurable archi-
`tectures. These processors have 3-bit or 16-bit datapaths
`and each programmable logic block has an S to fit—entry
`instruction RAM that makes it easy to support multiple
`functions. Because multimedia (or DSP) applications pre-
`dominantly manipulate 8-bit or 1643i: data values. these
`architectures work very well on than applications.
`Re—
`cently. we proposed a new array processor architecture
`called REMARC (Reconfigurable Multimedia Array Co-
`prucwsorfllii]. REMARC is a reconfigurable coprocessor
`that is tightly coupled to a. main RJSC proceanor and con-
`sists otsglobal control unit and 64 [obit simple procmors
`called mum processors.
`Both the FPGA coprocessor and REMARC are not
`limited to SiMD parallelism that can be exploited by mul-
`timedia extensions such as the intel MMX[2]. They can
`exploit various kinds of fine grain parallelism in multime-
`din applications. Using more processing resources. they
`can achieve higher performance than the multimedia axe
`tensions. To understand how them two coprocessor archi-
`tecture compare.
`in this paper we evaluate the oust and
`performance of these architectures. The architecture of
`FPGA caprooessors are still in flux. so we evaluate the per-
`formance of the FPGA coprocessm with a varying number
`of CLBs and my the cycle time of the FPGA coprocessor
`from in to 10:: that of the main processor. For the perfor-
`mance evaluation. we use detailed simulators and two real~
`istic application programs, DES encryption and MPEG-2
`decoding. We also stimate the chip sizes of processors
`with REMARC and the FPGA coprocasor and compare
`their performance when the same die size is used for both
`architectures.
`In 580
`The rest of this paper is organized as follows.
`tion 2. we dmribe the reconfigurable ooprocasor archie
`tenures. both REMARC (array based) and FPGA based.
`In Section 3, we Show the mults of our performance eval-
`uation. In Section 4. we estimate chip sizes of processors
`
`|PFl2018-01594
`
`EXHIBIT
`
`2056
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 1
`
`IPR2018-01594
`
`EXHIBIT
`2056
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 1
`
`

`

`with REMARC and the FPGA coprocessor. Finally. we
`conclude in Section 5.
`
`2 Reconfigurable Coprocessor Ar—
`chit ect ure
`
`2.1 Architecture Overview
`
`roan
`It:
`”no.2
`Idea!
`mlcz
`we}!
`old
`1:ch
`
`Ire frown or mean)
`caring. afilzll‘baul
`0192.73, nflsctl’om'c)
`eupflJlgt oflsetfbnsej
`coping, an:
`coping. it:
`naparey, Ire
`coping, list
`
`’
`
`Inslrucliorl
`Cache
`
`'
`
`MainProcessor
`
`Bus Interface Unll
`
`Figure 1: Block Diagram ora Microprocessor with Reconfig-
`urable Coproressor
`Figure 1 shows a block diagram of a microprocessor
`which includes a reconfigurable coprocessor. The recon-
`figurable coprocessor consists ofa global control unit, co-
`procmsor data registersI and a reconfigurable logic array.
`Recently, we proposed the REMARC architecture which
`includ$ an 8x8 16-bit processor (nano processor) array as
`its reconfigurable logic array[13]. The other reconfigurable
`coprocemr that we consider in this paper, the FPGA co—
`processor. uses FPGAs for the reconfigurable logic array.
`The global control unit controls the execution of the reconv
`figurable logic array and the transfer of data between the
`main processor and the reconfigurable logic array through
`the coprocessor data registers
`We use the MIPS-H ISA [14] as the base architecture
`of the main processor. The MIPS ISA is extended for the
`REMARC and the FPGA coprocessor using the instruc‘
`tions listed in Table 1. The main processor issuer these
`instructions to the reconfigurable coprocessor which exe-
`cutes them in a manner similar to a floating point copro-
`cream. Unlike a floating point coprocessor. the functions
`of reconfigurable coprocfisor instructions are configurable
`(or programmable) so that they can be Specialized for spe-
`cific applications.
`The configuration instructions, roan. rgcon. or mourn
`download the configuration data from memory and store
`them in the reconfigurable coprocessor. The start address
`of the configuration data is specified by the value of the
`source register {are}. The for instruction starts execution
`of a reconfigurable coprocessor instruction The sum of
`
`Table 1: New instructions Used in Reconfigurable Coproces—
`son
`
`the ofisct field and the base register specifies one of the 0p-
`erations to execute. The lducfl and grind instructions are
`load and store coprocessor instructions which transfer dou-
`ble word (54-bit) data between memoryr and the reconfig-
`urable coprocessor data registers. The mid. and. mic-2 in—
`structions transfer word (32-bit) data between the general-
`purpose registers (integer registers) in the main processor
`and the reconfigurable coprocessor data registers. The cfcfl
`and 51:2 instructions transfer data. between the integer reg-
`isters and the reconfigurable coprocessor control registers.
`The reconfigurable Dopmceflors do not have I direct
`interiace to the data cache or memory. The main proces
`sot has to set the input data to the coprocessor data reg-
`isters using lduc2 and rates instructions before execution
`of re: instructions. Then, the reconfigurable coprocessor
`reads the input data, executa the operations, and stores
`the raults into the coprocessor data registers. Finally, the
`main processor reads the results using sduc? and MR2 in-
`struetions,
`
`2.2 Pipeline Organization
`
`Main 9mm- 9mm- BEEN“
`nemnc Fianna
`"-EEIIEIIEE
`
`Mah- Pmrt‘o—Hm flElEltlm
`FPGA copmwssor Pip-an:
`HEREB-
`
`Figure 2: Pipeline Organization of Reconfigurable Coproeee—
`our
`
`The pipeline for REMARC, the FPGA coprocessor.
`and the main processor are shown in Figure 2. The main
`processor pipeline is similar to the MIPS 3.3000 and the
`MIPS M000 and consists of five stages: Instruction Fetch
`(F), Instruction Decode (D), Execution (E), Memory Ac-
`cess (M). Register Write-back (W). The reconfigurable co—
`processor pipelinu are independent of the main processor
`pipeline; therefore. the main processor can execute concur-
`rently with the reconfigurable coprocessors.
`The REMARC pipeline starts from the M stage of the
`main processor and has the following six stages:
`RF : An instruction of the globe] control unit is fetched.
`RR : The REMARC data registers are read.
`KL : The data are aligned or “unpacked".
`RI : The instructions of the nano processors are fetched.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 2
`
`

`

`HBUS
`
`
`
`
`BOUT
`
`q
`
`IE'i'i'
`
` !
`
`7
`
`mPc
`
`ll— '
`
`I.
`
`figure 4: Nana Processor Architecture
`register (IR), eight 16-bit data registers (DR). four ‘16-
`hit data input registers (DIR), and a 16—bit data output
`register (DOE).
`Each nano processor can use the DH. registers. the DIR.
`registers, and immediate data as the source data of ALU
`operations. Moreover. it can directly use the DOR regis-
`ters of the four adjacent nsno processors (DINU, DIND.
`DINL, and DINR) as the source.
`The nano processors are also connected by the 32-bit
`Horizontal Buses {HBUSsl and the 32-bit Vertical Buses
`(VBUSa). Each bus operates as two 16-bit data buses. The
`16-bit data in the DOR register can he sent to the upper
`or lower 16 hits of the VBUS or the HBUS. The HBUSs
`and the VBUSs allow data to be broadcast
`to the other
`nano processors in the same row or column. These buses
`can reduce the communication overhead between proces-
`sora separated by long distances.
`The DIR registers accept inputs from the HBUS, the
`VBUS, the DOR. or the four adjacent nano processors.
`Because the width of the HBUS and. the VBUS is 32 hits,
`data on the HBUS or the VBUS are Stored into a DIR
`register pair. DIRB and DIRl, or D192 and DIR3. Using
`the DIR registers, data can be transfered between sane
`processors during ALU operations.
`It takes a half cycle to transfer data using the VBUSS
`or HBUSS.
`It should not he a critical path of the de-
`sign. Other operations, except for data inputs from near-
`est neighbors. are done within the nano processor. Because
`the width of a norm processor's detapath is only 16 bits‘
`which is a quarter of those of the general purpose micro-
`processors. this careful design does not make REMARC a
`critical path of the drip.
`
`2.4 FPGA Caprocessor Architecture
`The reconfigurable logic array of the FPGA coprocessor is
`composed of configurable logic blocks (CLBs). Each CLB
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 3
`
`RE : The new prose-hrs execute the instructions.
`RW : The executed results are padced and stored into
`the REMARC data registers.
`The FPGA coprocessor pipeline has four stages as fol-
`lows:
`
`RF : The sequencer of the global control unit starts its
`execution.
`RR : The coprocessor data registers are read.
`111?.
`r The reconfigurable logic may starts execution.
`RZW .' The execution results are stored into the coproces—
`sor data registers.
`The RL and R1 stages are unnecmary in the FPGA
`coprocessor because load alignment or unpack operations
`are realized directly in the FPGA array and FPGM do
`not have instructions to fetch and execute.
`
`2.3 REMARC Architecture
`whom Main Processor
`
`
`
`
`
`
`
`
`.1mfimIS‘fl
`r
`I!
`
`
`
`Eli=ii= —iI
`
`
`
`ill. AIME
`
`
`oats
`Ml
`Oil—2
`con: mi"!
`00L! ml CELT
`
`Figure 3: Block Diagram of REMARC
`In this section. we briefly describe the REMARK} archiv
`oecture‘, more information can he found in [13]. Figure 3
`shows a block diagram of REMARC. REMARC‘S reoon-
`figurahle logic is composed of an 8x8 arrayr of the 16-bit
`processors, called auto processors. The execution of each
`nano pmcetsnr is controlled by the instructions stored in
`the local instruction 11AM (nu-Lo instruction RAM). How-
`ever. each name processor does not directly control the
`instructions it executes. Every cycle the nano promor
`receives a PC value. "mo PC", from the global control
`unit. All nano processors use the same norm PC and axe
`cuts the instructions indexed by the nano PC in their name
`instruction RAM.
`Figure 4 shows the architecture of the nano processor.
`The one processor contains a 32-entry nano instruction
`RAM. a 16-bit ALU. a lG—entry data RAM, on instruCtion
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 3
`
`

`

`I
`
`
`I @MARC
`EPG'I't—Onproceesor
`Ina]
`64 - rooeasors
`mam , 50m LB:
`natructlon
`untmlled by instruction
`on _u -- 3" switch matrix
`4 nei hora. VBUS. and HBIS
`
`Execution CDIIHO
`interconnection
`
`1X. 2x. 5x, lofl‘cpu
`
`Table 2: CDPIDEfiSDr Architecture Comparison
`
`is equal to a. CLB of the Xillnx 4000 series. The CLB
`includes two 4'1 lookup tables (LUTs) and two flip-flops.
`We do not fix the number of the CLBa in the FPGA
`CDDTUWDL Instead. we evaluate the performance of the
`FPGA coprocessor using a varying number of CLBS. The
`cycle time of the FPGA coproceseor is also parameter-
`ized. Cycle times of current FPGA systems are longer
`than those of the microprocessors by factors of five to
`ten. For instance. FPGA systems usually operate at 30
`to 100 MHz while state-ofithe-art microprocessors Operate
`at more than 400 MHZ. However. the recently proposed
`reconfigurable coprocessor Sup [8] aims to operate at the
`same operating frequency as its main processor. Therefore.
`we assumed the cycle time of the FPGA coprocessor could
`he 5:: or 10): that of the main processor for current FFGA
`architectures and 1x or 2:: {or future FPGA architectures.
`
`2.5 Coprocessor Architecture Compar-
`ison
`Table 2 summarizes the comparison of the two reconfig-
`urable coprocunr architectures. REMARC has larger
`processing elements. the nano processors, than the FPGA
`coprocessor. However. the number of mono processors is
`less than that or CLBs in the FPGA coprocessor.
`1n
`REMARK}, both the acecution and data transfer are con-
`trolled by instructions. while these are controlled by hard-
`wired logic in the FPGA. coprocessor. REMARC has lim-
`ited hardware interconnections. Each nanu processor has
`direct inputs from the four nearest neighbors and it is con-
`nected by two 32-bit data buses The FPGA coprocessor
`has more flexible hardware interconnections which are con-
`figured in a bit-wise fashion,
`We assume that REMARC will operate at the same
`frequency as the main processor. The cycle times of the
`FPO-A copra-user (Tfpgo) are varied. We evaluate the
`FPGA coprocessor performance. assuming Tl'pga values
`that are ix, 2:. 5x‘ and. 10x the CPU cycle time (Tcpu).
`
`3 Performance Evaluation
`
`3.1 Simulation Methodology
`We developed the reconfigurable coprocessor simulator us—
`ing the SimOS simulation environment [15] . SimOS mod-
`es the CPUS, memory systems. and NO devicm in suffi-
`cient detail to boot and run a commercial operating syn-
`tern. As a has: CPU simulation model. we used the
`
`“MIPSY " which models a simple single issue RISC pro-
`cesaor similar to the MIPS RBUIJG.
`REMARC functions are added. to the MIPSY model.
`The. latency of a reconfigurable execution (Her) instruction
`is the sum of the number of the executed global instruc—
`tions and the pipeline latency (5 cycles). If a following in-
`struction attempts to read the result of a re: instruction.
`the pipeline will stall for the cycles of the m: lnstruction’s
`latency (Data Dependency) Furthermore. a following re:
`instruction will stall if a. previous re: instruction is still
`executing (Resource Conflict).
`We evaluate the execution time of the FPGA coproces-
`sor by changing the latencies of the rec instructions. First,
`we estimate the number of execution cycles of the re: in—
`structions based on the FPGA delay model. In this model.
`two sequences can he executed within one FFGA cycle:
`- One stage of interconnection by long or short wire
`and. one stage of function not using the carry chain
`- One stage of interconnection by short wire and any
`one stage of function
`the same as the Garp's delay
`This model
`is almost
`model [3] and the X04000? medium frequency design which
`will operate at about 70 MHz [16]. Then, we normalise the
`number of execution cycla based on the CPU cycle time,
`assuming the F13GA cycle time is 13:. 2x. 5x. or 10:: the
`CPU cycle time. Finally. we use this estimated cycle count
`of the re: instruction in the simulator.
`We also developed simulators which can execute mul-
`timedia instructions similar to the Intel MMX instruction
`set extensions [2].
`the same application
`To make the comparison fair,
`source codes are used for the evaluation except for the
`use of the multimedia instructions or the augmented cow
`procwsor instructions. The memory system parameters
`used commonly through the performance evaluations are
`found in Table 3‘ We used the “goo" compiler with the I'-
`02" optimization option. This option executes most global
`compiler optimizations except for loop unrolling and func-
`tion inlioing.
`
`3? K bytes. 2-way an.
`IMH- 32 K him. 2-war set-
`256 K bytes. 27w” set.
`n he mm
`
`M mi” will?
`
`Table 3: Memory System Parameters
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 4
`
`

`

`3.2 DES Encryption
`The Data Encryption Standard (DES) is one 01' the most
`important encryption algorithms and has been a worldwide
`standard for over 20 years. It is widely used to provide sew
`cure communication over the Internet DES is also a. good
`application {or reconfigurable processors because it has a
`lot of fine—grained parallelism in the 5mm of hit-level irreg-
`ular data movements which make software implementation
`on conventional microprocessors difficult and inefficient.
`Pnirum 1:6»Q oils:
`
`
`
`
`
`
`
`
`
`
`Figure 6: DES f—hox Algorithm
`Figure 5 shows an outline ol the DES aigorithm. DES
`tel-roe as input 64-bit plaintext. After the initial permuta-
`tion. there are 16 rounds of “Maori" operations. as shown
`in Figure 6. ineiuding expansion permutation. XOR with
`the key, S-box table lookup. P-hox permutation. and XOR
`with the mull. of the previous round. The final permu-
`tation is performed on the 64-bit rault of the sixteenth
`round.
`
`We used the DES encryption program [17] based on
`the Electronic Codebook (ECB) mode. Although the ECB
`mode is In- secure thau the Cipher Block Chaining (CBC)
`mode. if. is more commonly used and its operation can be
`pipelined.
`
`3.2.1 DES Implementation on REMARC
`We decided to divide the algorithm between the main pro—
`cmsnr and REMARC. The initial permutation and the fi-
`nal permutation are executed by the main processor and
`the 16 rounds of (-120): operations are executed by RE—
`MARC. Each row of name processors execut- two f—hox
`operations. For instance. eight nun precursors in the row
`i] execute the first and second rounds. the row I execute
`the third and fourth rounds, and so on.
`
`CPU Cycle
`
`—umnal
`“new.
`—sm_ 12 sum
`P-Im err-mun:- Eamon]
`
`mam-Irma:-
`
`Table 4: Execulion Cycle Breakdown of Two f—bou Operations
`on REMARC
`
`Table 4 is the execution cycle breakdown of the two i-
`bnx operations implemented by the eight nano processors
`in each row. The 16 f-box operations are pipelioed into
`8 stages by the eight rows of the nano processor array.
`REMARC can generate a. result of the 16 (their operations
`every 51 cycles. As Table 4 show, because of the limited
`interconnection of REMARC. more than half (28 eyelet) of
`the execution time are used for the expansion permutation
`and the P-box permutation.
`
`3.2.2 DES Implementation on the FPGA Co-
`processor
`First. we estimate the latency and the throughput of one
`f—hox operation. The expansion permutation and the XOR
`operation with the keys can be executed in one cycle be—
`cause it can be implemented by long wire and simple logic.
`The S—hnx table lookup requires two cycles to execute be-
`cause it consists of LUTs and MUXS. The P-hoor permu-
`tation and the XOR operation can be executed in one cy-
`cle. The total latency of the f—box operation is four cycles.
`and it can be fully pipelinedt Therefore.
`the maximum
`throughput of the f-hox operation is one FPGA cycle.
`We assume three distinct cases, implementing one f-
`hox, 16 f-boxea. and. all of the DES encryption algorithm
`including 16 f-boxes. the initial permutation. and the final
`permutation.
`in the case of one f-box implementation. because each
`f-box operation takes as input the result of the previous
`f-box operation. the I—box operations cannot be pipelined.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 5
`
`

`

`
`
`Tfpga
`
`FPGA Coprncusor
`[1 f-box)
`
`FPGA opmcessor
`(l6 f—boxee]
`
` Tcpu it ill
`
`Table 5: Execution Throughput of [6 l—box Operations on
`Difierent FPGA Coprocnesors [CPU Cycles)
`
`D BilatSinglo-Issue) El +FPeM1sr-bus)
`E memo
`eFF‘GAlAHDESJ
`9FFGA (1 tom}
`
`
`Ellio- Ell.
`
`
`
`CycleCmnls(Mcycles] -_I II I
`
`Two
`
`TOOL! It 5
`Two x 2
`Tfpga
`
`Toou x 10
`
`Figure 7: DES Encryption (1M Bylfi} Results - Single Issue
`Architecture
`
`It takes 64 FPGA cycles (4 cycles x 16 rounds) to exe-
`cute the 16 rounds [or the [-‘oox operations. When the 16
`f-boxes are implemented in the FPGA array.
`the opera.-
`tions can be pipelined.
`It takes only one FPGA cycle to
`generate a new result. for the 16 {-1101 operations after the
`pipeline is filled up. Table 5 normalizes the throughput.
`of the 16 l-‘oox operations in CPU cycles for both the one
`febox implementation and the 16 film): implementation.
`When the complete DES encryption algorithm is Im-
`plemented in the PPGA coprocessor, the initial and final
`permutations can be Fully pipelined and a new 64-bit ci-
`phertcxt character is generated every FPGA cycle.
`To estimate the number of CLBs without wiring over-
`head. we use the following method. As shown in Figure 6,
`the f-box operation requires a 48-bit XOR‘ eight 6-bit in-
`put 4-bit output LUTS. a 32-bit 4-1 Multiplexer, and a.
`32-bit XOR. A two—bit XOR or a 4-1 Multiplexer fits to
`one CLB. A 6-4 LUT needs 16 4-1 LUTs or S CLBs.
`In
`total, 136 CLBs are required for one f-box operation. The
`16 “not: implementation needs 2176 (136 x 16) (31.35. We
`expect that in an actual implementation more CLBs would
`he dedicated to the extra wiring which realizes the initial
`permutation and the final permutation.
`
`D Base (Nestle)
`E] +215ch
`
`[:1 +FPGA(1f-boxi we
`
`E +FPGA (IE I—haxs:
`+FPGA (Ni DES)
`
`so
`so
`TD
`so
`so
`40
`30
`20
`10
`
` Cycle
`
`
`
`Counts[Mcydes)
`
`
`
`
`Ililmlilm
`TCDu
`'lcpu x 2
`Tcpu x 5
`Tfpga
`
`Topu x 10
`
`Figure 8: DES Encryption (1M Bytes] Results - 2~way Su-
`perscalar Architecture
`3.2.3 Performance Evaluation Results
`Figure 1' shows the result of the DES encryption of 1M
`byte; data. The base architecture of the main Processor is
`a single issue processm.
`'I‘a Optimize performance we im-
`plemented a software pipeliuing technique whirJi can over-
`lap operations in the main processor and the reconfigurable
`coprocessor; REMARC achieves a 7.3 times performance
`improvement. The FPGA coprocessor achieves the same
`performance improvement if its execution cycles for the 16
`f-box operations are 155 than 64 CPU Cycles. This indi-
`cates that the main processor which executes the initial
`permutation and the final permutation limits the perfor‘
`mance improvement in that! cases.
`if the FPGA coprocessor implement; all of the DES op-
`erations, the execution cycles are reduced to 9.0 M CPU
`cycles. Reducing the load on the main processor, this im-
`plementation delivers a 1.4.5 times performance irriprova.L
`ment over the base processor. Even though the cycle time
`of the FPGA coprocessor is Iii): Tcpu,
`it takes only 10
`CPU cycles to generate one 64-bit encrypted data. This
`execution time on the FPGA coprocessor is short enough
`that the main processor still limits the performance.
`Using a higher performance main processor is another
`way to reduce the total execution time. As shown in Fig-
`ure 8. if a flurry superscalar processor is used as the main
`proce‘nor, the execution times are reduced from 17.3 M
`CPU cycles to 14.3 M CPU cycles for the +REMARC,
`+FPGAU. I-box] at Two cycie time, and +FPGAU$ 1'-
`box) implementations.
`The other point that is demonstrated in Figure T and
`Figure B is that the one f—box implementation at slow cy-
`cle times, especially Tcpu 'x 5 and Tcpu at 1i]. delivers only
`limited performance improvements Even though the exe-
`cution time of one Lbox operation is onlyr four FFGA cy-
`cles, the slow cycle time of the PPGA coprocessor results in
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 6
`
`

`

`small performance gains. Furthermore. the perl'ormance is
`limited by the reconfigurable coprocessor: there is no per-
`formance improvement even il‘ the 2-way superscslar main
`processor is used as shown in Figure 8.
`
`3.3 MPEG-2 Decoding
`MPEG-2 is one of the most popular video compression
`standards. The algorithm for MPEG-2 decoding consists
`of five stages. Variable Length Decoding {VLD}, Inverse
`Quantization (IQ), Inverse DCT (IDCT). Motion Com-
`pensation (MC). and the addition of the DCT and the
`MC retults [AddBlock). We used the mpegfidecadeflgl
`program distributed by MPEG software simulation group
`as a base MPEG-2 decoding program. The mpegsdeoode
`program was modified using the reconfigurable coproces-
`sor instructions. The hitstrenm “mobile“ was chosen as a
`benchmark,
`
`El MC D Adojiock El Iocr
`
`VLosIo
`
` D
`
`2!! ”£05080 TDBDW‘IW
`
`in
`
`Figure 9: Execution Time Breakdown of MPEG-2 Decoding
`To determine the execution time breakde we pro-
`filed the MPEG-2 decoding program using the profiling
`tool pm‘e [18]. Figure 9 shows the percentage of execution
`time spent in each stage while decoding 30 frames. The fig-
`ure indicates that the execution time is distributed among
`MC, AddBlock. IDCT. and VLD/lQi This suggests that
`reconfigurable coprocessors need to support multiple func-
`tions to accelerate the entire M?EG-2 decoding. As a first
`step, however. we investigate the acceleration of IDCT.
`which represents 31% of the execution time for MPEG-2
`decoding.
`
`3.3.] Acceleration of IDCT
`
`We use the fast IDCT algorithm which is the reverse of the
`DCT algorithm presean in [20]. The lD-EDCT algorithm
`coneisteof 12 multiplications and 32 additions/subtraction.
`The 518 2-D IDCT is composed using a row IDCT followed
`by a column IDCT. Each of the row and column lDCTE
`consists of eight 1-D IDCTs.
`For REMARC.
`the lD-IDCT algorithm is changed
`slightly. The algorithm consists of 16 multiplications and
`30 additions/subtractions to achieve more regular data
`movements. Each of the eight nauo processors in the same
`row or column calculates a 1-D IDCT. Therefore, two mule
`tiplications are mapped into each uanc processor to main-
`tain the load balance even among the nano processors.
`First, each of the eight nano processors in the same row
`calculate the 1-D row IDCT. Then, each of the eight nano
`processors in the same column calculate the 1-D column
`iDCT. Because the 2-D lDCT is done in place, there is
`no need to execute a matrix transposition.
`in this fashion
`
`an 5x8 nano processm may naturally implements the 2.
`D IDCT. The 2-D IDCT can be executed in .54 cycles by
`REMARC. REMARC can realize a. fast 2-D IDCT imple-
`mentation which exploits the large amount of parallelism
`from the algorithm.
`We estimate the latency and the throughput of the 1-
`D IDCT for the FPGA coprocessor. The IDCT algorithm
`has five stages of additions/subtraction and one stage of
`multiplication. The hit width of these operations is 16
`hits. We assumed each stage of the additionsfsuhtractions
`requires two FPGA cycles. The multiplication requires
`tour FPGA cycles to atecute using a constant (k) coefi-l.
`cient multiplier (KCM). Therefore. the latency of the 1-D
`IDCT is 14 FPGA cycles ( 2 FPGA cycles it 5 + 4 FPGA
`cycles). The throughput is restricted to four FPGA cycles
`by the ROM.
`We implement the IDCT in the FPGA array using two
`models. One model only implements one 1DIDCT and
`the matrix transposition circuits. This lD-IDCT circuit
`is used 16 times to realize a 2-D IDCT. The other model
`implements eight lD-IDCTS and the matrix transposition
`circuits.
`In the first IDCT implementation, eight lD-lDCTs can
`he pipelined‘ It takes 42 FPGA cycles; ( 14 FPGA cycles
`+ 4 FPGA cycles it 7 J to execute eight lD-lDCTs. or a
`row/column iDCT.
`The matrix transposition is required between the row
`IDCT and column IDCT, and it breaks the IDCT exe-
`cution pipeline, The total execution cycle counts {or one
`2D-IDCT are 84 FPGA cycles for the one leIDCT im-
`plementation and 28 FPGA cycles for the eight lD-IDCT
`implementation. Table 6 normalizes the execution cycles
`in CPU cycles.
`We estimate the number of CLBs to implement 1D-
`IDC‘T and the matrix transposition. At least 808 CLBS
`are necasary for lD-IDCT and 512 CLBS for the matrix
`transposition. The implementation of one lD-IDCT and
`the matrix transposition requires 1320 CLBs‘ and the im-
`plementation of eight iD-‘lDC‘I‘s and the matrix transpo-
`sition requires 6976 (31.35. Again these estimates include
`no wiring overhead.
`
`F‘ A uroceesor
`ix lD-I‘DCT HW.)
`
`'GA aprocesoor
`Ex Ill-IDCT HW.)
`
`
`
`Table 6: 2D-IDC‘T Execution Cycla (CPU Cycles)
`Figure 10 shows the execution time of the MPEG-2 de
`coding for 15 frames of the “mobile" bitstream. Execution
`times of the processors with the reconfigurable coprocessor
`are compared with the single issue base procmsor and the
`procmr with the multimedia extensions (+MMX).
`Although the multimedia extensions can reduce the cyv
`tie counts of the IDCT part by a factor of 2.7 from the base
`processor. REMARC and the FPGA coprocolsor which cp-
`erate at the same frequency as the main processor achieve
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2073, p. 7
`
`

`

`B
`
`CycleCounts(Mcycles)_n.-NuaeeE8
`
`5
`Do
`E'EOFF‘
`géoo
`w U-IDI:
`+
`.-
`a: E;‘
`5
`In
`
`3
`n
`e
`
`5e
`gt
`he
`vli‘
`
`Dr.)
`F‘F‘
`on
`qua
`.-
`.—‘
`In
`
`:3
`as
`at
`1+
`
`00
`FF‘
`on
`an
`.-
`.'—'
`:1):
`
`in
`g
`E
`+9:
`
`Du
`l3?
`on
`on
`.1;
`xx
`
`:e
`ag
`e
`+i-i'
`
`Figure ill: MPEGZ Decoding (15 Frames} Results with Ac-
`celerating lDC'T‘
`from 14.4): to 20.43: performance improvements. They are
`from 5.3:: to Mix faster than the multimedia extensions.
`Another observation from Figure 10 is that the small
`and slow FPGA coprocessor does not achieve good perfor-
`mance improvement, which is similar to the results we saw
`with DES encryption. The FPGA copra-tester implement-
`ing a single 1-D IDCT with a cycle time of Tcpu 1 ID is
`no longer faster than the multimedia extensions.
`
`3.3.2 Acceleration of MC and AddBlock
`
`The execution time of [DCT can be dramatically reduced
`by the reconfigurable coprocessor. However, the execution
`time of IDCT is only 31% of the entire MPEG-2 decoding.
`The performance improvement of the entire program is
`only 22.8% for the multimedia instructions and 38.0% for
`REMARC. This suggests that a reconfigurable coprocessor
`needs to support multiple i'unctions to accelerate the entire
`MPEG-2 decoding.
`In addition to IDCT, we can accelerate MC and
`Addflloclt
`using
`the multimedia
`instructions
`and
`REMARC. This cavers more than 70% of the original ex-
`ecution time. The REMARC instructions for MC and
`AddBl-ck are SIMD type instructions and very similar
`to multimedia instructions. They can Operate on B or 16
`data elements at the same time using 8 or 16 nano proces-
`ears.
`The performance evaluation [melt is shown in Fig-
`ure 11. Using REMARC as a coprocessor the performance
`improves hy afactor oflfl from the base procasor and 1.33
`from the architecture with the multimedia extensions.
`We were not able to evaluate the FPGA coprocessor
`implementation of MC and Addfilock in detail. However.
`the FPGA coprocessor can also implement the some SIMD
`
`
`
`CycleCounls(Mcycles}
`
` a
`“fiefiifi
`see
`
`+MMX
`
`‘REMARC
`
`Figure 11: MPEG-2 Decoding (15 Frames} Results with A:-
`celerating IDC'I‘. MC. and Addjlock
`type instructions as those used in REMARC for these al-
`gorithm. The performance of the PPGA coprocasor will
`he almost the same at REMARC, if its cycle time is Tcpu
`or Tcpu x 2.
`If the cycle time of the FPGA coprocese
`nor is mud: slower than that of the main processor.
`its
`performance will decrease greatly because the END type
`instructions only exploit limited parallelism.

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket