`
`Advancing Technology
`for Humanity
`
`DECLARATION OF GERARD P. GRENIER
`
`I, Gerard P. Grenier, am over twenty-one (21) years of age. I have never been convicted
`of a felony, and I am fully competent to make this declaration. I declare the following to be true
`to the best of my knowledge, information and belief:
`
`1. I am Senior Director of Content Management of The Institute of Electrical and
`Electronics Engineers, Incorporated ("IEEE").
`
`2. IEEE is a neutral third party in this dispute.
`
`3. Neither I nor IEEE itself is being compensated for this declaration.
`
`4. Among my responsibilities as Senior Director of Content Management, I act as a
`custodian of certain records for IEEE.
`
`5. I make this declaration based on my personal knowledge and information contained
`in the business records of IEEE.
`
`6. As part of its ordinary course of business, IEEE publishes and makes available
`technical articles and standards. These publications are made available for public
`download through the IEEE digital library, IEEE Xplore.
`
`7. It is the regular practice of IEEE to publish articles and other writings including
`article abstracts and make them available to the public through IEEE Xplore. IEEE
`maintains copies of publications in the ordinary course of its regularly conducted
`activities.
`
`8. The article below has been attached as Exhibit A to this declaration:
`
`A. T. Miyamori; U. Olukotun, A quantitative analysis of reconfigurable
`coprocessors for multimedia applications, published in Proceedings. IEEE
`Symposium on FPGAs for Custom Computing Machines, date of
`conference April 17, 1998.
`
`9. I obtained a copy of Exhibit A through IEEE Xplore, where it is maintained in the
`ordinary course of IEEE's business. Exhibit A is a true and correct copy of the
`Exhibit, as it existed on or about December 18, 2019.
`
`10. The article and abstract from IEEE Xplore shows the date of publication. IEEE
`Xplore populates this information using the metadata associated with the publication.
`
`445 Hoes Lane Piscat away, NJ 08854
`
`INTEL - 1010
`Page 1 of 15
`
`
`
`11. T. Miyamori; U. Olukotun, A quantitative analysis of reconfigurable coprocessors for
`multimedia applications was published in Proceedings. IEEE Symposium on FPGAs
`for Custom Computing Machines, date of conference April 17, 1998. Copies of the
`conference proceedings were made available no later than the last day of the
`conference. The article is currently available for public download from the IEEE
`digital library, IEEE Xplore.
`
`12. I hereby declare that all statements made herein of my own knowledge are true and
`that all statements made on information and belief are believed to be true, and further
`that these statements were made with the knowledge that willful false statements and
`the like are punishable by fine or imprisonment, or both, under 18 U.S.C. § 1001.
`
`INTEL - 1010
`Page 2 of 15
`
`
`
`
`
`
`
`
`
`EXHIBIT A
`
`EXHIBIT A
`
`INTEL - 1010
`
`Page 3 of 15
`
`INTEL - 1010
`Page 3 of 15
`
`
`
`
` IEEE.org IEEE Xplore Digital L brary
`|
`
`
`|
`
`IEEE-SA
`
`
`|
`
`IEEE Spectrum
`
`
`|
`
`More Sites
`
`
` Cart Create Account
`|
`
`
`|
`
` Personal Sign In
`
`Access provided by:
`IEEE Publications Operations
`Staff
`Sign Out
`
`
`
`Browse
`
`My Settings
`
`Get Help
`
`Advertisement
`
`Conferences > Proceedings. EEE Symposium o...
`
`A quantitative analysis of reconfigurable coprocessors for
`multimedia applications
`Publisher: IEEE
`
`2 Author(s)
`
`T. Miyamori ; U. Olukotun View All Authors
`
`58
`Paper
`Citations
`
`22
`Patent
`Citations
`
`329
`Full
`Text Views
`
`Alerts
`
`Manage
`Content Alerts
`
`Add to Citation
`Alerts
`
`ORGANIZATION 4
`
`ORGANIZATION 3
`
`ORGANIZATION 2
`
`ORGANIZATION 1
`
`Abstract
`
`Authors
`
`References
`
`Citations
`
`Keywords
`
`Metrics
`
`More Like This
`
`Downl
`
`Abstract: Recently, computer architectures that combine a reconfigurable (or
`retargetable) coprocessor with a general-purpose microprocessor have been proposed.
`These architectures... View more
`
` Metadata
`Abstract:
`Recently, computer architectures that combine a reconfigurable (or retargetable)
`coprocessor with a general-purpose microprocessor have been proposed. These
`architectures are designed to exploit large amounts of fine grain parallelism in
`applications. In this paper, we study the performance of the reconfigurable coprocessors
`on multimedia applications. We compare a Field Programmable Gate Array (FPGA)
`based reconfigurable coprocessor with the array processor called REMARC
`(Reconfigurable Multimedia Array Coprocessor). REMARC uses a 16-bit simple
`processor that is much larger than a Configurable Logic Block (CLB) of an FPGA. We
`have developed a simulator, a programming environment, and multimedia application
`programs to evaluate the performance of the two coprocessor architectures. The
`simulation results show that REMARC achieves speedups ranging from a factor of 2.3
`to 7.3 on these applications. The FPGA coprocessor achieves similar performance
`improvements. However, the FPGA coprocessor needs more hardware area to achieve
`the same performance improvement as REMARC.
`
`Published in: Proceedings. IEEE Symposium on FPGAs for Custom Computing
`Machines (Cat. No.98TB100251)
`
`Date of Conference: 17-17 April 1998
`
`INSPEC Accession Number: 6034961
`
` Export to
`
`Collabratec
`
`More Like This
`
`Reconfigurable processing with field
`programmable gate arrays
`Proceedings of International Conference on
`Application Specific Systems, Architectures
`and Processors: ASAP '96
`Published: 1996
`
`A field programmable gate array
`implementation for biomedical system-on-
`chip (SoC)
`2011 IEEE 7th International Colloquium on
`Signal Processing and its Applications
`Published: 2011
`
`View More
`
`Top Organizations with Patents
`on Technologies Mentioned in
`This Article
`
`
`
`Advertisement
`
`INTEL - 1010
`Page 4 of 15
`
`
`
`Date Added to IEEE Xplore: 06 August DOI: 10.1109/FPGA.1998.707876
`2002
`
`Publisher: IEEE
`
`Print ISBN: 0-8186-8900-5
`
`Conference Location: Napa Valley, CA,
`USA.USA
`
`Advertisement
`
`Authors
`
`References
`
`Citations
`
`Keywords
`
`Metrics
`
`V
`
`V
`
`V
`
`V
`
`V
`
`IEEE Personal Account
`
`Purchase Details
`
`Profile Information
`
`NeedH
`
`CHANGE USERNAME/PASSWORO
`
`PAYMENT OPTIONS
`
`COMMUNICATIONS PREFERENCES
`
`US&CAN
`
`VIEW PURCHASED DOCUMENTS
`
`PROFESSION ANO EDUCATION
`
`WORLOW
`
`TECHNICAL INTERESTS
`
`CONTACT
`
`About EEE Xplore I Contact Us I Help I Accessibility I Terms of Use I Nondiscrimination Policy I Sitemap I Privacy & Opting Out of Cookies
`A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefrt of humanity.
`
`@ Copyright 2019 IEEE -All rights rese,ved. Use of this web site signifies your agreement to the terms and conditions.
`
`IEEE Account
`
`Purchase Details
`
`Profile Information
`
`Need Help?
`
`» Change Usemame/Password
`
`» Payment Options
`
`» Communications Preferences
`
`» us & Canada: +1 800 678 4333
`
`» Update Address
`
`» Order History
`
`» Profession and Education
`
`» Worldwide: +1 732 981 0060
`
`» View Purchased Documents
`
`» Technical Interests
`
`» Contact & Support
`
`About EEE Xplore Contact Us Help Accessibility Terms of Use Nondiscrimination Policy Sitemap Privacy & Opting Out of Cookies
`
`A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
`@ Copyright 2019 EEE -All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
`
`INTEL - 1010
`Page 5 of 15
`
`
`
`A Quantitative Analysis of Recon gurable Coprocessors
`for Multimedia Applications
`
`Takashi Miyamori
`System ULSI Engineering Laboratory
`TOSHIBA Corporation, JAPAN
`miyamori@sdel.toshiba.co.jp
`
`Kunle Olukotun
`Computer Systems Laboratory
`Stanford University
`kunle@ogun.stanford.edu
`
`Abstract
`
`Recently, computer architectures that combine a recon g-
`urable or retargetable coprocessor with a general-purpose
`microprocessor have been proposed. These architectures
`are designed to exploit large amounts of ne grain par-
`allelism in applications. In this paper, we study the per-
`formance of the recon gurable coprocessors on multimedia
`applications. We compare a Field Programmable Gate Ar-
`ray FPGA based recon gurable coprocessor with the array
`processor called REMARC Recon gurable Multimedia Ar-
`ray Coprocessor. REMARC uses a -bit simple processor
`that is much larger than a Con gurable Logic Block CLB
`of an FPGA. We have developed a simulator, a program-
`ming environment, and multimedia application programs to
`evaluate the performance of the two coprocessor architec-
`tures. The simulation results show that REMARC achieves
`speedups ranging from a factor of . to . on these ap-
`plications. The FPGA coprocessor achieves similar per-
`formance improvements. However, the FPGA coprocessor
`needs more hardware area to achieve the same performance
`improvement as REMARC.
`
`
`
`Introduction
`
`it be-
`As the use of multimedia applications increases,
`comes important to achieve high performance on algo-
`rithms such as video compression, decompression, and im-
`age processing with general-purpose microprocessors. This
`has motivated the recent addition of multimedia instruc-
`tions to most general-purpose microprocessor ISAs .
`These ISA extensions work by segmenting a conventional
`-bit datapath into four -bit or eight -bit datapaths.
`The multimedia instructions exploit ne grain SIMD par-
`allelism by operating on four -bit or eight -bit data
`values. However, a -bit datapath limits the speedups
`to a factor of four or eight even though many multimedia
`applications have much more inherent parallelism.
`Computer architectures that connect a recon gurable
`coprocessor to a general-purpose microprocessor have been
`proposed . The advantage of this approach is that
`the coprocessor can be recon gured to improve the perfor-
`mance of a particular application. All of these proposed
`architectures use eld programmable gate arrays FPGAs
`for the recon gurable hardware. We refer to this coproces-
`
`sor as an FPGA coprocessor" in this paper. The FPGA
`architecture, which has narrow programmable logic blocks
`and programmable interconnection network, provides great
` exibility for implementing application speci c hardware.
`However, the rich programmable interconnection comes at
`the price of reduced operating frequency and logic density.
`Array processors, such as general-purpose systolic ar-
`ray processors, wavefront array processors , PADDI-
` , and MATRIX , are other recon gurable archi-
`tectures. These processors have -bit or -bit datapaths
`and each programmable logic block has an to -entry
`instruction RAM that makes it easy to support multiple
`functions. Because multimedia or DSP applications pre-
`dominantly manipulate -bit or -bit data values, these
`architectures work very well on these applications. Re-
`cently, we proposed a new array processor architecture
`called REMARC Recon gurable Multimedia Array Co-
`processor . REMARC is a recon gurable coprocessor
`that is tightly coupled to a main RISC processor and con-
`sists of a global control unit and -bit simple processors
`called nano processors.
`Both the FPGA coprocessor and REMARC are not
`limited to SIMD parallelism that can be exploited by mul-
`timedia extensions such as the Intel MMX. They can
`exploit various kinds of ne grain parallelism in multime-
`dia applications. Using more processing resources, they
`can achieve higher performance than the multimedia ex-
`tensions. To understand how these two coprocessor archi-
`tecture compare, in this paper we evaluate the cost and
`performance of these architectures. The architecture of
`FPGA coprocessors are still in ux, so we evaluate the per-
`formance of the FPGA coprocessor with a varying number
`of CLBs and vary the cycle time of the FPGA coprocessor
`from x to x that of the main processor. For the perfor-
`mance evaluation, we use detailed simulators and two real-
`istic application programs, DES encryption and MPEG-
`decoding. We also estimate the chip sizes of processors
`with REMARC and the FPGA coprocessor and compare
`their performance when the same die size is used for both
`architectures.
`The rest of this paper is organized as follows. In Sec-
`tion , we describe the recon gurable coprocessor archi-
`tectures, both REMARC array based and FPGA based.
`In Section , we show the results of our performance eval-
`uation. In Section , we estimate chip sizes of processors
`
`INTEL - 1010
`Page 6 of 15
`
`
`
`with REi\.1ARC and the FPGA coprocessor. Finally, we
`conclude in Section 5.
`
`2 Reconfigurable Coprocessor Ar(cid:173)
`chit ect ure
`
`2 .1 Architecture Overv iew
`
`rcon
`rex
`lduc2
`sduc2
`mtc2
`mfc2
`ctc2
`cfc2
`
`src (rncon or rgcon)
`cov~reg, offset{base)
`cov~reg, offset{base)
`cov~reg, offset{base)
`cov~reg, src
`cov~reg, dst
`cov~reg, src
`cov~reg, dst
`
`Main Processor
`
`Instruction
`Cache
`
`Data
`Cache
`
`Figure 1: Block Diagram of a Microprocessor w;th Reconfig(cid:173)
`urable Coprocessor
`
`Figure 1 shows a block diagram of a microprocessor
`which includes a reconfigurable coprocessor . The recon(cid:173)
`figurable coprocessor consists of a global control unit, co(cid:173)
`processor data registers, and a reconfigurable logic array.
`Recently, we proposed the REMARC architectur e which
`includes an 8x8 16-bit processor (nano processor) array as
`its reconfigurable logic array(13]. The other reconfigurable
`coprocessor that we consider in this paper, the FPGA co(cid:173)
`processor, uses FPGAs for the reconfigurable logic array.
`The global control unit controls the execution of the recon(cid:173)
`figurable logic array and the transfer of data between the
`main processor and the reconfigurable logic array through
`the coprocessor data registers.
`We use the i\.flPS-II ISA (14] as the base architecture
`of the main processor. The MIPS ISA is extended for the
`RE MARC and the FPG A coprocessor using the instruc(cid:173)
`tions listed in Table 1. The main processor issues these
`instructions to the reconfigurable coprocessor which exe(cid:173)
`cutes them in a manner similar to a floating point copro(cid:173)
`cessor. Unlike a floating point coprocessor, the functions
`of reconfigurable coprocessor instructions ar e configur able
`( or programmable) so that they can be specialized for spe(cid:173)
`cific applications.
`The configuration instructions, rcon, rgcon, or rncon,
`download the configuration data from memory and store
`them in the reconfigur able coprocessor. The start address
`of the configuration data is specified by the value of the
`sour ce register (src). The rex instruction starts execution
`of a reconfigurable coprocessor instruction. The sum of
`
`Table 1: New Instructions Used in Reconfigurable Coproces(cid:173)
`sors
`the offset field and the base register specifies one of the op(cid:173)
`erations to execute. The lduc2 and sduc2 instructions are
`load and store coprocessor instructions which t ransfer dou(cid:173)
`ble word (64-bit) data between memory and the reconfig(cid:173)
`urable coprocessor data registers. The mfc2 and mtc2 in(cid:173)
`structions transfer word (32-bit) data between the general(cid:173)
`purpose registers (integer registers) in the main processor
`and the reconfigurable coprocessor data registers. The cfc2
`and ctc2 instructions transfer data between the integer reg(cid:173)
`isters and the reconfigurable coprocessor control registers.
`The reconfigurable coprocessors do not have a direct
`interface to the data cache or memory. The main proces(cid:173)
`sor has to set the input data to the coprocessor data reg(cid:173)
`isters using lduc2 and mtc2 instructions before execution
`of rex instructions. Then, the reconfigurable coprocessor
`reads the input data, executes the operations, and stores
`the results into the coprocessor data registers. Finally, the
`main processor reads the results using sduc2 and mfc2 in(cid:173)
`structions.
`
`2 .2 Pipeline Organization
`
`REMARC Pipeline
`
`RF RR RL ! RI ! RE !RW!
`
`Main Processor Pipeline ! F ! D ! E I M I W I
`Main Processor Pipeline ! F ! D ! E I M I W I
`
`FPGA Coprocessor Pipeline
`
`RF RR RE !RW!
`
`Figure 2: P ipeline Organization of Reconfigurable Coproces(cid:173)
`sor
`
`The pipeline for REMARC, the FPGA coprocessor,
`and the main processor are shown in Figure 2. The main
`processor pipeline is similar to the MIPS R3000 and the
`MIPS R5000 and consists of five stages: Instruction Fetch
`(F), Instruction Decode (D), Execution (E), Memory Ac(cid:173)
`cess (M), Register Write-back (W) . The reconfigurable co(cid:173)
`processor pipelines ar e independent of the main processor
`pipeline; therefore, the main processor can execute concur (cid:173)
`rently with the reconfigurable coprocessors .
`The REMARC pipeline starts from the M stage of the
`main processor and has the following six stages:
`RF : An instruction of the global control unit is fetched.
`RR : The REMARC data registers are read.
`RL : The data are aligned or "unpacked" .
`RI : The instructions of the nano processors are fetched.
`
`INTEL - 1010
`Page 7 of 15
`
`
`
`HBUS
`
`32
`
`VBUS
`32
`
`nano PC
`
`5
`
`Nano Instruction
` RAM
`(32 x 32 bits)
`
`32
`
`Imm
`
`D R
`
`02
`
`701
`
`DR
`
`16
`
`16
`
`IR
`
`16-bit ALU & Data RAM
`
`16
`
`16
`
`DOR
`
`16
`
`DOUT
`
`16
`
`16
`
`16
`
`DINU
`DIND
`DINL
`DINR
`DINU
`DIND
`DINL
`DINR
`DINU
`DIND
`DINL
`DINR
`
`Figure : Nano Processor Architecture
`
`register IR, eight -bit data registers DR, four -
`bit data input registers DIR, and a -bit data output
`register DOR.
`Each nano processor can use the DR registers, the DIR
`registers, and immediate data as the source data of ALU
`operations. Moreover, it can directly use the DOR regis-
`ters of the four adjacent nano processors DINU, DIND,
`DINL, and DINR as the source.
`The nano processors are also connected by the -bit
`Horizontal Buses HBUSs and the -bit Vertical Buses
`VBUSs. Each bus operates as two -bit data buses. The
` -bit data in the DOR register can be sent to the upper
`or lower bits of the VBUS or the HBUS. The HBUSs
`and the VBUSs allow data to be broadcast to the other
`nano processors in the same row or column. These buses
`can reduce the communication overhead between proces-
`sors separated by long distances.
`The DIR registers accept inputs from the HBUS, the
`VBUS, the DOR, or the four adjacent nano processors.
`Because the width of the HBUS and the VBUS is bits,
`data on the HBUS or the VBUS are stored into a DIR
`register pair, DIR and DIR , or DIR and DIR . Using
`the DIR registers, data can be transfered between nano
`processors during ALU operations.
`It takes a half cycle to transfer data using the VBUSs
`or HBUSs.
`It should not be a critical path of the de-
`sign. Other operations, except for data inputs from near-
`est neighbors, are done within the nano processor. Because
`the width of a nano processor’s datapath is only bits,
`which is a quarter of those of the general purpose micro-
`processors, this careful design does not make REMARC a
`critical path of the chip.
`
`. FPGA Coprocessor Architecture
`
`The recon gurable logic array of the FPGA coprocessor is
`composed of con gurable logic blocks CLBs. Each CLB
`
`RE : The nano processors execute the instructions.
`RW : The executed results are packed and stored into
`the REMARC data registers.
`
`The FPGA coprocessor pipeline has four stages as fol-
`lows:
`
`RF : The sequencer of the global control unit starts its
`execution.
`RR : The coprocessor data registers are read.
`RE : The recon gurable logic array starts execution.
`RW : The execution results are stored into the coproces-
`sor data registers.
`
`The RL and RI stages are unnecessary in the FPGA
`coprocessor because load alignment or unpack operations
`are realized directly in the FPGA array and FPGAs do
`not have instructions to fetch and execute.
`
`. REMARC Architecture
`
`to/from Main Processor
`
`32 bits
`
`64 bits
`
`64 bits
`
`REMARC
`
`Global Control Unit
`
`Coprocessor
`Data Registers
`
`32
`
`32
`
`NANO
`00
`
`NANO
`10
`
`NANO
`20
`
`HBUS0
`NANO
`NANO
`30
`40
`
`NANO
`50
`
`NANO
`60
`
`NANO
`70
`
`ROW 0
`
`ROW 1
`
`ROW 2
`
`ROW 3
`
`ROW 4
`
`ROW 8
`
`HBUS1
`
`HBUS2
`
`HBUS3
`
`HBUS4
`
`VBUS7
`
`VBUS6
`
`VBUS5
`
`VBUS4
`
`HBUS7
`
`VBUS3
`
`VBUS2
`
`VBUS1
`
`VBUS0
`
`COL.0
`
`COL.1
`
`COL. 2
`
`COL. 3
`
`COL. 4
`
`COL.5
`
`COL. 6
`
`COL. 7
`
`nano PC/Row Num./Column Num.
`
`Figure : Block Diagram of REMARC
`
`In this section, we brie y describe the REMARC archi-
`tecture; more information can be found in . Figure
`shows a block diagram of REMARC. REMARC’s recon-
` gurable logic is composed of an x array of the -bit
`processors, called nano processors. The execution of each
`nano processor is controlled by the instructions stored in
`the local instruction RAM nano instruction RAM. How-
`ever, each nano processor does not directly control the
`instructions it executes. Every cycle the nano processor
`receives a PC value, nano PC", from the global control
`unit. All nano processors use the same nano PC and exe-
`cute the instructions indexed by the nano PC in their nano
`instruction RAM.
`Figure shows the architecture of the nano processor.
`The nano processor contains a -entry nano instruction
`RAM, a -bit ALU, a -entry data RAM, an instruction
`
`INTEL - 1010
`Page 8 of 15
`
`
`
`Processor Size
`Num. of Procs.
`Execution control
`Communication
`Interconnection
`Cycle Time
`
`REMARC
`Large -bit datapath
`Small processors
`Instruction
`Controlled by instruction
` neighbors, VBUS, and HBUS
`Tcpu
`
`FPGA Coprocessor
`Small two - LUTs
`Large CLBs
`Hardwired
`Con gured by switch matrix
`Short wires and long wires
` x, x, x, xTcpu
`
`Table : Coprocessor Architecture Comparison
`
`is equal to a CLB of the Xilinx series. The CLB
`includes two - lookup tables LUTs and two ip- ops.
`We do not x the number of the CLBs in the FPGA
`coprocessor. Instead, we evaluate the performance of the
`FPGA coprocessor using a varying number of CLBs. The
`cycle time of the FPGA coprocessor is also parameter-
`ized. Cycle times of current FPGA systems are longer
`than those of the microprocessors by factors of ve to
`ten. For instance, FPGA systems usually operate at
`to MHz while state-of-the-art microprocessors operate
`at more than MHz. However, the recently proposed
`recon gurable coprocessor Garp aims to operate at the
`same operating frequency as its main processor. Therefore,
`we assumed the cycle time of the FPGA coprocessor could
`be x or x that of the main processor for current FPGA
`architectures and x or x for future FPGA architectures.
`
`. Coprocessor Architecture Compar-
`ison
`
`Table summarizes the comparison of the two recon g-
`urable coprocessor architectures. REMARC has larger
`processing elements, the nano processors, than the FPGA
`coprocessor. However, the number of nano processors is
`less than that of CLBs in the FPGA coprocessor.
`In
`REMARC, both the execution and data transfer are con-
`trolled by instructions, while these are controlled by hard-
`wired logic in the FPGA coprocessor. REMARC has lim-
`ited hardware interconnections. Each nano processor has
`direct inputs from the four nearest neighbors and it is con-
`nected by two -bit data buses. The FPGA coprocessor
`has more exible hardware interconnections which are con-
` gured in a bit-wise fashion.
`We assume that REMARC will operate at the same
`frequency as the main processor. The cycle times of the
`FPGA coprocessor Tfpga are varied. We evaluate the
`FPGA coprocessor performance, assuming Tfpga values
`that are x, x, x, and x the CPU cycle time Tcpu.
`
` Performance Evaluation
`
` . Simulation Methodology
`
`We developed the recon gurable coprocessor simulator us-
`ing the SimOS simulation environment . SimOS mod-
`els the CPUs, memory systems, and IO devices in su -
`cient detail to boot and run a commercial operating sys-
`tem. As a base CPU simulation model, we used the
`
`MIPSY " which models a simple single issue RISC pro-
`cessor similar to the MIPS R .
`REMARC functions are added to the MIPSY model.
`The latency of a recon gurable execution rex instruction
`is the sum of the number of the executed global instruc-
`tions and the pipeline latency cycles. If a following in-
`struction attempts to read the result of a rex instruction,
`the pipeline will stall for the cycles of the rex instruction’s
`latency Data Dependency. Furthermore, a following rex
`instruction will stall if a previous rex instruction is still
`executing Resource Con ict.
`We evaluate the execution time of the FPGA coproces-
`sor by changing the latencies of the rex instructions. First,
`we estimate the number of execution cycles of the rex in-
`structions based on the FPGA delay model. In this model,
`two sequences can be executed within one FPGA cycle:
`
` One stage of interconnection by long or short wire
`and one stage of function not using the carry chain
` One stage of interconnection by short wire and any
`one stage of function
`
`This model is almost the same as the Garp’s delay
`model and the XC ’s medium frequency design which
`will operate at about MHz . Then, we normalize the
`number of execution cycles based on the CPU cycle time,
`assuming the FPGA cycle time is x, x, x, or x the
`CPU cycle time. Finally, we use this estimated cycle count
`of the rex instruction in the simulator.
`We also developed simulators which can execute mul-
`timedia instructions similar to the Intel MMX instruction
`set extensions .
`To make the comparison fair, the same application
`source codes are used for the evaluation except for the
`use of the multimedia instructions or the augmented co-
`processor instructions. The memory system parameters
`used commonly through the performance evaluations are
`found in Table . We used the gcc" compiler with the -
`O" optimization option. This option executes most global
`compiler optimizations except for loop unrolling and func-
`tion inlining.
`
`I-cache
`D-cache
`L cache
`L miss penalty
`L miss penalty
`
` K bytes, -way set.
` K bytes, -way set.
` K bytes, -way set.
` cycles
` cycles
`
`Table : Memory System Parameters
`
`INTEL - 1010
`Page 9 of 15
`
`
`
`We used the DES encryption program based on
`the Electronic Codebook ECB mode. Although the ECB
`mode is less secure than the Cipher Block Chaining CBC
`mode, it is more commonly used and its operation can be
`pipelined.
`
` .. DES Implementation on REMARC
`
`We decided to divide the algorithm between the main pro-
`cessor and REMARC. The initial permutation and the -
`nal permutation are executed by the main processor and
`the rounds of f-box operations are executed by RE-
`MARC. Each row of nano processors executes two f-box
`operations. For instance, eight nano processors in the row
` execute the rst and second rounds, the row execute
`the third and fourth rounds, and so on.
`
`Operation
`Data Load
`Exp. Permutation
`Key XOR
`S-box
`P-box Permutation
`Left XOR
`Data Transfer
`Total
`
`CPU Cycles
`
` x iter.
` x iter.
` x iter.
` x iter.
` x iter.
`
`
`
`Table : Execution Cycle Breakdown of Two f-box Operations
`on REMARC
`
`Table is the execution cycle breakdown of the two f-
`box operations implemented by the eight nano processors
`in each row. The f-box operations are pipelined into
` stages by the eight rows of the nano processor array.
`REMARC can generate a result of the f-box operations
`every cycles. As Table shows, because of the limited
`interconnection of REMARC, more than half cycles of
`the execution time are used for the expansion permutation
`and the P-box permutation.
`
` .. DES Implementation on the FPGA Co-
`processor
`
`First, we estimate the latency and the throughput of one
`f-box operation. The expansion permutation and the XOR
`operation with the keys can be executed in one cycle be-
`cause it can be implemented by long wire and simple logic.
`The S-box table lookup requires two cycles to execute be-
`cause it consists of LUTs and MUXs. The P-box permu-
`tation and the XOR operation can be executed in one cy-
`cle. The total latency of the f-box operation is four cycles,
`and it can be fully pipelined. Therefore, the maximum
`throughput of the f-box operation is one FPGA cycle.
`We assume three distinct cases, implementing one f-
`box, f-boxes, and all of the DES encryption algorithm
`including f-boxes, the initial permutation, and the nal
`permutation.
`In the case of one f-box implementation, because each
`f-box operation takes as input the result of the previous
`f-box operation, the f-box operations cannot be pipelined.
`
` . DES Encryption
`
`The Data Encryption Standard DES is one of the most
`important encryption algorithms and has been a worldwide
`standard for over years. It is widely used to provide se-
`cure communication over the Internet. DES is also a good
`application for recon gurable processors because it has a
`lot of ne-grained parallelism in the form of bit-level irreg-
`ular data movements which make software implementation
`on conventional microprocessors di cult and ine cient.
`
`Plaintext (64 bits)
`
`Initial Permutation
`
`Round 1
`
`f-box
`
`Round 2
`
`f-box
`
`Round 16
`
`f-box
`
`L0
`
`+
`
`L1
`
`+
`
`L2
`
`L15
`
`+
`
`L16
`
`f
`
`f
`
`f
`
`R0
`
`R1
`
`R2
`
`R15
`
`R16
`
`Final Permutation
`
`Ciphertext (64 bits)
`
`K1
`
`K2
`
`K16
`
`Figure : DES Encryption Algorithm
`
`48-bit
`Key
`
`Li-1
`
`f-box
`
`Ri-1
`32
`
`Expansion Permutation
`6
`6
`6
`6
`6
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`32
`
`P-box Permutation
`32
`
`32
`
`Ri
`
`Figure : DES f-box Algorithm
`
`Figure shows an outline of the DES algorithm. DES
`takes as input -bit plaintext. After the initial permuta-
`tion, there are rounds of f-box" operations, as shown
`in Figure , including expansion permutation, XOR with
`the key, S-box table lookup, P-box permutation, and XOR
`with the result of the previous round. The nal permu-
`tation is performed on the -bit result of the sixteenth
`round.
`
`INTEL - 1010
`Page 10 of 15
`
`
`
`Tfpga
`
`Tcpu
`Tcpu x 2
`Tcpu x 5
`Tcpu x 10
`
`FPGA Coprocessor
`(1 f-box)
`64
`128
`320
`640
`
`FPGA Coprocessor
`(16 f-boxesJ
`1
`2
`5
`10
`
`Table 5: Execution Throughput of 16 f-box Operations on
`Different FPGA Coprocessors (CPU Cycles)
`
`D Base (Single-Issue) D +FPGA (16 f0 boxs)
`(cid:143) +REMARC
`(cid:143) +FPGA (All DES)
`(cid:143) +FPGA (1 f-box)
`
`140
`
`120
`-;;;-
`Q) o 100
`~
`6 80
`"' C: 5 60
`
`(.)
`Q) 40
`
`~ (.)
`20
`
`0
`
`""
`
`llb
`
`Tcpu
`
`...
`
`~
`
`,__
`
`,__
`
`n=;
`
`r-
`
`h
`
`-
`I lh
`
`Tcpu x 5
`Tcpu x 2
`Tfpga
`
`Tcpu x 10
`
`Figure 7: DES Encryption {lM Bytes) Results - Single Issue
`Archltecture
`
`It takes 64 FPGA cycles (4 cycles x 16 rounds) to exe(cid:173)
`cute the 16 rounds for the f-box operations. When the 16
`f-boxes are implemented in the FPGA array, the opera(cid:173)
`tions can be pipelined. It takes only one FPGA cycle to
`generate a new result for the 16 f-box operations after the
`pipeline is fill ed up. T.i.ble 5 normalizes the throughput
`of the 16 f-box operations in CPU cycles for both the one
`f-box implementation and the 16 f-box implementation.
`When the complete DES encryption algorithm is im(cid:173)
`plemented in the FPGA coprocessor, the init ial and final
`permutations can be fully pipelined and a new 64-bit ci(cid:173)
`phertext character is generated every FPG A cycle.
`To estimate the number of CLBs without ·wiring over(cid:173)
`head, we use the following method. As shown in Figure 6,
`the f-box operation requires a 48-bit XOR, eight 6-bit in(cid:173)
`put 4-bit output LUTs, a 32-bit 4-1 Multiplexer, and a
`32-bit XOR. A two-bit XOR or a 4-1 Multiplexer fits to
`one CLB. A 6-4 LUT needs 16 4-1 LUTs or 8 CLBs. In
`total, 136 CLBs are required for one f-box operation. The
`16 f-box implementation needs 2176 (136 x 16) CLBs. We
`expect that in an actual implementation more CLBs would
`be dedicated to the eJ...'tra wiring which realizes the initial
`permutation and the final permutation.
`
`D Base (2-lssue)
`(cid:143) +REMARC
`(cid:143) +FPGA (1 f-box)
`
`(cid:143) +FPGA (16 f-boxs)
`(cid:143) +FPGA (All DES)
`
`Q)
`
`100
`90
`en so
`,-
`-g. 70 ,-
`,-
`~ 60
`,-
`.!1 50
`C
`:::, 8 40 ,-
`,-
`a> 30
`0
`,-
`(S 20
`10
`
`0
`
`ITTli
`
`Tcpu
`
`----, -
`
`, -
`
`, -
`
`-
`n,
`
`,....
`
`n
`n
`
`n
`n
`
`Tcpu x 5
`Tcpu x 2
`Tfpga
`
`Tcpu x 10
`
`Figure 8: DES Encryption {lM Bytes) Results - 2-way Su(cid:173)
`perscalar Archltecture
`
`3.2.3 Performance Evaluation Results
`
`Figure 7 shows the result of the DES encryption of l M
`bytes data. The base architecture of the main processor is
`a single issue processor. To optimize performance we im(cid:173)
`plemented a software pipelining technique which can over(cid:173)
`lap operations in the main processor and the reconfigurable
`coprocessors. REMARC achieves a 7.3 times performance
`improvement. The FPGA coprocessor achieves the same
`performance improvement if its execution cycles for the 16
`f-box operations ar e less than 64 CPU cycles. T his indi(cid:173)
`cat