throbber
• IEEE
`
`Advancing Technology
`for Humanity
`
`DECLARATION OF GERARD P. GRENIER
`
`I, Gerard P. Grenier, am over twenty-one (21) years of age. I have never been convicted
`of a felony, and I am fully competent to make this declaration. I declare the following to be true
`to the best of my knowledge, information and belief:
`
`1. I am Senior Director of Content Management of The Institute of Electrical and
`Electronics Engineers, Incorporated ("IEEE").
`
`2. IEEE is a neutral third party in this dispute.
`
`3. Neither I nor IEEE itself is being compensated for this declaration.
`
`4. Among my responsibilities as Senior Director of Content Management, I act as a
`custodian of certain records for IEEE.
`
`5. I make this declaration based on my personal knowledge and information contained
`in the business records of IEEE.
`
`6. As part of its ordinary course of business, IEEE publishes and makes available
`technical articles and standards. These publications are made available for public
`download through the IEEE digital library, IEEE Xplore.
`
`7. It is the regular practice of IEEE to publish articles and other writings including
`article abstracts and make them available to the public through IEEE Xplore. IEEE
`maintains copies of publications in the ordinary course of its regularly conducted
`activities.
`
`8. The article below has been attached as Exhibit A to this declaration:
`
`A. T. Miyamori; U. Olukotun, A quantitative analysis of reconfigurable
`coprocessors for multimedia applications, published in Proceedings. IEEE
`Symposium on FPGAs for Custom Computing Machines, date of
`conference April 17, 1998.
`
`9. I obtained a copy of Exhibit A through IEEE Xplore, where it is maintained in the
`ordinary course of IEEE's business. Exhibit A is a true and correct copy of the
`Exhibit, as it existed on or about December 18, 2019.
`
`10. The article and abstract from IEEE Xplore shows the date of publication. IEEE
`Xplore populates this information using the metadata associated with the publication.
`
`445 Hoes Lane Piscat away, NJ 08854
`
`INTEL - 1010
`Page 1 of 15
`
`

`

`11. T. Miyamori; U. Olukotun, A quantitative analysis of reconfigurable coprocessors for
`multimedia applications was published in Proceedings. IEEE Symposium on FPGAs
`for Custom Computing Machines, date of conference April 17, 1998. Copies of the
`conference proceedings were made available no later than the last day of the
`conference. The article is currently available for public download from the IEEE
`digital library, IEEE Xplore.
`
`12. I hereby declare that all statements made herein of my own knowledge are true and
`that all statements made on information and belief are believed to be true, and further
`that these statements were made with the knowledge that willful false statements and
`the like are punishable by fine or imprisonment, or both, under 18 U.S.C. § 1001.
`
`INTEL - 1010
`Page 2 of 15
`
`

`

`
`
`
`
`
`
`EXHIBIT A
`
`EXHIBIT A
`
`INTEL - 1010
`
`Page 3 of 15
`
`INTEL - 1010
`Page 3 of 15
`
`

`

`
` IEEE.org IEEE Xplore Digital L brary
`|
`
`
`|
`
`IEEE-SA
`
`
`|
`
`IEEE Spectrum
`
`
`|
`
`More Sites
`
`
` Cart Create Account
`|
`
`
`|
`
` Personal Sign In
`
`Access provided by:
`IEEE Publications Operations
`Staff
`Sign Out
`
`
`
`Browse
`
`My Settings
`
`Get Help
`
`Advertisement
`
`Conferences > Proceedings. EEE Symposium o...
`
`A quantitative analysis of reconfigurable coprocessors for
`multimedia applications
`Publisher: IEEE
`
`2 Author(s)
`
`T. Miyamori ; U. Olukotun View All Authors
`
`58
`Paper
`Citations
`
`22
`Patent
`Citations
`
`329
`Full
`Text Views
`
`Alerts
`
`Manage
`Content Alerts
`
`Add to Citation
`Alerts
`
`ORGANIZATION 4
`
`ORGANIZATION 3
`
`ORGANIZATION 2
`
`ORGANIZATION 1
`
`Abstract
`
`Authors
`
`References
`
`Citations
`
`Keywords
`
`Metrics
`
`More Like This
`
`Downl
`PDF
`
`Abstract: Recently, computer architectures that combine a reconfigurable (or
`retargetable) coprocessor with a general-purpose microprocessor have been proposed.
`These architectures... View more
`
` Metadata
`Abstract:
`Recently, computer architectures that combine a reconfigurable (or retargetable)
`coprocessor with a general-purpose microprocessor have been proposed. These
`architectures are designed to exploit large amounts of fine grain parallelism in
`applications. In this paper, we study the performance of the reconfigurable coprocessors
`on multimedia applications. We compare a Field Programmable Gate Array (FPGA)
`based reconfigurable coprocessor with the array processor called REMARC
`(Reconfigurable Multimedia Array Coprocessor). REMARC uses a 16-bit simple
`processor that is much larger than a Configurable Logic Block (CLB) of an FPGA. We
`have developed a simulator, a programming environment, and multimedia application
`programs to evaluate the performance of the two coprocessor architectures. The
`simulation results show that REMARC achieves speedups ranging from a factor of 2.3
`to 7.3 on these applications. The FPGA coprocessor achieves similar performance
`improvements. However, the FPGA coprocessor needs more hardware area to achieve
`the same performance improvement as REMARC.
`
`Published in: Proceedings. IEEE Symposium on FPGAs for Custom Computing
`Machines (Cat. No.98TB100251)
`
`Date of Conference: 17-17 April 1998
`
`INSPEC Accession Number: 6034961
`
` Export to
`
`Collabratec
`
`More Like This
`
`Reconfigurable processing with field
`programmable gate arrays
`Proceedings of International Conference on
`Application Specific Systems, Architectures
`and Processors: ASAP '96
`Published: 1996
`
`A field programmable gate array
`implementation for biomedical system-on-
`chip (SoC)
`2011 IEEE 7th International Colloquium on
`Signal Processing and its Applications
`Published: 2011
`
`View More
`
`Top Organizations with Patents
`on Technologies Mentioned in
`This Article
`
`
`
`Advertisement
`
`INTEL - 1010
`Page 4 of 15
`
`

`

`Date Added to IEEE Xplore: 06 August DOI: 10.1109/FPGA.1998.707876
`2002
`
`Publisher: IEEE
`
`Print ISBN: 0-8186-8900-5
`
`Conference Location: Napa Valley, CA,
`USA.USA
`
`Advertisement
`
`Authors
`
`References
`
`Citations
`
`Keywords
`
`Metrics
`
`V
`
`V
`
`V
`
`V
`
`V
`
`IEEE Personal Account
`
`Purchase Details
`
`Profile Information
`
`NeedH
`
`CHANGE USERNAME/PASSWORO
`
`PAYMENT OPTIONS
`
`COMMUNICATIONS PREFERENCES
`
`US&CAN
`
`VIEW PURCHASED DOCUMENTS
`
`PROFESSION ANO EDUCATION
`
`WORLOW
`
`TECHNICAL INTERESTS
`
`CONTACT
`
`About EEE Xplore I Contact Us I Help I Accessibility I Terms of Use I Nondiscrimination Policy I Sitemap I Privacy & Opting Out of Cookies
`A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefrt of humanity.
`
`@ Copyright 2019 IEEE -All rights rese,ved. Use of this web site signifies your agreement to the terms and conditions.
`
`IEEE Account
`
`Purchase Details
`
`Profile Information
`
`Need Help?
`
`» Change Usemame/Password
`
`» Payment Options
`
`» Communications Preferences
`
`» us & Canada: +1 800 678 4333
`
`» Update Address
`
`» Order History
`
`» Profession and Education
`
`» Worldwide: +1 732 981 0060
`
`» View Purchased Documents
`
`» Technical Interests
`
`» Contact & Support
`
`About EEE Xplore Contact Us Help Accessibility Terms of Use Nondiscrimination Policy Sitemap Privacy & Opting Out of Cookies
`
`A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
`@ Copyright 2019 EEE -All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
`
`INTEL - 1010
`Page 5 of 15
`
`

`

`A Quantitative Analysis of Recongurable Coprocessors
`for Multimedia Applications
`
`Takashi Miyamori
`System ULSI Engineering Laboratory
`TOSHIBA Corporation, JAPAN
`miyamori@sdel.toshiba.co.jp
`
`Kunle Olukotun
`Computer Systems Laboratory
`Stanford University
`kunle@ogun.stanford.edu
`
`Abstract
`
`Recently, computer architectures that combine a recong-
`urable or retargetable coprocessor with a general-purpose
`microprocessor have been proposed. These architectures
`are designed to exploit large amounts of ne grain par-
`allelism in applications. In this paper, we study the per-
`formance of the recongurable coprocessors on multimedia
`applications. We compare a Field Programmable Gate Ar-
`ray FPGA based recongurable coprocessor with the array
`processor called REMARC Recongurable Multimedia Ar-
`ray Coprocessor. REMARC uses a -bit simple processor
`that is much larger than a Congurable Logic Block CLB
`of an FPGA. We have developed a simulator, a program-
`ming environment, and multimedia application programs to
`evaluate the performance of the two coprocessor architec-
`tures. The simulation results show that REMARC achieves
`speedups ranging from a factor of . to . on these ap-
`plications. The FPGA coprocessor achieves similar per-
`formance improvements. However, the FPGA coprocessor
`needs more hardware area to achieve the same performance
`improvement as REMARC.
`
`
`
`Introduction
`
`it be-
`As the use of multimedia applications increases,
`comes important to achieve high performance on algo-
`rithms such as video compression, decompression, and im-
`age processing with general-purpose microprocessors. This
`has motivated the recent addition of multimedia instruc-
`tions to most general-purpose microprocessor ISAs   .
`These ISA extensions work by segmenting a conventional
`-bit datapath into four -bit or eight -bit datapaths.
`The multimedia instructions exploit ne grain SIMD par-
`allelism by operating on four -bit or eight -bit data
`values. However, a -bit datapath limits the speedups
`to a factor of four or eight even though many multimedia
`applications have much more inherent parallelism.
`Computer architectures that connect a recongurable
`coprocessor to a general-purpose microprocessor have been
`proposed  . The advantage of this approach is that
`the coprocessor can be recongured to improve the perfor-
`mance of a particular application. All of these proposed
`architectures use eld programmable gate arrays FPGAs
`for the recongurable hardware. We refer to this coproces-
`
`sor as an FPGA coprocessor" in this paper. The FPGA
`architecture, which has narrow programmable logic blocks
`and programmable interconnection network, provides great
`exibility for implementing application specic hardware.
`However, the rich programmable interconnection comes at
`the price of reduced operating frequency and logic density.
`Array processors, such as general-purpose systolic ar-
`ray processors, wavefront array processors  , PADDI-
`  , and MATRIX  , are other recongurable archi-
`tectures. These processors have -bit or -bit datapaths
`and each programmable logic block has an  to -entry
`instruction RAM that makes it easy to support multiple
`functions. Because multimedia or DSP applications pre-
`dominantly manipulate -bit or -bit data values, these
`architectures work very well on these applications. Re-
`cently, we proposed a new array processor architecture
`called REMARC Recongurable Multimedia Array Co-
`processor . REMARC is a recongurable coprocessor
`that is tightly coupled to a main RISC processor and con-
`sists of a global control unit and  -bit simple processors
`called nano processors.
`Both the FPGA coprocessor and REMARC are not
`limited to SIMD parallelism that can be exploited by mul-
`timedia extensions such as the Intel MMX. They can
`exploit various kinds of ne grain parallelism in multime-
`dia applications. Using more processing resources, they
`can achieve higher performance than the multimedia ex-
`tensions. To understand how these two coprocessor archi-
`tecture compare, in this paper we evaluate the cost and
`performance of these architectures. The architecture of
`FPGA coprocessors are still in ux, so we evaluate the per-
`formance of the FPGA coprocessor with a varying number
`of CLBs and vary the cycle time of the FPGA coprocessor
`from x to x that of the main processor. For the perfor-
`mance evaluation, we use detailed simulators and two real-
`istic application programs, DES encryption and MPEG-
`decoding. We also estimate the chip sizes of processors
`with REMARC and the FPGA coprocessor and compare
`their performance when the same die size is used for both
`architectures.
`The rest of this paper is organized as follows. In Sec-
`tion , we describe the recongurable coprocessor archi-
`tectures, both REMARC array based and FPGA based.
`In Section , we show the results of our performance eval-
`uation. In Section , we estimate chip sizes of processors
`
`INTEL - 1010
`Page 6 of 15
`
`

`

`with REi\.1ARC and the FPGA coprocessor. Finally, we
`conclude in Section 5.
`
`2 Reconfigurable Coprocessor Ar(cid:173)
`chit ect ure
`
`2 .1 Architecture Overv iew
`
`rcon
`rex
`lduc2
`sduc2
`mtc2
`mfc2
`ctc2
`cfc2
`
`src (rncon or rgcon)
`cov~reg, offset{base)
`cov~reg, offset{base)
`cov~reg, offset{base)
`cov~reg, src
`cov~reg, dst
`cov~reg, src
`cov~reg, dst
`
`Main Processor
`
`Instruction
`Cache
`
`Data
`Cache
`
`Figure 1: Block Diagram of a Microprocessor w;th Reconfig(cid:173)
`urable Coprocessor
`
`Figure 1 shows a block diagram of a microprocessor
`which includes a reconfigurable coprocessor . The recon(cid:173)
`figurable coprocessor consists of a global control unit, co(cid:173)
`processor data registers, and a reconfigurable logic array.
`Recently, we proposed the REMARC architectur e which
`includes an 8x8 16-bit processor (nano processor) array as
`its reconfigurable logic array(13]. The other reconfigurable
`coprocessor that we consider in this paper, the FPGA co(cid:173)
`processor, uses FPGAs for the reconfigurable logic array.
`The global control unit controls the execution of the recon(cid:173)
`figurable logic array and the transfer of data between the
`main processor and the reconfigurable logic array through
`the coprocessor data registers.
`We use the i\.flPS-II ISA (14] as the base architecture
`of the main processor. The MIPS ISA is extended for the
`RE MARC and the FPG A coprocessor using the instruc(cid:173)
`tions listed in Table 1. The main processor issues these
`instructions to the reconfigurable coprocessor which exe(cid:173)
`cutes them in a manner similar to a floating point copro(cid:173)
`cessor. Unlike a floating point coprocessor, the functions
`of reconfigurable coprocessor instructions ar e configur able
`( or programmable) so that they can be specialized for spe(cid:173)
`cific applications.
`The configuration instructions, rcon, rgcon, or rncon,
`download the configuration data from memory and store
`them in the reconfigur able coprocessor. The start address
`of the configuration data is specified by the value of the
`sour ce register (src). The rex instruction starts execution
`of a reconfigurable coprocessor instruction. The sum of
`
`Table 1: New Instructions Used in Reconfigurable Coproces(cid:173)
`sors
`the offset field and the base register specifies one of the op(cid:173)
`erations to execute. The lduc2 and sduc2 instructions are
`load and store coprocessor instructions which t ransfer dou(cid:173)
`ble word (64-bit) data between memory and the reconfig(cid:173)
`urable coprocessor data registers. The mfc2 and mtc2 in(cid:173)
`structions transfer word (32-bit) data between the general(cid:173)
`purpose registers (integer registers) in the main processor
`and the reconfigurable coprocessor data registers. The cfc2
`and ctc2 instructions transfer data between the integer reg(cid:173)
`isters and the reconfigurable coprocessor control registers.
`The reconfigurable coprocessors do not have a direct
`interface to the data cache or memory. The main proces(cid:173)
`sor has to set the input data to the coprocessor data reg(cid:173)
`isters using lduc2 and mtc2 instructions before execution
`of rex instructions. Then, the reconfigurable coprocessor
`reads the input data, executes the operations, and stores
`the results into the coprocessor data registers. Finally, the
`main processor reads the results using sduc2 and mfc2 in(cid:173)
`structions.
`
`2 .2 Pipeline Organization
`
`REMARC Pipeline
`
`RF RR RL ! RI ! RE !RW!
`
`Main Processor Pipeline ! F ! D ! E I M I W I
`Main Processor Pipeline ! F ! D ! E I M I W I
`
`FPGA Coprocessor Pipeline
`
`RF RR RE !RW!
`
`Figure 2: P ipeline Organization of Reconfigurable Coproces(cid:173)
`sor
`
`The pipeline for REMARC, the FPGA coprocessor,
`and the main processor are shown in Figure 2. The main
`processor pipeline is similar to the MIPS R3000 and the
`MIPS R5000 and consists of five stages: Instruction Fetch
`(F), Instruction Decode (D), Execution (E), Memory Ac(cid:173)
`cess (M), Register Write-back (W) . The reconfigurable co(cid:173)
`processor pipelines ar e independent of the main processor
`pipeline; therefore, the main processor can execute concur (cid:173)
`rently with the reconfigurable coprocessors .
`The REMARC pipeline starts from the M stage of the
`main processor and has the following six stages:
`RF : An instruction of the global control unit is fetched.
`RR : The REMARC data registers are read.
`RL : The data are aligned or "unpacked" .
`RI : The instructions of the nano processors are fetched.
`
`INTEL - 1010
`Page 7 of 15
`
`

`

`HBUS
`
`32
`
`VBUS
`32
`
`nano PC
`
`5
`
`Nano Instruction
` RAM
`(32 x 32 bits)
`
`32
`
`Imm
`
`D R
`
`02
`
`701
`
`DR
`
`16
`
`16
`
`IR
`
`16-bit ALU & Data RAM
`
`16
`
`16
`
`DOR
`
`16
`
`DOUT
`
`16
`
`16
`
`16
`
`DINU
`DIND
`DINL
`DINR
`DINU
`DIND
`DINL
`DINR
`DINU
`DIND
`DINL
`DINR
`
`Figure : Nano Processor Architecture
`
`register IR, eight -bit data registers DR, four -
`bit data input registers DIR, and a -bit data output
`register DOR.
`Each nano processor can use the DR registers, the DIR
`registers, and immediate data as the source data of ALU
`operations. Moreover, it can directly use the DOR regis-
`ters of the four adjacent nano processors DINU, DIND,
`DINL, and DINR as the source.
`The nano processors are also connected by the -bit
`Horizontal Buses HBUSs and the -bit Vertical Buses
`VBUSs. Each bus operates as two -bit data buses. The
` -bit data in the DOR register can be sent to the upper
`or lower  bits of the VBUS or the HBUS. The HBUSs
`and the VBUSs allow data to be broadcast to the other
`nano processors in the same row or column. These buses
`can reduce the communication overhead between proces-
`sors separated by long distances.
`The DIR registers accept inputs from the HBUS, the
`VBUS, the DOR, or the four adjacent nano processors.
`Because the width of the HBUS and the VBUS is  bits,
`data on the HBUS or the VBUS are stored into a DIR
`register pair, DIR and DIR , or DIR and DIR . Using
`the DIR registers, data can be transfered between nano
`processors during ALU operations.
`It takes a half cycle to transfer data using the VBUSs
`or HBUSs.
`It should not be a critical path of the de-
`sign. Other operations, except for data inputs from near-
`est neighbors, are done within the nano processor. Because
`the width of a nano processor’s datapath is only  bits,
`which is a quarter of those of the general purpose micro-
`processors, this careful design does not make REMARC a
`critical path of the chip.
`
`. FPGA Coprocessor Architecture
`
`The recongurable logic array of the FPGA coprocessor is
`composed of congurable logic blocks CLBs. Each CLB
`
`RE : The nano processors execute the instructions.
`RW : The executed results are packed and stored into
`the REMARC data registers.
`
`The FPGA coprocessor pipeline has four stages as fol-
`lows:
`
`RF : The sequencer of the global control unit starts its
`execution.
`RR : The coprocessor data registers are read.
`RE : The recongurable logic array starts execution.
`RW : The execution results are stored into the coproces-
`sor data registers.
`
`The RL and RI stages are unnecessary in the FPGA
`coprocessor because load alignment or unpack operations
`are realized directly in the FPGA array and FPGAs do
`not have instructions to fetch and execute.
`
`. REMARC Architecture
`
`to/from Main Processor
`
`32 bits
`
`64 bits
`
`64 bits
`
`REMARC
`
`Global Control Unit
`
`Coprocessor
`Data Registers
`
`32
`
`32
`
`NANO
`00
`
`NANO
`10
`
`NANO
`20
`
`HBUS0
`NANO
`NANO
`30
`40
`
`NANO
`50
`
`NANO
`60
`
`NANO
`70
`
`ROW 0
`
`ROW 1
`
`ROW 2
`
`ROW 3
`
`ROW 4
`
`ROW 8
`
`HBUS1
`
`HBUS2
`
`HBUS3
`
`HBUS4
`
`VBUS7
`
`VBUS6
`
`VBUS5
`
`VBUS4
`
`HBUS7
`
`VBUS3
`
`VBUS2
`
`VBUS1
`
`VBUS0
`
`COL.0
`
`COL.1
`
`COL. 2
`
`COL. 3
`
`COL. 4
`
`COL.5
`
`COL. 6
`
`COL. 7
`
`nano PC/Row Num./Column Num.
`
`Figure : Block Diagram of REMARC
`
`In this section, we briey describe the REMARC archi-
`tecture; more information can be found in  . Figure
`shows a block diagram of REMARC. REMARC’s recon-
`gurable logic is composed of an x array of the -bit
`processors, called nano processors. The execution of each
`nano processor is controlled by the instructions stored in
`the local instruction RAM nano instruction RAM. How-
`ever, each nano processor does not directly control the
`instructions it executes. Every cycle the nano processor
`receives a PC value, nano PC", from the global control
`unit. All nano processors use the same nano PC and exe-
`cute the instructions indexed by the nano PC in their nano
`instruction RAM.
`Figure  shows the architecture of the nano processor.
`The nano processor contains a -entry nano instruction
`RAM, a -bit ALU, a -entry data RAM, an instruction
`
`INTEL - 1010
`Page 8 of 15
`
`

`

`Processor Size
`Num. of Procs.
`Execution control
`Communication
`Interconnection
`Cycle Time
`
`REMARC
`Large  -bit datapath
`Small  processors
`Instruction
`Controlled by instruction
` neighbors, VBUS, and HBUS
`Tcpu
`
`FPGA Coprocessor
`Small two - LUTs
`Large    CLBs
`Hardwired
`Congured by switch matrix
`Short wires and long wires
` x, x, x, xTcpu
`
`Table : Coprocessor Architecture Comparison
`
`is equal to a CLB of the Xilinx  series. The CLB
`includes two - lookup tables LUTs and two ip-ops.
`We do not x the number of the CLBs in the FPGA
`coprocessor. Instead, we evaluate the performance of the
`FPGA coprocessor using a varying number of CLBs. The
`cycle time of the FPGA coprocessor is also parameter-
`ized. Cycle times of current FPGA systems are longer
`than those of the microprocessors by factors of ve to
`ten. For instance, FPGA systems usually operate at
`to MHz while state-of-the-art microprocessors operate
`at more than  MHz. However, the recently proposed
`recongurable coprocessor Garp  aims to operate at the
`same operating frequency as its main processor. Therefore,
`we assumed the cycle time of the FPGA coprocessor could
`be x or x that of the main processor for current FPGA
`architectures and x or x for future FPGA architectures.
`
`. Coprocessor Architecture Compar-
`ison
`
`Table  summarizes the comparison of the two recong-
`urable coprocessor architectures. REMARC has larger
`processing elements, the nano processors, than the FPGA
`coprocessor. However, the number of nano processors is
`less than that of CLBs in the FPGA coprocessor.
`In
`REMARC, both the execution and data transfer are con-
`trolled by instructions, while these are controlled by hard-
`wired logic in the FPGA coprocessor. REMARC has lim-
`ited hardware interconnections. Each nano processor has
`direct inputs from the four nearest neighbors and it is con-
`nected by two -bit data buses. The FPGA coprocessor
`has more exible hardware interconnections which are con-
`gured in a bit-wise fashion.
`We assume that REMARC will operate at the same
`frequency as the main processor. The cycle times of the
`FPGA coprocessor Tfpga are varied. We evaluate the
`FPGA coprocessor performance, assuming Tfpga values
`that are x, x, x, and x the CPU cycle time Tcpu.
`
` Performance Evaluation
`
` . Simulation Methodology
`
`We developed the recongurable coprocessor simulator us-
`ing the SimOS simulation environment   . SimOS mod-
`els the CPUs, memory systems, and IO devices in su-
`cient detail to boot and run a commercial operating sys-
`tem. As a base CPU simulation model, we used the
`
`MIPSY " which models a simple single issue RISC pro-
`cessor similar to the MIPS R .
`REMARC functions are added to the MIPSY model.
`The latency of a recongurable execution rex instruction
`is the sum of the number of the executed global instruc-
`tions and the pipeline latency  cycles. If a following in-
`struction attempts to read the result of a rex instruction,
`the pipeline will stall for the cycles of the rex instruction’s
`latency Data Dependency. Furthermore, a following rex
`instruction will stall if a previous rex instruction is still
`executing Resource Conict.
`We evaluate the execution time of the FPGA coproces-
`sor by changing the latencies of the rex instructions. First,
`we estimate the number of execution cycles of the rex in-
`structions based on the FPGA delay model. In this model,
`two sequences can be executed within one FPGA cycle:
`
` One stage of interconnection by long or short wire
`and one stage of function not using the carry chain
` One stage of interconnection by short wire and any
`one stage of function
`
`This model is almost the same as the Garp’s delay
`model  and the XC’s medium frequency design which
`will operate at about  MHz  . Then, we normalize the
`number of execution cycles based on the CPU cycle time,
`assuming the FPGA cycle time is x, x, x, or x the
`CPU cycle time. Finally, we use this estimated cycle count
`of the rex instruction in the simulator.
`We also developed simulators which can execute mul-
`timedia instructions similar to the Intel MMX instruction
`set extensions .
`To make the comparison fair, the same application
`source codes are used for the evaluation except for the
`use of the multimedia instructions or the augmented co-
`processor instructions. The memory system parameters
`used commonly through the performance evaluations are
`found in Table . We used the gcc" compiler with the -
`O" optimization option. This option executes most global
`compiler optimizations except for loop unrolling and func-
`tion inlining.
`
`I-cache
`D-cache
`L cache
`L miss penalty
`L miss penalty
`
`  K bytes, -way set.
`  K bytes, -way set.
` K bytes, -way set.
` cycles
` cycles
`
`Table : Memory System Parameters
`
`INTEL - 1010
`Page 9 of 15
`
`

`

`We used the DES encryption program   based on
`the Electronic Codebook ECB mode. Although the ECB
`mode is less secure than the Cipher Block Chaining CBC
`mode, it is more commonly used and its operation can be
`pipelined.
`
` .. DES Implementation on REMARC
`
`We decided to divide the algorithm between the main pro-
`cessor and REMARC. The initial permutation and the -
`nal permutation are executed by the main processor and
`the  rounds of f-box operations are executed by RE-
`MARC. Each row of nano processors executes two f-box
`operations. For instance, eight nano processors in the row
` execute the rst and second rounds, the row execute
`the third and fourth rounds, and so on.
`
`Operation
`Data Load
`Exp. Permutation
`Key XOR
`S-box
`P-box Permutation
`Left XOR
`Data Transfer
`Total
`
`CPU Cycles
`
`  x iter.
`  x iter.
`    x  iter.
`  x  iter.
`  x  iter.
`
`
`
`Table : Execution Cycle Breakdown of Two f-box Operations
`on REMARC
`
`Table  is the execution cycle breakdown of the two f-
`box operations implemented by the eight nano processors
`in each row. The  f-box operations are pipelined into
` stages by the eight rows of the nano processor array.
`REMARC can generate a result of the  f-box operations
`every  cycles. As Table  shows, because of the limited
`interconnection of REMARC, more than half  cycles of
`the execution time are used for the expansion permutation
`and the P-box permutation.
`
` .. DES Implementation on the FPGA Co-
`processor
`
`First, we estimate the latency and the throughput of one
`f-box operation. The expansion permutation and the XOR
`operation with the keys can be executed in one cycle be-
`cause it can be implemented by long wire and simple logic.
`The S-box table lookup requires two cycles to execute be-
`cause it consists of LUTs and MUXs. The P-box permu-
`tation and the XOR operation can be executed in one cy-
`cle. The total latency of the f-box operation is four cycles,
`and it can be fully pipelined. Therefore, the maximum
`throughput of the f-box operation is one FPGA cycle.
`We assume three distinct cases, implementing one f-
`box,  f-boxes, and all of the DES encryption algorithm
`including  f-boxes, the initial permutation, and the nal
`permutation.
`In the case of one f-box implementation, because each
`f-box operation takes as input the result of the previous
`f-box operation, the f-box operations cannot be pipelined.
`
` . DES Encryption
`
`The Data Encryption Standard DES is one of the most
`important encryption algorithms and has been a worldwide
`standard for over  years. It is widely used to provide se-
`cure communication over the Internet. DES is also a good
`application for recongurable processors because it has a
`lot of ne-grained parallelism in the form of bit-level irreg-
`ular data movements which make software implementation
`on conventional microprocessors dicult and inecient.
`
`Plaintext (64 bits)
`
`Initial Permutation
`
`Round 1
`
`f-box
`
`Round 2
`
`f-box
`
`Round 16
`
`f-box
`
`L0
`
`+
`
`L1
`
`+
`
`L2
`
`L15
`
`+
`
`L16
`
`f
`
`f
`
`f
`
`R0
`
`R1
`
`R2
`
`R15
`
`R16
`
`Final Permutation
`
`Ciphertext (64 bits)
`
`K1
`
`K2
`
`K16
`
`Figure : DES Encryption Algorithm
`
`48-bit
`Key
`
`Li-1
`
`f-box
`
`Ri-1
`32
`
`Expansion Permutation
`6
`6
`6
`6
`6
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`6
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`S-box
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`4
`
`32
`
`P-box Permutation
`32
`
`32
`
`Ri
`
`Figure : DES f-box Algorithm
`
`Figure  shows an outline of the DES algorithm. DES
`takes as input -bit plaintext. After the initial permuta-
`tion, there are  rounds of f-box" operations, as shown
`in Figure , including expansion permutation, XOR with
`the key, S-box table lookup, P-box permutation, and XOR
`with the result of the previous round. The nal permu-
`tation is performed on the -bit result of the sixteenth
`round.
`
`INTEL - 1010
`Page 10 of 15
`
`

`

`Tfpga
`
`Tcpu
`Tcpu x 2
`Tcpu x 5
`Tcpu x 10
`
`FPGA Coprocessor
`(1 f-box)
`64
`128
`320
`640
`
`FPGA Coprocessor
`(16 f-boxesJ
`1
`2
`5
`10
`
`Table 5: Execution Throughput of 16 f-box Operations on
`Different FPGA Coprocessors (CPU Cycles)
`
`D Base (Single-Issue) D +FPGA (16 f0 boxs)
`(cid:143) +REMARC
`(cid:143) +FPGA (All DES)
`(cid:143) +FPGA (1 f-box)
`
`140
`
`120
`-;;;-
`Q) o 100
`~
`6 80
`"' C: 5 60
`
`(.)
`Q) 40
`
`~ (.)
`20
`
`0
`
`""
`
`llb
`
`Tcpu
`
`...
`
`~
`
`,__
`
`,__
`
`n=;
`
`r-
`
`h
`
`-
`I lh
`
`Tcpu x 5
`Tcpu x 2
`Tfpga
`
`Tcpu x 10
`
`Figure 7: DES Encryption {lM Bytes) Results - Single Issue
`Archltecture
`
`It takes 64 FPGA cycles (4 cycles x 16 rounds) to exe(cid:173)
`cute the 16 rounds for the f-box operations. When the 16
`f-boxes are implemented in the FPGA array, the opera(cid:173)
`tions can be pipelined. It takes only one FPGA cycle to
`generate a new result for the 16 f-box operations after the
`pipeline is fill ed up. T.i.ble 5 normalizes the throughput
`of the 16 f-box operations in CPU cycles for both the one
`f-box implementation and the 16 f-box implementation.
`When the complete DES encryption algorithm is im(cid:173)
`plemented in the FPGA coprocessor, the init ial and final
`permutations can be fully pipelined and a new 64-bit ci(cid:173)
`phertext character is generated every FPG A cycle.
`To estimate the number of CLBs without ·wiring over(cid:173)
`head, we use the following method. As shown in Figure 6,
`the f-box operation requires a 48-bit XOR, eight 6-bit in(cid:173)
`put 4-bit output LUTs, a 32-bit 4-1 Multiplexer, and a
`32-bit XOR. A two-bit XOR or a 4-1 Multiplexer fits to
`one CLB. A 6-4 LUT needs 16 4-1 LUTs or 8 CLBs. In
`total, 136 CLBs are required for one f-box operation. The
`16 f-box implementation needs 2176 (136 x 16) CLBs. We
`expect that in an actual implementation more CLBs would
`be dedicated to the eJ...'tra wiring which realizes the initial
`permutation and the final permutation.
`
`D Base (2-lssue)
`(cid:143) +REMARC
`(cid:143) +FPGA (1 f-box)
`
`(cid:143) +FPGA (16 f-boxs)
`(cid:143) +FPGA (All DES)
`
`Q)
`
`100
`90
`en so
`,-
`-g. 70 ,-
`,-
`~ 60
`,-
`.!1 50
`C
`:::, 8 40 ,-
`,-
`a> 30
`0
`,-
`(S 20
`10
`
`0
`
`ITTli
`
`Tcpu
`
`----, -
`
`, -
`
`, -
`
`-
`n,
`
`,....
`
`n
`n
`
`n
`n
`
`Tcpu x 5
`Tcpu x 2
`Tfpga
`
`Tcpu x 10
`
`Figure 8: DES Encryption {lM Bytes) Results - 2-way Su(cid:173)
`perscalar Archltecture
`
`3.2.3 Performance Evaluation Results
`
`Figure 7 shows the result of the DES encryption of l M
`bytes data. The base architecture of the main processor is
`a single issue processor. To optimize performance we im(cid:173)
`plemented a software pipelining technique which can over(cid:173)
`lap operations in the main processor and the reconfigurable
`coprocessors. REMARC achieves a 7.3 times performance
`improvement. The FPGA coprocessor achieves the same
`performance improvement if its execution cycles for the 16
`f-box operations ar e less than 64 CPU cycles. T his indi(cid:173)
`cat

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket