throbber
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/2962144
`
`The Promise of High-Performance Reconfigurable Computing
`
`Article  in  Computer · March 2008
`
`DOI: 10.1109/MC.2008.65 · Source: IEEE Xplore
`
`CITATIONS
`158
`
`6 authors, including:
`
`Esam El-Araby
`University of Kansas
`
`58 PUBLICATIONS   518 CITATIONS   
`
`SEE PROFILE
`
`Kris Gaj
`George Mason University
`
`164 PUBLICATIONS   2,609 CITATIONS   
`
`SEE PROFILE
`
`Some of the authors of this publication are also working on these related projects:
`
`Security Analysis of Logic Locking View project
`
`READS
`75
`
`Miaoqing Huang
`University of Arkansas
`
`55 PUBLICATIONS   508 CITATIONS   
`
`SEE PROFILE
`
`Duncan A. Buell
`University of South Carolina
`
`92 PUBLICATIONS   1,509 CITATIONS   
`
`SEE PROFILE
`
`All content following this page was uploaded by Miaoqing Huang on 18 February 2014.
`
`The user has requested enhancement of the downloaded file.
`
`

`

`R E S E A R C H F E A T U R E
`
`The Promise of
`High-Performance
`Reconfigurable Computing
`
`Tarek El-Ghazawi, Esam El-Araby, and Miaoqing Huang, George Washington University
`Kris Gaj, George Mason University
`Volodymyr Kindratenko, University of Illinois at Urbana-Champaign
`Duncan Buell, University of South Carolina
`
`Several high-performance computers now use field-programmable gate arrays as reconfigurable
`coprocessors. the authors describe the two major contemporary HPRC architectures and explore
`the pros and cons of each using representative applications from remote sensing, molecular
`dynamics, bioinformatics, and cryptanalysis.
`
`I n the past few years, high-performance computing
`
`vendors have introduced many systems contain-
`ing both microprocessors and field-programmable
`gate arrays. Three such systems—the Cray XD1,
`the SRC-6, and the SGI Altix/RASC—are paral-
`lel computers that resemble modern HPC architectures,
`with added FPGA chips. Two of these machines, the
`Cray XD1 and SGI Altix, also function as traditional
`HPCs without the reconfigurable chips. In addition, sev-
`eral Beowulf cluster installations contain one or more
`FPGA cards per node, such as HPTi’s reconfigurable
`cluster from the Air Force Research Laboratory.
`In all of these architectures, the FPGAs serve as
`coprocessors to the microprocessors. The main applica-
`tion executes on the microprocessors, while the FPGAs
`handle kernels that have a long execution time but lend
`themselves to hardware implementations. Such kernels
`are typically data-parallel overlapped computations that
`can be efficiently implemented as fine-grained architec-
`tures, such as single-instruction, multiple-data (SIMD)
`engines, pipelines, or systolic arrays, to name a few.
`Figure 1 shows that a transfer of control can occur
`during execution of the application on the microproces-
`sor, in which case the system invokes an appropriate
`architecture in a reconfigurable processor to execute
`the target operation. To do so, the reconfigurable pro-
`
`cessor can configure or reconfigure the FPGA “on the
`fly,” while the system’s other processors perform com-
`putations. This feature is usually referred to as runtime
`reconfiguration.1
`From an application development perspective, devel-
`opers can create the hardware kernel using hardware
`description languages such as VHDL and Verilog. Other
`systems allow the use of high-level languages such as
`SRC Computers’ Carte C and Carte Fortran, Impulse
`Accelerated Technologies’ Impulse C, Mitrion C from
`Mitrionics, and Celoxica’s Handel-C. There are also
`high-level graphical programming development tools
`such as Annapolis Micro Systems’ CoreFire, Starbridge
`Systems’ Viva, Xilinx System Generator, and DSPlogic’s
`Reconfigurable Computing Toolbox.
`Readers should consult Computer’s March 2007
`special issue on high-performance reconfigurable com-
`puting for a good overview of modern HPRC systems,
`application-development tools and frameworks, and
`applications.
`
`HPRC ARCHiteCtuRAL tAXONOMY
`Many early HPRC systems, such as the SRC-6E and
`the Starbridge Hypercomputer, can be seen as attached
`processors. These systems were designed around one
`node of microprocessors and another of FPGAs. The
`
`0018-9162/08/$25.00 © 2008 IEEE
`
`Published by the IEEE Computer Society
`
`February 2008
`
`69
`
`

`

`µP
`
`PC
`
`RP (FPGA)
`
`Pipelines, systolic arrays, SIMD, ...
`
`Figure 1. In high-performance reconfigurable computers, field-programmable
`gate arrays serve as coprocessors to the microprocessors. During execution of the
`application on the microprocessor, the system invokes an appropriate architecture
`in the FPGA to execute the target operation.
`
`RP1
`
`…
`
`RPM
`
`RP node
`
`vendors can vary the ratio of reconfigu-
`rable nodes to microprocessor nodes to
`meet the different demands of custom-
`ers’ applications. This is highly desirable
`from an economic perspective given the
`cost difference between FPGAs and
`microprocessors, and it is particularly
`suitable for special-purpose systems.
`On the downside, having the reconfig-
`urable node and the microprocessor node
`interact over the shared interconnection
`network makes them compete for over-
`all bandwidth, and it also increases the
`latency between the nodes. In addition,
`code portability could become an issue
`even within the same type of machine
`if there is a change in the ratio between
`the microprocessor nodes and the FPGA
`nodes.
`A representative example of the
`UNNS is the SRC-6/SRC-7, which con-
`sists of one or more general-purpose
`microprocessor subsystems, one or
`more MAP reconfigurable subsystems,
`and global common memory (GCM)
`nodes of shared memory space. These
`subsystems are interconnected through
`a Hi-Bar switch communication layer.
`The microprocessor boards each include
`two 2.8-GHz Intel Xeon microproces-
`sors and are connected to the Hi-Bar
`switch through a SNAP interface. The
`SNAP card plugs into the dual in-line
`memory module slot on the micropro-
`cessor motherboard to provide higher
`data transfer rates between the boards than the less effi-
`cient but common peripheral component interconnect
`(PCI) solution. The sustained transfer rate between a
`microprocessor board and the MAP processors is 1,400
`Mbytes per second.
`The MAP Series C processor consists of one control
`FPGA and two user FPGAs, all Xilinx Virtex II-6000-
`4s. Additionally, each MAP unit contains six interleaved
`banks of onboard memory (OBM) with a total capacity
`of 24 Mbytes. The maximum aggregate data transfer rate
`among all FPGAs and OBM is 4,800 MBps. The user
`FPGAs are configured such that one is in master mode
`and the other is in slave mode. A bridge port directly con-
`nects a MAP’s two FPGAs. Further, MAP processors can
`be connected via a chain port to create an FPGA array.
`
`µP1
`
`…
`
`µPN
`
`…
`
`µP1
`
`…
`
`µPN
`
`RP1
`
`…
`
`RPM
`
`µP node
`
`µP node
`
`RP node
`
`(a)
`
`(b)
`
`IN and/or GSM
`
`µP1
`
`RP1
`
`…
`
`µPN
`
`RPM
`
`IN and/or GSM
`
`two nodes were connected directly, without a scalable
`interconnection mechanism.
`Here we do not address these early attached processor
`systems but focus instead on scalable parallel systems such
`as the Cray XD1, SRC-6, and SGI Altix/RASC as well
`as reconfigurable Beowulf clusters. These architectures
`can generally be distinguished by whether each node in
`the system is homogeneous (uniform) or heterogeneous
`(nonuniform).2 A uniform node in this context contains
`one type of processing element—for example, only micro-
`processors or FPGAs. Based on this distinction, modern
`HPRCs can be grouped into two major classes: uniform
`node nonuniform systems and nonuniform node uniform
`systems.
`
`uniform node nonuniform systems
`In UNNSs, shown in Figure 2a, nodes strictly have
`either FPGAs or microprocessors and are linked via an
`interconnection network to globally shared memory
`(GSM). Examples of such systems include the SRC-6 and
`the Altix/RASC. The major advantage of UNNSs is that
`
`70
`
`Computer
`
`Figure 2. Modern HPRCs can be grouped into two major classes: (a) uniform node
`nonuniform systems (UNNSs) and (b) nonuniform node uniform systems (NNUSs).
`
`Nonuniform node uniform systems
`NNUSs, shown in Figure 2b, use only one type of node,
`thus the system level is uniform. However, each node
`contains both types of resources, and the FPGAs are con-
`nected directly to the microprocessors inside the node.
`
`

`

`3% 5%
`
`92%
`
`(a)
`
`Total execution time is 20.21 sec
`(1.8-GHz Pentium 4)
`
`58%
`
`(b)
`
`33%
`
`9%
`
`Total execution time is 1.67 sec
`(SRC-6E, P3)
`Speedup without streaming: 12.08x
`Speedup with streaming: 13.21x
`
`50%
`
`25%
`25%
`
`(c)
`
`Examples of such systems are the Cray XD1 and
`reconfigurable clusters. NNUSs’ main drawback
`is their fixed ratio of FPGAs to microprocessors,
`which might not suit the traditional vendor-buyer
`economic model. However, they cater in a straight-
`forward way to the single-program, multiple-data
`(SPMD) model that most parallel programming
`paradigms embrace. Further, the latency between
`the microprocessor and its FPGA coprocessor can
`be low, and the bandwidth between them will be
`dedicated—this can mean high performance for
`many data-intensive applications.
`A representative example of the NNUS is the
`Cray XD1, whose direct-connected processor
`(DCP) architecture harnesses multiple processors
`into a single, unified system. The base unit is a chas-
`sis, with up to 12 chassis per cabinet. One chassis
`houses six compute cards, each of which contains
`two 2.4-GHz AMD Opteron microprocessors and
`one or two RapidArray Processors (RAPs) that
`handle communication. The two Opteron micro-
`processors are connected via AMD’s HyperTrans-
`port technology with a bandwidth of 3.2 GBps
`forming a two-way symmetric multiprocessing (SMP)
`cluster. Each XD1 chassis can be configured with six
`application-acceleration processors based on Xilinx
`Virtex-II Pro or Virtex-4 FPGAs. With two RAPs per
`board, a bandwidth of 8 GBps (4 GBps bidirectional)
`between boards is available via a RapidArray switch.
`Half of this switch’s 48 links connect to the RAPs on the
`compute boards within the chassis, while the others can
`connect to other chassis.
`
`I/O-read
`Comp
`I/O-write
`
`Total execution time is 0.84 sec
`(SRC-6)
`Speedup without streaming: 24.06x
`Speedup with streaming: 32.04x
`
`Figure 3. Execution profiles of hyperspectral dimension reduction. (a)
`Total execution time on 1.8-GHz Pentium 4 microprocessor. (b) Total
`execution time on SRC-6E. (c) Total execution time on SRC-6.
`
`is, streaming—between these two processing elements
`and the computations also can help. As Figure 3a shows,
`such transfers (I/O read and write operations) take only
`8 percent of the application execution time on a 1.8-
`GHz Pentium 4 microprocessor, while the remaining
`92 percent is spent on computations.
`As Figure 3b shows, the first-generation SRC-6E
`achieves a significant speedup over the microprocessor:
`12.08× without streaming and 13.21× with streaming.
`However, the computation time is now only 9 percent
`of the overall execution time. In the follow-up SRC-6,
`the bandwidth between the microprocessor and FPGA
`increases from 380 MBps (sustained) to 1.4 GBps
`(sustained). As Figure 3c shows, this system achieves
`a 24.06× speedup (without streaming) and a 32.04×
`speedup (with streaming) over the microprocessor.
`These results clearly demonstrate that bandwidth
`between the microprocessor and the FPGA must be
`increased to support more data-intensive applications—
`an area the third-generation SRC-7 is likely to address.
`It should be noted, however, that in most HPRCs today,
`transfers between the microprocessor and FPGA are
`explicit, further complicating programming models.
`These two memory subsystems should either be fused
`into one or integrated into a hierarchy with the objective
`of reducing or eliminating this overhead and making the
`transfers transparent.
`
`Molecular dynamics
`Nanoscale molecular dynamics (NAMD)4 is repre-
`sentative of floating-point applications with respect to
`node performance. A recent case study revealed that
`when porting such highly optimized code, a sensible
`approach is to use several design iterations, starting with
`
`February 2008
`
`71
`
`NOde-LeveL iSSueS
`We have used the SRC-6E and SRC-6 systems to inves-
`tigate node-level performance of HPRC architectures in
`processing remote sensing3 and molecular dynamics4
`applications. These studies included the use of optimi-
`zation techniques such as pipelining and data transfer
`overlapping with computation to exploit the inherent
`temporal and spatial parallelism of such applications.
`
`Remote sensing
`Hyperspectral dimension reduction3 is representative
`of remote sensing applications with respect to node per-
`formance. With FPGAs as coprocessors for the micropro-
`cessor, substantial data in this data-intensive application
`must move back and forth between the microprocessor
`memory and the FPGA onboard memory. While the
`bandwidth for such transfers is on the order of GBps,
`the transfers are an added overhead and represent a chal-
`lenge on the SRC-6 given the finite size of its OBM.
`This overhead can be avoided altogether through
`the sharing of memory banks, or the bandwidth can
`be increased to take advantage of FPGAs’ outstanding
`processing speed. Overlapping memory transfers—that
`
`

`

`the simplest, most straightforward implementation and
`gradually adding to it until achieving the best solution
`or running out of FPGA resources.5
`The study’s final dual-FPGA-based implementation
`was only three times faster than the original code execu-
`tion. These results, however, are data dependent. For a
`larger cutoff radius, the original CPU code executes in
`more than 800 seconds while the FPGA execution time
`is unchanged, which would constitute a 260× speedup.
`The need to translate data between the C++ data storage
`mechanisms and the system-defined MAP/FPGA data
`storage architecture required considerable development
`effort. When creating code from scratch to run on an
`FPGA architecture, a programmer would implement
`the data storage mechanisms compatible between the
`CPU and FPGA from the beginning,
`but this is rarely the case for exist-
`ing code and adds to the amount of
`work required to port the code.
`Although the “official bench-
`mark” kernel employs double-pre-
`cision floating-point arithmetic, the
`NAMD researchers applied algo-
`rithmic optimization techniques
`and implemented their kernel using
`single-precision floating-point arith-
`metic for atom locations and 32-bit
`integer arithmetic for forces. Consequently, the final
`design occupies most available slices (97 percent), yet
`utilization of on-chip memory banks (40 percent) and
`hardware multipliers (28 percent) is low. The fact that the
`slice limit was reached before any other resource limits
`suggests that it might be necessary to restructure code to
`better utilize other available resources. One possible solu-
`tion is to overlap calculations with data transfer for the
`next data set to use more available on-chip memory.
`Despite the relatively modest speedup achieved, the
`NAMD study clearly illustrates the potential of HPRC
`technology. FPGA code development traditionally begins
`with writing code that implements a textbook algorithm,
`with little or no optimization. When porting such unop-
`timized code to an HPRC platform and taking care to
`optimize the FPGA design, it is easy to obtain a 10×-
`100× speedup. In contrast, we began with decade-old
`code optimized to run on the CPU-based platform; such
`code successfully competes with its FPGA-ported coun-
`terpart. It is important to keep in mind that the study’s
`100-MHz FPGA achieved a 3× application performance
`improvement over a 2.8-GHz CPU, and FPGAs are on a
`faster technology growth curve than CPUs.6
`
`Lessons learned
`Optimization techniques such as overlapping data
`transfers between the microprocessors and FPGAs with
`computations are useful for data-intensive, memory-
`bound applications. However, such applications, includ-
`
`72
`
`Computer
`
`vendor-provided transparent
`transfers can enhance
`performance by guaranteeing
`the most efficient transfer
`modes for the underlying
`platform.
`
`ing hyperspectral dimension reduction and NAMD, can
`only achieve good performance when the underlying
`HPRC architecture supports features such as streaming
`or overlapping. Streaming can be enabled by architectures
`that are characterized by high I/O bandwidth and/or tight
`coupling of FPGAs with associated microprocessors. New
`promising examples of these are DCP architectures such
`as AMD’s Torrenza initiative for HyperTransport links
`as well as Intel’s QuickAssist technology supporting front
`side bus (FSB) systems. Large enough memory bandwidth
`is another equally important feature.
`By memory bandwidth we mean that the memory sys-
`tem has sufficient multiplicity as well as speed, width, or
`depth/size. In other words, because FPGAs can produce
`and consume data at a high degree of parallelism, the
`associated memory system should
`also have an equal degree of multi-
`plicity. Simply put, a large multiple
`of memory banks with narrow word
`length of local FPGA memory can
`be more useful to memory-bound
`applications on HPRCs than larger
`and wider memories with fewer
`parallel banks.
`In addition, further node architec-
`ture developments are clearly neces-
`sary to support programming mod-
`els with transparent transfers of data between FPGAs
`and microprocessors by integrating the microprocessor
`memory and the FPGA memory into the same hierar-
`chy. Vendor-provided transparent transfers can enhance
`performance by guaranteeing the most efficient transfer
`modes for the underlying platform. This will let the user
`focus on algorithmic optimizations that can benefit the
`application under investigation rather than data trans-
`fers or distribution. It also can improve productivity.
`
`SYSteM-LeveL iSSueS
`We have used the SRC-6 and Cray XD1 systems to
`investigate system-level performance of HPRC archi-
`tectures in bioinformatics7 and cryptanalysis8-10 appli-
`cations. These applications provide a near-practical
`upper bound on HPRC potential performance as well
`as insight into system-level programmability and perfor-
`mance issues apart from those associated with general
`high-performance computers. They use integer arithme-
`tic, an area where HPRCs excel, are compute-intensive
`with lots of computations and not much data transfer
`between the FPGAs and microprocessors, and inherit
`both spatial and temporal parallelism.
`We distributed the workload of both types of appli-
`cations over all nodes using the message passing inter-
`face (MPI). In the case of DNA and protein analysis, we
`broadcast a database of reference sequences and scatter
`sequence queries. The application identified matching
`scores locally and then gathered them together. Each
`
`

`

`Expected
`Throughput
`(GCUPS)
`
`Speedup
`
`FASTA
`(ssearch34)
`
`Opteron
`2.4 GHz
`
`SRC-6
`100 MHz (32x1)
`
`XD1
`200 MHz (32x1)
`
`DNA
`Protein
`
`1 Engine/chip
`
`DNA
`
`4 Engines/chip
`
`8 Engines/chip
`
`Protein
`
`1 Engine/chip
`
`DNA
`
`4 Engines/chip
`
`8 Engines/chip
`
`Protein
`
`NA
`NA
`
`3.2
`
`12.8
`
`25.6
`
`3.2
`
`6.4
`
`25.6
`
`51.2
`
`6.4
`
`NA
`NA
`
`49.2×
`
`197×
`
`394×
`
`24.6×
`
`98×
`
`394×
`
`788×
`
`49×
`
`Measured
`Throughput
`(GCUPS)
`0.065
`0.130
`3.19 � 12.2
`1 � 4 chips
`
`Speedup
`
`1
`1
`49 � 188
`1 � 4 chips
`
`12.4 � 42.7
`1 � 4 chips
`
`24.1 � 74
`1 � 4 chips
`3.12 � 11.7
`1 � 4 chips
`
`5.9 � 32
`1 � 6 chips
`
`23.3 � 120.7
`1 � 6 chips
`
`45.2 � 181.6
`1 � 6 chips
`
`5.9 � 34
`1 � 6 chips
`
`191 � 656
`1 � 4 chips
`
`371 � 1,138
`1 � 4 chips
`24 � 90
`1 � 4 chips
`
`91 � 492
`1 � 6 chips
`
`359 � 1,857
`1 � 6 chips
`
`695 � 2,794
`1 � 6 chips
`
`45 � 262
`1 � 6 chips
`
`Figure 4. DNA and protein sequencing on the SRC-6 and Cray XD1 versus the open source FASTA program. An FPGA with one engine
`produced a 91× speedup, while eight cores on the same chip collectively achieved a 695× speedup.
`
`FPGA had as many hardware kernels for the basic
`operation as possible. In the case of cryptanalysis, we
`broadcast the ciphertext as well as the corresponding
`plaintext; upon finding the key, a worker node sent it
`back to the master to terminate the search.
`
`Bioinformatics
`Figure 4 compares DNA and protein sequencing on
`the SRC-6 and Cray XD1 with the open source FASTA
`program running on a 2.4-GHz Opteron microproces-
`sor. We used giga cell updates per second (GCUPS) as the
`throughput metric as well as to compute speedup over
`the Opteron. With its FPGA chips running at 200 MHz,
`the XD1 had an advantage over the SRC-6, which could
`run its FPGAs at only 100 MHz.
`By packing eight kernels on each FPGA chip, the Cray
`XD1 achieved a 2,794× speedup using one chassis with
`six FPGAs. An FPGA with one engine produced a 91×
`speedup instead of the expected 98× speedup due to asso-
`ciated overhead such as pipeline latency, resulting in 93
`percent efficiency. On the other hand, eight cores on the
`same chip collectively achieved a 695× speedup instead
`of the expected 788× speedup due to intranode com-
`munication and I/O overhead. The achieved speedup for
`eight engines/chip was 2,794× instead of the estimated
`(ideal) of 4,728× due to MPI internode communications
`overhead, resulting in 59 percent efficiency.
`These results demonstrate that, with FPGAs’ remark-
`able speed, overhead such as internode and intranode
`
`communication must be at much lower levels in HPRCs
`than what is accepted in conventional high-performance
`computers. However, given the speed of HPRCs, very
`large configurations might not be needed.
`
`Cryptanalysis
`The cryptanalysis results, shown in Tables 1 and 2, are
`even more encouraging, especially since this application
`has even lower overhead. With the Data Encryption Stan-
`dard (DES) cipher, the SRC-6 achieved a 6,757× speedup
`over the microprocessor—again, a 2.4-GHz Opteron—
`while the Cray XD1 achieved a 12,162× speedup. The
`application’s scalability is almost ideal.
`In the case of the Cray XD1, straightforward MPI
`application resulted in using all nodes. However, it made
`sense for the node program to run on only one micro-
`processor and its FPGA; the other microprocessors on
`each node were not used. On the SRC-6, MPI processes
`had to run on the microprocessors, and the system had
`to establish an association between each microprocessor
`and a MAP processor. Because the SRC-6 was limited to
`two network interface cards that could not be shared effi-
`ciently, two MPI processes were sufficient. This meant
`the program could only run on one microprocessor and
`one MAP processor.
`
`Lessons learned
`Heterogeneity at the system level—namely, UNNS
`architectures—can be challenging to most accepted
`
`February 2008
`
`73
`
`

`

`Table 1. Secret-key cipher cryptanalysis on SRC-6.
`
`Application
`
`Data Encryption Standard (DES)
`breaking
`International Data Encryption
`Algorithm (IDEA) breaking
`RC5-32/12/16 breaking
`RC5-32/8/8 breaking
`
`Hardware
`
`Software
`
`Number of
`search engines
`
`Throughput
`(keys/s)
`
`Number of
`search engines
`
`Throughput
`(keys/s)
`
`40
`
`16
`
`4
`8
`
`4,000 M
`
`1,600 M
`
`400 M
`800 M
`
`1
`
`1
`
`1
`1
`
`0.592 M
`
`2.498 M
`
`0.351 M
`0.517 M
`
`Speedup
`
`6,757×
`
`641×
`
`1,140×
` 1,547×
`
`Table 2. Secret-key cipher cryptanalysis on Cray XD1.
`
`Application
`
`Data Encryption Standard (DES)
`breaking
`International Data Encryption
`Algorithm (IDEA) breaking
`RC5-32/8/8 breaking
`
`Hardware
`
`Software
`
`Number of
`search engines
`
`Throughput
`(keys/s)
`
`Number of
`search engines
`
`36
`
`30
`
`6
`
`7,200 M
`
`6,000 M
`
`1,200 M
`
`1
`
`1
`
`1
`
`Throughput
`(keys/s)
`
`0.592 M
`
`Speedup
`
`12,162×
`
`2.498 M
`
`2,402×
`
`0.517 M
`
` 2,321×
`
`SPMD programming paradigms. This occurs because
`current technology utilizes the reconfigurable processors
`as coprocessors to the main host processor through a
`single unshared communication channel. In particular,
`when the ratio of microprocessors, reconfigurable pro-
`cessors, and their communication channels differs from
`unity, SPMD programs, which generally assume a unity
`ratio, might underutilize some of the microprocessors.
`On the other hand, heterogeneity at the node level does
`not present a problem for such programs.
`Heterogeneity at the system level is driven by nontech-
`nological factors such as cost savings, which develop-
`ers can achieve by tailoring systems to customers using
`homogeneous node architectures. However, this is at
`least partly offset by the increased difficulty in code por-
`tability. NNUS architectures are more privileged in this
`respect than their UNNS counterparts.
`
`HPRC PeRfORMANCe iMPROveMeNt
`To assess the potential of HPRC technology, we
`exploited the maximum hardware parallelism in the pre-
`viously cited studies’ testbeds at both the chip and system
`levels. For each application, we filled the chip with as
`many hardware cores as possible that can run in parallel.
`We obtained additional system-level parallelism via par-
`allel programming techniques, using the MPI to break
`the overall problem across all available nodes in order
`to decrease execution time. After estimating the size of
`a computer cluster capable of the same level of speedup,
`
`we derived the corresponding cost, power, and size sav-
`ings that can be achieved by an SRC-6, Cray XD1, and
`SGI Altix 4700 with an RC100 RASC module compared
`with a conventional high-performance PC cluster.
`As Tables 3-5 show, the improvements are many orders
`of magnitude larger. In this analysis, a 100× speedup indi-
`cates that the HPRC’s cost, power, and size are compared
`to those of a 100-processor Beowulf cluster. The estimates
`are very conservative, because when parallel efficiency is
`considered, a 100-processor cluster will likely produce
`a speedup much less than 100×—in other words, we
`assumed the competing cluster to be 100 percent efficient.
`We also assumed that one cluster node consumes about
`220 watts, and that 100 cluster nodes have a footprint of
`6 square feet. Based on actual prices, we estimated the
`cost ratio to be 1:200 in the case of the SRC-6 and 1:100
`in the case of the Cray XD1. The cost reduction is actually
`much larger than the tables indicate when considering the
`systems’ associated power and size.
`These dramatic improvements can be viewed as real-
`istic upper bounds on the promise of HPRC technol-
`ogy because the selected applications are all compute-
`intensive integer applications, a class at which HPRCs
`clearly excel. However, with additional FPGA chip
`improvements in the areas of size and floating-point
`support, and with improved data-transfer bandwidths
`between FPGAs and their external local memory as well
`as between the microprocessor and the FPGA, a much
`wider range of applications can harness similar levels of
`
`74
`
`Computer
`
`

`

`benefits. For example, in the hyperspectral dimension
`reduction study, data transfer improvements between
`the SRC-6E and SRC-6, while using the same FPGA
`chips, almost doubled the speedup.
`
`O ur research revealed that HPRCs can achieve up to
`
`four orders of magnitude improvement in perfor-
`mance, up to three orders of magnitude reduction
`in power consumption, and two orders of magnitude
`savings in cost and size requirements compared with
`contemporary microprocessors when running compute-
`intensive applications based on integer arithmetic.
`In general, these systems were less successful in pro-
`cessing applications based on floating-point arithmetic,
`especially double precision, whose high usage of FPGA
`resources constitutes an upper bound on fine-grained
`parallelism for application cores. However, they can
`achieve as high performance on embarrassingly parallel
`floating-point applications, subject to area constraints,
`as integer arithmetic applications. FPGA chips will likely
`become larger and have more integrated cores that can
`better support floating-point operations.
`Our future work will include a comprehensive study
`of software programming tools and languages and their
`impact on HPRC productivity, as well as multitasking/
`multiuser support on HPRCs. Because porting applica-
`tions from one machine to another, or even to the same
`machine after a hardware upgrade, is nontrivial, hard-
`ware architectural virtualization and runtime systems
`support for application portability is another good
`research candidate. ■
`
`References
`1. M. Taher and T. El-Ghazawi, “A Segmentation Model for
`Partial Run-Time Reconfiguration,” Proc. IEEE Int’l Conf.
`Field Programmable Logic and Applications, IEEE Press,
`2006, pp. 1-4.
`2. T. El-Ghazawi, “Experience with Early Reconfigurable High-
`Performance Computers,” 2006; http://hpcl.seas.gwu.edu/
`talks/Tarek_DATE2006.ppt.
`3. S. Kaewpijit, J. Le Moigne, and T. El-Ghazawi, “Automatic
`Reduction of Hyperspectral Imagery Using Wavelet Spectral
`Analysis,” IEEE Trans. Geoscience and Remote Sensing, vol.
`41, no. 4, 2003, pp. 863-871.
`4. J.C. Phillips et al., “Scalable Molecular Dynamics with
`NAMD,” J. Computational Chemistry, vol. 26, no. 16, 2005,
`pp. 1781-1802.
`5. V. Kindratenko and D. Pointer, “A Case Study in Porting
`a Production Scientific Supercomputing Application to a
`Reconfigurable Computer,” Proc. 14th Ann. IEEE Symp.
`Field-Programmable Custom Computing Machines, IEEE
`CS Press, 2006, pp. 13-22.
`6. K. Underwood, “FPGAs vs. CPUs: Trends in Peak Floating-Point
`Performance,” Proc. 12th ACM/SIGDA Int’l Symp. Field Pro-
`
`Table 3. Performance improvement of SRC-6 compared
`with a Beowulf cluster.
`
`Savings
`
`Application
`
`Speedup
`
`Cost
`
`Power
`
`Size
`
`DNA and protein
`sequencing
`DES breaking
`IDEA breaking
`RC5 breaking
`
`1,138×
`
`6×
`
`313×
`
`34×
`
`6,757×
`641×
`1,140×
`
`34×
`3×
`6×
`
`856×
`176×
`313×
`
`203×
`19×
`34×
`
`Table 4. Performance improvement of Cray XD1
`compared with a Beowulf cluster.
`
`Savings
`
`Application
`
`Speedup
`
`Cost
`
`Power
`
`Size
`
`DNA and protein
`sequencing
`DES breaking
`IDEA breaking
`RC5 breaking
`
`2,794×
`
`28×
`
`148×
`
`29×
`
`12,162×
`2,402×
`2,321×
`
`122×
`24×
`23×
`
`608×
`120×
`116×
`
`127×
`25×
`24×
`
`Table 5. Performance improvement of SGI Altix 4700 with
`RC100 RASC module compared with a Beowulf cluster.
`
`Savings
`
`Application
`
`Speedup
`
`Cost
`
`Power
`
`Size
`
`DNA and protein
`sequencing
`DES breaking
`IDEA breaking
`RC5 breaking
`
`8,723×
`
`22×
`
`779×
`
`253×
`
`28,514×
`961×
`6,838×
`
`96×
`2×
`17×
`
`3,439×
`86×
`610×
`
`1,116×
`28×
`198×
`
`grammable Gate Arrays, ACM Press, 2004, pp. 171-180.
`7. D.W. Mount, Bioinformatics: Sequence and Genome Analy-
`sis, 2nd ed., Cold Spring Harbor Laboratory Press, 2004.
`8. O.D. Fidanci et al., “Implementation Trade-Offs of Triple
`DES in the SRC-6E Reconfigurable Computing Environ-
`ment,” Proc. 5th Ann. Int’l Conf. Military and Aerospace
`Programmable Logic Devices, 2002; www.gwu.edu/~hpc/
`rcm/publications/MAPLD2002.pdf.
`9. R.L. Rivest, “The RC5 Encryption Algorithm,” revised ver-
`sion, MIT Laboratory for Computer Science, Cambridge,
`Mass., 20 Mar. 1997; http://people.csail.mit.edu/rivest/
`Rivest-rc5rev.pdf.
` 10. A. Michalski, K. Gaj, and T. El-Ghazawi, “An Implementa-
`tion Comparison of an IDEA Encryption Cryptosystem on
`Two General-Purpose Reconfigurable Computers,” Proc.
`13th Ann. Conf. Field-Programmable Logic and Applica-
`tions, LNCS 2778, Springer, 2003, pp. 204-219.
`
`February 2008
`
`75
`
`

`

`Tarek El-Ghazawi is a professor in the Department of
`Computer and Electrical Engineering, a founder of the
`High-Performance Computing Lab (HPCL) at George
`Washington University, and cofounder of the NSF Cen-
`ter for High-Performance Reconfigurable Computing
`(CHREC). His research interests include high-perfor-
`mance computing, parallel computer architectures, high-
`performance I/O, reconfigurable computing, experi-
`mental performance evaluations, computer vision, and
`remote sensing. El-Ghazawi received a PhD in electrical
`and computer engineering from New Mexico State Uni-
`versity. He is a senior member of the IEEE and a member
`of the ACM. Contact him at tarek@gwu.edu.
`
`Esam El-Araby is a doctoral student in the Department
`of Computer and Electrical Engineering and a research
`assistant in the HPCL at George Washington University.
`His research interests include reconfigurable computing,
`hybrid architectures, evolvable hardware, performance
`evaluation, digital signal/image processing, and hyper-
`spectral remote sensing. El-Araby received an MSc in
`computer engineering from the George Washington Uni-
`versity. Contact him at esam@gwu.edu.
`
`Miaoqing Huang is a doctoral student in the Department of
`Computer and Electrical Engineering and a research assis-
`tant in the HPCL at George Washington University. His
`research interests include reconfigurable computing, high-
`performance computing architectures, cryptography, image
`processing, and computer arithmetic. Huang received a BS
`in electronics and information systems from Fudan Univer-
`sity, Shanghai

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket