`
`The Promise of High-Performance Reconfigurable Computing
`
`Article in Computer · March 2008
`
`DOI: 10.1109/MC.2008.65 · Source: IEEE Xplore
`
`CITATIONS
`158
`
`6 authors, including:
`
`Esam El-Araby
`University of Kansas
`
`58 PUBLICATIONS 518 CITATIONS
`
`SEE PROFILE
`
`Kris Gaj
`George Mason University
`
`164 PUBLICATIONS 2,609 CITATIONS
`
`SEE PROFILE
`
`Some of the authors of this publication are also working on these related projects:
`
`Security Analysis of Logic Locking View project
`
`READS
`75
`
`Miaoqing Huang
`University of Arkansas
`
`55 PUBLICATIONS 508 CITATIONS
`
`SEE PROFILE
`
`Duncan A. Buell
`University of South Carolina
`
`92 PUBLICATIONS 1,509 CITATIONS
`
`SEE PROFILE
`
`All content following this page was uploaded by Miaoqing Huang on 18 February 2014.
`
`The user has requested enhancement of the downloaded file.
`
`
`
`R E S E A R C H F E A T U R E
`
`The Promise of
`High-Performance
`Reconfigurable Computing
`
`Tarek El-Ghazawi, Esam El-Araby, and Miaoqing Huang, George Washington University
`Kris Gaj, George Mason University
`Volodymyr Kindratenko, University of Illinois at Urbana-Champaign
`Duncan Buell, University of South Carolina
`
`Several high-performance computers now use field-programmable gate arrays as reconfigurable
`coprocessors. the authors describe the two major contemporary HPRC architectures and explore
`the pros and cons of each using representative applications from remote sensing, molecular
`dynamics, bioinformatics, and cryptanalysis.
`
`I n the past few years, high-performance computing
`
`vendors have introduced many systems contain-
`ing both microprocessors and field-programmable
`gate arrays. Three such systems—the Cray XD1,
`the SRC-6, and the SGI Altix/RASC—are paral-
`lel computers that resemble modern HPC architectures,
`with added FPGA chips. Two of these machines, the
`Cray XD1 and SGI Altix, also function as traditional
`HPCs without the reconfigurable chips. In addition, sev-
`eral Beowulf cluster installations contain one or more
`FPGA cards per node, such as HPTi’s reconfigurable
`cluster from the Air Force Research Laboratory.
`In all of these architectures, the FPGAs serve as
`coprocessors to the microprocessors. The main applica-
`tion executes on the microprocessors, while the FPGAs
`handle kernels that have a long execution time but lend
`themselves to hardware implementations. Such kernels
`are typically data-parallel overlapped computations that
`can be efficiently implemented as fine-grained architec-
`tures, such as single-instruction, multiple-data (SIMD)
`engines, pipelines, or systolic arrays, to name a few.
`Figure 1 shows that a transfer of control can occur
`during execution of the application on the microproces-
`sor, in which case the system invokes an appropriate
`architecture in a reconfigurable processor to execute
`the target operation. To do so, the reconfigurable pro-
`
`cessor can configure or reconfigure the FPGA “on the
`fly,” while the system’s other processors perform com-
`putations. This feature is usually referred to as runtime
`reconfiguration.1
`From an application development perspective, devel-
`opers can create the hardware kernel using hardware
`description languages such as VHDL and Verilog. Other
`systems allow the use of high-level languages such as
`SRC Computers’ Carte C and Carte Fortran, Impulse
`Accelerated Technologies’ Impulse C, Mitrion C from
`Mitrionics, and Celoxica’s Handel-C. There are also
`high-level graphical programming development tools
`such as Annapolis Micro Systems’ CoreFire, Starbridge
`Systems’ Viva, Xilinx System Generator, and DSPlogic’s
`Reconfigurable Computing Toolbox.
`Readers should consult Computer’s March 2007
`special issue on high-performance reconfigurable com-
`puting for a good overview of modern HPRC systems,
`application-development tools and frameworks, and
`applications.
`
`HPRC ARCHiteCtuRAL tAXONOMY
`Many early HPRC systems, such as the SRC-6E and
`the Starbridge Hypercomputer, can be seen as attached
`processors. These systems were designed around one
`node of microprocessors and another of FPGAs. The
`
`0018-9162/08/$25.00 © 2008 IEEE
`
`Published by the IEEE Computer Society
`
`February 2008
`
`69
`
`
`
`µP
`
`PC
`
`RP (FPGA)
`
`Pipelines, systolic arrays, SIMD, ...
`
`Figure 1. In high-performance reconfigurable computers, field-programmable
`gate arrays serve as coprocessors to the microprocessors. During execution of the
`application on the microprocessor, the system invokes an appropriate architecture
`in the FPGA to execute the target operation.
`
`RP1
`
`…
`
`RPM
`
`RP node
`
`vendors can vary the ratio of reconfigu-
`rable nodes to microprocessor nodes to
`meet the different demands of custom-
`ers’ applications. This is highly desirable
`from an economic perspective given the
`cost difference between FPGAs and
`microprocessors, and it is particularly
`suitable for special-purpose systems.
`On the downside, having the reconfig-
`urable node and the microprocessor node
`interact over the shared interconnection
`network makes them compete for over-
`all bandwidth, and it also increases the
`latency between the nodes. In addition,
`code portability could become an issue
`even within the same type of machine
`if there is a change in the ratio between
`the microprocessor nodes and the FPGA
`nodes.
`A representative example of the
`UNNS is the SRC-6/SRC-7, which con-
`sists of one or more general-purpose
`microprocessor subsystems, one or
`more MAP reconfigurable subsystems,
`and global common memory (GCM)
`nodes of shared memory space. These
`subsystems are interconnected through
`a Hi-Bar switch communication layer.
`The microprocessor boards each include
`two 2.8-GHz Intel Xeon microproces-
`sors and are connected to the Hi-Bar
`switch through a SNAP interface. The
`SNAP card plugs into the dual in-line
`memory module slot on the micropro-
`cessor motherboard to provide higher
`data transfer rates between the boards than the less effi-
`cient but common peripheral component interconnect
`(PCI) solution. The sustained transfer rate between a
`microprocessor board and the MAP processors is 1,400
`Mbytes per second.
`The MAP Series C processor consists of one control
`FPGA and two user FPGAs, all Xilinx Virtex II-6000-
`4s. Additionally, each MAP unit contains six interleaved
`banks of onboard memory (OBM) with a total capacity
`of 24 Mbytes. The maximum aggregate data transfer rate
`among all FPGAs and OBM is 4,800 MBps. The user
`FPGAs are configured such that one is in master mode
`and the other is in slave mode. A bridge port directly con-
`nects a MAP’s two FPGAs. Further, MAP processors can
`be connected via a chain port to create an FPGA array.
`
`µP1
`
`…
`
`µPN
`
`…
`
`µP1
`
`…
`
`µPN
`
`RP1
`
`…
`
`RPM
`
`µP node
`
`µP node
`
`RP node
`
`(a)
`
`(b)
`
`IN and/or GSM
`
`µP1
`
`RP1
`
`…
`
`µPN
`
`RPM
`
`IN and/or GSM
`
`two nodes were connected directly, without a scalable
`interconnection mechanism.
`Here we do not address these early attached processor
`systems but focus instead on scalable parallel systems such
`as the Cray XD1, SRC-6, and SGI Altix/RASC as well
`as reconfigurable Beowulf clusters. These architectures
`can generally be distinguished by whether each node in
`the system is homogeneous (uniform) or heterogeneous
`(nonuniform).2 A uniform node in this context contains
`one type of processing element—for example, only micro-
`processors or FPGAs. Based on this distinction, modern
`HPRCs can be grouped into two major classes: uniform
`node nonuniform systems and nonuniform node uniform
`systems.
`
`uniform node nonuniform systems
`In UNNSs, shown in Figure 2a, nodes strictly have
`either FPGAs or microprocessors and are linked via an
`interconnection network to globally shared memory
`(GSM). Examples of such systems include the SRC-6 and
`the Altix/RASC. The major advantage of UNNSs is that
`
`70
`
`Computer
`
`Figure 2. Modern HPRCs can be grouped into two major classes: (a) uniform node
`nonuniform systems (UNNSs) and (b) nonuniform node uniform systems (NNUSs).
`
`Nonuniform node uniform systems
`NNUSs, shown in Figure 2b, use only one type of node,
`thus the system level is uniform. However, each node
`contains both types of resources, and the FPGAs are con-
`nected directly to the microprocessors inside the node.
`
`
`
`3% 5%
`
`92%
`
`(a)
`
`Total execution time is 20.21 sec
`(1.8-GHz Pentium 4)
`
`58%
`
`(b)
`
`33%
`
`9%
`
`Total execution time is 1.67 sec
`(SRC-6E, P3)
`Speedup without streaming: 12.08x
`Speedup with streaming: 13.21x
`
`50%
`
`25%
`25%
`
`(c)
`
`Examples of such systems are the Cray XD1 and
`reconfigurable clusters. NNUSs’ main drawback
`is their fixed ratio of FPGAs to microprocessors,
`which might not suit the traditional vendor-buyer
`economic model. However, they cater in a straight-
`forward way to the single-program, multiple-data
`(SPMD) model that most parallel programming
`paradigms embrace. Further, the latency between
`the microprocessor and its FPGA coprocessor can
`be low, and the bandwidth between them will be
`dedicated—this can mean high performance for
`many data-intensive applications.
`A representative example of the NNUS is the
`Cray XD1, whose direct-connected processor
`(DCP) architecture harnesses multiple processors
`into a single, unified system. The base unit is a chas-
`sis, with up to 12 chassis per cabinet. One chassis
`houses six compute cards, each of which contains
`two 2.4-GHz AMD Opteron microprocessors and
`one or two RapidArray Processors (RAPs) that
`handle communication. The two Opteron micro-
`processors are connected via AMD’s HyperTrans-
`port technology with a bandwidth of 3.2 GBps
`forming a two-way symmetric multiprocessing (SMP)
`cluster. Each XD1 chassis can be configured with six
`application-acceleration processors based on Xilinx
`Virtex-II Pro or Virtex-4 FPGAs. With two RAPs per
`board, a bandwidth of 8 GBps (4 GBps bidirectional)
`between boards is available via a RapidArray switch.
`Half of this switch’s 48 links connect to the RAPs on the
`compute boards within the chassis, while the others can
`connect to other chassis.
`
`I/O-read
`Comp
`I/O-write
`
`Total execution time is 0.84 sec
`(SRC-6)
`Speedup without streaming: 24.06x
`Speedup with streaming: 32.04x
`
`Figure 3. Execution profiles of hyperspectral dimension reduction. (a)
`Total execution time on 1.8-GHz Pentium 4 microprocessor. (b) Total
`execution time on SRC-6E. (c) Total execution time on SRC-6.
`
`is, streaming—between these two processing elements
`and the computations also can help. As Figure 3a shows,
`such transfers (I/O read and write operations) take only
`8 percent of the application execution time on a 1.8-
`GHz Pentium 4 microprocessor, while the remaining
`92 percent is spent on computations.
`As Figure 3b shows, the first-generation SRC-6E
`achieves a significant speedup over the microprocessor:
`12.08× without streaming and 13.21× with streaming.
`However, the computation time is now only 9 percent
`of the overall execution time. In the follow-up SRC-6,
`the bandwidth between the microprocessor and FPGA
`increases from 380 MBps (sustained) to 1.4 GBps
`(sustained). As Figure 3c shows, this system achieves
`a 24.06× speedup (without streaming) and a 32.04×
`speedup (with streaming) over the microprocessor.
`These results clearly demonstrate that bandwidth
`between the microprocessor and the FPGA must be
`increased to support more data-intensive applications—
`an area the third-generation SRC-7 is likely to address.
`It should be noted, however, that in most HPRCs today,
`transfers between the microprocessor and FPGA are
`explicit, further complicating programming models.
`These two memory subsystems should either be fused
`into one or integrated into a hierarchy with the objective
`of reducing or eliminating this overhead and making the
`transfers transparent.
`
`Molecular dynamics
`Nanoscale molecular dynamics (NAMD)4 is repre-
`sentative of floating-point applications with respect to
`node performance. A recent case study revealed that
`when porting such highly optimized code, a sensible
`approach is to use several design iterations, starting with
`
`February 2008
`
`71
`
`NOde-LeveL iSSueS
`We have used the SRC-6E and SRC-6 systems to inves-
`tigate node-level performance of HPRC architectures in
`processing remote sensing3 and molecular dynamics4
`applications. These studies included the use of optimi-
`zation techniques such as pipelining and data transfer
`overlapping with computation to exploit the inherent
`temporal and spatial parallelism of such applications.
`
`Remote sensing
`Hyperspectral dimension reduction3 is representative
`of remote sensing applications with respect to node per-
`formance. With FPGAs as coprocessors for the micropro-
`cessor, substantial data in this data-intensive application
`must move back and forth between the microprocessor
`memory and the FPGA onboard memory. While the
`bandwidth for such transfers is on the order of GBps,
`the transfers are an added overhead and represent a chal-
`lenge on the SRC-6 given the finite size of its OBM.
`This overhead can be avoided altogether through
`the sharing of memory banks, or the bandwidth can
`be increased to take advantage of FPGAs’ outstanding
`processing speed. Overlapping memory transfers—that
`
`
`
`the simplest, most straightforward implementation and
`gradually adding to it until achieving the best solution
`or running out of FPGA resources.5
`The study’s final dual-FPGA-based implementation
`was only three times faster than the original code execu-
`tion. These results, however, are data dependent. For a
`larger cutoff radius, the original CPU code executes in
`more than 800 seconds while the FPGA execution time
`is unchanged, which would constitute a 260× speedup.
`The need to translate data between the C++ data storage
`mechanisms and the system-defined MAP/FPGA data
`storage architecture required considerable development
`effort. When creating code from scratch to run on an
`FPGA architecture, a programmer would implement
`the data storage mechanisms compatible between the
`CPU and FPGA from the beginning,
`but this is rarely the case for exist-
`ing code and adds to the amount of
`work required to port the code.
`Although the “official bench-
`mark” kernel employs double-pre-
`cision floating-point arithmetic, the
`NAMD researchers applied algo-
`rithmic optimization techniques
`and implemented their kernel using
`single-precision floating-point arith-
`metic for atom locations and 32-bit
`integer arithmetic for forces. Consequently, the final
`design occupies most available slices (97 percent), yet
`utilization of on-chip memory banks (40 percent) and
`hardware multipliers (28 percent) is low. The fact that the
`slice limit was reached before any other resource limits
`suggests that it might be necessary to restructure code to
`better utilize other available resources. One possible solu-
`tion is to overlap calculations with data transfer for the
`next data set to use more available on-chip memory.
`Despite the relatively modest speedup achieved, the
`NAMD study clearly illustrates the potential of HPRC
`technology. FPGA code development traditionally begins
`with writing code that implements a textbook algorithm,
`with little or no optimization. When porting such unop-
`timized code to an HPRC platform and taking care to
`optimize the FPGA design, it is easy to obtain a 10×-
`100× speedup. In contrast, we began with decade-old
`code optimized to run on the CPU-based platform; such
`code successfully competes with its FPGA-ported coun-
`terpart. It is important to keep in mind that the study’s
`100-MHz FPGA achieved a 3× application performance
`improvement over a 2.8-GHz CPU, and FPGAs are on a
`faster technology growth curve than CPUs.6
`
`Lessons learned
`Optimization techniques such as overlapping data
`transfers between the microprocessors and FPGAs with
`computations are useful for data-intensive, memory-
`bound applications. However, such applications, includ-
`
`72
`
`Computer
`
`vendor-provided transparent
`transfers can enhance
`performance by guaranteeing
`the most efficient transfer
`modes for the underlying
`platform.
`
`ing hyperspectral dimension reduction and NAMD, can
`only achieve good performance when the underlying
`HPRC architecture supports features such as streaming
`or overlapping. Streaming can be enabled by architectures
`that are characterized by high I/O bandwidth and/or tight
`coupling of FPGAs with associated microprocessors. New
`promising examples of these are DCP architectures such
`as AMD’s Torrenza initiative for HyperTransport links
`as well as Intel’s QuickAssist technology supporting front
`side bus (FSB) systems. Large enough memory bandwidth
`is another equally important feature.
`By memory bandwidth we mean that the memory sys-
`tem has sufficient multiplicity as well as speed, width, or
`depth/size. In other words, because FPGAs can produce
`and consume data at a high degree of parallelism, the
`associated memory system should
`also have an equal degree of multi-
`plicity. Simply put, a large multiple
`of memory banks with narrow word
`length of local FPGA memory can
`be more useful to memory-bound
`applications on HPRCs than larger
`and wider memories with fewer
`parallel banks.
`In addition, further node architec-
`ture developments are clearly neces-
`sary to support programming mod-
`els with transparent transfers of data between FPGAs
`and microprocessors by integrating the microprocessor
`memory and the FPGA memory into the same hierar-
`chy. Vendor-provided transparent transfers can enhance
`performance by guaranteeing the most efficient transfer
`modes for the underlying platform. This will let the user
`focus on algorithmic optimizations that can benefit the
`application under investigation rather than data trans-
`fers or distribution. It also can improve productivity.
`
`SYSteM-LeveL iSSueS
`We have used the SRC-6 and Cray XD1 systems to
`investigate system-level performance of HPRC archi-
`tectures in bioinformatics7 and cryptanalysis8-10 appli-
`cations. These applications provide a near-practical
`upper bound on HPRC potential performance as well
`as insight into system-level programmability and perfor-
`mance issues apart from those associated with general
`high-performance computers. They use integer arithme-
`tic, an area where HPRCs excel, are compute-intensive
`with lots of computations and not much data transfer
`between the FPGAs and microprocessors, and inherit
`both spatial and temporal parallelism.
`We distributed the workload of both types of appli-
`cations over all nodes using the message passing inter-
`face (MPI). In the case of DNA and protein analysis, we
`broadcast a database of reference sequences and scatter
`sequence queries. The application identified matching
`scores locally and then gathered them together. Each
`
`
`
`Expected
`Throughput
`(GCUPS)
`
`Speedup
`
`FASTA
`(ssearch34)
`
`Opteron
`2.4 GHz
`
`SRC-6
`100 MHz (32x1)
`
`XD1
`200 MHz (32x1)
`
`DNA
`Protein
`
`1 Engine/chip
`
`DNA
`
`4 Engines/chip
`
`8 Engines/chip
`
`Protein
`
`1 Engine/chip
`
`DNA
`
`4 Engines/chip
`
`8 Engines/chip
`
`Protein
`
`NA
`NA
`
`3.2
`
`12.8
`
`25.6
`
`3.2
`
`6.4
`
`25.6
`
`51.2
`
`6.4
`
`NA
`NA
`
`49.2×
`
`197×
`
`394×
`
`24.6×
`
`98×
`
`394×
`
`788×
`
`49×
`
`Measured
`Throughput
`(GCUPS)
`0.065
`0.130
`3.19 � 12.2
`1 � 4 chips
`
`Speedup
`
`1
`1
`49 � 188
`1 � 4 chips
`
`12.4 � 42.7
`1 � 4 chips
`
`24.1 � 74
`1 � 4 chips
`3.12 � 11.7
`1 � 4 chips
`
`5.9 � 32
`1 � 6 chips
`
`23.3 � 120.7
`1 � 6 chips
`
`45.2 � 181.6
`1 � 6 chips
`
`5.9 � 34
`1 � 6 chips
`
`191 � 656
`1 � 4 chips
`
`371 � 1,138
`1 � 4 chips
`24 � 90
`1 � 4 chips
`
`91 � 492
`1 � 6 chips
`
`359 � 1,857
`1 � 6 chips
`
`695 � 2,794
`1 � 6 chips
`
`45 � 262
`1 � 6 chips
`
`Figure 4. DNA and protein sequencing on the SRC-6 and Cray XD1 versus the open source FASTA program. An FPGA with one engine
`produced a 91× speedup, while eight cores on the same chip collectively achieved a 695× speedup.
`
`FPGA had as many hardware kernels for the basic
`operation as possible. In the case of cryptanalysis, we
`broadcast the ciphertext as well as the corresponding
`plaintext; upon finding the key, a worker node sent it
`back to the master to terminate the search.
`
`Bioinformatics
`Figure 4 compares DNA and protein sequencing on
`the SRC-6 and Cray XD1 with the open source FASTA
`program running on a 2.4-GHz Opteron microproces-
`sor. We used giga cell updates per second (GCUPS) as the
`throughput metric as well as to compute speedup over
`the Opteron. With its FPGA chips running at 200 MHz,
`the XD1 had an advantage over the SRC-6, which could
`run its FPGAs at only 100 MHz.
`By packing eight kernels on each FPGA chip, the Cray
`XD1 achieved a 2,794× speedup using one chassis with
`six FPGAs. An FPGA with one engine produced a 91×
`speedup instead of the expected 98× speedup due to asso-
`ciated overhead such as pipeline latency, resulting in 93
`percent efficiency. On the other hand, eight cores on the
`same chip collectively achieved a 695× speedup instead
`of the expected 788× speedup due to intranode com-
`munication and I/O overhead. The achieved speedup for
`eight engines/chip was 2,794× instead of the estimated
`(ideal) of 4,728× due to MPI internode communications
`overhead, resulting in 59 percent efficiency.
`These results demonstrate that, with FPGAs’ remark-
`able speed, overhead such as internode and intranode
`
`communication must be at much lower levels in HPRCs
`than what is accepted in conventional high-performance
`computers. However, given the speed of HPRCs, very
`large configurations might not be needed.
`
`Cryptanalysis
`The cryptanalysis results, shown in Tables 1 and 2, are
`even more encouraging, especially since this application
`has even lower overhead. With the Data Encryption Stan-
`dard (DES) cipher, the SRC-6 achieved a 6,757× speedup
`over the microprocessor—again, a 2.4-GHz Opteron—
`while the Cray XD1 achieved a 12,162× speedup. The
`application’s scalability is almost ideal.
`In the case of the Cray XD1, straightforward MPI
`application resulted in using all nodes. However, it made
`sense for the node program to run on only one micro-
`processor and its FPGA; the other microprocessors on
`each node were not used. On the SRC-6, MPI processes
`had to run on the microprocessors, and the system had
`to establish an association between each microprocessor
`and a MAP processor. Because the SRC-6 was limited to
`two network interface cards that could not be shared effi-
`ciently, two MPI processes were sufficient. This meant
`the program could only run on one microprocessor and
`one MAP processor.
`
`Lessons learned
`Heterogeneity at the system level—namely, UNNS
`architectures—can be challenging to most accepted
`
`February 2008
`
`73
`
`
`
`Table 1. Secret-key cipher cryptanalysis on SRC-6.
`
`Application
`
`Data Encryption Standard (DES)
`breaking
`International Data Encryption
`Algorithm (IDEA) breaking
`RC5-32/12/16 breaking
`RC5-32/8/8 breaking
`
`Hardware
`
`Software
`
`Number of
`search engines
`
`Throughput
`(keys/s)
`
`Number of
`search engines
`
`Throughput
`(keys/s)
`
`40
`
`16
`
`4
`8
`
`4,000 M
`
`1,600 M
`
`400 M
`800 M
`
`1
`
`1
`
`1
`1
`
`0.592 M
`
`2.498 M
`
`0.351 M
`0.517 M
`
`Speedup
`
`6,757×
`
`641×
`
`1,140×
` 1,547×
`
`Table 2. Secret-key cipher cryptanalysis on Cray XD1.
`
`Application
`
`Data Encryption Standard (DES)
`breaking
`International Data Encryption
`Algorithm (IDEA) breaking
`RC5-32/8/8 breaking
`
`Hardware
`
`Software
`
`Number of
`search engines
`
`Throughput
`(keys/s)
`
`Number of
`search engines
`
`36
`
`30
`
`6
`
`7,200 M
`
`6,000 M
`
`1,200 M
`
`1
`
`1
`
`1
`
`Throughput
`(keys/s)
`
`0.592 M
`
`Speedup
`
`12,162×
`
`2.498 M
`
`2,402×
`
`0.517 M
`
` 2,321×
`
`SPMD programming paradigms. This occurs because
`current technology utilizes the reconfigurable processors
`as coprocessors to the main host processor through a
`single unshared communication channel. In particular,
`when the ratio of microprocessors, reconfigurable pro-
`cessors, and their communication channels differs from
`unity, SPMD programs, which generally assume a unity
`ratio, might underutilize some of the microprocessors.
`On the other hand, heterogeneity at the node level does
`not present a problem for such programs.
`Heterogeneity at the system level is driven by nontech-
`nological factors such as cost savings, which develop-
`ers can achieve by tailoring systems to customers using
`homogeneous node architectures. However, this is at
`least partly offset by the increased difficulty in code por-
`tability. NNUS architectures are more privileged in this
`respect than their UNNS counterparts.
`
`HPRC PeRfORMANCe iMPROveMeNt
`To assess the potential of HPRC technology, we
`exploited the maximum hardware parallelism in the pre-
`viously cited studies’ testbeds at both the chip and system
`levels. For each application, we filled the chip with as
`many hardware cores as possible that can run in parallel.
`We obtained additional system-level parallelism via par-
`allel programming techniques, using the MPI to break
`the overall problem across all available nodes in order
`to decrease execution time. After estimating the size of
`a computer cluster capable of the same level of speedup,
`
`we derived the corresponding cost, power, and size sav-
`ings that can be achieved by an SRC-6, Cray XD1, and
`SGI Altix 4700 with an RC100 RASC module compared
`with a conventional high-performance PC cluster.
`As Tables 3-5 show, the improvements are many orders
`of magnitude larger. In this analysis, a 100× speedup indi-
`cates that the HPRC’s cost, power, and size are compared
`to those of a 100-processor Beowulf cluster. The estimates
`are very conservative, because when parallel efficiency is
`considered, a 100-processor cluster will likely produce
`a speedup much less than 100×—in other words, we
`assumed the competing cluster to be 100 percent efficient.
`We also assumed that one cluster node consumes about
`220 watts, and that 100 cluster nodes have a footprint of
`6 square feet. Based on actual prices, we estimated the
`cost ratio to be 1:200 in the case of the SRC-6 and 1:100
`in the case of the Cray XD1. The cost reduction is actually
`much larger than the tables indicate when considering the
`systems’ associated power and size.
`These dramatic improvements can be viewed as real-
`istic upper bounds on the promise of HPRC technol-
`ogy because the selected applications are all compute-
`intensive integer applications, a class at which HPRCs
`clearly excel. However, with additional FPGA chip
`improvements in the areas of size and floating-point
`support, and with improved data-transfer bandwidths
`between FPGAs and their external local memory as well
`as between the microprocessor and the FPGA, a much
`wider range of applications can harness similar levels of
`
`74
`
`Computer
`
`
`
`benefits. For example, in the hyperspectral dimension
`reduction study, data transfer improvements between
`the SRC-6E and SRC-6, while using the same FPGA
`chips, almost doubled the speedup.
`
`O ur research revealed that HPRCs can achieve up to
`
`four orders of magnitude improvement in perfor-
`mance, up to three orders of magnitude reduction
`in power consumption, and two orders of magnitude
`savings in cost and size requirements compared with
`contemporary microprocessors when running compute-
`intensive applications based on integer arithmetic.
`In general, these systems were less successful in pro-
`cessing applications based on floating-point arithmetic,
`especially double precision, whose high usage of FPGA
`resources constitutes an upper bound on fine-grained
`parallelism for application cores. However, they can
`achieve as high performance on embarrassingly parallel
`floating-point applications, subject to area constraints,
`as integer arithmetic applications. FPGA chips will likely
`become larger and have more integrated cores that can
`better support floating-point operations.
`Our future work will include a comprehensive study
`of software programming tools and languages and their
`impact on HPRC productivity, as well as multitasking/
`multiuser support on HPRCs. Because porting applica-
`tions from one machine to another, or even to the same
`machine after a hardware upgrade, is nontrivial, hard-
`ware architectural virtualization and runtime systems
`support for application portability is another good
`research candidate. ■
`
`References
`1. M. Taher and T. El-Ghazawi, “A Segmentation Model for
`Partial Run-Time Reconfiguration,” Proc. IEEE Int’l Conf.
`Field Programmable Logic and Applications, IEEE Press,
`2006, pp. 1-4.
`2. T. El-Ghazawi, “Experience with Early Reconfigurable High-
`Performance Computers,” 2006; http://hpcl.seas.gwu.edu/
`talks/Tarek_DATE2006.ppt.
`3. S. Kaewpijit, J. Le Moigne, and T. El-Ghazawi, “Automatic
`Reduction of Hyperspectral Imagery Using Wavelet Spectral
`Analysis,” IEEE Trans. Geoscience and Remote Sensing, vol.
`41, no. 4, 2003, pp. 863-871.
`4. J.C. Phillips et al., “Scalable Molecular Dynamics with
`NAMD,” J. Computational Chemistry, vol. 26, no. 16, 2005,
`pp. 1781-1802.
`5. V. Kindratenko and D. Pointer, “A Case Study in Porting
`a Production Scientific Supercomputing Application to a
`Reconfigurable Computer,” Proc. 14th Ann. IEEE Symp.
`Field-Programmable Custom Computing Machines, IEEE
`CS Press, 2006, pp. 13-22.
`6. K. Underwood, “FPGAs vs. CPUs: Trends in Peak Floating-Point
`Performance,” Proc. 12th ACM/SIGDA Int’l Symp. Field Pro-
`
`Table 3. Performance improvement of SRC-6 compared
`with a Beowulf cluster.
`
`Savings
`
`Application
`
`Speedup
`
`Cost
`
`Power
`
`Size
`
`DNA and protein
`sequencing
`DES breaking
`IDEA breaking
`RC5 breaking
`
`1,138×
`
`6×
`
`313×
`
`34×
`
`6,757×
`641×
`1,140×
`
`34×
`3×
`6×
`
`856×
`176×
`313×
`
`203×
`19×
`34×
`
`Table 4. Performance improvement of Cray XD1
`compared with a Beowulf cluster.
`
`Savings
`
`Application
`
`Speedup
`
`Cost
`
`Power
`
`Size
`
`DNA and protein
`sequencing
`DES breaking
`IDEA breaking
`RC5 breaking
`
`2,794×
`
`28×
`
`148×
`
`29×
`
`12,162×
`2,402×
`2,321×
`
`122×
`24×
`23×
`
`608×
`120×
`116×
`
`127×
`25×
`24×
`
`Table 5. Performance improvement of SGI Altix 4700 with
`RC100 RASC module compared with a Beowulf cluster.
`
`Savings
`
`Application
`
`Speedup
`
`Cost
`
`Power
`
`Size
`
`DNA and protein
`sequencing
`DES breaking
`IDEA breaking
`RC5 breaking
`
`8,723×
`
`22×
`
`779×
`
`253×
`
`28,514×
`961×
`6,838×
`
`96×
`2×
`17×
`
`3,439×
`86×
`610×
`
`1,116×
`28×
`198×
`
`grammable Gate Arrays, ACM Press, 2004, pp. 171-180.
`7. D.W. Mount, Bioinformatics: Sequence and Genome Analy-
`sis, 2nd ed., Cold Spring Harbor Laboratory Press, 2004.
`8. O.D. Fidanci et al., “Implementation Trade-Offs of Triple
`DES in the SRC-6E Reconfigurable Computing Environ-
`ment,” Proc. 5th Ann. Int’l Conf. Military and Aerospace
`Programmable Logic Devices, 2002; www.gwu.edu/~hpc/
`rcm/publications/MAPLD2002.pdf.
`9. R.L. Rivest, “The RC5 Encryption Algorithm,” revised ver-
`sion, MIT Laboratory for Computer Science, Cambridge,
`Mass., 20 Mar. 1997; http://people.csail.mit.edu/rivest/
`Rivest-rc5rev.pdf.
` 10. A. Michalski, K. Gaj, and T. El-Ghazawi, “An Implementa-
`tion Comparison of an IDEA Encryption Cryptosystem on
`Two General-Purpose Reconfigurable Computers,” Proc.
`13th Ann. Conf. Field-Programmable Logic and Applica-
`tions, LNCS 2778, Springer, 2003, pp. 204-219.
`
`February 2008
`
`75
`
`
`
`Tarek El-Ghazawi is a professor in the Department of
`Computer and Electrical Engineering, a founder of the
`High-Performance Computing Lab (HPCL) at George
`Washington University, and cofounder of the NSF Cen-
`ter for High-Performance Reconfigurable Computing
`(CHREC). His research interests include high-perfor-
`mance computing, parallel computer architectures, high-
`performance I/O, reconfigurable computing, experi-
`mental performance evaluations, computer vision, and
`remote sensing. El-Ghazawi received a PhD in electrical
`and computer engineering from New Mexico State Uni-
`versity. He is a senior member of the IEEE and a member
`of the ACM. Contact him at tarek@gwu.edu.
`
`Esam El-Araby is a doctoral student in the Department
`of Computer and Electrical Engineering and a research
`assistant in the HPCL at George Washington University.
`His research interests include reconfigurable computing,
`hybrid architectures, evolvable hardware, performance
`evaluation, digital signal/image processing, and hyper-
`spectral remote sensing. El-Araby received an MSc in
`computer engineering from the George Washington Uni-
`versity. Contact him at esam@gwu.edu.
`
`Miaoqing Huang is a doctoral student in the Department of
`Computer and Electrical Engineering and a research assis-
`tant in the HPCL at George Washington University. His
`research interests include reconfigurable computing, high-
`performance computing architectures, cryptography, image
`processing, and computer arithmetic. Huang received a BS
`in electronics and information systems from Fudan Univer-
`sity, Shanghai