`EURASIP Journal on Embedded Systems
`Volume 2007, Article ID 93652, 8 pages
`doi:10.1155/2007/93652
`
`Research Article
`Examining the Viability of FPGA Supercomputing
`
`Stephen Craven and Peter Athanas
`
`Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University,
`Blacksburg, VA 24061, USA
`
`Received 16 May 2006; Revised 6 October 2006; Accepted 16 November 2006
`
`Recommended by Marco Platzner
`
`For certain applications, custom computational hardware created using field programmable gate arrays (FPGAs) can produce
`significant performance improvements over processors, leading some in academia and industry to call for the inclusion of FPGAs
`in supercomputing clusters. This paper presents a comparative analysis of FPGAs and traditional processors, focusing on floating-
`point performance and procurement costs, revealing economic hurdles in the adoption of FPGAs for general high-performance
`computing (HPC).
`
`Copyright © 2007 S. Craven and P. Athanas. This is an open access article distributed under the Creative Commons Attribution
`License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
`cited.
`
`1.
`
`INTRODUCTION
`
`Supercomputers have experienced a resurgence, fueled by
`government research dollars and the development of low-
`cost supercomputing clusters constructed from commodity
`PC processors. Recently, interest has arisen in augmenting
`these clusters with programmable logic devices, such as FP-
`GAs. By tailoring an FPGA’s hardware to the specific task at
`hand, a custom coprocessor can be created for each HPC ap-
`plication.
`A wide body of research over two decades has repeat-
`edly demonstrated significant performance improvements
`for certain classes of applications through hardware accelera-
`tion in an FPGA [1]. Applications well suited to acceleration
`by FPGAs typically exhibit massive parallelism and small in-
`teger or fixed-point data types. Significant performance gains
`have been described for gene sequencing [2, 3], digital filter-
`ing [4], cryptography [5], network packet filtering [6], target
`recognition [7], and pattern matching [8].
`These successes have led SRC Computers [9], DRC Com-
`puter Corp. [10], Cray [11], Starbridge Systems [12], and SGI
`[13] to offer clusters featuring programmable logic. Cray’s
`XD1 architecture, characteristic of many of these systems,
`integrates 12 AMD Opteron processors in a chassis with six
`large Xilinx Virtex-4 FPGAs. Many systems feature some of
`the largest FPGAs in production.
`Many HPC applications and benchmarks require double-
`precision floating-point arithmetic to support a large dy-
`
`namic range and ensure numerical stability. Floating-point
`arithmetic is so prevalent that the benchmarking application
`ranking supercomputers, LINPACK, heavily utilizes double-
`precision floating-point math. Due to the prevalence of
`floating-point arithmetic in HPC applications, research in
`academia and industry has focused on floating-point hard-
`ware designs [14, 15], libraries [16, 17], and development
`tools [18] to effectively perform floating-point math on FP-
`GAs. The strong suit of FPGAs, however, is low-precision
`fixed-point or integer arithmetic and no current device fam-
`ilies contain dedicated floating-point operators though ded-
`icated integer multipliers are prevalent. FPGA vendors tai-
`lor their products toward their dominant customers, driv-
`ing development of architectures proficient at digital signal
`processing, network applications, and embedded computing.
`None of these domains demand floating-point performance.
`Published reports comparing FPGA-augmented systems
`to software-only implementations generally focus solely on
`performance. As a key driver in the adoption of any new tech-
`nology is cost, the exclusion of a cost-benefit analysis fails to
`capture the true viability of FPGA-based supercomputing. Of
`two previous works that do incorporate cost into the analy-
`sis, one [19] limits its scope to a single intelligent network
`interface design and, while the other [20] presents impres-
`sive cost-performance numbers, details and analysis are lack-
`ing. Furthermore, many comparisons in literature are inef-
`fective, as they compare a highly optimized FPGA floating-
`point implementation to nonoptimized software. A much
`
`IPR2018-01600
`
`EXHIBIT
`2068
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 1
`
`
`
`2
`
`EURASIP Journal on Embedded Systems
`
`Table 1: Published FPGA supercomputing application results.
`
`Application
`
`DGEMM [21]
`Boltzmann [22]
`Dynamics [23]
`Dynamics [24]
`Dynamics [25]
`MATPHOT [26]
`Filtering [27]
`Translation [28]
`Matching [29]
`Crypto [30]
`
`Platform
`
`Format
`
`Speedup
`
`SRC-6
`XC2VP70
`SRC-6E
`SRC-6E
`SRC-6E
`SRC
`SRC-6E
`SRC-6
`SRC-6/Cray XD1
`SRC-6E
`
`DP
`Float
`SP
`SP
`Float
`DP
`Fixed
`Integer
`Bit
`Bit
`
`0.9x
`1x
`2x
`3x
`3.8x
`8.5x
`14x
`75x
`256x/512x
`1700x
`
`better benchmark would redesign the algorithm to play to
`the FPGA’s strengths, comparing the design’s performance to
`that of an optimized program.
`The key contributions of this paper are the addition of an
`economic analysis to a discussion of FPGA supercomputing
`projects and the presentation of an effective benchmark for
`comparing FPGAs and processors on an equal footing. A sur-
`vey of current research, along with a cost-performance anal-
`ysis of FPGA floating-point implementations, is presented in
`Section 2. Section 3 describes alternatives to floating-point
`implementations in FPGAs, presenting a balanced bench-
`mark for comparing FPGAs to processors. Finally, conclu-
`sions are presented in Section 4.
`
`2. FPGA SUPERCOMPUTING TRENDS
`
`This section presents an overview of the use of FPGAs in su-
`percomputers, analyzing the reported performance enhance-
`ments from a cost perspective.
`
`2.1. HPC implementations
`
`The availability of high-performance clusters incorporating
`FPGAs has prompted efforts to explore acceleration of HPC
`applications. While not an exhaustive list, Table 1 provides
`a survey of recent representative applications. The SRC-6
`and 6E combine two Xeon or Pentium processors with two
`large Virtex-II or Virtex-II Pro FPGAs. The Cray XD1 places
`a Virtex-4 FPGA on a special interconnect system for low-
`latency communication with the host Opteron processors.
`In the table, the applications are listed by performance.
`The abbreviations SP and DP refer to single-precision
`and double-precision floating point, respectively. While the
`speedups provided in the table are not normalized to a com-
`mon processor, a trend is clearly visible. The top six examples
`all incorporate floating-point arithmetic and fare worse than
`the applications that utilize small data widths.
`With no cost information regarding the SRC-6 or Cray
`XD1 available to the authors a thorough cost-performance
`analysis is not possible. However, as the cost of the FPGA ac-
`celeration hardware in these machines alone likely is on the
`order of US$10 000 or more, it is likely that the floating-point
`
`examples may loose some of their appeal when compared to
`processors on a cost-effective basis. The observed speedups
`of 75–1700 for integer and bit-level operations, on the other
`hand, would likely be very beneficial from a cost perspective.
`
`2.2. Theoretical floating-point performance
`
`FPGA designs may suffer significant performance penalties
`due to memory and I/O bottlenecks. To understand the po-
`tential of FPGAs in the absence of bottlenecks, it is instructive
`to consider the theoretical maximum floating-point perfor-
`mance of an FPGA.
`Traditional processors, with a fixed data path width of
`32 or 64 bits, provide no incentive to explore reduced pre-
`cision formats. While FPGAs permit data path width cus-
`tomization, some in the HPC community are loath to utilize
`a nonstandard format owing to verification and portability
`difficulties. This principle is at the heart of the Top500 List
`of fastest supercomputers [31], where ranked machines must
`exactly reproduce valid results when running the LINPACK
`benchmarks. Many applications also require the full dynamic
`range of the double-precision format to ensure numeric sta-
`bility.
`Due to the prevalence of IEEE standard floating-point
`in a wide range of applications, several researchers have de-
`signed IEEE 754 compliant floating-point accelerator cores
`constructed out of the Xilinx Virtex-II Pro FPGA’s config-
`urable logic and dedicated integer multipliers [32–34]. Dou
`et al. published one of the highest performance benchmarks
`of 15.6 GFLOPS by placing 39 floating-point processing el-
`ements on a theoretical Xilinx XC2VP125 FPGA [14]. Inter-
`polating their results for the largest production Xilinx Virtex-
`II Pro device, the XC2VP100, produces 12.4 GFLOPS, com-
`pared to the peak 6.4 GFLOPS achievable for a 3.2 GHz Intel
`Pentium processor. Assuming that the Pentium can sustain
`50% of its peak, the FPGA outperforms the processor by a
`factor of four for matrix multiplication.
`Dou et al.’s design is comprised of a linear array of MAC
`elements, linked to a host processor providing memory ac-
`cess. The design is pipelined to a depth of 12, permitting op-
`eration at a frequency up to 200 MHz. This architecture en-
`ables high computational density by simplifying routing and
`control, at the requirement of a host controller. Since the re-
`sults of Dou et al. are superior to other published results, and
`even Xilinx’s floating-point cores, they are taken as an abso-
`lute upper limit on FPGA’s double-precision floating-point
`performance. Performance in any deployed system would be
`lower because of the addition of interface logic.
`Table 2 extrapolates Dou et al.’s performance results for
`other FPGA device families. Given the similar configurable
`logic architectures between the different Xilinx families, it
`has been assumed that Dou et al.’s requirements of 1419
`logic slices and nine dedicated multipliers hold for all fam-
`ilies. While the slice requirements may be less for the Virtex-
`4 family, owing to the inclusion of an MAC function with
`the dedicated multipliers, as all considered Virtex-4 imple-
`mentations were multiplier limited the overestimate in re-
`quired slices does not affect the results. The clock frequency
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 2
`
`
`
`S. Craven and P. Athanas
`
`3
`
`Table 2: Double-precision floating-point multiply accumulate
`cost-performance in US dollars.
`
`Device
`
`xc4vlx200
`xc4vsx35
`xc2vp100-7
`xc2vp100-6
`xc2vp70-6
`xc2vp30-6
`
`xc3s5000-5
`xc3s4000-5
`ClearSpeed
`CSX 600
`
`Speed
`(MHz)
`
`GFlops
`
`280
`280
`200
`180
`180
`180
`
`140
`140
`
`5.6
`5.6
`12.4
`11.2
`8.3
`3.2
`
`3.1
`2.8
`
`Device
`cost
`
`$7010
`$542
`$9610
`$6860
`$2780
`$781
`
`$242
`$164
`
`N/A
`
`50 [36]
`
`$7500 [37]
`
`Pentium 630
`Pentium D 920
`Cell processor
`
`System X
`
`3000
`3
`$167
`2800 × 2
`5.6
`$203
`3200 × 9
`10 [38]
`$230 [39]
`2300 × 2200 12 250 [31] $5.8 M [40]
`
`$/GFlops
`
`$1,250
`$97
`$775
`$613
`$334
`$244
`
`$78
`$59
`
`$150
`
`$56
`$36
`$23
`
`$473
`
`has been scaled by a factor obtained by averaging the perfor-
`mance differential of Xilinx’s double-precision floating-point
`multiplier and adder cores [35] across the different families.
`For comparison purposes, several commercial processors
`have been included in the list. The peak performance for each
`processor was reduced by 50%, taking into account compiler
`and system inefficiencies, permitting a fairer comparison as
`FPGAs designs typically sustain a much higher percentage of
`their peak performance than processors. This 50% perfor-
`mance penalty is in line with the sustained performance seen
`in the Top500 List’s LINPACK benchmark [31]. In the table,
`FPGAs are assumed to sustain their peak performance.
`As can be seen from the table, FPGA double-precision
`floating-point performance is noticeably higher than for tra-
`ditional Intel processors; however, considering the cost of
`this performance processors fare better, with the worst pro-
`cessor beating the best FPGA. In particular, Sony’s Cell pro-
`cessor is more than two times cheaper per GFLOPS than the
`best FPGA. The results indicate that the current generation of
`larger FPGAs found on many FPGA-augmented HPC clus-
`ters are far from cost competitive with the current genera-
`tion of processors for double-precision floating-point tasks
`typical of supercomputing applications.
`With two exceptions, ClearSpeed and System X, all costs
`in Table 2 only cover the price of the device not including
`other components (motherboard, memory, network, etc.)
`that are necessary to produce a functioning supercomputer.
`It is also assumed here that operational costs are equiva-
`lent. These additional costs are nonnegligible and, while the
`FPGA accelerators would also incur additional costs for cir-
`cuit board and components, it is likely that the cost of com-
`ponents to create a functioning HPC node from a processor,
`even factoring in economies of scale, would be larger than for
`creating an accelerator plug-in from an FPGA. However, as
`
`most clusters incorporating FPGAs also include a host pro-
`cessor to handle serial tasks and communication, it is reason-
`able to assume that the cost analysis in Table 2 favors FPGAs.
`To place the additional component costs in perspec-
`tive, the cost-performance for Virginia Tech’s System X su-
`percomputing cluster has been included [41]. Constructed
`from 1100 dual core Apple XServe nodes, the supercom-
`puter, including the cost of all components, cost US$473 per
`GFLOPS. Several of the larger FPGAs cost more per GFLOPS
`even without the memory, boards, and assembly required to
`create a functional accelerator.
`As the dedicated integer multipliers included by Xilinx,
`the largest configurable logic manufacturer, are only 18-bits
`wide, several multipliers must be combined to produce the
`52-bit multiplication needed for double-precision floating-
`point multiplication. For Xilinx’s double-precision floating-
`point core 16 of these 18-bit multipliers are required [35]
`for each multiplier, while for the Dou et al. design only nine
`are needed. For many FPGA device families the high multi-
`plier requirement limits the number of floating-point multi-
`pliers that may be placed on the device. For example, while
`31 of Dou’s MAC units may be placed on an XC2VP100, the
`largest Virtex-II Pro device, the lack of sufficient dedicated
`multipliers permits only 10 to be placed on the largest Xilinx
`FPGA, an XC4VLX200. If this device was solely used as a ma-
`trix multiplication accelerator, as in Dou’s work, over 80% of
`the device would be unused. Of course this idle configurable
`logic could be used to implement additional multipliers, at a
`significant performance penalty.
`While the larger FPGA devices that are prevalent in com-
`putational accelerators do not provide a cost benefit for the
`double-precision floating-point calculations required by the
`HPC community, historical trends [42] suggest that FPGA
`performance is improving at a rate faster than that of pro-
`cessors. The question is then asked, when, if ever, will FPGAs
`overtake processors in cost performance?
`As has been noted by some, the cost of the largest cutt-
`ing-edge FPGA remains roughly constant over time, while
`performance and size improve. A first-order estimate of US$
`8,000 has been made for the cost of the largest and newest
`FPGA—an estimate supported by the cost of the largest
`Virtex-II Pro and Virtex-4 devices. Furthermore, it is as-
`sumed that the cost of a processor remains constant at
`US$500 over time as well. While these estimates are some-
`what misleading, as these costs certainly do vary over time,
`the variability in the cost of computing devices between
`generations is much less than the increase in performance.
`The comparison further assumes, as before, that processors
`can sustain 50% of their peak floating-point performance
`while FPGAs sustain 100%. Whenever possible, estimates
`were rounded to favor FPGAs.
`Two sources of data were used for performance extrap-
`olation to increase the validity of the results. The work of
`Dou et al. [14], representing the fastest double-precision
`floating-point MAC design, was extrapolated to the largest
`parts in several Xilinx device families. Additional data was
`obtained by extrapolating the results of Underwood’s histor-
`ical analysis [42] to include the Virtex-4 family. Underwood’s
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 3
`
`
`
`4
`
`EURASIP Journal on Embedded Systems
`
`plays worse cost-performance than the previous generation
`of devices. This is due to the shortage of dedicated multipli-
`ers on the larger Virtex-4 devices. The Virtex-4 architecture
`is comprised of three subfamilies: the LX, SX, and FX. The
`Virtex-4 subfamily with the largest devices, by far, is the LX
`and it is these devices that are found in FPGA-augmented
`HPC nodes. However, the LX subfamily is focused on logic
`density, trading most of the dedicated multipliers found in
`the smaller SX subfamily for configurable logic. This signifi-
`cantly reduces the floating-point multiplication performance
`of the larger Virtex-4 devices.
`As the graphs illustrate, if this trend towards logic-centric
`large FPGAs continues it is unlikely that the largest FPGAs
`will be cost effective compared to processors anytime soon,
`if ever. However, as preliminary data on the next-generation
`Virtex-5 suggests that the relatively poor floating-point per-
`formance of the Virtex-4 is an aberration and not indica-
`tive of a trend in FPGA architectures, it seems reasonable
`to reconsider the results excluding the Virtex-4 data points.
`Figure 1 trend lines labeled “FPGA extrapolation w/o Virtex-
`4” exclude these potential misleading data points.
`When the Virtex-4 data is ignored, the cost-performance
`of FPGAs for double-precision floating-point matrix multi-
`plication improves at a rate greater than that for processors.
`While there is always a danger from drawing conclusions
`from a small data set, both the Dou et al. and Underwood
`design results point to a crossover point sometime around
`2009 to 2012 when the largest FPGA devices, like those typ-
`ically found in commercial FPGA-augmented HPC clusters,
`will be cost effectively compared to processors for double-
`precision floating-point calculations.
`
`2.3. Tools
`
`The typical HPC user is a scientist, researcher, or engineer
`desiring to accelerate some scientific application. These users
`are generally acquainted with a programming language ap-
`propriate to their fields (C, FORTAN, MATLAB, etc.) but
`have little, if any, hardware design knowledge. Many have
`noted the requirement of high-level development environ-
`ments to speed acceptance of FPGA-augmented clusters.
`These development tools accept a description of the appli-
`cation written in a high level language (HLL) and automate
`the translation of appropriate sections of code into hardware.
`Several companies market HLL-to-gates synthesizers to the
`HPC community, including impulse accelerated technolo-
`gies, Celoxica, and SRC.
`The state of these tools, however, as noted by some [43],
`does not remove the need for dedicated hardware expertise.
`Hardware debugging and interfacing still must occur. The
`use of automatic translation also drives up development costs
`compared to software implementations. C compilers and de-
`buggers are free. Electronic design automation tools, on the
`other hand, may require expensive yearly licenses. Further-
`more, the added inefficiencies of translating an inherently
`sequential high-level description into a parallel hardware im-
`plementation eat into the performance of hardware accelera-
`tors.
`
`10000
`
`1000
`
`100
`
`Cost/GFLOPS($)
`
`10
`2000
`
`2002
`
`2004
`
`2006
`Year
`
`2008
`
`2010
`
`FPGAs
`Processors
`Extrapolation FPGA w/o Virtex-4
`Extrapolation FPGA
`Extrapolation processor
`
`(a)
`
`10000
`
`1000
`
`100
`
`Cost/GFLOPS($)
`
`10
`2000
`
`2002
`
`2004
`
`2006
`Year
`
`2008
`
`2010
`
`FPGAs
`Processors
`Extrapolation FPGA w/o Virtex-4
`Extrapolation FPGA
`Extrapolation processor
`
`(b)
`
`Figure 1: Extrapolated double-precision floating-point MAC cost-
`performance, in US dollars, for: (a) Underwood design and (b) Dou
`et al. design.
`
`data came from his IEEE standard floating-point designs
`pipelined, depending on the device, to a maximum depth of
`34. The results are shown in Figure 1(a) for the Underwood
`data and Figure 1(b) for Dou et al.
`An additional data point exists for the Underwood graph
`as his work included results for the Virtex-E FPGAs. The
`Dou et al. design is higher performance and smaller, in terms
`of slices, than Underwood’s design. In both graphs, the lat-
`est data point, representing the largest Virtex-4 device, dis-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 4
`
`
`
`S. Craven and P. Athanas
`
`3. FLOATING-POINT ALTERNATIVES
`
`3.1. Nonstandard data formats
`
`The use of IEEE standard floating-point data formats in
`hardware implementations prevents the user from leverag-
`ing an FPGA’s fine-grained configurability, effectively reduc-
`ing an FPGA to a collection of floating-point units with con-
`figurable interconnect. Seeing the advantages of customizing
`the data format to fit the problem, several authors have con-
`structed nonstandard floating-point units.
`One of the earlier projects demonstrated a 23x speedup
`on a 2D fast Fourier transform (FFT) through the use of a
`custom 18-bit floating-point format [44]. More recent work
`has focused on parameterizible libraries of floating-point
`units that can be tailored to the task at hand [45–47]. By us-
`ing a custom floating-point format sized to match the width
`of the FPGA’s internal integer multipliers, a speedup of 44
`was achieved by Nakasato and Hamada for a hydrodynamics
`simulation [48] using four large FPGAs.
`Nakasato and Hamada’s 38 GFLOPS of performance is
`impressive, even from a cost-performance standpoint. For
`the cost of their PROGRAPE-3 board, estimated at US$
`15,000, it is likely that a 15-node processor cluster could be
`constructed producing 196 single-precision peak GFLOPS.
`Even in the unlikely scenario that this cluster could sus-
`tain the same 10% of peak performance obtained by Naka-
`sato and Hamada’s for their software implementation, the
`PROGRAPE-3 design would still achieve a 2x speedup.
`As in many FPGA to CPU comparisons, it is likely that
`the analysis unfairly favors the FPGA solution. Many com-
`parisons spend significantly more time optimizing hardware
`implementations than is spent optimizing software. Signif-
`icant compiler inefficiencies exist for common HPC func-
`tions [49], with some hand-coded functions outperform-
`ing the compiler by many times. It is possible that Nakasato
`and Hamada’s speedup would be significantly reduced, and
`perhaps eliminated on a cost-performance basis, if equal
`effort was applied to optimizing software at the assembly
`level. However, to permit their design to be more cost-
`competitive, even against efficient software implementations,
`smaller more cost-effective FPGAs could be used.
`
`3.2. GIMPS benchmark
`
`The strength of configurable logic stems from the ability to
`customize a hardware solution to a specific problem at the bit
`level. The previously presented works implemented coarse-
`grained floating-point units inside an FPGA for a wide range
`of HPC applications. For certain applications the full flexibil-
`ity of configurable logic can be leveraged to create a custom
`solution to a specific problem, utilizing data types that play
`to the FPGA’s strengths—integer arithmetic.
`One such application can be found in the great Inter-
`net Mersenne prime search (GIMPS) [50]. The software used
`by GIMPS relies heavily on double-precision floating-point
`FFTs. Through a careful analysis of the problem, an all-
`integer solution is possible that improves FPGA performance
`by a factor of two and avoids the inaccuracies inherit in
`floating-point math.
`
`5
`
`The largest known prime numbers are Mersenne pri-
`mes—prime numbers of the form 2q − 1, where q is also
`prime. The distributed computing project GIMPS was cre-
`ated to identify large Mersenne primes and a reward of
`US$100,000 has been issued for the first person to identify
`a prime number with greater than 10 million digits. The al-
`gorithm used by GIMPS, the Lucas-Lehmer test, is iterative,
`repeatedly performing modular squaring.
`One of the most efficient multiplication algorithms for
`large integers utilizes the FFT, treating the number being
`squared as a long sequence of smaller numbers. The linear
`convolution of this sequence with itself performs the squar-
`ing. As linear convolution in the time domain is equivalent
`to multiplication in the frequency domain, the FFT of the se-
`quence is taken and the resulting frequency domain sequence
`is squared elementwise before being brought back into the
`time domain. Floating-point arithmetic is used to meet the
`strict precision requirements across the time and frequency
`domains. The software used by GIMPS has been optimized
`at the assembly level for maximum performance on Pentium
`processors, making this application an effective benchmark
`of relative processor floating-point performance.
`Previous work focused on an FPGA hardware implemen-
`tation of the GIMPS algorithm to compare FPGA and pro-
`cessor floating-point performance [51]. Performing a tradi-
`tional port of the algorithm from software to hardware in-
`volves the creation of a floating-point FFT on the FPGA.
`On an XC2VP100, the largest Virtex-II Pro, 12 near-double-
`precision complex multipliers could be created from the 444
`dedicated integer multipliers. Such a design with pipelining
`performs a single iteration of the Lucas-Lehmer test in 3.7
`million clock cycles.
`To leverage the advantages of a configurable architec-
`ture an all-integer number theoretical transform was con-
`sidered. In particular, the irrational base discrete weighted
`transform (IBDWT) can be used to perform integer convo-
`lution, serving the exact same purpose as the floating-point
`FFT in the Lucas-Lehmer test. In the IBDWT, all arithmetic is
`performed modulo a special prime number. Normally mod-
`ulo arithmetic is a demanding operation requiring many cy-
`cles of latency, but by careful selection of this prime num-
`ber the reduction can be performed by simple additions and
`shifting [51]. The resulting all-integer implementation incor-
`porates two 8-point butterfly structures constructed with 24-
`64-bit integer multipliers and pipelined to a depth of 10. A
`single iteration of Lucas-Lehmer requires 1.7 million clock
`cycles, a more than two-fold improvement over the floating-
`point design.
`The final GIMPS accelerator, shown in Figure 2 imple-
`mented in the largest Virtex-II Pro FPGA, consisted of two
`butterflies fed by reorder caches constructed from the inter-
`nal memories. To prevent a memory bottleneck, the design
`assumed four independent banks of double data rate (DDR)
`SDRAM. Three sets of reorder buffers were created out of
`the dedicated block memories on the device. These mem-
`ories operated concurrently, two of the buffers feeding the
`butterfly units while the third exchanged data with the ex-
`ternal SDRAM. The final design could be clocked at 80 MHz
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 5
`
`
`
`6
`
`EURASIP Journal on Embedded Systems
`
`cost-performance crossover point to be reached requires ar-
`chitectures with significant dedicated multipliers.
`For lower precision data formats current generation FP-
`GAs fare much better, being cost-competitive with proces-
`sors. While completely integer implementations of floating-
`point applications permit the FPGA to fully leverage its
`strengths,
`for at
`least one such application the cost-
`performance of an all-integer implementation was signifi-
`cantly worse than a processor. This benchmark suggests that
`only certain domains of supercomputing problems will expe-
`rience significant performance improvements when imple-
`mented in FPGAs and floating-point arithmetic is not cur-
`rently one of them.
`
`Recorder
`RAM
`
`(8)
`
`8-point
`butterfly
`
`Recorder
`RAM
`
`(8)
`
`8-point
`butterfly
`
`Mux
`
`REFERENCES
`
`DDR
`SDRAM
`
`Recorder
`RAM
`
`(16)
`
`XC2VP100
`
`Figure 2: All-integer Lucas-Lehmer implementation.
`
`and used 86% of the dedicated multipliers and 70% of the
`configurable logic.
`In spite of the unique all-integer algorithmic approach,
`the stand-alone FPGA implementation only achieved a
`speedup of 1.76 compared to a 3.4 GHz Pentium 4 processor.
`Amdahl’s Law limited the FPGA’s performance due to the se-
`rial nature of certain steps in the algorithm, namely the final
`modulo reduction after the multimillion bit multiplication.
`A slightly reworked implementation, designed as an FFT ac-
`celerator with all serial functions implemented on an at-
`tached processor, could achieve a speedup of 2.6 compared to
`a processor alone. From a cost perspective, the FPGA imple-
`mentation fares far worse, with the large FPGA’s cost roughly
`ten times that of the processor.
`
`4. CONCLUSION
`
`When comparing HPC architectures many factors must be
`weighed, including memory and I/O bandwidth, commu-
`nication latencies, and peak and sustained performance.
`However, as the recent focus on commodity processor clus-
`ters demonstrates, cost-performance is of paramount impor-
`tance. In order for FPGAs to gain acceptance within the gen-
`eral HPC community, they must be cost-competitive with
`traditional processors for the floating-point arithmetic typi-
`cal in supercomputing applications. The analysis of the cost-
`performance of various current generation FPGAs revealed
`that only the lower-end devices were cost-competitive with
`processors for double-precision floating-point matrix multi-
`plications.
`An extrapolation of the double-precision floating-point
`cost-performance of larger FPGAs using two different de-
`signs suggests that these devices will not be cost-competitive
`with processors any earlier than 2009. However, FPGA
`floating-point performance is very sensitive to the mix of
`dedicated arithmetic units in the architecture and for this
`
`[1] K. Compton and S. Hauck, “Reconfigurable computing: a sur-
`vey of systems and software,” ACM Computing Surveys, vol. 34,
`no. 2, pp. 171–210, 2002.
`[2] K. Puttegowda, W. Worek, P. Pappas, A. Dandapani, P. Atha-
`nas, and A. Dickerman, “A run-time reconfigurable system for
`gene-sequence searching,” in Proceedings of the 16th Interna-
`tional Conference on VLSI Design, pp. 561–566, New Delhi, In-
`dia, January 2003.
`[3] TimeLogic, “DeCypher Engine G4,” 2006, http://www.time-
`logic.com/decypher engine.html.
`[4] R. Tessier and W. Burleson, “Reconfigurable computing for
`digital signal processing: a survey,” Journal of VLSI Signal Pro-
`cessing Systems for Signal, Image, and Video Technology, vol. 28,
`no. 1-2, pp. 7–27, 2001.
`[5] C. Patterson, “High performance DES encryption in vir-
`tex(tm) FPGAs using Jbits(tm),” in Proceedings of the 8th An-
`nual IEEE Symposium on Field-Programmable Custom Com-
`puting Machines (FCCM ’00), p. 113, Napa Valley, Calif, USA,
`April 2000.
`[6] R. Sinnappan and S. Hazelhurst, “A reconfigurable approach
`to packet filtering,” in Proceedings of the 11th International
`Conference on Field-Programmable Logic and Applications (FPL
`’01), vol. 2147 of Lecture Notes in Computer Science, pp. 638–
`642, Belfast, Northern Ireland, UK, August 2001.
`[7] J. Jean, X. Liang, B. Drozd, and K. Tomko, “Accelerating an
`IR automatic target recognition application with FPGAs,”
`in Proceedings of the 7th Annual IEEE Symposium on Field-
`Programmable Custom Computing Machines (FCMM ’99), pp.
`290–291, Napa Valley, Calif, USA, April 1999.
`[8] Z. K. Baker and V. K. Prasanna, “Time and area efficient
`pattern matching on FPGAs,” in Proceedings of the 12th
`ACM/SIGDA International Symposium on Field Programmable
`Gate Arrays (FPGA ’04), pp. 223–232, Monterey, Calif, USA,
`February 2004.
`[9] SRC, “SRC-7 Product Sheet,” 2006, http://www.srccomp.com/
`Product%20Sheets/.
`[10] A. Vance, “Start-up could kick Opteron into overdrive,” The
`Register, 2006.
`[11] G. Woods, “Cray ARSC presentation FPGA,” in Proceedings of
`ARSC High-Performance Reconfigurable Computing Workshop,
`Fairbanks, Ala, USA, August 2005.
`[12] J. Collins, G. Kent, and J. Yardley, “Using the starbridge sys-
`tems FPGA-based hypercomputer for cancer research,” in Pro-
`ceedings of the 7th International Conference on Military and
`Aerospace Programmable Logic Devices (MAPLD ’04), Wash-
`ington, DC, USA, September 2004.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 6
`
`
`
`S. Craven and P. Athanas
`
`7
`
`[13] SGI, “Extraordinary acceleration of workflows with reconfig-
`urable application-specific computing from SGI,” White Pa-
`per, Silicon Graphics, Mountain View, Calif, USA, November
`2004.
`[14] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev,
`“64-bit floating-point FPGA matrix multiplication,” in Pro-
`ceedings of the 13th ACM/SIGDA ACM International Sympo-
`sium on Field Programmable Gate Arrays (FPGA ’05), pp. 86–
`95, Monterey, Calif, USA, February 2005.
`[15] M. C. Smith, J. S. Vetter, and S. R. Alam, “Scientific comput-
`ing beyond CPUs: FPGA implementations of common scien-
`tific Kernels,” in Proceedings of the 8th International Confer-
`ence on Military and Aerospace Progr