IPR2018-01605, No. 2082-174 Exhibit - Ex 2082 (P.T.A.B. Jul. 26, 2019)

Hindawi Publishing Corporation
`EURASIP Journal on Embedded Systems
`Volume 2007, Article ID 93652, 8 pages
`doi:10.1155/2007/93652
`
`Research Article
`Examining the Viability of FPGA Supercomputing
`
`Stephen Craven and Peter Athanas
`
`Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University,
`Blacksburg, VA 24061, USA
`
`Received 16 May 2006; Revised 6 October 2006; Accepted 16 November 2006
`
`Recommended by Marco Platzner
`
`For certain applications, custom computational hardware created using ﬁeld programmable gate arrays (FPGAs) can produce
`signiﬁcant performance improvements over processors, leading some in academia and industry to call for the inclusion of FPGAs
`in supercomputing clusters. This paper presents a comparative analysis of FPGAs and traditional processors, focusing on ﬂoating-
`point performance and procurement costs, revealing economic hurdles in the adoption of FPGAs for general high-performance
`computing (HPC).
`
`Copyright © 2007 S. Craven and P. Athanas. This is an open access article distributed under the Creative Commons Attribution
`License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
`cited.
`
`1.
`
`INTRODUCTION
`
`Supercomputers have experienced a resurgence, fueled by
`government research dollars and the development of low-
`cost supercomputing clusters constructed from commodity
`PC processors. Recently, interest has arisen in augmenting
`these clusters with programmable logic devices, such as FP-
`GAs. By tailoring an FPGA’s hardware to the speciﬁc task at
`hand, a custom coprocessor can be created for each HPC ap-
`plication.
`A wide body of research over two decades has repeat-
`edly demonstrated signiﬁcant performance improvements
`for certain classes of applications through hardware accelera-
`tion in an FPGA [1]. Applications well suited to acceleration
`by FPGAs typically exhibit massive parallelism and small in-
`teger or ﬁxed-point data types. Signiﬁcant performance gains
`have been described for gene sequencing [2, 3], digital ﬁlter-
`ing [4], cryptography [5], network packet ﬁltering [6], target
`recognition [7], and pattern matching [8].
`These successes have led SRC Computers [9], DRC Com-
`puter Corp. [10], Cray [11], Starbridge Systems [12], and SGI
`[13] to oﬀer clusters featuring programmable logic. Cray’s
`XD1 architecture, characteristic of many of these systems,
`integrates 12 AMD Opteron processors in a chassis with six
`large Xilinx Virtex-4 FPGAs. Many systems feature some of
`the largest FPGAs in production.
`Many HPC applications and benchmarks require double-
`precision ﬂoating-point arithmetic to support a large dy-
`
`namic range and ensure numerical stability. Floating-point
`arithmetic is so prevalent that the benchmarking application
`ranking supercomputers, LINPACK, heavily utilizes double-
`precision ﬂoating-point math. Due to the prevalence of
`ﬂoating-point arithmetic in HPC applications, research in
`academia and industry has focused on ﬂoating-point hard-
`ware designs [14, 15], libraries [16, 17], and development
`tools [18] to eﬀectively perform ﬂoating-point math on FP-
`GAs. The strong suit of FPGAs, however, is low-precision
`ﬁxed-point or integer arithmetic and no current device fam-
`ilies contain dedicated ﬂoating-point operators though ded-
`icated integer multipliers are prevalent. FPGA vendors tai-
`lor their products toward their dominant customers, driv-
`ing development of architectures proﬁcient at digital signal
`processing, network applications, and embedded computing.
`None of these domains demand ﬂoating-point performance.
`Published reports comparing FPGA-augmented systems
`to software-only implementations generally focus solely on
`performance. As a key driver in the adoption of any new tech-
`nology is cost, the exclusion of a cost-beneﬁt analysis fails to
`capture the true viability of FPGA-based supercomputing. Of
`two previous works that do incorporate cost into the analy-
`sis, one [19] limits its scope to a single intelligent network
`interface design and, while the other [20] presents impres-
`sive cost-performance numbers, details and analysis are lack-
`ing. Furthermore, many comparisons in literature are inef-
`fective, as they compare a highly optimized FPGA ﬂoating-
`point implementation to nonoptimized software. A much
`
`IPR2018-01600
`
`EXHIBIT
`2068
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 1
`
`

`2
`
`EURASIP Journal on Embedded Systems
`
`Table 1: Published FPGA supercomputing application results.
`
`Application
`
`DGEMM [21]
`Boltzmann [22]
`Dynamics [23]
`Dynamics [24]
`Dynamics [25]
`MATPHOT [26]
`Filtering [27]
`Translation [28]
`Matching [29]
`Crypto [30]
`
`Platform
`
`Format
`
`Speedup
`
`SRC-6
`XC2VP70
`SRC-6E
`SRC-6E
`SRC-6E
`SRC
`SRC-6E
`SRC-6
`SRC-6/Cray XD1
`SRC-6E
`
`DP
`Float
`SP
`SP
`Float
`DP
`Fixed
`Integer
`Bit
`Bit
`
`0.9x
`1x
`2x
`3x
`3.8x
`8.5x
`14x
`75x
`256x/512x
`1700x
`
`better benchmark would redesign the algorithm to play to
`the FPGA’s strengths, comparing the design’s performance to
`that of an optimized program.
`The key contributions of this paper are the addition of an
`economic analysis to a discussion of FPGA supercomputing
`projects and the presentation of an eﬀective benchmark for
`comparing FPGAs and processors on an equal footing. A sur-
`vey of current research, along with a cost-performance anal-
`ysis of FPGA ﬂoating-point implementations, is presented in
`Section 2. Section 3 describes alternatives to ﬂoating-point
`implementations in FPGAs, presenting a balanced bench-
`mark for comparing FPGAs to processors. Finally, conclu-
`sions are presented in Section 4.
`
`2. FPGA SUPERCOMPUTING TRENDS
`
`This section presents an overview of the use of FPGAs in su-
`percomputers, analyzing the reported performance enhance-
`ments from a cost perspective.
`
`2.1. HPC implementations
`
`The availability of high-performance clusters incorporating
`FPGAs has prompted eﬀorts to explore acceleration of HPC
`applications. While not an exhaustive list, Table 1 provides
`a survey of recent representative applications. The SRC-6
`and 6E combine two Xeon or Pentium processors with two
`large Virtex-II or Virtex-II Pro FPGAs. The Cray XD1 places
`a Virtex-4 FPGA on a special interconnect system for low-
`latency communication with the host Opteron processors.
`In the table, the applications are listed by performance.
`The abbreviations SP and DP refer to single-precision
`and double-precision ﬂoating point, respectively. While the
`speedups provided in the table are not normalized to a com-
`mon processor, a trend is clearly visible. The top six examples
`all incorporate ﬂoating-point arithmetic and fare worse than
`the applications that utilize small data widths.
`With no cost information regarding the SRC-6 or Cray
`XD1 available to the authors a thorough cost-performance
`analysis is not possible. However, as the cost of the FPGA ac-
`celeration hardware in these machines alone likely is on the
`order of US$10 000 or more, it is likely that the ﬂoating-point
`
`examples may loose some of their appeal when compared to
`processors on a cost-eﬀective basis. The observed speedups
`of 75–1700 for integer and bit-level operations, on the other
`hand, would likely be very beneﬁcial from a cost perspective.
`
`2.2. Theoretical ﬂoating-point performance
`
`FPGA designs may suﬀer signiﬁcant performance penalties
`due to memory and I/O bottlenecks. To understand the po-
`tential of FPGAs in the absence of bottlenecks, it is instructive
`to consider the theoretical maximum ﬂoating-point perfor-
`mance of an FPGA.
`Traditional processors, with a ﬁxed data path width of
`32 or 64 bits, provide no incentive to explore reduced pre-
`cision formats. While FPGAs permit data path width cus-
`tomization, some in the HPC community are loath to utilize
`a nonstandard format owing to veriﬁcation and portability
`diﬃculties. This principle is at the heart of the Top500 List
`of fastest supercomputers [31], where ranked machines must
`exactly reproduce valid results when running the LINPACK
`benchmarks. Many applications also require the full dynamic
`range of the double-precision format to ensure numeric sta-
`bility.
`Due to the prevalence of IEEE standard ﬂoating-point
`in a wide range of applications, several researchers have de-
`signed IEEE 754 compliant ﬂoating-point accelerator cores
`constructed out of the Xilinx Virtex-II Pro FPGA’s conﬁg-
`urable logic and dedicated integer multipliers [32–34]. Dou
`et al. published one of the highest performance benchmarks
`of 15.6 GFLOPS by placing 39 ﬂoating-point processing el-
`ements on a theoretical Xilinx XC2VP125 FPGA [14]. Inter-
`polating their results for the largest production Xilinx Virtex-
`II Pro device, the XC2VP100, produces 12.4 GFLOPS, com-
`pared to the peak 6.4 GFLOPS achievable for a 3.2 GHz Intel
`Pentium processor. Assuming that the Pentium can sustain
`50% of its peak, the FPGA outperforms the processor by a
`factor of four for matrix multiplication.
`Dou et al.’s design is comprised of a linear array of MAC
`elements, linked to a host processor providing memory ac-
`cess. The design is pipelined to a depth of 12, permitting op-
`eration at a frequency up to 200 MHz. This architecture en-
`ables high computational density by simplifying routing and
`control, at the requirement of a host controller. Since the re-
`sults of Dou et al. are superior to other published results, and
`even Xilinx’s ﬂoating-point cores, they are taken as an abso-
`lute upper limit on FPGA’s double-precision ﬂoating-point
`performance. Performance in any deployed system would be
`lower because of the addition of interface logic.
`Table 2 extrapolates Dou et al.’s performance results for
`other FPGA device families. Given the similar conﬁgurable
`logic architectures between the diﬀerent Xilinx families, it
`has been assumed that Dou et al.’s requirements of 1419
`logic slices and nine dedicated multipliers hold for all fam-
`ilies. While the slice requirements may be less for the Virtex-
`4 family, owing to the inclusion of an MAC function with
`the dedicated multipliers, as all considered Virtex-4 imple-
`mentations were multiplier limited the overestimate in re-
`quired slices does not aﬀect the results. The clock frequency
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 2
`
`

`S. Craven and P. Athanas
`
`3
`
`Table 2: Double-precision ﬂoating-point multiply accumulate
`cost-performance in US dollars.
`
`Device
`
`xc4vlx200
`xc4vsx35
`xc2vp100-7
`xc2vp100-6
`xc2vp70-6
`xc2vp30-6
`
`xc3s5000-5
`xc3s4000-5
`ClearSpeed
`CSX 600
`
`Speed
`(MHz)
`
`GFlops
`
`280
`280
`200
`180
`180
`180
`
`140
`140
`
`5.6
`5.6
`12.4
`11.2
`8.3
`3.2
`
`3.1
`2.8
`
`Device
`cost
`
`$7010
`$542
`$9610
`$6860
`$2780
`$781
`
`$242
`$164
`
`N/A
`
`50 [36]
`
`$7500 [37]
`
`Pentium 630
`Pentium D 920
`Cell processor
`
`System X
`
`3000
`3
`$167
`2800 × 2
`5.6
`$203
`3200 × 9
`10 [38]
`$230 [39]
`2300 × 2200 12 250 [31] $5.8 M [40]
`
`$/GFlops
`
`$1,250
`$97
`$775
`$613
`$334
`$244
`
`$78
`$59
`
`$150
`
`$56
`$36
`$23
`
`$473
`
`has been scaled by a factor obtained by averaging the perfor-
`mance diﬀerential of Xilinx’s double-precision ﬂoating-point
`multiplier and adder cores [35] across the diﬀerent families.
`For comparison purposes, several commercial processors
`have been included in the list. The peak performance for each
`processor was reduced by 50%, taking into account compiler
`and system ineﬃciencies, permitting a fairer comparison as
`FPGAs designs typically sustain a much higher percentage of
`their peak performance than processors. This 50% perfor-
`mance penalty is in line with the sustained performance seen
`in the Top500 List’s LINPACK benchmark [31]. In the table,
`FPGAs are assumed to sustain their peak performance.
`As can be seen from the table, FPGA double-precision
`ﬂoating-point performance is noticeably higher than for tra-
`ditional Intel processors; however, considering the cost of
`this performance processors fare better, with the worst pro-
`cessor beating the best FPGA. In particular, Sony’s Cell pro-
`cessor is more than two times cheaper per GFLOPS than the
`best FPGA. The results indicate that the current generation of
`larger FPGAs found on many FPGA-augmented HPC clus-
`ters are far from cost competitive with the current genera-
`tion of processors for double-precision ﬂoating-point tasks
`typical of supercomputing applications.
`With two exceptions, ClearSpeed and System X, all costs
`in Table 2 only cover the price of the device not including
`other components (motherboard, memory, network, etc.)
`that are necessary to produce a functioning supercomputer.
`It is also assumed here that operational costs are equiva-
`lent. These additional costs are nonnegligible and, while the
`FPGA accelerators would also incur additional costs for cir-
`cuit board and components, it is likely that the cost of com-
`ponents to create a functioning HPC node from a processor,
`even factoring in economies of scale, would be larger than for
`creating an accelerator plug-in from an FPGA. However, as
`
`most clusters incorporating FPGAs also include a host pro-
`cessor to handle serial tasks and communication, it is reason-
`able to assume that the cost analysis in Table 2 favors FPGAs.
`To place the additional component costs in perspec-
`tive, the cost-performance for Virginia Tech’s System X su-
`percomputing cluster has been included [41]. Constructed
`from 1100 dual core Apple XServe nodes, the supercom-
`puter, including the cost of all components, cost US$473 per
`GFLOPS. Several of the larger FPGAs cost more per GFLOPS
`even without the memory, boards, and assembly required to
`create a functional accelerator.
`As the dedicated integer multipliers included by Xilinx,
`the largest conﬁgurable logic manufacturer, are only 18-bits
`wide, several multipliers must be combined to produce the
`52-bit multiplication needed for double-precision ﬂoating-
`point multiplication. For Xilinx’s double-precision ﬂoating-
`point core 16 of these 18-bit multipliers are required [35]
`for each multiplier, while for the Dou et al. design only nine
`are needed. For many FPGA device families the high multi-
`plier requirement limits the number of ﬂoating-point multi-
`pliers that may be placed on the device. For example, while
`31 of Dou’s MAC units may be placed on an XC2VP100, the
`largest Virtex-II Pro device, the lack of suﬃcient dedicated
`multipliers permits only 10 to be placed on the largest Xilinx
`FPGA, an XC4VLX200. If this device was solely used as a ma-
`trix multiplication accelerator, as in Dou’s work, over 80% of
`the device would be unused. Of course this idle conﬁgurable
`logic could be used to implement additional multipliers, at a
`signiﬁcant performance penalty.
`While the larger FPGA devices that are prevalent in com-
`putational accelerators do not provide a cost beneﬁt for the
`double-precision ﬂoating-point calculations required by the
`HPC community, historical trends [42] suggest that FPGA
`performance is improving at a rate faster than that of pro-
`cessors. The question is then asked, when, if ever, will FPGAs
`overtake processors in cost performance?
`As has been noted by some, the cost of the largest cutt-
`ing-edge FPGA remains roughly constant over time, while
`performance and size improve. A ﬁrst-order estimate of US$
`8,000 has been made for the cost of the largest and newest
`FPGA—an estimate supported by the cost of the largest
`Virtex-II Pro and Virtex-4 devices. Furthermore, it is as-
`sumed that the cost of a processor remains constant at
`US$500 over time as well. While these estimates are some-
`what misleading, as these costs certainly do vary over time,
`the variability in the cost of computing devices between
`generations is much less than the increase in performance.
`The comparison further assumes, as before, that processors
`can sustain 50% of their peak ﬂoating-point performance
`while FPGAs sustain 100%. Whenever possible, estimates
`were rounded to favor FPGAs.
`Two sources of data were used for performance extrap-
`olation to increase the validity of the results. The work of
`Dou et al. [14], representing the fastest double-precision
`ﬂoating-point MAC design, was extrapolated to the largest
`parts in several Xilinx device families. Additional data was
`obtained by extrapolating the results of Underwood’s histor-
`ical analysis [42] to include the Virtex-4 family. Underwood’s
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 3
`
`

`4
`
`EURASIP Journal on Embedded Systems
`
`plays worse cost-performance than the previous generation
`of devices. This is due to the shortage of dedicated multipli-
`ers on the larger Virtex-4 devices. The Virtex-4 architecture
`is comprised of three subfamilies: the LX, SX, and FX. The
`Virtex-4 subfamily with the largest devices, by far, is the LX
`and it is these devices that are found in FPGA-augmented
`HPC nodes. However, the LX subfamily is focused on logic
`density, trading most of the dedicated multipliers found in
`the smaller SX subfamily for conﬁgurable logic. This signiﬁ-
`cantly reduces the ﬂoating-point multiplication performance
`of the larger Virtex-4 devices.
`As the graphs illustrate, if this trend towards logic-centric
`large FPGAs continues it is unlikely that the largest FPGAs
`will be cost eﬀective compared to processors anytime soon,
`if ever. However, as preliminary data on the next-generation
`Virtex-5 suggests that the relatively poor ﬂoating-point per-
`formance of the Virtex-4 is an aberration and not indica-
`tive of a trend in FPGA architectures, it seems reasonable
`to reconsider the results excluding the Virtex-4 data points.
`Figure 1 trend lines labeled “FPGA extrapolation w/o Virtex-
`4” exclude these potential misleading data points.
`When the Virtex-4 data is ignored, the cost-performance
`of FPGAs for double-precision ﬂoating-point matrix multi-
`plication improves at a rate greater than that for processors.
`While there is always a danger from drawing conclusions
`from a small data set, both the Dou et al. and Underwood
`design results point to a crossover point sometime around
`2009 to 2012 when the largest FPGA devices, like those typ-
`ically found in commercial FPGA-augmented HPC clusters,
`will be cost eﬀectively compared to processors for double-
`precision ﬂoating-point calculations.
`
`2.3. Tools
`
`The typical HPC user is a scientist, researcher, or engineer
`desiring to accelerate some scientiﬁc application. These users
`are generally acquainted with a programming language ap-
`propriate to their ﬁelds (C, FORTAN, MATLAB, etc.) but
`have little, if any, hardware design knowledge. Many have
`noted the requirement of high-level development environ-
`ments to speed acceptance of FPGA-augmented clusters.
`These development tools accept a description of the appli-
`cation written in a high level language (HLL) and automate
`the translation of appropriate sections of code into hardware.
`Several companies market HLL-to-gates synthesizers to the
`HPC community, including impulse accelerated technolo-
`gies, Celoxica, and SRC.
`The state of these tools, however, as noted by some [43],
`does not remove the need for dedicated hardware expertise.
`Hardware debugging and interfacing still must occur. The
`use of automatic translation also drives up development costs
`compared to software implementations. C compilers and de-
`buggers are free. Electronic design automation tools, on the
`other hand, may require expensive yearly licenses. Further-
`more, the added ineﬃciencies of translating an inherently
`sequential high-level description into a parallel hardware im-
`plementation eat into the performance of hardware accelera-
`tors.
`
`10000
`
`1000
`
`100
`
`Cost/GFLOPS($)
`
`10
`2000
`
`2002
`
`2004
`
`2006
`Year
`
`2008
`
`2010
`
`FPGAs
`Processors
`Extrapolation FPGA w/o Virtex-4
`Extrapolation FPGA
`Extrapolation processor
`
`(a)
`
`10000
`
`1000
`
`100
`
`Cost/GFLOPS($)
`
`10
`2000
`
`2002
`
`2004
`
`2006
`Year
`
`2008
`
`2010
`
`FPGAs
`Processors
`Extrapolation FPGA w/o Virtex-4
`Extrapolation FPGA
`Extrapolation processor
`
`(b)
`
`Figure 1: Extrapolated double-precision ﬂoating-point MAC cost-
`performance, in US dollars, for: (a) Underwood design and (b) Dou
`et al. design.
`
`data came from his IEEE standard ﬂoating-point designs
`pipelined, depending on the device, to a maximum depth of
`34. The results are shown in Figure 1(a) for the Underwood
`data and Figure 1(b) for Dou et al.
`An additional data point exists for the Underwood graph
`as his work included results for the Virtex-E FPGAs. The
`Dou et al. design is higher performance and smaller, in terms
`of slices, than Underwood’s design. In both graphs, the lat-
`est data point, representing the largest Virtex-4 device, dis-
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 4
`
`

`S. Craven and P. Athanas
`
`3. FLOATING-POINT ALTERNATIVES
`
`3.1. Nonstandard data formats
`
`The use of IEEE standard ﬂoating-point data formats in
`hardware implementations prevents the user from leverag-
`ing an FPGA’s ﬁne-grained conﬁgurability, eﬀectively reduc-
`ing an FPGA to a collection of ﬂoating-point units with con-
`ﬁgurable interconnect. Seeing the advantages of customizing
`the data format to ﬁt the problem, several authors have con-
`structed nonstandard ﬂoating-point units.
`One of the earlier projects demonstrated a 23x speedup
`on a 2D fast Fourier transform (FFT) through the use of a
`custom 18-bit ﬂoating-point format [44]. More recent work
`has focused on parameterizible libraries of ﬂoating-point
`units that can be tailored to the task at hand [45–47]. By us-
`ing a custom ﬂoating-point format sized to match the width
`of the FPGA’s internal integer multipliers, a speedup of 44
`was achieved by Nakasato and Hamada for a hydrodynamics
`simulation [48] using four large FPGAs.
`Nakasato and Hamada’s 38 GFLOPS of performance is
`impressive, even from a cost-performance standpoint. For
`the cost of their PROGRAPE-3 board, estimated at US$
`15,000, it is likely that a 15-node processor cluster could be
`constructed producing 196 single-precision peak GFLOPS.
`Even in the unlikely scenario that this cluster could sus-
`tain the same 10% of peak performance obtained by Naka-
`sato and Hamada’s for their software implementation, the
`PROGRAPE-3 design would still achieve a 2x speedup.
`As in many FPGA to CPU comparisons, it is likely that
`the analysis unfairly favors the FPGA solution. Many com-
`parisons spend signiﬁcantly more time optimizing hardware
`implementations than is spent optimizing software. Signif-
`icant compiler ineﬃciencies exist for common HPC func-
`tions [49], with some hand-coded functions outperform-
`ing the compiler by many times. It is possible that Nakasato
`and Hamada’s speedup would be signiﬁcantly reduced, and
`perhaps eliminated on a cost-performance basis, if equal
`eﬀort was applied to optimizing software at the assembly
`level. However, to permit their design to be more cost-
`competitive, even against eﬃcient software implementations,
`smaller more cost-eﬀective FPGAs could be used.
`
`3.2. GIMPS benchmark
`
`The strength of conﬁgurable logic stems from the ability to
`customize a hardware solution to a speciﬁc problem at the bit
`level. The previously presented works implemented coarse-
`grained ﬂoating-point units inside an FPGA for a wide range
`of HPC applications. For certain applications the full ﬂexibil-
`ity of conﬁgurable logic can be leveraged to create a custom
`solution to a speciﬁc problem, utilizing data types that play
`to the FPGA’s strengths—integer arithmetic.
`One such application can be found in the great Inter-
`net Mersenne prime search (GIMPS) [50]. The software used
`by GIMPS relies heavily on double-precision ﬂoating-point
`FFTs. Through a careful analysis of the problem, an all-
`integer solution is possible that improves FPGA performance
`by a factor of two and avoids the inaccuracies inherit in
`ﬂoating-point math.
`
`5
`
`The largest known prime numbers are Mersenne pri-
`mes—prime numbers of the form 2q − 1, where q is also
`prime. The distributed computing project GIMPS was cre-
`ated to identify large Mersenne primes and a reward of
`US$100,000 has been issued for the ﬁrst person to identify
`a prime number with greater than 10 million digits. The al-
`gorithm used by GIMPS, the Lucas-Lehmer test, is iterative,
`repeatedly performing modular squaring.
`One of the most eﬃcient multiplication algorithms for
`large integers utilizes the FFT, treating the number being
`squared as a long sequence of smaller numbers. The linear
`convolution of this sequence with itself performs the squar-
`ing. As linear convolution in the time domain is equivalent
`to multiplication in the frequency domain, the FFT of the se-
`quence is taken and the resulting frequency domain sequence
`is squared elementwise before being brought back into the
`time domain. Floating-point arithmetic is used to meet the
`strict precision requirements across the time and frequency
`domains. The software used by GIMPS has been optimized
`at the assembly level for maximum performance on Pentium
`processors, making this application an eﬀective benchmark
`of relative processor ﬂoating-point performance.
`Previous work focused on an FPGA hardware implemen-
`tation of the GIMPS algorithm to compare FPGA and pro-
`cessor ﬂoating-point performance [51]. Performing a tradi-
`tional port of the algorithm from software to hardware in-
`volves the creation of a ﬂoating-point FFT on the FPGA.
`On an XC2VP100, the largest Virtex-II Pro, 12 near-double-
`precision complex multipliers could be created from the 444
`dedicated integer multipliers. Such a design with pipelining
`performs a single iteration of the Lucas-Lehmer test in 3.7
`million clock cycles.
`To leverage the advantages of a conﬁgurable architec-
`ture an all-integer number theoretical transform was con-
`sidered. In particular, the irrational base discrete weighted
`transform (IBDWT) can be used to perform integer convo-
`lution, serving the exact same purpose as the ﬂoating-point
`FFT in the Lucas-Lehmer test. In the IBDWT, all arithmetic is
`performed modulo a special prime number. Normally mod-
`ulo arithmetic is a demanding operation requiring many cy-
`cles of latency, but by careful selection of this prime num-
`ber the reduction can be performed by simple additions and
`shifting [51]. The resulting all-integer implementation incor-
`porates two 8-point butterﬂy structures constructed with 24-
`64-bit integer multipliers and pipelined to a depth of 10. A
`single iteration of Lucas-Lehmer requires 1.7 million clock
`cycles, a more than two-fold improvement over the ﬂoating-
`point design.
`The ﬁnal GIMPS accelerator, shown in Figure 2 imple-
`mented in the largest Virtex-II Pro FPGA, consisted of two
`butterﬂies fed by reorder caches constructed from the inter-
`nal memories. To prevent a memory bottleneck, the design
`assumed four independent banks of double data rate (DDR)
`SDRAM. Three sets of reorder buﬀers were created out of
`the dedicated block memories on the device. These mem-
`ories operated concurrently, two of the buﬀers feeding the
`butterﬂy units while the third exchanged data with the ex-
`ternal SDRAM. The ﬁnal design could be clocked at 80 MHz
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 5
`
`

`6
`
`EURASIP Journal on Embedded Systems
`
`cost-performance crossover point to be reached requires ar-
`chitectures with signiﬁcant dedicated multipliers.
`For lower precision data formats current generation FP-
`GAs fare much better, being cost-competitive with proces-
`sors. While completely integer implementations of ﬂoating-
`point applications permit the FPGA to fully leverage its
`strengths,
`for at
`least one such application the cost-
`performance of an all-integer implementation was signiﬁ-
`cantly worse than a processor. This benchmark suggests that
`only certain domains of supercomputing problems will expe-
`rience signiﬁcant performance improvements when imple-
`mented in FPGAs and ﬂoating-point arithmetic is not cur-
`rently one of them.
`
`Recorder
`RAM
`
`(8)
`
`8-point
`butterﬂy
`
`Recorder
`RAM
`
`(8)
`
`8-point
`butterﬂy
`
`Mux
`
`REFERENCES
`
`DDR
`SDRAM
`
`Recorder
`RAM
`
`(16)
`
`XC2VP100
`
`Figure 2: All-integer Lucas-Lehmer implementation.
`
`and used 86% of the dedicated multipliers and 70% of the
`conﬁgurable logic.
`In spite of the unique all-integer algorithmic approach,
`the stand-alone FPGA implementation only achieved a
`speedup of 1.76 compared to a 3.4 GHz Pentium 4 processor.
`Amdahl’s Law limited the FPGA’s performance due to the se-
`rial nature of certain steps in the algorithm, namely the ﬁnal
`modulo reduction after the multimillion bit multiplication.
`A slightly reworked implementation, designed as an FFT ac-
`celerator with all serial functions implemented on an at-
`tached processor, could achieve a speedup of 2.6 compared to
`a processor alone. From a cost perspective, the FPGA imple-
`mentation fares far worse, with the large FPGA’s cost roughly
`ten times that of the processor.
`
`4. CONCLUSION
`
`When comparing HPC architectures many factors must be
`weighed, including memory and I/O bandwidth, commu-
`nication latencies, and peak and sustained performance.
`However, as the recent focus on commodity processor clus-
`ters demonstrates, cost-performance is of paramount impor-
`tance. In order for FPGAs to gain acceptance within the gen-
`eral HPC community, they must be cost-competitive with
`traditional processors for the ﬂoating-point arithmetic typi-
`cal in supercomputing applications. The analysis of the cost-
`performance of various current generation FPGAs revealed
`that only the lower-end devices were cost-competitive with
`processors for double-precision ﬂoating-point matrix multi-
`plications.
`An extrapolation of the double-precision ﬂoating-point
`cost-performance of larger FPGAs using two diﬀerent de-
`signs suggests that these devices will not be cost-competitive
`with processors any earlier than 2009. However, FPGA
`ﬂoating-point performance is very sensitive to the mix of
`dedicated arithmetic units in the architecture and for this
`
`[1] K. Compton and S. Hauck, “Reconﬁgurable computing: a sur-
`vey of systems and software,” ACM Computing Surveys, vol. 34,
`no. 2, pp. 171–210, 2002.
`[2] K. Puttegowda, W. Worek, P. Pappas, A. Dandapani, P. Atha-
`nas, and A. Dickerman, “A run-time reconﬁgurable system for
`gene-sequence searching,” in Proceedings of the 16th Interna-
`tional Conference on VLSI Design, pp. 561–566, New Delhi, In-
`dia, January 2003.
`[3] TimeLogic, “DeCypher Engine G4,” 2006, http://www.time-
`logic.com/decypher engine.html.
`[4] R. Tessier and W. Burleson, “Reconﬁgurable computing for
`digital signal processing: a survey,” Journal of VLSI Signal Pro-
`cessing Systems for Signal, Image, and Video Technology, vol. 28,
`no. 1-2, pp. 7–27, 2001.
`[5] C. Patterson, “High performance DES encryption in vir-
`tex(tm) FPGAs using Jbits(tm),” in Proceedings of the 8th An-
`nual IEEE Symposium on Field-Programmable Custom Com-
`puting Machines (FCCM ’00), p. 113, Napa Valley, Calif, USA,
`April 2000.
`[6] R. Sinnappan and S. Hazelhurst, “A reconﬁgurable approach
`to packet ﬁltering,” in Proceedings of the 11th International
`Conference on Field-Programmable Logic and Applications (FPL
`’01), vol. 2147 of Lecture Notes in Computer Science, pp. 638–
`642, Belfast, Northern Ireland, UK, August 2001.
`[7] J. Jean, X. Liang, B. Drozd, and K. Tomko, “Accelerating an
`IR automatic target recognition application with FPGAs,”
`in Proceedings of the 7th Annual IEEE Symposium on Field-
`Programmable Custom Computing Machines (FCMM ’99), pp.
`290–291, Napa Valley, Calif, USA, April 1999.
`[8] Z. K. Baker and V. K. Prasanna, “Time and area eﬃcient
`pattern matching on FPGAs,” in Proceedings of the 12th
`ACM/SIGDA International Symposium on Field Programmable
`Gate Arrays (FPGA ’04), pp. 223–232, Monterey, Calif, USA,
`February 2004.
`[9] SRC, “SRC-7 Product Sheet,” 2006, http://www.srccomp.com/
`Product%20Sheets/.
`[10] A. Vance, “Start-up could kick Opteron into overdrive,” The
`Register, 2006.
`[11] G. Woods, “Cray ARSC presentation FPGA,” in Proceedings of
`ARSC High-Performance Reconﬁgurable Computing Workshop,
`Fairbanks, Ala, USA, August 2005.
`[12] J. Collins, G. Kent, and J. Yardley, “Using the starbridge sys-
`tems FPGA-based hypercomputer for cancer research,” in Pro-
`ceedings of the 7th International Conference on Military and
`Aerospace Programmable Logic Devices (MAPLD ’04), Wash-
`ington, DC, USA, September 2004.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2082, p. 6
`
`

`S. Craven and P. Athanas
`
`7
`
`[13] SGI, “Extraordinary acceleration of workﬂows with reconﬁg-
`urable application-speciﬁc computing from SGI,” White Pa-
`per, Silicon Graphics, Mountain View, Calif, USA, November
`2004.
`[14] Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev,
`“64-bit ﬂoating-point FPGA matrix multiplication,” in Pro-
`ceedings of the 13th ACM/SIGDA ACM International Sympo-
`sium on Field Programmable Gate Arrays (FPGA ’05), pp. 86–
`95, Monterey, Calif, USA, February 2005.
`[15] M. C. Smith, J. S. Vetter, and S. R. Alam, “Scientiﬁc comput-
`ing beyond CPUs: FPGA implementations of common scien-
`tiﬁc Kernels,” in Proceedings of the 8th International Confer-
`ence on Military and Aerospace Progr

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases