`
`On the Viability of FPGA-Based Integrated Coprocessors
`Osama T. Albaharnar, Peter Y. K. Cheung, and Thomas J. Clarke
`Information Engineering Section
`Department of Electrical and Electronic Engineering
`Imperial College of Science. Technology and Medicine
`Exhibition Road, London. SW7-2BT, UK
`
`Abstract
`This paper examines the viability of using integrated
`programmable logic as a coprocessor to support a host
`CPU core. This adaptive coprocessor is compared to a
`VLIW machine in term of both die area occupied and
`performance. The parametric bounds necessary to justify
`the adoption of an FPGA-based coprocessor are
`established. Art abstract Field Programmable Gate Array
`model is used to investigate the area and delay
`characteristics of arithmetic circuits implemented on
`FPGA architectures to determine the potential speedup of
`FPGA-based coprocessors.
`Our analysis shows that integrated FPGA arrays are
`suitable as coprocessor platforms for realising algorithms
`that require only limited numbers of multiplication
`instructions. Inherent FPGA characteristics limit the
`data-path widths that can be supported efficiently for these
`applications. An FPGA-based adaptive coprocessor
`requires a large minimum die area before any advantage
`over a VUW machine of a comparable size can be
`realised.
`
`1. Introduction
`Tne ever increasing spare transistor capacity has only
`been absorbed so far into a limited number of architectural
`features. Integrated programmable logic has emerged as
`one of the very few novel architectural ideas with the
`potential to exploit this abundant resource. A custom
`coprocessor can directly exploit the concurrency available
`in applications, algorithms, and code segments. An
`FPGA-based coprocessor can further adapt to any demands
`for special-purpose hardware by mapping an algorithm
`onto run-time configurable logic.
`These versatile
`■‘adaptive" coprocessors can be used to augment the
`instruction set of a core CPU or as special purpose custom
`computing engines Real-time applications can also swap
`multiple functions and subroutines directly onto the
`reconfigurable hardware during execution U)-[5).
`c mull aosansa&'ic.ac uk
`hup ifwww ee.ic.ac uk/rc«-'arch/inrormjiiorVw*v./3osaniy3osama luml
`
`The adaptive coprocessor model challenges the more
`established general purpose techniques that exploit fine-
`grain instruction level concurrency. We ask, Under what
`architectural conditions can the integration of a core
`CPU and an FPGA-based coprocessor on a single die
`outperform the possible alternative of using a Very Long
`Instruction Word engine (VLIW) on that same die area?
`This paper addresses this question through four stages.
`First, in Section 2. the cost and performance bounds of
`both computational models, the VLIW and the FPGA
`coprocessing, are examined and a set of critical parameters
`is determined, Section 3 describes the experimental
`methodology used to establish the characteristics of
`arithmetic computation on FPGAs and Section 4
`summarises the results of this investigation. In Section 5,
`we explore the implications of these results on the
`achievable cost and performance limits of FPGA-based
`coprocessors. Finally, in Section 6 we apply the ideas and
`conclusions presented in earlier section to a typical
`computational example to determine its suitability for
`FPGA-based adaptive coprocessor implementation.
`
`2. Computational Models
`An adaptive coprocessor uses silicon real-estate to
`integrate more programmable logic. This can then be used
`to implement larger custom circuits or exploit more
`concurrency. On the other hand, a VLIW machine will use
`this same die area to increase the number of ALUs and
`execute more instructions per cycle. In this section, we
`examine the cost and performance of implementing an
`algorithm on both computational models organised as in
`Figure 1.
`
`l-<*Chr
`
`Core p.P
`
`D-oche
`
`CoproceMor
`(FPGA <x VUW)
`
`Figure 1. Target coprocessor system organisation
`
`0-8186-7548 9/96 $05 00 © 1996 IEEE
`
`206
`
`IPR2018-01600
`
`EXHIBIT
`2065
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 1
`
`
`
`2.1 FPGA-based coprocessor organisation
`To achieve a high coprocessor throughput, we assume a
`pipelined implementation of all algorithms. This means
`that the performance of the FPGA-based coprocessor
`depend on the cycle time of the pipeline the number
`of iterations the circuit is used (W/j>jo). the number of
`concurrent copies of the circuit mapped onto the FPGA
`and the number of cycles needed to fill the pipeline
`(c'yj.jo). The total number of cycles is then
`
`T,r,=
`
`The area cost is the sum of the areas for all arithmetic
`nodes in a design. We assume integer nodes and floating
`point nodes are used. All other operator nodes are
`expressed as a percentage (cfr)0) of the area. If is the
`area of node type i and the number of nodes of type t
`used in the circuit, the cost of a circuit implemented on an
`adaptive coprocessor can be expressed as.
`AcrcuU = (l + C/pKll)x k Irt*
`
`Lin«-j
`
`/p-i
`
`2.2 VLIW machine organisation
`We assume the VLIW utilises integer and floating-point
`ALU units rather than single operation functional modules
`and that they constitute most of its area. All other area is
`expressed as a percentage (c,*,) of the total area. If a\i,w is
`the area of a function node of type i and the number
`of nodes of type i used, the cost of a VLIW machine can be
`expressed as.
`Avi,» = (/ + Cvii»)
`x xrt:L“'i+Wrx
`
`In addition to available resources, the performance of a
`VLIW machine is limited by two types of dependencies |6].
`The data dependencies within an iteration and the ones
`between iterations. A VLIW program can be viewed as a
`dependence graph, as in Figure 4, which must be repeated
`Data and iteration dependencies are
`A',,,, times.
`represented by the bold and dashed edges respectively.
`Since our VLIW processor model uses piplined ALUs,
`each node, or operation, in the graph takes a single time
`unit to execute
`The iteration distance, attached to
`dashed edges, is the number of loop iterations after
`
`issuance of Sj ihai Sj can begin execuiion. The number of
`lime units it takes to execute a cycle within a dependency
`graph (5,), given maximum resources, is the sum of all
`nodes along this cycle path. The number of iterations (K)
`it takes the pattern in a cycle to repeat execution is the sum
`of all iteration distances along this cycle's dependency
`path. Therefore NMJk, repetitions of a given cycle will be
`executed requiring 8C x (N.uJK) cycles.
`The minimum time to execute the whole loop is
`max|61(WWl.Ajr'w,~] =
`where p,„, is the
`critical iteration period bound. Using software pipelining
`[7] and other advanced compiler transformations, it is
`possible to overlap the execution of several different
`iterations of the loop. If /cWl. iteration can be unrolled and
`then scheduled, the iteration interval r,„ is the time units
`needed to execute an entire iteration of k unrolled loops. It
`must satisfy both types of data dependencies as well as
`resources dependencies. If q\ is the number of operations
`a resource of type i must be used in kMw iterations we can
`estimate lower bounds on the iteration interval and the
`maximum VLIW performance as follows:
`T- _ N vl,w
`xf,v*+c5ljx/w,w
`I vtiV - —
`_ v/ov
`fill - tTtax[r('(sources), (dependence)]
`tiik {resources) 2 max|
`
`]
`
`t,,! (dcptntUnce) > max[ (8C/XC) ]
`
`Although the problem of finding an optimal schedule
`using software pipelining is NP-complete, it has been
`shown that near optimal results can often be obtained for
`loops with both intra- and inter-iteration data
`dependencies. It has also been shown that hierarchical
`reduction allows software pipelining to be applied to
`complex loops containing conditional statements. Program
`restructuring using loop transformations and optimising
`data locality using space tiling techniques can also be
`applied to increase both the fine-grain and coarse-grain
`parallelism available in nested loops.
`
`2.3 Effects of memory access
`The power of a custom circuit often lies in its use of a
`custom address generator to reduce explicit address
`calculation instructions. This option can be successfully
`exploited by both coprocessor models, the FPGA-based and
`the VLIW.
`Furthermore, both models can gain from
`customising memory access to fit the data bit width. We
`
`207
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 2
`
`
`
`expeci these and other memory access optimisation
`techniques to produce equivalent benefits for both models
`and are therefore not directly included in the analysis.
`
`2.4 Best-case comparative analysis
`We can now compare the performance of both models,
`neglecting the pipeline fill cycles, by determining the
`speedup:
`su/P„ =
`
`Tvliw _ k/pga ^ ^ lh;w k./pla ;c t/it
`T /ppj k viiw
`lC/pta ktfiw 4
`The speedup is effected by the number of concurrent
`copies of the circuit (k/Plc) mapped onto the FPGA. Since
`the areas of both models have to be the same, we can
`determine k/pla in term of the equivalent number of integer
`ALUs as follows:
`
`Eq(l)
`
`kjpga
`
`inl-otii +axnvC'“
`I(iT"'X4£) + I (oxn/f-X4;')
`ini-I
`Jp-i
`
`—
`
`n*-‘
`
`andO =
`
`ini-i
`U ,
`ial-alu
`Ip-uh'
`0 vliw
`a viiw
`Ovhw
`Using tiafresources), tmfdependencc), and the speedup
`equation we can determine the conditions for which an
`FPGA-based coprocessor is virtually guaranteed to have
`better performance than a VLIW engine:
`p 2 k
`
`X A Eq(2)
`
`/Pti
`
`max
`
`C
`
`mi-o!u
`rtviiw
`
`vL
`
`^ JCvhw_ x £
`kjpgu
`
`Eq(3)
`
`SU — *|l* x
`Uw
`
`.
`^
`nz x a
`Z.(" /pun
`x* vliw
`^ini-a/u
`nvhw
`a £&xAx2\4£
`
`Eq(4)
`
`Eq(5)
`
`Eq(6)
`
`We can further simplify Eq(l), Eq(2), and Eq(3) by
`considering integer arithmetic only and substituting kfptc to
`get:
`lni-u/«
`
`p„„>q:,',xax
`
`We refer to G and A as the area and delay overheads,
`respectively, of a particular circuit implementation
`compared to an ALU’s area and delay. They are inherent
`
`208
`
`characteristics of the implementation platform (FPGA in
`this case) and limit its maximum achievable speedup. To
`sense how much speedup an adaptive coprocessor can
`deliver for a given fixed area and whether an algorithm has
`the necessary criterion that would make it suitable for
`adaptive coprocessor implementation, we need to estimate
`the minimum value of
`for arithmetic circuits
`implemented on FPGA platforms.
`
`3. Experimental Methodology
`To examine how efficiently FPGAs implement
`arithmetic circuits we need to eliminate technology and
`design variations and create an "even level" for
`comparison. This section describes how this even playing
`field is established. We first describe the cell architecture
`that is used throughout this paper and detail our models for
`estimating the area and delay of any FPGA cell. Then, our
`choices for arithmetic test circuits and implementation
`procedure are explained. In all discussion to follow, we
`consider only SRAM programmable FPGAs since only
`they provide the flexible platform necessary for field re
`programmability.
`
`3.1 FPGA cell architecture
`We examined 15 different FPGA cell architectures that
`span the range of current research and commercial arrays.
`The function generators included a 2-input HAND gate, a
`2-input Universal Logic Module (ULM.2) capable of
`implementing any of 16 2-input Boolean logic functions
`[8] , look-up tables (LUT) of input sizes 3, 4, 5, and 6, and
`finally, the cell architectures of both the Altera FLEX-8000
`[9] and the Xilinx XC5000 [10] which include specialised
`hardware to speedup carry propagation and wide gate
`implementation. All cells also incorporated D-typc flip-
`flops (FF). The cells interconnection capabilities examined
`included extensive neighbour connections with 8, 12, and
`16 possible neighbours, channelled 2D arrays with 4-
`neighbour connections, or channelled arrays with fully, or
`partially, connected clusters of cells similar to the Altera
`FLEX-8000 and the Xilinx XC5000 array architectures.
`Of all these cell types, the 3-input LUT cell proved the
`best overall for arithmetic circuit implementations. We
`elect to use it and a 2D channelled array architecture for
`communication as the example cell throughout this paper.
`A neighbour interconnection only array may also be used
`and will give similar numerical results. The chosen cell is
`based on a look-up table design similar in functionality to
`other look-up table model proposals [II]. It incorporates
`4-nearest neighbour connections as a vital way to reduce
`delay and improve routability. Figure 2 gives a conceptual
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 3
`
`
`
`diagram of this cell. The routing channel width, IV, is
`assumed to be the same for both the vertical and the
`horizontal channels. For LUTs with 3. 4, 5, and 6 inputs,
`the average minimum channel widths necessary for routing
`has been observed to be 9. 11, 11. and 12 respectively [I I],
`We therefore adopt a channel width of 9 for this model cell
`although the actual channel width should probably be
`slightly higher.
`
`)-uipwu
`Look-up
`T.bl*
`3 '
`
`Fiofli «
`
`fttl/.hbOlO
`
`D Q
`
`clr
`T
`1
`Mol .
`Boi
`
`+
`
`Ur
`
`Vdd
`
`2x1
`
`->
`r* mux
`
`Enable
`
`To 4 neirhboun
`
`w/
`\
`
`Switch Box.
`Connection
`Box
`Figure 2. FPGA cell model with a 3-inputs look-up table as a function
`generator, direct north, south, easu and west neighbour connections,
`and global horizontal and vertical channel routing.
`
`Limitations. We do not account for all the factors
`effecling the implementation and performance.
`Specifically, we leave issues such as external access,
`programming, testability, clock and control signals
`distribution, clock skew, and power consumption for future
`work. Of the global programming logic and network we
`only include the cost of the communication channel
`network and the number of SRAM configuration bits
`within a cell as part of the cost of the cell. These
`limitations bias fl in favour of the FPGA-bascd
`coprocessor model.
`
`3.2 Area measurement
`The area of an FPGA cell is approximated using a
`transistor density coefficient metric (a) in pmVtransistor.
`This density coefficient is deperdent on the fabrication
`process technology, layout methodology, and the circuit
`logic structure used. It is obtained by averaging layout area
`per transistor over all cells available in a library or over
`samples or real designs. We assume a normalised function
`generator logic density coefficient of ct/, a configuration
`memory normalised density coefficient of a„, and a
`routing pitch normalised density coefficient of a,.
`
`\
`
`Figure 3a is a representative model of the total cell area
`showing also the routing pitch between the physical
`channel tracks. We assume that the Routing Configuration
`Memory (RCM) bits used for the channels' switch and
`connection boxes are distributed between the channel
`tracks as shown in Figure 2b. It is therefore reasonable to
`assume that a„ equals a,. Other similar models also
`assume a distributed RCM [11]. The number of memory
`depend on the
`bits distributed within the channels
`connection and switch boxes. The connection box
`flexibility Fc is defined as the number of channel tracks
`each input and output can be connected to. The switch box
`flexibility Fs is defined as the number of possible tracks
`each incoming track can be connected to.
`Figure 3. A represenuiion of the total area of an
`FPGA. (a) Area model showing the vertical
`and horizontal tracks, (b) The Routing
`Configuration Memory bits (RCM) are
`distributed between the channel tracks.
`
`hope
`
`I/O
`Rouiing* j< «SthT|
`Pitching
`
`mm
`
`Logic and its
`Configuration
`Memory
`A/tm ♦ A/cm
`+ A,i„
`
`Swiich Box —1
`
`(b) - ID- mIIv
`Horizontal /
`Tracks
`(a) I/O Connection Box
`
`It has been show [12] that Fc has to be greater than half
`the number of tracks for 100% routing completion to be
`possible. Additionally, only a small Fs value is needed to
`achieve a 100% routing completion. In our model, we
`choose Fc = 0.75W and Fs = 3. The routing pitch is
`determined by a five-transistor SRAM bit (om) and a single
`pass-transistor PIP (aP) and is defined as
`rp„,i, = %/a.- (a„ + <2P) = ~j6a.
`The FPGA cell is modelled as a square die area having
`the following characteristics:
`Acrtl - A/unc + Amem + Aroutr
`Aiuk = 0./x(n !S„ + N rtm)
`Ament = ((X.n ‘ On) * fen * N rent)
`Aentu. = [(/••;,„» • W* • W..)] + [rptteh (X W* + K ■ IV.)]
`= v •
`whrrt X Y • A• A-.- und X *
`('»“* •Wk)
`Acnmm = (cX/ ’ A/r/m)+ (cXm Om Nrem) + A
`
`route
`
`209
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 4
`
`
`
`where,
`Ac,a = the area of an FPGA cell
`A/urr = logic area used for function generation
`A„,„ = memory area used for configuration
`A,o*t, = the area of the routing channels within a cell
`A„„„ = the area of the cell used for communication
`N/,„ = # of transistors used for function generation
`N,i„ = “ of transistors used for routing logic & muxs
`N/rm = # of memory bits for LUTs and control
`N,c„ = # of mem. bits used for routing configuration
`= # of transistors in a memory bit = 5
`M's = # of routing tracks in each horizontal channel
`Wv = # of routing tracks in each vertical channel
`The total area of a circuit implementation depends on
`how the mapping from logic equations to FPGA ceil
`functions is performed and how they are placed onto the
`cell array. If N„u is the number of FPGA cells used to
`implement the circuits, the total circuit area is
`Atiraul = N cell x Acdl •
`
`3.3 Delay measurement
`The delay of an FPGA cell is approximated using the
`method of "logical effort" proposed by Sutherland and
`Sproull [13] [14], The method is based on a simple RC
`model for transistors and provide a first order
`approximation of a circuit delay. It defines t as the actual
`time, for a fabrication process, that corresponds to a delay
`unit. The value of r can be measured from the frequency
`of oscillation of a ring oscillator. For each type of logic
`gate, the method assigns delay unit values based on the
`topology of the circuit element, the difficulty that an
`element has in driving capacitive loads, and the parasitic
`capacitance exhibited by the gate. The delay of an ideal
`inverter that drives another identical inverter is the sum of
`a single unit delay (1) and the parasitic delay value PM.
`Typically, for 3u CMOS process, t = 0.5ns and P,„ = 0.6r,
`while for 0.5u CMOS process, x = 0.1ns and Pm, = 0.5x.
`All other gate delays are measured relative lo that of an
`ideal inverter.
`Logical effort is used to arrive at delay value for each
`type of FPGA cell. Separate values are determined for
`each cell input to oulput, the set-up time, and the
`synchronous clock to output delay for each cell type. These
`delays also include the effects of internal fan-outs.
`After a circuit is mapped onto an array, its delay
`depends on the number of cells
`along the longest
`path from an input to an output as well as the routing delay
`between these cells. The routing delay between neighbour
`
`210
`
`cells is accounied for by ihc cxplicii loading on ihai cell’s
`oulput. The rouling delay beiween non-neighbouring cells
`in a channelled array is more difficull lo estimate specially
`without knowledge of the exact placement and routing
`information and the capacitive loading on each level due to
`the programmable routing switches along the path. The
`total execution time of a circuit, in r units, can be
`determined as the sum of all delays along the longest path
`as follows:
`
`1W
`
`Tcircuit ~ X [dab^ d route]
`where <tab is the delay, in t units, between input node a
`and output node h of a cell at circuit depth level r, and
`if,cut, is the routing delay, in T units as well, between the
`output of the cell at level i and the input of another cell at
`level /+/. In this investigation, we will assume ct
`to be
`zero. This assumption will bias A in favor of the FPGA-
`based coprocessor model.
`
`roule
`
`3.4 Implementation Procedure
`We determine the number of ceils needed to implement
`a circuit (Ndi,) and the depth of the implementation
`(A'*,,),) by a structure preserving direct hand mapping from
`the original circuit designs. Automated mapping and
`routing results vary significantly with different tools and
`for different optimisation criterion. They also significantly
`alter the overall high level organisation resulting in low
`area utilisation even for regular circuits structures. We can
`also assume that with improved FPGA cell architectures,
`mapping, placement, and routing technologies, the routing
`structure is sufficient to complete the mapped network
`interconnections and give a very high array utilisation.
`The results will therefore provide a lower bound on the
`cost and performance of different implementations which
`is exactly what we are looking for. Different designs are
`compared based on their implementation efficiency defined
`as the area times delay product ’AT’, or cost'performance,
`for that circuit. The less ’AT’ is, the more efficient is the
`implementation.
`
`3.5 Choice of arithmetic circuits
`We mapped 10 different integer addition circuit designs
`representing several delay and area optimisation
`techniques. They included, serial, carry-ripple, carry-skip,
`several one-level and two-levels carry-lookahead,
`conditional-sum, carry-select, and pyramid adders. For
`integer multipliers, we only considered 2’s complement
`multipliers with l-bit Booth recoding. We also mapped 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 5
`
`
`
`different multiplication circuit designs including serial,
`sequential, sequential with one-level and two-levels carry-
`save logic, parallel array, and bit systolic. For the
`sequential and array multipliers which require an adder,
`we tried all the integer adders above to determine the ones
`that produce the best 'AX’ results. All integer circuits
`were examined for bit widths varying from 4-bits to 64-
`bits.
`For floating-point numbers we implemented a subset of
`the 32-bit and 64-bit IEEE specification standard and
`mapped both addition and multiplication circuits. Not all
`options referred to in the standard were included.
`Particularly, we assumed truncation is performed after
`addition or multiplication rather than rounding to
`eliminate one normalisation step. Ten mantissa adders
`and six mantissa multipliers were tested corresponding to
`the types of integer units above. The exponent adders used
`were chosen to minimise the 'AT’ value for that design.
`Finally, we also implemented a pipelined version of those
`integer and floating-point circuits above that are actually
`pipelineabie without incurring any increase in area cost.
`Limitations. We did not include the control-path in either
`the cost or performance calculations. Furthermore, we did
`not examine any division algorithms or combination
`functions such as multiply-accumuiate. We also restricted
`the investigation by not considering table-lookup
`techniques, distributed arithmetic, or the potential of other
`number systems.
`
`4. Implementation Efficiency Gap
`In an earlier study [15], we demonstrated that there is a
`large implementation efficiency gap between FPGA and
`VLSI platforms. We showed that a system implemented
`on FPGAs will require as much as 100 times more die area
`than a custom VLSI implementation and would be about
`10 times slower. The result demonstrates how much worse
`an FPGA implementation is compared to a VLSI
`implementation,
`In this section we estimate the
`implementation efficiency gap between an FPGA platform
`and a programmable ALU as defined in section 2. The
`ALU parameters we use are based on estimates for the area
`and delay of complementary logic CMOS circuits which
`represent upper bounds on VLSI implementation cost and
`performance. Therefore, Cl and A, defined as the ratios of
`the area and delay results respectively of an FPGA-based
`implementation to those of an ALU, represent lower
`bounds while actual area and delay overheads could
`certainly be much higher. We will now compare the
`efficiency of pipelined, and non-pipelined. FPGA-based
`arithmetic circuits to that of ALUs.
`
`We select two pipelined integer ALUs, 32-bil and 64-
`bit. capable of addition, subtraction, multiplication, shift to
`the left, arithmetic shift to the right, and Anally, logical
`AND, OR. and NOT operations. They use a fast array
`multiplier to achieve small multiplication delays at the
`expense of high die area. We also select two pipelined
`floating-point ALUs. 32-bit and 64-bit, capable of addition,
`subtraction, and multiplication which also use an array
`multiplier as their mantissa multiplier/adder. For integer
`ALUs, we assume that addition takes a single cycle while
`multiplication needs 2 cycles on a 32-bit ALU, and 3
`cycles on a 64-bit ALU. For the floating-point ALUs, we
`assume 4 execute stages for a 32-bit FP ALU and 6 stages
`for a 64-bit ALU. The result of the comparison is given in
`Table I which gives the ‘best case- results for both
`pipelined and non-pipelined FPGA implementations. Note
`that multiplying Cl and A in the table may not correspond
`to the n*A entry because all results have been rounded.
`By limiting pipelined integer circuits to only those
`pipelinable without additional area, multiplication is
`restricted to array multipliers which have high area costs.
`Similarly, floating-point multiplication is not able to make
`use of the compactness of sequential multipliers and incurs
`a large area cost as well. The effects of these restrictions
`can be observed in Table I. The first thing to notice is that
`pipelined integer addition implementation give reasonably
`efficient results even though the area overhead is still
`large. Of course, this is an improvement over the non-
`pipelined results, and is mostly due to the low cycle time of
`the FPGA implementations and, subsequently, the low
`delay overhead. This suggests that addition on FPGAs
`should be pipelined.
`The second observation is that pipelined integer
`multiplication efficiency is better than the non-piplined
`case but at the expense of very high area cost, up to 60
`times from a maximum of 4.1 times in the non-piplined
`case. Unlike addition, however, this result suggest that
`multiplication is not suitable at all for FPGA
`implementation. If required, however, but with real-estate
`at a premium, as in our case, multiplication should not be
`pipelined so as to save on area. Finally, the overhead for
`floating-point is much larger than that of integer
`implementation. The large overhead magnitude raises
`doubt regarding the usefulness of floating-point operations
`on FPGAs under any circumstances.
`Serial Arithmetic. Serial arithmetic can deliver low area
`overheads for low operand width values. However, for
`larger widths, 32-bits and 64-bits for example, the area
`overhead is still considerably high because the registers
`used in accumulating the result have to be included in the
`cost. It is also evident from our results that the low
`
`211
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 6
`
`
`
`Mull
`
`ADD
`
`Mult
`
`non-
`pipclincd
`FPGA
`circuits
`
`TABLE 1
`Characterising FPGA-based Arithmetic Circuit Implementations
`32-bit ALU
`64-bit ALU
`Int
`gcr
`Integer
`Sg
`FP
`FPGA circuit bit widths
`16
`8
`32
`32
`24
`32
`24
`40
`64
`16
`8
`56
`48
`12 x A
`0.4
`0.8 31.4 0.03 0.11
`0.13
`0.18 0.23 0.29 0.35
`0.15 049 0.57
`area (til
`pipelined
`ADD
`0.76 2.5
`4.05 15.0 0.21 0.68 0.78
`1.44
`1.79 2.14
`2.87
`1.1
`2.46:
`IE3UD
`delay (A)
`FPGA
`0.2
`2.1
`0.16 0.16 0.16
`0.2
`0.2
`0.2
`0.16 0.16 0.16 0.16 0.16
`EHBE3mrri
`0.76 3 05 6.42 11.3
`108 0.17 0 60 1.44 2.54 3.94 5.65 7.67
`12 x A
`circuits
`9.68
`rrnm
`15.5 32.6 57.3 5115 1.05 4.22 8.88 15.6 24.3 34.8 47.2 59.6
`3.87
`12
`02
`0.16 0.16 0.16 0.16 0.16
`0.2
`0.2
`0.2
`2.1
`0.16 0.16
`0.16'
`A
`CHEEll
`12 x A
`119 0.14 0.47 0.79 1.27 1.64 2.02 2.41
`2.76.
`0.6
`2.1
`3.53 5.65
`gynim
`area (121
`0.32 0.61 3.73 5.19 16.7 0.09 0.17 1.02 1.42 1.83 2.25 2.69
`3.08
`EEJK3
`delay fAl
`1.88 345 0.95 1.09 7.13 1.55 2.85 0.78 0.9
`0.9
`0.9
`0.9
`0.9
`EHfSI
`12 x A
`14.2 22.0 31.6 42.7 53.6
`4.26 162 35.8 63.1 280 0.96 3.64 8.04
`KDES3
`0.84
`2.67
`1.07
`2.07 3.08
`4.08 14.2 0.29 0.57
`1.11
`1.38 1.66 2.37
`12
`IISEIESa
`11.6 15.5
`19.7 3.28 6.43 9.59 12.6 15.9 19.1
`20.1
`3.97 7.81
`18.0
`A
`performance of serial designs make their use on FPGAs
`are delermincd for a lotal coprocessor die area equivalent
`to approximately 5M and 20M transistors and assuming a
`unacceptable for high throughput computations.
`hypothetical algorithm requiring lotal node count Zn'fK, =
`Summary. The results given here highlight the main
`10 with a reasonable value of loop unrolling iwiv=!0.
`problem with FPGA-based arithmetic circuits. They
`We examine four cases of the algorithm. All nodes arc
`cannot compete with other programmable devices such as
`addition operations, all are multiplication operations, 9 are
`ALUs, at least at the circuit level. Pipelining can be used
`adds and one is a multiply, and finally, 5 are adds and 5
`to bring the cycle time of an FPGA implementation close
`arc multiply.
`to the delay of an ALU to reduce the delay overhead. But,
`the intractable problem is the area overhead. Consider,
`Total area vs. speedup, Although a die area equivalent to
`from Table I, that m arithmetic nodes of type and data
`5M transistors may seem to be large for such a small
`width i have total die area equivalent to m‘f2, ALUs.
`number of function nodes, already some data-path widths,
`Serial and sequential arithmetic can reduce the
`blanked out in Table n, will either not fit onto an FPGA of
`implementation cost but at the expense of decreasing the
`this size or will not achieve better performance than the
`performance and increasing the delay overhead. Since on-
`VLIW model under any p„„ and q'l.i conditions. Also,
`chip real-estate is valuable and the area cost of any FPGA
`the smaller die area forces the use of the more compact, but
`implementation is already high, FPGA design trade-offs
`slower, non pipelined circuits giving worse results than the
`should be optimised for area. Indeed, neither floating
`larger die area. On the other hand, from Eq(l) and Eq(4),
`point arithmetic nor integer multiplication have shown any
`as the area available, or number of ALUs, increase, the
`implementation attributes favourable to FPGA platforms.
`iteration period (u) decrease reducing the potential
`For the case of integer multiplication, this was also
`speedup.
`reported by a study on commercial FPGAs [16].
`Minimum instructions per iteration (q'k.i). Considering
`the small die area ritst, rows 1 and 2 show that the number
`of compiled VLIW instructions per iteration, before loop
`unrolling, that would guarantee FPGA superiority is
`relatively small. It can be easily satisfied by many
`algorithms that only require a limited number of
`multipliers. Rows 3 and 4 show that as the number of
`multiply operations increase, the number of instructions
`per iteration has to be very high for an FPGA to beat a
`VLIW engine particularly for data bit widths larger than
`16-bits. This suggests that highly concurrent algorithms of
`limited data-path widths, such as some image processing
`algorithms are suited for implementation on integrated
`
`5. Viability of Adaptive Coprocessors
`From the results of the previous section and the 6
`numbered equations in Section 2, we can evaluate the
`viability of the adaptive coprocessor proposal. For clarity,
`we assume either integer or floating-point operations only.
`Table II examines the critical iteration period bound (p„„)
`and the minimum number of instructions in an unrolled
`iteration (qV.i) as given by Eq(5) and Eq(6). These are the
`conditions necessary for a VLIW coprocessor machine to
`have belter performance than an FPGA-based coprocessor;
`i.e. speedup is larger than one. Values of both parameters
`
`•■d
`
`FP
`
`212
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 7
`
`
`
`16
`
`)5i^_63rpB_2_J^
`315
`9
`6
`2
`I
`19
`392
`12
`2
`1
`699
`4
`36
`2
`62
`• T*-
`66
`2
`2798
`633
`|
`1
`1
`1
`I
`2
`1
`14.
`I
`I
`1
`I
`l
`1
`1
`I
`
`8
`10
`17
`67
`
`48
`3
`9
`330
`318
`3
`9
`330
`318
`2
`4
`122
`118
`I
`I
`31
`30
`
`^••FP^
`64 .32 | 64
`4
`U
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`
`TABLE n
`Conditions determining the implementation viability of an FPGA-based Coprocessor
`64-bit ALU
`32-bit ALU
`For &',„= IOaodk^=IO.
`an FPGA-based coprocessor
`Integer
`FP
`Integer
`32 *32? 8
`24
`________is bener if________
`56
`40
`24
`16
`32
`•“■PHI
`all Adds
`4
`5
`3
`2
`2
`9 adds and 1 mult.
`Small
`5
`II
`8
`6
`3
`5 odds and 5 mult.
`die area
`■140 225
`406
`578:
`18
`all Mulls
`537
`427
`220
`140
`81
`162
`all Adds
`159 670
`4
`4
`5
`3
`2
`2
`9 adds and 1 mult.
`193 1940
`5
`14
`II
`6
`3
`8
`5 adds and 5 mult.
`1.-8:-
`516 596 2805
`225
`406
`14
`18
`-'15 *
`all Mulls
`537 550 2528
`427
`140 220
`31
`all Adds
`2
`2
`1
`1
`I
`I
`9 adds and 1 mult.
`5
`4
`3
`2
`1
`I
`5 adds and 5 mult.
`151
`'.52?. 983-
`;29:
`2
`:30; 52 r
`1991
`all Mulls
`158
`82
`16
`!
`alt Adds
`62
`•15
`1
`1
`I
`1
`I
`I
`9 adds and 1 mull.
`180
`18
`2
`1
`I
`1
`I
`I
`I
`5 adds and S mull.
`31
`21
`260
`48
`38
`1
`2“
`2
`I
`51 234
`all Mults
`50
`21
`13
`40
`2
`16
`I
`F