throbber
I ACC’961
`
`On the Viability of FPGA-Based Integrated Coprocessors
`Osama T. Albaharnar, Peter Y. K. Cheung, and Thomas J. Clarke
`Information Engineering Section
`Department of Electrical and Electronic Engineering
`Imperial College of Science. Technology and Medicine
`Exhibition Road, London. SW7-2BT, UK
`
`Abstract
`This paper examines the viability of using integrated
`programmable logic as a coprocessor to support a host
`CPU core. This adaptive coprocessor is compared to a
`VLIW machine in term of both die area occupied and
`performance. The parametric bounds necessary to justify
`the adoption of an FPGA-based coprocessor are
`established. Art abstract Field Programmable Gate Array
`model is used to investigate the area and delay
`characteristics of arithmetic circuits implemented on
`FPGA architectures to determine the potential speedup of
`FPGA-based coprocessors.
`Our analysis shows that integrated FPGA arrays are
`suitable as coprocessor platforms for realising algorithms
`that require only limited numbers of multiplication
`instructions. Inherent FPGA characteristics limit the
`data-path widths that can be supported efficiently for these
`applications. An FPGA-based adaptive coprocessor
`requires a large minimum die area before any advantage
`over a VUW machine of a comparable size can be
`realised.
`
`1. Introduction
`Tne ever increasing spare transistor capacity has only
`been absorbed so far into a limited number of architectural
`features. Integrated programmable logic has emerged as
`one of the very few novel architectural ideas with the
`potential to exploit this abundant resource. A custom
`coprocessor can directly exploit the concurrency available
`in applications, algorithms, and code segments. An
`FPGA-based coprocessor can further adapt to any demands
`for special-purpose hardware by mapping an algorithm
`onto run-time configurable logic.
`These versatile
`■‘adaptive" coprocessors can be used to augment the
`instruction set of a core CPU or as special purpose custom
`computing engines Real-time applications can also swap
`multiple functions and subroutines directly onto the
`reconfigurable hardware during execution U)-[5).
`c mull aosansa&'ic.ac uk
`hup ifwww ee.ic.ac uk/rc«-'arch/inrormjiiorVw*v./3osaniy3osama luml
`
`The adaptive coprocessor model challenges the more
`established general purpose techniques that exploit fine-
`grain instruction level concurrency. We ask, Under what
`architectural conditions can the integration of a core
`CPU and an FPGA-based coprocessor on a single die
`outperform the possible alternative of using a Very Long
`Instruction Word engine (VLIW) on that same die area?
`This paper addresses this question through four stages.
`First, in Section 2. the cost and performance bounds of
`both computational models, the VLIW and the FPGA
`coprocessing, are examined and a set of critical parameters
`is determined, Section 3 describes the experimental
`methodology used to establish the characteristics of
`arithmetic computation on FPGAs and Section 4
`summarises the results of this investigation. In Section 5,
`we explore the implications of these results on the
`achievable cost and performance limits of FPGA-based
`coprocessors. Finally, in Section 6 we apply the ideas and
`conclusions presented in earlier section to a typical
`computational example to determine its suitability for
`FPGA-based adaptive coprocessor implementation.
`
`2. Computational Models
`An adaptive coprocessor uses silicon real-estate to
`integrate more programmable logic. This can then be used
`to implement larger custom circuits or exploit more
`concurrency. On the other hand, a VLIW machine will use
`this same die area to increase the number of ALUs and
`execute more instructions per cycle. In this section, we
`examine the cost and performance of implementing an
`algorithm on both computational models organised as in
`Figure 1.
`
`l-<*Chr
`
`Core p.P
`
`D-oche
`
`CoproceMor
`(FPGA <x VUW)
`
`Figure 1. Target coprocessor system organisation
`
`0-8186-7548 9/96 $05 00 © 1996 IEEE
`
`206
`
`IPR2018-01600
`
`EXHIBIT
`2065
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 1
`
`

`

`2.1 FPGA-based coprocessor organisation
`To achieve a high coprocessor throughput, we assume a
`pipelined implementation of all algorithms. This means
`that the performance of the FPGA-based coprocessor
`depend on the cycle time of the pipeline the number
`of iterations the circuit is used (W/j>jo). the number of
`concurrent copies of the circuit mapped onto the FPGA
`and the number of cycles needed to fill the pipeline
`(c'yj.jo). The total number of cycles is then
`
`T,r,=
`
`The area cost is the sum of the areas for all arithmetic
`nodes in a design. We assume integer nodes and floating­
`point nodes are used. All other operator nodes are
`expressed as a percentage (cfr)0) of the area. If is the
`area of node type i and the number of nodes of type t
`used in the circuit, the cost of a circuit implemented on an
`adaptive coprocessor can be expressed as.
`AcrcuU = (l + C/pKll)x k Irt*
`
`Lin«-j
`
`/p-i
`
`2.2 VLIW machine organisation
`We assume the VLIW utilises integer and floating-point
`ALU units rather than single operation functional modules
`and that they constitute most of its area. All other area is
`expressed as a percentage (c,*,) of the total area. If a\i,w is
`the area of a function node of type i and the number
`of nodes of type i used, the cost of a VLIW machine can be
`expressed as.
`Avi,» = (/ + Cvii»)
`x xrt:L“'i+Wrx
`
`In addition to available resources, the performance of a
`VLIW machine is limited by two types of dependencies |6].
`The data dependencies within an iteration and the ones
`between iterations. A VLIW program can be viewed as a
`dependence graph, as in Figure 4, which must be repeated
`Data and iteration dependencies are
`A',,,, times.
`represented by the bold and dashed edges respectively.
`Since our VLIW processor model uses piplined ALUs,
`each node, or operation, in the graph takes a single time
`unit to execute
`The iteration distance, attached to
`dashed edges, is the number of loop iterations after
`
`issuance of Sj ihai Sj can begin execuiion. The number of
`lime units it takes to execute a cycle within a dependency
`graph (5,), given maximum resources, is the sum of all
`nodes along this cycle path. The number of iterations (K)
`it takes the pattern in a cycle to repeat execution is the sum
`of all iteration distances along this cycle's dependency
`path. Therefore NMJk, repetitions of a given cycle will be
`executed requiring 8C x (N.uJK) cycles.
`The minimum time to execute the whole loop is
`max|61(WWl.Ajr'w,~] =
`where p,„, is the
`critical iteration period bound. Using software pipelining
`[7] and other advanced compiler transformations, it is
`possible to overlap the execution of several different
`iterations of the loop. If /cWl. iteration can be unrolled and
`then scheduled, the iteration interval r,„ is the time units
`needed to execute an entire iteration of k unrolled loops. It
`must satisfy both types of data dependencies as well as
`resources dependencies. If q\ is the number of operations
`a resource of type i must be used in kMw iterations we can
`estimate lower bounds on the iteration interval and the
`maximum VLIW performance as follows:
`T- _ N vl,w
`xf,v*+c5ljx/w,w
`I vtiV - —
`_ v/ov
`fill - tTtax[r('(sources), (dependence)]
`tiik {resources) 2 max|
`
`]
`
`t,,! (dcptntUnce) > max[ (8C/XC) ]
`
`Although the problem of finding an optimal schedule
`using software pipelining is NP-complete, it has been
`shown that near optimal results can often be obtained for
`loops with both intra- and inter-iteration data
`dependencies. It has also been shown that hierarchical
`reduction allows software pipelining to be applied to
`complex loops containing conditional statements. Program
`restructuring using loop transformations and optimising
`data locality using space tiling techniques can also be
`applied to increase both the fine-grain and coarse-grain
`parallelism available in nested loops.
`
`2.3 Effects of memory access
`The power of a custom circuit often lies in its use of a
`custom address generator to reduce explicit address
`calculation instructions. This option can be successfully
`exploited by both coprocessor models, the FPGA-based and
`the VLIW.
`Furthermore, both models can gain from
`customising memory access to fit the data bit width. We
`
`207
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 2
`
`

`

`expeci these and other memory access optimisation
`techniques to produce equivalent benefits for both models
`and are therefore not directly included in the analysis.
`
`2.4 Best-case comparative analysis
`We can now compare the performance of both models,
`neglecting the pipeline fill cycles, by determining the
`speedup:
`su/P„ =
`
`Tvliw _ k/pga ^ ^ lh;w k./pla ;c t/it
`T /ppj k viiw
`lC/pta ktfiw 4
`The speedup is effected by the number of concurrent
`copies of the circuit (k/Plc) mapped onto the FPGA. Since
`the areas of both models have to be the same, we can
`determine k/pla in term of the equivalent number of integer
`ALUs as follows:
`
`Eq(l)
`
`kjpga
`
`inl-otii +axnvC'“
`I(iT"'X4£) + I (oxn/f-X4;')
`ini-I
`Jp-i
`
`—
`
`n*-‘
`
`andO =
`
`ini-i
`U ,
`ial-alu
`Ip-uh'
`0 vliw
`a viiw
`Ovhw
`Using tiafresources), tmfdependencc), and the speedup
`equation we can determine the conditions for which an
`FPGA-based coprocessor is virtually guaranteed to have
`better performance than a VLIW engine:
`p 2 k
`
`X A Eq(2)
`
`/Pti
`
`max
`
`C
`
`mi-o!u
`rtviiw
`
`vL
`
`^ JCvhw_ x £
`kjpgu
`
`Eq(3)
`
`SU — *|l* x
`Uw
`
`.
`^
`nz x a
`Z.(" /pun
`x* vliw
`^ini-a/u
`nvhw
`a £&xAx2\4£
`
`Eq(4)
`
`Eq(5)
`
`Eq(6)
`
`We can further simplify Eq(l), Eq(2), and Eq(3) by
`considering integer arithmetic only and substituting kfptc to
`get:
`lni-u/«
`
`p„„>q:,',xax
`
`We refer to G and A as the area and delay overheads,
`respectively, of a particular circuit implementation
`compared to an ALU’s area and delay. They are inherent
`
`208
`
`characteristics of the implementation platform (FPGA in
`this case) and limit its maximum achievable speedup. To
`sense how much speedup an adaptive coprocessor can
`deliver for a given fixed area and whether an algorithm has
`the necessary criterion that would make it suitable for
`adaptive coprocessor implementation, we need to estimate
`the minimum value of
`for arithmetic circuits
`implemented on FPGA platforms.
`
`3. Experimental Methodology
`To examine how efficiently FPGAs implement
`arithmetic circuits we need to eliminate technology and
`design variations and create an "even level" for
`comparison. This section describes how this even playing
`field is established. We first describe the cell architecture
`that is used throughout this paper and detail our models for
`estimating the area and delay of any FPGA cell. Then, our
`choices for arithmetic test circuits and implementation
`procedure are explained. In all discussion to follow, we
`consider only SRAM programmable FPGAs since only
`they provide the flexible platform necessary for field re­
`programmability.
`
`3.1 FPGA cell architecture
`We examined 15 different FPGA cell architectures that
`span the range of current research and commercial arrays.
`The function generators included a 2-input HAND gate, a
`2-input Universal Logic Module (ULM.2) capable of
`implementing any of 16 2-input Boolean logic functions
`[8] , look-up tables (LUT) of input sizes 3, 4, 5, and 6, and
`finally, the cell architectures of both the Altera FLEX-8000
`[9] and the Xilinx XC5000 [10] which include specialised
`hardware to speedup carry propagation and wide gate
`implementation. All cells also incorporated D-typc flip-
`flops (FF). The cells interconnection capabilities examined
`included extensive neighbour connections with 8, 12, and
`16 possible neighbours, channelled 2D arrays with 4-
`neighbour connections, or channelled arrays with fully, or
`partially, connected clusters of cells similar to the Altera
`FLEX-8000 and the Xilinx XC5000 array architectures.
`Of all these cell types, the 3-input LUT cell proved the
`best overall for arithmetic circuit implementations. We
`elect to use it and a 2D channelled array architecture for
`communication as the example cell throughout this paper.
`A neighbour interconnection only array may also be used
`and will give similar numerical results. The chosen cell is
`based on a look-up table design similar in functionality to
`other look-up table model proposals [II]. It incorporates
`4-nearest neighbour connections as a vital way to reduce
`delay and improve routability. Figure 2 gives a conceptual
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 3
`
`

`

`diagram of this cell. The routing channel width, IV, is
`assumed to be the same for both the vertical and the
`horizontal channels. For LUTs with 3. 4, 5, and 6 inputs,
`the average minimum channel widths necessary for routing
`has been observed to be 9. 11, 11. and 12 respectively [I I],
`We therefore adopt a channel width of 9 for this model cell
`although the actual channel width should probably be
`slightly higher.
`
`)-uipwu
`Look-up
`T.bl*
`3 '
`
`Fiofli «
`
`fttl/.hbOlO
`
`D Q
`
`clr
`T
`1
`Mol .
`Boi
`
`+
`
`Ur
`
`Vdd
`
`2x1
`
`->
`r* mux
`
`Enable
`
`To 4 neirhboun
`
`w/
`\
`
`Switch Box.
`Connection
`Box
`Figure 2. FPGA cell model with a 3-inputs look-up table as a function
`generator, direct north, south, easu and west neighbour connections,
`and global horizontal and vertical channel routing.
`
`Limitations. We do not account for all the factors
`effecling the implementation and performance.
`Specifically, we leave issues such as external access,
`programming, testability, clock and control signals
`distribution, clock skew, and power consumption for future
`work. Of the global programming logic and network we
`only include the cost of the communication channel
`network and the number of SRAM configuration bits
`within a cell as part of the cost of the cell. These
`limitations bias fl in favour of the FPGA-bascd
`coprocessor model.
`
`3.2 Area measurement
`The area of an FPGA cell is approximated using a
`transistor density coefficient metric (a) in pmVtransistor.
`This density coefficient is deperdent on the fabrication
`process technology, layout methodology, and the circuit
`logic structure used. It is obtained by averaging layout area
`per transistor over all cells available in a library or over
`samples or real designs. We assume a normalised function
`generator logic density coefficient of ct/, a configuration
`memory normalised density coefficient of a„, and a
`routing pitch normalised density coefficient of a,.
`
`\
`
`Figure 3a is a representative model of the total cell area
`showing also the routing pitch between the physical
`channel tracks. We assume that the Routing Configuration
`Memory (RCM) bits used for the channels' switch and
`connection boxes are distributed between the channel
`tracks as shown in Figure 2b. It is therefore reasonable to
`assume that a„ equals a,. Other similar models also
`assume a distributed RCM [11]. The number of memory
`depend on the
`bits distributed within the channels
`connection and switch boxes. The connection box
`flexibility Fc is defined as the number of channel tracks
`each input and output can be connected to. The switch box
`flexibility Fs is defined as the number of possible tracks
`each incoming track can be connected to.
`Figure 3. A represenuiion of the total area of an
`FPGA. (a) Area model showing the vertical
`and horizontal tracks, (b) The Routing
`Configuration Memory bits (RCM) are
`distributed between the channel tracks.
`
`hope
`
`I/O
`Rouiing* j< «SthT|
`Pitching
`
`mm
`
`Logic and its
`Configuration
`Memory
`A/tm ♦ A/cm
`+ A,i„
`
`Swiich Box —1
`
`(b) - ID- mIIv
`Horizontal /
`Tracks
`(a) I/O Connection Box
`
`It has been show [12] that Fc has to be greater than half
`the number of tracks for 100% routing completion to be
`possible. Additionally, only a small Fs value is needed to
`achieve a 100% routing completion. In our model, we
`choose Fc = 0.75W and Fs = 3. The routing pitch is
`determined by a five-transistor SRAM bit (om) and a single
`pass-transistor PIP (aP) and is defined as
`rp„,i, = %/a.- (a„ + <2P) = ~j6a.
`The FPGA cell is modelled as a square die area having
`the following characteristics:
`Acrtl - A/unc + Amem + Aroutr
`Aiuk = 0./x(n !S„ + N rtm)
`Ament = ((X.n ‘ On) * fen * N rent)
`Aentu. = [(/••;,„» • W* • W..)] + [rptteh (X W* + K ■ IV.)]
`= v •
`whrrt X Y • A• A-.- und X *
`('»“* •Wk)
`Acnmm = (cX/ ’ A/r/m)+ (cXm Om Nrem) + A
`
`route
`
`209
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 4
`
`

`

`where,
`Ac,a = the area of an FPGA cell
`A/urr = logic area used for function generation
`A„,„ = memory area used for configuration
`A,o*t, = the area of the routing channels within a cell
`A„„„ = the area of the cell used for communication
`N/,„ = # of transistors used for function generation
`N,i„ = “ of transistors used for routing logic & muxs
`N/rm = # of memory bits for LUTs and control
`N,c„ = # of mem. bits used for routing configuration
`= # of transistors in a memory bit = 5
`M's = # of routing tracks in each horizontal channel
`Wv = # of routing tracks in each vertical channel
`The total area of a circuit implementation depends on
`how the mapping from logic equations to FPGA ceil
`functions is performed and how they are placed onto the
`cell array. If N„u is the number of FPGA cells used to
`implement the circuits, the total circuit area is
`Atiraul = N cell x Acdl •
`
`3.3 Delay measurement
`The delay of an FPGA cell is approximated using the
`method of "logical effort" proposed by Sutherland and
`Sproull [13] [14], The method is based on a simple RC
`model for transistors and provide a first order
`approximation of a circuit delay. It defines t as the actual
`time, for a fabrication process, that corresponds to a delay
`unit. The value of r can be measured from the frequency
`of oscillation of a ring oscillator. For each type of logic
`gate, the method assigns delay unit values based on the
`topology of the circuit element, the difficulty that an
`element has in driving capacitive loads, and the parasitic
`capacitance exhibited by the gate. The delay of an ideal
`inverter that drives another identical inverter is the sum of
`a single unit delay (1) and the parasitic delay value PM.
`Typically, for 3u CMOS process, t = 0.5ns and P,„ = 0.6r,
`while for 0.5u CMOS process, x = 0.1ns and Pm, = 0.5x.
`All other gate delays are measured relative lo that of an
`ideal inverter.
`Logical effort is used to arrive at delay value for each
`type of FPGA cell. Separate values are determined for
`each cell input to oulput, the set-up time, and the
`synchronous clock to output delay for each cell type. These
`delays also include the effects of internal fan-outs.
`After a circuit is mapped onto an array, its delay
`depends on the number of cells
`along the longest
`path from an input to an output as well as the routing delay
`between these cells. The routing delay between neighbour
`
`210
`
`cells is accounied for by ihc cxplicii loading on ihai cell’s
`oulput. The rouling delay beiween non-neighbouring cells
`in a channelled array is more difficull lo estimate specially
`without knowledge of the exact placement and routing
`information and the capacitive loading on each level due to
`the programmable routing switches along the path. The
`total execution time of a circuit, in r units, can be
`determined as the sum of all delays along the longest path
`as follows:
`
`1W
`
`Tcircuit ~ X [dab^ d route]
`where <tab is the delay, in t units, between input node a
`and output node h of a cell at circuit depth level r, and
`if,cut, is the routing delay, in T units as well, between the
`output of the cell at level i and the input of another cell at
`level /+/. In this investigation, we will assume ct
`to be
`zero. This assumption will bias A in favor of the FPGA-
`based coprocessor model.
`
`roule
`
`3.4 Implementation Procedure
`We determine the number of ceils needed to implement
`a circuit (Ndi,) and the depth of the implementation
`(A'*,,),) by a structure preserving direct hand mapping from
`the original circuit designs. Automated mapping and
`routing results vary significantly with different tools and
`for different optimisation criterion. They also significantly
`alter the overall high level organisation resulting in low
`area utilisation even for regular circuits structures. We can
`also assume that with improved FPGA cell architectures,
`mapping, placement, and routing technologies, the routing
`structure is sufficient to complete the mapped network
`interconnections and give a very high array utilisation.
`The results will therefore provide a lower bound on the
`cost and performance of different implementations which
`is exactly what we are looking for. Different designs are
`compared based on their implementation efficiency defined
`as the area times delay product ’AT’, or cost'performance,
`for that circuit. The less ’AT’ is, the more efficient is the
`implementation.
`
`3.5 Choice of arithmetic circuits
`We mapped 10 different integer addition circuit designs
`representing several delay and area optimisation
`techniques. They included, serial, carry-ripple, carry-skip,
`several one-level and two-levels carry-lookahead,
`conditional-sum, carry-select, and pyramid adders. For
`integer multipliers, we only considered 2’s complement
`multipliers with l-bit Booth recoding. We also mapped 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 5
`
`

`

`different multiplication circuit designs including serial,
`sequential, sequential with one-level and two-levels carry-
`save logic, parallel array, and bit systolic. For the
`sequential and array multipliers which require an adder,
`we tried all the integer adders above to determine the ones
`that produce the best 'AX’ results. All integer circuits
`were examined for bit widths varying from 4-bits to 64-
`bits.
`For floating-point numbers we implemented a subset of
`the 32-bit and 64-bit IEEE specification standard and
`mapped both addition and multiplication circuits. Not all
`options referred to in the standard were included.
`Particularly, we assumed truncation is performed after
`addition or multiplication rather than rounding to
`eliminate one normalisation step. Ten mantissa adders
`and six mantissa multipliers were tested corresponding to
`the types of integer units above. The exponent adders used
`were chosen to minimise the 'AT’ value for that design.
`Finally, we also implemented a pipelined version of those
`integer and floating-point circuits above that are actually
`pipelineabie without incurring any increase in area cost.
`Limitations. We did not include the control-path in either
`the cost or performance calculations. Furthermore, we did
`not examine any division algorithms or combination
`functions such as multiply-accumuiate. We also restricted
`the investigation by not considering table-lookup
`techniques, distributed arithmetic, or the potential of other
`number systems.
`
`4. Implementation Efficiency Gap
`In an earlier study [15], we demonstrated that there is a
`large implementation efficiency gap between FPGA and
`VLSI platforms. We showed that a system implemented
`on FPGAs will require as much as 100 times more die area
`than a custom VLSI implementation and would be about
`10 times slower. The result demonstrates how much worse
`an FPGA implementation is compared to a VLSI
`implementation,
`In this section we estimate the
`implementation efficiency gap between an FPGA platform
`and a programmable ALU as defined in section 2. The
`ALU parameters we use are based on estimates for the area
`and delay of complementary logic CMOS circuits which
`represent upper bounds on VLSI implementation cost and
`performance. Therefore, Cl and A, defined as the ratios of
`the area and delay results respectively of an FPGA-based
`implementation to those of an ALU, represent lower
`bounds while actual area and delay overheads could
`certainly be much higher. We will now compare the
`efficiency of pipelined, and non-pipelined. FPGA-based
`arithmetic circuits to that of ALUs.
`
`We select two pipelined integer ALUs, 32-bil and 64-
`bit. capable of addition, subtraction, multiplication, shift to
`the left, arithmetic shift to the right, and Anally, logical
`AND, OR. and NOT operations. They use a fast array
`multiplier to achieve small multiplication delays at the
`expense of high die area. We also select two pipelined
`floating-point ALUs. 32-bit and 64-bit, capable of addition,
`subtraction, and multiplication which also use an array
`multiplier as their mantissa multiplier/adder. For integer
`ALUs, we assume that addition takes a single cycle while
`multiplication needs 2 cycles on a 32-bit ALU, and 3
`cycles on a 64-bit ALU. For the floating-point ALUs, we
`assume 4 execute stages for a 32-bit FP ALU and 6 stages
`for a 64-bit ALU. The result of the comparison is given in
`Table I which gives the ‘best case- results for both
`pipelined and non-pipelined FPGA implementations. Note
`that multiplying Cl and A in the table may not correspond
`to the n*A entry because all results have been rounded.
`By limiting pipelined integer circuits to only those
`pipelinable without additional area, multiplication is
`restricted to array multipliers which have high area costs.
`Similarly, floating-point multiplication is not able to make
`use of the compactness of sequential multipliers and incurs
`a large area cost as well. The effects of these restrictions
`can be observed in Table I. The first thing to notice is that
`pipelined integer addition implementation give reasonably
`efficient results even though the area overhead is still
`large. Of course, this is an improvement over the non-
`pipelined results, and is mostly due to the low cycle time of
`the FPGA implementations and, subsequently, the low
`delay overhead. This suggests that addition on FPGAs
`should be pipelined.
`The second observation is that pipelined integer
`multiplication efficiency is better than the non-piplined
`case but at the expense of very high area cost, up to 60
`times from a maximum of 4.1 times in the non-piplined
`case. Unlike addition, however, this result suggest that
`multiplication is not suitable at all for FPGA
`implementation. If required, however, but with real-estate
`at a premium, as in our case, multiplication should not be
`pipelined so as to save on area. Finally, the overhead for
`floating-point is much larger than that of integer
`implementation. The large overhead magnitude raises
`doubt regarding the usefulness of floating-point operations
`on FPGAs under any circumstances.
`Serial Arithmetic. Serial arithmetic can deliver low area
`overheads for low operand width values. However, for
`larger widths, 32-bits and 64-bits for example, the area
`overhead is still considerably high because the registers
`used in accumulating the result have to be included in the
`cost. It is also evident from our results that the low
`
`211
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 6
`
`

`

`Mull
`
`ADD
`
`Mult
`
`non-
`pipclincd
`FPGA
`circuits
`
`TABLE 1
`Characterising FPGA-based Arithmetic Circuit Implementations
`32-bit ALU
`64-bit ALU
`Int
`gcr
`Integer
`Sg
`FP
`FPGA circuit bit widths
`16
`8
`32
`32
`24
`32
`24
`40
`64
`16
`8
`56
`48
`12 x A
`0.4
`0.8 31.4 0.03 0.11
`0.13
`0.18 0.23 0.29 0.35
`0.15 049 0.57
`area (til
`pipelined
`ADD
`0.76 2.5
`4.05 15.0 0.21 0.68 0.78
`1.44
`1.79 2.14
`2.87
`1.1
`2.46:
`IE3UD
`delay (A)
`FPGA
`0.2
`2.1
`0.16 0.16 0.16
`0.2
`0.2
`0.2
`0.16 0.16 0.16 0.16 0.16
`EHBE3mrri
`0.76 3 05 6.42 11.3
`108 0.17 0 60 1.44 2.54 3.94 5.65 7.67
`12 x A
`circuits
`9.68
`rrnm
`15.5 32.6 57.3 5115 1.05 4.22 8.88 15.6 24.3 34.8 47.2 59.6
`3.87
`12
`02
`0.16 0.16 0.16 0.16 0.16
`0.2
`0.2
`0.2
`2.1
`0.16 0.16
`0.16'
`A
`CHEEll
`12 x A
`119 0.14 0.47 0.79 1.27 1.64 2.02 2.41
`2.76.
`0.6
`2.1
`3.53 5.65
`gynim
`area (121
`0.32 0.61 3.73 5.19 16.7 0.09 0.17 1.02 1.42 1.83 2.25 2.69
`3.08
`EEJK3
`delay fAl
`1.88 345 0.95 1.09 7.13 1.55 2.85 0.78 0.9
`0.9
`0.9
`0.9
`0.9
`EHfSI
`12 x A
`14.2 22.0 31.6 42.7 53.6
`4.26 162 35.8 63.1 280 0.96 3.64 8.04
`KDES3
`0.84
`2.67
`1.07
`2.07 3.08
`4.08 14.2 0.29 0.57
`1.11
`1.38 1.66 2.37
`12
`IISEIESa
`11.6 15.5
`19.7 3.28 6.43 9.59 12.6 15.9 19.1
`20.1
`3.97 7.81
`18.0
`A
`performance of serial designs make their use on FPGAs
`are delermincd for a lotal coprocessor die area equivalent
`to approximately 5M and 20M transistors and assuming a
`unacceptable for high throughput computations.
`hypothetical algorithm requiring lotal node count Zn'fK, =
`Summary. The results given here highlight the main
`10 with a reasonable value of loop unrolling iwiv=!0.
`problem with FPGA-based arithmetic circuits. They
`We examine four cases of the algorithm. All nodes arc
`cannot compete with other programmable devices such as
`addition operations, all are multiplication operations, 9 are
`ALUs, at least at the circuit level. Pipelining can be used
`adds and one is a multiply, and finally, 5 are adds and 5
`to bring the cycle time of an FPGA implementation close
`arc multiply.
`to the delay of an ALU to reduce the delay overhead. But,
`the intractable problem is the area overhead. Consider,
`Total area vs. speedup, Although a die area equivalent to
`from Table I, that m arithmetic nodes of type and data
`5M transistors may seem to be large for such a small
`width i have total die area equivalent to m‘f2, ALUs.
`number of function nodes, already some data-path widths,
`Serial and sequential arithmetic can reduce the
`blanked out in Table n, will either not fit onto an FPGA of
`implementation cost but at the expense of decreasing the
`this size or will not achieve better performance than the
`performance and increasing the delay overhead. Since on-
`VLIW model under any p„„ and q'l.i conditions. Also,
`chip real-estate is valuable and the area cost of any FPGA
`the smaller die area forces the use of the more compact, but
`implementation is already high, FPGA design trade-offs
`slower, non pipelined circuits giving worse results than the
`should be optimised for area. Indeed, neither floating­
`larger die area. On the other hand, from Eq(l) and Eq(4),
`point arithmetic nor integer multiplication have shown any
`as the area available, or number of ALUs, increase, the
`implementation attributes favourable to FPGA platforms.
`iteration period (u) decrease reducing the potential
`For the case of integer multiplication, this was also
`speedup.
`reported by a study on commercial FPGAs [16].
`Minimum instructions per iteration (q'k.i). Considering
`the small die area ritst, rows 1 and 2 show that the number
`of compiled VLIW instructions per iteration, before loop
`unrolling, that would guarantee FPGA superiority is
`relatively small. It can be easily satisfied by many
`algorithms that only require a limited number of
`multipliers. Rows 3 and 4 show that as the number of
`multiply operations increase, the number of instructions
`per iteration has to be very high for an FPGA to beat a
`VLIW engine particularly for data bit widths larger than
`16-bits. This suggests that highly concurrent algorithms of
`limited data-path widths, such as some image processing
`algorithms are suited for implementation on integrated
`
`5. Viability of Adaptive Coprocessors
`From the results of the previous section and the 6
`numbered equations in Section 2, we can evaluate the
`viability of the adaptive coprocessor proposal. For clarity,
`we assume either integer or floating-point operations only.
`Table II examines the critical iteration period bound (p„„)
`and the minimum number of instructions in an unrolled
`iteration (qV.i) as given by Eq(5) and Eq(6). These are the
`conditions necessary for a VLIW coprocessor machine to
`have belter performance than an FPGA-based coprocessor;
`i.e. speedup is larger than one. Values of both parameters
`
`•■d
`
`FP
`
`212
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2079, p. 7
`
`

`

`16
`
`)5i^_63rpB_2_J^
`315
`9
`6
`2
`I
`19
`392
`12
`2
`1
`699
`4
`36
`2
`62
`• T*-
`66
`2
`2798
`633
`|
`1
`1
`1
`I
`2
`1
`14.
`I
`I
`1
`I
`l
`1
`1
`I
`
`8
`10
`17
`67
`
`48
`3
`9
`330
`318
`3
`9
`330
`318
`2
`4
`122
`118
`I
`I
`31
`30
`
`^••FP^
`64 .32 | 64
`4
`U
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`13
`14
`15
`16
`
`TABLE n
`Conditions determining the implementation viability of an FPGA-based Coprocessor
`64-bit ALU
`32-bit ALU
`For &',„= IOaodk^=IO.
`an FPGA-based coprocessor
`Integer
`FP
`Integer
`32 *32? 8
`24
`________is bener if________
`56
`40
`24
`16
`32
`•“■PHI
`all Adds
`4
`5
`3
`2
`2
`9 adds and 1 mult.
`Small
`5
`II
`8
`6
`3
`5 odds and 5 mult.
`die area
`■140 225
`406
`578:
`18
`all Mulls
`537
`427
`220
`140
`81
`162
`all Adds
`159 670
`4
`4
`5
`3
`2
`2
`9 adds and 1 mult.
`193 1940
`5
`14
`II
`6
`3
`8
`5 adds and 5 mult.
`1.-8:-
`516 596 2805
`225
`406
`14
`18
`-'15 *
`all Mulls
`537 550 2528
`427
`140 220
`31
`all Adds
`2
`2
`1
`1
`I
`I
`9 adds and 1 mult.
`5
`4
`3
`2
`1
`I
`5 adds and 5 mult.
`151
`'.52?. 983-
`;29:
`2
`:30; 52 r
`1991
`all Mulls
`158
`82
`16
`!
`alt Adds
`62
`•15
`1
`1
`I
`1
`I
`I
`9 adds and 1 mull.
`180
`18
`2
`1
`I
`1
`I
`I
`I
`5 adds and S mull.
`31
`21
`260
`48
`38
`1
`2“
`2
`I
`51 234
`all Mults
`50
`21
`13
`40
`2
`16
`I
`F

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket