`
`On the Viability of FPGA-Based Integrated Coprocessors
`Osoma T. Alhoharno’ , Peter Y. K. Cheang, and Thomas J. Clarke
`Information Engineering Section
`Department of Electrical and Electronic Engineering
`imperial College of Science. Technology and Medicine
`Exhibition Road. London. SWT-ZBT. UK
`
`Abstract
`
`This paper examines the viability of using integrated
`programmable logic as a coprocessor to support a host
`CPU core. This adaptive coprocessor is compared to a
`VUW machine in term of both die area occupied and
`performance. The parametric bounds necessary to jam)?
`the
`adoption of art FFGA-based coprocessor
`are
`established. An abstract Field Programmable Gate Array
`model
`is used to
`investigate
`the area and delay
`characteristics of arithmetic circuits implemented on
`FPGA architecture: to determine the potential speedup of
`FPGA-bored coprocessors.
`
`Our analysis shows that integrated FPGA arrays are
`suitable as coprocessor plog'omts for realising algorithm
`that
`require only limited numbers of multiplication
`instructions.
`inherent FPGA characteristics limit
`the
`data-path widths that can be supported efi‘icientlyfor these
`applications.
`' Art FPGA~bosed adaptive coprocessor
`requires a large minimum die area before any advantage
`over a VLlW machine of a comparable size can be
`realised.
`
`The adaptive coprocessor model challenges the more
`established general purpose techniques that exploit fine
`grain instruction level concurrency. We ask, Under what
`architectural conditions can the integration of a core
`CPU and art FPGA-oased coprocessor art a single die
`outperfonn the possible alternative of using a Very Long
`instruction Word engine rVLlW) on that some die area?
`This paper addresses this question through four stages.
`First,
`in Section 2.
`the cost and performance bounds of
`both computational models.
`the VLIW and the FPGA
`coprocessing. are examined and a set of critical parameters
`is determined.
`Section 3 describes
`the experimental
`methodology used to establish the
`characteristics of
`arithmetic
`computation
`on H’GAs
`and Section
`4
`summarises the results of this investigation.
`In Section 5.
`we explore the implications of these results on the
`achievable cost and performance limits of FPGA-based
`copmsors. Finally. in Section 6 we apply the ideas and
`conclusions presented in earlier
`section to a
`typical
`computational example to determine its suitability for
`FPGA-based adaptive coprocessor implementation.
`
`I. Introduction
`
`The ever increasing Spare transistor capacity ha; only
`been absorbed so far into a limited number of architectural
`features.
`integrated programmable logic has emerged as
`one of the very few novel architectural
`ideas with the
`potential
`to exploit
`this abundant resource. A custom
`coprocessor can directly exploit the concurrency available
`in applications. algorithms. and code segments.
`An
`FPGArbased coprocessor can funher adapt to any demands
`for special-purpose hardware by mapping an algorithm
`onto
`run-time
`configurable
`logic,
`These
`versatile
`"adaptive" coprocessors can be used to augment
`the
`instruction set of a core CPU or as special purpose custom
`computing engines. Real-time applications can also mop
`multiple
`functions and subroutines directly onto the
`reconfigurable hardware during execution IIHS].
`———_..______
`t
`c I'llllil. :t.osamtt@tc.oc ulr
`Nmo‘twww ea in a: uler-searchfinfonnattonlwwwlnnsamaluosamnhw
`
`2. Computational Models
`An adaptive coprocessor uses silicon real-estate to
`integrate more programmable logic. This can then be used
`to implement
`larger custom circuits or exploit more
`concurrency. On the other hand, a VLIW machine will use
`this same die area to increase the number of ALUs and
`execute more instructions per cycle.
`In this section. we
`examine the cost and performance of implementing an
`algorithm on both computational models organised as in
`Figure t.
`
`
`
`Figure 1. Target coprocessor system organisation.
`
`0813645487986 SUE {JD 0 1995 IEEE
`
`206
`
`|PR2018-01594
`
`EXHIBIT
`
`2055
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 1
`
`IPR2018-01594
`
`EXHIBIT
`2055
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 1
`
`
`
`2.1 FPGA-based coprocessor organisation
`To achieve a high coprocessor throughput. we assume a
`pipelined implementation of all algorithms. This means
`that
`the performance of
`the FPGA-based coprocessor
`depend on the cycle time of the pipeline (they). the number
`of iterations the circuit
`is used mm“).
`the number of
`concurrent copies of the circuit mapped onto the FPGA
`(it
`a). and the number of cycles needed to fill the pipeline
`(
`' ml). The total number of cycles is then
`N
`a... =[k—N+rt]x:5...
`The area cost is the sum of the areas for all arithmetic
`nodes in a design. We assume integer nodes and floating-
`point nodes are used.
`All other operator nodes are
`expressed as a percentage tom.) of the area,
`If “hm is the
`area of node typet' and NZ...“ the number of nodes of typet'
`used in the circuit. the cost of a circuit implemented on an
`adaptive coprocessor can he expressed as.
`
`Arifnll'l = (I + Cm...) 3" kg.”
`
`xi..§.(aih‘><ns;)+ Z (airman
`fp-I
`
`2.2 VLIW machine organisation
`We assume the VLIW utilises integer and floating-point
`ALU units rather than single operation functional modules
`and that they constitute most of its area. All other area is
`expressed as a percentage (rm...) of the total area.
`If aim... is
`the area of a function node of type t' and n'm, the number
`of nodes of type i‘ used. the cost of a VLIW machine can be
`expressed as.
`
`[Low = U +Cuilw)
`- l
`—
`- l
`-
`t
`x[(allle""Kellie“”)+(a{lf..“"‘xn.fll.“l]
`
`In addition to available resources, the performance of :t
`VLIW machine is limited by two types of dependencies [6].
`The data dependencies within an iteration and the ones
`between iterations. A VLF-V pragram can be viewed as :1
`dependence graph. as in Figure 4, which must be repealed
`M“...
`times.
`Data
`and
`iteration
`dependencies
`are
`represented by the bold and dashed edges respectively.
`Since our VLfW processor model uses piplined ALUE.
`each node. or operation.
`in the graph takes a single time
`unit to execute (Prim). The iteration distance. attached to
`dashed edges.
`is
`the number of
`loop iterations after
`
`issuance of Si that 5i can begin execution. The number of
`time units it takes to execute a cycle within a dependency
`graph (5,}, given maximum resources.
`is the sum of all
`nodes along this cycle path. The number of iterations {3...}
`it takes the pattern in a cycle to repeat execution is the sum
`of all
`iteration distances along this cycle's dependency
`path. Therefore Nap/1L, repetitions ofa given cycle will be
`executed requiring 5, x (Nun/Ll cycles.
`The minimum time to execute the whole loop is
`maxiSJNvtiijlt—lfvnul = pKIIPNMfJ-‘fvflw Wilt“ Pun is
`the
`critical iteration period bound. Using software pipelining
`['l] and other advanced compiler transformations.
`it
`is
`possible to overlap the execution of several different
`iterations of the loop.
`If it”... iteration can be unrolled and
`then scheduled, the iteration interval t,“ is the time units
`needed to execute an entire iteration of k unrolled loops.
`It
`must satisfy both types of data dependencies as well as
`resources dependencies.
`If q‘i is the number of operations
`a resource of type i must be used in in“... iterations we can
`estimate lower bounds on the iteration interval and the
`maximum VLIW performance as follows:
`
`Nat...
`kvfiw
`
`3‘ fut +Cfiih] K lint.
`beiw =[
`Int 3 maxinntrmmesl-Im [hemmed]
`
`list (“Jam") 3 m3"[ [hi/Rim] ]
`
`:th (dependence) E l'l'lflX[ (a r/lr] ]
`
`Although the problem of finding an optimal schedule
`using software pipelining is NP—complete.
`it has been
`shown that near optimal results can often be obtained for
`loops with
`both
`intra-
`and
`inteteiteration
`data
`dependencies.
`it has also been shown that hierarchical
`reduction allows software pipelining to be applied to
`complex loops containing conditional statements. Program
`restructuring using loop transformations and optimising
`data locality using space tiling techniques can also be
`applied to increase both the fine-grain and coarse-grain
`parallelism available in nested loops.
`
`2.3 Effects of memory access
`The power of a custom circuit often lies in its use of a
`custom address generator
`to reduce explicit address
`calculation instructions. This option can be successfully
`exploited by both coprocessor models. the FPGA-based and
`the VLIW.
`Furthermore. both models can gain from
`customising memory access to fit the data bit width. We
`
`207
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 2
`
`
`
`these and other memory access optimisation
`expect
`techniques to produce equivalent benefits for both models
`and are therefore not directly included in the analysis.
`
`2.4 Best-case comparative analysis
`We can now compare the performance of both models.
`neglecting the pipeline fill cycles. by determining the
`speedup:
` E
`SUM =—c~r""“ = km" mix—“If“ = —k””" X“—"* eqt)
`T155“
`kriltw
`1'qu
`kvliw
`A
`
`The speedup is effected by the number of concurrent
`copies of the circuit arm.) mapped onto the FPGA. Since
`the areas of both models have to be the same. we can
`determine km, in term of the quivalent number of integer
`ALUs as follows:
`. _£
`_
`n5i~EI+a x'fli'fium'
`knm = 2(Qi""i><rtiiu‘tj)+ Z (a foP-ixnfig
`inl-i
`#4
`
`”"9 Q
`
`int-i _
`'
`
`Iii-I _
`Fl?“
`Girl—l
`inI-ulu‘ Q "
`audiot-
`
`{ii
`a vi
`fp-ni’u‘ ”do“
`altiilt‘
`
`=
`
`HEW
`app-l
`inlv-ni'u
`WEN
`
`Using tMresources). tgdependence}. and the speedup
`equation we can determine the conditions for which an
`FPGA—based caprocessor is virtually guaranteed to have
`better performance than a VLIW engine:
`
`pm a M x A
`km,
`
`[1-
`km...
`9.1-...
`. ——'— 2 — c
`
`int
`
`qt:
`.
`
`max[ Limit} Lanai]
`
`krm x
`
`eq2}
`
`3
`
`Eat D
`
`We can further simplify eql], eqZ). and Eqfl) by
`considering integer arithmetic only and substituting km. to
`gel:
`
` l‘lvi'iwlnl-Itilt } t 't
`
`50' = A x ___ x
`ti...
`2.- njlh.‘
`fill-L >< £-
`int—i
`Fl
`9.... 2 0:1: ><AXE+._’.‘?:‘.—"at...w
`Mn.»
`
`Eqm
`
`Eq(4i
`
`qu....2
`
`a who;
`
`Eats)
`
`We refer to Q and A as the area and delay overheads.
`respectively. of
`a
`particular
`circuit
`implementation
`compared to on ALU's area and delay. They are inherent
`
`208
`
`characteristics of the implementation platform (FPGA in
`this case] and limit its maximum achievable speedup. To
`sense how much speedup an adaptive coprocessor can
`deliver for a given fixed area and whether an algorithm has
`the necessary criterion that would make it suitable for
`adaptive coprocessor implementation. we need to estimate
`the minimum value of [Q'd]
`for arithmetic circuits
`implemented on FPGA platforms.
`
`3. Experimental Methodology
`implement
`To
`examine
`how efficiently FPGAa
`arithmetic circuits we need to eliminate technology and
`design
`variations
`and
`create
`an
`“even
`level"
`for
`comparison. This section describes how this even playing
`field is established. We first describe the cell architecture
`that is used throughout this paper and detail our models for
`estimating the area and delay of any FPGA cell. Then. our
`choices for arithmetic test circuits and implementation
`procedure are explained.
`In all discussion to follow. we
`consider only SRAM programmable FPGAs since only
`they provide the flexible platform necessary for field te-
`programmabiiity.
`
`3.1 FPGA cell architecture
`We examined 15 different FPGA cell architectures that
`span the range of current research and commercial arrays.
`The function generators included a 2-inptrt NAND gate. a
`2-input Universal Logic Module (ULMJ) capable of
`implementing any of lo 2-input Boolean logic functions
`[31. look-up tables ['LUT) of input sizes 3. 4. s, and 6. and
`finally. the cell architectures of both the Alters FLEX-8000
`[9] and the Xilinx XCSODO [ill] which include specialised
`hardware to speedup carry propagation and wide gate
`implementation. All cells also incorporated D-type flip«
`flops (FF). The cells interconnection capabilities examined
`included extensive neighbour connections with 8.
`[2. and
`[6 possible neighbours. channelled 2i) arrays with 4-
`neighbour connections. or channelled arrays with fully. or
`partially. connected clusters of cells similar to the Allera
`FLEX-3000 and the Xilimt XCSOOO array architectures.
`Of all these cell types. the 3-input LUT cell proved the
`best overall tor arithmetic circuit
`implementations. We
`elect to use it and a In channelled array architecture for
`communication as the example cell throughout this paper.
`A neighbour interconnection only array may also be used
`and will give similar numerical results. The chosen cell is
`based on a look-up table design similar in functionality Io
`other look-up table model proposals [l i].
`It incorporates
`4-nearest neighbour connections as a vital way to reduce
`delay and improve mutability. Figure 2 gives a conceptual
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 3
`
`
`
`is
`diagram of this cell. The routing channel width. W.
`assumed to be. the same for both the vertical and the
`horizontal channels. For LUTs with 3. 4. S. and 6 inputs.
`the average minimum channel widths necessary for routing
`has been observed to be9. 11.11, and [2 respectively [II].
`We therefore adopt a channel width of 9 for this model cell
`although the actual channel width should probably be
`slightly higher.
`
`
`
`Figure 2. FPGA cell model With a 3-mpuut look-up table as a fitnetion
`generator. duet: north. south. east. and west neighbour connections,
`and global honeontal. and vertical charmel routing.
`—_.—_—n————
`
`the factors
`for all
`Limitations. We do not account
`performance.
`and
`effecting
`the
`implementation
`Specifically. we leave issues such as external access.
`programming.
`testability.
`clock
`and
`control
`signals
`distribution. clock skew. and power consumption for future
`work. Of the global programming logic and network we
`only include the cost of
`the communication channel
`network and the number of SRAM configuration bits
`within a cell as part of the cost of
`the cell. These
`limitations
`bias
`9.
`in
`favour
`of
`the FPGA-hased
`coprocessor model.
`
`3.2 Area measurement
`
`is approximated using a
`The area of an FPGA cell
`transistor density coefficient metric (ct) in umlltransismr.
`This density coefficient
`is dependent on the fabrication
`process technology.
`layout methodology. and the circuit
`logic Structure used. [I is obtained by averaging layout area
`per transistor over all cells available in a library or over
`samples or real designs. We assume a normalised function
`generator logic density coefficient of (1;. a configuration
`memory normalised density coefficient of or... and a
`routing pitch normalised density coefficient of (1..
`
`Figure 3a is a representative model of the total cell area
`showing also the routing pitch between the physical
`channel tracks. We assume that the Routing Configuration
`Memory (REM) bits used for the channels' switch and
`connection boxes are distributed between the channel
`tracks as shown in Figure 2b.
`It is therefore reasonable to
`assume that
`tr... equals ac. Other similar model: also
`assume a distributed RCM [11]. The number of memory
`bits distributed within the channels (IVs-witch) depend on the
`connection and switch
`boxes. The
`connection
`has
`flexibility H: is defined as the number of channel tracks
`each input and output can be connected to. The switch box
`flexibility F5 is defined as the number of possible tracks
`each incoming track can be connected to.
`figure 3. A representation oi the toss: area plan
`I-‘PGA. (I) Am model showing the vents]
`and nonmetal
`tracks.
`t'b)
`The Routing
`Configuration Memory
`bits
`(RCM)
`are
`distributed between the channel tracks.
`
`Logic and ii:
`
`Configuration (a)
`
`”'0 Connection BOX
`
`
`It has been show [12] that Fc has to be greater than half
`the number of tracks for 100% routing completion to be
`possible. Additionally. only a Small F5 value is needed to
`achieve a 100% routing completion.
`In our model. we
`choose Fc = 0.?SW and F; = 3. The routing pitch ts
`determined by a five-transistor SRAM bit (a...) and a single
`pass-transistor PI? (0,) and is defined as
`
`mm = D‘.’ ‘ (fln+ as) = Jd‘u.
`The FPGA cell is modelled as a square die area having
`the following characteristics:
`Amt = My»: ‘l' Am... + Armrlr
`Arm = W X [”13“ Mt.)
`A"... = (n..-a..]x (N ,(m-v- NH...)
`Am...” = [(rinm' WA‘ W'” + [rpmn ' (X 'Wn + l"- WtJI
`when x r = A)... - 4....
`and
`x +(r.....-wt}= r . (amine)
`An"... = (Ctr 'Nrth‘l' (um'am'NlCHl) + Alum
`
`209
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 4
`
`
`
`where.
`= the area of an FPGA cell
`Amt
`AM, = logic area used for function generation
`A...“ = memory area used for configuration
`A..." = the area of the routing channels within a cell
`Am... = the area of the cell used for conununication
`NJ“...
`= it of transistors used for function generation
`M...
`= # of transistors used for routing logic d: muss
`NR".
`= it of memory bits for LUTs and control
`NM...
`: ll of mem. his used for routing configuration
`a...
`= if of transistors in a memory hit = 5
`W}.
`= ll of routing tracks in each horizontal channel
`W,
`= it of routing tracks in each vertical channel
`
`The total area of a circuit implementation depends on
`how the mapping from logic equations to FPGA cell
`functions is performed and how they are placed onto the
`cell array.
`If N..." is the number of FPGA cells used to
`implement the circuits. the total circuit area is
`
`Arum: = Na" X Acct: .
`
`3.3 Delay measurement
`
`The delay of an FPGA cell is approximated using the
`method of "logical effort" proposed by Sutherland and
`Sproull [l3] [14]. The method is based on a simple RC
`model
`for
`transistors
`and
`provide
`a
`first
`order
`approximation of a circuit delay.
`It defines it as the actual
`time. for a fabrication process, that cot-reSponds to a delay
`unit. The value of t can be measured from the Frequency
`of oscillation of a ring oscillator. For each type of logic
`gate. the method assigns delay unit values based on the
`topology of the circuit element,
`the difficulty that an
`element has in driving capacitive loads. and the parasitic
`capacitance exhibited by the gate. The delay of an ideal
`inverter that drives another identical inverter is the sum of
`a single unit delay (1') and the parasitic delay value P"...
`Typically. for 3H CMOS process. t = 0.5ns and P... = 0.61:.
`while for 0.5u CMOS process. 1: = Illns and P... = 0.51:.
`All other gate delays are measured relative to that of an
`ideal inverter.
`
`Logical effort is used to arrive at delay value for each
`type of FPGA cell. Separate values are deten'nined for
`each cell
`input
`to output.
`the set-up time. and the
`synchronous clock to output delay for each cell type. These
`delays also include the effects of internal fan-outs.
`
`its delay
`is mapped onto an array.
`After a circuit
`depends on the number of cells (NM...) along the longest
`path from an input to an output as well as the routing delay
`between these cells. The routing delay between neighbour
`
`cells is accounted for by the explicit Inading on that cell’s
`output. The routing delay between non-neighbouring cells
`in a channelled array is more difficult to estimate specially
`without knowledge of the exact placement and routing
`information and the capacitive loading on each level due to
`the progranu'tmble routing switches along the path. The
`total execution time of a circuit.
`in t units. can be
`determined as the sum of all delays along the longest path
`as follows:
`
`.
`New»
`..
`Trims“ = E: [diss'i'd'mmi
`where c‘..s is the delay. in 1: units, between input node a
`and output node b of a cell at circuit depth level
`t'. and
`c‘._,.,... is the routing delay. in 1: units as well. between the
`output of the cell at level i and the input of another cell at
`level f+.l.
`[n this investigation. we will assume o'mm to be
`zero. This assumption will bias A in favor of the FPGA-
`hased coprocessor model.
`
`3.4 Implementation Prooedure
`
`We determine the number of cells needed to implement
`a circuit
`(Main) and the depth of the implementation
`(NM...) by a structure preserving direct hand mapping from
`the original circuit designs. Automated mapping and
`routing results vary significantly with different tools and
`for different optimisation criterion. They also significantly
`alter the overall high level organisation resulting in low
`area utilisation even for regular circuits suuctures. We can
`also assume that with improved FPGA cell architectures.
`mapping. placement. and routing technologies, the routing
`structure is sufficient
`to complete the mapped network
`interconnections and give a very high array utilisation.
`The results will therefore provide a lower bound on the
`cost and performance of different implementations which
`is exactly what We are looking for. Different designs are
`compared based on their implementation efficiency defined
`as the area times delay product ‘AT'. or cost'performance.
`for that circuit. The less 'AT'
`is. the more efi'icient is the
`implementation.
`
`3.5 Choice of arithmetic circuits
`
`We mapped 10 different integer addition circuit designs
`representing
`several
`delay
`and
`area
`optimisation
`techniques. They included. serial, catty-ripple. carry-skip.
`several
`one-level
`and
`two-levels
`carry-lookahead.
`conditional-sum. can'y~select. and pyramid adders.
`For
`integer multipliers. we only considered 2’s complement
`multipliers with l-bit Booth recoding. We also mapped 6
`
`210
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 5
`
`
`
`different multiplication circuit designs including serial.
`sequential. sequential with one-level and two-levels carry—
`save logic. parallel array. and bit
`systolic.
`For
`the
`sequential and array multipliers which require an adder.
`we tried all the integer adders above to determine the ones
`that produce the best
`'AT‘
`results. All
`integer circuits
`were examined for bit widths varying from 4-bits to 64-
`bits.
`
`For floating-point numbers we implemented a subset of
`the 32-bit and 64-bit LEEE specification standard and
`mapped both addition and multiplication circuits. Not all
`options
`referred to
`in
`the
`standard were
`included.
`Particularly. we assumed truncation is performed after
`addition
`or multiplication rather
`than
`rounding
`to
`eliminate one normalisation step. Ten mantissa adders
`and six mentissa multipliers were tested corresponding to
`the types of integer units above. The exponent adders used
`were chosen to minimise the ‘AT' value for that design.
`Finally. we also implemented a pipelined version of those
`integer and floating-point circuits above that are actually
`pipelineahle without incurring any increase in area cost.
`
`Limitations. We did not include the control-path in either
`the cost or performance calculations. Furthermore. we did
`not examine any diVision algorithms or combination
`functions such as multiply-accumulate. We also restricted
`the
`investigation
`by
`not
`considering
`table-lookup
`techniques. distributed arithmetic. or the potential of other
`number systems.
`
`4. Implementation Efficiency Gap
`In an earlier study [15]. we demonstrated that there is a
`large implementation efficiency gap between FPGA and
`VLSI platforms. We showed that a system implemented
`on FPGAs will require as much as 100 times more die area
`than a custom VLSI implementation and would be about
`10 times slower. The result demonstrates how much worse
`on FPGA implementation is
`compared to
`a VLSI
`implementation.
`In
`this
`section we
`estimate
`the
`implementation efficiency gap between an FPGA platform
`and a programmable ALU as defined in section 2. The
`ALU parameters we use are based on estimates for the area
`and delay of complementary logic CMOS circuits which
`represent upper bounds on VLSI implementation cost and
`performance. Therefore. .0 and A. defined as the ratios of
`the area and delay results respectively of an FPGA-based
`implementation to those of an ALU.
`represent
`lower
`bounds while actual area and delay overheads could
`cenainly be much higher. We will now compare the
`efficiency of pipelined. and non-pipelined. FPGA-based
`arithmetic circuits to that of ALUS.
`
`We select two pipelined integer ALUs. 32-bit and 64-
`bit. capable of addition. subtraction. multiplication. shift to
`the left, arithmetic shift to the right. and finally.
`logical
`AND. OR. and NOT operations. They use a fast array
`multiplier to achieve small multiplication delays at
`the
`expense of high die area. We also select two pipelined
`floating-palm ALUS. 32-bit and 64-bit, capable of addition.
`subtraction. and multiplication which also use an array
`multiplier as their mantissa multiplienladder. For integer
`ALUs. we assume that addition takes a single cycle while
`multiplication needs 2 cycles on a 32-bit ALU. and 3
`cycles on a 64-bit ALU. For the floating-point ALle. We
`assume 4 execute stages for a 32-bit FP ALU and 6 stages
`fora 64-bit ALU. The result of the comparison is given in
`Table I which gives the ‘best case'
`results for both
`pipelined and non-pipelined 'FPGA implementations. Note
`that multiplying .l'l and A in the table may not correspond
`to the {PA entry because all results have been rounded.
`By limiting plpelined integer circuits to only those
`pipelinable without additional area. multiplication is
`resu-icted to array multipliers which have high area costs.
`Similarly. floating-point multiplication is not able to make
`use of the compactness of sequential multipliers and incurs
`a large area cost as well. The effects of these restrictions
`can be observed in Table I. The first thing to notice is that
`pipelined integer addition implementation give reasonably
`efficient results even though the area overhead is still
`large. or course. this is an improvement over the nort-
`pipelined results. and is mostly due to the low cycle time of
`the FPGA implementations and. subsequently.
`the low
`delay overhead. This suggests that addition on F‘PGAs
`should be pipelined.
`that pipelined integer
`The second observation is
`multiplication efficiency is better than the nonwpiplined
`case but at the expense of very high area cost. up to 60
`times from a maximum of 4.1 times in the non-piplinecl
`case. Unlike addition. however. this result soggest that
`multiplication
`is
`not
`suitable
`at
`all
`for
`FPGA
`implementation.
`11' required. however. but with teal-estate
`at a premium. as in our case. multiplication should not he
`pipelined so as to save on area. Finally. the overhead for
`flaming-point
`is much larger
`than
`that of
`integer
`implementation. The large overhead magnitude raises
`doubt regarding the usefulness of floating-point operations
`on FPGAs under any circumstances.
`Serial Arithmetic. Serial arithmetic can deliver low area
`met-heads for low operand width values, However.
`for
`larger widths,
`ill-bits and 64-bits for example,
`the area
`overhead is still considerably high because the registers
`used in accumulating the result have to be included in the
`cost.
`it
`is also evident
`from our
`results that
`the low
`
`211
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 6
`
`
`
`pipelined
`HflA
`
`
`
`
`:0i_.u.
`
`
`
`hfihéfihsassesses
`
`
`
`
`pmdmd
`
`RCA
`damn
`
`or]
`line.
`antenatal-“transna-
`
`
`
`-§--I.OTE:HE! [1-408[IE
`
`
`“MEI
`
`“Inmnnnnonrsntm-Isn
`IEIIIEEEIIBJEEE
`
`
`performance of serial designs make their use on F'PGAs
`unacceptable for high throughput computations.
`
`the main
`Summary. The results given here highlight
`problem with FPGA~based arithmetic circuits.
`They
`cannot compete with other programmable devices such as
`ALUs. at least at the circuit lev'el. Pipelining can be used
`to bring the cycle time of an FPGA implementation close
`to the delay of an ALU to reduce the delay overhead. But.
`the intractable problem is the area overhead. Consider.
`from Table I.
`that m arithmetic nodes of type and data
`width 1' havc total die area equivalent
`to m‘fli ALUs.
`Serial
`and
`sequential
`arithmetic
`can
`reduce
`the
`implementation cost but at the expense of decreasing the
`performance and increasing the delay overhead. Since on-
`chip real-estate is valuable and the area cost of any' FPGA
`implementation is already high. FPGA design trade-offs
`should be optimised for area.
`Indeed, neither floating-
`poini arithmetic nor integer multiplication have shown any
`implementation attributes favourable to FPGA platforms.
`For
`the case of integer multiplication.
`this was also
`reported by a study on commercial FPGAS [I6].
`
`5. Viability of Adaptive Coprocessot‘s
`From the results of the previous section and the 6
`numbered equations in Section 2. we can evaluate the
`viability of the adaptive coprocessor proposall For clarity.
`we assume either integer or floating-point operations only.
`Table ll examines the critical iteration period bound (pan)
`and the minimum number of instructions in an unrolled
`iteration (q’m) as given by Eq(S) and Eq(6). These are the
`conditions necessary for a VLIW coprocessor machine to
`have better performance than an FPGA-based coprocessor.
`Le. speedup is larger than one, Values of both parameters
`
`212
`
`are determined for a total coprocessor die area equivalent
`to approximately 5M and 20M transistors and assuming a
`hypothetical algorithm requiring total node count Znitm 2
`ID with a reasonable value of loop uruolling lt..._=10.
`We eaten-tine four cases of the algorithm. All nodes are
`addition operations, all are multiplication operations, 9 are
`adds and one is a multiply. and finally. 5 are adds and 5
`are multiply,
`
`Total area vs. speedup. Although a die area equivalent to
`5M transistors may seem to be large for such a small
`number of function nodes. already some data-path widths.
`blanked out in Table I]. will either not fit onto an FPGA of
`this size or will not achieve better performance than the
`W model under any par. and tilts. conditions. Also.
`the smaller die area forces the use of the more compact. but
`slower, non pipelined circuits giving worse results than the
`larger die area. On the other hand. from Eqm and eq4).
`as the area available. or number of ALUs. increase.
`the
`iteration period {In} decrease reducing the potential
`speedup.
`
`Minimum instructions per iteration (film?- Considering
`the small die area first. rows I and 2 show that the number
`of compiled VLIW instructions per iteration. before loop
`unrolling.
`that would guarantee FPGA superiority is
`relatively small.
`It can be easily satisfied by many
`algorithms
`that only require
`a
`limited
`number of
`multipliers. Rows 3 and 4 show that as the number of
`multiply operations increase. the number of instructions
`per iteration has to be very high for an FPGA to beat a
`VLI'W engine particularly for data bit widths larger than
`lé—bi’ts. This suggests that highly concurrent algorithms of
`limited data-path widths. such as some image processing
`algorithms are suited for implementation on integrated
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 7
`
`
`
`léBLE ll
`Conditions determining the implementation viability of an FPGA-based Coptocessor
`32-bit ALI}
`64-bit ALU
`
`Ell-EH
`
`Eli]
`[Em
`BEE
`
`mere--
`—-“IEIIIEI nut-[lumen
`—I-IIEIEEEII Elnora-solarium
`mnnm [IF
`anon-
`—EIEIIIIIEIEIIEIH 2 n-nnm
`wunmmnunnnmmlm
`—nIEIl£I-Eflllll
`mmwmfifi
`“mammalian Emu-BEER
`—IIIIIIII nnnnnnnn
`mm!- nnnnnnnn
`—nnmm __ Ill-“WEEK!
`MEIR“!!! nmsmszotm-Itom
`_IlIl-IIIEIIIIIII--IIIIIIIIEHIEI
`—---IIIIIIEEIIIIIIIIIIIIIEIIIIEIEI
`mnnnnnnfimmmmm
`mnnnmmnunmmmmmm-m
`For any given algorithm. maximum speedup is achieved
`with lower die area sizes above an absolute minimum area.
`The F'PGA. however, would still not be able to effectively
`deal with
`algorithms
`dominated
`by
`floating-point
`operations.
`Finally. note that the parametric limits reported in this
`section regarding the suitability of one algorithm or
`architecture over another are indicative of general trends
`for a broad range of operand width values and arithmetic
`functions
`and should not be interpreted as precise
`statements of comparison for
`the specific models and
`circuits investigated.
`
`FPGA arrays. Rows 5 to 8 show that as the total die area
`increase. bigger and faSter multipliers can be used for even
`larger operation widths i lo to 32 in this case) reducing the
`number of instructions per cycle for those data-path widths
`and allowing more applications to satisfy the instmclion
`count condition.
`
`iteration
`The critical
`Iteration period hound (pm).
`bound is the minimum time to execute a loop and is shown
`in rows 9 to 12 of Table 11 for the small die area. Similar
`observations to those of the q‘t parameter can be made.
`Again. many algorithms can easily satisfy this condition
`and will
`therefore perform better on an FPGAvhased
`coprocessor. Restrictions still have to be made to the
`number of multipliers used.
`If the number of multipliers
`required is
`large however,
`the condition can still be
`satisfied for
`small
`data-path widths.
`Furthermore.
`increasing the total area available for the coprocessor. rows
`13
`to
`16.
`means
`that more multipliers
`can be
`accommodated and larger data-path widths can be u