throbber
JAL‘cym
`
`On the Viability of FPGA-Based Integrated Coprocessors
`Osoma T. Alhoharno’ , Peter Y. K. Cheang, and Thomas J. Clarke
`Information Engineering Section
`Department of Electrical and Electronic Engineering
`imperial College of Science. Technology and Medicine
`Exhibition Road. London. SWT-ZBT. UK
`
`Abstract
`
`This paper examines the viability of using integrated
`programmable logic as a coprocessor to support a host
`CPU core. This adaptive coprocessor is compared to a
`VUW machine in term of both die area occupied and
`performance. The parametric bounds necessary to jam)?
`the
`adoption of art FFGA-based coprocessor
`are
`established. An abstract Field Programmable Gate Array
`model
`is used to
`investigate
`the area and delay
`characteristics of arithmetic circuits implemented on
`FPGA architecture: to determine the potential speedup of
`FPGA-bored coprocessors.
`
`Our analysis shows that integrated FPGA arrays are
`suitable as coprocessor plog'omts for realising algorithm
`that
`require only limited numbers of multiplication
`instructions.
`inherent FPGA characteristics limit
`the
`data-path widths that can be supported efi‘icientlyfor these
`applications.
`' Art FPGA~bosed adaptive coprocessor
`requires a large minimum die area before any advantage
`over a VLlW machine of a comparable size can be
`realised.
`
`The adaptive coprocessor model challenges the more
`established general purpose techniques that exploit fine
`grain instruction level concurrency. We ask, Under what
`architectural conditions can the integration of a core
`CPU and art FPGA-oased coprocessor art a single die
`outperfonn the possible alternative of using a Very Long
`instruction Word engine rVLlW) on that some die area?
`This paper addresses this question through four stages.
`First,
`in Section 2.
`the cost and performance bounds of
`both computational models.
`the VLIW and the FPGA
`coprocessing. are examined and a set of critical parameters
`is determined.
`Section 3 describes
`the experimental
`methodology used to establish the
`characteristics of
`arithmetic
`computation
`on H’GAs
`and Section
`4
`summarises the results of this investigation.
`In Section 5.
`we explore the implications of these results on the
`achievable cost and performance limits of FPGA-based
`copmsors. Finally. in Section 6 we apply the ideas and
`conclusions presented in earlier
`section to a
`typical
`computational example to determine its suitability for
`FPGA-based adaptive coprocessor implementation.
`
`I. Introduction
`
`The ever increasing Spare transistor capacity ha; only
`been absorbed so far into a limited number of architectural
`features.
`integrated programmable logic has emerged as
`one of the very few novel architectural
`ideas with the
`potential
`to exploit
`this abundant resource. A custom
`coprocessor can directly exploit the concurrency available
`in applications. algorithms. and code segments.
`An
`FPGArbased coprocessor can funher adapt to any demands
`for special-purpose hardware by mapping an algorithm
`onto
`run-time
`configurable
`logic,
`These
`versatile
`"adaptive" coprocessors can be used to augment
`the
`instruction set of a core CPU or as special purpose custom
`computing engines. Real-time applications can also mop
`multiple
`functions and subroutines directly onto the
`reconfigurable hardware during execution IIHS].
`———_..______
`t
`c I'llllil. :t.osamtt@tc.oc ulr
`Nmo‘twww ea in a: uler-searchfinfonnattonlwwwlnnsamaluosamnhw
`
`2. Computational Models
`An adaptive coprocessor uses silicon real-estate to
`integrate more programmable logic. This can then be used
`to implement
`larger custom circuits or exploit more
`concurrency. On the other hand, a VLIW machine will use
`this same die area to increase the number of ALUs and
`execute more instructions per cycle.
`In this section. we
`examine the cost and performance of implementing an
`algorithm on both computational models organised as in
`Figure t.
`
`
`
`Figure 1. Target coprocessor system organisation.
`
`0813645487986 SUE {JD 0 1995 IEEE
`
`206
`
`|PR2018-01594
`
`EXHIBIT
`
`2055
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 1
`
`IPR2018-01594
`
`EXHIBIT
`2055
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 1
`
`

`

`2.1 FPGA-based coprocessor organisation
`To achieve a high coprocessor throughput. we assume a
`pipelined implementation of all algorithms. This means
`that
`the performance of
`the FPGA-based coprocessor
`depend on the cycle time of the pipeline (they). the number
`of iterations the circuit
`is used mm“).
`the number of
`concurrent copies of the circuit mapped onto the FPGA
`(it
`a). and the number of cycles needed to fill the pipeline
`(
`' ml). The total number of cycles is then
`N
`a... =[k—N+rt]x:5...
`The area cost is the sum of the areas for all arithmetic
`nodes in a design. We assume integer nodes and floating-
`point nodes are used.
`All other operator nodes are
`expressed as a percentage tom.) of the area,
`If “hm is the
`area of node typet' and NZ...“ the number of nodes of typet'
`used in the circuit. the cost of a circuit implemented on an
`adaptive coprocessor can he expressed as.
`
`Arifnll'l = (I + Cm...) 3" kg.”
`
`xi..§.(aih‘><ns;)+ Z (airman
`fp-I
`
`2.2 VLIW machine organisation
`We assume the VLIW utilises integer and floating-point
`ALU units rather than single operation functional modules
`and that they constitute most of its area. All other area is
`expressed as a percentage (rm...) of the total area.
`If aim... is
`the area of a function node of type t' and n'm, the number
`of nodes of type i‘ used. the cost of a VLIW machine can be
`expressed as.
`
`[Low = U +Cuilw)
`- l
`—
`- l
`-
`t
`x[(allle""Kellie“”)+(a{lf..“"‘xn.fll.“l]
`
`In addition to available resources, the performance of :t
`VLIW machine is limited by two types of dependencies [6].
`The data dependencies within an iteration and the ones
`between iterations. A VLF-V pragram can be viewed as :1
`dependence graph. as in Figure 4, which must be repealed
`M“...
`times.
`Data
`and
`iteration
`dependencies
`are
`represented by the bold and dashed edges respectively.
`Since our VLfW processor model uses piplined ALUE.
`each node. or operation.
`in the graph takes a single time
`unit to execute (Prim). The iteration distance. attached to
`dashed edges.
`is
`the number of
`loop iterations after
`
`issuance of Si that 5i can begin execution. The number of
`time units it takes to execute a cycle within a dependency
`graph (5,}, given maximum resources.
`is the sum of all
`nodes along this cycle path. The number of iterations {3...}
`it takes the pattern in a cycle to repeat execution is the sum
`of all
`iteration distances along this cycle's dependency
`path. Therefore Nap/1L, repetitions ofa given cycle will be
`executed requiring 5, x (Nun/Ll cycles.
`The minimum time to execute the whole loop is
`maxiSJNvtiijlt—lfvnul = pKIIPNMfJ-‘fvflw Wilt“ Pun is
`the
`critical iteration period bound. Using software pipelining
`['l] and other advanced compiler transformations.
`it
`is
`possible to overlap the execution of several different
`iterations of the loop.
`If it”... iteration can be unrolled and
`then scheduled, the iteration interval t,“ is the time units
`needed to execute an entire iteration of k unrolled loops.
`It
`must satisfy both types of data dependencies as well as
`resources dependencies.
`If q‘i is the number of operations
`a resource of type i must be used in in“... iterations we can
`estimate lower bounds on the iteration interval and the
`maximum VLIW performance as follows:
`
`Nat...
`kvfiw
`
`3‘ fut +Cfiih] K lint.
`beiw =[
`Int 3 maxinntrmmesl-Im [hemmed]
`
`list (“Jam") 3 m3"[ [hi/Rim] ]
`
`:th (dependence) E l'l'lflX[ (a r/lr] ]
`
`Although the problem of finding an optimal schedule
`using software pipelining is NP—complete.
`it has been
`shown that near optimal results can often be obtained for
`loops with
`both
`intra-
`and
`inteteiteration
`data
`dependencies.
`it has also been shown that hierarchical
`reduction allows software pipelining to be applied to
`complex loops containing conditional statements. Program
`restructuring using loop transformations and optimising
`data locality using space tiling techniques can also be
`applied to increase both the fine-grain and coarse-grain
`parallelism available in nested loops.
`
`2.3 Effects of memory access
`The power of a custom circuit often lies in its use of a
`custom address generator
`to reduce explicit address
`calculation instructions. This option can be successfully
`exploited by both coprocessor models. the FPGA-based and
`the VLIW.
`Furthermore. both models can gain from
`customising memory access to fit the data bit width. We
`
`207
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 2
`
`

`

`these and other memory access optimisation
`expect
`techniques to produce equivalent benefits for both models
`and are therefore not directly included in the analysis.
`
`2.4 Best-case comparative analysis
`We can now compare the performance of both models.
`neglecting the pipeline fill cycles. by determining the
`speedup:
` E
`SUM =—c~r""“ = km" mix—“If“ = —k””" X“—"* eqt)
`T155“
`kriltw
`1'qu
`kvliw
`A
`
`The speedup is effected by the number of concurrent
`copies of the circuit arm.) mapped onto the FPGA. Since
`the areas of both models have to be the same. we can
`determine km, in term of the quivalent number of integer
`ALUs as follows:
`. _£
`_
`n5i~EI+a x'fli'fium'
`knm = 2(Qi""i><rtiiu‘tj)+ Z (a foP-ixnfig
`inl-i
`#4
`
`”"9 Q
`
`int-i _
`'
`
`Iii-I _
`Fl?“
`Girl—l
`inI-ulu‘ Q "
`audiot-
`
`{ii
`a vi
`fp-ni’u‘ ”do“
`altiilt‘
`
`=
`
`HEW
`app-l
`inlv-ni'u
`WEN
`
`Using tMresources). tgdependence}. and the speedup
`equation we can determine the conditions for which an
`FPGA—based caprocessor is virtually guaranteed to have
`better performance than a VLIW engine:
`
`pm a M x A
`km,
`
`[1-
`km...
`9.1-...
`. ——'— 2 — c
`
`int
`
`qt:
`.
`
`max[ Limit} Lanai]
`
`krm x
`
`eq2}
`
`3
`
`Eat D
`
`We can further simplify eql], eqZ). and Eqfl) by
`considering integer arithmetic only and substituting km. to
`gel:
`
` l‘lvi'iwlnl-Itilt } t 't
`
`50' = A x ___ x
`ti...
`2.- njlh.‘
`fill-L >< £-
`int—i
`Fl
`9.... 2 0:1: ><AXE+._’.‘?:‘.—"at...w
`Mn.»
`
`Eqm
`
`Eq(4i
`
`qu....2
`
`a who;
`
`Eats)
`
`We refer to Q and A as the area and delay overheads.
`respectively. of
`a
`particular
`circuit
`implementation
`compared to on ALU's area and delay. They are inherent
`
`208
`
`characteristics of the implementation platform (FPGA in
`this case] and limit its maximum achievable speedup. To
`sense how much speedup an adaptive coprocessor can
`deliver for a given fixed area and whether an algorithm has
`the necessary criterion that would make it suitable for
`adaptive coprocessor implementation. we need to estimate
`the minimum value of [Q'd]
`for arithmetic circuits
`implemented on FPGA platforms.
`
`3. Experimental Methodology
`implement
`To
`examine
`how efficiently FPGAa
`arithmetic circuits we need to eliminate technology and
`design
`variations
`and
`create
`an
`“even
`level"
`for
`comparison. This section describes how this even playing
`field is established. We first describe the cell architecture
`that is used throughout this paper and detail our models for
`estimating the area and delay of any FPGA cell. Then. our
`choices for arithmetic test circuits and implementation
`procedure are explained.
`In all discussion to follow. we
`consider only SRAM programmable FPGAs since only
`they provide the flexible platform necessary for field te-
`programmabiiity.
`
`3.1 FPGA cell architecture
`We examined 15 different FPGA cell architectures that
`span the range of current research and commercial arrays.
`The function generators included a 2-inptrt NAND gate. a
`2-input Universal Logic Module (ULMJ) capable of
`implementing any of lo 2-input Boolean logic functions
`[31. look-up tables ['LUT) of input sizes 3. 4. s, and 6. and
`finally. the cell architectures of both the Alters FLEX-8000
`[9] and the Xilinx XCSODO [ill] which include specialised
`hardware to speedup carry propagation and wide gate
`implementation. All cells also incorporated D-type flip«
`flops (FF). The cells interconnection capabilities examined
`included extensive neighbour connections with 8.
`[2. and
`[6 possible neighbours. channelled 2i) arrays with 4-
`neighbour connections. or channelled arrays with fully. or
`partially. connected clusters of cells similar to the Allera
`FLEX-3000 and the Xilimt XCSOOO array architectures.
`Of all these cell types. the 3-input LUT cell proved the
`best overall tor arithmetic circuit
`implementations. We
`elect to use it and a In channelled array architecture for
`communication as the example cell throughout this paper.
`A neighbour interconnection only array may also be used
`and will give similar numerical results. The chosen cell is
`based on a look-up table design similar in functionality Io
`other look-up table model proposals [l i].
`It incorporates
`4-nearest neighbour connections as a vital way to reduce
`delay and improve mutability. Figure 2 gives a conceptual
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 3
`
`

`

`is
`diagram of this cell. The routing channel width. W.
`assumed to be. the same for both the vertical and the
`horizontal channels. For LUTs with 3. 4. S. and 6 inputs.
`the average minimum channel widths necessary for routing
`has been observed to be9. 11.11, and [2 respectively [II].
`We therefore adopt a channel width of 9 for this model cell
`although the actual channel width should probably be
`slightly higher.
`
`
`
`Figure 2. FPGA cell model With a 3-mpuut look-up table as a fitnetion
`generator. duet: north. south. east. and west neighbour connections,
`and global honeontal. and vertical charmel routing.
`—_.—_—n————
`
`the factors
`for all
`Limitations. We do not account
`performance.
`and
`effecting
`the
`implementation
`Specifically. we leave issues such as external access.
`programming.
`testability.
`clock
`and
`control
`signals
`distribution. clock skew. and power consumption for future
`work. Of the global programming logic and network we
`only include the cost of
`the communication channel
`network and the number of SRAM configuration bits
`within a cell as part of the cost of
`the cell. These
`limitations
`bias
`9.
`in
`favour
`of
`the FPGA-hased
`coprocessor model.
`
`3.2 Area measurement
`
`is approximated using a
`The area of an FPGA cell
`transistor density coefficient metric (ct) in umlltransismr.
`This density coefficient
`is dependent on the fabrication
`process technology.
`layout methodology. and the circuit
`logic Structure used. [I is obtained by averaging layout area
`per transistor over all cells available in a library or over
`samples or real designs. We assume a normalised function
`generator logic density coefficient of (1;. a configuration
`memory normalised density coefficient of or... and a
`routing pitch normalised density coefficient of (1..
`
`Figure 3a is a representative model of the total cell area
`showing also the routing pitch between the physical
`channel tracks. We assume that the Routing Configuration
`Memory (REM) bits used for the channels' switch and
`connection boxes are distributed between the channel
`tracks as shown in Figure 2b.
`It is therefore reasonable to
`assume that
`tr... equals ac. Other similar model: also
`assume a distributed RCM [11]. The number of memory
`bits distributed within the channels (IVs-witch) depend on the
`connection and switch
`boxes. The
`connection
`has
`flexibility H: is defined as the number of channel tracks
`each input and output can be connected to. The switch box
`flexibility F5 is defined as the number of possible tracks
`each incoming track can be connected to.
`figure 3. A representation oi the toss: area plan
`I-‘PGA. (I) Am model showing the vents]
`and nonmetal
`tracks.
`t'b)
`The Routing
`Configuration Memory
`bits
`(RCM)
`are
`distributed between the channel tracks.
`
`Logic and ii:
`
`Configuration (a)
`
`”'0 Connection BOX
`
`
`It has been show [12] that Fc has to be greater than half
`the number of tracks for 100% routing completion to be
`possible. Additionally. only a Small F5 value is needed to
`achieve a 100% routing completion.
`In our model. we
`choose Fc = 0.?SW and F; = 3. The routing pitch ts
`determined by a five-transistor SRAM bit (a...) and a single
`pass-transistor PI? (0,) and is defined as
`
`mm = D‘.’ ‘ (fln+ as) = Jd‘u.
`The FPGA cell is modelled as a square die area having
`the following characteristics:
`Amt = My»: ‘l' Am... + Armrlr
`Arm = W X [”13“ Mt.)
`A"... = (n..-a..]x (N ,(m-v- NH...)
`Am...” = [(rinm' WA‘ W'” + [rpmn ' (X 'Wn + l"- WtJI
`when x r = A)... - 4....
`and
`x +(r.....-wt}= r . (amine)
`An"... = (Ctr 'Nrth‘l' (um'am'NlCHl) + Alum
`
`209
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 4
`
`

`

`where.
`= the area of an FPGA cell
`Amt
`AM, = logic area used for function generation
`A...“ = memory area used for configuration
`A..." = the area of the routing channels within a cell
`Am... = the area of the cell used for conununication
`NJ“...
`= it of transistors used for function generation
`M...
`= # of transistors used for routing logic d: muss
`NR".
`= it of memory bits for LUTs and control
`NM...
`: ll of mem. his used for routing configuration
`a...
`= if of transistors in a memory hit = 5
`W}.
`= ll of routing tracks in each horizontal channel
`W,
`= it of routing tracks in each vertical channel
`
`The total area of a circuit implementation depends on
`how the mapping from logic equations to FPGA cell
`functions is performed and how they are placed onto the
`cell array.
`If N..." is the number of FPGA cells used to
`implement the circuits. the total circuit area is
`
`Arum: = Na" X Acct: .
`
`3.3 Delay measurement
`
`The delay of an FPGA cell is approximated using the
`method of "logical effort" proposed by Sutherland and
`Sproull [l3] [14]. The method is based on a simple RC
`model
`for
`transistors
`and
`provide
`a
`first
`order
`approximation of a circuit delay.
`It defines it as the actual
`time. for a fabrication process, that cot-reSponds to a delay
`unit. The value of t can be measured from the Frequency
`of oscillation of a ring oscillator. For each type of logic
`gate. the method assigns delay unit values based on the
`topology of the circuit element,
`the difficulty that an
`element has in driving capacitive loads. and the parasitic
`capacitance exhibited by the gate. The delay of an ideal
`inverter that drives another identical inverter is the sum of
`a single unit delay (1') and the parasitic delay value P"...
`Typically. for 3H CMOS process. t = 0.5ns and P... = 0.61:.
`while for 0.5u CMOS process. 1: = Illns and P... = 0.51:.
`All other gate delays are measured relative to that of an
`ideal inverter.
`
`Logical effort is used to arrive at delay value for each
`type of FPGA cell. Separate values are deten'nined for
`each cell
`input
`to output.
`the set-up time. and the
`synchronous clock to output delay for each cell type. These
`delays also include the effects of internal fan-outs.
`
`its delay
`is mapped onto an array.
`After a circuit
`depends on the number of cells (NM...) along the longest
`path from an input to an output as well as the routing delay
`between these cells. The routing delay between neighbour
`
`cells is accounted for by the explicit Inading on that cell’s
`output. The routing delay between non-neighbouring cells
`in a channelled array is more difficult to estimate specially
`without knowledge of the exact placement and routing
`information and the capacitive loading on each level due to
`the progranu'tmble routing switches along the path. The
`total execution time of a circuit.
`in t units. can be
`determined as the sum of all delays along the longest path
`as follows:
`
`.
`New»
`..
`Trims“ = E: [diss'i'd'mmi
`where c‘..s is the delay. in 1: units, between input node a
`and output node b of a cell at circuit depth level
`t'. and
`c‘._,.,... is the routing delay. in 1: units as well. between the
`output of the cell at level i and the input of another cell at
`level f+.l.
`[n this investigation. we will assume o'mm to be
`zero. This assumption will bias A in favor of the FPGA-
`hased coprocessor model.
`
`3.4 Implementation Prooedure
`
`We determine the number of cells needed to implement
`a circuit
`(Main) and the depth of the implementation
`(NM...) by a structure preserving direct hand mapping from
`the original circuit designs. Automated mapping and
`routing results vary significantly with different tools and
`for different optimisation criterion. They also significantly
`alter the overall high level organisation resulting in low
`area utilisation even for regular circuits suuctures. We can
`also assume that with improved FPGA cell architectures.
`mapping. placement. and routing technologies, the routing
`structure is sufficient
`to complete the mapped network
`interconnections and give a very high array utilisation.
`The results will therefore provide a lower bound on the
`cost and performance of different implementations which
`is exactly what We are looking for. Different designs are
`compared based on their implementation efficiency defined
`as the area times delay product ‘AT'. or cost'performance.
`for that circuit. The less 'AT'
`is. the more efi'icient is the
`implementation.
`
`3.5 Choice of arithmetic circuits
`
`We mapped 10 different integer addition circuit designs
`representing
`several
`delay
`and
`area
`optimisation
`techniques. They included. serial, catty-ripple. carry-skip.
`several
`one-level
`and
`two-levels
`carry-lookahead.
`conditional-sum. can'y~select. and pyramid adders.
`For
`integer multipliers. we only considered 2’s complement
`multipliers with l-bit Booth recoding. We also mapped 6
`
`210
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 5
`
`

`

`different multiplication circuit designs including serial.
`sequential. sequential with one-level and two-levels carry—
`save logic. parallel array. and bit
`systolic.
`For
`the
`sequential and array multipliers which require an adder.
`we tried all the integer adders above to determine the ones
`that produce the best
`'AT‘
`results. All
`integer circuits
`were examined for bit widths varying from 4-bits to 64-
`bits.
`
`For floating-point numbers we implemented a subset of
`the 32-bit and 64-bit LEEE specification standard and
`mapped both addition and multiplication circuits. Not all
`options
`referred to
`in
`the
`standard were
`included.
`Particularly. we assumed truncation is performed after
`addition
`or multiplication rather
`than
`rounding
`to
`eliminate one normalisation step. Ten mantissa adders
`and six mentissa multipliers were tested corresponding to
`the types of integer units above. The exponent adders used
`were chosen to minimise the ‘AT' value for that design.
`Finally. we also implemented a pipelined version of those
`integer and floating-point circuits above that are actually
`pipelineahle without incurring any increase in area cost.
`
`Limitations. We did not include the control-path in either
`the cost or performance calculations. Furthermore. we did
`not examine any diVision algorithms or combination
`functions such as multiply-accumulate. We also restricted
`the
`investigation
`by
`not
`considering
`table-lookup
`techniques. distributed arithmetic. or the potential of other
`number systems.
`
`4. Implementation Efficiency Gap
`In an earlier study [15]. we demonstrated that there is a
`large implementation efficiency gap between FPGA and
`VLSI platforms. We showed that a system implemented
`on FPGAs will require as much as 100 times more die area
`than a custom VLSI implementation and would be about
`10 times slower. The result demonstrates how much worse
`on FPGA implementation is
`compared to
`a VLSI
`implementation.
`In
`this
`section we
`estimate
`the
`implementation efficiency gap between an FPGA platform
`and a programmable ALU as defined in section 2. The
`ALU parameters we use are based on estimates for the area
`and delay of complementary logic CMOS circuits which
`represent upper bounds on VLSI implementation cost and
`performance. Therefore. .0 and A. defined as the ratios of
`the area and delay results respectively of an FPGA-based
`implementation to those of an ALU.
`represent
`lower
`bounds while actual area and delay overheads could
`cenainly be much higher. We will now compare the
`efficiency of pipelined. and non-pipelined. FPGA-based
`arithmetic circuits to that of ALUS.
`
`We select two pipelined integer ALUs. 32-bit and 64-
`bit. capable of addition. subtraction. multiplication. shift to
`the left, arithmetic shift to the right. and finally.
`logical
`AND. OR. and NOT operations. They use a fast array
`multiplier to achieve small multiplication delays at
`the
`expense of high die area. We also select two pipelined
`floating-palm ALUS. 32-bit and 64-bit, capable of addition.
`subtraction. and multiplication which also use an array
`multiplier as their mantissa multiplienladder. For integer
`ALUs. we assume that addition takes a single cycle while
`multiplication needs 2 cycles on a 32-bit ALU. and 3
`cycles on a 64-bit ALU. For the floating-point ALle. We
`assume 4 execute stages for a 32-bit FP ALU and 6 stages
`fora 64-bit ALU. The result of the comparison is given in
`Table I which gives the ‘best case'
`results for both
`pipelined and non-pipelined 'FPGA implementations. Note
`that multiplying .l'l and A in the table may not correspond
`to the {PA entry because all results have been rounded.
`By limiting plpelined integer circuits to only those
`pipelinable without additional area. multiplication is
`resu-icted to array multipliers which have high area costs.
`Similarly. floating-point multiplication is not able to make
`use of the compactness of sequential multipliers and incurs
`a large area cost as well. The effects of these restrictions
`can be observed in Table I. The first thing to notice is that
`pipelined integer addition implementation give reasonably
`efficient results even though the area overhead is still
`large. or course. this is an improvement over the nort-
`pipelined results. and is mostly due to the low cycle time of
`the FPGA implementations and. subsequently.
`the low
`delay overhead. This suggests that addition on F‘PGAs
`should be pipelined.
`that pipelined integer
`The second observation is
`multiplication efficiency is better than the nonwpiplined
`case but at the expense of very high area cost. up to 60
`times from a maximum of 4.1 times in the non-piplinecl
`case. Unlike addition. however. this result soggest that
`multiplication
`is
`not
`suitable
`at
`all
`for
`FPGA
`implementation.
`11' required. however. but with teal-estate
`at a premium. as in our case. multiplication should not he
`pipelined so as to save on area. Finally. the overhead for
`flaming-point
`is much larger
`than
`that of
`integer
`implementation. The large overhead magnitude raises
`doubt regarding the usefulness of floating-point operations
`on FPGAs under any circumstances.
`Serial Arithmetic. Serial arithmetic can deliver low area
`met-heads for low operand width values, However.
`for
`larger widths,
`ill-bits and 64-bits for example,
`the area
`overhead is still considerably high because the registers
`used in accumulating the result have to be included in the
`cost.
`it
`is also evident
`from our
`results that
`the low
`
`211
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 6
`
`

`

`pipelined
`HflA
`
`
`
`
`:0i_.u.
`
`
`
`hfihéfihsassesses
`
`
`
`
`pmdmd
`
`RCA
`damn
`
`or]
`line.
`antenatal-“transna-
`
`
`
`-§--I.OTE:HE! [1-408[IE
`
`
`“MEI
`
`“Inmnnnnonrsntm-Isn
`IEIIIEEEIIBJEEE
`
`
`performance of serial designs make their use on F'PGAs
`unacceptable for high throughput computations.
`
`the main
`Summary. The results given here highlight
`problem with FPGA~based arithmetic circuits.
`They
`cannot compete with other programmable devices such as
`ALUs. at least at the circuit lev'el. Pipelining can be used
`to bring the cycle time of an FPGA implementation close
`to the delay of an ALU to reduce the delay overhead. But.
`the intractable problem is the area overhead. Consider.
`from Table I.
`that m arithmetic nodes of type and data
`width 1' havc total die area equivalent
`to m‘fli ALUs.
`Serial
`and
`sequential
`arithmetic
`can
`reduce
`the
`implementation cost but at the expense of decreasing the
`performance and increasing the delay overhead. Since on-
`chip real-estate is valuable and the area cost of any' FPGA
`implementation is already high. FPGA design trade-offs
`should be optimised for area.
`Indeed, neither floating-
`poini arithmetic nor integer multiplication have shown any
`implementation attributes favourable to FPGA platforms.
`For
`the case of integer multiplication.
`this was also
`reported by a study on commercial FPGAS [I6].
`
`5. Viability of Adaptive Coprocessot‘s
`From the results of the previous section and the 6
`numbered equations in Section 2. we can evaluate the
`viability of the adaptive coprocessor proposall For clarity.
`we assume either integer or floating-point operations only.
`Table ll examines the critical iteration period bound (pan)
`and the minimum number of instructions in an unrolled
`iteration (q’m) as given by Eq(S) and Eq(6). These are the
`conditions necessary for a VLIW coprocessor machine to
`have better performance than an FPGA-based coprocessor.
`Le. speedup is larger than one, Values of both parameters
`
`212
`
`are determined for a total coprocessor die area equivalent
`to approximately 5M and 20M transistors and assuming a
`hypothetical algorithm requiring total node count Znitm 2
`ID with a reasonable value of loop uruolling lt..._=10.
`We eaten-tine four cases of the algorithm. All nodes are
`addition operations, all are multiplication operations, 9 are
`adds and one is a multiply. and finally. 5 are adds and 5
`are multiply,
`
`Total area vs. speedup. Although a die area equivalent to
`5M transistors may seem to be large for such a small
`number of function nodes. already some data-path widths.
`blanked out in Table I]. will either not fit onto an FPGA of
`this size or will not achieve better performance than the
`W model under any par. and tilts. conditions. Also.
`the smaller die area forces the use of the more compact. but
`slower, non pipelined circuits giving worse results than the
`larger die area. On the other hand. from Eqm and eq4).
`as the area available. or number of ALUs. increase.
`the
`iteration period {In} decrease reducing the potential
`speedup.
`
`Minimum instructions per iteration (film?- Considering
`the small die area first. rows I and 2 show that the number
`of compiled VLIW instructions per iteration. before loop
`unrolling.
`that would guarantee FPGA superiority is
`relatively small.
`It can be easily satisfied by many
`algorithms
`that only require
`a
`limited
`number of
`multipliers. Rows 3 and 4 show that as the number of
`multiply operations increase. the number of instructions
`per iteration has to be very high for an FPGA to beat a
`VLI'W engine particularly for data bit widths larger than
`lé—bi’ts. This suggests that highly concurrent algorithms of
`limited data-path widths. such as some image processing
`algorithms are suited for implementation on integrated
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 7
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2072, p. 7
`
`

`

`léBLE ll
`Conditions determining the implementation viability of an FPGA-based Coptocessor
`32-bit ALI}
`64-bit ALU
`
`Ell-EH
`
`Eli]
`[Em
`BEE
`
`mere--
`—-“IEIIIEI nut-[lumen
`—I-IIEIEEEII Elnora-solarium
`mnnm [IF
`anon-
`—EIEIIIIIEIEIIEIH 2 n-nnm
`wunmmnunnnmmlm
`—nIEIl£I-Eflllll
`mmwmfifi
`“mammalian Emu-BEER
`—IIIIIIII nnnnnnnn
`mm!- nnnnnnnn
`—nnmm __ Ill-“WEEK!
`MEIR“!!! nmsmszotm-Itom
`_IlIl-IIIEIIIIIII--IIIIIIIIEHIEI
`—---IIIIIIEEIIIIIIIIIIIIIEIIIIEIEI
`mnnnnnnfimmmmm
`mnnnmmnunmmmmmm-m
`For any given algorithm. maximum speedup is achieved
`with lower die area sizes above an absolute minimum area.
`The F'PGA. however, would still not be able to effectively
`deal with
`algorithms
`dominated
`by
`floating-point
`operations.
`Finally. note that the parametric limits reported in this
`section regarding the suitability of one algorithm or
`architecture over another are indicative of general trends
`for a broad range of operand width values and arithmetic
`functions
`and should not be interpreted as precise
`statements of comparison for
`the specific models and
`circuits investigated.
`
`FPGA arrays. Rows 5 to 8 show that as the total die area
`increase. bigger and faSter multipliers can be used for even
`larger operation widths i lo to 32 in this case) reducing the
`number of instructions per cycle for those data-path widths
`and allowing more applications to satisfy the instmclion
`count condition.
`
`iteration
`The critical
`Iteration period hound (pm).
`bound is the minimum time to execute a loop and is shown
`in rows 9 to 12 of Table 11 for the small die area. Similar
`observations to those of the q‘t parameter can be made.
`Again. many algorithms can easily satisfy this condition
`and will
`therefore perform better on an FPGAvhased
`coprocessor. Restrictions still have to be made to the
`number of multipliers used.
`If the number of multipliers
`required is
`large however,
`the condition can still be
`satisfied for
`small
`data-path widths.
`Furthermore.
`increasing the total area available for the coprocessor. rows
`13
`to
`16.
`means
`that more multipliers
`can be
`accommodated and larger data-path widths can be u

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket