throbber
CHAPTER' 6
`
`Cluster-Based Logic
`Blocks
`
`In this chapter we investigate the speed and area-efficiency of FPGAs which use logic
`clusters as their logic block. A logic cluster is composed of several look-up tables
`and registers interconnected by local routing, as described in Section 3.1.1. In the
`next section we motivate our research by describing some of the advantages of clus(cid:173)
`ter-based logic blocks, and by showing that these logic blocks are commercially rele(cid:173)
`vant. Section 6.2 describes the experimental flow we use to evaluate different logic
`clusters. Sections 6.3 through 6.6 then explore several key architectural questions
`concerning these logic blocks: how many inputs(/) should the FPGA routing provide
`to each logic cluster; how should the logic block to general routing inteiface change
`as a function of logic cluster size (N); and how are circuit speed, FPGA area-effi(cid:173)
`ciency, and design compile time affected by the size of the logic cluster used?
`
`6.1 Motivation
`
`As described in Section 2.1.2, most SRAM-based FPGAs use logic blocks based on
`look-up tables (LUTs). A look-up table with more inputs can implement more logic,
`and hence one needs fewer logic blocks to implement a circuit. This saves routing
`area, as there are fewer connections to route between logic blocks. However, look-up
`table complexity grows exponentially with the number of inputs, so it is impractical to
`use a LUT with a large number of inputs as a logic block.
`
`Instead of creating a larger logic block by increasing the number of inputs to a LUT,
`we can simply group several LUTs together and provide local routing to interconnect
`
`Intel Exhibit 1006 Part 2
`Intel v. Iida
`
`

`

`128 CHAPTER 6 Cluster-Based Logic Blocks
`
`Local routing
`
`•. ~,
`
`!t-\i;')- , ;
`
`LH
`I -
`
`-
`
`\
`
`i--
`
`....
`/
`JI
`
`4-input
`look-up
`tables
`
`..---
`/
`
`.. ~
`Logic
`block
`(unused)
`
`D
`D
`
`~ Main
`(inter-block)
`routing
`
`-
`
`I - l::l ~
`
`}
`
`'( -~
`I -:;__/
`
`/
`Local
`routing
`
`FIGURE 6.1 Implementation of a circuit in an FPGA with a cluster size of 2.
`
`them. We call the resulting logic block a logic cluster [8, 9, 10]; the exact structure of
`a logic cluster was described in Section 3.1.1. Figure 6. 1 shows a circuit imple(cid:173)
`mented in an FPGA in which each logic cluster contains two four-input look-up
`tables. Notice that many connections are made via the local interconnect within a
`cluster.
`
`One major advantage of an FPGA employing a logic block that contains several LUTs
`is that fewer logic blocks will be needed to implement a circuit than would be needed
`if each logic block was a single LUT. This reduces the size of the placement and rout(cid:173)
`ing problem considerably. Since placement and routing is usually the most time-con(cid:173)
`suming step in mapping a design to an FPGA, cluster-based logic blocks can
`significantly reduce design compile time. As FPGAs grow larger, it is important to
`keep this compile time from growing too large or one of the key advantages of
`FPGAs, rapid prototyping and design spins, will be lost [127].
`
`

`

`6.1 Motivation 129
`
`Cluster-based logic blocks also have the potential to significantly improve FPGA
`speed. In FPGAs composed of logic clusters, many connections will be made via the
`local routing within a cluster. Since this local routing can be made faster than the
`general-purpose routing between logic blocks, cluster-based logic blocks can improve
`FPGA speed. It is not obvious which logic cluster size leads to the highest speed
`FPGA, however. On the one hand, larger logic clusters lead to a higher fraction of
`connections being made via local interconnect. On the other hand, as the size of a
`logic cluster increases, the local cluster interconnect becomes slower, potentially
`resulting in a speed decrease even though more connections are captured within the
`logic clusters.
`
`The area impact of grouping multiple LUTs into a logic cluster is also complex.
`Grouping related LUTs together into a single logic block reduces the number of con(cid:173)
`nections to be routed between logic blocks, which saves routing area. Since the gen(cid:173)
`eral-purpose interconnect consumes most of the die area in SRAM-based FPGAs, this
`is a significant area savings. On the other hand, in the logic clusters we study the area
`required by the local routing within a cluster grows quadratically with cluster size.
`For sufficiently large clusters, then, the area used by this local interconnect will
`exceed the area saved in the general interconnect.
`
`We explore four questions concerning the design of cluster-based logic blocks. First,
`how many distinct inputs should the FPGA routing provide to a cluster of LUTs?
`Reducing the number of inputs to a logic block saves routing area, but if the number
`of inputs is too low many circuits will be unable to use all the LUTs in a logic cluster,
`wasting area. Secondly, how should the flexibility of the logic block / routing inter(cid:173)
`face (i.e. Fe [1]) change as the number of LUTs in a logic cluster changes? Third,
`how many LUTs should be included in a cluster to create FPGAs with the best combi(cid:173)
`nation of speed and area-efficiency? Finally, how is the time required to compile a
`circuit affected by the size of logic cluster used? Recent FPGAs from Xilinx, Altera,
`Lucent Technologies, Actel and Yantis have all grouped several LUTs together into a
`larger logic block, but there has been little published work investigating any of these
`questions.
`
`Recall from Section 3.1.1 that we can describe a logic cluster with four parameters:
`the number of logic inputs (/), the number of BLEs (LUTs and registers) in a cluster
`(N), the number of clock inputs (Mctk), and the number of inputs to each LUT (K) . In
`this chapter we fix the number of clocks per cluster at one for all our experiments,
`since the MCNC benchmark circuits we use to evaluate architectures all have only
`one clock. We set the number of inputs to each LUT, K, to 4, since previous research
`has shown LUTs of this size are the most area-efficient [25], and because this is the
`LUT size used in most commercial FPGAs. We will investigate logic clusters with
`
`

`

`130 CHAPTER 6 Cluster-Based Logic Blocks
`
`different values of/ and N, however, as answering the questions posed above involves
`finding the most appropriate values of/ and N.
`
`6.2 Experimental Methodology
`
`Our goal is to determine the cluster parameters that lead to the fastest and most area(cid:173)
`efficient FPGA architectures. There are no detailed analytic models of FPGA archi(cid:173)
`tectures and circuitry, so we must evaluate architectures experimentally.
`
`We implement a set of twenty benchmark circuits into each FPGA architecture of
`interest, and measure the circuit speed and area required for each architecture. We
`implement each circuit using an automatic CAD flow similar to that used by typical
`FPGA users: technology-mapping, placement and routing. The benchmark circuits
`used are the 20 largest MCNC circuits [136]; they range in size from 1064 to 8383
`BLEs. These circuit sizes are typical of the designs being implemented in current
`commercial FPGAs.
`
`6.2.1 CAD Flow
`
`Figure 6.2 illustrates the CAD flow used in these experiments. First, the SIS [142]
`synthesis package is used to perform technology-independent logic optimization of
`each circuit. Next, each circuit is technology-mapped into 4-LUTs and flip flops by
`FlowMap and the Flowpack post-processing algorithm is used to optimize the map(cid:173)
`ping [46]. Our timing-driven T-VPack program (described in Section 3.1.3) then
`maps this netlist of 4-LUTs and flip flops into logic clusters with the specified values
`of N and /. At this point, then, the circuit is described as a set of interconnected logic
`blocks of the exact type that exist in the FPGA we're targeting. Finally, we use VPR
`to place and completely (combined global and detailed) route the circuit. In this
`chapter all routing was performed by the VPR timing-driven router described in Sec(cid:173)
`tion 4.4.
`
`As Figure 6.2 shows, the circuit is repeatedly routed with different channel capacities
`until VPR finds the minimum number of wire segments per channel required to suc(cid:173)
`cessfully route the circuit, which we call W min· FPGA manufacturers normally build
`enough routing into their FPGAs that "average" circuits have some spare routing
`available. We model this by performing a final "low-stress" routing of each circuit
`with the number of tracks per channel set to l .2·Wmin- Our delay model then esti(cid:173)
`mates the circuit critical path, and our area model estimates the total transistor area
`needed to lay out the FPGA. At the end of this CAD flow, then, we have enough
`
`

`

`6.2 Experimental Methodology 131
`
`Circuit
`
`Logic optimization (SIS)
`Technology map to 4-LUTS (FlowMap + Flowpack)
`
`Cluster
`Parameters---i~ Pack FFs and LUTs into logic clusters (T-VPack)
`(N, [)
`
`Placement (VPR)
`
`Routing
`Architecture
`Parameters
`(Fe, etc.)
`
`Routing (VPR, timing-driven router)
`
`Adjust channel
`capacities (W)
`
`No
`
`Yes ➔ W min determined
`
`Routing with W = 1.2 W min (VPR, timing-driven router)
`
`Determine critical path delay and transistor area to build FPGA (VPR + TransCount)
`
`FIGURE 6.2 Architecture evaluation flow.
`
`information to compare both the speed and the area-efficiency of one logic block
`architecture to another.
`
`Notice that in the above CAD flow we are allowing the channel width to vary accord(cid:173)
`ing to the needs of each circuit. By allowing the channel width to vary and searching
`for the minimum routable width, we can detect small improvements in FPGA archi(cid:173)
`tectures or CAD algorithms that might otherwise go unnoticed. If instead we mapped
`each circuit into an FPGA with a fixed channel width, we would only know whether
`each circuit successfully routed or not. It is more difficult to draw architectural con(cid:173)
`clusions from such a "binary" result.
`
`

`

`132 CHAPTER 6 Cluster-Based Logic Blocks
`
`6.2.2 Area Model
`
`Our area model is based on counting the number of minimum-width transistor areas
`required to implement each FPGA architecture. A minimum-width transistor area is
`simply the layout area occupied by the smallest transistor that can be contacted in a
`process, plus the minimum spacing to another transistor above it and to its right, as
`shown in Figure 6:3. Since the area of typical commercial FPGAs is dominated by
`transistor area, 1 the most accurate way to assess the area of an FPGA architecture,
`short of actually laying out each FPGA architecture studied, is to estimate the total
`transistor area required by its layout. By counting the number of minimum-width
`transistor areas required to implement an FPGA, rather than the number of square
`microns which these transistors would occupy, we obtain a process-independent esti(cid:173)
`mate of the FPGA area.
`
`Some transistors in FPGAs require a drive strength greater than that of a minimum(cid:173)
`width transistor. These transistors must either be made wider than minimum-width or
`
`I,
`
`Minimum horizontal spacing
`- ,
`
`.
`.
`
`----■···-·
`
`. . .
`
`.
`
`.. ·
`
`.· . . ·
`
`I
`Minimum vertical spacing
`
`I
`
`I
`I
`
`- - - - - - - - - - ~~ Perimeter of minimum
`
`width transistor area
`
`Contact
`
`Polysilicon (gate)
`
`FIGURE 6.3 Definition of a minimum-width transistor area.
`
`1. We have discussed this issue with FPGA architects at both Xilinx and Altera, and they have
`confirmed that transistor area determines the die size of their current FPGAs.
`
`

`

`6.2 Experimental Methodology 133
`
`2x
`minimum
`contacted
`width
`
`Polysilicon (gate)
`
`Contact
`
`! :~::~~
`
`width
`
`___ ___,
`
`(a) Parallel diffusions
`
`(b) Widen the transistor
`
`FIGURE 6.4 Methods to create a structure with 2x the drive strength of a minimum(cid:173)
`width transistor.
`
`their drive strength must be increased via parallel diffusion regions, as Figure 6.4
`shows. The transistor active area increases in either case, but notice that if the parallel
`diffusions technique of Figure 6.4(a) is used, the transistor active area increases by
`less than a factor of two. As well, the spacing to the next transistor does not increase
`when a transistor's drive strength is increased, so the active area plus spacing area of a
`transistor with twice the minimum drive strength is less than two minimum-width
`transistor areas. We examined the layout rules from a TSMC 0.35 µm process and
`from an LSI Logic 0.4 µm process, and determined how much extra area was required
`to give transistors greater drive strength, either by making them wider or5, by parallel(cid:173)
`ing diffusio_ns. In both the LSI and TSMC processes, the number of min1mum-width
`transistor areas required by a transistor, trans, (averaged over the different layout
`options) is:
`
`. _
`.
`.
`. .
`Mm1mum width transistor areas(trans) - 0.5 +
`2
`
`DriveStrength(trans)
`.d h • (6.1)
`h M. .
`D . S
`rive trengt (
`znimum Wi t )

`
`A transistor with the minimum drive strength therefore takes 1 minimum-width tran(cid:173)
`sistor area, while a transistor with double the minimum drive strength requires 1.5
`minimum-width transistor areas.
`
`In order to apply (6.1) to determine an FPGA's area, we must determine the number
`and size of the transistors required to build every structure in the FPGA. Appendix
`B .1 provides schematics showing how we build the key structures in an FPGA, and
`Section 6.2.5 discusses transistor sizing issues. In general, we tried to build an FPGA
`with as few transistors as possible without unduly compromising speed. We created a
`program, TransCount, that determines the area of a cluster-based logic block (includ-
`
`

`

`134 CHAPTER 6 Cluster-Based Logic Blocks
`
`ing the local cluster routing) with any values of N, I, K, and Mclk· This program is
`fairly sophisticated, and models such effects as buffer resizing as a function of the
`fanout of the connections within a logic block, and builds multi-stage buffers when
`high drive strengths are required. Of course the area of an FPGA includes not only
`the logic block area, but also routing area. VPR determines the routing area of each
`FPGA of interest, and by adding this area to the logic block area we obtain the total
`FPGA area. Recall that to evaluate the area of the FPGA needed to route a given cir(cid:173)
`cuit we build an FPGA with a channel width, W, of 1.2W min-
`
`6.2.3 Delay Model
`
`Our delay values are all based on the delays in TSMC's 0.35 µm, 3.3 V CMOS pro(cid:173)
`cess. Some of the delays we use are listed in this section and in Appendix B.2, while
`some delays cannot be listed because the process information is proprietary and was
`obtained under a non-disclosure agreement.
`
`To determine the critical path of a circuit, we must:
`
`1. Determine the delay of every connection internal to a logic block,
`2. Determine the delay of every connection between logic blocks, and
`
`3. Perform a path-based timing analysis of the circuit using these delay values.
`
`We found the delay of the connections within logic blocks by performing SPICE sim(cid:173)
`ulations of every structure in a logic block. Figure 6.5 shows the major structures and
`speed paths in a logic cluster, while Appendix B.1 contains transistor-level schemat(cid:173)
`ics that show how we build the multiplexers, buffers, look-up tables, and latches con(cid:173)
`tained in a logic cluster. Since loading effects and input signal swing times can
`considerably change the delay of a circuit, we always simulated speed paths with their
`loads in place, and with the input to the path driven by the circuit which would drive it
`in a real FPGA. Notice that as the number of BLEs in a cluster increases, the number
`of inputs to the multiplexers forming the local cluster routing increases. As well, the
`fanouts (within the cluster) of the/ cluster input signals and N cluster output signals
`increase, so the buffers driving these signals must be made larger (which results in a
`longer, and slower, inverter chain). For both these reasons, the delays of some of the
`paths within a logic cluster increase as the cluster size increases, as Table 6.1 shows.
`
`After a routing is complete, we can perform step 2 above -
`determine the delay of
`every routed connection. The circuits we map contain thousands of nets, so SPICE
`simulation would be prohibitively time-consuming. Instead, we follow the procedure
`described in Section 2.2.4; we model pass transistors and buffers by equivalent cir(cid:173)
`cuits composed of resistors, capacitors, and idealized, constant delay elements. The
`values of the various equivalent resistances, capacitances and buffer intrinsic delays
`
`

`

`6.2 Experimental Methodology 135
`
`r - - - - - - - - - - - - 7
`
`-
`
`-
`
`-
`
`-
`
`7
`
`•
`•
`• •
`
`Input
`connection
`block buffers
`& muxes
`
`•
`•
`•
`•
`
`•
`•
`•
`I '-r-'
`I Local
`L - - -
`I buffers ~ BLE
`•
`Local
`I
`• N
`routing
`•
`muxes
`I
`• BLEs
`- - -
`L -
`Logic Cluster
`
`- -
`
`_J
`
`I D I
`I
`I
`I
`I
`I
`I
`I
`
`_J
`
`FIGURE 6.5 Structure and speed paths of a logic cluster.
`
`TABLE 6.1 Important logic cluster delays in TSMC's 0.35 µm CMOS process.
`
`Cluster Size (N)
`
`A to B (ps) B to C and D to C (ps) C to D (ps) B to D (ps)
`
`1 (No local routing
`muxes)
`
`2
`
`4
`
`8
`
`16
`
`20
`
`760
`
`760
`
`760
`
`760
`
`760
`
`760
`
`55 (and no D to C
`path)
`
`540
`
`675
`
`815
`
`970
`
`1000
`
`465
`
`465
`
`465
`
`465
`
`465
`
`465
`
`520
`
`#-
`
`1005
`
`1140
`
`1280
`
`1435
`
`1465
`
`were again determined via SPICE simulations of the TSMC 0.35 µm process. For
`details of the procedure and circuit assumptions, see Appendix B. After a routing is
`complete, VPR uses these simplified models of pass transistors and buffers, as well as
`metal capacitance and resistance data, (all of which is specified in the architecture
`description file) to build an equivalent RC-tree for each net. It then computes the
`Elmore delay from the source to each of the sinks, as described in Section 4.5. We
`in Section 7 .2.3 we
`have found this linearized circuit model to be quite accurate:
`
`

`

`136 CHAPTER 6 Cluster-Based Logic Blocks
`
`show that for a wide variety of different routing structures the delays computed by
`VPR are almost always within 9% of the delays computed by SPICE.
`
`Finally, VPR performs a path-based timing-analysis using these delay values to deter(cid:173)
`mine the circuit critical path. Section 2.2.5 described the algorithms used in path(cid:173)
`based timing analysis, and Section 4.5 described the VPR timing analyzer implemen(cid:173)
`tation.
`
`6.2.4 Architecture Evaluation Metric: Area-Delay Product
`
`One metric that we will use to evaluate the quality of different FPGA architectures is
`the area-delay product. This is a reasonable architecture metric for two reasons:
`
`1.
`
`Intuitively, we want to find the point at which we are sacrificing the least amount
`of area for the most improvement in speed. Given that we can always trade area
`for speed (see below), and speed for area, it makes sense to combine these two
`factors into one curve to see where the best trade-off occurs.
`
`2. The computational throughput of an FPGA (on a parallel algorithm) is simply the
`number of functional units multiplied by the clock speed. Another way of looking
`at this is, throughput= (I/area per functional unit)· (I/delay). Therefore by min(cid:173)
`imizing the area-delay product, we maximize throughput.
`
`There are two main factors which can affect the area-delay product of an FPGA: tran(cid:173)
`sistor sizing and the FPGA architecture. In general, the speed of an FPGA can be
`increased (to a point) by sizing up the buffers and transistors within the FPGA, but
`this increases area. Alternatively, the FPGA can be made smaller by sizing down the
`buffers and transistors, but this degrades the FPGA performance.
`
`Throughout this chapter, we will size the transistors in each FPGA architecture to
`minimize the FPGA's area-delay product. Only by resizing transistors appropriately
`for each architecture in this way can we fairly compute the speed and area-efficiency
`of FPGAs with different logic block architectures.
`
`6.2.5 FPGA Architectural Assumptions
`
`To evaluate the speed and area of an FPGA we must choose not only the logic block
`architecture, but also a routing architecture and transistor sizes. The following sec(cid:173)
`tions detail all of our architectural choices.
`
`

`

`6.2 Experimental Methodology 137
`
`Basic Architecture
`
`We investigate island-style FPGAs in which each logic block is surrounded by routing
`channels on all four sides. The logic block input and output pins are evenly distrib(cid:173)
`uted around the logic block perimeter. Each circuit is mapped to the smallest square
`FPGA with enough logic blocks and pads to accommodate it.
`
`Routing Architecture
`
`We define the number of logic blocks which a routing segment spans as the logical
`length of that segment. In Chapter 7 we show that an architecture in which routing
`segments have a logical length of four, with 50% of the segments connected by tri(cid:173)
`state buffers and 50% connected by pass-transistors, provides good area-efficiency
`and speed for FPGAs containing logic clusters of size four. This routing architecture
`is shown in Figure 6.6. We implicitly assume that this routing architecture is good for
`architectures containing logic clusters of all sizes, and we use this routing architecture
`in all of our experiments. Ideally, one would find the best routing architecture for
`each FPGA employing a different cluster size, but this would require a huge amount
`of effort. By basing all of our experiments on this routing architecture, we may
`slightly favor architectures with size four clusters over other architectures.
`
`Logic
`cluster
`
`/
`
`I
`
`"i
`
`I '\
`
`'
`Logic
`cltyster
`
`Logic
`cluster
`
`\
`
`.
`
`Lo~c
`dust~
`\ '
`
`Logic
`cluster
`
`Logic
`cluster
`
`;
`
`J
`
`Logic
`cluster
`.
`'
`
`I
`
`'
`
`Ll>gic' t '
`
`Logic
`cluster
`
`I
`
`tluster
`
`L~gic
`clu\ter
`
`FIGURE 6.6 FPGA with length 4 segments, 50% buffered and 50% pass transistor
`switches.
`
`

`

`138 CHAPTER 6 Cluster-Based Logic Blocks
`
`Throughout this chapter we assume that metal all routing wires are laid out in metal 3,
`using the minimum width and spacing. See Appendix C.4 for a discussion of metal
`width and spacing issues in FPGAs.
`
`Effect of Cluster Size on the Physical Length of FPGA Routing Segments
`
`As we increase the cluster size, both the logic area per cluster and routing area per
`cluster grow. Figure 6.7 demonstrates how a tile (a logic block plus its associated
`routing) grows as cluster size is increased. This increased tile size results in routing
`segments with the same logical length having different physical lengths for logic clus(cid:173)
`ters of different sizes.
`
`We define the measured length of a routing segment as its physical length. The resis(cid:173)
`tance and capacitance of a routing segment grow linearly with the segment's physical
`length. We have experimentally determined the average rate at which the FPGA tiles
`grow with cluster size, and have used this information to appropriately scale the rout(cid:173)
`ing segment resistance and capacitance values for the various cluster sizes. The
`increase in the resistance and capacitance of routing segments as the size, or granular(cid:173)
`ity, of the FPGA logic block increases is an important effect that has often been
`neglected in prior FPGA architecture research.
`
`Sizing Routing Transistors to Compensate for Different Physical Segment Lengths
`
`To compensate for differences in the capacitance and resistance of routing segments
`in FPGAs using different sizes of logic clusters, we scale the routing pass transistors
`and buffers. All of our pass transistor and buffer scaling is in relation to a base archi-
`
`Increase
`cluster
`size
`
`Channel
`width
`,...,,_,
`
`Logic
`cluster
`
`►
`◄
`Segment length
`
`Increased
`channel
`width
`~ -- -
`
`Increased
`logic
`area
`~ e r cluster
`
`.....
`
`Logic
`cluster
`
`Increased
`routing
`area
`per cluster
`
`►
`◄
`Increased segment length
`
`FIGURE 6.7 Effect of cluster size on physical length of routing segments.
`
`

`

`6.3 Cluster Inputs Required vs. Cluster Size 139
`
`tecture that has been area-delay optimized for clusters of size four. The method used
`to size routing transistors for this base architecture is described in Appendix C. From
`this base architecture, we linearly scale routing buffers and pass transistors depending
`on the relation between the new segment lengths and the base segment length. For
`example, in an FPGA with size 16 clusters, the physical segment length is approxi(cid:173)
`mately 2x longer than in an architecture with size 4 clusters. To maintain roughly the
`same speed per routing segment, we increase the size of the routing switches connect(cid:173)
`ing to each wire by a factor of 2. In Section 6.5 we verify that this linear scaling of
`buffers and pass-transistors with physical segment length provides good results.
`
`In our architecture models, we account for variations in delay caused by resizing buff(cid:173)
`ers and pass-transistors. Also, changes in area due to the use of different sizes of rout(cid:173)
`ing pass-transistors and buffers are automatically calculated by VPR.
`
`6.3 Cluster Inputs Required vs. Cluster Size
`
`As discussed in Section 6.1, the first question we wish to answer is how many distinct
`inputs, /, should be provided to a cluster of size N. ~ince the number of transistors
`required to implement each of the multiplexers in the cluster local routing (see
`Figure 3.1 (b )) grows linearly with/ (for large /), we would like to make/ as small as
`possible. On the other hand, if/ is made too small, many of the BLEs in a logic clus(cid:173)
`ter may become essentially unusable, reducing logic utilization and wasting area. We
`find the minimum value of I that allows good cluster utilization by runnj.ng bench(cid:173)
`mark circuits through the first two steps shown in Figure 6.2, technology-mapping
`and cluster packing, and measuring the resulting logic utilization for different values
`of/. We define logic utilization to be the average number of BLEs per cluster that a
`circuit is able to use divided by the total number of BLEs per cluster, N.
`
`Figure 6.8 shows how the average logic utilization of our 20 benchmarks varies with/
`for three different logic cluster sizes. The horizontal axis is the number of distinct
`inputs to the cluster relative to the total number of BLE inputs in a cluster, i.e. //( 4N).
`For very low values of/, the logic utilization is very low, as one would expect. It is
`interesting, however, that when / is only 50 to 60% of the total number of BLE inputs,
`the logic utilization is essentially 100%. Clearly it ·is possible to pack BLEs together
`so that they have many common inputs and can reuse locally generated outputs. The
`relative amount of input sharing and output reuse increases slightly with logic cluster
`size, causing the curves in Figure 6.8 to shift to the left as cluster size increases.
`
`The solid line in Figure 6.9 shows the value of/ required to achieve 98% logic utiliza(cid:173)
`tion as the cluster size, N, is varied, while the dashed line shows how the average
`
`

`

`140 CHAPTER 6 Cluster-Based Logic Blocks
`
`... . .
`
`.•
`:N =4
`
`1
`
`0.8
`
`0.6
`
`Fraction
`ofBLEs
`Used
`(20 Benchmark 0.4
`Average)
`
`0.2
`
`0
`
`0
`
`0.2
`0.8
`0.6
`0.4
`Fraction of Inputs Accessible, / /( 4N)
`
`1
`
`FIGURE 6.8 Logic utilization vs. number of logic cluster inputs.
`
`number of logic cluster inputs that are actually used varies with cluster size.
`Although there are 4N BLE inputs in a logic cluster of size N, the number of inputs
`required to achieve 98% logic utilization is approximately 2N + 2. Furthermore, the
`average number of logic cluster inputs that are actually used grows even more slowly.
`On average, a cluster of size 1 uses 3.57 of its inputs, while an cluster of size 20 uses
`only 25.2 of its inputs. In other words, while the logic per cluster has increased by a
`
`36
`
`32
`
`28
`
`24
`
`Number of
`Cluster
`20
`Inputs(/)
`(20 Benchmark 16
`Average)
`12
`
`8
`
`4
`
`0
`
`Inputs required for 98%
`logic utilization
`
`...
`
`--
`
`---
`--
`--
`--
`--
`--
`--
`--
`--
`-- Average mputs used
`
`.
`
`.,.,..
`
`_,,,
`
`_,,,
`
`,.,
`
`/
`
`/
`
`1 2
`
`4
`
`6
`
`14
`12
`10
`8
`Cluster Size (N)
`
`16
`
`18
`
`20
`
`FIGURE 6.9 Variation in inputs required and inputs used with cluster size.
`
`

`

`6.4 Flexibility of Logic Block to Routing Interconnect vs. Cluster Size 141
`
`factor of 20, the average number of connections that must be routed to each cluster
`has increased by a factor of only 7.
`
`Our results indicate that commercial FPGAs can be more aggressive in reducing the
`value of/. For example, the Altera Flex 8K FPGAs use logic clusters with N = 8 and
`I= 24, while our results indicate that I= 18 suffices for a cluster of this size. Simi(cid:173)
`larly, the Xilinx 5200 FPGA uses a logic block very similar to a logic cluster with N =
`4, and makes all 16 LUT inputs accessible, while our results suggest 10 to 11 inputs
`are sufficient. Reducing I in this manner simplifies the local routing multiplexers
`within a logic cluster and reduces the number of logic block pins that must be con(cid:173)
`nected to the FPGA routing, resulting in considerable area savings.
`
`6.4 Flexibility of Logic Block to Routing Interconnect vs.
`Cluster Size
`
`Before we can apply the experimental flow of Section 6.2 to see how area-efficiency
`varies with cluster size, we must choose Fe, the number of routing tracks to which
`each logic block pin can connect. On the one hand, using a smaller value of Fe
`reduces the number of programmable switches in the FPGA routing, which improves
`area-efficiency. On the other hand, smaller values of Fe make an FPGA less mutable
`so that larger channel capacities, W, will be required to successfully route circuits.
`This reduces area-efficiency by increasing the routing area. The goal is to choose a
`value of Fe that balances these two competing objectives and achieves good, area-effi(cid:173)
`ciency.
`
`For a cluster of size 1, our experiments and those of Rose and Brown [30] have shown
`that a good value of Fe is W; i.e. each logic block pin can be connected to any routing
`track in an adjacent channel. For larger clusters, however, setting Fe to W provides far
`more routing flexibility than is required, wasting area.
`
`Recall from Section 3 .1.1 that the logic clusters we investigate are fully-connected. In
`other words, a BLE input can be connected to any cluster input or to the output of any
`of the BLEs within the cluster via the local routing. In a fully-connected logic cluster
`all the cluster inputs and all the cluster outputs are logically-equivalent. That is, all of
`the inputs are functionally identical, and all of the outputs are functionally identical.
`This means that a net which is an input to a cluster can be connected to any of the I
`cluster inputs, and a net which is driven by a cluster output can be connected to any of
`the N cluster outputs. Changing the connections made by the multiplexer-based local
`routing can compensate for any pin swapping performed on the cluster input pins by
`the router. Changing which BLE within the cluster generates each of the functions
`
`

`

`142 CHAPTER 6 Cluster-Based Logic Blocks
`
`required by the netlist can compensate for any pin swapping performed by the router
`on the cluster output pins. Therefore the router has a great deal of flexibility in how it
`routes inter-cluster nets.
`
`The logical equivalence of cluster inputs and of cluster outputs means that keeping Fe
`fixed at W, regardless of the cluster size, N, results in an excessive number of ways to
`connect to large logic clusters. For example, a cluster of size one has 4 inputs and one
`output. If Fe= W, then, there are 4W ways to connect to a cluster input and W ways
`to connect to the cluster output. A cluster of size 20, on the other hand, has 36 inputs
`and 20 outputs, so there are 36W ways to connect to a cluster input and 20W ways to
`connect to a cluster output if Fe= W. This excessive routing flexibility for a cluster of
`size 20 wastes a large amount of routing area, since we have added many more pro(cid:173)
`grammable switches to the routing than is necessary.
`
`We have experimentally found that a more appropriate level of routing flexibility
`results when the Fe value for logic block output pins, Fe.output is set to WIN, and all
`the experiments in the next section use this value. This choice of Fe,output means that
`each of the W routing tracks can be driven by one output pin on each logic block,
`ensuring that all the routing tracks in a channel c

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket