`
`Cluster-Based Logic
`Blocks
`
`In this chapter we investigate the speed and area-efficiency of FPGAs which use logic
`clusters as their logic block. A logic cluster is composed of several look-up tables
`and registers interconnected by local routing, as described in Section 3.1.1. In the
`next section we motivate our research by describing some of the advantages of clus(cid:173)
`ter-based logic blocks, and by showing that these logic blocks are commercially rele(cid:173)
`vant. Section 6.2 describes the experimental flow we use to evaluate different logic
`clusters. Sections 6.3 through 6.6 then explore several key architectural questions
`concerning these logic blocks: how many inputs(/) should the FPGA routing provide
`to each logic cluster; how should the logic block to general routing inteiface change
`as a function of logic cluster size (N); and how are circuit speed, FPGA area-effi(cid:173)
`ciency, and design compile time affected by the size of the logic cluster used?
`
`6.1 Motivation
`
`As described in Section 2.1.2, most SRAM-based FPGAs use logic blocks based on
`look-up tables (LUTs). A look-up table with more inputs can implement more logic,
`and hence one needs fewer logic blocks to implement a circuit. This saves routing
`area, as there are fewer connections to route between logic blocks. However, look-up
`table complexity grows exponentially with the number of inputs, so it is impractical to
`use a LUT with a large number of inputs as a logic block.
`
`Instead of creating a larger logic block by increasing the number of inputs to a LUT,
`we can simply group several LUTs together and provide local routing to interconnect
`
`Intel Exhibit 1006 Part 2
`Intel v. Iida
`
`
`
`128 CHAPTER 6 Cluster-Based Logic Blocks
`
`Local routing
`
`•. ~,
`
`!t-\i;')- , ;
`
`LH
`I -
`
`-
`
`\
`
`i--
`
`....
`/
`JI
`
`4-input
`look-up
`tables
`
`..---
`/
`
`.. ~
`Logic
`block
`(unused)
`
`D
`D
`
`~ Main
`(inter-block)
`routing
`
`-
`
`I - l::l ~
`
`}
`
`'( -~
`I -:;__/
`
`/
`Local
`routing
`
`FIGURE 6.1 Implementation of a circuit in an FPGA with a cluster size of 2.
`
`them. We call the resulting logic block a logic cluster [8, 9, 10]; the exact structure of
`a logic cluster was described in Section 3.1.1. Figure 6. 1 shows a circuit imple(cid:173)
`mented in an FPGA in which each logic cluster contains two four-input look-up
`tables. Notice that many connections are made via the local interconnect within a
`cluster.
`
`One major advantage of an FPGA employing a logic block that contains several LUTs
`is that fewer logic blocks will be needed to implement a circuit than would be needed
`if each logic block was a single LUT. This reduces the size of the placement and rout(cid:173)
`ing problem considerably. Since placement and routing is usually the most time-con(cid:173)
`suming step in mapping a design to an FPGA, cluster-based logic blocks can
`significantly reduce design compile time. As FPGAs grow larger, it is important to
`keep this compile time from growing too large or one of the key advantages of
`FPGAs, rapid prototyping and design spins, will be lost [127].
`
`
`
`6.1 Motivation 129
`
`Cluster-based logic blocks also have the potential to significantly improve FPGA
`speed. In FPGAs composed of logic clusters, many connections will be made via the
`local routing within a cluster. Since this local routing can be made faster than the
`general-purpose routing between logic blocks, cluster-based logic blocks can improve
`FPGA speed. It is not obvious which logic cluster size leads to the highest speed
`FPGA, however. On the one hand, larger logic clusters lead to a higher fraction of
`connections being made via local interconnect. On the other hand, as the size of a
`logic cluster increases, the local cluster interconnect becomes slower, potentially
`resulting in a speed decrease even though more connections are captured within the
`logic clusters.
`
`The area impact of grouping multiple LUTs into a logic cluster is also complex.
`Grouping related LUTs together into a single logic block reduces the number of con(cid:173)
`nections to be routed between logic blocks, which saves routing area. Since the gen(cid:173)
`eral-purpose interconnect consumes most of the die area in SRAM-based FPGAs, this
`is a significant area savings. On the other hand, in the logic clusters we study the area
`required by the local routing within a cluster grows quadratically with cluster size.
`For sufficiently large clusters, then, the area used by this local interconnect will
`exceed the area saved in the general interconnect.
`
`We explore four questions concerning the design of cluster-based logic blocks. First,
`how many distinct inputs should the FPGA routing provide to a cluster of LUTs?
`Reducing the number of inputs to a logic block saves routing area, but if the number
`of inputs is too low many circuits will be unable to use all the LUTs in a logic cluster,
`wasting area. Secondly, how should the flexibility of the logic block / routing inter(cid:173)
`face (i.e. Fe [1]) change as the number of LUTs in a logic cluster changes? Third,
`how many LUTs should be included in a cluster to create FPGAs with the best combi(cid:173)
`nation of speed and area-efficiency? Finally, how is the time required to compile a
`circuit affected by the size of logic cluster used? Recent FPGAs from Xilinx, Altera,
`Lucent Technologies, Actel and Yantis have all grouped several LUTs together into a
`larger logic block, but there has been little published work investigating any of these
`questions.
`
`Recall from Section 3.1.1 that we can describe a logic cluster with four parameters:
`the number of logic inputs (/), the number of BLEs (LUTs and registers) in a cluster
`(N), the number of clock inputs (Mctk), and the number of inputs to each LUT (K) . In
`this chapter we fix the number of clocks per cluster at one for all our experiments,
`since the MCNC benchmark circuits we use to evaluate architectures all have only
`one clock. We set the number of inputs to each LUT, K, to 4, since previous research
`has shown LUTs of this size are the most area-efficient [25], and because this is the
`LUT size used in most commercial FPGAs. We will investigate logic clusters with
`
`
`
`130 CHAPTER 6 Cluster-Based Logic Blocks
`
`different values of/ and N, however, as answering the questions posed above involves
`finding the most appropriate values of/ and N.
`
`6.2 Experimental Methodology
`
`Our goal is to determine the cluster parameters that lead to the fastest and most area(cid:173)
`efficient FPGA architectures. There are no detailed analytic models of FPGA archi(cid:173)
`tectures and circuitry, so we must evaluate architectures experimentally.
`
`We implement a set of twenty benchmark circuits into each FPGA architecture of
`interest, and measure the circuit speed and area required for each architecture. We
`implement each circuit using an automatic CAD flow similar to that used by typical
`FPGA users: technology-mapping, placement and routing. The benchmark circuits
`used are the 20 largest MCNC circuits [136]; they range in size from 1064 to 8383
`BLEs. These circuit sizes are typical of the designs being implemented in current
`commercial FPGAs.
`
`6.2.1 CAD Flow
`
`Figure 6.2 illustrates the CAD flow used in these experiments. First, the SIS [142]
`synthesis package is used to perform technology-independent logic optimization of
`each circuit. Next, each circuit is technology-mapped into 4-LUTs and flip flops by
`FlowMap and the Flowpack post-processing algorithm is used to optimize the map(cid:173)
`ping [46]. Our timing-driven T-VPack program (described in Section 3.1.3) then
`maps this netlist of 4-LUTs and flip flops into logic clusters with the specified values
`of N and /. At this point, then, the circuit is described as a set of interconnected logic
`blocks of the exact type that exist in the FPGA we're targeting. Finally, we use VPR
`to place and completely (combined global and detailed) route the circuit. In this
`chapter all routing was performed by the VPR timing-driven router described in Sec(cid:173)
`tion 4.4.
`
`As Figure 6.2 shows, the circuit is repeatedly routed with different channel capacities
`until VPR finds the minimum number of wire segments per channel required to suc(cid:173)
`cessfully route the circuit, which we call W min· FPGA manufacturers normally build
`enough routing into their FPGAs that "average" circuits have some spare routing
`available. We model this by performing a final "low-stress" routing of each circuit
`with the number of tracks per channel set to l .2·Wmin- Our delay model then esti(cid:173)
`mates the circuit critical path, and our area model estimates the total transistor area
`needed to lay out the FPGA. At the end of this CAD flow, then, we have enough
`
`
`
`6.2 Experimental Methodology 131
`
`Circuit
`
`Logic optimization (SIS)
`Technology map to 4-LUTS (FlowMap + Flowpack)
`
`Cluster
`Parameters---i~ Pack FFs and LUTs into logic clusters (T-VPack)
`(N, [)
`
`Placement (VPR)
`
`Routing
`Architecture
`Parameters
`(Fe, etc.)
`
`Routing (VPR, timing-driven router)
`
`Adjust channel
`capacities (W)
`
`No
`
`Yes ➔ W min determined
`
`Routing with W = 1.2 W min (VPR, timing-driven router)
`
`Determine critical path delay and transistor area to build FPGA (VPR + TransCount)
`
`FIGURE 6.2 Architecture evaluation flow.
`
`information to compare both the speed and the area-efficiency of one logic block
`architecture to another.
`
`Notice that in the above CAD flow we are allowing the channel width to vary accord(cid:173)
`ing to the needs of each circuit. By allowing the channel width to vary and searching
`for the minimum routable width, we can detect small improvements in FPGA archi(cid:173)
`tectures or CAD algorithms that might otherwise go unnoticed. If instead we mapped
`each circuit into an FPGA with a fixed channel width, we would only know whether
`each circuit successfully routed or not. It is more difficult to draw architectural con(cid:173)
`clusions from such a "binary" result.
`
`
`
`132 CHAPTER 6 Cluster-Based Logic Blocks
`
`6.2.2 Area Model
`
`Our area model is based on counting the number of minimum-width transistor areas
`required to implement each FPGA architecture. A minimum-width transistor area is
`simply the layout area occupied by the smallest transistor that can be contacted in a
`process, plus the minimum spacing to another transistor above it and to its right, as
`shown in Figure 6:3. Since the area of typical commercial FPGAs is dominated by
`transistor area, 1 the most accurate way to assess the area of an FPGA architecture,
`short of actually laying out each FPGA architecture studied, is to estimate the total
`transistor area required by its layout. By counting the number of minimum-width
`transistor areas required to implement an FPGA, rather than the number of square
`microns which these transistors would occupy, we obtain a process-independent esti(cid:173)
`mate of the FPGA area.
`
`Some transistors in FPGAs require a drive strength greater than that of a minimum(cid:173)
`width transistor. These transistors must either be made wider than minimum-width or
`
`I,
`
`Minimum horizontal spacing
`- ,
`
`.
`.
`
`----■···-·
`
`. . .
`
`.
`
`.. ·
`
`.· . . ·
`
`I
`Minimum vertical spacing
`
`I
`
`I
`I
`
`- - - - - - - - - - ~~ Perimeter of minimum
`
`width transistor area
`
`Contact
`
`Polysilicon (gate)
`
`FIGURE 6.3 Definition of a minimum-width transistor area.
`
`1. We have discussed this issue with FPGA architects at both Xilinx and Altera, and they have
`confirmed that transistor area determines the die size of their current FPGAs.
`
`
`
`6.2 Experimental Methodology 133
`
`2x
`minimum
`contacted
`width
`
`Polysilicon (gate)
`
`Contact
`
`! :~::~~
`
`width
`
`___ ___,
`
`(a) Parallel diffusions
`
`(b) Widen the transistor
`
`FIGURE 6.4 Methods to create a structure with 2x the drive strength of a minimum(cid:173)
`width transistor.
`
`their drive strength must be increased via parallel diffusion regions, as Figure 6.4
`shows. The transistor active area increases in either case, but notice that if the parallel
`diffusions technique of Figure 6.4(a) is used, the transistor active area increases by
`less than a factor of two. As well, the spacing to the next transistor does not increase
`when a transistor's drive strength is increased, so the active area plus spacing area of a
`transistor with twice the minimum drive strength is less than two minimum-width
`transistor areas. We examined the layout rules from a TSMC 0.35 µm process and
`from an LSI Logic 0.4 µm process, and determined how much extra area was required
`to give transistors greater drive strength, either by making them wider or5, by parallel(cid:173)
`ing diffusio_ns. In both the LSI and TSMC processes, the number of min1mum-width
`transistor areas required by a transistor, trans, (averaged over the different layout
`options) is:
`
`. _
`.
`.
`. .
`Mm1mum width transistor areas(trans) - 0.5 +
`2
`
`DriveStrength(trans)
`.d h • (6.1)
`h M. .
`D . S
`rive trengt (
`znimum Wi t )
`·
`
`A transistor with the minimum drive strength therefore takes 1 minimum-width tran(cid:173)
`sistor area, while a transistor with double the minimum drive strength requires 1.5
`minimum-width transistor areas.
`
`In order to apply (6.1) to determine an FPGA's area, we must determine the number
`and size of the transistors required to build every structure in the FPGA. Appendix
`B .1 provides schematics showing how we build the key structures in an FPGA, and
`Section 6.2.5 discusses transistor sizing issues. In general, we tried to build an FPGA
`with as few transistors as possible without unduly compromising speed. We created a
`program, TransCount, that determines the area of a cluster-based logic block (includ-
`
`
`
`134 CHAPTER 6 Cluster-Based Logic Blocks
`
`ing the local cluster routing) with any values of N, I, K, and Mclk· This program is
`fairly sophisticated, and models such effects as buffer resizing as a function of the
`fanout of the connections within a logic block, and builds multi-stage buffers when
`high drive strengths are required. Of course the area of an FPGA includes not only
`the logic block area, but also routing area. VPR determines the routing area of each
`FPGA of interest, and by adding this area to the logic block area we obtain the total
`FPGA area. Recall that to evaluate the area of the FPGA needed to route a given cir(cid:173)
`cuit we build an FPGA with a channel width, W, of 1.2W min-
`
`6.2.3 Delay Model
`
`Our delay values are all based on the delays in TSMC's 0.35 µm, 3.3 V CMOS pro(cid:173)
`cess. Some of the delays we use are listed in this section and in Appendix B.2, while
`some delays cannot be listed because the process information is proprietary and was
`obtained under a non-disclosure agreement.
`
`To determine the critical path of a circuit, we must:
`
`1. Determine the delay of every connection internal to a logic block,
`2. Determine the delay of every connection between logic blocks, and
`
`3. Perform a path-based timing analysis of the circuit using these delay values.
`
`We found the delay of the connections within logic blocks by performing SPICE sim(cid:173)
`ulations of every structure in a logic block. Figure 6.5 shows the major structures and
`speed paths in a logic cluster, while Appendix B.1 contains transistor-level schemat(cid:173)
`ics that show how we build the multiplexers, buffers, look-up tables, and latches con(cid:173)
`tained in a logic cluster. Since loading effects and input signal swing times can
`considerably change the delay of a circuit, we always simulated speed paths with their
`loads in place, and with the input to the path driven by the circuit which would drive it
`in a real FPGA. Notice that as the number of BLEs in a cluster increases, the number
`of inputs to the multiplexers forming the local cluster routing increases. As well, the
`fanouts (within the cluster) of the/ cluster input signals and N cluster output signals
`increase, so the buffers driving these signals must be made larger (which results in a
`longer, and slower, inverter chain). For both these reasons, the delays of some of the
`paths within a logic cluster increase as the cluster size increases, as Table 6.1 shows.
`
`After a routing is complete, we can perform step 2 above -
`determine the delay of
`every routed connection. The circuits we map contain thousands of nets, so SPICE
`simulation would be prohibitively time-consuming. Instead, we follow the procedure
`described in Section 2.2.4; we model pass transistors and buffers by equivalent cir(cid:173)
`cuits composed of resistors, capacitors, and idealized, constant delay elements. The
`values of the various equivalent resistances, capacitances and buffer intrinsic delays
`
`
`
`6.2 Experimental Methodology 135
`
`r - - - - - - - - - - - - 7
`
`-
`
`-
`
`-
`
`-
`
`7
`
`•
`•
`• •
`
`Input
`connection
`block buffers
`& muxes
`
`•
`•
`•
`•
`
`•
`•
`•
`I '-r-'
`I Local
`L - - -
`I buffers ~ BLE
`•
`Local
`I
`• N
`routing
`•
`muxes
`I
`• BLEs
`- - -
`L -
`Logic Cluster
`
`- -
`
`_J
`
`I D I
`I
`I
`I
`I
`I
`I
`I
`
`_J
`
`FIGURE 6.5 Structure and speed paths of a logic cluster.
`
`TABLE 6.1 Important logic cluster delays in TSMC's 0.35 µm CMOS process.
`
`Cluster Size (N)
`
`A to B (ps) B to C and D to C (ps) C to D (ps) B to D (ps)
`
`1 (No local routing
`muxes)
`
`2
`
`4
`
`8
`
`16
`
`20
`
`760
`
`760
`
`760
`
`760
`
`760
`
`760
`
`55 (and no D to C
`path)
`
`540
`
`675
`
`815
`
`970
`
`1000
`
`465
`
`465
`
`465
`
`465
`
`465
`
`465
`
`520
`
`#-
`
`1005
`
`1140
`
`1280
`
`1435
`
`1465
`
`were again determined via SPICE simulations of the TSMC 0.35 µm process. For
`details of the procedure and circuit assumptions, see Appendix B. After a routing is
`complete, VPR uses these simplified models of pass transistors and buffers, as well as
`metal capacitance and resistance data, (all of which is specified in the architecture
`description file) to build an equivalent RC-tree for each net. It then computes the
`Elmore delay from the source to each of the sinks, as described in Section 4.5. We
`in Section 7 .2.3 we
`have found this linearized circuit model to be quite accurate:
`
`
`
`136 CHAPTER 6 Cluster-Based Logic Blocks
`
`show that for a wide variety of different routing structures the delays computed by
`VPR are almost always within 9% of the delays computed by SPICE.
`
`Finally, VPR performs a path-based timing-analysis using these delay values to deter(cid:173)
`mine the circuit critical path. Section 2.2.5 described the algorithms used in path(cid:173)
`based timing analysis, and Section 4.5 described the VPR timing analyzer implemen(cid:173)
`tation.
`
`6.2.4 Architecture Evaluation Metric: Area-Delay Product
`
`One metric that we will use to evaluate the quality of different FPGA architectures is
`the area-delay product. This is a reasonable architecture metric for two reasons:
`
`1.
`
`Intuitively, we want to find the point at which we are sacrificing the least amount
`of area for the most improvement in speed. Given that we can always trade area
`for speed (see below), and speed for area, it makes sense to combine these two
`factors into one curve to see where the best trade-off occurs.
`
`2. The computational throughput of an FPGA (on a parallel algorithm) is simply the
`number of functional units multiplied by the clock speed. Another way of looking
`at this is, throughput= (I/area per functional unit)· (I/delay). Therefore by min(cid:173)
`imizing the area-delay product, we maximize throughput.
`
`There are two main factors which can affect the area-delay product of an FPGA: tran(cid:173)
`sistor sizing and the FPGA architecture. In general, the speed of an FPGA can be
`increased (to a point) by sizing up the buffers and transistors within the FPGA, but
`this increases area. Alternatively, the FPGA can be made smaller by sizing down the
`buffers and transistors, but this degrades the FPGA performance.
`
`Throughout this chapter, we will size the transistors in each FPGA architecture to
`minimize the FPGA's area-delay product. Only by resizing transistors appropriately
`for each architecture in this way can we fairly compute the speed and area-efficiency
`of FPGAs with different logic block architectures.
`
`6.2.5 FPGA Architectural Assumptions
`
`To evaluate the speed and area of an FPGA we must choose not only the logic block
`architecture, but also a routing architecture and transistor sizes. The following sec(cid:173)
`tions detail all of our architectural choices.
`
`
`
`6.2 Experimental Methodology 137
`
`Basic Architecture
`
`We investigate island-style FPGAs in which each logic block is surrounded by routing
`channels on all four sides. The logic block input and output pins are evenly distrib(cid:173)
`uted around the logic block perimeter. Each circuit is mapped to the smallest square
`FPGA with enough logic blocks and pads to accommodate it.
`
`Routing Architecture
`
`We define the number of logic blocks which a routing segment spans as the logical
`length of that segment. In Chapter 7 we show that an architecture in which routing
`segments have a logical length of four, with 50% of the segments connected by tri(cid:173)
`state buffers and 50% connected by pass-transistors, provides good area-efficiency
`and speed for FPGAs containing logic clusters of size four. This routing architecture
`is shown in Figure 6.6. We implicitly assume that this routing architecture is good for
`architectures containing logic clusters of all sizes, and we use this routing architecture
`in all of our experiments. Ideally, one would find the best routing architecture for
`each FPGA employing a different cluster size, but this would require a huge amount
`of effort. By basing all of our experiments on this routing architecture, we may
`slightly favor architectures with size four clusters over other architectures.
`
`Logic
`cluster
`
`/
`
`I
`
`"i
`
`I '\
`
`'
`Logic
`cltyster
`
`Logic
`cluster
`
`\
`
`.
`
`Lo~c
`dust~
`\ '
`
`Logic
`cluster
`
`Logic
`cluster
`
`;
`
`J
`
`Logic
`cluster
`.
`'
`
`I
`
`'
`
`Ll>gic' t '
`
`Logic
`cluster
`
`I
`
`tluster
`
`L~gic
`clu\ter
`
`FIGURE 6.6 FPGA with length 4 segments, 50% buffered and 50% pass transistor
`switches.
`
`
`
`138 CHAPTER 6 Cluster-Based Logic Blocks
`
`Throughout this chapter we assume that metal all routing wires are laid out in metal 3,
`using the minimum width and spacing. See Appendix C.4 for a discussion of metal
`width and spacing issues in FPGAs.
`
`Effect of Cluster Size on the Physical Length of FPGA Routing Segments
`
`As we increase the cluster size, both the logic area per cluster and routing area per
`cluster grow. Figure 6.7 demonstrates how a tile (a logic block plus its associated
`routing) grows as cluster size is increased. This increased tile size results in routing
`segments with the same logical length having different physical lengths for logic clus(cid:173)
`ters of different sizes.
`
`We define the measured length of a routing segment as its physical length. The resis(cid:173)
`tance and capacitance of a routing segment grow linearly with the segment's physical
`length. We have experimentally determined the average rate at which the FPGA tiles
`grow with cluster size, and have used this information to appropriately scale the rout(cid:173)
`ing segment resistance and capacitance values for the various cluster sizes. The
`increase in the resistance and capacitance of routing segments as the size, or granular(cid:173)
`ity, of the FPGA logic block increases is an important effect that has often been
`neglected in prior FPGA architecture research.
`
`Sizing Routing Transistors to Compensate for Different Physical Segment Lengths
`
`To compensate for differences in the capacitance and resistance of routing segments
`in FPGAs using different sizes of logic clusters, we scale the routing pass transistors
`and buffers. All of our pass transistor and buffer scaling is in relation to a base archi-
`
`Increase
`cluster
`size
`
`Channel
`width
`,...,,_,
`
`Logic
`cluster
`
`►
`◄
`Segment length
`
`Increased
`channel
`width
`~ -- -
`
`Increased
`logic
`area
`~ e r cluster
`
`.....
`
`Logic
`cluster
`
`Increased
`routing
`area
`per cluster
`
`►
`◄
`Increased segment length
`
`FIGURE 6.7 Effect of cluster size on physical length of routing segments.
`
`
`
`6.3 Cluster Inputs Required vs. Cluster Size 139
`
`tecture that has been area-delay optimized for clusters of size four. The method used
`to size routing transistors for this base architecture is described in Appendix C. From
`this base architecture, we linearly scale routing buffers and pass transistors depending
`on the relation between the new segment lengths and the base segment length. For
`example, in an FPGA with size 16 clusters, the physical segment length is approxi(cid:173)
`mately 2x longer than in an architecture with size 4 clusters. To maintain roughly the
`same speed per routing segment, we increase the size of the routing switches connect(cid:173)
`ing to each wire by a factor of 2. In Section 6.5 we verify that this linear scaling of
`buffers and pass-transistors with physical segment length provides good results.
`
`In our architecture models, we account for variations in delay caused by resizing buff(cid:173)
`ers and pass-transistors. Also, changes in area due to the use of different sizes of rout(cid:173)
`ing pass-transistors and buffers are automatically calculated by VPR.
`
`6.3 Cluster Inputs Required vs. Cluster Size
`
`As discussed in Section 6.1, the first question we wish to answer is how many distinct
`inputs, /, should be provided to a cluster of size N. ~ince the number of transistors
`required to implement each of the multiplexers in the cluster local routing (see
`Figure 3.1 (b )) grows linearly with/ (for large /), we would like to make/ as small as
`possible. On the other hand, if/ is made too small, many of the BLEs in a logic clus(cid:173)
`ter may become essentially unusable, reducing logic utilization and wasting area. We
`find the minimum value of I that allows good cluster utilization by runnj.ng bench(cid:173)
`mark circuits through the first two steps shown in Figure 6.2, technology-mapping
`and cluster packing, and measuring the resulting logic utilization for different values
`of/. We define logic utilization to be the average number of BLEs per cluster that a
`circuit is able to use divided by the total number of BLEs per cluster, N.
`
`Figure 6.8 shows how the average logic utilization of our 20 benchmarks varies with/
`for three different logic cluster sizes. The horizontal axis is the number of distinct
`inputs to the cluster relative to the total number of BLE inputs in a cluster, i.e. //( 4N).
`For very low values of/, the logic utilization is very low, as one would expect. It is
`interesting, however, that when / is only 50 to 60% of the total number of BLE inputs,
`the logic utilization is essentially 100%. Clearly it ·is possible to pack BLEs together
`so that they have many common inputs and can reuse locally generated outputs. The
`relative amount of input sharing and output reuse increases slightly with logic cluster
`size, causing the curves in Figure 6.8 to shift to the left as cluster size increases.
`
`The solid line in Figure 6.9 shows the value of/ required to achieve 98% logic utiliza(cid:173)
`tion as the cluster size, N, is varied, while the dashed line shows how the average
`
`
`
`140 CHAPTER 6 Cluster-Based Logic Blocks
`
`... . .
`
`.•
`:N =4
`
`1
`
`0.8
`
`0.6
`
`Fraction
`ofBLEs
`Used
`(20 Benchmark 0.4
`Average)
`
`0.2
`
`0
`
`0
`
`0.2
`0.8
`0.6
`0.4
`Fraction of Inputs Accessible, / /( 4N)
`
`1
`
`FIGURE 6.8 Logic utilization vs. number of logic cluster inputs.
`
`number of logic cluster inputs that are actually used varies with cluster size.
`Although there are 4N BLE inputs in a logic cluster of size N, the number of inputs
`required to achieve 98% logic utilization is approximately 2N + 2. Furthermore, the
`average number of logic cluster inputs that are actually used grows even more slowly.
`On average, a cluster of size 1 uses 3.57 of its inputs, while an cluster of size 20 uses
`only 25.2 of its inputs. In other words, while the logic per cluster has increased by a
`
`36
`
`32
`
`28
`
`24
`
`Number of
`Cluster
`20
`Inputs(/)
`(20 Benchmark 16
`Average)
`12
`
`8
`
`4
`
`0
`
`Inputs required for 98%
`logic utilization
`
`...
`
`--
`
`---
`--
`--
`--
`--
`--
`--
`--
`--
`-- Average mputs used
`
`.
`
`.,.,..
`
`_,,,
`
`_,,,
`
`,.,
`
`/
`
`/
`
`1 2
`
`4
`
`6
`
`14
`12
`10
`8
`Cluster Size (N)
`
`16
`
`18
`
`20
`
`FIGURE 6.9 Variation in inputs required and inputs used with cluster size.
`
`
`
`6.4 Flexibility of Logic Block to Routing Interconnect vs. Cluster Size 141
`
`factor of 20, the average number of connections that must be routed to each cluster
`has increased by a factor of only 7.
`
`Our results indicate that commercial FPGAs can be more aggressive in reducing the
`value of/. For example, the Altera Flex 8K FPGAs use logic clusters with N = 8 and
`I= 24, while our results indicate that I= 18 suffices for a cluster of this size. Simi(cid:173)
`larly, the Xilinx 5200 FPGA uses a logic block very similar to a logic cluster with N =
`4, and makes all 16 LUT inputs accessible, while our results suggest 10 to 11 inputs
`are sufficient. Reducing I in this manner simplifies the local routing multiplexers
`within a logic cluster and reduces the number of logic block pins that must be con(cid:173)
`nected to the FPGA routing, resulting in considerable area savings.
`
`6.4 Flexibility of Logic Block to Routing Interconnect vs.
`Cluster Size
`
`Before we can apply the experimental flow of Section 6.2 to see how area-efficiency
`varies with cluster size, we must choose Fe, the number of routing tracks to which
`each logic block pin can connect. On the one hand, using a smaller value of Fe
`reduces the number of programmable switches in the FPGA routing, which improves
`area-efficiency. On the other hand, smaller values of Fe make an FPGA less mutable
`so that larger channel capacities, W, will be required to successfully route circuits.
`This reduces area-efficiency by increasing the routing area. The goal is to choose a
`value of Fe that balances these two competing objectives and achieves good, area-effi(cid:173)
`ciency.
`
`For a cluster of size 1, our experiments and those of Rose and Brown [30] have shown
`that a good value of Fe is W; i.e. each logic block pin can be connected to any routing
`track in an adjacent channel. For larger clusters, however, setting Fe to W provides far
`more routing flexibility than is required, wasting area.
`
`Recall from Section 3 .1.1 that the logic clusters we investigate are fully-connected. In
`other words, a BLE input can be connected to any cluster input or to the output of any
`of the BLEs within the cluster via the local routing. In a fully-connected logic cluster
`all the cluster inputs and all the cluster outputs are logically-equivalent. That is, all of
`the inputs are functionally identical, and all of the outputs are functionally identical.
`This means that a net which is an input to a cluster can be connected to any of the I
`cluster inputs, and a net which is driven by a cluster output can be connected to any of
`the N cluster outputs. Changing the connections made by the multiplexer-based local
`routing can compensate for any pin swapping performed on the cluster input pins by
`the router. Changing which BLE within the cluster generates each of the functions
`
`
`
`142 CHAPTER 6 Cluster-Based Logic Blocks
`
`required by the netlist can compensate for any pin swapping performed by the router
`on the cluster output pins. Therefore the router has a great deal of flexibility in how it
`routes inter-cluster nets.
`
`The logical equivalence of cluster inputs and of cluster outputs means that keeping Fe
`fixed at W, regardless of the cluster size, N, results in an excessive number of ways to
`connect to large logic clusters. For example, a cluster of size one has 4 inputs and one
`output. If Fe= W, then, there are 4W ways to connect to a cluster input and W ways
`to connect to the cluster output. A cluster of size 20, on the other hand, has 36 inputs
`and 20 outputs, so there are 36W ways to connect to a cluster input and 20W ways to
`connect to a cluster output if Fe= W. This excessive routing flexibility for a cluster of
`size 20 wastes a large amount of routing area, since we have added many more pro(cid:173)
`grammable switches to the routing than is necessary.
`
`We have experimentally found that a more appropriate level of routing flexibility
`results when the Fe value for logic block output pins, Fe.output is set to WIN, and all
`the experiments in the next section use this value. This choice of Fe,output means that
`each of the W routing tracks can be driven by one output pin on each logic block,
`ensuring that all the routing tracks in a channel c