`FPGAs*
`
`Jianshe He and Jonathan Rose
`Department of Electrical and Computer Engineering,
`University of Toronto, Toronto, M5S 1A4, Canada
`
`Abstract
`
`Most FPGAs use an array of identical logic blocks,
`largely because such devices are easy to design. Previous
`studies on logic block architecture, however, have con-
`cluded that while 4-input lookup tables (4-LUTs) make
`efficient use of area [Rose901 [Kou192], significantly more
`coarse-grain blocks such as 5-LUTs, 6-LUTs and 7-LUTs
`are superior in terms of system speed [KoulSl] [Sing911
`[Sing92]. These results suggest that a mixture of different
`size LUTs (for example 4-LUTs and 6-LUTs) may provide
`a better tradeoff between speed and density. This paper
`presents an architectural investigation into such heteroge-
`neous FPGAs. We consider FPGAs that use two different
`sizes of lookup table logic blocks, and investigate the area-
`efficiency of different mixtures of different sizes of LUTs.
`Experimental results on a set of benchmark circuits in-
`dicate that several heterogeneous architectures achieve
`significant reduction in the number of programming bits
`and logic block pins compared to the industry standard
`4-input lookup tables [HsieSO] [Hi1192]. Furthermore, a
`6-LUT/4-LUT combination will likely exhibit better per-
`formance with nearly equivalent area than a homogeneous
`4-LUT FPGA.
`
`1 Introduction
`
`Commercial FPGAs usually consist of an array of
`identical logic blocks [Cart861 [HsieSO] [ElGa89] [Ahrego]
`[Wong89] [Wils92] [Algo89], or logic blocks that have very
`similar levels of functionality. It is possible that a hetero-
`geneous mixture of logic blocks may provide superior area
`(which relates to logic density) because some portions of
`logic may simply be more efficient with one particular
`type of logic block than another. For example, consider
`the boolean network pictured in Figure la. Figure l b is a
`mapping of that network using 4-input lookup tables (4-
`LUTs) and Figure IC is a mapping of that network using
`3-LUTs. As shown in Table 1, the 4-LUT solution uses
`one-third more lookup table bits (64 vs 48) but 20% fewer
`pins than the 3-LUT solution (because a single 3-LUT has
`8 bits and 4 pins and a single 4-LUT uses 16 bits and 5
`
`*This work was supported by a grant from ITRC & NSERC
`Operating Grant #URF0043928.
`
`pins). Suppose that the network is mapped into a hetero-
`geneous FPGA that contains 3-LUTs and 4-LUTs in equal
`numbers, as illustrated in Figure Id. This circuit uses ex-
`actly two 3-LUTs and two 4-LUTs and hence requires
`only 48 bits and 18 pins. This heterogeneous FPGA thus
`requires 25% fewer bits and 10% fewer pins than the 4-
`LUT homogenous FPGA. It has the same number of bits
`and 25% fewer pins than the 3-LUT homogenous FPGA
`to implement this example. This example demonstrates
`that a heterogeneous mixture of logic blocks may exhibit
`superior area-efficiency.
`
`Homo-
`geneous
`Hetero.
`
`I ratio = 1 I
`
`LUT types
`only 3-LUT
`only 4-LUT
`3-LUT and
`4-LUT
`
`#LUTs
`6
`4
`2 3-LUT
`I 2 4-LUT I
`
`#Bits
`48
`64
`48
`
`#Pins
`24
`20
`18
`
`Table 1: Comparison Measures of Heterogeneous vs. Ho-
`mogenous FPGAs for Example Circuit in Fig.1
`
`In this paper we focus our attention on heterogeneous
`mixtures of two sizes of lookup table. The larger lookup
`table will be referred to as the p-LUT, and the smaller
`as the s-LUT. An important architectural parameter is
`the ratio of the number of the two types of block, r =
`that are present in the FPGA.
`8-LUTs
`$p-LUTs
`We will assume that the two types of blocks will be
`grouped together in a “supertile” consisting of r s-LUTs
`and 1 p-LUT if 7 2 1, and 1 s-LUT with f p-LUTs if
`0 < r < 1. Thus the FPGA would consist of an array of
`supertiles. Figure 2a gives an example of a supertile with
`p = 3, s = 2, and r = 2, and Figure 2b illustrates the
`array. Note that the notion of a supertile is an abstrac-
`tion intended to represent only the ratio of different types
`of LUTs and is not meant to imply anything about the
`actual physical placement and routing structure.
`In this framework, we are concerned with the questions:
`What are the best values of p, s, and r in terms of the
`area-efficiency of an FPGA?
`The following section describes our experimental
`method of answering these questions and Section 3
`presents experimental results.
`
`IEEE 1993 CUSTOM INTEGRATED CIRCUITS CONFERENCE
`
`7.4.1
`
`0-7803-0826-3193 $3.00 ’ 1993 IEEE
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on February 28,2023 at 16:48:17 UTC from IEEE Xplore. Restrictions apply.
`
`Intel Exhibit 1015
`Intel v. Iida
`
`
`
`a b c
`
`t
`
`( c )
`
`t
`( d l
`
`Figure 1: An illustration of homogeneous and heterogeneous mappings
`
`2 Experimental Procedure and Area
`Measures
`
`The above architectural questions will be answered us-
`ing an experimental approach: a set of benchmark combi-
`national circuits will be synthesized into a set of FPGAs.
`By synthesizing each circuit into FPGAs with different
`values of s, p and r, we can measure the area-efficiency
`of each such architecture for each circuit. We first dis-
`cuss the synthesis procedure, and then describe the way
`in which the results are used to indicate area-efficiency.
`
`2.1 Logic Synthesis for Heterogenous FPGAs
`The synthesis procedure
`takes
`the combinational
`benchmark circuits and passes them through logic syn-
`thesis to determine a network of pLUTs and s-LUTs that
`will implement the function of the input boolean network.
`The key issue in synthesizing for heterogenous FPGAs
`comes in the technology mapping step. Combinational
`circuits are first optimized using technology-independent
`logic optimization [Bray87], which produces an optimized
`boolean network. Technology mapping selects which parts
`of the network are to be implemented using the available
`logic blocks. In a homogenous FPGA, (with an array of
`4-LUTs, for example) the optimization goal is to minimize
`the total number of 4-LUTs. In a heterogenous FPGA,
`for example with 3-LUTs and 2-LUTs in ratio r=l, the
`mapping algorithm must minimize the number of super-
`tiles. This means that for every 3-LUT that is used, one
`
`2-LUT must be used, and SO the objective function is to
`minimize Nsup, where
`Nsup = max(N3, Nz)
`where N3 is the number of 3-LUTs used, and Nz is the
`number of 2-LUTs.
`The difficulty we faced in performing this synthesis is
`that all logic synthesis tools for FPGAs are designed to
`solve the homogenous problem, in which all logic block
`are the same. Similarly, synthesis for standard ASIC gate
`arrays and standard cells does not apply because they are
`free to select any number of gate types available in the
`cell libraries.
`To solve this problem we developed a new technol-
`ogy mapping algorithm which deals explicitly with non-
`homogeneity. That algorithm is described in [He93a] and
`[He93b].
`The following section describes how the resulting netlist
`of pLUTs and s-LUTs is used to estimate the area that
`would be required in an FPGA.
`It should be pointed out that since our algorithm does
`not replicate logic at fanout nodes [He93a], the homoge-
`neous mapping algorithm we use is also prevented from
`this replication. It is possible that this may affect our
`conclusions.
`
`2.2 Relative Area Measurement
`To determine the area of a netlist of logic blocks, one
`could perform the placement and global routing, and mea-
`
`7.4.2
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on February 28,2023 at 16:48:17 UTC from IEEE Xplore. Restrictions apply.
`
`
`
`Similar to the calculation of the number of bits, we
`can calculate the total number of pins of a circuit. For a
`supertile, the number of 1/0 pins, Npin, is a function of
`p, s, and r and is given by
`
`or
`
`Npin = (P + 1) + T ( S + I), if T 2 1
`Npin = ( S + 1) + -(P + I), if 0 < < 1
`1
`(4)
`where (p + 1) is for the p inputs and one output for a
`p-LUT and (s + 1) is for the s inputs and one output for
`an s-LUT.
`From equations (1) and (4) we have
`Total #Pins = Nsup x Npin
`
`(5)
`
`Figure 2: Example Supertile and Heterogeneous FPGA
`
`sure the amount of wiring needed, as well as estimate the
`size of the logic blocks. While we took that approach in
`previous studies [Rose901 [Brow92], we have found that
`simple measures, such as counting the number of lookup
`table bits and logic block pins used in the design, leads
`to the same architectural conclusions. So, rather than go-
`ing through full placement and routing, we calculate the
`number of pins and the number of bits to “measure” the
`area of a netlist in the following way:
`First, calculate the number of supertiles. If the number
`of p-LUTs in the netlist is N p and the number of s-LUTs
`is N,, then the number of supertiles is given by:
`
`where I is the ratio of the number of s-LUTs to the number
`of p-LUTs in a supertile.
`For a single K-LUT the number of bits is 2 K . In a
`supertile the number of bits is given by
`N b i t = 2’+
`if T 2 1
`
`T X 2’,
`
`or
`
`1
`if 0 < ?‘ < 1
`N b j t = 2’+ - X 2’,
`(2)
`T
`If the total number of supertiles used in a circuit is
`Nsup, then the total number of bits is given by
`Total #Bits = NsUp x Nbjt
`(3)
`Routing area is very important in the FPGA area de-
`termination, because it often requires from 50% to over
`90% of the total area. For optimized placement and rout-
`ing, the total number of pins of a circuit directly relates
`to the total amount of routing area. For this reason we
`count the total number of pins in evaluating the routing
`area.
`
`3 Experimental Results
`A total of 40 benchmark circuits from the MCNC logic
`synthesis benchmark suite were used as the basis for ex-
`perimentation. We chose p to range from 3 to 7, and s to
`vary from 2 to p - 1. The ratio r was varied between 0.1
`and 10.
`Figure 3 plots the total number of bits of several hetero-
`geneous combinations versus I, normalized with respect to
`the total number of bits of the same circuits implemented
`with homogenous 4-LUTs as the basic block. The total
`number of bits is a decreasing function of I. This figure is
`consistent with previous results for homogeneous FPGAs.
`As I increases, there are more s-LUTs which require sig-
`nificantly fewer bits to implement the same logic function
`of a circuit. The number of LUT bits, however, are not
`the dominant factor in total area ( unless the lookup ta-
`ble size is larger than about 7). The number of pins to be
`connected is far more important [RoseSO].
`Figure 4 illustrates the normalized total number of pins
`with respect to homogeneous 4-LUTs for the same set of
`combinations of p and s as in Figure 3. These curves have
`similar shapes for all combinations of p and s, represented
`as (p, s) in the figure. They present several interesting
`results.
`First, consider the combination of 4-LUT and 2-LUT
`with ratio T = 0.5. With this heterogeneous architecture,
`compared to a Homogeneous 4-LUT FPGA, the number
`of bits is reduced by 22% and the number of pins is re-
`duced by 10%. The combination of 5-LUT and 2-LUT
`has an 11% reduction in pins with a slight increase (11%)
`in bits. This data illustrate the conclusion that some het-
`erogeneous architectures are more area-efficient than the
`best homogeneous architecture.
`We believe that heterogeneous combinations such as (5,
`2), (4, a), (4, 3) are superior to homogeneous 4-LUT FP-
`GAS because many logic circuits have a significant num-
`ber of small fanin functions that have fanout greater than
`one. These can be efficiently implemented by 2 or 3-input
`LUTs.
`Secondly, observe the shape of the curves in Figure 4.
`They exhibit (as do all combinations) a minimum in be-
`tween the homogeneous extremes, indicating that hetero-
`geneous architectures are always superior to homogeneous
`
`7.4.3
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on February 28,2023 at 16:48:17 UTC from IEEE Xplore. Restrictions apply.
`
`
`
`Normali-d
`#Bits
`
`3.00 h
`
`2.50
`
`2.00
`
`1.50
`
`1.00
`
`0.50 I
`1
`S
`
`y/
`
`( 492
`
`Normalize$
`#Pins
`
`I
`
`1.- l.m
`1.10
`
`1.05
`
`1.00
`
`o.%
`
`0.90
`
`1
`V
`
`1
`T
`
`
`1
`
`2
`
` 5 1 0
`
`log r
`
`L r
`10
`5
`M-W
`p-L.uT.
`
` - ,
`1
`
`1
`
`2
`
` 5 1 0
`
`Mostly
`s-LUTs
`
`log r
`
`Figure 3: Normalized bit count vs I for different combi-
`nations of (p, s )
`
`Figure 4: Normalized pin count vs r for the same set of
`combinations of (p, s ) as in Fig.3
`
`2 - 3
`
`5
`6
`
`1 2 - 6 1 2 - 5 1
`1 2 - 6 1
`I
`
`I
`
`I
`
`1 - 2
`
`I
`
`Table 2: Value or range of ratio I that achieves minimal
`total pin count (within 1% difference from the minimum)
`for each combination of (p, s)
`
`architectures. This shows that, as in the case of the ex-
`ample in section 1, most circuits benefit from having the
`choice of two different kinds of blocks.
`Thirdly, The minimum values of r are different for dif-
`ferent values of p and s, as shown in Table 2. Table 2 gives
`the value or range (within 1% difference of pin count from
`the minimum) of r that achieved the minimum total pin
`count for each combination (p, s). The ratio r at minimal
`pin count in all cases favors the LUT size that is closest to
`4, which makes sense since that is the best homogeneous
`block. Table 3 gives the minimum value of pin count (nor-
`malized to the pin count of homogeneous 4-LUT) and its
`corresponding bit count (normalized to the homogeneous
`bit count) for each (p, s ) corresponding to the ratio in
`Table 2.
`The combination of (6, 4) is also worth noting. In this
`combination with ratios 3 and 4, the total numbers of pins
`are reduced by 8% and 7% with bit number increasing by
`about 50% and 40%, respectively. We believe that the in-
`crease in number of bits is nearly offset by the decrease in
`number of pins because pins dominate the area. Thus this
`architecture has nearly equivalent area as homogeneous
`4-LUT FPGA. From previous research [Sing92], however,
`we expect a 6-LUT architecture to have roughly 25% less
`
`4
`
`pins
`bits
`pins
`
`0.94
`2.25
`1.00
`
`0.92
`1.47
`1.01
`
`0.94
`1.12
`
`bits
`
`3.76
`
`Table 3: Normalized minimum value of pin count and its
`corresponding normalized bits for each combination of p
`and s with respect to homogeneous 4-LUT
`
`delay than a 4-LUT and so we can expect this combina-
`tion to exhibit superior speed to the homogeneous 4-LUT
`FPGA.
`
`4 Conclusions and Future Work
`
`This paper has investigated the area-efficiency advan-
`tages of heterogeneous logic block FPGA architectures.
`We conclude that certain heterogeneous FPGAs exhibit
`better area than the most area-efficient homogeneous
`FPGA. The best ratio r corresponding to these minimum
`differs depending on the size of the lookup tables. We also
`demonstrated that some heterogeneous mixtures may de-
`liver superior speed with equivalent area to the best ho-
`mogeneous FPGA.
`In the future, we will investigate the speed-area tradeoff
`further by optimizing both delay and area.
`
`7.4.4
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on February 28,2023 at 16:48:17 UTC from IEEE Xplore. Restrictions apply.
`
`
`
`[Rose901 J. Rose, R. Francis, D. Lewis and P. Chow,
`“Architecture of Field-Programmable Gate Arrays:
`The Effect of Logic Block Functionality on Area Ef-
`ficiency,” IEEE J. Solid-state Circuits, Vo1.25, No.5,
`Oct.1990, pp.1217 - 1225.
`[Sing911 S. Singh, J. Rose, D. Lewis, K. Chung, and
`P. Chow, “Optimization of Field- Programmable Gate
`Array Logic Block Architecture for Speed,” Proc. 1991
`CICC, May 1991, pp.6.1.1- 6.1.6.
`[Sing921 S. Singh, J. Rose, P. Chow, and D. Lewis, “The
`Effect of Logic Block Architecture on FPGA Perfor-
`mance,” IEEE J . Solid State Circuits. Vol. 27, No. 3,
`March 1992, pp.281-287.
`[Wils92] R. Wilson, “Altera Flexes Programmable Logic
`Muscles,” Electronic Engineering Times, Oct.5, 1992.
`[Wong89] S. Wong, H. So, J. Ou, and J. Costello, “A
`5000-Gate CMOS EPLD with Multiple Logic and
`Interconnect Arrays,” Proc. 1989 CICC, May 1989,
`pp.5.8.1 - 5.8.4.
`
`5 Acknowledgements
`
`The authors wish to express their thanks to Bob Francis
`and Kevin Chung for many helpful discussions.
`
`References
`
`[Ahrego] M. Ahren, A. El Gamal, D. Galbraith,
`J. Greene, S. Kaptanoglu, K. Djarmarajan, L. Hutch-
`ings, S. Ku, P. McGibney, J. McGowan, A. Samie,
`K. Shaw, N. Stiawalt, T. Whitney, T. Wong,
`W. Wong, and B. Wu, “An FPGA Family Optimized
`for High Densities and Reduced Routing Delay,” Proc.
`1990 C I C C , May 1990, pp.31.5.1 - 31.5.4
`[Algo89] CAL 1024 Datasheet, Algotronix Ltd. Edin-
`burgh, Scotland, 1989
`[Bray901 R. Brayton, G. Hachtel, and A. Sangiovanni-
`Vincentelli, “Multilevel Logic Synthesis,” Proc. IEEE,
`vo1.78, No.2, Feb. 1990, pp.264-300.
`[Brow921 S. Brown, R. Francis, J. Rose, Z. Vranesic,
`“Field-Programmable Gate Arrays”, Kluwer Aca-
`demic Publishers, June, 1992.
`[Cart861 W. Carter, K. Duong, R. Freeman, H. Hsieh,
`J. Ja, J. Mahoney, L. Ngo, and S. Sze, “A user Pro-
`grammable Reconfigurable Gate Array,” Proc. 1986
`CICC, May 1986, pp.233-235.
`[ElGa89] A. El Gamal, J. Greene, J. Reyneri, E. Ro-
`goyski, K. El-Ayat and A. Mohsen, “An Architecture
`for Electrically Configurable Gate Arrays,” IEEE J.
`Solid State Circuits. Vol. 24, No. 2, April 1989, pp.394-
`398.
`[He93a] J. He and J. Rose, “Why Are an FPGA’s Logic
`Blocks All The Same? - (A Technology Mapping Al-
`gorithm for Heterogeneous FPGAs ), ” Submitted to
`DAC93, 1993.
`[He93b] J. He, M.A.Sc. Thesis in Preparation, University
`of Toronto.
`[Hi11921 D. Hill, B. Britton, B. Oswald, N. Woo, S. Singh,
`T. Poon, and B. Krambeck, “A New Architecture for
`High- Performance F P GAS,” 2n d in te rn a tio n a1 Wo rk-
`shop on Field-Programmable Logic and Applications,
`Aug., 1992.
`[HsieSO] H. Hsieh, W. Carter, J. Ja, E. Cheung,
`S. Schreifels, C. Erickson, P. Freidin, L. TinKey, and
`R. Kanazawa, “Third-Generation Architecture Boosts
`Speed and Density of Field-Programmable Gate Ar-
`rays,” Proc. 1990 CICC, May 1990, pp.31.2.1-31.2.7
`[KoulSl] J. Kouloheris and A. El Gamal “FPGA Perfor-
`mance vs. Cell Granularity,” Proc. 1991 CICC, May
`1991, pp. 6.2.1 - 6.2.4.
`[Kou192] J. Kouloheris and A. El Gamal, “FPGA
`Area vesus Cell Granularity - lookup Tables and
`PLA Cells,” ACM/SIGDA Workshop on FPGAs
`(FPGA’92), Feb. 1992, pp.9-14.
`
`7.4.5
`
`Authorized licensed use limited to: IEEE Publications Operations Staff. Downloaded on February 28,2023 at 16:48:17 UTC from IEEE Xplore. Restrictions apply.
`
`