`
`A 1.1 GOPS/mW FPGA Chip with Hierarchical Interconnect Fabric
`Cheng C. Wang, Fang-Li Yuan, Henry Chen, Dejan Marković
`Electrical Engineering Department, University of California, Los Angeles, CA
`Abstract
`is configurable as one 4-input LUT or two 3-input LUTs with
`up to 4 unique inputs. A Logic CLB includes a carry chain to
`A 2048 look-up-table FPGA with a radix-2 hierarchical
`interconnect network is realized in 3.94mm2 in 65-nm CMOS.
`support 4b additions where Propagate and Generate are driven
`from LUTs. The Logic CLB is especially useful when two
`It has an interconnect-to-logic area ratio of 1:1, which is a 3–4x
`outputs per bit are required, such as in 3:2 compressors.
`reduction from modern FPGAs while allowing up to 100%
`The DSP CLB (Fig. 4b) has a LUT combiner to support
`resource utilization. As a proof of concept, it is designed with
`standard cells, achieving 16.4 GOPS/mm2 at 370MHz. Peak
`5/6-input LUTs, and a carry chain that is configurable as one
`8b or two 4b adders. The adder cells are shared with a 4b×4b
`energy efficiency of 1.1 GOPS/mW is measured at 0.5V.
`Wallace-tree multiplier. Based on the configuration, the
`Introduction
`appropriate outputs are sent to the output stage. Due to the
`Field-programmable gate arrays (FPGAs) are effective for
`level of configurability, the synthesized CLB has 50 logic
`rapid verification and prototyping of VLSI designs. They are
`gates on its critical path (shaded), amounting to a 1.1ns delay.
`also used in products that require periodic hardware changes
`Configuration bits are required to control CLBs and SMs,
`and short time to market. However, FPGAs incur penalties in
`but traditional SRAM arrays are not suitable because all bits
`area (17–54x), speed (2.5–6.7x), and power (5.7–62x) over
`cannot be accessed simultaneously. A scan chain is adopted in
`standard-cell ASICs [1], hindering their expansion into ASIC
`[4] to control 6 CLBs, but it is not scalable to larger designs.
`markets. The overhead is primarily due to interconnects, which
`Therefore an SRAM-based bit cell (BC) is designed where the
`account for over 75% of area and delay.
`output of each BC is directly routed to the configuration inputs
`For over 20 years, FPGAs have used 2D-mesh interconnects,
`of CLBs and SMs (Fig. 5a). The BC area is 5x smaller than a
`where look-up tables (LUTs) are placed in configurable logic
`DFF-based scan cell. The bit-line (BL) and word-line (WL)
`blocks (CLBs), and arrays of switch boxes are placed at
`controls are implemented as scan chains to write one row of
`interconnect crossings (Fig. 1). Since a full array requires too
`BCs at a time. The BC arrays are local to each CLB, so only the
`much area, various heuristics are used to simplify switch-box
`BL and WL controls are propagated to top level. Overall, the
`arrays at the cost of resource utilization. Yet 80% of the 1.1B
`memory area is reduced (Fig. 5b), and total interconnect area is
`transistors on Virtex-5 are used for interconnects [2]. This
`51%, a 3–4x reduction over 2D-mesh [5] for a fixed logic area.
`paper demonstrates an FPGA with hierarchical interconnects
`Automated Mapper
`where interconnect area is 51%, a 3–4x reduction from
`An automated mapper is developed to map RTL onto this
`commercial FPGAs while preserving connectivity. An energy
`FPGA. A standard-cell library of LUT functions is created to
`efficiency of 1.1 GOPS/mW is the highest among reported
`enable logic synthesis using commercial tools. The LUT netlist
`FPGAs. The chip is tested up to 400MHz.
`is imported into an automated, custom place-and-route tool
`Hierarchical Interconnect Architecture
`that generates the bitstream for FPGA programming. This tool
`The key issue with 2D-mesh is scalability; the number of
`is also used during architecture design to evaluate interconnect
`switch boxes grows as O(N2) with the number of LUTs. Using
`connectivities by mapping Toronto20 benchmarks.
`Rent’s rule, interconnect complexity is still O(N1.75) for
`Measurement Results
`random logic, requiring FPGA size to scale much faster than
`Our chip achieves 16.4 GOPS/mm2 when all Logic and DSP
`Moore’s Law. In the proposed hierarchical interconnect, a
`CLBs are utilized, executing 175 16b accumulators at 370MHz.
`folded Beneš network is employed to reduce the complexity to
`Since a 16b adder uses 2 DSP CLBs or 4 Logic CLBs, the DSP
`O(N·logN) [3]: 4 LUTs are connected via 2 stages of switch
`matrices (SMs), and another 4 LUTs are connected with a 3rd
`adders are faster, reaching 400MHz. Performance is hindered
`by equipment limitations due to a 0.25ns input-clock jitter at
`SM stage (Fig. 2a). Each SM has 4 unidirectional connections
`400MHz. The energy-delay curve and the power breakdowns
`per direction. Although this architecture reduces interconnect
`for minimum delay and minimum energy are shown in Fig. 6.
`complexity, each SM stage doubles the routing congestion.
`In comparison, [4] has no interconnects, the full-custom
`This O(N) congestion makes physical design difficult.
`CLB in 32-nm LVT is 2.5x faster, but achieves 2.6 GOPS/mW
`To alleviate congestion, routing is alternated between x-y
`directions to reduce congestion to O(N0.5) (Fig. 2b). At every
`at 0.34V for 8b operations, which is 0.65 GOPS/mW for 16b (2
`CLBs per operation at half the speed). With interconnects, our
`hierarchy, the LUTs near the center are interconnected to
`65-nm chip reaches 1.1 GOPS/mW at 0.5V.
`create shorter routes, and the edge routes are longer. This gives
`Leakage is well-controlled even without power gating. A
`routing tools options for faster paths on timing-critical routes.
`1.08 GOPS/mW is attainable with only 112 DSP accumulators
`The test chip has 2048 4-input LUTs: 1024 LUTs form 256
`active and most of the Logic CLBs idle (Table I). The FIR
`Logic CLBs, 896 LUTs form 224 DSP CLBs, and 128 LUTs
`filter achieves 274MHz due to longer routing, but interconnect
`form 16 Block RAMs (BRAMs) of 1kb each. In practice, the
`delay is still under 50%. The 2×2 MIMO FFT uses 10 BRAMs
`majority of the logic connections are local, requiring fewer
`to implement various delay lines. With many control signals
`connections on upper hierarchies. Therefore full connectivity
`and a critical path of 11 CLBs, the FFT achieves 83MHz.
`is preserved up to 6 SM stages (Fig. 3a), then half-connectivity
`Figure 7 shows the die photo. The top 3 metal layers (out of
`SMs are used to reduce the complexity of upper hierarchies.
`9) are sparsely used, leaving ample room for larger designs.
`This partitions the interconnect into 3 sub-networks: N8:2, N6:2,
`and N6:1. The chip is divided into 16 macros (Fig. 3b). Macros
`Acknowledgments
`We thank STMicroelectronics and C. Yang for helpful discussions.
`N8:2 are centered for shorter top-level routing, branching into
`References
`N6:2 and N6:1. Each of the macros contains 32 CLBs—a
`[1] I. Kuon et al., Found. Trends in Elec. Design, 2008.
`combination of Logic, DSP, and BRAM (Fig. 3c).
`[2] I. Bolsens, MPSOC, 2006.
`Circuit Implementation
`[3] V. Konda, U.S. Patent 2010/0172349.
`The CLBs include four 4-input LUTs with selectable
`[4] A. Agarwal et al., ISSCC Dig. Tech. Papers, 2010.
`[5] M. Lin et al., FPGA ’06.
`asynchronous/synchronous output stages (Fig. 4a). Each LUT
`
`
`
`136 978-4-86348-165-7
`
`2011 Symposium on VLSI Circuits Digest of Technical Papers
`
`Page 1 of 2 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2002
`
`
`
`
`
`CLB
`LUT LUT
`LUT LUT
`Bi‐directional
`Switch Box
`
`I/O
`Connection
`Box
`Switch-box
`Array
`
`LUT
`
`LUT
`
`LUT
`
`Congestion: O(N)
`
`SM
`
`Congestion: O(√N)
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`BL‐
`
`WL
`
`BL+
`
`BC
`Output
`
`SM
`&
`CLB
`
`a)
`
`FF
`
`BC BC …
`
`BC
`
`Bit Cell
`Area: 1.4μm × 1.8μm
`Interconnect Area
`Interconnects + Routing
`43%
`
`Logic Area
`Logic
`Mem
`14%
`8%
`
`Memory
`35%
`
`Logic
`
`Mem
`
`Mem
`
`3–4× reduction in
`interconnect area
`
`Clock
`28%
`
`Leak.
`26%
`
`Active
`46%
`
`Clock
`18%
`
`Leak.
`44%
`
`Active
`38%
`
`LUT
`CLB
`To 4 LUTs
`a)
`b)
`Figure 2: a) Hierarchical routing of 8 LUTs (4 shown) using SMs, b) alternated x-y routing.
`Figure 1: 2D-mesh interconnect.
`To next
`BSE
`BL
`NP:Q: P stages of full SM +
`Control
`BIN
`CLB
`Q stages of half SM
`BC
`WL Control
`Outputs
`WIN WEV
`WSE
`FF
`FF
`
`FF
`
`FF
`
`…
`
`FF
`
`BC
`BC
`…
`BC BC …
`
`BC
`BC
`
`LUT
`
`LUT
`
`LUT
`
`N6:1
`
`N6:2
`
`N8:2
`
`LUT
`
`N6:1
`8
`7
`(SM Stage)
`
`6
`
`1
`
`9
`
`10
`
`a)
`
`: Full SM
`: Half SM
`
`N6:1
`
`N8:2
`
`N6:2
`
`N6:2
`
`BRAM
`
`DSP
`Logic
`
`DSP
`
`BRAM
`
`Logic
`b)
`c)
`N6:1
`DSP
`Figure 3: a) Interconnect architecture of 2048 LUTs,
`floorplan of b) SM network, c) CLB placement.
`(to Hier. Network)
`(from Hier. Network)
`Cout
`
`inD
`inC
`inB
`inA
`
`4
`
`Output
`Stage
`
`OutD
`OutC
`OutB
`OutA
`
`3‐/4‐
`input
`LUT
`
`Cin
`
`2
`
`a)
`
`5/6 input LUT Combiner
`(DSP CLB only)
`
`Configurable Adder / Mult
`CO2
`CO0
`
`4
`
`4
`
`4
`
`4
`
`inD
`
`inC
`
`inB
`
`inA
`
`CO1
`
`P32
`
`P33
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`A[3]
`P31
`
`C[3]
`P22
`
`A[2]
`P12
`
`C[2]
`P21
`
`A[1]
`P20
`
`C[1]
`P11
`
`A[0]
`P10
`0/1
`CI1
`C[0]
`
`P13
`
`P03
`
`P02
`
`B[3]
`P23
`D[3]
`
`B[2]
`
`D[2]
`
`B[1]
`
`D[1]
`
`B[0]
`
`D[0]
`
`P30
`
`C3
`CI2
`0/1
`
`To
`Output
`Stage
`
`M7
`C8
`M6
`S8
`C3
`C7
`S3
`LUT4_D
`S7
`LUT3_D
`M5
`
`LUT5_CD
`C2
`C6
`S2
`LUT4_C
`S6
`LUT3_C
`M4
`
`LUT6
`C1
`C5
`S1
`LUT4_B
`S5
`LUT3_B
`M3
`LUT4_A
`LUT3_A
`LUT5_AB
`C0
`C4
`S0M1
`M0
`S4M2
`
`Critical
`Path:
`
`LUT
`
`+
`
`×7
`
`+
`
`×8
`
`b)
`Figure 4: a) CLB block diagram and b) DSP CLB schematic.
`
`2D‐
`Mesh
`Intercon. +
`This
`Work
`Routing
`b)
`36%
`15%
`13%
`35%
`Figure 5: a) Bit cell (BC) configuration circuitry, b) area comparisons
`of 2D-mesh vs. this chip for a fixed logic area.
`2.8
`370MHz
`2.6
`1.0V
`2.4
`2.2
`2
`1.8
`1.6
`1.4
`1.2
`1
`0.8
`Delay (ns)
`0
`5
`10
`15
`Figure 6: Energy-delay curve of the mapped 175 16b accumulator
`with power breakdown at Fmax and Emin (insets).
`TABLE I: MEASUREMENT RESULTS.
`Resource Utilization
`Performance
`Freq.
`VDD
`Logic
`DSP
`BRAM
`(V)
`(256)
`(224)
`(16)
`(MHz)
`1.0
`370
`256
`224
`0
`0.50
`55
`1.0
`400
`0.51
`60
`1.0
`274
`0.56
`50
`1.0
`83
`0.78
`40
`
`Technology 65nm 1P9M CMOS
`0.34 to 1.0V
`Core VDD
`Frequency
`40 to 400MHz
`I/Os
`75 bidirectional
`Core Size
`2.52mm × 1.56mm
`Gate Count
`2.73M
`CLB Count
`256 Logic, 224 DSP
`Block RAMs
`16 128×8b
`Config. Bits
`297,472b
`Figure 7: Die micrograph and chip summary.
`
`2011 Symposium on VLSI Circuits Digest of Technical Papers
`
`137
`
`300MHz
`0.88V
`200MHz
`0.74V
`
`Result
`
`4
`
`132
`
`196
`
`Power Ratio
`@ Fmax
`
`100MHz
`0.59V
`
`Power Ratio
`@ Emin
`55MHz
`0.5V
`
`Power
`(mW)
`179
`8.6
`123
`6.2
`120
`10.2
`82.7
`26.5
`
`224
`
`209
`
`93
`
`0
`
`0
`
`10
`
`GOPS
`/mW
`0.36
`1.13
`0.57
`1.08
`0.21
`0.45
`0.05
`0.07
`
`Design
`175 Logic+DSP
`16b Accum.
`112 DSP
`16b Accum.
`32‐tap 16b
`FIR Filter
`2×2 MIMO
`64‐point FFT
`3.06mm
`
`Page 2 of 2 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2002
`
`