throbber
13-3
`
`A 1.1 GOPS/mW FPGA Chip with Hierarchical Interconnect Fabric
`Cheng C. Wang, Fang-Li Yuan, Henry Chen, Dejan Marković
`Electrical Engineering Department, University of California, Los Angeles, CA
`Abstract
`is configurable as one 4-input LUT or two 3-input LUTs with
`up to 4 unique inputs. A Logic CLB includes a carry chain to
`A 2048 look-up-table FPGA with a radix-2 hierarchical
`interconnect network is realized in 3.94mm2 in 65-nm CMOS.
`support 4b additions where Propagate and Generate are driven
`from LUTs. The Logic CLB is especially useful when two
`It has an interconnect-to-logic area ratio of 1:1, which is a 3–4x
`outputs per bit are required, such as in 3:2 compressors.
`reduction from modern FPGAs while allowing up to 100%
`The DSP CLB (Fig. 4b) has a LUT combiner to support
`resource utilization. As a proof of concept, it is designed with
`standard cells, achieving 16.4 GOPS/mm2 at 370MHz. Peak
`5/6-input LUTs, and a carry chain that is configurable as one
`8b or two 4b adders. The adder cells are shared with a 4b×4b
`energy efficiency of 1.1 GOPS/mW is measured at 0.5V.
`Wallace-tree multiplier. Based on the configuration, the
`Introduction
`appropriate outputs are sent to the output stage. Due to the
`Field-programmable gate arrays (FPGAs) are effective for
`level of configurability, the synthesized CLB has 50 logic
`rapid verification and prototyping of VLSI designs. They are
`gates on its critical path (shaded), amounting to a 1.1ns delay.
`also used in products that require periodic hardware changes
`Configuration bits are required to control CLBs and SMs,
`and short time to market. However, FPGAs incur penalties in
`but traditional SRAM arrays are not suitable because all bits
`area (17–54x), speed (2.5–6.7x), and power (5.7–62x) over
`cannot be accessed simultaneously. A scan chain is adopted in
`standard-cell ASICs [1], hindering their expansion into ASIC
`[4] to control 6 CLBs, but it is not scalable to larger designs.
`markets. The overhead is primarily due to interconnects, which
`Therefore an SRAM-based bit cell (BC) is designed where the
`account for over 75% of area and delay.
`output of each BC is directly routed to the configuration inputs
`For over 20 years, FPGAs have used 2D-mesh interconnects,
`of CLBs and SMs (Fig. 5a). The BC area is 5x smaller than a
`where look-up tables (LUTs) are placed in configurable logic
`DFF-based scan cell. The bit-line (BL) and word-line (WL)
`blocks (CLBs), and arrays of switch boxes are placed at
`controls are implemented as scan chains to write one row of
`interconnect crossings (Fig. 1). Since a full array requires too
`BCs at a time. The BC arrays are local to each CLB, so only the
`much area, various heuristics are used to simplify switch-box
`BL and WL controls are propagated to top level. Overall, the
`arrays at the cost of resource utilization. Yet 80% of the 1.1B
`memory area is reduced (Fig. 5b), and total interconnect area is
`transistors on Virtex-5 are used for interconnects [2]. This
`51%, a 3–4x reduction over 2D-mesh [5] for a fixed logic area.
`paper demonstrates an FPGA with hierarchical interconnects
`Automated Mapper
`where interconnect area is 51%, a 3–4x reduction from
`An automated mapper is developed to map RTL onto this
`commercial FPGAs while preserving connectivity. An energy
`FPGA. A standard-cell library of LUT functions is created to
`efficiency of 1.1 GOPS/mW is the highest among reported
`enable logic synthesis using commercial tools. The LUT netlist
`FPGAs. The chip is tested up to 400MHz.
`is imported into an automated, custom place-and-route tool
`Hierarchical Interconnect Architecture
`that generates the bitstream for FPGA programming. This tool
`The key issue with 2D-mesh is scalability; the number of
`is also used during architecture design to evaluate interconnect
`switch boxes grows as O(N2) with the number of LUTs. Using
`connectivities by mapping Toronto20 benchmarks.
`Rent’s rule, interconnect complexity is still O(N1.75) for
`Measurement Results
`random logic, requiring FPGA size to scale much faster than
`Our chip achieves 16.4 GOPS/mm2 when all Logic and DSP
`Moore’s Law. In the proposed hierarchical interconnect, a
`CLBs are utilized, executing 175 16b accumulators at 370MHz.
`folded Beneš network is employed to reduce the complexity to
`Since a 16b adder uses 2 DSP CLBs or 4 Logic CLBs, the DSP
`O(N·logN) [3]: 4 LUTs are connected via 2 stages of switch
`matrices (SMs), and another 4 LUTs are connected with a 3rd
`adders are faster, reaching 400MHz. Performance is hindered
`by equipment limitations due to a 0.25ns input-clock jitter at
`SM stage (Fig. 2a). Each SM has 4 unidirectional connections
`400MHz. The energy-delay curve and the power breakdowns
`per direction. Although this architecture reduces interconnect
`for minimum delay and minimum energy are shown in Fig. 6.
`complexity, each SM stage doubles the routing congestion.
`In comparison, [4] has no interconnects, the full-custom
`This O(N) congestion makes physical design difficult.
`CLB in 32-nm LVT is 2.5x faster, but achieves 2.6 GOPS/mW
`To alleviate congestion, routing is alternated between x-y
`directions to reduce congestion to O(N0.5) (Fig. 2b). At every
`at 0.34V for 8b operations, which is 0.65 GOPS/mW for 16b (2
`CLBs per operation at half the speed). With interconnects, our
`hierarchy, the LUTs near the center are interconnected to
`65-nm chip reaches 1.1 GOPS/mW at 0.5V.
`create shorter routes, and the edge routes are longer. This gives
`Leakage is well-controlled even without power gating. A
`routing tools options for faster paths on timing-critical routes.
`1.08 GOPS/mW is attainable with only 112 DSP accumulators
`The test chip has 2048 4-input LUTs: 1024 LUTs form 256
`active and most of the Logic CLBs idle (Table I). The FIR
`Logic CLBs, 896 LUTs form 224 DSP CLBs, and 128 LUTs
`filter achieves 274MHz due to longer routing, but interconnect
`form 16 Block RAMs (BRAMs) of 1kb each. In practice, the
`delay is still under 50%. The 2×2 MIMO FFT uses 10 BRAMs
`majority of the logic connections are local, requiring fewer
`to implement various delay lines. With many control signals
`connections on upper hierarchies. Therefore full connectivity
`and a critical path of 11 CLBs, the FFT achieves 83MHz.
`is preserved up to 6 SM stages (Fig. 3a), then half-connectivity
`Figure 7 shows the die photo. The top 3 metal layers (out of
`SMs are used to reduce the complexity of upper hierarchies.
`9) are sparsely used, leaving ample room for larger designs.
`This partitions the interconnect into 3 sub-networks: N8:2, N6:2,
`and N6:1. The chip is divided into 16 macros (Fig. 3b). Macros
`Acknowledgments
`We thank STMicroelectronics and C. Yang for helpful discussions.
`N8:2 are centered for shorter top-level routing, branching into
`References
`N6:2 and N6:1. Each of the macros contains 32 CLBs—a
`[1] I. Kuon et al., Found. Trends in Elec. Design, 2008.
`combination of Logic, DSP, and BRAM (Fig. 3c).
`[2] I. Bolsens, MPSOC, 2006.
`Circuit Implementation
`[3] V. Konda, U.S. Patent 2010/0172349.
`The CLBs include four 4-input LUTs with selectable
`[4] A. Agarwal et al., ISSCC Dig. Tech. Papers, 2010.
`[5] M. Lin et al., FPGA ’06.
`asynchronous/synchronous output stages (Fig. 4a). Each LUT
`
`
`
`136 978-4-86348-165-7
`
`2011 Symposium on VLSI Circuits Digest of Technical Papers
`
`Page 1 of 2 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2002
`
`

`

`
`
`CLB
`LUT LUT
`LUT LUT
`Bi‐directional
`Switch Box
`
`I/O
`Connection
`Box
`Switch-box
`Array
`
`LUT
`
`LUT
`
`LUT
`
`Congestion: O(N)
`
`SM
`
`Congestion: O(√N)
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`BL‐
`
`WL
`
`BL+
`
`BC
`Output
`
`SM
`&
`CLB
`
`a)
`
`FF
`
`BC BC …
`
`BC
`
`Bit Cell
`Area: 1.4μm × 1.8μm
`Interconnect Area
`Interconnects + Routing
`43%
`
`Logic Area
`Logic
`Mem
`14%
`8%
`
`Memory
`35%
`
`Logic
`
`Mem
`
`Mem
`
`3–4× reduction in
`interconnect area
`
`Clock
`28%
`
`Leak.
`26%
`
`Active
`46%
`
`Clock
`18%
`
`Leak.
`44%
`
`Active
`38%
`
`LUT
`CLB
`To 4 LUTs
`a)
`b)
`Figure 2: a) Hierarchical routing of 8 LUTs (4 shown) using SMs, b) alternated x-y routing.
`Figure 1: 2D-mesh interconnect.
`To next
`BSE
`BL
`NP:Q: P stages of full SM +
`Control
`BIN
`CLB
`Q stages of half SM
`BC
`WL Control
`Outputs
`WIN WEV
`WSE
`FF
`FF
`
`FF
`
`FF
`
`…
`
`FF
`
`BC
`BC
`…
`BC BC …
`
`BC
`BC
`
`LUT
`
`LUT
`
`LUT
`
`N6:1
`
`N6:2
`
`N8:2
`
`LUT
`
`N6:1
`8
`7
`(SM Stage)
`
`6
`
`1
`
`9
`
`10
`
`a)
`
`: Full SM
`: Half SM
`
`N6:1
`
`N8:2
`
`N6:2
`
`N6:2
`
`BRAM
`
`DSP
`Logic
`
`DSP
`
`BRAM
`
`Logic
`b)
`c)
`N6:1
`DSP
`Figure 3: a) Interconnect architecture of 2048 LUTs,
`floorplan of b) SM network, c) CLB placement.
`(to Hier. Network)
`(from Hier. Network)
`Cout
`
`inD
`inC
`inB
`inA
`
`4
`
`Output
`Stage
`
`OutD
`OutC
`OutB
`OutA
`
`3‐/4‐
`input
`LUT
`
`Cin
`
`2
`
`a)
`
`5/6 input LUT Combiner
`(DSP CLB only)
`
`Configurable Adder / Mult
`CO2
`CO0
`
`4
`
`4
`
`4
`
`4
`
`inD
`
`inC
`
`inB
`
`inA
`
`CO1
`
`P32
`
`P33
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`A[3]
`P31
`
`C[3]
`P22
`
`A[2]
`P12
`
`C[2]
`P21
`
`A[1]
`P20
`
`C[1]
`P11
`
`A[0]
`P10
`0/1
`CI1
`C[0]
`
`P13
`
`P03
`
`P02
`
`B[3]
`P23
`D[3]
`
`B[2]
`
`D[2]
`
`B[1]
`
`D[1]
`
`B[0]
`
`D[0]
`
`P30
`
`C3
`CI2
`0/1
`
`To
`Output
`Stage
`
`M7
`C8
`M6
`S8
`C3
`C7
`S3
`LUT4_D
`S7
`LUT3_D
`M5
`
`LUT5_CD
`C2
`C6
`S2
`LUT4_C
`S6
`LUT3_C
`M4
`
`LUT6
`C1
`C5
`S1
`LUT4_B
`S5
`LUT3_B
`M3
`LUT4_A
`LUT3_A
`LUT5_AB
`C0
`C4
`S0M1
`M0
`S4M2
`
`Critical
`Path:
`
`LUT
`
`+
`
`×7
`
`+
`
`×8
`
`b)
`Figure 4: a) CLB block diagram and b) DSP CLB schematic.
`
`2D‐
`Mesh
`Intercon. +
`This
`Work
`Routing
`b)
`36%
`15%
`13%
`35%
`Figure 5: a) Bit cell (BC) configuration circuitry, b) area comparisons
`of 2D-mesh vs. this chip for a fixed logic area.
`2.8
`370MHz
`2.6
`1.0V
`2.4
`2.2
`2
`1.8
`1.6
`1.4
`1.2
`1
`0.8
`Delay (ns)
`0
`5
`10
`15
`Figure 6: Energy-delay curve of the mapped 175 16b accumulator
`with power breakdown at Fmax and Emin (insets).
`TABLE I: MEASUREMENT RESULTS.
`Resource Utilization
`Performance
`Freq.
`VDD
`Logic
`DSP
`BRAM
`(V)
`(256)
`(224)
`(16)
`(MHz)
`1.0
`370
`256
`224
`0
`0.50
`55
`1.0
`400
`0.51
`60
`1.0
`274
`0.56
`50
`1.0
`83
`0.78
`40
`
`Technology 65nm 1P9M CMOS
`0.34 to 1.0V
`Core VDD
`Frequency
`40 to 400MHz
`I/Os
`75 bidirectional
`Core Size
`2.52mm × 1.56mm
`Gate Count
`2.73M
`CLB Count
`256 Logic, 224 DSP
`Block RAMs
`16 128×8b
`Config. Bits
`297,472b
`Figure 7: Die micrograph and chip summary.
`
`2011 Symposium on VLSI Circuits Digest of Technical Papers
`
`137
`
`300MHz
`0.88V
`200MHz
`0.74V
`
`Result
`
`4
`
`132
`
`196
`
`Power Ratio
`@ Fmax
`
`100MHz
`0.59V
`
`Power Ratio
`@ Emin
`55MHz
`0.5V
`
`Power
`(mW)
`179
`8.6
`123
`6.2
`120
`10.2
`82.7
`26.5
`
`224
`
`209
`
`93
`
`0
`
`0
`
`10
`
`GOPS
`/mW
`0.36
`1.13
`0.57
`1.08
`0.21
`0.45
`0.05
`0.07
`
`Design
`175 Logic+DSP
`16b Accum.
`112 DSP
`16b Accum.
`32‐tap 16b
`FIR Filter
`2×2 MIMO
`64‐point FFT
`3.06mm
`
`Page 2 of 2 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2002
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket