IPR2020-00262, No. 2002-19 Exhibit - 2011 paper at VLSI Circuits Symposium (P.T.A.B. May. 6, 2020)

13-3
`
`A 1.1 GOPS/mW FPGA Chip with Hierarchical Interconnect Fabric
`Cheng C. Wang, Fang-Li Yuan, Henry Chen, Dejan Marković
`Electrical Engineering Department, University of California, Los Angeles, CA
`Abstract
`is configurable as one 4-input LUT or two 3-input LUTs with
`up to 4 unique inputs. A Logic CLB includes a carry chain to
`A 2048 look-up-table FPGA with a radix-2 hierarchical
`interconnect network is realized in 3.94mm2 in 65-nm CMOS.
`support 4b additions where Propagate and Generate are driven
`from LUTs. The Logic CLB is especially useful when two
`It has an interconnect-to-logic area ratio of 1:1, which is a 3–4x
`outputs per bit are required, such as in 3:2 compressors.
`reduction from modern FPGAs while allowing up to 100%
`The DSP CLB (Fig. 4b) has a LUT combiner to support
`resource utilization. As a proof of concept, it is designed with
`standard cells, achieving 16.4 GOPS/mm2 at 370MHz. Peak
`5/6-input LUTs, and a carry chain that is configurable as one
`8b or two 4b adders. The adder cells are shared with a 4b×4b
`energy efficiency of 1.1 GOPS/mW is measured at 0.5V.
`Wallace-tree multiplier. Based on the configuration, the
`Introduction
`appropriate outputs are sent to the output stage. Due to the
`Field-programmable gate arrays (FPGAs) are effective for
`level of configurability, the synthesized CLB has 50 logic
`rapid verification and prototyping of VLSI designs. They are
`gates on its critical path (shaded), amounting to a 1.1ns delay.
`also used in products that require periodic hardware changes
`Configuration bits are required to control CLBs and SMs,
`and short time to market. However, FPGAs incur penalties in
`but traditional SRAM arrays are not suitable because all bits
`area (17–54x), speed (2.5–6.7x), and power (5.7–62x) over
`cannot be accessed simultaneously. A scan chain is adopted in
`standard-cell ASICs [1], hindering their expansion into ASIC
`[4] to control 6 CLBs, but it is not scalable to larger designs.
`markets. The overhead is primarily due to interconnects, which
`Therefore an SRAM-based bit cell (BC) is designed where the
`account for over 75% of area and delay.
`output of each BC is directly routed to the configuration inputs
`For over 20 years, FPGAs have used 2D-mesh interconnects,
`of CLBs and SMs (Fig. 5a). The BC area is 5x smaller than a
`where look-up tables (LUTs) are placed in configurable logic
`DFF-based scan cell. The bit-line (BL) and word-line (WL)
`blocks (CLBs), and arrays of switch boxes are placed at
`controls are implemented as scan chains to write one row of
`interconnect crossings (Fig. 1). Since a full array requires too
`BCs at a time. The BC arrays are local to each CLB, so only the
`much area, various heuristics are used to simplify switch-box
`BL and WL controls are propagated to top level. Overall, the
`arrays at the cost of resource utilization. Yet 80% of the 1.1B
`memory area is reduced (Fig. 5b), and total interconnect area is
`transistors on Virtex-5 are used for interconnects [2]. This
`51%, a 3–4x reduction over 2D-mesh [5] for a fixed logic area.
`paper demonstrates an FPGA with hierarchical interconnects
`Automated Mapper
`where interconnect area is 51%, a 3–4x reduction from
`An automated mapper is developed to map RTL onto this
`commercial FPGAs while preserving connectivity. An energy
`FPGA. A standard-cell library of LUT functions is created to
`efficiency of 1.1 GOPS/mW is the highest among reported
`enable logic synthesis using commercial tools. The LUT netlist
`FPGAs. The chip is tested up to 400MHz.
`is imported into an automated, custom place-and-route tool
`Hierarchical Interconnect Architecture
`that generates the bitstream for FPGA programming. This tool
`The key issue with 2D-mesh is scalability; the number of
`is also used during architecture design to evaluate interconnect
`switch boxes grows as O(N2) with the number of LUTs. Using
`connectivities by mapping Toronto20 benchmarks.
`Rent’s rule, interconnect complexity is still O(N1.75) for
`Measurement Results
`random logic, requiring FPGA size to scale much faster than
`Our chip achieves 16.4 GOPS/mm2 when all Logic and DSP
`Moore’s Law. In the proposed hierarchical interconnect, a
`CLBs are utilized, executing 175 16b accumulators at 370MHz.
`folded Beneš network is employed to reduce the complexity to
`Since a 16b adder uses 2 DSP CLBs or 4 Logic CLBs, the DSP
`O(N·logN) [3]: 4 LUTs are connected via 2 stages of switch
`matrices (SMs), and another 4 LUTs are connected with a 3rd
`adders are faster, reaching 400MHz. Performance is hindered
`by equipment limitations due to a 0.25ns input-clock jitter at
`SM stage (Fig. 2a). Each SM has 4 unidirectional connections
`400MHz. The energy-delay curve and the power breakdowns
`per direction. Although this architecture reduces interconnect
`for minimum delay and minimum energy are shown in Fig. 6.
`complexity, each SM stage doubles the routing congestion.
`In comparison, [4] has no interconnects, the full-custom
`This O(N) congestion makes physical design difficult.
`CLB in 32-nm LVT is 2.5x faster, but achieves 2.6 GOPS/mW
`To alleviate congestion, routing is alternated between x-y
`directions to reduce congestion to O(N0.5) (Fig. 2b). At every
`at 0.34V for 8b operations, which is 0.65 GOPS/mW for 16b (2
`CLBs per operation at half the speed). With interconnects, our
`hierarchy, the LUTs near the center are interconnected to
`65-nm chip reaches 1.1 GOPS/mW at 0.5V.
`create shorter routes, and the edge routes are longer. This gives
`Leakage is well-controlled even without power gating. A
`routing tools options for faster paths on timing-critical routes.
`1.08 GOPS/mW is attainable with only 112 DSP accumulators
`The test chip has 2048 4-input LUTs: 1024 LUTs form 256
`active and most of the Logic CLBs idle (Table I). The FIR
`Logic CLBs, 896 LUTs form 224 DSP CLBs, and 128 LUTs
`filter achieves 274MHz due to longer routing, but interconnect
`form 16 Block RAMs (BRAMs) of 1kb each. In practice, the
`delay is still under 50%. The 2×2 MIMO FFT uses 10 BRAMs
`majority of the logic connections are local, requiring fewer
`to implement various delay lines. With many control signals
`connections on upper hierarchies. Therefore full connectivity
`and a critical path of 11 CLBs, the FFT achieves 83MHz.
`is preserved up to 6 SM stages (Fig. 3a), then half-connectivity
`Figure 7 shows the die photo. The top 3 metal layers (out of
`SMs are used to reduce the complexity of upper hierarchies.
`9) are sparsely used, leaving ample room for larger designs.
`This partitions the interconnect into 3 sub-networks: N8:2, N6:2,
`and N6:1. The chip is divided into 16 macros (Fig. 3b). Macros
`Acknowledgments
`We thank STMicroelectronics and C. Yang for helpful discussions.
`N8:2 are centered for shorter top-level routing, branching into
`References
`N6:2 and N6:1. Each of the macros contains 32 CLBs—a
`[1] I. Kuon et al., Found. Trends in Elec. Design, 2008.
`combination of Logic, DSP, and BRAM (Fig. 3c).
`[2] I. Bolsens, MPSOC, 2006.
`Circuit Implementation
`[3] V. Konda, U.S. Patent 2010/0172349.
`The CLBs include four 4-input LUTs with selectable
`[4] A. Agarwal et al., ISSCC Dig. Tech. Papers, 2010.
`[5] M. Lin et al., FPGA ’06.
`asynchronous/synchronous output stages (Fig. 4a). Each LUT
`
`
`
`136 978-4-86348-165-7
`
`2011 Symposium on VLSI Circuits Digest of Technical Papers
`
`Page 1 of 2 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2002
`
`

`
`
`CLB
`LUT LUT
`LUT LUT
`Bi‐directional
`Switch Box
`
`I/O
`Connection
`Box
`Switch-box
`Array
`
`LUT
`
`LUT
`
`LUT
`
`Congestion: O(N)
`
`SM
`
`Congestion: O(√N)
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`BL‐
`
`WL
`
`BL+
`
`BC
`Output
`
`SM
`&
`CLB
`
`a)
`
`FF
`
`BC BC …
`
`BC
`
`Bit Cell
`Area: 1.4μm × 1.8μm
`Interconnect Area
`Interconnects + Routing
`43%
`
`Logic Area
`Logic
`Mem
`14%
`8%
`
`Memory
`35%
`
`Logic
`
`Mem
`
`Mem
`
`3–4× reduction in
`interconnect area
`
`Clock
`28%
`
`Leak.
`26%
`
`Active
`46%
`
`Clock
`18%
`
`Leak.
`44%
`
`Active
`38%
`
`LUT
`CLB
`To 4 LUTs
`a)
`b)
`Figure 2: a) Hierarchical routing of 8 LUTs (4 shown) using SMs, b) alternated x-y routing.
`Figure 1: 2D-mesh interconnect.
`To next
`BSE
`BL
`NP:Q: P stages of full SM +
`Control
`BIN
`CLB
`Q stages of half SM
`BC
`WL Control
`Outputs
`WIN WEV
`WSE
`FF
`FF
`
`FF
`
`FF
`
`…
`
`FF
`
`BC
`BC
`…
`BC BC …
`
`BC
`BC
`
`LUT
`
`LUT
`
`LUT
`
`N6:1
`
`N6:2
`
`N8:2
`
`LUT
`
`N6:1
`8
`7
`(SM Stage)
`
`6
`
`1
`
`9
`
`10
`
`a)
`
`: Full SM
`: Half SM
`
`N6:1
`
`N8:2
`
`N6:2
`
`N6:2
`
`BRAM
`
`DSP
`Logic
`
`DSP
`
`BRAM
`
`Logic
`b)
`c)
`N6:1
`DSP
`Figure 3: a) Interconnect architecture of 2048 LUTs,
`floorplan of b) SM network, c) CLB placement.
`(to Hier. Network)
`(from Hier. Network)
`Cout
`
`inD
`inC
`inB
`inA
`
`4
`
`Output
`Stage
`
`OutD
`OutC
`OutB
`OutA
`
`3‐/4‐
`input
`LUT
`
`Cin
`
`2
`
`a)
`
`5/6 input LUT Combiner
`(DSP CLB only)
`
`Configurable Adder / Mult
`CO2
`CO0
`
`4
`
`4
`
`4
`
`4
`
`inD
`
`inC
`
`inB
`
`inA
`
`CO1
`
`P32
`
`P33
`
`LUT
`
`LUT
`
`LUT
`
`LUT
`
`A[3]
`P31
`
`C[3]
`P22
`
`A[2]
`P12
`
`C[2]
`P21
`
`A[1]
`P20
`
`C[1]
`P11
`
`A[0]
`P10
`0/1
`CI1
`C[0]
`
`P13
`
`P03
`
`P02
`
`B[3]
`P23
`D[3]
`
`B[2]
`
`D[2]
`
`B[1]
`
`D[1]
`
`B[0]
`
`D[0]
`
`P30
`
`C3
`CI2
`0/1
`
`To
`Output
`Stage
`
`M7
`C8
`M6
`S8
`C3
`C7
`S3
`LUT4_D
`S7
`LUT3_D
`M5
`
`LUT5_CD
`C2
`C6
`S2
`LUT4_C
`S6
`LUT3_C
`M4
`
`LUT6
`C1
`C5
`S1
`LUT4_B
`S5
`LUT3_B
`M3
`LUT4_A
`LUT3_A
`LUT5_AB
`C0
`C4
`S0M1
`M0
`S4M2
`
`Critical
`Path:
`
`LUT
`
`+
`
`×7
`
`+
`
`×8
`
`b)
`Figure 4: a) CLB block diagram and b) DSP CLB schematic.
`
`2D‐
`Mesh
`Intercon. +
`This
`Work
`Routing
`b)
`36%
`15%
`13%
`35%
`Figure 5: a) Bit cell (BC) configuration circuitry, b) area comparisons
`of 2D-mesh vs. this chip for a fixed logic area.
`2.8
`370MHz
`2.6
`1.0V
`2.4
`2.2
`2
`1.8
`1.6
`1.4
`1.2
`1
`0.8
`Delay (ns)
`0
`5
`10
`15
`Figure 6: Energy-delay curve of the mapped 175 16b accumulator
`with power breakdown at Fmax and Emin (insets).
`TABLE I: MEASUREMENT RESULTS.
`Resource Utilization
`Performance
`Freq.
`VDD
`Logic
`DSP
`BRAM
`(V)
`(256)
`(224)
`(16)
`(MHz)
`1.0
`370
`256
`224
`0
`0.50
`55
`1.0
`400
`0.51
`60
`1.0
`274
`0.56
`50
`1.0
`83
`0.78
`40
`
`Technology 65nm 1P9M CMOS
`0.34 to 1.0V
`Core VDD
`Frequency
`40 to 400MHz
`I/Os
`75 bidirectional
`Core Size
`2.52mm × 1.56mm
`Gate Count
`2.73M
`CLB Count
`256 Logic, 224 DSP
`Block RAMs
`16 128×8b
`Config. Bits
`297,472b
`Figure 7: Die micrograph and chip summary.
`
`2011 Symposium on VLSI Circuits Digest of Technical Papers
`
`137
`
`300MHz
`0.88V
`200MHz
`0.74V
`
`Result
`
`4
`
`132
`
`196
`
`Power Ratio
`@ Fmax
`
`100MHz
`0.59V
`
`Power Ratio
`@ Emin
`55MHz
`0.5V
`
`Power
`(mW)
`179
`8.6
`123
`6.2
`120
`10.2
`82.7
`26.5
`
`224
`
`209
`
`93
`
`0
`
`0
`
`10
`
`GOPS
`/mW
`0.36
`1.13
`0.57
`1.08
`0.21
`0.45
`0.05
`0.07
`
`Design
`175 Logic+DSP
`16b Accum.
`112 DSP
`16b Accum.
`32‐tap 16b
`FIR Filter
`2×2 MIMO
`64‐point FFT
`3.06mm
`
`Page 2 of 2 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2002
`
`

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases