`Technical and Management Proposal
`
`Title: Energy-Efficient Butterfly FPGA Hardware and Programming Tools
`
`A proposal submitted to
`Dr. William Harrod, DARPA/TCTO
`in response to
`
`DARPA-BAA 10-78: Omnipresent High Performance Computing (OHPC)
`
`Technical Area:
`
`Energy Efficient Computing
`
`Lead Organization: University of California, Los Angeles (UCLA)
` Department of Electrical Engineering
`
`Los Angeles, CA 90095-1594
`
`Type of Business: Other Educational
`
`Team Members: Dejan Markovic (PI)
` Venkat Konda (Consultant)
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Technical Point of Contact:
`
`Dr. Dejan Markovic, PI
`UCLA Associate Professor
`Electrical Engineering Department
`56-147D Engineering IV Building
`420 Westwood Plaza
`Los Angeles, CA 90095-1594
`
`Tel: (310) 825-8656
`Fax: (310) 206-8495
`Email: dejan@ee.ucla.edu
`
`Administrative Point of Contact:
`
`Ms. Julia Zhu
`UCLA Senior Grant Analyst
`Office of Contract and Grant Administration
`11000 Kinross Ave, Suite 102
`Los Angeles, CA 90095-1406
`
`
`Tel: (310) 794-0155
`Fax: (310) 943-1658
`Email: ocga5@research.ucla.edu
`
`Total funds requested:
`Year 1:
`Year 2:
`Year 3:
`
`
`
`
`
`
`
`Date of proposal: August 4, 2010
`
`$2,374,111
`$789,927
`$792,100
`$792,086
`
`
`
`
`
`1
`
`Page 1 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`Page 2 of 43
`
`IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`UNIVERSITY OF CALIFORNIA, LOS ANGELES
`
`BERKELEY ' DAVIS - IRVINE - LOS ANGELES 'MERCED- RIVERSIDE - SAN DIEGO - SAN FRANCISCO
`
`
`,
`
`UCLA
`
`SANTA BARBARA - SANTA CRUZ
`
`
`
`OFFICE OF CONTRACT AND GRANT ADMINISTRATION
`BOX 951406
`11000 KINROSS‘ SUITE 102
`LOS ANGELES, CALIFORNIA 90095—1406
`
`PHONE: (310) 794-0102
`FAX: (310) 7940631
`www,vcsoarch.ucla edu/ocga
`
`August 5, 2010
`
`DARPA/TCTO
`ATTN: DARPA—BAA—lO—78
`3701 N. Fairfax Drive
`
`Arlington, VA 22203—1714
`
`The Regents ofthe University of California, Los Angeles, is pleased to submit the following proposal in
`response to solicitation DARPA-BAA—10~78.
`
`Title:
`
`“Energy-Efficient Butterfly FPGA Hardware and Programming Tools.”
`
`Requested Period of Performance: September 15, 2010 — September 14, 2013
`
`Amount Requested:
`
`$2,374,111
`
`Principal Investigator:
`
`Dr. Dej an Markovic
`Department of Electrical Engineering
`dejan@ee.ucla.edu
`310-825-8656
`
`This application is being submitted in contemplation of an agreement containing mutually agreeable terms and
`conditions applicable to educational institutions conducting unclassified fundamental research.
`
`Since UCLA is a public/State institution, open dissemination of research results and information, commitment
`to students, accessibility for research purposes, and legal integrity and consistency are part ofthe University’s
`Principles/Policy. The University does not discriminate and impose restrictions on any individual as a result of
`their nationalities.
`
`if an award is made, please be advised that ifit is funded by budget category 6.3(Advanced Research) and is
`considered Non—fundamental research, we will not be able to accept the award due to publication restrictions.
`
`Your favorable consideration ofthis proposal would be appreciated. Technical questions should be directed to
`Dr. Markovic. Administrative and contractual questions, should be directed to me at (310) 794—0155 or via
`email atizlllfcézsseamhatmac‘du.
`
`Sincerely,
`
`Mat/w
`
`Julia Zhu
`
`Senior Grant Analyst
`
`Page 2 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`Table of Contents
`
`Executive Summary
`
`Section II – Technical Details
`
`2.1. PowerPoint Summary Chart
`
`2.2. Innovative Claims for the Proposed Research
`
`Problem Description
`
`Research Goals
`
`Expected Impact
`
`2.3. Proposal Roadmap
`
`2.4. Technical Approach
`
`2.4.1. Network Architecture and Routing Tools
`
`2.4.2. Hardware Design
`
`2.4.3. Hardware Mapping
`
`Demonstrations and Technology Transition
`
`2.5. Statement of Work
`
`2.6. Intellectual Property
`
`2.7. Management Plan
`
`2.8. Schedule and Milestones
`
`2.8.1. Schedule Graphic
`
`2.8.2. Detailed Task Description
`
`2.8.3. Project Management and Interaction Plan
`
`2.9. Personnel, Qualifications, and Commitments
`
`2.10. Organizational Conflict of Interest Affirmations and Disclosure
`
`2.11. Human Use
`
`2.12. Animal Use
`
`2.13. Statement of Unique Capability Provided by Government or
` Government-Funded Team Member
`
`2.14. Government or Government-funded Team Member Eligibility
`
`2.15. Facilities
`
`References
`
`BEEcube Support Letter
`
`
`
`
`
`
`
`3
`
`5
`
`5
`
`6
`
`6
`
`6
`
`7
`
`8
`
`10
`
`14
`
`15
`
`19
`
`22
`
`24
`
`26
`
`28
`
`30
`
`30
`
`31
`
`33
`
`34
`
`36
`
`39
`
`38
`
`39
`
`40
`
`41
`
`42
`
`43
`
`3
`
`Page 3 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`Executive Summary
`
`UCLA offers to perform research on a revolutionary new FPGA technology consisting of FPGA
`hardware and supporting mapping tools. We will design, fabricate, and test hierarchical FPGA
`interconnect network to demonstrate FPGA technology that is 15x more energy-efficient than
`existing FPGAs. The new interconnect architecture allows for significant reduction in the
`number of switch points, buffers, and wire length in comparison to standard 2D-mesh
`architecture used by existing FPGAs. The proposed technology is a radical departure from 2D-
`mesh design, which for N logic blocks has complexity O(N2), incomplete and heuristic routing.
`The proposed technology has only O(N·log2N) complexity, complete and fully deterministic
`routing. The proposed technology has significant benefits: 15x lower power, 3x lower area, 2x
`higher performance compared to existing FPGA technology. The new FPGA technology will be
`used to demonstrate HPC benchmarks with a 15x higher power efficiency for DOD and
`commercial users. The PI has established interactions with industrial partners that will lead to the
`transition of ideas into the commercial space.
`
`
`
`
`
`
`
`
`4
`
`Page 4 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`Section II - Technical Details
`
`2.1. PowerPoint Summary Chart
`
`
`
`
`
`
`
`
`
`
`5
`
`Energy-Efficient Butterfly FPGA Hardware and Programming Tools
`
`Technical Challenge and Objective
`
`Key Innovations
`
`• Problem: Presently, FPGA chips use 2D-mesh architecture,
`
`which is very complex (over 75% of chip area is interconnect).
`Interconnect results in energy-inefficient computations!
`
`• Our hierarchical butterfly interconnect scheme significantly
`
`reduces interconnect complexity.
`
`• Objective: significantly improve energy efficiency of FPGAs.
`
`Expected Impact
`• New FPGA hardware and mapping tools.
`
`• With significant improvements:
`
`• Power (15x)
`• Area (3x)
`• Performance (2x)
`
`• To demonstrate HPC benchmarks
`with 15x higher power efficiency.
`
`• For DOD and commercial apps.
`
`2D-Mesh Network
`
`Even with just
`4 processing
`
`units, there is a
`3x reduction in
`the number of
`connections
`(24) (8)
`
`Butterfly Network
`
`Number of connections in 2D-Mesh and Hierarchical networks
`
`Number of LUTs
`
`2D-Mesh
`
`Hierarchical
`
`Savings factor
`
`N = 1 k
`
`N = 100 k
`
`1 M
`
`10 B
`
`9,97 k
`
`1.66 M
`
`100x
`
`6,200x
`
`• Ideas verified on chip (3x reduction in interconnect area).
`
`2D-Mesh FPGA
`
`Our FPGA
`
`>2x overall chip area reduction
`>2x performance improvement
`>10x lower power
`
`• Key proposed innovations:
`
`• Interconnect architecture optimization.
`• Hardware demonstrations of area, power, and performance.
`• Mapping tools for the new FPGA architecture.
`• Demonstrations of HPC benchmarks.
`
`PI: Dejan Markovic (UCLA)
`
`Connection
`
`: IN
`: OUT
`
`Slice
`
`CLB
`
`CLB
`
`Slice
`
`CLB
`
`CLB
`
`Konda Network
`
`Connection
`
`Switch
`
`CLB
`
`
`
`CLB
`
`CLB
`
`CLB
`
`2-D Mesh Network
`
`1000
`
`Energy Efficiency
`(MOPS/mW)
`
`100
`
`10
`
`1
`
`0.1
`
`Microprocessors
`
`30-50x
`
`General
`General
`
`Purpose DSPs
`Purpose DSPs
`
`and FPGAs
`
`Dedicated
`
`~3 orders of
`magnitude!
`
`0.01
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`
`Chip Number
`
`U
`U
`U
`U
`
`C
`C
`C
`C
`
`Our FPGA chips
`
`L
`
`L
`
`L
`
`L
`
`A
`A
`A
`1
`A
`2
`
`3
`
`4
`
`FMC interface (160 pins/chip)
`
`4 Virtex-6
`
`4 Virtex-6
`FPGAs inside
`
`FPGAs inside
`
`4 Virtex-6
`FPGAs inside
`
`4 Virtex-6
`FPGAs inside
`
`www.beecube.com
`
`www.beecube.com
`
`www.beecube.com
`
`www.beecube.com
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`Page 5 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`2.2. Innovative Claims for the Proposed Research
`
`Problem Description
`
`Today’s programmable FPGA devices are expensive in size, power, performance, scalability and
`flexibility. All of this is due to a fundamental problem in 2D-mesh interconnect architecture: it
`is large in size, has long latency, consumes lots of power, and is not scalable. Interconnect takes
`more than 75% of the FPGA chip area. Large number of inactive transistors also results in
`significant leakage power (about 50% of the total FPGA power). Due to inefficient interconnect
`architecture, there is a 30-50x energy-efficiency gap between FPGA and dedicated chips (Fig. 1).
`
`
`
`
`Figure 1: Energy efficiency for various computing architectures: microprocessors, general purpose
`DSPs, FPGAs, and dedicated chips. The study is based on chips from the ISSCC conference (normalized
`to the same technology). FPGAs with DSP cores are 30-50x less energy efficient than dedicated chips.
`
`
`Research Goals
`
`We will integrate hierarchical interconnect network to demonstrate significant improvements in
`speed, power, and area as compared to existing FGPAs technology. The hierarchical interconnect
`architecture requires at least 3x smaller number of active network elements, switch points and
`drivers. This is illustrated in Fig. 2 for a very simple 2x2 example.
`
`
`Figure 2: 2D-Mesh and Konda networks for a design consisting of 4 CLB blocks.
`
`
`
`
`
`
`6
`
`Dedicated
`
`~3 orders of
`magnitude!
`
`
`General General
`
`Purpose DSPsPurpose DSPs
`and FPGAs
`
`30-50x
`
`Microprocessors
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`Chip Number
`
`1000
`
`100
`
`10
`
`1
`
`0.1
`
`(MOPS/mW)
`
`Energy Efficiency
`
`0.01
`
`Even with just 4
`processing units, there
`is a 3x reduction in the
`number of connections
`(24) (8)
`
`2D-Mesh Network
`
`Butterfly Network
`
`Connection
`
`: IN
`: OUT
`
`Slice
`
`CLB
`
`CLB
`
`Slice
`
`CLB
`
`CLB
`
`Konda Network
`
`Connection
`
`Switch
`
`CLB
`
`
`
`CLB
`
`CLB
`
`CLB
`
`2-D Mesh Network
`
`Page 6 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`For larger number N of configurable logic elements, the benefits of hierarchical network will be
`even more pronounced (Table 1). Such large cost of the 2D-mesh architecture forces designers
`to employ heuristics to reduce the number of switch points, which results in insufficient
`connectivity. The hierarchical network provides complete and deterministic routing.
`
`
`Table 1: Number of connections in 2D-Mesh and Konda networks.
`
`Number of LUTs
`1 k
`100 k
`
`2D-Mesh
`1 M
`10 B
`
`Konda butterfly
`9.97 k
`1.66 M
`
`Savings factor
`100x
`6,200x
`
`
`Expected Impact
`
`The new FPGA platform will provide significant savings in power compared to today’s FPGAs
`as shown in Fig. 3. Our FPGA technology, which includes hardware and supporting mapping
`tools, will provide an estimated 15x power reduction as compared to conventional FPGAs.
`
`
`
`
`Figure 3: Power consumption for a range of applications. New FPGA will provide significant power
`reduction compared to typical Virtex-5 FPGA (normalized to the same technology).
`
`
`We will provide new FPGA technology consisting of hardware and mapping tools. The hardware
`and mapping tools will provide significant impacts: 15x lower power, 3x lower area, 2x higher
`performance compared to existing FPGA technology. The new FPGA technology will be used to
`demonstrate HPC benchmarks with a 15x higher power efficiency for DOD and commercial
`users. Equivalently, our FPGA technology can provide >10x higher throughput for the same
`amount of power (as shown in Fig. 3). This technology will be of use for HPC applications and
`many other DOD applications which use FPGA technology.
`
`
`
`
`
`
`
`7
`
`Typical FPGA (Virtex-5)
`
`LPE Goal
`
`Outcomes (arrows)
` New applications
`of FPGA (↓)
` More capability
`for existing apps
`(→)
`
`Radio DSP
`
`FFT module
`
`Mirco & small UAVs
`
`Satellite DSP
`
`MIMO DSP
`
`10
`100
`1000
`Performance (GOPS)
`
`100W
`
`10W
`
`1W
`
`Power
`
`100mW
`
`Page 7 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`2.3. Proposal Roadmap
`
`
`Main goals of the proposed research: The main goal of the program is to develop energy-
`efficient programmable hardware and supporting software mapping tools. The hardware is based
`on hierarchical interconnect architecture that provides significant reduction in interconnect
`complexity as compared to today’s FPGA hardware. With a combination of new interconnect
`architecture and supporting toolflow, we project over a 15x improvement in energy efficiency
`while also considerably reducing chip area and improving performance. The proposed work
`builds on patent-protected network architecture and successful chip demonstrations. The work
`proposed here focuses on the investigation of needed level of connectivity for large-sale designs,
`and supporting mapping tools to make the technology accessible to end users.
`
`Tangible benefits to end users: Over a 15x improvement in energy-efficiency, considerable
`reduction in chip area (3-4x), and considerable improvement in performance (> 2x) compared to
`today’s FPGA chips. Mapping tools will be developed to automatically map algorithms into
`hardware and abstract away hardware-specific details from end users.
`
`Critical technical barriers: Hierarchical interconnect networks have been known to the
`academic and industrial community for a long time, but physical realization of these networks
`precluded their successful deployment. The critical difficulty associated with the hierarchical
`networks is routing congestion during chip synthesis. Leopard Logic, Inc, is one example of a
`company that failed to deploy hierarchical interconnect architecture. FPGA startups today, most
`notably Abound Logic, Tier Logic, Blue Chip Designs, and Achronix, provide customized
`solutions for increased logic density or speed, but they still don’t solve the problem of power
`inefficiency associated with FPGA chip interconnects.
`
`Main elements of the proposed technical approach: Our approach is based on alternating
`vertical and horizontal routing. LUTs (or any other processing elements) are partitioned in a 2-D
`floorplan with switch-boxes placed to allow full routability. An N-LUT design requires log2(N)
`levels of switch-boxes. Simple example of N = 4 is shown below to illustrate the concept.
`
`
`Figure 4: Hierarchical Konda interconnect architecture. O(N∙log2N) interconnect switches are required
`for full connectivity. Routing is fully deterministic.
`
`
`
`
`
`
`8
`
`Alternating Vert./Horiz. routing: N = 4 LUTs example (2 levels)
`
`LUT0
`
`S(0,0)
`
`S(1,0)
`
`LUT2
`
`S(0,2)
`
`S(1,2)
`
`LUT1
`
`S(0,1)
`
`S(1,1)
`
`LUT3
`
`S(0,3)
`
`S(1,3)
`
`1st routing level
`(vertical)
`
`2nd routing level
`(horizontal)
`
`Page 8 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`In the case shown in Fig. 4, 2 levels of switch-boxes are required for N = 4 LUTs. LUTs with
`indices from 0 N/2 – 1 are placed on the left, the remaining LUTs are placed on the right.
`Switch-boxes are placed next to the LUT columns. Routing between elements with adjacent
`index is provided as a vertical connection (1st level routing); routing between elements with 2
`indices apart is provided with a horizontal connection (2nd level routing). The routing continues
`in vertical/horizontal fashion for larger N.
`
`Basis of confidence: Konda network architecture is a patent-protected technology that is
`recognized by many semiconductor companies including Cisco, Xilinx, Altera, and LSI Logic.
`To demonstrate the network in hardware, UCLA team has taped out 3 chips and successfully
`implemented variants of Konda network and also variants of processor-block features.
`
`Chip 1 (90nm, LUT-slice FPGA, concept demo): A 1024-LUT FPGA was made in 90nm 9SF
`technology (Dec 2009 run). Our synthesis estimates predict a 250 mW of power and a 600 MHz
`maximum performance. The chip occupies 2.6 x 2.5 mm2 in 90nm. Status: lab testing.
`
`Chip 2 (65nm, LUT and DSP slices, small scale): A 256-LUT 240-DSP 8-BRAM FPGA was
`made in 65nm technology (June 2010 run). The chip is aimed to show asymmetric network and
`heterogeneous computing blocks. The chip occupies 2.1 x 3.1 mm2 in 65nm. Status: taped out.
`
`Chip 3 (45nm, DSP-slice FPGA, small scale): A 512-DSP slice FPGA is made in IBM 45 nm
`SOI technology (June 2010 LEAP run). We expect power consumption below 500 mW. This
`design will be applicable to small-scale applications such as micro UAVs. Status: taped out.
`
`Nature and description of end results to be delivered to DARPA: We will provide several
`deliverables to DARPA and DOD community as listed below.
`
`
`Interconnect architecture and routing tools (software).
` Hardware library in 32nm IBM SOI process (compatible with Cadence software).
` Routing software for the new interconnect architecture and hardware library (software).
` Chip demos of varying scale to demonstrate algorithms of interest to DOD (hardware).
` Tool flow for mapping algorithms onto FPGA chips (software).
` Demonstrations of HPC benchmarks using commercial technology.
`
`
`The first three items in the list are intermediate steps towards the final hardware demonstration
`that also includes user-friendly mapping tool interface.
`
`Cost and schedule of the proposed effort: $2,374,111 over 3 years.
`
`
`
`
`
`
`
`
`9
`
`Page 9 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`2.4. Technical Approach
`
`Problem Description: FPGAs are used in many signal processing and computing applications.
`DOD mission capability or computing performance can be greatly improved with more energy
`efficient hardware. FPGA based solutions are very attractive due to their flexibility, similar to
`that of CPUs. This flexibility comes at a very high energy cost, as shown in Fig. 1.
`
`Looking at the energy efficiency (the amount of energy per unit operation) for a variety of chips
`from different categories, we observe a 1,000x gap in energy efficiency between microprocessors
`and dedicated designs. The root cause of this is architectural. Processors have general ALU-type
`processing unit(s) and large amounts of memory to support time-muliplexing of instructions and
`data into and out of the ALU(s). Dedicated chips have a variety of processing units, but are very
`expensive in low-volume and can’t be programmed, so they can’t be used for HPC applications.
`General-purpose DSPs are a viable compromise between microprocessors and dedicated designs.
`Recently, however, FPGA chips have started to gain attention with their increased computing
`capabilities. Look-up-table (LUT) based chips have energy efficiency similar to that of CPUs
`and are not very attractive alternatives to CPUs (CPUs are easier to program). Many today’s
`FPGAs have dedicated kernels such as DSP slices, ARM cores, etc. These FPGAs have energy
`efficiency similar to DSP chips, but they are still 30-50x worse than dedicated chips. The root
`cause of energy inefficiency in these FPGAs is their interconnect architecture.
`
`Today’s FPGAs use 2D-mesh interconnect architecture shown in Fig. 5. Interconnect consists of
`switch boxes (shown as cross-points), connection points for the buses, and bus drivers (buffers).
`This architecture is not very scalable: it requires O(N 2) interconnect switches for N LUTs. For 1k
`processing units, this means 1M switches! To overcome this complexity issue, designers employ
`heuristics to reduce the number of switches. One of the ideas is to reduce connectivity around the
`edges, as shown in Fig. 5. Another idea is to reduce top-level connectivity in large designs and
`utilize local connections. These approaches are heuristic and lead to inefficient utilization of
`hardware resources. Readers may be have experienced that utilizing more than 80% of FPGA
`resources without sacrificing performance is a big challenge in commercial FPGA systems.
`
`
`
`Figure 5: 2D-mesh interconnect architecture. O(N 2) interconnect switches are required for full
`connectivity. Heuristics are used to reduce the network complexity. These heuristics result in non-
`deterministic routing.
`
`
`
`
`10
`
`Programmable
`Switch Box
`
`LUT
`
`LUT
`
`LUT
`
`Output
`Connection
`Box
`
`LUT
`
`LUT
`
`LUT
`
`Input
`Connection
`Box
`
`LUT
`
`LUT
`
`LUT
`
`Requires careful
`heuristics to
`reduce
`crosspoints
`without
`significant loss of
`connectivity
`
`Page 10 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`
`
`Figure 6: Power breakdown
`in a Virtex-5 FPGA.
`
`Even after reductions in network complexity, interconnect still
`occupies over 75% of area in today’s FPGAs. For example, Xilinx
`Virtex-5 chip has 1.1B transistors; 275M are used for logic, 875M
`(80%) are used for interconnect. Most of FPGA power is dissipated
`by the interconnect, as shown in Fig. 6. Further simplifying
`interconnect (without sacrificing connectivity) would have multiple
`benefits. First, the interconnect power will decrease. Second, due to
`reduced interconnect area, overall chip area will also reduce. Third,
`since the chip area is reduced, the size of wires (and wire
`capacitance) also reduces. The reduction in wire length and
`complexity implies further reduction in power. It also implies
`improvements in performance. This excess performance can be traded for increased energy
`efficiency, or simply used to improve computational efficiency. Finally, we benefit from reduced
`clock power since the clock is now distributed over a smaller area. Therefore, reduction in
`interconnect complexity is crucially important for improved computing power and performance.
`
`Proposed Network Architecture: In response to the interconnect challenge, we propose to use a
`proprietary Konda hierarchical interconnect architecture. This interconnect architecture has
`greatly reduced complexity, O(N∙log2N), and it is based on fully deterministic routing. The
`concept of Konda network is to use simple unidirectional switches and 2x1 multiplexers to
`hierarchically connect the computing resources (LUTs, DSP slices, ARM IP, etc.).
`
`Eliminating routing congestions and making the 2D circuit layout possible are the key enabling
`features of the Konda network. An example of N = 8 LUT design with Konda network is shown
`in Fig. 7. For complete routing log28 = 3 levels of switch matrices are needed. First, vertical
`tracks connect nearest LUTs, then horizontal tracks are used to connect LUTs at the next level,
`and finally vertical tracks are used to connect the last level of switches. This structure has full
`connectivity and completely deterministic 2D routing.
`
`
`Figure 7: Konda interconnect network architecture and routing tracks for N = 8 LUTs.
`
`
`
`
`
`
`11
`
`LUT0
`
`S(0,0)
`
`S(1,0)
`
`S(2,0)
`
`LUT2
`
`S(0,2)
`
`S(1,2)
`
`S(2,2)
`
`LUT1
`
`S(0,1)
`
`S(1,1)
`
`S(2,1)
`
`LUT3
`
`S(0,3)
`
`S(1,3)
`
`S(2,3)
`
`LUT4
`
`S(0,4)
`
`S(1,4)
`
`S(2,4)
`
`LUT6
`
`S(0,6)
`
`S(1,6)
`
`S(2,6)
`
`LUT5
`
`S(0,5)
`
`S(1,5)
`
`S(2,5)
`
`LUT7
`
`S(0,7)
`
`S(1,7)
`
`S(2,7)
`
`22% 19%
`
`58%
`
`Interconnect
`
`Page 11 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`The benefits of this network architecture were evaluated using Toronto20 benchmarks. Toronto
`20 benchmark suite originated from an FPGA place-and-route challenge that was set up by
`University of Toronto Researchers [1] to encourage FPGA researchers to benchmark their
`software design tool chains on large circuits These 20 benchmarks are from real designs and the
`placed netlists are provided - for a given FPGA logic block consists of a 4-input look-up table
`(LUT) and a flip flop - to experiment with different routing architectures and routing algorithms.
`The existing results are experimented with 2D-Mesh network based routing network by
`providing partial bandwidth i.e., with different switch-box flexibility, connection-box flexibility
`and a certain number of channels. Konda hierarchical network is also experimented with partial
`bandwidth provisioning and the results are compared on various dimensions such as 1) number
`of cross points, 2) route length (delay) 3) performance 4) speed of routing and 5) routability.
`Konda hierarchical network performed better in several easily-measureable ways and the results
`are presented in Tables 2 and 3.
`
`
`Table 2: Comparison of 2D-Mesh and Konda interconnect networks using Toronto20 behcmarks.
`
`Toronto20 Benchmark Information
`
`Name
`
`Size
`
`LUTs
`
`Number of
`connections
`
`alu4
`apex2
`apex4
`bigkey
`clma
`des
`diffeq
`dsip
`elliptic
`ex5p
`ex1010
`frisk
`misex3
`pdc
`s298
`s38417
`s38584.1
`seq
`spla
`tseng
`
`40
`44
`36
`54
`92
`63
`39
`54
`61
`33
`68
`60
`38
`68
`44
`81
`81
`42
`61
`33
`
`1600
`1936
`1296
`2916
`8464
`3969
`1521
`2916
`3721
`1089
`4624
`3600
`1444
`4624
`1936
`6561
`6561
`1764
`3721
`1089
`
`1514
`1875
`1243
`1694
`8302
`1347
`1497
`1309
`3604
`1019
`4588
`3556
`1383
`4535
`1929
`6349
`6291
`1717
`3644
`975
`
`Savings
`factor
`
`Konda Hierarchical Network
`2D-Mesh Network
`Simulation
`Simulation
`(Unidirectional wires)
`(Bidirectional wires)
`Total
`Cross-
`Max
`Total
`cross-
`points
`channel
`cross-
`points
`saved
`width
`points
`58,737
`118,437
`9
`177,174
`83,180
`154,480
`10
`237,660
`52,482
`123,408
`11
`175,890
`54,643
`159,233
`6
`213,876
`10
`1,026,780 359,846
`666,934
`7
`338,730
`57,044
`281,686
`7
`131,082
`49,275
`81,807
`5
`178,230
`40,972
`137,258
`9
`408,510
`129,507
`279,003
`11
`148,170
`44,609
`103,561
`9
`506,790
`192,391
`314,399
`11
`483,186
`134,686
`348,500
`10
`177,900
`58,866
`119,034
`15
`844,650
`239,484
`605,166
`6
`142,596
`63,956
`78,640
`6
`478,260
`207,457
`270,802
`7
`557,970
`184,030
`373,940
`10
`216,780
`73,880
`142,900
`12
`544,680
`171,676
`373,004
`6
`80,820
`31,599
`49,221
`
`3.02
`2.86
`3.35
`3.91
`2.85
`5.94
`2.66
`4,35
`3.15
`3.32
`2.63
`3.59
`3.02
`3.52
`2.23
`2.30
`3.03
`2.93
`3.17
`2.56
`
`
`The benefits of Konda hierarchical network over 2D-Mesh network using Toronto20
`Benchmarks are summarized in Table 4. Various configurations of Konda hierarchical network
`were tested for each benchmark and the results are verified as follows:
`• All 20 benchmarks were routed by our algorithms in our network,
`
`
`
`12
`
`Page 12 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`• Switches required to route was reduced significantly,
`• Fundamental routing algorithms are proven,
`• Speed of routing is proven,
`• Benchmarks were profiled for Bandwidth requirements.
`
`
`
`Table 3: Comparison of 2D-Mesh and Konda interconnect networks using Toronto20 behcmarks. In
`addition to considerable savings in the number of cross-points, Konda network uses has far better
`percentage utilization (fewer % is better) than the 2D-Mesh network.
`
`Toronto20
`Benchmark
`Information
`
`2D-Mesh Network
`Simulation
`(Bidirectional wires)
`
`Konda Hierarchical Network
`Simulation
`(Unidirectional wires)
`
`Other Key Results of
`the Simulation
`
`Name
`
`Size
`
`Max Ch
`Width
`
`Total
`Cross-pts
`
`Total
`Cross-pts
`
`Savings
`factor
`
`Cross-pts
`saved
`
`alu4
`apex2
`apex4
`bigkey
`clma
`des
`diffeq
`dsip
`elliptic
`ex5p
`ex1010
`frisc
`misex3
`pdc
`s298
`s38417
`s38584.1
`seq
`spla
`tseng
`
`40
`44
`36
`54
`92
`63
`39
`54
`61
`33
`68
`60
`38
`68
`44
`81
`81
`42
`61
`33
`
`9
`10
`11
`6
`10
`7
`7
`5
`9
`11
`9
`11
`10
`15
`6
`6
`7
`10
`12
`6
`
`177,174
`237,660
`175,890
`213,876
`1,026,780
`338,730
`131,082
`178,230
`408,510
`148,170
`506,790
`483,186
`177,900
`844,650
`142,596
`478,260
`557,970
`216,780
`544,680
`80,820
`
`58,737
`83,180
`52,482
`54,643
`359,846
`57,044
`49,275
`40,972
`129,507
`44,609
`192,391
`134,686
`58,866
`239,484
`63,956
`207,457
`184,030
`73,880
`171,676
`31,599
`
`3.02
`2.86
`3.35
`3.91
`2.85
`5.94
`2.66
`4,35
`3.15
`3.32
`2.63
`3.59
`3.02
`3.52
`2.23
`2.30
`3.03
`2.93
`3.17
`2.56
`
`118,437
`154,480
`123,408
`159,233
`666,934
`281,686
`81,807
`137,258
`279,003
`103,561
`314,399
`348,500
`119,034
`605,166
`78,640
`270,802
`373,940
`142,900
`373,004
`49,221
`
`% Cross-
`pts used
`Konda
`7.9
`9.3
`9.4
`4.1
`8.1
`2.9
`6.9
`2.8
`7.0
`9.5
`8.4
`7.5
`8.7
`10.5
`7.1
`6.0
`5.3
`9.0
`9.3
`6.7
`
`% Cross-
`pts used
`2D-Mesh
`66
`70
`60
`51
`71
`33
`75
`46
`63
`60
`75
`56
`66
`56
`89
`86
`66
`68
`63
`78
`
`
`
`
`
`
`
`Table 4: Summary of the benefits of Konda hierarchical network. Analytical and empirical results are
`shown, the numbers are relative to 2D-Mesh network.
`
`Criteria
`Interconnect area
`Connectivity
`Interconnect Power
`Interconnect Latency
`Speed of compilation
`Scalability across process
`generations
`
`Analytical
`At most 1/3
`2-3x
`1/5 to 1/10
`1/5 to 1/10
`Significantly faster
`
`Empirical
`At most 1/3
`2-3x
`1/5 to 1/10
`1/5 to 1/10
`Significantly faster
`
`Close to linear
`
`Close to linear
`
`13
`
`Page 13 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`The conclusions of simulation of Toronto20 benchmarks using Konda hierarchical network
`matched the benefits derived in empirical analysis. The generic routing tool created for Konda
`hierarchical network delivers consistent and predictable results. Based on the Toronto20
`benchmark results it can be projected that the gap between ASIC’s and FPGA’s can be closed as
`shown in Fig. 8, which would significantly improve performance and energy efficiency of HPC
`hardware. In the proposed work, we will explore further technology improvements.
`
`
`
`Figure 8: Konda interconnect network architecture has substantial benefits over today’s FPGAs. It is
`projected to have ASIC-like energy efficiency, power, and performance. Such energy-efficiency levels
`are more than 100x better than general purpose processors.
`
`2.4.1. Network Architecture and Routing Tools
`We will next work on homogeneous and heterogeneous networks featuring arbitrary level of
`connectivity. The decision about the connectivity level will be aided with feedback from the
`mapping tools (Task 6) in order to minimize hardware utilization.
`
`Task 1) Routing Architectures for Homogeneous Blocks: Routing tool will be developed for
`the FPGA with homogeneous blocks. Routing algorithms need to be developed for uni-terminal
`nets and multi-terminal nets. The hierarchical routing network may be a symmetric network
`where the number of inputs and the number of outputs are the same. The routing network may
`also be asymmetric network where the number of inputs and the number of outputs are not the
`same. Rearrangeably nonblocking and strictly nonblocking multi-terminal net algorithms will be
`implemented to demonstrate the routability and the speed of routing. Routing algorithms need to
`be implemented for configurations of Konda hierarchical network where some of the stages in
`the network may be partially connected and the other stages are fully connected. The LUT size
`of the network may be a perfect power of two or non-perfect power of two.
`
`Task 2) Routing Architectures for Heterogeneous Blocks: We will also explore interconnect
`architectures suitable for heterogeneous blocks. The key architectural challenge is to adapt the
`Konda hierarchical network for FPGA architecture. A fully connected hierarchical network is an
`over-kill for FPGA applications. Our goal is to converge on the appropriate design of the routing
`network in three phases and also adopt it to many different applications end-user applications.
`Also we need to experiment with many varieties of hierarchical network designs such as Benes
`
`
`
`14
`
`ASIC
`
`Konda
`FPGA
`
`We keep the benefits of ASIC w/o giving up the benefits of FPGA
`
`Prevailing
`FPGA
`
`2X
`
`2X
`
`2X
`
`1X
`
`2X
`
`AREA
`4X
`
`5-10X
`POWER
`10-20X
`
`5-10X
`PERFORMANCE
`10-20X
`
`0.3-0.6X
`ROUTABILITY
`0.3-0.6X
`
`Page 14 of 43 IPR2020-00261
`
`VENKAT KONDA EXHIBIT 2003
`
`
`
`network, butterfly fat-tree network and other optimizations related to properties of FPGA
`designs. One aspect is to analyze the locality typical in FPGA designs and optimizing or
`adopting Konda hierarchical network with optimum bandwidth for local connectivity and global
`connectivity. The typical LUT size that is known to be optimal in a 2D-Mesh routing network
`may not be optimal for Konda hierarchical network. This is because Konda hierarchical network
`provides richer connectivity with smaller switch and less number of t