throbber
Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 1 of 44
`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 1 of 44
`
`
`
`EXHIBIT 2
`EXHIBIT 2
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 2 of 44
`
` Volume I
`Technical and Management Proposal
`
`Title: Energy-Efficient Butterfly FPGA Hardware and Programming Tools
`
`A proposal submitted to
`Dr. William Harrod, DARPA/TCTO
`in response to
`
`DARPA-BAA 10-78: Omnipresent High Performance Computing (OHPC)
`
`Technical Area:
`
`Energy Efficient Computing
`
`Lead Organization: University of California, Los Angeles (UCLA)
` Department of Electrical Engineering
`
`Los Angeles, CA 90095-1594
`
`Type of Business: Other Educational
`
`Team Members: Dejan Markovic (PI)
` Venkat Konda (Consultant)
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Technical Point of Contact:
`
`Dr. Dejan Markovic, PI
`UCLA Associate Professor
`Electrical Engineering Department
`56-147D Engineering IV Building
`420 Westwood Plaza
`Los Angeles, CA 90095-1594
`
`Tel: (310) 825-8656
`Fax: (310) 206-8495
`Email: dejan@ee.ucla.edu
`
`Administrative Point of Contact:
`
`Ms. Julia Zhu
`UCLA Senior Grant Analyst
`Office of Contract and Grant Administration
`11000 Kinross Ave, Suite 102
`Los Angeles, CA 90095-1406
`
`
`Tel: (310) 794-0155
`Fax: (310) 943-1658
`Email: ocga5@research.ucla.edu
`
`Total funds requested:
`Year 1:
`Year 2:
`Year 3:
`
`
`
`
`
`
`
`Date of proposal: August 4, 2010
`
`$2,374,111
`$789,927
`$792,100
`$792,086
`
`
`
`
`
`1
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 3 of 44
`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 3 of 44
`
`UCLA
`
`SANTA BARBARA + SANTA CRUZ
`
`.
`
`
`
`UNIVERSITY OF CALIFORNIA, LOS ANGELES
`
`BERKELEY + DAVIS + IRVINE + LOS ANGELES * MERCED*+ RIVERSIDE » SAN DIEGO » SAN FRANCISCO
`
`
`OFFICE OF CONTRACT AND GRANT ADMINISTRATION
`BOX 951406
`11000 KINROSS, SUITE 102
`LOS ANGELES, CALIFORNIA 90095-1406
`
`PHONE:(310) 794-0102
`FAX:(310) 794-0631
`www.research.ucla edu/ocga
`
`August 5, 2010
`
`DARPA/TCTO
`ATTN: DARPA-BAA-10-78
`3701 N. Fairfax Drive
`Arlington, VA 22203-1714
`
`The Regents of the University of California, Los Angeles, is pleased to submit the following proposalin
`response to solicitation DARPA-BAA-10-78.
`
`Title:
`
`“Energy-Efficient Butterfly FPGA Hardware and Programming Tools.”
`
`Requested Period of Performance: September 15, 2010 — September 14, 2013
`
`Amount Requested:
`
`$2,374,111
`
`Principal Investigator:
`
`Dr. Dejan Markovic
`Departmentof Electrical Engineering
`dejan@ee.ucla.edu
`310-825-8656
`
`This application is being submitted in contemplation of an agreement containing mutually agreeable terms and
`conditions applicable to educational institutions conducting unclassified fundamental research.
`
`Since UCLAis a public/State institution, open dissemination of research results and information, commitment
`to students, accessibility for research purposes,and legal integrity and consistency are part of the University’s
`Principles/Policy. The University does not discriminate and impose restrictions on any individual as a result of
`their nationalities.
`
`Ifan award is made, please be advisedthat if it is funded by budget category 6.3(Advanced Research) and Is
`considered Non-fundamental research, we will not be able to accept the award due to publicationrestrictions.
`
`Your favorable consideration of this proposal would be appreciated. Technical questions should be directed to
`Dr. Markovic. Administrative and contractual questions, should be directed to me at (310) 794-0155 or via
`email at jzhu@research.ucla.edu.
`
`Sincerely,
`
`Sulio Shu
`
`Julia Zhu
`Senior Grant Analyst
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 4 of 44
`
`Table of Contents
`
`Executive Summary
`
`Section II – Technical Details
`
`2.1. PowerPoint Summary Chart
`
`2.2. Innovative Claims for the Proposed Research
`
`Problem Description
`
`Research Goals
`
`Expected Impact
`
`2.3. Proposal Roadmap
`
`2.4. Technical Approach
`
`2.4.1. Network Architecture and Routing Tools
`
`2.4.2. Hardware Design
`
`2.4.3. Hardware Mapping
`
`Demonstrations and Technology Transition
`
`2.5. Statement of Work
`
`2.6. Intellectual Property
`
`2.7. Management Plan
`
`2.8. Schedule and Milestones
`
`2.8.1. Schedule Graphic
`
`2.8.2. Detailed Task Description
`
`2.8.3. Project Management and Interaction Plan
`
`2.9. Personnel, Qualifications, and Commitments
`
`2.10. Organizational Conflict of Interest Affirmations and Disclosure
`
`2.11. Human Use
`
`2.12. Animal Use
`
`2.13. Statement of Unique Capability Provided by Government or
` Government-Funded Team Member
`
`2.14. Government or Government-funded Team Member Eligibility
`
`2.15. Facilities
`
`References
`
`BEEcube Support Letter
`
`
`
`
`
`
`
`3
`
`5
`
`5
`
`6
`
`6
`
`6
`
`7
`
`8
`
`10
`
`14
`
`15
`
`19
`
`22
`
`24
`
`26
`
`28
`
`30
`
`30
`
`31
`
`33
`
`34
`
`36
`
`39
`
`38
`
`39
`
`40
`
`41
`
`42
`
`43
`
`3
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 5 of 44
`
`Executive Summary
`
`UCLA offers to perform research on a revolutionary new FPGA technology consisting of FPGA
`hardware and supporting mapping tools. We will design, fabricate, and test hierarchical FPGA
`interconnect network to demonstrate FPGA technology that is 15x more energy-efficient than
`existing FPGAs. The new interconnect architecture allows for significant reduction in the
`number of switch points, buffers, and wire length in comparison to standard 2D-mesh
`architecture used by existing FPGAs. The proposed technology is a radical departure from 2D-
`mesh design, which for N logic blocks has complexity O(N2), incomplete and heuristic routing.
`The proposed technology has only O(N·log2N) complexity, complete and fully deterministic
`routing. The proposed technology has significant benefits: 15x lower power, 3x lower area, 2x
`higher performance compared to existing FPGA technology. The new FPGA technology will be
`used to demonstrate HPC benchmarks with a 15x higher power efficiency for DOD and
`commercial users. The PI has established interactions with industrial partners that will lead to the
`transition of ideas into the commercial space.
`
`
`
`
`
`
`
`
`4
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 6 of 44
`
`Section II - Technical Details
`
`2.1. PowerPoint Summary Chart
`
`
`
`
`
`
`
`
`
`
`5
`
`Energy-Efficient Butterfly FPGA Hardware and Programming Tools
`
`Technical Challenge and Objective
`
`Key Innovations
`
`• Problem: Presently, FPGA chips use 2D-mesh architecture,
`
`which is very complex (over 75% of chip area is interconnect).
`Interconnect results in energy-inefficient computations!
`
`• Our hierarchical butterfly interconnect scheme significantly
`
`reduces interconnect complexity.
`
`• Objective: significantly improve energy efficiency of FPGAs.
`
`Expected Impact
`• New FPGA hardware and mapping tools.
`
`• With significant improvements:
`
`• Power (15x)
`• Area (3x)
`• Performance (2x)
`
`• To demonstrate HPC benchmarks
`with 15x higher power efficiency.
`
`• For DOD and commercial apps.
`
`2D-Mesh Network
`
`Even with just
`4 processing
`
`units, there is a
`3x reduction in
`the number of
`connections
`(24)  (8)
`
`Butterfly Network
`
`Number of connections in 2D-Mesh and Hierarchical networks
`
`Number of LUTs
`
`2D-Mesh
`
`Hierarchical
`
`Savings factor
`
`N = 1 k
`
`N = 100 k
`
`1 M
`
`10 B
`
`9,97 k
`
`1.66 M
`
`100x
`
`6,200x
`
`• Ideas verified on chip (3x reduction in interconnect area).
`
`2D-Mesh FPGA
`
`Our FPGA
`
`>2x overall chip area reduction
`>2x performance improvement
`>10x lower power
`
`• Key proposed innovations:
`
`• Interconnect architecture optimization.
`• Hardware demonstrations of area, power, and performance.
`• Mapping tools for the new FPGA architecture.
`• Demonstrations of HPC benchmarks.
`
`PI: Dejan Markovic (UCLA)
`
`Connection
`
`: IN
`: OUT
`
`Slice
`
`CLB
`
`CLB
`
`Slice
`
`CLB
`
`CLB
`
`Konda Network
`
`Connection
`
`Switch
`
`CLB
`
`
`
`CLB
`
`CLB
`
`CLB
`
`2-D Mesh Network
`
`1000
`
`Energy Efficiency
`(MOPS/mW)
`
`100
`
`10
`
`1
`
`0.1
`
`Microprocessors
`
`30-50x
`
`General
`General
`
`Purpose DSPs
`Purpose DSPs
`
`and FPGAs
`
`Dedicated
`
`~3 orders of
`magnitude!
`
`0.01
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`
`Chip Number
`
`U
`U
`U
`U
`
`C
`C
`C
`C
`
`Our FPGA chips
`
`L
`
`L
`
`L
`
`L
`
`A
`A
`A
`1
`A
`2
`
`3
`
`4
`
`FMC interface (160 pins/chip)
`
`4 Virtex-6
`
`4 Virtex-6
`FPGAs inside
`
`FPGAs inside
`
`4 Virtex-6
`FPGAs inside
`
`4 Virtex-6
`FPGAs inside
`
`www.beecube.com
`
`www.beecube.com
`
`www.beecube.com
`
`www.beecube.com
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`DSP SM_9
`
`DSP SM_9_1
`
`DSPSM_9_1
`
`DSP SM_9
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 7 of 44
`
`2.2. Innovative Claims for the Proposed Research
`
`Problem Description
`
`Today’s programmable FPGA devices are expensive in size, power, performance, scalability and
`flexibility. All of this is due to a fundamental problem in 2D-mesh interconnect architecture: it
`is large in size, has long latency, consumes lots of power, and is not scalable. Interconnect takes
`more than 75% of the FPGA chip area. Large number of inactive transistors also results in
`significant leakage power (about 50% of the total FPGA power). Due to inefficient interconnect
`architecture, there is a 30-50x energy-efficiency gap between FPGA and dedicated chips (Fig. 1).
`
`
`
`
`Figure 1: Energy efficiency for various computing architectures: microprocessors, general purpose
`DSPs, FPGAs, and dedicated chips. The study is based on chips from the ISSCC conference (normalized
`to the same technology). FPGAs with DSP cores are 30-50x less energy efficient than dedicated chips.
`
`
`Research Goals
`
`We will integrate hierarchical interconnect network to demonstrate significant improvements in
`speed, power, and area as compared to existing FGPAs technology. The hierarchical interconnect
`architecture requires at least 3x smaller number of active network elements, switch points and
`drivers. This is illustrated in Fig. 2 for a very simple 2x2 example.
`
`
`Figure 2: 2D-Mesh and Konda networks for a design consisting of 4 CLB blocks.
`
`
`
`
`
`
`6
`
`Dedicated
`
`~3 orders of
`magnitude!
`
`
`General General
`
`Purpose DSPsPurpose DSPs
`and FPGAs
`
`30-50x
`
`Microprocessors
`
`1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
`Chip Number
`
`1000
`
`100
`
`10
`
`1
`
`0.1
`
`(MOPS/mW)
`
`Energy Efficiency
`
`0.01
`
`Even with just 4
`processing units, there
`is a 3x reduction in the
`number of connections
`(24)  (8)
`
`2D-Mesh Network
`
`Butterfly Network
`
`Connection
`
`: IN
`: OUT
`
`Slice
`
`CLB
`
`CLB
`
`Slice
`
`CLB
`
`CLB
`
`Konda Network
`
`Connection
`
`Switch
`
`CLB
`
`
`
`CLB
`
`CLB
`
`CLB
`
`2-D Mesh Network
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 8 of 44
`
`For larger number N of configurable logic elements, the benefits of hierarchical network will be
`even more pronounced (Table 1). Such large cost of the 2D-mesh architecture forces designers
`to employ heuristics to reduce the number of switch points, which results in insufficient
`connectivity. The hierarchical network provides complete and deterministic routing.
`
`
`Table 1: Number of connections in 2D-Mesh and Konda networks.
`
`Number of LUTs
`1 k
`100 k
`
`2D-Mesh
`1 M
`10 B
`
`Konda butterfly
`9.97 k
`1.66 M
`
`Savings factor
`100x
`6,200x
`
`
`Expected Impact
`
`The new FPGA platform will provide significant savings in power compared to today’s FPGAs
`as shown in Fig. 3. Our FPGA technology, which includes hardware and supporting mapping
`tools, will provide an estimated 15x power reduction as compared to conventional FPGAs.
`
`
`
`
`Figure 3: Power consumption for a range of applications. New FPGA will provide significant power
`reduction compared to typical Virtex-5 FPGA (normalized to the same technology).
`
`
`We will provide new FPGA technology consisting of hardware and mapping tools. The hardware
`and mapping tools will provide significant impacts: 15x lower power, 3x lower area, 2x higher
`performance compared to existing FPGA technology. The new FPGA technology will be used to
`demonstrate HPC benchmarks with a 15x higher power efficiency for DOD and commercial
`users. Equivalently, our FPGA technology can provide >10x higher throughput for the same
`amount of power (as shown in Fig. 3). This technology will be of use for HPC applications and
`many other DOD applications which use FPGA technology.
`
`
`
`
`
`
`
`7
`
`Typical FPGA (Virtex-5)
`
`LPE Goal
`
`Outcomes (arrows)
` New applications
`of FPGA (↓)
` More capability
`for existing apps
`(→)
`
`Radio DSP
`
`FFT module
`
`Mirco & small UAVs
`
`Satellite DSP
`
`MIMO DSP
`
`10
`100
`1000
`Performance (GOPS)
`
`100W
`
`10W
`
`1W
`
`Power
`
`100mW
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 9 of 44
`
`2.3. Proposal Roadmap
`
`
`Main goals of the proposed research: The main goal of the program is to develop energy-
`efficient programmable hardware and supporting software mapping tools. The hardware is based
`on hierarchical interconnect architecture that provides significant reduction in interconnect
`complexity as compared to today’s FPGA hardware. With a combination of new interconnect
`architecture and supporting toolflow, we project over a 15x improvement in energy efficiency
`while also considerably reducing chip area and improving performance. The proposed work
`builds on patent-protected network architecture and successful chip demonstrations. The work
`proposed here focuses on the investigation of needed level of connectivity for large-sale designs,
`and supporting mapping tools to make the technology accessible to end users.
`
`Tangible benefits to end users: Over a 15x improvement in energy-efficiency, considerable
`reduction in chip area (3-4x), and considerable improvement in performance (> 2x) compared to
`today’s FPGA chips. Mapping tools will be developed to automatically map algorithms into
`hardware and abstract away hardware-specific details from end users.
`
`Critical technical barriers: Hierarchical interconnect networks have been known to the
`academic and industrial community for a long time, but physical realization of these networks
`precluded their successful deployment. The critical difficulty associated with the hierarchical
`networks is routing congestion during chip synthesis. Leopard Logic, Inc, is one example of a
`company that failed to deploy hierarchical interconnect architecture. FPGA startups today, most
`notably Abound Logic, Tier Logic, Blue Chip Designs, and Achronix, provide customized
`solutions for increased logic density or speed, but they still don’t solve the problem of power
`inefficiency associated with FPGA chip interconnects.
`
`Main elements of the proposed technical approach: Our approach is based on alternating
`vertical and horizontal routing. LUTs (or any other processing elements) are partitioned in a 2-D
`floorplan with switch-boxes placed to allow full routability. An N-LUT design requires log2(N)
`levels of switch-boxes. Simple example of N = 4 is shown below to illustrate the concept.
`
`
`Figure 4: Hierarchical Konda interconnect architecture. O(N∙log2N) interconnect switches are required
`for full connectivity. Routing is fully deterministic.
`
`
`
`
`
`
`8
`
`Alternating Vert./Horiz. routing: N = 4 LUTs example (2 levels)
`
`LUT0
`
`S(0,0)
`
`S(1,0)
`
`LUT2
`
`S(0,2)
`
`S(1,2)
`
`LUT1
`
`S(0,1)
`
`S(1,1)
`
`LUT3
`
`S(0,3)
`
`S(1,3)
`
`1st routing level
`(vertical)
`
`2nd routing level
`(horizontal)
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 10 of 44
`
`In the case shown in Fig. 4, 2 levels of switch-boxes are required for N = 4 LUTs. LUTs with
`indices from 0 N/2 – 1 are placed on the left, the remaining LUTs are placed on the right.
`Switch-boxes are placed next to the LUT columns. Routing between elements with adjacent
`index is provided as a vertical connection (1st level routing); routing between elements with 2
`indices apart is provided with a horizontal connection (2nd level routing). The routing continues
`in vertical/horizontal fashion for larger N.
`
`Basis of confidence: Konda network architecture is a patent-protected technology that is
`recognized by many semiconductor companies including Cisco, Xilinx, Altera, and LSI Logic.
`To demonstrate the network in hardware, UCLA team has taped out 3 chips and successfully
`implemented variants of Konda network and also variants of processor-block features.
`
`Chip 1 (90nm, LUT-slice FPGA, concept demo): A 1024-LUT FPGA was made in 90nm 9SF
`technology (Dec 2009 run). Our synthesis estimates predict a 250 mW of power and a 600 MHz
`maximum performance. The chip occupies 2.6 x 2.5 mm2 in 90nm. Status: lab testing.
`
`Chip 2 (65nm, LUT and DSP slices, small scale): A 256-LUT 240-DSP 8-BRAM FPGA was
`made in 65nm technology (June 2010 run). The chip is aimed to show asymmetric network and
`heterogeneous computing blocks. The chip occupies 2.1 x 3.1 mm2 in 65nm. Status: taped out.
`
`Chip 3 (45nm, DSP-slice FPGA, small scale): A 512-DSP slice FPGA is made in IBM 45 nm
`SOI technology (June 2010 LEAP run). We expect power consumption below 500 mW. This
`design will be applicable to small-scale applications such as micro UAVs. Status: taped out.
`
`Nature and description of end results to be delivered to DARPA: We will provide several
`deliverables to DARPA and DOD community as listed below.
`
`
`Interconnect architecture and routing tools (software).
` Hardware library in 32nm IBM SOI process (compatible with Cadence software).
` Routing software for the new interconnect architecture and hardware library (software).
` Chip demos of varying scale to demonstrate algorithms of interest to DOD (hardware).
` Tool flow for mapping algorithms onto FPGA chips (software).
` Demonstrations of HPC benchmarks using commercial technology.
`
`
`The first three items in the list are intermediate steps towards the final hardware demonstration
`that also includes user-friendly mapping tool interface.
`
`Cost and schedule of the proposed effort: $2,374,111 over 3 years.
`
`
`
`
`
`
`
`
`9
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 11 of 44
`
`2.4. Technical Approach
`
`Problem Description: FPGAs are used in many signal processing and computing applications.
`DOD mission capability or computing performance can be greatly improved with more energy
`efficient hardware. FPGA based solutions are very attractive due to their flexibility, similar to
`that of CPUs. This flexibility comes at a very high energy cost, as shown in Fig. 1.
`
`Looking at the energy efficiency (the amount of energy per unit operation) for a variety of chips
`from different categories, we observe a 1,000x gap in energy efficiency between microprocessors
`and dedicated designs. The root cause of this is architectural. Processors have general ALU-type
`processing unit(s) and large amounts of memory to support time-muliplexing of instructions and
`data into and out of the ALU(s). Dedicated chips have a variety of processing units, but are very
`expensive in low-volume and can’t be programmed, so they can’t be used for HPC applications.
`General-purpose DSPs are a viable compromise between microprocessors and dedicated designs.
`Recently, however, FPGA chips have started to gain attention with their increased computing
`capabilities. Look-up-table (LUT) based chips have energy efficiency similar to that of CPUs
`and are not very attractive alternatives to CPUs (CPUs are easier to program). Many today’s
`FPGAs have dedicated kernels such as DSP slices, ARM cores, etc. These FPGAs have energy
`efficiency similar to DSP chips, but they are still 30-50x worse than dedicated chips. The root
`cause of energy inefficiency in these FPGAs is their interconnect architecture.
`
`Today’s FPGAs use 2D-mesh interconnect architecture shown in Fig. 5. Interconnect consists of
`switch boxes (shown as cross-points), connection points for the buses, and bus drivers (buffers).
`This architecture is not very scalable: it requires O(N 2) interconnect switches for N LUTs. For 1k
`processing units, this means 1M switches! To overcome this complexity issue, designers employ
`heuristics to reduce the number of switches. One of the ideas is to reduce connectivity around the
`edges, as shown in Fig. 5. Another idea is to reduce top-level connectivity in large designs and
`utilize local connections. These approaches are heuristic and lead to inefficient utilization of
`hardware resources. Readers may be have experienced that utilizing more than 80% of FPGA
`resources without sacrificing performance is a big challenge in commercial FPGA systems.
`
`
`
`Figure 5: 2D-mesh interconnect architecture. O(N 2) interconnect switches are required for full
`connectivity. Heuristics are used to reduce the network complexity. These heuristics result in non-
`deterministic routing.
`
`
`
`
`10
`
`Programmable
`Switch Box
`
`LUT
`
`LUT
`
`LUT
`
`Output
`Connection
`Box
`
`LUT
`
`LUT
`
`LUT
`
`Input
`Connection
`Box
`
`LUT
`
`LUT
`
`LUT
`
`Requires careful
`heuristics to
`reduce
`crosspoints
`without
`significant loss of
`connectivity
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 12 of 44
`
`
`
`Figure 6: Power breakdown
`in a Virtex-5 FPGA.
`
`Even after reductions in network complexity, interconnect still
`occupies over 75% of area in today’s FPGAs. For example, Xilinx
`Virtex-5 chip has 1.1B transistors; 275M are used for logic, 875M
`(80%) are used for interconnect. Most of FPGA power is dissipated
`by the interconnect, as shown in Fig. 6. Further simplifying
`interconnect (without sacrificing connectivity) would have multiple
`benefits. First, the interconnect power will decrease. Second, due to
`reduced interconnect area, overall chip area will also reduce. Third,
`since the chip area is reduced, the size of wires (and wire
`capacitance) also reduces. The reduction in wire length and
`complexity implies further reduction in power. It also implies
`improvements in performance. This excess performance can be traded for increased energy
`efficiency, or simply used to improve computational efficiency. Finally, we benefit from reduced
`clock power since the clock is now distributed over a smaller area. Therefore, reduction in
`interconnect complexity is crucially important for improved computing power and performance.
`
`Proposed Network Architecture: In response to the interconnect challenge, we propose to use a
`proprietary Konda hierarchical interconnect architecture. This interconnect architecture has
`greatly reduced complexity, O(N∙log2N), and it is based on fully deterministic routing. The
`concept of Konda network is to use simple unidirectional switches and 2x1 multiplexers to
`hierarchically connect the computing resources (LUTs, DSP slices, ARM IP, etc.).
`
`Eliminating routing congestions and making the 2D circuit layout possible are the key enabling
`features of the Konda network. An example of N = 8 LUT design with Konda network is shown
`in Fig. 7. For complete routing log28 = 3 levels of switch matrices are needed. First, vertical
`tracks connect nearest LUTs, then horizontal tracks are used to connect LUTs at the next level,
`and finally vertical tracks are used to connect the last level of switches. This structure has full
`connectivity and completely deterministic 2D routing.
`
`
`Figure 7: Konda interconnect network architecture and routing tracks for N = 8 LUTs.
`
`
`
`
`
`
`11
`
`LUT0
`
`S(0,0)
`
`S(1,0)
`
`S(2,0)
`
`LUT2
`
`S(0,2)
`
`S(1,2)
`
`S(2,2)
`
`LUT1
`
`S(0,1)
`
`S(1,1)
`
`S(2,1)
`
`LUT3
`
`S(0,3)
`
`S(1,3)
`
`S(2,3)
`
`LUT4
`
`S(0,4)
`
`S(1,4)
`
`S(2,4)
`
`LUT6
`
`S(0,6)
`
`S(1,6)
`
`S(2,6)
`
`LUT5
`
`S(0,5)
`
`S(1,5)
`
`S(2,5)
`
`LUT7
`
`S(0,7)
`
`S(1,7)
`
`S(2,7)
`
`22% 19%
`
`58%
`
`Interconnect
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 13 of 44
`
`The benefits of this network architecture were evaluated using Toronto20 benchmarks. Toronto
`20 benchmark suite originated from an FPGA place-and-route challenge that was set up by
`University of Toronto Researchers [1] to encourage FPGA researchers to benchmark their
`software design tool chains on large circuits These 20 benchmarks are from real designs and the
`placed netlists are provided - for a given FPGA logic block consists of a 4-input look-up table
`(LUT) and a flip flop - to experiment with different routing architectures and routing algorithms.
`The existing results are experimented with 2D-Mesh network based routing network by
`providing partial bandwidth i.e., with different switch-box flexibility, connection-box flexibility
`and a certain number of channels. Konda hierarchical network is also experimented with partial
`bandwidth provisioning and the results are compared on various dimensions such as 1) number
`of cross points, 2) route length (delay) 3) performance 4) speed of routing and 5) routability.
`Konda hierarchical network performed better in several easily-measureable ways and the results
`are presented in Tables 2 and 3.
`
`
`Table 2: Comparison of 2D-Mesh and Konda interconnect networks using Toronto20 behcmarks.
`
`Toronto20 Benchmark Information
`
`Name
`
`Size
`
`LUTs
`
`Number of
`connections
`
`alu4
`apex2
`apex4
`bigkey
`clma
`des
`diffeq
`dsip
`elliptic
`ex5p
`ex1010
`frisk
`misex3
`pdc
`s298
`s38417
`s38584.1
`seq
`spla
`tseng
`
`40
`44
`36
`54
`92
`63
`39
`54
`61
`33
`68
`60
`38
`68
`44
`81
`81
`42
`61
`33
`
`1600
`1936
`1296
`2916
`8464
`3969
`1521
`2916
`3721
`1089
`4624
`3600
`1444
`4624
`1936
`6561
`6561
`1764
`3721
`1089
`
`1514
`1875
`1243
`1694
`8302
`1347
`1497
`1309
`3604
`1019
`4588
`3556
`1383
`4535
`1929
`6349
`6291
`1717
`3644
`975
`
`Savings
`factor
`
`Konda Hierarchical Network
`2D-Mesh Network
`Simulation
`Simulation
`(Unidirectional wires)
`(Bidirectional wires)
`Total
`Cross-
`Max
`Total
`cross-
`points
`channel
`cross-
`points
`saved
`width
`points
`58,737
`118,437
`9
`177,174
`83,180
`154,480
`10
`237,660
`52,482
`123,408
`11
`175,890
`54,643
`159,233
`6
`213,876
`10
`1,026,780 359,846
`666,934
`7
`338,730
`57,044
`281,686
`7
`131,082
`49,275
`81,807
`5
`178,230
`40,972
`137,258
`9
`408,510
`129,507
`279,003
`11
`148,170
`44,609
`103,561
`9
`506,790
`192,391
`314,399
`11
`483,186
`134,686
`348,500
`10
`177,900
`58,866
`119,034
`15
`844,650
`239,484
`605,166
`6
`142,596
`63,956
`78,640
`6
`478,260
`207,457
`270,802
`7
`557,970
`184,030
`373,940
`10
`216,780
`73,880
`142,900
`12
`544,680
`171,676
`373,004
`6
`80,820
`31,599
`49,221
`
`3.02
`2.86
`3.35
`3.91
`2.85
`5.94
`2.66
`4,35
`3.15
`3.32
`2.63
`3.59
`3.02
`3.52
`2.23
`2.30
`3.03
`2.93
`3.17
`2.56
`
`
`The benefits of Konda hierarchical network over 2D-Mesh network using Toronto20
`Benchmarks are summarized in Table 4. Various configurations of Konda hierarchical network
`were tested for each benchmark and the results are verified as follows:
`• All 20 benchmarks were routed by our algorithms in our network,
`
`
`
`12
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 14 of 44
`
`• Switches required to route was reduced significantly,
`• Fundamental routing algorithms are proven,
`• Speed of routing is proven,
`• Benchmarks were profiled for Bandwidth requirements.
`
`
`
`Table 3: Comparison of 2D-Mesh and Konda interconnect networks using Toronto20 behcmarks. In
`addition to considerable savings in the number of cross-points, Konda network uses has far better
`percentage utilization (fewer % is better) than the 2D-Mesh network.
`
`Toronto20
`Benchmark
`Information
`
`2D-Mesh Network
`Simulation
`(Bidirectional wires)
`
`Konda Hierarchical Network
`Simulation
`(Unidirectional wires)
`
`Other Key Results of
`the Simulation
`
`Name
`
`Size
`
`Max Ch
`Width
`
`Total
`Cross-pts
`
`Total
`Cross-pts
`
`Savings
`factor
`
`Cross-pts
`saved
`
`alu4
`apex2
`apex4
`bigkey
`clma
`des
`diffeq
`dsip
`elliptic
`ex5p
`ex1010
`frisc
`misex3
`pdc
`s298
`s38417
`s38584.1
`seq
`spla
`tseng
`
`40
`44
`36
`54
`92
`63
`39
`54
`61
`33
`68
`60
`38
`68
`44
`81
`81
`42
`61
`33
`
`9
`10
`11
`6
`10
`7
`7
`5
`9
`11
`9
`11
`10
`15
`6
`6
`7
`10
`12
`6
`
`177,174
`237,660
`175,890
`213,876
`1,026,780
`338,730
`131,082
`178,230
`408,510
`148,170
`506,790
`483,186
`177,900
`844,650
`142,596
`478,260
`557,970
`216,780
`544,680
`80,820
`
`58,737
`83,180
`52,482
`54,643
`359,846
`57,044
`49,275
`40,972
`129,507
`44,609
`192,391
`134,686
`58,866
`239,484
`63,956
`207,457
`184,030
`73,880
`171,676
`31,599
`
`3.02
`2.86
`3.35
`3.91
`2.85
`5.94
`2.66
`4,35
`3.15
`3.32
`2.63
`3.59
`3.02
`3.52
`2.23
`2.30
`3.03
`2.93
`3.17
`2.56
`
`118,437
`154,480
`123,408
`159,233
`666,934
`281,686
`81,807
`137,258
`279,003
`103,561
`314,399
`348,500
`119,034
`605,166
`78,640
`270,802
`373,940
`142,900
`373,004
`49,221
`
`% Cross-
`pts used
`Konda
`7.9
`9.3
`9.4
`4.1
`8.1
`2.9
`6.9
`2.8
`7.0
`9.5
`8.4
`7.5
`8.7
`10.5
`7.1
`6.0
`5.3
`9.0
`9.3
`6.7
`
`% Cross-
`pts used
`2D-Mesh
`66
`70
`60
`51
`71
`33
`75
`46
`63
`60
`75
`56
`66
`56
`89
`86
`66
`68
`63
`78
`
`
`
`
`
`
`
`Table 4: Summary of the benefits of Konda hierarchical network. Analytical and empirical results are
`shown, the numbers are relative to 2D-Mesh network.
`
`Criteria
`Interconnect area
`Connectivity
`Interconnect Power
`Interconnect Latency
`Speed of compilation
`Scalability across process
`generations
`
`Analytical
`At most 1/3
`2-3x
`1/5 to 1/10
`1/5 to 1/10
`Significantly faster
`
`Empirical
`At most 1/3
`2-3x
`1/5 to 1/10
`1/5 to 1/10
`Significantly faster
`
`Close to linear
`
`Close to linear
`
`13
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 15 of 44
`
`The conclusions of simulation of Toronto20 benchmarks using Konda hierarchical network
`matched the benefits derived in empirical analysis. The generic routing tool created for Konda
`hierarchical network delivers consistent and predictable results. Based on the Toronto20
`benchmark results it can be projected that the gap between ASIC’s and FPGA’s can be closed as
`shown in Fig. 8, which would significantly improve performance and energy efficiency of HPC
`hardware. In the proposed work, we will explore further technology improvements.
`
`
`
`Figure 8: Konda interconnect network architecture has substantial benefits over today’s FPGAs. It is
`projected to have ASIC-like energy efficiency, power, and performance. Such energy-efficiency levels
`are more than 100x better than general purpose processors.
`
`2.4.1. Network Architecture and Routing Tools
`We will next work on homogeneous and heterogeneous networks featuring arbitrary level of
`connectivity. The decision about the connectivity level will be aided with feedback from the
`mapping tools (Task 6) in order to minimize hardware utilization.
`
`Task 1) Routing Architectures for Homogeneous Blocks: Routing tool will be developed for
`the FPGA with homogeneous blocks. Routing algorithms need to be developed for uni-terminal
`nets and multi-terminal nets. The hierarchical routing network may be a symmetric network
`where the number of inputs and the number of outputs are the same. The routing network may
`also be asymmetric network where the number of inputs and the number of outputs are not the
`same. Rearrangeably nonblocking and strictly nonblocking multi-terminal net algorithms will be
`implemented to demonstrate the routability and the speed of routing. Routing algorithms need to
`be implemented for configurations of Konda hierarchical network where some of the stages in
`the network may be partially connected and the other stages are fully connected. The LUT size
`of the network may be a perfect power of two or non-perfect power of two.
`
`Task 2) Routing Architectures for Heterogeneous Blocks: We will also explore interconnect
`architectures suitable for heterogeneous blocks. The key architectural challenge is to adapt the
`Konda hierarchical network for FPGA architecture. A fully connected hierarchical network is an
`over-kill for FPGA applications. Our goal is to converge on the appropriate design of the routing
`network in three phases and also adopt it to many different applications end-user applications.
`Also we need to experiment with many varieties of hierarchical network designs such as Benes
`
`
`
`14
`
`ASIC
`
`Konda
`FPGA
`
`We keep the benefits of ASIC w/o giving up the benefits of FPGA
`
`Prevailing
`FPGA
`
`2X
`
`2X
`
`2X
`
`1X
`
`2X
`
`AREA
`4X
`
`5-10X
`POWER
`10-20X
`
`5-10X
`PERFORMANCE
`10-20X
`
`0.3-0.6X
`ROUTABILITY
`0.3-0.6X
`
`

`

`Case 5:18-cv-07581-LHK Document 31-2 Filed 03/04/19 Page 16 of 44
`
`network, butterfly fat-tree network and other optimizations related to properties of FP

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket