`UCLA
`
`Peer Reviewed
`
`Title:
`Building Efficient, Reconfigurable Hardware using Hierarchical Interconnects
`Author:
`Wang, Chengcheng
`Acceptance Date:
`2013
`Series:
`UCLA Electronic Theses and Dissertations
`Degree:
`Ph.D., Electrical Engineering 0303UCLA
`Advisor(s):
`Markovic, Dejan
`Committee:
`Srivastava, Mani B., Kaiser, William J., Gerla, Mario
`Permalink:
`http://escholarship.org/uc/item/2vt0b5cb
`Abstract:
`
`Copyright Information:
`All rights reserved unless otherwise indicated. Contact the author or original publisher for any
`necessary permissions. eScholarship is not the copyright owner for deposited works. Learn more
`at http://www.escholarship.org/help_copyright.html#reuse
`
`eScholarship provides open access, scholarly publishing
`services to the University of California and delivers a dynamic
`research platform to scholars worldwide.
`
`Page 1 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`UNIVERSITY OF CALIFORNIA
`
`Los Angeles
`
`
`
`
`
`
`
`Building Efficient, Reconfigurable Hardware using
`
`Hierarchical Interconnects
`
`
`
`
`
`A thesis submitted in partial satisfaction
`
`of the requirements for the degree
`
`Doctor of Philosophy in Electrical Engineering
`
`
`
`by
`
`Chengcheng Wang
`
`
`
`
`
`2013
`
`
`
`Page 2 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`© Copyright by
`
`Chengcheng Wang
`
`2013
`
`Page 3 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`ABSTRACT OF THE DISSERTATION
`
`
`
`Building Efficient, Reconfigurable Hardware using Hierarchical
`
`Interconnects
`
`
`
`by
`
`Chengcheng Wang
`
`Doctor of Philosophy in Electrical Engineering
`
`University of California, Los Angeles, 2013
`
`Professor Dejan Marković, Chair
`
`
`
`In the semiconductor industry today, ASICs are able to offer 10x-1000x higher energy
`
`and area efficiencies than non-dedicated chips, such as programmable DSP processers, field-
`
`programmable gate arrays (FPGAs), and microprocessors. Not surprisingly, SoCs today have
`
`become an integration of many ASIC blocks, each performing a few dedicated tasks. The
`
`growing size of modern SoC chips, accelerated by the increasing demands for functionalities, has
`
`exposed the major drawback of ASIC: design cost. These large SoCs are re-designed a few times
`
`a year to rectify hardware-bugs and to support new features. Because ASICs are not
`
`reconfigurable, even the smallest hardware change would require a re-design. Additionally,
`
`design cost is rising exponentially with every technology generation.
`
`The rising design cost of ASICs has exposed a huge need today: efficiency and flexibility
`
`must co-exist. But among flexible hardware candidates, microprocessors and programmable DSP
`
`
`
`iii
`
`Page 4 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`processors are far too slow to meet the throughput requirements of ASICs. FPGAs do come close
`
`in terms of performance, but are extremely inefficient due to its high energy and large area
`
`overhead. We must bridge the huge gap in efficiency for FPGA to become a viable contender to
`
`ASICs.
`
`The primary culprit for FPGA inefficiency is interconnect, which accounts for over 75%
`
`of area and delay. For over 20 years, 2D-mesh network has been the back-bone of FPGA
`
`interconnects, but full connectivity in a 2D-mesh require O(N2) switches, requiring interconnects
`
`to grow much faster than Moore‟s Law. As a result, various heuristics are used to simplify
`
`switch-box arrays at the cost of resource utilization, but interconnect area of modern FPGA is
`
`still around 80%. This work builds FPGA using hierarchical interconnects based on Beneš
`
`networks,
`
`requiring O(N∙log∙N)
`
`switches. Although Beneš
`
`is commonly used
`
`in
`
`telecommunication, this work is its first silicon realization of a FPGA. To realize a highly
`
`efficient interconnect architecture, significant pruning of the network is required. Novel
`
`techniques such as fast-path U-turns and unbalanced branching are also implemented. A custom
`
`place-and-route software is developed to map benchmark designs on a variety of interconnect
`
`candidates. From mapping results, the architecture is updated based on network utilization until
`
`an optimized design is converged. The large area of FPGA chip requires aggressive power gating
`
`(PG), but interconnect signals often lack spatial locality, make it block-level PG difficult. A
`
`novel PG circuit technique is developed to power-gate individual interconnect switches with very
`
`small overhead in area and performance. Such technique requires fundamental circuit changes,
`
`even modifying the CMOS inverter.
`
`With
`
`innovations
`
`in chip architecture, circuit design, and extensive software
`
`development, this work has demonstrated 5 user-mappable FPGAs (from 1K–16K LUTs) all
`
`
`
`iv
`
`Page 5 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`with around 50% interconnect area: a 3–4x reduction from commercial FPGAs while preserving
`
`connectivity. An energy efficiency of 1.1 GOPS/mW is the highest among reported FPGAs, and
`
`is 22x more efficient than the most efficient commercial FPGA today, significantly bridging the
`
`efficiency gap between FPGA and ASIC.
`
`
`
`v
`
`Page 6 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`
`
`The dissertation of Chengcheng Wang is approved.
`
`
`
`Mani B. Srivastava
`
`William J. Kaiser
`
`Mario Gerla
`
`Dejan Marković, Committee Chair
`
`
`
`
`
`
`
`
`University of California, Los Angeles
`
`2013
`
`
`
`
`
`
`
`
`
`vi
`
`Page 7 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`TABLE OF CONTENTS
`
`
`
`I
`
`Introduction ............................................................................................................1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1.1
`
`The Drive Towards Efficiency.....................................................................1
`
`1.2 What is Efficiency? ......................................................................................2
`
`1.3
`
`The Efficiency Tradeoff ...............................................................................3
`
`1.4
`
`Efficiency and Flexibility – Current Solutions ............................................5
`
`1.5
`
`Keeping Up with the Standards ...................................................................7
`
`1.6
`
`The Cost of Chip Design..............................................................................8
`
`1.7
`
`Candidates for Reconfigurable Hardware ....................................................9
`
`1.8
`
`Thesis Outline ............................................................................................11
`
`II
`
`FPGA Interconnects: the Source of its Inefficiency ..........................................12
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`2.1
`
`Brief History of FPGAs .............................................................................12
`
`2.2
`
`Interconnects: the Backbone of an FPGA ..................................................18
`
`2.3
`
`Scaling a 2D-mesh Network ......................................................................21
`
`2.4
`
`Hierarchical Network – A Scalable Solution .............................................23
`
`2.5
`
`Prior Attempts at Hierarchical FPGAs ......................................................28
`
`2.6
`
`Our Challenges...........................................................................................31
`
`
`
`vii
`
`Page 8 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`III
`
`Architecture Design of Hierarchical FPGAs .....................................................33
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`3.1
`
`Realizing Large-Scale Beneš Networks.....................................................33
`
`3.2
`
`Implementing a 2048-LUT FPGA Interconnect ........................................36
`
`3.3
`
`Radix-3 Boundary-less Interconnect ..........................................................38
`
`3.4
`
`Fast-Path Interconnect ...............................................................................44
`
`3.5
`
`Interconnect Cost vs. Gate Cost .................................................................47
`
`3.6
`
`Local Interconnect vs. Branch Interconnect ..............................................48
`
`3.7 Micro-architecture of a Switch Matrix ......................................................50
`
`3.8
`
`Implementing a 16K-LUT FPGA Interconnect .........................................52
`
`IV
`
`Interconnect Circuit Design ................................................................................58
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`4.1
`
`Key Building Blocks in Interconnect Circuits ...........................................58
`
`4.2
`
`Static Multiplexers and Area-Performance Tradeoff .................................59
`
`4.3
`
`Strategies for Interconnect Buffering.........................................................63
`
`4.4
`
`Designing Configuration Bit-Cells ............................................................66
`
`4.5
`
`Power-gating Switch Matrices ...................................................................68
`
`4.6
`
`Power-On Sequence of the Interconnect Network .....................................73
`
`
`
`viii
`
`Page 9 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`V
`
`Configurable Logic Block Design and Chip Integration ..................................79
`
`
`
`
`
`
`
`
`
`
`
`
`
`5.1
`
`Configurable Logic Blocks for the 2048-LUT FPGA ...............................79
`
`5.2 Macro-based Chip Integration for the 2048-LUT FPGA ..........................86
`
`5.3
`
`Fine-Grained CLBs for the 16K-LUT FPGA ............................................91
`
`5.4 Medium-Grained CLBs for the 16K-LUT FPGA ......................................98
`
`5.5
`
`Coarse-Grained CLBs for the 16K-LUT FPGA ......................................102
`
`5.6 Macro-based Chip Integration for the 16K-LUT FPGA ..........................105
`
`VI
`
`Software Flow and Design Mapping ................................................................113
`
`
`
`
`
`
`
`
`
`
`
`6.1
`
`Overview of FPGA Software Mapping Flow ..........................................113
`
`6.2
`
`FPGA Synthesis and LUT Packing..........................................................116
`
`6.3
`
`FPGA Partitioning and Placement ...........................................................120
`
`6.4
`
`FPGA Routing .........................................................................................124
`
`6.5
`
`Bitstream Generation ...............................................................................129
`
`VII Test Infrastructure and Measurement Results ...............................................130
`
`
`
`
`
`
`
`
`
`
`
`6.1 Matlab Simulink-based Testing Infrastructure ........................................130
`
`6.2 Measurement Results of our 2048-LUT FPGA .......................................134
`
`6.3
`
`Updated Testing Infrastructure ................................................................138
`
`6.4 Measurement Results of our 16K-LUT FPGA ........................................140
`
`6.5
`
`Chips Summary and Die Photos ..............................................................142
`
`VIII Conclusion and Future Outlook .......................................................................147
`
`
`
`
`
`ix
`
`Page 10 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`References .......................................................................................................................150
`
`
`
`x
`
`Page 11 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`LIST OF FIGURES
`
`
`
`1-1
`
`Energy and area efficiency of the ISSCC/VLSI chips from the past decade. ..........4
`
`1-2
`
`Block diagram of an NVIDIA Tegra 2 SoC for smartphones. ................................5
`
`1-3
`
`Evolution of common multimedia and radio standards. ..........................................7
`
`1-4
`
`Cost of chip design with every technology node. ....................................................8
`
`2-1
`
`Schematic diagram from a Xilinx XC2000 of CLB and interconnects. ................12
`
`2-2
`
`Illustration of Stacked-Silicon Technology in Xilinx Virtex-7. ............................14
`
`2-3
`
`CLB diagram of Xilinx XC3000, XC4000, and XC5200. .....................................16
`
`2-4
`
`CLB diagram of Xilinx a Virtex-6 and 7 series FPGA. .........................................17
`
`2-5 A sample 2D-mesh architecture with I/O connections and switch boxes. .............19
`
`2-6
`
`Interconnect architecture of a Xilinx XC4000 FPGA............................................20
`
`2-7 Area, delay and power breakdown of a modern 2D-mesh FPGA .........................21
`
`2-8
`
`Interconnect resources per CLB for Xilinx Virtex-4 vs. Virtex-5 .........................22
`
`2-9 A simple 3-stage Beneš network connecting 2 LUTs ............................................24
`
`2-10 A 5-stage Beneš network merged into a 3-stage using 2-bit 2x2 switches ............25
`
`2-11 A 5-stage Beneš network connecting 8 LUTs .......................................................26
`
`2-12 A 3-stage folded Beneš network connecting 8 LUTs ............................................27
`
`2-13 A hierarchical Beneš interconnect architecture using alternated x-y routing ........28
`
`2-14 A 5-stage Beneš network merged into a 3-stage using 2-bit 2x2 switches ............29
`
`2-15 The HSRA architecture without and with wiring shortcuts ...................................30
`
`2-16 The multilevel hierarchical FPGA architecture .....................................................31
`
`
`
`
`
`
`
`xi
`
`Page 12 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`3-1 A hierarchical macro-based implementation of a 2D-Beneš network ...................35
`
`3-2
`
`Interconnect architecture for our 2048-LUT FPGA, one quadrant shown ............36
`
`3-3
`
`Interconnect architecture for our 2048-LUT FPGA, one quadrant shown ............37
`
`3-4 An original 16-LUT Beneš network, with isomorphic transformation to shorten
`
`nearest-neighbor lengths, and with boundary-less radix-3 switches in stage 1 .....39
`
`3-5 A 16-LUT Beneš network with boundary-less radix-3 switches in stage 1, and
`
`with boundary-less radix-3 switches in stages 1 and 2 ..........................................40
`
`3-6 A 16-LUT Beneš network, with boundary-less radix-3 switches in stages 1 and
`
`2, with boundary-less radix-3 switches in stage 1-3, and rearranged for
`
`distributed routing ..................................................................................................43
`
`3-8 An original radix-4 16-LUT Beneš network and with boundary-less radix-6
`
`switches in stage 1 .................................................................................................44
`
`3-9 A routing example from LUT 2 to 16 without fast path and with fast path ..........45
`
`3-10 A routing example with routing obstruction that still allows a slower fast-path
`
`and allowing no fast-path .......................................................................................46
`
`3-11 Two SM design with same gate cost, but a) with more wiring than b) .................47
`
`3-12 An example where traditional-Beneš based SM experiences local interconnect
`
`congestion, whereas a SM design with more local interconnects can utilize the
`
`fast path ..................................................................................................................49
`
`3-13 A switch-matrix example with more
`
`local
`
`interconnects
`
`than branch
`
`interconnects ..........................................................................................................50
`
`3-14
`
`Internal mux interconnect of an example radix-3 switch matrix ...........................51
`
`3-15 1-D SM architecture of the 16K-LUT FPGA, showing the lower 10 SM stages ..55
`
`
`
`xii
`
`Page 13 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`3-16 2-D SM architecture of the 16K-LUT FPGA, showing the top 5 stages of
`
`wiring .....................................................................................................................56
`
`4-1 An example switch matrix with its internal circuitry .............................................58
`
`4-2 A static pass-transistor mux with high VDD for the bit-cells ................................60
`
`4-3 A 10-input static pass-transistor mux with 2 critical-path inputs and 8 non-
`
`critical-path inputs, requiring 8 bit-cells ................................................................62
`
`4-4
`
`Illustration of input-buffer sharing inside a switch matrix ....................................64
`
`4-5
`
`Illustration of signal buffer across interconnects of a non-inverting mux, an
`
`inverting mux with input inverters, and an inverting mux with output inverters ..65
`
`4-6
`
`Physical design of the configuration bit-cells in 5T SRAM and 6T SRAM .........68
`
`4-7 A 4-input static mux with output inverter and traditional power gating ................69
`
`4-8 A 4-input static mux with output inverter and our proposed power gating ...........71
`
`4-9 A 4-input static mux with output inverter and our proposed, tri-state PG .............72
`
`4-10 An example of an unconfigured mux where s0 and s3 are both conducting .........73
`
`4-11 An example of an unconfigured mux where VDDL is „0‟, no current flows ........74
`
`4-12 An example of an unconfigured mux from Figure 4.8, where VDDL is „0‟ but
`
`PG is „1‟, causing current flow ..............................................................................75
`
`4-13 An example of an unconfigured mux from Figure 4.9, where VDDL is „0‟ but
`
`PG is „1‟, causing current flow ..............................................................................75
`
`4-14 Example illustration with an updated design that uses VDDL signals, applied on
`
`a) the design from Figure 4.8 and b) the design from Figure 4.9 ..........................76
`
`4-15 Example illustration with an updated design that uses VDDH,LATE signals,
`
`applied on the design from Figure 4.8 and the design from Figure 4.9 .................77
`
`
`
`xiii
`
`Page 14 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`5-1
`
`Resource allocation for interconnects and CLBs ...................................................80
`
`5-2
`
`Block diagram of a Logic CLB and a DSP CLB ...................................................81
`
`5-3
`
`Block diagram of a Logic CLB and a DSP CLB ...................................................82
`
`5-4
`
`The 6 BRAM modes: dual 8-bit read, 16-bit read, 8-bit masked write, 16-bit
`
`read, 16-bit masked write, 8-bit write, dual 8-bit read, and 16-bit write, 16-bit
`
`read .........................................................................................................................84
`
`5-5 Write-logic architecture of the 1Kb reconfigurable dual-port BRAM ..................85
`
`5-6
`
`Read-logic architecture of the 1Kb reconfigurable dual-port BRAM ...................86
`
`5-7 Design of a bit-cell (BC) array with its bit-line (BL) and word-line (WL)
`
`controls ...................................................................................................................87
`
`5-8
`
`Layout of a CLB-SM macro with 4 SMs, a BC array, and BL and WL controls ..88
`
`5-9
`
`Top-level layout floorplan of the 2048-LUT FPGA with 512 CLBs ....................90
`
`5-10 Area impact of our work: a 1:1 logic-to-interconnect ratio ...................................90
`
`5-11 Micro-architecture of a Slice L/M CLB with dual-edged clocking .......................93
`
`5-12 Slice M microarchitecture of the memory and shift-register logic ........................97
`
`5-13 Architecture of a commercial FPGA DSP accelerator ..........................................99
`
`5-14 A commercial dual-port block RAM and its block architecture and datapath ....101
`
`5-15 Core schematic and interconnect architecture of a 16-core DSP processor ........102
`
`5-16 Example communication applications of the DSP processor ..............................103
`
`5-17 The FFT architecture and radix factorizations of different FFT resolutions .......105
`
`5-18 An example physical design of a SM macro .......................................................106
`
`5-19
`
`Illustration of the hierarchical design methodology used for chip integration ....108
`
`
`
`xiv
`
`Page 15 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`5-20 Layout examples of a) Slice L, b) Slice M, c) DSP, and d) BRAM CLBs and
`
`SMs ......................................................................................................................109
`
`5-21 Top level CLB and SM architecture, illustrating scan chain for BL and WL .....111
`
`5-22 Area impact of our two FPGAs: a 1:1 logic-to-interconnect ratio.......................112
`
`6-1
`
`Software mapping flow of commercial FPGA tools and our flow ......................115
`
`6-2 A snapshot of a synthesized netlist using our custom standard-cell library ........117
`
`6-3
`
`The updated software mapping flow for our new FPGA .....................................119
`
`6-4 Hierarchical partitioning performed on top-level, and one quadrant ...................121
`
`6-5 A routing-preference example for a point-to-point connection, LUT to LUT ....125
`
`7-1 A IBOB platform use for Matlab Simulink-based testing infrastructure .............131
`
`7-2 An example IBOB Simulink testbench for chip configuration and testing .........132
`
`7-3
`
`Energy efficiency and power ratio at maximum frequency and minimum energy
`
`..............................................................................................................................136
`
`7-4
`
`Comparison of energy efficiencies against state-of-the-art reconfigurable
`
`hardware ...............................................................................................................137
`
`7-5 Xilinx evaluation platforms – Kintex-7 KC705 and Virtex-7 KC707 .................139
`
`7-6
`
`Board layout of the chip-on-board testboard with two FMC connectors ............140
`
`7-7
`
`Chip photo and summary our 2048-LUT FPGA and our 16K-LUT FPGA ........143
`
`8-1
`
`Energy and area efficiency from modern VLSI chips and our chips ...................146
`
`8-2 NEM relays as PMOS and NMOS-equivalent devices, a static switch, and
`
`a SRAM bit-cell ...................................................................................................148
`
`8-3 A relay-interconnect concept with CMOS logic on the bottom and NEM-
`
`interconnects on the top 2 metal layers ................................................................149
`
`
`
`
`
`xv
`
`Page 16 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`LIST OF TABLES
`
`
`
`I-I
`
`ASIC vs. FPGA – efficiency vs. flexibility ........................................................10
`
`VI-I
`
`Routing time of our original router vs. PathFinder-based router ......................126
`
`VII-I
`
`Key measurement results from our 2048-LUT FPGA chip ..............................135
`
`VII-1I Chip performance comparison against commercial FPGA and ASIC
`
`implementations, based on design mapping and conservative
`
`timing
`
`estimations
`
`...........................................................................................................................141
`
`VII-III Coarse-grain
`
`accelerator
`
`performance
`
`against
`
`commercial FPGA
`
`implementations ................................................................................................142
`
`
`
`xvi
`
`Page 17 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`ACKNOWLEDGEMENTS
`
`
`
`It has been six years since I started my graduate life at UCLA, and it has certainly been
`
`the best six years of my life so far. First of all, I am wholeheartedly thankful to my advisor,
`
`Dejan Marković, for his patience, knowledge, and sheer passion for this work. Additionally, I‟ve
`
`also learned a great lot from him about presentation and communication skills.
`
`
`
`I wish to thank Professor Mani Srivastava, William Kaiser, and Mario Gerla for being on
`
`my dissertation committee. Their helpful and thoughtful comments are definitely appreciated.
`
`
`
`I am also grateful for having the best group members, especially Fang-Li Yuan and
`
`Tsung-Han Yu, who have endurance endless nights with me during tape-out madness and chip
`
`testing. They are the hardest working colleagues I‟ve ever had, and yet also exert such positive
`
`energy. I would also like to acknowledge other lab members, especially Vaibhav Karkare and
`
`Yuta Toriyama for incessant discussions in the cubicles, technical or not. It is really difficult to
`
`find a group so diverse, and yet so unified; so technically strong, and yet so pleasant and
`
`interesting to be around.
`
`
`
`I sincerely thank my parents for their never-ending care, and for always being my closest
`
`teacher and counselor. They shaped me the way I am today, and I am forever indebted to them. I
`
`also wish to thank my (soon-to-be-wife) Helen for her daily support and for being the biggest
`
`blessing in my life. Above all, I thank my God and Savior. His patience, grace, and love have
`
`been my greatest strength.
`
`
`
`
`
`
`
`xvii
`
`Page 18 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`CURRICULUM VITAE
`
`
`EDUCATION
`(GPA 3.76)
`M.S., Electrical Engineering, University of California, Los Angeles,
`2007-2009
`
`B.S., Electrical Engineering and Computer Sciences, University of California, Berkeley,
`
`(GPA 3.8)
`
`2003-2005
`
`Fall 2007 – Spring 2013
`
`
`EMPLOYMENT HISTORY
`Graduate Student Researcher – UCLA Electrical Engineering:
`Design and Optimization of Low-Power ASIC and FPGA
` Developing FPGA with a novel interconnect architecture that significantly reduces
`interconnect area and power by 3-4x compared to existing FPGA architectures. Chips
`fabricated using IBM90, ST65, TSMC65, IBM45SOI, and TSMC40 processes. Single-
`handedly performing all aspects of the project, from chip architecture, circuit design, to
`software tool design. The most recent test chip is by far the largest VLSI chip made in
`UCLA, and is one of the most complex chips made by any academic institution.
` Extensive experience in high-performance, low-power digital circuit design. Developed
`novel circuits for low-leakage power gating and high-speed interconnect performance.
`Also developed a more accurate delay model to compensate for the lack of accuracy in
`logical effort models under low-power optimizations.
`Nano-electro-mechanical Relays
` Designing circuits using nano-electro-mechanical relays, which have infinite off-
`impedance, low on-impedance, and low threshold voltage, making them attractive for
`digital-circuit, power-gating, and especially FPGA applications.
`Word-length Optimization
` Developed and maintained a word-length optimization too to automatically determines
`the optimal word-lengths of every logic block given a quantization-error requirement.
`Very effective for power-performance optimization in the system level, especially when
`combined with architectural optimizations.
`Fall 2005 - Fall 2007
`VLSI Design Engineer - Zoran Corporation, Sunnyvale, CA:
`Designed numerous blocks for HDTV applications, including H*264 decoding, HD video
`capture (component and HDMI), histogram computation, MPEG post-processing, and
`others.
`Involved with the entire design flow, including RTL design, verification, synthesis, timing-
`closure, place & route, ECO, FIB, driver design, SIMD microcode design, and chip-
`testing (using ATPG and FPGA).
`Summer 2005 - Fall 2005
`HDTV Intern - Zoran Corporation, Sunnyvale, CA:
`Developed and maintained a co-simulation environment that runs the RTL testbench in
`parallel with a software model
`Developed Specmen code to randomly generate SIMD instructions for the co-simulation
`testbench.
`
`
`
`xviii
`
`Page 19 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`Co-developed a complete set of microcode for H*264 that runs on the SIMD processor,
`including all modes of intra-/inter- prediction and reconstruction, for Luma and Chroma
`modes.
`
`
`HONORS AND AWARDS
`Outstand Dissertation Award, UCLA Electrical Engineering,
`Broadcom Fellowship (co-recipient),
`Jack Raper Award for Outstanding Technology Directions (co-recipient), ISSCC
`Department Fellowship, UCLA,
`High Honors, UC Berkeley,
`
`2013
`2012
`2010
`2007-2009
`2003-2005
`
`
`PUBLICATIONS:
`Journals:
`C. C. Wang, C. Shi, R. W. Brodersen, and D. Markovic, "An Automated Fixed-Point
`Optimization Tool in MATLAB XSG/SynDSP Enviornment," ISRN Signal Processing,
`Volume 2011
`M. Spencer, F. Chen, C. C. Wang, R. Nathanael, H. Fariborzi, A. Gupta, H. Kam, V. Pott, J.
`Jeon, T-J. K. Liu, D. Markovic, E. Alon, V. Stojanovic, "Demonstration of Integrated Micro-
`Electro-Mechanical Relay Circuits for VLSI Applications," IEEE Journal of Solid State
`Circuits, Jan. 2011
`C. C. Wang and D. Markovic, “Delay Estimation and Sizing of CMOS Logic Using Logical
`Effort with Slope Correction,” IEEE Trans. of Circuits and Systems-II, vol. 56, issue 8, pp.
`634-638, August 2009
`
`
`Conferences:
`C. C. Wang, F.-L. Yuan, H. Chen, D. Marković, "A 1.1 GOPS/mW FPGA Chip with
`Hierarchical Interconnect Fabric," in Proc. Int. Symposium on VLSI Circuits (VLSI'11), pp.
`136-137, June 2011
`F. Chen, M. Spencer, R. Nathanael, C. C. Wang, H. Fariborzi, A. Gupta, H. Kam, V. Pott, J.
`Jeon, T-J. K. Liu, D. Markovic, V. Stojanovic, E. Alon, "Demonstration of Integrated Micro-
`Electro-Mechanical Switch Circuits for VLSI Applications," in Proc. IEEE Int. Solid-State
`Conference (ISSCC'10), pp. 26-27, Feb. 2010
`
`
`Magazine Articles:
`D. Markovic, C. C. Wang, L. Alarcon, T.-T. Liu, and J. Rabaey, "Ultralow-Power Design in
`Near-Threshold Region," Proceedings of the IEEE, vol. 98, no. 2, pp. 237-252, Feb. 2010
`
`
`Book Chapters:
`D. Marković and R. W. Brodersen, DSP Architecture Design Essentials (Book Chapter 10 on
`Word-length Optimization), Springer, July 2012
`
`
`Patents:
`C. C. Wang, D. Markovic, “A Radix-3 Network Architecture For Boundary-Less
`Hierarchical Interconnects”, March 2013, Application No. 61/786,676
`C. C. Wang, D. Markovic, “Fine-Grained Power Gating in FPGA Interconnects”, March
`2013, Application No. 61/791,243
`
`
`
`xix
`
`Page 20 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`CHAPTER I
`
`Introduction
`
`1.1 The Drive Towards Efficiency
`
`For 50 years, Moore‟s law has driven the rapid scaling in transistor count and feature
`
`size. Transistor performance also increased at this pace, essentially doubling its operation
`
`frequency with every generation. Few seemed to care that doubling the performance also doubles
`
`the power consumption, and by the early 2000s, consumer CPUs have reached over 3 GHz,
`
`consuming around 100 watts of power. It then became clear that frequency scaling is reaching
`
`the end of the road: power, thermal, and physical constraints became just as important as circuit
`
`performance.
`
`“I don‟t want a kilowatt in my laptop,” said Gordon Moore at the International Solid-
`
`Sates Circuits Conference (ISSCC) Keynote in 2003 [Moore03]. The industry was recognizing a
`
`turning point towards efficiency: design tradeoffs that balance performance, power, and area
`
`requirements. Often times, obtaining efficiency requires fundamental hardware changes.
`
`“General-purpose hardware is generally not power-efficient," said Shekhar Borkar of Intel at the
`
`same conference. Over the past 10 years, the industry has shifted from high-frequency, single-
`
`core CPUS, to a heterogeneous integration of multi-core CPUs and dedicated accelerators.
`
`In 2003, many were concerned to maintain the 100W power budget. But in just a few
`
`years, the industry has commercialized sub-10W processors that fit in thin ultra-books, and even
`
`sub-1W processors for smartphones. Dictated by the changes in the scaling trend, these products
`
`are designed with efficiency in mind.
`
`
`
`
`
`1
`
`Page 21 of 179 IPR2020-00262
`
`VENKAT KONDA EXHIBIT 2004
`
`
`
`1.2 What is Efficiency?
`
`Efficiency, unlike many traditional criteria, requires a combination of metrics. Energy
`
`efficiency (or power efficiency) is arguably the most common efficiency metric. It quantifies
`
`work per unit energy, and is generally measured in billions of operations per second (GOPS) per
`
`milliwatt (GOPS/mW). In VLSI circuits, this translates directly to battery life, thermal limit, and
`
`reliability.
`
`One may wonder, for example, how energy efficiency differs from just low power. The
`
`difference is in operations. In an extreme case, any chip can consume 0 watts if it‟s off! But that
`
`is trivial because it is not performing not performing any operations. A similar analogy applies
`
`for performance: many smartphone processors today include 4 or 8 cores, but delivering peak
`
`performance in all cores will drain the battery very q