throbber
Electronic Thesis and Dissertations
`UCLA
`
`Peer Reviewed
`
`Title:
`Building Efficient, Reconfigurable Hardware using Hierarchical Interconnects
`Author:
`Wang, Chengcheng
`Acceptance Date:
`2013
`Series:
`UCLA Electronic Theses and Dissertations
`Degree:
`Ph.D., Electrical Engineering 0303UCLA
`Advisor(s):
`Markovic, Dejan
`Committee:
`Srivastava, Mani B., Kaiser, William J., Gerla, Mario
`Permalink:
`http://escholarship.org/uc/item/2vt0b5cb
`Abstract:
`
`Copyright Information:
`All rights reserved unless otherwise indicated. Contact the author or original publisher for any
`necessary permissions. eScholarship is not the copyright owner for deposited works. Learn more
`at http://www.escholarship.org/help_copyright.html#reuse
`
`eScholarship provides open access, scholarly publishing
`services to the University of California and delivers a dynamic
`research platform to scholars worldwide.
`
`Page 1 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`UNIVERSITY OF CALIFORNIA
`
`Los Angeles
`
`
`
`
`
`
`
`Building Efficient, Reconfigurable Hardware using
`
`Hierarchical Interconnects
`
`
`
`
`
`A thesis submitted in partial satisfaction
`
`of the requirements for the degree
`
`Doctor of Philosophy in Electrical Engineering
`
`
`
`by
`
`Chengcheng Wang
`
`
`
`
`
`2013
`
`
`
`Page 2 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`© Copyright by
`
`Chengcheng Wang
`
`2013
`
`Page 3 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`ABSTRACT OF THE DISSERTATION
`
`
`
`Building Efficient, Reconfigurable Hardware using Hierarchical
`
`Interconnects
`
`
`
`by
`
`Chengcheng Wang
`
`Doctor of Philosophy in Electrical Engineering
`
`University of California, Los Angeles, 2013
`
`Professor Dejan Marković, Chair
`
`
`
`In the semiconductor industry today, ASICs are able to offer 10x-1000x higher energy
`
`and area efficiencies than non-dedicated chips, such as programmable DSP processers, field-
`
`programmable gate arrays (FPGAs), and microprocessors. Not surprisingly, SoCs today have
`
`become an integration of many ASIC blocks, each performing a few dedicated tasks. The
`
`growing size of modern SoC chips, accelerated by the increasing demands for functionalities, has
`
`exposed the major drawback of ASIC: design cost. These large SoCs are re-designed a few times
`
`a year to rectify hardware-bugs and to support new features. Because ASICs are not
`
`reconfigurable, even the smallest hardware change would require a re-design. Additionally,
`
`design cost is rising exponentially with every technology generation.
`
`The rising design cost of ASICs has exposed a huge need today: efficiency and flexibility
`
`must co-exist. But among flexible hardware candidates, microprocessors and programmable DSP
`
`
`
`iii
`
`Page 4 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`processors are far too slow to meet the throughput requirements of ASICs. FPGAs do come close
`
`in terms of performance, but are extremely inefficient due to its high energy and large area
`
`overhead. We must bridge the huge gap in efficiency for FPGA to become a viable contender to
`
`ASICs.
`
`The primary culprit for FPGA inefficiency is interconnect, which accounts for over 75%
`
`of area and delay. For over 20 years, 2D-mesh network has been the back-bone of FPGA
`
`interconnects, but full connectivity in a 2D-mesh require O(N2) switches, requiring interconnects
`
`to grow much faster than Moore‟s Law. As a result, various heuristics are used to simplify
`
`switch-box arrays at the cost of resource utilization, but interconnect area of modern FPGA is
`
`still around 80%. This work builds FPGA using hierarchical interconnects based on Beneš
`
`networks,
`
`requiring O(N∙log∙N)
`
`switches. Although Beneš
`
`is commonly used
`
`in
`
`telecommunication, this work is its first silicon realization of a FPGA. To realize a highly
`
`efficient interconnect architecture, significant pruning of the network is required. Novel
`
`techniques such as fast-path U-turns and unbalanced branching are also implemented. A custom
`
`place-and-route software is developed to map benchmark designs on a variety of interconnect
`
`candidates. From mapping results, the architecture is updated based on network utilization until
`
`an optimized design is converged. The large area of FPGA chip requires aggressive power gating
`
`(PG), but interconnect signals often lack spatial locality, make it block-level PG difficult. A
`
`novel PG circuit technique is developed to power-gate individual interconnect switches with very
`
`small overhead in area and performance. Such technique requires fundamental circuit changes,
`
`even modifying the CMOS inverter.
`
`With
`
`innovations
`
`in chip architecture, circuit design, and extensive software
`
`development, this work has demonstrated 5 user-mappable FPGAs (from 1K–16K LUTs) all
`
`
`
`iv
`
`Page 5 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`with around 50% interconnect area: a 3–4x reduction from commercial FPGAs while preserving
`
`connectivity. An energy efficiency of 1.1 GOPS/mW is the highest among reported FPGAs, and
`
`is 22x more efficient than the most efficient commercial FPGA today, significantly bridging the
`
`efficiency gap between FPGA and ASIC.
`
`
`
`v
`
`Page 6 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`
`
`The dissertation of Chengcheng Wang is approved.
`
`
`
`Mani B. Srivastava
`
`William J. Kaiser
`
`Mario Gerla
`
`Dejan Marković, Committee Chair
`
`
`
`
`
`
`
`
`University of California, Los Angeles
`
`2013
`
`
`
`
`
`
`
`
`
`vi
`
`Page 7 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`TABLE OF CONTENTS
`
`
`
`I
`
`Introduction ............................................................................................................1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1.1
`
`The Drive Towards Efficiency.....................................................................1
`
`1.2 What is Efficiency? ......................................................................................2
`
`1.3
`
`The Efficiency Tradeoff ...............................................................................3
`
`1.4
`
`Efficiency and Flexibility – Current Solutions ............................................5
`
`1.5
`
`Keeping Up with the Standards ...................................................................7
`
`1.6
`
`The Cost of Chip Design..............................................................................8
`
`1.7
`
`Candidates for Reconfigurable Hardware ....................................................9
`
`1.8
`
`Thesis Outline ............................................................................................11
`
`II
`
`FPGA Interconnects: the Source of its Inefficiency ..........................................12
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`2.1
`
`Brief History of FPGAs .............................................................................12
`
`2.2
`
`Interconnects: the Backbone of an FPGA ..................................................18
`
`2.3
`
`Scaling a 2D-mesh Network ......................................................................21
`
`2.4
`
`Hierarchical Network – A Scalable Solution .............................................23
`
`2.5
`
`Prior Attempts at Hierarchical FPGAs ......................................................28
`
`2.6
`
`Our Challenges...........................................................................................31
`
`
`
`vii
`
`Page 8 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`III
`
`Architecture Design of Hierarchical FPGAs .....................................................33
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`3.1
`
`Realizing Large-Scale Beneš Networks.....................................................33
`
`3.2
`
`Implementing a 2048-LUT FPGA Interconnect ........................................36
`
`3.3
`
`Radix-3 Boundary-less Interconnect ..........................................................38
`
`3.4
`
`Fast-Path Interconnect ...............................................................................44
`
`3.5
`
`Interconnect Cost vs. Gate Cost .................................................................47
`
`3.6
`
`Local Interconnect vs. Branch Interconnect ..............................................48
`
`3.7 Micro-architecture of a Switch Matrix ......................................................50
`
`3.8
`
`Implementing a 16K-LUT FPGA Interconnect .........................................52
`
`IV
`
`Interconnect Circuit Design ................................................................................58
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`4.1
`
`Key Building Blocks in Interconnect Circuits ...........................................58
`
`4.2
`
`Static Multiplexers and Area-Performance Tradeoff .................................59
`
`4.3
`
`Strategies for Interconnect Buffering.........................................................63
`
`4.4
`
`Designing Configuration Bit-Cells ............................................................66
`
`4.5
`
`Power-gating Switch Matrices ...................................................................68
`
`4.6
`
`Power-On Sequence of the Interconnect Network .....................................73
`
`
`
`viii
`
`Page 9 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`V
`
`Configurable Logic Block Design and Chip Integration ..................................79
`
`
`
`
`
`
`
`
`
`
`
`
`
`5.1
`
`Configurable Logic Blocks for the 2048-LUT FPGA ...............................79
`
`5.2 Macro-based Chip Integration for the 2048-LUT FPGA ..........................86
`
`5.3
`
`Fine-Grained CLBs for the 16K-LUT FPGA ............................................91
`
`5.4 Medium-Grained CLBs for the 16K-LUT FPGA ......................................98
`
`5.5
`
`Coarse-Grained CLBs for the 16K-LUT FPGA ......................................102
`
`5.6 Macro-based Chip Integration for the 16K-LUT FPGA ..........................105
`
`VI
`
`Software Flow and Design Mapping ................................................................113
`
`
`
`
`
`
`
`
`
`
`
`6.1
`
`Overview of FPGA Software Mapping Flow ..........................................113
`
`6.2
`
`FPGA Synthesis and LUT Packing..........................................................116
`
`6.3
`
`FPGA Partitioning and Placement ...........................................................120
`
`6.4
`
`FPGA Routing .........................................................................................124
`
`6.5
`
`Bitstream Generation ...............................................................................129
`
`VII Test Infrastructure and Measurement Results ...............................................130
`
`
`
`
`
`
`
`
`
`
`
`6.1 Matlab Simulink-based Testing Infrastructure ........................................130
`
`6.2 Measurement Results of our 2048-LUT FPGA .......................................134
`
`6.3
`
`Updated Testing Infrastructure ................................................................138
`
`6.4 Measurement Results of our 16K-LUT FPGA ........................................140
`
`6.5
`
`Chips Summary and Die Photos ..............................................................142
`
`VIII Conclusion and Future Outlook .......................................................................147
`
`
`
`
`
`ix
`
`Page 10 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`References .......................................................................................................................150
`
`
`
`x
`
`Page 11 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`LIST OF FIGURES
`
`
`
`1-1
`
`Energy and area efficiency of the ISSCC/VLSI chips from the past decade. ..........4
`
`1-2
`
`Block diagram of an NVIDIA Tegra 2 SoC for smartphones. ................................5
`
`1-3
`
`Evolution of common multimedia and radio standards. ..........................................7
`
`1-4
`
`Cost of chip design with every technology node. ....................................................8
`
`2-1
`
`Schematic diagram from a Xilinx XC2000 of CLB and interconnects. ................12
`
`2-2
`
`Illustration of Stacked-Silicon Technology in Xilinx Virtex-7. ............................14
`
`2-3
`
`CLB diagram of Xilinx XC3000, XC4000, and XC5200. .....................................16
`
`2-4
`
`CLB diagram of Xilinx a Virtex-6 and 7 series FPGA. .........................................17
`
`2-5 A sample 2D-mesh architecture with I/O connections and switch boxes. .............19
`
`2-6
`
`Interconnect architecture of a Xilinx XC4000 FPGA............................................20
`
`2-7 Area, delay and power breakdown of a modern 2D-mesh FPGA .........................21
`
`2-8
`
`Interconnect resources per CLB for Xilinx Virtex-4 vs. Virtex-5 .........................22
`
`2-9 A simple 3-stage Beneš network connecting 2 LUTs ............................................24
`
`2-10 A 5-stage Beneš network merged into a 3-stage using 2-bit 2x2 switches ............25
`
`2-11 A 5-stage Beneš network connecting 8 LUTs .......................................................26
`
`2-12 A 3-stage folded Beneš network connecting 8 LUTs ............................................27
`
`2-13 A hierarchical Beneš interconnect architecture using alternated x-y routing ........28
`
`2-14 A 5-stage Beneš network merged into a 3-stage using 2-bit 2x2 switches ............29
`
`2-15 The HSRA architecture without and with wiring shortcuts ...................................30
`
`2-16 The multilevel hierarchical FPGA architecture .....................................................31
`
`
`
`
`
`
`
`xi
`
`Page 12 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`3-1 A hierarchical macro-based implementation of a 2D-Beneš network ...................35
`
`3-2
`
`Interconnect architecture for our 2048-LUT FPGA, one quadrant shown ............36
`
`3-3
`
`Interconnect architecture for our 2048-LUT FPGA, one quadrant shown ............37
`
`3-4 An original 16-LUT Beneš network, with isomorphic transformation to shorten
`
`nearest-neighbor lengths, and with boundary-less radix-3 switches in stage 1 .....39
`
`3-5 A 16-LUT Beneš network with boundary-less radix-3 switches in stage 1, and
`
`with boundary-less radix-3 switches in stages 1 and 2 ..........................................40
`
`3-6 A 16-LUT Beneš network, with boundary-less radix-3 switches in stages 1 and
`
`2, with boundary-less radix-3 switches in stage 1-3, and rearranged for
`
`distributed routing ..................................................................................................43
`
`3-8 An original radix-4 16-LUT Beneš network and with boundary-less radix-6
`
`switches in stage 1 .................................................................................................44
`
`3-9 A routing example from LUT 2 to 16 without fast path and with fast path ..........45
`
`3-10 A routing example with routing obstruction that still allows a slower fast-path
`
`and allowing no fast-path .......................................................................................46
`
`3-11 Two SM design with same gate cost, but a) with more wiring than b) .................47
`
`3-12 An example where traditional-Beneš based SM experiences local interconnect
`
`congestion, whereas a SM design with more local interconnects can utilize the
`
`fast path ..................................................................................................................49
`
`3-13 A switch-matrix example with more
`
`local
`
`interconnects
`
`than branch
`
`interconnects ..........................................................................................................50
`
`3-14
`
`Internal mux interconnect of an example radix-3 switch matrix ...........................51
`
`3-15 1-D SM architecture of the 16K-LUT FPGA, showing the lower 10 SM stages ..55
`
`
`
`xii
`
`Page 13 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`3-16 2-D SM architecture of the 16K-LUT FPGA, showing the top 5 stages of
`
`wiring .....................................................................................................................56
`
`4-1 An example switch matrix with its internal circuitry .............................................58
`
`4-2 A static pass-transistor mux with high VDD for the bit-cells ................................60
`
`4-3 A 10-input static pass-transistor mux with 2 critical-path inputs and 8 non-
`
`critical-path inputs, requiring 8 bit-cells ................................................................62
`
`4-4
`
`Illustration of input-buffer sharing inside a switch matrix ....................................64
`
`4-5
`
`Illustration of signal buffer across interconnects of a non-inverting mux, an
`
`inverting mux with input inverters, and an inverting mux with output inverters ..65
`
`4-6
`
`Physical design of the configuration bit-cells in 5T SRAM and 6T SRAM .........68
`
`4-7 A 4-input static mux with output inverter and traditional power gating ................69
`
`4-8 A 4-input static mux with output inverter and our proposed power gating ...........71
`
`4-9 A 4-input static mux with output inverter and our proposed, tri-state PG .............72
`
`4-10 An example of an unconfigured mux where s0 and s3 are both conducting .........73
`
`4-11 An example of an unconfigured mux where VDDL is „0‟, no current flows ........74
`
`4-12 An example of an unconfigured mux from Figure 4.8, where VDDL is „0‟ but
`
`PG is „1‟, causing current flow ..............................................................................75
`
`4-13 An example of an unconfigured mux from Figure 4.9, where VDDL is „0‟ but
`
`PG is „1‟, causing current flow ..............................................................................75
`
`4-14 Example illustration with an updated design that uses VDDL signals, applied on
`
`a) the design from Figure 4.8 and b) the design from Figure 4.9 ..........................76
`
`4-15 Example illustration with an updated design that uses VDDH,LATE signals,
`
`applied on the design from Figure 4.8 and the design from Figure 4.9 .................77
`
`
`
`xiii
`
`Page 14 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`5-1
`
`Resource allocation for interconnects and CLBs ...................................................80
`
`5-2
`
`Block diagram of a Logic CLB and a DSP CLB ...................................................81
`
`5-3
`
`Block diagram of a Logic CLB and a DSP CLB ...................................................82
`
`5-4
`
`The 6 BRAM modes: dual 8-bit read, 16-bit read, 8-bit masked write, 16-bit
`
`read, 16-bit masked write, 8-bit write, dual 8-bit read, and 16-bit write, 16-bit
`
`read .........................................................................................................................84
`
`5-5 Write-logic architecture of the 1Kb reconfigurable dual-port BRAM ..................85
`
`5-6
`
`Read-logic architecture of the 1Kb reconfigurable dual-port BRAM ...................86
`
`5-7 Design of a bit-cell (BC) array with its bit-line (BL) and word-line (WL)
`
`controls ...................................................................................................................87
`
`5-8
`
`Layout of a CLB-SM macro with 4 SMs, a BC array, and BL and WL controls ..88
`
`5-9
`
`Top-level layout floorplan of the 2048-LUT FPGA with 512 CLBs ....................90
`
`5-10 Area impact of our work: a 1:1 logic-to-interconnect ratio ...................................90
`
`5-11 Micro-architecture of a Slice L/M CLB with dual-edged clocking .......................93
`
`5-12 Slice M microarchitecture of the memory and shift-register logic ........................97
`
`5-13 Architecture of a commercial FPGA DSP accelerator ..........................................99
`
`5-14 A commercial dual-port block RAM and its block architecture and datapath ....101
`
`5-15 Core schematic and interconnect architecture of a 16-core DSP processor ........102
`
`5-16 Example communication applications of the DSP processor ..............................103
`
`5-17 The FFT architecture and radix factorizations of different FFT resolutions .......105
`
`5-18 An example physical design of a SM macro .......................................................106
`
`5-19
`
`Illustration of the hierarchical design methodology used for chip integration ....108
`
`
`
`xiv
`
`Page 15 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`5-20 Layout examples of a) Slice L, b) Slice M, c) DSP, and d) BRAM CLBs and
`
`SMs ......................................................................................................................109
`
`5-21 Top level CLB and SM architecture, illustrating scan chain for BL and WL .....111
`
`5-22 Area impact of our two FPGAs: a 1:1 logic-to-interconnect ratio.......................112
`
`6-1
`
`Software mapping flow of commercial FPGA tools and our flow ......................115
`
`6-2 A snapshot of a synthesized netlist using our custom standard-cell library ........117
`
`6-3
`
`The updated software mapping flow for our new FPGA .....................................119
`
`6-4 Hierarchical partitioning performed on top-level, and one quadrant ...................121
`
`6-5 A routing-preference example for a point-to-point connection, LUT to LUT ....125
`
`7-1 A IBOB platform use for Matlab Simulink-based testing infrastructure .............131
`
`7-2 An example IBOB Simulink testbench for chip configuration and testing .........132
`
`7-3
`
`Energy efficiency and power ratio at maximum frequency and minimum energy
`
`..............................................................................................................................136
`
`7-4
`
`Comparison of energy efficiencies against state-of-the-art reconfigurable
`
`hardware ...............................................................................................................137
`
`7-5 Xilinx evaluation platforms – Kintex-7 KC705 and Virtex-7 KC707 .................139
`
`7-6
`
`Board layout of the chip-on-board testboard with two FMC connectors ............140
`
`7-7
`
`Chip photo and summary our 2048-LUT FPGA and our 16K-LUT FPGA ........143
`
`8-1
`
`Energy and area efficiency from modern VLSI chips and our chips ...................146
`
`8-2 NEM relays as PMOS and NMOS-equivalent devices, a static switch, and
`
`a SRAM bit-cell ...................................................................................................148
`
`8-3 A relay-interconnect concept with CMOS logic on the bottom and NEM-
`
`interconnects on the top 2 metal layers ................................................................149
`
`
`
`
`
`xv
`
`Page 16 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`LIST OF TABLES
`
`
`
`I-I
`
`ASIC vs. FPGA – efficiency vs. flexibility ........................................................10
`
`VI-I
`
`Routing time of our original router vs. PathFinder-based router ......................126
`
`VII-I
`
`Key measurement results from our 2048-LUT FPGA chip ..............................135
`
`VII-1I Chip performance comparison against commercial FPGA and ASIC
`
`implementations, based on design mapping and conservative
`
`timing
`
`estimations
`
`...........................................................................................................................141
`
`VII-III Coarse-grain
`
`accelerator
`
`performance
`
`against
`
`commercial FPGA
`
`implementations ................................................................................................142
`
`
`
`xvi
`
`Page 17 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`ACKNOWLEDGEMENTS
`
`
`
`It has been six years since I started my graduate life at UCLA, and it has certainly been
`
`the best six years of my life so far. First of all, I am wholeheartedly thankful to my advisor,
`
`Dejan Marković, for his patience, knowledge, and sheer passion for this work. Additionally, I‟ve
`
`also learned a great lot from him about presentation and communication skills.
`
`
`
`I wish to thank Professor Mani Srivastava, William Kaiser, and Mario Gerla for being on
`
`my dissertation committee. Their helpful and thoughtful comments are definitely appreciated.
`
`
`
`I am also grateful for having the best group members, especially Fang-Li Yuan and
`
`Tsung-Han Yu, who have endurance endless nights with me during tape-out madness and chip
`
`testing. They are the hardest working colleagues I‟ve ever had, and yet also exert such positive
`
`energy. I would also like to acknowledge other lab members, especially Vaibhav Karkare and
`
`Yuta Toriyama for incessant discussions in the cubicles, technical or not. It is really difficult to
`
`find a group so diverse, and yet so unified; so technically strong, and yet so pleasant and
`
`interesting to be around.
`
`
`
`I sincerely thank my parents for their never-ending care, and for always being my closest
`
`teacher and counselor. They shaped me the way I am today, and I am forever indebted to them. I
`
`also wish to thank my (soon-to-be-wife) Helen for her daily support and for being the biggest
`
`blessing in my life. Above all, I thank my God and Savior. His patience, grace, and love have
`
`been my greatest strength.
`
`
`
`
`
`
`
`xvii
`
`Page 18 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`CURRICULUM VITAE
`
`
`EDUCATION
`(GPA 3.76)
`M.S., Electrical Engineering, University of California, Los Angeles,
`2007-2009
`
`B.S., Electrical Engineering and Computer Sciences, University of California, Berkeley,
`
`(GPA 3.8)
`
`2003-2005
`
`Fall 2007 – Spring 2013
`
`
`EMPLOYMENT HISTORY
`Graduate Student Researcher – UCLA Electrical Engineering:
`Design and Optimization of Low-Power ASIC and FPGA
` Developing FPGA with a novel interconnect architecture that significantly reduces
`interconnect area and power by 3-4x compared to existing FPGA architectures. Chips
`fabricated using IBM90, ST65, TSMC65, IBM45SOI, and TSMC40 processes. Single-
`handedly performing all aspects of the project, from chip architecture, circuit design, to
`software tool design. The most recent test chip is by far the largest VLSI chip made in
`UCLA, and is one of the most complex chips made by any academic institution.
` Extensive experience in high-performance, low-power digital circuit design. Developed
`novel circuits for low-leakage power gating and high-speed interconnect performance.
`Also developed a more accurate delay model to compensate for the lack of accuracy in
`logical effort models under low-power optimizations.
`Nano-electro-mechanical Relays
` Designing circuits using nano-electro-mechanical relays, which have infinite off-
`impedance, low on-impedance, and low threshold voltage, making them attractive for
`digital-circuit, power-gating, and especially FPGA applications.
`Word-length Optimization
` Developed and maintained a word-length optimization too to automatically determines
`the optimal word-lengths of every logic block given a quantization-error requirement.
`Very effective for power-performance optimization in the system level, especially when
`combined with architectural optimizations.
`Fall 2005 - Fall 2007
`VLSI Design Engineer - Zoran Corporation, Sunnyvale, CA:
`Designed numerous blocks for HDTV applications, including H*264 decoding, HD video
`capture (component and HDMI), histogram computation, MPEG post-processing, and
`others.
`Involved with the entire design flow, including RTL design, verification, synthesis, timing-
`closure, place & route, ECO, FIB, driver design, SIMD microcode design, and chip-
`testing (using ATPG and FPGA).
`Summer 2005 - Fall 2005
`HDTV Intern - Zoran Corporation, Sunnyvale, CA:
`Developed and maintained a co-simulation environment that runs the RTL testbench in
`parallel with a software model
`Developed Specmen code to randomly generate SIMD instructions for the co-simulation
`testbench.
`
`
`
`xviii
`
`Page 19 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`Co-developed a complete set of microcode for H*264 that runs on the SIMD processor,
`including all modes of intra-/inter- prediction and reconstruction, for Luma and Chroma
`modes.
`
`
`HONORS AND AWARDS
`Outstand Dissertation Award, UCLA Electrical Engineering,
`Broadcom Fellowship (co-recipient),
`Jack Raper Award for Outstanding Technology Directions (co-recipient), ISSCC
`Department Fellowship, UCLA,
`High Honors, UC Berkeley,
`
`2013
`2012
`2010
`2007-2009
`2003-2005
`
`
`PUBLICATIONS:
`Journals:
`C. C. Wang, C. Shi, R. W. Brodersen, and D. Markovic, "An Automated Fixed-Point
`Optimization Tool in MATLAB XSG/SynDSP Enviornment," ISRN Signal Processing,
`Volume 2011
`M. Spencer, F. Chen, C. C. Wang, R. Nathanael, H. Fariborzi, A. Gupta, H. Kam, V. Pott, J.
`Jeon, T-J. K. Liu, D. Markovic, E. Alon, V. Stojanovic, "Demonstration of Integrated Micro-
`Electro-Mechanical Relay Circuits for VLSI Applications," IEEE Journal of Solid State
`Circuits, Jan. 2011
`C. C. Wang and D. Markovic, “Delay Estimation and Sizing of CMOS Logic Using Logical
`Effort with Slope Correction,” IEEE Trans. of Circuits and Systems-II, vol. 56, issue 8, pp.
`634-638, August 2009
`
`
`Conferences:
`C. C. Wang, F.-L. Yuan, H. Chen, D. Marković, "A 1.1 GOPS/mW FPGA Chip with
`Hierarchical Interconnect Fabric," in Proc. Int. Symposium on VLSI Circuits (VLSI'11), pp.
`136-137, June 2011
`F. Chen, M. Spencer, R. Nathanael, C. C. Wang, H. Fariborzi, A. Gupta, H. Kam, V. Pott, J.
`Jeon, T-J. K. Liu, D. Markovic, V. Stojanovic, E. Alon, "Demonstration of Integrated Micro-
`Electro-Mechanical Switch Circuits for VLSI Applications," in Proc. IEEE Int. Solid-State
`Conference (ISSCC'10), pp. 26-27, Feb. 2010
`
`
`Magazine Articles:
`D. Markovic, C. C. Wang, L. Alarcon, T.-T. Liu, and J. Rabaey, "Ultralow-Power Design in
`Near-Threshold Region," Proceedings of the IEEE, vol. 98, no. 2, pp. 237-252, Feb. 2010
`
`
`Book Chapters:
`D. Marković and R. W. Brodersen, DSP Architecture Design Essentials (Book Chapter 10 on
`Word-length Optimization), Springer, July 2012
`
`
`Patents:
`C. C. Wang, D. Markovic, “A Radix-3 Network Architecture For Boundary-Less
`Hierarchical Interconnects”, March 2013, Application No. 61/786,676
`C. C. Wang, D. Markovic, “Fine-Grained Power Gating in FPGA Interconnects”, March
`2013, Application No. 61/791,243
`
`
`
`xix
`
`Page 20 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`CHAPTER I
`
`Introduction
`
`1.1 The Drive Towards Efficiency
`
`For 50 years, Moore‟s law has driven the rapid scaling in transistor count and feature
`
`size. Transistor performance also increased at this pace, essentially doubling its operation
`
`frequency with every generation. Few seemed to care that doubling the performance also doubles
`
`the power consumption, and by the early 2000s, consumer CPUs have reached over 3 GHz,
`
`consuming around 100 watts of power. It then became clear that frequency scaling is reaching
`
`the end of the road: power, thermal, and physical constraints became just as important as circuit
`
`performance.
`
`“I don‟t want a kilowatt in my laptop,” said Gordon Moore at the International Solid-
`
`Sates Circuits Conference (ISSCC) Keynote in 2003 [Moore03]. The industry was recognizing a
`
`turning point towards efficiency: design tradeoffs that balance performance, power, and area
`
`requirements. Often times, obtaining efficiency requires fundamental hardware changes.
`
`“General-purpose hardware is generally not power-efficient," said Shekhar Borkar of Intel at the
`
`same conference. Over the past 10 years, the industry has shifted from high-frequency, single-
`
`core CPUS, to a heterogeneous integration of multi-core CPUs and dedicated accelerators.
`
`In 2003, many were concerned to maintain the 100W power budget. But in just a few
`
`years, the industry has commercialized sub-10W processors that fit in thin ultra-books, and even
`
`sub-1W processors for smartphones. Dictated by the changes in the scaling trend, these products
`
`are designed with efficiency in mind.
`
`
`
`
`
`1
`
`Page 21 of 179 IPR2020-00260
`
`VENKAT KONDA EXHIBIT 2004
`
`

`

`1.2 What is Efficiency?
`
`Efficiency, unlike many traditional criteria, requires a combination of metrics. Energy
`
`efficiency (or power efficiency) is arguably the most common efficiency metric. It quantifies
`
`work per unit energy, and is generally measured in billions of operations per second (GOPS) per
`
`milliwatt (GOPS/mW). In VLSI circuits, this translates directly to battery life, thermal limit, and
`
`reliability.
`
`One may wonder, for example, how energy efficiency differs from just low power. The
`
`difference is in operations. In an extreme case, any chip can consume 0 watts if it‟s off! But that
`
`is trivial because it is not performing not performing any operations. A similar analogy applies
`
`for performance: many smartphone processors today include 4 or 8 cores, but delivering peak
`
`performance in all cores will drain the battery very q

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket