`TK
`789.S
`G36
`B48 '
`1999
`
`Intel Exhibit 1006 Part 1
`Intel v. Iida
`
`
`
`ARCHITECTURE AND CAD
`FOR
`DEEP-SUBMICRON FPGAs
`
`
`
`THE KLUWER INTERNATIONAL SERIES
`THE KLUWER INTERNATIONAL SERIES
`IN ENGINEERING AND COMPUTER SCIENCE
`IN ENGINEERING AND COMPUTER SCIENCE
`
`
`
`
`
`Distributors for North, Central and South America:
`Kluwer Academic Publishers
`101 Philip Drive
`Assinippi Park
`Norwell, Massachusetts 02061 USA
`Telephone (781) 871-6600
`Fax (781) 871-6528
`E-Mail <kluwer@wkap.com>
`
`Distributors for all other countries:
`Kluwer Academic Publishers Group
`Distribution Centre
`Post Office Box 322
`3300 AH Dordrecht, THE NETHERLANDS
`Telephone 31 78 6392 392
`Fax 31 78 6546 474
`E-Mail <orderdept@wkap.nl>
`
`, . Electronic Services <http://www.wkap.nl>
`
`Library of Congress Cataloging-in-Publication Data
`
`A C.I.P. Catalogue record for this book is available
`from the Library of Congress.
`
`Copyright © 1999 by Kluwer Academic Publishers
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system or transmitted in any form or by any means, mechanical, photo(cid:173)
`copying, recording, or otherwise, without the prior written permission of the
`publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell,
`Massachusetts 02061
`
`Printed on acid-free paper.
`
`Printed in the United States of America
`
`
`
`To Corinne, William, and Isabel,
`
`Betty and Frank Rose,
`
`Marie and Ron
`
`
`
`
`
`
`
`Table of Contents
`
`CHAPTER 1 Introduction ....... . .......... ... ....... . . . ... .. ..... . . l
`1.1 Overview of FPGAs ....... . ...................................... 2
`1.2 FPGA Architectural Issues . .. .... . . . ..... .. .. . .... ... ... .. . . ... . . . . 3
`1.3 Approach and CAD Tools ............... . .......................... 7
`1.4 Book Organization ........ . ................................... . ... 8
`1.5 Acknowledgments .. ..... .. .. . ......... . .. . ....................... 9
`
`CHAPTER 2 Background and Previous Work
`. ........................ . 11
`2.1 FPGA Architecture ...... .... ..... . .. . .. . . .. .... . . .. .. ........ . .. 11
`2.1.1
`FPGA Programming Technologies . . . .. ..... ...... .. . . . .. . . .. 11
`2.1.2 FPGA Logic Block Architecture ............ .. .............. . 13
`2.1.3 FPGA Routing Architecture . .. .. . . .. . ... ....... ... .... .. ... 14
`. . ...... . .. ... . . . . .. .... .. .... ... ....... 1 ••.•. •• 18
`2.2 CAD for FPGAs
`2.2.1 Synthesis and Logic Block Packing .......................... 19
`2.2.2 Placement ...... . . . . . .. .. . . . .... . . . ............. . .. . . . .. 22
`2.2.3 Routing .. . ...... .. .... . ...... .. ........... .. ...... . . ... 25
`2.2.4 Delay Modelling . . ... .. .. ........ . .. ... ........ .. .... .. .. 31
`2.2.5 Timing Analysis ... . .. ... .. . .. . .. . . .. . .. ... .. ..... . . ... . . 32
`2.3 Summary . .............. .. .... ... ......... . . . .. .... ... . . .... . . . 34
`
`CHAPTER 3 CAD Tools: Packing and Placement ....................... 37
`3.1 Logic Block Packing .. ... .. ...................... . ............... 37
`3.1.1 Cluster-Based Logic Blocks . ............ . .. . .............. . 38
`3.1 .2 Basic Logic Block Packing Algorithm: VPack .. . . ............. 39
`3.1.3 Timing-Driven Logic Block Packing: T-VPack .. .. .. . .... .. .. . 43
`Cluster Seed and Attraction Function . .. ... . . .... . . ........... 43
`Computational Complexity vs. Frequency of Timing Analysis . . ... 47
`
`
`
`
`
`Table of Contents
`
`ix
`
`5.2.1 CAD Flow .................... . . . ....... .. ... .. . ... .... 107
`5.2.2 Area-Efficiency Metric .......... . . . ... . ... . .... . .. . . . ... . 109
`5.2.3 Significant FPGA Architectural Details ...... .- ... ... ......... 109
`5.3 Experimental Results: Directionally-Biased Routing ...... . .... . . . .... 110
`5.3.1 Results for Square Logic Block Arrays ..... . . . .. ... ...... . .. 111
`5.3.2 Results for Rectangular Logic Block Arrays
`. . .. . . .. . .. . . ..... 113
`5.4 Experimental Results: Non-Uniform Routing . . . .... . .. . . . .. . .... . . .. 115
`5.4.1 Center/Edge Capacity Ratio . ... .. .. ......... . ............. 116
`5.4.2 Single Center Channel .. . .. .... .. . .. . .. . . . .. .. . . .. ... . . . . 120
`5.4.3 1/0 Channel .. . . ...... .. ... .. . . . .. . ... . . . .. ......... .. .. 122
`5.5 Summary ..... ........... . .... . ..... . . . ...... .. .. . . . . .... . ... . 126
`
`CHAPTER 6 Cluster-Based Logic Blocks . .... . . . .... . . .... . .... .. .. . . 127
`6.1 Motivation ... .. ... . . . . ... . . . ... .. . .. . .. . ...... . .. . ..... .. . ... . 127
`6.2 Experimental Methodology .............. · . . .... . . . .. .... ... .. . .. . . 130
`6.2.1 CAD Flow . . .. ..... . ...... .... .. ........... ... ... . . .. .. 130
`6.2.2 Area Model . .... .. ... . . ... .... .. ............. . ......... 132
`6.2.3 Delay Model .. . .. . .... . .. . .. .. . ...... . ................. 134
`6.2.4 Architecture Evaluation Metric: Area-Delay Product . . .... . .. . . 136
`. . ......... . . . .. . . .. ...... 136
`6.2.5 FPGA Architectural Assumptions
`Basic Architecture .............. .. ....... . ..... . .. ~ . . . . .. 137
`Routing Architecture ............................... . . . ... 137
`Effect of Cluster Size on the Physical Length of FPGA
`Routing Segments ............ . ..... . ..... . ....... .. . . 138
`Sizing Routing Transistors to Compensate for Different
`Physical Segment Lengths .... .. ........ . ... . .. .. . . . .. . 138
`6.3 Cluster Inputs Required vs. Cluster Size ........... . .... ... . .. ... . ... 139
`6.4 Flexibility of Logic Block to Routing Interconnect vs. Cluster Size ..... . . 141
`6.5 Speed and Area-Efficiency vs. Cluster Size . . . ... . ... . . . . . ........ . .. 142
`6.5.1 Discussion of Delay vs. Cluster Size Results . .. ... .. .. . .. . . ... 145
`6.6 Effect of Cluster Size on Compile Time ..... . ... .... ... . . . . . ... ... . . 147
`6.7 Summary .... ... .. . .. ........... .. . .. . . ...... . ....... . ... . . . .. 148
`
`
`
`
`
`Table of Contents
`
`xi
`
`B.1.1 FPGA Routing Structures ....... . .... .. . .. ..... .. .. ... . . . . 208
`Gate Boosting .... ... .. .................... . ....... .. . ... 208
`Buffers ...... ........................... · ............ . .. 210
`Connection Block to Logic Block Input Pins .. .... . .. ... ...... 212
`B.1.2 Logic Block Structures . ............ . ..... . ............... 214
`B.2 Delay and RC-Equivalent Circuit Extraction ..... .. .. . .. . ..... ...... . 217
`
`APPENDIX C Sizing of Routing Transistors and Metal . .. . ...... . ... . .. 221
`C.1 Sizing Pass Transistor Routing Switches . .... ..... .. ......... . .. .... 221
`C.2 Sizing Tri-State Buffer Routing Switches .. . .... . .................... 225
`C.3 Tri-State Buffers in Output Pin Connection Blocks .................... 228
`C.4 Metal Width and Spacing ............... .. .. . ............. .. . .. .. 228
`
`References .. . .. . . .. ...... .. .................................. . . .. 233
`
`Index .... . .... . . . .. ...... ... . . ......... ... .... . . ... . . . .... . . .. . . . 243
`
`
`
`
`
`
`
`CHAPTER 1
`
`Introduction
`
`Since their introduction in 1984, Field-Programmable Gate Arrays (FPGAs) have
`become one of the most popular implementation media for digital circuits and have
`grown into a $2 billion per year industry. As process geometries have shrunk into the
`deep-submicron region, the logic capacity of FPGAs has greatly increased, making
`FPGAs a viable implementation alternative for larger and larger designs. To make the
`best use of these new deep-submicron processes, one must re-architect one's FPGAs
`and Computer-Aided Design (CAD) tools. This book addresses several key issues in
`the design of high-performance FPGA architectures and CAD tools, wi\h particular
`emphasis on issues that are important for FPGAs implemented in deep-submicron
`processes.
`
`Three factors combine to determine the performance of an FPGA: the quality of the
`CAD tools used to map circuits into the FPGA, the quality of the FPGA architecture,
`and the electrical (i.e. transistor-level) design of the FPGA. In this book, we examine
`all three of these issues in concert, and we believe this is the first systematic study of
`FPGA architecture and CAD to do so.
`
`In order to investigate the quality of different FPGA architectures, one needs CAD
`tools capable of automatically implementing circuits in each FPGA architecture of
`interest. Once a circuit has been implemented in an FPGA architecture, one next
`needs accurate area and delay models to evaluate the quality (speed achieved, area
`required) of the circuit implementation in the FPGA architecture under test. This
`t~e develo ment of a high- ualit and highly
`book therefore has three major foci:
`flexible CAD infrastructure, the creation of accurate area and delay models for
`FPGAs, and the study of several important FPGA architectural issues.
`
`
`
`
`
`1.2 FPGA Architectural Issues
`
`3
`
`Time-to-market is the other key advantage of FPGAs. Full fabrication typically takes
`6 - 8 weeks. If problems are found in the finished chip it must be thrown away, and
`one must wait another 6 - 8 weeks to fabricate a corrected design. FPGAs, on the
`other hand, can be programmed in seconds, and any bugs found once the chip is tested
`in system can be corrected in minutes by reprogramming the FPGA. With today's
`short product cycles, the time-to-market advantage this provides is often compelling.
`
`FPGA programmability carries a price, however. In MPGAs and Standard Cells cir(cid:173)
`cuitry is interconnected with metal wires. FPGAs, in contrast, must connect circuitry
`via programmable switches. These switches have higher resistance t an me al wire-s
`· and add s 1gnificant c-apac1 ance to connections, reducing circuit speed. The switches
`also take up more area than metal wires would, so an FPGA must be considerably
`larger than an MPGA to implement the same circuit. A circuit implemented in an
`FPGA is typically 10 times larger and roughly 3 times slower than the same circuit
`implemented via an MPGA in an equivalent process [1]. The larger size of FPGA cir(cid:173)
`cuitry makes FPGA implementations more expensive than MPGAs for high volume
`designs, and the limited speed of FPGAs precludes their use in very high-speed
`designs. These differences mandate research into new FPGA architectures in order to
`reduce these speed and density penalties. In addition, because the FPGA marketplace
`is highly competitive each FPGA manufacturer is constant! searching for better
`FPGA architectures in 01:_d~r !9_gain a speed and density advantage.
`
`1.2 FPGA Architectural Issues
`
`All FPGAs consist of a large number of programmable logic blocks, each of which
`implement a small amount of digital logic, and programmable routing which allows
`the logic block inputs and outputs to be connected to form larger circuits. In this book
`we investigateJ hree different issues in FPGA architecture:
`concern FPGA rout(cid:173)
`ing design, and one concerns FPGA logic block design.
`~ -=----=---'----------._.::;._ ___ ~~
`
`/ The first issue we investigate is global routing architecture [2, 3]. ,The global routing
`~ architecture of an FPGA s ecifies the relative width of the various _wiring channels
`ll
`.
`,
`- .
`-
`-
`• -
`-
`- - ··
`-
`-···· --
`/ .: within the chip. Figure 1.1 depicts an example global routing architecture in which
`)(
`the channels -near the center of the FPGA are wider than those near the edges. In
`w/
`MPGA and Standard Cell implementations, a custom chip is created for each design,
`so routing channels can easily be made wider in areas of a chip where the demand for
`routing is greater. In FPGAs, however, all routing resources are prefabricated, so the
`width of all the routing channels is set by the FPGA manufacturer. Our goal, then, is
`to find the distribution of routing resources, or tracks, to the various channels that per-
`\ mits their efficient utilization by the largest class of circuits. If there are too few
`
`
`
`
`
`1.2 FPGA Architectural Issues
`
`5
`
`blocks, cluster-based logic blocks can improve FPGA speeds. A~ well, an FPGA in
`which every logic block contains several LUTs will need fewer logic blocks to imple-
`-
`-
`ment a circuit than an FPGA in wpich each logic block is a single LUT. This reduces
`\p,"\
`f the placement and routing problem considerably. Since- pla'cement andJ JI
`the si~
`~ ting is usually tne mo st t~ s u~p -in mapping a d~sign to an FPGA,
`cluster-based logic' blo~ks can· significantly reduce design compile time. As FPGAs
`grow larger, it is important to keep this compile time from growing too large or one of
`the key advantages of FPGAs, rapid prototyping and quick design turns, will be lost.
`
`We will investigate how the number of LUTs per cluster affects FPGA speed, area,
`compile time and other important FPGA architecture parameters, such as the proper
`number of inputs to a logic block and the required flexibility of the general-purpose
`interconnect. The trade-offs among these architectural issues and metrics are quite
`complex. For example, grouping related LUTs together into a single logic block
`reduces the number of connections to be routed between logic blocks, saving routing
`area. Since the eneral-purpose interconnect consume_s most __ of__!h~d_ie ~r~~ in
`this is a significant area savings. On the other hand, in the
`SRAM-based FPGAs 1
`logic c usters we study, the area required by the local routing grows quadratically with
`the number of LUTs in a cluster. For sufficiently large clusters, the area used by this
`local interconnect will exceed the area saved in the general interconnect. Similarly,
`since the local connections within clusters can be made very fast, we expect clusters
`containing a larger number of LUTs to attain higher circuit speeds. At some point,
`however, the speed gained by increasing the cluster size will diminish or be lost
`because increasing the cluster size slows the local cluster routing.
`
`While recent FPGAs from Xilinx [11), Altera [7], Lucent Technologi~s [5], Actel
`[12) and Yantis [13) have all grouped several LUTs together into larger logic blocks,
`there has been little published work investigating the impact of the number of LUTs
`per logic block on FPGA speed or area.
`
`The final FPGA issue we examine is that of detailed routing architecture [14). The
`detailed routing architecture of an FPGA defines how logic block inputs and outputs
`can be interconnected. The style of detailed routing architecture we will investigate is
`the island-style architecture [1] used by Xilinx, Lucent, and Yantis FPGAs [4, 5, 13].
`A simplified detailed-routing architecture is depicted in Figure 1.3. In many ways,
`detailed routing architecture is the key element of an FPGA because:
`
`1. Most of an FPGA's area is devoted to routing;
`2. Most of a circuit's delay is due to routing delays rather than logic block delays
`[15, 1]; and
`
`
`
`
`
`1.3 Approach and CAD Tools
`
`7
`
`completely unroutable. Achieving an appropriate balance between pass transistor
`based and tri-state buffer based switches is also crucial - pass transistors require less
`area and are faster for connections that go through only a few switches, but buffers are
`faster for connections that go through many switches. Similarly, if we have too few
`long routing wires, long connections will have to pass through many programmable
`switches and will be slow. On the other hand, too many long wires will result in short
`connections being made via long wires, wasting area and degrading speed.
`
`( It is difficult to look at one detailed routing architecture parameter in isolation
`· because the architectural parameters interact in complex ways to determine an
`FPGA's speed and area-efficiency.
`In the research presented in this book, we have
`tried to take a holistic approach that looks at all the parameters determining a detailed
`routing together, and assesses both the speed and area of each FPGA we investigate.
`We believe that this is the first study to examine such a broad spectrum of detailed
`routing architectures.
`
`1.3 Approach and CAD Tools
`
`Our approach to exploring FPGA architecture is an experimental one. Benchmark
`circuits are technology mapped, placed and routed into each FPGA architecture of
`interest and the area and/or delay of each circuit in that architecture is measured.
`These measurements allow us to determine the best architectural parameters.
`
`In order to evaluate different architectures, the CAD tools used in the empirical
`approach must be flexible enough to target all the architectures of interest. This pre(cid:173)
`cludes the use of commercial FPGA CAD tools, which target only the FPGAs of a
`specific manufacturer. Accordingly, we have created new CAD tools to pack multiple
`LUTs into a logic cluster, t,9 place logic blocl<:s_witbi!:! an FPGA, and to route the con-
`. nections between logic blocks [ 17, 10]. In addition to making these CAD tools flexi(cid:173)
`ble, we also took cons1dera5Iecaretoensure they produced high quality results, as
`poor CAD tools can lead to inaccurate architecture conclusions. The CAD tools
`understand the special features of each architecture, and attempt to fully optimize for
`each architecture they target. They incorporate several new features and enhance(cid:173)
`ments to existing algorithms, so the CAD algorithms are of interest in their own right.
`As well, these CAD tools are in use in numerous FPGA and CAD studies at the Uni(cid:173)
`versity of Toronto and around the world, and obtain the highest quality results on the
`set of standard benchmarks used to compare academic place and route tools [17].
`
`Ideally, one would lay out each FPGA architecture of interest to obtain exact area
`measurements and highly accurate delay values. In this book we study hundreds of
`
`
`
`8
`
`CHAPTER 1 Introduction
`
`different FPGA architectures, so this is clearly impossible. Accordingly, we have
`developed more abstract area models and delay estimates that do not require layout--of(cid:173)
`the FPGAs, but which are still sufficiently accurate to allow fair comparison of differ(cid:173)
`ent architectures.
`
`1.4 Book Organization
`
`Most of the material in this book is adapted from the Ph.D. thesis of Vaughn Betz
`(18), with important additional information (the T-VPack algorithm described in
`Chapter 3 and most of the experimental results and analysis concerning logic clusters
`presented in Chapter 6) adapted from the M.A.Sc thesis of Alexander Marquardt (19).
`
`The next chapter provides background information and details some of the previous
`work in the relevant areas of both CAD and FPGA architecture. C~apter 3 describes
`the CAD algorithms we use to pack several LUTs together into a logic cluster, and the
`algorithms we use to perform placement of logic blocks in an FPGA. In Chapter 4 we
`-=---=--- ·
`detail the algorithms used in our FPGA router.
`
`Chapters 5 through 7 then employ these CAD tools to investigate the three FPGA
`architectural issues outlined earlier. Each of these three chapters begins by describing
`the CAD flow and area and/or delay models used in the experimental study, and then
`presents the architectural results obtained. Chapter 5 evaluates different global rout(cid:173)
`ing architectures for FPGAs, and determines the most area-efficient architectures. In
`Chapter 6 we investigate cluster-based logic blocks and determine the logic cluster
`In Chapter 7 we ascertain which
`types that maximize area-efficiency and speed.
`detailed routing architectures lead to FPGAs with the best combination of speed and
`area-efficiency. The final chapter summarizes the architectural conclusions and pro(cid:173)
`vides suggestions for future work.
`
`The three appendices provide additional useful information. Appendix A outlines th,~
`graphic visualization features incorporated in our placementaiio rout ing CAD tools,
`and highlights their usefulness in algorithm and architecture research. Appendix B
`describes the transistor-level design of an FPGA and our characterization of a deep(cid:173)
`submicron IC process. Finally, in Appendix C we investigate how routing transistors
`and metal routing wires should be sized in order to create an FPGA with the best
`combination of area-efficiency and speed. The detailed information concerning the
`electrical design of an FPGA provided by Appendices B and C is essential to create
`accurate FPGA area and delay models, and hence to properly compare FPGA archi(cid:173)
`tectures.
`
`
`
`1.5 Acknowledgments
`
`9
`
`1.5 Acknowledgments
`
`The authors are indebted to Carl Harris of Kluwer Academic Publishers for his help
`and advice as we prepared this book. We would like to thank the other members of
`our research group at the University of Toronto -
`Steve Wilton, Mike Hutton,
`Mohammed Khalid, Yaska Sankar, Jordan Swartz, Rob McCready, and Paul Leventis
`-
`for the many valuable suggestions and insightful discussions they have provided
`over the years. Special thanks are due to Jordan Swartz for enhancing the VPR FPGA
`architecture generator to allow targeting of an architecture very similar to that of the
`Xilinx 4000X series FPGAs. Many other students and faculty in the University of
`Toronto Electrical Engineering department lent us their expertise on vexing problems;
`we would particularly like to thank Dave Lewis, Mark Stoodley, Guy Lemieux, Jason
`Anderson, and Jason Podaima. We also appreciate several interesting discussions on
`FPGA CAD with Russ Tessier of MIT.
`
`The authors are grateful for many enlightening discussions with people in the FPGA
`industry. Particular thanks are due to Steve Trimberger, Steve Young, and Bill Carter
`of Xilinx, Frank Heile and David Mendel of Altera, Sinan Kaptanoglu of Actel, and
`Tim Lacey of Cypress. We appreciate the helpful suggestions of Jason Cong, Steve
`Trimberger, Sinan Kaptanoglu, Corinna Lee, Stephen Brown, Dave Lewis, and Derek
`Corneil concerning the Ph.D. thesis on which the majority of this book is based.
`
`We gratefully acknowledge the financial support of our work provided by the Natural
`Sciences and Engineering Research Council of Canada (via an NSERC 1967 Science
`and Engineering Scholarship), the Information Technology Research ~Centre of
`Ontario, the Walter C. Sumner Foundation, a V. L. Henderson Research Fellowship,
`and Xilinx. Finally, we appreciate the 0.35 µm process data provided to us by the
`Canadian Microelectronics Corporation and the Taiwan Semiconductor Manufactur(cid:173)
`ing Company (TSMC).
`
`
`
`- -___,,_---~-!-
`
`.
`
`r
`r
`
`--
`
`- -
`
`\
`
`-
`
`-- - -
`
`-
`
`- -
`
`-
`
`
`
`CHAPTER 2
`
`Background and
`Previous Work
`
`The first half of this chapter provides background information about FPGA architec(cid:173)
`tures, and briefly describes the prior work relevant to this book. The second half of
`the chapter describes the CAD flow used to automatically map circuits into FPGAs
`and determine their speed, a_nd summarizes some of the prior work in the relevant
`areas of CAD.
`
`2.1 FPGA Architecture
`
`All FPGAs are composed of three fundamental components: logic blocks, 1/0 blocks
`and programmable routing, as Figure 2.1 shows [20). A circuit is implemented in an
`FPGA by programming each of the logic blocks to implement a small portion of the
`logic required by the circuit, and each of the 1/0 blocks to act as either an input pad or
`an output pad, as required by the circuit. The programmable routing is configured to
`make all the necessary connections between logic blocks and from logic blocks to I/O
`blocks. The following section briefly describes the basic technologies used to make
`FPGAs programmable. Sections 2.1.2 and 2.1.3 then describe some of the different
`architectures and prior research for FPGA logic blocks and routing, respectively.
`
`2.1.1 FPGA Programming Technologies
`
`There are three different approaches to making FPGAs programmable [1, 20). The
`most popular technology today uses SRAM cells to control pass transistors, multi-
`
`
`
`
`
`
`
`
`
`
`
`
`
`2.1 FPGA Architecture 17
`
`Brown et al investigated different segment length distributions in order to optimize
`the speed and area of an island-style FPGA [36, 37, 38]. They placed circuits, then
`routed them with a two-step (global then detailed) routing procedure, and evaluated
`the resulting speed and area for each FPGA architecture of interest. They found that
`segments of length greater than 3 or 4 were unnecessary, and that segments of length
`1 were also not very important. The best segmentation distributions, in terms of
`speed/area trade-off, had 40 - 70% length 3 wires, 10 - 50% length 2 wires, and 10 -
`20% length 1 wires. Their conclusions may have been affected by several factors,
`however. First, in the two-step routing scheme they use, the global router is unaware
`of the segmentation distribution, and hence is not attempting to optimize for it. This
`may make it difficult to use long wires effectively. Second, their area metric was
`number of tracks per channel required to route, W, while FPGA area is usually deter(cid:173)
`mined by transistor area. Third, their speed metric was average net delay, rather than
`the circuit critical path. Finally, the tools did not use timing analysis to determine
`which connections needed high-speed routing, and which could use area-efficient
`routes.
`
`Chow et al [39] performed a similar study of segmentation distributions in an island(cid:173)
`style FPGA, using similar tools to Brown et al [36]. They did not evaluate circuit
`speed, but assumed that longer wires would lead to faster circuits. Hence, they looked
`for segmentation distributions that incorporated length 2 and 3 wires but required only
`slightly higher W values to route circuits than a segmentation distribution where all
`wires were length 1. They also investigated a new architectural parameter, internal
`population, that is relevant to wires which span more than one logic block. All wires
`are assumed to be able to connect to other wires and to logic blocks at their endpoints.
`A wire segment is internally switch-block populated if it can make connections from
`its interior to other wire segments (via switch blocks); otherwise it is switch-block
`depopulated. Similarly, a wire that can connect to from its middle to logic blocks (via
`connection blocks) is internally connection-block populated. Figure 2.6 illustrates
`the four different internal population possibilities for segments. Notice that a length 3
`wire that is connection-block depopulated can connect to only 2 of the 3 logic blocks
`it passes. A length 3 wire that is switch-block depopulated can connect to wires in
`only two of the four vertical channels that pass by it. Internally depopulated wire seg(cid:173)
`ments are faster than populated wire segments because there are fewer switches, and
`hence less parasitic capacitance, connected to them. Chow et al found that the use of
`switch-block depopulated wires resulted in a significant increase in the required value
`of W to route circuits, while connection-block depopulation caused only a small ( ~0 -
`10%) increase in W. They also found that many architectures with 0 - 60% length 2,
`10 - 40% length 3, and the remainder length 1 wires required only about 10% more
`tracks per channel than an all length 1 architecture.
`
`
`
`
`
`2.2 CAD for FPGAs 19
`
`Circuit description (VHDL, schematic, ... )
`
`Synthesize to logic blocks
`
`Place logic blocks in FPGA
`
`Route connections between logic blocks
`
`FPGA programming file
`
`FIGURE 2. 7 FPGA CAD flow.
`
`Figure 2.7. In the following three sections we will describe the synthesis, placement
`and routing problems and briefly outline some of the prior work in each area. Once a
`circuit is placed and routed, delay models and timing analysis are used to assess its
`speed; we describe these problems in Sections 2.2.4 and 2.2.5.
`
`2.2.1 Synthesis and Logic Block Packing
`
`The first stage of synthesis converts the circuit description, which is usually in a hard(cid:173)
`ware description language or schematic form, into a netlist of basic gates. Then, the
`logic synthesis process converts this netlist of basic gates into a netlist of FPGA logic
`blocks such that the number of logic blocks needed is minimized and/or circuit speed
`is maximized. Logic synthesis is sufficiently complex that it is usually broken into
`two or more subproblems [41]; in this book we will use a three-stage synthesis flow as
`shown in Figure 2.8.
`
`Technology-independent logic optimization removes redundant logic and simplifies
`logic wherever possible [ 41, 42]. The optimized netlist of basic gates is then mapped
`to look-up tables [ 43, 44, 45, 46, 47]. Both of these problems have been extensively
`studied and good algorithms and tools capable of targeting the FPGAs we are inter(cid:173)
`ested in studying are publicly available, so this book does not study these phases of
`the synthesis process.
`
`The third synthesis step in Figure 2.8 is necessary whenever an FPGA logic block
`contains more than a single LUT. Logic block packing groups several LUTs and reg-
`
`
`
`
`
`2.2 CAD for FPGAs 21
`
`the "natural" clustering solution enough to significantly impact the result quality. It is
`difficult for labelling methods to simultaneously satisfy both a constraint on the num(cid:173)
`ber of inputs to a cluster (pin constraint) and the maximum number of items (e.g.
`LUTs) in a cluster (size constraint). Yang and Wong [60] present a labelling algo(cid:173)
`rithm which can simultaneously satisfy both size and pin constraints, but it tends to
`produce large amounts of logic replication and to only partially fill clusters. Hence it
`requires a large number of clusters to cover a circuit.
`
`In [61], Cong et al describe a logic block packing tool capable of targeting several dif(cid:173)
`ferent types of logic blocks. It uses a closeness metric to determine the desirability of
`putting two LUTs into the same logic block, and it respects any constraints restricting
`which LUTs can be packed together. In [61], the packing step is performed via maxi(cid:173)
`mum weighted matching on a graph representing the circuit. This is an O(nm) algo(cid:173)
`rithm, where n is the number of LUTs, and mis the number of edges in the graph [62,
`63]. For the logic clusters we study, the number of edges in this "closeness" graph, m,
`is O(n2), which would lead to an O(n3) algorithm. 1
`
`In [64], Shin and Kim describe a greedy algorithm to transform a netlist of circuit
`blocks into a netlist of clusters, where each cluster contains approximately the same
`number of circuit blocks, and the circuit blocks in each cluster tend to be highly con(cid:173)
`nected. Initially each circuit block is a cluster. The "closeness" of two clusters is a
`function of how many nets the two clusters share and the size of the cluster that would
`result from merging them. More formally:
`
`(C D) _
`. ClusterSize(C, D)
`NumCommonNet(C, D)
`l
`- min(NumPin(C), NumPin(D)) - a AverageClusterSize
`'
`c oseness
`
`(2.1)
`
`where a is a weighting factor controlling how balanced the cluster sizes should be.
`The two clusters with the largest closeness value are merged, and the closeness values
`of all the other clusters are updated to reflect the change. The merging process termi(cid:173)
`nates when the number of clusters in the netlist falls below some user-specified value.
`While this algorithm was not developed for FPGAs, and does not understand the hard
`constraints on cluster size and number of inputs intrinsic to logic clusters, the general
`approach of greedily optimizing a closeness function is highly adaptable, and hence
`of interest.
`
`1. If the edge weights are all integers within a limited magnitude range, algorithms with com(cid:173)
`plexity somewhat worse than O (n.5m), but less than O(nm) exist [63]. With m of O(n2) this
`would still lead to an algorithm with complexity worse than O(n2·5).
`
`
`
`
`
`
`
`
`
`2.2 CAD for FPGAs 25
`
`in a local area that are being used. Since the Triptych FPGA is usually unroutable in
`regions where the logic blocks are completely used, maintaining a porosity of 50% or
`so across the FPGA is essential. In [86, 87], Nag and Rutenbar describe a simulated
`annealing based tool that performs placement and routing simultaneously in one com(cid:173)
`bined step. After any swap of blocks, all the affected nets are re-routed via a maze
`router. To keep the CPU time reasonable, this maze router is constrained to look at
`only a small number of potential routes when the temperature is high; at lower tem(cid:173)
`peratures, the router is allowed to spend more time looking for routes.