`-
`.8
`.368 S68
`
`1996
`
`LAN DOVER
`
`GenCoH
`
`splaSh z
`
`EFEM Ira % WWW Emmottlaw Momma
`
`
`
` Emma A r.
`
`iflmmy M. magi
`
`mm? EL fittofimfigétfiw
`
`IEEE
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 1
`
`@W QgréeiasirgéarefihégagficaAND
`
`SQYEARS OFSERVICE-~1946- 1996
`
`Petitioner Microsoft Corporation - EX. 1007, Cover 1
`
`
`
`
`
`
`
`Petitioner Microsoft Corporation - EX. 1007, Cover 2
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 2
`
`
`
`Splash 2
`FPGAs in a Custom
`Computing Machine
`
`
`
`Petitioner Microsoft Corporation - EX. 1007, Cover 3
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 3
`
`
`
`Xilinx, the Xilinx logo, XC3090, XC4010, XBLOX, XACT, LCA,
`and Configurable Logic Cell are trademarks of Xilinx, Inc.
`CM-2 and Paris are trademarks of Thinking Machines Corporation.
`VMEbus is a trademark of Motorola Corporation.
`SPARC and SPARCstation are trademarks of SPARC International,
`Inc. Products bearing a SPARC trademark are based on an architec-
`ture developed by Sun Microsystems, Inc. SPARCstation is licensed
`exclusively to Sun Microsystems, Inc.
`UNIX is a trademark of UNIX System Laboratories.
`Sun, Sun Workstation, SunOS, and SBus are trademarks of Sun Mi-
`crosystems, Inc.
`
`Design Compiler and FPGA Compiler are trademarks of Synopsys,
`Inc.
`
`DEC is a trademark of Digital Equipment Corporation.
`Verilog is a trademark of Cadence Design Systems, Inc.
`
`
`
`Petitioner Microsoft Corporation - EX. 1007, Cover 4
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 4
`
`
`
`§plash 2
`FPGAs in a Custom
`Computing Machine
`
`Duncan A. Buell
`
`Jeffrey M. Arnold
`Walter J. Kleinfelder
`Editors
`
`Center for Computing Sciences
`Bowie, Maryland
`
`IEEE Computer Society Press
`Los Alamitos, California
`
`Washington
`
`0
`
`Brussels
`
`0
`
`Tokyo
`
`
`
`Petitioner Microsoft Corporation - EX. 1007, Cover 5
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 5
`
`
`
`' Library of Congress CataIoging-in-Publication Data
`
`Buell, Duncan A.
`Splash 2: FPGAs in a custom computing machine / Duncan A. Buell,
`Jeffrey M. Arnold, Walter J. Kleinfelder.
`p.
`cm.
`includes bibliographical references and index.
`ISBN 0-8186-7413-X
`2. Electronic digital computers—Design
`1. Spash 2 (Computer)
`and construction.
`I. Arnold, Jeffrey M.
`II. Kleinfelder. Walter J.
`lII. Title.
`QA76.8.865B84
`004.2 ' 2-—-d020
`
`1996
`
`CIP
`
`95-47397
`
`IEEE Computer Society Press
`
`@ 10662 Los Vaqueros Circle
`
`P.O. Box 3014
`Los Alamitos, CA 90720-1264
`
`Copyright © 1996 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted
`to photocopy isolated pages beyond the limits of US copyright law, for private use of their patrons. Other
`copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE Service
`Center, 445 Hoes Lane, PO. Box 1331, Piscataway, NJ 08855-1331.
`
`IEEE Computer Society Press Order Number BP07413
`Library of Congress Number 95-47397
`ISBN 0-8186-7413-X
`
`
`
`.,.W”“
`I“? “or:
`. *v-‘xfl‘
`'
`u
`(' ~'-.}
`1‘#1:“ L "’ IEEE gbfiiputer So iety Press
`.. ‘1
`Customer Service
`enter
`i
`1,0962 Los Vaqu ros Circle
`. (PO. Box 3914
`\‘2\\”;Los Alalpi
`, CA 90720—1264
`{’5’ Tel: +137I 4-821—8380
`Fax: yf-7l4‘821—4641
`fimfiil: cs.books@computer.org
`
` ,m
`
`.
`.
`,
`,
`Additional copies may be orderedfmm:
`
`IEEE Service Center
`445 Hoes Lane
`PO. Box 1331
`Piscataway, NJ 08855-1331
`Tel: +1-908-981—1393
`Fax: +1~908-981-9667
`mis.custserv@computer.org
`
`IEEE Computer Society
`13, Avenue de l’AquiIon
`B—1200 Brussels
`BELGIUM
`Tel: +32-2-770-2198
`Fax: +32-2-770—8505
`euro.ofc@computer.org
`
`IEEE Computer Society
`Ooshima Building
`2-19-1 Minami-Aoyama
`Minato-ku, Tokyo 107
`JAPAN
`TeI: +81—3—3408-31 18
`Fax: +81»3—3408-3553
`toky0.ofc@computer.org
`
`Assistant Publisher: Matt Loeb
`
`Technical Editor: Dharma P. Agrawal
`Acquisitions Assistant: Cheryl Smith
`Advertising/Promotions: Tom Fink
`Production Editor: Lisa O’Conner
`
`Cover Image: Dan Kopetzky, Center for Computing Sciences
`
`Printed in the United States of America
`
`<9 The Institute of Electrical and Electronics Engineers, Inc
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 6
`
`Petitioner Microsoft Corporation - EX. 1007, Cover 6
`
`
`
`($2474,
`.34.? S5 5’
`m ea
`
`Contents
`
`
`xi
`
`1
`
`10
`
`PREFACE
`
`1 CUSTOM COMPUTING MACHINES: AN INTRODUCTION
`
`1.1
`
`1.2
`
`Introduction
`
`1
`
`The Context for Splash 2
`
`4
`
`FPGAs, 4
`1.2.1
`1.2.2 Architecture, 5
`1.2.3
`Programming, 6
`
`2 THE ARCHITECTURE OF SPLASH 2
`
`2.1
`
`2.2
`
`2.3
`
`2.4
`
`2.5
`
`Introduction
`
`10
`
`The Building Blocks
`
`11
`
`The System Architecture
`
`12
`
`Data Paths
`
`13
`
`The Splash 2 Array Board
`
`16
`
`The Linear Array, 16
`2.5.1
`The Splash 2 Crossbar, 16
`2.5.2
`2.5.3 Xilinx Chip X0 and Broadcast Mode, 17
`
`2.6
`
`The Interface Board and Control Features
`
`17
`
`
`
`Petitioner Microsoft Corporation - EX. 1007, p. TOC V
`
`Petitioner Microsoft Corporation - Ex. 1007, p. TOC v
`
`
`
`vi
`
`3 HARDWARE IMPLEMENTATION
`
`Contents
`
`19
`
`3.1
`
`3.2
`
`3.3
`
`Introduction
`
`19
`
`Development Board Design
`
`21
`
`Interface Board Design
`
`21
`
`3.3.1 DMA Channel, 23
`3.3.2 XL and XR, 23
`3.3.3
`Interrupts, 24
`3.3.4 Clock, 24
`3.3.5
`Programming and Readback, 24
`3.3.6 Miscellaneous Registers, 25
`
`3.4
`
`Array Board Design
`
`25
`
`Processing Element, 26
`3.4.1
`3.4.2 Control Element, 28
`3.4.3
`External Memory Access, 28
`3.4.4 Crossbar, 28
`3.4.5
`Programming and Readback, 29
`3.4.6 Miscellaneous Registers, 29
`
`4 SPLASH 2: THE EVOLUTION OF A NEW ARCHITECTURE
`
`31
`
`4.1
`
`4.2
`
`4.3
`
`4.4
`
`4.5
`
`4.6
`
`4.7
`
`4.8
`
`4.9
`
`Splash 1
`
`31
`
`Splash 2: Thoughts on a Redesign
`
`34
`
`Programming Language
`
`36
`
`Choice of FPGAs
`
`37
`
`Choice of Host and Bus
`
`38
`
`Chip—to—Chip Interconnections
`
`39
`
`Multitasking
`
`42
`
`Chip X0 and Broadcast
`
`43
`
`Other Design Decisions
`
`43
`
`5 SOFTWARE ARCHITECTURE
`
`46
`
`5.1
`
`5.2
`
`5.3
`
`Introduction
`
`46
`
`Background
`
`47
`
`VHDL as a Programming Language
`
`49
`
`5.3.1 History and Purpose of VHDL, 50
`5.3.2 VHDL Language Features, 50
`5.3.3
`Problems with VHDL, 51
`
`5.4
`
`Software Environment
`
`51
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC vi
`
`Petitioner Microsoft Corporation - EX. 1007, TOC Vi
`
`
`
`vii
`
`60
`
`Contents
`
`7
`
`5.5
`
`Programmer’s View of Splash 2
`
`55
`
`Programming Process, 55
`5.5.1
`Processing Element View, 56
`5.5.2
`Interface Board View, 57
`5.5.3
`5.5.4 Host View, 57 A
`
`6 SOFTWARE IMPLEMENTATION
`
`6.1
`
`6.2
`
`Introduction
`
`60
`
`VHDL Environment
`
`60
`
`Splash 2 VHDL Library, 61
`6.2.1
`6.22 Standard Entity Declarations, 61
`6.2.3
`Programming Style, 64
`
`6.3
`
`Splash 2 Simulator
`
`66
`
`Structure, 66
`6.3.1
`Configuring the Simulator, 67
`6.3.2
`Input and Output, 68
`6.3.3
`6.3.4 Crossbar and Memory Models, 68
`6.3.5 Hardware Constraints, 70
`
`6.4
`
`Compilation
`
`70
`
`Logic Synthesis, 70
`6.4.1
`Physical Mapping, 71
`6.4.2
`6.4.3 Debugging Support, 71
`
`6.5
`
`Runtime System 72
`
`T2: A Symbolic Debugger, 72
`6.5.1
`Runtime Library, 73
`6.5.2
`6.5.3 Device Driver, 74
`
`6.6
`
`Diagnostics
`
`75
`
`7 A DATA PARALLEL PROGRAMMING MODEL
`
`77
`
`7.1
`
`7.2
`
`Introduction
`
`78
`
`Data-parallel Bit C 80
`
`7.2.1
`7.2.2
`
`dbC Overview, 80
`dbC Example, 81
`
`7.3
`
`Compiling from dbC to Splash 2
`
`82
`
`Creating a Specialized SIMD Engine, 83
`7.3.1
`7.3.2 Generic SIMD Code, 84
`7.3.3 Generating VHDL, 84
`
`7.4
`
`Global Operations
`
`88
`
`7.4.1 Nearest-Neighbor Communication, 88
`
`
`
`Petitioner Microsoft Corporation - EX. 1007, TOC vii
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC vii
`
`
`
`viii
`
`Contents
`
`7.4.2 Reduction Operations, 89
`7.4.3 Host/Processor Communication, 91
`
`Optimization: Macro Instructions
`
`92
`
`7.5.1 Creating a Macro Instruction, 93
`7.5.2 Discussion, 94
`
`Evaluation: Genetic Database Search
`
`94
`
`Conclusions and Future Work
`
`95
`
`7.5
`
`7.6
`
`7.7
`
`SEARCHING GENETIC DATABASES 0N SPLASH 2
`
`97
`
`8.1
`
`Introduction
`
`97
`
`Edit Distance, 98
`8.1.1
`8.1.2 Dynamic Programming Algorithm, 98
`
`Systolic Sequence Comparison
`
`100
`
`Bidirectional Array, 100
`8.2.1
`8.2.2 Unidirectional Array, 103
`
`Implementation
`
`104
`
`8.3.1 Modular Encoding, 105
`8.3.2 Configurable Parameters, 106
`8.3.3 Bidirectional Array, 107
`8.3.4 Unidirectional Array, 107
`
`Benchmarks
`
`107
`
`Discussion
`
`108
`
`Conclusions
`
`108
`
`8.2
`
`8.3
`
`8.4
`
`8.5
`
`8.6
`
`TEXT SEARCHING ON SPLASH 2
`
`110
`
`9.1
`
`9.2
`
`9.3
`
`9.4
`
`9.5
`
`9.6
`
`Introduction
`
`110
`
`The Text Searching Algorithm 111
`
`Description of the Single—Byte Splash Program
`
`113
`
`Timings, Discussion
`
`114
`
`Outline of the 16-bit Approach
`
`115
`
`Conclusions
`
`1 16
`
`10 FINGERPRINT MATCHING ON SPLASH 2
`
`117
`
`10.1
`
`10.2
`
`Introduction
`
`1 17
`
`Background
`
`120
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC viii
`
`Petitioner Microsoft Corporation - EX. 1007, TOC viii
`
`
`
`Contents
`
`ix
`
`10.2.1 Pattern Recognition Systems, 121
`10.2.2 Terminology, 122
`10.2.3 Stages in AFIS, 123
`
`10.3
`
`Splash 2 Architecture and Programming Models
`
`125
`
`10.4
`
`Fingerprint Matching Algorithm 125
`
`10.4.1 Minutia Matching, 126
`10.4.2 Matching Algorithm, 127
`
`10.5
`
`Parallel Matching Algorithm 128
`
`10.5.1 Preprocessing on the Host, 131
`10.5.2 ' Computations on Splash, 132
`10.5.3 VHDL Specification for X0, 133
`
`10.6
`
`Simulation and Synthesis Results
`
`134
`
`10.7
`
`Execution on Splash 2
`
`137
`
`10.7.1 User Interface, 137
`10.7.2 Performance Analysis, 137
`
`10.8
`
`Conclusions
`
`139
`
`11 HIGH-SPEED IMAGE PROCESSING WITH SPLASH 2
`
`141
`
`11.1
`
`Introduction
`
`141
`
`11.2
`
`The VTSplash System 142
`
`11.3
`
`Image Processing Terminology and Architectural Issues
`
`143
`
`11.4
`
`Case Study: Median Filtering
`
`150
`
`11.5
`
`Case Study: Image Pyramid Generation
`
`153
`
`11.5.1 Gaussian Pyramid, 154
`11.5.2 Two Implementations for Gaussian Pyramid on Splash 2, 155
`11.5.3 The Hybrid Pipeline Gaussian Pyramid Structure, 157
`11.5.4 The Laplacian Pyramid, 157
`11.5.5 Implementation of the Laplacian Pyramid on Splash 2, 159
`
`11.6
`
`Performance
`
`159
`
`11.7
`
`Summary
`
`163
`
`12 THE PROMISE AND THE PROBLEMS
`
`166
`
`12.1
`
`Some Bottom-Line Conclusions
`
`166
`
`12.1.1 High Bandwidth I/O Is a Must, 166
`12.1.2 Memory Is a Must, 167
`12.1.3 Programming Is Possible, and Becoming More So, 168
`12.1.4 The Programming Environment Is Crucial, 168
`
`12.2
`
`To Where from Here?
`
`169
`
`
`
`Petitioner Microsoft Corporation - EX. 1007, TOC iv
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC iv
`
`
`
`12.3
`
`If Not Splash 3, Then What?
`
`171
`
`12.3.1 Architectures, 172
`12.3.2 Custom Processors, 173
`12.3.3 Languages, 174
`
`12.4
`
`12.5
`
`The “Killer” Applications
`
`177
`
`Final Words
`
`178
`
`A SPLASH 2 DEVELOPMENT—THE PROIECT MANAGER’S
`SUMMARY
`
`B AN EXAMPLE APPLICATION
`
`REFERENCES
`
`Contents
`
`179
`
`186
`
`190
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC x
`
`
`Petitioner Microsoft Corporation - EX. 1007, TOC X
`
`
`
`Preface
`
`The Splash 2 project began at the Supercomputing Research Center1 in September of
`1991 and ended, with success, in the spring of 1994. Splash 2 is an attached processor
`system using Xilinx XC4010 FPGAs as its processing elements. As such, it is a
`custom computing machine. That is to say that much of what would be the instruction
`set architecture of the processing elements is not specified except in the details of the
`program developed by the application programmer. Although a higher-level block
`diagram of processing elements, memories, interconnect, and dataflow exists in the
`hardware structure of Splash 2, the details of the instruction set architecture level of
`the machine will vary from one application to the next.
`The Splash 2 project is significant for two reasons. First, Splash 2 is part of a
`complete computer system that has achieved supercomputer—like performance on a
`number of different applications. By “complete computer system” we mean that SRC
`created or caused to be created an extant hardware system (replicated a dozen times),
`a complete programming and runtime environment, and a collection of application
`programs that exercised the unique hardware.
`The second significant aspect of Splash 2 is that we were fortunate enough to
`be able to build a large system, capable of performing real computations on real
`problems. One common complaint about performance results on novel computing
`machines or environments is that results on small problems cannot be accurately
`extrapolated to large problems. The Splash 2 system that was designed and built is
`a full—sized machine and does not suffer from this defect.
`
`To get to the point: why a book?
`
`1Renamed the Center for Computing Sciences in May 1995, but referred to throughout this book
`as SRC.
`
`xi
`
`
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xi
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xi
`
`
`
`Xii
`
`Preface
`
`This is a novel computing machine. In order to understand what happens when
`the application programmer is permitted, indeed required, to design the processor
`architecture of the machine that will execute his program, it is necessary to see the
`system as a whole. Programmability and problems to be run on this machine both
`had major influences on its architecture, just as its architecture and its unique nature
`influence the kinds of problems one would expect to program for this machine and
`the nature of that programming. And standing between the user and the machine,
`as the old joke goes, is a new kind of programming environment and an evolving
`understanding of how this environment must allow the use of the hardware, without
`forcing every programmer to be a hardware-design engineer.
`At the first IEEE Workshop on FPGAS for Custom Computing Machines, one
`of the industrial attendees remarked that, although nearly everyone would agree, as
`part of a coffee-room discussion or the like, that it would be interesting to think about
`building a “computer” using FPGAs, no one in management (except perhaps at SRC
`and DEC PRL) had put up the commitment necessary in time and money to do a
`real test of the idea. It was then remarked that, given the nature of the marketplace
`and of engineering management, these first attempts had to be successful in order to
`open the door for future work. We feel we have been successful, and we offer in
`this book an in-depth look at one of the handful of data points in the design space
`of this new kind of machine.
`
`We would hope that this book has a broad appeal and is readable with under—
`standing by nearly all computer scientists and computer engineers. To the hardware
`designer, perhaps we can offer a new look at programming applications on a mod-
`erately general FPGA-based computing machine instead of designing circuits for a
`specific board incorporating FPGAS. The engineering world has viewed FPGAs, to a
`great extent, as the next logical step in a continuum of electronic devices; we offer,
`we feel, the option of viewing them much more broadly than that. To the computer
`architect we offer a variant hardware platform and-an indication of how that general
`platform can be used. Much of computer architecture is a compromise between the
`functionality desired and the limits of what can be built given existing technology; we
`offer the use of a new technology that can offer, to a limited extent now, and could
`offer much more generally later, greatly increased functionality. For some of those
`who have hard problems in computation, we offer much of the power of special-
`purpose hardware without the inflexibility and uncertain delivery times of hardware.
`The long-term task is not to map a high—level language to a particular architecture
`or range of architectures, but in some sense to create for each application program
`a suitable architecture to which the high-level language will be mapped. And to the
`language designers and compiler writers we offer a world to conquer. We have pre—
`sented one imperfect but usable approach to programming such a computing machine,
`and we trust that others interested in the critical problem of making these machines
`programmable can learn both from what we did right and what we did wrong.
`Chapter 1 discusses the general concept of Custom Computing Machines, of
`which Splash 2 is one example. Chapters 2 and 3 describe at a high level and
`then in some detail the hardware architecture of Splash 2. Chapter 4 covers the
`design considerations and decisions in arriving at the second-generation Splash 2
`architecture. We present this chapter at the end of the section on hardware, on the
`basis that it is easier to understand variations in a design when those variations are
`compared against something concrete.
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xii
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xii
`
`
`
`Preface
`
`xiii
`
`Chapters 5 and 6 describe, also first at a higher and then at a lower level, the
`software architecture of Splash 2. All the application programs in the latter chapters
`were done using VHDL as an applications programming language and these tools
`in support. The main goal of the Splash 2 project was to show that software, as
`described in Chapters 5 and 6, could make a computer using FPGAs as its processing
`elements into something that reasonable people would call “programmable,” and, in
`that sense, the heart of the Splash 2 project is in Chapters 5 and 6. Throughout
`the life of Splash 2, however, there has been an alternative view of programming.
`This view is reflected in Chapter 7 on the Splash 2 version of the programming
`language dbC. The approach taken in dbC is to permit the programmer to use a
`version of C as the programming language. It is the compiler which then becomes
`responsible for, in essence, recognizing the instruction set architecture necessary to
`execute the program and then creating in the FPGA the requisite registers, logic units,
`and control.
`
`Chapters 8 through 11 then describe four different applications programmed
`to conclusion on Splash 2. The first of these—the sequence comparison problem—
`was the driving application, in the sense that funding for Splash 2 was based on its
`perceived ability to perform this computation. This and the text processing application
`were done at SRC.
`
`The Splash 2 project team was fortunate in that SRC’s parent company, the
`Institute for Defense Analyses, issued two contracts, to Virginia Polytechnic Institute
`and to Michigan State University, for applications work on Splash 2 in image pro-
`cessing and fingerprint identification. Both applications seemed good matches with
`the Splash 2 architecture but lay outside the normal realm of SRC’s research program.
`The faculty members involved have each prepared a chapter on these applications.
`We close in Chapter 12 with some opinions and speculations about the future.
`In an appendix, the project manager presents a chronology of the entire Splash 2
`project.
`It is incumbent on us, and a genuine pleasure, to thank the Center for Com-
`puting Sciences of the Institute for Defense Analyses and the CCS Director, Francis
`Sullivan, for supporting us in our writing and editing of this book. All royalties
`will be donated to the Center for Excellence in Education, formerly known as the
`Rickover Foundation, in McLean, Virginia. The Center for Excellence in Education
`supports science and engineering education through its sponsorship of the Research
`Science Institute each summer for high school seniors, its Role Models and Lead-
`ers Project in Washington, DC, Los Angeles, and Chicago for promising women
`and minority high school students intending to study science and engineering, and
`its support and preparation of the United States Informatics Olympiad team each
`year.
`
`The Splash 2 project was a success in large part due to the ability of those who
`were involved nearly full-time, but it might not have taken the course it did had the
`hard-core Splash 2 players not had the chance to get advice and occasional help from
`a much larger group of experts both at SRC and elsewhere.
`We acknowledge, therefore, the help and advice of Nate Bronson, Dan Burns,
`Bill Carlson, Neil Coletti, Maripat Corr, Steve Cuccaro, Hillory Dean, Chuck Fiduc—
`cia, Brad Fross, Charles Goedeke, Maya Gokhale, Frank Hady, Dzung Hoang, Bill ,
`Holmes, Amy Johnston, Elaine (Davis) Keith, Dan Kopetzky, Andy Kopser, Steve
`Kratzer, Jim Kuehn, Sara Lucas, Michael Mascagni, Marge McGarry, John McHenry,
`
`
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xiii
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xiii
`
`
`
`xiv
`
`Preface
`
`Ron Minnich, Lindy Moran, Fred More, Mark Norder, Lou Podrazik, Dan Pryor,
`Craig Reese, Paul Schneck, Brian Schott, Nabeel Shirazi, Doug Sweely, Dennis
`Sysko, Mark Thistle, Bob Thompson, Ken Wallgren, Alice Yuen, Neal Ziring, and
`Jennifer Zito.
`
`Duncan A. Buell
`
`Jefi‘rey M. Arnold
`Walter J. Kleinfelder
`Bowie, Maryland
`March 1996
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xiv
`
`Petitioner Microsoft Corporation - EX. 1007, Preface xiv
`
`
`
`CHAPTER 1
`
`Custom Computing
`Machines: An Introduction
`
`Duncan A. Buell1
`
`1.1 INTRODUCTION
`
`It is a basic observation about computing that generality and efficiency are in some
`sense inversely related to one another; the more general-purpose an object is and
`thus the greater the number of tasks it can perform, the less efficient it will be in
`performing any of those specific tasks. Design decisions are therefore almost always
`compromises; designers identify key features or applications for which competitive
`efficiency is a must and then expand the range as far as is practicable without unduly
`damaging performance on the main targets.
`This thesis has certainly been true in processor architecture of computers aimed
`at computationally intense problems. Vector processors and vector supercomputers
`have targeted computationally intense, array-oriented floating point problems, usually
`in the hard sciences and engineering, but have not sacrificed the necessary speed on
`their core applications in order to be all things to all people. Thus, on computationally
`intense problems that do not fit well on traditional supercomputers, perhaps due to
`such things as integer arithmetic or scalar code, fast workstations can often outperform
`supercomputers.
`To counter the problem of computationally intense problems for which general-
`purpose machines cannot achieve the necessary performance, special-purpose proces-
`sors, attached processors, and coprocessors have been built for many years, especially '
`
`1A version of this chapter appeared as Buell and Pocek [11] and is used with permission.
`
`
`
`Petitioner Microsoft Corporation - EX. 1007, p. l
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 1
`
`
`
`Custom Computing Machines: An Introduction
`
`Chapter 1
`
`in such areas as image or signal processing (for which many of the computational
`tasks can be very well defined). The problem with such machines is that they are
`special-purpose; as problems change or new ideas and techniques develop, their lack
`of flexibility makes them problematic as long—term solutions.
`Enter the FPGA. Field Programmable Gate Arrays, first introduced in 1986
`by Xilinx [34], were seen rather immediately by a few people to offer a totally
`new avenue to explore in the world of processor engineering. The great strength
`of the computer as a tool is in its ability to be adapted, via programming,
`to a
`multitude of computational tasks. The possibility now existed for an FPGA-based
`computing device not only to be configured to act like special-purpose hardware
`for a particular problem, but to be reconfigured for different problems and for this
`reconfiguration to be a programming process. By being more than single-purpose,
`such a machine would have the advantage of being flexible with at least a limited
`range of different applications. By being programmable, such a machine would open
`up “design of high-performance hardware” to individuals who can “design hard-
`ware” in an abstract sense but not a concrete sense. Finally, by being designed to
`operate as if they were hardware, the applications for these machines can achieve
`the hardware—like performance one gets from having explicitly parallel computa—
`tions, from not having instructions and data fetched and decoded, and from hav—
`ing the ability to design processing units that reflect precisely the processing being
`done.
`
`It is no exaggeration to say that machines using FPGAs as their processing
`elements have demonstrated that very high performance on an absolute scale, and
`extraordinary performance when measured against price, is possible with this tech-
`nology. The PeRLe machines built at DEC’s Paris Research Laboratory have been
`programmed on a number of applications with impressive results [7, 8, 9, 31]: An
`implemented multiplier can compute a 50-coefficient, 16-bit polynomial convolution
`FIR filter at 16 times audio real time. An implementation of RSA decryption executes
`at 10 times the bit rate of an AT&T ASIC. A Hough Transform implementation for
`an application in high—energy physics achieves a compute rate that a standard 64-bit
`computer could not equal without a 1.2 GHz clock rate.
`As can be seen from the later chapters of this book, some of the applications
`programmed on Splash 2 have achieved similarly promising results. It was a general
`observation made by those involved in the Splash 2 project that, on applications that
`fit the machine, one Splash 2 Array Board could deliver approximately the compute
`power of one Cray YMP processor. One of the commercial licensees of the Splash 2
`technology sells its system for about $40,000. Of course, not all applications fit well,
`and most that do not fit well actually fit very badly indeed, but this is nonetheless a
`performance-to-price ratio substantial enough to warrant continued investigation and
`experimentation.
`The idea of reconfiguring a computer is certainly not new. The Burroughs
`B1700 had multiple instruction sets with different targets (Fortran, COBOL, and
`such) implemented with different microcode. In another way, the Paris functions of
`the Thinking Machines Corp. CM-2 differed from one version to the next. In the
`former case, standard views of hardware instructions were implemented. In the latter
`case, with a novel machine and a new architecture, we presume that an effort was
`made to implement function calls that users were seen to need and to delete unneeded
`functions when the instruction store ran out.
`
`
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 2
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 2
`
`
`
`Section 1.1
`
`Introduction
`
`3
`
`to give to these
`label
`A certain amount of disagreement exists over what
`machines and how to refer to them. The group at DEC’s Paris research lab refers
`to their machine as a Programmable Active Memory (PAM) [7, 9, 29, 31]. Another
`commercial entity uses the term “virtual computer” [12]. From Brown University
`we have Processor Reconfiguration through Instruction Set Metamorphosis (PRISM)
`[4, 5, 6, 32]. We have already used the term “FPGA-based computing device” here.
`That none of these are truly satisfactory was evidenced in the spring of 1994 when the
`comp.arch. fpga newsgroup was discussed and established; the most contentious
`point was over the name. Both the new newsgroup and the IEEE workshops we have
`organized use the term FPGA, thus being in some sense bound in terminology to a
`particular technology (unless, of course, one can convince the developers of the next
`technology that its name should allow FPGA as its acronym). We have chosen to
`use the term Custom Computing Machine (CCM). None of these terms is perfect,
`but we believe that this one is no worse than any of the others.
`The work on CCMs also differs from what is considered reconfigurable com-
`puter architecture, in that the term “reconfigurable” usually refers to a much higher
`level of the system. In the case of CCMs, that which is reconfigurable and significant
`by virtue of its reconfigurability is the “processor architecture” itself. A reconfig-
`urable computer, by contrast, is likely to be either a multiprocessor in which the
`interconnections among the processors can change, or a heterogeneous machine with
`processors of different kinds that a user can choose to include or exclude in the View
`he/she has of “the computer.”
`It is to be emphasized that this is not a mature computing technology and that
`CCMs are not a panacea for all problems in high-performance computing. Among
`the many issues and problems are the following:
`
`1. Are FPGAs large enough, or will they become large enough, so that a significant
`unit of computation can be put on a single chip so as to avoid both the loss
`of efficiency in going off—chip and the problems in partitioning a computation
`across multiple chips?
`2. With current technology, even in the best of circumstances, the user must be
`exposed to the hardware itself. What is the level of understanding about chip
`architecture, signal routing delays, and so forth, that a “programmer” must know
`in order to use a CCM? How much more must be known in order to obtain
`
`the performance that would warrant using a CCM instead of a general-purpose
`computer?
`.
`3. If these machines are to be viewed as “computers,” then they must be capable of
`being programmed. How will this be done? What sort of programming language
`is appropriate? How do we “compile” when we have eliminated the underlying
`machine model?
`
`4. Granting the point that these are limited—purpose, but not special-purpose,
`machines, what is the range of architectures needed to cover the spectrum
`of applications for which these machines make sense?
`5. What are the cost/performance figures necessary to make this a viable approach
`for getting a computing task done? General-purpose machines are cheaper and 7
`easier to use but can be slow. ASICs and special-purpose devices are faster
`but more expensive in small quantities, take longer to develop, and are hard to
`modify. Where might CCMs fit between these two?
`
`
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 3
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 3
`
`
`
`4
`
`Custom Computing Machines: An Introduction
`
`Chapter 1
`
`One problem faced by those involved in Splash 1 and Splash 2 at the Super-
`computing Research Center2 has been a stubborn refusal from some quarters to
`believe that achieving high performance on a CCM is possible without a design
`or programming agony so great as to be offputting to all but the most dedicated
`of “application designers/programmers.” Even in the face of our evidence to the
`contrary, the case has been difficult to make to some critics.
`The case is especially hard because what is needed is to build a complete
`hardware system, to create or cause to be created a programming environment worthy
`of being called a programming environment, and then to develop a variety of different
`applications so that the proof of concept is complete. Further, since the goal is
`to demonstrate a competitive performance with more expensive and sophisticated
`machines, the CCM must be big enough to do real work and to be part of complete
`computational processing environments;
`it cannot be just a toy machine suitable
`only for doing kernels of problems. To our knowledge, only two such machines
`have been built that meet these criteria—Splash 2 and the larger of the DEC PAM
`systems.
`The goal in this book is to present the Splash 2 experience as a whole. Splash 2
`was designed and developed in an iterative process from top to bottom to top and
`back again.
`
`1.2 THE CONTEXT FOR SPLASH 2
`
`1.2.1 FPGAs
`
`FPGAS in general have a wide spectrum of characteristics, but the FPGAs used for
`CCMs have been of two distinct types. The Xilinx XC4010, a typical example of
`one type, is a chip containing a 20 x 20 square array of Configurable Logic Blocks
`(CLBs) [34]. Each CLB can serve one of three functions, either as two flipflops, or
`as Boolean functions of nine inputs, or as 32 bits of RAM. The function use has
`two 4—input functions, each producing an output; these two bits combine with a ninth
`input in a 3-input Boolean function. The RAM usage simply takes advantage of the
`fact that the 4—input functions are done with lookup tables to allow the input bits to
`be viewed as addresses.
`,
`
`Connecting the CLBs to one another and to special Input Output Blocks (IOBs)
`on the periphery of the chip are routing resources running from CLB to CLB, skipping
`one CLB, or running the full length of the chip. Configuration of the FPGA is done
`by loading a bit file onto on-chip RAM.
`In contrast to the relatively coarse granularity of the Xilinx chips, the second
`type of FPGA, by Algotronix, Ltd., and by Concurrent Logic, Inc. [1, 13], is fine-
`grained. The Algotronix chip is a 32 X 32 array of 2—input, l—output Boolean function
`logic cells, with the signal lines running only point to point from one cell to its
`neighbors in each rectangular direction.3
`
`2The Supercomputing Research Center was renamed the Center for Computing Sciences in May
`1995, but will be referred to throughout this book as SRC.
`3Algotronix is now a part of Xilinx.
`
`
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 4
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 4
`
`
`
`Section 1.2
`
`The Context for Splash 2
`
`5
`
`To a first-order approxima