`
`Reference [1007]
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 1
`
`
`
`QA 76
`.8
`.S68 S68
`1996
`
`LANDOVER
`GenColl
`
`Splash 2
`
`f PGAs in a Custom Computing Machine
`
`Duncan A. Buell
`Jeffrey M. Arnold
`Walter J. Kleinfelder
`
`~ COMPUTER SOCIETY
`~ 5QYEARS OF SERVICE• 1946 - 19 96
`
`♦ THE INSTITUTE OF ELECTRICAL AND
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 1
`
`ELECTRONICS ENGINEERS, INC.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 2
`
`
`
`
`
`l
`
`
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 2
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 3
`
`
`
`Splash 2
`FPGAs in a Custom
`Computing Machine
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 4
`
`
`
`Xilinx, the Xilinx logo, XC3090, XC4010, XBLOX, XACT, LCA,
`and Configurable Logic Cell are trademarks of Xilinx, Inc.
`CM-2 and Paris are trademarks of Thinking Machines Corporation.
`VMEbus is a trademark of Motorola Corporation.
`SPARC and SPARCstation are trademarks of SPARC International,
`Inc. Products bearing a SPARC trademark are based on an architec(cid:173)
`ture developed by Sun Microsystems, Inc. SPARCstation is licensed
`exclusively to Sun Microsystems, Inc.
`UNIX is a trademark of UNIX System Laboratories.
`Sun, Sun Workstation, SunOS, and SBus are trademarks of Sun Mi(cid:173)
`crosystems, Inc.
`Design Compiler and FPGA Compiler are trademarks of Synopsys,
`Inc.
`DEC is a trademark of Digital Equipment Corporation.
`Verilog is a trademark of Cadence Design Systems, Inc.
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 4
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 5
`
`
`
`Splash 2
`••
`FPGAs in a Custom
`Computing Machine
`
`Duncan A. Buell
`Jeffrey M. Arnold
`Walter J. Kleinfelder
`Editors
`Center for Computing Sciences
`Bowie, Maryland
`
`IEEE Computer Society Press
`Los Alamitos, California
`
`Washington
`
`•
`
`Brussels
`
`•
`
`Tokyo
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 5
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 6
`
`
`
`Library of Congress Cataloging-in-Publication Data
`
`Buell, Duncan A.
`Splash 2: FPGAs in a custom computing machine / Duncan A. Buell,
`Jeffrey M. Arnold, Walter J. Kleinfelder.
`p. cm.
`Includes bibliographical references and index.
`ISBN 0-8186-7413-X
`2. Electronic digital computers-Design
`1. Spash 2 (Computer)
`II. Kleinfelder, Walter J.
`I. Arnold, Jeffrey M.
`and construction.
`Ill. Title.
`QA76.8.S65B84 1996
`004.2 ' 2-dc20
`
`95-47397
`GIP
`
`IEEE Computer Society Press
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720-1264
`
`Copyriiht © 1996 by The Institute of Electrical and Electronics Engineers, Inc. All rights reserved.
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries are permitted
`to photocopy isolated pages beyond the limits of US copyright law, for private use of their patrons. Other
`copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE Service
`Center, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331.
`
`IEEE Computer Society Press Order Number BP07413
`Library of Congress Number 95-47397
`ISBN 0-8186-7413-X
`
`; ;
`
`,,.,.---·
`.,,,,,. -"
`..-: 9__v,':):,
`~);;; '
`,{ !~1-:;
`..
`.J IEEE S,~puter So iety Press
`(:_:,
`Cuiro'RJer Service enter
`•-
`'-t:· • W~'?2 Los Yaqu ros Circle
`Cf1.\J. Box 3Q.l4
`~~.L-;;s Alatf/ ' , CA 90720-1 264
`'')'' Tel· +I-i 11-821-8180
`-
`.
`Fax: )'f714-821-4641
`_ .,.J1l't.i'\I: cs.books@computer.org
`
`Additional copies may be ordered.from:
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: +1-908-981-1393
`Fax: + 1-908-981-9667
`mis.custserv@computer.org
`
`IEEE Computer Society
`13, Avenue de I' Aquilon
`B-1200 Brussels
`BELGIUM
`Tel: +32-2-770-2198
`Fax: +32-2-770-8505
`euro.ofc@computer.org
`
`IEEE Computer Society
`Ooshima Building
`2-19-1 Minami-Aoyama
`Minato-ku, Tokyo 107
`JAPAN
`Tel: +81-3-3408-3118
`Fax: +81 -3-3408-3553
`tokyo.ofc @computer.org
`
`Assistant Publisher: Matt Loeb
`Technical Editor: Dharma P. Agrawal
`Acquisitions Assistant: Cheryl Smith
`Advertising/Promotions: Tom Fink
`Production Editor: Lisa O'Conner
`Cover Image: Dan Kopetzky, Center for Computing Sciences
`
`Printed in the United States of America
`
`♦ The Institute of Electrical and Electronics Engineers, Inc
`
`Petitioner Microsoft Corporation - Ex. 1007, Cover 6
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 7
`
`
`
`QA1,
`, i
`.S(p1tS6i
`/q(j (p
`Contents
`
`PREFACE
`
`1 CUSTOM COMPUTING MACHINES: AN INTRODUCTION
`
`1.1
`
`1.2
`
`Introduction 1
`
`The Context for Splash 2 4
`
`FPGAs, 4
`1.2.1
`1.2.2 Architecture, 5
`1.2.3 Programming, 6
`
`2 THE ARCHITECTURE OF SPLASH 2
`
`2.1
`
`2.2
`
`2.3
`
`2.4
`
`2.5
`
`Introduction 10
`
`The Building Blocks
`
`11
`
`The System Architecture 12
`
`Data Paths 13
`
`The Splash 2 Array Board 16
`
`2.5.1 The Linear Array, 16
`2.5.2 The Splash 2 Crossbar, 16
`2.5.3 Xilinx Chip XO and Broadcast Mode, 17
`
`2.6
`
`The Interface Board and Control Features 17
`
`xi
`
`1
`
`10
`
`V
`
`Petitioner Microsoft Corporation - Ex. 1007, p. TOC v
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 8
`
`
`
`Contents
`
`19
`
`vi
`
`3 HARDWARE IMPLEMENTATION
`
`3.1
`
`3.2
`
`3.3
`
`Introduction 19
`
`Development Board Design 21
`
`Interface Board Design 21
`
`3.3.1 DMA Channel, 23
`3.3.2 XL and XR, 23
`3.3.3
`Interrupts, 24
`3.3.4 Clock, 24
`3.3.5 Programming and Readback, 24
`3.3.6 Miscellaneous Registers, 25
`
`3.4
`
`Array Board Design 25
`
`3.4.1 Processing Element, 26
`3.4.2 Control Element, 28
`3.4.3 External Memory Access, 28
`3.4.4 Crossbar, 28
`3.4.5 Programming and Readback, 29
`3.4.6 Miscellaneous Registers, 29
`
`4 SPLASH 2: THE EVOLUTION OF A NEW ARCHITECTURE
`
`31
`
`4.1
`
`4.2
`
`4.3
`
`4.4
`
`4.5
`
`4.6
`
`Splash 1 31
`
`Splash 2: Thoughts on a Redesign 34
`
`Programming Language 36
`
`Choice of FPGAs 37
`
`Choice of Host and Bus 38
`
`Chip-to-Chip Interconnections 39
`
`4.7 Multitasking 42
`
`4.8
`
`4.9
`
`Chip XO and Broadcast 43
`
`Other Design Decisions 43
`
`5 SOFTWARE ARCHITECTURE
`
`46
`
`5.1
`
`5.2
`
`5.3
`
`Introduction 46
`
`Background 47
`
`VHDL as a Programming Language 49
`
`5.3.1 History and Purpose of VHDL, 50
`5.3.2 VHDL Language Features, 50
`5.3.3 Problems with VHDL, 51
`
`5.4
`
`Software Environment 51
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC vi
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 9
`
`
`
`vii
`
`60
`
`Contents
`
`5.5
`
`Programmer's View of Splash 2 55
`
`Programming Process, 55
`5.5.1
`5.5.2 Processing Element View, 56
`Interface Board View, 57
`5.5.3
`5.5.4 Host View, 57 '
`
`6 SOFTWARE IMPLEMENTATION
`
`6.1
`
`6.2
`
`Introduction 60
`
`VHDL Environment 60
`
`Splash 2 VHDL Library, 61
`6.2.1
`6.2.2 Standard Entity Declarations, 61
`6.2.3 Programming Style, 64
`
`6.3
`
`Splash 2 Simulator 66
`
`Structure, 66
`6.3.1
`6.3.2 Configuring the Simulator, 67
`6.3.3
`Input and Output, 68
`6.3.4 Crossbar and Memory Models, 68
`6.3.5 Hardware Constraints, 70
`
`6.4
`
`Compilation 70
`
`6.4.1 Logic Synthesis, 70
`6.4.2 Physical Mapping, 71
`6.4.3 Debugging Support, 71
`
`6.5
`
`Runtime System 72
`
`6.5.1 T2: A Symbolic Debugger, 72
`6.5.2 Runtime Library, 73
`6.5.3 Device Driver, 74
`
`6.6
`
`Diagnostics 75
`
`7 A DATA PARALLEL PROGRAMMING MODEL
`
`77
`
`7.1
`
`7.2
`
`Introduction 78
`
`Data-parallel Bit C 80
`
`7.2.1
`7.2.2
`
`dbC Overview, 80
`dbC Example, 81
`
`7.3
`
`Compiling from dbC to Splash 2 82
`
`7.3.1 Creating a Specialized SIMD Engine, 83
`7.3.2 Generic SIMD Code, 84
`7.3.3 Generating VHDL, 84
`
`7.4
`
`Global Operations 88
`
`7.4.1 Nearest-Neighbor Communication, 88
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC vii
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 10
`
`
`
`viii
`
`Contents
`
`7.4.2 Reduction Operations, 89
`7.4.3 Host/Processor Communication, 91
`
`7 .5
`
`Optimization: Macro Instructions 92
`
`7 .5.1 Creating a Macro Instruction, 93
`7.5.2 Discussion, 94
`
`7.6
`
`7.7
`
`Evaluation: Genetic Database Search 94
`
`Conclusions and Future Work 95
`
`8 SEARCHING GENETIC DATABASES ON SPLASH 2
`
`97
`
`8.1
`
`Introduction 97
`
`8.1.1 Edit Distance, 98
`8.1.2 Dynamic Programming Algorithm, 98
`
`8.2
`
`Systolic Sequence Comparison 100
`
`8.2.1 Bidirectional Array, 100
`8.2.2 Unidirectional Array, 103
`
`8.3
`
`Implementation 104
`
`8.3.1 Modular Encoding, 105
`8.3.2 Configurable Parameters, 106
`8.3.3 Bidirectional Array, 107
`8.3.4 Unidirectional Array, 107
`
`8.4
`
`8.5
`
`8.6
`
`Benchmarks 107
`
`Discussion 108
`
`Conclusions 108
`
`9 TEXT SEARCHING ON SPLASH 2
`
`110
`
`9.1
`
`9.2
`
`9.3
`
`9.4
`
`9.5
`
`9.6
`
`Introduction 110
`
`The Text Searching Algorithm 111
`
`Description of the Single-Byte Splash Program 113
`
`Timings, Discussion 114
`
`Outline of the 16-bit Approach 115
`
`Conclusions 116
`
`10 FINGERPRINT MATCHING ON SPLASH 2
`
`117
`
`10.1
`
`Introduction 117
`
`10.2 Background 120
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC viii
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 11
`
`
`
`Contents
`
`ix
`
`10.3
`
`10.4
`
`10.5
`
`10.2.1 Pattern Recognition Systems, 121
`10.2.2 Terminology, 122
`10.2.3 Stages in AFIS, 123
`Splash 2 Architecture and Programming Models 125
`Fingerprint Matching Algorithm 125
`10.4.1 Minutia Matching, 126
`10.4.2 Matching Algorithm, 127
`
`Parallel Matching Algorithm 128
`10.5 .1 Preprocessing on the Host, 131
`10.5.2 · Computations on Splash, 132
`10.5.3 VHDL Specification for XO, 133
`
`Simulation and Synthesis Results 134
`10.6
`10.7 Execution on Splash 2 137
`10.7.1 User Interface, 137
`10.7.2 Performance Analysis, 137
`
`10.8 Conclusions 139
`
`11 HIGH-SPEED IMAGE PROCESSING WITH SPLASH 2
`
`141
`
`Introduction 141
`11.1
`11.2 The VTSplash System 142
`Image Processing Terminology and Architectural Issues 143
`11.3
`11.4 Case Study: Median Filtering 150
`11.5 Case Study: Image Pyramid Generation 153
`11.5 .1 Gaussian Pyramid, 154
`11.5.2 Two Implementations for Gaussian Pyramid on Splash 2, 155
`11.5.3 The Hybrid Pipeline Gaussian Pyramid Structure, 157
`11.5.4 The Laplacian Pyramid, 157
`11.5.5 Implementation of the Laplacian Pyramid on Splash 2, 159
`
`11.6
`
`11.7
`
`Performance 159
`
`Summary 163
`
`12 THE PROMISE AND THE PROBLEMS
`
`166
`
`12.1
`
`Some Bottom-Line Conclusions 166
`12.1.1 High Bandwidth UO Is a Must, 166
`12.1.2 Memory Is a Must, 167
`12.1.3 Programming Is Possible, and Becoming More So, 168
`12.1.4 The Programming Environment Is Crucial, 168
`
`12.2 To Where from Here? 169
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC iv
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 12
`
`
`
`x
`
`Contents
`
`12.3
`
`If Not Splash 3, Then What? 171
`
`12.3.1 Architectures, 172
`12.3.2 Custom Processors, 173
`12.3.3 Languages, 174
`
`12.4 The "Killer" Applications 177
`
`12.5
`
`Final Words 178
`
`A SPLASH 2 DEVELOPMENT-THE PROJECT MANAGER'S
`SUMMARY
`
`B AN EXAMPLE APPL/CATION
`
`REFERENCES
`
`179
`
`186
`
`190
`
`Petitioner Microsoft Corporation - Ex. 1007, TOC x
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 13
`
`
`
`Preface
`
`The Splash 2 project began at the Supercomputing Research Center1 in September of
`1991 and ended, with success, in the spring of 1994. Splash 2 is an attached processor
`system using Xilinx XC4010 FPGAs as its processing elements. As such, it is a
`custom computing machine. That is to say that much of what would be the instruction
`set architecture of the processing elements is not specified except in the details of the
`program developed by the application programmer. Although a higher-level block
`diagram of processing elements, memories, interconnect, and dataflow exists in the
`hardware structure of Splash 2, the details of the instruction set architecture level of
`the machine will vary from one application to the next.
`The Splash 2 project is significant for two reasons. First, Splash 2 is part of a
`complete computer system that has achieved supercomputer-like performance on a
`number of different applications. By "complete computer system" we mean that SRC
`created or caused to be created an extant hardware system (replicated a dozen times),
`a complete programming and runtime environment, and a collection of application
`programs that exercised the unique hardware.
`The second significant aspect of Splash 2 is that we were fortunate enough to
`be able to build a large system, capable of performing real computations on real
`problems. One common complaint about performance results on novel computing
`machines or environments is that results on small problems cannot be accurately
`extrapolated to large problems. The Splash 2 system that was designed and built is
`a full-sized machine and does not suffer from this defect.
`To get to the point: why a book?
`
`1 Renamed the Center for Computing Sciences in May 1995, but referred to throughout this book
`as SRC.
`
`xi
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xi
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 14
`
`
`
`xii
`
`Preface
`
`This is a novel computing machine. In order to understand what happens when
`the application programmer is permitted, indeed required, to design the processor
`architecture of the machine that will execute his program, it is necessary to see the
`system as a whole. Programmability and problems to be run on this machine both
`had major influences on its architecture, just as its architecture and its unique nature
`influence the kinds of problems one would expect to program for this machine and
`the nature of that programming. And standing between the user and the machine,
`as the old joke goes, is a new kind of programming environment and an evolving
`understanding of how this environment must allow the use of the hardware, without
`forcing every programmer to be a hardware-design engineer.
`At the first IEEE Workshop on FPGAs for Custom Computing Machines, one
`of the industrial attendees remarked that, although nearly everyone would agree, as
`part of a coffee-room discussion or the like, that it would be interesting to think about
`building a "computer" using FPGAs, no one in management (except perhaps at SRC
`and DEC PRL) had put up the commitment necessary in time and money to do a
`real test of the idea. It was then remarked that, given the nature of the marketplace
`and of engineering management, these first attempts had to be successful in order to
`open the door for future work. We feel we have been successful, and we offer in
`this book an in-depth look at one of the handful of data points in the design space
`of this new kind of machine.
`We would hope that this book has a broad appeal and is readable with under(cid:173)
`standing by nearly all computer scientists and computer engineers. To the hardware
`designer, perhaps we can offer a new look at programming applications on a mod(cid:173)
`erately general FPGA-based computing machine instead of designing circuits for a
`specific board incorporating FPGAs. The engineering world has viewed FPGAs, to a
`great extent, as the next logical step in a continuum of electronic devices; we offer,
`we feel, the option of viewing them much more broadly than that. To the computer
`architect we offer a variant hardware platform and an indication of how that general
`platform can be used. Much of computer architecture is a compromise between the
`functionality desired and the limits of what can be built given existing technology; we
`offer the use of a new technology that can offer, to a limited extent now, and could
`offer much more generally later, greatly increased functionality. For some of those
`who have hard problems in computation, we offer much of the power of special(cid:173)
`purpose hardware without the inflexibility and uncertain delivery times of hardware.
`The long-term task is not to map a high-level language to a particular architecture
`or range of architectures, but in some sense to create for each application program
`a suitable architecture to which the high-level language will be mapped. And to the
`language designers and compiler writers we offer a world to conquer. We have pre(cid:173)
`sented one imperfect but usable approach to programming such a computing machine,
`and we trust that others interested in the critical problem of making these machines
`programmable can learn both from what we did right and what we did wrong.
`Chapter 1 discusses the general concept of Custom Computing Machines, of
`which Splash 2 is one example. Chapters 2 and 3 describe at a high level and
`then in some detail the hardware architecture of Splash 2. Chapter 4 covers the
`design considerations and decisions in arriving at the second-generation Splash 2
`architecture. We present this chapter at the end of the section on hardware, on the
`basis that it is easier to understand variations in a design when those variations are
`compared against something concrete.
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xii
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 15
`
`
`
`Preface
`
`xiii
`
`Chapters 5 and 6 describe, also first at a higher and then at a lower level, the
`software architecture of Splash 2. All the application programs in the latter chapters
`were done using VHDL as an applications programming language and these tools
`in support. The main goal of the Splash 2 project was to show that software, as
`described in Chapters 5 and 6, could make a computer using FPGAs as its processing
`elements into something that reasonable people would call "programmable," and, in
`that sense, the heart of the Splash 2 project is in Chapters 5 and 6. Throughout
`the life of Splash 2, however, there has been an alternative view of programming.
`This view is reflected in Chapter 7 on the Splash 2 version of the programming
`language dbC. The approach taken in dbC is to permit the programmer to use a
`version of C as the programming language. It is the compiler which then becomes
`responsible for, in essence, recognizing the instruction set architecture necessary to
`execute the program and then creating in the FPGA the requisite registers, logic units,
`and control.
`Chapters 8 through 11 then describe four different applications programmed
`to conclusion on Splash 2. The first of these-the sequence comparison problem(cid:173)
`was the driving application, in the sense that funding for Splash 2 was based on its
`perceived ability to perform this computation. This and the text processing application
`were done at SRC.
`The Splash 2 project team was fortunate in that SRC's parent company, the
`Institute for Defense Analyses, issued two contracts, to Virginia Polytechnic Institute
`and to Michigan State University, for applications work on Splash 2 in image pro(cid:173)
`cessing and fingerprint identification. Both applications seemed good matches with
`the Splash 2 architecture but lay outside the normal realm of SRC's research program.
`The faculty members involved have each prepared a chapter on these applications.
`We close in Chapter 12 with some opinions and speculations about the future.
`In an appendix, the project manager presents a chronology of the entire Splash 2
`project.
`It is incumbent on us, and a genuine pleasure, to thank the Center for Com(cid:173)
`puting Sciences of the Institute for Defense Analyses and the CCS Director, Francis
`Sullivan, for supporting us in our writing and editing of this book. All royalties
`will be donated to the Center for Excellence in Education, formerly known as the
`Rickover Foundation, in McLean, Virginia. The Center for Excellence in Education
`supports science and engineering education through its sponsorship of the Research
`Science Institute each summer for high school seniors, its Role Models and Lead(cid:173)
`ers Project in Washington, D.C., Los Angeles, and Chicago for promising women
`and minority high school students intending to study science and engineering, and
`its support and preparation of the United States Informatics Olympiad team each
`year.
`
`The Splash 2 project was a success in large part due to the ability of those who
`were involved nearly full-time, but it might not have taken the course it did had the
`hard-core Splash 2 players not had the chance to get advice and occasional help from
`a much larger group of experts both at SRC and elsewhere.
`We acknowledge, therefore, the help and advice of Nate Bronson, Dan Bums,
`Bill Carlson, Neil Coletti, Maripat Corr, Steve Cuccaro, Hillary Dean, Chuck Fiduc(cid:173)
`cia, Brad Fross, Charles Goedeke, Maya Gokhale, Frank Hady, Dzung Hoang, Bill -
`Holmes, Amy Johnston, Elaine (Davis) Keith, Dan Kopetzky, Andy Kopser, Steve
`Kratzer, Jim Kuehn, Sara Lucas, Michael Mascagni, Marge McGarry, John McHenry,
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xiii
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 16
`
`
`
`xiv
`
`Preface
`
`Ron Minnich, Lindy Moran, Fred More, Mark Norder, Lou Podrazik, Dan Pryor,
`Craig Reese, Paul Schneck, Brian Schott, Nabeel Shirazi, Doug Sweely, Dennis
`Sysko, Mark Thistle, Bob Thompson, Ken Wallgren, Alice Yuen, Neal Ziring, and
`Jennifer Zito.
`
`Duncan A. Buell
`Jeffrey M. Arnold
`Walter J. Kleinfelder
`Bowie, Maryland
`March 1996
`
`Petitioner Microsoft Corporation - Ex. 1007, Preface xiv
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 17
`
`
`
`CHAPTER 1
`
`Custom Computing
`Machines: An Introduction
`
`Duncan A. Buell1
`
`1.1 INTRODUCTION
`
`It is a basic observation about computing that generality and efficiency are in some
`sense inversely related to one another; the more general-purpose an object is and
`thus the greater the number of tasks it can perform, the less efficient it will be in
`performing any of those specific tasks. Design decisions are therefore almost always
`compromises; designers identify key features or applications for which competitive
`efficiency is a must and then expand the range as far as is practicable without unduly
`damaging performance on the main targets.
`This thesis has certainly been true in processor architecture of computers aimed
`at computationally intense problems. Vector processors and vector supercomputers
`have targeted computationally intense, array-oriented floating point problems, usually
`in the hard sciences and engineering, but have not sacrificed the necessary speed on
`their core applications in order to be all things to all people. Thus, on computationally
`intense problems that do not fit well on traditional supercomputers, perhaps due to
`such things as integer arithmetic or scalar code, fast workstations can often outperform
`supercomputers.
`To counter the problem of computationally intense problems for which general(cid:173)
`purpose machines cannot achieve the necessary performance, special-purpose proces(cid:173)
`sors, attached processors, and coprocessors have been built for many years, especially ·
`
`1 A version of this chapter appeared as Buell and Pocek [11) and is used with permission.
`
`1
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 1
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 18
`
`
`
`2
`
`Custom Computing Machines: An Introduction
`
`Chapter 1
`
`in such areas as image or signal processing (for which many of the computational
`tasks can be very well defined). The problem with such machines is that they are
`special-purpose; as problems change or new ideas and techniques develop, their lack
`of flexibility makes them problematic as long-term solutions.
`Enter the FPGA. Field Programmable Gate Arrays, first introduced in 1986
`by Xilinx [34), were seen rather immediately by a few people to offer a totally
`new avenue to explore in the world of processor engineering. The great strength
`of the computer as a tool is in its ability to be adapted, via programming, to a
`multitude of computational tasks. The possibility now existed for an FPGA-based
`computing device not only to be configured to act like special-purpose hardware
`for a particular problem, but to be reconfigured for different problems and for this
`reconfiguration to be a programming process. By being more than single-purpose,
`such a machine would have the advantage of being flexible with at least a limited
`range of different applications. By being programmable, such a machine would open
`up "design of high-performance hardware" to individuals who can "design hard(cid:173)
`ware" in an abstract sense but not a concrete sense. Finally, by being designed to
`operate as if they were hardware, the applications for these machines can achieve
`the hardware-like performance one gets from having explicitly parallel computa(cid:173)
`tions, from not having instructions and data fetched and decoded, and from hav(cid:173)
`ing the ability to design processing units that reflect precisely the processing being
`done.
`It is no exaggeration to say that machines using FPGAs as their processing
`elements have demonstrated that very high performance on an absolute scale, and
`extraordinary performance when measured against price, is possible with this tech(cid:173)
`nology. The PeRLe machines built at DEC's Paris Research Laboratory have been
`programmed on a number of applications with impressive results [7, 8, 9, 31]: An
`implemented multiplier can compute a SO-coefficient, 16-bit polynomial convolution
`FIR filter at 16 times audio real time. An implementation of RSA decryption executes
`at 10 times the bit rate of an AT&T ASIC. A Hough Transform implementation for
`an application in high-energy physics achieves a compute rate that a standard 64-bit
`computer could not equal without a 1.2 GHz clock rate.
`As can be seen from the later chapters of this book, some of the applications
`programmed on Splash 2 have achieved similarly promising results. It was a general
`observation made by those involved in the Splash 2 project that, on applications that
`fit the machine, one Splash 2 Array Board could deliver approximately the compute
`power of one Cray YMP processor. One of the commercial licensees of the Splash 2
`technology sells its system for about $40,000. Of course, not all applications fit well,
`and most that do not fit well actually fit very badly indeed, but this is nonetheless a
`performance-to-price ratio substantial enough to warrant continued investigation and
`experimentation.
`The idea of reconfiguring a computer is certainly not new. The Burroughs
`B 1700 had multiple instruction sets with different targets (Fortran, COBOL, and
`such) implemented with different microcode. In another way, the Paris functions of
`the Thinking Machines Corp. CM-2 differed from one version to the next. In the
`former case, standard views of hardware instructions were implemented. In the latter
`case, with a novel machine and a new architecture, we presume that an effort was
`made to implement function calls that users were seen to need and to delete unneeded
`functions when the instruction store ran out.
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 2
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 19
`
`
`
`Section 1.1
`
`Introduction
`
`3
`
`A certain amount of disagreement exists over what label to give to these
`machines and how to refer to them. The group at DEC's Paris research lab refers
`to their machine as a Programmable Active Memory (PAM) [7, 9, 29, 31]. Another
`commercial entity uses the term "virtual computer" [12]. From Brown University
`we have Processor Reconfiguration through Instruction Set Metamorphosis (PRISM)
`[4, 5, 6, 32]. We have already used the term "FPGA-based computing device" here.
`That none of these are truly satisfactory was evidenced in the spring of 1994 when the
`comp. arch. fpga newsgroup was discussed and established; the most contentious
`point was over the name. Both the new newsgroup and the IEEE workshops we have
`organized use the term FPGA, thus being in some sense bound in terminology to a
`particular technology (unless, of course, one can convince the developers of the next
`technology that its name should allow FPGA as its acronym). We have chosen to
`use the term Custom Computing Machine (CCM). None of these terms is perfect,
`but we believe that this one is no worse than any of the others.
`The work on CCMs also differs from what is considered reconfigurable com(cid:173)
`puter architecture, in that the term "reconfigurable" usually refers to a much higher
`level of the system. In the case of CCMs, that which is reconfigurable and significant
`by virtue of its reconfigurability is the "processor architecture" itself. A reconfig(cid:173)
`urable computer, by contrast, is likely to be either a multiprocessor in which the
`interconnections among the processors can change, or a heterogeneous machine with
`processors of different kinds that a user can choose to include or exclude in the view
`he/she has of "the computer."
`It is to be emphasized that this is not a mature computing technology and that
`CCMs are not a panacea for all problems in high-performance computing. Among
`the many issues and problems are the following:
`
`1. Are FPGAs large enough, or will they become large enough, so that a significant
`unit of computation can be put on a single chip so as to avoid both the loss
`of efficiency in going off-chip and the problems in partitioning a computation
`across multiple chips?
`2. With current technology, even in the best of circumstances, the user must be
`exposed to the hardware itself. What is the level of understanding about chip
`architecture, signal routing delays, and so forth, that a "programmer" must know
`in order to use a CCM? How much more must be known in order to obtain
`the performance that would warrant using a CCM instead of a general-purpose
`computer?
`3. If these machines are to be viewed as "computers," then they must be capable of
`being programmed. How will this be done? What sort of programming language
`is appropriate? How do we "compile" when we have eliminated the underlying
`machine model?
`4. Granting the point that these are limited-purpose, but not special-purpose,
`machines, what is the range of architectures needed to cover the spectrum
`of applications for which these machines make sense?
`5. What are the cost/performance figures necessary to make this a viable approach
`for getting a computing task done? General-purpose machines are cheaper and
`easier to use but can be slow. ASICs and special-purpose devices are faster
`but more expensive in small quantities, take longer to develop, and are hard to
`modify. Where might CCMs fit between these two?
`
`Petitioner Microsoft Corporation - Ex. 1007, p. 3
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2170, p. 20
`
`
`
`4
`
`Custom Computing Machines: An Introduction
`
`Chapter 1
`
`One problem faced by those involved in Splash 1 and Splash 2 at the Super(cid:173)
`computing Research Center2 has been a stubborn refusal from some quarters to
`believe that achieving high performance on a CCM is possible without a design
`or programming agony so great as to be offputting to all but the most dedicated
`of "application designers/programmers." Even in the face of our evidence to the
`contrary, the case has been difficult to make to some critics.
`The case is especially hard because what is needed is to build a complete
`hardware system, to create or cause to be created a programming environment worthy
`of being called a programming environment, and then to develop a variety of different
`applications so that the proof of concept is complete. Further, since the goal is
`to demonstrate a competitive performance with more expensive and sophisticated
`machines, the CCM must be big enough to do real work and to be part of complete
`computational processing environments; it cannot be just a toy machine suitable
`only for doing kernels of problems. To our knowledge, only two such machines
`have been built that meet these criteria-Splash 2 and the larger of the DEC PAM
`systems.
`The goal in this book is to present the Splash 2 experience as a whole. Splash 2
`was designed and developed in an iterative process from top to bottom to top and
`back again.
`
`1.2 THE CONTEXT FOR SPLASH 2
`
`1.2.1 FPGAs
`
`FPGAs in general have a wide spectrum of characteristics, but the FPGAs used for
`CCMs have been of two distinct types. The Xilinx XC4010, a typical example of
`one type, is a chip containing a 20 x 20 square array of Configurable Logic Blocks
`(CLBs) [34]. Each CLB can serve one of three functions, either as two flipflops, or
`as Boolean functions of nine inputs, or as 32 bits of RAM. The function use has
`two 4-input functions, each producing an output; these two bits combine with a ninth
`input in a 3-input Boolean function. The RAM usage simply takes advantage of the
`fact that the 4-input functions are done with lookup tables to allow the input bits to
`be viewed as addresses.
`Connecting the CLBs to one another and to special Input Output _Blocks (IOBs)
`on the periphery of the chip are routing resources running from CLB to CLB, skipping
`one CLB, or running t