`
`Computing for Science and
`
`Engineering Applications
`
`by
`
`Peter Leonard McMahon
`
`Submitted to the Department of Electrical Engineering
`
`in partial fulfillment of the requirements for the degree of
`
`Bachelor of Science in Electrical and Computer Engineering
`
`at the
`
`University of Cape Town
`
`October 2006
`
`Advisor: Professor Michael R. Inggs
`
`
`
`Abstract
`
`This thesis investigates the feasibility of using reconfigurable computers for
`
`scientific applications. We review recent developments in reconfigurable high
`
`performance computing. We then present designs and implementation details
`
`of various scientific applications that we developed for the SRC-6 reconfig-
`
`urable computer. We present performance measurements and analysis of the
`
`results obtained.
`
`We chose a selection of applications from bioinformatics, physics and
`
`financial mathematics, including automatic docking of molecular models into
`
`electron density maps, lattice gas fluid dynamics simulations, edge detection
`
`in images and Monte Carlo options pricing simulations.
`
`We conclude that reconfigurable computing is a maturing field that may
`
`provide considerable benefit to scientific applications in the future. At present
`
`the performance gains offered by reconfigurable computers are not sufficient
`
`to justify the expense of the systems, and the software development environ-
`
`ment lacks the language features and library support that application devel-
`
`opers require so that they can focus on developing correct software rather
`
`than on software infrastructure.
`
`
`
`Contents
`
`1 Introduction
`
`1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`1.2.1
`
`Investigate the State-of-the-art in Reconfigurable Com-
`
`puting . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`1.2.2
`
`Implement Scientific Computing Algorithms on Recon-
`
`figurable Computers
`
`. . . . . . . . . . . . . . . . . . .
`
`1.2.3 Analyze the Performance of Scientific Applications on
`
`Reconfigurable Computers . . . . . . . . . . . . . . . .
`
`1.2.4 Provide Guidance on the Methodology for Developing
`
`Software for Reconfigurable Computers . . . . . . . . .
`
`1.3 Motivation for Problems Studied . . . . . . . . . . . . . . . .
`
`1.4 Thesis Outline and Summary . . . . . . . . . . . . . . . . . .
`
`1
`
`1
`
`3
`
`3
`
`4
`
`4
`
`4
`
`5
`
`6
`
`2 An Introduction to Reconfigurable Computing
`
`10
`
`2.1 Reconfigurable Computing Hardware . . . . . . . . . . . . . . 11
`
`2.1.1 Where do reconfigurable computers get their speed from? 12
`
`2.2 Reconfigurable Computing Software . . . . . . . . . . . . . . . 13
`
`2.3 Measuring Performance in Reconfigurable Computing Systems
`
`14
`
`2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
`
`3 Monte Carlo Methods on Reconfigurable Computers
`
`16
`
`3.1 Monte Carlo Methods
`
`. . . . . . . . . . . . . . . . . . . . . . 16
`
`3.2 Monte Carlo Estimation of π . . . . . . . . . . . . . . . . . . . 17
`
`i
`
`
`
`3.2.1
`
`Implementation of a Parallel Pseudorandom Number
`
`Generator . . . . . . . . . . . . . . . . . . . . . . . . . 19
`
`3.2.2 Design and Implementation of the Monte Carlo π Es-
`
`timator
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . 25
`
`3.2.3 Performance Results
`
`. . . . . . . . . . . . . . . . . . . 29
`
`3.3 Monte Carlo Options Pricing . . . . . . . . . . . . . . . . . . . 32
`
`3.3.1 Pricing Asian Options with Monte Carlo . . . . . . . . 33
`
`3.3.2 Generating Normal Random Variables
`
`. . . . . . . . . 36
`
`3.3.3 Design and Implementation . . . . . . . . . . . . . . . 37
`
`3.3.4 Performance Results
`
`. . . . . . . . . . . . . . . . . . . 43
`
`3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
`
`4 Cellular Automata Simulations on Reconfigurable Comput-
`
`ers
`
`48
`
`4.1 An Introduction to Cellular Automata . . . . . . . . . . . . . 49
`
`4.1.1 One-dimensional Cellular Automata . . . . . . . . . . . 49
`
`4.1.2 Two-dimensional Cellular Automata . . . . . . . . . . 51
`
`4.2 Conway’s Game of Life . . . . . . . . . . . . . . . . . . . . . . 52
`
`4.2.1 Design and Implementation . . . . . . . . . . . . . . . 53
`
`4.2.2 Performance Results
`
`. . . . . . . . . . . . . . . . . . . 71
`
`4.3 Fluid Dynamics Simulations using the Lattice Gas Method . . 73
`
`4.3.1 Design and Implementation . . . . . . . . . . . . . . . 75
`
`4.3.2 Performance Results
`
`. . . . . . . . . . . . . . . . . . . 79
`
`4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
`
`5 Image Processing – Edge Detection on Reconfigurable Com-
`
`puters
`
`81
`
`5.1 An Introduction to the Sobel Edge Detection Algorithm . . . 81
`
`5.2 Edge Detection on a Reconfigurable Computer . . . . . . . . . 84
`
`5.2.1 Design and Implementation . . . . . . . . . . . . . . . 84
`
`5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
`
`5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
`
`ii
`
`
`
`6 Automatic Macromolecular Docking on Reconfigurable Com-
`
`puters
`
`91
`
`6.1 Macromolecular Docking using Global Correlation . . . . . . . 93
`
`6.1.1 Design and Implementation . . . . . . . . . . . . . . . 94
`
`6.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
`
`6.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
`
`7 Conclusion
`
`99
`
`7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`
`7.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
`
`A Monte Carlo Methods
`
`101
`
`A.1 A Review of Monte Carlo Methods
`
`. . . . . . . . . . . . . . . 101
`
`A.1.1 An Early Monte Carlo Algorithm . . . . . . . . . . . . 101
`
`A.1.2 Numerical Integration using Monte Carlo Methods
`
`. . 102
`
`A.1.3 Monte Carlo, Beyond Simple Integration . . . . . . . . 104
`
`A.2 A Review of Parallel Pseudorandom Number Generation . . . 105
`
`A.2.1 Generating Random Numbers on Deterministic Com-
`
`puters . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
`
`A.2.2 Parallel Pseudorandom Number Generation . . . . . . 108
`
`A.2.3 Assessing the Quality of Pseudorandom Number Se-
`
`quences
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . 110
`
`B The SRC-6 Reconfigurable Computer
`
`114
`
`iii
`
`
`
`List of Figures
`
`1.1 A generic reconfigurable computer architecture.
`
`. . . . . . . .
`
`2
`
`2.1 Computational density of FPGAs and Intel microprocessors.
`
`Image from ref. [6]. . . . . . . . . . . . . . . . . . . . . . . . . 12
`
`2.2 Measurement of processing time for an application running on
`
`a reconfigurable computer. Image from ref. [9]. . . . . . . . . . 14
`
`3.1 Scatter plot of (x, y) pairs randomly sampled from [0, 1]2. . . . 18
`3.2 Processing engines in the FPGA, each independently generat-
`
`ing pseudorandom numbers.
`
`. . . . . . . . . . . . . . . . . . . 20
`
`3.3 Simulation of a set of 3 parallel pseudorandom number gener-
`
`ators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
`
`3.4 The operation of a single processing engine.
`
`. . . . . . . . . . 22
`
`3.5 Top-level design for random number generators on two FPGAs. 23
`
`3.6 Processing engines each performing N simulations to estimate
`
`π. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
`
`3.7 Percentage of FPGA resources used on a single FPGA, as a
`
`function of the number of processing engines included.
`
`. . . . 28
`
`3.8 Performance of Monte Carlo estimation of π on an SRC-6
`
`MAPe with 15 processing engines, using one FPGA, compared
`
`with that of an x86 processor.
`
`. . . . . . . . . . . . . . . . . . 30
`
`3.9 Performance of Monte Carlo estimation of π on an SRC-6
`
`MAPe with 30 processing engines, using two FPGAs, com-
`
`pared with that of an x86 processor.
`
`. . . . . . . . . . . . . . 31
`
`iv
`
`
`
`3.10 Performance of Monte Carlo estimation of π on an SRC-6
`
`MAPe, using one FPGA, as a function of the number of process-
`
`ing engines used.
`
`. . . . . . . . . . . . . . . . . . . . . . . . . 32
`
`3.11 Histogram, with 500 bins, of 200000 random numbers gen-
`
`erated on the SRC-6 (blue), and 2000000 random numbers
`
`generated using MATLAB’s randn function, scaled (red). . . . 37
`
`3.12 Processing engines each generating N independent paths of
`
`stock price movements to simulate possible payoff scenarios.
`
`. 39
`
`3.13 Performance of a Monte Carlo options pricing simulation on an
`
`SRC-6 MAPe, using one FPGA, showing the speed difference
`
`between unflattened loops and flattened loops. . . . . . . . . . 44
`
`3.14 Performance of a Monte Carlo options pricing simulation on
`
`an SRC-6 MAPe, compared with that of an x86 processor.
`
`. . 46
`
`4.1 Examples of Birth, Survival and Death in the Game of Life.
`
`. 52
`
`4.2 A grid representing a configuration in the Game of Life cellular
`
`automaton.
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
`
`4.3 A grid decomposed into four subgrids, by dividing the grid
`
`vertically and horizontally.
`
`. . . . . . . . . . . . . . . . . . . . 55
`
`4.4 A grid decomposed into five subgrids, by dividing the grid into
`
`horizontal strips.
`
`. . . . . . . . . . . . . . . . . . . . . . . . . 55
`
`4.5 A grid decomposed into five subgrids, showing the ‘ghost re-
`
`gions’ for each slice. . . . . . . . . . . . . . . . . . . . . . . . . 57
`
`4.6 Sequence of processor communication to synchronize ghost re-
`
`gions, while avoiding deadlock. . . . . . . . . . . . . . . . . . . 59
`
`4.7 Processing engine performing an N timestep simulation, us-
`
`ing the design that all communication is done between the
`
`processing engines.
`
`. . . . . . . . . . . . . . . . . . . . . . . . 61
`
`4.8 Processing engines performing an N timestep simulation. The
`
`program forks processing engines, then joins them once every
`
`iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
`
`v
`
`
`
`4.9 Layout and flow of the data in the system during a cellular
`
`automata simulation. . . . . . . . . . . . . . . . . . . . . . . . 66
`
`4.10 Performance of a Game of Life simulation on an SRC-6 MAPe,
`
`compared with that of an x86 processor.
`
`. . . . . . . . . . . . 72
`
`4.11 Transition of an FHP lattice gas automata. Solid arrows show
`
`the configuration at the current timestep, and hollow arrows
`
`show the configuration at the next timestep. Image from Luo,
`
`ref. [37].
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
`
`4.12 Example transition rules for FHP lattice gas automata. Image
`
`from Luo, ref. [37].
`
`. . . . . . . . . . . . . . . . . . . . . . . . 75
`
`4.13 Layout and flow of the data in the system during a lattice gas
`
`automata simulation. . . . . . . . . . . . . . . . . . . . . . . . 77
`
`4.14 Performance of a Lattice Gas Automata simulation on an
`
`SRC-6 MAPe, with 5 processing engines, using one FPGA,
`
`compared with that of an x86 processor.
`
`. . . . . . . . . . . . 80
`
`5.1 Processing engines each performing Sobel edge detection on a
`
`slice of the input image.
`
`. . . . . . . . . . . . . . . . . . . . . 85
`
`5.2 Layout and flow of the data in the system during edge detection. 87
`
`5.3 Original image (left). Image with detected edges shown (right). 88
`
`5.4 Caching pixels to reduce the number of reads from onboard
`
`memory.
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
`
`6.1 A macromolecular model (yellow) docked in an electron mi-
`
`croscope density map (blue mesh). Image from ref. [45].
`
`. . . 92
`
`6.2 Fitting a macromolecular model by rotation and translation.
`
`Image from ref. [46].
`
`. . . . . . . . . . . . . . . . . . . . . . . 93
`
`6.3 An overview of the global correlation docking algorithm. Im-
`
`age from ref. [46]. . . . . . . . . . . . . . . . . . . . . . . . . . 95
`
`6.4 Processing engines each search to find a maximum correlation
`
`for the Euler angles they are assigned.
`
`. . . . . . . . . . . . . 97
`
`vi
`
`
`
`B.1 The SRC-6 in a rack at the National Center for Supercomput-
`
`ing Applications.
`
`. . . . . . . . . . . . . . . . . . . . . . . . . 115
`
`B.2 The architecture of the SRC MAP module. . . . . . . . . . . . 115
`
`vii
`
`
`
`List of Tables
`
`3.1 Results of statistical tests for quality of a pseudorandom se-
`
`quence generated by 15 processing engines. . . . . . . . . . . . 24
`
`3.2 Place and Route results for a two-FPGA, 30 processing engine
`
`Monte Carlo simulation to estimate π.
`
`. . . . . . . . . . . . . 27
`
`3.3 Place and Route results for a one-FPGA, one processing en-
`
`gine Monte Carlo options pricing simulation. . . . . . . . . . . 42
`
`3.4 Place and Route results for a one-FPGA, two processing en-
`
`gine Monte Carlo options pricing simulation. . . . . . . . . . . 43
`
`4.1 Place and Route results for a one-FPGA, four-processing en-
`
`gine Game of Life cellular automata simulation.
`
`. . . . . . . . 69
`
`4.2 Place and Route results for a one-FPGA, eight-processing en-
`
`gine Game of Life cellular automata simulation.
`
`. . . . . . . . 70
`
`4.3 Place and Route results for a one-FPGA, five-processing en-
`
`gine Lattice Gas Automata simulation.
`
`. . . . . . . . . . . . . 78
`
`5.1 Place and Route results for a one-FPGA, three processing en-
`
`gine edge detection program. . . . . . . . . . . . . . . . . . . . 87
`
`5.2 Performance of Sobel Edge Detection on an SRC-6 MAPe,
`
`with three processing engines, using one FPGA, compared
`
`with that of an x86 processor.
`
`. . . . . . . . . . . . . . . . . . 89
`
`viii
`
`
`
`Acknowledgements
`
`This thesis marks the culmination of my undergraduate career, and so I
`
`find this to be the perfect opportunity to acknowledge those individuals who
`
`have enriched my undergraduate experience, making it immeasurably more
`
`beneficial and exciting than it otherwise might have been. Of course by
`
`naming people, I run the risk of making serious omissions. If I do, I apologise
`
`and ask that you know that I am not any less grateful for your help over the
`
`years.
`
`First and foremost, I am exceedingly grateful to my advisor, Professor
`
`Michael Inggs. He granted me an exceptional amount of independence so
`
`that I could pursue what I thought to be the most interesting and impor-
`
`tant problems, and in so doing forced me to see first-hand what the world
`
`of research is about. This academic freedom has allowed me to enjoy an
`
`excellent six months following my every intellectual whim, and being able to
`
`count most of it as work! I am thankful for his unreasonable confidence in a
`
`student he had barely met, and for his willingness to put his reputation (and
`
`funds!) on the line by recommending me to his collaborators and contacts,
`
`and then leaving me to ‘go forth and do good things’. I can only hope that
`
`the final result, this thesis, is an apt repayment of his faith in me.
`
`This thesis would not exist as it stands were it not for the help I have
`
`received from a large number of people. I would like to extend my gratitude
`
`to all the folk at the National Center for Supercomputing Applications at
`
`the University of Illinois for their hospitality and support. Dr Radha Nand-
`
`kumar was instrumental in bringing me over to the United States for two
`
`months so that I could work with the NCSA’s Innovative Systems Labora-
`
`tory Reconfigurable Computing group, and so I am especially indebted to
`
`her. Thank you, Radha! I was apparently the first visitor in an experimental
`
`‘International Affiliates’ program, and although I certainly gained more from
`
`the experience than I was able to give back to the NCSA, I hope they found
`
`my visit to be a beneficial exercise.
`
`Dr Craig Steffen was my official host at the NCSA, and I would like to
`
`ix
`
`
`
`thank him for the logistical, technical and intellectual support that he pro-
`
`vided during my stay. Craig did far more than I ever could have reasonably
`
`imagined he would - from driving for 6 hours to Chicago to pick me up ‘so
`
`that I wouldn’t have to worry about catching the train’, to making sure
`
`I had literally everything I needed to start working straight away. I spent
`
`many hours with Craig pestering him about technical matters, discussing ap-
`
`proaches to problems and otherwise learning from him. He was an excellent
`
`mentor. Craig also had to put up with an enormous number of questions to
`
`satisfy my child-like curiosity about why things are the way they are in the
`
`U.S. - such as what the holes in electrical plug pins are for, why there is a
`
`big, slit rubber membrane in kitchen sinks, and why so many houses are built
`
`out of wood. Craig, thanks again for the lunches out and for periodic invites
`
`to your home for dinner. Thanks too, to Craig’s wife Becky for sharing her
`
`home with me and for the cooking. I hope to repay the debt to both of you
`
`one day.
`
`The ISL was an exceptionally friendly and productive group to work in. I
`
`had a lot of fun on the third floor - thanks, guys. Besides the joking around,
`
`I also learned a great deal from discussions I had with you all. Dr Volodymyr
`
`‘Vlad’ Kindratenko and David Pointer certainly deserve special mention for
`
`their technical guidance and general willingness to help me whenever I was
`
`stuck or in uncharted waters. Both Vlad and David went out of their way
`
`to help with questions I had that they didn’t have the answers to. This
`
`wasn’t very often though - their technical knowledge was an inspiration, and
`
`a resource from which I benefited regularly. Thank you.
`
`At the Reconfigurable Systems Summer Institute in Urbana in July, I
`
`met Dan Poznanovic and Dr Jeff Hammes, and would like to acknowledge
`
`helpful discussions with them. In particular, Jeff, who is a lead developer of
`
`the MAPC compiler at SRC, provided numerous useful pieces of advice and
`
`workarounds to compiler limitations.
`
`I was able to investigate the diverse mix of academic domains in this thesis
`
`that I did thanks to the generosity of several researchers who individually
`
`x
`
`
`
`agreed to meet with me and give me ‘crash courses’ in their disciplines and
`
`work, with virtually no hope of any payoff. Dr Dan Jacobsen from the
`
`National Bioinformatics Network drove out to UCT to chat with me about his
`
`work in integrating biological databases. The staff at the Square Kilometer
`
`Array / Karoo Array Telescope offices in Pinelands kindly showed me round
`
`one afternoon, and gave me a useful overview of the projects and their needs.
`
`Dr Athol Kemball, an astronomer at the NCSA, offered helpful advice. Artur
`
`Szostak, Gareth de Vaux and Bruce Becker from the UCT-ALICE group
`
`spent hours explaining the functioning of the dimuon High Level Trigger in
`
`the ALICE project to me, and giving me advice on where I would be best
`
`advised to focus my efforts.
`
`I am particularly grateful to Dr Zeblon Vilakazi from UCT-ALICE and
`
`iThemba Labs. Dr Vilakazi has spent a great deal of time discussing particle
`
`physics and the engineering behind detectors with me since I first approached
`
`him about his work on the ALICE project in 2004.
`
`I have learned a lot
`
`from him about what modern experimental physics is all about, even if his
`
`confidence in my ability as a dabbling physicist is far greater than my own!
`
`I also thoroughly enjoyed a short visit I had to CERN in June 2006 at Dr
`
`Vilakazi’s invitation - it was certainly a memorable experience, and a much
`
`appreciated detour on my way to Illinois.
`
`Dr Michelle Kuttel, besides her direct impact on this thesis, in the form
`
`of the work on atomic model docking in electron density maps and on cel-
`
`lular automata, has undoubtedly had a far more general influence as well.
`
`Dr Kuttel has been an exceptional mentor since I did a class project under
`
`her guidance in 2004. I have spent innumerable hours in her office discussing
`
`research in computer science and computational science, and academia and
`
`research in general. Besides stimulating intellectual conversations, Dr Kut-
`
`tel has also given me numerous opportunities to develop - by allowing me
`
`to guest lecture in her Distributed Systems course, involving me in the Sci-
`
`entific Clustering Applications Workshops and allowing me to speak at the
`
`second SCAW. She has also patiently explained computational chemistry to
`
`xi
`
`
`
`a neophyte and walked me through her atomic model docking work. In the
`
`CS2 project I took with her, I learned more about cellular automata that
`
`I otherwise might have, and had an exceptionally fun time building a CA
`
`lattice gas simulator with Shen Tian. This project, although not strictly
`
`research, was probably the closest I had come at the time.
`
`Shen, besides being a terrific friend, was the best project partner I could
`
`ever ask or hope for. His technical brilliance is nearly unmatched, and he is
`
`a hard and dedicated worker (when he isn’t distracted by Google or anime!).
`
`Our project in second year was undoubtedly the most exciting group project
`
`I’ve been involved in, and I benefited enormously from it, largely due to
`
`Shen’s involvement.
`
`I have continued our work in this thesis, and it goes
`
`almost without saying that had I not undertaken the original lattice gas
`
`project with Shen, this would not have happened.
`
`I would like to thank Justin Kelleher (now back in Ireland), Professor
`
`Andrew Hutchison and Professor Pieter Kritzinger from the Data Network
`
`Architectures group, as well as the various group students I’ve had interac-
`
`tions with, for involving me in the DNA activities, which gave me another
`
`perspective on what a research group does and how it goes about its busi-
`
`ness. Justin during his time at UCT was a great champion for me, and I had
`
`an excellent time discussing the state-of-the-art in software engineering with
`
`him. I would also like to especially thank Prof. Hutchison for giving me a
`
`glimpse into the security research field, and for presenting our paper at ISSA
`
`2006 while I was away.
`
`Professor Vasco Brattka, UCT’s lone quantum computing theory expert,
`
`has taught me a tremendous amount about what theory research is like.
`
`His quantum computing course was truly superb - flawlessly prepared, with
`
`excellent notes. Prof. Brattka’s deep insights into the mathematical founda-
`
`tions of quantum computing have been enlightening and inspiring. He has
`
`the enviable quality of never jumping to conclusions, and always insists on
`
`a well-thought-out argument before settling on any statement. I am grateful
`
`for his patiently sitting through descriptions of my hair-brained algorithm
`
`xii
`
`
`
`ideas. His gentle encouragement to continue working and thinking despite
`
`my minimal advances were much appreciated.
`
`Besides the abovementioned faculty, I have encountered several other lec-
`
`turers who have made my time at UCT more enjoyable. Professor Martin
`
`Braae, who headed the Department of Electrical Engineering for my first
`
`three years here, and Deputy Dean Professor Barry Downing have both been
`
`very supportive of my efforts, never quibbling when I broke rules on curricu-
`
`lum changes or otherwise tried to jump over bureaucratic barriers. Professor
`
`Anthony Chan’s seemingly unbreakable spirit and work ethic have been an
`
`inspiration to watch. In no particular order, Mr Alan Rynhoud, Mr Stephen
`
`Schrire, Professor Jonathan Tapson, Dr Fred Nicolls, Dr Andrew Wilkinson,
`
`Mr Samuel Ginsberg, Mr Simon Winberg, Professor Cathal Seoighe, Pro-
`
`fessor Raoul Viollier, Professor Sandro Perez, Dr Roger Fearick, Dr Gary
`
`Tupper and Professor Andy Buffler have all taught me at some stage, and
`
`were memorable for their depth of knowledge in their respective fields and
`
`the quality of their expositions. Mme. Ewa Swida-Reid provided an oft en-
`
`tertaining outlet for ‘les ing´enieurs’, in our solitary foray into the humanities.
`
`Not entirely outside the academic realm, I would like to thank the editors
`
`at Varsity when I worked at the paper, Steven Kenyon, David Wilson and
`
`Cathryn Reece, and the rest of the staff, for letting this engineer pretend to
`
`be a journalist for a few hours each week.
`
`Last but not least, my friends and family deserve my thanks for their
`
`support.
`
`In addition, the Smuts community has been excellent, and I’m
`
`especially grateful to my peers in the Electrical Engineering department for
`
`creating a friendly environment in which to work and learn, and for teaching
`
`me all that they have. Thanks finally to the Copeland tribe — Dean, Devin,
`
`Jason and Mike — for an excellent year, on top of those that went before.
`
`xiii
`
`
`
`Chapter 1
`
`Introduction
`
`This thesis investigates the feasibility of using reconfigurable computing tech-
`
`nology for performing scientific computations. In this introduction, we pro-
`
`vide a brief background and motivation for this investigation, provide details
`
`of the objectives of the thesis, and outline the contents of the thesis.
`
`1.1 Background
`
`The past four decades has seen an exponential rise in the speed of processors.
`
`Moore’s Law, a prediction made by Gordon Moore of Intel in 1965, has
`
`proven surprisingly accurate — the number of transistors on processors has
`
`doubled nearly every 24 months since Moore’s prediction that this would be
`
`the case. The ability to fabricate chips with more transistors has resulted in
`
`proportionate speed increases.
`
`However, during the past 3 years, microprocessor manufacturers have
`
`been experiencing difficulty in dramatically increasing the performance of
`
`their processors. Clock speed increases have all but come to a halt due to
`
`limitations with present fabrication technology, which has resulted in manu-
`
`facturers seeking alternative means to providing consumers with greater per-
`
`formance. Unfortunately no revolutionary architectural changes have been
`
`forthcoming, so the manufacturers have instead simply opted to build mul-
`
`1
`
`
`
`ticore chips.
`
`Power consumption on present microprocessors has also become a con-
`
`siderable issue, besides the problems that power dissipation causes with at-
`
`tempts to increase performance. Large high performance computing centres
`
`often consume megawatts of power, leading to electricity being a considerable
`operating expense1. Thus there is also an economic motivation to investigate
`
`lower power technologies.
`
`Field Programmable Gate Arrays (FPGAs) have long been used in digi-
`
`tal signal processing (DSP) applications, where relatively simple algorithms,
`
`involving primarily integer operations, are performed on large quantities of
`
`data. Over the past decade, several projects have been initiated to investi-
`
`gate the use of FPGAs for general-purpose scientific computation [4, 5, 6].
`
`This ideal led to its natural extension — the development of hybrid com-
`
`putational machines that use both traditional microprocessors and FPGAs.
`
`Such hybrid architectures, known as reconfigurable computers have recently
`
`been introduced commercially by vendors such as Cray, SRC and SGI.
`
`Figure 1.1: A generic reconfigurable computer architecture.
`
`Successes with current reconfigurable computers have largely been lim-
`
`1The National Center for Supercomputing Applications at the University of Illinois
`currently uses more than 3MW of power. The planned National Science Foundation
`petaflop supercomputer may require upwards of 20MW.
`
`2
`
`
`
`ited to specific signal processing applications or other specialized algorithms.
`
`However, it is certainly possible [6] that reconfigurable computers may be-
`
`come important technology for more general high performance computations,
`
`and in particular scientific simulation and computation.
`
`1.2 Objectives
`
`In this thesis we aimed to investigate the state-of-the-art in reconfigurable
`
`computing, and analyze the ability of current reconfigurable computing tech-
`
`nology to perform scientific computations. Methodology for developing soft-
`
`ware for reconfigurable computers was also to be investigated.
`
`These overall objectives are now described in more detail.
`
`1.2.1
`
`Investigate the State-of-the-art in Reconfigurable
`
`Computing
`
`A sizeable number of academic and commercial projects have been attempted
`
`over the past decade, with the intention of furthering the use of FPGAs, and
`
`more generally reconfigurable computers, in computationally intensive envi-
`
`ronments outside of signal processing. Academic projects include SPLASH
`
`from the early 1990s [4], the MIT FPGA-based Cellular Automata Machine
`
`from the mid-1990s, the Berkeley Emulation Engine (BEE) and its succes-
`
`sor, BEE2 [6], and Brigham Young University and Boston University’s ef-
`forts, amongst others. Commercial offerings now include those from Cray2,
`
`SGI, SRC, Nallatech and Linux Networx. Xilinx, the world’s leading FPGA
`
`manufacturer, has also recently demonstrated its own compiler efforts for a
`
`high-level language for programming its FPGAs.
`
`One of the broader aims of this thesis was to survey the current state-of-
`
`the-art in reconfigurable computing, to look for key differences between the
`
`2The current understanding in the industry is that Cray is discontinuing its XD1 line
`of reconfigurable computers, but may licence to third parties the technology it acquired
`from Octigabay that led to the development of the Cray XD1.
`
`3
`
`
`
`available technologies, and provide a review of what a researcher new to the
`
`area needs to know.
`
`1.2.2
`
`Implement Scientific Computing Algorithms on
`
`Reconfigurable Computers
`
`A core aim of the thesis is the implementation of several scientific computing
`
`codes on a reconfigurable computer. We aimed to identify several scientific
`
`computing problems that are computationally intensive, and of interest to re-
`
`searchers in South Africa, and rewrite them for execution on a reconfigurable
`
`computer. This, of course, implies the need to investigate how to develop
`
`software for reconfigurable computers.
`
`The specific problems we chose to implement are: Monte Carlo Simula-
`
`tions; Cellular Automata Simulations; Edge Detection (Image Processing)
`
`and Macromolecular Docking.
`
`1.2.3 Analyze the Performance of Scientific Applica-
`
`tions on Reconfigurable Computers
`
`For each problem implemented on a reconfigurable computer, we aimed to
`
`analyze the performance of the implementation on the reconfigurable ver-
`
`sus its performance on a classic x86 microprocessor architecture. We also
`
`aimed to investigate how performance on reconfigurable computers scales.
`
`Finally, we aimed to analyze the performance and available resources to de-
`
`termine which aspects of current reconfigurable computers are performance
`
`bottlenecks.
`
`1.2.4 Provide Guidance on the Methodology for De-
`
`veloping Software for Reconfigurable Computers
`
`During the implementation of algorithms on a reconfigurable computer, we
`
`expected to find some effective method for parallelizing and porting code.
`
`4
`
`
`
`We also expected to encounter a series of technical difficulties, given the
`
`immature nature of the technology, and aimed to provide a brief reference in
`
`this thesis that gives our solutions to common problems one might encounter
`
`when porting code.
`
`1.3 Motivation for Problems Studied
`
`As we have mentioned, we implemented four different types of algorithm on
`
`reconfigurable computers during our study. These were Monte Carlo Simu-
`
`lations; Cellular Automata Simulations; Edge Detection (Image Processing)
`
`and Macromolecular Docking.
`
`In selecting algorithms to implement, one of our broad objectives was to
`
`pick those that would be of interest to researchers in South Africa. The first
`
`three problems that we selected are very broad, and naturally satisfy this
`
`criterion: Monte Carlo simulations are used in a wide variety of scientific
`
`computing areas, including computational chemistry and physics, as well as
`
`financial mathematics. Cellular automata are less widely used than Monte
`
`Carlo simulations, but have equally broad application, as Wolfram [10] has
`
`demonstrated.
`
`Image processing algorithms, including edge detection, are
`
`also widely used in industry in South Africa, and there are several research
`
`groups in academia and industry that work extensively with such algorithms.
`
`Macromolecular docking is a far more specific application than the pre-
`
`ceding three. It is under active study as part of the National Bioinformatics
`
`Network research programme, and is an important research topic in experi-
`
`mental techniques for structural biology.
`
`In selecting the algorithms to implement, we also aimed to pick a set
`
`of algorithms that were dissimilar, so that we could observe how different
`
`types of algorithms performed on reconfigurable computers.
`
`In our study
`
`of Monte Carlo simulations, we implemented a simple estimator of π that
`
`wasn’t very floating-point calculation intensive, and a financial options pric-
`
`ing simulation, which was quite floating-point calculation intensive. Neither
`
`5
`
`
`
`of these applications were data intensive. The cellular automata models we
`
`implemented were both discrete, and did not involve any floating-point cal-
`
`culations. Relatively large amounts of data did, however, need to be handled.
`
`The edge detection algorithm was also data-intensive, and involved only in-
`
`teger arithmetic. Its advantage was that is performance on FPGAs and on
`
`microprocessors is fairly well known, so it provides a good basis for compar-
`
`ison. Finally, the docking problem involves a large amount of floating-