`
`Jeffrey M. Arnold, Duncan A. Buell, Dzung T. Hoang
`Daniel V. Pryor, Nabeel Shirazi, Mark R. Thistle
`IDA Supercomputing Research Center
`17100 Science Drive
`
`Bowie, MD 20715
`
`Abstract
`
`Splash 2’ is an attached parallel processor in which
`the computing elements are user programmable FPGA
`devices. The architecture ofSplash 2 is designed to ac-
`celerate the solution of problems which ezhibit at least
`modest amounts of temporal or data parallelism. Ap-
`plications are deoelopecl by writing descriptions of al-
`gorithms in VHDL, which are then iteratively refined
`and debugged within a simulator. Once an application
`is determined to be functionally correct in simulation.
`it is compiled to a gate list and optimized by logic syn-
`thesis. The gate list is then mapped onto the FPGA
`architecture by automatic placement and routing tools
`to form a loadable FPGA object module. A Clangaage
`library and a symbolic debugger comprise the execution
`environment. The Splash 9 system has been shown to
`he efiectioe on a variety of applications, including text
`searching, sequence analysis, and image processing.
`
`1
`
`Introduction
`
`In recent years the emergence of reconfigurable
`Field Programmable Gate Arrays (FPGAs)[3l[7} has
`given rise to a new form of computing in which the ar-
`chitecture of a computer may evolve over time, chang—
`ing to fit the needs of each application it executes.
`With such a computer, the programmer may tailor the
`architecture to fit the problem solution, rather than
`vice versa. Several FPGA based architecture have
`been proposed and built recently. These machines all
`have very different programming models, but each re-
`lies on the use of reconfigurable hardware to customize
`the architecture to particular applications. (Seem for
`a more extensive bibliography.)
`in
`Splash 2[2]
`is an attached parallel processor
`which the computing engines are user programmable
`FPGA devices. The architecture of Splash 2 is de-
`signed to accelerate the solution of problems which
`
`1063-6401993 $08.00 9 1993 $333
`
`43?-
`
`exhibit at. least modest amounts of temporal paral-
`lelism (pipelining) or data parallelism (single instruo
`tion multiple data stream). High [/0 bandwidth
`makes Splash 2 particularly useful for problems with
`large data sets or continuous data streams. Finally,
`because the architecture may be shaped to fit
`the
`problem, Splash 2 is well suited to problems in which
`the size of the fundamental data objects do not match
`a conventional architecture (eg. streams of character
`text}.
`
`2 Splash 2 Architecture
`
`The Splash 2 system consists of a Sun Sparcsta—
`tion host, an interface board, and from one to sixteen
`Splash array boards, as shown in figure 1. The in—
`terface board contains a programmable system clock
`and provides DMA access to the host memory through
`two user programmable FPGA devices, XL and KR,
`which are used to process incoming and outgoing data
`streams.
`
`Each array board contains 16 processing elements
`{PEs}, X1 through X16, arranged in a linear array and
`fully connected via a 16x16 crossbar switch which is
`regulated by the 17th control element, X0. Each of the
`17 computing elements consists of a Xilinx X04010
`FPGA and 256K 16 bit words of memory, which is
`also directly addressable by the host. Each processing
`element has 36 bit. iii-directional data paths to its left
`neighbor,
`to its right neighbor, and to the crossbar
`switch. The input to the array is provided by XL
`through the SIMD bus to X1 of the first board and to
`X0 of every board. Multiple array boards are linked
`together by extending the linear data path from the
`right side of one board to the left side of the next. The
`last. board in the chain is connected to the “Bus, and
`thence to KR. on the interface board.
`
`(cid:51)(cid:72)(cid:87)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:72)(cid:85)(cid:3)(cid:48)(cid:76)(cid:70)(cid:85)(cid:82)(cid:86)(cid:82)(cid:73)(cid:87)(cid:3)(cid:38)(cid:82)(cid:85)(cid:83)(cid:82)(cid:85)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:16)(cid:3)(cid:40)(cid:91)(cid:17)(cid:3)(cid:20)(cid:19)(cid:25)(cid:19)(cid:15)(cid:3)(cid:83)(cid:17)(cid:3)(cid:23)(cid:27)(cid:21)
`Petitioner Microsoft Corporation - EX. 1060, p. 482
`
`
`
`Splash Boards
`
`
`
`
`
`flii-
`--
`annnnmn
`
`
`
`Figure 1. The architecture of the Splash 2 system.
`
`The crossbar switch is capable of storing up to eight
`separate, dynamically selectable configurations, with
`each configuration specifying a different set of con-
`nections among the 16 ports. The configuration in
`use in any given clock cycle is selected by the con-
`trol element, X0, which shares the 36 bit connection
`to the crossbar with X16. X0 may therefore receive
`broadcast data from the host on the SIMD bus and
`rebroadcast it through the crossbar to each of the 16
`PEs.
`
`The Sparcstation host performs a variety of con—
`trol functions for Splash 2. The host is responsible
`for downloading the configuration information to the
`FPGA computing elements, XD—Xlfi, the interface el~
`ements, XL and XR, and the crossbar switch. There
`are a variety of both synchronous and asynchronous
`mechanisms with which the host may communicate
`with the Splash system, including direct access to the
`computing element memories and the ability to start
`and stop the system clock.
`
`As a research vehicle Splash 2 was designed to
`support a variety of programming models,
`includ-
`ing a single—instruction multiple—data stream {SIMD}
`
`model, a one dimensional pipelined systolic modelI
`and several higher dimensional systolic models. SIMD
`applications may be programmed on Splash 2 by using
`the X0 element and the crossbar switch on each board
`
`to broadcast instructions and data to all processing el—
`ements simultaneously. The instruction stream is sent
`from the host to the X0 chip on each board via the
`SIMD Bus. The X0 elements may in turn decode this
`instruction stream, possibly by fetching microcode
`from their external memories. This decoded instruc-
`
`tion is then broadcast through the crossbar switch to
`each of the 16 processing elements on the board. Each
`processing element is programmed with one or more
`identical SIMD computing engines. These SIMD en»
`gines simultaneously receive and execute the decoded
`instruction, performing nearest neighbor communica-
`tion through the linear data path. Global synchro-
`nization is supported through an AND ,1“ OR reduction
`network.
`
`One dimensional systolic applications may be
`mapped onto Splash 2 by using the linear data path
`to form a continuous pipeline from the host1 through
`the array, and back to the host. The crossbar switch
`
`483
`
`(cid:51)(cid:72)(cid:87)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:72)(cid:85)(cid:3)(cid:48)(cid:76)(cid:70)(cid:85)(cid:82)(cid:86)(cid:82)(cid:73)(cid:87)(cid:3)(cid:38)(cid:82)(cid:85)(cid:83)(cid:82)(cid:85)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:16)(cid:3)(cid:40)(cid:91)(cid:17)(cid:3)(cid:20)(cid:19)(cid:25)(cid:19)(cid:15)(cid:3)(cid:83)(cid:17)(cid:3)(cid:23)(cid:27)(cid:22)
`Petitioner Microsoft Corporation - Ex. 1060, p. 483
`
`
`
`may be configured to bypass certain PEs during some
`or all of the processing time. Many higher dimen-
`sional applications can be mapped onto Splash by us-
`ing both the linear data path and the dynamic selec-
`tion capability of the crossbar switch to build higher
`dimensional topologies, such as trees, hypercubes, and
`toroids.
`
`3 Programming Splash 2
`
`The programming environment for Splash 2 is
`based upon the VHSIC Hardware Description Lan—
`guage (VHDL)[5], simulation, and logic synthesis. Ap—
`plications for Splash are developed by writing be-
`havioral descriptions of algorithms in VHDL which
`are then iteratively refined and debugged within the
`Splash 2 simulator. During the course of this iter-
`ation, the VHDL implementation is manually parti—
`tioned into a set of Splash computing element pro-
`grams. Once the implementation is determined to be
`functionally correct in simulation it is compiled into a
`gate list and optimized by logic synthesis techniques.
`The gate list is then mapped onto the F PGA architec—
`ture by automatic placement and routing tools to form
`a loadable FPGA object module. Static timing analy~
`sis tools are applied to the object module to determine
`the maximum operating frequency and the set of crit-
`ical paths. This information may be used to manually
`optimize the design. Finally, a C language interface
`library and a symbolic debugger form the execution
`environment.
`
`4 Applications
`
`4.1 Text Searching
`
`Splash 2 has been programmed by the last three
`authors to detect key words from a dictionary in a
`stream of text [6].
`In a preproceasiug step, a dic-
`tionary is streamed past a hash function. The hash
`function computes an address from each word in the
`dictionary and sets a presence bit in a memory table.
`During the processing, the text is streamed through
`the hash function, an address computed {rom words,
`and the corresponding memory location examined. A
`set presence bit indicates a potential key word hit.
`The individual probabilities of false hits multiply if
`multiple (assumed to be) independent hash functions
`are used, so the overall probability of a false hit can
`be made as small as one likes.
`
`The interface board chip XL unpacks four-character
`(32—bit)
`input words in two two-character “super-
`bytee” which are broadcast on the SIMD bus to X0
`on the Splash 2 array boards. X0 uses its attached
`memory as a translation table to convert upper case
`letters to lower case, and white space characters to a
`simple end—of-word marker. This translation table can
`be used to accommodate multiple alphabets or other
`special processing requirements.
`X0 sends the translated superbyte to X] over the
`crossbar. Xilinx chips Xl-lei run the same program
`with di'fl'erent hash functions receiving data from the
`linear data path, computing an updated hash function
`(difierent for each chip), and performing a lookup into
`memory if the processed superbyte contained an end—
`of-word marker.
`If the lookup is done, the presence
`bit is ANDed to the incoming accumulated presence
`bits and passed onto the next chip in the linear array.
`The final determination of a hit is made by XR on
`the interface board, which reports the existence and
`position in the file of the hit.
`The memories attached to Xl-Xlfi each are 223 bits
`in size. The English language has about 213 words
`(ignoring case—sensitivity).
`In the simplistic scheme
`adopted here, the set of keywords is doubled in size
`to accommodate end-of-word markers in either the
`
`odd or even byte positions. The full English language
`would thus hash into a Splash 2 memory that is about
`If 8 full, and 16 independent hash functions would pro-
`duce false hits with a probability of 1 in 2“, which is
`certainly low enough for nearly any application. A
`“real” dictionary or keyword list could be expected
`to be much smaller (the Unix spell-checking program
`uses a dictionary of about 30,000 words);
`the corre-
`sponding substantial reduction in the probability of
`false hits will be more than make up for the fact that
`text is in fact not “random”.
`
`The program as written for Splash 2 runs at 25
`MHz, processing two bytes of text per clock, for an
`effective text throughput rate of 50 Mbytes per second.
`
`4.2 Macromolecule Sequence Compari-
`son
`
`Splash 2 has been programmed by the third au—
`thor to perform DNA and protein sequence compar-
`isons [4]. The well—known dynamic programming al—
`gorithm is used to compute the edit-distance between
`two strings of characters. The edit distances for all
`prefixes ofa source sequence 5' = 8183...8n and a tar-
`get sequence T = haul,“ form a matrix. With 5.-
`labelling row 1' and t;
`labelling column j of the ma—
`trix,
`the entry in the (£,j) cell of the matrix is the
`
`(cid:51)(cid:72)(cid:87)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:72)(cid:85)(cid:3)(cid:48)(cid:76)(cid:70)(cid:85)(cid:82)(cid:86)(cid:82)(cid:73)(cid:87)(cid:3)(cid:38)(cid:82)(cid:85)(cid:83)(cid:82)(cid:85)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:16)(cid:3)(cid:40)(cid:91)(cid:17)(cid:3)(cid:20)(cid:19)(cid:25)(cid:19)(cid:15)(cid:3)(cid:83)(cid:17)(cid:3)(cid:23)(cid:27)(cid:23)
`Petitioner Microsoft Corporation - EX. 1060, p. 484
`
`
`
`E. Keith, W. Kleinfelder, D. Kopetaky, A. Kopser,
`J. Kuehn, S. Lucas, M. Mascsgni, M. McCarty, R.
`Minnich, M. Moran, F. More, L. Podrazik, C. Reese,
`J. Schlesinger, P. Schneck, B. Schott, D. Smitley, D.
`Sweely, C. Tscharner. and K. Wallgren.
`
`References
`
`[1} J. M. Arnold, “The Splash 2 software enViron-
`merit,” Proc., IEEE Workshop on FPGAs for Gus-
`tom Computing Machines, 1993I to appear.
`
`[2] J. M. Arnold, D. A. Buell, and E. G. Davis,
`“Splash 2", Proc, 41!: Annual ACM Symp. on Pan-
`oHel Algorithms and Architectures, pp. 316-322,
`1992.
`
`[3] S. D. Brown. R. J. Francis, J. Rose, and Z.
`G. Vraneeic, Field—Programmable Gate Arrays,
`Kluwer Academic Publishers, 1992.
`
`[4] D. Hoang, “Searching genetic databases on Splash
`2”, Prom, IEEE Workshop on FPGAs for Custom
`Computing Machines. 1993, to appear.
`
`{5} IEEE Standard VHDL Language Reference Man-
`ual, IEEE Std 1076—4987, New York, NY, 1988.
`
`[6} D. Pryor, M. Thistle, and N. Shirasi, “Text search—
`ing on Splash 2“, Proc., IEEE Workshop on FP-
`GAs for Custom Computing Machines, 1993, to
`appear.
`
`[7] The X04000 Data Book, Xilinx Inc., San Jose, CA,
`1992.
`
`to I}; plus the cost of
`
`edit-distance from S.- : 31....” to T,- = t3...t,'. This
`distance is the smaller of three quantities:
`1) The distance from SF; to T,- plus the “cost” of
`appending s,- to 33.1.
`2) The distance from S.-
`appending t,- to Cit--1.
`3) The distance from 8,4 to 13.; plus the cost of
`substituting t,- for 5;.
`For DNA sequences, the costs in 1) and 2) are usu-
`ally taken to be 1 and the cost in 3) to be 0 if t; = s.-
`and 2 if not.
`The locality of the information needed for edit dis-
`tance computation permits a systolic method to be
`used which is also highly parallel because all the cells
`on a given antidiagonal can be updated simultane»
`ously.
`In Splash 2 implementation of this algorithm
`for DNA comparisons, 14 copies of a cell-update pro-
`cessing element fit on a single Xilinx chip for a total
`of 224 processing elements per Splash 2 board. This
`implementation can process 12 million characters per
`second. In a. protein sequence version, 8 million char-
`acters per second can be processed.
`
`4. 3 Edge Detection
`
`The high 1,10 bandwidth and the pipelined nature
`of Splash make it well suited to many image process-
`ing applications. One such application which has been
`implemented is a 3x3 gradient operator edge detector.
`The input image is streamed into the array at video
`rates (up to 16 MPixels/sec). For the duration of one
`scan line, the input stream is written to one PE mem-
`ory while the previous two scan lines are read from two
`other memories. The input scan line together with the
`two previous scan lines are used to compute the X and
`Y partial gradients. Another PE memory is used as a.
`table lookup to compute the magnitude and the angle
`of the gradient. At the end of each scan line, the three
`scan line bufl'ers exchange roles so that the second pre-
`vious buffer is over written with the new scan line, and
`the most recently written buffer becomes the previous.
`This is accomplished by rotating among three cross-
`bar configurations at the end of each scan line. An
`“image” consisting of the angle and magnitude of the
`gradient at each pixel is streamed out at the input
`data rate with a. latency of three scan line times.
`
`Acknowledgements
`
`those who have contributed
`We acknowledge all
`to the Splash 2 project, including D. Burns, N. Go—
`letti, S. Cuccaro, B. Fross, M. Gokhale, W. Holmes,
`
`(cid:51)(cid:72)(cid:87)(cid:76)(cid:87)(cid:76)(cid:82)(cid:81)(cid:72)(cid:85)(cid:3)(cid:48)(cid:76)(cid:70)(cid:85)(cid:82)(cid:86)(cid:82)(cid:73)(cid:87)(cid:3)(cid:38)(cid:82)(cid:85)(cid:83)(cid:82)(cid:85)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:16)(cid:3)(cid:40)(cid:91)(cid:17)(cid:3)(cid:20)(cid:19)(cid:25)(cid:19)(cid:15)(cid:3)(cid:83)(cid:17)(cid:3)(cid:23)(cid:27)(cid:24)
`Petitioner Microsoft Corporation - Ex. 1060, p. 485
`
`