`
`Building and Using
`a Highly Parallel
`Programmable Logic Array
`
`Maya Gokhale, William Holmes, Andrew Kopser, Sara
`Lucas, Ronald Minnich, and Douglas Sweely
`Supercomputing Research Center
`
`Daniel Lopresti, Brown University
`
`ith a $13,000 two-slot addition
`called Splash, a Sun worksta-
`tion can outperform a Cray-2
`on certain applications. Several applica-
`tions, most involving bit-stream computa-
`tions, have been run on Splash, which re-
`ceived a 1989 Gordon Bell Prize honorable
`mention for timings on a problem that
`compared a new DNAsequenceagainst a
`library of sequences to find the closest
`match. In essencc, Splash is a programma-
`ble linear logic array that can be config-
`ured to suit the problemat hand; it bridges
`the gap between the traditional fixed-func-
`tion VLSI systolic array and the more
`versatile programmable array.
`Asoriginally conceived, a systolic array
`is a collection of simple processing ele-
`ments, each with a fixed, data-independent
`function, along with a one- or two-dimen-
`sional nearest-neighbor communication
`pattern.? The local nature of the communi-
`cation gives the systolic array a high com-
`munications bandwidth, and the simple, fixed
`function gives a high packing density for
`VLSI implementation. However, since the
`functionis built in, the application space of
`a particular systolic arrayis ratherlimited.
`Recognizing the benefit to he gained froma
`more flexible base for systolic algorithm
`implementation, H. T. Kung and colleagues
`
`[oo
`
`Construction of real
`
`hardware and feedback
`
`from real users
`contributed to Splash’s
`design, development,
`and success. For
`certain pattern-
`matching applications
`its price/performance
`ratio is unmatched.
`
`built the Warparray, ! a linear array in which
`each cell is a powerful very-large-instruc-
`tion-word processor. Currently, a two-di-
`mensional array of custom 32-bit proces-
`sors is being built jointly by Intel and
`Carnegie Mellon University.4
`Like the simple fixed-function systolic
`
`array, the linear array of chips comprising
`Splash is programmed at a very low level.
`A hardware implementation of the desired
`algorithm must be synthesized. Unlike the
`fixed-function systolic array,
`the “hard-
`ware” can be reprogrammed and loaded
`with newalgorithms. This is madepossible
`by using field-programmable gate arrays
`(FPGAs) as the chips of the linear array.
`Unlike the programmable systolic array,
`each stageoflinear array does not have an
`instruction set architecture. Rather than
`processors with a fixed instruction set, a
`stage contains several hundred “configu-
`rable logic blocks,” cach of which can be
`configured at the gate level to compute
`certain sorts of Boolean functions. There is
`no fixed number of systolic cells in the
`Splash array. The amount oflogic in cach
`cell determines the numberofsystolic cells
`per chip and therefore the numberof cells
`in the array. Typical applications have eight
`or 16 systolic cells per chip.
`This gate-level programmability enables
`high-speed execution ofalgorithms, since
`onlynecessary circuitry executes. Systolic
`and parallel algorithms implementedat the
`gate Jevel on Splash have achieved speed-
`ups of up to 330 over one Cray-2 proces-
`sor; speedups greater than 10 times are
`achieved routinely.
`
`January 1991
`
`0018-9162/9 1/1 000-008 1$01.00 © 199] IEEE
`
`81
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 81
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 81
`
`
`
`VME bus
`
`8-megabyte VSB staging memory
`
`
`
`Xilinx control chips
`
` FIFO out
`
`
`
`Figure 1. The 32-stage linear array.
`
`Overview of Splash
`
`The Splash design was motivated by a
`systolic algorithm for DNA pattern match-
`ing.® From the outset, the application do-
`main has focused on non-floating-point
`applications suchas pattern matching. Many
`pattern-matching applications must recog-
`nize when two sequencesare similar, even
`though they maynot be identical. Exam-
`ples include speech recognition, data re-
`trieval, and genetic analysis.’ The Splash
`architecture is suited to one-dimensional
`pattern matching. A two-dimensional im-
`plementation with similar FPGAtechnolo-
`gy has been built by Digital Equipment
`Corporation Paris Research Labs.®
`The design of a prototype was begun in
`September 1988 at
`the Supercomputing
`Research Center. In June 1989, Splash was
`released to the SRC user community.
`Operational at that time were five Splash
`systems, the Logic Description Generator
`(LDG)language, and the Trigger symbolic
`debugger. Currently, 16 Splash arrays are
`in use at SRC, Brown University, and else-
`where.
`
`System. Splash consists of two boards,
`the first containing the linear array and the
`second containing a dual-ported memory
`card. The two boards reside in two VME
`(Versabus modified for Eurocard)slots of a
`Sun-3 or Sun-4 workstation (see Figure !).
`The Splash logic-array board holds 32
`Xilinx 3090 programmable gate arrays?
`and 32 memory chips. Two additional Xil-
`inx chips are used for bus control. The
`logic array card connects to the Sun VME
`bus for control and the Sun VME Subsystem
`Bus (VSB) for data I/O. The associated
`
`82
`
`dual-ported memory card connects to the
`Sun VMEbusfor data initialization and
`retrieval and to the Sun VSB busfor data
`I/O to and from the logic array.
`
`Programming. Splash is programmed
`by specifying the logic functions andinter-
`connections of each of 320 configurable
`logic blocks (CLBs) and 144 input/output
`blocks (LOBs) on each ofthe 32 chips. A
`Xilinx 3090 FPGA contains a 20 x 16 grid
`of CLBs surrounded on the perimeter by a
`single layer of IOBs (Figure 2). A CLB has
`a combinatorial logic section, two D flip-
`flops, and an internal control section. The
`CLB can be configured to generate any
`function of five variables, any two func-
`tions of four variables (see Figure 3), or
`some functions of up to seven variables.
`The IOBs provide the interface between
`external package pins andthe internal log-
`ic. Fach IOB has input and outputbuffers,
`which include both registered and direct
`data paths.
`Each Xilinx chip is programmedat the
`gate level using the Logic Description
`Generator language. LDG is a computer-
`aided design tool developed at SRC, with
`language constructs to describe andrepli-
`cate systolic cells and to place thecells on
`achip. Parameterized cell descriptions may
`be written, providing a functionality simi-
`lar to the VHDL (VHSIC hardware de-
`scription language) generate command.
`Splash designs are debugged using the
`Trigger symbolic debugger, also devel-
`oped at SRC. Triggeris similar to a software
`debugger, with user-definable procedures
`and local variables. The values of specific
`locations on the gate array can be examined
`symbolically. The array can be single
`stcpped or stepped in burst mode. Inter-
`
`rupts can either be ignored or can invoke a
`Trigger procedure.
`LDG and Trigger permit rapid design
`turnaround time that is more comparable to
`software than hardware redesign. With
`LDG,it takes only a few keystrokes to
`significantly modify a chip design, which
`can be easily tested with Trigger. These
`design tools, plus the fact that a design can
`be loaded on the board in half a second,
`make it easier to generate and test a new
`chip design on Splash hardware than to
`simulate several different designs before
`committing to hardware.
`
`Hardware development
`
`Designing and developing Splash re-
`quired numerous decisions and trade-offs
`in defining the hardware (and the LDG and
`Trigger, as described in later sections).
`The systolic array, as first envisioned,
`wasto consist of manystages connected in
`a one-dimensionalarray. Each stage wasto
`have three components: a Xilinx FPGA,
`local memory, and a floating-point chip.
`Theinitial design called for dual-ported
`local memoryso that the host could direct-
`ly access all the memory on the board. For
`the prototype, we planned to develop a
`simple one-board system that would plug
`into a Sun workstation, communicating
`with the Sun CPU over the VME bus.
`Because we opted for a single-board sys-
`tem, space becamethe constrainingfactor.
`Thirty-two FPGA/SRAMpairs fit nicely
`on a 9U x 400-millimeter card but left no
`room for floating-point chips. Since the
`application driving the design did not re-
`quire floating-point manipulation, we
`eliminated floating-point chips from the
`
`COMPLTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 82
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 82
`
`
`
`design. That left 32 stages, each with an
`FPGAand an SRAMchip.
`At that
`time,
`the biggest and fastest
`memories were single- ported 128K x 8, 50-
`nanosecond SRAMs. Thus, we were faced
`with choosing between slower, dual-port-
`ed, host-accessible SRAMsandfaster,sin-
`gle-ported memories accessible only
`through the Xilinx chips. Since we expect-
`ed applications to use the local memories
`primarily for constant tables, which would
`be loaded initially through the host and
`accessed only locally, we opted for the
`faster, single-ported SRAMs.
`Finally, we used two more Xilinx chips
`for the VMEbusinterface so that changes
`to the VME interface design could be
`quickly implemented without modifying
`the hardware.
`
`To simulateor not to simulate. Because
`the flow ofsystolic algorithmsis data inde-
`pendent, we could estimatethe prototype’s
`performance on targeted applications be-
`fore actually building the board. However,
`several potential users questioned the ac-
`curacy of these predictions. They felt we
`should simulate the design first, but ini-
`tial estimates showed that simulating the
`Xilinx chips on a Zycad SDE would be
`2,500 times slower than Splash — even for
`simple programs.
`Since the board design was relatively
`simple, we decided to build the board rath-
`er than invest resources ina simulator. This
`turned out to be the right decision. It took
`two engineers only six months to get the
`first board up and running. Since we chose
`not to write a simulator, the software cngi-
`neers developed the programming and de-
`bugging tools simultaneously with the
`hardware.
`
`Communicating with the host. A sim-
`ple VME interface was built first. The
`initial design had specified this as the only
`communication channel to the Sun host.
`Realizing that the Splash card could quickly
`outrun the 1 millisecond/32-bit VME
`transfer rate, we added a second, faster
`communication channel, the VSB bus. This
`necessitated a second off-the-shelf, dual-
`ported memory card as a staging memory
`for Splash. The original VMEinterface
`was retained for control transfers.
`In the revised design, the Sun transfers
`data to and from the staging memory via
`the VMEbus, and Splash communicates
`with the staging memory over the VSB
`bus. The addition of the VSB interface
`considerably complicated the design, but
`at the time wefelt the factor of two in I/O
`
`January 1991
`
`
`
`
`
`
`
`0,31 | 0,32
`
`
`
`
`
`
`
`
`
`
`
`
`
`20,1
`
`beast} 41,2
`
`
`
`
`
`
`
`
`
`Anyfunction
`of up to four
`variables
`
`Any function
`of up to four
`variables
`
`
`
`
`
`
`
`
`Figure 3. A configurable logic block.
`
`speedup compensated for the additional
`complexity.
`
`Design for debuggability. Development
`ofan actual hardware system often leadsto
`incorporating features that might not seem
`necessary in a simulated environment. For
`instance, hardware support for debugging
`programmedapplications on the Splash board
`would not have been a consideration in a
`
`paperstudy. It was clear earlyin the devel-
`opment, however, that we had to design for
`ease of debugging. The designerhasto have
`some way of knowing whatthe gate arrays
`are doing at any given pointin time.
`Fortunately, the Xilinx chips have a fea-
`ture called state readback. Just as the chip
`is programmedbyshifting into the chip a
`64Kserial string of bits, the state can also
`be read out. The configuration of the chip
`
`83
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 83
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 83
`
`
`
`andthe state of all user-definable flip-flops
`on the chip are shifted out. This feature,
`combined with the ability to single step (or
`step in bursts), has proven extremely valu-
`able for debugging user programs.
`Althoughstate readback and single step
`were incorporated into the 32 Xilinx chips
`in thearray, these features were not used
`on the two Xilinx chips for the VMEin-
`terface. To debugthe logic on these chips,
`wehad to use the traditional logic analyzer
`to observe signals on the chip’s external
`pins. To observeinternal signals, however,
`we had to redesign the chip, bringing these
`signals to external pins. These external
`routings made the chip more complicated
`and altered signal timings (longerpaths) of
`critical logic sections. The external routings
`changed frequently and eventually had to
`be removedafter debugging,thus the actu-
`al design was not really debuggedatall.
`Having state readback and a debugger for
`these control chips would have drastically
`reduced the design time for Splash.
`In addition to state readback and single
`stepping, other important debugging fea-
`tures include a user-definable variable-
`speed clock and maskable interrupts. If
`desired, the clock may be stopped as soon
`as an interrupt occurs. The user can op-
`tionally provide data flow control; if the
`input buffer (FIFO) between Splash and
`the VSB memory is empty,
`the clock
`“pauses” allowing the FIFO to fill, insur-
`ing contiguous data from the staging mem-
`ory. Because the contro] logic was imple-
`mented with the Xilinx chips, many
`additional features were added after the
`board was built.
`
`SRAMand Xilinx chip connection.
`Weconsidered twoalternative methods of
`connecting the SRAMsto the Xilinx chips.
`The first, perhaps more straightforward
`approach, wasto dedicate 28 pins of each
`chip to SRAM. Each chip would have its
`ownlocal memory, accessible only to that
`chip. However, we wanted high intercon-
`nectivity between Xilinx chips. Taking away
`28 pins exclusively for the SRAM was
`undesirable.
`The alJternative strategy, which we
`adopted, called for an SRAM connection
`to share lines connecting two chips. This
`had several additional advantages. The
`memory was now accessible to two adja-
`cent chips, giving the designer the option,
`for example. of having every other stage
`have a memoryofsize 256K X 8 or 128K x
`16, Anotherpossibility is for one stage to
`be the reader and the next the writer. If the
`local memoryis not required, the single
`
`84
`
`[oT
`
`Thegoal of the
`architectural study
`was to design and
`implement a
`programmable
`linear array.
`
`dedicated line to it can be disabled, and the
`other 27 pins can be used for communicat-
`ing with the adjacent chip. The hazard of
`this implementation is that the designer
`must coordinate access to the memory so
`that both chips do nottry to access the same
`SRAMat the same time.
`
`Evaluation. We have evaluated Splash’s
`performance on real applications, as well
`as on a set of synthetic benchmarks, and
`found the following of note.
`
`¢ 1/O-limited. The Splash board ts in-
`deed I/O-limited. Many applications could
`Tun al least an order of magnitude faster
`with better I/O.
`* Host inaccessibility to SRAMs. Al-
`though the application designer can work
`around the tack of dual-ported SRAM
`memoriesto the host, the inconvenience of
`getting data into and out ofthe local mem-
`ories rules out some applications that need
`host accessto local memory. Using bigger.
`wider, and faster memories would be an
`important objective for a follow-on design.
`* State readback, The state readback
`capability was originally designed for de-
`bugging. However, we found another im-
`portant, unanticipated use. Often,
`it
`is
`convenient to design a systolic algorithm
`in which the results are accumulated in
`stationary registers in the array. In this
`case, state readback can be used to read
`results at the end of a computation. This
`technique has been used in many Splash
`applications. However, the amountoftime
`to read backthestate (about half a second)
`is often too long. A 64K bit stream is sent
`to the Sun host andfiltered. The remaining
`1,024 bits constitute the desired state in-
`formation. The new Xilinx 4000 series
`discards the extraneous bits (which encode
`each chip’s CLB and IOB configurations)
`al the chip. We plan to use the new part in
`Splash follow-ons.
`
`« Using the VSB. We expected the addi-
`tion of the faster VSB to speed I/O by a
`factor of two. This proved to be the case for
`applications that made repeated passes over
`the data set; however, applications that
`madeonlya single pass paid a penally. The
`data first had to be loaded over the VME
`bus to the staging memoryand then from
`the staging memoryto the logic array board.
`Results also had to go first to the staging
`memory and then through the VMEback to
`the host. Thus, single-pass applications ran
`slower than they would have with only the
`VMEinterface for data. Because we've
`found that applications are typically onc
`pass rather than multiple pass, we are now
`making VSB-less single-board systems per
`the original design.
`
`Not all of these features would have
`been considered if a prototype had not been
`built. The applications work has given us a
`greater understanding of the limitations
`and capabilities of Splash.
`
`Path to the Logic
`Description Generator
`
`Initial language. The intent of the Splash
`project was to study the systolic model of
`computing both architecturally and at the
`languagelevel. The goal of the architectural
`study was to design and implementa pro-
`grammable linear array. The language ob-
`jective was to understand the essential
`language constructs that describe a systolic
`computation; to define or adapt an existing
`language in which those systolic constructs
`were embedded: and,if an appropriate target
`machine was built, to implement a compil-
`er for the language.
`The core systolic constructs we selected
`were (1) the notion of a /ogical systolic cell
`through which data are streamed and (2)
`the replication and interconnection of the
`logical cells to form the systolic array.
`The language construct satisfying the
`first need is called a template, which is
`associated with named input and output
`signals and whose body, in a hierarchical
`fashion, can contain other templates as
`well as language primitives. In response to
`the second need, we developed a concise
`notation to specify the replication and in-
`terconnection of the parts in a template.
`In the first iteration of LDG design, the
`language primitives consisted of the usual
`Booleanlogic operations as well as D flip-
`flops. The language processor expanded
`hierarchically invoked templates until a
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 84
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 84
`
`
`
`primitive was encountered. The primitive
`was then output in a format required by the
`Xilinx tools.
`
`Required revision. Wc implemented this
`initial version of LDG and foundthat pro-
`gramming the Xilinx chip at
`the logic
`equation leve] did not effectively utilize
`the chip. Our experience showedthat using
`Xilinx tools to automatically pack the logic
`equations imto logical CLBs, to assign the
`logical CLBs to physical locations on the
`chip, and finally to route the chip resulted
`in, at most, 10 percent use of the CLBs.
`Our performance requirements demand-
`ed high chip-area usc. Thus, the realities of
`the design environment dictated a change
`to the language: We added CLB templates
`and IOB templates as new primitive rem-
`plates. The designer could configure these
`templates just as with the Xilinx-supplied
`tools (for example, as a function of five
`variables). In addition, we added the con-
`cepts of location and shape to the LDG
`language. Each partin a template ts assigned
`a location on the CLB/IOB grid and a
`rectangular shape. The location can heei-
`ther relative, in terms of parameters passed
`into the template, or absolute. The addition
`of user-directed placement gavethe designer
`complete control over the layout of logic
`on the gate array. In conjunction with the
`replicated part, it becamepossible to spec-
`ify the configuration of an entire chip with
`relatively little effort.
`
`Ergonomics. The LDG syntax and user
`interface also evolved as designers wrote
`LDGprograms and debugged them on real
`hardware. LDG is embedded in Common
`
`Lisp, and the initial language syntax con-
`sisted simply of calls to Lisp functions.
`This required users to develop some famil-
`larity with the Lisp environment, especial-
`ly since syntax errors threw the user into
`the Lisp debugger. Responding to user
`feedback, we designed a more intuitive
`keyword-driven syntax and added exten-
`sive error-handling from within LDG so
`that users did not haveto interact with the
`Lisp debugger.
`Although a graphical editor was avail-
`able from Xilinx, users preferred the text
`interface for its ease of modification. A
`few text changes could modify every CLB
`on every chip, which is very tedious with
`the graphical editor.
`
`LDGexample. The simple example in
`Figure 4 illustrates various LDG language
`constructs. The figure shows a two-bit
`pipeline, Pipe2, with input signals sig,
`
`January 1991
`
`sig], and clk and output signals outO and
`out!. Internally, Pipe2 consists of 16 cop-
`ies of another template,cell. Figure 5 shows
`the internal structure of Pipe2. The arrays
`of signals a and b areinternal to Pipe2 and
`pass the signal between adjacentstages.
`Figure 6 shows the LDG template for
`Pipe2. Note the correspondence between
`the set of input signals in the block diagram
`and the input clause on the second line of
`the 1.DG program (and similarly for out-
`put). The location clause passes in the
`
`Figure 4. A two-bit wide pipeline.
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 5. Internal structure of Pipe2.
`
`(template pipe2
`(input sigO sig} clk)
`(output outO out!)
`(location pos-x pos-y)
`
`(part-list
`( (namepl)
`(part cell)
`(input sigQ sig] clk)
`(ouput (index al) (index bl)
`(ocation row !col) )
`
`( (name p2-15)
`(range! 2 to 15)
`(shape row-major | by 14)
`(start-!row pos-x)
`(start-!col (1 + pos-y))
`(part cell)
`(input (index a(1-!) ) (index b(1-!) ) clk)
`(output (index a!) (index b!) )
`
`(location !row !col) )
`
`( (name p16)
`(start-!row pos-x)
`(start-!col (+ 15 pos-y) )
`(part cell)
`(input (index al5) (index b15) clk)
`(output out0 out))
`
`5 input signals
`; Output signals
`; position parameters
`
`; first part: pl
`; of cell, input sig, sigl, clk
`
`; output afl], b[1]
`
`; second part: p2-15
`; make 14 copies
`; for 1 row and 14 columns
`; at (pos-x, pos-y + 1) for
`
`of cell
`; with inputs af[!-1], b[!-1], clk
`; and outputs af[!], bf!)
`: for ! in the range 2... 15
`
`; third part: p16
`; at (pos-x, 15 + pos-y)
`
`, of cell
`; with inputs a[15], b[15], clk
`; output outd, out
`
`Cocation !row !col) } ; template name
`
`Figure 6. Logic-description-language template for a two-bit pipeline.
`
`85
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 85
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 85
`
`
`
`
`
`starting row and column for parts in the
`emplate. The names used here, pos-x and
`pos-y, are referenced again in a (start-
`'row...) or (start-!coL..) clause for a part.
`Within Pipe2 there are three parts: p1.
`p2-15, and p16. Part p2-15 invokes 14
`copies of the template cell becausc of the
`range clause (range! 2 to 15).
`The (shape row-major...) clause specifies
`he layout of these 14 cells. The first copy of
`cell is placed at location (pos-x, pos-y + 1).
`The next is at (pos-x, pos-y + 2), and so
`orth. Note that the numberof copies of the
`emplate cell specified by the range clause
`(2... 15 = 14) must equal
`the number of
`copies specified by the shape clause (1 row
`x 14 columns = 14), The final two parame-
`ters are being sentto cell, and their values
`are the current row index (!row) and current
`column index (!col), respectively.
`Pipe2 is instantiated with the call com-
`mand.
`
`(call pipe2)
`(input dinO din clk)
`(output doutO dout!)
`(location 1 1)
`
`the
`In the generated Xilinx commands,
`names dinO, dini, and clk will be substitut-
`ed for the dummyinput parameters sig0,
`sig!, and clk (similarly for output). (pos-x,
`pos-y) will be assigned values (1, 1), and
`the expressions using pos-x and pos-y will
`be evaluated so that the cvaluated valucs,
`constantintegers, will be output as a CLB
`location.
`Here. we’ ve omitted the LDG specifica-
`tion ofcell. This template is aCLB template,
`configured simply to pass data through
`unchanged.
`Another useful feature of LDG in im-
`plementing systolic algorithms is its abil-
`ity to write parameterized template defi-
`nitions. The hardware designer can design
`a schema ofa structure, a generalized shift
`register, for example, and then instantiate
`different schema instances byinvoking the
`schema with different parameters. The re-
`petitive layout and the control over place-
`ment are available at the schema level as
`well as in the base language.
`The readeris referred to our earlier work'®
`for examples of schema definition and use.
`
`Splash runtime
`environment
`
`The Splash runtime environment on the
`Sun workstation consists of a symbolic
`
`86
`
`Le
`
`Library routines
`created for use
`by the debugger
`can also be invoked
`from C programs,
`so applications
`programscandirect
`the Splash board.
`
`debugger Trigger and a kernel driver to
`control
`the Splash device and the VSB
`interface. (The debugger borrows muchof
`its code fromthe Horizon Simulator,'! hence
`the name Trigger, son of Horse.) Library
`routines created for use by the debugger
`can also be invoked from C programs, so
`applications programscandirect the Splash
`board and gain access to Splash-related
`symbols just as the debugger does. In ad-
`dition, there are graphical tools to view the
`activity of a single chip or of the entire
`array.
`Below wediscuss some debugger capa-
`bilities and how they can be accessed from
`independent C programs.
`
`Loadingchip designs. All the chips are
`loaded in parallel, with each bit of the 32-
`bit word going to a different chip. The
`entire board can be loadedin half a second.
`The same file may be used for multiple
`chips, a differentfile for each chip, or any
`combination thereof.
`
`Stepping the board. Trigger allows the
`user to step a selected numberof clocks.
`Commonly, the user initially single steps
`the design, monitoring variables on the
`chips at each step. As more and moreofthe
`design starts to work, the user typically
`steps the design through a larger butstill
`well-defined number of clocks. Signals
`from the chips can be designedto interrupt
`or assert a flag in one of the control and
`status registers. In either case, the counted
`clocks are stopped.
`
`Trigger procedures. Trigger allows the
`user to create and invoke procedures, a
`handy capability for frequently used com-
`mands. A basic library of Trigger proce-
`
`dures has grown overtime andis available
`to the users as part of the Triggerlibrary.
`The command language for Triggeris
`similar to the command languageforthe C-
`shell. There are conditional statements,
`while statements, for loops, user-defined
`variables that may contain strings or nu-
`meric values, and a numberofotherstate-
`ments found in many command language
`interpreters. The variables may be user-
`defined and can control flowof the Trigger
`procedure being executed.
`
`Nondestructive readback. The user may
`examine on-chip state at any time and then
`resume the program. In fact, symbols may
`be evaluated as part of conditional looping.
`Thus, a user’s procedure may runa design,
`examine an on-chip variable, and make a
`decision about whether to continue running
`the design based on that variable.
`
`Support for interrupts. The Splash
`board can generate interrupts. These in-
`terrupts are vectored through the kernel
`driver and to the user program via the sigIO
`signal. Trigger allows the user to specify a
`procedure (interrupt service routine) to be
`invoked whenthe signal is received. There
`is a default procedure, which will print out
`information aboutthe type of interrupt and
`whyil might have occurred. Interrupts may
`also be disabled. The user can decide
`whetherto defer processingof interrupts or
`to ignore them complctcly.
`
`Accessing Trigger from C. Trigger and
`auser’s C or Fortran programs can interact
`in a variety of ways. In the simplest mode,
`a user program can call Trigger library
`routines to run Splash. If desired, parts of
`the symbolic debugging environment can
`be accessed from the user program, up to
`the point of actually dropping into the de-
`bugger from a user program when some
`condition is met, such as when a user types
`Control-C to generate an interrupt or when
`an on-chip variable reachesa certain value.
`
`Sequence comparison
`on Splash
`
`A pattern-matching algorithm has been
`implemented on Splash and on a variety of
`other supercomputers. In genctic analysis,
`sequences over the four-character alphabet
`A, C, G, and 7represent DNA molecules,
`and similarity between sequences mayindi-
`cate an evolutionary or functionalrelation-
`ship. When attempting to characterize an
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 86
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 86
`
`
`
`unfamiliar sequence, a biologist will often
`compareit to collections of known DNA
`with the hope of finding close matches.
`There are many ways to measure the
`similarity between two DNA sequences.
`One appealing measure to biologists is the
`evolutionary distance, defined as the min-
`imum cost series of single character dele-
`tions, insertions, and substitutions necded
`to transform the source sequence S into the
`target sequence T. If S= 5), 89, ... 8, T=,
`th, ... f,, and d;_;is the distance between the
`subsequences 5), 5,... 8, and), fy,... t, then
`
`dog =0
`dig = dt + Cadet (5)
`do, ; = do. j-1 + Cins (4)
`
`|<igm
`P<jsa
`
`and
`
`dij + Cae (9)
`dj. j= min dij + Cing (t;)
`di. Ljel + Csub (5,6)
`lsjsn.1sism
`
`Herecgei(s;) is the costof deleting 5;, cing (f)
`is the cost ofinserting f;, and Cyyp (5;.0,) 1s the
`cost of substituting #; for s;.
`With one processor, this dynamic pro-
`gramming formulation requires time pro-
`portional to the product ofthe lengths of
`the two sequences. Because DNA sequences
`are long (tens of thousands of characters)
`and genetic databases are large (tens of
`millions ofcharacters), exhaustive searches
`can require hours of mainframe time.
`Fortunately, there is tremendous poten-
`tial for parallelism in the recurrence given
`above: All values qd); can be calculated
`simultaneously for a given k = (j+i).
`Mapping the recurrence onto a linear sys-
`tolic arrayis a straightforward procedure.
`One such arrangementis shown in Fig-
`urc 7. The characters of the source sequence
`flow in fromthe left, while the characters
`of the target flow in from the right. Each
`processing element evaluates the recurrence
`every clock “tick.”
`The Princeton Nucleic Acid Comparator
`(P-NAC) is an NMOS realization of this
`array, designed and built for the sole pur-
`pose of comparing DNA sequences.® The
`implementation assumedthat cga(5;) =Cing(t))
`= I, for all s;and ¢,, and that Coub(Sit) = Of
`s; matches t; and Csyp(s,.;) = 2 otherwise.
`Benchmarks established that P-NAC was
`several hundred times faster than mini-
`computers of the day.
`
`Splash implementation. In the Splash
`implementation, a processing element
`is
`composedof two modules: a character com-
`
`January 1991
`
`
`
`
`
`
`Icfi|t jolt{s|
`Lt tt
`
`Figure 7. A linear systolic array for sequence comparison.
`
`716,514,352)
`
`Tgt Dst Out4<2+—
`
`
`
`
`Sre Chr in4]
`
`Tgt Chr Out
`
`Character
`comparator
`
`Scr Null
`
`Tgt Null
`
`Sre DstIn
`
`Finite
`state machine
`(Mod 4}
`
`Sre Chr Out
`
`Tgt Chr In
`
`Src Dst Out
`
`Tat Dstin
`
`Figure 8. Block diagram of a sequence comparison processing element.
`
`parator and a finite state machine (see Fig-
`ure 8). The source and target characters,
`each fourbits, are compared during the first
`clock phase. Thefinite state machine com-
`putes a new distance based on the results of
`this comparison and the source and target
`distances during the second clock phase.
`
`Eachsystolic processing element is real-
`ized using 12 CLBs arranged as a 6 x 2
`module. [n the currentimplementation, this
`basic cell is replicated 24 times on each of
`the 30 middle chips, and 12 times on the
`two end chips, to yield a total of 744 pro-
`cessing elements.
`
`87
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 87
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 87
`
`
`
`
`
`Splash
`
`P-NAC
`
`Multiflow Trace
`
`0.020
`
`0.91
`
`3.7
`
`Connection Machine CM-2
`
`4.7
`
`Cray-2
`
`Convex Cl
`
`Sun 3/140
`
`Sun Sparcstation I
`
`DEC VAX11/785
`
`6.5
`
`89
`
`48
`
`5.8
`
`54
`
`2,700
`
`1 MHz, Sun 3/260 host
`
`60
`
`14
`
`11
`
`8.3
`
`6.0
`
`1.1
`
`9.3
`
`1.0
`
`Special-purpose NMOS
`device, Sun 2 host
`
`C compiler, optimization
`level 5, 14 functional units
`
`C compiler, Paris
`library 16,000 processors
`
`Vector Pascal, one head
`
`Vector C compiler,
`optimization level 2
`
`C compiler
`
`C compiler
`
`C compiler
`
`
`
`|
`
`he actual design and construction
`Table 1. Benchmarkresults for 100 comparisons of 100-long sequences.
`SS
`of real hardware formed the impe-
`Best time
`tus for many components, both
`in seconds
`Machine
`hardwareand software, that would not have
`Speedup
`Notes
`tc;nneee
`been recognized as desirable in a “paper”
`machine. Fundamental changes, such as
`adding CLB and