`a Highly Parallel
`Programmable Logic Array
`
`Maya Gokhale, William Holmes, Andrew Kopser, Sara
`
`Lucas, Ronald Minnich, and Douglas Sweely
`
`Supercomputing Research Center
`
`Daniel Lopresti, Brown University
`
`itli a $13,000 two—slot addition
`called Splash. a Sun worksta~
`tion can outperform a Cray—2
`on certain applications. Several applica-
`tions, most involving bit—stream computa-
`tions, have been run on Splash, which re
`ceived a 1989 Gordon Bell Prize honorable
`mention for timings on a problem that
`compared a new DNA sequence against a
`library of sequences to find the closest
`match. In essence, Splash is a programma—
`ble linear logic array that can be config—
`ured to suit the problem at hand; it bridges
`the gap between the traditional fixed—func—
`tion VLSI systolic array and the more
`versatile programmable array."2
`As originally conceived, a systolic array
`is a collection of simple processing ele—
`ments, each with a fixed, data—independent
`function, along with a one- or two—dimen-
`sional nearesteneighhor communication
`pattern.3 The local nature ofthe communi-
`cation gives the systolic array a high com—
`munications bandwidth, and the simple, fixed
`function gives a high packing density for
`VLSI implementation. However, since the
`function is built in, the application space of
`a particular systolic array is rather limited.
`Recognizing the benefit to he gained from a
`more flexible base for systolic algorithm
`implementation, H. T, Kung and colleagues
`
`1:]
`
`Construction of real
`
`hardware and feedback
`
`from real users
`
`contributed to Splash’s
`design, development,
`and success. For
`
`certain pattern-
`matching applications
`its price/performance
`ratio is unmatched.
`
`built the Warp array.l a linear array in which
`each cell is a powerful vcry—largc—instruc—
`tion-word processor. Currently, a two—di—
`mensional array of custom 32—bit proces—
`sors is being built jointly by Intel and
`Carnegie Mellon University.4
`Like the simple fixed—function systolic
`
`array, the linear array of chips comprising
`Splash is programmed at a very low level.
`A hardware implementation of the desired
`algorithm must be synthesized. Unlike the
`fixed—function systolic array,
`the “hard-
`ware" can be reprogrammed and loaded
`with new algorithms This is made possible
`by using field—programmable gate arrays
`(FPGAS) as the chips of the linear array.
`Unlike the programmable systolic array,
`each stage oflinear array does not have an
`instruction set architecture. Rather than
`processors with a fixed instruction set, a
`stage contains several hundred “configw
`rable logic blocks,” each of which can be
`configured at the gate level to compute
`certain sorts ofBoolean functions. There is
`no fixed number of systolic cells in the
`Splash array. The amount of logic in each
`cell determines the number ofsystolic cells
`per chip and therefore the number 01‘ cells
`in the array. Typical applications have eight
`or 16 systolic cells per chip.
`This gate—level programmability enables
`high»speed execution of algorithms. since
`only necessary circuitry executes. Systolic
`and parallel algorithms implemented at the
`gate level on Splash have achieved speed—
`ups of up to 330 over one Cray72 procesv
`sor5; speedups greater than 10 times are
`achieved routinely.
`
`January 1991
`
`0018-9162/91HGOO-008130100 © 199] llilsli
`
`81
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 81
`
`Petitioner Microsoft Corporation - EX. 1017, p. 81
`
`
`
`8-megabyte VSB staging memory
`
`Xilinx control chips
`
`
`
`
`
`FIFO out
`
`
`
`Figure 1. The 32-stage linear array.
`
`Overview of Splash
`
`The Splash design was motivated by a
`systolic algorithm for DNA pattern match-
`ing" From the outset, the application do-
`main has focused on non—floating-point
`applications such as pattern matching. Many
`pattern—matching applications must recog-
`nize when two sequences are similar. even
`though they may not be identical. Exam—
`ples include speech recognition, data re-
`trieval. and genetic analysis.7 The Splash
`architecture is suited to one—dimensional
`pattern matching. A two-dimensional im—
`plementation with similar FPGA technolo-
`gy has been built by Digital Equipment
`Corporation Paris Research Labs.8
`The design of a prototype was begun in
`September 1988 at
`the Supercomputing
`Research Center. In June 1989. Splash was
`released to the SRC user community.
`Operational at that time were five Splash
`systems, the Logic Description Generator
`(LDG) language, and the Trigger symbolic
`debugger. Currently. 16 Splash arrays are
`in use at SRC, Brown University, and else-
`where.
`
`System. Splash consists of two boards,
`the first containing the linear array and the
`second containing a dual—ported memory
`card. The two boards reside in two VME
`(Versabns modified for Eurocard) slots of a
`Sun-3 or Sun-4 workstation (see Figure l).
`The Splash logic»array board holds 32
`Xilinx 3090 programmable gate arrays0
`and 32 memory chips. Two additional Xi]—
`inx chips are used for bus control. The
`logic array card connects to the Sun VME
`bus for control andthe Sun VME Subsystem
`Bus (VSB) for data 1/0. The associated
`
`82
`
`dual-ported memory card connects to the
`Sun VME bus for data initialization and
`retrieval and to the Sun VSB bus for data
`1/0 to and from the logic array.
`
`Programming. Splash is programmed
`by specifying the logic functions and inter-
`connections of each of 320 configurable
`logic blocks (CLBs) and 144 input/output
`blocks (lOBs) on each of the 32 chips. A
`Xilinx 3090 FPGA contains a 20 X 16 grid
`of CLBs surrounded on the perimeter by a
`single layer ofIOBs (Figure 2). A CLB has
`a combinatorial logic section. two D flip»
`flops, and an internal control section. The
`CLB can be configured to generate any
`function of five variables, any two func-
`tions of four variables (see Figure 3), or
`some functions of up to seven variables.
`The lOBs provide the interface between
`external package pins and the internal log-
`ic. Each 10B has input and output buffers,
`which include both registered and direct
`data paths.
`Each Xilinx chip is programmed at the
`gate level using the Logic Description
`Generator language. LDG is a computer-
`aided design tool developed at SRC, with
`language constructs to describe and repli»
`cate systolic cells and to place the cells on
`achip. Parameterized cell descriptions may
`be written, providing a functionality simi—
`lar to the VHDL (Vi-[SIC hardware def
`scription language) generate command.
`Splash designs are debugged using the
`Trigger symbolic debugger, also devel
`oped at SRC. Trigger is similar to a software
`debugger, with user-definable procedures
`and local variables. The values of specific
`locations on the gate array can be examined
`symbolically. The array can be single
`stcpped or stepped in burst mode. Inter—
`
`rupts can either be ignored or can invoke a
`Trigger procedure.
`LDG and Trigger permit rapid design
`turnaround time that is more comparable to
`software than hardware redesign. With
`LDG, it takes only a few keystrokes to
`significantly modify a chip design. which
`can be easily tested with Trigger. These
`design tools, plus the fact that a design can
`be loaded on the board in half a second,
`make it easier to generate and test a new
`chip design on Splash hardware than to
`simulate several different designs before
`committing to hardware.
`
`Hardware development
`
`Designing and developing Splash re—
`quired numerous decisions and trade-offs
`in defining the hardware (and the LDG and
`Trigger. as described in later sections).
`The systolic array. as first envisioned.
`was to consist of many stages connected in
`a one—dimensional array. Each stage was to
`have three components: a Xilinx FPGA,
`local memory, and a floating-point chip.
`The initial design called for dual-ported
`local memory so that the host could direct»
`1y access all the memory on the board. For
`the prototype, we planned to develop a
`simple one-board system that would plug
`into a Sun workstation, communicating
`with the Sun CPU over the VME bus.
`Because we opted for a single—board sys—
`tem. space became the constraining factor.
`Thirty-two FPGA/SRAM pairs fit nicely
`on a 9U >< 400—millimeter card but left no
`room for floating-point chips. Since the
`application driving the design did not re
`quire floating-point manipulation, we
`eliminated floating—point chips from the
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 82
`
`Petitioner Microsoft Corporation - EX. 1017, p. 82
`
`
`
`design. That left 32 stages, each with an
`FPGA and an SRAM chip.
`At that
`time,
`the biggest and fastest
`memories were single— ported 128K x 8, 50—
`nanoseeond SRAMs. Thus, we were faced
`with choosing between slower, dual—port-
`ed, host—accessible SRAMs and faster, sin-
`gle-ported memories accessible only
`through the Xilinx chips. Since we expect—
`ed applications to use the local memories
`primarily for constant tables, which would
`be loaded initially through the host and
`accessed only locally, we opted for the
`faster, single-ported SRAMs.
`Finally, we used two more Xilinx chips
`for the VME bus interface so that changes
`to the VME interface design could be
`quickly implemented without tnodifying
`the hardware.
`
`To simulate or not to simulate. Because
`the flow ofsy stolic algorithms is data inde-
`pendent, we could estimate the prototype‘s
`performance on targeted applications be—
`fore actually building the board. However,
`several potential users questioned the ac-
`curacy of these predictions. They felt we
`should simulate the design first, but ini—
`tial estimates showed that simulating the
`Xilinx chips on a Zycad SDE would be
`2,500 times slower than Splash — even for
`simple programs.
`Since the board design was relatively
`simple, we decided to build the board rathi
`erthaninvestresourccsinasimulator. This
`turned out to be the right decision. It took
`two engineers only six months to get the
`first board up and running. Since we chose
`not to write a simulator, the software engi-
`neers developed the programming and de-
`bugging tools simultaneously with the
`hardware,
`
`Communicating with the host. A sim-
`ple VME interface was built first. The
`initial design had specified this as the only
`communication channel to the Sun host.
`Realizing that the Splash card could quickly
`outrun the 1 millisecond/32bit VME
`transfer rate, we added a second, faster
`communication channel, the VSB bus. This
`necessitated a second off—the—shelf, dual—
`ported memory card as a staging memory
`for Splash. The original VME interface
`was retained for control transfers.
`In the revised design, the Sun transfers
`data to and from the staging memory via
`the VME bus, and Splash communicates
`with the staging memory over the VSB
`bus. The addition of the VSB interface
`considerably complicated the design, but
`at the time we felt the factor of two in 1/0
`
`January 1991
`
`
`
`
`
`0,31 |o,32
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`20,1
`
`beast 41,2
`
`Any function
`of up to four
`variables
`
`
`
`Any function
`of up to four
`variables
`
`
`
`
`
`
`
`
`
`
`Figure 3. A configurable logic block.
`
`speedup compensated for the additional
`complexity.
`
`Design for debuggability. Development
`of an actual hardware system often leads to
`incorporating features that might not seem
`necessary in a simulated environment. For
`instance, hardware support for debugging
`programmed applications on the Splash board
`would not have been a consideration in a
`
`paper study. It was clear early in the devel-
`opment, however, that we had to design for
`ease ofdebugging. The designer has to have
`some way of knowing what the gate arrays
`are doing at any given point in time.
`Fortunately, the Xilinx chips have a fea—
`ture called state readback. Just as the Chip
`is programmed by shifting into the chip a
`64K serial string of bits, the state can also
`be read out. The configuration of the chip
`
`83
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 83
`
`Petitioner Microsoft Corporation - EX
`
`. 1017, p. 83
`
`
`
`and the state ofall user-definable flip-flops
`()n the chip are shifted out. This feature,
`combined with the ability to single step (or
`step in bursts), has proven extremely valu—
`able for debugging user programs.
`Although state readbaek and single step
`were incorporated into the 32 Xilinx chips
`in the array, these features were not used
`on the two Xilinx chips for the VME in-
`terface. To debug the logic on these chips.
`we had to use the traditional logic analyzer
`to observe signals on the chip‘s external
`pins. To observe internal signals, however,
`we had to redesign the chip, bringing these
`signals to external pins. These external
`routings made the chip more complicated
`and altered signal timings (longer paths) of
`critical logic sections. The external routings
`changed frequently and eventually had to
`be removed after debugging, thus the actu-
`al design was not really debugged at 2111.
`Having state readback and a debugger for
`these control chips would have drastically
`reduced the design time for Splash.
`In addition to state readback and single
`stepping, other important debugging fea~
`tures include a user-definable variable—
`speed clock and maskable interrupts. If
`desired, the clock may be stopped as soon
`as an interrupt occurs. The user can 0p-
`tionally provide data flow control; if the
`input buffer (FIFO) between Splash and
`the VSB memory is empty,
`the clock
`“pauses” allowing the FIFO to fill, insur—
`ing contiguous data from the staging mem-
`ory. Because the control logic was imple-
`mented with the Xilinx chips, many
`additional features were added after the
`board was built.
`
`SRAM and Xilinx chip connection.
`We considered two alternative methods of
`connecting the SRAMs to the Xilinx chips.
`The first, perhaps more straightforward
`approach. was to dedicate 28 pins of each
`chip to SRAM. Each chip would have its
`own local memory, accessible only to that
`chip. However, we wanted high interconi
`nec‘tivity between Xilinx chipsTakingaway
`28 pins exclusively for the SRAM was
`undesirable.
`The alternative strategy, which we
`adopted, called for an SRAM connection
`to share lines connecting two chips. This
`had several additional advantages. The
`memory was now accessible to two adja-
`cent chips, giving the designer the option,
`for example. of having every other stage
`have a memory of size 256K X 8 or 128K x
`to. Another possibility is for one stage to
`be the reader and the next the writer. If the
`local memory is not required, the single
`
`84
`
`E:
`
`The goal of the
`architectural study
`was to design and
`implement a
`programmable
`linear array.
`
`dedicated line to it can be disabled, and the
`other 27 pins can be used for communicat—
`ing with the adjacent chip. The hazard of
`this implementation is that the designer
`must coordinate access to the memory so
`that both chips do not try to access the same
`SRAM at the same time.
`
`Evaluation. We have evaluated Splash’s
`performance on real applications, as well
`as on a set of synthetic benchmarks, and
`found the following of note.
`
`~ ”(J-limited. The Splash board is in-
`deed l/O-liinited. Many applications could
`run at least an order of magnitude faster
`with better I/O.
`0 Host inaccessibility m SRAMs. Al-
`though the application designer can work
`around the lack of dual—ported SRAM
`memories to the host, the inconvenience of
`getting data into and out ofthe local mem—
`ories rules out some applications that need
`host access to local memory. Using bigger.
`wider, and faster memories would be an
`important Objective fora follow-on design.
`' State readback. The state readback
`capability was originally designed for de-
`bugging. However, we found another im—
`portant, unanticipated use. Often.
`it
`is
`convenient to design a systolic algorithm
`in which the results are accumulated in
`stationary registers in the array. In this
`case, state readback can be used to read
`results at the end of a computation. This
`technique has been used in many Splash
`applications. However. the amount ol'time
`to read back the state (about halfa second)
`is often too long. A 64K bit stream is sent
`to the Sun host and filtered. The remaining
`1,024 bits constitute the desired state in-
`formation. The new Xilinx 4000 series
`discards the extraneous bits (which encode
`each chip’s CLB and 108 configurations)
`at the chip. We plan to use the new part in
`Splash followions.
`
`° Using the VSB. We expected the addi—
`tion of the faster VSB to speed 1/0 by a
`factor oftwo. This proved to be the case for
`applications that made repeated passes over
`the data set", however, applications that
`made only a single pass paid a penalty. The
`data first had to be loaded over the VME
`bus to the staging memory and then from
`the staging memory to the logic array board,
`Results also had to go first to the staging
`memory and then through the VME back to
`the host. Thus, single—pass applications ran
`slower than they would have with only the
`VME interface for data. Because we’ve
`found that applications are typically one
`pass rather than multiple pass, we are now
`making VSB—less single—board systems per
`the original design.
`
`Not all of these features would have
`been considered ifa prototype had not been
`built. The applications work has given us a
`greater understanding of the limitations
`and capabilities of Splash.
`
`Path to the Logic
`Description Generator
`
`Initial language. The intent ofthe Splash
`project was to study the systolic model of
`computing both architecturally and at the
`language level. The goal ofthe architectural
`study was to design and implement a pro
`grammable linear array. The language ob—
`jective was to understand the essential
`language constructs that describe a systolic
`computation; to define or adapt an existing
`language in which those systolic constructs
`were embedded; and. if an appropriate target
`machine was built, to implement a compil—
`er for the language.
`The core systolic constructs we selected
`were ( l ) the notion ofa logicalsystolic cell
`through which data are streamed and (2)
`the replication and interconnection of the
`logical cells to form the systolic array.
`The language construct satisfying the
`first need is called a temp/rite, which is
`associated with named input and output
`signals and whose body, in a hierarchical
`fashion, can contain other templates as
`well as language primitives. In response to
`the second need. we developed a concise
`notation to specify the replication and in,
`terconnection of the parts in a template.
`In the first iteration of LDG design, the
`language primitives consisted of the usual
`Boolean logic operations as well as D flip—
`flops. The language processor expanded
`hierarchically invoked templates until a
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 84
`
`Petitioner Microsoft Corporation - EX. 1017, p. 84
`
`
`
`primitive was encountered. The primitive
`was then output in a format required by the
`Xilinx tools.
`
`Required revision. We implemented this
`initial version of LDG and found that pro—
`gramming the Xilinx chip at
`the logic
`equation level did not effectively utilize
`the chip. Our experience showed that using
`Xilirtx tools to automatically pack the logic
`equations into logical CLBs, to assign the
`logical CLBs to physical locations on the
`chip, and finally to route the chip resulted
`in, at most. 10 percent use of the CLBs.
`Our performance requirements demand~
`ed high chip-area use. Thus, the realities of
`the design environment dictated a change
`to the language: We added CLB templates
`and 1013 templates as new primitive rem—
`plates. The designer could configure these
`templates just as with the Xilinx—supplied
`tools (for example, as a function of five
`variables). In addition, we added the con—
`cepts of location and shape to the LDC
`language. Each part in a template is assigned
`a location on the CLB/10B grid and a
`rectangular shape. The location can be ei-
`ther relative, in terms of parameters passed
`into the template, or absolute. The addition
`of user-directed placement gave the designer
`complete control over the layout of logic
`on the gate array. In conjunction with the
`replicated part, it became possible to spec—
`ify the configuration of an entire chip with
`relatively little effort.
`
`Ergonomics. The LDG syntax and user
`interface also evolved as designers wrote
`LDG programs and debugged them on real
`hardware. LDG is embedded in Common
`
`Lisp. and the initial language syntax con,
`sisted simply of calls to Lisp functions.
`This required users to develop some famil-
`iarity with the Lisp environment, especial-
`ly since syntax errors threw the user into
`the Lisp debugger. Responding to user
`feedback, we designed a more intuitive
`keyword-driven syntax and added extene
`sive errorehandling from within LDG so
`that users did not have to interact with the
`Lisp debugger.
`Although a graphical editor was avail—
`able from Xilinx, users preferred the text
`interface for its ease of modification. A
`few text changes could modify every CLB
`on every chip, which is very tedious with
`the graphical editor.
`
`LDG example. The simple example in
`Figure 4 illustrates various LDG language
`constructs. The figure shows a two—bit
`pipeline, Pipe2, with input signals sigO,
`
`January 1991
`
`sigl, and clk and output signals outO and
`out I. Internally, Pipe2 consists of 16 cop—
`ies ofanother template, cell. Figure 5 shows
`the internal structure of Pipe2. The arrays
`of signals a and b are internal to Pipe2 and
`pass the signal between adjacent stages.
`Figure 6 shows the LDG template for
`Pich. Note the correspondence between
`the set ofinput signals in the block diagram
`and the input clause on the second line of
`the LDG program (and similarly for out—
`put). The location clause passes in the
`
`Figure 4. A two-bit wide pipeline.
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 5. Internal structure of Pipe2.
`
`(template pipe2
`(input sigO sigl Clk)
`(output outO outl)
`(location pos-x pos—y)
`
`(partilist
`( (name pl)
`(part cell)
`(input sigO sigl clk)
`(ouput (index a1) (index bl)
`(location lrow !col) )
`
`( (name p2-15)
`(range! 2 to 15)
`(shape row-major l by 14)
`(start—how pos—x)
`(start»!col (l + pos~y) )
`(part cell)
`(input (index a(1—!) ) (index b(l-!) ) elk)
`(output (index a!) (index b!))
`
`(location !row lcol))
`
`( (name p16)
`(startvlrow pos—x)
`(start-tool (+ 15 pos—y))
`(part cell)
`(input (index alS) (index h15) elk)
`(output outO outl)
`
`(location lrow !col))
`
`; template name
`; input signals
`; output signals
`; position parameters
`
`; first part: p1
`; of cell, input sigO, sigl, clk
`
`;outpnt a[1], b[l]
`
`; second part: p2'15
`; make 14 copies
`; for 1 row and 14 columns
`; at (posex, pos—y + 1) for
`
`; of cell
`; with inputs a[!—l], b[!—l]. elk
`; and outputs all], b[!]
`: for l in the range 2
`
`15
`
`; third part: p16
`; at (pos—x, 15 + pos—y)
`
`; of cell
`; with inputs a[15]. b[15],c1k
`; output outO, outl
`
`Figure 6. Logic-description-language template for a two-bit pipeline.
`
`85
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 85
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 85
`
`
`
`
`
`starting row and column for parts in the
`emplate. The names used here. pos—x and
`pos—y, are referenced again in a (start—
`!row...) or (start—leol...) clause for a part.
`Within Pipe2 there are three parts: p1.
`32—15, and p16. Part p2—15 invokes 14
`copies of the template cell because of the
`range clause (range! 2 to 15).
`The (shape row—major...) clause specifies
`he layout ofthese 14cells. The first copy of
`cell is placed at location (posex, pos—y + l).
`The next is at (posix, pos—y + 2). and so
`orth. Note that the number of copies of the
`cmplate cell specified by the range clause
`(2 .. 15 : 14) must equal
`the number of
`copies specified by the shape clause (1 row
`X 14 columns = 14). The final two parame-
`ters are being sent to cell. and their values
`are the current row index (!row) and current
`column index (icol), respectively,
`Pipe2 is instantiated with the call com—
`mand.
`
`(call pipeZ)
`(input dinf) dinl elk)
`(output doutO doutl)
`(location 1 1)
`
`the
`In the generated Xilinx commands,
`names dinO. din l , and elk will be substitut-
`ed for the dummy input parameters sigO.
`sig 1 , and clk (similarly for output). (pos—x.
`pos—y) will be assigned values (1, 1), and
`the expressions using pos—x and pos—y will
`be evaluated so that the evaluated values,
`constant integers. will be output as a CLB
`location.
`Here. we’ ve omitted the LDG specifica—
`tion ofeell. This template is a CLB template,
`configured simply to pass data through
`unchanged.
`Another useful feature of LDG in im-
`plementing systolic algorithms is its abil—
`ity to write parameterized template defi—
`nitions. The hardware designer can design
`a schema of a structure, a generalized shift
`register. for example, and then instantiate
`different schema instances by invoking the
`schema with different parameters. The re-
`petitive layout and the control over place—
`ment are available at the schema level as
`well as in the base language.
`The readeris referred to our earlier work'0
`for examples of schema definition and use,
`
`Splash runtime
`environment
`
`The Splash runtime environment on the
`Sun workstation consists of a symbolic
`
`86
`
`l::::l
`
`Library routines
`created for use
`
`by the debugger
`can also be invoked
`
`from C programs,
`so applications
`programs can direct
`the Splash board.
`
`debugger Trigger and a kernel driver to
`control
`the Splash device and the VSB
`interface. (The debugger borrows much of
`its code from the Horizon Simulator. " hence
`the name Trigger, son of Horse.) Library
`routines created for use by the debugger
`can also be invoked from C programs, so
`applications programs can direct the Splash
`board and gain access to Splash—related
`symbols just as the debugger does. In ad—
`dition, there are graphical tools to view the
`activity of a single chip or of the entire
`array.
`Below we discuss some debugger capa—
`bilities and how they can be accessed from
`independent C programs.
`
`Loading chip designs. All the chips are
`loaded in parallel, with each bit ofthe 327
`bit word going to a different chip. The
`entire board can be loaded in half a second.
`The same file may be used for multiple
`chips, 21 different file for each chip, or any
`combination thereof.
`
`Stepping the board. Trigger allows the
`user to step a selected number of clocks.
`Commonly, the user initially single steps
`the design, monitoring variables on the
`chips at each step. As more and more ofthe
`design starts to work, the user typically
`steps the design through a larger but still
`wellidefined number of clocks. Signals
`from the chips can be designed to interrupt
`or assert a flag in one of the control and
`status registers. In either case, the counted
`Clocks are stopped.
`
`Trigger procedures. Trigger allows the
`user to create and invoke procedures, a
`handy capability for frequently used com—
`mands. A basic library of Trigger proce—
`
`dures has grown over time and is available
`to the users as part of the Trigger library.
`The command language for Trigger is
`similar to the command language for the C—
`shell. There are conditional statements.
`while statements, for loops. user—defined
`variables that may contain strings or no—
`meric values, and a number of other state-
`ments found in many command language
`interpreters. The variables may be user-
`defined and can control flow ofthe Trigger
`procedure being executed.
`
`Nondestructive readback. The user may
`examine on—chip state at any time and then
`resume the program. In fact, symbols may
`be evaluated as part ofconditional looping.
`Thus. a user’s procedure may run a design,
`examine an on—chip variable, and make a
`decision about whether to continue running
`the design based on that variable.
`
`Support for interrupts. The Splash
`board can generate interrupts. These in—
`terrupts are vectored through the kernel
`driver and to the user program via the sigIO
`signal. Trigger allows the user to specify a
`procedure (interrupt service routine) to be
`invoked when the signal is received. There
`is a default procedure, which will print out
`information about the type of interrupt and
`why it might have occurred. Interrupts may
`also be disabled. The user can decide
`whether to defer processing ofinterrupts or
`to ignore them completely.
`
`Accessing Trigger from C. Trigger and
`a user’s C or Fortran programs can interact
`in a variety of ways. In the simplest mode.
`a user program can call Trigger library
`routines to run Splash. If desired, parts of
`the symbolic debugging environment can
`be accessed from the user program, up to
`the point of actually dropping into the de—
`bugger frotn a user program when some
`condition is met, such as when a user types
`ControliC to generate an interrupt or when
`an onechip variable reaches acertain value.
`
`Sequence comparison
`on Splash
`
`A pattern-matching algorithm has been
`implemented on Splash and on a variety of
`other supercomputers. In genetic analysis,
`sequences over the four—character alphabet
`A, C, G, and 'I‘ represent DNA molecules,
`and similarity between sequences may indi—
`cate an evolutionary or functional relation—
`ship. When attempting to characterize an
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 86
`
`Petitioner Microsoft Corporation - EX. 1017, p. 86
`
`
`
`unfamiliar sequence, a biologist will often
`compare it to collections of known DNA
`with the hope of finding close matches.
`There are many ways to measure the
`similarity between two DNA sequences.
`One appealing measure to biologists is the
`evolutionary distance, defined as the min-
`imum cost series of single character dele-
`tions. insertions, and substitutions needed
`to transform the source sequence S into the
`target sequence T. IfS = 51, So,
`Sm, T: 11,
`13,
`t,,, and d'lr' is the distance between the
`subscqueneesspsy,
`s, andt], t1.
`I], then
`
`[10.0 = 0
`d”) = £11,]th + eds] (5,)
`day = dowel + Fun (Ii)
`
`l S i S m
`1 SJ" 5 n
`
`and
`
`din" : lTllIl
`
`(1H,} + ('dei (Sr)
`dial-ml + Cins ”7)
`dl'
`l._jel + Csub (Sill)
`lSan. l SiSm
`
`Here Cdei (_s,-) is the cost ofdeleting xi, cmoj)
`isthe cost ofinsertingt,,and cmb(sl.t,)is the
`cost of substituting t,- for s‘.
`\Nith one processor, this dynamic pro-
`gramming formulation requires time pro-
`portional to the product of the lengths of
`the two sequences. Because DNA sequences
`are long (tens of thousands of characlers)
`and genetic databases are large (tens of
`millionsofcharacters),exhaustive searches
`can require hours of mainframe time.
`Fortunately, there is tremendous potene
`tial for parallelism in the recurrence given
`above: All values d” can be calculated
`simultaneously for a given It : 0+1").
`Mapping the recurrence onto a linear sysv
`tolic array is a straightforward procedure.
`One such arrangement is shown in Fig
`urc 7. The characters ofthe source sequence
`flow in from the left, while the characters
`of the target flow in from the right. Each
`processingelementevaluatesthe recurrence
`every clock “tick."
`The Princeton Nucleic Acid Comparator
`(P»NAC) is an NMOS realization of this
`array. designed and built for the sole pur
`pose of comparing DNA sequences.“ The
`implementation assumed that (idem i) = cma tj)
`= I, for all Si and rj, and that cmb(s,-,tj) = 0 if
`s,- matches r,» and Csub(Sr9f_/,) : 2 otherwise.
`Benchmarks established that P—NAC was
`several hundred times faster than mini—
`computers of the day.
`
`Splash implementation. In the Splash
`implementation. a processing element
`is
`composed oftwo modules: a character com-
`
`January 199]
`
`
`
`
`
`
`
`
`
`ICIill lolllsl
`lllllll
`
`L7l645¢4i3i2d
`
`
`Tgt Dst Out4+2-
`
`Figure 7. A linear systolic array for sequence comparison.
`
`
`
`Src Chr ln—vi
`
`Tgt Chr Out
`
`Character
`comparator
`
`Src Dst In
`
`Finite
`state machine
`(Mod 4)
`
`Src Chr Out
`
`Tgt Chr In
`
`Src Dst Out
`
`Tgt Dst In
`
`Figure 8. Block diagram of a sequence comparison processing element.
`
`parator and a finite state machine (see Fig,
`me 8). The source and target characters.
`each four bits, are compared during the first
`clock phase. The finite state machine com—
`putes a new distance based on the results of
`this comparison and the source and target
`distances during the second clock phase.
`
`Each systolic processing element is reale
`izcd using 12 CLBs arranged as a 6 x 2
`module. In the currentimplementation. this
`basic cell is replicated 24 times on each of
`the 30 middle chips. and IQ times on the
`two end chips. to yield a total of 744 pro—
`cessing elements.
`
`87
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 87
`
`Petitioner Microsoft Corporation - EX
`
`. 1017, p. 87
`
`
`
`Table 1. Benchmark results for 100 comparisons of loo-long sequences.
`__———
`Best time
`in seconds
`Machine
`Speedup
`Notes
`y——————————————-——_—-——
`
`Splash
`
`P—NAC
`
`Multiflow Trace
`
`0.020
`
`0.91
`
`3.7
`
`Connection Machine CM—2
`
`4.7
`
`Cray—2
`
`Convex CI
`
`Sun 3/140
`
`Sun Sparcstation l
`
`DEC VAX 11/785
`
`6.5
`
`8.9
`
`48
`
`5.8
`
`54
`
`2,700
`
`1 MHZ, Sun 3/260 host
`
`60
`
`14
`
`11
`
`8.3
`
`6.0
`
`1.1
`
`9.3
`
`1.0
`
`Special-purpose NMOS
`device, Sun 2 host
`
`C compiler. optimization
`level 5, 14 functional units
`
`C compiler, Paris
`library 16.000 processors
`
`Vector Pascal, one head
`
`Vector C compiler,
`optimization level 2
`
`C compiler
`
`C compiler
`
`C compiler
`
`
`
`————-——————_—1
`
`
`
`he actual design and cons