throbber
[oT
`
`Building and Using
`a Highly Parallel
`Programmable Logic Array
`
`Maya Gokhale, William Holmes, Andrew Kopser, Sara
`Lucas, Ronald Minnich, and Douglas Sweely
`Supercomputing Research Center
`
`Daniel Lopresti, Brown University
`
`ith a $13,000 two-slot addition
`called Splash, a Sun worksta-
`tion can outperform a Cray-2
`on certain applications. Several applica-
`tions, most involving bit-stream computa-
`tions, have been run on Splash, which re-
`ceived a 1989 Gordon Bell Prize honorable
`mention for timings on a problem that
`compared a new DNAsequenceagainst a
`library of sequences to find the closest
`match. In essencc, Splash is a programma-
`ble linear logic array that can be config-
`ured to suit the problemat hand; it bridges
`the gap between the traditional fixed-func-
`tion VLSI systolic array and the more
`versatile programmable array.
`Asoriginally conceived, a systolic array
`is a collection of simple processing ele-
`ments, each with a fixed, data-independent
`function, along with a one- or two-dimen-
`sional nearest-neighbor communication
`pattern.? The local nature of the communi-
`cation gives the systolic array a high com-
`munications bandwidth, and the simple, fixed
`function gives a high packing density for
`VLSI implementation. However, since the
`functionis built in, the application space of
`a particular systolic arrayis ratherlimited.
`Recognizing the benefit to he gained froma
`more flexible base for systolic algorithm
`implementation, H. T. Kung and colleagues
`
`[oo
`
`Construction of real
`
`hardware and feedback
`
`from real users
`contributed to Splash’s
`design, development,
`and success. For
`certain pattern-
`matching applications
`its price/performance
`ratio is unmatched.
`
`built the Warparray, ! a linear array in which
`each cell is a powerful very-large-instruc-
`tion-word processor. Currently, a two-di-
`mensional array of custom 32-bit proces-
`sors is being built jointly by Intel and
`Carnegie Mellon University.4
`Like the simple fixed-function systolic
`
`array, the linear array of chips comprising
`Splash is programmed at a very low level.
`A hardware implementation of the desired
`algorithm must be synthesized. Unlike the
`fixed-function systolic array,
`the “hard-
`ware” can be reprogrammed and loaded
`with newalgorithms. This is madepossible
`by using field-programmable gate arrays
`(FPGAs) as the chips of the linear array.
`Unlike the programmable systolic array,
`each stageoflinear array does not have an
`instruction set architecture. Rather than
`processors with a fixed instruction set, a
`stage contains several hundred “configu-
`rable logic blocks,” cach of which can be
`configured at the gate level to compute
`certain sorts of Boolean functions. There is
`no fixed number of systolic cells in the
`Splash array. The amount oflogic in cach
`cell determines the numberofsystolic cells
`per chip and therefore the numberof cells
`in the array. Typical applications have eight
`or 16 systolic cells per chip.
`This gate-level programmability enables
`high-speed execution ofalgorithms, since
`onlynecessary circuitry executes. Systolic
`and parallel algorithms implementedat the
`gate Jevel on Splash have achieved speed-
`ups of up to 330 over one Cray-2 proces-
`sor; speedups greater than 10 times are
`achieved routinely.
`
`January 1991
`
`0018-9162/9 1/1 000-008 1$01.00 © 199] IEEE
`
`81
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 81
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 81
`
`

`

`VME bus
`
`8-megabyte VSB staging memory
`
`
`
`Xilinx control chips
`
` FIFO out
`
`
`
`Figure 1. The 32-stage linear array.
`
`Overview of Splash
`
`The Splash design was motivated by a
`systolic algorithm for DNA pattern match-
`ing.® From the outset, the application do-
`main has focused on non-floating-point
`applications suchas pattern matching. Many
`pattern-matching applications must recog-
`nize when two sequencesare similar, even
`though they maynot be identical. Exam-
`ples include speech recognition, data re-
`trieval, and genetic analysis.’ The Splash
`architecture is suited to one-dimensional
`pattern matching. A two-dimensional im-
`plementation with similar FPGAtechnolo-
`gy has been built by Digital Equipment
`Corporation Paris Research Labs.®
`The design of a prototype was begun in
`September 1988 at
`the Supercomputing
`Research Center. In June 1989, Splash was
`released to the SRC user community.
`Operational at that time were five Splash
`systems, the Logic Description Generator
`(LDG)language, and the Trigger symbolic
`debugger. Currently, 16 Splash arrays are
`in use at SRC, Brown University, and else-
`where.
`
`System. Splash consists of two boards,
`the first containing the linear array and the
`second containing a dual-ported memory
`card. The two boards reside in two VME
`(Versabus modified for Eurocard)slots of a
`Sun-3 or Sun-4 workstation (see Figure !).
`The Splash logic-array board holds 32
`Xilinx 3090 programmable gate arrays?
`and 32 memory chips. Two additional Xil-
`inx chips are used for bus control. The
`logic array card connects to the Sun VME
`bus for control and the Sun VME Subsystem
`Bus (VSB) for data I/O. The associated
`
`82
`
`dual-ported memory card connects to the
`Sun VMEbusfor data initialization and
`retrieval and to the Sun VSB busfor data
`I/O to and from the logic array.
`
`Programming. Splash is programmed
`by specifying the logic functions andinter-
`connections of each of 320 configurable
`logic blocks (CLBs) and 144 input/output
`blocks (LOBs) on each ofthe 32 chips. A
`Xilinx 3090 FPGA contains a 20 x 16 grid
`of CLBs surrounded on the perimeter by a
`single layer of IOBs (Figure 2). A CLB has
`a combinatorial logic section, two D flip-
`flops, and an internal control section. The
`CLB can be configured to generate any
`function of five variables, any two func-
`tions of four variables (see Figure 3), or
`some functions of up to seven variables.
`The IOBs provide the interface between
`external package pins andthe internal log-
`ic. Fach IOB has input and outputbuffers,
`which include both registered and direct
`data paths.
`Each Xilinx chip is programmedat the
`gate level using the Logic Description
`Generator language. LDG is a computer-
`aided design tool developed at SRC, with
`language constructs to describe andrepli-
`cate systolic cells and to place thecells on
`achip. Parameterized cell descriptions may
`be written, providing a functionality simi-
`lar to the VHDL (VHSIC hardware de-
`scription language) generate command.
`Splash designs are debugged using the
`Trigger symbolic debugger, also devel-
`oped at SRC. Triggeris similar to a software
`debugger, with user-definable procedures
`and local variables. The values of specific
`locations on the gate array can be examined
`symbolically. The array can be single
`stcpped or stepped in burst mode. Inter-
`
`rupts can either be ignored or can invoke a
`Trigger procedure.
`LDG and Trigger permit rapid design
`turnaround time that is more comparable to
`software than hardware redesign. With
`LDG,it takes only a few keystrokes to
`significantly modify a chip design, which
`can be easily tested with Trigger. These
`design tools, plus the fact that a design can
`be loaded on the board in half a second,
`make it easier to generate and test a new
`chip design on Splash hardware than to
`simulate several different designs before
`committing to hardware.
`
`Hardware development
`
`Designing and developing Splash re-
`quired numerous decisions and trade-offs
`in defining the hardware (and the LDG and
`Trigger, as described in later sections).
`The systolic array, as first envisioned,
`wasto consist of manystages connected in
`a one-dimensionalarray. Each stage wasto
`have three components: a Xilinx FPGA,
`local memory, and a floating-point chip.
`Theinitial design called for dual-ported
`local memoryso that the host could direct-
`ly access all the memory on the board. For
`the prototype, we planned to develop a
`simple one-board system that would plug
`into a Sun workstation, communicating
`with the Sun CPU over the VME bus.
`Because we opted for a single-board sys-
`tem, space becamethe constrainingfactor.
`Thirty-two FPGA/SRAMpairs fit nicely
`on a 9U x 400-millimeter card but left no
`room for floating-point chips. Since the
`application driving the design did not re-
`quire floating-point manipulation, we
`eliminated floating-point chips from the
`
`COMPLTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 82
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 82
`
`

`

`design. That left 32 stages, each with an
`FPGAand an SRAMchip.
`At that
`time,
`the biggest and fastest
`memories were single- ported 128K x 8, 50-
`nanosecond SRAMs. Thus, we were faced
`with choosing between slower, dual-port-
`ed, host-accessible SRAMsandfaster,sin-
`gle-ported memories accessible only
`through the Xilinx chips. Since we expect-
`ed applications to use the local memories
`primarily for constant tables, which would
`be loaded initially through the host and
`accessed only locally, we opted for the
`faster, single-ported SRAMs.
`Finally, we used two more Xilinx chips
`for the VMEbusinterface so that changes
`to the VME interface design could be
`quickly implemented without modifying
`the hardware.
`
`To simulateor not to simulate. Because
`the flow ofsystolic algorithmsis data inde-
`pendent, we could estimatethe prototype’s
`performance on targeted applications be-
`fore actually building the board. However,
`several potential users questioned the ac-
`curacy of these predictions. They felt we
`should simulate the design first, but ini-
`tial estimates showed that simulating the
`Xilinx chips on a Zycad SDE would be
`2,500 times slower than Splash — even for
`simple programs.
`Since the board design was relatively
`simple, we decided to build the board rath-
`er than invest resources ina simulator. This
`turned out to be the right decision. It took
`two engineers only six months to get the
`first board up and running. Since we chose
`not to write a simulator, the software cngi-
`neers developed the programming and de-
`bugging tools simultaneously with the
`hardware.
`
`Communicating with the host. A sim-
`ple VME interface was built first. The
`initial design had specified this as the only
`communication channel to the Sun host.
`Realizing that the Splash card could quickly
`outrun the 1 millisecond/32-bit VME
`transfer rate, we added a second, faster
`communication channel, the VSB bus. This
`necessitated a second off-the-shelf, dual-
`ported memory card as a staging memory
`for Splash. The original VMEinterface
`was retained for control transfers.
`In the revised design, the Sun transfers
`data to and from the staging memory via
`the VMEbus, and Splash communicates
`with the staging memory over the VSB
`bus. The addition of the VSB interface
`considerably complicated the design, but
`at the time wefelt the factor of two in I/O
`
`January 1991
`
`
`
`
`
`
`
`0,31 | 0,32
`
`
`
`
`
`
`
`
`
`
`
`
`
`20,1
`
`beast} 41,2
`
`
`
`
`
`
`
`
`
`Anyfunction
`of up to four
`variables
`
`Any function
`of up to four
`variables
`
`
`
`
`
`
`
`
`Figure 3. A configurable logic block.
`
`speedup compensated for the additional
`complexity.
`
`Design for debuggability. Development
`ofan actual hardware system often leadsto
`incorporating features that might not seem
`necessary in a simulated environment. For
`instance, hardware support for debugging
`programmedapplications on the Splash board
`would not have been a consideration in a
`
`paperstudy. It was clear earlyin the devel-
`opment, however, that we had to design for
`ease of debugging. The designerhasto have
`some way of knowing whatthe gate arrays
`are doing at any given pointin time.
`Fortunately, the Xilinx chips have a fea-
`ture called state readback. Just as the chip
`is programmedbyshifting into the chip a
`64Kserial string of bits, the state can also
`be read out. The configuration of the chip
`
`83
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 83
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 83
`
`

`

`andthe state of all user-definable flip-flops
`on the chip are shifted out. This feature,
`combined with the ability to single step (or
`step in bursts), has proven extremely valu-
`able for debugging user programs.
`Althoughstate readback and single step
`were incorporated into the 32 Xilinx chips
`in thearray, these features were not used
`on the two Xilinx chips for the VMEin-
`terface. To debugthe logic on these chips,
`wehad to use the traditional logic analyzer
`to observe signals on the chip’s external
`pins. To observeinternal signals, however,
`we had to redesign the chip, bringing these
`signals to external pins. These external
`routings made the chip more complicated
`and altered signal timings (longerpaths) of
`critical logic sections. The external routings
`changed frequently and eventually had to
`be removedafter debugging,thus the actu-
`al design was not really debuggedatall.
`Having state readback and a debugger for
`these control chips would have drastically
`reduced the design time for Splash.
`In addition to state readback and single
`stepping, other important debugging fea-
`tures include a user-definable variable-
`speed clock and maskable interrupts. If
`desired, the clock may be stopped as soon
`as an interrupt occurs. The user can op-
`tionally provide data flow control; if the
`input buffer (FIFO) between Splash and
`the VSB memory is empty,
`the clock
`“pauses” allowing the FIFO to fill, insur-
`ing contiguous data from the staging mem-
`ory. Because the contro] logic was imple-
`mented with the Xilinx chips, many
`additional features were added after the
`board was built.
`
`SRAMand Xilinx chip connection.
`Weconsidered twoalternative methods of
`connecting the SRAMsto the Xilinx chips.
`The first, perhaps more straightforward
`approach, wasto dedicate 28 pins of each
`chip to SRAM. Each chip would have its
`ownlocal memory, accessible only to that
`chip. However, we wanted high intercon-
`nectivity between Xilinx chips. Taking away
`28 pins exclusively for the SRAM was
`undesirable.
`The alJternative strategy, which we
`adopted, called for an SRAM connection
`to share lines connecting two chips. This
`had several additional advantages. The
`memory was now accessible to two adja-
`cent chips, giving the designer the option,
`for example. of having every other stage
`have a memoryofsize 256K X 8 or 128K x
`16, Anotherpossibility is for one stage to
`be the reader and the next the writer. If the
`local memoryis not required, the single
`
`84
`
`[oT
`
`Thegoal of the
`architectural study
`was to design and
`implement a
`programmable
`linear array.
`
`dedicated line to it can be disabled, and the
`other 27 pins can be used for communicat-
`ing with the adjacent chip. The hazard of
`this implementation is that the designer
`must coordinate access to the memory so
`that both chips do nottry to access the same
`SRAMat the same time.
`
`Evaluation. We have evaluated Splash’s
`performance on real applications, as well
`as on a set of synthetic benchmarks, and
`found the following of note.
`
`¢ 1/O-limited. The Splash board ts in-
`deed I/O-limited. Many applications could
`Tun al least an order of magnitude faster
`with better I/O.
`* Host inaccessibility to SRAMs. Al-
`though the application designer can work
`around the tack of dual-ported SRAM
`memoriesto the host, the inconvenience of
`getting data into and out ofthe local mem-
`ories rules out some applications that need
`host accessto local memory. Using bigger.
`wider, and faster memories would be an
`important objective for a follow-on design.
`* State readback, The state readback
`capability was originally designed for de-
`bugging. However, we found another im-
`portant, unanticipated use. Often,
`it
`is
`convenient to design a systolic algorithm
`in which the results are accumulated in
`stationary registers in the array. In this
`case, state readback can be used to read
`results at the end of a computation. This
`technique has been used in many Splash
`applications. However, the amountoftime
`to read backthestate (about half a second)
`is often too long. A 64K bit stream is sent
`to the Sun host andfiltered. The remaining
`1,024 bits constitute the desired state in-
`formation. The new Xilinx 4000 series
`discards the extraneous bits (which encode
`each chip’s CLB and IOB configurations)
`al the chip. We plan to use the new part in
`Splash follow-ons.
`
`« Using the VSB. We expected the addi-
`tion of the faster VSB to speed I/O by a
`factor of two. This proved to be the case for
`applications that made repeated passes over
`the data set; however, applications that
`madeonlya single pass paid a penally. The
`data first had to be loaded over the VME
`bus to the staging memoryand then from
`the staging memoryto the logic array board.
`Results also had to go first to the staging
`memory and then through the VMEback to
`the host. Thus, single-pass applications ran
`slower than they would have with only the
`VMEinterface for data. Because we've
`found that applications are typically onc
`pass rather than multiple pass, we are now
`making VSB-less single-board systems per
`the original design.
`
`Not all of these features would have
`been considered if a prototype had not been
`built. The applications work has given us a
`greater understanding of the limitations
`and capabilities of Splash.
`
`Path to the Logic
`Description Generator
`
`Initial language. The intent of the Splash
`project was to study the systolic model of
`computing both architecturally and at the
`languagelevel. The goal of the architectural
`study was to design and implementa pro-
`grammable linear array. The language ob-
`jective was to understand the essential
`language constructs that describe a systolic
`computation; to define or adapt an existing
`language in which those systolic constructs
`were embedded: and,if an appropriate target
`machine was built, to implement a compil-
`er for the language.
`The core systolic constructs we selected
`were (1) the notion of a /ogical systolic cell
`through which data are streamed and (2)
`the replication and interconnection of the
`logical cells to form the systolic array.
`The language construct satisfying the
`first need is called a template, which is
`associated with named input and output
`signals and whose body, in a hierarchical
`fashion, can contain other templates as
`well as language primitives. In response to
`the second need, we developed a concise
`notation to specify the replication and in-
`terconnection of the parts in a template.
`In the first iteration of LDG design, the
`language primitives consisted of the usual
`Booleanlogic operations as well as D flip-
`flops. The language processor expanded
`hierarchically invoked templates until a
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 84
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 84
`
`

`

`primitive was encountered. The primitive
`was then output in a format required by the
`Xilinx tools.
`
`Required revision. Wc implemented this
`initial version of LDG and foundthat pro-
`gramming the Xilinx chip at
`the logic
`equation leve] did not effectively utilize
`the chip. Our experience showedthat using
`Xilinx tools to automatically pack the logic
`equations imto logical CLBs, to assign the
`logical CLBs to physical locations on the
`chip, and finally to route the chip resulted
`in, at most, 10 percent use of the CLBs.
`Our performance requirements demand-
`ed high chip-area usc. Thus, the realities of
`the design environment dictated a change
`to the language: We added CLB templates
`and IOB templates as new primitive rem-
`plates. The designer could configure these
`templates just as with the Xilinx-supplied
`tools (for example, as a function of five
`variables). In addition, we added the con-
`cepts of location and shape to the LDG
`language. Each partin a template ts assigned
`a location on the CLB/IOB grid and a
`rectangular shape. The location can heei-
`ther relative, in terms of parameters passed
`into the template, or absolute. The addition
`of user-directed placement gavethe designer
`complete control over the layout of logic
`on the gate array. In conjunction with the
`replicated part, it becamepossible to spec-
`ify the configuration of an entire chip with
`relatively little effort.
`
`Ergonomics. The LDG syntax and user
`interface also evolved as designers wrote
`LDGprograms and debugged them on real
`hardware. LDG is embedded in Common
`
`Lisp, and the initial language syntax con-
`sisted simply of calls to Lisp functions.
`This required users to develop some famil-
`larity with the Lisp environment, especial-
`ly since syntax errors threw the user into
`the Lisp debugger. Responding to user
`feedback, we designed a more intuitive
`keyword-driven syntax and added exten-
`sive error-handling from within LDG so
`that users did not haveto interact with the
`Lisp debugger.
`Although a graphical editor was avail-
`able from Xilinx, users preferred the text
`interface for its ease of modification. A
`few text changes could modify every CLB
`on every chip, which is very tedious with
`the graphical editor.
`
`LDGexample. The simple example in
`Figure 4 illustrates various LDG language
`constructs. The figure shows a two-bit
`pipeline, Pipe2, with input signals sig,
`
`January 1991
`
`sig], and clk and output signals outO and
`out!. Internally, Pipe2 consists of 16 cop-
`ies of another template,cell. Figure 5 shows
`the internal structure of Pipe2. The arrays
`of signals a and b areinternal to Pipe2 and
`pass the signal between adjacentstages.
`Figure 6 shows the LDG template for
`Pipe2. Note the correspondence between
`the set of input signals in the block diagram
`and the input clause on the second line of
`the 1.DG program (and similarly for out-
`put). The location clause passes in the
`
`Figure 4. A two-bit wide pipeline.
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 5. Internal structure of Pipe2.
`
`(template pipe2
`(input sigO sig} clk)
`(output outO out!)
`(location pos-x pos-y)
`
`(part-list
`( (namepl)
`(part cell)
`(input sigQ sig] clk)
`(ouput (index al) (index bl)
`(ocation row !col) )
`
`( (name p2-15)
`(range! 2 to 15)
`(shape row-major | by 14)
`(start-!row pos-x)
`(start-!col (1 + pos-y))
`(part cell)
`(input (index a(1-!) ) (index b(1-!) ) clk)
`(output (index a!) (index b!) )
`
`(location !row !col) )
`
`( (name p16)
`(start-!row pos-x)
`(start-!col (+ 15 pos-y) )
`(part cell)
`(input (index al5) (index b15) clk)
`(output out0 out))
`
`5 input signals
`; Output signals
`; position parameters
`
`; first part: pl
`; of cell, input sig, sigl, clk
`
`; output afl], b[1]
`
`; second part: p2-15
`; make 14 copies
`; for 1 row and 14 columns
`; at (pos-x, pos-y + 1) for
`
`of cell
`; with inputs af[!-1], b[!-1], clk
`; and outputs af[!], bf!)
`: for ! in the range 2... 15
`
`; third part: p16
`; at (pos-x, 15 + pos-y)
`
`, of cell
`; with inputs a[15], b[15], clk
`; output outd, out
`
`Cocation !row !col) } ; template name
`
`Figure 6. Logic-description-language template for a two-bit pipeline.
`
`85
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 85
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 85
`
`

`

`
`
`starting row and column for parts in the
`emplate. The names used here, pos-x and
`pos-y, are referenced again in a (start-
`'row...) or (start-!coL..) clause for a part.
`Within Pipe2 there are three parts: p1.
`p2-15, and p16. Part p2-15 invokes 14
`copies of the template cell becausc of the
`range clause (range! 2 to 15).
`The (shape row-major...) clause specifies
`he layout of these 14 cells. The first copy of
`cell is placed at location (pos-x, pos-y + 1).
`The next is at (pos-x, pos-y + 2), and so
`orth. Note that the numberof copies of the
`emplate cell specified by the range clause
`(2... 15 = 14) must equal
`the number of
`copies specified by the shape clause (1 row
`x 14 columns = 14), The final two parame-
`ters are being sentto cell, and their values
`are the current row index (!row) and current
`column index (!col), respectively.
`Pipe2 is instantiated with the call com-
`mand.
`
`(call pipe2)
`(input dinO din clk)
`(output doutO dout!)
`(location 1 1)
`
`the
`In the generated Xilinx commands,
`names dinO, dini, and clk will be substitut-
`ed for the dummyinput parameters sig0,
`sig!, and clk (similarly for output). (pos-x,
`pos-y) will be assigned values (1, 1), and
`the expressions using pos-x and pos-y will
`be evaluated so that the cvaluated valucs,
`constantintegers, will be output as a CLB
`location.
`Here. we’ ve omitted the LDG specifica-
`tion ofcell. This template is aCLB template,
`configured simply to pass data through
`unchanged.
`Another useful feature of LDG in im-
`plementing systolic algorithms is its abil-
`ity to write parameterized template defi-
`nitions. The hardware designer can design
`a schema ofa structure, a generalized shift
`register, for example, and then instantiate
`different schema instances byinvoking the
`schema with different parameters. The re-
`petitive layout and the control over place-
`ment are available at the schema level as
`well as in the base language.
`The readeris referred to our earlier work'®
`for examples of schema definition and use.
`
`Splash runtime
`environment
`
`The Splash runtime environment on the
`Sun workstation consists of a symbolic
`
`86
`
`Le
`
`Library routines
`created for use
`by the debugger
`can also be invoked
`from C programs,
`so applications
`programscandirect
`the Splash board.
`
`debugger Trigger and a kernel driver to
`control
`the Splash device and the VSB
`interface. (The debugger borrows muchof
`its code fromthe Horizon Simulator,'! hence
`the name Trigger, son of Horse.) Library
`routines created for use by the debugger
`can also be invoked from C programs, so
`applications programscandirect the Splash
`board and gain access to Splash-related
`symbols just as the debugger does. In ad-
`dition, there are graphical tools to view the
`activity of a single chip or of the entire
`array.
`Below wediscuss some debugger capa-
`bilities and how they can be accessed from
`independent C programs.
`
`Loadingchip designs. All the chips are
`loaded in parallel, with each bit of the 32-
`bit word going to a different chip. The
`entire board can be loadedin half a second.
`The same file may be used for multiple
`chips, a differentfile for each chip, or any
`combination thereof.
`
`Stepping the board. Trigger allows the
`user to step a selected numberof clocks.
`Commonly, the user initially single steps
`the design, monitoring variables on the
`chips at each step. As more and moreofthe
`design starts to work, the user typically
`steps the design through a larger butstill
`well-defined number of clocks. Signals
`from the chips can be designedto interrupt
`or assert a flag in one of the control and
`status registers. In either case, the counted
`clocks are stopped.
`
`Trigger procedures. Trigger allows the
`user to create and invoke procedures, a
`handy capability for frequently used com-
`mands. A basic library of Trigger proce-
`
`dures has grown overtime andis available
`to the users as part of the Triggerlibrary.
`The command language for Triggeris
`similar to the command languageforthe C-
`shell. There are conditional statements,
`while statements, for loops, user-defined
`variables that may contain strings or nu-
`meric values, and a numberofotherstate-
`ments found in many command language
`interpreters. The variables may be user-
`defined and can control flowof the Trigger
`procedure being executed.
`
`Nondestructive readback. The user may
`examine on-chip state at any time and then
`resume the program. In fact, symbols may
`be evaluated as part of conditional looping.
`Thus, a user’s procedure may runa design,
`examine an on-chip variable, and make a
`decision about whether to continue running
`the design based on that variable.
`
`Support for interrupts. The Splash
`board can generate interrupts. These in-
`terrupts are vectored through the kernel
`driver and to the user program via the sigIO
`signal. Trigger allows the user to specify a
`procedure (interrupt service routine) to be
`invoked whenthe signal is received. There
`is a default procedure, which will print out
`information aboutthe type of interrupt and
`whyil might have occurred. Interrupts may
`also be disabled. The user can decide
`whetherto defer processingof interrupts or
`to ignore them complctcly.
`
`Accessing Trigger from C. Trigger and
`auser’s C or Fortran programs can interact
`in a variety of ways. In the simplest mode,
`a user program can call Trigger library
`routines to run Splash. If desired, parts of
`the symbolic debugging environment can
`be accessed from the user program, up to
`the point of actually dropping into the de-
`bugger from a user program when some
`condition is met, such as when a user types
`Control-C to generate an interrupt or when
`an on-chip variable reachesa certain value.
`
`Sequence comparison
`on Splash
`
`A pattern-matching algorithm has been
`implemented on Splash and on a variety of
`other supercomputers. In genctic analysis,
`sequences over the four-character alphabet
`A, C, G, and 7represent DNA molecules,
`and similarity between sequences mayindi-
`cate an evolutionary or functionalrelation-
`ship. When attempting to characterize an
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 86
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 86
`
`

`

`unfamiliar sequence, a biologist will often
`compareit to collections of known DNA
`with the hope of finding close matches.
`There are many ways to measure the
`similarity between two DNA sequences.
`One appealing measure to biologists is the
`evolutionary distance, defined as the min-
`imum cost series of single character dele-
`tions, insertions, and substitutions necded
`to transform the source sequence S into the
`target sequence T. If S= 5), 89, ... 8, T=,
`th, ... f,, and d;_;is the distance between the
`subsequences 5), 5,... 8, and), fy,... t, then
`
`dog =0
`dig = dt + Cadet (5)
`do, ; = do. j-1 + Cins (4)
`
`|<igm
`P<jsa
`
`and
`
`dij + Cae (9)
`dj. j= min dij + Cing (t;)
`di. Ljel + Csub (5,6)
`lsjsn.1sism
`
`Herecgei(s;) is the costof deleting 5;, cing (f)
`is the cost ofinserting f;, and Cyyp (5;.0,) 1s the
`cost of substituting #; for s;.
`With one processor, this dynamic pro-
`gramming formulation requires time pro-
`portional to the product ofthe lengths of
`the two sequences. Because DNA sequences
`are long (tens of thousands of characters)
`and genetic databases are large (tens of
`millions ofcharacters), exhaustive searches
`can require hours of mainframe time.
`Fortunately, there is tremendous poten-
`tial for parallelism in the recurrence given
`above: All values qd); can be calculated
`simultaneously for a given k = (j+i).
`Mapping the recurrence onto a linear sys-
`tolic arrayis a straightforward procedure.
`One such arrangementis shown in Fig-
`urc 7. The characters of the source sequence
`flow in fromthe left, while the characters
`of the target flow in from the right. Each
`processing element evaluates the recurrence
`every clock “tick.”
`The Princeton Nucleic Acid Comparator
`(P-NAC) is an NMOS realization of this
`array, designed and built for the sole pur-
`pose of comparing DNA sequences.® The
`implementation assumedthat cga(5;) =Cing(t))
`= I, for all s;and ¢,, and that Coub(Sit) = Of
`s; matches t; and Csyp(s,.;) = 2 otherwise.
`Benchmarks established that P-NAC was
`several hundred times faster than mini-
`computers of the day.
`
`Splash implementation. In the Splash
`implementation, a processing element
`is
`composedof two modules: a character com-
`
`January 1991
`
`
`
`
`
`
`Icfi|t jolt{s|
`Lt tt
`
`Figure 7. A linear systolic array for sequence comparison.
`
`716,514,352)
`
`Tgt Dst Out4<2+—
`
`
`
`
`Sre Chr in4]
`
`Tgt Chr Out
`
`Character
`comparator
`
`Scr Null
`
`Tgt Null
`
`Sre DstIn
`
`Finite
`state machine
`(Mod 4}
`
`Sre Chr Out
`
`Tgt Chr In
`
`Src Dst Out
`
`Tat Dstin
`
`Figure 8. Block diagram of a sequence comparison processing element.
`
`parator and a finite state machine (see Fig-
`ure 8). The source and target characters,
`each fourbits, are compared during the first
`clock phase. Thefinite state machine com-
`putes a new distance based on the results of
`this comparison and the source and target
`distances during the second clock phase.
`
`Eachsystolic processing element is real-
`ized using 12 CLBs arranged as a 6 x 2
`module. [n the currentimplementation, this
`basic cell is replicated 24 times on each of
`the 30 middle chips, and 12 times on the
`two end chips, to yield a total of 744 pro-
`cessing elements.
`
`87
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 87
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 87
`
`

`

`
`
`Splash
`
`P-NAC
`
`Multiflow Trace
`
`0.020
`
`0.91
`
`3.7
`
`Connection Machine CM-2
`
`4.7
`
`Cray-2
`
`Convex Cl
`
`Sun 3/140
`
`Sun Sparcstation I
`
`DEC VAX11/785
`
`6.5
`
`89
`
`48
`
`5.8
`
`54
`
`2,700
`
`1 MHz, Sun 3/260 host
`
`60
`
`14
`
`11
`
`8.3
`
`6.0
`
`1.1
`
`9.3
`
`1.0
`
`Special-purpose NMOS
`device, Sun 2 host
`
`C compiler, optimization
`level 5, 14 functional units
`
`C compiler, Paris
`library 16,000 processors
`
`Vector Pascal, one head
`
`Vector C compiler,
`optimization level 2
`
`C compiler
`
`C compiler
`
`C compiler
`
`
`
`|
`
`he actual design and construction
`Table 1. Benchmarkresults for 100 comparisons of 100-long sequences.
`SS
`of real hardware formed the impe-
`Best time
`tus for many components, both
`in seconds
`Machine
`hardwareand software, that would not have
`Speedup
`Notes
`tc;nneee
`been recognized as desirable in a “paper”
`machine. Fundamental changes, such as
`adding CLB and

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket