throbber
Building and Using
`a Highly Parallel
`Programmable Logic Array
`
`Maya Gokhale, William Holmes, Andrew Kopser, Sara
`
`Lucas, Ronald Minnich, and Douglas Sweely
`
`Supercomputing Research Center
`
`Daniel Lopresti, Brown University
`
`itli a $13,000 two—slot addition
`called Splash. a Sun worksta~
`tion can outperform a Cray—2
`on certain applications. Several applica-
`tions, most involving bit—stream computa-
`tions, have been run on Splash, which re
`ceived a 1989 Gordon Bell Prize honorable
`mention for timings on a problem that
`compared a new DNA sequence against a
`library of sequences to find the closest
`match. In essence, Splash is a programma—
`ble linear logic array that can be config—
`ured to suit the problem at hand; it bridges
`the gap between the traditional fixed—func—
`tion VLSI systolic array and the more
`versatile programmable array."2
`As originally conceived, a systolic array
`is a collection of simple processing ele—
`ments, each with a fixed, data—independent
`function, along with a one- or two—dimen-
`sional nearesteneighhor communication
`pattern.3 The local nature ofthe communi-
`cation gives the systolic array a high com—
`munications bandwidth, and the simple, fixed
`function gives a high packing density for
`VLSI implementation. However, since the
`function is built in, the application space of
`a particular systolic array is rather limited.
`Recognizing the benefit to he gained from a
`more flexible base for systolic algorithm
`implementation, H. T, Kung and colleagues
`
`1:]
`
`Construction of real
`
`hardware and feedback
`
`from real users
`
`contributed to Splash’s
`design, development,
`and success. For
`
`certain pattern-
`matching applications
`its price/performance
`ratio is unmatched.
`
`built the Warp array.l a linear array in which
`each cell is a powerful vcry—largc—instruc—
`tion-word processor. Currently, a two—di—
`mensional array of custom 32—bit proces—
`sors is being built jointly by Intel and
`Carnegie Mellon University.4
`Like the simple fixed—function systolic
`
`array, the linear array of chips comprising
`Splash is programmed at a very low level.
`A hardware implementation of the desired
`algorithm must be synthesized. Unlike the
`fixed—function systolic array,
`the “hard-
`ware" can be reprogrammed and loaded
`with new algorithms This is made possible
`by using field—programmable gate arrays
`(FPGAS) as the chips of the linear array.
`Unlike the programmable systolic array,
`each stage oflinear array does not have an
`instruction set architecture. Rather than
`processors with a fixed instruction set, a
`stage contains several hundred “configw
`rable logic blocks,” each of which can be
`configured at the gate level to compute
`certain sorts ofBoolean functions. There is
`no fixed number of systolic cells in the
`Splash array. The amount of logic in each
`cell determines the number ofsystolic cells
`per chip and therefore the number 01‘ cells
`in the array. Typical applications have eight
`or 16 systolic cells per chip.
`This gate—level programmability enables
`high»speed execution of algorithms. since
`only necessary circuitry executes. Systolic
`and parallel algorithms implemented at the
`gate level on Splash have achieved speed—
`ups of up to 330 over one Cray72 procesv
`sor5; speedups greater than 10 times are
`achieved routinely.
`
`January 1991
`
`0018-9162/91HGOO-008130100 © 199] llilsli
`
`81
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 81
`
`Petitioner Microsoft Corporation - EX. 1017, p. 81
`
`

`

`8-megabyte VSB staging memory
`
`Xilinx control chips
`
`
`
`
`
`FIFO out
`
`
`
`Figure 1. The 32-stage linear array.
`
`Overview of Splash
`
`The Splash design was motivated by a
`systolic algorithm for DNA pattern match-
`ing" From the outset, the application do-
`main has focused on non—floating-point
`applications such as pattern matching. Many
`pattern—matching applications must recog-
`nize when two sequences are similar. even
`though they may not be identical. Exam—
`ples include speech recognition, data re-
`trieval. and genetic analysis.7 The Splash
`architecture is suited to one—dimensional
`pattern matching. A two-dimensional im—
`plementation with similar FPGA technolo-
`gy has been built by Digital Equipment
`Corporation Paris Research Labs.8
`The design of a prototype was begun in
`September 1988 at
`the Supercomputing
`Research Center. In June 1989. Splash was
`released to the SRC user community.
`Operational at that time were five Splash
`systems, the Logic Description Generator
`(LDG) language, and the Trigger symbolic
`debugger. Currently. 16 Splash arrays are
`in use at SRC, Brown University, and else-
`where.
`
`System. Splash consists of two boards,
`the first containing the linear array and the
`second containing a dual—ported memory
`card. The two boards reside in two VME
`(Versabns modified for Eurocard) slots of a
`Sun-3 or Sun-4 workstation (see Figure l).
`The Splash logic»array board holds 32
`Xilinx 3090 programmable gate arrays0
`and 32 memory chips. Two additional Xi]—
`inx chips are used for bus control. The
`logic array card connects to the Sun VME
`bus for control andthe Sun VME Subsystem
`Bus (VSB) for data 1/0. The associated
`
`82
`
`dual-ported memory card connects to the
`Sun VME bus for data initialization and
`retrieval and to the Sun VSB bus for data
`1/0 to and from the logic array.
`
`Programming. Splash is programmed
`by specifying the logic functions and inter-
`connections of each of 320 configurable
`logic blocks (CLBs) and 144 input/output
`blocks (lOBs) on each of the 32 chips. A
`Xilinx 3090 FPGA contains a 20 X 16 grid
`of CLBs surrounded on the perimeter by a
`single layer ofIOBs (Figure 2). A CLB has
`a combinatorial logic section. two D flip»
`flops, and an internal control section. The
`CLB can be configured to generate any
`function of five variables, any two func-
`tions of four variables (see Figure 3), or
`some functions of up to seven variables.
`The lOBs provide the interface between
`external package pins and the internal log-
`ic. Each 10B has input and output buffers,
`which include both registered and direct
`data paths.
`Each Xilinx chip is programmed at the
`gate level using the Logic Description
`Generator language. LDG is a computer-
`aided design tool developed at SRC, with
`language constructs to describe and repli»
`cate systolic cells and to place the cells on
`achip. Parameterized cell descriptions may
`be written, providing a functionality simi—
`lar to the VHDL (Vi-[SIC hardware def
`scription language) generate command.
`Splash designs are debugged using the
`Trigger symbolic debugger, also devel
`oped at SRC. Trigger is similar to a software
`debugger, with user-definable procedures
`and local variables. The values of specific
`locations on the gate array can be examined
`symbolically. The array can be single
`stcpped or stepped in burst mode. Inter—
`
`rupts can either be ignored or can invoke a
`Trigger procedure.
`LDG and Trigger permit rapid design
`turnaround time that is more comparable to
`software than hardware redesign. With
`LDG, it takes only a few keystrokes to
`significantly modify a chip design. which
`can be easily tested with Trigger. These
`design tools, plus the fact that a design can
`be loaded on the board in half a second,
`make it easier to generate and test a new
`chip design on Splash hardware than to
`simulate several different designs before
`committing to hardware.
`
`Hardware development
`
`Designing and developing Splash re—
`quired numerous decisions and trade-offs
`in defining the hardware (and the LDG and
`Trigger. as described in later sections).
`The systolic array. as first envisioned.
`was to consist of many stages connected in
`a one—dimensional array. Each stage was to
`have three components: a Xilinx FPGA,
`local memory, and a floating-point chip.
`The initial design called for dual-ported
`local memory so that the host could direct»
`1y access all the memory on the board. For
`the prototype, we planned to develop a
`simple one-board system that would plug
`into a Sun workstation, communicating
`with the Sun CPU over the VME bus.
`Because we opted for a single—board sys—
`tem. space became the constraining factor.
`Thirty-two FPGA/SRAM pairs fit nicely
`on a 9U >< 400—millimeter card but left no
`room for floating-point chips. Since the
`application driving the design did not re
`quire floating-point manipulation, we
`eliminated floating—point chips from the
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 82
`
`Petitioner Microsoft Corporation - EX. 1017, p. 82
`
`

`

`design. That left 32 stages, each with an
`FPGA and an SRAM chip.
`At that
`time,
`the biggest and fastest
`memories were single— ported 128K x 8, 50—
`nanoseeond SRAMs. Thus, we were faced
`with choosing between slower, dual—port-
`ed, host—accessible SRAMs and faster, sin-
`gle-ported memories accessible only
`through the Xilinx chips. Since we expect—
`ed applications to use the local memories
`primarily for constant tables, which would
`be loaded initially through the host and
`accessed only locally, we opted for the
`faster, single-ported SRAMs.
`Finally, we used two more Xilinx chips
`for the VME bus interface so that changes
`to the VME interface design could be
`quickly implemented without tnodifying
`the hardware.
`
`To simulate or not to simulate. Because
`the flow ofsy stolic algorithms is data inde-
`pendent, we could estimate the prototype‘s
`performance on targeted applications be—
`fore actually building the board. However,
`several potential users questioned the ac-
`curacy of these predictions. They felt we
`should simulate the design first, but ini—
`tial estimates showed that simulating the
`Xilinx chips on a Zycad SDE would be
`2,500 times slower than Splash — even for
`simple programs.
`Since the board design was relatively
`simple, we decided to build the board rathi
`erthaninvestresourccsinasimulator. This
`turned out to be the right decision. It took
`two engineers only six months to get the
`first board up and running. Since we chose
`not to write a simulator, the software engi-
`neers developed the programming and de-
`bugging tools simultaneously with the
`hardware,
`
`Communicating with the host. A sim-
`ple VME interface was built first. The
`initial design had specified this as the only
`communication channel to the Sun host.
`Realizing that the Splash card could quickly
`outrun the 1 millisecond/32bit VME
`transfer rate, we added a second, faster
`communication channel, the VSB bus. This
`necessitated a second off—the—shelf, dual—
`ported memory card as a staging memory
`for Splash. The original VME interface
`was retained for control transfers.
`In the revised design, the Sun transfers
`data to and from the staging memory via
`the VME bus, and Splash communicates
`with the staging memory over the VSB
`bus. The addition of the VSB interface
`considerably complicated the design, but
`at the time we felt the factor of two in 1/0
`
`January 1991
`
`
`
`
`
`0,31 |o,32
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`20,1
`
`beast 41,2
`
`Any function
`of up to four
`variables
`
`
`
`Any function
`of up to four
`variables
`
`
`
`
`
`
`
`
`
`
`Figure 3. A configurable logic block.
`
`speedup compensated for the additional
`complexity.
`
`Design for debuggability. Development
`of an actual hardware system often leads to
`incorporating features that might not seem
`necessary in a simulated environment. For
`instance, hardware support for debugging
`programmed applications on the Splash board
`would not have been a consideration in a
`
`paper study. It was clear early in the devel-
`opment, however, that we had to design for
`ease ofdebugging. The designer has to have
`some way of knowing what the gate arrays
`are doing at any given point in time.
`Fortunately, the Xilinx chips have a fea—
`ture called state readback. Just as the Chip
`is programmed by shifting into the chip a
`64K serial string of bits, the state can also
`be read out. The configuration of the chip
`
`83
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 83
`
`Petitioner Microsoft Corporation - EX
`
`. 1017, p. 83
`
`

`

`and the state ofall user-definable flip-flops
`()n the chip are shifted out. This feature,
`combined with the ability to single step (or
`step in bursts), has proven extremely valu—
`able for debugging user programs.
`Although state readbaek and single step
`were incorporated into the 32 Xilinx chips
`in the array, these features were not used
`on the two Xilinx chips for the VME in-
`terface. To debug the logic on these chips.
`we had to use the traditional logic analyzer
`to observe signals on the chip‘s external
`pins. To observe internal signals, however,
`we had to redesign the chip, bringing these
`signals to external pins. These external
`routings made the chip more complicated
`and altered signal timings (longer paths) of
`critical logic sections. The external routings
`changed frequently and eventually had to
`be removed after debugging, thus the actu-
`al design was not really debugged at 2111.
`Having state readback and a debugger for
`these control chips would have drastically
`reduced the design time for Splash.
`In addition to state readback and single
`stepping, other important debugging fea~
`tures include a user-definable variable—
`speed clock and maskable interrupts. If
`desired, the clock may be stopped as soon
`as an interrupt occurs. The user can 0p-
`tionally provide data flow control; if the
`input buffer (FIFO) between Splash and
`the VSB memory is empty,
`the clock
`“pauses” allowing the FIFO to fill, insur—
`ing contiguous data from the staging mem-
`ory. Because the control logic was imple-
`mented with the Xilinx chips, many
`additional features were added after the
`board was built.
`
`SRAM and Xilinx chip connection.
`We considered two alternative methods of
`connecting the SRAMs to the Xilinx chips.
`The first, perhaps more straightforward
`approach. was to dedicate 28 pins of each
`chip to SRAM. Each chip would have its
`own local memory, accessible only to that
`chip. However, we wanted high interconi
`nec‘tivity between Xilinx chipsTakingaway
`28 pins exclusively for the SRAM was
`undesirable.
`The alternative strategy, which we
`adopted, called for an SRAM connection
`to share lines connecting two chips. This
`had several additional advantages. The
`memory was now accessible to two adja-
`cent chips, giving the designer the option,
`for example. of having every other stage
`have a memory of size 256K X 8 or 128K x
`to. Another possibility is for one stage to
`be the reader and the next the writer. If the
`local memory is not required, the single
`
`84
`
`E:
`
`The goal of the
`architectural study
`was to design and
`implement a
`programmable
`linear array.
`
`dedicated line to it can be disabled, and the
`other 27 pins can be used for communicat—
`ing with the adjacent chip. The hazard of
`this implementation is that the designer
`must coordinate access to the memory so
`that both chips do not try to access the same
`SRAM at the same time.
`
`Evaluation. We have evaluated Splash’s
`performance on real applications, as well
`as on a set of synthetic benchmarks, and
`found the following of note.
`
`~ ”(J-limited. The Splash board is in-
`deed l/O-liinited. Many applications could
`run at least an order of magnitude faster
`with better I/O.
`0 Host inaccessibility m SRAMs. Al-
`though the application designer can work
`around the lack of dual—ported SRAM
`memories to the host, the inconvenience of
`getting data into and out ofthe local mem—
`ories rules out some applications that need
`host access to local memory. Using bigger.
`wider, and faster memories would be an
`important Objective fora follow-on design.
`' State readback. The state readback
`capability was originally designed for de-
`bugging. However, we found another im—
`portant, unanticipated use. Often.
`it
`is
`convenient to design a systolic algorithm
`in which the results are accumulated in
`stationary registers in the array. In this
`case, state readback can be used to read
`results at the end of a computation. This
`technique has been used in many Splash
`applications. However. the amount ol'time
`to read back the state (about halfa second)
`is often too long. A 64K bit stream is sent
`to the Sun host and filtered. The remaining
`1,024 bits constitute the desired state in-
`formation. The new Xilinx 4000 series
`discards the extraneous bits (which encode
`each chip’s CLB and 108 configurations)
`at the chip. We plan to use the new part in
`Splash followions.
`
`° Using the VSB. We expected the addi—
`tion of the faster VSB to speed 1/0 by a
`factor oftwo. This proved to be the case for
`applications that made repeated passes over
`the data set", however, applications that
`made only a single pass paid a penalty. The
`data first had to be loaded over the VME
`bus to the staging memory and then from
`the staging memory to the logic array board,
`Results also had to go first to the staging
`memory and then through the VME back to
`the host. Thus, single—pass applications ran
`slower than they would have with only the
`VME interface for data. Because we’ve
`found that applications are typically one
`pass rather than multiple pass, we are now
`making VSB—less single—board systems per
`the original design.
`
`Not all of these features would have
`been considered ifa prototype had not been
`built. The applications work has given us a
`greater understanding of the limitations
`and capabilities of Splash.
`
`Path to the Logic
`Description Generator
`
`Initial language. The intent ofthe Splash
`project was to study the systolic model of
`computing both architecturally and at the
`language level. The goal ofthe architectural
`study was to design and implement a pro
`grammable linear array. The language ob—
`jective was to understand the essential
`language constructs that describe a systolic
`computation; to define or adapt an existing
`language in which those systolic constructs
`were embedded; and. if an appropriate target
`machine was built, to implement a compil—
`er for the language.
`The core systolic constructs we selected
`were ( l ) the notion ofa logicalsystolic cell
`through which data are streamed and (2)
`the replication and interconnection of the
`logical cells to form the systolic array.
`The language construct satisfying the
`first need is called a temp/rite, which is
`associated with named input and output
`signals and whose body, in a hierarchical
`fashion, can contain other templates as
`well as language primitives. In response to
`the second need. we developed a concise
`notation to specify the replication and in,
`terconnection of the parts in a template.
`In the first iteration of LDG design, the
`language primitives consisted of the usual
`Boolean logic operations as well as D flip—
`flops. The language processor expanded
`hierarchically invoked templates until a
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 84
`
`Petitioner Microsoft Corporation - EX. 1017, p. 84
`
`

`

`primitive was encountered. The primitive
`was then output in a format required by the
`Xilinx tools.
`
`Required revision. We implemented this
`initial version of LDG and found that pro—
`gramming the Xilinx chip at
`the logic
`equation level did not effectively utilize
`the chip. Our experience showed that using
`Xilirtx tools to automatically pack the logic
`equations into logical CLBs, to assign the
`logical CLBs to physical locations on the
`chip, and finally to route the chip resulted
`in, at most. 10 percent use of the CLBs.
`Our performance requirements demand~
`ed high chip-area use. Thus, the realities of
`the design environment dictated a change
`to the language: We added CLB templates
`and 1013 templates as new primitive rem—
`plates. The designer could configure these
`templates just as with the Xilinx—supplied
`tools (for example, as a function of five
`variables). In addition, we added the con—
`cepts of location and shape to the LDC
`language. Each part in a template is assigned
`a location on the CLB/10B grid and a
`rectangular shape. The location can be ei-
`ther relative, in terms of parameters passed
`into the template, or absolute. The addition
`of user-directed placement gave the designer
`complete control over the layout of logic
`on the gate array. In conjunction with the
`replicated part, it became possible to spec—
`ify the configuration of an entire chip with
`relatively little effort.
`
`Ergonomics. The LDG syntax and user
`interface also evolved as designers wrote
`LDG programs and debugged them on real
`hardware. LDG is embedded in Common
`
`Lisp. and the initial language syntax con,
`sisted simply of calls to Lisp functions.
`This required users to develop some famil-
`iarity with the Lisp environment, especial-
`ly since syntax errors threw the user into
`the Lisp debugger. Responding to user
`feedback, we designed a more intuitive
`keyword-driven syntax and added extene
`sive errorehandling from within LDG so
`that users did not have to interact with the
`Lisp debugger.
`Although a graphical editor was avail—
`able from Xilinx, users preferred the text
`interface for its ease of modification. A
`few text changes could modify every CLB
`on every chip, which is very tedious with
`the graphical editor.
`
`LDG example. The simple example in
`Figure 4 illustrates various LDG language
`constructs. The figure shows a two—bit
`pipeline, Pipe2, with input signals sigO,
`
`January 1991
`
`sigl, and clk and output signals outO and
`out I. Internally, Pipe2 consists of 16 cop—
`ies ofanother template, cell. Figure 5 shows
`the internal structure of Pipe2. The arrays
`of signals a and b are internal to Pipe2 and
`pass the signal between adjacent stages.
`Figure 6 shows the LDG template for
`Pich. Note the correspondence between
`the set ofinput signals in the block diagram
`and the input clause on the second line of
`the LDG program (and similarly for out—
`put). The location clause passes in the
`
`Figure 4. A two-bit wide pipeline.
`
`
`
`
`
`
`
`
`
`
`
`
`Figure 5. Internal structure of Pipe2.
`
`(template pipe2
`(input sigO sigl Clk)
`(output outO outl)
`(location pos-x pos—y)
`
`(partilist
`( (name pl)
`(part cell)
`(input sigO sigl clk)
`(ouput (index a1) (index bl)
`(location lrow !col) )
`
`( (name p2-15)
`(range! 2 to 15)
`(shape row-major l by 14)
`(start—how pos—x)
`(start»!col (l + pos~y) )
`(part cell)
`(input (index a(1—!) ) (index b(l-!) ) elk)
`(output (index a!) (index b!))
`
`(location !row lcol))
`
`( (name p16)
`(startvlrow pos—x)
`(start-tool (+ 15 pos—y))
`(part cell)
`(input (index alS) (index h15) elk)
`(output outO outl)
`
`(location lrow !col))
`
`; template name
`; input signals
`; output signals
`; position parameters
`
`; first part: p1
`; of cell, input sigO, sigl, clk
`
`;outpnt a[1], b[l]
`
`; second part: p2'15
`; make 14 copies
`; for 1 row and 14 columns
`; at (posex, pos—y + 1) for
`
`; of cell
`; with inputs a[!—l], b[!—l]. elk
`; and outputs all], b[!]
`: for l in the range 2
`
`15
`
`; third part: p16
`; at (pos—x, 15 + pos—y)
`
`; of cell
`; with inputs a[15]. b[15],c1k
`; output outO, outl
`
`Figure 6. Logic-description-language template for a two-bit pipeline.
`
`85
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 85
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 85
`
`

`

`
`
`starting row and column for parts in the
`emplate. The names used here. pos—x and
`pos—y, are referenced again in a (start—
`!row...) or (start—leol...) clause for a part.
`Within Pipe2 there are three parts: p1.
`32—15, and p16. Part p2—15 invokes 14
`copies of the template cell because of the
`range clause (range! 2 to 15).
`The (shape row—major...) clause specifies
`he layout ofthese 14cells. The first copy of
`cell is placed at location (posex, pos—y + l).
`The next is at (posix, pos—y + 2). and so
`orth. Note that the number of copies of the
`cmplate cell specified by the range clause
`(2 .. 15 : 14) must equal
`the number of
`copies specified by the shape clause (1 row
`X 14 columns = 14). The final two parame-
`ters are being sent to cell. and their values
`are the current row index (!row) and current
`column index (icol), respectively,
`Pipe2 is instantiated with the call com—
`mand.
`
`(call pipeZ)
`(input dinf) dinl elk)
`(output doutO doutl)
`(location 1 1)
`
`the
`In the generated Xilinx commands,
`names dinO. din l , and elk will be substitut-
`ed for the dummy input parameters sigO.
`sig 1 , and clk (similarly for output). (pos—x.
`pos—y) will be assigned values (1, 1), and
`the expressions using pos—x and pos—y will
`be evaluated so that the evaluated values,
`constant integers. will be output as a CLB
`location.
`Here. we’ ve omitted the LDG specifica—
`tion ofeell. This template is a CLB template,
`configured simply to pass data through
`unchanged.
`Another useful feature of LDG in im-
`plementing systolic algorithms is its abil—
`ity to write parameterized template defi—
`nitions. The hardware designer can design
`a schema of a structure, a generalized shift
`register. for example, and then instantiate
`different schema instances by invoking the
`schema with different parameters. The re-
`petitive layout and the control over place—
`ment are available at the schema level as
`well as in the base language.
`The readeris referred to our earlier work'0
`for examples of schema definition and use,
`
`Splash runtime
`environment
`
`The Splash runtime environment on the
`Sun workstation consists of a symbolic
`
`86
`
`l::::l
`
`Library routines
`created for use
`
`by the debugger
`can also be invoked
`
`from C programs,
`so applications
`programs can direct
`the Splash board.
`
`debugger Trigger and a kernel driver to
`control
`the Splash device and the VSB
`interface. (The debugger borrows much of
`its code from the Horizon Simulator. " hence
`the name Trigger, son of Horse.) Library
`routines created for use by the debugger
`can also be invoked from C programs, so
`applications programs can direct the Splash
`board and gain access to Splash—related
`symbols just as the debugger does. In ad—
`dition, there are graphical tools to view the
`activity of a single chip or of the entire
`array.
`Below we discuss some debugger capa—
`bilities and how they can be accessed from
`independent C programs.
`
`Loading chip designs. All the chips are
`loaded in parallel, with each bit ofthe 327
`bit word going to a different chip. The
`entire board can be loaded in half a second.
`The same file may be used for multiple
`chips, 21 different file for each chip, or any
`combination thereof.
`
`Stepping the board. Trigger allows the
`user to step a selected number of clocks.
`Commonly, the user initially single steps
`the design, monitoring variables on the
`chips at each step. As more and more ofthe
`design starts to work, the user typically
`steps the design through a larger but still
`wellidefined number of clocks. Signals
`from the chips can be designed to interrupt
`or assert a flag in one of the control and
`status registers. In either case, the counted
`Clocks are stopped.
`
`Trigger procedures. Trigger allows the
`user to create and invoke procedures, a
`handy capability for frequently used com—
`mands. A basic library of Trigger proce—
`
`dures has grown over time and is available
`to the users as part of the Trigger library.
`The command language for Trigger is
`similar to the command language for the C—
`shell. There are conditional statements.
`while statements, for loops. user—defined
`variables that may contain strings or no—
`meric values, and a number of other state-
`ments found in many command language
`interpreters. The variables may be user-
`defined and can control flow ofthe Trigger
`procedure being executed.
`
`Nondestructive readback. The user may
`examine on—chip state at any time and then
`resume the program. In fact, symbols may
`be evaluated as part ofconditional looping.
`Thus. a user’s procedure may run a design,
`examine an on—chip variable, and make a
`decision about whether to continue running
`the design based on that variable.
`
`Support for interrupts. The Splash
`board can generate interrupts. These in—
`terrupts are vectored through the kernel
`driver and to the user program via the sigIO
`signal. Trigger allows the user to specify a
`procedure (interrupt service routine) to be
`invoked when the signal is received. There
`is a default procedure, which will print out
`information about the type of interrupt and
`why it might have occurred. Interrupts may
`also be disabled. The user can decide
`whether to defer processing ofinterrupts or
`to ignore them completely.
`
`Accessing Trigger from C. Trigger and
`a user’s C or Fortran programs can interact
`in a variety of ways. In the simplest mode.
`a user program can call Trigger library
`routines to run Splash. If desired, parts of
`the symbolic debugging environment can
`be accessed from the user program, up to
`the point of actually dropping into the de—
`bugger frotn a user program when some
`condition is met, such as when a user types
`ControliC to generate an interrupt or when
`an onechip variable reaches acertain value.
`
`Sequence comparison
`on Splash
`
`A pattern-matching algorithm has been
`implemented on Splash and on a variety of
`other supercomputers. In genetic analysis,
`sequences over the four—character alphabet
`A, C, G, and 'I‘ represent DNA molecules,
`and similarity between sequences may indi—
`cate an evolutionary or functional relation—
`ship. When attempting to characterize an
`
`COMPUTER
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 86
`
`Petitioner Microsoft Corporation - EX. 1017, p. 86
`
`

`

`unfamiliar sequence, a biologist will often
`compare it to collections of known DNA
`with the hope of finding close matches.
`There are many ways to measure the
`similarity between two DNA sequences.
`One appealing measure to biologists is the
`evolutionary distance, defined as the min-
`imum cost series of single character dele-
`tions. insertions, and substitutions needed
`to transform the source sequence S into the
`target sequence T. IfS = 51, So,
`Sm, T: 11,
`13,
`t,,, and d'lr' is the distance between the
`subscqueneesspsy,
`s, andt], t1.
`I], then
`
`[10.0 = 0
`d”) = £11,]th + eds] (5,)
`day = dowel + Fun (Ii)
`
`l S i S m
`1 SJ" 5 n
`
`and
`
`din" : lTllIl
`
`(1H,} + ('dei (Sr)
`dial-ml + Cins ”7)
`dl'
`l._jel + Csub (Sill)
`lSan. l SiSm
`
`Here Cdei (_s,-) is the cost ofdeleting xi, cmoj)
`isthe cost ofinsertingt,,and cmb(sl.t,)is the
`cost of substituting t,- for s‘.
`\Nith one processor, this dynamic pro-
`gramming formulation requires time pro-
`portional to the product of the lengths of
`the two sequences. Because DNA sequences
`are long (tens of thousands of characlers)
`and genetic databases are large (tens of
`millionsofcharacters),exhaustive searches
`can require hours of mainframe time.
`Fortunately, there is tremendous potene
`tial for parallelism in the recurrence given
`above: All values d” can be calculated
`simultaneously for a given It : 0+1").
`Mapping the recurrence onto a linear sysv
`tolic array is a straightforward procedure.
`One such arrangement is shown in Fig
`urc 7. The characters ofthe source sequence
`flow in from the left, while the characters
`of the target flow in from the right. Each
`processingelementevaluatesthe recurrence
`every clock “tick."
`The Princeton Nucleic Acid Comparator
`(P»NAC) is an NMOS realization of this
`array. designed and built for the sole pur
`pose of comparing DNA sequences.“ The
`implementation assumed that (idem i) = cma tj)
`= I, for all Si and rj, and that cmb(s,-,tj) = 0 if
`s,- matches r,» and Csub(Sr9f_/,) : 2 otherwise.
`Benchmarks established that P—NAC was
`several hundred times faster than mini—
`computers of the day.
`
`Splash implementation. In the Splash
`implementation. a processing element
`is
`composed oftwo modules: a character com-
`
`January 199]
`
`
`
`
`
`
`
`
`
`ICIill lolllsl
`lllllll
`
`L7l645¢4i3i2d
`
`
`Tgt Dst Out4+2-
`
`Figure 7. A linear systolic array for sequence comparison.
`
`
`
`Src Chr ln—vi
`
`Tgt Chr Out
`
`Character
`comparator
`
`Src Dst In
`
`Finite
`state machine
`(Mod 4)
`
`Src Chr Out
`
`Tgt Chr In
`
`Src Dst Out
`
`Tgt Dst In
`
`Figure 8. Block diagram of a sequence comparison processing element.
`
`parator and a finite state machine (see Fig,
`me 8). The source and target characters.
`each four bits, are compared during the first
`clock phase. The finite state machine com—
`putes a new distance based on the results of
`this comparison and the source and target
`distances during the second clock phase.
`
`Each systolic processing element is reale
`izcd using 12 CLBs arranged as a 6 x 2
`module. In the currentimplementation. this
`basic cell is replicated 24 times on each of
`the 30 middle chips. and IQ times on the
`two end chips. to yield a total of 744 pro—
`cessing elements.
`
`87
`
`Petitioner Microsoft Corporation - Ex. 1017, p. 87
`
`Petitioner Microsoft Corporation - EX
`
`. 1017, p. 87
`
`

`

`Table 1. Benchmark results for 100 comparisons of loo-long sequences.
`__———
`Best time
`in seconds
`Machine
`Speedup
`Notes
`y——————————————-——_—-——
`
`Splash
`
`P—NAC
`
`Multiflow Trace
`
`0.020
`
`0.91
`
`3.7
`
`Connection Machine CM—2
`
`4.7
`
`Cray—2
`
`Convex CI
`
`Sun 3/140
`
`Sun Sparcstation l
`
`DEC VAX 11/785
`
`6.5
`
`8.9
`
`48
`
`5.8
`
`54
`
`2,700
`
`1 MHZ, Sun 3/260 host
`
`60
`
`14
`
`11
`
`8.3
`
`6.0
`
`1.1
`
`9.3
`
`1.0
`
`Special-purpose NMOS
`device, Sun 2 host
`
`C compiler. optimization
`level 5, 14 functional units
`
`C compiler, Paris
`library 16.000 processors
`
`Vector Pascal, one head
`
`Vector C compiler,
`optimization level 2
`
`C compiler
`
`C compiler
`
`C compiler
`
`
`
`————-——————_—1
`
`
`
`he actual design and cons

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket