`Dzung T. Hoang✁
`
`Department of Computer Science
`Brown University, Providence, RI 02912, USA
`email: dth@cs.brown.edu
`
`Daniel P. Lopresti✂
`
`Matsushita Information Technology Laboratory
`182 Nassau Street, Princeton, NJ 08542-7072, USA
`email: dpl@mitl.com
`
`Abstract
`
`This paper describes an implementation of a novel sys-
`tolic array for sequence alignment on the SPLASH re-
`configurable logic array. The systolic array operates in
`two phases. In the first phase, a sequence comparison
`array due to Lopresti [2] is used to compute a matrix
`of distances which is stored in local RAM. In the sec-
`ond phase, the stored distances are used by the alignment
`array to produce a binary encoding of the sequence align-
`ment. Preliminary benchmarks show that the SPLASH
`implementation performs several orders of magnitude
`faster than implementation on supercomputers.
`
`1 Introduction
`
`The work presented in this paper was begun during
`one co-author’s summer internship at the National Can-
`cer Institute’s Laboratory of Mathematical Biology in
`Fredrick, Maryland. The goal was to develop genetic
`sequence analysis algorithms for the SPLASH reconfig-
`urable logic array [3]. A systolic sequence compari-
`son algorithm that computes the edit distance between
`a pair of sequences had already been implemented on
`SPLASH [4]. Certain applications of interest to biol-
`ogists at the laboratory, such as multiple alignment of
`genetic (DNA and RNA) sequences, however, require
`more than just the edit distance: a more informative
`analysis of the similarity, or homology, of the sequences
`in the form of an alignment is required. In this paper,
`we describe an implementation of a systolic algorithm
`
`and afterwards by an NSF Graduate Fellowship.
`
`✄ A longer version of this paper can be found in [1].
`☎ Supported during Summer 1991 by an NIH Summer Internship
`✆ Supported by NSF grant MIP-9020570.
`
`for computing sequence alignments on SPLASH. Prior
`to our work, we know of no systolic array for computing
`sequence alignments.
`
`1.1 Sequence Comparison and Alignment
`
`Given a source sequence ✝✟✞✡✠ 1✠ 2 ☛☞☛☞☛ ✠✍✌
`, the edit distance between ✝
`sequence ✎✡✞✑✏ 1✏ 2 ☛☞☛✒☛ ✏✔✓
`and✎
`to✎
`
`and a target
`
`is defined to be the minimum cost of transforming
`through a series of the following edit operations:
`deleting a character, inserting a character, and substitut-
`ing one character for another1.
`In some applications, such as approximate multiple
`sequence comparison [5] and protein folding [6], in ad-
`dition to the edit distance, we need to know the series
`of edit operations that leads to a minimum cost transfor-
`mation. A standard way to represent the transformation
`is with an alignment. In an alignment, the characters
`of the source and target sequences are arranged in a
`matrix with two rows. The source sequence, possibly
`’, is placed in the first
`row. Similarly, the characters of the target sequence
`are placed in the second row. The matrix is analyzed
`indicates dele-
`in-
`
`with embedded null characters, ‘✕
`column-wise. A column containing ✖✒✗✘✚✙
`tion of the character ✛ ; a column containing ✖ ✘✜ ✙
`dicates insertion of the character ✢ ; and a column ✖✣✗✜ ✙
`indicates substitution of ✢
`for✛
`ment: ✖✥✤ ✦ ✤ ✧ ★ ✤ ✘ ✦ ✦★ ✦ ✘ ✧ ★ ✤ ✤ ✦ ✧ ✙ . For a given
`
`. A column consisting of
`two nulls is not allowed. Here is an example of an align-
`
`cost function, there may be more than one minimum-
`cost alignment. The alignment algorithm presented here
`
`1A different set of edit operations may be defined to suit a particular
`application. For example, in text processing, a swap of two adjacent
`characters may be considered an edit operation. However, a different
`algorithm than presented here may be required to accomodate these
`additional edit operations.
`
`Petitioner Microsoft Corporation - Ex. 1039, p. 1
`
`1
`
`✝
`
`
`SPLASH is a reconfigurable logic array developed at the
`Supercomputer Research Center (SRC) as a coprocessor
`card for the Sun VME bus. The SPLASH board con-
`tains 32 Xilinx XC3090 field-programmable gate arrays
`(FPGA) [9] with local connections to 32 1M-bit (128K
`by 8) static RAM chips. The FPGA’s are connected lin-
`early in a ring with input coming from a 32-bit FIFO
`queue connected to chip 0 and output going to a 32-bit
`FIFO queue connected to chip 31. A RAM chip is con-
`nected between each pair of adjacent FPGA chips and
`can be accessed by either FPGA. The data path connect-
`ing the FIFO’s to the array consists of 36 unidirectional
`lines, 32 for data and 4 for control signals. Adjacent
`FPGA’s, except for chips 0 and 31, are joined by a 68-bit
`programmable bidirectional bus, which shares connec-
`tions to the local RAM. Chips 0 and 31 are connected
`with a 35-bit data path. This “wrap-around” connection
`allows data flow through the array in either direction.
`At the heart of the SPLASH board are the Xilinx
`XC3090 FPGA’s. Each FPGA contains 320 configurable
`
`logic blocks (CLB’s) arranged in a 20❏ 16 grid and sur-
`
`rounded by 144 input/output blocks (IOB’s). The 144
`IOB’s surrounding each XC3090 FPGA provides con-
`nections to the control bus and programmable intercon-
`nections between adjacent chips and local RAM. Each
`IOB can be configured as either an input port, an output
`port, or a bidirectional input/output port, with optional
`latch or flip-flop operation. The programmability of the
`IOB’s allows for flexibility in the interchip connections.
`For example, when the local RAM is not needed, it
`can be disabled and the IOB’s connected to the RAM’s
`address and data lines can be used for communication
`between adjacent FPGA’s.
`The reader is referred to [3] for a more complete de-
`scription of SPLASH.
`Using the FPGA technology in SPLASH, we were
`able to rapidly prototype the systolic array without hav-
`ing to construct any additional hardware. This approach
`also has advantages over software-only simulations in
`that it allowed us to detect and correct race conditions
`present in early prototypes.
`
`3 Systolic Array
`Alignment
`
`for Sequence
`
`Our systolic array for sequence alignment operates in
`two phases. In the first phase, the systolic array oper-
`ates in sequence comparison mode to compute entries in
`the dynamic programming table and store them in local
`RAM. In the second phase, the stored table is used to
`construct an alignment with a marker passing systolic
`
`Petitioner Microsoft Corporation - Ex. 1039, p. 2
`
`2
`
` ✁✂ ✄✄✄ ☎☎✆✂✂ ✄☎☎✆✄ ✄ ✂ ✄☎☎✆ ☎☎✆ ✂ ✄✄ ☎☎✆ ✂ ✄ ✂ ✄☎☎✆ ✄ ✄ ✁✄✄✄✄✄✄✄✄ ✂✂✂✂✂✂✂✂☎☎✆ ✄☎☎✆ ✄✂☎☎✆ ✄✂☎☎✆ ✄✂☎☎✆ ✄✂ ✁ ✄✂☎☎✆✂ ✄✄✄ ✁ ✄✂☎☎✆ ✄✂☎☎✆ ✄✂☎☎✆ ☎☎✆ ✂ ✂ ✂✂☎☎✆ ✄✂☎☎✆✂✂✄✂☎☎✆✝☎☎✆✄✂ ✄☎☎✆☎☎✆ ✄☎☎✆ ✂ ✄ ✄ ✁ ✂✂ ✄☎☎✆ ✄ ✁ ✂ ✂✞ ✂ ✂ ✄ ✂ ✄☎☎✆ ☎☎✆✂ ✄☎☎✆ ✁ ✂ ✂ ✂ ✄
`
`5
`
`6
`
`7
`
`6
`
`5
`
`6
`
`7
`
`4
`
`5
`
`6
`
`5
`
`4
`
`5
`
`6
`
`5
`
`6
`
`5
`
`6
`
`7
`
`8
`
`4
`
`5
`
`6
`
`7
`
`5
`
`6
`
`5
`
`6
`
`4
`
`5
`
`6
`
`7
`
`T G C T A A G C
`
`Figure 1: Dynamic programming table with minimiza-
`tion pointers
`
`computes one such alignment.
`
`1.2 Dynamic Programming
`
`The edit distance can be computed sequentially with a
`well-known dynamic programming algorithm [7,8] in
`be the source se-
`
`✟✡✠☞☛✍✌✏✎
`time. Let ✝ ✞ ✠ 1✠ 2✠ 3 ☛☞☛✒☛ ✠✍✌
`quence,✎ ✞ ✏ 1✏ 2✏ 3 ☛☞☛✒☛ ✏✔✓ be the target sequence, and✑✓✒✕✔✖
`be the edit distance between the subsequences✠ 1✠ 2 ☛☞☛☞☛ ✠ ✒
`and ✏ 1✏ 2 ☛☞☛✒☛ ✏✖ . Then
`
`✑ 0✔ 0 0✗✑✘✒✕✔ 0
`1✰✲✱✳✰ ☛ ✗✑ 0✔✖ ✞ ✑ 0✔✖✴✙ 1✚✵✛ ✒ ✓ ✪✶★✸✷✫✹✺✮ ✗
`✞ ✑✘✒✕✙ 1✔ 0✚✜✛✣✢✥✤✧✦✩★✫✪✭✬✯✮ ✗
`1✰✼✻✡✰ ✌ ✗
`1✾ ✹✾❀❂ ❃❄❅ ✑❆✒✕✙ 1✔✖ ✚✼✛ ✢✭✤✶✦✩★❇✪ ✬✮ ✗✑ ✒✕✔✖❈✙ 1✚✼✛ ✒ ✓ ✪✧★✯✷✹✮ ✗✑❆✒✕✙ 1✔✖✴✙ 1✚✜✛ ✪☞❉❆❊☞★✫✪ ✬✔✷✹✮✣❋
`✑✘✒✽✔✖ ✞
`1✾ ✬✾❀✿✏❁
`Here✛●✢✭✤✶✦✩★❇✪✭✬❍✮ is the cost of deleting✠✣✒ ,✛ ✒ ✓ ✪✧★✯✷✫✹☞✮ is the cost
`of inserting ✏✖ , and✛●✪■❉❆❊☞★✫✪ ✬✔✷✹ ✮
`✏✖
`for ✠ ✒ .
`
`and
`
`min
`
`is the cost of substituting
`
`An alignment can be constructed by creating pointers
`to indicate the minimization choices when evaluating
`the dynamic programming recurrence. An example dy-
`namic programming table augmented with pointers is
`shown in Figure 1. By tracing a path from the lower-
`right corner to the upper-left corner, we can construct
`an alignment in reverse. The bold pointers in Figure 1
`show the path that corresponds to the alignment given in
`a previous example.
`
`2 SPLASH Reconfigurable Logic
`Array
`
`A G A C T A G G
`
`0
`
`1
`
`2
`
`3
`
`4
`
`1
`
`2
`
`3
`
`4
`
`5
`
`2
`
`3
`
`2
`
`3
`
`4
`
`3
`
`4
`
`3
`
`4
`
`5
`
`4
`
`5
`
`4
`
`3
`
`4
`
`5
`
`4
`
`5
`
`4
`
`3
`
`4
`
`6
`
`5
`
`6
`
`5
`
`4
`
`3
`
`7
`
`6
`
`5
`
`6
`
`5
`
`4
`
`8
`
`7
`
`6
`
`7
`
`6
`
`5
`
`✞
`
`
`Target
`
` 2
`
`✁ ✁✂✂ ✡ 0
`
`✗✘ ✘ ✘ ✘✘ ✘ ✘✘ ✘✗✗✗✗✗ ✗ ✗✗ ✗✗✗ ✗✗✗✙ ✙ ✙✛✚ ✜✜✜✢
`✁✂ ✂✑✂✒✂✂✑✂✒✂✁ ✁ ✁✂✂✂✂ ✓✔✁ ✓✓✂✕✂✕✂✂✕✂✕✂ ✖ ✂ ✁✁ ✁✂✕✂✕✂ ✖ ✁ ✖ ✁ ✖ ✁ ✖ ✁ ✂✑✂✒✂✑ ✓ ✙ 1✔✓ ✙ 1
`✑ ✓ ✔✓ ✙ 2
`✑ ✓ ✙ 2✔✓✑ ✓ ✔✓ ✙ 1
`✑ ✓ ✙ 1✔✓✑ ✓ ✔✓
`
`0
`
`0
`
`0
`
`0
`
`1
`
`1
`
`0
`
`0
`
`0
`
`0
`
`Reversed
`Target
`Reversed
`Source
`Marker
`Stream
`Marker Flag
`
`☛☞✂
`
`Character
`
`Traveling Distance
`
`Stored Distance
`
`✌✌✍✂
`✁✂ ✁✂✂✄✂☎✂✂☎✂☎✂✂ ✁✆ 3 ✝ 4 ✞ 5 ✞ 6 7 ✆ 8
`✁✟✠ ✡ 2
`✎✎✏✝ 1
`✡ 1
`✡ 0
`✡ 2
`✁ ✁✂✂✡ 1
` 2
`✞ 1
` 8 7 ✞ 6 ✝ 5 ✆ 4 ✞ 3
`
`Source
`
`Figure 4: Systolic array for generating alignment
`
`Figure 2: Systolic array for sequence comparison
`
`✄To RAM
`✄MemDstOut
`
`Finite State
`Machine
`
`✄✄✄ ✄✄✄✁✂ ✁✂ ✁✂✁✂
`
`Match
`SrcNull TgtNull
`
`Character
`Comparator
`
`SrcDstIn
`TgtDstOut
`
`SrcChrIn
`TgtChrOut
`
`SrcDstOut
`TgtDstIn
`
`SrcChrOut
`TgtChrIn
`
`Figure 3: Block diagram of the sequence comparison PE
`
`array.
`Since the data path to local RAM is 8-bits wide,
`for convenience, eight processing elements (PE’s) were
`placed on each FPGA chip, except for X0 and X31,
`where only four PE’s were placed to leave room for
`I/O logic. This puts a total of 248 PE’s on SPLASH,
`allowing for alignment of sequences up to 123 in length.
`
`3.1 Phase One: Dynamic Programming
`
`The dynamic programming recurrence can be mapped
`onto a linear systolic array that computes a single antidi-
`agonal of the dynamic programming table at each step,
`with each PE in the array computing the distances along
`one diagonal. The resulting systolic array (Figure 2)
`and its implementation on SPLASH is described in [4].
`The array is modified to save the dynamic programming
`table in local RAM. The first phase ends just after the
`, has been computed.
`Figure 3 shows a block diagram of a sequence com-
`parison PE. Each PE is implemented in 13 CLB’s, eight
`for the character comparator and five for the finite state
`machine.
`
`edit distance,✑ ✌ ✔✓
`
`3
`
`3.2 Phase Two: Marker Passing
`
`The pointer traceback procedure for constructing an
`alignment, as described earlier, is performed systoli-
`cally in the second phase. We can think of the trace-
`back as a marker passing process in which the marker
`hops alongs a path created by the minimization pointers.
`Using the same antidiagonal mapping of the dynamic
`programming table to PE’s as in phase one, we seek
`to move the marker from the lower-right corner of the
`table to the upper-left. Following a horizontal pointer
`would correspond to moving the marker left one PE.
`Similarly, following a vertical pointer corresponds to
`moving the marker right one PE. Finally, following a di-
`agonal pointer corresponds to keeping the marker in the
`same PE. Where there are multiple pointers, one is arbi-
`trarily chosen. In phase one, the minimization pointers
`were never actually computed. However, we can de-
`
`duce the pointers originating from position✠✱✶✗■✻ ✎ given
`✑✘✒✕✔✖ ,✑❆✒✕✙ 1✔✖ ,✑✘✒✕✙ 1✔✖❈✙ 1, ✠●✒ , and ✏✕✖ . Therefore, by reading
`
`back the distances saved in local RAM and streaming
`the source and target sequences backwards through the
`array, the minimization pointers, and thus the movement
`of the marker, can be computed. The algorithm out-
`lined above is realized by the systolic alignment array
`diagrammed in Figure 4.
`The sequence alignment PE is diagrammed in Fig-
`ure 5. The sequence alignment array uses the same
`character comparator in the sequence comparison array.
`The additional finite state machine is implemented in
`eight CLB’s, bringing the total number of CLB’s per PE
`for both phases to 21.
`Since at any step, the marker can move at most one
`PE from its current position, the marker can be regis-
`tered on a systolic stream that moves across two PE’s
`at each step. The output of the marker stream encodes
`the movement of the marker. Two consecutive 1’s in-
`dicate that the marker moved right. Two 0’s between
`successive 1’s indicate that the marker moved left. A
`pattern of 10101 indicates that the marker did not move.
`The binary pattern exiting the marker stream can be de-
`coded into a series of edit operations by a simple finite
`
`Petitioner Microsoft Corporation - Ex. 1039, p. 3
`
`✁
`
`
`1
`
`Finite State
`Machine
`
`✗ ✗✑ ✏ ✒✕✙ 1
`✑ ✏ ✒✑ ✏ ✒✂✁
`✂ ✂
`✁ ✁
`✁✂ ✁✂
`✁ ✁
`✄✄✄ ✄✄✄✁✂ ✁✂
`
`Match
`SrcNull TgtNull
`
`Character
`Comparator
`
`MarkerStrmIn
`
`SeedOut
`
`SeedInLeft
`
`MarkerOutLeft
`
`MarkerInLeft
`
`SrcChrIn
`TgtChrOut
`
`Figure 5: Block diagram of alignment PE
`
`state automaton that counts the number of 0’s between
`successive 1’s.
`
`4 Benchmarks
`
`For timing, we performed 10,000 alignments of 100-
`long sequences on SPLASH. It took 0.50 seconds to
`initialize the SPLASH array and 3.2 seconds to run the
`alignments. Normalizing for 100 alignments gives 0.032
`seconds. For comparison, the benchmarks for 100 com-
`parisons of 100-long sequences found in [4] are summa-
`rized in Figure 6. We have not completed benchmarking
`sequence alignment on conventional computers and use
`these results for preliminary comparison. Computing
`an alignment would require additional processing and
`therefore take additional time in most implementations.
`Even including initialization time, the SPLASH imple-
`mentation performs at least an order of magnitude better
`than implementations on commercial supercomputers,
`which, as tested, compute only the edit distance.
`
`5 Conclusion
`
`A systolic array for sequence alignment is presented and
`its implementation on SPLASH is described. Prelimi-
`nary benchmarks show that the SPLASH implementa-
`tion is several orders of magnitude faster than implemen-
`tations on supercomputers costing many times more.
`
`MarkerStrmOut
`SeedInRight
`
`SeedOut
`MarkerInRight
`
`MarkerOutRight
`
`SrcChrOut
`TgtChrIn
`
`System
`SPLASH
`P-NAC
`Multiflow Trace
`Sun SPARCstation 1
`Cray 2
`Convex C1
`DEC VAX 8600
`Sun 3/140
`DEC VAX 11/785
`
`Time
`0.020 s
`0.91 s
`3.7 s
`5.8 s
`6.5 s
`8.9 s
`31 s
`48 s
`54 s
`
`Speed-Up
`2,700
`60
`14
`9.3
`8.3
`6.0
`1.7
`1.1
`1.0
`
`Figure 6: Benchmarks of 100 comparisons of 100-long
`sequences [4]
`
`References
`
`[1] D. T. Hoang, “A Systolic Array for the Sequence
`Alignment Problem," Brown University, Providence,
`RI, Technical Report CS-92-22, 1992.
`
`[2] R. J. Lipton and D. P. Lopresti, “A Systolic Array
`for Rapid String Comparison," in 1985 Chapel Hill
`ConferenceonVLSI, H. Fuchs, Ed. Rockville, MD:
`Computer Science Press, pp. 363–376, 1985.
`
`[3] M. Gokhale, W. Holmes, A. Kopser, S. Lucas, R.
`Minnich, D. Sweely and D. Lopresti, “Building and
`Using a Highly Parallel Programmable Logic Array,"
`Computer, 24, no. 1, pp. 81–89, January 1991.
`
`[4] D. P. Lopresti, “Rapid Implementation of a Genetic
`Sequence Comparator Using Field-Programmable
`Logic Arrays," presented at Advanced Research in
`VLSI Conference, Santa Cruz, March 1991, Invited
`paper.
`
`[5] B. A. Shapiro, “An Algorithm for Comparing Mul-
`tiple RNA Secondary Structures," Comput. Applic.
`Biosci., 4, no. 3, pp. 387–393, 1988.
`
`[6] H. Margalit, B. A. Shapiro, A. B. Oppenheim and
`J. V. M. Jr., “Detection of Common Motifs in RNA
`Secondary Structures," Nucleic Acids Research, 17,
`no. 12, pp. 4829–4845, 1989.
`
`[7] S. B. Needleman and C. D. Wunsch, “A General
`Method Applicable to the Search for Similarities in
`the Amino-Acid Sequence of Two Proteins," Journal
`ofMolecular Biology, 48, pp. 443–453, 1970.
`
`[8] R. A. Wagner and M. J. Fischer, “The String-to-String
`Correction Problem," J.. Assn. Comput. Mach., 1,
`pp. 168–173, 1974.
`
`[9] Xilinx, Inc., The Programmable Gate Array Data
`Book. San Jose, CA, 1991.
`
`Petitioner Microsoft Corporation - Ex. 1039, p. 4
`
`4
`
`✗✗
`