`
`(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:2)(cid:1)(cid:11)(cid:11)(cid:12)(cid:4)(cid:2)(cid:13)(cid:7)(cid:11)(cid:2)(cid:14)(cid:12)(cid:15)(cid:9)(cid:16)(cid:2)(cid:3)(cid:6)(cid:11)(cid:9)(cid:17)(cid:18)(cid:2)(cid:19)(cid:7)(cid:20)(cid:15)(cid:12)(cid:11)(cid:9)(cid:5)(cid:7)(cid:17)
`
`(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:3)(cid:31)(cid:3)(cid:32)(cid:13)(cid:10)(cid:8)(cid:13)(cid:15)(cid:33)(cid:3)(cid:34)(cid:35)(cid:36)(cid:37)
`
`(cid:38)(cid:39)(cid:40)(cid:41)(cid:40)(cid:39)(cid:42)(cid:43)(cid:1)
`
`(cid:34)(cid:34)(cid:27)
`
`(cid:8)(cid:9)(cid:10)(cid:11)(cid:3)(cid:12)(cid:13)(cid:2)(cid:14)(cid:11)(cid:3)(cid:5)(cid:10)(cid:7)(cid:18)(cid:8)(cid:4)(cid:5)(cid:10)(cid:24)(cid:20)
`
`(cid:46)(cid:13)(cid:10)(cid:5)(cid:2)(cid:18)(cid:3)(cid:47)(cid:9)(cid:16)(cid:15)(cid:2)(cid:6)(cid:12)(cid:5)
`
`(cid:47)(cid:2)(cid:14)(cid:5)(cid:24)(cid:14)(cid:3)(cid:48)(cid:10)(cid:5)(cid:49)(cid:2)(cid:15)(cid:6)(cid:5)(cid:12)(cid:33)
`
`(cid:15)(cid:16)(cid:8) (cid:50)(cid:48)(cid:51)(cid:47)(cid:39)(cid:38)(cid:41)(cid:40)(cid:39)(cid:42)(cid:43)(cid:1) (cid:17)(cid:18)(cid:19)(cid:20)(cid:20) (cid:38)(cid:39)(cid:40)(cid:41)(cid:40)(cid:39)(cid:42)(cid:43)(cid:1)
`
`(cid:1)(cid:45)(cid:45)(cid:3)(cid:50)(cid:44)(cid:42)(cid:52)(cid:39)(cid:47)(cid:45)
`
`(cid:44)(cid:45)(cid:41)(cid:46)(cid:1)
`
`(cid:34)(cid:37)(cid:34)
`
`(cid:41)(cid:18)(cid:18)(cid:3)(cid:7)(cid:9)(cid:10)(cid:12)(cid:2)(cid:10)(cid:12)(cid:3)(cid:17)(cid:9)(cid:18)(cid:18)(cid:9)(cid:22)(cid:5)(cid:10)(cid:24)(cid:3)(cid:12)(cid:14)(cid:5)(cid:6)(cid:3)(cid:16)(cid:13)(cid:24)(cid:2)(cid:3)(cid:22)(cid:13)(cid:6)(cid:3)(cid:8)(cid:16)(cid:18)(cid:9)(cid:13)(cid:4)(cid:2)(cid:4)(cid:3)(cid:19)(cid:33)(cid:3)(cid:46)(cid:13)(cid:10)(cid:5)(cid:2)(cid:18)(cid:3)(cid:47)(cid:9)(cid:16)(cid:15)(cid:2)(cid:6)(cid:12)(cid:5)(cid:3)(cid:9)(cid:10)(cid:3)(cid:25)(cid:29)(cid:3)(cid:53)(cid:13)(cid:15)(cid:7)(cid:14)(cid:3)(cid:25)(cid:29)(cid:34)(cid:30)(cid:23)
`
`(cid:40)(cid:14)(cid:2)(cid:3)(cid:8)(cid:6)(cid:2)(cid:15)(cid:3)(cid:14)(cid:13)(cid:6)(cid:3)(cid:15)(cid:2)(cid:54)(cid:8)(cid:2)(cid:6)(cid:12)(cid:2)(cid:4)(cid:3)(cid:2)(cid:10)(cid:14)(cid:13)(cid:10)(cid:7)(cid:2)(cid:55)(cid:2)(cid:10)(cid:12)(cid:3)(cid:9)(cid:17)(cid:3)(cid:12)(cid:14)(cid:2)(cid:3)(cid:4)(cid:9)(cid:22)(cid:10)(cid:18)(cid:9)(cid:13)(cid:4)(cid:2)(cid:4)(cid:3)(cid:17)(cid:5)(cid:18)(cid:2)(cid:23)
`
`Petitioner Microsoft Corporation - Ex. 1008, Cover 1
`
`
`
`
`
`D 7
`
`ry
`
`a
`
`1. Introduction
`String comparison is an important operation in many disciplines; a
`particularly interesting application comes from the field of molecular
`biology. After isolating and “sequencing” a chain of DNA, which may
`consist of from tens to thousands of the four bases Adenine, Cytosine,
`Guanine, and Thymine (abbreviated A, C, G, and T)}, biologists often
`want to search a database of known DNA for close matches.
`In doing
`so they hope that previous results will help them draw conchisions
`about
`their new strand. One such database, maintained by the
`National
`Institutes of Health, contains millions of bases, so such
`searches can take hours of mainframe time. Hence, molecular biolo-
`gists resort
`to heuristics; they only search against the portion of the
`database which they consider relevant and they use sub-optimal algo-
`rithms which trim the search.
`In doing so,
`they may miss an unex-
`pected, but
`important,
`homology.
`Still,
`these
`searches
`require
`significant computational resources and time[1], {2}, [3].
`+ This work was supported in part by DARPA Contract N00014-82-iX-0549.
`
`A Systolic Array for Rapid String
`Comparison
`Richard J. Lipton and Daniel Lopresti
`Department ofElectrical Engineering and Computer
`Science
`Princeton University, New Jersey
`
`ABSTRACT
`
`This paper presents a linear systolic array for quanti-
`fying the similarity between two strings over a given alpha-
`bet. The architecture is a parallel realization of a standard
`dynamic programming algorithm. Also introduced is a
`novel encoding scheme which minimizes the numberofbits
`required
`to represent
`a
`state
`in
`the
`computation,
`significantly reducing the size of a processor. An nMOS
`prototype,
`to be used in searching genetic databases for
`DNAstrands which closely match a target sequence,
`is
`being implemented. Preliminary results indicate that
`it
`will perform hundreds to thousands of times faster than a
`minicomputer.
`
`1985 Chapel Hill Conference on VLS/
`
`wTionBe
`
`at
`
`pe
`
`peemnaePTTTeyTyTtne
`
`
`reyTT1temeeneeD
`Petitioner Microsoft Corporation - Ex. 1008, p. 363
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 363
`
`
`
`364
`
`Systolic Array for Rapid String Comparison
`
`2. Quantifying String Similarity
`An intuitive metric for quantifying the similarity of two strings is
`their “edit” distance [5]. That is, a count of the number of basieedit-
`ing operations:
`
`i)
`ii)
`iii)
`
`insert
`delete
`substitute
`
`Presented in this paper is a linear (ie., one-dimensional) systolic
`atray for the string comparison problem in general, as illustrated by
`the DNA comparison problem in particular.
`It is a parallelization of an
`optimal dynamic programming algorithm, which promises
`to run
`thousands of times faster than the same algorithm run on a serial com-
`puter, As
`is characteristic of systolic arrays (an architecture first
`described by H.T. Kung and associates at Carnegie-Mellon University
`{4])
`the machine is composed of a large number of simple processors
`which are highly regular and have only local communication require-
`ments; henceit is ideal for implementation in VLSI.
`Perhaps even more significantly, a technique is introduced which
`greatly simplifies the processing elements, allowing a relatively large
`number to be fit on a single chip. This observation, which minimizes
`the amount of data which must flow between adjacent processors, may
`be applicable to other systolic implementations.
`
`ACG -substitute T for A+ TCG -substitute G for C-» TGG
`
`needed to transform onestring, the “source,” into the other, the “tar-
`get.”
`For example:
`
`to transform ACG into TGG
`
`ACG -delete A— CG -delete C+ G -insert G-> GG -insert T> TGG
`
`which is two deletions and two insertions, so the difference between
`ACG and TGGis four. A substitution has cost two, as it is equivalent
`to a deletion and an insertion:
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 364
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 364
`
`
`
`R. J. Lipton and D. Lopresti
`
`365
`
`The edit distance between two strings can be ealeulated using a
`dynamic programming algorithm which would, in the above case, build
`the following table:
`
`OD
`
`
`
`
`(cid:1)
`
`wTeero
`
`eanTDPaci
`
`aayTTTrange
`
`peeTd
`eeeTyTyeyytCee
`
`
`
`
`
`
`CeefttyCeemeneey|
`Petitioner Microsoft Corporation - Ex. 1008, p. 365
`
`It is known that dynamic programming algorithms map well onto
`systolic arrays [6].
`In particular,
`there is a tremendous potential for
`concurrency in the construction of an edit distance table; the entries on
`a given 45 degree diagonal can be caleulated simultaneously because
`
`The difference, four,
`trix.
`
`:
`
`is the entry in the lower right corner of the ma-
`
`In general, the value din the 2 X 2 table fragment:
`
`is determined by the rule:
`
`b4+1
`d== min|je+1
`
`if a=
`a
`a+2if 3Ab
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 365
`
`
`
`(cid:1)
`
`366
`
`Systolic Array for Rapid String Comparison
`
`they depend only on values up and to the left. This parallelism can be
`realized with a systolic array. All of the values on 45 degree diagonals
`will be calculated at the same time and all of the values on -45 degree
`diagonals will be calculated in the same processor.
`In this way the
`quadratic time serial algorithm becomes a linear time parallel algorithm
`requiring a linear number of processors.
`
`suffice for any length string comparison. The key observation is that
`
`3. The Systolic Architecture
`A rhythmic pumping of data through simple processors distin-
`guishes systolic arrays. Figure 1 demonstrates the data flow through
`one implementation of a linear comparison array using the strings of
`the earlier example. The source and target are shifted in simultane-
`ously from the left and right respectively.
`Interleaved with the charac-
`ters are the data values from the first row and column of the dynamic
`programming matrix. When two non-null characters enter a processor
`from opposite directions a comparison is performed {indicated by the
`darkened circles in the figure). On the next clock tick the characters
`shift out and the values following them shift in. This processor now
`determines a new state based on the result of the comparison, the two
`values just shifted in, and its previous state value, using the same rule
`as in the dynamic programming algorithm. When the strings are
`shifted out
`they carry with them the last
`row and column of
`the
`dynamic programming matrix, and hence the answer.
`Strings too long to be compared in one pass through the array can
`be handled by making multiple passes.
`In this case,
`the initializing
`data values shifted in with the characters reflect a matrix row or
`column at an intermediate point in the calculation.
`
`4. Minimizing Processor Size
`Using the above scheme, comparing twostrings of length n in one
`requires approximately 2n processing elements.
`Since DNA
`pass
`sequences ofinterest can be several thousand bases long it is necessary
`to fit a large number of processors onto a single chip. The greatest
`influence on the size of a single processor is the appearance that it
`must add and comparerelatively large data values, on the order of
`log(n) bits. Even being optimistic,
`such adders and comparators
`require significant silicon area; it is unlikely that a naive implementa-
`tion would fit more than a couple of processors onio one chip. There
`is, however, an important property of this algorithm which makes
`manipulating such large values unnecessary; a consfant two bits will
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 366
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 366
`
`
`
`
`
`(cid:1)
`
`an be
`sonals
`legree
`y the
`rithm
`
`istin-
`‘ough
`gs of
`tane-
`arac-
`amic
`essor
`the
`’
`cters
`now
`two
`Tule
`the
`a.
`
`can
`zing
`vor
`
`one
`NA
`ary
`test
`t
`it
`> of
`ors
`ita-
`1ere
`kes
`will
`hat
`
`ONE PROCESSING ELEMENT
`
`367
`
` R. J. Lipton and D. Lopresti
`
`
`
`]
`‘i
`i
`i
`os
`re
`
`|i
`
`z at
`
`i
`i.
`|
`rf
`EB
`i
`M1
`a
`rj
`;
`4
`r
`y
`
`ii|
`
`Ba
`r]
`
`|| i¢
`
`}
`H
`|
`|
`5
`i
`i
`4
`|
`ie
`
`FIGURE I. A SYSTOLIC ARRAY FOR STRING COMPARISON EeTyCents,
`Petitioner Microsoft Corporation - Ex. 1008, p. 367
`
`
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 367
`
`
`
`(cid:1)
`
`368
`
`Systolic Array for Rapid String Comparison
`
`adjacent processors’ state values can not differ greatly in magnitude,
`henceit is only necessary to pass low order bits between processors.
`More formally,
`
`To Prove:it is possible to construct a string comparison table:
`
`in value.
`
`In other words,
`consistently by keeping only remainders modulo 4.
`determining the remainder of d modulo 4 only requires knowing the
`characters s; and ¢; and the remainders of each ofa, 6, and ¢ modulo 4.
`The essential point is that adjacent matrix elements must be close
`
`using the rule:
`
`(rule 1):
`
`b+1
`d=min|je+1
`
`if s =
`a
`a+2if st
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 368
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 368
`
`
`
`
`
`
`
`(cid:1)
`parma=
`
`
` LeenTTLerthetwal)tebeTd
`
`
`pega
`
`
`eeeTTTTTine)[yySee
`
`Only one valueis calculated and by rule 1 it must be 0 or 2, hence
`horizontal and vertical neighbors differ by + 1
`
`Inductive hypothesis: assume the lemmais true for #+ 7 <n.
`
`Induction: say §+ 7 = n:
`
`iAiH
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 369
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 369
`
`!
`
`'
`
`the values of horizontal and vertical neighbors in the matrix
`Lemma:
`differ by +1. That is, a 2 X 2 matrix fragment must look like:
`
`R. J. Lipton and D. Lopresti
`
`369
`
`a
`
`b=atl,dil
`
`czmzmat+i,dtl
`
`d
`
`Proof of Lemma: by induction on + j,
`column indicies.
`
`the sum of the row and
`
`Basis: for {+ j = 2 the matrix fragment is the upperleft corner:
`
`“
`
`ea
`
`
`
`(cid:1)
`
`370—-Systolic Array for Rapid String Comparison
`The values of 6 and ¢ are restricted to a+ 1 by the inductive
`hypothesis, hence the value of d can be restricted. Rewriting rule
`1:
`
`to the top PLA which
`
`Determining the remainder of ¢ modulo 4 using rule 2 only
`requires deciding between the two cases d= aand d= a+ 2. Since b
`and ¢ are guaranteed to be within +1 of a,
`it
`is only necessary to
`know the remainders of each of a, 6, and ¢ modulo 4 to make this
`differentiation. This completes the proof.
`cnly the next to
`As an aside, a stronger statement is possible:
`least significant bit of both b and ¢ are necessary to differentiate the
`cases, but the two low order bits are used in the current implementa-
`tion to make initializing the array easier.
`Of course, knowing only the remainder modulo four of the edit
`distance between the source and target strings would be of little use,
`but the fact that the entire last row and column of the comparison
`matrix are shifted out means that the full difference can be constructed
`outside of the array. All that is required is to pre-load a counter with
`the length of
`the target string and to increment oF decrement
`it
`appropriately as the source values shift out of the right side of the
`array. Figure 2 demonstrates this.
`
`(rule 2):
`
`if 6or c equals a- lors, = 4;
`a
`d= |449 if band c equala+1 and 3 Ft
`
`In cither of the above cases b= d+] ande= dil, thus hor-
`izontal and vertical neighbors always differ by + 1. This com-
`pletes the proof of the lemma.
`
`5. The Implementation
`Figure 3 is a plot of a single processing element. The layout was
`done using allende [7], a procedural language developed at Princeton,
`caesar, and the Berkeley PLA tools [8]. The four DNA bases are
`encoded as 4-bit nibbles. Wild cards, characters which match any sub-
`set of the bases, are also available. The source and target characters
`shift in from the left and right respectively at the bottom of the pro-
`cessor. The character comparison is performed in the Jower PLA dur-
`ing the first clock phase. The result of the comparison, as well as the
`source and target data values, are input
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 370
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 370
`
`
`
`(cid:1)
`
`R. J. Lipton and D. Lopresti
`
`371
`
`UP/DOWN COUNTER
`
`
`
`ee|eedeeeSee. peeaspOee
`
`—_—
`
`<_<
`
`<_—
`
`—_—
`
`="stooh
`
`081008
`
`=}[=]?
`
`
`pecPTTTeytTens
`
`
`
`rkietart“TytTCaer|
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 371
`
`=|f=]|
`
`—_—
`
`_—
`
`FIGURE 2. CONSTRUCTING THE EDIT DISTANCE
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 371
`
`
`
`Systolic Array for Rapid String Comparison
`
`
`
`
`< * oS
`
`I bi
`
`te
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 372
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 372
`
`-ao qo b
`
`i
`bo
`
`a oS3h
`
`e& &B
`
`D
`gvi
`
`(cid:1)
`
`372
`
`
`
`
`
`i
`
`
`
`
`
`
`
`(cid:1)
` R. J. Lipton and D. Lopresti
`Eeethetea
`eepTTyesga
`Tekel
`poemTTT)
`TTenee
`
`
`Peeryytremit
`Petitioner Microsoft Corporation - Ex. 1008, p. 373
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`ae
`
`373
`
`determines a newstate during the second clock phase.
`The current nMOS design (lambda = 2.0 microns) fits 30 proces-
`sors in a 4.6mm X 6.8mm 40 pin MOSIS standard frame. A plot of
`the prototype is shown in figure 4. The processors are almost identical,
`they simply alternate as to whether they calculate their newstate dur-
`ing the chip’s phi a or phi 6. The design is expansible; building a
`larger array involves simple connecting a numberof chips side by side.
`Also implemented is a bypass option which shortens the path through
`a chip from 30 to 10 processors; this will keep short comparisous from
`needlessly passing through the entire array.
`The factor presently limiting the number of processors which can
`be placed on one chip is power dissipation; estimates for the nMOS
`design indicate that it will consume approximately one watt. A CMOS
`version is planned which will run much cooler, so that with the same
`feature size from 50 to 80 processors can be placed on a chip.
`The systolic array is data hungry; it requires two 4-bit characters,
`two 2-bit counts and output monitoring three million times a second to
`operate at
`full speed. Since the array is expected to function as a
`hardware accelerator for micro-and minicomputers a primary concernis
`the host to array interface. Work is just beginningin this area.
`
`6. Expected Performance
`the linear systolic array has a
`The planned configuration of
`minimum of 340 processors and a maximum of 1020 processors, select-
`able in multiples of 20 via the bypass option. This would require 34
`chips for the array plus support logic, a reasonable number to fit on a
`standard printed circuit board. The canonical minicomputer for com-
`parison is the 1 MIPS VAX 11/780}. Assume that the systolic array
`can be clocked at 3 MHz (crystal timings support this) aud that the
`VAX takes 10 instructions to do the inner loop step in the dynamic
`programming algorithm. Then the time to compare a source and tar-
`get string, both of the same length, is given by:
`
`HME,yetolic
`
`= (length + no. of PEs) /3 x 10°
`
`time yay = (length?) /1 x 10°
`
`+ VAX is a trademark of Digital Equipment Corporation.
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 373
`
`
`
`(cid:1)
`
`c3aH=oaE3oOac==wo3a.@«.62_a©E4232an
`
`nmwo
`
`
`
`
`
`374
`
`
`Eoee
`
`
`
`djqpoulobana
`
`
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 374
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 374
`
`
`
`YYier
`
`375
`
`speedup factor
`
`
`
`
`
`length
`
`PEs
`
`systolic time
`
`VAX time
`
`15
`
`340
`
`1.18 x 10° secs.
`
`2.25 X 107 secs.
`
`19.1
`
`50=-340 1.30 x 1074 2.50 x 107 192
`
`
`
`
`
`(cid:1)
` R. J. Lipton and D. Lopresti
`aeTTtbetbellaycid
`
`peTrTTeyTTeee
`
`
`7. Summary
`This paper has presented a linear systolic array for rapid string
`comparison. Also introduced is an encoding scheme, significant in its
`own right, which minimizes the required state information internal to
`the array. Further research will determine whether this technique can
`be applied to other systolic algorithms.
`An nMOS design for
`the specific problem of comparing DNA
`sequences is being implemented;
`it fits 30 processor on a single chip.
`Conservative estimates indicate that it will perform comparisons hun-
`dreds to thousands of times faster than a serial computer.
`
`+ta
`
`
`
`
`
`
`
`
`100=.340 1.47 x 10-4 0.10 680
`
`250
`
`520
`
`2.57 xX 10%
`
`500
`1000
`
`1020
`x
`
`5.07 x 1074
`2.03 x 103
`
`0.62
`
`2.50
`10.0
`
`2430
`
`4930
`4930
`
`*four times the length 500 case through 1020 processors.
`
`Figure4TheChip
`
`
`
`
`Acknowledgments
`The authors would like to thank Doug Welsh of the Princeton
`Molecular Biology Department for his help. Daniel Lopresti gratefully
`acknowledges the support of a Garden State Graduate Fellowship.
`
`References
`
`[1] W. J. Wilbur and David J. Lipman, ‘Rapid similarity searches of
`nucleic acid protein data banks," Proc. Natl. Acad. Sei. USA,
`vol. 80, pp. 726-730, February 1983.
`
`
`
`
`
`
`
`
`
`
`atEyiyRpeees
`Petitioner Microsoft Corporation - Ex. 1008, p. 375
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 375
`
`
`
`(cid:1)
`
`
`
`View publication statsView publication stats
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 376
`
`