`
`(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:2)(cid:1)(cid:11)(cid:11)(cid:12)(cid:4)(cid:2)(cid:13)(cid:7)(cid:11)(cid:2)(cid:14)(cid:12)(cid:15)(cid:9)(cid:16)(cid:2)(cid:3)(cid:6)(cid:11)(cid:9)(cid:17)(cid:18)(cid:2)(cid:19)(cid:7)(cid:20)(cid:15)(cid:12)(cid:11)(cid:9)(cid:5)(cid:7)(cid:17)
`
`(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:3)(cid:31)(cid:3)(cid:32)(cid:13)(cid:10)(cid:8)(cid:13)(cid:15)(cid:33)(cid:3)(cid:34)(cid:35)(cid:36)(cid:37)
`
`(cid:38)(cid:39)(cid:40)(cid:41)(cid:40)(cid:39)(cid:42)(cid:43)(cid:1)
`
`(cid:34)(cid:34)(cid:27)
`
`(cid:8)(cid:9)(cid:10)(cid:11)(cid:3)(cid:12)(cid:13)(cid:2)(cid:14)(cid:11)(cid:3)(cid:5)(cid:10)(cid:7)(cid:18)(cid:8)(cid:4)(cid:5)(cid:10)(cid:24)(cid:20)
`
`(cid:46)(cid:13)(cid:10)(cid:5)(cid:2)(cid:18)(cid:3)(cid:47)(cid:9)(cid:16)(cid:15)(cid:2)(cid:6)(cid:12)(cid:5)
`
`(cid:47)(cid:2)(cid:14)(cid:5)(cid:24)(cid:14)(cid:3)(cid:48)(cid:10)(cid:5)(cid:49)(cid:2)(cid:15)(cid:6)(cid:5)(cid:12)(cid:33)
`
`(cid:15)(cid:16)(cid:8) (cid:50)(cid:48)(cid:51)(cid:47)(cid:39)(cid:38)(cid:41)(cid:40)(cid:39)(cid:42)(cid:43)(cid:1) (cid:17)(cid:18)(cid:19)(cid:20)(cid:20) (cid:38)(cid:39)(cid:40)(cid:41)(cid:40)(cid:39)(cid:42)(cid:43)(cid:1)
`
`(cid:1)(cid:45)(cid:45)(cid:3)(cid:50)(cid:44)(cid:42)(cid:52)(cid:39)(cid:47)(cid:45)
`
`(cid:44)(cid:45)(cid:41)(cid:46)(cid:1)
`
`(cid:34)(cid:37)(cid:34)
`
`(cid:41)(cid:18)(cid:18)(cid:3)(cid:7)(cid:9)(cid:10)(cid:12)(cid:2)(cid:10)(cid:12)(cid:3)(cid:17)(cid:9)(cid:18)(cid:18)(cid:9)(cid:22)(cid:5)(cid:10)(cid:24)(cid:3)(cid:12)(cid:14)(cid:5)(cid:6)(cid:3)(cid:16)(cid:13)(cid:24)(cid:2)(cid:3)(cid:22)(cid:13)(cid:6)(cid:3)(cid:8)(cid:16)(cid:18)(cid:9)(cid:13)(cid:4)(cid:2)(cid:4)(cid:3)(cid:19)(cid:33)(cid:3)(cid:46)(cid:13)(cid:10)(cid:5)(cid:2)(cid:18)(cid:3)(cid:47)(cid:9)(cid:16)(cid:15)(cid:2)(cid:6)(cid:12)(cid:5)(cid:3)(cid:9)(cid:10)(cid:3)(cid:25)(cid:29)(cid:3)(cid:53)(cid:13)(cid:15)(cid:7)(cid:14)(cid:3)(cid:25)(cid:29)(cid:34)(cid:30)(cid:23)
`
`(cid:40)(cid:14)(cid:2)(cid:3)(cid:8)(cid:6)(cid:2)(cid:15)(cid:3)(cid:14)(cid:13)(cid:6)(cid:3)(cid:15)(cid:2)(cid:54)(cid:8)(cid:2)(cid:6)(cid:12)(cid:2)(cid:4)(cid:3)(cid:2)(cid:10)(cid:14)(cid:13)(cid:10)(cid:7)(cid:2)(cid:55)(cid:2)(cid:10)(cid:12)(cid:3)(cid:9)(cid:17)(cid:3)(cid:12)(cid:14)(cid:2)(cid:3)(cid:4)(cid:9)(cid:22)(cid:10)(cid:18)(cid:9)(cid:13)(cid:4)(cid:2)(cid:4)(cid:3)(cid:17)(cid:5)(cid:18)(cid:2)(cid:23)
`
`Petitioner Microsoft Corporation - Ex. 1008, Cover 1
`
`
`
`
`
`I-:".‘
`
`LII-l-II1II-.-on.-
`
`
`1 o
`
`r.
`
`I:
`
`A....Jt-I-uinn-I-I-L—"
`
`
`
`
`"3.:'I-n-Il:__,.’~‘
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 363
`
`A Systolic Array for Rapid String
`Comparison
`Richard J. Lipton and Daniel Lopresti
`Department of Electrical Engineering and Computer
`Science
`Princeton University, New Jersey
`
`ABSTRACT
`
`This paper presents a linear systolic array for quanti-
`fying the similarity between two strings over a given alpha-
`bet. The architecture is a parallel realization of a standard
`dynamic programming algorithm. Also introduced is a
`novel encoding scheme which minimizes the number of bits
`required
`to represent
`a.
`state
`in
`the
`computation,
`significantly reducing the size of a processor. An nMOS
`prototype,
`to be used in searching genetic databases for
`DNA strands which closely match a target sequence,
`is
`being implemented. Preliminary results indicate that
`it
`will perform hundreds to thousands of times faster than a
`minicomputer.
`
`1. Introduction
`
`String comparison is an important operation in many disciplines; a
`particularly interesting application comes from the field of molecular
`biology. After isolating and “sequencing“ a chain of DNA, which may
`consist of from tens to thousands of the four bases Adenine, Cytosine,
`Guanine, and Thymine (abbreviated A, C, G, and T), biologists often
`want to search a database of known DNA for close matches.
`In doing
`so they hope that previous results will help them ‘draw conclusions
`about
`their new strand. One such database, maintained by the
`National
`Institutes of Health, contains millions of bases, so such
`searches can take hours of mainframe time. Hence, molecular biolo-
`gists resort
`to heuristics; they only search against the portion of the
`database which they consider relevant and they use sub-optimal algo-
`rithms which trim the search.
`In doing so,
`they may miss an unex-
`pected, but
`important,
`homology.
`Still,
`these
`searches
`require
`significant computational resources and time [1], [2], [3].
`
`i This work was supported in part by DARPA Contract N0001»l-82—K-0549.
`
`1985 Chapel Hill Conference on VLSI
`
`
`
`
`“r.-
`
`um"
`
`
`
`
`
`
`‘.::--I-_q-I!I;;II
`
`..-_*..I-I-I
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Petitioner Microsoft Corporation - EX. 1008, p. 363
`
`
`
`364
`
`Systolic Array for Rapid String Comparison
`
`2. Quantifying String Similarity
`An intuitive metric for quantifying the similarity of two strings is
`their “edit” distance [5]. That is, a count of the number of basic edit~
`ing operations:
`
`i)
`ii)
`iii)
`
`insert
`delete
`substitute
`
`Presented in this paper is a linear (i.e., one-dimensional) systolic
`array for the string comparison problem in general, as illustrated by
`the DNA comparison problem in particular.
`It is a parallelization of an
`optimal dynamic programming algorithm, which promises
`to run
`thousands of times faster than the same algorithm run on a serial com-
`puter. As
`is characteristic of systolic arrays (an architecture first
`described by H.T. Kung and associates at Carnegie-Mellon University
`[4])
`the machine is composed of a large number of simple processors
`which are highly regular and have only local communication require-
`ments; hence it is ideal for implementation in V'LSI.
`Perhaps even more significantly, a technique is introduced which
`greatly simplifies the processing elements, allowing a relatively large
`number to be fit on a single chip. This observation. which minimizes
`the amount of data which must flow between adjacent processors, may
`be applicable to other systolic implementations.
`
`ACG -substitute Tfor A—v TCG -substitute 6' for C—+ TGG
`
`needed to transform one string, the “source,” into the other, the "tar-
`get."
`For example:
`
`to transform ACG into TGG
`
`ACG —deletc A—e CG vdelete C—v G -insert G-—> GG -in.s-erl T—v TGG
`
`which is two deletions and two insertions, so the diflerence between
`ACG and TGG is four. A substitution has cost two, as it is equivalent
`to a deletion and an insertion:
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 364
`
`Petitioner Microsoft Corporation - EX. 1008, p. 364
`
`
`
`H. J. Llpion and D. Lopresfl
`
`365
`
`The edit distance between two strings can be calculated using a
`dynamic programming algorithm which would, in the above case, build
`the following table:
`
`I
`
`
`The difference, four,
`trix.
`
`?
`
`is the entry in the lower right corner of the ma-
`
`In general, the value d in the 2 X '2 table fragment:
`
`
`
`
`
`(cid:1)
`
`
`"l-l-ll‘lI-.-4-.’ m
`
`
`
`J‘Iqr-IEA::II“-’I-IFI-:.4
`
`',nII--n:,_,;
`3».'_.‘.lI-fl-|
`
`
`A;.Jn-I-umI-u-I‘:__,'.a
`
`
`
`
`
`
`9.;quI-I-II:.__'-ia'
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 365
`
`It is known that dynamic programming algorithms map well onto
`systolic arrays [6]
`In particular,
`there is a tremendous potential for
`concurrency in the construction of an edit distance table: the entries on
`a given 45 degree diagonal can be calculated simultaneously because
`
`is determined by the rule:
`
`b+l
`d=min c+1
`
`if si=tf
`a
`a+2if 3,-7ét1-
`
`
`
`Petitioner Microsoft Corporation - EX. 1008, p. 365
`
`
`
`(cid:1)
`
`366
`
`Systolic Array tor Rapid String Comparison
`
`they depend only on values up and to the left. This parallelism can be
`realized with a systolic array. All of the values on 45 degree diagonals
`will be calculated at the same time and all of the values on -45 degree
`diagonals will be calculated in the same processor.
`In this way the
`quadratic time serial algorithm becomes a linear time parallel algorithm
`requiring a linear number of processors.
`
`suffice for any lcngth string comparison. The key observation is that
`
`3. The Systolic Architecture
`A rhythmic pumping of data through simple processors distin-
`guishes systolic arrays. Figure 1 demonstrates the data flow through
`one implementation of a linear comparison array using the strings of
`the earlier example. The source and target are shifted in simultane-
`ously from the left and right respectively.
`Interleaved with the charac-
`ters are the data values from the first row and column of the dynamic
`programming matrix. When two non-null characters enter a processor
`from opposite directions a comparison is performed (indicated by the
`darkened circles in the figure). On the next clock tick the characters
`shift out and the values following them shift in. This processor now
`determines a new state based on the result of the comparison, the two
`values just shifted in, and its previous state value, using the same rule
`as in the dynamic programming algorithm, When the strings are
`shifted out
`they carry with them the last
`row and column of
`the
`dynamic programming matrix, and hence the answer.
`Strings too long to be compared in one pass through the array can
`be handled by making multiple passes.
`In this case,
`the initializing
`data values shifted in with the characters reflect a matrix row or
`column at an intermediate point in the calculation.
`
`4. Minimizing Processor Size
`Using the above scheme, comparing two strings of length n in one
`requires approximately 271 processing elements.
`Since DNA
`pass
`sequences of interest can be several thousand bases long it is necessary
`to fit a large number of processors onto a single chip. The greatest
`influence on the size of a single processor is the appearance that it
`must add and compare relatively large data values, on the order of
`log(n) bits. Even being optimistic,
`such adders and comparators
`require significant silicon area; it is unlikely that a naive implementa-
`tion would fit more than a couple of processors onto one chip. There
`is, however, an important property of this algorithm which makes
`manipulating such large values unnecessary; a constant two bits will
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 366
`
`Petitioner Microsoft Corporation - EX. 1008, p. 366
`
`
`
`(cid:1)
`
`ONE PROCESSING ELEMENT
`
`H. J. Lipton and D. Lopresli
`
`367
`
`"ill-l-IIHl-i1-:..J-'
`..r-l.y....w,-.1.1m.
`r....,.._.-,.g_I!l
`,,Iu“-_u-I.-._-=_
`
`‘:
`i
`i.
`I.
`
`”-5.4
`
`.'.'_iJn-I-u-'.:.u-I-II.'._.«
`
`
`9&.|~‘fl-I-II;
`Petitioner Microsoft Corporation - Ex. 1008, p. 367
`
`!L
`
`,'a !H i
`
`k
`
`\.
`
`Petitioner Microsoft Corporation - EX. 1008, p. 367
`
`
`
`(cid:1)
`
`368
`
`Systolic Array for Rapld String Comparlson
`
`adjacent processors’ state values can not difl'er greatly in magnitude,
`hence it is only necessary to pass low order bits between processors.
`More formally,
`
`To Prove: it is possible to construct a string comparison table:
`
`in value.
`
`In other words,
`consistently by keeping only remainders modulo 4,
`determining the remainder of d modulo 4 only requires knowing the
`characters 3,- and ’1' and the remainders of each of a, b, and c modulo 4.
`The essential point is that adjacent matrix elements must be close
`
`using the rule:
`
`(rule 1):
`
`(2+1
`d=min c+l
`
`if 5i=t'
`“
`a+2if s;7étj
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 368
`
`Petitioner Microsoft Corporation - EX. 1008, p. 368
`
`
`
`(cid:1)
`
`
`R. J. Llpton and D. Lopresli
`
`369
`
`E
`
`w
`
`the values of horizontal and vertical neighbors in the matrix
`Lemma:
`differ by :l: 1. That is, a 2 X 2 matrix fragment must look like:
`
`a
`
`b=ai1,d:1:l
`
`c=ai1,dd:l
`
`d
`
`Proof of Lemma: by induction on i+j, the sum of the row and
`column indicies.
`
`Basis: for i+ j = 2 the matrix fragment is the upper left corner:
`
`
`
`ill-IE";‘I-N-I
`
`
`
`iu"
`
`
`
`
`
` -"l!'-'Elti"!'-IFI-:‘..‘r-l
`
`
`
`
`In...‘..l-l-la.
`
`;.*_‘i.-I-I-u'~.n-I-II.‘._-,.
`
`‘l
`
`i! !
`
`Only one value is calculated and by rule 1 it must be 0 or 2, hence
`horizontal and vertical neighbors differ by i l
`
`Inductive hypothesis: assume the lemma is true for i + j < n.
`
`Induction: say i+ j: n:
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 369
`
`Petitioner Microsoft Corporation - EX. 1008, p. 369
`
`
`
`(cid:1)
`
`370
`
`Systolic Array for Rapid String Comparison
`
`The values of b and c are restricted to a :1: 1 by the inductive
`hypothesis, hence the value of d can be restricted. Rewriting rule
`1:
`
`(rule 2):
`
`ifbor c equals aw 1 or 3;: ’1'
`a
`d: n+2 ifbandcequala+land sl-7étj
`
`In either of the above cases 6: di 1 and c = dd: 1, thus hor-
`izontal and vertical neighbors always differ by a; 1. This corn-
`pletes the proof of the lemma.
`
`to the top PLA which
`
`Determining the remainder of d modulo 4 using rule 2 only
`requires deciding between the two cues d = a and d = a + 2. Since b
`and c are guaranteed to be within ii of a,
`it
`is only necessary to
`know the remainders of each of a, b, and c modulo 4 to make this
`differentiation. This completes the proof.
`As an aside, a stronger statement is possible: only the next to
`least significant bit of both b and c are necessary to differentiate the
`cases, but the two low order bits are used in the current implementa-
`tion to make initializing the array easier.
`Of course, knowing only the remainder modulo {our of the edit
`distance between the source and target strings would be of little use,
`but the fact that the entire last row and column of the comparison
`matrix are shifted out. means that the full diilerence can be constructed
`outside of the array. All that is required is to pre—load a counter with
`the length of
`the target string and to increment or decrement
`it
`appropriately as the source values shift out of the right side of the
`array. Figure 2 demonstrates this.
`
`5. The Implementation
`Figure 3 is a plot of a single processing element. The layout was
`done using allende [7], a procedural language developed at Princeton,
`caesar, and the Berkeley PLA tools [8] The four DNA bases are
`encoded as 4—bit nibbles. Wild cards, characters which match any sub-
`set of the bases, are also available. The source and target characters
`shift in from the left and right respectively at the bottom of the pro—
`cessor. The character comparison is performed in the lower PLA dur-
`ing the first clock phase. The result of the comparison, as well as the
`source and target data values, are input
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 370
`
`Petitioner Microsoft Corporation - EX. 1008, p. 370
`
`
`
`R. J. Lipton and D. Lopresti
`
`371
`
`UP/ DOWN COUNTER
`
`-n-- *—
`
`—’ ”‘0“
`
`4—
`
`4—
`
`(~—
`
`W
`
`(cid:1)
`
`
`
`
` "r-3!’-'H'Il"!'-'Fn-:‘..‘:-I"-.'-‘E".t"E’-'.':I-;”".'
`A.—.'.-I-l-lc'l-I-I-J"-
`
`
`J-;.‘.l-I-u"uII-I-II—Z-u.
`
`
`9:.l-1v'I-I-IIJfi-u1v‘
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 371
`
`iFI .9
`
`‘—
`
`----
`
`‘—
`
`(—
`
`“OW
`
`FIGURE 2. CONSTRUCTING THE EDIT DISTANCE
`
`Petitioner Microsoft Corporation - EX. 1008, p. 371
`
`
`
`(cid:1)
`
`Systolic Array for Rapid String Comparison
`
`
`
`
`Hao ao Ba
`
`n
`a:
`
`E 3 8Hn
`
`.
`.9.no
`5VI
`.1:
`
`n ok F
`
`:In
`
`
`
`
`
`
`
`
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 372
`
`Petitioner Microsoft Corporation - EX. 1008, p. 372
`
`
`
`(cid:1)
` R. J. Lipton and D. Lopresti
`
`..‘r-v
`
`
`
`
`*I-lgllllIIE-.-:
`
`.'_.-".-‘l-I’ql;;|-!I-|.-:_-:.'
`41_I_I_:_...-.l
`n._.‘.-a-u-u
`
`373
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`determines a new state during the second clock phase.
`
`The current nMOS design (lambda :: 2.0 microns) fits 30 proces-
`sors in a 4.6mm X 6.8mm 40 pin MOSIS standard frame. A plot of
`the prototype is shown in figure 4. The processors are almost identical,
`they simply alternate as to whether they calculate their new state dur-
`ing the chip’s phi a or phi b. The design is expansible; building a
`larger array involves simple connecting a number of chips side by side.
`Also implemented is a bypass option which shortens the path through
`a chip from 30 to 10 processors; this will keep short comparisons from
`needlessly passing through the entire array.
`The factor presently limiting the number of processors which can
`be placed on one chip is power dissipation; estimates for the nMOS
`design indicate that it will consume approximately one watt. A CMOS
`version is planned which will run much cooler, so that with the same
`feature size from 50 to 80 processors can be placed on a chip.
`
`The systolic array is data hungry; it requires two 4-bit characters,
`two 2-bit counts and output monitoring three million times a second to
`operate at
`full speed. Since the array is expected to function as a
`hardware accelerator for micro-and minicomputers a primary concern is
`the host to array interface. Work is just beginning in this area.
`
`6. Expected Performance
`the linear systolic array has a
`The planned configuration of
`minimum of 340 processors and a maximum of 1020 processors, select-
`able in multiples of 20 via the bypass option. This would require 34
`chips for the array plus support logic, a reasonable number to fit on a
`standard printed circuit board. The canonical minicomputer for com-
`parison is the l MIPS VAX ll/TSOt. Assume that the systolic array
`can be clocked at 3 MHz (crystal timings support this) and that the
`VAX takes 10 instructions to do the inner loop step in the dynamic
`programming algorithm. Then the time to compare a source and tar-
`get string, both of the same length, is given by:
`
`
`
`
`
`J4....Jn-I-uall-n-im,‘,
`
`
`92w"n-n-nm._-.1J
`Petitioner Microsoft Corporation - Ex. 1008, p. 373
`
`hmerwtolic
`
`= (length + no. of PEs) /3 X 106
`
`tangy-AX: (length?) /l X 105
`
`T VAX is a trademark of Digital Equipment Corporation.
`
`Petitioner Microsoft Corporation - EX. 1008, p. 373
`
`
`
`(cid:1)
`
`374
`
`noS.naPmoC9n.nISMPaRrofyarrA.mdtsyS
`
`
`
`
`
`
`
`
`
`352:.v3&5
`
`
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 374
`
`Petitioner Microsoft Corporation - EX. 1008, p. 374
`
`
`
`(cid:1)
`
`
`length
`15
`50
`100
`250
`
`PEs
`340
`340
`340
`520
`
`systolic time
`1.18 X 10" secs.
`1.30 x 10'4
`1.47 x 10-4
`2.57 x 10-4
`
`500
`
`1020
`
`5.07 x 10’4
`
`1000
`
`*
`
`2.03 x 10‘“
`
`H. J. Lipton and D. Lopresti
`375
`VAX time
`speedup factor
`2.25 X 10'3 secs.
`19.1
`2.50 x 10-2
`192
`0.10
`680
`0.62
`2430
`
`2.50
`
`10.0
`
`4930
`
`4930
`
`
`
`."‘-'-!'-'H'::"!'-'F-'.-—.'
`
`
`Minna-u-n4;-n--ug:__.~
`£.;.-.lI-I-uz'-_'II-I-1-;,i_,
`
`'l-I-IIL‘.._-.u
`Petitioner Microsoft Corporation - Ex. 1008, p. 375
`
`=
`
`:I‘
`
`:l
`
`5‘!
`
`i'
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Figure6TheChip
`
`*four times the length 500 case through 1020 processors.
`
`7. Summary
`
`This paper has presented a linear systolic array for rapid string
`comparison. Also introduced is an encoding scheme, significant in its
`own right, which minimizes the required state information internal to
`the array. Further research will determine whether this technique can
`be applied to other systolic algorithms.
`An nMOS design for
`the specific problem of comparing DNA
`sequences is being implemented;
`it fits 30 processor 011 a single chip.
`Conservative estimates indicate that it will perform comparisons hun-
`dreds to thousands of times faster than a serial computer.
`
`Acknowledgments
`The authors would like to thank Doug Welsh of the Princeton
`Molecular Biology Department for his help. Daniel Lopresti gratefully
`acknowledges the support of a Garden State Graduate Fellowship.
`
`References
`
`[1] W. J. Wilbur and David J. Lipman, “Rapid similarity searches of
`nucleic acid protein data banks," Proc. Natl. Acad. Sci. USA,
`vol. 80, pp. 726-730, February 1983.
`
`Petitioner Microsoft Corporation - EX. 1008, p. 375
`
`
`
`(cid:1)
`
`
`
`View publication statsView publication stats
`
`Petitioner Microsoft Corporation - Ex. 1008, p. 376
`
`