`
`Matching
`
`A. E. Abdallah
`
`G. Simiakakis
`
`Centre for Applied Formal Methods
`Sc heel of Computing
`South Bank Universit y
`103 Borough Road
`London SE2 0AA
`Email: A.Abdallah@sbu.ac.uk
`
`Harocopio University
`70 Eleftheriou Venizelou St
`
`17671 Athens, Greece
`Email: gsimiak©huagr
`
`'1‘. Theoharis
`
`Department of Informatics
`The Universit yoi‘ Athens
`P anepistimioupolis 15784, .fithens, Greece
`Emaii: theothco©di.noa.gr
`
`Abstract
`
`DNA matching is a computationally demanding
`task. The Human Genome Project is producing huge
`quantities of data, which have to be analyzed. A formal
`description of the task of searching a DNA sequence is
`given and an eflicient parallel algorithm is derive das-
`ing' formal methods. The algorithm is implemented on
`an FPGA using Handel— C. a language that enables the
`compilation of high-level algorithms directly into gate
`level synchronous liar (liner [9],
`thus reducing the de—
`velopment time. The designed algorithm makes no as-
`sumptions about DNA transformations, and is there—
`fore a new powerful tool. It can lie use d in onjiinction
`with an {aspect system to automatically date ct patterns
`of inter est in the DNA.
`
`Keywords: FPGAS, Reconfigurable computing, for-
`mal methods, DNA, string matching, CSP, Handel-C,
`pipeline.
`
`1
`
`Introduction
`
`There is an increasing electronic availabilit y of DNA
`sequences from human and other species. The ambi-
`tious Human Genome Project has recently decoded the
`DNA sequence of chromosomes 21 and 22 and it is ex—
`pected that the whole human genome will he completed
`within the next three years in an admirable world—wide
`
`0-7803-6542-9/00/510.00 ® 2000 IEEE
`
`uge;
`collaboration. The amount of data produceilh
`chromosome 22 {the smallest one} consists of approxi-
`mately 32Mbytes of base data.
`Scientists who wish to use these data to test their hy—
`potheses electronically are faced with an overwhelming
`amount of processing effort. Often this effort involv es
`searc hing DNA data for traces of viruses or other or-
`ganisms that have been incorporated in the genome
`ever the long track of evolution. Also, DNA sequences
`of different Species may be compared in order to verify
`common ancestry or detect protein homology. Detec-
`tion of sequences of interest is a v ery complex task and
`several soft w are solutions exist. [410, 17 , 14.
`DNA sequences are availableas text documents.
`In computing terms the Operations that w eneed to
`perform on DNA sequences are similar to well known
`string problems, specifically the edit-distance [18] and
`the longest common subsequence problem. The first
`measures the distance betw eentw ostrings as a func-
`tion of the number of insertions, deletions and substi—
`tutions necessary in order to go from one string to the
`other. The second, as the name implies, determines
`the longest common substring oi' tw o gien strings.
`The strings involved are how ev erenormous (mil-
`lions of places in length). Efficient string processing
`is therefore a requirement. Both the above problems
`have been well studied in computer science and effi-
`cient implementations on systolic architectures have
`been proposed [13 , 15]. Sue llSpECial purpose are hi~
`
`268
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 268
`Petitioner Microsoft Corporation - Ex. 1043, p. 268
`
`
`
`Lectures. although very efficient, can be hard and ex—
`pensive to build. A recent innovation is reconfigurable
`computing, or computing systems whose hardware can
`be modified by soft w are to math a. certain application
`[16]. The main cemponent of such systems is the Field
`Pr ogremmeble Gate Amy fFPGA).
`The FPGA [8] is a silicon chip that can rapidly im—
`plement a significant amount of digital hardware under
`software control. F orexample, Xilinx and Alters are
`w ellknown FPGA manufacturers. Dramatic perfor—
`mance gains, in c0mparison to a microprocessor, can
`be acliiev edby implementing a certain algorithm di-
`rectly onto hardware and exploiting natural hardware
`parallelism while at the same time reducing overheads
`such as instruction fetdies. Furthermore one can wary
`the function of an FPGA as the problem at hand varies.
`Until recen tly,the main problem with using hard—
`w arehas been the long time required and the large
`expense involv edin producing it. Even FPGA's re
`quired training in extremely specialized tools. Then
`hardware compilation came along which allows ordi—
`nary computer programs to be turned automatically
`into hardware designs (FPGA‘s), thus promoting the
`programmer to a hardware designer. A well known and
`successful such tool is Handel—C available from ESL [9}.
`Apart from development efficiency, hardware com-
`pilation offers another important Opportunity, namely
`the potential to produce provably correct liarduare at
`the same cost as the production of provably correct
`soflw a1'e(assuming of course the correctness of the
`hardware compiler).
`In this paper we make a step in
`this direction by beginning with a functional specifica—
`tion of the problem and deriving an efficient pipelined
`implementation which can readily be programmed in
`Handel-C and thus compiled onto an FPGA. We can-
`sider a variation of the longest common subsequence
`problem which is useful for DNA operations and use
`w ellkno wnmathematical transformations 'r', 1, 2, 3]
`in the-derivation that follows. The algorithm has been
`implemented in Handel-C and performance figures are
`giv en.
`
`2 F unctionalNotation
`
`We give a brief summary of the functional notation
`used in this paper. The reader should refer to [5, G]
`for a fuller account of this notation. Lists are finite
`sequences of values of the same type. The list concate-
`nation operator is denoted by —H— and the list construc-
`tion operator is denoted by :. The elements of a list
`are displayed betw een square brackets and separated
`by commas. Function composition is denoted by a .
`Functions are usually defined using higher order func-
`
`tions or by sets of recursive equations. The operator i-
`(pron0unced “map") takes a function on the left, a list
`on the right, and applies the function to each element
`of the list. Informally, we have:
`
`f*[ai,a2,---,aul
`
`[flai )i feel. -
`
`-
`
`-
`
`‘ Haul]
`
`The operator { (pronounced “reduce”) takes an as-
`sociative binary operator on the left, a list of values on
`the right and returns the “summation" of all the ele-
`ments of the list. This can be informally described as
`folIOWs
`
`(@lllahazrwaul
`
`alone-neat
`
`In order to formulate the pattern matching prob—
`lem, we will make use of the following list manipulation
`functions. The function inits+ returns the list of non-
`empty initial segments (prefixes) of a list in increasing
`order of length.
`
`‘ ' 3551:]
`tntts+ [a] , (12, '
`= l [ails[Ghee]:"'slflia"‘aan-llulal ---[t1nl l
`
`Similarly, the function fins... returns the list of non—
`empty final segments (post-fixes) of a list in increasing
`order of length. Thus, informally, w e hare
`
`- ,on]
`fins+ [a1,e2,- -
`= [ larilylan—lraflls' '
`
`‘ )[5‘21 ' Mann}: [alvu ’ ianl]
`
`Finally, the function 8693+ which returns all the pos—
`sible Combination of consecutive elements of a list is
`
`defined as a composition of three functions.
`
`3695+ = (4—H)
`
`irttts+ *
`
`fins+
`
`3
`
`Starting Specification
`
`DNA matching can be translated to string matching
`in computing terms, where each string element is one
`of the 4 characters {.4,C, G,T} which represent the 4
`DNA bases. F or example, the DNA sequence of hlman
`chromosome 22 is downloaded into a huge text file of
`the form:
`
`TTTGGCTAAAACCGAAATCAATTATGMlGC MuGG AAG‘G .
`
`. .
`
`.
`
`This sequence represents one of the two strands of
`the chromosome; the other is complementary. Comple—
`mentary base pairs are A -— T and C — G. This huge
`sequence must then be searched to detect, for example,
`a (smaller) DNA string belonging to a virus or to reveal
`
`269
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 269
`Petitioner Microsoft Corporation - Ex. 1043, p. 269
`
`
`
`similarities with a chromosome of a particular African
`monkey.
`A mathematical specification of the problem will be
`giv on using list notation. Lists are finite sequences of
`a certain base type which in this case is the set:
`
`)3: {moor}
`
`Let Q and Q he lists of type [2] representing the
`query DNA sequence (for example virus) and the bank
`DNA sequence [for example chromosome 22) respec-
`tively.
`First, we define a function iicp which tak es tie lists
`and returns the length of the longest common prefix of
`both lists. For example, the length of the longest com-
`mon prefix of ATCCTG and ATCTTG is 3 being the length
`of ATC. A formal definition of Hep is given recursively
`as follo ws:
`
`= D
`[l
`Hop 3
`= 0
`t
`Hep []
`llcp ((1:3) (tut) = 1+Hcpst,
`: 0,
`
`if ozl)
`otherwise
`
`A t first, it mar be suggested that it is all about sim—
`ple string matching,
`the type that a word processor
`uses when trying to find a certain word in a document.
`While this is often true, DNA operations usually have
`certain peculiarities. First, mutations tak e place which
`change some bases of the subject string (eg. 0 —> T),
`th us rendering impossible its exact detection. Second,
`deletions can occur, which rein0ve a significant. part of
`the subject string altogether. And third, insertions in-
`troduce a new piece of DNA within the subject DNA
`(eg. the DNA of a virus particle), see Figure 1. Com-
`binations of these Operations are possible.
`CCGAAATC
`
`l
`l
`I
`:
`l
`l
`:
`l
`CTGACATC
`
`Hutation(:)
`
`concepts of deletion, insertion and mutation and han—
`dles them all simultaneously, but this will be at a sig-
`nificant cost of increasing the complexity of the specifi-
`cation and making it less general. instead we consider
`a simpler approach which is based on the observation
`that all the above cases (plus seine more} can be han-
`dled by detecting the longest common substm'ngs in B
`and Q of at least a minimum length. Thus for exam-
`ple,
`if w ecletect tw oconimon substriags
`length 40
`and 50 separated by a single element which is different
`in B and Q, then this points to a mutation. Similarly,
`tw 0 common substriugs thai are separated inB (rest).
`Q) by another string, point to an insertion (resp. dele-
`tion).
`In addition many other useful DNA transformatimis
`can be detected in this manner, such as the transposi-
`tion of parts of the DNA sequence. Of course, at the
`end of the day it will be the biologist who will decide
`what is of significance as well as its categorization. All
`that is therefore required is a tool which will compare
`Lw o stringsB and Q and detect equality of substrings
`within them, regardless of where they occur, in B and
`Q.
`
`A string c is said to be a maximal common segment
`of tw o stringss and t if and only if s can he expressed
`in the form 31 ++c ++.52, t can also be expressed in
`the form t; ++c ++t2 and c cannot be extended to the
`right (that is, when .32 and t; are non—empty lists, their
`first elements must be different. i.e. hd s2 gfi lid t2]
`We need to identify the maximal common segments
`of s and t. The main insight for achieving this is the
`observation that a maximal segment c of tw olists s
`and t is the longest common prefix of a final segment
`of s (i.e. c ++s_:) and a final segment of t (i.e.
`:2 ++
`t2). Therefore, a. suitable solution to this problem is
`reduced to computing the longest common prefixes of
`each final segment of s with eac 11 final segment of t.
`Hence, the lengths of maximal common segments, lines
`can be determined by computing the following table:
`
`Deletion(.)
`
`from s t = [ Hop n b l o (— fins+ s;b (— ft'nsi. t
`
`1
`
`t is m
`Given that the size of s is n and thesize of
`{where m E ii) there are oxm entries in the table; each
`(311 try requires Ohm.) calculation but by using a simple
`dynamic programming technique this can be reduced
`to 0(1).
`The lengths of maximal common segments for
`the strings ATCCATGTCATC (horizontal query) and
`CTATCTCATCG (vertical data bank] are giv on b y the
`following table. Note that both strings are displayed
`in reverse order.
`
`CCGAFLATC
`Ill. .
`l ll
`CCG
`ATC
`
`ATC
`CCG
`I ll
`lll. .
`CCGCTATC
`
`Insertion(.)
`
`Figure 1. Mutation, deletion and insertion {|
`denotes a perfect match).
`
`It. is possible to introduce a similarity measure func-
`tion between tw ostrings that incorporates the above
`
`The entry at position (Li) of
`ble computes
`the length of
`the
`
`the abo veta—
`longest prefix,
`
`270
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 270
`Petitioner Microsoft Corporation - Ex. 1043, p. 270
`
`
`
`mecnzifm=cthenn+1else0
`
`Note that >> denotes the piping CSPoperator. The
`synthesis of theabo ve processes from their functional
`description is a straigh tforiard task using the refine-
`ments rules in [1, 2, 3]. The abo veparallel version
`requires 0(m + n.)
`time steps to terminate and uses
`0(m) parallel processes in the pipeline.
`
`4.2 Pseudo Handel-C description
`
`The algorithm is implemented as a bidirectional
`pipeline, depicted in Figure 3, of length m + I where
`m is the length of string 3. The current lengths of
`matching strings flow to the righ twhile final results
`lengths (couhnns) flow downwards.
`Each stage in the pipeline holds one element e of
`the query sequence Q and compares it at each clock
`cycle with the incoming element a: of the bank sequence
`B and the value v of longest prefix in the proceeding
`segment. If these t w o elemets match, the length of the
`longest common prefix r starting with x is incremented
`otherwise, it is reseset to 0.
`In both cases this value
`is stored in a local array which will even tuallymake
`a column in the global table. After the first cycle the
`most recent value of 1' is passed to an external channel
`down, coupler] with the incoming element. of the data.
`bank and passed to the right. For space rename the
`Handel—C version is onnnitted.
`
`5 Conclusion and F urtherWor-k
`
`A fast FPGA implementation of DNA matching has
`been obtained based on a formal description of the
`problem which revealed a generally useful algorithm.
`Speed is extremely important in this application do-
`main because DNA databases, which are no w widely
`availabie, are 11 uge. The presented algorithm makes
`little to no assumptiOns on the transformations that
`may have tak en place on the DNA, and is therefore
`able to detect all cases of interest.
`It is important to observe that hardware perfor-
`mance thus becomes available out of a general pur—
`pose FPGA card. What is striking is the short devel—
`opment time that a hardware compiler like Handel-C
`allows. Although current FPGA chips allow for large
`lengths for string Q, in the present algorithm the max—
`imum length for string Q is always limited by the size
`of the FPGA. Larger strhtgiid btbrok
`en up into
`segments and processed sequentially or by multiple FP-
`GAs. Resulting matches that span over miltiple seg-
`ments could then be combined.
`
`CTRCTGTRCCTA
`
`G 000001000000
`C 100100001100
`T 020020100020
`ll 003000020003
`C 100400003100
`T 020150100020
`C 100100001100
`T 020020100020
`A 001000020003
`T 010010100010
`C 100100001100
`
`Figure 2. Table oi maximal segments for two
`strings.
`
`(drop j t), where (drop 1‘ 3) denotes
`Hep {drop t 5)
`the postfix of 5 obtained by removing the first
`2‘
`elements of s.
`It may be w orth whileto note that
`list indexing starts from 0. The first column of the
`table in Figure 2 displays Hep for the string C with
`each final segment of the bank sequence; Le. G. CG,
`TCG, ATCG, CATCG, TCATCG, CTCATCG, CTCATCG,
`TCTCATCG, ATCTCATCG. ATTCTCATCG, CTATCTCATCG.
`Similarly,
`the last column of the table displays llcp
`for the whole query with each final segment of the
`bank sequence. The length of the longest common
`segment is 5, the appropriate segment can be retrieved
`as TCATC by simply following the diagonal in the table
`starting at 5.
`
`4 Derivation of a Pipelined Algorithm
`for FPGA’s
`
`4.1 Transformation to CSP
`
`We have that lines is expressed as a pipe pattern;
`using transformational techniques from [3, 2, l], we can
`formally synthesize a pipeline network of communicat-
`ing sequential processes expressed in Hoare’s CSP [12]
`for solving this problem as follows:
`
`LMCS(3, r) = more >> ((3),: (STAGE t 3))
`
`= leot —> SKIP
`INITHJ)
`INIT(n.:s) = l(a,0) —>INIT(S)
`
`STAGE(C) z ?(.t,n) —>l{:r:,n.) —> F(c, [3: EB: 11])
`ch, 1'
`1 rs)
`?“eot” —>!eot —> PRDdownU'
`: rs)
`
`l ?
`
`(a:,n) ->!(:c,r) —+ F‘(c,(;r; 613.: n) :r : rs)
`
`271
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 271
`Petitioner Microsoft Corporation - Ex. 1043, p. 271
`
`
`
`“’1"le
`
`ST 4013
`STAGE
`(Gm—ll
`WT
`
`
`STAGE
`
`(9.)
`
`]
`
`column 1
`
`column 2
`
`column m
`
`Figure 3. The matching pipeline process LMCS.
`
`Acknowledgements
`
`The authors would like to thank The British Coun—
`cil in Greece and the University of Athens for funding
`their join tresearc h. We w ouldlike also to thank Dr.
`Nikos Yannakouris and Dr. George Dedoussis for their
`helpful comments on genetics.
`
`References
`
`[1] AB. Abdallah, Derivation of P arallel Algorithms
`from Functional Specifications to CS? Processes,
`in: Bernhard Miiller, ed., Mathematics of Program
`Construction, LNCS 94?, (Springer V erlag,1995)
`67—96
`
`[2] A. E. Abdullah, Synthesis of Massively Pipelined
`Algorithms for List Manipulation, in L. Rouge and
`P. Fraigniaud and A, Mignotte and Y. Robert (eds),
`Proceedings of the European Conference on Paral-
`lel Processing, EuroPar’Qo', LNCS 1124, (Springer
`V erlag, 1996), pp 911-920.
`
`[3] A. E. Abdalluh, Functional Process Modelling.
`In K Hammond and G. Michealson (eds), R c—
`seorch Dir cctionsiu For allch unctional Pr ogum-
`ming, (Springer Verlag. October 1999). [313339-360
`
`[4] Altshol, S.F. et 3.1., Basic Local Alignement search
`tool, Journal of Molecular Biology, pp.403-410,
`1990.
`
`[5] R. S. Bird, An Introduction to The Theory of Lists,
`in: M. Broy, ed., L ogic of Pogromming and Calculr‘:
`of Discreet Design (Springer, Berlin, 1987) 3-42.
`
`[6] R. B. Bird, Introduction to Functional Pr ogum—
`ming, (Prentice—Hall, 1998).
`
`J. Gibbons and G. Jones, Formal
`[7] R. S. Bird,
`Derivation of a P attern Matting Algorithm, Sci-
`ence of Computer Programmingm (2)
`(Elsevicr,
`1989), pages 93-104.
`
`[8] Chan, RH. and S. Mourad, Digital System Design
`Using Ffield Pr gmmmable Gale A Trays Prentice
`Hall, 1994.
`
`[9] Handel-C L anguageR efcrcnc Manual, Embedded
`Solutions Limited, 7/8 Milton Park, Abingdon, Ox~
`fordshire, 0X14 411T, United Kingdom, 2000.
`
`[10] Gaasterland, T. and C.W. Sensen, MAGPIE:
`Automated genome interpretation, TrendsGenet.,
`V oi. I2, pp.7(i-78, 1996.
`
`[11] Guerdoux-Jamet P.,Lavenier D., Wagner C. and
`Quinton P., Design and Implementation of a P ar-
`allel Architecture for Biological Sequence Compari—
`son, in L. Rouge and P. Fraigniaurl and A. Mignotte
`and Y. Robert (eds), Proceedings of the European
`Conference on For allcl Pr ocessing, EuroPar’Qo',
`LNCS 1123, (Springer Verlag, 1996), pp 11—24.
`
`[12] C. A. R. Hoare, Communicating Se quentéalPr o-
`ccsses. (Prentice-Hall, 1985).
`
`[13] Hoang D.T. and DP. Lopresti, FPGA Implemen—
`tation oBystolz'c Se queucc Alignment, LNCS 705,
`1993. pp. 183-191.
`
`{14] Reese, MG. et at, Genome Annotation Assess—
`ment
`in Drosophila. melanogastor, Genome R e-
`scarch, Vol. 10, No. 4, pp.483~501, 2000.
`
`[15] Sastry R. and N. Ranganathan, A Systolic Ar-
`ray for Apprcuximate String Matching, IEEE 1n-
`tcmott‘onol Conference on Computer Design, Cam-
`bridge, Massachusetts, 1993, pp. 402-405.
`
`[16] Schewel, J. et al (Eds), Reconfigurable Technol-
`ogy: FPGAs for Computing and Applications, Pr o~
`coatings of Spie, 20—21 September 1999, Boston,
`Massachusetts.
`
`[17] Stormo, G.D., Gene-Finding Approaches for Eu-
`kary otes,Gcnomc Research, Vol. 10, No. 4, [313.394-
`39?, 2000.
`
`[18] Wagner RA. and M.J. Fisher, The String—to-
`String Correction Problem, J. ACM, 21(1), 1974,
`pp. 168-173.
`
`[19] Waterman M. 5., Mathematical Methods for DNA
`Sequences. (CRC Press, 1989).
`
`272
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 272
`Petitioner Microsoft Corporation - EX. 1043, p. 272
`
`