throbber
Formal Development of a Reconfigurable Tool for Parallel DNA
`
`Matching
`
`A. E. Abdallah
`
`G. Simiakakis
`
`Centre for Applied Formal Methods
`Sc heel of Computing
`South Bank Universit y
`103 Borough Road
`London SE2 0AA
`Email: A.Abdallah@sbu.ac.uk
`
`Harocopio University
`70 Eleftheriou Venizelou St
`
`17671 Athens, Greece
`Email: gsimiak©huagr
`
`'1‘. Theoharis
`
`Department of Informatics
`The Universit yoi‘ Athens
`P anepistimioupolis 15784, .fithens, Greece
`Emaii: theothco©di.noa.gr
`
`Abstract
`
`DNA matching is a computationally demanding
`task. The Human Genome Project is producing huge
`quantities of data, which have to be analyzed. A formal
`description of the task of searching a DNA sequence is
`given and an eflicient parallel algorithm is derive das-
`ing' formal methods. The algorithm is implemented on
`an FPGA using Handel— C. a language that enables the
`compilation of high-level algorithms directly into gate
`level synchronous liar (liner [9],
`thus reducing the de—
`velopment time. The designed algorithm makes no as-
`sumptions about DNA transformations, and is there—
`fore a new powerful tool. It can lie use d in onjiinction
`with an {aspect system to automatically date ct patterns
`of inter est in the DNA.
`
`Keywords: FPGAS, Reconfigurable computing, for-
`mal methods, DNA, string matching, CSP, Handel-C,
`pipeline.
`
`1
`
`Introduction
`
`There is an increasing electronic availabilit y of DNA
`sequences from human and other species. The ambi-
`tious Human Genome Project has recently decoded the
`DNA sequence of chromosomes 21 and 22 and it is ex—
`pected that the whole human genome will he completed
`within the next three years in an admirable world—wide
`
`0-7803-6542-9/00/510.00 ® 2000 IEEE
`
`uge;
`collaboration. The amount of data produceilh
`chromosome 22 {the smallest one} consists of approxi-
`mately 32Mbytes of base data.
`Scientists who wish to use these data to test their hy—
`potheses electronically are faced with an overwhelming
`amount of processing effort. Often this effort involv es
`searc hing DNA data for traces of viruses or other or-
`ganisms that have been incorporated in the genome
`ever the long track of evolution. Also, DNA sequences
`of different Species may be compared in order to verify
`common ancestry or detect protein homology. Detec-
`tion of sequences of interest is a v ery complex task and
`several soft w are solutions exist. [410, 17 , 14.
`DNA sequences are availableas text documents.
`In computing terms the Operations that w eneed to
`perform on DNA sequences are similar to well known
`string problems, specifically the edit-distance [18] and
`the longest common subsequence problem. The first
`measures the distance betw eentw ostrings as a func-
`tion of the number of insertions, deletions and substi—
`tutions necessary in order to go from one string to the
`other. The second, as the name implies, determines
`the longest common substring oi' tw o gien strings.
`The strings involved are how ev erenormous (mil-
`lions of places in length). Efficient string processing
`is therefore a requirement. Both the above problems
`have been well studied in computer science and effi-
`cient implementations on systolic architectures have
`been proposed [13 , 15]. Sue llSpECial purpose are hi~
`
`268
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 268
`Petitioner Microsoft Corporation - Ex. 1043, p. 268
`
`

`

`Lectures. although very efficient, can be hard and ex—
`pensive to build. A recent innovation is reconfigurable
`computing, or computing systems whose hardware can
`be modified by soft w are to math a. certain application
`[16]. The main cemponent of such systems is the Field
`Pr ogremmeble Gate Amy fFPGA).
`The FPGA [8] is a silicon chip that can rapidly im—
`plement a significant amount of digital hardware under
`software control. F orexample, Xilinx and Alters are
`w ellknown FPGA manufacturers. Dramatic perfor—
`mance gains, in c0mparison to a microprocessor, can
`be acliiev edby implementing a certain algorithm di-
`rectly onto hardware and exploiting natural hardware
`parallelism while at the same time reducing overheads
`such as instruction fetdies. Furthermore one can wary
`the function of an FPGA as the problem at hand varies.
`Until recen tly,the main problem with using hard—
`w arehas been the long time required and the large
`expense involv edin producing it. Even FPGA's re
`quired training in extremely specialized tools. Then
`hardware compilation came along which allows ordi—
`nary computer programs to be turned automatically
`into hardware designs (FPGA‘s), thus promoting the
`programmer to a hardware designer. A well known and
`successful such tool is Handel—C available from ESL [9}.
`Apart from development efficiency, hardware com-
`pilation offers another important Opportunity, namely
`the potential to produce provably correct liarduare at
`the same cost as the production of provably correct
`soflw a1'e(assuming of course the correctness of the
`hardware compiler).
`In this paper we make a step in
`this direction by beginning with a functional specifica—
`tion of the problem and deriving an efficient pipelined
`implementation which can readily be programmed in
`Handel-C and thus compiled onto an FPGA. We can-
`sider a variation of the longest common subsequence
`problem which is useful for DNA operations and use
`w ellkno wnmathematical transformations 'r', 1, 2, 3]
`in the-derivation that follows. The algorithm has been
`implemented in Handel-C and performance figures are
`giv en.
`
`2 F unctionalNotation
`
`We give a brief summary of the functional notation
`used in this paper. The reader should refer to [5, G]
`for a fuller account of this notation. Lists are finite
`sequences of values of the same type. The list concate-
`nation operator is denoted by —H— and the list construc-
`tion operator is denoted by :. The elements of a list
`are displayed betw een square brackets and separated
`by commas. Function composition is denoted by a .
`Functions are usually defined using higher order func-
`
`tions or by sets of recursive equations. The operator i-
`(pron0unced “map") takes a function on the left, a list
`on the right, and applies the function to each element
`of the list. Informally, we have:
`
`f*[ai,a2,---,aul
`
`[flai )i feel. -
`
`-
`
`-
`
`‘ Haul]
`
`The operator { (pronounced “reduce”) takes an as-
`sociative binary operator on the left, a list of values on
`the right and returns the “summation" of all the ele-
`ments of the list. This can be informally described as
`folIOWs
`
`(@lllahazrwaul
`
`alone-neat
`
`In order to formulate the pattern matching prob—
`lem, we will make use of the following list manipulation
`functions. The function inits+ returns the list of non-
`empty initial segments (prefixes) of a list in increasing
`order of length.
`
`‘ ' 3551:]
`tntts+ [a] , (12, '
`= l [ails[Ghee]:"'slflia"‘aan-llulal ---[t1nl l
`
`Similarly, the function fins... returns the list of non—
`empty final segments (post-fixes) of a list in increasing
`order of length. Thus, informally, w e hare
`
`- ,on]
`fins+ [a1,e2,- -
`= [ larilylan—lraflls' '
`
`‘ )[5‘21 ' Mann}: [alvu ’ ianl]
`
`Finally, the function 8693+ which returns all the pos—
`sible Combination of consecutive elements of a list is
`
`defined as a composition of three functions.
`
`3695+ = (4—H)
`
`irttts+ *
`
`fins+
`
`3
`
`Starting Specification
`
`DNA matching can be translated to string matching
`in computing terms, where each string element is one
`of the 4 characters {.4,C, G,T} which represent the 4
`DNA bases. F or example, the DNA sequence of hlman
`chromosome 22 is downloaded into a huge text file of
`the form:
`
`TTTGGCTAAAACCGAAATCAATTATGMlGC MuGG AAG‘G .
`
`. .
`
`.
`
`This sequence represents one of the two strands of
`the chromosome; the other is complementary. Comple—
`mentary base pairs are A -— T and C — G. This huge
`sequence must then be searched to detect, for example,
`a (smaller) DNA string belonging to a virus or to reveal
`
`269
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 269
`Petitioner Microsoft Corporation - Ex. 1043, p. 269
`
`

`

`similarities with a chromosome of a particular African
`monkey.
`A mathematical specification of the problem will be
`giv on using list notation. Lists are finite sequences of
`a certain base type which in this case is the set:
`
`)3: {moor}
`
`Let Q and Q he lists of type [2] representing the
`query DNA sequence (for example virus) and the bank
`DNA sequence [for example chromosome 22) respec-
`tively.
`First, we define a function iicp which tak es tie lists
`and returns the length of the longest common prefix of
`both lists. For example, the length of the longest com-
`mon prefix of ATCCTG and ATCTTG is 3 being the length
`of ATC. A formal definition of Hep is given recursively
`as follo ws:
`
`= D
`[l
`Hop 3
`= 0
`t
`Hep []
`llcp ((1:3) (tut) = 1+Hcpst,
`: 0,
`
`if ozl)
`otherwise
`
`A t first, it mar be suggested that it is all about sim—
`ple string matching,
`the type that a word processor
`uses when trying to find a certain word in a document.
`While this is often true, DNA operations usually have
`certain peculiarities. First, mutations tak e place which
`change some bases of the subject string (eg. 0 —> T),
`th us rendering impossible its exact detection. Second,
`deletions can occur, which rein0ve a significant. part of
`the subject string altogether. And third, insertions in-
`troduce a new piece of DNA within the subject DNA
`(eg. the DNA of a virus particle), see Figure 1. Com-
`binations of these Operations are possible.
`CCGAAATC
`
`l
`l
`I
`:
`l
`l
`:
`l
`CTGACATC
`
`Hutation(:)
`
`concepts of deletion, insertion and mutation and han—
`dles them all simultaneously, but this will be at a sig-
`nificant cost of increasing the complexity of the specifi-
`cation and making it less general. instead we consider
`a simpler approach which is based on the observation
`that all the above cases (plus seine more} can be han-
`dled by detecting the longest common substm'ngs in B
`and Q of at least a minimum length. Thus for exam-
`ple,
`if w ecletect tw oconimon substriags
`length 40
`and 50 separated by a single element which is different
`in B and Q, then this points to a mutation. Similarly,
`tw 0 common substriugs thai are separated inB (rest).
`Q) by another string, point to an insertion (resp. dele-
`tion).
`In addition many other useful DNA transformatimis
`can be detected in this manner, such as the transposi-
`tion of parts of the DNA sequence. Of course, at the
`end of the day it will be the biologist who will decide
`what is of significance as well as its categorization. All
`that is therefore required is a tool which will compare
`Lw o stringsB and Q and detect equality of substrings
`within them, regardless of where they occur, in B and
`Q.
`
`A string c is said to be a maximal common segment
`of tw o stringss and t if and only if s can he expressed
`in the form 31 ++c ++.52, t can also be expressed in
`the form t; ++c ++t2 and c cannot be extended to the
`right (that is, when .32 and t; are non—empty lists, their
`first elements must be different. i.e. hd s2 gfi lid t2]
`We need to identify the maximal common segments
`of s and t. The main insight for achieving this is the
`observation that a maximal segment c of tw olists s
`and t is the longest common prefix of a final segment
`of s (i.e. c ++s_:) and a final segment of t (i.e.
`:2 ++
`t2). Therefore, a. suitable solution to this problem is
`reduced to computing the longest common prefixes of
`each final segment of s with eac 11 final segment of t.
`Hence, the lengths of maximal common segments, lines
`can be determined by computing the following table:
`
`Deletion(.)
`
`from s t = [ Hop n b l o (— fins+ s;b (— ft'nsi. t
`
`1
`
`t is m
`Given that the size of s is n and thesize of
`{where m E ii) there are oxm entries in the table; each
`(311 try requires Ohm.) calculation but by using a simple
`dynamic programming technique this can be reduced
`to 0(1).
`The lengths of maximal common segments for
`the strings ATCCATGTCATC (horizontal query) and
`CTATCTCATCG (vertical data bank] are giv on b y the
`following table. Note that both strings are displayed
`in reverse order.
`
`CCGAFLATC
`Ill. .
`l ll
`CCG
`ATC
`
`ATC
`CCG
`I ll
`lll. .
`CCGCTATC
`
`Insertion(.)
`
`Figure 1. Mutation, deletion and insertion {|
`denotes a perfect match).
`
`It. is possible to introduce a similarity measure func-
`tion between tw ostrings that incorporates the above
`
`The entry at position (Li) of
`ble computes
`the length of
`the
`
`the abo veta—
`longest prefix,
`
`270
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 270
`Petitioner Microsoft Corporation - Ex. 1043, p. 270
`
`

`

`mecnzifm=cthenn+1else0
`
`Note that >> denotes the piping CSPoperator. The
`synthesis of theabo ve processes from their functional
`description is a straigh tforiard task using the refine-
`ments rules in [1, 2, 3]. The abo veparallel version
`requires 0(m + n.)
`time steps to terminate and uses
`0(m) parallel processes in the pipeline.
`
`4.2 Pseudo Handel-C description
`
`The algorithm is implemented as a bidirectional
`pipeline, depicted in Figure 3, of length m + I where
`m is the length of string 3. The current lengths of
`matching strings flow to the righ twhile final results
`lengths (couhnns) flow downwards.
`Each stage in the pipeline holds one element e of
`the query sequence Q and compares it at each clock
`cycle with the incoming element a: of the bank sequence
`B and the value v of longest prefix in the proceeding
`segment. If these t w o elemets match, the length of the
`longest common prefix r starting with x is incremented
`otherwise, it is reseset to 0.
`In both cases this value
`is stored in a local array which will even tuallymake
`a column in the global table. After the first cycle the
`most recent value of 1' is passed to an external channel
`down, coupler] with the incoming element. of the data.
`bank and passed to the right. For space rename the
`Handel—C version is onnnitted.
`
`5 Conclusion and F urtherWor-k
`
`A fast FPGA implementation of DNA matching has
`been obtained based on a formal description of the
`problem which revealed a generally useful algorithm.
`Speed is extremely important in this application do-
`main because DNA databases, which are no w widely
`availabie, are 11 uge. The presented algorithm makes
`little to no assumptiOns on the transformations that
`may have tak en place on the DNA, and is therefore
`able to detect all cases of interest.
`It is important to observe that hardware perfor-
`mance thus becomes available out of a general pur—
`pose FPGA card. What is striking is the short devel—
`opment time that a hardware compiler like Handel-C
`allows. Although current FPGA chips allow for large
`lengths for string Q, in the present algorithm the max—
`imum length for string Q is always limited by the size
`of the FPGA. Larger strhtgiid btbrok
`en up into
`segments and processed sequentially or by multiple FP-
`GAs. Resulting matches that span over miltiple seg-
`ments could then be combined.
`
`CTRCTGTRCCTA
`
`G 000001000000
`C 100100001100
`T 020020100020
`ll 003000020003
`C 100400003100
`T 020150100020
`C 100100001100
`T 020020100020
`A 001000020003
`T 010010100010
`C 100100001100
`
`Figure 2. Table oi maximal segments for two
`strings.
`
`(drop j t), where (drop 1‘ 3) denotes
`Hep {drop t 5)
`the postfix of 5 obtained by removing the first
`2‘
`elements of s.
`It may be w orth whileto note that
`list indexing starts from 0. The first column of the
`table in Figure 2 displays Hep for the string C with
`each final segment of the bank sequence; Le. G. CG,
`TCG, ATCG, CATCG, TCATCG, CTCATCG, CTCATCG,
`TCTCATCG, ATCTCATCG. ATTCTCATCG, CTATCTCATCG.
`Similarly,
`the last column of the table displays llcp
`for the whole query with each final segment of the
`bank sequence. The length of the longest common
`segment is 5, the appropriate segment can be retrieved
`as TCATC by simply following the diagonal in the table
`starting at 5.
`
`4 Derivation of a Pipelined Algorithm
`for FPGA’s
`
`4.1 Transformation to CSP
`
`We have that lines is expressed as a pipe pattern;
`using transformational techniques from [3, 2, l], we can
`formally synthesize a pipeline network of communicat-
`ing sequential processes expressed in Hoare’s CSP [12]
`for solving this problem as follows:
`
`LMCS(3, r) = more >> ((3),: (STAGE t 3))
`
`= leot —> SKIP
`INITHJ)
`INIT(n.:s) = l(a,0) —>INIT(S)
`
`STAGE(C) z ?(.t,n) —>l{:r:,n.) —> F(c, [3: EB: 11])
`ch, 1'
`1 rs)
`?“eot” —>!eot —> PRDdownU'
`: rs)
`
`l ?
`
`(a:,n) ->!(:c,r) —+ F‘(c,(;r; 613.: n) :r : rs)
`
`271
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 271
`Petitioner Microsoft Corporation - Ex. 1043, p. 271
`
`

`

`“’1"le
`
`ST 4013
`STAGE
`(Gm—ll
`WT
`
`
`STAGE
`
`(9.)
`
`]
`
`column 1
`
`column 2
`
`column m
`
`Figure 3. The matching pipeline process LMCS.
`
`Acknowledgements
`
`The authors would like to thank The British Coun—
`cil in Greece and the University of Athens for funding
`their join tresearc h. We w ouldlike also to thank Dr.
`Nikos Yannakouris and Dr. George Dedoussis for their
`helpful comments on genetics.
`
`References
`
`[1] AB. Abdallah, Derivation of P arallel Algorithms
`from Functional Specifications to CS? Processes,
`in: Bernhard Miiller, ed., Mathematics of Program
`Construction, LNCS 94?, (Springer V erlag,1995)
`67—96
`
`[2] A. E. Abdullah, Synthesis of Massively Pipelined
`Algorithms for List Manipulation, in L. Rouge and
`P. Fraigniaud and A, Mignotte and Y. Robert (eds),
`Proceedings of the European Conference on Paral-
`lel Processing, EuroPar’Qo', LNCS 1124, (Springer
`V erlag, 1996), pp 911-920.
`
`[3] A. E. Abdalluh, Functional Process Modelling.
`In K Hammond and G. Michealson (eds), R c—
`seorch Dir cctionsiu For allch unctional Pr ogum-
`ming, (Springer Verlag. October 1999). [313339-360
`
`[4] Altshol, S.F. et 3.1., Basic Local Alignement search
`tool, Journal of Molecular Biology, pp.403-410,
`1990.
`
`[5] R. S. Bird, An Introduction to The Theory of Lists,
`in: M. Broy, ed., L ogic of Pogromming and Calculr‘:
`of Discreet Design (Springer, Berlin, 1987) 3-42.
`
`[6] R. B. Bird, Introduction to Functional Pr ogum—
`ming, (Prentice—Hall, 1998).
`
`J. Gibbons and G. Jones, Formal
`[7] R. S. Bird,
`Derivation of a P attern Matting Algorithm, Sci-
`ence of Computer Programmingm (2)
`(Elsevicr,
`1989), pages 93-104.
`
`[8] Chan, RH. and S. Mourad, Digital System Design
`Using Ffield Pr gmmmable Gale A Trays Prentice
`Hall, 1994.
`
`[9] Handel-C L anguageR efcrcnc Manual, Embedded
`Solutions Limited, 7/8 Milton Park, Abingdon, Ox~
`fordshire, 0X14 411T, United Kingdom, 2000.
`
`[10] Gaasterland, T. and C.W. Sensen, MAGPIE:
`Automated genome interpretation, TrendsGenet.,
`V oi. I2, pp.7(i-78, 1996.
`
`[11] Guerdoux-Jamet P.,Lavenier D., Wagner C. and
`Quinton P., Design and Implementation of a P ar-
`allel Architecture for Biological Sequence Compari—
`son, in L. Rouge and P. Fraigniaurl and A. Mignotte
`and Y. Robert (eds), Proceedings of the European
`Conference on For allcl Pr ocessing, EuroPar’Qo',
`LNCS 1123, (Springer Verlag, 1996), pp 11—24.
`
`[12] C. A. R. Hoare, Communicating Se quentéalPr o-
`ccsses. (Prentice-Hall, 1985).
`
`[13] Hoang D.T. and DP. Lopresti, FPGA Implemen—
`tation oBystolz'c Se queucc Alignment, LNCS 705,
`1993. pp. 183-191.
`
`{14] Reese, MG. et at, Genome Annotation Assess—
`ment
`in Drosophila. melanogastor, Genome R e-
`scarch, Vol. 10, No. 4, pp.483~501, 2000.
`
`[15] Sastry R. and N. Ranganathan, A Systolic Ar-
`ray for Apprcuximate String Matching, IEEE 1n-
`tcmott‘onol Conference on Computer Design, Cam-
`bridge, Massachusetts, 1993, pp. 402-405.
`
`[16] Schewel, J. et al (Eds), Reconfigurable Technol-
`ogy: FPGAs for Computing and Applications, Pr o~
`coatings of Spie, 20—21 September 1999, Boston,
`Massachusetts.
`
`[17] Stormo, G.D., Gene-Finding Approaches for Eu-
`kary otes,Gcnomc Research, Vol. 10, No. 4, [313.394-
`39?, 2000.
`
`[18] Wagner RA. and M.J. Fisher, The String—to-
`String Correction Problem, J. ACM, 21(1), 1974,
`pp. 168-173.
`
`[19] Waterman M. 5., Mathematical Methods for DNA
`Sequences. (CRC Press, 1989).
`
`272
`
`Petitioner Microsoft Corporation - Ex. 1043, p. 272
`Petitioner Microsoft Corporation - EX. 1043, p. 272
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket