`
`(19) World Intellectual Property Organization
`International Bureau
`
`(43) International Publication Date
`4 April 2002 (04.04.2002)
`
`
`
`PCT
`
`(10) International Publication Number
`WO 02/27638 Al
`
`(51)
`
`International Patent Classification’:
`
`GO06F 19/00
`
`(21) International Application Number:
`
`=PCT/NO01/00394
`
`(74) Agent: MIDTTUN,Gisle; Bryns Zacco as, P.O. Box 765,
`Sentrum, N-0106 Oslo (NO).
`
`(22) International Filing Date:
`27 September2001 (27.09.2001)
`
`(81) Designated States (national): AE, AG, AL, AM, AT, AU,
`AZ, BA, BB, BG, BR, BY, BZ, CA, CH, CN, CO, CR, CU,
`CZ, DE, DK, DM, DZ, EC, EL, US, FI, GB, GD, GE, GH,
`GM, HR, HU, ID, IL, IN, IS, JP, KE, KG, KP, KR, KZ, LC,
`LK, LR, LS, LT, LU, LV, MA, MD, MG, MK, MN, Mw,
`MX, MZ, NO, NZ, PH, PL, PT, RO, RU, SD, SE, SG,SI,
`SK, SL, ‘TJ, 1M, TR, VT, TZ, UA, UG, US, UZ, VN, YU,
`(30) Priority Data:
`ZA, ZW.
`20004869 28 September 2000 (28.09.2000)=NO
`
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`English
`
`English
`
`(71) Applicant (for all designated States except US): SEE-
`BERG, Erling, Christen [NO/NO]; Borgestadveien 25B,
`N-0875 Oslo (NO).
`
`(71)
`(72)
`
`Applicant and
`Inventor; ROGNES, Torbjern [NO/NO]; Motzfeldts
`gate 16, N-0187 Oslo (NO).
`
`(84) Designated States (regional): ARIPO patent (GH, GM,
`KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, 7W), Eurasian
`patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), European
`patent (AT, BE, CH, CY, DE, DK,ES, FI, FR, GB, GR,IE,
`TT, LU, MC, NL, PT, SE, TR), OAPI patent (BF, BJ, CF,
`CG, CI, CM, GA, GN, GQ, GW, ML, MR, NE, SN, TD,
`TG).
`
`[Continued on next page]
`
`(54) Title: DETERMINATION OF OPTIMAL LOCAL SEQUENCE ALIGNMENTSIMILARITY SCORE
`
`query sequence.
`
`
`
`
`
`
` aouanbaseseqeyep
`
`Sequence alignment and sequence database
`(57) Abstract;
`similarity searching are among the most important and challeng-
`ing task in bio informatics, and are used for several purposes,
`including protein function prediction. An efficient parallelisation
`of the Smith-Waterman sequence alignment algorithm using
`parallel processing in the form of SIMD (Single-Instruction,
`Multiple-Data) technology is presented. The method has been
`implementation using the MMX (MultiMedia eXtensions) and
`SSE (Streaming SIMD Extensions) technology that is embedded
`in Intel’s latest microprocessors, but
`the method can also be
`implemented using similar technology cxisting in other modern
`microprocessors. Near eight-fold speed-up relative to the fastest
`previously an optimised eight-way parallel processing approach
`achicved know non-parallel Smith-Watcrman implementation on
`the same hardware. A speed of about 200 million cell updates per
`second has been obtained on a single Intel Pentium IT 500MHz
`microprocessor.
`
`query sequence
`
` aouanbes
`
`
`
`
`
`aseqejep
`
`
`WO02/27638Al
`
`
`
`WO 02/27638 Ad
`
`—_IMMIIMTIAVTITTAT TITIAN TAMIA
`
`Declaration under Rule 4.17:
`
`of inventorship (Rule 4.17(iv)) for US only
`
`Published:
`
`with international search report
`
`For two-letter codes and other abbreviations, refer to the "Guid-
`ance Notes on Codes andAbbreviations" appearing at the begin-
`ning ofeach regular issue ofthe PCT Gazette.
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`Determination of optimal local sequence alignment similarity score
`
`Field of invention
`
`The present invention relates to the comparison of biological sequences and, more
`specifically, the invention relates to a method, a computer readable device and an
`electronic device for determination of optimal local sequence similarity score in
`accordance with claims 1,8 and 16, respectively.
`
`Backgroundof invention
`
`The rapidly increasing amounts of genetic sequence information available represent a
`constant challenge to developers of hardware and software database searching and
`handling. The size of the GenBank/EMBL/DDBJnucleotide database is now doubling
`at least every 15 months (Benson et al. 2000). The rapid expansion of the genetic
`sequence information is probably exceeding the growth in computing poweravailable at
`a constant cost, in spite of the fact that computing resources also have been increasing
`exponentially for many years. If this trend continues, increasingly longer time or
`increasingly more expensive computers will be needed to search the entire database.
`Searching databases for sequences similar to a given sequenceis one of the most
`fundamental and important tools for predicting structural and functional properties of
`uncharacterised proteins. The availability of good tools for performing these searchesis
`hence important. When looking for sequences in a database similar to a given query
`sequence, the search programs compute an alignment score for every sequencein the
`database. This score represents the degree of similarity between the query and database
`sequence. The score is calculated from the alignment of the two sequences, and is based
`on a substitution score matrix and a gap penalty function. A dynamic programming
`algorithm for computing the optimal local alignment score wasfirst described by Smith
`and Waterman (1981), improved by Gotoh (1982) for linear gap penalty functions, and
`optimised by Green (1993).
`Database searches using the optimal algorithm are unfortunately quite slow on
`ordinary computers, so many heuristic alternatives have been developed, such as
`FASTA (Pearson and Lipman, 1988) and BLAST(Altschul et al., 1990; Altschulet al.,
`1997). These methods have reduced the running time by a factor of up to 40 compared
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`to the best-known Smith-Waterman implementation on non-parallel general-purpose
`computers, however, at the expense of sensitivity. Because of the loss of sensitivity,
`somedistantly related sequences might not be detected in a search using the heuristic
`algorithms.
`Due to the demand for both fast and sensitive searches, much effort has been
`made to produce fast implementations of the Smith-Waterman method. Several special-
`purpose hardware solutions have been developed withparallel processing capabilities
`(Hughey, 1996), such as Paracel’s GeneMatcher, Compugen’s Bioccelerator and
`TimeLogic’s DeCypher. These machines are able to process more than 2 000 million
`matrix cells per second, and can be expanded to reach much higher speeds. However,
`such machinesare very expensive and cannot readily be exploited by ordinary users.
`Some hardware implementations of the Smith-Waterman algorithm are described in
`patent publications, for instance US patents 5553272, 5632041, 5706498, 5964860 and
`6112288.
`
`A more general form of parallel processing capability is available using Single-
`Instruction Multiple-Data (SIMD) technology. A SIMD computeris able to perform the
`same operation (logical, arithmetic or other) on several independent data sources in
`parallel. It is possible to exploit this by dividing wide registers into smaller units in the
`form of micro parallelism (also known as SIMD withina register - SWAR). However,
`modern microprocessors have added special registers and instructions to make the
`SIMD technologyeasier to use. With the introduction of the Pentium MMX
`(MultiMedia eXtensions) microprocessor in 1997, Intel made computing with SIMD
`technology available in a general-purpose microprocessor in the most widely used.
`computer architecture — the industry standard PC. The technologyis also available in
`the Pentium II and has been extended in the Pentium HI under the name of SSE
`
`(Streaming SIMD Extensions) (Intel, 1999). Further extension of this technology has
`been announcedfor the Pentium 4 processor (also known as Willamette) under the
`name SSE2 (Streaming SIMD extensions 2) (Intel 2000). The MMX/SSE/SSE2
`instruction sets include arithmetic (add, subtract, multiply, min, max, average,
`compare), logical (and, or, xor, not) and other instructions (shift, pack, unpack) that may
`operate on integer or floating-point numbers. This technologyis primarily designed for
`speeding up digital signal processing applicationslike sound, images and video, but
`seemssuitable also for genetic sequence comparisons. Several other microprocessors
`with SIMD technology are or will be made available in the near future, as shown in
`table 1 (Dubey, 1998).
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`Table 1. Examples of microprocessors with SEMD technology
`
`
`Manufacturer
`AMD
`
`Microprocessor
`K6/K6-2/K6-IIT
`
`Nameof technology
`MMX / 3DNow!
`
`Compag (Digital)
`Hewlett Packard (HP)
`HP / Intel
`Intel
`
`Athlon / Duron
`
`Extended MMX / 3DNow!
`
`MVI (Motion Video Instruction)
`Alpha
`MAX(-2) (Multimedia Acceleration eXtensions)
`PA-RISC
`SSE (Streaming SIMD Extensions) ?
`Itanium (Merced)
`Pentium MMX/IT MMX (MultiMedia eXtensions)
`Pentium II
`SSE (Streaming SIMD Extensions)
`Pentium 4
`SSE2 (Streaming SIMD Extensions 2)
`
`Motorola
`
`PowerPC G4
`
`Velocity Engine (AltiVec)
`
`SGI
`
`Sun
`
`MIPS
`
`SPARC
`
`MDMX (MIPSDigital Media eXtensions
`
`VIS (Visual Instruction Set)
`
`10
`
`1s
`
`20
`
`Several investigators have used SIMD technology to speed up the Smith-
`Waterman algorithm, but the increase in speed relative to the best non-parallel
`implementations have been limited.
`The general dynamic programming algorithm for optimal local alignment score
`computation wasinitially described by Smith and Waterman (1981).
`Gotoh (1982) described an implementation of this algorithm with affined gap
`penalties, where the gap penalty for a gap of size k is equal to q+rk, where q is the gap
`open penalty and ris the gap extension penalty. Undertheserestrictions the running
`time of the algorithm wasreducedto be proportional to the product of the lengths of the
`
`two sequences.
`Green (1993) wrote the SWAT program and applied some optimisations to the
`algorithm of Gotoh to achieve.a speed-up of a factor of about tworelative to a
`straightforward implementation. The SWAT-optimisations have also been incorporated
`into the SSEARCHprogram of Pearson (1991).
`The Smith-Waterman algorithm has been implemented for several different
`SIMD computers. Sturrock and Collins (1993) implemented the Smith-Waterman
`algorithm for the MasPar family of parallel computers, in a program called MPsrch.
`This solution achieved a speed of up to 130 million matrix cells per second on a MasPar
`MP-1 computer with 4096 CPUsand up to 1 500 million matrix cells per second on a
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`MasPar MP-2 with 16384 CPUs. Brutlag et al. (1993) also implemented the Smith-
`Waterman algorithm on the MasPar computers in a program called BLAZE.
`Alpern et al. (1995) presented several ways to speed up the Smith-Waterman
`algorithm including a parallel implementation utilising micro parallelism by dividing
`the 64-bit wide Z-buffer registers of the Intel Paragon i860 processors into 4 parts. With
`this approach they could compare the query sequence with four different database
`sequences simultaneously. They achieved more than a fivefold speedup over a
`conventional implementation.
`Wozniak (1997) presented a way to implement the Smith-Watermanalgorithm
`using the VIS (Visual Instruction Set) technology of Sun UltraSPARC microprocessors.
`This implementation reached a speed of over 18 million matrix cells per second on a
`167MHz UltraSPARC microprocessor. According to Wozniak (1997), this represents a
`speedup of a factor of about 2 relative to the same algorithm implemented with integer
`instructions on the same machine.
`
`Taylor (1998 and 1999) applied the MMX technologyto the Smith-Waterman
`algorithm and achieved a speed of 6.6 million cell updates per second on an Intel
`Pentium II 500 MHz microprocessor.
`Sturrock and Collins (2000) have implemented the Smith-Waterman algorithm
`using SIMD on Alpha microprocessors. Howeverno details of their method has been
`published. As far as we can see from the web service available at
`hittp://www.ebi.ac.uk/MPsrch/, they have achieved a speed of about 53 million cell
`updates per second using affine gap penalties. It is unknown exactly what computerthis
`system is running on.
`Recently, Barton et al. (2000) employed MMX technology to speed up their
`SCANPS implementation of the Smith-Waterman algorithm. They claim a speed of 71
`million cell updates per second on a Intel Pentium HI 650 MHz microprocessor. Only a
`poster abstract without any details of their implementation is currently available.
`
`Disclosure of the Invention
`
`It is described an efficient parallelisation related to a method for computing the optimal
`local sequence alignmentscore of two sequences. Increased speed ofthe overall
`computation is achieved by performing several operationsin parallel.
`
`The invention enables Smith-Waterman based searches to be performed at a much
`higher speed than previously possible on general-purpose computers.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`Using the MMX and SSEtechnologyin an Intel Pentium III 500MHz
`microprocessor, a speed of about 200 million cell updates per second was achieved
`when comparing a protein with the sequencesin a protein database. As far as known,
`this is so far the fastest implementation of the Smith-Waterman algorithm on a single-
`microprocessor general-purpose computer. Relative to the commonly used SSEARCH
`program, which is consider as a reference implementation, it represents a speedup of
`about eight. It is believed that an implementation of the present invention on the
`forthcoming Intel Pentium 4 processor running at 1.4 GHz will obtain a speed of more
`than 1000 million cell updates per second. These speeds approach or equal the speed of
`expensive dedicated hardware solutions for performing essentially the same
`calculations.
`
`Using the present invention, high speed is achieved on commonlyavailable and
`inexpensive hardware, thus significantly reducing the cost and/or computation time of
`performing sequence alignment and database searching using an optimal local
`alignment dynamic programming approach.
`
`The present invention is based on the principle that all calculations are performed using
`vectors that are parallel to the query sequence (“query-based vectors’’) (see figure 2b).
`This is in contrast to the traditional approach in which the calculations are performed
`using vectors that are parallel to the minor diagonal in the matrix (“diagonal vectors’’)
`(see figure 2a). In both cases, the vectors contain two or more elements that represent
`different cells in the alignment matrix. The values in the elements of the vectors
`represent score values, e.g. the h-, e- or f-variables in the recurrence relations described
`later.
`
`The advantage ofthe traditional approach is that the calculations of the
`individual elements of the vectors are completely independent(see figure 3a). Hence the
`use of this approach may seem obvious. The traditional approach is described in detail
`by Hughey (1996), Wozniak (1997) and Taylor (1998, 1999), and seems to be almost
`universal to all parallel implementations of the Smith-Waterman alignment procedure
`that are known in detail. However, there are several disadvantages with this approach,
`most notably the complexity of forming the vector of substitution score values. This
`problem is especially noticeable at high degrees ofparallelism. The query-based vector
`approach has not been described earlier, in spite of numerous attempts to parallelise the
`Smith-Waterman algorithm.It is believed that the present invention is sufficiently novel
`and different from the current “state of the art”.
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`The disadvantage of the query-based vector approach is the dependence between the
`individual vector elements in some of the calculations (see figure 3b). However, by
`exploiting a vector generalisation of the principles used in the SWAT-optimisations,
`these dependencies will only affect a small fraction of the calculations, and hence not
`have a major impact on the performance. The advantage of the query-based approachis
`the greatly simplified loading of the vector of substitution score values from memory
`when using a query sequenceprofile (also known as a query-specific score matrix).
`
`The present invention method uses a query profile, which is a matrix of scores for all
`combination of query positions and possible sequence symbols. The scores are arranged
`in memory in such a manner that a vector representing the scores for matching a single
`database sequence symbol with a consecutive range of query sequence symbols can
`easily be loaded from memory into the microprocessorusing a single instruction. This is
`achieved by first storing the scores for matching the first possible database sequence
`symbol with each of the symbols of the query sequence, followed by the scores for
`matching the second possible database sequence symbol with each query position, and
`so on.
`
`The scores in the query profile are all biased by a fixed amount(e.g. 4, in case of the
`BLOSUM62 matrix) so that all values become non-negative, and stored as an unsigned
`integer. This allows subsequent calculations to be performed using unsigned arithmetic.
`The constantbias is later subiracted using an unsigned integer subtraction operation.
`
`Vector elements are represented by few bits (e.g. 8), and may thus represent only a
`narrow range of scores. When the total vector size is limited to a specific numberofbits
`(e.g. 64), narrow vector elements allows the vector to be divided into more elements
`(e.g. 8) than if wider elements were used. This allows more concurrent calculations to
`take place, further increasing speed.
`
`In order to make the narrow score range useful even in cases where the scores are larger
`than what can be represented by a vector element, we employsaturated arithmetic to
`detect overflow in the score calculations. If overflow is detected, the entire alignment
`score is subsequently recomputed using an implementation with a wider score range,
`e.g. 16 bits. Because such high scoresarerelatively rare, the performance impactis
`small.
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`Score computations are performed using unsigned values. Unsigned score values allows
`the widest possible score range. This reduces the number of recomputations necessary
`as described above.
`
`The computations are performed using saturated unsigned arithmetic. This allows easy
`clipping of negative results of subtractions at zero, an operation that is frequently
`performedin the calculations.
`
`The method, the computer readable device and the electronic device have their
`respective characteristic features as stated in the attached claims 1, 8 and 16
`respectively, and further embodiments thereof are disclosed in claims 2-7, 9-15 and 17-
`22, respectively.
`
`Brief Descriptions of Drawings
`
`Figure 1 illustrates computational dependencies in the Smith-Waterman alignment
`matrix, the arrows indicate dependencies between matrix cells that are involved in the
`computations of the h-, e- and f-values in each cell of the alignment matrix.
`
`20
`
`25
`
`30
`
`35
`
`Figure 2 illustrates vector arrangements in SIMDimplementations of the Smith-
`Waterman algorithm, each column represent one symbolof the database sequence, and
`each row represents on symbol of the query sequence, where
`
`a)
`
`represent traditional approach with vectors parallel to the minor diagonal in the
`matrix, and
`
`b)
`
`represents the novel approach with vectors parallel to the query sequence.
`
`Figure 3 illustrates vector calculation dependencies, the arrows indicates dependencies
`between matrix cells that are involved in the computationsofthe h-, e- and f-values in
`the elements of the vector using (a) the traditional and (b) the novel approach.
`
`Figure 4a illustrates pseudo-code for the new approach, the pseudo-code being just a
`detailed example showing how our method may be implemented in a computer
`
`program.
`
`Figure 4b illustrates a flow diagram of the pseudo-codein fig. 4a
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`Table 2: The symbols and phrases used in the pseudo-code.
`
`Phrase / statement / symbol
`
`BYTE a
`
`INTEGER a
`
`BYTE A[m]
`
`value
`
`integer values
`
`BYTE A[m][n]
`
`VECTOR A
`
`Il
`
`©
`
`o
`
`
`
`
`{
`
`is represented by a byte value
`Indicates that A is a two-dimensionalarray (matrix), where
`each element is represented by a byte value
`Indicates that A is a vector of containing eight elements, each
`represented by a byte value
`
`The variable a is assigned the value of expression b
`
`The vector variable A is assigned the value of expression B.
`
`Can be implemented using the MOVQ (move quadword)
`instruction on processors supporting MMX technology.
`
`
`Parenthesises indicated expressions that take precedence in the
`order of computation.
`
`
`L
`
`The FUNCTION and RETURN statements indicate the
`FUNCTIONa(b,c)
`
`
`beginning and ending of a function a taking b and c as
`parameters and returning the resulting value c.
`
`|FORi=aTObDO
`
`
`This FOR-statement indicates that the statements enclosed by
`
`
`the brackets should be repeated (b-a+1) times, with a loop
`statements
`index variable i, taking consecutive integer values a., at+1, at+2,
`
`
`
`}
`..» b-1, b.
`
`
`IF expression THEN
`This IF-statement indicates that the if-statements shall be
`
`{
`executed if the expression is true, and that the else-statements
`shall be executed if the expression is false.
`
`
`{
`
`
`else-statements
`
`Brackets enclosing the statements of a for-loop, if-statements
`
`and else-statements
`
`An expression representing the sum of the values of a and b
`
`(A+B)-C
`
`RETURN c
`
`if-statements
`
`$ E
`
`LSE
`
`
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`An expression representing the value of b subtracted from the
`value ofa
`
`
`
`
`
`Indicates scalar integer multiplication
`
`
`
`Indicates scalar integer division
`
`A scalar expression representing the largest of its arguments
`
`
`
`A vector operation taking A and B as arguments and
` representing a new vector where each element is equal to the
`
` pair wise sum of the corresponding elements of vector A and B.
`
` The computations are performed using unsigned saturated
`
` arithmetic, meaning that if any resulting elementis larger than
`
` the highest representative number, that element is replaced by
`
` the largest representative integer.
`
`
`cj = min(aj + b;, 255)
`Can be implemented using the PADDUSB(packedaddition of
`
`unsigned saturated bytes) instruction on processors supporting
`
`MMxX technology.
`
`A vector operation taking A and B as arguments and resulting
` In a new vector where each element is equal to the pair wise
`
` difference between the corresponding elements of vector A and
` B, when subtracting each value of B from the value of A.
`
` The computations are performed using unsigned saturated
`
` arithmetic, meaning that if the difference is negative, zero
`
` replaces the result.
`
` cj = max(aj — bj, 0)
`
`
`Can be implemented using the PSUBUSB (packed subtract of
`
`unsignedsaturated bytes) instruction on processors supporting
`MMX technology.
`
`A vector operation taking A and B as arguments and
` representing a new vector where each elementis equal to the
`
` larger of two corresponding elements of A and B.
`
` cj = max(aj, bi)
`
` Can be implemented using the PMAXUB (packed maximum
` unsigned byte) instruction processors supporting SSE
`technology, or using the PSUBUSBinstruction followed by the
`PADDUSB(packed subtract/add unsigned saturated bytes)
`instruction on
`
`processors supporting only MMX technology.
`
`
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`10
`
`A SHIFT b
`
`
`
`
`AORB
`
`Afi]
`
`SHIFTis a vector operation taking the vector A andthe scalar b
`as arguments and resulting in a new vector where the elements
`of vector A is shifted a numberofpositions. Element i of vector
`A is placed at position it+b in the new vector. The remaining
`
`elements of the resulting vector are assigned a value ofzero.
`
`Thus:
`
`IF ( i>=b) AND (i<b+x))
`
`{ cj =ajy }
`
`ELSE
`
`{ G=0 }
`
`The sign of b hence indicates the direction of the shift, while
`
`the magnitude of b indicates the numberofpositions the
`
`elements shall be shifted.
`
`On Intel Pentiumprocessors and otherlittle-endian
`
`microprocessors, a positive b value correspondsto a left shift,
`
`while a negative b-value correspondsto a right shift. On big-
`
`endian microprocessors the shift direction is reversed.
`
`Can be implemented using the PSLLQ and PSRLQ (packed
`shift left/right logical quadword) instructions on processors ©
`
`
`
` supporting MMX technology.
`
`Bit wise OR of all bits in vector A and B.
`
`
`
`
`Used in combination with the SHIFT-operation to combine
`elements from two vectors into a new vector.
`
`An expression representing element number1 of vectoror array
`A
`
`position b to c of the array A.
`
`An expression representing a vector consisting of elements with
`values a, b, c, d, e, f, g, and h.
`
`Preferred Embodiment of the Invention
`
`5
`
`The preferred embodimentwill be described with reference to the drawings. For each
`sequence in the database, a sequence similarity-searching program computes an
`alignment score that represent the degree of similarity between the query sequence and
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`11
`
`the database sequence. By also taking the length and composition of the query and
`database sequences into account, a statistical parameter can be computed and used to
`rank the database sequencesin order of similarity to the query sequence. The raw
`alignment score is based on a substitution score matrix, (e.g. BLOSUM62), representing
`the similarity between two symbols, and an affined gap penalty function based on a gap
`open and a gap extension penalty.
`In the following description vectors of 8 elements are used with 8 bits each,
`totalling 64 bits. However, this is only for the ease of description. Many other
`combinations of vector and element sizes are possible and can easily be generalised
`from the description below. Any numberof vector elements larger than one, and any
`numberof bits for each element is possible. However, an element size of 8 bits, and a
`vector size of 8, 16 or 32 elements are probably the most useful.
`In the following description the methodis illustrated with a possible
`implementation using Intel’s microprocessors and their MMX and SSE technology.
`However, this is just for the ease of description. An implementation using other
`microprocessors and other SIMD technologyis also possible.
`
`The Smith-Waterman algorithm
`
`To compute the optimal local alignment score, the dynamic programmingalgorithm by
`Smith and Waterman (1981), as enhanced by Gotoh (1982), is used. Given a query
`sequence A of length m, a database sequence B of length m, a substitution score matrix
`Z, a gap open penalty g and a gap extension penalty r, the optimal local alignment score
`¢ can be computed by the following recursion relations:
`
`e,, = maxie,jo h,,pit qj-r
`
`fy = mifo, ohia,, -a}—1marr[iLBLibe,,.f,,.0}
`t="max, }
`
`Here, e,, and f,, represent the maximum local alignment score involving thefirst 7
`symbols ofA and the first7 symbols ofB, and ending with a gap in sequence B or A,
`respectively. The overall maximum local alignment score involving the first i symbols
`ofA andthe firstj symbols ofB, is represented by h,,. The recursions should be
`calculated with 7 going from 1 to m andj from 1 to 1, starting with @,,= Fiz =h,; =0
`for all i= 0 or j =0. The order of computation of the values in the alignment matrix is
`
`20
`
`25
`
`30
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`12
`
`strict because the value of any cell cannot be computed before the valueofall cells to
`the left and above it has been computed, as shownby the data interdependence graph in
`figure 1. An implementation of the algorithm as described by Gotoh (1982) has a
`running time proportional to mn.
`
`Parallelisation
`
`The Smith-Waterman algorithm can be madeparallel on twoscales. It is fairly easy to
`distribute the processing of each of the database sequences on a numberof independent
`processors in a symmetric multiprocessing (SMP) machine. On a lowerscale, however,
`distributing the work involved within a single database sequenceis a bit more
`complicated. Figure 1 showsthe data interdependencein the alignment matrix. The final
`value, h, of any cell in the matrix cannot be computed before the value ofall cells to the
`left and aboveit has been computed. But the calculations of the values of diagonally
`arranged cells parallel to the minor diagonal (see figure 2a and 3a) are independent and
`can be done simultaneously in a parallel implementation. This fact has been utilised in
`earlier SIMD implementations (Hughey, 1996; Wozniak, 1997).
`
`The inventive approach
`
`The main features of the implementation accordingto the invention, are:
`
`e Vectors parallel to the query sequence
`e Vector generalisation of the SWAT-optimisations
`e
`8-way parallel processing with 8-bit values
`e Query sequence profiles
`e Unsigned arithmetics
`
`e
`
`Saturated arithmetics
`
`e General code optimisations
`
`These concepts are described in detail below.In orderto illustrate and exemplify the
`description, pseudo-code for the present method is shown in figure 4.
`
`In the pseudo-code, the method accordingto the invention through figures 4a and 4b,is
`illustrated using vectors of 8 elements, howeverthe present invented method is general
`and can be implemented with vectors of any numberof elements.
`
`10
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`13
`
`The pseudo-code assumesthat the query sequence length (m) is a multiple of the vector
`size, 8. This can be achieved by padding the query sequence and query score profile.
`—
`
`All vector indices start at zero as is usual in programming languages (notone,as is
`usual in ordinary mathematics notation).
`
`Uppercaseletters are generally used to represent vector or array variables or operations,
`while lowercase letters are generally used to represent scalar variables or operations.
`
`The S-matrix is an x times n query-specific score matrix representing the score for
`substituting any of the x different possible database sequence symbols with the query
`symbolat any of the n query positions. In general, x just represents the size of the
`alphabet from which the sequence symbols belong to. For amino acid sequences, x is
`typically 20 (representing the 20 natural aminoacids) or slightly larger (to include also
`ambiguous and other symbols). For nucleotide sequences, x is typically 4 (representing
`the 4 nucleotides adenine, cytosine, guanine and thymine/uracil), or larger (to include
`also ambiguous and other symbols). The S-matrix is usually precomputed from a query
`sequence and a substitution score matrix, but may also represent a general queryprofile,
`which may be based on scores calculated from a multiple alignmentof a protein
`sequence family.
`
`The H-vectors holds the h-values, representing the optimal local alignment score for a
`given matrix cell. The H-vectors represent a part of the HH-array. The E-vector holds
`the e-values, representing scores from alignments involving a database sequence gap.
`The E-vectors represent a part of the EE-array. The F-vector holds the f-values, and
`temporary f-values, representing scores from alignments involving a query sequence
`
`gap.
`
`In the LASTFvector, only a single element (at index 0) is used. It represents the
`potential score from previous rounds involving a gap in the query sequence. The actual
`value used has the gap opening penalty added,in order to simplify later calculations.
`
`Vectors parallel to the query sequence
`
`Despite the loss of independence between the computations of each of the vector
`elements, it was decided to use vectors of cells parallel to the query sequence (as shown
`
`10
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 02/27638
`
`PCT/NO01/00394
`
`14
`
`in figure 2b and 3b), instead of vectors ofcells parallel to the minor diagonal in the
`matrix (as shown in figure 2a and 3a). The advantage of this approach is the much-
`simplified and faster loading of the vector of substitution scores from memory. The
`disadvantage is that data dependencies within the vector related to the f-value
`computations (see figure 3b) must be handled. Eight cells are processed simultaneously
`along each column as indicated in figure 2b. Vectors are used to represent the h-, e- and
`f-values in the recurrence relations in groups of eight consecutive cells parallel to the
`query sequence. Vectors are also used to represent zero, q, r, qtr and other constants.
`Using vector processing, the value of eight h-, e- or f-values may be computed in
`parallel (the h, e and f as described in the equations on page 11).
`
`Vector generalisation of the SWAT-optimisations
`
`Asalready indicated, we have to take into account that each elementin the vectoris
`dependent on the element aboveit, as shown in figure 3b, because of the possible
`introduction of gaps in the query sequence. We employ a vector generalisation of the
`SWAT-optimisations and other optimisations to make the principle of query-parallel
`vectors efficient. The conceptis illustrated in the p