throbber
BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 14 2009, pages 1754–1760
`
`doi:10.1093/bioinformatics/btp324
`
`Sequence analysis
`Fast and accurate short read alignment with Burrows–Wheeler
`transform
`Heng Li and Richard Durbin∗
`
`Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK
`Received on February 20, 2009; revised on May 6, 2009; accepted on May 12, 2009
`Advance Access publication May 18, 2009
`Associate Editor: John Quackenbush
`
`ABSTRACT
`Motivation: The enormous amount of short reads generated by the
`new DNA sequencing technologies call for the development of fast
`and accurate read alignment programs. A first generation of hash
`table-based methods has been developed, including MAQ, which
`is accurate, feature rich and fast enough to align short reads from a
`single individual. However, MAQ does not support gapped alignment
`for single-end reads, which makes it unsuitable for alignment of
`longer reads where indels may occur frequently. The speed of MAQ is
`also a concern when the alignment is scaled up to the resequencing
`of hundreds of individuals.
`Results: We implemented Burrows-Wheeler Alignment tool (BWA),
`a new read alignment package that is based on backward search
`with Burrows–Wheeler Transform (BWT), to efficiently align short
`sequencing reads against a large reference sequence such as the
`human genome, allowing mismatches and gaps. BWA supports both
`base space reads, e.g. from Illumina sequencing machines, and
`color space reads from AB SOLiD machines. Evaluations on both
`simulated and real data suggest that BWA is ∼10–20× faster than
`MAQ, while achieving similar accuracy. In addition, BWA outputs
`alignment in the new standard SAM (Sequence Alignment/Map)
`format. Variant calling and other downstream analyses after the
`alignment can be achieved with the open source SAMtools software
`package.
`Availability: http://maq.sourceforge.net
`Contact: rd@sanger.ac.uk
`
`1 INTRODUCTION
`The Illumina/Solexa sequencing technology typically produces
`50–200 million 32–100 bp reads on a single run of the machine.
`Mapping this large volume of short reads to a genome as large
`as human poses a great challenge to the existing sequence
`alignment programs. To meet the requirement of efficient and
`accurate short read mapping, many new alignment programs
`have been developed. Some of these, such as Eland (Cox, 2007,
`unpublished material), RMAP (Smith et al., 2008), MAQ (Li et al.,
`2008a), ZOOM (Lin et al., 2008), SeqMap (Jiang and Wong,
`2008), CloudBurst (Schatz, 2009) and SHRiMP (http://compbio.
`cs.toronto.edu/shrimp), work by hashing the read sequences and
`scan through the reference sequence. Programs in this category
`usually have flexible memory footprint, but may have the overhead
`
`∗
`
`To whom correspondence should be addressed.
`
`of scanning the whole genome when few reads are aligned.
`The second category of software, including SOAPv1 (Li et al.,
`2008b), PASS (Campagna et al., 2009), MOM (Eaves and
`Gao, 2009), ProbeMatch (Jung Kim et al., 2009), NovoAlign
`(http://www.novocraft.com), ReSEQ (http://code.google.com/p/
`re-seq), Mosaik (http://bioinformatics.bc.edu/marthlab/Mosaik) and
`BFAST (http://genome.ucla.edu/bfast), hash the genome. These
`programs can be easily parallelized with multi-threading, but they
`usually require large memory to build an index for the human
`genome. In addition, the iterative strategy frequently introduced by
`these software may make their speed sensitive to the sequencing
`error rate. The third category includes slider (Malhis et al., 2009)
`which does alignment by merge-sorting the reference subsequences
`and read sequences.
`Recently, the theory on string matching using Burrows–Wheeler
`Transform (BWT) (Burrows and Wheeler, 1994) has drawn the
`attention of several groups, which has led to the development of
`SOAPv2 (http://soap.genomics.org.cn/), Bowtie (Langmead et al.,
`2009) and BWA, our new aligner described in this article.
`Essentially, using backward search (Ferragina and Manzini, 2000;
`Lippert, 2005) with BWT, we are able to effectively mimic the top-
`down traversal on the prefix trie of the genome with relatively small
`memory footprint (Lam et al., 2008) and to count the number of exact
`hits of a string of length m in O(m) time independent of the size of
`the genome. For inexact search, BWA samples from the implicit
`prefix trie the distinct substrings that are less than k edit distance
`away from the query read. Because exact repeats are collapsed on
`one path on the prefix trie, we do not need to align the reads against
`each copy of the repeat. This is the main reason why BWT-based
`algorithms are efficient.
`In this article, we will give a sufficient introduction to the
`background of BWT and backward search for exact matching, and
`present the algorithm for inexact matching which is implemented
`in BWA. We evaluate the performance of BWA on simulated data
`by comparing the BWA alignment with the true alignment from
`the simulation, as well as on real paired-end data by checking
`the fraction of reads mapped in consistent pairs and by counting
`misaligned reads mapped against a hybrid genome.
`
`2 METHODS
`2.1 Prefix trie and string matching
`The prefix trie for string X is a tree where each edge is labeled with a symbol
`and the string concatenation of the edge symbols on the path from a leaf to
`
`© 2009 The Author(s)
`This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
`by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
`
`[14:59 15/6/2009 Bioinformatics-btp324.tex]
`
`Page: 1754
`
`1754–1760
`
`00001
`
`EX1038
`
`

`

`Alignment with BWT
`
`Fig. 2. Constructing suffix array and BWT string for X = googol$. String
`X is circulated to generate seven strings, which are then lexicographically
`sorted. After sorting, the positions of the first symbols form the suffix array
`(6,3,0,5,2,4,1) and the concatenation of the last symbols of the circulated
`strings gives the BWT string lo$oogg.
`
`BWT-SW (Lam et al., 2008). We adapted its source code to make it work
`with BWA.
`
`2.3 Suffix array interval and sequence alignment
`If string W is a substring of X, the position of each occurrence of W in X will
`occur in an interval in the suffix array. This is because all the suffixes that
`have W as prefix are sorted together. Based on this observation, we define:
`R(W) = min{k: W is the prefix of XS(k)}
`R(W) = max{k: W is the prefix of XS(k)}
`(2)
`In particular, if W is an empty string, R(W)= 1 and R(W)= n−1. The interval
`[R(W),R(W)] is called the SA interval of W and the set of positions of all
`occurrences of W in X is {S(k): R(W)≤ k≤ R(W)}. For example in Figure 2,
`the SA interval of string ‘go’ is [1,2]. The suffix array values in this interval
`are 3 and 0 which give the positions of all the occurrences of ‘go’.
`Knowing the intervals in suffix array we can get the positions. Therefore,
`sequence alignment is equivalent to searching for the SA intervals of
`substrings of X that match the query. For the exact matching problem, we
`can find only one such interval; for the inexact matching problem, there may
`be many.
`
`(1)
`
`2.4 Exact matching: backward search
`Let C(a) be the number of symbols in X[0,n−2] that are lexicographically
`smaller than a∈ and O(a,i) the number of occurrences of a in B[0,i].
`Ferragina and Manzini (2000) proved that if W is a substring of X:
`R(aW) = C(a)+O(a,R(W)−1)+1
`R(aW) = C(a)+O(a,R(W))
`(4)
`and that R(aW)≤ R(aW) if and only if aW is a substring of X. This result
`makes it possible to test whether W is a substring of X and to count the
`occurrences of W in O(|W|) time by iteratively calculating R and R from the
`end of W. This procedure is called backward search.
`It is important to note that Equations (3) and (4) actually realize the top-
`down traversal on the prefix trie of X given that we can calculate the SA
`interval of a child node in constant time if we know the interval of its parent.
`In this sense, backward search is equivalent to exact string matching on the
`prefix trie, but without explicitly putting the trie in the memory.
`
`(3)
`
`1755
`
`Fig. 1. Prefix trie of string ‘GOOGOL’. Symbol ∧ marks the start of the string.
`The two numbers in a node give the SA interval of the string represented by
`the node (see Section 2.3). The dashed line shows the route of the brute-force
`search for a query string ‘LOL’, allowing at most one mismatch. Edge labels
`in squares mark the mismatches to the query in searching. The only hit is the
`bold node [1,1] which represents string ‘GOL’.
`
`the root gives a unique prefix of X. On the prefix trie, the string concatenation
`of the edge symbols from a node to the root gives a unique substring of X,
`called the string represented by the node. Note that the prefix trie of X is
`identical to the suffix trie of reverse of X and therefore suffix trie theories
`can also be applied to prefix trie.
`With the prefix trie, testing whether a query W is an exact substring of
`X is equivalent to finding the node that represents W, which can be done in
`O(|W|) time by matching each symbol in W to an edge, starting from the
`root. To allow mismatches, we can exhaustively traverse the trie and match
`W to each possible path. We will later show how to accelerate this search by
`using prefix information of W. Figure 1 gives an example of the prefix trie
`for ‘GOOGOL’. The suffix array (SA) interval in each node is explained in
`Section 2.3.
`
`2.2 Burrows–Wheeler transform
`Let be an alphabet. Symbol $ is not present in and is lexicographically
`smaller than all the symbols in . A string X = a0a1 ...an−1 is always ended
`with symbol $ (i.e. an−1= $) and this symbol only appears at the end. Let
`X[i]=a i, i= 0,1,...,n−1, be the i-th symbol of X, X[i,j]=a i ...aj a substring
`and Xi = X[i,n−1] a suffix of X. Suffix array S of X is a permutation of the
`integers 0... n−1 such that S(i) is the start position of the i-th smallest suffix.
`The BWT of X is defined as B[i]=$ when S(i)= 0 and B[i]=X [S(i)−1]
`otherwise. We also define the length of string X as |X| and therefore |X|=
`|B|=n . Figure 2 gives an example on how to construct BWT and suffix array.
`The algorithm shown in Figure 2 is quadratic in time and space. However,
`this is not necessary. In practice, we usually construct the suffix array
`first and then generate BWT. Most algorithms for constructing suffix array
`
`require at least n(cid:4)log2 n(cid:5) bits of working space, which amounts to 12 GB for
`
`human genome. Recently, Hon et al. (2007) gave a new algorithm that uses
`n bits of working space and only requires <1 GB memory at peak time for
`constructing the BWT of human genome . This algorithm is implemented in
`
`[14:59 15/6/2009 Bioinformatics-btp324.tex]
`
`Page: 1755 1754–1760
`
`00002
`
`

`

`H.Li and R.Durbin
`
`Precalculation:
`Calculate BWT string B for reference string X
`Calculate array C(·) and O(·,·) from B
`(cid:8)
`Calculate BWT string B
`for the reverse reference
`(cid:8)
`(cid:8)
`(·,·) from B
`Calculate array O
`Procedures:
`InexactSearch(W ,z)
`CalculateD(W)
`return InexRecur(W ,|W|−1, z,1,|X|−1)
`
`CalculateD(W)
`k← 1
`l←|X|−1
`z← 0
`for i= 0 to |W|−1 do
`(cid:8)
`k← C(W[i])+O
`(W[i],k−1)+1
`(cid:8)
`l← C(W[i])+O
`(W[i],l)
`if k > l then
`k← 1
`l←|X|−1
`z← z+1
`D(i)← z
`
`InexRecur(W ,i,z,k,l)
`if z < D(i) then
`return ∅
`if i < 0 then
`return {[k,l]}
`I ←∅
`I ← I ∪ InexRecur(W ,i−1,z−1,k,l)
`for each b∈{A,C,G,T} do
`k← C(b)+O(b,k−1)+1
`l← C(b)+O(b,l)
`if k≤ l then
`I ← I ∪ InexRecur(W ,i,z−1,k,l)
`if b= W[i] then
`I ← I ∪ InexRecur(W ,i−1,z,k,l)
`else
`I ← I ∪ InexRecur(W ,i−1,z−1,k,l)
`return I
`
`*
`
`*
`
`Fig. 3. Algorithm for inexact search of SA intervals of substrings that
`match W. Reference X is $ terminated, while W is A/C/G/T terminated.
`Procedure InexactSearch(W ,z) returns the SA intervals of substrings
`that match W with no more than z differences (mismatches or gaps);
`InexRecur(W ,i,z,k,l) recursively calculates the SA intervals of substrings
`that match W[0,i] with no more than z differences on the condition that suffix
`Wi+1 matches interval [k,l]. Lines started with asterisk are for insertions to
`and deletions from X, respectively. D(i) is the lower bound of the number of
`differences in string W[0,i].
`
`2.5 Inexact matching: bounded traversal/backtracking
`Figure 3 gives a recursive algorithm to search for the SA intervals of
`substrings of X that match the query string W with no more than z differences
`(mismatches or gaps). Essentially, this algorithm uses backward search to
`sample distinct substrings from the genome. This process is bounded by the
`D(·) array where D(i) is the lower bound of the number of differences in
`W[0,i]. The better the D is estimated, the smaller the search space and the
`more efficient the algorithm is. A naive bound is achieved by setting D(i)= 0
`
`1756
`
`for all i, but the resulting algorithm is clearly exponential in the number of
`differences and would be less efficient.
`CalculateD(W)
`z← 0
`j← 0
`for i= 0 to |W|−1 do
`if W[j,i] is not a substring of X then
`z← z+1
`j← i+1
`D(i)← z
`
`Fig. 4. Equivalent algorithm to calculate D(i).
`
`The CalculateD procedure in Figure 3 gives a better, though not optimal,
`bound. It is conceptually equivalent to the one described in Figure 4, which
`is simpler to understand. We use the BWT of the reverse (not complemented)
`reference sequence to test if a substring of W is also a substring of X. Note
`that to do this test with BWT string B alone would make CalculateD an
`O(|W|2) procedure, rather than O(|W|) as is described in Figure 3.
`To understand the role of D, we come back to the example of searching
`for W = LOL in X = GOOGOL$ (Fig. 1). If we set D(i)= 0 for all i and
`disallow gaps (removing the two star lines in the algorithm), the call graph
`of InexRecur, which is a tree, effectively mimics the search route shown
`as the dashed line in Figure 1. However, with CalculateD, we know that
`D(0)= 0 and D(1)= D(2)= 1. We can then avoid descending into the ‘G’ and
`‘O’ subtrees in the prefix trie to get a much smaller search space.
`The algorithm in Figure 3 guarantees to find all the intervals allowing
`maximum z differences. It is complete in theory, but in practice, we also
`made various modifications. First, we pay different penalties for mismatches,
`gap opens and gap extensions, which is more realistic to biological data.
`Second, we use a heap-like data structure to keep partial hits rather than
`using recursion. The heap-like structure is prioritized on the alignment score
`of the partial hits to make BWA always find the best intervals first. The
`reverse complemented read sequence is processed at the same time. Note that
`the recursion described in Figure 3 effectively mimics a depth-first search
`(DFS) on the prefix trie, while BWA implements a breadth-first search (BFS)
`using this heap-like data structure. Third, we adopt an iterative strategy: if
`the top interval is repetitive, we do not search for suboptimal intervals by
`default; if the top interval is unique and has z difference, we only search
`for hits with up to z+1 differences. This iterative strategy accelerates BWA
`while retaining the ability to generate mapping quality. However, this also
`makes BWA’s speed sensitive to the mismatch rate between the reads and
`the reference because finding hits with more differences is usually slower.
`Fourth, we allow to set a limit on the maximum allowed differences in the
`first few tens of base pairs on a read, which we call the seed sequence. Given
`70 bp simulated reads, alignment with maximum two differences in the 32 bp
`seed is 2.5× faster than without seeding. The alignment error rate, which is
`the fraction of wrong alignments out of confident mappings in simulation
`(see also Section 3.2), only increases from 0.08% to 0.11%. Seeding is less
`effective for shorter reads.
`
`2.6 Reducing memory
`The algorithm described above needs to load the occurrence array O and the
`suffix array S in the memory. Holding the full O and S arrays requires huge
`memory. Fortunately, we can reduce the memory by only storing a small
`fraction of the O and S arrays, and calculating the rest on the fly. BWT-
`SW (Lam et al., 2008) and Bowtie (Langmead et al., 2009) use a similar
`strategy which was first introduced by Ferragina and Manzini (2000).
`
`Given a genome of size n, the occurrence array O(·,·) requires 4n(cid:4)log2 n(cid:5)
`bits as each integer takes (cid:4)log2 n(cid:5) bits and there are 4n of them in the array.
`In practice, we store in memory O(·,k) for k that is a factor of 128 and
`calculate the rest of elements using the BWT string B. When we use two bits
`to represent a nucleotide, B takes 2n bits. The memory for backward search is
`
`[14:59 15/6/2009 Bioinformatics-btp324.tex]
`
`Page: 1756 1754–1760
`
`00003
`
`

`

`thus 2n+n(cid:4)log2 n(cid:5)/32 bits. As we also need to store the BWT of the reverse
`
`genome to calculate the bound, the memory required for calculating intervals
`is doubled, or about 2.3 GB for a 3 Gb genome.
`Enumerating the position of each occurrence requires the suffix array S.
`
`Alignment with BWT
`
`to this modification, but the deviation is relatively small. For example, BWA
`wrongly aligns 11 reads out of 1 569 108 simulated 70 bp reads mapped with
`mapping quality 60. The error rate 7×10
`−6 (= 11/1 569 108) for these Q60
`−6.
`mappings is higher than the theoretical expectation 10
`
`2.8 Mapping SOLiD reads
`For SOLiD reads, BWA converts the reference genome to dinucleotide ‘color’
`sequence and builds the BWT index for the color genome. Reads are mapped
`in the color space where the reverse complement of a sequence is the same as
`the reverse, because the complement of a color is itself. For SOLiD paired-
`end mapping, a read pair is said to be in the correct orientation if either
`of the two scenarios is true: (i) both ends mapped to the forward strand of
`the genome with the R3 read having smaller coordinate; and (ii) both ends
`mapped to the reverse strand of the genome with the F3 read having smaller
`coordinate. Smith–Waterman alignment is also done in the color space.
`After the alignment, BWA decodes the color read sequences to the
`nucleotide sequences using dynamic programming. Given a nucleotide
`reference subsequence b1b2 ...bl+1 and a color read sequence c1c2 ...cl
`mapped to the subsequence, BWA infers a nucleotide sequence ˆb1
`ˆb2 ...ˆbl+1
`)+ l(cid:1)
`i=1
`
`such that it minimizes the following objective function:
`l+1(cid:1)
`i=1
`
`(cid:8)·(1−δˆbi ,bi
`
`q
`
`qi·(cid:2)
`
`1−δˆci ,g(ˆbi ,ˆbi+1)
`
`(cid:3)
`
`If we put the entire S in memory, it would use n(cid:4)log2 n(cid:5) bits. However, it is
`also possible to reconstruct the entire S when knowing part of it. In fact, S
`and inverse compressed suffix array (inverse CSA) −1 (Grossi and Vitter,
`2000) satisfy:
`S(k)= S(( −1)(j)(k))+j
`(5)
`where ( −1)(j) denotes repeatedly applying the transform −1 for j times.
`The inverse CSA −1 can be calculated with the occurrence array O:
` −1(i)= C(B[i])+O(B[i],i)
`In BWA, we only store in memory S(k) for k that can be divided by 32.
`For k that is not a factor of 32, we repeatedly apply −1 until for some j,
`( −1)(j)(k) is a factor of 32 and then S(( −1)(j)(k)) can be looked up and
`S(k) can be calculated with Equation (5).
`In all, the alignment procedure uses 4n+n(cid:4)log2 n(cid:5)/8 bits, or n bytes
`
`(6)
`
`for genomes <4 Gb. This includes the memory for the BWT string, partial
`occurrence array and partial suffix array for both original and the reversed
`genome. Additionally, a few hundred megabyte of memory is required for
`heap, cache and other data structures.
`
`2.7 Other practical concerns for Illumina reads
`2.7.1 Ambiguous bases Non-A/C/G/T bases on reads are simply treated
`as mismatches, which is implicit in the algorithm (Fig. 3). Non-A/C/G/T
`bases on the reference genome are converted to random nucleotides. Doing
`so may lead to false hits to regions full of ambiguous bases. Fortunately,
`the chance that this may happen is very small given relatively long reads.
`We tried 2 million 32 bp reads and did not see any reads mapped to poly-N
`regions by chance.
`
`2.7.2 Paired-end mapping BWA supports paired-end mapping. It first
`finds the positions of all
`the good hits, sorts them according to the
`chromosomal coordinates and then does a linear scan through all the potential
`hits to pair the two ends. Calculating all the chromosomal coordinates
`requires to look up the suffix array frequently. This pairing process is time
`consuming as generating the full suffix array on the fly with the method
`described above is expensive. To accelerate pairing, we cache large intervals.
`This strategy halves the time spent on pairing.
`In pairing, BWA processes 256K read pairs in a batch. In each batch,
`BWA loads the full BWA index into memory, generates the chromosomal
`coordinate for each occurrence, estimates the insert size distribution from
`read pairs with both ends mapped with mapping quality higher than 20, and
`then pairs them. After that, BWA clears the BWT index from the memory,
`loads the 2 bit encoded reference sequence and performs Smith–Waterman
`alignment for unmapped reads whose mates can be reliably aligned. Smith–
`Waterman alignment rescues some reads with excessive differences.
`
`2.7.3 Determining the allowed maximum number of differences Given
`a read of length m, BWA only tolerates a hit with at most k differences
`(mismatches or gaps), where k is chosen such that <4% of m-long reads with
`2% uniform base error rate may contain differences more than k. With this
`configuration, for 15–37 bp reads, k equals 2; for 38–63 bp, k= 3; for 64–92
`bp, k= 4; for 93–123 bp, k= 5; and for 124–156 bp reads, k= 6.
`
`2.7.4 Generating mapping quality scores For each alignment, BWA
`calculates a mapping quality score, which is the Phred-scaled probability
`of the alignment being incorrect. The algorithm is similar to MAQ’s except
`that in BWA we assume the true hit can always be found. We made this
`modification because we are aware that MAQ’s formula overestimates the
`probability of missing the true hit, which leads to underestimated mapping
`quality. Simulation reveals that BWA may overestimate mapping quality due
`
`(cid:8)
`
`This optimization can be done by dynamic programming because the best
`
`the best decoding score up to i. The iteration equations are
`
`where q
`is the Phred-scaled probability of a mutation, qi is the Phred quality
`)= g(b
`(cid:8)
`(cid:8),b) gives the color corresponding to
`of color ci and function g(b,b
`(cid:8)
`(cid:8)
`. Essentially, we pay a penalty q
`if
`the two adjacent nucleotides b and b
`bi (cid:12)= ˆbi and a penalty qi if ci (cid:12)= g(ˆbi,ˆbi+1).
`decoding beyond position i only depends on the choice of ˆbi. Let fi(ˆbi) be
`f1(ˆb1)= q
`(cid:8)·(1−δˆb1,b1
`)+qi·(cid:2)
`fi(ˆbi)+q
`fi+1(ˆbi+1)= minˆbi
`1−δˆci ,g(ˆbi ,ˆbi+1)
`(cid:8)·(1−δ
`bi+1,ˆbi+1
`BWA approximates base qualities as follows. Let ˆci = g(ˆbi,ˆbi+1). The i-th
`base quality ˆqi, i= 2...l, is calculated as:
`⎧⎪⎪⎨
`qi−1+qi if ci−1=ˆci−1 and ci =ˆci
`qi−1−qi if ci−1=ˆci−1 but ci (cid:12)=ˆci
`qi−qi−1 if ci =ˆci but ci−1(cid:12)=ˆci−1
`⎪⎪⎩
`0
`otherwise
`BWA outputs the sequence ˆb2 ...ˆbl and the quality ˆq2 ...ˆql as the final result
`
`)
`
`(cid:3)(cid:5)
`
`(cid:4)
`
`ˆqi =
`
`for SOLiD mapping.
`
`3 RESULTS
`3.1
`Implementation
`We implemented BWA to do short read alignment based on the BWT
`of the reference genome. It performs gapped alignment for single-
`end reads, supports paired-end mapping, generates mapping quality
`and gives multiple hits if required. The default output alignment
`format is SAM (Sequence Alignment/Map format). Users can use
`SAMtools (http://samtools.sourceforge.net) to extract alignments in
`a region, merge/sort alignments, get single nucleotide polymorphism
`(SNP) and indel calls and visualize the alignment.
`BWA is distributed under the GNU General Public License (GPL).
`Documentations and source code are freely available at the MAQ
`web site: http://maq.sourceforge.net.
`
`1757
`
`[14:59 15/6/2009 Bioinformatics-btp324.tex]
`
`Page: 1757 1754–1760
`
`00004
`
`

`

`H.Li and R.Durbin
`
`3.2 Evaluated programs
`To evaluate the performance of BWA, we tested additional
`three alignment programs: MAQ (Li et al., 2008a), SOAPv2
`(http://soap.genomics.org.cn) and Bowtie (Langmead et al., 2009).
`MAQ indexes reads with a hash table and scans through the genome.
`It is the software package we developed previously for large-scale
`read mapping. SOAPv2 and Bowtie are the other two BWT-based
`short read aligners that we are aware of. The latest SOAP-2.1.7 (Li
`et al., unpublished data) uses 2way-BWT (Lam et al., unpublished
`data) for alignment. It tolerates more mismatches beyond the 35
`bp seed sequence and supports gapped alignment limited to one
`gap open. Bowtie (version 0.9.9.2) deploys a similar algorithm to
`BWA. Nonetheless, it does not reduce the search space by bounding
`the search with D(i), but by cleverly doing the alignment for
`both original and reverse read sequences to bypass unnecessary
`searches towards the root of the prefix trie. By default, Bowtie
`performs a DFS on the prefix trie and stops when the first qualified
`hit is found. Thus, it may miss the best inexact hit even if its
`seeding strategy is disabled. It is possible to make Bowtie perform
`a BFS by applying ‘–best’ at the command line, but this makes
`Bowtie slower. Bowtie does not support gapped alignment at the
`moment.
`including BWA, randomly place a
`All
`the four programs,
`repetitive read across the multiple equally best positions. As we
`are mainly interested in confident mappings in practice, we need
`to rule out repetitive hits. SOAPv2 gives the number of equally
`best hits of a read. Only unique mappings are retained. We also ask
`SOAPv2 to limit the possible gap size to at most 3 bp. We run Bowtie
`with the command-line option ‘–best -k 2’, which renders Bowtie
`to output the top two hits of a read. We discard a read alignment
`if the second best hit contains the same number of mismatches as
`the best hit. MAQ and BWA generate mapping qualities. We use
`mapping quality threshold 1 for MAQ and 10 for BWA to determine
`confident mappings. We use different thresholds because we know
`that MAQ’s mapping quality is underestimated, while BWA’s is
`overestimated.
`
`3.3 Evaluation on simulated data
`We simulated reads from the human genome using the wgsim
`program that is included in the SAMtools package and ran the
`four programs to map the reads back to the human genome. As
`we know the exact coordinate of each read, we are able to calculate
`the alignment error rate.
`Table 1 shows that BWA and MAQ achieve similar alignment
`accuracy. BWA is more accurate than Bowtie and SOAPv2 in terms
`of both the fraction of confidently mapped reads and the error rate
`of confident mappings. Note that SOAP-2.1.7 is optimized for reads
`longer than 35 bp. For the 32 bp reads, SOAP-2.0.1 outperforms the
`latest version.
`On speed, SOAPv2 is the fastest and actually it would be 30–80%
`faster for paired-end mapping if gapped alignment was disabled.
`Bowtie with the default option (data not shown) is several times
`faster than the current setting ‘–best -k 2’ on single-end mapping.
`However, the speed is gained at a great cost of accuracy. For
`example, with the default option, Bowtie can map the two million
`single-end 32 bp reads in 151 s, but 6.4% of confident mappings are
`wrong. This high alignment error rate may complicate the detection
`of structural variations and potentially affect SNP accuracy. Between
`
`1758
`
`Table 1. Evaluation on simulated data
`
`Single-end
`
`Paired-end
`
`Program
`
`Time (s) Conf (%) Err (%) Time (s) Conf (%) Err (%)
`
`Bowtie-32
`BWA-32
`MAQ-32
`SOAP2-32
`
`Bowtie-70
`BWA-70
`MAQ-70
`SOAP2-70
`
`bowtie-125
`BWA-125
`MAQ-125
`SOAP2-125
`
`1271
`823
`19797
`256
`
`1726
`1599
`17928
`317
`
`1966
`3021
`17506
`555
`
`79.0
`80.6
`81.0
`78.6
`
`86.3
`90.7
`91.0
`90.3
`
`88.0
`93.0
`92.7
`91.5
`
`0.76
`0.30
`0.14
`1.16
`
`0.20
`0.12
`0.13
`0.39
`
`0.07
`0.05
`0.08
`0.17
`
`1391
`1224
`21589
`1909
`
`1580
`1619
`19046
`708
`
`1701
`3059
`19388
`1187
`
`85.7
`89.6
`87.2
`86.8
`
`90.7
`96.2
`94.6
`94.5
`
`91.0
`97.6
`96.3
`90.8
`
`0.57
`0.32
`0.07
`0.78
`
`0.43
`0.11
`0.05
`0.34
`
`0.37
`0.04
`0.02
`0.14
`
`One million pairs of 32, 70 and 125 bp reads, respectively, were simulated from
`the human genome with 0.09% SNP mutation rate, 0.01% indel mutation rate and
`2% uniform sequencing base error rate. The insert size of 32 bp reads is drawn from
`a normal distribution N(170,25), and of 70 and 125 bp reads from N(500,50). CPU
`time in seconds on a single core of a 2.5 GHz Xeon E5420 processor (Time), percent
`confidently mapped reads (Conf) and percent erroneous alignments out of confident
`mappings (Err) are shown in the table.
`
`BWA and MAQ, BWA is 6–18× faster, depending on the read length.
`MAQ’s speed is not affected by read length because internally it
`treats all reads as 128 bp. It is possible to accelerate BWA by not
`checking suboptimal hits similar to what Bowtie and SOAPv2 are
`doing. However, calculating mapping quality would be impossible in
`this case and we believe generating proper mapping quality is useful
`to various downstream analyses such as the detection of structural
`variations.
`On memory, SOAPv2 uses 5.4 GB. Both Bowtie and BWA uses
`2.3 GB for single-end mapping and about 3 GB for paired-end, larger
`than MAQ’s memory footprint 1 GB. However, the memory usage
`of all the three BWT-based aligners is independent of the number of
`reads to be aligned, while MAQ’s is linear in it. In addition, all BWT-
`based aligners support multi-threading, which reduces the memory
`per CPU core on a multi-core computer. On modern computer
`servers, memory is not a practical concern with the BWT-based
`aligners.
`
`3.4 Evaluation on real data
`To assess the performance on real data, we downloaded about
`12.2 million pairs of 51 bp reads from European Read Archive
`(AC:ERR000589). These reads were produced by Illumina for
`NA12750, a male included in the 1000 Genomes Project
`(http://www.1000genomes.org). Reads were mapped to the human
`genome NCBI build 36. Table 2 shows that almost all confident
`mappings from MAQ and BWA exist in consistent pairs although
`MAQ gives fewer confident alignments. A slower mode of BWA (no
`seeding; searching for suboptimal hits even if the top hit is a repeat)
`did even better. In that mode, BWA confidently mapped 89.2% of all
`reads in 6.3 hours with 99.2% of confident mappings in consistent
`pairs.
`In this experiment, SOAPv2 would be twice as fast with both
`percent confident mapping (Conf) and percent paired (Paired)
`
`[14:59 15/6/2009 Bioinformatics-btp324.tex]
`
`Page: 1758 1754–1760
`
`00005
`
`

`

`Table 2. Evaluation on real data
`
`Program
`
`Time (h)
`
`Conf (%)
`
`Paired (%)
`
`Bowtie
`BWA
`MAQ
`SOAP2
`
`5.2
`4.0
`94.9
`3.4
`
`84.4
`88.9
`86.1
`88.3
`
`96.3
`98.8
`98.7
`97.5
`
`The 12.2 million read pairs were mapped to the human genome. CPU time in hours on
`a single core of a 2.5 GHz Xeon E5420 processor (Time), percent confidently mapped
`reads (Conf) and percent confident mappings with the mates mapped in the correct
`orientation and within 300 bp (Paired), are shown in the table.
`
`dropping by 1% if gapped alignment was disabled. In contrast, BWA
`is 1.4 times as fast when it performs ungapped alignment only. But
`even with BWT-based gapped alignment disabled, BWA is still able
`to recover many short indels with Smith–Waterman alignment given
`paired-end reads.
`We also obtained the chicken genome sequence (version 2.1)
`and aligned these 12.2 million read pairs against a human–chicken
`hybrid reference sequence. The percent confident mappings is almost
`unchanged in comparison to the human-only alignment. As for the
`number of reads mapped to the chicken genome, Bowtie mapped
`2640, BWA 2942, MAQ 3005 and SOAPv2 mapped 4531 reads to
`the wrong genome. If we consider that the chicken sequences take up
`one-quarter of the human–chicken hybrid reference, the alignment
`error rate for BWA is about 0.06% (=2942×4/12.2M/0.889).
`Note that such an estimate of the alignment error rate may be
`underestimated because wrongly aligned human reads tend to be
`related to repetitive sequences in human and to be mapped back
`to the human sequences. The estimate may also be overestimated
`due to the presence of highly conservative sequences and the
`incomplete assembly of human or misassembly of the chicken
`genome.
`If we want fewer errors, the mapping quality generated by BWA
`and MAQ allows us to choose alignments of higher accuracy. If we
`increased the mapping quality threshold in determining a confident
`hit to 25 for BWA, 86.4% of reads could be aligned confidently with
`1927 reads mapped to the chicken genome, outperforming Bowtie
`in terms of both percent confident mappings and the number of reads
`mapped to the wrong genome.
`
`4 DISCUSSION
`For short read alignment against the human reference genome, BWA
`is an order of magnitude faster than MAQ while achieving similar
`alignment accuracy. It supports gapped alignment for single-end
`reads, which is increasingly important when reads get longer and
`tend to contain indels. BWA outputs

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket