throbber
BMC Bioinformatics
`
`BioMed Central
`
`Open Access
`Research article
`Using quality scores and longer reads improves accuracy of Solexa
`read mapping
`Andrew D Smith†, Zhenyu Xuan† and Michael Q Zhang*
`
`Address: Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11274, USA
`
`Email: Andrew D Smith - asmith@cshl.edu; Zhenyu Xuan - xuan@cshl.edu; Michael Q Zhang* - mzhang@cshl.edu
`* Corresponding author †Equal contributors
`
`Published: 28 February 2008
`
`BMC Bioinformatics 2008, 9:128
`
`doi:10.1186/1471-2105-9-128
`
`This article is available from: http://www.biomedcentral.com/1471-2105/9/128
`
`Received: 5 October 2007
`Accepted: 28 February 2008
`
`© 2008 Smith et al; licensee BioMed Central Ltd.
`This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
`which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
`
`Abstract
`Background: Second-generation sequencing has the potential to revolutionize genomics and
`impact all areas of biomedical science. New technologies will make re-sequencing widely available
`for such applications as identifying genome variations or interrogating the oligonucleotide content
`of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of
`sequencing technology brings demand for advances in computational technology to perform
`associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads,
`ranging in length from ~25–50 nt, in a single experiment. Accurately mapping the reads back to a
`reference genome is a critical task in almost all applications. Two sources of information that are
`often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads,
`which contain a much higher frequency of sequencing errors, and the base-call quality scores.
`Results: To investigate whether these sources of information can be used to improve accuracy
`when mapping reads, we developed the RMAP tool, which can map reads having a wide range of
`lengths and allows base-call quality scores to determine which positions in each read are more
`important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC
`regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely
`available for downloading at http://rulai.cshl.edu/rmap/.
`Conclusion: Our results indicate that significant gains in Solexa read mapping performance can be
`achieved by considering the information in 3' ends of longer reads, and appropriately using the base-
`call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit
`this information in targeted re-sequencing projects.
`
`Background
`The main technological advances that accompanied the
`genomic and post-genomic eras are high-throughput
`sequencing and hybridization microarrays. Sequencing
`technology enabled scientists to obtain the full genomic
`sequence for many species, including the human and
`many model organisms. Sequencing technology is also
`being used to selectively re-sequence the human genome
`
`to detect genome variations such as single nucleotide pol-
`ymorphisms or large-scale structural variations. Because
`understanding these variations can immediately impact
`medical sciences, making sequencing more efficient and
`accessible is imperative. However, traditional methodolo-
`gies used to sequence the first mammalian genomes
`remain expensive, time consuming, and labor intensive.
`
`Page 1 of 8
`(page number not for citation purposes)
`
`SEQUENOM EXHIBIT 1009
`
`

`
`BMC Bioinformatics 2008, 9:128
`
`http://www.biomedcentral.com/1471-2105/9/128
`
`Oligonucleotide microarray technology, used to interro-
`gate the RNA or DNA content of a sample, has emerged as
`a widely accessible and effective tool for studying gene
`expression or detecting protein-DNA interactions (i.e.
`ChIP-chip). Microarrays have also been used to detect
`genome variations, like SNPs or structural variations.
`Array-based hybridization also has limitations. For exam-
`ple, probes can behave in highly non-uniform ways, and
`the effects of cross-hybridization and resolution limits are
`poorly understood. Although significant research efforts
`are focused on these problems, they remain inherent in all
`hybridization-based methods.
`
`The new sequencing technology referred to as "second-
`generation" shows promise to eliminate many of the
`problems associated with traditional sequencing technol-
`ogy and also those with oligonucleotide microarray tech-
`nology. Second-generation sequencers are able
`to
`sequence more quickly and at lower cost in terms of both
`money and labor. New sequencing technologies are devel-
`oped to sequence at greater depth, meaning that a clone
`can be sequenced from a sample even when that clone
`exists at very low abundance (i.e. 1 molecule per cell).
`
`Several second-generation technologies have been devel-
`oped using diverse methods. Two recent second-genera-
`tion sequencers are from 454 Life Sciences (Roche
`Diagnostics) and Solexa (Illumina). The 454 sequencers
`use an emulsion method for DNA amplification and a
`pyrosequencing protocol for sequencing by synthesis
`(SBS) at picolitre-scale volumes. Current 454 sequencers
`can produce 25–50 M nt of sequence in a single run, in the
`form of reads with length up to 500 nt (a number
`expected to increase), enabling this technology to be used
`for de-novo sequencing in addition to re-sequencing [1].
`
`Solexa/Illumina 1G sequencers also use sequencing by
`synthesis, with DNA amplified on the surface of a flow
`cell, resulting in a random array of dense clusters [2]. The
`Solexa technology is faster and cheaper than that used in
`454 sequencers, producing 1G nucleotides of sequence in
`one run but producing much shorter reads. Each individ-
`ual read is roughly 25 to 50 bases in length (also expected
`to increase slightly in coming years). The Solexa sequenc-
`ing technology has recently started producing break-
`through results. Research described in [3] and [4]
`employed Solexa sequencers to obtain high-resolution
`genomic maps of several histone modifications, as well as
`localization data for the DNA-binding proteins. Effective-
`ness of ChIP-sequencing has also been demonstrated by
`[5], who used Solexa sequencing to obtain locations of
`STAT1 binding sites in HeLa S3 cells before and following
`IFN-γ stimulation.
`
`Mapping reads from the Solexa sequencer presents an
`obvious algorithmic challenge: tens of millions of reads
`must be mapped to a large (e.g. mammalian) genome in
`a reasonable amount of time. Strong efforts to design
`short-read mapping algorithms have resulted in methods
`that are effective in particular contexts. The mapping algo-
`rithm implemented as part of the Solexa analysis pipeline
`is named ELAND (Efficient Large-Scale Alignment of
`Nucleotide Databases). ELAND is optimized to map very
`short reads, with length at most 32 nt, and ignores the
`additional bases when the sequenced reads are longer.
`ELAND also only allows at most two mismatches between
`the read and the genomic sequence to which it maps,
`which will clearly be too few for longer reads. Despite
`these restrictions, ELAND remains very useful for many
`mapping tasks because it is extremely efficient. The SXO-
`ligoSearch algorithm (by Synamatix) can quickly map
`reads of varying length, using different criteria, with per-
`formance depending on both the read length and map-
`ping criteria. The performance of SXOligoSearch depends
`on use of the proprietary SynaBASE data structure. This
`data structure is a heavily compressed and annotated
`index for the reference genome that retains all non-redun-
`dant information. The gains in mapping speed by using
`such a data structure come at a cost in terms of the mem-
`ory required for the SynaBASE data structure. Extreme
`memory requirements of the data structure makes SXOli-
`goSearch unsuitable for use on hardware available in
`most labs.
`
`In most re-sequencing applications accuracy of mapping
`is a primary concern. We must know with great accuracy
`what part of the genome was actually sequenced. There
`are several reasons why it might be difficult to determine
`the location in the reference genome from which a read
`was derived, or even if a read was derived from the refer-
`ence genome. These include problems with the experi-
`ments,
`such
`as
`sequencing
`errors or
`sample
`contamination. Mapping is also made more difficult by
`repeats in the genome, and by polymorphisms. While
`mapping algorithms cannot be expected to be robust to all
`such problems, effort should be made to make the algo-
`rithms as robust as possible.
`
`Two sources of information with the potential to improve
`mapping accuracy are the 3' ends of longer reads, which
`are often ignored because they contain a higher frequency
`of errors, and the base-call quality scores. The quality
`scores describe the confidence of bases in each read.
`Sequencing quality scores, introduced in the Phred algo-
`rithm [6,7], assign a probability to the four possible nucle-
`otides for each sequenced base. The Solexa analysis
`pipeline, for example, includes a program called BUS-
`TARD to calculate quality scores. Because the bases with
`lower quality scores are more likely to be sequencing
`
`Page 2 of 8
`(page number not for citation purposes)
`
`

`
`BMC Bioinformatics 2008, 9:128
`
`http://www.biomedcentral.com/1471-2105/9/128
`
`errors, any potential mapping for a read should be penal-
`ized less for mismatching at positions with lower scores.
`The quality scores are especially important for mapping
`longer reads, since the 3' ends of longer reads are known
`to have a higher frequency of sequencing errors.
`
`To investigate whether these sources of information can
`be used to improve accuracy when mapping reads, we
`developed the RMAP tool, which can map reads having a
`wide range of lengths and allows base-call quality scores
`to determine which positions in each read are more
`important when mapping. The only requirement on the
`base-call quality scores for use in RMAP is that they
`increase monotonically with the inverse of the error prob-
`ability for a particular base call. Our results indicate that
`significant gains in mapping performance can be achieved
`by considering the information in 3' ends of longer reads,
`and appropriately using the quality scores.
`
`Results and Discussion
`The mapping criteria
`We designed RMAP to use two different mapping criteria,
`both based on approximate matching of the read and the
`reference genome. The first criterion is a simple count of
`mismatches between a read and the aligned genomic seg-
`ment. Under this criterion, any unknown nucleotides in
`the reference genome (i.e. Ns) will induce a mismatch
`with any nucleotide. Uncalled positions, where the
`sequencing was unable to determine the nucleotide, also
`induce a mismatch. For a fixed read length, by allowing a
`greater number of mismatches, more reads can be
`mapped to reference genome. We refer to this simple mis-
`match criterion as RMAPM (RMAP using mismatch
`scores).
`
`The second criterion, also based on mismatch-counts,
`makes use of the base-call quality scores. A cutoff for the
`base-call quality score is used to designate positions as
`either high-quality (HQ) or low-quality (LQ), depending
`on whether the quality score of the highest-scoring base at
`that position exceeds the cutoff. Low-quality positions
`always induce a match (i.e. act as wild-cards). To prevent
`the possibility of trivial matches, a quality control step
`eliminates reads with too many low-quality positions. As
`with the first criterion, mapping accuracy can be control-
`led by manipulating the number of allowed mismatches
`when mapping. But for the second criterion, manipulat-
`ing the quality-score cutoff provides another means of
`adjusting sensitivity and specificity, and allows positions
`to contribute when they are of high-quality, but not be
`penalized if they are low-quality. We refer to this criterion
`as RMAPQ (RMAP using quality scores).
`
`Evaluating mapping accuracy
`Measuring mapping accuracy
`In measuring mapping accuracy, we want to quantify both
`sensitivity and specificity by using reads sequenced from
`DNA samples from selected genomic regions instead of
`the entire genome. By mapping those reads to the
`genome, we can evaluate how accurately they are mapped
`to the target region. However, there are theoretical and
`practical limits to how well these can be measured. Inabil-
`ity to map a read correctly can be attributed to sequencing
`errors (arising from any part of the experiment), to varia-
`tion between the sampled genome and the reference
`genome, or can result from ambiguities caused by repeats
`in the reference genome. These diverse sources of error
`make it difficult to measure accuracy in terms of tradi-
`tional sensitivity and specificity.
`
`Ambiguous reads, under a given mapping criterion, are
`reads that map to more than one location in the reference
`genome. Reads that map to a single location are called
`uniquely mappable (or simply mappable) reads. All reads
`that are not mapped uniquely to some location in the ref-
`erence genome are said to be unmappable (which
`includes ambiguous reads). We define target region cover-
`age (or simply coverage) as the number of bases in the tar-
`get region covered by at least one mappable read, divided
`by the total number of bases in the target region. In order
`to compare the coverage values for different read lengths,
`we use only the first base of each read to represent that
`read. By counting bases covered, rather than number of
`reads that map to the target region, greater target region
`coverage is achieved when the reads map uniformly in the
`target region. We define mapping selectivity as the number
`of mappable reads that map inside the target region,
`divided by the total number of mappable reads. A read is
`said to map inside the target region if any part of the read
`overlaps the target region. The selectivity shows how well
`the mapping criterion places mappable reads inside the
`target region. In this study, when we refer to mapping
`accuracy, we are referring to both coverage and mapping
`selectivity (and we formally treat accuracy as the mean of
`these two measures).
`
`Evaluation data
`We used the data from samples of two BACs provided for
`sequencer validation by Solexa, which covers 162 kb of
`the chromosome 6 MHC region in an A1-B8-DR3 alter-
`nate haplotype assembly based on sequence data from the
`COX library [8]. We chose one lane of reads sequenced by
`the 1G sequencers at the CSHL Genome Center. The total
`number of raw 36 nt reads in this data set is 3.4 million,
`with the quality score of each base ranging from -5 to 40
`(as called by the BUSTARD program from the Solexa anal-
`ysis pipeline). As reference genome we used hg18, all
`chromosomes except chr6, which we replaced entirely
`
`Page 3 of 8
`(page number not for citation purposes)
`
`

`
`BMC Bioinformatics 2008, 9:128
`
`http://www.biomedcentral.com/1471-2105/9/128
`
`with chr6_cox_hap1, the A1-B8-DR3 alternate haplotype
`assembly. We chose to include this alternate haplotype in
`the reference because it is the origin of the BAC that was
`sequenced. We excluded the ordinary chr6 because it has
`high similarity with chr6_cox_hap1, and including both
`of these would have resulted in a high proportion of reads
`mapping ambiguously to chr6 and chr6_cox_hap1 (see
`additional file 1).
`
`Evaluation procedure
`To investigate how information is distributed within the
`reads, we ran RMAP on all reads with lengths ranging
`from 25–36 nt, allowing mismatches in the range of 0 to
`10, and using both the RMAPM and RMAPQ criteria. For
`the RMAPQ criterion, we chose {4, 8, 12, 16, 20, 24} as
`the set of quality-score cutoffs to evaluate. Reads with
`fewer than 10 contiguous HQ bases (i.e. bases scoring
`above the quality-score cutoff) were considered unmap-
`pable and removed from consideration, as the algorithm
`requires a minimum number of high-quality bases for
`efficiency (see Methods section for details). Any read lack-
`ing 10 consecutive high-quality bases would likely have a
`very high overall amount of error.
`
`Mapping longer reads with more mismatches increases
`accuracy
`The Solexa sequencer can produce reads of more than 50
`bases, and longer reads contain more sequence informa-
`tion. Although it is known that the quality of sequenced
`bases in reads decreases toward the 3' end of the read,
`especially as read length increases, it remains to be shown
`how much useful information may still exist in bases at
`the 3' ends of longer reads. Making use of any additional
`bases is only expected to improve mapping accuracy if the
`additional bases contain information of sufficient quality.
`When the BAC reads were mapped to the human genome
`using the RMAPM criterion, with length from 25–36 nt,
`and different number of allowed mismatches, we found
`that there is generally a great deal of information in 3' end
`bases up to 36 nt. These results are presented in Figure 1
`and Supplementary Table 1 (see additional file 2) (We
`remark that the accuracy of RMAP using the RMAPM cri-
`terion for reads shorter than 32 nt, allowing at most 2 mis-
`matches, is the same as that of ELAND). The BAC coverage
`always increased with length of mapped reads, except
`when only one or zero mismatches are allowed. The map-
`ping selectivity decreases monotonically with read length
`when zero or at most one mismatch is allowed. When
`multiple mismatches are allowed, the mapping selectivity
`first increases with read length, then decreases slightly.
`Taking the mean of these two measures as overall map-
`ping accuracy, we see that the combined target region cov-
`erage and mapping selectivity is maximized when read
`length is 36 nt and up to 4 mismatches are allowed. Com-
`paring read lengths between 25 nt and 36 nt, the mapping
`
`
`
` Figure 1
`
`
`
`
`
`
`
`
`
`Comparison of mapping accuracy of RMAPM crite-
`rion under different parameter combinations. Com-
`parison of mapping accuracy for reads of different lengths,
`and allowing different numbers of mismatches without using
`quality scores. Both the target (BAC) region coverage (a) and
`the mapping selectivity (b) are displayed. The mean of these
`two measures is presented in (c) as mapping accuracy. Stand-
`ard error of displayed values was always ≤ 1.0% and usually <
`0.1%, as estimated by mapping reads obtained from the sec-
`ond lane of the same sequencing run of the same BAC
`regions (this applies also to values in Figure 2).
`
`Page 4 of 8
`(page number not for citation purposes)
`
`

`
`BMC Bioinformatics 2008, 9:128
`
`http://www.biomedcentral.com/1471-2105/9/128
`
`selectivity increases 6.4% while the BAC region coverage
`increases 8.6%. By extrapolating our results, even greater
`improvements are expected as read lengths increase
`beyond 36 nt.
`
`Using quality score information increases accuracy
`We tested different cutoffs for defining high quality bases
`to estimate the ability of the RMAPQ criterion to most
`effectively use the quality information in the reads. We
`achieve better mapping performance in both BAC cover-
`age and mapping selectivity when using longer reads with
`high quality score cutoff of 4 to 24 than without using
`quality score filtering (See Figure 2, and additional file 2).
`This is due to a larger number of reads being mapped
`unambiguously to the genome by the RMAPQ criterion,
`and a larger number of those being mapped to the target
`(BAC) regions. The best mapping accuracy was achieved
`when reads were of length 36 nt, a quality-score cutoff of
`8 was used, and when at most one mismatch was allowed.
`For these settings, the mapping selectivity further
`improved almost 3% with the same BAC coverage, com-
`pared with the best accuracy of RMAP using the RMAPM
`criterion. These results demonstrate that the way in which
`the quality scores are incorporated into RMAP results in
`improved accuracy.
`
`The reads derived from the BACs provided by Solexa for
`the purpose of validation are of very high quality. For a
`given 162 Kb BAC region, most of these 3.4 million reads
`can be mapped almost perfectly. This introduces a satura-
`tion effect, leaving limited space for improvement, as
`highly accurate mapping is achieved easily. RMAP still
`makes significant improvements within this narrow
`range. In many applications, without this ceiling effect,
`the improvement is expected to be even more pro-
`nounced.
`
`We also compared the mapping accuracy of RMAP with
`another available, but presently unpublished, method
`called MAQ [9]. Using default parameter setting, we ran
`MAQ to map the set of 3.4 M reads from the BAC.
`Although MAQ had speed and memory usage similar to
`RMAP, we found MAQ to have lower mapping accuracy
`(see additional file 3) when allowing at most 2 mis-
`matches.
`
`Discussion
`Widely-accessible second-generation sequencing technol-
`ogies promise to revolutionize many areas of bio-medical
`research. In addition to de novo sequencing of new species,
`these technologies make targeted re-sequencing a reality.
`Re-sequencing will provide more accurate means of inter-
`rogating the oligo-nucleotide content of samples, and of
`identifying important genome variations, such as disease-
`related SNPs. Mapping reads to genomes is a critical step
`
`in re-sequencing data analysis, and both the algorithmic
`and software technology must keep pace with surging
`advances in the throughput of sequencing instruments.
`
`In order to maximize the use of available information in
`mapping Solexa reads, we developed the RMAP tool,
`which incorporates base-call quality scores to improve
`accuracy. RMAP responds to an urgent need for such an
`algorithm by providing both the accuracy to handle
`emerging mapping tasks. Our results in applying RMAP
`have shown that more reads can be mapped into the target
`regions when using the RMAPM criterion to map longer
`reads and allow more mismatches. We have also shown
`that the way in which quality scores are used in RMAP (the
`RMAPQ criterion) significantly increases both coverage
`and mapping selectivity.
`
`Although second-generation sequencing technology is
`currently producing many important results, there is still
`little understanding of how this technology should
`behave with respect to sequencing errors and what are the
`general properties of typical re-sequencing data sets. As
`more knowledge accumulates about typical results from
`this new sequencing technology, more information can be
`incorporated into algorithms for mapping reads and other
`associated analysis tasks.
`
`In theory we could move toward an ideal mapping crite-
`rion by predictive modeling, where a model would be
`trained to identify the location from which each read was
`derived. Although the best mapping criteria may not be
`amenable
`to high-throughput computation,
`some
`approximation of those criteria could be developed.
`Cross-validation and the use of a wide range of data sets
`could be used to ensure that the trained criteria are suffi-
`ciently general. In practice such a procedure would require
`high-quality training data, and an extreme amount of
`computing time to train and evaluate such models.
`
`RMAP does not consider insertions or deletions (indels),
`which are potentially important in certain sequencing
`applications (e.g. indel polymorphisms). The straight-for-
`ward strategy for handling indels is to extend initial seed
`matches using a Smith-Waterman-style alignment, as is
`commonly done in database search programs like BLAST
`[10]. For short reads, with length ≤ 50 bases, providing
`this greater flexibility will require careful investigation
`into scoring the alignments, because using simple scoring
`schemes for indels may result in higher rate of false-posi-
`tive mappings and ambiguities. In addition, because
`indels have nothing analogous to the base-call quality
`scores, it will be more difficult to distinguish errors from
`real genotypic variations during the mapping stage of
`analysis. We have observed that data sets containing a
`high proportion of low-complexity reads, such as single
`
`Page 5 of 8
`(page number not for citation purposes)
`
`

`
`BMC Bioinformatics 2008, 9:128
`
`http://www.biomedcentral.com/1471-2105/9/128
`
` Figure 2
`
`
`
`
`
`
`
`Mapping accuracy of RMAPQ criterion under varying parameters. Reads with length from 25–36 nt were mapped
`and 0,1, or 2 mismatches were allowed at high quality bases defined by quality score cutoffs of 4 (d) or 8 (a-d). For reference,
`mapping performance of RMAPM criterion with at most 2 mismatches is also shown. (a) The BAC coverage; (b) the mapping
`selectivity; (c) the overall mapping accuracy (equal to the mean of the BAC coverage and selectivity).(d) 2-D performance com-
`parison in both BAC coverage and selectivity of RMAPM and RMAPQ.
`
`nucleotide repeats, cause a decrease in execution speed of
`RMAP. This is explained by such low complexity seeds
`resulting in a large increase in the number of full compar-
`isons (between reads and regions of the reference
`genome), that are induced by these highly-common
`seeds. The exclusion strategy used in RMAP is also heavily
`used by popular programs for searching sequence data-
`bases. Research effort toward improving the structure of
`seeds used in these algorithms has also led to improve-
`ments in the algorithms themselves [11], and similar
`improvements might be observed by developing seeds
`specifically for the short read mapping problem.
`
`The filtration strategy used in RMAP can also be extended
`to "multiple filtration", as described by [12]. This uses
`multiple criteria for excluding possible mappings, and
`would result in the algorithm performing fewer full com-
`parisons between reads and genomic regions. Unlike
`many other applications of approximate matching, the
`extreme volume of data that must be mapped means that
`algorithms for mapping reads must be conscious of the
`memory required. In the framework of RMAP, the most
`straight-forward implementation of multiple filtration
`would use larger (multidimensional) or additional hash-
`tables, which generally requires a great deal more mem-
`
`Page 6 of 8
`(page number not for citation purposes)
`
`

`
`BMC Bioinformatics 2008, 9:128
`
`http://www.biomedcentral.com/1471-2105/9/128
`
`ory. For short reads, as produced by the Solexa sequencer,
`the full comparisons can be computed very fast, so
`another concern is keeping the work required to imple-
`ment more powerful filtration less than that required to
`do the extra comparisons.
`
`Aside from further processing on the set of reads, effi-
`ciency improvements can be gained by indexing the refer-
`ence genome. This is the strategy used by SXOligoSearch,
`through the proprietary SynaBASE data structure. Because
`the reference genome will likely remain static for many
`mapping tasks (e.g several experiments using the hg18
`genome), this indexing can be done off-line, so the time
`required to build such an index is not important. The
`usual problem with indexing the genome is that such
`index structures are often several times larger than the
`genome itself. The more information built into the index
`to facilitate searches, the larger the index must be. As
`stated above, hardware currently available for mapping
`usually has a few GB of memory, and this restricts the
`kinds of indexing that can be done on entire genomes. To
`make genome index structures practical, they must either
`be small, or support on-line processing so that only a
`small portion of the structure must reside in memory at
`once. It is likely that special purpose data structures for
`indexing the genome can be developed to fit these
`requirements, and greatly improve efficiency of mapping
`reads. Specifically with respect to RMAP, adding any
`means of eliminating exact genomic repeats during map-
`ping will greatly improve efficiency.
`
`Conclusion
`Our results indicate that significant gains in Solexa read
`mapping performance can be achieved by considering the
`information in 3' ends of longer reads, and appropriately
`using the base-call quality scores. The RMAP tool we have
`developed will enable researchers to effiectively exploit
`this information in targeted re-sequencing projects.
`
`Availability and Requirements
`The RMAP tool is freely available for downloading at
`http://rulai.cshl.edu/rmap/.
`
`Methods
`Design of the RMAP algorithm
`In this section we describe the algorithmic strategy used in
`RMAP. We treat the mapping problem as approximately
`matching a set of patterns in a text – the set of patterns
`being the reads, and the text being the genome. This prob-
`lem has been well studied, and several general algorithmic
`strategies have emerged for solving it (see [13] for a
`detailed treatment). The major motivation for developing
`the RMAP algorithm was to incorporate base-call quality
`scores to weight mismatches and improve mapping accu-
`racy. In addition to having high mapping accuracy, RMAP
`
`was designed under the restrictions that it must be capable
`of (1) mapping reads with length exceeding 50 bases (for
`the applications discussed in the introduction), (2) allow-
`ing the number of mismatches to be controlled (not being
`restricted to a small fixed number), and (3) completing
`mapping tasks under reasonable time constraints on
`widely available computing hardware.
`
`The algorithm implemented in RMAP uses the filtration
`method described by [14]. For reads of length n, and map-
`ping with up to k mismatches, each read is partitioned
`into k + 1 contiguous seeds (each seed is a substring of the
`read, and has length &#x230A;n/(k + 1)&#x230B;).
`Because there can only be k mismatches in a mapping,
`and there are k + 1 seeds for each read, any mapping must
`have at least one seed with no mismatches. The algorithm
`first identifies locations in the genome where the seeds
`match exactly. Exact matching can be done much more
`quickly than approximate matching, and evaluating the
`approximate match between a read and a genomic region
`only needs to be done for those regions surrounding an
`exactly-matching seed.
`
`To efficiently implement the filtration strategy, RMAP pre-
`processes the set of reads, building a hash-table (which we
`refer to as the seed-table) indexed by the seeds. The table
`entry for a particular seed lists all reads containing that
`seed, along with the offset of that seed within the read. For
`a set of r reads, each having length n, if k mismatches are
`allowed in the search, the seed table has size O(4n/k + rk).
`The mapping proceeds by scanning the genome, with a
`window of size equal to the seed size. Each segment of the
`genome is tested as a seed by hashing that segment to
`determine the set of reads that must be compared in their
`entirety with a larger genomic region surrounding the seg-
`ment of the genome currently being scanned. This is a
`common strategy to implement the filtration stage of
`approximate matching. The influence of the size of the
`genome in the time complexity of RMAP is therefore lin-
`ear, and importantly the space complexity of RMAP is
`independent of the size of the genome.
`
`The step of comparing the full read to portions of the
`genome where a seed has been found is implemented to
`require time that is logarithmic in the length of the reads.
`The comparison takes advantage of bit-wise operations,
`and the reads are encoded in a binary format (see addi-
`tional file 4 for supplementary method). A series of logical
`operations produce a vector indicating the locations of
`mismatches between the read and the genomic segment
`being compared, and the weight of the bit-vector indicat-
`ing mismatches computed using a well-known technique
`described by [15].
`
`Page 7 of 8
`(page number not for citation purposes)
`
`

`
`BMC Bioinformatics 2008, 9:128
`
`http://www.biomedcentral.com/1471-2105/9/128
`
`Acknowledgements
`The authors would like to thank Richard McCombie, Vivekanand Balija,
`Melissa Kramer in Genome center of Cold Spring Harbor Laboratory for
`providing us the Solexa sequencing data and helpful suggestion. This work
`was supported partly by the National Institutes of Health (HG001696), the
`National Science Foundation (0324292) and DART Neurogenomics Alli-
`ance Foundation.
`
`RMAP is sufficiently fast that several million reads can be
`mapped to a mammalian genome in one day on a com-
`puter with a single processor. No portion of the reference
`genome is maintained by RMAP, and the size of the seed
`table dominates the space requirements. Because these
`requirements are sufficiently small, RMAP can be run on
`widely available hardware. This includes the nodes typi-
`cally used in cluster computers, and allows the processing
`to be easily and effectively parallelized by simply parti-
`tioning the set of reads. On a test data set generated by
`randomly sampling one million 50 nt segments (simu-
`lated reads) from the hg18 genome, and randomly chang-
`ing up
`to 4 bases
`in each
`read, ou

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket