`
`Nucleic Acids Research, 2008, Vol. 36, No. 16 e105
`doi:10.1093/nar/gkn425
`
`Substantial biases in ultra-short read data sets from
`high-throughput DNA sequencing
`Juliane C. Dohm1, Claudio Lottaz2, Tatiana Borodina1 and Heinz Himmelbauer1,*
`
`1Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin and 2Institute for Functional Genomics,
`Computational Diagnostics, University of Regensburg, Josef-Engert-Str. 9, 93053 Regensburg, Germany
`
`Received December 21, 2007; Revised June 16, 2008; Accepted June 19, 2008
`
`ABSTRACT
`
`Novel sequencing technologies permit the rapid
`production of large sequence data sets. These tech-
`nologies are likely to revolutionize genetics and bio-
`medical research, but a thorough characterization
`of the ultra-short read output is necessary. We gen-
`erated and analyzed two Illumina 1G ultra-short read
`data sets, i.e. 2.8 million 27mer reads from a Beta
`vulgaris genomic clone and 12.3 million 36mers from
`the Helicobacter acinonychis genome. We found
`that error rates range from 0.3% at the beginning
`of reads to 3.8% at the end of reads. Wrong base
`calls are frequently preceded by base G. Base sub-
`stitution error frequencies vary by 10- to 11-fold,
`with A > C transversion being among the most fre-
`quent and C > G transversions among the least fre-
`quent substitution errors. Insertions and deletions
`of single bases occur at very low rates. When simu-
`lating re-sequencing we found a 20-fold sequencing
`coverage to be sufficient to compensate errors by
`correct reads. The read coverage of the sequenced
`regions is biased; the highest read density was
`found in intervals with elevated GC content. High
`Solexa quality scores are over-optimistic and low
`scores underestimate the data quality. Our results
`show different
`types of biases and ways to
`detect
`them. Such biases have implications on
`the use and interpretation of Solexa data, for de
`novo sequencing, re-sequencing, the identification
`of single nucleotide polymorphisms and DNA
`methylation sites, as well as for transcriptome
`analysis.
`
`INTRODUCTION
`
`The DNA sequencing field has experienced a major boost
`with the emergence of novel sequencing technologies.
`Several systems are currently on the market, including
`Illumina’s Solexa instrument, the Applied Biosystems’
`
`Sequencing by Oligonucleotide Ligation and Detection
`(SOLiD) technology, and the GS FLX instruments from
`Roche/454 Life Sciences. The Polony cyclic sequencing by
`synthesis technology is to be launched (1).
`These technologies allow sequence determination much
`quicker and cheaper than the dideoxy chain terminator
`method presented by Sanger in 1977 (2). The main differ-
`ence between Sanger sequencing output and the output of
`the new technologies is an increased read number, asso-
`ciated with a decrease in the length of individual reads.
`To achieve high throughput, the new approaches apply
`different strategies. 454 Life Sciences has adapted pyrose-
`quencing to a microbead format to sequence 400 000
`DNA fragments simultaneously, resulting in a per-run
`dataset of 100 Mbp with reads averaging 250 bp. SOLiD
`sequencing also uses templates immobilized onto microbe-
`ads. Here, the sequence of the template DNA is decoded
`by ligation assays
`involving oligonucleotides
`labeled
`with different fluorophores. The SOLiD read length is
`currently 25–35 bases, and 2–3 Gbp of data can be col-
`lected during an 8-day run. Solexa sequencing is based on
`amplifying single molecules attached to the surface of
`a flow cell to generate clusters of identical molecules, fol-
`lowed by sequencing using fluorophore-labeled reversible
`chain terminators. Solexa sequencing proceeds a base at a
`time and read length depends on the number of sequenc-
`ing cycles. Current Illumina sequencing instrumentation
`achieves read lengths of 36 bases. The Solexa flow cell
`is composed of eight separately loadable lanes. Since
`each lane has a capacity of about 5 million reads, > 40
`million reads can be generated in a run of 3 days, equiva-
`lent to > 1.3 Gbp.
`The adoption of high-throughput sequencing will revo-
`lutionize molecular biology research, similar to the inven-
`tion of the polymerase chain reaction (PCR) twenty years
`ago (3). 454 pyrosequencing short (100 bp) reads gen-
`erated on Roche GS20 instruments (now replaced by
`GS FLX) were successfully used for the de novo sequenc-
`ing of small genomes and BACs as well as for transcript
`discovery and characterization (4–9). De novo genomic
`sequencing succeeded even when ultra-short (27–36 bp)
`reads generated by Solexa sequencing were employed for
`
`*To whom correspondence should be addressed. Tel: +49 30 8413 1354; Fax: +49 30 8413 1380; Email: himmelbauer@molgen.mpg.de
`
`ß 2008 The Author(s)
`This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
`by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
`
`SEQUENOM EXHIBIT 1007
`
`
`
`e105 Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`PAGE 2 OF 10
`
`a small genome (10). For the human genome, ultra-short
`reads were applied in studies on chromatin analysis
`(11,12).
`However, working with large data sets of short reads
`involves difficulties, especially due to wrong base calls. To
`exploit the full prospects of the novel technologies there is
`the need to know as much as possible about biases in the
`output data sets, especially with respect to errors. Previous
`studies focused on the 454 technology (13) or dealt with
`the prospects of short read sequencing as such (14). Here,
`we characterize two Solexa read data sets: 12.3 million
`36mer reads (trimmed to 32 bases) from the Helicobacter
`acinonychis genome and 2.8 million 27mer reads from
`a Beta vulgaris bacterial artificial chromosome (BAC)
`clone. We analyze these reads and detect biases with
`respect to error positions, error rates, erroneous base
`calls and their neighboring bases and single base insertions
`or deletions. We determine the compensation of erroneous
`base calls by correct base calls depending on the sequenc-
`ing coverage. We analyze read start positions, the read
`coverage along the target sequence, and dependencies of
`read coverage and local sequence characteristics. Finally,
`we assess the reliability of quality values for wrong and
`correct base calls.
`
`METHODS
`
`Solexa sequencing
`
`Helicobacter acinonychis. DNA was fragmented by nebu-
`lization as described in the Solexa protocol (www.illumi
`na.com). Beta vulgaris DNA was sheared for 1 h with a
`UTR200 sonication device (Hielscher Ultrasonics GmbH)
`at 100% amplitude and 0.5 cycle mode. Fragmented DNA
`was
`further processed as described previously (10).
`Sequencing was carried out by running 27 or 36 cycles,
`respectively, on the Illumina 1G sequencing instrument.
`The Goat module
`(Firecrest v.1.8.28 and Bustard
`v.1.8.28 programs) of the Solexa pipeline v.0.2.2.3 (for
`Helicobacter data set) and v.0.2.2.5 (for Beta data set)
`were used for image deconvolution and quality value cal-
`culation. Parameterization was auto-generated by the
`pipeline (see Supplementary Data for intensity plots and
`run parameters, i.e. frequency cross-talk matrix, offsets,
`phasing). Set up configuration was used as installed by
`Illumina’s technical staff. The Helicobacter data set was
`collected from three lanes of two flow cells. The Beta
`data set was generated in a single lane from a further
`flow cell.
`
`Data analysis
`
`We developed various Perl scripts to extract and process
`information from ELAND output files (Gerald module
`v.1.27 of the Solexa pipeline) and to find positions of
`reads that can be aligned more than once to the reference
`sequence without mismatches (the positions of those reads
`are not reported by ELAND). We wrote Perl scripts for
`the detection of deletions and insertions of single nucleo-
`tides in otherwise error-free reads and for the analysis of
`quality values per base call. Plots were generated with the
`statistical computing environment R (www.R-project.org)
`
`or OpenOffice Calc (www.openoffice.org). R: A language
`and environment for statistical computing. R Foundation
`for Statistical Computing, Vienna, Austria. www.R-project.
`org) or OpenOffice Calc (www.openoffice.org).
`
`Data availability
`
`Solexa read data are available from the SHARCGS
`project website at http://sharcgs.molgen.mpg.de.
`
`RESULTS
`
`We previously generated 12 288 791 36mer reads from
`Helicobacter acinonychis on an Illumina 1G sequencing
`device (10). The Helicobacter genome is 1.55 Mbp in size
`and has a GC content of 38%. A high-quality reference
`for Helicobacter
`sequence
`is
`available
`(GenBank
`NC_008229) (15). We ran the ELAND software on the
`read data set (trimmed by the last four bases, because
`ELAND processes the first 32 bases only) and selected
`the 8 389 548 32mer reads that ELAND reported to be
`the Helicobacter
`uniquely matched against
`reference
`sequence with zero, one or two mismatches (labeled U0,
`U1 or U2, respectively, see Figure 1b). Additionally, we
`generated a 27mer read data set for the sugar beet (Beta
`vulgaris) bacterial artificial chromosome (BAC) clone ZR-
`47B15. The data set consists of 2 788 286 reads, 2 156 266
`of which were labeled U0, U1 or U2 in the ELAND
`output (Figure 1a). The Sanger reference sequence in fin-
`ished quality of this BAC insert consists of 10 9563 bases
`with 34.85% GC (Dohm et al., manuscript submitted for
`publication). For all uniquely matched reads, ELAND
`reports the match position in the reference sequence as
`well as the error position(s) in the read.
`
`Start positions of reads and read distribution on
`the target sequence
`
`The preparation of Solexa sequencing libraries involves
`the fragmentation of the DNA, followed by the adaptor
`ligation, pre-amplification for material enrichment and
`amplification within the flow cell prior to sequencing. In
`order to detect whether the steps preceding sequencing
`show biases, we analyzed the first bases of a read and
`the bases that flank the read start position on either
`side. Of all possible 27mer tuples (Beta) and 32mer
`tuples (Helicobacter), 99.8 and 98.8% are unique, respec-
`tively. We therefore assume that potential biases are repre-
`sentative for the data set.
`We calculated the frequency of 2- to 10-base tuples
`enclosing the starting point
`for 8 389 548 uniquely
`matched Helicobacter reads and for 2 156 266 uniquely
`matched Beta reads relative to the frequency of these
`tuples in the reference sequences. Since the bases in the
`reads are subject to errors, we used for both sides the bases
`of the corresponding region in the reference sequence.
`A general sequence bias for the immediate vicinity of
`the read start position could not be deduced from the two
`data sets. The results for the Beta data set did not suggest
`any tendencies (Supplementary Figure 1a). The results
`for the reads from Helicobacter showed a weak tendency
`towards T being the most frequent base call to the left and
`
`
`
`PAGE 3 OF 10
`
`Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`e105
`
`(a)
`
`88715
`
`191683
`
`I U2
`
`E] NM
`QC
`El R012
`El U0
`I U1
`
`readsperkbp
`
`(b)
`
`647151
`
`1399772
`
`
`
`3000035000
`
`25000
`
`20000
`
`15000
`
`5000
`
`20
`
`25
`
`30
`
`35
`GC—contenl [0/9]
`
`40
`
`45
`
`50
`
`10000
`5000
`
`V
`
`
`
`
`
`6342625
`
`512436
`
`420762
`
`DNM
`IQC
`EIFt012
`Duo
`IU1
`lU2
`
`Figure l. Pie charts of tire read analysis with ELAND. The ELAND
`categories are: QC: no matching done because of low quality of the
`read (more than two positions with quality score
`5), NM. no match
`found; U0, unique exact match found; Ul, unique match with one
`error, U2. unique match with two errors; R0, multiple exact matches
`found; Rl, multiple matches with one error; R2, multiple matches with
`two errors. 1he categories R0, R1, R2 are shown as a single entity. (a)
`ELAND categorizations for 27mer reads from Bela vulgaris clone ZR
`47815 (2788 286 in total). (b) ELAND categorimtions for 32mu' reads
`from Helicobacter achronychis (12 288 79! in total, trimmed by the last
`four base calls of the original 36mer data).
`
`to the right of the read start position (Supplementary
`Figure lb). Since two different fragmentation methods
`were used, sonication for Beta and nebulization for
`Helicobacter,
`the results may indicate method-inherent
`properties.
`However, by analysing sequence characteristics and
`number of roads starting in a sliding window of lkbp in
`width, we found a correlation of read coverage and GC
`content in both data sets (Figure 2). In regions of elevated
`GC content
`the number of reads was increased. For
`
`instance, windows with a GC content of 40% contain
`almost twice as many reads as windows with 30% GC in
`the Beta data set. Thus, while the vicinity of 10 bp was not
`sufficient
`to detect a conclusive bias for read starting
`
`20000
`
`l5000
`
`readsperkbp I0000
`
`GC—Conlent [°/.]
`
`Figure 2. Correlation of the Solexa read coverage and GC content. (a)
`27mer reads generated from Bela vulgaris BAC ZR 47B15 (b) 32mer
`data set from the Helicobacrer acinmychr‘s genome Each data point
`corresponds to the number of reads recorded for a lkbp window
`(shift of l00bp in Beta and lkbp in Helicobacler).
`
`there is a strong preference towards GC-rich
`points,
`regions in lkbp sliding windows. Since both templates
`show the correlation of read coverage and GC content,
`the shift to GC rich regions seems to be a general feature
`of the current pre-sequencing procedure. A similar finding
`was reported by Hillier et a1. (16).
`The overall coverage considering matching reads only is
`[65-fold in the Helicobacter data set (l85-fold for 36mer
`reads) and 465-fold in the Beta data set. The distribution
`of matching reads along the reference sequences is shown
`in Figure 3. We calculated the read depth in windows of
`
`
`
`e105 Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`PAGE 4 OF 10
`
`Figure 3. Distribution of Solexa reads along the reference sequences considering unique match positions reported by ELAND (zero, one or two
`mismatch bases) and reads with more than one match position (no mismatch bases) detected with a Perl script. (a) Read distribution along the Beta
`vulgaris BAC sequence (with cloning vector pBeloBACII). 2 166 892 27mer reads were matched against the finished sequence (enclosed by the cloning
`vector,117 kbp in total). The read coverage was calculated in 200 consecutive 0.58 kbp windows. (b) Read distribution along the 1.55 Mbp
`Helicobacter genome, based on 8 700 113 32mer reads. The local coverage is shown in 200 consecutive windows of 7.77 kbp.
`
`size 7.77 kbp for Helicobacter (Figure 3a) and of size
`0.58 kbp for Beta (Figure 3b). The coverage varied by
`a factor of 13 and 3.8, respectively, ranging from 49- to
`652-fold for Helicobacter and from 238- to 897-fold for
`Beta (Table 1). We tested whether the distributions shown
`in Figure 3 are compatible with a uniform distribution
`of reads across the target sequences. We have applied a
`2-test (goodness of fit) to reject the hypothesis that reads
`have the same probability to fall into equally sized regions
` 10 even when dividing
`of the target sequence (P < 1e
`target sequences in only five regions).
`There is a number of ‘gap’ positions in the target
`sequences where no read starts from. However, since
`there are no gaps larger than read length all positions of
`the target sequence are covered (Supplementary Table 1).
`
`Distribution of error positions along reads
`
`We selected all ELAND U1 and U2 reads, i.e. 28 0173
`Beta
`and 2 046 923 Helicobacter
`reads
`reads
`(cf.
`Figure 1), to analyze the occurrence of errors per position.
`We performed two types of calculations. Firstly, we calcu-
`lated the fraction of wrong base calls at each read position
`considering wrong base calls only. Secondly, we calculated
`per-base error rates, i.e. the fraction of wrong base calls
`per position considering all base calls. The result is shown
`in Figure 4. The number of occurrences of wrong bases is
`increased at the first position. Rising from the lowest error
`rate at the second position, the highest error rate is
`observed at the last positions of the read [similar observa-
`tion reported in (16)]: 2.5 and 2.9% of the errors in the
`data sets of Beta and Helicobacter, respectively, were
`found at read position 1, and 11.8% of errors were
`recorded at the last read position (position 27 in the
`Beta data set and position 32 in the Helicobacter data
`set, Figure 4a). The per-base error rates range from
`0.3% to 3.8% (Figure 4b) resulting in an average error
`
`Table 1. Proportion of reference sequence and coverage ranges (based
`on ELAND U0, U1, U2, R0 matched reads and reads with single
`indels)
`
`Beta
`
`Helicobacter
`
`Coverage
`
`BAC (%)
`
`Coverage
`
`Genome (%)
`
`200 300
`300 400
`400 500
`500 600
`600 700
`700 800
`800 900
`
`4.27
`23.93
`25.64
`23.93
`12.82
`4.27
`5.13
`
`<100
`100 150
`150 200
`200 250
`250 300
`300 350
`>350
`
`3.53
`26.06
`42.28
`21.49
`4.44
`1.29
`0.90
`
`rate of 0.6% for the Beta data set and 1.0% for
`the Helicobacter data set. Note that only uniquely
`matched reads with less than three substitution errors
`are considered.
`In re-sequencing projects, sequencing errors can be
`compensated by high-coverage sequencing. In a re-sequen-
`cing project, the reads are aligned against a reference
`sequence. Wherever a mismatch between sequencing
`data and the reference is observed, a polymorphism is
`postulated. In order to avoid spurious detection of poly-
`morphisms due to sequencing errors, a consensus between
`several reads at each position of the reference is common
`practice. Here, we simulate re-sequencing at different
`depth by randomly choosing the appropriate number of
`reads from our two data sets and counting wrong and
`correct base calls [five (Helicobacter) or ten (Beta) simula-
`tions per data point]. An error was considered as compen-
`sated when at least one correct base call for the same
`position existed. A correct base call and the reference
`sequence hold the majority over one wrong base call, i.e.
`x wrong base calls at the same position can be compen-
`sated by x correct base calls (plus reference sequence).
`
`
`
`PAGE 5 OF 10
`
`
`
`
`
`
`
`
`
`Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`e105
`
`
`
`
`
`
`
`
`
`
`
`murmur-onus
`
`
`
`
`
`
`Figure 6. Distance between two errors on a read in the Helicobacrer
`and Beta vulgari: data sets.
`‘0’ indicates that the erroneous base calls
`are next to each other.
`
` 7 a s rorr121314rsrsrnoromzrmaaaraszszrmzom
`
`
`
`
` ||||
`flli'fiTTT'li'li'li'lI'li'l'lTli'lWIIlllIlII III |
`| II
`
`|
`I
`12 S 4 5 6 7 I 9 10"IZISMISIE‘I‘IIOIONZIZZSZ‘EEZ
`P ..
`
`Figure 4. Frequency of wrong base calls in Solexa reatk depending on
`the position along the read (27mer reads from Beta vulgaris and 32mer
`reads from Helicobacter). (u) Error frequency per position calculated
`from considering wrong base (211s only. The highest error frequency is
`observed at the read 3’ end.
`(1)) Per base error rates (overall error
`frequency per position considering all base calls).
`
`Helicobacter
`Beta vulgaris
`
`H-‘""'-"-.-
`
`Errorsperkbp
`
`Coverage
`
`Figure 5. Compensation of sequencing errors by deep sequencing
`in resequencing projects. The average number of errors per kbp is
`shown for different levels of coverage. For coverages below 2. reads
`are unlikely to overlap and compensation of sequencing errors is rare
`(IhlB. sequencing errors accumulate when the coverage is increased).
`For coverages above 3 fold the number of uncompensated errors drops
`rapidly with the increase of coverage.
`
`We plotted the dependency of sequencing coverage and
`error compensation in Figure 5 (range of simulation
`results:
`see Supplementary Figure 2).
`Increasing the
`sequencing coverage results in a rapid decrease of uncom-
`pensated errors. At a coverage of 20-fold the average
`number of errors per kilo base pair is close to zero and
`does not decrease any further. However, such estimates
`are likely to change with improvements of the sequencing
`technology, as less coverage will be sufficient for reduced
`error rates.
`
`Analysis of reads containing two errors
`
`ELAND reported 88 753 reads containing two errors
`in the Beta data set, corresponding to 4.1% of all uniquely
`matched reads. In Helicobacter, 647151 reads contained
`two errors
`(7.7% of all uniquely matched reads).
`We analyzed the distance between erroneous bases and
`found a preference for small distances between errors
`(Figure 6). In 25% of reads that contained two errors
`the erroneous bases were either at adjacent positions or
`separated by one base. This observation does not contra-
`dict
`the assumption that errors occur independently
`according to their position-specific probability. The heat-
`map in Supplementary Figure 3 illustrates the occurrence
`of two errors relative to the positions in the read. As
`expected from the per-base error rates, two-error occur-
`rences are concentrated at the 3’ end of reads and are
`
`therefore close together. In addition, error pairs also
`occur with increased frequency at read positions 1 and
`2. We provide even stronger evidence for the independence
`of error positions in two-error reads in Supplementary
`Figure 4.
`Although error positions seem to be independent in
`reads with two errors, there is evidence that errors accu-
`mulate in reads more easily than expected. We deduce this
`from the ratios of the observed and expected number of
`reads containing one and two errors
`respectively:
`Given the determined error rates per position (for the
`Helicobacter data set) we expect 3.5 times more correct
`reads (UO) than reads with one error (U1), but we observe
`4.5 times more U0 than U1; we expect 19.8 times more
`correct reads than reads with two errors (UZ), but we
`observe 9.8 times more U0 than U2. Thus,
`there are
`
`
`
`e105 Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`PAGE 6 OF 10
`
`fewer U1 reads than expected and more U2 reads than
`expected compared to the U0 reads. This tendency is con-
`firmed in the Beta data set (data not shown) and suggests
`dependencies in the occurrence of errors.
`
`Analysis of error base sequence context
`
`In order to find sequence composition preferences close to
`wrong base calls, we analyzed the sequence tuples flanking
`error positions. Since errors at position 1 do not have
`preceding bases and errors at the last position do not
`have subsequent bases in the read, we used the corre-
`sponding segment of the reference sequence for the analy-
`sis. This also avoids analysing wrong base calls in the
`error-prone read sequences close to the error position
`under consideration. The sequence composition before
`the read start is not considered to be responsible for
`an error at position 1 because this part of the source
`sequence
`is not part of
`the
`sequenced fragment.
`However, the bases following the end of the read could
`have an influence on the base calling. We decided to treat
`all error positions in the same manner by looking up the
`flanking bases in the reference sequence. As reference
`tuples we did not consider all tuples in the reference
`sequence but all tuples in all uniquely matched reads
`(taken from the reference sequence and adding 5 bases
`before and after the corresponding read segment). This
`is to keep the analysis clean from the read coverage bias
`towards GC-rich regions of the reference sequence. We
`calculated the relative frequencies for 3-
`to 11-base
`tuples enclosing the error at the middle position and gen-
`erated sequence logos for Beta and Helicobacter sepa-
`rately. To visualize the general trend we show the 3- and
`
`5-base tuple results in scatterplots with the tuple frequen-
`cies for both data sets (Figure 7 and Supplementary
`Figure 5). All 3-base tuples starting with a G are clearly
`dominant in both data sets with G-error-G and G-error-A
`being the top candidates (Figure 7). Error enclosing tuples
`starting with A or T are underrepresented, and error
`enclosing tuples starting with C are as frequent as in the
`reference tuples. The least frequent base after an error is
`T, being the third base in the three least frequent tuples
`A-error-T, T-error-T and C-error-T. The trend of G being
`the most frequent base before an error is preserved
`and even more emphasized in the scatterplot with 5-base
`tuples (Supplementary Figure 5). Here, Gs are still the
`preferred bases before an error, and least frequently we
`see errors enclosed by Ts. In 35 and 32% of cases (Beta
`and Helicobacter, respectively), the error position was
`preceded by G.
`
`Analysis of base substitution errors in Solexa reads
`
`Twelve substitution errors (eight transversions and four
`transitions) are possible during a base call. We compared
`the wrong base calls in the reads to the base in the refer-
`ence sequence and found that base substitution errors in
`Solexa reads are not equally frequent. Generally, the two
`data sets show similar tendencies (Figure 8, Table 2). The
`most frequent base to be changed into is a C, preferen-
`tially substituting T or A in the Beta data set and A in the
`Helicobacter data set (T in Helicobacter as well but at a
`lower frequency). Consistently for both data sets, C > G
`transversions are the least frequent substitution errors.
`The top three types of substitution errors account for
`>53% of
`all
`substitution
`errors
`found
`in
`the
`(the transversions A > C,
`Helicobacter read data set
`G > T and A > T) and for >42% of all substitution
`errors found in the Beta data set (the transition T > C
`and the transversions A > C and C > A).
`
`Figure 7. Sequence context of wrong base calls in Solexa reads from
`Helicobacter acinonychis and Beta vulgaris, considering one base
`upstream and downstream of the wrong base calls. An ‘e’ indicates
`the substituted base. The scatterplot shows the correlation of the rela
`tive frequencies (relating the frequency of 3 tuples at error positions to
`the frequency of all 3 tuples in the reads) for the two data sets.
`
`Figure 8. Frequency of substitution errors in the Helicobacter acinony
`chis and Beta vulgaris Solexa read data sets.
`
`
`
`PAGE 70F 10
`
`Nucleic Acids Research, 2008, Vol. 36, N0. 16
`
`e105
`
`
`
`
`
`
`
`
`Table 2. Base substitution frequencies in the Beta and Helicobacter
`read data sets
`
`Into
`
`A
`
`Beta
`A
`C
`C
`T
`Any
`Helicobacter
`A
`C
`C
`T
`Any
`
`0.14
`0.05
`0.05
`0.25
`
`0.25
`0.05
`0.10
`0.41
`
`C
`
`0.13
`
`0.02
`0.04
`0.19
`
`0.07
`
`0.02
`0.06
`0.15
`
`From
`
`G
`
`0.04
`0.08
`
`0.12
`0.24
`
`0.02
`0.04
`
`0. 18
`0.24
`
`T
`
`0.08
`0.15
`0.09
`
`0. 33
`
`0.04
`0. 10
`0.06
`
`0.20
`
`Any
`
`0.25
`0.38
`0. 16
`0.21
`
`0.14
`0.39
`0. 14
`0. 34
`
`(mo-2°0.13
`0.10
`0.14
`
`E 0.12
`3 mo
`
`Table 3. Observed and expected error rates for base calls of different
`quality values in the Beta and Helicobacter data sets
`
`Score
`
`Q 40
`Q 30
`Q 20
`Q
`10
`Q 0
`
`Bela (°/o)
`
`Helicobacler (°/.)
`
`Expected (°/o)
`
`1.39
`3.55
`5.2l
`9.68
`39.65
`
`0.43
`1.06
`1.70
`4.40
`28.68
`
`0.01
`0.10
`0.99
`9.09
`50.00
`
`Insertions and deletions in the Solexa read data sets
`
`The ELAND algorithm is limited to the alignment of reads
`containing up to two substitution errors. In addition to the
`reads matched by ELAN D to the reference sequence there
`is a substantial amount of unmatched reads (Figure 1).
`Some of the unmatched reads contain more than two
`
`sequencing errors, but another reason for unmatched
`reads may be the occurrence of insertions and deletions
`(indels). We implemented a Perl script to find single nucleo-
`tide indels within reads without considering additional sub-
`stitution errors. We observed a very low rate (<0.01%) of
`indel errors: 323 of 2.8 million Beta reads contain a single
`nucleotide insertion and 1258 Beta reads contain a single
`nucleotide deletion; 1215 and 2284 insertion and deletion
`errors, respectively, were found in the Helicobacter data set.
`Further inspection of the data revealed that >25% of
`base insertions occurred in homopolymer tracts of four
`or more nucleotides. However, no clear trend could be
`detected for deletions. With respect
`to the positions
`within reads where indels occur there is a slight accumula-
`tion of such events at internal positions of the reads. No
`bias for inserted or deleted bases could be detected (data
`not shown). The reason for detecting this type of error
`might be sequencing errors in the reference sequences.
`For Helicobacter, two erroneous insertions in the Sanger
`sequence were reported (10). Approximately 10% of
`lllumina reads with one deletion match to these two
`
`positions.
`
`
`
`
`
`
`
`
`
`44444:12:1ss1IInnunumununnuzaaaanaammnaasxarnn
`Score
`
`
`
`
`
`
`
`
`
`Hr
`Illlllllllll
`44444012316e1reonuuusun“wmaazaanaaunmzzuauulaw
`Scam
`
`Figure 9. Histograms of base quality valms for all correct base calls
`(a) and all wrong base calls (11) in the Beta and Helicobacler data sets.
`
`A$e$ment ofquality values
`
`The Solexa base caller Bustard reports the quality of each
`base call by estimating a quality score similar to the phred
`score based on the image output without considering the
`reference sequence. More precisely, Bustard estimates the
`probability P of a base call to be wrong and reports
`the corresponding quality score Q = -10 log", (P/(l—P).
`Thus, a quality score Q = 40 roughly corresponds to an
`expected error probability of P = 0.01%, and Q = 0 cor-
`responds to an expected error probability of P = 50%.
`Based on uniquely matched reads reported by ELAND,
`we have determined 7201633 correct and 369113 wrong
`base calls in the Beta data set as well as 70995 154 correct
`
`and 2 694074 wrong base calls in the Helicobacter data set.
`We have extracted the corresponding quality scores from
`Bustard output files and computed observed error rates
`per quality score. Table 3 shows a comparison of the
`expected and observed error rates for the base (all score
`quality in our two data sets: Theoretical values under-
`estimate the error probability for high quality values
`and overestimate the error probability for low quality
`values.
`
`We also collected the quality values for bases reported
`by ELAND as correct separately from quality values for
`bases reported as wrong (matching reads only). Figure 9
`shows the results in separate histograms. The fraction
`of the best quality value is increased for correct base
`calls and low quality values show low fractions as
`expected. However,
`there is a substantial amount of
`high quality values for wrong base calls. Six percent
`of all wrong base walls in Helicobacter and 19% of all
`wrong base calls in Beta have Solexa quality scores
`Q = 40.
`
`
`
`e105 Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`PAGE 8 OF 10
`
`DISCUSSION
`
`We have characterized two Solexa read data sets derived
`from a bacterial genome (Helicobacter acinonychis) and
`from a Beta vulgaris BAC clone. We looked for systematic
`biases of read start positions, recorded the error positions
`and the error frequency along the read length, examined
`the distribution of reads along the reference sequences,
`investigated substitution preferences, and assessed the
`reliability of quality scores. The generalization of our
`observations may be limited by the fact that the presented
`data relates to a single Illumina 1G Analyzer. However,
`since three different flow cells, four different lanes and two
`different
`library preparations for two different
`target
`sequences are involved we assume that our consistent
`observations reflect relevant aspects of the current state
`of Solexa technology.
`To explain the observed biases, a comprehensive knowl-
`edge of the Solexa technology is necessary. The source
`DNA is fragmented randomly, and adapter molecules
`are ligated at both ends of each fragment followed by
`pre-amplification for enrichment of the material. The
`DNA fragments are melted, and the single strands are
`trapped inside the flow cell which is covered by a dense
`lawn of primers. Subsequent local amplification leads to
`the formation of clusters of approximately 1000 identical
`molecules per square micrometer. The base incorporation
`is started by adding primers, polymerase and the four
`flourophore-labeled deoxynucleotidetriphosphates. The
`dNTPs act as reversible terminators, i.e. only a single
`base is added per molecule in each cycle. The cluster fluor-
`escence is measured to identify which base has been incor-
`porated. A green laser identifies the incorporation of the
`bases G and T, and a red laser identifies the bases A and C.
`Two different filters are used to distinguish between
`G/T and A/C, respectively. After signal detection, the
`fluorophore and the terminating modification of
`the
`nucleotide are removed.
`In the context of this work we could not detect a general
`sequence bias for the immediate vicinity of read start posi-
`tions, indicating that the fragmentation step is essentially
`random. Two different methods of fragmentation were
`used but potential trends for each method were rather
`weak. However, we did observe a strong correlation
`between GC richness and read coverage, with the read den-
`sity being increased in regions of elevated GC content.
`Uneven coverage of the target genome is well known
`from Sanger sequencing, but this effect has been attributed
`to a cloning bias in the underlying plasm