throbber
Published online 26 July 2008
`
`Nucleic Acids Research, 2008, Vol. 36, No. 16 e105
`doi:10.1093/nar/gkn425
`
`Substantial biases in ultra-short read data sets from
`high-throughput DNA sequencing
`Juliane C. Dohm1, Claudio Lottaz2, Tatiana Borodina1 and Heinz Himmelbauer1,*
`
`1Max Planck Institute for Molecular Genetics, Ihnestr. 63-73, 14195 Berlin and 2Institute for Functional Genomics,
`Computational Diagnostics, University of Regensburg, Josef-Engert-Str. 9, 93053 Regensburg, Germany
`
`Received December 21, 2007; Revised June 16, 2008; Accepted June 19, 2008
`
`ABSTRACT
`
`Novel sequencing technologies permit the rapid
`production of large sequence data sets. These tech-
`nologies are likely to revolutionize genetics and bio-
`medical research, but a thorough characterization
`of the ultra-short read output is necessary. We gen-
`erated and analyzed two Illumina 1G ultra-short read
`data sets, i.e. 2.8 million 27mer reads from a Beta
`vulgaris genomic clone and 12.3 million 36mers from
`the Helicobacter acinonychis genome. We found
`that error rates range from 0.3% at the beginning
`of reads to 3.8% at the end of reads. Wrong base
`calls are frequently preceded by base G. Base sub-
`stitution error frequencies vary by 10- to 11-fold,
`with A > C transversion being among the most fre-
`quent and C > G transversions among the least fre-
`quent substitution errors. Insertions and deletions
`of single bases occur at very low rates. When simu-
`lating re-sequencing we found a 20-fold sequencing
`coverage to be sufficient to compensate errors by
`correct reads. The read coverage of the sequenced
`regions is biased; the highest read density was
`found in intervals with elevated GC content. High
`Solexa quality scores are over-optimistic and low
`scores underestimate the data quality. Our results
`show different
`types of biases and ways to
`detect
`them. Such biases have implications on
`the use and interpretation of Solexa data, for de
`novo sequencing, re-sequencing, the identification
`of single nucleotide polymorphisms and DNA
`methylation sites, as well as for transcriptome
`analysis.
`
`INTRODUCTION
`
`The DNA sequencing field has experienced a major boost
`with the emergence of novel sequencing technologies.
`Several systems are currently on the market, including
`Illumina’s Solexa instrument, the Applied Biosystems’
`
`Sequencing by Oligonucleotide Ligation and Detection
`(SOLiD) technology, and the GS FLX instruments from
`Roche/454 Life Sciences. The Polony cyclic sequencing by
`synthesis technology is to be launched (1).
`These technologies allow sequence determination much
`quicker and cheaper than the dideoxy chain terminator
`method presented by Sanger in 1977 (2). The main differ-
`ence between Sanger sequencing output and the output of
`the new technologies is an increased read number, asso-
`ciated with a decrease in the length of individual reads.
`To achieve high throughput, the new approaches apply
`different strategies. 454 Life Sciences has adapted pyrose-
`quencing to a microbead format to sequence 400 000
`DNA fragments simultaneously, resulting in a per-run
`dataset of 100 Mbp with reads averaging 250 bp. SOLiD
`sequencing also uses templates immobilized onto microbe-
`ads. Here, the sequence of the template DNA is decoded
`by ligation assays
`involving oligonucleotides
`labeled
`with different fluorophores. The SOLiD read length is
`currently 25–35 bases, and 2–3 Gbp of data can be col-
`lected during an 8-day run. Solexa sequencing is based on
`amplifying single molecules attached to the surface of
`a flow cell to generate clusters of identical molecules, fol-
`lowed by sequencing using fluorophore-labeled reversible
`chain terminators. Solexa sequencing proceeds a base at a
`time and read length depends on the number of sequenc-
`ing cycles. Current Illumina sequencing instrumentation
`achieves read lengths of 36 bases. The Solexa flow cell
`is composed of eight separately loadable lanes. Since
`each lane has a capacity of about 5 million reads, > 40
`million reads can be generated in a run of 3 days, equiva-
`lent to > 1.3 Gbp.
`The adoption of high-throughput sequencing will revo-
`lutionize molecular biology research, similar to the inven-
`tion of the polymerase chain reaction (PCR) twenty years
`ago (3). 454 pyrosequencing short (100 bp) reads gen-
`erated on Roche GS20 instruments (now replaced by
`GS FLX) were successfully used for the de novo sequenc-
`ing of small genomes and BACs as well as for transcript
`discovery and characterization (4–9). De novo genomic
`sequencing succeeded even when ultra-short (27–36 bp)
`reads generated by Solexa sequencing were employed for
`
`*To whom correspondence should be addressed. Tel: +49 30 8413 1354; Fax: +49 30 8413 1380; Email: himmelbauer@molgen.mpg.de
`
`ß 2008 The Author(s)
`This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
`by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
`
`
`
`Petitioner Sequenom - Ex. 1007, p. 1
`
`

`

`e105 Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`PAGE 2 OF 10
`
`a small genome (10). For the human genome, ultra-short
`reads were applied in studies on chromatin analysis
`(11,12).
`However, working with large data sets of short reads
`involves difficulties, especially due to wrong base calls. To
`exploit the full prospects of the novel technologies there is
`the need to know as much as possible about biases in the
`output data sets, especially with respect to errors. Previous
`studies focused on the 454 technology (13) or dealt with
`the prospects of short read sequencing as such (14). Here,
`we characterize two Solexa read data sets: 12.3 million
`36mer reads (trimmed to 32 bases) from the Helicobacter
`acinonychis genome and 2.8 million 27mer reads from
`a Beta vulgaris bacterial artificial chromosome (BAC)
`clone. We analyze these reads and detect biases with
`respect to error positions, error rates, erroneous base
`calls and their neighboring bases and single base insertions
`or deletions. We determine the compensation of erroneous
`base calls by correct base calls depending on the sequenc-
`ing coverage. We analyze read start positions, the read
`coverage along the target sequence, and dependencies of
`read coverage and local sequence characteristics. Finally,
`we assess the reliability of quality values for wrong and
`correct base calls.
`
`METHODS
`
`Solexa sequencing
`
`Helicobacter acinonychis. DNA was fragmented by nebu-
`lization as described in the Solexa protocol (www.illumi
`na.com). Beta vulgaris DNA was sheared for 1 h with a
`UTR200 sonication device (Hielscher Ultrasonics GmbH)
`at 100% amplitude and 0.5 cycle mode. Fragmented DNA
`was
`further processed as described previously (10).
`Sequencing was carried out by running 27 or 36 cycles,
`respectively, on the Illumina 1G sequencing instrument.
`The Goat module
`(Firecrest v.1.8.28 and Bustard
`v.1.8.28 programs) of the Solexa pipeline v.0.2.2.3 (for
`Helicobacter data set) and v.0.2.2.5 (for Beta data set)
`were used for image deconvolution and quality value cal-
`culation. Parameterization was auto-generated by the
`pipeline (see Supplementary Data for intensity plots and
`run parameters, i.e. frequency cross-talk matrix, offsets,
`phasing). Set up configuration was used as installed by
`Illumina’s technical staff. The Helicobacter data set was
`collected from three lanes of two flow cells. The Beta
`data set was generated in a single lane from a further
`flow cell.
`
`Data analysis
`
`We developed various Perl scripts to extract and process
`information from ELAND output files (Gerald module
`v.1.27 of the Solexa pipeline) and to find positions of
`reads that can be aligned more than once to the reference
`sequence without mismatches (the positions of those reads
`are not reported by ELAND). We wrote Perl scripts for
`the detection of deletions and insertions of single nucleo-
`tides in otherwise error-free reads and for the analysis of
`quality values per base call. Plots were generated with the
`statistical computing environment R (www.R-project.org)
`
`or OpenOffice Calc (www.openoffice.org). R: A language
`and environment for statistical computing. R Foundation
`for Statistical Computing, Vienna, Austria. www.R-project.
`org) or OpenOffice Calc (www.openoffice.org).
`
`Data availability
`
`Solexa read data are available from the SHARCGS
`project website at http://sharcgs.molgen.mpg.de.
`
`RESULTS
`
`We previously generated 12 288 791 36mer reads from
`Helicobacter acinonychis on an Illumina 1G sequencing
`device (10). The Helicobacter genome is 1.55 Mbp in size
`and has a GC content of 38%. A high-quality reference
`for Helicobacter
`sequence
`is
`available
`(GenBank
`NC_008229) (15). We ran the ELAND software on the
`read data set (trimmed by the last four bases, because
`ELAND processes the first 32 bases only) and selected
`the 8 389 548 32mer reads that ELAND reported to be
`the Helicobacter
`uniquely matched against
`reference
`sequence with zero, one or two mismatches (labeled U0,
`U1 or U2, respectively, see Figure 1b). Additionally, we
`generated a 27mer read data set for the sugar beet (Beta
`vulgaris) bacterial artificial chromosome (BAC) clone ZR-
`47B15. The data set consists of 2 788 286 reads, 2 156 266
`of which were labeled U0, U1 or U2 in the ELAND
`output (Figure 1a). The Sanger reference sequence in fin-
`ished quality of this BAC insert consists of 10 9563 bases
`with 34.85% GC (Dohm et al., manuscript submitted for
`publication). For all uniquely matched reads, ELAND
`reports the match position in the reference sequence as
`well as the error position(s) in the read.
`
`Start positions of reads and read distribution on
`the target sequence
`
`The preparation of Solexa sequencing libraries involves
`the fragmentation of the DNA, followed by the adaptor
`ligation, pre-amplification for material enrichment and
`amplification within the flow cell prior to sequencing. In
`order to detect whether the steps preceding sequencing
`show biases, we analyzed the first bases of a read and
`the bases that flank the read start position on either
`side. Of all possible 27mer tuples (Beta) and 32mer
`tuples (Helicobacter), 99.8 and 98.8% are unique, respec-
`tively. We therefore assume that potential biases are repre-
`sentative for the data set.
`We calculated the frequency of 2- to 10-base tuples
`enclosing the starting point
`for 8 389 548 uniquely
`matched Helicobacter reads and for 2 156 266 uniquely
`matched Beta reads relative to the frequency of these
`tuples in the reference sequences. Since the bases in the
`reads are subject to errors, we used for both sides the bases
`of the corresponding region in the reference sequence.
`A general sequence bias for the immediate vicinity of
`the read start position could not be deduced from the two
`data sets. The results for the Beta data set did not suggest
`any tendencies (Supplementary Figure 1a). The results
`for the reads from Helicobacter showed a weak tendency
`towards T being the most frequent base call to the left and
`
`Petitioner Sequenom - Ex. 1007, p. 2
`
`

`

`PAGE 3 OF 10
`
`Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`e105
`
`(a)
`
`88715
`
`191683
`
`I U2
`
`E] NM
`QC
`El R012
`El U0
`3 U1
`
`readsperkbp
`
`(b)
`
`647151
`
`1399772
`
`
`
`3000035000
`
`25000
`
`20000
`
`15000
`
`5000
`
`20
`
`25
`
`30
`
`35
`GC—contenl PA]
`
`40
`
`45
`
`50
`
`10000
`5000
`
`
`
`
`
`\‘
`//////////”, -_
`
`6342625
`
`512436
`
`420762
`
`DNM
`IQC
`EIH012
`Duo
`Em
`IU2
`
`Figure l. Pie charts of the read analysis with ELAND. The ELAND
`categories are: QC: no matching done because of low quality of the
`read (more than two positions with quality score
`5), NM. no match
`found; U0, unique exact match found; Ul, unique match with one
`error, U2. unique match with two errors; R0, multiple exact matdres
`found; RI, multiple matches with one error; R2, multiple matches with
`two errors. The categories R0, R1, R2 are shown as a single entity. (a)
`ELAND categorizations for 27mer reads from Bela vulgaris clone ZR
`47815 (2788 286 in total). (b) ELAND categorintions for 32mm reads
`from Helieobacter acinonychis (12 288 79! in total, trimmed by the last
`four base calls of the original 36mer data).
`
`to the right of the read start position (Supplementary
`Figure lb). Since two different fragmentation methods
`were used, sonication for Beta and nebulization for
`Helicobacter,
`the results may indicate method-inherent
`properties.
`However, by analysing sequence characteristics and
`number of reeds starting in a sliding window of 1 kbp in
`width, we found a correlation of read coverage and GC
`content in both data sets (Figure 2). In regions of elevated
`GC content
`the number of reads was increased. For
`
`instance, windows with a GC content of 40% contain
`almost twice as many reads as windows with 30% GC in
`the Beta data set. Thus, while the vicinity of 10 bp was not
`sufficient
`to detect a conclusive bias for read starting
`
`20000
`
`l5000
`
`readsperkbp l0000
`
`GC—Conlent [°/.]
`
`Figure 2. Correlation of the Solexa read coverage and GC content. (a)
`27mer reads generated from Bela vulgaris BAC ZR 47B15 (b) 32mer
`data set from the Helicobacrer acinmychis genome Each data point
`corresponds to the number of reads recorded for a lkbp window
`(shift of l00bp in Beta and lkbp in Helicobacler).
`
`there is a strong preference towards GC-rich
`points,
`regions in lkbp sliding windows. Since both templates
`show the correlation of read coverage and GC content,
`the shift to GC rich regions seems to be a general feature
`of the current pre—sequencing procedure. A similar finding
`was reported by Hillier et a1. (16).
`The overall coverage considering matching reads only is
`[65-fold in the Helicobacter data set (l85-fold for 36mer
`reads) and 465-fold in the Beta data set. The distribution
`of matching reads along the reference sequences is shown
`in Figure 3. We calculated the read depth in windows of
`
`Petitioner Sequenom - EX. 1007, p. 3
`
`Petitioner Sequenom - Ex. 1007, p. 3
`
`

`

`e105 Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`PAGE 4 OF 10
`
`Figure 3. Distribution of Solexa reads along the reference sequences considering unique match positions reported by ELAND (zero, one or two
`mismatch bases) and reads with more than one match position (no mismatch bases) detected with a Perl script. (a) Read distribution along the Beta
`vulgaris BAC sequence (with cloning vector pBeloBACII). 2 166 892 27mer reads were matched against the finished sequence (enclosed by the cloning
`vector,117 kbp in total). The read coverage was calculated in 200 consecutive 0.58 kbp windows. (b) Read distribution along the 1.55 Mbp
`Helicobacter genome, based on 8 700 113 32mer reads. The local coverage is shown in 200 consecutive windows of 7.77 kbp.
`
`size 7.77 kbp for Helicobacter (Figure 3a) and of size
`0.58 kbp for Beta (Figure 3b). The coverage varied by
`a factor of 13 and 3.8, respectively, ranging from 49- to
`652-fold for Helicobacter and from 238- to 897-fold for
`Beta (Table 1). We tested whether the distributions shown
`in Figure 3 are compatible with a uniform distribution
`of reads across the target sequences. We have applied a
`2-test (goodness of fit) to reject the hypothesis that reads
`have the same probability to fall into equally sized regions
`10 even when dividing
`of the target sequence (P < 1e
`target sequences in only five regions).
`There is a number of ‘gap’ positions in the target
`sequences where no read starts from. However, since
`there are no gaps larger than read length all positions of
`the target sequence are covered (Supplementary Table 1).
`
`Distribution of error positions along reads
`
`We selected all ELAND U1 and U2 reads, i.e. 28 0173
`Beta
`and 2 046 923 Helicobacter
`reads
`reads
`(cf.
`Figure 1), to analyze the occurrence of errors per position.
`We performed two types of calculations. Firstly, we calcu-
`lated the fraction of wrong base calls at each read position
`considering wrong base calls only. Secondly, we calculated
`per-base error rates, i.e. the fraction of wrong base calls
`per position considering all base calls. The result is shown
`in Figure 4. The number of occurrences of wrong bases is
`increased at the first position. Rising from the lowest error
`rate at the second position, the highest error rate is
`observed at the last positions of the read [similar observa-
`tion reported in (16)]: 2.5 and 2.9% of the errors in the
`data sets of Beta and Helicobacter, respectively, were
`found at read position 1, and 11.8% of errors were
`recorded at the last read position (position 27 in the
`Beta data set and position 32 in the Helicobacter data
`set, Figure 4a). The per-base error rates range from
`0.3% to 3.8% (Figure 4b) resulting in an average error
`
`Table 1. Proportion of reference sequence and coverage ranges (based
`on ELAND U0, U1, U2, R0 matched reads and reads with single
`indels)
`
`Beta
`
`Helicobacter
`
`Coverage
`
`BAC (%)
`
`Coverage
`
`Genome (%)
`
`200 300
`300 400
`400 500
`500 600
`600 700
`700 800
`800 900
`
`4.27
`23.93
`25.64
`23.93
`12.82
`4.27
`5.13
`
`<100
`100 150
`150 200
`200 250
`250 300
`300 350
`>350
`
`3.53
`26.06
`42.28
`21.49
`4.44
`1.29
`0.90
`
`rate of 0.6% for the Beta data set and 1.0% for
`the Helicobacter data set. Note that only uniquely
`matched reads with less than three substitution errors
`are considered.
`In re-sequencing projects, sequencing errors can be
`compensated by high-coverage sequencing. In a re-sequen-
`cing project, the reads are aligned against a reference
`sequence. Wherever a mismatch between sequencing
`data and the reference is observed, a polymorphism is
`postulated. In order to avoid spurious detection of poly-
`morphisms due to sequencing errors, a consensus between
`several reads at each position of the reference is common
`practice. Here, we simulate re-sequencing at different
`depth by randomly choosing the appropriate number of
`reads from our two data sets and counting wrong and
`correct base calls [five (Helicobacter) or ten (Beta) simula-
`tions per data point]. An error was considered as compen-
`sated when at least one correct base call for the same
`position existed. A correct base call and the reference
`sequence hold the majority over one wrong base call, i.e.
`x wrong base calls at the same position can be compen-
`sated by x correct base calls (plus reference sequence).
`
`Petitioner Sequenom - Ex. 1007, p. 4
`
`

`

`PAGE 5 OF 10
`
`
`
`
`
`
`
`
`
`Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`e105
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Mandi-onus
`
`Figure 6. Distance between two errors on a read in the Helicobacrer
`and Beta vulgari: data sets ‘0’ indicates that the erroneous base calls
`are next to each other.
`
` a a rorr12131415151119Iomzrmaaarzszszraazom
`
`
`We plotted the dependency of sequencing coverage and
`error compensation in Figure 5 (range of simulation
`results:
`see Supplementary Figure 2).
`Increasing the
`sequencing coverage results in a rapid decrease of uncom-
`pensated errors. At a coverage of 20—fold the average
`number of errors per kilo base pair is close to zero and
`does not decrease any further. However, such estimates
`are likely to change with improvements of the sequencing
`technology, as less coverage will be sufficient for reduced
`error rates.
`
`Analysis of reads containing two errors
`
`ELAND reported 88 753 reads containing two errors
`in the Beta data set, corresponding to 4.1% of all uniquely
`matched raids. In Helicobacter, 647151 reads contained
`two errors
`(7.7% of all uniquely matched reads).
`We analyzed the distance between erroneous bases and
`found a preference for small distances between errors
`(Figure 6). In 25% of reads that contained two errors
`the erroneous bases were either at adjacent positions or
`separated by one base. This observation does not contra-
`dict
`the assumption that errors occur independently
`according to their position-specific probability. The heat-
`map in Supplementary Figure 3 illustrates the occurrence
`of two errors relative to the positions in the read. As
`expected from the per-base error rates, two-error occur-
`rences are concentrated at the 3’ end of reads and are
`
`therefore close together. In addition, error pairs also
`occur with increased frequency at read positions 1 and
`2. We provide even stronger evidence for the independence
`of error positions in two-error reads in Supplementary
`Figure 4.
`Although error positions seem to be independent in
`reads with two errors, there is evidence that errors accu-
`mulate in reads more easily than expected. We deduce this
`from the ratios of the observed and expected number of
`reads containing one and two errors
`respectively:
`Given the determined error rates per position (for the
`Helicobacrer data set) we expect 3.5 times more correct
`reads (UO) than reads with one error (U1), but we observe
`4.5 times more U0 than U1; we expect 19.8 times more
`correct reads than reads with two errors (UZ), but we
`observe 9.8 times more U0 than U2. Thus,
`there are
`
`Petitioner Sequenom - EX. 1007, p. 5
`
`
`flli'fiTTT'li'li'lTlI'li'l'lTli'll‘lllllllllIl Ill I‘I ||
`
`
`12 S 4 5 6 7 I 9 “HI1213“1515171810m21225248327
`P ..
`
`||||
`
`Figure 4. Frequency of wrong base calls in Solexa reatk depending on
`the position along the read (27mer reark from Beta vulgarr': and 32mer
`reads from Helicobacrer).
`(an) Error frequency per position calculated
`from considering wrong base calls only. The highest error frequency is
`observed at the read 3’ end.
`(1)) Per base error rates (overall error
`frequency per position considering all base calls).
`
`Helicobacter
`Beta vulgaris
`
`H-‘""“"-.-
`
`Errorsperkbp
`
`Coverage
`
`Figure 5. Compensation of sequencing errors by deep sequencing
`in resequencing projects. The average number of errors per kbp is
`shown for different levels of coverage. For coverages below 2. reads
`are unlikely to overlap and compensation of sequencing errors is rare
`(thus. sequencing errors accumulate when the coverage is increased).
`For coverages above 3 fold the number of uncompensated errors drops
`rapidly with the increase of coverage.
`
`Petitioner Sequenom - Ex. 1007, p. 5
`
`

`

`e105 Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`PAGE 6 OF 10
`
`fewer U1 reads than expected and more U2 reads than
`expected compared to the U0 reads. This tendency is con-
`firmed in the Beta data set (data not shown) and suggests
`dependencies in the occurrence of errors.
`
`Analysis of error base sequence context
`
`In order to find sequence composition preferences close to
`wrong base calls, we analyzed the sequence tuples flanking
`error positions. Since errors at position 1 do not have
`preceding bases and errors at the last position do not
`have subsequent bases in the read, we used the corre-
`sponding segment of the reference sequence for the analy-
`sis. This also avoids analysing wrong base calls in the
`error-prone read sequences close to the error position
`under consideration. The sequence composition before
`the read start is not considered to be responsible for
`an error at position 1 because this part of the source
`sequence
`is not part of
`the
`sequenced fragment.
`However, the bases following the end of the read could
`have an influence on the base calling. We decided to treat
`all error positions in the same manner by looking up the
`flanking bases in the reference sequence. As reference
`tuples we did not consider all tuples in the reference
`sequence but all tuples in all uniquely matched reads
`(taken from the reference sequence and adding 5 bases
`before and after the corresponding read segment). This
`is to keep the analysis clean from the read coverage bias
`towards GC-rich regions of the reference sequence. We
`calculated the relative frequencies for 3-
`to 11-base
`tuples enclosing the error at the middle position and gen-
`erated sequence logos for Beta and Helicobacter sepa-
`rately. To visualize the general trend we show the 3- and
`
`5-base tuple results in scatterplots with the tuple frequen-
`cies for both data sets (Figure 7 and Supplementary
`Figure 5). All 3-base tuples starting with a G are clearly
`dominant in both data sets with G-error-G and G-error-A
`being the top candidates (Figure 7). Error enclosing tuples
`starting with A or T are underrepresented, and error
`enclosing tuples starting with C are as frequent as in the
`reference tuples. The least frequent base after an error is
`T, being the third base in the three least frequent tuples
`A-error-T, T-error-T and C-error-T. The trend of G being
`the most frequent base before an error is preserved
`and even more emphasized in the scatterplot with 5-base
`tuples (Supplementary Figure 5). Here, Gs are still the
`preferred bases before an error, and least frequently we
`see errors enclosed by Ts. In 35 and 32% of cases (Beta
`and Helicobacter, respectively), the error position was
`preceded by G.
`
`Analysis of base substitution errors in Solexa reads
`
`Twelve substitution errors (eight transversions and four
`transitions) are possible during a base call. We compared
`the wrong base calls in the reads to the base in the refer-
`ence sequence and found that base substitution errors in
`Solexa reads are not equally frequent. Generally, the two
`data sets show similar tendencies (Figure 8, Table 2). The
`most frequent base to be changed into is a C, preferen-
`tially substituting T or A in the Beta data set and A in the
`Helicobacter data set (T in Helicobacter as well but at a
`lower frequency). Consistently for both data sets, C > G
`transversions are the least frequent substitution errors.
`The top three types of substitution errors account for
`>53% of
`all
`substitution
`errors
`found
`in
`the
`(the transversions A > C,
`Helicobacter read data set
`G > T and A > T) and for >42% of all substitution
`errors found in the Beta data set (the transition T > C
`and the transversions A > C and C > A).
`
`Figure 7. Sequence context of wrong base calls in Solexa reads from
`Helicobacter acinonychis and Beta vulgaris, considering one base
`upstream and downstream of the wrong base calls. An ‘e’ indicates
`the substituted base. The scatterplot shows the correlation of the rela
`tive frequencies (relating the frequency of 3 tuples at error positions to
`the frequency of all 3 tuples in the reads) for the two data sets.
`
`Figure 8. Frequency of substitution errors in the Helicobacter acinony
`chis and Beta vulgaris Solexa read data sets.
`
`Petitioner Sequenom - Ex. 1007, p. 6
`
`

`

`PAGE 70F 10
`
`Nucleic Acids Research, 2008, Vol. 36, N0. 16
`
`e105
`
`Table 2. Base substitution frequencies in the Beta and Helicobacter
`read data sets
`
`Into
`
`A
`
`Beta
`A
`C
`C
`T
`Any
`Helicobacter
`A
`C
`C
`T
`Any
`
`0.14
`0.05
`0.05
`0.25
`
`0.25
`0.05
`0.10
`0.41
`
`C
`
`0.13
`
`0.02
`0.04
`0.19
`
`0.07
`
`0.02
`0.06
`0.15
`
`From
`
`G
`
`0.04
`0.08
`
`0.12
`0.24
`
`0.02
`0.04
`
`0. 18
`0.24
`
`T
`
`0.08
`0.15
`0.09
`
`0. 33
`
`0.04
`0. 10
`0.06
`
`0.20
`
`Any
`
`0.25
`0.38
`0. 16
`0.21
`
`0.14
`0.39
`0. 14
`0. 34
`
`(b) 0200.10
`0.10
`0.14
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Table 3. Observed and expected error rates for base calls of different
`quality values in the Beta and Helicobacter data sets
`
`Score
`
`Q 40
`Q 30
`Q 20
`Q
`10
`Q 0
`
`Bela (°/o)
`
`Helicobacler (°/.)
`
`Expected (%)
`
`1.39
`3.55
`5.2l
`9.68
`39.65
`
`0.43
`1.06
`1.70
`4.40
`28.68
`
`0.01
`0.10
`0.99
`9.09
`50.00
`
`Insertions and deletions in the Solexa read data sets
`
`The ELAND algorithm is limited to the alignment of reads
`containing up to two substitution errors. In addition to the
`reads matched by ELAN D to the reference sequence there
`is a substantial amount of unmatched reads (Figure 1).
`Some of the unmatched reads contain more than two
`
`sequencing errors, but another reason for unmatched
`reads may be the occurrence of insertions and deletions
`(indels). We implemented a Perl script to find single nucleo-
`tide indels within reads without considering additional sub-
`stitution errors. We observed a very low rate (<0.01%) of
`indel errors: 323 of 2.8 million Beta reads contain a single
`nucleotide insertion and 1258 Beta reads contain a single
`nucleotide deletion; 1215 and 2284 insertion and deletion
`errors, respectively, were found in the Helicobacter data set.
`Further inspection of the data revealed that >25% of
`base insertions occurred in homopolymer tracts of four
`or more nucleotides. However, no clear trend could be
`detected for deletions. With respect
`to the positions
`within reads where indels occur there is a slight accumula-
`tion of such events at internal positions of the reads. No
`bias for inserted or deleted bases could be detected (data
`not shown). The reason for detecting this type of error
`might be sequencing errors in the reference sequences.
`For Helicobacter, two erroneous insertions in the Sanger
`sequence were reported (10). Approximately 10% of
`lllumina reads with one deletion match to these two
`
`positions.
`
`44444012:1ss1IInnunumununnuzaaaanaammnaasxarnn
`Score
`
`
`
`
`
`
`
`
`
`Hr
`Illlllllllll
`4444401231601IIouuuuswanwmaazaznaaanmzzuauulaw
`Scale
`
`E 0.12
`3 (110
`1E 0.00
`0.00
`0.04
`0.02 —
`0.00 —
`
`Figure 9. Histograms of base quality valms for all correct base calls
`(a) and all wrong base calls (11) in the Beta and Helicobacler data sets.
`
`A$e$ment ofquality values
`
`The Solexa base caller Bustard reports the quality of each
`base call by estimating a quality score similar to the phred
`score based on the image output without considering the
`reference sequence. More precisely, Bustard estimates the
`probability P of a base call to be wrong and reports
`the corresponding quality score Q = -10 log", (P/(l—P).
`Thus, a quality score Q = 40 roughly corresponds to an
`expected error probability of P = 0.01%, and Q = 0 cor-
`responds to an expected error probability of P = 50%.
`Based on uniquely matched reads reported by ELAND,
`we have determined 7201633 correct and 369113 wrong
`base calls in the Beta data set as well as 70995 154 correct
`
`and 2 694074 wrong base wlls in the Helicobacter data set.
`We have extracted the corresponding quality scores from
`Bustard output files and computed observed error rates
`per quality score. Table 3 shows a comparison of the
`expected and observed error rates for the base (all score
`quality in our two data sets: Theoretical values under-
`estimate the error probability for high quality values
`and overestimate the error probability for low quality
`values.
`
`We also collected the quality values for bases reported
`by ELAND as correct separately from quality values for
`bases reported as wrong (matching reads only). Figure 9
`shows the results in separate histograms. The fraction
`of the best quality value is increased for correct base
`calls and low quality values show low fractions as
`expected. However,
`there is a substantial amount of
`high quality values for wrong base calls. Six percent
`of all wrong base walls in Helicobacter and 19% of all
`wrong base calls in Beta have Solexa quality scores
`Q = 40.
`
`Petitioner Sequenom - EX. 1007, p. 7
`
`Petitioner Sequenom - Ex. 1007, p. 7
`
`

`

`e105 Nucleic Acids Research, 2008, Vol. 36, No. 16
`
`PAGE 8 OF 10
`
`DISCUSSION
`
`We have characterized two Solexa read data sets derived
`from a bacterial genome (Helicobacter acinonychis) and
`from a Beta vulgaris BAC clone. We looked for systematic
`biases of read start positions, recorded the error positions
`and the error frequency along the read length, examined
`the distribution of reads along the reference sequences,
`investigated substitution preferences, and assessed the
`reliability of quality scores. The generalization of our
`observations may be limited by the fact that the presented
`data relates to a single Illumina 1G Analyzer. However,
`since three different flow cells, four different lanes and two
`different
`library preparations for two different
`target
`sequences are involved we assume that our consistent
`observations reflect relevant aspects of the current state
`of Solexa technology.
`To explain the observed biases, a comprehensive knowl-
`edge of the Solexa technology is necessary. The source
`DNA is fragmented randomly, and adapter molecules
`are ligated at both ends of each fragment followed by
`pre-amplification for enrichment of the material. The
`DNA fragments are melted, and the single strands are
`trapped inside the flow cell which is covered by a dense
`lawn of primers. Subsequent local amplification leads to
`the formation of clusters of approximately 1000 identical
`molecules per square micrometer. The base incorporation
`is started by adding primers, polymerase and the four
`flourophore-labeled deoxynucleotidetriphosphates. The
`dNTPs act as reversible terminators, i.e. only a single
`base is added per molecule in each cycle. The cluster fluor-
`escence is measured to identify which base has been incor-
`porated. A green laser identifies the incorporation of the
`bases G and T, and a red laser identifies the bases A and C.
`Two different filters are used to distinguish between
`G/T and A/C, respectively. After signal detection, the
`fluorophore and the terminating modification of
`the
`nucleotide are removed.
`In the context of this work we could not detect a general
`sequence bias for the immediate vicinity of read start posi-
`tions, indicating that the fragmentation step is essentially
`random. Two different

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket