`http://www.biomedcentral.com/1471-2164/12/382
`
`CO R R E SP O N D E N C E
`Open Access
`Addressing challenges in the production and
`analysis of illumina sequencing data
`Martin Kircher1, Patricia Heyn2 and Janet Kelso1*
`
`Abstract
`Advances in DNA sequencing technologies have made it possible to generate large amounts of sequence data
`very rapidly and at substantially lower cost than capillary sequencing. These new technologies have specific
`characteristics and limitations that require either consideration during project design, or which must be addressed
`during data analysis. Specialist skills, both at the laboratory and the computational stages of project design and
`analysis, are crucial to the generation of high quality data from these new platforms. The Illumina sequencers
`(including the Genome Analyzers I/II/IIe/IIx and the new HiScan and HiSeq) represent a widely used platform
`providing parallel readout of several hundred million immobilized sequences using fluorescent-dye reversible-
`terminator chemistry. Sequencing library quality, sample handling, instrument settings and sequencing chemistry
`have a strong impact on sequencing run quality. The presence of adapter chimeras and adapter sequences at the
`end of short-insert molecules, as well as increased error rates and short read lengths complicate many
`computational analyses. We discuss here some of the factors that influence the frequency and severity of these
`problems and provide solutions for circumventing these. Further, we present a set of general principles for good
`analysis practice that enable problems with sequencing runs to be identified and dealt with.
`
`Background
`Recent advances in DNA sequencing have changed the
`field of genomics making it possible to generate giga-
`bases of genome and transcriptome sequence data at
`substantially lower cost than was possible just ten years
`ago http://www.genome.gov/sequencingcosts/. The rela-
`tive affordability of these high-throughput sequencers
`and the potential to generate large amounts of sequence
`data at lower cost means that scientists outside of tradi-
`tional sequencing facilities are now faced with the chal-
`lenges associated with design of large-scale projects and
`analysis of the data generated. This poses significant
`challenges for many groups since the inherent limita-
`tions of these platforms, and particular artifacts asso-
`ciated with sequences generated on these platforms,
`need to be understood and dealt with at various stages
`of the project including planning, sample preparation,
`run processing and downstream analyses.
`We present here an analysis of challenges encountered
`in using the Illumina sequencing instruments. A
`
`* Correspondence: kelso@eva.mpg.de
`1Max Planck Institute for Evolutionary Anthropology, Department of
`Evolutionary Genetics Deutscher Platz 6 04103 Leipzig, Germany
`Full list of author information is available at the end of the article
`
`thorough description of the Solexa/Illumina sequencing
`technology as well as a comparison to other platforms is
`available elsewhere [1-6]. We revisit here only the
`aspects relevant for project design and data analysis.
`Figure 1 shows the steps from the DNA sample prepara-
`tion to sequence read outs with quality scores. Indepen-
`dent of the actual application, Solexa/Illumina sequencing
`requires that the molecules to be determined are con-
`verted into special sequencing libraries. This is achieved
`by adding specific adapter sequences on both ends of frag-
`mented DNA molecules; allowing molecules to be ampli-
`fied,
`immobilized and primed for sequencing. For
`sequencing, typically double-stranded DNA libraries are
`provided and melted using sodium hydroxide to obtain
`single stranded molecules. These are then immobilized
`(hybridization and amplification on a solid phase; bridge
`amplification [1,7]) in one or more channels of an 8-chan-
`nel flow cell. The immobilization and subsequent bridge-
`amplification creates randomly scattered clusters, consist-
`ing of more than one thousand copies of the original
`sequence in very close proximity to each other. After the
`bridge amplification step, cluster molecules are largely
`double stranded. One of the strands then has to be
`removed to obtain single stranded, identically oriented
`
`© 2011 Kircher et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
`Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
`any medium, provided the original work is properly cited.
`
`GUARDANT – EXHIBIT 2020
`GUARDANT – EXHIBIT 2020(cid:0)(cid:3)
`
`Twinstrand Biosciences, Inc. v. Guardant Health, Inc. Twinstrand Biosciences, Inc. v. Guardant Health, Inc.
`IPR2022-01400(cid:0)(cid:3)
`IPR2022-01400
`
`
`
`Kircher et al. BMC Genomics 2011, 12:382
`http://www.biomedcentral.com/1471-2164/12/382
`
`Page 2 of 14
`
`Sample
`
`(a) Library preparation
`
`Flowcell preparation
`
`(b) Immobilization
`and solid phase
`amplification
`
`TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNNNNNNNNNNNNNNNNCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC
`
`ACACTCTTTCCCTACACGACGCTCTTCCGATCT
`
`TTACTATGCCGCTGGTGGCT
`
`TCGTATGCCGTCTTCTGTTG
`
`TCGTATGCCGTCTTCTGTTG
`
`TTACTATGCCGCTGGTGGCT
`
`TCGTATGCCGTCTTCTGTTG
`
`(c) Linearization, 3' blocking,
`sequence primer hybridization
`
`TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNNNNNNNNNNNNNNNNCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC
`
`ACACTCTTTCCCTACACGACGCTCTTCCGATCT
`
`TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNNNNNNNNNNNNNNNNCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC
`
`ACACTCTTTCCCTACACGACGCTCTTCCGATCT
`
`TTACTATGCCGCTGGTGGCT
`
`TCGTATGCCGTCTTCTGTTG
`
`TCGTATGCCGTCTTCTGTTG
`
`TTACTATGCCGCTGGTGGCT
`
`TCGTATGCCGTCTTCTGTTG
`
`TTACTATGCCGCTGGTGGCTCTAGATGTGAGAAAGGGATGTGCTGCGAGAAGGCTAGANNNNNNNNNNNNNNNNNNCTAGCCTTCTCGAGCATACGGCAGAAGACGAAC
`
`ACACTCTTTCCCTACACGACGCTCTTCCGATCT
`
`(d) Reversible terminator chemistry
`and read out of incorporated dyes
`
`A
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGATCATGGCTGAA...
`
`A
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGATCATGGCTGAA...
`
`T
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGAACGTTGCAGGAGCATTGCACTAGCCT
`
`T
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGAACGTTGCAGGAGCATTGCACTAGCCT
`
`C
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGAGACAGGCGATT...
`
`C
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGAGACAGGCGATT...
`
`G
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGACATAGCGAGGA...
`
`filter A
`
`A
`
`G
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGACATAGCGAGGA...
`
`filter C
`
`C
`
`A
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGATCATGGCTGAA...
`
`A
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGATCATGGCTGAA...
`
`T
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGAACGTTGCAGGAGCATTGCACTAGCCTT
`
`T
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGAACGTTGCAGGAGCATTGCACTAGCCTT
`
`C
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGAGACAGGCGATT...
`
`G
`
`C
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGAGACAGGCGATT...
`
`T
`
`G
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGACATAGCGAGGA...
`
`G
`ACACGACGCTCTTCCGATCT
`TGTGCTGCGAGAAGGCTAGACATAGCGAGGA...
`
`filter G
`
`filter T
`
`(e) Image registration and intensity extraction
`
`(f) Base calling and quality scoring
`
`Cycle 1
`
`Cycle 2
`
`Cycle 3
`
`A
`
`A
`
`A
`
`...
`@SOLEXA-GA03_0001_PEi_SG:5:1:1033:5267
`AGACAGACACAGAGNAAGACCCAGTCCGCCACACAGGCAAACTCA
`+SOLEXA-GA03_0001_PEi_SG:5:1:1033:5267
`4--'-(/.23/044!51/+//.400/-/1-62/.6021834///6
`...
`
`Figure 1 Illumina sample preparation and sequencing. Illumina sequencing requires that a DNA sample (a) is converted into special
`sequencing libraries. This can be achieved by shearing DNA to a designated size and adding specific adapter sequences on both ends of the
`DNA molecules (b). These adapters allow molecules to be amplified and immobilized in one or more channels of an 8-channel flow cell (c).
`Immobilization and solid-phase amplification create randomly scattered clusters, consisting of a few thousand copies of the original molecule in
`very close proximity to each other. One of the DNA strands is removed to obtain single stranded, identically oriented copies, 3’ ends of the DNA
`are blocked and a sequencing primer hybridized on the adapter sequences. Afterwards, the reversible terminator chemistry is performed (d).
`Here, four differently labeled nucleotides are provided and used for extension of the primers by DNA polymerases. The polymerase reaction
`terminates after the first base incorporation since the nucleotides used are not only labeled, but also 3’-blocked. After washing away free
`nucleotides, the nucleotides incorporated are readout by piece-wise imaging of the flow cell. Then, the terminator and fluorophore are removed
`and another incorporation cycle started. The four images are overlaid (registered) and light intensities extracted for each cluster and cycle using
`a cluster position template obtained from the first instrument cycles (e). Resulting intensity files serve as input for base calling, the conversion of
`intensity values into bases and quality scores (f).
`
`copies of the starting molecule. This is achieved by selec-
`tive cleavage of base modifications of oligonucleotides on
`the flowcell. After the free 3’ ends of the DNA have been
`blocked, the copies can be sequenced by hybridizing a
`sequencing primer onto the adapter sequences and start-
`ing the reversible terminator chemistry. Here, four differ-
`ently labeled nucleotides are provided and used for
`extension of the sequencing primers by DNA polymerases.
`The DNA polymerase reaction terminates after the first
`base incorporation since the nucleotides used are not only
`labeled, but also 3’-blocked (i.e. they carry a terminator
`
`group at the third carbon atom of the sugar, which pre-
`vents further extension). After free nucleotides are washed
`away, the nucleotides being incorporated are read by cap-
`turing the light signal of the fluorophore labels after laser
`excitation. Imaging of the flow cell is carried out in so-
`called tiles which are the units in which the flowcell is
`imaged and data processed. The terminator and fluoro-
`phore are then removed and another incorporation cycle
`started [1].
`Initially, the number of sequencing cycles, and thereby
`the length of the sequence reads, was limited to 26
`
`
`
`Kircher et al. BMC Genomics 2011, 12:382
`http://www.biomedcentral.com/1471-2164/12/382
`
`Page 3 of 14
`
`cycles because of steeply increasing sequencing error.
`Between 2008 and 2010 there were several technical
`updates to the Genome Analyzer (GA) platform includ-
`ing improvements in mechanics, chemistry and software.
`Even though sequencing error still increases with each
`cycle, up to 150 sequencing cycles are currently per-
`formed with reasonable error profiles (average error
`below 1%, and up to 10% in the final cycles). Further,
`flow cell cluster densities were increased from 5-12 mil-
`lion clusters to about 35-60 million clusters per lane
`(and twice that for HiSeq instruments; where clusters
`on the top and bottom of the flow cell are read). A
`technical update made sequencing of the reverse strand
`of each molecule possible. Using this “paired-end
`sequencing” approach for determining the reverse
`strand, doubles the amount of sequence data generated.
`Known insert size, and thereby a known distance separ-
`ating the paired reads obtained, provides additional
`information for later assembly or mapping [8]. This
`technical update of the Genome Analyzer in 2008, the
`Paired End (PE) module, also allowed the hybridization
`of further sequencing primers in the same strand orien-
`tation, making it possible to sequence a sample index (i.
`e. barcode) as part of the ligated adapter [9,10]. Such an
`index read allows for multiple samples to be sequenced
`in one lane (multiplexing). These can later be computa-
`tionally separated based on the sequence of their index.
`During progression of the sequencing run, or when
`images for all cycles have been collected (depending on
`the setup and version), the four images captured per tile
`are overlaid (registered) and light
`intensities are
`extracted for each cluster and cycle [1]. The resulting
`cluster position template is then aligned with images of
`all cycles and the intensities minus the surrounding
`background in the four different images extracted.
`Resulting intensity files serve as input for base calling -
`the conversion of intensity values into bases. Base call-
`ing on the Illumina platform is complicated by at least
`two effects: (1) a strong correlation of the A and C
`intensities as well as of the G and T intensities due to
`similar emission spectra of the fluorophores used and
`their limited separation by optical filters, and (2) depen-
`dence of the signal for a specific cycle on the signal of
`the cycles before and after, known as phasing and pre-
`phasing, respectively. Phasing and pre-phasing describe
`the loss of synchrony in the readout of the sequence
`copies of a cluster. Phasing is caused by incomplete
`removal of the 3’ terminators and fluorophores as well
`as sequences in the cluster missing an incorporation
`cycle. Pre-phasing is caused by the incorporation of
`nucleotides without effective 3’-blocking. The proportion
`of sequences in each cluster which are affected by phas-
`ing and pre-phasing increases with cycle number; ham-
`pering correct base identification [11-14].
`
`From this whole process, the Illumina user typically
`obtains sequences and per base quality scores. The set
`of sequences for each lane is usually quality filtered and
`the user gets a summary report for judging run quality.
`Finally, the Illumina CASAVA package provides addi-
`tional tools and an interface to the visualization routines
`in Illumina’s Genome Studio. Different commercial as
`well as free programs are available that replace some
`parts of the processing such as image analysis [15], base
`calling [11-15], quality assessment (e.g. TileQC [16] or
`FastQC [17]), mapping [18-21], as well as downstream
`data analysis and processing [8,22-25]. There is a large
`community of users and developers for the Illumina
`platform; the http://seqanswers.com website is an excel-
`lent resource when starting to explore the variety of
`programs available for analyzing the data generated.
`
`Results
`We present each stage of a sequencing project from the
`generation of sequencing libraries to the base-calling of
`the sequences. For each step we discuss the potential
`effects on data analysis and final data quality. Where
`possible we offer suggestions and guidelines on how to
`avoid specific artifacts that arise during sequencing.
`
`Sequencing libraries, minimum insert size and adapter
`artifacts
`The most important requirement for a DNA library to
`be sequenced on the Illumina platform is the presence
`of specific outer adapter sequences, complementary to
`the oligonucleotides on the flow cell used for cluster
`generation, the so-called “grafting sequences”. As differ-
`ent sequencing primers can be used (see below), the rest
`of the library design is very flexible and various library
`preparation protocols with partially distinct adapter
`sequences are used for specific applications. Library
`adapters can be added by single strand ligation (e.g. Illu-
`mina small RNA protocol), double strand blunt-end
`ligation (e.g. for a multiplex protocol [10]), double
`strand overhang ligation (e.g. A-overhang for Illumina
`genomic library protocols, and restriction enzyme over-
`hangs in the Illumina DGE protocols), or by extension
`from overhanging primers (e.g. multiplex PCR or mole-
`cular
`inversion probes
`[26,27]). Each of
`these
`approaches has a different susceptibility to the creation
`of library adapter dimers, chimeric sequences and other
`library artifacts. Each therefore requires a different
`approach to enrich for only those molecules with cor-
`rectly added adapters, and to remove short/no insert
`molecules and molecules which are too long (> 800nt)
`from the library before sequencing. While short-insert
`molecules will, as described below, directly impact data
`analysis, longer molecules will perform differently in
`flow cell generation and generate more wide-spread and
`
`
`
`Kircher et al. BMC Genomics 2011, 12:382
`http://www.biomedcentral.com/1471-2164/12/382
`
`Page 4 of 14
`
`less dense clusters. If not accounted for by modified
`cluster generation protocols, these will result in lower
`quality reads.
`Failure to perform an enrichment during library pre-
`paration has two potential effects: (i) the artifact
`sequences may have a negative impact on the image
`analysis and base-calling which are both challenged by
`an overrepresentation of one sequence population (see
`below) and (ii) sequencing of large numbers of such
`artifacts is uneconomical and lowers the potential num-
`ber of informative sequences that can be generated per
`run. Libraries prepared from small amounts of input
`material tend to suffer from a higher fraction of library
`artifacts due to the relative abundance of adapter oligo-
`nucleotides compared to insert molecules.
`It is possible to computationally post-process sequen-
`cing data in cases where enrichment has not been
`
`performed. Figure 2, exemplifies for the Illumina NlaIII
`DGE protocol (a protocol for digital gene expression tag
`profiling) that adapter chimeras might be created which
`are of comparable length as the targeted library mole-
`cules and thus may not be removed by selecting a speci-
`fic library insert-size (e.g. by gel length selection, silica
`column purification or Solid Phase Reversible Immobili-
`zation (SPRI) purification [28]). In this case, a program
`like TagDust [29] can be used with the original adapter
`and primer oligonucleotide sequences to identify such
`artifacts in a library (Figure 2B). This program can be
`either used to directly remove these sequences or, for a
`representative lane, its results can be clustered and the
`most frequent ones used with other software tools.
`Inappropriate size selection during library preparation
`may also complicate analysis due to partial sequencing
`of the adaptor at the sequence ends. Thus, when
`
`A: Digital gene expression / SAGE experiment
`
`B: Most frequent TagDust filtered sequences
`
`mRNA enrichment with polyT beads
`
`NlaIII
`Poly-A-Tail
`recognition site
`NNN...NNNCATGXXXXXXXXXXXXXXXNNN...NNNAAAAAAAAAAA
`17nt sequence
`mRNA
`NNN...NNNGTACXXXXXXXXXXXXXXXNNN...NNNTTTTTTTTTTT
`
`5'
`
`3'
`
`cDNA 1st and 2nd strand synthesis
`NNN...NNNCATGXXXXXXXXXXXXXXXNNN...NNNAAAAAAAAAAA
`NNN...NNNGTACXXXXXXXXXXXXXXXNNN...NNNTTTTTTTTTTT
`
`1st Digestion
`NlaIII
`NNN...NNNCATGXXXXXXXXXXXXXXXNNN...NNNAAAAAAAAAAA
`NNN...NNNGTACXXXXXXXXXXXXXXXNNN...NNNTTTTTTTTTTT
`
`Ligation GEX adapter 1
`...TCTACAGTCCGACATGXXXXXXXXXXXXXXXNNN...NNNAAAAAAAAAAA
`...AGATGTCAGGCTGTACXXXXXXXXXXXXXXXNNN...NNNTTTTTTTTTTT
`
`2nd Digestion
`
`MmeI
`recognition site
`...TCTACAGTCCGACATGXXXXXXXXXXXXXXXNNN...NNNAAAAAAAAAAA
`...AGATGTCAGGCTGTACXXXXXXXXXXXXXXXNNN...NNNTTTTTTTTTTT
`
`Bead
`
`Bead
`
`Bead
`
`Bead
`
`Bead
`
`Ligation GEX adapter 2
`
`Sequencing primer site
`...CAGAGTTCTACAGTCCGACATGXXXXXXXXXXXXXXXTCGTATGCCGTCTTCTGCTTG
`CAAGTCTCAAGATGTCAGGCTGTACXXXXXXXXXXXXXNNAGCATACGGCAGAAGACGAAC
`
`GEX Adapter 2.1
`TCGTATGCCGTCTTCTGCTTG
`TCGTATGCCGTCTTCTGCTTG
`35191 GTATGCCGTCTTCTGCTT
`35191 GTATGCCGTCTTCTGCTT
`4733 AA-TCGTATGCCGTCTTCT
`4733 AA-TCGTATGCCGTCTTCT
`3109 TCGGAC-TCGTATGCCGTC
`3109 TCGGAC-TCGTATGCCGTC
`2963 G-TCGTATGCCGTCTTCTG
`2963 G-TCGTATGCCGTCTTCTG
`2875 TCGGACTGTAGAATCGTA
`2875 TCGGACTGTAGAATCGTA
`2818 ATGGC-TCGTATGCCGTCT
`2818 ATGGC-TCGTATGCCGTCT
`2339 AGGAG-TCGTATGCCGTCT
`2339 AGGAG-TCGTATGCCGTCT
`2307 TC-TCGTATGCCGTCTTCT
`2307 TC-TCGTATGCCGTCTTCT
`2108 TCGGACTGTAGAACTCTT
`2108 TCGGACTGTAGAACTCTT
`1936 A-TCGTATGCCGTCTTCTG
`1936 A-TCGTATGCCGTCTTCTG
`1886 AGGAGT-TCGTATGCCGTC
`1886 AGGAGT-TCGTATGCCGTC
`1880 CGTATGCCGTCTTCTGCT
`1880 CGTATGCCGTCTTCTGCT
`GTATGCCGTCTTCTTCTT
`1667
`GTATGCCGTCTTCTTCTT
`1667
`1527 AG-TCGTATGCCGTCTTCT
`1527 AG-TCGTATGCCGTCTTCT
`1509 CCAG-TCGTATGCCGTCTT
`1509 CCAG-TCGTATGCCGTCTT
`1366 GTGA-TCGTATGCCGTCTT
`1366 GTGA-TCGTATGCCGTCTT
`1238 GTG-TCGTATGCCGTCTTC
`1238 GTG-TCGTATGCCGTCTTC
`1209 TCGGACTGTAGA-TCGTAT
`1209 TCGGACTGTAGA-TCGTAT
`TCGGACTGTAGAACTCTGAAC
`TCGGACTGTAGAACTCTGAAC
`GEX Adapter 1.2
`
`Matches human NlaIII restriction
`site chr20:+1868416-1868436
`1907 GTTTCAGGAGTTTATTTT
`1907 GTTTCAGGAGTTTATTTT
`
`GEX Adapter 1.1
`ACAGGTTCAGAGTTCTACAGTCCGACATG
`ACAGGTTCAGAGTTCTACAGTCCGACATG
`1268 GCCACCCTCTACAG-CCGA
`1268 GCCACCCTCTACAG-CCGA
`
`Figure 2 Adaptors and adaptor chimeras are a common sources of sequence artifacts. Specific outer adapter sequences, complementary
`to the grafting sequences on the flow cell are essentially the only requirement for sequencing a DNA library on the Genome Analyzer platform.
`As different sequencing primers can be used, library design is very flexible and various protocols with partially distinct adapter sequences have
`been established. The Illumina NlaIII DGE tag protocol illustrated here (a protocol for digital gene expression tag profiling) uses short adapters
`which are not compatible with paired end sequencing and are added by overhang ligation (A). For this protocol the majority of adapter dimers
`are removed by a gel excision step after library preparation. However, the protocol may also create adapter chimeras with a length comparable
`to the targeted library molecules. The resulting chimera sequences also show the sequences required for cluster generation as well as the
`necessary priming site, causing them to be sequenced together with the real DGE tags. A program like TagDust [29] can be used with the
`original adapter and primer oligonucleotide sequences to identify such artifacts (B). Shown are the twenty most frequent identified artifacts from
`one lane with human DGE tags, as well as the oligosequences they might be based on. One of the 20 sequences seems to be a real DGE tag
`that was incorrectly identified as an artifact.
`
`
`
`Kircher et al. BMC Genomics 2011, 12:382
`http://www.biomedcentral.com/1471-2164/12/382
`
`Page 5 of 14
`
`selecting for insert-size, it should be considered that
`current experimental methods generally do not provide
`precise length cutoffs. The lower cutoff selected should
`therefore be well-above the desired sequencing length.
`For sequence reads where part of the adapter sequence
`is included, the position in the sequence read at which
`the adapter sequence begins has to be identified and the
`read trimmed appropriately. Unfortunately, this is not
`part of the standard Illumina data processing and also
`non-trivial for short adapter fragments, especially given
`the increasing sequencing error at the end of reads. If
`reads are not filtered for known chimeras and trimmed
`for adapter sequences, these may interfere with map-
`ping/alignment and reads will either be incorrectly
`excluded or placed incorrectly. In both cases down-
`stream data analysis will be affected.
`In order to test how Illumina’s ELAND mapper as
`well as the widely used mapping program BWA [20] are
`impacted by incipient adapter sequence, we simulated
`101-cycle reads of an Illumina Paired End genomic
`library with 10,000 reads for every adapter start point
`between 1 to 350nt and the error profile observed for
`an actual run of this length (Figure 3). Considering that
`both mapping programs implement very different
`approaches (seed alignment versus semi-global align-
`ment of the whole read respectively), the performance
`of Illumina’s ELAND mapper is expected to be different
`from BWA. Since ELAND requires only a fixed seed in
`the beginning of the read (typically of 32nt length) adap-
`ters starting after this seed region should not affect
`ELAND’s mapping. Indeed, ELAND maps 98% of all
`simulated reads of at least 30nt insert size (2nt of adap-
`ter sequence being compensated by 2 mismatches being
`allowed in the seed), while BWA only reports 98% suc-
`cessful mappings for reads with an insert size of at least
`97nt. More relevant for many analyses, however, is the
`number of mappings reported to be uniquely placed and
`whether they are mapped at the correct position in the
`genome. ELAND reports a uniquely placed 20nt-insert-
`size read, but it is placed incorrectly (as are all uniquely
`placed reads reported up to an insert size of 67nt).
`BWA reports the first three uniquely placed fragments
`(mapping quality above 20) for an insert size of 83nt (2
`of them are correctly placed). If we require that 98% of
`the reads are correctly placed, ELAND achieves this for
`insert sizes of 83nt and above (14nt of adapter), while
`BWA can only compensate with mismatches for 4nt of
`adapter sequence (97nt insert size). However, BWA pro-
`vides a lower total number of false positive placements
`due to the inclusion of adapter sequence (8490 vs.
`6308). Moreover, for an insert size of at the least read
`length, BWA reports 99.999% of uniquely placed reads
`(94.2% of all reported alignments) at the correct geno-
`mic positions, while ELAND only reports 98.757% of
`
`the uniquely placed reads (83.8% of all reported align-
`ments) at the correct genome coordinates. BWA there-
`fore provides a more accurate mapping of these reads
`for downstream analysis.
`While length selection and dimer removal are impor-
`tant for the cost-effective sequencing of a library and
`downstream data analysis, experimental methods to
`achieve these generally consume sample material and
`may bias molecule representation. It is therefore often
`only practical to apply a minimum of these purification
`steps in order to maintain library quantity and complex-
`ity. In such cases, downstream sequence filtering prior
`to data analysis becomes extremely important.
`
`Short-insert libraries in paired-end sequencing
`experiments
`When libraries containing inserts shorter than the sum
`of forward and reverse read cycles are created, these can
`be sequenced from both ends to obtain higher quality
`sequence information for the overlapping sequence part.
`For such paired-end reads the correct identification of
`the adapter is eased by maximizing autocorrelation of
`the two reads as well as requiring identical adapter start
`positions for both reads [30-32]. This strategy is more
`powerful than alignment(-like) approaches used for
`identifying adapter starts in single reads, which fre-
`quently remove sequence from the read ends that match
`the adapter by chance, or which do not identify real
`adapter sequence due to the higher sequencing error at
`the end of reads. Thus, for short insert libraries, paired
`end sequencing is preferable. As previously reported
`[30], the read merging performed for these short-insert
`libraries considerably decreases the number of errors
`and creates sequences reflecting the original outer mole-
`cule length (e.g. of interest for authenticity of ancient
`DNA samples [31]). Applying this merging approach to
`the simulated data set described above, but this time
`using both paired-end reads, we see a factor of 5 reduc-
`tion in the error rate of all merged sequences (average
`error of 0.24% reduces to 0.05%; Additional File 1). For
`sequences shorter or equal to the read length a reduc-
`tion by a factor of about 21 (0.146% to 0.007%) is
`observed.
`
`Library contamination
`Sample contamination during library preparation from
`other DNA/RNA sources might be an important issue
`for some types of analyses and applications. Contamina-
`tion may be introduced by the experimenter or may
`stem from lab chemicals and equipment. Library pre-
`parations starting from low amounts of sample DNA
`and protocols using single strand ligation procedures
`can be considered the most prone to contamination.
`While avoidance of contamination is the most desirable
`
`
`
`Kircher et al. BMC Genomics 2011, 12:382
`http://www.biomedcentral.com/1471-2164/12/382
`
`Page 6 of 14
`
`BWA v0.5.5
`
`ELAND v1.5
`
`Read length (101nt)
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`0.0
`
`1.0
`
`0.8
`
`0.6
`
`0.4
`
`0.2
`
`0.0
`
`Fraction sequences reported mapped
`
`Fraction correctly placed
`
`0
`
`50
`
`100
`
`150
`
`200
`
`250
`
`300
`
`350
`
`0
`
`50
`
`100
`
`150
`
`200
`
`250
`
`300
`
`350
`
`1.0
`
`0.8
`
`0.2
`Fraction correct / placed
`
`0.4
`
`0.6
`
`D
`
`0.0
`
`C
`
`D
`
`0
`
`50
`
`100
`
`200
`150
`Molecule length
`
`250
`
`300
`
`350
`
`50
`
`60
`
`90
`80
`70
`Molecule length
`
`100
`
`110
`
`Figure 3 Effects of adapter sequence inclusion on mapping. Untrimmed adapter sequence at the read ends can interfere with alignment/
`mapping. We simulated 101-cycle human genomic shotgun reads for an Illumina Paired End library with 10,000 reads for every adapter starting
`point between 1 to 350nt, and the error profile observed for an actual run of this length. On this data set, we tested how ELAND and BWA are
`affected by inclusion of adapter sequence: (A) ELAND requires only a fixed seed (here 32nt) in the beginning of the read. Adapters beginning
`after this seed region may therefore have no effect on the output. ELAND reports 98% successful mappings for all simulated reads of at least
`30nt insert size (2nt of adapter sequence being compensated by 2 mismatches allowed in the seed), BWA only reports 98% successful mappings
`for reads with an insert size of at least 97nt. (B) Frequently only uniquely placed molecules are considered in data analysis. ELAND reports the
`first uniquely placed fragment for 20nt insert size. BWA reports the first three uniquely placed fragments (mapping quality above 20) for an
`insert size of 83nt. (C) All uniquely placed reads reported by ELAND up to an insert length of 67nt are placed incorrectly (when comparing to
`the coordinates the sequence was extracted from), as is one of the 3 reported by BWA for an insert size of 83nt. When requiring 98% correct
`placements, ELAND handles up to 14nt of adapter (83nt insert size), while BWA can only compensate with mismatches for 4nt of adapter
`sequence (97nt insert size). (D) For analysis purposes, BWA shows the better performance due to the lower number of false positive placements.
`Moreover, for an insert size of at least the read length (i.e. no adapters interfering with the alignment), BWA reports 99.999% of uniquely placed
`reads (94.2% of all reported alignments) at the designated genomic positions, while ELAND only reports 98.757% of the uniquely placed reads
`(83.8% of all reported alignments) at the correct position.
`
`D
`
`B
`
`1.0
`0.0
`Fraction sequences reported uniquely placed
`
`0.4
`
`0.2
`
`0.8
`
`0.6
`
`A
`
`
`
`Kircher et al. BMC Genomics 2011, 12:382
`http://www.biomedcentral.com/1471-2164/12/382
`
`Page 7 of 14
`
`approach, it has been suggested (e.g. [33]) that reads can
`be filtered by the alignment to the putative contaminant
`sequence before data analysis. However, such filtering
`may introduce biases in the data, especially if sequences
`are short and/or the evolutionary distance between con-
`taminant and sample is low. This is a frequent problem
`in ancient DNA studies of early modern humans, and
`Neandertals where contamination with even small
`amounts of modern DNA can quickly dominate sequen-
`cing output. Here the fraction of contamination can be
`deduced from informative sites (i.e. sites of known fixed
`differences between species/populations) and the frac-
`tion of contaminant molecules determined [31,32,34].
`This ratio can then be used in statistical models during
`data analysis. If no informative sites are known, esti-
`mates of contamination may be obtained from biallelic
`or triallelic sites in haploid/diploid sequences. Hence,
`also the sex of the sample may be exploited by counting
`Y chromosomal alignments in female samples and
`determining × chromosomal heterozygosity for males
`[35,36].
`Cross-contamination after library preparation (e.g.
`during preparation of the sequencing run) can be easily
`identified and filtered from the final sequencing data
`when sample-specific barcodes are included in the
`libraries and determined during sequencing [9,10]. A
`different problem may arise when pools of barcoded
`libraries are amplified simultaneously; in such experi-
`ments “jumping PCR” may cause barcodes to be trans-
`ferred between samples [37-40]. High error in the
`sequencing, amplification or synthesis of barcodes, espe-
`cially for those with limited sequence distance from one
`another, as well as barcode contamination during hand-
`ling or mixed clusters/small physical cluster distance on
`the flow cell can lead to sample misidentification. In
`projects for which data analysis is susceptible to any of
`these types of contamination, appropriate measures have
`to be taken to minimize contamination and to estimate
`its impact on analysis results.
`
`Alternative sequencing primers
`The sequencing primers used in sequencing of libraries
`constructed for different applications may differ. During
`flow cell preparation, different sequencing primers can
`be hybridized in each lane. However, in contrast to the
`first read prepared on the Cluster Station/cBot, it is not
`possible to use lane sp

Accessing this document will incur an additional charge of $.
After purchase, you can access this document again without charge.
Accept $ ChargeStill Working On It
This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.
Give it another minute or two to complete, and then try the refresh button.
A few More Minutes ... Still Working
It can take up to 5 minutes for us to download a document if the court servers are running slowly.
Thank you for your continued patience.

This document could not be displayed.
We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.
You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.
Set your membership
status to view this document.
With a Docket Alarm membership, you'll
get a whole lot more, including:
- Up-to-date information for this case.
- Email alerts whenever there is an update.
- Full text search for other cases.
- Get email alerts whenever a new case matches your search.

One Moment Please
The filing “” is large (MB) and is being downloaded.
Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!
If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document
We are unable to display this document, it may be under a court ordered seal.
If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.
Access Government Site