throbber
Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`R E S E A R C H A R T I C L E
`Open Access
`A tale of three next generation sequencing
`platforms: comparison of Ion Torrent, Pacific
`Biosciences and Illumina MiSeq sequencers
`Michael A Quail*, Miriam Smith, Paul Coupland, Thomas D Otto, Simon R Harris, Thomas R Connor, Anna Bertoni,
`Harold P Swerdlow and Yong Gu
`
`Abstract
`
`Background: Next generation sequencing (NGS) technology has revolutionized genomic and genetic research. The
`pace of change in this area is rapid with three major new sequencing platforms having been released in 2011: Ion
`Torrent’s PGM, Pacific Biosciences’ RS and the Illumina MiSeq. Here we compare the results obtained with those
`platforms to the performance of the Illumina HiSeq, the current market leader. In order to compare these platforms,
`and get sufficient coverage depth to allow meaningful analysis, we have sequenced a set of 4 microbial genomes
`with mean GC content ranging from 19.3 to 67.7%. Together, these represent a comprehensive range of genome
`content. Here we report our analysis of that sequence data in terms of coverage distribution, bias, GC distribution,
`variant detection and accuracy.
`Results: Sequence generated by Ion Torrent, MiSeq and Pacific Biosciences technologies displays near perfect
`coverage behaviour on GC-rich, neutral and moderately AT-rich genomes, but a profound bias was observed upon
`sequencing the extremely AT-rich genome of Plasmodium falciparum on the PGM, resulting in no coverage for
`approximately 30% of the genome. We analysed the ability to call variants from each platform and found that we
`could call slightly more variants from Ion Torrent data compared to MiSeq data, but at the expense of a higher
`false positive rate. Variant calling from Pacific Biosciences data was possible but higher coverage depth was
`required. Context specific errors were observed in both PGM and MiSeq data, but not in that from the Pacific
`Biosciences platform.
`Conclusions: All three fast turnaround sequencers evaluated here were able to generate usable sequence.
`However there are key differences between the quality of that data and the applications it will support.
`Keywords: Next-generation sequencing, Ion torrent, Illumina, Pacific biosciences, MiSeq, PGM, SMRT, Bias, Genome
`coverage, GC-rich, AT-rich
`
`Background
`Sequencing technology is evolving rapidly and during
`the course of 2011 several new sequencing platforms
`were released. Of note were the Ion Torrent Personal
`Genome Machine (PGM) and the Pacific Biosciences
`(PacBio) RS that are based on revolutionary new
`technologies.
`The Ion Torrent PGM “harnesses the power of semi-
`conductor technology” detecting the protons released as
`nucleotides are incorporated during synthesis [1]. DNA
`
`* Correspondence: mq1@sanger.ac.uk
`Wellcome Trust Sanger Institute, Hinxton, UK
`
`fragments with specific adapter sequences are linked to
`and then clonally amplified by emulsion PCR on the sur-
`face of 3-micron diameter beads, known as Ion Sphere
`Particles. The templated beads are loaded into proton-
`sensing wells that are fabricated on a silicon wafer and se-
`quencing is primed from a specific location in the adapter
`sequence. As sequencing proceeds, each of the four bases
`is introduced sequentially. If bases of that type are incor-
`porated, protons are released and a signal is detected pro-
`portional to the number of bases incorporated.
`PacBio have developed a process enabling single mol-
`ecule real time (SMRT) sequencing [2]. Here, DNA poly-
`merase molecules, bound to a DNA template, are
`
`© 2012 Quail et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
`Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
`reproduction in any medium, provided the original work is properly cited.
`
`00001
`
`EX1027
`
`

`

`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 2 of 13
`
`attached to the bottom of 50 nm-wide wells termed
`zero-mode waveguides (ZMWs). Each polymerase is
`allowed to carry out second strand DNA synthesis in the
`presence of γ-phosphate fluorescently labeled nucleo-
`tides. The width of the ZMW is such that light cannot
`propagate through the waveguide, but energy can pene-
`trate a short distance and excite the fluorophores
`attached to those nucleotides that are in the vicinity of
`the polymerase at the bottom of the well. As each base
`is incorporated, a distinctive pulse of
`fluorescence is
`detected in real time.
`In recent years, the sequencing industry has been
`dominated by Illumina, who have adopted a sequencing-
`by-synthesis approach [3], utilizing fluorescently labeled
`reversible-terminator nucleotides, on clonally amplified
`DNA templates immobilized to an acrylamide coating
`on the surface of a glass flowcell. The Illumina Genome
`Analyzer and more recently the HiSeq 2000 have set the
`standard for high throughput massively parallel sequen-
`cing, but in 2011 Illumina released a lower throughput
`fast-turnaround instrument, the MiSeq, aimed at smaller
`laboratories and the clinical diagnostic market.
`Here we evaluate the output of these new sequencing
`platforms and compare them with the data obtained
`from the Illumina HiSeq and GAIIx platforms. Table 1
`gives a summary of the technical specifications of each
`of these instruments.
`
`Results
`Sequence generation
`Platform specific libraries were constructed for a set of
`microbial genomes Bordetella pertussis (67.7% GC, with
`some regions in excess of 90% GC content), Salmonella
`
`Pullorum (52% GC), Staphylococcus aureus (33% GC)
`and Plasmodium falciparum (19.3% GC, with some
`regions close to 0% GC content). We routinely use these
`to test new sequencing technologies, as together their
`sequences represent the range of genomic landscapes
`that one might encounter.
`PCR-free [4]
`Illumina libraries were uniquely bar-
`coded, pooled and run on a MiSeq flowcell with paired
`150 base reads plus a 6-base index read and also on a
`single lane of an Illumina HiSeq with paired 75 base
`reads plus an 8-base index read (Additional file 1: Table
`S1). Illumina libraries prepared with amplification using
`Kapa HiFi polymerase [5] were run on a single lane of
`an Illumina GA IIx with paired 76 base reads plus an 8-
`base index read and on a MiSeq flowcell with paired 150
`base reads plus a 6-base index read. PCR-free libraries
`represent an improvement over the standard Illumina li-
`brary preparation method as they result in more even
`sequence coverage [4] and are included here alongside
`libraries prepared with PCR in order to enable compari-
`son to PacBio which has an amplification free workflow.
`Ion Torrent libraries were each run on a single 316
`chip for a 65 cycles generating mean read lengths of
`112–124 bases (Additional file 1: Table S2). Standard
`PacBio libraries, with an average of 2 kb inserts, were
`run individually over multiple SMRT cells, each using
`C1 chemistry, and providing ≥20x sequence coverage
`data for each genome (Additional file 1: Table S3).
`The datasets generated were mapped to the corre-
`sponding reference genome as described in Methods.
`For a fair comparison, all sequence datasets were ran-
`domly down-sampled (normalized)
`to contain reads
`representing a 15x average genome coverage.
`
`Table 1 Technical specifications of Next Generation Sequencing platforms utilised in this study
`Platform
`Illumina MiSeq
`Ion Torrent PGM
`PacBio RS
`Illumina GAIIx
`Instrument Cost*
`$128 K
`$80 K**
`$695 K
`$256 K
`
`Sequence yield per run
`
`1.5-2Gb
`
`20-50 Mb on 314 chip,
`100-200 Mb on 316 chip,
`1Gb on 318 chip
`
`Sequencing cost per Gb*
`
`$502
`
`$1000 (318 chip)
`
`Run Time
`
`Reported Accuracy
`
`27 hours***
`Mostly > Q30
`
`Observed Raw Error Rate
`
`0.80 %
`
`2 hours
`
`Mostly Q20
`
`1.71 %
`
`Read length
`
`up to 150 bases
`
`~200 bases
`
`100 Mb
`
`$2000
`
`2 hours
`<Q10
`
`12.86 %
`
`Average 1500 bases****
`(C1 chemistry)
`
`30Gb
`
`$148
`
`10 days
`Mostly > Q30
`
`Illumina HiSeq 2000
`$654 K
`
`600Gb
`
`$41
`
`11 days
`Mostly > Q30
`
`0.76 %
`
`0.26 %
`
`up to 150 bases
`
`up to 150 bases
`
`Paired reads
`
`Yes
`
`Yes
`
`No
`
`Yes
`
`Yes
`
`Insert size
`
`up to 700 bases
`
`up to 250 bases
`
`Typical DNA requirements
`
`50-1000 ng
`
`100-1000 ng
`
`up to 10 kb
`~1 μg
`* All cost calculations are based on list price quotations obtained from the manufacturer and assume expected sequence yield stated.
`** System price including PGM, server, OneTouch and OneTouch ES.
`*** Includes two hours of cluster generation.
`**** Mean mapped read length includes adapter and reverse strand sequences. Subread lengths, i.e. the individual stretches of sequence originating from the
`sequenced fragment, are significantly shorter.
`
`up to 700 bases
`
`up to 700 bases
`
`50-1000 ng
`
`50-1000 ng
`
`00002
`
`

`

`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 3 of 13
`
`Workflow
`All the platforms have library preparation protocols that
`involve fragmenting genomic DNA and attaching spe-
`cific adapter sequences. Typically this takes somewhere
`between 4 and 8 hours for one sample. In addition, the
`Ion Torrent template preparation has a two hour emul-
`sion PCR and a template bead enrichment step.
`In the battle to become the platform with the fastest
`turnaround time, all the manufacturers are seeking to
`streamline library preparation protocols. Life Technolo-
`gies have developed the Ion Xpress Fragment Library Kit
`that has an enzymatic “Fragmentase” formulation for
`shearing starting DNA, thereby avoiding the labour of
`physical shearing and potentially enabling complete li-
`brary automation. We tested this kit on our four gen-
`omes alongside the standard library kit with physical
`shearing and found both to give equal genomic repre-
`sentation (see Additional file 2: Figure S1 for results
`obtained with P. falciparum). Illumina purchased Epi-
`centre in order to package the Nextera technology with
`the MiSeq. Nextera uses a transposon to shear genomic
`DNA and simultaneously introduce adapter sequences
`[6]. The Nextera method can produce sequencing ready
`DNA in around 90 minutes and gave us remarkably even
`genome representation (Additional file 2: Figures S2 and
`Additional file 2: Figure S3) with B. pertussis and S. aur-
`eus, but produced a very biased sequence dataset from
`the extremely AT-rich P. falciparum genome.
`
`Genome coverage and GC bias
`To analyse the uniformity of coverage across the genome
`we tabulated the depth of coverage seen at each position
`of the genome. We utilized the coverage plots described
`by Lam et al., [7] that depict; the percentage of the gen-
`ome that is covered at a given read depth, and genome
`coverage at different read depths respectively, for each
`dataset (Figure 1) alongside the ideal theoretical cover-
`age that would be predicted based on Poisson behaviour.
`In the context of the GC-rich genome of B. pertussis,
`most platforms gave similar uniformity of sequence
`coverage, with the Ion Torrent data giving slightly more
`uneven coverage. In the S. aureus genome the PGM
`performed better. The PGM gave very biased coverage
`when sequencing the extremely AT-rich P. falciparum
`genome (Figure 1). This affect was also evident when
`we plotted coverage depth against GC content (Additional
`file 2: Figure S4). Whilst the PacBio platform gave a
`sequence dataset with quite even coverage on GC and
`extremely AT-rich contexts, it did demonstrate slight but
`noticeable unevenness of coverage and bias towards GC-
`rich sequences with the S. aureus genome. With the GC-
`neutral S. Pullorum genome all platforms gave equal
`coverage with unbiased GC representation (data not
`shown).
`
`The most dramatic observation from our results was
`the severe bias seen when sequencing the extremely AT-
`rich genome of P. falciparum on the PGM. The result of
`this was deeper than expected coverage of the GC-rich
`var and subtelomeric regions and poor coverage within
`introns and AT-rich exonic segments (Figure 2), with ap-
`proximately 30% of the genome having no sequence
`coverage whatsoever. This bias was observed with librar-
`ies prepared using both enzymatic and physical shearing
`(Additional file 2: Figure S1).
`In a recent study to investigate the optimal enzyme
`for next generation library preparation [5], we found
`that the enzyme used for fragment amplification during
`next generation library preparation can have a signifi-
`cant influence on bias. We found the enzyme Kapa
`HiFi amplifies fragments with the least bias, giving
`even coverage, close to that obtained without amplifi-
`cation. Since the PGM has two amplification steps,
`one during library preparation and the other emulsion
`PCR (emPCR) for template amplification, we reasoned
`that this might be the cause of the observed bias. Sub-
`stituting the supplied Platinum Taq enzyme with Kapa
`HiFi
`for the nick translation and amplification step
`during library preparation profoundly reduced the
`observed bias (Figure 3). We were unable to further
`improve this by use of Kapa HiFi
`for the emPCR
`(results not shown).
`falciparum
`Of the four genomes sequenced, the P.
`genome is the largest and most complex and contains a
`significant quantity of repetitive sequences. We used P.
`falciparum to analyse the effect of read length versus
`mappability. As the PacBio pipeline doesn’t generate a
`mapping quality value and to ensure a fair comparison,
`we remapped the reads of all technologies using the k-
`mer based mapper, SMALT [9], and then analysed cover-
`age across the P. falciparum genome (Additional file 3:
`Table S4). This data confirms the poor performance of
`Ion Torrent on the P. falciparum genome, as only 65%
`of the genome is covered with high quality (>Q20) reads
`compared to ~98-99% for the other platforms. Whilst
`the mean mapped readlength of the PacBio reads with
`this genome was 1336 bases, average subread length (the
`length of sequence covering the genome) is significantly
`less (645 bases). The short average subread length is due
`to preferential loading of short fragment constructs in
`the library and the effect of lag time (non-imaged bases)
`after
`sequencing initiation,
`the
`latter
`resulting in
`sequences near the beginning of library constructs not
`being reported.
`As the median length of the PacBio subreads for this
`data set are just 600 bases, we compared their coverage
`with an equivalent amount of in silico filtered reads of
`>620 bases. This led to a very small decrease in the per-
`centage of bases covered. Using paired reads on the
`
`00003
`
`

`

`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 4 of 13
`
`Figure 1 Genome coverage plots for 15x depth randomly downsampled sequence coverage from the sequencing platforms tested.
`A) The percentage of the B. pertussis genome covered at different read depths; B) The number of bases covered at different depths for B.
`pertussis; C) The percentage of the S. aureus genome covered at different read depths; D) The number of bases covered at different depths for
`S. aureus; E) The percentage of the P. falciparum genome covered at different read depths; and F) The number of bases covered at different
`depths for P. falciparum.
`
`(> 20-base) homopolymer
`
`Illumina MiSeq, however, gave a strong positive effect,
`with 1.1% more coverage being observed from paired-
`end reads compared to single-end reads.
`
`Error rates
`We observed error rates of below 0.4% for the Illumina
`platforms, 1.78% for Ion Torrent and 13% for PacBio se-
`quencing (Table 1). The number of error-free reads,
`without a single mismatch or indel, was 76.45%, 15.92%
`and 0% for, MiSeq, Ion Torrent and PacBio, respectively.
`The error heatmap in Figure 2A shows that the PacBio
`errors are distributed evenly over the chromosome. We
`manually inspected the regions where Ion Torrent and
`Illumina generated more errors. Illumina produced errors
`
`tracts
`
`[10]
`
`long
`after
`(Figure 4A).
`Also evident in the MiSeq data, were strand errors due
`to the GGC motif [11]. Following the finding that the
`motif GGC generates strand-specific errors, we analyzed
`this phenomenon in the MiSeq data for P. falciparum
`(Additional file 4: Table S5). We observed that the error
`is mostly generated by GC-rich motifs, principally
`GGCGGG. We found no evidence for an error if the
`triplet after the GGC is AT-rich. Other MiSeq datasets
`also showed this artifact (data not shown). In addition to
`this being a strand-specific issue, it appears that this is a
`read-specific phenomenon. Whilst there is a quality drop
`in the first read following these GC-rich motifs, there is
`
`00004
`
`

`

`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 5 of 13
`
`PacBio
`
`PGM
`
`GAII
`
`HISeq
`
`MIseq
`
`50.1
`
`7.5
`
`C)
`
`D)
`
`A)
`
`%GC
`
`Coverage Depth
`
`B)
`
`%GC
`
`Coverage Depth
`
`Figure 2 Artemis genome browser [8] screenshots illustrating the variation in sequence coverage of a selected region of P. falciparum
`chromosome 11, with 15x depth of randomly normalized sequence from the platforms tested. In each window, the top graph shows the
`percentage GC content at each position, with the numbers on the right denoting the minimum, average and maximum values. The middle
`graph in each window is a coverage plot for the dataset from each instrument; the colour code is shown above graph a). Each of the middle
`graphs shows the depth of reads mapped at each position, and below that in B-D are the coordinates of the selected region in the genome with
`gene models on the (+) strand above and (−) strand below. A) View of the first 200 kb of chromosome 11. Graphs are smoothed with window
`size of 1000. A heatmap of the errors, normalized by the amount of mapping reads is included just below the GC content graph (PacBio top line,
`PGM middle and MiSeq bottom). B) Coverage over region of extreme GC content, ranging from 70% to 0%. C) Coverage over the gene
`PF3D7_1103500. D) Example of intergenic region between genes PF3D7_1104200 and PF3D7_1104300. The window size of B, C and D is 50 bp.
`
`a striking loss of quality in read 2, where the reads have
`nearly half the mean quality value compared to the read
`1 reads for GC-rich triplets that follow the GGC motif.
`We could observe this low quality in read 2 in all our
`analysed Illumina lanes. For AT-rich motifs the ratio is
`nearly 1 (1.03).
`Ion Torrent didn’t generate reads at all for long (> 14-
`base) homopolymer tracts, and could not predict the
`correct number of bases in homopolymers >8 bases
`long. Very few errors were observed following short
`homopolymer stretches in the MiSeq data (Figure 4B).
`Additionally, we observed strand-specific errors in the
`PGM data but were unable to associate these with any
`obvious motif (Figure 4C).
`
`SNP calling
`In order to determine whether or not the higher error
`rates observed with the PGM and PacBio affected their
`ability to call SNPs, we aligned the reads from the S.
`aureus genome, for which all platforms gave good se-
`quence representation, against the reference genome of
`the closely related strain USA300_FPR3757 [12], and
`
`compared the SNPs called against those obtained by
`aligning the reference sequences of the two genomes
`(Figure 5 and Additional file 5: Table S6). In order to
`create a fair comparison we initially used the same ran-
`domly normalized 15x datasets used in our analysis of
`genome coverage, which according to the literature [3]
`is sufficient to accurately call heterozygous variants but
`found that that was insufficient for the PacBio datasets
`where a 190x coverage was used.
`Overall the rate of SNP calling was slightly higher for
`the Ion Torrent data than for Illumina data (chi square
`p value 3.15E-08), with approximately 82% of SNPs
`being correctly called for the PGM and 68-76% of the
`SNPs being detected from the Illumina data (Figure 5A).
`Conversely, the rate of false SNP calls was higher with
`Ion Torrent data than for Illumina data (Figure 5B). SNP
`calling from PacBio data proved more problematic, as
`existing tools are optimized for short-read data and not
`for high error-rate long-read data. We were reliant on
`SNPs called by the SMRT portal pipeline for this ana-
`lysis. Our results showed that SNP detection from Pac-
`Bio data was not as accurate as that from the other
`
`00005
`
`

`

`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 6 of 13
`
`Figure 3 The effect of substituting Platinum HiFi PCR supermix with Kapa HiFi in the PGM library prep amplification step.
`A) The percentage of the P. falciparum genome covered at different read depths. The blue line shows the data obtained with the recommended
`Platinum enzyme and the green line with Kapa HiFi. The red line depicts ideal coverage behavior. B) The number of bases covered at different
`depths. C) Sequence representation vs. GC-content plots.
`
`platforms, with overall only 71% of SNPs being detected
`and 2876 SNPs being falsely called (Additional file 5:
`Table S6).
`the datasets obtained from the Illumina
`Amongst
`sequencers, the percentage of correct SNP calls was
`higher for the MiSeq (76%)
`than the GAIIx (70%)
`data than for that obtained from the HiSeq (69%),
`despite the same libraries being run on both MiSeq
`and HiSeq. The use of Nextera library preparation
`gave similar results with 76% of SNPs being correctly
`called. It should be noted that we found the inbuilt
`automatic variant calling inadequate on both MiSeq
`and PGM, with MiSeq reporter calling just 6.6% of
`variants and Torrent suite 1.5.1 calling only 1.4% of
`variants.
`
`Discussion
`A key feature of these new platforms is their speed. De-
`creasing run time has clear advantages particularly
`within the clinical sequencing arena, but poses chal-
`lenges in itself. Whilst manufacturers may state library
`prep times on the order of a couple of hours, these times
`don’t include upfront QC and library QC and quantifica-
`tion. Also,
`typical
`library prep times quoted usually
`apply to processing of only one sample; i.e., pipetting
`time is largely ignored. Purchasers of sequencing instru-
`ments will want to keep them running at full utilization,
`in order to maximize their investment and will also want
`to pool multiple samples on single runs for economic
`
`reasons. To obtain maximum throughput, users must
`consider the whole process, potentially investing in an-
`cillary equipment and robotics to create an automated
`pipeline for the preparation of large numbers of samples.
`To process large numbers of samples quickly, a facility’s
`instrument base must be large enough to avoid sample
`backlogs. With this in mind, manufacturers are seeking
`to develop more streamlined sample-prep protocols that
`will facilitate timely sample loading. Here we have tested
`two such developments: enzymatic fragmentation and
`the Nextera technique. We conclude that these methods
`can be very useful, but users must carefully evaluate the
`methods they use for their particular applications and
`for use with genomes of extreme base composition to
`avoid bias.
`Whilst the data generated using the Ion Torrent PGM
`platform has a higher raw error rate (~1.8%) than Illu-
`mina data (<0.4%), provided there is sufficient coverage,
`the representation and ability to call SNPs is quite
`closely matched between these technologies with more
`true positives being called from PGM data but far less
`false positives from the Illumina data. Detection of SNPs
`using PacBio data was not as accurate; the use of single-
`molecule sequencing to detect low level variants and
`quasispecies within populations remains unproven. We
`have found PacBio’s long reads useful for scaffolding de
`novo assemblies, though our experience suggests that
`this is currently not
`fully optimized and extensive
`method development is still required.
`
`00006
`
`

`

`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 7 of 13
`
`A)
`
`B)
`
`C)
`
`P
`
`I
`
`M
`
`Figure 4 Illustration of platform-specific errors. The panels show Artemis BAM views with reads (horizontal bars) mapping to defined regions
`of chromosome 11 of P. falciparum from PacBio (P; top), Ion Torrent (I; middle) and MiSeq (M; bottom). Red vertical dashes are 1 base differences
`to the reference and white points are indels. A) Illustration of errors in Illumina data after a long homopolymer tract. Ion torrent data has a drop
`of coverage and multiple indels are visible in PacBio data. B) Example of errors associated with short homopolymer tracts. Multiple insertions are
`visible in the PacBio Data, deletions are observed in the PGM data and the MiSeq sequences read generally correct through the homopolymer
`tract. C) Example of strand specific deletions (red circles) observed in Ion Torrent data.
`
`Interestingly, the mappability didn’t increase signifi-
`cantly with longer reads, although a beneficial effect was
`obtained from having mate-pair information. Current
`PacBio protocols favor the preferential loading of smaller
`constructs, resulting in average subread lengths that are
`significantly shorter than the often quoted average read
`lengths. Further development is therefore required to
`avoid having excess short fragments and adapter-dimer
`constructs in the library and reducing their loading effi-
`ciency into the ZMWs.
`Whilst one would normally use higher coverage than
`used here for confident SNP detection (i.e., 30-40x
`depth), we were limited to 15x depth due to the yield of
`some of the platforms. Nonetheless, at least for the hap-
`loid genome, S. aureus, 15x coverage should be a reason-
`able quantity for SNP detection and even in the human
`genome, 15x coverage has been shown to be sufficient to
`accurately call heterozygous SNPs [3].
`Variant calling is a highly subjective process; the par-
`ticular software chosen as well as the specific parameters
`
`employed to make the predictions will change the results
`substantially. As such, the rate of both true SNP and
`false positive calling provided here are purely indicative
`and results obtained with each sequencing platform will
`vary. For any particular application using a specific se-
`quencing method, optimisation of the SNP- and indel-
`calling algorithm would always be recommended.
`We sequence many isolates of the malaria parasite P.
`falciparum as it represents a significant health issue in
`developing countries; this organism leads to several mil-
`lion deaths per annum. There are several active large se-
`quencing programs (e.g. MalariaGEN [13])
`that are
`currently aiming to sequence thousands of clinical mal-
`aria samples. As the malaria genome has a GC content
`of only 19.4% [14], we use it as one of our test genomes,
`representing a significant challenge to most sequencing
`technologies. Based on the present study, use of Illumina
`sequencing technology with libraries prepared without
`amplification [4] leads to the least biased coverage across
`this genome. Ion Torrent semiconductor sequencing is
`
`00007
`
`

`

`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 8 of 13
`
`Percentage of correctly called true SNPs
`
`Ion Torrent
`
`Ion Torrent,
`SAMtools
`
`GAII
`
`MiSeq
`
`HiSeq
`
`Nextera library
`on MiSeq
`
`PacBio 190x
`
`All SNPs
`
`Excluding mobile elements and indels
`
`Number of incorrect SNP calls
`
`A)
`90.00
`80.00
`70.00
`60.00
`50.00
`40.00
`30.00
`20.00
`10.00
`0.00
`
`B)
`3500
`3000
`2500
`2000
`1500
`1000
`500
`0
`
`Ion Torrent
`
`GAII
`
`MiSeq
`
`Ion Torrent,
`Samtools
`Excluding mobile elements and indels
`Total false positives
`Figure 5 Accuracy of SNP detection from the S. aureus datasets generated from each platform, compared against the reference
`genome of its close relative S. aureus USA300_FPR3757. Both the Torrent server variant calling pipeline and SAMtools were used for Ion
`Torrent data; SAMtools was used for Illumina data and SMRT portal pipeline for PacBio data. A) The percentage of SNPs detected using each
`platform overall (blue bar), and outside of repeats, indels and mobile genetic elements (red bar). B) The number of incorrect SNP calls for each
`platform overall (blue bar), and outside of repeats, indels and mobile genetic elements (red bar).
`
`HiSeq
`
`MiSeq_Nextera PacBio 190x
`
`not recommended for sequencing of extremely AT-rich
`genomes, due to the severe coverage bias observed. This
`is likely to be an artifact introduced during amplification.
`Therefore, avoidance of
`library amplification and/or
`emPCR, or use of more faithful enzymes during emPCR,
`may eliminate the bias.
`Illumina sequencing has matured to the point where
`a great many applications [15-24] have been developed
`on the platform. Since the PGM is also a massively
`parallel short-read technology, many of those applica-
`tions should translate well and be equally practicable.
`There
`are
`a
`few obvious
`exceptions;
`techniques
`involving manipulations on the flow cell such as FRT-
`seq [21] and OS-Seq [22] will be impossible using
`
`semiconductor sequencing. Also, the Ion Torrent plat-
`form currently employs fragment lengths of 100 or 200
`bases; without a mate-pair type library protocol, these
`insert sizes are too short perhaps to enable accurate
`de novo assemblies such as that demonstrated using
`ALLPATHS-LG for mammalian genomes using Illu-
`mina data [25]. Conversely, Illumina sequencing on
`the HiSeq or MiSeq instruments requires heteroge-
`neous base composition across the population of
`imaged clusters [26]. In order to sequence monotem-
`plates
`(where most
`sequenceable fragments have
`exactly the same sequence),
`it is often necessary to
`significantly dilute or mix the sample with a complex
`genomic library to enable registration of clusters.
`
`00008
`
`

`

`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 9 of 13
`
`suffer
`
`this
`
`sequencing does not
`
`Semiconductor
`problem.
`The DNA-input requirements of PacBio can be pro-
`hibitory. Illumina and PGM library preparation can be
`performed with far
`less DNA;
`the standard PGM
`IonEXpress library prep requires just 100 ng DNA and
`Illumina sequencing has been performed from sub-
`nanogram quantities
`[27]. The yield,
`sample-input
`requirements and amplification-free library prep of
`PacBio potentially make it unsuitable for counting
`applications and for applications involving significant
`prior enrichment such as exome sequencing [15] and
`ChIP-seq [18]. The PacBio platform, by virtue of
`its
`long read lengths, should however have application in
`de novo sequencing and may also benefit analysis of
`linkage of alternative splicing and in of variants across
`long amplicons. Furthermore, the potential
`for direct
`detection of epigenetic modifications has been demon-
`strated [28].
`Finally, it should be noted that thus study represents a
`point in time, utilising kits and reagents available up
`until the end of 2011. Ion Torrent and Pacific Bios-
`ciences are relatively new sequencing technologies that
`have not had time to mature in the same way that the
`Illumina technology has. We anticipate that whilst some
`of the issues identified may be intrinsic, others will be
`resolved as these platforms evolve.
`
`Conclusion
`The limited yield and high cost per base currently pro-
`hibit large scale sequencing projects on the Pacific Bios-
`ciences instrument. The PGM and MiSeq are quite
`closely matched in terms of utility and ease of workflow.
`The decision on whether to purchase one or the other
`will hinge on available resources, existing infrastructure
`and personal experience, available finances and the type
`of applications being considered.
`
`Methods
`Genomic DNA
`P. falciparum 3D7 genomic DNA was a gift from Prof
`Chris Newbold, University of Oxford, UK. Bordetella
`pertussis ST24 genomic DNA was a gift from Craig
`Cummings, Stanford University School of Medicine, CA.
`Staphylococcus aureus TW20 genomic DNA was a gift
`from Jodi Lindsay, St George’s Hospital Medical School,
`University of London. S. Pullorum S449/87 genomic
`DNA was prepared at the Wellcome Trust Sanger Insti-
`tute, UK.
`
`Illumina library construction
`DNA (0.5 μg in 120 μl of 10 mM Tris–HCl pH8.5) was
`sheared in an AFA microtube using a Covaris S2 device
`
`(Covaris Inc.) with the following settings: Duty cycle 20,
`Intensity 5, cycles/burst 200, 45 seconds.
`Sheared DNA was purified by binding to an equal vol-
`ume of Ampure beads (Beckman Coulter Inc.) and
`eluted in 32 μl of 10 mM Tris–HCl, pH8.5. End-repair,
`A-tailing and paired-end adapter ligation were per-
`formed (as per the protocols supplied by Illumina, Inc.
`using reagents from New England Biolabs- NEB) with
`purification using a 1.5:1 ratio of standard Ampure to
`sample between each enzymatic reaction. PCR-free li-
`braries were constructed according to Kozarewa et al.
`[4]. After ligation, excess adapters and adapter dimers
`were removed using two Ampure clean-ups, first with a
`1.5:1 ratio of standard Ampure to sample, followed by a
`0.7:1 ratio of Ampure beads. PCR free libraries were
`then used as is. Libraries prepared with amplification
`were diluted to 2 ng/μl

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket