`http://www.biomedcentral.com/1471-2164/13/341
`
`R E S E A R C H A R T I C L E
`Open Access
`A tale of three next generation sequencing
`platforms: comparison of Ion Torrent, Pacific
`Biosciences and Illumina MiSeq sequencers
`Michael A Quail*, Miriam Smith, Paul Coupland, Thomas D Otto, Simon R Harris, Thomas R Connor, Anna Bertoni,
`Harold P Swerdlow and Yong Gu
`
`Abstract
`
`Background: Next generation sequencing (NGS) technology has revolutionized genomic and genetic research. The
`pace of change in this area is rapid with three major new sequencing platforms having been released in 2011: Ion
`Torrent’s PGM, Pacific Biosciences’ RS and the Illumina MiSeq. Here we compare the results obtained with those
`platforms to the performance of the Illumina HiSeq, the current market leader. In order to compare these platforms,
`and get sufficient coverage depth to allow meaningful analysis, we have sequenced a set of 4 microbial genomes
`with mean GC content ranging from 19.3 to 67.7%. Together, these represent a comprehensive range of genome
`content. Here we report our analysis of that sequence data in terms of coverage distribution, bias, GC distribution,
`variant detection and accuracy.
`Results: Sequence generated by Ion Torrent, MiSeq and Pacific Biosciences technologies displays near perfect
`coverage behaviour on GC-rich, neutral and moderately AT-rich genomes, but a profound bias was observed upon
`sequencing the extremely AT-rich genome of Plasmodium falciparum on the PGM, resulting in no coverage for
`approximately 30% of the genome. We analysed the ability to call variants from each platform and found that we
`could call slightly more variants from Ion Torrent data compared to MiSeq data, but at the expense of a higher
`false positive rate. Variant calling from Pacific Biosciences data was possible but higher coverage depth was
`required. Context specific errors were observed in both PGM and MiSeq data, but not in that from the Pacific
`Biosciences platform.
`Conclusions: All three fast turnaround sequencers evaluated here were able to generate usable sequence.
`However there are key differences between the quality of that data and the applications it will support.
`Keywords: Next-generation sequencing, Ion torrent, Illumina, Pacific biosciences, MiSeq, PGM, SMRT, Bias, Genome
`coverage, GC-rich, AT-rich
`
`Background
`Sequencing technology is evolving rapidly and during
`the course of 2011 several new sequencing platforms
`were released. Of note were the Ion Torrent Personal
`Genome Machine (PGM) and the Pacific Biosciences
`(PacBio) RS that are based on revolutionary new
`technologies.
`The Ion Torrent PGM “harnesses the power of semi-
`conductor technology” detecting the protons released as
`nucleotides are incorporated during synthesis [1]. DNA
`
`* Correspondence: mq1@sanger.ac.uk
`Wellcome Trust Sanger Institute, Hinxton, UK
`
`fragments with specific adapter sequences are linked to
`and then clonally amplified by emulsion PCR on the sur-
`face of 3-micron diameter beads, known as Ion Sphere
`Particles. The templated beads are loaded into proton-
`sensing wells that are fabricated on a silicon wafer and se-
`quencing is primed from a specific location in the adapter
`sequence. As sequencing proceeds, each of the four bases
`is introduced sequentially. If bases of that type are incor-
`porated, protons are released and a signal is detected pro-
`portional to the number of bases incorporated.
`PacBio have developed a process enabling single mol-
`ecule real time (SMRT) sequencing [2]. Here, DNA poly-
`merase molecules, bound to a DNA template, are
`
`© 2012 Quail et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative
`Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
`reproduction in any medium, provided the original work is properly cited.
`
`00001
`
`EX1027
`
`
`
`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 2 of 13
`
`attached to the bottom of 50 nm-wide wells termed
`zero-mode waveguides (ZMWs). Each polymerase is
`allowed to carry out second strand DNA synthesis in the
`presence of γ-phosphate fluorescently labeled nucleo-
`tides. The width of the ZMW is such that light cannot
`propagate through the waveguide, but energy can pene-
`trate a short distance and excite the fluorophores
`attached to those nucleotides that are in the vicinity of
`the polymerase at the bottom of the well. As each base
`is incorporated, a distinctive pulse of
`fluorescence is
`detected in real time.
`In recent years, the sequencing industry has been
`dominated by Illumina, who have adopted a sequencing-
`by-synthesis approach [3], utilizing fluorescently labeled
`reversible-terminator nucleotides, on clonally amplified
`DNA templates immobilized to an acrylamide coating
`on the surface of a glass flowcell. The Illumina Genome
`Analyzer and more recently the HiSeq 2000 have set the
`standard for high throughput massively parallel sequen-
`cing, but in 2011 Illumina released a lower throughput
`fast-turnaround instrument, the MiSeq, aimed at smaller
`laboratories and the clinical diagnostic market.
`Here we evaluate the output of these new sequencing
`platforms and compare them with the data obtained
`from the Illumina HiSeq and GAIIx platforms. Table 1
`gives a summary of the technical specifications of each
`of these instruments.
`
`Results
`Sequence generation
`Platform specific libraries were constructed for a set of
`microbial genomes Bordetella pertussis (67.7% GC, with
`some regions in excess of 90% GC content), Salmonella
`
`Pullorum (52% GC), Staphylococcus aureus (33% GC)
`and Plasmodium falciparum (19.3% GC, with some
`regions close to 0% GC content). We routinely use these
`to test new sequencing technologies, as together their
`sequences represent the range of genomic landscapes
`that one might encounter.
`PCR-free [4]
`Illumina libraries were uniquely bar-
`coded, pooled and run on a MiSeq flowcell with paired
`150 base reads plus a 6-base index read and also on a
`single lane of an Illumina HiSeq with paired 75 base
`reads plus an 8-base index read (Additional file 1: Table
`S1). Illumina libraries prepared with amplification using
`Kapa HiFi polymerase [5] were run on a single lane of
`an Illumina GA IIx with paired 76 base reads plus an 8-
`base index read and on a MiSeq flowcell with paired 150
`base reads plus a 6-base index read. PCR-free libraries
`represent an improvement over the standard Illumina li-
`brary preparation method as they result in more even
`sequence coverage [4] and are included here alongside
`libraries prepared with PCR in order to enable compari-
`son to PacBio which has an amplification free workflow.
`Ion Torrent libraries were each run on a single 316
`chip for a 65 cycles generating mean read lengths of
`112–124 bases (Additional file 1: Table S2). Standard
`PacBio libraries, with an average of 2 kb inserts, were
`run individually over multiple SMRT cells, each using
`C1 chemistry, and providing ≥20x sequence coverage
`data for each genome (Additional file 1: Table S3).
`The datasets generated were mapped to the corre-
`sponding reference genome as described in Methods.
`For a fair comparison, all sequence datasets were ran-
`domly down-sampled (normalized)
`to contain reads
`representing a 15x average genome coverage.
`
`Table 1 Technical specifications of Next Generation Sequencing platforms utilised in this study
`Platform
`Illumina MiSeq
`Ion Torrent PGM
`PacBio RS
`Illumina GAIIx
`Instrument Cost*
`$128 K
`$80 K**
`$695 K
`$256 K
`
`Sequence yield per run
`
`1.5-2Gb
`
`20-50 Mb on 314 chip,
`100-200 Mb on 316 chip,
`1Gb on 318 chip
`
`Sequencing cost per Gb*
`
`$502
`
`$1000 (318 chip)
`
`Run Time
`
`Reported Accuracy
`
`27 hours***
`Mostly > Q30
`
`Observed Raw Error Rate
`
`0.80 %
`
`2 hours
`
`Mostly Q20
`
`1.71 %
`
`Read length
`
`up to 150 bases
`
`~200 bases
`
`100 Mb
`
`$2000
`
`2 hours
`<Q10
`
`12.86 %
`
`Average 1500 bases****
`(C1 chemistry)
`
`30Gb
`
`$148
`
`10 days
`Mostly > Q30
`
`Illumina HiSeq 2000
`$654 K
`
`600Gb
`
`$41
`
`11 days
`Mostly > Q30
`
`0.76 %
`
`0.26 %
`
`up to 150 bases
`
`up to 150 bases
`
`Paired reads
`
`Yes
`
`Yes
`
`No
`
`Yes
`
`Yes
`
`Insert size
`
`up to 700 bases
`
`up to 250 bases
`
`Typical DNA requirements
`
`50-1000 ng
`
`100-1000 ng
`
`up to 10 kb
`~1 μg
`* All cost calculations are based on list price quotations obtained from the manufacturer and assume expected sequence yield stated.
`** System price including PGM, server, OneTouch and OneTouch ES.
`*** Includes two hours of cluster generation.
`**** Mean mapped read length includes adapter and reverse strand sequences. Subread lengths, i.e. the individual stretches of sequence originating from the
`sequenced fragment, are significantly shorter.
`
`up to 700 bases
`
`up to 700 bases
`
`50-1000 ng
`
`50-1000 ng
`
`00002
`
`
`
`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 3 of 13
`
`Workflow
`All the platforms have library preparation protocols that
`involve fragmenting genomic DNA and attaching spe-
`cific adapter sequences. Typically this takes somewhere
`between 4 and 8 hours for one sample. In addition, the
`Ion Torrent template preparation has a two hour emul-
`sion PCR and a template bead enrichment step.
`In the battle to become the platform with the fastest
`turnaround time, all the manufacturers are seeking to
`streamline library preparation protocols. Life Technolo-
`gies have developed the Ion Xpress Fragment Library Kit
`that has an enzymatic “Fragmentase” formulation for
`shearing starting DNA, thereby avoiding the labour of
`physical shearing and potentially enabling complete li-
`brary automation. We tested this kit on our four gen-
`omes alongside the standard library kit with physical
`shearing and found both to give equal genomic repre-
`sentation (see Additional file 2: Figure S1 for results
`obtained with P. falciparum). Illumina purchased Epi-
`centre in order to package the Nextera technology with
`the MiSeq. Nextera uses a transposon to shear genomic
`DNA and simultaneously introduce adapter sequences
`[6]. The Nextera method can produce sequencing ready
`DNA in around 90 minutes and gave us remarkably even
`genome representation (Additional file 2: Figures S2 and
`Additional file 2: Figure S3) with B. pertussis and S. aur-
`eus, but produced a very biased sequence dataset from
`the extremely AT-rich P. falciparum genome.
`
`Genome coverage and GC bias
`To analyse the uniformity of coverage across the genome
`we tabulated the depth of coverage seen at each position
`of the genome. We utilized the coverage plots described
`by Lam et al., [7] that depict; the percentage of the gen-
`ome that is covered at a given read depth, and genome
`coverage at different read depths respectively, for each
`dataset (Figure 1) alongside the ideal theoretical cover-
`age that would be predicted based on Poisson behaviour.
`In the context of the GC-rich genome of B. pertussis,
`most platforms gave similar uniformity of sequence
`coverage, with the Ion Torrent data giving slightly more
`uneven coverage. In the S. aureus genome the PGM
`performed better. The PGM gave very biased coverage
`when sequencing the extremely AT-rich P. falciparum
`genome (Figure 1). This affect was also evident when
`we plotted coverage depth against GC content (Additional
`file 2: Figure S4). Whilst the PacBio platform gave a
`sequence dataset with quite even coverage on GC and
`extremely AT-rich contexts, it did demonstrate slight but
`noticeable unevenness of coverage and bias towards GC-
`rich sequences with the S. aureus genome. With the GC-
`neutral S. Pullorum genome all platforms gave equal
`coverage with unbiased GC representation (data not
`shown).
`
`The most dramatic observation from our results was
`the severe bias seen when sequencing the extremely AT-
`rich genome of P. falciparum on the PGM. The result of
`this was deeper than expected coverage of the GC-rich
`var and subtelomeric regions and poor coverage within
`introns and AT-rich exonic segments (Figure 2), with ap-
`proximately 30% of the genome having no sequence
`coverage whatsoever. This bias was observed with librar-
`ies prepared using both enzymatic and physical shearing
`(Additional file 2: Figure S1).
`In a recent study to investigate the optimal enzyme
`for next generation library preparation [5], we found
`that the enzyme used for fragment amplification during
`next generation library preparation can have a signifi-
`cant influence on bias. We found the enzyme Kapa
`HiFi amplifies fragments with the least bias, giving
`even coverage, close to that obtained without amplifi-
`cation. Since the PGM has two amplification steps,
`one during library preparation and the other emulsion
`PCR (emPCR) for template amplification, we reasoned
`that this might be the cause of the observed bias. Sub-
`stituting the supplied Platinum Taq enzyme with Kapa
`HiFi
`for the nick translation and amplification step
`during library preparation profoundly reduced the
`observed bias (Figure 3). We were unable to further
`improve this by use of Kapa HiFi
`for the emPCR
`(results not shown).
`falciparum
`Of the four genomes sequenced, the P.
`genome is the largest and most complex and contains a
`significant quantity of repetitive sequences. We used P.
`falciparum to analyse the effect of read length versus
`mappability. As the PacBio pipeline doesn’t generate a
`mapping quality value and to ensure a fair comparison,
`we remapped the reads of all technologies using the k-
`mer based mapper, SMALT [9], and then analysed cover-
`age across the P. falciparum genome (Additional file 3:
`Table S4). This data confirms the poor performance of
`Ion Torrent on the P. falciparum genome, as only 65%
`of the genome is covered with high quality (>Q20) reads
`compared to ~98-99% for the other platforms. Whilst
`the mean mapped readlength of the PacBio reads with
`this genome was 1336 bases, average subread length (the
`length of sequence covering the genome) is significantly
`less (645 bases). The short average subread length is due
`to preferential loading of short fragment constructs in
`the library and the effect of lag time (non-imaged bases)
`after
`sequencing initiation,
`the
`latter
`resulting in
`sequences near the beginning of library constructs not
`being reported.
`As the median length of the PacBio subreads for this
`data set are just 600 bases, we compared their coverage
`with an equivalent amount of in silico filtered reads of
`>620 bases. This led to a very small decrease in the per-
`centage of bases covered. Using paired reads on the
`
`00003
`
`
`
`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 4 of 13
`
`Figure 1 Genome coverage plots for 15x depth randomly downsampled sequence coverage from the sequencing platforms tested.
`A) The percentage of the B. pertussis genome covered at different read depths; B) The number of bases covered at different depths for B.
`pertussis; C) The percentage of the S. aureus genome covered at different read depths; D) The number of bases covered at different depths for
`S. aureus; E) The percentage of the P. falciparum genome covered at different read depths; and F) The number of bases covered at different
`depths for P. falciparum.
`
`(> 20-base) homopolymer
`
`Illumina MiSeq, however, gave a strong positive effect,
`with 1.1% more coverage being observed from paired-
`end reads compared to single-end reads.
`
`Error rates
`We observed error rates of below 0.4% for the Illumina
`platforms, 1.78% for Ion Torrent and 13% for PacBio se-
`quencing (Table 1). The number of error-free reads,
`without a single mismatch or indel, was 76.45%, 15.92%
`and 0% for, MiSeq, Ion Torrent and PacBio, respectively.
`The error heatmap in Figure 2A shows that the PacBio
`errors are distributed evenly over the chromosome. We
`manually inspected the regions where Ion Torrent and
`Illumina generated more errors. Illumina produced errors
`
`tracts
`
`[10]
`
`long
`after
`(Figure 4A).
`Also evident in the MiSeq data, were strand errors due
`to the GGC motif [11]. Following the finding that the
`motif GGC generates strand-specific errors, we analyzed
`this phenomenon in the MiSeq data for P. falciparum
`(Additional file 4: Table S5). We observed that the error
`is mostly generated by GC-rich motifs, principally
`GGCGGG. We found no evidence for an error if the
`triplet after the GGC is AT-rich. Other MiSeq datasets
`also showed this artifact (data not shown). In addition to
`this being a strand-specific issue, it appears that this is a
`read-specific phenomenon. Whilst there is a quality drop
`in the first read following these GC-rich motifs, there is
`
`00004
`
`
`
`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 5 of 13
`
`PacBio
`
`PGM
`
`GAII
`
`HISeq
`
`MIseq
`
`50.1
`
`7.5
`
`C)
`
`D)
`
`A)
`
`%GC
`
`Coverage Depth
`
`B)
`
`%GC
`
`Coverage Depth
`
`Figure 2 Artemis genome browser [8] screenshots illustrating the variation in sequence coverage of a selected region of P. falciparum
`chromosome 11, with 15x depth of randomly normalized sequence from the platforms tested. In each window, the top graph shows the
`percentage GC content at each position, with the numbers on the right denoting the minimum, average and maximum values. The middle
`graph in each window is a coverage plot for the dataset from each instrument; the colour code is shown above graph a). Each of the middle
`graphs shows the depth of reads mapped at each position, and below that in B-D are the coordinates of the selected region in the genome with
`gene models on the (+) strand above and (−) strand below. A) View of the first 200 kb of chromosome 11. Graphs are smoothed with window
`size of 1000. A heatmap of the errors, normalized by the amount of mapping reads is included just below the GC content graph (PacBio top line,
`PGM middle and MiSeq bottom). B) Coverage over region of extreme GC content, ranging from 70% to 0%. C) Coverage over the gene
`PF3D7_1103500. D) Example of intergenic region between genes PF3D7_1104200 and PF3D7_1104300. The window size of B, C and D is 50 bp.
`
`a striking loss of quality in read 2, where the reads have
`nearly half the mean quality value compared to the read
`1 reads for GC-rich triplets that follow the GGC motif.
`We could observe this low quality in read 2 in all our
`analysed Illumina lanes. For AT-rich motifs the ratio is
`nearly 1 (1.03).
`Ion Torrent didn’t generate reads at all for long (> 14-
`base) homopolymer tracts, and could not predict the
`correct number of bases in homopolymers >8 bases
`long. Very few errors were observed following short
`homopolymer stretches in the MiSeq data (Figure 4B).
`Additionally, we observed strand-specific errors in the
`PGM data but were unable to associate these with any
`obvious motif (Figure 4C).
`
`SNP calling
`In order to determine whether or not the higher error
`rates observed with the PGM and PacBio affected their
`ability to call SNPs, we aligned the reads from the S.
`aureus genome, for which all platforms gave good se-
`quence representation, against the reference genome of
`the closely related strain USA300_FPR3757 [12], and
`
`compared the SNPs called against those obtained by
`aligning the reference sequences of the two genomes
`(Figure 5 and Additional file 5: Table S6). In order to
`create a fair comparison we initially used the same ran-
`domly normalized 15x datasets used in our analysis of
`genome coverage, which according to the literature [3]
`is sufficient to accurately call heterozygous variants but
`found that that was insufficient for the PacBio datasets
`where a 190x coverage was used.
`Overall the rate of SNP calling was slightly higher for
`the Ion Torrent data than for Illumina data (chi square
`p value 3.15E-08), with approximately 82% of SNPs
`being correctly called for the PGM and 68-76% of the
`SNPs being detected from the Illumina data (Figure 5A).
`Conversely, the rate of false SNP calls was higher with
`Ion Torrent data than for Illumina data (Figure 5B). SNP
`calling from PacBio data proved more problematic, as
`existing tools are optimized for short-read data and not
`for high error-rate long-read data. We were reliant on
`SNPs called by the SMRT portal pipeline for this ana-
`lysis. Our results showed that SNP detection from Pac-
`Bio data was not as accurate as that from the other
`
`00005
`
`
`
`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 6 of 13
`
`Figure 3 The effect of substituting Platinum HiFi PCR supermix with Kapa HiFi in the PGM library prep amplification step.
`A) The percentage of the P. falciparum genome covered at different read depths. The blue line shows the data obtained with the recommended
`Platinum enzyme and the green line with Kapa HiFi. The red line depicts ideal coverage behavior. B) The number of bases covered at different
`depths. C) Sequence representation vs. GC-content plots.
`
`platforms, with overall only 71% of SNPs being detected
`and 2876 SNPs being falsely called (Additional file 5:
`Table S6).
`the datasets obtained from the Illumina
`Amongst
`sequencers, the percentage of correct SNP calls was
`higher for the MiSeq (76%)
`than the GAIIx (70%)
`data than for that obtained from the HiSeq (69%),
`despite the same libraries being run on both MiSeq
`and HiSeq. The use of Nextera library preparation
`gave similar results with 76% of SNPs being correctly
`called. It should be noted that we found the inbuilt
`automatic variant calling inadequate on both MiSeq
`and PGM, with MiSeq reporter calling just 6.6% of
`variants and Torrent suite 1.5.1 calling only 1.4% of
`variants.
`
`Discussion
`A key feature of these new platforms is their speed. De-
`creasing run time has clear advantages particularly
`within the clinical sequencing arena, but poses chal-
`lenges in itself. Whilst manufacturers may state library
`prep times on the order of a couple of hours, these times
`don’t include upfront QC and library QC and quantifica-
`tion. Also,
`typical
`library prep times quoted usually
`apply to processing of only one sample; i.e., pipetting
`time is largely ignored. Purchasers of sequencing instru-
`ments will want to keep them running at full utilization,
`in order to maximize their investment and will also want
`to pool multiple samples on single runs for economic
`
`reasons. To obtain maximum throughput, users must
`consider the whole process, potentially investing in an-
`cillary equipment and robotics to create an automated
`pipeline for the preparation of large numbers of samples.
`To process large numbers of samples quickly, a facility’s
`instrument base must be large enough to avoid sample
`backlogs. With this in mind, manufacturers are seeking
`to develop more streamlined sample-prep protocols that
`will facilitate timely sample loading. Here we have tested
`two such developments: enzymatic fragmentation and
`the Nextera technique. We conclude that these methods
`can be very useful, but users must carefully evaluate the
`methods they use for their particular applications and
`for use with genomes of extreme base composition to
`avoid bias.
`Whilst the data generated using the Ion Torrent PGM
`platform has a higher raw error rate (~1.8%) than Illu-
`mina data (<0.4%), provided there is sufficient coverage,
`the representation and ability to call SNPs is quite
`closely matched between these technologies with more
`true positives being called from PGM data but far less
`false positives from the Illumina data. Detection of SNPs
`using PacBio data was not as accurate; the use of single-
`molecule sequencing to detect low level variants and
`quasispecies within populations remains unproven. We
`have found PacBio’s long reads useful for scaffolding de
`novo assemblies, though our experience suggests that
`this is currently not
`fully optimized and extensive
`method development is still required.
`
`00006
`
`
`
`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 7 of 13
`
`A)
`
`B)
`
`C)
`
`P
`
`I
`
`M
`
`Figure 4 Illustration of platform-specific errors. The panels show Artemis BAM views with reads (horizontal bars) mapping to defined regions
`of chromosome 11 of P. falciparum from PacBio (P; top), Ion Torrent (I; middle) and MiSeq (M; bottom). Red vertical dashes are 1 base differences
`to the reference and white points are indels. A) Illustration of errors in Illumina data after a long homopolymer tract. Ion torrent data has a drop
`of coverage and multiple indels are visible in PacBio data. B) Example of errors associated with short homopolymer tracts. Multiple insertions are
`visible in the PacBio Data, deletions are observed in the PGM data and the MiSeq sequences read generally correct through the homopolymer
`tract. C) Example of strand specific deletions (red circles) observed in Ion Torrent data.
`
`Interestingly, the mappability didn’t increase signifi-
`cantly with longer reads, although a beneficial effect was
`obtained from having mate-pair information. Current
`PacBio protocols favor the preferential loading of smaller
`constructs, resulting in average subread lengths that are
`significantly shorter than the often quoted average read
`lengths. Further development is therefore required to
`avoid having excess short fragments and adapter-dimer
`constructs in the library and reducing their loading effi-
`ciency into the ZMWs.
`Whilst one would normally use higher coverage than
`used here for confident SNP detection (i.e., 30-40x
`depth), we were limited to 15x depth due to the yield of
`some of the platforms. Nonetheless, at least for the hap-
`loid genome, S. aureus, 15x coverage should be a reason-
`able quantity for SNP detection and even in the human
`genome, 15x coverage has been shown to be sufficient to
`accurately call heterozygous SNPs [3].
`Variant calling is a highly subjective process; the par-
`ticular software chosen as well as the specific parameters
`
`employed to make the predictions will change the results
`substantially. As such, the rate of both true SNP and
`false positive calling provided here are purely indicative
`and results obtained with each sequencing platform will
`vary. For any particular application using a specific se-
`quencing method, optimisation of the SNP- and indel-
`calling algorithm would always be recommended.
`We sequence many isolates of the malaria parasite P.
`falciparum as it represents a significant health issue in
`developing countries; this organism leads to several mil-
`lion deaths per annum. There are several active large se-
`quencing programs (e.g. MalariaGEN [13])
`that are
`currently aiming to sequence thousands of clinical mal-
`aria samples. As the malaria genome has a GC content
`of only 19.4% [14], we use it as one of our test genomes,
`representing a significant challenge to most sequencing
`technologies. Based on the present study, use of Illumina
`sequencing technology with libraries prepared without
`amplification [4] leads to the least biased coverage across
`this genome. Ion Torrent semiconductor sequencing is
`
`00007
`
`
`
`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 8 of 13
`
`Percentage of correctly called true SNPs
`
`Ion Torrent
`
`Ion Torrent,
`SAMtools
`
`GAII
`
`MiSeq
`
`HiSeq
`
`Nextera library
`on MiSeq
`
`PacBio 190x
`
`All SNPs
`
`Excluding mobile elements and indels
`
`Number of incorrect SNP calls
`
`A)
`90.00
`80.00
`70.00
`60.00
`50.00
`40.00
`30.00
`20.00
`10.00
`0.00
`
`B)
`3500
`3000
`2500
`2000
`1500
`1000
`500
`0
`
`Ion Torrent
`
`GAII
`
`MiSeq
`
`Ion Torrent,
`Samtools
`Excluding mobile elements and indels
`Total false positives
`Figure 5 Accuracy of SNP detection from the S. aureus datasets generated from each platform, compared against the reference
`genome of its close relative S. aureus USA300_FPR3757. Both the Torrent server variant calling pipeline and SAMtools were used for Ion
`Torrent data; SAMtools was used for Illumina data and SMRT portal pipeline for PacBio data. A) The percentage of SNPs detected using each
`platform overall (blue bar), and outside of repeats, indels and mobile genetic elements (red bar). B) The number of incorrect SNP calls for each
`platform overall (blue bar), and outside of repeats, indels and mobile genetic elements (red bar).
`
`HiSeq
`
`MiSeq_Nextera PacBio 190x
`
`not recommended for sequencing of extremely AT-rich
`genomes, due to the severe coverage bias observed. This
`is likely to be an artifact introduced during amplification.
`Therefore, avoidance of
`library amplification and/or
`emPCR, or use of more faithful enzymes during emPCR,
`may eliminate the bias.
`Illumina sequencing has matured to the point where
`a great many applications [15-24] have been developed
`on the platform. Since the PGM is also a massively
`parallel short-read technology, many of those applica-
`tions should translate well and be equally practicable.
`There
`are
`a
`few obvious
`exceptions;
`techniques
`involving manipulations on the flow cell such as FRT-
`seq [21] and OS-Seq [22] will be impossible using
`
`semiconductor sequencing. Also, the Ion Torrent plat-
`form currently employs fragment lengths of 100 or 200
`bases; without a mate-pair type library protocol, these
`insert sizes are too short perhaps to enable accurate
`de novo assemblies such as that demonstrated using
`ALLPATHS-LG for mammalian genomes using Illu-
`mina data [25]. Conversely, Illumina sequencing on
`the HiSeq or MiSeq instruments requires heteroge-
`neous base composition across the population of
`imaged clusters [26]. In order to sequence monotem-
`plates
`(where most
`sequenceable fragments have
`exactly the same sequence),
`it is often necessary to
`significantly dilute or mix the sample with a complex
`genomic library to enable registration of clusters.
`
`00008
`
`
`
`Quail et al. BMC Genomics 2012, 13:341
`http://www.biomedcentral.com/1471-2164/13/341
`
`Page 9 of 13
`
`suffer
`
`this
`
`sequencing does not
`
`Semiconductor
`problem.
`The DNA-input requirements of PacBio can be pro-
`hibitory. Illumina and PGM library preparation can be
`performed with far
`less DNA;
`the standard PGM
`IonEXpress library prep requires just 100 ng DNA and
`Illumina sequencing has been performed from sub-
`nanogram quantities
`[27]. The yield,
`sample-input
`requirements and amplification-free library prep of
`PacBio potentially make it unsuitable for counting
`applications and for applications involving significant
`prior enrichment such as exome sequencing [15] and
`ChIP-seq [18]. The PacBio platform, by virtue of
`its
`long read lengths, should however have application in
`de novo sequencing and may also benefit analysis of
`linkage of alternative splicing and in of variants across
`long amplicons. Furthermore, the potential
`for direct
`detection of epigenetic modifications has been demon-
`strated [28].
`Finally, it should be noted that thus study represents a
`point in time, utilising kits and reagents available up
`until the end of 2011. Ion Torrent and Pacific Bios-
`ciences are relatively new sequencing technologies that
`have not had time to mature in the same way that the
`Illumina technology has. We anticipate that whilst some
`of the issues identified may be intrinsic, others will be
`resolved as these platforms evolve.
`
`Conclusion
`The limited yield and high cost per base currently pro-
`hibit large scale sequencing projects on the Pacific Bios-
`ciences instrument. The PGM and MiSeq are quite
`closely matched in terms of utility and ease of workflow.
`The decision on whether to purchase one or the other
`will hinge on available resources, existing infrastructure
`and personal experience, available finances and the type
`of applications being considered.
`
`Methods
`Genomic DNA
`P. falciparum 3D7 genomic DNA was a gift from Prof
`Chris Newbold, University of Oxford, UK. Bordetella
`pertussis ST24 genomic DNA was a gift from Craig
`Cummings, Stanford University School of Medicine, CA.
`Staphylococcus aureus TW20 genomic DNA was a gift
`from Jodi Lindsay, St George’s Hospital Medical School,
`University of London. S. Pullorum S449/87 genomic
`DNA was prepared at the Wellcome Trust Sanger Insti-
`tute, UK.
`
`Illumina library construction
`DNA (0.5 μg in 120 μl of 10 mM Tris–HCl pH8.5) was
`sheared in an AFA microtube using a Covaris S2 device
`
`(Covaris Inc.) with the following settings: Duty cycle 20,
`Intensity 5, cycles/burst 200, 45 seconds.
`Sheared DNA was purified by binding to an equal vol-
`ume of Ampure beads (Beckman Coulter Inc.) and
`eluted in 32 μl of 10 mM Tris–HCl, pH8.5. End-repair,
`A-tailing and paired-end adapter ligation were per-
`formed (as per the protocols supplied by Illumina, Inc.
`using reagents from New England Biolabs- NEB) with
`purification using a 1.5:1 ratio of standard Ampure to
`sample between each enzymatic reaction. PCR-free li-
`braries were constructed according to Kozarewa et al.
`[4]. After ligation, excess adapters and adapter dimers
`were removed using two Ampure clean-ups, first with a
`1.5:1 ratio of standard Ampure to sample, followed by a
`0.7:1 ratio of Ampure beads. PCR free libraries were
`then used as is. Libraries prepared with amplification
`were diluted to 2 ng/μl