`systems for deep sequencing
`Chilamakuri et al.
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`R E S E A R C H A R T I C L E
`Open Access
`Performance comparison of four exome capture
`systems for deep sequencing
`Chandra Sekhar Reddy Chilamakuri1,3*, Susanne Lorenz1,3,4, Mohammed-Amin Madoui1,4, Daniel Vodák1,5,
`Jinchang Sun1,3,4, Eivind Hovig1,2,3,5, Ola Myklebost1,3 and Leonardo A Meza-Zepeda1,3,4*
`
`Abstract
`
`Background: Recent developments in deep (next-generation) sequencing technologies are significantly impacting
`medical research. The global analysis of protein coding regions in genomes of interest by whole exome sequencing
`is a widely used application. Many technologies for exome capture are commercially available; here we compare
`the performance of four of them: NimbleGen’s SeqCap EZ v3.0, Agilent’s SureSelect v4.0, Illumina’s TruSeq Exome,
`and Illumina’s Nextera Exome, all applied to the same human tumor DNA sample.
`Results: Each capture technology was evaluated for its coverage of different exome databases, target coverage
`efficiency, GC bias, sensitivity in single nucleotide variant detection, sensitivity in small indel detection, and technical
`reproducibility. In general, all technologies performed well; however, our data demonstrated small, but consistent
`differences between the four capture technologies. Illumina technologies cover more bases in coding and
`untranslated regions. Furthermore, whereas most of the technologies provide reduced coverage in regions with
`low or high GC content, the Nextera technology tends to bias towards target regions with high GC content.
`Conclusions: We show key differences in performance between the four technologies. Our data should help
`researchers who are planning exome sequencing to select appropriate exome capture technology for their
`particular application.
`Keywords: Exome capture technology, Next-generation sequencing, Coverage efficiency, Enrichment efficiency,
`GC bias, Single nucleotide variant, Indel
`
`Background
`In general
`it remains prohibitively expensive to analyze
`whole genomes for population scale study, even though the
`cost of whole genome sequencing has fallen significantly
`[1]. As an alternative, the targeted resequencing of subsets
`of a genome is more feasible. The most widely used ap-
`proach captures much of the entire protein coding region
`of a genome (the exome), which makes up about 1% of the
`human genome, and has become a routine technique in
`clinical and basic research [2-5]. Exome sequencing offers
`definite advantages over whole genome sequencing: it is
`significantly less expensive, more easily understood for
`functional interpretation, significantly faster to analyze, and
`
`* Correspondence: chichi@rr-research.no; Leonardo.Meza-Zepeda@rr-research.no
`1Department of Tumor Biology, Oslo University Hospital, Norwegian Radium
`Hospital, 0310 Oslo, Norway
`3Norwegian Cancer Genomics Consortium, Oslo, Norway
`Full list of author information is available at the end of the article
`
`an easy dataset to manage. Multiple technologies have sur-
`faced for the enrichment of target regions of interest, as the
`demand for targeted resequencing has increased over time.
`Broadly, these technologies can be classified into two
`groups, chip-based exome capture versus solution-based
`exome capture. Chip-based exome capture was the first to
`be developed [6,7], but required large amounts of input
`DNA, and was quickly replaced by more efficient solution-
`based capture systems. There are currently four major
`solution-based human exome capture systems available:
`Agilent’s SureSelect Human All Exon, NimbleGen’s SeqCap
`EZ Exome Library [8], Illumina’s TruSeq Exome Enrich-
`ment, and Illumina’s Nextera Exome Enrichment
`[9].
`Exome capture involves the capture of protein coding re-
`gions by hybridization of genomic DNA to biotinylated
`oligonucleotide probes (baits). These technologies use bio-
`tinylated DNA or RNA baits complementary to targeted
`exons, which are hybridized to genomic fragment libraries.
`Magnetic streptavidin beads are used to selectively pull-
`
`© 2014 Chilamakuri et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the
`Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use,
`distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public
`Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this
`article, unless otherwise stated.
`
`1
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 2 of 13
`
`down and enrich baits with bound targeted regions. The
`sample preparation methods are highly similar across the
`different technologies. The major differences between
`the technologies correspond to the choice of their respect-
`ive target regions, bait lengths, bait density, molecules used
`for capture, and genome fragmentation method (Table 1).
`Clark et al. compared three capture technologies and
`showed that NimbleGen technology required the least
`number of reads to sensitively detect small variants,
`whereas Agilent and Illumina technologies appeared to
`detect a higher total number of variants with additional
`reads [10]. In another study, Sulonen et al. compared
`NimbleGen and Agilent technologies, and showed that
`there were no major differences between the two
`technologies, except that NimbleGen showed greater ef-
`ficiency in covering the exome with a minimum of 20x
`[11]. Asan et al.
`coverage
`compared NimbleGen
`Sequence Capture Array, NimbleGen SeqCap EZ, and
`Agilent SureSelect, and showed that all three technolo-
`gies achieved a similar accuracy of genotype assignment
`and single nucleotide polymorphism (SNP) detection,
`and had similar levels of reproducibility and GC bias
`[12]. In another exome capture comparison study, Parla et
`al. showed that both NimbleGen SeqCap EZ Exome
`Library SR and Agilent SureSelect All Exon were similar to
`each other in performance, and able to capture most of the
`human exons targeted by their probe sets. However, they
`failed to cover a noteworthy percentage of the exons in the
`consensus coding sequence database (CCDS) [13].
`During the past few years, substantial updates have
`been made to the different capture technologies, includ-
`ing new content and improved probe design. For
`
`instance, NimbleGen’s SeqCap EZ exome library v2.0
`targets approximately 44 Mb of genome, where as their
`next version EZ exome library v3.0 targets 64.1 Mb. The
`new Illumina Nextera capture technology has to the best of
`our knowledge not been tested extensively vis-à-vis other
`technologies.
`The lack of a clear consensus from previous studies,
`updates in three major capture technologies, and the im-
`portant new Illumina Nextera capture technology, using
`an entirely different strategy, motivated us to perform a
`detailed comparative analysis before initiating a major
`exome sequencing project.
`We, therefore, systematically compared four exome cap-
`ture technologies, NimbleGen’s SeqCap EZ exome library
`v3.0, Agilent SureSelect Human all exon V4, Illumina
`TruSeq and Illumina Nextera, with respect to features such
`as design differences relative to coverage efficiency, GC
`bias, and variant discovery.
`
`Results
`Distinctive features of four exome capture technologies
`There are considerable differences between the four ex-
`ome capture technologies, as shown in Table 1. Illumina
`TruSeq and Nextera technologies are identical in many
`characteristics, except that Nextera uses transposomes
`for fragmentation, whereas TruSeq fragments the DNA
`by ultrasonication. The Agilent technology uses RNA
`molecules as probes, whereas all the other technologies
`use DNA as probe molecules. NimbleGen presents the
`highest number of probes, being the only technology
`with an overlapping probe design, thus giving it the
`highest probe density technology of the four. Agilent
`
`Table 1 Exome capture technology designs
`NimbleGen
`DNA
`
`Bait type
`
`Bait length range (bp)
`
`Median bait length (bp)
`
`Number of baits
`
`Total bait length (Mb)
`
`Target length range (bp)
`
`Median target length (bp)
`
`Number of targets
`
`NP
`
`NP
`
`NP
`
`NP
`59–742
`171
`
`368,146
`
`Total target length (Mb)
`
`64.19
`
`Agilent
`RNA
`
`114-126
`
`119
`
`554,079
`
`66.48
`114–21,747
`200
`
`185,636
`
`51.18
`
`Illumina TruSeq
`DNA
`
`Illumina Nextera
`DNA
`
`95
`
`95
`
`347,517
`
`33.01
`2–37,917
`135
`
`201071
`
`62.08
`
`95
`
`95
`
`347,517
`
`33.01
`2–37,917
`135
`
`201,071
`
`62.08
`
`Fragmentation method
`
`Ultrasonication
`
`Ultrasonication
`
`Ultrasonication
`
`Transposomes
`
`Automation
`
`Throughput
`
`Flexibility
`
`Species
`
`++
`
`+++
`
`++
`
`+++
`
`Custom available
`
`Custom available
`
`++
`
`+++
`
`+++
`
`+++
`
`Custom available
`
`Human, mouse, 3 plant species
`
`Human, mouse, 14 other species custom Human
`
`Human
`
`Costs
`$
`$
`$$
`$$
`Some NimbleGen information was not provided, indicated by NP. Relative automation and throughput indicated by “+” symbols, higher number of symbols
`indicates easy to automate and higher throughput. Relative cost is indicated by “$” symbol, higher “$” symbols indicate the higher price.
`
`2
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 3 of 13
`
`probes are non-overlapping, but lie directly adjacent to
`one another. On the other hand, the Illumina technolo-
`gies, use a gapped probe approach. The technologies
`also differ in the regions they target, and in the total
`number of bases targeted. For instance, NimbleGen tar-
`gets 64.1 Mb, Agilent targets 51.1 Mb, and TruSeq and
`Nextera targets 62.08 Mb of human genome.
`
`Interestingly, only 26.2 Mb of the total targeted bases
`are common among all exome capture technologies
`(Figure 1A). Of the four, NimbleGen and Agilent technolo-
`gies have the most in common, sharing almost 40 Mb of
`targeted sequences. Illumina has 22.5 million unique target
`bases, followed by NimbleGen with 16.1 million bases, and
`Agilent with 7 million unique bases.
`
`Figure 1 Venn diagram showing the overlap between different features. A) Overlap among Agilent, NimbleGen and Illumina capture
`targets. B) Overlap among RefSeq, CCDS, and ENSEMBL protein coding exon databases. Coverage of exome capture technology for C) CCDS
`coding exons, D) RefSeq coding exons, E) ENSEMBL coding exons, and F) RefSeq UTRs.
`
`3
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 4 of 13
`
`Many different RNA databases are available, such as
`RefSeq [14] and Ensembl [15], which differ in the num-
`ber of non-coding RNAs and total number of exons re-
`ported, as well as the start and end coordinates of exons.
`Significant portions of
`the sequences are common
`among the different databases (Figure 1B). CCDS con-
`tains protein-coding sequences with high quality annota-
`tions [16]. RefSeq and CCDS share a greater proportion
`of bases with each other, whereas Ensembl possesses
`more unique bases (2.19 million) than the other two da-
`tabases. We investigated the coverage of RefSeq (coding
`and UTR), Ensembl (coding) and CCDS (coding).
`Illumina covers a greater portion of coding exon bases
`across all the databases, followed by NimbleGen and
`Agilent (Figure 1C–E). There are 32.11 Mb common
`across the three databases, but only about 24 Mb are
`covered by all
`four
`technologies. The majority of
`Illumina-specific bases (22.5 Mb) target untranslated re-
`gions (UTRs) (Figure 1F), whereas NimbleGen and Agi-
`lent target UTRs at 9.5 Mb and 5.6 Mb, respectively.
`
`Sequencing, sequence alignment, and read filtering
`To evaluate each technology, two independent exome
`libraries derived from the tumor tissue of an osteosarcoma
`sample were sequenced twice (technical replicates). The
`exome library for each technology was prepared according
`to each supplier’s recommended protocol. On average,
`136.8 million reads were generated for each technology,
`varying between 95.8 and 185.1 million reads. There were
`also differences in sequencing and alignment rates be-
`tween the different technologies. The read alignment rate
`varied among technologies: 97.4% for TruSeq, 97.7% for
`NimbleGen, 97.6% Agilent, and 98.95% for Nextera
`(Figure 2A). Mapped reads from each library were further
`
`filtered for duplicates, multiple mappers, improper pairs,
`and off-target reads. Large variation was observed for the
`percentage of pass-filter mapped reads, with Agilent being
`the highest at 71.7% retained reads, NimbleGen next at
`66.0%, TruSeq at 54.8%, and Nextera at 40.1% (Figure 2A).
`We further examined the number of reads filtered out in
`each of the four steps (Figure 2B). For all the technologies,
`the greatest number of reads lost was due to the number
`of reads mapped to non-targeted regions (off-target reads).
`Agilent showed a slightly higher percentage of off-target
`reads and the fewest reads mapping to multiple sites.
`
`Target coverage efficiency differs among four
`technologies
`We used the methods described by Clark et al. [10] to
`investigate target coverage efficiency. We evaluated
`coverage efficiency by calculating base coverage over 1)
`all intended target bases, 2) common bases among the
`four technologies, 3) Ensembl exons, 4) RefSeq exons,
`and 5) CCDS exons, using 50 million randomly chosen
`reads for each technology. Target coordinates were
`downloaded from the supplier’s websites. It is worth-
`while to note that TruSeq and Nextera, both supplied by
`Illumina, use the same capture baits. At this level of
`reads, the fractions of targets covered at least once var-
`ied somewhat, the Agilent technology captured 99.8%,
`the Nextera technology captured 98.2%, the TruSeq cap-
`tured 96.9%, and the NimbleGen captured 96.5% of the
`intended targets (Figure 3A). The 1× coverage number
`provides the fraction of the target that can potentially be
`covered by the respective designs. Not surprisingly, all
`the technologies give high coverage of their respective
`target regions, with the Agilent technology giving high-
`est coverage (99.8%). The number of intended target
`
`Figure 2 Read statistics. A) Bar plot showing percent of initial reads, mapped reads and reads left after filtering for four different technologies;
`each bar shows the number of reads in millions. B) Stacked bar plot showing subgroups of filtered reads.
`
`4
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 5 of 13
`
`Figure 3 Coverage efficiency comparison by technology. Coverage efficiency defined as the percent of the total targeted bases covered at
`particular depths. A) Coverage efficiency for intended targeted bases for each technology. B) Coverage efficiency for bases, which are shared, by
`all four technologies (26.2 MB). Smooth line indicates replicate 1, and dotted line indicates replicate 2.
`
`bases varies considerably, as the Agilent technology tar-
`gets 51.1 Mb, NimbleGen 64.1 Mb, and Illumina
`62.08 Mb (Figure 1A), sharing only 26.2 million bases
`between technologies. When measured at 1× coverage
`on the common bases (26.2 Mb), we observed a similar
`trend, where the Agilent technology covers the highest
`number of bases, with 99.8%, followed by Nextera with
`99.5%, TruSeq with 98.8%, and NimbleGen with 98%
`(Figure 3B and Additional file 1: Figure S1). We found
`no major difference in coverage efficiency between two
`technical replicates, indicating that all four technologies
`give high technical reproducibility.
`We next evaluated coverage efficiency as a function of se-
`quencing depth. We randomly selected filtered reads in 5
`million read increments from 5 million to 50 million. The
`fraction of the intended target bases, covered at depths of
`at least 10×, 20×, 30×, 40×, 50× and 100×, was determined
`(Figure 4). The Agilent technology covered a higher percent
`of its target bases at all read counts and depth cut-offs com-
`pared with the other three technologies. For all the tech-
`nologies, 25 million reads were sufficient to cover about
`80% of target bases with at least 10× depth, with the
`exception of the Nextera technology, which covered only
`about 60% of target bases with the same number of reads
`(Figure 4A). When using 45 million reads with all the tech-
`nologies, more than 80% of target bases were covered
`with ≥20× coverage, but the Nextera technology covered
`only 58% of the bases at the same depth (Figure 4B). For all
`the read counts, Agilent and Nextera covered more bases
`with ≥100× coverage than other two technologies, but
`showed a considerable difference in coverage (Figure 4F).
`
`Influence of GC content on coverage
`Base composition has been shown to bias sequencing
`efficiency, thus coverage may be low for sequences with
`high GC or AT content [17]. There are two primary ex-
`planations for this bias: 1) a polymerase chain reaction
`(PCR) amplification bias, where high or low GC content
`reduces the efficiency of PCR amplification [18]; and 2)
`a reduced efficiency of capture probe hybridization to
`sequences with high or low GC content [19]. Whereas
`the former bias is inherent of the sequences to be ampli-
`fied, the latter is a property of the capture probes, and
`may to some extent be compensated by probe design.
`To study the GC bias effect, we utilized density plots as
`described by Clark et al. [10], where we plotted GC con-
`tent against the normalized mean read depth (Figure 5
`and Additional file 2: Figure S2). All four technologies
`showed bias against very low (<30%) and very high
`(>70%) GC content. All the technologies, except Nex-
`tera, demonstrated a sharp fall in read depth for GC
`contents of 60% or higher. Nextera gave increased cover-
`age for sequences with higher GC content, owing to the
`preference of the transposon technology used [20]. All
`the technologies gave poor coverage for sequences with
`less than 25% GC content.
`
`Ability to detect SNVs
`An important goal of exome resequencing is to identify se-
`quence variants. Therefore, we systematically compared
`the efficiency of exome capture for allele detection among
`the four technologies. We used UnifiedGenotyper, imple-
`mented in the GATK package [21], to investigate the
`
`5
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 6 of 13
`
`Figure 4 Coverage efficiency as a function of number of reads. The percent of targeted bases covered at A) ≥10x, B) ≥20x, C) ≥30x,
`D) ≥40x, E) ≥50x, and F) ≥100x depths.
`
`relationship between read counts and total single nucleo-
`tide variants (SNVs) detected within different intervals. As
`read counts increased, the number of SNVs identified in
`their target regions increased initially, and became satu-
`rated at approximately 20 million reads (Figure 6A). Very
`few additional SNVs were identified beyond 20 million
`reads. When considering the SNVs identified on their re-
`spective target regions, there is a clear correlation between
`the total number of SNVs detected and the number of
`bases targeted; NimbleGen detected the highest number of
`SNVs followed by TruSeq, Nextera, and Agilent (Figure 6A
`
`and Additional file 3: Figure S3A). A different trend was
`clear in the 26 Mb region shared by all four technologies,
`where Agilent detected the highest number of SNVs,
`followed by Truseq, Nextera, and NimbleGen (Figure 6B
`and Additional file 3: Figure S3B). The majority of newly
`detected SNVs were common.
`We also investigated SNV detection in the regions cov-
`ered by the CCDS (Figure 6C), RefSeq (Figure 6D), and
`Ensembl (Figure 6E) exome databases. The Illumina tech-
`nologies, TruSeq and Nextera, and NimbleGen detected
`similar number of SNVs in CCDS and RefSeq. However in
`
`6
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 7 of 13
`
`Figure 5 Density plots showing GC content against normalized mean read depth for A) Agilent, B) NimbleGen, C) TruSeq, and
`D) Nextera technologies.
`
`Ensembl regions, NimbleGen detected the highest number
`of SNVs. As expected, Illumina technologies detected a
`much larger number of SNVs in UTRs. Illumina technolo-
`gies also covered the highest number of bases in the UTRs,
`followed by NimbleGen and Agilent (Figure 1F). Interest-
`ingly, at low read counts, more SNVs were detected by
`TruSeq, but at 40 million read counts, Nextera surpassed
`TruSeq.
`We also investigated whether capture technologies
`showed bias in substitution detection, but none of the
`technologies showed bias towards specific nucleotide sub-
`stitutions (Additional file 4: Figure S4 and Additional file 5:
`
`Figure S5). Transitions were expected to occur twice as fre-
`quently as transversions. The transition-transversion (ts/tv)
`ratio is a metric for assessing the specificity of new SNP
`calls. We assessed the ts/tv ratio on their respective target
`regions (including non-exonic segments), and it ranged
`from 2.215 in Nextera to 2.257 in Agilent (Additional file
`4: Figure S4). Previous studies have shown ts/tv ratios of ≈
`2.0–2.1 for whole genome datasets [22]. The Nextera and
`TruSeq technologies showed very similar ts/tv ratios,
`caused most likely by their identical target regions. Also,
`Agilent and NimbleGen had very similar ts/tv ratios. The
`difference in ts/tv ratios between Illumina technologies
`
`7
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 8 of 13
`
`Figure 6 SNV detection by technology as a function of increasing read counts on A) intended target region, B) regions common
`among technologies, C) CCDS exons, D) RefSeq exons, E) Ensembl exons, and F) UTRs. Solid-lines indicate technology specific SNVs,
`dashed-lines indicate total number of SNVs, and solid pink lines indicate the SNVs common between the four technologies.
`
`8
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 9 of 13
`
`(TruSeq and Nextera) and non-Illumina technologies (Agi-
`lent and NimbleGen) may be because Illumina technolo-
`gies target a significantly higher number of UTRs than the
`other technologies. We also determined the ts/tv ratio in
`CCDS coding exons (Additional file 5: Figure S5). The ts/
`tv ratio on CCDS ranges from 3.054 in Nextera to 3.109 in
`NimbleGen. It has been previously shown that the ts/tv
`ratio is ≈ 3.0–3.3 for exonic variation [23].
`
`Detection of insertions and deletions
`Small insertions and deletions (indels) were called using
`the UnifiedGenotyper algorithm implemented in the
`GATK package [21]. Indel size ranged from −40 to +37
`bases in Agilent, −61 to +37 bases in NimbleGen, −66
`to +52 bases in TruSeq, and −66 to +90 bases in Nextera.
`Most indels were single bases, and more than 90% of the
`indels were less than seven bases long; this pattern was
`observed for all
`four technologies (Additional
`file 6:
`Figure S6A). At low read counts, TruSeq and NimbleGen
`detected a higher number of indels, followed by Nextera
`and Agilent (Figure 7A). At 15 million read counts, TruSeq
`surpassed NimbleGen, and at 20 million reads, Nextera sur-
`passed Agilent (Figure 7A). Interestingly, at 50 million
`reads, Nextera surpassed NimbleGen (Figure 7A). At all the
`read counts, a disturbing fact was that very few indels were
`common across the four technologies, especially on CCDS,
`Ensembl and RefSeq regions.
`Figure 7B shows a head-to-head comparison of indel
`detection in the regions covered by all four technologies.
`At all read counts, Agilent detected the highest number
`of indels. At lower read counts, NimbleGen detected
`more indels than TruSeq and Nextera; at 15 million
`reads, both Nextera and TruSeq surpassed NimbleGen.
`Only about 50% of indels were common among four
`technologies.
`Indel detection in the regions covered by exome data-
`bases was also studied (Figure 7C–E). The number of
`indels detected in exons was significantly lower, than
`indels detected on the respective technology target re-
`gions and UTRs. We observed more indels of three or
`six bases (Additional file 6: Figure S6B), probably due to
`the negative selection of sizes not equal to multiples of
`three bases in coding sequences because they cause dele-
`terious frame shift mutations.
`When compared between replicates, both SNVs
`(Additional file 7: Figure S7 and Additional file 8: Figure
`S8) and indels (Additional file 9: Figure S9), showed
`similar trends in detecting total number of variants and
`showed very high overlap in newly detected variants.
`
`Discussion
`Continuous advancement in sequencing technologies in-
`creases the throughput of DNA sequencing, while at the
`same time contributes sharply to decreasing its cost.
`
`Although sequencing costs have fallen, whole genome
`sequencing is still quite expensive, and data interpret-
`ation remains challenging. Therefore, whole genome se-
`quencing is not the most appropriate choice for all
`investigations. The ability to target certain regions of the
`genome, such as protein and or RNA-coding exons, is
`an attractive alternative for many experiments. In recent
`times, target enrichment by hybridization technologies
`has demonstrated rapid progress in development and
`usage by the research and diagnostic community.
`We present a comparative study of four whole exome
`capture technologies from three manufacturers, designed
`to reveal
`important performance aspects of the tech-
`nologies. To address this, we studied six parameters for
`each technology: the portion of target bases representing
`different exome databases, target coverage efficiency, GC
`bias, sensitivity in SNV detection, sensitivity in small
`indel detection, and reproducibility.
`Although all four exome capture technologies show
`very high target enrichment efficiency and cover large
`portions of the exome, only a small portion of the
`CCDS exome is uniquely covered by each technology
`(Figure 1C). Therefore, a researcher who is planning exome
`sequencing should assess which technology best covers
`the regions of interest to the investigation. Agilent tar-
`gets the smallest part of the genome with 51.1 Mb,
`followed by Illumina technologies with 62.08 Mb, and
`NimbleGen with 64.1 Mb. There are 26.2 Mb of the hu-
`man genome shared by all four technologies; the major-
`ity of which falls in CCDS exonic regions. Illumina not
`only encompasses far more UTRs, but also shows a
`higher coverage of RefSeq, CCDS, and Ensembl exome
`databases, followed by NimbleGen and Agilent.
`Target coverage efficiency differs between the four
`technologies. Using pass-filter
`reads, Agilent
`shows
`higher coverage efficiency than the other technologies,
`which may be partially explained by the smaller targeted
`region (51.1 Mb) compared with 64.1 Mb and 62.08 Mb
`for NimbleGen and Illumina respectively. Among the
`Illumina technologies, TruSeq gave a more uniform
`coverage than Nextera, but both had inferior efficiency
`compared with Agilent. Agilent gives the highest per-
`centage of usable reads (pass-filter reads) (71.7%), closely
`followed by NimbleGen.
`Regardless of high or low target region GC content,
`there was a negative correlation between sequencing
`coverage and extreme GC content. Preference for
`transposon targets with high GC content can help
`explain
`non-uniform coverage
`for
`the Nextera
`technology.
`Most researchers aiming for exome sequencing, espe-
`cially in the medical sciences, focus on protein-coding
`regions. Therefore,
`the ability to identify SNVs and
`indels in coding regions is critical to many applications.
`
`9
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 10 of 13
`
`Figure 7 Indels detection by technology as a function of increasing read counts on A) intended target region, B) regions common
`among the technologies, C) CCDS exons, D) RefSeq exons, E) Ensembl exons, and F) UTRs. Solid-lines indicate technology specific SNVs,
`dashed-lines indicate total number of SNVs, and solid pink lines indicate the SNVs common between four technologies.
`
`10
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 11 of 13
`
`Figure 8 Overview of the computational pipeline.
`
`NimbleGen captures the highest number of SNVs,
`followed by Illumina technologies and Agilent, when
`the total number of SNVs detected are correlated
`with technology target size. However, the number of
`bases sequenced also has cost and capacity consider-
`ations. Our results suggest that Illumina technologies
`detect a higher number of SNVs over the other tech-
`nologies with regard to SNV detection against
`the
`CCDS and RefSeq exomes, owing to a higher cover-
`age of these regions, but Agilent was better at detect-
`ing indels. We also observed that Nextera shows a
`clear edge over other technologies in the CCDS and
`RefSeq exomes, because it covers a larger fraction of
`these sequences.
`We did not observe significant differences in technical
`reproducibility between the four technologies. However,
`we could, by comparing performance between replicates
`to the differences observed above, conclude that al-
`though some differences in SNV and indel detection
`were due to random experimental error, the major effect
`appears to be due to technological biases.
`Since the comparison is based on a tumor sample,
`which may contains genomic aberrations that could dif-
`ferentially affect the performance of each technology, we
`investigated the coverage differences in COSMIC cancer
`genes. No significant deviation in coverage was observed
`when compared with global coverage (Figure 3 and
`Additional file 10: Figure S10).
`Another important consideration is exome capture
`technologies evolve rapidly. For instance, Agilent re-
`cently released their next version of exome capture Sur-
`eSelect Human All Exon V5. Although these versions do
`differ with regard to the genomic regions they target,
`about 84% of target region bases overlap. Illumina also
`has a new version, with a smaller targeted panel, just for
`exons.
`It
`is called Nextera Rapid Capture Exome
`(37 Mb), while the larger panel version is now named
`Nextera Expanded Exome (62 Mb). Illumina has also im-
`proved the Nextera protocol, with the Nextera Rapid kit;
`this improvement may reduce the GC bias observed
`here.
`
`In total, our data suggest that all four technologies
`offer comparable performance. Other factors, such as
`the DNA content of the targeted regions, the amount of
`input DNA required, the extent of automation in library
`construction, and the cost of reagents to reach a certain
`depth of coverage, need to be considered before select-
`ing the exome capture technology most appropriate for
`your particular application.
`Readers should keep in mind that this study is based
`on one biological sample with two replicates. The ob-
`served technical reproducibility is very high and variabil-
`ity may be higher when two biological replicates are
`compared.
`
`Conclusions
`We systematically evaluated the performance of four
`whole exome capture technologies, and show that all
`the exome capture technologies perform well, but do
`exhibit
`consistent differences.
`Illumina
`covers
`a
`greater portion of coding exon bases across all the
`databases,
`followed by NimbleGen and Agilent. All
`the technologies give high coverage of their respective
`target
`regions, with the Agilent
`technology giving
`highest coverage (99.8%) followed by Nextera (98.2%),
`Truseq (96.9%),
`and NimbleGen (96.5%) of
`the
`intended targets. Nextera shows a sharp increase in
`read depth for GC content of 60% or higher com-
`pared other technologies. In common regions covered
`by
`all
`four
`technologies, Agilent detects
`slightly
`higher number of SNVs, followed by Nextera, TruSeq
`and Nimblegen. At all the read counts very few indels
`were common across the four technologies. All tech-
`nologies give high technical
`reproducibility. One
`major limitation is that none of the capture technolo-
`gies are able to cover all of the exons of the CCDS,
`RefSeq or Ensembl databases. Our study should help
`researchers who are planning exome sequencing ex-
`periments select the most appropriate technology for
`their study, without having to perform expensive and
`time-consuming comparisons.
`
`11
`
`Personalis EX2159
`
`
`
`Chilamakuri et al. BMC Genomics 2014, 15:449
`http://www.biomedcentral.com/1471-2164/15/449
`
`Page 12 of 13
`
`Methods
`Sample collection and library preparation
`One human osteosarcoma was selected from a tumor
`collection at the Department of Tumor Biology at the
`Norwegian Radium Hospital. The tumor was collected
`immediately after surgery after written informed con-
`sent, cut into small pieces, frozen in liquid nitrogen and
`s