`
`Target-enrichment strategies for next-
`generation sequencing
`
`Lira Mamanova1, Alison J Coffey1, Carol E Scott1, Iwanka Kozarewa1, Emily H Turner2,
`Akash Kumar2, Eleanor Howard1, Jay Shendure2 & Daniel J Turner1
`
`We have not yet reached a point at which routine sequencing of large numbers of whole
`eukaryotic genomes is feasible, and so it is often necessary to select genomic regions
`of interest and to enrich these regions before sequencing. There are several enrichment
`approaches, each with unique advantages and disadvantages. Here we describe our
`experiences with the leading target-enrichment technologies, the optimizations that
`we have performed and typical results that can be obtained using each. We also provide
`detailed protocols for each technology so that end users can find the best compromise
`between sensitivity, specificity and uniformity for their particular project.
`
`The ability to read the sequence of bases that com-
`prise a polynucleotide has had an impact on bio-
`logical research that is difficult to overstate. For the
`majority of the past 30 years, dideoxy DNA ‘Sanger’
`sequencing1 has been used as the standard sequenc-
`ing technology in many laboratories, and its acme
`was the completion of the human genome sequence2.
`However, because Sanger sequencing is performed on
`single amplicons, its throughput is limited, and large-
`scale sequencing projects are expensive and labori-
`ous: the human genome sequence took hundreds of
`sequencing machines several years and cost several
`hundred million dollars.
`The paradigm of DNA sequencing changed with
`the advent of ‘next-generation’ sequencing technolo-
`gies (reviewed in refs. 3,4), which process hundreds of
`thousands to millions of DNA templates in parallel,
`resulting in a low cost per base of generated sequence
`and a throughput on the gigabase (Gb) scale. As a con-
`sequence, we can now start to define the characteristics
`of entire genomes and delineate differences between
`them. Ultimately, whole-genome sequencing of com-
`plex organisms will become routine, allowing us to
`gain a deeper understanding of the full spectrum of
`genetic variation and to define its role in phenotypic
`variation and the pathogenesis of complex traits.
`Nevertheless, it is not yet feasible to sequence large
`numbers of complex genomes in their entirety because
`
`the cost and time taken are still too great. To obtain
`30-fold coverage of a human genome (90 Gb in total),
`would currently require several sequencing runs and
`would cost tens of thousands of dollars. In addition to
`the demands such a project would place on laboratory
`time and funding, the primary analysis during which
`the captured image files are processed, as well as stor-
`age of the sequences, would place a substantial burden
`on a research center’s informatics infrastructure.
`Consequently, considerable effort has been devoted
`to develop ‘target-enrichment’ methods, in which
`genomic regions are selectively captured from a DNA
`sample before sequencing. Resequencing the genomic
`regions that are retained is necessarily more time- and
`cost-effective, and the resulting data are considerably
`less cumbersome to analyze. Several approaches to
`target enrichment have been developed (Fig. 1), and
`there are several parameters by which the perfor-
`mance of each can be measured, which vary from one
`approach to another: (i) sensitivity, or the percentage
`of the target bases that are represented by one or more
`sequence reads; (ii) specificity, or the percentage of
`sequences that map to the intended targets; (iii) uni-
`formity, or the variability in sequence coverage across
`target regions; (iv) reproducibility, or how closely
`results obtained from replicate experiments correlate;
`(v) cost; (vi) ease of use; and (vii) amount of DNA
`required per experiment, or per megabase of target.
`
`1The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. 2Department of Genome Sciences,
`University of Washington, Seattle, Washington, USA. Correspondence should be addressed to J.S. (shendure@u.washington.edu) or
`D.J.T. (djt@sanger.ac.uk).
`PUBLISHED ONLINE 28 jaNUary 2010; cOrrEcTED afTEr PrINT 12 aPrIL 2010; DOI:10.1038/NMETH.1419
`
`nature methods | VOL.7 NO.2 | FEBRUARY 2010 | 111
`
`© 2010 Nature America, Inc. All rights reserved.
`
`00001
`
`EX1034
`
`
`
`review
`
`A technology that typically has a high specificity and unifor-
`mity will require less sequencing to generate adequate coverage of
`sequence data for the downstream analysis, making the sequencing
`more economical. In addition to these factors, when assessing which
`target-enrichment technology is the most appropriate for a particular
`project, thought must be given to how well matched each method
`is to the total size of intended target region, the number of samples
`
`(Fig. 2) and whether or not sample multiplexing is required to most
`efficiently use sequencer throughput.
`Here we describe the most widely used approaches to target
`enrichment, our experiences with each and the optimizations that
`we have performed. We also provide detailed protocols, which we
`have developed with the aim of finding the best compromise between
`the parameters described above.
`
`a
`
`b
`
`Uniplex PCR
`1 reaction =
`1 amplicon
`
`Multiplex PCR
`1 reaction =
`10 amplicons
`
`RainStorm
`1 reaction =
`4,000 amplicons
`
`Molecular inversion probes = 10,000 exons
`
`Gap-fill
`and ligate
`
`Exon 1
`
`Exon 2
`
`Exon 3
`
`c
`
`Hybrid capture > 100,000 exons
`
`Array capture
`
`Adapter-modified
`shotgun library
`
`Solution
`hybridization
`
`B
`
`B
`
`B
`
`B
`
`B
`
`B
`
`Bead capture
`
`B
`
`B
`
`B
`
`B
`
`B
`
`B
`
`B
`
`B
`
`Figure 1 | Approaches to target enrichment. (a) In the uniplex PCR–based approach, single amplicons
`are generated in each reaction. In multiplexed PCR, several primer pairs are used in a single reaction,
`generating multiple amplicons. On the RainStorm platform, up to 4,000 primer pairs are used
`simultaneously in a single reaction. (b) In the MIP-based approach, probes consisting of a universal spacer
`region flanked by target-specific sequences are designed for each amplicon. These probes anneal at either
`side of the target region, and the gap is filled by a DNA polymerase and a ligase. Genomic DNA is digested,
`and the target DNA is PCR-amplified and sequenced. (c) In the hybrid capture–based approach, adaptor-
`modified genomic DNA libraries are hybridized to target-specific probes either on a microarray surface or in
`solution. Background DNA is washed away, and the target DNA is eluted and sequenced.
`
`112 | VOL.7 NO.2 | FEBRUARY 2010 | nature methods
`
`PCr
`PCR has been the most widely used prese-
`quencing sample preparation technique for
`over 20 years5, and it is particularly well suit-
`ed to a Sanger sequencing–based approach,
`in which a single PCR can be used to gen-
`erate a single DNA sequence and in which
`the sequence read length is comparable
`to that of a typical PCR amplicon. PCR is
`also potentially compatible with any next-
`generation sequencing platform, though to
`make full use of the high throughput, a large
`number of amplicons must be sequenced
`together. However, PCR is difficult to mul-
`tiplex to any useful degree: the simultaneous
`use of many primer pairs can generate a high
`level of nonspecific amplification, caused by
`interaction between the primers, and more-
`over amplicons can fail to amplify6,7. Clever
`derivatives of multiplex PCR have been
`developed8–10, but in practice, it is often
`more straightforward to perform PCRs in
`uniplex. Additionally, there is an upper limit
`to the length of amplicon that can be gener-
`ated by long PCR11: in our experience very
`long PCRs tend to lack robustness, and for
`PCR amplification of contiguous regions, we
`prefer to design overlapping PCRs that are
`no more than 10 kilobases (kb) long. Each
`individual PCR must be validated and, ide-
`ally, optimized to make amplification as effi-
`cient as possible to minimize the total mass
`of DNA required.
`After amplification, the concentration of
`products must be normalized before pool-
`ing to avoid sequencing one dominant
`PCR product above all others. There are
`several ways to approach normalization at
`this stage, but the most reliable way is to
`visually inspect the intensity of bands on an
`agarose gel, alongside a quantitative ladder.
`Consequently, there is an upper limit to the
`size of genomic target that can realistically
`be selected by PCR because of the workload
`involved. We recommend using long PCR to
`target regions that are up to several hundred
`kilobases long, as this is feasible both from
`the perspectives of workload and the quan-
`tity of DNA required.
`By current standards, a single lane of
`a paired-end, 76-base sequencing run
`
`© 2010 Nature America, Inc. All rights reserved.
`
`00002
`
`
`
`review
`
`In solution
`
`MIPs
`
`PCR
`
`Hybridization
`
`On array and
`in solution
`
`RainStorm
`
`4
`
`3
`
`2
`
`1
`
`0
`
`Log10 (number of samples)
`
`0
`
`1
`
`3
`2
`Log10 (number of genes)
`
`4
`
`5
`
`Human
`exome
`
`Figure 2 | Suitability of different target-enrichment strategies to
`different combinations of target size and sample number. Suitability was
`estimated from the perspective of the feasibility with which each method
`could be applied to the various combinations of target size and sample
`number, rather than the cost.
`
`The RainStorm platform, developed by RainDance Technologies,
`is a convenient solution to many of the problems encountered in a
`standard PCR–based approach (http://www.raindancetechnologies.
`com/applications/next-generation-sequencing-technology.asp/). The
`technology uses microdroplets, similarly to emulsion PCR20,21. Each
`droplet supports an independent PCR and can be made to contain a
`single primer pair along with genomic DNA and other reagents. The
`entire population of droplets represents hundreds to thousands of
`distinct primer pairs and is subjected to thermal cycling, after which
`this emulsion is broken and products are recovered. The mixture of
`DNA amplicons can then be subjected to shotgun library construc-
`tion and massively parallel sequencing. During the microdroplet
`PCR, different primer pairs cannot interact with each other, which
`removes one of the primary constraints on conventional multiplex
`PCR. The microdroplet approach also prevents direct competition
`of multiplex PCRs for the same reagent pool, which should improve
`uniformity relative to conventional multiplex PCR. The current max-
`imum number of primer pairs that can be used is 4,000, though it is
`expected that the number will reach 20,000 by mid 2010 (J. Lambert,
`personal communication).
`The proof of concept for this approach has been published
`recently22. In one experiment, the authors targeted 457 amplicons of
`variable size (119–956 bp) and G+C content (24–78%), totalling to
`172 kb. In six samples, 84% of uniquely mapping reads aligned to tar-
`geted amplicons, and 90% of targeted bases were represented within a
`25-fold abundance range. In a second experiment, they targeted 3,976
`amplicons representing an aggregate target of 1.35 megabases (Mb)
`and observed that 79% of uniquely mapping reads aligned to targets
`and 97% of targeted bases were covered within a 25-fold abundance
`range. The specificity and uniformity of the approach compare well
`with those of the alternatives, and base calling demonstrated good
`concordance with expected HapMap genotypes. One limitation is
`that the approach currently has relatively high input requirements
`(7.5 µg per sample), but this may be reduced with optimization. In
`terms of the flexibility of targeting, it is reasonable to expect that this
`approach will have advantages and disadvantages analogous to those
`of conventional PCR primer design.
`
`nature methods | VOL.7 NO.2 | FEBRUARY 2010 | 113
`
`would generate an average coverage of about 30,000-fold for a
`100-kb target, clearly a massive excess. For the sequencing to be
`economical, it is necessary to barcode and pool many samples
`and to sequence these pools in a single lane. Several approaches
`to sample barcoding have been reported12–14, but we have found
`ligation of barcodes to fragmented PCR amplicons to give uneven
`sequence coverage of different samples.
`We developed a protocol for barcoding 96 samples, in which
`the library is prepared in 96-well plates and the barcode is
`included in the central region of the reverse PCR primer
`(Supplementary Protocol 1). We validated this strategy by ana-
`lyzing a 25-kb region in DNA from several human populations
`worldwide. We sequenced 96 libraries per flowcell lane and gen-
`erated 50-base paired-end sequence reads, with an additional 8
`bases of sequence to generate the tag sequences. Sequence data
`from this study have been deposited in the European Short Read
`Archive. The average coverage obtained from these sequences
`was high: median >225-fold per lane for native DNA, and 175-
`fold for whole genome–amplified samples. Coverage and unifor-
`mity was poorer for whole genome–amplified samples than for
`genomic DNA, especially for the longest amplicon in the pool,
`suggesting that biases were introduced during whole-genome
`amplification, as has been noted previously15,16. However, the
`barcoding approach was successful, with 80% of sequenced bases
`covered within a twofold range of the median for the genomic
`samples. We called single-nucleotide polymorphisms (SNPs) at
`>99% of sites in approximately 98% of samples and detected 63
`high-confidence SNPs; 27 of them were new and 23 were rare.
`
`improvements for PCr
`Although this PCR-based approach was highly effective, there are
`several areas in which it could be improved. First, a reduction in the
`cost of library preparation reagents would have a major impact on
`the overall cost because a separate sequencing library is required for
`each DNA sample, making library preparation very expensive, for
`even a small number of lanes of sequencing. Second, improvement in
`the accuracy of pooling the tiled amplicons, which impacts sequence
`uniformity, is needed because quantifying tiled amplicons by quan-
`titative PCR is still difficult to achieve for tens to hundreds of ampli-
`cons per sample. Third, the use of 5′-blocked primers would achieve
`greater sequence uniformity across amplicons14. Fourth, the use of a
`greater depth of tiling in the PCRs is another area for improvement.
`The failure of long PCRs has a major impact on coverage unifor-
`mity, but if every base in the target locus is covered by at least two
`overlapping PCRs, failure of one of these PCRs will not result in the
`‘loss’ of that base. Finally, the use of error-correcting barcodes would
`allow a greater proportion of pooled sequences to be deconvoluted17.
`Using Hamming codes for tag design18, it is possible to make tagsets
`in which single nucleotide-sequencing errors can be corrected, and
`in which two errors and single insertion-deletions can be detected
`unambiguously (Supplementary Table 1).
`It is possible to design long PCR primers for close to 100% of
`desired targets, but in practice, not all reactions will yield a product
`after amplification. This can be problematic for samples in which
`the integrity of the DNA is low, such as clinical specimens. Similarly,
`when there are SNPs in the primer annealing regions, one allele may
`be amplified preferentially19. Such difficulties can usually be over-
`come by optimization, primer redesign, greater tiling of amplicons
`or using a combination of long and short PCR.
`
`© 2010 Nature America, Inc. All rights reserved.
`
`00003
`
`
`
`table 1 | Performance of target-enrichment methods
`PCr
`miP
`High
`<10 samples, high;
`>100 samples, low
`High
`As little as 200 ng
`
`review
`
`Cost
`
`ease of use
`mass dna
`
`sensitivity
`
`specificity
`
`uniformity
`
`Low
`~8 µg for 1 Mb of 2× tiled,
`5 kb amplicons
`>99.5%
`
`93% for HapMap DNA samples,
`72% for whole genome–
`amplified samples
`80% of bases within twofold
`range of median
`
`on-array hybrid capture
`Medium
`
`Medium
`10-15 µg per array for up to 30 Mb
`target
`98.6% of CTRa
`
`in-solution hybrid capture
`<10 samples, medium;
`>10 samples, low
`High
`3 µg for up to 30 Mb target
`
`>99.5% of CTRa
`
`Up to 70% mapping to CTRa for exons;
`higher for contiguous regions
`
`Up to 80% mapping to CTRa for
`exons; higher for contiguous regions
`
`60% of CTRa within 0.5–1.5-fold of
`mean coverage (mapping qualityb 30)
`
`61% of CTRa within 0.5–1.5-fold of
`mean coverage (mapping qualityb 30)
`
`>98%, with stringent design
`constraints
`>98%
`
`58% of CTR within tenfold
`coverage range; 88% within
`100-fold coverage range
`0.92 rank-order correlationc
`
`reproducibility Up to 100%
`
`For 107 paired end-sequences,
`For 107 paired-end sequences,
`>96% reproducibility at tenfold
`>95% reproducibility at tenfold
`between two samples
`between two samples
`aCTR, capture target region, that is, the regions of the desired target region to which probes could be designed after repeat masking. bMapping qualities were calculated by the mapping software, MAQ, and
`indicate the probability that the mapping location is correct. A score of 30 or greater indicates that the quality of a read was good, and that it mapped unambiguously to that location with few mismatches.
`cRank-order correlation in capture efficiency distributions between independent samples.
`
`Even with an efficient, automated PCR pipeline, it is not feasible
`to use conventional PCR to target genomic regions that are several
`megabases in size because of the high cost of primers and reagents
`and the DNA input requirements, particularly in large sample
`sets (Fig. 2). Similarly, there is a limit to the maximum target size
`that can be selected using the RainStorm platform (2–3 Mb), and
`its sample throughput is limited to approximately 8 per workday
`(Fig. 2). Consequently, for very large target regions such as the
`approximately 30 Mb human exome, or to select moderately sized
`regions in very large numbers of samples, other approaches to
`target enrichment should be used.
`
`moLeCuLar inversion ProBes
`Various enzymatic methods for targeted amplification are compat-
`ible with extensive multiplexing based on target circularization23–25.
`One approach in the latter category relies on the use of molecu-
`lar inversion probes (MIPs), which initially had been developed
`for multiplex target detection and SNP genotyping26–30. Single-
`stranded oligonucleotides, consisting of a common linker flanked
`by target-specific sequences31,32, anneal to their target sequence and
`become circularized by a ligase. Uncircularized species are digested
`by exonucleases to reduce background, and circularized species are
`PCR amplified via primers directed at the common linker. To adapt
`this method to perform exon capture in combination with next-
`generation sequencing, a DNA polymerase can be used to ‘gap-fill’
`between target-specific MIP sequences designed to flank a full or
`partial exon, before ligase-driven circularization, thereby capturing
`a copy of the intervening sequence24. The assay initially demon-
`strated low uniformity, largely owing to inefficiencies in the capture
`reaction itself, but more recently an optimized, simplified protocol
`for MIP-based exon capture has been reported33. This revised pro-
`tocol (Supplementary Protocol 2) retains the high specificity of
`MIP capture, with >98% of mapped reads aligning to a targeted
`exon but additionally, uniformity is markedly improved, with 58%
`of targeted bases in 13,000 targets captured to within a tenfold
`range and 88% to within a 100-fold range (Fig. 3a and Table 1).
`The improved capture uniformity resolves the issue of stochastic
`allelic bias that plagued the initial proof of concept, showing that
`
`114 | VOL.7 NO.2 | FEBRUARY 2010 | nature methods
`
`accurate genotypes can be derived from massively parallel sequencing
`of MIP capture products. Furthermore, MIP amplification products
`can be directly sequenced on a next-generation sequencing platform
`to interrogate variation in targeted sequences, thereby bypassing the
`need for shotgun library construction.
`Our current view is that the approach of MIP-based capture fol-
`lowed by direct sequencing may be most relevant for projects involv-
`ing relatively small numbers of targets but large numbers of samples
`(Fig. 2). This is based on the following characteristics. (i) Gap-fill
`reactions and PCRs take place in aqueous solution, in small volumes,
`so they are easy to scale to large numbers of samples on 96-well
`plates; no mechanical shearing, gel-based size purification, ligation or
`A-tailing is required. (ii) Sample-identifying barcodes can be nested
`in one of the primers used in post-capture amplification, allowing
`products from multiple samples to be pooled and sequenced in a sin-
`gle lane. (iii) As with PCR, capture is performed directly on genomic
`DNA rather than after conversion to a shotgun library, reducing input
`requirements to as low as 200 ng34.
`The main disadvantages of using MIPs for target enrichment are,
`first, that capture uniformity, though markedly improved, compares
`poorly with the most recent reports on capture by hybridization and
`is the foremost challenge for the approach. To help circumvent this,
`MIPs can potentially be grouped into sets based on similar capture
`efficiencies because biases tend to be systematically reproducible34.
`Also, modeling of the causes of nonuniformity can be fed back to MIP
`design algorithms. Second, MIP oligonucleotides can be costly and dif-
`ficult to obtain in large numbers to cover large target sets. To mitigate
`the high cost of column-based oligonucleotide synthesis, thousands
`of oligos can be obtained by synthesis and release from programmable
`microarrays (Agilent24; LC Sciences). Provided that these are designed
`in an amplifiable format, they can potentially be used to generate
`MIP probes to support thousands of samples. Alternatively, one can
`undertake column-based synthesis of individual MIPs followed by
`pooling. Although the initial cost for this can be high, sufficient mate-
`rial is obtained to support an extraordinarily large number of capture
`reactions24. The availability of individual probes would also facilitate
`empirical repooling to improve capture uniformity (J.S.; unpublished
`data). Finally, it is worth noting that MIPs offer flexibility to address
`
`© 2010 Nature America, Inc. All rights reserved.
`
`00004
`
`
`
`review
`
`capture products is the nature of the capture probes: the NimbleGen
`product uses 60–90-mer DNA capture probes, whereas the Agilent
`one uses 150-mer RNA capture probes. We have not noticed any
`appreciable difference between the performance of each product.
`
`Library preparation for hybrid capture
`Our aim has been to establish a robust production pipeline that
`can support both on-array and in-solution target enrichment. The
`manufacturers’ workflows for these approaches are very similar
`and several general principles apply, which allowed us to produce
`a standard library preparation protocol for both approaches
`(Supplementary Protocol 3).
`
`55,000-plex
`Optimized 55,000-plex
`
`13,000-plex
`
``
`
`0
`
`10
`
`90
`80
`70
`60
`50
`40
`30
`20
`55,000 or 13,000 MIPs by rank-ordered percentile
`
`0123
`
`−1
`
`−2
`
`−3
`
`efficiency (arbitrary units)
`
`Estimated capture
`
`68.7
`69.2
`69.7
`70.2
`70.7
`Position on X chromosome (Mb)
`
`Array CTR
`UCSC genes
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Sequence depth
`
`a
`
`b
`
`c
`
`Solution CTR
`UCSC genes
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`Sequence depth
`
`68.7
`69.2
`69.7
`70.2
`70.7
`Position on X chromosome (Mb)
`
`Figure 3 | Uniformity of approaches to target enrichment. (a) Capture
`efficiency obtained with the MIP-based approach, showing improvements
`in uniformity for optimized protocols and reduced target size. Image was
`adapted from ref. 33. (b,c) A region of human chromosome X, detailing
`the regions to which capture probes could be designed: the capture target
`region (CTR) for array capture (b) and for solution capture (c). Below the
`CTR are the UCSC genes, taken from the UCSC genome browser. Below this
`sequence depth obtained from a single lane of Illumina sequences, for a
`3.5 Mb capture experiment, is shown.
`
`nature methods | VOL.7 NO.2 | FEBRUARY 2010 | 115
`
`a range of related applications, for example, DNA methylation, RNA
`editing and allelic imbalance in expression34–36.
`
`hYBrid CaPture
`on-array capture
`The principle of direct selection is well-established37,38: a shotgun
`fragment library is hybridized to an immobilized probe, nonspe-
`cific hybrids are removed by washing and targeted DNA is eluted.
`Roche NimbleGen and their collaborators were the first to adapt
`the technology to be compatible with next-generation sequenc-
`ing15,16,39. In the original format, library DNA is hybridized to a
`single microarray containing 385,000 isothermal probes (the HD1
`NimbleGen array), ranging from 60 to 90 bases in length, and with
`a total capture size of around 4–5 Mb. More recently, the HD2
`array has been made available, with 2.1 million probes per array
`and the ability to capture up to 34 Mb on a single array (Fig. 2).
`The technology was originally designed to be used with the Roche
`454 sequencer, but many groups, including ours, expended a con-
`siderable amount of effort to modify and optimize protocols for
`use with the Illumina Genome Analyzer. Agilent’s Capture Arrays
`and comparative genomic hybridization (CGH) arrays are perhaps
`the most direct competitor to NimbleGen’s HD1 arrays, though
`Agilent’s Capture Arrays contain only 244,000 probes on the sur-
`face (106 for CGH arrays). We found that the performance of both
`NimbleGen’s and Agilent’s arrays is similar (Table 1).
`There are clear advantages to on-array target enrichment of large
`regions over PCR-based approaches: it is far quicker and less labori-
`ous than PCR. But there are also drawbacks: working with microarray
`slides requires expensive hardware, such as a hybridization station.
`Additionally, the limit to the number of arrays that a single person
`can realistically perform each day is approximately 24. As arrays that
`are hybridized at the same time must also be eluted together, studies
`with very large numbers of samples are unfeasible. Finally, to have
`enough DNA library for a target-enrichment experiment, it is nec-
`essary to start library preparation with a relatively large amount of
`DNA, around 10–15 µg, though this is irrespective of whether the
`capture experiment is for 100 kb or an entire exome.
`
`in-solution capture
`To overcome many of these disadvantages, both Agilent and
`NimbleGen have also developed solution-based target-enrichment
`protocols. The general principle is similar to array capture, in that
`there are specific probes designed to target regions of interest from a
`sequencing library, but whereas an on-array target enrichment uses
`a vast excess of DNA library over probes, solution capture has an
`excess of probes over template, which drives the hybridization reac-
`tion further to completion using a smaller quantity of sequencing
`library40. In our experiments to test the performance of array versus
`solution capture, we observed that for smaller target sizes (~3.5 Mb),
`the uniformity and specificity of sequences obtained from a solu-
`tion capture experiment tend to be slightly higher than that of array
`capture (Fig. 3b,c). Thus in the 3.5-Mb range, solution capture yields
`superior sequence coverage of the target regions from a similar yield
`of sequences. However, for whole-exome captures, both solution and
`array appear to perform equivalently (Fig. 4).
`In-solution target enrichment can be performed in 96-well plates,
`using a thermal cycler, so it is more readily scalable than on-array
`enrichment and does not require specialized equipment (Fig. 2). The
`principal difference between the Agilent and NimbleGen solution
`
`© 2010 Nature America, Inc. All rights reserved.
`
`00005
`
`
`
`that the size-selection step can be omitted
`(Fig. 5a), and when mapped using MAQ43,
`we found little difference between libraries
`made with or without a size-selection step.
`In a single experiment, the percentage of
`mapped reads with score ≥30 (indicating
`that the base quality of the reads is good
`and that the read maps unambiguously to
`the selected location with few mismatch-
`es) was just over 1% lower for a library
`made without size selection compared to
`the same library after size selection (89.9
`and 88.7%, respectively).
`
`50
`
`40
`
`45
`
`3.5 Mb in-solution capture
`3.5 Mb on-array capture
`Whole-exome solution capture
`Whole-exome array capture
`
`review
`
`100
`
`90
`
`80
`
`70
`
`60
`
`50
`
`40
`
`Percentage of CTR bases
`
`0
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`Fold coverage
`
`PCR optimization. The use of acoustic
`shearing and removal of the size-selection
`step resulted in a greater mass of DNA
`being available for target enrichment than
`when standard approaches are used, and
`this allowed us to investigate the effect of
`performing PCR amplification at different
`stages of the target-enrichment process.
`We noted a negative influence of PCR amplification on the uni-
`formity of enrichment in both on-array and in-solution methods:
`performing 18 cycles of PCR amplification of libraries both before
`and after hybridization can introduce severe bias toward neutral
`G+C content in the resulting sequences (Fig. 5b). Avoiding the
`PCR step altogether before hybridization greatly improved the
`situation (Fig. 5c), so it is desirable to keep PCR amplification to
`a minimum and only perform it after hybridization.
`However, an amplification-free library preparation tends to
`lack robustness, especially with samples of lower integrity, such
`as clinical specimens, compared to intact DNA. In these cases,
`we recommend around six cycles of amplification before hybrid-
`ization and the use of blocking adapters41 to avoid a reduction
`in specificity caused by random concatenation of libraries, so-
`called ‘daisy-chaining’. If no PCR is performed before hybrid-
`ization, there is no need to use blocking adapters if sequenc-
`ing is performed on Illumina’s Genome Analyzer because the
`pre-PCR adapters are partially noncomplementary44 and are
`thus not problematic in this way. We recommend that hybrid
`capture be performed following the manufacturers’ standard
`protocols (Supplementary Protocols 4,5) and that 14–18 cycles
`of PCR be performed on the samples eluted after hybridization
`(Supplementary Protocol 6).
`
`Figure 4 | Coverage plot for array and solution hybrid capture, for 3.5 Mb of exonic target and whole
`human exome. Values were taken from five independent array and solution experiments, using the
`same CTR, with each capture using a different DNA sample, and each yielding roughly 107 mappable
`sequences per lane. One lane of sequencing was used for 3.5 Mb captures, whereas two or three lanes
`were used for the whole exome. Error bars, s.d. (n = 5).
`
`Fragment size. Fragment size, obtained by shearing or other frag-
`mentation approaches, has a large influence over the outcome of a
`target-enrichment experiment, with shorter fragments invariably
`being captured with higher specificity than longer ones40,41. This
`is not necessarily surprising, given that a longer fragment will con-
`tain a higher proportion of off-target sequence, and the effect is
`especially apparent for exons, whose mean length is relatively short:
`164 bp40 (for example, a 100-bp exon that is part of a 200-bp frag-
`ment will be 50% off target just because the captured fragment is
`larger). However, in our experiments comparing hybrid-capture
`protocols, the decrease in specificity with increasing template size
`that we observed was more pronounced than could be accounted
`for by just the inclusion of off-target portions of longer template
`sequences and presumably reflects the increase in potential for
`cross-hybridization between longer fragments themselves.
`We assume that there is also a lower size limit to fragments for
`efficient capture, but in practice the minimum fragment size is deter-
`mined by the length one would wish to sequence. Longer reads would
`be expected to map to the reference sequence with lower ambiguity
`than shorter reads and can help to reduce overrepresentation toward
`the end of capture probes40. For target enrichment of human DNA,
`we typically generate 76-base paired-end reads, and consequently, it is
`useful to generate fragments that are around 200 bp to avoid overlap
`between reads 1 and 2 (Supplementary Protocol 3).
`Target enrichment sample preparation protocols include a size-
`selection step to generate a narrow fragment size range, as this is
`assumed to assist with read mapping. However, this step is not
`compatible with a high-throughput workflow because it is too
`labor-intensive, and, in any case, many read-mapping softw