`Targeted enrichment of genomic DNA
`regions for next-generation sequencing
`Florian Mertes, Abdou ElSharawy, Sascha Sauer, Joop M.L.M. van Helvoort, P.J. van der Zaag, Andre Franke,
`Mats Nilsson, Hans Lehrach and Anthony J. Brookes
`Advance Access publication date 26 November 2011
`In this review, we discuss the latest targeted enrichment methods and aspects of their utilization along with
`second-generation sequencing for complex genome analysis. In doing so, we provide an overview of issues involved
`in detecting genetic variation, for which targeted enrichment has become a powerful tool. We explain how targeted
`enrichment for next-generation sequencing has made great progress in terms of methodology, ease of use and ap-
`plicability, but emphasize the remaining challenges such as the lack of even coverage across targeted regions. Costs
`are also considered versus the alternative of whole-genome sequencing which is becoming ever more affordable.
`We conclude that targeted enrichment is likely to be the most economical option for many years to come in a
`range of settings.
`Keywords: targeted enrichment; next-generation sequencing; genome partitioning; exome; genetic variation
`Next-generation sequencing (NGS) [1, 2] is now a
`major driver in genetics research, providing a power-
`ful way to study DNA or RNA samples. New and
`improved methods and protocols have been de-
`veloped to support a diverse range of applications,
`including the analysis of genetic variation. As part of
`this, methods have been developed that aim to
`achieve ‘targeted enrichment’ of genome subregions
`[3, 4], also sometimes referred to as ‘genome parti-
`tioning’. Strategies for direct selection of genomic
`regions were already developed in anticipation of
`the introduction of NGS [5, 6]. By selective
`recover and subsequent sequencing of genomic loci
`of interest, costs and efforts can be reduced signifi-
`cantly compared with whole-genome sequencing.
`Targeted enrichment can be useful in a number
`situations where particular portions of
`Corresponding author. Florian Mertes, Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany.
`Tel: þ49 30 8413 1289; fax þ49 30 8413 1128; E-mail: mertes@molgen.mpg.de
`Florian Mertes studied biotechnology and earned a Doctorate from the Technical University Berlin. Currently, he is a postdoctoral
`researcher focusing on applied research to develop test/screening assays based on high-throughput technologies, using both PCR and
`next-generation sequencing.
`Abdou ElSharawy is a postdoctoral researcher (University of Kiel, CAU, Germany), and lecturer of Biochemistry and Cell Molecular
`Biology (Manusoura University, Egypt). He focuses on disease-associated mutations and miRNAs, allele-dependent RNA splicing, and
`high-throughput targeted, whole exome, and genome sequencing.
`Sascha Sauer is a research group leader at the Max Planck Institute for Molecular Genetics, and coordinates the European Sequencing
`and Genotyping Infrastructure.
`Joop M.L.M. van Helvoort is CSO at FlexGen. He received his PhD at the University of Amsterdam. His expertise is in microarray
`applications currently focusing on target enrichment.
`P.J. van der Zaag is with Philips Research, Eindhoven, The Netherlands. He holds a doctorate in physics from Leiden University. At
`Philips, he has worked on a number of topics related to microsystems and nanotechnology, lately in the field of nanobiotechnology.
`Andre Franke is a biologist by training and currently holds an endowment professorship for Molecular Medicine at the Christian-
`Albrechts-University of Kiel in Germany and is guest professor in Oslo (Norway).
`Mats Nilsson is Professor of Molecular Diagnostics at the Department of Immunology, Genetics, and Pathology, Uppsala University,
`Sweden. He has pioneered a number of molecular analysis technologies for multiplexed targeted analyses of genes.
`Hans Lehrach is Director at the Max Planck Institute for Molecular Genetics. His expertise lies in genetics, genomics, systems biology,
`and personalized medicine. Highlights include key involvement in several large-scale genome sequencing projects.
`Anthony J Brookes is a Professor of Bioinformatics and Genomics at the University of Leicester (UK) where he runs a research team
`and several international projects in method development and informatics for DNA analysis through to healthcare.
`ß The Author(s) 2011. Published by Oxford University Press.
`This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
`by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
`Targeted enrichment of genomic DNA regions
`whole genome need to be analyzed [7]. Efficient
`sequencing of the complete ‘exome’ (all transcribed
`sequences) represents a major current application,
`but researchers are also focusing their experiments
`on far smaller sets of genes or genomic regions po-
`tentially being implicated in complex diseases [e.g.
`derived from genome-wide
`association studies
`(GWAS)], pharmacogenetics, pathway analysis and
`so on [1, 8, 9]. For identifying monogenetic diseases,
`exome sequencing can be a powerful
`Across all these areas of study, a typical objective is
`the analysis of genetic variation within defined
`cohorts and populations.
`Targeted enrichment techniques can be charac-
`terized via a range of technical considerations related
`to their performance and ease of use, but the prac-
`tical importance of any one parameter may vary de-
`pending on the methodological approach applied
`and the scientific question being asked. Arguably,
`the most important features of a method, which in
`turn reflect the biggest challenges in targeted enrich-
`ment, include: enrichment factor, ratio of sequence
`reads on/off target region (specificity), coverage (read
`depth), evenness of coverage across the target region,
`method reproducibility, required amount of input
`DNA and overall cost per target base of useful
`sequence data.
`Within this review, we compare and contrast
`the most commonly used techniques for targeted
`enrichment of nucleic acids
`for NGS analysis.
`Additionally, we consider issues around the use of
`such methods for the detection of genetic variation,
`and some general points regarding the design of the
`target region, input DNA sample preparation and the
`output analysis.
`Current techniques for targeted enrichment can be
`categorized according to the nature of their core re-
`action principle (Figure 1):
`‘Hybrid capture’: wherein nucleic acid strands
`derived from the input sample are hybridized
`specifically to preprepared DNA fragments com-
`plementary to the targeted regions of interest,
`either in solution or on a solid support, so that
`one can physically capture and isolate the
`sequences of interest;
`‘Selective circularization’: also called molecular
`inversion probes (MIPs), gap-fill padlock probes
`and selector probes, wherein single-stranded
`DNA circles
`include target
`region se-
`quences are formed (by gap-filling and ligation
`chemistries) in a highly specific manner, creating
`structures with common DNA elements that are
`then used for selective amplification of the tar-
`geted regions of interest;
`(iii) PCR amplification: wherein polymerase chain
`reaction (PCR) is directed toward the targeted
`regions of
`interest by conducting multiple
`long-range PCRs in parallel, a limited number
`of standard multiplex PCRs or highly multi-
`plexed PCR methods that amplify very large
`numbers of short fragments.
`Given the operational characteristics of these dif-
`ferent targeted enrichment methods, they naturally
`vary in their suitability for different fields of applica-
`tion. For example, where many megabases needs
`to be analyzed (e.g. the exome), hybrid capture
`approaches are attractive as they can handle large
`target regions, even though they achieve suboptimal
`enrichment over the complete region of interest.
`In contrast, when small target regions need to be
`examined, especially in many samples, PCR-based
`approaches may be preferred as they enable a deep
`and even coverage over the region of interest, suit-
`able for genetic variance analysis.
`An overview of these different approaches is pre-
`sented in Figure 1, and Table 1 lists
`the most
`common methods
`Basic considerations for targeted
`enrichment experiments
`The design of a targeted enrichment experiment
`begins with a general consideration of the target
`region of interest. In particular, a major obstacle for
`targeted enrichment is posed by repeating elements,
`including interspersed and tandem repeats as well as
`elements such as pseudogenes located within and
`outside the region of interest. Exclusion of repeat
`masked elements [11] from the targeted region is a
`straightforward and efficient way to reduce the re-
`covery of undesirable products due to repeats.
`Furthermore, at extreme values (<25% or >65%),
`the guanine-cytosine (GC) content of the target
`region has a considerable impact on the evenness
`and efficiency of
`the enrichment
`[12]. This can
`adversely affect the enrichment of the 5
`promoter region and the first exon of genes, which
`Mertes et al.
`Figure 1: Commonly used targeted enrichment techniques. (1) Hybrid capture targeted enrichment either on solid
`support-like microarrays (a) or in solution (b). A shot-gun fragment library is prepared and hybridized against a li-
`brary containing the target sequence. After hybridization (and bead coupling) nontarget sequences are washed
`away, the enriched sample can be eluted and further processed for sequencing. (2) Enrichment by MIPs which are
`composed of a universal sequence (blue) flanked by target-specific sequences. MIPs are hybridized to the region of
`interest, followed by a gap filling reaction and ligation to produce closed circles. The classical MIPs are hybridized
`to mechanically sheared DNA (a), the Selector Probe technique uses a restriction enzyme cocktail to fragment
`the DNA and the probes are adapted to the restriction pattern (b). (3) Targeted enrichment by differing PCR
`approaches. Typical PCR with single-tube per fragment assay (a), multiplex PCR assay with up to 50 fragments (b)
`and RainDance micro droplet PCR with up to 20 000 unique primer pairs (c) utilized for targeted enrichment.
`are often GC rich [13]. Therefore, expectations
`regarding the outcome of the experiment require
`careful evaluation in terms of
`the precise target
`region in conjunction with the appropriate enrich-
`ment method.
`The performance of a targeted enrichment ex-
`periment will also depend upon the mode and qual-
`ity of processing of the input DNA sample. Having
`sufficient high-quality DNA is key for any further
`downstream handling. When limited genomic DNA
`is available, whole-genome amplification (WGA) is
`usually applied. Since WGA produces only a repre-
`sentation and not a replica of the genome, a bias is
`assumed to be introduced though the impact of this
`on the final results can be compensated for, to a degree
`by identically manipulating control samples [14].
`All three major targeted enrichment techniques
`(hybrid capture, circularization and PCR) differ in
`terms of sample library preparation workflow enabl-
`ing sequencing on any of the current NGS instru-
`Illumina, Roche 454 and SOLiD).
`Enrichment by hybrid selection relies on short frag-
`ment library preparations (typically range from 100
`to 250 bp) which are generated before hybridization
`to the synthetic library comprising the target region.
`In contrast, enrichment by PCR is performed dir-
`ectly on genomic DNA and thereafter are the library
`primers for sequencing added. Enrichment by circu-
`larization offers the easiest library preparation for
`NGS because the sequencing primers can be added
`to the circularization probe, thus eliminating the
`need for any further
`library preparation steps.
`Targeted enrichment of genomic DNA regions
`Mertes et al.
`Sequencing can be performed either as single read or
`paired-end reads of the fragment library. In general,
`mate–pair libraries are not used for hybridization-
`based targeted enrichments due to the extra compli-
`cations this implies in terms of target region design.
`In general, a single NGS run produces enough
`reads to sequence several samples enriched by one
`of the mentioned methods. Therefore, pooling stra-
`tegies and indexing approaches are a practical way
`to reduce the per sample cost. Depending on the
`method used for
`targeted enrichment, different
`multiplexing strategies can be envisaged that enable
`multiplexing in different stages of the enrichment
`process: before, during and after the enrichment.
`For targeted enrichment by hybrid capture, indexing
`of the sample is usually performed after the enrich-
`ment but to reduce the number of enrichment reac-
`the sample libraries can alternatively be
`indexed during the library preparations and then
`pooled for enrichment [15]. Enrichment by PCR
`and circularization offers indexing during the enrich-
`ment by using bar-coded primers in the product
`amplification steps [16]. Furthermore, two multi-
`plexing strategies can be combined in a single ex-
`periment. First, multiple samples can be enriched as a
`pool, with each harboring a unique pre-added
`bar-code. Then second, another bar-coding proced-
`ure can be applied postenrichment, to each of these
`pools, giving rise to a highly multiplexed final pool.
`If such extensive multiplexing is used, great care
`must be taken to normalize the amount of each
`sample within the pool to achieve sufficiently even
`representation over all samples in the final set of se-
`quence reads. In addition, highly complex pooling
`strategies also imply far greater challenges when it
`to deconvoluting the final
`sequence data
`back into the original samples.
`The task of designing the target region is relatively
`straightforward, and this can be managed with web-
`based tools offered by UCSC, Ensembl/BioMart,
`etc. and spreadsheet calculations (e.g. Excel) on a
`personal computer. Web-based tools like MOPeD
`offer a more user-friendly approach for oligoncleo-
`tide probe design [17]. Far more difficult, however, is
`the final sequence output analysis, which needs dedi-
`cated computer hardware and software. Fortunately,
`great progress has recently been made in read map-
`ping and parameter selection for this process, leading
`to more consistent and higher quality final results
`[18]. Reads generated by hybrid selection will always
`tend to extend into sequences beyond the
`target region and the longer the fragment library is,
`the more of these ‘near target’ sequences will be re-
`covered. Therefore, read mapping must start with a
`basic decision regarding the precise definition of the
`on/off target boundaries, as this parameter is used for
`counting on/off target reads and so influences the
`number of sequence reads considered as on target.
`This problem is not so critical for enrichments based
`on PCR and circularization as these methods do not
`suffer from ‘near target’ products. Another major
`consideration in data analysis is the coverage needed
`to reliably identify sequence variants, e.g. single nu-
`cleotide polymorphisms
`(SNP). This depends on
`multiple factors such as the nature of the region of
`interest in question, the method used for targeted
`enrichment. In different reports, it has ranged from
`8x coverage [19], which was the minimum coverage
`for reliable SNP calling and up to 200x coverage
`[20], in this case the total average coverage for the
`targeted region.
`Enrichment by hybrid capture
`Enrichment by hybrid capture (Figure 1.1a and b)
`builds on know-how developed over the decade or
`more of microarray research that preceded the NGS
`age [21, 22]. The hybrid capture principle is based
`upon the hybridization of a selection ‘library’ of very
`many fragments of DNA or RNA representing the
`target region against a shotgun library of DNA frag-
`ments from the genome sample to be enriched. Two
`alternative strategies are used to perform the hybrid
`capture: (i) reactions in solution [4] and (ii) reactions
`on a solid support [3]. Each of these two approaches
`brings different advantages, as listed in Table 1.
`Selection libraries for hybrid capture are typically
`produced by oligonucleotide synthesis upon micro-
`arrays, with lengths ranging from 60 to 180 bases.
`These microarrays can be used directly to perform
`the hybrid capture reaction (i.e. surface phase meth-
`ods), or the oligonucleotide pool can be harvested
`from the array and used for an in-solution targeted
`enrichment (i.e. solution phase methods). The de-
`tached oligonucleotide pool enables versatile down-
`stream processing:
`if universal 5
`and 3
`sequences are included in the design of the oligo-
`nucleotides, the pool can be reamplified by PCR and
`used to process many genomic samples. Furthermore,
`it is possible to introduce T7/SP6 transcription start
`sites via these PCRs [23], so that the pool can be
`transcribed into RNA before being used in an en-
`richment experiment.
`Targeted enrichment of genomic DNA regions
`Recently, an increasing number of protocols and
`vendors have begun offering out of the box solutions
`for hybrid capture, meaning, the researcher need
`not do development work but merely choose be-
`tween a preset targeted enrichment regions (e.g.
`whole exome) or specify their own custom enrich-
`region. Example vendors
`include: Agilent
`(SureSelect product), NimbleGen (SeqCap EZ prod-
`uct), Flexgen and MYcroArray. Alternatively, a
`more cost efficient option compared with buying a
`complete kit
`involves ordering a synthetic bait
`library, reamplifying this by PCR [24], optionally
`transcribing this
`into RNA and undertaking a
`do-it-yourself enrichment experiment based upon
`published protocols.
`Enrichment by circularization
`Enrichment by DNA fragment circularization is
`based upon the principle of selector probes [6, 25]
`and gap-fill padlock or MIPs [26]. This approach
`differs significantly compared with the aforemen-
`tioned hybrid capture method. Most notably, it is
`greatly superior in terms of specificity, but far less
`amenable to multiple sample co-processing in a
`single reaction. Each probe used for enrichment by
`circularization comprises a single-stranded DNA
`oligonucleotide that at its ends contains two se-
`quences that are complementary to noncontiguous
`stretches of a target genomic fragment, but in re-
`versed linear order. Specific hybridization between
`such probes and their cognate target genomic frag-
`ments generates bipartite circular DNA structures.
`These are then converted to closed single-stranded cir-
`cles by gap filling and ligation reactions (Figure 1.2). A
`rolling circle amplification step or a PCR directed
`toward sequences present in the common region of
`all the circles is then finally applied to amplify the
`target regions (circularized sequences) to generate an
`NGS library.
`Variations on this basic method concept exist, in
`particular with regard to the differences in sample
`material preparation and downstream processing for
`NGS library preparation. In the gap-fill padlock or
`implementation (Figure 1.2a),
`the sample
`DNA is fragmented by shearing and used in the bi-
`partite circular structure to provide a template for the
`probe DNA to be extended by gap filling and con-
`verted to a closed circle. In this incarnation, the de-
`sign of
`the MIPs merely has
`to consider
`uniqueness of each target region fragment and the
`most suitable hybridization conditions. In contrast, a
`more elaborated design is offered by the ‘Selector
`Probe’ technique [6, 27]. Here the genomic DNA
`is fragmented in a controlled manner by means of a
`cocktail of restriction enzymes, and the selector
`probes are designed to accommodate the restriction
`pattern of the target region. The ends of each gen-
`omic DNA thus become adjacently positioned in the
`bipartite circles, enabling them to be gap filled and
`ligated into closed single-stranded circles
`A particularly appealing feature of enrichment by
`circularization with MIPs and selectors is their ‘li-
`brary free’ nature [28]. Since MIPs and selectors
`- and 3
`-end with a
`comprise a target-specific 5
`common central linker, the sequencing primer infor-
`mation for NGS applications can be directly included
`into this common linker. Burdensome NGS library
`preparations are therefore not required, reducing
`processing time markedly.
`Enrichment by PCR
`Enrichment by PCR (Figures 1.3a–c) is in terms of
`methodology, a more straightforward method com-
`pared with the other genome partioning techniques.
`It takes advantage of the great power of PCR to
`enrich genome regions
`from small amounts of
`target material. Just as for circularization methods,
`if the PCR product sizes fall within the sequencing
`length of the applied NGS platform (maximum read
`length for SOLiD: 110 bp, Illumina: 240 bp and 454:
`1000 bp) PCR-based enrichment can allow one to
`bypass the need for shot-gun library preparation by
`-tailed primers in the final amplifica-
`using suitably 5
`tion steps.
`The main downside of the method is that it does
`not scale easily, in any format, to enable the targeting
`of very large genome subregions or many DNA sam-
`ples. To use this method effectively, any significant
`extent of parallelized singleplex or multiplex PCR
`would need to be supported by the use of automated
`individual PCR amplicons (or multiplex
`products) need to be carefully normalized to equiva-
`lent molarities when pooling in advance of NGS (so
`that the final coverage of the total region of interest is
`as even as possible), and the amount of DNA mater-
`ial a study requires can be substantial as this require-
`ment grows linearly with the number of utilized
`PCR reactions. But if the target region is small,
`PCR can be the method of choice. For example, a
`target region of 50–100 kb or so, could be spanned
`by a handful of long-range PCRs each of 5–10 kb
`Mertes et al.
`[29], or by tiling a few hundred shorter PCRs and
`using microtiter plates and robotics, or by one or
`other approaches toward PCR multiplexing [30, 31].
`Long-range PCR is the most commonly applied
`approach and it is reasonably straightforward to ac-
`complish. Many vendors now offer specially formu-
`lated kits
`Invitrogen SequalPrep, Qiagen
`that can amplify fragments of up to
`20 kb in length. And obviously, this approach is
`fully compatible with automation. Long-range
`PCR products do, however, have to be cleaned,
`pooled and processed for shot-gun library prepar-
`ation so that they are ready for analysis by NGS.
`To increase the throughput of PCR by keeping
`the number of PCR reactions as low as possible, there
`is the alternative of multiplex PCR (Figure 1.3b).
`Given careful primer design and reaction optimiza-
`tion, several dozen primer pairs can be used together
`effectively in a multiplexing reaction [32]. Indeed,
`software specifically created to help with multiplex
`PCR assay design is available [33]. Then, by running
`many such reactions in parallel, many hundred dif-
`ferent DNA fragments can be amplified. An alterna-
`tive method that
`is commercially available from
`Fluidigm (Table 1), uses a microfluidics PCR chip
`to conduct
`thousand singleplex PCRs
`Yet, another strikingly elegant method is
`micro-droplet PCR technology developed by
`Raindance [34, 35]. Here, two libraries of lipid en-
`capsulated water droplets are prepared—one in
`which each droplet contains a small amount of the
`test sample DNA and the other comprising droplets
`that harbor distinct pairs of primers. These two
`libraries are then merged (respective droplet pairs
`are fused together) to generate a highly multiplexed
`total emulsion PCR wherein each reaction is actually
`isolated from all others in its own fused droplet
`(Figure 1.3c). Using this technology, up to 20 000
`primer pairs can be used effectively in parallel in a
`single tube.
`Overall, one can draw the following conclusions
`from a comparison of the currently used enrichment
`techniques shown in Table 1: (i) that hybrid capture
`has its main advantages for medium to large target
`regions (10–50 Mb) in contrast to the other two
`approaches which typically only target small regions
`within the kilo base pairs and low mega base pairs
`range. The ability to enrich for mega base pair-sized
`targets is particularly advantageous in research studies
`where typically whole exomes or many genes are
`involved. Especially for clinical applications,
`may be relevant for oncological applications where
`one would expect to sequence 100–1000’s of genes.
`(ii) The advantage of PCR and circularization-based
`methods is that they achieve very high enrichment
`factors and few off-target reads, but only for small
`target regions. This is more suited to clinical genetics
`where typically only a few critical loci need to be
`Descriptive metrics for targeted DNA
`enrichment experiments
`To allow meaningful comparison of enrichment
`methods and experiments that employ them, and
`to rationally decide which technologies are most
`suitable when designing a research project, it is im-
`portant that an objective set of descriptive metrics are
`defined and then widely used when reporting en-
`richment datasets. A series of metrics need to be
`considered, and the importance of each can be
`weighted according to specific needs and objectives
`of any experiment. A proposal for such a set of met-
`rics is soon to be published, and it contains the fol-
`lowing (Nilsson et al., manuscript in preparation):
`(i) Region of interest (size): ROI;
`(ii) Average read depth (in ROI): D;
`(iii) Fraction of ROI sufficiently covered (at a spe-
`cified D): F;
`(iv) Specificity (fraction of reads in ROI): S;
`(v) Enrichment Factor (D for ROI versus D for rest
`of genome): EF;
`(vi) Evenness (lack of bias): E and
`(vii) Weight (input DNA requirement): W.
`A theoretical examination of how a method’s
`innate enrichment capability and the size of the tar-
`geted region work together to determine other par-
`ameters (such as specificity and read depth) can be
`very instructive when choosing an enrichment
`method for a particular application. This is illustrated
`in Figure 2. For example, given a method’s specific
`enrichment factor and knowledge about the size of
`the region of interest, the corresponding sequencing
`effort can be estimated for a given desired specificity
`(percent of sequences on target). Similarly,
`for a
`given region of interest and a minimum desired spe-
`cificity, the necessary enrichment factor capabilities
`can be calculated.
`Finally, the specific per sample costs for a targeted
`enrichment is useful to consider. To make costs
`Targeted enrichment of genomic DNA regions
`Figure 2: Comparison of enrichment factor calculations on sequencing depth and percent on target sequences for
`different target region sizes employed for targeted enrichment. Calculations were performed as follows:
`EF ROI ROIþgenome [52] sequencing depth ¼ pot seq per run
`percent on target sequences ¼ 100
`EF, enrichment factor;
`100 ROI
`ROI, region of interest in kb; genome, genome size in kb; pot, percent on target sequences; seq per run, assumed
`sequences per run in kb.
`comparable, either for different target region sizes or
`across different methods, the costs can be normalized
`as costs per base pair. Costs also change with time
`and as technologies improve, and so at some stage
`the overall price of any particular experiment (i.e.
`targeted enrichment plus sequencing costs) will not
`be cheaper than the alternative of whole-genome
`sequencing combined with in silico-based isolation
`of the region of interest.
`To investigate genetic variation by NGS, many
`DNA samples need to be tested. To reduce the
`cost of such studies, researchers typically focus their
`attention on genome subregions of particular inter-
`est, and this implies a major role for targeted enrich-
`ment in such undertakings. A set of concerns then
`arises regarding the accuracy of variation discovery
`within NGS data obtained from DNA that has been
`subjected to one or other enrichment methods.
`Other questions, such as whether the input genomic
`DNA was also preamplified by WGA, whether sam-
`ple pooling or multiplexing was applied and whether
`proper experimental controls were employed, also
`come into play. Currently, however, the field is lack-
`ing a complete understanding of all the issues and
`influences relevant
`to these important questions.
`For these reasons, it is critical that thorough down-
`stream validation experiments are performed, using
`independent experimental approaches.
`Another dimension to the problem of reliably dis-
`covering sequence variation, and one where there is
`perhaps a little more clarity, is the impact of different
`software and algorithm choices used for primary se-
`quence data analysis (e.g. the choice of suitable gen-
`ome alignment tool, filter parameters for the analysis,
`coverage thresholds at intended bases). It has been
`shown that the detection of variants depends strongly
`Mertes et al.
`software tools employed [36].
`on the particular
`Indeed, because current alignment and analytical tools
`perform so heterogeneously,
`the 1000 Genomes
`Project Consortium [37] decided to avoid calling
`novel SNPs unless they were discovered by at least
`two independent analytical pipelines. In general, uni-
`fied analysis workflows can and must be developed
`[38] to enable the combination and processing of
`data produced from different machines/approaches,
`to at least minimize instrument-specific biases and
`that otherwise detract
`from making high-
`confidence variant base calls.
`Whatever mapping and analysis approach is
`applied, sufficient coverage on a single base reso-
`lution ranging from 20 to 50x is usually deemed
`necessary for reliable detection of sequence variation
`[39–42]. In one simulation study, the SNP discovery
`performances of two NGS platforms in a specific
`disease gene were shown to fall rapidly when the
`coverage depth was below 40x [43]. In addition,
`all called variants should ideally be supported by
`data from both read orientations (forward and re-
`verse). Some researchers further insist on obtaining
`at least three reads from both the forward and the
`reverse DNA strands (double-stranded coverage) for
`any nonreference base before it is called [20]. Such
`stringent quality control practices are surely needed
`to minimize error rates and the impact of random
`sampling variance, so that true variations and sequen-
`cing artifacts can be resolved and homozygous and
`heterozygous genotypes at sites of variation reliably
`Deep coverage alone seems not, by itself,
`always be sufficient for accurate variation discovery.
`For example, a naı¨ve Bayesian model for SNP call-
`ing—even with deep coverage—can le