`
`Sequencing depth and coverage: key
`considerations in genomic analyses
`
`David Sims, Ian Sudbery, Nicholas E. Ilott, Andreas Heger and Chris P. Ponting
`
`Abstract | Sequencing technologies have placed a wide range of genomic analyses within the
`capabilities of many laboratories. However, sequencing costs often set limits to the amount
`of sequences that can be generated and, consequently, the biological outcomes that can be
`achieved from an experimental design. In this Review, we discuss the issue of sequencing
`depth in the design of next-generation sequencing experiments. We review current
`guidelines and precedents on the issue of coverage, as well as their underlying considerations,
`for four major study designs, which include de novo genome sequencing, genome
`resequencing, transcriptome sequencing and genomic location analyses (for example,
`chromatin immunoprecipitation followed by sequencing (ChIP–seq) and chromosome
`conformation capture (3C)).
`
`Genomics is extending its reach into diverse fields of
`biomedical research from agriculture to clinical diag-
`nostics. Despite sharp falls in recent years1, sequencing
`costs remain substantial and vary for different types of
`experiment. Consequently, in all of these fields inves-
`tigators are seeking experimental designs that gener-
`ate robust scientific findings for the lowest sequencing
`cost. Higher coverage of sequencing (BOX 1) inevitably
`requires higher costs. The theoretical or expected cover-
`age is the average number of times that each nucleotide
`is expected to be sequenced given a certain number of
`reads of a given length and the assumption that reads
`are randomly distributed across an idealized genome2.
`Actual empirical per-base coverage represents the exact
`number of times that a base in the reference is covered
`by a high-quality aligned read from a given sequenc-
`ing experiment. Redundancy of coverage is also called
`the depth or the depth of coverage. In next-generation
`sequencing studies coverage is often quoted as average
`raw or aligned read depth, which denotes the expected
`coverage on the basis of the number and the length
`of high-quality reads before or after alignment to the
`reference. Although the terms depth and coverage can
`be used interchangeably (as they are in this Review),
`coverage has also been used to denote the breadth of
`coverage of a target genome, which is defined as the
`percentage of target bases that are sequenced a given
`number of times. For example, a genome sequencing
`study may sequence a genome to 30× average depth
`and achieve a 95% breadth of coverage of the reference
`genome at a minimum depth of ten reads.
`
`An ideal genome sequencing method would fault-
`lessly read all nucleotides just once, doing so sequen-
`tially from one end of a chromosome to the other.
`Such a perfect approach would ensure that all poly-
`morphic alleles within diploid or polyploid genomes
`could be identified, and that long identical or near-
`identical repetitive regions could be unambiguously
`placed in a genome assembly. In real-world sequenc-
`ing approaches, read lengths are short (that is, ≤250
`nucleotides) and can contain sequence errors. When
`considered alone, an error is indistinguishable from
`a sequence variant. This problem can be overcome
`by increasing the number of sequencing reads: even
`if reads contain a 1% variant-error rate, the combina-
`tion of eight identical reads that cover the location of
`the variant will produce a strongly supported vari-
`ant call with an associated error rate of 10−16 (REF. 3).
`Increased depth of coverage therefore ‘rescues’ inad-
`equacies in sequencing methods (BOX 1). Nevertheless,
`generating greater depth of short reads does not cure
`all sequencing ills. In particular, it alone cannot resolve
`assembly gaps that are caused by repetitive regions
`with lengths that either approach or exceed those of
`the reads. Instead, in the paired-end read approach,
`paired reads — two ends of the same DNA molecule
`that are sequenced and which are separated by a known
`distance — are used to unambiguously place repetitive
`regions that are smaller than this distance.
`Sequencing is enriching our understanding not
`only of genome sequence but also of genome organiza-
`tion, genetic variation, differential gene expression and
`
`Depth
`The average number of times
`that a particular nucleotide is
`represented in a collection of
`random raw sequences.
`
`Computational Genomics
`Analysis and Training
`Programme, Medical
`Research Council Functional
`Genomics Unit, Department
`of Physiology, Anatomy and
`Genetics, Le Gros Clark
`Building, University of
`Oxford, Parks Road, Oxford
`OX1 3PT, UK.
`Correspondence to D.S. and
`C.P.P.
`e‑mails: david.sims@dpag.
`ox.ac.uk; chris.ponting@
`dpag.ox.ac.uk
`doi:10.1038/nrg3642
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 15 | FEBRUARY 2014 | 121
`
`© 2014 Macmillan Publishers Limited. All rights reserved
`
`Personalis EX2022.121
`
`
`
`R E V I E W S
`
`generation of high-quality, unbiased and interpretable
`data from next-generation sequencing studies.
`
`De novo genome sequencing
`The major factors that determine the required depth in
`a de novo genome sequencing study are the error rate of the
`sequencing method, the assembly algorithms used,
`the repeat complexity of the particular genome under
`study and the read length. Genomes that have been
`sequenced to high depths by short-read technologies
`are not necessarily a substantial improvement in assem-
`bly quality compared with those produced using the
`earlier lower-coverage Sanger sequencing technology.
`Although the human genome was initially assembled
`to high quality with 8–10-fold coverage using long-read
`Sanger sequencing 2, a raw coverage of ~73-fold was
`required to generate the first short-read-only assembly
`of the giant panda genome that was of lower quality
`than the human genome4. A similarly low coverage
`(~7.5-fold) dog genome, which is similar in size to that
`of the giant panda and was assembled using Sanger
`sequencing reads, is more complete and more contigu-
`ous than the giant panda genome3. These differences
`arise because Sanger sequencing reads are longer, are
`derived from larger insert libraries and can be assembled
`using mature assembly algorithms3.
`High-quality assemblies are now often produced
`using hybrid approaches, in which the advantages of
`high-depth, short-read sequencing are complemented
`with those of lower-depth but longer-read sequencing.
`For example, sequencing the draft assembly of the wild
`grass Aegilops tauschii was a considerable challenge
`owing to its large size (4.4 Gb) and to the fact that two-
`thirds of its sequence consists of highly repetitive trans-
`posable element-derived regions5. The draft genome
`was successfully assembled first into short fragments
`(that is, contigs) using 398 Gb (that is, a 90-fold cov-
`erage) of high-quality short reads from 45 libraries
`with insert sizes between 0.2 kb and 20 kb, and these
`fragments could then be linked into longer scaffolds
`using paired-end read information. Gaps between con-
`tigs predominantly contained repetitive sequence, the
`unique placement of which posed difficulties. These
`gaps were filled in using a subsequent addition of
`18.4 Gb (that is, a fourfold coverage) of Roche 454 long
`reads. A recently introduced approach to sequencing
`repeat-rich genomes is to barcode and sequence to an
`average of 20× depth all reads that are derived from each
`of many collections of hundreds or thousands of short
`(6–8 kb) DNA fragments6. By assembling each collec-
`tion separately, many otherwise confounding repetitive
`sequences of the Botryllus schlosseri tunicate genome
`were resolved. By applying approaches that are comple-
`mentary in aspects such as read lengths and coverage
`biases, hybrid library and assembly methods are likely
`to dominate in the near future7,8.
`Twofold coverage and lower-quality assemblies have
`been produced using Sanger sequencing for a selection
`of mammalian genomes to identify sequences that are
`conserved in eutherian species, including humans9. The
`Lander–Waterman approach (BOX 1) predicts that ~86%
`
`diverse aspects of transcriptional regulation, which
`range from transcription factor-binding sites to the
`three-dimensional conformation of chromosomes. As
`these areas of genome research often adopt markedly
`different sequencing depths (FIG. 1), we review this issue
`for each area in turn. First, we examine current best
`practice in de novo genome sequencing and assembly.
`We then proceed to consider genome resequencing
`and targeted resequencing approaches, particularly
`whole-exome sequencing (WES). Second, we discuss
`the rapidly evolving area of transcriptome sequencing,
`specifically the different considerations that are needed
`for transcript discovery compared with the analyses of
`differential expression and alternative splicing. Finally,
`we explore a range of methodologies that identify the
`genomic sites of transcription factor binding, chromatin
`marks, DNA methylation and spatial interactions that
`are revealed by chromosome conformation capture (3C)
`methods. We discuss experimental considerations that
`are relevant to sequence depth, which are required for the
`
`Box 1 | Sequencing coverage theory
`
`Much of the original work on sequencing coverage stemmed from early genome
`mapping efforts. In 1988, Lander and Waterman96 described the theoretical
`redundancy of coverage (c) as LN/G, where L is the read length, N is the number of
`reads and G is the haploid genome length. The figure shows the theoretical coverage
`(shown as diagonal lines; c = 1× or 30×) according to the Lander–Waterman formula for
`human genome or exome sequencing. The coverage that is achieved by sequencing
`technologies according to the manufacturers’ websites is also indicated (see the
`figure). Unfortunately, biases in sample preparation, sequencing, and genomic
`alignment and assembly can result in regions of the genome that lack coverage (that is,
`gaps) and in regions with much higher coverage than theoretically expected. GC‑rich
`regions, such as CpG islands, are particularly prone to low depth of coverage partly
`because these regions remain annealed during amplification97. Consequently, it is
`important to assess the uniformity of coverage, and thus data quality, by calculating
`the variance in sequencing depth across the genome98. The term depth may also be
`used to describe how much of the complexity in a sequencing library has been
`sampled. All sequencing libraries contain finite pools of distinct DNA fragments. In a
`sequencing experiment only some of these fragments are sampled. The number of
`these distinct fragments sequenced is positively correlated with the depth of the true
`biological variation that has been sampled.
`
`Illumina
`HiSeq 2000
`
`Illumina HiSeq 2500 Rapid Run
`
`Illumina
`GAIIx
`
`Solid
`5500xl
`
`Human genome (30×)
`Human genome (1×)
`Human exome (30×)
`Human exome (1×)
`
`Ion torrent
`PGM 318 Chip
`
`454 FLX titanium XL+
`
`PacBio RS II
`
`1011
`
`1010
`
`109
`
`108
`
`107
`
`106
`
`105
`
`104
`
`Number of reads per run
`
`100
`
`1,000
`Read length (bases)
`
`GAIIx, Genome Analyzer IIx; PacBio, Pacific Biosciences; PGM, personal genome machine.
`Nature Reviews | Genetics
`
`122 | FEBRUARY 2014 | VOLUME 15
`
` www.nature.com/reviews/genetics
`
`© 2014 Macmillan Publishers Limited. All rights reserved
`
`Personalis EX2022.122
`
`
`
`R E V I E W S
`
`single-nucleotide variants (SNVs), small insertions and
`deletions (indels), larger structural variants (such as
`inversions and translocations) and copy number vari-
`ants (CNVs). Naturally, the design of a particular study
`depends on the biological hypothesis in question, and
`different sequencing strategies are used for population
`studies compared with those for studies of Mendelian
`disease or of somatic mutations in cancer. Furthermore,
`targeted resequencing approaches allow a trade-off
`between sequencing breadth and sample numbers: for
`the same cost, more samples can be sequenced to the
`same depth but over a smaller genomic region. Here, we
`discuss the merits of whole-genome sequencing (WGS)
`relative to targeted resequencing approaches, including
`WES, in the context of these different variant types and
`disease models.
`
`WGS versus WES. High-depth WGS is the ‘gold stand-
`ard’ for DNA resequencing because it can interrogate all
`variant types (including SNVs, indels, structural variants
`and CNVs) in both the minority (1.2%) of the human
`genome that encodes proteins and the remaining major-
`ity of non-coding sequences. WES is focused on the
`detection of SNVs and indels in protein-coding genes
`and on other functional elements such as microRNA
`sequences; consequently, it omits regulatory regions
`such as promoters and enhancers. Although costs vary
`depending on the sequence capture solution, WES can
`be an order of magnitude less expensive than WGS to
`achieve an approximately equivalent breadth of coverage
`of protein-coding exons. These reduced costs offer the
`potential to greatly increase sample numbers, which is a
`key factor for many studies. However, WES has various
`limitations that are discussed below.
`
`SNV and indel detection. Early genome resequencing
`studies focused specifically on the two most common
`classes of sequence variation, which are SNVs and small
`indels. The first human genome that was sequenced
`using Illumina short-read technology showed that,
`although almost all homozygous SNVs are detected at a
`15× average depth, an average depth of 33× is required
`to detect the same proportion of heterozygous SNVs12.
`Consequently, an average depth that exceeds 30× rapidly
`became the de facto standard13,14. In 2011, one study15
`suggested that an average mapped depth of 50× would
`be required to allow reliable calling of SNVs and small
`indels across 95% of the genome. However, improve-
`ments in sequencing chemistry reduced GC bias and
`thus yielded a more uniform coverage of the genome,
`which later reduced the required average mapped depth
`to 35× (REF. 15). The power to detect variants is reduced
`by low base quality and by non-uniformity of coverage.
`Increasing sequencing depth can both improve these
`issues and reduce the false-discovery rate for variant
`calling. Although read quality is mostly governed by
`sequencing technology, the uniformity of depth of cov-
`erage can also be affected by sample preparation. A GC
`bias that is introduced during DNA amplification by
`PCR has been identified as a major source of variation
`in coverage. Elimination of PCR amplification results in
`
`WES
`WGS
`ChIP–seq
`RNA-seq
`
`Frequency of studies
`
`107
`
`108
`
`109
`1010
`Number of bases sequenced per sample
`
`1011
`
`1012
`
`Figure 1 | Sequencing depths for different applications. The frequency of studies
`Nature Reviews | Genetics
`that use read counts of all runs (which are typically flow-cell lanes) and that were
`deposited from 2012 to June 2013 for the Illumina platform in the European Nucleotide
`Archive (ENA) is shown. The plot provides an overview of sequencing depths that are
`usually chosen for the four most common experimental strategies. Densities have been
`smoothed and normalized to provide an area under the curve that is equal to one.
`The depth and therefore the cost of an experiment increase in the order of chromatin
`immunoprecipitation followed by sequencing (ChIP–seq), RNA sequencing (RNA-seq),
`whole-exome sequencing (WES) to whole-genome sequencing (WGS). Although
`ChIP–seq, WES and WGS have typical applications and thus standardized read depths,
`the sequencing depth of RNA-seq data sets varies over several orders of magnitude.
`Multimodal distributions of WES and WGS reflect different target coverage. To
`generate this figure, runs were summed by experiment and, for each study, one
`experiment was chosen at random to avoid counting large studies more than once.
`Note that the ENA archive only contains published data sets and excludes medically
`relevant data sets. The plot was created from 771 studies.
`
`(that is, 1 – e−2) of bases in such genomes are covered
`once by a sequencing depth of 2× although, in reality,
`this decreases to ~65% for mammalian genomes that are
`sequenced at twofold coverage10. In these and other stud-
`ies, low coverage has two principal effects on subsequent
`analyses and biological interpretation. First, it is not pos-
`sible to resolve whether an absence of a protein-coding
`gene, or a disruption of its open reading frame, repre-
`sents a deficiency of the assembly or a real evolutionary
`gene loss. Second, and perhaps more seriously, low depth
`can introduce sequence errors that are in danger of being
`mistakenly propagated through downstream analyses
`and misdirecting conclusions of a study. To mitigate this
`possibility, two approaches are recommended. First, low-
`quality bases or sequences that align poorly against a
`closely related genome should be discarded from such
`analyses. Second, adjacent bases that have high-quality
`scores should also be discarded because they can contain
`a high density of residual sequence errors11.
`
`DNA resequencing
`DNA resequencing explores genetic variation in indi-
`viduals, families and populations, particularly with
`respect to human genetic disease. Requirements for
`sequencing depth in these studies are governed by
`the variant type of interest, the disease model and the
`size of the regions of interest. Resequencing can reveal
`
`Sequence capture
`The enrichment of fragmented
`DNA or RNA species of interest
`by hybridization to a set of
`sequence-specific DNA or RNA
`oligonucleotides.
`
`GC bias
`The difference between the
`observed GC content of
`sequenced reads and the
`expected GC content based
`on the reference sequence.
`
`Variant calling
`The process of identifying
`consistent differences between
`the sequenced reads and the
`reference genome; these
`differences include single base
`substitutions, small insertions
`and deletions, and larger copy
`number variants.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 15 | FEBRUARY 2014 | 123
`
`© 2014 Macmillan Publishers Limited. All rights reserved
`
`Personalis EX2022.123
`
`
`
`R E V I E W S
`
`Box 2 | Genomic alignment and mappability
`
`The first major data processing step in sequencing studies for species with a
`reference genome is the alignment of sequencing reads to this reference. The choice
`of alignment algorithm often influences final coverage values, as different
`algorithms show varying false‑positive and false‑negative rates99,100. Even the best
`mapping algorithms cannot align all reads to the reference genome, which is
`perhaps due to sequencing errors, structural rearrangements or insertions in the
`query genome, or deletions in the reference. Indeed, analyses of unmapped reads
`are often used for the identification of structural variants and non‑reference
`insertions40,101. Furthermore, it is not possible to unambiguously assign reads to all
`genomic regions, as some regions will contain low‑degeneracy repeats or
`low‑complexity sequences. The ‘mappability’ (also known as uniqueness) of a
`sequence within a genome has a major influence on the average mapped depth
`and is an important source of false‑negative single‑nucleotide variant calls102.
`Mappability improves with increased read length and generally shows an inverse
`correlation with genomic repeats103. One approach to increase coverage in regions
`of low mappability is to use longer reads that improve the chance of a read
`encompassing a unique sequence that anchors all remaining sequences. A second
`approach is to generate paired‑end libraries with longer insert sizes, which increases
`the chance of one read of the pair mapping to a unique region outside the repeat
`sequence. It is often useful to use mappability data to normalize read depth, for
`example, when using depth of coverage to estimate DNA copy number.
`
`improved coverage of high GC regions of the genome
`and in fewer duplicate reads16.
`In WES, differences in the hybridization efficiency
`of sequence capture probes, which are possibly again
`attributable to GC content variation, can result in tar-
`get regions that have little or no coverage. Uniformity
`of coverage will also be influenced by repetitive or
`low-complexity sequences, which either restrict bait design
`or lead to off-target capture. Furthermore, unlike WGS,
`WES still routinely uses PCR amplification, which must
`be carefully optimized to reduce GC bias17. As a result
`of increased variation in coverage, a greater average read
`depth is required to achieve the same breadth of cov-
`erage as that from WGS, and an 80× average depth is
`required to cover 89.6–96.8% of target bases, depending
`on the platform, by at least tenfold18. Different sequence
`capture kits yield different coverage profiles, and
`designs with higher density seem to be more efficient,
`which provide better uniformity of coverage and better
`sensitivity for SNV detection18,19. As capture kits have
`improved sequence coverage, the amount of sequenc-
`ing required has inevitably increased. Regardless of the
`capture protocol or the sequencing platform used, there
`has been a trend for recent exome studies to require a
`minimum of 80% of the target region to be covered by
`at least tenfold20–22. All WES kits are prone to reference
`bias, which arises from capture probes that match the
`reference sequence and thus tend to preferentially enrich
`the reference allele at heterozygous sites; such bias can
`produce false-negative SNV calls23.
`
`CNV detection. CNVs can be detected from WGS
`and WES24,25 data using methods that analyse depth of
`coverage. These methods pile up aligned reads against
`genomic coordinates, then calculate read counts in
`windows to provide the average depth across a region.
`Copy number changes can then be inferred from
`variation in average depth across genomic regions.
`
`Low-complexity sequences
`DNA regions that have a
`biased nucleotide composition,
`which are enriched with simple
`sequence repeats.
`
`Clonal evolution
`An iterative process of
`clonal expansion, genetic
`diversification and clonal
`selection that is thought to
`drive the evolution of cancers,
`which gives rise to metastasis
`and resistance to therapy.
`
`In WGS, reasonable specificity can be obtained with
`an average depth of as little as 0.1× (REF. 26). However,
`sensitivity, break-point detection and absolute copy
`number estimation all improve with increasing read
`depth26,27. Regardless of average read depth, depth-
`of-coverage methods are vulnerable to false positives
`that are being called owing to local variations in coverage
`even after correction for both GC bias and ‘mappability’
`(BOX 2), and cross-sample calling is required to reduce
`this effect28.
`
`Study design for different disease models. In contrast to
`the high depth that is required to accurately call SNVs
`and indels in individual genomes, population genomics
`studies benefit from a trade-off between sample num-
`bers and sequencing depth, in which many genomes are
`sequenced at low depth (for example, 400 samples at 4×)
`and their variants are called jointly across all samples29–31.
`Variant calls on individual low-depth genomes have a
`high false-positive rate, but this is mitigated by combin-
`ing information across samples. This approach provides
`good power to detect common variants at a proportion
`of the sequencing cost of deep sequencing29,30. Indeed,
`even ultra-low-coverage sequencing (that is, sequencing
`at 0.1–0.5×) captures almost as much common variation
`(that is, variants with >1% allele frequency) as single-
`nucleotide polymorphism (SNP) arrays32. Conversely,
`reliable identification of variants in either highly
`aneuploid genomes or heterogeneous cell populations,
`such as those from tumours, requires greater depth
`of coverage than those from normal tissue33. Targeted
`enrichment and ultra-deep sequencing (that is, sequenc-
`ing at 1,000×) of limited regions of interest can be used
`to study clonal evolution in cancer samples, in which spe-
`cific variants are present in <1% of the cell population34.
`The identification of disease-causing de novo or recessive
`variants is often best served by sequencing parent–child
`trios. In this case, it is recommended that the same depth
`of sequencing is obtained for each of the family members
`in order to minimize false-positive calls in the proband
`and false-negative calls in the parents35.
`
`Analyses of DNA resequencing data. A typical analy-
`sis pipeline for DNA resequencing data involves the
`alignment of sequencing reads to a reference genome
`followed by variant calling. A post-alignment step to
`remove all but one duplicates (that is, the removal of
`two or more read pairs with both forward and reverse
`reads that map to identical genomic coordinates) is
`important for accurate variant calling, as it ensures that
`errors that are introduced and amplified during PCR do
`not result in erroneous calls36. Duplicate read removal
`can significantly reduce the number of high-quality
`mapped reads and thus the average depth of coverage
`(TABLE 1). Even in species with a complete reference
`genome, assembly approaches (reviewed and com-
`pared in REFS 37–39) offer several advantages over those
`using reference alignment. First, assembly can faithfully
`recapitulate divergent sequence, such as that of the
`human leukocyte antigen (HLA) locus, which often does
`not align well to a reference genome. Second, assembly
`
`124 | FEBRUARY 2014 | VOLUME 15
`
`www.nature.com/reviews/genetics
`
`© 2014 Macmillan Publishers Limited. All rights reserved
`
`Personalis EX2022.124
`
`
`
`Table 1 | Sources of uninformative reads for different experiments
`Source of uninformative reads WGS
`WES
`ChIP–seq RNA-seq
`Sequencing adaptor reads
`•
`•
`•
`•
`Low-quality reads
`•
`•
`•
`•
`Unmapped reads
`•
`•
`•
`•
`Reads that do not map uniquely
`•
`•
`•
`•
`PCR duplicates
`•
`•
`•
`•
`Reads that map out with peaks,
`–
`•
`•
`•
`transcript models or exons
`Reads that map to uninformative
`transcripts (for example, rRNA)
`ChIP–seq, chromatin immunoprecipitation followed by sequencing; RNA-seq, RNA sequencing;
`rRNA, ribosomal RNA; WES, whole-exome sequencing; WGS, whole-genome sequencing.
`
`–
`
`•
`
`–
`
`–
`
`R E V I E W S
`
`considerations than transcriptome-wide coverage statis-
`tics. Furthermore, when used for differential expression
`analyses, RNA-seq can be considered as a tag-counting
`application. In this case, a sufficient number of reads are
`required to quantify exons and splice junctions in the
`sample. Therefore, the number of reads that is required
`in an experiment is determined by the least abundant
`RNA species of interest — a variable that is not known
`before sequencing.
`The number of useful reads that is generated in a
`study can be optimized either by depleting the ribo-
`somal RNA (rRNA) fraction, which constitutes ~90% of
`total RNA in mammalian cells, or by enriching for the
`RNA species of interest, such as the use of immobilized
`oligo-deoxythymidine to enrich for polyadenylated
`RNAs43. Total RNA that is depleted in rRNA contains
`reads from both non-polyadenylated transcripts and
`pre-processed mRNA transcripts. Consequently, many
`reads will align to intronic sequences, thereby decreasing
`the proportion of reads that map to expressed exons and
`reducing the power to detect splice junctions. A good
`indication of the performance of an RNA-seq experi-
`ment is provided by the proportion of reads that are
`mapped to rRNA and other highly expressed RNAs,
`and by the proportion that are mapped to splice junc-
`tions and coding exons. Using a poly(A) selection pro-
`tocol with paired reads of lengths that are >76 bp, >80%
`of read pairs can be expected to map to the reference
`genome in experiments using human samples, and >70%
`of these reads can be expected to map with zero mis-
`matches44. With this approach, the number of reads that
`map to rRNA will be minimal (that is, <1%), and ~15%
`of reads will map across splice junctions.
`
`Transcript discovery. One application of transcrip-
`tome sequencing that is not possible using microarrays
`is the identification of novel transcripts, such as long
`non-coding RNAs (lncRNAs) and alternative transcripts
`of protein-coding genes. Many of these transcripts are
`expressed at low levels45,46, and their discovery therefore
`requires either deep sampling of the transcriptome or
`mapping of transcription start sites using cap analysis
`of gene expression (CAGE). The power to detect a
`transcript depends on its length and abundance in
`the sequencing library, as well as on its mappability to the
`reference genome. The sequencing of RNA standards
`from the External RNA Control Consortium47 revealed
`that molecules that are present at frequencies of 0.6–2.5
`molecules per 107 molecules could not be detected
`using 12.4 million uniquely mapping 36-bp reads48.
`Furthermore, the accuracy of abundance estimations
`using spike-in control RNAs in deeply sequenced human
`data sets (which contain >94 million uniquely mapped
`76-bp paired-end reads) showed a clear dependence
`on both length and GC composition of an RNA mole-
`cule48. Sampling of transcripts is also affected by library
`preparation. Sequenced reads that are generated using
`Illumina protocols show compositional biases at their
`5ʹ ends owing to the nonrandomness of the hexamer
`primers that are used in cDNA synthesis49. This results
`in nonrandom sampling of the transcriptome and an
`
`Dynamic range
`The range of expression
`levels over which genes and
`transcripts can be accurately
`quantified in gene expression
`analyses. In theory, RNA
`sequencing offers an infinite
`dynamic range, whereas
`microarrays are limited by the
`range of signal intensities.
`
`Long non-coding RNAs
`(lncRNAs). RNA molecules
`that are transcribed from
`non-protein-coding loci; such
`RNAs are >200 nt in length
`and show no predicted
`protein-coding capacity.
`
`Cap analysis of gene
`expression
`(CAGE). In contrast to RNA
`sequencing, CAGE produces
`short ‘tag’ sequences that
`represent the 5ʹ end of the
`RNA molecule. As CAGE does
`not sequence across an entire
`cDNA, it requires a lower depth
`of sequencing than RNA
`sequencing to quantify
`low-abundance transcripts.
`
`Spike-in control RNAs
`A pool of RNA molecules of
`known length, sequence
`composition and abundance
`that is introduced into an
`experiment to assess the
`performance of the technique.
`
`Fragments per kilobase of
`exon per million reads
`mapped
`(FPKM). A method for
`normalizing read counts over
`genes or transcripts. Read
`counts are first normalized by
`gene length and then by library
`size. After normalization, the
`expression value of each gene
`is less dependent on these
`variables.
`
`can avoid the mis-mapping of reads that originate from
`incomplete regions of the reference genome. Third,
`assembly enables multiple variant types to be analysed
`at once, which minimizes errors around clusters of vari-
`ants. The latest assembly methods, such as Cortex40, can
`consider multiple eukaryotic genomes simultaneously
`while incorporating information about known varia-
`tion. This allows variant calling against a range of dif-
`ferent genomes rather than a single reference genome.
`This method required only an average depth of 16×
`during the assembly of human HLA regions to provide
`results that are in good agreement with laboratory-
`based typing 40. However, as assembly methods are
`still unable to fully reconstruct entire genomes owing
`mainly to repeat content, they are only able to call
`variants in 80% of the genome.
`
`Transcriptome sequencing
`RNA sequencing (RNA-seq) allows the detection and
`the quantification of expressed transcripts in a biological
`sample. Its applications include novel transcript discov-
`ery, and analyses of differential expression and alterna-
`tive splicing. RNA-seq has advantages over microarray
`gene expression analyses, as it provides an unbiased
`assessment of the full range of transcripts with a greater
`dynamic range41,42. Large numbers of RNA-seq experi-
`ments have now been carried out in many cell and tissue
`types across diverse conditions, yet few clear guidelines
`on read counts have emerged. This is because sequenc-
`ing requirements are often dependent on the biologi-
`cal question under investigation, as well as on the size
`and the complexity of the transcriptome being assayed.
`Here, we describe the concepts that govern the coverage
`required in RNA-seq experiments and illustrate these
`with examples from the literature.
`
`Coverage in transcriptome sequencing. Coding and
`non-coding transcripts can be expressed at vastly dif-
`ferent levels — from one copy to millions of copies
`per cell — in differe