throbber
R E V I E W S
`
`Sequencing depth and coverage: key
`considerations in genomic analyses
`
`David Sims, Ian Sudbery, Nicholas E. Ilott, Andreas Heger and Chris P. Ponting
`
`Abstract | Sequencing technologies have placed a wide range of genomic analyses within the
`capabilities of many laboratories. However, sequencing costs often set limits to the amount
`of sequences that can be generated and, consequently, the biological outcomes that can be
`achieved from an experimental design. In this Review, we discuss the issue of sequencing
`depth in the design of next-generation sequencing experiments. We review current
`guidelines and precedents on the issue of coverage, as well as their underlying considerations,
`for four major study designs, which include de novo genome sequencing, genome
`resequencing, transcriptome sequencing and genomic location analyses (for example,
`chromatin immunoprecipitation followed by sequencing (ChIP–seq) and chromosome
`conformation capture (3C)).
`
`Genomics is extending its reach into diverse fields of
`biomedical research from agriculture to clinical diag-
`nostics. Despite sharp falls in recent years1, sequencing
`costs remain substantial and vary for different types of
`experiment. Consequently, in all of these fields inves-
`tigators are seeking experimental designs that gener-
`ate robust scientific findings for the lowest sequencing
`cost. Higher coverage of sequencing (BOX 1) inevitably
`requires higher costs. The theoretical or expected cover-
`age is the average number of times that each nucleotide
`is expected to be sequenced given a certain number of
`reads of a given length and the assumption that reads
`are randomly distributed across an idealized genome2.
`Actual empirical per-base coverage represents the exact
`number of times that a base in the reference is covered
`by a high-quality aligned read from a given sequenc-
`ing experiment. Redundancy of coverage is also called
`the depth or the depth of coverage. In next-generation
`sequencing studies coverage is often quoted as average
`raw or aligned read depth, which denotes the expected
`coverage on the basis of the number and the length
`of high-quality reads before or after alignment to the
`reference. Although the terms depth and coverage can
`be used interchangeably (as they are in this Review),
`coverage has also been used to denote the breadth of
`coverage of a target genome, which is defined as the
`percentage of target bases that are sequenced a given
`number of times. For example, a genome sequencing
`study may sequence a genome to 30× average depth
`and achieve a 95% breadth of coverage of the reference
`genome at a minimum depth of ten reads.
`
`An ideal genome sequencing method would fault-
`lessly read all nucleotides just once, doing so sequen-
`tially from one end of a chromosome to the other.
`Such a perfect approach would ensure that all poly-
`morphic alleles within diploid or polyploid genomes
`could be identified, and that long identical or near-
`identical repetitive regions could be unambiguously
`placed in a genome assembly. In real-world sequenc-
`ing approaches, read lengths are short (that is, ≤250
`nucleotides) and can contain sequence errors. When
`considered alone, an error is indistinguishable from
`a sequence variant. This problem can be overcome
`by increasing the number of sequencing reads: even
`if reads contain a 1% variant-error rate, the combina-
`tion of eight identical reads that cover the location of
`the variant will produce a strongly supported vari-
`ant call with an associated error rate of 10−16 (REF. 3).
`Increased depth of coverage therefore ‘rescues’ inad-
`equacies in sequencing methods (BOX 1). Nevertheless,
`generating greater depth of short reads does not cure
`all sequencing ills. In particular, it alone cannot resolve
`assembly gaps that are caused by repetitive regions
`with lengths that either approach or exceed those of
`the reads. Instead, in the paired-end read approach,
`paired reads — two ends of the same DNA molecule
`that are sequenced and which are separated by a known
`distance — are used to unambiguously place repetitive
`regions that are smaller than this distance.
`Sequencing is enriching our understanding not
`only of genome sequence but also of genome organiza-
`tion, genetic variation, differential gene expression and
`
`Depth
`The average number of times
`that a particular nucleotide is
`represented in a collection of
`random raw sequences.
`
`Computational Genomics
`Analysis and Training
`Programme, Medical
`Research Council Functional
`Genomics Unit, Department
`of Physiology, Anatomy and
`Genetics, Le Gros Clark
`Building, University of
`Oxford, Parks Road, Oxford
`OX1 3PT, UK.
`Correspondence to D.S. and
`C.P.P.
`e‑mails: david.sims@dpag.
`ox.ac.uk; chris.ponting@
`dpag.ox.ac.uk
`doi:10.1038/nrg3642
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 15 | FEBRUARY 2014 | 121
`
`© 2014 Macmillan Publishers Limited. All rights reserved
`
`Personalis EX2022.121
`
`

`

`R E V I E W S
`
`generation of high-quality, unbiased and interpretable
`data from next-generation sequencing studies.
`
`De novo genome sequencing
`The major factors that determine the required depth in
`a de novo genome sequencing study are the error rate of the
`sequencing method, the assembly algorithms used,
`the repeat complexity of the particular genome under
`study and the read length. Genomes that have been
`sequenced to high depths by short-read technologies
`are not necessarily a substantial improvement in assem-
`bly quality compared with those produced using the
`earlier lower-coverage Sanger sequencing technology.
`Although the human genome was initially assembled
`to high quality with 8–10-fold coverage using long-read
`Sanger sequencing 2, a raw coverage of ~73-fold was
`required to generate the first short-read-only assembly
`of the giant panda genome that was of lower quality
`than the human genome4. A similarly low coverage
`(~7.5-fold) dog genome, which is similar in size to that
`of the giant panda and was assembled using Sanger
`sequencing reads, is more complete and more contigu-
`ous than the giant panda genome3. These differences
`arise because Sanger sequencing reads are longer, are
`derived from larger insert libraries and can be assembled
`using mature assembly algorithms3.
`High-quality assemblies are now often produced
`using hybrid approaches, in which the advantages of
`high-depth, short-read sequencing are complemented
`with those of lower-depth but longer-read sequencing.
`For example, sequencing the draft assembly of the wild
`grass Aegilops tauschii was a considerable challenge
`owing to its large size (4.4 Gb) and to the fact that two-
`thirds of its sequence consists of highly repetitive trans-
`posable element-derived regions5. The draft genome
`was successfully assembled first into short fragments
`(that is, contigs) using 398 Gb (that is, a 90-fold cov-
`erage) of high-quality short reads from 45 libraries
`with insert sizes between 0.2 kb and 20 kb, and these
`fragments could then be linked into longer scaffolds
`using paired-end read information. Gaps between con-
`tigs predominantly contained repetitive sequence, the
`unique placement of which posed difficulties. These
`gaps were filled in using a subsequent addition of
`18.4 Gb (that is, a fourfold coverage) of Roche 454 long
`reads. A recently introduced approach to sequencing
`repeat-rich genomes is to barcode and sequence to an
`average of 20× depth all reads that are derived from each
`of many collections of hundreds or thousands of short
`(6–8 kb) DNA fragments6. By assembling each collec-
`tion separately, many otherwise confounding repetitive
`sequences of the Botryllus schlosseri tunicate genome
`were resolved. By applying approaches that are comple-
`mentary in aspects such as read lengths and coverage
`biases, hybrid library and assembly methods are likely
`to dominate in the near future7,8.
`Twofold coverage and lower-quality assemblies have
`been produced using Sanger sequencing for a selection
`of mammalian genomes to identify sequences that are
`conserved in eutherian species, including humans9. The
`Lander–Waterman approach (BOX 1) predicts that ~86%
`
`diverse aspects of transcriptional regulation, which
`range from transcription factor-binding sites to the
`three-dimensional conformation of chromosomes. As
`these areas of genome research often adopt markedly
`different sequencing depths (FIG. 1), we review this issue
`for each area in turn. First, we examine current best
`practice in de novo genome sequencing and assembly.
`We then proceed to consider genome resequencing
`and targeted resequencing approaches, particularly
`whole-exome sequencing (WES). Second, we discuss
`the rapidly evolving area of transcriptome sequencing,
`specifically the different considerations that are needed
`for transcript discovery compared with the analyses of
`differential expression and alternative splicing. Finally,
`we explore a range of methodologies that identify the
`genomic sites of transcription factor binding, chromatin
`marks, DNA methylation and spatial interactions that
`are revealed by chromosome conformation capture (3C)
`methods. We discuss experimental considerations that
`are relevant to sequence depth, which are required for the
`
`Box 1 | Sequencing coverage theory
`
`Much of the original work on sequencing coverage stemmed from early genome
`mapping efforts. In 1988, Lander and Waterman96 described the theoretical
`redundancy of coverage (c) as LN/G, where L is the read length, N is the number of
`reads and G is the haploid genome length. The figure shows the theoretical coverage
`(shown as diagonal lines; c = 1× or 30×) according to the Lander–Waterman formula for
`human genome or exome sequencing. The coverage that is achieved by sequencing
`technologies according to the manufacturers’ websites is also indicated (see the
`figure). Unfortunately, biases in sample preparation, sequencing, and genomic
`alignment and assembly can result in regions of the genome that lack coverage (that is,
`gaps) and in regions with much higher coverage than theoretically expected. GC‑rich
`regions, such as CpG islands, are particularly prone to low depth of coverage partly
`because these regions remain annealed during amplification97. Consequently, it is
`important to assess the uniformity of coverage, and thus data quality, by calculating
`the variance in sequencing depth across the genome98. The term depth may also be
`used to describe how much of the complexity in a sequencing library has been
`sampled. All sequencing libraries contain finite pools of distinct DNA fragments. In a
`sequencing experiment only some of these fragments are sampled. The number of
`these distinct fragments sequenced is positively correlated with the depth of the true
`biological variation that has been sampled.
`
`Illumina
`HiSeq 2000
`
`Illumina HiSeq 2500 Rapid Run
`
`Illumina
`GAIIx
`
`Solid
`5500xl
`
`Human genome (30×)
`Human genome (1×)
`Human exome (30×)
`Human exome (1×)
`
`Ion torrent
`PGM 318 Chip
`
`454 FLX titanium XL+
`
`PacBio RS II
`
`1011
`
`1010
`
`109
`
`108
`
`107
`
`106
`
`105
`
`104
`
`Number of reads per run
`
`100
`
`1,000
`Read length (bases)
`
`GAIIx, Genome Analyzer IIx; PacBio, Pacific Biosciences; PGM, personal genome machine.
`Nature Reviews | Genetics
`
`122 | FEBRUARY 2014 | VOLUME 15
`
` www.nature.com/reviews/genetics
`
`© 2014 Macmillan Publishers Limited. All rights reserved
`
`Personalis EX2022.122
`
`

`

`R E V I E W S
`
`single-nucleotide variants (SNVs), small insertions and
`deletions (indels), larger structural variants (such as
`inversions and translocations) and copy number vari-
`ants (CNVs). Naturally, the design of a particular study
`depends on the biological hypothesis in question, and
`different sequencing strategies are used for population
`studies compared with those for studies of Mendelian
`disease or of somatic mutations in cancer. Furthermore,
`targeted resequencing approaches allow a trade-off
`between sequencing breadth and sample numbers: for
`the same cost, more samples can be sequenced to the
`same depth but over a smaller genomic region. Here, we
`discuss the merits of whole-genome sequencing (WGS)
`relative to targeted resequencing approaches, including
`WES, in the context of these different variant types and
`disease models.
`
`WGS versus WES. High-depth WGS is the ‘gold stand-
`ard’ for DNA resequencing because it can interrogate all
`variant types (including SNVs, indels, structural variants
`and CNVs) in both the minority (1.2%) of the human
`genome that encodes proteins and the remaining major-
`ity of non-coding sequences. WES is focused on the
`detection of SNVs and indels in protein-coding genes
`and on other functional elements such as microRNA
`sequences; consequently, it omits regulatory regions
`such as promoters and enhancers. Although costs vary
`depending on the sequence capture solution, WES can
`be an order of magnitude less expensive than WGS to
`achieve an approximately equivalent breadth of coverage
`of protein-coding exons. These reduced costs offer the
`potential to greatly increase sample numbers, which is a
`key factor for many studies. However, WES has various
`limitations that are discussed below.
`
`SNV and indel detection. Early genome resequencing
`studies focused specifically on the two most common
`classes of sequence variation, which are SNVs and small
`indels. The first human genome that was sequenced
`using Illumina short-read technology showed that,
`although almost all homozygous SNVs are detected at a
`15× average depth, an average depth of 33× is required
`to detect the same proportion of heterozygous SNVs12.
`Consequently, an average depth that exceeds 30× rapidly
`became the de facto standard13,14. In 2011, one study15
`suggested that an average mapped depth of 50× would
`be required to allow reliable calling of SNVs and small
`indels across 95% of the genome. However, improve-
`ments in sequencing chemistry reduced GC bias and
`thus yielded a more uniform coverage of the genome,
`which later reduced the required average mapped depth
`to 35× (REF. 15). The power to detect variants is reduced
`by low base quality and by non-uniformity of coverage.
`Increasing sequencing depth can both improve these
`issues and reduce the false-discovery rate for variant
`calling. Although read quality is mostly governed by
`sequencing technology, the uniformity of depth of cov-
`erage can also be affected by sample preparation. A GC
`bias that is introduced during DNA amplification by
`PCR has been identified as a major source of variation
`in coverage. Elimination of PCR amplification results in
`
`WES
`WGS
`ChIP–seq
`RNA-seq
`
`Frequency of studies
`
`107
`
`108
`
`109
`1010
`Number of bases sequenced per sample
`
`1011
`
`1012
`
`Figure 1 | Sequencing depths for different applications. The frequency of studies
`Nature Reviews | Genetics
`that use read counts of all runs (which are typically flow-cell lanes) and that were
`deposited from 2012 to June 2013 for the Illumina platform in the European Nucleotide
`Archive (ENA) is shown. The plot provides an overview of sequencing depths that are
`usually chosen for the four most common experimental strategies. Densities have been
`smoothed and normalized to provide an area under the curve that is equal to one.
`The depth and therefore the cost of an experiment increase in the order of chromatin
`immunoprecipitation followed by sequencing (ChIP–seq), RNA sequencing (RNA-seq),
`whole-exome sequencing (WES) to whole-genome sequencing (WGS). Although
`ChIP–seq, WES and WGS have typical applications and thus standardized read depths,
`the sequencing depth of RNA-seq data sets varies over several orders of magnitude.
`Multimodal distributions of WES and WGS reflect different target coverage. To
`generate this figure, runs were summed by experiment and, for each study, one
`experiment was chosen at random to avoid counting large studies more than once.
`Note that the ENA archive only contains published data sets and excludes medically
`relevant data sets. The plot was created from 771 studies.
`
`(that is, 1 – e−2) of bases in such genomes are covered
`once by a sequencing depth of 2× although, in reality,
`this decreases to ~65% for mammalian genomes that are
`sequenced at twofold coverage10. In these and other stud-
`ies, low coverage has two principal effects on subsequent
`analyses and biological interpretation. First, it is not pos-
`sible to resolve whether an absence of a protein-coding
`gene, or a disruption of its open reading frame, repre-
`sents a deficiency of the assembly or a real evolutionary
`gene loss. Second, and perhaps more seriously, low depth
`can introduce sequence errors that are in danger of being
`mistakenly propagated through downstream analyses
`and misdirecting conclusions of a study. To mitigate this
`possibility, two approaches are recommended. First, low-
`quality bases or sequences that align poorly against a
`closely related genome should be discarded from such
`analyses. Second, adjacent bases that have high-quality
`scores should also be discarded because they can contain
`a high density of residual sequence errors11.
`
`DNA resequencing
`DNA resequencing explores genetic variation in indi-
`viduals, families and populations, particularly with
`respect to human genetic disease. Requirements for
`sequencing depth in these studies are governed by
`the variant type of interest, the disease model and the
`size of the regions of interest. Resequencing can reveal
`
`Sequence capture
`The enrichment of fragmented
`DNA or RNA species of interest
`by hybridization to a set of
`sequence-specific DNA or RNA
`oligonucleotides.
`
`GC bias
`The difference between the
`observed GC content of
`sequenced reads and the
`expected GC content based
`on the reference sequence.
`
`Variant calling
`The process of identifying
`consistent differences between
`the sequenced reads and the
`reference genome; these
`differences include single base
`substitutions, small insertions
`and deletions, and larger copy
`number variants.
`
`NATURE REVIEWS | GENETICS
`
` VOLUME 15 | FEBRUARY 2014 | 123
`
`© 2014 Macmillan Publishers Limited. All rights reserved
`
`Personalis EX2022.123
`
`

`

`R E V I E W S
`
`Box 2 | Genomic alignment and mappability
`
`The first major data processing step in sequencing studies for species with a
`reference genome is the alignment of sequencing reads to this reference. The choice
`of alignment algorithm often influences final coverage values, as different
`algorithms show varying false‑positive and false‑negative rates99,100. Even the best
`mapping algorithms cannot align all reads to the reference genome, which is
`perhaps due to sequencing errors, structural rearrangements or insertions in the
`query genome, or deletions in the reference. Indeed, analyses of unmapped reads
`are often used for the identification of structural variants and non‑reference
`insertions40,101. Furthermore, it is not possible to unambiguously assign reads to all
`genomic regions, as some regions will contain low‑degeneracy repeats or
`low‑complexity sequences. The ‘mappability’ (also known as uniqueness) of a
`sequence within a genome has a major influence on the average mapped depth
`and is an important source of false‑negative single‑nucleotide variant calls102.
`Mappability improves with increased read length and generally shows an inverse
`correlation with genomic repeats103. One approach to increase coverage in regions
`of low mappability is to use longer reads that improve the chance of a read
`encompassing a unique sequence that anchors all remaining sequences. A second
`approach is to generate paired‑end libraries with longer insert sizes, which increases
`the chance of one read of the pair mapping to a unique region outside the repeat
`sequence. It is often useful to use mappability data to normalize read depth, for
`example, when using depth of coverage to estimate DNA copy number.
`
`improved coverage of high GC regions of the genome
`and in fewer duplicate reads16.
`In WES, differences in the hybridization efficiency
`of sequence capture probes, which are possibly again
`attributable to GC content variation, can result in tar-
`get regions that have little or no coverage. Uniformity
`of coverage will also be influenced by repetitive or
`low-complexity sequences, which either restrict bait design
`or lead to off-target capture. Furthermore, unlike WGS,
`WES still routinely uses PCR amplification, which must
`be carefully optimized to reduce GC bias17. As a result
`of increased variation in coverage, a greater average read
`depth is required to achieve the same breadth of cov-
`erage as that from WGS, and an 80× average depth is
`required to cover 89.6–96.8% of target bases, depending
`on the platform, by at least tenfold18. Different sequence
`capture kits yield different coverage profiles, and
`designs with higher density seem to be more efficient,
`which provide better uniformity of coverage and better
`sensitivity for SNV detection18,19. As capture kits have
`improved sequence coverage, the amount of sequenc-
`ing required has inevitably increased. Regardless of the
`capture protocol or the sequencing platform used, there
`has been a trend for recent exome studies to require a
`minimum of 80% of the target region to be covered by
`at least tenfold20–22. All WES kits are prone to reference
`bias, which arises from capture probes that match the
`reference sequence and thus tend to preferentially enrich
`the reference allele at heterozygous sites; such bias can
`produce false-negative SNV calls23.
`
`CNV detection. CNVs can be detected from WGS
`and WES24,25 data using methods that analyse depth of
`coverage. These methods pile up aligned reads against
`genomic coordinates, then calculate read counts in
`windows to provide the average depth across a region.
`Copy number changes can then be inferred from
`variation in average depth across genomic regions.
`
`Low-complexity sequences
`DNA regions that have a
`biased nucleotide composition,
`which are enriched with simple
`sequence repeats.
`
`Clonal evolution
`An iterative process of
`clonal expansion, genetic
`diversification and clonal
`selection that is thought to
`drive the evolution of cancers,
`which gives rise to metastasis
`and resistance to therapy.
`
`In WGS, reasonable specificity can be obtained with
`an average depth of as little as 0.1× (REF. 26). However,
`sensitivity, break-point detection and absolute copy
`number estimation all improve with increasing read
`depth26,27. Regardless of average read depth, depth-
`of-coverage methods are vulnerable to false positives
`that are being called owing to local variations in coverage
`even after correction for both GC bias and ‘mappability’
`(BOX 2), and cross-sample calling is required to reduce
`this effect28.
`
`Study design for different disease models. In contrast to
`the high depth that is required to accurately call SNVs
`and indels in individual genomes, population genomics
`studies benefit from a trade-off between sample num-
`bers and sequencing depth, in which many genomes are
`sequenced at low depth (for example, 400 samples at 4×)
`and their variants are called jointly across all samples29–31.
`Variant calls on individual low-depth genomes have a
`high false-positive rate, but this is mitigated by combin-
`ing information across samples. This approach provides
`good power to detect common variants at a proportion
`of the sequencing cost of deep sequencing29,30. Indeed,
`even ultra-low-coverage sequencing (that is, sequencing
`at 0.1–0.5×) captures almost as much common variation
`(that is, variants with >1% allele frequency) as single-
`nucleotide polymorphism (SNP) arrays32. Conversely,
`reliable identification of variants in either highly
`aneuploid genomes or heterogeneous cell populations,
`such as those from tumours, requires greater depth
`of coverage than those from normal tissue33. Targeted
`enrichment and ultra-deep sequencing (that is, sequenc-
`ing at 1,000×) of limited regions of interest can be used
`to study clonal evolution in cancer samples, in which spe-
`cific variants are present in <1% of the cell population34.
`The identification of disease-causing de novo or recessive
`variants is often best served by sequencing parent–child
`trios. In this case, it is recommended that the same depth
`of sequencing is obtained for each of the family members
`in order to minimize false-positive calls in the proband
`and false-negative calls in the parents35.
`
`Analyses of DNA resequencing data. A typical analy-
`sis pipeline for DNA resequencing data involves the
`alignment of sequencing reads to a reference genome
`followed by variant calling. A post-alignment step to
`remove all but one duplicates (that is, the removal of
`two or more read pairs with both forward and reverse
`reads that map to identical genomic coordinates) is
`important for accurate variant calling, as it ensures that
`errors that are introduced and amplified during PCR do
`not result in erroneous calls36. Duplicate read removal
`can significantly reduce the number of high-quality
`mapped reads and thus the average depth of coverage
`(TABLE 1). Even in species with a complete reference
`genome, assembly approaches (reviewed and com-
`pared in REFS 37–39) offer several advantages over those
`using reference alignment. First, assembly can faithfully
`recapitulate divergent sequence, such as that of the
`human leukocyte antigen (HLA) locus, which often does
`not align well to a reference genome. Second, assembly
`
`124 | FEBRUARY 2014 | VOLUME 15
`
`www.nature.com/reviews/genetics
`
`© 2014 Macmillan Publishers Limited. All rights reserved
`
`Personalis EX2022.124
`
`

`

`Table 1 | Sources of uninformative reads for different experiments
`Source of uninformative reads WGS
`WES
`ChIP–seq RNA-seq
`Sequencing adaptor reads
`•
`•
`•
`•
`Low-quality reads
`•
`•
`•
`•
`Unmapped reads
`•
`•
`•
`•
`Reads that do not map uniquely
`•
`•
`•
`•
`PCR duplicates
`•
`•
`•
`•
`Reads that map out with peaks,
`–
`•
`•
`•
`transcript models or exons
`Reads that map to uninformative
`transcripts (for example, rRNA)
`ChIP–seq, chromatin immunoprecipitation followed by sequencing; RNA-seq, RNA sequencing;
`rRNA, ribosomal RNA; WES, whole-exome sequencing; WGS, whole-genome sequencing.
`
`–
`
`•
`
`–
`
`–
`
`R E V I E W S
`
`considerations than transcriptome-wide coverage statis-
`tics. Furthermore, when used for differential expression
`analyses, RNA-seq can be considered as a tag-counting
`application. In this case, a sufficient number of reads are
`required to quantify exons and splice junctions in the
`sample. Therefore, the number of reads that is required
`in an experiment is determined by the least abundant
`RNA species of interest — a variable that is not known
`before sequencing.
`The number of useful reads that is generated in a
`study can be optimized either by depleting the ribo-
`somal RNA (rRNA) fraction, which constitutes ~90% of
`total RNA in mammalian cells, or by enriching for the
`RNA species of interest, such as the use of immobilized
`oligo-deoxythymidine to enrich for polyadenylated
`RNAs43. Total RNA that is depleted in rRNA contains
`reads from both non-polyadenylated transcripts and
`pre-processed mRNA transcripts. Consequently, many
`reads will align to intronic sequences, thereby decreasing
`the proportion of reads that map to expressed exons and
`reducing the power to detect splice junctions. A good
`indication of the performance of an RNA-seq experi-
`ment is provided by the proportion of reads that are
`mapped to rRNA and other highly expressed RNAs,
`and by the proportion that are mapped to splice junc-
`tions and coding exons. Using a poly(A) selection pro-
`tocol with paired reads of lengths that are >76 bp, >80%
`of read pairs can be expected to map to the reference
`genome in experiments using human samples, and >70%
`of these reads can be expected to map with zero mis-
`matches44. With this approach, the number of reads that
`map to rRNA will be minimal (that is, <1%), and ~15%
`of reads will map across splice junctions.
`
`Transcript discovery. One application of transcrip-
`tome sequencing that is not possible using microarrays
`is the identification of novel transcripts, such as long
`non-coding RNAs (lncRNAs) and alternative transcripts
`of protein-coding genes. Many of these transcripts are
`expressed at low levels45,46, and their discovery therefore
`requires either deep sampling of the transcriptome or
`mapping of transcription start sites using cap analysis
`of gene expression (CAGE). The power to detect a
`transcript depends on its length and abundance in
`the sequencing library, as well as on its mappability to the
`reference genome. The sequencing of RNA standards
`from the External RNA Control Consortium47 revealed
`that molecules that are present at frequencies of 0.6–2.5
`molecules per 107 molecules could not be detected
`using 12.4 million uniquely mapping 36-bp reads48.
`Furthermore, the accuracy of abundance estimations
`using spike-in control RNAs in deeply sequenced human
`data sets (which contain >94 million uniquely mapped
`76-bp paired-end reads) showed a clear dependence
`on both length and GC composition of an RNA mole-
`cule48. Sampling of transcripts is also affected by library
`preparation. Sequenced reads that are generated using
`Illumina protocols show compositional biases at their
`5ʹ ends owing to the nonrandomness of the hexamer
`primers that are used in cDNA synthesis49. This results
`in nonrandom sampling of the transcriptome and an
`
`Dynamic range
`The range of expression
`levels over which genes and
`transcripts can be accurately
`quantified in gene expression
`analyses. In theory, RNA
`sequencing offers an infinite
`dynamic range, whereas
`microarrays are limited by the
`range of signal intensities.
`
`Long non-coding RNAs
`(lncRNAs). RNA molecules
`that are transcribed from
`non-protein-coding loci; such
`RNAs are >200 nt in length
`and show no predicted
`protein-coding capacity.
`
`Cap analysis of gene
`expression
`(CAGE). In contrast to RNA
`sequencing, CAGE produces
`short ‘tag’ sequences that
`represent the 5ʹ end of the
`RNA molecule. As CAGE does
`not sequence across an entire
`cDNA, it requires a lower depth
`of sequencing than RNA
`sequencing to quantify
`low-abundance transcripts.
`
`Spike-in control RNAs
`A pool of RNA molecules of
`known length, sequence
`composition and abundance
`that is introduced into an
`experiment to assess the
`performance of the technique.
`
`Fragments per kilobase of
`exon per million reads
`mapped
`(FPKM). A method for
`normalizing read counts over
`genes or transcripts. Read
`counts are first normalized by
`gene length and then by library
`size. After normalization, the
`expression value of each gene
`is less dependent on these
`variables.
`
`can avoid the mis-mapping of reads that originate from
`incomplete regions of the reference genome. Third,
`assembly enables multiple variant types to be analysed
`at once, which minimizes errors around clusters of vari-
`ants. The latest assembly methods, such as Cortex40, can
`consider multiple eukaryotic genomes simultaneously
`while incorporating information about known varia-
`tion. This allows variant calling against a range of dif-
`ferent genomes rather than a single reference genome.
`This method required only an average depth of 16×
`during the assembly of human HLA regions to provide
`results that are in good agreement with laboratory-
`based typing 40. However, as assembly methods are
`still unable to fully reconstruct entire genomes owing
`mainly to repeat content, they are only able to call
`variants in 80% of the genome.
`
`Transcriptome sequencing
`RNA sequencing (RNA-seq) allows the detection and
`the quantification of expressed transcripts in a biological
`sample. Its applications include novel transcript discov-
`ery, and analyses of differential expression and alterna-
`tive splicing. RNA-seq has advantages over microarray
`gene expression analyses, as it provides an unbiased
`assessment of the full range of transcripts with a greater
`dynamic range41,42. Large numbers of RNA-seq experi-
`ments have now been carried out in many cell and tissue
`types across diverse conditions, yet few clear guidelines
`on read counts have emerged. This is because sequenc-
`ing requirements are often dependent on the biologi-
`cal question under investigation, as well as on the size
`and the complexity of the transcriptome being assayed.
`Here, we describe the concepts that govern the coverage
`required in RNA-seq experiments and illustrate these
`with examples from the literature.
`
`Coverage in transcriptome sequencing. Coding and
`non-coding transcripts can be expressed at vastly dif-
`ferent levels — from one copy to millions of copies
`per cell — in differe

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket