throbber
Downloaded from https://academic.oup.com/bioinformatics/article/27/19/2648/231564 by guest on 29 September 2023
`
`BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 19 2011, pages 2648–2654
`
`doi:10.1093/bioinformatics/btr462
`
`Advance Access publication August 9, 2011
`
`Sequence analysis
`Exome sequencing-based copy-number variation and loss of
`heterozygosity detection: ExomeCNV
`Jarupon Fah Sathirapongsasuti1,2,3,∗, Hane Lee3,4, Basil A. J. Horst4,5,6, Georg Brunner7,
`Alistair J. Cochran4, Scott Binder4, John Quackenbush1,2 and Stanley F. Nelson3,4,∗
`1Department of Biostatistics, Harvard School of Public Health, 2Department of Biostatistics and Computational
`Biology, Dana-Farber Cancer Institute, Boston, MA 02115, 3Department of Human Genetics and 4Department of
`Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles,
`CA 90095, 5Department of Dermatology, 6Department of Pathology and Cell Biology, Columbia University Medical
`Center, New York, NY 10032, USA and 7Department of Cancer Research, Skin Cancer Center Hornheide, Münster,
`Germany
`Associate Editor: Alfonso Valencia
`
`ABSTRACT
`Motivation: The ability to detect copy-number variation (CNV) and
`loss of heterozygosity (LOH) from exome sequencing data extends
`the utility of this powerful approach that has mainly been used for
`point or small insertion/deletion detection.
`Results: We present ExomeCNV, a statistical method to
`detect CNV and LOH using depth-of-coverage and B-allele
`frequencies,
`from mapped short sequence reads, and we
`assess both the method’s power and the effects of confounding
`variables. We apply our method to a cancer exome resequencing
`dataset. As expected, accuracy and resolution are dependent on
`depth-of-coverage and capture probe design.
`Availability: CRAN package ‘ExomeCNV’.
`Contact: fsathira@fas.harvard.edu; snelson@ucla.edu
`Supplementary information: Supplementary data are available at
`Bioinformatics online.
`
`Received on January 22, 2011; revised on July 8, 2011; accepted on
`August 2, 2011
`
`1 INTRODUCTION
`The development of next-generation sequencing has enabled
`routine large-scale resequencing projects, permitting us to perform
`increasingly more comprehensive DNA variant analysis. However,
`the cost and analytical complexity of sequencing still limit the
`number of whole genomes that can be sequenced in any single
`project (Teer and Mullikin, 2010). In fact, the analysis of complete
`human genome sequence often interprets DNA alterations in protein
`coding regions primarily. This is in practice a reasonable strategy
`since ∼85% of the disease-causing mutations are found in the coding
`regions or canonical splice sites (Choi et al., 2009). Thus, whole-
`exome sequencing presents an effective alternative to whole-genome
`sequencing and provides an unbiased, cost-effective and time-
`efficient tool for the study of the genetic basis for disease. Following
`the first successful application of whole-exome sequencing in re-
`discovering the cause of a dominantly inherited rare Mendelian
`disorder Freeman–Sheldon syndrome (Ng et al., 2009), a number
`
`∗
`
`To whom correspondence should be addressed.
`
`of studies have reported similar successes (Chou et al., 2010;
`Hoischen et al., 2010; Johnston et al., 2010; Raca et al., 2010;
`Rehman et al., 2010; Volpi et al., 2010). Although the cost of both
`genome and exome sequencing continues to fall at a rapid pace,
`whole-exome sequencing has a number of advantages, including a
`lower cost, more straightforward data analysis and interpretation and
`significantly greater depth of coverage with a corresponding overall
`improvement in data quality. Exome sequencing is rapidly becoming
`a fundamental tool for genetic and functional genomic research
`laboratories and a diagnostic tool in clinics. At present the main
`applications of targeted exonic sequencing is for the determination
`of single nucleotide variants (SNVs) or small indel variants but not
`structural variation.
`Structural variation, especially copy-number variation (CNV) and
`loss of heterozygosity (LOH), is an important class of genetic
`variability in Mendelian, common inherited diseases and cancer
`(Choi et al., 2007; Deng et al., 2010; Fong et al., 1989; Jankowska
`et al., 2009; Sha et al., 2009; Shuin et al., 1994; Stankiewicz
`and Lupski, 2010; Wain et al., 2009). As is true of SNVs, there
`are population-specific, common CNVs and rare, disease-causing
`CNVs (Kato et al., 2010; Sudmant et al., 2010). Many large-scale
`projects (Conrad et al., 2010a,b; McCarroll, 2010) and technological
`platforms (Pinkel et al., 1998; Urban et al., 2006) have been devised
`to estimate the prevalence and impact of CNV. Array Comparative
`Genomic Hybridization (CGH) (Pinkel et al., 1998) and SNP
`genotyping arrays have been widely used as standard methods to
`detect CNV and LOH. However, with the rapid increase in genomic
`and exomic sequence, there is growing interest in the use of these
`data to detect CNVs.
`While methods have been developed for CNV estimation in
`whole-genome sequencing (Chiang et al., 2009; Yoon et al.,
`2009), these methods make key assumptions that fail to hold in
`the exome sequencing setting. For example, Yoon et al. (2009)
`assumes random, unbiased distribution of sequence reads, such
`that read depth can be modeled as a normal distribution across
`the genome, and deviation from the background indicates the
`presence of CNV. This random read distribution assumption breaks
`down in the context of exome capture as the probes have variable
`specificity and efficiency for the targeted exonic regions. The
`discrete nature of exome sequences also presents problems to
`
`2648
`
`© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
`
`[09:50 21/9/2011 Bioinformatics-btr462.tex]
`
`Page: 2648 2648–2654
`
`Personalis EX2158
`
`

`

`Downloaded from https://academic.oup.com/bioinformatics/article/27/19/2648/231564 by guest on 29 September 2023
`
`ExomeCNV
`
`.
`
`(1)
`

`
`,
`
`(2)
`
`(3a)
`
`(3b)
`
`if ρ <1
`if ρ≥ 1
`
`control, respectively. Although we discuss our method in terms of depth-of-
`coverage, our method is developed in terms of the count statistics X and Y .
`Let NX and NY be the total numbers of aligned reads in case and control,
`respectively. Define the read count ratio:
`R= X/NX
`Y /NY
`We divide the raw counts X and Y by the total number of reads NX
`and NY to mitigate the effect of overall increase in local counts due to the
`increase in total depth-of-coverage. Finally, we adjust the ratio so that the
`exome-wide median is 1. Without lost of generality, we assume NX = NY
`and reduce R= X/Y . Because X and Y follow Poisson distributions with
`parameters λX and λY , respectively, with sufficient depth-of-coverage the
`Poisson distributions converge to normals with equal means and variances:
`N(λX , λX ) and N(λY , λY ). Under the null hypothesis of no CNV, λX = λY ,
`and under the alternative hypothesis, λX = ρλY = ρλ. ρ indicates the copy-
`number ratio; for example, ρ = 0.5 for deletion and ρ = 1.5 for duplication.
`By Geary–Hinkley transformation (Geary, 1930,1944; Hinkley, 1969), let
`√
`(cid:1)
`(cid:2)
`(cid:2)
`t(ρ)= µY R−µX
`= λY R−λX
`= λR−ρλ
`= (R−ρ)
`√
`λR2+ρλ
`R2+ρ
`λY R2+λX
`Y R2+σ2
`and t(ρ) follows the standard normal distribution. Thus, the specificity and
`(cid:3)
`sensitivity are 1−α and 1−β where
`(cid:4)
`(cid:4)
`(cid:5)(cid:5)
`(cid:4)
`(cid:4)
`α=
`1
`t
`1−φ
`(cid:3)
`(cid:4)
`(cid:4)
`(cid:4)
`(cid:4)
`(cid:5)(cid:5)
`1−φ
`if ρ <1,
`if ρ≥ 1
`t
`The formulas above describe the achievable specificity and sensitivity of
`a given cutoff ratio R. Considering values of R from zero to infinity, we
`can plot the receiver operating characteristic (ROC) curve (Supplementary
`Materials). In practice, we may wish to identify a cutoff r(ρ) which yields
`the desired minimum specificity and/or sensitivity for testing a particular
`copy-number ratio ρ at an exon with certain depth-of-coverage and length.
`This can be achieved by solving above equations, and an appropriate cutoff
`r(ρ) can be chosen from a set of solutions to maximize user-selected quantity
`metrics such as specificity, sensitivity or area under curve.
`In the presence of sample admixture, the ‘true’copy-number ratio will tend
`to 1. In particular, if a fraction c of the tumor sample has a normal copy-
`number (either by contamination of normal tissue or heterogeneity within
`the tumor), the copy-number ratio of this admixed sample will be ρ(cid:5)= c+
`ρ(1−c). Thus, in heterogeneous samples, the only change to the method
`described above is the replacement of ρ by ρ(cid:5)
`. The admixture rate c can be
`estimated from data by back-calculating c from empirical ρ(cid:5)
`in LOH regions
`(see Supplementary Materials).
`
`σ2
`
`X
`

`
`t
`
`(cid:5)(cid:5)
`(cid:5)(cid:5)
`
`1
`
`β=
`
`t
`

`

`

`
`Fig. 1. Overview of ExomeCNV analysis workflows. Two workflows
`are present: CNV detection and LOH detection. Each involves similar
`steps of exon/position/segment-wise CNV/LOH calling, Circular Binary
`Segmentation, and interval merging. User inputs and parameters are listed
`at each step.
`
`existing methods. Many whole-genome CNV detection tools use
`segmentation algorithms that assume continuity of search space and
`do not function properly when given discontinuous and variable
`length exome sequencing data. SegSeq (Chiang et al., 2009), for
`example, merges windows of a fixed length based on a log-ratio
`difference statistic. Lastly, because exons are generally smaller than
`insert sizes for paired-end sequencing (200–500 bp), paired-end
`based CNV detection methods are not generally applicable to exome
`data.
`Here we present ExomeCNV, which uses depth-of-coverage
`and B-allele frequencies from mapped short sequence reads to
`estimate CNV (Fig. 1, left side) and LOH (Fig. 1, right side).
`We describe an assessment of its validity, sensitivity, specificity
`and limitations through an analysis of a melanoma tumor and a
`matched normal sample. Important model assumption and the effect
`of important confounding factor such as sample admixture rate are
`also considered.
`
`2 METHODS
`2.1 Correlation of depth-of-coverage
`To plot the correlation of depth-of-coverage (Fig. 2), we used the internal
`exome data that were captured by the same Agilent SureSelect Human
`All Exon Kit and sequenced on the Illumina GAIIx. Samples 1 through
`4 were generated by two lanes of 76 bp single-end sequencing, Sample 5
`was generated by three lanes of 76 bp single-end sequencing and Sample 6
`was generated by one lane of 76 + 76 bp paired-end sequencing. For each
`sample, the average depth-of-coverage per exon was normalized by dividing
`the average coverage by the overall exome average coverage and then, the
`normalized depth-of-coverage were compared between 15 pairs of samples.
`
`2.2 The CNV detection algorithm
`2.2.1 Power analysis of CNV Detection Consider an exon of length L,
`let X and Y denote the numbers of reads, each of length w, mapped within
`the exon in question in case (e.g. tumor) and control (e.g. matched normal),
`respectively. The depth-of-coverage is then Xw/L and Yw/L for case and
`
`2.3 Segmentation and sequential merging
`We used the circular binary segmentation (CBS) algorithm (Olshen et al.,
`2004), as implemented in the R package DNAcopy (Venkatraman and
`Olshen, 2007), to subdivide the genome (exome). For each segment we
`combined the coverage by direct sum, and used mean coverage log ratio
`as the segment’s log ratio log(R). Log ratio is used here to satisfy the input
`requirement of CBS algorithm. CNV call proceeds on each segment in the
`same manner as described above.
`In order to achieve the most sensitive segmentation, we chose to start
`CBS with parameters that produce a large number of small segments, call
`CNV on the segments, and repeat the process with a smaller number of
`larger segments. We then merge the CNV segments sequentially, from finest
`segmentation to coarsest. By nature of CBS, finer segments are contained
`within coarser segments, and in merging step, we need to resolve conflicting
`CNV calls between finer segments within a larger segment. If a finer (smaller)
`segment has sufficient coverage to call a CNV event, the call persists.
`
`2649
`
`[09:50 21/9/2011 Bioinformatics-btr462.tex]
`
`Page: 2649 2648–2654
`
`Personalis EX2158
`
`

`

`Downloaded from https://academic.oup.com/bioinformatics/article/27/19/2648/231564 by guest on 29 September 2023
`
`autosomes were inferred from these values using the genoCN R package (Sun
`et al., 2009). The genoCNA function was used with the default parameters.
`
`2.7 Copy-number analysis using ERDS
`The same PILEUP file we used to generate the average coverage per capture
`interval was used to run Estimation by Read Depth with SNVs (ERDS)
`to demonstrate the need to control for the variability of capture of exons
`observed in exome sequencing (M.Zhu et al., manuscript in preparation).
`The SNV file that is required by ERDS was generated by using SAMtools
`varFilter tool default parameters and SVA (Sequence Variant Analyzer)
`snp_filter.pl script. The result file generated by ERDS was summarized using
`the SVA software (http://people.genome.duke.edu/∼dg48/sva/index.php).
`
`2.8 Comparison between CNV calling methods
`In assessing performance of ExomeCNV, we used all mapped exons as the
`sample space. CNV calls on other platforms were mapped to the exons and
`compared to calls by ExomeCNV. Thus specificity is the proportion of copy-
`neutral exons correctly identified by ExomeCNV, while sensitivity is the
`proportion of amplified (or deleted) exons correctly identified. Similarly,
`LOH performance is assessed using all polymorphic positions as the sample
`space.
`
`3 RESULTS
`3.1 ExomeCNV for CNV and LOH detection
`ExomeCNV uses a normalized depth-of-coverage ratio approach
`to identify CNV and LOH from exome sequencing information
`of paired case/control samples (for example, paired tumor/normal)
`in a way that optimizes sensitivity and specificity. We begin by
`assuming that although there are potentially exon-specific biases due
`to laboratory capture methods and sequence-specific biases, these
`are independent of sample and so are nearly uniform for a particular
`exon across samples. As a result, simply assessing the ratio of depth-
`of-coverage of each exon reduces such bias (see Supplementary
`Materials).
`
`3.1.1 Correlation of depth-of-coverage across exome sequencing
`To establish validity of this fundamental assumption,
`samples
`we compared depth-of-coverage of exons across five independent
`samples from five different subjects (Samples 1–5 in Fig 2). All
`samples were captured using the same probe set (Agilent SureSelect
`Human All Exon G3362) and sequenced at mean base coverages
`of 36–39× as a result of two (Samples 1–4) or three (Sample 5)
`lanes of GAIIx single-end sequencing per sample (see Section 2).
`As shown in Figure 2, a high correlation was observed among
`the five samples (Pearson correlation 0.908–0.975, mean = 0.947,
`SD = 0.027), arguing for the validity of our assumption.
`The same level of consistency was not observed when single-
`end data were compared with paired-end data (Sample 6; Pearson
`correlation 0.855–0.877, mean = 0.871, SD = 0.009) due to the lack
`of independence between pairs of reads in paired-end data. Thus,
`care must be taken to ensure consistency of library preparation
`and sequencing method between samples used in analysis. Here,
`all of our analyses used exome sequencing data from a melanoma
`(Sample 5) and a matched normal skin, both processed and
`sequenced in the same manner (see Section 2).
`
`3.1.2 Analytic power calculation of exonic CNV and LOH
`For each exon, the number of sequencing reads aligning
`detection
`
`J.F.Sathirapongsasuti et al.
`
`However, if it does not have sufficient coverage to reject the null hypothesis
`(i.e. not being called, or being called as copy-number neutral), a positive
`CNV call in a larger segment overrides the negative calls. An illustration of
`this sequential merging algorithm is given in the Supplementary Materials.
`
`2.4 LOH detection
`First, we consider all polymorphic positions in the exome of the control
`sample, and for each of the positions, the B-allele count B is the number of
`reads with non-reference- or B-allele at that position. For a polymorphic
`position i, let Ni be the total number of reads mapped to that position
`(i.e. depth-of-coverage); thus the B-allele count Bi follows a binomial
`distribution Binomial (pi, Ni). A binomial that rejects the null hypothesis:
`pi = 0.5 can be used to detect LOH at each polymorphic position.
`Segmentation is done using CBS algorithm based on the absolute
`difference in B-allele frequencies (BAFs) |BAFi,case – BAFi,control|, where
`BAFi = Bi/Ni. Within each segment, based on the realization that B-allele
`frequencies deviate from the null value of 0.5 under LOH, an F-test for
`equality of variance is used to detect significance increase in variance of
`BAFcase from that of BAFcontrol (other statistics were also considered, see
`Supplementary Materials). Finally, the LOH calls are merged sequentially
`as described above.
`
`2.5 Exome sequencing and data analysis
`2.5.1 Exome sequencing High molecular weight whole genomic DNA
`from matched skin and tumor pair samples of a melanoma patient
`were sequenced on the Illumina Genome Analyzer (GAIIx). The UCLA
`Institutional Review Board (IRB) approved the collections of the DNA
`samples. The libraries were generated following the Agilent SureSelect
`Human All Exon Kit version 1.0.1 protocol, and the Illumina Genome
`Analyzer (GAIIx) flowcell was prepared according to the manufacturer’s
`protocol. We performed three and four lanes of single-end sequencing for
`each of the skin and tumor samples, respectively, within the UCLA Center
`of High-throughput Biology (CHTB). The base-calling was performed by
`the real time analysis (RTA) software (version 1.6) provided by Illumina.
`
`2.5.2 Sequence data analysis Novoalign from Novocraft Short Read
`Alignment Package (http://www.novocraft.com/index.html) was used to
`align each lane’s QSEQ file to the reference genome. Human Genome
`reference sequence (hg18, March 2006, build 36.1), downloaded from the
`UCSC genome database located at http://genome.ucsc.edu and mirrored
`locally, was indexed using novoindex program (-k 14 -s 3). The output
`format was set to SAM and default settings were used for all options. Using
`SAMtools (http://samtools.sourceforge.net/), the SAM files of each lane were
`converted to BAM files, sorted and merged for each sample and potential
`PCR duplicates were removed using Picard (http://picard.sourceforge.net/)
`(Li et al., 2009). To retrieve the depth of coverage information of each base,
`we generated a PILEUP file for each sample using SAMtools and calculated
`the average coverage per capture interval using a custom script. Here, we
`used processed BAM files that were used to call the SNVs while reducing
`the likelihood of using spuriously mismapped reads to call the variants: the
`last 5 bases were trimmed and only the reads lacking indels were retained
`(Clark et al., 2009). The detailed description of the mutational landscape of
`this tumor sample is in preparation.
`
`2.6 Genome-wide SNP genotyping
`Both the skin and the tumor samples were submitted to the Southern
`California Genotyping Consortium (SCGC) at UCLA for genotyping on
`the Illumina Omni-1 Quad BeadChip, which consists of 1 140 419 SNPs
`(1 016 423 genotyping probes and 123 996 CNV probes) distributed across
`the genome. The Illumina GenomeStudio V2010.1 Genotyping Module
`version 1.6.3 was used to calculate the BAF values and the log R ratio
`(LRR) for each probe and the copy number aberration (CNA) and LOH of the
`
`2650
`
`[09:50 21/9/2011 Bioinformatics-btr462.tex]
`
`Page: 2650 2648–2654
`
`Personalis EX2158
`
`

`

`Downloaded from https://academic.oup.com/bioinformatics/article/27/19/2648/231564 by guest on 29 September 2023
`
`within it appears to follow a Poisson with mean directly proportional
`to the size of the exon and the copy-number (see Supplementary
`Materials), but assuming that we have sufficiently deep coverage,
`we can approximate this by a normal distribution with mean
`
`Fig. 2. Correlation of depth-of-coverage across exome sequencing samples.
`To demonstrate the consistency of capture and sequencing efficiency
`of individual exons represented by the depth-of-coverage per exon, the
`normalized individual exon coverage in all pairs of six-independent exomes
`were plotted. All 6 samples were captured using the Agilent SureSelect
`Human All Exon G3362. Samples 1–5 had mean base coverages of 36∼ 39×
`as a result of 2 (Samples 1–4) or 3 (Sample 5) lanes of GAIIx single-end
`sequencing per sample. Sample 6 had mean base coverage of 60× as a result
`of 1 lane of GAIIx paired-end sequencing and demonstrates substantially
`different biases in individual exonic depth-of-coverage.
`
`ExomeCNV
`
`equal to variance. We apply the Geary–Hinkley transformation
`(Geary, 1930,1944; Hinkley, 1969), which converts a ratio of two
`normally distributed variables to a standard normal distribution, and
`a CNV is identified by a significant deviation of the transformed ratio
`from the null, standard normal distribution (see Section 2). Allowing
`only one false positive per genome, we analytically determine the
`statistical power of this CNV detection approach for different depth-
`of-coverage, and the results are shown in Figure 3a–b. For detecting
`deletions, 95% power is achieved for segments of size 500 bp or
`more (Fig. 3a), while detection of a single copy duplication is
`achieved with 95% power for segments of size 1,000 bp or more
`(Fig. 3b) with a mean segmental base coverage of 35×. We note
`that the power of the method improves substantially with higher
`depth-of-coverage, and an individual exons deletion/duplication
`status would be more powerfully observed by including additional
`flanking intronic sequence in the capture probe design. Genomic
`DNA admixture, as expected, diminishes power, but even with
`35× coverage of a given exon, a length of greater than 1000 bp
`is observed over 95% of the time. Exons or segments captured with
`500 bp at target sequence are observed at 95% power only with
`greater than 55× base coverage. A more thorough consideration
`of specificity and ROC curves are produced in the Supplementary
`Materials.
`To estimate LOH, we focused on the non-reference-allele or BAF
`of polymorphic positions in the sequenced regions. The observed
`B-allele count at a polymorphic position can be modeled by a
`binomial distribution with depth-of-coverage as sample size and
`the probability of observing a B-allele proportional to the B-allele
`copy-number, which is equivalent to the LOH state. Because the
`
`Fig. 3. Examples of the power of ExomeCNV to detect segmental duplication, deletion and LOH based on an analytical calculation. Power is plotted relative
`to mean depth-of-coverage in the genomic segment, setting false positive to 1 per genome based on an analytical model of genome-wide power of detection
`at different window sizes (inset, a–d). Windows are the total length of a given sequence at a given exon or the sum of length of exons adjacent to each other
`in the genome. The effect of admixture (rate of 30%) on the power to detect deletions and single copy duplications are shown in (c) and (d), respectively. (e)
`plots the power of LOH detection versus depth-of-coverage of individual polymorphic position (single base pair) with variable admixture rates (inset). The
`periodicity of the power curve is due to discrete nature of the binomial test. The 35× depth of coverage is chosen because it is a typical minimal average
`depth of coverage for exome sequencing and is thus a conservative view of power within typical exome sequencing datasets.
`
`2651
`
`[09:50 21/9/2011 Bioinformatics-btr462.tex]
`
`Page: 2651 2648–2654
`
`Personalis EX2158
`
`

`

`Downloaded from https://academic.oup.com/bioinformatics/article/27/19/2648/231564 by guest on 29 September 2023
`
`are sufficient to achieve at least 90% sensitivity and specificity based
`on the power calculation above.
`
`3.2.1 Validating false positive and false negative rates We first
`estimated the false positive rate of the algorithm by calling CNVs
`on two sequencing lanes of the same normal tissue library, treating
`one as case and the other as control; any CNV call from this
`would be false positive. Our method correctly called most exons
`as non-CNV. In particular, setting P-value thresholds to ensure
`minimum specificity of 0.9, 0.99 and 0.999, we observed specificity
`of 0.916, 0.995 and 1.0, respectively (Supplementary Materials).
`Furthermore, we tested the sensitivity of ExomeCNV by analyzing
`copy-number of sex chromosomes in a pair of male and female
`exome data that were available internally (see Section 2). Using
`the male exome as control, ExomeCNV correctly identified female
`chromosome X as being ‘duplicated’ and chromosome Y as being
`‘deleted’ (Supplementary Materials) with no false negatives.
`
`3.2.2 Comparison with SNP genotyping array We then used
`ExomeCNV to predict CNV and LOH in the melanoma samples
`and compared our results to those obtained from Illumina Omni-1
`Quad Beadchip genotyping array assessment of the same samples
`(Fig. 4 and Supplementary Materials). The sizes of the CNV
`segments from ExomeCNV range from single exon (120 bp) to
`whole chromosome (chr 10 and 18) (size distribution of CNV calls
`is presented in the Supplementary Materials). Treating calls from
`the genotyping array experiment as a standard, ExomeCNV had
`97% specificity and 86% sensitivity for detecting deletions, 92%
`specificity and 88% sensitivity for detecting amplifications, and
`88% specificity and 68% sensitivity for detecting LOH even though
`there is substantial variability across the genome. Higher depth-of-
`coverage from the sequence data for each exome would likely further
`improve concordance. We note that this is a dramatic improvement
`of ExomeCNV relative to the inappropriate application of ERDS
`(M.Zhu et al., manuscript in preparation) CNV caller which, when
`applied to these data, achieves only 16% sensitivity and 83%
`specificity for deletion and 50% sensitivity and 56% specificity for
`amplification (see Circo plot showing results from the three methods
`in the Supplementary Materials). This is due to the fact that there is
`substantial variability of the efficiency of capture of exons in exome
`sequencing not accounted for in ERDS. For CNV segments called by
`ExomeCNV but not by the genotyping arrays, we found that most
`lie within regions in which there is a low density of genotyping
`markers; thus the false positive rates (and the associated specificity)
`for ExomeCNV here may in fact be lower.
`
`4 DISCUSSION AND CONCLUSION
`The resolution of CNV detection with ExomeCNV is limited
`largely by the probe design. The CNV segments identified by our
`method range from 120 bp (single exons with higher than average
`coverage) to 240 Mb in size (whole chromosomes); however, the
`true breakpoint can be anywhere in the space between the terminal
`exon called within a CNV region and the adjacent exon in a non-
`CNV region. Hence, although a given CNV event can be detected
`at a single exon in some instances, the absolute resolution of our
`method is in fact limited to the inter-exon distance around an exon,
`which can be as small as 125 bp or as large as 22.8 Mb with the
`
`J.F.Sathirapongsasuti et al.
`
`expected value of BAF at a normal (non-LOH) polymorphic position
`is 0.5, a significant deviation of BAF from 0.5 identifies LOH.
`With sufficient depth-of-coverage, LOH can be detected at a single
`polymorphic position.
`
`The specificity and
`3.1.3 The effect of sample admixture
`sensitivity of this CNV detection method depends not only on the
`depth-of-coverage but also the rate of admixture, whereby non-
`mutated genomes contaminate mutated genomes in the sampled
`tissue/cells. In the absence of admixture, the average depth-of-
`coverage ratio is 0.5 for deletion, 1.5 for one-copy duplication and
`the BAF at an LOH site is either 0 or 1; however, in cancer biopsy
`sequencing, this is rarely observed in practice due to admixture with
`normal or non-mutated tumor cells. Thus, we assess the effect of
`admixture ranging from 10% to 70%, which is frequently observed
`in tumor samples (Fig. 3e). With admixture, the depth-of-coverage
`ratio and the BAF will tend to the null values of 1 and 0.5,
`respectively, making the CNV and LOH detection harder. Figure 3c
`and d show a reduction of power in detecting deletion (Fig. 3c) and
`one-copy duplication (Fig. 3d) as a result of 30% admixture. There is
`an approximate two-fold increase in the size of the exonic sequence
`detectable in the presence of 30% non-mutated genomic DNA.
`Capping the false positive rate at 0.001 and assuming 35× mean
`depth-of-coverage, a power curve (Fig. 3e) shows 0.95 sensitivity
`of detecting LOH at a single polymorphic position with admixture
`up to 30%.
`
`3.1.4 Using circular binary segmentation to merge exonic
`CNV/LOH Because CNV and LOH can, and usually do, span
`multiple exons, we extended our method above to call CNV/LOH on
`larger segments derived from summing data of sequentially spaced
`exons in the human genome. We apply circular binary segmentation
`(CBS) (Olshen et al., 2004; Venkatraman and Olshen, 2007) to
`subdivide the genome and then combine depth-of-coverage of exons
`and BAF of polymorphic positions within each segment, composed
`of arbitrary number of individual exons, to search for larger CNV and
`LOH. In the case of CNV, since reads are independent of each other,
`the sum of depth-of-coverage of all exons in a segment constitutes
`the segment’s depth-of-coverage, and the CNV test can be performed
`as described above. In the case of LOH, since B-alleles are not
`always on the same chromosome, BAF cannot be combined by
`direct summation. Instead, since BAF deviates from the null value
`0.5 under LOH, a significance increase in variance of BAF from
`control (F-test for equality of variances) indicates LOH (several
`other statistics were also considered, see Supplementary Materials).
`Finally, we repeated the process of CBS and CNV/LOH-calling,
`ranging granularity of segmentation from finest to coarsest, and
`merged the CNV/LOH calls by prioritizing positive calls of finer
`segments over coarser ones (see Section 2 for details). In the case
`of our melanoma sample, we performed CBS/sequential merging at
`five levels of granularity and observed 165 130 merging events in
`the first iteration followed by 121, 79, 105 and 66 in the subsequent
`iterations for a total of 165 501 merging events.
`
`3.2 Validation
`To test
`the performance of ExomeCNV we analyzed exome
`sequencing data from a melanoma and a matched normal skin
`(Supplementary Materials); the average depth-of-coverage of the
`data is 42.8× for the tumor and 37.5× for the normal sample, which
`
`2652
`
`[09:50 21/9/2011 Bioinformatics-btr462.tex]
`
`Page: 2652 2648–2654
`
`Personalis EX2158
`
`

`

`Downloaded from https://academic.oup.com/bioinformatics/article/27/19/2648/231564 by guest on 29 September 2023
`
`ExomeCNV
`
`a strong assumption that the samples do not share common CNV
`regions and that the population has an average genomic copy number
`of two. Other potential challenges of using the pooled sample as
`control are discussed in the Supplementary Materials.
`Because ExomeCNV depends on an estimate of the admixture
`rate c, misspecification of c would affect its performance. We
`performed sensitivity analysis and found that misestimating c
`would have a strong effect on sensitivity and specificity of CNV
`detection. Fortunately, LOH detection provides some data to directly
`estimate c, as LOH detection does not depend on a prior knowledge
`of c (Supplementary Materials). For the melanoma sample, our
`estimate of 30% admixture rate matches that from genotyping
`arrays, confirming the validity of this approach. However, there are
`advantages to slightly overestimating c as it makes the method more
`conservative and reduces false positives.
`As we have shown, CNV and LOH detection is readily possible
`from exome sequencing data, extending the utility of this powerful
`approach. The fundamental basis that makes this approach possible
`is the consistency of depth-of-coverage of each exon (and BAF
`by extension) across multiple samples for each individual exon,
`as demonstrated in five samples performed in our laboratory
`(Supplementary Materials, Fig. 2). This consistency permits reliable
`parametric modeling of the shift in depth-of-coverage and BAF
`distributions, hence accurate identification of CNV and LOH.
`However, we do not observe the same level of consistency when
`comparing depth-of-coverage across different library types. For
`instance, a sixth sample was performed using a paired-end approach
`that results in very different coverage of each exon (Fig. 2), and as a
`result, ExomeCNV does not perform well when the control sample
`library is of one type and the case is of another, or when the case

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket