throbber
Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`R E V I E W
`Identifying driver mutations in sequenced cancer
`genomes: computational approaches to enable
`precision medicine
`Benjamin J Raphael1,2*, Jason R Dobson1,2,3, Layla Oesper1 and Fabio Vandin1,2
`
`Abstract
`
`High-throughput DNA sequencing is revolutionizing the
`study of cancer and enabling the measurement of the
`somatic mutations that drive cancer development.
`However, the resulting sequencing datasets are large and
`complex, obscuring the clinically important mutations in a
`background of errors, noise, and random mutations. Here,
`we review computational approaches to identify somatic
`mutations in cancer genome sequences and to
`distinguish the driver mutations that are responsible for
`cancer from random, passenger mutations. First, we
`describe approaches to detect somatic mutations from
`high-throughput DNA sequencing data, particularly for
`tumor samples that comprise heterogeneous populations
`of cells. Next, we review computational approaches that
`aim to predict driver mutations according to their
`frequency of occurrence in a cohort of samples, or
`according to their predicted functional impact on protein
`sequence or structure. Finally, we review techniques to
`identify recurrent combinations of somatic mutations,
`including approaches that examine mutations in known
`pathways or protein-interaction networks, as well as de
`novo approaches that identify combinations of mutations
`according to statistical patterns of mutual exclusivity. These
`techniques, coupled with advances in high-throughput
`DNA sequencing, are enabling precision medicine
`approaches to the diagnosis and treatment of cancer.
`
`Challenges of cancer genome sequencing and analysis
`Cancer is driven largely by somatic mutations that accu-
`mulate in the genome over an individual’s lifetime, with
`additional contributions from epigenetic and transcrip-
`tomic alterations. These somatic mutations range in
`scale from single-nucleotide variants (SNVs), insertions
`and deletions of a few to a few dozen nucleotides
`
`* Correspondence: braphael@brown.edu
`1Department of Computer Science, Brown University, 115 Waterman Street,
`Providence, RI 02912, USA
`Full list of author information is available at the end of the article
`
`larger copy-number aberrations (CNAs) and
`(indels),
`large-genome rearrangements, also called structural vari-
`ants (SVs). These genomic alterations have been studied
`for decades using low-throughput approaches such as
`targeted gene sequencing or cytogenetic techniques,
`which have led to the identification of a number of
`highly recurrent somatic mutations [1,2]. Importantly, a
`subset of these mutations have been successfully tar-
`geted therapeutically;
`for example,
`imatinib has been
`used to target cells expressing the BCR-ABL fusion gene
`in chronic myeloid leukemia [3], and gefitinib has been
`used to inhibit the epidermal growth factor receptor in
`lung cancer [4]. Unfortunately, highly recurrent muta-
`tions with a corresponding drug treatment are unknown
`for most cancer types, in part due to our lack of compre-
`hensive knowledge of somatic mutations present in dif-
`ferent patients from a variety of cancer types.
`In the past few years, high-throughput DNA sequen-
`cing has revolutionized the identification of somatic mu-
`tations in cancer genomes. Whole-genome sequencing
`reveals somatic mutations of all types, whereas whole-
`exome sequencing identifies coding mutations at a lower
`cost, but does not allow the analysis of non-coding re-
`gions or the detection of SVs. When applied to many
`samples of the same cancer type, these technologies enable
`the identification of novel recurrent somatic mutations, a
`subset of which present new targets for cancer diagnostics
`and treatment [5-15]. These advances hold promise for
`precision medicine, or precision oncology, where a cancer
`treatment could be tailored to a patient’s mutational pro-
`file [16]. Fulfilling this promise of precision oncology will
`require researchers to overcome several challenges in the
`analysis and interpretation of sequencing data.
`In this review, we focus on three key challenges in
`cancer genome sequencing. First is the issue of identify-
`ing somatic mutations from the short sequence reads
`generated by high-throughput technologies, particularly
`in the presence of intra-tumor heterogeneity. Second is
`the problem of distinguishing the relatively small
`
`© 2014 Raphael et al.; licensee BioMed Central Ltd. The licensee has exclusive rights to distribute this article, in any medium,
`for 12 months following its publication. After this time, the article is available under the terms of the Creative Commons
`Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
`reproduction in any medium, provided the original work is properly cited.
`
`Personalis EX2002.001
`
`

`

`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 2 of 17
`
`number of driver mutations that are responsible for
`the development and progression of cancer from the
`large number of passenger mutations that are irrelevant
`for the cancer phenotype. Third is the challenge of
`determining the biological pathways and processes that are
`altered by somatic mutation. We survey recent computa-
`tional approaches that address each of these challenges.
`The rapid advances in high-throughput DNA sequen-
`cing technologies and their application to cancer genome
`sequencing has led to a proliferation of approaches to
`analyze the resulting data. Moreover, there are multi-
`ple signals in sequencing data that can be used to
`address the challenges listed above, and different compu-
`tational methods use different combinations of these sig-
`nals. This rapid pace of progress, the diversity of strategies
`and the lack, for the most part, of rigorous comparisons
`among different methods explain why a standard pipeline
`for the analysis of high-throughput cancer genome se-
`quencing data has yet to emerge. Hence, we are able to in-
`clude only a fraction of possible approaches. Moreover, we
`restrict attention to methods for DNA sequencing data
`and do not discuss the analysis of other high-throughput
`sequencing data, such as RNA sequencing data, that also
`provide key components for precision medicine [17].
`
`Detection of somatic mutations
`Many of the recent advances in our understanding of driver
`mutations have been the result of the increasing availability
`and affordability of DNA-sequencing technologies produced
`by companies such as Illumina, Ion Torrent, 454, Pa-
`cific Biosciences, and others. Such technologies enabled
`the sequencing of the first cancer genome [18] and the
`subsequent sequencing of thousands of additional can-
`cer genomes, particularly through collaborative pro-
`jects such as The Cancer Genome Atlas (TCGA) and
`the International Cancer Genome Consortium (ICGC).
`Some of these projects employ whole-genome sequen-
`cing, whereas others use exome sequencing, a targeted
`approach that sequences only the coding regions of the
`genome, enabling deeper coverage sequencing of genes
`but at the expense of ignoring non-coding regions. At
`the moment, the dominant approach is to perform
`whole-exome sequencing using one of several target-
`enrichment protocols followed by Illumina sequencing.
`However, the cost-benefit analysis of different tech-
`nologies and approaches is continually changing, and
`we refer the reader to recent surveys for additional in-
`formation [17,19,20].
`The advances in DNA sequencing technologies have
`been dramatic, but these technologies still face some sig-
`nificant limitations in measuring genomes. In particular,
`all of the technologies that sequence human genomes at
`reasonable cost produce millions to billions of short se-
`quences, or reads, of approximately 50–150 bp in length.
`
`To detect somatic mutations in cancer genomes, these
`reads are aligned to the human reference genome and
`differences between the reference genome and the can-
`cer genome are identified (Figure 1a). A matched normal
`sample from the same individual
`is typically analyzed
`simultaneously to distinguish somatic from germline
`mutations. The process of detecting somatic mutations
`from aligned reads is not straightforward. Numerous er-
`rors and artifacts are introduced during both the se-
`quencing and the alignment processes including: optical
`PCR duplicates, GC-bias, strand bias (where reads indi-
`cating a possible mutation only align to one strand of
`DNA) and alignment artifacts resulting from low com-
`plexity or repetitive regions in the genome. These lead
`to somatic mutation predictions containing both incor-
`rect variants (false positives) and missing variants (false
`negatives) [21].
`While standard pre-processing handles some sources of
`error (such as the removal of PCR duplicates), most
`methods for somatic mutation detection address only a
`subset of the possible sources of error. For instance, the
`methods MuTect [22] and Strelka [23] for predicting SNVs
`both employ stringent filtering after initial SNV detection
`to remove false positives resulting from strand bias or from
`poor mapping resulting from repetitive sequence in the ref-
`erence genome. Such filtering may, however, result in high
`false negatives. On the other hand, the VarScan 2 method
`[24] does not specifically address either of these issues, but
`still outperforms the previously mentioned methods on
`some datasets [25]. These differences demonstrate that the
`performance of methods can vary by dataset, and suggest
`that running multiple methods is advisable at present.
`Table 1 lists a number of publicly available algorithms for
`the detection of somatic SNVs, CNAs, and SVs from DNA-
`sequencing data. New methods and further refinements of
`existing methods for somatic mutation detection continue
`to be developed.
`
`Intra-tumor heterogeneity
`One particular challenge in identifying and characteriz-
`ing somatic mutations in tumors is the fact that most
`tumor samples are a heterogeneous collection of cells,
`containing both normal cells and different populations
`of cancerous cells [26]. The clonal theory of cancer [27]
`posits that all cancerous cells in a tumor descended from
`a single cell in which the first driver mutation occurred,
`and that subsequent clonal expansions and selective
`sweeps lead to a tumor with a dominant (majority)
`population of cancerous cells containing early driver
`events. Most cancer-genome sequencing studies gener-
`ate data from a bulk tumor sample that contains both
`normal cells and one or more subpopulations of tumor
`cells. This intra-tumor heterogeneity complicates the
`identification of all
`types of somatic mutations and
`
`Personalis EX2002.002
`
`

`

`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 3 of 17
`
`(a)
`
`100% Tumor purity
`
`(b)
`
`60% Tumor purity
`
`Reference genome
`
`Reference genome
`
`Key:
`
`Read
`
`Sequencing
`error
`
`Heterozygous
`germline SNV
`
`Heterozygous
`somatic SNV
`
`Figure 1 Somatic mutation detection in tumor samples. DNA-sequence reads from a tumor sample are aligned to a reference genome
`(shown in gray). Single-nucleotide differences between reads and the reference genome indicate germline single-nucleotide variants (SNVs; green
`circles), somatic SNVs (red circles), or sequencing errors (black diamonds). (a) In a pure tumor sample, a location containing mismatches or single
`nucleotide substitutions in approximately half of the reads covering the location indicates a heterozygous germline SNV or a heterozygous somatic
`SNV - assuming that there is no copy number aberration at the locus. Algorithms for detecting SNVs distinguish true SNVs from sequencing errors by
`requiring multiple reads with the same single-letter substitution to be aligned at the position (gray boxes). (b) As tumor purity decreases, the fraction
`of reads containing somatic mutations decreases: cancerous and normal cells, and the reads originating from each, are shown in blue and orange,
`respectively. The number of reads reporting a somatic mutation decreases with tumor purity, diminishing the signal to distinguish true somatic
`mutations from sequencing errors. In this example, only one heterozygous somatic SNV and one hetererozygous germline SNV are detected
`(gray boxes) as the mutation in the middle set of aligned reads is not distinguishable from sequencing errors.
`
`specialized methods [28-31] have been developed to
`quantify the extent of heterogeneity in a sample. The
`simplest form of intra-tumor heterogeneity is admixture
`by normal cells. The tumor purity of a sample is defined
`as the fraction of cells in the sample that are cancerous.
`A read from a tumor sample represents a sequence in
`the cell, or subpopulation of cells, from which the read
`was derived. Thus, lower tumor purity results in a reduc-
`tion in the number of sequence reads derived from the
`cancerous cells, and thus a reduction in the signal that
`can be used to detect somatic mutations (Figure 1b).
`Tumor purity is an important parameter in the detection
`of somatic mutations. To obtain reasonable sensitivity and
`specificity, methods to predict somatic aberrations must
`utilize, either implicitly or explicitly, an estimate of
`
`tumor purity. The VarScan 2 program [24] for calling
`somatic SNVs and indels allows a user to provide an
`estimate of tumor purity in order to calibrate the ex-
`pected number of reads containing a somatic mutation
`at a single locus. Conversely, methods such as MuTect
`[22] and Strelka [23] explicitly model tumor and nor-
`mal allele frequencies using observed data to calibrate
`sensitivity. As a result, MuTect and Strelka may pro-
`vide improved sensitivity for detecting mutations that
`occur in lower frequencies, especially when tumor pur-
`ity is unknown a priori. The performance of these and
`other somatic mutation-calling algorithms depends on
`accurate estimates of tumor purity.
`Standard methods for estimating tumor purity involve
`visual inspection by a pathologist or automated analysis
`
`Personalis EX2002.003
`
`

`

`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 4 of 17
`
`Table 1 Methods for detecting somatic mutations
`Objective
`Data Method
`Description
`Somatic mutation
`SNV
`MuTect [22]
`Designed to detect low-frequency mutations in both whole-genome and exome data.
`detection
`
`Strelka [23]
`
`Can be applied to both whole-genome and whole-exome data. Uses stringent post-call
`filtration.
`
`VarScan 2 [24]
`
`JointSNVMix [128]
`
`BIC-Seq [129]
`
`APOLLOH [130]
`
`CoNIFER [131]
`
`BreakDancer [132]
`
`CNA
`or SV
`
`Demonstrates high sensitivity for detecting SNVs in relatively pure tumor samples from
`both whole-genome and exome data.
`
`A probabilistic model that describes the observed allelic counts in both tumor and
`normal samples.
`
`Detects CNAs from whole-genome data.
`
`Predicts loss of heterozygosity regions from whole-genome sequencing data.
`
`Detects CNAs from exome data.
`
`Cluster paired-end alignments to detect SVs. One version to detect large aberrations and
`another to detect smaller indels.
`
`VariationHunter-CommonLaw
`[133], HYDRA [70]
`
`Cluster paired-reads, including reads with multiple possible alignments. Support simul-
`taneous analysis of multiple samples.
`
`GASV/GASVPro [134,135],
`PeSV-Fisher [136]
`
`Meerkat [130]
`
`Combine paired-read and read-depth analysis to detect SVs.
`
`Combines paired-end split-read and multiple alignment information to detect structural
`aberrations.
`
`Delly [137], Break-Pointer [138] Combines paired-end and split-read signals to detect structural aberrations.
`
`Tumor purity
`estimation
`
`SNV
`
`ABSOLUTE [28]
`
`Originally designed for SNP array data, but may be adapted for whole-genome sequen-
`cing data. Handles subclonal populations as outliers.
`
`ASCAT [29]
`
`CNA
`
`THetA [30]
`
`Designed for SNP array data, but may be adapted for whole-genome sequencing data.
`Only considers a single tumor population.
`
`Able to consider multiple subclonal tumor populations, but only if they differ by large
`CNAs. Designed for whole-genome sequencing data.
`
`SomatiCA [31]
`
`Only uses aberrations that are identified as clonal to estimate tumor purity.
`
`CNA, copy number aberration; SNV, single-nucleotide variant; SV, structural variant.
`A representative list of software available for the detection of somatic mutations from high-throughput sequencing data of cancer genomes. Some methods
`detect more than one type of mutation but are listed only once for clarity.
`
`of cellular images [32]. Recently, several alternative ap-
`proaches have been developed to estimate tumor purity
`directly from sequencing data by identifying shifts in the
`expected number of reads that align to a locus (Table 1).
`This is not an easy task as most cancer genomes are an-
`euploid and thus do not contain two copies of each
`chromosomal locus. The tumor ploidy, defined as the
`total DNA content in a tumor cell, also results in shifts
`in the sequencing coverage. Thus, estimation of tumor
`purity and tumor ploidy are closely intertwined. ABSO-
`LUTE [28] and ASCAT [29] are two algorithms that are
`used to infer both tumor purity and tumor ploidy from
`single-nucleotide polymorphism (SNP) array data. Al-
`though both methods may be modified to work with
`DNA-sequencing data [33], they model a tumor sample
`as consisting of only two populations: normal cells and
`tumor cells. As they do not directly model the possible
`existence of multiple distinct tumor subpopulations, the
`tumor purity estimates that result can be inaccurate,
`and reflect either an average over all tumor subpopula-
`tions or a bias for the dominant tumor subpopulation
`
`[30]. Furthermore, accurate identification of tumor sub-
`populations may provide important information on tu-
`mors that do not respond well to treatments [34-36].
`Recently, the Tumor Heterogeneity Analysis (THetA)
`algorithm [30] was developed to infer the composition
`of a tumor sample (including tumor purity) containing
`any number of
`tumor subpopulations directly from
`DNA-sequencing data. Although THetA overcomes
`some of the limitations of earlier methods, it is unable
`to distinguish distinct tumor subpopulations that do not
`contain CNAs, necessitating the development of add-
`itional approaches to identify tumor subpopulations that
`are distinguished only by SNVs and/or small indels. The
`identification of somatic mutations and the estimation of
`intra-tumor heterogeneity are closely related, and so
`methods that jointly perform these tasks while allowing
`for multiple tumor subpopulations are desirable for
`obtaining highly sensitive and specific estimates of all
`somatic aberrations in tumors.
`Advances in DNA-sequencing technologies have also
`enabled
`the
`direct
`quantification of
`intra-tumor
`
`Personalis EX2002.004
`
`

`

`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 5 of 17
`
`heterogeneity. One approach is to perform targeted,
`ultra-deep-coverage sequencing of SNVs,
`followed by
`clustering of the read counts for each SNV into distinct
`subpopulations [37,38]. Ding et al. [37] identified two
`distinct clonal evolution patterns for acute myeloid
`leukemia (AML) patients: a relapse sample evolved ei-
`ther from the founding clone in the primary tumor or
`from a minor subclone that survived initial treatment.
`Shah et al. [38] demonstrated extreme variability in the
`total number of tumor subpopulations (ranging from 1–
`2 to more than 15 subpopulations) in tumors from a
`large cohort of breast cancer patients. Another approach
`to measure intra-tumor heterogeneity is to sequence
`samples from multiple regions within the same tumor.
`Gerlinger et al. [39] sequenced multiple regions from
`several kidney tumors and found that a majority (63-
`69%) of the somatic mutations identified were present in
`only a subset of the sequenced regions of the tumor.
`Navin and colleagues [40,41] found similar heterogeneity
`in the CNAs present within different regions of breast
`tumors. These results demonstrate that a single sample
`from a tumor might not fully represent the complete
`landscape of somatic mutations (including driver muta-
`tions) present in the tumor.
`Finally, Nik-Zainal et al. [42] demonstrated how care-
`ful computational analysis can reveal information about
`the composition of a tumor sample, including the identi-
`fication of clonal mutations that are present in nearly all
`
`cells of the tumor (and thus presumably are early events
`in tumorigenesis) and subclonal mutations that are
`present in a fraction of tumor cells. Using high-coverage
`(188X) whole-genome DNA sequencing of a breast
`tumor, they inferred the proportion of tumor cells con-
`taining somatic SNVs and CNAs and grouped these pro-
`portions into several clusters, demonstrating different
`mutational events during the evolutionary progression
`from the founder cell of the tumor to the present tumor
`cell population. Eventually, single-cell sequencing tech-
`nologies [41,43-47] promise to provide a comprehensive
`view of intra-tumor heterogeneity, but these approaches
`remain limited by artifacts introduced during whole-
`genome amplification [47]. In the interim, there is an
`immediate need for better methods to detect somatic
`mutations that occur in heterogeneous tumor samples.
`
`Computational prioritization of driver mutations
`Following the sequencing of a cancer genome, the next
`step is to identify driver mutations that are responsible
`for the cancer phenotype. Ultimately, the determination
`that a mutation is functional requires experimental val-
`idation, using in vitro or in vivo models to demonstrate
`that a mutation leads to at least one of the characteris-
`tics of the cancer phenotype, such as DNA repair defi-
`ciency, uncontrolled proliferation and growth, or
`immune evasion. As a result of advances in DNA-
`sequencing technology,
`the measurement of somatic
`
`Identifying driver mutations
`
`Whole genome
`sequencing
`
`Or
`
`Whole exome
`sequencing
`
`Driver
`mutations
`
`Passenger
`mutations
`
`Prioritization of mutations
`
`Somatic mutation identification
`
`Identify recurrent mutations
`
`Predict functional
`impact of mutations
`
`Identify recurrent
`combinations of mutations
`
`Experimental and
`functional validation
`
`Figure 2 Overview of strategies for cancer-genome sequencing. A cancer-genome sequencing project begins with whole-genome or
`whole-exome sequencing. Various methods are used to detect somatic mutations in the resulting sequence (see Table 1), yielding a long list
`of somatic mutations. Several strategies can then be employed to prioritize these mutations for experimental or functional validation. These
`strategies include: testing for recurrent mutations, predicting functional impact, and assessing combinations of mutations (see Table 2). None of
`these approaches are perfect, and each returns a subset of driver mutations as well as passenger mutations. The mutations returned by these
`approaches can then be validated using a variety of experimental techniques.
`
`Personalis EX2002.005
`
`

`

`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 6 of 17
`
`Table 2 Methods for prediction of driver mutations and genes
`Objective
`Data
`Method
`Description
`Recurrent somatic
`SNV
`MutSigCV [48]
`Uses coverage information and genomic features (e.g. DNA replication time) to
`mutation identification
`estimate the background mutation rate of a gene.
`
`MuSiC [49]
`
`Uses a per-gene background mutation rate; allows for user-defined regions of
`interest.
`
`Youn et al. [51]
`
`Includes predicted impact on protein function in determining recurrent
`mutations.
`
`Sjöblom et al. [52] Defines a cancer mutation prevalence score for each gene.
`
`DrGaP [139]
`
`GISTIC2 [61],
`JISTIC [63]
`
`CMDS [62]
`
`Uses Bayesian approach to estimate background mutation rate; helpful for cancer
`types with low mutation rate.
`Uses ‘peel-off’ techniques to find smaller recurrent aberrations inside larger
`aberrations.
`
`Identifies recurrent CNAs from unsegmented data.
`
`ADMIRE [65]
`
`Multi-scale smoothing of copy number profiles.
`
`CNA
`
`Functional impact
`prediction
`
`General
`
`SIFT [72]
`
`Uses conservation of amino acids to predict functional impact of a non-
`synonymous amino-acid change.
`
`Pathway analysis and
`combinations of
`mutations
`
`Polyphen-2 [74]
`
`Infers functional impact of non-synonymous amino-acid changes through align-
`ments of related peptide sequences and a machine-learning-based probabilistic
`classifier.
`
`MutationAssessor
`[75]
`
`Uses protein homologs to calculate a score based on the divergence in
`conservation caused by an amino-acid change.
`
`PROVEAN [73]
`
`Benchmarks favorably against MutationAssessor, Polyphen-2 and SIFT.
`
`Cancer-specific
`
`CHASM [77]
`
`Uses a machine-learning approach to classify mutations as drivers or passengers
`based on sequence conservation, protein domains, and protein structure.
`
`Oncodrive-FM
`[79]
`
`Combines scores from SIFT, Polyphen-2, and MutationAccessor into a single
`ranking.
`
`Positional or
`structural
`clustering
`
`NMC [83]
`
`iPAC [84]
`
`Known pathways GSEA [92]
`
`Finds clusters of non-synonymous mutations across patients. Typically used with
`missense mutations to detect so-called ‘activating’ mutations.
`Extends the NMC approach to search for clusters of mutations in three-
`dimensional space using crystal structures of proteins.
`
`A general technique for testing ranked lists of genes for enrichment in known
`gene sets. Can be used on rankings derived from significance of observed
`mutations.
`
`Interaction
`networks
`
`PathScan [95]
`
`Finds pathways with excess of mutations in a gene set (pathway), by combining
`P-values of enrichment across samples.
`
`Patient-oriented
`gene sets [94]
`
`NetBox [140]
`
`HotNet [102]
`
`MEMo [104]
`
`Tests known pathways using a binary indicator for a pathway in each patient.
`
`Finds network modules in a user-provided list of genes. Significance depends
`only on the topology of the genes in the network, and not on mutation scores.
`
`Finds subnetworks with significantly more aberrations than would be expected
`by chance, using both network topology and user-defined gene or protein
`scores.
`
`Finds subnetworks whose interacting pairs of genes have mutually exclusive
`aberrations [105]; recommends including only recurrent SNVs and CNAs in the
`analysis.
`
`De novo
`
`Dendrix [102]
`
`Identifies groups of genes with mutually exclusive aberrations.
`
`Multi-Dendrix
`[112]
`
`RME [110]
`
`Simultaneously finds multiple groups of genes with mutually exclusive
`aberrations.
`
`Finds groups of genes with mutually exclusive aberrations by building from gene
`pairs; best results obtained when restricting to genes with high mutation
`frequencies (e.g. > 10%).
`
`CNA, copy number aberration; SNV, single-nucleotide variant.
`A representative list of software available to predict driver mutations or genes by detecting their recurrence across multiple samples, functional impact, or
`interactions with other mutations in pathways or combinations. Some methods fall into multiple categories but are listed only once for clarity.
`
`Personalis EX2002.006
`
`

`

`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 7 of 17
`
`mutations is now significantly cheaper and faster than
`the functional characterization of a mutation. Moreover,
`as cancer-genome sequencing moves from the research
`laboratory into the clinic, there is a strong need to auto-
`mate the categorization of mutations to prioritize rapid,
`accurate diagnoses and treatments for patients. Unfortu-
`nately, distinguishing driver from passenger mutations
`solely from the resulting DNA-sequence change is
`extremely complicated, as the effect of most DNA-
`sequence changes is poorly understood, even in the sim-
`plest case of single nucleotide substitutions in coding
`regions of well-studied proteins.
`In the following sections, we describe three ap-
`proaches for computational prioritization of driver mu-
`tations: identifying recurrent mutations; predicting the
`functional impact of individual mutations; and assessing
`combinations of mutations using pathways, interaction
`networks, or statistical correlations. These approaches
`provide alternative strategies to filter the long list of
`measured somatic mutations, and to identify a smaller
`subset enriched for driver mutations to undergo further
`experimental and functional validation (Figure 2).
`
`Statistical tests for recurrent mutations
`One approach to prioritize mutations for further experi-
`mental characterization is to identify recurrent mutations.
`Each cancer sample has undergone an independent evolu-
`tionary process in which acquired driver mutations that
`provide selective advantage result in clonal expansion of
`these lineages [27]. As these mutational processes converge
`to a common oncogenic phenotype, the mutations that
`drive cancer progression should appear more frequently
`than expected by chance across patient samples. Recur-
`rence may be revealed at different levels of resolution, such
`as an individual nucleotide, a codon, a protein domain, a
`whole gene, or even a pathway. In this section, we describe
`the techniques and difficulties in identifying recurrently
`mutated driver genes.
`
`Statistical tests for genes with recurrent single-nucleotide
`mutations
`Several methods have been designed to find recurrent
`mutations in a cohort of cancer patients, including Mut-
`SigCV [48], MuSiC [49], and others [50-53] (Table 2).
`The fundamental calculation in all these approaches is
`to determine whether the observed number of mutations
`in the gene is significantly greater than the number ex-
`pected according to a background mutation rate (BMR).
`The BMR is the probability of observing a passenger
`mutation in a specific location of the genome. From the
`BMR and the number of sequenced nucleotides within a
`gene, a binomial model can be used to derive the prob-
`ability of the observed number of mutations in a gene
`across a cohort of patients (Box 1).
`
`Box 1. The binomial model: a statistical test for
`detecting recurrent mutations.
`
`Using the background mutation rate (BMR) and the number n of
`sequenced nucleotides within a gene (g), the probability (Pg) that a
`passenger mutation is observed in g is given by Pg = 1 - (1- BMR).
`Since somatic mutations arise independently in each sample, the
`occurrences of passenger mutations in g are modeled by flipping a
`biased coin with probability pg of heads (mutation). Thus, if somatic
`mutations have been measured in m samples, the number of
`patients in which gene g is mutated is described by a binomial
`random variable B(m, Pg) with parameters m and Pg. From B(m, Pg),
`it is possible to compute the probability that the observed number
`or more samples contain passenger mutations; this is the P-value
`of the statistical test. A multiple-hypothesis testing correction is
`applied when examining multiple genes.
`
`The main differences between methods for identifying
`recurrently mutated genes are in how they estimate the
`BMR and how many different mutational contexts they
`analyze. Regarding the former, the BMR is not constant
`across the genome, but depends on the genomic context
`of a nucleotide [52] and the type of mutation [7]. More-
`over, the BMR of a gene is correlated with both its rate
`of transcription [54] and replication timing [55,56]. The
`BMR is also not constant across patients, and cancer co-
`horts often present hypermutated samples [6]. Finally,
`certain genomic regions may display localized somatic
`hypermutation, termed kataegis [57]. Different combina-
`tions of these effects can cause the BMR to vary by as
`much as an order of magnitude across different genes.
`The estimated BMR greatly affects the identification of
`recurrent mutations, as an estimate that is higher than
`the true value fails to identify recurrent mutations (false
`negatives), whereas an estimate that is lower than the
`true value would leads to false positives. Of course, if a
`driver gene is mutated in a very high percentage of sam-
`ples (more than 20%, for example), even an inaccurate
`estimate of the BMR is sufficient to correctly identify
`such a gene as recurrently mutated. Thus, well-known
`cancer genes (such as TP53) are readily identified as re-
`currently mutated genes by all computational methods.
`The priority now is to identify rare driver mutations that
`are important for precision oncology. The tools that are
`currently available often report different rare mutations
`as drivers, and more work is needed in order to improve
`the sensitivity in the detection of rare driver mutations
`and to compare and combine the results from different
`tools [58]. In general, reporting rarely mutated genes as
`recurrently mutated with high confidence requires either
`better estimates of the BMR and/or much larger patient
`cohorts.
`
`Personalis EX2002.007
`
`

`

`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 8 of 17
`
`Statistical tests for genes with

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket