`http://genomemedicine.com/content/6/1/5
`
`R E V I E W
`Identifying driver mutations in sequenced cancer
`genomes: computational approaches to enable
`precision medicine
`Benjamin J Raphael1,2*, Jason R Dobson1,2,3, Layla Oesper1 and Fabio Vandin1,2
`
`Abstract
`
`High-throughput DNA sequencing is revolutionizing the
`study of cancer and enabling the measurement of the
`somatic mutations that drive cancer development.
`However, the resulting sequencing datasets are large and
`complex, obscuring the clinically important mutations in a
`background of errors, noise, and random mutations. Here,
`we review computational approaches to identify somatic
`mutations in cancer genome sequences and to
`distinguish the driver mutations that are responsible for
`cancer from random, passenger mutations. First, we
`describe approaches to detect somatic mutations from
`high-throughput DNA sequencing data, particularly for
`tumor samples that comprise heterogeneous populations
`of cells. Next, we review computational approaches that
`aim to predict driver mutations according to their
`frequency of occurrence in a cohort of samples, or
`according to their predicted functional impact on protein
`sequence or structure. Finally, we review techniques to
`identify recurrent combinations of somatic mutations,
`including approaches that examine mutations in known
`pathways or protein-interaction networks, as well as de
`novo approaches that identify combinations of mutations
`according to statistical patterns of mutual exclusivity. These
`techniques, coupled with advances in high-throughput
`DNA sequencing, are enabling precision medicine
`approaches to the diagnosis and treatment of cancer.
`
`Challenges of cancer genome sequencing and analysis
`Cancer is driven largely by somatic mutations that accu-
`mulate in the genome over an individual’s lifetime, with
`additional contributions from epigenetic and transcrip-
`tomic alterations. These somatic mutations range in
`scale from single-nucleotide variants (SNVs), insertions
`and deletions of a few to a few dozen nucleotides
`
`* Correspondence: braphael@brown.edu
`1Department of Computer Science, Brown University, 115 Waterman Street,
`Providence, RI 02912, USA
`Full list of author information is available at the end of the article
`
`larger copy-number aberrations (CNAs) and
`(indels),
`large-genome rearrangements, also called structural vari-
`ants (SVs). These genomic alterations have been studied
`for decades using low-throughput approaches such as
`targeted gene sequencing or cytogenetic techniques,
`which have led to the identification of a number of
`highly recurrent somatic mutations [1,2]. Importantly, a
`subset of these mutations have been successfully tar-
`geted therapeutically;
`for example,
`imatinib has been
`used to target cells expressing the BCR-ABL fusion gene
`in chronic myeloid leukemia [3], and gefitinib has been
`used to inhibit the epidermal growth factor receptor in
`lung cancer [4]. Unfortunately, highly recurrent muta-
`tions with a corresponding drug treatment are unknown
`for most cancer types, in part due to our lack of compre-
`hensive knowledge of somatic mutations present in dif-
`ferent patients from a variety of cancer types.
`In the past few years, high-throughput DNA sequen-
`cing has revolutionized the identification of somatic mu-
`tations in cancer genomes. Whole-genome sequencing
`reveals somatic mutations of all types, whereas whole-
`exome sequencing identifies coding mutations at a lower
`cost, but does not allow the analysis of non-coding re-
`gions or the detection of SVs. When applied to many
`samples of the same cancer type, these technologies enable
`the identification of novel recurrent somatic mutations, a
`subset of which present new targets for cancer diagnostics
`and treatment [5-15]. These advances hold promise for
`precision medicine, or precision oncology, where a cancer
`treatment could be tailored to a patient’s mutational pro-
`file [16]. Fulfilling this promise of precision oncology will
`require researchers to overcome several challenges in the
`analysis and interpretation of sequencing data.
`In this review, we focus on three key challenges in
`cancer genome sequencing. First is the issue of identify-
`ing somatic mutations from the short sequence reads
`generated by high-throughput technologies, particularly
`in the presence of intra-tumor heterogeneity. Second is
`the problem of distinguishing the relatively small
`
`© 2014 Raphael et al.; licensee BioMed Central Ltd. The licensee has exclusive rights to distribute this article, in any medium,
`for 12 months following its publication. After this time, the article is available under the terms of the Creative Commons
`Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
`reproduction in any medium, provided the original work is properly cited.
`
`Personalis EX2002.001
`
`
`
`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 2 of 17
`
`number of driver mutations that are responsible for
`the development and progression of cancer from the
`large number of passenger mutations that are irrelevant
`for the cancer phenotype. Third is the challenge of
`determining the biological pathways and processes that are
`altered by somatic mutation. We survey recent computa-
`tional approaches that address each of these challenges.
`The rapid advances in high-throughput DNA sequen-
`cing technologies and their application to cancer genome
`sequencing has led to a proliferation of approaches to
`analyze the resulting data. Moreover, there are multi-
`ple signals in sequencing data that can be used to
`address the challenges listed above, and different compu-
`tational methods use different combinations of these sig-
`nals. This rapid pace of progress, the diversity of strategies
`and the lack, for the most part, of rigorous comparisons
`among different methods explain why a standard pipeline
`for the analysis of high-throughput cancer genome se-
`quencing data has yet to emerge. Hence, we are able to in-
`clude only a fraction of possible approaches. Moreover, we
`restrict attention to methods for DNA sequencing data
`and do not discuss the analysis of other high-throughput
`sequencing data, such as RNA sequencing data, that also
`provide key components for precision medicine [17].
`
`Detection of somatic mutations
`Many of the recent advances in our understanding of driver
`mutations have been the result of the increasing availability
`and affordability of DNA-sequencing technologies produced
`by companies such as Illumina, Ion Torrent, 454, Pa-
`cific Biosciences, and others. Such technologies enabled
`the sequencing of the first cancer genome [18] and the
`subsequent sequencing of thousands of additional can-
`cer genomes, particularly through collaborative pro-
`jects such as The Cancer Genome Atlas (TCGA) and
`the International Cancer Genome Consortium (ICGC).
`Some of these projects employ whole-genome sequen-
`cing, whereas others use exome sequencing, a targeted
`approach that sequences only the coding regions of the
`genome, enabling deeper coverage sequencing of genes
`but at the expense of ignoring non-coding regions. At
`the moment, the dominant approach is to perform
`whole-exome sequencing using one of several target-
`enrichment protocols followed by Illumina sequencing.
`However, the cost-benefit analysis of different tech-
`nologies and approaches is continually changing, and
`we refer the reader to recent surveys for additional in-
`formation [17,19,20].
`The advances in DNA sequencing technologies have
`been dramatic, but these technologies still face some sig-
`nificant limitations in measuring genomes. In particular,
`all of the technologies that sequence human genomes at
`reasonable cost produce millions to billions of short se-
`quences, or reads, of approximately 50–150 bp in length.
`
`To detect somatic mutations in cancer genomes, these
`reads are aligned to the human reference genome and
`differences between the reference genome and the can-
`cer genome are identified (Figure 1a). A matched normal
`sample from the same individual
`is typically analyzed
`simultaneously to distinguish somatic from germline
`mutations. The process of detecting somatic mutations
`from aligned reads is not straightforward. Numerous er-
`rors and artifacts are introduced during both the se-
`quencing and the alignment processes including: optical
`PCR duplicates, GC-bias, strand bias (where reads indi-
`cating a possible mutation only align to one strand of
`DNA) and alignment artifacts resulting from low com-
`plexity or repetitive regions in the genome. These lead
`to somatic mutation predictions containing both incor-
`rect variants (false positives) and missing variants (false
`negatives) [21].
`While standard pre-processing handles some sources of
`error (such as the removal of PCR duplicates), most
`methods for somatic mutation detection address only a
`subset of the possible sources of error. For instance, the
`methods MuTect [22] and Strelka [23] for predicting SNVs
`both employ stringent filtering after initial SNV detection
`to remove false positives resulting from strand bias or from
`poor mapping resulting from repetitive sequence in the ref-
`erence genome. Such filtering may, however, result in high
`false negatives. On the other hand, the VarScan 2 method
`[24] does not specifically address either of these issues, but
`still outperforms the previously mentioned methods on
`some datasets [25]. These differences demonstrate that the
`performance of methods can vary by dataset, and suggest
`that running multiple methods is advisable at present.
`Table 1 lists a number of publicly available algorithms for
`the detection of somatic SNVs, CNAs, and SVs from DNA-
`sequencing data. New methods and further refinements of
`existing methods for somatic mutation detection continue
`to be developed.
`
`Intra-tumor heterogeneity
`One particular challenge in identifying and characteriz-
`ing somatic mutations in tumors is the fact that most
`tumor samples are a heterogeneous collection of cells,
`containing both normal cells and different populations
`of cancerous cells [26]. The clonal theory of cancer [27]
`posits that all cancerous cells in a tumor descended from
`a single cell in which the first driver mutation occurred,
`and that subsequent clonal expansions and selective
`sweeps lead to a tumor with a dominant (majority)
`population of cancerous cells containing early driver
`events. Most cancer-genome sequencing studies gener-
`ate data from a bulk tumor sample that contains both
`normal cells and one or more subpopulations of tumor
`cells. This intra-tumor heterogeneity complicates the
`identification of all
`types of somatic mutations and
`
`Personalis EX2002.002
`
`
`
`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 3 of 17
`
`(a)
`
`100% Tumor purity
`
`(b)
`
`60% Tumor purity
`
`Reference genome
`
`Reference genome
`
`Key:
`
`Read
`
`Sequencing
`error
`
`Heterozygous
`germline SNV
`
`Heterozygous
`somatic SNV
`
`Figure 1 Somatic mutation detection in tumor samples. DNA-sequence reads from a tumor sample are aligned to a reference genome
`(shown in gray). Single-nucleotide differences between reads and the reference genome indicate germline single-nucleotide variants (SNVs; green
`circles), somatic SNVs (red circles), or sequencing errors (black diamonds). (a) In a pure tumor sample, a location containing mismatches or single
`nucleotide substitutions in approximately half of the reads covering the location indicates a heterozygous germline SNV or a heterozygous somatic
`SNV - assuming that there is no copy number aberration at the locus. Algorithms for detecting SNVs distinguish true SNVs from sequencing errors by
`requiring multiple reads with the same single-letter substitution to be aligned at the position (gray boxes). (b) As tumor purity decreases, the fraction
`of reads containing somatic mutations decreases: cancerous and normal cells, and the reads originating from each, are shown in blue and orange,
`respectively. The number of reads reporting a somatic mutation decreases with tumor purity, diminishing the signal to distinguish true somatic
`mutations from sequencing errors. In this example, only one heterozygous somatic SNV and one hetererozygous germline SNV are detected
`(gray boxes) as the mutation in the middle set of aligned reads is not distinguishable from sequencing errors.
`
`specialized methods [28-31] have been developed to
`quantify the extent of heterogeneity in a sample. The
`simplest form of intra-tumor heterogeneity is admixture
`by normal cells. The tumor purity of a sample is defined
`as the fraction of cells in the sample that are cancerous.
`A read from a tumor sample represents a sequence in
`the cell, or subpopulation of cells, from which the read
`was derived. Thus, lower tumor purity results in a reduc-
`tion in the number of sequence reads derived from the
`cancerous cells, and thus a reduction in the signal that
`can be used to detect somatic mutations (Figure 1b).
`Tumor purity is an important parameter in the detection
`of somatic mutations. To obtain reasonable sensitivity and
`specificity, methods to predict somatic aberrations must
`utilize, either implicitly or explicitly, an estimate of
`
`tumor purity. The VarScan 2 program [24] for calling
`somatic SNVs and indels allows a user to provide an
`estimate of tumor purity in order to calibrate the ex-
`pected number of reads containing a somatic mutation
`at a single locus. Conversely, methods such as MuTect
`[22] and Strelka [23] explicitly model tumor and nor-
`mal allele frequencies using observed data to calibrate
`sensitivity. As a result, MuTect and Strelka may pro-
`vide improved sensitivity for detecting mutations that
`occur in lower frequencies, especially when tumor pur-
`ity is unknown a priori. The performance of these and
`other somatic mutation-calling algorithms depends on
`accurate estimates of tumor purity.
`Standard methods for estimating tumor purity involve
`visual inspection by a pathologist or automated analysis
`
`Personalis EX2002.003
`
`
`
`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 4 of 17
`
`Table 1 Methods for detecting somatic mutations
`Objective
`Data Method
`Description
`Somatic mutation
`SNV
`MuTect [22]
`Designed to detect low-frequency mutations in both whole-genome and exome data.
`detection
`
`Strelka [23]
`
`Can be applied to both whole-genome and whole-exome data. Uses stringent post-call
`filtration.
`
`VarScan 2 [24]
`
`JointSNVMix [128]
`
`BIC-Seq [129]
`
`APOLLOH [130]
`
`CoNIFER [131]
`
`BreakDancer [132]
`
`CNA
`or SV
`
`Demonstrates high sensitivity for detecting SNVs in relatively pure tumor samples from
`both whole-genome and exome data.
`
`A probabilistic model that describes the observed allelic counts in both tumor and
`normal samples.
`
`Detects CNAs from whole-genome data.
`
`Predicts loss of heterozygosity regions from whole-genome sequencing data.
`
`Detects CNAs from exome data.
`
`Cluster paired-end alignments to detect SVs. One version to detect large aberrations and
`another to detect smaller indels.
`
`VariationHunter-CommonLaw
`[133], HYDRA [70]
`
`Cluster paired-reads, including reads with multiple possible alignments. Support simul-
`taneous analysis of multiple samples.
`
`GASV/GASVPro [134,135],
`PeSV-Fisher [136]
`
`Meerkat [130]
`
`Combine paired-read and read-depth analysis to detect SVs.
`
`Combines paired-end split-read and multiple alignment information to detect structural
`aberrations.
`
`Delly [137], Break-Pointer [138] Combines paired-end and split-read signals to detect structural aberrations.
`
`Tumor purity
`estimation
`
`SNV
`
`ABSOLUTE [28]
`
`Originally designed for SNP array data, but may be adapted for whole-genome sequen-
`cing data. Handles subclonal populations as outliers.
`
`ASCAT [29]
`
`CNA
`
`THetA [30]
`
`Designed for SNP array data, but may be adapted for whole-genome sequencing data.
`Only considers a single tumor population.
`
`Able to consider multiple subclonal tumor populations, but only if they differ by large
`CNAs. Designed for whole-genome sequencing data.
`
`SomatiCA [31]
`
`Only uses aberrations that are identified as clonal to estimate tumor purity.
`
`CNA, copy number aberration; SNV, single-nucleotide variant; SV, structural variant.
`A representative list of software available for the detection of somatic mutations from high-throughput sequencing data of cancer genomes. Some methods
`detect more than one type of mutation but are listed only once for clarity.
`
`of cellular images [32]. Recently, several alternative ap-
`proaches have been developed to estimate tumor purity
`directly from sequencing data by identifying shifts in the
`expected number of reads that align to a locus (Table 1).
`This is not an easy task as most cancer genomes are an-
`euploid and thus do not contain two copies of each
`chromosomal locus. The tumor ploidy, defined as the
`total DNA content in a tumor cell, also results in shifts
`in the sequencing coverage. Thus, estimation of tumor
`purity and tumor ploidy are closely intertwined. ABSO-
`LUTE [28] and ASCAT [29] are two algorithms that are
`used to infer both tumor purity and tumor ploidy from
`single-nucleotide polymorphism (SNP) array data. Al-
`though both methods may be modified to work with
`DNA-sequencing data [33], they model a tumor sample
`as consisting of only two populations: normal cells and
`tumor cells. As they do not directly model the possible
`existence of multiple distinct tumor subpopulations, the
`tumor purity estimates that result can be inaccurate,
`and reflect either an average over all tumor subpopula-
`tions or a bias for the dominant tumor subpopulation
`
`[30]. Furthermore, accurate identification of tumor sub-
`populations may provide important information on tu-
`mors that do not respond well to treatments [34-36].
`Recently, the Tumor Heterogeneity Analysis (THetA)
`algorithm [30] was developed to infer the composition
`of a tumor sample (including tumor purity) containing
`any number of
`tumor subpopulations directly from
`DNA-sequencing data. Although THetA overcomes
`some of the limitations of earlier methods, it is unable
`to distinguish distinct tumor subpopulations that do not
`contain CNAs, necessitating the development of add-
`itional approaches to identify tumor subpopulations that
`are distinguished only by SNVs and/or small indels. The
`identification of somatic mutations and the estimation of
`intra-tumor heterogeneity are closely related, and so
`methods that jointly perform these tasks while allowing
`for multiple tumor subpopulations are desirable for
`obtaining highly sensitive and specific estimates of all
`somatic aberrations in tumors.
`Advances in DNA-sequencing technologies have also
`enabled
`the
`direct
`quantification of
`intra-tumor
`
`Personalis EX2002.004
`
`
`
`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 5 of 17
`
`heterogeneity. One approach is to perform targeted,
`ultra-deep-coverage sequencing of SNVs,
`followed by
`clustering of the read counts for each SNV into distinct
`subpopulations [37,38]. Ding et al. [37] identified two
`distinct clonal evolution patterns for acute myeloid
`leukemia (AML) patients: a relapse sample evolved ei-
`ther from the founding clone in the primary tumor or
`from a minor subclone that survived initial treatment.
`Shah et al. [38] demonstrated extreme variability in the
`total number of tumor subpopulations (ranging from 1–
`2 to more than 15 subpopulations) in tumors from a
`large cohort of breast cancer patients. Another approach
`to measure intra-tumor heterogeneity is to sequence
`samples from multiple regions within the same tumor.
`Gerlinger et al. [39] sequenced multiple regions from
`several kidney tumors and found that a majority (63-
`69%) of the somatic mutations identified were present in
`only a subset of the sequenced regions of the tumor.
`Navin and colleagues [40,41] found similar heterogeneity
`in the CNAs present within different regions of breast
`tumors. These results demonstrate that a single sample
`from a tumor might not fully represent the complete
`landscape of somatic mutations (including driver muta-
`tions) present in the tumor.
`Finally, Nik-Zainal et al. [42] demonstrated how care-
`ful computational analysis can reveal information about
`the composition of a tumor sample, including the identi-
`fication of clonal mutations that are present in nearly all
`
`cells of the tumor (and thus presumably are early events
`in tumorigenesis) and subclonal mutations that are
`present in a fraction of tumor cells. Using high-coverage
`(188X) whole-genome DNA sequencing of a breast
`tumor, they inferred the proportion of tumor cells con-
`taining somatic SNVs and CNAs and grouped these pro-
`portions into several clusters, demonstrating different
`mutational events during the evolutionary progression
`from the founder cell of the tumor to the present tumor
`cell population. Eventually, single-cell sequencing tech-
`nologies [41,43-47] promise to provide a comprehensive
`view of intra-tumor heterogeneity, but these approaches
`remain limited by artifacts introduced during whole-
`genome amplification [47]. In the interim, there is an
`immediate need for better methods to detect somatic
`mutations that occur in heterogeneous tumor samples.
`
`Computational prioritization of driver mutations
`Following the sequencing of a cancer genome, the next
`step is to identify driver mutations that are responsible
`for the cancer phenotype. Ultimately, the determination
`that a mutation is functional requires experimental val-
`idation, using in vitro or in vivo models to demonstrate
`that a mutation leads to at least one of the characteris-
`tics of the cancer phenotype, such as DNA repair defi-
`ciency, uncontrolled proliferation and growth, or
`immune evasion. As a result of advances in DNA-
`sequencing technology,
`the measurement of somatic
`
`Identifying driver mutations
`
`Whole genome
`sequencing
`
`Or
`
`Whole exome
`sequencing
`
`Driver
`mutations
`
`Passenger
`mutations
`
`Prioritization of mutations
`
`Somatic mutation identification
`
`Identify recurrent mutations
`
`Predict functional
`impact of mutations
`
`Identify recurrent
`combinations of mutations
`
`Experimental and
`functional validation
`
`Figure 2 Overview of strategies for cancer-genome sequencing. A cancer-genome sequencing project begins with whole-genome or
`whole-exome sequencing. Various methods are used to detect somatic mutations in the resulting sequence (see Table 1), yielding a long list
`of somatic mutations. Several strategies can then be employed to prioritize these mutations for experimental or functional validation. These
`strategies include: testing for recurrent mutations, predicting functional impact, and assessing combinations of mutations (see Table 2). None of
`these approaches are perfect, and each returns a subset of driver mutations as well as passenger mutations. The mutations returned by these
`approaches can then be validated using a variety of experimental techniques.
`
`Personalis EX2002.005
`
`
`
`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 6 of 17
`
`Table 2 Methods for prediction of driver mutations and genes
`Objective
`Data
`Method
`Description
`Recurrent somatic
`SNV
`MutSigCV [48]
`Uses coverage information and genomic features (e.g. DNA replication time) to
`mutation identification
`estimate the background mutation rate of a gene.
`
`MuSiC [49]
`
`Uses a per-gene background mutation rate; allows for user-defined regions of
`interest.
`
`Youn et al. [51]
`
`Includes predicted impact on protein function in determining recurrent
`mutations.
`
`Sjöblom et al. [52] Defines a cancer mutation prevalence score for each gene.
`
`DrGaP [139]
`
`GISTIC2 [61],
`JISTIC [63]
`
`CMDS [62]
`
`Uses Bayesian approach to estimate background mutation rate; helpful for cancer
`types with low mutation rate.
`Uses ‘peel-off’ techniques to find smaller recurrent aberrations inside larger
`aberrations.
`
`Identifies recurrent CNAs from unsegmented data.
`
`ADMIRE [65]
`
`Multi-scale smoothing of copy number profiles.
`
`CNA
`
`Functional impact
`prediction
`
`General
`
`SIFT [72]
`
`Uses conservation of amino acids to predict functional impact of a non-
`synonymous amino-acid change.
`
`Pathway analysis and
`combinations of
`mutations
`
`Polyphen-2 [74]
`
`Infers functional impact of non-synonymous amino-acid changes through align-
`ments of related peptide sequences and a machine-learning-based probabilistic
`classifier.
`
`MutationAssessor
`[75]
`
`Uses protein homologs to calculate a score based on the divergence in
`conservation caused by an amino-acid change.
`
`PROVEAN [73]
`
`Benchmarks favorably against MutationAssessor, Polyphen-2 and SIFT.
`
`Cancer-specific
`
`CHASM [77]
`
`Uses a machine-learning approach to classify mutations as drivers or passengers
`based on sequence conservation, protein domains, and protein structure.
`
`Oncodrive-FM
`[79]
`
`Combines scores from SIFT, Polyphen-2, and MutationAccessor into a single
`ranking.
`
`Positional or
`structural
`clustering
`
`NMC [83]
`
`iPAC [84]
`
`Known pathways GSEA [92]
`
`Finds clusters of non-synonymous mutations across patients. Typically used with
`missense mutations to detect so-called ‘activating’ mutations.
`Extends the NMC approach to search for clusters of mutations in three-
`dimensional space using crystal structures of proteins.
`
`A general technique for testing ranked lists of genes for enrichment in known
`gene sets. Can be used on rankings derived from significance of observed
`mutations.
`
`Interaction
`networks
`
`PathScan [95]
`
`Finds pathways with excess of mutations in a gene set (pathway), by combining
`P-values of enrichment across samples.
`
`Patient-oriented
`gene sets [94]
`
`NetBox [140]
`
`HotNet [102]
`
`MEMo [104]
`
`Tests known pathways using a binary indicator for a pathway in each patient.
`
`Finds network modules in a user-provided list of genes. Significance depends
`only on the topology of the genes in the network, and not on mutation scores.
`
`Finds subnetworks with significantly more aberrations than would be expected
`by chance, using both network topology and user-defined gene or protein
`scores.
`
`Finds subnetworks whose interacting pairs of genes have mutually exclusive
`aberrations [105]; recommends including only recurrent SNVs and CNAs in the
`analysis.
`
`De novo
`
`Dendrix [102]
`
`Identifies groups of genes with mutually exclusive aberrations.
`
`Multi-Dendrix
`[112]
`
`RME [110]
`
`Simultaneously finds multiple groups of genes with mutually exclusive
`aberrations.
`
`Finds groups of genes with mutually exclusive aberrations by building from gene
`pairs; best results obtained when restricting to genes with high mutation
`frequencies (e.g. > 10%).
`
`CNA, copy number aberration; SNV, single-nucleotide variant.
`A representative list of software available to predict driver mutations or genes by detecting their recurrence across multiple samples, functional impact, or
`interactions with other mutations in pathways or combinations. Some methods fall into multiple categories but are listed only once for clarity.
`
`Personalis EX2002.006
`
`
`
`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 7 of 17
`
`mutations is now significantly cheaper and faster than
`the functional characterization of a mutation. Moreover,
`as cancer-genome sequencing moves from the research
`laboratory into the clinic, there is a strong need to auto-
`mate the categorization of mutations to prioritize rapid,
`accurate diagnoses and treatments for patients. Unfortu-
`nately, distinguishing driver from passenger mutations
`solely from the resulting DNA-sequence change is
`extremely complicated, as the effect of most DNA-
`sequence changes is poorly understood, even in the sim-
`plest case of single nucleotide substitutions in coding
`regions of well-studied proteins.
`In the following sections, we describe three ap-
`proaches for computational prioritization of driver mu-
`tations: identifying recurrent mutations; predicting the
`functional impact of individual mutations; and assessing
`combinations of mutations using pathways, interaction
`networks, or statistical correlations. These approaches
`provide alternative strategies to filter the long list of
`measured somatic mutations, and to identify a smaller
`subset enriched for driver mutations to undergo further
`experimental and functional validation (Figure 2).
`
`Statistical tests for recurrent mutations
`One approach to prioritize mutations for further experi-
`mental characterization is to identify recurrent mutations.
`Each cancer sample has undergone an independent evolu-
`tionary process in which acquired driver mutations that
`provide selective advantage result in clonal expansion of
`these lineages [27]. As these mutational processes converge
`to a common oncogenic phenotype, the mutations that
`drive cancer progression should appear more frequently
`than expected by chance across patient samples. Recur-
`rence may be revealed at different levels of resolution, such
`as an individual nucleotide, a codon, a protein domain, a
`whole gene, or even a pathway. In this section, we describe
`the techniques and difficulties in identifying recurrently
`mutated driver genes.
`
`Statistical tests for genes with recurrent single-nucleotide
`mutations
`Several methods have been designed to find recurrent
`mutations in a cohort of cancer patients, including Mut-
`SigCV [48], MuSiC [49], and others [50-53] (Table 2).
`The fundamental calculation in all these approaches is
`to determine whether the observed number of mutations
`in the gene is significantly greater than the number ex-
`pected according to a background mutation rate (BMR).
`The BMR is the probability of observing a passenger
`mutation in a specific location of the genome. From the
`BMR and the number of sequenced nucleotides within a
`gene, a binomial model can be used to derive the prob-
`ability of the observed number of mutations in a gene
`across a cohort of patients (Box 1).
`
`Box 1. The binomial model: a statistical test for
`detecting recurrent mutations.
`
`Using the background mutation rate (BMR) and the number n of
`sequenced nucleotides within a gene (g), the probability (Pg) that a
`passenger mutation is observed in g is given by Pg = 1 - (1- BMR).
`Since somatic mutations arise independently in each sample, the
`occurrences of passenger mutations in g are modeled by flipping a
`biased coin with probability pg of heads (mutation). Thus, if somatic
`mutations have been measured in m samples, the number of
`patients in which gene g is mutated is described by a binomial
`random variable B(m, Pg) with parameters m and Pg. From B(m, Pg),
`it is possible to compute the probability that the observed number
`or more samples contain passenger mutations; this is the P-value
`of the statistical test. A multiple-hypothesis testing correction is
`applied when examining multiple genes.
`
`The main differences between methods for identifying
`recurrently mutated genes are in how they estimate the
`BMR and how many different mutational contexts they
`analyze. Regarding the former, the BMR is not constant
`across the genome, but depends on the genomic context
`of a nucleotide [52] and the type of mutation [7]. More-
`over, the BMR of a gene is correlated with both its rate
`of transcription [54] and replication timing [55,56]. The
`BMR is also not constant across patients, and cancer co-
`horts often present hypermutated samples [6]. Finally,
`certain genomic regions may display localized somatic
`hypermutation, termed kataegis [57]. Different combina-
`tions of these effects can cause the BMR to vary by as
`much as an order of magnitude across different genes.
`The estimated BMR greatly affects the identification of
`recurrent mutations, as an estimate that is higher than
`the true value fails to identify recurrent mutations (false
`negatives), whereas an estimate that is lower than the
`true value would leads to false positives. Of course, if a
`driver gene is mutated in a very high percentage of sam-
`ples (more than 20%, for example), even an inaccurate
`estimate of the BMR is sufficient to correctly identify
`such a gene as recurrently mutated. Thus, well-known
`cancer genes (such as TP53) are readily identified as re-
`currently mutated genes by all computational methods.
`The priority now is to identify rare driver mutations that
`are important for precision oncology. The tools that are
`currently available often report different rare mutations
`as drivers, and more work is needed in order to improve
`the sensitivity in the detection of rare driver mutations
`and to compare and combine the results from different
`tools [58]. In general, reporting rarely mutated genes as
`recurrently mutated with high confidence requires either
`better estimates of the BMR and/or much larger patient
`cohorts.
`
`Personalis EX2002.007
`
`
`
`Raphael et al. Genome Medicine 2014, 6:5
`http://genomemedicine.com/content/6/1/5
`
`Page 8 of 17
`
`Statistical tests for genes with