`Using a Major Allele Reference Sequence
`
`Frederick E. Dewey1, Rong Chen2, Sergio P. Cordero3, Kelly E. Ormond4,5, Colleen Caleshu1, Konrad J.
`Karczewski3,4, Michelle Whirl-Carrillo4, Matthew T. Wheeler1, Joel T. Dudley2,3, Jake K. Byrnes4, Omar E.
`Cornejo4, Joshua W. Knowles1, Mark Woon4, Katrin Sangkuhl4, Li Gong4, Caroline F. Thorn4, Joan M.
`Hebert4, Emidio Capriotti4, Sean P. David4, Aleksandra Pavlovic1, Anne West6, Joseph V. Thakuria7,
`Madeleine P. Ball8, Alexander W. Zaranek8, Heidi L. Rehm9, George M. Church8, John S. West10, Carlos D.
`Bustamante4, Michael Snyder4, Russ B. Altman4,11, Teri E. Klein4, Atul J. Butte2, Euan A. Ashley1*
`
`1 Center for Inherited Cardiovascular Disease, Division of Cardiovascular Medicine, Stanford University, Stanford, California, United States of America, 2 Division of Systems
`Medicine, Department of Pediatrics, Stanford University School of Medicine, Stanford, California, United States of America, 3 Biomedical Informatics Graduate Training
`Program, Stanford University School of Medicine, Stanford, California, United States of America, 4 Department of Genetics, Stanford University School of Medicine,
`Stanford, California, United States of America, 5 Center for Biomedical Ethics, Stanford University, Stanford, California, United States of America, 6 Wellesley College,
`Wellesley, Massachusetts, United States of America, 7 Division of Genetics, Massachusetts General Hospital, Boston, Massachusetts, United States of America,
`8 Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America, 9 Department of Pathology, Harvard Medical School, Boston,
`Massachusetts, United States of America, 10 Personalis, Palo Alto, California, United States of America, 11 Department of Bioengineering, Stanford University, Stanford,
`California, United States of America
`
`Abstract
`
`Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation.
`Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of
`genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele
`reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites
`to the lowest median resolution demonstrated to date (,1,000 base pairs). We use family inheritance state analysis to
`control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound
`heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to
`disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding
`and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying
`multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific,
`family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk
`assessment using whole-genome sequencing.
`
`Citation: Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, et al. (2011) Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele
`Reference Sequence. PLoS Genet 7(9): e1002280. doi:10.1371/journal.pgen.1002280
`
`Editor: Gregory P. Copenhaver, The University of North Carolina at Chapel Hill, United States of America
`
`Received April 21, 2011; Accepted July 26, 2011; Published September 15, 2011
`Copyright: ß 2011 Dewey et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
`unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
`
`Funding: FED was supported by NIH/NHLBI training grant T32 HL094274-01A2 and the Stanford University School of Medicine Dean’s Postdoctoral Fellowship.
`MTW was supported by NIH National Research Service Award fellowship F32 HL097462. JKB, OEC, and CDB were supported by NHGRI grant U01HG005715. CFT,
`JMH, KS, LG, MW-C, MW, and RBA were supported by grants from the NIH/NIGMS U01 GM61374. KEO was supported by NIH/NHGRI 5 P50 HG003389-05. AJB was
`supported by the Lucile Packard Foundation for Children’s Health, Hewlett Packard Foundation, and NIH/NIGMS R01 GM079719. JTD and KJK were supported by
`NIH/NLM T15 LM007033. EAA was supported by NIH/NHLBI KO8 HL083914, NIH New Investigator DP2 Award OD004613, and a grant from the Breetwor Family
`Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
`
`Competing Interests: JVT and AWZ are founders, consultants, and equity holders in Clinical Future; GMC has advisory roles in and research sponsorships from
`several companies involved in genome sequencing technology and personal genomics (see http://arep.med.harvard.edu/gmc/tech.html); MS is on the scientific
`advisory board of DNA Nexus and holds stock in Personalis; RBA has received consultancy fees from Novartis and 23andMe and holds stock in Personalis; AJB is a
`scientific advisory board member and founder for NuMedii and Genstruct, a scientific advisory board member for Johnson and Johnson, has received consultancy
`fees from Lilly, NuMedii, Johnson and Johnson, Genstruct, Tercica, and Prevendia and honoraria from Lilly and Siemens, and holds stock in NuMedii, Genstruct,
`and Personalis. EAA holds stock in Personalis.
`
`* E-mail: euan@stanford.edu
`
`Introduction
`
`Whole genome sequencing of related individuals provides a
`window into human recombination as well as superior error
`control and the ability to phase genomes assembled from high
`throughput short read sequencing technologies. The interro-
`gation of the entire euchromatic genome, as opposed to the
`coding exome, provides superior sensitivity to recombination
`
`events, allows for full interrogation of regulatory regions, and
`comprehensive exploration of disease associated variant loci, of
`which nearly 90% fall into non-protein-coding regions [1]. The
`recent publication of low-coverage sequencing data from large
`numbers of unrelated individuals offers a broad catalog of
`genetic variation in three major population groups that
`is
`complementary to deep sequencing of related individuals [2].
`Recently, investigators used a family-sequencing approach to
`
`PLoS Genetics | www.plosgenetics.org
`
`1
`
`September 2011 | Volume 7 |
`
`Issue 9 | e1002280
`
`Personalis EX2122
`
`
`
`Author Summary
`
`An individual’s genetic profile plays an important role in
`determining risk for disease and response to medical
`therapy. The development of technologies that facilitate
`rapid whole-genome sequencing will provide unprece-
`dented power in the estimation of disease risk. Here we
`develop methods to characterize genetic determinants of
`disease risk and response to medical therapy in a nuclear
`family of four, leveraging population genetic profiles from
`recent large scale sequencing projects. We identify the
`way in which genetic information flows through the family
`to identify sequencing errors and inheritance patterns of
`genes contributing to disease risk. In doing so we identify
`genetic risk factors associated with an inherited predispo-
`sition to blood clot formation and response to blood
`thinning medications. We find that this aligns precisely
`with the most significant disease to occur to date in the
`family, namely pulmonary embolism, a blood clot in the
`lung. These ethnicity-specific, family-based approaches to
`interpretation of individual genetic profiles are emblematic
`of the next generation of genetic risk assessment using
`whole-genome sequencing.
`
`fine map recombination sites, and combined broad population
`genetic variation data with phased family variant data to
`identify putative compound heterozygous loci associated with
`the autosomal recessive Miller syndrome [3]. We previously
`developed and applied a methodology for interpretation of
`genetic and environmental risk in a single subject using a
`combination of traditional clinical assessment, whole genome
`sequencing, and integration of genetic and environmental risk
`factors [4]. The combination of these methods and resources
`and their application to phased genetic variant data from family
`based sequencing has the potential to provide unique insight
`into topology of genetic variation, haplotype information, and
`genetic risk.
`One of the challenges to interpretation of massively parallel
`whole genome sequence data is the assembly and variant calling of
`sequence reads against the human reference genome. Although de
`novo assembly of genome sequences from raw sequence reads
`represents an alternative approach, computational limitations and
`the large amount of mapping information encoded in relatively
`invariant genomic regions make this an unattractive option
`presently. The National Center for Biotechnology Information
`(NCBI) human reference genome in current use [5] is derived
`from DNA samples from a small number of anonymous donors
`and therefore represents a small sampling of the broad array of
`human genetic variation. Additionally, this reference sequence
`contains both common and rare disease risk variants, including
`rare susceptibility variants for acute lymphoblastic leukemia and
`the Factor V Leiden allele associated with hereditary thrombo-
`philia [6]. Thus, the use of the haploid NCBI reference for variant
`identification using high throughput sequencing may complicate
`detection of rare disease risk alleles. Furthermore, the detection of
`alternate alleles in high-throughput sequence data may be affected
`by preferential mapping of short reads containing the reference
`base over those containing an alternate base [7]. The effects of
`such biases on genotype accuracy at common variant loci remain
`unclear.
`the development of a novel, ethnically
`Here we report
`concordant major allele reference sequence and the evaluation
`of its use in variant detection and genotyping at disease risk loci.
`Using this major allele reference sequence, we provide an
`
`Genetic Risk in a Family Quartet
`
`assessment of haplotype structure and phased genetic risk in a
`family quartet with familial thrombophilia.
`
`Results
`
`Study subjects and genome sequence generation
`Clinical characteristics of the study subjects and the heuristic
`for the genome sequence generation and analysis are described
`in Figure 1. Two first-degree family members, including the
`father in the sequenced quartet, have a history of venous
`thrombosis; notably,
`the sequenced father has a history of
`recurrent venous thromboembolism despite systemic anticoag-
`ulation. Both parents self-reported northern European ancestry.
`We used the Illumina GAII sequencing platform to sequence
`genomic DNA from peripheral blood monocytes from four
`individuals in this nuclear family, providing 39.3x average
`coverage of 92% of known chromosomal positions in all four
`family members (Figure S1).
`
`Development of ethnicity-specific major allele references
`We developed three ethnicity specific major allele references
`for European (European ancestry in Utah (CEU)), African
`(Yoruba from Ibadan, Nigereia (YRI)), and East Asian (Han
`Chinese from Beijing and Japanese from Tokyo (CHB/JPT))
`HapMap population groups using estimated allele frequency data
`at 7,917,426, 10,903,690, and 6,253,467 positions cataloged in
`the 1000 genomes project. Though relatively insensitive for very
`rare genetic variation, the low coverage pilot sequencing data
`(which comprises the majority of population-specific variation
`data) has a sensitivity for an alternative allele of .99% at allele
`frequencies .10% and thus has high sensitivity for detecting the
`major allele [2]. Substitution of the ethnicity-specific major allele
`for the reference base resulted in single base substitutions at
`1,543,755, 1,658,360, and 1,676,213 positions in the CEU, YRI,
`and CHB/JPT populations, respectively (Figure 2A). There were
`796,548 positions common to all
`three HapMap population
`groups at which the major allele differed from the NCBI
`reference base. Variation from the NCBI reference genomes was
`relatively uniform across
`chromosomal
`locations with the
`exception of loci
`in and near the Human Leukocyte Antigen
`(HLA)
`loci on chromosome 6p21 (Figure 2C). Of variant
`positions associated with disease in our manually curated
`database of 16,400 genotype-disease phenotype associations,
`4,339, 4,451, and 4,769 are represented in the NCBI reference
`sequence by the minor allele in the CEU, YRI, and CHBJPT
`populations, respectively (Figure 2B). There are 1,971 disease-
`associated variant positions represented on the NCBI reference
`sequence by the minor allele in all three population groups
`(Figure 2B). Of these manually-curated disease-associated vari-
`ants, 23 are represented on the NCBI reference sequence by
`minor alleles with a frequencies of less than 5% in all three
`population groups, and 18 are represented by minor alleles with
`frequencies of less than 1% in at least one population group
`(Table S1).
`To test the alignment performance of the major allele reference
`sequences, we performed alignments of one lane of sequence data
`to chromosome 6, which demonstrated the greatest population-
`specific divergence between the major allele in each HapMap
`population and the NCBI reference, and chromosome 22 in the
`NCBI and CEU major allele references
`(Table S2). These
`analyses demonstrated that ,0.01% more reads mapped
`uniquely to the major allele reference sequence than to the
`NCBI reference sequence. We identified sequence variants in the
`family quartet by comparison with the HG19 reference as well
`
`PLoS Genetics | www.plosgenetics.org
`
`2
`
`September 2011 | Volume 7 |
`
`Issue 9 | e1002280
`
`Personalis EX2122
`
`
`
`Genetic Risk in a Family Quartet
`
`Figure 1. Pedigree and genetic risk prediction workflow. A, Family pedigree with known medical history. The displayed ages represent the
`age of death for deceased subjects or the age at the time of medical history collection (9/2010) for living family members. Arrows denote sequenced
`family members. Abbreviations: AD, Alzheimer’s disease; CABG, coronary artery bypass graft surgery; CHF, congestive heart failure; CVA,
`cerebrovascular accident; DM, diabetes mellitus; DVT, deep venous thrombosis; GERD, gastroesophageal reflux disease; HTN, hypertension; IDDM,
`insulin-dependent diabetes mellitus; MI, myocardial infarction; SAB, spontaneous abortion; SCD, sudden cardiac death. B, Workflow for phased
`genetic risk evaluation using whole genome sequencing.
`doi:10.1371/journal.pgen.1002280.g001
`
`the CEU major allele reference we developed, resulting in single
`nucleotide substitutions at an average distance of 699 base pairs
`when compared with the NCBI reference and 809 base pairs
`
`when compared with the CEU major allele reference. We also
`identified 859,870 indels at an average inter-marker distance of
`3.6 kbp.
`
`PLoS Genetics | www.plosgenetics.org
`
`3
`
`September 2011 | Volume 7 |
`
`Issue 9 | e1002280
`
`Personalis EX2122
`
`
`
`Genetic Risk in a Family Quartet
`
`Figure 2. Development of major allele reference sequences. Allele frequencies from the low coverage whole genome sequencing pilot of the
`1000 genomes data were used to estimate the major allele for each of the three main HapMap populations. The major allele was substituted for the
`NCBI reference sequence 37.1 reference base at every position at which the reference base differed from the major allele, resulting in approximately
`1.6 million single nucleotide substitutions in the reference sequence. A, Approximately half of these positions were shared between all three HapMap
`population groups, with the YRI population containing the greatest number of major alleles differing from the NCBI reference sequence. B, Number
`of disease-associated variants represented in the NCBI reference genome by the minor allele in each of the three HapMap populations. C, Number of
`positions per Mbp at which the major allele differed from the reference base by chromosome and HapMap population.
`doi:10.1371/journal.pgen.1002280.g002
`
`PLoS Genetics | www.plosgenetics.org
`
`4
`
`September 2011 | Volume 7 |
`
`Issue 9 | e1002280
`
`Personalis EX2122
`
`
`
`A major allele reference sequence reduces genotyping
`error at variant loci associated with disease traits
`Specific to the family quartet, of 16,400 manually-curated
`single nucleotide polymorphisms associated with disease traits,
`10,396 were variant in the family when called against the NCBI
`reference genome, and 9,389 were variant in the family when
`called against the major allele reference genome. The genotyping
`error rate for these disease-associated variants, estimated by the
`Mendelian inheritance error (MIE) rate per variant, was 38%
`higher in the variants called by comparison with the NCBI
`reference genome (5.8 per 10,000 variants) than in variants called
`by comparison with the major allele reference genome (4.2 per
`10,000 variants). There were 233 genotype calls at 130 disease-
`associated variant positions
`that differed across
`the quartet
`between the NCBI reference genome and the major allele
`reference genome (summary for each genome is provided in
`Table S3). Among these variants, 161/188 genotypes (85.6%) in
`the major allele call set were concordant with genotypes from
`orthogonal genotyping technology, whereas only 68/188 (36.2%)
`in the NCBI call
`set were concordant with independent
`genotyping.
`
`Inheritance state analysis identifies .90% of sequencing
`errors
`Sequencing family quartets allows for precise identification of
`meiotic crossover sites from boundaries between inheritance
`states and superior error control [3]. We resolved contiguous
`blocks of single nucleotide variants into one of four Mendelian
`inheritance states or two error states. Using this methodology,
`we identified 3.1% of variant positions as associated with error
`prone regions
`(Figure 3A). Using a combination of
`these
`methods and quality score calibration with orthogonal geno-
`typing technology, we identified 94% of genotyping errors, with
`the greatest reduction in error rate resulting from filtering of
`variants in error prone regions (Figure 3B). We estimated a
`final genotype error
`rate by three methods of between
`5.2661027, estimated by the state consistency error rate in
`identical-by-descent regions, and 2.161026, estimated by the
`MIE rate per bp sequenced.
`
`Prior population mutation rate estimates are biased
`upwards by the reference sequence
`After excluding variants in sequencing-error prone regions, we
`identified 4,302,405 positions at which at least one family member
`differed from the NCBI reference sequence and 3,733,299
`positions at which at least one family member differed from the
`CEU major allele reference sequence (Figure S2). With respect to
`the NCBI reference sequence, this corresponds to an estimated
`population mutation rate (Watterson’s h [8]) of 9.261024,
`matching previous estimates [3]. However, in comparison with
`the CEU major allele reference, we estimated a lower population
`mutation rate of 7.861024, suggesting that previous estimates may
`have been biased upwards by comparison with the NCBI
`reference sequence.
`
`Male and female recombinations occur with nearly equal
`frequency in this family and approximately half occur in
`hotspots
`Boundaries between contiguous inheritance state blocks defined
`55 maternal and 51 paternal recombination events across the
`quartet at a median resolution of 963 base pairs. A parallel
`heuristic analysis of recombination sites confirmed our observation
`of nearly equal paternal and maternal recombination frequency
`
`Genetic Risk in a Family Quartet
`
`(Figure 3C). Fine scale recombination mapping and long range
`phasing revealed that the mother has two haplotypes ([C, T] and
`[T, T]) at SNPs rs3796619 and rs1670533 that are associated with
`low recombination rates in females, while the father has one
`haplotype associated with low recombination rate in males [T, C]
`[9]. The father also has the common [C,T] haplotype which is
`associated with high recombination rates in males when compared
`with the other two observed haplotypes. We found that 25 of 51
`paternal recombination windows (49%) and 27 of 55 maternal
`recombination windows (49%, Figure 3) were in hotspots (defined
`by maximum recombination rate of .10 cM/Mbp), while only
`,4 (4.1%) would be expected by chance alone (p = 2.0610273 for
`hotspot enrichment according to Monte Carlo permutation). Both
`parents carry 13 zinc finger repeats in the PRDM9 gene (Entrez
`Gene ID 56979) and are homozygous for the rs2914276-A allele;
`both of
`these loci have been previously associated with
`recombination hotspot usage [10–13]. We used a combination
`of per-trio phasing,
`inheritance state of adjacent variants, and
`population linkage disequilibrium data to provide long range
`phased haplotypes (Figure 3D).
`
`Rare variant analysis identifies multi-genic risk for familial
`thrombophilia
`It has been estimated from population sequencing data that
`apparently healthy individuals harbor between 50 and 100
`putative loss of
`function variants
`in genes associated with
`Mendelian diseases and traits [2]. Many of these variants are
`of limited import, either because they result in subtle phenotypes
`or have no biological effect. Thus, prioritization of putative loss
`of function variants remains a significant challenge. We used a
`combination of manually-curated rare-variant disease risk
`association data, an internally-developed method for scoring
`risk variants according to potential clinical impact, and existing
`prediction algorithms [14,15] (Figure S3 and Table S4)
`to
`provide genetic risk predictions for phased putative loss-of-
`function variants for the family quartet (Table 1). To further
`characterize the potential adverse effects of non-synonymous
`single nucleotide variants (nsSNVs), we implemented a multiple
`sequence alignment
`(MSA) of 46 mammalian genomes, de-
`scribed further in Text S1, that is similar to that implemented in
`the Genomic Evolutionary Rate Profiling score [16,17]. For
`coding variants of unknown significance,
`the mammalian
`evolutionary rate is proportional to the fraction of selectively
`neutral alleles [18] and can therefore serve as a prior expectation
`in determining the likelihood that an observed nsSNV is
`deleterious.
`Of 354,074 rare or novel variants compared with the CEU
`major allele reference sequence, we identified 200 non-synony-
`mous variants in coding regions, 1 nonsense variant, 1 single
`in the conserved 39 splice acceptor
`nucleotide variant
`(SNV)
`dinucleotides, and 27 novel
`frame-shifting indels
`in genes
`associated with Mendelian diseases or traits. Consistent with our
`prior observations and a conserved regulatory role for microRNAs
`(miRNAs), we found no rare or novel SNVs in mature miRNA
`sequence regions or miRNA target regions in 39 UTRs. There
`were four compound heterozygous variants in disease-related
`genes and three homozygous variants in disease-related genes
`(Table S6). Five variants across the family quartet are associated
`with Mendelian traits (Table 2). One variant in the gene F5
`(Entrez Gene ID 2153), encoding the coagulation factor V, confers
`activated protein C resistance and increased risk for thrombophilia
`[19,20]. Another variant (the thermolabile C677T variant) in the
`gene MTHFR (Entrez Gene ID 4524), encoding methylenetetra-
`hydrofolate reductase, predisposes heterozygous carriers to hyper-
`
`PLoS Genetics | www.plosgenetics.org
`
`5
`
`September 2011 | Volume 7 |
`
`Issue 9 | e1002280
`
`Personalis EX2122
`
`
`
`Genetic Risk in a Family Quartet
`
`Figure 3. Inheritance state analysis, error estimation, and phasing. A, A Hidden Markov Model (HMM) was used to infer one of four
`Mendelian and two non-Mendelian inheritance states for each allele assortment at variant positions across the quartet.
`‘‘MIE-rich’’ refers to
`Mendelian-inheritance error (MIE) rich regions. ‘‘Compression’’ refers to genotype errors from heterozygous structural variation in the reference or
`study subjects, manifest as a high proportion of uniformly heterozygous positions across the quartet. B, A combination of quality score calibration
`using orthogonal genotyping technology and filtering SNVs in error prone regions (MIE-rich and compression regions) identified by the HMM
`resulted in .90% reduction in the genotype error rate estimated by the MIE rate. C, Consistent with PRDM9 allelic status, approximately half of all
`recombinations in each parent occurred in hotspots. The mother has two haplotypes in the gene RNF212 associated with low recombination rates,
`while the father has one haplotype each associated with high and low recombination rates. Notation denotes base at [rs3796619, rs1670533]. D,
`Variant phasing using pedigree, inheritance state, and population linkage disequilibrium data. Pedigree data were first used to phase informative
`allele assortments in trios (top). The inheritance state of neighboring regions was used to phase positions in which all members of a mother-father-
`child trio were heterozygous and the sibling was homozygous for the reference or non-reference allele (middle). For uniformly heterozygous
`positions, we phased the non-reference allele using a maximum likelihood model to assign the non-reference allele to paternal or maternal
`chromosomes based on population linkage disequilibrium with phased SNVs within 250 kbp (bottom). In all panels a corresponds to the reference
`allele and b to the non-reference allele.
`doi:10.1371/journal.pgen.1002280.g003
`
`homocysteinemia and may have a synergistic effect on risk for
`recurrent venous thromboembolism [21,22]. Follow-up serological
`analysis demonstrated the father’s serum homocysteine concen-
`tration was 11.5 mmol/L (Table S11). We were able to exclude a
`homozygous variant in F2 (Entrez Gene ID 2147), a gene known
`to be associated with hereditary thrombophilia, based on its high
`evolutionary rate in multiple sequence alignment (Table S5). It is
`likely that these variants in F5 and MTHFR contribute digenic risk
`for thrombophilia passed to the daughter but not son from the
`father. This is consistent with the father’s clinical history of two
`venous thromboemboli, the second of which occurred on systemic
`anticoagulation. The daughter has a third variant inherited from
`her mother, the Marburg I polymorphism,
`in the hyaluronan
`binding protein 2 (HABP2, Entrez Gene ID 3026) gene known to
`
`be associated with inherited thrombophilia, thus contributing to
`multigenic risk for this trait [23–25]. Thus, our prediction pipeline
`recapitulated multigenic risk for the only manifest phenotype,
`recurrent thromboembolism, in the family quartet and provided a
`basis for a rational prescription for preventive care for the
`daughter.
`Association between synonymous SNVs (sSNVs) and disease has
`recently been described [26]. sSNVs may affect gene product
`function in several ways,
`including codon usage bias, mRNA
`decay rates, and splice site creation and/or disruption (Figure S4).
`We developed and applied an algorithm (Text S1), for predicting
`loss of function effects of 186 rare and novel sSNVs in Mendelian
`disease associated genes based on change in mRNA stability, splice
`site creation and loss, and codon usage bias. We found that one
`
`PLoS Genetics | www.plosgenetics.org
`
`6
`
`September 2011 | Volume 7 |
`
`Issue 9 | e1002280
`
`Personalis EX2122
`
`
`
`Genetic Risk in a Family Quartet
`
`Table 1. Putative loss of function variants across the family quartet.
`
`All variants
`
`All rare/novel
`
`Rare/novel and OMIM-disease
`associated gene
`
`HG19 reference
`(n = 4302405)
`
`CEU reference
`(n = 3733299)
`
`HG19 reference
`(n = 351555)
`
`CEU reference
`(n = 354074)
`
`HG19 reference
`
`CEU reference
`
`9468
`
`52
`
`11663
`
`7982
`
`50
`
`9928
`
`1276
`
`13
`
`1061
`
`1276
`
`13
`
`1059
`
`203
`
`1
`
`186
`
`200
`
`1
`
`186
`
`1303341
`
`1128283
`
`116276
`
`115397
`
`19544
`
`19766
`
`156
`
`98
`
`40142
`
`61826
`
`0
`
`2
`
`0
`
`147
`
`96
`
`37794
`
`59396
`
`0
`
`2
`
`0
`
`16
`
`9
`
`3637
`
`5989
`
`0
`
`1
`
`0
`
`16
`
`9
`
`3619
`
`5953
`
`0
`
`1
`
`0
`
`0
`
`1
`
`510
`
`848
`
`0
`
`0
`
`0
`
`0
`
`1
`
`516
`
`857
`
`0
`
`0
`
`0
`
`Variant type
`
`Coding-missense
`
`Coding-nonsense
`
`Coding-synonyn
`
`Intronic
`
`Splice-59
`
`Splice-39
`
`UTR-59
`
`UTR-39
`
`miRNA target
`
`Pri-miRNA
`
`Mature miRNA
`
`Coding indels
`
`Coding frameshift indels
`
`1519
`
`440
`
`1476
`
`418
`
`432
`
`273
`
`412
`
`253
`
`73
`
`29
`
`71
`
`27
`
`Abbreviations: CEU reference, variant calls against CEU major allele reference; HG19 reference, variant calls against NCBI reference sequence 37.1; miRNA, micro RNA; Pri-
`miRNA, primary microRNA transcript; OMIM, Online Mendelian Inheritance In Man database; UTR, un-translated region.
`doi:10.1371/journal.pgen.1002280.t001
`
`sSNV in the gene ATP6V0A4 (Entrez Gene ID 50617) was
`predicted to significantly reduce mRNA stability, quantified by the
`change in free energy in comparison with the reference base at
`that position (Figure S5). Further secondary structure prediction
`demonstrated that this SNV likely disrupts a short region of self-
`complementarity that forms a stable tetraloop (Figure S5) in the
`resultant mRNA. Homozygosity for loss of
`function (largely
`protein truncating) variants in this gene is associated with distal
`renal
`tubular acidosis, characterized by metabolic acidosis,
`potassium imbalance, urinary calcium insolubility, and distur-
`bances in bone calcium physiology [27].
`
`Common variant risk prediction identifies risk for obesity
`and psoriasis
`(GWAS)
`Results
`from Genome Wide Association Studies
`provide a rich data source for assessment of common disease risk
`
`in individuals. To provide a population risk framework for genetic
`risk predictions for this family quartet, we first localized ancestral
`origins based on principal components analysis of common single
`nucleotide polymorphism (SNP) data in each parent and the
`Population Reference Sample (POPRES) dataset [28] (Figure 4A).
`This analysis demonstrated North/Northeastern and Western
`European ancestral origins for maternal and paternal
`lineages,
`respectively.
`HLA groups are associated with several disease traits and are
`known to modify other genotype - disease trait associations [29–
`31]. We used long-range phased haplotypes and an iterative search
`(described in full in Text S1) for the nearest HLA tag haplotype
`[32]
`to provide HLA types
`for each individual prior
`to
`downstream risk prediction (Figure 4B and 4C). We then
`calculated composite likelihood ratios
`(LR)
`for 28 common
`diseases for 174 ethnically-concordant HapMap CEU individuals,
`
`Table 2. Rare variants with known clinical associations.
`
`Chromosome Gene
`
`rsid
`
`Affected
`family
`members Disease
`
`Inheritance
`
`Onset-
`earliest
`
`Onset-
`median Severity Actionability
`
`Lifetime
`risk
`
`Variant
`pathogenicity
`
`12
`
`10
`
`19
`
`1
`
`1
`
`VWF
`
`rs61750615 M, S, D
`
`Von Willebrand
`disease
`
`Incomplete
`dominant
`
`HABP2
`
`rs7080536 M, S, D
`
`SLC7A9
`
`rs79389353 M, D
`
`Carotid stenosis,
`thrombophilia
`
`Cysteinuria –
`kidney stones
`
`AD
`
`AR
`
`F5
`
`rs6025
`
`F, D
`
`Thrombophilia
`
`Incomplete
`dominant
`
`MTHFR
`
`rs1801133
`
`F, D
`
`Hyperhomocystein-
`emia
`
`AR
`
`1
`
`4
`
`1
`
`4
`
`1
`
`1
`
`4
`
`1
`
`4
`
`1
`
`5
`
`1
`
`3
`
`4
`
`1
`
`5
`
`5
`
`5
`
`5
`
`6
`
`variable
`
`variable
`
`7
`
`2
`
`2
`
`7
`
`7
`
`7
`
`7
`
`7
`
`Key: Father, mother, son, daughter = F, M, S, D. Abbreviations: AD, autosomal dominant; AR, autosomal recessive. Variants were scored according to disease phenotype
`features and variant pathogenicty as outlined in Table S4.
`doi:10.1371/journal.pgen.1002280.t002
`
`PLoS Genetics | www.plosgenetics.org
`
`7
`
`September 2011 | Volume 7 |
`
`Issue 9 | e1002280
`
`Personalis EX2122
`
`
`
`Genetic Risk in a Family Quartet
`
`Figure 4. Ancestry and immunogenotyping using phased variant data. A, Ancestry analysis of maternal and paternal origins based on
`principle components analysis of SNP genotypes intersected with the Population Reference Sample dataset. B, The HMM identified a recombination
`spanning the HLA–B locus and facilitated resolution of haplotype phase at HLA loci. Contig colors in the lower panel correspond to the inheritance
`state as depicted in Figure 3A. C, Common HLA types for family quartet based on phased sequence data.
`doi:10.1371/journal.pgen.1002280.g004
`
`and provided percentile scores for each study subject’s composite
`LR for each disease studied (Figure 5A). All four family members
`were at high risk for psoriasis, with the mother and daughter at
`highest risk (98th and 79th percentiles, respectively). We also found
`that both parents were predisposed to obesity, while both children
`had low genetic risk for obesity. Discordant risks for common
`disease between parents and at least one child were also seen for
`esophagitis and Alzheimer’s disease. Phased variant data were
`further used to provide estimates of parental contribution to
`disease risk in each child according to parental risk haploty