`Scott Kirk
`Richard Chen; Mark Pratt
`Accuracy documents to convert into project plan format
`Tuesday, May 29, 2012 12:58:39 PM
`Accuracy framework JW 5Jan2012.ppt
`Accuracy JW 5Mar2012.ppt
`Accuracy Gantt JW 15Mar2012.ppt
`Accuracy slides for mtg 13Apr2012.ppt
`Accuracy slides 2May2012 JW.pptx
`Methods to determine SNP error rate 3May2012 JW.pptx
`Accuracy JW 16May2012.ppt
`Accuracy differentiation JW 22May2012.xls
`Accuracy Gantt Spreadsheet 21April2012 JW.xls
`
`From:
`To:
`Cc:
`Subject:
`Date:
`Attachments:
`
`Hi Scott,
`
`I am attaching a sequence of documents related to accuracy planning which have been
`developed starting in early January. Since they are sequential in time, there is overlap of
`content, but perhaps walking through them will give you an idea of how the ideas developed
`and where we are now. These should complement the material you have from Mark. I would
`be happy to discuss these with you, Mark & Rich as you see fit, so this can become more
`integrated with the other project plans. Please let me know.
`
`Thanks,
`
`John
`
`Personalis EX2056
`
`
`
`ACCURACY DIFFERENTIATION
`Draft for discussion, JW, 22 May, 2012
`
`ORGANIZATION OF THIS FILE
`Company differentiation around accuracy
`Better sequencing in the laboratory
`Better variant detection & reporting
`More accurate databases, for interpretation
`
`COMPANY DIFFERENTIATION AROUND ACCURACY
`Focus on accuracy relevant to medical interpretation
`Better understanding of the issues than anyone
`Best track record of publications on the issue
`Unbiased from a platform standpoint & able to combine platforms
`More comprehensive data on accuracy than anyone
`World's only collection of genomes sequenced on both ILMN and CG platforms, plus arrays and karyotyping
`Largest family pedigree sequenced to high coverage
`Only genome sequenced on Sanger, ILMN and CGI
`Databases more accurate than those publicly available
`Able to provide a detailed quantitative view of mechanisms underlying errors
`Deep understanding of accuracy issues used to create better results
`Not just flagging errors, or filtering out those loci, but fixing the problems
`More insightful approaches delivery accuracy affordably
`
`BETTER SEQUENCING IN THE LABORATORY (WHEN MANAGED BY PERSONALIS)
`
`Focus on getting the whole medically-interpretable genome, accurately, even if more expensive
`Use insight into error types & medical content to keep this affordable
`
`Combine data from multiple different runs of a single platform
`Combine paired-end libraries made with multiple insert lengths
`Use longer read lengths (e.g. 2 x 250 bases, when available later in 2012)
`More expensive because only MiSeq, but clearly better
`Combine with bulk shorter-read data from HiSeq
`Substantially more efficient at split-read & junction-sequence SV detection
`Key to single-base breakpoint determination
`
`Combine data from multiple platforms
`More experience with Illumina / Complete Genomics than anyone
`May add Ion Torrent, Oxford Nanopore
`Guided by deep understanding of differential error mechanisms in each platform
`Not tied to any one platform
`Use whatever it takes to get the best possible combination
`
`Combine data from outside next-gen sequencing
`
`Several major areas of medical genetics are not well assayed by next gen sequencing
`Example 1 : Diseases caused by STR-expansion (e.g. Huntinton's)
`Example 2 : Robertsonian translocations
`
`Personalis will combine NGS data with Non-NGS technologies to create a complete assessment
`Add karyotyping
`Add electrophoresis where appropriate (TBD)
`Others
`
`Orthogonal technologies also provide validation of NGS results
`Integrate NGS with array (fluorescence for SV's, in addition to genotypes)
`Sanger and/or (PCR + electrophoresis) as follow-up to SNP's / SV's of specific genomes (option TBD)
`
`Ability to create semi-custom products focused on medically interpretable parts of the genome
`Leverages Personalis advantages in content
`Custom hybridization array
`Custom pullout set
`Other assays to fix specific error types
`
`Question : Should there be a "Personalis exome" option ?
`More comprehensive / accurate at exome price level ?
`
`What is proprietary about this approach ?
`
`Personalis EX2056
`
`
`
`Personalis can focus it's efforts based on the world's best content (re medical interpretation)
`Personalis will develop proprietary understanding about how best to combine multiple technologies
`Personalis does not face the competitive & anti-trust barriers that platform companies do
`Personalis' people combine deep experience with both platforms and interpretation & can leverage the two against each other
`Personalis can combine work in the lab and in bioinformatics, in a way that pure informatics companies can't
`
`BETTER VARIANT DETECTION & REPORTING
`
`Fewer false positive SNP's due to method of generating laboratory data
`Combination of paired-end insert lengths covers more of the genome uniquely
`
`Fewer false negative SNP's due to method of generating laboratory data
`More uniform coverage by combining library prep methods & platforms
`
`Orthogonal validation of millions of SNP genotypes by array
`Integrated with next gen sequencing data, not just another separate report
`
`Better alignment, due to better reference sequence
`SNP major allele ref by ethnicity (in our first product)
`InDel major allele ref by ethnicity (later)
`Other advances as R&D develops them:
`Itterative alignment
`
`# TBD changes inSNP alleles called with Personalis reference vs public standard
`Likely more improvement in non-European ethnicities
`Include (eventually) changes in InDels called as well
`
`We provide the only support available specifically for admixed genomes
`Major-allele non-ethnic reference, or even more advanced options
`
`Focused effort to align in the presence of SNP / InDel clusters, MNP's
`May leverage Hugo's BreakSeq approach
`May need time in the development plan
`
`Better SNP reporting due to better reference
`
`We report variants when sequence is a homozygous match to the public ref but that's the minor allele
`Entirely missed by systems which use the public reference
`Example : Factor V Leiden
`Rong had a whole paper on all the disease variants in the public ref
`> 1M loci where we can be different
`We should calculate the average # actual loci / genome, by ethnicity
`
`At het loci, we report both alleles, but we report the minor allele as the variant
`Not the allele which is different from the public reference
`
`Better detection of SV's
`
`Better lab data for SV detection:
`Longer reads (better for approaches based on split-read & junction-sequences)
`MiSeq 2x250 or other platform
`Multiple insert lengths
`Electrophoretic assay of STR-expansions
`Karyotyping for Robertsonian translocations
`
`Orthogonal technologies for validation of SV's:
`Fluorescence intensity data from hybridization arrays
`
`We combine the results from five different algorithmic approaches
`
`We test our SV algorithms by Mendelian Inheritance in high coverage whole genome family data sets
`one which was sequenced with ten different paired-end libraries spanning 200 - 40,000 bases
`and validate them using fluorescent intensities from high density hybridization arrays
`
`We don't treat all SV's as novel - we have the world's best database of known SV's and their junction sequences
`Detection is better when you know exactly what you are looking for
`We should have a meeting to discuss how we can (easily ?) build this
`Start with 1,000 Genomes result Hugo has helped create
`Large data set but low coverage may make detection less certain in low MAF SV's
`Others will be able to access this eventually, potentially catching up, or claiming to
`Augment this with (more confident ?) SV's from:
`
`Personalis EX2056
`
`
`
`Full coverage (30-40x) genomes (West, Altman, 40 Koreans, others we can download)
`High coverage (>60x) genomes (Snyder, CEPH1463, Venter, others ?)
`
`Better reporting of SV's
`We determine the zygocity of deletions and report it
`Deletions integrated with SNP report, e.g. "A-" vs "AA" inside a het deletion
`We report SV's with their allele frequencies in the ethnicity matching the sample
`
`Flagging of potential errors
`Many subtle error types not recognized by others
`Error mechanisms underlying differences when the same person is sequenced twice (it's not just Poisson !)
`Error loci determined from deep & multi-platform sequencing of large families
`Error loci determined by extensive platform comparison, both NGS/NGS and NGS/Non-NGS
`Detailed understanding of compressions, and large unpublished catalog of them
`
`MORE ACCURATE DATABASES, FOR INTERPRETATION
`
`Cleaner databases:
`Well financed, systematic manual curation to industrial QC standard
`Standardized medical language hierarchy
`Extensive cross checking of databases developed independently
`VariMed vs HGMD
`MendelDB vs OMIM
`Personalis PharmGKB vs public PGKB (need to be careful in this positioning)
`
`Databases others will not have:
`Regulome
`BreakSeq (esp if augmented with private Personalis data)
`Compression list (described in publications but not released)
`Variant data derived from a broad collection of genomes
`Multiple public data sets, some processed in proprietary ways by Personalis
`Access to private data sets, sequenced by others
`Access to private data sets, sequenced by Personalis
`
`Personalis EX2056
`
`
`
`Framework for assessment &
`improvement of accuracy in whole
`genome medical interpretation
`
`Draft Jan 5, 2012
`John West
`
`Personalis EX2056
`
`
`
`Research Market Products
`
`• Comprehensive medically-focused interpretation of whole human genome
`sequences
`– Huge # of potential results
`– Even a low % error can swamp customer & waste time trying to track
`down invalid hypotheses
`– If research is in a clinical setting, particularly with return of results to
`patients, IRB’s may want “FDA-like” quality systems in place
`• Standard quality approaches
`• Standard quality terminology & reporting
`– Accuracy / quality to “near-FDA” standards can be a sales advantage in
`the research market
`• Maximal-accuracy analysis of a genome. Personalis :
`– Takes responsibility for the sample
`– Specifies the combination of next-gen sequencing performed
`– Augments NGS with other technologies to determine key variants
`– Conducts genotyping & assesses concordance
`– Conducts follow-up genetic & non-genetic testing to validate results
`
`Personalis EX2056
`
`
`
`Eventual clinical product
`
`• Also focused on comprehensive medically-focused interpretation of whole
`human genomes
`– FDA anticipates “Class 3” (highest) level of regulatory oversight
`– Need to undo bad reputation created by DTC genotyping companies
`– Pro-active early leadership in quality systems for whole genome may
`create a positive reputation for Personalis at FDA
`
`Personalis EX2056
`
`
`
`CDC Process – Established process & terminology
`(Augment with FDA processes, likely to be similar)
`
`Personalis EX2056
`
`
`
` Clinical
`
`Clinical
`
`Specificity
`
`
`
`&;
`Ethical, Legal
`S
`rkta
`
`
`Social Implicat
`safeguards & impediments
`
`
`Setti ng
`
`
`ae”
`Analytic
`
`pecicty|
` Evaluation;eerie
`onitoring A
`
`>
`
`
`
`Personalis EX2056
`
`Sensitivity
`
`Penetrance
`Robustress
`Conta A
`OS
`
`“J
`
`Personalis EX2056
`
`
`
`Supplementary
`info:
` Medical record
` Family history
` Family tree
`
`Study participant(s)
`Samples
`Laboratory testing
`Raw data, QC report
`Alignment & variant
`detection
`Alleles at known loci
`& novel variants
`Variant interpretation
`
`Types of testing
`to run:
`Mix of NGS &
`non-NGS
`
`List of medically
`relevant allele
`types &
`coordinates
`
`Focus on
`Analytic Validity
`(Primary topic of
`this slide set)
`
`Draft genetic report
`Technical validation
`(e.g. Sanger sequencing)
`
`Validated genetic report
`Follow-on testing (non-genetic)
`Draft medical report
`Physician / researcher / counselor
`
`Optional return of final medical results
`Study participant
`
`Focus on Clinical
`Validity & Utility
`(Discuss in a
`future slide set)
`
`Personalis EX2056
`
`
`
`Elements of Analytic Validity in CDC’s ACCE Model
`See Appendix for additional detail
`
`• Analytic Sensitivity : How often positive when mutation is present ?
`– One minus the false negative rate ?
`• Analytic Specificity : How often negative when mutation is not present ?
`– One minus the false positive rate ?
`• Assay Robustness : How often does the test fail to give a usable result ?
`• Repeatability :
`– On the same sample
`– Within & between labs
`– Between sample types & other process variables
`• Confirmatory testing to resolve false positives ?
`– Genetic validation tests
`– Non-genetic tests (molecular, imaging, physiological)
`• Quality Control: Internal QC program defined & externally monitored ?
`
`Personalis EX2056
`
`
`
`Crucial to develop a model which links analytic
`validity results to monitor-able process variables
`
`• Allele determination performance is known to vary widely between loci
`• With millions of known loci, many quite rare, it is impractical to determine the
`assay performance for each as an independent system.
`• Novel loci, inherently, cannot be validated in advance
`• Proposed approach:
`– Create a quantitative process model of allele determination
`– Incorporate known process variables (sample type, storage & preparation,
`coverage, read length, repeat structure, algorithm parameters, etc)
`– Augment with models of process variability & failure mechanisms
`– Instrument the model with QC observables, related to normal & failure modes
`– Compare model predictions with real data & characterize differences
`– Iterate until model can confidently predict analytic validity metrics for both
`known & novel alleles, including in combination
`– System performance metric should prioritize medically interpretable alleles
`– Use insights from experimentally assessed error model to :
`• Evaluate / prioritize process improvements
`• Select disease-associated alleles which can be assayed with NGS+
`
`Personalis EX2056
`
`
`
`Genetic variant types to include in model
`
`• SNP’s
`•
`InDel’s
`• Larger deletions
`• STR / VNTR’s
`• Structural variants
`• Copy Number Variation
`• Trisomies
`
`• Combinations:
`– Any of these can be
`heterozygous
`– Compound
`heterozygocity
`– Small variants within
`larger het deletions or
`other CNV’s
`– Clusters of closely
`spaced variants, which
`might interfere with
`each others’ detection
`
`Personalis EX2056
`
`
`
`Examples of known error mechanisms
`Many more likely to follow
`
`•
`
`Locally low / no coverage due to:
`– GC bias of biochemistry
`– Alignment degeneracy (repeats)
`– Clusters of variant alleles exceed alignment mismatch allowance
`– Alignment poor due to problems with reference sequence
`• Reads align, but incorrectly
`– InDels missed, or inconsistently placed in homopolymers
`– Misplacements lead to :
`• Phantom het SNP’s
`• Apparent tri-allelic loci
`• Apparent triple haplotype structure
`• Raw read errors, random or systematic
`• Allelic imbalance of reads due to:
`– Random extremes of expected binomial distribution, esp where low
`coverage
`– Allelic biases - biochemical or bioinformatic (e.g. clusters of SNP’s in LD)
`• Read length too short to quantify length of VNTR, or VNTR embedded in other
`repeat structures
`
`Personalis EX2056
`
`
`
`Examples of process variability
` Many more likely to follow
`
`• Sample source (blood, saliva, tissue, cell culture)
`• Sample preservation, culturing & prep for sequencing
`• DNA sequencing platform, lab & operator(s)
`• Coverage
`• Read length
`• Paired-end insert length (a distribution)
`• Cluster creation & Sequencing chemistry versions
`• Cluster density
`• Variation in data processing algorithms, input parameters, reference sequence
`• DNA contamination or incomplete filtering of primer-related artifacts
`• Availability of supplementary data from the same sample (e.g. genotyping)
`•
`Inconsistent supplementary information from / about the individual sequenced
`
`Personalis EX2056
`
`
`
`QC Readouts, to monitor the process
`(Particularly given that it may be impractical to standardize all
`process variables completely, over multiple years)
`
`• Paired-end insert length distribution vs expected
`• Coverage non-uniformity vs % GC content; relative to a standard
`• Coverage histogram in regions selected to avoid repeats & %GC variation
`•
`Total coverage ratios of chromosomes
`• Raw read Q-score %-iles vs position along a read; relative to a standard
`• Allelic distribution of reads vs expected binomial, at a “diagnostic” set of het
`SNP loci
`• Sex, ethnicity, admixture and family relationships between samples vs
`expected from supplementary data provided
`• Raw read error rates (monitored at homozygous loci), systematic / random
`• Mendelian inheritance errors (family genome data sets)
`• Clusters of variants which are close / dense enough to predict higher
`probability of errors
`Two alleles / haplotypes in single sex chromosomes, vs a standard
`•
`Three alleles / haplotypes in diploid chromosomes , vs a standard
`•
`• Amount and location of autozygocity , vs a standard
`• Screening for non-human DNA
`• Concordance with genotyping data, vs a standard
`
`Personalis EX2056
`
`
`
`Categories of experimental data to assess
`error types, mechanisms, frequencies
`
`• Repeatability :
`– Same person twice (or more ?), trying to keep everything else constant
`– Alternatives: Genomes of identical twins, or children where identical
`• Controlled process experiments :
`– Same person sequenced multiple times, each time varying just one
`parameter of the process (many of these can be done computationally)
`• Concordance of other technology platforms on the same sample
`• Compare process QC metrics from many sources of raw genome data, even
`low coverage (e.g. 1,000 Genomes)
`• Mendelian inheritance errors (MIE’s) determined from family genome sets
`• Absolute accuracy :
`– Genomes of individuals known to have clear Mendelian diseases
`– Sequencing of complex synthesized oligo sets (synthetic genome)
`– Targeted modifications of DNA from a naturally occurring genome
`
`Personalis EX2056
`
`
`
`Lists of problematic regions in the genome
`Capture all in one place, unify, & link to underlying error mechanisms of the model
`
`• Mapability / alignability assessments (some on Santa Cruz browser site)
`• Genes with known CNV’s
`• Genes with known pseudogenes
`•
`List of problematic genes for NGS (e.g. VWF)
`• Evan Eichler’s list of problematic regions
`•
`1,000 Genomes list of regions with top 0.1% of coverage
`•
`List of compressions
`• MIE’s & MIE-cluster regions from family data
`• Difficult regions from phasing of a family
`• Regions of high raw read error rates (monitor at homozygous loci)
`•
`Loci apparently tri-allelic in the genome of a healthy individual, or 3 haplotypes
`• Regions of allelic bias
`• Regions of no or low aligned coverage, determined experimentally
`• Segmental duplications & other repeat structures
`
`Personalis EX2056
`
`
`
`List of Variants to Optimize Against
`
`• Medically interpretable variants (not just SNP’s !):
`– Mendelian “Top 200”
`• Specific loci / variant types, where known
`• Genes in which to identify novel variants
`– PharmGKB
`• Curated & novel variants
`– Alleles in top Risk-O-Grams
`• Diseases
`• Molecular or physiologic medically relevant phenotypes
`– Blood & tissue typing
`Functional elements (broad interest for discovery research)
`– RefSeq genes
`– Linc-RNA
`– Regulome elements
`– eQTL loci
`
`•
`
`Personalis EX2056
`
`
`
`Proposed Initial Priorities
`
`•
`
`Literature review related to genome sequencing accuracy, followed by
`discussion with SAB
`• Develop an initial list of top priority medically interpretable variants to optimize
`against
`• Create / collect lists of problematic regions of the genome
`• Screen target medical variant list against problematic regions to get 1st cut
`scale & composition of the accuracy problem
`• Develop initial quantitative model of allele determination, including known error
`mechanisms. Predict & characterize accuracy & repeatability levels
`• Obtain existing experimental data, reprocess with current version algorithms,
`and conduct 1st cut repeatability & accuracy assessment. Compare with
`model on a genome-wide basis and at the medically interpretable loci.
`Launch program to obtain long-lead-time experimental data, including
`prototype of “maximally accurate genome” product (based on NGS augmented
`by other technologies).
`
`•
`
`Personalis EX2056
`
`
`
`Appendix
`US Centers for Disease Control (CDC)
`ACCE Model Process for Evaluating Genetics Tests
`
`Potentially useful framework & terminology for moving towards FDA
`approval of genome-based diagnostics
`
`Personalis EX2056
`
`
`
`CBC Home
`
`
`01b€
`
`Centers for Disease Control and Prevention
`
`epee 24/7: Seving Lives. Protecting People. Saving Money through Prevention.
`
`GS http: / /www.cdc.gov/genomics/gtesting /ACCE/
`
`Genomic Testing
`ACCE ModelProcess for Evaluating Genetic Tests
`
`From 2000 = 2004, CDC's Office of Public Health Genomics (OPHG)
`established and supported the ACCE Model Project, which developed
`the first publicly-available analytical process for evaluating scientific
`data on emerging genetic tests. The ACCE framework has guided or
`been adopted by various entities in the United States and worldwide
`for evaluating genetic tests; the CDC-supported EGAPP™ initiative
`builds on the ACCE model structure and experience.
`
`Introduction to ACCE
`ACCE, which takes its name from the four main criteria for
`evaluating a genetic test — analytic validity, clinical validity, clinical
`utility and associated ethical, legal and Social implications — isa
`model process that includes collecting, evaluating, interpreting, and
`reporting data about DNA (and related} testing for disorders with a
`genetic component in a format that allows policy makers to have access to up-to-date and reliable
`information for decision making. The ACCE model process is composed of a standard set of 44 targetedi+
`questions (3) that address disorder, testing, and clinical scenarios, as well as analytic and clinical validity,
`clinical utility, and associated ethical, legal, and social issues.
`
` Facilities
`
`An important by-product of the ACCE model process is the identification of gaps in knowledge that will
`help to define future research agendas. The ACCE approach builds on a methodology originally described
`by Wald and Cuckle (1) and on terminology introduced by the Secretary's Advisory Committee on
`Genetic Testing (2).
`
`Learn more about ACCE,
`
`Personalis EX2056
`
`Personalis EX2056
`
`
`
`Genomic Testing
`ACCE ModelList of 44 Targeted Questions Aimed at a Comprehensive Review of
`Genetic Testing
`
`Element
`
`Component
`
`Specific Question
`
`Disorder/Setting
`
`. What is the specific clinical disorder to be studied?
`
`. What are the clinical findings defining this disorder?
`
`. What is the clinical setting in which the test is to be
`performed?
`
`. What DNA test(s) are associated with this disorder?
`
`. Are preliminary screening questions employed?
`
`.
`
`.
`
`Is it a stand-alone test or is it one of a series of tests?
`
`If it is part of a series of screening tests, are all tests
`performed in all instances (parallel} or are only some
`tests performed on the basis of other results (series)?
`
`La
`
`JooLlfe
`
`Personalis EX2056
`
`Personalis EX2056
`
`
`
`Genomic Testing
`ACCE ModelList of 44 Targeted Questions Aimed at a Comprehensive Review of
`Genetic Testing
`
`Element
`
`Component
`
`Specific Question
`
`Analytic Validity
`
`.
`
`Is the test qualitative or quantitative?
`
`Sensitivity
`
`. How often is the test positive when a mutation is
`present?
`
`Specificity
`
`. How often is the test negative when a mutation is not
`present?
`
`11.
`
`Is an internal QC program defined and externally
`monitored?
`
`12.
`
`Have repeated measurements been made on specimens?
`
`13.
`
`What is the within- and between-laboratory precision?
`
`14.
`
`If appropriate, how is confirmatory testing performed to
`resolve false positive results in a timely manner?
`
`15.
`
`What range of patient specimens have been tested?
`
`16.
`
`How often does the test fail to give a useable result?
`
`17.
`
`How similar are results obtained in multiple laboratories
`using the same, or different technology?
`
`Personalis EX2056
`
`Personalis EX2056
`
`
`
`Genomic Testing
`ACCE ModelList of 44 Targeted Questions Aimed at a Comprehensive Review of
`Genetic Testing
`
`Element
`
`Component
`
`Specific Question
`
`Clinical Validity
`
`Sensitivity
`
`. How often is the test positive when the disorder is
`present?
`
`Specificity
`
`. How often is the test negative when a disorder is not
`present?
`
`20,
`
`Are there methods to resolve clinical false positive results
`in a timely manner?
`
`Prevalence
`
`21.
`
`What is the prevalence of the disorder in this setting?
`
`22.
`
`Has the test been adequately validated on all populations
`to which it may be offered?
`
`23,
`
`What are the positive and negative predictive values?
`
`24.
`
`What are the genotype/phenotype relationships?
`
`25.
`
`What are the genetic, environmental or other modifiers?
`
`Personalis EX2056
`
`Personalis EX2056
`
`
`
`Genomic Testing
`ACCE ModelList of 44 Targeted Questions Aimed at a Comprehensive Review of
`Genetic Testing
`
`Element
`
`Component
`
`Specific Question
`
`Clinical Utility
`
`Intervention
`
`26,
`
`What is the natural history of the disorder?
`
`Intervention
`
`27.
`
`What is the impact of a positive (or negative) test on
`patient care?
`
`Intervention
`
`28.
`
`If applicable, are diagnostic tests available?
`
`Intervention
`
`29,
`
`Is there an effective remedy, acceptable action, or other
`measurable benefit?
`
`Intervention
`
`30.
`
`Is there general access to that remedy or action?
`
`31.
`
`Is the test being offered to a Socially vulnerable
`population ?
`
`Quality
`Assurance
`
`. What quality assurance measures are in place?
`
`Pilot Trials
`
`33.
`
`What are the results of pilot trials?
`
`Personalis EX2056
`
`Personalis EX2056
`
`
`
`Genomic Testing
`ACCE ModelList of 44 Targeted Questions Aimed at a Comprehensive Review of
`Genetic Testing
`
`Element
`
`Component
`
`Specific Question
`
`Health Risks
`
`7"
`
`. What health risks can be identified for follow-up testing
`and/or intervention?
`
`35.
`
`What are the financial costs associated with testing?
`
`Economic
`
`ah.
`
`What are the economic benefits associated with actions
`resulting from testing?
`
`Facilities
`
`J?
`
`. What facilities/personnel are available or easily put in
`place?
`
`Education
`
`a8.
`
`What educational materials have been developed and
`validated and which of these are available?
`
`39.
`
`Are there informed consent requirements?
`
`Monitoring
`
`40,
`
`What methods exist for long term monitoring?
`
`41.
`
`What guidelines have been developed for evaluating
`program performance?
`
`Personalis EX2056
`
`Personalis EX2056
`
`
`
`Genomic Testing
`ACCE ModelList of 44 Targeted Questions Aimed at a Comprehensive Review of
`Genetic Testing
`
`Element
`
`Component
`
`Specific Question
`
`ELSI
`
`impediments
`
`42. What is known aboutstigmatization, discrimination,
`privacy/confidentiality and personal/family social issues?
`
`43, Are there legal issues regarding consent, ownership of
`data and/or samples, patents,licensing, proprietary
`testing, obligation to disclose, or reporting requirements?
`
`Safeguards
`
`44, What safeguards have been described and are these
`Safeguards in place and effective?
`
`Personalis EX2056
`
`Personalis EX2056
`
`
`
`Develop list of
`medically
`interpretable loci /
`regions for testing
`
`Review
`literature
`
`Conceptual Gantt chart
`for analytical accuracy
`development. In
`practice, there will be
`overlap of the elements.
`
`Develop
`Error model
`based on
`hypothesized
`mechanisms
`of action
`
`Create
`QC
`tools
`
`Iterate error model & refine
`experimental analysis
`
`Analysis of
`experimental data to
`quantify error rates
`for all variant types
`& characterize by
`mechanism of action
`
`Develop
`strategy
`
`Design,
`implement &
`test technology
`improvements
`
`Release
`version x.x
`
`Personalis EX2056
`
`
`
`Develop a List of Medically Interpretable Loci /
`Regions for Testing
`• Develop a list, some of each variant-type, which we want to be able to interpret
`medically. An initial list can be anecdotal / representative. Eventually, it
`should be more comprehensive.
`• PharmGKB SNP’s
`• VariMed SNP’s which contribute to Risk-O-Grams
`– 1st exon, other exons, non-exonic
`• Create an initial list (maybe 20 each ?) of medically interpretable:
`– Small InDels, CNV’s, VNTR’s, Deletions spanning 10 – 100k bases,
`Insertions, Duplications, Balanced translocations
`• Create a list of genes which we want to be able to assay confidently in their
`entirety:
`– PhGKB VIP genes, genes for 20 major Mendelian Diseases (e.g. CFTR),
`20 major cancer genes (e.g. p53, BRCA 1&2, …)
`• Variants required for blood typing
`• Variants for tissue typing
`•
`Linc-RNA regions, Regulome elements (TF binding site ranges), eQTL loci
`•
`Loci / genes for which phasing is most likely to be necessary for interpretation
`
`Personalis EX2056
`
`
`
`Literature Review
`• Avoid re-inventing the wheel
`
`•
`
`•
`
`•
`
`Literature already reviewed:
`– Marguillies (NHGRI, coverage analysis & error filtering)
`Lam (Platform comparison)
`–
`Kahn (Accuracy advances, Illumina technology)
`–
`Complete Genomics technology whitepaper
`–
`Finishing of the Human Genome (2004)
`–
`Nature Biotech Dec 2011, Dutch group. (Filtering out errors)
`–
`Some literature on Huntington’s disease assays
`–
`To be reviewed:
`Complete Genomics data pipeline paper (recent)
`–
`To look for:
`Papers systematically addressing the mechanisms underlying errors by type
`–
`Assay methods & problems for specific difficult cases
`–
`UCLA (Stan Nelson’s group) error model (published ?)
`–
`• Mark to review algorithms and parameter settings of HugeSeq elements (Hugo
`may already be familiar):
`Alignments : BWA
`–
`SNP & InDel detection : GATK, SAMtools
`–
`SV & CNV detection :
`–
`• Breakdancer (paired-end mapping)
`• CNVnator (read-depth analysis)
`• Pindel (split-read analysis)
`• BreakSeq (junction mapping)
`
`Personalis EX2056
`
`
`
`Develop an error model based on
`hypothesized mechanisms of action
`
`• Coverage model incorporating: GC-bias, alignment degeneracy, Poisson
`sampling
`• SNP detection error model incorporating: Binomial distribution of het alleles,
`Raw read error rate, proximity to InDels (detected or not), clusters of SNPs,
`compressions, alignment degeneracy, coverage, SNP loci within SV’s (esp
`large deletions), aligner & variant caller parameters, allelic biases. False
`positive, false negative & non-call rates.
`InDel error model, including zygocity
`•
`• SV error model, including zygocity (separate models for the multiple SV &
`CNV algorithms used, and then their combination)
`• Can the model correctly reflect the differences between Illumina & CGI data
`sets on the same sample ?
`• Can the model reflect errors in the reference sequence & guide its
`improvement ?
`
`Personalis EX2056
`
`
`
`Create QC Tools to Measure Error-Model
`Parameters From Experimental Data Sets
`
`Determinants of coverage:
`– Coverage vs % GC content
`– Aligned coverage vs reference degeneracy (multiple parameters?)
`– Coverage histogram in regions selected to avoid repeats & %GC variation
`• Should match Poisson ?
`•
`Look for deviations from Poisson (e.g. palindromes)
`• Total coverage ratios of chromosomes (e.g. each chromosome vs total of Chr 1-22)
`Paired-end insert length distribution vs expected
`– Tails of the distribution may impact BreakDancer
`Raw read Q-score %-iles vs position along a read; vs expected
`Raw read error rates (monitored at homozygous loci)
`–
`In regions devoid of known problems (i.e. baseline)
`– Profiled across the genome, to look for problematic regions
`Allelic distribution of reads vs expected binomial (alignment, biochemistry problems)
`–
`In regions devoid of known problems (i.e. baseline)
`– Profiled across the genome, to look for problematic regions (allelic bias in the genomic sequence reads)
`• May be best with a very high coverage genome
`Sex, ethnicity, admixture and family relationships between samples vs expected from
`supplementary data provided
`Clusters of variants (not just SNPs) which are close / dense enough to predict higher
`probability of errors
`Amount and location of autozygocity, vs expectation
`
`Screening for non-human DNA (esp in data sets from salliva ?)
`
`•
`
`•
`
`•
`•
`
`•
`
`•
`
`•
`
`•
`
`•
`
`Personalis EX2056
`
`
`
`Analysis of experimental data to quantify error
`rates for all variant types & characterize by
`mechanism of action
`
`•
`
`•
`
`•
`
`Analysis of a single genome:
`– Coverage assessment & errors related to that, including non-calls
`–
`Two alleles / haplotypes in single sex chromosomes, vs a standard
`–
`Three alleles / haplotypes in diploid chromosomes , vs a standard
`– Comparison of variant detection between multiple algorithms used in HugeSeq for the same thing:
`• SNP’s & InDels : GATK & SAMtools
`• SV’s & CNV’s : Breakdancer, CNVnator, Pindel, BreakSeq
`• Aligned with the standard reference sequence vs ethnically specific major allele
`
`Comparison of genomes:
`–
`Independent subsamples from a single deep genome data set
`– Same genome sequenced twice (or more) with all the same platform settings
`– Same genome vs paired-end insert length
`– Same genome vs read-length
`– Mendelian Inheritance Errors (MIE) & Inheritance State Consistency Errors (ISCE)
`– Same genome