`To:
`Subject:
`Date:
`Attachments:
`
`Scott Kirk
`John West; Christian Haudenschild; Richard Chen; Hugo Lam; Rong Chen; Mark Pratt
`PGS v1 Requirements draft (rev 1.10)
`Wednesday, October 10, 2012 12:51:06 PM
`PGS_v1.10_RemainingRequirements.docx
`
`Hi,
`
`V1.10 is attached which includes a few edits and additional items from the accuracy discussion yesterday as well as
`a number of updates from the discussion this morning which covered secs 8,9,10,11, and part of 6 on SVs.
`
`Thanks,
`Scott
`
`Personalis EX2095
`
`
`
`
`
`
`
`
`
`
`Personalis Genome Service
`
`
`
`
`Remaining Requirements for v1.0
`
`Draft v1.9
`
`Personalis EX2095
`
`
`
`
`
`Version Author
`V1
`RC, HL, MP, SK, ROC,
`GC, MC
`ROC, SK, HL
`
`V1.1
`
`Date
`9/30/12
`
`10/1/12
`
`10/2/12
`
`V1.2
`
`RC, SK, ROC, MP
`
`10/3/12
`
`V1.3
`
`SK
`
`10/4/12
`10/5/12
`
`V1.4
`V1.5
`
`HL
`SK, MP, ROC, HL, RC
`
`10/7/12
`
`V1.6
`
`SK, MP
`
`10/7/12
`
`V1.7
`
`SK, RC, MP, HL, ROC
`
`10/8/12
`
`V1.8
`
`SK
`
`10/8/12
`
`V1.9
`
`SK
`
`Description
`First consolidated version
`
`Updated SV and pipeline requirements
`post discussion with Hugo.
`
`Merged in first round of changes from
`Rong
`Second round of changes from
`Rong
`First round of changes from Mark.
`Reorganization and miscellaneous
`edits and some priority tagging
`Pipeline performance testing
`Priority tagging and miscellaneous
`edits. Additional refinement of
`exome plus requirements.
`Further refinement of exome +
`requirements
`Revisions post VA gap analysis –
`VA gaps is now traced to the
`numbering convention. Try not to
`change numbers.
`Included mockup graphics in various
`sections and updated accuracy
`section some.
`Additions and edits from accuracy
`discussion
`
`
`1 Key Differentiators ............................................................................................................ 3
`2 Scope ...................................................................................................................................... 3
`3 Key Dependencies (if any) .............................................................................................. 3
`4 Format and Priority Definitions ................................................................................... 4
`5 Exome Plus Requirements .............................................................................................. 4
`6 Structural Variants Requirements ........................................................................... 10
`7 Additional Pipeline Requirements: .......................................................................... 18
`8 Build a repository of public genomes/exomes to calculate population
`mechanisms. ............................................................................................................................ 21
`
`Table of Contents
`
`frequencies, serve as controls, test our pipeline, and identify disease
`
`Personalis EX2095
`
`
`
`9 Curate proprietary databases with quality control ............................................ 23
`10 Additional Annotation requirements ................................................................... 26
`11 Case/control discovery module .............................................................................. 30
`12 Database accuracy study ........................................................................................... 33
`13 Reports Requirements ............................................................................................... 36
`14 Performance: ................................................................................................................. 52
`15 Additional testing requirements ............................................................................ 53
`16 Additional Accuracy Related Features ................................................................. 55
`17 Commercial .................................................................................................................... 59
`
`
`
`• Exome Plus support:
`o Pulldowns for long and short read sequencing that include custom
`content. Provides a differentiated laboratory component in our
`offering covering our proprietary content outside of standard Exome
`assays as well as regions with known accuracy issues.
`
`• End to End SV support:
`o SV is an area where most competitors do very little. Offering more
`here significantly differentiates us. We will be also producing our own
`frequency data as well as deriving it from public data sets.
`
`• Database accuracy study:
`o Provides a check on the quality of our proprietary databases
`o Provides marketing content around our proprietary databases
`
`• Base Pipeline and Annotation engine.
`• V1 of PGS
`• Document also highlights features that critical for VA proposal
`• VA proposal
`
`3 Key Dependencies (if any)
`
`
`
`1 Key Differentiators
`
`2 Scope
`
`Personalis EX2095
`
`
`
`• Exome Plus:
`o Custom pulldown orders from external manufacturers (Agilent,
`ILMN).
`• End to End SV support:
`o external tools
`o test data
`o databases
`• Database accuracy study:
`o tools to map the variants and phenotypes across proprietary and
`public databases
` Each requirement should include:
`• Requirement statement: Statement of what the requirement is. Can be
`simple descriptive sentences or in user based story format (as an [persona], I
`want [feature], so that I can [do something]). Independent of priority, “shall”
`generally means required and “may” means optional.
`• Additional Description: provide additional details and technical
`requirements as needed.
`• Definition of Done: what criteria need to be satisfied for the requirement to
`be considered “done”. Helps to define how requirement is tested.
`• Priority: see definitions.
`• Unique identifier: will be added when we enter the requirements into Jira.
` Priorities definitions:
`• P1 – Must have. Critical for the release.
`• VA – required for VA proposal
`• P2 - Nice to Have. Implement if time allows.
`• P3 - Future: not required, but is perhaps desirable in the future. May require
`some work to make them easier to implement later on.
`• P4 - Deferred: the feature is deferred until a future/next release.
` 5.1 Exome plus extends the standard exome to areas of medical significance
`Priority: P1
` Area
`
`outside the exome including regulatory regions, pharmGKB, Varimed,
`HGMD, and mendelian variants.
`
` Format and Priority Definitions
`
` 4
`
`5 Exome Plus Requirements
`
`Priority
`
`Comments
`
`Personalis EX2095
`
`
`
`P1
`P1
`P1
`P1
`P1
`P2
`P2
`
`
`
`
`
`
`
`Could be a P2, it’s next up if time allows.
`
`
`5.2 Exome plus fixes the following types of problematic regions in areas of
`medical significance in both exomes and whole genomes:
`
`Varimed
`PharmGKB
`Regulome tier 1
`Mendelian gene exons
`HGMD
`Regulome tier 2
`Gene panels
`Priority: P3
`Additional Description:: for the first release of this, work is to be scoped
`through the lens of the clinical exome panels. For example, focus on fixing
`problematic regions in the cardiomyopathy, FH, LongQT syndrome panels.
`Area
`Priority
`Comments
`Degenerate Alignment
`P2
`use long reads to improve these areas
`GC rich regions
`P2
`can we improve coverage in GC rich regions
`(more iterative than the other three and
`implies more research)
`Unstable expanding repeat
`P2
`regions/VNTRs/STRs
`Content intersecting SVs
`P2
`Content intersecting
`P2
`
`compressions
`
`Phased exons(?)
`P2
`P2
`P2
`
`HLA typing
`P2
`
`Allelotyping, unstable
`P2
`
`expanding repeats
` 5.3 Design exome plus to augment established low coverage regions across
`Priority: P2
`Additional Description::
`• High GC and poorly mapped (degenerately mapped) regions
`• Solution is longer reads and a GC specific process.
`Priority:P1
` Read length is 2x150 (?) and content shall include coverage of:
`
`
`
`
`
`
`
`
`the interpretable exome to achieve more uniform coverage and thus
`higher quality.
`
`5.4
`Implement first pass Exome Plus to Achieve 5.1, 5.2, 5.3
`5.4.1 An exome supplement pulldown (ESP) to cover missed and low coverage areas
`in standard NGS exome sequencing shall be developed.
`
`Personalis EX2095
`
`
`
`Area
`
`Priority
`
`What it covers/details
`
`
`P1
`VariMed
`Sarah’s set of 7800 genes and any SNP within
`P1
`pharmGKB
`vicinity of these genes
`3502 HGMD genes
`P1
`HGMD genes
`
`?
`Low coverage
`
`?
`High GC content
`
`?
`Missed splice sites
`
`P2
`Homopolymer flanks
`
`P1
`Important Regulome 1
`– overlapped with
`medical variants
`
`P1
`Important Regulome 2
`
`P1
`Highly conserved
`regions
`cover key disease related variants in our
`P1
`Hgmd, omim, varimed,
`proprietary databases as part of the pullout.
`mendelDB overlap
`
`P1
`Intronic variants
`(possible
`differentiator)
`
`P1
`HGMD 3400 disease
`causing genes
`
`P1
`ILMN Truseq exome
`expanded by 50
`bases(?), trimmed
`down by low and zero
`coverage areas.
` 5.4.2 An exome long read pulldown (LRP) set to cover areas that are interpretable,
`Priority:P3
`Read length is 2x250. Is therefore dependent upon selection, integration, and
`testing of a long read aligner.
` Area
`areas in OMIM gene,
`P3
`zero coverage regions
`pharmGKB, HGMD
`gene sets with mapping
`problems.
`mapping problems
`P3
`
`where assembly and
`compression is an issue
`hitting anything
`interpretable.
`HLA regions
`P2
`
`Unstable repeat dz
`P3
`
`Exomic seg dups
`P3
`
`Exomic compressions
`P3
`
`Content sv breakpoints P3
`
`Exomic long strs
`P3
`
`Pharmgkb problems
`P3
`
`
`but challenging with standard NGS exome sequencing shall be developed.
`
`Priority
`
`What it covers/details
`
`Personalis EX2095
`
`
`
`
`
`
`
`
`
`
`
`5.4.3 An exome High GC Supplement (HGCS) to cover regions of high GC content
`known to be problematic in NGC exome sequencing shall be developed.
`
`Area
`
`Priority
`
`What it covers/details
`
`Including high GC SNPs
`Very high GC. Poor coverage now in ILMN v3
`chemistry is an improvement over the past, so not
`necessarily a knock against the method.
`
`Common exomic indels P3
`Suspect medical snps
`P3
`Small, high value sets
`P3
`3 whole compressions
`P3
`on OMIM genes
`amylase family
`P3
`SMN1
`P3
`SMN2
`P3
`
`Priority:P3
`Content shall include coverage of:
`Exomic high GC regions P3
`First exons
`P3
`
`
`5.5 Exome+ pulldowns shall be evaluated to determine how well various
`experimental strategies work and to allow marketing communication
`about product performance and quality.
`5.5.1 The overall quality of the Exome+ pulldown assay will be evaluated and
`summarized by measuring the following characteristics:
`
`
`
`Metric
`
`Target
`
`Min. Acceptable
`
`Priority:P1
`We may want to add these to 5.11 for measuring within each type of content
`region as they could be different? They could also vary by vendor selected.
`<<Need to determine spec numbers to test against>>
`On/Off target quality
`Base coverage of XX% at >=
`XX% at >= 10x coverage depth.
`/enrichment efficiency
`10x coverage depth.
`<XX>
`(specificity)
`<XX>
`Sensitivity
`SNPs:<XX>
`SNPs:<XX>
`Indels:<XX>
`Indels:<XX>
`SVs:<XX>
`SVs:<XX>
`Uniformity of coverage 10, 20, 30x??<XX>
`<XX>
`Reproducibility
`95%<XX>
`<XX>
`Other measures?
`
`
` 5.5.2 The performance of Exome+ will be evaluated for each of the defined areas of
`Priority:P1
`<<Need to determine spec numbers to test against>>
`
`
`content by measuring the following characteristics:
`
`Personalis EX2095
`
`
`
`Metric
`
`Target
`
`Min. Acceptable
`
`5.5.3 To improve our ability to measure quality and performance of the Exome +
`pulldowns, produce specialized reports and analytics that summarize
`performance characteristics for the following areas:
`
`<XX>
`<XX>
`Comparison to standard
`exome results
`XX% at >= 10x coverage depth.
`Base coverage of XX% at >=
`On/Off target quality
`<XX>
`10x coverage depth.
`/enrichment efficiency
`<XX>
`(specificity)
`SNPs:<XX>
`SNPs:<XX>
`Sensitivity
`Indels:<XX>
`Indels:<XX>
`SVs:<XX>
`SVs:<XX>
`<XX>
`10, 20, 30x??<XX>
`Uniformity of coverage
`<XX>
`10, 20, 30x?<XX>
`Coverage of content
`region
`SNPs:<XX>
`SNPs:<XX>
`Sensitivity of detecting
`Indels:<XX>
`Indels:<XX>
`specific medically
`SVs:<XX>
`SVs:<XX>
`relevant loci (per panel
`basis?)
`SNPs:<XX>
`SNPs:<XX>
`Specificity of detecting
`Indels:<XX>
`Indels:<XX>
`specific medically
`SVs:<XX>
`SVs:<XX>
`relevant loci (per panel
`basis?)
`<XX>
`95%<XX>
`Reproducibility
`For example: a report on “here’s our performance on HGMD variants:
`coverage, variant detections, mean confidence”
`Possibly include ref calls at critical variant loci. Nice to know you have an
`affirmed null detection of some nasty variant.
` Area
`Varimed
`P1
`
`pharmGKB
`P1
`
`Regulome tier 1
`P1
`
`Regulome tier 2
`P1
`
`Mendelian gene exons
`P1
`
`Blood typing
`P1
`Double check VA proposal
`HLA typing
`P2
`
`Allelotyping, unstable
`P3
`
`expanding repeats
`Content intersecting
`P3
`
`SVs
`Content intersecting
`P3
`
`compressions
`Phased exons(?)
`P3
`
`Gene panels
`P3
`Could be a P2, it’s next up if time allows.
` 5.6 Overall cost of exome plus needs to be significantly less than a whole
`Priority: P1
`
`Priority
`
`Comments
`
`genome. (What is the target number?)
`
`Personalis EX2095
`
`
`
`5.6.1 Produce a detailed cost model of the entire exome plus workflow and use to
`determine what parameters we can work with to impact cost/sample.
`
`5.6.2 Develop an itemized value assessment of content to show the marginal value
`of the data in various content sets.
`
`5.7 The clinical exome panel must “finish” the well established gene panels
`for cardiomyopathy, FH, LongQT syndrome.
`
`5.7.1 For the cardiomyopathy, FH, LongQT syndrome. clinical panels scope what the
`current accuracy is.
`
`5.7.2 Determine the data combination and weighting strategy(??)
`
`Priority:P1
`Additional Description:: Determine what parameters can be worked with. It’s
`not necessarily content that reduces cost. Multiplexing, for example, is a
`possibility to reduce cost/sample.
`Priorty: P1
`Priority: P3
`Additional Description:: First release is a research (exome plus) and doesn’t
`include panels. Second release: clinical (exome panel)
`Priority: P3
`?Provide the ability to combine and weight data to ???? Not sure what this is.
`Priority:
`Priority:P1
`Priority:P3
`Priority:P1
`Priority:P1
`Priority: P1
`Priority:P3
`
`5.7.3 The pipeline analysis shall be enabled to make calls on combined data sets for
`single genome/exome/supplement (how is this different than multisample
`support?)
`
`5.7.4 Provide the ability to allelotype unstable expanding repeat diseases.
`
`5.7.5 Provide the ability to perform reassembly of SDs, compressions, and SVs to ???
`
`5.7.6 Provide the ability to detect a sample’s HLA type from exome + data.
`
`5.8
`
`Integrate support for exome+ workflow(s) into pipeline to enable analysis,
`variant calling, and annotation of the data in an automated fashion for
`multiple samples.
`
`5.8.1 A long read aligner shall be enabled to support 2x250 reads from the LRP
`analysis.
`
`Personalis EX2095
`
`
`
`5.8.2 GATK-lite version currently implemented in the pipeline shall be examined to
`determine it’s compatibility with Exome + data.
`
`6.1.2
`
`Insertion >50 bp (long insertion)
`
`6.1.3 deletion >50 (long deletion)
`
`6.1.4
`
`inversion
`
`6 Structural Variants Requirements
`6.1 Pipeline shall detect the following SV types:
`
`Priority:P1
`Definition of Done:
`We do most of these already. It’s a matter of improvements and doing really well
`with long deletions and stay with what we have for the others for V1.
`6.1.1 CNV Priority:P2, VA
`Priority:P2, VA
`Priority:P1, VA
`Feel we can get the most gain by doing well with long deletions.
`Prioritize those that are medically interpretable.
`Exome data deletion detection.
`Priority:P2, VA
`Priority:P2, VA
`Priority:P2, VA
`Priority:P2, VA
`Priority:P2, VA
`Contributors: Hugo, Dan, Nan
`Additional Description: Detecting low complexity SVs is one of the most
`conjunction with our long reads capture, we shall detect the most STRs.
`1. Integrate LobSTR as an SV detection tool in the SV module of the
`pipeline a) Installing dependent modules on the server and in the
`pipeline
`b) Adding a new SV algorithm to the pipeline
`
`6.1.5
`
`interchromosomal translocation (potential place we can stand out)
`
`6.1.6
`
`Intrachromosomal translocation
`
`6.1.7 VNTR (detect STR, get some VNTR with Breakseq)
`
`6.1.7.1 Integrate LobSTR for STR detection
`
`difficult SV detections. No known products are particularly targeting on
`detecting STR comprehensively. By using LobSTR, particularly in
`
`Personalis EX2095
`
`
`
`6.1.8 MEI
`6.1.8.1 Implement MEI (Mobile Element Insertion) detection which can be implicated
`in human disease such as schizophrenia (?)
`
`
`
`
`
`
`
`big challenge in the field, particularly with relatively short reads;
`however, MEI has been shown in various study before causing different
`diseases and somatic differences. There is currently no known
`
`c) Integrate the STR calls with existing SV call set
`Definition of Done:
`Priority:P1, VA
`Contributors: Hugo, Jing, Dan
`Additional Description: Detecting repetitive elements has always been a
`(publicly) available tool for MEI detection. For Complete, they have MEI
`detection but the quality is unknown. We shall develop a new detection tool
`with the best algorithms detecting mobile element insertions such as L1 and
`ALUs.
` Looks for gene disruption.
`
`
`ILMN data on BWA needs it’s own algorithm written. Start from algorithms
`which have been published and rewrite for ILMN/BWA data.
` There may be a way to take what we’re doing already as a first pass
`implementation for identifying “existing” MEI. A more involved
`implementation would consider “denovo” MEI.
`LOE is 1.5 Mo (1FTE) + validation time for more involved implementation.
`Definition of Done:
`
`Priority:P3
`Session with Sarah and Gemma to determine what we should be scanning
`for.
`Definition of Done:
`
`Priority:P1, VA
`
`6.1.9 Polyploidy/aneuploidy
`
`6.1.10 SVs shall be flagged in the GFF file output by the pipeline.
`6.2 Pipeline shall improve the accuracy for detecting the above SVs at
`increased sensitivity and specificity compared to Complete or other
`competitive solutions (including validation).
`
`Personalis EX2095
`
`
`
`6.2.1 Provide the ability to trade off between sensitivity and specificity on a per
`run/study basis.
`
`6.2.2 Specifically assess accuracy and performance for medically important genes
`that are the focal point of our disease panels. For example:
`cardiomyopathy/HCM, long QT syndrome, Familial hypercholesterolemia, and
`the associated pharmacogenomic regions.
`
`6.2.3 Generate latest breakpoint library from 1KG to improve accuracy of all types of
`
`Improve to add a “p” value to the detection replacing the high/low classes
`that currently exist.
` Testing:
` Definition of Done:
`
`Priority:P1, VA
`Additional Description:
`For example in discovery research we may want to emphasize sensitivity
`over specificity while for clinical applications specificity may be more
`important. ROC curves shall be constructed to measure.
`Definition of Done:
`
`Priority:P1, VA
`Definition of Done:
`SVs. Priority:P1, VA
`Contributors: Hugo, Jing
`Additional Description: With the collaboration with 1KG, particularly the
`and EBI (Jan Korbel, the SV leader in 1KG), a latest SV library from the
`1000 genomes shall be generated with breakpoint and validation
`information. We shall also run breakseq and possibly other SV tools to
`generate stringent breakpoints. Impvoves both sensitivity and specificity.
`Definition of Done:
`Priority:P1, VA
`Contributors: Hugo, Jing
`Additional Description: None of the existing SV callers perform local
`resolution where breakpoints are unclear. Performing local reassembly in
`breakpoint regions to determine more breakpoints and to validate the SVs.
`Definition of Done:
`
`SV group (Lam et. al. >4yrs experience), Yale (Gerstein’s lab, SV expert)
`
`6.2.4 Local reassembly to be more precise in resolving breakpoints and to validate
`the SVs.
`
`reassembly. SVs detected by certain algorithms are also of low
`
`Personalis EX2095
`
`
`
`6.2.5 Report breakpoint resolution when possible to better assess the downstream
`effect on gene function
`
`
`Priority:P1, VA
`Contributors:Hugo, Jing
`Dependency: 6.2.3, 6.2.4
`Additional Description: Many of the SVs detected nowadays are not of
`breakpoint resolution. Algorithms shall be developed to report the
`breakpoint information for as many SVs as possible, e.g. by doing local
`reassembly in breakpoint regions.
`Definition of Done:
`
` 6.2.6 Report zygosity for SVs when possible to better assess impact on gene function
`Priority:P1, VA
`Contributors: Hugo, Dan
`Dependency: 6.2.5
`Additional Description: Thus far, none of the SV algorithms report the SV
`zygosity. For the SVs detected by the pipeline, algorithms shall be developed
`to report the SV genotype with zygosity such as homozygous/heterozygous
`deletions/insertions.
`Definition of Done:
`
` 6.2.7 Refine SV merging algorithms to better annotate SV and increase precision of
`Priority:P1, VA
`Contributors:Hugo, Dan
`Dependency: 6.2.6
`Additional Description: SV merging is critical as SVs detected by the
`catalogorize these SVs. Without phasing information, it is particularly hard
`to do it accurately. Algorithms shall be developed to refine the current SV
`merging in the pipeline with the aid of multiple orthogonal SV algorithms, the
`zygosity information and the SV breakpoint/frequency data sets we are
`generating.
`Definition of Done:
`
`
`algorithms are sometimes broken into fragments or overestimated and
`multiple algorithms also report SVs largely overlap; however, no known
`algorithm or software can accurately consolidate, merge and then
`
`and disease, P1
`
`merging overlapping SVs (for example better detection of compound
`heterozygosity or overestimate on size of SVs)
`
`
`
`Personalis EX2095
`
`
`
`6.2.8 Provide an SV characterization tool to understand the formation mechanism of
`SVs (targeted to discovery researcher)
`
`following methods
`
`6.3.1 Create simulation data sets to model different types of SVs
`
`crucial for relating SV surveys to primate genome evolution and
`population genetics. Inference of its formation mechanism is crucial for
`understanding how the SV was generated, which may then intersect
`with exons of genes or that lead to gene fusion events causing diseases.
`Characterizing remains a challenging problem until our first
`
`Priority:P2
`Additional Description: Inference of the ancestral state of an SV locus is
`publication (Lam et. al. 2010). Based on our previous experience, we shall
`develop an in-house tool to infer SV formation mechanism and ancestral
`state.
`Definition of Done:
` 6.3 Sensitivity and specificity for each type of SV shall be tested via the
`Main focus on long deletions for V1 (P1)
`Priority:P2
`Contributors: Hugo, Ming, Mark, Jason
`Additional Description: There is currently no known robust SV
`accuracy/error model. Using an existing simulation box or creating our
`own to generate simulation data (e.g. sequence reads) to model different
`types of SV and to test their detection. If possible, a p-value should be
`generated and assigned to the SVs being called.
`Definition of Done:
`Priority:P2
`Manual validation of SV detected across multiple algorithms.
`Definition of Done:
`
`6.3.2 Comparison with SV results generated with orthogonal technologies (for
`example Complete Genomics data we’ve had sequenced).
`
`6.3.3 Describe relative performance statistics (sensitivity and specificity) for our SV
`detection algorithm versus:
`6.3.3.1 Illumina seq + each individual SV detection algorithm (Pindel, etc)
`6.3.3.2 Complete genomics pipeline
`6.3.4 Comparison to 1000 genomes and available results in the literature.
`
`Priority:P2
`As a golden data set to start working with.
`Definition of Done:
`
`Personalis EX2095
`
`
`
`6.3.7 Finalize the new breakseq by testing full-scale using 1KG data to fine tune the
`sensitivity and performance of the algorithm(s)
`
`new version with zygosity information has just been released at
`
`
`
`6.4 Build a unified and comprehensive SV/CNV database for downstream
`annotation
`
`6.3.5 Perform validations of detected SVs with PCR methods to validate accuracy
`and resolve discordance
`
`6.3.6 Asses the impact of PCR or Gel free sequencing on SV detection
`
`Priority:P2
`Contributors: Hugo, Jing, Shujun
`Additional Description: For regions not or badly validated by public such as
`1KG and are in content regions of interest/importance, PCR shall be
`performed to validate the existence and zygosity of the SVs.
`Definition of Done:
`Priority:P3
`Contributors: Hugo, Jing, Shujun
`Additional Description: PCR free might change the GC of sequences and gel
`free might change the fragment size, which may have impact on various
`detection algorithms.
`Definition of Done:
`Priority:P1, VA
`Contributors: Hugo, Jing
`Dependency: 6.2.3
`Additional Description: We are the author of BreakSeq (Lam et. al.) and a
`Personalis. We will be testing the new version of breakseq on 1KG data
`(~1000 samples) and fine tune the sensitivity and performance.
`Definition of Done:
`Priority:P1, VA (ask them)
`Contributors: Hugo, Dan, Jing, Ming
`Dependency: 6.2.5-6.2.8
` Potentially focus on deletions at first.
` Additional Description: Personalis will be the first one with a cleaned,
`frequency data as well as possible phenotypic data.
`1. Public call set (Hugo, Jing, Dan)
`a. Gather SVs with different level of resolutions (e.g. breakseq
`library, 1KG CNVs, 1KG SVs, etc)
`
`unified, comprehensive, and largest SV database with different level of
`resolutions, different studies, validation data, highest amount of
`breakpoints, breakpoint formation mechanisms, ancestral states,
`individual zygosity information where possible, comprehensive
`
`Personalis EX2095
`
`
`
`b. Develop algorithms to resolve the differences and enhance the
`quality of heterogeneous SVs and CNVs
`2. Internal call set (Hugo, Dan, Ming)
`a. Run SV module of the pipeline on our data sets to produce our
`own call set with possible zygosity and frequency data
`i. Using uk10k as well (6000 exomes) – get allele
`frequency data for SVs.
`ii. Look at the 25 genomes that we have.
`iii. Grow these over time (aggregated stats from
`everything we run)
`iv. Systematic approach to pulling in data to generate the
`information such as frequency
`3. Consolidate the call sets to generate the highest resolution data
`possible (Hugo, Dan)
`4. Validate with public datasets (e.g. 1KG), local re-assembly, and
`internal experiments where necessary (e.g. important content
`regions) (Hugo, Jing)
`5. Generate frequency data and doing statistical validation such as
`hardy weinberg. (Hugo, Dan)
`6. Correlate the final, cleaned, unified database with any phenotypic
`data available. (Hugo, Dan, Jing)
`Definition of Done:
` 6.5 Detect SV in exome regions and assess accuracy to strengthen our ability
`Priority:P1
`Additional Description: Detecting SV in exome regions are known to be
`Also, no known exome product is known to generate SV calls. We shall
`detect CNV and SV from exome data using the latest exome-specific algorithms
`as well as algorithms developed in-house.
`Definition of Done:
` 6.6 Choose an SV file format and multi-sample representation that works well
`Priority:P1
`Contributors:Hugo, Dan, Steve, Rong
`
`to “finish” exomes (esp. for clinical application of exome+). This is also a
`key differentiator because nobody else is doing this commercially.
`
`hard and existing SV detection tools usually don’t work with exome data;
`however, SVs in the exome regions are critical, e.g. indels in BRCA genes.
`
`for multi-algorithms, multi-sample, and downstream analyses.
`
`Personalis EX2095
`
`
`
`representing SVs, particularly for multi samples. 1KG is using VCF, but its
`known to be not optimal.
`
`Additional Description: There is currently no known good file format for
`1. Currently GFF for every sample. (Dan)
`2. Come up with a format or a way that works well for multi-algorithms
`and multi-sample, as well as the downstream analyses (Hugo, Dan)
`3. Changing the pipeline accordingly (Dan)
`4. Think about what a multisample file should have in order to work for
`downstream analyses. GVF is a possible option for representing
`multisamples. (Rong, Steve)
`Definition of Done:
`Priority:P1, VA
`Contributors: Hugo, Dan, Rong, Michael, Steve
`Additional Description:
`1. For VA we’ll be annotating the GFF file for each individual sample.
`2. In addition to annotated GFF, for v1 consider creating a multisample
`representation/merged SVs for annotation and comparison across
`samples. (Risk in this lies in the “fuzzy” criteria for merging across
`samples and annotating based on overlap with gene). Need to
`explain methodology and validate.
`3. Variant (stat) report: We are currently reporting SV association with
`different genomic elements (e.g. repeat elements, genes) (Hugo, Dan)
`a. elements (e.g. with other public datasets like DGV)
`4. Gene report: Annotate disease for genes overlapped with SV (Rong,
`Michael, Steve) (see 6.5)
`Definition of Done:
`
`Priority:P1, VA?
`Contributors: Rong, Steve, Michael, Hugo
`Additional Description:
`1. dbVar (Rong, Steve)
`a) Investigate and download dbVar
`b) Parse dbVar into MySQL
`c) Separate SV in clinical and normal samples
`d) Connect ID, genomic coordinate, and disease
`e) Design and implement dbVar report
`
`6.7 Create stats, QC, and annotation report for SV to deliver results.
`
`6.8 Create report for SV associations with diseases (for annotation) to deliver
`results
`
`Personalis EX2095
`
`
`
`6.9 Create report for statistically significant SV associations with genes,
`pathways, disease in case/control.
`
`7.2 Provide the ability to determine blood type from sequence data to
`improve ability to identity match sequence to sample record for the VA
`proposal.
`
`7.3 Provide the ability to determine HLA type to improve downstream
`annotation (especially for Exome +)
`
`7 Additional Pipeline Requirements:
`7.1 Provide the ability to determine relatedness, ethnicity, and sex to improve
`downstream annotation.
`
`2. Generate report for diseases associated with SVs
`
`Definition of Done:
`Priority:P2
`Dependency: 6.7
`Research project, but would be a large differentiator.
`Priority:P1, VA for Sex determination. P2 for relatedness and ethnicity.
`Contributors: Hugo, Dan, Jing
`Additional Description:
`Consult with Carlos and he may have suggestions or some things we could use.
`Definition of Done:
`Priority: P1, VA
`Definition of Done:
`Priority:P2
`Definition of Done:
`
`Priority:P1, VA
`Additional Description: Look at Complete and ILMN outputs we’ve received.
`Definition of Done:
`Priority:P1
`Contributors: Hugo, Ming
`Additional Description: We will be testing IE8+, Firefox, Chrome, and maybe
`Safari for all HTML outputs. Prioritize Firefox and Chrome.
`Definition of Done:
`
`7.4 Determine how the raw data should be delivered (e.g. file structures,
`formats, and transfer/devices)
`
`7.5 Test browser compatibility for output
`
`Personalis EX2095
`
`
`
`7.6 Test compatibility with public browsing tools and document how to use
`them. Tools include Broad-IGV, ??other browsing favorites?? Part of a
`larger interaction design.
`
`the full GATK2.0 version.
`
`Priority:P1
`Definition of Done:
` 7.7 Get GATK2.0 license when available and consider performing update to
`Priority:P1
`Contributors: Hugo, Dan, Scott
`Additional Description:
`Full GATK update in the fall when Broad releases the version from beta.
`Mostly consists of integration work.
`Definition of Done:
`Priority:P1
`Contributors: Hugo, Dan, Michael
`Additional Description: Generate unique Personalis variant IDs for all variants
`generated. Particularly important for the varaints that don’t have RSIDs.
`Definition of Done:
`Priority:P1
`Contributors: Hugo, Ming, Nan
`Additional Description: At a minimum includes sample names, case/control,
`sex, ethnicity, age as well as customer information for creating business
`elements reports. Additional information that is provided in contained in
`Christian’s DB layout and Genologics documentation.
`Definition of Done:
`Priority:P1,VA
`Contributors: Hugo, Dan, Jing
`Additional Description:
`1. On samples run at ILMN the array is run alongside the sequencing
`and we get the data back from them. Need specifics of the results
`returned.
`2. Use for validating SNPs and Genotypes
`
`Integrate analysis and annotation pipeline with our sample tracking/LIMS
`system from Genologics to pull necessary metadata to automate analysis
`and report generation.
`
`7.8 Unique Personalis IDs shall be assigned to all variants generated.
`
`7.9
`
`7.10 Pipeline shall provide the ability to cross-validate sequencing with array
`results – multiplatform support.
`
`Personalis EX