throbber
John West
`Richard Chen
`Accuracy next steps, spreadsheet to organize the discussion
`Thursday, June 7, 2012 6:24:18 PM
`Error mech Detect n Poss solutions JW 6Jun2012.xls
`
`From:
`To:
`Subject:
`Date:
`Attachments:
`
`Hi Rich,
`
`I put this together earlier this week to begin organizing my thoughts on how we move towards
`accuracy solutions, somewhat in parallel with ongoing analysis. I would be happy to go
`through it when you have time.
`
`John
`
`Personalis EX2061
`
`

`

`MECHANISM OF ERROR GENERATION
`
`Mis-alignment due to:
`
`Repeats / degeneracy
`
`CAN WE DETECT & COUNT EACH
`ERROR FROM THIS MECHANISM ?
`HOW ?
`ARE WE SET UP TO LOOK FOR THIS ?
`
`MIE due to false positive het SNP's, clustering
`in known degenerate regions
`Reproducibly abnormal coverage in known
`degenerate regions
`Ratio of reads at het loci far from 50/50
`Tri-allelic in a single genome (fairly easy to see)
`Triple-haplotype (more work to make QC tool)
`Systematically different result from paired-end
`reads of different insert lengths
`
`HOW WE COULD ELIMINATE / REDUCE THIS ERROR TYPE
`BIOINFORMATICALY
`LAB METHODS
`REFERENCE
`
`ALGORITHMS
`
`Chain-link SNP alleles by shared raw
`reads and PE-reads across a degen
`region
`
`Special focus on junction sequences
`
`Variable alignment stringency across
`the genome, tailored to known
`differences in degeneracy, likely as
`a 2nd stage re-alignment
`
`Solve degenerate regions
`in "gold standard" genome
`by "spare no expense"
`
`Develop MAF data in
`degenerate regions to guide
`later alignment of others
`
`SNP/InDel cluster
`
`Low raw sequence quality:
`
`Reproducible coverage drop around SNP clusters ?
`Ratio of reads at het loci far from 50/50
`InDel MIE due to failure to detect InDel due to cluster
`Apparent failure of LD - Alleles at nearby loci
`don't "travel together" as expected
`
`New version of GATK is better ?
`
`Use BreakSeq approach for SNP/InDel
`clusters
`
`Do SNP clusters near 50% MAF
`split 49/51 in non-real
`combinations in our ethnically
`specific major allele refs ?
`
`Check for allele combinations which
`are unlikely from known LD
`
`Add InDels to our ref where
`major allele
`
`Raw Q-score depression downstream from
`specific sequence motifs
`
`Directional Q-score average dips, particularly
`in standard locations
`
`Motif-specific pattern recognition
`including length, not just substitutions
`
`Clusters systematically too faint in certain
`sequence contexts
`
`Phasing worse than normal in certain
`sequence contexts
`
`Low raw Q-scores near the start of a read, e.g.
`bases 2-20. We don't currently look
`for this, but could.
`We could confirm with access to fluorescence
`intensity data - an option on the
`system but not usually on
`
`We could confirm with access to fluorescence
`intensity data - an option on the
`system but not usually on
`
`Look specifically for these & if they
`are reproducible. Should we exclude
`them ? What motifs cause them ?
`
`Motif-specific algorithms
`
`Motif-based sequencing primers
`
`Pile-up of low raw Q-scores by random
`near-registration of the ends of reads,
`which are inherently low-Q
`
`We don't currently look for this, but could
`May be a good refinement of the
`directional Q-score metric
`
`Look for this and flag it (at least)
`
`Some shorter reads, which are
`highly accurate all the way to
`the ends
`
`Longer reads
`ILMN 2x250
`Ion Proton
`Oxford Nanopore
`Moleculo
`Longer PE inserts
`Multiple PE insert lengths
`
`Custom pullout for long
`insert paired ends or other
`assays
`
`Custom SNP array or pullout
`incorporating the allele sets
`known to be in LD
`
`If GC or hairpin driven, use
`nucleotide analog kits
`
`Lower cluster density, longer
`imaging exposure times ?
`Different cluster growth, e.g.
`# cycles
`
`Low coverage:
`
`Leading to a lack of calls,
`missed heterozygocity, or missed SV
`The low coverage being due to:
`
`Extremes of %GC
`
`Hairpin
`
`Poisson
`
`Look for directional coverage reproducible
`dropouts
`
`Look for where this is unusual, i.e.
`not seen in test genomes sequenced
`the same way
`
`Need to find best scale and lateral offset for GC
`correlation with coverage
`
`Need to understand mechanism of
`GC bias to know how to fix it
`
`Nucleotide analogs
`Extreme GC pullout(s)
`
`JW spreadsheet-based model can find these but
`is too slow for genome scale.
`Also, not all large hairpins lead to
`low coverage - needs work
`
`Refine model to understand correlation
`between hairpin properties and
`error occurrence, so we can think
`about how to fix the problem
`
`Genome-wide database of
`hairpins, their properties &
`their impact
`
`Nucleotide analogs
`Pullouts or PCR sets which have
`just one side of the hairpin
`
`The part of coverage variability which is not
`reproducible run to run on same library
`
`Exome pullout imperfection
`
`Coverage vs position wil show this
`
`Maybe room to add SV detection to
`exomes, esp via BreakSeq
`
`Failure to detect small InDel's, due to:
`
`Algorithm weakness
`
`InDel MIE's may identify some of these ?
`
`New GATK may improve this quite a bit
`
`Can BAM crawler identify InDels in alignment of
`raw reads which were not called in the
`variant lists (possible false negatives) ?
`
`If BAM crawler can do this (see left)
`then local re-allignment may correct
`miss
`
`Difficult seq contexts
`Seq with InDel maps elsewhere
`InDel creates degeneracy
`InDel creates hairpin
`InDel near reference gap
`InDel close to, but not in, STR
`InDel close to, but not in, SV
`
`Many reads too close to either end
`at location of InDel to be able to
`detect
`
`Special case not currently handled
`Special case not currently handled
`Special case not currently handled
`Special case not currently handled
`Special case not currently handled
`Special case not currently handled
`
`Similar to issue of pile-up of low raw Q scores
`by random near-registration of
`ends of reads, but either end is a
`problem here.
`
`InDels should be in LD with nearby SNP's
`so we could look for those and use that
`
`Could create dedicated algorithm if priority
`Could create dedicated algorithm if priority
`Could create dedicated algorithm if priority
`Could create dedicated algorithm if priority
`Could create dedicated algorithm if priority
`Could create dedicated algorithm if priority
`
`Extend the InDel detection algorithms for
`loci with known high-allele-frequency
`InDels, to support detection even near
`either end of a read. Different from
`detection of novel InDels.
`
`STR errors due to:
`
`Biochemistry fails
`
`Shows up as a drop in coverage and/or a reproducible
`drop in raw Q-scores downstream
`
`Failure is generally for very long STR's
`which are generally heterozygous.
`This may be detectable as an allelic read
`depth imbalance at nearby het SNP's
`
`Probably worst problem where
`combined with other sources
`low coverage, so fix those
`
`Supplementary pullout
`
`Ethnically-specific major InDel
`allele reference, on Hugo's
`plan.
`
`Longer Illumina reads should
`help: Minimal false positive
`but a much bigger % of the
`read will be usable
`
`See other hairpin options, above
`
`Detectable by electrophoresis
`Standard technology we could
` incorporate
`
`Long STR's can also be strong
`hairpins, so hairpin-reducing
`nucleotide analogs could help
`
`Seq OK but algorithm fails, esp if het
`Worse (?) if both alleles
`don't match reference
`
`Would this show up as a low alignment Q-score ?
`Would we notice a pile-up of those ?
`Should we add this to our BAM crawler ?
`
`STR MIE's, if we look for these
`
`Since all STR loci should already be known
`we could write a dedicated algorithm for
`this. lobSTR may already be this.
`Mark has suggested trying it.
`
`Current reference almost
`certainly does not have the
`most common # STR units at
`each position, let alone by
`ethnicity. This should be easy
`to change in the reference,
`though it requires mapping
`coordinate systems.
`
`Homopolymer errors due to:
`
`Biochemistry fails
`
`Shows up as a drop in coverage and/or a reproducible
`drop in raw Q-scores downstream
`
`A simple dedicated algorithm could
`prioritize getting the sequence right
`downstream of the homopolymer, vs the
`length of the homopolymer, by working
`with reads which begin near the end of the
`
`Dedicated sample prep which
` intentionally and reproducibly
`spikes different bases into these
`regions, to break them up and
`make them more sequenceable.
`
`Personalis EX2061
`
`

`

`Seq OK but algorithm fails, esp if het
`Worse (?) if both alleles
`don't match reference
`
`Would this show up as a low alignment Q-score ?
`Would we notice a pile-up of those ?
`Should we add this to our BAM crawler ?
`
`Homopolymer MIE's, if we look for these
`
`Translocation errors:
`
`False positive due to repeat seq
`
`Translocation MIE's, if we look for them
`
`Missed reads which span junction
`
`Translocation detected by paired-end algorithm but
`not split-read algorithm
`
`Robertsonian translocations missed
`Not detectable with next gen seq.
`
`Larger (>50bp) deletion errors:
`
`Zygocity not reported, or wrong
`
`MIE in large deletion detection
`
`Del detected but not junction seq so
`length & position inexact
`
`HugeSeq's SV algorithm requires detection by at
`least two algorithms, but doesn't
`require junction seq detection
`Should it ?
`
`homopolymer, or end just into it.
`Almost all homopolymer loci in the genome
`should be known now, so we can use that info
`
`Does lobSTR potentially help here ?
`
`Current reference likely does
`not have the most common
`# homopolymer base units
`in all of these regions, let alone
`by ethnicity
`
`Paper circulated fairly recently (by Gemma ?)
`showed success ealing with this by filtering
`out repeat regions & using long mate pairs
`
`HugeSeq has two algorithms which look
`for junction sequences & could filter out
`potential translocations for which it did not
`find them
`
`Map out regions of systematically low
`coverage (in the absence of a deletion)
`Potentially useful in distinguishing deletions
`from normal coverage variation
`
`Junction sequences are most likely to be
`missed where coverage is systematically low
`and a deletion is heterozygous.
`Also where deletion removes one unit of
`a multiple unit tandem repeat or STR
`because junction seq can be same with or
`without deletion. Potentially solvable with
`a dedicated algorithm if priorotized
`
`Could use a targeted mega-
`nuclease.
`What priority / value ?
`
`Supplement sequencing with
` long mate pairs
`
`Supplement next gen seq with
`karyotyping
`
`Longer reads should be propor-
`tionately better at detecting
`junction sequences (with & w/o
`deletion, in case of het del).
`If we can put more weight on
`junction-based algorithms, we
`will be less subject to false
`positives from coverage variation.
`
`Potential for pull-out in junction
`sequence regions, to increase
`ability to detect these known
` junctions
`
`False positive from read-depth algorithm
`due to coverage variation caused
`by other mechanisms
`
`MIE in large deletion detection, and HugeSeq
`requires two algorithms to detect
`SV's confidently
`
`See above
`
`Larger insertion errors:
`
`Haven't investigated enough to categorize
`
`MIE's and reproducibility with multiple PE insert
`lengths
`
`Haven't investigated enough to
`propose approaches
`
`Can gold standard genomes
`and 1,000 Genomes' SV data
`help produce a better ref
`sequence re insertions ?
`
`Longer insert read-pairs might
`be quite helpful.
`
`CNV errors:
`
`Copy number wrong due to mis-alignment
`esp in repeats
`
`Total copy # OK, but heterozygocity wrong
`e.g. 4x as 2+2 vs 3+1
`
`Junction sequences of CNV's not detected
`
`CNV MIE's, if we look for these
`
`CNV MIE's, if we look for these
`
`Algorithms to improve alignment in
`degenerate & compression regions
`might also help with CNV's - another
`variation on the same type of problem
`
`Gold standard genomes, esp
`using molecules 10-100kbp,
`might help solve some of this
`and create a framework on
`which to organize the
`population statistics
`
`HugeSeq's SV algorithm requires detection by at
`least two algorithms, but doesn't
`require junction seq detection
`Should it ?
`
`Complex CNV's: differing, overlapping blocks
`Also re-arrangements, deletions, etc
`
`MIE's and reproducibility with multiple PE insert
`lengths
`
`Reference sequence errors:
`
`Incorporates rare SNP alleles,
`leading to mis-alignment
`and missed variant calls
`
`Incorporates rare InDel alleles,
`leading to mis-alignment
`
`Incorporates rare SV alleles, leading to SV
`inverse (e.g. Insert for Del)
`
`Compressions (large duplications not
`recognized in the reference)
`
`Gaps in the reference
`
`Reference is not 1,000 Genomes major allele
`Done by Rick Dewey for PLoS paper
`
`Hugo plans to look for InDels with >50% Alt
`Allele Frequency in 1,000 Genomes
`result he calculated
`
`Hugo plans to look for SV's with >50% Alt
`Allele Frequency in 1,000 Genomes
`result he calculated
`
`All quartet family members het where inheritance
`state would not allow that
`Split of reads at het loci is far from 50/50
`esp in very high coverage genome
`Blocks of elevated coverage seen identically in
`other ethnicities
`Loci tri-allelic in a single individual
`Three haplotypes in one region of a single genome
`
`These are well documented in the reference
`itself.
`
`More complex cases could likely be
`addressed with more sophisticated
`and dedicated algorithms.
`Development of these would benefit from
`interaction with an effort to build gold
`standard genomes using maximal
`technology.
`
`Update with newer 1,000 Genomes data ?
`Cross-check against NHLBI exome major alleles ?
`Call all interpretable loci regardless of
`variance vs reference sequence
`
`Addressing this is on Hugo's project plan
`(right ? When ?)
`
`Addressing this is on Hugo's project plan
`(right ? When ?)
`
`Make an effort to vallidate more compressions
`from Rick Dewey' list: Top 36 was not a limit -
`We just ran out of time manually validating from
`the original "top 50".
`Alignment algorithm proposed based on
`population data, supplemented by gold
`standard genomes
`
`Look for reads which align at edge of gap
`and extend into the "unknown" to build
`our own extensions to narrow the gaps.
`Others have done some of this, which we
`might leverage.
` If we wanted to make significant progress
`here, it could be very hard and we would
`certainly want to understand what kept the
`original human genome project from
`closing these gaps. They certainly tried.
`
`Reads of very long molecules
`would help. Not just mate pairs,
`but possibly very long insert
`clones, Moleculo approach or
`Shendure approach or CGI "long
`fragment read" approach.
`
`Personalis EX2061
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket