`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 1 of 287 PageID #: 8603
`
`EXHIBIT 76
`
`EXHIBIT 76
`
`
`
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 2 of 287 PageID #: 8604
`
`
`
`OUTSIDE ATTORNEYS' EYES ONLY INFORMATION
`
`GUARDFM00914128
`
`A1090
`
`A1090
`
`
`
`
`
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 3 of 287 PageID #: 8605
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 3 of 287 PagelD #: 8605
`
`Detection of ultra-rare mutations by
`next-generation sequencing
`
`Michael W. Schmitta, Scott R. Kennedya, Jesse J. Salka, Edward J. Foxa, Joseph B. Hiattb, and Lawrence A. Loeb"°"c'1
`
`Departments of aPathology, bGenome Sciences, and “Biochemistry, University of Washington School of Medicine, Seattle, WA 98195
`
`Edited* by Mary—Claire King, University of Washington, Seattle, WA, and approved July 3, 2012 (received for review June 6, 2012)
`
`Next-generation DNA sequencing promises to revolutionize clinical
`medicine and basic research. However, while this technology has
`the capacity to generate hundreds of billions of nucleotides of DNA
`sequence in a single experiment, the error rate of ~1% results in
`hundreds of millions of sequencing mistakes. These scattered errors
`can be tolerated in some applications but become extremely prob-
`lematic when "deep sequencing" genetically heterogeneous mix-
`tures, such as tumors or mixed microbial populations. To overcome
`limitations in sequencing accuracy, we have developed a method
`termed Duplex Sequencing. This approach greatly reduces errors
`by independently tagging and sequencing each of the two strands
`of a DNA duplex. As the two strands are complementary, true muta-
`tions are found at the same position in both strands. In contrast, PCR
`or sequencing errors result in mutations in only one strand and can
`thus be discounted as technical error. We determine that Duplex
`Sequencing has a theoretical background error rate of less than
`one artifactual mutation per billion nucleotides sequenced. In addi-
`tion, we establish that detection of mutations present in only one of
`the two strands of duplex DNA can be used to identify sites of
`DNA damage. We apply the method to directly assess the frequency
`and pattern of random mutations in mitochondrial DNA from
`human cells.
`
`cancer l diagnostics l subclone l quasispecies l biomarker
`
`he advent of massively parallel DNA sequencing has ushered
`in a new era of genomic exploration by making simultaneous
`genotyping of hundreds of billions of base pairs possible at a small
`fraction of the time and cost of traditional Sanger methods (1).
`Unlike conventional techniques, which simply report the average
`genotype of an aggregate collection of molecules, next-generation
`sequencing technologies digitally tabulate the sequence of many
`individual DNA fragments, thus offering the unique ability to
`detect minor variants within heterogeneous mixtures. This con-
`cept of “deep sequencing” has been implemented in a variety of
`fields including metagenomics (2), paleogenomics (3), forensics
`(4), and human genetics (5) to disentangle subpopulations in
`complex biological samples. Clinical applications are rapidly being
`developed, such as prenatal screening for fetal aneuploidy (6),
`early detection of cancer (7), and monitoring its response to
`therapy (8) with nucleic acid-based serum biomarkers.
`Although, in theory, DNA subpopulations of any size should be
`detectable when deep sequencing a sufficient number of mole-
`cules, a practical limit of detection is imposed by errors introduced
`during sample preparation and sequencing (9). PCR amplification
`of heterogeneous mixtures can result in population skewing due to
`differential amplification (10, 11), and polymerase mistakes gen-
`erate point mutations resulting from base misincorporations and
`rearrangements due to template switching (10, 12). Combined
`with the additional errors that arise during cluster amplification,
`cycle sequencing, and image analysis, ~1% of bases are incorrectly
`identified, depending on the specific platform and sequence con-
`text
`(1). This background level of artifactual heterogeneity
`establishes a limit below which the presence of true rare variants is
`obscured (9).
`A variety of improvements at the level of biochemistry (13, 14)
`and data processing (14—19) have been developed to improve se-
`quencing accuracy. In addition, techniques whereby PCR dupli-
`cates arising from individual DNA fragments can be resolved on
`
`www.pnas.org/cgi/doi/10.1073/pnas.1208715109
`
`the basis of unique random shear points (20) or via exogenous
`tagging (21, 22) before amplification (23—28) have recently been
`reported. Because all amplicons derived from a particular starting
`molecule can be explicitly identified, any variation in the sequence
`or copy number of identically tagged sequencing reads can be
`discounted as technical error. This approach has been used to
`improve counting accuracy of DNA (25, 26, 28) and RNA tem-
`plates (24, 25, 27, 29) and to correct base errors arising during
`PCR or sequencing (20, 23, 24, 26). For example, Kinde et al. (23)
`reported a reduction in error frequency of ~20-fold with a tagging
`method that is based on labeling single-stranded DNA fragments
`with a primer containing a 14-bp degenerate sequence. This ap-
`proach allowed for an observed mutation frequency of ~0.001%
`mutations/bp in normal human genomic DNA. Nevertheless,
`a number of highly sensitive genetic assays have indicated that the
`true mutation frequency in normal cells is likely to be far lower,
`with estimates of per-nucleotide mutation frequencies generally
`ranging from 10—8 to 10—11 (30, 31). Thus, the majority of muta-
`tions seen in normal human genomic DNA by this method po-
`tentially still represent technical artifacts.
`Prevailing next-generation sequencing platforms generate se-
`quence data from single-stranded fragments of DNA As a con-
`sequence, artifactual mutations introduced during the initial
`round of PCR amplification are undetectable as errors—even with
`tagging techniques—if the base change is propagated to all sub-
`sequent PCR duplicates. Multiple types of DNA damage are
`highly mutagenic and may lead to this scenario. Spontaneous
`DNA damage arising from normal metabolic processes results in
`thousands of damaging events per cell per day (32), and additional
`DNA damage is generated ex vivo during tissue processing and
`DNA extraction (33).
`Limitations inherent to sequencing of single-stranded DNA
`can be overcome, however, as DNA naturally exists as a double-
`stranded entity, with one molecule reciprocally encoding the se-
`quence information of its partner. Thus, it should be feasible to
`identify and correct nearly all forms of sequencing errors by
`comparing the sequence of individual tagged amplicons derived
`from one half of a double-stranded complex with those of the
`other half of the same molecule. Herein, we present an approach
`for tag-based error correction, termed Duplex Sequencing, which
`capitalizes on the redundant information stored in complexed
`double-stranded DNA. Our method has a theoretical background
`error rate of less than one artifactual error per 109 nucleotides
`sequenced and thus allows rare variants in heterogeneous pop-
`ulations to be detected with unprecedented sensitivity.
`
`Author contributions: M.W.S., S.R.K., 1.1.5., and L.A.L. designed research; M.W.S., S.R.K.,
`and E.J.F. performed research; M.W.S., S.R.K., and J.B.H. contributed new reagents/ana-
`lytic tools; M.W.S., S.R.K., 1.1.5., and L.A.L. analyzed data; and M.W.S., S.R.K., 1.1.5., and
`L.A.L. wrote the paper.
`The authors declare no conflict of interest.
`
`*This Direct Submission article had a prearranged editor.
`Freely available online through the PNAS open access option.
`1To whom correspondence should be addressed. E-mail: |a|oeb@u.washington.edu.
`This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
`1073/pnas.1208715109/-/DCSupplemental.
`
`1 of 6
`l
`PNAS Early Edition
`A1091
`
`v:
`E.—DJ
`2DJ
`
`‘5
`
`OUTSIDE ATTORNEYS' EYES ONLY INFORMATION
`
`GUARDFM00914129
`
`A1091
`
`
`
`
`
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 4 of 287 PageID #: 8606
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 4 of 287 PagelD #: 8606
`
`Results
`
`To improve the sensitivity of variant detection by next-generation
`DNA sequencing, we designed an alternative approach to library
`preparation and analysis that we term Duplex Sequencing. The
`method entails tagging both strands of duplex DNA with a random,
`yet complementary double-stranded nucleotide sequence, which
`we refer to as a Duplex Tag. Double-stranded tag sequences are
`incorporated into standard Illumina sequencing adapters by first
`introducing a single-stranded randomized nucleotide sequence into
`one adapter strand and then extending the opposite strand with
`a DNA polymerase to yield a complementary, double-stranded tag
`(Fig. 1A). Following ligation of tagged adapters to sheared DNA,
`the individually labeled strands are PCR amplified from asym-
`metric primer sites on the adapter tails (Fig. 13) and subjected to
`paired-end sequencing. Every PCR duplicate that arises from
`a single strand of DNA will carry the original strand’s tag sequence.
`Owing to the complementary nature of the Duplex Tags on the two
`strands, each strand in a DNA duplex pair generates a distinct, yet
`related, population of PCR duplicates. Comparing the sequence
`obtained from each of the two strands in a duplex facilitates dif-
`ferentiation of sequencing errors from true mutations: when an
`apparent mutation is, in fact, due to a PCR or sequencing error, the
`substitution will only be seen on a single strand. In contrast, with
`a true DNA mutation, complementary substitutions will be present
`on both strands.
`
`During the PCR amplification step after tagging, many duplicate
`“families” of molecules are created, each of which arose from
`a single strand of an individual DNA molecule. After sequencing,
`members of each PCR family are identified and grouped by virtue
`
`of sharing an identical tag sequence (Fig. 1C). The sequences of
`uniquely tagged PCR duplicates are then compared to create
`a PCR consensus sequence. Only PCR families consisting of at
`least three duplicates and yielding the same sequence in at least
`90% of the members at a given position are used to create the
`consensus sequence. This step filters out random errors introduced
`during sequencing or PCR to yield a set of sequences, each of which
`derives from an individual molecule of single-stranded DNA. We
`refer to these as single strand consensus sequences (SSCSs).
`Next, sequences belonging to the two complementary strands of
`each DNA duplex are identified by searching for complementary
`tag sequences among SSCS reads. Specifically, a 24-nucleotide tag
`sequence consists of two 12-nucleotide sequences at each end of
`the molecule that can be designated or and [3. For a tag of form dB
`in read 1, the opposite strand’s tag will be of form Bot in read 2.
`Following partnering of the two strands, the sequences of the
`strands are compared. A sequence base at a given position is kept
`only if the read data from each of the two strands matches per-
`fectly. A detailed illustration of the approach is provided in SI
`Materials and Methods. Comparing the sequences obtained from
`both strands eliminates errors introduced during the first round of
`PCR where an artifactual mutation may be propagated to all PCR
`duplicates of one strand and would not be removed by SSCS fil-
`tering alone. We refer to the resulting high-confidence sequences
`of individual DNA duplex molecules as duplex consensus
`sequences (DCSs).
`
`Duplex Sequencing of M13 DNA. To establish the sensitivity of
`Duplex Sequencing, we first applied the method to M13mp2
`
`Fixed sequence
`
`B
`
`DuplexTags
`_.
`[IT-tailed DNAfragment ":
`
`C
`
`A
`5'
`
`12 randomized base
`Duplex Tag
`
`
`
`
`
`
`Extension with
`polymerase + 4 dNTPs
`
`
`
`
`
`Ligation &
`size selection
`
`
`PCR with flow-cell
`
`adapter sequence
`
`
`
`A-tailing with
`polymerase + dATP
`
` gfi family
`
`
`Ex family
`
`
`
`3621E2323%"I?315.!$1Efii:fi
`
`
`
`
`
`iv.
`
`00006
`
`
`
`
`
`@@@®@@666@ 32$:
`$2!39:1E33915:3?3&125$=§IE2515232
`
`
`
`_
`Single-strand
`consensus sequences ‘
`(SSCS)
`
`,
`
`‘
`
` 13’
`..
`Duplex
`..
`consensus sequence
`
`(DCS)
`
`
`
`
`Non-mutants
`True mutants
`
`Fig. 1.
`Overview of Duplex Sequencing. (A) Adapter synthesis. A double—stranded, randomized Duplex Tag sequence is appended to a sequencing adapter by
`copying a degenerate sequence in one strand of the adapter with DNA polymerase. Complete adapter A—tailing is ensured by extended incubation with
`polymerase and dATP. (B) Duplex Sequencing workflow. Sheared, T—tailed double—stranded DNA is ligated to A—tailed adapters. Because every adapter contains
`a Duplex Tag on each end, every DNA fragment becomes labeled with two distinct tag sequences (arbitrarily designated at and [i in the single fragment shown).
`PCR amplification with primers containing Illumina flow—cel l—compatible tails is carried out to generate families of PCR duplicates. Two types of PCR products are
`produced from each DNA fragment. Those derived from one strand will have the or tag sequence adjacent to flow cell sequence 1 and the [Stag sequence
`adjacent to flow cell sequence 2. PCR products originating from the complementary strand are labeled reciprocally. (C) Error correction. (i—iii) Sequence reads
`sharing a unique set of tags are grouped into paired families with members having strand identifiers in either the ocfi or (in: orientation. Each family pair reflects
`the amplification of one double—stranded DNA fragment. (i) Mutations (colored spots) present in only one or a few family members represent sequencing
`mistakes or PCR—introduced errors occurring late in amplification. (ii) Mutations occurring in many or all members of one family in a pair arise from PCR errors
`during the first round of amplification such as might occur when copying across sites of mutagenic DNA damage. (iii) True mutations (green) present on both
`strands of a DNA fragment appear in all members of a family pair. Whereas artifactual mutations may co—occur in a family pair with a true mutation, all except
`those arising during the first round of PCR amplification can be independently identified and discounted when producing (iv) an error—corrected single—strand
`consensus sequence (SSCS). The sequences obtained from each of the two strands of an individual DNA duplex can then be compared to obtain (v) the duplex
`consensus sequence (DCS), which eliminates remaining errors that occurred during the first round of PCR.
`
`2 of 6
`
`l www.pnas.org/cgi/doi/10.1073/pnas.1208715109
`
`Schmitt et al.
`A1092
`
`OUTSIDE ATTORNEYS' EYES ONLY INFORMATION
`
`GUARDFM00914130
`
`A1092
`
`
`
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 5 of 287 PageID #: 8607
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 5 of 287 PagelD #: 8607
`
`030 reads
`
`SSCS
`
`log scale
`
`3e-5
`
`2e-5
`
`1 e-5
`
`
`
`Mutationfrequency
`
`
`
`
`
`Mutationfrequency
`
`
`
`
`4;‘P01
`
`3e-5
`
`2e-5
`
`1 e-5
`
`
`
`DNA, which is a substrate that has been used extensively in sen-
`sitive genetic mutation assays and has a well-established base
`substitution frequency of 3.0 X 10—6 (34). M13mp2 DNA was
`sheared and ligated to Duplex Sequencing adapters and subjected
`to deep sequencing on an Illumina HiSEq 2000 (Fig. 2A). Analysis
`of the data by standard methods (i.e., without consideration of the
`double-stranded tag sequences and with quality filtering for
`a Phred score of 30) resulted in an error frequency of 3.8 X 10—3,
`more than 1,000-fold higher than the true mutation frequency of
`M13mp2 DNA. Thus, >99.9% of the apparent mutations identi-
`fied by standard sequencing are erroneous.
`We generated SSCSs by using the unique tag affixed to each
`molecule to create a consensus of all PCR products that came
`from an individual molecule of single-stranded DNA. This
`resulted in a mutation frequency of 3.4 X 10—5, suggesting that
`~99% of sequencing errors are corrected in SSCS reads. How-
`ever, this mutation fre uency is >10-fold higher than the refer-
`ence value of 3.0 X 10— , indicating that ~90% of the mutations
`identified by SSCSs are still artifacts.
`Next, we further corrected errors by using the complementary
`tags to compare the DNA sequence arising from the two strands of
`each single molecule of duplex DNA to create DCSs. This ap-
`proach resulted in a mutation frequency of 2.5 X 10—6, nearly
`identical to the frequency of 3.0 X 10—6 determined by well-
`established genetic methods (34). The number of nucleotides of
`DNA sequence obtained by a standard sequencing approach, and
`after SSCS and DCS analysis, may be found in Table 81.
`
`DNA Damage Alters SSCS Mutation Spectrum. We next examined the
`spectrum of mutations identified by both SSCS and DCS analysis
`relative to literature reference values (34) for the M13mp2 sub-
`strate (Fig. 23). SSCS analysis revealed a large excess of G—>A/
`C—>T and G—>T/C—>A mutations relative to reference (P < 10—“,
`two-sample t test). In contrast, DCS analysis was in excellent
`agreement with the literature values with the exception of a de-
`crease relative to reference of these same mutational events:
`
`G—>A/C—>T and G—>T/C—>A (P < 0.01). To probe the potential
`cause of these spectrum deviations, the SSCS data were filtered to
`consist of forward-mapping reads from read 1 (i.e., direct se-
`quencing of the reference strand) and the reverse complement of
`reverse-mapping reads from read 1 (i.e., direct sequencing of the
`antireference strand.) True double-stranded mutations should
`result in an equal balance of complementary mutations observed
`on the reference and antireference strand. However, SSCS anal-
`ysis revealed a large number of single-stranded G—>T mutations,
`with a much smaller number of C—>A mutations (Fig. 2C). A
`similar bias was seen with a large excess of C—>T mutations relative
`to G—>A mutations.
`Base-specific mutagenic DNA damage is a likely explanation of
`these imbalances. Excess G—>T mutations are consistent with the
`oxidative product 8-oxo-guanine (8-oxo-G) causing first round
`PCR errors and artifactual G—>T mutations. DNA polymerases,
`including those commonly used in PCR, have a strong tendency
`to insert adenine opposite 8-oxo-G (35, 36), and misinsertion of
`A opposite 8-oxo-G would result in erroneous scoring of a G—>T
`mutation. Likewise,
`the excess C—>T mutations are consistent
`with spontaneous deamination of cytosine to uracil (37), a par-
`ticularly common DNA damage event that results in insertion
`during PCR of adenine opposite uracil and erroneous scoring of
`a C—>T mutation.
`To determine whether the excess G—>T mutations seen in
`
`SSCSs might reflect oxidative DNA damage at guanine nucleo-
`tides, before sequencing library preparation we
`incubated
`M13mp2 DNA with the free radical generator hydrogen peroxide
`in the presence of iron, a protocol that induces DNA damage (38).
`This treatment resulted in a substantial further increase in G—>T
`
`mutations by SSCS analysis (Fig. 3A), consistent with PCR errors
`at sites of DNA damage as the likely mechanism of this biased
`mutation spectrum. In contrast, induction of oxidative damage did
`not alter the mutation spectrum seen with DCS analysis (Fig. 3B),
`
`Schmitt et al.
`
`
`
`Type of mutation
`
`Fig. 2. Duplex Sequencing of M13mp2 DNA. (A) Average mutation frequency
`of M13mp2 DNA as measured by a standard sequencing approach, SSCS, and
`DCS. Reference value of 3.0 X 10'6 is from ref. 34. Note that the axis is plotted on
`a split—log scale. (3) Single—strand consensus sequences (SSCSs) reveal a large
`excess of G—>A/C—>T and G—>T/C—>A mutations, whereas duplex consensus
`sequences (DCSs)yield a balanced spectrum. Mutation frequencies are grouped
`into reciprocal mispairs, as DCS analysis only scores mutations present in both
`strands of duplex DNA. All significant (P < 0.05) differences between DCS
`analysis and the literature reference values are noted. (C) Complementary types
`of mutations should occur at approximately equal frequencies within a DNA
`fragment population derived from duplex molecules. However, SSCS analysis
`yields a 15—fold excess of G—>T mutations relative to C—>A mutations and an 11—
`fold excess of C—>T mutations relative to G—>A mutations. All significant (P <
`0.05) differences between paired reciprocal mutation frequencies are noted.
`
`PNAS Early Edition
`
`l30f6
`A1093
`
`OUTSIDE ATTORNEYS' EYES ONLY INFORMATION
`
`GUARDFM00914131
`
`A1093
`
`
`
`
`
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 6 of 287 PageID #: 8608
`Case 1:20-cv-01580-LPS
`Document 40-3 Filed 03/05/21 Page 6 of 287 PageID #: 8608
`
`A 3e-4
`
`N‘94s
`
`3505 analysis
`
`Untreated
`
`a Iron + peroxide
`
`
`
`Mutationfrequency 1e-4
`
`
`
`MutationfrequencyW
`
`
`
`
`DCS analysis
`
`Untreated % Iron + peroxide
`
`Fig. 3. Effect of DNA damage on mutation spectrum. DNA damage was
`induced by incubating purified M13mp2 DNA with hydrogen peroxide and
`FeSO4. (A) SSCS analysis reveals a further elevation from baseline of G—>T
`mutations,
`indicating these events to be the artifactual consequence of
`nucleotide oxidation. All significant (P < 0.05) changes from baseline mu—
`tation frequencies are noted. (3) Induced DNA damage had no effect on the
`overall frequency or spectrum of DCS mutations.
`
`indicating that duplex consensus sequences are not similarly sus-
`ceptible to DNA damage artifacts.
`Furthermore, relative to the literature reference values, DCS
`analysis results in a lower frequency of G—>T/C—>A and C—>T/
`G—>A mutations (Fig. 2B), which are the same mutations elevated
`in SSCS analysis as a probable result of DNA damage. Notably,
`the M13mp2 LacZ assay, from which reference values have been
`derived, is dependent upon bacterial replication of a single mol-
`ecule of M13mp2 DNA. Thus, the presence of oxidative damage
`within this substrate could cause an analogous first-round repli-
`cation error by Escherichia coli, converting a single-stranded
`damage event
`into a fixed, double-stranded mutation during
`replication. The slight reduction in the frequency of these two
`types of mutations measured by DCS analysis may, therefore,
`reflect the absence of damage-induced errors that are scored by
`the in vivo LacZ assay.
`
`Mutant Recovery. To further validate the capability of DCS analysis
`to detect rare mutations, we constructed a series of M13mp2
`variants containing specific single base substitutions and mixed the
`variants together at known ratios. The final mixture was then se-
`quenced with Duplex Sequencing adapters. With conventional
`analysis of the sequencing data (i.e., without consideration of the
`tag sequences and filtering for a read quality score of 30), variants
`present at a level of <1/100 could not be accurately identified be-
`cause artifactual mutations occurring at a background frequency of
`about 1/100 obscured the presence of less abundant true mutations
`(Fig. S1). In contrast, when the data are analyzed as duplex con-
`sensus sequences with ~20,000-fold final depth, accurate recovery
`
`4of6 l www.pnas.org/cgi/doi/10.1073/pnas.1208715109
`
`of mutant sequences was possible down to the lowest tested level of
`one mutant molecule per 10,000 wild-type molecules.
`
`Duplex Sequencing of Human Mitochondrial DNA. Having estab-
`lished the methodology for Duplex Sequencing with M13mp2
`DNA, which is a substrate for which the mutation frequency and
`spectrum are fairly well established, we next wished to apply the
`approach to a human DNA sample. Thus, we isolated mitochon-
`drial DNA from human brain tissue and sequenced the DNA after
`ligation of Duplex Sequencing adapters. A standard sequencing
`approach with quality filtering for a Phred score of 30 resulted in
`a mutation frequency of 2.7 X 104‘, and SSCS analysis yielded
`a mutation frequency of 1.5 X 10‘ . In contrast, DCS analysis
`revealed a much lower overall mutation frequency of 3.5 X 10—5
`(Fig. 4A). The frequency of mutations in mitochondrial DNA has
`previously been difficult to measure directly due in part to sources
`of error in existing assays that can result in either overestimation or
`underestimation of the true value. An additional confounder has
`
`been that most approaches are limited to interrogation of muta-
`tions within a small fraction of the genome (39). The method of
`single-molecule PCR, which has been proposed as an accurate
`method of measuring mitochondrial mutation frequency (39) and
`is considered resistant to damage-induced background errors (40),
`has resulted in a reported mitochondrial mutation frequency in
`human colonic mucosa of 5.9 X 10‘5 i 3.2 X 10‘5 (39), which is in
`excellent agreement with our result. Likewise, mitochondrial DNA
`sequence divergence rates in human pedigrees are consistent with
`a mitochondrial mutation frequency of 3—5 X 10‘5 (41, 42).
`When the distribution of mutations throughout the mitochon-
`drial genome is considered, the quality filtered reads (analyzed
`without consideration of the tags) have many artifactual errors,
`such that identification of mutational hotspots is difficult or im-
`possible (Fig. 4B). DCS analysis removed these artifacts (Fig. 4C)
`and revealed striking hypermutability of the region of replication
`initiation (D loop), which is consistent with prior estimates of
`mutational patterns in mitochondrial DNA based upon sequence
`variation at this region within the population (43).
`SSCS analysis produced a strong mutational bias, with a 130-
`fold excess of G—>T relative to C—>A mutations (Fig. 4D), con-
`sistent with oxidative damage of the DNA leading to first-round
`PCR mutations as a significant source of background error. A high
`level of oxidative damage is expected in mitochondrial DNA, due
`to extensive exposure of mitochondria to free radical species
`generated as a byproduct of metabolism (44). DCS analysis (Fig.
`4E) removed the mutational bias and revealed that transition
`mutations are the predominant replication errors in mitochon-
`drial DNA. The DCS mutation spectrum is in accord with prior
`estimates of deamination events (45) and T-dGTP mispairing by
`the mitochondrial DNA polymerase (46) as primary mutational
`forces in mitochondrial DNA. Furthermore, the mutation spec-
`trum of our mitochondrial data are consistent with previous
`reports of heteroplasmic mutations in human brain showing an
`increased load of A—>G/T—>C and G—>A/C—>T transitions, rela-
`tive to transversions (47, 48). A similar spectral bias has also been
`reported in mice (45, 49) and in population studies of Drosophila
`melanogaster (50).
`
`Discussion
`
`The accuracy of standard approaches to next-generation se-
`quencing is constrained by a general reliance on analysis of sin-
`gle-stranded DNA, which makes certain technical sources of
`single-stranded errors fundamentally limiting. The complemen-
`tary strands of native duplex DNA harbor redundant sequence
`information and here we have demonstrated an approach for
`error correction, termed Duplex Sequencing, which capitalizes
`on this biochemical redundancy to greatly lower the error rate of
`sequencing.
`The most sensitive approach previously reported for improving
`accuracy of next-generation sequencing involves use of a random
`tag sequence in a PCR primer (23). In this technique, PCR
`duplicates are generated from a single strand of DNA, and the
`
`Schmitt et al.
`
`A1094
`
`OUTSIDE ATTORNEYS' EYES ONLY INFORMATION
`
`GUARDFM00914132
`
`A1094
`
`
`
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21 Page 7 of 287 PageID #: 8609
`Case 1:20-cv-01580-LPS Document 40-3 Filed 03/05/21
`Page 7 of 287 PagelD #: 8609
`
`A
`Q30 reads
`SSCS
`
`
`
`D
`56-4
` SSCS analysis
`log scale
`464
`l—l
`I_I_;_;_EJ.IMIII_LLUJIIII
`0
`69gb
`\ 9e.
`\ 152'
`192’
`\ 923‘
`\ 02$
`Mutation frequency
`
`
`
`E
`
`0 00
`
`0 06
`
`>.
`q)
`8
`<37
`C
`E O 04
`'79;
`‘5
`
`O 02
`
`10000
`5000
`Genome position
`
`15000
`
`DCS anal sis
`V
`
`1e-4
`
`E 4e-5
`g.
`C"
`g
`3e-5
`e
`a
`5 2e-5
`5
`E 15.5
`
`
`
`.
`
`.
`
`
`
`. .T.’.G.7.C.3'.C.7<$2.A:.C. 1791571175 .'<.3.-2I .975. 3Transversrons
`
`Type of mutation
`
`DCS analysis
`
`10000
`5000
`Genome position
`
`15000
`
`ac, caejAac T~>G ’AeT, T~>A 36»? (HA:
`Transversrons
`
`Type of mutation
`
`0
`
`-
`‘1
`_.
`
`0
`
`B
`
`C
`
`
`
`DCS
`
`0 06
`
`0 04
`
`0 02
`
`.«
`
`-
`
`0CG)
`>~
`8'
`E
`C
`.9
`E
`fl
`
`g
`[37
`2 3e-4
`.5
`— 3
`E
`030 analysis
`2 26-4
`
`Fig. 4. Duplex Sequencing of human mitochondrial
`DNA. (A) Overall mutation frequency as measured
`by a standard sequencing approach, SSCS, and DCS.
`(3) Pattern of mutation in human mitochondrial
`DNA by a standard sequencing approach. The mu—
`tation frequency (vertical axis) is plotted for every
`position in the ~16—kb mitochondrial genome. Due
`to the substantial background of technical error, no
`obvious mutational pattern is discernible by this
`method.
`(C) DCS analysis eliminates sequencing
`artifacts and reveals the true distribution of mito—
`chondrial mutations to include a striking excess
`adjacent to the mtDNA origin of replication.
`(D)
`SSCS analysis yields a large excess of G—>T mutations
`relative to complementary C—>A mutations, consis—
`tent with artifacts from damaged—induced 8—oxo—G
`lesions during PCR. All significant (P < 0.05) differ—
`ences between paired reciprocal mutation fre—
`quencies are noted.
`(E) DCS analysis removes the
`SSCS strand bias and reveals the true mtDNA mu—
`tational spectrum to be characterized by an excess
`of transitions.
`
`
`
`sequences of tag-identified duplicates are compared such that
`mutations are scored only when present in multiple duplicates.
`This method, conceptually analogous to our approach of SSCS
`analysis, results in ~20-fold improvement in accuracy relative to
`standard Illumina sequencing, but is presumably susceptible to
`the same sort of artifactual, largely damage-mediated, first-round
`PCR errors we observed in SSCS.
`Notably, because SSCS is prone to damage-induced PCR errors,
`SSCS analysis can be used as a tool for detection of sites and pat-
`terns of DNA damage occurring in vivo. For example, the occur-
`rence of G—>T mutations in SSCS analysis in excess of reciprocal
`C—>A mutations can be used as a marker for the extent of oxidative
`DNA damage in a sample. The ability to detect damage by SSCS
`could be further enhanced by using different DNA polymerases in
`the initial rounds of PCR, which have a proclivity to catalyze spe-
`cific misinsertions opposite defined types of damage (51, 52).
`In contrast to the damage sensitivity of single-strand consensus
`sequences, for DNA damage to result in an artifactual mutation in
`DCS, mutagenic lesions (or spontaneous, recurrent first-round
`PCR errors) would need to occur at the same nucleotide position
`on both strands of a molecule of duplex DNA and result in
`complementary errors. Thus, the background error frequency of
`our method may be calculated as (probability of error on one
`strand) X (probability of error on other strand) X (probability that
`both errors are complementary).
`Based on the SSCS background error frequency of 3.4 X 10—5
`from the M13mp2 DNA sequencing experiment, the error fre-
`quency of DCS can be approximated as: (3.4 X 10—5) X (3.4 X 10—5) X
`1/3 = 3.8 X 10—10. This calculated error frequency represents a 10
`million-fold improvement over the 3.8 X 10—3 value we obtained by
`standard methods. Of note, the calculation simplistically assumes
`that all mutational events are equally likely by multiplying by the
`factor one-third (because any given nucleotide can mutate to any
`one of three other nucleotides). In reality, the strong mutational
`bias observed in SSCSs indicates that reciprocal mispairs are not
`equally probable and, hence, the background of DCS is expected
`to be lower than this estimate.
`
`In addition to their application for high sensitivity detection of
`rare DNA variants, the degenerate tags in our Duplex Sequencing
`adapters can also be used for single-molecule counting to precisely
`determine absolute DNA or RNA copy numbers (25, 29). Because
`tagging occurs before amplification, the relative abundance of
`
`Schmitt et al.
`
`variants in a population can be accurately assessed given that
`proportional representation is not subject to skewing by amplifi-
`cation biases. As with their use for error correction, because the
`degenerate tags are present in the adapters, there are no addi-
`tional steps required during library preparation, which is in con-
`trast to many existing methods of tag-based counting.
`In principle, Duplex Sequencing could be performed on the
`Illumina or similar platforms without the use of Duplex Tags, but
`instead by using the randomly sheared ends of the DNA fragments
`as unique identifiers (20): specifically, for a given DNA sequence
`seen in sequencing read 1 with 5’ sheared end sequence or and 3’
`sheared end sequence [3 used as a tag of form Oil), the partner
`strand will occur as a matching sequence in read 2 tagged with 5’
`shear point [3 and 3’ shear point or. In practice,