`WO 2013/142389
`This application claims priority to U.S. Provisional Patent Application No.
`61/613,413, filed March 20, 2012; U.S. Provisional Patent Application No. 61/625,623,
`filed April 17, 2012; and U.S. Provisional Patent Application No. 61/625,319, filed April
`17, 2012; the subject matter of all of which are hereby incorporated by reference as if
`fully set forth herein.
`The present invention was made with government support under Grant
`Nos. RO1 CA115802 and RO1 CA102029 awarded by the National Institutes of Health.
`The Government has certain rights in the invention.
`The advent of massively parallel DNA sequencing has ushered in a new
`era of genomic exploration by making simultaneous genotyping of hundreds of billions
`of base-pairs possible at small fraction of the time and cost of traditional Sanger
`methods [1 ]. Because these technologies digitally tabulate the sequence of many
`individual DNA fragments, unlike conventional techniques which simply report the
`average genotype of an aggregate collection of molecules, they offer the unique ability
`to detect minor variants within heterogeneous mixtures [2].
`This concept of "deep sequencing" has been implemented in a variety
`fields including metagenomics [3, 4], paleogenomics [5], forensics [6], and human
`genetics [7, 8] to disentangle subpopulations in complex biological samples. Clinical
`applications, such prenatal screening for fetal aneuploidy [9, 1 0], early detection of
`cancer [11] and monitoring its response to therapy [12, 13] with nucleic acid-based
`serum biomarkers, are rapidly being developed. Exceptional diversity within microbial
`[14, 15] viral [16-18] and tumor cell populations [19, 20] has been characterized through
`next-generation sequencing, and many low-frequency, drug-resistant variants of
`WO 2013/142389
`therapeutic importance have been so identified [12, 21, 22]. Previously unappreciated
`intra-organismal mosasism in both the nuclear [23] and mitochondrial [24, 25] genome
`has been revealed by these technologies, and such somatic heterogeneity, along with
`that arising within the adaptive immune system [13], may be an important factor in
`phenotypic variability of disease.
`Deep sequencing, however, has limitations. Although, in theory, DNA
`subpopulations of any size should be detectable when deep sequencing a sufficient
`number of molecules, a practical limit of detection is imposed by errors introduced
`during sample preparation and sequencing. PCR amplification of heterogeneous
`mixtures can result in population skewing due to stoichastic and non-stoichastic
`amplification biases and lead to over- or under-representation of particular variants [26].
`Polymerase mistakes during pre-amplification generate point mutations resulting from
`base mis-incorporations and rearrangements due to template switching
`[26, 27].
`Combined with the additional errors that arise during cluster amplification, cycle
`sequencing and image analysis, approximately 1 % of bases are incorrectly identified,
`depending on the specific platform and sequence context [2, 28]. This background level
`of artifactual heterogeneity establishes a limit below which the presence of true rare
`variants is obscured [29].
`A variety of improvements at the level of biochemistry [30-32] and data
`processing [19, 21, 28, 32, 33] have been developed to improve sequencing accuracy.
`The ability to resolve subpopulations below 0.1 %, however, has remained elusive.
`Although several groups have attempted to increase sensitivity of sequencing, several
`limitations remain. For example techniques whereby DNA fragments to be sequenced
`are each uniquely tagged [34, 35] prior to amplification [36-41] have been reported.
`Because all amplicons derived from a particular starting molecule will bear its specific
`tag, any variation in the sequence or copy number of identically tagged sequencing
`reads can be discounted as technical error. This approach has been used to improve
`counting accuracy of DNA [38, 39, 41] and RNA templates [37, 38, 40] and to correct
`base errors arising during PCR or sequencing [36, 37, 39]. Kinde et. al. reported a
`reduction in error frequency of approximately 20-fold with a tagging method that is
`based on labeling single-stranded DNA fragments with a primer containing a 14 bp
`WO 2013/142389
`degenerate sequence. This allowed for an observed mutation frequency of -0.001 %
`mutations/bp in normal human genomic DNA [36]. Nevertheless, a number of highly
`sensitive genetic assays have indicated that the true mutation frequency in normal cells
`is likely to be far lower, with estimates of per-nucleotide mutation frequencies generally
`ranging from 10-9 to 10-11 [42]. Thus, the mutations seen in normal human genomic
`DNA by Kinde et al. are likely the result of significant technical artifacts.
`Traditionally, next-generation sequencing platforms rely upon generation
`of sequence data from a single strand of DNA. As a consequence, artifactual mutations
`introduced during the initial rounds of PCR amplification are undetectable as errors -
`even with tagging techniques - if the base change is propagated to all subsequent PCR
`duplicates. Several types of DNA damage are highly mutagenic and may lead to this
`scenario. Spontaneous DNA damage arising from normal metabolic processes results
`in thousands of damaging events per cell per day [43].
`In addition to damage from
`oxidative cellular processes, further DNA damage is generated ex vivo during tissue
`processing and DNA extraction [44]. These damage events can result in frequent
`copying errors by DNA polymerases: for example a common DNA lesion arising from
`oxidative damage, 8-oxo-guanine, has the propensity to incorrectly pair with adenine
`during complementary strand extension with an overall efficiency greater than that of
`correct pairing with cytosine, and thus can contribute a large frequency of artifactual
`G~ T mutations [45]. Likewise, deamination of cytosine to form uracil is a particularly
`common event which leads to the inappropriate insertion of adenine during PCR, thus
`producing artifactual c~ T mutations with a frequency approaching 100% [46].
`It would be desirable to develop an approach for tag-based error
`correction, which reduces or eliminates artifactual mutations arising from DNA damage,
`PCR errors, and sequencing errors; allows rare variants in heterogeneous populations
`to be detected with unprecedented sensitivity; and which capitalizes on the redundant
`information stored in complexed double-stranded DNA.
`In one embodiment, a single molecule identifier (SMI) adaptor molecule
`for use in sequencing a double-stranded target nucleic acid molecule is provided. Said
`WO 2013/142389
`SMI adaptor molecule includes a single molecule identifier (SMI) sequence which
`comprises a degenerate or semi-degenerate DNA sequence; and an SMI ligation
`adaptor that allows the SMI adaptor molecule to be ligated to the double-stranded target
`nucleic acid sequence. The SMI sequence may be single-stranded or double-stranded.
`In some embodiments, the double-stranded target nucleic acid molecule is a double(cid:173)
`stranded DNA or RNA molecule.
`In another embodiment, a method of obtaining the sequence of a double-
`stranded target nucleic acid is provided (also known as Duplex Consensus Sequencing
`or DCS) is provided. Such a method may include steps of ligating a double-stranded
`target nucleic acid molecule to at least one SMI adaptor molecule to form a double(cid:173)
`stranded SMl-target nucleic acid complex; amplifying the double-stranded SMl-target
`nucleic acid complex, resulting in a set of amplified SMl-target nucleic acid products;
`and sequencing the amplified SMl-target nucleic acid products.
`In some embodiments, the method may additionally include generating an
`error-corrected double-stranded consensus sequence by (i) grouping the sequenced
`SM I-target nucleic acid products into families of paired target nucleic acid strands based
`on a common set of SMI sequences; and (ii) removing paired target nucleic acid strands
`having one or more nucleotide positions where the paired target nucleic acid strands
`are non-complementary (or alternatively removing individual nucleotide positions in
`cases where the sequence at the nucleotide position under consideration disagrees
`among the two strands). In further embodiments, the method confirms the presence of
`a true mutation by (i) identifying a mutation present in the paired target nucleic acid
`strands having one or more nucleotide positions that disagree; (ii) comparing the
`mutation present in the paired target nucleic acid strands to the error corrected double(cid:173)
`stranded consensus sequence; and (iii) confirming the presence of a true mutation
`when the mutation is present on both of the target nucleic acid strands and appears in
`all members of a paired target nucleic acid family.
`Figure 1 illustrates an overview of Duplex Consensus Sequencing.
`Sheared double-stranded DNA that has been end-repaired and T-tailed is combined
`WO 2013/142389
`with A-tailed SMI adaptors and ligated according to one embodiment. Because every
`adaptor contains a unique, double-stranded, complementary n-mer random tag on each
`end (n-mer = 12 bp according to one embodiment), every DNA fragment becomes
`labeled with two distinct SMI sequences (arbitrarily designated a and 13
`in the single
`capture event shown). After size-selecting for appropriate length fragments, PCR
`amplification with primers containing lllumina flow-cell-compatible tails is carried out to
`generate families of PCR duplicates. By virtue of the asymmetric nature of adapted
`fragments, two types of PCR products are produced from each capture event. Those
`derived from one strand will have the a SMI sequence adjacent to flow-cell sequence 1
`and the 13 SMI sequence adjacent to flow cell sequence 2. PCR products originating
`from the complementary strand are labeled reciprocally.
`Figure 2 illustrates Single Molecule Identifier (SMI) adaptor synthesis
`according to one embodiment. Oligonucleotides are annealed and the complement of
`the degenerate lower arm sequence (N's) plus adjacent fixed bases is produced by
`polymerase extension of the upper strand in the presence of all four dNTPs. After
`reaction cleanup, complete adaptor A-tailing is ensured by extended incubation with
`polymerase and dATP.
`Figure 3
`illustrates error correction
`through Duplex Consensus
`Sequencing (DCS) analysis according to one embodiment.
`(a-c) shows sequence
`reads (brown) sharing a unique set of SMI tags are grouped into paired families with
`members having strand identifiers in either the al3 or l3a orientation. Each family pair
`reflects one double-stranded DNA fragment.
`(a) shows mutations (spots) present in
`only one or a few family members representing sequencing mistakes or PCR-introduced
`errors occurring late in amplification.
`(b) shows mutations occurring in many or all
`members of one family in a pair representing mutations scored on only one of the two
`strands, which can be due to PCR errors arising during the first round of amplification
`such as might occur when copying across sites of mutagenic DNA damage. (c) shows
`true mutations (* arrow) present on both strands of a captured fragment appear in all
`members of a family pair. While artifactual mutations may co-occur in a family pair with
`a true mutation, these can be independently identified and discounted when producing
`(d) an error-corrected consensus sequence (i.e., single stranded consensus sequence)
`WO 2013/142389
`(+ arrow) for each duplex.
`(e) shows consensus sequences from all independently
`captured, randomly sheared fragments containing a particular genomic site are
`identified and (f) compared to determine the frequency of genetic variants at this locus
`within the sampled population.
`Figure 4 illustrates an example of how a SMI sequence with n-mers of 4
`nucleotides in length (4-mers) are read by Duplex Consensus Sequencing (DCS)
`according to some embodiments. (A) shows the 4-mers with the PCR primer binding
`sites (or flow cell sequences) 1 and 2 indicated at each end.
`(B) shows the same
`molecules as in (A) but with the strands separated and the lower strand now written in
`the 5'-3' direction. When these molecules are amplified with PCR and sequenced, they
`will yield the following sequence reads: The top strand will give a read 1 file of TAAC--(cid:173)
`and a read 2 file of GCCA---. Combining the read 1 and read 2 tags will give
`TAACCGGA as the SMI for the top strand. The bottom strand will give a read 1 file of
`CGGA---- and a read 2 file of TAAC---. Combining the read 1 and read 2 tags will give
`CGGATAAC as the SMI for the bottom strand. (C) illustrates the orientation of paired
`strand mutations in DCS.
`In the initial DNA duplex shown in Figures 4A and 4B, a
`mutation "x" (which is paired to a complementary nucleotide "y") is shown on the left
`side of the DNA duplex. The "x" will appear in read 1, and the complementary mutation
`on the opposite strand, "y," will appear in read 2. Specifically, this would appear as "x"
`in both read 1 and read 2 data, because "y" in read 2 is read out as "x" by the
`sequencer owing to the nature of the sequencing primers, which generate the
`complementary sequence during read 2.
`Figure 5 illustrates duplex sequencing of human mitochondrial DNA. (A)
`Overall mutation frequency as measured by a standard sequencing approach, SSCS,
`and DCS. (B) Pattern of mutation in human mitochondrial DNA by a standard
`sequencing approach. The mutation frequency (vertical axis) is plotted for every position
`in the -16-kb mitochondrial genome. Due to the substantial background of technical
`error, no obvious mutational pattern is discernible by this method. (C) DCS analysis
`eliminates sequencing artifacts and reveals the true distribution of mitochondrial
`mutations to include a striking excess adjacent to the mtDNA origin of replication. (D)
`WO 2013/142389
`SSCS analysis yields a large excess of G~ T mutations relative to complementary c~A
`mutations, consistent with artifacts from damaged-induced 8-oxo-G lesions during PCR.
`All significant (P < 0.05) differences between paired reciprocal mutation frequencies are
`noted. (E) DCS analysis removes the SSCS strand bias and reveals the true mtDNA
`mutational spectrum to be characterized by an excess of transitions.
`Figure 6 shows
`that consensus sequencing
`removes artifactual
`sequencing errors as compared to Raw Reads. Duplex Consensus Sequencing (DCS)
`results in an approximately equal number of mutations as the reference and single
`strand consensus sequencing (SSCS) .
`Figure 7 illustrates duplex sequencing of M13mp2 DNA. (A) Single-strand
`consensus sequences (SSCSs) reveal a large excess of G~Nc~ T and G~ T/C~A
`mutations, whereas duplex consensus sequences (DCSs) yield a balanced spectrum.
`Mutation frequencies are grouped into reciprocal mispairs, as DCS analysis only scores
`mutations present in both strands of duplex DNA. All significant (P < 0.05) differences
`between DCS analysis and
`reference values are noted.
`Complementary types of mutations should occur at approximately equal frequencies
`within a DNA fragment population derived from duplex molecules. However, SSCS
`analysis yields a 15-fold excess of G~ T mutations relative to c~A mutations and an
`11-fold excess of c~ T mutations relative to G~A mutations. All significant (P < 0.05)
`differences between paired reciprocal mutation frequencies are noted.
`Figure 8 shows the effect of DNA damage on the mutation spectrum. DNA
`damage was induced by incubating purified M13mp2 DNA with hydrogen peroxide and
`FeSO4. (A) SSCS analysis reveals a further elevation from baseline of G~ T mutations,
`indicating these events to be the artifactual consequence of nucleotide oxidation. All
`significant (P < 0.05) changes from baseline mutation frequencies are noted. (B)
`Induced DNA damage had no effect on the overall frequency or spectrum of DCS
`Figure 9 shows duplex sequencing results in accurate recovery of spiked-
`control mutations. A series of variants of M13mp2 DNA, each harboring a known
`single-nucleotide substitution, were mixed in together at known ratios and the mixture
`WO 2013/142389
`was sequenced to -20,000-fold final depth. Standard sequencing analysis cannot
`accurately distinguish mutants present at a ratio of less than 1/100, because artifactural
`mutations occurring at every position obscure the presence of less abundant true
`mutations, rendering apparent recovery greater than 100%. Duplex consensus
`sequences, in contrast, accurately identify spiked-in mutations down to the lowest
`tested ratio of 1/10,000.
`Figure 10 is a Python Code that may used to carry out methods described
`herein according to one embodiment.
`Single molecule identifier adaptors and methods for their use are provided
`herein. According to the embodiments described herein, a single molecule identifier
`(SMI) adaptor molecule is provided. Said SMI adaptor molecule is double stranded,
`and may include a single molecule identifier (SMI) sequence, and an SMI ligation
`adaptor (Figure 2). Optionally, the SMI adaptor molecule further includes at least two
`PCR primer binding sites, at least two sequencing primer binding sites, or both.
`The SMI adaptor molecule may form a "Y-shape" or a "hairpin shape." In
`some embodiments, the SMI adaptor molecule is a "Y-shaped" adaptor, which allows
`both strands to be independently amplified by a PCR method prior to sequencing
`because both the top and bottom strands have binding sites for PCR primers FC1 and
`FC2 as shown in the examples below. A schematic of a Y-shaped SMI adaptor
`molecule is also shown in Figure 2. A Y-shaped SMI adaptor requires successful
`amplification and recovery of both strands of the SMI adaptor molecule.
`In one
`embodiment, a modification that would simplify consistent recovery of both strands
`entails ligation of a Y-shaped SMI adaptor molecule to one end of a DNA duplex
`molecule, and ligation of a "U-shaped" linker to the other end of the molecule. PCR
`amplification of the hairpin-shaped product will then yield a linear fragment with flow cell
`sequences on either end. Distinct PCR primer binding sites (or flow cell sequences
`FC1 and FC2) will flank the DNA sequence corresponding to each of the two SMI
`adaptor molecule strands, and a given sequence seen in Read 1 will then have the
`sequence corresponding to the complementary DNA duplex strand seen in Read 2.
`WO 2013/142389
`Mutations are scored only if they are seen on both ends of the molecule (corresponding
`to each strand of the original double-stranded fragment), i.e. at the same position in
`both Read 1 and Read 2. This design may be accomplished as described in the
`examples relating to double stranded SMI sequence tags.
`In other embodiments, the SMI adaptor molecule is a "hairpin" shaped (or
`"U-shaped") adaptor. A hairpin DNA product can be used for error correction, as this
`product contains both of the two DNA strands. Such an approach allows for reduction
`of a given sequencing error rate N to a lower rate of N*N*(1/3), as independent
`sequencing errors would need to occur on both strands, and the same error among all
`three possible base substitutions would need to occur on both strands. For example,
`the error rate of 1/100 in the case of lllumina sequencing [32] would be reduced to
`(1/100)*(1/100)*(1/3) = 1/30,000.
`An additional, more remarkable reduction in errors can be obtained by
`inclusion of a single-stranded SMI in either the hairpin adaptor or the "Y-shaped"
`adaptor will also function to label both of the two DNA strands. Amplification of hairpin(cid:173)
`shaped DNA may be difficult as the polymerase must synthesize through a product
`containing significant regions of self-complementarity, however, amplification of hairpin(cid:173)
`shaped structures has already been established in the technique of hairpin PCR, as
`described below. Amplification using hairpin PCR is further described in detail in U.S.
`Patent No. 7,452,699, the subject matter of which is hereby incorporated by reference
`as if fully set forth herein.
`According to the embodiments described herein, the SMI sequence (or
`"tag") may be a double-stranded, complementary SMI sequence or a single-stranded
`SMI sequence.
`In some embodiments, the SMI adaptor molecule includes an SMI
`sequence (or "tag") of nucleotides that is degenerate or semi-degenerate.
`In some
`embodiments, the degenerate or semi-degenerate SMI sequence may be a random
`degenerate sequence. A double-stranded SMI sequence includes a first degenerate or
`semi-degenerate nucleotide n-mer sequence and a second n-mer sequence that is
`complementary to the first degenerate or semi-degenerate nucleotide n-mer sequence,
`while a single-stranded SMI sequence includes a first degenerate or semi-degenerate
`WO 2013/142389
`nucleotide n-mer sequence. The first and/or second degenerate or semi-degenerate
`nucleotide n-mer sequences may be any suitable length to produce a sufficiently large
`number of unique tags to label a set of sheared DNA fragments from a segment of DNA.
`Each n-mer sequence may be between approximately 3 to 20 nucleotides in length.
`Therefore, each n-mer sequence may be approximately 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
`14, 15, 16, 17, 18, 19, 20 nucleotides in length. In one embodiment, the SMI sequence
`is a random degenerate nucleotide n-mer sequence which is 12 nucleotides in length.
`A 12 nucleotide SMI n-mer sequence that is ligated to each end of a target nucleic acid
`molecule, as described in the Example below, results in generation of up to 424 (i.e., 2.8
`x 1014
`) distinct tag sequences.
`In some embodiments,
`the SMI
`tag nucleotide sequence may be
`completely random and degenerate, wherein each sequence position may be any
`(i.e., each position, represented by "X," is not limited, and may be an
`adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U)) or any other natural or
`non-natural DNA or RNA nucleotide or nucleotide-like substance or analog with base(cid:173)
`pairing properties (e.g., xanthosine, inosine, hypoxanthine, xanthine, 7-methylguanine,
`7-methylguanosine, 5,6-dihydrouracil, 5-methylcytosine, dihydouridine,
`isoguanine, deoxynucleosides, nucleosides, peptide nucleic acids, locked nucleic acids,
`glycol nucleic acids and threose nucleic acids). The term "nucleotide" as described
`herein, refers to any and all nucleotide or any suitable natural or non-natural DNA or
`RNA nucleotide or nucleotide-like substance or analog with base pairing properties as
`described above.
`In other embodiments, the sequences need not contain all possible
`bases at each position. The degenerate or semi-degenerate n-mer sequences may be
`generated by a polymerase-mediated method described in the Example below, or may
`be generated by preparing and annealing a library of individual oligonucleotides of
`known sequence. Alternatively, any degenerate or semi-degenerate n-mer sequences
`may be a randomly or non-randomly fragmented double stranded DNA molecule from
`any alternative source that differs from the target DNA source.
`In some embodiments,
`the alternative source is a genome or plasmid derived from bacteria, an organism other
`than that of the target DNA, or a combination of such alternative organisms or sources.
`The random or non-random fragmented DNA may be introduced into SMI adaptors to
`WO 2013/142389
`serve as variable tags. This may be accomplished through enzymatic ligation or any
`other method known in the art.
`In some embodiments, the SMI adaptor molecules are ligated to both
`ends of a target nucleic acid molecule, and then this complex is used according to the
`methods described below.
`In certain embodiments, it is not necessary to include n(cid:173)
`mers on both adapter ends, however, it is more convenient because it means that one
`does not have to use two different types of adaptors and then select for ligated
`fragments that have one of each type rather than two of one type. The ability to
`determine which strand is which is still possible in the situation wherein only one of the
`two adaptors has a double-stranded SMI sequence.
`In some embodiments, the SMI adaptor molecule may optionally include a
`double-stranded fixed reference sequence downstream of the n-mer sequences to help
`make ligation more uniform and help computationally filter out errors due to ligation
`problems with improperly synthesized adaptors. Each strand of the double-stranded
`fixed reference sequence may be 4 or 5 nucleotides in length sequence, however, the
`fixed reference sequence may be any suitable length including, but not limited to 3, 4, 5
`or 6 nucleotides in length.
`The SMI ligation adaptor may be any suitable ligation adaptor that is
`complementary to a ligation adaptor added to a double-stranded target nucleic acid
`sequence including, but not limited to a T-overhang, an A-overhang, a CG overhang, a
`blunt end, or any other ligatable sequence.
`In some embodiments, the SMI ligation
`adaptor may be made using a method for A-tailing or T-tailing with polymerase
`extension; creating an overhang with a different enzyme; using a restriction enzyme to
`create a single or multiple nucleotide overhang, or any other method known in the art.
`the embodiments described herein,
`the SMI adaptor
`molecule may include at least two PCR primer or "flow cell" binding sites: a forward
`PCR primer binding site (or a "flow cell 1" (FC1) binding site); and a reverse PCR primer
`binding site (or a "flow cell 2" (FC2) binding site). The SMI adaptor molecule may also
`include at least two sequencing primer binding sites, each corresponding to a
`sequencing read. Alternatively, the sequencing primer binding sites may be added in a
`WO 2013/142389
`separate step by inclusion of the necessary sequences as tails to the PCR primers, or
`by ligation of the needed sequences. Therefore, if a double-stranded target nucleic acid
`molecule has an SMI adaptor molecule ligated to each end, each sequenced strand will
`have two reads - a forward and a reverse read.
`Double-stranded SMI sequences
`Adaptor 1 (shown below) is a Y-shaped SMI adaptor as described above
`(the SMI sequence is shown as X's in the top strand (a 4-mer), with the complementary
`bottom strand sequence shown as Y's):
