`(19) World Intellectual Property
`1111111111111111 IIIIII IIIII IIIII IIIII IIII I II Ill lllll lllll lllll lllll lllll 11111111111111111111111
`Organization
`International Bureau
`(10) International Publication Number
`WO 2013/142389 Al
`
`~ ~
`
`(43) International Publication Date
`26 September 2013 (26.09.2013) WIPO I PCT
`
`(51) International Patent Classification:
`C04B 20/04 (2006.01)
`
`(21) International Application Number:
`
`(22) International Filing Date:
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`PCT /US2013/032665
`
`15 March 2013 (15.03.2013)
`
`English
`
`English
`
`(30) Priority Data:
`61/613,413
`61/625,623
`61/625,319
`
`20 March 2012 (20.03.2012)
`17 April 2012 (17.04.2012)
`17 April 2012 (17.04.2012)
`
`us
`us
`us
`UNIVERSITY OF WASHINGTON
`(71) Applicant:
`THROUGH ITS CENTER FOR COMMERCIALIZA(cid:173)
`TION [US/US]; 4311 11th Avenue NE, Seattle, WA
`98105-4608 (US).
`
`(72) Inventors: SCHMITT, Michael; Seattle, WA 98105 (US).
`(US). LOEB,
`SALK, Jesse; Seattle, WA 98105
`Lawrence, A.; Bellevue, WA (US).
`
`(74) Agent: DUEPPEN, Lara, J.; Perkins Coie LLP, P.O. Box
`1208, Seattle, WA 98111-1208 (US).
`
`(81) Designated States (unless otherwise indicated, for every
`kind of national protection available): AE, AG, AL, AM,
`AO,AT,AU,AZ,BA,BB,BG,BH,BN,BR,BW,BY,
`BZ,CA,CH,CL,CN,CO,CR,CU,CZ,DE,DK,DM,
`DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT,
`HN, HR, HU, ID, IL, IN, IS, JP, KE, KG, KM, KN, KP,
`KR, KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD,
`ME, MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI,
`NO, NZ, OM, PA, PE, PG, PH, PL, PT, QA, RO, RS, RU,
`RW, SC, SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ,
`TM, TN, TR, TT, TZ, VA, VG, US, UZ, VC, VN,ZA,
`ZM,ZW.
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, SZ, TZ,
`VG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU, TJ,
`TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK,
`EE, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU, LV,
`MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM,
`TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW,
`ML, MR, NE, SN, TD, TG).
`
`[Continued on next page]
`;;;;;;;;;;;;;; ----------------------------------------------
`(54) Title: METHODS OF LOWERING THE ERROR RA TE OF MASSIVELY PARALLEL DNA SEQUENCING USING DU -
`PLEX CONSENSUS SEQUENCING
`
`;;;;;;;;;;;;;;
`
`---;;;;;;;;;;;;;;
`;;;;;;;;;;;;;; -
`-----
`---
`
`(57) Abstract: Next Generation DNA sequencing promises to revolutionize
`clinical medicine and basic research. However, while this technology has
`the capacity to generate hundreds of billions of nucleotides of DNA se(cid:173)
`quence in a single experiment, the error rate of approximately I% results in
`hundreds of millions of sequencing mistakes. These scattered errors can be
`tolerated in some applications but become extremely problematic when
`"deep sequencing" genetically heterogeneous mixtures, such as tumors or
`mixed microbial populations. To overcome limitations in sequencing accur(cid:173)
`acy, a method Duplex Consensus Sequencing (DCS) is provided. This ap(cid:173)
`proach greatly reduces errors by independently tagging and sequencing each
`of the two strands of a DNA duplex. As the two strands are complementary,
`true mutations are found at the same position in both strands. In contrast,
`PCR or sequencing errors will result in errors in only one strand.
`
`Figure 1
`
`SMI seauerices
`
`I-tailed UNA tragrnent
`Arrn·'--------__:_~(cid:173)
`Arm;:i:"i"'~;;;"
`
`Lig9!1or &.
`•,Le s~lri1d1un
`
`>a-::::::::::::::::i~
`
`PC9 w t:-i. +icW·C'ell
`adar:tor twuencci
`
`,,.
`~SMlfarT1lly
`e11111-11:111i111:11n11n11n1-,m1,;-1,:i-:,:1:
`1111-11:111:111:11::11::11::1_,,m::;i,,.,,ri,:,
`l!li!-llilll:111:11::1l:illiilllllll\lllll!)ijll•i•l<l•1
`
`E:!,SMl!amily
`e tllll-llil!l:lllilli:ll:ilii:l-li:11,il·i•II::,:, 1:;1
`Hlll-111111:1111111:ll:llli:l-11:~lli•:·11;:1,
`lllll-lilll!il!liil!il!iilllil-11~:lh·I·~-:,
`
`;, Capture target regions
`.,
`
`-;: Sacond round PGR
`
`,:; Massively p0rail1:1! ssqu1:1ncir1g
`
`----;;;;;;;;;;;;;;
`
`!!!!!!!!
`;;;;;;;;;;;;;;
`!!!!!!!!
`
`00001
`
`EX1009
`
`
`
`WO 2013 /14 23 89 Al 1111111111111111 IIIIII IIIII IIIII IIIII IIII I II Ill lllll lllll lllll lllll lllll 11111111111111111111111
`
`Published:
`
`-
`
`with international search report (Art. 21 (3))
`
`-
`
`with sequence listing part of description (Rule 5.2(a))
`
`00002
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`METHODS OF LOWERING THE ERROR RATE OF MASSIVELY PARALLEL DNA
`
`SEQUENCING USING DUPLEX CONSENSUS SEQUENCING
`
`PRIORITY CLAIM
`
`[0001]
`
`This application claims priority to U.S. Provisional Patent Application No.
`
`61/613,413, filed March 20, 2012; U.S. Provisional Patent Application No. 61/625,623,
`
`filed April 17, 2012; and U.S. Provisional Patent Application No. 61/625,319, filed April
`
`17, 2012; the subject matter of all of which are hereby incorporated by reference as if
`
`fully set forth herein.
`
`STATEMENT OF GOVERNMENT INTEREST
`
`[0002]
`
`The present invention was made with government support under Grant
`
`Nos. RO1 CA115802 and RO1 CA102029 awarded by the National Institutes of Health.
`
`The Government has certain rights in the invention.
`
`BACKGROUND
`
`[0003]
`
`The advent of massively parallel DNA sequencing has ushered in a new
`
`era of genomic exploration by making simultaneous genotyping of hundreds of billions
`
`of base-pairs possible at small fraction of the time and cost of traditional Sanger
`
`methods [1 ]. Because these technologies digitally tabulate the sequence of many
`
`individual DNA fragments, unlike conventional techniques which simply report the
`
`average genotype of an aggregate collection of molecules, they offer the unique ability
`
`to detect minor variants within heterogeneous mixtures [2].
`
`[0004]
`
`This concept of "deep sequencing" has been implemented in a variety
`
`fields including metagenomics [3, 4], paleogenomics [5], forensics [6], and human
`
`genetics [7, 8] to disentangle subpopulations in complex biological samples. Clinical
`
`applications, such prenatal screening for fetal aneuploidy [9, 1 0], early detection of
`
`cancer [11] and monitoring its response to therapy [12, 13] with nucleic acid-based
`
`serum biomarkers, are rapidly being developed. Exceptional diversity within microbial
`
`[14, 15] viral [16-18] and tumor cell populations [19, 20] has been characterized through
`
`next-generation sequencing, and many low-frequency, drug-resistant variants of
`
`00003
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`therapeutic importance have been so identified [12, 21, 22]. Previously unappreciated
`
`intra-organismal mosasism in both the nuclear [23] and mitochondrial [24, 25] genome
`
`has been revealed by these technologies, and such somatic heterogeneity, along with
`
`that arising within the adaptive immune system [13], may be an important factor in
`
`phenotypic variability of disease.
`
`[0005]
`
`Deep sequencing, however, has limitations. Although, in theory, DNA
`
`subpopulations of any size should be detectable when deep sequencing a sufficient
`
`number of molecules, a practical limit of detection is imposed by errors introduced
`
`during sample preparation and sequencing. PCR amplification of heterogeneous
`
`mixtures can result in population skewing due to stoichastic and non-stoichastic
`
`amplification biases and lead to over- or under-representation of particular variants [26].
`
`Polymerase mistakes during pre-amplification generate point mutations resulting from
`
`base mis-incorporations and rearrangements due to template switching
`
`[26, 27].
`
`Combined with the additional errors that arise during cluster amplification, cycle
`
`sequencing and image analysis, approximately 1 % of bases are incorrectly identified,
`
`depending on the specific platform and sequence context [2, 28]. This background level
`
`of artifactual heterogeneity establishes a limit below which the presence of true rare
`
`variants is obscured [29].
`
`[0006]
`
`A variety of improvements at the level of biochemistry [30-32] and data
`
`processing [19, 21, 28, 32, 33] have been developed to improve sequencing accuracy.
`
`The ability to resolve subpopulations below 0.1 %, however, has remained elusive.
`
`Although several groups have attempted to increase sensitivity of sequencing, several
`
`limitations remain. For example techniques whereby DNA fragments to be sequenced
`
`are each uniquely tagged [34, 35] prior to amplification [36-41] have been reported.
`
`Because all amplicons derived from a particular starting molecule will bear its specific
`
`tag, any variation in the sequence or copy number of identically tagged sequencing
`
`reads can be discounted as technical error. This approach has been used to improve
`
`counting accuracy of DNA [38, 39, 41] and RNA templates [37, 38, 40] and to correct
`
`base errors arising during PCR or sequencing [36, 37, 39]. Kinde et. al. reported a
`
`reduction in error frequency of approximately 20-fold with a tagging method that is
`
`based on labeling single-stranded DNA fragments with a primer containing a 14 bp
`
`-2-
`
`00004
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`degenerate sequence. This allowed for an observed mutation frequency of -0.001 %
`
`mutations/bp in normal human genomic DNA [36]. Nevertheless, a number of highly
`
`sensitive genetic assays have indicated that the true mutation frequency in normal cells
`
`is likely to be far lower, with estimates of per-nucleotide mutation frequencies generally
`ranging from 10-9 to 10-11 [42]. Thus, the mutations seen in normal human genomic
`
`DNA by Kinde et al. are likely the result of significant technical artifacts.
`
`[0007]
`
`Traditionally, next-generation sequencing platforms rely upon generation
`
`of sequence data from a single strand of DNA. As a consequence, artifactual mutations
`
`introduced during the initial rounds of PCR amplification are undetectable as errors -
`
`even with tagging techniques - if the base change is propagated to all subsequent PCR
`
`duplicates. Several types of DNA damage are highly mutagenic and may lead to this
`
`scenario. Spontaneous DNA damage arising from normal metabolic processes results
`
`in thousands of damaging events per cell per day [43].
`
`In addition to damage from
`
`oxidative cellular processes, further DNA damage is generated ex vivo during tissue
`
`processing and DNA extraction [44]. These damage events can result in frequent
`
`copying errors by DNA polymerases: for example a common DNA lesion arising from
`
`oxidative damage, 8-oxo-guanine, has the propensity to incorrectly pair with adenine
`
`during complementary strand extension with an overall efficiency greater than that of
`
`correct pairing with cytosine, and thus can contribute a large frequency of artifactual
`
`G~ T mutations [45]. Likewise, deamination of cytosine to form uracil is a particularly
`
`common event which leads to the inappropriate insertion of adenine during PCR, thus
`producing artifactual c~ T mutations with a frequency approaching 100% [46].
`
`[0008]
`
`It would be desirable to develop an approach for tag-based error
`
`correction, which reduces or eliminates artifactual mutations arising from DNA damage,
`
`PCR errors, and sequencing errors; allows rare variants in heterogeneous populations
`
`to be detected with unprecedented sensitivity; and which capitalizes on the redundant
`
`information stored in complexed double-stranded DNA.
`
`SUMMARY
`
`[0009]
`
`In one embodiment, a single molecule identifier (SMI) adaptor molecule
`
`for use in sequencing a double-stranded target nucleic acid molecule is provided. Said
`
`-3-
`
`00005
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`SMI adaptor molecule includes a single molecule identifier (SMI) sequence which
`
`comprises a degenerate or semi-degenerate DNA sequence; and an SMI ligation
`
`adaptor that allows the SMI adaptor molecule to be ligated to the double-stranded target
`
`nucleic acid sequence. The SMI sequence may be single-stranded or double-stranded.
`
`In some embodiments, the double-stranded target nucleic acid molecule is a double(cid:173)
`
`stranded DNA or RNA molecule.
`
`[0010]
`
`In another embodiment, a method of obtaining the sequence of a double-
`
`stranded target nucleic acid is provided (also known as Duplex Consensus Sequencing
`
`or DCS) is provided. Such a method may include steps of ligating a double-stranded
`
`target nucleic acid molecule to at least one SMI adaptor molecule to form a double(cid:173)
`
`stranded SMl-target nucleic acid complex; amplifying the double-stranded SMl-target
`
`nucleic acid complex, resulting in a set of amplified SMl-target nucleic acid products;
`
`and sequencing the amplified SMl-target nucleic acid products.
`
`[0011]
`
`In some embodiments, the method may additionally include generating an
`
`error-corrected double-stranded consensus sequence by (i) grouping the sequenced
`
`SM I-target nucleic acid products into families of paired target nucleic acid strands based
`
`on a common set of SMI sequences; and (ii) removing paired target nucleic acid strands
`
`having one or more nucleotide positions where the paired target nucleic acid strands
`
`are non-complementary (or alternatively removing individual nucleotide positions in
`
`cases where the sequence at the nucleotide position under consideration disagrees
`
`among the two strands). In further embodiments, the method confirms the presence of
`
`a true mutation by (i) identifying a mutation present in the paired target nucleic acid
`
`strands having one or more nucleotide positions that disagree; (ii) comparing the
`
`mutation present in the paired target nucleic acid strands to the error corrected double(cid:173)
`
`stranded consensus sequence; and (iii) confirming the presence of a true mutation
`
`when the mutation is present on both of the target nucleic acid strands and appears in
`
`all members of a paired target nucleic acid family.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0012]
`
`Figure 1 illustrates an overview of Duplex Consensus Sequencing.
`
`Sheared double-stranded DNA that has been end-repaired and T-tailed is combined
`
`-4-
`
`00006
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`with A-tailed SMI adaptors and ligated according to one embodiment. Because every
`
`adaptor contains a unique, double-stranded, complementary n-mer random tag on each
`
`end (n-mer = 12 bp according to one embodiment), every DNA fragment becomes
`
`labeled with two distinct SMI sequences (arbitrarily designated a and 13
`
`in the single
`
`capture event shown). After size-selecting for appropriate length fragments, PCR
`
`amplification with primers containing lllumina flow-cell-compatible tails is carried out to
`
`generate families of PCR duplicates. By virtue of the asymmetric nature of adapted
`
`fragments, two types of PCR products are produced from each capture event. Those
`
`derived from one strand will have the a SMI sequence adjacent to flow-cell sequence 1
`
`and the 13 SMI sequence adjacent to flow cell sequence 2. PCR products originating
`
`from the complementary strand are labeled reciprocally.
`
`[0013]
`
`Figure 2 illustrates Single Molecule Identifier (SMI) adaptor synthesis
`
`according to one embodiment. Oligonucleotides are annealed and the complement of
`
`the degenerate lower arm sequence (N's) plus adjacent fixed bases is produced by
`
`polymerase extension of the upper strand in the presence of all four dNTPs. After
`
`reaction cleanup, complete adaptor A-tailing is ensured by extended incubation with
`
`polymerase and dATP.
`
`[0014]
`
`Figure 3
`
`illustrates error correction
`
`through Duplex Consensus
`
`Sequencing (DCS) analysis according to one embodiment.
`
`(a-c) shows sequence
`
`reads (brown) sharing a unique set of SMI tags are grouped into paired families with
`members having strand identifiers in either the al3 or l3a orientation. Each family pair
`reflects one double-stranded DNA fragment.
`(a) shows mutations (spots) present in
`
`only one or a few family members representing sequencing mistakes or PCR-introduced
`
`errors occurring late in amplification.
`
`(b) shows mutations occurring in many or all
`
`members of one family in a pair representing mutations scored on only one of the two
`
`strands, which can be due to PCR errors arising during the first round of amplification
`
`such as might occur when copying across sites of mutagenic DNA damage. (c) shows
`
`true mutations (* arrow) present on both strands of a captured fragment appear in all
`
`members of a family pair. While artifactual mutations may co-occur in a family pair with
`
`a true mutation, these can be independently identified and discounted when producing
`
`(d) an error-corrected consensus sequence (i.e., single stranded consensus sequence)
`
`-5-
`
`00007
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`(+ arrow) for each duplex.
`
`(e) shows consensus sequences from all independently
`
`captured, randomly sheared fragments containing a particular genomic site are
`
`identified and (f) compared to determine the frequency of genetic variants at this locus
`
`within the sampled population.
`
`[0015]
`
`Figure 4 illustrates an example of how a SMI sequence with n-mers of 4
`
`nucleotides in length (4-mers) are read by Duplex Consensus Sequencing (DCS)
`
`according to some embodiments. (A) shows the 4-mers with the PCR primer binding
`
`sites (or flow cell sequences) 1 and 2 indicated at each end.
`
`(B) shows the same
`
`molecules as in (A) but with the strands separated and the lower strand now written in
`
`the 5'-3' direction. When these molecules are amplified with PCR and sequenced, they
`
`will yield the following sequence reads: The top strand will give a read 1 file of TAAC--(cid:173)
`
`and a read 2 file of GCCA---. Combining the read 1 and read 2 tags will give
`
`TAACCGGA as the SMI for the top strand. The bottom strand will give a read 1 file of
`
`CGGA---- and a read 2 file of TAAC---. Combining the read 1 and read 2 tags will give
`
`CGGATAAC as the SMI for the bottom strand. (C) illustrates the orientation of paired
`
`strand mutations in DCS.
`
`In the initial DNA duplex shown in Figures 4A and 4B, a
`
`mutation "x" (which is paired to a complementary nucleotide "y") is shown on the left
`
`side of the DNA duplex. The "x" will appear in read 1, and the complementary mutation
`
`on the opposite strand, "y," will appear in read 2. Specifically, this would appear as "x"
`
`in both read 1 and read 2 data, because "y" in read 2 is read out as "x" by the
`
`sequencer owing to the nature of the sequencing primers, which generate the
`
`complementary sequence during read 2.
`
`[0016]
`
`Figure 5 illustrates duplex sequencing of human mitochondrial DNA. (A)
`
`Overall mutation frequency as measured by a standard sequencing approach, SSCS,
`
`and DCS. (B) Pattern of mutation in human mitochondrial DNA by a standard
`
`sequencing approach. The mutation frequency (vertical axis) is plotted for every position
`
`in the -16-kb mitochondrial genome. Due to the substantial background of technical
`
`error, no obvious mutational pattern is discernible by this method. (C) DCS analysis
`
`eliminates sequencing artifacts and reveals the true distribution of mitochondrial
`
`mutations to include a striking excess adjacent to the mtDNA origin of replication. (D)
`
`-6-
`
`00008
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`SSCS analysis yields a large excess of G~ T mutations relative to complementary c~A
`
`mutations, consistent with artifacts from damaged-induced 8-oxo-G lesions during PCR.
`
`All significant (P < 0.05) differences between paired reciprocal mutation frequencies are
`
`noted. (E) DCS analysis removes the SSCS strand bias and reveals the true mtDNA
`
`mutational spectrum to be characterized by an excess of transitions.
`
`[0017]
`
`Figure 6 shows
`
`that consensus sequencing
`
`removes artifactual
`
`sequencing errors as compared to Raw Reads. Duplex Consensus Sequencing (DCS)
`
`results in an approximately equal number of mutations as the reference and single
`
`strand consensus sequencing (SSCS) .
`
`[0018]
`
`Figure 7 illustrates duplex sequencing of M13mp2 DNA. (A) Single-strand
`consensus sequences (SSCSs) reveal a large excess of G~Nc~ T and G~ T/C~A
`mutations, whereas duplex consensus sequences (DCSs) yield a balanced spectrum.
`
`Mutation frequencies are grouped into reciprocal mispairs, as DCS analysis only scores
`
`mutations present in both strands of duplex DNA. All significant (P < 0.05) differences
`
`between DCS analysis and
`
`the
`
`literature
`
`reference values are noted.
`
`(B)
`
`Complementary types of mutations should occur at approximately equal frequencies
`
`within a DNA fragment population derived from duplex molecules. However, SSCS
`
`analysis yields a 15-fold excess of G~ T mutations relative to c~A mutations and an
`
`11-fold excess of c~ T mutations relative to G~A mutations. All significant (P < 0.05)
`
`differences between paired reciprocal mutation frequencies are noted.
`
`[0019]
`
`Figure 8 shows the effect of DNA damage on the mutation spectrum. DNA
`
`damage was induced by incubating purified M13mp2 DNA with hydrogen peroxide and
`
`FeSO4. (A) SSCS analysis reveals a further elevation from baseline of G~ T mutations,
`
`indicating these events to be the artifactual consequence of nucleotide oxidation. All
`
`significant (P < 0.05) changes from baseline mutation frequencies are noted. (B)
`
`Induced DNA damage had no effect on the overall frequency or spectrum of DCS
`
`mutations.
`
`[0020]
`
`Figure 9 shows duplex sequencing results in accurate recovery of spiked-
`
`control mutations. A series of variants of M13mp2 DNA, each harboring a known
`
`single-nucleotide substitution, were mixed in together at known ratios and the mixture
`
`-7-
`
`00009
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`was sequenced to -20,000-fold final depth. Standard sequencing analysis cannot
`
`accurately distinguish mutants present at a ratio of less than 1/100, because artifactural
`
`mutations occurring at every position obscure the presence of less abundant true
`
`mutations, rendering apparent recovery greater than 100%. Duplex consensus
`
`sequences, in contrast, accurately identify spiked-in mutations down to the lowest
`
`tested ratio of 1/10,000.
`
`[0021]
`
`Figure 10 is a Python Code that may used to carry out methods described
`
`herein according to one embodiment.
`
`DETAILED DESCRIPTION
`
`[0022]
`
`Single molecule identifier adaptors and methods for their use are provided
`
`herein. According to the embodiments described herein, a single molecule identifier
`
`(SMI) adaptor molecule is provided. Said SMI adaptor molecule is double stranded,
`
`and may include a single molecule identifier (SMI) sequence, and an SMI ligation
`
`adaptor (Figure 2). Optionally, the SMI adaptor molecule further includes at least two
`
`PCR primer binding sites, at least two sequencing primer binding sites, or both.
`
`[0023]
`
`The SMI adaptor molecule may form a "Y-shape" or a "hairpin shape." In
`
`some embodiments, the SMI adaptor molecule is a "Y-shaped" adaptor, which allows
`
`both strands to be independently amplified by a PCR method prior to sequencing
`
`because both the top and bottom strands have binding sites for PCR primers FC1 and
`
`FC2 as shown in the examples below. A schematic of a Y-shaped SMI adaptor
`
`molecule is also shown in Figure 2. A Y-shaped SMI adaptor requires successful
`
`amplification and recovery of both strands of the SMI adaptor molecule.
`
`In one
`
`embodiment, a modification that would simplify consistent recovery of both strands
`
`entails ligation of a Y-shaped SMI adaptor molecule to one end of a DNA duplex
`
`molecule, and ligation of a "U-shaped" linker to the other end of the molecule. PCR
`
`amplification of the hairpin-shaped product will then yield a linear fragment with flow cell
`
`sequences on either end. Distinct PCR primer binding sites (or flow cell sequences
`
`FC1 and FC2) will flank the DNA sequence corresponding to each of the two SMI
`
`adaptor molecule strands, and a given sequence seen in Read 1 will then have the
`
`sequence corresponding to the complementary DNA duplex strand seen in Read 2.
`
`-8-
`
`00010
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`Mutations are scored only if they are seen on both ends of the molecule (corresponding
`
`to each strand of the original double-stranded fragment), i.e. at the same position in
`
`both Read 1 and Read 2. This design may be accomplished as described in the
`
`examples relating to double stranded SMI sequence tags.
`
`[0024]
`
`In other embodiments, the SMI adaptor molecule is a "hairpin" shaped (or
`
`"U-shaped") adaptor. A hairpin DNA product can be used for error correction, as this
`
`product contains both of the two DNA strands. Such an approach allows for reduction
`
`of a given sequencing error rate N to a lower rate of N*N*(1/3), as independent
`
`sequencing errors would need to occur on both strands, and the same error among all
`
`three possible base substitutions would need to occur on both strands. For example,
`
`the error rate of 1/100 in the case of lllumina sequencing [32] would be reduced to
`
`(1/100)*(1/100)*(1/3) = 1/30,000.
`
`[0025]
`
`An additional, more remarkable reduction in errors can be obtained by
`
`inclusion of a single-stranded SMI in either the hairpin adaptor or the "Y-shaped"
`
`adaptor will also function to label both of the two DNA strands. Amplification of hairpin(cid:173)
`
`shaped DNA may be difficult as the polymerase must synthesize through a product
`
`containing significant regions of self-complementarity, however, amplification of hairpin(cid:173)
`
`shaped structures has already been established in the technique of hairpin PCR, as
`
`described below. Amplification using hairpin PCR is further described in detail in U.S.
`
`Patent No. 7,452,699, the subject matter of which is hereby incorporated by reference
`
`as if fully set forth herein.
`
`[0026]
`
`According to the embodiments described herein, the SMI sequence (or
`
`"tag") may be a double-stranded, complementary SMI sequence or a single-stranded
`
`SMI sequence.
`
`In some embodiments, the SMI adaptor molecule includes an SMI
`
`sequence (or "tag") of nucleotides that is degenerate or semi-degenerate.
`
`In some
`
`embodiments, the degenerate or semi-degenerate SMI sequence may be a random
`
`degenerate sequence. A double-stranded SMI sequence includes a first degenerate or
`
`semi-degenerate nucleotide n-mer sequence and a second n-mer sequence that is
`
`complementary to the first degenerate or semi-degenerate nucleotide n-mer sequence,
`
`while a single-stranded SMI sequence includes a first degenerate or semi-degenerate
`
`-9-
`
`00011
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`nucleotide n-mer sequence. The first and/or second degenerate or semi-degenerate
`
`nucleotide n-mer sequences may be any suitable length to produce a sufficiently large
`
`number of unique tags to label a set of sheared DNA fragments from a segment of DNA.
`
`Each n-mer sequence may be between approximately 3 to 20 nucleotides in length.
`
`Therefore, each n-mer sequence may be approximately 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
`
`14, 15, 16, 17, 18, 19, 20 nucleotides in length. In one embodiment, the SMI sequence
`
`is a random degenerate nucleotide n-mer sequence which is 12 nucleotides in length.
`
`A 12 nucleotide SMI n-mer sequence that is ligated to each end of a target nucleic acid
`molecule, as described in the Example below, results in generation of up to 424 (i.e., 2.8
`x 1014
`) distinct tag sequences.
`
`[0027]
`
`In some embodiments,
`
`the SMI
`
`tag nucleotide sequence may be
`
`completely random and degenerate, wherein each sequence position may be any
`
`nucleotide.
`
`(i.e., each position, represented by "X," is not limited, and may be an
`
`adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U)) or any other natural or
`
`non-natural DNA or RNA nucleotide or nucleotide-like substance or analog with base(cid:173)
`
`pairing properties (e.g., xanthosine, inosine, hypoxanthine, xanthine, 7-methylguanine,
`
`7-methylguanosine, 5,6-dihydrouracil, 5-methylcytosine, dihydouridine,
`
`isocytosine,
`
`isoguanine, deoxynucleosides, nucleosides, peptide nucleic acids, locked nucleic acids,
`
`glycol nucleic acids and threose nucleic acids). The term "nucleotide" as described
`
`herein, refers to any and all nucleotide or any suitable natural or non-natural DNA or
`
`RNA nucleotide or nucleotide-like substance or analog with base pairing properties as
`
`described above.
`
`In other embodiments, the sequences need not contain all possible
`
`bases at each position. The degenerate or semi-degenerate n-mer sequences may be
`
`generated by a polymerase-mediated method described in the Example below, or may
`
`be generated by preparing and annealing a library of individual oligonucleotides of
`
`known sequence. Alternatively, any degenerate or semi-degenerate n-mer sequences
`
`may be a randomly or non-randomly fragmented double stranded DNA molecule from
`
`any alternative source that differs from the target DNA source.
`
`In some embodiments,
`
`the alternative source is a genome or plasmid derived from bacteria, an organism other
`
`than that of the target DNA, or a combination of such alternative organisms or sources.
`
`The random or non-random fragmented DNA may be introduced into SMI adaptors to
`
`-10-
`
`00012
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`serve as variable tags. This may be accomplished through enzymatic ligation or any
`
`other method known in the art.
`
`[0028]
`
`In some embodiments, the SMI adaptor molecules are ligated to both
`
`ends of a target nucleic acid molecule, and then this complex is used according to the
`
`methods described below.
`
`In certain embodiments, it is not necessary to include n(cid:173)
`
`mers on both adapter ends, however, it is more convenient because it means that one
`
`does not have to use two different types of adaptors and then select for ligated
`
`fragments that have one of each type rather than two of one type. The ability to
`
`determine which strand is which is still possible in the situation wherein only one of the
`
`two adaptors has a double-stranded SMI sequence.
`
`[0029]
`
`In some embodiments, the SMI adaptor molecule may optionally include a
`
`double-stranded fixed reference sequence downstream of the n-mer sequences to help
`
`make ligation more uniform and help computationally filter out errors due to ligation
`
`problems with improperly synthesized adaptors. Each strand of the double-stranded
`
`fixed reference sequence may be 4 or 5 nucleotides in length sequence, however, the
`
`fixed reference sequence may be any suitable length including, but not limited to 3, 4, 5
`
`or 6 nucleotides in length.
`
`[0030]
`
`The SMI ligation adaptor may be any suitable ligation adaptor that is
`
`complementary to a ligation adaptor added to a double-stranded target nucleic acid
`
`sequence including, but not limited to a T-overhang, an A-overhang, a CG overhang, a
`
`blunt end, or any other ligatable sequence.
`
`In some embodiments, the SMI ligation
`
`adaptor may be made using a method for A-tailing or T-tailing with polymerase
`
`extension; creating an overhang with a different enzyme; using a restriction enzyme to
`
`create a single or multiple nucleotide overhang, or any other method known in the art.
`
`[0031]
`
`According
`
`to
`
`the embodiments described herein,
`
`the SMI adaptor
`
`molecule may include at least two PCR primer or "flow cell" binding sites: a forward
`
`PCR primer binding site (or a "flow cell 1" (FC1) binding site); and a reverse PCR primer
`
`binding site (or a "flow cell 2" (FC2) binding site). The SMI adaptor molecule may also
`
`include at least two sequencing primer binding sites, each corresponding to a
`
`sequencing read. Alternatively, the sequencing primer binding sites may be added in a
`
`-11-
`
`00013
`
`
`
`WO 2013/142389
`
`PCT/0S2013/032665
`
`separate step by inclusion of the necessary sequences as tails to the PCR primers, or
`
`by ligation of the needed sequences. Therefore, if a double-stranded target nucleic acid
`
`molecule has an SMI adaptor molecule ligated to each end, each sequenced strand will
`
`have two reads - a forward and a reverse read.
`
`Double-stranded SMI sequences
`
`[0032]
`
`Adaptor 1 (shown below) is a Y-shaped SMI adaptor as described above
`
`(the SMI sequence is shown as X's in the top strand (a 4-mer), with the complementary
`
`bottom strand sequence shown as Y's):
`
`\
`\_
`
`-----L•:.,."'iX-- -
`-- ---1"'1'1.Y ---
`
`F·("·">
`'i..·"-
`
`(Adaptor 1)
`
`[0033]
`
`Adaptor 2 (shown below) is a hairpin (or "Li-shaped") linker:
`
`(Adaptor 2)
`
`[0034]
`Following ligation of both adaptors to a double-stranded target nucleic acid,
`the following is structure is obtained:
`FC:f
`\
`\
`\
`
`-----XX .• 'LV.--------·DNA -----·------ \
`
`-·-·---Y":lt':C'f-·-------DNA'-·-·- --