`
` WIPO
`
`WORLD
`INTELLECTUAL PROPERTY
`ORGANIZATION
`
`DOCUMENT MADE AVAILABLE UNDER THE
`PATENT COOPERATION TREATY (PCT)
`PCT /US2013/032665
`International application number:
`
`International filing date:
`
`15 March 2013 (15.03.2013)
`
`Document type:
`
`Document details:
`
`Certified copy of priority document
`
`Country/Office:
`Number:
`Filing date:
`
`us
`61/625,623
`17 April 2012 (17.04.2012)
`
`Date of receipt at the International Bureau:
`
`01 April 2013 (01.04.2013)
`
`Remark: Priority document submitted or transmitted to the International Bureau in compliance with Rule
`17.1(a),(b) or (b-bis)
`
`34, chemin des Colombettes
`1 211 Geneva 20, Switze1·1and
`
`www.wipo.int
`
`00001
`
`EX1083
`
`
`
`SSSARSARA
`SRR —
`Ce SESoSSSRK
`a BRSREE
`SShennerSN~ORS
`aSANSNS
`SSS
`XS
`&
`&
`x
`eae
`
`:
`
`.
`
`<
`§
`
`Patent and Trademark Office
`
`THIS IS TO CERTIFY THAT ANNEXED HERETOIS A TRUE COPY FROM
`THE RECORDSOF THE UNITED STATES PATENT AND TRADEMARK
`OFFICE OF THOSE PAPERS OF THE BELOW IDENTIFIED PATENT
`APPLICATION THAT MET THE REQUIREMENTSTO BE GRANTED A
`FILING DATE.
`
`UNITED: aPredidLe weOG
`
`SAK
`.
`" <e
`
`‘S
`
`.
`
`ATES DEPARTMENTOF COMMERCE
`
`United States Patent and Trademark Office
`
`March 31, 2013
`
`Senties
`
`APPLICATION NUMBER: 61/625,623
`FILING DATE: April 17, 2012
`RELATED PCT APPLICATION NUMBER: PCT/US13/32665
`
`THE COUNTRY CODE AND NUMBEROF YOUR PRIORITY
`APPLICATION, TO BE USED FOR FILING ABROAD UNDER THEPARIS
`CONVENTION,IS US61/625,623
`
`Certified by
`
`Si Laps
`
`Under Secretary of Conmierce
`for Intellectual Property
`and Director af the United States
`
`00002
`
`
`
`Doc Code: TR.PROV
`Document Description: Provisional Cover Sheet (SB16)
`
`PTO/SB/16 (04-07)
`Approved for use through 06/30/2010 0MB 0651-0032
`U.S. Patent and Trademark Office: U.S. DEPARTMENT OF COMMERCE
`Under the Paperwork Reduction Act of 1995. no persons are required to respond to a collection of information unless it displays a valid 0MB control number
`Provisional Application for Patent Cover Sheet
`This is a request for filing a PROVISIONAL APPLICATION FOR PATENT under 37 CFR 1.53(c)
`
`lnventor(s)
`
`Inventor 1
`
`Given Name
`
`Middle Name
`
`Family Name
`
`City
`
`State
`
`Michael
`
`Inventor 2
`
`Schmitt
`
`Seattle
`
`WA
`
`Given Name
`
`Middle Name
`
`Family Name
`
`City
`
`State
`
`Jesse
`
`Inventor 3
`
`Salk
`
`Seattle
`
`WA
`
`Given Name
`
`Middle Name
`
`Family Name
`
`City
`
`State
`
`Lawrence
`
`A.
`
`Loeb
`
`Bellevue
`
`WA
`
`Remove
`
`Country
`
`i
`
`us
`
`Remove
`
`Country
`
`i
`
`us
`
`Remove
`
`Country
`
`i
`
`us
`
`All Inventors Must Be Listed -Additional Inventor Information blocks may be
`generated within this form by selecting the Add button.
`
`I
`
`Add
`
`I
`
`Title of Invention
`
`METHODS OF LOWERING THE ERROR RATE OF MASSIVELY PARALLEL
`DNA SEQUENCING USING DUPLEX CONSENSUS SEQUENCING
`
`Attorney Docket Number (if applicable)
`
`72227 .8043. USO 1
`
`Correspondence Address
`
`Direct all correspondence to (select one):
`
`® The address corresponding to Customer Number O Firm or Individual Name
`
`Customer Number
`
`94991
`
`The invention was made by an agency of the United States Government or under a contract with an agency of the United
`States Government.
`0 No.
`® Yes, the name of the U.S. Government agency and the Government contract number are:
`NIH RO1 CA115802; NIH RO1 CA102029
`
`EFS - Web 1.0.1
`
`00003
`
`
`
`Doc Code: TR.PROV
`Document Description: Provisional Cover Sheet (SB16)
`
`PTO/SB/16 (04-07)
`Approved for use through 06/30/2010 0MB 0651-0032
`U.S. Patent and Trademark Office: U.S. DEPARTMENT OF COMMERCE
`Under the Paperwork Reduction Act of 1995. no persons are required to respond to a collection of information unless it displays a valid 0MB control number
`
`Entity Status
`Applicant claims small entity status under 37 CFR 1.27
`
`(!) Yes, applicant qualifies for small entity status under 37 CFR 1.27
`0 No
`Warning
`
`Petitioner/applicant is cautioned to avoid submitting personal information in documents filed in a patent application that may
`contribute to identity theft. Personal information such as social security numbers, bank account numbers, or credit card
`numbers (other than a check or credit card authorization form PT0-2038 submitted for payment purposes) is never required
`by the USPTO to support a petition or an application. If this type of personal information is included in documents submitted
`to the US PTO, petitioners/applicants should consider redacting such personal information from the documents before
`submitting them to USPTO. Petitioner/applicant is advised that the record of a patent application is available to the public
`after publication of the application (unless a non-publication request in compliance with 37 CFR 1.213(a) is made in the
`application) or issuance of a patent. Furthermore, the record from an abandoned application may also be available to the
`public if the application is referenced in a published application or an issued patent (see 37 CFR1 .14). Checks and credit
`card authorization forms PTO-2038 submitted for payment purposes are not retained in the application file and therefore are
`not publicly available.
`
`Signature
`
`Please see 37 CFR 1.4(d) for the form of the signature.
`
`Signature
`
`/Lara J. Dueppen/
`
`Date (YYYY-MM-DD)
`
`Apr17,2012
`
`First Name
`
`Lara J.
`
`Last Name
`
`Dueppen
`
`Registration Number
`(If appropriate)
`
`65002
`
`This collection of information is required by 37 CFR 1.51. The information is required to obtain or retain a benefit by the public which is to
`file (and by the USPTO to process) an application. Confidentiality is governed by 35 U.S.C. 122 and 37 CFR 1.11 and 1.14. This collection
`is estimated to take 8 hours to complete, including gathering, preparing, and submitting the completed application form to the USPTO.
`Time will vary depending upon the individual case. Any comments on the amount of time you require to complete this form and/or
`suggestions for reducing this burden, should be sent to the Chief Information Officer, U.S. Patent and Trademark Office, U.S. Department
`of Commerce, P.O. Box 1450, Alexandria, VA 22313-1450. DO NOT SEND FEES OR COMPLETED FORMS TO THIS ADDRESS. This
`form can only be used when in conjunction with EFS-Web. If this form is mailed to the USPTO, it may cause delays in handling
`the provisional application.
`
`EFS - Web 1.0.1
`
`00004
`
`
`
`Privacy Act Statement
`
`The Privacy Act of 1974 (P.L. 93-579) requires that you be given certain information in connection with your submission of
`the attached form related to a patent application or paten. Accordingly, pursuant to the requirements of the Act, please be
`advised that: (1) the general authority for the collection of this information is 35 U.S.C. 2(b)(2); (2) furnishing of the
`information solicited is voluntary; and (3) the principal purpose for which the information is used by the U.S. Patent and
`Trademark Office is to process and/or examine your submission related to a patent application or patent. If you do not
`furnish the requested information, the U.S. Patent and Trademark Office may not be able to process and/or examine your
`submission, which may result in termination of proceedings or abandonment of the application or expiration of the patent.
`
`The information provided by you in this form will be subject to the following routine uses:
`
`1.
`
`2.
`
`3.
`
`4.
`
`5.
`
`6.
`
`7.
`
`8.
`
`9.
`
`The information on this form will be treated confidentially to the extent allowed under the Freedom of Information
`Act (5 U.S.C. 552) and the Privacy Act (5 U.S.C 552a). Records from this system of records may be disclosed to the
`Department of Justice to determine whether disclosure of these records is required by the Freedom of Information
`Act.
`A record from this system of records may be disclosed, as a routine use, in the course of presenting evidence to
`a court, magistrate, or administrative tribunal, including disclosures to opposing counsel in the course of settlement
`negotiations.
`A record in this system of records may be disclosed, as a routine use, to a Member of Congress submitting a
`request involving an individual, to whom the record pertains, when the individual has requested assistance from the
`Member with respect to the subject matter of the record.
`A record in this system of records may be disclosed, as a routine use, to a contractor of the Agency having need
`for the information in order to perform a contract. Recipients of information shall be required to comply with the
`requirements of the Privacy Act of 1974, as amended, pursuant to 5 U.S.C. 552a(m).
`A record related to an International Application filed under the Patent Cooperation Treaty in this system of
`records may be disclosed, as a routine use, to the International Bureau of the World Intellectual Property
`Organization, pursuant to the Patent Cooperation Treaty.
`A record in this system of records may be disclosed, as a routine use, to a n other federal agency for purposes
`of National Security review (35 U.S.C. 181) and for review pursuant to the Atomic Energy Act (42 U.S.C. 218(c)).
`A record from this system of records may be disclosed, as a routine use, to the Administrator, General Services,
`or his/her designee, during an inspection of records conducted by GSA as part of that agency's responsibility to
`recommend improvements in records management practices and programs, under authority of 44 U.S.C. 2904 and
`2906. Such disclosure shall be made in accordance with the GSA regulations governing inspection of records for this
`purpose, and any other relevant (i.e., GSA or Commerce) directive. Such disclosure shall not be used to make
`determinations about individuals.
`A record from this system of records may be disclosed, as a routine use, to the public after either publication of
`the application pursuant to 35 U.S.C. 122(b) or issuance of a patent pursuant to 35 U.S.C. 151. Further, a record
`may be disclosed, subject to the limitations of 37 CFR 1.14, as a routine use, to the public if the record was filed in an
`application which became abandoned or in which the proceedings were terminated and which application is
`referenced by either a published application, an application open to public inspection or an
`issued patent.
`A record from this system of records may be disclosed, as a routine use, to a Federal, State, or local law
`enforcement agency, if the USPTO becomes aware of a violation or potential violation of law or regulation.
`
`00005
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`METHODS OF LOWERING THE ERROR RATE OF MASSIVELY PARALLEL DNA
`SEQUENCING USING DUPLEX CONSENSUS SEQUENCING
`
`STATEMENT OF GOVERNMENT INTEREST
`
`[0001]
`
`The present invention was made with government support under Grant
`
`Nos. RO1 CA 115802 and RO1 CA 102029 awarded by the National Institutes of Health.
`
`The Government has certain rights in the invention.
`
`BACKGROUND
`
`[0002]
`
`The advent of massively parallel DNA sequencing has ushered in a new
`
`era of genomic exploration by making simultaneous genotyping of hundreds of billions
`
`of base-pairs possible at small fraction of the time and cost of traditional Sanger
`
`methods [1 ]. Because these technologies digitally tabulate the sequence of many
`
`individual DNA fragments, unlike conventional techniques which simply report the
`
`average genotype of an aggregate collection of molecules, they offer the unique ability
`
`to detect minor variants within heterogeneous mixtures [2].
`
`[0003]
`
`This concept of "deep sequencing" has been implemented in a variety
`
`fields including metagenomics [3, 4], paleogenomics [5], forensics [6], and human
`
`genetics [7, 8] to disentangle subpopulations in complex biological samples. Clinical
`
`applications, such prenatal screening for fetal aneuploidy [9, 1 0], early detection of
`
`cancer [11] and monitoring its response to therapy [12, 13] with nucleic acid-based
`
`serum biomarkers, are rapidly being developed. Exceptional diversity within microbial
`
`[14, 15] viral [16-18] and tumor cell populations [19, 20] has been characterized through
`
`next-generation sequencing, and many
`
`low-frequency, drug-resistant variants of
`
`72227-8043/LEGAL23158670.l
`
`00006
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`therapeutic importance have been so identified [12, 21, 22]. Previously unappreciated
`
`intra-organismal mosasism in both the nuclear [23] and mitochondrial [24, 25] genome
`
`has been revealed by these technologies, and such somatic heterogeneity, along with
`
`that arising within the adaptive immune system [13], may be an important factor in
`
`phenotypic variability of disease.
`
`[0004]
`
`Deep sequencing, however, has limitations. Although, in theory, DNA
`
`subpopulations of any size should be detectable when deep sequencing a sufficient
`
`number of molecules, a practical limit of detection is imposed by errors introduced
`
`during sample preparation and sequencing. PCR amplification of heterogeneous
`
`mixtures can result in population skewing due to stoichastic and non-stoichastic
`
`amplification biases and lead to over- or under-representation of particular variants [26].
`
`Polymerase mistakes during pre-amplification generate point mutations resulting from
`
`base mis-incorporations and rearrangements due to template switching [26, 27].
`
`Combined with the additional errors that arise during cluster amplification, cycle
`
`sequencing and image analysis, approximately 1 % of bases are incorrectly identified,
`
`depending on the specific platform and sequence context [2, 28]. This background level
`
`of artifactual heterogeneity establishes a limit below which the presence of true rare
`
`variants is obscured [29].
`
`[0005]
`
`A variety of improvements at the level of biochemistry [30-32] and data
`
`processing [19, 21, 28, 32, 33] have been developed to improve sequencing accuracy.
`
`The ability to resolve subpopulations below 0.1 %, however, has remained elusive.
`
`Although several groups have attempted to increase sensitivity of sequencing, several
`
`limitations remain. For example techniques whereby DNA fragments to be sequenced
`
`72227-8043/LEGAL23158670.l
`
`-2-
`
`00007
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`are each uniquely tagged [34, 35] prior to amplification [36-41] have been reported.
`
`Because all amplicons derived from a particular starting molecule will bear its specific
`
`tag, any variation in the sequence or copy number of identically tagged sequencing
`
`reads can be discounted as technical error. This approach has been used to improve
`
`counting accuracy of DNA [38, 39, 41] and RNA templates [37, 38, 40] and to correct
`
`base errors arising during PCR or sequencing [36, 37, 39]. Kinde et. al. reported a
`
`reduction in error frequency of approximately 20-fold with a tagging method that is
`
`based on labeling single-stranded DNA fragments with a primer containing a 14 bp
`
`degenerate sequence. This allowed for an observed mutation frequency of -0.001 %
`
`mutations/bp in normal human genomic DNA [36]. Nevertheless, a number of highly
`
`sensitive genetic assays have indicated that the true mutation frequency in normal cells
`
`is likely to be far lower, with estimates of per-nucleotide mutation frequencies generally
`
`ranging from 10-9 to 10-11 [42]. Thus, the mutations seen in normal human genomic
`
`DNA by Kinde et al. are likely the result of significant technical artifacts.
`
`[0006]
`
`Traditionally, next-generation sequencing platforms rely upon generation
`
`of sequence data from a single strand of DNA. As a consequence, artifactual mutations
`
`introduced during the initial rounds of PCR amplification are undetectable as errors -
`
`even with tagging techniques - if the base change is propagated to all subsequent PCR
`
`duplicates. Several types of DNA damage are highly mutagenic and may lead to this
`
`scenario. Spontaneous DNA damage arising from normal metabolic processes results
`
`in thousands of damaging events per cell per day [43].
`
`In addition to damage from
`
`oxidative cellular processes, further DNA damage is generated ex vivo during tissue
`
`processing and DNA extraction [44]. These damage events can result in frequent
`
`72227-8043/LEGAL23158670.l
`
`-3-
`
`00008
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`copying errors by DNA polymerases: for example a common DNA lesion arising from
`
`oxidative damage, 8-oxo-guanine, has the propensity to incorrectly pair with adenine
`
`during complementary strand extension with an overall efficiency greater than that of
`
`correct pairing with cytosine, and thus can contribute a large frequency of artifactual
`
`G----* T mutations [45]. Likewise, deamination of cytosine to form uracil is a particularly
`
`common event which leads to the inappropriate insertion of adenine during PCR, thus
`
`producing artifactual C----* T mutations with a frequency approaching 100% [46].
`
`[0007]
`
`It would be desirable to develop an approach for tag-based error
`
`correction, which reduces or eliminates artifactual mutations arising from DNA damage,
`
`PCR errors, and sequencing errors; allows rare variants in heterogeneous populations
`
`to be detected with unprecedented sensitivity; and which capitalizes on the redundant
`
`information stored in complexed double-stranded DNA.
`
`SUMMARY
`
`[0008]
`
`In one embodiment, a single molecule identifier (SMI) adaptor molecule
`
`for use in sequencing a double-stranded target nucleic acid molecule is provided. Said
`
`SMI adaptor molecule includes a double-stranded single molecule identifier (SMI)
`
`sequence which comprises a double-stranded degenerate or semi-degenerate DNA
`
`sequence; and an SMI ligation adaptor that allows the SMI adaptor molecule to be
`
`ligated to the double-stranded target nucleic acid sequence. In some embodiments, the
`
`double-stranded target nucleic acid molecule is a double-stranded DNA or RNA
`
`molecule.
`
`[0009]
`
`In another embodiment, a method of obtaining the sequence of a double-
`
`stranded target nucleic acid is provided (also known as Duplex Consensus Sequencing
`
`72227-8043/LEGAL23158670.l
`
`-4-
`
`00009
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`or DCS) is provided. Such a method may include steps of ligating a double-stranded
`
`target nucleic acid molecule to at least one SMI adaptor molecule to form a double(cid:173)
`
`stranded SMl-target nucleic acid complex; amplifying the double-stranded SMl-target
`
`nucleic acid complex, resulting in a set of amplified SMl-target nucleic acid products;
`
`and sequencing the amplified SMl-target nucleic acid products.
`
`[0010]
`
`In some embodiments, the method may additionally include generating an
`
`error-corrected double-stranded consensus sequence by (i) grouping the sequenced
`
`SMl-target nucleic acid products into families of paired target nucleic acid strands based
`
`on a common set of SMI sequences; and (ii) removing paired target nucleic acid strands
`
`having one or more nucleotide positions where the paired target nucleic acid strands
`
`are non-complementary (or alternatively removing individual nucleotide positions in
`
`cases where the sequence at the nucleotide position under consideration disagrees
`
`among the two strands).
`
`In further embodiments, the method confirms the presence of
`
`a true mutation by (i) identifying a mutation present in the paired target nucleic acid
`
`strands having one or more nucleotide positions that disagree; (ii) comparing the
`
`mutation present in the paired target nucleic acid strands to the error corrected double(cid:173)
`
`stranded consensus sequence; and (iii) confirming the presence of a true mutation
`
`when the mutation is present on both of the target nucleic acid strands and appears in
`
`all members of a paired target nucleic acid family.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0011]
`
`Figure 1 illustrates an overview of Duplex Consensus Sequencing.
`
`Sheared double-stranded DNA that has been end-repaired and T-tailed is combined
`
`with A-tailed SMI adaptors and ligated according to one embodiment. Because every
`
`72227-8043/LEGAL23158670.l
`
`-5-
`
`00010
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`adaptor contains a unique, double-stranded, complementary n-mer random tag on each
`
`end (n-mer = 12 bp according to one embodiment), every DNA fragment becomes
`
`labeled with two distinct SMI sequences (arbitrarily designated a and 13 in the single
`
`capture event shown). After size-selecting for appropriate length fragments, PCR
`
`amplification with primers containing lllumina flow-cell-compatible tails is carried out to
`
`generate families of PCR duplicates. By virtue of the asymmetric nature of adapted
`
`fragments, two types of PCR products are produced from each capture event. Those
`
`derived from one strand will have the a SMI sequence adjacent to flow-cell sequence 1
`
`and the 13 SMI sequence adjacent to flow cell sequence 2. PCR products originating
`
`from the complementary strand are labeled reciprocally.
`
`[0012]
`
`Figure 2 illustrates Single Molecule Identifier (SMI) adaptor synthesis
`
`according to one embodiment. Oligonucleotides are annealed and the complement of
`
`the degenerate lower arm sequence (N's) plus adjacent fixed bases is produced by
`
`polymerase extension of the upper strand in the presence of all four dNTPs. After
`
`reaction cleanup, complete adaptor A-tailing is ensured by extended incubation with
`
`polymerase and dATP.
`
`[0013]
`
`Figure 3
`
`illustrates error correction
`
`through Duplex Consensus
`
`Sequencing (DCS) analysis according to one embodiment.
`
`(a-c) shows sequence
`
`reads (brown) sharing a unique set of SMI tags are grouped into paired families with
`
`members having strand identifiers in either the al3 or l3a orientation. Each family pair
`
`reflects one double-stranded DNA fragment.
`
`(a) shows mutations (spots) present in
`
`only one or a few family members representing sequencing mistakes or PCR-introduced
`
`errors occurring late in amplification.
`
`(b) shows mutations occurring in many or all
`
`72227-8043/LEGAL23158670.l
`
`-6-
`
`00011
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`members of one family in a pair representing mutations scored on only one of the two
`
`strands, which can be due to PCR errors arising during the first round of amplification
`
`such as might occur when copying across sites of mutagenic DNA damage. (c) shows
`
`true mutations (* arrow) present on both strands of a captured fragment appear in all
`
`members of a family pair. While artifactual mutations may co-occur in a family pair with
`
`a true mutation, these can be independently identified and discounted when producing
`
`(d) an error-corrected consensus sequence (+ arrow) for each duplex.
`
`(e) shows
`
`consensus sequences from all independently captured, randomly sheared fragments
`
`containing a particular genomic site are identified and (f) compared to determine the
`
`frequency of genetic variants at this locus within the sampled population.
`
`[0014]
`
`Figure 4 illustrates an example of how a SMI sequence with n-mers of 4
`
`nucleotides in length (4-mers) are read by Duplex Consensus Sequencing (DCS)
`
`according to some embodiments.
`
`(A) shows the 4-mers with the PCR primer binding
`
`sites (or flow cell sequences) 1 and 2 indicated at each end.
`
`(B) shows the same
`
`molecules as in (A) but with the strands separated and the lower strand now written in
`
`the 5'-3' direction. When these molecules are amplified with PCR and sequenced, they
`
`will yield the following sequence reads: The top strand will give a read 1 file of TAAC--(cid:173)
`
`and a read 2 file of GCCA---. Combining the read 1 and read 2 tags will give
`
`TAACCGGA as the SMI for the top strand. The bottom strand will give a read 1 file of
`
`CGGA---- and a read 2 file of TAAC---. Combining the read 1 and read 2 tags will give
`
`CGGATAAC as the SMI for the bottom strand.
`
`(C) illustrates the orientation of paired
`
`strand mutations in DCS.
`
`In the initial DNA duplex shown in Figures 4A and 4B, a
`
`mutation "x" (which is paired to a complementary nucleotide "y") is shown on the left
`
`72227-8043/LEGAL23158670.l
`
`-7-
`
`00012
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`side of the DNA duplex. The "x" will appear in read 1, and the complementary mutation
`
`on the opposite strand, "y," will appear in read 2. Specifically, this would appear as "x"
`
`in both read 1 and read 2 data, because "y" in read 2 is read out as "x" by the
`
`sequencer owing to the nature of the sequencing primers, which generate the
`
`complementary sequence during read 2.
`
`DETAILED DESCRIPTION
`
`[0015]
`
`Single molecule identifier adaptors and methods for their use are provided
`
`herein. According to the embodiments described herein, a single molecule identifier
`
`(SMI) adaptor molecule is provided. Said SMI adaptor molecule may include a double(cid:173)
`
`stranded single molecule identifier (SMI) sequence, and an SMI ligation adaptor (Figure
`
`2). Optionally, the SMI adaptor molecule further includes at least two PCR primer
`
`binding sites, at least two sequencing primer binding sites, or both.
`
`[0016]
`
`In some embodiments, the SMI adaptor molecule includes a double-
`
`stranded, complementary SMI sequence (or "tag") of nucleotides that is degenerate or
`
`semi-degenerate.
`
`In some embodiments, the degenerate or semi-degenerate SMI
`
`sequence may be a random degenerate sequence.
`
`The double-stranded SMI
`
`sequence includes a first degenerate or semi-degenerate nucleotide n-mer sequence
`
`and a second n-mer sequence that is complementary to the first degenerate or semi(cid:173)
`
`degenerate nucleotide n-mer sequence. The first and second degenerate or semi(cid:173)
`
`degenerate nucleotide n-mer sequences may be any suitable length to produce a
`
`sufficiently large number of unique tags to label a set of sheared DNA fragments from a
`
`segment of DNA. Each n-mer sequence may be between approximately 4 to 20
`
`nucleotides in length. Therefore, each n-mer sequence may be approximately 4, 5, 6, 7,
`
`72227-8043/LEGAL23158670.l
`
`-8-
`
`00013
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 nucleotides in length. In one embodiment,
`
`the SMI sequence is a random degenerate nucleotide n-mer sequence which is 12
`
`nucleotides in length. A 12 nucleotide SMI n-mer sequence that is ligated to each end
`
`of a target nucleic acid molecule, as described in the Example below, results in
`
`generation of up to 424 (i.e., 2.8 x 1014
`
`) distinct tag sequences.
`
`[0017]
`
`In some embodiments,
`
`the SMI
`
`tag nucleotide sequence may be
`
`completely random and degenerate, wherein each sequence position may be any
`
`nucleotide.
`
`(i.e., each position, represented by "X," is not limited, and may be an
`
`adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U)) or any other natural or
`
`non-natural DNA or RNA nucleotide or nucleotide-like substance or analog with base(cid:173)
`
`pairing properties (e.g., xanthosine, inosine, hypoxanthine, xanthine, 7-methylguanine,
`
`7-methylguanosine, 5,6-dihydrouracil, 5-methylcytosine, dihydouridine,
`
`isocytosine,
`
`isoguanine, deoxynucleosides, nucleosides, peptide nucleic acids, locked nucleic acids,
`
`glycol nucleic acids and threose nucleic acids). The term "nucleotide" as described
`
`herein, refers to any and all nucleotide or any suitable natural or non-natural DNA or
`
`RNA nucleotide or nucleotide-like substance or analog with base pairing properties as
`
`described above.
`
`In other embodiments, the sequences need not contain all possible
`
`bases at each position. The degenerate or semi-degenerate n-mer sequences may be
`
`generated by a polymerase-mediated method described in the Example below, or may
`
`be generated by preparing and annealing a library of individual oligonucleotides of
`
`known sequence. Alternatively, any degenerate or semi-degenerate n-mer sequences
`
`may be a randomly or non-randomly fragmented double stranded DNA molecule from
`
`any alternative source that differs from the target DNA source.
`
`In some embodiments,
`
`72227-8043/LEGAL23158670.l
`
`-9-
`
`00014
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`the alternative source is a genome or plasmid derived from bacteria, an organism other
`
`than that of the target DNA, or a combination of such alternative organisms or sources.
`
`The random or non-random fragmented DNA may be introduced into SMI adaptors to
`
`serve as variable tags. This may be accomplished through enzymatic ligation or any
`
`other method known in the art.
`
`[0018]
`
`In some embodiments, the SMI adaptor molecules are ligated to both
`
`ends of a target nucleic acid molecule, and then this complex is used according to the
`
`methods described below.
`
`In certain embodiments, it is not necessary to include n(cid:173)
`
`mers on both adapter ends, however, it is more convenient because it means that one
`
`does not have to use two different types of adaptors and then select for ligated
`
`fragments that have one of each type rather than two of one type. The ability to
`
`determine which strand is which is still possible in the situation wherein only one of the
`
`two adaptors has a double-stranded SMI sequence.
`
`[0019]
`
`In some embodiments, the SMI adaptor molecule may optionally include a
`
`double-stranded fixed reference sequence downstream of the n-mer sequences to help
`
`make ligation more uniform and help computationally filter out errors due to ligation
`
`problems with improperly synthesized adaptors. Each strand of the double-stranded
`
`fixed reference sequence may be 4 or 5 nucleotides in length sequence, however, the
`
`fixed reference sequence may be any suitable length including, but not limited to 3, 4, 5
`
`or 6 nucleotides in length.
`
`[0020]
`
`The SMI ligation adaptor may be any suitable ligation adaptor that is
`
`complementary to a ligation adaptor added to a double-stranded target nucleic acid
`
`sequence including, but not limited to a T-overhang, an A-overhang, a CG overhang, a
`
`72227-8043/LEGAL23158670.l
`
`-10-
`
`00015
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`blunt end, or any other ligatable sequence.
`
`In some embodiments, the SMI ligation
`
`adaptor may be made using a method for A-tailing or T-tailing with polymerase
`
`extension; creating an overhang with a different enzyme; using a restriction enzyme to
`
`create a single or multiple nucleotide overhang, or any other method known in the art.
`
`[0021]
`
`According
`
`to
`
`the embodiments described herein,
`
`the SMI adaptor
`
`molecule may include at least two PCR primer or "flow cell" binding sites: a forward
`
`PCR primer binding site (or a "flow cell 1" (FC1) binding site); and a reverse PCR primer
`
`binding site (or a "flow cell 2" (FC2) binding site). The SMI adaptor molecule may also
`
`include at least two sequencing primer binding sites, each corresponding to a
`
`sequencing read. Alternatively, the sequencing primer binding sites may be added in a
`
`separate step by inclusion of the necessary sequences as tails to the PCR primers, or
`
`by ligation of the needed sequences. Therefore, if a double-stranded target nucleic acid
`
`molecule has an SMI adaptor molecule ligated to each end, each sequenced strand will
`
`have two reads - a forward and a reverse read.
`
`[0022]
`
`In some embodiments, the SMI adaptor molecule is a "Y-shaped" adaptor,
`
`which allows both strands to be independently amplified by a PCR method prior to
`
`sequencing because both the top and bottom strands have binding sites for PCR
`
`primers FC1 and FC2 as shown below. A schematic of a Y-shaped SMI adaptor
`
`molecule is also shown in Figure 2.
`
`[0023]
`
`A Y-shaped SMI adaptor requires successful amplification and recovery of
`
`both strands. In one embodiment, a modification that would simplify consistent recovery
`
`of both strands entails ligation of a Y-shaped SMI adaptor molecule to one end of a
`
`DNA duplex molecule, and ligation of a "Li-shaped" linker to the other end of the
`
`72227-8043/LEGAL23158670.l
`
`-11-
`
`00016
`
`
`
`Attorney Docket No. 72227.8043.US00
`
`molecule. PCR amplification of the hairpin-shaped product will then yield a linear
`
`fragment with flow cell sequences on either end. Distinct PCR primer binding sites (or
`
`flow cell sequences FC1 and FC2) will flank the DNA sequence corresponding to each
`
`of the two strands, and a given sequence seen in Read 1 will then have the sequence
`
`corresponding to the complementary DNA duplex strand seen in Read 2. Mutations can
`
`be scored only if they are seen on both ends of the molecule (corresponding to each
`
`strand of the original double-stranded fragment), i.e. at the same position in both Read
`
`1 and Read 2. This design may be accomplished as follows.
`
`[0024]
`
`Adaptor 1 (shown below) is a Y-shaped SMI adaptor as described above
`
`(the SMI sequence is shown as X's in the top strand (a 4-mer), with the complementary
`
`bottom strand sequence shown as Y's):
`
`\
`\
`\
`
`------1C{X...'X-----
`-----"l''C'{'t: :.. __ _
`
`FC2;
`
`(Adap