Attorney Docket No. 42534-708-101
`
`PROVISIONAL PATENT APPLICATION
`
`METHODS FOR COPY NUMBER VARIATION DETECTION
`
`Inventor(s):
`
`Helmy ELTOUKHY
`a citizen of the United States, residing at
`2 Barry Lane
`Atherton, CA 94027
`
`AmirAli TALASAZ,
`a citizen of the United States, residing at
`2181 Camino a Los Cerros
`Menlo Park, CA 94025
`
`Assignee:
`
`Guardant Health, Inc.
`2686 Middlefield Rd., Suite D
`Redwood City, CA 94063 U.S.A.
`
`Entity:
`
`Large entity
`
`John Storella, P.C.
`2625 Alcatraz Ave. #197
`Berkeley, CA 94705
`510-501-0567
`Reg. No. 32,944
`Customer # 115823
`
`Filed: December 28, 2013
`
`

`

`Attorney Docket No: 42534-708-101
`
`METHODS FOR COPY NUMBER VARIATION DETECTION
`
`CROSS-REFERENCE TO RELATED APPLICATIONS
`
`[0001]
`
`None.
`
`STATEMENT AS TO FEDERALLY SPONSORED RESEARCH
`
`[0002]
`
`None.
`
`BACKGROUND OF THE INVENTION
`
`[0003]
`
`The detection and quantification of polynucleotides is important for molecular biology
`
`and medical applications such as diagnostics. Genetic testing is particularly useful for a numberof
`
`diagnostic methods. For example, disorders that are caused by rare genetic alterations (e.g.,
`
`sequencevariants) or changes in epigenetic markers, such as cancer and partial or complete
`
`aneuploidy, may be detected or more accurately characterized with DNA sequenceinformation.
`
`[0004]=Early detection and monitoring of genetic diseases, such as cancer, is often useful and
`
`neededin the successful treatment or management of the disease. One approach mayinclude the
`
`monitoring of a sample derived from cell free nucleic acids, a population of polynucleotides that
`
`can be foundin different types of bodily fluids.
`
`In some cases, disease may be
`
`characterized or detected based on detection of genetic aberrations, such as a change in
`
`copy number variation and/or sequencevariation of one or more nucleic acid sequences, or
`
`the development of other certain rare genetic alterations. Cell free DNA may contain
`
`genetic aberrations associated with a particular disease. With improvementsin
`
`sequencing and techniques to manipulate nucleic acids, there is a needin the art for
`
`improved methods and systems for using cell free DNA to detect and monitor disease.
`
`[0005]
`
`In particular, many methods have been developed for accurate copy number
`
`variation estimation, especially for heterogeneous genomic samples, such as tumor-derived
`
`gDNAor for cfDNA for many applications (e.g., prenatal, transplant, immune,
`
`metagenomics or cancer diagnostics). Most of these methods consist of sample
`
`preparation wherebythe original nucleic acids are converted into a sequenceable library,
`
`-2-
`
`

`

`Attorney Docket No: 42534-708-101
`
`followed by massively parallel sequencing, and finally bioinformatics to estimate copy
`
`number variation at one or more loci.
`
`[0006]
`
`International Publication WO 2012/142213 (Vogelstein et al.) refers to an
`
`approach called "Safe-SeqS"for (Safe-Sequencing System) that is said to include (i)
`
`assignment of a unique identifier (UID) to each template molecule; (ii) amplification of each
`
`uniquely tagged template molecule to create UID-families; and (ili) redundant sequencing of
`
`the amplification products.
`
`[0007]=International Publication WO 2013/142389 (Schmidt et al.) refers to a method
`
`referred to as Duplex Consensus Sequencing (DCS). The approach is described as
`
`reducing errors by independently tagging and sequencing each of the two strands of a DNA
`
`duplex.
`
`[0008]
`
`Statements in the Background are not necessarily meant to endorse the
`
`characterization in the cited references.
`
`SUMMARYOF THE INVENTION
`
`[0009]
`
`Although many of these methods are able to reduce or combatthe errors
`
`introduced by the sample preparation and sequencing processesfor all molecules that are
`
`converted and sequenced, none of these methods is able to infer the counts of molecules
`
`that were converted but not sequenced. Since this count of converted but unsequenced
`
`molecules can be highly variable from genomic region to region, these counts can
`
`dramatically and adversely affect the sensitivity that can be achieved.
`
`[0010]
`
`Tocombatthis issue, input double-stranded DNA can be converted by a process
`
`that tags both halves of the individual double-stranded molecule differently. This can be
`
`performed using a variety of different techniques, including ligation of hairpin, bubble or
`
`forked adapters. If tagged correctly, each original Watson and Crick side of the input
`
`double-stranded DNA molecule can be uniquely identified by the sequencer and
`
`subsequent bioinformatics. For all molecules in a particular region, counts of molecules
`
`where both Watson and Crick sides were recovered (“Pairs”) versus those were only one
`
`-3-
`
`

`

`Attorney Docket No: 42534-708-101
`
`half was recovered (“Singlets”) can be recorded. The number of unseen molecules can be
`
`estimated based on the number of Pairs and Singlets detected.
`
`[0011]
`
`In one aspect disclosed herein is a method comprising: (a) providing a sample
`
`comprising a set of double-stranded polynucleotide molecules, each double-stranded
`
`polynucleotide molecule including first and second complementary strands; (b) tagging the
`
`double-stranded polynucleotide molecules with a set of duplex tags, wherein each duplex
`
`tag differently tags the first and second complementary strands of a double-stranded
`
`polynucleotide molecule in the set; (c) sequencing at least some of the tagged strands to
`
`produce a set of sequence reads; (d) reducing or tracking redundancyin the set of
`
`sequence reads; (e) sorting sequence reads into paired reads and unpaired reads, wherein
`
`(1) each paired read is formed from sequence reads generated fromafirst tagged strand
`
`and a second differently tagged complementary strand derived from a double-stranded
`
`polynucleotide molecule in the set; and (2) each unpaired read representsa first tagged
`
`strand having no second differently tag complementary strand derived from a double-
`
`stranded polynucleotide molecule represented among the sequencereads in the set of
`
`sequencereads; (f) determining quantitative measures of (1) the paired reads and (2) the
`
`unpaired reads that map to each of one or more genetic loci; and (g) estimating a
`
`quantitative measure of total double-stranded polynucleotide molecules in the set that map
`
`to each of the one or more genetic loci based on the quantitative measure of paired reads
`
`and unpaired reads mapping to eachlocus. In one embodiment method further comprises:
`
`(nh) detecting copy number variation in the sample by determining a normalized total
`
`quantitative measure determined in step (g) at each of the one or more genetic loci and
`
`determining copy number variation based on the normalized measure.
`
`In another
`
`embodiment the double-stranded polynucleotide molecules are DNA. In another
`
`embodiment the sample comprises double-stranded polynucleotide molecule sourced
`
`substantially from cell-free nucleic acids, e.g., cDNA.
`
`In another embodiment the sample
`
`comprises no more than 100 ng double-stranded polynucleotide molecule.
`
`In another
`
`embodiment the sample is selected from the group consisting of blood, plasma, serum,
`
`urine, saliva, mucosal excretions, sputum, stool and tears.
`
`In another embodiment the
`
`-4-
`
`

`

`Attorney Docket No: 42534-708-101
`
`sample comprises double-stranded polynucleotide molecules from healthy cells and from
`
`malignant cells.
`
`In another embodiment the sample comprises maternal double-stranded
`
`polynucleotide molecules and fetal double-stranded polynucleotide molecules.
`
`In another
`
`embodiment any of at least 10%, 25%, 50%, 75%, 90% or 99% of the double-stranded
`
`polynucleotide molecules in the set bear an identifying tag shared with at least one other
`
`double-stranded polynucleotide molecule in the set (e.g., the set of polynucleotide
`
`molecules is non-uniquely tagged).
`
`In another embodiment any of at most 25%, 10%, 2%,
`
`1% or 0.1% of the double-stranded polynucleotide moleculesin the set bear an identifying
`
`tag shared with at least one other polynucleotide molecule in the set.
`
`In another
`
`embodiment the double-stranded polynucleotide molecules in the set are tagged with
`
`between 2 and 1000 different identifying tags or between 2 and 100 different identifying
`
`tags.
`
`In another embodiment each duplex tag comprises a polynucleotide identifier.
`
`In
`
`another embodiment each polynucleotide identifier comprises a non-complementary region.
`
`In another embodiment each duplex tag is Y-shaped, bubble shapedor hairpin shaped.
`
`In
`
`another embodiment the double-stranded polynucleotides are converted into tagged
`
`polynucleotides with a conversion efficiency of at least 10%, at least 20%, at least 30%, at
`
`least 40%, at least 50%, at least 60%, at least 80% or at least 90%.
`
`In another
`
`embodiment tagging comprises any of blunt-end ligation, sticky end ligation, molecular
`
`inversion probes, PCR, ligation-based PCR, multiplex PCR, single strand ligation and
`
`single strand circularization.
`
`In another embodiment sequencing comprises amplification of
`
`the tagged strands, e.g., by PCR.
`
`In another embodiment the method comprisesfiltering
`
`out sequencereads that are introduced into the sample through contamination.
`
`In another
`
`embodiment the method comprisesfiltering out reads that fail to meet a set threshold.
`
`In
`
`another embodiment reducing redundancyin the set of sequence reads comprises
`
`collapsing sequence reads produced from amplified products of an original polynucleotide
`
`molecule in the sample back to the original polynucleotide molecule.
`
`In another
`
`embodiment the method further comprises determining a consensus sequencefor the
`
`original polynucleotide molecule.
`
`In another embodiment the method further comprises
`
`identifying polynucleotide molecules at one or more genetic loci comprising a sequence
`
`-5-
`
`

`

`Attorney Docket No: 42534-708-101
`
`variant.
`
`In another embodiment the method further comprises determining a quantitative
`
`measure ofpaired reads that map to a locus, wherein both strands of the pair comprise a
`
`sequencevariant.
`
`In another embodiment the method further comprises determining a
`
`quantitative measure of paired molecules in which only one member of the pair bears a
`
`sequencevariant and/or determining a quantitative measure of unpaired molecules bearing
`
`a sequencevariant.
`
`In another embodiment the sequence variant is selected from a single
`
`nucleotide variant, an indel, a transversion, a translocation, an inversion, a deletion, a
`
`chromosomal structure alteration, a gene fusion, a chromosome fusion, a gene truncation,
`
`a gene amplification, a gene duplication and a chromosomal lesion.
`
`In another
`
`embodiment the quantitative measures are numbers of molecules.
`
`In another embodiment
`
`the one or more genetic loci are a plurality of genetic loci.
`
`In another embodiment the one
`
`or more genetic loci correspond to one or more oncogenes(e.g., a panel of oncogenes).
`
`In
`
`another embodiment the plurality of the genetic loci map to a single nucleotide, a gene, a
`
`fragment of a chromosome, a full chromosome or a genome.
`
`In another embodiment
`
`estimating the quantitative measure comprises estimating a quantitative measure of
`
`polynucleotide molecules in the sample for which no sequence reads are detected.
`
`In
`
`another embodiment estimating the quantitative measure uses binomial distribution,
`
`exponential distribution, beta distribution or empirical distribution based on the redundancy
`
`of sequence reads.
`
`In another embodiment copy number variation is selected from
`
`aneuploidy, partial aneuploidy and polyploidy.
`
`In another embodiment nucleotide
`
`sequences from sequence reads are assembled into combined sequences and wherein
`
`combined sequencesare partitioned into non-overlapping windows.
`
`In another
`
`embodiment CNV is determined between loci.
`
`[0012]
`
`In another aspect disclosed herein is a system comprising a computer readable
`
`medium comprising machine-executable code that, upon execution by a computer
`
`processor, implements a method comprising: (a) receiving into memory sequence reads of
`
`polynucleotides tagged with duplex tags; (b) reducing and/or tracking redundancyin the set
`
`of sequencereads; (c) sorting sequence reads into paired reads and unpaired reads,
`
`wherein (1) each paired read is formed from sequence reads generated fromafirst tagged
`
`-6-
`
`

`

`Attorney Docket No: 42534-708-101
`
`strand and a second differently tagged complementary strand derived from a double-
`
`stranded polynucleotide molecule in a set; and (2) each unpaired read representsa first
`
`tagged strand having no second differently tag complementary strand derived from a
`
`double-stranded polynucleotide molecule represented among the sequencereads in the
`
`set of sequence reads; (d) determining quantitative measuresof (1) the paired reads and
`
`(2) the unpaired reads that map to each of one or more genetic loci; and (e) estimating a
`
`quantitative measure of total double-stranded polynucleotide molecules in the set that map
`
`to each of the one or more genetic loci based on the quantitative measure of paired reads
`
`and unpaired reads mapping to each locus.
`
`In one embodiment the method further
`
`comprises:
`
`(f) detecting copy number variation in the sample by determining a normalized
`
`total quantitative measure determined in step (e) at each of the one or more genetic loci
`
`and determining copy number variation based on the normalized measure.
`
`[0013]
`
`In another aspect disclosed herein is a composition comprising between 300 and
`
`300,000 haploid genome equivalents of fragmented DNA, wherein DNA fragments are
`
`tagged with duplex tags and bear between 2 and 10,000 different identifiers.
`
`In one
`
`embodiment the composition comprises between 1000 and 100,000 haploid genome
`
`equivalents of fragmented DNA.
`
`In another embodiment the DNA fragments bear between
`
`10 and 1000 different identifiers.
`
`In another embodiment the fragmented DNA is human-
`
`derived DNA. In another embodiment the fragmented DNAis cfDNA (e.g., human cfDNA).
`
`In another embodiment the fragmented DNAis tumor DNA.
`
`In another embodiment the
`
`tumor DNAis formalin-fixed, paraffin-embedded.
`
`[0014]
`
`In another aspect provided herein is a method comprising:
`
`(a) providing a
`
`sample comprising a set of double-stranded polynucleotide molecules, each double-
`
`stranded polynucleotide molecule including first and second complementary strands; (b)
`
`tagging the double-stranded polynucleotide molecules with a set of duplex tags, wherein
`
`each duplex tag differently tags the first and second complementary strands of a double-
`
`stranded polynucleotide molecule in the set; (c) sequencing at least some of the tagged
`
`strands to produce a set of sequencereads; (d) reducing or tracking redundancyin the set
`
`of sequencereads; (e) sorting sequence reads into paired reads and unpaired reads,
`
`-7-
`
`

`

`wherein (1) each paired read is formed from sequence reads generated fromafirst tagged
`
`Attorney Docket No: 42534-708-101
`
`strand and a second differently tagged complementary strand derived from a double-
`
`stranded polynucleotide molecule in the set; and (2) each unpaired read representsa first
`
`tagged strand having no second differently tag complementary strand derived from a
`
`double-stranded polynucleotide molecule represented among the sequencereads in the
`
`set of sequencereads; (f) determining quantitative measures of (1) the paired reads and (2)
`
`the unpaired reads that map to each of one or more genetic loci and (3) read depth of the
`
`paired reads and (4) read depth of unpaired reads; and (g) estimating a quantitative
`
`measure of total double-stranded polynucleotide molecules in the set that map to each of
`
`the one or more genetic loci based on the quantitative measure of paired reads and
`
`unpaired reads and their read depths mapping to each locus.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0015]
`
`Figure 1 is a flowchart representation of a method ofthis invention for
`
`determining CNV.
`
`[0016]
`
`Figure 2 depicts mapping of Pairs and Singlets to Locus A and Locus B ina
`
`genome. The number of Unseen original molecules in a sample can be estimated based on
`
`the relative number of Pairs and Singlets.
`
`[0017]
`
`Figure 3 showsa reference sequence encoding a genetic Locus A. Double-
`
`stranded polynucleotides are mapped to Locus A include sequence m and its complement,
`
`m’; sequence n and its complement, n’ and sequence p and its complement, p’. The
`
`polynucleotides are each tagged with duplex primers that include non-complementary
`
`regions. Molecule nn’ includes a sequencevariant at Locus A, indicated by a dot.
`
`[0018]
`
`Figure 4A-C shows amplification, sequencing, redundancy reduction and pairing
`
`of complementary molecules. In Figure 4A duplex-tagged polynucleotides mm’ and nn’ are
`
`subject to amplification by, for example, PCR. The strand of the duplex polynucleotide
`
`including sequence m bears sequencetagswand y, while the strand of the duplex
`
`polynucleotide including sequence m’ bears sequencetags x and z. Similarly, the strand of
`
`the duplex polynucleotide including sequence n bears sequencetags a and c, while the
`
`-8-
`
`

`

`Attorney Docket No: 42534-708-101
`
`strand of the duplex polynucleotide including sequence n’ bears sequencetags b and d.
`
`During amplification, each strand producesitself and its complementary sequence.
`
`However, for example, an amplification progeny of original strand m that includes the
`
`complementary sequence, m’, is distinguishable from an amplification progeny of original
`
`strand m’ becausethe progeny from original strand m will have the sequence 5’-y’m’w’-3’
`
`and the progeny of the original m’ strand one strand will have the sequence 5’-zm’x-3’.
`
`Figure 4B shows amplification in more detail. During amplification, errors can be
`
`introduced into the amplification progeny, represented by dots. The application progeny are
`
`sampled for sequencing, so that not all strands produce sequencereads, resulting in the
`
`sequence reads indicated. Because sequence reads can come from either of a strand or its
`
`complement, both sequences and complement sequenceswill be included in the set of
`
`sequencereads. In Figure 4C, sequencereads are corrected for complementary
`
`sequences. However, sequences generated from an original Watson strand or an original
`
`Crick strand can be differentiated on the basis of their duplex tags. Sequences generated
`
`from the same original strand are grouped. Examination of the sequencesallows one to
`
`infer the sequenceofthe original strand (the “consensus sequence’). In this case, for
`
`example, the sequencevariant in the nn’ molecule is included in the consensus sequence
`
`becauseit included in every sequence read while other variants are seen to be stray errors.
`
`After collapsing sequences, original polynucleotide pairs can beidentified based ontheir
`
`complementary sequences and duplex tags.
`
`[0019]
`
`Figure 5 showsincreased confidencein detecting sequencevariants by pairing
`
`reads from Watson and Crick strands. Sequence nn’ includes a sequencevariant indicated
`
`by a dot. Sequence pp’ does not include a sequence variant. Amplification, sequencing,
`
`redundancy reduction and pairing results in both Watson and Crick strands of the same
`
`original molecule including the sequence variant.
`
`In contrast, as a result of errors
`
`introduced during amplification and sampling during sequencing, the consensus sequence
`
`of the Watson strand p contains a sequencevariant, while the consensus sequenceof the
`
`Crick strand p’ does not.
`
`It is less likely that amplification and sequencing will introduce the
`
`same variant into both strands (nn’ sequence) of a duplex than onto one strand (pp’
`
`-9-
`
`

`

`Attorney Docket No: 42534-708-101
`
`sequence). Therefore, the variant in the pp’ sequenceis morelikely to be an artifact, and
`
`the variant in the nn’ sequence is morelikely to exist in the original molecule.
`
`[0020]
`
`Figure 6 shows a computer system that is programmedor otherwise configured
`
`to implement various methods of the present disclosure.
`
`[0021]
`
`Figure 7 is schematic representation of a system for analyzing a sample
`
`comprising nucleic acids from a User, including a sequencer; bioinformatic software and
`
`internet connection for report analysis by, for example, a hand held device or a desk top
`
`computer.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`[0022]
`
`As.used herein the terms “at least”,
`
`“at most” or “about”, when preceding a
`
`series, refers to each member of the series, unless otherwiseidentified.
`
`[0023]
`
`|. Methods
`
`[0024]
`
`An embodimentof the method of determining CNV is shownin FIG. 1.
`
`[0025]
`
`A. Polynucleotide Isolation
`
`[0026]
`
` Instep (102) double-stranded polynucleotides are isolated from a sample, for
`
`example, from a bodily fluid, e.g., blood. In certain embodiments the polynucleotide can be
`cell-free DNA. A sample of about 30 ng DNA can contain about 10,000 (10°) haploid
`human genomeequivalents and, in the case of cfDNA, about 200billion (2x10"’) individual
`
`polynucleotide molecules. Similarly, a sample of about 100 ng of DNA can contain about
`
`30,000 haploid human genome equivalents and, in the case of cfDNA, about 600billion
`
`individual molecules.
`
`[0027]
`
`B. Tagging
`
`[0028]
`
` Instep (104) the double-stranded polynucleotides are tagged with duplex tags.
`
`Duplex tags are tags that differently label the complementary strands (i.e., the "Watson"
`
`and "Crick" strands) of a double-stranded molecule. In one embodiment the duplex tags are
`
`polynucleotides having complementary and non-complementary portions. One such
`
`-10-
`
`

`

`Attorney Docket No: 42534-708-101
`
`example are the so-called "Y adapters”, e.g., used in Illumina sequencing. Other examples
`
`include hairpin shaped adapters or bubble shaped adapters. Bubble shaped adapters have
`
`non-complementary sequencesflanked on both sides by complementary sequences.
`
`[0029] Asused herein, a collection of molecules is considered to be "uniquely tagged"if
`
`each of at least 95% of the moleculesin the collection bears an identifying tag ("identifier")
`
`that is not shared by any other molecule in the collection (“unique tag” or “unique
`
`identifier’). A collection of molecules is considered to be "non-uniquely tagged"if each of at
`
`least 1% of the moleculesin the collection bears an identifying tag that is shared byat least
`
`one other molecule in the collection ("non-unique tag" or "non-unique identifier").
`
`Accordingly, in a non-uniquely tagged population no more than 1% of the molecules are
`
`uniquely tagged.
`
`[0030]
`
`In unique tagging, at least two times as many different tags are used as the
`
`estimated number of molecules in the sample.
`
`[0031]
`
`In certain embodiments, the molecules in the sample are non-uniquely tagged.
`
`The number ofdifferent identifying tags used to tag moleculesin a collection can range, for
`
`example, between any of 2, 4, 8, 16, or 32 at the low end of the range, and any of 50, 100,
`
`500, 1000, 5000 and 10,000at the high end of the range. So, for example, a collection of
`
`between 100 billion and 1 trillion molecules can be tagged with between 4 and 100 different
`
`identifying tags.
`
`
`
`[0032] Incertain embodiments the number ofdifferent tags can be fewer than1trillion,
`
`fewer than 1 billion, fewer than 1 million, fewer than 100,000, fewer than 10,000 or fewer
`
`than 1000.
`
`[0033]
`
`In certain embodiments, a population of polynucleotides in a sample of
`
`fragmented genomic DNAis tagged with n different unique identifiers, whereinnis at least
`
`2 and no morethan 100,000*z, wherein z is a measureof central tendency (e.g., mean,
`
`median, mode) of an expected number of duplicate molecules having the same start and
`
`stop positions.
`
`In certain embodiments, n is at least any of 2*z, 3*z, 4*z, 5*z, 6*z, 7*z, 8*z,
`
`9*z, 10*z, 11*z, 12*z, 13*z, 14*z, 15*z, 16*z, 17*z, 18*z, 19*z, or 20*z (e.g., lower limit).
`
`In
`
`-11-
`
`

`

`Attorney Docket No: 42534-708-101
`
`other embodiments, n is no greater than100,000*z, 10,000*z, 1000*z or 100*z (e.g., upper
`
`limit). Thus, n can range between any combination of these lower and upper limits.
`
`In
`
`certain embodiments, n is between 5*z and 15*z, between 8*z and 12*z, or about 10*z. For
`
`example, a haploid human genome equivalent has about 3 picograms of DNA. A sample of
`
`about 1 microgram of DNA contains about 300,000 haploid human genome equivalents.
`
`Improvements in sequencing can be achieved as long as at least some of the duplicate or
`
`cognate polynucleotides bear unique identifiers with respect to each other, that is, bear
`
`different tags. However, in certain embodiments, the number of tags usedis selected so
`
`that there is at least a 95% chancethat all duplicate molecules starting at any one position
`
`bear unique identifiers. For example, in a sample comprising about 10,000 haploid human
`
`genome equivalents of fragmented genomic DNA, e.g., cfDNA, z is expected to be between
`
`2 and 8. Such a population can be tagged with between about 10 and 100different
`
`identifiers, for example, about 36 different identifiers, about 49 different identifiers, about 64
`
`different identifiers, about 81 different identifiers or about 100 different identifiers. DNA
`
`barcodes having identifiable sequencesof 6, 7, 8, 9 or 10 nucleotides can be used. When
`
`attached to both ends of a polynucleotide, they produce 36, 49, 64, 81 or 100 possible
`
`different identifiers, respectively. Samples tagged in such a way can be those with a range
`
`of about 10 ng to any of about 100 ng, about 1 ug, about 10 ug of fragmented
`
`polynucleotides, e.g., genomic DNA,e.g. cfDNA.
`
`[0034]
`
`Accordingly, this invention also provides compositions of duplex-tagged
`
`polynucleotides. The polynucleotides can comprise fragmented DNA, e.g. cfDNA. A set of
`
`polynucleotides in the composition that map to a mappable baseposition in a genome can
`
`be non-uniquely tagged, that is, the number ofdifferent identifiers can be at least at least 2
`
`and fewer than the number of polynucleotides that map to the mappable base position. A
`
`composition of between about 10 ng to about 10 ug (e.g., any of about 10 ng-1 wg, about
`
`10 ng-100 ng, about 100 ng-10 ug, about 100 ng-1 yg, about 1 g-10 ug) can bear
`
`between any of 2, 5, 10, 50 or 100 to any of 100, 1000, 10,000 or 100,000 different
`
`identifiers. For example, between 5 and 100 different identifiers can be used to tag the
`
`polynucleotides in such a composition.
`
`-12-
`
`

`

`Attorney Docket No: 42534-708-101
`
`[0035]
`
`The systems and methods disclosed herein may be usedin applications that
`
`involve the assignment of molecular barcodes, to cell free polynucleotides. Often, the
`
`identifier is a bar-code oligonucleotide that is used to tag the polynucleotide. The barcode
`
`identifier may be a nucleic acid oligonucleotide, in which case the attachment to the
`
`polynucleotide sequences may comprise a ligation reaction between the oligonucleotide
`
`and the sequencesor incorporation through PCR.
`
`In other cases, the reaction may
`
`comprise addition of a metal isotope, either directly to the analyte or by a probe labeled
`
`with the isotope. Generally, assignment of unique or non-unique identifiers, or molecular
`
`barcodesin reactions of this disclosure may follow methods and systems described by, for
`
`example, US patent applications 20010053519, 20030152490, 20110160078 and US
`
`patent US 6,582,908.
`
`[0036]
`
`In some cases, identifiers may be predetermined, random or semi-random
`
`sequenceoligonucleotides.
`
`In other cases, a plurality of barcodes may be used such that
`
`barcodes are not necessarily unique to one another in the plurality.
`
`In this example,
`
`barcodes maybeligated to individual molecules such that the combination of the bar code
`
`and the sequence it maybeligated to creates a specific sequence that may beindividually
`
`tracked. As described herein, detection of non unique barcodes in combination with
`
`sequence data of beginning (start) and end (stop) portions of sequence reads can allow
`
`assignment of a unique identity to a particular molecule. The length, or number of base
`
`pairs, of an individual sequence read may also be usedto assign a unique identity to such
`
`a molecule. As described herein, fragments from a single strand of nucleic acid having
`
`been assigned a unique identity, may thereby permit subsequent identification of fragments
`
`from the parent strand.
`
`In this way the polynucleotides in the sample can be uniquely or
`
`substantially uniquely tagged.
`
`[0037]
`
`A duplex tag can include a degenerate or semi-degenerate nucleotide sequence,
`
`e.g., a random degenerate sequence. The sequence can be, for example, between
`
`approximately 3 to 20 nucleotides in length. Therefore, each n-mer sequence may be any
`
`of 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides in length. The
`
`duplex tag can comprise a double-stranded fixed reference sequence downstream of the n-
`
`-13-
`
`

`

`Attorney Docket No: 42534-708-101
`
`mer sequences. Each strand of the double-stranded fixed reference sequencecan be, for
`
`example, 3, 4, 5 or 6 nucleotides in length.
`
`[0038]
`
`In certain embodiments barcodes comprise 8 n-mer nucleotide sequences. The
`
`attached to both ends of a double-stranded polynucleotide, this results in 64 different
`
`possible identifying combinations. When a polynucleotide molecule there’s tags both ends,
`
`each tag can bean identifier, or the combination of tags can function as an identifier.
`
`[0039]
`
`C. Sequencing
`
`[0040]
`
`In step (106) the tagged polynucleotides are sequenced to generate sequence
`
`reads. Both strands of a tagged duplex polynucleotide can generate sequence reads.
`
`Becausetheyaredifferently tagged, sequence reads generated from a Watson strand can
`
`be distinguished from sequence reads generated from a Crick strand. Sequencing can
`
`involve generating multiple sequence reads for each molecule. This occurs, for example, as
`
`a result the amplification of individual polynucleotide strands during the sequencing
`
`process, e.g., by PCR.
`
`[0041]
`
`Amplification of a single strand of a polynucleotide by PCR will generate copies
`
`both of that strand and its complement. During sequencing, both the strand andits
`
`complement will generate sequence reads. However, sequence reads generated from the
`
`complement of, for example, the Watson strand, can be identified as such because they
`
`bear the complement of the portion of the duplex tag that tagged the original Watson
`
`strand.
`
`In contrast, a sequence read generated from a Crick strand or its amplification
`
`productwill bear the portion of the duplex tag that tagged the original Crick strand. In this
`
`way, a sequenceread generated from an amplified product of a complement of the Watson
`
`strand can be distinguished from a complement sequence read generated from an
`
`amplification product of the Crick strand of the original molecule.
`
`[0042]
`
`Typically, a sampling, or subset, of all of the amplified polynucleotides are
`
`submitted to a sequencing device for sequencing. With respect to any original double-
`
`stranded polynucleotide there can be three results with respect to sequencing. First,
`
`sequencereads can be generated from both complementary strands ofthe original
`
`-14-
`
`

`

`Attorney Docket No: 42534-708-101
`
`molecule (that is, from both the Watson strand and from the Crick strand). Second,
`
`sequence reads can be generated from only one of the two complementary strands (thatis,
`
`either from the Watson strand or from the Crick strand, but not both). Third, no sequence
`
`read may be generated from either of the two complementary strands. Consequently,
`
`counting unique sequence reads mapping to a genetic locus will underestimate the number
`
`of double-stranded polynucleotides in the original sample mapping to the locus. Described
`
`herein are methods of estimating the unseen and uncounted polynucleotides.
`
`[0043]
`
`|The sequencing method can be massively parallel sequencing, thatis,
`
`simultaneously (or in rapid succession) sequencing any ofat least 100,000, 1 million, 10
`
`million, 100 million, or 1 billion polynucleotide molecules. Sequencing methods may
`
`include, but are notlimited to: high-throughput sequencing, pyrosequencing, sequencing-
`
`by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor
`
`sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq(Illumina),
`
`Digital Gene Expression (Helicos), Next generation sequencing, Single Molecule
`
`Sequencing by Synthesis (SMSS)(Helicos), massively-parallel sequencing, Clonal Single
`
`Molecule Array (Solexa), shotgun sequencing, Maxam-Gilbert or Sanger sequencing,
`
`primer walking, sequencing using PacBio, SOLID, lon Torrent, or Nanopore platforms and
`
`any other sequencing methods knownin the art.
`
`[0044]
`
`D. Reducing or Tracking Redundancy
`
`[0045]=Instep (108) redundancy in sequencereads is reduced and/or tracked.
`
`Sequencing of amplified polynucleotides typically produces reads of the several
`

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.

We are unable to display this document.

PTO Denying Access

Refresh this Document
Go to the Docket