`
`PROVISIONAL PATENT APPLICATION
`
`METHODS FOR COPY NUMBER VARIATION DETECTION
`
`Inventor(s):
`
`Helmy ELTOUKHY
`a citizen of the United States, residing at
`2 Barry Lane
`Atherton, CA 94027
`
`AmirAli TALASAZ,
`a citizen of the United States, residing at
`2181 Camino a Los Cerros
`Menlo Park, CA 94025
`
`Assignee:
`
`Guardant Health, Inc.
`2686 Middlefield Rd., Suite D
`Redwood City, CA 94063 U.S.A.
`
`Entity:
`
`Large entity
`
`John Storella, P.C.
`2625 Alcatraz Ave. #197
`Berkeley, CA 94705
`510-501-0567
`Reg. No. 32,944
`Customer # 115823
`
`Filed: December 28, 2013
`
`
`
`Attorney Docket No: 42534-708-101
`
`METHODS FOR COPY NUMBER VARIATION DETECTION
`
`CROSS-REFERENCE TO RELATED APPLICATIONS
`
`[0001]
`
`None.
`
`STATEMENT AS TO FEDERALLY SPONSORED RESEARCH
`
`[0002]
`
`None.
`
`BACKGROUND OF THE INVENTION
`
`[0003]
`
`The detection and quantification of polynucleotides is important for molecular biology
`
`and medical applications such as diagnostics. Genetic testing is particularly useful for a numberof
`
`diagnostic methods. For example, disorders that are caused by rare genetic alterations (e.g.,
`
`sequencevariants) or changes in epigenetic markers, such as cancer and partial or complete
`
`aneuploidy, may be detected or more accurately characterized with DNA sequenceinformation.
`
`[0004]=Early detection and monitoring of genetic diseases, such as cancer, is often useful and
`
`neededin the successful treatment or management of the disease. One approach mayinclude the
`
`monitoring of a sample derived from cell free nucleic acids, a population of polynucleotides that
`
`can be foundin different types of bodily fluids.
`
`In some cases, disease may be
`
`characterized or detected based on detection of genetic aberrations, such as a change in
`
`copy number variation and/or sequencevariation of one or more nucleic acid sequences, or
`
`the development of other certain rare genetic alterations. Cell free DNA may contain
`
`genetic aberrations associated with a particular disease. With improvementsin
`
`sequencing and techniques to manipulate nucleic acids, there is a needin the art for
`
`improved methods and systems for using cell free DNA to detect and monitor disease.
`
`[0005]
`
`In particular, many methods have been developed for accurate copy number
`
`variation estimation, especially for heterogeneous genomic samples, such as tumor-derived
`
`gDNAor for cfDNA for many applications (e.g., prenatal, transplant, immune,
`
`metagenomics or cancer diagnostics). Most of these methods consist of sample
`
`preparation wherebythe original nucleic acids are converted into a sequenceable library,
`
`-2-
`
`
`
`Attorney Docket No: 42534-708-101
`
`followed by massively parallel sequencing, and finally bioinformatics to estimate copy
`
`number variation at one or more loci.
`
`[0006]
`
`International Publication WO 2012/142213 (Vogelstein et al.) refers to an
`
`approach called "Safe-SeqS"for (Safe-Sequencing System) that is said to include (i)
`
`assignment of a unique identifier (UID) to each template molecule; (ii) amplification of each
`
`uniquely tagged template molecule to create UID-families; and (ili) redundant sequencing of
`
`the amplification products.
`
`[0007]=International Publication WO 2013/142389 (Schmidt et al.) refers to a method
`
`referred to as Duplex Consensus Sequencing (DCS). The approach is described as
`
`reducing errors by independently tagging and sequencing each of the two strands of a DNA
`
`duplex.
`
`[0008]
`
`Statements in the Background are not necessarily meant to endorse the
`
`characterization in the cited references.
`
`SUMMARYOF THE INVENTION
`
`[0009]
`
`Although many of these methods are able to reduce or combatthe errors
`
`introduced by the sample preparation and sequencing processesfor all molecules that are
`
`converted and sequenced, none of these methods is able to infer the counts of molecules
`
`that were converted but not sequenced. Since this count of converted but unsequenced
`
`molecules can be highly variable from genomic region to region, these counts can
`
`dramatically and adversely affect the sensitivity that can be achieved.
`
`[0010]
`
`Tocombatthis issue, input double-stranded DNA can be converted by a process
`
`that tags both halves of the individual double-stranded molecule differently. This can be
`
`performed using a variety of different techniques, including ligation of hairpin, bubble or
`
`forked adapters. If tagged correctly, each original Watson and Crick side of the input
`
`double-stranded DNA molecule can be uniquely identified by the sequencer and
`
`subsequent bioinformatics. For all molecules in a particular region, counts of molecules
`
`where both Watson and Crick sides were recovered (“Pairs”) versus those were only one
`
`-3-
`
`
`
`Attorney Docket No: 42534-708-101
`
`half was recovered (“Singlets”) can be recorded. The number of unseen molecules can be
`
`estimated based on the number of Pairs and Singlets detected.
`
`[0011]
`
`In one aspect disclosed herein is a method comprising: (a) providing a sample
`
`comprising a set of double-stranded polynucleotide molecules, each double-stranded
`
`polynucleotide molecule including first and second complementary strands; (b) tagging the
`
`double-stranded polynucleotide molecules with a set of duplex tags, wherein each duplex
`
`tag differently tags the first and second complementary strands of a double-stranded
`
`polynucleotide molecule in the set; (c) sequencing at least some of the tagged strands to
`
`produce a set of sequence reads; (d) reducing or tracking redundancyin the set of
`
`sequence reads; (e) sorting sequence reads into paired reads and unpaired reads, wherein
`
`(1) each paired read is formed from sequence reads generated fromafirst tagged strand
`
`and a second differently tagged complementary strand derived from a double-stranded
`
`polynucleotide molecule in the set; and (2) each unpaired read representsa first tagged
`
`strand having no second differently tag complementary strand derived from a double-
`
`stranded polynucleotide molecule represented among the sequencereads in the set of
`
`sequencereads; (f) determining quantitative measures of (1) the paired reads and (2) the
`
`unpaired reads that map to each of one or more genetic loci; and (g) estimating a
`
`quantitative measure of total double-stranded polynucleotide molecules in the set that map
`
`to each of the one or more genetic loci based on the quantitative measure of paired reads
`
`and unpaired reads mapping to eachlocus. In one embodiment method further comprises:
`
`(nh) detecting copy number variation in the sample by determining a normalized total
`
`quantitative measure determined in step (g) at each of the one or more genetic loci and
`
`determining copy number variation based on the normalized measure.
`
`In another
`
`embodiment the double-stranded polynucleotide molecules are DNA. In another
`
`embodiment the sample comprises double-stranded polynucleotide molecule sourced
`
`substantially from cell-free nucleic acids, e.g., cDNA.
`
`In another embodiment the sample
`
`comprises no more than 100 ng double-stranded polynucleotide molecule.
`
`In another
`
`embodiment the sample is selected from the group consisting of blood, plasma, serum,
`
`urine, saliva, mucosal excretions, sputum, stool and tears.
`
`In another embodiment the
`
`-4-
`
`
`
`Attorney Docket No: 42534-708-101
`
`sample comprises double-stranded polynucleotide molecules from healthy cells and from
`
`malignant cells.
`
`In another embodiment the sample comprises maternal double-stranded
`
`polynucleotide molecules and fetal double-stranded polynucleotide molecules.
`
`In another
`
`embodiment any of at least 10%, 25%, 50%, 75%, 90% or 99% of the double-stranded
`
`polynucleotide molecules in the set bear an identifying tag shared with at least one other
`
`double-stranded polynucleotide molecule in the set (e.g., the set of polynucleotide
`
`molecules is non-uniquely tagged).
`
`In another embodiment any of at most 25%, 10%, 2%,
`
`1% or 0.1% of the double-stranded polynucleotide moleculesin the set bear an identifying
`
`tag shared with at least one other polynucleotide molecule in the set.
`
`In another
`
`embodiment the double-stranded polynucleotide molecules in the set are tagged with
`
`between 2 and 1000 different identifying tags or between 2 and 100 different identifying
`
`tags.
`
`In another embodiment each duplex tag comprises a polynucleotide identifier.
`
`In
`
`another embodiment each polynucleotide identifier comprises a non-complementary region.
`
`In another embodiment each duplex tag is Y-shaped, bubble shapedor hairpin shaped.
`
`In
`
`another embodiment the double-stranded polynucleotides are converted into tagged
`
`polynucleotides with a conversion efficiency of at least 10%, at least 20%, at least 30%, at
`
`least 40%, at least 50%, at least 60%, at least 80% or at least 90%.
`
`In another
`
`embodiment tagging comprises any of blunt-end ligation, sticky end ligation, molecular
`
`inversion probes, PCR, ligation-based PCR, multiplex PCR, single strand ligation and
`
`single strand circularization.
`
`In another embodiment sequencing comprises amplification of
`
`the tagged strands, e.g., by PCR.
`
`In another embodiment the method comprisesfiltering
`
`out sequencereads that are introduced into the sample through contamination.
`
`In another
`
`embodiment the method comprisesfiltering out reads that fail to meet a set threshold.
`
`In
`
`another embodiment reducing redundancyin the set of sequence reads comprises
`
`collapsing sequence reads produced from amplified products of an original polynucleotide
`
`molecule in the sample back to the original polynucleotide molecule.
`
`In another
`
`embodiment the method further comprises determining a consensus sequencefor the
`
`original polynucleotide molecule.
`
`In another embodiment the method further comprises
`
`identifying polynucleotide molecules at one or more genetic loci comprising a sequence
`
`-5-
`
`
`
`Attorney Docket No: 42534-708-101
`
`variant.
`
`In another embodiment the method further comprises determining a quantitative
`
`measure ofpaired reads that map to a locus, wherein both strands of the pair comprise a
`
`sequencevariant.
`
`In another embodiment the method further comprises determining a
`
`quantitative measure of paired molecules in which only one member of the pair bears a
`
`sequencevariant and/or determining a quantitative measure of unpaired molecules bearing
`
`a sequencevariant.
`
`In another embodiment the sequence variant is selected from a single
`
`nucleotide variant, an indel, a transversion, a translocation, an inversion, a deletion, a
`
`chromosomal structure alteration, a gene fusion, a chromosome fusion, a gene truncation,
`
`a gene amplification, a gene duplication and a chromosomal lesion.
`
`In another
`
`embodiment the quantitative measures are numbers of molecules.
`
`In another embodiment
`
`the one or more genetic loci are a plurality of genetic loci.
`
`In another embodiment the one
`
`or more genetic loci correspond to one or more oncogenes(e.g., a panel of oncogenes).
`
`In
`
`another embodiment the plurality of the genetic loci map to a single nucleotide, a gene, a
`
`fragment of a chromosome, a full chromosome or a genome.
`
`In another embodiment
`
`estimating the quantitative measure comprises estimating a quantitative measure of
`
`polynucleotide molecules in the sample for which no sequence reads are detected.
`
`In
`
`another embodiment estimating the quantitative measure uses binomial distribution,
`
`exponential distribution, beta distribution or empirical distribution based on the redundancy
`
`of sequence reads.
`
`In another embodiment copy number variation is selected from
`
`aneuploidy, partial aneuploidy and polyploidy.
`
`In another embodiment nucleotide
`
`sequences from sequence reads are assembled into combined sequences and wherein
`
`combined sequencesare partitioned into non-overlapping windows.
`
`In another
`
`embodiment CNV is determined between loci.
`
`[0012]
`
`In another aspect disclosed herein is a system comprising a computer readable
`
`medium comprising machine-executable code that, upon execution by a computer
`
`processor, implements a method comprising: (a) receiving into memory sequence reads of
`
`polynucleotides tagged with duplex tags; (b) reducing and/or tracking redundancyin the set
`
`of sequencereads; (c) sorting sequence reads into paired reads and unpaired reads,
`
`wherein (1) each paired read is formed from sequence reads generated fromafirst tagged
`
`-6-
`
`
`
`Attorney Docket No: 42534-708-101
`
`strand and a second differently tagged complementary strand derived from a double-
`
`stranded polynucleotide molecule in a set; and (2) each unpaired read representsa first
`
`tagged strand having no second differently tag complementary strand derived from a
`
`double-stranded polynucleotide molecule represented among the sequencereads in the
`
`set of sequence reads; (d) determining quantitative measuresof (1) the paired reads and
`
`(2) the unpaired reads that map to each of one or more genetic loci; and (e) estimating a
`
`quantitative measure of total double-stranded polynucleotide molecules in the set that map
`
`to each of the one or more genetic loci based on the quantitative measure of paired reads
`
`and unpaired reads mapping to each locus.
`
`In one embodiment the method further
`
`comprises:
`
`(f) detecting copy number variation in the sample by determining a normalized
`
`total quantitative measure determined in step (e) at each of the one or more genetic loci
`
`and determining copy number variation based on the normalized measure.
`
`[0013]
`
`In another aspect disclosed herein is a composition comprising between 300 and
`
`300,000 haploid genome equivalents of fragmented DNA, wherein DNA fragments are
`
`tagged with duplex tags and bear between 2 and 10,000 different identifiers.
`
`In one
`
`embodiment the composition comprises between 1000 and 100,000 haploid genome
`
`equivalents of fragmented DNA.
`
`In another embodiment the DNA fragments bear between
`
`10 and 1000 different identifiers.
`
`In another embodiment the fragmented DNA is human-
`
`derived DNA. In another embodiment the fragmented DNAis cfDNA (e.g., human cfDNA).
`
`In another embodiment the fragmented DNAis tumor DNA.
`
`In another embodiment the
`
`tumor DNAis formalin-fixed, paraffin-embedded.
`
`[0014]
`
`In another aspect provided herein is a method comprising:
`
`(a) providing a
`
`sample comprising a set of double-stranded polynucleotide molecules, each double-
`
`stranded polynucleotide molecule including first and second complementary strands; (b)
`
`tagging the double-stranded polynucleotide molecules with a set of duplex tags, wherein
`
`each duplex tag differently tags the first and second complementary strands of a double-
`
`stranded polynucleotide molecule in the set; (c) sequencing at least some of the tagged
`
`strands to produce a set of sequencereads; (d) reducing or tracking redundancyin the set
`
`of sequencereads; (e) sorting sequence reads into paired reads and unpaired reads,
`
`-7-
`
`
`
`wherein (1) each paired read is formed from sequence reads generated fromafirst tagged
`
`Attorney Docket No: 42534-708-101
`
`strand and a second differently tagged complementary strand derived from a double-
`
`stranded polynucleotide molecule in the set; and (2) each unpaired read representsa first
`
`tagged strand having no second differently tag complementary strand derived from a
`
`double-stranded polynucleotide molecule represented among the sequencereads in the
`
`set of sequencereads; (f) determining quantitative measures of (1) the paired reads and (2)
`
`the unpaired reads that map to each of one or more genetic loci and (3) read depth of the
`
`paired reads and (4) read depth of unpaired reads; and (g) estimating a quantitative
`
`measure of total double-stranded polynucleotide molecules in the set that map to each of
`
`the one or more genetic loci based on the quantitative measure of paired reads and
`
`unpaired reads and their read depths mapping to each locus.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0015]
`
`Figure 1 is a flowchart representation of a method ofthis invention for
`
`determining CNV.
`
`[0016]
`
`Figure 2 depicts mapping of Pairs and Singlets to Locus A and Locus B ina
`
`genome. The number of Unseen original molecules in a sample can be estimated based on
`
`the relative number of Pairs and Singlets.
`
`[0017]
`
`Figure 3 showsa reference sequence encoding a genetic Locus A. Double-
`
`stranded polynucleotides are mapped to Locus A include sequence m and its complement,
`
`m’; sequence n and its complement, n’ and sequence p and its complement, p’. The
`
`polynucleotides are each tagged with duplex primers that include non-complementary
`
`regions. Molecule nn’ includes a sequencevariant at Locus A, indicated by a dot.
`
`[0018]
`
`Figure 4A-C shows amplification, sequencing, redundancy reduction and pairing
`
`of complementary molecules. In Figure 4A duplex-tagged polynucleotides mm’ and nn’ are
`
`subject to amplification by, for example, PCR. The strand of the duplex polynucleotide
`
`including sequence m bears sequencetagswand y, while the strand of the duplex
`
`polynucleotide including sequence m’ bears sequencetags x and z. Similarly, the strand of
`
`the duplex polynucleotide including sequence n bears sequencetags a and c, while the
`
`-8-
`
`
`
`Attorney Docket No: 42534-708-101
`
`strand of the duplex polynucleotide including sequence n’ bears sequencetags b and d.
`
`During amplification, each strand producesitself and its complementary sequence.
`
`However, for example, an amplification progeny of original strand m that includes the
`
`complementary sequence, m’, is distinguishable from an amplification progeny of original
`
`strand m’ becausethe progeny from original strand m will have the sequence 5’-y’m’w’-3’
`
`and the progeny of the original m’ strand one strand will have the sequence 5’-zm’x-3’.
`
`Figure 4B shows amplification in more detail. During amplification, errors can be
`
`introduced into the amplification progeny, represented by dots. The application progeny are
`
`sampled for sequencing, so that not all strands produce sequencereads, resulting in the
`
`sequence reads indicated. Because sequence reads can come from either of a strand or its
`
`complement, both sequences and complement sequenceswill be included in the set of
`
`sequencereads. In Figure 4C, sequencereads are corrected for complementary
`
`sequences. However, sequences generated from an original Watson strand or an original
`
`Crick strand can be differentiated on the basis of their duplex tags. Sequences generated
`
`from the same original strand are grouped. Examination of the sequencesallows one to
`
`infer the sequenceofthe original strand (the “consensus sequence’). In this case, for
`
`example, the sequencevariant in the nn’ molecule is included in the consensus sequence
`
`becauseit included in every sequence read while other variants are seen to be stray errors.
`
`After collapsing sequences, original polynucleotide pairs can beidentified based ontheir
`
`complementary sequences and duplex tags.
`
`[0019]
`
`Figure 5 showsincreased confidencein detecting sequencevariants by pairing
`
`reads from Watson and Crick strands. Sequence nn’ includes a sequencevariant indicated
`
`by a dot. Sequence pp’ does not include a sequence variant. Amplification, sequencing,
`
`redundancy reduction and pairing results in both Watson and Crick strands of the same
`
`original molecule including the sequence variant.
`
`In contrast, as a result of errors
`
`introduced during amplification and sampling during sequencing, the consensus sequence
`
`of the Watson strand p contains a sequencevariant, while the consensus sequenceof the
`
`Crick strand p’ does not.
`
`It is less likely that amplification and sequencing will introduce the
`
`same variant into both strands (nn’ sequence) of a duplex than onto one strand (pp’
`
`-9-
`
`
`
`Attorney Docket No: 42534-708-101
`
`sequence). Therefore, the variant in the pp’ sequenceis morelikely to be an artifact, and
`
`the variant in the nn’ sequence is morelikely to exist in the original molecule.
`
`[0020]
`
`Figure 6 shows a computer system that is programmedor otherwise configured
`
`to implement various methods of the present disclosure.
`
`[0021]
`
`Figure 7 is schematic representation of a system for analyzing a sample
`
`comprising nucleic acids from a User, including a sequencer; bioinformatic software and
`
`internet connection for report analysis by, for example, a hand held device or a desk top
`
`computer.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`[0022]
`
`As.used herein the terms “at least”,
`
`“at most” or “about”, when preceding a
`
`series, refers to each member of the series, unless otherwiseidentified.
`
`[0023]
`
`|. Methods
`
`[0024]
`
`An embodimentof the method of determining CNV is shownin FIG. 1.
`
`[0025]
`
`A. Polynucleotide Isolation
`
`[0026]
`
` Instep (102) double-stranded polynucleotides are isolated from a sample, for
`
`example, from a bodily fluid, e.g., blood. In certain embodiments the polynucleotide can be
`cell-free DNA. A sample of about 30 ng DNA can contain about 10,000 (10°) haploid
`human genomeequivalents and, in the case of cfDNA, about 200billion (2x10"’) individual
`
`polynucleotide molecules. Similarly, a sample of about 100 ng of DNA can contain about
`
`30,000 haploid human genome equivalents and, in the case of cfDNA, about 600billion
`
`individual molecules.
`
`[0027]
`
`B. Tagging
`
`[0028]
`
` Instep (104) the double-stranded polynucleotides are tagged with duplex tags.
`
`Duplex tags are tags that differently label the complementary strands (i.e., the "Watson"
`
`and "Crick" strands) of a double-stranded molecule. In one embodiment the duplex tags are
`
`polynucleotides having complementary and non-complementary portions. One such
`
`-10-
`
`
`
`Attorney Docket No: 42534-708-101
`
`example are the so-called "Y adapters”, e.g., used in Illumina sequencing. Other examples
`
`include hairpin shaped adapters or bubble shaped adapters. Bubble shaped adapters have
`
`non-complementary sequencesflanked on both sides by complementary sequences.
`
`[0029] Asused herein, a collection of molecules is considered to be "uniquely tagged"if
`
`each of at least 95% of the moleculesin the collection bears an identifying tag ("identifier")
`
`that is not shared by any other molecule in the collection (“unique tag” or “unique
`
`identifier’). A collection of molecules is considered to be "non-uniquely tagged"if each of at
`
`least 1% of the moleculesin the collection bears an identifying tag that is shared byat least
`
`one other molecule in the collection ("non-unique tag" or "non-unique identifier").
`
`Accordingly, in a non-uniquely tagged population no more than 1% of the molecules are
`
`uniquely tagged.
`
`[0030]
`
`In unique tagging, at least two times as many different tags are used as the
`
`estimated number of molecules in the sample.
`
`[0031]
`
`In certain embodiments, the molecules in the sample are non-uniquely tagged.
`
`The number ofdifferent identifying tags used to tag moleculesin a collection can range, for
`
`example, between any of 2, 4, 8, 16, or 32 at the low end of the range, and any of 50, 100,
`
`500, 1000, 5000 and 10,000at the high end of the range. So, for example, a collection of
`
`between 100 billion and 1 trillion molecules can be tagged with between 4 and 100 different
`
`identifying tags.
`
`
`
`[0032] Incertain embodiments the number ofdifferent tags can be fewer than1trillion,
`
`fewer than 1 billion, fewer than 1 million, fewer than 100,000, fewer than 10,000 or fewer
`
`than 1000.
`
`[0033]
`
`In certain embodiments, a population of polynucleotides in a sample of
`
`fragmented genomic DNAis tagged with n different unique identifiers, whereinnis at least
`
`2 and no morethan 100,000*z, wherein z is a measureof central tendency (e.g., mean,
`
`median, mode) of an expected number of duplicate molecules having the same start and
`
`stop positions.
`
`In certain embodiments, n is at least any of 2*z, 3*z, 4*z, 5*z, 6*z, 7*z, 8*z,
`
`9*z, 10*z, 11*z, 12*z, 13*z, 14*z, 15*z, 16*z, 17*z, 18*z, 19*z, or 20*z (e.g., lower limit).
`
`In
`
`-11-
`
`
`
`Attorney Docket No: 42534-708-101
`
`other embodiments, n is no greater than100,000*z, 10,000*z, 1000*z or 100*z (e.g., upper
`
`limit). Thus, n can range between any combination of these lower and upper limits.
`
`In
`
`certain embodiments, n is between 5*z and 15*z, between 8*z and 12*z, or about 10*z. For
`
`example, a haploid human genome equivalent has about 3 picograms of DNA. A sample of
`
`about 1 microgram of DNA contains about 300,000 haploid human genome equivalents.
`
`Improvements in sequencing can be achieved as long as at least some of the duplicate or
`
`cognate polynucleotides bear unique identifiers with respect to each other, that is, bear
`
`different tags. However, in certain embodiments, the number of tags usedis selected so
`
`that there is at least a 95% chancethat all duplicate molecules starting at any one position
`
`bear unique identifiers. For example, in a sample comprising about 10,000 haploid human
`
`genome equivalents of fragmented genomic DNA, e.g., cfDNA, z is expected to be between
`
`2 and 8. Such a population can be tagged with between about 10 and 100different
`
`identifiers, for example, about 36 different identifiers, about 49 different identifiers, about 64
`
`different identifiers, about 81 different identifiers or about 100 different identifiers. DNA
`
`barcodes having identifiable sequencesof 6, 7, 8, 9 or 10 nucleotides can be used. When
`
`attached to both ends of a polynucleotide, they produce 36, 49, 64, 81 or 100 possible
`
`different identifiers, respectively. Samples tagged in such a way can be those with a range
`
`of about 10 ng to any of about 100 ng, about 1 ug, about 10 ug of fragmented
`
`polynucleotides, e.g., genomic DNA,e.g. cfDNA.
`
`[0034]
`
`Accordingly, this invention also provides compositions of duplex-tagged
`
`polynucleotides. The polynucleotides can comprise fragmented DNA, e.g. cfDNA. A set of
`
`polynucleotides in the composition that map to a mappable baseposition in a genome can
`
`be non-uniquely tagged, that is, the number ofdifferent identifiers can be at least at least 2
`
`and fewer than the number of polynucleotides that map to the mappable base position. A
`
`composition of between about 10 ng to about 10 ug (e.g., any of about 10 ng-1 wg, about
`
`10 ng-100 ng, about 100 ng-10 ug, about 100 ng-1 yg, about 1 g-10 ug) can bear
`
`between any of 2, 5, 10, 50 or 100 to any of 100, 1000, 10,000 or 100,000 different
`
`identifiers. For example, between 5 and 100 different identifiers can be used to tag the
`
`polynucleotides in such a composition.
`
`-12-
`
`
`
`Attorney Docket No: 42534-708-101
`
`[0035]
`
`The systems and methods disclosed herein may be usedin applications that
`
`involve the assignment of molecular barcodes, to cell free polynucleotides. Often, the
`
`identifier is a bar-code oligonucleotide that is used to tag the polynucleotide. The barcode
`
`identifier may be a nucleic acid oligonucleotide, in which case the attachment to the
`
`polynucleotide sequences may comprise a ligation reaction between the oligonucleotide
`
`and the sequencesor incorporation through PCR.
`
`In other cases, the reaction may
`
`comprise addition of a metal isotope, either directly to the analyte or by a probe labeled
`
`with the isotope. Generally, assignment of unique or non-unique identifiers, or molecular
`
`barcodesin reactions of this disclosure may follow methods and systems described by, for
`
`example, US patent applications 20010053519, 20030152490, 20110160078 and US
`
`patent US 6,582,908.
`
`[0036]
`
`In some cases, identifiers may be predetermined, random or semi-random
`
`sequenceoligonucleotides.
`
`In other cases, a plurality of barcodes may be used such that
`
`barcodes are not necessarily unique to one another in the plurality.
`
`In this example,
`
`barcodes maybeligated to individual molecules such that the combination of the bar code
`
`and the sequence it maybeligated to creates a specific sequence that may beindividually
`
`tracked. As described herein, detection of non unique barcodes in combination with
`
`sequence data of beginning (start) and end (stop) portions of sequence reads can allow
`
`assignment of a unique identity to a particular molecule. The length, or number of base
`
`pairs, of an individual sequence read may also be usedto assign a unique identity to such
`
`a molecule. As described herein, fragments from a single strand of nucleic acid having
`
`been assigned a unique identity, may thereby permit subsequent identification of fragments
`
`from the parent strand.
`
`In this way the polynucleotides in the sample can be uniquely or
`
`substantially uniquely tagged.
`
`[0037]
`
`A duplex tag can include a degenerate or semi-degenerate nucleotide sequence,
`
`e.g., a random degenerate sequence. The sequence can be, for example, between
`
`approximately 3 to 20 nucleotides in length. Therefore, each n-mer sequence may be any
`
`of 4,5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 or 20 nucleotides in length. The
`
`duplex tag can comprise a double-stranded fixed reference sequence downstream of the n-
`
`-13-
`
`
`
`Attorney Docket No: 42534-708-101
`
`mer sequences. Each strand of the double-stranded fixed reference sequencecan be, for
`
`example, 3, 4, 5 or 6 nucleotides in length.
`
`[0038]
`
`In certain embodiments barcodes comprise 8 n-mer nucleotide sequences. The
`
`attached to both ends of a double-stranded polynucleotide, this results in 64 different
`
`possible identifying combinations. When a polynucleotide molecule there’s tags both ends,
`
`each tag can bean identifier, or the combination of tags can function as an identifier.
`
`[0039]
`
`C. Sequencing
`
`[0040]
`
`In step (106) the tagged polynucleotides are sequenced to generate sequence
`
`reads. Both strands of a tagged duplex polynucleotide can generate sequence reads.
`
`Becausetheyaredifferently tagged, sequence reads generated from a Watson strand can
`
`be distinguished from sequence reads generated from a Crick strand. Sequencing can
`
`involve generating multiple sequence reads for each molecule. This occurs, for example, as
`
`a result the amplification of individual polynucleotide strands during the sequencing
`
`process, e.g., by PCR.
`
`[0041]
`
`Amplification of a single strand of a polynucleotide by PCR will generate copies
`
`both of that strand and its complement. During sequencing, both the strand andits
`
`complement will generate sequence reads. However, sequence reads generated from the
`
`complement of, for example, the Watson strand, can be identified as such because they
`
`bear the complement of the portion of the duplex tag that tagged the original Watson
`
`strand.
`
`In contrast, a sequence read generated from a Crick strand or its amplification
`
`productwill bear the portion of the duplex tag that tagged the original Crick strand. In this
`
`way, a sequenceread generated from an amplified product of a complement of the Watson
`
`strand can be distinguished from a complement sequence read generated from an
`
`amplification product of the Crick strand of the original molecule.
`
`[0042]
`
`Typically, a sampling, or subset, of all of the amplified polynucleotides are
`
`submitted to a sequencing device for sequencing. With respect to any original double-
`
`stranded polynucleotide there can be three results with respect to sequencing. First,
`
`sequencereads can be generated from both complementary strands ofthe original
`
`-14-
`
`
`
`Attorney Docket No: 42534-708-101
`
`molecule (that is, from both the Watson strand and from the Crick strand). Second,
`
`sequence reads can be generated from only one of the two complementary strands (thatis,
`
`either from the Watson strand or from the Crick strand, but not both). Third, no sequence
`
`read may be generated from either of the two complementary strands. Consequently,
`
`counting unique sequence reads mapping to a genetic locus will underestimate the number
`
`of double-stranded polynucleotides in the original sample mapping to the locus. Described
`
`herein are methods of estimating the unseen and uncounted polynucleotides.
`
`[0043]
`
`|The sequencing method can be massively parallel sequencing, thatis,
`
`simultaneously (or in rapid succession) sequencing any ofat least 100,000, 1 million, 10
`
`million, 100 million, or 1 billion polynucleotide molecules. Sequencing methods may
`
`include, but are notlimited to: high-throughput sequencing, pyrosequencing, sequencing-
`
`by-synthesis, single-molecule sequencing, nanopore sequencing, semiconductor
`
`sequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq(Illumina),
`
`Digital Gene Expression (Helicos), Next generation sequencing, Single Molecule
`
`Sequencing by Synthesis (SMSS)(Helicos), massively-parallel sequencing, Clonal Single
`
`Molecule Array (Solexa), shotgun sequencing, Maxam-Gilbert or Sanger sequencing,
`
`primer walking, sequencing using PacBio, SOLID, lon Torrent, or Nanopore platforms and
`
`any other sequencing methods knownin the art.
`
`[0044]
`
`D. Reducing or Tracking Redundancy
`
`[0045]=Instep (108) redundancy in sequencereads is reduced and/or tracked.
`
`Sequencing of amplified polynucleotides typically produces reads of the several
`