`
`PROVISIONAL PATENT APPLICATION
`
`COPY NUMBER ESTIMATION
`
`Inventor(s):
`
`Mike LUCERO,
`Citizen of USA, Residing at
`634 Pine Terrace
`South San Francisco, CA 94080
`
`Serge SAXONOV,
`Citizen of USA, Residing at
`10 De Anza Ct,
`San Mateo, CA 94402
`
`Ben HINDSON,
`Citizen of Australia, Residing at
`1039 Bannock Street
`Livermore, CA 94551
`
`Kevin NESS,
`Citizen of Canada, Residing at
`24 Baytree Way Apt #10
`San Mateo, CA 94402
`
`Phil BELGRADER,
`Citizen of USA, Residing at
`89 Robinson Landing Rd.
`Severna Park, MD 21146
`
`Wilson Sonsini Goodrich &Rosati
`PROFESSIONAL CORPORATION
`
`650 Page Mill Road
`Palo Alto, CA 94304
`(650) 493-9300 (Main)
`(650) 493-6811 (Facsimile)
`
`Filed Electronically on: February 18, 2011
`
`WSGRDocket No. 38983-713.102
`
`
`
`COPY NUMBER ESTIMATION
`
`CROSS-REFERENCE
`
`[0001] This application is related to co-pending U.S. Provisional Patent Application No. 61/443,156, filed
`
`February 15, 2011, which is incorporated herein by referencein its entirety.
`
`[0002] There is a need for improved methods for copy numberestimation of nucleic acid.
`
`BACKGROUNDOF THE INVENTION
`
`[0003] Described herein are methods for estimating the copy numberof nucleic acids.
`
`SUMMARYOF THE INVENTION
`
`INCORPORATION BY REFERENCE
`
`[0004] All publications, patents, and patent applications mentionedin this specification are herein
`
`incorporated by reference to the same extent as if each individual publication, patent, or patent application
`
`was specifically and individually indicated to be incorporated by reference.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0005] The novel features of the invention are set forth with particularity in the appended claims. A better
`
`understanding of the features and advantages of the present invention will be obtained by reference to the
`
`following detailed description that sets forth illustrative embodiments, in whichthe principles of the invention
`
`are utilized, and the accompanying drawings of which:
`
`[0006] Figure 1 illustrates FAM and VIC separated by 1K, 10K, or 100K bases.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`[0007] A practical methodfor high-resolution CNV estimation and validation using ddPCR™
`
`[0008] Droplet digital PCR (ddPCR™)is a practical solution for validating copy numbervariations
`
`identified by next generation sequencers and microarrays. The ddPCR™ method can empowerone person to
`
`screen hundreds of samples for CNV analysis in a single work shift. The ddPCR™ workflow involves using
`
`restriction enzymes to separate tandem copiesof a target gene prior to assembling a duplex TaqMan® assay
`
`that includes reagents to detect both the target gene and a single-copy reference gene. The reaction mixture
`
`can then be partitioned into 20,000 nanoliter droplets that are thermo-cycled to end-point before being
`
`analyzed in a two-color reader. The fraction of positive-counted droplets enables the absolute concentrations
`
`for the target and reference genes to be measured, from which,a relative copy numberis determined. 20,000
`
`PCRreplicates per well providesthe statistical powerto resolve higher-order copy numberdifferences. This
`
`low-cost methodreliably generates copy number measurements with 95% confidenceintervals that span
`
`integer without overlap of adjacent copy numberstates. This technology is capable of phasing copy number
`
`-2-
`
`WSGRDocket No. 38938-713.102
`
`
`
`variants, and it can determine whether gene copies are on the sameor different chromosomes. Applications
`
`of this technology include, e.g., high-resolution CNV measurements, follow-up to genome-wideassociation
`
`studies, cytogenetic analysis, CNV alterations in canceroustissue, and CNV phasing.
`
`[0009]
`
`In general, described herein are restriction enzymes for copy numberestimation in digital format
`
`[0010] Disclosed includes: 1) Use ofrestriction enzymesto separate target copies so that the copies can be
`
`assorted independently into partitions for the digital readout and 2) use of the readout of undigested DNA
`
`together with the readout from digested DNAto estimate how targets are phased — ie if they are present close
`
`to each other on the same chromosomeorif they are on different chromosomes.
`
`[0011] Separating target copies for copy numberanalysis
`
`[0012] Digital PCR counts targets by partitioning the sample and identifying partitions containing the target.
`
`The digital readoutis an all or nothing processin that it specifies whether a given partition contains the target
`
`of interest and not necessarily how manycopiesofthe target are in the partition.
`
`[0013]
`
`In copy numberapplications, one is interested in counting the numberof times a particular sequence
`
`(ie target) is found in a given genome. This can be donebyassessing the concentration ofthe target and of a
`
`reference sequence that’s knownto be present at some fixed numberof copies in every genome. Typically for
`
`the reference one uses a housekeeping genethat’s present at two copies per diploid genome. Dividing the
`
`concentration of the target by the concentration of the reference yields an estimate for the numberoftarget
`
`copies per genome.
`
`[0014] For a particular genome, two or more targets can be linked closely together on the same chromosome.
`
`In that case, if the DNA is not sufficiently fragmented, a digital PCR partition containing one target will also
`
`contain all the targets that are linked to it. Because ofits digital nature, the readout would count multiple
`
`copies as one. In order for the linked targets to be counted separately they need to be separated before digital
`
`analysis.
`
`[0015] One technique that has been usedfor target separation is STA (Specific Target Amplification) [Qin et
`
`al. 2008], which entails performing a short pre-amplification step to generate separate unlinked amplicons for
`
`the target and the reference. The main shortcoming of STAis that it requires for the amplification efficiencies
`
`of the target and the reference to be matched. A slight difference in amplification efficiencies would result in
`
`a bias in CNV estimates. For example, if the target has the efficiency of 95% and the reference has the
`
`efficiency of 100%, a five-cycle STA would result in a 23% underestimate of the CNV. Similarly, any
`
`fluctuations in relative efficiencies of the two reactions (due to sample composition, instrument performance
`
`or operator variation) can result in a significant loss of precision. Also, pre-amplification imposesadditional
`
`burdens on the workflow. For example it requires setting up a separate PCR reaction, performing PCR,
`
`dilution of the PCR amplicons.
`
`[0016] Here we present an approach ofusing restriction enzymesto separate target copies in order to
`
`estimate copy numberstates accurately. The basic outline is that one digests the sample using a restriction
`
`enzymeor a restriction enzyme cocktail. The enzymes are chosen so that the DNA betweenthe targets is
`
`-3-
`
`WSGRDocket No. 38938-713.102
`
`
`
`restricted, but the regions to be amplified are not. The digested sample is then used in the digital PCR reaction
`
`for copy numberestimation.
`
`[0017] Choosing appropriate restriction enzymes:
`
`[0018] The enzyme should not cut the target or the reference amplicon. One can use a reference genome
`
`sequenceto predict the cutting.
`
`[0019]
`
`It should not cut the target or the reference amplicon even if the amplicons contain SNPs. SNP
`
`information can be obtained from several databases, most readily from dbSNP. Methylation sensitive
`
`restriction enzymescan also be used.
`
`[0020]
`
`It needs to cut between the amplicons. It’s best to ensure this by choosing enzymes whoserecognition
`
`sequencesoccur-- preferably multiple times -- near the amplicons. One has to makesure that these
`
`recognition sequencesare not affected by the presence of SNPs.
`
`[0021] The enzymeneedsto be an efficient but specific (no star activity) cutter. This, along with digestion
`
`time and enzymeconcentration, can be determined in advanceby performing appropriate enzymetitration
`
`experiments.
`
`[0022] Multiple-digests can be employed if some of the enzymesare notefficient at cutting, or if they don’t
`
`all work universally well across all samples (eg because of SNPs).
`
`[0023] There is some evidence that the PCR reaction works better when the size of the fragment containing
`
`the amplicon is smaller. Therefore picking restriction enzymes with cutting sites near the ampliconsis
`
`desirable.
`
`[0024] One often needs to analyze the same sample for multiple CNVs. In this case, it is desirable to select
`
`the smallest number of digests that would work well for the entire set of CNVs. Ideally a single restriction
`
`enzyme cocktail could be found that does not cut within any of the amplicons but has recognition sites near
`
`each one of them.
`
`[0025] Appropriate software can be written to automate the process of restriction enzyme choice and present
`
`an interface for experimental biologists to choose the most appropriate enzymesgiven thecriteria above.
`
`Additional considerations can be employed by the software, such as enzymecost or availability.
`
`[0026] Digestion with more than one enzyme, performedserially or together in a single tube, may help to
`
`ensure complete cutting of difficult targets.
`
`[0027] Most restriction enzymes can be heat-inactivated after restriction by raising the temperature of the
`
`restriction reaction. The temperature of heat-inactivation can be below the melt point of the restricted target
`
`fragments, so as to maintain double-stranded template copies.
`
`[0028] A control assay and template can be used to measurethe efficiency of the RE digestion step.
`
`[0029] Extracting haplotype information for CNV analysis
`
`[0030] If multiple copies of a gene are present in a particular sample, it is often important to determine
`
`whether both maternal and paternal chromosomescarry some ofthe copies or if one of the chromosomes
`
`lacks that gene. For example if a sample contains two copies of a gene, it may carry one chromosome with
`
`-4-
`
`WSGRDocket No. 38938-713.102
`
`
`
`two copies and one with zero or it may carry two chromosomeswith one copy each. Similarly, a sample with
`
`three copies, may carry one chromosomewith three copies and the other with zero or one chromosome with
`
`two and the other with one. Distinguishing between these possibilities (establishing whether sequences of
`
`interest are linked) is called phasing or haplotyping.
`
`[0031] Currently, no method can resolve phasing in a practical and general manner. It is especially difficult
`
`to resolve phase for copy numbervariants. In some applications long range PCR can be used. In somecases,
`
`genotypes of parents or other relatives can be used to infer the copy numberstate of the target individual.
`
`[0032] Here we present a method of extracting phasing information for CNVs by assaying the same sample
`
`twice on a digital PCR platform: once after applying a treatment that separates copies of the target and once
`
`without applying such a treatment. This approach requires high precision for final copy numberestimation
`
`andis thus best suited for use with a digital PCR platform that can produce a large numberofpartitions.
`
`[0033] Steps:
`
`[0034] 1. Split the sample into two aliquots. Aliquot (A) should contain the sample processed so that linked
`
`copies are separated. For this portion one can use STAor the RE digestion method outlined above. Aliquot
`
`(B) should contain the sample not treated for target copy separation (meaning no STA ordigestion).
`
`[0035] 2. Assess copy numberin aliquot (A).
`
`[0036] 3. Assess copy numberin aliquot (B). The sample needsto be of sufficiently high molecular weight
`
`so that if a pair of targets is on the same chromosome, they are mostly linked in solution as well. Ifthe DNA
`
`is completely unfragmented, the readout should be 0, 1, or 2 copies. However, because DNA will usually be
`
`at least partially degraded we expect copy numbers to span non-integer values as well as numbersgreater than
`
`2. We can add anotherstep to assess fragmentation in the sample using gels or a digital PCR co-location
`
`method (mile-post assay). Thusif the sample is deemed overly fragmented, no information can be gleaned
`
`about its haplotypes.
`
`[0037] 4. The greater the difference between readingsin (2) and (3) the morelikely it is that one of the
`
`chromosomesdoesnot carry a copy of the target. See examples below.
`
`[0038] This approachis particularly valuable for smaller copy numberstates: 2, 3, 4. It yields less
`
`information about phasing (and is more difficult technically) at higher copy numbers. Conveniently, most
`
`CNV work has focused on lower copy numberstates and there is reason to believe that resolving phaseis
`
`morerelevant for such states.
`
`[0039] Example 1:
`
`EXAMPLES
`
`[0040] If we have a CN of2.0 for a particular assay post-digestion we don't necessarily know if the
`
`composition is (A) 2 copies on one chromosomeand 0 on the other; or (B) 1 on each. If we run the same
`
`assay on undigested DNA then weshould be able to resolve between the two possibilities. In principle, we
`
`should get a CN 1 if the arrangementis (A) and a CN of 2 if the arrangementis (B). Because DNA is
`
`-5-
`
`WSGRDocket No. 38938-713.102
`
`
`
`fragmented, the readings aren't going to be as clean -- (A) should yield a reading higher than 1.0, but
`
`presumably significantly less than 2.0; (B) should yield exactly 2.0.
`
`[0041] Higher fragmentation of the starting material would bring the CN reading in (A) closer to 2.0. As an
`
`example, for a given assay we anticipate that the linked copies are separated by about 10kb and based on our
`
`fragmentation analysis 30% of 10kb segments are fragmented. In that case, scenario (A) should yield a
`
`reading of 1.3 and scenario (B) a reading of 2.0.
`
`[0042] Example 2:
`
`[0043] Alternatively, if the CN reading is 3.0 post-digestion, we don't know if the composition is (A) 3
`
`copies on one chromosome and 0 on the other; or (B) 2 on one and 1 on the other. If we run the same assay on
`
`undigested DNA,and assume the same parameters of a 10kb separation and 30% fragmentation as above,
`
`scenario (A) would yield a reading of 1.6 copies (=0.7 * 0.3 * 2+ 0.7 * 0.3 *14+ 0.3 *0.7*2+0.3 * 0.3 *3
`
`), whereas (B) would yield a reading of 2.3 copies.
`
`[0044] For the digital PCR-realtime hybrids, a la Life’s Biotrove arrays, one might be able to extract
`
`additional phasing information from undigested DNA. One can estimate how manyofthe partitions contain
`
`two,three, etc copies of the target by analyzing real time curves within eachpartition.
`
`[0045] We can attempt to do the same by looking at endpoint fluorescence, if segments with two copies of
`
`the target yield higher amplitudes than segments with one copy. Then under scenario (A) you should have
`
`fewer positives, but those positives should on average have higher fluorescence.
`
`[0046] One could tweak the numberofcycles to gain the best separation (makingit so that segments carrying
`
`one copy do not reach the endpoint).
`
`[0047] Example3:
`
`[0048] % Consider three types of DNA fragments. Fam-Vic together (not chopped),
`
`[0049] % Fam fragment, Vic fragment. We observe someprobabilities (counts in
`
`[0050] % FAM-VICcrossplot), and goalis to infer the concentrations.
`
`[0051] % First let us do forward. Given concentrations, compute counts. Then to do
`
`[0052] % inverse, we simply try out different values of concentrations and select
`
`[0053] % one which gives actual counts.
`
`[0054] N=20000;
`
`[0055] A = 10000;
`
`[0056] B=20000;
`
`[0057] AB = 10000; % Joined together
`
`[0058] cA=A/N;
`
`[0059] cB=B/N;
`
`[0060] cAB=AB/N;
`
`[0061]
`
`fprintf(1, '%f %f %f\n', cAB, cA, cB);
`
`[0062] pA=1 - exp(-cA);
`
`-6-
`
`WSGRDocket No. 38938-713.102
`
`
`
`[0063]
`
`pB = 1 - exp(-cB);
`
`[0064]
`
`pAB = 1 - exp(-cAB);
`
`[0065]
`
`%A is X and B is Y in cross plot
`
`[0066]
`
`p(2,1) = (1 - pA) * (1 - pB) * (1 - pAB); % Bottom left
`
`[0067]
`
`p(2,2) =pA * (1 - pB) * C1 - pAB); % Bottom right
`
`[0068]
`
`p(i,1) =(1 - pA) * pB * (1 - pAB); % Top Left
`
`[0069]
`
`p(1,2) = 1 - p@,1) - p@,2) - pd,1); % Top Right
`
`[0070]
`
`disp(round(p * N));
`
`[0071]
`
`% Also compute marginals directly
`
`[0072]
`
`[0073]
`
`[0074]
`
`cAorAB = (A + AB)/N; % =c_A+c_AB;
`
`cBorAB = (B + AB)/N; %=c_B+c_AB;
`
`
`
`pAorAB = 1 - exp(-cAorAB); % Can be computed from p too
`
`[0075]
`
`pBorAB = 1 - exp(-cBorAB);
`
`[0076]
`
`% Inverse
`
`[0077]
`
`H=p *N; % Weare given somehits
`
`[0078]
`
`%H = [0 8000;2000 0];
`
`[0079]
`
`% Compute prob
`
`[0080]
`
`estN = sum(H(:));
`
`[0081]
`
`ip= HeestN;
`
`[0082]
`
`[0083]
`
`ipAorAB = 1p(1,2) +1_p(,2);
`
`ipBorAB =1p(1,1) +1i_p(i,2);
`
`
`
`[0084]
`
`i_cAorAB = -log(1 - ipAorAB);
`
`[0085]
`
`i_cBorAB = -log(1 - 1pBorAB);
`
`[0086]
`
`maxVal = min(i_cAorAB, 1_cBorAB);
`
`[0087]
`
`delta = maxVal/1000;
`
`[0088]
`
`errArr = [];
`
`[0089]
`
`gcABArr = [];
`
`[0090]
`
`for gcAB = 0:delta:maxVal
`
`[0091]
`
`gcA =1_cAorAB - gcAB;
`
`[0092]
`
`gcB =1_cBorAB - gcAB;
`
`[0093]
`
`gpA = 1 - exp(-gcA);
`
`[0094]
`
`epB = | - exp(-geB);
`
`[0095]
`
`gpAB = | - exp(-gcAB);
`
`[0096]
`
`gp(2,1) = (1 - gpA) * (1 - gpB) * (1 - gpAB); % Bottom left
`
`[0097]
`
`gp(2,2) = gpA * (1 - gpB) * (1 - gpAB); % Bottom right
`
`[0098]
`
`ep(1,1) = (1 - gpA) * gpB * (1 - gpAB); % Top Left
`
`[0099]
`
`ep(1,2) = 1 - gp(2,1) - gp(2,2) - gp(1.1); % Top Right
`
`-7-
`
`WSGRDocket No. 38938-713.102
`
`
`
`[00100] gH = gp * estN;
`
`[00101] err = sqrt(sum((H(:) - gH(:)).*2));
`
`[00102] errArr = [errArr; err];
`
`[00103] gcABArr = [gcABArr; gcAB];
`
`[00104] end
`
`[00105] figure, plot(gcABArr, errArr);
`
`[00106] minidx = find(errArr == min(errArr(:)));
`
`[00107] minidx = minidx(1);
`
`[00108] estAB = gcABArt(minidx);
`
`[00109] estA =i_cAorAB - estAB;
`
`[00110] estB = i_cBorAB- estAB;
`
`[00111] fprintf(1, ‘%f %f %f\n’, estAB, estA, estB);
`
`[00112] gpA =1- exp(-estA);
`
`[00113] gpB = 1 - exp(-estB);
`
`[00114] gpAB = 1 - exp(-estAB);
`
`[00115] gp(2,1) = (1 - gpA) * C1 - gpB) * (1 - gpAB); % Bottom left
`
`[00116] gp(2,2) = gpA * (1 - gpB) * (1 - gpAB); % Bottom right
`
`[00117] gp(1,1) =(1 - gpA) * gpB * (1 - gpAB); % TopLeft
`
`[00118] gp(1,2) = 1 - gp(2,1) - gp(2,2) - gp(1,1); % Top Right
`
`[00119] cH = gp * estN;
`
`[00120] disp(round(gH));
`
`[00121] % Confirm the results using simulation
`
`[00122] numMolA = round(estA * estN);
`
`[00123] numMolB = round(estB * estN);
`
`[00124] numMolAB= round(estAB * estN);
`
`[00125] A = unique(randsample(estN, numMolA,1));
`
`[00126] B = unique(randsample(estN, numMolB, 1));
`
`[00127] AB = unique(randsample(estN, numMolAB,1));
`
`[00128] U = 1:estN;
`
`[00129] notA = setdiff(U, A);
`
`[00130] notB = setdiff(U, B);
`
`[00131] notAB = setdiff(U, AB);
`
`[00132] AorBorAB = union(A,union(B, AB));
`
`[00133] none = setdiff(U, AorBorAB);
`
`[00134] simcount(2,1) = length(none);
`
`[00135] simcount(2,2) = length(intersect(A, intersect(notB, notAB)));
`
`[00136] simcount(1,1) = length(intersect(B, intersect(notA, notAB)));
`
`-8-
`
`WSGRDocket No. 38938-713.102
`
`
`
`[00137] simcount(1,2) = length(AorBorAB)- simcount(2,2) - simcount(1,1);
`
`[00138] disp(simcount);
`
`[00139] Example 4:
`
`[00140] Milepost Assay Analysis—Probability of Fragmentation
`
`[00141] Problem statement
`
`[00142] Normally, there are two species of molecules (corresponding to FAM and VIC probes). Here, there
`
`are three species—fragmented FAM,fragmented VIC, and Linked FAM-VIC.
`
`[00143] There are two dyes, so there can be ambiguity. There is a need to compute concentrationsofall three
`
`species. (See Figure 1)
`
`[00144] Algorithm: Get 2x2 table of FAM versus VIC counts. Compute concentration of fragmented FAM
`
`and linked FAM-VIC asif there is 1 species. Compute concentration of fragmented VIC and linked FAM-
`
`VIC as if there is 1 species. Try out different concentrations of linked FAM-VIC (from which concentration
`
`of fragmented FAM and VIC can be found), and find the bestfit of the probability table with the observed
`
`counts:
`
`
`
`VIC-
`
`(1-f) (I-v) (-¢)
`
`{(1-v) (1-c)
`
`Probability of fragmentation (in %)
`
`gete
`
`1k Uncut
`
`10K Uncut
`
`100K Uncut
`
`6
`
`29.4
`
`98.7
`
`29.8
`
`97.7
`
`29.5
`
`99.9
`
`[00145] Next steps can include to see if a closed formula can be easily derived and/or to integrate with
`
`QTools.
`
`-9-
`
`WSGRDocket No. 38938-713.102
`
`
`
`[00146] While preferred embodiments of the present invention have been shownanddescribed herein,it will
`
`be obviousto those skilled in the art that such embodimentsare provided by way of example only. Numerous
`
`variations, changes, and substitutions will now occurto those skilled in the art without departing from the
`
`invention. It should be understood that various alternatives to the embodiments of the invention described
`
`herein may be employed in practicing the invention. It is intended that the following claims define the scope
`
`of the invention and that methods and structures within the scope of these claims and their equivalents be
`
`covered thereby.
`
`-10-
`
`WSGRDocket No. 38938-713.102
`
`