throbber
brief communications
`
`counting absolute numbers
`of molecules using unique
`molecular identifiers
`Teemu Kivioja1–3,5, Anna Vähärautio1,3,5,
`Kasper Karlsson4, Martin Bonke1, Martin Enge3,
`Sten Linnarsson4 & Jussi Taipale3
`
`counting individual rna or dna molecules is difficult because
`they are hard to copy quantitatively for detection. to overcome
`this limitation, we applied unique molecular identifiers (umis),
`which make each molecule in a population distinct, to genome-
`scale human karyotyping and mrna sequencing in Drosophila
`melanogaster. use of this method can improve accuracy of
`almost any next-generation sequencing method, including
`chromatin immunoprecipitation–sequencing, genome assembly,
`diagnostics and manufacturing-process control and monitoring.
`
`Determining the relative abundance of two different molecular
`species or the absolute number of molecules in a single sample is
`challenging. We describe an absolute counting method that can
`use amplification but does not require detecting each original
`molecule or keeping track of the number of copies made. In this
`method, each molecule in a population is first made unique. This
`can be accomplished by adding a random DNA sequence label,
`by fragmenting or by taking an aliquot of a complex mixture that
`is small enough to contain only distinct molecules (Fig. 1a–c).
`Any combination of these manipulations can be used to gener-
`ate a library in which each molecule has a distinct identifying
`
`sequence. We designate the resulting sequences that can be used
`to uniquely identify copies derived from each molecule unique
`molecular identifiers (UMIs; Fig. 1).
`As long as the complexity of the library of molecules is main-
`tained, the library can be amplified, normalized or otherwise
`processed without loss of information about the original molecule
`count because the number of UMIs in the library acts as a molecular
`memory of the number of molecules in the starting sample
`(Supplementary Fig. 1). Upon deep sequencing, each UMI will
`be observed multiple times, and the number of original DNA
`molecules can be determined simply by counting each UMI only
`once. However, long before all UMIs are observed, increasingly
`precise estimates of the absolute molecule number can be made
`(Online Methods). This is in contrast to other counting methods,
`such as direct single-molecule sequencing1–3, which require that
`all counted molecules are observed directly. In addition, many
`existing digital molecule counting methods such as digital PCR4,
`digital microarray profiling5 and direct single-molecule sequenc-
`ing1–3 cannot be effectively multiplexed and are thus generally only
`applicable to measuring one or few molecular species from many
`samples, or many species from a single sample. Counting methods
`that introduce random tags to make molecules unique before
`amplification have been suggested5,6 and applied to analysis of
`RNA-protein interactions7,8. In addition, three recent publica-
`tions applied such labeling methods to the analysis of selected
`target genes by using either microarrays9 or sequencing9–11. Here
`we apply the idea more generally and show that it can be used for
`absolute quantification.
`The UMI method is very effective on simulated data
`(Supplementary Fig. 1). To assess whether it can be used to
`improve experimental measurements, we applied UMI count-
`ing to digital karyotyping and mRNA sequencing (mRNA-seq).
`
`b
`
`c
`
`Labeling
`
`Amplification,
`normalization
`
`Sequencing
`
`Fragmentation
`
`Aliquot
`
`Amplification,
`normalization
`
`Amplification,
`normalization
`
`Sequencing
`
`Sequencing
`
`1 UMI
`
`3 UMIs
`
`4 UMIs
`
`a
`
`2 UMIs
`
`13 UMIs
`≈ 20 original molecules
`
`figure 1 | UMIs can be generated by adding oligonucleotide labels,
`fragmenting, taking a small enough aliquot or a combination thereof.
`(a) Three different DNA species (green, blue and black lines) are labeled
`with a collection of random labels (colored filled circles). Two green
`molecules are originally present (top), corresponding to two different UMIs
`(red, blue) among the sequenced molecules (green; bottom). Information
`about the original number of molecules (top) is preserved in the number
`of different UMIs detected by sequencing a sample of the amplified and
`normalized library (bottom). Even if some UMIs are not observed, the
`original number of molecules can be estimated using count statistics.
`(b) The original molecule is randomly fragmented, and a short unique
`sequence from the resulting fragments constitutes each UMI; here only the
`fragment adjacent to the poly(A) sequence (red vertical bars) is amplified.
`(c) An aliquot is taken from a sample that has many identical molecules
`such that on average, less than one copy of each molecule remains.
`
`1Genome-Scale Biology Program, Institute of Biomedicine, University of Helsinki, Helsinki, Finland. 2Department of Computer Science, University of Helsinki, Helsinki, Finland.
`3Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, Stockholm, Sweden. 4Department of Medical Biochemistry and Biophysics, Karolinska
`Institutet, Stockholm, Sweden. 5These authors contributed equally to this work. Correspondence should be addressed to J.T. (jussi.taipale@ki.se) or S.L. (sten.linnarsson@ki.se).
`Received 10 MaRch; accepted 27 septeMbeR; published online 20 noveMbeR 2011; doi:10.1038/nMeth.1778
`
`72  |  VOL.9  NO.1  |  JANUARY 2012  |  nature methods
`
`© 2012 Nature America, Inc. All rights reserved.
`
`00001
`
`EX1006
`
`

`

`brief communications
`
`21 X
`
`21 X
`
`21 X
`
`21 X
`
`18 million reads; CV = 7.8%
`
`Genomic position in chromosome order
`
`279 million reads; CV = 7.5%
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`500
`
`400
`
`300
`
`200
`
`100
`
`0
`
`a
`
`Number of reads
`
`(×1,000)
`
`b
`
`Number of reads
`
`(×1,000)
`
`Genomic position in chromosome order
`
`3,500
`
`Digital karyotype (1.28 million molecules); CV = 3.0%
`
`2,500
`
`1,500
`
`500
`
`Genomic position in chromosome order
`
`3,500
`
`Simulated (1.28 million molecules); CV = 2.2%
`
`2,500
`
`1,500
`
`500
`
`c
`
`Number of molecules
`
`d
`
`Number of molecules
`
`Genomic position in chromosome order
`
`must all be derived from different copies of the same chromosome.
`In simulated experiments, we found that this information can be
`used to accurately estimate the original number of molecules even
`when only a small fraction of the fragments derived from them are
`detected (Supplementary Fig. 2 and Supplementary Note).
`Generating UMIs from both high- and low-abundance species
`in a single reaction requires the use of sequence labels (Fig. 1a and
`Online Methods). We tested a labeling protocol for counting cellular
`
`figure 2 | Digital karyotyping by counting the absolute number of molecules.
`(a) Standard digital karyotype based on genomic DNA from a boy with trisomy
`21 and from his mother, mixed 1:1. (b) Standard digital karyotype of a
`sample from a male with a normal chromosome count. (c) The same sample as
`in a was analyzed by UMI counting (CV = 3.0%). Arrow highlights uniformly
`elevated copy number of regions in chromosome 21. (d) Simulated sample
`by uniform random sampling of 1.28 million molecules in silico from the
`NCBI human genome build 37 (CV = 2.2%). Number of reads and of molecules
`aligned to each 5-megabase-pair window is indicated. Chromosomes 21 and
`X are indicated by shading (the Y chromosome was excluded because its
`sequence was too repetitive for reliable alignments).
`
`For digital karyotyping, we mixed equal amounts of genomic DNA
`from a boy with Down’s syndrome and his mother. We fragmented
`the mixed DNA to generate a library of molecules, after which we
`took a sample containing less than a single genome copy. In com-
`bination with fragmentation, the use of a small aliquot reduces
`complexity such that each molecule is expected to have unique
`ends (Fig. 1b,c). The mapped genomic position of either end can
`be used as an UMI. After amplification by PCR and sequencing
`of 20 million reads, we binned read counts in 5-megabase-pair
`intervals. Total counts did not clearly show that half of the sample
`was derived from DNA with trisomy 21 and a single copy of the
`X chromosome (Fig. 2). In contrast, reanalyzing the sample by
`counting UMIs clearly revealed increased and decreased copy
`numbers of all 5-megabase-pair intervals in chromosomes 21 and
`X, respectively. As cell-free DNA from the plasma of pregnant
`women contains a mixture of parental and fetal DNA, detection
`of aberrant chromosome copy numbers is relevant to noninvasive
`prenatal diagnostics12,13, although in clinical samples the ratio of
`fetal to parental DNA is generally lower than what we used here.
`The accuracy of the UMI method was close to the theoretical limit
`imposed by the sample size (Fig. 2), suggesting that development
`of the UMI method for diagnostic use appears feasible.
`Unlike read-counting methods that are inherently limited by
`errors introduced during amplification, the UMI method can be
`made more accurate by increasing sample size and sequencing
`depth. Accuracy can be increased additionally by considering the
`number of unique consecutive and unique overlapping fragments.
`This is because consecutive fragments are likely to be derived from
`a single chromosome molecule, whereas overlapping fragments
`
`Counts
`38
`36
`33
`31
`29
`26
`24
`22
`20
`17
`15
`13
`10
`
`8 6 3 1
`
`P < 2.2 × 10–16
`
`Adjusted R2 = 0.0727
`
`40
`20
`G+C content (%)
`
`60
`
`15
`
`10
`
`5
`
`d
`
`Copy number
`
`R2 = 0.99993
`CV = 5.2%
`
`105
`
`103
`
`101
`
`c
`
`Molecules at cycle 25
`
`R2 = 0.94664
`CV = 14.3%
`
`106
`
`104
`
`102
`
`b
`
`Reads at cycle 25
`
`102
`
`106
`104
`Reads at cycle 15
`
`101
`
`105
`103
`Molecules at cycle 15
`
`a R
`
`NA fragmentation
`
`First-strand synthesis
`
`Template switch
`N10
`
`N10
`PCR ampli(cid:31)cation
`
`N10
`
`N10
`
`figure 3 | Accuracy of mRNA-seq can be improved by the UMI method. (a) mRNA-seq libraries were generated by fragmenting total RNA and reverse-
`transcribing it to cDNA using an oligo(dT) primer with an Illumina linker (blue) and a 5′ template switch adaptor containing another Illumina linker
`(magenta), a 10-base-pair random label (N10) and an index sequence (green). The combination of label sequence and the 5′ mapped position of the
`RNA fragment forms the UMI. (b,c) Measurements of expression of the same set of genes after 15 (x axes) and 25 (y axes) PCR amplification cycles were
`obtained using total read counts (b) or the UMI method (c). Most individual transcript total read counts obtained in the two measurements are far
`from each other and from diagonal (dashed gray line), and this effect is corrected by the UMI method (c). Genes whose mean in the two measurements
`deviated more than 5% from the fitted line (gray, solid) are in red. (d) A density plot shows average copy number of UMIs after 15 PCR cycles as a
`function of the average G+C content of the fragments for each measured gene from b and c. Red line in d indicates a least-squares fit, for which a P value
`and adjusted R2 value are given.
`
`nature methods  |  VOL.9  NO.1  |  JANUARY 2012  |  73
`
`© 2012 Nature America, Inc. All rights reserved.
`
`00002
`
`

`

`brief communications
`
`mRNA molecules. We fragmented Drosophila melanogaster S2 cell
`RNA, converted it to cDNA using oligo(dT)-primed reverse tran-
`scription and incorporated an oligonucleotide with a 10-base-pair
`random label by template switching (Fig. 3). We amplified the
`resulting cDNA fragments directly by PCR and sequenced them.
`In this method, only one fragment is derived from each mRNA.
`The sequence of the label and the 5′ mapped position of the frag-
`ment together define the UMI (Supplementary Fig. 3).
`Counting the total reads after 15 or 25 PCR cycles revealed an
`amplification bias that resulted in loss of accuracy, with 418 of the
`5,097 measured genes differing more than 5% between the sam-
`ples (Fig. 3b). Using UMIs to estimate the number of cDNAs in
`the original sample resulted in much higher correlation between
`the samples (R2 = 0.99993), and the number of genes differing
`by 5% or more was only ten (Fig. 3c). Whereas robust measure-
`ment of gene expression by read counting requires normaliza-
`tion14 that renders all measurements dependent on each other,
`with the UMI method one can reproducibly detect the number of
`molecules after different numbers of PCR cycles without normali-
`zation. Analysis of the average copy number of UMIs mapping
`to each gene revealed a clear bias in G+C content in the raw read
`counts (Fig. 3d), presumably owing to preferential amplification
`of sequences with low G+C content15. However, the G+C content
`explained only a small fraction of the variance, indicating that a
`simple correction cannot be used to substantially improve the
`accuracy of the total read counting method.
`Analysis of replicate samples revealed that the precision
`of the total read counting and UMI methods were similar
`(Supplementary Fig. 4). This was at least partly due to a repro-
`ducible bias in the total read counting method (Supplementary
`Fig. 5). Read counting biases introduced by PCR (Fig. 3d) or
`in silico (Supplementary Fig. 6) could be identified using the
`UMI method. Furthermore, when we used small amounts of
`input RNA, with the UMI method we could infer the relative
`sizes of the different samples. Also, estimates based on the UMI
`method were better correlated with smaller coefficients of vari-
`ation (CV) between different aliquots than the total read counts
`(Supplementary Fig. 7).
`The UMI method is compatible with sample indexing using
`separate DNA barcodes, allowing parallel analysis of samples. In
`addition to the applications described above, the UMI method
`could be used to monitor mixing of complex solutions and to trace
`flow patterns. In principle, the method can be applied to count
`all types of molecules or particles such as proteins or viruses that
`can be stoichiometrically labeled with DNA and subsequently
`purified from free label.
`In contrast to previous approaches, the UMI method also can
`be used to accurately estimate the number of molecules without
`actually observing all of them. Use of overlapping and consecu-
`tive fragments can extend the method to fragments that were lost
`during sample preparation (Supplementary Fig. 2). Furthermore,
`nonrandom UMI labels can provide information about relation-
`ships or interactions between molecules. For example, UMIs can
`contain information about fragments that were consecutive in the
`original molecule (‘junction labels’) or that were linked together
`as one macromolecular complex (‘correlative labels’). Junction
`
`74  |  VOL.9  NO.1  |  JANUARY 2012  |  nature methods
`
`labels could be introduced by inserting a DNA label using viral
`integrase or recombinase; this method is likely to have utility in
`shotgun genome assembly of repetitive sequences. Correlative
`labels, in turn, can be introduced by a ‘split-and-pool’ method,
`whereby the sample is split into a large number of wells, labeled
`with different labels and then pooled. Fragments from a macro-
`molecular complex are more likely to contain the same labels than
`unconnected fragments. Correlative labels can be used to analyze
`chromatin structure.
`The UMI method and its variations are thus likely to improve
`a large number of next-generation sequencing–based molecule-
`counting applications and also enable new methods for tracking
`relationships between molecules.
`
`methods
`Methods and any associated references are available in the online
`version of the paper at http://www.nature.com/naturemethods/.
`
`Accession codes. ArrayExpress E-MTAB-816 and European
`Nucleotide Archive ERA063165 (sequences derived from RNA
`from the S2 cell line).
`
`Note: Supplementary information is available on the Nature Methods website.
`
`acknowledgments
`We thank M. Taipale, H. Secher Lindroos, E. Ukkonen and T. Whitington for
`critical review of the manuscript, and E. Iwarsson (Karolinska University
`Hospital) for the trisomy-21 DNA. This work was supported by European Research
`Council project Growth Control, Academy of Finland postdoctoral researcher’s
`projects 122197 and 134073, and the Swedish Foundation for Strategic Research
`grant MDB09-0052.
`
`author contributions
`S.L., J.T., A.V., T.K. and M.E. conceived and designed experiments. A.V., K.K. and
`M.B. performed biological experiments. S.L., J.T., A.V. and T.K. analyzed data.
`J.T., A.V., T.K., M.E. and S.L. wrote the paper.
`
`comPeting financial interests
`The authors declare competing financial interests: details accompany the
`full-text HTML version of the paper at http://www.nature.com/naturemethods/.
`
`
`Published online at http://www.nature.com/naturemethods/.
`reprints and permissions information is available online at http://www.nature.
`com/reprints/index.html.
`
`1. Ozsolak, F. et al. Nat. Methods 7, 619–621 (2010).
`2. Lipson, D. et al. Nat. Biotechnol. 27, 652–658 (2009).
`3. Ozsolak, F. et al. Nature 461, 814–818 (2009).
`4. Vogelstein, B. & Kinzler, K.W. Proc. Natl. Acad. Sci. USA 96, 9236–9241 (1999).
`5. Macevicz, S.C. US patent application 11/125,043 (2005).
`6. Hug, H. & Schuler, R. J. Theor. Biol. 221, 615–624 (2003).
`7. Konig, J. et al. Nat. Struct. Mol. Biol. 17, 909–915 (2010).
`8. Wang, Z. et al. PLoS Biol. 8, e1000530 (2010).
`9. Fu, G.K., Hu, J., Wang, P.H. & Fodor, S.P. Proc. Natl. Acad. Sci. USA 108,
`9026–9031 (2011).
`10. Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K.W. & Vogelstein, B.
`Proc. Natl. Acad. Sci. USA 108, 9530–9535 (2011).
`11. Casbon, J.A., Osborne, R.J., Brenner, S. & Lichtenstein, C.P. Nucleic Acids
`Res. 39, e81 (2011).
`12. Chiu, R.W. et al. Proc. Natl. Acad. Sci. USA 105, 20458–20463 (2008).
`13. Fan, H.C., Blumenfeld, Y.J., Chitkara, U., Hudgins, L. & Quake, S.R.
`Proc. Natl. Acad. Sci. USA 105, 16266–16271 (2008).
`14. Anders, S. & Huber, W. Genome Biol. 11, R106 (2010).
`15. Benita, Y., Oosting, R.S., Lok, M.C., Wise, M.J. & Humphery-Smith, I.
`Nucleic Acids Res. 31, e99 (2003).
`
`© 2012 Nature America, Inc. All rights reserved.
`
`00003
`
`

`

`online methods
`Digital karyotyping. Genomic DNA was obtained by informed
`consent from three individuals, a boy with diagnosed trisomy 21,
`his mother and an unrelated adult male (here referred to as ‘adult
`male’ sample). This study was conducted with the approval of
`the Regional Ethical Review Board in Stockholm (Regionala
`etikprövningsnämnden i Stockholm). The samples from the boy
`and his mother were mixed 1:1 (‘mixed’ sample). Samples were
`prepared by enzymatic fragmentation, adaptor ligation and PCR
`as previously described16, except that the mixed sample was ali-
`quoted before PCR, aiming to obtain ~20 million molecules, and
`was ligated with a mixture of eight adapters carrying distinct
`6-base-pair (bp) barcodes. Thus, before amplification, the adult
`male sample was expected to contain billions of molecules, whereas
`the mixed sample was expected to contain 20 million molecules
`(the actual number of UMIs was 1.28 million; we attribute the
`difference to losses in sample preparation). Sequences were gener-
`ated on an Illumina Genome Analyzer with 76-bp single reads for
`the mixed sample and 100-bp paired-end reads for the adult male
`sample. Reads were mapped to the genome using Bowtie17.
`We analyzed the genome in non-overlapping 5-megabase-
`pair windows. To obtain a reliable estimate of the effective size
`of each window, accounting for repeats and other unmappable
`sequences, we generated a simulated dataset with 34 million reads
`and mapped this to the genome. The number of hits per window
`was taken as the effective size of that window, and windows
`having more than 10% repeats were discarded; this eliminated all
`of the chromosome Y data.
`To ensure that sequencing depth did not limit accuracy of
`the total read count method, we performed the same analysis
`on an adult male genome sequenced to 279 million reads. The
`CV (determined from the 5-megabase intervals) decreased only
`slightly (from 7.8% to 7.5%), showing that standard read counting
`does not converge on the true copy number.
`For absolute molecule counting, we used the 5′ genomic posi-
`tion of each mapped read as the UMI. Among the 20 million
`reads, we observed 1.28 million UMIs. The library was inten-
`tionally made from a small aliquot and sequenced until all
`molecules were observed multiple times to improve precision
`and accuracy of the UMI method. Expected copy numbers per
`genome for chromosomes 21 and X were 1.25 and 0.75, respec-
`tively, as the sample should contain five copies of chromosome
`21, three copies of chromosome X and four copies of all other
`chromosomes. We observed 1.26 copies of chromosome 21 and
`0.75 copies of chromosome X. To verify that UMIs did in fact
`identify single molecules, we searched for instances in which
`copies of a UMI (that is, multiple reads aligned to the same posi-
`tion) had different barcodes. We found only 25 such instances in
`the entire genome, demonstrating that UMIs indeed represented
`single molecules.
`To determine the best accuracy theoretically obtainable with
`1.28 million UMIs, we generated a simulated sample with this
`number of molecules and analyzed it along with the real samples.
`The CV for the UMI method was 3.0%, close to the theoretically
`maximal accuracy of 2.2% obtained by uniform random sampling
`of 1.28 million molecules.
`
`species are present in very different concentrations in cells.
`Taking a small aliquot can result in loss of low-abundance species.
`Fragmentation of high-abundance species, in turn, can result
`in generation of multiple fragments with the same start and/or
`end positions.
`For mRNA-seq, total RNA from Drosophila S2 cells trans-
`fected with GFP dsRNA was hydrolyzed via a 3-min incubation at
`70 °C in 1× RNA fragmentation buffer (Ambion). The reaction
`was terminated as instructed by the manufacturer. cDNA syn-
`thesis was performed according to the SMART protocol18 with
`addition of adapters for massively parallel sequencing19,20 using
`an oligo(dT)-containing adaptor (see Supplementary Table 1
`for sequence; Eurofins MWG Operon). For absolute molecule
`counting, a random 10-base DNA sequence label (N) was added
`to the 5′ adaptor (see Supplementary Table 1 for sequence;
`Integrated DNA Technologies). To denature RNA and anneal
`the 3′ adaptor, 12 pmol of both adapters were incubated at
`72 °C for 2 min with 3 µl of the unpurified solution containing
`50 ng of fragmented total RNA. RNA was then reverse-transcribed
`with 200 U SuperScriptIII (Invitrogen) in a 15 µl volume with the
`provided buffer, 1 mM dNTPs, 2 mM dithiothreitol (DTT) and
`excess MgCl2 (added to 15 mM). The reaction was carried out at
`55 °C for 1 h and enzyme was inactivated by incubation at 70 °C
`for 15 min. Uracil-specific excision reagent was used to degrade
`the random label sequence in the template-switch oligonucleo-
`tide (5 U of USER per 50 ng of total RNA at 37 °C for 30 min;
`New England Biolabs).
`The libraries were amplified using Phusion High-Fidelity
`DNA polymerase (Finnzymes) from 2 µl of unpurified cDNA
`reaction mixture with 300 nM Illumina single-read sequencing
`library primers. PCR was performed according to manufacturers’
`instructions. In the 50 µl reactions 20% trehalose was included
`and the following cycle settings were used: denaturation, 1 min
`at 98 °C, followed by 15 to 25 cycles of 10 s at 98 °C, 30 s at 64 °C
`and 1 min at 72 °C. Final extension was 11 min. In the PCR-cycle
`experiment, half of the reaction volume was extracted at cycle 15
`and replaced with fresh master mix. PCR products were purified
`with one volume of Agencourt XP beads (Beckman) and sub-
`jected to Illumina GAIIx massively parallel sequencing according
`to manufacturer’s instructions (54-bp reads).
`
`mRNA-seq with low input. Samples with RNA amounts corres-
`ponding to 10, 100 or 1,000 Drosophila S2 cells (Supplementary
`Fig. 7) were prepared with about 0.07, 0.7 or 7 ng of fragmented
`RNA from the same total RNA from GFP dsRNA–transfected
`cells used in the mRNA-seq experiments (Fig. 3). RNA was con-
`verted to cDNA using a paired-end–compatible oligo(dT) adaptor
`(12 pmol; see Supplementary Table 1 for sequence; Integrated DNA
`Technologies), the labeled 5′ adaptor described in the section above
`(12 pmol) and 300 U SuperScriptIII in 30 µl volume with the pro-
`vided buffer, 0.8 mM dNTPs, 2.9 mM DTT and excess MgCl2 (added
`to 17 mM). The cDNA was treated with 60 U of exonuclease I (New
`England Biolabs) before uracil-specific excision reagent. The whole
`volume of treated cDNA was included in a 100 µl PCR; samples were
`amplified as described above and sequenced (Illumina HiSeq 2000
`instrument, 54-nucleotide reads from both ends).
`
`mRNA-seq. Application of the UMI method to mRNA sequenc-
`ing requires the use of labels. This is because different mRNA
`
`Estimation of unobserved molecules. The principle of the UMI
`method is that the original number of molecules in a sample
`
`doi:10/103/nmeth.1778
`
`nature methods
`
`© 2012 Nature America, Inc. All rights reserved.
`
`00004
`
`

`

`can be estimated as the sum of observed and unobserved UMIs.
`The number of unobserved UMIs can be estimated based on
`the distribution of the copy numbers of the observed UMIs. For
`example, if one observes UMIs on average ten times (average copy
`number = 10), it is likely that very few UMIs have been missed.
`However, if the average copy number is two, a substantial fraction
`of all UMIs have not yet been observed. In general, we assumed
`that all of the UMIs of a gene had an equal probability of being
`observed. Thus, the number of molecules from each gene was
`estimated by fitting a zero-truncated Poisson distribution to the
`UMI copy number distribution using the generalized additive
`models for location, scale and shape (GAMLSS) R package21
`and adding the predicted number of unobserved UMIs to the
`observed UMI count.
`
`mRNA-seq data analysis. The sequencing reads were analyzed
`as follows: after removal of the label and index sequences and the
`following two bases, the sequencing reads were mapped to refer-
`ence sequences from Ensembl version 52 using Burrows-Wheeler
`aligner (BWA) software version 0.5.8 with default parameter val-
`ues22. Two bases were removed from the 5′ end of the reads after
`the index and label sequences to exclude additional guanines
`occasionally added by the template-switch method.
`For each gene, the sequence of its longest transcript was used
`as the reference sequence. Reads were discarded from further ana-
`lysis if they did not contain the constant sequences expected based
`on oligonucleotide design, mapped to the wrong strand, had a
`BWA mapping quality score lower than 20 or a base in the label
`sequence with an Illumina base call quality score lower than 20.
`A total of 14.8 million reads and 23.9 million reads passed these
`criteria in Drosophila S2 cell samples taken after 15 and 25 PCR
`amplification cycles, respectively. Total read count of a gene is the
`number of accepted reads mapped to its reference sequence.
`The mapped reads with the same gene, position and label
`were collected to one UMI and the number of such reads was
`recorded as the copy number of that UMI. Average copy numbers
`were 10.7 and 17.0 for samples taken after 15 and 25 PCR cycles,
`respectively. Sequence errors introduced by library preparation,
`amplification and sequencing can produce false UMIs with a low
`copy number. To limit the effect of such errors, two UMIs were
`merged if they either had identical positions and one mismatch
`in the label sequences (probable substitution) or consecutive
`positions, identical label sequences and the UMI closer to the 3′
`end of the mRNA had a copy number of one and the UMI closer
`to 5′ end had at least a copy number of two (probable deletion).
`
`In addition, all UMIs from positions where UMI average map-
`ping quality was less than 30 were discarded.
`The approximately one million random labels used were suffi-
`cient to generate UMIs from an mRNA amount that corresponds
`to the amount found in ~1000 Drosophila S2 cells. On average,
`only 0.0005% of all labels were observed per position, and even
`in the position with the highest number of labels, less than 2% of
`over one million labels were observed. In addition, a large fraction
`of all labels (>70%) were observed, and less than 0.12% overlap
`of UMIs (position label pairs) was observed between replicate
`experiments, indicating that the UMIs were not exhausted in the
`experiments and that the labels were incorporated into cDNAs
`effectively randomly, independent of position- or label-specific
`factors. Moreover, the incorporation of the random label sequence
`did not interfere with the mRNA-seq process; similar counts
`of reads mapping to each gene were observed in labeled and
`unlabeled samples (data not shown).
`The expression level of a gene was considered to be measured if
`its total read count was at least 100 and the estimate of the number
`of molecules was at least 10, and at least one of the UMIs had
`two or more copies. These cutoffs correspond to approximately
`0.2–1 mRNA molecules per cell based on yield estimates from RNA
`quantification of total RNA and spike controls (data not shown).
`The G+C content of the sequenced gene fragments was cal-
`culated as the average G+C content of the subsequences from
`the position of the mapped read to the 3′ end of the reference
`sequence. The solid gray line in Figure 3b indicating the size
`factor (1.79) needed to render the total read counts from differ-
`ent samples comparable was fitted analogously to the method
`described in reference 14. First, the mean relative difference
`between the samples was calculated from the Equation d = (1 / N)
`Σi = [1..N](xi − yi) / (xi + yi), where xi and yi are the individual
`total read counts for transcript i, and N is the total number
`of transcripts. Then, the size factor s was calculated from the
`Equation s = (1 − d) / (1 + d), and its logarithm used as the inter-
`cept for the solid fit line shown. The corresponding size factor for
`the absolute molecule counts was 1.02.
`
`16. Linnarsson, S. Exp. Cell Res. 316, 1339–1343 (2010).
`17. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Genome Biol. 10, R25
`(2009).
`18. Zhu, Y.Y., Machleder, E.M., Chenchik, A., Li, R. & Siebert, P.D.
`Biotechniques 30, 892–897 (2001).
`19. Cloonan, N. et al. Nat. Methods 5, 613–619 (2008).
`20. Levin, J.Z. et al. Nat. Methods 7, 709–715 (2010).
`21. Stasinopoulos, D.M. & Rigby, R.A. J. Stat. Softw. 23, 1–46 (2007).
`22. Li, H. & Durbin, R. Bioinformatics 25, 1754–1760 (2009).
`
`nature methods
`
`doi:10/103/nmeth.1778
`
`© 2012 Nature America, Inc. All rights reserved.
`
`00005
`
`

`


`
`1 
`
`Supplementary information
`
`Counting absolute number of molecules using unique molecular identifiers
`
`Teemu Kivioja, Anna Vähärautio, Kasper Karlsson, Martin Bonke, Martin Enge, Sten
`Linnarsson and Jussi Taipale
`
`
`
`
`Supplementary Figure 1
`
`Normalized cDNA library simulation
`
`Supplementary Figure 2
`
`Estimating the absolute number of molecules before library
`preparation
`
`Supplementary Figure 3
`
`The fraction of used positions and labels
`
`Supplementary Figure 4
`
`Reproducibility of the read and umi counting methods
`
`Supplementary Figure 5
`
`Copy number correlation in technical replicates
`
`Supplementary Figure 6
`
`Estimating the number of molecules after removing part of the
`reads in silico
`
`Supplementary Figure 7 Measuring mRNA abundances from small amount of RNA
`
`Sequences for oligomers used in mRNA-seq
`
`Normalized cDNA library simulation
`
`Modeling absolute number of molecules before library 
`preparation 
`Simulation of absolute number of molecules before library 
`preparation
`
`Supplementary Table 1
`
`Supplementary Note
`

`
`Nature Methods: doi:10.1038/nmeth.1778
`
`00006
`
`

`

`2 
`
`105
`
`104
`
`Number of sequenced reads
`
`103
`
`102
`
` 
` 
`
`
`
`
`
`
` GAPD
`ACTB
`UBC
`YWHAZ
`RPS9
`RPL13a
`NFKB1
`IGF2R
`
`101
`
`104
`103
`102
`Original number of molecules
`
`101
`
`105
`

`
`105
`
`104
`
`103
`
`102
`
`101
`
`Estimated number of molecules
`

`
`
`
`
`
`Supplementary Figure 1. Normalized cDNA library simulation. Simulation of an experiment where labels are
`used to estimate the original number of mRNA species after normalization of a cDNA library. Ten simulations
`were performed and the original number of cDNA molecules prior to normalization (x-axis) was estimated (y-axis,
`blue symbols; see Supplementary Note for details). The raw number of observed cDNA sequences for each gene
`is also shown (red symbols). Note that accurate estimates (blue symbols) can be derived even when the
`normalization decreases high abundance cDNA (GAPD, red circles) to a level that is lower than that of medium-
`abundance cDNA (RPS9, red diamond).
`
`

`
`Nature Methods: doi:10.103

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket