`
`METHODS AND COMPOSITIONS FOR NUCLEIC ACID ANALYSIS
`
`WSGRDocket No. 38938-719.101
`
`Inventor(s):
`
`Serge SAXONOV,
`Citizen of USA,Residingat
`10 De Anza Court
`San Mateo, CA 94402
`
`WagR
`Wilson Sonsini Goodrich & Rosati
`PROFESSIONAL CORPORATION
`
`650 Page Mill Road
`Palo Alto, CA 94304
`(650) 493-9300 (Main)
`(650) 493-6811 (Facsimile)
`
`Filed Electronically on: April 25, 2011
`
`4338427_1.DOCX
`
`
`
`METHODS AND COMPOSITIONS FOR NUCLEIC ACID ANALYSIS
`
`BACKGROUNDOF THE INVENTION
`
`[0001] There is a need for means of measuring collocated species in plasma, measuring fetal load and
`
`multiplexing on the same channel, multiplexing to align the dynamic range of targets whose concentrations
`
`are very different and to smooth out biological variations of reference genes, and for sample partitioning and
`
`barcode tagging for sequencing.
`
`SUMMARYOF THE INVENTION
`In general, in one aspect, a method is provided comprising partitioning one or more nucleic acids into
`[0002]
`isolated partitions, adding a unique bar codeto nucleic acids in each partition, pooling the nucleic acids from
`
`the partitions, analyzing the nucleic acids, and determining which nucleic acids were in the samepartition.
`
`INCORPORATION BY REFERENCE
`
`[0003] All publications, patents, and patent applications mentioned in this specification are herein —
`
`incorporated by reference to the same extent as if each individual publication, patent, or patent application
`wasspecifically and individually indicated to be incorporated by reference.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0004] The novel features of the invention are set forth with particularity in the appended claims. A better
`understanding of the features and advantages of the present invention will be obtained by referenceto the
`following detailed description that sets forth illustrative embodiments, in which the principles of the invention
`are utilized, and the accompanying drawings of which:
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`[0005]
`
`In general, described herein are methods, compositions, and kits for library preparation for
`
`sequencing comprising partitioning a given sample and furnishing those partitions with their own sets of
`
`barcode adaptors. Library preparation can be performedin separate partitions. The contents of the partitions
`
`can subsequently be sequenced, and the barcodes can be usedto identify which reads came from the same
`
`partition.
`
`[0006] A partition can be any modeofseparating that can be used for digital PCR,e.g., droplets,
`~ microfluidic channels, or wells. Barcode adaptors can be bundled within droplets.
`[0007] Currently, barcoding is used to pool samples in order to reduce the cost of sequencing per sample.
`
`Thusseparate library preps are done for each sample each with its own barcodes. Thelibraries are then pooled
`
`and run through a sequencer. Every read of the resulting dataset can then be traced backto the original sample
`via the barcode. Our approachis analogous, but instead of tagging for the purpose of resolving which sample
`produced a given read, we propose to use barcoding to group reads accordingto their partition. Given a large
`
`4338427_1,.DOCX
`
`-2-
`
`WSGRDocket No. 38938-719.101
`
`
`
`set of barcodes this enables.a number of breakthrough applications. Manyofthese applications will become
`
`increasingly relevant as throughput of nextgen sequencersincreases.
`
`[0008] Barcode tagging can be accomplished by merging adapter-filled droplets (AFD) with sample
`
`containing partitions (SCP), which themselves can be droplets. It can be madeso that adapter-filled droplets
`
`are smaller than sample-partitioning droplets. SCPs can be formed so that they contain AFDs. One
`
`implementation is that we can pre-makea large batch of AFDs and emulsify the sample so that sample-
`
`partitioning droplets end up containing AFDs. Through a temperature adjustment, the AFDscan be burst to
`
`release reaction components necessary forlibrary prep.
`
`[0009] Alternatively one could form larger adapter filled droplets and have them encompass sample-
`
`containing droplets.
`
`[0010] One can also construct a microfluidic device that merges a large set of pre-made adapter reagents
`with sample partitions such that every sample partition ends up withits own reagents. For example, if we
`have a square-shaped device with 1000x1000=1 millionpartitions, and our chemistry is suchthatit allows
`tagging eachread with two barcodes, we can construct one million unique identifiers with a modest number
`(2,000) of different barcodes. We load reagents with 1,000 different barcodesin the horizontal channels and
`
`reagents for another set of 1,000 different barcodesin the vertical channels. Every one ofthe million
`
`partitions ends up with its own unique combination of barcodes.
`
`[0011] One can also merge the two types of droplets (SCPs and AFDs)in a controlled manner-- one droplet
`
`of sample with one droplet of adapters.
`
`[0012] If we use droplets for tagging, we can pre-makea large set of droplets of N types. Each type is loaded
`
`with its own barcode.Nis partially determined by the length of the barcode (L). In principle, N can be as
`
`large as 4“L. So for L= 10, we can generate up to 1 million different droplet types.
`
`[0013] These adapter-filed droplets (ADFs) are randomly merged with sample containing partitions (SCP).
`
`One can then perform standard sequencing library preparation within each partition. Once the libraries are
`
`prepped the contents ofall the partitions is merged (eg by breaking droplets) and loaded onto a sequencer.
`The sequencer generates reads for many of the library molecules. Molecules that were prepared within the
`same droplet would contain the same barcode. If the numberof barcodesis sufficiently large, one can surmise
`
`that molecules containing the same barcode camefrom the samepartition.
`
`[0014] Thatis, if N is sufficiently large (ie larger than the number of ADFsactually used in the experiment),
`
`we would expect that any two SCPswill be tagged by different ADFs. If N is not very large we mayfind that
`distinct SCPs are tagged with the same adaptors. In that case we can estimate probabilistically the likelihood
`that any two reads came from the sameor different SCPs. For many applications, a probabilistic assessment
`would be sufficient.
`
`[0015] Applications
`[0016] Long reads and Phasing
`
`4338427_1.DOCX
`
`-3-
`
`WSGRDocket No. 38938-719.101
`
`
`
`[0017] Short read sequencers, such as those made by Illumina and ABI suffer from being unable to provide
`
`phasing information. These sequencers can produce reads of 100-200bp andas short as 30bp. 454 can do a
`slightly betterjob because the reads can get up to 400bp, but even that is generally far from sufficient to yield
`phasing information. PacBio and some other technologies promise much longerreads, but even with 1000bp
`
`reads, muchof the phasing information will be lacking.
`
`[0018] Ail of the existing nextgen sequencing platforms entail a library preparation step, where genomic
`
`DNAis appropriately fragmented and potentially sized, then ligated with a commonset of primers. This
`
`commonset of primers is used in the sequencing step for massive clonal amplification — either in solution or
`
`on solid support. These clones can then be sequenced because the presence of a massive amountofidentical
`
`sequencein a tightly confined space allowsfor the amplification of the fluorescent (or other) signal emitted
`
`by the sequencing reaction.
`
`[0019]
`
`It is now becoming commonpractice to use tag sequences appendedto the primers, so that a common
`
`barcodeis ligated to every sequence from a particular sample. Then libraries from different samples can be
`
`mixed and sequencedin a single run. Since every read would contain a barcode it should then be
`
`straightforward to infer which sample produced any given read. This is known as sample multiplexing and
`allows for much morecost effective pricing per sample for many sequencing applications. Note that in this
`case a part of every read is consumedbythe barcode. This has becomeless andless of an issue, as the reads
`
`are getting longer (100-200bp) since in principle one can tag one million samples with 10bp tags.
`
`[0020] Short reads make it challenging to sequence large genomes de-novo. Short reads are also incapable of
`
`delivering phasing information for all but a very small fraction of polymorphisms. Ourpartition-barcoding
`scheme can be usedto effectively re-construct much longerreads, help with long range assembly and supply
`
`phasing information while making use of existing sequencing approaches.
`
`[0021] The keyis to start with high molecular weight DNA andpartition the sample so that a given partition
`
`is very unlikely to contain two fragments from the same locusbut different chromosomes. Library prep is
`
`then performed within droplets as described above.
`
`[0022] The core conceptis that all the reads map somewhatclose to each other in the genomeand are found
`in the same droplet that, are very likely linked to each other and thus reside on the same chromosome.In this
`
`fashion individual short reads can be strung together into longer sequence fragments.
`
`[0023] Library prep within droplets would entail fragmentation and ligation of adaptors. Fragmentation can
`be accomplished enzymatically using an endonuclease, followedbya ligation step. Alternatively, a
`transposon-based approach such as Nextera’s can be used in that approach DNAis fragmented and adapted in
`
`a single step reaction. One can follow adapterligation with PCR amplification of ligated products to increase
`
`their concentrations.
`
`4338427_1.DOCX
`
`-4-
`
`WSGRDocket No. 38938-719.101
`
`
`
`[0024] One can also perform an MDA(multiple-displacement amplification) step within the droplet prior to
`fragmentation and adapterligation to amplify the amount of DNAin each droplet in order to cover more of
`
`the captured fragments.
`
`[0025] Example:
`
`[0026] Let’s say we load 1,000 GE = 6ng of DNAto be sequenced at 100x depth. This impliesthat there are
`
`2,000 copies of every (normal copy number) target. For every locus, we want to makesurethat a large
`
`majority of fragments end up in their own partitions and that most of the 2,000 fragments are tagged with a
`
`unique barcode.
`
`[0027] Thefirst requirement is accomplished by increasing the numberofpartitions. With 100,000
`
`partitions, we expect that only about 0.5% of fragments at a particular locus from different chromosomes
`
`would end upin the samepartition. Note that many such cases will be readily identified by the appearance of
`
`distinct alleles from heterozygous SNPs with the same barcodeas well as by increased coverage of the locus
`
`by a barcode.
`
`[0028]
`
`In order to ensure that most fragments are tagged with distinct barcodes, we need a large number of
`
`different barcodes and an approachthat distributes barcodes so that any given partition is furnished with a
`
`small number(preferably one) of barcode-containing droplets. The distribution can be random so that some
`
`partitions receive zero droplets, some one, some multiple. Thus for 100,000 partitions we can supply 100,000
`
`barcoding droplets. In that case, 37% ofthe partitions will receive no adapters and will thus be unavailable for
`
`sequencing. The number barcoding droplets can be increased if sample preservation is of paramount
`
`importance. 37% ofthe partitions will be barcoded with a single barcode and up to 25% will be coded with
`
`potentially different barcodes. In the case above, 740 fragments will be unavailable for sequencing, 740 will
`
`be sequestered in with their own barcodes and 500 will be sequestered with multiple barcodes. Ideally all of
`
`the 740*1 + 360*2 + ... = 2,000 barcodesin the partitions associated with a particular fragment would be
`
`unique. If we have 10,000 different barcode types, then more than 80% of the fragments would be uniquely
`
`tagged.
`Ifthe numberof genome equivalents were lower then we would need fewerpartitions and barcodes.
`[0029]
`[0030] Note that perfection is not necessary for this application, because we only need to capture a small
`
`subset of SNPs from any given genomic location to yield phasing information.It is acceptable if a substantial
`fraction of fragments is not informative.
`[0031] One canattain greater efficiency of sample processing if each partition is supplied with a barcode in a
`
`controlled manner — as could be done with raindance-like merging of droplets or via a microfluidic circuit
`similar to Fluidigm’s array designs. Meaning if we can guarantee that a given partition receives precisely one
`ADF, we can make do with fewer ADFs and fewer ADF types.
`
`4338427_1.DOCX
`
`-5-
`
`WSGRDocket No. 38938-719.101
`
`
`
`[0032] A microfluidic chip can be used in an analogous manner forpartitioning. Sample partitions can be
`
`supplied with their own barcodesvia a two-dimensional arrangement of channels as described above. A very
`
`large number of unique barcodes can bereadily supplied by combining vertical and horizontal barcodes.
`
`[0033] Single cell transcriptome sequencing
`
`[0034] Single cells can be captured within separate droplets. These can be lysed and reverse transcribed with
`
`partition-specific barcoded primers (the appropriate reagents can be sequestered in their own inner droplets to
`
`be burst by heating when appropriate). Alternatively, a generic RT reaction can be followed by library prep,
`which would incorporate unique barcodes.
`[0035] Calculations for the numberof droplets and barcodes are similar to what we covered above for
`
`phasing. For 2,000 cells, we need sufficient partitioning to capture them in separate droplets. 20,000 would be
`
`plenty. Wethen needto ensure that every oneofthe partitions with cells receives a unique barcode which can
`
`be accomplished reasonably well with 10,000 barcoding types.
`
`[0036] After partitioning, lysing, barcoding and sequencing, the read data can be analyzed to determine
`
`which transcripts came from the samecell. This way the massive capacity of nextgen sequencing can be
`
`applied to large collections of cells while preserving single cell resolution.
`
`[0037] Single cell genomic sequencing
`
`[0038] Similar to the idea of single cell transcriptome sequencing, one can capture individualcells in
`
`separate partitions and sequence the genomes while preserving single cell resolution.
`
`[0039] Depending onthe library prep chemistry used, the sequence coverage percellis likely to be shallow
`(very few readsper locus), but can nevertheless be usefulfor discerning large CNVsatsingle cell resolution.
`
`[0040] One can also perform MDAwithin the droplet on the cell’s genomeprior fragmentation and adapter
`
`ligation. This would provide more genetic material from the cell to sequence at the cost of introducing some
`
`bias and potentially losing CNV information.
`
`[0041] Single cell methylome sequencing
`[0042]
`Idea: partition cells, expose them to methyl-sensitive enzymesto digest away methylatedsites,
`
`sequence what’s left.
`
`[0043] Exosome sequencing.
`
`[0044] This is the same basic concept as single cell transcriptome sequencing. Exosomesare small
`
`extracellular organelles that contain RNA. Sequencing that RNA while preserving the information about
`which transcripts derive from the same exosomeis currently impossible butis likely to be interesting and
`valuable.
`
`[0045] Metagenomics sequencing
`
`[0046] Analogous to what’s described above for cells, we can capture viruses/bacteria and sequencetheir
`
`genomesand transcriptomes. Due to the large numberand variety of micro-organismsit is often very
`
`4338427_1.DOCX
`
`-6-
`
`WSGRDocket No. 38938-719.101
`
`
`
`challenging to sequence them de-novo with short reads. Stringing the reads togethervia partitioning would
`
`allow for high-throughput sequencing of many species whose genomesare currently effectively inaccessible.
`
`[0047] Additional variation on the idea
`
`[0048] One can construct a microfluidics device that partitions the sample so that every cell ends up with its
`
`ownset of barcodes. The contents of each chamberis is then processed separately to dilute and further
`
`partition (perhaps now through an emulsion) in order to enable whole genomeortranscriptome amplification
`
`separately for each cell. The idea is that WGA and other amplification schemes benefit from partitioning to
`reduce competition between different parts ofthe genome or transcriptome.
`[0049] We can makelarge slugs to capture individual cells and supply them with their own barcodes. We can
`
`then break those slugs into many (thousands or more) of much smaller droplets in order to perform unbiased
`
`whole genome/transcriptome amplification in droplets. (WGA can work muchbetter in droplets than in bulk.)
`
`Note that the droplets from all the slugs can be mixed together because they are already furnished with
`
`appropriate adaptors so we can determine from the sequencing information which reads came from which
`
`slug.
`
`[0050] Another application:
`
`[0051] We pre-makebatches ofantibodies linked to beads coated with short DNA fragments. Each antibody
`
`would be associated with its own unique sequence. (The antibodies could also be linked to droplets containing
`DNAfragments — whichcanbeburst as appropriate. ((This is actually what I thoughtoffirst, but Ben
`suggested beads, which made more sense))). Cells can be pre-coated with these antibodies, then captured in
`larger droplets along with droplet/cell-specific barcode adaptors. We can then go throughlibrary prep as
`described above, sequencethe contents ofall the droplets and infer which reads came from whichcell by the
`barcodes. Now in addition to sequencingthe cells' genomes or transcriptomes we canalso get information
`
`abouttheir proteins. Some of the same information can be captured via FACS, but here we can have an
`
`essentially arbitrary amount of multiplexing. Also, this could be a relatively straightforward way of capturing
`
`cell-specific nucleic acid information along with the protein data for manycells in a single shot.
`
`{0052] Multiplexing to align the dynamic range of targets whose concentrations are very different and
`
`_
`
`to smooth out biological variation of reference genes
`
`[0053] Weestimate CNVs by measuring concentrations of a target gene and a reference genein a single
`
`reaction using one fluorescence dye for the target and anotherfor the reference.
`
`[0054] For high copy number, concentration of the target should be higher than concentration of the
`reference. In that case it may be challenging to measure both in a single digital PCR reaction dueto the
`
`limited dynamic range of digital PCR.
`
`4338427_1.DOCX
`
`-7-
`
`WSGRDocket No. 38938-719.101
`
`
`
`[0055] We can address thisproblem by choosing several different targets for the reference multiplexed
`together andall using the same fluorescence dye. This would boost the counts of the reference and bring
`
`them closer to those ofthe target.
`
`[0056] Depending on the numberoftargets to be multiplexed we may needto use universal probes, LNA
`
`probes, ligation approaches.
`
`[0057] The sameidea applies if one is trying to measure several gene expression targets in a single reaction.
`Wecan design several assays to target the lowest expressed gene and bring the measured countscloserto
`those of the higher expressed gene(s).
`[0058] For expression we may need to doarestriction digest on the cDNAin order to makesurethat the
`
`different targets measured in a given gene end upin different droplets.
`
`[0059] This scheme mayalso apply to measuring viral load levels in a single reaction.
`
`[0060] This kind of multiplexing for CNVs mayalso be useful for evening out biological variation where the
`
`reference may vary in copy numberfrom individualto individual. By averaging across multiple targets we
`reduce the impactof the variation. This becomesofparticular interest for diagnostic tests, in particular those
`
`measuring copy numberalterations.
`
`[0061] Measuring fetal load and multiplexing on the same channel
`
`[0062] When measuring fetal load we measure markersof fetal-specific DNA (such as Y chromosome
`
`markers, or paternal SNPs, methyl-digested sequence) as well as those of total DNA from cell free plasma.
`
`[0063] Plasma contains very little DNA and weare limited in the amount of blood we can draw. Thereforeit
`
`is desirableto use aslittle DNA as possible to achieve a satisfactory measurement offetal load.
`[0064] One can reduce the amountof plasmaused(or, equivalently, increase precision while consuming the
`same amount of sample) by multiplexing several markers of the same type on the same fluorescence channel.
`
`[0065] For example we can simultaneously measure N markers of total DNA on genes of knownstable copy
`
`numberon the VIC channel, and several markers on the Y chromosome on the FAM channel. We don’t need
`
`to know the concentration of each individual marker, just their combinedtotal, which will come directly from
`
`the appropriate channel.
`
`[0066] Measuring collocated species in plasma
`
`[0067] Our technology allows us to measure when twospecies of nucleic acids are in physical proximity to
`each other. We construct assays with different fluorophores correspondingto different targets. When thereis
`an excess of droplets with two (or more) fluophores, we infer that the two speciesare spatially linked.
`[0068] microRNAsand other RNAsare believed to be packaged in exosomesin blood. Alsoit’s possible that
`
`some of them are packaged together in protein- complexes.
`
`4338427_1.DOCX
`
`-8-
`
`WSGRDocket No. 38938-719.101
`
`
`
`[0069]
`
`It is of interest to be able to tell which transcripts are co-localized whether in exosomesorprotein
`
`complexes.
`
`[0070] The idea is that we would partition plasma or somederivative of plasma processed in a way that
`
`would preserve exosomesor protein complexesofinterest.
`
`[0071] We then break the exosomes, digest proteins and run the appropriate PCR reactions within the
`
`partitions in a standard ddPCRfashion. Bursting of the exosomes can be accomplished through a temperature
`
`adjustmentor by releasing an inner emulsion that would carry an exosomeor protein-complex breaking agent.
`
`[0072] Cell free DNA molecules may also travel similarly aggregated andit is of interest to determine when
`
`the molecules are collocated.
`
`[0073] Samelogic applies to proteins. In principle we can detect proteins through proximity ligation
`
`approaches.
`
`[0074] We can therefore detect when particular RNA, DNA,protein targets travel together in plasma.
`
`Additional fluorescence channels allow collocation measurement of more targets simultaneously. For
`
`example, with three fluorophores we can measure collocation frequencyacrossthree targets.
`
`[0075] While preferred embodiments of the present invention have been shown and describedherein,it will
`
`be obviousto those skilled in the art that such embodiments are provided by way of example only. Numerous
`
`variations, changes, and substitutions will now occurto those skilled in the art without departing from the
`
`invention. It should be understoodthat various alternatives to the embodiments of the invention described
`
`herein may be employedin practicing the invention. It is intended that the following claims define the scope
`
`of the invention and that methods andstructures within the scope of these claims and their equivalents be
`
`covered thereby.
`
`4338427_1.DOCX ©
`
`:
`
`-9-
`
`WSGRDocket No. 38938-719.101
`
`