`
`(19) World Intellectual Property Organization A, | I
`International Bureau
`
`p
`
` (10) International Publication Number
`
`(43) International Publication Date
`28 June 2007 (28.06.2007)
`
`International Patent Classification:
`
`C1I2@Q 1/68 (2006.01)
`
`(51)
`
`(21)
`
`International Application Number:
`PCT/NL2006/000654
`
`(22)
`
`International Filing Date:
`21 December 2006 (21.12.2006)
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`English
`
`English
`
`(30) Priority Data:
`60/752,591
`
`22 December 2005 (22.12.2005)
`
`US
`
`(71)
`
`Applicant (for all designated States except US): KEY-
`GENE N.V. [NL/NL]; 90, Agro Business Park, NL-6708
`PW Wageningen (NL).
`
`WO 2007/073171 A2
`
`(81) Designated States (unless otherwise indicated, for every
`kind of national protection available): AE, AG, AL, AM,
`AT, AU, AZ, BA, BB, BG, BR, BW,BY, BZ, CA, CH, CN,
`CO, CR, CU, CZ, DE, DK, DM, DZ, EC, EE, EG, ES, FI,
`GB, GD, GE, GH, GM, GT, HN, HR, HU,ID,IL, IN, IS,
`JP, KE, KG, KM, KN,KP, KR, KZ, LA, LC, LK, LR, LS,
`LT, LU, LV, LY, MA, MD, MG, MK, MN, MW, MX, MY,
`MZ, NA, NG, NI, NO, NZ, OM,PG, PH, PL, PT, RO, RS,
`RU, SC, SD, SL, SG, SK, SL, SM, SV, SY, TJ, ‘I'M, ‘TN,
`TR, TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW.
`
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KF, T.S, MW, MZ, NA, SD, SL, SZ, TZ, UG, 7M,
`ZW), Eurasian (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM),
`European (AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI,
`FR, GB, GR, HU,IE, IS, IT, LT, LU, LV, MC, NL, PL, PT,
`RO,SE,SI, SK, TR), OAPI (BF, BJ, CF, CG, CI, CM, GA,
`GN, GQ, GW, ML, MR, NE, SN, TD, TG).
`
`2007/073171AIRITMIITAINTIUITIONNTAANAAATA
`
`Published:
`Inventor; and
`(72)
`(75)
`—_without international search report and to be republished
`Inventor/Applicant (for US only): VAN ELJK, Michael,
`upon receipt of that report
`Josephus, Theresia [NL/NL]; 12, Pastoor Strijboschstraat,
`NL-5373 EJ Herpen (NL).
`
`(74)
`
`Agent: DE LANG,R.-J.; Exter Polak & Charlouis B.V.,
`P.O. Box 3241, NL-2280 GE Rijswijk (NL).
`
`For two-letter codes and other abbreviations, refer to the "Guid-
`ance Notes on Codes and Abbreviations" appearing at the begin-
`ning of each regular issue of the PCT Gazette.
`
`(54) Title: IMPROVED STRATEGIES FOR TRANSCRIPT PROFILING USING HIGH THROUGHPUT SEQUENCING TECH-
`NOLOGIES
`
`WO
`
`(57) Abstract: Described is a method for determining a nucleotide sequence within cDNA,the frequency of a nucleotide sequence
`inacDNAsample, as well as a method for (unbiased) determination ofrelative transcript levels of genes without sequence informa-
`tion of these genes being required, said methods using complexity reduction and (high throughput) sequencing.
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000654
`
`Improved strategies for transcript profiling using High
`Title:
`
`Throughput sequencing technologies.
`
`Technical Field
`The present invention relates to the fields of molecular
`
`
`biology and genetics. The invention relates to improved
`strategies for determining the sequence of transcripts based on
`the use of high throughput sequencing technologies. The
`
`invention further relates to improved strategies for unbiased
`transcript profiling.
`
`Background of the invention
`Transcript profiling is one of the cornerstone
`‘technologies used in modern day biotechnology research. The
`main application domain of transcript profiling is discovery of
`genes involved in complex traits. This includes a wide range of
`
`biological phenomena such as discovery of genes involved in
`(human) disease in order to identify targets for development of ©
`medication (target discovery), unraveling biochemical pathways
`
`controlling synthesis of biomolecules (fermentation industry),
`dissection of complex traits for plant and animal breeding
`
`(gene discovery) and many others.
`A second application domain follows the reverse route,
`i.e.
`to use transcript profiling for routine diagnostic
`determination of transcript profiles of
`(a selected subset of)
`genes in order to predict a complex phenotype. Examples in this
`category are molecular classification, diagnosis and prediction
`
`(Van de Vijver et
`of clinical prognosis of human breast cancer
`al., 2002; N. Engl. J. Med., vol. 347)25:1999-2009; van ‘*t Veer
`et al., 2002, Breast Cancer Res., vol. 5(1):57-87
`www. agendia.com) and papillary renal cell carcinoma (Yang et
`al., 2005). Approaches for the identification of relevant genes
`based on transcript profiling data collected in segregating
`
`
`
`10
`
`15
`
`20
`
`29
`
`30
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000654
`
`populations are described by Schadt and co-workers (2005, Sei.
`STKE, vol. 296:pe40) .
`In brief,
`transcript profilingis of
`paramount
`importance in life sciences research.
`
`Technologies for transcript profiling, have evolved rapidly
`over the past 10 years. Until the early nineties (shortly after
`the widespread availability of PCR),
`transcript profiling was
`performed by Northern blot analysis or RNAse protection assays.
`While these techniques are fairly specific and sensitive
`(especially RNAseprotection assays),
`limitations of these
`technologies are that only one or a few genes can analyzed at
`the time (low throughput), while the procedures are tedious and
`time-consuming.
`In addition, both methods require the use of
`radioactive labeling techniques, which poses health hazards.
`With the advent of the differential display (DD)
`technique
`in 1992 (Liang & Pardee, 1992, Science, vol. 257(5072):967-71),
`
`and many modifications and improvements of DD (e.g. Ordered
`Differential Display, Matz et al., 1997, Nucl. Acids. Res.,
`vol. 25(12):2541-2), a first step was taken towards multiplexed
`transcript profiling. Characteristics of DD are that random
`subsets of genes are targeted by low-stringency annealing of a
`randomly designed PCR primer to the cDNA sample to be analyzed,
`resulting in preferential amplification of expressed
`transcripts containing. sequences with high homology to the PCR
`primer, used. Next,
`the amplification products are resolved on
`sequence gels, resulting ina fingerprint pattern representing
`subsets of transcribed genes. While DD methods have higher
`throughput compared to Northern blots and RNAse protection
`assays,
`their limitations are the fairly low reproducibility /
`robustness of these techniques. This is in part due to non-
`specific annealing of the random PCR primer used. Consequently,
`fingerprint patterns generated using different random primers
`do not systematically target different
`(complementary) subsets
`of transcripts.
`A further disadvantage is that DD methods
`.
`require preparation of slab-gels or detection by capillary gel-
`electrophoresis. Yet another limitation is that’ the gene origin
`of observed bands in the fingerprints are not known, which
`requires band excision, elution, re-amplification and DNA
`
`LO
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000654
`
`the latter limitation is shared with
`sequencing to reveal;
`other fingerprint-based transcript profiling methods. Finally,
`with detection of 50-100 fragments per lane on.a gel /
`capillary trace,
`the technology is moderately multiplexed.
`The cDNA-AFLP method (Bachem et al., 1996, Plant J., vol.
`9(5):745-53) addresses two of the main limitations of DD
`technology, namely reproducibility/robustness and
`complementarity of information obtained in fingerprints
`generated with different PCR primers. The robustness and
`reproducibility of cDNA-AFLP method is very high because
`amplification of adaptor-ligated restriction fragments using
`selective AFLP@® (Keygene N.V.,
`the Netherlands; see e.g. BP 0
`534 858 and Vos P., et al.
`(1995). AFLP: a new technique for
`DNA fingerprinting. Nucleic Acids Research, vol. 23, No. 21, p.
`4407-4414) primers takes place under high-stringency
`conditions, resulting in highly reproducible fingerprints
`patterns.
`In addition,
`the use of selective AFLP primers with
`‘different selective nucleotides ensures that fingerprints
`containing complementary information are obtained. Hence cDNA-
`AFLP technology enables reproducible sampling of subsets of the
`transcriptome. Another advantage of
`(cDNA-) AFLP (and DD)
`is
`that no prior sequence information is needed:-and the technology
`can therefore be applied to a wide range of organisms. —
`Limitations of cDNA-AFLP are its moderate multiplexing levels
`per lane/trace and the fact that the gene origin of bands is
`not known directly (see also DD).
`
`The limitations in multiplexing levels of the above
`described transcript profiling methods have been addressed by
`both SAGE (Serial Analysis of Gene Expression; Velculescu et
`al., 1995, Science, vol. 270(5235):484-7) and Massively
`Parallel Signature Sequencing (MPSS: Brenner et al., 2000,
`Nature Biotechnology, vol. 18(6):630-4; Meyers et al., 2004,
`Nature Biotechnology, vol. 22(8):1006-11). Like CDNA-AFLP, both
`methods use type ITS restriction enzymes to cut sample cDNA,
`followed by adapter ligation. -
`In SAGE, adaptor-ligated fragments are subsequently:
`concatenated and sequenced by Sanger sequencing. Short 14-20 bp
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`PCT/NL2006/000654
`sequence tags are extracted from the Sanger sequence trace,
`providing quantitative information about the transcribed genes
`(“ digital Northern”). By comparing the frequency of tags
`between samples,
`information is obtained about relative
`expression levels between investigated samples, without the
`need for prior sequence information. Although this results in
`(accurate) determination of relative transcript abundance: in
`different samples, given the short sequence tags obtained it is
`difficult to assess from which genés the tags are derived,
`unless the large EST collections or the whole genome sequence
`of the investigated organism is available and tag sequences can
`be subjected to homology searches such as BLAST (Basic Local
`Alignment Search Tool) analysis. Hence, although SAGE is highly
`multiplexed,
`reproducible and robust, its value is limited to
`organisms with sequenced genomes. Another limitation is that
`the method is not very amenable to processing large samples
`(low throughput) due to the costs of large-scale Sanger
`sequencing.
`.
`Contrary to SAGE, MPSS is based on solid phase sequencing
`reactions. However, MPSS essentially suffers from the same
`limitations as SAGE, i.e.
`that very short sequence tags
`(approximately 20 bp) are obtained, which strongly limits
`further follow-up (gene identification / assay conversion) of
`interesting sequence tags in organisms for which limited
`(genome) sequence is available. In summary, although SAGE and
`MPSS are robust and highly multiplexed transcript profiling
`technologies which do not require prior sequence information to
`apply,
`their value is in practice limited to organisms for
`which the whole genome sequences have been determined or large
`EST collections are available in order to connect sequence tags
`£o genes. Both methods are low-throughput and.technically
`complex.
`Conceptual strong points are that both methods rely on
`statistical sampling of transcript libraries (resulting in
`“digital Northerns”)
`in combination with accurate sequence
`
`determination, which provides for unbiased estimates of
`(relative) transcription.1¢vels of many genes simultaneously
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000634
`
`and the fact that-.transcript profiling does not suffer from
`
`cross-hybridization to probes on solid supports.
`In 1995, gene expression microarrays were introduced
`(Schena et al., 1995, Science, vol. 270(5235):467-70), which
`presented a paradigm shift in the transcript profiling field.
`While initially so called “spotted * microarrays containing
`EST-derived PCR products as probes were used,
`in subsequent
`years the focus has shifted towards oligonucleotide DNA chips
`(Pease et al.,. 1994, Proc. Nat. Ac. Sci. USA, vol. 91(11) :5022-
`6), because of their higher robustness and scaling flexibility.
`Currently,
`the transcript profiling market is dominated by
`oligonucleotide DNA chips from various suppliers (e.g.
`Affymetrix, Nimblegen, Agilent etc). The power of DNA chips
`lies in the large number of DNA sequences that can be attached
`/ synthesized on their surface, which enables massively
`parallel transcript profiling, allowing e.g.
`transcript
`profiling for all known human genes (= high multiplexing level
`of genes).
`In addition,
`the process of chip fabrication and
`hybridization can be automated and controlled, allowing for
`high throughput and robustness, respectively. Consequently, DNA
`chips are the state-of-the-art for transcript profiling anno
`2005. However, while multiplexing capacity,
`throughput and
`
`two
`‘~robustness are very important strong points of DNA chips,
`important limitations of chip-based transcript profiling are
`that sequence information is needed in order to be able to
`build the chip and that cross-hybridization betweenhighly
`homologous sequence such as those derived from members of
`
`
`
`
`
`duplicated gene families may affect the accuracy of the
`results. The latter is very difficult to monitor/exclude,
`because it is an intrinsic characteristic of hybridization-
`based’ detection. Due to these facts, comparison of results
`‘obtained using DNA chips from different suppliers (reflecting
`different underlying production technologies and application
`protocols),
`is difficult to perform (Yauk et al., 2005, Nucleic
`Acids Research, vol. 32(15):¢e124). Within one platform,
`|
`validation of results by an independent method such as real-
`time PCR assays (e.g. TaqMan,
`Invader)
`is needed. Thus, DNA
`
`~
`
`10
`
`15
`
`20.
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000634
`
`chips do not provide data fitting the concept of a digital
`Northern but are useful for determination of relative
`expression levels if the same platform is used for all samples.
`|
`Ideally, a transcript profiling technology is highly
`multiplexed, i.e. many genes can be investigated
`simultaneously, high throughput, very robust and reproducible,
`highly accurate (not suffering. from cross-hybridization) and
`applicable without the need for prior sequence information. The
`invention described below provides for methods fitting such
`
`10
`
`criteria.
`
`Summary of the invention
`The present inventors have now found that with a different
`strategy this problem can be solved and the high throughput
`sequencing technologies can be efficiently used in transcript
`profiling.
`The invention comprises employing a technology that
`preferably divides the transcriptome in reproducible subsets.
`The subsets are sequenced and assembled into contigs
`corresponding to individual transcripts. By repeating this step
`in such a way that a different reproducible subset is provided,
`different sets of contigs are obtained. These different contigs
`are used to assemble the draft sequences of the transcripts.
`The invention does not require any knowledge of the sequence
`and can be applied to transcripts of any complexity: The
`invention is also applicable to a combination of transcripts
`e.g. derived from different tissues of the same organism or
`different organisms. The present invention provides a quicker,
`reliable and faster access to any transcript of interest and
`thereby provides for accelerated analysis of the transcript.
`qn
`The invention is also directed.to (unbiased) determination
`
`of relative transcript levels of genes without sequence
`the
`information of these’ genes being required. To this end,
`frequency of a sequence within a cDNA sample is determined by
`sequencing of complexity-reduced libraries of said cDNA sample
`and alignment of the sequence to determine, the number of times
`
`the sequence is identified in’the libraries. This may be
`
`15s
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000654
`
`and. the frequencies of the
`repeated for a second cDNA sample,
`two cDNA samples may be normalized, if required, and compared
`to determine relative transcription levels.
`
`
`
`‘Definitions
`In the following description and examples a number of
`terms are used.
`In order to provide a clear and consistent
`understanding of the specification and claims,
`including the
`scope to be given such terms,
`the following definitions are
`
`provided. Unless otherwise defined herein, all technical and
`scientific terms uséd have the same meaning as commonly
`understood by one of ordinary skill in the art to which this
`
`invention belongs. The disclosures of all publications, patent
`applications, patents and other references are incorporated
`herein in their entirety by reference.
`Nucleic acid: a nucleic acid according to the presént
`invention may include any polymer or oligomer of pyrimidine and
`purine bases, preferably cytosine,
`thymine, and uracil, and
`adenine and guanine, respectively (See Albert L. Lehninger,
`Principles of Biochemistry, at 793-800 (Worth Pub. 1982) which
`is herein incorporated by reference in its entirety for all
`purposes). The present invention contemplates any
`deoxyribonucleotide, ribonucleotide or peptide nucleic acid
`component, and any chemical variants thereof, such as
`methylated, hydroxymethylated or glycosylated forms of these
`bases, and the like. The polymers or oligomers may be
`heterogenous or homogenous in composition, and may be isolated
`from naturally occurring sources or may be artificially or
`synthetically produced.
`In addition,
`the nucleic acids may be
`DNA or RNA,. or a mixture thereof, and may exist permanently or
`transitionally in single-stranded or double-stranded form,
`including homoduplex, heteroduplex, and hybrid states.
`. Complexity reduction:
`the term complexity reduction is
`used to denote a method wherein the complexity of a nucleic
`acid sample, such as genomic DNA,
`is reduced by the generation
`of a subset of the sample. This subset can be representative _
`for the whole (i.e. complex)
`sample and.is preferably a
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000654
`
`reproducible subset. Reproducible means in this context that
`
`when the same sample is reduced in complexity using the same
`method,
`the same, or at least comparable, subset is obtained.
`The method used for complexity reduction may be any method for
`complexity reduction known in the art. Non-limiting examples of
`methods for complexity reduction include AFLP® (Keygene N.V.,
`the Netherlands; see e.g. EP.0 534 858),
`the methods described
`by Dong (see e.g. WO 03/012118, WO 00/24939),
`indexed linking.
`(Unrau, et al., 1994, Gene, 145:163-169),
`those disclosed in US
`
`10
`
`2005/260628, WO 03/010328, US 2004/10153, genome portioning
`
`(see e.g. WO 2004/022758), Serial Analysis of Gene Expression
`(SAGE; see e.g. Velculescu et al.; 1995, see above, and
`Matsumura et al., 1999, The Plant Journal, vol. 20(6):719-726)
`and modifications of SAGE (see e.g. Powell, 1998, Nucleic Acids
`
`
`Research, vol. 26(14) :3445-3446; and Kenzelmann and. Mitihlemann,
`1999, Nucleic Acids Research, vol. 27(3):917-918), MicroSAGE
`
`(see e.g. Datson et al., 1999, Nucleic Acids Research, vol.
`
`27(5).:1300-1307), Massively Parallel Signature Seguencing
`(MPSS; see e.g. Brenner et al., 2000, Nature Biotechnology,
`vol. 18:630-634 and Brenner et al., 2000, PNAS, vol.
`
`97(4):1665-1670), self-subtracted cDNA libraries (Laveder et
`al., 2002, Nucleic Acids Research, vol. 30(9):e38), Real-Time
`Multiplex Ligation-dependent Probe Amplification (RT-MLPA; see
`e.g. Bldering et al., 2003, vol. 31(23):e153), High Coverage
`Expression Profiling (HiCEP; see e.g. Fukumura et al., 2003,
`Nucleic Acids Research, vol. 31 (16) :e94), a universal micro-
`array system as disclosed in Roth et al., 2004, Nature
`Biotechnology, vol. 22(4):418-426, a transcriptome subtraction
`method (see e.g. Li et al., Nucleic Acids Research, vol.
`33(16):¢136), and fragment display {see e.g. Metsis et al.,
`2004, Nucleic Acids Research, vol. 32(16):e127). The complexity
`reduction methods used in the present invention have in common
`that they are reproducible. Reproducible in the sense that when
`the same sample is reduced in complexity in the same manner,
`.
`the same subset of the sample is obtained, as opposed to more
`random complexity reduction such as microdissection or the use
`of mRNA (cDNA) which represents a portion of the genome
`
`,
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO2007/073171
`
`PCT/NL2006/000654
`
`transcribed in a selected tissue and for its reproducibility is
`depending on the selection of tissue,
`time of isolation, and
`the like.
`.
`Tagging:
`the term tagging refers to the addition of a tag
`to a nucleic acid sample in order to be able to distinguish it
`from a second or further nucleic acid sample. Tagging can e.g.
`be performed by the addition of a sequence identifier during
`complexity reduction or by any other means known in the art.
`Such sequence identifier can e.g. be a unique base sequence of
`varying but defined length uniquely used for identifying a
`specific nucleic acid sample. Typical examples thereof are for
`instance ZIP sequences. Using such a tag, the origin of a
`sample can be determined upon further processing.
`in case of
`combining processed products originating from different nucleic
`acid samples,
`the different nucleic acid samples should be
`identified using different tags.
`Tagged library:
`the term tagged library refers to a
`
`library of tagged nucleic acid.
`Sequencing: The term sequencing refers to determining the
`order of nucleotides (base sequences)
`in a nucleic: acid sample,
`e.g. DNA or RNA.
`
`
`
`Aligning and alignment: With the term “aligning” and
`“alignment” is meant the comparison of two or more nucleotide
`sequence based on the presence of short or long stretches of
`identical or similar nucleotides. Several methods for alignment
`of nucleotide sequences are known in the art, as will be
`further explained below. Sometimes the terms ‘assembling’ or
`‘clustering’ are used as a synonym, although these terms are
`technically not identical. Alignment takes place based on
`comparing maximum homology, whereas assembling means preparing
`a contig based on an overlap.
`High-throughput screening: High-throughput screening,
`often abbreviated as HTS,
`is a method for: scientific
`experimentation especially relevant to the fields of biology
`and chemistry. Through a combination of modern robotics and
`other specialized laboratory hardware, it allows a researcher
`to effectively’ screen large amounts of samples simultaneously.
`
`LO
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000654
`
`High-throughput sequencing: determining the sequence of a
`nucleotide sequence using high-throughput techniques.
`|
`Restriction endonuclease: a restriction endonuclease or
`restriction enzyme is an enzyme that recognizes a specific
`nucleotide sequence (target site) in a double-stranded DNA
`molecule, and will cleave both strands of the DNA molecule at
`every target site.
`the DNA molecules produced by
`Restriction fragments:
`digestion with a restriction endonuclease are referred to as
`restriction fragments. Any given genome (or nucleic acid,
`regardless of its origin) will be digested by a particular
`
`restriction endonuclease into a discrete set of restriction
`fragments. The DNA fragments that result from restriction
`endonuclease cleavage can be further used in a variety of
`
`techniques and can for instance be detected by gel
`
`electrophoresis.
`Gel electrophoresis:
`in order to detect restriction
`
`fragments, an analytical method for fractionating double-
`stranded DNA molecules on the basis of size can be required.
`
`The most commonly used technique for achieving such
`fractionation is (capillary) gel electrophoresis. The rate at
`which DNA fragments move in such gels depends on their
`molecular weight;
`thus,
`the distances traveled decrease as the
`fragment. lengths increase. The DNA fragments fractionated by
`gel electrophoresis can be visualized directly by a staining
`procedure e.g. silver staining or staining using ethidium
`bromide, if the number of fragments included in the pattern is
`sufficiently small. Alternatively further treatment of the DNA
`fragments may incorporate detectable labels in the fragments,.
`
`10
`
`15
`
`20
`
`“25
`
`» 30
`
`such as fluorophores or radioactive labels.
`
`Ligation:
`the enzymatic reaction catalyzed by a ligase
`enzyme in which two double-stranded DNA molecules are
`In
`covalently joined together is referred to as ligation.
`general, both.DNA strands arecovalently joined together, but
`.it is also possible to prevent the ligation of one of the two
`strands through chemical or enzymatic modification of one of
`
`35
`
`
`
`PCT/NL2006/000654
`WO 2007/073171
`In that case the covalent joining will
`the ends of the strands.
`occur in only one of the two DNA strands.
`Synthetic oligonucleotide: single-stranded DNA molecules
`having preferably from about 10 to about 50 bases, which can be
`synthesized chemically are referred to as synthetic
`oligonucleotides. In general,
`these synthetic DNA molecules are
`designed to have a unique or desired nucleotide sequence,
`although it is possible to synthesize families of molecules
`
`having related sequences and which have different nucleotide
`compositions at specific positions within the nucleotide
`- sequence. The term synthetic oligonucleotide will be used to
`refer to DNA molecules having a designed or desired nucleotide
`sequence.
`-
`.
`|
`Adaptors: short double-stranded DNA molecules with a
`limited number of base pairs, e.g. about 10 to about 30 base
`pairs in length, which are designed such that they can be
`ligated to the ends of restriction fragments. Adaptors are
`generally composed of two-synthetic oligonucleotides, which
`have nucleotide sequences that are partially complementary to
`each other. When mixing the two synthetic oligonucleotides in
`solution under appropriate conditions,
`they will anneal to each
`other forming a double-stranded structure. After annealing, one
`end of the adaptor molecule is designed such that it is
`compatible with the end of a restriction fragment and can be
`ligated thereto;
`the other end of the adaptor can be designed
`so that it cannot be ligated, but this need not be the case
`(double ligated adaptors).
`|
`Adaptor-ligated restriction fragments: restriction
`fragments that have been capped by adaptors as a result of
`
`the term primers refers to a DNA
`“Primers: in general,
`strand which can prime the synthesis of DNA. DNA polymerase
`cannot synthesize DNA de novo without. primers: it can only
`extend an existing DNA strand in a reaction in which the
`complementary strand is used as a template to direct the order
`of nucleotides to be assembled. We will refer to the synthetic
`
` ligation.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000654
`
`oligonucleotide molecules that are used in a polymerase chain
`reaction (PCR) as primers.
`.
`|
`DNA amplification:
`the term DNA amplification will be |
`typically used to denote the in vitro synthesis of double-
`stranded DNA molecules using PCR.. It is noted that other
`amplification methods exist and they may be used in the present
`invention without departing from the gist.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`(a)
`(6b)
`
`Detailed description of the invention
`The present invention provides for a method for
`determining a nucleotide sequence of cDNA comprising the steps
`of:
`.
`.
`Providing CDNA;
`
`Performing a complexity reduction on at least a
`portion of the cDNA to obtain a first Library of the
`cDNA comprising cDNA fragments;
`(c) Determining at least part of the nucleotide sequences
`
`of the cDNA fragments of the first library by high-
`throughput sequencing;
`.
`(ad) Aligning the nucleotide sequences of the cDNA
`fragments of the first library of step d)
`to generate
`contigs of the first library; and
`(e) Determining the nucleotide sequence of the cDNA.
`Hitherto in the art of sequencing technology,
`the use of
`this complexity reduction in combination with high-throughput
`sequence determination of cDNA to represent transcripts has not
`been disclosed or suggested.
`cDNA is provided. It well known
` In step (a) of the method,
`
`in the art how to prepare cDNA. A method for the preparation is
`set forth below. However, any method for the preparation of
`cDNA may be used.
`is usually prepared from mRNA
`cDNA (complementary DNA)
`In that case, reverse
`using reverse transcriptase.
`transcriptase synthesizes a DNA strand complementary to an RNA
`template if it is provided with a.primer that is base-paired to
`the RNA and contains a free 3’-Oh group. Such primer can e€.g.
`be an oligo-dT primer that pairs with the poly-A sequence at
`
`
`
`WO 2007/073171
`PCT/NL2006/000654
`the 3’ end of most eucaryotic mRNA molecules. The rest of the
`cDNA strand can then be synthesized in the presence of the four
`deoxyribonucleoside triphosphates. The RNA strand of the
`resulting RNA-DNA hybrid is subsequently hydrolyzed, e.g. by
`raising the pH. Unlike RNA, DNA is resistant to alkaline
`hydrolysis, such that the DNA strand remains intact. An
`alternative primer can. be a random primer. The random priming,
`of cDNA may be beneficial when the reverse transcriptase fails
`to fully transcribe an mRNA template or if secondary structures
`exist. Yet an alternative primer can be a sequence-specific
`
`primer.
`
`a
`
`Methods’ for isolation of RNA from cells of a tissue of an
`organism or an organism itself are well known in the art of
`molecular biology. Moreover, many commercially available kits
`for cDNA synthesis can be purchased, such as e.g.
`from ABgene,
`Ambion, Applied Biosystems, BioChain, Bio-Rad, Clontech, GE
`Healthcare,. GeneChoice,
`Invitrogen, Novagen, Qiagen, Roche
`
`Applied Science, Stratagene, and the like. Such methods are
`e.g. described in Sambrook et al.
`(Sambrook, J., Fritsch, E.F.,
`
`and Maniatis, T.,
`in Molecular Cloning: A Laboratory Manual.
`Cold Spring Harbor Laboratory Press, NY, Vol. 1, 2,
`3
`(1989)).
`RNAmay be isolated from several sources such as a cell
`culture, a tissue, etc.
`In step (b) of the method according to the present
`invention, a complexity reduction is performed on at least a
`portion of the cDNA to obtain a first library of the cDNA
`comprising cDNA fragments- Many methods for complexity
`reduction are known in the art, as indicated in the definition
`section.
`
`the step of complexity
`In one embodiment of the invention,
`reduction of the nucleic acid sample comprises enzymatically
`cutting the nucleic acid sample in restriction fragments,
`separating the restriction fragments and selecting a particular
`
`pool of restriction fragments. Optionally,
`the selected
`fragments are then ligated to adaptor sequences containing PCR
`
`primer.templates/binding sequences.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`
`
`WO 2007/073171
`PCT/NL2006/000654
`In one embodiment of complexity reduction, a type IIs
`endonuclease is used to digest the nucleic acid. sample and the
`restriction fragments are selectively ligated to adaptor
`sequences. The adaptor sequences can contain various
`nucleotides in the overhang that is to be ligated and only the
`adaptor with the matching set of nucleotides in the overhang is
`ligated to the fragment and subsequently amplified. This
`technology is depicted in the art as ‘indexing linkers’.
`Examples of this principle can be found inter alia in Unrau and
`Deugau (1994) Gene 145:163-169.
`In one embodiment,
`the method of complexity reduction
`utilizes two restriction endonucleases having different target
`sites and frequencies and two. different adaptor sequences to
`provide adaptor-ligated restriction fragments, such as in AFLP.
`In one embodiment of the invention,
`the step of complexity
`reduction comprises performing an Arbitrarily Primed PCR upon
`the sample.
`|
`In one embodiment of the invention,
`the step of complexity
`reduction comprises removing repeated sequences by denaturing
`and re-annealing the DNA and then removing double-stranded
`duplexes.
`the step of
`In certain embodiments of the invention,
`complexity reduction comprises hybridising the nucleic acid
`sample to a magnetic bead that is bound to an oligonucleotide
`probe containing a desired sequence. This embodiment may
`further comprise exposing the hybridised sample to a single
`strand DNA nuclease to remove the single-stranded DNA,’ ligating
`an adaptor sequence containing a Class IIs restriction enzyme
`to release the magnetic bead. This embodiment may or may not
`comprise amplification of the isolated DNA sequence.
`Furthermore, the adaptor sequence may or may not be used as a
`template for the PCR oligonucleotide primer.
`In this
`embodiment,
`the adaptor sequence may or may not contain a
`sequence identifier or tag.
`the complexity
`In certain embodiments of the invention,
`reduction utilises differential display technology or READS
`(Gene Logic)
`technology.
`,
`
`10
`
`15
`
`30
`
`35
`
`
`
`WO 2007/073171
`
`PCT/NL2006/000654
`
`In certain embodiments of the invention,
`the method of
`complexity reduction comprises exposing the DNA sample to a
`mismatch binding protein and digesting the sample with a 3’
`5’ exonuclease and then a single strand nuclease. This
`embodiment may or may not include the use of a magnetic bead
`attached to the mismatch binding prot