(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT)
`(19) World Intellectual Property
`Organization
`International Bureau
`
`(10) International Publication Number
`
`\2
`
`(43) International Publication Date
`WO 2012/168815 A2
`l 3 December 2012 (13.12.2012) WlPOI PCT
`
`
`(51)
`
`International Patent Classification:
`G06F 19/00 (2011.01)
`
`(21)
`
`International Application Number:
`
`PCT/IB2012/052613
`
`(22)
`
`International Filing Date:
`
`24 May 2012 (24.05.2012)
`
`(25)
`
`(26)
`
`(30)
`
`(71)
`
`(72)
`(75)
`
`Filing Language:
`_
`_
`Publication Language:
`Priority Data:
`61/493,541
`
`6 June 2011 (06.06.2011)
`
`English
`.
`English
`
`US
`
`except US):
`designated States
`all
`0pm"
`Applicant
`KONINKLIJKE PHILIPS ELECTRONICS N.V.
`[NL/NL]; Groenewoudseweg 1, NL-5621 BA Eindhoven
`(NL).
`
`Inventors; and
`(for US only): KUMAR, Sunil
`Inventors/Applicants
`[IN/IN]; c/o High Tech Campus, Building 44, NL-5656 AE
`Eindhoven (NL). SINGH, Randeep [IN/IN]; c/o Iligh
`Tech Campus, Building 44, NL-5656 AE Eindhoven (NL).
`DIMITROVA, Nevenka [US/US]; 0/0 High Tech Cam-
`pus, Building 44, NL-5656 AE Eindhoven (NL)
`
`(74)
`
`(81)
`
`Agents: VAN VELZEN, Maaike et a1.; c/o High Tech
`Campus, Building 44, NL-5656 AE Eindhoven (NL).
`
`Designated States (unless otherwise indicated, for every
`kind of national protection available): AE, AG, AL, AM,
`AO, AT, AU, AZ, BA, BB, BG, BH, BR, BW, BY, BZ,
`CA, CH, CL, CN, CO, CR, CU, CZ, DE, DK, DM, DO,
`DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, HN,
`HR, HU) ID, 1L, IN, IS, JP, KE, KG, KM, KN, KP, KR,
`KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD, ME,
`MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, pro9 NZ,
`OM, PE, PG, PII, PL, PT, QA, RO, RS, RU, RW, SC, SD,
`SE) SG) SK; 51:7 SM, ST) 5Y7 5‘7 TH) TL TM) TN) TR)
`TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW.
`
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, SZ, TZ,
`UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU, TJ,
`TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK,
`FF, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU, LV,
`MC, MK, MT, NL, NO, PL, PT, Ro, RS, SE, SI, SK, SM,
`TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GVV,
`ML, MR, NE, SN, TD, TG).
`
`(54) Title: METHOD FOR ASSEMBLY OF NUCLEIC ACID SEQUENCE DATA
`
`[Continued on next page]
`
`
`Raw reads
`Reference
`
`Sequencgm
`
`(57) Abstract: The present invention relates to a method for assembly of
`nucleic acid sequence data comprising nucleic acid fragment reads into (a)
`contiguous nucleotide sequence segment(s), comprising the steps of: (a) ob-
`taining a plurality of nucleic acid sequence data from a plurality of nucleic
`acid fragment reads; (b) aligning said plurality of nucleic acid sequence data
`to a reference sequence;(c) detecting one or more gaps or regions of non-
`assembly, or non-matching With the reference sequence in the alignment out-
`put of step (b);(d) performing de novo sequence assembly of nucleic acid se-
`quence data mapping to said gaps or regions of non—assembly; and (c) com—
`bining the alignment output of step (b) and the assembly output of step (d) in
`order to obtain (a) contiguous nucleotide sequence segment(s). The present
`invention further relates to a method wherein the detection of gaps or regions
`of non—assembly is performed by implementing a base quality, coverage,
`complexity of the surrounding region, or length of mismatch filter or
`threshold. Also envisaged is the masking out of nucleic acid sequence data
`relating to known polymorphisms, disease related mutations or modifica-
`tions, repeats, low map ability regions, CPG islands, or regions With certain
`biophysical features. In addition, a corresponding program element or corri-
`puter program for assembly of nucleic acid sequence data and a sequence as-
`sembly system for transforming nucleic acid sequence data comprising nucle-
`ic acid fi'agment reads into (a) contiguous nucleotide sequence segment(s) is
`provided.
`
`be provided with RelSeq
`
`Polymorphic Iancmark can
`
`
`
` Highly
`
`repetitive
`region
`
`YES
`V
`
`Pen‘onii De ncvo
`Discard
`
`
`
`assembly
`
`v
`Extract Avg Coverage
`Check QC and
`coverage to till 4— from RefSeq and use
`
`
`the gap
`as cut-off
`v
`
`Betseq ‘ —alignment
`
`
`
`
`
`Consensus
`assembly
`
`FIG. 4
`
`
`
`
`
`W02012/168815A2i||||||||||||||||||||||||||||||||||||||||||||||||lli|||||||||||||||||||||||||||||||||||||||||||
`
`

`

`WO 2012/168815 A2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
`
`Declarations under Rule 4.17:
`
`Published:
`
`— as to applicant’s entitlement to applyfor and be granted — without international search report and to be republished
`a patent (Rule 4.1 7(ii))
`upon receipt ofthat report (Rule 48.2(g))
`
`— as to the applicant’s entitlement to claim the priority of
`the earlier application (Rule 4.1 7(iii))
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`METHOD FOR ASSEMBLY OF NUCLEIC ACID SEQUENCE DATA
`
`FIELD OF THE INVENTION
`
`The present invention relates to a method for assembly of nucleic acid
`
`sequence data comprising nucleic acid fragment reads into (a) contiguous nucleotide
`
`sequence segment(s), comprising the steps of; (a) obtaining a plurality of nucleic acid
`
`sequence data from a plurality of nucleic acid fragment reads; (b) aligning said plurality of
`
`nucleic acid sequence data to a reference sequence;(c) detecting one or more gaps or regions
`
`of non-assembly, or non-matching with the reference sequence in the alignment output of
`
`step (b);(d) performing de novo sequence assembly of nucleic acid sequence data mapping to
`
`said gaps or regions of non-assembly; and (e) combining the alignment output of step (b) and
`
`the assembly output of step (d) in order to obtain (a) contiguous nucleotide sequence
`
`segment(s). The present invention filrther relates to a method wherein the detection of gaps or
`
`regions of non-assembly is performed by implementing a base quality, coverage, complexity
`
`of the surrounding region, or length of mismatch filter or threshold. Also envisaged is the
`
`masking out of nucleic acid sequence data relating to known polymorphisms, disease related
`
`mutations or modifications, repeats, low mapability regions, CPG islands, or regions with
`
`certain biophysical features. In addition, a corresponding program element or computer
`
`program for assembly of nucleic acid sequence data and a sequence assembly system for
`
`transforming nucleic acid sequence data comprising nucleic acid fragment reads into (a)
`
`contiguous nucleotide sequence segment(s) is provided.
`
`BACKGROUND OF THE INVENTION
`
`With the introduction of next generation or ultra—high—throughput sequencing
`
`techniques the amount of sequence data has increased enormously, while the costs for
`
`obtaining sequence information and the time needed for the provision of this information
`
`have been dramatically reduced and will be further decreased in the future. Research. as well
`
`as clinical applications of next generation sequencing approaches will have an impact on
`
`transcriptomc analysis and gene annotation, allow RNA splice identification, SNP discovery
`
`or genome methylation analysis and provide a way to identify the etiology of diseases, and to
`
`screen for genomic pattern on a personal basis.
`
`The next generation sequencing (NGS) is currently based on only a handfill of
`
`platforms including the Roche/454, the lllumina/Solex and the ABI SOLiD systems. The
`
`10
`
`15
`
`20
`
`25
`
`30
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`2
`
`underlying techno logy relies on a template amplification step before the sequencing starts. In
`
`consequence, the read length is shortened in comparison to the traditional Sanger-based
`
`techno logy: whereas the de-deoxy terminator approach provided read lengths of 650 to 800
`
`bp, NGS approaches have read lengths of 35 — 400 bp (Bao et al., Journal of Human
`
`Genetics, 28 April 2011, p. 1-9). Furthermore, the raw data obtained from the NGS platforms
`
`is not standardized and shows differences in read lengths, error profiles, matching thresholds
`
`etc. Thus, the implementation of NGS approaches connotes an increase in amount and
`
`complexity of sequence information.
`
`However, the output of NGS sequence machines is essentially worthless by
`
`itself, since the sequence reads only become meaningful upon a reconstruction of the
`
`underlying contiguous genomic sequence. Furthermore, for routine uses of NGS, e.g. in
`
`clinical setups, a high sequence accuracy and an expedient way to select genomic subsets of
`
`interest are of importance. Upon a higher integration of genome sequencing into the practice
`
`of medical counseling, there will be an increased responsibility of geneticists to ensure that
`
`the information obtained is in fact true and represents the original genome of the individual.
`
`There is, thus, a need for a method allowing the accurate and timesaving
`
`alignment and assembly of nucleic acid sequence data as derivable from NGS approaches.
`
`SUMMARY OF THE INVENTION
`
`The present invention addresses this need and provides means and methods,
`
`which allow the assembly of nucleic acid sequence data comprising nucleic acid fragment
`
`reads into contiguous nucleotide sequence segments. The above objective is in particular
`
`accomplished by a method comprising the steps of:
`
`(a)
`
`obtaining a plurality of nucleic acid sequence data from a plurality of
`
`10
`
`15
`
`20
`
`25
`
`nucleic acid fragment reads;
`
`(b)
`
`aligning said plurality of nucleic acid sequence data to a reference
`
`sequence;
`
`(c)
`
`detecting one or more gaps or regions of non-assembly, or non-
`
`matching with the reference sequence in the alignment output of step (b);
`
`30
`
`(d)
`
`performing de novo sequence assembly of nucleic acid sequence data
`
`mapping to said gaps or regions of non-assembly; and
`
`(e)
`
`combining the alignment output of step (b) and the assembly output of
`
`step (d) in order to obtain (a) contiguous nucleotide sequence segment(s).
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`3
`
`This method provides the advantage that a bias, which is typically generated
`
`when a reference sequence alignment is performed, can be overcome by using de nova
`
`assembly steps. Furthermore, typical problems associated with the filling of the gaps that are
`
`created during reference sequence alignment, polymorphism lengths detection and in
`
`particular the fitting of un-aligned sequence in the consensus assembly may be solved when
`
`closing these information gaps or breaks via de novo assembly. At the same time, annotation
`
`problems known from de novo assembly approaches can be mitigated by basing parts of the
`
`analysis on a reference sequence. The method accordingly starts with a reference sequence
`
`alignment and when it finds a gap or regions of non-assembly it switches to de novo
`
`alignment, e. g. until it again detects the reference alignment. This creates a consensus
`
`assembly or contiguous nucleotide sequence segments with a significantly increased
`
`sequence accuracy. In fact, the accordingly assembled sequence represents individual
`
`genomes rather than reference genomes and avoids reference sequence associated bias
`
`problems. The presently described method is accordingly assumed to have huge implications,
`
`inter alia in medical genetics where it may help in determining the genetic basis of complex
`
`genetic disorders.
`
`In a preferred embodiment of the present invention, wherein the above
`
`mentioned plurality of nucleic acid sequence data is converted to a unified format.
`
`In another preferred embodiment of the present invention the detection of step
`
`(c) as mentioned herein above is performed by implementing a filter or threshold.
`
`In further preferred embodiments, said filter or threshold is a base quality,
`
`coverage, complexity of the surrounding region or length of mismatch filter or threshold.
`
`In another preferred embodiment of the present invention, prior to the above
`
`mentioned aligning step (b) a masking out of nucleic acid sequence data relating to known
`
`polymorphisms, highly variable regions, disease related mutations or modifications, repeats,
`
`low mapability regions, CPG islands, or regions with specific biophysical features is
`
`performed.
`
`In a particularly preferred embodiment said masked out nucleic acid sequence
`
`data is subjected to a de nova sequence assembly of step (d) as mentioned herein above.
`
`In another preferred embodiment of the present invention the above defined
`
`step (b) is carried outwith a reference alignment algorithm. In a particularly preferred
`
`embodiment of said reference alignment algorithm is BFAST, ELAND, GenomeMapper,
`
`GMAP, MAQ, MOSAIK, PASS, SeqMap, SHRiMP, SOAP, SSAHA, or CLD. Even more
`
`preferred is Bowtie or BWA.
`
`10
`
`15
`
`20
`
`25
`
`

`

`W0 2012/168815
`
`PCT/IB2012/052613
`
`4
`
`In yet another preferred embodiment of the present invention, the above
`
`defined step (c) is carried out with a de novo assembly algorithm. In a particularly preferred
`
`embodiment of said de novo assembly algorithm is AAPATHS, Edena, EULER-SR, MIRAZ,
`
`SEQAN, SHARCGS, SSAKE, SOAPdenovo, VCAKE. Even more preferred is ABySS or
`
`Velvet.
`
`In a further preferred embodiment the herein above mentioned reference
`
`sequence is an essentially complete prokaryotic, eukaryotic or viral genome sequence, or a
`
`sub-portion thereof. In a particularly preferred embodiment of the present invention said
`
`reference sequence is a human genome sequence, an animal genome sequence, a plant
`
`genome sequence, a bacterial genome sequence, or a sub-portion thereof.
`
`In a further preferred embodiment of the present invention said reference
`
`sequence is selected from a group or taxon, which is phylogenetically related to the organism,
`
`whose nucleic acid sequence data is to be assembled.
`
`In yet another preferred embodiment of the present invention said reference
`
`sequence is a genomic sub-portion having regulatory potential selected from the group
`
`comprising exon sequences, promoter sequences, enhancer sequences, transcription factor
`
`binding sites, or any grouping or sub—grouping thereof.
`
`In a further preferred embodiment said reference sequence is a virtual
`
`sequence based on sequence composition parameters, or based on biophysical nucleic acid
`
`properties. In a particularly preferred embodiment of the present invention said composition
`
`parameter is the presence of monomers, dimers and/or trimers. In a further preferred
`
`embodiment of the present invention said biophysical nucleic acid property is the stacking
`
`energy, the presence of propeller twist, the bendability of the nucleic acid, duplex stability,
`
`the amount of disrupt energy, the amount of free energy, the presence of DNA denaturation
`
`10
`
`15
`
`20
`
`25
`
`or DNA bending stiffness.
`
`In a further aspect the present invention relates to a program element or
`
`computer program for assembly of nucleic acid sequence data comprising nucleic acid
`
`fragment reads into contiguous nucleotide sequence segments, which when being executed
`
`by a processor is adapted to carry out the steps of a method as defined herein above.
`
`30
`
`In yet another aspect the present invention rclatcs to a scqucncc assembly
`
`system for transforming nucleic acid sequence data comprising nucleic acid fragment reads
`
`into (a) contiguous nuclcotidc scqucncc scgmcnt(s), comprising a computcr proccssor,
`
`memory, and (a) data storage device(s), the memory having programming instructions to
`
`execute a program element or computer program as defined herein above.
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`5
`
`In a preferred embodiment of the present invention said sequence assembly
`
`system is associated or connected to a sequencer device. In a further preferred embodiment
`
`said sequence assembly system is a medical decision support system. In a particularly
`
`preferred embodiment said medical decision support system is a diagnostic decision support
`
`system.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`10
`
`15
`
`20
`
`25
`
`30
`
`Fig. 1 provides an overview over reference and de nova sequence and
`
`alignment procedures. Reference sequence alignment and assembly shows mapping of reads
`
`to the reference sequences. De nova assembly shows the generation of contigs using ABySS
`
`algorithm based on an excerpt from an ABySS—Explorer view, where edges represent contigs
`
`and nodes represent common k—l-mers between adjacent contigs. The labels correspond to
`
`SET contig IDs. Contig lengths and coverage are indicated by the length and the thickness of
`
`the edges, respectively. Arrows and edge are shape indicate the direction of contigs and the
`
`polarity of the nodes distinguish reverse complements of common k—l -mers between
`
`adjacent contigs.
`
`Fig. 2 shows examples of different sequence file formats. Depicted are the
`
`qseq format (sequence read output from Illumina instrument which has machine, run and
`
`quality information), the fastq format (Illumina read name, sequence and quality which has
`
`been derived from qseq file) and SAM format (Sequence Alignment/ Map) which is output of
`
`BWA aligner. The SAM format, which allows to store read alignment information against a
`
`reference.
`
`Fig. 3 depicts an overview over the alignment and assembly steps according to
`
`the present invention. It shows the overall method of combining reference alignment and de
`
`novo assembly. Initially the reads are aligned to a reference sequence. Where ever a gap (e. g.
`
`user defined size, ex: >10base) of N/A/T/G/C is identified where the reads are not matching
`
`to the reference in continuation with the previous read in an overlap fashion, the de novo
`
`assembly will be started. There will be a de novo contig formation until the next read
`
`matching to the reference is identified. This de nova contig will then be merged with
`
`intermediate consensus to give final consensus sequence.
`
`Fig. 4 shows a process chart of method steps of a combination of reference
`
`sequence alignment and de novo assembly according to the present invention.
`
`Fig. 5 depicts the determination of the exact length of GT polymorphism in
`
`AVPRlA gene using a combination of reference alignment and de novo assembly following
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`6
`
`the method according to the present invention. First, reads with the reference genome to
`
`extract the AVPRl gene for the analyzed sample were aligned. As the RS3 is highly
`
`polymorphic site and is associated with clinical phenotype, a de nova assembly of the reads
`
`that were falling in this chromosome was carried out and subsequently contigs were
`
`generated. After obtaining the contigs relaxed sequence alignment (allowing mismatch and
`
`gaps) was performed to merge the de nova contig with the reference consensus. The obtained
`
`consensus sequence showed the true polymorphic repeat for the analyzed sample.
`
`Fig. 6 shows a direct comparison between the Reference Sequence assembly
`
`and the de nova assembly of the AVPRIA gene. Reads were aligned to reference and de nova
`
`assembly was performed. The consensus generated from reference was then aligned against
`
`dc nova contig using ClustanW. Shown is a difference in GT repeats which is biased from
`
`reference as compared to de nova displaying different repeat contents.
`
`DETAILED DESCRIPTION OF EMBODIMENTS
`
`The inventors have developed means and methods, which allow the assembly
`
`of nucleic acid sequence data comprising nucleic acid fragment reads into contiguous
`
`nucleotide sequence segments.
`
`Although the present invention will be described with respect to particular
`
`embodiments, this description is not to be construed in a limiting sense.
`
`Before describing in detail exemplary embodiments of the present invention,
`
`definitions important for understanding the present invention are given.
`
`As used in this specification and in the appended claims, the singular forms of
`
`"a" and "an” also include the respective plurals unless the context clearly dictates otherwise.
`
`In the context of the present invention, the terms "about" and "approximately"
`
`denote an interval of accuracy that a person skilled in the art will understand to still ensure
`
`the technical effect of the feature in question. The term typically indicates a deviation from
`
`
`
`the indicated numerical value of :20 %, preferably :15 %, more preferably :10 %, and even
`
`more preferably :5 %.
`
`It is to be understood that the term "comprising" is not limiting. For the
`
`purposes of the present invention the term "consisting of‘ is considered to be a preferred
`
`embodiment of the term "comprising of '. If hereinafter a group is defined to comprise at least
`
`a certain number of embodiments, this is meant to also encompass a group which preferably
`
`consists of these embodiments only.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`7
`
`Furthermore, the terms "first", "second", "third" or "(a)", "(b)", "(c)", ”(d)" etc.
`
`and the like in the description and in the claims, are used for distinguishing between similar
`
`elements and not necessarily for describing a sequential or chronological order. It is to be
`
`understood that the terms so used are interchangeable under appropriate circumstances and
`
`that the embodiments of the invention described herein are capable of operation in other
`
`sequences than described or illustrated herein.
`
`In case the terms "first", "second", "third" or "(a)", "(b)", "(c)", "(d)" etc. relate
`
`to steps of a method or use there is no time or time interval coherence between the steps, i.e.
`
`the steps may be carried out simultaneously or there may be time intervals of seconds,
`
`minutes, hours, days, weeks, months or even years between such steps, unless otherwise
`
`indicated in the application as set forth herein above or below.
`
`It is to be understood that this invention is not limited to the particular
`
`methodology, protocols, reagents etc. described herein as these may vary. It is also to be
`
`understood that the terminology used herein is for the purpose of describing particular
`
`embodiments only, and is not intended to limit the scope of the present invention that will be
`
`limited only by the appended claims. Unless defined otherwise, all technical and scientific
`
`terms used herein have the same meanings as commonly understood by one of ordinary skill
`
`in the art.
`
`As has been set out above, the present invention concerns in one aspect a
`
`method for assembly of nucleic acid sequence data comprising nucleic acid fragment reads
`
`into (a) contiguous nucleotide sequence segment(s), comprising the steps of:
`
`(a)
`
`obtaining a plurality of nucleic acid sequence data from a plurality of
`
`nucleic acid fragment reads;
`
`(b)
`
`aligning said plurality of nucleic acid sequence data to a reference
`
`10
`
`15
`
`20
`
`25
`
`sequence;
`
`(e)
`
`detecting one or more gaps or regions of non—assembly, or non—
`
`matching with the reference sequence in the alignment output of step (b);
`
`(d)
`
`performing de novo sequence assembly of nucleic acid sequence data
`
`mapping to said gaps or regions of non-assembly; and
`
`30
`
`(e)
`
`combining the alignment output of step (b) and the assembly output of
`
`step (d) in order to obtain (a) contiguous nucleotide sequence segment(s).
`
`The term "assembly" of nucleic acid sequence data as used herein refers to the
`
`arrangement of singularly or independently provided sequence data into a contiguous
`
`nucleotide sequence segment. The term "contiguous nucleotide sequence segment(s)" as used
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`8
`
`herein refers to the output of the presently claimed method being a coherent, non-redundant
`
`and preferably error-free or substantially error-free sequence context. A "sequence segment”
`
`as used herein may be any stretch comprising more the information content of more than
`
`about 50 reads. Preferably, a sequence segment may be an entire genome, an entire
`
`chromosome, a chromosome arm, one or more sub-portion of a chromosome, a conjunction
`
`of interrelated sequence, e. g. exomes, transcriptome-related sequences, a conjunction of open
`
`reading frames, introns, transposon-sequences, repeats, regulorne-related sequences such as
`
`transcription factor binding sites, methylation binding protein sites, specific regions with
`
`higher probability of Histone 3 lysine 4 mono- di— and tri-methylation etc. A "nucleic acid
`
`fragment read" as used herein refers to a single, short contiguous information piece or stretch
`
`of sequence data. A read may have any suitable length, preferably a length of between about
`
`30 nucleotides to about 1000 nucleotides. The length generally depends on the sequencing
`
`technology used for obtaining it. In specific embodiments, the reads may also be longer, e. g.
`
`2 to 10 kb or more. The present invention generally envisages any read or read length and is
`
`not to be understood as being limited to the presently available read lengths, but also includes
`
`further developments in this area, e.g. the development of long reading sequencing
`
`approaches etc.
`
`In a first step of the method, a plurality of nucleic acid sequence data from a
`
`plurality of nucleic acid fragment reads may be obtained. A "nucleic acids sequence data" as
`
`used herein may be any sequence information on nucleic acid molecules known to the skillet
`
`person. The sequence data preferably includes information on DNA or RNA sequences,
`
`modified nucleic acids, single strand or duplex sequences, or alternatively amino acid
`
`sequences, which have to converted into nucleic acid sequences. The sequence data may
`
`additionally comprise information on the sequencing machine, date of acquisition, read
`
`length, direction of sequencing, origin of the sequenced entity, neighbouring sequences or
`
`reads, presence of repeats or any other suitable parameter known to the person skilled in the
`
`art. The sequence data may be presented in any suitable format, archive, coding or document
`
`known to the person skilled in the art. The data may, for example, be in the format of
`
`FASTQ, Qseq, CSFASTA, BED, WIG, EMBL, Phred, GFF, SAM, SRF, SFF or ABI-ABIF,
`
`as depicted and filrther explained in the following Table l:
`
`10
`
`15
`
`20
`
`25
`
`30
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`Table l:
`
`Developed
`By
`
`Sanger
`Institute
`
`Illumina,
`Sanger
`
`Extension
`
`Representation
`
`Text Based for
`
`sequence and
`quality score
`
`—Simple (Fasta like)
`-defact0 standard for
`many sequencing
`instruments
`
`like fastq—Sanger, fastq—
`Solexa,fastq—illumina
`-Encode Phred quality
`score using ASCII
`lllumina
`lllumina
`.qseq
`-A single file will be
`Sequence and
`created for each lane
`Quality score
`ABI
`ABI
`.csfasta
`—Conversion from color—
`Color-Space
`sequence reads
`
`space to base-space
`leads to error
`
`Genome
`Browsers
`
`UC SC
`Genome
`Bioinformat
`
`ics Group
`
`Text based
`
`propagation
`—Better visualization and
`
`alignment.
`— Flexible way to define
`the data lines that are
`
`displayed in an
`annotation track
`
`Genome
`brow ser track
`format
`
`
`
`
`-Accepted by most
`
`genomic browsers
`— Display of continuous—
`Valued data in a track
`format
`
`UCSC
`Genome
`Bioinformat
`
`Genome
`Browsers
`
`CD3’a.
`
`U)
`
`ics Group
`Europe an
`Molecular
`
`Biology
`Laboratory
`
`Research
`
`Output
`
`Sanger
`Institute
`
`Collab orativ
`e result of
`several
`
`major
`genome
`centres
`
`EMBL ,
`GenB ank
`databases
`
`All
`
`sequencing
`o ro ' ects
`
`sequencing
`projects
`
`Developed
`using an
`open
`process
`
`All
`
`sequencing
`technologies
`
`Several
`
`sequences in a
`single file
`
`Store serialized
`
`chromatogram
`data
`Genomic feature
`in a text file
`
`generic
`nucleotide
`
`alignment
`format
`
`Generic binary
`format for DNA
`
`- Represents database
`records for nucleotide
`
`and peptide sequences
`from EMBL databases
`-Meta information can
`be 0 timall
`stored.
`
`-Widely used in storing
`quality scores for bases
`
`- Data exchange and
`representation of
`genomic data
`-Support longer reads
`and alignment with more
`than one indels.
`
`—Used by the
`1000Genome Project
`Committee
`
`—Simple, Compact in
`size
`—Format flexible to store
`data from different DNA
`
`sequence data
`
`sequencing
`technologies.
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`10
`
`S-————— 454(Roche)
`
`454 FLX
`
`Binary container
`file to encode
`
`-Can store one or more
`than one reads from 454
`
`results from 454
`FLX
`
`life sciences platform.
`
`—Accommodate
`Binary
`ABI—
`heterogeneous data
`chromatogram
`ABI
`- Stores data directory
`Files
`F
`wise.
`
`SOLiD(ABI)
`
`. ab 1 ,.fsa
`
`Preferably, the data or data sets are present in one data format, more preferably
`
`in a unified data format, e. g. in the fastq format, along with their base quality either in Phred /
`
`Phrap or modified format. It is filrther preferred that the data format at least covers the
`
`sequence read and its associated base quality.
`
`In a particularly preferred embodiment of the present invention, the plurality
`
`of sequence data may be converted into a unified format. Such a conversion may be carried
`
`out by any suitable conversion tool known to the person skilled in the art, for example
`
`standard conversion tools which are capable of converting an Illumina format into a Sanger
`
`format, which may be used by several alignment algorithms, or any other comparable tool
`
`capable of converting a format indicated in Table 1 into another format indicated in Table 1
`
`or known to the person skilled in the art. The conversion may be performed such that at least
`
`a minimum amount of essential data is kept. Such a minimum amount of data may comprise,
`
`for example, the sequence itself, the run information, paired end library information, mate
`
`pair library information, single end library information, and base QC value. The preferred
`
`format into which the sequence data may be converted is any suitable format, which is
`
`recognized by reference sequence alignment algorithms, as well as de novo assembly
`
`algorithms. A preferred example is the fastq format. Alternatively, the sequence data may
`
`also be converted into the cfasta/SCARF format. The present invention further envisages any
`
`further, e.g. newly defined or developed format being able to be used by both, reference
`
`sequence alignments and de novo assembly procedures.
`
`The data may comprise single entries or multiple entries within one data set.
`
`The data may also include one or more data sets, or a plurality of data sets. The term
`
`"plurality" as used herein accordingly refers to one or more data sets coming from one or
`
`more origins or sources. The data sets or data may, for example, have the same format and/or
`
`come from the same origin, e.g. the same sequencing machine, the same patient or subject or
`
`have been obtained with the same sequencing technology, or they may have different formats
`
`10
`
`15
`
`20
`
`25
`
`

`

`WO 2012/168815
`
`PCT/IB2012/052613
`
`l l
`
`and/or come from different origins such as different sequencing machines or different
`
`patients or subjects or have been obtained with different sequencing technologies.
`
`The term "obtaining sequence data from a plurality of nucleic acid fragment
`
`reads" as used herein refer to the process of determining the sequence information of a
`
`subject, or a group of subjects by the performance of nucleic acid sequencing reactions, The
`
`present invention in one alternative embodiment uses previously obtained sequence data, e.g.
`
`derivable from databases, external sequencing projects, laboratories, archives etc. In another
`
`alternative embodiment the present invention also envisages the step of obtaining the
`
`sequence data as an integral part of method step (a).
`
`Methods for sequence determination are generally known to the person skilled
`
`in the art. Preferred are next generation sequencing methods or high throughput sequencing
`
`methods. For example, a subj ect's, group of subject’s, or population’s genomic sequence may
`
`be obtained by using Massively Parallel Signature Sequencing (MPS S). An example of an
`
`envisaged sequence method is pyrosequencing, in particular 454 pyrosequencing, e. g. based
`
`on the Roche 454 Genome Sequencer. This method amplifies DNA inside water droplets in
`
`an oil solution with each droplet containing a single DNA template attached to a single
`
`primer—coated bead that then forms a clonal colony. Pyrosequencing uses luciferase to
`
`generate light for detection of the individual nucleotides added to the nascent DNA, and the
`
`combined data are used to generate sequence read—outs. Yet another envisaged example is
`
`Illumina or Solexa sequencing, e.g. by using the Illumina Genome Analyzer techno logy,
`
`which is based on reversible dye—terminators. DNA molecules are typically attached to
`
`primers on a slide and amplified so that local clonal colonies are formed. Subsequently one
`
`type of nucleotide at a time may be added, and non—incorporated nucleotides are washed
`
`away. Subsequently, images of the fluorescently labeled nucleotides may be taken and the
`
`dye is chemically removed from the DNA, allowing a next cycle. Yet another example is the
`
`use of Applied Biosystems' SOLiD technology, which employs sequencing by ligation. This
`
`method is based on the use of a pool of all possible oligonucleotides of a fixed length, which
`
`are labe

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.

We are unable to display this document.

PTO Denying Access

Refresh this Document
Go to the Docket