`(19) World Intellectual Property
`Organization
`International Bureau
`
`(10) International Publication Number
`
`\2
`
`(43) International Publication Date
`WO 2012/168815 A2
`l 3 December 2012 (13.12.2012) WlPOI PCT
`
`
`(51)
`
`International Patent Classification:
`G06F 19/00 (2011.01)
`
`(21)
`
`International Application Number:
`
`PCT/IB2012/052613
`
`(22)
`
`International Filing Date:
`
`24 May 2012 (24.05.2012)
`
`(25)
`
`(26)
`
`(30)
`
`(71)
`
`(72)
`(75)
`
`Filing Language:
`_
`_
`Publication Language:
`Priority Data:
`61/493,541
`
`6 June 2011 (06.06.2011)
`
`English
`.
`English
`
`US
`
`except US):
`designated States
`all
`0pm"
`Applicant
`KONINKLIJKE PHILIPS ELECTRONICS N.V.
`[NL/NL]; Groenewoudseweg 1, NL-5621 BA Eindhoven
`(NL).
`
`Inventors; and
`(for US only): KUMAR, Sunil
`Inventors/Applicants
`[IN/IN]; c/o High Tech Campus, Building 44, NL-5656 AE
`Eindhoven (NL). SINGH, Randeep [IN/IN]; c/o Iligh
`Tech Campus, Building 44, NL-5656 AE Eindhoven (NL).
`DIMITROVA, Nevenka [US/US]; 0/0 High Tech Cam-
`pus, Building 44, NL-5656 AE Eindhoven (NL)
`
`(74)
`
`(81)
`
`Agents: VAN VELZEN, Maaike et a1.; c/o High Tech
`Campus, Building 44, NL-5656 AE Eindhoven (NL).
`
`Designated States (unless otherwise indicated, for every
`kind of national protection available): AE, AG, AL, AM,
`AO, AT, AU, AZ, BA, BB, BG, BH, BR, BW, BY, BZ,
`CA, CH, CL, CN, CO, CR, CU, CZ, DE, DK, DM, DO,
`DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, HN,
`HR, HU) ID, 1L, IN, IS, JP, KE, KG, KM, KN, KP, KR,
`KZ, LA, LC, LK, LR, LS, LT, LU, LY, MA, MD, ME,
`MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, pro9 NZ,
`OM, PE, PG, PII, PL, PT, QA, RO, RS, RU, RW, SC, SD,
`SE) SG) SK; 51:7 SM, ST) 5Y7 5‘7 TH) TL TM) TN) TR)
`TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW.
`
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection available): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, SZ, TZ,
`UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU, TJ,
`TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK,
`FF, ES, FI, FR, GB, GR, HR, HU, IE, IS, IT, LT, LU, LV,
`MC, MK, MT, NL, NO, PL, PT, Ro, RS, SE, SI, SK, SM,
`TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GVV,
`ML, MR, NE, SN, TD, TG).
`
`(54) Title: METHOD FOR ASSEMBLY OF NUCLEIC ACID SEQUENCE DATA
`
`[Continued on next page]
`
`
`Raw reads
`Reference
`
`Sequencgm
`
`(57) Abstract: The present invention relates to a method for assembly of
`nucleic acid sequence data comprising nucleic acid fragment reads into (a)
`contiguous nucleotide sequence segment(s), comprising the steps of: (a) ob-
`taining a plurality of nucleic acid sequence data from a plurality of nucleic
`acid fragment reads; (b) aligning said plurality of nucleic acid sequence data
`to a reference sequence;(c) detecting one or more gaps or regions of non-
`assembly, or non-matching With the reference sequence in the alignment out-
`put of step (b);(d) performing de novo sequence assembly of nucleic acid se-
`quence data mapping to said gaps or regions of non—assembly; and (c) com—
`bining the alignment output of step (b) and the assembly output of step (d) in
`order to obtain (a) contiguous nucleotide sequence segment(s). The present
`invention further relates to a method wherein the detection of gaps or regions
`of non—assembly is performed by implementing a base quality, coverage,
`complexity of the surrounding region, or length of mismatch filter or
`threshold. Also envisaged is the masking out of nucleic acid sequence data
`relating to known polymorphisms, disease related mutations or modifica-
`tions, repeats, low map ability regions, CPG islands, or regions With certain
`biophysical features. In addition, a corresponding program element or corri-
`puter program for assembly of nucleic acid sequence data and a sequence as-
`sembly system for transforming nucleic acid sequence data comprising nucle-
`ic acid fi'agment reads into (a) contiguous nucleotide sequence segment(s) is
`provided.
`
`be provided with RelSeq
`
`Polymorphic Iancmark can
`
`
`
` Highly
`
`repetitive
`region
`
`YES
`V
`
`Pen‘onii De ncvo
`Discard
`
`
`
`assembly
`
`v
`Extract Avg Coverage
`Check QC and
`coverage to till 4— from RefSeq and use
`
`
`the gap
`as cut-off
`v
`
`Betseq ‘ —alignment
`
`
`
`
`
`Consensus
`assembly
`
`FIG. 4
`
`
`
`
`
`W02012/168815A2i||||||||||||||||||||||||||||||||||||||||||||||||lli|||||||||||||||||||||||||||||||||||||||||||
`
`
`
`WO 2012/168815 A2 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
`
`Declarations under Rule 4.17:
`
`Published:
`
`— as to applicant’s entitlement to applyfor and be granted — without international search report and to be republished
`a patent (Rule 4.1 7(ii))
`upon receipt ofthat report (Rule 48.2(g))
`
`— as to the applicant’s entitlement to claim the priority of
`the earlier application (Rule 4.1 7(iii))
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`METHOD FOR ASSEMBLY OF NUCLEIC ACID SEQUENCE DATA
`
`FIELD OF THE INVENTION
`
`The present invention relates to a method for assembly of nucleic acid
`
`sequence data comprising nucleic acid fragment reads into (a) contiguous nucleotide
`
`sequence segment(s), comprising the steps of; (a) obtaining a plurality of nucleic acid
`
`sequence data from a plurality of nucleic acid fragment reads; (b) aligning said plurality of
`
`nucleic acid sequence data to a reference sequence;(c) detecting one or more gaps or regions
`
`of non-assembly, or non-matching with the reference sequence in the alignment output of
`
`step (b);(d) performing de novo sequence assembly of nucleic acid sequence data mapping to
`
`said gaps or regions of non-assembly; and (e) combining the alignment output of step (b) and
`
`the assembly output of step (d) in order to obtain (a) contiguous nucleotide sequence
`
`segment(s). The present invention filrther relates to a method wherein the detection of gaps or
`
`regions of non-assembly is performed by implementing a base quality, coverage, complexity
`
`of the surrounding region, or length of mismatch filter or threshold. Also envisaged is the
`
`masking out of nucleic acid sequence data relating to known polymorphisms, disease related
`
`mutations or modifications, repeats, low mapability regions, CPG islands, or regions with
`
`certain biophysical features. In addition, a corresponding program element or computer
`
`program for assembly of nucleic acid sequence data and a sequence assembly system for
`
`transforming nucleic acid sequence data comprising nucleic acid fragment reads into (a)
`
`contiguous nucleotide sequence segment(s) is provided.
`
`BACKGROUND OF THE INVENTION
`
`With the introduction of next generation or ultra—high—throughput sequencing
`
`techniques the amount of sequence data has increased enormously, while the costs for
`
`obtaining sequence information and the time needed for the provision of this information
`
`have been dramatically reduced and will be further decreased in the future. Research. as well
`
`as clinical applications of next generation sequencing approaches will have an impact on
`
`transcriptomc analysis and gene annotation, allow RNA splice identification, SNP discovery
`
`or genome methylation analysis and provide a way to identify the etiology of diseases, and to
`
`screen for genomic pattern on a personal basis.
`
`The next generation sequencing (NGS) is currently based on only a handfill of
`
`platforms including the Roche/454, the lllumina/Solex and the ABI SOLiD systems. The
`
`10
`
`15
`
`20
`
`25
`
`30
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`2
`
`underlying techno logy relies on a template amplification step before the sequencing starts. In
`
`consequence, the read length is shortened in comparison to the traditional Sanger-based
`
`techno logy: whereas the de-deoxy terminator approach provided read lengths of 650 to 800
`
`bp, NGS approaches have read lengths of 35 — 400 bp (Bao et al., Journal of Human
`
`Genetics, 28 April 2011, p. 1-9). Furthermore, the raw data obtained from the NGS platforms
`
`is not standardized and shows differences in read lengths, error profiles, matching thresholds
`
`etc. Thus, the implementation of NGS approaches connotes an increase in amount and
`
`complexity of sequence information.
`
`However, the output of NGS sequence machines is essentially worthless by
`
`itself, since the sequence reads only become meaningful upon a reconstruction of the
`
`underlying contiguous genomic sequence. Furthermore, for routine uses of NGS, e.g. in
`
`clinical setups, a high sequence accuracy and an expedient way to select genomic subsets of
`
`interest are of importance. Upon a higher integration of genome sequencing into the practice
`
`of medical counseling, there will be an increased responsibility of geneticists to ensure that
`
`the information obtained is in fact true and represents the original genome of the individual.
`
`There is, thus, a need for a method allowing the accurate and timesaving
`
`alignment and assembly of nucleic acid sequence data as derivable from NGS approaches.
`
`SUMMARY OF THE INVENTION
`
`The present invention addresses this need and provides means and methods,
`
`which allow the assembly of nucleic acid sequence data comprising nucleic acid fragment
`
`reads into contiguous nucleotide sequence segments. The above objective is in particular
`
`accomplished by a method comprising the steps of:
`
`(a)
`
`obtaining a plurality of nucleic acid sequence data from a plurality of
`
`10
`
`15
`
`20
`
`25
`
`nucleic acid fragment reads;
`
`(b)
`
`aligning said plurality of nucleic acid sequence data to a reference
`
`sequence;
`
`(c)
`
`detecting one or more gaps or regions of non-assembly, or non-
`
`matching with the reference sequence in the alignment output of step (b);
`
`30
`
`(d)
`
`performing de novo sequence assembly of nucleic acid sequence data
`
`mapping to said gaps or regions of non-assembly; and
`
`(e)
`
`combining the alignment output of step (b) and the assembly output of
`
`step (d) in order to obtain (a) contiguous nucleotide sequence segment(s).
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`3
`
`This method provides the advantage that a bias, which is typically generated
`
`when a reference sequence alignment is performed, can be overcome by using de nova
`
`assembly steps. Furthermore, typical problems associated with the filling of the gaps that are
`
`created during reference sequence alignment, polymorphism lengths detection and in
`
`particular the fitting of un-aligned sequence in the consensus assembly may be solved when
`
`closing these information gaps or breaks via de novo assembly. At the same time, annotation
`
`problems known from de novo assembly approaches can be mitigated by basing parts of the
`
`analysis on a reference sequence. The method accordingly starts with a reference sequence
`
`alignment and when it finds a gap or regions of non-assembly it switches to de novo
`
`alignment, e. g. until it again detects the reference alignment. This creates a consensus
`
`assembly or contiguous nucleotide sequence segments with a significantly increased
`
`sequence accuracy. In fact, the accordingly assembled sequence represents individual
`
`genomes rather than reference genomes and avoids reference sequence associated bias
`
`problems. The presently described method is accordingly assumed to have huge implications,
`
`inter alia in medical genetics where it may help in determining the genetic basis of complex
`
`genetic disorders.
`
`In a preferred embodiment of the present invention, wherein the above
`
`mentioned plurality of nucleic acid sequence data is converted to a unified format.
`
`In another preferred embodiment of the present invention the detection of step
`
`(c) as mentioned herein above is performed by implementing a filter or threshold.
`
`In further preferred embodiments, said filter or threshold is a base quality,
`
`coverage, complexity of the surrounding region or length of mismatch filter or threshold.
`
`In another preferred embodiment of the present invention, prior to the above
`
`mentioned aligning step (b) a masking out of nucleic acid sequence data relating to known
`
`polymorphisms, highly variable regions, disease related mutations or modifications, repeats,
`
`low mapability regions, CPG islands, or regions with specific biophysical features is
`
`performed.
`
`In a particularly preferred embodiment said masked out nucleic acid sequence
`
`data is subjected to a de nova sequence assembly of step (d) as mentioned herein above.
`
`In another preferred embodiment of the present invention the above defined
`
`step (b) is carried outwith a reference alignment algorithm. In a particularly preferred
`
`embodiment of said reference alignment algorithm is BFAST, ELAND, GenomeMapper,
`
`GMAP, MAQ, MOSAIK, PASS, SeqMap, SHRiMP, SOAP, SSAHA, or CLD. Even more
`
`preferred is Bowtie or BWA.
`
`10
`
`15
`
`20
`
`25
`
`
`
`W0 2012/168815
`
`PCT/IB2012/052613
`
`4
`
`In yet another preferred embodiment of the present invention, the above
`
`defined step (c) is carried out with a de novo assembly algorithm. In a particularly preferred
`
`embodiment of said de novo assembly algorithm is AAPATHS, Edena, EULER-SR, MIRAZ,
`
`SEQAN, SHARCGS, SSAKE, SOAPdenovo, VCAKE. Even more preferred is ABySS or
`
`Velvet.
`
`In a further preferred embodiment the herein above mentioned reference
`
`sequence is an essentially complete prokaryotic, eukaryotic or viral genome sequence, or a
`
`sub-portion thereof. In a particularly preferred embodiment of the present invention said
`
`reference sequence is a human genome sequence, an animal genome sequence, a plant
`
`genome sequence, a bacterial genome sequence, or a sub-portion thereof.
`
`In a further preferred embodiment of the present invention said reference
`
`sequence is selected from a group or taxon, which is phylogenetically related to the organism,
`
`whose nucleic acid sequence data is to be assembled.
`
`In yet another preferred embodiment of the present invention said reference
`
`sequence is a genomic sub-portion having regulatory potential selected from the group
`
`comprising exon sequences, promoter sequences, enhancer sequences, transcription factor
`
`binding sites, or any grouping or sub—grouping thereof.
`
`In a further preferred embodiment said reference sequence is a virtual
`
`sequence based on sequence composition parameters, or based on biophysical nucleic acid
`
`properties. In a particularly preferred embodiment of the present invention said composition
`
`parameter is the presence of monomers, dimers and/or trimers. In a further preferred
`
`embodiment of the present invention said biophysical nucleic acid property is the stacking
`
`energy, the presence of propeller twist, the bendability of the nucleic acid, duplex stability,
`
`the amount of disrupt energy, the amount of free energy, the presence of DNA denaturation
`
`10
`
`15
`
`20
`
`25
`
`or DNA bending stiffness.
`
`In a further aspect the present invention relates to a program element or
`
`computer program for assembly of nucleic acid sequence data comprising nucleic acid
`
`fragment reads into contiguous nucleotide sequence segments, which when being executed
`
`by a processor is adapted to carry out the steps of a method as defined herein above.
`
`30
`
`In yet another aspect the present invention rclatcs to a scqucncc assembly
`
`system for transforming nucleic acid sequence data comprising nucleic acid fragment reads
`
`into (a) contiguous nuclcotidc scqucncc scgmcnt(s), comprising a computcr proccssor,
`
`memory, and (a) data storage device(s), the memory having programming instructions to
`
`execute a program element or computer program as defined herein above.
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`5
`
`In a preferred embodiment of the present invention said sequence assembly
`
`system is associated or connected to a sequencer device. In a further preferred embodiment
`
`said sequence assembly system is a medical decision support system. In a particularly
`
`preferred embodiment said medical decision support system is a diagnostic decision support
`
`system.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`10
`
`15
`
`20
`
`25
`
`30
`
`Fig. 1 provides an overview over reference and de nova sequence and
`
`alignment procedures. Reference sequence alignment and assembly shows mapping of reads
`
`to the reference sequences. De nova assembly shows the generation of contigs using ABySS
`
`algorithm based on an excerpt from an ABySS—Explorer view, where edges represent contigs
`
`and nodes represent common k—l-mers between adjacent contigs. The labels correspond to
`
`SET contig IDs. Contig lengths and coverage are indicated by the length and the thickness of
`
`the edges, respectively. Arrows and edge are shape indicate the direction of contigs and the
`
`polarity of the nodes distinguish reverse complements of common k—l -mers between
`
`adjacent contigs.
`
`Fig. 2 shows examples of different sequence file formats. Depicted are the
`
`qseq format (sequence read output from Illumina instrument which has machine, run and
`
`quality information), the fastq format (Illumina read name, sequence and quality which has
`
`been derived from qseq file) and SAM format (Sequence Alignment/ Map) which is output of
`
`BWA aligner. The SAM format, which allows to store read alignment information against a
`
`reference.
`
`Fig. 3 depicts an overview over the alignment and assembly steps according to
`
`the present invention. It shows the overall method of combining reference alignment and de
`
`novo assembly. Initially the reads are aligned to a reference sequence. Where ever a gap (e. g.
`
`user defined size, ex: >10base) of N/A/T/G/C is identified where the reads are not matching
`
`to the reference in continuation with the previous read in an overlap fashion, the de novo
`
`assembly will be started. There will be a de novo contig formation until the next read
`
`matching to the reference is identified. This de nova contig will then be merged with
`
`intermediate consensus to give final consensus sequence.
`
`Fig. 4 shows a process chart of method steps of a combination of reference
`
`sequence alignment and de novo assembly according to the present invention.
`
`Fig. 5 depicts the determination of the exact length of GT polymorphism in
`
`AVPRlA gene using a combination of reference alignment and de novo assembly following
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`6
`
`the method according to the present invention. First, reads with the reference genome to
`
`extract the AVPRl gene for the analyzed sample were aligned. As the RS3 is highly
`
`polymorphic site and is associated with clinical phenotype, a de nova assembly of the reads
`
`that were falling in this chromosome was carried out and subsequently contigs were
`
`generated. After obtaining the contigs relaxed sequence alignment (allowing mismatch and
`
`gaps) was performed to merge the de nova contig with the reference consensus. The obtained
`
`consensus sequence showed the true polymorphic repeat for the analyzed sample.
`
`Fig. 6 shows a direct comparison between the Reference Sequence assembly
`
`and the de nova assembly of the AVPRIA gene. Reads were aligned to reference and de nova
`
`assembly was performed. The consensus generated from reference was then aligned against
`
`dc nova contig using ClustanW. Shown is a difference in GT repeats which is biased from
`
`reference as compared to de nova displaying different repeat contents.
`
`DETAILED DESCRIPTION OF EMBODIMENTS
`
`The inventors have developed means and methods, which allow the assembly
`
`of nucleic acid sequence data comprising nucleic acid fragment reads into contiguous
`
`nucleotide sequence segments.
`
`Although the present invention will be described with respect to particular
`
`embodiments, this description is not to be construed in a limiting sense.
`
`Before describing in detail exemplary embodiments of the present invention,
`
`definitions important for understanding the present invention are given.
`
`As used in this specification and in the appended claims, the singular forms of
`
`"a" and "an” also include the respective plurals unless the context clearly dictates otherwise.
`
`In the context of the present invention, the terms "about" and "approximately"
`
`denote an interval of accuracy that a person skilled in the art will understand to still ensure
`
`the technical effect of the feature in question. The term typically indicates a deviation from
`
`
`
`the indicated numerical value of :20 %, preferably :15 %, more preferably :10 %, and even
`
`more preferably :5 %.
`
`It is to be understood that the term "comprising" is not limiting. For the
`
`purposes of the present invention the term "consisting of‘ is considered to be a preferred
`
`embodiment of the term "comprising of '. If hereinafter a group is defined to comprise at least
`
`a certain number of embodiments, this is meant to also encompass a group which preferably
`
`consists of these embodiments only.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`7
`
`Furthermore, the terms "first", "second", "third" or "(a)", "(b)", "(c)", ”(d)" etc.
`
`and the like in the description and in the claims, are used for distinguishing between similar
`
`elements and not necessarily for describing a sequential or chronological order. It is to be
`
`understood that the terms so used are interchangeable under appropriate circumstances and
`
`that the embodiments of the invention described herein are capable of operation in other
`
`sequences than described or illustrated herein.
`
`In case the terms "first", "second", "third" or "(a)", "(b)", "(c)", "(d)" etc. relate
`
`to steps of a method or use there is no time or time interval coherence between the steps, i.e.
`
`the steps may be carried out simultaneously or there may be time intervals of seconds,
`
`minutes, hours, days, weeks, months or even years between such steps, unless otherwise
`
`indicated in the application as set forth herein above or below.
`
`It is to be understood that this invention is not limited to the particular
`
`methodology, protocols, reagents etc. described herein as these may vary. It is also to be
`
`understood that the terminology used herein is for the purpose of describing particular
`
`embodiments only, and is not intended to limit the scope of the present invention that will be
`
`limited only by the appended claims. Unless defined otherwise, all technical and scientific
`
`terms used herein have the same meanings as commonly understood by one of ordinary skill
`
`in the art.
`
`As has been set out above, the present invention concerns in one aspect a
`
`method for assembly of nucleic acid sequence data comprising nucleic acid fragment reads
`
`into (a) contiguous nucleotide sequence segment(s), comprising the steps of:
`
`(a)
`
`obtaining a plurality of nucleic acid sequence data from a plurality of
`
`nucleic acid fragment reads;
`
`(b)
`
`aligning said plurality of nucleic acid sequence data to a reference
`
`10
`
`15
`
`20
`
`25
`
`sequence;
`
`(e)
`
`detecting one or more gaps or regions of non—assembly, or non—
`
`matching with the reference sequence in the alignment output of step (b);
`
`(d)
`
`performing de novo sequence assembly of nucleic acid sequence data
`
`mapping to said gaps or regions of non-assembly; and
`
`30
`
`(e)
`
`combining the alignment output of step (b) and the assembly output of
`
`step (d) in order to obtain (a) contiguous nucleotide sequence segment(s).
`
`The term "assembly" of nucleic acid sequence data as used herein refers to the
`
`arrangement of singularly or independently provided sequence data into a contiguous
`
`nucleotide sequence segment. The term "contiguous nucleotide sequence segment(s)" as used
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`8
`
`herein refers to the output of the presently claimed method being a coherent, non-redundant
`
`and preferably error-free or substantially error-free sequence context. A "sequence segment”
`
`as used herein may be any stretch comprising more the information content of more than
`
`about 50 reads. Preferably, a sequence segment may be an entire genome, an entire
`
`chromosome, a chromosome arm, one or more sub-portion of a chromosome, a conjunction
`
`of interrelated sequence, e. g. exomes, transcriptome-related sequences, a conjunction of open
`
`reading frames, introns, transposon-sequences, repeats, regulorne-related sequences such as
`
`transcription factor binding sites, methylation binding protein sites, specific regions with
`
`higher probability of Histone 3 lysine 4 mono- di— and tri-methylation etc. A "nucleic acid
`
`fragment read" as used herein refers to a single, short contiguous information piece or stretch
`
`of sequence data. A read may have any suitable length, preferably a length of between about
`
`30 nucleotides to about 1000 nucleotides. The length generally depends on the sequencing
`
`technology used for obtaining it. In specific embodiments, the reads may also be longer, e. g.
`
`2 to 10 kb or more. The present invention generally envisages any read or read length and is
`
`not to be understood as being limited to the presently available read lengths, but also includes
`
`further developments in this area, e.g. the development of long reading sequencing
`
`approaches etc.
`
`In a first step of the method, a plurality of nucleic acid sequence data from a
`
`plurality of nucleic acid fragment reads may be obtained. A "nucleic acids sequence data" as
`
`used herein may be any sequence information on nucleic acid molecules known to the skillet
`
`person. The sequence data preferably includes information on DNA or RNA sequences,
`
`modified nucleic acids, single strand or duplex sequences, or alternatively amino acid
`
`sequences, which have to converted into nucleic acid sequences. The sequence data may
`
`additionally comprise information on the sequencing machine, date of acquisition, read
`
`length, direction of sequencing, origin of the sequenced entity, neighbouring sequences or
`
`reads, presence of repeats or any other suitable parameter known to the person skilled in the
`
`art. The sequence data may be presented in any suitable format, archive, coding or document
`
`known to the person skilled in the art. The data may, for example, be in the format of
`
`FASTQ, Qseq, CSFASTA, BED, WIG, EMBL, Phred, GFF, SAM, SRF, SFF or ABI-ABIF,
`
`as depicted and filrther explained in the following Table l:
`
`10
`
`15
`
`20
`
`25
`
`30
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`Table l:
`
`Developed
`By
`
`Sanger
`Institute
`
`Illumina,
`Sanger
`
`Extension
`
`Representation
`
`Text Based for
`
`sequence and
`quality score
`
`—Simple (Fasta like)
`-defact0 standard for
`many sequencing
`instruments
`
`like fastq—Sanger, fastq—
`Solexa,fastq—illumina
`-Encode Phred quality
`score using ASCII
`lllumina
`lllumina
`.qseq
`-A single file will be
`Sequence and
`created for each lane
`Quality score
`ABI
`ABI
`.csfasta
`—Conversion from color—
`Color-Space
`sequence reads
`
`space to base-space
`leads to error
`
`Genome
`Browsers
`
`UC SC
`Genome
`Bioinformat
`
`ics Group
`
`Text based
`
`propagation
`—Better visualization and
`
`alignment.
`— Flexible way to define
`the data lines that are
`
`displayed in an
`annotation track
`
`Genome
`brow ser track
`format
`
`
`
`
`-Accepted by most
`
`genomic browsers
`— Display of continuous—
`Valued data in a track
`format
`
`UCSC
`Genome
`Bioinformat
`
`Genome
`Browsers
`
`CD3’a.
`
`U)
`
`ics Group
`Europe an
`Molecular
`
`Biology
`Laboratory
`
`Research
`
`Output
`
`Sanger
`Institute
`
`Collab orativ
`e result of
`several
`
`major
`genome
`centres
`
`EMBL ,
`GenB ank
`databases
`
`All
`
`sequencing
`o ro ' ects
`
`sequencing
`projects
`
`Developed
`using an
`open
`process
`
`All
`
`sequencing
`technologies
`
`Several
`
`sequences in a
`single file
`
`Store serialized
`
`chromatogram
`data
`Genomic feature
`in a text file
`
`generic
`nucleotide
`
`alignment
`format
`
`Generic binary
`format for DNA
`
`- Represents database
`records for nucleotide
`
`and peptide sequences
`from EMBL databases
`-Meta information can
`be 0 timall
`stored.
`
`-Widely used in storing
`quality scores for bases
`
`- Data exchange and
`representation of
`genomic data
`-Support longer reads
`and alignment with more
`than one indels.
`
`—Used by the
`1000Genome Project
`Committee
`
`—Simple, Compact in
`size
`—Format flexible to store
`data from different DNA
`
`sequence data
`
`sequencing
`technologies.
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`10
`
`S-————— 454(Roche)
`
`454 FLX
`
`Binary container
`file to encode
`
`-Can store one or more
`than one reads from 454
`
`results from 454
`FLX
`
`life sciences platform.
`
`—Accommodate
`Binary
`ABI—
`heterogeneous data
`chromatogram
`ABI
`- Stores data directory
`Files
`F
`wise.
`
`SOLiD(ABI)
`
`. ab 1 ,.fsa
`
`Preferably, the data or data sets are present in one data format, more preferably
`
`in a unified data format, e. g. in the fastq format, along with their base quality either in Phred /
`
`Phrap or modified format. It is filrther preferred that the data format at least covers the
`
`sequence read and its associated base quality.
`
`In a particularly preferred embodiment of the present invention, the plurality
`
`of sequence data may be converted into a unified format. Such a conversion may be carried
`
`out by any suitable conversion tool known to the person skilled in the art, for example
`
`standard conversion tools which are capable of converting an Illumina format into a Sanger
`
`format, which may be used by several alignment algorithms, or any other comparable tool
`
`capable of converting a format indicated in Table 1 into another format indicated in Table 1
`
`or known to the person skilled in the art. The conversion may be performed such that at least
`
`a minimum amount of essential data is kept. Such a minimum amount of data may comprise,
`
`for example, the sequence itself, the run information, paired end library information, mate
`
`pair library information, single end library information, and base QC value. The preferred
`
`format into which the sequence data may be converted is any suitable format, which is
`
`recognized by reference sequence alignment algorithms, as well as de novo assembly
`
`algorithms. A preferred example is the fastq format. Alternatively, the sequence data may
`
`also be converted into the cfasta/SCARF format. The present invention further envisages any
`
`further, e.g. newly defined or developed format being able to be used by both, reference
`
`sequence alignments and de novo assembly procedures.
`
`The data may comprise single entries or multiple entries within one data set.
`
`The data may also include one or more data sets, or a plurality of data sets. The term
`
`"plurality" as used herein accordingly refers to one or more data sets coming from one or
`
`more origins or sources. The data sets or data may, for example, have the same format and/or
`
`come from the same origin, e.g. the same sequencing machine, the same patient or subject or
`
`have been obtained with the same sequencing technology, or they may have different formats
`
`10
`
`15
`
`20
`
`25
`
`
`
`WO 2012/168815
`
`PCT/IB2012/052613
`
`l l
`
`and/or come from different origins such as different sequencing machines or different
`
`patients or subjects or have been obtained with different sequencing technologies.
`
`The term "obtaining sequence data from a plurality of nucleic acid fragment
`
`reads" as used herein refer to the process of determining the sequence information of a
`
`subject, or a group of subjects by the performance of nucleic acid sequencing reactions, The
`
`present invention in one alternative embodiment uses previously obtained sequence data, e.g.
`
`derivable from databases, external sequencing projects, laboratories, archives etc. In another
`
`alternative embodiment the present invention also envisages the step of obtaining the
`
`sequence data as an integral part of method step (a).
`
`Methods for sequence determination are generally known to the person skilled
`
`in the art. Preferred are next generation sequencing methods or high throughput sequencing
`
`methods. For example, a subj ect's, group of subject’s, or population’s genomic sequence may
`
`be obtained by using Massively Parallel Signature Sequencing (MPS S). An example of an
`
`envisaged sequence method is pyrosequencing, in particular 454 pyrosequencing, e. g. based
`
`on the Roche 454 Genome Sequencer. This method amplifies DNA inside water droplets in
`
`an oil solution with each droplet containing a single DNA template attached to a single
`
`primer—coated bead that then forms a clonal colony. Pyrosequencing uses luciferase to
`
`generate light for detection of the individual nucleotides added to the nascent DNA, and the
`
`combined data are used to generate sequence read—outs. Yet another envisaged example is
`
`Illumina or Solexa sequencing, e.g. by using the Illumina Genome Analyzer techno logy,
`
`which is based on reversible dye—terminators. DNA molecules are typically attached to
`
`primers on a slide and amplified so that local clonal colonies are formed. Subsequently one
`
`type of nucleotide at a time may be added, and non—incorporated nucleotides are washed
`
`away. Subsequently, images of the fluorescently labeled nucleotides may be taken and the
`
`dye is chemically removed from the DNA, allowing a next cycle. Yet another example is the
`
`use of Applied Biosystems' SOLiD technology, which employs sequencing by ligation. This
`
`method is based on the use of a pool of all possible oligonucleotides of a fixed length, which
`
`are labe