`http://www.investigativegenetics.com/content/2/1/23
`
`REVIEW
`
`a
`
`
`
`Investigative
`Genetics
`
`Open Access
`
`Next-generation sequencing technologies and
`applications for human genetic history and
`forensics
`
`
`
`Eva C Berglund, Anna Kiialainen and Ann-Christine Syvanen’
`
`
`
`Abstract
`
` and forensic genetics.
`
`Rapid advances in the development of sequencing technologies in recent years have enabled an increasing
`number of applications in biology and medicine. Here, we review key technical aspects of the preparation of DNA
`templates for sequencing, the biochemical reaction principles and assay formats underlying next-generation
`sequencing systems, methods for imaging and base calling, quality control, and bioinforrnatic approaches for
`sequence alignment, variant calling and assembly. We also discuss some of the most important advances that the
`new sequencing technologies have brought to the fields of human population genetics, human genetic history
`
`
`Background
`Determining the DNA sequence is the most comprehen-
`sive way of obtaining information about the genomeof
`anyliving organism. For decades, Sanger sequencing [1]
`using fluorescently labeled terminating nucleotides and
`electrophoresis has been the gold standard sequencing
`technology. Sanger sequencing made an early impact in
`the field of microbial genomics, with the first complete
`bacterial genome, Haemophilus influenzae, sequenced in
`1995 [2]. Multicenter collaborations using numerous
`sequencing instruments and automated sample prepara-
`tion also made it possible to use Sanger sequencing in
`the human genome project, which took more than 10
`years and US$2.7 billion to complete [3,4).
`In recent years, we have witnessed a rapid develop-
`ment of a new generation of DNA sequencing systems
`followed by a multitude of novel applications in biology
`and medicine. The major advantage of the new‘second-
`generation’ or ‘massively parallel’ sequencing technolo-
`gies, compared to Sanger sequencing, is their consider-
`ably higher throughput and thereby lower cost per
`sequenced base. On a second-generation sequencing
`(SGS) machine several human genomes can be
`sequenced in a single run in a matter of days. Here, we
`
`
`* Correspondence: ann-christine.syvanen@medsci,uu.se
`:
`Department of Medical Sciences, Molecular Medicine and Science for Life
`Laboratory, Uppsala University, 751 85 Uppsala, Sweden
`
`review recent technological advances of SGS technolo-
`gies and discuss the bioinformatic and computational
`implications of the sequencing revolution. Finally we
`highlight some applications of SGS technology with a
`focus on human population genetics and genetic history,
`and genetic forensics.
`
`Second-generation sequencing technologies
`There are three major SGS systems that are routinely
`used in manylaboratories today. The first system to
`become commercially available was
`the Genome
`Sequencer from 454 Life Sciences (Branford, CT, USA)
`(later acquired by Roche [5]) in 2005, which was also
`the first SGS technology to sequence a complete human
`genome, that of Dr. James D. Watson [6]. The Genome
`Analyzer, first conceived by Solexa and later further
`developed by lumina (San Diego, CA, USA) [7] was
`launched in 2006, and the SOLiD system from Applied
`Biosystems [8] (nowpart of Life Technologies (Carlsbad,
`CA, USA)) in 2007. The key steps of a sequencing pro-
`ject are the samefor all of these technologies: prepara-
`tion and amplification of template DNA, distribution of
`templates on a solid support, sequencing and imaging,
`base calling, quality control and data analysis (Figure 1),
`In terms of applications, there are two major types of
`projects, de novo sequencing and resequencing. In a de
`novo sequencing project, the genome of an organism is
`sequenced for the first time. In contrast, in resequencing
`
`(+) BioMed Central
`
`© 2011 Berglund et al; licensee BioMed Central Ltd. This is an Open Accessarticle distributed under the terms of the Creative
`Commons Attribution License (http//creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
`reproduction in any medium, provided the original work is properly cited
`
`Personalis EX2017.1
`
`Personalis EX2017.1
`
`
`
`Berglund et al, Investigative Genetics 2011, 2:23
`http://www.investigativegenetics .com/content/2/1/23
`
`Page 2 of 15
`
`
`
`DNA sample
`¥
`Library preparation
`
`Ke
`Distribution on
`solid support
`
`Cc»
`
`PCR
`amplification
`2
`
`Sequencing and imaging
`§
`Base/color calling
`¥
`Quality control
`¥
`Data analysis
`
`Black arrows indicatestepsthatareComimon for all second: gerieration sequencing(SGS)
`Figure 1 Steps of a sequencing experiment,
`technologies, white arrowsrefer to the Illumina systems, and grey arrows refer to the Roche 454 and SOLID systems.
`
`applications, the genomeorparts ofit are sequenced of
`a species where a reference sequence is already available.
`This difference affects both the selection of sequencing
`strategy and the data analysis (further discussed below}.
`
`In human forensics and population genetics the rese-
`quencing approachis used, but in microbial forensics
`both de nove sequencing and resequencing of microbial
`genomes may be required.
`
`Personalis EX2017.2
`
`Personalis EX2017.2
`
`
`
`Berglund et al. Investigative Genetics 2011, 2:23
`http://www.investigativegenetics,com/content/2/1/23
`
`Page 3 of 15
`
`Two common measures of the amount of sequence
`data generated in a project are the sequencing depth
`and breadth. Sequencing depth, or coverage, ig the aver-
`age number of times each base in the genomeis
`sequenced. For example, to sequence a 3 Gb human
`genome to 30 x coverage, 90 Gb of sequence data is
`needed. The coverage will be uneven over the genome
`however, and sequencing breadth, sometimes also
`referred to as genomecoverage, is the percentage of the
`genome that is covered by sequence reads.
`
`DNA samples for sequencing
`High-quality DNA in sufficient quantity is the basis for
`any successful sequencing experiment. For most sequen-
`cing applications, 1 to 5 pg of purified DNA is needed,
`an amount that may not always be available. Whole gen-
`ome amplification (WGA) has frequently been used to
`increase the amount of DNA for genotyping [9] and can
`be applied also in combination with SGS. Several micro-
`bial genomes have been sequenced using SGS after
`WGA (for example, the genome of uncultured bacterial
`symbionts of termites isolated from a single host cell
`[10]}. WGA was also recently used to amplify DNA
`from single cells from primary breast tumors, and
`although sequence data was retrieved from only 6% of
`the genome of each cell, this genomic representation
`was enough to identify subpopulationsof cancercells by
`copy number yariations [11].
`SGS can also be used to detect rare and unknownvar-
`iants in genomic regions of interest in a cost-efficient
`way, and in a larger number of samples than by whole
`genome sequencing. Another advantage of targeted
`sequencing is the reduced issue of sequence reads align-
`ing to multiple locations in the genome. The most com-
`monly used methods for enrichment of genomic regions
`for sequencingare either based on hybridization to bioti-
`nylated probes in solution or probes immobilized on
`microarrays, or on multiplexed amplification by PCR
`(reviewed in [12]). The recently developed selector probe
`technology, whichis based onrolling circle amplification,
`provides efficient and highly multiplexed enrichment of
`small regions totaling up to 1 Mbinsize is particularly
`useful for ultra-deep sequencing at low cost and high
`specificity [12]. Sequencing of human exomes enriched
`by hybridization-based capture in solution is becoming
`widely used, and has proven to be particularly successful
`for identification of mutated genes underlying monogenic
`disorders (see, for example, [14-17]}. The methods for
`hybridization-based capture are also applicable to cus-
`tom-selected genomic regions of interest [18].
`
`Preparation of sequencing libraries
`The DNAsamples to be sequencedare first converted
`into one of two main types of sequencing libraries,
`
`fragmentlibraries or mate-pair libraries, Thefirst step
`in the preparation of a sequencinglibrary is to fragment
`the DNA sample, usually using sonication or nebuliza-
`tion, For preparation of fragmentlibraries, sequencing
`adapters are ligated to both ends of the DNA fragments,
`followed by PCR amplification using primers comple-
`mentary to the adapters.
`In the Illumina SGS technology, adapter-ligated DNA
`fragments are amplified directly in the flow cell subge:—
`quently used for sequencing. Eachflow cell has eight
`channels (lanes) coated with oligonucleotides that are
`complementary to the adapters, The adapter-ligated
`DNA fragments are hybridized to the flow cell, in which
`they are distributed randomly and amplified by a pro-
`cess called bridge amplification. After amplification,
`DNA molecules are linearized to form clusters, each of
`which consists of about 1,000 copies of the original
`DNA molecule at that position.
`In the 454 and SOLID technologies, adapter-ligated
`fragments are hybridized to beads coated with an oligo-
`nucleotide that is complementary to oneof the adapters
`for amplification in a water-in-oil emulsion PCR. Each
`water droplet constitutes a microreactor containing the
`PCR reagents and optimally a single bead with a single
`immobilized DNA fragment. Thus, multiple PCRs can
`be performed in parallel in a single tube. After breaking
`the emulsion, the beads, which are now coated with
`thousands or millions of copies of the original DNA
`molecule, are loaded onto the solid support for sequen-
`cing. In the 454 system, the solid support is called Pico-
`TiterPlate and consists of wells that can fit a single
`DNA-coated bead each. SOLID uses a glass slide to
`which the beads are distributed randomly.
`The amplified fragments are then sequenced either
`from one end (single-end) or from both ends (paired-
`end). Paired reads allow more accurate alignment to a
`reference genome, and are also very useful to resolve
`repeats and improve assembly in de novo sequencing
`projects, The Illumina system generates sequence reads
`of the same length from both ends, whereas the second
`read from SOLID is shorter (Table 1). The 454 system
`currently does not support paired-end sequencing of
`fragmentlibraries,
`Mate-pairlibraries are constructed by circularizing
`fragmented DNA,thereby bringing the two endsof the
`original DNA fragment adjacent to each other (Figure
`2). After fragmentation of the circular DNA,the frag-
`ment containing the ends of the original linear DNA is
`selected using biotin capture. Sequencing both ends of
`the selected fragmentwill yield reads that are separated
`by the distance of the original fragment. In order to
`avoid chimeric sequence reads that span over both origi-
`nal fragment ends, the 454 and SOLID systems include
`an internal adapter,
`In the Illumina mate-pair
`
`Personalis EX2017.3
`
`Personalis EX2017.3
`
`
`
`Berglund etal. investigative Genetics 2011, 2:23
`http://www.investigativegenetics.com/content/2/1 /23
`
`Page 4 of 15
`
`Table 1 Characteristics of second-generation and third-generation sequencing instruments
`Instrument
`Read length
`No.of
`Output
`No. of
`Runtime Advantages
`Disadvantages
`(nucleotides)
`reads?
`(Gb)?
`samples™ b
`
`
`
`
`
`
`
`Roche 454 GS FLX+ 23h~—Long reads, short run700° 1x 108 0.7 1928 Homopolymererrors,
`time
`expensive
`11 days’ High yield
`No. of index tags
`384
`600
`3x 10°
`100°
`Hlurmina Hiseq2000
`limiting
`14 days’
`Inherent error
`Short reads?
`1,152
`180
`15 x 10°
`754
`Life Technologies
`correction
`SOLID 5500xl
`oh
`Long reads
`Homopolymererrors,
`132
`0.035
`1x 10°
`400°
`Reche 454 GS Junior
`expensive
`
`
`
`
`
`
`lumina MiSeq Shert run time, ease of=Expensive per base150 5 x 108 15 96 27h
`use
`
`
`
`
`
`
`lon Torrent PGM lon 2h—Short run time, jow> 100" 1 x 108 ol 16 Not well evaluated
`316 chip
`reagent cost
`Helicos BioSciences
`35h
`1x 10°
`35
`4800
`8 days
`SMS, sequences RNA
`Short reads, high error
`HeliScope
`rate
`SMS, long reads, short
`Pacific Biosciences
`High error rate, low
`run time
`yield
`PacBio RS
`Most of this information is subject to rapid changes, and the aim of this table is not to present absolute numbers but to previde a general comparison between
`different sequencing systems.
`@Numbers calculated for two flow cells on HiSeq2000 and SOLID 5500xI.
`’calculated as no. of index tags (provided by the sequencing company} x no. of divisions on solid support.
`“Average for single-end sequencing, palred-end reads are shortet,
`do. of reads decreases when the PicoTiterPlate is divided.
`*96 nucleatides for mate-palr reads.
`fgun time depends on the read length, and on whether one or two flow cells are used.
`%Second read in paired-end sequencing Is limited to 35 nucleotides, and mate pair reads to 60 nucleotides.
`hAverage.
`SMS = single molecule sequencing.
`
`1
`
`90 min
`
`> 7,000"
`
`1x 108
`
`0.1
`
`preparation, no internal adapter is used, and to mini-
`mize the risk of sequencing over the original junction
`the recommended read lengthis limited to 36 nucleo-
`tides. Mate-pair libraries allow larger insert sizes (2 to
`20 kb) than paired-end sequencing of fragmentlibraries.
`Drawbacks of mate-pair sequencing are that the labora-
`tory protocols are more complicated and that a substan-
`tially larger amount of DNA (5 to 120 ug)is required.
`In contrast to paired-end reads, which are oriented
`towards each other, mate-pair reads are either both
`oriented outwards from the original fragment or both
`have the same orientation (Figure 2), which needs to be
`accounted for in the data analysis. Large inserts are
`especially valuable in de novo sequencing projects,
`where they can substantially improve scaffolding(order-
`ing of assembled contigs). Mate-pair sequencing is not
`used as frequently in resequencing projects, where DNA
`resources are often limited and the analysis is mainly
`_based onalignment to a reference genome.
`Introducing an additional indextag (barcoding) to ~
`each DNA fragment makes it possible to sequence
`pooled samples that can be distinguished i# silico after
`sequencing, Multiplexing is useful in applications where
`a relatively small amount of data is needed from each
`sample, such as sequencing of small genomes or
`enriched regions of Jarge genomes(see, for example,
`
`[19,20]). As the capacity of the sequencing instruments
`has increased, multiplex sequencing of indexed samples
`has become more and more important to minimize
`sequencing costs. Indexing also decreases the risk of
`sample mix-ups and contaminations during library pre-
`paration. Currently,Illumina provides 24different index
`tags, Life Technologies 96 and Roche 12 (Table 1).
`Additional index tags for Illumina (48 tags in total) and
`454 (120 additional tags) can be purchased from other
`companies. It is also possible to use custom-designed
`index tags [21,22],
`
`Sequencing and imaging principles
`Sequencing-by-synthesis
`The Illumina and 454 technologies are based on sequen-
`cing-by-synthesis. A DNA polymeraseis used to extend
`a sequencing primer by incorporating nuclectides that
`form a growing sequence complementary to the tem-
`plate DNA.In the Illumina system, fluorescent reversi-
`bly terminating ~hneleotidesreused: “All four: ~ ~-
`nucleotides are added at the same time, each with a
`unique fluorescent label, which allows incorporation of
`one base per cycle into each template molecule [23]
`(Figure 3a). After incorporation and fluorescence regis-
`tration at four wavelengths, the terminating and fluores-
`cent moieties are removed from the nucleotides to allow
`
`Personalis EX2017.4
`
`Personalis EX2017.4
`
`
`
` a
`sequenceis split into
`
`4 Set to,oe a —————
`~150bp ~150bp50-75bp 50-75bp
`5 seeFlowcell
`5 Bead
`~400bp¢—
`5 eedaebp——»50bp
`—> 36 b
`
`P14
`
`P2
`
`LA1
`
`CA
`
`LA2
`
`Berglund etal. Investigative Genetics 2011, 2:23
`http://www.investigativegenetics.com/content/2/1/23
`
`Page 5 of 15
`
`Illumina
`
`6
`
`b
`
`Roche 454
`——ES—S————
`
`(B)
`
`2
`
`CA
`
`ae
`
`=
`=
`—.
`/
`@
`
`ca (8)
`
`()
`
`2
`
`\
`
`c
`
`1
`
`3
`
`IA
`
`SOLID
`
`
`
`
`;
`=
`ia (8)
`
`
`
`3
`\
`3
`A2
`Al
`
`3BENOETIEEETS/Sey
`350-600 bp
`=
`IA
`
`—
`
`—
`
`—>
`
`—
`—
`—
`(a) Preparation ofIllumina mate-pair libraries. Fragments are end-
`Figure 2 Principles for construction of mate-pair sequencing libraries.
`repaired using biotinylated nucleotides (1), After circularization, the two fragment ends (green and red) become located adjacent to each other
`(2). The circularized DNAis fragrnented, and biotinylated fragments are purified by affinity capture. Sequencing adapters (Al and A2) are ligated
`to the ends of the captured fragments (3), and the fragments are hybridized to a flowcell, in which they are bridge amplified. Thefirst
`sequence read is obtained with adapter A2 boundto the flowcell (4), The complementary strand is synthesized and linearized with adapter Al
`bound to the flow cell, and the second sequence read is obtained (5). The two sequence reads(arrows) will be directed outwards from the
`original fragment (6). (b) Preparation of Roche 454 paired-end libraries (these are called paired-end, but are based on the same principles as the
`
`mate-pairlibraries in the other technologies). Original fragments (1) are end-repaired with unlabeled nucleptities, a
`eled
`
`circularization adapters (CA) are ligated to the fragment ends 2) After circularization (3), fragmentation afd affinity purification library adaptors
`
`(LAI and LA2) are ligated to the new fragment ends (4) and théfragments are amplifiedan beads by emblsion
` equenceé read
`that covers the two original ends and the internal adapteris generated (5), Adapter sequence is removed in silico,
`two reads, which both have the same orientation (6). (c) Preparation
`of&SOLID mate-pair libraries. Steps 1
`to 4 are analogous with preparation of
`Roche 454 paired-end libraries, with a biotin-labeled internal adapter (i) and two sequencing adapters (P1 and P2). Sequencing is performed
`with two different primers, complementary to the P1 adapter and internal adapter, respectively (5), The resulting reads will have the same
`[_ovientation (6).
`
`
`the next sequencing cycle. In 2010, Illumina released the
`HiSeq2000 instrument, which uses the same chemistry
`as the original Genome Analyzer instrument, but has
`improved imaging optics and can process two flowcells
`in parallel. The HiSeq2000 system has the highest
`throughput of all currently available SGS instruments,
`with around 600 Gb sequence produced per run (Table
`1). Examples of what can be achieved with the current
`capacity of HiSeq2000 are shownin Table 2. Sequencing
`errors are primarily substitution errors and occur more
`frequently in the distal bases of a read.
`In the 454 sequencing-by-synthesis reaction, natural
`non-terminating deoxynucleotides are added to the sys-
`tem sequentially (Figure 3b). In homopolymeric regions
`several bases will thus become incorporated in the same
`step. The 454 technology is based on the pyrosequen-
`cing principle [24], where pyrophosphateis released as a
`
`consequence of nucleotide incorporation and converted
`into ATP by sulfurylase. ATP is then used as a substrate
`for the production oflight by luciferase, and the emis-
`sion of light is registered by a charge-coupled device
`(CCD) camera. The major advantage of the 454 technol-
`ogyis the long read length. The misincorporation rate
`for the natural deoxynucleotides is low, resulting in low
`levels of nucleotide substitution errors. Insertion-dele-
`tion errors are frequent in homopolymeric regions, how-
`ever, due to the non-linearlight response whenseveral
`nucleotides are incorporated simultaneously to the same
`molecule. Compared to the Illumina and SOLID sys-
`tems, the 454 technology is more expensive per base
`due to the lower capacity and the higher reagent cost
`associated with the multiple enzymes required. Thus,
`the 454 technologyis mainly used in applications where
`long reads are desired, such as de novo sequencing
`
`Personalis EX2017.5
`
`Personalis EX2017.5
`
`
`
`2
`7a
`
`}+Ea 2
`
`.Ee 2souOT____
`
`[iz13 5 @rcACNNNOONNNZZZ
`
`
`
`Berglund etal. Investigative Genetics 2011, 2:23
`http://www.investigativegenetics.com/content/2/1/23
`
`Page 6 of 15
`
`
`
`Illumina
`
`Roche 454
`
`SOLID
`
`'
`
`@A4
`
`O17 OG
`
`A
`
`A
`
`* wea
`
`A
`
`A
`
`@,A
`
`OA
`
`TTNNnzz™ ACNNnzzz™
`GANNNZZZ™ =oNNNZZZ? zn
`ACNNNZZ2™
`
`4
`
`eee
`
`© +> QS6?Purinaa PP
`| ee a= iii
`
`Aj
`3 @c7
`
`_G
`_T4]
`OA] OC
`
`4
`
`ae...
`
`seoEELACNNN>ONNNGANNNTTNNN
`
`es
`
`seoZLCANNNGGNNNGGNNNGTNNN
`
`bh
`
`5
`
`6
`
`6
`
`5
`
`Read 1: AT
`Read 1: AC
`7 @0O000800000008000000
`Read 2: TTT
`Read 2: TA
`CACTGGGCTAGGATTGTICG
`Read 3: AA
`Read 3. GC
`Figure 3 Principles for sequencing and imaging. (a) Illumina sequencing of three template molecules. All four nucleotides, carrying
`terminating moieties and unique fluorescent labels, and DNA polymerase are added, and one complementary nucleotide becomes incorporated
`at each template molecule (1), After washing,fluorescence is registered at four wavelengths(2). Fluorescent dyes and terminating groups are
`cleaved off. A newset of nucleotides is added (3), and imaged (4). Sequence reads of equal length are obtained (5). (b) 454 sequencing of three
`template molecules. One type of natural non-terminating deoxynucleotides and DNA polymerase are added and a pyrophosphate molecule is
`released at each nucleotide incorporation (1). Pyrophasphateis converted into light using sulfurylase andluciferase, and the light intensity is
`measured in each well (2). Free deoxynucleotides are destroyed with apyrase before adding the next type of deoxynucleotide (3) and imaging
`(4). Light signals are converted to flowgrams with highersignalintensity bars in homopolymer regions (5). Sequence reads that may differ in
`length are obtained (6).
`(c) SOLID sequencing of one template molecule. A sequencing primer, DNAligase and 1,024 unique probes, which are
`fluorescently labeled according to their first two bases, are added, and the complementary probeis ligated to the template (1). After washing,
`fluorescence is registered at four wavelengths. The three universal bases and the fluorophor are cleavedoff (2). Addition of a new probesetis
`repeatedfor the desired number of cycles (3,4). The newly built strand is melted off. A new sequencing primer is added, which anneals one
`base off from thefirst primer and therefore interrogates different positions (5). Sequencing is repeated for
`the desired numberof cycles (6)
`Additional primers are added, until each baseis sequenced twice. The colors fromall sequencing rounds are merged and can be convertedto
`nucleotides (7)
`
`
`
`projects, where the read length is the most important
`Table 2 Capacity of the HiSeq2000 instrument from
`factor determining the quality of the assembly, and in
`ene
`q
`,
`:
`F
`:
`Illumina
`
`metagenomics, where the sample contains a mix of dif-
`:
`:
`Target region
`Coverage
`Samples per run
`
`ferent organisms.
`:
`-
`7
`.
`aif
`(3
`Gb)
`40
`5
`Sequencing-by-ligation
`no.
`a
`_ genome \ ~
`100
`(30 Mb)
`2
`n
`The SOLiD technologyis based on sequencing-by-liga-
`a5
`aen sus
`)
`})
`f
`z
`j
`Escherichia
`coli
`geno
`200
`(6
`tion, where a DNA ligase is used to add probes to a
`oo —eee a ‘
`+ .
`»
`
`large geneseft growing oligonucleotide chain [25] (Figure 3c). The —
`
`
`2
`large
`genes
`(1 Mb)
`1
`6,000
`.
`.
`:
`j
`.
`
`:
`
`,
`
`.
`
`.
`
`*
`
`Personalis EX2017.6
`
`Personalis EX2017.6
`
`
`
`Berglund et al. investigative Genetics 2011, 2:23
`http:/Avww.investigativegenetics.com/content/2/1/23
`
`Page 7 of 15
`
`probes consist of eight bases, five that are specific and
`complementary to the template and three that are uni-
`versal and support hybridization to the DNA template.
`In the sequencing reaction, probes containing all possi-
`ble combinations of thefirst five nucleotides are added.
`The probe that matches the template perfectly becomes
`hybridized and ligated to the sequencing primer or pre-
`vious probe, The probes are fluorescently labeled
`according to the first two bases using a scheme for two-
`base encoding with four fluorophores, After imaging,
`the fluorescent label and the three universal bases are
`cleaved off, and a new set of probes is added. After the
`first round of sequencing, the newly built DNA strand is
`melted off, a new sequencing primer which starts one
`base off from the first primer binding site is hybridized,
`and the sequencing reaction is repeated now interrogat-
`ing different positions. This process is repeated several
`times with different primers, so that all bases in the
`template become sequenced twice. Since each base is
`sequenced twice in the SOLID system, most sequencing
`errors can be corrected during alignment, resulting in a
`low error rate of mappeddata. It is possible to use dif-
`ferent chemistries and read lengths in different lanes.
`The current read length is 75 nucleotides for fragment
`libraries, and the total yield per run (two glass slides) is
`around 180 Gb. Data analysis has traditionally been per-
`formed in color space, however, with the recent upgrade
`to the 5500xI system,it is now also possible to get error
`corrected reads in base space.
`Sequencing-by-ligation is also used by Complete
`Genomics (Mountain View, CA, USA) [26], a company
`that sequences human genomes as a service. In their
`technology, the template DNAis first inserted into a
`single-stranded DNAcircle, which is then copied several
`times to make up DNA nanoballs. The nanoballs are
`attached to arrays and sequenced by ligation reactions,
`which use multiple priming sites [27]. The current capa-
`city of Complete Genomics is more than 600 genomes
`per month, and they are driving down the price of
`whole-genome sequencing.
`Base calling and quality control
`Intensities of light signals from the sequencing reactions
`are converted to bases (Illumina and 454 systems) or
`colors (SOLID system). In addition to sequence data,
`base calling produces quality scores for each base, which
`are estimates of the probability of the call being erro-
`neous, After base calling, reads with indications of
`mixed signals or other errors are filtered out. To facili-
`tate troubleshooting and discrimination between pro-
`blems caused by instrument/reagent factors and sample
`factors, the IMumina and 454 platforms include standar-
`dized control DNA in each run,
`in the Illumina system, the clusters are identified dur-
`ing the first four cycles of sequencing. Intensities are
`
`registered for each cluster in every cycle and converted
`to nucleotide sequence.If the initial recognition of clus-
`ters was not perfect, some clusters may contain more
`than one original template molecule. It is also possible
`that some clusters contain many molecules that have
`incorporated fewer (phasing) or more (prephasing)
`nucleotides than the number of cycles, Such clusters are
`filtered out by a so-called chastity filter, which is based
`on the ratio of signal intensities of the bases with the
`strongest and the second strongest intensity, Control
`DNA from the phage phiX is sequenced in each flow cell.
`In the analysis pipeline, phiX reads are identified by com-
`parison to the phiX genomeand theerrorrate is deter-
`mined and used as a measure of the quality of the run,
`During QCfiltering of data from the 454. system, pos-
`sible polyclonal beads and beads with no template are
`identified based on the numberof positive and negative
`flows. In addition, reads that do notstart with a specific
`key sequence, which is part of the adapter, and reads
`that have a high number ofoff-peak signal intensities
`(indicative of homopolymererrors) are filtered out.
`Sequence reads are also trimmed from the 3’ end to
`remove adapter sequence and bases of low quality, aris-
`ing from phasing/prephasing issues and loss of signal
`intensity, Beads with control DNA, labeled with a differ-
`ent key sequence, are included in each run. With the
`aid of the key sequence, these sequence reads are identi-
`fied and aligned to a reference sequence, and the per-
`centage of reads that match with 95, 98 and 100%
`similarity is reported.
`After color calling in the SOLID system, possible poly-
`clonal reads and reads with color combinations that do
`not make sense according to the two-base encoding
`schemeare filtered out. No control DNA is used, but
`the quality of a run is assessed from the color intensity
`distribution.
`Despite the standard QC steps, not all obtained data
`will be of high quality. To recognize potential problems
`and biasesit is useful to apply additional quality control
`measures. A good resource for assessing the quality of
`sequence data is the FastQC software [28], which
`reports, for example, distributions of base qualities, GC
`content, redundancy and over-representation of adapter
`or primer sequence.
`
`Trends and upcoming technologies
`Low-capacity sequencing systems
`While the sequencing companies compete to increase
`throughput, they have also launched systems with lower
`capacity. The reason for this trend is that the high capa-
`city of the original systems is not always required, and
`the current multiplexing possibilities do not match their
`throughput, which results in a much higher coverage
`(and cost) than needed for many applications.
`
`Personalis EX2017.7
`
`Personalis EX2017.7
`
`
`
`Berglund et af. investigative Genetics 2011, 2:23
`hitp:/Awww.investigativegenetics.com/content/2/1 /23
`
`Page 8 of 15
`
`Roche 454 Technologies was first to launch the Gen-
`ome Sequencer Junior system in 2010 and Illumina
`launched the MiSeq system in 2011. The capacity of
`these systems is 35 Mb and 1.5 Gb per run, respectively.
`Life Technologies recently acquired the company lon
`Torrent, whose technology has a concept siznilar to that
`of the 454 system, However, detection is based on pH
`changes caused by release of electrons upon nucleotide
`incorporation rather than pyrophosphate release. Since
`no enzymes are used for detection, the reagent cost for
`this instrument is low.
`Compared to the high-capacity instruments, the cost
`per base is high for the smaller machines, but they are
`suitable for sequencing small genomes, amplicons or
`DNA enriched by targeted capture. Since the run time
`for these systems is short, they are also useful for tech-
`nology development runs and to optimize reaction con-
`ditions for a larger run. These machines have been
`referred to as ‘personal sequencers’, meaning that they
`are easily obtainable also by laboratories with smaller
`resources that want to have rapid access to a sequencing
`instrument, but need relatively small amounts of data.
`Single molecule sequencing
`In single molecule sequencing, sometimes also referred
`to as third-generation sequencing, no amplification of
`the template molecules is performed prior to sequen-
`cing. These technologies provide improved quantitative
`accuracy by eliminating the risk of biases introduced
`during preparation of sequencing libraries. Single mole-
`cule sequencing also allows direct sequencing of RNA
`molecules, detection of chemically modified bases such
`as DNA methylation, and increased read lengths. Longer
`reads will be useful in de nove sequencing projects and
`open up perspectives for experimental phasing (determi-
`nation of which variant alleles are on the same chromo-
`some), in contrast to statistical phasing that is used
`today.
`In the Heliscope Single Molecule Sequencer system
`from Helicos Biosciences (Cambridge, MA, USA) [29]
`single stranded poly(dA)-tailed templates are attached to
`poly{dT) oligonucleotide primers that are anchored on a
`flow cell, In each sequencing cycle one type of reversibly
`terminating fluorescently labeled nucleotides are added
`and incorporated by a polymerase, the slide is washed
`and imaged, and the dye labels are cleaved off [30,31],
`This technology generates around 35 Gb per run, and
`“theread lengtlr is 35 nucleotides“on average (Table-l}--
`Pacific Biosciences (Menlo Park, CA, USA) [32] has
`developed a system called single molecule real time
`(SMRT) sequencing, which uses a DNA polymerase
`anchored on a glass surface and nucleotides with phos-
`pholinked fluorescent labels that are cleaved off when
`the nucleotides are incorporated. The sequencing reac-
`tion takes place on zero-mode waveguide nanostructure
`
`arrays [33], The incorporation of the fluorescently
`labeled nucleotides is monitored in real time, which
`results in very short run times, This system has read
`lengths over a thousand bases, but error rates are high
`and throughput is currently limited to 0,1 Gb per run
`(Table 1}. The Pacific Biosciences system was success-
`fully used for rapid analysis of the Cholera strains in the
`outbreak in Haiti in 2010 (34),
`Several new technologies for single molecule sequen-
`cing are under development. Nanopore sequencing tech-
`nologies (Oxford Nanopore (Oxford, UK) [35], NABsys
`(Providence, RI, USA) [36]) are based on detecting nat-
`ural electric or chemical differences between nucleo-
`tides, and do not require labeling of DNA. The Starlight
`system (Life Technologies) is a real-time technology that
`uses a quantum-dot-labeled polymerase and distinctly
`labeled fluorescent nuc