`Platforms
`
`Elaine R. Mardis
`
`The Genome Institute at Washington University School of Medicine, St. Louis,
`Missouri 63108; email: emardis@wustl.edu
`
`Annu. Rev. Anal. Chem. 2013. 6:287-303
`
`The A111111nl Rroiew of Annlytiml Cbe111ist1y is online
`at anchem.annua1reviews.org
`
`Thi s article's doi:
`10. 1 H6/annurev-anchem-062012-092628
`
`Copyright © 2013 by Annual Reviews.
`All rights reserved
`
`Keywords
`massively parallel sequencing, next-generation sequencing, reversible dye
`terminators, sequencing by synthesis, single-molecule sequencing,
`genomics
`
`Abstract
`
`Automated DNA sequencing instruments embody an elegant interplay
`among chemisuy, engineering, software, and molecular biology and have
`built upon Sanger's founding discove1y of dideoxynucleotide sequencing to
`perform once-unfathomable tasks. Combined with innovative physical map(cid:173)
`ping approaches that helped to establish long-range relationships between
`cloned stretches of genomic DNA, fluorescent DNA sequencers produced
`reference genome sequences for model organisms and for the reference hu(cid:173)
`man genome. New types of sequencing instruments that permit amazing
`acceleration of data-collection rates for DNA sequencing have been devel(cid:173)
`oped. The ability to generate genome-scale data sets is now u·ansforming
`the nature of biological inqui1y. Here, I provide an historical perspective of
`the field, focusing on the fundamental developments that predated the ad(cid:173)
`vent of next-generation sequencing instruments and providing information
`about how these insu·uments work, their application to biological research,
`and the newest types of sequencers that can exu·act data from single DNA
`molecules.
`
`00001
`
`EX1012
`
`
`
`1. INTRODUCTION
`Automated DNA sequencing instruments embody an elegant interplay among chemist1y, engi(cid:173)
`neering, sofrwai·e, and molecular biology and have built upon Sanger's founding discovery of
`d.ideoxynuc.leotide ~equc.nci.ng to perform once-unfathomable tasks. Combined with innovative
`physical mapping approaches that helped to establish long-range relationships between cloned
`stretches of genomic D A, fluorescent D r A sequencers have been used to produce reference
`genome sequences for model organisms (Ercberichia coli, Dro.rophi/11 111elrmog11ster, Caenor/J11bditis
`olegrms, Nlus 11111.smlus 1-Jmbidopsis I ba/i1111n, Zerr 1111,ys) and for the reference human genome. Since
`2005, however, new rypes of seq uencing instruments that permit amazing acceleration of data(cid:173)
`collection rares for DNA sequencing have been inn·oduced by commercial manufacturers. For
`example, single instruments can generate data to decipher an entire human genome within only
`2 weeks. Indeed, we anticipate insn·unu:nts that will further accelerate this whole-genome se(cid:173)
`quencing data-production cirneline to days or hours in the near future. T he ability to generate
`ge110me-sc11lc data secs is now transforming the narure of biological inquiry, and the rcsulti.no-
`•
`0
`increase in our u11dersta n.di.11g of biology will probably be extrnordinary. In this review, I provide
`an historical perspective of the field, focusing on the fonda.menrnl developments that predated
`the advent of nexr-gencrati.on sequencing insmunent.s, providing information about how mas(cid:173)
`sively pamllel instru ments work and their applicarion to biological research, and finally discussing
`the newest types of equencers that are capable of exn·acting sequence data from single DNA
`molecules.
`
`2. A BRIEF HISTORY OF DNA SEQUENCING
`DNA sequencing and its manifest discipline, known as genomics, are relatively new areas of
`endeavor. They are the result of combining molecular biology with nucleotide chemist1y, both
`of which blossomed as scientific disciplines in the 1950s. Dr. Frederick Sanger's laborat01y at the
`Medical Research Council (MRC) in Cambridge, United Kingdom, began research to devise a
`method of DNA sequencing in the early 1970s (1-3) after having first published methods for R.J.'\TA
`sequencing in the late 1960s (4-6). Sanger et al.'s (7) seminal 1977 publication describes a method
`for essentially tricking DNA polymerase into incorporating nucleotides with a slight chemical
`modification-the exchange of the 3' hydroxyl group needed for cha.in elongation with a hydrogen
`atom that is functionally unable to participate in the reaction with the incoming nucleotide to
`extend the synthesized strand. Mixing proportions of the four native deoxynucleotides with one
`of four of their analogs, termed dideoxynucleotides, yields a collection of nucleotide-specific
`terminated fragments for each of the four bases (Figure 1). The fragments resulting from these
`reactions were separated by size on thin slab polyacrylamide gels; the A, C, G and T reactions
`were performed for each template run in adjacent lanes. The fragment positions were identified
`by virtue of 32 P, which was supplied in the reaction as labeled dATP molecules. When cfried and
`exposed to X-ray film, the gel-separated fragments were visualized and subsequently read from
`the exposed film from bottom to top (shortest to longest fragments) by the naked eye. Thus, a long
`and labor-intensive process was completed, and the sequencing data for the DNA of interest were
`in hand and ready for assembly, translation to amino acid sequence, or other types of analysis.
`Sequencing by radiolabeled methods underwent numerous improvements followin g its inven(cid:173)
`tion until the mid 1980s. These improvements included the invention ofDNAsynthe is chcrnisn·y
`(8, 9) and, ultimately, of DNA synthesizers that can be used to make oligonuc:leotidc pcimcrs for
`the sequencing reaction (providing a 3'-0H for extension); improved enzymes from the original
`E.coli K.lenow fragment polymerase (more uniform incorporatioll_of dideoxynucleolides) ( I 0, 11);
`
`288
`
`lvlnrdis
`
`-
`
`00002
`
`
`
`Primer
`
`Template-strand
`
`s·---- 3'
`3' --------------------------------s·
`I Polymerase+ dNTPs
`r.::::=-Ps_="_._""l"=""- +ddCTPs
`
`l +ddGTPs
`
`5' . , . _ _ CTAAG
`
`S' - - - - CT
`
`s· ---- c
`
`5• . - - CTAA
`
`Long
`fragments
`
`Direction of
`electrophoresis
`
`Short
`fragments
`
`Fi~ 1 urc 1
`"
`s~nger sequencing.
`
`Direction of
`sequence read
`
`use of 35 S-in place of 32 P-dATP for radiolabeling (sharper banding and hence longer read lengths);
`and the use of thinner and/or longer polyacrylamide gels (improved separation and longer read
`lengths), among others. Although there were attempts at automating various steps of the process,
`not,1bly the automated pipetting of sequencing reactions and the automated reading of the au(cid:173)
`toradiograph banding patterns, most improvements were not sufficient to make this sequencing
`approach truly scalable to high-throughput needs.
`
`3. IMPACT OF FLUORESCENCE LABELING
`A significant change in the scalability of DNA sequencing was introduced in 1986, when Applied
`Biosystems, Inc. (ABI), commercialized a fluorescent DNA sequencing instrument that had been
`invented in Leroy Hood's laborato1y at the California Institute of Technology (12). In replacing
`the use of radiolabeled dATP with reactions primed by fluorescently labeled primers (different
`fluor for each nucleotide reaction), the laborious processes of gel drying, X-ray film exposure
`and developing, reading autoradiographs, and performing hand enuy of the resulting sequences
`were eliminated. In this instrument, a raster scanning laser beam crossed the surface of the gel
`plates to provide an excitation wavelength for the differentially labeled fluorescent primers to
`he detected during the elecu·ophoretic separation of fragments. Thus, significant manual effort
`and several sources of error were eliminated . By use of the initial versions of this instrument,
`great increases were made in the daily thrnughput of sequencing data production, and several
`
`www.m11111nlreviews.org • Ne:a-Geuemrio11 Sequencing Plnrfonns
`
`289
`
`00003
`
`
`
`laboratories used newly available automated pipetting stations to decrease the effort and error
`rate of the upstream sequencing reaction pipetting steps (13). During this time, investigators
`made additional improvements to sequencing enzymology and processes, including the ability
`to perform cycled sequencing reactions catalyzed by thermostable sequencing polymerases (14)
`that were patterned after the polymerase chain reaction (PCR), which was first described in 1988
`by Mullis and colleagues (15). By incorporating linear (cycled) amplification into the sequencino-
`b
`reaction, one could begin with significantly lower input template DNA and hence could produce
`uniform results across a range of DNA yields (from automated isolation methods in multiwel!
`plates, for example). Improvements to chemist:1y were also important, as fluorescent dye-labeled
`dideoxynucleotides (known as terminators) were introduced (16). Because the terminatino-
`b
`nucleotide was identified by its attached fluor, all four reactions could be combined into a single
`reaction, greatly decreasing the cost of reagents and the input DNA requirements. Finally, the
`per run throughput of the sequencers increased during this time (17), ultimately permitting
`96 samples to be loaded on one gel. These technological breakthroughs combined to make
`96-well and ultimately 384-well sequencing reactions a major contributor to scalability. These
`high-throughput slab gel fluorescence inSt:1·uments largely contributed to the sequencing of
`several model organism genomes, and although they were impressive in their capacity to produce
`data, they still contained several manual and hence labor-intensive and error-prone steps. These
`limitations largely centered around casting polyacrylamide gels and loading samples by hand.
`
`4. IMPACT OF CAPILLARY OVER SLAB GEL ELECTROPHORESIS
`The rate-limiting manual steps in slab gels were addressed in 1999 with the introduction of
`capilla1y sequencing inst:1·uments, first the MegaBACETM sequencer from Molecular Dynamics
`(18) and then the ABI PRISM® 3700. These instruments solved the slab gel problem by directly
`injecting a polymeric separation mat:1·ix into capillaries that provided single-nucleotide resolution.
`Samples, by definition, could also be loaded directly from the microtiter plate to the capillaries for
`separation by use of elect:1·ical current pulses through a process known as elect:1·okinetic injection.
`Following the separation and detection of reaction products, the polymer matrix was replaced by
`pumping in new matrix. Thus, these instruments eliminated an entire series of rate-limiting steps.
`Downstream activities were further simplified because the capillaries were fixed in their positions,
`so there was no need for t:1·acking lanes on the slab gel image, and subsequent data ext:1·action and
`base-calling were much faster and more accurate. Lastly, the run times were greatly accelerated
`due to the rapid heat dissipation of the capillaries over thick glass plates. The ABI PRISM 3 700
`instruments and a later upgrade (ABI 3730) were principal data-generating instruments for the
`human and mouse genome projects, among others. Their scalability and ease of use came at a
`crucial time, when large-scale robotics to perform DNA extraction and sequencing were available
`in specialized facilities for the clone-based front end of the process.
`Indeed, these reference genomes that were produced for major model organisms, human and
`plant, provided not only a fundamental advance for biological studies in these organisms but also
`the basis for the utility of next-generation sequencing instruments. Next-generation sequencing
`is described in the next section.
`
`5. GENERAL PRINCIPLES OF NEXT-GENERATION SEQUENCING
`Beginning in 2005, the traditional Sanger-based approach to D TA sequencing has expcrienc:ecl
`revcluLionmy changes (l 9, 20). The previous "top-down' approach involved c:barntterir.ing large
`clones by low-resolution m apping as a means to organize the high- resolution sequencing oi" s1miller
`
`290
`
`1\1a1'tlir
`
`00004
`
`
`
`subclones that were assembled and finished to recapitulate each originating, larger clone (21). The
`sequences of the larger clones were then stitched together at their overlapped ends to reconstruct
`entire chromosomes (with small gaps). By contrast, next-generation sequencing instruments do
`not require a cloning step per se. Rather, the DNA to be sequenced is used to construct a libra1y
`of fragments that have synthetic DNAs (adapters) added covalently to each fragment end by use
`of DNA ligase. These adapters are universal sequences, specific to each platform, that can be
`used to polymerase-amplify the libra1y fragments during specific steps of the process. Another
`difference is that next-generation sequencing does not require performing sequencing reactions in
`microtiter plate wells. Rather, the libra1y fragments are amplified in situ on a solid surface, either
`a bead or a flat glass microfluidic channel that is covalently derivatized with adapter sequences
`that are complementary to those on tl1e library fragments. This amplification is digital in nature;
`in other words, each amplified fragment yields a single focus (a bead- or surface-borne cluster
`of mnplified DNA, all of which originated from a single fragment). Amplification is required
`co prnvide sufficient signal from each of the DNA sequencing reaction steps that determine the
`sequencing data for thatlibrary fragment. The scale and throughput of next-generation sequencing
`,ire often referred to as massively parallel, which is an appropriate descriptor for the process
`that follows fragment amplification to yield sequencing data. In Sanger sequencing, the reaction
`th,it produces the nested fragment set is distinct from the process that separates and detects the
`fragments by size to produce a linear sequence of bases. In massively parallel sequencing, the
`process is a stepwise reaction series that consists of (a) a nucleotide addition step, (b) a detection
`step that determines the identity of the incorporated nucleotides on each fragment focus being
`sequenced, and (c) a wash step that may include chemistty to remove fluorescent labels or blocking
`groups. In essence, next-generation sequencing instruments conduct sequencing and detection
`simultaneously rather than as distinct processes, one of which is completed before the other takes
`phice. J\1oreover, these steps are performed in a format that allows hundreds of thousands to
`billions of reaction foci to be sequenced during each instrument run and, hence, at a capacity per
`instrument that can produce enormous data sets.
`One final difference between Sanger sequencing data and next-generation sequencing data is
`the read length, or the number of nucleotides obtained from each fragment being sequenced.
`In Sanger sequencing, the read length was determined largely by a combination of gel-related
`factors, such as the percentage of polyacrylamide, the electt·ophoresis conditions, the time of
`separation, and the length and thickness of the gel. In next-generation sequencing, the read
`length is a function of the signal-to-noise ratio. Because the sources of noise differ according to
`the technology, specifics are described for each type of sequencing below. However, the major
`impact of the signal-to-noise ratio is to limit the read length from all next-generation sequencing
`instruments, all of which produce shorter reads than does Sanger sequencing.
`Shorter read lengths, in turn, are a differentiation point because, although short reads can be
`assembled as are traditional Sanger reads, based on shared sequence, the lower extent of shared
`sequence (due to read length) limits the ability to assemble these reads, so the overall length of
`contiguous sequence that can be assembled is limited. This limitation is exacerbated by genome
`size and complexity (e.g., repetitive content and gene families), so genomes such as that of the
`human (3 Gb and ~48% repetitive content) cannot be reassembled from the component reads
`of a whole-genome shotgun of next-generation sequencing data. Rather, because a high-quality
`reference genome exists for many model organisms and for humans, sequence read alignment is a
`more practical approach to sequencing data analysis from next-generation read lengths. Specific
`algorithms to approach short read alignment have been devised; they provide a score-based metric
`indicative of that sequence's best fit in the genome, whereby sequences that contain mostly or
`entirely repetitive content score lowest due to the uncertainty of their origin (22, 23). Improved
`
`www.n11111rnlreviews.org • Next-Ge11ern1io11 Seq11enci11g Plntfonns
`
`291
`
`00005
`
`
`
`a
`
`Al
`
`SPl
`
`..--::::
`
`Genomic DNA l h•g=o< (200-SOO bpi
`~\\==--~
`l Ligate adapters
`j
`
`SP2
`
`Generate clusters
`
`SP2 A2
`
`Flowcell
`
`SPl Al
`
`Bio •
`
`A2
`
`SPl
`
`~ ..... .
`
`A2
`
`j Sequence first end
`j Regeratedusters and
`~ ..... .
`------"[
`
`SP2
`
`sequence paired end
`
`Genomic
`DNA
`
`Fragment
`(2-5 kb)
`
`Biotinylate
`Bi~ ends
`
`Circularize
`
`Fragment
`(400-600 bp)
`
`Enrich
`biotinylated
`fragments
`
`Ligate
`adapters
`
`SP2 A2
`
`SP2 A2
`
`SPl Al
`
`Generate
`clusters
`
`A2
`
`Al
`
`Sequence
`first end
`
`Regerate
`clusters and
`sequence
`paired end
`
`•
`•
`
`Al SPI
`
`SPl
`
`~ ......
`
`SPl ~ --····
`
`Figure 2
`Comparison between (a) paired-end and (b) mate-pair sequencing libra ry-construction processes.
`
`certainty can be obtained from longer read lengths, and several next-generation sequencers have
`offered increases in read length over time and relinemenr of their signal -to-noise chara cteristics
`to allow this certainty. Another fundam ntal improvement has resulted from so-ca llc<l paired-end
`sequencing, namely producing sequence data from bod1 ends of cacl, libr.uy frngmenL. llcad pairs
`can be obtained by one of two mechani sms: (a) paired ends or (b) mate pair (Figure 2).
`In paired-end equencing, a linear frn gmenrwith a length oflcss thanl kb has adapter sequences
`at each end with diffc1·enr printing sires on each adapter. The sequencing insu·umcnt is designed ro
`sequence from one adapter priming sire by use of the stepwise sequencing described above; then,
`
`00006
`
`
`
`in a subsequent reaction, the opposite adapter is primed and sequence data are obtained. These
`reads are paired with one another during the alignment step in data analysis, which provides higher
`overall certainty of placement than does a single end read of the same length. Most alignment
`algorithms also take into account the average length of fragments in the sequencing libra1y to
`inake the most accurate placement possible. In mate-pair sequencing, the libra1y is constructed
`of fragments longer than 1 kb, and instead of ligating two adapters at each fragment end, the
`fragment is circularized around a single adapter and both fragment ends ligate to the adapter
`ends (24). These circular molecules are then treated by various molecular biology schemes (e.g.,
`by type IIS endonuclease digestion or by nick translation) to produce a single linear fragmenr
`that holds both ends of the original DNA fragment with a central ad;1pter. The remaining DNA
`remnants are removed by washing steps, as the central adapter that carries the mate-pair ends
`is biotinylated and can be captured using streptavidin magnetic beads. Typically, the resulting
`linear fragments have distinct adapters ligated to their ends, and sequencing is obtained from two
`sequential reads as described above. Again, the resulting reads are aligned as a pair to the genome
`of interest, wherein the separation distance between the reads is longer overall than that obtained
`with the paired-end approach. Often, mate-pair and paired-end reads are used in combination
`to achieve genome coverage when attempting longer-range assemblies through difficult regions
`of a genome or when attempting to assemble a genome for the first time (de nova sequencing)
`(25). In this combined coverage approach, the mate-pair reads provide longer-range order and
`orientation (a separation of up to 20 kb is possible), and the paired ends provide the ability to
`assemble, in a localized way, difficult-to-sequence regions that can then be layered on top of the
`sc,1 lfold provided by an assembly of mate-pair reads.
`
`6. DIGITAL DATA TYPE AND RAMIFICATIONS
`
`Next-generation sequencing libraries, carefully constructed to avoid sources of biasing and du(cid:173)
`plication, are highly digital. Specifically, the fact that each read originates from a consistently
`detected focus that results from the amplification of a single libra1y fragment means that the
`data m·e inherently digital in nature. Thus, a quantitation of abundance can be inferred from this
`one-to-one relationship, which has ramifications for biological systems that are being investigated
`by next-generation sequencing. For example, chromosomal amplifications that are common in
`cancer genomes can be quantitated with respect to the extent of amplification (ploidy) on each
`chromosome (26). Similarly, the read prevalence of expressed genes identified by Rl'-JA sequencing
`can be directly correlated to their expression level and compared acrnss replicates or with other
`samples from the same study (27). In population-based studies that use next-generation sequenc(cid:173)
`ing to characterize the individual species present in an isolate (metagenomics), a similar ability
`to correlate the presence of each species as a proportion of the overall population can be derived
`from the digital nature of next-generation sequencing data (28).
`
`7. SOURCES OF NOISE AND ERROR MODELS
`As mentioned above, although read length in next-generation sequencing is not limited by an elec(cid:173)
`trophoretic separation step, the major limitation of read length is the signal-to-noise ratio during
`stepwise sequencing. Depending on the platform, the contributors to noise in the sequencing reac(cid:173)
`tion differ, and there is interplay between the sources of noise and the sequencing errors that may
`result. This interplay gives rise to what is commonly referred to as the error model and is highly
`instrument and chemistry specific. In general, one typically explores both read-length limitations
`and error types by sequencing a reference set of genes or an entire genome, then comparing the
`
`www.a1111unlreviews.org • Nert-Geuerntiou Seq11e11d11g Plntfoiws
`
`2 93
`
`00007
`
`
`
`sequence obrnined with the high-quality reference gene set or genome (29). In this approach, the
`different type of errors (sub titution errors or in ertion and deletion errors) can be identified, and
`the error model (random versus systematic errors) can be defined. Representation biases can also be
`uncovered by th.is ,lpproach when one examines the aligned reads for evidence of complete or par(cid:173)
`tia l Lack of representation. If this lack ofrepresenrotion can be classified (for example, regions with
`> 95% G + C content), then the hias can be defined. Typically, the more sequence reads are exam(cid:173)
`ined, rhe better defined are the error model, coverage biases, and their conn-:ibuting SOUJ'ces. For
`example, the use of PCR or other types of enzymatic amplification may contribute systematie c.rrnrs
`during the library construction or amplification processes described above. nc might11cldrcss this
`problem, inclependc.mly of the instrument system used, by employing a ltigh-fidelir.y polymerase
`and/or by Umitino-the a.umber of amplification cycles when possible. Some sources of error, how(cid:173)
`ever, are simply inso·ument pecific and may nor be readily addressed by the end user (althoua-h
`they may improve over time with new chemistry nnd software from the manufacturer). As discussed
`below, instruments that use library amplific,ition to enhance signal produced from the sequencing
`process forego ome of the signa l-to-noise issues that are experienced in single-molecule systems
`because there are so many identical fragments being sequenced per focus that the nw11ber of fra"-
`o
`mcnrs that a.re not misrc-portiug far c.-x_cceds the number of fragments that are. In gene1:al, noise
`accumulates dming the stepwise sequencing process and ultimately limits the read length obtained
`once the signal from any base incorporation step is outcompeted by incorrect or out-of-phase in(cid:173)
`corporation events, residual signal from prior reactions or reactants, and other sources of noise.
`
`t,
`
`8. NEXT-GENERATION SEQUENCING WITH REVERSIBLE
`DYE TERMINATORS
`It is informative to discuss some of the predominant approaches to next-generation sequenc:in:g
`as a means of tying together the concepts presented herein. The first instrnment system lnvolvc.
`the use of reversible dye terminators in enzymatic sequencing of ampliliecl foci of libra1.y fr~g(cid:173)
`initially developed in 2007 by Solexa and was subsequently acquired by
`ments. This system wa
`fllw11in11®, Inc. (30). The library work How follows steps simila r ro those outlined above namely
`fragmentation of high- mole ular weight D A, enzymatic trimming, and, denylation of the frag(cid:173)
`ment ends and ligation of pecific adapters (Figure 311). The l1Jm1,una 111icrofluidic conduit is a
`flow cell composed of flar glass wirh eight micrnfluidic channels, each decorated by covalent at(cid:173)
`tachment of adapter sequences complemem:uy to the librn1y adapters. By CJrcfuJ qaantitntion of
`the library concentrntion, a prcc.iscly diluted solution of libr:iry fragments is amplifit:d in situ on
`the Row ceJI. lll'faces by use of a bridge amplification step to produce foci for seq11encing(clustcr)
`(Figure 3b). A sub equenr step chemically effects the release of fra 0 ment ends carrying the same
`arJapter, which is then primed with a complementary synthetic D A (primer) to provide free
`3'- 1J groups that can be extended in sub equent stepwise seriuencing react.ions. Tn reversible dye
`terminator eqnencing, all four nucleotides arc provided in each cycle because each nucl eotide
`carries an identifying fluo1·escent label. The sequencing occurs as single-nucleotide addition re(cid:173)
`action because a blocking group ex.is rs at tl1c 3'-0H position of the ribose sugar, prcvcnLi.ng
`additional base incorporation reactions by the polymerase. As such, the series of cvenrs in each
`rep includes the following in order of occurrence: (17) The nucleotide is added by polymerase,
`(b) unincorporated nuclcoridcs nre washed away, (t) the flow cell is imaged on both i11.ner sur(cid:173)
`faces ro identify each cluster that is reporting a auorescent signa l, (d) the Auorescent grnups arc
`chemically cleaved, and (e) the 3'-0U i chemically deblocked (Fig ure Jc). This seri es of ·reps
`i repeated for up to L50 nucleotide addition reactions, whereupon the second read pi:ep:m1tions
`begin. To read from the opposite end of each fn1gment cluster, the instnU11cnt nrst removes the
`
`194 Mnrdis
`
`00008
`
`
`
`a lllumina's library-preparation work flow
`
`b
`
`L
`
`Denaturation
`
`OH OH
`
`_L_I -
`-~ -
`~~ -'- · JI II ~
`P7 P5
`Grafted
`flow cell
`
`Template
`hybridization
`
`Initial
`extension
`
`DNA fragments
`
`0 0
`
`Blunting by fill-in
`and exonucl ease
`
`o O ! 0
`p ! p
`!
`~ i ~
`
`Phosphorylation
`p
`
`Addition of
`A-overh ang
`
`p
`
`Ligation
`to adapters
`
`p
`
`p
`
`"
`
`C
`
`(N
`
`G)
`
`C
`
`®
`
`~
`©
`
`T
`
`®
`
`(I)
`
`T
`
`C
`
`5'
`
`3'
`
`S'
`
`0
`
`Fluor
`
`First
`denaturation
`
`First cycle
`annealing
`
`First cycle
`extension
`
`Second cycle
`denaturation
`
`H~~ pp~, Cle~~:ge
`ll-ill- ~--1
`
`3'h
`
`ock
`
`Incorporate
`Detect
`Deblock
`Cleave flu or
`
`Second cycle
`denaturation
`
`Second cycle
`annealing
`
`Second cycle ~
`extension MIIIII
`
`5~ ~y x
`
`'b
`DNAX9
`3'
`OH free 3' end
`
`Cluster
`amplification
`
`PS
`linearization
`
`Block with
`ddNTPs
`
`Denaturation and
`sequencing primer
`hybridization
`
`Figu, L' 3
`(11) Illumina® library-construction process. (b) Illumina cluster generation by bridge amplifica tion. (c) Sequencing by synthesis with
`reversible dye termina to rs.
`
`synthesized strands by denaturation and regenerates the clusters by performing a limited bridge
`amplification to improve the signal-to-noise ratio in the second read. After the amplification step,
`the opposite ends of the fragments are released from the flow cell surfaces by a different chemical
`cleavage reagent (corresponding to a labile group on the reverse adapter), and the fragments are
`primed with the reverse primer. Sequencing proceeds as described above. All of these steps occur
`on-instrument with the flow cell in place and without manual intervention, so the correlation of
`positi on from forward (first) to reverse (second) reads is maintained and yields a very high read-pair
`concordance upon read alignment to the reference genome.
`Illumina data have an error model that is described as having decreasing accuracy with
`increasing nucleotide addition steps. When errors occur, they are predominantly substitution
`errors, in which an incorrect nucleotide identity is assigned to the base. The error percentage
`of most Illumina reads is approximately 0.5% at best (i.e., 1 error in 200 bases). Sources of noise
`include (n) phasing, wherein increasing numbers of fragments fall out of phase with the majority
`
`www.n111111nlrroiews.org • Ne.r1-Ge11erntio11 Seq11e11t·i11g P/11tforn1s
`
`2 9 5
`
`00009
`
`
`
`of fragments in the cluster due to incomplete deblocking in prior cycles or, conversely, due to lack
`of a blocking group that allows an additional base to be incorporated and (b) residual fluorescence
`imcrfercace noise due to i.acornplete fluorescent label cleavage from previous cycles.
`Read lengths have increased from the original Solexa instrument at 25-bp single-end read , to
`the current Illumina HiSeq 2000 instrument's 150-bp paired-end reads. Increased read length has
`been one component that is contributing to an explosion in throughput-per-instrument run over
`a relarively short time frame (5 year ), from l Gb for the Solexa I G to 600 ,.b for the HiSeq 2000.
`The fatter i.nsa:wnent can thus produce sufficient dsirn coverage for six whole-lrnmau gcnonie
`sequences in approximately 11 days. The coverage per genome needed is approximately 30-folcl
`.-1nd with a 3-Gb genome wherein approximately 90% of the reads will map, 100 Gb are require~
`to produce the necessary 90 Gb of data per genome.
`The other contriburo,· to throughput has beeuche ability to use increasingl}1 mare-concearraced
`library dilutions onto the How cell, resulting in significant increases in duster density. The Hi. eq
`2000 was the first instrument to read clusters from both surfaces of the How cell channel , effectively
`doubling the throughput per run. Improvements in chemisuy have made deblocking and fluor(cid:173)
`removal steps more complete; polymerase engineering has improved incorporation fidelity and
`decreased errors and has decreased the G + C biases associated with the instrument at the brido-e
`amplification step.
`
`b
`
`9. NEXT-GENERATION SEQUENCING BY pH CHANGE MONITORING
`A completely different approach to next-generation sequencing is embodied in an insu·ument sys(cid:173)
`tem that detects the release of hydrogen ions, a by-product of nucleotide incorporation, as quanti(cid:173)
`tatcd changes in pH through a novel coupled silicon detector. This instrument was commercialized
`in 20 IO by lon TorrcnL (31), a company tha was later purchased by Life TechnologiesTM Corp.
`For d1is ,lpproach, library con m:iction includes DNA fragmentation, enzymatic end polishing,
`and adapter ligation. Amplification of libnuy fragments occurs by a unique approach !mown as
`emulsio1tPCR, whid, qnancitates the librn1y fragments and dilutes them to be mixed in equimolar
`quantities with smlllJ beads, P R reacrnars and DNA polymerase molecules (32). The beads have
`covalently linked adapter complementary sequences on their surfaces to facilitate amplification
`on tl1c bead. Thi mixture is tl1en shaken to form an emulsion so that the beads and DNA are
`encapsulated in a I: I ratio (on average) in oil micelles that also contain the reactants needed for
`PCR-based amplification. The r~ulting mixture is placed into a specific app.mnus tlrnt performs
`rhermaJ cycling of th.e emulsion, effectively allowing hundreds of Li1ousands o[ indi iJual PCR
`amplifications to occu r i.n parallel in one v sel. ubsequcnt steps are required lir t Lo scparntc
`tbe oil from the aqueous solution and beads (so-called enmlsion breaking) ,111d th.en ro emich the
`beads that were successfully amplified (to remove beads with insufficient DNA). Enriched beads
`are primed for seq11encing by annealin"a