throbber
Berglund etal, Investigative Genetics 2011, 2:23
`http://www.investigativegenetics.com/content/2/1/23
`
`REVIEW
`
`a
`
`
`
`Investigative
`Genetics
`
`Open Access
`
`Next-generation sequencing technologies and
`applications for human genetic history and
`forensics
`
`
`
`Eva C Berglund, Anna Kiialainen and Ann-Christine Syvanen’
`
`
`
`Abstract
`
` and forensic genetics.
`
`Rapid advances in the development of sequencing technologies in recent years have enabled an increasing
`number of applications in biology and medicine. Here, we review key technical aspects of the preparation of DNA
`templates for sequencing, the biochemical reaction principles and assay formats underlying next-generation
`sequencing systems, methods for imaging and base calling, quality control, and bioinforrnatic approaches for
`sequence alignment, variant calling and assembly. We also discuss some of the most important advances that the
`new sequencing technologies have brought to the fields of human population genetics, human genetic history
`
`
`Background
`Determining the DNA sequence is the most comprehen-
`sive way of obtaining information about the genomeof
`anyliving organism. For decades, Sanger sequencing [1]
`using fluorescently labeled terminating nucleotides and
`electrophoresis has been the gold standard sequencing
`technology. Sanger sequencing made an early impact in
`the field of microbial genomics, with the first complete
`bacterial genome, Haemophilus influenzae, sequenced in
`1995 [2]. Multicenter collaborations using numerous
`sequencing instruments and automated sample prepara-
`tion also made it possible to use Sanger sequencing in
`the human genome project, which took more than 10
`years and US$2.7 billion to complete [3,4).
`In recent years, we have witnessed a rapid develop-
`ment of a new generation of DNA sequencing systems
`followed by a multitude of novel applications in biology
`and medicine. The major advantage of the new‘second-
`generation’ or ‘massively parallel’ sequencing technolo-
`gies, compared to Sanger sequencing, is their consider-
`ably higher throughput and thereby lower cost per
`sequenced base. On a second-generation sequencing
`(SGS) machine several human genomes can be
`sequenced in a single run in a matter of days. Here, we
`
`
`* Correspondence: ann-christine.syvanen@medsci,uu.se
`:
`Department of Medical Sciences, Molecular Medicine and Science for Life
`Laboratory, Uppsala University, 751 85 Uppsala, Sweden
`
`review recent technological advances of SGS technolo-
`gies and discuss the bioinformatic and computational
`implications of the sequencing revolution. Finally we
`highlight some applications of SGS technology with a
`focus on human population genetics and genetic history,
`and genetic forensics.
`
`Second-generation sequencing technologies
`There are three major SGS systems that are routinely
`used in manylaboratories today. The first system to
`become commercially available was
`the Genome
`Sequencer from 454 Life Sciences (Branford, CT, USA)
`(later acquired by Roche [5]) in 2005, which was also
`the first SGS technology to sequence a complete human
`genome, that of Dr. James D. Watson [6]. The Genome
`Analyzer, first conceived by Solexa and later further
`developed by lumina (San Diego, CA, USA) [7] was
`launched in 2006, and the SOLiD system from Applied
`Biosystems [8] (nowpart of Life Technologies (Carlsbad,
`CA, USA)) in 2007. The key steps of a sequencing pro-
`ject are the samefor all of these technologies: prepara-
`tion and amplification of template DNA, distribution of
`templates on a solid support, sequencing and imaging,
`base calling, quality control and data analysis (Figure 1),
`In terms of applications, there are two major types of
`projects, de novo sequencing and resequencing. In a de
`novo sequencing project, the genome of an organism is
`sequenced for the first time. In contrast, in resequencing
`
`(+) BioMed Central
`
`© 2011 Berglund et al; licensee BioMed Central Ltd. This is an Open Accessarticle distributed under the terms of the Creative
`Commons Attribution License (http//creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and
`reproduction in any medium, provided the original work is properly cited
`
`Personalis EX2017.1
`
`Personalis EX2017.1
`
`

`

`Berglund et al, Investigative Genetics 2011, 2:23
`http://www.investigativegenetics .com/content/2/1/23
`
`Page 2 of 15
`
`
`
`DNA sample

`Library preparation
`
`Ke
`Distribution on
`solid support
`
`Cc»
`
`PCR
`amplification
`2
`
`Sequencing and imaging

`Base/color calling

`Quality control

`Data analysis
`
`Black arrows indicatestepsthatareComimon for all second: gerieration sequencing(SGS)
`Figure 1 Steps of a sequencing experiment,
`technologies, white arrowsrefer to the Illumina systems, and grey arrows refer to the Roche 454 and SOLID systems.
`
`applications, the genomeorparts ofit are sequenced of
`a species where a reference sequence is already available.
`This difference affects both the selection of sequencing
`strategy and the data analysis (further discussed below}.
`
`In human forensics and population genetics the rese-
`quencing approachis used, but in microbial forensics
`both de nove sequencing and resequencing of microbial
`genomes may be required.
`
`Personalis EX2017.2
`
`Personalis EX2017.2
`
`

`

`Berglund et al. Investigative Genetics 2011, 2:23
`http://www.investigativegenetics,com/content/2/1/23
`
`Page 3 of 15
`
`Two common measures of the amount of sequence
`data generated in a project are the sequencing depth
`and breadth. Sequencing depth, or coverage, ig the aver-
`age number of times each base in the genomeis
`sequenced. For example, to sequence a 3 Gb human
`genome to 30 x coverage, 90 Gb of sequence data is
`needed. The coverage will be uneven over the genome
`however, and sequencing breadth, sometimes also
`referred to as genomecoverage, is the percentage of the
`genome that is covered by sequence reads.
`
`DNA samples for sequencing
`High-quality DNA in sufficient quantity is the basis for
`any successful sequencing experiment. For most sequen-
`cing applications, 1 to 5 pg of purified DNA is needed,
`an amount that may not always be available. Whole gen-
`ome amplification (WGA) has frequently been used to
`increase the amount of DNA for genotyping [9] and can
`be applied also in combination with SGS. Several micro-
`bial genomes have been sequenced using SGS after
`WGA (for example, the genome of uncultured bacterial
`symbionts of termites isolated from a single host cell
`[10]}. WGA was also recently used to amplify DNA
`from single cells from primary breast tumors, and
`although sequence data was retrieved from only 6% of
`the genome of each cell, this genomic representation
`was enough to identify subpopulationsof cancercells by
`copy number yariations [11].
`SGS can also be used to detect rare and unknownvar-
`iants in genomic regions of interest in a cost-efficient
`way, and in a larger number of samples than by whole
`genome sequencing. Another advantage of targeted
`sequencing is the reduced issue of sequence reads align-
`ing to multiple locations in the genome. The most com-
`monly used methods for enrichment of genomic regions
`for sequencingare either based on hybridization to bioti-
`nylated probes in solution or probes immobilized on
`microarrays, or on multiplexed amplification by PCR
`(reviewed in [12]). The recently developed selector probe
`technology, whichis based onrolling circle amplification,
`provides efficient and highly multiplexed enrichment of
`small regions totaling up to 1 Mbinsize is particularly
`useful for ultra-deep sequencing at low cost and high
`specificity [12]. Sequencing of human exomes enriched
`by hybridization-based capture in solution is becoming
`widely used, and has proven to be particularly successful
`for identification of mutated genes underlying monogenic
`disorders (see, for example, [14-17]}. The methods for
`hybridization-based capture are also applicable to cus-
`tom-selected genomic regions of interest [18].
`
`Preparation of sequencing libraries
`The DNAsamples to be sequencedare first converted
`into one of two main types of sequencing libraries,
`
`fragmentlibraries or mate-pair libraries, Thefirst step
`in the preparation of a sequencinglibrary is to fragment
`the DNA sample, usually using sonication or nebuliza-
`tion, For preparation of fragmentlibraries, sequencing
`adapters are ligated to both ends of the DNA fragments,
`followed by PCR amplification using primers comple-
`mentary to the adapters.
`In the Illumina SGS technology, adapter-ligated DNA
`fragments are amplified directly in the flow cell subge:—
`quently used for sequencing. Eachflow cell has eight
`channels (lanes) coated with oligonucleotides that are
`complementary to the adapters, The adapter-ligated
`DNA fragments are hybridized to the flow cell, in which
`they are distributed randomly and amplified by a pro-
`cess called bridge amplification. After amplification,
`DNA molecules are linearized to form clusters, each of
`which consists of about 1,000 copies of the original
`DNA molecule at that position.
`In the 454 and SOLID technologies, adapter-ligated
`fragments are hybridized to beads coated with an oligo-
`nucleotide that is complementary to oneof the adapters
`for amplification in a water-in-oil emulsion PCR. Each
`water droplet constitutes a microreactor containing the
`PCR reagents and optimally a single bead with a single
`immobilized DNA fragment. Thus, multiple PCRs can
`be performed in parallel in a single tube. After breaking
`the emulsion, the beads, which are now coated with
`thousands or millions of copies of the original DNA
`molecule, are loaded onto the solid support for sequen-
`cing. In the 454 system, the solid support is called Pico-
`TiterPlate and consists of wells that can fit a single
`DNA-coated bead each. SOLID uses a glass slide to
`which the beads are distributed randomly.
`The amplified fragments are then sequenced either
`from one end (single-end) or from both ends (paired-
`end). Paired reads allow more accurate alignment to a
`reference genome, and are also very useful to resolve
`repeats and improve assembly in de novo sequencing
`projects, The Illumina system generates sequence reads
`of the same length from both ends, whereas the second
`read from SOLID is shorter (Table 1). The 454 system
`currently does not support paired-end sequencing of
`fragmentlibraries,
`Mate-pairlibraries are constructed by circularizing
`fragmented DNA,thereby bringing the two endsof the
`original DNA fragment adjacent to each other (Figure
`2). After fragmentation of the circular DNA,the frag-
`ment containing the ends of the original linear DNA is
`selected using biotin capture. Sequencing both ends of
`the selected fragmentwill yield reads that are separated
`by the distance of the original fragment. In order to
`avoid chimeric sequence reads that span over both origi-
`nal fragment ends, the 454 and SOLID systems include
`an internal adapter,
`In the Illumina mate-pair
`
`Personalis EX2017.3
`
`Personalis EX2017.3
`
`

`

`Berglund etal. investigative Genetics 2011, 2:23
`http://www.investigativegenetics.com/content/2/1 /23
`
`Page 4 of 15
`
`Table 1 Characteristics of second-generation and third-generation sequencing instruments
`Instrument
`Read length
`No.of
`Output
`No. of
`Runtime Advantages
`Disadvantages
`(nucleotides)
`reads?
`(Gb)?
`samples™ b
`
`
`
`
`
`
`
`Roche 454 GS FLX+ 23h~—Long reads, short run700° 1x 108 0.7 1928 Homopolymererrors,
`time
`expensive
`11 days’ High yield
`No. of index tags
`384
`600
`3x 10°
`100°
`Hlurmina Hiseq2000
`limiting
`14 days’
`Inherent error
`Short reads?
`1,152
`180
`15 x 10°
`754
`Life Technologies
`correction
`SOLID 5500xl
`oh
`Long reads
`Homopolymererrors,
`132
`0.035
`1x 10°
`400°
`Reche 454 GS Junior
`expensive
`
`
`
`
`
`
`lumina MiSeq Shert run time, ease of=Expensive per base150 5 x 108 15 96 27h
`use
`
`
`
`
`
`
`lon Torrent PGM lon 2h—Short run time, jow> 100" 1 x 108 ol 16 Not well evaluated
`316 chip
`reagent cost
`Helicos BioSciences
`35h
`1x 10°
`35
`4800
`8 days
`SMS, sequences RNA
`Short reads, high error
`HeliScope
`rate
`SMS, long reads, short
`Pacific Biosciences
`High error rate, low
`run time
`yield
`PacBio RS
`Most of this information is subject to rapid changes, and the aim of this table is not to present absolute numbers but to previde a general comparison between
`different sequencing systems.
`@Numbers calculated for two flow cells on HiSeq2000 and SOLID 5500xI.
`’calculated as no. of index tags (provided by the sequencing company} x no. of divisions on solid support.
`“Average for single-end sequencing, palred-end reads are shortet,
`do. of reads decreases when the PicoTiterPlate is divided.
`*96 nucleatides for mate-palr reads.
`fgun time depends on the read length, and on whether one or two flow cells are used.
`%Second read in paired-end sequencing Is limited to 35 nucleotides, and mate pair reads to 60 nucleotides.
`hAverage.
`SMS = single molecule sequencing.
`
`1
`
`90 min
`
`> 7,000"
`
`1x 108
`
`0.1
`
`preparation, no internal adapter is used, and to mini-
`mize the risk of sequencing over the original junction
`the recommended read lengthis limited to 36 nucleo-
`tides. Mate-pair libraries allow larger insert sizes (2 to
`20 kb) than paired-end sequencing of fragmentlibraries.
`Drawbacks of mate-pair sequencing are that the labora-
`tory protocols are more complicated and that a substan-
`tially larger amount of DNA (5 to 120 ug)is required.
`In contrast to paired-end reads, which are oriented
`towards each other, mate-pair reads are either both
`oriented outwards from the original fragment or both
`have the same orientation (Figure 2), which needs to be
`accounted for in the data analysis. Large inserts are
`especially valuable in de novo sequencing projects,
`where they can substantially improve scaffolding(order-
`ing of assembled contigs). Mate-pair sequencing is not
`used as frequently in resequencing projects, where DNA
`resources are often limited and the analysis is mainly
`_based onalignment to a reference genome.
`Introducing an additional indextag (barcoding) to ~
`each DNA fragment makes it possible to sequence
`pooled samples that can be distinguished i# silico after
`sequencing, Multiplexing is useful in applications where
`a relatively small amount of data is needed from each
`sample, such as sequencing of small genomes or
`enriched regions of Jarge genomes(see, for example,
`
`[19,20]). As the capacity of the sequencing instruments
`has increased, multiplex sequencing of indexed samples
`has become more and more important to minimize
`sequencing costs. Indexing also decreases the risk of
`sample mix-ups and contaminations during library pre-
`paration. Currently,Illumina provides 24different index
`tags, Life Technologies 96 and Roche 12 (Table 1).
`Additional index tags for Illumina (48 tags in total) and
`454 (120 additional tags) can be purchased from other
`companies. It is also possible to use custom-designed
`index tags [21,22],
`
`Sequencing and imaging principles
`Sequencing-by-synthesis
`The Illumina and 454 technologies are based on sequen-
`cing-by-synthesis. A DNA polymeraseis used to extend
`a sequencing primer by incorporating nuclectides that
`form a growing sequence complementary to the tem-
`plate DNA.In the Illumina system, fluorescent reversi-
`bly terminating ~hneleotidesreused: “All four: ~ ~-
`nucleotides are added at the same time, each with a
`unique fluorescent label, which allows incorporation of
`one base per cycle into each template molecule [23]
`(Figure 3a). After incorporation and fluorescence regis-
`tration at four wavelengths, the terminating and fluores-
`cent moieties are removed from the nucleotides to allow
`
`Personalis EX2017.4
`
`Personalis EX2017.4
`
`

`

` a
`sequenceis split into
`
`4 Set to,oe a —————
`~150bp ~150bp50-75bp 50-75bp
`5 seeFlowcell
`5 Bead
`~400bp¢—
`5 eedaebp——»50bp
`—> 36 b
`
`P14
`
`P2
`
`LA1
`
`CA
`
`LA2
`
`Berglund etal. Investigative Genetics 2011, 2:23
`http://www.investigativegenetics.com/content/2/1/23
`
`Page 5 of 15
`
`Illumina
`
`6
`
`b
`
`Roche 454
`——ES—S————
`
`(B)
`
`2
`
`CA
`
`ae
`
`=
`=
`—.
`/
`@
`
`ca (8)
`
`()
`
`2
`
`\
`
`c
`
`1
`
`3
`
`IA
`
`SOLID
`
`
`
`
`;
`=
`ia (8)
`
`
`
`3
`\
`3
`A2
`Al
`
`3BENOETIEEETS/Sey
`350-600 bp
`=
`IA
`
`—
`
`—
`
`—>
`
`—
`—
`—
`(a) Preparation ofIllumina mate-pair libraries. Fragments are end-
`Figure 2 Principles for construction of mate-pair sequencing libraries.
`repaired using biotinylated nucleotides (1), After circularization, the two fragment ends (green and red) become located adjacent to each other
`(2). The circularized DNAis fragrnented, and biotinylated fragments are purified by affinity capture. Sequencing adapters (Al and A2) are ligated
`to the ends of the captured fragments (3), and the fragments are hybridized to a flowcell, in which they are bridge amplified. Thefirst
`sequence read is obtained with adapter A2 boundto the flowcell (4), The complementary strand is synthesized and linearized with adapter Al
`bound to the flow cell, and the second sequence read is obtained (5). The two sequence reads(arrows) will be directed outwards from the
`original fragment (6). (b) Preparation of Roche 454 paired-end libraries (these are called paired-end, but are based on the same principles as the
`
`mate-pairlibraries in the other technologies). Original fragments (1) are end-repaired with unlabeled nucleptities, a
`eled
`
`circularization adapters (CA) are ligated to the fragment ends 2) After circularization (3), fragmentation afd affinity purification library adaptors
`
`(LAI and LA2) are ligated to the new fragment ends (4) and théfragments are amplifiedan beads by emblsion
` equenceé read
`that covers the two original ends and the internal adapteris generated (5), Adapter sequence is removed in silico,
`two reads, which both have the same orientation (6). (c) Preparation
`of&SOLID mate-pair libraries. Steps 1
`to 4 are analogous with preparation of
`Roche 454 paired-end libraries, with a biotin-labeled internal adapter (i) and two sequencing adapters (P1 and P2). Sequencing is performed
`with two different primers, complementary to the P1 adapter and internal adapter, respectively (5), The resulting reads will have the same
`[_ovientation (6).
`
`
`the next sequencing cycle. In 2010, Illumina released the
`HiSeq2000 instrument, which uses the same chemistry
`as the original Genome Analyzer instrument, but has
`improved imaging optics and can process two flowcells
`in parallel. The HiSeq2000 system has the highest
`throughput of all currently available SGS instruments,
`with around 600 Gb sequence produced per run (Table
`1). Examples of what can be achieved with the current
`capacity of HiSeq2000 are shownin Table 2. Sequencing
`errors are primarily substitution errors and occur more
`frequently in the distal bases of a read.
`In the 454 sequencing-by-synthesis reaction, natural
`non-terminating deoxynucleotides are added to the sys-
`tem sequentially (Figure 3b). In homopolymeric regions
`several bases will thus become incorporated in the same
`step. The 454 technology is based on the pyrosequen-
`cing principle [24], where pyrophosphateis released as a
`
`consequence of nucleotide incorporation and converted
`into ATP by sulfurylase. ATP is then used as a substrate
`for the production oflight by luciferase, and the emis-
`sion of light is registered by a charge-coupled device
`(CCD) camera. The major advantage of the 454 technol-
`ogyis the long read length. The misincorporation rate
`for the natural deoxynucleotides is low, resulting in low
`levels of nucleotide substitution errors. Insertion-dele-
`tion errors are frequent in homopolymeric regions, how-
`ever, due to the non-linearlight response whenseveral
`nucleotides are incorporated simultaneously to the same
`molecule. Compared to the Illumina and SOLID sys-
`tems, the 454 technology is more expensive per base
`due to the lower capacity and the higher reagent cost
`associated with the multiple enzymes required. Thus,
`the 454 technologyis mainly used in applications where
`long reads are desired, such as de novo sequencing
`
`Personalis EX2017.5
`
`Personalis EX2017.5
`
`

`

`2
`7a
`
`}+Ea 2
`
`.Ee 2souOT____
`
`[iz13 5 @rcACNNNOONNNZZZ
`
`
`
`Berglund etal. Investigative Genetics 2011, 2:23
`http://www.investigativegenetics.com/content/2/1/23
`
`Page 6 of 15
`
`
`
`Illumina
`
`Roche 454
`
`SOLID
`
`'
`
`@A4
`
`O17 OG
`
`A
`
`A
`
`* wea
`
`A
`
`A
`
`@,A
`
`OA
`
`TTNNnzz™ ACNNnzzz™
`GANNNZZZ™ =oNNNZZZ? zn
`ACNNNZZ2™
`
`4
`
`eee
`
`© +> QS6?Purinaa PP
`| ee a= iii
`
`Aj
`3 @c7
`
`_G
`_T4]
`OA] OC
`
`4
`
`ae...
`
`seoEELACNNN>ONNNGANNNTTNNN
`
`es
`
`seoZLCANNNGGNNNGGNNNGTNNN
`
`bh
`
`5
`
`6
`
`6
`
`5
`
`Read 1: AT
`Read 1: AC
`7 @0O000800000008000000
`Read 2: TTT
`Read 2: TA
`CACTGGGCTAGGATTGTICG
`Read 3: AA
`Read 3. GC
`Figure 3 Principles for sequencing and imaging. (a) Illumina sequencing of three template molecules. All four nucleotides, carrying
`terminating moieties and unique fluorescent labels, and DNA polymerase are added, and one complementary nucleotide becomes incorporated
`at each template molecule (1), After washing,fluorescence is registered at four wavelengths(2). Fluorescent dyes and terminating groups are
`cleaved off. A newset of nucleotides is added (3), and imaged (4). Sequence reads of equal length are obtained (5). (b) 454 sequencing of three
`template molecules. One type of natural non-terminating deoxynucleotides and DNA polymerase are added and a pyrophosphate molecule is
`released at each nucleotide incorporation (1). Pyrophasphateis converted into light using sulfurylase andluciferase, and the light intensity is
`measured in each well (2). Free deoxynucleotides are destroyed with apyrase before adding the next type of deoxynucleotide (3) and imaging
`(4). Light signals are converted to flowgrams with highersignalintensity bars in homopolymer regions (5). Sequence reads that may differ in
`length are obtained (6).
`(c) SOLID sequencing of one template molecule. A sequencing primer, DNAligase and 1,024 unique probes, which are
`fluorescently labeled according to their first two bases, are added, and the complementary probeis ligated to the template (1). After washing,
`fluorescence is registered at four wavelengths. The three universal bases and the fluorophor are cleavedoff (2). Addition of a new probesetis
`repeatedfor the desired number of cycles (3,4). The newly built strand is melted off. A new sequencing primer is added, which anneals one
`base off from thefirst primer and therefore interrogates different positions (5). Sequencing is repeated for
`the desired numberof cycles (6)
`Additional primers are added, until each baseis sequenced twice. The colors fromall sequencing rounds are merged and can be convertedto
`nucleotides (7)
`
`
`
`projects, where the read length is the most important
`Table 2 Capacity of the HiSeq2000 instrument from
`factor determining the quality of the assembly, and in
`ene
`q
`,
`:
`F
`:
`Illumina
`
`metagenomics, where the sample contains a mix of dif-
`:
`:
`Target region
`Coverage
`Samples per run
`
`ferent organisms.
`:
`-
`7
`.
`aif
`(3
`Gb)
`40
`5
`Sequencing-by-ligation
`no.
`a
`_ genome \ ~
`100
`(30 Mb)
`2
`n
`The SOLiD technologyis based on sequencing-by-liga-
`a5
`aen sus
`)
`})
`f
`z
`j
`Escherichia
`coli
`geno
`200
`(6
`tion, where a DNA ligase is used to add probes to a
`oo —eee a ‘
`+ .

`
`large geneseft growing oligonucleotide chain [25] (Figure 3c). The —
`
`
`2
`large
`genes
`(1 Mb)
`1
`6,000
`.
`.
`:
`j
`.
`
`:
`
`,
`
`.
`
`.
`
`*
`
`Personalis EX2017.6
`
`Personalis EX2017.6
`
`

`

`Berglund et al. investigative Genetics 2011, 2:23
`http:/Avww.investigativegenetics.com/content/2/1/23
`
`Page 7 of 15
`
`probes consist of eight bases, five that are specific and
`complementary to the template and three that are uni-
`versal and support hybridization to the DNA template.
`In the sequencing reaction, probes containing all possi-
`ble combinations of thefirst five nucleotides are added.
`The probe that matches the template perfectly becomes
`hybridized and ligated to the sequencing primer or pre-
`vious probe, The probes are fluorescently labeled
`according to the first two bases using a scheme for two-
`base encoding with four fluorophores, After imaging,
`the fluorescent label and the three universal bases are
`cleaved off, and a new set of probes is added. After the
`first round of sequencing, the newly built DNA strand is
`melted off, a new sequencing primer which starts one
`base off from the first primer binding site is hybridized,
`and the sequencing reaction is repeated now interrogat-
`ing different positions. This process is repeated several
`times with different primers, so that all bases in the
`template become sequenced twice. Since each base is
`sequenced twice in the SOLID system, most sequencing
`errors can be corrected during alignment, resulting in a
`low error rate of mappeddata. It is possible to use dif-
`ferent chemistries and read lengths in different lanes.
`The current read length is 75 nucleotides for fragment
`libraries, and the total yield per run (two glass slides) is
`around 180 Gb. Data analysis has traditionally been per-
`formed in color space, however, with the recent upgrade
`to the 5500xI system,it is now also possible to get error
`corrected reads in base space.
`Sequencing-by-ligation is also used by Complete
`Genomics (Mountain View, CA, USA) [26], a company
`that sequences human genomes as a service. In their
`technology, the template DNAis first inserted into a
`single-stranded DNAcircle, which is then copied several
`times to make up DNA nanoballs. The nanoballs are
`attached to arrays and sequenced by ligation reactions,
`which use multiple priming sites [27]. The current capa-
`city of Complete Genomics is more than 600 genomes
`per month, and they are driving down the price of
`whole-genome sequencing.
`Base calling and quality control
`Intensities of light signals from the sequencing reactions
`are converted to bases (Illumina and 454 systems) or
`colors (SOLID system). In addition to sequence data,
`base calling produces quality scores for each base, which
`are estimates of the probability of the call being erro-
`neous, After base calling, reads with indications of
`mixed signals or other errors are filtered out. To facili-
`tate troubleshooting and discrimination between pro-
`blems caused by instrument/reagent factors and sample
`factors, the IMumina and 454 platforms include standar-
`dized control DNA in each run,
`in the Illumina system, the clusters are identified dur-
`ing the first four cycles of sequencing. Intensities are
`
`registered for each cluster in every cycle and converted
`to nucleotide sequence.If the initial recognition of clus-
`ters was not perfect, some clusters may contain more
`than one original template molecule. It is also possible
`that some clusters contain many molecules that have
`incorporated fewer (phasing) or more (prephasing)
`nucleotides than the number of cycles, Such clusters are
`filtered out by a so-called chastity filter, which is based
`on the ratio of signal intensities of the bases with the
`strongest and the second strongest intensity, Control
`DNA from the phage phiX is sequenced in each flow cell.
`In the analysis pipeline, phiX reads are identified by com-
`parison to the phiX genomeand theerrorrate is deter-
`mined and used as a measure of the quality of the run,
`During QCfiltering of data from the 454. system, pos-
`sible polyclonal beads and beads with no template are
`identified based on the numberof positive and negative
`flows. In addition, reads that do notstart with a specific
`key sequence, which is part of the adapter, and reads
`that have a high number ofoff-peak signal intensities
`(indicative of homopolymererrors) are filtered out.
`Sequence reads are also trimmed from the 3’ end to
`remove adapter sequence and bases of low quality, aris-
`ing from phasing/prephasing issues and loss of signal
`intensity, Beads with control DNA, labeled with a differ-
`ent key sequence, are included in each run. With the
`aid of the key sequence, these sequence reads are identi-
`fied and aligned to a reference sequence, and the per-
`centage of reads that match with 95, 98 and 100%
`similarity is reported.
`After color calling in the SOLID system, possible poly-
`clonal reads and reads with color combinations that do
`not make sense according to the two-base encoding
`schemeare filtered out. No control DNA is used, but
`the quality of a run is assessed from the color intensity
`distribution.
`Despite the standard QC steps, not all obtained data
`will be of high quality. To recognize potential problems
`and biasesit is useful to apply additional quality control
`measures. A good resource for assessing the quality of
`sequence data is the FastQC software [28], which
`reports, for example, distributions of base qualities, GC
`content, redundancy and over-representation of adapter
`or primer sequence.
`
`Trends and upcoming technologies
`Low-capacity sequencing systems
`While the sequencing companies compete to increase
`throughput, they have also launched systems with lower
`capacity. The reason for this trend is that the high capa-
`city of the original systems is not always required, and
`the current multiplexing possibilities do not match their
`throughput, which results in a much higher coverage
`(and cost) than needed for many applications.
`
`Personalis EX2017.7
`
`Personalis EX2017.7
`
`

`

`Berglund et af. investigative Genetics 2011, 2:23
`hitp:/Awww.investigativegenetics.com/content/2/1 /23
`
`Page 8 of 15
`
`Roche 454 Technologies was first to launch the Gen-
`ome Sequencer Junior system in 2010 and Illumina
`launched the MiSeq system in 2011. The capacity of
`these systems is 35 Mb and 1.5 Gb per run, respectively.
`Life Technologies recently acquired the company lon
`Torrent, whose technology has a concept siznilar to that
`of the 454 system, However, detection is based on pH
`changes caused by release of electrons upon nucleotide
`incorporation rather than pyrophosphate release. Since
`no enzymes are used for detection, the reagent cost for
`this instrument is low.
`Compared to the high-capacity instruments, the cost
`per base is high for the smaller machines, but they are
`suitable for sequencing small genomes, amplicons or
`DNA enriched by targeted capture. Since the run time
`for these systems is short, they are also useful for tech-
`nology development runs and to optimize reaction con-
`ditions for a larger run. These machines have been
`referred to as ‘personal sequencers’, meaning that they
`are easily obtainable also by laboratories with smaller
`resources that want to have rapid access to a sequencing
`instrument, but need relatively small amounts of data.
`Single molecule sequencing
`In single molecule sequencing, sometimes also referred
`to as third-generation sequencing, no amplification of
`the template molecules is performed prior to sequen-
`cing. These technologies provide improved quantitative
`accuracy by eliminating the risk of biases introduced
`during preparation of sequencing libraries. Single mole-
`cule sequencing also allows direct sequencing of RNA
`molecules, detection of chemically modified bases such
`as DNA methylation, and increased read lengths. Longer
`reads will be useful in de nove sequencing projects and
`open up perspectives for experimental phasing (determi-
`nation of which variant alleles are on the same chromo-
`some), in contrast to statistical phasing that is used
`today.
`In the Heliscope Single Molecule Sequencer system
`from Helicos Biosciences (Cambridge, MA, USA) [29]
`single stranded poly(dA)-tailed templates are attached to
`poly{dT) oligonucleotide primers that are anchored on a
`flow cell, In each sequencing cycle one type of reversibly
`terminating fluorescently labeled nucleotides are added
`and incorporated by a polymerase, the slide is washed
`and imaged, and the dye labels are cleaved off [30,31],
`This technology generates around 35 Gb per run, and
`“theread lengtlr is 35 nucleotides“on average (Table-l}--
`Pacific Biosciences (Menlo Park, CA, USA) [32] has
`developed a system called single molecule real time
`(SMRT) sequencing, which uses a DNA polymerase
`anchored on a glass surface and nucleotides with phos-
`pholinked fluorescent labels that are cleaved off when
`the nucleotides are incorporated. The sequencing reac-
`tion takes place on zero-mode waveguide nanostructure
`
`arrays [33], The incorporation of the fluorescently
`labeled nucleotides is monitored in real time, which
`results in very short run times, This system has read
`lengths over a thousand bases, but error rates are high
`and throughput is currently limited to 0,1 Gb per run
`(Table 1}. The Pacific Biosciences system was success-
`fully used for rapid analysis of the Cholera strains in the
`outbreak in Haiti in 2010 (34),
`Several new technologies for single molecule sequen-
`cing are under development. Nanopore sequencing tech-
`nologies (Oxford Nanopore (Oxford, UK) [35], NABsys
`(Providence, RI, USA) [36]) are based on detecting nat-
`ural electric or chemical differences between nucleo-
`tides, and do not require labeling of DNA. The Starlight
`system (Life Technologies) is a real-time technology that
`uses a quantum-dot-labeled polymerase and distinctly
`labeled fluorescent nuc

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket