throbber
Vol 456 | 6 November 2008 | doi:10.1038/nature07517
`
`ARTICLES
`
`Accurate whole human genome
`sequencing using reversible terminator
`chemistry
`
`A list of authors and their affiliations appears at the end of the paper
`
`DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit.
`Sequencing projects have traditionally used long (400–800 base pair) reads, but the existence of reference sequences for
`the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter
`reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates
`several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a
`flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator
`deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of
`this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the
`genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from .303 average
`depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand
`structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical
`whole-genome re-sequencing and many other biomedical applications.
`
`DNA sequencing yields an unrivalled resource of genetic informa-
`tion. We can characterize individual genomes, transcriptional states
`and genetic variation in populations and disease. Until recently, the
`scope of sequencing projects was limited by the cost and throughput
`of Sanger sequencing. The raw data for the three billion base
`(3 gigabase (Gb)) human genome sequence, completed in 2004 (ref. 1),
`was generated over several years for ,$300 million using several hun-
`dred capillary sequencers. More recently an individual human gen-
`ome sequence has been determined for ,$10 million by capillary
`sequencing2. Several new approaches at varying stages of development
`aim to increase sequencing throughput and reduce cost3–6. They
`increase parallelization markedly by imaging many DNA molecules
`simultaneously. One instrument run produces typically thousands or
`millions of sequences that are shorter than capillary reads. Another
`human genome sequence was recently determined using one of these
`approaches7. However, much bigger improvements are necessary to
`enable routine whole human genome sequencing in genetic research.
`We describe a massively parallel synthetic sequencing approach that
`transforms our ability to use DNA and RNA sequence information in
`biological systems. We demonstrate utility by re-sequencing an indivi-
`dual human genome to high accuracy. Our approach delivers data at
`very high throughput and low cost, and enables extraction of genetic
`information of high biological value, including single-nucleotide
`polymorphisms (SNPs) and structural variants.
`
`DNA sequencing using reversible terminators
`We generated high-density single-molecule arrays of genomic DNA
`fragments attached to the surface of the reaction chamber (the flow
`cell) and used isothermal ‘bridging’ amplification to form DNA ‘clus-
`ters’ from each fragment. We made the DNA in each cluster single-
`stranded and added a universal primer for sequencing. For paired
`read sequencing, we then converted the templates to double-stranded
`DNA and removed the original strands, leaving the complementary
`
`strand as template for the second sequencing reaction (Fig. 1a–c). To
`obtain paired reads separated by larger distances, we circularized
`DNA fragments of the required length (for example, 2 6 0.2 kb)
`and obtained short junction fragments for paired end sequencing
`(Fig. 1d).
`We sequenced DNA templates by repeated cycles of polymerase-
`directed single base extension. To ensure base-by-base nucleotide
`incorporation in a stepwise manner, we used a set of four reversible
`terminators, 39-O-azidomethyl 29-deoxynucleoside triphosphates
`(A, C, G and T), each labelled with a different removable fluorophore
`(Supplementary Fig. 1a)8. The use of 39-modified nucleotides
`allowed the incorporation to be driven essentially to completion
`without risk of over-incorporation. It also enabled addition of all
`four nucleotides simultaneously rather than sequentially, minimiz-
`ing risk of misincorporation. We engineered the active site of 9uN
`DNA polymerase to improve the efficiency of incorporation of these
`unnatural nucleotides9. After each cycle of incorporation, we deter-
`mined the identity of the inserted base by laser-induced excitation of
`the fluorophores and imaging. We added tris(2-carboxyethyl)pho-
`sphine (TCEP) to remove the fluorescent dye and side arm from a
`linker attached to the base and simultaneously regenerate a 39
`hydroxyl group ready for the next cycle of nucleotide addition
`(Supplementary Fig. 1b). The Genome Analyzer (GA1) was designed
`to perform multiple cycles of sequencing chemistry and imaging to
`collect the sequence data automatically from each cluster on the
`surface of each lane of an eight-lane flow cell (Supplementary Fig. 2).
`To determine the sequence from each cluster, we quantified the
`fluorescent signal from each cycle and applied a base-calling algo-
`rithm. We defined a quality (Q) value for each base call (scaled as by
`the phred algorithm10) that represents the likelihood of each call
`being correct (Supplementary Fig. 3). We used the Q-values in sub-
`sequent analyses to weight the contribution of each base to sequence
`alignment and detection of sequence variants (for example, SNP
`
`53
`
` ©2008 Macmillan Publishers Limited. All rights reserved
`
`00001
`
`EX1035
`
`

`

`ARTICLES
`
`NATURE | Vol 456 | 6 November 2008
`
`calling). We discarded all reads from mixed clusters and used the
`remaining ‘purity filtered’ reads for analysis. Typically we generated
`1–2 Gb of high-quality purity filtered sequence per flow cell from
`,30–60-million single 35-base reads, or 2–4 Gb in a paired read
`experiment (Supplementary Table 1).
`To demonstrate accurate sequencing of human DNA, we sequenced
`a human bacterial artificial chromosome (BAC) clone (bCX98J21) that
`contained 162,752 bp of the major histocompatibility complex on
`
`a
`
`b
`
`c
`
`d
`
`B
`
`B
`
`B
`
`B
`
`B
`
`Figure 1 | Preparation of samples. a, DNA fragments are generated, for
`example, by random shearing and joined to a pair of oligonucleotides in a
`forked adaptor configuration. The ligated products are amplified using two
`oligonucleotide primers, resulting in double-stranded blunt-ended material
`with a different adaptor sequence on either end. b, Formation of clonal
`single-molecule array. DNA fragments prepared as in a are denatured and
`single strands are annealed to complementary oligonucleotides on the flow-
`cell surface (hatched). A new strand (dotted) is copied from the original
`strand in an extension reaction that is primed from the 39 end of the surface-
`bound oligonucleotide; the original strand is then removed by denaturation.
`The adaptor sequence at the 39 end of each copied strand is annealed to a new
`surface-bound complementary oligonucleotide, forming a bridge and
`generating a new site for synthesis of a second strand (dotted). Multiple
`cycles of annealing, extension and denaturation in isothermal conditions
`result in growth of clusters, each ,1 mm in physical diameter. This follows
`the basic method outlined in ref. 33. c, The DNA in each cluster is linearized
`by cleavage within one adaptor sequence (gap marked by an asterisk) and
`denatured, generating single-stranded template for sequencing by synthesis
`to obtain a sequence read (read 1; the sequencing product is dotted). To
`perform paired-read sequencing, the products of read 1 are removed by
`denaturation, the template is used to generate a bridge, the second strand is
`re-synthesized (shown dotted), and the opposite strand is then cleaved (gap
`marked by an asterisk) to provide the template for the second read (read 2).
`d, Long-range paired-end sample preparation. To sequence the ends of a
`long (for example, .1 kb) DNA fragment, the ends of each fragment are
`tagged by incorporation of biotinylated (B) nucleotide and then circularized,
`forming a junction between the two ends. Circularized DNA is randomly
`fragmented and the biotinylated junction fragments are recovered and used
`as starting material in the standard sample preparation procedure illustrated
`in a. The orientation of the sequence reads relative to the DNA fragment is
`shown (magenta arrows). When aligned to the reference sequence, these
`reads are oriented with their 59 ends towards each other (in contrast to the
`short insert paired reads produced as shown in a–c). See Supplementary Fig.
`17a for examples of both. Turquoise and blue lines represent
`oligonucleotides and red lines represent genomic DNA. All surface-bound
`oligonucleotides are attached to the flow cell by their 59 ends. Dotted lines
`indicate newly synthesized strands during cluster formation or sequencing.
`(See Supplementary Methods for details.)
`
`54
`
`human chromosome 6 (accession AL662825.4, previously determined
`using capillary sequencing by the Wellcome Trust Sanger Institute).
`We developed a fast global alignment algorithm ELAND that aligns a
`read to the reference only if the read can be assigned a unique position
`with 0, 1 or 2 differences. We collected 0.17 Gb of aligned data for the
`BAC from one lane of a flow cell. Approximately 90% of the 35-base
`reads matched perfectly to the reference, demonstrating high raw read
`accuracy (Supplementary Fig. 4). To examine consensus coverage
`and accuracy, we used 5 Mb of 35-base purity filtered reads (30-fold
`average input depth of the BAC) and obtained 99.96% coverage of the
`reference. There was one consensus miscall, at a position of very low
`coverage (just above our cutoff threshold), yielding an overall con-
`sensus accuracy of .99.999%.
`
`Detecting genetic variation of the human X chromosome
`For an initial study of genetic variation, we sequenced flow-sorted X
`chromosomes of a Caucasian female (sample NA07340 originating
`from the Centre d’Etude du Polymorphisme Humain (CEPH)). We
`generated 278-million paired 30–35-bp purity filtered reads and
`aligned them to the human genome reference sequence. We carried
`out separate analyses of the data using two alignment algorithms:
`ELAND (see above) or MAQ (Mapping and Assembly with
`Qualities)11. Both algorithms place each read pair where it best
`matches the reference and assign a confidence score to the alignment.
`In cases where a read has two or more equally likely positions (that is,
`in an exact repeat), MAQ randomly assigns the read pair to one
`position and assigns a zero alignment quality score (these reads are
`excluded from SNP analysis). ELAND rejects all non-unique align-
`ments, which are mostly in recently inserted retrotransposons (see
`Supplementary Fig. 5). MAQ therefore provides an opportunity to
`assess the properties of a data set aligned to the entire reference,
`whereas ELAND effectively excludes ambiguities from the short read
`alignment before further analysis.
`We obtained comprehensive coverage of the X chromosome from
`both analyses. With MAQ, 204 million reads aligned to 99.94% of the
`X chromosome at an average depth of 433. With ELAND, 192 mil-
`lion reads covered 91% of the reference sequence, showing what can
`be covered by unique best alignments. These results were obtained
`after excluding reads aligning to non-X sequence (impurities of flow
`sorting) and apparently duplicated read pairs (Supplementary Table 2).
`We reasoned that these duplicates (,10% of the total) arose during
`initial sample amplification.
`The sampling of sequence fragments from the X chromosome is
`close to random. This is evident from the distribution of mapped
`read depth in the MAQ alignment in regions where the reference is
`unique (Fig. 2a): the variance of this distribution is only 2.26 times
`that of a Poisson distribution (the theoretical minimum). Half of this
`excess variance can be accounted for by a dependence on G1C con-
`tent. However, the average mapped read depth only falls below 103
`in regions with G1C content less than 4% or greater than 76%,
`comprising in total just 1% of unique chromosome sequence and
`3% of coding sequence (Fig. 2b).
`We identified 92,485 candidate SNPs in the X chromosome using
`ELAND (Supplementary Fig. 6). Most calls (85%) match previous
`entries in the public database dbSNP. Heterozygosity (p) in this data
`set is 4.3 3 1024 (that is, one substitution per 2.3 kb), close to a
`previously published X chromosome estimate (4.7 3 1024)12. Using
`MAQ we obtained 104,567 SNPs, most of which were common to the
`results of the ELAND analysis. The differences between the two sets of
`SNP calls are largely the consequence of different properties of the
`alignments as described earlier. For example, most of the SNPs found
`only by the MAQ-based analysis were at positions of low or zero
`sequence depth in the ELAND alignment (Supplementary Fig. 6c).
`We assessed accuracy and completeness of SNP calling by compar-
`ison to genotypes obtained for this individual using the Illumina
`HumanHap550 BeadChip (HM550). The sequence data cov-
`ered .99.8% of the 13,604 genotyped positions and we found excellent
`
` ©2008 Macmillan Publishers Limited. All rights reserved
`
`00002
`
`

`

`NATURE | Vol 456 | 6 November 2008
`
`ARTICLES
`
`depth and/or anomalous read pair spacing, similar to previous
`approaches13–15. We detected 115 indels in total, 77 of which were
`visible from anomalous read-pair spacing (see Supplementary Tables
`4 and 5). We developed Resembl, an extension to the Ensembl
`browser16, to view all variants (Supplementary Fig 9). Inversions
`can be detected when the orientation of one read in a pair is reversed
`(for example, see Supplementary Fig. 10). In general, inversions
`occur as the result of non-allelic homologous recombination, and
`are therefore flanked by repetitive sequence that can compromise
`alignments. We found partial evidence for other inversion events,
`but characterization of inversions from short read data is complex
`because of the repeats and requires further development.
`
`Sequencing and analysis of a whole human genome
`Our X chromosome study enabled us to develop an integrated set of
`methods for rapid sequencing and analysis of whole human genomes.
`We sequenced the genome of a male Yoruba from Ibadan, Nigeria
`(YRI, sample NA18507). This sample was originally collected for the
`HapMap project17,18 through a process of community engagement
`and informed consent19 and has also been studied in other pro-
`jects20,21. We were therefore able to compare our results with publicly
`available data from the same sample. We constructed two libraries:
`one of short inserts (,200 bp) with similar properties to the previous
`X chromosome library and one from long fragments (,2 kb) to
`provide longer-range read-pair information (see Supplementary
`Fig. 11 for size distributions). We generated 135 Gb of sequence
`(,4 billion paired 35-base reads; see Supplementary Table 6) over
`a period of 8 weeks (December 2007 to January 2008) on six GA1
`instruments
`averaging
`3.3 Gb
`per
`production
`run
`(see
`Supplementary Table 1 for example). The approximate consumables
`cost (based on full list price of reagents) was $250,000. We aligned
`97% of the reads using MAQ and found that 99.9% of the human
`reference (NCBI build 36.1) was covered with one or more reads at an
`average of 40.6-fold depth. Using ELAND, we aligned 91% of the
`reads over 93% of the reference sequence at sufficient depth to call a
`strong consensus (.three Q30 bases). The distribution of mapped
`read depth was close to random, with slight over-dispersion as seen
`for the X chromosome data. We observed comprehensive representa-
`tion across a wide range of G1C content, dropping only at the very
`extreme ends, but with a different pattern of distribution compared
`to the X chromosome (see Supplementary Fig. 12).
`We identified ,4 million SNPs, with 74% matching previous
`entries in dbSNP (Fig. 3). We found excellent agreement of our
`SNP calls with genotyping results: sequence-based SNP calls covered
`almost all of the 552,710 loci of HM550, with .99.5% concordance
`of sequencing versus genotyping calls (Table 1 and Supplementary
`Table 7a). The few disagreements were mostly under-calls of hetero-
`zygous positions (GT.Seq) in areas of low sequence depth, provid-
`ing us with a false-negative rate of ,0.35% from the ELAND analysis
`(see Table 1). The other disagreements (0.09% of all genotypes)
`included errors
`in genotyping plus apparent
`tri-allelic SNPs
`(Supplementary Table 7a). The main cause of genotype error
`(0.05% of all genotypes) is the existence of a second ‘hidden’ SNP
`close to the assayed locus that disrupts the genotyping assay, leading
`to loss of one allele and an erroneous homozygous genotype
`(Supplementary Figs 13 and 14).
`To examine the accuracy of SNP calling in more detail, we com-
`pared our sequence-based SNP calls with 3.7 million genotypes (HM-
`All) generated for this sample during the HapMap project (Table 1
`and Supplementary Table 7b)18 and found excellent concordance
`between the data sets. Disagreements included sequence-based
`under-calls of heterozygous positions in regions of low read depth.
`The slightly higher level of other disagreements (0.76%) seen in this
`analysis compared to that of the HM550 data (0.09%) is in line with
`the higher level of underlying genotype error rate of 0.7% for the
`HapMap data18. To refine this analysis further, we generated a set of
`530,750 very high confidence reference genotypes comprising
`
`55
`
`All
`Unique only
`Poisson
`
`0
`
`0
`
`20
`
`40
`Mapped depth (fold)
`
`60
`
`80
`
`G+C content (%)
`40
`
`30
`
`50 60
`
`6
`
`5
`
`4
`
`3
`
`2
`
`1
`
`0
`
`a
`
`Frequency (Mb)
`
`b
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Mapped depth
`
`100
`80
`20
`0
`40
`60
`Percentile of unique sequence ordered by G+C content
`
`Figure 2 | X chromosome data. a, Distribution of mapped read depth in the
`X chromosome data set (NA07340), sampled at every 50th position along the
`chromosome and displayed as a histogram (‘All’). An equivalent analysis of
`mapped read depth for the unique subset of these positions is also shown
`(‘Unique only’). The solid line represents a Poisson distribution with the
`same mean. b, Distribution of X chromosome uniquely mapped reads as a
`function of G1C content. Note that the x axis is per cent G1C content and is
`scaled by percentile of unique sequence. The solid line is average mapped
`depth of unique sequence; the grey region is the central 80% of the data (10th
`to 90th centiles); the dashed lines are 10th and 90th centiles of a Poisson
`distribution with the same mean as the data.
`
`agreement between sequence-based SNP calls and genotyping data
`(99.52% or 99.99% using ELAND or MAQ,
`respectively;
`Supplementary Table 3). There was complete concordance of all
`homozygous calls and a low level of ‘under-calling’ from the sequence
`data (denoted as ‘GT.Seq’ in Table 1) at a small number of the
`heterozygous sites, caused by inadequate sampling of one of the two
`alleles. The depth of input sequence influences the coverage and accu-
`racy of SNP calling. We found that reducing the read depth to 153 still
`gives 97% coverage of genotype positions and only 1.27% of the het-
`erozygous sites are under-called. We observed no other types of dis-
`agreement at any input depth (Supplementary Fig. 7).
`We detected structural variants (defined as any variant other than
`a single base substitution) as follows. We found 9,747 short inser-
`tions/deletions (‘short indels’; defined here as less than the length of
`the read) by performing a gapped alignment of individual reads
`(Supplementary Fig. 8). We identified larger indels based on read
`
` ©2008 Macmillan Publishers Limited. All rights reserved
`
`00003
`
`

`

`ARTICLES
`
`NATURE | Vol 456 | 6 November 2008
`
`Table 1 | Comparison of SNP calls made from sequence versus genotype data for the human genome (NA18507) and X chromosome (NA07340)
`ELAND
`MAQ
`
`X
`
`Human
`
`Human
`
`X
`
`Human
`
`Human
`
`Human
`
`HM550 (13,604
`SNPs)
`(%)
`99.77
`
`HM550 (552,710
`SNPs)
`(%)
`99.60
`
`HM-All (3,699,592
`SNPs)
`(%)
`99.24
`
`HM550 (13,604
`SNPs)
`(%)
`99.91
`
`HM550 (552,710
`SNPs)
`(%)
`99.74
`
`HM-All (3,699,592
`SNPs)
`(%)
`99.29
`
`99.52
`0.48
`0.48
`0
`0
`
`99.57
`0.43
`0.35
`0.05
`0.03
`
`98.80
`1.20
`0.46
`0.52
`0.22
`
`99.99
`0.01
`0.01
`0
`0
`
`99.90
`0.1
`0.03
`0.05
`0.02
`
`99.12
`0.88
`0.15
`0.54
`0.2
`
`Combined (530,750 SNPs)
`
`(%)
`99.78
`
`99.94
`0.06
`0.02
`0.02
`0.01
`
`(n)
`529,589
`
`529,285
`304
`130
`130
`44
`
`Covered by
`sequence
`Concordant calls
`All disagreements
`GT.Seq
`Seq.GT
`Other
`discordances
`
`SNP panels referred to are HM550 (Illumina Infinium HumanHap550 BeadChip) and HM-All (complete data from phase 1 and phase 2 of the International HapMap Project). ‘Combined’ is a set of
`concordant genotypes from both sets (HM550 and HM-All; see text). GT.Seq denotes a heterozygous genotyping SNP call where there is a homozygous sequencing SNP call (one of the two
`alleles); Seq.GT denotes the converse (that is, a heterozygous sequencing SNP call where there is a homozygous genotyping call). Other discordances are differences in the two SNP calls that
`cannot be accounted for by one allele being missing from one call.
`
`concordant calls in both the HM550 and HM-All genotype data sets.
`Comparing the results of the MAQ analysis to this high confidence
`set (see Table 1), we found 130 heterozygote under-calls GT.Seq
`(that is, a false-negative rate of 0.025%). There were also 130 hetero-
`zygote over-calls Seq.GT, but most of these are probably genotype
`errors as 82 have a nearby ‘hidden’ SNP and 3 have a nearby indel. A
`further 41 are tri-allelic loci, leaving at most 4 potential wrong calls by
`sequencing (that is, false-positive rate of 4 per 529,589 positions).
`Finally we selected a subset of novel SNP calls from the sequence data
`and tested them by genotyping. We found 96.1% agreement between
`sequence and genotype calls (Supplementary Table 8). However, the
`47 disagreements included 10 correct sequencing calls (genotyping
`under-calls owing to hidden SNPs) and 7 sequencing under-calls. On
`this basis, therefore, the false-positive discovery rate for the one mil-
`lion novel SNPs is 2.5% (30 out of 1,206). For the entire data set of
`four million SNPs detected in this analysis, the false-positive and
`-negative rates both average ,1%.
`This genome from a Yoruba individual contains significantly more
`polymorphism than a genome of European descent. The autosomal
`heterozygosity (p) of NA18507 is 9.94 3 1024 (1 SNP per 1,006 bp),
`higher than previous values for Caucasians (7.6 3 1024, ref. 12).
`Heterozygosity in the pseudoautosomal region 1 (PAR1) is substan-
`tially higher (1.92 3 1023) than the autosomal value. PAR1 (2.7 Mb)
`
`a
`
`Call
`
`ELAND
`In dbSNP
`SNPs
`(n)
`(%)
`
`MAQ
`In dbSNP
`(%)
`
`SNPs
`(n)
`
`Homozygote
`Heterozygote
`All
`
`1,417,320
`2,411,022
`3,828,342
`
`90.1
`63.9
`73.6
`
`1,503,420
`2,635,776
`4,139,196
`
`90.8
`63.8
`73.6
`
`b
`
`ELAND
`
`MAQ
`
`215,844
`(42.4% dbSNP)
`
`3,612,498
`(75.5% dbSNP)
`
`526,698
`(60.8% dbSNP)
`
`Figure 3 | SNPs identified in the human genome sequence of NA18507.
`a, Number of SNPs detected by class and percentage in dbSNP (release 128).
`Results from ELAND and MAQ alignments are reported separately.
`b, Analysis of SNPs detected in each analysis reveals extensive overlap. The
`percentage of NA18507 SNP calls that match previous entries in dbSNP is
`lower than that of our X chromosome study (see Supplementary Fig. 6). We
`expect this because individual NA07340 (from the X chromosome study)
`was also previously used for discovery and submission of SNPs to dbSNP
`during the HapMap project, in contrast to NA18507.
`
`56
`
`at the tip of the short arm of chromosomes X and Y undergoes
`obligatory recombination in male meiosis, which is equivalent to
`203 the autosome average. This illustrates a clear correlation
`between recombination and nucleotide diversity. By contrast, the
`0.33-Mb PAR2 region has a much lower recombination rate than
`PAR1; we observed that heterozygosity in PAR2 is identical to that
`of the autosomes in NA18507. Heterozygosity in coding regions is
`lower (0.54 3 1023) than the total autosome average, consistent with
`the model that some coding changes are deleterious and are lost as the
`result of natural selection22. Nevertheless, the 26,140 coding SNPs
`(Supplementary Fig. 15) include 5,361 non-conservative amino acid
`substitutions
`plus
`153
`premature
`termination
`codons
`(Supplementary Table 9), many of which are expected to affect pro-
`tein function.
`We performed a genome-wide survey of structural variation in this
`individual and found excellent correlation with variants that had
`been reported in previous studies, as well as detecting many new
`variants. We
`found
`0.4 million
`short
`indels
`(1–16 bp;
`Supplementary Fig. 16), most of which are length polymorphisms
`in homopolymeric tracts of A or T. Half of these events are corrobo-
`rated by entries in dbSNP, and 95 of 100 examined were present in
`amplicons sequenced from this individual in ENCODE regions, con-
`firming the high specificity of this method of short indel detection.
`For larger structural variants (detected by anomalously spaced paired
`ends) we found that some were detected by both long and short insert
`data sets (Supplementary Fig. 17a), but most were unique to one or
`other data set. We observed two reasons for this: first, small events
`(,400 bp) are within the normal size variance of the long insert data;
`second, nearby repetitive structures can prevent unique alignment of
`read pairs (see Supplementary Fig. 17b, c). In some cases, the high
`resolution of the short insert data permits detection of additional
`complexity in a structural rearrangement that is not revealed by
`the long insert data. For example, where the long insert data indicate
`a 1.3-kb deletion in NA18507 relative to the reference, the short insert
`data reveal an inversion accompanied by deletions at both break-
`points (Fig. 4). We carried out de novo assembly of reads in this
`region and constructed a single contig that defines the exact structure
`of the rearrangement (data not shown).
`We discovered 5,704 structural variants ranging from 50 bp
`to .35 kb where there is sequence absent from the genome of
`NA18507 compared to the reference genome. We observed a steadily
`decreasing number of events of this type with increasing size, except
`for two peaks (Supplementary Fig. 18). Most of the events repre-
`sented by the large peak at 300–350 bp contain a sequence of the
`AluY family. This is consistent with insertion of short interspersed
`nuclear elements (SINEs) that are present in the reference genome
`but missing from the genome of NA18507. Similarly, the second,
`smaller peak at 6–7 kb is the consequence of insertion of the long
`interspersed nuclear element (LINE) L1 Homo sapiens (L1Hs) in
`
` ©2008 Macmillan Publishers Limited. All rights reserved
`
`00004
`
`

`

`NATURE | Vol 456 | 6 November 2008
`
`ARTICLES
`
`8.00 kb
`
`4 kb
`
`a
`
`b
`
`c
`
`d
`
`e
`
`many cases. We found good correspondence between our results and
`the data of ref. 23, which reported 148 deletions of ,100 kb in this
`individual on the basis of abnormal fosmid paired-end spacing. We
`found supporting evidence for 111 of these events. We detected a
`further 2,345 indels in the range 60–160 bp which are sequences
`present in the genome of NA18507 and absent from the reference
`genome (Supplementary Fig. 19). One example is shown in
`
`Figure 4 | Homozygous complex rearrangement detected by anomalous
`paired reads. The rearrangement involves an inversion of 369 bp
`(blue–turquoise bar in the schematic diagram) flanked by deletions (red
`bars) of 1,206 and 164 bp, respectively, at the left- and right-hand
`breakpoints. a, Summary tracks in the Resembl browser, denoting scale,
`simulated alignability of reads to reference (blue plot), actual aligned depth
`of coverage by NA18507 reads (green plot), density of anomalous reads
`indicating structural variants (red plot; peaks denote ‘hotspots’) and density
`of singleton reads (pink plot). b, Anomalous long-insert read pairs (orange
`lines denote DNA fragment; blocks at either end denote each read); the data
`indicate loss of ,1.3 kb in NA18507 relative to the reference. c, Anomalous
`short-insert pairs of two types (red and pink) indicate an inverted sequence
`flanked by two deletions. d, Normal short-insert read-pair alignments (each
`green line denotes the extent of the reference that is covered by the short
`fragment, including the two reads). e, The schematic diagram depicts the
`arrangement of normal and anomalous read pairs relative to the
`rearrangement. Top line, structure of NA18507; second line, structure of
`reference sequence. Green bars denote sequence that is collinear in the
`reference and NA18507 genomes. The turquoise–blue bar illustrates the
`inverted segment. Red bars indicate the sequences present in the reference
`but absent in NA18507. Arrows denote orientation of reads when aligned to
`the reference. The display in a–d is a composite of screen shots of the same
`window, overlapped for display purposes.
`
`Supplementary Fig. 20. The ‘singleton’ reads on either side of the
`event, which have partners that do not align to the reference, form
`part of a de novo assembly that precisely defines the novel sequence
`and breakpoint (Supplementary Fig. 21).
`
`Effect of sequence depth on coverage and accuracy
`We investigated the impact of varying input read depth (and hence
`cost) on SNP calling using chromosome 2 as a model. SNP discovery
`increases with increasing depth: essentially all homozygous positions
`are detected at 153, whereas heterozygous positions accumulate
`more gradually to 333 (Fig. 5a). This effect is influenced by the
`stringency of the SNP caller. To call each allele in this analysis we
`required the equivalent of two high-quality Q30 bases (as opposed to
`three used in full depth analyses). Homozygotes could be detected at
`read depth of 23 or higher, whereas heterozygote detection required
`at least double this depth for sampling of both alleles. Missing calls
`(not covered by sequence) and discordances between sequence-based
`SNP calls and genotype loci (mostly under-calls of heterozygotes due
`to low depth) progressively reduced with increasing depth (Fig. 5b).
`We observed very few other types of discordance at any depth; many
`of these are genotyping errors as described above.
`
`Concluding remarks
`Reversible terminator chemistry is a defining feature of this sequen-
`cing approach, enabling each cycle to be driven to completion while
`minimizing misincorporation. The result is a system that generates
`accurate data at very high throughput and low cost. We determined
`an accurate whole human genome sequence in 8 weeks to an average
`depth of ,403. We built a consensus sequence, optimized methods
`for analysis, assessed accuracy and characterized the genetic variation
`of this individual in detail.
`We assessed accuracy relative to genotype data over the entire
`fraction of the human sequence where SNP calling was possible
`(.90%). We established very low false-positive and -negative rates
`for the ,four million SNPs detected (,1% over-calls and under-
`calls). This compares favourably with previous individual genome
`analyses which reported a 24% under-calling of heterozygous posi-
`tions2,7.
`Paired reads were very powerful in all areas of the analysis. They
`provided very accurate read alignment and thus improved the accu-
`racy and coverage of consensus sequence and SNP calling. They were
`essential for developing our short indel caller, and for detecting larger
`structural variants. Our short-insert paired-read data set introduced
`a new level of resolution in structural variation detection, revealing
`thousands of variants in a size range not characterized previously. In
`
`57
`
` ©2008 Macmillan Publishers Limited. All rights reserved
`
`00005
`
`

`

`ARTICLES
`
`NATURE | Vol 456 | 6 November 2008
`
`filtered read data are available for download from the Short Read Archive at
`NCBI or from the European Short Read Archive (ERA) at the EBI.
`Analysis software. Image analysis software and the ELAND aligner are provided
`as part of the Genome Analyzer analysis software. SNP and structural variant
`detectors will be available as future upgrades of the analysis pipeline. The
`Resembl extension to Ensembl is available on request. The MAQ (Mapping
`and Assembly with Qualities) aligner is freely available for download from
`http://maq.sourceforge.net.
`Data access. Sequence data for NA18507 are freely available from the NCBI short
`read archive,
`accession SRA000271 (ftp://ftp.ncbi.nih.gov/pub/TraceDB/
`ShortRead/SRA000271). X chromosome data are freely available fro

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket