`
`Identification of genetic variants using bar-coded
`multiplexed sequencing
`
`David W Craig1•3, John V Pearson 1•3, Szabolcs Szelinger1•3, Aswin Sekar1, Margot Redman 1, Jason J Corneveaux1
`,
`Traci L Pawlowski 1, Trisha Laub1, Gary Nunn2, Dietrich A Stephan1, Nils Homer1 & Matthew J Huentelman1
`
`We developed a generalized framework for multiplexed
`resequencing of targeted human genome regions on the Illumina
`Genome Analyzer using degenerate indexed DNA bar codes
`ligated to fragmented DNA before sequencing. Using this
`method, we simultaneously sequenced the DNA of multiple
`HapMap individuals at several Encyclopedia of DNA Elements
`(ENCODE) regions. We then evaluated the use of Bayes factors
`for discovering and genotyping polymorphisms. For
`polymorphisms that were either previously identified within the
`Single Nucleotide Polymorphism database (dbSNP) or visually
`evident upon re-inspection of archived ENCODE traces, we
`observed a false positive rate of 11.3% using strict thresholds
`for predicting variants and 69.6% for lax thresholds. Conversely,
`false negative rates were 10.8-90.8%, with false negatives at
`stricter cut-offs occurring at lower coverage ( < 10 aligned
`reads). These results suggest that > 90% of genetic variants are
`discoverable using multiplexed sequencing provided sufficient
`coverage at the polymorphic base.
`
`Genome-wide association, candidate gene and linkage studies have
`identified thousands of moderately sized genomic regions that are
`associated with human disease but for which comprehensive
`resequencing is needed to identify the genetic variant causing
`the association. In particular, genome-wide association studies
`have identified hundreds of disease-associated hap lo types, typically
`spanning 5-100 kb 1- 3. A logical next step is to identify and
`resequence all genetic variants within the associated haplotype to
`identify the functional variants among the many nonfunctional,
`evolutionarily linked neighboring polymorphisms. Next-genera(cid:173)
`tion DNA sequencing technologies are in principle well-suited to
`this task owing to their capability for high-throughput low-cost
`sequencing. Although these technologies offer massive sequencing
`capacity, it is still difficult, time-consuming and/or expensive to
`resequence large numbers of samples across moderately sized
`genomic regions (5 kb-1 Mb).
`Sinmltaneous resequencing of a target region in large numbers of
`individuals is possible by bar-coding or indexing the reads from
`identifying oligonucleotide4-7.
`each
`individual with a short
`Although indexing has the obvious benefit of m ultiplexing samples
`
`within a run, DNA indexing offers two key additional advantages:
`direct measure of base-by-base error rate and reduction of array(cid:173)
`to-array or day-to-day variability. Previous pioneering efforts to
`develop DNA indexing have shown considerable promise, but their
`adoption is still in its infancy, and considerable challenges remain,
`including
`the development of practical and cost-effective
`approaches for short-read platforms. Beyond these experimental
`challenges, there are few analytical frameworks that are character(cid:173)
`ized for discovering and genotyping genetic variants across a
`targeted interval using multiplexed short-read sequence data
`from multiple individuals.
`Here we report an experimental and analytical approach for
`simultaneo us sequencing of multiple individuals using DNA bar
`codes, which we call indexes here, on the Illurnina Genome
`Analyzer (GA). We used a six-base index with built-in redundancy
`for error correction and assessed the performance of the method by
`resequencing Encyclopedia of DNA Elements (ENCODE) regions
`of HapMap individuals that have previously been capillary(cid:173)
`sequenced. We developed a Bayesian analytical framework that
`leverages the inherent ability of indexing to measure error and to
`reflect variability in sequencing coverage.
`
`RESULTS
`Experimental design
`We amplified multiple 5-kb regions (Supplementary Tables 1 and
`2 online) by long-range PCR for 46 individuals genotyped by the
`ENCODE projects 1•8 (Fig. 1 and Supplementary Methods online).
`For each individual, we pooled the amplicons in equ imolar
`amounts, digested them, blunt end-repaired them , added to
`them an adenosine overhang and ligated these modified amplicons
`to one of the 46 indexed adapters (Supplementary Tables 3 and 4
`online). After ligation, we pooled samples from all individuals into
`a single sample (referred to as an indexed library), purified the
`samples, enriched them by PCR ampli fication and sequenced them
`on the Illumina GA on a single lane of an 8-lane flow-cell. We
`prepared two libraries: library A, consisting of ten 5-kb amplicons
`covering 50 kb and libra ry B, consisting of fou rteen 5-kb ampLicons
`covering 70 kb (Supplementary Table 2). Library A contained
`regions that were previously capillary-sequenced and regions that
`
`1The Translatio nal Genomics Research Institute, 445 N. 51h St. 5th Floo r, Phoenix, Arizona 85004, USA. 2Illumina, 9885 Town Centre Drive, San Diego, Cali fo rnia 92 12 1,
`USA. 3These authors contributed equally to this wo rk. Co rrespondence sho uld be addressed to D.W.C. (dcraig@tgen.o rg).
`RECEIVED 17 APRIL; ACCEPTED 22 AUGUST; PUBLISHED ONLINE 14 SEPTEMBER 2008; DOl:10.1038/NMETH.1251
`
`NATURE METHODS I VOL.5 N0.10 I OCTOBER 2008 I 887
`
`00001
`
`EX1007
`
`
`
`I ARTICLES
`
`"' C:
`0 ·c,
`~
`
`~ a.
`E
`<(
`
`a.
`- .0
`"'o
`
`a.. g
`
`c:o o"' g,
`a. 0 E-
`"'c 8l
`
`C:
`
`0 ·c, .. a:
`\.!/
`0
`
`l ------...
`JT ------
`
`-0 C:
`C:"'
`
`"'
`"' -E - .,
`
`"[~
`~ <(
`-6 M
`C: -0 w -0
`"'
`
`-0 .,
`
`X
`
`<I)
`
`-
`
`\.!/
`0
`
`\.!/
`0
`
`\.!/
`0
`
`l ------.... ....
`l ------... ...
`l ---...
`-
`• .... •
`...
`I' -A
`1r ------
`,t ----
`r -
`
`l
`
`Tj
`
`l
`
`8-lane flow cell, though early sequencing runs exhibited greater
`variability in the number of sequenced reads. After filtering using
`Illumina analysis pipeline defaults, approximately 45-50% of the
`reads remained. We observed a large spread in the number of
`counts per index (Fig. 2). Although we did not identify a systematic
`reason fo r the initial spread in index performance, weaknesses in
`index design were obvious in some cases. For example, 'AAAAAT'
`was frequently read as 'AAAAAAT', perhaps because of an oligo(cid:173)
`nucleotide synthesis bias. A few indexes that were not well(cid:173)
`represented were complementary to other sections of the adaptor
`sequence, possibly hindering adaptor formation. Resequencing
`the same library gave nearly the identical distribution of reads
`regardless of run performance, indicating that the distribution is
`likely not due to a post-PCR emichment step. Furthermore,
`recreating libraries and sequencing DNA from different individuals
`in additional sequencing runs did not substantially alter perfor(cid:173)
`mance for indexes that were substantially underrepresented or
`overrepresented. Of the 46 initial indexes, 19 indexes varied by
`less than a factor of 5 between the most and least common index,
`and 13 indexes varied by less than a factor of 2. Although some of
`the initial index variability was co nsistent between sequencing runs,
`retrospective analysis of the products by gel electrophoresis after
`ligation of adapters suggested that a portion of the index variance
`may have been due to subtle differences in DNAse digestion of
`pooled am pl icons, whereby the number of available ligation targets
`was higher for samples that were digested with higher efficiency.
`In rw1s after sequencing of these initial libraries ( data not shown ),
`we observed that quantification and normalization of the amount
`of ligated adaptor before pooling, usi ng gel electrophoresis of
`the PCR-enriched products or quantitative PCR, reduced index
`variabil ity such that the index with highest number of reads
`aligning to the reference sequence was observed fivefo ld more
`frequently than the ind ex with the fewest number of aligned
`reads. By comparison, the same ratio was 11 fold without quanti(cid:173)
`fication of the ligated primers before pooling. Although future
`studies may improve index variability still further, it may be
`effecti vely managed without substa ntially affecting workflow, by
`
`., -~l
`., "'
`-0 "'"' "'
`:::; ~~
`0
`i
`
`PCR enrichment, gel purification, cluster generation and sequencing
`
`Figure 1 I Schematic describing the preparation of indexed libraries.
`
`were not sequenced as a part of the ENCODE project, whereas
`library B contained only regions previously sequenced as part of the
`ENCODE project.
`
`Index design
`We used a six-base design, which allowed us to co ntrol, tolerate and
`measure error during base calling of the index. We designed indexes
`so that one, and in some cases two, sequencing errors could be
`tolerated without an index being incorrectly identified as being a
`different valid index. We synthesized only 48 of the 4,096 possible
`nucleotide combinatio ns (see Supplemen-
`tary Table 4 fo r indexes). Perfect alignment
`of any index to a randomly generated 6-base
`sequence should occur at - 0.1 o/o by chance.
`The sixth base of the index was an obligate
`thymidine necessary for
`ligation of the
`adenosine overhan g. The first and fifth
`bases were identical to detect biases during
`normalization and calculation of the decon(cid:173)
`volution matrix. In practice, we used 46 of
`the 48 indexes to allow for plate layo uts that
`included positive and negative co ntrols.
`Although we did not implement this in
`this study, the use of each of the fo ur
`nucleotides within an index may provide
`for higher-accuracy base calling as each base
`would have to be correctly called at least
`once within a sequenced read.
`
`II Perfect matches
`• One-error matches
`
`19 <fivefold difference
`
`13 <twofold difference
`
`10
`
`5
`
`0
`
`Index performance
`We generated 3-10 million short-read (32-
`or 42-base) sequences for each lane of an
`
`888 I VOL. 5 N0.10 I OCTOBER 2008 I NATURE METHODS
`
`Figure 2 I Comparison of index performance. Index variability in initial sequencing runs (library A)
`used for evaluating index performance are shown (top). Percentages of reads aligning to the reference
`sequence are listed by index, without introduction of normalization methods. A total of 30 indexes were
`present in > 0.05% of all aligned reads. There were 19 indexes with less than fivefo ld difference in index
`frequencies, which we used in subsequent studies. The bottom graph shows the location of errors by base
`for each index.
`
`00002
`
`
`
`Maximum reads/base
`Minimum reads/base
`250
`
`191
`
`135
`O
`
`191
`O
`
`119
`O
`
`TGGTTT
`Average coverage = 52.45
`160 145 104
`141
`162
`O
`O
`O
`O
`O
`
`200
`a,
`~ 150
`
`~
`
`U 100
`
`50
`
`Observed
`
`0
`
`ARTICLES I
`
`122
`
`70
`O
`
`115
`O
`
`54
`o
`
`TCTCTT
`Average coverage = 33. 76
`136
`144
`87
`96
`71
`O
`o
`o
`o
`o
`
`Maximum reads/base
`Minimum reads/base
`160
`140
`
`0 ~ . ' . . lLL . . . ! . J . '
`10,000
`20,000 Base30,000
`40,000
`50.000 Expecteo
`O
`
`67
`O
`
`38
`O
`
`59
`O
`
`AGCTAT
`Average coverage = 13.95
`32
`33
`62 160 49
`33
`O
`O
`O
`O
`O
`O
`
`Maximum reads/base
`Minimum reads/base
`80
`70
`~ 60
`l" 50
`~ U 40
`30
`20
`10
`
`18
`19
`O O
`
`20
`O
`
`ACTGAT
`Average coverage = 3. 72
`60
`18
`19
`14
`17
`O O
`O
`O
`O
`
`11
`O
`
`Maximum reads/base
`Minimum reads/base
`40
`35
`& 30
`i 25
`8 20
`15
`10
`5
`
`Expected
`D 20+ reads/base O 10-19 reads/base 5-9 reads/base • 1-4 reads/base • 0 reads/base
`
`Base
`
`Figure 3 I Relationship betwee n mean and local coverage. Exa mple coverage in 4 individuals sequenced within a single lane of an 8-lane flow-ce ll for 10 pooled
`amplicons as part of library A. Amplicons are shown consecutively for each individual. Index sequence and mea n coverage for that individual are shown above
`each graph. The maximum an d minimum coverage is shown for each amplicon at the top. Pie charts show the observed distribution of bases across all amplicons
`and the ex pected distribution determined from a Poi sson distri bution of the mean coverage, binned as indicated.
`
`requiring higher average coverage within a study, by sequencing in
`two lanes with different indexes or by sequestering samples with
`deficient coverage for later runs.
`
`Index-level coverage
`As shown for a subset of library A, coverage ac ross individual 5-kb
`amplicons was even and generally free of large gaps (Fig. 3).
`We observed base-to-base variability in the coverage, as expected
`from alignment of short reads. We observed some deviatio n
`from the expected Poisson distribution both between amplicons
`and within an amplicon. Clearly, amplicon-to-amplico n varia(cid:173)
`bility contributes to some extent to the departure fro m th e
`expected Poisson distribution. For a given index, we observed an
`approximately 1.5- to 2.0-fold difference between the amplicons
`with the most and fewest number of reads. Inspecting gel images
`for selected arnplicons confirmed that these observed differ(cid:173)
`ences within regions were largely due to uneven pooling of
`amplicons. The observed amplicon-to-amplicon variability is likely
`due to the fact that we used median concentrations across the
`plate when pooling amplicons for an individual, rather than
`separately pipetting the reaction for each amplicon based on
`its co ncentration.
`Comparing a given amplicon across indexes (that is, across
`individuals), there was clearly some base-level correlation in cover(cid:173)
`age based o n the positions of spikes and valleys within the coverage
`plots (Fig. 3). Within a single amplicon, there was also departure
`from a Poisson distribution, evident from the fact that the same
`bases had little or no coverage across individuals. Indeed, there is
`co nsistency between individuals with regard to bases that were
`under o r over- represented. The rank correlatio n coefficient
`between indexes at a given base averaged 0.408, suggesting that
`local sequence ( or base order) accounts for slightly less than half of
`the base-to-base variability in coverage.
`
`Error reduction and alignment strategy
`Depending on alignment rules, aligning a short read to a reference
`sequence reduces
`the sequencing error rate at
`the cost of
`limiting discovery. We aligned 35-base-pair sequences, allowing
`fo r only a single error. We were thus essentially limited to
`id entifying single-base substitutions in an aligned read, while
`limiting error to 1/35 o r 2.8%, as explained below. We also required
`that two stretches of 11 or more consecutive bases match
`least one
`the reference sequence or that the read have at
`stretch of 15 consecutive matches to the reference sequence.
`In both cases, our aligner required
`that the final 2 bases
`match the reference sequence to insure that we did not overalign
`an error at the final base. We chose the rules for alignment largely
`to control error, and a randomly generated sequence wo uld
`falsely align in less than 0.1 % of alignments in a 100-kb region.
`Given our toleran ce for 1 error in alignment, we expected
`a maximum per-base error rate of 2-3% (1 error
`in 35
`bases = - 2.8%).
`One would expect that we would have greater difficulty detecting
`closely neighboring single- nucleotide polymorphi sms (SNPs)
`because we mostly limited our aligner to o ne nonconsecutive
`mismatch. However, the short-reads stochastically overlapped,
`and neighbo ring genetic variants were observed by alignment of
`multiple sequences not spanning both variants.
`
`Polymorphism discovery
`Polymorphism discovery is a primary goal for resequencing an
`association interval for a genome-wide association study, particu(cid:173)
`larly under the common variant hypothesis. Indeed, in some cases
`one may only wish to know which bases are polymorphic for
`custom genotyping on a separate platform.
`We first provide an
`intuitive explanatio n of our analysis
`fo r polymorphism discovery
`(see Methods
`for
`approach
`
`NATURE METHODS I VOL.5 NO.10 I OCTOBER 2008 I 889
`
`00003
`
`
`
`i~A_R_TI_CL_E_S _____ ____________ ______ _
`
`a 60
`
`45
`
`~
`ci 30
`0
`...J
`
`~w
`co
`~8
`if z
`'('w
`~c
`~:c
`l]
`
`(.)
`
`.
`
`• - r 12533005
`
`rs11960,p2
`
`s1361963
`
`~
`
`.'1l
`"'
`u
`
`g
`·-s11167787 "
`2:- ·-
`[
`·-s11167786
`... rs1116778 a
`rs24 926 . . rs6975798
`.
`. .
`..
`.
`. .
`..
`~ . . . .. \,. ~
`
`•.- rs1021 JOSS
`
`. .
`.. .
`. .
`
`..
`
`15
`
`0
`
`• - r 452812£
`
`rs22 7790
`• --rs1 ~57644
`r rs1 5i 645
`
`-·
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10
`
`• Previously documented SNP
`• New unannotated SNP with capillary-sequence data
`• Neighboring documented SNPs
`
`Predicted SNP in homologous region
`(possible fal se positive)
`
`rs1079114
`
`.
`
`.
`.
`. -· A•~ .
`
`rs1122259
`
`.
`.
`
`Figure 4 I Discovery of variant bases by
`simultaneous analysis of all individuals. (a) The
`Bayes factor for polymorphism discovery (Ks) is
`plotted for each of the 10 sequenced 5-kb
`amplicons from library A. Exact positions matching
`known polymorphisms are colored as red spheres,
`and the dbSNP identifier is provided for most SNPs
`with high Ks values. (b,c) Magnified views of
`amplicon 1 (b) and amplicon 6 (c) to compare
`variants predicted by indexed-multiplexed
`sequencing to previous deep capillary sequencing
`results for the same individuals as part of the
`ENCODE project. (d,e) Example traces of predicted
`SNPs in homologous regions with ambiguous trace
`data. (f-i) Examples of sequence traces validating
`the discovery of new SNPs not previously
`annotated in ENCODE capillary sequencing traces.
`The top row shows traces from HapMap individuals
`with the rare variant genotype and the bottom row
`shows reference traces. Similar analysis was
`conducted on library B (shown in Supplementary
`Fig. 1). Red arrows mark predicted SNPs in
`capillary-sequence data.
`
`C
`
`50
`
`25
`
`0
`
`epeat re ions
`Res,_e~V
`proximity
`
`Homology
`to other regions
`
`0Trlr:
`
`• o
`
`Oo
`
`• •
`
`e
`
`0
`
`1,000
`
`3,000
`2,000
`Base position
`
`4 ,000
`
`5,000
`
`b 15
`
`10
`
`5
`
`0
`
`d
`
`r ..,_ -
`....
`_ ,........_ •
`3 ,000
`1,000
`2,000
`0
`Base position
`
`......
`
`~LMtC ra
`4,000
`
`5,000
`
`!
`
`We analyzed sequenced regions base by
`base for all individuals by calcu lating a
`polymorphism discovery Bayes
`factor,
`where K, is the Bayes factor for polymorph(cid:173)
`ism discovery for the sth base as derived in
`equation 2 (see Methods). An example plot
`of K, across each base ( of 50 kb) is shown in
`Figure 4 for library A; we conducted a
`similar analysis for library B (Supplemen(cid:173)
`tary Fig. 1 online).
`We next evaluated false positive and false
`negative rates to assess our experimental
`and analytical framework for variant dis(cid:173)
`covery (Table 1 for library B and Fig. 5 for
`both library A and B). False positives are
`particularly difficult to quantify as not all
`polymorphic sites are known, even in pre-
`viously resequenced regions. In our analy(cid:173)
`sis, to be defined as a false positive, a variant must not exactly match
`the location of variants within the Single Nucleotide Polymorphism
`database (dbSNP) and must not have trace sequencing data indicat(cid:173)
`ing a previously missed variant. In some cases, trace· sequence data
`were not available or were unreliable. Consequently, the false
`positive rate is expected to be an upper estimate as the exact
`position must be validated as polymorphic by an existing database.
`We determined false negative rates by calculating whether a base
`known to be polymorphic in our library of HapMap individuals
`reached previously specified Ks thresholds of 3, 10, 100 or 1,000.
`This calculation of false negative rates does have some bias, as it does
`not take into account coverage of the polymorphic base.
`As expected, setting a higher threshold for Ks gives fewer false
`positives. For library B, as Ks increased from 10 to 1,000, the false
`positive rate decreased from 69.6 to 11.3% (Table 1). Likewise, with
`fixed coverage, we observed the false negative rate increasing from
`I 0.8 to 90.8% as Ks increased from 10 to 1,000. A more detailed
`discussion of false-negative and false-positive rates is available in
`the Supplementary Methods. We found that false negatives
`occurred when the cumulative coverage of individuals with the
`
`l!iP. ,cu ,••~ •,o u3?f'.c.a•
`
`to compare the
`detailed equations). We used Bayes factors
`probability that the distribution of mismatched bases arises
`from sequencing error to the probability that the distribution
`of mismatches arises from diploid polymorphism. For example,
`for a given base were nonconcordant
`if 20% of reads
`with the reference sequence across all
`individuals, and the
`the presence of a SNP,
`nonconcordant bases were due to
`one would expect each individual to be homozygous (0% or
`100% concordance with reference) or heterozygous (concordance
`split 50:50). In contrast, if the 20% nonconcordant bases were due
`to sequencing error, then the number of nonconcordant bases for
`each individual would follow a binomial distribution around 20%
`(for example, person 1, -20.5%; person 2, -19.3%; person 3,
`-20.7%; and so forth). As described below, the error estimates
`required to calculate the probability of a genetic variant being a true
`variant are readily obtainable when samples from individuals are
`indexed and multiplex-sequenced. Additionally, indexed and mul(cid:173)
`tiplexed sequencing removes run-to-run biases, which would con(cid:173)
`found these estimates if all aspects of experimental design were not
`properly randomized.
`
`890 I VOL.5 N0.10 I OCTOBER 2008 I NATURE METHODS
`
`00004
`
`
`
`Table 1 I Polymorphism discovery and individual variant calling
`
`Polymorphism discovery by Ks threshold"
`
`Threshold (Ks)
`
`3
`10
`100
`1,000
`
`Polymorphisms
`predicted
`
`True
`positivesb
`
`False
`positivesc
`
`False
`negativesd
`
`932
`352
`131
`106
`
`112
`107
`99
`94
`
`88.0%
`69.6%
`24.4%
`11.3%
`
`9.2%
`10.8%
`32.3%
`90.8%
`
`ARTICLES
`
`individual given a certain number of reads
`(Fig. Sc). For example, when the coverage
`for a base was ~ 20 reads ( averaging from
`to 24), we detected > 80-90% of
`16
`the bases at K; > 10, with a false-positive
`rate of 1.6%. In comparison to poly(cid:173)
`morphism discovery, the low false positive
`rates of genotyping at a known poly(cid:173)
`morphic base are due to the fact that we
`were no longer assessing thousands of bases
`for a rare event but rather assessing samples
`from a few dozen individuals for a more
`frequent event.
`
`Individual variant calling or genotyping (AA, AB, BB) by K; threshold
`
`Threshold (K;)
`
`3
`10
`100
`1,000
`
`Genotyped
`correctly
`
`3,376
`3,144
`2,677
`2,397
`
`Genotyped
`incorrectly
`
`115
`58
`8
`7
`
`aPredicted polymorphic bases at a given threshold for Ks were evaluated by comparison to known polymorphisms within dbSNP and to ENCODE
`capillary sequencing traces. False negatives rates reflect that greater base coverage is required to exceed larger K5 thresholds and that many
`polymorphisms become insufficiently covered for polymorphism discovery at these levels (see Fig. 5 for relation between coverage and Ks)(cid:173)
`bValidated in dbSNP or NCBI trace archive. 'Not identified in dbSNP or NCBI trace archive. dRates of polymorphism discovery were evaluated
`irrespective of coverage. False negatives rates were calculated across Libraries A and B. False positive rates were calculated usi ng only library
`B since not all region s of library A were previously resequenced within the ENCODE project.
`
`DISCUSSION
`Our experience suggests that achieving ade(cid:173)
`quate coverage is one of the most important
`factors in the design of a multiplexed tar-
`geted resequencing experiment. Depending
`on assumptions made in the experiment,
`the desired coverage (and as a consequence,
`the cost) can vary substantially. Key con(cid:173)
`siderations include whether the objective is
`(i) discovering genetic variants for genotyp(cid:173)
`ing by a separate method such as custom SNP genotyping, (ii)
`conducting polymorphism discovery and variant calling within one
`sequencing experiment, and/or (ii i) exhaustively resequencing for
`all common and rare variants.
`Exhaustive polymorphism discovery is the next major phase for
`genome-wide association studies. Indexing of short-reads was
`surprisingly robust at polymorphism identification. For example,
`
`rarer variant was less than 10 reads (Fig. 5). Highlighting the
`dependence of false negatives on coverage, all polymorphisms that
`were covered by 20 or more reads (summed across individuals
`known to differ from the reference) had a Ks > 1,000. Overall, we
`observed that 90% of variants were detectable, though designing
`experiments for an average of greater than 20 reads will be essential
`for controlling false negatives.
`While analyzing bases with a K, > 100 for false positives using
`US National Center for Biotechnology Information (NCBI)(cid:173)
`archived ENCODE traces, we discovered new SNPs that were
`evident in visual reinspection of capillary traces but that had not
`been annotated in dbSNP (Fig. 4f- h). These examples demonstrate
`that index-based resequencing can identify new variants even in
`heavily sequenced and heavily annotated
`regions. Notably,
`within library B, two variants with a Ks > 100 were not SNPs
`but actually insertions ( with dbSNP, rsl 1279266 is a 1-bp insertion
`and rsl0555419 is a 6-bp insertion). Thus, it is possible to identify
`genetic variants explicitly not allowed within the alignment scheme.
`
`Genotyping individuals at known polymorphisms
`As false negatives are clearly tied to coverage, we explored the
`influence of coverage further by analyzing sequenced regions in an
`individual-by-individual analysis. Derived in equation 3 (see
`below), Ki is the analogous Bayes factor for the ith individual
`having the rarer allele at a known polymorphic base. Conceptually,
`it can be thought of as a specific individual's contribution to Ks.
`We calculated the percentage of variants correctly identified in an
`
`Figure S I Relationship between base-level coverage and Bayes factor for
`polymorphism discovery and variant genotyping . (a) Total coverage across
`those individuals with a nonreference genotype at a known polymorphism .
`(b) Magnification of the graph in a. (c) The percent of the time the correct
`genotype was determined versus the coverage of the variant within the
`individual. Plots contain cumulative statistics using variant discovery and
`genotyping within both libraries A and B.
`
`• Identified known SNPs
`
`• Missed known SNPs "f :.
`I •
`••••
`.i}-
`. ·It-~-
`: ..,.._.
`••••
`• ,._ 11! "'
`\~ .1
`
`b 10
`
`8
`
`6
`
`Q>
`oi 4
`0
`__J
`
`2
`
`0
`
`.
`.
`
`K5 =1 0
`
`- '!
`
`0
`
`~
`
`---"
`
`:}:_s = 10
`
`1,000
`100
`10
`Total number of reads for all individuals
`with variant at polymorphic base
`
`1,000
`100
`10
`Total number of reads for all individuals
`with variant at polymorphic base
`
`a
`
`100
`
`80
`
`60
`
`40
`
`20
`
`Q>
`oi
`0
`__J
`
`C
`
`.
`.
`.
`.
`..
`-.
`..
`.
`. .
`-.
`"'
`I•.
`.. ......
`: ·=· ...... .. . . .
`
`•
`
`• •
`• •• •
`•
`• •
`• •
`•
`
`•
`
`•
`
`.!!!
`
`c:: " ·~
`
`0
`
`40
`
`20
`
`•
`
`•
`
`Threshold
`a K1=3
`a K1= 10
`a K1= 100
`
`Genotyped
`correct
`3,376
`3,144
`2,677
`
`Genotyped
`incorrect
`115
`58
`8
`
`Number of reads at base with a
`known variant in a sequenced individual
`
`NATURE METHODS I VOL.5 N0.10 I OCTOBER 2008 I 891
`
`00005
`
`
`
`I ARTICLES
`
`even when we used a highly restrictive error alignment scheme, we
`identified a new coding SNP 3 bases from an anno tated SNP.
`Additionally, it is encouraging to see the discovery of insertions
`using an alignment scheme only aUowing substitutions. FinaUy, we
`show that an automated analysis strategy for short-read data can
`uncover new variants, even in regions that had been previously
`sequenced for variant discovery using samples from these indivi(cid:173)
`duals in the HapMap database.
`Based on the false positive and false negative rates, a critical
`factor will be how to balance coverage and cost. Using Ks > 1,000,
`we observed a low false positive rate and a high false negative rate.
`As we required that the exact base must be polymorphic in an
`existing database, the actual false- positive rate may be lower.
`False negatives are due to the additional coverage ( > 10 reads)
`required for overcoming higher Ks thresholds. Considering the
`substantial base-to-base variability (Fig. 3), one would not simply
`want to design for an average coverage of 10 reads. Rather,
`by designing for - 50 reads or more, one m ay minimize both the
`false negative and false-positive rates, given coverage variability of
`short-read sequencing.
`Although whole-genome sequencing may be the primary moti(cid:173)
`vator for in1provements in sequencing technology, it is clear that
`next-generation technologies are immediately useful for foc used,
`hypothesis-driven sequencing of linkage peaks, groupings of can(cid:173)
`didate genes or sequencing the entire known coding sequence of the
`human genome. In this report, we developed per-individual
`indexing of pooled PCR amplicons to carry o ut targeted sequen(cid:173)
`cing. However, it is straightforward to envision using other sample(cid:173)
`preparation methods such as genome partitio ning9- 12. One could
`replace the pooled amplicons from our experimental strategy with
`total genomic DNA and complete the partitioning by a hybridiza(cid:173)
`tion approach after pooling ligated amplicons. Indeed, variant
`discovery through the resequencing o f all ca ndidate regions impli (cid:173)
`cated in a disease across genomes of dozens, and possibly hundreds,
`of individuals could be considerably accelerated by merging multi(cid:173)
`plex capture, indexing and next-generation sequencing approaches
`into a single protocol.
`
`METHODS
`Amplification and pre-ligation sample preparation. Two ampli(cid:173)
`con libraries (library A and B; see Supplementary Table 2 for
`specific regions) were constructed from individually amplified 5-
`kb regio ns using long-range PCR. Regions comprising a library
`were chosen from the ENCODE project and provide a sampling of
`different genomic region types. Only a portion of the ENCODE
`regions have been previously sequenced as part of the SNP
`discovery portion o f the ENCODE project. Library A was a
`composite of previously sequenced regions and regions that had
`not been sequenced in ENCODE. Library B was entirely composed
`of regions that had been previously sequenced. Flanking primers
`for each amplicon (Supplementary Table 5 o nline) we re manually
`selected. Although we did not screen for the existence of known
`this wo uld be
`polymorphisms within the primer sequences,
`advisable in future experiments.
`
`Base calling and alignment. Illumina GA images were analyzed
`using a modified Illumina GA (0.2.2.6) processing pipeline.
`Descriptions of the modifications and all scripts are available in
`Supplementary Methods. A default m atrix deconvolution file was
`
`892 I VOL.5 N0.10 I OCTOBER 2008 I NATURE METHODS
`
`used for base-calling by 'Bustard', based on a control phi-X library
`provided by Illumina. Following base-calling, a script was used to
`access all sequences, regardless of quality score. We deviated from
`the default cu t-offs provided by Illumina's 'GERALD' process
`because we found that sequence quality was better controlled by
`matching to an index (46/4,096 or 0.1% by chance) or matching
`with I or fewer errors to the reference sequence. Bases were aligned
`by progressively truncating the sequence at the 5' end until a unique
`alignment was obtained with a probability of stochastic alignment
`of less than 1 %. This ap proach is distinct from recently described
`short-read alignment schem es 13 but has so me of the same features.
`Aligned sequences were summarized into a binomial of counts
`agreeing (a; ) or disagreeing (b;) with the reference sequence for the
`i th individual of n total individuals for each base (s). The error rate
`(Os)
`is the percentage of reads disagreeing with the reference
`sequence across aU samples. Model 1 (M1) assumes that the error
`rate for an individual equals the error rate for all individuals, o r
`O; = Os. Therefore, we can estimate 05 as:
`0s= L _ a; _
`
`II
`(
`)
`i = I a; + b;
`
`.
`
`(1)
`
`Model 2 (M2) assumed that for some individual i, we have O;#- Os.
`To calculate this likelihood, we used a hyperprior offset (as) fo r
`three possible genotypes. The hyperprior can be thought of as
`conditioning on the zygosity of the individual and thus reflecting
`the uncertainty of the genotype of the individual at a given base. In
`this analysis, we focused on the detection of biallelic SNPs, but
`triallelic o r o ther types of SNPs could be considered under more
`complex models. The Bayes fac tor for a base position is
`
`( a,