throbber
ARTICLES I
`
`Identification of genetic variants using bar-coded
`multiplexed sequencing
`
`David W Craig1•3, John V Pearson 1•3, Szabolcs Szelinger1•3, Aswin Sekar1, Margot Redman 1, Jason J Corneveaux1
`,
`Traci L Pawlowski 1, Trisha Laub1, Gary Nunn2, Dietrich A Stephan1, Nils Homer1 & Matthew J Huentelman1
`
`We developed a generalized framework for multiplexed
`resequencing of targeted human genome regions on the Illumina
`Genome Analyzer using degenerate indexed DNA bar codes
`ligated to fragmented DNA before sequencing. Using this
`method, we simultaneously sequenced the DNA of multiple
`HapMap individuals at several Encyclopedia of DNA Elements
`(ENCODE) regions. We then evaluated the use of Bayes factors
`for discovering and genotyping polymorphisms. For
`polymorphisms that were either previously identified within the
`Single Nucleotide Polymorphism database (dbSNP) or visually
`evident upon re-inspection of archived ENCODE traces, we
`observed a false positive rate of 11.3% using strict thresholds
`for predicting variants and 69.6% for lax thresholds. Conversely,
`false negative rates were 10.8-90.8%, with false negatives at
`stricter cut-offs occurring at lower coverage ( < 10 aligned
`reads). These results suggest that > 90% of genetic variants are
`discoverable using multiplexed sequencing provided sufficient
`coverage at the polymorphic base.
`
`Genome-wide association, candidate gene and linkage studies have
`identified thousands of moderately sized genomic regions that are
`associated with human disease but for which comprehensive
`resequencing is needed to identify the genetic variant causing
`the association. In particular, genome-wide association studies
`have identified hundreds of disease-associated hap lo types, typically
`spanning 5-100 kb 1- 3. A logical next step is to identify and
`resequence all genetic variants within the associated haplotype to
`identify the functional variants among the many nonfunctional,
`evolutionarily linked neighboring polymorphisms. Next-genera(cid:173)
`tion DNA sequencing technologies are in principle well-suited to
`this task owing to their capability for high-throughput low-cost
`sequencing. Although these technologies offer massive sequencing
`capacity, it is still difficult, time-consuming and/or expensive to
`resequence large numbers of samples across moderately sized
`genomic regions (5 kb-1 Mb).
`Sinmltaneous resequencing of a target region in large numbers of
`individuals is possible by bar-coding or indexing the reads from
`identifying oligonucleotide4-7.
`each
`individual with a short
`Although indexing has the obvious benefit of m ultiplexing samples
`
`within a run, DNA indexing offers two key additional advantages:
`direct measure of base-by-base error rate and reduction of array(cid:173)
`to-array or day-to-day variability. Previous pioneering efforts to
`develop DNA indexing have shown considerable promise, but their
`adoption is still in its infancy, and considerable challenges remain,
`including
`the development of practical and cost-effective
`approaches for short-read platforms. Beyond these experimental
`challenges, there are few analytical frameworks that are character(cid:173)
`ized for discovering and genotyping genetic variants across a
`targeted interval using multiplexed short-read sequence data
`from multiple individuals.
`Here we report an experimental and analytical approach for
`simultaneo us sequencing of multiple individuals using DNA bar
`codes, which we call indexes here, on the Illurnina Genome
`Analyzer (GA). We used a six-base index with built-in redundancy
`for error correction and assessed the performance of the method by
`resequencing Encyclopedia of DNA Elements (ENCODE) regions
`of HapMap individuals that have previously been capillary(cid:173)
`sequenced. We developed a Bayesian analytical framework that
`leverages the inherent ability of indexing to measure error and to
`reflect variability in sequencing coverage.
`
`RESULTS
`Experimental design
`We amplified multiple 5-kb regions (Supplementary Tables 1 and
`2 online) by long-range PCR for 46 individuals genotyped by the
`ENCODE projects 1•8 (Fig. 1 and Supplementary Methods online).
`For each individual, we pooled the amplicons in equ imolar
`amounts, digested them, blunt end-repaired them , added to
`them an adenosine overhang and ligated these modified amplicons
`to one of the 46 indexed adapters (Supplementary Tables 3 and 4
`online). After ligation, we pooled samples from all individuals into
`a single sample (referred to as an indexed library), purified the
`samples, enriched them by PCR ampli fication and sequenced them
`on the Illumina GA on a single lane of an 8-lane flow-cell. We
`prepared two libraries: library A, consisting of ten 5-kb amplicons
`covering 50 kb and libra ry B, consisting of fou rteen 5-kb ampLicons
`covering 70 kb (Supplementary Table 2). Library A contained
`regions that were previously capillary-sequenced and regions that
`
`1The Translatio nal Genomics Research Institute, 445 N. 51h St. 5th Floo r, Phoenix, Arizona 85004, USA. 2Illumina, 9885 Town Centre Drive, San Diego, Cali fo rnia 92 12 1,
`USA. 3These authors contributed equally to this wo rk. Co rrespondence sho uld be addressed to D.W.C. (dcraig@tgen.o rg).
`RECEIVED 17 APRIL; ACCEPTED 22 AUGUST; PUBLISHED ONLINE 14 SEPTEMBER 2008; DOl:10.1038/NMETH.1251
`
`NATURE METHODS I VOL.5 N0.10 I OCTOBER 2008 I 887
`
`00001
`
`EX1007
`
`

`

`I ARTICLES
`
`"' C:
`0 ·c,
`~
`
`~ a.
`E
`<(
`
`a.
`- .0
`"'o
`
`a.. g
`
`c:o o"' g,
`a. 0 E-
`"'c 8l
`
`C:
`
`0 ·c, .. a:
`\.!/
`0
`
`l ------...
`JT ------
`
`-0 C:
`C:"'
`
`"'
`"' -E - .,
`
`"[~
`~ <(
`-6 M
`C: -0 w -0
`"'
`
`-0 .,
`
`X
`
`<I)
`
`-
`
`\.!/
`0
`
`\.!/
`0
`
`\.!/
`0
`
`l ------.... ....
`l ------... ...
`l ---...
`-
`• .... •
`...
`I' -A
`1r ------
`,t ----
`r -
`
`l
`
`Tj
`
`l
`
`8-lane flow cell, though early sequencing runs exhibited greater
`variability in the number of sequenced reads. After filtering using
`Illumina analysis pipeline defaults, approximately 45-50% of the
`reads remained. We observed a large spread in the number of
`counts per index (Fig. 2). Although we did not identify a systematic
`reason fo r the initial spread in index performance, weaknesses in
`index design were obvious in some cases. For example, 'AAAAAT'
`was frequently read as 'AAAAAAT', perhaps because of an oligo(cid:173)
`nucleotide synthesis bias. A few indexes that were not well(cid:173)
`represented were complementary to other sections of the adaptor
`sequence, possibly hindering adaptor formation. Resequencing
`the same library gave nearly the identical distribution of reads
`regardless of run performance, indicating that the distribution is
`likely not due to a post-PCR emichment step. Furthermore,
`recreating libraries and sequencing DNA from different individuals
`in additional sequencing runs did not substantially alter perfor(cid:173)
`mance for indexes that were substantially underrepresented or
`overrepresented. Of the 46 initial indexes, 19 indexes varied by
`less than a factor of 5 between the most and least common index,
`and 13 indexes varied by less than a factor of 2. Although some of
`the initial index variability was co nsistent between sequencing runs,
`retrospective analysis of the products by gel electrophoresis after
`ligation of adapters suggested that a portion of the index variance
`may have been due to subtle differences in DNAse digestion of
`pooled am pl icons, whereby the number of available ligation targets
`was higher for samples that were digested with higher efficiency.
`In rw1s after sequencing of these initial libraries ( data not shown ),
`we observed that quantification and normalization of the amount
`of ligated adaptor before pooling, usi ng gel electrophoresis of
`the PCR-enriched products or quantitative PCR, reduced index
`variabil ity such that the index with highest number of reads
`aligning to the reference sequence was observed fivefo ld more
`frequently than the ind ex with the fewest number of aligned
`reads. By comparison, the same ratio was 11 fold without quanti(cid:173)
`fication of the ligated primers before pooling. Although future
`studies may improve index variability still further, it may be
`effecti vely managed without substa ntially affecting workflow, by
`
`., -~l
`., "'
`-0 "'"' "'
`:::; ~~
`0
`i
`
`PCR enrichment, gel purification, cluster generation and sequencing
`
`Figure 1 I Schematic describing the preparation of indexed libraries.
`
`were not sequenced as a part of the ENCODE project, whereas
`library B contained only regions previously sequenced as part of the
`ENCODE project.
`
`Index design
`We used a six-base design, which allowed us to co ntrol, tolerate and
`measure error during base calling of the index. We designed indexes
`so that one, and in some cases two, sequencing errors could be
`tolerated without an index being incorrectly identified as being a
`different valid index. We synthesized only 48 of the 4,096 possible
`nucleotide combinatio ns (see Supplemen-
`tary Table 4 fo r indexes). Perfect alignment
`of any index to a randomly generated 6-base
`sequence should occur at - 0.1 o/o by chance.
`The sixth base of the index was an obligate
`thymidine necessary for
`ligation of the
`adenosine overhan g. The first and fifth
`bases were identical to detect biases during
`normalization and calculation of the decon(cid:173)
`volution matrix. In practice, we used 46 of
`the 48 indexes to allow for plate layo uts that
`included positive and negative co ntrols.
`Although we did not implement this in
`this study, the use of each of the fo ur
`nucleotides within an index may provide
`for higher-accuracy base calling as each base
`would have to be correctly called at least
`once within a sequenced read.
`
`II Perfect matches
`• One-error matches
`
`19 <fivefold difference
`
`13 <twofold difference
`
`10
`
`5
`
`0
`
`Index performance
`We generated 3-10 million short-read (32-
`or 42-base) sequences for each lane of an
`
`888 I VOL. 5 N0.10 I OCTOBER 2008 I NATURE METHODS
`
`Figure 2 I Comparison of index performance. Index variability in initial sequencing runs (library A)
`used for evaluating index performance are shown (top). Percentages of reads aligning to the reference
`sequence are listed by index, without introduction of normalization methods. A total of 30 indexes were
`present in > 0.05% of all aligned reads. There were 19 indexes with less than fivefo ld difference in index
`frequencies, which we used in subsequent studies. The bottom graph shows the location of errors by base
`for each index.
`
`00002
`
`

`

`Maximum reads/base
`Minimum reads/base
`250
`
`191
`
`135
`O
`
`191
`O
`
`119
`O
`
`TGGTTT
`Average coverage = 52.45
`160 145 104
`141
`162
`O
`O
`O
`O
`O
`
`200
`a,
`~ 150
`
`~
`
`U 100
`
`50
`
`Observed
`
`0
`
`ARTICLES I
`
`122
`
`70
`O
`
`115
`O
`
`54
`o
`
`TCTCTT
`Average coverage = 33. 76
`136
`144
`87
`96
`71
`O
`o
`o
`o
`o
`
`Maximum reads/base
`Minimum reads/base
`160
`140
`
`0 ~ . ' . . lLL . . . ! . J . '
`10,000
`20,000 Base30,000
`40,000
`50.000 Expecteo
`O
`
`67
`O
`
`38
`O
`
`59
`O
`
`AGCTAT
`Average coverage = 13.95
`32
`33
`62 160 49
`33
`O
`O
`O
`O
`O
`O
`
`Maximum reads/base
`Minimum reads/base
`80
`70
`~ 60
`l" 50
`~ U 40
`30
`20
`10
`
`18
`19
`O O
`
`20
`O
`
`ACTGAT
`Average coverage = 3. 72
`60
`18
`19
`14
`17
`O O
`O
`O
`O
`
`11
`O
`
`Maximum reads/base
`Minimum reads/base
`40
`35
`& 30
`i 25
`8 20
`15
`10
`5
`
`Expected
`D 20+ reads/base O 10-19 reads/base 5-9 reads/base • 1-4 reads/base • 0 reads/base
`
`Base
`
`Figure 3 I Relationship betwee n mean and local coverage. Exa mple coverage in 4 individuals sequenced within a single lane of an 8-lane flow-ce ll for 10 pooled
`amplicons as part of library A. Amplicons are shown consecutively for each individual. Index sequence and mea n coverage for that individual are shown above
`each graph. The maximum an d minimum coverage is shown for each amplicon at the top. Pie charts show the observed distribution of bases across all amplicons
`and the ex pected distribution determined from a Poi sson distri bution of the mean coverage, binned as indicated.
`
`requiring higher average coverage within a study, by sequencing in
`two lanes with different indexes or by sequestering samples with
`deficient coverage for later runs.
`
`Index-level coverage
`As shown for a subset of library A, coverage ac ross individual 5-kb
`amplicons was even and generally free of large gaps (Fig. 3).
`We observed base-to-base variability in the coverage, as expected
`from alignment of short reads. We observed some deviatio n
`from the expected Poisson distribution both between amplicons
`and within an amplicon. Clearly, amplicon-to-amplico n varia(cid:173)
`bility contributes to some extent to the departure fro m th e
`expected Poisson distribution. For a given index, we observed an
`approximately 1.5- to 2.0-fold difference between the amplicons
`with the most and fewest number of reads. Inspecting gel images
`for selected arnplicons confirmed that these observed differ(cid:173)
`ences within regions were largely due to uneven pooling of
`amplicons. The observed amplicon-to-amplicon variability is likely
`due to the fact that we used median concentrations across the
`plate when pooling amplicons for an individual, rather than
`separately pipetting the reaction for each amplicon based on
`its co ncentration.
`Comparing a given amplicon across indexes (that is, across
`individuals), there was clearly some base-level correlation in cover(cid:173)
`age based o n the positions of spikes and valleys within the coverage
`plots (Fig. 3). Within a single amplicon, there was also departure
`from a Poisson distribution, evident from the fact that the same
`bases had little or no coverage across individuals. Indeed, there is
`co nsistency between individuals with regard to bases that were
`under o r over- represented. The rank correlatio n coefficient
`between indexes at a given base averaged 0.408, suggesting that
`local sequence ( or base order) accounts for slightly less than half of
`the base-to-base variability in coverage.
`
`Error reduction and alignment strategy
`Depending on alignment rules, aligning a short read to a reference
`sequence reduces
`the sequencing error rate at
`the cost of
`limiting discovery. We aligned 35-base-pair sequences, allowing
`fo r only a single error. We were thus essentially limited to
`id entifying single-base substitutions in an aligned read, while
`limiting error to 1/35 o r 2.8%, as explained below. We also required
`that two stretches of 11 or more consecutive bases match
`least one
`the reference sequence or that the read have at
`stretch of 15 consecutive matches to the reference sequence.
`In both cases, our aligner required
`that the final 2 bases
`match the reference sequence to insure that we did not overalign
`an error at the final base. We chose the rules for alignment largely
`to control error, and a randomly generated sequence wo uld
`falsely align in less than 0.1 % of alignments in a 100-kb region.
`Given our toleran ce for 1 error in alignment, we expected
`a maximum per-base error rate of 2-3% (1 error
`in 35
`bases = - 2.8%).
`One would expect that we would have greater difficulty detecting
`closely neighboring single- nucleotide polymorphi sms (SNPs)
`because we mostly limited our aligner to o ne nonconsecutive
`mismatch. However, the short-reads stochastically overlapped,
`and neighbo ring genetic variants were observed by alignment of
`multiple sequences not spanning both variants.
`
`Polymorphism discovery
`Polymorphism discovery is a primary goal for resequencing an
`association interval for a genome-wide association study, particu(cid:173)
`larly under the common variant hypothesis. Indeed, in some cases
`one may only wish to know which bases are polymorphic for
`custom genotyping on a separate platform.
`We first provide an
`intuitive explanatio n of our analysis
`fo r polymorphism discovery
`(see Methods
`for
`approach
`
`NATURE METHODS I VOL.5 NO.10 I OCTOBER 2008 I 889
`
`00003
`
`

`

`i~A_R_TI_CL_E_S _____ ____________ ______ _
`
`a 60
`
`45
`
`~
`ci 30
`0
`...J
`
`~w
`co
`~8
`if z
`'('w
`~c
`~:c
`l]
`
`(.)
`
`.
`
`• - r 12533005
`
`rs11960,p2
`
`s1361963
`
`~
`
`.'1l
`"'
`u
`
`g
`·-s11167787 "
`2:- ·-
`[
`·-s11167786
`... rs1116778 a
`rs24 926 . . rs6975798
`.
`. .
`..
`.
`. .
`..
`~ . . . .. \,. ~
`
`•.- rs1021 JOSS
`
`. .
`.. .
`. .
`
`..
`
`15
`
`0
`
`• - r 452812£
`
`rs22 7790
`• --rs1 ~57644
`r rs1 5i 645
`
`-·
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`9
`
`10
`
`• Previously documented SNP
`• New unannotated SNP with capillary-sequence data
`• Neighboring documented SNPs
`
`Predicted SNP in homologous region
`(possible fal se positive)
`
`rs1079114
`
`.
`
`.
`.
`. -· A•~ .
`
`rs1122259
`
`.
`.
`
`Figure 4 I Discovery of variant bases by
`simultaneous analysis of all individuals. (a) The
`Bayes factor for polymorphism discovery (Ks) is
`plotted for each of the 10 sequenced 5-kb
`amplicons from library A. Exact positions matching
`known polymorphisms are colored as red spheres,
`and the dbSNP identifier is provided for most SNPs
`with high Ks values. (b,c) Magnified views of
`amplicon 1 (b) and amplicon 6 (c) to compare
`variants predicted by indexed-multiplexed
`sequencing to previous deep capillary sequencing
`results for the same individuals as part of the
`ENCODE project. (d,e) Example traces of predicted
`SNPs in homologous regions with ambiguous trace
`data. (f-i) Examples of sequence traces validating
`the discovery of new SNPs not previously
`annotated in ENCODE capillary sequencing traces.
`The top row shows traces from HapMap individuals
`with the rare variant genotype and the bottom row
`shows reference traces. Similar analysis was
`conducted on library B (shown in Supplementary
`Fig. 1). Red arrows mark predicted SNPs in
`capillary-sequence data.
`
`C
`
`50
`
`25
`
`0
`
`epeat re ions
`Res,_e~V
`proximity
`
`Homology
`to other regions
`
`0Trlr:
`
`• o
`
`Oo
`
`• •
`
`e
`
`0
`
`1,000
`
`3,000
`2,000
`Base position
`
`4 ,000
`
`5,000
`
`b 15
`
`10
`
`5
`
`0
`
`d
`
`r ..,_ -
`....
`_ ,........_ •
`3 ,000
`1,000
`2,000
`0
`Base position
`
`......
`
`~LMtC ra
`4,000
`
`5,000
`
`!
`
`We analyzed sequenced regions base by
`base for all individuals by calcu lating a
`polymorphism discovery Bayes
`factor,
`where K, is the Bayes factor for polymorph(cid:173)
`ism discovery for the sth base as derived in
`equation 2 (see Methods). An example plot
`of K, across each base ( of 50 kb) is shown in
`Figure 4 for library A; we conducted a
`similar analysis for library B (Supplemen(cid:173)
`tary Fig. 1 online).
`We next evaluated false positive and false
`negative rates to assess our experimental
`and analytical framework for variant dis(cid:173)
`covery (Table 1 for library B and Fig. 5 for
`both library A and B). False positives are
`particularly difficult to quantify as not all
`polymorphic sites are known, even in pre-
`viously resequenced regions. In our analy(cid:173)
`sis, to be defined as a false positive, a variant must not exactly match
`the location of variants within the Single Nucleotide Polymorphism
`database (dbSNP) and must not have trace sequencing data indicat(cid:173)
`ing a previously missed variant. In some cases, trace· sequence data
`were not available or were unreliable. Consequently, the false
`positive rate is expected to be an upper estimate as the exact
`position must be validated as polymorphic by an existing database.
`We determined false negative rates by calculating whether a base
`known to be polymorphic in our library of HapMap individuals
`reached previously specified Ks thresholds of 3, 10, 100 or 1,000.
`This calculation of false negative rates does have some bias, as it does
`not take into account coverage of the polymorphic base.
`As expected, setting a higher threshold for Ks gives fewer false
`positives. For library B, as Ks increased from 10 to 1,000, the false
`positive rate decreased from 69.6 to 11.3% (Table 1). Likewise, with
`fixed coverage, we observed the false negative rate increasing from
`I 0.8 to 90.8% as Ks increased from 10 to 1,000. A more detailed
`discussion of false-negative and false-positive rates is available in
`the Supplementary Methods. We found that false negatives
`occurred when the cumulative coverage of individuals with the
`
`l!iP. ,cu ,••~ •,o u3?f'.c.a•
`
`to compare the
`detailed equations). We used Bayes factors
`probability that the distribution of mismatched bases arises
`from sequencing error to the probability that the distribution
`of mismatches arises from diploid polymorphism. For example,
`for a given base were nonconcordant
`if 20% of reads
`with the reference sequence across all
`individuals, and the
`the presence of a SNP,
`nonconcordant bases were due to
`one would expect each individual to be homozygous (0% or
`100% concordance with reference) or heterozygous (concordance
`split 50:50). In contrast, if the 20% nonconcordant bases were due
`to sequencing error, then the number of nonconcordant bases for
`each individual would follow a binomial distribution around 20%
`(for example, person 1, -20.5%; person 2, -19.3%; person 3,
`-20.7%; and so forth). As described below, the error estimates
`required to calculate the probability of a genetic variant being a true
`variant are readily obtainable when samples from individuals are
`indexed and multiplex-sequenced. Additionally, indexed and mul(cid:173)
`tiplexed sequencing removes run-to-run biases, which would con(cid:173)
`found these estimates if all aspects of experimental design were not
`properly randomized.
`
`890 I VOL.5 N0.10 I OCTOBER 2008 I NATURE METHODS
`
`00004
`
`

`

`Table 1 I Polymorphism discovery and individual variant calling
`
`Polymorphism discovery by Ks threshold"
`
`Threshold (Ks)
`
`3
`10
`100
`1,000
`
`Polymorphisms
`predicted
`
`True
`positivesb
`
`False
`positivesc
`
`False
`negativesd
`
`932
`352
`131
`106
`
`112
`107
`99
`94
`
`88.0%
`69.6%
`24.4%
`11.3%
`
`9.2%
`10.8%
`32.3%
`90.8%
`
`ARTICLES
`
`individual given a certain number of reads
`(Fig. Sc). For example, when the coverage
`for a base was ~ 20 reads ( averaging from
`to 24), we detected > 80-90% of
`16
`the bases at K; > 10, with a false-positive
`rate of 1.6%. In comparison to poly(cid:173)
`morphism discovery, the low false positive
`rates of genotyping at a known poly(cid:173)
`morphic base are due to the fact that we
`were no longer assessing thousands of bases
`for a rare event but rather assessing samples
`from a few dozen individuals for a more
`frequent event.
`
`Individual variant calling or genotyping (AA, AB, BB) by K; threshold
`
`Threshold (K;)
`
`3
`10
`100
`1,000
`
`Genotyped
`correctly
`
`3,376
`3,144
`2,677
`2,397
`
`Genotyped
`incorrectly
`
`115
`58
`8
`7
`
`aPredicted polymorphic bases at a given threshold for Ks were evaluated by comparison to known polymorphisms within dbSNP and to ENCODE
`capillary sequencing traces. False negatives rates reflect that greater base coverage is required to exceed larger K5 thresholds and that many
`polymorphisms become insufficiently covered for polymorphism discovery at these levels (see Fig. 5 for relation between coverage and Ks)(cid:173)
`bValidated in dbSNP or NCBI trace archive. 'Not identified in dbSNP or NCBI trace archive. dRates of polymorphism discovery were evaluated
`irrespective of coverage. False negatives rates were calculated across Libraries A and B. False positive rates were calculated usi ng only library
`B since not all region s of library A were previously resequenced within the ENCODE project.
`
`DISCUSSION
`Our experience suggests that achieving ade(cid:173)
`quate coverage is one of the most important
`factors in the design of a multiplexed tar-
`geted resequencing experiment. Depending
`on assumptions made in the experiment,
`the desired coverage (and as a consequence,
`the cost) can vary substantially. Key con(cid:173)
`siderations include whether the objective is
`(i) discovering genetic variants for genotyp(cid:173)
`ing by a separate method such as custom SNP genotyping, (ii)
`conducting polymorphism discovery and variant calling within one
`sequencing experiment, and/or (ii i) exhaustively resequencing for
`all common and rare variants.
`Exhaustive polymorphism discovery is the next major phase for
`genome-wide association studies. Indexing of short-reads was
`surprisingly robust at polymorphism identification. For example,
`
`rarer variant was less than 10 reads (Fig. 5). Highlighting the
`dependence of false negatives on coverage, all polymorphisms that
`were covered by 20 or more reads (summed across individuals
`known to differ from the reference) had a Ks > 1,000. Overall, we
`observed that 90% of variants were detectable, though designing
`experiments for an average of greater than 20 reads will be essential
`for controlling false negatives.
`While analyzing bases with a K, > 100 for false positives using
`US National Center for Biotechnology Information (NCBI)(cid:173)
`archived ENCODE traces, we discovered new SNPs that were
`evident in visual reinspection of capillary traces but that had not
`been annotated in dbSNP (Fig. 4f- h). These examples demonstrate
`that index-based resequencing can identify new variants even in
`heavily sequenced and heavily annotated
`regions. Notably,
`within library B, two variants with a Ks > 100 were not SNPs
`but actually insertions ( with dbSNP, rsl 1279266 is a 1-bp insertion
`and rsl0555419 is a 6-bp insertion). Thus, it is possible to identify
`genetic variants explicitly not allowed within the alignment scheme.
`
`Genotyping individuals at known polymorphisms
`As false negatives are clearly tied to coverage, we explored the
`influence of coverage further by analyzing sequenced regions in an
`individual-by-individual analysis. Derived in equation 3 (see
`below), Ki is the analogous Bayes factor for the ith individual
`having the rarer allele at a known polymorphic base. Conceptually,
`it can be thought of as a specific individual's contribution to Ks.
`We calculated the percentage of variants correctly identified in an
`
`Figure S I Relationship between base-level coverage and Bayes factor for
`polymorphism discovery and variant genotyping . (a) Total coverage across
`those individuals with a nonreference genotype at a known polymorphism .
`(b) Magnification of the graph in a. (c) The percent of the time the correct
`genotype was determined versus the coverage of the variant within the
`individual. Plots contain cumulative statistics using variant discovery and
`genotyping within both libraries A and B.
`
`• Identified known SNPs
`
`• Missed known SNPs "f :.
`I •
`••••
`.i}-
`. ·It-~-
`: ..,.._.
`••••
`• ,._ 11! "'
`\~ .1
`
`b 10
`
`8
`
`6
`
`Q>
`oi 4
`0
`__J
`
`2
`
`0
`
`.
`.
`
`K5 =1 0
`
`- '!
`
`0
`
`~
`
`---"
`
`:}:_s = 10
`
`1,000
`100
`10
`Total number of reads for all individuals
`with variant at polymorphic base
`
`1,000
`100
`10
`Total number of reads for all individuals
`with variant at polymorphic base
`
`a
`
`100
`
`80
`
`60
`
`40
`
`20
`
`Q>
`oi
`0
`__J
`
`C
`
`.
`.
`.
`.
`..
`-.
`..
`.
`. .
`-.
`"'
`I•.
`.. ......
`: ·=· ...... .. . . .
`
`•
`
`• •
`• •• •
`•
`• •
`• •
`•
`
`•
`
`•
`
`.!!!
`
`c:: " ·~
`
`0
`
`40
`
`20
`
`•
`
`•
`
`Threshold
`a K1=3
`a K1= 10
`a K1= 100
`
`Genotyped
`correct
`3,376
`3,144
`2,677
`
`Genotyped
`incorrect
`115
`58
`8
`
`Number of reads at base with a
`known variant in a sequenced individual
`
`NATURE METHODS I VOL.5 N0.10 I OCTOBER 2008 I 891
`
`00005
`
`

`

`I ARTICLES
`
`even when we used a highly restrictive error alignment scheme, we
`identified a new coding SNP 3 bases from an anno tated SNP.
`Additionally, it is encouraging to see the discovery of insertions
`using an alignment scheme only aUowing substitutions. FinaUy, we
`show that an automated analysis strategy for short-read data can
`uncover new variants, even in regions that had been previously
`sequenced for variant discovery using samples from these indivi(cid:173)
`duals in the HapMap database.
`Based on the false positive and false negative rates, a critical
`factor will be how to balance coverage and cost. Using Ks > 1,000,
`we observed a low false positive rate and a high false negative rate.
`As we required that the exact base must be polymorphic in an
`existing database, the actual false- positive rate may be lower.
`False negatives are due to the additional coverage ( > 10 reads)
`required for overcoming higher Ks thresholds. Considering the
`substantial base-to-base variability (Fig. 3), one would not simply
`want to design for an average coverage of 10 reads. Rather,
`by designing for - 50 reads or more, one m ay minimize both the
`false negative and false-positive rates, given coverage variability of
`short-read sequencing.
`Although whole-genome sequencing may be the primary moti(cid:173)
`vator for in1provements in sequencing technology, it is clear that
`next-generation technologies are immediately useful for foc used,
`hypothesis-driven sequencing of linkage peaks, groupings of can(cid:173)
`didate genes or sequencing the entire known coding sequence of the
`human genome. In this report, we developed per-individual
`indexing of pooled PCR amplicons to carry o ut targeted sequen(cid:173)
`cing. However, it is straightforward to envision using other sample(cid:173)
`preparation methods such as genome partitio ning9- 12. One could
`replace the pooled amplicons from our experimental strategy with
`total genomic DNA and complete the partitioning by a hybridiza(cid:173)
`tion approach after pooling ligated amplicons. Indeed, variant
`discovery through the resequencing o f all ca ndidate regions impli (cid:173)
`cated in a disease across genomes of dozens, and possibly hundreds,
`of individuals could be considerably accelerated by merging multi(cid:173)
`plex capture, indexing and next-generation sequencing approaches
`into a single protocol.
`
`METHODS
`Amplification and pre-ligation sample preparation. Two ampli(cid:173)
`con libraries (library A and B; see Supplementary Table 2 for
`specific regions) were constructed from individually amplified 5-
`kb regio ns using long-range PCR. Regions comprising a library
`were chosen from the ENCODE project and provide a sampling of
`different genomic region types. Only a portion of the ENCODE
`regions have been previously sequenced as part of the SNP
`discovery portion o f the ENCODE project. Library A was a
`composite of previously sequenced regions and regions that had
`not been sequenced in ENCODE. Library B was entirely composed
`of regions that had been previously sequenced. Flanking primers
`for each amplicon (Supplementary Table 5 o nline) we re manually
`selected. Although we did not screen for the existence of known
`this wo uld be
`polymorphisms within the primer sequences,
`advisable in future experiments.
`
`Base calling and alignment. Illumina GA images were analyzed
`using a modified Illumina GA (0.2.2.6) processing pipeline.
`Descriptions of the modifications and all scripts are available in
`Supplementary Methods. A default m atrix deconvolution file was
`
`892 I VOL.5 N0.10 I OCTOBER 2008 I NATURE METHODS
`
`used for base-calling by 'Bustard', based on a control phi-X library
`provided by Illumina. Following base-calling, a script was used to
`access all sequences, regardless of quality score. We deviated from
`the default cu t-offs provided by Illumina's 'GERALD' process
`because we found that sequence quality was better controlled by
`matching to an index (46/4,096 or 0.1% by chance) or matching
`with I or fewer errors to the reference sequence. Bases were aligned
`by progressively truncating the sequence at the 5' end until a unique
`alignment was obtained with a probability of stochastic alignment
`of less than 1 %. This ap proach is distinct from recently described
`short-read alignment schem es 13 but has so me of the same features.
`Aligned sequences were summarized into a binomial of counts
`agreeing (a; ) or disagreeing (b;) with the reference sequence for the
`i th individual of n total individuals for each base (s). The error rate
`(Os)
`is the percentage of reads disagreeing with the reference
`sequence across aU samples. Model 1 (M1) assumes that the error
`rate for an individual equals the error rate for all individuals, o r
`O; = Os. Therefore, we can estimate 05 as:
`0s= L _ a; _
`
`II
`(
`)
`i = I a; + b;
`
`.
`
`(1)
`
`Model 2 (M2) assumed that for some individual i, we have O;#- Os.
`To calculate this likelihood, we used a hyperprior offset (as) fo r
`three possible genotypes. The hyperprior can be thought of as
`conditioning on the zygosity of the individual and thus reflecting
`the uncertainty of the genotype of the individual at a given base. In
`this analysis, we focused on the detection of biallelic SNPs, but
`triallelic o r o ther types of SNPs could be considered under more
`complex models. The Bayes fac tor for a base position is
`
`( a,

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket