`
`Performance comparison of whole-genome sequencing
`platforms
`Hugo Y K Lam1,8, Michael J Clark1, Rui Chen1, Rong Chen2,8, Georges Natsoulis3, Maeve O’Huallachain1,
`Frederick E Dewey4, Lukas Habegger5, Euan A Ashley4, Mark B Gerstein5–7, Atul J Butte2, Hanlee P Ji3 & Michael Snyder1
`
`Whole-genome sequencing is becoming commonplace, but
`the accuracy and completeness of variant calling by the most
`widely used platforms from Illumina and Complete Genomics
`have not been reported. Here we sequenced the genome
`of an individual with both technologies to a high average
`coverage of ~76×, and compared their performance with
`respect to sequence coverage and calling of single-nucleotide
`variants (SNVs), insertions and deletions (indels). Although
`88.1% of the ~3.7 million unique SNVs were concordant
`between platforms, there were tens of thousands of platform-
`specific calls located in genes and other genomic regions.
`In contrast, 26.5% of indels were concordant between
`platforms. Target enrichment validated 92.7% of the
`concordant SNVs, whereas validation by genotyping array
`revealed a sensitivity of 99.3%. The validation experiments
`also suggested that >60% of the platform-specific variants
`were indeed present in the genome. Our results have important
`implications for understanding the accuracy and completeness
`of the genome sequencing platforms.
`
`The ability to sequence entire human genomes has the potential to
`provide enormous insights into human diversity and genetic disease,
`and is likely to transform medicine1,2. Several platforms for whole-
`genome sequencing have emerged3–7. Each uses relatively short reads
`(up to 450 bp) and through high-coverage DNA sequencing, vari-
`ants are called relative to a reference genome. The platforms of two
`companies, Illumina and Complete Genomics (CG), have become
`particularly commonplace, and >90% of the complete human
`genome sequences reported thus far have been sequenced using these
`platforms5,8–11. Each of these platforms uses different technologies,
`and despite their increasingly common use, a detailed compari-
`son of their performance has not been reported previously. Such a
`
`1Department of Genetics, Stanford University, Stanford, California, USA. 2Division
`of Systems Medicine, Department of Pediatrics, Stanford University, Stanford,
`California, USA. 3Department of Medicine, Stanford University, Stanford, California,
`USA. 4Center for Inherited Cardiovascular Disease, Division of Cardiovascular
`Medicine, Stanford University, Stanford, California, USA. 5Program in Computational
`Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA.
`6Department of Molecular Biophysics and Biochemistry, Yale University, New Haven,
`Connecticut, USA. 7Department of Computer Science, Yale University, New Haven,
`Connecticut, USA. 8Present address: Personalis, Inc., Palo Alto, California, USA.
`Correspondence should be addressed to M.S. (mpsnyder@stanford.edu).
`
`Received 1 September; accepted 15 November; published online 18 December
`2011; corrected after print 7 June 2012; doi:10.1038/nbt.2065
`
` comparison is crucial for understanding accuracy and completeness
`of variant calling by each platform so that robust conclusions can be
`drawn from their genome sequencing data.
`
`RESULTS
`Sequence data generation
`To examine the performance of Illumina and CG whole-genome
`sequencing technologies, we used each platform to sequence two
`sources of DNA, peripheral blood mononuclear cells (PBMCs) and
`saliva, from a single individual to high coverage. An Illumina HiSeq
`2000 was used to generate 101-bp paired-end reads, and CG gener-
`ated 35-bp paired-end reads. The average sequence coverage for each
`sample was ~76× (Table 1), which resulted in a total coverage equiva-
`lent to 300 haploid human genomes.
`We aligned reads from both platforms to the human reference
`genome (NCBI build 37/HG19)12 and called SNVs. For Illumina, a
`total of 4,539,328,340 sequence reads, comprising 1,499,021,500 reads
`(151.4 Gb) from PBMCs and 3,040,306,840 reads (307.1 Gb) from
`saliva, were mapped to the reference genome using the Burrows-
`Wheeler Aligner13. About 88% mapped successfully. Duplicate reads
`were removed using the Picard software tool, resulting in 3,588,531,824
`(79%, 362 Gb) mapped, nonduplicate reads (Table 1). Targeted realign-
`ment and base recalibration was performed using the Genome Analysis
`ToolKit (GATK)14. We used GATK to detect a total of 3,640,123 SNVs
`(3,570,658 from PBMCs and 3,528,194 from saliva) with a quality
`filter as defined by the 1000 Genomes Project11. CG generated a gross
`mapping yield of 233.2 Gb for the PBMC sample and 218.6 Gb for the
`saliva sample for a total of 451.8 Gb of sequence (Table 1). We analyzed
`these data using the CG Analysis pipeline to identify 3,394,601 SNVs
`(3,277,339 from PBMCs and 3,286,645 from saliva). A detailed com-
`parison of PBMCs versus saliva differences has revealed that few of the
`tissue-specific calls could be validated by independent methods, and
`these results will be published elsewhere.
`To examine the completeness of sequencing, we analyzed the
`depth and breadth of genomic coverage by each platform with the
`PBMC genome sequences. Both platforms covered the majority of
`the genome, and >95% of the genome was covered by 17 or more reads
`(Fig. 1a). The Illumina curve drops to zero coverage at much lower
`read depth than the CG curve because there are substantially fewer
`reads in the Illumina data set. We also noticed that CG generally is less
`uniform in coverage (Fig. 1b). This suggests that to achieve a certain
`level of coverage for most of the genome, CG requires more overall
`sequencing than Illumina.
`
`78
`
`VOLUME 30 NUMBER 1 JANUARY 2012 nature biotechnology
`
`©2012 Nature America, Inc. All rights reserved.
`
`Foresight EX1027
`Foresight v Personalis
`
`
`
`A n A ly s i s
`
`Table 1 Whole-genome sequencing using CG and Illumina platforms
`CG
`
`Sample
`Blood
`Saliva
`Total
`
`Bases (Gb)
`233.2
`218.6
`451.8
`
`Coverage (×)
` 78
` 73
`151
`
`Bases (Gb)
`151.4
`307.1
`458.5
`
`Coverage (×)
` 50
`102
`153
`
`Reads
`1,499,021,500
`3,040,306,840
`4,539,328,340
`
`Illumina
`
`Mapped
`1,367,988,241
`2,614,663,882
`3,982,652,123
`
`After duplicate removal
`1,233,937,084
`82%
`2,354,594,740
`77%
`3,588,531,824
`79%
`
`91%
`86%
`88%
`
`To directly determine accuracy, we sequenced randomly selected
`concordant and platform-specific regions for Sanger sequencing. We
`found that 20 of 20 concordant SNVs could be validated, whereas 2
`of 15 (13.3%) Illumina-specific and 17 of 18 (94.4%) CG-specific
`SNVs could be validated. This suggests CG has higher accuracy than
`Illumina and that almost all the concordant calls are correct.
`To attempt to examine accuracy on a larger scale, we used Agilent
`SureSelect target enrichment capture technology to capture 33,084
`(9.6%) Illumina-specific, 3,015 (3.0%) CG-specific and 24,247 (0.7%)
`concordant SNVs for sequencing on an Illumina Hi-Seq instrument
`(Table 2). We found that the validation rate for the concordant SNVs
`was 92.7%, whereas the validation rate was 61.9% and 64.3% for the
`CG-specific and Illumina-specific SNVs. These results indicate that
`the platform-specific calls have a very high false-positive rate of at
`least 35%. We also found that 12.6–21.4% of the targeted SNVs were
`not called in the validation, possibly owing to nonunique regions
`that are difficult to map precisely. Because the capture validation was
`performed using Illumina DNA sequencing technology, it is diffi-
`cult to directly compare the Illumina versus CG SNV rates with this
`approach. Nonetheless, these overall results indicate that concordant
`SNVs have high accuracy and platform-specific SNVs have a high
`false-positive rate.
`
`Association of genes with variant calling differences
`To better understand the platform-specific calls, we investigated
`the association of SNVs from each platform with different genomic
`elements. We annotated both the platform-specific SNVs and con-
`cordant SNVs with gene and repeat annotations using Annovar16. In
`general, we did not find a significant difference between the associa-
`tions of the platform-specific SNVs and the concordant SNVs with
`gene elements, such as exons and introns (Fig. 3a,b). For example,
`1% and 32–38% of the platform-specific SNVs were associated with
`exonic and intronic regions, respectively, regardless of the platform.
`This correlates well with the portions of exons (~1.3%) and introns
`(~37%) in the whole human genome. Nonetheless, the CG-specific
`SNVs had a slightly stronger association (14%) with noncoding
`RNA than the Illumina-specific SNVs (12%) and concordant SNVs
`(11%). Overall, many platform-specific SNVs lie in RNA coding
`regions of the human genome, and thus deducing their accuracy is
`of high importance.
`
`Complete
`Genomics
`Illumina
`
`0
`
`50
`
`100
`Read depth
`
`150
`
`200
`
`100
`90
`80
`70
`60
`50
`40
`30
`20
`10
`0
`
`Number of bases (M)b
`
`Complete
`Genomics
`Illumina
`
`0
`
`20
`40
`60
`80
`Cumulative read depth
`
`100
`
`100
`90
`80
`70
`60
`50
`40
`30
`20
`10
`0
`
`a
`
`Percentage of genome
`
`Figure 1 Genome coverage at different read depths. (a) Percentage
`of genome covered by different read depths in different platforms.
`(b) Histogram of genome coverage at different read depths.
`
`Extensive differences in variant calling
`We sought to compare the sensitivity and accuracy of each platform for
`SNV calling. In total, 88.1% (3,295,023 out of 3,739,701) of the unique
`SNVs were concordant—that is, either a homozygous or heterozygous
`SNV was detected at the same locus by the two platforms in at least one
`sample (Fig. 2a). We detected 444,678 SNVs by only one platform or the
`other but not both, of which 345,100 were specific to Illumina (10.5%
`of the Illumina combined SNVs) and 99,578 were CG-specific (3.0%
`of the CG combined SNVs). Among the Illumina-specific SNVs, 67%
`were ‘no-calls’ (that is, not a reference or variant call), 11% were reference
`calls and 22% were other types of calls (that is, complex and substitution
`calls) in CG. Similarly, 75% of the CG-specific SNVs were no-calls in
`Illumina, and 25% were reference calls (Fig. 2b). The higher percentage of
`no-calls in Illumina is likely because GATK does not make the complex and
`substitution calls as does the CG pipeline.
`To assess the quality of the calls, we used four criteria: the
`transition/transversion ratio (ti/tv), quality scores, the heterozygous/
`homozygous call ratio and novel, platform-specific SNVs. The ti/tv
`ratio of 2.1 for SNVs in humans has been described in several previous
`studies, including the 1000 Genomes Project11. The ti/tv ratio for all
`of the SNVs detected in these genomes was 2.04, but in our data the
`ti/tv of SNVs concordant between the two platforms was 2.14. For all
`SNVs detected by the Illumina platform, ti/tv was 2.05, but for SNVs
`specific to Illumina it was only 1.40. Similarly, for SNVs detected by
`CG, ti/tv was 2.13, but for CG-specific SNVs, it was 1.68. Thus, the
`ti/tv of concordant SNVs was very close to that expected, whereas the
` platform-specific ti/tv was much lower, suggesting that the platform-
`specific calls were of lower accuracy. Inspection of the quality scores of
`the platform-specific SNVs showed that they were indeed lower than
`those for the concordant calls (Supplementary Fig. 1). Furthermore,
`the heterozygous/homozygous call ratio was 1.48 for the concordant
`calls, whereas the platform-specific ratios were indeed higher: 2.48
`for Illumina-specific calls and 1.98 for CG-specific calls.
`To examine the fraction of novel platform-specific SNVs, we
`noted that 3,160,905 (96.0%) of the concordant SNVs were present
`in dbSNP131 (ref. 15). In contrast, only 260,108 (75.4%) of the SNVs
`in the Illumina-specific set, and 72,735 (73.0%) of the SNVs in the
`CG-specific set were present in dbSNP131. Thus, the platform-specific
`call sets were enriched for novel SNVs, suggesting that they likely
`contain more errors. In addition, the overall genotype concordance
`rate (that is, the proportion of concordant calls having a consistent
`genotype—heterozygous or homozygous—across both platforms) for
`the concordant SNVs was 98.9%. The high genotype concordance rate
`and percentage of known SNVs indicate that the concordant SNVs
`were of high quality and accuracy.
`To further assess the accuracy of the variant calling, we sought to vali-
`date our SNVs by using Omni Quad 1M Genotyping arrays, traditional
`Sanger sequencing and Agilent SureSelect target enrichment capture
`followed by sequencing on an Illumina HiSeq for both samples. Of the
`260,112 heterozygous calls detected with the Omni array, 99.5% were
`present in the entire SNV data set, 99.34% were concordant calls and only
`0.16% were platform-specific SNVs. This demonstrates that both plat-
`forms are sensitive to known SNVs and that few known single-nucleotide
`polymorphisms (SNPs) are detected by only one platform.
`
`nature biotechnology VOLUME 30 NUMBER 1
`
`JANUARY 2012
`
`79
`
`©2012 Nature America, Inc. All rights reserved.
`
`Foresight EX1027
`Foresight v Personalis
`
`
`
`A n A ly s i s
`
`Figure 2 SNV detection and intersection.
`(a) SNVs detected from the PBMC and saliva
`samples in each platform were combined.
`The unions of SNVs in each platform were
`then intersected. Sensitivity was measured
`against the Illumina Omni array. Ti/Tv is the
`transition-to-transversion ratio. The known
`and novel counts were based on dbSNP.
`‘Sanger’ and ‘validated’ represent validation by
`Sanger sequencing and Illumina sequencing
`(with Agilent target enrichment capture),
`respectively. (b) Comparing platform-specific
`SNVs to non-SNV calls in another platform. IL,
`Illumina; CG, Complete Genomics.
`
`a
`
`Complete Genomics
`
`Blood
`3,277,339
`
`Saliva
`
`3,286,645
`
`Merge
`
`Union
`3,394,601
`2.13
`99,578 (3.0%)
`1.68
`72,735 (73.0%)
`26,843 (27.0%)
`94.4% (17/18)
`61.9%
`
`Total
`Ti/Tv
`Specific
`Ti/Tv
`Known
`Novel
`Sanger
`Validated
`
`Intersect
`
`Illumina
`
`Blood
`3,570,658
`
`Saliva
`3,528,194
`
`Merge
`
`Union
`3,640,123
`2.05
`345,100 (10.5%)
`1.40
`260,108 (75.4%)
`84,992 (24.6%)
`13.3% (2/15)
`64.3%
`
`Total
`Ti/Tv
`Specific
`Ti/Tv
`Known
`Novel
`Sanger
`Validated
`
`Intersect
`
`CG+IL
`
`2.7%
`
`3,295,023
`Concordant SNPs
`88.1%
`
`Sensitivity: 99.34%
`
`9.2%
`
`Total
`Ti/Tv
`Sensitivity
`Concordant
`Ti/Tv
`Known
`Novel
`Sanger
`Validated
`
`Overall
`3,739,701
`2.04
`99.5%
`3,295,023
`2.14
`3,160,905 (95.9%)
`134,118 (4.1%)
`100% (20/20)
`92.7%
`
`b
`
`Complete Genomics specific
`99,578
`
`IL ref. 25,022;
`25%
`
`CG no-call
`230,119; 67%
`
`IL no-call
`74,556; 75%
`
`Illumina specific
`345,100
`
`CG
`Sub & other
`77,196; 22%
`
`CG ref.
`37,785; 11%
`
`To further ascertain whether the platform-
`specific SNVs might be located in functionally
`important regions, we examined whether the
`variant calls were present in the Varimed data-
`base2,17, which contains variants catalogued
`through genome-wide association studies and
`other genetic linkage studies. We found that
`31 Illumina- and 3 CG-specific SNVs were
`present in Varimed, from which we were able
`to estimate associations between diseases
`and platform-specific SNPs (Supplementary
`Table 1). One of these, rs2672598, was called
`in both PBMCs and saliva by the Illumina
`platform, but not called in either PBMCs or
`saliva by the CG platform. This SNP is at the
`5′ end of HTRA1 and known to increase the
`risk of age-related macular degeneration by
`4.89-fold (P = 3.39 × 10−11)18,19. Another
`example is the A202T allele in the TERT gene
`encoding telomerase. This allele has been associated with aplastic ane-
`mia20 and was only detected by the Illumina platform. Thus, some
`platform-specific calls are of high importance.
`
`Association of repetitive regions with variant calling differences
`In contrast to coding SNVs, we found that overall the platform-
` specific SNVs had a substantially stronger association with repeti-
`tive elements such as Alu, telomere and simple repeat sequences
`(Fig. 3c,d). For example, only 0.3% of the concordant SNVs were
`associated with telomere or centromere sequences, but 4% and 2%
`of the CG-specific SNVs and Illumina-specific SNVs, respectively,
`were associated with telomeric or centromeric repeats (Fig. 3c,e).
`The enrichment of platform-specific SNVs with simple repeats and
`low-complexity repeats was particularly evident. We found that <1%
`of the concordant SNVs were associated with simple repeats, but
`8% and 15% of the CG-specific SNVs and Illumina-specific SNVs,
`respectively, were associated with these sequences. Among the
` platform-specific SNVs, CG had a stronger association with the Alu
`element and centromere and telomere sequences, whereas Illumina
`
`Table 2 Agilent SureSelect target enrichment capture with Illumina sequencing
`CG specific
`Illumina specific
`Concordant
`99,578
`345,100
`3,295,023
`3,015
`33,084
`24,247
`388
`7,088
`3,053
`1,001
`9,280
`1,543
`1,626
`16,716
`19,651
`—
`—
`—
`
`Total
`Targeted
`Not validated
`Invalidated
`Validated
`Validation rate
`
`—
`3.0%
`12.9%
`33.2%
`53.9%
`61.9%
`
`—
`9.6%
`21.4%
`28.0%
`50.5%
`64.3%
`
`had a stronger association with L1, simple repeat and low-complexity
`repeat. Overall, these results indicate that many platform-specific
`SNVs lie in repetitive regions, suggesting that these calls may be due
`to mapping difficulties and errors.
`We also measured GC content and read depth of the SNVs in the
`gene and repeat regions. The average GC content of the concordant,
`CG-specific and Illumina-specific SNVs were 0.46, 0.45 and 0.41,
`respectively. The average read depths were 48, 47 and 44, respectively.
`Thus, the Illumina-specific SNVs showed a lower GC content and read
`depth compared to the concordant SNVs. Analysis by gene and repeat
`regions did not reveal any strong correlation with GC content. However,
`we found that Illumina-specific SNVs had a strikingly higher read
`depth in centromeric and telomeric regions, whereas CG had higher
`read depth in the tRNA and rRNA regions (Supplementary Fig. 2).
`
`Differences in indel calls
`We also examined small indel calls from Illumina and CG platforms.
`Small indels ranged in size from −107 to +36 bp by Illumina and −190
`to +48 bp by CG. Illumina calls were made using GATK with the
`Dindel model21, and CG calls were obtained
`from their standard pipeline and converted
`to VCF format22 using the CG conversion
`tool. A stringent quality score cutoff of
`30 was used for each platform. This resulted
`in a total of 811,903 indel calls with 611,110
`for Illumina and 430,258 for CG. We found
`that only 215,382 (26.5%) indels were
`detected by both Illumina and CG, whereas
`
`—
`0.7%
`12.6%
`6.4%
`81.0%
`92.7%
`
`80
`
`VOLUME 30 NUMBER 1
`
`JANUARY 2012 nature biotechnology
`
`©2012 Nature America, Inc. All rights reserved.
`
`Foresight EX1027
`Foresight v Personalis
`
`
`
`A n A ly s i s
`
`4
`
`4
`
`2
`
`2
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`5 4
`
`23
`
`1 0
`
`c
`
`Percent association of SNVs
`
`14
`
`12
`
`11
`
`0 0
`
`0
`
`1
`
`1 1
`
`1 1 1
`
`16
`
`14
`
`12
`
`10
`
`8 6 4 2 0
`
`b
`
`Percent association of SNVs
`
`26,542
`
`20,591
`
`59 60
`
`56
`
`7,498
`
`190
`1,762
`
`725
`1,872
`
`785
`2,469
`
`38
`
`32
`
`32
`
`0 0 1
`
`1 1 1
`
`1 1 1
`
`Exonic
`Intronic Intergenic
`CG speci(cid:31)c
`IL speci(cid:31)c
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Percent association of SNVsa
`
`UTR3
`UTR5
`Concordant
`
`Upstream Downstream
`ncRNA
`Splicing
`Concordant
`CG speci(cid:31)c
`IL speci(cid:31)c
`
`tRNA
`Telomere
`Centromere
`Concordant
`CG speci(cid:31)c
`
`rRNA
`IL speci(cid:31)c
`
` 0
`
` 25
`
`50
` 2
` 75
` 100
` 125
` 150
` 175
` 200
` 225
`
` 0
` 2 5
` 5
`
`5
`
`0
` 7
` 1
`
`0
` 1
`
` 3
`0
`5
`2
`5
` 1
` 1
`
`0
`7
`
` 1
`
`y
`
`0
`
`50
`25
`0
`1 5 0
`1 2 5
`1 0 0
`7 5
`0
`5
`
`5
`
`2
`
`x
`
`e
`
`2
`
`2
`
`0
`
`5
`
`2
`
`0
`
`5
`
`0
`
`2 1
`
`2 5
`
`0
`
`50
`20
`
`25
`
`0
`
`50
`19
`
`25
`
`d
`
`25
`
`20
`
`15
`
`10
`
`22
`
`21
`
`19
`
`16
`
`14
`
`14
`
`15
`
`8
`
`3
`
`25
`00
`75
`50
` 125
` 100
` 75
` 50
` 25
`
` 1
`
` 1
`
` 2
`
` 2
`
`5 0
`
`2 5
` 5 0
`7 5
` 4
` 1 0 0
` 125
` 150
` 175
`
`5
`
` 0
` 25
` 50
` 75
` 100
` 125
` 150
` 175
`0
`25
`50
`75
`6
`
` 100 125 1
`
`50
`
`0
`75
`18
`50
`25
`0
`75
`50
`25
`0
`
`17
`
`0
` 2
`5
`
`50
` 75
` 100
` 1 2 5
` 1 5 0
`
`7
`
`0
`
`2
`
`5
`
`5
`
`7
`0
`
`5
`
` 1
` 1
`
`0
`
`2
`
`0
`
`5
`
`8
`
`0
`
`2 5
`5 0
`7 5
` 100
` 125
`
`9
`
`0
`25
`50
`75
`100
`125
`
`0
`
`125
`100
`11
`
`25
`50
`75
`
`0
`
`0
`
`1
`1 3
`
`0
`7 5
`5 0
`
`2 5
`
`0
`125
`100
`75
`50
`12
`25
`
`0
`
`16
`
`1 5
`
`75
`50
`25
`0
`100
`75
`5 0
`2 5
`0
`1 0 0
`5
`7
`0
`5
`5
`
`4
`
`1
`
`2
`
`L1
`
`Alu
`
`5 0
`
`Percent association of SNVs
`
`Concordant
`
`CG speci(cid:31)c
`
`Figure 3 SNV association with different genomic
`elements. (a) Gene elements: UTR, exonic, intronic
`and intergenic regions. Inset: number of SNVs
`associated with UTR5, UTR3 and exonic regions.
`(b) Gene elements: splicing sites, noncoding RNA
`and upstream/downstream (<1 kb) regions of genes.
`(c) Repetitive elements: centromere, telomere, tRNA
`and rRNA. (d) Repetitive elements: L1, Alu, simple
`repeat and low-complexity repeat. (e) SNV frequency
`at different chromosomal locations. Tracks from outer
`to inner: SNV frequency for Illumina (IL), Complete
`Genomics (CG), concordant, IL-specific and CG-
`specific calls. Outermost: chromosome ideogram.
`
`1
`
`2
`
`0
`
`Simple repeat
`
`Low
`complexity
`IL speci(cid:31)c
`
`©2012 Nature America, Inc. All rights reserved.
`
`10
`
`390,060 (48.1%) and 206,461 (25.4%) were Illumina- and CG-specific,
`respectively (Fig. 4a). Owing to the complexity of indels compared
`to SNVs, the number of concordant indels was much lower than
`the number of concordant SNVs. We also observed that the indels
`
`detected by both platforms were similar in their size distribution
`and type (Fig. 4b), though it is noteworthy that the Illumina data
`showed a slight enrichment of 1-bp insertions, whereas the CG data
`showed a slight enrichment of 1-bp deletions.
`
`a
`
`Complete Genomics
`
`Blood
`361,783
`
`Saliva
`341,172
`
`Merge
`
`Union
`
`Total
`430,258
`Specific 206,461 (48.0%)
`
`CG+IL
`
`206,461
`CG-specific
`(25.4%)
`
`215,382
`Concordant indels
`(26.5%)
`
`390,060
`IL-specific
`(48.1%)
`
`b
`
`Illumina
`
`Blood
`523,445
`
`Saliva
`
`555,770
`
`Merge
`
`Union
`611,110
`390,060 (63.8%)
`
`Total
`Specific
`
`Complete Genomics
`Illumina
`
`160,000
`
`140,000
`
`120,000
`
`100,000
`
`80,000
`
`60,000
`
`40,000
`
`20,000
`
`Intersect
`
`Overall
`
`Total
`Concordant
`
`811,903
`215,382
`
`Intersect
`
`–72
`
`–68
`
`–64
`
`–60
`
`–56
`
`–52
`
`–48
`
`–44
`
`–40
`
`–36
`
`–32
`
`–28
`
`–24
`
`–20
`
`0
`–12 –8 –4 0
`–16
`Indel size
`
`4 8 12 16 20 24 28 32 36 40 44 48
`
`Figure 4 Indel detection and intersection. (a) Indels detected from the PBMC and saliva samples in each platform were combined. The unions of
`indels in each platform were then intersected. Note: 5,668 IL and 8,415 CG indels were removed after 5b-window merging. (b) Indel size distribution.
`Negative size represents deletion and positive size represents insertion.
`
`nature biotechnology VOLUME 30 NUMBER 1
`
`JANUARY 2012
`
`81
`
`Foresight EX1027
`Foresight v Personalis
`
`
`
`A n A ly s i s
`
`Detection accuracy was assessed for concordant and platform-
` specific indels by comparing them to indels detected by exome sequenc-
`ing of the same individual23. We validated 2.2% (4,681) of concordant
`indels but only 1.2% (4,682) of Illumina-specific and 0.3% (561) of
`CG-specific indels. These lower validation rates for platform-specific
`indels suggest that they are indeed less robust than those detected by
`both platforms. Because exome sequencing was performed using the
`Illumina HiSeq platform, bias toward greater consistency between
`the Illumina-specific and exome sequencing–specific indels was
`not unexpected.
`We further validated indels by randomly selecting indels for tra-
`ditional Sanger sequencing. For 24 concordant indels, 15 could be
`amplified by PCR allowing us to validate 14 of them (93.33%). For
`42 platform-specific indels, 19 could be amplified allowing us to vali-
`date 10 of 11 Illumina-specific indels and 8 of 8 CG-specific indels.
`Although the platform-specific indels could be validated at a high
`rate, the increased frequency of failed PCR amplification for platform-
`specific versus concordant indels (54.8% versus 37.5%, respectively)
`suggests that there may have been issues with the sequence context
`around a larger fraction of the platform-specific calls. We therefore
`examined whether both the concordant and platform-specific indels
`overlapped with known repeats. We found that 72% of Illumina-
` specific and 63% of CG-specific indels overlapped repeats, whereas
`only 52% of concordant indels overlapped with repeats. Although
`there is a clear enrichment of platform-specific indels over problem-
`atic repeat regions, many bona fide indels were detected by only one
`platform, as demonstrated by their high validation rate. This suggests
`that indel detection by both Illumina and CG lacks sensitivity.
`
`DISCUSSION
`Overall, we conclude that each genome sequencing approach is
`generally capable of detecting most SNVs. Based on the transition/
`transversion ratio and Sanger sequencing, CG appears to be more
`accurate, but also slightly less sensitive. Illumina, in contrast, covers
`more bases and makes a higher number of overall calls, but also has
`more false positives. This may be in part because Illumina has longer
`reads and is therefore able to map more reads in difficult regions,
`which leads to both increased sensitivity and decreased specificity.
`Nonetheless, both methods clearly call variants missed by the other
`technology. Many of these lie in exons and thus can affect coding
`potential. In fact, 1,676 genes have platform-specific SNVs in exons;
`one of the Illumina-specific SNVs lies in a telomerase gene and is
`likely to affect function. We also found that indel detection is subject
`to a much larger platform bias, with each platform detecting a large
`quantity of indels missed by the other platform. It may therefore be
`beneficial to sequence on both platforms and analyze both data sets
`together, using evidence from one to bolster discovery in the other.
`We demonstrated that the best approach for comprehensive vari-
`ant detection is to sequence genomes with both platforms if budget
`permits. We assessed the cost effectiveness of sequencing on both
`platforms and found that on average it costs about four cents per
`additional variant (Online Methods). Alternatively, supplementing
`with exome sequencing can assess the most interpretable part of the
`genome at higher depth of coverage and accuracy and fill in the gaps
`in the detection of coding variants23. If genome sequencing is per-
`formed on both platforms, platform-specific variants can be validated
`by Sanger sequencing and array capture experiments or disregarded
`if they map to difficult regions (that is, simple repeats) or have low
`quality scores. Using this strategy, variant detection sensitivity and
`specificity can be maximized, and meaningful variants that may
`otherwise have been missed can be discovered.
`
`METHODS
`Methods and any associated references are available in the online version
`of the paper at http://www.nature.com/naturebiotechnology/.
`
`Accession code. Sequence Read Archive: SRA045736.
`
`Note: Supplementary information is available on the Nature Biotechnology website.
`
`ACKNOwLEDGMENtS
`This work is supported by the Stanford Department of Genetics and the US
`National Institutes of Health.
`
`AUtHOR CONtRIBUtIONS
`H.Y.K.L. and M.J.C. did the analysis. G.N. and L.H. assisted in the analysis. Rui C.
`did DNA sequencing. Rong C. did the disease-association study. Rui C. and M.O’H.
`did the validation experiments. H.Y.K.L., F.E.D., E.A.A., M.B.G., A.J.B., H.P.J. and
`M.S. coordinated the analysis and revised the manuscript. H.Y.K.L., M.J.C. and
`M.S. wrote the manuscript.
`
`COMPEtING FINANCIAL INtEREStS
`The authors declare competing financial interests: details accompany the full-text
`HTML version of the paper at http://www.nature.com/nbt/index.html.
`
`
`Published online at http://www.nature.com/nbt/index.html.
`Reprints and permissions information is available online at http://www.nature.com/
`reprints/index.html.
`
`1. Ajay, S.S., Parker, S.C., Ozel Abaan, H., Fuentes Fajardo, K.V. & Margulies, E.H.
`Accurate and comprehensive sequencing of personal genomes. Genome Research
`21, 1498–1505 (2011).
`2. Ashley, E.A. et al. Clinical assessment incorporating a personal genome. Lancet
`375, 1525–1535 (2010).
`3. Wheeler, D.A. et al. The complete genome of an individual by massively parallel
`DNA sequencing. Nature 452, 872–876 (2008).
`4. McKernan, K.J. et al. Sequence and structural variation in a human genome
`uncovered by short-read, massively parallel ligation sequencing using two-base
`encoding. Genome Res. 19, 1527–1541 (2009).
`5. Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-
`genome sequencing. Science 328, 636–639 (2010).
`6. Pushkarev, D., Neff, N. & Quake, S. Single-molecule sequencing of an individual
`human genome. Nat. Biotechnol. 27, 847–852 (2009).
`7. Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the
`human genome. Science 318, 420–426 (2007).
`8. Snyder, M., Du, J. & Gerstein, M. Personal genome sequencing: current approaches
`and challenges. Genes Dev. 24, 423–431 (2010).
`9. Rios, J., Stein, E., Shendure, J., Hobbs, H.H. & Cohen, J.C. Identification by
`whole-genome
`resequencing
`of
`gene
`defect
`responsible
`for
`severe
`hypercholesterolemia. Hum. Mol. Genet. 19, 4313–4318 (2010).
`10. Lee, W. et al. The mutation spectrum revealed by paired genome sequences from
`a lung cancer patient. Nature 465, 473–477 (2010).
`11. The 1000 Genomes Project Consortium. A map of human genome variation from
`population-scale sequencing. Nature 467, 1061–1073 (2010).
`12. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature
`409, 860–921 (2001).
`13. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler
`transform. Bioinformatics 25, 1754–1760 (2009).
`14. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing
`next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
`15. Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids
`Res. 29, 308–311 (2001).
`16. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants
`from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
`17. Chen, R., Davydov, E.V., Sirota, M. & Butte, A.J. Non-synonymous and synonymous
`coding SNPs show similar likelihood and effect size of human disease association.
`PLoS ONE 5, e13574 (2010).
`18. Kaur, I. et al. Variants in the 10q26 gene cluster (LOC387715 and HTRA1) exhibit
`enhanced risk of age-related macular degeneration along with CFH in Indian
`patients. Invest. Ophthalmol. Vis. Sci. 49, 1771–1776 (2008).
`19. Tam, P.O. et al. HTRA1 variants in exudative age-related macular degeneration and
`interactions with smoking and CFH. Invest. Ophthalmol. Vis. Sci. 49, 2357–2365
`(2008).
`20. Yamaguchi, H. et al. Mutations in TERT, the gene for telomerase reverse
`transcriptase, in aplastic anemia. N. Engl. J. Med. 352, 1413–1424 (2005).
`21. Albers, C.A. et al. Dindel: Accurate indel calls from short-read data. Genome Res.
`21, 961–973 (2011).
`22. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158
`(2011).
`23. Clark, M.J. et al. Performance comparison of exome DNA sequencing technologies.
`Nat. Biotechnol. 29, 908–914 (2011).
`
`82
`
`VOLUME 30 NUMBER 1
`
`JANUARY 2012 nature biotechnology
`
`©2012 Nature America, Inc. All rights reserved.
`
`Foresight EX1027
`Foresight v Personalis
`
`
`
`
`
`
`
`DP = depth of coverage
`SB = strand bias
`MQ0 = number of reads with mapping quality equal to zero
`
`SNVs were combined and compared using custom program scripts. ANNOVAR
`(http://www.openbioinformatics.org/annovar/) was used to annotate the
`SNVs with gene and repeat annotations downloaded from the UCSC browser
`(http://www.genome.ucsc.edu/).
`
`Small indel detection. For CG, small insertions and deletions were
`derived from the masterVar file. Indels were extracted and converted
`to VCF format using the CG masterVar-to-VCF conversion tool avail-
`able at the CG community website. For Illumina, small indels were
`detected using GATK with the Dindel model for indel detection. Indels
`from both platforms were filtered based on quality score such that only
`those with QUAL ≥ 30 remained. Indels were compared using VCFtools
`(http://www.vcftools.sf.net).
`
`Disease association with SNV. Varimed, a manually curated database
`(comprising data from 5,478 human genetics papers) of human disease-SNP
`associations, was used to perform disease association with our SNVs. We que-
`ried the subject’s genotypes from the platform-specific SNVs against Varimed,
`and identified SNVs that were known to increase the subject’s risk of diseases
`with P < 1 × 10−6. The evidences of their disease associations were evalu-
`ated using the number of studies, cohort size, P-value and the odds ratio. For
`risk genotypes validated in multiple studies, we reported the most significant
`P-values, t