throbber
A n A ly s i s
`
`Performance comparison of whole-genome sequencing
`platforms
`Hugo Y K Lam1,8, Michael J Clark1, Rui Chen1, Rong Chen2,8, Georges Natsoulis3, Maeve O’Huallachain1,
`Frederick E Dewey4, Lukas Habegger5, Euan A Ashley4, Mark B Gerstein5–7, Atul J Butte2, Hanlee P Ji3 & Michael Snyder1
`
`Whole-genome sequencing is becoming commonplace, but
`the accuracy and completeness of variant calling by the most
`widely used platforms from Illumina and Complete Genomics
`have not been reported. Here we sequenced the genome
`of an individual with both technologies to a high average
`coverage of ~76×, and compared their performance with
`respect to sequence coverage and calling of single-nucleotide
`variants (SNVs), insertions and deletions (indels). Although
`88.1% of the ~3.7 million unique SNVs were concordant
`between platforms, there were tens of thousands of platform-
`specific calls located in genes and other genomic regions.
`In contrast, 26.5% of indels were concordant between
`platforms. Target enrichment validated 92.7% of the
`concordant SNVs, whereas validation by genotyping array
`revealed a sensitivity of 99.3%. The validation experiments
`also suggested that >60% of the platform-specific variants
`were indeed present in the genome. Our results have important
`implications for understanding the accuracy and completeness
`of the genome sequencing platforms.
`
`The ability to sequence entire human genomes has the potential to
`provide enormous insights into human diversity and genetic disease,
`and is likely to transform medicine1,2. Several platforms for whole-
`genome sequencing have emerged3–7. Each uses relatively short reads
`(up to 450 bp) and through high-coverage DNA sequencing, vari-
`ants are called relative to a reference genome. The platforms of two
`companies, Illumina and Complete Genomics (CG), have become
`particularly commonplace, and >90% of the complete human
`genome sequences reported thus far have been sequenced using these
`platforms5,8–11. Each of these platforms uses different technologies,
`and despite their increasingly common use, a detailed compari-
`son of their performance has not been reported previously. Such a
`
`1Department of Genetics, Stanford University, Stanford, California, USA. 2Division
`of Systems Medicine, Department of Pediatrics, Stanford University, Stanford,
`California, USA. 3Department of Medicine, Stanford University, Stanford, California,
`USA. 4Center for Inherited Cardiovascular Disease, Division of Cardiovascular
`Medicine, Stanford University, Stanford, California, USA. 5Program in Computational
`Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA.
`6Department of Molecular Biophysics and Biochemistry, Yale University, New Haven,
`Connecticut, USA. 7Department of Computer Science, Yale University, New Haven,
`Connecticut, USA. 8Present address: Personalis, Inc., Palo Alto, California, USA.
`Correspondence should be addressed to M.S. (mpsnyder@stanford.edu).
`
`Received 1 September; accepted 15 November; published online 18 December
`2011; corrected after print 7 June 2012; doi:10.1038/nbt.2065
`
` comparison is crucial for understanding accuracy and completeness
`of variant calling by each platform so that robust conclusions can be
`drawn from their genome sequencing data.
`
`RESULTS
`Sequence data generation
`To examine the performance of Illumina and CG whole-genome
`sequencing technologies, we used each platform to sequence two
`sources of DNA, peripheral blood mononuclear cells (PBMCs) and
`saliva, from a single individual to high coverage. An Illumina HiSeq
`2000 was used to generate 101-bp paired-end reads, and CG gener-
`ated 35-bp paired-end reads. The average sequence coverage for each
`sample was ~76× (Table 1), which resulted in a total coverage equiva-
`lent to 300 haploid human genomes.
`We aligned reads from both platforms to the human reference
`genome (NCBI build 37/HG19)12 and called SNVs. For Illumina, a
`total of 4,539,328,340 sequence reads, comprising 1,499,021,500 reads
`(151.4 Gb) from PBMCs and 3,040,306,840 reads (307.1 Gb) from
`saliva, were mapped to the reference genome using the Burrows-
`Wheeler Aligner13. About 88% mapped successfully. Duplicate reads
`were removed using the Picard software tool, resulting in 3,588,531,824
`(79%, 362 Gb) mapped, nonduplicate reads (Table 1). Targeted realign-
`ment and base recalibration was performed using the Genome Analysis
`ToolKit (GATK)14. We used GATK to detect a total of 3,640,123 SNVs
`(3,570,658 from PBMCs and 3,528,194 from saliva) with a quality
`filter as defined by the 1000 Genomes Project11. CG generated a gross
`mapping yield of 233.2 Gb for the PBMC sample and 218.6 Gb for the
`saliva sample for a total of 451.8 Gb of sequence (Table 1). We analyzed
`these data using the CG Analysis pipeline to identify 3,394,601 SNVs
`(3,277,339 from PBMCs and 3,286,645 from saliva). A detailed com-
`parison of PBMCs versus saliva differences has revealed that few of the
`tissue-specific calls could be validated by independent methods, and
`these results will be published elsewhere.
`To examine the completeness of sequencing, we analyzed the
`depth and breadth of genomic coverage by each platform with the
`PBMC genome sequences. Both platforms covered the majority of
`the genome, and >95% of the genome was covered by 17 or more reads
`(Fig. 1a). The Illumina curve drops to zero coverage at much lower
`read depth than the CG curve because there are substantially fewer
`reads in the Illumina data set. We also noticed that CG generally is less
`uniform in coverage (Fig. 1b). This suggests that to achieve a certain
`level of coverage for most of the genome, CG requires more overall
`sequencing than Illumina.
`
`78
`
`VOLUME 30 NUMBER 1 JANUARY 2012 nature biotechnology
`
`©2012 Nature America, Inc. All rights reserved.
`
`Foresight EX1027
`Foresight v Personalis
`
`

`

`A n A ly s i s
`
`Table 1 Whole-genome sequencing using CG and Illumina platforms
`CG
`
`Sample
`Blood
`Saliva
`Total
`
`Bases (Gb)
`233.2
`218.6
`451.8
`
`Coverage (×)
` 78
` 73
`151
`
`Bases (Gb)
`151.4
`307.1
`458.5
`
`Coverage (×)
` 50
`102
`153
`
`Reads
`1,499,021,500
`3,040,306,840
`4,539,328,340
`
`Illumina
`
`Mapped
`1,367,988,241
`2,614,663,882
`3,982,652,123
`
`After duplicate removal
`1,233,937,084
`82%
`2,354,594,740
`77%
`3,588,531,824
`79%
`
`91%
`86%
`88%
`
`To directly determine accuracy, we sequenced randomly selected
`concordant and platform-specific regions for Sanger sequencing. We
`found that 20 of 20 concordant SNVs could be validated, whereas 2
`of 15 (13.3%) Illumina-specific and 17 of 18 (94.4%) CG-specific
`SNVs could be validated. This suggests CG has higher accuracy than
`Illumina and that almost all the concordant calls are correct.
`To attempt to examine accuracy on a larger scale, we used Agilent
`SureSelect target enrichment capture technology to capture 33,084
`(9.6%) Illumina-specific, 3,015 (3.0%) CG-specific and 24,247 (0.7%)
`concordant SNVs for sequencing on an Illumina Hi-Seq instrument
`(Table 2). We found that the validation rate for the concordant SNVs
`was 92.7%, whereas the validation rate was 61.9% and 64.3% for the
`CG-specific and Illumina-specific SNVs. These results indicate that
`the platform-specific calls have a very high false-positive rate of at
`least 35%. We also found that 12.6–21.4% of the targeted SNVs were
`not called in the validation, possibly owing to nonunique regions
`that are difficult to map precisely. Because the capture validation was
`performed using Illumina DNA sequencing technology, it is diffi-
`cult to directly compare the Illumina versus CG SNV rates with this
`approach. Nonetheless, these overall results indicate that concordant
`SNVs have high accuracy and platform-specific SNVs have a high
`false-positive rate.
`
`Association of genes with variant calling differences
`To better understand the platform-specific calls, we investigated
`the association of SNVs from each platform with different genomic
`elements. We annotated both the platform-specific SNVs and con-
`cordant SNVs with gene and repeat annotations using Annovar16. In
`general, we did not find a significant difference between the associa-
`tions of the platform-specific SNVs and the concordant SNVs with
`gene elements, such as exons and introns (Fig. 3a,b). For example,
`1% and 32–38% of the platform-specific SNVs were associated with
`exonic and intronic regions, respectively, regardless of the platform.
`This correlates well with the portions of exons (~1.3%) and introns
`(~37%) in the whole human genome. Nonetheless, the CG-specific
`SNVs had a slightly stronger association (14%) with noncoding
`RNA than the Illumina-specific SNVs (12%) and concordant SNVs
`(11%). Overall, many platform-specific SNVs lie in RNA coding
`regions of the human genome, and thus deducing their accuracy is
`of high importance.
`
`Complete
`Genomics
`Illumina
`
`0
`
`50
`
`100
`Read depth
`
`150
`
`200
`
`100
`90
`80
`70
`60
`50
`40
`30
`20
`10
`0
`
`Number of bases (M)b
`
`Complete
`Genomics
`Illumina
`
`0
`
`20
`40
`60
`80
`Cumulative read depth
`
`100
`
`100
`90
`80
`70
`60
`50
`40
`30
`20
`10
`0
`
`a
`
`Percentage of genome
`
`Figure 1 Genome coverage at different read depths. (a) Percentage
`of genome covered by different read depths in different platforms.
`(b) Histogram of genome coverage at different read depths.
`
`Extensive differences in variant calling
`We sought to compare the sensitivity and accuracy of each platform for
`SNV calling. In total, 88.1% (3,295,023 out of 3,739,701) of the unique
`SNVs were concordant—that is, either a homozygous or heterozygous
`SNV was detected at the same locus by the two platforms in at least one
`sample (Fig. 2a). We detected 444,678 SNVs by only one platform or the
`other but not both, of which 345,100 were specific to Illumina (10.5%
`of the Illumina combined SNVs) and 99,578 were CG-specific (3.0%
`of the CG combined SNVs). Among the Illumina-specific SNVs, 67%
`were ‘no-calls’ (that is, not a reference or variant call), 11% were reference
`calls and 22% were other types of calls (that is, complex and substitution
`calls) in CG. Similarly, 75% of the CG-specific SNVs were no-calls in
`Illumina, and 25% were reference calls (Fig. 2b). The higher percentage of
`no-calls in Illumina is likely because GATK does not make the complex and
`substitution calls as does the CG pipeline.
`To assess the quality of the calls, we used four criteria: the
`transition/transversion ratio (ti/tv), quality scores, the heterozygous/
`homozygous call ratio and novel, platform-specific SNVs. The ti/tv
`ratio of 2.1 for SNVs in humans has been described in several previous
`studies, including the 1000 Genomes Project11. The ti/tv ratio for all
`of the SNVs detected in these genomes was 2.04, but in our data the
`ti/tv of SNVs concordant between the two platforms was 2.14. For all
`SNVs detected by the Illumina platform, ti/tv was 2.05, but for SNVs
`specific to Illumina it was only 1.40. Similarly, for SNVs detected by
`CG, ti/tv was 2.13, but for CG-specific SNVs, it was 1.68. Thus, the
`ti/tv of concordant SNVs was very close to that expected, whereas the
` platform-specific ti/tv was much lower, suggesting that the platform-
`specific calls were of lower accuracy. Inspection of the quality scores of
`the platform-specific SNVs showed that they were indeed lower than
`those for the concordant calls (Supplementary Fig. 1). Furthermore,
`the heterozygous/homozygous call ratio was 1.48 for the concordant
`calls, whereas the platform-specific ratios were indeed higher: 2.48
`for Illumina-specific calls and 1.98 for CG-specific calls.
`To examine the fraction of novel platform-specific SNVs, we
`noted that 3,160,905 (96.0%) of the concordant SNVs were present
`in dbSNP131 (ref. 15). In contrast, only 260,108 (75.4%) of the SNVs
`in the Illumina-specific set, and 72,735 (73.0%) of the SNVs in the
`CG-specific set were present in dbSNP131. Thus, the platform-specific
`call sets were enriched for novel SNVs, suggesting that they likely
`contain more errors. In addition, the overall genotype concordance
`rate (that is, the proportion of concordant calls having a consistent
`genotype—heterozygous or homozygous—across both platforms) for
`the concordant SNVs was 98.9%. The high genotype concordance rate
`and percentage of known SNVs indicate that the concordant SNVs
`were of high quality and accuracy.
`To further assess the accuracy of the variant calling, we sought to vali-
`date our SNVs by using Omni Quad 1M Genotyping arrays, traditional
`Sanger sequencing and Agilent SureSelect target enrichment capture
`followed by sequencing on an Illumina HiSeq for both samples. Of the
`260,112 heterozygous calls detected with the Omni array, 99.5% were
`present in the entire SNV data set, 99.34% were concordant calls and only
`0.16% were platform-specific SNVs. This demonstrates that both plat-
`forms are sensitive to known SNVs and that few known single-nucleotide
`polymorphisms (SNPs) are detected by only one platform.
`
`nature biotechnology VOLUME 30 NUMBER 1
`
`JANUARY 2012
`
`79
`
`©2012 Nature America, Inc. All rights reserved.
`
`Foresight EX1027
`Foresight v Personalis
`
`

`

`A n A ly s i s
`
`Figure 2 SNV detection and intersection.
`(a) SNVs detected from the PBMC and saliva
`samples in each platform were combined.
`The unions of SNVs in each platform were
`then intersected. Sensitivity was measured
`against the Illumina Omni array. Ti/Tv is the
`transition-to-transversion ratio. The known
`and novel counts were based on dbSNP.
`‘Sanger’ and ‘validated’ represent validation by
`Sanger sequencing and Illumina sequencing
`(with Agilent target enrichment capture),
`respectively. (b) Comparing platform-specific
`SNVs to non-SNV calls in another platform. IL,
`Illumina; CG, Complete Genomics.
`
`a
`
`Complete Genomics
`
`Blood
`3,277,339
`
`Saliva
`
`3,286,645
`
`Merge
`
`Union
`3,394,601
`2.13
`99,578 (3.0%)
`1.68
`72,735 (73.0%)
`26,843 (27.0%)
`94.4% (17/18)
`61.9%
`
`Total
`Ti/Tv
`Specific
`Ti/Tv
`Known
`Novel
`Sanger
`Validated
`
`Intersect
`
`Illumina
`
`Blood
`3,570,658
`
`Saliva
`3,528,194
`
`Merge
`
`Union
`3,640,123
`2.05
`345,100 (10.5%)
`1.40
`260,108 (75.4%)
`84,992 (24.6%)
`13.3% (2/15)
`64.3%
`
`Total
`Ti/Tv
`Specific
`Ti/Tv
`Known
`Novel
`Sanger
`Validated
`
`Intersect
`
`CG+IL
`
`2.7%
`
`3,295,023
`Concordant SNPs
`88.1%
`
`Sensitivity: 99.34%
`
`9.2%
`
`Total
`Ti/Tv
`Sensitivity
`Concordant
`Ti/Tv
`Known
`Novel
`Sanger
`Validated
`
`Overall
`3,739,701
`2.04
`99.5%
`3,295,023
`2.14
`3,160,905 (95.9%)
`134,118 (4.1%)
`100% (20/20)
`92.7%
`
`b
`
`Complete Genomics specific
`99,578
`
`IL ref. 25,022;
`25%
`
`CG no-call
`230,119; 67%
`
`IL no-call
`74,556; 75%
`
`Illumina specific
`345,100
`
`CG
`Sub & other
`77,196; 22%
`
`CG ref.
`37,785; 11%
`
`To further ascertain whether the platform-
`specific SNVs might be located in functionally
`important regions, we examined whether the
`variant calls were present in the Varimed data-
`base2,17, which contains variants catalogued
`through genome-wide association studies and
`other genetic linkage studies. We found that
`31 Illumina- and 3 CG-specific SNVs were
`present in Varimed, from which we were able
`to estimate associations between diseases
`and platform-specific SNPs (Supplementary
`Table 1). One of these, rs2672598, was called
`in both PBMCs and saliva by the Illumina
`platform, but not called in either PBMCs or
`saliva by the CG platform. This SNP is at the
`5′ end of HTRA1 and known to increase the
`risk of age-related macular degeneration by
`4.89-fold (P = 3.39 × 10−11)18,19. Another
`example is the A202T allele in the TERT gene
`encoding telomerase. This allele has been associated with aplastic ane-
`mia20 and was only detected by the Illumina platform. Thus, some
`platform-specific calls are of high importance.
`
`Association of repetitive regions with variant calling differences
`In contrast to coding SNVs, we found that overall the platform-
` specific SNVs had a substantially stronger association with repeti-
`tive elements such as Alu, telomere and simple repeat sequences
`(Fig. 3c,d). For example, only 0.3% of the concordant SNVs were
`associated with telomere or centromere sequences, but 4% and 2%
`of the CG-specific SNVs and Illumina-specific SNVs, respectively,
`were associated with telomeric or centromeric repeats (Fig. 3c,e).
`The enrichment of platform-specific SNVs with simple repeats and
`low-complexity repeats was particularly evident. We found that <1%
`of the concordant SNVs were associated with simple repeats, but
`8% and 15% of the CG-specific SNVs and Illumina-specific SNVs,
`respectively, were associated with these sequences. Among the
` platform-specific SNVs, CG had a stronger association with the Alu
`element and centromere and telomere sequences, whereas Illumina
`
`Table 2 Agilent SureSelect target enrichment capture with Illumina sequencing
`CG specific
`Illumina specific
`Concordant
`99,578
`345,100
`3,295,023
`3,015
`33,084
`24,247
`388
`7,088
`3,053
`1,001
`9,280
`1,543
`1,626
`16,716
`19,651
`—
`—
`—
`
`Total
`Targeted
`Not validated
`Invalidated
`Validated
`Validation rate
`
`—
`3.0%
`12.9%
`33.2%
`53.9%
`61.9%
`
`—
`9.6%
`21.4%
`28.0%
`50.5%
`64.3%
`
`had a stronger association with L1, simple repeat and low-complexity
`repeat. Overall, these results indicate that many platform-specific
`SNVs lie in repetitive regions, suggesting that these calls may be due
`to mapping difficulties and errors.
`We also measured GC content and read depth of the SNVs in the
`gene and repeat regions. The average GC content of the concordant,
`CG-specific and Illumina-specific SNVs were 0.46, 0.45 and 0.41,
`respectively. The average read depths were 48, 47 and 44, respectively.
`Thus, the Illumina-specific SNVs showed a lower GC content and read
`depth compared to the concordant SNVs. Analysis by gene and repeat
`regions did not reveal any strong correlation with GC content. However,
`we found that Illumina-specific SNVs had a strikingly higher read
`depth in centromeric and telomeric regions, whereas CG had higher
`read depth in the tRNA and rRNA regions (Supplementary Fig. 2).
`
`Differences in indel calls
`We also examined small indel calls from Illumina and CG platforms.
`Small indels ranged in size from −107 to +36 bp by Illumina and −190
`to +48 bp by CG. Illumina calls were made using GATK with the
`Dindel model21, and CG calls were obtained
`from their standard pipeline and converted
`to VCF format22 using the CG conversion
`tool. A stringent quality score cutoff of
`30 was used for each platform. This resulted
`in a total of 811,903 indel calls with 611,110
`for Illumina and 430,258 for CG. We found
`that only 215,382 (26.5%) indels were
`detected by both Illumina and CG, whereas
`
`—
`0.7%
`12.6%
`6.4%
`81.0%
`92.7%
`
`80
`
`VOLUME 30 NUMBER 1
`
`JANUARY 2012 nature biotechnology
`
`©2012 Nature America, Inc. All rights reserved.
`
`Foresight EX1027
`Foresight v Personalis
`
`

`

`A n A ly s i s
`
`4
`
`4
`
`2
`
`2
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`0
`
`5 4
`
`23
`
`1 0
`
`c
`
`Percent association of SNVs
`
`14
`
`12
`
`11
`
`0 0
`
`0
`
`1
`
`1 1
`
`1 1 1
`
`16
`
`14
`
`12
`
`10
`
`8 6 4 2 0
`
`b
`
`Percent association of SNVs
`
`26,542
`
`20,591
`
`59 60
`
`56
`
`7,498
`
`190
`1,762
`
`725
`1,872
`
`785
`2,469
`
`38
`
`32
`
`32
`
`0 0 1
`
`1 1 1
`
`1 1 1
`
`Exonic
`Intronic Intergenic
`CG speci(cid:31)c
`IL speci(cid:31)c
`
`70
`
`60
`
`50
`
`40
`
`30
`
`20
`
`10
`
`0
`
`Percent association of SNVsa
`
`UTR3
`UTR5
`Concordant
`
`Upstream Downstream
`ncRNA
`Splicing
`Concordant
`CG speci(cid:31)c
`IL speci(cid:31)c
`
`tRNA
`Telomere
`Centromere
`Concordant
`CG speci(cid:31)c
`
`rRNA
`IL speci(cid:31)c
`
` 0
`
` 25
`
`50
` 2
` 75
` 100
` 125
` 150
` 175
` 200
` 225
`
` 0
` 2 5
` 5
`
`5
`
`0
` 7
` 1
`
`0
` 1
`
` 3
`0
`5
`2
`5
` 1
` 1
`
`0
`7
`
` 1
`
`y
`
`0
`
`50
`25
`0
`1 5 0
`1 2 5
`1 0 0
`7 5
`0
`5
`
`5
`
`2
`
`x
`
`e
`
`2
`
`2
`
`0
`
`5
`
`2
`
`0
`
`5
`
`0
`
`2 1
`
`2 5
`
`0
`
`50
`20
`
`25
`
`0
`
`50
`19
`
`25
`
`d
`
`25
`
`20
`
`15
`
`10
`
`22
`
`21
`
`19
`
`16
`
`14
`
`14
`
`15
`
`8
`
`3
`
`25
`00
`75
`50
` 125
` 100
` 75
` 50
` 25
`
` 1
`
` 1
`
` 2
`
` 2
`
`5 0
`
`2 5
` 5 0
`7 5
` 4
` 1 0 0
` 125
` 150
` 175
`
`5
`
` 0
` 25
` 50
` 75
` 100
` 125
` 150
` 175
`0
`25
`50
`75
`6
`
` 100 125 1
`
`50
`
`0
`75
`18
`50
`25
`0
`75
`50
`25
`0
`
`17
`
`0
` 2
`5
`
`50
` 75
` 100
` 1 2 5
` 1 5 0
`
`7
`
`0
`
`2
`
`5
`
`5
`
`7
`0
`
`5
`
` 1
` 1
`
`0
`
`2
`
`0
`
`5
`
`8
`
`0
`
`2 5
`5 0
`7 5
` 100
` 125
`
`9
`
`0
`25
`50
`75
`100
`125
`
`0
`
`125
`100
`11
`
`25
`50
`75
`
`0
`
`0
`
`1
`1 3
`
`0
`7 5
`5 0
`
`2 5
`
`0
`125
`100
`75
`50
`12
`25
`
`0
`
`16
`
`1 5
`
`75
`50
`25
`0
`100
`75
`5 0
`2 5
`0
`1 0 0
`5
`7
`0
`5
`5
`
`4
`
`1
`
`2
`
`L1
`
`Alu
`
`5 0
`
`Percent association of SNVs
`
`Concordant
`
`CG speci(cid:31)c
`
`Figure 3 SNV association with different genomic
`elements. (a) Gene elements: UTR, exonic, intronic
`and intergenic regions. Inset: number of SNVs
`associated with UTR5, UTR3 and exonic regions.
`(b) Gene elements: splicing sites, noncoding RNA
`and upstream/downstream (<1 kb) regions of genes.
`(c) Repetitive elements: centromere, telomere, tRNA
`and rRNA. (d) Repetitive elements: L1, Alu, simple
`repeat and low-complexity repeat. (e) SNV frequency
`at different chromosomal locations. Tracks from outer
`to inner: SNV frequency for Illumina (IL), Complete
`Genomics (CG), concordant, IL-specific and CG-
`specific calls. Outermost: chromosome ideogram.
`
`1
`
`2
`
`0
`
`Simple repeat
`
`Low
`complexity
`IL speci(cid:31)c
`
`©2012 Nature America, Inc. All rights reserved.
`
`10
`
`390,060 (48.1%) and 206,461 (25.4%) were Illumina- and CG-specific,
`respectively (Fig. 4a). Owing to the complexity of indels compared
`to SNVs, the number of concordant indels was much lower than
`the number of concordant SNVs. We also observed that the indels
`
`detected by both platforms were similar in their size distribution
`and type (Fig. 4b), though it is noteworthy that the Illumina data
`showed a slight enrichment of 1-bp insertions, whereas the CG data
`showed a slight enrichment of 1-bp deletions.
`
`a
`
`Complete Genomics
`
`Blood
`361,783
`
`Saliva
`341,172
`
`Merge
`
`Union
`
`Total
`430,258
`Specific 206,461 (48.0%)
`
`CG+IL
`
`206,461
`CG-specific
`(25.4%)
`
`215,382
`Concordant indels
`(26.5%)
`
`390,060
`IL-specific
`(48.1%)
`
`b
`
`Illumina
`
`Blood
`523,445
`
`Saliva
`
`555,770
`
`Merge
`
`Union
`611,110
`390,060 (63.8%)
`
`Total
`Specific
`
`Complete Genomics
`Illumina
`
`160,000
`
`140,000
`
`120,000
`
`100,000
`
`80,000
`
`60,000
`
`40,000
`
`20,000
`
`Intersect
`
`Overall
`
`Total
`Concordant
`
`811,903
`215,382
`
`Intersect
`
`–72
`
`–68
`
`–64
`
`–60
`
`–56
`
`–52
`
`–48
`
`–44
`
`–40
`
`–36
`
`–32
`
`–28
`
`–24
`
`–20
`
`0
`–12 –8 –4 0
`–16
`Indel size
`
`4 8 12 16 20 24 28 32 36 40 44 48
`
`Figure 4 Indel detection and intersection. (a) Indels detected from the PBMC and saliva samples in each platform were combined. The unions of
`indels in each platform were then intersected. Note: 5,668 IL and 8,415 CG indels were removed after 5b-window merging. (b) Indel size distribution.
`Negative size represents deletion and positive size represents insertion.
`
`nature biotechnology VOLUME 30 NUMBER 1
`
`JANUARY 2012
`
`81
`
`Foresight EX1027
`Foresight v Personalis
`
`

`

`A n A ly s i s
`
`Detection accuracy was assessed for concordant and platform-
` specific indels by comparing them to indels detected by exome sequenc-
`ing of the same individual23. We validated 2.2% (4,681) of concordant
`indels but only 1.2% (4,682) of Illumina-specific and 0.3% (561) of
`CG-specific indels. These lower validation rates for platform-specific
`indels suggest that they are indeed less robust than those detected by
`both platforms. Because exome sequencing was performed using the
`Illumina HiSeq platform, bias toward greater consistency between
`the Illumina-specific and exome sequencing–specific indels was
`not unexpected.
`We further validated indels by randomly selecting indels for tra-
`ditional Sanger sequencing. For 24 concordant indels, 15 could be
`amplified by PCR allowing us to validate 14 of them (93.33%). For
`42 platform-specific indels, 19 could be amplified allowing us to vali-
`date 10 of 11 Illumina-specific indels and 8 of 8 CG-specific indels.
`Although the platform-specific indels could be validated at a high
`rate, the increased frequency of failed PCR amplification for platform-
`specific versus concordant indels (54.8% versus 37.5%, respectively)
`suggests that there may have been issues with the sequence context
`around a larger fraction of the platform-specific calls. We therefore
`examined whether both the concordant and platform-specific indels
`overlapped with known repeats. We found that 72% of Illumina-
` specific and 63% of CG-specific indels overlapped repeats, whereas
`only 52% of concordant indels overlapped with repeats. Although
`there is a clear enrichment of platform-specific indels over problem-
`atic repeat regions, many bona fide indels were detected by only one
`platform, as demonstrated by their high validation rate. This suggests
`that indel detection by both Illumina and CG lacks sensitivity.
`
`DISCUSSION
`Overall, we conclude that each genome sequencing approach is
`generally capable of detecting most SNVs. Based on the transition/
`transversion ratio and Sanger sequencing, CG appears to be more
`accurate, but also slightly less sensitive. Illumina, in contrast, covers
`more bases and makes a higher number of overall calls, but also has
`more false positives. This may be in part because Illumina has longer
`reads and is therefore able to map more reads in difficult regions,
`which leads to both increased sensitivity and decreased specificity.
`Nonetheless, both methods clearly call variants missed by the other
`technology. Many of these lie in exons and thus can affect coding
`potential. In fact, 1,676 genes have platform-specific SNVs in exons;
`one of the Illumina-specific SNVs lies in a telomerase gene and is
`likely to affect function. We also found that indel detection is subject
`to a much larger platform bias, with each platform detecting a large
`quantity of indels missed by the other platform. It may therefore be
`beneficial to sequence on both platforms and analyze both data sets
`together, using evidence from one to bolster discovery in the other.
`We demonstrated that the best approach for comprehensive vari-
`ant detection is to sequence genomes with both platforms if budget
`permits. We assessed the cost effectiveness of sequencing on both
`platforms and found that on average it costs about four cents per
`additional variant (Online Methods). Alternatively, supplementing
`with exome sequencing can assess the most interpretable part of the
`genome at higher depth of coverage and accuracy and fill in the gaps
`in the detection of coding variants23. If genome sequencing is per-
`formed on both platforms, platform-specific variants can be validated
`by Sanger sequencing and array capture experiments or disregarded
`if they map to difficult regions (that is, simple repeats) or have low
`quality scores. Using this strategy, variant detection sensitivity and
`specificity can be maximized, and meaningful variants that may
`otherwise have been missed can be discovered.
`
`METHODS
`Methods and any associated references are available in the online version
`of the paper at http://www.nature.com/naturebiotechnology/.
`
`Accession code. Sequence Read Archive: SRA045736.
`
`Note: Supplementary information is available on the Nature Biotechnology website.
`
`ACKNOwLEDGMENtS
`This work is supported by the Stanford Department of Genetics and the US
`National Institutes of Health.
`
`AUtHOR CONtRIBUtIONS
`H.Y.K.L. and M.J.C. did the analysis. G.N. and L.H. assisted in the analysis. Rui C.
`did DNA sequencing. Rong C. did the disease-association study. Rui C. and M.O’H.
`did the validation experiments. H.Y.K.L., F.E.D., E.A.A., M.B.G., A.J.B., H.P.J. and
`M.S. coordinated the analysis and revised the manuscript. H.Y.K.L., M.J.C. and
`M.S. wrote the manuscript.
`
`COMPEtING FINANCIAL INtEREStS
`The authors declare competing financial interests: details accompany the full-text
`HTML version of the paper at http://www.nature.com/nbt/index.html.
`
`
`Published online at http://www.nature.com/nbt/index.html.
`Reprints and permissions information is available online at http://www.nature.com/
`reprints/index.html.
`
`1. Ajay, S.S., Parker, S.C., Ozel Abaan, H., Fuentes Fajardo, K.V. & Margulies, E.H.
`Accurate and comprehensive sequencing of personal genomes. Genome Research
`21, 1498–1505 (2011).
`2. Ashley, E.A. et al. Clinical assessment incorporating a personal genome. Lancet
`375, 1525–1535 (2010).
`3. Wheeler, D.A. et al. The complete genome of an individual by massively parallel
`DNA sequencing. Nature 452, 872–876 (2008).
`4. McKernan, K.J. et al. Sequence and structural variation in a human genome
`uncovered by short-read, massively parallel ligation sequencing using two-base
`encoding. Genome Res. 19, 1527–1541 (2009).
`5. Roach, J.C. et al. Analysis of genetic inheritance in a family quartet by whole-
`genome sequencing. Science 328, 636–639 (2010).
`6. Pushkarev, D., Neff, N. & Quake, S. Single-molecule sequencing of an individual
`human genome. Nat. Biotechnol. 27, 847–852 (2009).
`7. Korbel, J.O. et al. Paired-end mapping reveals extensive structural variation in the
`human genome. Science 318, 420–426 (2007).
`8. Snyder, M., Du, J. & Gerstein, M. Personal genome sequencing: current approaches
`and challenges. Genes Dev. 24, 423–431 (2010).
`9. Rios, J., Stein, E., Shendure, J., Hobbs, H.H. & Cohen, J.C. Identification by
`whole-genome
`resequencing
`of
`gene
`defect
`responsible
`for
`severe
`hypercholesterolemia. Hum. Mol. Genet. 19, 4313–4318 (2010).
`10. Lee, W. et al. The mutation spectrum revealed by paired genome sequences from
`a lung cancer patient. Nature 465, 473–477 (2010).
`11. The 1000 Genomes Project Consortium. A map of human genome variation from
`population-scale sequencing. Nature 467, 1061–1073 (2010).
`12. Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature
`409, 860–921 (2001).
`13. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler
`transform. Bioinformatics 25, 1754–1760 (2009).
`14. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing
`next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
`15. Sherry, S.T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids
`Res. 29, 308–311 (2001).
`16. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants
`from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
`17. Chen, R., Davydov, E.V., Sirota, M. & Butte, A.J. Non-synonymous and synonymous
`coding SNPs show similar likelihood and effect size of human disease association.
`PLoS ONE 5, e13574 (2010).
`18. Kaur, I. et al. Variants in the 10q26 gene cluster (LOC387715 and HTRA1) exhibit
`enhanced risk of age-related macular degeneration along with CFH in Indian
`patients. Invest. Ophthalmol. Vis. Sci. 49, 1771–1776 (2008).
`19. Tam, P.O. et al. HTRA1 variants in exudative age-related macular degeneration and
`interactions with smoking and CFH. Invest. Ophthalmol. Vis. Sci. 49, 2357–2365
`(2008).
`20. Yamaguchi, H. et al. Mutations in TERT, the gene for telomerase reverse
`transcriptase, in aplastic anemia. N. Engl. J. Med. 352, 1413–1424 (2005).
`21. Albers, C.A. et al. Dindel: Accurate indel calls from short-read data. Genome Res.
`21, 961–973 (2011).
`22. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158
`(2011).
`23. Clark, M.J. et al. Performance comparison of exome DNA sequencing technologies.
`Nat. Biotechnol. 29, 908–914 (2011).
`
`82
`
`VOLUME 30 NUMBER 1
`
`JANUARY 2012 nature biotechnology
`
`©2012 Nature America, Inc. All rights reserved.
`
`Foresight EX1027
`Foresight v Personalis
`
`

`

`
`
`
`
`DP = depth of coverage
`SB = strand bias
`MQ0 = number of reads with mapping quality equal to zero
`
`SNVs were combined and compared using custom program scripts. ANNOVAR
`(http://www.openbioinformatics.org/annovar/) was used to annotate the
`SNVs with gene and repeat annotations downloaded from the UCSC browser
`(http://www.genome.ucsc.edu/).
`
`Small indel detection. For CG, small insertions and deletions were
`derived from the masterVar file. Indels were extracted and converted
`to VCF format using the CG masterVar-to-VCF conversion tool avail-
`able at the CG community website. For Illumina, small indels were
`detected using GATK with the Dindel model for indel detection. Indels
`from both platforms were filtered based on quality score such that only
`those with QUAL ≥ 30 remained. Indels were compared using VCFtools
`(http://www.vcftools.sf.net).
`
`Disease association with SNV. Varimed, a manually curated database
`(comprising data from 5,478 human genetics papers) of human disease-SNP
`associations, was used to perform disease association with our SNVs. We que-
`ried the subject’s genotypes from the platform-specific SNVs against Varimed,
`and identified SNVs that were known to increase the subject’s risk of diseases
`with P < 1 × 10−6. The evidences of their disease associations were evalu-
`ated using the number of studies, cohort size, P-value and the odds ratio. For
`risk genotypes validated in multiple studies, we reported the most significant
`P-values, t

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket