throbber

`
`©2008NaturePublishingGrouphttp:/Iwwwnaturecomlnaturemethods
`
`
`
`
`
`ARTICLES I
`
`Whole-genome sequencing and variant discovery in
`C. elegans
`
`LaDeana W Hillier1’3, Gabor T Marth2’3, Aaron R Quinlanz, David Doolingl, Ginger Fewelll, Derek Barnettz,
`Paul Foxl, Jarret I Glasscockl, Matthew Hickenbothaml, Weichun Huangz, Vincent J Magrinil, Ryan I Richtl,
`Sacha N Sanderl, Donald A Stewartz, Michael Strombergz, Eric F Tsungz, Todd Wylie], Tim Schedll,
`Richard K Wilson1 8: Elaine R Mardis1
`
`Massively parallel sequencing instruments enable rapid and
`inexpensive DNA sequence data production. Because these
`instruments are new, their data require characterization with
`respect to accuracy and utility. To address this, we sequenced a
`Caemohabditis elegans N2 Bristol strain isolate using the Solexa
`Sequence Analyzer, and compared the reads to the reference
`genome to characterize the data and to evaluate coverage and
`representation. Massiwa parallel sequencing facilitates strain-
`to-reference comparison for genome-wide sequence variant
`discovery. Owing to the short-read-length sequences produced,
`we developed a revised approach to determine the regions
`of the genome to which short reads could be uniquely mapped.
`We then aligned Solexa reads from C. elegans strain (34858 to
`the reference, and screened for single-nucleotide polymorphisms
`(SNPs) and small indels. This study demonstrates the utility
`of massively parallel short read sequencing for whole
`genome resequencing and for accurate discovery of
`genome-wide polymorphisms.
`
`In 1998 the decoding of the first animal genome sequence, that
`of C. elegans, was published]. C. elegans was first suggested as a
`model organism in the 19605 by Sydney Brenner, and subsequent
`work produced a physical map of its genomez. As a result, the
`C. elegans genome sequencing project formed the cornerstone
`of efforts ultimately aimed at decoding the human genome“.
`The entire C. elegans biology community has benefited enor
`mously from the availability of the genome sequence and the
`ever improving genome annotations, and from the comparative
`power of the availability of sequenced genomes for C. elegans’
`relatives such as C. briggsae‘s.
`The emerging availability of massively parallel sequencing instru
`mentation provides the capability to resequence genomes in a
`fraction ofthe time, effort and expense than ever before. Compared
`to capillary sequencing, these instruments produce relatively short
`read length sequences that require characterization, including read
`error profiles and base call accuracy (which we refer to as base
`
`quality) values. Furthermore, the general utility of short read
`sequences, coverage models for resequencing and approaches for
`read mapping to reference genomes requires investigation. To
`address these, we sequenced an isolate of the C. elegans N2 Bristol
`strain using the Solexa Sequence Analyzer (Illumina Inc). Our
`analyses ofthese sequences included (i) an elucidation ofthe Solexa
`read error model, (ii) an evaluation of sequence differences between
`the two isolates and (iii) identification and investigation of repre
`sentational biases in Solexa data. We revealed possible sequencing
`errors in the C. elegans reference genome, and putative variants that
`had occurred in our passaged N2 Bristol strain.
`Massively parallel sequencing can be applied to strain to
`reference comparisons that reveal genome wide sequence differ
`ences, either for evolutionary studies or for discovering genetic
`variation that may explain phenotypic variation. Implementing this
`application requires a new approach that assesses the fraction of a
`genome to which short read sequences can be uniqudy mapped
`because they are more susceptible to multiple placements than are
`longer capillary instrument derived sequences. Computational
`identification and markup of these ‘microrepeats’ is therefore an
`important precursor to accurate short read analysis, and must
`allow for mismatches resulting from sequencing errors or poly
`morphisms. We aligned Solexa sequence reads from the C. elegans
`strain CB4858 (originally isolated in Pasadena, California, USA)7 to
`the microrepeat masked N2 Bristol reference sequence, and identi
`fied SNPs and small indels with a modified PolyBayes8 version.
`Orthologous validation yidded a high validation rate.
`
`RESULTS
`
`Experimental design
`In this study we explored two applications of Solexa sequencing:
`(i) genome resequencing and (ii) genome wide polymorphism
`discovery. For the first application, we sequenced an isolate of the
`C. elegans N2 Bristol strain at a high coverage depth with single end
`reads and at a much lower coverage depth with paired end reads.
`Using the reference genome as our alignment target, we determined
`
`1Washington University School of Medicine, Department of Genetim and Genome Sequencing Center, 4444 Forest Park Blvd., St. Iouis, Missouri 63108, USA. 2Boston
`College, Department of Biology, 140 Commonwealth Ave., Chestnut Hill, Massachusetts 02467, USA. ’I‘hese authors contributed equally to this work. Correspondence
`should be addrmsed to E.R.M. (emardis@watson.wustl.edu).
`RECEIVED 19 SEPTEMBER monument) 21 MEMBER 2N7; PUBLISHED cm 20 JANUARY 2008; D01:10.1038/NHEI'H.1179
`
`NATURE METHODS I VOL5 no.2 I FEBRUARY 2008 I 183
`Petltloner Sequenom - EX. 1006, p. 1
`
`Petitioner Sequenom - Ex. 1006, p. 1
`
`

`

`
`
`©2008NaturePublishingGrouphttp:llwww.nature.comlnaturemethods
`
`
`
`
`
` ARTICLES
`
`85,498,844 Solexa N2 Bristol reads
`
`0
`
`. ‘65
`‘0 e. d)“
`
`$03
`
`1,525,093 (1.78%) reads
`
`No match to
`C. dams,
`E. 00! or
`Solera priners
`
`Exad match
`to C. elegans
`reterenoe genome,
`catalog poeflions
`
`18,442,727 reads
`
`Quality trim
`reeds (keephg only
`reads 2 20 hp at
`2 025)
`
`64,636,331 (75.55%) reads
`
`6,699,337 reads
`
`No match to
`0. 9199815 /
`1,843,353 reads
`
`Exact match to
`\ C. dogma
`4,855,984 reads
`
`phrap-based
`assenbty
`
`BLAT-based allgrments to Identity norkeotad
`matches to C. elegans genome
`
`an accuracy estimate and an error model for Solexa reads. We next
`aligned all reads possible using a tiered approach (Fig. 1), identified
`sequence differences between the two isolates and evaluated both
`representational bias and copy number detection.
`We developed a genome wide polymorphism discovery approach
`by first sequencing C. elegans strain CB4858, using Solexa single
`end reads of about ninefold coverage. To decrease the possibility
`of erroneous variant detection because of paralogous read place
`ments, we identified and masked ‘microrepeat’ regions in the
`genome based on a 32 bp read length. We then aligned CB4858
`reads to the reference genome using Mosaik and applied a modified
`PolyBayes version to detect variants. Our predicted polymorphic
`sites were validated by PCR amplification and Sanger sequencing at
`a high rate.
`
`figure 1 I N2 Brigol Solexa read analysis. The diagram shows the processing steps used to evaluate
`Solexa single end reads from the N2 Bristolisolate. The majority of reads mapped exactly to the
`
`reference genome.
`
`0%;
`°x§$
`“a,
`
`6’
`884,698 (1.05%)reads
`
`given base quality value depends upon that
`base’s position in the sequence read.
`To determine Solexa N2 Bristol single end
`read coverage, we first devised an iterative
`read alignment
`strategy for
`these reads
`(Fig.
`l
`and Supplementary Methods
`online). We then determined the average
`coverage of the genome to be 19.2 (s.d. =
`9.0; Supplementary Fig. 1 online).
`We examined the genome for over
`represented regions and found ~1.7% of
`the genome had >40 fold coverage in
`unique 32 mers. We expected ribosomal
`DNA genes to have higher than average
`coverage because the reference represents
`these as single copies but they actually exist
`in multiple copies. By examining unique
`32 mers within rDNA segments, we found
`evidence of excess coverage (> 100x for the
`duomosome 1
`rDNA unique 32 mers).
`This finding lends credibility to the use of
`read coverage as a quantitative metric of
`region specific copy number. Based on our
`analysis of regions with higher than average
`coverage, combined with the assembly and
`analysis of unmapped reads (see Supplementary Methods), we
`estimate a maximum of ~ 0.5 Mb of repetitive sequence is missing
`from the C. elegans reference genome.
`To determine the genome coverage of the sequence reads we
`proceeded as follows. After aligning exactly matching 32 bp reads,
`exactly matching quality trimmed reads (that is, reads with at least
`20 consecutive base pairs of quality score 220) and quality
`trimmed reads with 2 or more mismatches to the reference genome,
`we found that 99.9% ofthe unique C. elegans genome was covered,
`mostly in large spans; the longest was 194 kb on chromosome V. Of
`the regions left uncovered by Solexa reads, there were 9,492 gaps
`comprising 95,913 bp. These coverage gaps ranged in size from
`1 base to over 1,000 bases; 77.9% were 1 9 bp and another 36.8%
`were 10 50 bp. Ifwe only consider 32 mer exact mapping reads, the
`largest coverage gap was a 4,601 bp region on chromosome X
`(5907815 5912415). Notably,
`the entire 4,600 bp region is
`bounded by a transposon (TC5A#DNA/Tc4)
`in the reference
`sequence and is completely contained within a single fosmid
`clone (H05L03) that extends into an overlapping yeast artificial
`chromosome (YAC; Y23B4A).
`areas
`There were a total of 1,728 zero base pair gaps discovered
`of the genome where two adjacent reference genome 32 mers were
`covered by reads that aligned exactly to each 32 mer but across
`
`Resequencing a C. elegans N2 Bristol strain isolate
`We used the single end C. elegans N2 Bristol reads to evaluate
`the overall accuracy and quality of Solexa pipdine passed reads.
`Table 1 provides several metrics of our Solexa singe end read
`dataset, including Eland alignment results to the wsl70 release of
`the C. elegans reference genome9 (http://www.wormbase.org).
`Based on the Bland metrics, we estimated 20 fold coverage of N2
`Bristol for the quality passed and aligned single end reads.
`We performed read aligmnent with
`EagleDisooverer, and subsequent error ana
`lysis revealed that 57.2% of the uniquely
`mapping single end reads contained zero
`.
`.
`mismatches and 7.93% had 0 or .1 mis
`match. We determined the full distribution
`of mismatches for the Solexa N2 Bristol
`
`Table 1 l Solexa run metrics for N2 Bristol and CB 4858 single-end reads
`
`Genome
`N2 Bristol
`(34858
`
`Number
`of runs
`3.5
`1.5
`
`Total bass
`generated
`4.06 Gb
`2.52 Gb
`
`Total passed
`bases
`2.67 Gb
`1.35 Gb
`
`Percentage
`passed bases
`66%
`54%
`
`Percentage
`aligning (w5170)
`79%
`67%
`
`Percentage error
`(alignment based)
`0.6%
`0.52%
`
`Solera run metrics obtained fertile entwined 30 and 32 hp single-end reads from both the N2 Bristol and CB 4858 isolates
`Results indndlng the totd number of bases generated, the totd number of passed (forerample high quality) bases, and fire
`percentage of aligning reads were obtained from the output ofthe Salem-provided data analyst pipeline. The Hand-generated
`error rate is reported, based on the reference genome alignments ofSolera passed reads.
`
`reads (Fig. 2), and the position specific
`dependency of Solexa base calls at phred
`qualities of 25 and 30 (Fig. 3), which
`illustrates that
`the base accuracy for a
`
`184 I VOL5'HOZ2‘1 mxuimr 2008 I NATURE METHODS
`
`Petitioner Sequenom - EX. 1006, p. 2
`
`Petitioner Sequenom - Ex. 1006, p. 2
`
`

`

`Percentagedreal:
`
`22.73%
`
`1 1.24%
`5.87% 2.96%
`
`0
`1
`2
`3
`4
`Number 01 mismatches
`
`figure 2 I Accuracy distribution of N2 Bristol Solexa single end reads.
`As described in the text, after alignment of N2 Bristol Solara reads to
`the reference genome sequence using EagleDiscoverer and tabulating
`any differences between the two sequencs, we determined that ~80°lo
`of the reads exhibited 0 or 1 mismatch when uniquely aligned to the
`reference genome.
`
`which no exactly matching spanning read exists. Of these, 1,564
`(90%) had single read representation of the flanking 32 met,
`consistent with the notion that these regions are under represented
`by Salem reads. Further investigation of these coverage gaps
`revealed that (i) they are located primarily in noncoding sequence
`(2% are in exons), (ii) only a few regions could be explained by
`hairpin formation (see Supplementary Data and Supplementary
`Table 1 online) and (iii) the A+T content
`in these regions is
`substantially higher than the genome average (85% versus 65%,
`respectively; Supplementary Fig. 2 onlirre). Furthermore, this A+T
`bias more likely occurs during amplification than during sequen
`cing (see Supplementary Data). We identified 125 zero base pair
`gaps with a non identical spanning Solexa read, suggesting an
`insertion in the Solexa sequenced strain. We resequenced 22 of
`these and validated them as true differences between the two N2
`Bristol isolates.
`
`ARTICLES I
`
`reads) for which a Salem read had extra bases relative to the
`reference. These also could be deletions in the reference.
`
`We identified 1,396 nonrepetitive, uncovered regions with at
`least one read having an unaligned or mismatched base, suggesting
`a Salem base calling error, a polymorphism in the Salem
`sequenced N2 Bristol isolate or a substitution error in the reference
`genome. Of these, 1,011 were covered by more than one read, and
`544 were covered by more than two reads These suggest a
`maximum substitution error rate in the C. elegans reference
`sequence of 1 in 99 kb. We included a limited number of these
`putative errors in our validation efforts, described below.
`We were able to produce and analyze limited numbers ofpaired
`end roads for C. elegans N2 Bristol, providing an average coverage
`of 0.84x with a mean physical coverage (measured by the span of
`matching paired ends) of 3.08x. Because paired end reads are used
`to evaluate structural variation based on deviations in end read
`
`distance from expectation"), we determined that 37,352 read pairs
`had a mapped distance > 3 s.d. from the 218 bp average (Supple-
`mentary Data; 36,209 were < 104 bp and 1,143 were > 332 bp). If
`we required more than one read pair placement to confirm an
`event, only 5,908 pairs remained (5,670 were <104 bp). Notably,
`these 5,670 reed pairs spaced ~ 100 bp closer than expected
`support our estimate that ~0.5 Mb of the C. elegans repetitive
`genome is missing fi'orn the reference genome. The vast majority of
`multiple read pairs that confirmed a structural variation event were
`within introns and/or were annotated as repetitive. Many such
`read pairs confirmed regions already annotated as “difficult to
`sequence”; about 1.5 times as many fell
`in genomic regions
`sequenced from YACs or plasmids (both clone types were used to
`sequence regions unclonable in cosrnids for the reference genome
`sequencing). For example, on chromosome III, a complex tandem
`repeat annotated as “restriction digest data indicate 3 kb is missing
`from the assembly of this region” was identified by 238 Solexa
`paired ends placed at > 3 s.d. apart (50% were 332 400 base pairs
`apart), firrther substantiating the initial suspicion of misassembly.
`Hence, paired end data enhance the utility of Solexa reads, provid
`ing an important tool for identifying putative structural variation.
`
`Polymorphism discovery in C. elegans strain (34858
`We sequenced an isolate of the CB4858 strain using the
`Solexa tedmology to produce ~ninefold coverage in single end
`reeds. This strain was selected because previous work had suggested
`
`
`
`
`
`
`
`©2008NaturePubllshlngGrouphup:llwww.nature.comlnaturamethods
`
`‘0‘
`36-
`k 30_
`g
`3
`g 20-
`g 15-
`3 10_
`
`
`
`5 r
`o
`
`Genomic alignment of nonexact match reads (that is, N2 Bristol
`single end reads without an exact match to C. elegans; Supple-
`mentary Methods) allowed us determine diEerences between the
`two N2 Bristol isolates and to identify possible errors in the
`reference sequence as these reads are highly similar but contain
`inserted or deleted bases that preclude an exact match. Here three
`or more Solexa reads were required to predict a reference error, to
`reduce contributions from Salem base calling errors. Sud'r align
`ments putatively identified 2,981 insertions, deletions and indels of
`1 20 bp. Of these, 2,082 occurred at posi
`tions also having emctly matching Salem
`reads,
`thus
`confirming the
`reference
`sequence and indicating an allelic poly
`morphism between the two N2 Bristol iso
`lates. By contrast, 235 ofthe indels occurred
`in regions with no perfectly aligning Solexa
`read, suggesting a possible error in the
`C. elegans reference genome, and indicating
`a potential indel error rate of 1 in 373 kb.
`We detected 56 different putative deletion
`events, for which a Salem read spanned one
`or more bases in the reference genome,
`aligning immediately on either side. Alter
`natively, these could be insertions in the
`reference genome. Iastly, 53 different puta
`tive insertions were suggested (by 502
`
`12 3 4 6 6 7 8 910111213141516171819m212223242526272829303132
`Base posllon
`
`Figure 3 I Position dependency of base calling accuracy for N2 Bristol Solexa single end reads. The
`calculated based quality is shown as a distribution at phred base quality 25 or at base quality 30.
`
`NATURE MEI'HODS I VOL5 no.2 I FEBRUARY 2008 I 185
`Pet1t1oner Sequenom - EX. 1006, p. 3
`
`Petitioner Sequenom - Ex. 1006, p. 3
`
`

`

`
`
`
`
`
`
`©2008NaturePublishingGrouphttp://wwwnaturecomlnaturemethods
`
`l ARTICLES
`
`
`
`Genome:
`100,231,244 up
`
`Microrep eats:
`19.8%
`
`
`
`RepealMasker
`14.5%
`
`figure 4 I Repetitive content in C. elegans. Venn diagram depicting the
`faction of bases in the genome covered by microrepeats and by
`RepeatMasker, and file overlapping set.
`
`a polymorphism rate of 1:1,600 (ref. 11). The Solexa analysis
`pipeline produced metrics of our Solexa single end read data
`for CB4858 (Table 1).
`As a precursor to variant discovery in CB4858, we identified
`regions of the reference genome with a high potential for ambig
`uous read alignment, based on the Solexa 32 bp read length. First,
`we identified all unique 32 mers in the reference sequence, but as
`our error rate analysis (Fig. 2) indicated a drop offin the error rate
`beyond 2 errors per read, we defined a repetitive 32 mer as one that
`appears in the genome more than once, allowing 0 2 mismatdied
`bases (substitutions,
`insertions or deletions). We called these
`‘microrepeats’ to distinguish than from repeats marked by the
`RepeatMasker program”, which masks 14.5% of the bases in the
`genome. The fraction of the genome comprising perfect and near
`perfect microrepeats totaled 19.8%. We illustrate the relationship
`between RepeatMasker masked bases and microrepeat bases iden
`tified by our methods as a Venn diagram (Fig. 4). Although there is
`a substantial overlap (11.11%) between the regions masked by both
`methods, 8.7% of the genome that we identified as microrepeats
`was not masked by RepeatMasker. Conversely, 3.4% of the genome
`was masked by RepeatMasker only, indicating that some fraction of
`C. elegans repeat elements can be uniquely sequenced with 32 bp
`reads. Taken together, RepeatMasker repeats and microrepeats
`cover 23.2% the genome.
`
`hble 2 l PolyBayes SNP and indel validation data
`
`Mask type applied
`
`Assay
`type
`
`Submitted to
`validation
`
`Assay
`successful
`
`Once we aligned CB4858 Solexa reads to the conservatively
`masked C. elegans genome, we applied our combined repeat
`masking to filter the alignments, identified high quality sequence
`differences with PolyBayes, and finalized a set of 45,539 SNPs and
`7,353 single base pair indels. This yields a rate of one SNP per
`1,629.81 bp and one indel per 9,894.99 bp. Hence the pair wise
`nucleotide diversity (theta) between the CB4858 and the N2 Bristol
`strains is 6.136 x 10 4, in good agreement with the ~ 1:1,500 rate
`posited in a previous description ofCB4858 (ref. 1 1). As 37,856,444
`CB4858 Solexa reads yielded a total number of 45,539 SNPs, the
`‘read per SNP’ yield was 831. All confirmed CB4858 sequence
`variants are available in Wormbase.
`
`We orthologously validated roughly 1,000 candidate SNPs and
`indels by PCR directed capillary sequencing to gauge the perfor
`mance of our Mosaik PolyBayes approach. After sequencing and
`evaluation, we determined a SNP validation rate of 96.3% (438/
`455) and an 89.0% conversion rate (438l492)
`for candidates
`identified by PolyBayes (Table 2). We sequenced 239 of our
`putative single base indels, finding they validated (93.8%) and
`converted (87.7%) at practically the same rates as SNPs. Both
`insertions and deletions predicted in the reference genome
`sequence were represented (insertions: 2,948 or 47.1%, and dele
`tions: 3,316 or 52.9%). Many ofthe indels were variable numbers of
`bases in mono nucleotide repeats, for example, 5 versus 4 adeno
`Sines. Although mononucleotide runs are typically very difficult
`areas for indel detection, our high validation rate indicates that
`Solexa reads resolve base numbers in these runs very well. Micro
`repeat masking has a marked impact on accurate SNP discovery by
`eliminating putative SNPs and indels resulting from paralogous
`read mapping (Table 2).
`We estimated false negative rates for PolyBayes by running
`PolyPhred‘3'ls (version 5.0) on the validation trace data. This
`algorithm indicated PolyBayes had missed 26 SNPs, for a false
`negative rate of 3.75%.
`To determine the chromosomal distribution of CB4858 poly
`morphisms, we placed CB4858 SNPs and indels along the six
`C. elegans chromosomes, and identified both chromosome wide
`and chromosomal position specific differences (Supplementary
`Data and Supplementary Fig. 3 online). Our data confirmed an
`earlier study in C. elegans‘6 suggesting that nonsynonymous sub
`stitution rates are higher in the first and second codon positions
`than in the third (Supplementary Fig. 4 online). Furthermore, over
`half of CB4858 SNPs positioned in exons putatively introduce an
`amino acid change.
`
`Sequencing
`successful
`557
`518
`458
`
`222
`217
`208
`
`SNP candidate
`confirmed
`482
`475
`438
`
`202
`201
`193
`
`Validation
`late (9»)
`86.5
`91.7
`96.3
`
`91.0
`92.6
`93.8
`
`Conversion
`rate (“/o)
`80.6
`82.0
`89.0
`
`84.5
`86.6
`87.7
`
`Known repeat
`Exact microrepeats
`Near exact micmrepeats
`(2 or fewer mismatdies)
`Known repeats
`Exact microrepeas
`Near exact microrepeats
`(2 or fewer mismatches)
`Validation and conversion rats for PoyBalesrselected SNPs and single base indel candidates. Successive application of masking filters, as described in the text. reduced the number of paralegals
`placements and identified high confidence putatiie variant sites
`
`SNP
`SNP
`SNP
`
`Indel
`Indel
`Indel
`
`598
`579
`492
`
`239
`232
`220
`
`582
`559
`482
`
`228
`223
`213
`
`186 I VOLTWJ 'l 1‘1!!!ka 2008 I NATURE METHODS
`
`Petitioner Sequenom - EX. 1006, p. 4
`
`Petitioner Sequenom - Ex. 1006, p. 4
`
`

`

`
`
`©2008NaturePublishingGrouphttp:llwww.nature.comlnaturemethods
`
`
`
`
`
`DISCUSSION
`
`Masside parallel sequencing approaches hold great promise for
`genome wide discovery of sequence variation, when comparing
`different isolates or strains to reference genomes. It is apparent that
`short read technologies must initially be characterized with respect
`to their quality and accuracy, providing a baseline for devising
`analytiml methods. Dramatically shorter read lengths also increase
`the coverage level needed for adequate depth and breadth of reads
`to predict varhtion with high confidence, when compared to
`capillary sequencing reads. Although these short reads presently
`are too short for de novo assembly, producing regional assemblies of
`resequencing reads, followed by reference genome alignment,
`apparently has merit for detecting insertions and ddetions, and
`should be pursued in future resequencing efforts.
`Paired end reads clearly increase the power to properly interpret
`problematic areas of the genome, including collapsed or misas
`sembled repeats, and to detect structural variations. As genomes
`increase in size and complexity, paired ends will also be more
`efficiently placed than single end reads, as only one end of each
`read pair needs a unique genome placement to properly place most
`reads, given that a precise paired end read distance has been
`achieved in library construction.
`Solexa reads provide a rapid vehicle for genome wide SNP and
`small indel discovery, once additional masking of ‘microrepeat’
`sequences is achieved. Aside from SNP or indel discovery, whole
`genome resequencing also can be used after random mutagenesis
`to identify and characterize each mutagenized base. Our results
`establish the utility of short read length massively parallel
`sequencing for the accurate discovery of both single nucleotide
`and small insertion deletion polymorphisms, and establishes a
`framework for human genome resequencing toward similar
`discovery aims.
`
`METHODS
`
`Determining Solexa single-end read accuracy. To isolate sequen
`cing errors from simple aligmnent errors, we used a version of
`the Smith Waterman based global alignment algorithm that
`reports all optimal and suboptimal aligmnents above a prespeci
`fled aligmnent score (EagleDiscover; W.H., unpublished data).
`Although time intensive,
`this algorithm identifies all alignable
`positions in the C. elegans genome for a 32 bp read. Here we
`generated three random sample sets of 20,000 Solexa N2 Bristol
`single end reads and aligned each read set
`to the unmasked
`reference genome, allowing up to 4 mismatches (substitution,
`insertion or deletion). For fiirther consideration of accuracy, we
`kept only reads that aligned at a single locus in the genome. For
`each of the three read sets we tabulated the number of sequence
`differences between each read and the reference sequence, and
`combined the results to make a histogram of reads (Fig. 2). Then
`we evaluated unique alignments to ealculate the observed error
`rate at ead'r base position for a given Solexa base quality score. We
`converted these rates to phred scores (Solexa base qualities are
`expressed as a probability of each of the 4 bases being the correct
`call rather than as a single phred like probability of correctness)
`and graphed the dependence of observed base quality on base
`position (Fig. 3).
`
`Alignment and analysis of Solexa single-end reads. We compared
`Solexa N2 Bristol
`reads to the reference genome to identify
`
`ARTICLES I
`
`sequence variants, to analyze coverage and to evaluate representa
`tional bias. These alignments consisted of a combination of exact
`hash match based comparisons, followed by BLAST like alignment
`tool
`(BLAT) based comparisons. Our methods are detailed
`below, and are presented in a flowchart format
`(Fig.
`l and
`Supplementary Data).
`
`Paired-end read evaluation. We mapped paired end reads (for
`example, a 25 35 bp read from each end of a ~200 bp genomic
`fragment) from the N2 Bristol isolate to the C. elegans genome
`using the exact hash match based method described above. After
`read mapping of individual paired ends, we determined final
`placements by asking that the ‘forward’ and ‘reverse’ read of the
`pair match on the same chromosome and within 1,000 bases of
`each other.
`
`Mosaik alignment of CB4858 Solexa reads. We identified both
`perfect microrepeats and microrepeats with up to two mismatches
`(substitutions, deletions or insertions) to encompass the possibi
`lity of sequencing errors (nucleotide misincorporation or base
`calling) in the reads or of polymorphism in the genomes being
`compared. Custom scripts then produced a microrepeat masked
`reference genome.
`We next aligned the Solexa CB4858 single end reads to the
`microrepeat masked C. elegans reference genome with our Mosaik
`program. Mosaik consists of two parts: the aligner (aligns each
`read to the reference genome separately in a pair wise fashion) and
`the assembler (pads the individual reads and the reference ge
`nome sequence so that every aligned base within each read
`remains in register in the resulting multiple read alignment).
`The details of Mosaik processing are described in Supplementary
`Methods. The resulting multiple read aligmnents were then
`reported either in ACEl7 or in binary formats used by downstream
`analysis software.
`
`SNP and indel discovery in strain CB4858. Starting with the
`multiple read alignments produced by the Mosaik aligner and
`assembler, we analyzed the resulting alignments using a version of
`PolyBayes8 that was completely reengineered to enable efficient
`analyses of millions of aligned short read sequences. The program
`evaluated each aligned base and its base quality value at each
`position, to indicate putative SNPs and small (1 3 bp) putative
`indels, and their corresponding SNP probability value (PSNP). Base
`quality values were converted to base probabilities corresponding
`to every one of the four possible nucleotides (and to the prob
`ability that the nucleotide in question was an actual insertion
`error in the sequence). Using a Bayesian formulations, a Psz
`(or indel probability value, as appropriate) was calculated as the
`likelihood that multiple different alleles are present between the
`reference genome sequence and the reads aligned at that position.
`If the probability value exceeds a prespecified threshold,
`the
`SNP or indel candidate is reported in the output. For the
`collection of bases contributed by such reads, a single ‘con
`sensus’ base call and its base quality value are computed. The
`corresponding base probabilities are then used in the Bayesian
`PSN'p calculation. In this study, we used a PSNp cutoff value of
`0.7 to define a high certainty SNP or small indel site. Validated
`CB4858 SNPs and indels were assigned Wormbase accession
`numbers pasl pa550906.
`
`NATURE MEI'HODS I VOLS no.2 I FEBRUARY 2008 I 187
`Petltloner Sequenom - EX. 1006, p. 5
`
`Petitioner Sequenom - Ex. 1006, p. 5
`
`

`

`
`
`©2008NaturePublishingGrouphup:llwww.nature.c0mlnaturemethods
`
`
`
`
`
`I ARTICLES
`
`Software availability. The combined microrepeat plus Repeat
`Masker masked genome sequence annotations and PASTA files are
`available
`at http://bioinformaticsbc.edulmicrorepeats/elegansl.
`Mosaik and the updated version of PolyBayes are now in beta
`release and available for users to wish to participate in software
`testing (http://bioinformatics.bc.edulmarthlab/Beta Release). After
`the testing period, both programs will be released for public use,
`free of charge for academic users.
`
`Additional methods. Details of Solexa library construction and
`sequencing, data analysis of primary sequence data and its align
`ment to the C. elegans refermce genome (both single and paired
`end reads) as well as detailed descriptions of Mosaik and PolyBayes
`analysis of CB4858 read data and its validation are available in
`Supplementary Methods.
`
`Note: Supplementary information is available on the Nature Methods website.
`
`ACKNOWLEDGMENTS
`We acknowledge National Human Genome Research Institute funding
`(HGOO3079 04 to R.K.W. and H6003698 to G.T.M.). We thank K. Hall and
`D. Bentley of Illumina, Inc. for generously producing the paired end read
`data described in the manuscript. M. Wendl for careful reading of
`are manuscript and T. Bieri for submitting the CB4858 variants
`to Wonnbase.
`
`AUTHOR CONTRIBUTIONS
`LW.H., N2 Bristol read, coverage, variant and gap analyses; G.T.M., CB4858 SNP
`discovery and N2 Bristol error profile analysis; A.R.0., CB4858 SNP discovery and
`validation analysis; D.D., Solara analysis pipeline; G.F., validation assay design and
`analysis, D.B., Solexa base quality value analysis, P.F., preparation of N2 Bristol
`and CB4858 DNA, J.I.G., N2 Bristol read analysis; M.H., Solexa libraries and
`sequencing, W.H., microrepeat analysis, VJ.M., Solera libraries and sequencing,
`R.J.R., N2 Bristol analysis; S.N.S., validation assays; OAS, microrepeat masking of
`C elegans; M.S., Mosaik adaptation; E.F.T., microrepeat finding; T.W., N2 Bristol
`analysis, T.S., C. elegans strain selection; R.I(.W., project origination; ER.M.,
`project coordination and manuscript preparation.
`
`Published onllne at http://mnamre.com/naturemethods/
`Reprints and permissions information is available online at
`http://npg.namre.com/reprintsandpemissions
`
`4.
`
`1. C. elegans Sequencing Consortium. Genome sequence ofthe nematode C. elegans:
`a platform for investigating biology. Scienrr 282, 2012 2018 (1998).
`2. Waterston, R. et aL The genome of the nematode Caenorhabrfitis elegans. Cold
`Spring Harlz Syrnp. Ouant. Biol. 58, 367 376 (1993).
`3. Lander, E.S. et aL Initial sequencing and analysis of the human genome. Nature
`409, 860 921 (2001).
`International Human Genome Sequencing Consortium. Finishing fire euchromatic
`sequence of the human genome. Nature 431, 931 945 (2004).
`5. Harris, T.W. et aL WormBase: a multi species resource for nematode biology and
`genornics. Nucleic Acids Res. 32, D411 D417 (2004).
`6. Stein, L.D. et aL The genome sequence of Caenorhabditis briggsae: a platfonn for
`comparative genomies. PLoS Biol. 1, e45 (2003).
`7. Hodgkin, J. St Doniach, T. Natural variation and copulatory plug formation in
`Caenorhabditis elegans. Genetics 146, 149 164 (1997).
`8. Marth, G.T. et at A general approach to single nucleotide polymorphism
`discovery. Nat. Garret 23, 452 456 (1999).
`9. Bieri, T. et aL WormBase: new content and better access. Nucleic

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket