IPR2013-00390, No. 1006 Exhibit - Exhibit 1006 (P.T.A.B. Jun. 26, 2013)

I
`
`IIIIIIIII
`
`I
`
`
`-mmuun.
`InllllnnIIIII.
`::sss;'.-
`~u::'JI
`
`
`
`i3:':'.I1I.‘.:1|r"1|1-t::::::‘it:II"'1:-«tin»
`
`IIIII|'.I.“nu' rllllIlull~
`
`
`
`

`
`

`
`
`
`
` ||ll|||I||I||||||||||||||||||||||||||
`Wllllmllllll
`
`E
`
`
`
`

`
`l ARTICLES
`
`Genome:
`1oo,2s1,244 up
`
`Microrep eats:
`19.8%
`
`
`
`RepealMasker
`14.5%
`
`Figure 4 I Repetitive content in C. elegans. Venn diagram depicting the
`fraction of bases in the genome covered by microrepeats and by
`RepeatMasker, and die overlapping set.
`
`a polymorphism rate of 1:1,600 (ref. 11). The Solexa analysis
`pipeline produced metrics of our Solexa single end read data
`for CB4858 (Table 1).
`As a precursor to variant discovery in CB4858, we idmtified
`regions of the reference genome with a high potential for ambig
`uous read alignment, based on the Solexa 32 bp read length. First,
`we identified all unique 32 mers in the reference sequence, but as
`our error rate analysis (Fig. 2) indicated a drop offin the error rate
`beyond 2 errors per read, we defined a repetitive 32 mer as one that
`appears in the genome more than once, allowing 0 2 mismatdied
`bases (substitutions,
`insertions or deletions). Mk called these
`‘microrepeats’ to distinguish than from repeats marked by the
`RepeatMasker program”, which masks 14.5% of the bases in the
`genome. The fraction of the genome comprising perfect and near
`perfect rnicrorepeats totaled 19.8%. We illustrate the relationship
`between RepeatMasl<er masked bases and microrepeat bases iden
`tified by our methods as a Venn diagram (Fig. 4). Although there is
`a substantial overlap (11.1 1%) between the regions masked by both
`methods, 8.7% of the genome that we identified as microrepeats
`was not masked by RepeatMasker. Conversely, 3.4% of the genome
`was masked by Repeatlviasker only, indicating that some fraction of
`C. elegans repeat elements can be uniquely sequenced with 32 bp
`reads. Taken together, RepeatMasker repeats and microrepeats
`cover 23.2% the genome.
`
`Ihble 2 I PolyBayes SNP and indel validation data
`
`Mask type applied
`
`Assay
`type
`
`Submitted to
`validation
`
`Assay
`successful
`
`
`
`
`
`
`
`®©2008NaturePublishingGrouphttp:IIwww.nature.comInaturemethods
`
`Once we aligned CB4858 Solexa reads to the conservatively
`masked C. elegans genome, we applied our combined repeat
`masking to filter the alignments, identified high quality sequence
`differences with PolyBayes, and finalized a set of 45,539 SNPs and
`7,353 single base pair indels. This yields a rate of one SNP per
`1,629.81 bp and one indel per 9,894.99 bp. Hence the pair wise
`nucleotide diversity (theta) between the CB4858 and the N2 Bristol
`strains is 6.136 X 10 4, in good agreement with the ~ l:1,500 rate
`posited in a previous description ofCB4858 (ref. 1 1). As 37,856,444
`CB4858 Solexa reads yielded a total number of 45,539 SNPs, the
`‘read per SNP’ yield was 831. All confirmed CB4858 sequence
`variants are available in Wormbase.
`
`We orthologously validated roughly 1,000 candidate SNPs and
`indels by PCR directed capillary sequencing to gauge the perfor
`mance of our Mosaik PolyBayes approach. After sequencing and
`evaluation, we determined a SNP validation rate of 96.3% (438/
`455) and an 89.0% conversion rate (438l492)
`for candidates
`identified by PolyBayes (Table 2). We sequenced 239 of our
`putative single base indels, finding they validated (93.8%) and
`converted (87.7%) at practically the same rates as SNPs. Both
`insertions and deletions predicted in the reference genome
`sequence were represented (insertions: 2,948 or 47.1%, and dele
`tions: 3,316 or 52.9%). Many ofthe indds were variable numbers of
`bases in mono nudeotide repeats, for example, 5 versus 4 adeno
`sines. Although mononucleotide runs are typically very difficult
`areas for indel detection, our high validation rate indicates that
`Solexa reads resolve base numbers in these runs very well. Micro
`repeat masking has a marked impact on accurate SNP discovery by
`eliminating putative SNPs and indels resulting from paralogous
`read mapping (Table 2).
`We estimated false negative rates for PolyBayes by nmning
`PolyPhred‘3“5 (version 5.0) on the validation trace data. This
`algorithm indicated PolyBayes had missed 26 SNPs, for a false
`negative rate of 3.75%.
`To determine the chromosomal distribution of CB4858 poly
`morphisms, we placed CB4858 SNPs and indels along the six
`C. elegans chromosomes, and identified both chromosome wide
`and chromosomal position specific differences (Supplementary
`Data and Supplementary Fig. 3 online). Our data confinned an
`earlier study in C. elegans“ suggesting that nonsynonymous sub
`stitution rates are higher in the first and second codon positions
`than in the third (Supplementary Fig. 4 online). Furthermore, over
`half of CB4858 SNPs positioned in exons putatively introduce an
`amino acid change.
`
`Sequencing
`successful
`557
`518
`458
`
`222
`217
`208
`
`SNP candidate
`confirmed
`482
`475
`438
`
`202
`201
`193
`
`Validation
`rate (‘'In)
`86.5
`91.7
`96.3
`
`91.0
`92.6
`93.8
`
`Conversion
`rate ("/o)
`80.6
`82.0
`89.0
`
`84.5
`86.6
`87.7
`
`Known repeats
`Exact microrepeats
`Near exact microrepeats
`(2 or fewer mismatches)
`Known repeats
`Exact microrepeats
`Near exact microrepeats
`(2 or fewer mismatdres)
`Validation and eonvesion rats fovPouBa1esrselected SNPs and single base indel candidates. Successive application of masking filters. as dacrihed in the text. reduced the number of paralogous
`placements and identified high confidence putatixe variant sites
`
`SNP
`SNP
`SNP
`
`Indel
`Indel
`Indel
`
`598
`579
`492
`
`239
`232
`220
`
`582
`559
`482
`
`228
`223
`213
`
`186 I voE?H6’.24|
`
`2003 I NATURE METHODS
`
`

`
`

`
`Whole-genome sequencing and variant discovery in C.
`elegans
`
`LaDeana W Hillier, Gabor T Marth, Aaron R Quinlan, David Dooling, Ginger Fewell,
`Derek Barnett, Paul Fox, Jarret I Glasscock, Matthew Hickenbotham, Weichun Huang,
`Vincent J Magrini, Ryan J Richt, Sacha N Sander, Donald A Stewart, Michael Stromberg,
`Eric F Tsung, Todd Wylie, Tim Schedl, Richard K Wilson & Elaine R Mardis
`
`
`
`Supplementary figures and text:
`
`
`Supplementary Figure 1 Solexa single end read coverage levels mapped onto the C.
`elegans genome.
`
`Supplementary Figure 2 Average N2 Bristol Solexa single end read coverage at different
`A+T percentages.
`
`Supplementary Figure 3 Chromosomal distribution pattern of CB4858 SNPs and indels.
`
`Supplementary Figure 4 Codon position bias in validated CB4858 SNPs.
`
`Supplementary Data
`
`Supplementary Methods
`
`Supplementary Table 1 Results of hairpin formation potential analysis in regions of the
`C. elegans genome with no exact match coverage by Solexa reads.
`
`
`Page 7 of 20
`
`

`
`
`
`Supplementary Figure 1. Solexa single end read coverage levels mapped onto the
`C. elegans genome. Solexa single end read coverage of exactly matching reads
`mapped to the unique portion of the C. elegans reference genome.
`Average coverage was calculated at 19.2X.
`
`
`
`
`
`Page 8 of 20
`
`

`
`
`
`
`
`Supplementary Figure 2. Average N2 Bristol Solexa single end read coverage at
`different A+T percentages. The curves indicate the average coverage per base in A+T
`percentage bins ranging from 0-99 percent considering either a window size of
`200 bp (red line), indicative of amplicon-associated biases or of 32 bp (blue line),
`indicative of read-associated biases.
`
`
`
`
`
`
`
`
`
`Page 9 of 20
`
`

`
`
`
`
`
`Supplementary Figure 3. Chromosomal distribution pattern of CB4858 SNPs and indels.
`Chromosomal SNP (red) and indel (blue) distribution across the C. elegans
`genome, based on our analysis of CB4858 sequence alignment to the N2 Bristol
`reference genome sequence and variant detection by PolyBayes.
`
`
`
`
`
`Page 10 of 20
`
`

`
`
`
`
`
`Supplementary Figure 4. Codon position bias in validated CB4858 SNPs.
`N2 Bristol-CB4858 codon position distribution of non-synonymous (black)
`and synonymous (white) SNPs, identified by analysis of CB3848 Solexa
`single end reads in comparison to the N2 Bristol reference sequence.
`
`
`
`
`
`Page 11 of 20
`
`

`
`Supplementary Data
`
`Resequencing of a C. elegans N2 Bristol strain isolate
`
`Identifying unmapped read placements
`Of our phrap-assembled contigs, 12,635 aligned with only a single mismatch to
`the C. elegans reference. An additional 3,088 contigs align with two mismatches, while
`2,295 match over fewer than 50% of their bases. We further identified 124 contigs
`spanning one or more bases determined previously to be not covered (78 were gaps of 3
`or fewer bases), suggesting insertion/deletion differences between the sequenced isolate
`and the reference. Thus, some unplaced reads simply are due to sequence differences
`between two isolates, not to missing sequences in the reference.
`
`Evaluating Solexa single end read coverage
`We calculated the ratio of observed to expected coverage by dividing the number
`of alignments that included each base by the number of times that base is involved in a
`32mer in the C. elegans genome and found the average coverage ratio was 0.316, with
`the peak of the distribution around 0.25. This result is as expected, given that the peak of
`the coverage curve occurs between 16-17X (Fig. 1), and the representation value for a
`unique-in-genome base is 64, equaling a coverage ratio of 16/64=0.25. Paired end data
`provided a read-based average of expected to observed coverage of 0.014, with a
`corresponding physical coverage based average of 0.052.
`To define a threshold for a significantly “over-represented” region (suggesting
`missing sequence from the reference or a possible coverage bias), we examined the
`Solexa read coverage in unique regions of the C. elegans genome (i.e. where the base had
`a representation value of 64 and was not in an annotated repeat in the C. elegans wb170
`version at Wormbase). At the extremes, we found 1.32M base positions (1.7%) with
`>40X coverage in unique 32mers. Two areas of expected over-coverage are the rDNA
`regions on chromosomes I and V where uncertainty about the exact copy number of these
`large tandem repeats led to the placement of sole representative members of each in the
`reference sequence. Specifically, on chromosome I, the 18S, 28S and 5.8S are transcribed
`separately by RNA polymerase I in the ~55 copies of the 7.2kb rDNA repeat 1,9. On
`chromosome V, the 5S gene along with the SL1 spliced leader gene lies in a 1kb tandem
`repeat with ~110 copies 9,10. Examining unique 32mers within the rDNA segments, we
`ascertained that the unique chromosome I rDNA 32mers exceed a coverage ratio of 100
`in Solexa reads; the only unique regions in the genome with this high a coverage level.
`
`Analysis of regions not covered by Solexa reads
`We further investigated the characteristics of regions not covered by Solexa reads,
`pursuing three hypotheses; 1) representational bias due to the potential for hairpin
`formation, 2) decreased coverage as a function of A+T content at the 32mer level and at
`the 200bp amplicon level (the fragment size utilized for Solexa libraries) and 3)
`correlation to sequence features such as repeats, exons, introns, etc. One possible
`explanation for uncovered regions, the formation of a hairpin that could cause deletion of
`the intervening sequence prior to Solexa library construction, was tested by examining
`the 31bp flanking each uncovered region for 5bp or more of sequence identity on
`
`Page 12 of 20
`
`

`
`complementary strands. A control random set of flanking 31bp sequences was created
`for random pseudo-gaps across the genome (randomizing the chromosome and the start
`site and with the “gap” or “uncovered” size mimicking those of the true uncovered
`regions). We then required any sequences having 5 or more base pairs of identity to also
`have those identical sequences align immediately at the boundaries of the uncovered
`region. This analysis provided evidence of putative hairpins (Table 1) for the real and
`control gaps, and indicated that hairpin formation might account for only a handful of the
`uncovered genomic regions.
`Our investigation of A+T content at the read- and amplicon-length level provided
`a compelling explanation for uncovered sequences. For zero base pair gaps, the A+T
`content of all 32mers centered on the base revealing the zero size gap was 76% and of all
`200bp centered on that base was 73%. For gaps in coverage larger than zero bases, the
`AT content was 82%. When we stratified these regions by size, we found that regions
`between 25-100 bp had the highest A+T content; remarkably 85%. Examining regions
`covered by 2 or fewer reads and indicated an 84% A+T content.
`To evaluate whether A+T biasing was occurring during cluster amplification or
`sequencing, we examined A+T content in 32 bp “read” windows compared to 200bp
`“amplicon” windows, then evaluated coverage of Solexa reads as a function of A+T
`content, examining the distribution for all bases and for all non-repetitive bases with a
`representation value = 64 (thus unique in genome). We considered these distributions for
`only 32mer exact match reads, and for the combined 32mer exact match and quality-
`trimmed exact match with 2 or fewer mismatches. For all permutations attempted we
`found the same trend; as A+T content reaches either extreme (very low or very high), the
`Solexa read coverage decreases. This is illustrated (Fig. 2), where a higher percentage of
`bases have zero coverage for a 200bp amplicon window, compared to a 32bp read
`window (51.6% vs. 9.7% for non-repetitive bases) suggesting that A+T content biasing
`may be related to the amplicon rather than the sequence read.
`Finally, we correlated the locations of 13,035 regions without an exact match of
`>1 bp with other sequence features by number of events. Here, 33.4% of the uncovered
`base events occurred within repeats, 51.1% within introns, 2.3% within exons and 2.1%
`within “tagged” regions (annotated in the reference sequence as being a problematic
`region during conventional sequence finishing). Therefore, the relatively lower A+T
`content of C. elegans exons means that we can adequately characterize variant bases in
`exons, but this may not be the case for an organism with higher exonic A+T content.
`
`Paired end coverage analysis
`Although our paired end read coverage was significantly lower than from single
`end reads, it represented portions of the genome not represented in the other set,
`suggesting it may not have the same biases. Of the ~100kb of bases not covered by exact
`matching single end reads, 41% were covered by a read in the paired end set. Further,
`80% of the not covered bases fell between paired ends (thus represented by physical but
`not by sequence coverage). The A+T content of the 49.5 Mb not covered by reads in the
`paired set, however, was 63%, similar to the genome average as a whole. Additional
`paired end coverage would be required for a definitive bias evaluation.
`
`Genome-wide polymorphism discovery in C. elegans strain CB4858
`
`Page 13 of 20
`
`

`
`
`Mosaik alignment of CB4858 reads
`Of the total 37,856,444 CB4858 Solexa reads, the Mosaik aligner/assembler was
`able to align 24,518,452 (64.7%) to the conservatively masked C. elegans reference
`genome (ws170), allowing no more than two sequence differences (mismatches plus
`gaps), to avoid misalignments due to paralogy.
`
`Chromosomal locations of CB4858 polymorphisms
`In 2000, Koch1 illustrated in C. elegans that the non-synonymous substitution rate
`was much higher in the first and second codon positions than in the third. Our study
`confirms these earlier results (Fig. 3) and provides a detailed, genome-wide estimate of
`coding polymorphisms in the CB4858 strain. In total, we found 6,255 SNPs positioned
`within an exon, of which 3,275 putatively introduce an amino acid change. Through our
`experimental validation, 100 of 119 (84%) non-synonymous mutations were confirmed,
`indicating that our methods provide an important first step in describing the complete
`mutational profile in a strain-to-reference paradigm. Furthermore, we evaluated SNP
`positioning on a chromosome-by-chromosome basis (Fig. 4), finding the resulting
`polymorphism density to be much higher on Chromosomes II, III and X. The densities
`on Chromosomes IV and V suggest a very low variation rate for much of the
`chromosome, yet a comparatively high mutation rate on the right half of each
`chromosome.
`
`Supplementary Methods
`
`Preparation of Solexa fragment libraries
`Genomic DNA (5µg) was nebulized for 2 minutes at 45 psi of compressed air, to
`obtain an average fragment size of 500bp, then further purified and concentrated with
`Qiaquick PCR purification spin columns (Qiagen Inc., Valencia CA). Treatment to
`remove 3’ overhangs and fill in 5’ overhangs resulted in blunt ended genomic fragments.
`An A residue was added by terminal transferase to the 3’ end and the resulting fragments
`were ligated with Solexa adapters. Adapter-modified DNA fragments were enriched by
`an 18 cycle PCR using 50 ng of the ligation reaction and Solexa universal adapter
`primers. The resulting PCR products were separated by agarose gel electrophoresis and
`the band between 150-200 bp was excised from the gel. The DNA fragments were
`extracted from the agarose slice using a Qiaquick Gel Extraction Kit (Qiagen Inc.), and
`further purified by drop dialysis using a 0.025mm/25mm filter (Millipore Inc., Billerica
`MA) and tissue culture-grade water (Sigma Chemical, St. Louis MO). The DNA
`fragment library was quantitated, then diluted to a 10 nM working stock for cluster
`generation.
`
`Solexa cluster generation
`Adapter-ligated fragments (2 nM) were denatured in 0.1N NaOH for 5 minutes,
`then were further diluted to a final 9 pM concentration in 1 ml of pre-chilled
`hybridization buffer, and introduced onto the Solexa flow cell using the Cluster Station,
`an automated device supplied by Solexa. On this apparatus, the oligo-derivatized flow
`cell surfaces hybridize to library fragments by adapter-to-oligo pairing. “Clusters”,
`
`Page 14 of 20
`
`

`
`representing discrete populations of unique single-stranded library fragments amplified in
`situ, are generated by isothermal amplification using a proprietary process. In practice,
`each cluster produces a single Solexa read. Our experiments aimed for an average cluster
`density of 30,000 (+/- 5,000) clusters per flow cell lane.
`
`Preparing clusters for sequencing
`Following isothermal amplification, clusters were made single-stranded by 0.1N
`NaOH denaturation, metered across the flow cell by the Solexa Cluster Station. A
`sequencing primer complementary to one Solexa adapter was added to prime the single-
`strands of each cluster. Once hybridized and with excess primer removed by a wash, the
`flow cell was ready for sequencing.
`
`Solexa single end read sequencing process
`The Solexa Genome Analyzer was programmed to provide up to 32 sequential
`flows of fluorescently labeled, 3’-OH blocked nucleotides and polymerase to the surface
`of the flow cell, thus producing a fixed 32bp read length. After each base incorporation
`step, the flow cell surface is washed to remove reactants and then imaged by microscope
`objective. Our experiments collected 200 tiled images (“tiles”) per flow cell lane, each
`containing on average 30,000 clusters. Solexa single end reads were generated for N2
`Bristol (85,498,849 total reads) and CB4858 strains (37,856,444 total reads) using a 30
`base read length for the titration run (used to determine the correct input library DNA
`amount), and a 32 base read length for standard runs.
`
`Solexa paired end N2 Bristol sequencing process
`Paired end library construction involved the DNA fragmentation steps outlined
`above, but ligated the end-repaired fragments with Solexa paired end adapters to
`introduce two unique sequence priming sites at each fragment end. The adapter-ligated
`fragments were separated by agarose gel electrophoresis and a scalpel was used to cut a
`thin horizontal slit in the lane at 200bp. The DNA fragments from this slit were rinsed
`from the scalpel blade using 30 µl of EB (Qiagen Inc., Valencia CA), then were enriched
`by an 18 cycle PCR (conditions described above). The resulting PCR products were
`purified and concentrated with a Qiaquick PCR purification spin column (Qiagen Inc.).
`The paired end library was quantitated and diluted to a 10 nM working stock. Cluster
`generation was accomplished on a modified flow cell that contained three primer-
`complementary sequences. These sequences promote the formation of heterogeneous
`clusters in which two populations exist; one that can be subsequently released for
`sequencing primer annealing using the standard chemistry, and a second population that
`remains intact during release and sequencing of the first population, then is subsequently
`released for sequencing primer hybridization and extension by a separate chemical step
`that is applied whilst the flow cell remains in place on the Solexa Genome Analyzer. A
`total of 3,671,972 paired end reads were produced, with a first end read length of 35 bp
`and a second end read length of 25 bp.
`
`Solexa data processing pipeline
`The raw images from the Solexa Genome Analyzer are synchronously mirrored as
`the run proceeds, from the instrument computer to an NAS (Network Attached Storage)
`
`Page 15 of 20
`
`

`
`filesystem. Once the run completes, image analysis, base calling and read quality filtering
`are performed by the Solexa run analysis software. The read quality filtering acts to first
`remove poor quality reads using Solexa-defined criteria (“% passed bases”), then uses the
`program Gerald to align and compute the percentage of bases aligning to the reference C.
`elegans genome (ws170 version). The alignment of the resulting short reads to the
`reference genome is performed by the Solexa pipeline. Although we did not utilize the
`Eland alignments in the detailed data analyses described here, Eland metrics did provide
`an initial evaluation of Solexa data quality and helped us to estimate accumulating
`coverage.
`
`Solexa N2 Bristol reads-to-reference alignment approach
`We initiated read alignment to the reference by first identifying all exactly
`matching reads. Here, we generated a hash table of all 30mers and 32mers (to match the
`Solexa read lengths on titration and full runs) in the reference C. elegans genome
`sequence, and the positions for each mer within the C. elegans genome were cataloged.
`The UNIX sort function was then used in a hash-based approach, to sort the Solexa reads
`with all reference-derived 32mers and 30mers, determining that 75.55% of reads had an
`exact match. Reads were placed in all regions where they had an exact match. Of the
`reads without an exact match, we removed all reads with a best match (by BLAT) to
`Escherichia coli (C. elegans feed on E. coli). Further, we removed all reads (1.05%)
`matching a portion of the Solexa primer sequence (which has no match in either C.
`elegans or E. coli).
`The remaining reads were quality trimmed, meaning we retained reads with at
`least 20 consecutive base pairs having a quality score of 25 or higher (36.3% retained).
`These reads were again sorted by exact match hash-based methods to an index of all
`20mer, 21mers, etc. through 32mers for the C. elegans and E. coli reference sequences.
`The remaining quality-trimmed reads without an exact match to C. elegans or to E. coli
`were then considered as potentially falling into one of the following categories: 1)
`contamination, 2) containing one or more base calling errors, 3) indicative of potential
`errors in the reference genome or of a difference between the reference and Solexa-
`sequenced strain, 4) sequences altogether missing from the current reference genome.
`We used BLAT-based14 alignments with the parameters (tileSize=6 –stepsize=1
`repMatch=1000000 –minIdentity=70 minScore=20) to identify non-exact matching
`reads. These were identified as Solexa reads placed uniquely by gapped alignment, that
`indicated an insertion in the Solexa-sequenced N2 strain relative to the reference (defined
`by an exact Solexa read alignment on either side of a one base (or larger) gap), or
`uniquely-placed Solexa reads with a deletion in the Solexa-sequenced strain (defined by
`exact alignment that excluded one or more unmatched bases in the reference sequence).
`We also found evidence of indels, where the Solexa read and the reference matched
`exactly except for one or more bases that were not matched between the two in the
`unaligned region.
`
`Assembly and analysis of unplaced reads
`To investigate the remaining 1,843,353 unmapped reads, we attempted assembly
`using phrap (P. Green, unpublished) with stringent parameters (-minmatch 20 –minscore
`20 –forcelevel 0 –repeat_stringency 1 –new_ace_penalty -4). Only 331,568 reads
`
`Page 16 of 20
`
`

`
`assembled into 34,382 contigs with an average length of 35 bp. These contigs were
`compared to the reference genome using BLAT. The remaining reads likely differ
`significantly or are missing completely from the C. elegans or E. coli reference sequences
`(E. coli strain OP50 is fed to C. elegans), or are contamination or sequencing artifacts.
`
`Solexa N2 Bristol single end read coverage
`To identify regions with above- or below-average coverage by Solexa N2 Bristol
`reads, and to evaluate the potential for biases in coverage, we first created a map
`enumerating Solexa reads in which each basepair in the C. elegans genome participates.
`If a base is in a unique region of the genome, the “representation value” for that base is
`64 since it will only be located in the 32mer including that base, as well as on the
`opposite strand. Of the 100,281,244 bases in the C. elegans reference, 87,558,992 bases
`participate in a single 32mer (i.e. representation value of 64). To obtain an estimate of
`unique genome coverage, we then examined the number of times each 32mer occurred
`within the reference genome.
`
`Evaluating sequences not covered in the reference
`Genomic regions that remained not covered by Solexa reads were identified as all
`bases not participating in an alignment with a Solexa read after mapping all exactly
`matching reads and quality-trimmed reads, and by further mapping quality-trimmed reads
`allowing up to 2 mismatches or unaligned bases. In addition to obvious large gaps in
`coverage, our gap analysis also revealed so-called “zero base pair gaps”. We defined a
`zero base pair gap as a site where two adjacent reference genome 32mers are covered by
`reads that align exactly to each 32mer but across which no exactly matching spanning
`read exists. All of the uncovered regions we computed were evaluated for their A+T
`content, potential for hairpin formation, presence of transposon sequences, and other
`sequence features.
`
`Paired end alignment
`
`Of the 3,671,972 paired end reads produced, 2,158,677 (58.8%) forward reads
`and 2,663,404 (72.5%) reverse reads had an exact match. The difference in matching
`percentage likely was due the lower base calling accuracy near the end of the longer
`forward read. We identified 1,581,763 N2 Bristol quality filtered read pairs that could be
`uniquely placed in the genome and had an average insert size of 218 bp (s.d. = 38 bp).
`These were evaluated further for their potential to identify putative areas of structural
`variation, as described in the Results section of the manuscript.
`
`Mosaik alignment of CB4858 Solexa reads
`Our approach used two fundamental methods: (1) hash-based, and (2) sequence
`alignment-based. Our hash-based method enumerated every 32mer in the C. elegans
`reference genome sequence, and recorded its map location. If the same 32mer occurred in
`multiple locations (either strand), it was marked as a microrepeat. Near-perfect
`microrepeats were identified by asking whether each specific 32mer, or any of the other
`32mers obtained by introducing “mutations” up to a pre-specified number of mismatches,
`appeared at any other genomic location. The sequence alignment-based method used the
`BLAT algorithm with the following parameters: -stepSize=8 -tileSize=16 -minMatch=1 -
`
`Page 17 of 20
`
`

`
`minScore=28 -oneOff=1 to enumerate every 32mer in the reference genome, and then to
`search the rest of the genome for a perfect or near-perfect match, up to a pre-specified
`number of mismatches. Custom scripts identified hits that had up to 2 mismatches (any
`combination of substitutions, insertions, or deletions), and combined the results of our
`hash-based and sequence alignment-based methods to produce a microrepeat-masked
`reference genome.
`Mosaik consists of two parts: the aligner and the assembler. The aligner employs
`a two-stage process: an initial short hash-based search to explore possible map locations
`for each read and to register locations with multiple initial hash matches, followed by a
`full Smith-Waterman-Gotoh pair-wise local alignment between the read and each
`possible map location. The best-scoring map location is identified, and the alignment is
`reported if 1) the alignment score is above a pre-specified threshold, 2) the alignment
`extends over a pre-specified fraction of the reads, and 3) the total number of mismatches
`within the alignment does not exceed a pre-specified maximum. In this analysis, we
`allowed a maximum of two mismatches (substitutions and indels), a decision motivated
`by the fact that over 79% of the Solexa reads had either one or zero error. Allowing a
`single polymorphic difference in such reads would bring the maximum number of
`mismatches to two. The assembler algorithm first registers every gap introduced by any
`of the aligned reads on the reference sequence, then introduces alignment gaps into all
`aligned reads to preserve the positions of all pair-wise aligned bases in every pair-wise
`read alignment.
`
`Validation of putative sequence differences
`A subset of candidate insertion, deletion, indel and polymorphic sites from our
`analyses were submitted for orthologous validation using PCR-based sequencing and
`variant analysis. Because we had a very limited amount of the original N2 Bristol
`genomic DNA isolate used for the reference genome sequencing, we had a similarly
`limited ability to validate the sequence variants we identified between Solexa N2 Bristol
`reads and the reference genome. Hence, the preponderance of our validations centered
`on the putative variants identified for the CB4858-to-N2 Bristol comparison.
`For each type of CB4858 variant, we attempted validation for 50% of the
`candidates identified in coding and 50% in non-coding sequences. We designed primers
`with 300bp of sequence both to the left and to the right of the target site, using Primer3
`(http://primer3.sourceforge.net/). A 5’ universal M13 forward or reverse sequence was
`added to each primer pair to allow processing of the resulting PCR products in our high-
`throughput sequencing pipeline. PCR was performed in a 10µL reaction containing 5ng
`of genomic DNA (the original DNA stock utilized in Solexa sequencing, or the original
`N2 Bristol reference strain DNA, according to the assay type), 1.2nmol of each
`universally tailed amplification primer, and 5.0µL of Amplitaq Gold 2X mix (Applied
`Biosystems P/N 4327059, Foster City,CA), with a final 8% glycerol concentration).
`Reactions were cycled at 96 ºC for 5 minutes, followed by 40 cycles of 94 ºC for 30
`seconds, 60 ºC for 45 seconds, and 72 ºC for 45 seconds, and a final extension at 72 ºC
`for 10 minutes. Following amplification, PCR products were treated with 3.7U of
`Exonuclease I (USB P/N 70073X, Cleveland,OH) and 0.18U of Shrimp Alkaline
`Phosphatase (USB P/N 70092X), and incubated at 37 ºC for 30 minutes. The reaction
`was stopped by incubation at 80 ºC for 15 minutes. PCR products were sequenced with
`
`Page 18 of 20
`
`

`
`BigDye Terminator Sequencing Kit (Applied Biosystems P/N 4336943, Foster City CA)
`using either universal forward or reverse sequencing primers, and analyzed on ABI
`3730XL DNA sequencers.
`
`Visual inspection of validation data
`We aligned validation reads to the reference sequence using PolyPhred2,
`determining by manual inspection whether a Solexa variant was confirmed in the 3730
`trace. Both a validation rate (defined as the number of confirmed variants divided by the
`number of successful sequencing reactions) and a conversion rate (defined as the number
`of confirmed variants divided by the number of variants submitted for validation) were
`calculated based on our visual confirmation results.
`
`Evaluating variants for exonic disruption
`We investigated SNP candidates in CB4858 that introduced amino acid changes
`relative to N2 Bristol. Conceivably, such

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases