throbber
Supplementary Information for Initial Sequencing and Analysis of the
`Human Genome.
`International Human Genome Sequencing Consortium.
`
`Methods and additional notes
`
`Section: Generating the draft genome sequence (p. 864)
` Subsection: Clone selection (p. 865)
`
`Page 866 col. 2, para.3 “Fingerprint data were reviewed ….bias against rearranged
`clones).
`
`
`
`
`Seed clones were picked from the growing contigs as follows: We began by
`identifying fingerprint clone contigs that had been localized to targeted locations and
`that did not contain any clones that had previously been selected for sequencing.
`Contigs were localized using mapping data from a variety of sources that could be
`attached to the fingerprinted clones, including STS/hybridization data from
`McPherson and colleagues86, FISH data from several sources (C. McPherson et al.,
`ref. 103), STS/PCR mapping data from several sources92,95,103, electronic PCR data
`(http://www.ncbi.nlm.nih.gov/STS/) matching the BAC end sequences with mapped STSs
`and others. Beginning with the largest available clone in a valid contig (clones >250
`kb were excluded to avoid artifacts), the FPC program451 evaluated the fingerprints
`of all of the clones in the contig to determine largest clone for which all (but 2) of the
`individual bands in the restriction fragment pattern were common to or shared with
`(confirmed; having a band of equivalent size ±3%) with bands in the patterns of
`flanking clones (again, ignoring >250 kb flanking clones >250 kb). (Since the
`restriction enzyme used to produce the clone inserts is different than the enzyme
`used to produce the fingerprints, two bands may arise from the insert-vector junction,
`which are not found in the genome or in flanking clones.) Selected clones were then
`checked for excessive overlap with previously selected or sequenced clones and
`with each other. The allowable overlap at this stage was varied to suit the demands
`of the project.
`
`Clones (walking clones) extending from seed or other selected clones were selected
`as follows: In the early phases of the effort, clones were not necessarily correctly
`ordered within a fingerprint clone contig and indeed not all of the available clones
`had necessarily been incorporated into the contig. Starting with a previously
`selected (seed) clone, the FPC program compared the restriction fragment pattern of
`that clone with the patterns of all of the clones in the fingerprint database that
`overlapped with the seed clone. It then iteratively analyzed the clones identified in
`the first round of analysis to identify the additional clones that overlapped with those.
`In this way, a set of overlapping clones was identified and the clones in the set were
`ordered based on their overlap statistics. After ordering, all of the valid clones were
`identified (valid clones were defined as those with all but three of their bands
`confirmed by clones within 4 clones on either side). Any clone that also had outside
`evidence of overlap, e.g. through BAC end sequence matches or shared
`SEQUENOM EXHIBIT 1105
`Sequenom v. Stanford
`SEQUENOM EXHIBIT 1105
`IPR2013-00390
`
`

`

`STS/hybridization data was selected for further evaluation. In cases with more than
`one clone with such outside evidence, the clone with the lowest overlap statistic (i.e.,
`the one that was least redundant) was selected (in the case of ties, the largest clone
`was favored). Where there was no outside evidence, a clone was picked based on
`evaluation of the overlaps. The candidate clone was the first one that was found to
`have the minimal overlap with the seed clone (initially <20% overlap, rising to 30% in
`later phases of the mapping effort; the percentage overlap was estimated by dividing
`the sum of the sizes of the common bands by the size of the smaller of the two
`clones). To be picked, the clone also had to be bridged to the seed clone by a third,
`intermediate clone that confidently (<1e-4) overlapped both the seed clone and the
`candidate clone. The candidate clone was then further evaluated for fingerprint
`overlap with previously selected or sequenced clones.
`
`Once clones were ordered within fingerprint clone contigs, a similar algorithm that
`exploited the known clone order was used to pick the walking clones. This algorithm
`was also adapted to pick a spanning/walking clone for complex contigs with 2 or
`more clones in the sequencing pipeline, using the fingerprint map as a guide.
`
`
`
`
` Subsection: Sequencing (p. 867)
`
`Page 868, left-hand column, line 20: “By examining … 500 bp.”
`
`
`The sizes of the gaps between adjacent initial sequence contigs in draft clones were
`measured using alignments of the initial sequence contigs from individual draft
`clones to contigs of size (cid:149) 40 kb from overlapping clones, usually finished clones.
`10,999 gaps were examined. 1,726 gaps larger than 6,000 bp were discarded as
`probable artefacts due to misassemblies or incorrect alignments. The mean size of
`the gaps between the initial sequence contigs in draft clones was 554 bases. When
`the cutoff for discarding gaps was lowered to 3000 bp or raised to 12,000 bp, the
`mean gap size decreased to about 400 bp (estimated from 9,801 gaps) and
`increased to about 800 bp (estimated from 11,972 gaps) accordingly, indicating that
`there is still considerable uncertainty in the mean value. The 554 bp estimate for the
`mean gap size was used, along with the number of initial sequence contigs (Table 7)
`and the total number of bases in the initial sequence contigs (data not shown) to
`estimate the percentage of the draft clones that were covered by the initial sequence
`contigs. It was thus determined that, on average, about 96% of the draft clones was
`covered; assuming a mean gap size between 400 and 800 bp, the range in coverage
`is about 94-97%.
`
`This comment also pertains to page 874, left-hand column, line 57: “Assuming that the
`sequence gaps … gaps within the draft sequenced clones”
`
` Subsection: Assembly of the draft genome (p. 868)
`
`Page 868, right-hand column, l. 47, "To eliminate such problems, sequenced clones were
`associated with the fingerprint clone contigs in the physical map…"
`
`

`

`An FPC match statistic better than 1e-7 for the sequenced clone against the fpc
`fingerprint database was considered significant, based on empirical evidence. This
`match level was the weakest value used for placement when there was other
`confirmatory evidence to support the placement. In the absence of additional
`supportive data, a match score of better than 1e-9 was required for placement. In
`general, only the best match was used. Other confirmatory evidence included BAC
`end matches; the BAC end sequences were obtained from NCBI (dbGSS;
`http://www.ncbi.nlm.nih.gov/dbGSS/index.html). Only BAC end sequences with 15 or fewer
`matches to the genomic sequence were used to eliminate repetitive sequences.
`Additional information used to place clones included BAC paired-end sequence
`matches, shared STS matches, and "believed" sequence overlap relationships
`determined by investigators at the NCBI and at UC-Santa Cruz. In instances in which
`the data led to conflicting placements, the data were weighted based on estimates of
`reliability. In some cases, if there was conflicting placement data or only weak data
`for placement and, according to GigAssembler, the sequenced clone failed to
`overlap any clones in the assembly at their original placement positions, a placement
`was attempted at secondary sites suggested by the placement data.
`
`
`Page 869, left-hand column, line 48 “Of these 942 contigs with sequenced clones… “
`
`
`In general, merges between fingerprint clone contigs were based primarily on
`evaluation of the fingerprint data. Information about the STS map location of the
`fingerprint contigs was used to prevent spurious merges, to break spurious contigs
`and to suggest possible merges that had not been previously recognized. In
`addition, 62 contigs were merged on the basis of sequence overlap information,
`supported by STS map positions.
`
`
`
`
`
` Subsection: Quality assessment (p. 871)
` Sub-subsection: Alignment of the fingerprint clone contigs (p. 873)
`
`Page 873, right-hand column, line 28: “The positions of most of the STSs… about 1.7%
`differed from one or more of them."
`
`
`We localized the STS markers from seven different physical maps (the Genethon101
`and Marshfield (http://research.marshfieldclinic.org/genetics/ ) genetic maps, the
`GeneMap99100, the G3 and Stanford TNG radiation hybrid maps (http://www-
`shgc.stanford.edu/Mapping/Marker/STSindex.html), and the Whitehead YAC and radiation
`hybrid map29) on the draft genome sequence using e-PCR, allowing one mismatch
`per primer and the default distance constraints between primers (50 bp deviation
`from expected size of product). Only those markers that were uniquely placed on the
`draft sequence were considered. There were 62,239 such markers. Of these, 1,095,
`or 1.7%, were mapped by ePCR to a chromosome of the draft sequence that was
`different from the chromosome indicated by the information from a genetic or
`radiation hybrid map.
`
`

`

` Subsection: representation of random raw sequences (p. 874)
`
`Page 875, left-hand column, line 9: “We compared the raw sequences … using the BLAST
`computer program.”
`
`
`We processed whole genome shotgun reads from four independently constructed
`libraries as follows. All reads with fewer than 300 bases of PHRED quality 20 or
`greater were removed. The remaining reads were then trimmed for vector and for
`quality, looking at the 5’ end for the first window with at least 15 continuous non-
`vector bases of >PHRED20 and at the 3’ end, starting from the left cutoff, for 12
`contiguous non-vector bases with <PHRED20 scores. Only trimmed reads that had
`>95% of their trimmed bases with PHRED>20 and a length of >250 bases were kept.
`The reads after trimming were composed of 40% GC base pairs. Reads were
`masked for repeats using the RepeatMasker program (A.F.A. Smit & P. Green,
`http://repeatmasker.genome.washington.edu/cgi-bin/RM2_req.pl) and for low entropy data using the
`nseg option of BLAST (W. Gish, unpublished; http://blast.wustl.edu )Reads were
`retained and used only if there were at least 100 consecutive bases of PHRED
`quality 20 or greater and 100 consecutive unmasked bases.
`
`
`
`
`
`
`
`Page 875, left-hand column, line 30: “We found that 88% of the bases of these cDNAs
`could be aligned ...”
`
`
`Based on a test data set of random reads from finished projects, the following
`BLAST parameters were found to match 100% of the reads without false matches: -
`filter seg S=170 S2=150 W=13 gapW=4 gapS2=150 M=5 N=-11 Q=11 R=11. The
`set of masked trimmed reads was compared to the 7 October 7 2000 freeze of the
`HTGS data set, to all of Genbank and to the TSC SNP database using BLASTN
`2.0MP (W. Gish, unpublished; http://blast.wustl.edu). The highest scoring match was
`aligned against the read using CROSSMATCH, demanding alignment of the full
`trimmed read at (cid:149)97% identity for genomic sequence and with appropriate
`topological constraints for the SNP reads. Typically 1-2% of the matches were
`eliminated by this step.
`
`We aligned the RefSeq cDNA sequences to the draft genome using the psLayout
`program104 and gathered statistics on the percentage of cDNA bases that aligned at
`various percent identity thresholds.
`
`The distal 200 bases of each cDNA were not included in the computation of the
`percentage of aligning bases because alignments in these regions are less reliable.
`If any cDNA aligned in more than one way, each cDNA base involved in any
`alignment was counted only once. At a threshold of 98% identity for the alignments,
`we found that 87.9% of the cDNA bases aligned somewhere in the draft genome.
`When the threshold was increased to 99% identity, the percentage of aligning bases
`fell to 85.83%, and when the threshold was decreased to 97% identity, it rose to
`88.5%. Further decreases in the threshold all the way down to 90% identity only
`
`

`

`increased the percentage of aligning bases one more percentage point, so the value
`of approximately 88% aligning bases, achieved by requiring 98% identity, represents
`a knee in the curve.
`
`
`
`Section: Broad genomic landscape (p. 875)
`
`page 876, right-hand column, line 9: “In addition, the human cytogenetic map ...”
`
`
`The locations of the cytogenetically mapped clones on the draft genome sequence
`can be viewed at http://genome.ucsc.edu/goldenPath/mapPlots . Further information about the
`individual clones can be obtained at http://www.ncbi.nlm.nih.gov/genome/cyto/ and
`http://www.ncbi.nlm.nih.gov/genome/guide. Here, as well as on the browser at
`http://genome.ucsc.edu and http://www.ensembl.org/ , they can be viewed in the context of other
`genome annotation.
`
`
` Subsection: Long-range variation in GC content (p. 876)
`
`Page 877, left-hand column, line 30 “About three-quarters of the genome-wide variance…
`consistent with a homogeneous distribution”
`
`
`All 3,312 windows of length 300 kb that had at least eight gap-free 20 kb
`subwindows and did not contain more than 50% simple repeats were extracted from
`the draft genome sequence. The average sample variance of the GC content of the
`subwindows of a window was 7.3%. The sample variance of all subwindows
`genome-wide (N = 36,562) was 27.4%. Hence, the variance of GC content within
`the 20 kb subwindows of a 300 kb window accounts for approximately one quarter of
`the overall variance of the GC content among all 20 kb subwindows in this sample.
`The average sample standard deviation of the GC content of the subwindows of a
`window was 2.4%.
`
`
`Page 877, left-hand column, line 34: “In fact, the hypothesis … draft genome sequence.”
`
`
`For each of the 3,312 windows of length 300 kb, we tested the hypothesis that its 20
`kb subwindows were sampled from a homogeneous GC distribution. The distribution
`was defined to have mean m equal to the GC-content in the combined subwindows
`of the 300 kb window, and the bases were taken as independent. Under this
`distribution, the GC-content of a 20 kb subwindow would have mean m and variance
`s2 = m(100-m)/20000. For m = 41%, the typical value, this gives s2 = 0.121%, which
`is about 0.017 times the average sample variance of 7.3%. For each window, the
`variance s2 and the sample variance (cid:454)2 were determined, along with the value c2 =
`(n-1) (cid:454)2/s2, where n is the number of subwindows of the window. Under the
`hypothesis of homogeneity, the statistic c2 should have an approximately chi-square
`distribution with n-1 degrees of freedom. However, for every one of the 3,312
`windows, c2 > 31.5, which rejects the hypothesis of homogeneity with p-value >>
`0.995.
`
`

`

`
`Another way to test the hypothesis of homogeneity is to look in each 300 kb window
`for one 20 kb subwindow whose GC content differs significantly from the mean m for
`that window. In these tests, all 300 kb windows with less than 50% simple repeats
`and less than 25% gaps were tested (N = 10,596). Under the assumptions above, if
`X is the GC content of a subwindow, then D = (X-m)/sqrt[m(100-m)/20000] should
`have an approximately normal distribution. However, in all but four windows there is
`a subwindow with |D| > 3.0, i.e the GC content of the subwindow is more than 3.0
`standard deviations from the mean of the window. The p-value for such a deviation is
`0.0026. Considering that there are 15 possible subwindows, this gives an overall p-
`value of 0.039, i.e. the hypothesis of homogeneity is rejected with a p-value greater
`than 0.96.
`
`The above analysis was repeated using 5 kb subwindows of 300 kb windows, and
`the hypothesis of homogeneity was rejected for all windows with p-value greater than
`0.96, and with greater confidence for those windows tested with the chi-square test.
`Similar results were also obtained for 5 kb subwindows of 100 kb windows: all but
`thirteen windows were rejected with p-value greater than approximately 0.95, and all
`but three were rejected from those examined with the chi-square test. Since any
`region of 200 kb must contain one of the regions of 100 kb we tested for
`homogeneity, this indicates that there are few if any regions of 200 kb in the genome
`with homogeneous GC content.
`
`
`Page 877, right-hand column, line 25: “Estimated band locations …”
`
`
`Bands were assigned by a dynamic programming algorithm that attempted to
`maximize the number of cytogenetically mapped clones that lie within the range of
`possible sub-bands predicted from FISH, with special emphasis on high-resolution
`FISH-mapped clones provided by investigators at the National Cancer Institute103.
`The band positions were optimized subject to the constraint that the bands must
`appear in the known order along the draft genome sequence. Slight penalties for
`band size deviation from the standard fractional sizes were also imposed, so that in
`the absence of any FISH-mapped clones at all in a particular region, and given that
`there are no constraints from surrounding regions, the program would produce sub-
`bands corresponding to the standard fractional band lengths.
`
`
`Section: Repeat content of the human genome (p. 879)
`Subsection: Distribution of GC content (p. 884)
`
`Concerning the subdivision of the draft genome sequence into 50 kb pieces of
`similar GC level. The same results will be obtained however the sequence is
`subdivided, as long as the fragments are around 50 kb long. Specifically, however,
`for the analyses shown in Figures 22 to 26, the draft genome sequence was
`subdivided in fragments of 40-60 kb (averaging 50 kb) overlappong by 1 kb. These
`fragments were created on the fly by the RepeatMasker program, and for each a
`
`

`

`repeat analysis was done. The repeat information files were grouped by the GC level
`of the fragment, and processed according to need.
`
`
`
`For the analyses shown in Figures 23 and 25, the number of repeat copies was
`compared. The number of individual insertions per megabase of DNA of a particular
`GC level was extracted from the RepeatMasker output (RepeatMasker provides
`information on which fragments originated from the same inserted transposable
`element). The Y axis is the ratio of the frequency of Alu (fig 23) or LINE1 (fig 25) over
`the average frequency of these elements in the genome.
`
`Subsection: Segmental Duplications (p. 889)
`
`
`Our assessment of low copy repeats (genomic duplications) within the draft genome
`sequence involved a global analysis of all non-overlapping sequence. The analysis
`using a combination of DNA sequence analysis software and a suite of perlscripts
`developed for paralogy detection ( J. A. Bailey and E. E. Eichler, in preparation).
`The basic methodology included: repeatmasking (RepeatMasker v.4/20) of all
`reference sequences for common repeats, the removal and splicing of such repeat
`segments, global BLAST analysis of the segments for the identification of non-
`overlapping high-scoring segments, using relaxed affine gapping parameters which
`allowed large gaps up to 1 kb to be traversed (parameters: -G 180 –E 1 –q –80 –r 30
`-z 3000000000 –Y 3000000000 –e 1e-10 –F F)), the reintroduction of common
`repeat elements into each pairwise alignment followed by optimal global alignment
`of the segments using the program ALIGN ( E.W. Myers and W. Miller, CABIOS
`(1989) 4:11-17). To detect internal duplications within each query segment, a
`modified version of BLASTZ (W. Miller, unpublished) was used with similar relaxed
`gap parameters (B=2 M=30 I=-80 V=-80 O=180 E=1 W=14 Y=1400). Alignment
`statistics were generated (program:ALIGN_SCORER), and alignments that equaled or
`exceeded the threshold of 1000 bases aligned with over 90% similarity (i.e. gaps
`excluded) were analyzed. Generation of global alignments also acted as a
`safeguard against false positives from BLAST analysis. In cases of extremely large
`gaps (>1kb, alignments were fractured. Such cases were detected and merged for
`gaps up to 20 kb.
`
`Subsection: Pericentromeres and telomeres (p. 890)
`
`
`Chromosome 22 (May 2000, Sanger Centre) and Chromosome 21 (Sept., NCBI)
`were analyzed for large duplications as described. For interchromosomal
`duplications, the chromosome was analyzed versus the NT accession contigs
`(NCBI) and versus all remaining HTGS accessions (draft and finished) for
`interchromosomal duplications. A final global alignment threshold, >90%; >=1000
`bases, was used. Due to unassembled allelic overlaps, sequences containing
`highly similar alignments (>99.5% NT; >99.0% HTGS) were excluded as probable
`allelic overlaps. The duplicated sequence for chromosome 21 and chromosome 22
`were graphically viewed using the program PARASIGHT (J. A. Bailey and E.E. Eichler,
`in preparation).
`
`
`

`

`Subsection: Genome-wide analysis of segmental duplications. (p. 891)
`
`
`Finished sequence included all assembled sequence from NCBI within the NT
`dataset (version of 5 September 2000). A global alignment threshold (>90%; ±1000
`bases) was used for comparisons between finished sequence. Further selection
`limited alignments for analyses to those less than 99.5% identity, as those greater
`than that were likely to represent unassembled allelic overlaps.
`
`The 15 July 2000 version of the draft genome sequence was used as the basis for
`the duplication analysis of the entire human draft. A final global alignment threshold
`(>90%, ±1000 bases and <98%) defined the limits of detection for duplicated
`sequence. Sequence alignments (>98%) appear to represent mainly missed allelic
`overlaps many of which were subsequently merged in later releases of the assembly
`(e.g. 7 October 2000). Final validation of duplicated segments >98% within the
`working draft will require finished sequence data and/or experimental validation.
`
`Section: Gene content of the human genome (p. 892)
` Subsection: Noncoding RNAs (p. 892)
`
`
`
`
`To identify transfer RNA genes, we used tRNAscan-SE version 1.21 [T.M. Lowe,
`S.R. Eddy. tRNAscan-SE: a program for improved detection of transfer RNA genes
`in genomic sequence. Nucleic Acids Res. 25,955-964 (1997)] to analyze the 7
`October 7 2000 version of the draft genome sequence. tRNAscan-SE predicted 504
`tRNA genes and 144 tRNA-derived pseudogenes. Three of the predicted genes had
`a non-canonical anticodon loop length, preventing tRNAscan-SE from
`unambiguously identifying the anticodon; although there are many possible
`explanations for them, for our current purposes we classified these as probable
`pseudogenes. After manual examination of the tRNAs with unlikely anticodons, four
`more of the predicted genes were also classified as probable pseudogenes: a
`putative UAA suppressor, a putative UAG suppressor, and two putative UGA-reading
`selenocysteine tRNAs. The remaining gene predictions were not examined
`manually. We know that a small number of the 497 "true" tRNA genes are likely to
`be pseudogenes or parts of tRNA-derived repetitive sequence elements because
`tRNAscan-SE's ability to separate pseudogenes from true genes is not perfect.
`Because tRNAscan-SE models tRNA consensus secondary structure, it is not a
`reliable detector of divergent tRNA pseudogenes. To more accurately estimate the
`number of tRNA-derived pseudogenes, all 648 sequences detected by tRNAscan-SE
`were used as WU-BLASTN queries (see below), and another 173 significantly
`related sequences were detected, bringing the estimated pseudogene count to 324.
`
`To identify all ncRNA homologues other than tRNA genes, we performed sequence
`similarity searches using WashU BLASTN 2.0MP (W. Gishl, unpublished;
`http://blast.wustl.edu ) on the 7 October 2000 genome assembly, with parameters "-kap
`wordmask=seg B=50000 W=8" and the default DNA scoring matrix. True genes
`were operationally defined as BLAST hits with (cid:149)95% identity over (cid:149)95% the length of
`the query. Related sequences (e.g. pseudogenes) were operationally defined as all
`
`

`

`other BLAST hits with P-values <= 0.001. To reconcile our tRNA gene count of 497
`with the larger number of 1310 generally found in textbook references, we
`reexamined the primary data in a classic paper by Hatlen and Attardi252. The
`textbook estimate of 1310 human tRNA genes was based on their observation that
`purified and labelled human 4S RNA (e.g. the tRNA population) hybridizes to HeLa
`genomic DNA and saturates at a fraction of about 1.1x10-5 of the genome. The
`molecular weight of the human genome was thought at that time to be 3.1x1012
`(about 4.7 billion bases). Recalculation using the current estimated genome size of
`3.2 billion bases [T.R. Tiersch, R.W. Chandler, S.S. Wachtel, S. Elias. Reference
`standards for flow cytometry and application in comparative studies of nuclear DNA
`content. Cytometry 10, 706-710 (1989); this paper] gives an estimate of 890 tRNA-
`complementary loci instead of 1310. Hatlen and Attardi also noted, but at the time
`could not explain, a puzzling length heterogeneity in their hybridized genomic loci.
`We believe that they were observing the tRNA pseudogene population, many of
`which are truncated copies of tRNA genes; therefore we believe their hybridization-
`based estimate of ~890 loci included tRNA pseudogenes (of which we count 324 in
`the genome) in addition to the true tRNA genes (of which we count 497 in the
`genome).
`
`
` Subsection: Protein-coding genes (p. 896)
` Sub-subsection: Exploring properties of known genes (p. 896)
`
`
`Known genes were aligned with Spidey (S. Wheelan et al., manuscript in
`preparation) and Acembly (D. Thierry-Mieg and J. Thierry-Mieg, unpublished;
`http://www.acedb.org/ ), which in both cases align the cDNA to the genome while
`allowing for introns. The results from the two programs were in broad agreement.
`5,364 RefSeq entroess (from a 1 September 2000) release were used as a source of
`the cDNAs. The alignments of the cDNAs to the genome could be classified by the
`proportion of the cDNA that aligned to the genome and by the percentage of identical
`nucleotides between the cDNA and the genomic sequence. In most cases, there was
`an unambiguous location for a cDNA. However, some proportion at each level of
`coverage had more than one site with high identity matches; in these cases, one of
`the locations was arbitrarily chosen.
`
`
` Sub-subsection: Towards a complete index of human genes (p. 898)
` Creating an initial gene index (p. 899)
`
`Ensembl: Ensembl aims to predict coding sequences of true genes with high
`confidence, by only predicting coding sequence regions which have confirming
`evidence across their entire length. The sources of confirmation are cDNA, EST and
`protein-based similarity. The Genscan computer program was run across the
`individual fragments of the genome and the resulting peptides were used to search
`vertebrate mRNA sources (extracted from the EMBL databank;
`http://www.ebi.ac.uk/index.html), EST (vertebrate dbEST; ftp://ncbi.nlm.nih.gov/genbank ) and a
`non-redundant protein database (SWIR; http://www.ebi.ac.uk/swissprot/ ). Protein hits of
`greater than 200 bits similarity were then further processed by using the GeneWise
`
`

`

`
`
`
`
`
`
`program with the similar protein against the assembled draft genome sequence (the
`17 July 2000 version). A final gene-building method was then used to merge all the
`resulting information, being Genscan predictions with confirming similarity at a
`number of exons and the GeneWise gene predictions. The method only accepted a
`join between two exons if consistent similarity evidence was found on each exon with
`the following thresholds: (a) all GeneWise predictions were accepted, although
`redundant GeneWise predictions were discarded; and (b) for exons predicted by
`Genscan, a single protein or cDNA similarity of at least 100 bits or higher, or at least
`two EST hits of 100 bits or higher. This final process allows for alternative splicing,
`although modeling alternative splicing has not been optimised. Ensembl produced
`35,500 gene predictions with 44,860 transcripts.
`
`Merge procedure to produce a final protein set: To generate a single protein set for
`further analysis we merged the known protein sequences from RefSeq (version of
`29Sept2000), SWISSPROT (Release 39.6 of 30th Aug 200), TREMBL (TrEMBL
`Release 14.17 of 1 Oct 2000) and TREMBL_NEW (1 Oct 2000) with the gene
`predictions. The later protein analysis required a non-redundant protein set where
`genes were represented as a single protein sequence; in the case of alternative
`splicing, a single, representative protein sequence was required. We are aware of
`the obvious limitations of this representation of the human proteome, but
`accommodating alternative splicing in the downstream analysis was very complex.
`
`The genome prediction data set was prepared as follows: the Ensembl and Genie
`predictions were merged by examining overlap of coding exons in genomic
`coordinates. Two gene predictions were merged if a single coding exon on the same
`strand overlapped. From this set of merged predictions, we used only the
`Ensembl+Genie and the Ensembl-only predictions. In cases where there was more
`than one prediction, or for Ensembl genes, more than one transcript, we chose the
`longest protein sequence from each merged unit to represent the gene. The protein
`level merge then occurred by comparing the union of all the data sources in an all-
`vs-all FASTA comparison using default parameters. Two protein sequences were
`merged if the match covered at least 95% of the shorter sequence, and identity was
`(cid:149) 95%, which takes into account both nearly identical protein sequences and also
`nearly identical fragments.
`
`Special attention was needed to prevent overrepresentation of alternative splice
`forms. Firstly we expanded the Swissprot and Trembl databases to represent known
`splice variants in the protein merge, but only took a single protein (the canonical
`database sequence) for the final protein set. An additional cull for alternative splice
`forms which remained as separate proteins was produced by taking the
`corresponding DNA sequences of the known proteins (RefSeq, SWISSPROT,
`TREMBL and TREMBL_NEW) and matching back to the genome using the SSAHA
`program without requiring a valid gene structure alignment. If the DNA derived from
`two protein sequences matched at over 28 base pairs at the same location, the
`longest protein sequence was used. Finally, clear bacterial contamination (proteins
`which had an almost identical match to a bacterial protein) were removed.
`
`

`

`
`
`
`
`Quality Control on the protein set: We took 31 genes which we could confirm as
`being unavailable at the time of the gene builds (22 from RefSeq, 9 from the Sanger
`Centre gene identification program on chromosome X). 3 of the 31 sequences could
`not be found in the genome assembly. Using the wublastp program
`(http://blast.wustl.edu) with default parameters, we matched the 31 sequences to the IPI.1
`set and visually inspected the alignments. 19 sequences showed a clear match to an
`IPI protein; 14 hit a single IPI protein, 3 hit 2 IPI proteins, 1 hit 3 IPI proteins and 1 hit
`4 IPI proteins.
`
`RIKEN mouse cDNAs. We took a random sample (1,000) of known genes, Ensembl-
`Genie genes and Ensembl-only genes and matched them to the Riken cDNA set of
`15,294 cDNAs using the TBLASTN program (http://www.ncbi.nlm.nih.gov/BLAST/ ) with
`default parameters, at the 1e-6 E-value significance level.
`
`The IPI and IGI can be found at http://www.ensembl.org/IPI/.
`
`
`
`
`Additional information for Table 23 (p. 902). All of the tables of Interpro are
`
`accessible through http://www.sanger.ac.uk/Users/agb/Ensembl.
`
`Section: Segmental history of the human genome (p. 908)
` Subsection: Conserved segments between human and mouse (p. 908)
`
`Putatively orthologous sequences were determined in two ways. Curated
`orthologues determined at the Jackson Laboratory (www.informatics.jax.org) were
`obtained by FTP. In addition, orthologues were calculated at the NCBI using the
`program megaBLAST [Z. Zhang et al., J. Comput. Biol. 7, 203-214 (2000)]. In order
`to calculate orthologues, non-EST mRNA sequences found in LocusLink
`(http://www.ncbi.nlm.nih.gov:80/LocusLink/) were obtained for both human and mouse. The
`megaBLAST analysis was performed first using the mouse sequence as the query
`and the human sequence as the database. A second analysis was performed in
`which the human sequence was the query and the mouse sequence was the
`database. R

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket