`Human Genome.
`International Human Genome Sequencing Consortium.
`
`Methods and additional notes
`
`Section: Generating the draft genome sequence (p. 864)
` Subsection: Clone selection (p. 865)
`
`Page 866 col. 2, para.3 “Fingerprint data were reviewed ….bias against rearranged
`clones).
`
`
`
`
`Seed clones were picked from the growing contigs as follows: We began by
`identifying fingerprint clone contigs that had been localized to targeted locations and
`that did not contain any clones that had previously been selected for sequencing.
`Contigs were localized using mapping data from a variety of sources that could be
`attached to the fingerprinted clones, including STS/hybridization data from
`McPherson and colleagues86, FISH data from several sources (C. McPherson et al.,
`ref. 103), STS/PCR mapping data from several sources92,95,103, electronic PCR data
`(http://www.ncbi.nlm.nih.gov/STS/) matching the BAC end sequences with mapped STSs
`and others. Beginning with the largest available clone in a valid contig (clones >250
`kb were excluded to avoid artifacts), the FPC program451 evaluated the fingerprints
`of all of the clones in the contig to determine largest clone for which all (but 2) of the
`individual bands in the restriction fragment pattern were common to or shared with
`(confirmed; having a band of equivalent size ±3%) with bands in the patterns of
`flanking clones (again, ignoring >250 kb flanking clones >250 kb). (Since the
`restriction enzyme used to produce the clone inserts is different than the enzyme
`used to produce the fingerprints, two bands may arise from the insert-vector junction,
`which are not found in the genome or in flanking clones.) Selected clones were then
`checked for excessive overlap with previously selected or sequenced clones and
`with each other. The allowable overlap at this stage was varied to suit the demands
`of the project.
`
`Clones (walking clones) extending from seed or other selected clones were selected
`as follows: In the early phases of the effort, clones were not necessarily correctly
`ordered within a fingerprint clone contig and indeed not all of the available clones
`had necessarily been incorporated into the contig. Starting with a previously
`selected (seed) clone, the FPC program compared the restriction fragment pattern of
`that clone with the patterns of all of the clones in the fingerprint database that
`overlapped with the seed clone. It then iteratively analyzed the clones identified in
`the first round of analysis to identify the additional clones that overlapped with those.
`In this way, a set of overlapping clones was identified and the clones in the set were
`ordered based on their overlap statistics. After ordering, all of the valid clones were
`identified (valid clones were defined as those with all but three of their bands
`confirmed by clones within 4 clones on either side). Any clone that also had outside
`evidence of overlap, e.g. through BAC end sequence matches or shared
`SEQUENOM EXHIBIT 1105
`Sequenom v. Stanford
`SEQUENOM EXHIBIT 1105
`IPR2013-00390
`
`
`
`STS/hybridization data was selected for further evaluation. In cases with more than
`one clone with such outside evidence, the clone with the lowest overlap statistic (i.e.,
`the one that was least redundant) was selected (in the case of ties, the largest clone
`was favored). Where there was no outside evidence, a clone was picked based on
`evaluation of the overlaps. The candidate clone was the first one that was found to
`have the minimal overlap with the seed clone (initially <20% overlap, rising to 30% in
`later phases of the mapping effort; the percentage overlap was estimated by dividing
`the sum of the sizes of the common bands by the size of the smaller of the two
`clones). To be picked, the clone also had to be bridged to the seed clone by a third,
`intermediate clone that confidently (<1e-4) overlapped both the seed clone and the
`candidate clone. The candidate clone was then further evaluated for fingerprint
`overlap with previously selected or sequenced clones.
`
`Once clones were ordered within fingerprint clone contigs, a similar algorithm that
`exploited the known clone order was used to pick the walking clones. This algorithm
`was also adapted to pick a spanning/walking clone for complex contigs with 2 or
`more clones in the sequencing pipeline, using the fingerprint map as a guide.
`
`
`
`
` Subsection: Sequencing (p. 867)
`
`Page 868, left-hand column, line 20: “By examining … 500 bp.”
`
`
`The sizes of the gaps between adjacent initial sequence contigs in draft clones were
`measured using alignments of the initial sequence contigs from individual draft
`clones to contigs of size (cid:149) 40 kb from overlapping clones, usually finished clones.
`10,999 gaps were examined. 1,726 gaps larger than 6,000 bp were discarded as
`probable artefacts due to misassemblies or incorrect alignments. The mean size of
`the gaps between the initial sequence contigs in draft clones was 554 bases. When
`the cutoff for discarding gaps was lowered to 3000 bp or raised to 12,000 bp, the
`mean gap size decreased to about 400 bp (estimated from 9,801 gaps) and
`increased to about 800 bp (estimated from 11,972 gaps) accordingly, indicating that
`there is still considerable uncertainty in the mean value. The 554 bp estimate for the
`mean gap size was used, along with the number of initial sequence contigs (Table 7)
`and the total number of bases in the initial sequence contigs (data not shown) to
`estimate the percentage of the draft clones that were covered by the initial sequence
`contigs. It was thus determined that, on average, about 96% of the draft clones was
`covered; assuming a mean gap size between 400 and 800 bp, the range in coverage
`is about 94-97%.
`
`This comment also pertains to page 874, left-hand column, line 57: “Assuming that the
`sequence gaps … gaps within the draft sequenced clones”
`
` Subsection: Assembly of the draft genome (p. 868)
`
`Page 868, right-hand column, l. 47, "To eliminate such problems, sequenced clones were
`associated with the fingerprint clone contigs in the physical map…"
`
`
`
`An FPC match statistic better than 1e-7 for the sequenced clone against the fpc
`fingerprint database was considered significant, based on empirical evidence. This
`match level was the weakest value used for placement when there was other
`confirmatory evidence to support the placement. In the absence of additional
`supportive data, a match score of better than 1e-9 was required for placement. In
`general, only the best match was used. Other confirmatory evidence included BAC
`end matches; the BAC end sequences were obtained from NCBI (dbGSS;
`http://www.ncbi.nlm.nih.gov/dbGSS/index.html). Only BAC end sequences with 15 or fewer
`matches to the genomic sequence were used to eliminate repetitive sequences.
`Additional information used to place clones included BAC paired-end sequence
`matches, shared STS matches, and "believed" sequence overlap relationships
`determined by investigators at the NCBI and at UC-Santa Cruz. In instances in which
`the data led to conflicting placements, the data were weighted based on estimates of
`reliability. In some cases, if there was conflicting placement data or only weak data
`for placement and, according to GigAssembler, the sequenced clone failed to
`overlap any clones in the assembly at their original placement positions, a placement
`was attempted at secondary sites suggested by the placement data.
`
`
`Page 869, left-hand column, line 48 “Of these 942 contigs with sequenced clones… “
`
`
`In general, merges between fingerprint clone contigs were based primarily on
`evaluation of the fingerprint data. Information about the STS map location of the
`fingerprint contigs was used to prevent spurious merges, to break spurious contigs
`and to suggest possible merges that had not been previously recognized. In
`addition, 62 contigs were merged on the basis of sequence overlap information,
`supported by STS map positions.
`
`
`
`
`
` Subsection: Quality assessment (p. 871)
` Sub-subsection: Alignment of the fingerprint clone contigs (p. 873)
`
`Page 873, right-hand column, line 28: “The positions of most of the STSs… about 1.7%
`differed from one or more of them."
`
`
`We localized the STS markers from seven different physical maps (the Genethon101
`and Marshfield (http://research.marshfieldclinic.org/genetics/ ) genetic maps, the
`GeneMap99100, the G3 and Stanford TNG radiation hybrid maps (http://www-
`shgc.stanford.edu/Mapping/Marker/STSindex.html), and the Whitehead YAC and radiation
`hybrid map29) on the draft genome sequence using e-PCR, allowing one mismatch
`per primer and the default distance constraints between primers (50 bp deviation
`from expected size of product). Only those markers that were uniquely placed on the
`draft sequence were considered. There were 62,239 such markers. Of these, 1,095,
`or 1.7%, were mapped by ePCR to a chromosome of the draft sequence that was
`different from the chromosome indicated by the information from a genetic or
`radiation hybrid map.
`
`
`
` Subsection: representation of random raw sequences (p. 874)
`
`Page 875, left-hand column, line 9: “We compared the raw sequences … using the BLAST
`computer program.”
`
`
`We processed whole genome shotgun reads from four independently constructed
`libraries as follows. All reads with fewer than 300 bases of PHRED quality 20 or
`greater were removed. The remaining reads were then trimmed for vector and for
`quality, looking at the 5’ end for the first window with at least 15 continuous non-
`vector bases of >PHRED20 and at the 3’ end, starting from the left cutoff, for 12
`contiguous non-vector bases with <PHRED20 scores. Only trimmed reads that had
`>95% of their trimmed bases with PHRED>20 and a length of >250 bases were kept.
`The reads after trimming were composed of 40% GC base pairs. Reads were
`masked for repeats using the RepeatMasker program (A.F.A. Smit & P. Green,
`http://repeatmasker.genome.washington.edu/cgi-bin/RM2_req.pl) and for low entropy data using the
`nseg option of BLAST (W. Gish, unpublished; http://blast.wustl.edu )Reads were
`retained and used only if there were at least 100 consecutive bases of PHRED
`quality 20 or greater and 100 consecutive unmasked bases.
`
`
`
`
`
`
`
`Page 875, left-hand column, line 30: “We found that 88% of the bases of these cDNAs
`could be aligned ...”
`
`
`Based on a test data set of random reads from finished projects, the following
`BLAST parameters were found to match 100% of the reads without false matches: -
`filter seg S=170 S2=150 W=13 gapW=4 gapS2=150 M=5 N=-11 Q=11 R=11. The
`set of masked trimmed reads was compared to the 7 October 7 2000 freeze of the
`HTGS data set, to all of Genbank and to the TSC SNP database using BLASTN
`2.0MP (W. Gish, unpublished; http://blast.wustl.edu). The highest scoring match was
`aligned against the read using CROSSMATCH, demanding alignment of the full
`trimmed read at (cid:149)97% identity for genomic sequence and with appropriate
`topological constraints for the SNP reads. Typically 1-2% of the matches were
`eliminated by this step.
`
`We aligned the RefSeq cDNA sequences to the draft genome using the psLayout
`program104 and gathered statistics on the percentage of cDNA bases that aligned at
`various percent identity thresholds.
`
`The distal 200 bases of each cDNA were not included in the computation of the
`percentage of aligning bases because alignments in these regions are less reliable.
`If any cDNA aligned in more than one way, each cDNA base involved in any
`alignment was counted only once. At a threshold of 98% identity for the alignments,
`we found that 87.9% of the cDNA bases aligned somewhere in the draft genome.
`When the threshold was increased to 99% identity, the percentage of aligning bases
`fell to 85.83%, and when the threshold was decreased to 97% identity, it rose to
`88.5%. Further decreases in the threshold all the way down to 90% identity only
`
`
`
`increased the percentage of aligning bases one more percentage point, so the value
`of approximately 88% aligning bases, achieved by requiring 98% identity, represents
`a knee in the curve.
`
`
`
`Section: Broad genomic landscape (p. 875)
`
`page 876, right-hand column, line 9: “In addition, the human cytogenetic map ...”
`
`
`The locations of the cytogenetically mapped clones on the draft genome sequence
`can be viewed at http://genome.ucsc.edu/goldenPath/mapPlots . Further information about the
`individual clones can be obtained at http://www.ncbi.nlm.nih.gov/genome/cyto/ and
`http://www.ncbi.nlm.nih.gov/genome/guide. Here, as well as on the browser at
`http://genome.ucsc.edu and http://www.ensembl.org/ , they can be viewed in the context of other
`genome annotation.
`
`
` Subsection: Long-range variation in GC content (p. 876)
`
`Page 877, left-hand column, line 30 “About three-quarters of the genome-wide variance…
`consistent with a homogeneous distribution”
`
`
`All 3,312 windows of length 300 kb that had at least eight gap-free 20 kb
`subwindows and did not contain more than 50% simple repeats were extracted from
`the draft genome sequence. The average sample variance of the GC content of the
`subwindows of a window was 7.3%. The sample variance of all subwindows
`genome-wide (N = 36,562) was 27.4%. Hence, the variance of GC content within
`the 20 kb subwindows of a 300 kb window accounts for approximately one quarter of
`the overall variance of the GC content among all 20 kb subwindows in this sample.
`The average sample standard deviation of the GC content of the subwindows of a
`window was 2.4%.
`
`
`Page 877, left-hand column, line 34: “In fact, the hypothesis … draft genome sequence.”
`
`
`For each of the 3,312 windows of length 300 kb, we tested the hypothesis that its 20
`kb subwindows were sampled from a homogeneous GC distribution. The distribution
`was defined to have mean m equal to the GC-content in the combined subwindows
`of the 300 kb window, and the bases were taken as independent. Under this
`distribution, the GC-content of a 20 kb subwindow would have mean m and variance
`s2 = m(100-m)/20000. For m = 41%, the typical value, this gives s2 = 0.121%, which
`is about 0.017 times the average sample variance of 7.3%. For each window, the
`variance s2 and the sample variance (cid:454)2 were determined, along with the value c2 =
`(n-1) (cid:454)2/s2, where n is the number of subwindows of the window. Under the
`hypothesis of homogeneity, the statistic c2 should have an approximately chi-square
`distribution with n-1 degrees of freedom. However, for every one of the 3,312
`windows, c2 > 31.5, which rejects the hypothesis of homogeneity with p-value >>
`0.995.
`
`
`
`
`Another way to test the hypothesis of homogeneity is to look in each 300 kb window
`for one 20 kb subwindow whose GC content differs significantly from the mean m for
`that window. In these tests, all 300 kb windows with less than 50% simple repeats
`and less than 25% gaps were tested (N = 10,596). Under the assumptions above, if
`X is the GC content of a subwindow, then D = (X-m)/sqrt[m(100-m)/20000] should
`have an approximately normal distribution. However, in all but four windows there is
`a subwindow with |D| > 3.0, i.e the GC content of the subwindow is more than 3.0
`standard deviations from the mean of the window. The p-value for such a deviation is
`0.0026. Considering that there are 15 possible subwindows, this gives an overall p-
`value of 0.039, i.e. the hypothesis of homogeneity is rejected with a p-value greater
`than 0.96.
`
`The above analysis was repeated using 5 kb subwindows of 300 kb windows, and
`the hypothesis of homogeneity was rejected for all windows with p-value greater than
`0.96, and with greater confidence for those windows tested with the chi-square test.
`Similar results were also obtained for 5 kb subwindows of 100 kb windows: all but
`thirteen windows were rejected with p-value greater than approximately 0.95, and all
`but three were rejected from those examined with the chi-square test. Since any
`region of 200 kb must contain one of the regions of 100 kb we tested for
`homogeneity, this indicates that there are few if any regions of 200 kb in the genome
`with homogeneous GC content.
`
`
`Page 877, right-hand column, line 25: “Estimated band locations …”
`
`
`Bands were assigned by a dynamic programming algorithm that attempted to
`maximize the number of cytogenetically mapped clones that lie within the range of
`possible sub-bands predicted from FISH, with special emphasis on high-resolution
`FISH-mapped clones provided by investigators at the National Cancer Institute103.
`The band positions were optimized subject to the constraint that the bands must
`appear in the known order along the draft genome sequence. Slight penalties for
`band size deviation from the standard fractional sizes were also imposed, so that in
`the absence of any FISH-mapped clones at all in a particular region, and given that
`there are no constraints from surrounding regions, the program would produce sub-
`bands corresponding to the standard fractional band lengths.
`
`
`Section: Repeat content of the human genome (p. 879)
`Subsection: Distribution of GC content (p. 884)
`
`Concerning the subdivision of the draft genome sequence into 50 kb pieces of
`similar GC level. The same results will be obtained however the sequence is
`subdivided, as long as the fragments are around 50 kb long. Specifically, however,
`for the analyses shown in Figures 22 to 26, the draft genome sequence was
`subdivided in fragments of 40-60 kb (averaging 50 kb) overlappong by 1 kb. These
`fragments were created on the fly by the RepeatMasker program, and for each a
`
`
`
`repeat analysis was done. The repeat information files were grouped by the GC level
`of the fragment, and processed according to need.
`
`
`
`For the analyses shown in Figures 23 and 25, the number of repeat copies was
`compared. The number of individual insertions per megabase of DNA of a particular
`GC level was extracted from the RepeatMasker output (RepeatMasker provides
`information on which fragments originated from the same inserted transposable
`element). The Y axis is the ratio of the frequency of Alu (fig 23) or LINE1 (fig 25) over
`the average frequency of these elements in the genome.
`
`Subsection: Segmental Duplications (p. 889)
`
`
`Our assessment of low copy repeats (genomic duplications) within the draft genome
`sequence involved a global analysis of all non-overlapping sequence. The analysis
`using a combination of DNA sequence analysis software and a suite of perlscripts
`developed for paralogy detection ( J. A. Bailey and E. E. Eichler, in preparation).
`The basic methodology included: repeatmasking (RepeatMasker v.4/20) of all
`reference sequences for common repeats, the removal and splicing of such repeat
`segments, global BLAST analysis of the segments for the identification of non-
`overlapping high-scoring segments, using relaxed affine gapping parameters which
`allowed large gaps up to 1 kb to be traversed (parameters: -G 180 –E 1 –q –80 –r 30
`-z 3000000000 –Y 3000000000 –e 1e-10 –F F)), the reintroduction of common
`repeat elements into each pairwise alignment followed by optimal global alignment
`of the segments using the program ALIGN ( E.W. Myers and W. Miller, CABIOS
`(1989) 4:11-17). To detect internal duplications within each query segment, a
`modified version of BLASTZ (W. Miller, unpublished) was used with similar relaxed
`gap parameters (B=2 M=30 I=-80 V=-80 O=180 E=1 W=14 Y=1400). Alignment
`statistics were generated (program:ALIGN_SCORER), and alignments that equaled or
`exceeded the threshold of 1000 bases aligned with over 90% similarity (i.e. gaps
`excluded) were analyzed. Generation of global alignments also acted as a
`safeguard against false positives from BLAST analysis. In cases of extremely large
`gaps (>1kb, alignments were fractured. Such cases were detected and merged for
`gaps up to 20 kb.
`
`Subsection: Pericentromeres and telomeres (p. 890)
`
`
`Chromosome 22 (May 2000, Sanger Centre) and Chromosome 21 (Sept., NCBI)
`were analyzed for large duplications as described. For interchromosomal
`duplications, the chromosome was analyzed versus the NT accession contigs
`(NCBI) and versus all remaining HTGS accessions (draft and finished) for
`interchromosomal duplications. A final global alignment threshold, >90%; >=1000
`bases, was used. Due to unassembled allelic overlaps, sequences containing
`highly similar alignments (>99.5% NT; >99.0% HTGS) were excluded as probable
`allelic overlaps. The duplicated sequence for chromosome 21 and chromosome 22
`were graphically viewed using the program PARASIGHT (J. A. Bailey and E.E. Eichler,
`in preparation).
`
`
`
`
`Subsection: Genome-wide analysis of segmental duplications. (p. 891)
`
`
`Finished sequence included all assembled sequence from NCBI within the NT
`dataset (version of 5 September 2000). A global alignment threshold (>90%; ±1000
`bases) was used for comparisons between finished sequence. Further selection
`limited alignments for analyses to those less than 99.5% identity, as those greater
`than that were likely to represent unassembled allelic overlaps.
`
`The 15 July 2000 version of the draft genome sequence was used as the basis for
`the duplication analysis of the entire human draft. A final global alignment threshold
`(>90%, ±1000 bases and <98%) defined the limits of detection for duplicated
`sequence. Sequence alignments (>98%) appear to represent mainly missed allelic
`overlaps many of which were subsequently merged in later releases of the assembly
`(e.g. 7 October 2000). Final validation of duplicated segments >98% within the
`working draft will require finished sequence data and/or experimental validation.
`
`Section: Gene content of the human genome (p. 892)
` Subsection: Noncoding RNAs (p. 892)
`
`
`
`
`To identify transfer RNA genes, we used tRNAscan-SE version 1.21 [T.M. Lowe,
`S.R. Eddy. tRNAscan-SE: a program for improved detection of transfer RNA genes
`in genomic sequence. Nucleic Acids Res. 25,955-964 (1997)] to analyze the 7
`October 7 2000 version of the draft genome sequence. tRNAscan-SE predicted 504
`tRNA genes and 144 tRNA-derived pseudogenes. Three of the predicted genes had
`a non-canonical anticodon loop length, preventing tRNAscan-SE from
`unambiguously identifying the anticodon; although there are many possible
`explanations for them, for our current purposes we classified these as probable
`pseudogenes. After manual examination of the tRNAs with unlikely anticodons, four
`more of the predicted genes were also classified as probable pseudogenes: a
`putative UAA suppressor, a putative UAG suppressor, and two putative UGA-reading
`selenocysteine tRNAs. The remaining gene predictions were not examined
`manually. We know that a small number of the 497 "true" tRNA genes are likely to
`be pseudogenes or parts of tRNA-derived repetitive sequence elements because
`tRNAscan-SE's ability to separate pseudogenes from true genes is not perfect.
`Because tRNAscan-SE models tRNA consensus secondary structure, it is not a
`reliable detector of divergent tRNA pseudogenes. To more accurately estimate the
`number of tRNA-derived pseudogenes, all 648 sequences detected by tRNAscan-SE
`were used as WU-BLASTN queries (see below), and another 173 significantly
`related sequences were detected, bringing the estimated pseudogene count to 324.
`
`To identify all ncRNA homologues other than tRNA genes, we performed sequence
`similarity searches using WashU BLASTN 2.0MP (W. Gishl, unpublished;
`http://blast.wustl.edu ) on the 7 October 2000 genome assembly, with parameters "-kap
`wordmask=seg B=50000 W=8" and the default DNA scoring matrix. True genes
`were operationally defined as BLAST hits with (cid:149)95% identity over (cid:149)95% the length of
`the query. Related sequences (e.g. pseudogenes) were operationally defined as all
`
`
`
`other BLAST hits with P-values <= 0.001. To reconcile our tRNA gene count of 497
`with the larger number of 1310 generally found in textbook references, we
`reexamined the primary data in a classic paper by Hatlen and Attardi252. The
`textbook estimate of 1310 human tRNA genes was based on their observation that
`purified and labelled human 4S RNA (e.g. the tRNA population) hybridizes to HeLa
`genomic DNA and saturates at a fraction of about 1.1x10-5 of the genome. The
`molecular weight of the human genome was thought at that time to be 3.1x1012
`(about 4.7 billion bases). Recalculation using the current estimated genome size of
`3.2 billion bases [T.R. Tiersch, R.W. Chandler, S.S. Wachtel, S. Elias. Reference
`standards for flow cytometry and application in comparative studies of nuclear DNA
`content. Cytometry 10, 706-710 (1989); this paper] gives an estimate of 890 tRNA-
`complementary loci instead of 1310. Hatlen and Attardi also noted, but at the time
`could not explain, a puzzling length heterogeneity in their hybridized genomic loci.
`We believe that they were observing the tRNA pseudogene population, many of
`which are truncated copies of tRNA genes; therefore we believe their hybridization-
`based estimate of ~890 loci included tRNA pseudogenes (of which we count 324 in
`the genome) in addition to the true tRNA genes (of which we count 497 in the
`genome).
`
`
` Subsection: Protein-coding genes (p. 896)
` Sub-subsection: Exploring properties of known genes (p. 896)
`
`
`Known genes were aligned with Spidey (S. Wheelan et al., manuscript in
`preparation) and Acembly (D. Thierry-Mieg and J. Thierry-Mieg, unpublished;
`http://www.acedb.org/ ), which in both cases align the cDNA to the genome while
`allowing for introns. The results from the two programs were in broad agreement.
`5,364 RefSeq entroess (from a 1 September 2000) release were used as a source of
`the cDNAs. The alignments of the cDNAs to the genome could be classified by the
`proportion of the cDNA that aligned to the genome and by the percentage of identical
`nucleotides between the cDNA and the genomic sequence. In most cases, there was
`an unambiguous location for a cDNA. However, some proportion at each level of
`coverage had more than one site with high identity matches; in these cases, one of
`the locations was arbitrarily chosen.
`
`
` Sub-subsection: Towards a complete index of human genes (p. 898)
` Creating an initial gene index (p. 899)
`
`Ensembl: Ensembl aims to predict coding sequences of true genes with high
`confidence, by only predicting coding sequence regions which have confirming
`evidence across their entire length. The sources of confirmation are cDNA, EST and
`protein-based similarity. The Genscan computer program was run across the
`individual fragments of the genome and the resulting peptides were used to search
`vertebrate mRNA sources (extracted from the EMBL databank;
`http://www.ebi.ac.uk/index.html), EST (vertebrate dbEST; ftp://ncbi.nlm.nih.gov/genbank ) and a
`non-redundant protein database (SWIR; http://www.ebi.ac.uk/swissprot/ ). Protein hits of
`greater than 200 bits similarity were then further processed by using the GeneWise
`
`
`
`
`
`
`
`
`
`program with the similar protein against the assembled draft genome sequence (the
`17 July 2000 version). A final gene-building method was then used to merge all the
`resulting information, being Genscan predictions with confirming similarity at a
`number of exons and the GeneWise gene predictions. The method only accepted a
`join between two exons if consistent similarity evidence was found on each exon with
`the following thresholds: (a) all GeneWise predictions were accepted, although
`redundant GeneWise predictions were discarded; and (b) for exons predicted by
`Genscan, a single protein or cDNA similarity of at least 100 bits or higher, or at least
`two EST hits of 100 bits or higher. This final process allows for alternative splicing,
`although modeling alternative splicing has not been optimised. Ensembl produced
`35,500 gene predictions with 44,860 transcripts.
`
`Merge procedure to produce a final protein set: To generate a single protein set for
`further analysis we merged the known protein sequences from RefSeq (version of
`29Sept2000), SWISSPROT (Release 39.6 of 30th Aug 200), TREMBL (TrEMBL
`Release 14.17 of 1 Oct 2000) and TREMBL_NEW (1 Oct 2000) with the gene
`predictions. The later protein analysis required a non-redundant protein set where
`genes were represented as a single protein sequence; in the case of alternative
`splicing, a single, representative protein sequence was required. We are aware of
`the obvious limitations of this representation of the human proteome, but
`accommodating alternative splicing in the downstream analysis was very complex.
`
`The genome prediction data set was prepared as follows: the Ensembl and Genie
`predictions were merged by examining overlap of coding exons in genomic
`coordinates. Two gene predictions were merged if a single coding exon on the same
`strand overlapped. From this set of merged predictions, we used only the
`Ensembl+Genie and the Ensembl-only predictions. In cases where there was more
`than one prediction, or for Ensembl genes, more than one transcript, we chose the
`longest protein sequence from each merged unit to represent the gene. The protein
`level merge then occurred by comparing the union of all the data sources in an all-
`vs-all FASTA comparison using default parameters. Two protein sequences were
`merged if the match covered at least 95% of the shorter sequence, and identity was
`(cid:149) 95%, which takes into account both nearly identical protein sequences and also
`nearly identical fragments.
`
`Special attention was needed to prevent overrepresentation of alternative splice
`forms. Firstly we expanded the Swissprot and Trembl databases to represent known
`splice variants in the protein merge, but only took a single protein (the canonical
`database sequence) for the final protein set. An additional cull for alternative splice
`forms which remained as separate proteins was produced by taking the
`corresponding DNA sequences of the known proteins (RefSeq, SWISSPROT,
`TREMBL and TREMBL_NEW) and matching back to the genome using the SSAHA
`program without requiring a valid gene structure alignment. If the DNA derived from
`two protein sequences matched at over 28 base pairs at the same location, the
`longest protein sequence was used. Finally, clear bacterial contamination (proteins
`which had an almost identical match to a bacterial protein) were removed.
`
`
`
`
`
`
`
`Quality Control on the protein set: We took 31 genes which we could confirm as
`being unavailable at the time of the gene builds (22 from RefSeq, 9 from the Sanger
`Centre gene identification program on chromosome X). 3 of the 31 sequences could
`not be found in the genome assembly. Using the wublastp program
`(http://blast.wustl.edu) with default parameters, we matched the 31 sequences to the IPI.1
`set and visually inspected the alignments. 19 sequences showed a clear match to an
`IPI protein; 14 hit a single IPI protein, 3 hit 2 IPI proteins, 1 hit 3 IPI proteins and 1 hit
`4 IPI proteins.
`
`RIKEN mouse cDNAs. We took a random sample (1,000) of known genes, Ensembl-
`Genie genes and Ensembl-only genes and matched them to the Riken cDNA set of
`15,294 cDNAs using the TBLASTN program (http://www.ncbi.nlm.nih.gov/BLAST/ ) with
`default parameters, at the 1e-6 E-value significance level.
`
`The IPI and IGI can be found at http://www.ensembl.org/IPI/.
`
`
`
`
`Additional information for Table 23 (p. 902). All of the tables of Interpro are
`
`accessible through http://www.sanger.ac.uk/Users/agb/Ensembl.
`
`Section: Segmental history of the human genome (p. 908)
` Subsection: Conserved segments between human and mouse (p. 908)
`
`Putatively orthologous sequences were determined in two ways. Curated
`orthologues determined at the Jackson Laboratory (www.informatics.jax.org) were
`obtained by FTP. In addition, orthologues were calculated at the NCBI using the
`program megaBLAST [Z. Zhang et al., J. Comput. Biol. 7, 203-214 (2000)]. In order
`to calculate orthologues, non-EST mRNA sequences found in LocusLink
`(http://www.ncbi.nlm.nih.gov:80/LocusLink/) were obtained for both human and mouse. The
`megaBLAST analysis was performed first using the mouse sequence as the query
`and the human sequence as the database. A second analysis was performed in
`which the human sequence was the query and the mouse sequence was the
`database. R