throbber
Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`RESEARCH Complete Genomic Sequence and Analysis of 117 kb of Human DNA Containing the Gene BRCAI Todd M. Smith, 1 Ming K. Lee, 2 Csilla I. Szabo, 2 Nicole Jerome, 1 Mark McEuen, ~ Matthew Taylor, ~ Leroy Hood, ~ and Mary-Claire King 2'3 1Department of Molecular Biotechnology and 2Departments of Medicine and Genetics, University of Washington Medical School, Seattle, Washington 98195 Over 100 distinct disease-associated mutations have been identified in the breast-ovarian cancer susceptibility gene BRCAI. Loss of the wild-type allele in >90% of tumors from patients with inherited BRCAI mutations indicates tumor suppressive function. The low incidence of somatic mutations suggests that BRCAI inactivation in sporadic tumors occurs by alternative mechanisms, such as interstitial chromosomal deletion or reduced transcription. To identify possible features of the BRCA1 genomic region that may contribute to chromosomal instability as well as potential transcriptional regulatory elements, a 117,143-bp DNA sequence encompassing BRCA1 was obtained by random sequencing of four cosmids identified from a human chromosome 17 specific library. The 24 exons of BRCAI span an 81-kb region that has an unusually high density of AJu repetitive DNA {41.5%), but relatively low density [4.8%) of other repetitive sequences. BRCAI intron lengths range in size from 403 bp to 9.2 kb and contain the intragenic microsatellite markers DI7S1323, DI7SI322, and DI7S855, which localize to introns 12, 19, and 20, respectively. In addition to BRCAI, the contig contains two complete genes: RhoT, a member of the rho family of GTP binding proteins, and VAT1, an abundant membrane protein of cholinergic synaptic vesicles. Partial sequences of the IAI-JB B-box protein pseudogene and lip JS, an interferon induced leucine zipper protein, reside within the contig. An L21 ribosomal protein pseudogene is embedded in BRCA1 intron 13. The order of genes on the chromosome is: centromere-lFP JS-VATI-P, hoT- BRCAI-IAI-JB- telomere. [The sequence data described in this paper have been submitted to GenBank under accession no. L78833.] Inherited mutations in the breast-ovarian cancer susceptibility gene, BRCA1 (Hall et al. 1990; Miki et al. 1994), confer a lifetime risk of breast cancer greater than 80% and an increased risk of ovarian cancer (Newman et al. 1988; Easton et al. 1993; Ford et al. 1994). Evidence for tumor suppressive function of BRCA1 derives from the high propor- tion (87%) of truncating germ-line mutations, which likely represent loss-of-function alter- ations (Castilla et al. 1994; Friedman et al. 1994b; Futreal et al. 1994; Miki et al. 1994; Simard et al. 1994; Friedman et al. 1995b; Gayther et al. 1995, 1996; Hogervorst et al. 1995; Hosking et al. 1995; Merajver et al. 1995; Shattuck-Eidens et al. 1995; Struewing et al. 1995; Szabo and King 1995; Ta- 3Corresponding author. E-MAIL mcking@u.washington.edu; FAX (206) 616-4295. kahashi et al. 1995; Couch et al. 1996; Durocher et al. 1996; FitzGerald et al. 1996; Johansson et al. 1996; Langston et al. 1996; Serova et al. 1996); loss of the wild-type allele in >90% of breast and ovarian patients with inherited BRCA 1 mutations (Smith et al. 1992; Friedman et al. 1994a; Neu- hausen and Marshall 1994); decreased levels of BRCA1 expression in breast (Thompson et al. 1995) and ovarian (R. Hernandez, M. Skelly, C. Laird, M.-C. King, and A. Gown, in prep.) tumors from patients not selected for family history; ac- celerated growth of normal and malignant mam- mary epithelial cells upon experimental inhibi- tion of BRCA1 expression with antisense oligo- nucleotides (Thompson et al. 1995); and inhibition of malignant breast and ovarian can- cer cell growth in culture as well as suppression of MCF7 breast cancer cell tumorigenesis in mice by 6:1029-1049 (cid:14)9 by Cold Spring Harbor Laboratory Press ISSN 1054-9803/96 $5.00 GENOME RESEARCH @ 1029
`
`GeneDX 1024, pg. 1
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`SMITH ET AL. overexpression of wild-type BRCA1 (Holt et al. 1996). Few somatic point mutations or frame-shift alterations have been identified thus far (Futreal et al. 1994; Hosking et al. 1995; Merajver et al. 1995; Takahashi et al. 1995), suggesting that so- matic inactivation of BRCA1 may occur through different mechanisms, such as interstitial chro- mosomal deletion or epigenetic silencing of BRCA1 expression. Gross somatic mutations at BRCA1 are frequently found in breast and ovar- ian tumors from patients not selected for family history. Loss of heterozygosity (LOH) in the BRCA1 region ranges from 40% to 80% among sporadic breast carcinomas (Cropp et al. 1993; Saito et al. 1993; Ford et al. 1994) and from 30% to 70% among sporadic ovarian carcinomas (Rus- sell et al. 1990; Cliby et al. 1993; Yang-Feng et al. 1993; Takahashi et al. 1995). BRCA1 transcript expression in breast carcinomas of patients not selected for family history ranges from none de- tectable to <50% of normal levels (Thompson et al. 1995) and protein expression in ovarian car- cinomas is lost (R. Hernandez, M. Skelly, C. Laird, M.-C. King, and A. Gown, in prep.). Characterization of the genomic sequence of BRCA1 may lead to identification of putative transcriptional regulatory sequences and struc- tural elements that potentially contribute to chromosomal instability. We present the com- plete sequence and analysis of a 117,143-bp re- gion of human chromosome 17 that encom- passes the 81-kb BRCA1 gene. RESULTS Mapping BRCA1 Cosrnid Clones Clones containing portions of the BRCA1 gene were isolated from a chromosome 17 specific cos- mid library by screening arrayed filters with probes prepared from PCR products of BRCA1 ex- ons amplified from human genomic DNA (Fried- man 1994b). Overlapping cosmids were identi- fied by PCR using the same primer pairs as for the preparation of the probes. From these screens, 10 overlapping cosmids were identified that spanned the BRCA1 gene (Table 1), four of which (BRCA1-5, 1-7, 1-8, and 1-9) were selected for sequencing. Sequencing The four overlapping cosmids containing por- tions of the BRCA1 gene were sequenced from randomly selected M13 subclones (Deininger 1983; Wilson et al. 1994) using fluorescence based automated detection (Hunkapiller et al. 1991). The 117,143-bp sequence (Fig. 1; GenBank accession no. L78833) was assembled from three separate overlapping contigs of 42 kb, 36 kb, and 40 kb. For each contig, between 1300 to 1500 ABI trace files were analyzed with the phred base- calling program (P. Green and B. Ewing, unpubl.; see Methods). The resulting sequence strings were assembled using phrap (P. Green, unpubl.) to give an average redundancy of eight- to tenfold for the entire region after vector-containing and low- Table 1. BRCA1 Cosmid Contig N U ~ N N N N N N C E E C 0 0 0 0 X X X X 43H2 + + + + BRCA1-2 48C3 + + + + + BRCA1-3 154B12 + + + + + + BRCA1-4 145E5 + + + + + + + + BRCAI-IO 73F5 + + + + + + + + + + + + + BRCA1-5 141H4 + + + + + + + + + + + + BRCA1-6 14F8 + + + + + + + + + + BRCA1-7 38H2 + + + + + + - + + + BRCAI-1 87E7 + + + + + + + + + BRCA1-8 90F2 + + + + + + BRCA1-9 Chromosome 17 specific cosmid library designations and contig identifiers used in the text are indicated. Primers used to map the extent of the cosmid inserts are described for BRCA1 (Friedman et al. 1994b, 1995b) and microsatellites (Anderson et al. 1993; Neuhausen et al. 1994). 1030 @ GENOME RESEARCH
`
`GeneDX 1024, pg. 2
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`COMPLETE GENOMIC SEQUENCE OF THE HUMAN BRCA 1 GENE quality sequences were removed. Complete as- sembly of the 36-kb contig (BRCA1-7/8) was pos- sible with the combined data from two cosmids (BRCA1-7 and BRCA1-8), whereas the 42-kb (BRCA1-5) and 40-kb (BRCA1-9) contigs required additional data to close gaps that remained after an initial phase of random sequencing. To obtain sequence spanning the gaps, primers were de- signed from existing data, and single-stranded M13 templates known to contain the region of interest were sequenced using dye terminator chemistry (Wilson et al. 1994). Assembly accu- racy was verified by comparison of restriction en- zyme digests of the cosmid DNA (data not shown) to computer-generated restriction maps, and by alignment with known cDNA sequences. Accuracy of the sequence data was estimated using three methods: observing discrepancies in overlapping regions between cosmid clones, ob- serving discrepancies between genomic sequence data and published cDNA sequences from this region, and estimating quality values using phred and phrap for each nucleotide of the sequence. The 946-bp overlap between BRCA1-5 and BRCA1-7/8 did not show any discrepancies. Be- tween BRCA1-7/8, and BRCA1-9 there are two overlapping regions (resulting from a 16-kb dele- tion in BRCA1-8) of 2500 bp and 7467 bp. The 2500-bp overlap did not show any discrepancies, and the 7467 bp overlap showed four discrepan- cies: Two were a result of sequencing errors that were resolved by replacing the consensus but low-quality nucleotides with nucleotides as- signed high-quality values; the other two likely represent haplotype variants. These data indicate an error estimate of less than one error in 3000 bases. When the 5711-bp BRCA1 cDNA sequence (Miki et al. 1994); (GenBank accession no. U14680) was aligned with the germ-line se- quence, six discrepancies [five mismatches and one insertion/deletion (indel)] were detected. Of these, only one mismatch in the coding sequence was definitely a result of an incorrect base in the germ-line consensus sequence. The other four mismatches and indel are all in the 5' untrans- lated region (UTR) and may represent either se- quencing discrepancies or polymorphisms. Dif- ferences between the genomic sequence and published cDNA sequences of two other genes within the contig, Rho7 (GenBank accession no. X95456) and VAT1 (GenBank accession no. U18009), were confirmed to be true discrepancies between the data sets. Based on automated quality values from phrap, which include quality values obtained from analysis of the peaks in a sequence trace by phred and confirmation of bases from overlaps between individual sequence strings within the project, -95% of the data is estimated to have one error or less in 10,000 bases, and 98% of the data is estimated to be >99.9% accurate. Regions that are sequenced on both strands (-99% of the total 117,143-bp contig) result in the highest quality values. The remaining 1% of the sequence that was determined from a single strand contained at least three overlapping high-quality sequences Figure 1 Global analysis of the BRCA 1 contig based on the data bases available, July 1996. (A) Graph of the fre- quency of CpG dinucleotides divided by the frequency of GpC dinucleotides. Dinucleotide frequencies were calculated in a 1000-nucleotide window moved at 100-nucleotide intervals using the program CpG (T. Smith, unpubl.). (B) Structure of the genes and gene fragments found in the 117,143-bp contig. (C) Distribution of Alu interspersed repeat elements in the contig. The human specific (AluY) elements are highlighted in red. (D) Distribution of non-Alu interspersed repeats, including L1, Mer, Mir, and MIt families, and a complete LTR element. (0 All simple-sequence repeats of i>10 nucleotides found using the program sputnik (C. Abajian, unpubl.); STS repeats known to be polymorphic are highlighted in red. (F) Locations of exons predicted by GRAIL (Xu et al. 1994). (G) Locations of exons predicted by genefinder (P. Green, unpubl.). (H) Blastn search of the 117,143-bp DNA sequence against dbEST (NCBI). For this search interspersed repeats were masked (by replacing repeat region sequence with strings of x's or n's) using cross match (P. Green, unpubl.) and a library of inter- spersed repeat sequences (A. Smit, unpubl.) (masked repeats c~isplayed in C and D). The resulting sequence data was used to search dbEST with blastn (Altschul et al. 1990) and the accession numbers of the sequences (subject) predicted to align with the contig sequence (query) by blastn were obtained over the network and used to make a library of subject sequences. Cross match was then used to refine the alignments by searching the query sequence against the library of subject sequences. The three steps of this process were automated with Find- Matches (1. Smith, unpubl.; see Methods). The results from the analysis are plotted as a three-dimensional histogram where the x-axis represents the contig, the y-axis gives the number of sequences aligning in a particular region, and the z-axis lists bins of decreasing identity between the query and subject sequences. (I) Blastn search against the nonredundant version (nr) of GenBank (NCBI). This analysis was carried out as in H. GENOME RESEARCH 1031
`
`GeneDX 1024, pg. 3
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`SMITH ET AL. o O k.J v 1.1_ 1032 ~ GENOME RESEARCH
`
`GeneDX 1024, pg. 4
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`COMPLETE GENOMIC SEQUENCE OF THE HUMAN BRCA 1 GENE om GENOME RESEARCH ~ 1033
`
`GeneDX 1024, pg. 5
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`SMITH ET AL. from independent clones. From these estimates the sequence should contain approximately one error in 3000-6000 bases. Analysis of the 117,143-bp Contig Repetitive Elements Low-complexity DNA sequences containing in- terspersed repetitive elements and simple repeat- ing units or homopolynucleotide stretches were identified using cross_match (P. Green, unpubl.) and a library of human repeat sequences (A. Smit, unpubl.; see Methods) and masked prior to car- rying out data-base searches. This analysis iden- tified 138 individual Alu elements within the BRCA1 gene, comprising 41.5% of the 81-kb se- quence (Fig. 1, Table 2). Ninety-four of these Alu elements are complete (containing >90% of the consensus length in the alignment), whereas the remaining 44 represent partial elements ranging in length from 69-231 nucleotides of 70-100% sequence homology to the consensus sequences of Alu subfamilies (Deininger 1989; Jurka 1995; Batzer et al. 1996). The density of Alus is signifi- cantly lower in the regions flanking BRCA1, with 32 elements comprising 21.8% of the flanking sequence. In addition to the Alu sequences, 46 other interspersed repetitive elements comprise 6.6% of the contig, including fragments belong- ing to the L1, Mer, Mir, and Mlt families, as well as a complete long terminal repeat (LTR) ele- ment, pTR5 (La Mantia et al. 1989), at the 5' end of the BRCA1 contig. L1 fragments occur more frequently within the BRCA1 gene, whereas Met and Mlt elements are more frequent in the flank- ing sequences (Table 2). A total of 41.9% of the BRCA1 contig is made up of interspersed repeti- tive elements. With the exception of a single Alu half element (FLAM) in the Rho7 gene (described below) none the repetitive elements appear to be in naturally transcribed regions within the con- tig. The data masking procedure with cross- match is very efficient: Only a few regions of similarity were identified in the data-base searches by virtue of repeat elements. Without masking, well over 300 sequence similarities to repetitive elements with blastn P values <10 -70 were identified throughout the contig. Table 2. Interspersed Repeats in the BRCA1 Contig BRCA 1 gene (nucleotides 3,344-84,436) Subfamily number bases % region Rest of locus (nucleotides 1-3,343, 84,435-117,143) number bases % region AluFLAM A 1 106 0.1 0 0 0.0 AluFLAM C 3 358 0.4 0 0 0.0 AluFRAM 1 69 0.1 1 170 0.5 AluJb 12 2925 3.6 4 773 2.1 AluJo 20 4066 5.0 1 369 1.0 AluSc 6 1655 2.0 2 298 0.8 AluSg 1 3 3202 4.0 2 354 1.0 AluSp 15 4305 5.3 3 900 2.5 AluSq 14 3155 3.9 4 1,046 2.9 AluSx 37 9545 11.8 7 1,675 4.7 AluY 16 4255 5.3 8 2,086 5.8 Total Alu 138 33,641 41.5 32 7,671 21.3 L1 7 969 1.2 1 107 0.3 MER 5 374 0.5 2 478 1.3 MIR 17 2,352 2.9 8 948 2.6 MLT 1 95 0.1 2 412 1.1 SVA 1 108 0.1 1 50 0.1 pRT5 (LTR) 0 0 0 1 1849 5.1 Total non-Alu 31 3,898 4.8 15 3,844 10.5 1034 @ GENOME RESEARCH
`
`GeneDX 1024, pg. 6
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`COMPLETE GENOMIC SEQUENCE OF THE HUMAN BRCA! GENE In addition to the interspersed repetitive el- ements, 68 simple sequence repeats (SSRs) of at least 10 nucleotides in length were detected by the sputnik program (C. Abajian, unpubl.; see Methods) (Fig. 1). Of these, 15 form six overlap- ping blocks with two or more different or alter- nating repeat unit types and 43 are individual repeats extending 15 bp or more. Thirty-two of the SSRs are not components of Alu elements (Table 3). Comparison of Repeat Elements in BRCAI and Other Genes The distribution of repeat elements within BRCA1 was compared with that of other genes selected from version 95.0 of the GenBank data base (Benson et al. 1996). First, all entries in Gen- Bank with the words "human" or "Homo sapiens" along with the word "complete" on the defini- tion line were identified. Entries identified by this method but which contained no coding se- quence were discarded. The remaining sequences were screened to eliminate redundant entries. If more than one entry contained DNA sequences of the same gene, the most recent entry was re- tained and all others were discarded. Finally, any entry lacking an indication of introns was also discarded. After this screening process, 326 en- tries were retained for analysis. Sequences be- tween the first base of the first exon and the last base of the last exon in each entry were analyzed and compared with the sequence within the same boundaries of BRCA1. Table 3. Non-Alu SSRs in the BRCA1 Contig Begin End Repeat type (nucleotide) (nucleotide) Gene Intron Marker (AT) x 6 4587 4599 BRCA 1 1 (TAAAC) x 2 6395 6409 BRCA 1 2 (TTCT) x 3 221 75 22189 BRCA1 3 (AAAT) x 4 2331 7 23333 BRCA 1 5 (AT) x 21 28032 28075 BRCA 1 7 (TA) x 5 28088 28099 BRCA 1 7 (TA) X 24 28125 281 74 BRCA1 7 (CA) x 7 28174 28188 BRCA1 7 (TA) x 6 28196 28208 BRCA 1 7 (CA) x 7 28208 28222 BRCA 1 7 (GT) x 6 37797 37810 BRCA1 12 (TG) x 19 42582 42621 BRCA1 12 (TA) x 5 42876 42887 BRCA 1 12 (AAAT) x 3 44853 44868 BRCA 1 12 (TTCC) x 5 53645 53665 BRCA1 14 (TCC1-F) x 3 53666 53684 BRCA1 14 (CCI-r) x 5 53687 53708 BRCA 1 14 (CI-ITF) x 2 53754 53768 BRCA1 14 (GCT) x 7 64072 64093 BRCA1 1 7 (1-FG) x 15 6921 7 69263 BRCA1 19 (GT) x 25 75906 75956 BRCA 1 20 (TG) x 5 88724 88735 intergenic (TG) x 8 89743 89759 intergenic (GT) x 23 90565 90611 intergenic (GGAA) x 3 94405 94418 intergenic (TGAA) x 3 98327 98342 intergenic (TCCCA) x 2 101127 101141 Rho7 3 (GCC) x 6 10651 3 106531 intergenic (AGGGC) x 3 106804 106819 intergenic (TAT) x 6 109336 109355 VAT1 1 (1-FA) x 5 109784 109799 VAT1 1 (TG) x 6 113978 11 3990 VAT1 3'UTR D17S1323 D1 7S1 322 D17S855 GENOME RESEARCH ~ 1035
`
`GeneDX 1024, pg. 7
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`SMITH ET AL. The length of analyzed sequences for these entries ranged from 436 to 175,019 bases. Over- all, Alus and miscellaneous types of repeats occur with about the same frequencies. Given the dif- ferent distribution of intron lengths in the genes, and that repeat elements most frequently occur in intronic sequences, the percentage of repeats were compared with the percentage of intronic sequence for each entry. The percentage of re- peats is modestly correlated with the percentage of intronic sequences [correlation (r 2) = 0.52]. BRCA1 is a relatively large gene with introns comprising 90.9% of its sequence; 46.3% of its sequence consists of repeat elements. Among the 14 large genes (>30,000 bases) in the analyzed set, the average percentage of intronic sequence is 89.8% (range: 70.3% to 98.41%) while the average repeat element content is 30.39% (range: 3.39% to 50.76%). Only three genes had higher overall Alu densities than BRCAI: apolipoprotein c-I (VLDL; GenBank accession no. M20903) with 60.8% Alus; Blym transforming gene (GenBank accession no. K01884) containing 53.7% Alus, and apolipoprotein c-IV (APOC4; GenBank acces- sion no. HSU32576) with 41.3% Alu composi- tion. The BRCAI Gene Comparative analysis of the genomic sequence for BRCA1 and the cDNA sequence (Miki et al. 1994) revealed the positions of all 24 exons (Fig. 2A, Table 4). Characterization of an aberrant BRCA1 cDNA clone in the original report (Miki et Table 4. Positions of Exons for Genes in the BRCA1 Contig Start End Exon Start End Exon BRCA 1 Rho7 3344 3464 1 a 3621 3998 1 b 4620 4718 2 12955 13008 3 22201 22278 5 23778 23866 6 24473 24612 7 28853 28958 8 a 31443 31488 9 a 32810 32886 10 a 33872 37297 11 a 37700 37788 12 a 46156 46327 13 a 52118 52244 14 54211 54401 15 57494 57804 16 61038 61125 1 7 64782 64859 18 a 65360 65400 19 a 71598 71681 20 a 77620 77674 21 79543 79616 22 a 81034 81094 23 a 82936 83872 24 b 84012 84436 24 b VATI IFP35 98031 96693 6 c 100302 100054 5 c 100673 100539 4 101551 101442 3 102774 102687 2 103386 103285 1 c 106659 106789 1 109927 1101 34 2 110520 110690 3 110796 110885 4 112774 11 3015 5 113178 114716 6 115216 115033 5 115546 115440 4 115846 115660 3 116055 115949 2 116275 116128 1 aExons are shifted relative to those reported in the BRCA1 cDNA sequence (U14680) so that the introns begin with a GT donor sequence and end with an AG acceptor sequence. bThe 3'-UTR region is predicted by alignment with available human 3'-UTR sequence data (U68041) and a mouse cDNA sequence (U36475). A poly(A) signal sequence is located at nucleotide 84413. Cln the Rho7 gene the final exon (6) is predicted by alignment with EST sequences and contains a polyA signal sequence. Exon 5 is possibly truncated, and the first exon positions are based on alignment with the Rho7 cDNA sequence (X95456). 1036 @ GENOME RESEARCH
`
`GeneDX 1024, pg. 8
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`COMPLETE GENOMIC SEQUENCE OF THE HUMAN BRCA 1 GENE Figure 2 Map of the BRCA1 gene. (A) Structure of the gene predicted by alignment with the human BRCAI cDNA (U14680), dark boxes. Below this alignment are exons predicted by grail and genefinder from the complete genomic sequence. The shaded boxes at the bottom represent alignments with five mouse BRCA1 cDNAs (U32446, U35641, U31625, U36475, and U681 74). (B) Expansion of the 5' end of BRCA1 showing the alternate starting exons of the BRCA1 gene and exons 1 a and 1 b of the 1A1-3B pseudogene. (C) Expansion of the 3' end of the BRCA1 gene showing the 3' UTRs, corresponding mouse cDNA sequences, and poly(A) signal sequence sites. al. 1994) led to the misidentification of an in- serted Alu element as exon 4. Not normally found in BRCA1 transcripts, insertion of this Alu would lead to introduction of a STOP codon. Hence, BRCA1 exons and introns are numbered la, lb, 2, 3, 5, 6, and so on. Most of the exon/ intron boundaries were accurately predicted in the published cDNA sequence (HSU14680); how- ever, 11 boundaries required adjustment with the available genomic data to preserve the nearly in- variant consensus 5' GT and 3' AG dinucleotides of intron boundaries (Table 4; GenBank acces- sion no. L78833). The coding regions between the two sequences were in perfect agreement at both the nucleotide and amino acid levels. The BRCA1 gene is conserved in mammals (Miki et al. 1994), and five complete cDNA se- quences from the mouse have been determined (GenBank accession nos. U31625, U32446, U35641, U36475, and U68174). All these se- quences show a high degree of similarity when aligned against the genomic human sequence us- ing the program cross_match (P. Green, unpubl.). Overall the human and mouse sequences are 76% identical at the nucleotide level with exons 2 (87%), 3 (90%), 5 (90%), 12 (85%), 19 (91%), and 21 (87%) containing the highest identities and the identities of the other exons ranging from 68% to 83%. The U36475 cDNA sequence identified two additional 3' regions of high simi- larity between human and mouse (Fig. 2C), one at nucleotides 83,326-83,545 (two windows of 95% identity at nucleotides 83,326-83,349, and 74% identity at nucleotides 83,432-83,545), and the other at nucleotides 84,012-84,436 (80% identity). These homologies overlap the 833-bp GENOME RESEARCH ~ 1037
`
`GeneDX 1024, pg. 9
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`SMITH ET AL. segment of 3'-UTR sequence (96% identity to nucleotides 83,061-83,893) derived from a hu- man placental BRCA1 cDNA clone (Friedman et al. 1995b; GenBank accession no. U68041). The presence of a putative poly(A) signal sequence at nucleotide 84,416 and the high similarity of the human and mouse sequences suggests that nucleotides 83,061-84,416 correspond to the 3' UTR. A 3798-nucleotide fragment containing pu- tative promoter sequences for BRCA1 has been cloned and sequenced (Xu et al. 1995; GenBank accession no. U37574). Our data are virtually identical to this sequence: 13 mismatches and one indel were observed; nine of the differences occur at Ns in the U37574 sequence. Analysis of this region originally suggested that it represents a bidirectional promoter controlling expression of BRCA1 and 1A1-3B, a B-box containing pro- tein with homology to CA125 (Brown et al. 1994). However, subsequent characterization of a 300-kb region revealed a duplication of the 5' ends of both BRCA1 and 1A1-3B (Brown et al. 1996). Although the full-length genes are sepa- rated by -50 kb, 5' pseudogene sequences of each exist within 550 bp of the transcription initiation site of the other but in the opposite orientation. The genomic BRCA1 and U37574 sequences are identical to the 1A1-3B pseudogene exon la and lb sequences (Brown et al. 1996; Fig. 2B) but dif- fer from the alternative first exon sequences in 1A1-3B transcripts characterized from an ovarian tumor cell line (exon lb; Campbell et al. 1994) and the myeloblast cell-line KG 1 (exon la; N. No- mura, unpubl.; GenBank accession no. D30756). Promoter and Enhancer Elements BRCA1 transcription initiates from either of two sites separated by 277 bp that encode alternative first exons la and lb. Both transcription initia- tion siteS are utilized in most tissues, although there is preferentially higher expression of the exon la transcript in mammary gland and of the exon lb transcript in placenta (Xu et al. 1995). TATA boxes are not evident in the sequences 5' of either exon la or lb (Brown et al. 1994; Xu et al. 1995); however, both have features similar to initiator elements (Inr) and reside in GC-rich regions characteristic of TATA-less promoters (Azizkhan et al. 1993). Furthermore, GC boxes (GGGCGG), which bind the Spl transcription factor and have been shown to be required for interaction of transcription factor liD (TFIID) 1038 ~ GENOME RESEARCH with TATA-less promoters, are present 5' of exon la [163 and 233 nucleotides upstream of the exon la transcription initiation site (nucleotide 3344), i.e., at positions -163 and -233], exon lb [-7, -46, -130 from exon lb initiation (nucleotide 3621)] and overlapping exon la (-200 and -248 from exon lb). Other potential regulatory elements in the sequences preceding exons la and lb were iden- tified using SIGNALSCAN (Prestridge 1991) and the TRANSFAC data base (Wingender 1994) to identify transcription factor binding sites. Among those identified in the region preceding exon la are cyclic AMP regulatory element bind- ing protein (CREB) at position -176, CCAAT binding factor (-149, -340), serum response factor (SRF: -148), polyomavirus enhancer A binding protein 3 (PEA3: -183), and pituitary transcription factor-1 (Pit-l: -6); sites preceding exon lb are CREB (-59) and activator protein 2 (AP2: - 10). Sequence alignment was also used to identify potential progesterone (PRE) and estro- gen (ERE) response elements. With the exception of two imperfect ERE elements in introns 2 (nucleotide 7238) and 7 (nucleotide 25,455), only ERE half sites were detected, which al- though sufficient for estrogen receptor binding do not confer hormone- inducible transcrip- tional activation. A single putative PRE element was identified at nucleotide 1222, -2 kb upstream (cid:12)9 of the start of transcription of exon la and exon lb, and two matches were embedded in the cod- ing regions of exon 2 (nucleotide 4668) and exon 11 (nucleotide 37,063). Other Genes In addition to BRCA1, five genes were identified within the 117,143-bp sequenced region (Fig. 1; Table 4). Most of these were known to map close to BRCA1 and expressed tags had been identified in the search for BRCA1. Two are complete genes (Rho7, VAT1), one is incomplete (a 3' portion of IFP 35), and two are pseudogenes (rpL21 and two 5' exons of 1A1-3B were identified). These genes were identified by a combination of homology searching (blastx and blastn; Altschul et al. 1990); searches against the nonredundant (nr) data base at the National Center for Biotechnology Infor- mation (NCBI) and blastn searches against dbEST (the data base of expressed sequence tags, using the e-mail or blastn client servers at NCBI), and exon prediction using grail (version 1.3; Xu 1994)
`
`GeneDX 1024, pg. 10
`
`

`

`Downloaded from
`
`genome.cshlp.org
`
`
`
` on July 8, 2014 - Published by Cold Spring Harbor Laboratory Press
`
`
`
`COMPLETE GENOMIC SEQUENCE OF THE HUMAN BRCA! GENE and genefinder (C. Wilson and P. Green, un- publ.). Rho7: The Rho7 gene (nucleotides 96,693- 103,386) was identified initially as a homolog to the rho family of GTP-binding proteins (Chardin 1991) by a similarity search with blastp against the SWISS-PROT data base using deduced amino acid sequences from grail predicted exons. Recent searches using blatsn against the nr data base identified the putative cDNA corresponding to this gene (X95456; P. Chardin, unpubl.); only two bases out of 684 were in conflict. The re- ported cDNA sequence contains the coding se- quence only; neither 5' nor 3' UTRs are in Gen- Bank. A portion of the potential 5' UTR was iden- tified by genefinder (Fig. 3). Other exon/intron boundaries were identical when genefinder and similarity analysis (alignment against X95456) were compared. Grail 1.3 identified these exons as well but did not predict any 5' UTR sequence, and split the 3' terminal coding exon into two smaller exons. This gene is also highly similar to a mouse EST homolog (R74747; D. Beier and K. Brady, unpubl.); >90% identity is observed be- tween the two nucleotide sequences. The only significant difference is that the mouse EST lacks the first exon relative to the human germ-line sequence. None of the rho homologs identified any additional exons 5' or 3' to the coding se- quence; both grail and genefinder predicted the end of the last exon to precede the terminal A of the TGA stop codon by four bases to give a GT dinucleotide acceptor sequence in the intron. Five EST sequences (R42098, N66093, R15355, H48939, with 97-99% identity over the entire lengths of the clones, and R74748, a mouse cDNA that shows 74% identity) map close to the 3' end of the Rho7 gene. R42098, N66093, and R15355 form one region of similarity from nucleotides 96,693-98,031, which contains two potential poly(A) signal sites at nucleotides 96,710 and 97,592. R42098 (343-bp) and R15355 (457-bp) are derived from end sequences of clone yf90b08 (Washington University-Merck EST Project, unpubl.), which likely spans the 1338-bp region from nucleotides 96,693 to 98,031. An- other region of similarity, aligning with H48939 and R74748, would extend the exon containing the TGA stop codon an additional -240-400 bases. This is plausible because neither a GT Figure 3 Map of the region from nucleotides 96,000 to 105,000 containing Rho7. The arrows at the top represent similar sequences identified by searches against the expressed sequence data base (dbEST). The identity of the pairwise alignment with the 11 7,143-bp contig is shown in parentheses. ESTs T58202, X95282, H96700, N66093, and yf90b08 are of human origin; R74747 and R74748 are of mouse origin; and D23963 is from rice. The next line shows the deduced structure of the Rho7 gene. Dark boxes represent the coding region of the gene as determined by alignment with the Rho7 cDNA seque

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket