`structure and conservation of marsupial and
`monotreme genomes
`
`Elliott H. Margulies*, NISC Comparative Sequencing Program*†‡, Valerie V. B. Maduro*, Pamela J. Thomas†,
`Jeffery P. Tomkins§, Chris T. Amemiya¶, Meizhong Luo储, and Eric D. Green*†**
`
`*Genome Technology Branch and †NISC, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892; §Clemson
`University Genomics Institute, Department of Genetics and Biochemistry and Life Science Studies, Clemson University, Clemson, SC 29634; ¶Benaroya
`Research Institute at Virginia Mason, Seattle, WA 98101; and 储Arizona Genomics Institute, Department of Plant Sciences, University of Arizona,
`Tucson, AZ 85721
`
`Communicated by Francis S. Collins, National Institutes of Health, Bethesda, MD, November 18, 2004 (received for review August 30, 2004)
`
`Sequencing and comparative analyses of genomes from multiple
`vertebrates are providing insights about the genetic basis for
`biological diversity. To date, these efforts largely have focused on
`eutherian mammals, chicken, and fish. In this article, we describe
`the generation and study of genomic sequences from noneuther-
`ian mammals, a group of species occupying unusual phylogenetic
`positions. A large sequence data set (totaling >5 Mb) was gener-
`ated for the same orthologous region in three marsupial (North
`American opossum, South American opossum, and Australian
`tammar wallaby) and one monotreme (platypus) genomes. These
`ancient mammalian genomes are characterized by unusual archi-
`tectural features with respect to G ⴙ C and repeat content, as well
`as compression relative to human. Approximately 14% and 34% of
`the human sequence forms alignments with the orthologous
`sequence from platypus and the marsupials, respectively; these
`numbers are distinctly lower than that observed with nonprimate
`eutherian mammals (45–70%). The alignable sequences between
`human and each marsupial species are not completely overlapping
`(only 80% common to all three species) nor are the platypus-
`alignable sequences completely contained within the marsupial-
`alignable sequences. Phylogenetic analysis of synonymous coding
`positions reveals that platypus has a notably long branch length,
`with the human–platypus substitution rate being on average 55%
`greater than that seen with human–marsupial pairs. Finally, anal-
`yses of the major mammalian lineages reveal distinct patterns with
`respect to the common presence of evolutionarily conserved ver-
`tebrate sequences. Our results confirm that genomic sequence
`from noneutherian mammals can contribute uniquely to unravel-
`ing the functional and evolutionary histories of the mammalian
`genome.
`
`comparative genomics 兩 genome sequencing 兩 genome analysis 兩
`phylogenetics 兩 mammalian evolution
`
`Comparisons of genome sequences from evolutionarily di-
`
`3354 –3359 兩 PNAS 兩 March 1, 2005 兩 vol. 102 兩 no. 9
`
`analyses, the optimal phylogenetic distances among species vary,
`depending on the question(s) being addressed [with the distance
`between humans and eutherian mammals sometimes being too
`close, and that between humans and birds (or fish) sometimes
`being too far].
`Within this large phylogenetic gap between eutherian mam-
`mals and birds reside the marsupials and monotremes (12, 13).
`These metatherian and prototherian mammals diverged before
`the eutherian radiation, estimated at 185 and 200 million years
`ago (mya), respectively (14). Indeed, these divergence dates, as
`well as the origins of prototherian mammals relative to met-
`atherian mammals, remain a source of scientific debate, in part
`because of insufficient molecular data (13, 15–17). Until re-
`cently, very little marsupial or monotreme DNA sequence was
`available in public databases. Although comparative studies
`involving small amounts of genomic sequence from a marsupial
`species [the stripe-faced dunnart (Sminthopsis macroura)] have
`been described (18), no comparisons involving large, contiguous
`blocks of marsupial or monotreme sequence have been reported
`to date.
`In this article, we present the results of comparative sequence
`analyses involving ⬎5 Mb of sequence from four noneutherian
`mammals. Specifically, we describe the features of their ge-
`nomes, provide insights about their phylogenetic relationships,
`and reveal similarities and differences among mammalian lin-
`eages with respect to the presence of evolutionarily conserved
`vertebrate sequences.
`
`Materials and Methods
`Genomic Sequence Data Set. Genomic segments orthologous to a
`1.9-Mb region on human chromosome 7q31.3, encompassing the
`
`Freely available online through the PNAS open access option.
`
`Abbreviations: NISC, National Institutes of Health Intramural Sequencing Center; mya,
`million years ago; N.A., North American; S.A., South American; BAC, bacterial artificial
`chromosome; TBA, THREADED BLOCKSET ALIGNER; MCS, multispecies conserved sequence; 4D,
`4-fold degenerate; SINEs, short interspersed nucleotide elements; LINEs, long interspersed
`nucleotide elements.
`
`Data deposition: The sequences reported in this paper have been deposited in the GenBank
`database [accession nos. AC127465, AC129065, AC129066, AC129885, AC142561,
`AC144364, AC144365, AC144600, AC144690, AC144691, AC144755, and AC144756 (N.A.
`opossum); AC147869, AC147870, AC147871, AC147872, AC147873, AC147874, AC148151,
`and AC148214 (S.A. opossum); AC127464, AC129882, AC129883, AC129884, AC130185,
`AC138553, AC144363, AC144689, AC144753, AC144754, AC144788, AC146535, and
`AC146754 (platypus); and AC145041, AC145042, AC145183, AC145184, AC145249,
`AC145250, AC145407, AC145408, AC145409, and AC145841 (wallaby)]. See Table 3, which
`is published as supporting information on the PNAS web site for specific versions of all
`GenBank accession nos. used in this study.
`‡National Institutes of Health Intramural Sequencing Center (NISC) Comparative Sequenc-
`ing Program: Leadership provided by Robert W. Blakesley, Gerard G. Bouffard, Nancy F.
`Hansen, Baishali Maskeri, and Jennifer C. McDowell.
`
`**To whom correspondence should be addressed at: National Human Genome Research
`Institute, National Institutes of Health, 50 South Drive, Building 50, Room 5222, Be-
`thesda, MD 20892. E-mail: egreen@nhgri.nih.gov.
`
`© 2005 by The National Academy of Sciences of the USA
`
`verse species are central to decoding the functions of
`vertebrate genomes (1). Of particular interest is the use of highly
`diverged species for detecting and characterizing sequences
`under purifying selection (2). Large-scale sequence comparisons
`have been reported for eutherian (commonly referred to as
`‘‘placental’’) mammals (3) or fish (4), with the most detailed
`studies to date emphasizing human–rodent comparisons (5, 6).
`We previously described our efforts to sequence the same
`orthologous regions from large collections of vertebrates (7, 8)
`and to perform multispecies sequence comparisons (9). These
`analyses have helped to refine phylogenetic relationships (7), to
`gain insight about the mutational process (10, 11), and to reveal
`differences between eutherian mammals and other vertebrates
`(e.g., birds and fish) with respect to their utility for detecting
`highly conserved regions in the human genome (9). However,
`these studies also demonstrate that for comparative sequence
`SEQUENOM EXHIBIT 1091
`www.pnas.org兾cgi兾doi兾10.1073兾pnas.0408539102
`Sequenom v. Stanford
`SEQUENOM EXHIBIT 1091
`IPR2013-00390
`
`
`
`Table 1. General characteristics of comparative sequence data set
`
`Species
`
`N.A. opossum
`S.A. opossum
`Wallaby
`Platypus
`
`No. sequenced
`BACs
`
`No. sequencing
`gaps*
`
`No. mapping
`gaps†
`
`Total nonredundant
`sequence, Mb
`
`Amount relative to
`human,‡ Mb
`
`12
`8
`10
`13
`
`3
`7
`5
`0
`
`3
`7
`5
`0
`
`1.63
`1.17
`1.35
`1.26
`
`1.36
`1.19
`1.18
`1.65
`
`*Gaps reflecting missing sequence in the assembly of shotgun sequence data from an individual BAC; these are typically 100 bp or less.
`See the supplement in ref. 7 for details.
`†Gaps reflecting the lack of BAC coverage across an interval. See the supplement in ref. 7 for details.
`‡The amount of human sequence in or between pair-wise alignments for the covered portions of each species’ sequence; this value
`includes an estimate of sequence that might be proximal to the first and distal to the last alignment (utilizing the estimated degree
`of compression relative to human for that species).
`
`GENETICS
`
`cystic fibrosis transmembrane conductance regulator (CFTR)
`gene (referred to as the ‘‘greater CFTR region’’), were isolated
`from the North American (N.A.) opossum, South American
`(S.A.) opossum, Australian tammar wallaby, and duckbilled
`platypus, and the segments were subjected to shotgun sequenc-
`ing, as detailed in the supporting information, which is published
`on the PNAS web site. Sequences from an additional 23 verte-
`brates were generated and used for comparative analyses; the
`sequence data [including a listing of individual GenBank records
`for each bacterial artificial chromosome (BAC), assimilated and
`annotated sequences for each species, and multispecies sequence
`alignments (see below)] are available in the supporting infor-
`mation and at www.nisc.nih.gov兾data.
`
`Repeat Identification. Repetitive elements in noneutherian mam-
`malian sequences were identified by using a RECON-based ap-
`proach (19), as described in the supporting information. Impor-
`tantly, this approach was tuned to correctly detect repetitive
`elements in the human sequence at high specificity (99.8%) but
`at the cost of a lower sensitivity (63%). In turn, the identified
`repeats were used with REPEATMASKER (July 13, 2002; www.re-
`peatmasker.org) and the standard REPEATMASKER mammalian
`repeat libraries to detect and mask all repetitive sequences. This
`process involved adding the identified repeats in the noneuth-
`erian mammalian sequence to the standard artiodactyl repeat
`library and then running REPEATMASKER with the ⫺cow option.
`
`Generation and Characterization of Sequence Alignments. A multi-
`sequence alignment of the assembled sequences from 27
`vertebrates was generated by using the THREADED BLOCKSET
`ALIGNER (TBA) (20). The resulting alignment then was ‘‘pro-
`jected’’ onto the human reference sequence for subsequent
`analyses (see the supporting information for details). A por-
`tion of the sequenced interval (541 kb distributed across nine
`distinct regions; see the supporting information) was selected
`where there was complete sequence coverage in a subset of
`species (chimpanzee, cat, cow, mouse, wallaby, N.A. opossum,
`S.A. opossum, platypus, and chicken). For each human–species
`pair-wise combination, the number of human-referenced po-
`sitions of TBA-aligned bases was determined; these data then
`were used to calculate the number of bases in alignments for
`each human–species combination.
`
`Estimating Phylogenetic Branch Lengths. A ‘‘virtual’’ multisequence
`alignment consisting solely of synonymous [4-fold degenerate
`(4D)] coding positions was generated by using the human-
`referenced annotations. Sites that fell within sequence gaps or
`that were no longer synonymous (because of changes in the first
`two bases) were treated as missing data. Substitution rates were
`estimated from this multisequence alignment by maximum like-
`lihood with the PHAST package (21). A generally accepted tree
`topology for the analyzed species was used (7, 22). The most
`
`general reversible substitution model (REV) was used, and no
`molecular clock was assumed. Errors associated with the result-
`ing branch length calculations were estimated by bootstrapping
`(both nonparametric and parametric methods; see the support-
`ing information), with the tree topology fixed.
`
`Examining Lineage Specificity of Multispecies Conserved Sequences
`(MCSs). MCSs were identified by using the multisequence align-
`ment generated with sequences from 27 vertebrate species (8).
`A portion of the sequenced interval (571 kb distributed across
`seven separate regions; see the supporting information) was
`selected where there was complete sequence coverage in a subset
`of species (cat, dog, cow, pig, rat, mouse, N.A. opossum, wallaby,
`and platypus). Note that this limited data set is distinct from the
`one above used for characterizing the multisequence alignments.
`Each of the nine species’ sequences was analyzed for the
`presence of the above-identified MCSs; specifically, each MCS
`in the relevant interval was scored as being present or absent
`based on BLASTZ analysis (see the supporting information).
`
`Results
`Comparative Sequence Data Set. We generated large blocks of
`high-quality sequence from three marsupial species (N.A. opos-
`sum, S.A. opossum, and wallaby) and one monotreme species
`(platypus). All sequences correspond to genomic segments
`orthologous to the greater CFTR region on human chromosome
`7q31.3 (7), with 1.17–1.63 Mb of nonredundant sequence gen-
`erated from each species (Table 1). Based on comparisons with
`available genome-wide human (23), mouse (5), and rat (6)
`sequence, the greater CFTR region is close to average with
`respect to general genomic properties (e.g., repeat content,
`G ⫹ C content, fraction of coding sequence, and synonymous
`substitution rate). The resulting sequences from the four non-
`eutherian mammals were analyzed individually and also com-
`pared with corresponding sequences from 23 additional verte-
`brates (7, 8).
`
`Genomic Architecture. Analysis of the orthologous genes in this
`region reveals no gross differences in the content, order, orien-
`tation, or intron-exon structure between human and the non-
`eutherian mammals (note that there are two instances of a
`missing exon within noneutherian sequence, but these appear to
`be due to gaps in sequence coverage; data not shown). However,
`examination of several architectural features associated with
`each species’ sequence uncovered a number of differences. For
`example, the size of this genomic region (relative to human)
`varies by as much as 24% among the noneutherian mammals
`(Table 2). Specifically, evidence of both genome compression
`(e.g., 24% in platypus) and expansion (e.g., 17% and 15% in N.A.
`opossum and wallaby, respectively) is seen; these findings are
`generally consistent with previous estimates of genome sizes
`(refs. 24 and 25; also see www.genomesize.com).
`
`Margulies et al.
`
`PNAS 兩 March 1, 2005 兩 vol. 102 兩 no. 9 兩 3355
`
`
`
`Table 2. Architectural features of different species’ sequences
`G ⫹ C content*
`
`Species
`
`Human
`Cat
`Pig
`Mouse
`N.A. opossum
`S.A. opossum
`Wallaby
`Platypus
`Chicken
`Fugu
`
`Total
`
`0.384
`0.383
`0.377
`0.401
`0.358
`0.358
`0.373
`0.459
`0.412
`0.486
`
`Nonrepetitive
`sites
`
`Synonymous
`4D sites
`
`Relative
`size†
`
`Percentage
`repetitive‡
`
`0.369
`0.372
`0.366
`0.391
`0.358
`0.358
`0.374
`0.457
`0.407
`0.485
`
`0.432
`0.434
`0.455
`0.479
`0.415
`0.380
`0.412
`0.642
`0.423
`0.721
`
`NA
`0.95
`0.92
`0.90
`1.17
`0.99
`1.15
`0.76
`0.44
`0.16
`
`40.3
`36.4
`31.9
`32.6
`43.2
`34.2
`37.0
`44.9
`6.0
`2.3
`
`Boldface indicates the data for noneutherian mammals.
`*Fraction of G ⫹ C bases in the entire sequence (total), the nonrepetitive portion of sequence (i.e., sequence not
`masked by REPEATMASKER), and synonymous 4D sites (the third position of codons that can be any base and still
`code for the same amino acid).
`†Ratio of sequence length in each species to the amount of corresponding human sequence (as defined in Table 1).
`‡Percentage of sequence masked by REPEATMASKER.
`
`The asserted correlation between genome size and repeat
`content (4, 26) prompted us to investigate the amount and
`composition of repetitive elements within each species’ se-
`quence. Because repetitive sequences in noneutherian mammals
`have not been fully characterized, this analysis first required
`assembling repeat libraries for each marsupial and monotreme
`species (see Materials and Methods). Fig. 1 shows a summary of
`the content and types of repeats in each species’ sequence, with
`data from several other vertebrates provided for comparison.
`Note the considerable variation in total repeat content among
`these species and the lack of correlation with genome size
`
`Comparison of the content and types of repetitive elements among
`Fig. 1.
`different species’ sequences. Sequences from the orthologous regions of the
`indicated species’ genomes were analyzed by REPEATMASKER, allowing detec-
`tion and quantification of the indicated types of repetitive elements. The data
`for the noneutherian mammals are highlighted for emphasis. SINEs, short
`interspersed nucleotide elements; LINEs,
`long interspersed nucleotide
`elements.
`
`(relative to human; see Table 2). Specifically, the orthologous
`platypus genomic region is smaller than the human region yet
`contains a larger proportion of repetitive sequences; similarly,
`the wallaby genomic region is larger than the human region yet
`contains a smaller proportion of repetitive sequences. Another
`finding is the relatively large proportion of short interspersed
`nucleotide elements (SINEs) in the platypus sequence (27, 28),
`markedly different from other vertebrate sequences. The latter
`is consistent with the PCR-based identification of an abundant
`SINE repeat within monotreme genomes (J. A. M. Graves and
`P. J. Kirby, personal communication).
`The overall G ⫹ C content is similar among the three mar-
`supial sequences (35.8–37.3%; see Table 2), which is slightly
`lower than that of the orthologous human genomic region
`(38.4%). In contrast, the overall G ⫹ C content of the platypus
`sequence is notably high (45.9%), more like that seen with the
`orthologous Fugu genomic region (48.6%). A similarly high
`G ⫹ C content for platypus is seen in the nonrepetitive sites and
`at synonymous 4D sites (see Table 2). Examining the distribution
`of G ⫹ C content in 1-kb windows across the noneutherian
`sequences reveals the same general trends (see the supporting
`information).
`
`Multispecies Sequence Comparisons. Analyses of a multisequence
`alignment generated by using data from 27 vertebrates revealed
`notable patterns of sequence conservation. For example, the
`fraction of the human sequence forming alignments with non-
`primate eutherian mammals is typically 45–70% (Fig. 2A) (7);
`these alignments include both neutrally evolving and functionally
`constrained portions of the sequence. This fraction of alignable
`sequence is significantly lower for the noneutherian mammals
`(14–34%), with the decrease mostly reflecting fewer alignments
`within nonannotated regions (i.e., those reflecting sequences not
`thought to be genes or repeats). A substantially larger amount of
`noneutherian sequence could be aligned to the human sequence
`by generating a true multisequence alignment with the program
`TBA (20) as opposed to simple pair-wise alignments (Fig. 2 A,
`purple bars). In the case of eutherian mammals (where no such
`difference is seen), it is thought that both pair-wise and multi-
`sequence alignments contain virtually all neutrally evolving
`sequence (5). However, with the noneutherian mammals, the
`dramatic difference likely reflects a larger amount of neutrally
`evolving sequence within the multisequence alignment; it re-
`
`3356 兩 www.pnas.org兾cgi兾doi兾10.1073兾pnas.0408539102
`
`Margulies et al.
`
`
`
`Phylogenetic tree of vertebrate species. By using the generated
`Fig. 3.
`27-species multisequence alignment, branch lengths were calculated based on
`analysis of synonymous coding positions. The branch lengths (as substitutions
`per synonymous site) between human and each species are listed (with
`additional pair-wise branch lengths provided in the supporting information).
`The last common ancestor among the catarrhine primates (A) is estimated at
`25 mya (36, 37), between the rodents and primates (B) at 75 mya (5, 6),
`between eutherians and metatherians (C) at 185 mya (14), between
`monotremes and other therians (D) at 200 mya (14), and between mammals
`and birds (E) at 310 mya (13).
`
`GENETICS
`
`quences may represent neutrally evolving, lineage-specific inser-
`tions and deletions.
`To better understand the phylogenetic relationships among
`the noneutherian mammals, as well as their relationship to other
`vertebrate species, we calculated the substitution rates at syn-
`onymous coding positions within the multisequence alignment.
`These rates then were used to scale the branch lengths of the
`phylogenetic tree depicted in Fig. 3; note that the total branch
`lengths between human and each species also are indicated (with
`all possible pair-wise branch lengths provided as supporting
`information). The synonymous substitution rate (per site) be-
`tween the two opossum species is 0.09, whereas that between
`wallaby and either opossum species is 0.18. These rates are
`similar to those observed with primate–primate comparisons.
`Interestingly, platypus has a notably long branch length, with the
`platypus–marsupial substitution rate averaging 0.85. Also note
`that the human–platypus substitution rate is 55% higher (on
`average) than that for all human–marsupial pairs, providing
`
`Patterns of sequence conservation among different vertebrates. (A)
`Fig. 2.
`The fraction of human sequence forming alignments with sequences from
`each of the indicated species is shown, broken down for four annotated
`categories. The additional alignable sequence (indicated in purple, see text for
`details) found exclusive to the TBA-generated multisequence alignment (20)
`falls largely within nonexonic regions. For data that include a larger set of
`vertebrates, see the supporting information. (B) The relationships between
`the fraction of human sequence aligned and estimated branch length from
`human (calculated as substitutions per site) are shown for the indicated
`vertebrate species.
`
`mains to be determined whether this accounts for all neutrally
`evolving sequence.
`We examined more closely the relationships among the hu-
`man-alignable portions of each species’ sequence, focusing our
`analyses on a 571-kb portion of the region with complete
`sequence coverage in a representative subset of species (see
`Materials and Methods). Although each marsupial sequence
`individually aligns with ⬇34% of the human sequence, only 27%
`of the human sequence aligns with all three marsupial sequences,
`indicating that the human-alignable portions of each marsupial
`sequence are not completely overlapping. Similarly, whereas the
`platypus sequence aligns with ⬇14% of the human sequence,
`only 11% of the human sequence aligns with all four noneuth-
`erian sequences, indicating that 21% of the human sequence that
`aligns with the platypus sequence is distinct from that aligning
`to all three marsupial sequences. These results demonstrate that
`the human-alignable sequence from more distantly related spe-
`cies is not fully contained within that from more closely related
`species. This finding also was observed with additional combi-
`nations of species (i.e., cat and mouse, but not cow; see the
`supporting information). These nonoverlapping alignable se-
`
`Margulies et al.
`
`PNAS 兩 March 1, 2005 兩 vol. 102 兩 no. 9 兩 3357
`
`
`
`further evidence for the considerable divergence of monotremes
`relative to both the marsupial and eutherian mammals (16). The
`synonymous substitution rates we calculated for the mouse and
`rat sequences are similar to the genome-wide estimates (5, 6),
`whereas that for the chicken sequence is substantially lower than
`the genome-wide estimate (29). The latter is likely attributable
`to differences in the methods and assumptions used and兾or
`characteristics of the respective data sets (i.e., pair-wise whole-
`genome analyses vs. multisequence targeted analyses).
`These findings reinforce the distinct phylogenetic positions of
`marsupials and monotremes within the vertebrate and mamma-
`lian radiations (12, 13). In addition, the simultaneous examina-
`tion of alignment and branch length properties of each species’
`sequence compared to human (Fig. 2B) reveals a clear grouping
`of the marsupials at an intermediate position between the
`eutherian mammals and birds, consistent with the purported
`phylogenetic relationships. In contrast, the grouping of platypus
`and chicken in this analysis is surprising based on the significant
`evolutionary distance thought to separate these species (30, 31).
`
`Presence of Evolutionarily Conserved Sequences in Different Lineages.
`The unique genomic properties of marsupials and monotremes
`make their sequences of particular interest for identifying and
`characterizing the small portion of the mammalian genome
`under purifying selection (5, 32, 33). We previously described an
`approach for using sequences from multiple vertebrates to detect
`evolutionarily conserved sequences in the human genome
`(called MCSs) and demonstrated that different species’ se-
`quences vary greatly in their relative contribution to the iden-
`tification of MCSs (7–9).
`Given the diverse representation of mammalian species in our
`sequence data set, especially with the inclusion of metatherian
`and prototherian sequences, we next investigated the presence of
`MCSs among the different mammalian lineages. For this anal-
`ysis, we studied a set of 418 MCSs falling within a 571-kb portion
`of the targeted genomic region where there was complete
`sequence coverage from cat and dog (carnivores), cow and pig
`(artiodactyls), rat and mouse (rodents), N.A. opossum and
`wallaby (marsupials), and platypus (monotreme). Note that S.A.
`opossum sequence was not included in this analysis, so that each
`lineage would be represented by two species (except
`monotremes, where only one species was available). The pres-
`ence or absence of each of the 418 MCSs in each species’
`sequence was determined based on whether there was a human–
`species sequence alignment that overlapped that MCS in the
`human sequence (note that virtually all such alignments reflect
`high levels of sequence identity). Although virtually all 58 MCSs
`overlapping coding regions and 46 MCSs overlapping UTRs are
`present in all species, the remaining noncoding MCSs show
`interesting patterns of conservation (Fig. 4; also see the sup-
`porting information for additional details).
`Just over one-half (52%) of the human-referenced noncoding
`MCSs are present in all nine nonhuman mammals analyzed.
`These regions thus represent the most anciently constrained
`sequences in the mammalian lineage. An additional 3.8% of the
`MCSs are present in all mammals except one or both rodents;
`this could be due to the known high deletion rate in the rodent
`lineage (5) or imprecision of current MCS-detection methods.
`An additional 17% of MCSs are present in all mammals except
`monotremes, with an additional 2% present in all mammals
`except monotremes and both rodents. The other major combi-
`nations are MCSs in all mammals except N.A. opossum (4.5%),
`in all mammals except N.A. opossum and platypus (4.5%), and
`in all eutherian mammals (4.0%). Together, these data provide
`evidence for lineage specificity with respect to the presence of
`evolutionarily conserved sequences in the human genome.
`
`Lineage specificity of MCSs. The proportion of nonexonic MCSs found
`Fig. 4.
`in the sequences of species in each category is indicated. Note that virtually all
`MCSs overlapping known exonic sequences are present in all mammals (data
`not shown). All Mammals: cat, dog, cow, pig, rat, mouse, N.A. opossum,
`wallaby, and platypus; Eutherian: cat, dog, cow, pig, rat, and mouse; Marsu-
`pials: N.A. opossum and wallaby; and Other: species combinations containing
`⬍2% of the analyzed MCSs (see the supporting information for the complete
`data set). Hashed areas of ‘‘All Mammals’’ reflect portions lacking one or both
`rodents, and hashed portions of ‘‘Eutherian ⫹ Marsupials’’ reflect portions
`lacking both rodents.
`
`Discussion
`Phylogenetic diversity is an important component of compara-
`tive genomic studies (8, 34). To date, the comparative sequencing
`of mammalian genomes largely has involved species within the
`eutherian radiation, each contributing relatively short branch
`lengths. Although short branch lengths allow for accurate se-
`quence alignments, many species’ sequences then are needed to
`identify those bases under purifying selection. The more di-
`verged metatherian and prototherian mammals contribute
`longer branch lengths, making their sequences particularly valu-
`able for identifying genomic regions under purifying selection,
`while still allowing for reliable alignments to the human se-
`quence. The latter has been challenging with nonmammalian
`vertebrates, such as chicken and fish (W. Miller, personal
`communication).
`Here, we report the large-scale generation and comparative
`studies of genome sequences from noneutherian mammals. This
`initial in-depth glimpse revealed several intriguing properties of
`these species’ genomes. The platypus genome, which, at least for
`the region studied, shows: (i) ⬇25% compression relative to the
`human genome; (ii) an unusually high G ⫹ C content for a
`mammal; (iii) a disproportionately high fraction of SINEs among
`its repetitive sequences; (iv) a notably low fraction of human-
`alignable sequence (14% compared with 34% for marsupials);
`and (v) a markedly long branch length revealed by phylogenetic
`analyses. Interestingly, these last two properties of platypus are
`quite similar to those of chicken (see Fig. 2B), despite the large
`difference in their evolutionary distances from human [esti-
`mated at 200 versus 310 mya, respectively (12–14)]. Although the
`long branch length for platypus is intriguing, it was calculated by
`using the reversible substitution model (REV), which assumes
`similar nucleotide composition among analyzed sequences. Be-
`cause this is not the case for platypus (Table 2), and because the
`synonymous 4D sites analyzed in this study might not be entirely
`neutrally evolving, caution should be used in making strong
`claims about the phylogenetic position of monotremes based on
`our data. Finally, it is interesting to note that the observed
`compression of the platypus genome (relative to human) cannot
`be explained fully by differences in gene or repeat content. The
`evolutionary events that led to this relative compression are not
`
`3358 兩 www.pnas.org兾cgi兾doi兾10.1073兾pnas.0408539102
`
`Margulies et al.
`
`
`
`GENETICS
`
`obvious from the analyses performed here; however, more
`detailed examination of larger data sets of platypus sequence,
`with particular emphasis on cataloging repetitive versus nonre-
`petitive sequences and searching for evidence of insertions and
`deletions, should shed light on this issue.
`It is interesting to note that we were able to align a greater
`amount of sequence by using a multisequence alignment tool
`[TBA (20)] compared to simpler pair-wise alignment methods.
`Importantly, this enhancement was most evident with the se-
`quences from the noneutherian mammals, which showed roughly
`a 2-fold increase in the fraction of human-alignable sequence
`(purple portion of bars in Fig. 2 A). Similar improvements likely
`would enhance comparative sequence analyses involving more
`distantly related, nonmammalian vertebrates (e.g., birds, rep-
`tiles, and fish). At the same time, the observed increase in
`alignability in part reflected the large number of species’ se-
`quences being studied (a total of 27); the minimal number and
`phylogenetic characteristics of mammalian species required for
`such enhanced alignments remain to be established.
`Analyses of the multisequence alignment revealed that the
`14% of the human sequence that aligns with the platypus
`sequence is not completely contained within the larger fraction
`of the human sequence that aligns with all three marsupial
`sequences. Similar situations were encountered among the sim-
`ilarly diverged marsupials as well as other combinations of
`eutherians and nonmammals (see the supporting information).
`Although there is a general trend that alignments of more
`diverged sequences are contained within the alignments of more
`closely related sequences, significant exceptions emerge that may
`point to lineage-specific aspects of genome evolution.
`Our studies confirm that sequences from noneutherian mam-
`mals will play an important role in identifying evolutionarily
`conserved regions of the human genome, which is important for
`establishing a comprehensive catalog of all functional genomic
`
`1. Nobrega, M. A. & Pennacchio, L. A. (2004) J. Physiol. 554, 31–39.
`2. Cooper, G. M. & Sidow, A. (2003) Curr. Opin. Genet. Dev. 13, 604–610.
`3. Ureta-Vidal, A., Ettwiller, L. & Birney, E. (2003) Nat. Rev. Genet. 4, 251–262.
`4. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.-M., Dehal, P.,
`Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. (2002) Science 297,
`1301–1310.
`5. International Mouse Genome Sequencing Consortium (2002) Nature 420,
`520–562.
`6. Rat Genome Sequencing Project Consortium (2004) Nature 428, 493–521.
`7. Thomas, J. W., Touchman, J. W., Blakesley, R. W., Bouffard, G. G., Beckstrom-
`Sternberg, S. M., Margulies, E. H., Blanchette, M., Siepel, A. C., Thomas, P. J.,
`McDowell, J. C., et al. (2003) Nature 424,