`DOI: 10.1007/s00239-002-2346-9
`
`A Sliding Window-Based Method to Detect Selective Constraints in
`Protein-Coding Genes and Its Application to RNA Viruses
`
`Mario A. Fares,1 Santiago F. Elena,1 Javier Ortiz,2 Andre s Moya,1 Eladio Barrio1
`
`1 Institut Cavanilles de Biodiversitat i Biologia Evolutiva and Departament de GeneÁ tica, Universitat de ValeÁ ncia, ValeÁ ncia, Spain
`2 Servei de BioinformaÁ tica, Universitat de ValeÁ ncia, ValeÁ ncia, Spain
`
`Received: 30 January 2002 / Accepted: 10 May 2002
`
`Abstract. Here we present a new sliding window-
`based method specially designed to detect selective
`constraints in speci®c regions of a multiple protein-
`coding sequence alignment. In contrast to previous
`window-based procedures, our method is based on a
`nonarbitrary statistical approach to ®nd the appro-
`priate codon-window size to test deviations of syno-
`nymous (dS) and nonsynonymous (dN) nucleotide
`substitutions from the expectation. The probabilities
`of dN and dS are obtained from simulated data and
`used to detect signi®cant deviations of dN and dS in a
`speci®c window region of the real sequence align-
`ment. The nonsynonymous-to-synonymous rate ratio
`(x = dN/dS) was used to highlight selective con-
`straints in any window wherein dS or dN was signi®-
`cantly dierent
`from the expectation.
`In these
`signi®cant windows, x and its variance [V(x)] were
`calculated and used to test the neutral hypothesis.
`Computer simulations showed that the method is
`accurate even for highly divergent sequences. The
`main advantages of the new method are that it (i) uses
`a statistically appropriate window size to detect dif-
`ferent selective patterns, (ii) is computationally less
`intensive than maximum likelihood methods, and (iii)
`detects saturation of synonymous sites, which can
`give deviations from neutrality. Hence, it allows the
`analysis of highly divergent sequences and the test of
`dierent alternative hypothesis as well. The applica-
`
`Correspondence to: Mario A. Fares, Institute of Genetics, De-
`partment of Genetics, Trinity College, Dublin 2, Ireland; email:
`fares@evalga.uv.es
`
`SEQUENOM EXHIBIT 1093
`Sequenom v. Stanford
`IPR2013-00390
`
`tion of the method to dierent human immunode®-
`ciency virus type 1 and to foot-and-mouth disease
`virus genes con®rms the action of positive selection
`on previously described regions as well as on new
`regions.
`
`Key words: FMDV Ð HIV-1 Ð Positive selec-
`tion Ð Selective constraints Ð Structural/func-
`tional domains Ð Window-based method
`
`Introduction
`
`One of the interests of evolutionary genetics is to
`unveil the mechanisms of natural selection whereby
`proteins evolve. Positive (adaptive) and negative
`(purifying) selection are the two sides of natural se-
`lection that can explain the appearance and mainte-
`nance of protein functions, respectively. In the light
`of the neutral theory of molecular evolution, the
`®xation of a majority of nonsynonymous mutations
`is explained not by positive selection but by random
`®xation of neutral or nearly neutral mutations
`(Kimura 1983). However, evidence supporting the
`®xation of nonsynonymous mutations by positive
`selection has been documented in the mammalian
`major histocompatibility complex (MHC) (Hughes
`and Nei 1988, 1989), the abalone sperm lysine (Metz
`and Palumbi 1996), the envelope proteins of HIV-1
`(Seibert et al. 1995; Yamaguchi and Gojobori 1997),
`and the capsid proteins of FMDV (Fares et al. 2001).
`Most proteins appear to be under strong purifying
`
`SEQUENOM EXHIBIT 1093
`
`
`
`510
`
`selection most of the time (Li 1997), ®xation of amino
`acid replacements by positive selection being rare and
`relegated to small regions of the molecule. Therefore,
`dierent structural and functional domains are likely
`to be subjected to distinct selective constraints and,
`thus, to evolve at dierent rates. Good examples of
`this have been provided by the analyses of the pro-
`insulin protein (Kimura 1983),
`thyroid hormone
`receptors (Green and Chambor 1986), MHC (Hughes
`and Nei 1988), and paralogous genes coding for in-
`teracting polipeptides, such as the a and b subunits of
`the ATPase complex (MarõÂ n et al. 2001).
`The study of these protein domains can be per-
`formed by taking single codons as units of selection
`or by averaging rates of nucleotide substitutions
`along the entire protein-coding gene. However, the
`analysis of a speci®c domain or region of the protein-
`coding sequence represents a better approach to show
`nucleotide variability between dierent structural or
`functional regions and might provide detailed infor-
`mation on the selective constraints driving the evo-
`lution of protein-coding genes. When no knowledge
`about the functional or structural domains of a pro-
`tein is available, a promising approach to unveil se-
`lective constraints is to devise statistical models to
`measure the intensity of selection acting on these
`protein domains. In this respect, comparison of non-
`synonymous nucleotide substitutions (causing amino
`acid replacements) to synonymous (silent) changes
`constitutes an important tool for studying the mech-
`anisms of DNA evolution (Kimura 1983; Gillespie
`1991; Ohta 1993). The intensity of natural selection
`can be measured simply by estimating the nonsyn-
`onymous-to-synonymous nucleotide substitution rate
`ratio (x = dN/dS). Thus, values of x > 1, x = 1,
`and x < 1 indicate positive selection, neutrality, and
`purifying selection, respectively. This is the most ac-
`cepted and stringent way to detect positively selected
`changes
`in protein-coding nucleotide
`sequences
`(Sharp 1997; Akashi 1999; Crandall et al. 1999).
`Several methods have been developed to estimate
`the number of synonymous and nonsynonymous
`nucleotide substitutions per site (Miyata and Yasu-
`naga 1980; Li et al. 1985; Nei and Gojobori 1986; Li
`1993; Pamilo and Bianchi 1993; Goldman and Yang
`1994; Muse and Gaut 1994; Comeron 1995; Ina
`1995). Nonetheless, these methods require the use of
`a large number of codons to avoid the eect of
`variance on estimates of nucleotide distances. More-
`over, the likelihood ratio test (LRT) has been exten-
`sively applied by several authors to detect positively
`selected codons under a known phylogeny (e.g.,
`Goldman and Yang 1994; Yang and Nielsen 1998;
`Yang et al. 2000a; Zanotto et al. 2000). A dierent
`approach to show nucleotide substitution rate varia-
`tion among dierent genomic regions is to plot dif-
`ferences as averages generated by sliding a window
`
`along a sequence alignment (Tajima 1991). Although,
`these methods were not originally developed to ana-
`lyze selective constraints in protein-coding genes,
`Hughes and Nei (1989) used a sliding window-based
`method to estimate the dierences between the aver-
`age rate of nonsynonymous and that of synonymous
`nucleotide substitutions per site and window, detect-
`ing selective pressures acting on the MHC. However,
`these authors did not use any statistical approach to
`select an appropriate window size to explore protein-
`coding genes or test the signi®cance of the dierences
`between the two types of nucleotide substitutions,
`which could lead to misinterpreting results. Although
`the methods developed by Nielsen and Yang (1998)
`have been used extensively to detect positive selec-
`tion, these methods rely on the assumption that all
`amino acid sites have the same x value or that there is
`a limited and speci®ed number of codon categories
`characterized by their x values, which seems unreal-
`istic. Suzuki and Gojobori (1999) developed a maxi-
`mum parsimony method that, given a phylogeny,
`detects positive selection at single amino acid sites.
`Here we propose a new sliding window procedure,
`based on the same principles as those proposed by
`Suzuki and Gojobori (1999) but using the most ap-
`propriate window size to highlight regions within a
`protein-coding gene subjected to dierent selective
`pressures. The central focus of this work is that dif-
`ferent protein regions are subjected to dierent se-
`lective constraints that can be underpinned by an
`appropriate sliding window procedure to detect re-
`gions with higher nucleotide substitutions than ex-
`pected.
`
`Materials and Methods
`
`Our method identi®es regions from a protein-coding gene that have
`accumulated a dierent number of nucleotide substitutions than
`expected by chance, hence rejecting neutrality as the best expla-
`nation for the evolution of these speci®c regions or domains. For
`this purpose, our method implements dierent steps (Fig. 1) that
`can be summarized as follows.
`
`Estimation of the Probability of Synonymous and
`Nonsynonymous Nucleotide Substitutions in Speci®c
`Window Regions Along the Sequence Alignment
`
`The ®rst step of the method is the estimation of the expected
`probability of nonsynonymous, P(dN), and synonymous, P(dS),
`nucleotide substitutions per codon site in each sliding window. To
`estimate these probabilities, let us ®rst suppose that the Xth non-
`synonymous nucleotide substitution and the Yth synonymous
`nucleotide substitution are variables with probability 1/L of oc-
`curring in the sequence under study, where L is the total number of
`codon sites in the sequence. If we assume that nonsynonymous and
`synonymous substitutions are discrete and take the values xi and
`yi, respectively, then the expectations of Xr- and Yr- are the rth
`moment about zero of the nonsynonymous substitution variable
`
`
`
`511
`
`3
`
`Fig. 1. Diagram of the sliding window procedure to detect selective constraints in speci®c regions of protein-coding genes. Z is the window
`size in codons and
`n is the total length of the sequence alignment after adding the random sequence pieces to the beginning and end of the
`sequence alignment.
`
`and evolves the sequence along the branches of the given phy-
`logeny using speci®ed branch lengths and substitution parameters
`(Yang et al. 2000b). Simulations were performed using as pa-
`rameters the branch lengths estimated by the modi®ed Nei and
`Gojobori method (Zhang et al. 1998) as well as codon frequencies
`estimated from the real data set. All simulations were carried out
`assuming neutral evolution (x = 1), which is the null hypothesis
`against which the x value estimated in each window region of
`the real sequence alignment is compared. Once sequences are
`simulated, nonsynonymous substitutions per nonsynonymous site
`(dNij
`),
`) and synonymous substitutions per synonymous site (dSij
`between sequence i and its inferred ancestral sequence j, are
`estimated by the unbiased method of Li (1993). Given the large
`distances among the sequences under study, ancestral sequences
`were inferred by maximum likelihood (Yang et al. 1995; Koshi
`and Goldstein 1996; Schultz et al. 1996; Zhang and Nei 1997).
`However, whenever the distances between sequences are short
`
`XN and synonymous change variable YS. These moments are de-
`®ned as
`
`y r
`i
`
` 1
`
`X i
`
`X i
`
`N E XN 1
`hr
`
`n
`
`xr
`i ;
`
`S E YS 1
`hr
`
`s
`
`where n and s are the total number of nonsynonymous and syn-
`onymous nucleotide sites, respectively.
`The ®rst moments, r = 1, are the expectations of XN and YS
`and are the mean of the variables that describes nonsynonymous
`and synonymous nucleotide substitutions on an alignment of se-
`quences, respectively.
`
`
`h1S and h1N are estimated using K random sequence alignments
`simulated by maximum likelihood according to a known phylo-
`genetic tree using the EVOLVER program from the PAML
`package, v3.0 (Yang 2000). This program generates a codon
`sequence for the root of the tree (inferred from the real data set)
`
`
`
`512
`
`enough, both maximum likelihood and maximum parsimony
`methods give similarly reliable sequence inferences (Yang et al.
`1995; Zhang and Nei 1997). The simulations allow us to avoid
`the eect of the nucleotide compositional bias in third codon
`positions on the codon usage, the estimation of dN and dS under
`neutrality, and the buering of the regional codon-composition
`eect on the estimates of dN and dS.
`The second step consists in the estimation of the probabilities of
`nucleotide substitutions as
`
`P hN hN
`hN hS
`
`; P hS hS
`hS hN
`
` 2
`
`where hN and hS are the mean number of synonymous and non-
`synonymous substitutions estimated from the K random sequence
`alignments using the expressions
`
`XK
`
`hN 1
`NK
`
`XN
` dNij
`
`l1
`
`f1
`
`fl;
`
`hS 1
`NK
`
`XK
`
`XN
` dSij
`
`l1
`
`f1
`
`fl
`
` 3
`
`conserved sequences require smaller window sizes to detect selec-
`tive constraints than more variable sequences do. The estimation of
`the window size in our method is a trade-o between avoiding false
`signi®cant results and still getting as much biological information
`as possible. To achieve this goal, we generate random window sizes
`(di) in each alignment of the K randomly simulated data sets and
`slide each ith window of di size along the random sequence align-
`ment. For every window (l) the average numbers of nonsynony-
`mous ( dNl
`) and synonymous ( dSl
`) nucleotide substitutions for all
`pairwise sequence comparisons, and given the phylogenetic tree,
`for each lth
`are estimated. Thereafter, the probability of having dNl
`window is calculated using Eq. (5). By repeating this operation, a
`probability distribution along the sequence alignment, with mean
`dN, is obtained. Before starting the window-sliding procedure, two
`sequence alignment pieces of size (d ) 1) are randomized and joined
`to the beginning and end of the sequence alignment to avoid
`undercounting the ®rst and last d ) 1 codons in the ®rst and last
`windows, respectively. Therefore, every codon site is counted d
`times in all the sliding steps. The generation of random pieces of the
`sequence alignment is repeated during the (d ) 2) ®rst and last
`sliding steps, avoiding nucleotide composition eects of the ran-
`dom pieces on the calculations performed for each window.
`Thereafter, we obtain the distribution of probabilities of dNl
`in the
`K random alignments for the dierent window sizes randomly
`chosen. Finally, we plot the mean P(dN) values (95% con®dence
`interval) for each window size against the window size and choose,
`as the appropriate window size, the largest window having a 5%
`lower probability higher than 0.05 (Fig. 3).
`Once the appropriate window size is determined, we slide
`windows of this size along the real data set, in the same way as done
`for the random data sets, and calculate the probabilities of the
`estimated dNijZ
`in each Zth window to test against chance
`and dSijZ
`using Eq. (5), as described in the previous section.
`This method allows us, by direct comparison of the expected
`and observed nucleotide substitutions, to discriminate between
`dierent hypotheses that, in addition to neutral evolution, positive
`selection, or purifying selection, can explain dierent mutational
`dynamics (summarized in Table 1).
`
`Nonsynonymous-to-Synonymous Rate Ratio and Its
`Variance as an Estimator of Selection Intensity
`
`The ®nal complementary step to our method, in those windows with
`values dierent than expected by chance, consists in the
`or dSZ
`dNZ
`analysis of the type and intensity of selection by estimating the
`nonsynonymous-to-synonymous rate ratio (xZ dNZ
`=dSZ
`) for com-
`parison of sequence i and its inferred ancestral sequence j. As men-
`tioned in the Introduction, a x ¹ 1 is a good indication of the action
`of selection. However, to avoid obtaining biased values due to sat-
`uration of synonymous sites, especially in regions with multiple hits,
`those values of xZ ¹ 1 have to be tested for signi®cance. Conse-
`quently, the variance of x[V(x)] needs to be estimated from the data
`and used to test the signi®cance of x against neutrality. An estimator
`of V(x) was obtained by means of Fisher's d method (Weir 1996):
`^V B2 L2
`^V B0
`#
` L0 L22
`^V A2
`^V A4 L2
` L2 L42
` L0
`Cov A0; B0
`L0 L2
`xL2
` L0 L2 L2 L4 Cov A2; B2
`
`2
` x2L4
`Cov A4; B4
`L2 L4
`
`ij
`
`ij
`
`0
`
`2
`
`(
`^V A0 L2
`^Vx 1
`"
`d2
`S
` x2 ^V B4 L2
`
`2
`
`4
`
`
`
` 7
`
`Here N stands for the total number of pairwise comparisons be-
`tween the random simulated sequence i and its inferred ancestral
`sequence j.
`If the sequence alignment is large enough (>300 codons) we
`can assume that both nonsynonymous (dN) and synonymous (dS)
`nucleotide substitutions follow a Poisson distribution with pa-
`rameters
`
`kN nP hN;
`
`kS sP hS
`
` 4
`
`Therefore, the probabilities of observing dN = a and dS = b nu-
`cleotide changes in a speci®c window Z of the real sequence
`alignment are, respectively,
`
`PZ XNZ
`
`ij
`
`jkN e
`
` k
`NZ
`ij
`
`X
`NZ
`k
`ij
`NZ
`ij
`XNZ
`
`ij
`
`!
`
`and
`
`PZ YSZ
`
`ij
`
`jkS e
`
` k
`SZ
`ij
`
`Y
`SZ
`ij
`
`k
`
`SZ
`ij
`YSZ
`
`ij
`
`!
`
` 5
`
`ij
`
`are the observed number of nonsynonymous
`and YSZ
`where XNZ
`and synonymous nucleotide substitutions, respectively, between
`sequence i and its inferred ancestral sequence j in window Z and are
`calculated as
`
`ij
`
` na; YSZ
`XNZ
`This procedure of estimating PZ XNZ
`
`ij
`
`ij
`
`ij
`
`jkN and PZ YSZ
`
`ij
`
`jkS is re-
`peated in every window and the possible action of selection in each
`region is also tested. Once those window regions with a number of
`substitutions signi®cantly dierent than expected are detected, an
`analogy of
`selection intensity is performed estimating the nonsyn-
`onymous-to-synonymous rate ratio and its variance.
`
`1
`
`Finding an Appropriate Window Size to Detect
`Selective Constraints
`
`The sliding window-based method requires a previous estimation
`of the most appropriate window size to analyze the dierent re-
`gions of the alignment. We have to stress caution when the window
`size is chosen randomly because there is a strong eect of the codon
`number and composition of the window size used on the results
`obtained (see Results and Discussion). Furthermore, the use of a
`speci®cally optimized window size strongly depends on the data
`under analysis. Thus, assuming absence of saturation of synony-
`mous sites, there is a threshold of variability above which highly
`
` sb
`
` 6
`
`ij
`
`ij
`
`
`
`List of alternative hypotheses that explain values of
`Table 1.
`nonsynonymous-to-synonymous rate ratios (x), dN, and dS, sig-
`ni®cantly dierent than expected under the neutral modela
`
`x > 1
`
`x = 1
`
`x < 1
`
`dN
`
`>
`>
`>
`=
`=
`=
`<
`>
`>
`=
`=
`<
`<
`<
`>
`>
`>
`=
`=
`=
`<
`<
`<
`
`dS
`
`>
`=
`<
`>
`=
`<
`<
`>
`=
`=
`<
`>
`=
`<
`>
`=
`<
`>
`=
`<
`>
`=
`<
`
`Hypothesis accepted
`
`1
`1
`1, 3, 5
`1, 3
`1
`3, 5
`3, 4
`6
`6
`0
`0, 3
`4, 6
`0, 4
`3, 4
`6
`2, 6
`3, 6
`2, 6
`2
`2, 4
`4, 6
`2, 4
`3, 4
`
`a A value of dN or dS signi®cantly higher or lower than the mean
`values for the sequence alignments is indicated by > or <, re-
`spectively. Hypotheses: 0, neutrality; 1, positive selection; 2, pur-
`ifying selection; 3, saturation of synonymous sites; 4, saturation of
`nonsynonymous sites; 5, translational selection; and 6, high mu-
`tation rates.
`
`where Ai and Bi are the transition and transversion rates in the ith
`degenerated site and their variances ^V Ai; ^V Bi are given by Eqs.
`(3) and (4) in Li (1993), Li is the number of the ith degenerated sites
`in the region analyzed, and Cov(Ai, Bi) is the covariance of the
`transition and transversion rates in the ith degenerated site, given
`by Eq. (A8) of Ina (1998). It should be noted that this variance is
`calculated for x values along the sequence alignment comparing
`sequence i and its immediate simulated ancestral sequence j, thus
`there is no need to include covariances of pairwise comparisons in
`the estimation of ^V x.
`To test the reliability of the estimator of ^V x, we simulated 200
`alignment data sets with the EVOLVER program, calculated their
`empirical V(x), and compared them with the estimated variance
`obtained with Eq. (7). Empirical variance of x was estimated from
`the simulated data as
`
` 8
`
`XN
` xi x2
`
`i1
`
`V x 1
`N
`
`Here xi is the estimation of x in the ith window and x is the mean
`of x along the sequence alignment.
`
`Data Sets Analyzed by the Sliding
`Window-Based Method
`
`To examine the usefulness of this method, we used sequences of the
`env and gag genes from HIV-1 (Seibert et al. 1995) and the VP1 to
`VP4 capsid genes from FMDV (Fares et al. 2001), where positive
`selection has previously been demonstrated.
`
`513
`
`HIV-1 sequence data from Seibert et al. (1995) were analyzed to
`test the selective pressure acting on dierent genomic regions of this
`retrovirus. These authors examined the action of selection on the
`gag (structural proteins), env (envelope region, including both re-
`gion gp120 and region gp41), and pol (polymerase protein) genes
`and they observed that the pattern of nucleotide substitution in the
`case of the env gene diered signi®cantly from that of either pol or
`gag. Furthermore, they showed that, when the env gene was used,
`signi®cantly higher nonsynonymous than synonymous substitu-
`tions were observed for sequence families B, C, and E in the V2
`region, for sequence families B and C in the V3 region, and for
`family C in the V4 region. [Each family included those sequences
`identi®ed by Seibert et al. (1995) as being phylogenetically closely
`related.] When overall means were computed for the ®ve families,
`the average dN value was signi®cantly higher than the dS value
`estimated in V2 and V3 from the gp120 belonging to the env gene.
`In the case of gp41, this region showed signi®cantly higher dS than
`dN values for families A±D.
`In our study, a subset of the data analyzed by Seibert et al.
`(1995) of the env gene (23 sequences) and gag gene (17 sequences)
`from the ®ve families were used to test the validity of the method.
`The sequences are identi®ed by their accession numbers in the
`phylogenetic trees shown in Fig. 2.
`Sequences of FMDV from the capsid region (available upon
`request) with a known experimental history were also used to test
`the validity of the method. Fares et al. (2001) studied 31 sequences
`belonging to the capsid region of FMDV. These sequences were
`obtained from virus isolated under dierent experimental proce-
`dures that emulated several environmental and epidemiological
`conditions such as genetic drift, massive infections, immune pres-
`sure, and persistent infections. This study showed that several
`amino acid changes were ®xed by positive selection in the main
`antigenic site (A) and in the 50 end of the variable capsid protein
`VP3. Here we use the same 31 sequences to con®rm the previously
`obtained results and to test the possible presence of additional
`regions under selective pressures.
`
`Phylogenetic Reconstruction
`
`sequences were obtained with CLUSTAL-X
`Alignments of
`(Thompson et al. 1994) and adjusted manually when necessary.
`Phylogenetic analysis of the aligned sequences was done using
`dierent methods: neighbor-joining (NJ) (Saitou and Nei 1987),
`maximum likelihood (ML) (Felsenstein 1981), and maximum par-
`simony (MP) (Fitch 1971). The MEGA v2.01 program (Kumar
`et al. 2001) was used to obtain NJ trees and the bootstrap support
`values for 1000 pseudoreplicates. MP and ML trees were recon-
`structed using DNPARS and DNAML, respectively, from the
`PHYLIP v3.5 package (Felsenstein 1993).
`
`Results
`
`Positive Selection in HIV-1 Regions Coding
`for the env Protein
`
`Phylogenetic trees for the env gene showed the same
`topology despite the phylogenetic reconstruction
`method used (Fig. 2A). The P(hN) value estimated
`from random alignments of the env region was 0.497.
`Analysis of the suitable window size generated the
`distribution probabilities depicted in Fig. 3. Accord-
`ing to this ®gure, only windows including more than
`nine codons had a 5% lower P(dN) value higher than
`
`
`
`514
`
`Fig. 2. Maximum-likelihood phylogenetic trees for the env (A) and gag (B) genes of HIV-1 and for capsid (C) genes of FMDV. Numbers in
`the internal nodes correspond to 1000-pseudoreplicate bootstrap values. A±D in the env phylogenetic tree (A) identify families of closely
`related sequences.
`
`The estimation of the average x for the whole gene
`gave a value of 1.168, indicating a weak action of the
`positive selection on some regions of this gene. The
`use of a window of nine codons on the real data set
`identi®ed several codons subjected to positive selec-
`tion. When each HIV-1 env family was analyzed sep-
`arately, several regions, not detected before, appear to
`be subjected to positive selection. Thus, HIV-1 fami-
`lies A, B, C, and D show a dN value signi®cantly
`higher than expected and higher than dS in the V1
`variable region and in a region located between codon
`450 and codon 460. Furthermore, in families A and B,
`variable region 2 (V2) showed positive selection of
`amino acid changes, whereas families B, C, and D
`showed additional regions in V3 to be under positive
`selection of amino acid substitutions (see Table 2). In
`contrast to the rest of the HIV-1 sequence families,
`family C showed regions with higher dN than dS values
`in the V4 variable region (Table 2). Table 2 lists those
`regions with dN values higher than expected, P(dN), x
`values, and the signi®cance of x compared to the
`average value, under a t test. None of the sliding
`windows showed dS rates lower or higher than ex-
`pected, which indicates that, in these regions, there is
`no saturation of synonymous sites in comparison with
`the overall gene region, or codon composition bias
`
`Fig. 3. Distribution of the probability of nonsynonymous nu-
`cleotide substitutions, P(dN), assuming dierent window sizes (d).
`P(dN) is depicted versus d in 100 random data sets of sequence
`alignments. Circles represent the mean P(dN) for each window size;
`error bars are based on the 5% higher and lower P(dN) values in
`each window size, respectively. The 5% threshold probability limit
`is indicated by the dashed line.
`
`0.05. This means that, sliding a window of nine co-
`dons in the real sequence alignment, we should not
`expect to have any signi®cant P(dN) deviation from
`that expected by chance, given that more than 95% of
`the probability values are higher than 0.05.
`
`
`
`515
`
`Table 2. Detection of positively selected regions in the HIV-1 env gene corresponding to several phylogenetic families of sequencesa
`
`gp120 variable
`region
`
`Codon window
`
`V1
`
`V2
`
`V3
`
`V4
`
`132±140
`134±142
`140±148
`142±150
`143±151
`144±152
`176±184
`179±189
`180±188
`181±189
`297±305
`310±318
`311±319
`312±320
`313±321
`314±322
`390±398
`391±399
`
`Family A
`
`Family B
`
`Family C
`
`Family D
`
`P(dN)
`
`0.003
`0.003
`Ð
`Ð
`0.015
`Ð
`0.028
`Ð
`0.014*
`0.003
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`
`x
`
`4.827*
`6.906*
`Ð
`Ð
`5.368*
`Ð
`3.035*
`Ð
`3.521*
`3.886*
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`
`P(dN)
`
`x
`
`P(dN)
`
`x
`
`P(dN)
`
`x
`
`Ð
`0.000
`Ð
`Ð
`Ð
`Ð
`Ð
`0.001
`0.019
`0.006
`Ð
`0.039
`0.041
`0.041
`0.001
`Ð
`Ð
`Ð
`
`Ð
`2.720*
`Ð
`Ð
`Ð
`Ð
`Ð
`3.465*
`3.451*
`4.313*
`Ð
`3.079*
`1.839*
`1.839*
`5.311*
`Ð
`Ð
`Ð
`
`Ð
`0.000
`Ð
`0.000
`0.000
`0.041
`Ð
`Ð
`Ð
`Ð
`0.027
`0.048
`Ð
`Ð
`0.008
`Ð
`0.045
`0.027
`
`Ð
`3.932*
`Ð
`7.863*
`4.510*
`3.588*
`Ð
`Ð
`Ð
`Ð
`6.375*
`5.349*
`Ð
`Ð
`5.150*
`Ð
`3.747*
`5.035*
`
`0.006
`0.003
`0.000
`0.023
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`0.002
`0.001
`Ð
`0.000
`0.003
`Ð
`Ð
`
`6.211*
`3.467*
`11.912**
`3.776*
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`Ð
`2.838*
`2.363
`Ð
`37.385**
`2.457*
`Ð
`Ð
`
`a V1 to V4 are the four variable regions where positive selection was detected using the sliding-window procedure. P(dN) and x are the
`probability of the observed nonsynonymous substitutions and the nonsynonymous-to-synonymous rate ratio, respectively. *Probabilities
`lower than 5% and **probabilities lower than 1% (z test).
`
`Table 3. Window midpoint codons where positive selection was
`detected in the env gene from HIV-1a
`
`Region
`
`V1
`V2
`V3
`Epitope 4
`V4
`
`Midpoint
`codons
`
`130±152
`178±192
`296±323
`336±344
`387±408
`
`x
`
`t
`
`3.289
`3.150
`4.204
`4.679
`2.715z
`
`18.39
`17.19
`26.34
`30.45
`13.42
`
`One-
`tailed P
`
`0.0173
`0.0185
`0.0121
`0.0104
`0.0236
`
`a V1±V4 are the variable regions of gp120 from the env gene and
`epitope 4 represents the fourth positional epitope of T cells located
`in the gp41 region of the env gene. x is the mean estimated non-
`synonymous-to-synonymous rate ratio for the dierent regions
`located in each variable region. t is the Student test value for x (H0:
`x = 1).
`
`due to translational selection. However, saturation of
`synonymous sites was detected for the four families in
`the V5 region of gp120 (region between codon 427 and
`codon 438), with P( dS) = 0.048, and in gp41 T cell
`epitopes 5 [codons 731±740; P( dS) = 0.037] and 6
`[codons 761±777; P( dS) = 0.035]. In addition, family
`A showed saturation of synonymous sites in epitope 1
`[codon region 517±541, with P( dS) = 0.048]. In these
`regions, we also detected fewer nonsynonymous sub-
`stitutions than expected by chance and x < 1. Taken
`together, these results indicate that these regions are
`subjected to very strong purifying selection.
`When the four families were compared, positively
`selected regions were detected in the V1±V4 variable
`regions, with x values signi®cantly higher than 1
`(Table 3). The location of the midpoints of the
`
`windows showing positive selection is given in Fig. 4.
`On average, we detected 6.5% of codon sites under
`positive selection, 9.3% neutral sites, and 84.2% of
`codon sites under strong purifying selection.
`
`Dierences in Selective Pressures Among Dierent
`Regions of the gag Gene of HIV-1
`
`Once again, dierent phylogenetic methods gave the
`same tree for the gag gene (Fig. 2B). We examined an
`alignment of 17 sequences of the HIV-1 gag gene
`following the new method. Seibert et al.
`(1995)
`demonstrated that the average dS value along the
`sequence alignment exceeded signi®cantly that of dN
`in the gag gene of HIV-1 (i.e., x < 1). However, as
`mentioned, the estimation of the average x value over
`the region is very simplistic for concluding anything
`about the selective pressures acting on a protein. In
`our analysis, several regions were identi®ed as being
`subjected to strong purifying selection. The average
`value of x for the whole region was estimated to be
`0.8, which means that, on average, the gag region of
`the HIV-1 genome evolves under purifying selection.
`The optimal window size for analyzing the gag region
`sequences was seven codons. Sliding a window of this
`length, we detected three types of regions in the gag
`gene according to the selective constraints acting on
`them. Most regions evolve under neutrality (95.87%),
`and other regions are under strong purifying selection
`(3%), while a few codons (0.13%) showed signi®-
`cantly more nonsynonymous changes than expected
`by chance in all families examined (codons 73±79),
`
`
`
`516
`
`Positively selected regions detected by the sliding window
`Fig. 4.
`method in the env gene of HIV-1. Only those regions with proba-
`bilities lower than 5% are represented.
`
`Positively selected regions detected by the sliding window
`Fig. 5.
`method in the FMDV region coding for capsid proteins. Only re-
`gions with probabilities lower than 5% are depicted.
`
`with x = 15.034 (strong positive selection), signi®-
`cantly higher than the average (P < 0.001) and
`clearly higher than 1. Analysis of the distribution of
`synonymous nucleotide
`substitutions along the
`alignment did not give any signi®cant results,
`in-
`dicating either the absence of saturation in synon-
`ymous sites or the absence of nucleotide or codon
`composition bias.
`
`Positive Selection on the Capsid Proteins of FMDV
`
`2
`
`The genomic region of the FMDV coding for the four
`capsid proteins (VP1±VP4) was also analyzed. In this
`case, we used 31 sequences from viruses isolated
`under dierent
`experimental conditions (Fares et al.
`2001). As in the other two cases, the phylogenetic
`reconstruction methods gave the same tree topology
`for the relationships between FMDV sequences (Fig.
`2C). In our previous study we adjusted 13 codon-
`based models (Yang et al. 2000a) to these sequences
`and used the likelihood ratio test to determine the
`best model ®tting the data. Among the discrete
`models, the positive selection model with all x cate-
`gories estimated from the data (called M3) gave sig-
`ni®cantly higher likelihood estimates than the neutral
`model (M1) or the positive selection model with only
`one x value estimated from the data (M2). Also, the
`continuous positive selection model assuming a b-
`distribution for x < 1 and estimating x for a frac-
`tion of codons with x > 1 (M8) provided a sig-
`ni®cantly higher
`log-likelihood value
`than the
`continuous model with a b-distribution (M7) or
`models with a mixture of b- and c-distributions (M9±
`M13). Each model implements an empirical Bayesian
`approach to identify those codons under positive se-
`lection. For all models, codons 310, 666, and 669
`were identi®ed to be under positive selection with
`posterior probabilities higher than 95%.
`
`Computer simulations of sequences with the pro-
`gram EVOLVER and posterior analysis by our
`method gave an estimate of P(hN) = 0.2. Taking this
`probability into account, the optimal window size
`was determined to be three codons. Estimation of the
`x parameter using the whole capsid region gave a
`value of x = 0.578,
`indicating that
`this region
`evolves mainly under purifying selection. Applying
`the sliding-window procedure to the capsid region,
`with a window size of three codons, revealed four
`regions where dN was signi®cantly higher than ex-
`pectation under neutrality, whereas dS did not dier
`from the mean e