throbber
J Mol Evol (2002) 55:509±521
`DOI: 10.1007/s00239-002-2346-9
`
`A Sliding Window-Based Method to Detect Selective Constraints in
`Protein-Coding Genes and Its Application to RNA Viruses
`
`Mario A. Fares,1 Santiago F. Elena,1 Javier Ortiz,2 Andre s Moya,1 Eladio Barrio1
`
`1 Institut Cavanilles de Biodiversitat i Biologia Evolutiva and Departament de GeneÁ tica, Universitat de ValeÁ ncia, ValeÁ ncia, Spain
`2 Servei de BioinformaÁ tica, Universitat de ValeÁ ncia, ValeÁ ncia, Spain
`
`Received: 30 January 2002 / Accepted: 10 May 2002
`
`Abstract. Here we present a new sliding window-
`based method specially designed to detect selective
`constraints in speci®c regions of a multiple protein-
`coding sequence alignment. In contrast to previous
`window-based procedures, our method is based on a
`nonarbitrary statistical approach to ®nd the appro-
`priate codon-window size to test deviations of syno-
`nymous (dS) and nonsynonymous (dN) nucleotide
`substitutions from the expectation. The probabilities
`of dN and dS are obtained from simulated data and
`used to detect signi®cant deviations of dN and dS in a
`speci®c window region of the real sequence align-
`ment. The nonsynonymous-to-synonymous rate ratio
`(x = dN/dS) was used to highlight selective con-
`straints in any window wherein dS or dN was signi®-
`cantly di€erent
`from the expectation.
`In these
`signi®cant windows, x and its variance [V(x)] were
`calculated and used to test the neutral hypothesis.
`Computer simulations showed that the method is
`accurate even for highly divergent sequences. The
`main advantages of the new method are that it (i) uses
`a statistically appropriate window size to detect dif-
`ferent selective patterns, (ii) is computationally less
`intensive than maximum likelihood methods, and (iii)
`detects saturation of synonymous sites, which can
`give deviations from neutrality. Hence, it allows the
`analysis of highly divergent sequences and the test of
`di€erent alternative hypothesis as well. The applica-
`
`Correspondence to: Mario A. Fares, Institute of Genetics, De-
`partment of Genetics, Trinity College, Dublin 2, Ireland; email:
`fares@evalga.uv.es
`
`SEQUENOM EXHIBIT 1093
`Sequenom v. Stanford
`IPR2013-00390
`
`tion of the method to di€erent human immunode®-
`ciency virus type 1 and to foot-and-mouth disease
`virus genes con®rms the action of positive selection
`on previously described regions as well as on new
`regions.
`
`Key words: FMDV Ð HIV-1 Ð Positive selec-
`tion Ð Selective constraints Ð Structural/func-
`tional domains Ð Window-based method
`
`Introduction
`
`One of the interests of evolutionary genetics is to
`unveil the mechanisms of natural selection whereby
`proteins evolve. Positive (adaptive) and negative
`(purifying) selection are the two sides of natural se-
`lection that can explain the appearance and mainte-
`nance of protein functions, respectively. In the light
`of the neutral theory of molecular evolution, the
`®xation of a majority of nonsynonymous mutations
`is explained not by positive selection but by random
`®xation of neutral or nearly neutral mutations
`(Kimura 1983). However, evidence supporting the
`®xation of nonsynonymous mutations by positive
`selection has been documented in the mammalian
`major histocompatibility complex (MHC) (Hughes
`and Nei 1988, 1989), the abalone sperm lysine (Metz
`and Palumbi 1996), the envelope proteins of HIV-1
`(Seibert et al. 1995; Yamaguchi and Gojobori 1997),
`and the capsid proteins of FMDV (Fares et al. 2001).
`Most proteins appear to be under strong purifying
`
`SEQUENOM EXHIBIT 1093
`
`

`

`510
`
`selection most of the time (Li 1997), ®xation of amino
`acid replacements by positive selection being rare and
`relegated to small regions of the molecule. Therefore,
`di€erent structural and functional domains are likely
`to be subjected to distinct selective constraints and,
`thus, to evolve at di€erent rates. Good examples of
`this have been provided by the analyses of the pro-
`insulin protein (Kimura 1983),
`thyroid hormone
`receptors (Green and Chambor 1986), MHC (Hughes
`and Nei 1988), and paralogous genes coding for in-
`teracting polipeptides, such as the a and b subunits of
`the ATPase complex (MarõÂ n et al. 2001).
`The study of these protein domains can be per-
`formed by taking single codons as units of selection
`or by averaging rates of nucleotide substitutions
`along the entire protein-coding gene. However, the
`analysis of a speci®c domain or region of the protein-
`coding sequence represents a better approach to show
`nucleotide variability between di€erent structural or
`functional regions and might provide detailed infor-
`mation on the selective constraints driving the evo-
`lution of protein-coding genes. When no knowledge
`about the functional or structural domains of a pro-
`tein is available, a promising approach to unveil se-
`lective constraints is to devise statistical models to
`measure the intensity of selection acting on these
`protein domains. In this respect, comparison of non-
`synonymous nucleotide substitutions (causing amino
`acid replacements) to synonymous (silent) changes
`constitutes an important tool for studying the mech-
`anisms of DNA evolution (Kimura 1983; Gillespie
`1991; Ohta 1993). The intensity of natural selection
`can be measured simply by estimating the nonsyn-
`onymous-to-synonymous nucleotide substitution rate
`ratio (x = dN/dS). Thus, values of x > 1, x = 1,
`and x < 1 indicate positive selection, neutrality, and
`purifying selection, respectively. This is the most ac-
`cepted and stringent way to detect positively selected
`changes
`in protein-coding nucleotide
`sequences
`(Sharp 1997; Akashi 1999; Crandall et al. 1999).
`Several methods have been developed to estimate
`the number of synonymous and nonsynonymous
`nucleotide substitutions per site (Miyata and Yasu-
`naga 1980; Li et al. 1985; Nei and Gojobori 1986; Li
`1993; Pamilo and Bianchi 1993; Goldman and Yang
`1994; Muse and Gaut 1994; Comeron 1995; Ina
`1995). Nonetheless, these methods require the use of
`a large number of codons to avoid the e€ect of
`variance on estimates of nucleotide distances. More-
`over, the likelihood ratio test (LRT) has been exten-
`sively applied by several authors to detect positively
`selected codons under a known phylogeny (e.g.,
`Goldman and Yang 1994; Yang and Nielsen 1998;
`Yang et al. 2000a; Zanotto et al. 2000). A di€erent
`approach to show nucleotide substitution rate varia-
`tion among di€erent genomic regions is to plot dif-
`ferences as averages generated by sliding a window
`
`along a sequence alignment (Tajima 1991). Although,
`these methods were not originally developed to ana-
`lyze selective constraints in protein-coding genes,
`Hughes and Nei (1989) used a sliding window-based
`method to estimate the di€erences between the aver-
`age rate of nonsynonymous and that of synonymous
`nucleotide substitutions per site and window, detect-
`ing selective pressures acting on the MHC. However,
`these authors did not use any statistical approach to
`select an appropriate window size to explore protein-
`coding genes or test the signi®cance of the di€erences
`between the two types of nucleotide substitutions,
`which could lead to misinterpreting results. Although
`the methods developed by Nielsen and Yang (1998)
`have been used extensively to detect positive selec-
`tion, these methods rely on the assumption that all
`amino acid sites have the same x value or that there is
`a limited and speci®ed number of codon categories
`characterized by their x values, which seems unreal-
`istic. Suzuki and Gojobori (1999) developed a maxi-
`mum parsimony method that, given a phylogeny,
`detects positive selection at single amino acid sites.
`Here we propose a new sliding window procedure,
`based on the same principles as those proposed by
`Suzuki and Gojobori (1999) but using the most ap-
`propriate window size to highlight regions within a
`protein-coding gene subjected to di€erent selective
`pressures. The central focus of this work is that dif-
`ferent protein regions are subjected to di€erent se-
`lective constraints that can be underpinned by an
`appropriate sliding window procedure to detect re-
`gions with higher nucleotide substitutions than ex-
`pected.
`
`Materials and Methods
`
`Our method identi®es regions from a protein-coding gene that have
`accumulated a di€erent number of nucleotide substitutions than
`expected by chance, hence rejecting neutrality as the best expla-
`nation for the evolution of these speci®c regions or domains. For
`this purpose, our method implements di€erent steps (Fig. 1) that
`can be summarized as follows.
`
`Estimation of the Probability of Synonymous and
`Nonsynonymous Nucleotide Substitutions in Speci®c
`Window Regions Along the Sequence Alignment
`
`The ®rst step of the method is the estimation of the expected
`probability of nonsynonymous, P(dN), and synonymous, P(dS),
`nucleotide substitutions per codon site in each sliding window. To
`estimate these probabilities, let us ®rst suppose that the Xth non-
`synonymous nucleotide substitution and the Yth synonymous
`nucleotide substitution are variables with probability 1/L of oc-
`curring in the sequence under study, where L is the total number of
`codon sites in the sequence. If we assume that nonsynonymous and
`synonymous substitutions are discrete and take the values xi and
`yi, respectively, then the expectations of Xr- and Yr- are the rth
`moment about zero of the nonsynonymous substitution variable
`
`

`

`511
`
`3
`
`Fig. 1. Diagram of the sliding window procedure to detect selective constraints in speci®c regions of protein-coding genes. Z is the window
`size in codons and
`n is the total length of the sequence alignment after adding the random sequence pieces to the beginning and end of the
`sequence alignment.
`
`and evolves the sequence along the branches of the given phy-
`logeny using speci®ed branch lengths and substitution parameters
`(Yang et al. 2000b). Simulations were performed using as pa-
`rameters the branch lengths estimated by the modi®ed Nei and
`Gojobori method (Zhang et al. 1998) as well as codon frequencies
`estimated from the real data set. All simulations were carried out
`assuming neutral evolution (x = 1), which is the null hypothesis
`against which the x value estimated in each window region of
`the real sequence alignment is compared. Once sequences are
`simulated, nonsynonymous substitutions per nonsynonymous site
`(dNij
`),
`) and synonymous substitutions per synonymous site (dSij
`between sequence i and its inferred ancestral sequence j, are
`estimated by the unbiased method of Li (1993). Given the large
`distances among the sequences under study, ancestral sequences
`were inferred by maximum likelihood (Yang et al. 1995; Koshi
`and Goldstein 1996; Schultz et al. 1996; Zhang and Nei 1997).
`However, whenever the distances between sequences are short
`
`XN and synonymous change variable YS. These moments are de-
`®ned as
`
`y r
`i
`
`…1†
`
`X i
`
`X i
`
`N ˆ E…XN† ˆ 1
`hr
`
`n
`
`xr
`i ;
`
`S ˆ E…YS† ˆ 1
`hr
`
`s
`
`where n and s are the total number of nonsynonymous and syn-
`onymous nucleotide sites, respectively.
`The ®rst moments, r = 1, are the expectations of XN and YS
`and are the mean of the variables that describes nonsynonymous
`and synonymous nucleotide substitutions on an alignment of se-
`quences, respectively.
`
`
`h1S and h1N are estimated using K random sequence alignments
`simulated by maximum likelihood according to a known phylo-
`genetic tree using the EVOLVER program from the PAML
`package, v3.0 (Yang 2000). This program generates a codon
`sequence for the root of the tree (inferred from the real data set)
`
`

`

`512
`
`enough, both maximum likelihood and maximum parsimony
`methods give similarly reliable sequence inferences (Yang et al.
`1995; Zhang and Nei 1997). The simulations allow us to avoid
`the e€ect of the nucleotide compositional bias in third codon
`positions on the codon usage, the estimation of dN and dS under
`neutrality, and the bu€ering of the regional codon-composition
`e€ect on the estimates of dN and dS.
`The second step consists in the estimation of the probabilities of
`nucleotide substitutions as
`
`P…hN† ˆ hN
`hN ‡ hS
`
`; P…hS† ˆ hS
`hS ‡ hN
`
`…2†
`
`where hN and hS are the mean number of synonymous and non-
`synonymous substitutions estimated from the K random sequence
`alignments using the expressions
`
`XK
`
`hN ˆ 1
`NK
`
`XN
`…dNij
`
`lˆ1
`
`fˆ1
`
`†fl;
`
`hS ˆ 1
`NK
`
`XK
`
`XN
`…dSij
`
`lˆ1
`
`fˆ1
`
`†fl
`
`…3†
`
`conserved sequences require smaller window sizes to detect selec-
`tive constraints than more variable sequences do. The estimation of
`the window size in our method is a trade-o€ between avoiding false
`signi®cant results and still getting as much biological information
`as possible. To achieve this goal, we generate random window sizes
`(di) in each alignment of the K randomly simulated data sets and
`slide each ith window of di size along the random sequence align-
`ment. For every window (l) the average numbers of nonsynony-
`mous ( dNl
`) and synonymous ( dSl
`) nucleotide substitutions for all
`pairwise sequence comparisons, and given the phylogenetic tree,
`for each lth
`are estimated. Thereafter, the probability of having dNl
`window is calculated using Eq. (5). By repeating this operation, a
`probability distribution along the sequence alignment, with mean
`dN, is obtained. Before starting the window-sliding procedure, two
`sequence alignment pieces of size (d ) 1) are randomized and joined
`to the beginning and end of the sequence alignment to avoid
`undercounting the ®rst and last d ) 1 codons in the ®rst and last
`windows, respectively. Therefore, every codon site is counted d
`times in all the sliding steps. The generation of random pieces of the
`sequence alignment is repeated during the (d ) 2) ®rst and last
`sliding steps, avoiding nucleotide composition e€ects of the ran-
`dom pieces on the calculations performed for each window.
`Thereafter, we obtain the distribution of probabilities of dNl
`in the
`K random alignments for the di€erent window sizes randomly
`chosen. Finally, we plot the mean P(dN) values (95% con®dence
`interval) for each window size against the window size and choose,
`as the appropriate window size, the largest window having a 5%
`lower probability higher than 0.05 (Fig. 3).
`Once the appropriate window size is determined, we slide
`windows of this size along the real data set, in the same way as done
`for the random data sets, and calculate the probabilities of the
`estimated dNijZ
`in each Zth window to test against chance
`and dSijZ
`using Eq. (5), as described in the previous section.
`This method allows us, by direct comparison of the expected
`and observed nucleotide substitutions, to discriminate between
`di€erent hypotheses that, in addition to neutral evolution, positive
`selection, or purifying selection, can explain di€erent mutational
`dynamics (summarized in Table 1).
`
`Nonsynonymous-to-Synonymous Rate Ratio and Its
`Variance as an Estimator of Selection Intensity
`
`The ®nal complementary step to our method, in those windows with
`values di€erent than expected by chance, consists in the
`or dSZ
`dNZ
`analysis of the type and intensity of selection by estimating the
`nonsynonymous-to-synonymous rate ratio (xZ ˆ dNZ
`=dSZ
`) for com-
`parison of sequence i and its inferred ancestral sequence j. As men-
`tioned in the Introduction, a x ¹ 1 is a good indication of the action
`of selection. However, to avoid obtaining biased values due to sat-
`uration of synonymous sites, especially in regions with multiple hits,
`those values of xZ ¹ 1 have to be tested for signi®cance. Conse-
`quently, the variance of x[V(x)] needs to be estimated from the data
`and used to test the signi®cance of x against neutrality. An estimator
`of V(x) was obtained by means of Fisher's d method (Weir 1996):
`^V…B2† ‡ L2
`^V…B0†
`#
`…L0 ‡ L2†2
`^V…A2†
`^V…A4† ‡ L2
`…L2 ‡ L4†2
`‡ L0
`Cov…A0; B0†
`L0 ‡ L2
`xL2
`…L0 ‡ L2†…L2 ‡ L4† Cov…A2; B2†
`
`2
`‡ x2L4
`Cov…A4; B4†
`L2 ‡ L4
`
`ij
`
`ij
`
`0
`
`2
`
`(
`^V…A0† ‡ L2
`^Vx ˆ 1
`"
`d2
`S
`‡ x2 ^V…B4† ‡ L2
`
`2
`
`4
`

`
`…7†
`
`Here N stands for the total number of pairwise comparisons be-
`tween the random simulated sequence i and its inferred ancestral
`sequence j.
`If the sequence alignment is large enough (>300 codons) we
`can assume that both nonsynonymous (dN) and synonymous (dS)
`nucleotide substitutions follow a Poisson distribution with pa-
`rameters
`
`kN ˆ nP…hN†;
`
`kS ˆ sP…hS†
`
`…4†
`
`Therefore, the probabilities of observing dN = a and dS = b nu-
`cleotide changes in a speci®c window Z of the real sequence
`alignment are, respectively,
`
`PZ…XNZ
`
`ij
`
`jkN† ˆ e
`
`k
`NZ
`ij
`
`X
`NZ
`k
`ij
`NZ
`ij
`XNZ
`
`ij
`
`!
`
`and
`
`PZ…YSZ
`
`ij
`
`jkS† ˆ e
`
`k
`SZ
`ij
`
`Y
`SZ
`ij
`
`k
`
`SZ
`ij
`YSZ
`
`ij
`
`!
`
`…5†
`
`ij
`
`are the observed number of nonsynonymous
`and YSZ
`where XNZ
`and synonymous nucleotide substitutions, respectively, between
`sequence i and its inferred ancestral sequence j in window Z and are
`calculated as
`
`ij
`
`ˆ na; YSZ
`XNZ
`This procedure of estimating PZ…XNZ
`
`ij
`
`ij
`
`ij
`
`jkN† and PZ…YSZ
`
`ij
`
`jkS† is re-
`peated in every window and the possible action of selection in each
`region is also tested. Once those window regions with a number of
`substitutions signi®cantly di€erent than expected are detected, an
`analogy of
`selection intensity is performed estimating the nonsyn-
`onymous-to-synonymous rate ratio and its variance.
`
`1
`
`Finding an Appropriate Window Size to Detect
`Selective Constraints
`
`The sliding window-based method requires a previous estimation
`of the most appropriate window size to analyze the di€erent re-
`gions of the alignment. We have to stress caution when the window
`size is chosen randomly because there is a strong e€ect of the codon
`number and composition of the window size used on the results
`obtained (see Results and Discussion). Furthermore, the use of a
`speci®cally optimized window size strongly depends on the data
`under analysis. Thus, assuming absence of saturation of synony-
`mous sites, there is a threshold of variability above which highly
`
`ˆ sb
`
`…6†
`
`ij
`
`ij
`
`

`

`List of alternative hypotheses that explain values of
`Table 1.
`nonsynonymous-to-synonymous rate ratios (x), dN, and dS, sig-
`ni®cantly di€erent than expected under the neutral modela
`
`x > 1
`
`x = 1
`
`x < 1
`
`dN
`
`>
`>
`>
`=
`=
`=
`<
`>
`>
`=
`=
`<
`<
`<
`>
`>
`>
`=
`=
`=
`<
`<
`<
`
`dS
`
`>
`=
`<
`>
`=
`<
`<
`>
`=
`=
`<
`>
`=
`<
`>
`=
`<
`>
`=
`<
`>
`=
`<
`
`Hypothesis accepted
`
`1
`1
`1, 3, 5
`1, 3
`1
`3, 5
`3, 4
`6
`6
`0
`0, 3
`4, 6
`0, 4
`3, 4
`6
`2, 6
`3, 6
`2, 6
`2
`2, 4
`4, 6
`2, 4
`3, 4
`
`a A value of dN or dS signi®cantly higher or lower than the mean
`values for the sequence alignments is indicated by > or <, re-
`spectively. Hypotheses: 0, neutrality; 1, positive selection; 2, pur-
`ifying selection; 3, saturation of synonymous sites; 4, saturation of
`nonsynonymous sites; 5, translational selection; and 6, high mu-
`tation rates.
`
`where Ai and Bi are the transition and transversion rates in the ith
`degenerated site and their variances ‰ ^V…Ai†; ^V…Bi†Š are given by Eqs.
`(3) and (4) in Li (1993), Li is the number of the ith degenerated sites
`in the region analyzed, and Cov(Ai, Bi) is the covariance of the
`transition and transversion rates in the ith degenerated site, given
`by Eq. (A8) of Ina (1998). It should be noted that this variance is
`calculated for x values along the sequence alignment comparing
`sequence i and its immediate simulated ancestral sequence j, thus
`there is no need to include covariances of pairwise comparisons in
`the estimation of ^V…x†.
`To test the reliability of the estimator of ^V…x†, we simulated 200
`alignment data sets with the EVOLVER program, calculated their
`empirical V(x), and compared them with the estimated variance
`obtained with Eq. (7). Empirical variance of x was estimated from
`the simulated data as
`
`…8†
`
`XN
`…xi x†2
`
`iˆ1
`
`V…x† ˆ 1
`N
`
`Here xi is the estimation of x in the ith window and x is the mean
`of x along the sequence alignment.
`
`Data Sets Analyzed by the Sliding
`Window-Based Method
`
`To examine the usefulness of this method, we used sequences of the
`env and gag genes from HIV-1 (Seibert et al. 1995) and the VP1 to
`VP4 capsid genes from FMDV (Fares et al. 2001), where positive
`selection has previously been demonstrated.
`
`513
`
`HIV-1 sequence data from Seibert et al. (1995) were analyzed to
`test the selective pressure acting on di€erent genomic regions of this
`retrovirus. These authors examined the action of selection on the
`gag (structural proteins), env (envelope region, including both re-
`gion gp120 and region gp41), and pol (polymerase protein) genes
`and they observed that the pattern of nucleotide substitution in the
`case of the env gene di€ered signi®cantly from that of either pol or
`gag. Furthermore, they showed that, when the env gene was used,
`signi®cantly higher nonsynonymous than synonymous substitu-
`tions were observed for sequence families B, C, and E in the V2
`region, for sequence families B and C in the V3 region, and for
`family C in the V4 region. [Each family included those sequences
`identi®ed by Seibert et al. (1995) as being phylogenetically closely
`related.] When overall means were computed for the ®ve families,
`the average dN value was signi®cantly higher than the dS value
`estimated in V2 and V3 from the gp120 belonging to the env gene.
`In the case of gp41, this region showed signi®cantly higher dS than
`dN values for families A±D.
`In our study, a subset of the data analyzed by Seibert et al.
`(1995) of the env gene (23 sequences) and gag gene (17 sequences)
`from the ®ve families were used to test the validity of the method.
`The sequences are identi®ed by their accession numbers in the
`phylogenetic trees shown in Fig. 2.
`Sequences of FMDV from the capsid region (available upon
`request) with a known experimental history were also used to test
`the validity of the method. Fares et al. (2001) studied 31 sequences
`belonging to the capsid region of FMDV. These sequences were
`obtained from virus isolated under di€erent experimental proce-
`dures that emulated several environmental and epidemiological
`conditions such as genetic drift, massive infections, immune pres-
`sure, and persistent infections. This study showed that several
`amino acid changes were ®xed by positive selection in the main
`antigenic site (A) and in the 50 end of the variable capsid protein
`VP3. Here we use the same 31 sequences to con®rm the previously
`obtained results and to test the possible presence of additional
`regions under selective pressures.
`
`Phylogenetic Reconstruction
`
`sequences were obtained with CLUSTAL-X
`Alignments of
`(Thompson et al. 1994) and adjusted manually when necessary.
`Phylogenetic analysis of the aligned sequences was done using
`di€erent methods: neighbor-joining (NJ) (Saitou and Nei 1987),
`maximum likelihood (ML) (Felsenstein 1981), and maximum par-
`simony (MP) (Fitch 1971). The MEGA v2.01 program (Kumar
`et al. 2001) was used to obtain NJ trees and the bootstrap support
`values for 1000 pseudoreplicates. MP and ML trees were recon-
`structed using DNPARS and DNAML, respectively, from the
`PHYLIP v3.5 package (Felsenstein 1993).
`
`Results
`
`Positive Selection in HIV-1 Regions Coding
`for the env Protein
`
`Phylogenetic trees for the env gene showed the same
`topology despite the phylogenetic reconstruction
`method used (Fig. 2A). The P(hN) value estimated
`from random alignments of the env region was 0.497.
`Analysis of the suitable window size generated the
`distribution probabilities depicted in Fig. 3. Accord-
`ing to this ®gure, only windows including more than
`nine codons had a 5% lower P(dN) value higher than
`
`

`

`514
`
`Fig. 2. Maximum-likelihood phylogenetic trees for the env (A) and gag (B) genes of HIV-1 and for capsid (C) genes of FMDV. Numbers in
`the internal nodes correspond to 1000-pseudoreplicate bootstrap values. A±D in the env phylogenetic tree (A) identify families of closely
`related sequences.
`
`The estimation of the average x for the whole gene
`gave a value of 1.168, indicating a weak action of the
`positive selection on some regions of this gene. The
`use of a window of nine codons on the real data set
`identi®ed several codons subjected to positive selec-
`tion. When each HIV-1 env family was analyzed sep-
`arately, several regions, not detected before, appear to
`be subjected to positive selection. Thus, HIV-1 fami-
`lies A, B, C, and D show a dN value signi®cantly
`higher than expected and higher than dS in the V1
`variable region and in a region located between codon
`450 and codon 460. Furthermore, in families A and B,
`variable region 2 (V2) showed positive selection of
`amino acid changes, whereas families B, C, and D
`showed additional regions in V3 to be under positive
`selection of amino acid substitutions (see Table 2). In
`contrast to the rest of the HIV-1 sequence families,
`family C showed regions with higher dN than dS values
`in the V4 variable region (Table 2). Table 2 lists those
`regions with dN values higher than expected, P(dN), x
`values, and the signi®cance of x compared to the
`average value, under a t test. None of the sliding
`windows showed dS rates lower or higher than ex-
`pected, which indicates that, in these regions, there is
`no saturation of synonymous sites in comparison with
`the overall gene region, or codon composition bias
`
`Fig. 3. Distribution of the probability of nonsynonymous nu-
`cleotide substitutions, P(dN), assuming di€erent window sizes (d).
`P(dN) is depicted versus d in 100 random data sets of sequence
`alignments. Circles represent the mean P(dN) for each window size;
`error bars are based on the 5% higher and lower P(dN) values in
`each window size, respectively. The 5% threshold probability limit
`is indicated by the dashed line.
`
`0.05. This means that, sliding a window of nine co-
`dons in the real sequence alignment, we should not
`expect to have any signi®cant P(dN) deviation from
`that expected by chance, given that more than 95% of
`the probability values are higher than 0.05.
`
`

`

`515
`
`Table 2. Detection of positively selected regions in the HIV-1 env gene corresponding to several phylogenetic families of sequencesa
`
`gp120 variable
`region
`
`Codon window
`
`V1
`
`V2
`
`V3
`
`V4
`
`132±140
`134±142
`140±148
`142±150
`143±151
`144±152
`176±184
`179±189
`180±188
`181±189
`297±305
`310±318
`311±319
`312±320
`313±321
`314±322
`390±398
`391±399
`
`Family A
`
`Family B
`
`Family C
`
`Family D
`
`P(dN)
`
`0.003
`0.003


`0.015

`0.028

`0.014*
`0.003








`
`x
`
`4.827*
`6.906*


`5.368*

`3.035*

`3.521*
`3.886*








`
`P(dN)
`
`x
`
`P(dN)
`
`x
`
`P(dN)
`
`x
`

`0.000





`0.001
`0.019
`0.006

`0.039
`0.041
`0.041
`0.001



`

`2.720*





`3.465*
`3.451*
`4.313*

`3.079*
`1.839*
`1.839*
`5.311*



`

`0.000

`0.000
`0.000
`0.041




`0.027
`0.048


`0.008

`0.045
`0.027
`

`3.932*

`7.863*
`4.510*
`3.588*




`6.375*
`5.349*


`5.150*

`3.747*
`5.035*
`
`0.006
`0.003
`0.000
`0.023







`0.002
`0.001

`0.000
`0.003


`
`6.211*
`3.467*
`11.912**
`3.776*







`2.838*
`2.363

`37.385**
`2.457*


`
`a V1 to V4 are the four variable regions where positive selection was detected using the sliding-window procedure. P(dN) and x are the
`probability of the observed nonsynonymous substitutions and the nonsynonymous-to-synonymous rate ratio, respectively. *Probabilities
`lower than 5% and **probabilities lower than 1% (z test).
`
`Table 3. Window midpoint codons where positive selection was
`detected in the env gene from HIV-1a
`
`Region
`
`V1
`V2
`V3
`Epitope 4
`V4
`
`Midpoint
`codons
`
`130±152
`178±192
`296±323
`336±344
`387±408
`
`x
`
`t
`
`3.289
`3.150
`4.204
`4.679
`2.715z
`
`18.39
`17.19
`26.34
`30.45
`13.42
`
`One-
`tailed P
`
`0.0173
`0.0185
`0.0121
`0.0104
`0.0236
`
`a V1±V4 are the variable regions of gp120 from the env gene and
`epitope 4 represents the fourth positional epitope of T cells located
`in the gp41 region of the env gene. x is the mean estimated non-
`synonymous-to-synonymous rate ratio for the di€erent regions
`located in each variable region. t is the Student test value for x (H0:
`x = 1).
`
`due to translational selection. However, saturation of
`synonymous sites was detected for the four families in
`the V5 region of gp120 (region between codon 427 and
`codon 438), with P( dS) = 0.048, and in gp41 T cell
`epitopes 5 [codons 731±740; P( dS) = 0.037] and 6
`[codons 761±777; P( dS) = 0.035]. In addition, family
`A showed saturation of synonymous sites in epitope 1
`[codon region 517±541, with P( dS) = 0.048]. In these
`regions, we also detected fewer nonsynonymous sub-
`stitutions than expected by chance and x < 1. Taken
`together, these results indicate that these regions are
`subjected to very strong purifying selection.
`When the four families were compared, positively
`selected regions were detected in the V1±V4 variable
`regions, with x values signi®cantly higher than 1
`(Table 3). The location of the midpoints of the
`
`windows showing positive selection is given in Fig. 4.
`On average, we detected 6.5% of codon sites under
`positive selection, 9.3% neutral sites, and 84.2% of
`codon sites under strong purifying selection.
`
`Di€erences in Selective Pressures Among Di€erent
`Regions of the gag Gene of HIV-1
`
`Once again, di€erent phylogenetic methods gave the
`same tree for the gag gene (Fig. 2B). We examined an
`alignment of 17 sequences of the HIV-1 gag gene
`following the new method. Seibert et al.
`(1995)
`demonstrated that the average dS value along the
`sequence alignment exceeded signi®cantly that of dN
`in the gag gene of HIV-1 (i.e., x < 1). However, as
`mentioned, the estimation of the average x value over
`the region is very simplistic for concluding anything
`about the selective pressures acting on a protein. In
`our analysis, several regions were identi®ed as being
`subjected to strong purifying selection. The average
`value of x for the whole region was estimated to be
`0.8, which means that, on average, the gag region of
`the HIV-1 genome evolves under purifying selection.
`The optimal window size for analyzing the gag region
`sequences was seven codons. Sliding a window of this
`length, we detected three types of regions in the gag
`gene according to the selective constraints acting on
`them. Most regions evolve under neutrality (95.87%),
`and other regions are under strong purifying selection
`(3%), while a few codons (0.13%) showed signi®-
`cantly more nonsynonymous changes than expected
`by chance in all families examined (codons 73±79),
`
`

`

`516
`
`Positively selected regions detected by the sliding window
`Fig. 4.
`method in the env gene of HIV-1. Only those regions with proba-
`bilities lower than 5% are represented.
`
`Positively selected regions detected by the sliding window
`Fig. 5.
`method in the FMDV region coding for capsid proteins. Only re-
`gions with probabilities lower than 5% are depicted.
`
`with x = 15.034 (strong positive selection), signi®-
`cantly higher than the average (P < 0.001) and
`clearly higher than 1. Analysis of the distribution of
`synonymous nucleotide
`substitutions along the
`alignment did not give any signi®cant results,
`in-
`dicating either the absence of saturation in synon-
`ymous sites or the absence of nucleotide or codon
`composition bias.
`
`Positive Selection on the Capsid Proteins of FMDV
`
`2
`
`The genomic region of the FMDV coding for the four
`capsid proteins (VP1±VP4) was also analyzed. In this
`case, we used 31 sequences from viruses isolated
`under di€erent
`experimental conditions (Fares et al.
`2001). As in the other two cases, the phylogenetic
`reconstruction methods gave the same tree topology
`for the relationships between FMDV sequences (Fig.
`2C). In our previous study we adjusted 13 codon-
`based models (Yang et al. 2000a) to these sequences
`and used the likelihood ratio test to determine the
`best model ®tting the data. Among the discrete
`models, the positive selection model with all x cate-
`gories estimated from the data (called M3) gave sig-
`ni®cantly higher likelihood estimates than the neutral
`model (M1) or the positive selection model with only
`one x value estimated from the data (M2). Also, the
`continuous positive selection model assuming a b-
`distribution for x < 1 and estimating x for a frac-
`tion of codons with x > 1 (M8) provided a sig-
`ni®cantly higher
`log-likelihood value
`than the
`continuous model with a b-distribution (M7) or
`models with a mixture of b- and c-distributions (M9±
`M13). Each model implements an empirical Bayesian
`approach to identify those codons under positive se-
`lection. For all models, codons 310, 666, and 669
`were identi®ed to be under positive selection with
`posterior probabilities higher than 95%.
`
`Computer simulations of sequences with the pro-
`gram EVOLVER and posterior analysis by our
`method gave an estimate of P(hN) = 0.2. Taking this
`probability into account, the optimal window size
`was determined to be three codons. Estimation of the
`x parameter using the whole capsid region gave a
`value of x = 0.578,
`indicating that
`this region
`evolves mainly under purifying selection. Applying
`the sliding-window procedure to the capsid region,
`with a window size of three codons, revealed four
`regions where dN was signi®cantly higher than ex-
`pectation under neutrality, whereas dS did not di€er
`from the mean e

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket