`
`Jounal of
`Molecular Evolution
`© Springer -Verlag 1983
`
`Whatis a Conservative Substitution?
`
`5;‘mon French! and Barry Robson”
`
`1
`D
`2 “Pattment of Decision Theory, University of Manchester, Manchester, M13 9PL, Great Britain
`Partment of Biochemistry, University of Manchester, Manchester, M13 9PL, Great Britain
`
`emmary, It is commonly recognised that many evolu-
`con,ed changes of amino acid sequence in proteins are
`or €rvative:
`a substitution of one amino acid residue
`another has a far greater chanceof being acceptedif
`inve two residues are similar
`in properties. Here we
`eteete what properties
`are most
`important
`in
`Mining the similarity of two amino acids, from
`cattiglutionary point of view. Our
`results confirm
`mole, Observations that
`the hydrophobicity and the
`oe bulk of the side chain tend to be conserved.
`Dressy importantly they also show that evolutionary
`tes favour the conservation of secondary structure,
`-©
`that all these properties can be arranged in a two
`necasional diagram in which distances well preserve
`i Observed substitution frequencies between amino
`sional These results were obtained by a multi-dimen-
`tio, Scaling technique; and are independent of any
`emo Opinions about conserved properties. Thus,it is
`-Snstrated that all relations of importance to single
`"NO acid substitutions can be represented by a single
`han which is much more comprehensible and useful
`requ the usual
`tabular representation of substitution
`« tern Such a figure conveniently portrays the
`Ochemical code” for conservative substitution.
`
`Ktien Words: Amino acid substitution — Protein evolu-
`Dhobi Conservation of secondary structure — Hydro-
`Sdicity — Bulk — Multidimensional scaling
`
`nttoduction
`Daingnt et al. (1972) have collated much data concern-
`Work © amino acid sequences of proteins. From their
`and that of others it
`is apparent that natural
`
`Off:
`Thprine requests to: B. Robson
`
`selection has favoured changes in protein sequence in
`which certain physical and chemical properties of resi-
`dues are conserved (‘conservative substitution’). The
`question that concerned us was whether the relevant
`properties of have been properly and completely identi-
`fied. We have been brought to the conclusion that inter-
`esting details have not been appreciated, except per-
`haps by inspection of similar sequences which does not
`allow all
`the significant properties to be considered
`together in quantitative, objective, and useful manner.
`We have sought a more objective, quantitative approach,
`finally using a data-analytic method not widely exploit-
`ed by evolutionary molecular biologists; this study there-
`fore serves also to bring this technique to their attention.
`The topic of conservative substitutions is of interest
`for three reasons. First, there is the obvious need for
`such information in order to consider the similarity and
`relatedness of sequences. Second, it has often been hy-
`pothesised that natural selection pressures mainly favour
`the conservation of 3-dimensional structure while allow-
`ing for extensive substitution (cf. ‘neutralist theories’).
`If this is so, the few chemical and physical properties
`conserved would presumably be those that most gener-
`ally determine the 3-dimensional structure of a protein.
`Third, such knowledge is a pointer to which properties
`of residues must be modelled in computer simulation of
`protein folding. In speaking of 3-dimensional structure,
`we include the secondary structure on which the gross
`3-dimensional structure depends.
`Our analysis starts from that of Dayhoffet al. (1972)
`(their table 9.10) which constitutes a ‘relatedness odds
`matrix’. The elements of this matrix give the ratio of
`two probabilities:
`the probability that two residues at
`the same locus in two proteins are the consequence of
`common ancestry, and the probability that the relation
`occurred only by chance. The data were derived from
`comparing sequences within the cytochrome c, haemo-
`MSNExhibit 1026 - Page 1 of 5
`MSNv. Bausch - IPR2023-00016
`
`
`
`172
`
`globin, myoglobin, virus coat, chymotrypsinogen, glycer-
`aldehyde 3-phosphate dehydrogenase, clupeine insulin,
`and ferrodixin families of proteins. By combining sever-
`al quite different families,
`they have obtained an ac-
`count of the selective pressures on proteins in general
`rather than in specific instances. The measure of Day-
`hoff et al. thus provides a matrix (o.), where the ele-
`ments have the properly that ¢, > o,, if amino acids i
`and j appear more similar to each other from the view-
`point of evolutionary pressure than amino acids k and 1.
`Dayhoff et al. have noted that this similarity data
`points naturally to a classification of amino acids into
`5 groups:
`
`Hydrophilic: Ala, Pro, Gly, Glu, Asp, Gln, Asn, Ser,
`Thr
`Cys
`Sulphydryl:
`Aliphatic: Val, Me, Leu, Met
`Basic:
`Lys, Arg, His
`Aromatic: Phe, Tyr, Trp.
`
`Here we extend Dayhoffet al.’s analysis through the
`Statistical technique of multidimensional scaling. We re-
`fine their grouping and show that this new groupingcor-
`responds to a very high degree with one deduced by
`Robson and Suzuki (1976). This latter classification
`grouped amino acid residues according to their tenden-
`cies to be involved in different forms of secondary struc-
`ture. This correspondence between the two classifica-
`tions is the first objective evidence from substitution
`probabilities for the reasonable conjecture that natural
`selection strongly favours
`the maintenance of
`the
`intrinsic
`stability of
`secondary structure
`features.
`
`Method
`
`Dayhoff et al. (1972) gave a similarity matrix, an ordering of ele-
`ments reflecting the ordering of pairwise similarities between ob-
`jects, here amino acid residues. The occurrence of such data is
`commonplace in psychology and sociology. Within those disci-
`plines a family of statistical techniques, known collectively as
`multidimensional scaling, have been developed to explore and
`analyse similarity matrices. Surveys of these methods may be
`found in Shepard (1974), Shepard et al. (1972) and Sibson
`(1972). Briefly, the similarity matrix of Dayhoff et al.
`is ana-
`lysed as follows. Using iterative optimisation techniques de-
`scribed in Kruskal (1964) and Guttman (1968) a set of 20 points
`(one for each amino acid residue is found in m dimensions such
`that nearer two points are, the more similar are the correspond-
`ing amino acids to evolutionary pressures. Essentially,
`this is
`comparable to finding the geographical distribution of towns
`from only an ordering of the (approximately determined)
`intertown distances (Kendal! (1971)) and, furthermore, without
`knowing that the solution is two dimensional. More precisely,
`the optimisation is a best fit (in a particular least squares sense
`Kruskal (1964)) of the interpoint distances to the negatives of
`the measures of Dayhoffet al. asking only that
`
`d;; > diy <=> %%5 <O1,
`
`MSNExhibit 1026 - Page 2 of 5
`MSNv. Bausch - IPR2023-00016
`
`Where the d are the interpoint distances corresponding to the
`o. Because this demands only that the o are reasonably ordered
`and does not assume any functional relationship between the
`dij and oy, this method is known to be very robust.
`
`Results
`
`A representation was readily obtained in two dimensions
`without any evidence that the use of a higher dimensio"
`would display any further information (Fig. 1) (the ob-
`tained stress (Kruksal 1964) was 9% and the Monté
`Carlo test procedure of Spence and Graef (1974) sus
`gested clearly that a 2-dimensional representation w%
`adequate).
`Since the optimisation technique underlying multi-
`dimensional scaling is iterative, it requires an initial co™
`figuration. To avoid any possible bias we started with
`ten district random configurations. In each case the 1&
`sult converged to one with no significant differenc®
`from the one shown in Figure 1.
`As expected (Dayhoff et al. (1972), Dickerson and
`Geiss (1969)) conservation of the hydrophobic nature 9
`the residue is the most visually apparent feature. Al
`points lie fairly close to a curve,
`the distance along
`which (from charged sidechain such as lysine, arginin®
`histidine to nonpolar aromatic residues,
`tyrosine a?
`tryptophan) correlated well visually with increasiné
`hydrophobicity. However the “horse shoe” shape of the
`curve also suggests a property of secondary importanc®:
`namely bulk which increases towards the right of thé
`diagram. These two properties, hydrophobicity and
`bulk, are the only two amino acid properties that ca"
`be clearly seen to vary systematically along a trajectory»
`linear or otherwise, on the diagram. An automatic search
`for other properties of importance was also undertake
`by analysis of the variation of many properties (Jungek
`(1978)) using a method developed by Carroll (1972)
`using a program in the MDS(X) suite for multidime™
`sional
`scaling. However
`this failed to discover any
`further systematic variation. Thus it appears that the
`representation can answer the very general question:
`what amino acid properties tend to be conserved i?
`evolution. Hydrophobicity and molecular bulk are thé
`ones that we observed.
`However the innovation in the diagram is that it c4*
`answer more specific questions than that general on
`Namely, what amino acid properties are conserved 1"
`evolutionary change starting from a specific amino acid?
`Closer inspection of Figure 1
`reveals features that at
`first glance seem curious. For example,the proximity ©
`glycine and proline and of alanine and glutamic acid 4
`the left side of the diagram is quite inconsistent wit
`their bulk or degree of hydrophobicity. However, thes¢
`apparent anomalies are in the nature of groupings whict
`are strikingly similar to those obtained by Robson 2!
`Suzuki (1976) who undertook a clustering analysis ©
`
`
`
`173
`
`
`
`
`
` DIRECTION OF IWGREASING
`MOLECULAR
`VOLUME
`
`potter
`
`~ ~
`
`se
`
`Fithosepyodimensional scaling plot (see text) of the odds-relatedness matrix of Dayhoff et al. (1972). The symbols correspond to
`esidts obson and Suzuki (1976). ¢ Hydrophobic residues; ° Hydrophobic residues which have ability to form hydrogen bonds;
`Fo Gh which may receive or donate hydrogen bonds; © Residues which may receive and donate hydrogen bonds; = Gly; # Pro;
`ni uso His (see text), The numbers associated with each amino acid are their hydrophobicities as given by Levitt (1976). The
`Wo j
`lcated directions give general trends in the diagram of increasing molecular weight and volume. Note that the axes have no
`MOP signif:
`aa
`:
`a
`aes
`' Significance in this technique, much as would also arise if a map of Britain were constructed from tables of distances between
`to
`.
`es and villages. That is to say, such a distance table contains no information about North-South, East-West axes (though the
`direr,
`tie of South might be deduced ¢ posterior in the grounds that a warmer climate encouraged habitation; in a similar way the
`Proper
`€s of importance are deduced above leaving in mind that important trends need not belinear or even lie on a curve)
`
`UCe amino acid residues in a space whose dimensions
`omningwine to the helix, extended chain, and coil
`atson Power of a residue. The work took no account
`rOteing of evolutionary
`relationships between
`Conky 8 or residues, but only between sequence and
`Son inReo Their figures are reproduced for compari-
`mo
`Ig. 2. The similar groupings obtained in Fig, 1
`Nstrate for
`the first
`time that
`the preferences
`a ferent
`types of backbone conformational (sec-
`i Dow Structure are also a property of considerable
`Mutation for evolutionary pressures on amino acid
`effect a As discussed by Robson and Suzuki, this
`tions ins as a result of sidechain-backbone interac-
`Senge of. way largely determined by the nature or ab-
`G
`hydrogen bonding groups in the sidechain.
`lyojne, Proline, alanine and glutamic acid were treated
`
`as special cases on the groundsof special stereochemical
`effects and this is strongly supported by the present
`study.
`interesting differences. These
`There are, however,
`authors classified residues according to whether side-
`chains were non-hydrogen bonding(filled circles in Fig.
`1), could receive and donate a hydrogen bond (crossed
`circles), or could receive or donate a hydrogen bond
`(open circles). Histidine is 10% protonated at neutral
`pH and with reservations was assigned to the group of
`residues whose sidechains can both receive and donate
`hydrogen bond. From this point of view of evolutionary
`pressures, Fig. 1 placesit firmly alongside lysine andar-
`ginine, which are close to fully charged at neutral pH.
`Indeed, it reveals that evolutionary pressure places much
`greater emphasis on whether a sidechain is negatively
`MSN Exhibit 1026 - Page 3 of 5
`MSNv. Bausch - IPR2023-00016
`
`
`
`174
`
`
`
`Pleatedsheet
`
`Tur
`
`information Helix information
`imormatian -8
`
`-6
`
`-2
`-4
`Pleated sheet
`
`2
`0
`information
`
`4
`
`6
`
`B
`
`«10
`
`MSNExhibit 1026 - Page 4 of 5
`MSNv. Bausch - IPR2023-00016
`
`
`
`Turnintormation
`
`:, Helix information
`(decinats) -6
`Reverseturninformation
`
`
`
`4
`
`4
`2
`0
`-2
`Coil information (decinats)
`
`6
`
`‘
`:
`.
`Fig. 2. The groupings of Robson and Suzuki based on conformational tendencies and physicochemical properties alone, i.e. withou
`reference to comparison of homologous sequences. Symbols as Fig. 1
`
`t
`
`charged (glutamate and aspartate) or positively charged
`(lysine, arginine, histidine) than did the tentative assign-
`ments of Robson and Suzuki based on clustering analysis
`and the sidechain-backbone hydrogen bonding inter-
`actions. Cysteine also deviates from the largely non-
`polar but weakly hydrogen bonding group to whichit
`was assigned by the cluster analysis of Robson and
`Suzuki, but this may be expected from the point of view
`of evolutionary pressure because of its special role in
`forming covalent disulphide bridges
`in some cases.
`Use of tables of substitution distances obtained inde-
`pendently for intracellular and extracellular proteins
`might well clarify this point, though this would depart
`from the idea of seeking ‘‘gross” global determinants of
`
`‘substitution frequencies independent of any kind of
`family grouping, and independent of any specific inter
`actions peculiar to a protein class. On the whole, ho¥”
`ever,
`the agreement
`is remarkable and this illustrates
`the value of multidimensional scaling in revealing pa
`terns which may be meaningful to the observer.
`The similarity between alanine and glutamate (a8)
`and proline and glycine (PG) in terms of substitut!?
`distances may: seem surprising in view of the fact th?
`the former are strong helix formers, the latter stron?
`helix breakers. A preiliminary view might be that mole?
`ular bulk dominates here, perhaps along with a
`physical properties of less general importance. Howeve®
`other types of secondary structure tendency must
`
`
`
`extents and it may be that the ability to disrupt
`duce1 ed (primarily plented sheet) structure, to intro-
`is of Ocal bends in it, and to demarcate its boundaries,
`are no Prime evolutionary importance. These aspects
`W underinvestigation.
`
`Conclusions
`(ypution of proteins in general has tended to conserve
`confor degree of hydrophobicity of a residue, (2) the
`Tmational preferences of its backbone and (3) its
`to xAl these are continuous properties and the extent
`a
`ich a substitution is conservative is correspondingly
`ioe_Of degree. Since the maximum distance in
`m 1 is between glycine and tryptophan, changesbe-
`. on Tesidues at less than one third this distance might
`tioeehiently classified as ‘““good”’ conservative substi-
`Serve Because most substitutions which would con-
`constine involve greater distances in Fig. i and indeed
`nates ute bad” conservative substitution, the domi-
`}
`of importance seems to be in the order given
`ve, with bulk playing a subserviantif significantrole.
`work emphasizes the value of multidimensional
`Scaling
`in reaching conclusions without any initial
`@ prion;
`assumptions. Jorre and Currow (1975) have ap-
`Dlieg
`anal
`the technique to a similar problem but
`their
`the4 a strong theoretical input which modelled
`onf Prior beliefs about relationships. Moreover it was
`Vtech.
`to a single well-defined protein family,
`the
`olution” © group, and therefore considered only
`1Onary pressures relating to the structure, stability
`function of cytochrome c. Hence they arrived at
`ferent different conclusions and answered a dif-
`agai
`question. The advantage of the present work is,
`n, that it applies to the conversation of substitutions
`ef eens in general, using extensive data from which
`ve S$ peculiar to conformations of specific families
`Presumably been almost entirely averaged out.
`AgnOwledgements. The programsin our analysis were from the
`th
`) Package developed by A.P.M. Coxon and funded by
`“cial Science Research Council. Computing facilities were
`Pro
`j
`entnes by the University of Manchester Regional Computer
`
`175
`
`One of us (SF) is most grateful to Dr. C.C.F. Blake for
`encouraging him to work in this area. The other (BR) is grateful
`for S.R.C. funding relevant ta the discovery of properties relat-
`ing to protein folding simulations.
`After preparation of this manuscript Dr. W. Taylor has drawn
`to our attention to his very similar results and conclusions
`independently obtained (Taylor 1982). We are grateful to him
`for useful discussions.
`
`References
`
`Carroll ID (1972) Individual differences and multidimensional
`scaling. In: Shepard RN, Romney AK, Nerlove SB, Multi-
`dimensional
`scaling: Theory
`and Applications
`in the
`Behavioural sciences, Seminar Press, London, pp 105—155
`Dayhoff MO, Eck RV, Park CM (1972) A model of evolutionary
`change in proteins. In: Dayhoff MO (ed) Atlas of protein
`sequence
`and structure, National Biomedical Research
`Foundation, Georgetown University, Washington DC, pp
`89~99
`Dickerson KE, Geis I (1969) The structure and action of pro-
`teins. Harper and Row, New York
`Guttmann L (1968) A general non-metric technique for finding
`the smallest co-ordinate space for a configuration of points.
`Psychometrika 33:469-—506
`Jorre RP, Curnow RN (1975) A model for the evolution of
`proteins. Biochimie 57:1147-1156
`Jungck JR (1978) The genetic code as a periodic table. J Mol
`Evol 11:211-224
`Kendall DG (1971) Construction of maps from oddbits of infor-
`mation. Nature 231:158—-159
`Kruskal JB (1964) Non-metric multidimensional scaling. Psycho-
`metrika 29:1—-27
`Levitt M (1976) A simplified representation of protein confor-
`mations for rapid simulation of protein folding. J Mol Biol
`104:59~107
`Robson B, Suzuki E (1976) Conformational properties of amino
`acid residues in globular proteins. J Mol Biol 107:327-—356
`Shepard RN (1974) Representation of structure in similarity
`data: problems and prospects. Psychometrika 39:373—421
`Shepard RN, Romney AK, Nerlove SB (1972) Multidimensional
`scaling: Theory and applications in the behavioural sciences.
`Vols I and IL. Seminar Press, London
`Sibson R (1972) Order in variant methods for data analysis (with
`discussion) J Roy Statist Soc B34:31 1-349
`Spence I, Graef J (1974) The determination of the underlying di-
`mensionality of an empirically obtained matrix of proxi-
`mities. Multivariate Behavioural Research 9:331—342
`Taylor, W (1982) Private Communication
`
`Received July 20/Accepted November1, 1982
`
`MSNExhibit 1026 - Page 5 of 5
`MSNv. Bausch - IPR2023-00016
`
`