Elastic network models
Let the conformation of an N-sites protein be represented by the column vector of the 3 N Cartesian coordinates of its N C
α
atoms: r = (x1 y1 z1 x2 y2 z2 … x
N
y
N
z
N
)T. r
i
= (x
i
y
i
z
i
)T is the position vector of the ith C
α
. The vector joining sites i and j is d
ij
= r
j
- r
i
with length d
ij
= d
ij
. We use r0 for the protein’s equilibrium conformation in which the ith site is at .
An Elastic Network Model (ENM) represents the folded protein as a network of sites connected by springs. They have proved accurate and useful in a variety of applications [17, 19]. The potential energy landscape is given by:
(1)
where and k
ij
are, respectively, the equilibrium length and force constant of spring ij. As far as we know, all models proposed so far assume that , i.e. that at the equilibrium conformation r0, all springs are relaxed.
Fluctuations and flexibility
No protein is frozen at its equilibrium conformation. At non-zero absolute temperature, the folded protein fluctuates around r0 sampling conformational space with equilibrium Boltzmann’s probability density function:
where , with T the absolute temperature and k
B
Boltzmann’s constant. The denominator of Eq. (2) is the partition function of the folded protein:
where ∫ … dτ stands for integration over the whole of conformational space.
The dynamical flexibility (mobility) of a site is ordinarily quantified using its Mean Square Fluctuation:
(4)
To calculate MSF
i
using Eq. (4), the potential energy function Eq. (1) is approximated using a second-order Taylor expansion around its equilibrium conformation. First, the Hessian matrix H of second derivatives of the potential Eq. (1) with respect to the atoms’ Cartesian coordinates is calculated. Then, H is inverted to obtain the 3N × 3N variance-covariance matrix C, which is composed of a 3 × 3 C
ij
block for each pair of sites. Finally, a site’s MSF is given by [20]:
An empirical flexibility model
Several studies have investigated the correlation between site-specific rates of evolution or other sequence-variability measures and the corresponding flexibility. Since such studies use Pearson’s correlation coefficients as measure of association, the underlying assumption is that there is a linear relationship between rate of evolution and flexibility. To make such assumption explicit, here we postulate the following Flexibility Model:
(6)
where is the relative rate of substitution of the ith site. In general, for site-specific scalar properties we will use relative values obtained by z-score normalization. For any given site-specific property x
i
, we the z-score normalized values are , where the averages are calculated over all sites of the same protein. The subscript P is used to note that a priori the coefficients may depend on the protein considered. We emphasize that the Flexibility Model is empirical: rather than derived from first principles, it is postulated, based on the intuitive notion that flexible sites should accommodate mutations more easily.
A mechanistic stress model
We introduce here a mechanistic model that includes explicitly the effects of mutations and natural selection. We consider mutations as random perturbations of the wild-type ENM potential [21–23]. A random mutation at site i results in a mutant whose potential V
mut
is obtained from Eq. (1) by adding perturbations to the equilibrium length of each of its springs: We further assume that the springs are independently perturbed and that perturbations are spring-independent, randomly drawn from a distribution with zero mean and constant variance α2:
(7)
As we mentioned above, when the wild type is at its equilibrium conformation , all springs are relaxed by construction. In contrast, when the mutant is at , the mutated site’s springs will be stressed (stretched or compressed). For further reference, we define the Mean Local mutational Stress (MLmS) as follows:
(8)
where 〈 … 〉mut @ i stands for averaging over random mutations at the ith site.
To complete the model, we derive a simple selection function. First, we assume that there is a single specific active conformation r
active
. Next, we acknowledge fluctuations and assume that the protein’s activity (either the wild-type’s or a mutant’s) is proportional to the concentration of the active conformation r
active
. Finally, we assume that and, accordingly, we model the acceptance probability of a mutant as:
(9)
Where and are the concentrations of folded protein for the mutant and wild type, respectively. From statistical mechanics, the Folded-Unfolded equilibrium constants for the wild-type and mutant proteins are, respectively, and . We further assume that the partition function and concentration of unfolded protein is the same for the mutant and wild type. Therefore . Replacing this relationship and Eq. (2) into Eq. (9) we find:
(10)
Finally, averaging over random mutations at site i and using Eq. (8) we obtain the acceptance rate:
(11)
Where β may be thought of as representing not just temperature but also selection pressure, and we have assumed that βΔV << 1 (mild selection) to approximate the exponential to first order. To finish, we z-normalize the variables of Eq. (11) to get the relative rates of evolution:
(12)
This equation specifies the stress model.
Relationship of flexibility and stress with packing density
The purpose of this work is to investigate why LPD correlates with rate of evolution at site level. The previous models relate rates of evolution with MSF (Eq. 6) and MLmS (Eq. 12). Here we derive the relationship between these properties and LPD measures.
First, we relate flexibility and stress with the potential energy parameters of Eq. (1). Let us define:
Regarding flexibility, replacing Eqs. (1), (2), and (3) into Eq. (4), following [12], and using Eq. (13), it can be found that:
Regarding stress, from Eqs. (1), (7), and (8), after some algebra, we get:
(15)
Note that Eq. (14) is an approximation while Eq. (15) is an identity.
Second, to relate the previous models to LPD we need to specify the ENMs spring constants. A variety of ENMs have been developed (see [24] for a recent comparison). Here, we consider two models. First, the “parameter-free Anisotropic Network Model” (pfANM) [25], which uses:
Second, the “Anisotropic Network Model” (ANM) [20], for which:
(17)
where R
cut
is typically between 10 Å and 18 Å.
From Eqs. (13), (16), and (17) and z-normalizing we find:
where for the pfANM, LPD is the Weighted Contact Number (WCN) of [26], and for the ANM, it is the Contact Number (CN): the number of sites closer than R
cut
. Finally, from Eqs. (14) and (18) it follows:
(19)
Similarly, from Eqs. (15) and (18) we get:
(20)
Note that while MSF is approximately equal to 1/LPD, MLmS is exactly equal to LPD (for relative z-normalized values).
Calculation details
We used the dataset of 213 monomeric enzymes of Yeh et al. [11]. The dataset includes proteins of diverse sizes, functional, and structural classes (Additional file 1: Table S1).
We used the evolutionary rates of [11]. They were inferred from multiple alignments of homologous sequences using Rate4Site, which builds the phylogenetic tree using a neighbour-joining algorithm and estimates rates with an empirical Bayesian approach and the JTT model of sequence evolution [27, 28]. To keep in mind that we are not dealing with the (unknown) “true rates”, but with Rate4Site-inferred rates, we use the notation .
From the pdb equilibrium structure of each protein we calculated the spring constants of pfANM (Eq. 16) and ANM (Eq. 17), for which we used a cut-off distance of 13 Å [11]. Given a protein and ENM model, we calculated the Hessian matrix, inverted it to obtain the variance-covariance matrix, and calculated the site-specific flexibility values using Eq. (5) and z-normalizing. Regarding stress, we obtained the relative site-specific values using Eq. (15) and z-normalizing.
Since we always use z-normalized relative values, for the sake of notational simplicity, we shall use ωR4S, MSF, and MLmS to refer to z-normalized values from now on.
We performed two analyses. In a protein-by-protein analysis, we performed linear fits of ωR4S with either MSF (Flexibility Model) or MLmS (Stress Model) using the lm() function of the base package of R for each protein. In a global analysis we pooled together all sites of all proteins and performed similar global fits.
To assess the goodness-of-fit of a model to the data, we used the Akaike Information Criterion AIC = 2k - 2 ln L, where k is the number of parameters and L is the model’s likelihood given the data. When comparing models, the AIC weight of evidence for each model is given by , where Δ(AIC) = AIC ‒ min(AIC) [29, 30].
We also calculated Pearson’s correlation coefficients between evolutionary rates and the independent variable that defines each model. When comparing two models, we calculated partial correlation coefficients of evolutionary rates with the independent variable of each model controlling that of the other.