The two-bead model
The protein structure is represented by a reduced two-bead model originally described by Levitt [53] and more recently used by Mukherjee and Bagchi [54] as well as Grahnen et al. [35]. Each residue is represented by one backbone bead (Cαi) and one side chain bead (Cβi) (Figure 1). The Cα bead is centered on the Cα atom in an all-atom representation and has a radius of 1.8 Å. For a residue i (except glycine), the Cβi bead center is placed at a distance b
i
from Cαi, which is determined as the centroid of all the side chain atoms (including the Cαi atom), and assigning a residue-dependent radius r
i
. Glycine is simply represented with a Cα bead with the appropriate properties (hydrophobicity etc.). For residue i, where 2 < = i < = N-2 in a protein with N residues, we define θ
i
as the angle between Cαi-1, Cαi and Cβi, and φ
i
as the torsion angle between Cαi-1, Cαi, Cαi+1 and Cαi+2. Residues within two bonds of the termini do not have a defined φ
i
, and the residues at the N-terminus do not have a defined θ
i
. Finally, r
ij
describes the distance between the center of any two beads i and j in the model. In the informational representation of protein structure r
ij
was measured by taking the minimum distance between the Cβ beads in residues i and j (Cα is substituted for Cβ for glycines).
Threading sequences into conformations
To thread a sequence s into a protein backbone conformation c, the side chain replacement algorithm SARA [35] was used. SARA is a very fast approach based on iterative Markov Chain Monte Carlo refinement of the replaced side chain geometry. Backbone coordinates were not adjusted during the replacement procedure, and c was assumed to be constant throughout.
Scoring protein folding with the physics-based model
To evaluate the fit of a sequence s to a backbone conformation c, where the number of residues N is the same for both s and c, s is threaded into c (see above) and then a weighted scoring function V(s, c) is computed:
(1)
where:
V
bend
is a harmonic bending potential around the equilibrium bond angles for Cα-Cβ bonds:
(2)
K
Θ
is the force constant for the bending potential (10.0 kJ mol-1 rad-2), Θ
i
and Θ
i+1
are defined as described above, and is the equilibrium bond angle for a Cβ bead of type t.
V
LJ
is the pair-wise vdW interaction potential, approximated as a sum of Lennard-Jones potentials:
(3)
ε
ij
is the interaction parameter between beads i and j (a pair of Cα beads, a pair of one Cα bead and one Cβ bead or a pair of two Cβ beads), based on their respective hydropathy indexes, σ
ij
is the collision diameter of beads i and j (i.e. the sum of their radii), and r
ij
is the defined in Figure 1. The sum ij runs over all pairs of beads noted above.
V
helix
is the helical potential, an approximation of the entropic and steric effects of the backbone forming an α-helix:
(4)
K
i
1-3 is average helical propensity of residues i, i+1 and i+2. r
i, i+2
is the distance between Cα beads i and i+2. r
i, i+3
is the distance between Cα beads i and i+3. is the average helical propensity of residues i, i+1, i+2, and i+3. Finally, r
h
is the equilibrium helix inter-bead distance (5.5 Å).
V
beta
is the beta-sheet potential, constructed analogously to V
helix
but using an equilibrium beta torsion angle instead of a bead-bead distance:
(5)
is the average beta propensity of residues i-1, i, i+1, and i+2. The beta propensity for a given residue type is constructed using the same linear scaling of helical propensities used by Mukherjee, but is based on Kim and Berg's beta propensity scale [55] rather than Pace and Scholtz's helix propensity scale [32]. C
b
is a scaling constant (0.01) selected to scale the range of torsion angles to a similar magnitude as the range of bead-bead distances in V
helix
. Torsional angle φ
i
is defined above (Figure 1), and φ
b
is the equilibrium beta sheet value (210°) for a two-bead representation as described by Bahar and coworkers [56].
V
ion
is an electrostatic potential based on Coulomb's Law:
(6)
C
c
is a scaling constant (1,000) selected to scale the term to similar magnitude as other terms, q
i
and q
j
are the charges of residues i and j (+1 or -1 depending on residue type), r
ij
is the distance between beads i and j, and ε is the dielectric constant of the protein interior (3.0)[57], with the sum running over all pairs of charged residues with a SASA (see below) of less than 0.25. This roughly approximates the strong screening of charged interactions due to the highly polarizable surrounding water, and the nearly negligible effects of the largely non-polarizable interior of the protein [58]. The whole term is then scaled by a weighting term w
ion
when used in V(s, c) (eq. 1), so C
c
has no effect on the parameterization, but is necessary computationally so as not to introduce errors due to lack of floating point precision.
V
solv
is the potential due to solvation of the protein, approximated by a novel implicit solvent model based on solvent-accessible surface area (SASA):
(7)
h
i
is the hydrophobicity-based interaction parameter of residue i (equivalent to ε
ii
above) and p
i
is the "polarity index", constructed as p
i
= (h
max
-h
min
) - h
i
. SASA(i) is the fraction of solvent-accessible surface area of residue i calculated by the NeighborVector method of Durham [59] as adapted for the two-bead representation.
V
S-S
represents the stability contribution due to formation of predicted disulfide bonds:
(8)
where
(9)
and r
ij
is the distance between two cysteine Cβ beads and r
SS
is the maximal Cβ-Cβ distance in a typical disulfide-bonded cysteine pair (4.5 Å[60]). The sum ij runs over all pairs of cysteine residues. This form of V
cys
predicts disulfide bonds with a specificity of 0.98, sensitivity of 0.91 and Matthews Correlation Coefficient of 0.79 as measured on the structural data set used to construct SCWRL 3.0 [61].
Finally, w
x
are weights for each individual term in eq. 1, determined in a procedure described below.
Scoring protein-protein interactions with the physics-based model
When using protein-protein interaction as a measure of protein function, it becomes necessary to evaluate the strength of the interaction, a proxy for the free energy change of binding or the dissociation constant. The non-covalent terms from the folding scoring function were adapted to compose an interaction score V(s
1
, c
1
, s
2
, c
2
) between a sequence s
1
in conformation c
1
and a sequence s
2
in conformation c
2
:
(10)
and are the same as above but evaluated for inter- rather than intra-molecular bead pairs.
V
Δsolv
is the change in solvation potential upon binding:
(11)
where complex is the bound state. The weights w
x
are determined as described below.
Scoring protein folding and interactions with the knowledge-based model
The approach of Bastolla and co-workers [9] to was used to score folding under a knowledge-based model, and adapted to scoring protein-protein interactions. Briefly, a sequence s in conformation c has a folding score of
(12)
where U
ab
is a matrix describing the free energy gain when amino acids of type a and b are in contact, C is the contact map of c (1 when the Cβ beads of i and j are closer than 4.5 Å, 0 otherwise) and the sum runs over all residue pairs (i, j) that are separated by more than four residues in primary sequence.
To evaluate a protein-protein interaction, a protein complex score is calculated as
(13)
where C is evaluated between c1 and c2 (inter-molecularly), and the sum runs over all residue pairs (i, j) such that i comes from c
1
and j comes from c
2
.
Calculating folding and binding specificity
Folding and binding specificity were evaluated with several considerations. For example, with folding, a given sequence should have the specified backbone rather than an alternative as its most stable state and also present a gap between this state and unfolded or misfolded states. The size of this gap can be measured as a Z-score [62–64]:
(14)
where Gis the distribution of free energies (ΔG) in the alternative conformations, G
nat
is the free energy of folding into the native conformation, and <G> and σ( G ) describe the location and dispersal of that distribution. In other words, Z
fold
can be interpreted as measuring the specificity of a protein sequence for its native structure, or equivalently a combination of unfolding and misfolding stability as well as the preference for the specified fold over alternative folds.
During simulation and sequence sampling, G
nat
is approximated by V(s, c) (eq. 1) or E(s, c) (eq. 12) for the current sequence and a native conformation, and G by calculation of the same measure for 100 random decoy conformations. The decoy conformations were generated by independently randomizing each type of geometrical measurement in the native structure, in accordance with the Random Energy Model (REM) of Bryngelson and Wolynes [6]. For the informational model this means randomizing the residue-residue contacts, whereas in the physics-based model the angles, distances, etc. involved in each term are shuffled. The fold score gap is then calculated as
(15)
where <G> is estimated by the median of G. The dispersal σ( G ) is assumed to be constant. This assumption is computationally efficient, which is a major consideration when evaluating tens of millions of sequences. Also, since the dispersal is purely dependent on the amino acid composition under the REM [65], and the gross features of the composition (the proportion of hydrophobic residues for example) should not change when simulating biologically realistic protein sequences, this enables the approximation Z
fold
≈ S
gap
.
The case of specific protein-protein interaction is simpler due to the lower number of states available. It was previously shown that specific binding can be adequately modeled as a combination of selection for the native ligand and against a single non-specific decoy ligand [14]. Here this approach is adopted during simulations, which are described in more detail below.
Parameterizing the physics-based scoring functions
It has previously been shown that no single parameterization of an energy function fits all proteins optimally [34, 33]. Therefore the individual weights in the scoring function were changed for each protein under investigation to maximize Z
fold
(maximize the specificity of sequence-structure fit in the context of alternative states). The parameter space has many dimensions, is expected to be rugged, and likely contains many local maxima. In order to sample possible parameter values efficiently, and avoid becoming trapped in the course of the search procedure, the Metropolis-Hastings Markov Chain Monte Carlo (MCMC) algorithm [66, 67] was used. Acceptance probability for a move in parameter space was based on Z
fold
:
(16)
where θ
k
is the set of weights in V(s, c) (eq. 1) for step k of the Markov chain, and T is the temperature of the chain. θ
k
is proposed by perturbing q
k-1
in each dimension with values drawn from a Gaussian distribution with mean 0.1 and variance 0.1. The search space was restricted to (0,1) for all weights, and 100,000 moves were attempted at T = 0.1. The parameter set with the largest Z
fold
found during sampling was retained as the best set of weights.
To obtain the distribution of non-native scores Gnecessary to calculate Z
fold
in this procedure, a two-pronged approach was taken. The goal was to find a set of weights that makes the native sequence-structure combination as specific as possible, both with respect to the fold recognition problem (which conformation does a fixed sequence fold into?) and the inverse folding problem (which sequence best fits a fixed conformation?). Because random sequences may fold into alternative structures, the random sequences also play a role in parameterizing the fold recognition problem, where having a hydrophobic core and preferring a given backbone to an alternative state is not enough. To address specificity of conformation, 1,000 decoy conformations were generated as described above to obtain the distribution of scores G
struct
. Specificity of sequence was addressed by sampling 1,000 random sequences of the same length as the native sequence (with equal probability of drawing any amino acid), threading each one through the native conformation, and obtaining from these the distribution of scores G
seq
. For each score G, composed of individual term scores from V(s, c) (eq. 1), the value of each term score was re-scaled such that the overall term-specific distributions of values were of equal range. The full distribution Gwas formed by the union of G
struct
and G
seq
for each ϴ
k
.
When threading random sequences, or randomly perturbing the native conformation, the resulting score distribution is noticeably skewed (data not shown). In particular, the Lennard-Jones potential produces a long tail of very large scores due to steric clashes. This is not unexpected since not all structures can accommodate all sequences [68], but it does make mean and variance poor estimators of location and dispersal. Fortunately, the probability of observing a sequence s in a particular conformation c follows a Boltzmann distribution:
(17)
where i runs over all possible other conformations, ΔG is the free energy of folding, k is Boltzmann's constant and T is the absolute temperature. It follows that only those alternative conformations with large negative ΔG contribute greatly to this probability, or equivalently that only those decoys with very low scores V(s, c) (eq. 1) are important for folding specificity. Therefore the median was used as a location estimator, and the difference between the first quartile and the median as the estimator of dispersal.
To choose the term weights for a particular protein-protein interaction, the same procedure was carried out, but only with respect to sequence specificity in one binding partner. The score distribution G
seq
was generated by threading 1,000 random sequences through the constant conformation of one binding partner (while the other partner is kept constant in both sequence and structure), and scoring the resulting complex using V(s
1
, c
1
, s
2
, c
2
) (eq. 10) described above. In other words, s
2
is randomly sampled while s
1
, c
1
and c
2
remain constant. It is then possible to calculate a gap score Z
bind
in the same manner as Z
fold
:
(18)
where G
bind
= G
seq
. Weight parameters w
x
can then be chosen using the MCMC algorithm described above (eq. 16).
Sampling sequences
Once a set of weights for V(s, c) (eq. 1) are obtained (the parameters U
ab
of E(s, c) in eq. 12 are fixed), the folding score gap landscape and its minima in sequence space were characterized for both scoring functions by varying s. Sequences were sampled based on S
gap
(eq. 15) both near to and far from the native state using a typical [69] MCMC approach, with acceptance probability:
(19)
where each step k was a single substitution in the protein sequence.
For near-native sampling two temperatures were employed: a very high temperature for nearly unbiased sampling (T = 10 for the informational function E(s, c) of eq. 12, T = 50 for the physics-based function V(s, c) of eq. 1) and a lower temperature to find local minima (T = 0.1 for E(s, c), T = 1.5 for V(s, c)). The low temperature was chosen to make the chain capable of passing the barrier in the energy landscape caused by an average deleterious substitution under the informational model (~0.1 units of E(s, c)) with a reasonably high probability (~0.1). This was also calibrated with respect to the expected fraction of sequences with better folding stability for real proteins and indirectly on dN/dS. The high temperature was chosen to make the sampling nearly independent of S
gap
(> 0.95 expected acceptance probability). Temperatures for the physics-based function were then chosen such that the actual acceptance frequencies obtained from sampling under the informational model were reproduced. Differences in the observed dN/dS ratios between the physics-based and informational models were due to differences in the ruggedness of the landscape and its sequence context dependence (local ruggedness).
To increase sample density, divergence of the chains from the starting state was restricted to 5%-30%, in increments of 5%, and two chains were run at each temperature and divergence limit. The number of attempted steps was adjusted to obtain ~8,500 unique samples for each function. Samples were projected onto a two-dimensional representation of sequence space using Sammon mapping [70] and score surfaces were calculated using Gnuplot v 4.2.
To discover regions of the landscape where most mutations are destabilizing (i.e S
gap
is near a maximum), an important feature of biological protein sequences leading to observed dN/dS ratios, ten replicate chains attempting 100,000 steps were run for each function at the lower sampling temperature without any divergence restrictions. The chains were thinned by taking every fourth unique sample and the final 100 such samples were retained, resulting in 1,000 unique samples for each scoring function. Samples were Sammon-mapped and sequence logos [71] were calculated to assess convergence. Structural properties of the sampled equilibria were further characterized by projecting the consensus sequence at each position with > 2 bits of information onto the protein conformation.
Measuring scoring performance
To test the accuracy of the scoring functions, Z
fold
was calculated for the native sequence and 1,000 random sequences. The test was performed on the same diverse set of structures used in the development of the side chain replacement method employed [35].
The interaction score was tested by a similar scoring procedure. In this case a subset of the structures from the PepX database [72], which describes protein-peptide interactions, was constructed. As for the folding score test set, 100 structures (25 from each major SCOP Class) were selected, each a centroid of a cluster of binding interfaces at the 3 Å-75% level of PepX. For each Class with more than 25 available centroids, the structures were sorted by average B-factor across the binding interface, and the 25 structures with the lowest B-factor were selected. A list of these structures can be found in Additional File 1. Z
bind
was calculated as described above (eq 18) for the native peptide sequence and a set of 1,000 random peptide sequences for each complex.
In addition, the distribution of weights for each term in V(s, c) (eq. 1) and V(s
1
, c
1
, s
2
, c
2
) (eq. 10) across the structural data sets was generated to compare the relative importance of the terms. The effect on the gap score of using each term individually was also examined. Furthermore, the numerical stability of the MCMC algorithm for choosing weights and the characteristics of the parameter-stability landscape were probed by producing eight different sets of non-native scores Gfor the SAP protein, and re-running the parameterization 100 times for each set.
Simulating sequences
As a test of the utility of the methods in an evolutionary context, simulations were performed under negative selection for protein folding and binding. In a manner similar to that of Rastogi et al. [11], a population of 1,000 virtual organisms containing a single copy of the SAP protein [25] (PDB ID: 1D4T) was evolved for 200,000 generations at a rate of 10-5 mutations per bp per generation at the DNA level, with transitions being twice as probable as tranversions. To obtain a thermodynamically stable starting point, non-native optima of S
gap
(eq. 15) in protein sequence space were sampled as described above and a starting DNA sequence was randomly selected from the reverse translation of those samples.
In each generation, the mutated DNA sequence in each organism was translated to protein, threaded through the native SAP conformation c
protein
, and the fitness calculated as
(20)
where S
gap
(s) is the fold gap score (eq. 15) of the current sequence s, S
gap
start is the fold gap score of the starting sequence, V
bind
(s) is the binding score V(s, c
protein
, s
ligand
, c
ligand
) (eq. 10), s
ligand
and c
ligand
correspond to the SLAM peptide (the native ligand of SAP), V
bind
start is the binding score of the starting protein sequence to the same ligand, and V
decoy
(s) is the binding score V(s, c
protein
, s
decoy
, c
ligand
) for a decoy ligand. Simulations under the informational model used E(s, c) (eq. 12) to compute S
gap
and E(s
1
, c
1
, s
2
, c
2
) (eq. 13) to compute binding scores. In other words, mutations that make the protein or the protein-ligand complex less stable than the starting point are infinitely deleterious, mutations that maintain those stabilities but enable binding of the decoy ligand are somewhat deleterious, and all other changes are neutral. Organisms were then propagated to the next generation by random sampling weighted by fitness w (eq. 20), with replacement, while maintaining a constant population size.
The decoy ligand were constructed by threading a sequence s
decoy
= IWMTIYMIIIT through the SLAM peptide conformation c
ligand
for the informational function, and s
decoy
= RLPTIYICITG for the physics-based function. Both sequences are variations on the experimentally determined XXXTIYXX(VI)XX SAP-binding motif [25], and do not bind the protein at the starting point of the simulation.
Finally, each simulation was replicated 10×, and the resulting sequences analyzed. Structural patterns of evolutionary rates (dN/dS) were measured by comparing sequences to the original sequence and calculating according to the PBL method [73, 16], and the distribution of position-specific residue preference was compared with that in the Pfam database [30] for the same fold by constructing sequence logos [71] from the evolved sequences. The sequences were also tested for emergence of rate heterogeneity by assessing the superior fit of the data to the rates-across-sites model over an equal-rates model using ProtTest [74]. To discern if the novel solvation model preserves the general pattern of hydrophobic core and hydrophilic surface the proportion of hydrophobic and hydrophilic residues in parts of the protein were measured before and after simulation.
Availability
The scoring function and simulation software are available as C++ code by request. The code will be released as part of open source software for sequence simulation to be described in a future publication.