Using equilibrium frequencies in models of sequence evolution
BMC Evolutionary Biology volume 5, Article number: 21 (2005)
The f factor is a new parameter for accommodating the influence of both the starting and ending states in the rate matrices of "generalized weighted frequencies" (+gwF) models for sequence evolution. In this study, we derive an expected value for f, starting from a nearly neutral model of weak selection, and then assess the biological interpretation of this factor with evolutionary simulations.
An expected value of f = 0.5 (i.e., equal dependency on the starting and ending states) is derived for sequences that are evolving under the nearly neutral model of this study. However, this expectation is sensitive to violations of its underlying assumptions as illustrated with the evolutionary simulations.
This study illustrates how selection, drift, and mutation at the population level can be linked to the rate matrices of models for sequence evolution to derive an expected value of f. However, as f is affected by a number of factors that limit its biological interpretation, this factor should normally be estimated as a free parameter rather than fixed a priori in a +gwF analysis.
Felsenstein  was the first to introduce an evolutionary model for DNA sequences, which allows for unequal nucleotide frequencies (see also ). His F81 model allows for substitutions at a rate proportional to the frequencies of the ending nucleotides. It is considered the simplest rate matrix for accommodating variable nucleotide frequencies and is therefore the starting point for the consideration of more complex models with frequency variation (e.g., the HKY model of Hasegawa et al. ). Goldman and Whelan  described new variants of these F81-based models (their +gwF (generalized weighted frequencies) models; e.g., JC+gwF for Jukes and Cantor , and K2P+gwF for Kimura ). At the heart of their +gwF variants was a new free parameter (f) to accommodate the frequencies of the starting, as well as ending, nucleotides in the evolutionary process:
where q ij refers to the substitution rate from nucleotide i to j, π i and π j correspond to their equilibrium base frequencies, and s ij is the exchangeability between the two. In the +gwF variants, the substitution rate becomes more dependent on the ending nucleotide as f decreases from 1 to 0, with f = 0 for the classic F81-type models.
This study starts with a population genetics model to derive equations that link weak selection, genetic drift, and mutation to the f parameter and evolutionary rate matrices of the +gwF variants. These theoretical derivations lead to an expected value of f = 0.5. However, as illustrated with simulations, the f parameter is complex and thus its biological interpretation must be considered with caution.
Derivation of the rate matrix for the weak selection model
The nearly neutral model of molecular evolution states that most DNA mutations of longer-term evolutionary consequence are under weak selection and are therefore prone to drift [7, 8]. For a diploid population of size N, a neutral mutation has a probability of 1/2N of becoming fixed in the population. However, because of drift, even slightly deleterious mutations can become fixed, but at a probability of less than 1/2N. Advantageous mutations have higher fixation probabilities than neutral mutations. In the nearly neutral model, the distribution of alleles is determined by an equilibrium of selection, drift, and mutation.
Consider a number of sites under identical evolutionary constraints and with a bias in nucleotide distribution. Assume that weak selection and drift are the causes of this bias; e.g., as for the codon usage biases in micro-organisms and Drosophila [9, 10]. In our model, some nucleotides confer a slightly higher fitness onto the organism than do others, regardless of their position, and these can become fixed in the population through drift and/or selection. Here, we also assume that selective advantages are additive for the two alleles of the diploid organism [11, 12]. Let the selective advantages of the four nucleotides be given by s A , s C , s G , and s T . The differences between these selection coefficients will be very close to zero, since no strong selection is expected.
Consider a mutation from nucleotide i to j, with a selective advantage of s = s j - s i (a selective disadvantage exists when s is negative). For a population of size N and an effective size of N e , Kimura  showed that the fixation probability in this population is given by:
when s ≠ 0. For s = 0, we have P(s) = 1/2N. This approximation is valid for small values of s, which is the case here.
The substitution rate from nucleotide i to j is proportional to P(s j - s i ):
q ij = 2N μ ij P(s j - s i ), (3)
where μ ij is the mutation rate from i to j. For different i and j, μ ij can vary because of unequal transition versus transversion rates (for example). Furthermore, let us assume that the mutation rate is the same for either direction of substitutions between i and j. This assumption is necessary to maintain the widely used condition of time reversibility in the evolutionary process, which thereby keeps the following derivations tractable [1, 13].
We then have:
Since q ij /q ji can be written as a function evaluated at s j divided by the same function evaluated at s i , evolution is time reversible according to this model with:
Here, c and c' are constants with c' = -l/4N e log c, which will be chosen to make the equilibrium frequencies sum to one. The substitution rates can now be approximated as:
Given an exchangeability of s ij = μ ij , this equation reduces to equation (1) with f = 0.5 and an adjustment factor of:
This adjustment factor is close to one for moderate ratios of π, with a horizontal tangent around π j /π i = 1 and a slight bending downwards when deviating from this value (Fig. 1). Thus, a value of f = 0.5 is suggested for the +gwF variants according to these derivations of the weak selection model.
Evolutionary simulations were conducted to examine the effects of violating certain assumptions in the above model of weak selection. Unless otherwise noted, these simulations were based on the K2P+gwF model with f = 0.5 and k = 2 (for the transition/transversion ratio). Simulations consisted of four sequences of length 10,000 and relied on a symmetric rooted phylogeny with all branch lengths equal to 0.10 expected substitutions per site under the model in question [i.e., ((seq1:0.10, seq2:0.10):0.10, (seq3:0.10, seq4:0.10):0.10)]. Violations of the weak selection model were incorporated in the simulations by: (1) heterogeneous sequences with sites drawn from different equilibrium base frequencies; (2) populations in disequilibrium due to changing N e ; and (3) an accelerated C to T substitution rate. Estimates of f for the simulated sequences were made with the K2P+gwF model. Forty simulations were run for each test condition, with the results for the f estimates summarized as their means and twice their standard errors. In the first set of simulations, six categories of sites with different equilibrium distributions were considered (Table 1). The f estimates for the simulations with each category alone were not significantly different from 0.5 (i.e., the value under which the sequences were generated). In contrast, for the simulated heterogeneous sequences (i.e., those composed of equal numbers of sites from two or three different categories), their values of f varied significantly in either direction from 0.5. Analogous results were obtained for the simulations of homogeneous and heterogeneous sequences under the HKY model (with f = 0.0 instead of 0.5). Thus, the value of f can vary considerably when heterogeneous sequences are analyzed with a +gwF model. Here, such deviations are a consequence of using a single rate matrix to analyze sequences that were derived from two or three different ones.
In the second set of simulations, N e was kept constant until the time of the most recent common ancestor for the four simulated sequences. Then, N e was either left unchanged or was suddenly changed by a certain factor. The latter was done by replacing the rate matrix derived from equation (4), resulting in new equilibrium frequencies of the nucleotides. When N e was kept constant, the selective pressures and drift were left unchanged, thereby maintaining the same starting equilibrium frequencies throughout the phylogeny. Thus, the corresponding f estimates did not significantly differ from 0.5 (Fig. 2). In contrast, increases in N e lowered the value of f as the efficiency of selection was increased relative to drift . Correspondingly, the evolutionary process became more dominated by the ending nucleotide. This increasing dominance can be expected to continue until a new equilibrium is restored (which occurs on a longer time scale than that in these simulations).
In the third set of simulations, an acceleration in the C to T substitution rate was incorporated, thereby modeling an increase in their mutation rate due to the deamination of methylated C's in CpG pairs . The introduction of this bias resulted in significant deviations of f in either direction from 0.5, even though their sequences were simulated in equilibrium (Fig. 2). Thus, the value of f can vary considerably when the rates for reciprocal mutations are unequal.
This study illustrates how selection, drift, and mutation within a population can be linked to the f parameter and rate matrices of the +gwF variants for sequence evolution. Our weak selection model relies on the fixation probabilities of mutant alleles with additive genie selection and equal mutation rates for reciprocal substitutions. What is now needed are additional studies that link other population genetics models to the +gwF variants . For example, the population genetics models of Li , which focus on allele frequency distributions and different modes of selection and mutation, could be studied for their connections to the f parameter and +gwF rate matrices.
Collectively, the three sets of simulations highlight that the f parameter is complex and can be influenced by a number of different factors . This complexity limits its biological interpretation and the use of its expected value of 0.5 as derived for the weak selection model. Correspondingly, in many +gwF analyses, f ill need to be estimated as a free parameter rather than fixed beforehand.
Goldman and Whelan  focused on amino acid sequences, where they found that the +gwF models provided better fits to the majority of their protein data sets. They also analyzed two rather small nucleotide data sets for which the general reversible model (REV) outperformed the +gwF variants. As noted by them, the REV model provides enough free parameters to cover the effects of a +gwF analysis. Thus, given sufficient data, this model will consistently outperform the simpler +gwF variants, since it can always accommodate more of the evolutionary process by virtue of its extra parameters. Nevertheless, as widely acknowledged, simpler models have their place, since they allow one to maximize analytical power for more limited data, while minimizing the risk of over-parameterization [13, 16]. Thus, as for the JC, K2P, and HKY models, we expect their +gwF variants to remain of interest as part of the hierarchy of simple to complex models for sequence evolution.
Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17: 368-376.
Tajima F, Nei M: Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. J Mol Evol. 1982, 18: 115-120.
Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22: 160-174.
Goldman N, Whelan S: A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol. 2002, 19: 1821-1831.
Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism. Edited by: Munro HN. 1969, New York: Academic Press, 21-132.
Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111-120.
Ohta T: Slightly deleterious mutant substitutions in evolution. Nature. 1973, 246: 96-98.
Ohta T: The nearly neutral theory of molecular evolution. Annu Rev Ecol Syst. 1992, 23: 263-286. 10.1146/annurev.es.23.110192.001403.
Buhner M: The selection-mutation-drift theory of synonymous codon usage. Genetics. 1991, 129: 897-907.
Carlini DB, Stephan W: In vivo introduction of unpreferred synonymous codons into the Drosophila Adh gene results in reduced levels of ADH protein. Genetics. 2003, 163: 239-243.
Kimura M: On the probability of fixation of mutant genes in populations. Genetics. 1962, 47: 713-719.
Kimura M, Ohta T: Population genetics, molecular biometry, and evolution. Proc Sixth Berkeley Symp Math Stat Prob. 1972, 5: 43-68.
Felsenstein J: Inferring phylogenies. 2004, Sunderland, MA: Sinauer Associates
Razin A, Riggs AD: DNA methylation and gene function. Science. 1980, 210: 604-610.
Li WH: Maintenance of genetic variability under mutation and selection pressures in a finite population. Proc Natl Acad Sci U S A. 1977, 74: 2509-2513.
Posada D, Crandall KA: Selecting models of nucleotide substitution: an application to human immunodeficiency virus 1 (HIV-1). Mol Biol Evol. 2001, 18: 897-906.
B.K. thanks the Carlsberg Foundation, the University of Aarhus, and the Danish National Science Research Council (grant number 21-00-0283) for their support. Both authors also thank the Department of Zoology, University of Florida for its support.
Both authors contributed to the conception and design of this study and to the writing, reviewing, and final approval of this article. B.K. performed the simulations and parameter estimations.
About this article
Cite this article
Knudsen, B., Miyamoto, M.M. Using equilibrium frequencies in models of sequence evolution. BMC Evol Biol 5, 21 (2005). https://doi.org/10.1186/1471-2148-5-21