 Research article
 Open Access
 Published:
Efficient contextdependent model building based on clustering posterior distributions for noncoding sequences
BMC Evolutionary Biology volume 9, Article number: 87 (2009)
Abstract
Background
Many recent studies that relax the assumption of independent evolution of sites have done so at the expense of a drastic increase in the number of substitution parameters. While additional parameters cannot be avoided to model contextdependent evolution, a large increase in model dimensionality is only justified when accompanied with careful modelbuilding strategies that guard against overfitting. An increased dimensionality leads to increases in numerical computations of the models, increased convergence times in Bayesian Markov chain Monte Carlo algorithms and even more tedious Bayes Factor calculations.
Results
We have developed two modelsearch algorithms which reduce the number of Bayes Factor calculations by clustering posterior densities to decide on the equality of substitution behavior in different contexts. The selected model's fit is evaluated using a Bayes Factor, which we calculate via modelswitch thermodynamic integration. To reduce computation time and to increase the precision of this integration, we propose to split the calculations over different computers and to appropriately calibrate the individual runs. Using the proposed strategies, we find, in a dataset of primate Ancestral Repeats, that careful modeling of contextdependent evolution may increase model fit considerably and that the combination of a contextdependent model with the assumption of varying rates across sites offers even larger improvements in terms of model fit. Using a smaller nuclear SSU rRNA dataset, we show that contextdependence may only become detectable upon applying modelbuilding strategies.
Conclusion
While contextdependent evolutionary models can increase the model fit over traditional independent evolutionary models, such complex models will often contain too many parameters. Justification for the added parameters is thus required so that only those parameters that model evolutionary processes previously unaccounted for are added to the evolutionary model. To obtain an optimal balance between the number of parameters in a contextdependent model and the performance in terms of model fit, we have designed two parameterreduction strategies and we have shown that model fit can be greatly improved by reducing the number of parameters in a contextdependent evolutionary model.
Background
The past decades have seen the rise of increasingly complex models to describe evolution, both in coding and in noncoding datasets, using a range of different inferential methods of varying complexity. More accurate mathematical models of molecular sequence evolution continue to be developed for good reasons. First, the additional complexity of such models can lead to the identification of important evolutionary processes that would be missed with simpler models. Such discoveries may increase our understanding of molecular evolution. Second, using more accurate models may help to infer biological factors, such as phylogenetic topologies and branch lengths, more reliably. This may arise from the improved ability of those complex models to account for factors that simpler models neglect and whose influence on observed data might otherwise be misinterpreted [1].
Currently, the importance of modeling varying rates across sites in recovering the correct tree topology is wellknown (see e.g. [2]). Acknowledging that the evolutionary rate at different sites might differ may, however, not be sufficient. Takezaki and Gojobori [3] used concatenated sequences of all proteincoding genes in mitochondria to recover the phylogeny of 28 vertebrate species. When the tree was rooted by lampreys or lampreys and sea urchins, the root of the vertebrate tree was incorrectly placed in the maximumlikelihood tree even when accounting for varying rates across sites. The authors suggest the importance of using the appropriate model for probabilities of substitution among different amino acids or nucleotides, as well as the assumption of varying rates across sites. Several other studies confirm the importance of using appropriate evolutionary models (see e.g. [4, 5]).
In this article, we focus specifically on relaxing the assumption of siteindependent evolution, motivated by the fact that a number of empirical studies have found this assumption to be overly restrictive (e.g. [6–11]). Also the detection of the CpGmethylationdeamination process in mammalian data has given rise to many contextdependent studies (for an overview of such socalled CpGrelated studies, studies using codonbased models, as well as the empirical studies mentioned, see [12]). In our previous work [12], we have introduced a contextdependent approach using data augmentation which builds upon standard evolutionary models, but incorporates site dependencies across the entire tree by letting the evolutionary parameters in these models depend upon the ancestral states at the two immediate flanking sites. Indeed, once that ancestral sequences have been estimated, the evolution of a given site across a branch is allowed to depend upon the identities of its immediate flanking bases at the start (i.e. the ancestor) of that branch. The use of existing evolutionary models avoids the need for introducing new and highdimensional evolutionary models for sitedependent evolution, such as those proposed by Siepel and Haussler [13] and Hwang and Green [14]. Indeed, using a general timereversible (GTR) model, which contains six evolutionary parameters on top of the four base frequencies used in the model, for each of the 16 neighboring base compositions results in a total of 96 parameters (although the GTR model is often regarded to have five free parameters, which results in 80 parameters instead of 96). This number of parameters does not include the set of four stationary equilibrium frequencies, which is assumed to be contextindependent. Using a Markov chain Monte Carlo approach with data augmentation, one may then infer the evolutionary parameters under the resulting model for a large genomic dataset under a fixed tree topology. Previous analyses [12] based on this model have revealed large variations in substitution behavior dependent upon the neighbouring base composition.
The increase in dimensionality of such contextdependent models warrants model reduction strategies [15] based on merging similar evolutionary contexts. One approach is to evaluate Bayes Factors [16] to compare models with and without merged contexts. Here, the Bayes Factor is a ratio of two marginal likelihoods (i.e. two normalizing constants of the form p(Y_{ obs }M), with Y_{ obs }the observed data and M an evolutionary model under evaluation) obtained under the two models, M_{0} and M_{1}, to be compared [16, 17]:
Bayes Factors greater (smaller) than 1 suggest evidence in favor of M_{1} (M_{0}). In this paper, we will use log Bayes Factors, which are typically divided into 4 categories depending on their value: from 0 to 1, indicating nothing worth reporting; from 1 to 3, indicating positive evidence of one model over the other; from 3 to 5, indicating strong evidence of one model over the other; and larger than 5, indicating significant (or very strong) evidence of one model over the other [16].
We have chosen to calculate Bayes Factors using thermodynamic integration [18], since the harmonic mean estimator of the marginal likelihood systematically favors parameterrich models. Thermodynamic integration is a generalization of the bridge sampling approach and is therefore often referred to as 'pathsampling' (see e.g. [19–21]). Lartillot and Phillipe [18] present two methods to calculate the Bayes Factor between two models. Using their socalled annealing or melting approach one model at a time is evaluated, resulting in a marginal likelihood for each model. The ratio of these individual marginal likelihoods then yields a Bayes Factor. When this approach yields marginal likelihoods with large error estimates, the resulting log Bayes Factor can be inaccurate. This can be avoided by using modelswitch thermodynamic integration, which directly calculates the log Bayes Factor (Lartillot and Philippe, 2006). By construction, this approach results in lower error estimates for the Bayes Factor and allows one to make use of the additivity property of the logarithmic function to calculate Bayes Factors [22].
Unfortunately, building models based on Bayes Factorbased model comparisons is not feasible because calculating Bayes Factors requires vast amounts of computation time. In view of this, we reduce the number of model evaluations by proposing two schemes for clustering posterior density estimates from different evolutionary contexts (and thus for merging these contexts). To evaluate the fit of the resulting model, a Bayes Factor must be calculated. We propose 2 adaptations of the thermodynamic integration method proposed by Lartillot and Philippe [18] to make this practically feasible. First, we show how the Bayes Factor calculation can be performed in parallel independent runs on different nodes in a cluster system, thus greatly reducing the time needed to obtain results. Second, we show that these independent runs can be adjusted depending on the part of the integrand that is being integrated to allow for more intensive calculations in hardtoevaluate parts of the integrand in the modelswitch integration procedure, resulting in more accurate (log) Bayes Factor estimates.
Methods
Data
We analyze two datasets which we have discussed in earlier work [12]. The first dataset consists of 10 sequences from vertebrate species, each consisting of 114,726 sites, and is analyzed using the following rooted tree topology (((((Human, Chimpanzee), Gorilla), Orangutan), ((Baboon, Macaque), Vervet)), ((Marmoset, Dusky Titi), Squirrel Monkey)). We refer to this dataset as the 'Ancestral Repeats' dataset. The second dataset consists of 20 small subunit (SSU) rRNA genes (nuclear), consists of 1,619 sites for each sequence and is analyzed using the 50% majority rule posterior consensus tree obtained under the general timereversible model. This dataset contains the following sequences: Cyanophora paradoxa, Nephroselmis olivacea, Chlamydomonas moewusii, Volvox carteri, Paulschulzia pseudovolvox, Coleochaete orbicularis 2651, Coleochaete solute 32d1, Coleochaete irregularis 3d2, Coleochaete sieminskiana 10d1, Zygnema peliosporum, Mougeotia sp 758, Gonatozygon monotaenium 1253, Onychonema sp 832, Cosmocladium perissum 2447, Lychnothamnus barbatus 159, Nitellopsis obtusa F131B, Chara connivens F140, Lamprothamnium macropogon X695, Arabidopsis thaliana and Taxus mairei. We refer to this dataset as the 'Nuclear SSU rRNA' dataset.
Evolutionary models
We have used the general timereversible model (GTR; [23]) to study site interdependencies, with the following substitution probabilities:
with π = {π_{ A }, π_{ C }, π_{ G }, π_{ T }} the set of base frequencies and rAC, rAG, rAT, rCG, rCT and rGT the evolutionary substitution parameters. As in our previous work (Baele et al., 2008), let θ = {2π_{ A }π_{ G }rAG, 2π_{ A }π _{ C }rAC, 2π_{ A }π_{ T }rAT, 2π_{ G }π_{ C }rCG, 2π_{ G }π_{ T }rGT, 2π_{ C }π_{ T }rCT} be the terms of the scaling formula that binds the parameters of the model and T be the set of branch lengths with t_{ b }(t_{ b }≥ 0) one arbitrary branch length and μ a hyperparameter in the prior for t_{ b }in T. As in Baele et al. (2008), the following prior distributions q(·) were chosen for our analysis, with Γ(.) the Gamma function:
π ~ Dirichlet (1,1,1,1), q (π) = Γ (4) on ,
θ ~ Dirichlet (1,1,1,1,1,1), q (θ) = Γ (6) on ,
t_{ b }μ ~ Exponential (μ), for each t_{ b }in T
and
μ ~ Invgamma (2.1, 1.1), , μ > 0.
Branch lengths are assumed i.i.d. given μ. When the model allows for the presence of multiple contexts of evolution, each context is assumed to have its own prior, independently of other contexts.
As there are 16 possible neighboring base combinations, we use a distinct GTR model per neighboring base composition, thus increasing the number of evolutionary contexts from 1 to16 for a full contextdependent model (Baele et al., 2008). The goal of this article is to reduce the dimension of such a model in an accurate and computationally efficient manner, by sharing parameters between contexts, which will improve the fit to the data. Note that the independent GTR model will be used as the reference model (i.e. the model to which all other models will be compared) throughout the remainder of this paper.
Thermodynamic integration – split calculation
The splitcalculation approach discussed in this section can be skipped by the less technicallyminded people.
Different contextdependent models can be compared in terms of model fit with the independent GTR model by calculating the appropriate (log) Bayes Factors. One may use modelswitch thermodynamic integration for this purpose [18]. This is a computationally intensive approach which yields reliable estimates for the (log) ratio of the marginal likelihoods corresponding to two models. Below, we explain how one can use a split calculation approach to make this integration procedure computationally more tractable.
Suppose our goal is to calculate the Bayes Factor corresponding to models M_{0} and M_{1} defined on the same parameter space Θ. The true data densities (conditional on the parameter θ) are denoted by
for the models M_{ i }, i = 0, 1, where q_{ i }(θ) denotes the joint density of the observed data and the parameter θ, and
is a normalizing constant. The latter encodes the marginal data density, which is needed in the calculation of the (log) Bayes Factor. The key idea behind modelswitch integration is to translate the problem of integration w.r.t. θ into the relatively simpler problem of averaging over θ. For this purpose, a continuous and differentiable path (q_{ β }(θ))_{0≤β≤1 }(with corresponding p_{ β }(θ) and Z_{ β }(θ)) is chosen in the space of unnormalized densities, joining q_{0}(θ) and q_{1}(θ), which thus goes directly from model M_{0} to model M_{1}. When β tends to 0 (resp. 1), p_{ β }(θ) converges pointwise to p_{0} (θ) (resp. p_{1} (θ)), and Z_{ β }(θ) to Z_{0}(θ) (resp. Z_{1}(θ)). The log Bayes Factor of model M_{1} versus M_{0} can now be calculated as the logratio [18]
where E_{ β }[...] denotes the expectation with respect to θ under the density p_{ β }(θ), and with U(θ) the potential
This expectation E_{ β }[U(θ)]may be approximated with a sample average once a sample of random draws from p_{ β }(θ) is obtained using MCMC. The integration problem is now simplified to the problem of integrating w.r.t. a scalar parameter β, which is relatively easily handled by numerical approximation using the composite trapezoidal rule.
These calculations can be partitioned over a number of computers upon rewriting the integral in expression (3) as
with α_{0} = 0 <α_{1}, < ... <α_{ n }< 1 = α_{n+1 }dividing the interval [0,1] into n subintervals with the number of MCMCupdates of β in each subinterval resp. equal to chosen values K_{0}, K_{1},..., K_{ n }. For each value of β, the Markov chain is updated during a number of Q iterations (during which β is held constant), after which β is increased (or decreased). As in the work of Lartillot and Philippe [18], each of these integrals can be calculated using the composite trapezoidal rule to obtain the socalled quasistatic estimator
for the m^{th} subinterval, with δ_{ m = }α_{m+1 }α_{ m }, (θ_{ i }), i = p...r (with ) the saved parameter draws. A possible approach to calculate the quasistatic estimator is thus to save the set of parameter values in the iteration before β is increased (or decreased), this way obtaining a set of parameter values θ_{ i }for each value of β during the transition of β from α_{ m }to α_{m+1}. Calculating this quasistatic estimator for each subintegral and adding yields the following expression for the quasistatic estimator of :
The obtained estimates of the log Bayes Factor are subject to a discretization error (due to numerical integration) and sampling variance (due to the limited number of MCMCdraws used in the calculation of the expected potential). Below we report on how to quantify these errors under the split calculation algorithm proposed above. The discretization error of is characterized by its worstcase upper (resp. lower) error which, because E_{ β }[U(θ)] is monotone in β, is given by the area between the piecewise continuous function joining the measured values of E_{ β }[U(θ)] and the upper (resp. lower) step function built from them [18]. Both areas (i.e. between both upper and lower step functions and the continuous function) are equal to:
By splitting the calculation over different subintervals, we obtain a sum of discretization errors, one for each integral, which is given by
The sampling variance can be estimated by summing the variances over the parallel chains
assuming independence between the successive draws from the chain. The total error on the log Bayes Factor equals σ = σ_{ d }+ 1.645 σ_{ s }, with σ_{ s }the square root of the sampling variance [18]. In general, a sufficiently long burnin is necessary to obtain reliable estimates and low error margins.
Data augmentation
Because of the computational complexity, Baele et al. [12] use data augmentation for estimating the parameters of a contextdependent model, whereby ancestral data are repeatedly imputed. Indeed, the computational complexity of using contextdependent models does not allow for easy calculation of the observed data likelihood and requires the use of a full (or complete) data likelihood to make inference possible. As a result, each missing ancestor in the tree needs to be provided with an estimated ancestral nucleotide in each iteration. This has implications for the modelswitch thermodynamic integration scheme, which was developed for settings where inference is based on the observed data likelihood [18]. In our approach, i.e. data augmentation with modelswitch thermodynamic integration, the ancestral data can be shared between both models (i.e. the imputations take identical values under both models) and in that case must be part of the unknown parameter θ. In particular, each ancestral "augmented" site is imputed with a draw from a multinomial distribution from the probability density p_{ β }(θ) since the expectation E_{ β }[U(θ)] will be approximated with a sample average of random draws from p_{ β }(θ) [18]. In our approach, this probability density for the ancestral site i has the following form (with Y_{mis, i}representing the state of the ancestor that is being augmented at site i, Y_{mis,i}representing the set of states for all remaining ancestors, r_{ i }the evolutionary rate at site i and β the current position along the path between the two posterior distributions)
Upon noting that p_{ β }(θ) = q_{ β }(θ)/∫q_{ β }(θ), expression (11) yields
In our approach, we choose q_{ β }(Y_{ mis }, {r_{ i }}, Y_{ obs }M_{0}, M_{1}) = (L_{ X }M_{0})^{1β}(L_{ X }M_{1})^{β}, implying that each ancestral "augmented" site is imputed with a draw from a multinomial distribution with probability
where L_{ X }M_{ i }is the complete data likelihood under model M_{ i }when X ∈ {A, C, G, T} is the value augmented for the considered ancestral site. This result in a probability for each nucleotide to be inferred at a given site, with the four probabilities summing to one. The ancestral sequences are then updated sequentially, i.e. one site at a time, from top to bottom in the considered tree and from the first site moving along the sequence up to the last site, each ancestral site is updated during each update cycle.
When β equals 0 (1), the ancestral sequences are random draws from the posterior distribution of Y_{ mis }under M_{0} (M_{1}). At certain ancestral positions, this may result in imputed values with small likelihoods under M_{1} (M_{0}), which in turn leads to larger differences between the log likelihoods of the two models. Because of this, the contributions of the modelswitch integration scheme to the log Bayes Factor are most tedious to calculate when β is close to 0 and 1, which is why we use smaller update steps for β in those situations. In the case of an observed data likelihood, which involves summing over all missing ancestral nucleotides, this situation does not occur.
Evolutionary rate augmentation
To accommodate varying rates across sites (or amongsite rate variation), we use a similar data augmentation approach as before, which now additionally imputes evolutionary rates in order to avoid summing the likelihood over all the possible rate classes. Given a discrete approximation to the gamma distribution with n rate classes (where n = 1 encodes the assumption of equal rates), the rate r_{ i }at each site i for model M_{1} is updated by drawing from a multinomial distribution with probability
where r_{ i }represents the rate of site i, which is being augmented, r_{i}represents the set of rates for all remaining sites, β represents the current position along the path between the two posterior distributions, and is the complete data likelihood under the ratesacrosssites model M_{1} when X ∈ {r_{1},..., r_{ n }} is the value imputed for the considered missing rate. Note that, when comparing model M_{1} with a model which assumes equal rates, only the rate parameters indexing M_{1} need to be updated with a new set of rates at each modelswitch iteration in the calculation of a Bayes Factor.
Context reduction
Our contextdependent model consists of 16 possibly different GTR models, one for each neighbouring base composition (a.k.a. 'evolutionary context'). In practice, it is likely that the evolutionary processes are similar in a number of neighboring base compositions, or that the data are insufficiently informative to distinguish these. This suggests reducing the model's dimensionality by merging contexts, which may subsequently lead to evolutionary models with reduced parameter uncertainty which fit the data better than the independent model. Unfortunately, the timeconsuming calculation of Bayes Factors makes exhaustive model search using Bayes Factors currently prohibiting. In view of this, we have sampled 1,000 values of each of the 96 parameters in our full contextdependent model from the Markov chain every 50^{th} iteration after an initial burnin of 50,000 iterations. On the basis of the 1,000 values for each of the six parameters per context, the first two principal components are calculated and displayed in a scatterplot, thus resulting in 16 sixdimensional clusters each consisting of one context.
The location of certain contexts in such a scatterplot may indicate strong differences between some contexts, but not between others, and may thus be informative of contexts that can meaningfully be merged. However, this is not always the case partly because information is inevitably lost by considering only two principal components. Using a scatterplot matrix of the first three principal components might add information, but would still require arbitrary decisions from the researcher on the clustering of different contexts. In this section, we therefore propose two algorithmic, automated methods for clustering contexts by progressive agglomeration. Each decision taken by these algorithms is then confirmed by calculating the corresponding log Bayes Factor.
A likelihoodbased reduction approach
The parameters in each of the 16 neighboring base compositions can be described by a sixdimensional mean with corresponding variancecovariance matrix. Assuming a multivariate normal posterior density within each context (which is asymptotically valid), a likelihood function of all sampled parameter values can thus be calculated. This initial likelihood is the starting point for our first reduction approach, which uses the following iteration scheme:

1.
Reduce the number of contexts with 1 by merging 2 contexts. Calculate the likelihood of the corresponding model. Repeat this for all pairs of contexts.

2.
Select the highest likelihood obtained in the previous step, merge the two corresponding clusters and recalculate the empirical mean and variancecovariance matrix of the parameters corresponding to the merged clusters. To make the calculations more feasible, we do not enforce to run a Markov chain for each newly obtained model to infer new estimates of the posterior means and variancecovariance matrices.

3.
Iterate steps 1 and 2 until only one cluster remains.
Through the remainder of this work, we define a cluster as the merge of two or more evolutionary contexts. As the merging of clusters progresses, the likelihood will gradually decrease in value. This is expected as the parameter estimates can be better approximated by contextspecific means and variancecovariance matrices instead of clusterspecific means and variancecovariance matrices. Since the likelihood only decreases (and additionally depends on the chosen number of samples in an arbitrary fashion), it cannot be used to determine the optimal number of clusters/contexts. In terms of the (log) Bayes Factor, it is typically expected that the model fit will first gradually increase, reach an optimum, and then decrease. In each step of the algorithm, we therefore calculate the Bayes Factor corresponding to the selected model. In principle, the algorithm can be stopped when the Bayes Factors decrease convincingly with additional context reductions.
In each step of the above algorithm, the number of parameters in the model decreases with 6. Since each step selects the clustering which minimizes the decrease in log likelihood, this approach is likely to detect a model with nearoptimal (log) Bayes Factor.
A graphbased reduction approach
While the parameterreduction approach of the previous section has a statistical basis, it is likely to yield models with suboptimal fit. Indeed, the likelihoodbased approach systematically favors merging two separate contexts over merging a context with already merged contexts (to build clusters with three or more contexts). Consider a scenario of two proposed clusters, one already containing two contexts and being expanded with a third context, and one containing a single context and being merged with another context. These two clusters will each have one mean and one variancecovariance matrix to represent the parameter estimates. However, the threecontext cluster is not represented so easily with this reduced parameterization due to the increased variance of the threecontexts cluster. Such an increase in variance leads to a drastic decrease of the likelihood which implies that merging small clusters will tend to be preferred over expanding existing clusters.
This artifact may be resolved by rerunning a Markov chain after each merge using the newly obtained model, which will allow to reestimate the posterior variancecovariance matrices needed to predict the next merge operation. However, this requires additional computational efforts, making it much less suited for reducing the model complexity. We thus propose the following graphbased approach, which avoids the need for reestimating the posterior variance of each cluster by using the loglikelihood difference between the 16context model and each 15context model as costs in a graphbased algorithm. It requires one Markov chain run to predict all necessary clustering steps, to determine a possibly more optimal solution:

1.
Calculate the likelihoods of all possible pair wise context reductions (120 in total), starting from the initial clustering of each context separately (which yielded the initial likelihood).

2.
Use the difference between the newly obtained likelihoods and the initial likelihood as edge costs in a fully connected undirected graph of 16 nodes. Similar contexts will be connected by lowcost edges while dissimilar contexts will be connected by highcost edges. The cost function then consists of the sum of each edge that participates in a cluster (i.e. one edge for a cluster of two contexts, three edges for a cluster of three contexts, six edges for a cluster of four contexts ...).

3.
Sort the 120 likelihood differences, smallest differences first (each difference corresponds with a merge of two clusters).

4.
Using the sorted list, merge the clusters following the order proposed by the list. If the proposed merge is between two contexts which both have not yet participated in a merge operation, proceed with merging them and color the two nodes and the connecting edge in the graph. If at least one of the contexts is already colored in the graph, proceed with the merging of the clusters if the resulting fully interconnected network in the graph (i.e. between the contexts to be merged) yields the lowest value for the cost function of the possible networks of similar structure (i.e. the fully connected networks with an equal amount of nodes) that have not yet been colored (see the Results section for a practical example). If there is a lowercost alternative, do not merge the clusters and proceed with the following entry from the list, until only one cluster remains. The objective of this graphbased approach is thus not simply to minimize the cost function, but to minimize the cost function conditional on the proposed new clustering. This means that when a proposed merge is accepted, it was the cheapest merge possible between competing networks of similar structure.
This graphbased reduction approach has the advantage that it does not need additional Markov chains to be run. Given the costs of the various reductions and their order to be evaluated in, determined in the first two steps, the algorithm attempts to cluster those contexts closest to one another, with the enforced constraints that only competing clusters of the same composition are compared. This should result in larger clusters and thus possibly in larger improvements of the model fit.
Results and discussion
The Ancestral Repeats dataset
Approaches Compared
To compare our Bayes Factor calculation approach with the original approach of Lartillot and Philippe [18], we have initially opted for a constant increment for β and an equal number of Q updates for all the parameters and ancestral sites to estimate the log Bayes Factor and its error for the large Ancestral Repeats dataset. The results are shown in Table 1 and in Figure 1. Thermodynamic integration requires a drastic increase in CPU time compared to a plain posterior sampling under the more demanding of the two models that are being compared [18]. This requires running a chain for several days, up to several weeks for more complex models (where mixing can be more challenging). A singlerun log Bayes Factor calculation with low accuracy (i.e. only Q = 200 updates for each value of β, with step size 0.001) of the full 16contexts model (GTR16C) against the independent GTR model takes 42 days for one direction (i.e. either annealing or melting) on a single computer, given the large number of sites. More accurate settings for the modelswitch integration will further increase calculation time and are in fact necessary as the log Bayes Factor estimates in both directions, i.e. 592.7 and 693.6, are still far apart. In contrast, our proposed approach yields very similar results in terms of the log Bayes Factor estimates in both directions as can be seen in Table 1. The calculation time for our approach is reduced to 6 days on 10 cluster nodes and further reductions are possible since we opted for a lengthy burnin sequence of 10,000 iterations. Its only disadvantage lies in slightly broader confidence intervals for the log Bayes Factor, which is an expected consequence of using several independent chains.
As can be seen from Table 1, running 200 full chain updates at each value of β works well only in the integration interval [0.1;0.9]. Indeed, the quasistatic estimates in both directions produce very similar results when β is in the interval [0.1;0.9]. However, as a result of the ancestral data augmentation, the same settings for β should not be applied when one of the models converges to its prior, i.e. in the integration intervals [0.0;0.1] and [0.9;1.0] for β, as the ancestral sites imputed in those intervals converge towards essentially arbitrary data for one of the models and yield very small values for the log likelihood. This makes it more difficult and timeconsuming to yield similar and reliable estimates for both directions of the modelswitch integration scheme. We have therefore opted to split our calculations into 20 sections, each with the same amount of chain updates (Q = 200), but using a larger number of updates for β as the chain progresses closer to one of the prior distributions. A referee remarked that Lepage et al. [24] have used a sigmoidal schedule for β to circumvent this problem.
The results for each of the subintervals and in each direction are reported in Tables 2 and 3. The log Bayes Factor estimates in both directions are now 630.7 (95% CI: [623.2; 638.2]) and 653.7 (95% CI: [645.5; 661.9]). Given the magnitude of the increase in terms of model fit, we have refrained from increasing the converging times in order to obtain more similar confidence intervals and we have taken the average of both estimates to be the resulting log Bayes Factor. Further, the width of the confidence intervals has decreased from 35 to 15 log units, suggesting that this approach also reduces the variance(s) of the log Bayes Factor estimates.
Varying rates across sites and CpG effects
To determine the impact of assuming varying rates across sites on the model fit, we calculated the log Bayes Factor comparing the independent GTR model with equal rates to the GTR model with varying rates across sites using n = 4 discrete rate classes, as proposed by Yang [25]. The log Bayes Factor equals 355.6, indicating a very strong preference towards the varying rates across sites assumption using four discrete rate classes. The mean estimate for the shape parameter of the gamma distribution using four rate classes equals 1.156 (Baele et al., 2008). Because four rates classes may not be sufficient to approximate the continuous gamma distribution, we have gradually increased the number of rate classes n as reported in Table 4. The log Bayes Factor equals 354.3 for n = 5, 354.6 for n = 6, 354.4 for n = 7 and 356.0 for n = 8 rate classes. Increasing the number of rate classes beyond n = 4 hence does not yield important improvements in model fit.
The previous results show that allowing for varying rates across sites drastically increases model fit compared to assuming equal rates for all sites. Analysis of the data using the contextdependent evolutionary model has further shown that substitution behavior is heavily dependent upon the neighbouring base composition [12]. A wellknown contextdependent substitution process is the 5methylcytosine deamination process (i.e., the CpG effect), which has been the subject of several studies (see e.g. [26–28]). We have calculated the log Bayes Factor of two different CpG effects. We have modeled the traditional CpG effect where the substitution behavior of a site can differ from the other sites when the 3' neighbor is guanine. The mean log Bayes Factor for this model, which contains a mere 12 parameters, equals 137.8 (annealing: [127.1; 141.0], melting: [137.0; 146.2]), a significant improvement in terms of model fit compared to the independent model. We have also modeled a CpG effect that is dependent upon its 5' neighbor, i.e. those sites with guanine as a 3' neighbor are assumed to have a different substitution behavior depending on the 5' neighbor. Such a model has 30 parameters and a mean log Bayes Factor of 157.8 when compared to the independent GTR model (annealing: [142.7; 155.0], melting: [159.6; 173.8]), i.e. this model is preferred over both the model assuming the traditional CpG effect and the independent model. However, the log Bayes Factor of 642.2 attained by the full contextdependent model (as compared to the independent GTR model) suggests that many more complex evolutionary patterns exist besides the CpG effect.
The likelihoodbased reduction approach: results
Figure 2 shows the stepwise clustering of contexts with corresponding log Bayes Factors reported in Table 5. The full contextdependent model, consisting of 16 clusters each containing one context (denoted GTR16C), is shown by the white bar which corresponds to a log Bayes Factor of 642.2 (as compared to the independent GTR model). Each step of the algorithm yields a reduction of one context, resulting in the light grey bars in Figure 2, which are annotated with the new cluster structure that is being formed at that step. For example: in the first step, i.e. the reduction from 16 to 15 clusters (i.e. model GTR15C in Table 5), the CXC and TXC contexts are merged, reducing the number of parameters from 96 to 90. The log Bayes Factor of this 15context model over the GTR model equals 669.2. In the second step, i.e. the reduction from 15 to 14 clusters (i.e. model GTR14C in Table 5), the GXG and TXG context are merged, further reducing the number of parameters to 84. While this 14clusters model yields a lower log Bayes Factor over the GTR model (665.9) than the 15clusters model, there is no reason to stop here as this decrease may well be the result of sampling and discretization error on the Bayes Factor and thus more optimal models might still be obtained by further context reduction. After further reductions, our likelihoodbased reduction scenario yields an optimal clustering scheme for the Ancestral Repeats dataset consisting of 10 clusters (GTR10C; using a total of 60 parameters), as indicated by the dark grey bar in Figure 2. This 10clustersmodel yields a log Bayes Factor of 700.1 over the independent GTR model. The 10 clusters are shown in Figure 3, which identifies 6 separate singlecontext clusters (for the evolutionary contexts AXA, AXG, CXA, CXT, TXA and TXT) and 4 clusters consisting of two or three contexts. A first cluster contains 2 contexts: AXT and CXG, a second cluster contains 3 contexts: AXC, GXC and GXA, a third cluster contains 2 contexts: GXG and TXG, and a final cluster contains 3 contexts: CXC, GXT and TXC.
The graphbased reduction approach: results
As predicted above, the likelihoodbased reduction approach favors small clusters. To confirm this assumption, we have rerun a Markov chain using a contextdependent model consisting of the optimal number of 10 clusters derived using the likelihoodbased approach. Using the parameter estimates from this model, we have calculated the posterior variances of the (yellow) cluster containing the AXC, GXA and GXC contexts and compared them to the empirical variances obtained from merging these three contexts but not rerunning the chain. The actual posterior variances were much smaller, equaling merely between 3% and 24% of the empirical variances that were used. However, calculating these posterior variances is practically not feasible for fast model building because running a new Markov chain for the Ancestral Repeats dataset takes about 4 days per run. Further, the result of each run needs to be awaited to decide upon the next step in the clustering algorithm, which greatly increases the time needed to obtain an optimal contextdependent model.
In view of this, the graphbased reduction approach was designed. The decisions taken in the first 22 iterations are shown in Table 6, with corresponding log Bayes Factors in Figure 4 and Table 7. Starting from the full contextdependent model (GTR16C in Table 7), each step of the algorithm yields a reduction of one context, as shown by the light grey bars in Figure 4 which are annotated as in Figure 2. The first two reduction steps of the graphbased approach are identical to those of the likelihoodbased approach. Further reductions show that fewer clusters are constructed by instead expanding existing clusters. The reduction to 13 clusters (GTR13C in Table 7), for example, consists of merging the previously separated GXT context with the cluster constructed in the first reduction step, thus creating a new cluster with 3 contexts: CXC, GXT and TXC.
To illustrate step 4 of the algorithm (see the Materials & Methods section), we discuss the 5^{th} step of the graphbased reduction approach. The 4^{th} iteration has yielded a model allowing for twelve clusters: a first cluster consists of four contexts (CXC, GXC, GXT and TXC), a second cluster consists of two contexts (GXG and TXG) and ten other clusters consist of a single context. To calculate the current clustering cost, the cost of the branch connecting contexts GXG and TXG (1064.87 units) is added to the cost of all the interconnecting branches between the CXC, GXC, GXT and TXC contexts (see Table 6): CXCGXC (1524.96), CXCGXT (1925.79), CXCTXC (872.10), GXCGXT (1940.97), GXCTXC (1777.24) and GXTTXC (1300.93). The clustering in step 4 thus has a cost of 10,406.86 units. The proposed step in the 5^{th} iteration to expand the fourcontexts cluster (CXC, GXC, GXT and TXC) with a fifth context, i.e. AXC, then results in a cost of 20,124.32 units. However, expanding the fourcontext cluster with CXA instead of AXC yields a cost of 19,423.12 units, as the CXA context lies reasonably close to all four contexts whereas AXC lies mainly close to the GXC context. Therefore, the fourcontext cluster is not expanded in this 5^{th} iteration.
After further reductions, the graphbased method yields a different optimal model than the likelihoodbased approach for the Ancestral Repeats dataset. The optimal clustering consists of 8 clusters (GTR8C; using a total of 48 parameters for the model) with a log Bayes Factor of 712.7 (see Table 7), thus yielding an improvement in model fit over the optimal clustering found by the likelihoodbased approach. This model is illustrated by the dark grey bar in Figure 4. This model is reached in the 12^{th} step in Table 6, which corresponds to the graph coloring scheme shown in Figure 5. The 8 clusters are shown in Figure 6, which identifies 5 separate singlecontext clusters (for the evolutionary contexts AXA, AXG, CXT, TXA and TXT) and 3 clusters consisting of two or more contexts. A first cluster contains 2 contexts: AXT and CXG, a second cluster contains 4 contexts: AXC, GXA, GXG and TXG, and a final cluster contains 5 contexts: CXA, CXC, GXC, GXT, and TXC.
We have concluded earlier (see Table 4) that using more than four rate classes for the discrete approximation does not yield clear improvements in model fit. Hence, we have combined the optimal model, obtained with the graphbased reduction approach, with the assumption of varying rates across sites using four rate classes. We have compared this reduced model with varying rates across sites to the full contextdependent model with varying rates across sites (also with four rate classes). The full contextdependent model yields a log Bayes Factor of 960.1 (annealing: [935.5; 959.4], melting: [690.9; 984.6]), thereby clearly outperforming the full contextdependent model with equal rates (with a log Bayes Factor of 642.2) and the independent model with varying rates across sites (with a log Bayes Factor of 355.6). Further, the optimal model yields an even higher log Bayes Factor of 1029.8 (annealing: [1001.8; 1022.4], melting: [1037.0; 1058.1]), thereby conserving the increase in model fit obtained with equal rates (see Table 7).
Interpretation of the optimal model
The graphbased reduction approach yields the best performing contextdependent model for the Ancestral Repeats dataset, but the interpretation of the clustering of neighboring base compositions is far from obvious. To gain insight, we have studied the parameter estimates of the GTR model for the 16 neighboring base compositions, which have been reported and discussed in previous work (Baele et al., 2008). In a first step, we try to determine why the five contexts AXA, AXG, CXT, TXA and TXT are clustered separately in the 'optimal' contextdependent model. The AXA and AXG contexts have much higher rCT parameter estimates than all other contexts. For the AXG context, this could be attributed to a CpG effect, conditional on the preceding adenine. This might mean that, for the AXA context, a nonCpG methylation process is present although we are unaware of earlier reports of such a process in the AXA context. In previous work (see [12]), we already elaborated on the possibility of a TpA effect, especially in the AXA context. Such an effect could occur conditional on the preceding base, as is the case for the CpG effect.
NonCpG methylation is the subject of many discussions in mammalian evolution. Woodcock et al. [29] reported that 55% of all methylation in human spleen DNA could be at dinucleotides other than CpG. In their analysis of mammalian DNA, Ramsahoye et al. [30] found that embryonic stem cells, but not somatic tissues, have significant cytosine5 methylation at CpA and, to a lesser extent, at CpT. However, high relative rates between C and T have been observed in aligned gene/pseudogene sequences of human DNA in the past [31]. The reason for the separate clustering of the CXT and TXT contexts seems to be the lower than average (and again, than all other contexts) rCT parameter estimates (or the higher than average rAG parameter estimates). When considering that the CXT and TXT contexts can be found on the complementary strand of the AXG and AXA contexts, this makes perfect sense. This complementarity aspect is actually dominantly present in Figure 6 when considering the green and red clusters. Indeed, the green cluster contains contexts AXC, GXA, GXG and TXG while the red cluster contains all the complementary contexts, resp. GXT, TXC, CXC and CXA, further augmented with GXC. This latter context, along with the other symmetrical contexts (i.e. whose complementary context is the context itself) AXT, CXG and TXA correspond to a zero first principal component in Figure 6. This first principal component has a loading of 0.73 for the rAG parameter and 0.68 for the rCT parameter, with loadings for the other parameters all below 0.03. Hence, this principal component roughly measures the differences between the rAG and rCT parameter estimates. This explains why most of the clustering patterns in Figures 3 and 6 are retrieved in the rAG and rCT parameter estimates.
Only the separate TXA context cannot be explained using the transition estimates. Because both the rAG and rCT parameter estimates for this context are lower than average, the transversion estimates must be studied (see [12], online supplementary material). The TXA context has the highest rAT and rGT parameter estimates of all 16 contexts and the rAC parameter estimates are also above average, which seems to lead to a significantly differing evolutionary behavior when compared to all other contexts. This observation reinforces our opinion that modeling different substitution behavior of the transition parameters (as is mainly the case when modeling CpG effects) cannot by far account for the complexity of mammalian DNA evolution. Indeed, the separate clustering of the TXA context suggests that modeling different substitution behavior of the transversion parameters depending on the nearest neighbors can increase model fit. This is supported by a clear preference of the sixcluster model (GTR6C in Table 7), clustering TXA separately, over the fivecluster model (GTR5C in Table 7), which includes TXA in a large cluster with 11 other contexts.
We have already shown that modeling CpG effects, both dependent and independent of the preceding base, does not even come close to modeling a full contextdependence scheme based on the flanking bases in terms of model fit. The evolutionary patterns of sites with guanine as the 3' neighbor can nonetheless be seen to lie close in the principal components plot (see Figure 6). All four occurrences lie in the lower left section of the plot, even though only the GXG and TXG contexts are effectively clustered together. This reinforces our finding that CpG effects are only one aspect of contextdependent evolution and that CpG effects are dependent of the preceding base, with adenine being the most influential 5' neighbor resulting in a separate cluster. In terms of the transition parameter estimates, the CXG context has lower (higher) rCT (rAG) parameter estimates than both GXG and TXG, which explains why only GXG and TXG are clustered together. Apart from a small difference in the rCG parameter estimates, both contexts are very similar in their parameter estimates. The same goes for the AXT and CXG contexts, which only differ in the rCG parameter estimates. However, the cluster containing both GXG and TXG also contains the GXA and AXC contexts, meaning that this cluster (like all clusters determined) does not contain all the contexts with either identical 5' or 3' neighboring bases, i.e. the cluster containing GXG and TXG does not contain the AXG and CXG contexts. The GXA context differs from the GXG and TXG contexts in its parameter estimates for rCT and rAC, and a small difference for rAT. The transversion estimates of the AXC context yield no drastically differing observations when compared to those of the other three contexts in the cluster. The difference seems to lie in the transition parameters, where the AXC context is observed to have decreased rCT parameter estimates compared to the other contexts in the cluster.
One cluster left for discussion is the one containing the CXA, CXC, GXC, GXT and TXC contexts. Different from those sites with guanine as the 3' neighbor, those sites with cytosine as the 3' neighbor are clustered closer to one another, with CXC, GXC and TXC being part of the same cluster and thus only AXC being part of another cluster. In other words, as is the case when the 3' neighbor is guanine, those sites with adenine as their 5' neighbor are positioned away from the other occurrences with the same 3' neighbor. Apart from the rCG parameter estimates, the different contexts show only small differences in the parameter estimates. The CXA context has lower rCG parameter estimates than the other contexts in the cluster, which might explain why CXA is the last context to be added to the cluster in the graphbased reduction approach.
The Nuclear SSU rRNA dataset
The likelihoodbased reduction approach: results
Given the larger increase in model fit brought about by the graphbased reduction approach for the Ancestral Repeats dataset, we have opted to test this method on a previously analyzed nuclear small subunit ribosomal RNA dataset [12]. As this dataset is much smaller than the Ancestral Repeats dataset, calculation of the necessary log Bayes Factors is much faster and does not require applying our split calculation approach for the thermodynamic integration method. Instead, we have used a larger number of chain updates (Q = 1000) while increasing or decreasing β by 0.001 throughout the whole integration interval.
The starting point of the analysis of the nuclear SSU rRNA dataset is different from that of the Ancestral Repeats dataset in that the standard contextdependent model yields a log Bayes Factor of 17.65 compared to the independent GTR model (see Table 8), suggesting a large decrease in terms of model fit of the contextdependent model. While this could mean that there are no dependencies in this dataset, it might also be the result of overparameterization.
The first four reductions made by the likelihoodbased reduction approach yield a contextdependent model with equal model fit to that of the independent GTR model. Further reductions yield a contextdependent model consisting of six contexts, which significantly outperforms the independent model with a log Bayes Factor of 19.62. This indicates that the true contextdependent effects were initially undetectable due to the drastic increase in parameters. As we show here, a careful modelbuilding strategy can unveil the important evolutionary contexts, leading to an increased performance in terms of model fit. This will become an even more important aspect when modelling additional dependencies. The stepwise clustering of contexts for the likelihoodbased clustering, in terms of the log Bayes Factor, can be seen in Figure 7 and Table 8. The optimal clustering using this likelihoodbased reduction approach can be seen in Figure 8.
The extended likelihoodbased reduction approach: results
Because the nuclear SSU rRNA dataset is relatively small, it is feasible to reestimate the evolutionary parameters after each merge of contexts or clusters. This allows for a more accurate calculation of the posterior variances for each cluster of contexts and may result in a different contextdependent model. We call this approach the extended likelihoodbased reduction approach and compare its results to the regular likelihoodbased reduction approach. Note that reestimating a posterior variance would take over four days of computation time in the Ancestral Repeats dataset, which in turn would lead to over sixty days of computation time in total until all Bayes Factor calculations can be started.
As can be seen from Table 9, the extended likelihoodbased reduction approach yields an identical optimal model as the simple likelihoodbased reduction approach although the path that both approaches take towards this optimal model is different (data not shown). This illustrates that the simple approach may yield a good approximation and that it is not always necessary to perform tedious calculations to achieve a decent parameterperformance tradeoff. The approximation may become poorer, however, as the clustered contexts are further apart (because this increases the difference between empirical and posterior variance of each cluster).
The graphbased reduction approach: results
In this dataset, the graphbased reduction approach yields an optimal model with three clusters, containing only 18 parameters. The stepwise reduction, starting from the full contextdependent model, can be seen in Figure 9, with the corresponding log Bayes Factors for each step given in Table 10. A representation of the optimal clustering scenario is shown in Figure 10, where the three clusters can be identified: a first (red) cluster containing the contexts AXG, GXG and TXG, a second (yellow) cluster containing the single context TXC and a large (green) cluster containing all remaining 12 contexts. The log Bayes Factor for this model equals 16.55 when compared to the independent model, which is just below the log Bayes Factor generated by the optimal model using the likelihoodbased reduction approach, although the two model performances are not significantly differing from one another. Note however that the confidence intervals in both directions seem to overlap more using the graphbased reduction approach, resulting in higher accuracy for the calculated log Bayes Factors.
Interpretation of the optimal model
The optimal model for the nuclear SSU rRNA dataset consists of six clusters and has 36 parameters. The reasons for this specific clustering scenario can be identified by considering the parameter estimates for the 96 parameters of the model, as reported in earlier work [12]. The fact that there is support for the presence of CpG effects in this dataset only leads to the clustering of contexts AXG and GXG while TXG is clustered with TXC and CXG with GXC. The TXG context has higher estimates for the rGT parameter and its substitution behavior bears more resemblance to that of TXC. The CXG context has much higher rAG estimates compared to the AXG and GXG contexts, with the highest rAG estimates for the CXA and CXT contexts which are clustered together. The larger clusters are harder to explain as there are more differences between the individual parameters. For example, it can be seen that the largest cluster has more or less average estimates for all six parameters, although this is not the case for the rAT parameter where the contexts with higher estimates are clustered together with the exception of the TXC context.
Note that the increases in terms of model fit of these contextdependencies (as expressed through the log Bayes Factor) might seem less relevant in this dataset given the drastic increase brought about by assuming varying rates across sites, i.e. a log Bayes Factor of 499.10 units compared to the independent GTR model (see [12]). In other datasets (e.g. the Ancestral Repeats dataset), however, the opposite is true. Hence, a contextdependent model might also be useful for those datasets where no drastic increases in terms of model fit are brought about by assuming varying rates across sites, such as the frequently analyzed dataset of six ψηglobin pseudogenes [32] which originally appeared in the work of Miyamoto et al. [33].
Conclusion
In this work, we have introduced a parallelization of the modelswitch integration method, as proposed by Lartillot and Philippe [18], to accurately calculate Bayes Factors for model comparisons on large datasets. The main advantage of our approach consists of the ability to split the calculations of the Bayes Factors over different computers, to minimize the time to yield results at the cost of only a slightly larger variation around the actual estimates. Given the increase in computation power using cluster computing during recent years, the ability to split certain calculations over different computers is a valuable aid in analyzing large datasets, such as the one studied in this work. Using thermodynamic integration with bidirectional checks, we have evaluated 28 models against the independent GTR model for the Ancestral Repeats dataset, a large dataset consisting of 10 sequences each containing 114,726 sites. As indicated by Lartillot and Philippe [18], thermodynamic integration requires tremendous amounts of CPU time.
Calculations for all the log Bayes Factors for the Ancestral Repeats dataset reported in this work were split over 20 processors, using the settings shown in Tables 2 and 3. Calculating one of these 20 integration intervals takes about 5 days on an Intel Xeon 2.4 Ghz processor. Given that bidirectional checks were performed and that 28 models were evaluated, this amounts to over 15 years of computation time that was used to obtain the log Bayes Factors and the corresponding confidence intervals for the Ancestral Repeats dataset. Given the magnitude of the improvements shown by our models, we did not have to use the most stringent settings for the thermodynamic integration method, which would obviously have increased computational demands even more.
Using this splitcalculation approach, we have shown that significant improvements in terms of model fit (calculated through the use of Bayes Factors) of contextdependent models are possible, by optimizing the performanceparameter tradeoff. Indeed, standard contextdependent models tend to estimate too many parameters, reducing the amount of data present to estimate each pattern accurately. Furthermore, many parameters in such contextdependent models might be nearly equal to one another, meaning that several parameters in the model may just as well be replaced by one single parameter.
Using both our likelihood and graphbased clustering approaches, we have shown that evolutionary patterns in the Ancestral Repeats dataset do not solely depend upon a similar identity of either the 5' or 3' neighboring base. We have shown that even CpG effects depend upon the 5' neighbor of the site under consideration and that many more substitution patterns are present in the data, as the CpG effect yields a much smaller increase in model fit than the full contextdependent model. The clustering results of our optimal contextdependent model, obtained through our graphbased reduction approach, confirm this finding. No cluster contains all four occurrences of the neighboring base combinations, conditional on either 5' or 3' neighboring base. At most, three contexts with either a similar 5' or 3' neighboring base are clustered together, i.e. CXC, GXC and TXC appear in a cluster together with CXA and GXT.
In contrast with the calculation for the Ancestral Repeats dataset, each model comparison for the nuclear SSU rRNA dataset could be performed in a single run of 5 days. Using the likelihood and graphbased reduction approaches, we calculated all 15 reduction steps from the full contextdependent model to the independent model and found that the optimal clustering scheme contains six clusters. The likelihoodbased approach yields the largest improvement in terms of model fit for this nuclear SSU rRNA dataset and also overthrows the initial conclusion that a full contextdependent model does not explain the data better than an independent model. This clearly illustrates that great care must be taken when building complex models that contain many parameters. A similar conclusion can be drawn for other contextdependent approaches (see e.g. [13, 14]) than the one used in this paper [12].
One might argue that the visualization of the different contexts on a scatter plot (see Figures 3, 6, 8 and 10) along with the credibility intervals for all the available parameters of the contextdependent model (see [12]) provide enough clues to determine an optimal clustering scenario. While certain aspects of the optimal clustering scenario may be retrieved through the use of visualization aids, many other aspects of the clustering would then depend upon subjective decisions, thereby reducing reproducibility. To avoid the need for subjective decisions and restriction to the first two principal components, we have presented two automatic approaches which we found to yield better contextdependent models in a straightforward manner.
The approaches discussed in this work can be used with different underlying evolutionary models. While we have assumed contextdependent evolution using 16 different GTR models in this work, we have assumed the equilibrium frequencies to be contextindependent. In other words, we assume a stationary equilibrium distribution for the base frequencies. Such an assumption may be overly restrictive and its relaxation in a contextdependent framework may result in a larger increase in terms of model fit at the expense of increased computation times.
The computational complexity of the model selection approaches presented in this paper may be reduced by using a reversible jump MCMC (RJMCMC) approach [35]. By designing appropriate transdimensional jumps, flexible increases and decreases of the number of evolutionary contexts can be implemented. Analyzing the posterior distribution of the identities of the models that are required can then identify the optimal combination of contexts to be used in subsequent analyses.
Finally, the approaches discussed in this work can be applied to a wider range of problems whereby decisions on the equality of parameter values must be taken. For instance, when assuming branch or lineagespecific evolutionary models (using the independence assumption), the number of models may initially be too large as not each branch or lineage may require a specific model differing from all other models present in the analysis.
References
 1.
Goldman N, Whelan S: A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol. 2002, 19: 18211831.
 2.
Yang Z: Amongsite rate variation and its impact on phylogenetic analyses. Trends Ecol Evol. 1996, 11: 367372. 10.1016/01695347(96)100410.
 3.
Takezaki N, Gojobori T: Correct and incorrect vertebrate phylogenies obtained by the entire mitochondrial DNA sequences. Mol Biol Evol. 1999, 16: 590601.
 4.
Cunningham CW, Zhu H, Hillis DM: Bestfit maximumlikelihood models for phylogenetic inference: empirical tests with known phylogenies. Evolution. 1998, 52: 978987. 10.2307/2411230.
 5.
Posada D, Crandall K: Selecting models of nucleotide substitution: an application to Human Immunodeficiency Virus 1 (HIV1). Mol Biol Evol. 2001, 18: 897906.
 6.
Morton BR: Neighbouring base composition and transversion/transition bias in a comparison of rice and maize chloroplast noncoding regions. Proc Natl Acad Sci. 1995, 92: 97179721. 10.1073/pnas.92.21.9717.
 7.
Morton BR: The influence of neighbouring base composition on substitutions in plant chloroplast coding sequences. Mol Biol Evol. 1997, 14: 189194.
 8.
Morton BR, Clegg MT: Neighbouring base composition is strongly correlated with base substitution bias in a region of the chloroplast genome. J Mol Evol. 1995, 41: 597603. 10.1007/BF00175818.
 9.
Morton BR, Oberholzer VM, Clegg MT: The influence of specific neighbouring bases on substitution bias in noncoding regions of the plant chloroplast genome. J Mol Evol. 1997, 45: 227231. 10.1007/PL00006224.
 10.
Bulmer M: Neighbouring base effects on substitution rates in pseudogenes. Mol Biol Evol. 1986, 3: 322329.
 11.
Blake RD, Hess ST, NicholsonTuell J: The influence of nearest neighbours on the rate and pattern of spontaneous points mutations. J Mol Evol. 1992, 34: 189200. 10.1007/BF00162968.
 12.
Baele G, Peer Van de Y, Vansteelandt S: A modelbased approach to study nearestneighbor influences reveals complex substitution patterns in noncoding sequences. Syst Biol. 2008, 57: 675692. 10.1080/10635150802422324.
 13.
Siepel A, Haussler D: Phylogenetic estimation of contextdependent substitution rates by maximum likelihood. Mol Biol Evol. 2004, 21: 468488. 10.1093/molbev/msh039.
 14.
Hwang DG, Green P: Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci USA. 2004, 101: 1399414001. 10.1073/pnas.0404142101.
 15.
Steel M: Should phylogenetic models be trying to 'fit an elephant'?. Trends Genet. 2005, 21: 307309. 10.1016/j.tig.2005.04.001.
 16.
Kass RE, Raftery AE: Bayes factors. J Am Stat Assoc. 1995, 90: 773795. 10.2307/2291091.
 17.
Jeffreys H: Some tests of significance treated by theory of probability. Proceedings of the Cambridge Philosophical Society. 1935, 31: 203222. 10.1017/S030500410001330X.
 18.
Lartillot N, Philippe H: Computing Bayes factors using thermodynamic integration. Syst Biol. 2006, 55: 195207. 10.1080/10635150500433722.
 19.
Ogata Y: A Monte Carlo method for high dimensional integration. Num Math. 1989, 55: 137157. 10.1007/BF01406511.
 20.
Gelman A, Meng XL: Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Stat Sci. 1998, 13: 163185. 10.1214/ss/1028905934.
 21.
Neal RM: Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat. 2000, 9: 249265. 10.2307/1390653.
 22.
Rodrigue N, Philippe H, Lartillot N: Assessing siteinterdependent phylogenetic models of sequence evolution. Mol Biol Evol. 2006, 23: 17621775. 10.1093/molbev/msl041.
 23.
Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates. J Mol Evol. 1984, 20: 8693. 10.1007/BF02101990.
 24.
Lepage T, Bryant D, Philippe H, Lartillot N: A general comparison of relaxed molecular clock models. Mol Biol Evol. 2007, 24: 26692680. 10.1093/molbev/msm193.
 25.
Yang Z: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994, 39: 306314. 10.1007/BF00160154.
 26.
Fryxell KJ, Zuckerkandl E: Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol Biol Evol. 2000, 17: 13711383.
 27.
Arndt PF, Burge CB, Hwa T: DNA sequence evolution with neighbordependent mutation. J Comp Biol. 2003, 10: 313322. 10.1089/10665270360688039.
 28.
Arndt PF, Hwa T: Identification and measurement of neighbordependent nucleotide substitution processes. Bioinformatics. 2005, 21: 23222328. 10.1093/bioinformatics/bti376.
 29.
Woodcock DM, Crowther PJ, Diver WP: The majority of methylated deoxycytidines in human DNA are not in the CpG dinucleotide. Biochem Biophys Res Commun. 1987, 145: 888894. 10.1016/0006291X(87)910485.
 30.
Ramsahoye BH, Biniszkiewicz D, Lyko F, Clark V, Bird AP, Jaenisch R: NonCpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proc Natl Acad Sci USA. 2000, 97: 52375242. 10.1073/pnas.97.10.5237.
 31.
Hess ST, Blake JD, Blake RD: Wide variations in neighbordependent substitution rates. J Mol Biol. 1994, 236: 10221033. 10.1016/00222836(94)900094.
 32.
Yang Z: Estimating the pattern of nucleotide substitution. J Mol Evol. 1994, 39: 105111.
 33.
Miyamoto MM, Slighton JL, Goodman M: Phylogenetic relations of humans and African apes from DNA sequences in the ψηglobin region. Science. 1987, 238: 369373. 10.1126/science.3116671.
 34.
Peeters A, Coolsaet K, Brinkmann G, Van Cleemput N, Fack V: GrInvIn in a nutshell. J Math Chem. 2008, 45: 471477. 10.1007/s1091000894205.
 35.
Green PJ: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995, 82: 711732. 10.1093/biomet/82.4.711.
Acknowledgements
We would like to thank Tom Degroote for his assistance with our departmental cluster and Stijn De Smet for his assistance in using the BEgrid (Belgian Grid for Research) infrastructure. The last author acknowledges support from IAP research network grant nr. P06/03 from the Belgian government (Belgian Science Policy).
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
GB initiated the study, codesigned the split calculation approach, designed the graphbased reduction approach, performed all the analyses, programmed the software routines and wrote a first complete version of the manuscript. YVdP contributed biological expertise to the analyses and edited the manuscript. SV contributed statistical expertise to the analyses, codesigned the split calculation approach, designed the likelihoodbased reduction strategy and edited the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Baele, G., Van de Peer, Y. & Vansteelandt, S. Efficient contextdependent model building based on clustering posterior distributions for noncoding sequences. BMC Evol Biol 9, 87 (2009). https://doi.org/10.1186/14712148987
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/14712148987
Keywords
 Marginal Likelihood
 Rate Class
 Thermodynamic Integration
 Posterior Variance
 Substitution Behavior