The mixture branch length (MBL) model
The mixture model assumes several components with different sets of branch lengths. When sites are assumed to be independent, the likelihood for the data D in the mixture model is the product of N site-specific likelihoods, and each site's likelihood is the sum of likelihoods over all Nc components, weighted by the components' probabilities w
Where l is Nc sets of (2s-3) branch lengths (s is the number of species); τ is the topology; θ is the rest of parameters (such as rate matrix, stationary probability); and C
is the alignment column at site i. The MBL model is implemented based on a homemade software, which uses a Bayesian Markov chain Monte Carlo (MCMC) sampler . Maximum likelihood was calculated via simulated annealing.
The covarion model
The covarion model corresponds to a doubly stochastic process The process of rate switching is described as:
where s01 is the rate of switching from off to on; s10 is the rate of switching from on to off. Thus, two parameters are necessary for this process, the rates of switching between the two states, off and on. When a site is in the on state, it undergoes substitutions among the 20 amino-acids according to a first order Markov process, described by a rate matrix Q. Here, for both the covarion and MBL models, this substitution process was described by a JTT+Γ model with four discrete categories..
The rate matrix can be
where I is the identity matrix (r × r, r is the number of states, for a protein data set, r = 20). For more details on the implementation, see refs.  and . Therefore, R is 40 × 40 rate matrix for the covarion in the Markov process. For both the MBL and covarion models, the substitution process was described by a JTT+Γ model with four discrete categories.
Maximum likelihood estimation using simulated annealing
We use simulated annealing, within our MCMC sampler, to obtain the maximum likelihood estimation. Simulated annealing is a straightforward generalization of the MCMC algorithm, especially for high-dimensional models such as MBL . In a normal MCMC run, at each cycle, a new parameter value (x'), slightly different from the current one (x), is proposed according to a stochastic kernel q(x, dx'), and accepted according to the Metropolis-Hastings rule, i.e. with probability
where L(x) is the likelihood for the current state; L(x') is the likelihood for the proposed state; q(x', dx) is the probability of proposing from x' to dx state; q(x, dx') is the probability of proposing from x to dx' state. The only additional feature to be implemented for simulated annealing is to replace this Metropolis Hastings version by its thermal version:
Here, β is analogous to an inverse temperature. If β<1, the Markov chain is heated up (the equilibrium distribution is flatter than the posterior distribution), and if β>1, it is cooled down (the equilibrium distribution is more peaked around its mode). At the reference temperature (β = 1), it reduces to the posterior distribution.
Based on this modification of the Metropolis principle, one can mimic the process of a thermodynamic annealing to obtain the maxima: we start at a high temperature (β = 1), whereby the posterior distributions are extensively visited; then, as the temperature decreases (as β increases), the distribution explored by the MCMC gets progressively more peaked around the mode, until, at a sufficiently low temperature, the Markov chain "freezes" at the ML estimate. Our cooling schedule consists in starting with β = 1, and increasing its value geometrically (i.e. β = 1.01* β), until β = 50000. To check whether the chain gets stuck in local maxima, several independent runs with random starting points are performed, and compared with each other. All the independent runs were found to converge at the same maximal point.
The BIC  is defined as:
where is now the overall set of parameters maximizing the log-likelihood lnp(D|), K is the number of parameters that have been adjusted in , and N is the number of sites. The penalty depends both on the number of parameters and on the number of sites; the smaller the BIC, the better the fitness of the model. Another criterion similar to the BIC, but less strict, is the Akaike Information Criteria (AIC; ), for which the penalty only depends on the number of parameters:
A second order correction for the AIC  has a negligible impact in the present context, and so is not reported here.
We also compared models by the cross-validation (CV). Briefly, for a given model, we first optimize parameters on a portion of the dataset, i.e. the learning set (D
), then use these parameters (
) to compute the likelihood of the testing set (D
). Thus, the CV score is obtained by sampling the learning set and the testing set several times, and taking the expectation of the likelihood over these replicates (parameters being inferred from the training tests):
By averaging over replicates, one gets rid of sampling errors in the partitioning of the dataset into a learning set and a test set. In particular, one smoothes out possible (albeit unlikely) uneven repartitions in which sites corresponding to distinct components of the mixture would be partially segregated.
The learning set (D
) and the testing set (D
) can be created in various ways. One method is the so-called v-fold cross-validation. The original data set is partitioned into v disjoint subsets of equal size; then each partition is successively used as the testing set (D
), the union of all other v-1 partitions being used as the learning set (D
). The overall procedure is repeated until a total of v tests have been performed. In this work, we used the most currently used 2-fold cross-validation schemes. The random sampling of half data set was performed ten times, which yielded a precision of CV score sufficient to discriminate among the models under study. This small value is therefore a good compromise between computational time and accuracy.
Identifying the optimal component for each site
Since we do not know exactly which component a given site belongs to, the likelihood for one site is the weighted sum of likelihoods conditional on each possible allocation of the site to the available components. We can, however, calculate the posterior probability of a site (i) belonging to a given component (k):
These posterior probabilities were then averaged over the sites, for each gene of the alignment. Alternatively, each site was affiliated to the component of higher posterior probability, and a chi-square test of the independence between the affiliations to the component, and the affiliation to each of the genes, was performed.
All the simulations were done with the JTT replacement matrix, rate across site heterogeneity being modeled by a Γ distribution (four discrete categories). Heterotachous data were simulated by concatenating two alignments generated under the same tree topology, but with different branch lengths [24, 54]. Briefly, a reference tree, with branch lengths specified, is chosen (Fig. 1). Next, each branch length of the two partitions is adjusted by multiplying the length of the reference tree either with (1 + τ), or with (1 - τ), where τ ∈ [0,1] is a parameter tuning the extent of heterotachy. The choice between the two opposite multipliers ((1 + τ) and (1 - τ)) is made at random, independently for each branch while under two constraints: a) the corresponding branch in the two partitions should be adjusted with opposite multipliers; b) in one partition, sister branches should be adjusted with opposite multipliers also; i.e., if one branch length in one partition is increased by a factor (1 + τ), then the same branch in the other partition is decreased by a factor (1 - τ) and also the sibling branch length in the same partition is decreased by a factor (1 - τ). In this way, the average length over the alignment remains equal to the reference length  and the branch length heterogeneity strictly followed the strategy by Kolaczkowski and Thornton , i.e., the branch lengths in each component tend to behavior in a Felsenstein zone. Totally, 16 simulated datasets are generated with different discrete α (0.5,1,1.5,2) and different τ(0.2,0.4,0.6,0.8).
Three protein datasets were used to examine the fitness of the covarion model, the mixture branch length models, and the homotachous model (one-component model):
• Nuclear alignment: a subsample was obtained from the dataset of 133 nuclear genes and 57 animal species . The twenty most complete species were selected. For computing time reason, only the first 5000 sites were used.
• Plastid alignment: the dataset was created by concatenating plastid ribosomal proteins (rpl14, rpl20, rpl2, rpl33, rps12, rps16, rpl16, rpl22, rpl32, rpl37, rps19, rps3, rps7, rps11, rps14, rps18, rps2, rps4 and rps8) and RNA polymerase proteins (rpolA, rpolBp, and rpolB) from green plants, glaucophytes, red algae, cryptophytes, stramenopiles and haptophytes. The ambiguously aligned regions were removed using Gblocks . The final alignment contains 22 species and 3754 sites.
• Mitochondrial alignment: we used a concatenation of 12 mitochondrial genes (atp6, atp8, cox1, cox2, cox3, cytochrome b, nad1, nad2, nad3, nad4, nad4L and nad5) totally 3591 sites from 17 mammals.
The computing times for a CV replicate (on Pentium P4, 3.2 GHz) are approximately 80 and 190 (MBL 2 components and covarion), 40 and 110, and 35 and 80 hours for nuclear, plastid and mitochondrial datasets, respectively.