 Methodology article
 Open Access
 Published:
EM for phylogenetic topology reconstruction on nonhomogeneous data
BMC Evolutionary Biology volume 14, Article number: 132 (2014)
Abstract
Background
The reconstruction of the phylogenetic tree topology of four taxa is, still nowadays, one of the main challenges in phylogenetics. Its difficulties lie in considering not too restrictive evolutionary models, and correctly dealing with the longbranch attraction problem. The correct reconstruction of 4taxon trees is crucial for making quartetbased methods work and being able to recover large phylogenies.
Methods
We adapt the well known expectationmaximization algorithm to evolutionary Markov models on phylogenetic 4taxon trees. We then use this algorithm to estimate the substitution parameters, compute the corresponding likelihood, and to infer the most likely quartet.
Results
In this paper we consider an expectationmaximization method for maximizing the likelihood of (time nonhomogeneous) evolutionary Markov models on trees. We study its success on reconstructing 4taxon topologies and its performance as input method in quartetbased phylogenetic reconstruction methods such as QFIT and QuartetSuite. Our results show that the method proposed here outperforms neighborjoining and the usual (timehomogeneous continuoustime) maximum likelihood methods on 4leaved trees with amonglineage instantaneous rate heterogeneity, and perform similarly to usual continuoustime maximumlikelihood when data satisfies the assumptions of both methods.
Conclusions
The method presented in this paper is well suited for reconstructing the topology of any number of taxa via quartetbased methods and is highly accurate, specially regarding largely divergent trees and time nonhomogeneous data.
Background
Obtaining a good method for reconstructing the phylogenetic topology of four taxa is one of the crucial goals in phylogenetics. The fourtaxon trees, if correctly inferred, can be used as input of quartetbased methods in order to reconstruct larger trees. But due to the complexity of real data, the problem of reconstructing fourtaxon trees is not so easy. Most phylogenetic reconstruction methods assume simple evolutionary models that may not really fit real data, which leads to incorrect phylogenetic inference ([1–5]). For example, many of them rely on continuoustime Markov processes with a constant instantaneous mutation rate matrix along the tree (the socalled global homogeneity), and also assume timereversibility (and hence stationarity). On the other hand, as pointed out in [6], in order to make quartetbased methods work it is extremely important to obtain 4taxon tree reconstruction methods that are not affected by the presence of longbranch attraction [7].
Most evolutionary models are described by a Markov process over the tree, that is, conditional rates of change at two diverging sequences depend only on the current state and are independent of the previous sates ([8], chapter 8.2). Markov processes on trees are specified by a distribution at the root of the tree and a transition matrix at each branch and, in contrast to continuoustime models, the Markov process on each branch is not assumed to be timehomogeneous [9] (that is, they are locally heterogeneous). These models directly consider as parameters the entries of the substitution matrices and the root distribution (see [8], chapter 8, [10], § 4.2). Barry and Hartigan [11] considered such a general Markov model (henceforth called GMM), which does not assume any other constraints. In particular, it is locally and globally time nonhomogeneous (instantaneous substitution rates are not constant on any edge, nor on the whole tree) and it is neither timereversible nor stationary. The only restriction underlying this model is that sites evolve independently and are identically distributed (i.i.d. hypothesis). Considering this model and its submodels is one way of covering more general scenarios, in contrast to those phylogenetic methods that implement timehomogeneous and timereversible continuoustime models (GTR and its submodels) cf [9].
The GMM above accounts for 12 parameters per edge plus three parameters for the root distribution. When some symmetries on the transition matrices or on the root distribution are imposed, one obtains the substitution matrices of the corresponding JukesCantor and Kimura (2 and 3 parameters) models among others (([10], § 4.2),[12], see Methods section). For instance, the Markov version of the K81 model [13] (henceforth referred to as K81*) deals with 3 parameters per edge (one for the conditional probability of transitions, and two for the two types of transversions, see the Methods section) which makes a total of 3 ∗ (2n  3) parameters in unrooted trivalent trees with n leaves, whereas the usual timehomogeneous continuoustime version accounts for 2 parameters for a normalized instantaneous rate matrix constant over the tree plus one parameter per edge length (that is, 2 + (2n  3) parameters on an unrooted trivalent tree). Notice, however, that if one considers a time nonhomogeneous continuoustime Kimura 3parameter model, then the number of parameters is exactly the same as for K81*. In this case, the only difference between both models is that K81* does not even assume local homogeneity (that is, time homogeneity over each branch), while all timecontinuous models do. The huge amount of parameters for nonhomogeneous models makes a maximumlikelihood approach unfeasible and unreliable for a whole tree on n taxa if n is large. Nevertheless, there is some hope that these more general models lead to accurate methods on 4taxon trees. In our setting, we only deal with substitutions of nucleotides (not aminoacids) on 4taxon trees and we will always assume the i.i.d. hypothesis (thus excluding the possibility of heterogeneity across sites).
In this paper we develop a maximum likelihood framework for inferring the best tree topology under (general) Markov processes. Our approach is based on the widely used ExpectationMaximization algorithm. The ExpectationMaximization algorithm (EM), as introduced in [14], is an iterative algorithm for finding maximum likelihood estimates of parameters in statistical models that deal with unobserved data. We have adapted this algorithm to the case of phylogenetic 4taxon trees in what we call EMtree. EM iteratively gives an expectation of the distribution of the nucleotide sequences at the interior nodes (this is called the Estep) and finds the parameters that maximize the likelihood for these data in the socalled Mstep (because the parameters for which the maximum likelihood is achieved can be computed in a closed form for complete data). The EM algorithm has been applied to many other disciplines (see for example [15]).
The use of the EM algorithm to estimate the continuous parameters of a phylogenetic tree under a Markov process (namely, the root distribution and the entries of the transition matrices) has been already discussed in [9] and [16]. In this paper we focus on the use of EMtree for estimating the tree topology for four taxa and test its performance in reconstructing the correct tree topology on simulated data. We shall see that, although it is a time consuming algorithm, it can lead to very accurate results.
First of all we test it as a 4taxon tree reconstruction method by using the tree space proposed by Huelsenbeck [17] on data simulated from a general Markov process first and then restricting to a timehomogeneous (continuoustime) process. We compare EMtree to neighborjoining and to the usual (continuoustime) maximum likelihood approach under both global homogeneity and nonhomogeneity (note that throughout the experiments we will restrict to timereversible models to simulate and to recover trees). As all models considered will be stationary (and with uniform stationary distribution), we will not be evaluating compositional heterogeneity but only the effect of the variation of substitution rates among lineages [18]. Afterwards, we use these three methods as input of two quartetbased methods: one weighted (Willson’s QuartetSuite [19, 20]) and one unweighted (Maximum Quartet Fit, QFIT as implemented in Clann [21]); and compare their performance on the 12taxon trees proposed in [6]. Specially for largely divergent trees, we observe that EMtree gives the best results and is less subject to longbranch attraction.
Results and discussion
In this section we present the results of the topology reconstruction method EMtree on simulated data evolving both under homogeneous and nonhomogeneous Markov processes.
In Figure 1 we present the performance of three different topology reconstruction methods on the 4taxon tree space proposed by Huelsenbeck [17] (see Methods section): Figure 1(a) for the usual neighborjoining (NJ) with the K81 distance, 1(b) for the usual maximum likelihood tree (ML) assuming timehomogeneous continuoustime K81 model, and 1(c) EMtree for K81* model. The Huelsenbeck tree space covers a wide range of branch lengths for 4taxon trees: parameters a and b denote branch lengths as indicated in Figure 2 and they were varied from 0.01 to 0.75. For each pair (a,b) we draw a black cell if the topology is correctly reconstructed on the 100 simulated alignments; a white cell denotes less than 33% success; and the gray scale used in between is shown in the figures. The alignments were generated following the K81* model (see the Methods section) and had length 300 (Figure 1 left) and 1000 bp (Figure 1 right).
The results show that NJ has difficulties in the Felsenstein zone where the longbranch attraction problem is present (that is, small a and large b) and ML fails for largely divergent trees (note that trees here have at most 2.25 pairwise divergence, whereas similar studies have even used divergences of about 7.0 expected substitutions per site [18]). The EMtree algorithm seems to overcome these two difficulties and is clearly more accurate than NJ and ML in this tree space. Nevertheless, we need to point out that the data were simulated according to K81* model, which fits the assumptions of EMtree but not of NJ nor ML methods (indeed, these last two methods are based on the continuoustime Kimura model and assume homogeneous mutation rates along lineages). This model misspecification for ML leads to in incorrect inference of parameters, which is even more extreme for long branches [18] (this justifies the bad performance of ML in the lower right corner of Figure 1(b)). To confront this situation, we also generated data under timehomogeneous continuoustime K81 model. The results are depicted in the left column of Figure 3 and clearly show a similar success of ML and EMtree in this case.
In order to compare EMtree to time nonhomogeneous but continuoustime ML, we generated data evolving under a continuoustime nonhomogeneous K81 model. This is a special case of K81* and therefore this model matches both approaches: EMtree (on its K81* version) and bppml[22] restricted to a nonhomogeneous continuoustime K81 model. On Figure 3 (right) we present the results of these two methods on nonhomogeneous K81 data to show that both methods perform similarly and outperform NJ.
Although EMtree may give more accurate results on more general data, it is a time consuming algorithm. Indeed, on 4taxon trees, EMtree is almost 1000 times slower than NJ and more than 2000 times slower than ML. For example on 100 4taxon alignments of 600bp, the execution time on a Intel(R) Core(TM) i54200U CPU @ 1.60GHz (only using one of the four CPU) was 186.66s for EMtree on K81*, 0.06s for PAML ML on homogeneous continuoustime K81, 211.45 for bppml on nonhomogeneous K81 model, and 0.16s for NJ (with K81 distances).
Now we present the results that test the use of EMtree, NJ, and (timehomogeneous continuoustime) ML as methods to obtain the input for the quartetbased methods QuartetSuite and QFIT on data generated under K81* model.
For the three topologies on twelve taxa studied here (called cc, cd, and dd –see Figure 4 and the Methods section), we give the proportion of the (one thousand) reconstructed trees whose RobinsonFoulds distance to the original topology is equal to 0,2,4,6 or >6, for both QuartetSuite and QFIT, and for the three input methods under study. The results are displayed in Figures 5 (QuartetSuite) and 6 (QFIT) for simulated alignments of 600 bp and in Figures A1 and A2 of the Additional file 1 for 300 bp. The figures for both alignment lengths are similar and present the same trends (slightly better for 600 bp for all methods, as expected), so we let the results on 300 bp for the Additional file 1. In Figure 5 we also present the performance of a global NJ and a global ML (estimating timehomogeneous continuoustime K81 model).
In both quartetbased methods (QuartetSuite and QFIT), the reconstruction of the cc topology presents the best results compared to the others, independently of the algorithm used as input quartets (EMtree, NJ, or ML). The same tendency is shared by global ML and NJ. Conversely, the topology dd is never correctly reconstructed for any of the methods or branch lengths. It is worth noting that cc is the topology that would have the least long branch attraction and dd is the one that would have the most (cd is in between because half of the tree comes from cc and the other half comes from dd). Therefore our results are consistent with the observation in [6] that the success of a quartet based method depends on the capacity of the input method to correctly reconstruct 4taxon trees under the longbranch attraction problem.
In both Figures 5 and 6 we observe that the performance of ML and EMtree is quite similar in most cases, although ML never outperforms EMtree. A detailed look at these figures reveals that for largely divergent trees (that is, b=0.1, the last bar in each plot), EMtree is the best quartet input method among those considered here, as its results outperform NJ and ML for both QuartetSuite and QFIT and in all tree topologies. We find the explanation of this result in the management of longbranch attraction by the different methods considered. Long branches lead to similar sequences as a result of multiple substitutions and, as ML estimates have been computed on a wrong substitution model, this method is more influenced by longbranch attraction. EMtree has also a better success than a global NJ and a global ML on the trees cd and dd, but on the “easiest” tree cc, a global ML is more accurate. For the largely divergent cc tree, a global ML has better average performance than EMtree (see also below), but it never succeeds in fully reconstructing the topology (whereas EMtree does in about 2% of the alignments).
When considering the QuartetSuite method, NJ is clearly the worst quartet input method for all tree topologies and branch lengths. This is probably due to the fact that the Willson method implemented in QuartetSuite is intended for weighted quartets, whereas NJ quartets are given only binary weights. On the contrary, for the unweighted method QFIT, NJ seems to be more accurate than ML and EMtree for topologies cc and dd but only for low divergence (b=0.005 or b=0.015).
We want to point out that, in general, QFIT gives worse results than QuartetSuite (except for the NJ algorithm used as input in the cc and cd topologies with low divergence), reinforcing the idea that weighted methods are more reliable [6, 23].
In Tables 1 and 2 we display the mean of the RobinsonFoulds distance of the same study on the 12taxon trees, and its variance in parentheses (Table 1 for QuartetSuite, and Table 2 for QFIT). Results are presented for alignments of 600 bp and have to be interpreted as less mean, better approximation (results for 300 bp are similar and appear in the Additional file 1, Tables S1 and S2). For each tree topology and each choice of parameter b, the best method (according to the lowest mean) is marked in bold print.
In these tables we observe that QuartetSuite gives the lowest distance to the original tree in general and that the best results for largely divergent trees (b = 0.1) are obtained by EMtree (for both QuartetSuite and QFIT and all tree topologies). Whenever the mean of RobinsonFoulds distance for (quartetbased) ML is lower than the mean for EMtree, there is no significant difference. Overall, EMtree is the method that outperforms the other two quartet input methods in most cases. As far as global methods are concerned, it is worth pointing out the bad performance of a global NJ on trees cd and dd (or cc with b = 0.05 or 0.1), specially if one takes into account that the RobinsonFoulds distance for the (less resolved) star tree is 9.
When inspecting the variances, one sees that EMtree is the only quartet input method that preserves low variances in all cases (global methods are the ones presenting lower variance, though). Conversely, the variances are extremely large for NJ with QuartetSuite in all different scenarios, and they are also huge for (quartet input) ML when the trees are largely divergent (b = 0.1). QFIT presents low variances in all cases, for all input methods, probably because the input information is less “subject to vary” (the input only considers tree topologies, not weights). It is worth pointing out that whenever quartet input NJ outperforms ML and EMtree (only for QFIT), it does so with larger variance than the other two methods.
Conclusion
We tested the accuracy of the (likelihoodbased) method EMtree as a method to infer 4taxon topologies under (time nonhomogeneous) Markov models, and compared it to NJ and ML (homogeneous and nonhomogeneous). When EMtree and ML are tested on data satisfying the assumptions of both methods, they have a similar performance. Nevertheless, EMtree is based on time nonhomogeneous models (both local and global time heterogeneity), and hence outperforms the other methods when these assume homogeneity.
There are only few nonhomogeneous continuoustime models that could be fairly compared to general Markov processes, and one expects that under more complex evolutionary scenarios (such as nonstationary or not timereversible data), the success of usual ML or NJ methods (based on these assumptions) will be poorer (as shown in [18]), confirming that an EM approach based on general Markov processes, could be more recommendable.
EMtree is a timeconsuming algorithm, however, and the user has to decide whether it is worth performing such an analysis (we only recommend it for at most 6 taxa).
We have also assessed EMtree, NJ, and ML as input for the quartetbased methods QFIT and QuartetSuite. To do so, we have considered three different topologies on twelve taxa evolving under tome nonhomogeneous processes, and a wide set of branch lengths values. EMtree turns out to be the input method that performs best in most cases on this type of data. Regardless of the quartetbased method chosen and the tree topology, EMtree gives the best results for trees with large divergence among taxa (b = 0.1).
Summing up, an EM approach on Markov models provides an accurate 4taxon tree reconstruction method suited for data not known to satisfy homogeneity and very useful as input of quartetbased methods, specially for largely divergent trees. However, the method presented here is not valid for data violating the i.i.d. hypotheses, such as data with variation across sites (Gammarates, invariable sites, mixtures, and others), or dependency among sites. The method should be strongly modified in order to accommodate these generalizations, and this might be an interesting future project.
Methods
EM algorithm
We have implemented an expectationmaximization (EM) algorithm on fourtaxon trees evolving under Markov processes. The core of the EM algorithm for phylogenetic trees is:

Input: Data D (multiple alignment of n sequences), unrooted trivalent tree topology T with n leaves (and $\mathfrak{e}:=2n3$ edges), Markov model .

Initialization: Root the tree at some internal node and provide tentative initial values for the root distribution π and the substitution matrices S_{ i } ($i=1,\dots ,\mathfrak{e}$), and a threshold ε > 0.

Recursion:

Estep: Provide complete data cD (for all nodes in the tree) that maximizes the posterior probability ${P}_{T,\mathcal{M}}({S}_{i},\pi \mathit{\text{cD}})$ (unique maximum that can be computed efficiently by Felsenstein’s algorithm [24]).

Mstep: Compute the parameters $\widehat{{S}_{i}},\widehat{\pi}$ that maximize the loglikelihood $l({\left\{{S}_{i}\right\}}_{i},\pi )={P}_{T,\mathcal{M},\left\{{S}_{i}\right\},\pi}\left(\mathit{\text{cD}}\right)$ for these complete data (this maximum is unique and can be computed in a closed form, see the Additional file 1, section 2).

If $l({\left\{\widehat{{S}_{i}}\right\}}_{i},\widehat{\pi})l({\left\{{S}_{i}\right\}}_{i},\pi )>\epsilon $, then set $\pi =\widehat{\pi}$, ${S}_{i}=\widehat{{S}_{i}}$ ($i=1,\dots ,\mathfrak{e}$), and go back to Estep.


Output:${\left\{\widehat{{S}_{i}}\right\}}_{i},\widehat{\pi}$ and $l({\left\{\widehat{{S}_{i}}\right\}}_{i},\widehat{\pi})$
We set up the EM algorithm with ε = 10^{3} and forced it to stop after 100 iterations if it had not finished.
Given four aligned nucleotide sequences s1, s2, s3, s4, we run the EM algorithm for the three possible trivalent trees on four taxa: T_{1} = (s1,s2s3,s4), T_{2} = (s1,s3s2,s4), T_{2} = (s1,s4s2,s3). Our tree reconstruction method returns the tree topology for which the likelihood output by EM is higher. This method is called EMtree throughout the paper.
Models
To test the method proposed in this paper, we mainly used the K81* model on fourleaved trees. This is, the evolutionary process is a Markov process with uniform distribution at the root and is specified by transition matrices of type
on each branch e_{ i }. In the matrix above, the rows and columns are labeled by nucleotides adenine, cytosine, guanine and thymine (in this order), so that entry (j,k) stands for the conditional probability that nucleotide j at the parent node of edge e_{ i } is substituted by nucleotide k at the child node. When trees evolve under this model, the root becomes unidentifiable (different root locations may give rise to the same joint distribution at the leaves, [10]) and, as a consequence, one can only expect to reconstruct unrooted trees. This model does not assume a constant instantaneous mutation rate matrix over the tree and, therefore, trees evolving under this model are time nonhomogeneous. The more restrictive models K80* and JC* (Markov versions of Kimura 2parameter and JukesCantor models) are obtained by imposing b_{ i } = d_{ i } and b_{ i } = c_{ i } = d_{ i }, respectively ([10], § 4.2). All of them are timereversible (and hence stationary), and are part of the socalled groupbased models ([12, 25]). The most general model among Markov processes is the general Markov model GMM, which considers transition matrices and root distribution without any further restriction and is neither stationary nor timereversible [10].
The other two methods that are confronted to EMtree in this paper are NeighborJoining (NJ) and a usual continuoustime maximumlikelihood (ML). The ML estimates for homogeneous continuoustime K81 model [13] (with a instantaneous mutation rate matrix constant over the tree) have been obtained by the free software PAML [26] (baseml program), whereas for nonhomogeneous continuoustime K81 model we used the program bppml in the Bio++ package [22]. NeighborJoining was implemented considering the K81 distance [13]. When a global ML was used on 12 taxa, we used PAML and set up the stepwise addition option on this software.
Simulations
The simulated data in this paper has been produced using the program GenNonh of [27] and SeqGen [28]. GenNonh produces directly transition matrices of the required branch length (for any of the Markov models described above) and therefore does not assume timehomogeneity (not globally over the tree, nor locally at the edges). SeqGen was used to generate data evolving under continuoustime K81 model (both homogeneous and globally nonhomogeneous). In order to generate nonhomogeneous data, we made the software generate various edges evolving at different instantaneous rates by recording ancestral sequences. Alternatively, the software Hetero[29] could be also used to generate (global) time nonhomogeneous continuoustime data.
Tree space on 4leaf trees
In order to test the performance of the proposed method on fourtaxon trees, we based our tests on the tree space proposed by Huelsenbeck [17], so that it is possible to compare our results to those obtained there with different phylogenetic reconstruction methods. In this tree space, two branch length parameters a, b on trees of four taxa are varied. Parameter a assigns the branch length to the internal branch and two opposite peripheral branches, and parameter b assigns the branch length to the two remaining branches as in Figure 2. Parameters a and b represent expected number of substitution per site and in this paper are varied from 0.01 to 0.75 in increments of 0.02.
For each pair (a,b), we simulated one hundred alignments of lengths 300 and 1000 basepairs (briefly bp) under the tree topology 1234 (see Figure 2) and we inferred the topology using the three methods EMtree, ML, and NJ. The results for each method are shown in Figure 1.
Quartetbased methods
In order to assess the fourtaxon tree reconstruction method EMtree proposed above, it is important to test its performance in quartetbased reconstruction methods. To do so, we considered two of these methods: Maximum Quartet Fit (QFIT) and the method proposed by Willson in [19]. QFIT choses the supertree that shares the maximum number of quartets with the source trees, whereas the other method choses the supertree that optimizes a certain criterion based on the weights assigned at the input quartets. We use the implementation of QFIT distributed in Clann [21] and the implementation of Willson’s method called QuartetSuite in [20]. While QuartetSuite produces a tree that is expected to minimize inconsistencies with quartets, QFIT uses heuristics to search through the tree space (and we restricted this search to 100000 trees). As our intention was not to evaluate quartetbased methods but testing our 4taxon method in comparison to others, our criterion to choose these two quartetbased methods was the fact that they were freely available and that one of them allowed the use of weighted quartets.
In QuartetSuite weights are understood as log of the frequency of a quartet, so that weights are nonnegative and zero denotes the tree with most support. Therefore we used the opposite of the loglikelihood output by EMtree and baseml as weights of the corresponding quartets. Weights for NJ were set up to be 0 for the topology output by NJ and 999 for the other two trees.
We followed [6] to test the performance of our method as input of quartetbased method. In that paper, the authors consider three tree topologies on 12 taxa, denoted as cc, cd and dd (see Figure 4), and fix the proportions among their branch lengths in order to compare different reconstruction methods. A parameter b denoting the internal branch lengths is varied between 0.005, 0.015, 0.05, and 0.1, which gives a maximum pairwise divergence along the tree of about 0.1, 0.3, 1.0, and 2.0 nucleotides per site, respectively. The lengths of the simulated alignments are 300 and 600 bp.
For each of these scenarios we generated 1000 alignments using GenNonh and the Kimura 3parameter model as explained above. For each alignment, the tree was estimated using QFIT and QuartetSuite with the fourtaxon methods EMtree, ML and NJ as input. Then the RobinsonFoulds distance to the original tree (that is, the number of partitions that are present in one tree but not in the other) was computed using DendroPy symmetric_difference function [30]. The results are shown in Figures 5, 6, A1 and A2 (in the Additional file 1), and in Tables 1, 2, and S1, S2 (in the Additional file 1), and explained in the Results and discussion section.
Author’s contributions
The first author implemented the methods, produced results, and wrote the paper. The second author had the idea, directed the work, implemented methods and wrote the paper. Both authors read and approved the final manuscript.
References
 1.
Kelchner SA, Thomas MA: Model use in phylogenetics: nine key questions. Trends Ecol Evol. 2007, 22 (2): 8794. 10.1016/j.tree.2006.10.004.
 2.
Ripplinger J, Sullivan J: Assessment of substitution model adequacy using frequentist and Bayesian methods. Mol Biol Evol. 2010, 27 (12): 27902803. 10.1093/molbev/msq168.
 3.
Jermiin LS, Ho SY, Ababneh F, Robinson J, Larkum AW: The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst Biol. 2004, 53 (4): 638643. 10.1080/10635150490468648.
 4.
Galtier N, Gouy M: Inferring pattern and process: maximum likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol Biol Evol. 1998, 154 (4): 871879.
 5.
Yang Z, Yoder AD: Estimation of the transition/transversion rate bias and species sampling. J Mol Evol. 1999, 48: 274283. 10.1007/PL00006470.
 6.
Ranwez V, Gascuel O: Quartetbased phylogenetic inference: improvements and limits. Mol Biol Evol. 2001, 18 (6): 11031116. 10.1093/oxfordjournals.molbev.a003881.
 7.
Anderson FE, Swofford DL: Should we be worried about longbranch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol Phylogenet Evol. 2004, 33 (2): 440451. 10.1016/j.ympev.2004.06.015.
 8.
Semple C, Steel M: Phylogenetics, Volume 24 of Oxford Lecture Series in Mathematics and its Applications. 2003, Oxford: Oxford University Press
 9.
Jayaswal V, Jermiin LS, Robinson J: Estimation of phylogeny using a general Markov model. Evolutionary Bioinformatics Online. 2005, 1: 62
 10.
Allman ES, Rhodes JA: Phylogenetic invariants. Reconstructing Evolution. Edited by Gascuel O, Steel M. 2007, New York: Oxford University Press
 11.
Barry D, Hartigan JA: Statistical analysis of hominoid molecular evolution. Stat Sci. 1987, 2 (2): 191207. 10.1214/ss/1177013353.
 12.
Evans S, Speed T: Invariants of some probability models used in phylogenetic inference. Ann Statist. 1993, 21: 355377. 10.1214/aos/1176349030.
 13.
Kimura M: Estimation of evolutionary distances between homologous nucleotide sequences. Proc Natl Acad Sci USA. 1981, 78: 454458. 10.1073/pnas.78.1.454.
 14.
Dempster A, Laird N, Rubin D: Maximum likelihood estimation from incomplete data via the EM algorithm. J Roy Stat Soc. 1977, 39: 138.
 15.
McLachlan G, Krishnan T: The EM Algorithm and Extensions, Volume 382. 2007, New York: WileyInterscience
 16.
Kedzierska AM, Casanellas M: EMpar: EMbased algorithm for parameter estimation of Markov models on trees. [http://arxiv.org/abs/1207.1236],
 17.
Huelsenbeck J: Performance of phylogenetic methods in simulation. Syst Biol. 1995, 44: 1748. 10.1093/sysbio/44.1.17.
 18.
Ho SY, Jermiin LS: Tracing the decay of the historical signal in biological sequence data. Syst Biol. 2004, 53 (4): 623637. 10.1080/10635150490503035.
 19.
Willson SJ: Building phylogenetic trees from quartets by using local inconsistency measures. Mol Biol Evol. 1999, 16: 685693. 10.1093/oxfordjournals.molbev.a026151.
 20.
Department of Computer Science, Iowa State University: QuartetSuite by Raul Piaggio. [http://genome.cs.iastate.edu/CBL/download/],
 21.
Creevey CJ, McInerney JO: Clann: investigating phylogenetic information through supertree analyses. Bioinformatics. 2005, 21: 390392. 10.1093/bioinformatics/bti020.
 22.
Dutheil J, Boussau B: Nonhomogeneous models of sequence evolution in the Bio++ suite of libraries and programs. BMC Evol Biol. 2008, 8: 25510.1186/147121488255.
 23.
Strimmer K, Goldman N, von Haeseler A: Bayesian probabilities and quartet puzzling. Mol Biol Evol. 1997, 14 (2): 21010.1093/oxfordjournals.molbev.a025756.
 24.
Felsenstein J: Inferring Phylogenies. 2004, Sunderland: Sinauer Associates
 25.
Szekely LA, Steel MA, Erdos P: Fourier calculus on evolutionary trees. Adv Appl Math. 1993, 14 (2): 200216. 10.1006/aama.1993.1011.
 26.
Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS. 1997, 15: 555556. [http://abacus.gene.ucl.ac.uk/software/paml.html],
 27.
Kedzierska AM, Casanellas M: GenNonh: generating multiple sequence alignments on nonhomogeneous phylogenetic trees. BMC Bioinformatics. 2012, 13: 21610.1186/1471210513216.
 28.
Rambaut A, Grassly N: SeqGen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997, 13: 235238.
 29.
Jermiin LS, Ho SY, Ababneh F, Robinson J, Larkum AW: Hetero: a program to simulate the evolution of DNA on a fourtaxon tree. Appl Bioinformatics. 2003, 2 (3): 159163.
 30.
Sukumaran J, Holder MT: DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010, 26: 15691571. 10.1093/bioinformatics/btq228.
Acknowledgements
MC acknowledges funding from Ministerio de Ciencia y Competitividad of Government of Spain, project MTM201238122C0301/FEDER, and Generalitat de Catalunya, 2009 SGR 1284. We thank the editor and reviewers for highly valuable comments that lead to major improvements of the manuscript.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Electronic supplementary material
A pdf file containing figures A1, A2, and tables S1, S2 in section 1 and the explicit computation of the
Additional file 1: Mstep for K81* model in section 2.(PDF 598 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
IbáñezMarcelo, E., Casanellas, M. EM for phylogenetic topology reconstruction on nonhomogeneous data. BMC Evol Biol 14, 132 (2014). https://doi.org/10.1186/1471214814132
Received:
Accepted:
Published:
Keywords
 Tree topology reconstruction
 Expectationmaximization
 Quartetbased method
 Evolutionary Markov model