Improving the performance of Bayesian phylogenetic inference under relaxed clock models.

BACKGROUND
Bayesian MCMC has become a common approach for phylogenetic inference. But the growing size of molecular sequence data sets has created a pressing need to improve the computational efficiency of Bayesian phylogenetic inference algorithms.


RESULTS
This paper develops a new algorithm to improve the efficiency of Bayesian phylogenetic inference for models that include a per-branch rate parameter. In a Markov chain Monte Carlo algorithm, the presented proposal kernel changes evolutionary rates and divergence times at the same time, under the constraint that the implied genetic distances remain constant. Specifically, the proposal operates on the divergence time of an internal node and the three adjacent branch rates. For the root of a phylogenetic tree, there are three strategies discussed, named Simple Distance, Small Pulley and Big Pulley. Note that Big Pulley is able to change the tree topology, which enables the operator to sample all the possible rooted trees consistent with the implied unrooted tree. To validate its effectiveness, a series of experiments have been performed by implementing the proposed operator in the BEAST2 software.


CONCLUSIONS
The results demonstrate that the proposed operator is able to improve the performance by giving better estimates for a given chain length and by using less running time for a given level of accuracy. Measured by effective samples per hour, use of the proposed operator results in overall mixing more efficient than the current operators in BEAST2. Especially for large data sets, the improvement is up to half an order of magnitude.


Background
Bayesian phylogenetics puts an emphasis on estimating a probability distribution over parameters of interest, including the phylogenetic tree topology and divergence times, given the data. The Metropolis-Hastings Markov chain Monte Carlo (MCMC) [1,2] algorithm has been the primary computational tool used in Bayesian phylogenetics for sampling from the posterior distribution. This paper is aimed at improving the performance of the relaxed clock model in Bayesian phylogenetic analysis.
*Correspondence: rzha419@aucklanduni.ac.nz 1 School of Computer Science, University of Auckland, Auckland, New Zealand Full list of author information is available at the end of the article Early implementations of Bayesian phylogenetic inference assumed a strict molecular clock where the evolutionary rates are the same at every branch [3]. This was the preferred method for estimating divergence times [4,5]. The introduction of relaxed molecular clocks allowed for the estimation of divergence times [6] and phylogeny [7] in the presence of rate heterogeneity among branches. Since then, the relaxed clock model has been widely applied, such as the study of Nothofagus [8] and flowering plants [9]. Many aspects of the performance and accuracy of relaxed clock models have subsequently been investigated (e.g. [10,11]).
Bayesian phylogenetic inference via MCMC is computationally intensive for large data sets. Two approaches to improve efficiency are (i) by making faster likelihood calculations, and (ii) by incorporating more effective proposal kernels. Calculating the phylogenetic likelihood is computationally expensive. Hence, researchers have tried many ways to tackle the computation burden in the likelihood calculations, such as detection of repeating sites [12], approximate methods (e.g. [13,14]) and the use of parallelisation strategies (e.g. BEAGLE [15]).
However the overall efficiency of the sampling process also depends strongly on the construction of the proposal mechanism. An effective proposal mechanism is proficient at exploring the posterior distribution, and can do so with fewer steps in the MCMC chain. Therefore fewer likelihood calculations are required, since each step in the chain that changes the tree or substitution parameters requires a likelihood calculation.
A major limitation in Bayesian MCMC analysis of phylogeny lies in the efficiency with which operators sample the tree space [16,17]. Fast and reliable estimation is dependent on a good mixture of operators, since the posterior distribution often exhibits correlations between the tree and other random variables.
In this paper, we present a novel operator that works alongside standard operators by proposing moves within a subspace of constant genetic distances. Namely, the proposed operator changes both divergence times of nodes and neighbouring branch rates so that the implied genetic distances are not changed. For time-reversible substitution models the phylogenetic likelihood will also be unchanged under this operation. The proposed operator has been implemented and tested in BEAST2 [18].

Bayesian MCMC
Let D, g and denote the data, phylogenetic time-tree and a set of evolutionary parameters respectively. The time-tree g = {E, t} consists of a directed edge graph, E, defining a rooted tree topology on a set of labelled taxa and a set of associated divergence times t (for details see e.g. [19]). The posterior probability density can be calculated using Eq. 1. It consists of prior distributions for the tree and the parameters, a phylogenetic likelihood that conveys information from data, and the posterior distribution to be inferred. These are denoted in the form of probability densities by p(g), p( ), p(D|g, ), p(g, |D) respectively. From a Bayesian perspective, the phylogenetic trees and the parameters are random variables described by a posterior probability distribution given the observed data D.
p(g, |D) = p(D|g, )p(g)p( ) However, due to the state space being high dimensional and the marginal likelihood being infeasible to calculate, MCMC is adopted to sample the posterior distribution. Specifically, MCMC algorithms construct a Markov chain whose stationary distribution is the posterior distribution p(g, |D), in such a way that the computation of the marginal likelihood p(D) is avoided.

Tree proposals
We use the term "operator" to describe an algorithm that can be used to draw a new state θ given an existing state θ = {g, } from a specific proposal kernel q(θ |θ) and also return the Hastings-Green ratio for the proposed state transition [2,20].
Standard naïve operators such as the random walk operator propose the new state θ by adding a random variate to a component of the current state θ [21]. Similarly, scale operators multiply a subset of the current state by a random scale factor [22]. They are suitable for working on a single random variable, or a single component of the model, for example the population size parameter of the coalescent tree prior. Standard operators for the tree topology and divergence times include the subtree slide operator, Wilson-balding and narrow exchange operators [19,23].
In this paper, the novelty of the proposed operators lies in maintaining the genetic distance d while changing the rate r and divergence time t. The reason is that the likelihood along one branch is constant if its distance is fixed, i.e. d = r × t, noting that the likelihood is calculated based on transition probability matrix for each branch of e Qd i , where d i is the branch length in units of substitutions per site for branch i. In this way, the joint distribution on rates and divergence times can be explored without proposing states that would adversely affect the phylogenetic likelihood.

Uncorrelated relaxed clock model
Molecular clocks model how molecular sequences evolve along branches in the phylogenetic tree, so that a time tree can be reconciled with the genetic distances between sequences. In this paper, uncorrelated relaxed clock models are adopted, where the rates are drawn independently and identically from a given prior distribution, such as the log-normal distribution [7]. As a result, the rates can vary markedly between parent and child branches.
Referring to the Bayesian framework in Eq. (1), the joint inference of evolutionary rates r and the time tree g can be obtained by the conditional distribution in Eq. 2: where p(r) is the prior for rates specified in uncorrelated relaxed clock model. In the constructed Markov chain, the While the proposed operator is introduced based on uncorrelated clock models, it could equally be applied to any other relaxed clock that applies a rate parameter to each branch, such as autocorrelated clock models [6].

Results
To validate the correctness and determine the efficiency, we conducted a series of experiments by implementing the Constant Distance operator in BEAST2 [18].
First, we perform a well-calibrated simulation study, which tests our operator alongside existing operators. Correctness was further confirmed by sampling trees from the prior distribution i.e. without data (Refer to Appendix 2 section for more details). By comparing effective sample sizes (ESS) [24] and running times, it is demonstrated that the performance is improved when including our proposed operator. Finally, the posterior correlation of rates and node times are discussed.

Well-calibrated simulation study
A well-calibrated simulation study is a powerful tool for evaluating and validating the implementation of a Bayesian model [25]. Figure 1 shows the Bayesian model used in this study, which includes the evolutionary model and the prior distributions of parameters. As is shown in the figure, the sequence alignment is simulated by a phylogenetic continuous-time Markov chain in BEAST2. It contains a substitution rate matrix given by the HKY85 [26] model and a substitution tree determined by an uncorrelated relaxed clock model and Yule model. More specifically, base frequencies π follow a Dirichlet distribution and the transition-transversion ratio κ follows a log-normal prior distribution. The distribution of node times is described in a Yule tree ψ with hyperparameter birth rate λ following a log-normal distribution. The rates r i follow a log-normal distribution with mean of 1 and standard deviation s 1 following a hyperprior distribution.
First, we sampled parameters and trees from the full model 100 times. The random parameters included: standard deviation of rates across branchess 1 , birth rate λ, base frequencies π and transition-transversion bias κ. Second, we simulated nucleotide alignments using the simulated parameters. In total, 100 data sets were simulated, each with 120 taxa. Third, we used BEAST2 with the Constant Distance operator to infer the tree and parameters from each of the 100 simulated data sets in turn. Finally, the posterior estimates of the parameters were compared with the real values that were used to simulate the corresponding sequence alignment. The comparisons are shown in Fig. 2.
These results show that the true values of the parameters are within the 95% highest posterior density (HPD) interval approximately 95% of the time (Table 1). This well-calibrated simulation study formed part of the validation of our implementation of the Constant Distance operator.

Performance comparison
To evaluate the performance of Constant Distance operator in a Bayesian phylogenetic analysis, we explored the time required to adequately sample the posterior distribution. This was achieved by examining (i) the total time The models and prior distributions to simulate the sequence data. The sequence alignment (SA) is simulated through a phylogenetic continuous-time Markov Chain (PhyloCTMC) that consists of a substitution model (HKY) and an uncorrelated relaxed clock model (UCRelaxedClockModel). The random variables in HKY model construct the mutation rate matrix (Q), including base frequencies (π = {π A , π C , π G , π T }) and kappa (κ). The time trees (ψ) and branch rates (r i for each branch i in ψ) construct the substitution tree (ST). The branch rates have a LogNormal prior with fixed mean 1 and certain standard deviation (denoted by s 1 ). And the time trees have a Yule model prior with birth rate (λ) having a LogNormal prior. The other prior distributions include a Dirichlet distributions on π, a LogNormal distribution on κ, and a LogNormal distribution on s 1 . For notations in LogNormal distributions, the uppercase letters represent the parameters in real space, and the lowercase letters represent the parameters in log space. In all the simulations, the number of taxa is fixed at 120 (n = 120) taken by BEAST2 to complete the MCMC inference (running time), and (ii) the effective sample size (ESS) of the sampled parameters. The effective sample size of a parameter is the number of effectively independent samples from the posterior distribution. Larger ESS indicates a better approximation of the marginal posterior distribution of the parameter. We used Tracer [24] to compute ESS.
For each dataset, we compared two operator configurations. 1) Using the current operators in BEAST2 to sample discrete rate categories (Category). 2) Using the Constant Distance operator to sample continuous rates specified by an uncorrelated related clock model (Cons). The Category configuration is the default setting in BEAST 2.5. We aim to compare the performance of the Constant Distance operator to that of the existing operator schedule. In each configuration, the data set was analyzed 20 times with the prior distributions and all other model specifications held constant. The details of operator weights used are given in Appendix 3.1. Each setting is benchmarked using an Intel(R) Xeon(R) Gold 6138 CPU (2.00 GHz).
We performed the analysis on two sets of simulated sequence alignment (See Appendix 3.2 for more details).
The simulated data sets both have 20 taxa but different sequence lengths, i.e. one data set containing 500 sites, the other containing 1,000 sites. Moreover, we used four real data sets to further evaluate the performance of Constant Distance operator, including a primate data set [27] and three other data sets (Anolis [28], RSV2 [29,30] and HIV-1 [31]) in BEAST2 [32].
The ESS and running time are summarised in Fig. 3 and Table 2. To be more specific, we measure the efficiency by ESS per hour, which is calculated by the ESS of parameters in one simulation divided by the running time in hours. Then we compare the efficiency of two configurations by calculating the ratios of ESS per hour for simulations in the two configurations. If the ratio is larger than 1, then ESS per hour of Cons configuration is larger than that of Category configuration. As is shown in Fig. 3, the efficiency varies in different data sets and also depends on what model is used in the analysis. For Anolis data set (29 taxa and 1456 sites), Category configuration performs better than Cons configuration, since most ratios of the parameters are slightly below the red line (which means smaller than 1). Moreover, for RSV2 (129 taxa and 629 sites) and HIV-1 data sets (117 taxa and Well-calibrated simulation study with 120 taxa. Each point is a separate simulated dataset 663 sites), some ratios of the parameters, "posterior" and "prior" in particular, are above the red line (larger than 1), which indicates that Cons configuration provides larger ESS per hour. Although there are several parameters sampled by Cons configuration having smaller ESS per hour, it should be noticed that the ratio is calculated by choosing random simulations in the two configurations (See Appendix 3.3 for more details). Additionally, it is worth noting that the efficiency is improved more obviously in simulated data set having 1000 sites, compared with the data sets having 500 sites. This indicates that the proposed operators behave better when sequence length is long. More specifically, in Primates data set (87 taxa and 19220 sites), the longer molecular sequence provides more accurate genetic distances, which leads to peaked likelihood distributions. In this circumstance, the proposed operators sample rates and node times that fit the constant genetic distances more efficiently. Table 2 lists the average running time of 20 simulations for each data set. It can be seen that Cons configuration finished simulations with a little bit more time in most cases. This is because the continuous rates have to be adjusted for a new clock standard deviation (See Appendix Section 4 for more details). Moreover, Table 2 also shows the parameter that has the smallest ESS in Category configuration, and is compare with the corresponding ESS in Cons configuration. Although the improvement in ESS is not obvious for both simulate data sets, it is noticed that ESS of the parameters are much larger in Cons configuration for all the real data sets. After calculating the ESS per hour, we conclude that Cons configuration improved the

Correlation analysis of rates and branch lengths
In this section, we conduct a pairwise comparison between rates and branch lengths in units of time. We used a data set of ratite mitochondrial genomes [33]. This data set includes 7 species of ratites and an alignment of 10767 sites. After analysing the ratites data set in BEAST2 using the Constant Distance operator, we calculated the Pearson coefficient between the rates and the times across branches to investigate the posterior correlation of these parameters.
The results are summarised in Fig. 4. Figure 4a presents the topology of the maximum clade credibility tree. We utilised the programme TreeStat2 [34] to obtain the fil-tered trees that have the same topology as the maximum clade credibility tree from the sampled trees in MCMC chain. This means the trees that have different shared common ancestors of each taxon from the reference tree are filtered out.
Afterwards, Fig. 4b shows the pairwise comparison of the 12 branch rates and 12 branch lengths (in time) on these filtered trees. As can be seen from the diagonal, the rate on one branch is negatively correlated with the length of that branch, which indicates that an older divergence time will lead to a smaller rate. This is because the primary signal in the data is genetic distance, so that there will be a range of rates and divergence times that are consistent with the genetic distances, but the products of these quantities will vary less than the individual parameters. The consequence is that there will tend to be a negative rela- tionship between rate r i and branch length l i i.e. r i = d i /l i . At the same time, there will tend to be a positive relationship between rate r i and its parent's branch length l ip , since a larger l ip leads to a smaller l i . Moreover, for cherries that share the same branch length in the tree, they will tend to have the same correlation pattern. Take ANDI and DIGI as an example. r 1 and l 1 are negatively correlated, but r 1 and l 8 are positively correlated, which is also the correlation of r 2 , l 2 and l 8 .
It is precisely this form of correlation structure in the posterior that our operator anticipates, and these correlations are the reason that our operator performs better than naive alternatives.

Sampling a fixed unrooted tree
A limiting case for the relaxed molecular clock model (and one exploited in some of our validation tests) occurs for long sequences, when the branch lengths of the unrooted Fig. 4 Correlation analysis in the ratites tree. l represents the length of a branch, that is the time difference between a parent node and a child node, The rates and branch lengths are converted into log space and then Pearson's coefficients are computed, which range from -1 to 1. Blue indicates positive correlations and red indicates negative correlations. The darker the colour, the stronger the correlation tree, in units of expected substitutions per site, become known without error. With full length genomes now available, although inferring trees from genomes involves complexities and assumptions such as a good partition scheme [35], this limiting case might be approached in some data sets. As a simple test in this paper, this gives rise to an alternative approach to analysis, where divergence times, a root position and the branch rates are random variables, and the data are a set of branch lengths in units of substitution on a known unrooted tree topology.
Previous work done by Reis and Yang [14] also tried to approximate the likelihood of such an unrooted tree in Bayesian phylogenetic inference. Similar researches in [6,13] show that these methods can account for rate changes in a relaxed clock model, but the genetic distances are not fixed, for example Stéphane Guindon used a Gibbs sampling algorithm [13]. Outside of the Bayesian MCMC formalism, least-squares criteria [36] and maximum likelihood [37,38], can also be applied to estimate substitution rates and divergence times in unrooted trees.
In this section, we investigated this approach on a fixed substitution tree reconstructed from whole mitochondrial genomes from a set of ratite species [33]. Since no uncertainty is admitted in the genetic distances and the proposed operator doesn't change the genetic distances, the phylogenetic likelihood is no longer needed and the unrooted tree becomes the data, rather than a multiple sequence alignment.
First of all, we used the ratites data set to construct an unrooted tree with PhyML 3.0 [39,40]. Figure 5a shows the unrooted tree with the genetic distances on the branches which are fixed in the subsequent relaxed clock analysis in BEAST2.
As an initial starting point, the root is assigned using the midpoint method. After that, according to the genetic distances among seven taxa and the position of the root, consistent divergence times are specified and assigned to each ancestral node, so that a valid rooted time tree is obtained. Once divergence times are determined, rates on the branches are also calculated Fig. 5 Illustration of sampling a fixed unrooted tree. In subfigure (a), the unrooted tree is obtained from the ratites data set [33] by a maximum likelihood method [39] and the labeled numbers represent genetic distances. The three unique tree topologies in (b) (c) and (d) are obtained from the sampled trees by using program TreeTraceAnalysis [41]. The branch rates and node times are summarised by using program TreeAnnotator [42]. The labeled numbers represent the posterior mean of rates on the corresponding branches. The colour of branches from green to red indicates the rates increasing from small to large, and the blue bars represent the 95% HPD of the corresponding node times so that the products match the unrooted substitution tree.
Then we used Constant Distance operator to sample a Markov chain initiated by this starting tree. The resulting posterior distribution is shown in Fig. 5b-d. As can be seen, despite that there is some uncertainty in the root position, the most probable tree in Fig. 5b is consistent with previous analyses of this data (see Figure 2 in Ref. [33]). For large data sets of long sequences, the proposed operators may prove useful to provide faster divergence time estimates based on the assumption of known unrooted topology and branch lengths in units of expected substitutions per site.

Discussion
We have demonstrated that the presented operator is valid and able to improve the efficiency of phylogenetic MCMC for relaxed clock models. The overall performance of a Bayesian phylogenetic analysis will be affected by the proportion of MCMC steps that this operator is chosen to make the proposal. In the BEAST2 software, this can be changed by modifying the relative weights operators in the operator schedule. The ideal proportion is non-trivial to determine for an arbitrary data set. In this study, we assigned equal weights on operations to all internal nodes (including the root). How to assign weights to achieve better performance is not studied in this paper, and users may assign different weights in practice. Hence, an optimal method of assigning weights still needs further investigation.
The key idea of the presented operator (to maintain the genetic distances) shows a novel direction for more efficient proposals in Bayesian phylogenetic MCMC. For example, the operations on the internal nodes, in the current study, involve one random internal node, one node time and three branch rates. If two or more nodes are selected, then more associated rates and node times can be sampled in one proposal, which may achieve even better efficiency. Another possible approach is to make small changes to the genetic distances as well. To minimise the number of changes to genetic distances, a twodimensional random draw will be used to change four parameters (one divergence time and one rate changed directly, the other two rates derived so as to minimise changes to genetic distances). What's more, it should be pointed out that Small Pulley and Big Pulley can only be applied to reversible continuous-time Markov chain models where unrooted trees can be used in inference, because these operators require the underlying unrooted tree to be unchanged. Future work could elaborate a larger class of operators along these lines.
As data sets have increased in size the impetus to improve efficiency of Bayesian phylogenetic inference algorithms has steadily increased. Besides more effec-tive proposal mechanisms within Metropolis-Hastings MCMC, completely novel approaches to Bayesian phylogenetics have also begun to get some attention. Variational methods are one alternative for approximating Bayesian posterior distributions [43]. These approaches make inference an optimisation problem and take advantage of tractable variational distributions that approximate the posterior distribution, thus decreasing the computational cost by avoiding high-dimensional integrals in MCMC sampling schemes. Recent work has investigated the potential for applying variational methods to phylogenetics [44,45]. Our improved MCMC methods provide a performance baseline for these new approaches.

Conclusions
As data sets have increased in size, the need for computational efficiency of Bayesian phylogenetic analyses has also increased. In this paper, we have discussed a new tree proposal that substantially increases the efficiency of Bayesian phylogenetic inference under a popular class of relaxed molecular clock models.
We demonstrate the correctness of this algorithm with a series of tests including a well-calibrated simulation study. Based on both simulated and real data sets, the proposed operator is more efficient than current algorithms implemented in BEAST2 for datasets with long sequences. This is a desirable property because efficiency is most important for larger datasets. The proposed operator is available for use as a package of BEAST2.

Methods
In this section, we define the Constant Distance Operator. Figure 6 illustrates the flow chart of the proposed operators. In a phylogenetic tree, the node to operate on is denoted by X and the Constant Distance Operator works differently on internal nodes and the root node. The details of the operations are introduced step by step in the following subsections. Figure 7 represents the tree (or subtree) with the node X that is randomly selected from among the internal nodes. Let g be the tree in the current state. The following steps propose a new divergence time in g and three rates in r .

Operations on internal nodes
Step 1 Identify the parent node and two child nodes of X, denoted by P, L and R respectively.
Step 2 Denote the nodes times of X, P, L and R by t X , t P , t L , t R respectively. Denote the rates on the branches above the nodes by r X , r L and r R respectively.
Step 3 Propose a new node time for X by t X ← t X + a, where a follows a Uniform distribution with a symmetric window size w, i.e. a ∼ Uniform[-w, +w], for some window size w. Make sure that the proposed time is valid, i.e. Fig. 6 The flow chart of the Constant Distance operator max{t L , t R } < t X < t P holds. Otherwise, we reject the proposal.
Step 4 Propose new rates by using Eq. 3.
Step 5 Return the Green ratio α IN (Refer to Calculating the Green Ratio in the following subsection).

Operations on the root
We present three strategies for proposing the new rates and a new divergence time for the special case when X is the root node. i) The Simple Distance operator only Fig. 7 Illustration of the operation on an internal node. The operator proposes t X , r X , r L and r R , during which d L , d R , d X are kept constant proposes a new root time. ii) Small Pulley adjusts the distances of branches on both sides of the root. iii) Big Pulley proposes a new tree topology by rearranging the root, without perturbing the unrooted tree. As is illustrated in Fig. 8a, all the operations on the root, including Big Pulley that changes the tree topology, do not change the underlying unrooted tree. For instance, no matter where the root X is (either on branch EF or AE), the operators maintain the distances (d AB , d AC , d AD , d BC , d BD , d CD ) and preserve the unrooted tree at the same time. Figure 8b, c and d show the trees that are rooted at the node X. The original tree g in the current state is shown in Fig. 8b. Similar to the operations on internal nodes, we will use the following steps to propose a new root time in g and two rates in r , as is illustrated in Fig. 8c. At the same time, the genetic distances of two branches linked to the root, i.e. d L and d R , are kept constant.

Simple distance
Step 1 Identify the child nodes of the root X, denoted by L and R. Their corresponding node times and branch rates are t X , t L , t R and r L , r R .
Step 2 Propose a new node time for the root X by t X ← t X + a, where a ∼ Uniform[-w, +w]. Make sure that t X > max{t L , t R } holds. Otherwise, we reject the proposal.
Step 3 Propose new rates for branches on both sides of the root by using Eq. 4.
Step 4 Return the Green ratio α SD .

Small pulley
In contrast to Simple Distance, Small Pulley changes genetic distances of branches on both sides of the root. As is illustrated in Fig. 8d, two new rates in r are proposed based on those in the original tree g. In order to maintain the total genetic distance d L + d R of the two branches linked to the root, after d L is proposed, d R will be adjusted simultaneously. In other words, Small Pulley keeps D = d L +d R constant. The detailed process includes the following 4 steps.
Step 1 Identify the child nodes of the root X, denoted by L and R. Their corresponding node times and branch rates are t X , t L , t R and r L , r R . The implied genetic distances of the two branches linked to the root can be calculated by: Step 2 Propose a new genetic distance for d L by adding a random number that follows a Uniform distribution, i.e. d L ← d L + b, where b ∼ Uniform[-v, +v], for some window size v. Make sure that 0 < d L < D holds. Otherwise, we reject the proposal.
Step 3 Propose new rates for branches on each side of the root: Step 4 Return the Green ratio α SP .

Big pulley
Big Pulley resamples the rates and times while maintaining the implied unrooted tree in units of genetic distance. So the genetic distances between the taxa are held constant, but the location of the root in the time tree is readjusted. Before describing the detailed steps, we introduce a method Exchange that proposes a new root position. In Fig. 9, let (i) X denote the root of tree g, (ii) C and N denote the two child nodes of X, (iii) S and M denote the two child nodes of C. The Exchange(M, N) method involves the following steps: • Swap the two nodes by pruning and regrafting, i.e.
cutting M (N) at its original position and attaching it to the original position of N (M).
Make sure that 0 < d C < D holds, where D = d C + d N . Otherwise, we reject the proposal. • The distances on the other three branches, i.e. d S , d M and d N , will be adjusted: As can be seen from the above descriptions, the method Exchange(M, N) swaps two nodes and adjusts distances (d S , d M , d N and d C ) on the four branches so as to maintain the implied genetic distances among three taxa S, M and N.
Additionally, operations in Big Pulley vary depending on the shape of phylogenetic tree. In Fig. 10, a symmetric tree is shown on the left, in which both the child nodes of the root have two child nodes, i.e. L having children H1, H2 and R having children H3, H4. But in the asymmetric tree on the right, only one of the child nodes of the root has two child nodes below it, i.e. O having children G1, G2. But the other child node Y doesn't have any child node, which is a heterochronous tip. The corresponding operations are detailed in the following two parts.
Symmetric tree For the symmetric tree in Fig. 10, the operations are illustrated in Fig. 11, after which one of the four possible trees ( 1 2 3 4 ) will be proposed. The detailed process is described in Algorithm 1.
For example, suppose we are going to propose tree 1 . After the new node times for the root X and L are proposed, we apply the method by Exchange (H1, R), so that four distances are adjusted, as follows: Finally, in this example the new rates would be updated by: Asymmetric tree For an asymmetric tree such as in Fig. 10 we would operate as illustrated in Fig. 12, in which  To give an example, assume we are going to propose tree 5 . Firstly, t X and t O are proposed in Step 3 and Step 4. Then, in Step 4, the method Exchange (G1, Y) is applied, after which the four distances are adjusted as follows: And the four rates are updated as follows: Calculating the green ratio MCMC operators must use reversible proposal distributions to satisfy the detailed balance requirements of the MCMC algorithm (Refer to Appendix section 1 for more details). Therefore, all four of our operators involve a final step of calculating the Green ratio for the proposal.
According to the third and fourth steps in the operations for internal nodes, three rates on the branches linked to the selected internal node are proposed by one random number a that is used to change the node time. There are four parameters involved in this proposal, comprised of a 3-dimensional rate space and a 1-dimensional time space. The proposed operator utilises one random number in time space and makes changes in both time and rate space, which leads to a dimensionmatching problem. To solve this dimension-matching problem, as is mentioned in Green's paper [20], it is necessary to construct the Jacobian matrix. In Eq. (12), J 1 deals with the parametric spaces before the proposal in vector IN =[ t X , r X , r L , r R ] and after the proposal in vector OUT =[ t X , r X , r L , r R ].
where the functions f 1 , f 2 , f 3 and f 4 represent how the operator makes a proposal. After substituting Eq. (3) in Eq. (12), the Green ratio for the internal nodes can be derived: where the proposal density p(−a) is equal to p(a) since the random number a is drawn from Uniform distribution. Likewise, the Green ratio for Simple Distance, Small Pulley and Big Pulley can be obtained:

the green ratio
When developing an operator for MCMC, the proposal function must be reversible. In other words, the proba-bility that the operator propose a new state from the current state is required to be equal to the probability that the proposed state goes back to current state. To be specific, let π(x) be the target probability distribution and p(x, x ) be the transition kernel in the continuous Markov chain. The reversibility condition requires that π(x)p(x, x ) = π(x )p(x , x). And an operator provides a proposal q(x, x ) with some probability α(x, x ) that the proposal is accepted. Thus, the reversibility condition is rewritten as   Considering the subspace ϕ 1 on x and subspace ϕ 2 on x , it is assumed that there is a symmetric measure on the combined parametric space ϕ = ϕ 1 × ϕ 2 , so that π(x)q(x, x ) has a density with respect to a single measure on ϕ. Then, Green suggested that the reversibility condition should be satisfied by detailed balance [20], as represented by Eq. (17). And according to Peskun' proof, it is optimal to take Eq. (18) as the acceptance probability to retain the detailed balance [46].

Algorithm 3 Calculation of μ for Big pulley
Original tree is symmetric: if the node that has been exchanged with L or R has child nodes then where A ∈ ϕ 1 and B ∈ ϕ 2 are two Borel sets. q(x, x ) denotes the probability that the operator proposes a new state x given the current state x.
where p(x , dx)/p(x, dx ) is known as the Hastings ratio. However, for operators that do not have a symmetric measure, it is necessary to include the Jacobian matrix J in order to deal with the dimension matching problem, as is discussed in Green's paper [20]. In this case, Eq. (18) is extended, as is shown in Eq. (19).
where J = ∇h(x, x ) represents a vector differential matrix of deterministic function h. α = p(x ,x) p(x,x ) |J| is defined as the Green ratio, and J ensures that the proposal have a symmetric measure on each subspace in state x and x .

calculating the green ratio for operations on internal nodes
The Constant Distance Operator firstly proposes a new time for the randomly selected internal node (Eq. (20a)), and then proposes three rates by the original distances and new node times(Eqs. (20b)∼ (20d)).

calculating the green ratio for simple distance
Simple Distance proposes two rates by using Eqs. (23b) and (23c), according the new root time in Eq. (23a). So the Jacobian matrix can be obtained as is shown in Eq. (24).
So the determinant of J 2 is calculated by Eq. (25)

Calculating the green ratio for small pulley
Small Pulley proposes a new genetic distance of a branch on one side of the root by adding a random number b, which is equal to adding a random number b to the original product of rate and time on that branch. As a result, a new rate is proposed by Eq. (26a). Similarly, a new rate on another branch is proposed by Eq. (26b), because the total distance of the two branches linked to the root should remain constant.
Then, as is illustrated in Eq. (27), the Jacobian matrix J 3 is simply obtained, which makes the determinant |J 3 | = 1.

calculating the green ratio for big pulley
Two new node times are proposed in Big Pulley. One is the root time (Eq. (28a)), the other is the node time of the child node of the root. It can be either children of the root, i.e. son and dau. So t C is used to denote the node time proposed, as is seen in Eq. (28b). In addition, the distances are adjusted by the method Exchange (M, N), dependent on which nodes are chosen. As a result, the four rates are proposed, as is shown in Eq. (28c)∼Eq. (28f) where a 1,2,3 is the random number to propose a new node time for the child node of the root. Depending on which child node is selected, the notation is different, i.e. a 1 , a 2 , a 3 . Here, to make it a general case, a x is used.
Therefore, the Jacobian matrix J 4 for the six parameters in Eq. (28) is obtained by Eq. (29). And the determinant of J 4 is calculated shown in Eq. (30).
Last but not least, due to the change of tree topology in Exchange (M, N), the probability of the proposed tree going back to the original tree p(g|g ), as well as the probability of making the proposal p(g |g), should be considered. As the ratio of p(g|g )/p(g |g) is defined as μ, the calculation of μ is detailed in the following algorithm.

sampling from the prior
In this section, we aim to validate the correctness of the proposed operators. To be more specific, we firstly run the simulations by sampling from prior distributions in BEAST2. Since the prior distributions are deterministic, we can analytically calculate the theoretical jointdistributions of sampled parameters in MCMC chains. By comparing the sampled distributions with the analytical results, we demonstrate whether the proposed operators are able to sample parameters correctly.
In Fig. 13, a tree with three taxa A, B and C (plus one internal node D, and root E) is used as a small example in the experiments in this section. In the figure, g 1 is set as the initial tree. Firstly, a LogNormal distribution is used as the rate prior in the uncorrelated relaxed clock model, given by Eq. (31).
In addition, a Coalescent model [47] with constant population size (N = 0.3) is used to describe the tree prior. Hence, for the tree in Fig. 13, the probability of node times is calculated by Eq. (32).
After the priors are specified, the distribution to sample can be exactly known, since the samples are drawn from the prior distributions. In other words, as the rates are functions of its genetic distance and times, the joint distribution to sample can be represented by Eq. (33).
where p(.) is the probability of certain rate values in the LogNormal distribution. Therefore, the whole probability can be obtained by conducting numerical integration on Eq. (33), which shows the probability distribution over all the possible values of parameters.

test the operator on internal nodes
The genetic distances, node times and rates for g 1 in Fig. 13 are given in Table 3. To test roundly, two scenarios are designed. In each scenario, the genetic distances are fixed, the node time t D starts from the initial value and will be changed by the proposed operator during the sampling process. Essentially, the proposed operator makes node D Fig. 13 The illustration of sampling from prior. g 1 is set to be the original tree where an MCMC chain starts. When testing Big Pulley, the proposed operator samples the trees among g 1 , g 2 and g 3

Using Big Pulley
For g 1 in Fig. 13, a new tree, together with the root time t E and node time of its older child t D , as well as a genetic distance d i , is proposed by Big Pulley. In this case, the initial tree g 1 will either go to g 2 or g 3 , as is shown in Fig. 13. So the samples are repeatedly drawn from the 3 trees. Besides, according to the initial settings in Table 5, the genetic distances remain unchanged during the process, i.e. d AB = 1, d AC = 1 and d BC = 1 hold. Hence, the distribution we are about to achieve can be calculated by Eq. (37).   Fig. 13. The two scenarios sample two trees with different distances specified in Table ??
The statistical measurements, i.e. mean and standard deviation, are compared in Table 6. The histograms of samples and numerical curves of d D and t E are pictured in Fig. 15c and d. It is shown that the two distributions are consistent within the acceptable error range. Therefore, Big Pulley can also give the right combinations of rates and node times, under the condition that the genetic distances among taxa are constant.

performance analysis of operators
This section provides the details of the results presented in Performance comparison section.

Operator weights
The weights on operators for the simulations when comparing efficiency are listed in Table 7. Although how to assign weights to achieve better performance is not studied in this paper, we maintain the percentage of weights on three operator class in Category and Cons configurations. But we modified some weights on the operators inside the same class, and we assigned different weights for different data sets.

Simulated data sets
We simulated two sets of sequence alignment on the same tree with 20 taxa that is shown in Fig. 16. We used HKY model as substitution model with κ = 2.4751, and the base frequencies are π = (0.21930.22680.30070.2531). In the uncorrelated relaxed clock model, the standard deviation of the branch rates (Ucldstdev) is 0.1803. The models and prior distributions are the same as is described in Fig. 1.

Efficiency measured by ESS per hour
Since we compare the efficiency based on ESS per hour using two configurations, i.e. Category and Cons, the ratio of ESS per hour is calculated by a random simulation in the two configurations, as is shown in Fig. 3. Then Table 2 lists the average running time and ESS of particular parameters in the simulations using different data sets. Here, we present the detailed running time and ESS of the simulations, which can be seen in Figs. 17, 18, 19, 20,   -: The parameter is not sampled and no operator is assigned.

Fig. 16
The tree used to simulate sequence alignment. The taxa are denoted by t1 to t20. The divergence times are drawn near the node and 21. Overall, we conclude that the proposed operators are able to provide better performance, because the figures suggest that Cons configuration requires less running time and have larger ESS for most parameters in most simulations. Especially, for those poorly estimated parameters in Category configuration, the improvement is more obvious. For data sets such as primates and simulated data with 500 sites, the running time is slightly larger in Cons configuration, but the ESS are much larger, which makes it acceptable to reduce the MCMC chain length and get the same performance.

Efficiency measured by proposals
The operators introduced in the paper utilise a random walk proposal for the new node time, which draws a random number from a uniform distribution and moves the node uniformly on the branch. However, others proposals, such as a Bactrian proposal [48] and a Beta proposal [49], assign  a specific distribution on the new node time so that it is more probable to move to a certain height on the branch, either far away from or close to its original position. This section applied Random walk proposal (the operators in this paper), Bactrian proposal and Beta proposal to the three data sets, and the results are compared to those using Category configuration.
The comparisons are shown in Figs. 22, 23, and 24. It is indicated that Beta proposal achieved worst performance in the three analysed data sets. The performance of the Constant Distance operator (Random walk) and Bactrian proposal achieved similar performance in RSV2 data set, while Bactrian proposal provided larger ESS per hour for most parameters in HIV-1 data set. Therefore, it still   needs further investigation to demonstrate the effectiveness of different proposals when analysing various data sets. Our current implementation of the operators enables users to specify which proposal style will be used in Beast2 analysis.

ucldstdevScaleOperator: a scale operator on standard deviation
It should be noted that the proposed ConstantDistance operator parameterises branch rates as continuous random variables, instead of discrete rate categories as is used  in current BEAST2 settings. In uncorrelated relaxed clock model, branch rates are assumed to have a lognormal prior distribution, where the real mean is fixed to 1 and the standard deviation (denoted by Ucldstdev) is usually sampled with a hyper prior such as gamma(α = 0.5396, β = 0.3819). When a new Ucldstdev is proposed in one state during MCMC sampling by normal operators, the probability of all rates change as well under the new log normal distribution. Therefore, the authors implemented a separate operator working on Ucldstdev, which is able to solve this problem properly.
The first step is to propose a new Ucldstdev by a scale operation, which multiplies current Ucldstdev by a random factor, as is shown in Eq. (38).
where scale = Factor+ ξ × ( 1 Factor −Factor) and ξ is a random variable from a Uniform(0, 1), Factor is a userdefined parameter to specify how bold the proposal is.
Secondly, all the branch rates are proposed based on the new Ucldstdev , given the probability of original Ucldstdev, which is calculated using Eq. (39).
where the notations cdf (·) and icdf (·) represent the cumulative and inverse cumulative density function of log normal distribution. Because of the calculation of cdf (·) and icdf (·) for each branch rates, the "Cons" configuration requires more running time than "category", as is discusses in "Performance comparison" section. However, it is acceptable as ConstantDistance operator gives larger ESS. Finally, it is important to return the corrected hastings ratio, since the proposal is associated with one random variable, Ucldstdev and (2n − 1) branch rates. As is shown in Eq. (40), the ratio includes the scale operation and rates changing under the same probability.
In the comparison of ESS for the clock standard deviation (denoted by ucld.stdev in Fig. 3), we specified a normal scale operator in "Category" configuration. In "Cons" configuration, the UcldstdevScaleOperator is used to sample the clock standard deviation of continuous rates. To avoid the concern that the difference between "Category" and "Cons" is a result of how rates are parameterised (i.e. discrete or continuous), we set another configuration where continuous rates are sampled without using the ConstantDistance operator (denoted by "NoCons" configuration). The weights of the operators in "NoCons" are the same as those in "Category" which is detailed in Table 7. We ran the analysis using the three real data sets (Anolis, RSV2 and HIV-1) and the comparison of ESS per hour between "Category", "Cons" and "NoCons" is summarised in Fig. 25. The figure shows ESS per hour in log 10 space of ucld.stdev in 20 independent MCMC chains. As can be seen, "Cons" configuration gives similar performance, comparing with "Category". This indicates UcldstdevScaleOperator works properly on continuous rates. Moreover, ESS per hour is much larger in "Cons" than in "NoCons", where both continuous rates are sampled. Therefore, the proposed operators contribute to the improved performance. However, we noticed that the rate parameterisation does have some mixing issues in MCMC chains. In the future, we will further investigate how to parameterise branch rates to get better performance when using the proposed operators.
Abbreviations MCMC: Markov chain Monte Carlo; ESS: effective sample sizes; HPD: highest posterior density; Category: Using the current operators in BEAST2 to sample discrete rate categories; Cons: Using the Constant Distance operator to sample continuous rates specified by an uncorrelated related clock model