### Definitions and notations

We define a rooted tree as a collection of clades with ages. Specifically, a tree is a strict hierarchy of clades, where each clade is a subset of the taxa, and a non-negative age is associated with each clade.

Formally, a tree *T* is a triplet (L,\u2102,h), where L=\left\{{x}_{1},{x}_{2},\dots ,{x}_{l}\right\} is the set of taxa and \u2102=\left\{{C}_{1},{C}_{2},\dots ,{C}_{n}\right\} is a set of clades. Each clade *C*_{
i
}⊆*L* is a subset of taxa, and h:\u2102\to {\mathbb{R}}_{\ge 0} is a function assigning an age to the clade. The set \u2102 describes only the clades hierarchy and is referred to as the tree topology. Sometimes we shall use *c*∈*T* as a shorthand for T=(L,\u2102,h)\text{and}c\in \u2102 (“clade *c* is present in tree *T*”).

To qualify as a tree, the following conditions must hold:

i The tree contains all leaves: {\forall}_{i}\left\{{x}_{i}\right\}\in \u2102.

ii The tree contains a root: L\in \u2102.

iii Strict hierarchy of clades: for any two clades {c}_{1},{c}_{2}\in \u2102, either *C*_{1}⊂*C*_{2}, *C*_{2}⊂*C*_{1} or *C*_{1}∩*C*_{2}=*∅*. (Note that *C*_{1}⊂*C*_{2} implies *C*_{1}≠*C*_{2}, otherwise we write *C*_{1}⊆*C*_{2}.)

iv Non-Negative branches: for {c}_{1},{c}_{2}\in \u2102, *c*_{1}⊂*c*_{2}⇒*h*(*c*_{1})≤*h*(*c*_{2}).

For any clade *c*, the elements in the set A\left(c\right)=\{y\in \u2102:c\subset y\} are called ancestors of *c*, and the minimal element *P*(*c*) in *A* is the parent of *c*. Every clade except the root has a parent and by association a branch to its parent with length *b*(*c*)=*h*(*P*(*c*))−*h*(*c*). For convenience, the branch length of a subset not in is defined as zero. Any subset of taxa *x* has a *Most Recent Common Ancestor* in the tree, the minimal clade containing all members of *x*. Formally, ca(*x*) is the minimal element of \{y\in \u2102:x\subseteq y\}. For brevity we omit the tree when the context is clear, and use ca(*c*,*T*) to explicitly associate the clade with the tree *T*.

Extending the domain of *b*(·) to all taxa subsets simplifies definitions involving sets of trees with different topologies. We extend *h*(·) for the same reason and define the age of any subset *x*⊆*L* to be the age of the common ancestor of *x*,\stackrel{\u0304}{h}\left(x\right)=h\left(\text{ca}\right(x\left)\right).

Using \stackrel{\u0304}{h}, we define the *heights error*, a discrepancy score between clade ages of T=(L,\u2102,h) and a reference tree *T*_{
r
e
f
},

{\epsilon}_{H}(T,{T}_{\mathit{\text{ref}}})=\sum _{c\in \u2102}\left|h\right(c)-{\stackrel{\u0304}{h}}^{\mathit{\text{ref}}}(c\left)\right|.

(1)

The heights error is the total sum of clade age errors, whether they appear in the reference tree or not. The age of a clade which is not in the reference tree is taken to be the age of the MRCA of the clade taxa, which spans a larger clade in the reference tree. Note that the definition is not symmetric. Alternatively we define the *divergence times error* which focuses on the time lineages split from each other. The divergence time for any clade *x*⊆*L* is the mean divergence time of all pairs of *x*. Formally, we start with the pairs of taxa which split at the clade; those are the pairs in *x* whose common ancestor is the clade,

D(x,T)=\left\{\left\{a,b\right\}\in x:\text{ca}(T,\left\{a,b\right\})=x\right\}.

(2)

Now the average split time is the mean of all pair splits,

\stackrel{~}{h}(x,T)=\left|D\right(x,T\left){|}^{-1}\sum _{a,b\in D(x,T)}\stackrel{\u0304}{h}\right(\{a,b\}).

(3)

Finally, The total error is,

{\epsilon}_{D}(T,{T}_{\mathit{\text{ref}}})=\sum _{c\in T}\left|h\right(c)-\stackrel{~}{h}(c,{T}_{\mathit{\text{ref}}}\left)\right|

(4)

The clade and divergence errors are equal for trees with the same topology, but they differ when topologies disagree, and the difference usually increases with the distance in topology.

The *clades missed error* counts the number of clades in *T*_{
r
e
f
} not present in *T*,

{\epsilon}_{\mathit{\text{cm}}}(T,{T}_{\mathit{\text{ref}}})=\left|{\u2102}^{\mathit{\text{ref}}}\right|-\left|\left\{c\in \u2102\cap {\u2102}^{\mathit{\text{ref}}}:b\left(c\right)>0\right\}\right|.

(5)

This number is equal to (half) the Robinson-Foulds tree distance [10] when *T* has no zero length branches. A clade with a zero branch does not count as a match because it is potentially confused with its parent. The *clades called error* scores a 1 for correctly called clades and a -1 penalty for incorrectly called clades,

\begin{array}{l}{\epsilon}_{\mathit{\text{cc}}}(T,{T}_{\mathit{\text{ref}}})=\left|\left\{c\in \u2102\setminus {\u2102}^{\mathit{\text{ref}}}:b\left(c\right)>0\right\}\right|\\ \phantom{\rule{7.5em}{0ex}}-\left|\left\{c\in \u2102\cap {\u2102}^{\mathit{\text{ref}}}:b\left(c\right)>0\right\}\right|.\end{array}

(6)

A *tree set*\mathbb{T}=\left\{{T}_{1},{T}_{2},\dots ,{T}_{k}\right\} is a set of trees on shared taxa. Typically those sets are samples from a Bayesian analysis, and we define the posterior frequency *F*(*x*) of *x*⊆*L* as the fraction of times *x* is present as a clade in the trees:

F\left(x\right)=\frac{1}{\left|\mathbb{T}\right|}\left|\left\{T\in \mathbb{T}:x\in T\right\}\right|.

(7)

The posterior frequency of a subset not in any of the trees is zero.

### Distance between trees

The Rooted Branch Score (RBS) measures the distance between two rooted time trees, and is the total sum of the difference in branch lengths of matching clades. This definition is motivated by the distance between unrooted trees [11], but the space of rooted trees is more complex than its unrooted counterpart since branch lengths are not free to vary independently of each other [12]. Since by convention the branch length of a missing clade is zero, any clade present only in one tree contributes its total length to the score.

Formally, for {T}_{1}=(L,{\u2102}^{1},{h}^{1}) and {T}_{2}=(L,{\u2102}^{2},{h}^{2}) we have,

\mathit{\text{RBS}}({T}_{1},{T}_{2})=\sum _{c\in {\u2102}^{1}\cup {\u2102}^{2}}\left|{b}^{\left(1\right)}\right(c)-{b}^{\left(2\right)}(c\left)\right|.

(8)

The Squared Branch Score (SRBS) is similar, but taking the square of the difference instead of the absolute value,

\mathit{\text{SRBS}}({T}_{1},{T}_{2})=\sum _{c\in {\u2102}^{1}\cup {\u2102}^{2}}{\left({b}^{\left(1\right)}\right(c)-{b}^{\left(2\right)}(c\left)\right)}^{2}.

(9)

The Heights Score (HS) takes the difference between clade ages instead of branches. Like the RBS, branches appearing in only one tree are added to the sum,

\begin{array}{l}\mathit{\text{HS}}({T}_{1},{T}_{2})=\sum _{c\in {\u2102}^{1}\cap {\u2102}^{2}}\left|{h}^{\left(1\right)}\right(c)-{h}^{\left(2\right)}(c\left)\right|+\\ \phantom{\rule{7em}{0ex}}\sum _{c\in {\u2102}^{1}\setminus {\u2102}^{2}}b\left(c\right)+\sum _{c\in {\u2102}^{2}\setminus {\u2102}^{1}}b\left(c\right).\end{array}

(10)

The heights score is a (non-optimal) edit distance, where the score is the total sum of a sequence of moves which transform one tree into the other. Each move involves sliding an internal node, and two nodes may “merge” into one when they meet.

The Rooted Agreement Score (RAS) measures the disagreement between branches by treating them as intervals. Two branches may be of the same length and still contribute to the distance if they span different intervals as measured from the time of the tips. The score, when divided by the sum of the length of the two trees, is the probability that a random point chosen uniformly on one of the trees has a corresponding point on the other tree. Formally,

\begin{array}{l}\mathit{\text{RAS}}({T}_{1},{T}_{2})=\sum _{c\in {\u2102}^{1}\cap {\u2102}^{2}}\mu \left({\stackrel{\u20d7}{b}}^{\left(1\right)}\left(c\right)\u25b3{\stackrel{\u20d7}{b}}^{\left(2\right)}\left(c\right)\right)\\ \phantom{\rule{7em}{0ex}}+\sum _{c\in {\u2102}^{1}\setminus {\u2102}^{2}}b\left(c\right)+\sum _{c\in {\u2102}^{2}\setminus {\u2102}^{1}}b\left(c\right),\end{array}

(11)

where \stackrel{\u20d7}{b}\left(x\right) is the interval spanned by the clade branch, \stackrel{\u20d7}{b}\left(x\right)=\left[h\right(x),h(x)+b(x\left)\right] and △ is the symmetric difference operator, that is

\begin{array}{l}\phantom{\rule{-15.0pt}{0ex}}\mu \left(\right[\phantom{\rule{0.3em}{0ex}}{l}_{1},{h}_{1}]\u25b3[\phantom{\rule{0.3em}{0ex}}{l}_{2},{h}_{2}\left]\right)\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}({h}_{1}-{l}_{1})+({h}_{2}-{l}_{2})\\ \phantom{\rule{11em}{0ex}}-\phantom{\rule{0.3em}{0ex}}2max(min({h}_{1},{h}_{2})\phantom{\rule{0.3em}{0ex}}-\phantom{\rule{0.3em}{0ex}}max({l}_{1},{l}_{2}),0).\end{array}

(12)

RBS and RAS are metrics in tree space, while SRBS and HS are not. RBS is a metric since branches can be mapped to the vector space {\mathbb{R}}^{{2}^{n}-1}[8], and a similar argument works for RAS. However, we only require that distances are semimetrics and make no use of the triangle inequality.

### Summary trees

#### BEAST Tree annotator

The *Tree Annotator* utility in BEAST generates a summary tree using a two stage procedure. First, each posterior tree is assigned a score based on topology. The *Clade Credibility* of a tree is the product of posterior frequencies (equation (7)) of all clades in the tree,

CC(T=(L,C,h\left)\right)=\prod _{c\in C}F\left(c\right).

The *Maximal Clade Credibility* (MCC) tree is the tree with the highest score, and we shall refer to its topology as the MCC topology. In the second step, each clade is assigned an age based on the clade age in posterior trees. Formally, the age is set as either the mean or the median of the set of ages

H(c,\mathbb{T})=\left\{{h}_{i}\left(c\right):(L,{\u2102}_{i},{h}_{i})\in \mathbb{T}\phantom{\rule{2.77626pt}{0ex}}\text{and}\phantom{\rule{2.77626pt}{0ex}}c\in {\u2102}_{i}\right\}.

Since each age is set independently, the end result is not guaranteed to be a tree (condition iii). A few “negative branches” are not an unusual occurrence in trees with a medium to large number of taxa and moderate posterior uncertainty.

#### Minimum distance trees

The distance between the tree set and the tree *T* is defined as the mean distance of *T* to all members of ,

d(T,\mathbb{T})=\left|\mathbb{T}{|}^{-1}\sum _{{t}_{i}\in \mathbb{T}}{d}_{T}\right(T,{t}_{i}),

(13)

where *d*_{
T
} is one of the tree scores defined previously. A *Minimum Distance Tree* is a tree which minimizes d(T,\mathbb{T}). While the definition is simple and natural, the details are not. First, the minimal tree is not necessarily unique; there might be several or even an infinite number of minimal trees in some cases. Second, with anything more than a few taxa the space of trees is vast and topologically complex, so there is no guarantee of finding the minimal tree. We therefore limit the search to the topologies present in the posterior, and designate this approach by a lowercase ‘m’ followed by the distance method (mRBS, mRAS, etc). However, even this can be time consuming when the posterior contains many topologies, and in addition we examine a family of methods which consider just a single topology, using one of the heuristics outlined in the next section. The details about the algorithm for searching the best branch assignment for a specific topology are in Appendix 2.

#### Selecting a topology

All of the two stage methods we considered selects a topology first and assign branch lengths conditional on that topology. We examined three alternatives to the MCC for selecting a topology.

The first alternative uses the recently published Conditional Clade Probability Distribution (CCD). The CCD computes a probability for each tree based upon the posterior probability of the splits in the tree, conditional on the clade posterior frequency [6]. The second is a the Total Clade Branch (TCB), which assigns a score to each clade in the tree equal to the total length of matching branches in the posterior. The total length reflects the support for a clade by combining both the frequency (the number of trees with the clade) and confidence, under the assumption that longer branches are more likely to be “real” than shorter branches. The third is the Highest Posterior Frequency (HPF), which picks the topology of the tree most frequent in the posterior. To break ties, the HPF picks the tree whose height is closest to the mean root height of the posterior.

#### CA Tree

Negative branches in the TreeAnnotator tree result from using a different subset of posterior trees for estimating each clade age. In the *Common Ancestor Tree* (CAT), every clade x\in \u2102 is assigned an age using the mean of the clade age in **all** posterior trees. Formally,

h\left(c\right):=\left|\mathbb{T}{|}^{-1}\sum _{{T}_{i}\in \mathbb{T}}\stackrel{\u0304}{h}\right(c,{T}_{i})

(14)

The generated ages always produce a tree, since x\subset y\Rightarrow \stackrel{\u0304}{h}(x,{T}_{i})\le \stackrel{\u0304}{h}(y,{T}_{i}). Unlike TreeAnnotator, which may end up using a small number of values for some clades, CAT uses \left|\mathbb{T}\right| posterior values for estimating the age of each clade.

#### Taxa partitions tree

We now present the Taxa Partition (TP) tree, a single stage method which does not commit to a particular topology before assigning ages. The TP is inspired by the tree operator described by Mau *et al*[13]. In this representation each internal node is assigned a left/right orientation, inducing a linear order on the taxa and positioning each internal node between two tips (Figures one and two in [13]). We reverse the process by first ordering the taxa, then using the posterior to assign the ages between tips and finally reconstructing the tree topology from the ages.

For a given ordering of taxa, each posterior tree provides ages according to its topology. A clade contributes an age if it spans an unbroken range in the ordering. For example, for the order [a b c d], the tree (((a,b),c),d) contributes the age of (a,b) to [a | bcd], the age of ((a,b),c) to [ab | cd] and the root height to [abc | d]. The tree ((a,d),(b,c)) contributes only the age of (b,c) to [ab | cd]. (a,((d,b),c)) contribute only its root height to [a | bcd].

After collecting ages for all splits from the posterior, a point estimate of the height at each split is used to build the tree. The precise definitions are given in Appendix 2.

TP incorporates clade ages from competing topologies before committing to the final topology. For example, take the set with a mixture of two topologies, ((a,b),c) and (a,(b,c)). With the obvious ordering [a b c], TP uses all ages in every tree, and the choice between the two topologies is determined by the age of the [ab | c] and [a | bc] splits. If [ab | c] is higher we end up with ((a,b),c), otherwise with (a,(b,c)).

Finding an optimal ordering is hard. Assigning an orientation which minimizes the distance between taxa orders of just two trees is NP complete [14]. We use a fast heuristic which proved effective in practice: build a distance matrix for pairs of taxa and use simple clustering to build the ordering. The distance between taxa *a* and *b* in each tree is the size of the clade of their common ancestor, *d*(*a*,*b*)=|ca({*a*,*b*})|. The overall distance is the mean of pair distances over all posterior trees. The clustering starts with each taxon in its own group, then progressively joins the two closest groups.

### Test cases

To evaluate the various methods we generated 2000 test cases, divided into 20 groups of 100 repeats. For each case, a tree with *n* tips was drawn from the Kingman coalescent distribution [15] with population size *N*_{
e
}. All repeats shared the same *n* and *N*_{
e
}, and each group was assigned one pair from the 5x4 grid formed by *n*=8,16,32,64,128 and *N*_{
e
}=1,2,4,8.

A sequence of length 800bp was generated for the tips of the tree, starting with an ancestral sequence at the root and mutating the sequence along the branches using the Jukes-Cantor substitution model [16] with a mutation rate of 0.005. The sequences were analyzed using BEAST-2 [17] under the same model (Jukes-Cantor and a coalescent prior with constant population size). The tree and population size were estimated but the mutation rate was fixed at its true value. The chain was 2.2M steps, sampled every 2k steps. 200k of the initial samples were discarded (burn-in), leaving 1000 posterior samples. Those were used as input for building a summary tree by each of the methods under consideration.

The test trees contain 8 to 128 tips and range (on average) from a height of 0.01 substitutions to 0.08, or 2 to 16 million years for a nuclear mammalian gene. Sampling the posterior of such trees normally requires a longer MCMC chain, but here a relatively short one is sufficient. The data was generated under a simple model and the exact same model is used for inference, resulting in excellent mixing. Not only was the effective sample size high for all parameters, we made sure the clades were adequately sampled by running a second independent chain, starting with a different seed. We then computed the maximum of the absolute difference between posterior frequency of all clades; this number was well below 5% in most settings, and around 6% for the most diffuse case (128 tips and height of 0.01 substitutions).

The posterior for trees with 32 and more taxa was completely diffuse, with a distinct topology for each sample. Even the easiest cases (*n*=8 and *N*_{
e
}=8) contained between 1 and 45 distinct topologies, with a mean of 6. Also note that even when the posterior has a single topology, a method may do better that others by setting more accurate branch lengths.

Summary trees were compared using two main criteria: accuracy in estimating ages and model fit. The first criteria was broken into 3 related error measures: accuracy in estimating the root height, accuracy in estimating clade ages (equation 1) and accuracy in estimating divergence times (equation 4). The second criteria was also divided into three: the log-likelihood of the sequence data given the tree (tree likelihood), the log-likelihood of the tree under the coalescent (coalescent likelihood), and the overall model fit, which is the sum of the tree and coalescent likelihood.

### How methods are ranked

The methods were compared by aggregating the results from all test cases. Let us take the root height as an example. For each test case, an error value is computed for each method by taking the absolute difference between the summery and true tree heights. Next, the methods are ranked by error using dense ranking (the 1-2-2-3 rule). Finally, the mean rank of each method is computed by averaging its rank over all 2000 tests.

This scoring procedure was repeated (bootstrapped) 4000 times. In each repeat 2000 test cases are sampled (with replacement) from the pool of 2000 test cases, and a mean score computed for each method. Method A was deemed better than B only if A’s mean ranking was greater than B’s in 90% (3600) of the bootstraps. The method gets a final score of 0 (best) if no other method is better, and a score of *R*+1 if there is a better method of score *R*.

The same process is repeated, using not the rankings of errors but the normalized error values. The normalization takes the errors of each case and transforms them to have a mean of 0 and a variance of 1. This ranks the methods by the magnitude of the error they make compared to other methods.

This may seem overly complex but making a fair comparison requires extra care. The methods and error measures are correlated in both obvious and subtle ways. Multiple criteria allows for a more nuanced comparison. Ideally, the particular mix of methods should not matter: adding a duplicate (or a very close variant) of one method should not penalize the ranking of lesser methods. Using dense ranking should minimize those effects. Strong correlations exist between the test settings (*N*_{
e
} and *n*) and the magnitude of errors, so aggregating results from the 20 groups requires some care. Rankings based on comparison alone are insensitive to those correlations, and the normalization of errors makes aggregation possible without going through the complex exercise of modeling the relations between settings. Another reason for using two rankings is that method A may be slightly better than B in (say) 60% of the cases, yet its errors in the other 40% are large. The difference between the two ranks would alert us to such situations.

Finally, any number of test cases, 2000 included, is small when considering the size of tree space. Bootstrapping provides some confidence that the results are stable and not due to random noise.