Basic definitions
A rooted tree is a rooted binary tree whose leaves are labeled by species names (not necessarily uniquely). Let T be a gene tree. By L(T) we denote the set of all leaf labels (i.e., species) present in T. The root of T is denoted root(T). A node n is called internal if it has two children, which are denoted by Ch(n). A cluster of a node g, denoted by clu(g), is the set of leaf labels belowg. A species tree is a rooted tree whose leaves are uniquely labeled (that is, there are no two leaves with the same species label).
Let S=〈VS,ES〉 be a species tree. For nodes a,b∈VS, by a+b we denote the least common ancestor of a and b in S. We also use the binary order relation a≤b if b is a node on the path between a and root(S) (note that a≤a). Two nodes a and b are called siblings if they are children of a+b.
For a rooted tree G, called here a rooted gene tree, and a species tree S such that L(G)⊆L(S), a least common ancestor mapping, or lca-mapping, is a function from the nodes of G to the nodes of S such that M(g)=s if g and s are leaves with the same label, or M(g)=M(a)+M(b) if g is an internal node of G such that a and b are the children of g.
An unrooted gene tree is a tree whose internal nodes are of degree 3 and species names label the leaves. An unrooted gene tree G can be rooted by placing the root on an edge e (that is, by subdividing edge e with a new node ρ and designating it to be the root). Such a rooting (i.e., a resulting rooted gene tree) is denoted Ge.
Cost functions
Now, we introduce several cost functions used when reconciling a rooted gene tree G and a species tree S. An internal node g∈VG is a (gene) duplication if M(g)=M(g′) for g′ a child of g. The total number of gene duplications is called duplication cost and denoted by D(G,S). The deep coalescence cost is defined as follows [17]: \(\text {DC}(G,S)=\sum _{a,b\textrm {\ siblings in\ }G}(|\pi (M(a),M(b))| - 1)\), where π(x,y) is the set of all nodes on the shortest path connecting x and y in S. Note that the standard definition of the deep coalescence cost function [40] simply differs by 1−|VG| from our definition, which is a constant value for a fixed G, and the results presented in this work can be easily adapted for the standard definition. Next, we define the the loss cost based on the formula derived in [19]: L(G,S)=DC(G,S)+2·D(G,S)−|VG|+1 and the duplication-loss cost: DL(G,S)=D(G,S)+L(G,S).
The above cost functions can be naturally extended to unrooted gene trees. For every cost function c defined above for rooted gene trees, we define its unrooted counterpart as mine∈E(G)c(Ge,S). For convenience, we adopt the same notation (i.e., D,DC,L,DL) to denote unrooted cost functions. The edge e, such that Ge has the minimal cost c(Ge,S), is called optimal (for c).
For a given rooted or unrooted gene tree G and a species tree S, c(G,S) can be computed in linear time [41, 42].
Problems
We say that a species tree S is over a set of speciesI if L(S)=I. Let Q be a collection of unrooted gene trees \(G_{1},G_{2},\dots,G_{n}\). By L(Q) we denote the set of all species present in the trees of Q. We extend the notion of the cost function to collections of gene trees. For a given species tree S over L(Q), by c(Q,S) we denote the total cost \(\sum _{G\in Q}c(G,S)\).
Problem 1
(uMinST - Minimal Species Tree) Given a collection of unrooted gene trees Q and a cost function c find a species tree Smin that minimizes the total cost c(Q,S) in the set of all species trees S over L(Q).
We call a species tree Sminminimal (for Q and c) and we denote the respective minimal cost by cmin(Q). We also define a simpler variant of the previous problem that does not require finding a tree Smin explicitly.
Problem 2
(uMinCC - Minimal Cost Computation) Given a collection of unrooted gene trees Q and a cost function c compute cmin(Q).
Similarly, we define problems for the collections of rooted trees, called rMinST and rMinCC, respectively. Note that [28] provides a dynamic programming solution to rMinST and rMinCC.
Solution to rMinST and rMinCC - overview
In this section, we show how rMinCC is solved in [28]. Any internal node v of a gene or species tree determines a split A|B, where A and B are the clusters of children of v. For a collection Q of rooted gene trees let r(Q) be the multiset of all rooted splits present in trees of Q. We also set r(T) to be r({T}) for any rooted tree T. For brevity, here we present the dynamic programming formulations only for D and DC costs. For DL and L, please refer to [28].
For a collection of gene trees Q and a species s, Λ(Q,s) is the total cost contribution of the nodes from Q to a leaf of some species tree over L(Q) labeled by s. Let now X and Y be two disjoint sets of species. Then Γ(Q,X,Y) is the total cost contribution of the nodes from Q to an internal node v of some species tree over L(G) such that the cluster of v is X∪Y and v has two children whose clusters are X and Y. Given a species leaf s and a split X|Y and a cost c∈{D,DC,DC}:
$$\begin{array}{@{}rcl@{}} \Lambda^{c}(Q,s)& =& \sum_{q\in r(Q)}\lambda^{c}(q,s), \\ \Gamma^{c}(Q,X|Y)& =& \sum_{q\in r(Q)}\gamma^{c}(q,X|Y), \end{array} $$
where
$$\begin{aligned} \lambda^{\mathrm{D}}(A|B,s)&=\lambda^{\text{DL}}(A|B,s)=\mathbf{1}\left[A\cup B=\{s\}\right], \\ \lambda^{\text{DC}}(A|B,s)&=\mathbf{1}\left[\exists i \colon A_{i}=\{s\}\neq A_{i+1}\right] \\ \gamma^{\mathrm{D}}(A_{1}|A_{2},X_{1}|X_{2}) & = \mathbf{1}\left[A_{1}\cup A_{2}\subseteq X_{1}\cup X_{2}\wedge \exists i\colon X_{1} \nsupseteq A_{i}\nsubseteq X_{2} \right],\\ \gamma^{\text{DC}}\left(A_{1}|A_{2},X_{1}|X_{2}\right) & = \mathbf{1}\left[\exists i,j\colon X_{j} \nsupseteq A_{i}\subseteq X_{1}\cup X_{2} \nsupseteq A_{i+1} \right]. \end{aligned} $$
The above functions can be used to compute the cost as follows:
$$ c(Q,S)= \sum_{\substack{s \in L(S)}} \Lambda^{c}(Q,s)+\sum_{q \in r(S)} \Gamma^{c}(Q,q). $$
(1)
The dynamic programming solution to rMinCC from [28] is as follows:
$$\Delta^{c}(Q,Z)\,=\,\!\left\{\begin{array}{ll} \Lambda^{c}(Q,s) & \text{if}\ Z=\{s\},\\ \min_{X|Y \in {\text{splits}}(Z)} \Delta^{c}(Q,X)+\Delta^{c}(Q,Y) +\Gamma^{c}(Q,X,Y) & \text{otherwise}, \end{array}\right. $$
where splits(Z) is the set of all splits (2-partitions) of Z and the solution is given by Δc(Q,L(Q)).
Unrooted reconciliation
We now summarize the previous structural results on reconciliation of an unrooted gene tree with a species tree. These results will be used for the design of our main dynamic programming solutions to uMinST and uMinCC problems.
Without loss of generality, we assume that every unrooted/rooted gene tree has at least 3 distinct labels and it has all labels from a species tree, i.e., L(G)=L(S)≥3. Let G be an unrooted gene tree and S be a fixed species tree. The split of the root of S we call top-split.
A set of species Z is simple if Z is a subset of a cluster from the top-split. Let ζ(Z) be a predicate that is true if Z is simple. Further, let G be an unrooted tree. Any internal node g of G determines a star A|B|C, where A, B and C are the leaves of three subtrees obtained from G by removing g. Note that A∪B∪C=L(S). Let \(\bar {A}=B \cup C, \bar {B}=C \cup A\) and \(\bar {C}=A \cup B\). Then, it follows from [42, 43], that given a top-split we have five disjoint types of stars (see Fig. 1) in unrooted gene trees (reordering of A, B and C may be required; e.g., the two stars in G2 from Fig. 2 are represented as a|b|c (or a|c|b) and bc|a|a in the context of the top-split a|bc):
-
if \(\neg \zeta (A) \wedge \zeta (\bar {A})\),
-
if \(\zeta (A) \wedge \zeta (\bar {A})\),
-
if \(\neg \zeta (A) \wedge \neg \zeta (\bar {A}) \wedge \zeta (B) \wedge \zeta (C)\),
-
if ¬ζ(A)∧¬ζ(B)∧¬ζ(C),
-
if ζ(A)∧¬ζ(B)∧¬ζ(C).
Given a species tree S, an edge of an unrooted tree G is symmetric if removing the edge from G creates two trees whose top-clusters are either both simple or both are not simple. The remaining edges are called asymmetric. E.g., all edges of S4 are symmetric, while all edges of S1 are asymmetric. We have the main theorem for unrooted reconciliation theory.
Theorem 1
(Adopted from [42–45]) If e is a symmetric edge in any star or e is the asymmetric edge from S5 then c(Ge,S)=c(G,S), i.e., e is optimal for every cost c∈{D,DC,L,DL}.
The edges satisfying the conditions from the above theorem induce a connected unrooted subtree in G called a plateau. The plateau in our article equals the DL-plateau from [42], where the c-plateau is the graph induced by the set of optimal rooting edges for the cost c. Since any optimal rooting for DL is optimal for D, DC and DL [42], to solve our problems it is sufficient to focus on the plateau (or DL-plateau) rootings only. Examples are depicted in Fig. 2.
Decomposing unrooted gene trees
Before we start with the main results, we show how given a species tree (or its top-split only), an unrooted gene tree can be decomposed into two parts: one part which is of the rooted nature and the second part related to the plateau. Note that the notion of a cost function refers to D, DC, L or DL.
We start with the following observation, which follows from Theorem 1 and stars definitions.
Lemma 1
The top-split of the species tree is sufficient to determine the optimal rooting of a given unrooted gene tree.
A rooted subtree of an unrooted gene tree G is a proper subtree of some rooting of G. Then, we have the following property.
Lemma 2
Let T be a rooted subtree of an unrooted gene tree G and let S be a species tree over L(G). T is a subtree in every plateau rooting of G if and only if the following conditions are satisfied:
where t=root(T).
Proof
(<=) The statement is obvious if t is a leaf. We show that no edge in T is symmetric in G if t is an internal node in G. If M(t)<root(S) then t is a center of S1 or S2 and the children of t in T are right-hand in the star. Thus, the remaining edges in T are asymmetric (only S1 can be present in T) with the star-arrows directed towards the leaves of T. A similar property holds in the second case with the difference that t is a center of S3. Thus, the edges of T are disjoint with the plateau and T is a rooted subtree in every plateau rooting of G. (=>) If t is mapped below the root of S, the tree contains only asymmetric non-plateau edges. Otherwise, t is a duplication mapped to the root, then at least one edge e′ connecting t with its child is symmetric. Thus, T is not present in the rooting placed on e′. A contradiction. □
We conclude that every unrooted gene tree G can be decomposed into two parts: a plateau G∗ and the rooted forest\(\check {G}\) obtained from G by removing the internal nodes and edges of G∗. Note that G∗ is an unrooted tree with at least one edge and \(\check {G}\) is a forest whose edges are asymmetric in G. Moreover, if \(\check {G}\) contains a tree with a non-root internal node, then this node is a center of S1. The leaves of G∗, which are also the roots of \(\check {G}\), are called border nodes. Since not every two stars can share an edge, possible topologies of stars in gene tree are limited. Figure 3 depicts all possible types of gene trees depending on the stars (see also Fig. 2).
Cost contribution functions
For an unrooted gene tree G let stars(G) be the multiset of all stars present in G (similarly it is defined for collections of gene trees). Here we define the cost contribution functions for our four standard costs. For every cost, we define \(\hat {\lambda }\) as the contribution of a given star to a species (i.e., a leaf of the species tree) in the context of a top-split, \(\hat {\gamma }\) as the contribution of a given star to a non-root internal node of a species tree (a split X|Y) in context of a top-split, σ as the contribution of a given star to the root of a species tree in context of a top-split, and ε which is the cost correction depending on the gene tree. For an unrooted gene tree G, a species s and a top-split ⊤, \(\hat {\Lambda }(G,s,\top)\) is the total cost contribution of the stars (in the context of a top-split ⊤) from G to a leaf of some species tree over L(G) labeled by s. Given two disjoint sets of species X and Y and a top-split ⊤, \(\hat {\Gamma }(G,X|Y,\top)\) is the total cost contribution of the stars (in the context of a top-split ⊤) from G to an internal node v of some species tree over L(G) such that (i) the cluster of v is X∪Y and (ii) v has two children whose clusters are X and Y, respectively. Then, the total cost contributions are defined:
$$\begin{array}{@{}rcl@{}} \hat{\Lambda}^{c}(G,s,\top) &=& \sum_{a \in {\text{stars}}(G)}\hat{\lambda}^{c}(a,s,\top), \\ \hat{\Gamma}^{c}(G,X|Y,\top) & =& \sum_{a \in {\text{stars}}(G)} \left\{\begin{array}{ll} \sigma^{c}(a,\top) & X|Y=\top,\\ \hat{\gamma}^{c}(a,X|Y,\top) & \text{otherwise},\\ \end{array}\right. \end{array} $$
where the contribution functions \(\hat {\lambda }^{c}, \hat {\gamma }^{c}\) and σc for c∈{D,DL,DC} are depicted below. Here, ⊤ is a top-split of some species tree S, X|Y is a split of some node from S, s is a species, G is an unrooted gene tree and \(\mathcal {A}=A|B|C\) is a star of type τ from G.
-
Duplication cost (D):
$$\begin{array}{*{20}l} \hat{\lambda}^{\mathrm{D}}(\mathcal{A},s,\top)&=\mathbf{1}[B=C=\{s\} \wedge \tau \in \{\text{S1}, \text{S2}\}] \end{array} $$
(2)
$$\begin{array}{*{20}l} \hat{\gamma}^{\mathrm{D}}(\mathcal{A},X|Y,\top)&=\gamma^{\mathrm{D}}(B,C,X|Y) \end{array} $$
(3)
$$\begin{array}{*{20}l} \epsilon^{D}(G)&=1. \end{array} $$
(4)
-
Deep Coalescence cost (DC):
$$\begin{aligned}{\hat{\lambda}^{\text{DC}}(\mathcal{A},s,\top)\,=\,\!\left\{\begin{array}{ll} \lambda^{\text{DC}}(B|C,s) & \text{if}\ \tau \in \{\text{S1, S3}\}\\ \lambda^{\text{DC}}(A|\bar{A},s) & \text{if}\ \tau=\text{S5}\\ \max \left\{ \lambda^{\text{DC}}(B|C,s), \lambda^{\text{DC}}(A|\bar{A},s) \right\} & \text{if}\ \tau=\text{S2} \wedge \neg (B\,=\,C \!\wedge |B|=1)\\ 0 & \text{otherwise} \end{array}\right.} \end{aligned} $$
$$\begin{aligned}{\hat{\gamma}^{\text{DC}}(\mathcal{A},X|Y,\top)\,=\,\! \left\{\begin{array}{ll} \gamma^{\text{DC}}(B,C,X|Y) & \tau \in \{\text{S1, S3}\}\\ \gamma^{\text{DC}}(A,\bar{A},X|Y) & \tau=\text{S5}\\ \gamma^{\text{DC}}(B,C,X|Y)+ \\ \ \ + \frac{1+[|A|=1]}{2}\gamma^{\text{DC}}(A,\bar{A},X|Y) & \tau=\text{S2}\ \text{and}\ \neg (B\,=\,C \!\wedge\! |B|\!=1)\\ 0 & \text{otherwise} \end{array}\right.} \end{aligned} $$
$$\begin{array}{*{20}l} \sigma^{\text{DC}}(\mathcal{A},\top)&=0, \\ \epsilon^{\text{DC}}(G)&=0. \end{array} $$
-
Loss and Duplication-Loss cost; for c∈{L,DL}, let ωL=2 and ωDL=3 in
$${}{\begin{aligned} \hat{\lambda}^{c}(\mathcal{A},s,\top)&=\hat{\lambda}^{\text{DC}}(\mathcal{A},s,\top)+\omega_{c}\cdot \hat{\lambda}^{\mathrm{D}}(\mathcal{A},s,\top) \\ \hat{\gamma}^{c}(\mathcal{A},X|Y,\top)&=\hat{\gamma}^{\text{DC}}(\mathcal{A},X|Y,\top) +\omega_{c}\cdot \hat{\gamma}^{\mathrm{D}}(\mathcal{A},X|Y,\top) \\ \sigma^{c}(\mathcal{A},\top)&=\sigma^{\text{DC}}(\mathcal{A},\top) +\omega_{c}\cdot \sigma^{\mathrm{D}}(\mathcal{A},\top) \\ \epsilon^{c}(G)&=1-|V_{G}|+\omega_{c} \end{aligned}} $$
The next theorem defines how these functions can be used to compute the unrooted cost given an unrooted gene tree and a species tree. This result is the unrooted analogue of Eq. (1).
Theorem 2
Let G be an unrooted gene tree with |L(G)|≥3 and S be a species tree over L(G). If c is a cost function {D,DL,L,DC} and ⊤ is the top-split of S, then
$$\begin{aligned} c(G,S) & =\! \sum_{s \in L(S)} \hat{\Lambda}^{c}(G,s,\top)+ \sum_{x,y \text{\ siblings in} S} \hat{\Gamma}^{c}(G,{\text{clu}}(x)|{\text{clu}}(y),\top)+\epsilon^{c}(G). \end{aligned} $$
Proof
Let R denote the right-hand side of the above formula. We start with the duplication cost and gene trees of type U4.
Cost D vs. U4: The plateau of G contains stars S4/S5. Then, the remaining stars have type S1/S3. We show that D(Ge,S)=R, where e is a plateau edge. Let g be an internal node of Ge, M(g)=s and, if g≠root(Ge), then g is a center of star A|B|C of type τ.
(U4.a) If s is a leaf. Then, g is a duplication and τ is S1 with B=C={s}. The case is incorporated in \(\hat {\lambda }^{D}\) (see (2)). Note that τ cannot be S3/S4/S5 under the assumption that L(S)≥3.
(U4.b) If s is an internal non-root node of S with the split X|Y. In this case, B∪C⊆X∪Y and X∪Y is a subset of the element from ⊤. Then, τ=S1. In such a case, g is a duplication if and only if γD(B|C,X|Y,⊤)=1 (See (3), where γD was defined in “Solution to rMinST and rMinCC - overview” section.
(U4.c) The remaining case is when s=root(S). We have three subcases: τ=S3 (border of G∗), τ∈{S4,S5} (internal of G∗) or g=root(Ge). If τ=S3, then, g is a border node and by Lemma 2, it is not a duplication. In our formula, the contribution of stars S3 is 0. If τ∈{S4,S5} then g is a duplication. This case is contributed in (5a). Finally, if g=root(Ge) then it is also a duplication counted in εD(G). Note that, given a plateau with n internal nodes (or equivalently, n stars S4/S5), the total number of duplications at the root of S, equals n+1 (one duplication for each star S4/S5 plus one for the root of G). Thus, our formulas count exactly the number of duplications in such a case.
This completes the proof for gene trees of the type U4.
Gene tree type U3: The proof is almost the same as in the previous case. The only difference is that the plateau consists of a single edge (n=0; no stars in the plateau). Thus, the third subcase in (U4.c) can be omitted.
Gene tree type U1-U2: Now, the gene tree contains one or two stars S2 (sharing an edge) and the remaining stars are of type S1 (if present).
(U1-2.a) If s is a leaf, then τ is S1 or S2. Then, g is a duplication and the rest is similar to (U4.a).
(U1-2.b) The same as (U4.b) plus the case when τ=S2.
(U1-2.c) If g=root(Ge), then it is not a duplication, and its contribution is 0.
Since there is no duplication mapped to root(S) if G contains S2, the above cases count every duplication present in Ge. However, the presence of εD(G)=1 which is needed for proper calculation in other cases, requires correction of the total contribution. The correction by −1 is only performed when S2 is present in G, but the difficulty is that G can have one or two such stars. This correction is embedded in (5b) and (5c) as follows. Assume that the first star S2 is A|B|C. If the second star S2 is present we assume it is \(\bar {A}|B'|C'\), where B′∪C′=A.
-
|A|=1 and G has only one star S2. Then, the star contributes −1 in (5b).
-
|A|=1 and G has two stars S2. The first star contributes −1 from (5b), while the second star contributes 0 from (5d).
-
\(|A| \neq 1 \neq |\bar {A}|\). In such a case G has two stars S2, each contributes −0.5 from (5c).
Thus, the total correction is −1 in every case. This completes the proof for the case of duplication cost.
Cost DC: The technical proof for DC is analogous to D.
Cost DL and L: Note that the formulas for DL and L are linear combinations of D and DC with the scalar addition of 1−|VG| (see “Cost functions” section). Further, the optimal rootings are shared across all costs; therefore, the proof for this case follows from D and DC proofs and the corresponding contribution formulas. □
Dynamic programming solution to uMinCC
Let Q be a collection of unrooted gene trees, c be a cost function and Z⊆L(Q). Here we extend \(\hat {\Gamma }, \hat {\Lambda }\) and ε to collections of unrooted gene trees, e.g., \(\hat {\Gamma }(Q)=\sum _{G \in Q} \hat {\Gamma }(G)\) and so on. The dynamic programming formulas for the solution to the uMinCC problem is defined as follows (the superscript c is omitted):
$$\begin{array}{@{}rcl@{}} \Upsilon(Q,Z,\top) \,=\, \begin{array}{ll} \hat{\Lambda}(Q,s,\top) & \text{if}\ Z=\{s\},\\ \min_{X|Y \in {\text{splits}}(Z)} \Upsilon^{*}(Q,X|Y,\!\top) & \text{otherwise}, \end{array} \\ \Upsilon^{*}(Q,X|Y,\top)\,=\,\Upsilon(Q,X,\top)\,+\,\Upsilon(Q,Y,\top)+\hat{\Gamma}(Q,X|Y,\top). \end{array} $$
Informally, in the above recurrence, for each top-split X|Y and for each Z such that Z⊆T or Z⊆T′, Υ(Q,Z,⊤) is the minimal cost contribution of a non-plateau parts of input gene trees in the set of all species trees over L(Q) having a node v whose cluster is Z, where the contribution is only calculated for the nodes strictly descendant from v. The formula for Υ∗(Q,X|Y,⊤) extends the Υ by including the contribution of the nodes whose split equals X|Y. Note that the contribution of the plateau is incorporated in Υ∗ in the special case when X|Y=Q (see def. of \(\hat {\gamma }\)).
Theorem 3
Let c be a cost function and Q be a collection of unrooted gene trees such that |L(Q)|≥3, and L(Q)=L(G) for every gene tree G from Q. Then, the solution to uMinCC is min⊤∈splits(L(Q))Υ∗(Q,⊤,⊤)+ε(Q).
Proof
Without loss of generality, we may assume that Q consists of one gene tree G. Given a species tree S with top-split ⊤ and having a node s with a cluster Z, by partial Z-contribution of S we denote the partial cost defined recursively as follows:
$$\begin{aligned} c_{Z}(G,S)= \left\{ \begin{array}{ll} \hat{\Lambda}^{c}(G,s,\top) & \text{if } Z=\{s\}, \\ c_{X}(G,S)+c_{Y}(G,S)+\hat{\Gamma}^{c}(G,X|Y),\top) & \text{if }X|Y \text{ is the split of \textit{s}}. \end{array}\right. \end{aligned} $$
It follows from Theorem 2, that c(G,S)=c⊤(G,S)+εc(G).
Let ⊤=T|T′. It is sufficient to prove that, for each Z⊆T or Z⊆T′, Υc(Q,Z,⊤) is the minimal partial Z-contribution in the set of all species trees over L(Q) having a node whose cluster is Z. Then, it follows that Υ∗(Q,⊤,⊤) is the minimal partial ⊤-contribution in the set of all species trees over L(Q) having a top-split ⊤.
The proof follows by induction from Theorem. 2. We omit technical details. □
Solving uMinST follows directly from the values of Υ: it is sufficient to track which partitions of the cluster Z into X and Y under a given top-split yield the minimal value. Such partitions determine clusters in optimal species trees and they can be used to infer one or all optimal species trees.
Theorem 4
(Complexity) Given a collection of Q unrooted gene trees with m leaves and n species. The time complexity of the dynamic programming formula is O(nm3n) and the space complexity is O(3n+m).
Proof
We have the following identities: |stars(Q)|=m−2|Q|, the number of splits of |X|=k equals \(\frac {1}{2}\sum _{k=1}^{n-1} \binom {n}{k} = 2^{k-1}-1\). Then, computing stars requires O(m) time, while each value of λ, γ, etc. requires O(n) time. Thus, a single \(\hat {\gamma }\) or \(\hat {\lambda }\) computation needs O(mn) time. Note that Z must be a nonempty subset of an element from ⊤ in Υ. Thus, the size of Υ array is \(\sum _{k=1}^{n-1} \binom {n}{k}(2^{k}-1) \approx 3^{n}\). Similarly, the size of \(\hat {\gamma }\) array is \(\sum _{k=1}^{n-1}\binom {n}{k}(2^{k-1}-1) \approx 3^{n}\). Thus, the time complexity of the algorithm is O(mn3n) and the space complexity is O(3n+m). □
Now, we compare the above complexity with the naïve approach i.e., try all rooting variants to compute the minimum cost. We may assume that every gene tree has the size of n. Then, we have 2n−3 rooting variants of a single gene tree. As the REXACT requires O(mn3n) time, we need O((2n−3)mmn3n) time for the naïve algorithm.
The general case when L(G)⊆L(S)
When L(G)⊆L(S), the split needed to determine the star type can be located below the root of S. For example, the star ab|c|d cannot be determined for the split abcde|f, since the set of labels of the star is a subset of an element of the split. To determine the type of a star A|B|C the species tree split X|Y has to satisfy the following condition: L⊆X∪Y, L∩X≠∅ and L∩Y≠∅, where L=A∪B∪C. The split X|Y satisfying above condition will be called a rooting split for L. Then, Theorem 2 can be reformulated as follows
Theorem 5
Let G be an unrooted gene tree with |L(G)|≥3 and S be a species tree such that L(G)⊆L(S). If c is a cost function {D,DL,L,DC} and T is a split in S such that T is a rooting split for L(G), then
$$\begin{array}{@{}rcl@{}} c(G,S)&= &\sum_{s \in L(S)} \hat{\Lambda}^{c}(G,s,T)\\&+& \sum_{x,y \text{\ siblings in\ } S} \hat{\Gamma}^{c}(G,{\text{clu}}(x)|{\text{clu}}(y),T)+ \epsilon^{c}(G). \end{array} $$
Proof
Note that the rooting split is uniquely determined. The rest follows from the proof of Theorem 2. □
It follows from the above theorem that the rooting split is crucial for proper computation of the cost. Therefore, the dynamic programming formula has to be modified to capture possible different rooting splits. Let τ be the set of pairs 〈a,R〉, where a is a star and R is its rooting split. Then, we have the following dynamic programming formula:
$${}\Upsilon(Q,Z,\tau)= \begin{array}{ll} \sum_{\left\langle{a,R}\right\rangle \in \tau} \lambda^{c}(a,s,R) & \text{if}\ Z=\{s\},\\ \min_{X|Y \in {\text{splits}}(Z)} \Upsilon^{*}(Q,X|Y,\tau') & \text{otherwise}, \end{array} $$
$$\begin{array}{@{}rcl@{}} \Upsilon^{*}(Q,X|Y,\tau)&=\Upsilon(Q,X,\tau)+\Upsilon(Q,Y,\tau)\\&+\sum_{\left\langle{q,X|Y}\right\rangle \in \tau} \sigma(a,X|Y)+\\ &+\sum_{\left\langle{q,R}\right\rangle \in \tau, R \neq X|Y} \gamma(a,X|Y,R), \end{array} $$
where τ′=τ∪{〈a,X|Y〉:if X|Y is a rooting split for a ∈ stars(Q) }. Now, the time complexity of the algorithm is O(mn32n) as the τ is an additional component with O(m3n) possible values.
Similarly, we can adopt the dynamic programming algorithm to solve the species tree inference using the minus method in which the cost between a gene tree G and a species tree S such that L(G)⊆L(S) is defined as c(G,S|L(G)), were S|L(G) is the species tree obtained from S by contracting the set of species in S to L(G). Then, the above dynamic programming algorithm solves the problem under the minus method setting; however, the contribution formulas require modification. For the rooted component, the formulas (λ and γ) are provided in [28], while for the plateau component (σ) they remain unchanged for all our cost functions.