Horizstory
Comparing trees for two different genes amounts to exchanging branches within the first genes' tree (editing it) to match another's, and assessing the relative likelihood of this scenario. The topological comparison of phylogenetic trees is inherently difficult, as the number of possible edit paths (where one rearranges the branches of one tree to match another's) increases rapidly with the number of differences between trees. Where edits are individually independent, they may furthermore occur in any order within an edit path (evolutionary scenario), thus exploding their number factorially. Without resorting to heuristics and its inherent approximations, and without constraining the types of possible edits (apart from those that are biologically impossible: LGT with one's ancestor), we are therefore limited to comparing trees that are relatively similar. However, if one abandons the constraint of assuming that trees are fully resolved [33], which is most often not the case, then many apparent differences disappear and the problem becomes more tractable. One should be concerned with explaining only robust differences between trees, which can be determined by bootstrap support, for example, in ML phylogenies.
It is also important to compare the right trees, which amounts to a wise choice of reference tree against which test trees (typically gene-based trees) are compared. A good reference tree may be one that minimizes the overall number of differences it finds when matched with a variety of test trees; thus references based on concatenated sequences, supertrees, or genomic phylogenies should be most suitable for this purpose. Still, one might be uncertain about the exact choice of reference tree (or its optimal rooting), where alternatives are equally attractive. One could then use each candidate reference tree in turn as a reference against the set of test trees, and measure the extent of LGT predicted with each choice. This may suggest that one reference tree is better than another in this context, a result interesting on its own with respect to the issue of organismal phylogeny and the "true" tree.
Our approach to the comparison of phylogenetic trees, with the purpose of detecting lateral gene transfer (LGT) as well as determining the degree of concordance of vertical inheritance relationships, makes use of a recursive procedure of consolidation and rearrangement. Consolidation involves the simplification of identical topological features (vertical inheritance relationships) by collapsing such features [for example, a triplet-taxon relationship (("A","B"),"C") in common to both trees is collapsed to the single-taxon "((A,B),C)"]. This is followed in a second step by the collapse of compatible topological features. A compatible feature is, for example, a relationship of (("A","B"),"C") in the reference tree and an unresolved relationship of ("A","B","C") in the candidate tree, leading to " [A,B],C]". Compatible features do not necessarily support vertical relationships, but neither do they provide evidence for lateral gene transfer. Once the pair of trees to be compared is thus simplified, each of the candidate tree's leaves is moved to every alternative node in its tree in turn, with each move being tested by consolidation. Where simplification is possible (where the topologies can further converge), the move is productive and launches another pathway of rearrangement, where further rearrangements and simplifications are tried until the pair of trees can converge to identity, or until the pathway is abandoned. A pathway is abandoned, if, for example, it would require more steps than another pathway already explored.
We suppose that a rearrangement that can bring a pair of trees closer to one another topologically, is equivalent to "undoing" an event of lateral gene transfer. The position from which a taxon had to be moved in order to make the trees more similar is taken to be the LGT donor, whereas the taxon being moved is then the recipient. Given that rearrangement reconstitutes reference topologies (vertical inheritance relationships), presumed LGT targets are disqualified from suggesting such relationships by being pruned from the trees prior to the next recursive rearrangement.
Last but not least, rearrangement pathways must go to completion in order to be reported, resulting in full convergence of the pair of trees, but some edit distances are shorter than others, and suggest a smaller number of LGT events. For reasons of evolutionary parsimony as well as computational economy, this conservative route should be preferred.
Our scenarios can include different kinds of LGT events. Events of LGT may be nested, or otherwise intertwined, needing reversal by multiple rearrangements. The clade founded by an LGT donor may have subsequently had its species membership obfuscated by later exchanges of genetic material, yielding an unnatural assemblage of nomenclatural tags (species labels) in a presumed lineage. We distinguish such intermediary groupings in our output using an asterisk, indicating an ambiguity in deducing the root taxon, that being the actual organism that served as LGT donor. We also found it necessary to indicate (using a prime mark) when an LGT event cannot be attributed directly to a clade found in the reference tree, but rather to a phantom sister of that clade. This signals a 'basal transfer', which is observed when a taxon migrates out of its own clade to sit just outside of that clade in the candidate tree. Since LGT cannot occur with one's ancestor, the best explanation in a context of LGT is that the taxon received genetic material from a sister clade which happens not to be represented among the taxa in the dataset under investigation. Other nomenclatural marks in our program's output, mentioned above, include the parenthesis indicating an identical topological feature, and the square bracket indicating a compatible topological feature. The program's output thus consists of a list of the rearrangement pathways and the consequently deduced vertical and lateral (LGT) features. The frequencies of these features are also summarized in the output for each the pair of trees, and a global summary is presented for all pairs of trees analyzed.
For various reasons, a user might have a collection of trees to compare that include different taxa, or a user might more generally wish to exclude specific taxa from certain individual phylogenetic analyses. Where trees have different complements of taxa, pruning of taxa outside of the intersection of the two sets is done automatically. Where a pair of trees includes common taxa that the user wishes to exclude from analysis explicitly, he or she need only supply files listing which taxa should be excluded, either globally (applied to the reference tree and all of the candidate trees), or locally (applied to the reference tree and a candidate tree on a one-by-one basis). Pruning might be prescribed if, for example, one suspects causes other than LGT to be responsible for some of the differences between a pair of trees, such as long branch attraction. Absence of such files indicates that no pruning is desired.
We have successfully tested Horizstory with many different sets of real data, including hundreds of trees with 13 species and several dozen trees with over 27 species. Simple and nested LGTs simulated by manual modifications of a reference tree were also correctly reported in the scenarios.
Lumbermill
Lumbermill is a phylogeny editor, written in Java, for drawing trees and syntheses, using Horizstory output as input. It realizes our notion of synthesis (see Fig. 2) by first representing a vertical phylogram as its backbone. Line thickness is drawn in proportion to the percentage of genes supporting a given grouping; common patterns of vertical inheritance are thus easy to identify. (The assumption here is that over the short term, vertical inheritance is the dominant pattern). Overlaid onto this backbone are links that represent presumed LGTs, completing the synthesis.
The images trees or syntheses are highly editable, allowing the user to change fonts and line colours to provide a customized view of the data. The order of a node's descendants can be swapped. The backbone can be rerooted on another node or even unrooted. To help describe events in the synthesis, nodes are labeled numerically by distance from the root then alphabetically from top to bottom. In addition to this, each link is interactive, and when clicked on, displays information such as the proportion of genes inherited along this link, and their names.
Often, multiple equally parsimonious LGT scenarios are proposed by Horizstory in order to explain differences between a given gene tree and the reference topology. Lumbermill allows the user to restrict the display of individual LGT events to those suggested by a specified minimum fraction of the scenarios, such as 1.0 (strict consensus) or values greater than 0.5 (majority rule). One can also elect to omit the display of specific LGTs for some genes in Lumbermill, such as when the conflicting signal involving this gene is found to be due to some cause other than lateral gene transfer.
Putative LGT events are drawn as arrows originating from a donor (indicated by a circle) and terminating at a recipient. Since the exact time at which a transfer occurred cannot be determined, the relative order of multiple arrows on any given segment is irrelevant, as is the position of an arrow on the segment. In order to avoid clutter in busy regions of the tree, we chose to extend segments to provide more room. Such artificially extended segments are drawn as dotted lines so as not to confuse them with actual branch lengths (solid lines).
When genes have apparently been inherited from a taxon missing from the reference tree, we insert a basal group in the tree where appropriate. This donor, a contemporary clade absent from the current dataset and that may or may not have since gone extinct, collects all of the LGT events originating from outside of a represented clade. It is meant as a convenient catchall, and where multiple LGTs appear to originate from such a given basal group, it is understood that different donors may actually have contributed genes independently.
Our method allows for nested LGT scenarios, where a compound donor in the evolutionary scenario for a gene is not represented as an organismal clade in the reference tree. In this case, one cannot therefore point to an actual donor for the gene at this intermediary step in the scenario, since its identity is ambiguous, appearing to parent species from different clades. Lumbermill deals with such organismal assemblages by indicating several candidate donors all leading to the same target. It is assumed that the LGT event in question involved a single donor parenting one or more species in this assemblage, or when a basal group is also indicated, a single donor related to the parent of one or more of the species in the assemblage. These LGT involving such intermediate assemblages are represented using a double-headed arrow.