The prevalence of terraced treescapes in analyses of phylogenetic data sets

Dobrin, Barbara H.; Zwickl, Derrick J.; Sanderson, Michael J.

doi:10.1186/s12862-018-1162-9

Research article
Open access
Published: 04 April 2018

The prevalence of terraced treescapes in analyses of phylogenetic data sets

Barbara H. Dobrin¹,
Derrick J. Zwickl¹ &
Michael J. Sanderson¹

BMC Evolutionary Biology volume 18, Article number: 46 (2018) Cite this article

3491 Accesses
10 Citations
13 Altmetric
Metrics details

Abstract

Background

The pattern of data availability in a phylogenetic data set may lead to the formation of terraces, collections of equally optimal trees. Terraces can arise in tree space if trees are scored with parsimony or with partitioned, edge-unlinked maximum likelihood. Theory predicts that terraces can be large, but their prevalence in contemporary data sets has never been surveyed. We selected 26 data sets and phylogenetic trees reported in recent literature and investigated the terraces to which the trees would belong, under a common set of inference assumptions. We examined terrace size as a function of the sampling properties of the data sets, including taxon coverage density (the proportion of taxon-by-gene positions with any data present) and a measure of gene sampling “sufficiency”. We evaluated each data set in relation to the theoretical minimum gene sampling depth needed to reduce terrace size to a single tree, and explored the impact of the terraces found in replicate trees in bootstrap methods.

Results

Terraces were identified in nearly all data sets with taxon coverage densities < 0.90. They were not found, however, in high-coverage-density (i.e., ≥ 0.94) transcriptomic and genomic data sets. The terraces could be very large, and size varied inversely with taxon coverage density and with gene sampling sufficiency. Few data sets achieved a theoretical minimum gene sampling depth needed to reduce terrace size to a single tree. Terraces found during bootstrap resampling reduced overall support.

Conclusions

If certain inference assumptions apply, trees estimated from empirical data sets often belong to large terraces of equally optimal trees. Terrace size correlates to data set sampling properties. Data sets seldom include enough genes to reduce terrace size to one tree. When bootstrap replicate trees lie on a terrace, statistical support for phylogenetic hypotheses may be reduced. Although some of the published analyses surveyed were conducted with edge-linked inference models (which do not induce terraces), unlinked models have been used and advocated. The present study describes the potential impact of that inference assumption on phylogenetic inference in the context of the kinds of multigene data sets now widely assembled for large-scale tree construction.

Background

Among the methodological challenges in phylogenetic inference are those posed by missing data. Problems tied to incomplete data sets first emerged in the context of paleontological data matrices [1,2,3], from which character states may be missing because of inapplicable characters or fossil incompleteness, leading to parsimony reconstruction (used widely for morphological data sets) recovering multiple, equally good trees. A large literature (e.g., [4,5,6,7,8,9,10,11,12,13,14,15,16]) has since assessed the risks and identified advantages linked to the use of incomplete data sets for inference, and the issues remain salient in the modern phylogenetics context because few data sets are fully sampled (i.e., include data at every taxon-by-gene position). Incomplete data can be analyzed accurately [10, 12, 14, 16,17,18] but studies also find that sparse data can undermine phylogenetic accuracy [4,5,6, 8] and confidence [9, 19, 20]. Recent work shows, for example, that abundant or nonrandom missing data can bias estimates of model parameters [21] promote the emergence of support artifacts [22, 23]; and worsen biases built into heuristic search procedures [24, 25], leading to artifactual tree search outcomes [25].

Adding to these difficulties are terraces [26, 27], collections of equally optimal trees that may arise in tree space because of the taxon coverage patterns (the pattern of gene presence/absence across taxa) in partitioned alignments, such as commonly are found in multigene data matrices. Terraces can slow tree search [26, 28] and mislead heuristic search algorithm [27]; when a tree search algorithm returns one putatively optimal tree that is actually on a terrace, then this adds ambiguity to tree inference. The presence of terraces can also confound confidence assessment: in bootstrapping (under some conditions), replicates are more likely to return a spurious clade if the clade occurs frequently on a terrace of optimal trees; and in Bayesian assessment, long-branch bias in the presence of missing data can elevate posterior probabilities of some of the trees belonging to a terrace [27]. The latter “phantom” support phenomenon resembles the “star paradox” [29] and Bayesian long-branch repulsion effects [30] observed elsewhere.

Precise necessary and sufficient conditions for the occurrence of terraces have been described elsewhere [26, 27]. Roughly speaking they include: 1) the tree optimality criterion is parsimony or partitioned maximum likelihood (ML) and, if the latter, edge lengths are optimized independently across data partitions (i.e., the inference model is “edge-unlinked” (EUL)) [27]; and 2) each partition is sampled for fewer than the full complement of taxa. For any “parent” tree, T, having all the taxa in a data matrix, each partition of the matrix with fewer than this number of taxa sampled induces (“displays”) a subtree of T with those “missing” taxa pruned. Depending on the taxon coverage pattern, these subtrees may be compatible not only with T but with an assortment of other parent trees, each displaying the induced subtree (Fig. 1). If the optimality function is one of those cited above, scores of all parent trees will be identical [26, 27], and collectively the parent trees are called a terrace. Because terraces consist of parent trees that display the same compatible subtrees, they can be characterized using algorithms from the supertree literature; in particular, terraces can be discovered and described without the need to search tree space once the first tree, T, is found [26, 27].

All else being equal, terraces should arise more often from data sets with sparser taxon coverage, and more often when data span many taxa and few genes (as in a “tall” matrix), than the converse (as in a “wide” matrix) [31]. The increased prevalence of next-generation sequencing (NGS) sampling approaches will reduce the incompleteness of data matrices, but “gappiness” currently characterizes much large-scale phylogenetic data, for reasons including 1) the use of public sequence archives, which store disparate data sets composed of different taxa and different numbers of taxa; 2) biological [30] or methodological [32] barriers to obtaining orthologous sequences; 3) the use of shallow coverage protocols with NGS methods; and 4) loss of genes. In this paper, we investigate the terraces that would arise from 26 large data sets under the necessary inference assumptions. In particular, we investigated whether the published optimal trees – generally maximum likelihood trees - were on a terrace, and the properties of those terraces. When we reviewed the methods and models used originally to recover the trees, we found surprising variability: one author [33] conducted unpartitioned analysis, another [34] reported having used an edge-linked (EL) model, and many authors left inference model details unspecified. Of those in the latter category, some may have used linked edge-length parameters (and consequently EL models), often the default parameter setting of tree reconstruction programs. However, we were less interested in evaluating the findings of the published studies than in constructing a test bed of data sets for examining the size and diversity of terraces that would emerge under EUL inference (or parsimony). EUL models have been used in likelihood-based tree reconstruction, and may confer advantages in analysis of some time-heterogeneous data sets (see Discussion); and terraces are predicted to emerge from incompletely-sampled data under EUL assumptions. Accordingly, we evaluated the terraces that would have arisen if the reported trees had been products of EUL maximum likelihood inference (ML-EUL) or parsimony. We characterized the terraces, measuring their size and the diversity of their trees. We examined terrace properties in relation to the data availability characteristics of the data sets, including taxon coverage density and a measure of data sampling sufficiency derived from theory. When bootstrapping to obtain tree support values, each replicate tree may belong to a terrace. We used consensus methods to measure the impact of these terraces on bootstrap support. Finally, we examined terrace size as a function of a simple measure of overall data coverage, the percentage of taxon triples sampled within partitions. Because terrace formation in likelihood inference occurs with the use of EUL models, we also used the Akaike Information Criterion (AIC) [35] to identify the more suitable model (EL or EUL) for each data set in the study sample.

Methods

Concepts and definitions

Earlier articles [26, 27] provide detailed exposition of terraces and their properties. Here, we outline terrace theory in brief.

Terraces and inference models

Consider a data matrix consisting of aligned, homologous sites (these may be nucleotides or other characters) and n taxa, with the sites subdivided into k loci. We may denote the set of n taxon labels as X. Each locus corresponds to a unit such as a gene or a codon position, or perhaps to a collection of sites demarcated by some a posteriori criterion. Throughout this article, we will refer to loci variously as “loci,” “partitions,” and “genes,” without regard to the scheme used to cluster the data. If any data are present for a taxon at a locus, we consider that locus sampled for the taxon. The coverage pattern S for the data and partitioning scheme consists of the subsets of taxa Y₁,...,Y_k sampled for each of the k loci. Taxon coverage density, or just coverage density, refers to the percentage of taxon-by-locus combinations that have any data present. We also speak of a taxon coverage matrix, which differs from the coverage pattern only in that it records the presence and absence of samples at taxon-by-locus locations.

Given a tree T on X, each of the taxon subsets Y_j in S induces a subtree T|Y_j composed only of taxon labels in Y_j - that is, T|Y_j is the subtree of T remaining after all taxa not present in Yj are removed. The tree T displays the set of induced subtrees T|Y₁,…,T|Y_k, and is a parent tree of T|Y₁,…,T|Y_k. Fig. 1 illustrates how more than one tree may display (i.e., be a parent tree of) a set of subtrees induced by a taxon coverage pattern: the two-locus coverage pattern Y₁, Y₂ induces the subtrees T|Y₁, T|Y₂, of which T, T’, and T” are parent trees. If the parent trees are scored with an optimality function such as parsimony or maximum likelihood, and if all parent trees score the same, collectively the parent trees are called a terrace.

If the scoring criterion is parsimony, the set of parent trees is always a terrace [26]. If the criterion is maximum likelihood (ML), the parent trees are a terrace if edge-length parameters of the inference model are optimized independently across loci [27]. In this paper, we refer to models with such parameters as edge-unlinked (EUL). An edge-linked (EL) model has a single length parameter per edge (i.e., optimization is joint across loci). A partially edge-linked (PEL) model joins edge-length parameters across loci by one or more proportionality constants. Use of an EUL model is a sufficient condition for the emergence of terraces, while optimization with a linked model (EL or PEL) is insufficient – terraces do not arise under their assumptions. No conditions apply to the rate matrix of the model, which may be defined independently or jointly across loci [27].

As noted in the Introduction, often we could not discern rigorously the details of the inference models used in the phylogenetic studies in our sample. In particular, authors often left unspecified the linkage type (i.e., whether optimized jointly [linked] or separately [unlinked] across partitions) assigned to edge-length parameters, and some authors may have relied on inference tool default settings. In RAxML [36], the program used most often across the sample, parameters for edge length are linked by default, implementing a model (EL) that does not induce terraces. The authors of one analysis [34] explicitly reported having used an EL model. As we have noted, we were more interested in the impact of the structure of the data than the particular inference assumptions of the published papers, and accordingly, we investigated the terraces that would have arisen had the reported trees been recovered with parsimony or with some form of ML-EUL inference model.

Defining and decisiveness

If a tree T on X is the only parent tree of a set of subtrees induced by a coverage pattern S, we say that the subtrees define T. Similarly, a coverage pattern S is said to be decisive for T if T is the only parent tree of the subtrees induced by S. Theory [31, 37] establishes necessary and sufficient conditions under which a coverage pattern achieves decisiveness. A theory of defining sets out conditions under which a set of subtrees define a tree. Here we summarize a selection of these theoretical results, described previously in [31]:

For a coverage pattern S to be decisive for all (unrooted) trees on X, it is sufficient that one locus is fully sampled (i.e., for every label in X). This condition follows trivially from a condition (which we do not describe here) applying to the distribution of taxon quadruples among label subsets in S.
For a coverage pattern S to be decisive for all (unrooted) trees on X, it is necessary that every triple of taxa (set of 3 taxa) is present (i.e., sampled or observed) in at least one of the taxon subsets in S.

The latter result suggests intuitively that the distribution of triples in a coverage pattern, and the number of parent trees that can be constructed from its induced subtrees, may be empirically correlated. Sanderson et al. 2010 [31] speculated that the percentage of observed taxon triples might indeed predict the impact of a given quantity of missing data. Further theory developed in [31, 38] similarly suggests such a relationship. We state here one such further result, given in [38]:

Given a rooted tree T and a coverage pattern S, the set of induced subtrees T|Y₁,…,T|Y_k defines T if every edge of T is distinguished by some rooted triplet from T|Y₁,…,T|Y_k. To describe the concept of distinguishing informally, let π be a leaf taxon whose incident edge subtends the root of T, but which is not found in X (i.e., the label set of T); let a, b, and c be taxa belonging to X. The rooted triplet a|bc distinguishes an edge e of T if each taxon in the set {π,a,b,c} has one label found in each subtree in T whose roots are adjacent to e, and e corresponds to the edge of the resolved quartet πa|bc.

Whether a taxon triple is associated with a distinguishing triplet depends on the shape of T, but taxon triple percentage can be thought of as a (numerically smaller) proxy for the proportion of edges of T distinguished by rooted triplets. Edges not fixed by induced subtrees can be broken and their subtended partial trees placed elsewhere, forming equally optimal alternative topologies.

Terrace discovery and analysis

Selection and preparation of empirical data sets

From recent phylogenetics literature, we selected 13 multi-locus data sets, each consisting of at least 7 loci and at least 95 taxa [33, 34, 39,40,41,42,43,44,45,46,47,48]. From the largest of these, the ~ 33,000-taxa vascular plant “megamatrix” of Zanne et al. [42], we extracted 13 disjoint data subsets, each corresponding to a named genus or family, and each including (with one exception) at least 95 taxa. Some of these data subsets contained fewer than the 7 loci present in the megamatrix. Across all data sets (including vascular plant subsets), the number of taxa ranged from 57 to 7000, the number of loci from 5 to 1122, and the number of aligned sites from 5054 to 504,850. Taxon coverage densities ranged from 0.06 to 0.98 (Table 1). Of the studies selected, all but two reported maximum likelihood trees. We explored the terraces (if present) associated with these trees, characterizing terraces as they would have arisen had the published trees been products of parsimony or ML-EUL. To analyze the data set of [44], we used the published maximum clade credibility (MCC) Bayesian tree. To analyze the data set of [34], we used the published partitioning scheme and a tree that we constructed ourselves from the aligned data using parsimony heuristic search in PAUP [49]. [34] reported a tree estimated from the data (with an edge-linked (EL) model), but no machine-readable copy of the tree accompanied the article. For each data subsample of the plant megamatrix, we extracted the corresponding subtree from the ~ 33,000-taxa megaphylogeny. Polytomies were absent from all trees except that reported by [33].

Table 1 Data set profiles and results of terrace and decisiveness analyses

Full size table

Several of the published data alignments included sequences for taxa not found in the accompanying trees. We deleted these taxa; consequently, some taxon counts in our experimental data sets differ from the published counts. We also deleted a small number of additional taxa (three or fewer across all data sets) when we encountered difficulties processing their sequence data into the format required for terrace analysis. All final data alignments, partitioning schemes, and trees analyzed for this study have been posted on the GitHub website.

Discovering and characterizing terraces

We used the Python program ‘terraphy’ [50], written by DJZ, to discover and characterize the terraces. Terraphy accepts as input a data matrix of aligned sites, a partition scheme, and a tree. It computes the taxon coverage matrix for the alignment and partitioning scheme, the size of the terrace to which the tree belongs, and the strict and Adams (BUILD) [51] consensus trees of the trees on the terrace. To compute terrace size, terraphy uses the Constantinescu & Sankoff [52] supertree algorithm, created to construct the full set of parent trees of a group of compatible input trees. To compute the Adams consensus tree, the program uses the BUILD algorithm of Aho [51]. To construct the strict consensus tree, the program relies on algorithms of Constantinescu & Sankoff [52] and Steel [38]. The scaling properties of these operations have been described in [26, 27]. The terraphy package also includes functionality to: 1) construct and output samples of trees from a given terrace, 2) determine whether two trees (found in bootstrap replicates, for example) belong to the same terrace, and 3) report the number of equally good subtree resolutions within each clade in the strict consensus tree of a terrace.

Terraphy treats input tree polytomies as “soft” or irresolvable. When the program receives a nonbinary tree, it evaluates the terraces of the alternative polytomy resolutions, and its output is the sum of tree counts from those terraces. The impact of polytomies on terrace tree counts is minimally relevant to this study because all but one of the trees we examined were binary.

Variability among trees on terraces

To describe the diversity of trees on the terraces, we constructed the strict consensus tree of each terrace and calculated its resolution, ρ (defined as the ratio of the number of a tree’s bipartitions to the number of bipartitions of a fully resolved (binary) tree of the same size).

Number of loci that must be sampled to ensure decisiveness; “gene sampling sufficiency”

A probability model of random taxon sampling, described in [31, 53], predicts the lower bound on the number of loci, k_min, that would need to be sampled to guarantee that a taxon coverage pattern, S, given its taxon coverage density and taxon number, n, would be decisive for some (random) tree constructed on the label set, X, of S:

$$ {k}_{\mathrm{min}}=\frac{\ \ln \left(\left(\begin{array}{c}n\\ {}3\end{array}\right)/p\right)\ }{-\ln \left(1-{d}^4\right)} $$

which approximates to

$$ {k}_{\mathrm{min}}\approx \frac{\log \frac{n^3}{6p}}{-\log \left(1-{d}^4\right)} $$

([53], Mike Steel, personal communication, 2015), where d is the taxon coverage density of S, n the number of taxa in S, and p the desired confidence level. Henceforth, k_min stands for the approximation. To compare data sets, we used a normalized value that we call “gene sampling sufficiency” (i.e., the depth of the gene sampling), or ζ:

$$ \zeta =\ln \kern0.28em \frac{k_{\mathrm{min}}}{k} $$

where k is the number of loci (partitions) sampled. If decisiveness for a random tree on X is highly probable (p < 0.05), then ζ ≥ 0. Otherwise, ζ < 0.

Impact of terraces on bootstrap support

In bootstrapping, the tree returned by each bootstrap replicate may be part of a terrace of equally good trees. To examine the impact of terraces on resampling support, we selected three small-to-medium-sized (112–225 taxa) data sets whose terraces were among the larger of those recovered. From each of the three data sets, we constructed 100 RAxML rapid bootstrap trees. We used PAUP to construct a majority rule consensus tree of each bootstrap replicate set and computed ρ for each majority rule consensus tree (Fig. 2). Next, using terraphy, we evaluated the terrace of each replicate tree and constructed the strict consensus tree of each terrace. Finally, we used PAUP to construct the majority rule consensus tree of each collection of strict consensus trees. We call these majority rule trees “terrace-aware,” because they exclude clades present in fewer than 100% of trees on the terrace found in each bootstrap replicate. We computed ρ for each “terrace-aware” consensus tree.

Observed taxon triples and terrace size

To test the conjecture of Sanderson et al. [31] that the fraction of triples sampled in taxon subsets (i.e., Y₁,...,Y_k) might predict the effects of a given amount of missing data, we computed the observed triple proportion (see Concepts and Definitions, earlier) for 12 data sets of relatively similar taxon coverage density and ζ values. Taxon coverage densities for this group ranged from 0.19 to 0.43, and ζ values ranged from − 4.35 to − 7.81. All data sets were chosen from among the vascular plant subsamples.

Edge-length model choice

Although terraces are only known to occur with EUL models, EL may not be the best model for all data sets. We used the Akaike Information Criterion (AIC) [35] to identify the most appropriate edge-length model for each data matrix in the study sample. For each matrix, we obtained maximum likelihood scores for a tree previously inferred from the data (in each case the tree used for terrace analysis) with both models. For likelihood analyses, we used RAxML v. 8.2.11 [36], with separate HKY85 [54] substitution matrices for each partiton of DNA data sets, and separate WAG [55] transition models for each partition of protein data sets. We used the GAMMA model of rate heterogeneity for all data sets. Within pairs of inference models, the EUL model differed from the corresponding EL model only in estimating branch lengths independently across loci. We computed ΔAIC [56, 57] for each pair of models.

Results

Size of terraces; relationship to taxon coverage percentage

We succeeded in measuring the terraces present in 25 data sets; the sizes ranged from one tree (a nominal terrace) to an astonishing 10³⁸⁸ trees (Fig. 3a, Table 1). The latter terrace was that found in the 7000-taxon data matrix, the largest (in terms of taxa) of those analyzed. In evaluating the terrace of the large (946 taxa), low-sampling-density (coverage density of .06) data set of [31], we terminated the program run after several weeks. Although the time required to count trees on this terrace implies that it is very large (as run time scales linearly with terrace size [26, 27]), the polytomy topology (ρ = 0.82) of the tree may have extended the program running time (see Methods). No terrace of a data set of coverage density greater than 0.90 exceeded one tree. For the 13 plant “megamatrix” data subsets, terraces ranged in size from 1 to ~ 10²³ trees, although taxon coverage densities for these data sets spanned a narrow range, from 0.19 to 0.43. In general, terrace size varied inversely with the taxon coverage density of the data (Fig. 3a).

Minimum gene sampling depth needed for decisiveness

k_min was often very large, exceeding 1000 loci for 16 data sets, and exceeding 2 million loci for one data set (Table 1). Sampling sufficiency, ζ, measured less than zero (i.e., insufficient) for all but six data sets (Fig. 4a, Table 1). As with taxon coverage density, terrace size generally varied inversely with sampling sufficiency (ζ) (Fig. 4b). Terraces found in two data sets for which values of ζ were low (− 5.44 and − 5.60) comprised 1 and 3 trees, respectively, results at odds with the predictions of the Steel [53] and Sanderson et al. [31] probability model. The uniform taxon sampling assumed by their model, however, may not reflect samples found in many empirical data sets.