The disappearing tree phenomenon
Concordance between the branches in individual gene trees and their concatenated incarnation is weak, as suggested by earlier studies [5],[35]. For the present data, this is shown in Figure 1, using a dataset of 48 genes present in three samples of 100 prokaryotic genomes spanning three phylogenetic depths: 50 archeabacteria and 50 eubacteria (Figure 1A), 100 proteobacteria (Figure 1B), and 100 gammaproteobacteria (Figure 1C). In each 100 genome, 48 gene sample, the frequency of branches in 48 individual gene trees were compared to the set of branches in the concatenation tree. For each internal node within the concatenation tree, the node score was specified as the number of times that the corresponding node was observed among the 48 individual gene trees.
For the most divergent data set (Figure 1A), deeper internal nodes of the concatenated tree have almost no congruence with the nodes in single gene trees, except the branch separating archaebacteria and eubacteria. At the tips of the tree, much greater congruence between the individual genes and the concatenation tree is observed. Surprisingly, the same "tree of tips" [15] or "disappearing tree" [35] phenomenon was observed for the proteobacterial sample (Figure 1B) and for the gammaproteobacterial sample (Figure 1C). For all three samples of phylogenetic/taxonomic depth, congruence between the deeper internal branches and branches recovered in individual trees disappears, yet the bootstrap proportions (BP) for virtually all branches in the concatenation trees were very high: for all three concatenation trees combined, only 15 internal branches had a BP below 80 (nodes marked with a red dot in Figure 1) and the average BP was 90, 99, and 96 for Figure 1a-c respectively.
For the deep prokaryote sample, there are mainly two areas of low BPs within the tree: one within the euryarchaeota, and one spanning firmicutes, actinobacteria and tenericutes. The corresponding node support values relative to individual trees are low as well. But it is clearly visible that the rest of the internal branches have low node scores (congruence among individual trees) and high bootstrap support (site pattern sampling in the concatenation tree). The total number of splits present within the 48 single gene trees (split pool), is a simple measure to reflect the observed incongruence within the concatenation tree. On the range between total congruence with a split pool of 97 splits and total incongruence with 4,656 possible splits, 1,830 different splits were observed for the set of 48 trees summarized in Figure 1A. For trees in Figure 1B and C, 1,905 and 1,804 splits were observed, respectively. In other words, each internal branch of the species tree generates more than 18 conflicting splits on average. Especially for deeper phylogenetic relationships, the topology in the concatenation tree is not present in any of the family gene trees, despite the corresponding branches of the concatenation tree showing high BP values.
Influence of LGT – Comparing eukaryotic and prokaryotic data
The main effect of LGT on prokaryote genome evolution is to alter the number and kinds of genes that are found in a prokaryotic genome, not to promote orthologous replacement [36]. But there is also evidence that some of the core genes in prokaryotes might be replaced during evolution [37]-[39]. Thus, if LGT is the main reason why the present set of prokaryotic "core" genes analyzed individually tend to obtain different phylogenetic results, then this tendency should be more pronounced in prokaryotes than in eukaryotes. This is because eukaryotes counteract Muller's ratchet using meiosis and sex (process that generate reciprocal recombination), while prokaryotes rely on mechanisms of LGT — transformation, conjugation and transduction — processes that spread genes unidirectionally from donors to recipients. In order to see whether the congruence between concatenation trees and individual phylogenies is greater in prokaryotes or eukaryotes, we compared two additional datasets: one comprising of 50 fungal genomes, and one comprising of 50 eukaryotes, spanning plants, animals and fungi (PAF). Both datasets were composed of 50 genes with comparable length and different average pairwise identities (61% in fungi, 49% in the mixed set).
The results, summarised in Figure 2A and B, show that both eukaryotic concatenation trees tend to have weaker node scores in the deeper branches than at the tips, like the prokaryote concatenation trees, but the overall agreement between concatenation trees and individual gene trees is far better for the eukaryotic data than for the prokaryotic data. As in the prokaryotic example, the eukaryotic concatenation trees show high BPs, averaging 96 and 97, respectively. The PAF tree shows a clear correspondence between low bootstrap support and low node score in the clade spanning the higher plants. But, as in the case of the prokaryotic trees, sampling at increasing phylogenetic depth does not reduce the congruence between individual gene trees and concatenated trees, as the average node score, 25% ± 14, for the fungal data set (Figure 2B) is slightly higher than the value for the plant-animal-fungi dataset, 19% ± 11 (Figure 2A) (P = 0.026). Out of possible 2,350 splits we observed 350 different splits within the PAF dataset and 390 splits within the fungi dataset.
Factors affecting node scores
We investigated different factors that might affect node scores, which are a proxy for the tendency of individual trees to recover branches found in the concatenated tree. For this, we plotted, for each node in the concatenation tree, the frequency with which it was recovered in different data samples in order of increasing frequency (abscissas in Figure 3).
First we looked at phylogenetic depth (Figure 3A) because distantly related groups have distantly related sequences, which are notoriously hard to align, and their phylogenetic analysis can be further hampered by substitution levels that can approach saturation or algorithmic biases such as long branch attraction. The prokaryotic datasets shown in Figure 1— prokaryotes, proteobacteria, gammaproteobacteria — encompass the same 48 genes, but because of their different phylogenetic depth, they span different levels of sequence divergence, the average pairwise identity being 32%, 48% and 67% respectively. Perhaps surprisingly, there is no significant difference (P = 0.67, P = 0.40, P = 0.70) between the node score distributions of the three samples (Figure 3A), despite the samples spanning a twofold decrease in average pairwise sequence identity. Thus, for these samples, phylogenetic depth is not a cause of low node scores.
In Figure 3B we plotted node score distributions for the eukaryotic data sets shown in Figure 2. The comparison of the plant-animal-fungi vs. the fungal samples also revealed no significant difference, such that, like the prokaryotic samples, increasing sequence divergence stemming from greater phylogenetic depth (52% average pairwise identity PAF vs. 58% fungi) had no detectable effect on node scores. To see if differences between prokaryote and eukaryote samples could be detected, we constructed a gammaproteobacterial sample with the same number of sequences and taxa (50) as the eukaryotic samples and consisting of genes with similar lengths (avg. 441 gammaproteobacteria, avg 438 PAF, avg. 441 fungi) and similar sequence conservation (avg. 58% for the gammaproteobacteria). Despite having very similar sequence atttributes as the eukaryotic samples, the node score distribution for the 50-genome gammaproteobacterial sample is strongly shifted towards lower values and is significantly different from that for the eukaryotes (P = 0.0007 , P = 1.96 × 10−5) (Figure 3B). This would be consistent with an effect of LGT in the gammaproteobacterial sample, but if so, it remains puzzling why we do not see a decrease in the prokaryotic node score with increasing phylogenetic depth (Figure 1, Figure 3A). Notably here the two eukaryote sets show no significant difference in their node score distribution (P = 0.2377).
Figure 3C shows the node score distributions for gammaproteobacterial gene samples of 50 genes each that were separated into three categories of sequence divergence (average pairwise sequence identity 44%, 61% and 70% respectively). No significant difference between the node score distributions was observed (P = 0.59, P = 0.86, P = 0.46). This suggests that sequence divergence at similar phylogenetic depth is not a factor affecting node score.
Another possible factor affecting generation of incongruent branches in individual and concatenated analyses is sequence length, or small site sample size. To check this, we assembled three more samples from the gammaproteobacterial data, each consisting of 50 genes for the same 100 species. The three samples consist of sequences with different average sequence length (124aa, 305aa, 692aa). The distributions of the node scores for the individual genes vs. the respective concatenation tree in two of the three samples are significantly different (P =3.8 x 10−5 , P =0.0172), with the longer sequences providing higher values than the shorter sequences (Figure 3D).
To investigate this effect further, we assembled fungal and proteobacterial datasets consisting of 200 genes each for 50 genomes and binned the individual alignments by their sequence length. In each of the 40 bins, we simply counted the number of different splits observed for the five trees in each bin. In the case of five identical topologies, we would observe 47 splits, in the case that no common branches were observed across all five trees in a bin, we would observe 235 splits. The numbers of splits observed in each bin are plotted against sequence length in Figure 4A. A very strong correlation is observed both for the gammaproteobacterial (r = −0.87, P = 2.2 × 10−13) and for the fungal bins (r = −0.8, P = 3.0 × 10−10). The corresponding analysis for alignment length, rather than sequence length, takes the influence of gaps into account, and very similar distributions to those obtained for sequence length were obtained (Additional file 7, gammaproteobacterial sample r = −0.78, P = 2.7 × 10−9, fungal sample r = −0.73, P = 5.6 × 10−8). Although these fungal and gammaproteobacterial samples have comparable sequence lengths and similar average pairwise identity distributions (fungi: 56% ± 5; gamma: 59% ± 6), the fungal data tends much more strongly to recover the same tree than the gammaproteobacterial sample does. This again might point to a greater role for LGT in the gammaproteobacterial genes than in the fungal genes sampled.
To estimate the presence of LGT in this data we used two established programms to search for potential LGT events, PRUNIER [31] and RANGER-DTL [32]. RANGER-DTL estimates the number of gene duplications, horizontal gene transfers and gene losses that are needed to reconcile a species and a gene tree. As a species tree we used a concatenation tree of all 200 genes for the fungi and gammaproteobacteria data. The mean number of estimated transfers per tree in the fungi data was 9, whereas it was 17.2 in the gammaproteobacteria data. PRUNIER uses a more conservative approach to estimate lateral transfer events, since it also includes a bootstrap cutoff. Using a cutoff of 70% yielded on average 2.3 transfers per tree within the fungi data and on average 3.3 transfers within the gammaproteobacteria data. Lowering the cutoff to 55% in the gammaproteobacteria data, since it has in average low BP values, yielded on average 5.7 transfers. The effect of LGT present in trees on incongruence is visible in Additional file 8. Again, length sorted bins were used to visualize the effect. This time gammaproteobacterial trees with low LGT rate (max. 1 event) and high LGT rate (at least 5) were compared. The low rate trees show higher congruency than the high rate trees. It might not explain all the difference between prokaryotic and eukaryotic data, but it shows that LGT has some influence as well.
Because bootstrapping provides information about the number of sites with similar distributions of site patterns needed to obtain the same tree in every pseudosample [40], it is perhaps not surprising that the average BP for each tree in the 200-gene gammaproteobacterial and fungal samples is strongly correlated with sequence length (Figure 4c). In the case of the gammaproteobacteria data (200 trees, all having the same 50 species), there is a strong positive correlation between bootstrap support in a gene family tree and sequence length (Figure 3B, r = 0.76, P = 5.72 × 10−40). A similar strong positive correlation between BP and sequence length is observed in the fungi dataset (Figure 3B, r = 0.73, P = 1.7 × 10−35). Other parameters do not show this strong correlation with BPs, or show no correlation at all. The alignment length has a slightly lower correlation with BP, than sequence length (gammaproteobacteria: r = 0.69, P = 2.23 × 10−30, fungi: r = 0.68, P = 1.06 × 10−28). The pairwise identity of genes within one gene family tree appears to correlate with BP (gammaproteobacteria: r = −0.31, P = 4.81 × 10−6, fun: r = −0.44, P = 3.5 10−11), but much less strongly than sequence length. Moreover, sequence length and pairwise identity are themselves only weakly or not correlated (fungi: r = −0.16, P = 0.017, gammaproteobacteria: r = −0.03, P = 0.598).
Simulations to investigate the influence of sequence length
To see if the sequence length effect is repeatable with perfect alignments, we simulated alignments along a known evolutionary history. As an input for these alignments we used the concatenation tree made of the 48 conserved genes from 100 gammaproteobacteria species (Figure 1). Two datasets consisting of 50 alignments were generated, one with an initial alignment length of 1,000 positions, one with 200 positions. The dataset based on 1,000-position long alignments yield a nearly perfect distribution of splits. Nearly all of the 50 trees supporting the same splits, meaning all the trees are almost identical. The 200-position dataset trees have twice the amount of splits in their split pool than the longer ones (107 vs. 225 splits). To check whether the alignment process itself makes a difference, two additional datasets were made by recovering the sequences from the simulated alignment and aligning them using the same procedure as for the biological sequences. Again, no effect was detected. Increasing the tree length (sequence divergence) by a factor of three for the shorter 200 position alignments increases the number of individual splits to 350, which is still much less than observed in real data.
Comparing the distribution of incompatible splits between simulated data and real data make the differences more obvious (Figure 5). In real data, in this case the gammaproteobacteria, most of the observed splits appear only in one of the trees. This is true for data made from short sequences as well as for long sequence data (Figure 5A,B). Whereas in the long sequence data, the number of splits observed in single trees is strongly reduced. Within the simulated data most of the splits are present in all trees (Figure 4C,D). In the short simulated data, some splits are only present in single trees.