Genomic census and trees of fold families
We have searched for controlled vocabularies that have multiple genomic occurrences and that are appropriate for surveying ancient evolutionary history. We already found that domain structures at the fold and fold superfamily levels and their domain combinations harbor phylogenetic signatures that are congruent [17, 20, 24–27]. Here we study the evolution of protein domains at the FF level to determine if lower levels of structural abstraction still preserve these ancient signatures. We note that our focus is on the structure of protein domains and not on how they interact with each other, within or between molecules, or with nucleic acids and other molecules of significance. The census therefore takes protein domains out of their natural molecular and cellular context.
Figure 1 shows a flow diagram of our experimental strategy. The 645 completely sequenced genomes that we here analyze consist of 49 archaeal (A), 421 bacterial (B), and 175 eukaryotic (E) organisms. Manual inspection of lifestyles showed these organisms can be divided into 420 free-living (48 A, 239 B, and 133 E), 93 facultative parasitic (0 A, 71 B, and 22 E), and 132 obligate parasitic (1 A, 111 B, and 20 E) organisms. Their proteomes contained 3,114 FFs. We used an E-value cutoff of 10-4 to extract reliable HMM hits of the FFs in individual proteomes. As a result, the structural census revealed that 2,493 FFs out of 3,114 FFs were present in the 645 proteomes. Data matrices of genomic abundance (g; see Methods) and genomic occurrence (presence/absence) of FFs for all possible pairs of FFs and proteomes were generated from the census. These matrices were then used to build intrinsically rooted phylogenomic trees of FF domain structures (with FFs as taxa and proteomes as characters) and trees of proteomes (with proteomes as taxa and FFs as characters) by transposing columns and rows in the matrix.
Since trees of FFs are highly unbalanced, the relative age of individual FFs can be obtained directly from the tree by counting the numbers of nodes that exist from its base to each leaf, and expressing this node index (nd) in a 0-1 scale (see Methods). The age of FFs derived from abundance-based trees (nd
a
) was strongly correlated with the age derived from occurrence-based trees (ndo) (y = 1.03 × -0.04, R2 = 0.883; Additional file 1, Figure S1). While genomic occurrence of domains has been used previously to build trees of proteomes at fold superfamily level [28], a comparison of the two methods produces phyletic patterns that are largely congruent [19, 24]. We thus chose to build trees of domains and trees of proteomes from FF abundance to incorporate phylogenetic signal embedded in the proteomic reuse of domains and in FFs that are widely distributed in life and had an origin that predated the last universal common ancestor (LUCA). This is not possible with an occurrence-based approach. Indeed, genomic occurrence underestimates the age of the most ancient FFs (nd < 0.3) (Additional file 1, Figure S1). This is expected since these FFs are widely shared and are the most abundant (see Results and Discussion below). In addition, we find that the tree of FFs based on genomic occurrence displayed a polytomy among the most ancient structural lineages (data not shown), which is fully resolved in the tree reconstructed from genomic abundance. Mechanistically, domain structures spread by recruitment as genes duplicate and diversify and genomes rearrange; their numbers are expected to increase in proteomes with evolutionary time and as species diversify. The abundance-based phylogenetic approach is therefore in line with the processes of genome evolution. Given these considerations, we here concentrate on results obtained using genomic abundance.
We note that our strategy for the construction of rooted phylogenomic trees is based on the fundamental premise that 'FFs that are more popular are more ancient'. This premise of increase representation of FFs in the protein world is not constrained by how FFs spread in the proteomes that we sample by for example gains, losses, convergent evolution, and horizontal gene transfer. In other words, our evolutionary model of tree reconstruction is not governed by the assumption that 'FFs that are more widely spread are more ancient'. While this outcome is quite frequent in our analysis, the model is agnostic about how FF growth occurs in proteomes.
Trees of proteomes, genome reduction, and horizontal gene transfer
Reconstruction of a tree of organisms describing the evolution of 645 proteomes resulted in one most parsimonious rooted tree (Additional file 1, Figure S2). The tree was built from genomic abundances of 2,493 FFs and embodied the canonical rooting of the tree of organisms typically recovered when studying rRNA [3]. It clustered superkingdoms Archaea and Eukarya, each of which formed a monophyletic group. Bacteria was divided into two groups. One of them (group B1) was positioned at the base of the tree and contained some few bacterial facultative and obligate parasitic lineages (e.g., Chlrorobium, Candidatus Sulcia, and Candidatus Carsonella). In fact, the total set of 225 parasitic organisms were dispersed throughout the tree but their presence was particularly evident at the bases of their respective superkingdoms (e.g., Giardia, Encephalitozoon, etc in Eukarya; Nanoarchaeum in Archaea; Mycoplasma, Anaplasma, etc in group B1; see Additional file 1, Figure S2), regardless of their original taxonomic positions in rRNA trees. Parasitic organisms generally discard enzymatic and cellular machineries in exchange for resources from their hosts [19, 29]. In most cases, these reductive tendencies result in small genomes and highly reduced domain repertoires. In previous studies, we found that the inclusion of these highly reduced proteomes in trees of organisms result in abnormal phylogenetic relationships [19, 27]. We thus excluded proteomes from parasitic organisms and tested if their presence biased the rooting of the tree. Indeed, a tree of organisms describing the evolution of 420 proteomes of free-living organisms that was reconstructed from the abundance of 2,397 FFs (2,262 of which were parsimony-informative) showed it was rooted in Archaea (Additional file, Figure S3). Superkingdoms Bacteria and Eukarya formed monophyletic clades, each strongly supported by 100% bootstrap support (BS) values. These two superkingdoms were sister taxa to each other (53% BS) and clustered paraphyletically to archaeal proteomes, which in turn were positioned at the base of the tree. Compared with the tree of organisms that describes the evolution of the 645 proteomes, the phyletic patterns of the tree of proteomes of free-living organisms were highly congruent with those from trees of organisms built from rRNA sequences or repertoires of folds and fold superfamilies [19, 24, 27]. In addition, there was significant phylogenetic signal (g
1
= -0.241), confirming that FF data is appropriate for deep phylogenetic studies.
While horizontal gene transfer seems rampant at sequence level, its impact appears quite limited at higher levels of structural organization [15, 20, 22]. We tested however if FFs evolved without major horizontal gene transfer biases. Informational genes that are involved in transcription, translation, and DNA replication have been reported to be refractory to the effects of horizontal gene transfer [30]. We therefore divided the 2,262 parsimony-informative FFs into informational (182 FFs) and non-informational (2,080 FFs) domains using as reference Vogel and Chothia's functional classification [23]. It is also well established that horizontal gene transfer occurs more frequently in Bacteria than in the other superkingdoms. We thus extracted informational (34 FFs) and non-informational (488 FFs) domains that are uniquely present in the proteomes of the 239 bacterial free-living organisms. For each of the groups, we calculated retention indexes (r
i
) of individual FF characters and plotted them against the age of the corresponding FFs (nd) derived from the tree of FF structures we describe below. The index portrays the relative amount of homoplasy of individual phylogenetic characters (conflict in how data matches the reconstructed tree) and processes other than vertical inheritance, such as convergent evolution, horizontal gene transfer and recruitment [20]. It is important to note that the measure is independent of the number of taxa in reconstructed trees. Both r
i
distributions for informational and non-informational FFs were highly consistent with each other and consistency was still maintained in the FFs of Bacteria (Additional file 1, Figure S4). These results do not support the argument that horizontal gene transfer is rare in informational genes since they generally interact with large number of other molecules [30]. Instead, results indicate that in contrast with sequence, horizontal gene transfer occurs with no functional preference at the FF level.
Global evolutionary patterns of FF domain structures
Intrinsically rooted trees of FFs were reconstructed from the structural census of FFs in the 420 proteomes of free-living organisms we analyzed. The most parsimonious tree describing the evolution of 2,397 FFs had significant phylogenetic signal (g
1
= -0.070) despite the large number of taxa (Figure 2A). We assigned relative ages of FFs (nd) and calculated the fraction of proteomes containing FFs (f; see Methods) to examine the relationship between the age and genomic distribution of domains (Figure 2B). As expected, the 13 most ancient FFs were present in all proteomes (f = 1), indicating that the most ancient FFs are both widely distributed and are highly conserved. However, domain loss and their distribution in emerging lineages are expected to reduce the wide distribution of domains and decrease f values. Indeed, the f values of FFs decreased with the increase of nd until f reaches 0 at about nd = 0.550. After this point, the pattern of change reverses and both f and nd values become positively correlated. This probably results from horizontal gene transfer, domain duplication and recruitment, and rearrangement, among other factors.
The evolutionary patterns in these plots are remarkably similar to those observed in trees of folds and fold superfamilies [19] or their domain combinations [31]. However, they are clearly apparent with lower variance of f values at every time point. Moreover, the global trend of f in the timeline can be better dissected into superkingdom-specific patterns. In the case of Archaea, the f values declined heavily early in time (nd < 0.151), reached zero at about nd = 0.151, rose suddenly within 0.551 ≤ nd ≤ 0.661, an interval in which all Archaea-specific FFs (A in Figure 2B) appeared, and were dispersed in the remaining parts of the timeline. On the other hand, the trend of f values for Bacteria was quite similar to the global trend but showed additional features: (1) At nd ≥ 0.151, the f distribution of FFs shared by Bacteria and Eukarya (BE in Figure 2B) was similar to that of FFs shared by all superkingdoms (ABE in Figure 2B); (2) The f values of FFs in the 0.151 ≤ nd ≤ 0.256 interval were slightly lower; (3) FFs that were unique to Bacteria or were shared by Archaea and Bacteria (AB in Figure 2B) were only present in the 0.256 ≤ nd ≤ 0.661 interval and showed two abnormal peaks in the distribution of f values at about nd = 0.4 and 0.6; and (4) After nd = 0.661, many FFs were lost (had f values of zero). Finally, in the case of Eukarya, the f values in the early part of the timeline (nd ≤ 0.256) decreased more than those of Bacteria but less than those of Archaea. The extent of f-value dispersal in the 0.256 ≤ nd ≤ 0.550 interval was highly reduced in comparison to that of Bacteria. Starting at about nd = 0.550, f values increased dramatically along the timeline. In this period, the majority of FFs are Eukarya-specific. Consequently, while loss of the domain structures occurred in all superkingdoms before the inflection point at nd = 0.550, a new trend in architectural innovation by gain of domains became predominant after that time.
The 2,397 FFs are not equally distributed between superkingdoms. A Venn diagram shows FFs that are uniquely present in one (taxonomic groups A, B, or E), two (BE, AB, and AE) or three (ABE) superkingdoms, with A, B and E group labels representing Archaea, Bacteria and Eukarya, respectively (Figure 2A). Only 20% of FFs are common to all superkingdoms (group ABE). Previous studies of the distribution of folds or fold superfamilies in proteomes showed the ABE group was the most abundant of all taxonomic groups [19, 20]. For example, about 65% and 62% of folds and fold superfamilies belonged to this group, respectively [19]. In contrast, the number of FFs unique to Bacteria (group B) and Eukarya (group E) were larger than the group of common FFs (ABE) (Figure 2A). The clear reduction of the number of universal domain structures with lower levels of structural abstractions is expected and showcases the decreased evolutionary conservation of FFs relative to fold superfamilies and folds.
The structural timeline (0 ≤ nd ≤ 1) can be divided into five different phases by studying the emergence, distribution and diversity of FFs (Figure 2):
(1) A primordial (communal) protein world (phase I; 0 ≤ nd ≤ 0.150): In this ancient phase, domain structures diversified but were rapidly shared by the emergent cells. Proteomes of the three superkingdoms share all 76 FFs (ABE FFs). However, some FFs were lost in few proteomes (f < 1; Figure 2B), most notably in Archaea, indicating the start of diversification at the protein structural level. Remarkably, the ancient FFs of this phase correspond to fold superfamilies that were previously identified as being part of LUCA [27]. We believe that this phase describes the emergence of a diverse community of primordial cells that consist of genetic founders of the three superkingdoms [32]. During this phase however there were no lineages of organisms as we know of them today. Instead, selective sweeps ensured structural innovations were retained but were tolerant of considerable diversity in the emerging proteomic repertoires. Most proteins were also multifunctional. That multifunctionality is retained today in the many functions of the corresponding fold superfamilies that unify these ancient FFs [22, 31].
(2) Reductive evolution of primordial proteins (phase II: 0.151 ≤ nd ≤ 0.256): This phase consists of 232 FFs, many of which (181 ABE FFs) experienced reductive evolution (f < 1) or were completely lost (f = 0) in archaeal lineages (51 BE FFs that are shared by Bacteria and Eukarya) (Figures 2B and 2C). The first domains lost in Archaea were d.122.1.1 (heat shock protein 90, N-terminal domain) and d.14.1.8 (the middle domain of heat shock protein 90), which appeared at nd = 0.151. Consequently, this phase features the emergence of Archaea from LUCA by reductive evolution of ancient ABE FFs. The overall evolutionary trend of domain loss was higher in Archaea than in Bacteria and Eukarya. This is exemplified by significantly reduced f values (Figure 2B). This phase also marks the start of a slow process of diversification in superkingdom Archaea. We thus expect that many ancient though ill-defined archaeal lineages arose during this time. Since many archaeal species have adapted to extreme environments, we propose that the marked proteomic reduction of primordial archaeal species was probably caused by adaptive expansions of the LUCA into the harsh environments of early Earth.
(3) Development of the three superkingdoms (phase III: 0.257 ≤ nd ≤ 0.550): Here, the ancestral lineage that is sister to Archaea gives rise to superkingdoms Bacteria and Eukarya. The primordial trend of domain loss responsible for superkingdom Archaea is still maintained (Figure 2B). FFs unique to Bacteria (138 B FFs) probably appear from loss of BE or ABE FFs. For example, the first FFs lost in Eukarya, c.40.1.1 (C-terminal domain of methylesterase) and c.116.1.4 (tRNA-methyltransferase), occurred at nd = 0.257 and had considerable representation in superkingdoms (f = 0.41 and = 0.57, respectively). This suggests that the most recent eukaryal ancestor was derived from the common ancestor of Bacteria and Eukarya. Results also exclude the possibility that Eukarya originated from Archaea, a conclusion that is also supported globally by the archaeal rooting and the sister relationship between Bacteria and Eukarya in the trees of proteomes of free-living organisms (Additional file 1, Figure S3). Consequently, the topology of the tree of proteomes should be [A, [B, E]]. Most importantly, all of the three superkingdoms reduced their proteomic complements by domain loss during this phase of superkingdom development. This is clearly evident in the substantial decrease in the appearance of FFs in the proteomes of Archaea, Bacteria and Eukarya during this phase (Figure 2D).
(4) Organismal diversification (phase IV: 0.551 ≤ nd ≤ 0.661): This period embodies the 'big bang' of domain organization in proteins [31]. Despite its short time span, phase IV is responsible for over 42% of modern FFs (see the sharp slope of 'Total' in Figure 2D). At nd ≥ 0.551, f values for all superkingdoms are positively (instead of negatively) correlated with nd values. The looser trend was therefore replaced by massive domain gains and structural innovations. A total of 1,008 FFs appear as part of all seven taxonomic groups (ABE, BE, AB, B, AE, A and E). Widespread appearance of domain structures in organismal lineages across the three superkingdoms signals massive diversification of proteins and proteomes. In addition, Archaea and Bacteria (but not Eukarya) showed abnormal peaks in the f distribution plots (Figure 2B) and r
i
values of the FFs of this phase were significantly lower than the rest (Additional file 1, Figure S4). These observations suggest that horizontal gene transfer and processes of recruitment (e.g., genome rearrangement mechanisms responsible for domain combinations) largely contributed to the make-up and diversification of the superkingdoms. For example, the appearance of 384 FFs unique to Bacteria (Figures 2C and 2D) supports the conclusion.
(5) Eukaryal diversification (phase V: 0.662 ≤ nd ≤ 1): The majority of new FFs appearing in this final period were unique to the emerging eukaryotic lineages (515 out of 750 E FFs; Figure 2C). In contrast, FFs belonging to the A, AB, and B taxonomic groups were conspicuously absent, suggesting a halt of domain innovation in microbial superkingdoms. Similarly, domain appearance in the AE, BE, and ABE taxonomic groups was considerably reduced. Massive duplication of genes, genome duplications and rearrangements, meiosis, sex, and other reproductive innovations should be considered ultimately responsible for domain combination, domain recruitment and emergence of new domains in Eukarya, fundamentally by fission [31], which is typical of the most modern phase of the protein world.
Domain diversity increases in evolution
The accumulation of FFs along the timeline shows that the numbers of different FFs always increase in the proteomes of superkingdoms despite the early and massive episodes of domain loss and the lack of appearance of new FFs specific to microbial superkingdoms in the late phases of protein evolution (Figure 2D). This observation provides support to the evolutionary model used to root the trees of proteomes, which polarizes character state changes in proteomes towards increases in genome abundance (see details in Methods).
Evolution of molecular functions associated with FFs
Molecular functions are linked to corresponding protein domain structures. For the most part, structure-function relationships are unambiguous at the FF level of structural abstraction. In order to simplify the description of the functions of hundreds of FFs, we used the functional classification of Vogel and Chothia [23]. A total of 1,299 FFs were grouped into one of 7 major categories (General, Information, Metabolism, Intra-cellular processes, Extra-cellular processes, Regulation, and Other) and into one of 49 minor categories of molecular functions. For simplicity, the names of the categories were displayed in italics and the initial letters of the major categories were capitalized. The emergence time points of the major and minor categories of molecular functions in the timeline of FFs revealed remarkable patterns of origin in the five evolutionary phases (Figure 3).
Phase I: Only three of the seven major categories were present very early in the FFs of phase I. They included minor categories small molecule binding and protein interaction of General, ion m/tr (m/tr stands for metabolism and transport) of Intra-cellular processes, and nucleotide m/tr, other enzymes, coenzyme m/tr, transferases, and redox of Metabolism. Since small molecule binding and ion m/tr involve popular multifunctional enzymes and membrane transporters (e.g., ATPases), the vast majorities of molecules emerging at the beginning of modern cellular life were involved in making up modern metabolic enzymes and enabling transport processes across primordial membranes. This suggests primitive cells acted as containers of the emerging protein domains already during this first evolutionary phase. The notable absence of molecular functions involved in Information indicates that ancient catalytic proteins with primordial metabolic functions initiated life in the absence of a translational apparatus. This conclusion is supported by the mapping of functions in a timeline of fold superfamilies [13, 19] and by phylogenomic analyses of structures and functional ontologies [20, 22]. The minor categories translation (Information), amino acids m/tr, carbohydrate m/tr, and energy (Metabolism), and proteases (Intra-cellular processes) appeared for the first time very late in phase I. The first FFs of translation were the catalytic domains of aminoacyl-tRNA synthetases [20]. Thus, translation emerges after crucial metabolic activities and together with amino acids biosynthesis and polypeptide breakdown [20, 22]. Results once again support the metabolism-first hypothesis of the origin of life and refute the existence of an RNA world (see [20] for an extended discussion and [33] for a review).
Phase II: This period starts with the emergence of FFs belonging to DNA replication/repair and transcription (Information), suggesting that early during this time nucleic acids started to be used as genetic repository. In addition, the appearance of protein modification and cell motility (Intra-cellular processes) suggests the start of cellular diversification. Late in phase II, functions related to signal transduction (Regulation), secondary metabolism and e-transfer (electron transfer) (Metabolism), and transport (Intra-cellular processes) suggest more advanced cellular systems capable of regulatory control of cellular processes and more efficient energy management.
Phase III: With the exception of Extra-cellular processes and Other, all major categories are represented in this period and include ligand binding and general (General), DNA binding, kinases/phosphatases, RNA binding m/tr and other regulatory functions (Regulation), nitrogen m/tr, polysaccharide m/tr, lipid m/tr and cell envelope m/tr (Metabolism), RNA processing (Information) and cell cycle (Intra-cellular processes) (Figure 3). Functions such as lipid m/tr and cell envelope m/tr emerged quite late in the period and are clearly associated with the rise of superkingdoms Bacteria and Eukarya (the fundamental feature that defines this phase) (Figure 2C). For example, FFs involved in these processes established the chirality and chemistry of glycerol membranes by diversifying primordial ether and ester lipids that were present in LUCA into the sn2,3 isoprenoid ether lipids of Archaea and the sn1,2 fatty acid ester lipids of Bacteria and Eukarya [27]. Remarkably, molecular functions and FFs withered as the phase progressed and in preparation of a truly diversified world of organisms approaches.
Phase IV: The molecular functions added in this relatively short phase of protein and proteomic diversification start with chromatin structure (Information), cell adhesion (Extra-cellular processes), and viral proteins (Other), and are followed by ion binding and structural protein (General), receptor activity (Regulation), photosynthesis (Metabolism), phospholipid m/tr (Intra-cellular processes), and toxins/defense, blood clotting and immune response (Extra-cellular processes). These functions are quite advanced and involve complex variants of Bacteria and Eukarya that engage in multicellularity, cell communication, and interaction with the environment at various biological levels (e.g., between cells or among organisms).
Phase V: This final phase has the longest time span but introduced only four functional innovations: lipid/membrane binding (General), storage (Metabolism), nuclear structure (Information), and intracellular trafficking/secretion (Intra-cellular processes). All of these processes are involved in establishing a much more complex cellular structure, such as the formation of compartments (e.g., the nucleus), lipid and polysaccharide storage, and targeting of proteins to proper compartments, sorting and translocation, and protein secretion mechanisms. All of these innovations are quite elaborated in Eukarya and involve many of Eukarya-specific FFs that appear abundantly in this phase.
Phase-specific trees of proteomes along the timeline
In order to examine how the deep phyletic patterns of the three superkingdoms changed along the timeline, we reconstructed trees of proteomes for each of the five evolutionary phases (Figure 4). Again, we avoided the relatively modern reductive effects of parasitism by extracting phase-specific FFs present in the set of proteomes of free-living organisms. The main assumption in these studies is that different phases carry FFs with different phylogenetic signatures that describe selected aspect of life's evolution.
The most parsimonious tree of proteomes for phase I was reconstructed using genomic abundances of the universal 76 ABE FFs that appeared during the 0 ≤ nd ≤ 0.150 time interval (Figure 4A). The tree shows that the three superkingdoms formed separate groups. Proteomes of Archaea and Bacteria appeared paraphyletic while proteomes of Eukarya formed a moderately supported (70% BS) monophyletic group. The tree was rooted in Archaea, which was positioned at its base. Thermofilum pendens, a hyperthermophilic archaeon belonging to the phylum Crenarchaeota, was the most basal taxon. On the other hand, bacterial proteomes spanned the ancient archaeal lineages and the more derived eukaryal counterparts. The timeline derived from the tree of FFs shows no separation of the three superkingdoms in this phase, since all FFs of this phase are common to all life (Figure 2C). However, the phylogenetic signal embedded in the genomic abundances of these very old FFs, which contain domains of all ages in their make-up (the 'modern effect' sensu [27]), is strong and dissects the appearance of the three superkingdoms. The archaeal root of the tree of proteomes that is apparent already in phase I is consistent with the first emergence of Archaea from LUCA in the timeline of domain structures (Figure 2C). Remarkably, the tree of proteomes reconstructed from genomic abundances of the 181 ABE and 51 BE FFs of phase II is congruent with the tree reconstructed from phase I-specific FFs (Figure 4B). The tree is rooted in Archaea and shows Eukarya as a weakly supported (< 50% BS) monophyletic group. Interestingly, the most ancient 19 archaeal lineages of the phase I and phase II tree, including the T. pendens root, are thermophiles and hyperthermophiles and are consistently followed by methanogenic archaeal lineages in both trees. These basal topologies that are congruently recovered from trees reconstructed from the most ancient protein domain characters lend support to the hypothesis of a thermophilic bottleneck during the rise of diversified lineages.
However, the deep relationships of the three superkingdoms present in phases I and II are broken in the tree of proteomes reconstructed from genomic abundances of the 331 FFs (66 ABE + 110 BE + 17 AB + 138 B FFs) of phase III (Figure 4C). Bacterial proteomes now clustered monophyletically and eukaryotic species formed a polyphyletic group at the base of the tree that included a monophyletic archaeal group. The eukaryotic placozoan Trichoplax adhaerens roots the tree of proteomes. It is also noteworthy that distributions of branch lengths show high levels of divergence in Bacteria in this phase when compared to the basal Archaea and Eukarya. The many bacteria-specific FFs of this period provide further support to the existence of high levels of bacterial diversification. The tree of proteomes reconstructed using the 1,008 FFs of phase IV that belong to all seven taxonomic groups was star-like and was rooted in a β-proteobacterium Polynucleobacter sp. (Figure 4D). Most lineages in the three superkingdoms formed polytomies. Bacterial and eukaryal species were polyphyletic. Instead, archaeal species formed a poorly supported clade. The star-like tree suggests horizontal gene transfer occurred rampantly across the three superkingdoms (also supported by peaks of f distribution in Figure 2B). Finally, the tree of proteomes reconstructed from the 750 eukaryotic FFs (78 ABE, 32 AE, 125 BE, and 515 E FFs) of phase V supported monophyletic Archaea and Eukarya and was rooted in Bacteria. However, the archaeal group bisected bacterial groups. Unlike the trees of proteomes for the other previous four phases, eukaryal lineages were highly divergent, indicating that duplication of genes and genomes has frequently occurred in eukaryal lineages.
The canonical rooting of the tree of organisms derived from phylogenetic analyses of rRNA and other sequences (e.g., ATPases, aminoacyl-tRNA synthetases, elongation factors) generally shows hyperthermophilic bacteria (e.g., Thermotogae) at the base of the tree [32]. Our results do not support this topology. Instead, results are compatible with the hypothesis that the tree of organisms is rooted in an ancestor of modern archaeal proteomes [22, 27]. The archaeal rooting has been reliably obtained in numerous studies with different proteomic sets [13, 22, 24, 27] and is congruent with results from phylogenetic analysis of the structure of tRNA [34, 35], 5S rRNA [36] and RNase P [37], and of tRNA paralogs [38–41]. Remarkably, a molecular clock of folds also revealed that the first fold lost in a superkingdom disappeared in Archaea 2.6 billion years ago, within the span of the rise of planetary oxygen that preceded the great oxidation event on Earth [21]. Similarly, a careful reconstruction of the fold superfamily repertoire of LUCA showed it emerged sometime between 2.9 and 2 billion years ago, after the development of primordial ribosomal protein synthesis [27]. Trees of proteomes reconstructed from FFs appearing in the five evolutionary phases of domain diversification and from the entire set of FFs now confirm the archaeal rooting of diversified life.
Growth of FF repertoires in proteomes
Plots describing the evolutionary accumulation of FFs in proteomes that were directly derived from the intrinsically rooted trees of FFs (Figure 2D) show that domain gains always overwhelm domain loss. Moreover, they show that the repertoires of FFs always increase in all superkingdoms and in all taxonomic groups of FFs, regardless of the strong reductive evolutionary trends identified in the timelines. Even FF repertoires of individual free-living organisms exhibit these same trends. This overwhelming tendency of domain growth in proteome evolution that occurs throughout the timeline (regardless of how widely shared are FFs in the protein world) supports the character polarization statements that we use to root the trees of proteomes, and falsifies any character polarization scheme that may be applied in an opposite direction for the reconstruction of trees of proteomes. This trend, in conjunction with strong reductive evolutionary episodes of domain loss that occur in Archaea, Bacteria and Eukarya, also dissects the three superkingdoms in representations of occurrence and abundance of FFs in proteomes. A simple 'non-historical' plot of use of FFs (number of distinct FFs in a proteome; i.e. FF diversity) versus reuse of FFs (the sum of multiple occurrences of FFs in a proteome; i.e., FF abundance) shows a clear increase in values for the individual proteomes analyzed, starting with the proteomes of Archaea, then those of Bacteria, and finally those of Eukarya (Figure 5). Besides dissecting the three superkingdoms without phylogenetic reconstruction and supporting our character polarization statements, these patterns, in conjunction with the results of Figure 2, suggest the plot of FF use and reuse (Figure 5) should be interpreted as a temporal progression of proteome appearance. The index of proteomes in the figure indeed confirms that both axes of the plot are correlated with evolutionary time and reveals once again the temporal progression of Archaea, followed by Bacteria and Eukarya.
Protein structures are unevenly distributed in the world of proteins and proteomes [13]. Genomic surveys reveal they follow power-law distributions and establish networks with scale-free properties. This shows a preference for duplication of genes encoding protein structures that are already common--a "rich get richer" process, which we here use to root our trees of FFs. Interestingly, frequency plots of fold structures for microbial superkingdoms Archaea and Bacteria had steeper slopes that those of Eukarya, showing folds accumulate at higher rates in the proteomes of complex organisms [17]. However, the most ancient folds that are shared by all organisms or are shared by Bacteria and Eukarya fitted Gaussian-like distributions characteristic of random graphs, suggesting the spread of these structures across superkingdoms is complex [17]. Figure 5 uncovers the interplay between forces that produce redundancy (e.g., gene duplication) and forces that degrade it (e.g., mutation), an interplay that is ultimately responsible for the rise and diversification of FF structural modules. In contrast to redundancy, modularity can spread pervasively in genomes, increasing their size and slowing down replication time and proliferation. Consequently, the costs of limited proliferation curb excessive increases in modularity, especially in r-selected organisms such as those of microbial superkingdoms, which can only pack a limited gene repertoire in their genomes and thrive in competitive environments. In contrast, K-selected organisms such as eukaryotes can tolerate module expansion within confines of rates of error correction in DNA replication and growth conditions dictated by the environment.