Phylogeny reconstruction of the prokaryotic genomes
We downloaded from GenBank (Oct 2013) more than 15,000 bacterial and archaeal genomes, henceforth referred to as the prokaryotic genomes. After discarding genomes containing more than 1000 contigs and genomes with less than 400 protein-coding genes, the remaining 14,727 prokaryotic genomes were analyzed in this study and their metadata was provided in Additional file 1. A phylogenetic tree of these genomes (see Figure 1) was built using PhyloPhlAn [21] and FastTree [22] based on proteins found by Prodigal [23]. Prodigal was chosen in order to standardize gene calling across all genomes, as GenBank gene predictions were shown to have more variation as a result of the different methods used by the submitters [24]. Prodigal predictions were also found to have more consistent identification of translation initiation sites than GenBank records, as in Burkholderia genomes [25]. The phylogenetic tree in the Newick format is available in the Additional file 2. The full lineages of the genomes were extracted from the NCBI Taxonomy database and their phyla are color-coded in Ring 1 of Figure 1. Most phyla were monophyletic and occupied contiguous sectors of the Ring 1, except that Proteobacteria was separated into two clades. This indicated a general agreement between the taxonomic classification and phylogenetic distribution of these genomes. Due to the errors in either taxonomic classification or phylogeny reconstruction, there was also occasional inconsistency, indicated by distinct colored lines within some contiguous phylum sections in Ring 1.
To compare the gene calling results from Prodigal predictions and GenBank annotations, we built two phylogenetic trees for the 10,075 genomes that had GenBank annotations. One tree was based on proteins called by Prodigal and the other was based on proteins provided in GenBank (see Additional file 3). The Robinson-Foulds [26] distance between the two trees, defined as the total number of partitions of tips implied by one tree but not the other, was 8494. This was 45% of 18,916 total partitions in the two trees. The branch score [27] defined by the square root of sum of squared differences between branches in the two trees was 2.24, comparing to the total internal branch length of 1074.04 in the two trees. These two distance measures indicated that there were many discordant internal branches between the two trees, but the lengths of these branches were very small.
The taxonomic classification and the phylogenetic tree (see Figure 1) exhibited highly uneven distribution of the sequenced genomes among different phyla. 85% of the genomes belonged to the three most sequenced phyla (7120 genomes in Proteobacteria, 4137 in Firmicutes, and 1294 in Actinobacteria), whereas 17 phyla have 3 or fewer sequenced genomes. The completion status of the genomes is indicated in Ring 2 of Figure 1 (finished genomes in black lines and draft genomes in white lines) and the number of contigs in a genome is shown by the height of its corresponding bar in Ring 3. A few phyla had very high percentages of finished genomes, including Chlamydiae (71% of 146 genomes), Tenericutes (58% of 133 genomes) in the bacterial domain, as did the most phyla in the archaeal domain.
Re-annotation of the prokaryotic genomes
The prokaryotic genomes were annotated by their submitters to GenBank. We observed a significant degree of inconsistency between the GenBank annotations of different genomes in the following aspects: (i) terminology: genomes have different names for orthologous proteins, (ii) ontology: some genomes have extensive annotations in GO terms and EC numbers and some others do not, and (iii) annotation coverage: closely related genomes have different percentages of genes with assigned functions. The inconsistency probably stemmed from the submitters’ different annotation procedures that may involve a variety of algorithms and databases for inferring functions. In addition, the inconsistency also existed between genomes deposited by the same submitter at different times, because the algorithms and databases used for annotation are continuously improved, but the genome annotation in GenBank was rarely updated after initial submission. Because such inconsistency in the existing GenBank annotations may prevent accurate comparison of cellular functions across genomes, we re-annotated the 14,727 prokaryotic genomes in three steps: gene calling, protein functional annotation, and metabolic pathway inference.
Prodigal [23] was used to predict protein-coding genes in the genomes. The numbers of predicted genes (shown by the height of green bars in Ring 4 of Figure 1) were comparable between neighboring genomes in the phylogenetic tree. The number of genes in a genome was not correlated with the completion status of the genome or its number of contigs. This suggested that, although draft genomes were likely fragmented by many repeat sequences, most of them could still provide sufficient coverage of the gene content of the genomes.
The predicted protein-coding genes were re-annotated using UniFam_Prok. The UniFam protein family database (freely available at http://unifam.omicsbio.org) was constructed from the UniProt database as described in the Material and Methods. The manually curated annotations of proteins in SwissProt were used as the source of function information in standardized terminology. Protein families were built around the SwissProt proteins with their close homologs in TrEMBL. Substantial sequence diversity was obtained for many protein families, owing to the enormous sequence space of TrEMBL. In comparison with many existing protein family databases, UniFam provided the following advantages for genome annotation. First, UniFam provided a comprehensive coverage to alleviate the need for searching multiple databases and combining the results. Ring 5 of Figure 1 shows the percentages of annotated proteins for all the genomes. On average, 70% of the predicted proteins were annotated in a genome. 10,992 out of 14,727 genomes had more than 65% of their proteins annotated by UniFam. Second, because only the manually curated annotations of proteins in SwissProt were used as the source of function information, the UniFam families were associated with standardized protein product names, extensive GO term annotations, and EC numbers. Third, because the UniFam families were built from proteins with stringent whole sequence alignment requirement, we believe the genome annotation by UniFam is reliable. Although it is difficult to rigorously benchmark the accuracy of genome annotation, our manual analysis of selected genomes indicated high quality annotation of UniFam. Finally, the annotation of a microbial genome takes approximately 4 CPU hours on average using UniFam, which made it computationally feasible to re-annotate 14,727 genomes. UniFam will be updated in synchronization with the UniProt database. The annotation of the UniFam families will improve with the ongoing SwissProt curation effort. The sequence diversity of the UniFam families will increase with the expansion of the TrEMBL database.
The UniFam annotation was compared with annotations by RAST [28] and Prokka [29] on five representative genomes: two Escherichia coli K-12 genomes (a finished W3110 genome and a draft MG1655 genome), two Pseudomonas aeruginosa genomes (a finished M18 genome and a draft SJTD1 genome), and one finished archaeal genome Methanocaldococcus jannaschii DSM 2661. The same proteins predicted by Prodigal were provided to the three annotation pipelines. The annotation results in gene names, gene product names and EC numbers from the three pipelines are available in Additional file 4. Only UniFam annotations provided GO terms. UniFam annotated more proteins in the two E. coli genomes and the M. jannaschii genome, but less in the two P. aeruginosa genomes, than Prokka and RAST (see Additional file 5). The consistency between the genome annotations were measured based on EC numbers (see Additional file 6), because two EC numbers can be compared exactly, and the EC number annotation is a primary information source for pathway inference by Pathway Tools. The EC number consistency between the three pipelines was highest in E. coli, followed by P. aeruginosa, and finally by M. jannaschii. The agreement between UniFam and each of the other two methods was better than that between Prokka and RAST. UniFam annotation was more computationally expensive than Prokka and RAST. The P. aeruginosa M18 genome was annotated by UniFam using 4.2 CPU hours, by Prokka using 1.3 CPU hours, and by RAST using 0.2 CPU hours.
Metabolic pathways were inferred for all genomes with Pathway Tools [30] based on UniFam annotations. In comparison with pathway inference tools such as KEGG [31] and others [32], Pathway Tools provides a comprehensive coverage of metabolic reactions and pathways [33] and allows automatic processing of a large number of genomes. The pathway inference was facilitated by the extensive EC number and GO term annotation by UniFam. 706 prokaryotic pathways in the MetaCyc database [34] were found in the studied genomes. The numbers of predicted pathways across genomes are shown in Ring 6 of Figure 1. Genomes in the Proteobacteria phylum had more pathways inferred than other phyla, whereas the archaeal genomes had much fewer inferred pathways. The number of inferred pathways in a genome correlated with numbers of predicted and annotated proteins of the genome. In addition, the distribution of curated pathways in MetaCyc among different clades also affected the number of inferred pathways in a genome.
The genome annotation results are provided at http://unifam.omicsbio.org and will be regularly updated to keep pace with the growth of GenBank and newer releases of Prodigal, UniFam and MetaCyc. Although every annotation step was performed with stringent settings to minimize errors, we cannot obtain the ground truth or perform extensive manual curation for such a large set of prokaryotic genomes. Thus, it was not the goal of this study to examine the cellular functions of individual organisms. Instead, the functional phylogenomics analysis was focused on the aggregate results of a large number of diverse genomes to minimize the adverse effect of occasional errors in the phylogenetic tree reconstruction, genome annotation, and pathway inference.
Association of cellular functions with the taxonomic classification
The overall functional profile of a taxon can be represented by the collection of the biological process GO terms of the proteins encoded in its genomes. The differentiation between the functional profiles of taxa at a given taxonomic rank was measured using GO term enrichment analysis [35]. Because there were too many bacterial genomes for enrichment analysis and result presentation, archaeal genomes were chosen to showcase the differentiation at the phylum and family levels (see Additional file 7). The enrichment analysis was conducted only on taxa containing sufficient genomes: ≥ 10 genomes for a phylum and ≥ 5 for a family. Additional file 7 lists the top ten most enriched GO terms in each phylum compared to the average in the whole archaeal domain, and the top five most enriched GO terms of each family compared to the average in the corresponding phylum.
The Crenarchaeota phylum contains many thermophilic organisms. The enrichment of the biological processes in defense response to viruses indicated that Crenarchaeota has heavier genomic investment in virus defense than the other phyla, which may suggest prevalent viral infection in its environments. Euryarchaeota was specialized in methanogenesis as expected, but it also appeared to have more “intelligence-related” cellular functions, such as signal transduction and transcription regulation. Out of eight families in Euryarchaeota, the five known methanogen families were found to be specialized in methanogenesis along with different sets of other biological processes.
The metabolic pathway profile of a genome was represented by a 0–1 vector of length 706 with 0 for absence and 1 for presence of the 706 pathways in this genome. The metabolic pathway profiles of individual genomes in a genus were merged into an aggregate genus pathway profile (a pathway profile of a genus pan-genome) to reduce the effect of uneven genera representation on the analysis. The 1206 genera were hierarchically clustered based on their metabolic pathway profiles and 706 metabolic pathways were clustered based on their distribution across the genera (see Figure 2). There were some universal pathways clustered in the right section, which were detected in almost all genera. Examples included tRNA charging, UTP and CTP dephosphorylation I, adenosine ribonucleotides de novo biosynthesis, and glutamine degradation I. Lack of these pathways in the very few genera is likely due to the incompleteness of the genomes. There were also rare pathways that existed only in very few genera, such as quinate degradation II (13 genomes in 2 genera: Corynebacterium and Ketogulonicigenium) and myrcene degradation (4 genomes in 2 genera: Gordonia and Nocardia).
The hierarchical clustering of the genera was based on the similarity between their pathway profiles without using the taxonomic or phylogenetic information. The domain and phylum classifications of the genera are marked in the left and right vertical strips, respectively, in Figure 2. The archaeal genera are clearly clustered together. They were distinguished by some Archaea-specific pathways, such as methanogenesis from CO2, tRNA splicing, starch degradation V, selenocysteine biosynthesis II, and chitin degradation I. They also lacked some pathways universally found in Bacteria, such as tRNA processing and thiazole biosynthesis I. The clustering of genera was also relatively consistent with their phylum classification, in that the clusters formed continuous blocks of genera from the same phyla. This was a result of many taxa-specific pathways that were commonly found in only a few selected taxa. On the other hand, genera from the same phylum were also often separated into several disjoint clusters, indicating divergence of the functional profiles of genera in the same phylum.
Association of cellular functions with the phylogenetic tree
The metabolic pathway profiles of genomes were superimposed on their phylogenetic tree to assess the phylogeny-function correlation. Nine representative pathways are shown in Figure 3. The tRNA charging pathway (Ring 1) was found in 99.9% of the genomes and, therefore, was likely to have been passed from an ancient ancestor to all the descendants. The absence of this pathway in 3 genomes can be attributed to the low genome quality, instead of biological effect. The mercury detoxification pathway (Ring 6) existed in only 15.7% of the genomes, but dispersed among many distant clades of the tree. The phylogenetic distribution of this pathway suggested extensive horizontal gene transfer events across the bacterial domain, probably as a result of the selection force of potentially mercury-contaminated environments and the high transferability of this cellular function. In contrast, the arsenate detoxification pathway (Ring 7) found in 14.8% of the genomes was much more concentrated on selected clades. It was not clear why the two heavy-metal detoxification pathways have such different levels of correlation with the phylogeny.
A pathway can be viewed as a binary character observed at the tips of the phylogenetic tree, with its presence and absence in a genome as the two states. The parsimony score [36] for a pathway on the tree is the minimum number of state changes (from presence to absence or vice versa) needed along the branches to produce the observed presence/absence pattern of the pathway at the tips. Retention index (RI) [37], which normalizes the parsimony score of a pathway with the number of its occurrences in the genomes, was also calculated to measure how well a character conforms to the phylogenetic tree. A higher RI indicates a better fit of a character to the tree. For example, aerobic respiration (cytochrome C) (Ring 2) was found in 9383 genomes with parsimony score 290 and RI 0.95, whereas methylotrophy (Ring 5) was found in 1353 genomes with a parsimony score 592 and RI 0.56. This is in line with the previous observation that methylotrophy has undergone extensive horizontal gene transfer, while aerobic respiration is consistent with the phylogeny [20]. The parsimony scores and the genome occurrence frequencies of the 706 pathways are shown in Figure 4. Pathways were categorized into consistent pathways with RI > 0.9 and inconsistent pathways with RI < 0.7. Out of the 706 pathways in Bacteria and Archaea, 31% were consistent pathways and 26% were inconsistent pathways. The complete list of the 706 pathways with their parsimony scores, occurrence frequencies in the genomes, and RIs is available in Additional file 8.
Phosphate acquisition was an example of consistent pathways (Ring 4). This pathway was found in 6005 genomes with parsimony score 528 and RI 0.91. It was concentrated in 16 bacterial phyla and 2 archaeal phyla. Of the bacterial genomes, it was mostly found in Proteobacteria (3882 out of 7120 genomes), Firmicutes (1343 out of 4137), Actinobacteria (434 out of 1294), Bacteroidetes (170 out of 488), and Cyanobacteria (69 out of 173). Nitrogen fixation was an example of inconsistent pathways (Ring 5). It was found in 1121 genomes with parsimony score 409 and RI 0.64. This pathway is known to have been horizontally transferred between many species [20]. Although both phosphate acquisition and nitrogen fixation provide essential nutrients for microorganisms, they have very different occurrence frequencies and horizontal gene transfer behaviors in the bacterial genomes.
Two biosynthesis pathways are also shown in Figure 3. Isopenicillin N biosynthesis (Ring 8) occurred in 2236 genomes with parsimony score 387 and RI 0.83. Lysine biosynthesis I (Ring 9), which involves succinylated intermediates, was found in 8372 genomes with parsimony score 149 and RI 0.97, showing much higher prevalence and consistency. The two pathways were both mostly found in the phyla of Proteobacteria, Actinobacteria and Bacteroidetes. However, isopenicillin N biosynthesis had a very sparse distribution, while lysine biosynthesis I had a nearly uniform distribution in these phyla.
From the presence/absence pattern of a pathway at the tips of the phylogenetic tree, the ancestral states of the pathway at the internal nodes of the tree were inferred with the parsimony method [36],[38],[39]. The divergence time of the internal nodes of the tree was also estimated [40] with a scaled reference age of 100 for the root. Combining the ancestral states of a pathway and the divergence time at internal nodes, the state changes (gain or loss) of the pathway were clocked on the phylogenetic tree.
Figure 5 shows four pairs of subtrees for the representative pathways examined in Figure 3. Without considering horizontal gene transfer from eukaryotes, the first red node from the root on the subtree for a pathway marks the first occurrence of the pathway in nature through evolutionary innovation. The subsequent red nodes may result from independent evolutionary innovation or horizontal gene transfer. The first pair (5A and B) displayed very distinct evolutionary patterns. Aerobic respiration had very early first appearance and was well maintained in most clades, whereas methylotrophy was a more recent innovation and underwent many changes. The second pair (Figure 5C and D) had similar parsimony scores (528 and 409), but very different genome occurrence frequencies (6005 and 1121). The gains of phosphate acquisition were concentrated in only a few clades, while the gains of nitrogen fixation were spread across many clades. The third pair (Figure 5E and F) shows two heavy-metal detoxification pathways. The arsenate one showed a small number of losses and gains in a few clades, while the mercury one had a few ancient gains and losses followed by considerable recent gains and losses in many clades. For the two biosynthesis pathways (Figure 5G and H), isopenicillin N biosynthesis went through frequent gains and losses, probably in response to selection pressure in specific environments; whereas lysine biosynthesis I was maintained stably with few losses in almost all clades since their ancestors first gained this pathway.