Sequence data and alignment
We assembled and aligned protein sequences of all 467 potentially orthologous groups from complete genome databases and these were supplemented with additional eukaryote taxa from the sequence database of the National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/entrez/. Data from the following species were assembled into presumptive orthology groups (hereafter, proteins) and aligned [36]: Aquifex aeolicus, Bacillus subtilis, Borrelia burgdorferi, Chlamydia trachomatis, Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Mycobacterium tuberculosis, Mycoplasma genitalium, Mycoplasma pneumoniae, Rickettsia prowazekii, Synechocystis sp., and Treponema pallidum (Eubacteria), Archaeoglobus fulgidus, Methanobacterium thermoautotrophicum, Methanococcus jannaschii, Pyrococcus abyssi, and P. horikoshii (Archaebacteria: Euryarchaeota), and Arabidopsis thaliana, Caenorhabditis elegans, Dictyostelium discoideum, Drosophila melanogaster, Gallus gallus, Homo sapiens, Mus musculus, Oryza saticva, Plasmodium falciparum, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Trypanosoma sp., and Xenopus laevis (Eukaryota). The genome sequence of Aeropyrum pernix (Archaebacteria: Crenarchaeota) became available during completion of study and was included in phylogenetic analyses only.
Global alignment algorithms differ from local alignment algorithms in that they sometimes align unrelated (non-homologous) sites together with homologous sites. Using a computational tool, xcons [36], such unrelated sites were removed from these CLUSTALW [37] alignments to increase probability of site homology. During construction of protein alignments using the WAT system [36], short fragmented sequences were manually removed. Of the 204 proteins that could be calibrated for time estimation, the orthology of roughly half (116 proteins) was ambiguous for unknown reasons (e.g., lateral gene transfer, gene loss, or poor phylogenetic resolution) leaving 87 proteins for phylogeny and time estimation. The seven shortest (<75 amino acids) of those were used only in phylogenetic analyses; the remaining proteins averaged 196 amino acids each. Where possible, proteins were rooted by duplicate proteins (duplicate genes); otherwise, they were midpoint-rooted.
Separately, for timing the origin of Giardia, sequences of 17 proteins were obtained from the public databases and aligned [37] in which the following taxa were available: Giardia and other eukaryotes (including calibration taxa; see below), archaebacteria, and eubacteria. Correspondence and requests for materials should be addressed to S.B.H. (e-mail: sbh1@psu.edu) or see http://www.evogenomics.org/Publications/data/Eukaryotes/ for alignments and other information.
Time estimation
Methods are described elsewhere [38] except as follows. Our initial goal was to estimate divergence times for the last common ancestor (LCA), the divergence between archaebacteria and eukaryotes (AK), cyanobacteria and closest eubacterial relatives (origin of cyanobacteria, BC), eubacteria and mitochondrial eukaryotes (origin of mitochondria, BK-m), and Giardia and other eukaryotes (GK) (Fig. 1). The importance of Giardia is its lack of mitochondria and basal location in many phylogenies of eukaryotes [27, 39].
However, our initial phylogenetic analyses revealed that many eukaryotes did not cluster with Rickettsia, the α-proteobacterium, as predicted by current genomic models [8, 25]. Instead, they typically formed a basal lineage among eubacteria in the tree. This result was consistent with the serial endosymbiosis theory [3] and with other findings [6] and therefore we designated this divergence as BK-o (origin of eukaryotes). Estimation of the divergence time of the origin of plastids (BK-p) was not a goal of this study, and the LCA was not estimated because of an insufficient number of duplicate proteins needed for reciprocal rooting [23]. Thus, five divergence times were studied: AK, BC, BK-o, BK-m, and GK. Eukaryotes derived from different prokaryotes are referred to herein as KA (from AK), KB-o (from BK-o), and KB-m (from BK-m).
Because of the large amount of sequence conservation in these proteins, it was not possible to calibrate directly by extrapolation from vertebrates [40], for which an extensive fossil record exists. For example, sequences often were identical among rodents, primates, and birds. Instead, multiple calibrations were used from older divergences among kingdoms (plants, animals, fungi) and animal phyla, derived from analysis of 75 nuclear proteins calibrated with the vertebrate fossil record [38]. This two-step calibration reduced the error involved in extrapolation. Two classes of time estimation methods were used and compared. The multigene (MG) approach uses the mean or mode of many single-gene time estimates [40, 41] whereas the average-distance (AD) approach [42–44] involves the combining of distances and rates among genes or proteins to yield a single time estimate. For the AD approach, we weight each single-gene distance, before combining, by the length of the protein (aligned amino acids).
Protein-specific rates were estimated by regression, fixed through the origin, of these calibration points within eukaryotes: arthropod-chordate (0.993 Ga), chordate-nematode (1.177 Ga), and plant-animal-fungi (1.576 Ga) [38]. During the course of the study, it was discovered that the shape parameter (α) of the gamma distribution used to account for rate variation among sites, estimated by a likelihood method [45], differed consistently between calibration taxa (average, 1.99) and the overall data set (1.44) for each gene (Fig. 2). Therefore, a dual-gamma approach was taken whereby the eukaryote rate was estimated using the eukaryote gamma parameter and the time estimate (involving prokaryotes) was made using the overall gamma parameter. There is insufficient evidence at present to determine whether or not this difference is biologically based, related to the covarion model [46], or follows a simple scaling relationship with time or total protein distance. If the relationship is scaled, additional modification in methods may be necessary in the future.
We compared rates of change in archaebacterial versus eubacterial sequences using paralogous sequences (those related by gene duplication) as a root for relative rate tests [47, 48]. To examine rate differences between eukaryotic sequences and their closest prokaryote orthologs (those representing the same gene), we used the more distant prokaryote (archaebacteria or eubacteria) as root. For examining rates in eukaryotic sequences derived from either archaebacteria or eubacteria, we compared pairwise distances of the same taxa present in both locations (e.g., one pair clustering with archaebacteria and the other with eubacteria) in the same protein. The discovery of rate differences among prokaryotes and eukaryotes required rate adjustments for all proteins and comparisons, including those accepted in rate tests. These adjustments were made by estimating time only with the eukaryote lineage, or in the case of BC, using a cyanobacterial rate adjusted by direct comparison of the cyanobacteria branch and eukaryote branch in rate tests. For example, the AK divergence time was estimated only with the KA calibration and the BC, BK-o, and BK-m divergence times were estimated only with the KB-o calibration. These restrictions further reduced the number of proteins available for time estimation to the following: AK (36 total, 21 constant-rate), BK-o (25, 16), BK-m (7, 5), BC (20, 16), and Giardia-eukaryotes (17, 11).
Modes were used in the MG approach, as described previously [40], except with the BK-m comparison where the median was used because of the small number of proteins. The mode is preferred over the mean or median because it eliminates or reduces the effect of outliers (e.g., unusually high estimates resulting from paralogous comparisons). In this study, a large number of overlapping bins was used initially to better define the distribution of time estimates, followed by use of a smaller number of non-overlapping bins and standard estimation of mode by interpolation. This two-step procedure was found to reduce the influence of bin size on mode estimation. For the AD approach, single-gene distances and rates were weighted by sequence length and then combined distances were divided by combined rates.
Phylogeny estimation
Phylogenetic trees [49] were constructed for each gene from amino acid data to assist in gene selection and interpretation. A gamma distance was used for all trees, with α estimated from the entire data set [45]. An analysis involving combined protein alignments was performed with maximum likelihood [50], neighbor joining [49, 51], and maximum parsimony [51], using bootstrapping. Bootstrap consensus trees show branch-lengths estimated by ordinary least-squares method [51]. Bootstrap support ≥ 95% was considered significant.