Stage 1. Detecting homologs
Often, finding a set of homologous sequences is the first step in an evolutionary analysis. The first important point to consider is the goal of the search. When the goal is to reconstruct a species tree, for most methods, only orthologs may be sought, because mis-identification of paralogs as orthologs can yield an incorrect result; however, a few methods reconstruct the species tree from the reconciliation of gene trees, overcoming the limitation of using orthologous sequences, and are therefore promising. Such methods are dependent upon the model and assumptions made during the reconciliation process. For studies of gene families, orthologs, paralogs and xenologs are needed. Another point to consider is whether to search for all homologous sequences or to limit the search to a specific group (for example only vertebrates or only mammals). Where the search is indeed limited to a specific group, it is necessary to explain the motivation behind such a decision. Finally, the outgroup sequences used to root the tree, when possible, should be carefully chosen. Notably, numerous publications about taxon and sequence sampling exist and considering this accumulated knowledge in these fields can help guide the search for homologous sequences.
Once the scope of the search is determined, there is the question of choosing the query sequences for the homologous sequence search. When searching for all homologs in a specific gene family, a single BLAST search of the human sequence against a standard database may miss many homologs. It is likely that another search, starting from sequences identified in the first run would lead to the detection of additional homologs. In this respect, homology detection is often an iterative procedure in which sequences identified at each step are used to refine the search. The search stops when no new homologs are detected. Taking into account context-dependency (non-independence of sites) can further increase the power of a homology search, identifying remote homologs.
While BLAST is clearly the most commonly used algorithm for homology search, there are many alternative methods that can potentially detect homologs missed and exclude unrelated homologs erroneously included by BLAST. These methods can be divided into sequence-based and sequence-structure-based methods. Within the sequence-based methods, we mention the psi-BLAST algorithm, other profile search algorithms, methods using a Hidden Markov Model, and other advanced machine-learning techniques. Within each method, one should remember that the default cost matrix used by the algorithm and the gap penalties may not be ideal for the specific data analyzed.
While structural information may aid the detection of remote homologs, a structure-based search should be carefully considered: structural similarity alone may identify sequences that are the result of convergent evolution, rather than sequences that evolved from a common ancestral molecule.
Finally, the end result of the search should be evaluated with regard to the research question at hand: is there enough data (for example have enough taxa been sampled) to answer the set of hypotheses? Should some sequences be filtered out in order to increase the reliability of the alignment?
Stage 2. Multiple Sequence Alignment
The choice of an alignment method is critical to downstream analyses and should be considered carefully. Each alignment column is a statement of homology, representing the descent from a common ancestry. Several recent reviews have offered a detailed perspective on the field of alignment. Here we outline only some of the key issues in a phylogenetic pipeline.
The first consideration is the data that one will align. Alignment is most commonly performed at the DNA or the amino acid level. However, for protein-coding genes codon alignment is often necessary, for example for tasks which involve characterizing selection on the protein, or other codon-based analyses. In these cases, DNA alignment lacks codon structural information and it is typically preferred to align at the protein level and back translate amino acid gaps to three nucleotide gaps in the corresponding DNA sequences, resulting in a codon alignment. This assumes that frame shifts never happen, and statistical alignment approaches using codon models may be more robust to this assumption. Alignment with empirical codon matrices is now possible in a few software packages.
Structural alignment (sequence alignment that is guided by one or more available structures of the proteins or RNA being analyzed) is sometimes employed for more distantly related sequences. However, sequences can slide through structures during evolution and fit of a sequence to a structure assayed with a force field is not necessarily a statement of evolutionary history. This has the potential to lead to incorrect evolutionary inference if subsequent steps like tree building are performed at the sequence level using the alignment.
Different sequence-based alignment methods can also give very different results, due to differences in assumptions and statistical and algorithmic approaches. Further, the substitution matrix and gap penalties used for alignment scoring should be tuned to the divergence of the sequences being aligned. Once an alignment is obtained, software to identify sequences or regions that are poorly aligned can be applied. However, one should keep in mind that substitutions, insertions, and deletions happen in evolution and an alignment that does not minimize such events may still be evolutionarily correct. Caution should be taken with repetitive sequences, which may introduce highly variable regions within the alignment. Furthermore, repetitive sequences and mobile elements may be homoplasious, and may thus lead to false inference of homology.
For these reasons, alignments should not be manually adjusted, as this is subjective and therefore not repeatable. It can be justified to adjust alignments based upon expected conserved biochemical or structural features (with the explicit assumption that these are conserved consistent with the evolutionary homology and therefore can be used as alignment ‘anchors’). If this is done, the criterion and justification for doing so should be explicitly stated and the pre- and post- adjustment alignments should be included in supplementary materials.
Stage 3. Quality control
The accuracy of a phylogenetic analysis does not depend only on the models and methods used, but also on the quality of the data. Multiple errors, such as taxon misidentification, sequencing error, annotation error or sequence contamination, can occur during data collection and lead to errors in public databases. Such errors are indeed becoming more frequent, because the level and efficiency of quality controls, which were often manual in phylogenetics, did not follow the flood of high-throughput data production and because publication pressure favors the release of draft, instead of complete, genomes.
The incorrect assignments of sequences to species are particularly problematic, because erroneous, yet strong, signals are included in the data matrix, potentially yielding deeply flawed results. These errors are due to an initial incorrect taxonomic identification or, more frequently, to a contamination. Contaminations can occur at the level of the biological sample (such as DNA from parasites), of the sequencing center (such as DNA from previously sequenced organisms) and of the computational processing. For instance, the very small contigs of a draft eukaryotic genome may in fact be of prokaryotic origin. It is therefore crucial to verify the correct taxonomic assignment of each sequence, especially in multi-gene phylogenetic inference, using phylogenetic congruence, nucleotide composition, codon usage and, if necessary, additional wet experiments.
Sequencing, assembly and annotation errors are also quite frequent. They are particularly detrimental to molecular evolution studies, since they can seriously inflate the number of sites inferred to evolve under positive selection or the number of insertion and deletion events. For instance, a frameshift or an incorrect exon prediction will create a long string of amino acids without similarity to those of other species, creating multiple indels and non-synonymous substitutions, even for a highly conserved region. Researchers should be aware of such potential errors, and whenever possible, aim to detect them. For example, two protein coding sequences from two diverged mammals which are identical both at synonymous and non-synonymous sites may indicate contamination.
Finally, even in the absence of errors, the accuracy of a phylogenetic inference is sensitive to the completeness of the alignment. As the effective number of taxa (hence the ability to detect multiple substitutions) is directly related to the number of known states per site, the higher the amount of missing data, the higher the risk of tree reconstruction artifact. The existence of missing data is unavoidable, especially in a large data matrix, because of gene loss and difficulty in obtaining the sequences, but information about the amount and distribution of incompleteness should be clearly stated and their potentially misleading effects should be studied and/or carefully discussed.
Stage 4. Model selection
It is important to remember that all evolutionary models are approximations of the course of evolution and thus a model can never be considered as ‘truth’. There is always a balance between over-simplified models and models which over-fit the data. Over-simplified models often ignore important aspects of the data and may lead to biased conclusions. In contrast, models that use too many parameters may over-fit the specific data, which can result in large errors in estimated parameters. In addition, over-fitting models may capture patterns that are specific to the data analyzed, and may thus lead to conclusions that do not reflect the population from which the data were sampled. Thus, the number of parameters should be tuned based on the dataset size, with larger datasets (which are becoming more frequent) allowing for parameter-rich models. Choosing the ‘right’ model for specific data is not a trivial task and thus, model selection procedures were developed in order to find the best model.
To summarize, models used in phylogenetic analyses should be justified. Notably, often more than one model can fit the data equally well, because they will handle different evolutionary properties more adequately (for example codon structure or non-stationarity of nucleotide composition). Using several well-fitting models allows demonstration of the robustness of conclusions. This relates to our sequel point, which is the benefit of using multiple methods and models to analyze a specific dataset, especially if mechanistically-motivated models are not available. It should be noted that the most widely used software for model selection analyze only a limited diversity of models, for example variations of the GTR+I+Γ model. However, numerous alternative models that incorporate heterogeneity of the substitution process across sites and/or over time or codon structure are available and were generally shown to fit the data better. They should therefore also be considered.
The model choice should ensure that the assumptions and the features of the model enable the inferences relevant to the study goals, such as testing of specific biological hypotheses. Once a suitable model (or a set of models) has been selected, it is important to show that the model adequately describes the data under scrutiny. Indeed, goodness-of-fit tests are common practice in statistics and have been widely applied in phylogenetics. Notably, testing the adequacy of a model is not always straightforward. It is important to test the relationship between model parameters and the biological processes studied. In addition, robustness to violation of certain model assumptions may also be tested. Conclusions drawn from non-fitting models should be discussed with care. Alternatively, the researcher can apply recently developed tools to identify the misfitting parts in the data.
When relevant and possible, combining data from different sources should be considered. One has to consider whether a concatenation of molecular sequences is sensible or whether different segments should be separately analyzed. It is important to remember that different sequence segments may evolve with different evolutionary patterns as they are affected by different mutational and selective constraints. This could be reflected in a combined dataset by defining data partitions. Here model choice for different data partitions in the combined dataset will be crucial for further interpretation of results. One can consider using mixture models or asking if a network rather than a tree describes the data in this context. Networks are particularly suitable for intraspecific studies, where ancestral haplotypes/genotypes may still be extant. However, the shorter coalescent times associated with such studies means that the use of multiple loci should be considered, since incomplete lineage sorting increases the chance that the genealogy of any single locus might not be fully representative of that of the species or populations under study. In addition, several types of data may be considered for specific analyses. For example, when inferring trees from genomic data it is possible that partitions that refer to protein sequences are analyzed at the amino-acid level while other partitions that refer to non-coding sequences are analyzed at the nucleotide level. Thus, when applicable, various coding and partitioning of the data should be considered and justified.
Stage 5. Phylogeny inference
A vast number of evolutionary studies build and test their hypotheses based on inferred phylogenies, which should reflect the evolutionary history of a set of homologous sequences. Consequently, phylogeny inference became one of most standard tasks in evolutionary pipelines. Until the early nineties, parsimony and distance-based tree-building methods were preferred. More recently, probabilistic model-based methods, namely the maximum likelihood (ML) and the Bayesian approaches have grown to prominence due to their statistical properties and inferential powers. Moreover, these approaches go beyond simple phylogeny inference, providing a convenient statistical framework for further model selection and biological hypothesis testing. While parsimony is sometimes justified as model-free, it has mathematical properties and is not assumption-free; therefore explicit models should be generated for many biological problems. Likewise, distance-based methods may be unreliable for highly diverged data, yet they are often model-based and have nice mathematical properties and thus they may enable very fast and relatively accurate estimation of relevant biological parameters. Distance-based methods for tree reconstruction, such as neighbor joining, are extremely fast, and can provide reasonable solutions for extremely large data sets, something that would be much more computationally challenging with ML or Bayesian methods, even with recent computational advances. Furthermore, a candidate tree obtained with a distance method can be taken as a starting tree for ML heuristic searches. With the Bayesian approach, as a general rule, care should be taken to study the convergence diagnostics and the sensitivity of the estimates to prior distributions.
Since the conclusions of an evolutionary study rely on an inferred tree, the statistical support for the inferred tree or particular nodes should be reported. Luckily, the current arsenal of methods for branch support includes not only the traditional bootstrap and jackknife, but also a number of alternative methods. In particular, both Bayesian and ML programs for tree inference offer a variety of support values that are estimated along with a tree, such as posterior probabilities of clades or supports based on approximate likelihood tests. Once again, both the inferred tree and the support values of a specific node may change depending on the model assumed during the analysis. Failure of a model to account for major biological forces shaping the evolution of a sequence may lead to various systematic biases, such as the well-known Long Branch Attraction (LBA) artifact. Moreover, it is now well-documented that gene trees often do not coincide with species trees. The species trees concept has recently been questioned, particularly in prokaryotes. Considering a distribution of gene trees or a distribution of candidate trees for one specific DNA region can bring real insights into evolutionary biology, setting new standards for phylogenetic studies.
Testing some evolutionary hypotheses (for example, testing the monophyly of a group of species) may require a rooted tree. The choice of a root is not always trivial. The most common rooting method is via an outgroup, which should be selected and justified carefully. If data present a signal of non-homogeneous and non-reversible evolution in time (such as drifting GC content through evolution), it might be possible to infer the position of the root using non-reversible models. This however, increases the complexity of the estimation problem and requires large samples with informative sequence divergence. Formal gene tree/species tree reconciliation can also be used to root a tree, when a reference species tree is available.
Further, a weak or conflicting tree signal may be indicative of the biological factors that perturb the tree-like ancestral relationships. Possibilities of such events have to be considered and may include lateral gene transfers, recombination, gene conversion, incomplete lineage sorting, gene duplication and gene loss, and sequence convergence. While a tree representation is convenient for computational purposes, methods that relax this assumption already exist, such as phylogenetic network reconstruction.
Stage 6. Inference of evolutionary selective forces
Many fundamental questions in evolutionary biology involve estimating the type and intensity of natural selection from molecular sequences. As with other tasks, methodological choices are crucial for the accuracy and power of the inference. Particularly, both ML and Bayesian approaches have been shown to be well-suited for evaluating selective pressures. Considering the state-of-the-art methodology, pairwise sequence comparisons have little value for investigating selection on specific genes. Instead, MSA-based analyses coupled with explicit evolutionary models enable estimating a variety of intricate details about the evolutionary process.
Selection analyses typically require a careful formulation of biological hypotheses to be tested. These will dictate the choice of methods and models and the types of analyses that are most appropriate for a particular case study. For example, focusing on lineage-specific selection on a protein will require using branch-specific models, while searching for specific residues under positive selection in a 3D structure will necessitate models allowing for variable selection pressure among sites or sites and branches together. Once again, the size and divergence of the dataset defines the limits to the power of the analyses. For instance, detecting episodic selection at a handful of residues may require a particularly large number of sufficiently diverged sequences. Note that hypotheses should be formulated a priori (before “’ooking’ at data), and cannot be based on estimates from a related statistical analysis. In molecular biology, positive selection is ultimately used as a predictor of lineage-specific functional change, in which case additional analysis might be desirable. It is known that compensatory co-variation within proteins can account for elevated lineage-specific rates of amino acid substitution and dN/dS (possibly linked to changes in population genetic parameters). Still, lineage-specific rate variation also contains signals for lineage-specific functional change in proteins and is valid as an imperfect predictor of functional change. Additional lines of structural and functional evidence involving observed substitutions is ultimately necessary to validate predictions of lineage-specific functional change.
Reliability of estimates and tests should be reported. Additionally, the possibility of known artifacts and model assumptions should be addressed: how would the results be affected by uncertainty in the alignment and phylogeny and various data biases (for example recombination, selection on synonymous positions, saturation of substitutions, codon usage preference, heterogeneous GC)? When the same hypothesis for positive selection is tested multiple times, a suitable correction is often necessary.
Stage 7. Interpretation and conclusions
Even when all the analyses are completed, it is premature to conclude that the study is complete. The interpretation of the results is as important as its motivation and design. The study has little value without understanding the significance of results and putting them into a wider biological context. Here the first pre-requisite is a thorough knowledge of the literature concerning the system of interest: mining for additional evidence from the literature, experiments, or support from complimentary data is essential at this stage. Valuable insights may sometimes come from connecting previously disparate reports, and fully using additional available information, such as the paleontological record, functional and structural annotations, expression levels, ethological data, phenotypic characteristics, and protein sequence-structure-function studies (for example, those involving mutagenesis experiments). Indeed, a multitude of factors at play need to be considered to truly understand the workings of complex biological systems.
For example, to gain a meaningful interpretation of a phylogeny, the integrity of signals from ecosystem, development, physiology, protein structure and function, gene linkage, and other biological sources need to be considered. Similarly, studies of selective pressures on a lineage of a particular gene family may benefit from further analysis of the positively selected sites in the context of functional data and/or protein structure, including additional computational experiments using the techniques of structural bioinformatics. This may help to understand how natural selection actually happens in a protein.