Skip to main content

Relative model selection of evolutionary substitution models can be sensitive to multiple sequence alignment uncertainty



Multiple sequence alignments (MSAs) represent the fundamental unit of data inputted to most comparative sequence analyses. In phylogenetic analyses in particular, errors in MSA construction have the potential to induce further errors in downstream analyses such as phylogenetic reconstruction itself, ancestral state reconstruction, and divergence time estimation. In addition to providing phylogenetic methods with an MSA to analyze, researchers must also specify a suitable evolutionary model for the given analysis. Most commonly, researchers apply relative model selection to select a model from candidate set and then provide both the MSA and the selected model as input to subsequent analyses. While the influence of MSA errors has been explored for most stages of phylogenetics pipelines, the potential effects of MSA uncertainty on the relative model selection procedure itself have not been explored.


We assessed the consistency of relative model selection when presented with multiple perturbed versions of a given MSA. We find that while relative model selection is mostly robust to MSA uncertainty, in a substantial proportion of circumstances, relative model selection identifies distinct best-fitting models from different MSAs created from the same set of sequences. We find that this issue is more pervasive for nucleotide data compared to amino-acid data. However, we also find that it is challenging to predict whether relative model selection will be robust or sensitive to uncertainty in a given MSA.


We find that that MSA uncertainty can affect virtually all steps of phylogenetic analysis pipelines to a greater extent than has previously been recognized, including relative model selection.

Peer Review reports


One of the first steps in a phylogenetic analysis, and indeed any comparative sequence analysis, is to construct a multiple sequence alignment (MSA) from homologous sequences of interest. In spite of decades of advances in MSA construction algorithms, it remains difficult to confidently generate a reliable and accurate MSA that fully considers the evolutionary process, and different software platforms and/or algorithm parameterizations are known yield different MSAs of varying quality [24, 29, 36, 39]. A substantial body of research has shown that errors in the MSA itself can influence, with the potential to produce spurious results, downstream analyses such as phylogenetic reconstruction [5, 39], phylogenetic topology tests [18], ancestral state reconstruction [5], divergence time estimation [10], and identification of positive selection [15, 20, 27].

While the effects of MSA uncertainty in phylogenetic pipelines have been heavily studied, the MSA is not the only piece of information that is inputted to phylogenetic reconstruction and other evolutionary-informed analyses. A suitable model of sequence evolution for the data at hand must also be provided [4]. Most models of sequence evolution consider a continuous-time reversible Markov process, and there are dozens, if not hundreds, of available parameterizations for nucleotide and amino-acid models [40]. One of the most common approaches used to identify a suitable model for phylogenetic inference is relative model selection, wherein a set of candidate models are ranked according to a given goodness-of-fit measurement, and the best-fitting model is then used in the phylogenetic reconstruction [34]. Although recent studies have suggested that relative model selection may not be a critical step in phylogenetic studies [2, 31, 33], it remains an enduring staple of most analysis pipelines. Henceforth, we use the phrase “model selection” to refer specifically to relative model selection, unless otherwise stated.

When performing model selection, the MSA of interest is generally fixed, yet there is rarely a guarantee that the MSA has precisely identified site homology and/or insertion/deletion (indel) events. Indeed, there are many MSAs which could be in theory derived from a given set of homologous sequences, but how relative model selection would perform on a different MSA version is unclear. In other words, if there were N different MSAs which could be generated from the same set of orthologs, it is unknown whether relative model selection will consistently identify the same best-fitting model for all N MSAs. Previously, Abdo et al. [3] showed that relative model selection is robust to the guide tree used to assess model fit, but it remains unknown to what extent uncertainty in the MSA will influence the procedure. This dearth of understanding is a notable omission from both the model selection literature and from the literature investigating effects of MSA uncertainty on phylogenetic applications.

Here, we aimed to bridge this gap through a systematic analysis of whether MSA uncertainty can affect the results of model selection. We examined whether model selection is robust to MSA uncertainty across thousands of sets of natural orthologous sequences, at both the amino-acid and nucleotide levels, by interrogating whether model selections identifies different optimal models for different versions of MSAs created from the same underlying set of orthologs. We broadly found that there is potential for model selection, in particular on nucleotide data, to identify different best-fitting evolutionary models for different MSA versions created from the same ortholog set. Moreover, we find that it is challenging to predict whether model selection will be robust or sensitive to uncertainty for a given MSA. Our results demonstrate that MSA uncertainty may be even more of a pervasive influence in phylogenetic analysis pipelines than has previously been recognized.


We obtained sets of unaligned orthologs (“datasets”; Table 1) from databases Selectome [21], including both Euteleostomi and Drosophila sequences, and PANDIT [37]. For each dataset, we generated 50 unique MSA variants using a GUIDANCE2-based approach [29, 32], with one MSA designated as the reference MSA and the remaining 49 designated as perturbed MSAs. The GUIDANCE2 algorithm generates alternative MSAs by varying the guide tree and gap opening penalties used by progressive alignment algorithms during MSA reconstruction. For each resulting MSA, we used ModelFinder [16, 23] to determine the best-fitting model with each of the three information theoretic criteria AIC, AICc, and BIC. For all steps, we analyzed both the nucleotide (NT) and amino-acid (AA) versions (“datatypes”; Table 1) of each dataset. In total, for each datatype, we processed 1000 Drosophila and Euteleostomi datasets each from Selectome and 236 datasets from PANDIT, more details for which are available in “Methods” secion. Associated acronyms and terms are defined in Table 1.

Table 1 Terminology used throughout text

Distinct models are often associated with variant MSAs

We first investigated model selection’s consistency by tabulating how many distinct best-fitting models were identified across a given dataset’s 50 variant MSAs. If model selection were robust to MSA uncertainty, we would observe the same best-fitting model for all variant MSAs. Alternatively, if model selection were sensitive to MSA uncertainty, we would observe multiple distinct best-fitting models identified across a given dataset’s variant MSAs. We refer to datasets adhering to the former scenario (one model for all variant MSAs) as a “stable” and datasets adhering to the latter scenario as “unstable” (Table 1).

Overall, we observed distinct stability patterns across datatypes and dataset source (Fig. 1). Drosophila datsets tended to be more unstable, with a pronounced instability increase for nucleotide datasets compared to amino-acid datasets. By contrast, Euteleostomi datasets tended to be more stable, with a pronounced stability increase for amino-acid datasets compared to nucleotide datasets. Finally, PANDIT amino-acid datasets tended to be more stable compared to their nucleotide counterparts. Stability was broadly consistent across the theoretic information criterion used, suggesting that all criteria are similarly influenced by MSA uncertainty.

Fig. 1
figure 1

Percent of stable and unstable (defined in Table 1) datasets within each of the three data sources, Selectome Euteleostomi [1000 amino-acid (AA) and nucleotide (NT) datasets each], Selectome Drosophila (1000 AA and NT datasets each), and PANDIT (236 AA and NT datasets each). Bars are colored based on the information theoretic criterion used for model selection

We next counted the total number of selected models identified for unstable datasets (Fig. 2a for AIC results and Additional file 1: Fig. S1a, c for BIC and AICc results, respectively). Results were again broadly consistent among information theoretic criteria. Across both amino-acid and nucleotide datatypes, the vast majority of unstable datasets only had two associated best-fitting models; for example, as shown in Fig. 2a, that roughly 65% of all Drosophila amino-acid datasets were associated with two distinct best-fitting models with model selection by AIC. Across all amino-acid datasets, roughly 17–34% of ortholog datasets were associated with three or more models, and across all nucleotide datasets, roughly 32–52% of ortholog datasets were associated with three or more models, demonstrating that model selection is indeed influenced by MSA uncertainty in a substantial proportion of cases. Most notably, model selection was most sensitive to MSA uncertainty on Drosophila NT datasets, of which nearly 10% of datasets selected five or more best-fitting models.

Fig. 2
figure 2

RMS with AIC on unstable datasets. Corresponding figures for model selection with BIC and AICc are in Additional file 1: Fig. S1). a Barplot of the total number of unique selected models across the 50 variant MSAs generated for each dataset. Percentages of unstable datasets are shown separately for each dataset source (Drosophila, Euteleostomi, PANDIT) and datatype (AA and NT). b Histogram of the percentage of MSA variants whose selected model matched \(M_0\) (as determined by the respective information theoretic criterion), the most commonly selected model among all 50 MSA variants (Table 1)

We next assessed, for each unstable dataset, the percentage of MSA variants whose selected model matched the dataset’s most frequently identified best-fitting model, \(M_0\) (Table 1). We observed substantial variation across datasets in how commonly the \(M_0\) model was selected (Fig. 2B for AIC and Additional file 1: Fig. S1b, d for BIC and AICc, respectively). Taken together, these results suggest that there is strong potential for MSA uncertainty to affect model selection, and there is no guarantee that model selection will consistently identify the same best-fitting model for different MSA versions, amino-acid or nucleotide alike, derived from the same underlying data.

We next looked more closely at the differences among selected models for unstable datasets. Specifically, assuming the same datatype (amino-acid or nucleotide), evolutionary models can differ in two primary ways. First, models may have entirely different Q matrices, reflecting fundamentally different evolutionary patterns (e.g. WAG [38] vs. JTT [14] for amino-acid data). Alternatively, models can have the same underlying Q matrix, reflecting a shared evolutionary process, but differ in their additional parameterizations including the specification of stationary frequencies (F), presence of among-site rate variation (ASRV, represented here by a four-category discrete Gamma distribution and denoted +G), or proportion of invariant sites (+I). For example, the two models WAG+G and WAG+G+I would have the same Q matrix but different additional parameterizations.

We therefore asked, specifically among unstable datasets, whether best-fitting models across MSA variants shared the same Q matrix (Fig. 3 for AIC and Additional file 1: Fig. S2 for BIC and AICc). To this end, we compared the Q matrix selected for each variant MSA to the respective \(M_0\) model’s Q matrix, identifying each as either the same or different Q matrix.

As with previous analyses, we observed distinct trends between amino-acid and nucleotide datasets, but overall similar trends among information theoretic criteria and dataset sources examined. For Drosophila and PANDIT amino-acid datasets, roughly half of the selected models shared the \(M_0\) model’s Q matrix, but for over 60% of Euteleostomi amino-acid datasets, the selected model’s Q matrix matched the \(M_0\) matrix (Fig. 3). By contrast, for unstable nucleotide datasets from all data sources, the vast majority of selected Q matrices differed from the \(M_0\) model’s Q matrix.

Fig. 3
figure 3

Consistency of Q matrices of selected models across MSA variants. “Different Q matrices” implies the selected model’s matrix differs from the respective \(M_0\) model’s Q matrix, and “Same Q matrices” implies the selected model’s matrix matches the respective \(M_0\) model’s Q matrix

One reason why uncertainty in nucleotide MSAs may show increased model selection instability may be the relative model selection procedure itself. The ModelFinder algorithm used here for model selection [16] 88 candidate model parameterizations for nucleotide MSAs, and 168 model parameterizations for amino acid MSAs. More specifically, nucleotide model selection evaluates 21 different Q matrices each with four different combinations of +F (use observed stationary frequencies), +I (proportion of invariant sites), and +G (four-category discrete gamma distribution to model among-site rate variation, ASRV) parameterizations, and amino-acid model selection evaluates 22 different Q matrices each with six different combinations of +F, +I, and +G parameterizations. As such, there are more model parameterizations per Q matrix being evaluated for amino-acid model selection compared to nucleotide model selection. It is therefore possible that this analysis is somewhat biased in favor of amino-acids datasets showing increased Q matrix consistency compared to nucleotide datasets simply because there are more models per Q matrix. In addition, exchangeability parameters in amino-acid models are not optimized by maximum likelihood, but rather are fixed to a priori determined values. By contrast, nucleotide model parameters are all optimized by maximum likelihood and none are a priori fixed. This distinction in the source of model parameters may also explain the differences observed between nucleotide and amino-acid datasets.

Unstable datasets are challenging to distinguish from stable datasets

We next asked whether there are any overall differences between stable and unstable datasets. To this end, we constructed logistic regressions to explore whether MSA stability could be explained by underlying properties of each dataset. We built models for each combination of information theoretic criterion and for each datatype, each considering these four predictors to capture dataset-level information: (1) number of MSA sequences, (2) the mean number of sites among the given dataset’s de-gapped sequences, (3) the mean edit (Hamming) distance between all pairwise comparisons of dataset sequences, and (4) the mean GUIDANCE residue pair scores calculated based on each dataset’s reference MSA [24]. Smaller mean edit scores indicate overall more similar sequences, and higher mean edit scores indicate more diverged sequences. Similarly, GUIDANCE scores which range between [0, 1] reflect dataset robustness to MSA uncertainty as a whole. Smaller mean GUIDANCE scores indicate that the dataset is challenging to align with high expected uncertainty, and higher mean GUIDANCE scores indicate that the dataset can be more reliably aligned with less expected uncertainty. We calculated edit distances from all-to-all pairwise alignments [constructed with default settings in MAFFT v7 [17]], and we averaged resulting edit distances to derive a single mean edit distance score for each dataset. GUIDANCE residue pair scores were calculated for the reference MSA against all 49 associated perturbed MSAs, and we averaged all residue pair scores to derive a single mean GUIDANCE score for each dataset.

For all logistic regressions, the mean number of sites and mean number of sequences were significant (all \(P<10^{-5}\)) predictors of dataset stability but with effect sizes barely different from zero (Fig. 4a). By contrast, higher mean GUIDANCE scores were consistently associated with increased log-odds of dataset stability, demonstrating that datasets whose variant MSAs are more similar to another are more likely to be stable, as expected. Surprisingly, the mean edit distance predictor showed opposite trends between amino-acid and nucleotide datasets. For amino-acid data, higher mean edit distance was associated with higher log-odds of dataset stability, except for BIC model selection where this predictor was not significant. For nucleotide data, on the other hand, lower mean edit distances were associated with increased log-odds of dataset stability. Regardless, consistent across information theoretic criteria and datatypes, the mean GUIDANCE score had the largest absolute effect on explaining dataset stability. ROC (receiver operating characteristic) analyses for these logistic regressions revealed an AUC (area under the curve) of at most 0.71 (Fig. 4). While better than random chance (AUC = 0.5), AUC values ranging from 0.68 to 0.71 do not suggest a high predictive ability for assessing whether model selection will be sensitive to MSA uncertainty for any given dataset in a real-world analysis. That said, that the predictive ability of these logistic regressions, and similarly AUC scores, may be influenced by having only \(N=50\) perturbed MSAs for each dataset, and exploring more variant MSAs may influence these overall findings. We further note that large ranges of predictor variables (Fig. 4b) may also have an impact on resulting AUC values.

Fig. 4
figure 4

Logistic regression coefficients to explain dataset stability. a Filled circles represent logistic model coefficients that are significant at \(P\le 0.01\), and error bars represent the 99% confidence intervals. Values in the top left of each panel represent the AUC from an associated ROC (receiver operating characteristic) analysis. Note that effect sizes for “mean number of sites” and “number of sequences” are significant but extremely close to zero. Dataset sizes in each logistic regressions are equivalent to the number of datasets explored: 236 PANDIT datasets, 1000 Drosophila datasets, and 1000 Euteleostomi datasets, for each of AA and NT datatypes. b Distributions of predictor variables used in each logistic regression

We next examined MSA-level properties to ascertain whether more similar variant MSAs were more likely to have the same best-fitting model, and conversely whether more dissimilar variant MSAs are more likely to have different best-fitting models. For this analysis, we considered two common MSA scores, total columns (TC) and sum-of-pairs (SP) scores, to quantify similarity between each dataset’s perturbed MSA and the given reference MSA, using a procedure outlined in Fig. 5. For all stable datasets, all perturbed MSA selected models are by-definition the same as the reference MSA’s model (\(\mathtt {M_{ref}}\); Table 1). For each stable dataset, we calculated the SP and TC score between each of the 49 perturbed MSA and the reference MSA (49 total comparisons), finally averaging these scores to derive an overall mean SP and TC score for the dataset. For all unstable datasets, we classified each perturbed MSA either as having the same or different selected model as \(\mathtt {M_{ref}}\), resulting in two groups of MSAs: differs \(\mathtt {M_{ref}}\) and matches \(\mathtt {M_{ref}}\). We calculated mean TC and SP scores for each of the two groups. Any unstable dataset with fewer than three perturbed MSAs per group was fully excluded from this analysis. In total, these calculations led to three distributions of mean dataset SP and TC scores for each of the stable, differs, and matches groups (Fig. 6 for AIC and Additional file 1: Fig. S3a, b for BIC and AICc, respectively).

Fig. 5
figure 5

Flowchart showing procedure for grouping MSAs to analyze SP and TC scores. Endpoints in the flowchart are colored according to the same groupings shown in Fig. 6

For each datatype (NT and AA) and MSA score (SP and TC), we built an ANOVA to assess the mean score differences among the three groupings, applying a post-hoc Tukey test to compare differences among groupings (Table 2 for AIC, and Additional file 1: Tables S1, S2 for BIC and AICc, respectively). Overall, we observed no significant differences the differs and matches groupings for all information theoretic criteria (differs-matches comparisons in Table 2 and Additional file 1: Tables S1, S2). By contrast, all stable-differs and all except one stable-matches comparison were significant at \(P\le 0.01\), with one exception: The stable-matches comparison of TC scores on amino-acid datasets with AIC model selection was significant only at the level \(P\le 0.05\) (Table 2). Moreover, all effect sizes were fairly small, except for significant nucleotide TC score comparisons. In total, these results support findings shown in Fig. 4 that MSAs are overall more similar for stable datasets compared to unstable datasets, but specifically predicting whether two given variant MSAs will select the same evolutionary model may not be possible.

Fig. 6
figure 6

Mean SP and TC scores across variant MSA groups for model selection with AIC. Values in each boxplot represent mean dataset SP and TC scores across MSA variant groupings. Corresponding figures for model selection with BIC and AICc are in Additional file 1: Fig. S3

Table 2 SP and TC score comparisons among variant MSA groups for model selection with AIC

Discussion and conclusions

We have conducted a large-scale empirical data analysis to ascertain the extent to which the results of relative model selection may be influenced by MSA uncertainty. We assessed the consistency of model selection, for three different commonly-used information theoretic criteria (AIC, BIC, AICc), across variant MSAs created from the same dataset, finding that model selection is often sensitive to MSA uncertainty. As has been often observed in previous studies, we observed very few practical differences among results from different information theoretic criteria used in model selection [2, 28]. Furthermore, we have shown that it is challenging, if at all possible, to predict the specific circumstances under which MSA uncertainty influences model selection, although we do observe that model selection is, as expected, more consistent when the MSA itself is more reliable (Fig. 4). In total, these results build on previous findings that MSA uncertainty has the potential to influence even more stages of phylogenetic analyses than has been previously recognized.

We note that one limitation of this study is that we only explore the influence of MSA uncertainty on relative model selection and not other approaches to identifying best-fitting models, including tests of model adequacy [11,12,13], Bayesian assessments of absolute model fit [7, 19], or more recently-developed machine-learning methods for model selection [1]. Indeed, it may be possible that other approaches to model selection are more robust to MSA uncertainty, but the computational demands of these methods prohibit similar large-scale benchmarking. As the computational efficiency of these methods improves, it will be possible to thoroughly examine the potential effects of MSA uncertainty on model adequacy approaches.

How can the effects of MSA uncertainty be mitigated for relative model selection? In other scenarios in phylogenetic methods where MSA uncertainty is a known confounding factor, researchers will sometimes filter and/or mask MSAs to remove columns and/or residues that are suspected to have been poorly aligned. However, there have been mixed results on whether such actions will improve analyses. For example, in the context of positive selection, Privman et al. [27] found that masking poorly aligned residues may reduce false positives, but Spielman et al. [32] found that there was limited practical utility to masking residues. Similarly, in the context of phylogenetic reconstruction, Sela et al. [29] found that MSA filtering may improve inferences, but Tan et al. [35] found that MSA filtering more frequently worsens inferences. Overall, these studies point to the conclusion that MSA filtering can reduce spurious inferences, but may also reduce power to the detriment of overall performance. In spite of the lackluster promise shown by other applications of MSA filtering, examining the effect of MSA filtering on model selection itself may begin to reveal the source of its uncertainty. Another potential option for mitigating the effects of MSA uncertainty more generally may be to pursue a model averaging and/or mixture modeling strategy [6, 25, 30], for example by jointly considering candidate models identified across variant MSAs, during phylogenetic reconstruction. Alternatively, it may be possible to address MSA uncertainty by averaging across a candidate set of MSAs, which has been previously shown to improve accuracy in phylogenetic reconstruction [29], and selecting a model for subsequent analyses based on the averaged MSA rather than from a single variant MSA.

While it may be possible to ameliorate the influence of MSA uncertainty on relative model selection, we must also ask: Do we need to mitigate this issue in the first place? For example, recent studies have shown that, for both nucleotide and amino-acid models, the model selection procedure itself may not be a critical step in phylogenetic reconstruction, since different models with extreme differences in relative fit may not actually result in systematically different results [2, 31, 33] although how the precise model used may influence branch length and/or divergence estimation remains an important question [1, 2]. As such, if distinct models may yield highly similar inferences, optimizing the model selection procedure itself has diminishing returns. If models themselves are overly similar, there may be little practical consequence to the observed uncertainty in model selection.

Considering the results of this study in the context of literature questioning the utility of model selection itself, we suggest that, rather expending additional efforts to perfect existing paradigms in phylogentic modeling, the time has come to explore novel modeling paradigms. For example, one reason that all models may yield highly similar inferences is that, while they may have somewhat different focal parameters, virtually all commonly-used models, and certainly all those evaluated by model selection, adhere to the same underlying mechanism: A stationary, time-reversible, and homogeneous Markov process that assumes sites evolve independently (e.g. no epistasis) [22]. We conclude that concerted efforts to develop and extend these paradigms while facilitating widespread familiarity represent the most promising next steps in advancing phylogenetic methods.


Construction of perturbed alignments

We analyzed multiple sequence alignments (MSAs) from two databases, Selectome v6 [21] and PANDIT [37], specifically considering only those MSAs which had fully compatible amino-acid and nucleotide versions. This approach allowed us to use the same underlying biological data to examine the robustness of relative model selection when performed on both nucleotide and amino-acid MSAs. From Selectome, we analyzed 1000 randomly selected Drosophila alignments and 1000 randomly selected Eutelestomi MSAs, ensuring that each MSA corresponded to a different gene (i.e., only a single transcript for a given gene was analyzed). From PANDIT, we analyzed all MSAs which contained at least 25 sequences and whose shortest protein sequence contained at least 100 amino-acids, which totalled 254 PANDIT MSAs.

After collecting these datasets, we removed all gaps from these MSAs. We refer to each resulting set of unaligned orthologs as a “dataset.” For each nucleotide and amino-acid version of each dataset, we used a GUIDANCE2-based approach to generate 50 distinct MSAs [29, 32]. We created perturbed alignments by varying the guide tree and gap penalty parameters provided to the MAFFT v7.407 aligner [17]. In this procedure, a “reference” MSA is first made using default parameters in MAFFT. Bootstrapped versions of this reference MSA are used to create guide trees for the progressive alignment with FastTree2 [26]. Each unique guide tree is then supplied to MAFFT as a guide tree to construct a perturbed MSA from the underlying data. As in GUIDANCE2, the MAFFT gap opening penalty X for specified each perturbed MSA was randomly drawn from the uniform distribution \(X \sim U(1,3)\). We created 49 perturbed MSAs for each dataset, leading to a total of 50 distinct MSAs (including the reference MSA) for each dataset. We ensured that all MSAs for a given dataset were fully unique.

Relative model selection

We used ModelFinder [16, 23] to conduct relative model selection, using the -m TEST flag to perform a standard model selection similar to jModelTest and ProtTest [8, 9]. This procedure evaluates 88 candidate model parameterizations for nucleotide MSAs, and 168 model parameterizations for amino acid MSAs. More specifically, nucleotide model selection evaluates 21 different Q matrices each with four different combinations of +F (use observed stationary frequencies), +I (proportion of invariant sites), and +G (four-category discrete gamma distribution to model among-site rate variation, ASRV) parameterizations, and amino-acid model selection evaluates 22 different Q matrices each with six different combinations of +F, +I, and +G parameterizations. For each dataset, we identified the best-fitting model under each of the three information theoretic criterion AIC (Akaike Information Criterion), AICc (small-sample Akaike Information Criterion), and BIC (Bayesian Information Criterion).

Availability of data and materials

All code used in this study is freely available from All variant MSA datasets are available from the FigShare repository


  1. Abadi S, Avram O, Rosset S, Pupko T, Mayrose I. ModelTeller: model selection for optimal phylogenetic reconstruction using machine learning. Mol Biol Evol. 2020;37(11):3338–52.

    Article  CAS  Google Scholar 

  2. Abadi S, Azouri D, Pupko T, Mayrose I. Model selection may not be a mandatory step for phylogeny reconstruction. Nat Commun. 2019;10(1):934.

    Article  Google Scholar 

  3. Abdo Z, Minin VN, Joyce P, Sullivan J. Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation. Mol Biol Evol. 2005;22(3):691–703.

    Article  CAS  Google Scholar 

  4. Arenas M. Trends in substitution models of molecular evolution. Front Genet. 2015;6:319.

    Article  Google Scholar 

  5. Ashkenazy H, Sela I, Levy Karin E, Landan G, Pupko T. Multiple sequence alignment averaging improves phylogeny reconstruction. Syst Biol. 2019;68(1):117–30.

    Article  CAS  Google Scholar 

  6. Bouckaert RR, Drummond AJ. bModelTest: Bayesian phylogenetic site model averaging and model comparison. BMC Evol Biol. 2017;17(1):42.

    Article  Google Scholar 

  7. Brown JM. Predictive approaches to assessing the fit of evolutionary models. Syst Biol. 2014;63(3):289–92.

    Article  PubMed  Google Scholar 

  8. Darriba D, Taboada GL, Doallo R, Posada D. Prottest 3: fast selection of best-fit models of protein evolution. Bioinformatics. 2011;27:1164–5.

    Article  CAS  Google Scholar 

  9. Darriba D, Taboada GL, Doallo R, Posada D. jModelTest 2: more models, new heuristics and parallel computing. Nat Methods. 2012;9(8):772.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Du Y, Wu S, Edwards SV, Liu L. The effect of alignment uncertainty, substitution models and priors in building and dating the mammal tree of life. BMC Evol Biol. 2019;19(1):203.

    Article  Google Scholar 

  11. Duchêne DA, Duchêne S, Ho SYW, Kelso J. PhyloMAd: efficient assessment of phylogenomic model adequacy. Bioinformatics. 2018;34(13):2300–1.

    Article  Google Scholar 

  12. Goldman N. Simple diagnostic statistical tests of models for DNA substitution. J Mol Evol. 1993;37(6):650–61.

    CAS  PubMed  Google Scholar 

  13. Goldman N. Statistical tests of models of DNA substitution. J Mol Evol. 1993;36(2):182–98.

    Article  CAS  Google Scholar 

  14. Jones DT, Taylor WR, Thornton JM. The rapid generation of mutation data matrices from protein sequences. CABIOS. 1992;8:275–82.

    CAS  PubMed  Google Scholar 

  15. Jordan G, Goldman N. The effects of alignment error and alignment filtering on the sitewise detection of positive selection. Mol Biol Evol. 2012;29(4):1125–39.

    Article  CAS  Google Scholar 

  16. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods. 2017;14(6):587–9.

    Article  CAS  Google Scholar 

  17. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80.

    Article  CAS  Google Scholar 

  18. Levy Karin E, Susko E, Pupko T. Alignment errors strongly impact likelihood-based tests for comparing topologies. Mol Biol Evol. 2014;11:3057–67.

    Article  Google Scholar 

  19. Lewis PO, Xie W, Chen MH, Fan Y, Kuo L. Posterior predictive bayesian phylogenetic model selection. Syst Biol. 2014;63(3):309–21.

    Article  Google Scholar 

  20. Markova-Raina P, Petrov D. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 drosophila genomes. Genome Res. 2011;21(6):863–74.

    Article  CAS  Google Scholar 

  21. Moretti S, Laurenczy B, Gharib WH, Castella B, Kuzniar A, Schabauer H, Studer RA, Valle M, Salamin N, Stockinger H, Robinson-Rechavi M. Selectome update: quality control and computational improvements to a database of positive selection. Nucleic Acids Res. 2014;42:D917–21.

    Article  CAS  Google Scholar 

  22. Naser-Khdour S, Minh BQ, Zhang W, Stone EA, Lanfear R. The prevalence and impact of model violations in phylogenetic analysis. Genome Biol Evol. 2019;11(12):3341–52.

    Article  Google Scholar 

  23. Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32(1):268–74.

    Article  CAS  Google Scholar 

  24. Penn O, Privman E, Landan G, Graur D, Pupko T. An alignment confidence score capturing robustness to guide tree uncertainty. Mol Biol Evol. 2010;27(8):1759–67.

    Article  CAS  Google Scholar 

  25. Posada D, Buckley TR. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst Biol. 2004;53(5):793–808.

    Article  Google Scholar 

  26. Price M, Dehal P, Arkin A. FastTree2: approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5:e9490.

    Article  Google Scholar 

  27. Privman E, Penn O, Pupko T. Improving the performance of positive selection inference by filtering unreliable alignment regions. Mol Biol Evol. 2012;29(1):1–5.

    Article  CAS  Google Scholar 

  28. Ripplinger J, Sullivan J. Does choice in model selection affect maximum likelihood analysis? Syst Biol. 2008;57(1):76–85.

    Article  Google Scholar 

  29. Sela I, Ashkenazy H, Katoh K, Pupko T. GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Res. 2015;43:W7–14.

    Article  CAS  Google Scholar 

  30. Si Quang L, Gascuel O, Lartillot N. Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics. 2008;24(20):2317–23.

    Article  Google Scholar 

  31. Spielman S. Relative model fit does not predict topological accuracy in single-gene protein phylogenetics. Mol Biol Evol. 2020;37(7):2110–23.

    Article  CAS  Google Scholar 

  32. Spielman SJ, Dawson ET, Wilke CO. Limited utility of residue-masking for positive-selection inference. Mol Biol Evol. 2014;31(9):2496–500.

    Article  CAS  Google Scholar 

  33. Spielman SJ, Kosakovsky Pond SL. Relative evolutionary rates in proteins are largely insensitive to the substitution model. Mol Biol Evol. 2018;35(9):2307–17.

    Article  CAS  Google Scholar 

  34. Sullivan J, Joyce P. Model selection in phylogenetics. Ann Rev Ecol Evol Syst. 2005;36(1):445–66.

    Article  Google Scholar 

  35. Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, Dessimoz C. Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference. Syst Biol. 2015;64(5):778–91.

    Article  CAS  Google Scholar 

  36. Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS ONE. 2011;6(3):e18093.

    Article  CAS  Google Scholar 

  37. Whelan S. PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res. 2006;34(90001):D327–31.

    Article  CAS  Google Scholar 

  38. Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol Biol Evol. 2001;18:691–9.

    Article  CAS  Google Scholar 

  39. Wong KM, Suchard MA, Huelsenbeck JP. Alignment uncertainty and genomic analysis. Science. 2008;319(5862):473–6.

    Article  CAS  Google Scholar 

  40. Yang Z. Molecular evolution: a statistical approach. Oxford: Oxford University Press; 2014.

    Book  Google Scholar 

Download references


High-performance computing was conducted, in part, using the Rowan University Computing Cluster (RUCC).


SJS was supported by start-up funds from Rowan University. MM was supported by funds awarded by NASA to the New Jersey Space Grant Consortium. The funding bodies had no role in the study, analysis, or interpretation of the data or the writing of the manuscript.

Author information

Authors and Affiliations



SJS conceived of and designed the study. SJS and MM performed data analysis. SJS wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Stephanie J. Spielman.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Additional Figures and Tables.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Spielman, S.J., Miraglia, M.L. Relative model selection of evolutionary substitution models can be sensitive to multiple sequence alignment uncertainty. BMC Ecol Evo 21, 214 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: