### SNR codon models M0, M1a, M2a, M3 and M8

Goldman and Yang [26] and Muse and Gaut [47] independently proposed similar formulations for modelling the Markovian substitution process between sense codons. Here we present the core formulation of Goldman and Yang [26], as it was developed into models that form some LRTs investigated within this study. The instantaneous substitution rate between codon *i* and *j* (*i* ≠ *j*) at a single site within an alignment of protein coding sequences is defined as:

$$ {q}_{ij}=\left\{\begin{array}{c}0,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{more}\ \mathrm{than}\ \mathrm{one}\ \mathrm{nucleotide}\\ {}{\pi}_j,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{a}\ \mathrm{synonymous}\ \mathrm{transversion}\\ {}\kappa {\pi}_j,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{a}\ \mathrm{synonymous}\ \mathrm{transition}\ \\ {}\omega {\pi}_j,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{a}\ \mathrm{nonsynonymous}\ \mathrm{transversion}\\ {}\omega \kappa {\pi}_j,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{a}\ \mathrm{nonsynonymous}\ \mathrm{transition}\end{array}\right. $$

where the matrix *Q* specifies a continuous-time, stationary, time-reversible Markov process. Parameters *π*_{j}, *κ* and *ω* specify the stationary frequencies of codon *j*, the transitions to transversion rate ratio, and the nonsynonymous to synonymous rate ratio, respectively. Because this formulation models all nonsynonymous changes using a single *ω* parameter, this is an example of a SNR model. The transition probability matrix *P*(*t*) is related to matrix *Q* by *P*(*t*) = *e*^{Qt}, thereby giving the probabilities for state changes over a branch of length *t*. The likelihood of a codon site for a given phylogenetic tree and branch lengths can then be calculated using the pruning algorithm of Felsenstein [48]. The above formulation is widely referred to as model M0, and it assumes that the intensity of natural selection (as captured by parameter *ω*) is the same for all sites in the codon sequence alignment. Model M0 was extended to a series of models that permit the *ω* parameter to vary among sites [6], which includes the models known as M1a, M2a, M3 and M8. Hereafter, the family of codon models derived from M0 that permit the *ω* parameter to vary among sites will be referred as “M-series” models. All members of the M-series family are SNR models.

Models M1a and M2a [45] are widely used as the basis of an LRT for positive selection, and for empirical Bayes identification of positively selected sites within a multi-species alignment [49]. These models employ a restricted form of the *ω* distribution that, although highly idealized, leads to desirable properties for the LRT [11, 46]. Model M1a (a.k.a. nearly neutral) is a discrete mixture of two classes of sites: strictly neutral sites with *ω*_{1} = 1, and sites subject to purifying selection with *ω*_{0} estimated from the data but constrained to take a value < 1. The mixture weights for these classes of sites (*p*_{0} and *p*_{1}) also are estimated from the data. Model M2a extends model M1a by adding a third class of sites for positive selection (*ω*_{+} > 1). As these models are nested they serve as the basis of a LRT for sites evolving by positive selection.

Model M3 employs an unconstrained discrete distribution for *ω* [6]. In this model, sites are assumed to belong to *k* discrete classes, each having a parameter for selection (*ω*_{i}) and a proportion of sites (*p*_{i}) within the gene. An LRT of M3 against M0 (a special case of M3 where *k* = 1 and all sites have just a single *ω*) constitutes a test for variable selection intensity among sites [11]. In this study we use the LRT of M0 versus M3_{k = 2} to pre-screen the real datasets and thereby ensure each contains signal for among-site variation in the intensity of natural selection.

Model M8 uses a flexible *β* distribution to permit *ω* to vary among sites within the interval (0,1) and an extra discrete category that can allow *ω*_{+} > 1 [6]. For computational convenience the *β* distribution is divided into 10 bins. An LRT for positive selection is obtained by comparing a restricted form of M8 (*ω*_{+} = 1, fixed) to an alternative form of M8 (*ω*_{+} ≥ 1, estimated). In both models the mixture weights for the *β* distribution (*p*_{0}) and *ω*_{+} (*p*_{+}) are estimated from the data. This LRT represents a popular alternative to M1a and M2a as a test for sites evolving by positive selection.

### GPP codon models G1a and G2a

We developed GPP codon models that employ the same discrete distributions for *ω* as employed by M1a and M2a, but without requiring that any other simplifying assumptions be imposed on the data (e.g., SNRs, zero probability for DT changes, and restrictions on the GTR). These models are hereafter referred to as G1a and G2a. Like M1a, model G1a assumes that data evolve under one of two discrete selective regimes: purifying selection and strict neutral evolution. Model G2a extends this by adding a class of sites evolving under positive selection. The restrictions, as well as the notation, are the same for the *ω* parameters (*ω*_{0} < 1, *ω*_{1} = 1, and *ω*_{+} > 1) and mixture weights (*p*_{0}, *p*_{1} and *p*_{+}).

G1a and G2a are derived from a simple GPP codon model that includes the current models such as Goldman and Yang [26] and Muse and Gaut [47] as special cases. We refer to the basic form of this model, which has only a single class of sites, as G0. The GPP model exploits the fact that a time-reversible process is expressible as the product of a matrix of exchangeability parameters (*R*) and the steady state frequencies (*π*), and uses a logarithm link function to link the non-zero off-diagonal elements of the 61 × 61 instantaneous codon matrix, *Q = Rπ*, to a linear model format (see online Additional file 1 for details). We assume *R* is symmetric, and the instantaneous rates can be written as *q*_{ij} = *π*_{j}*r*_{ij}, where *π*_{j}is the equilibrium frequency of the *j*^{th} codon, and the parameter *r*_{ij} determines the exchangeability between codons. In G0 the matrix of exchangeability parameters, *R*, is determined by a set of model parameters, *β*_{0}, …, *β*_{n}. For each *β*_{k} there is a corresponding matrix *X*^{(k)}, and the value of *r*_{ij} for *i* ≠ *j* is determined by *log*(*r*_{ij}) = ∑_{k}*β*_{k}(*X*^{(k)})_{ij}. The diagonal elements of *R* are set such that rows of *Q* sum to 0. The first model parameter, *β*_{0}, is a scaling factor set so that the branch lengths can be interpreted as the expected numbers of substitutions per codon sites, and the other parameters *β*_{1}, …, *β*_{n} are intended to represent different mechanisms of the evolutionary process. This framework allows specification of all possible instantaneous codon substitutions, and any restrictions on the process are special cases of the general model where the instantaneous rate is set to zero (e.g., prohibition of codon substitutions involving DT nucleotide changes is a special case of the general model).

As the familiar SNR codon model M0 [26] is a special case of G0, it serves as a convenient way to illustrate how a GPP model is specified. M0 can be expressed within the GPP framework as follows:

$$ {q}_{ij}=\left\{\begin{array}{c}0,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{more}\ \mathrm{than}\ \mathrm{one}\ \mathrm{nucleotide}\\ {}{e}^{\beta_0}{\pi}_j,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{a}\ \mathrm{synonymous}\ \mathrm{transversion}\\ {}{e}^{\beta_0}{e}^{\beta_1}{\pi}_j,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{a}\ \mathrm{synonymous}\ \mathrm{transition}\ \\ {}{e}^{\beta_0}{e}^{\beta_2}{\pi}_j,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{a}\ \mathrm{nonsynonymous}\ \mathrm{transversion}\\ {}{e}^{\beta_0}{e}^{\beta_1}{e}^{\beta_2}{\pi}_j,\ \mathrm{i}\mathrm{f}\ \mathrm{i}\ \mathrm{a}\mathrm{nd}\ \mathrm{j}\ \mathrm{differ}\ \mathrm{by}\ \mathrm{a}\ \mathrm{nonsynonymous}\ \mathrm{transition}\end{array}\right. $$

where \( {e}^{\beta_0} \) is the required matrix scale factor, \( {e}^{\beta_1} \) is equivalent to the transition/transversion rate ratio (*κ*), and \( {e}^{\beta_2} \) is equivalent to the nonsynonymous/synonymous rate ratio (*ω*). Transitions are indicated by a matrix *X*^{(1)} whose entries are 1 for all single nucleotide changes between codons that are transitions (and 0 for all other entries). Nonsynonymous changes are indicated by a matrix *X*^{(2)} whose entries are 1 for all single nucleotide changes that yield a change in the encoded amino acid (and 0 for all other entries). Note that the requirement that *q*_{ij} = 0 if *i* and *j* differ in more than one nucleotide position is explicitly enforced after applying the link function. By removing this requirement and extending *X*^{(1)} and *X*^{(2)} to include DT changes, we obtain an extension of G0 that permits multiple nucleotide changes between codons.

Model G0 (like model M0) is a SNR model because the nonsynonymous exchangeabilities are all equal. However, nonsynonymous exchangeabilities need not be constrained in this way. Any number of mechanisms for differences in nonsynonymous exchangeabilities can be added to the model through additional *β*_{i} parameters. For example, empirical data indicate that differences in hydrophobicity among pairs of amino acids is well known to impact the probability of an amino acid substitution (e.g., Clark [24]; Grantham [25]). Taking hydrophobicity as an example, a matrix of pairwise differences in hydrophobicity between amino acids can be constructed from a given scale (e.g., HI of Monera et al. [50]), and the nonsynonymous transition rate can then be linked to the exponent of the entries in this matrix via \( {e}^{\beta_{HI}} \), where *β*_{HI} is a fitted parameter in the model. Any such addition to the model yields a process of codon evolution having MNRs. Restrictions on the DNA-level process of evolution also can be relaxed. For example, rather than the single parameter for the transition/transverion rate ratio (*β*_{1}, in the above model), each DNA-level exchangeability can be modelled with a separate parameter (*β*_{AC}, *β*_{CT}, *β*_{AT}, *β*_{TG}, *β*_{CG}). This leads to a codon model having a GTR process at the DNA level, which has been recommended when testing for positive selection (e.g., Kosakovsky Pond and Frost [7]).

Parameterization of a codon model in terms of *β*_{1}, …, *β*_{n} means that process-variation among sites can be modelled with different random effects for different model parameters. In this study we develop GPP models motivated by M1a and M2a by using constrained discrete distributions to model among site variation in the nonsynonymous rate (*β*_{2} in the above model). These models (G1a and G2a) extend M1a and M2a by permitting double and triple changes between codons, a full GTR process at the DNA level, and model MNRs via the addition of *β*_{1}, …, *β*_{n} for different aspects of physiochemical constraints.

### Simulation based assessment of the G-series and M-series models

Simulation is used to evaluate MLE estimation under the new G-series models and the performance of several LRTs for positive selection (e.g., G1a vs G2a). Our overall design is comprised of 32 distinct evolutionary scenarios (Fig. 1), which serve as the basis for four simulation studies focused on different ways in which model based inference could be impacted. Although the evolutionary details differ between the 32 scenarios, each is comprised of 100 replicate datasets, each having sequences of 300 codons in length.

Simulated datasets were generated using methods implemented in version 1.2 of the COLD program “www.mathstat.dal.ca/~tkenney/Cold/”. COLD is an open source software package available for download from the COLD website “www.mathstat.dal.ca/~tkenney/Cold/”, and from GitHub “https://github.com/tjk23/COLD”. The commands used to generate the sequence data for this study, the relevant Newick tree files, and all multi-sequence alignments that were produced for each of the simulation studies, are available to download from the DRYAD repository for this study [51].

#### Simulation study 1

The purpose of this study is to investigate the impact of DT codon changes on the false positive rate. For this simulation we start with the 5-taxon tree and branch lengths of Wong et al. [45] (Fig. 1a). The generating process for this study is based on a selective regime at the codon level derived from a strictly neutral model of codon evolution (Fig. 1b). In this scenario 50% of the sites are subject to perfect purifying selection (*ω* = 0) and 50% are subject to neutral evolution (*ω* = 1). This scenario is often included in simulation studies as a “benchmark case” for LRTs (e.g., Kosakovski Pond and Frost [7]; Anisimova et al. [11]; Wong et al. [45]; Bao et al. [46]). Here, we extend this benchmark case by adding DT changes between codons, with rates 0.06 and 0.03 respectively. These are in accordance with the notion that their rates are substantially lower than the rate of single nucleotide substitution between codons [39, 40]. To enhance interpretability, we began by setting all GTR exchangeabilities to 1 and specified equal nucleotide frequencies. This scenario is referred to as case 1a (Fig. 1b). We then extended this simulation study in two ways. The first extension was to increase the complexity of the nucleotide-level process by adding unequal GTR exchangeabilities and nucleotide frequencies (from [6]). This extension is referred to as case 1b (Fig. 1b). The next extension was designed to investigate the impact of taxon sampling. Each terminal branch of the 5-taxon tree in Fig. 1a was split by the addition of a second lineage, resulting in a 10-taxon tree. The total length of the new tree (sum of the branches) was set equal to that of the 5-taxon tree, but with the tree length re-distributed evenly among all branches (see online Additional file 2). Simulation over the 10-taxon tree was based on the more complex process of case 1b, and is referred to as case 1c. Each dataset was analysed with M1a and M2a, and variants that permit DT changes, hereafter called G1a^{DT} and G2a^{DT}).

#### Simulation study 2

The purpose of this study is to investigate model performance using much more complex scenarios than the strictly neutral case above. The tree and branch lengths are derived from a set of 17 real *β*-globin sequences (Fig. 1a), and thus are the same for all scenarios. This tree has been used widely in previous simulation studies (e.g., [6, 11]). This study is comprised of 24 distinct scenarios (Fig. 1c). Each scenario is based on a mixture of sites having three distinct selective regimes. All scenarios have a large fraction of sites (77%) dominated by purifying selection (*ω*_{0} = 0.05). A moderate fraction of sites (20%) assumed to evolve under moderate purifying selection (*ω*_{1} = 0.5) or neutrality (*ω*_{1} = 1.0). A small fraction of sites (3%) evolve with *ω* ≥ 1 (*ω*_{+} = 1.0, 1.5, 2.0 or 5.0). In addition we also employ heterogeneous GTR exchangeabilities, and unequal nucleotide frequencies at the three positions of the codon, as estimated from a set of real *β*-globin sequences. Lastly, we cover a range of nonsynonymous rate heterogeneity by specifying hydrophobicity factors \( \left({e}^{\beta_{HI}}\right) \) of 1.0, 0.4 or 0.05. The hydrophobicity index of Monera et al. [50] was re-scaled by a factor of 100, so that it takes values in the interval [− 1,1], and the absolute value of the difference between the hydrophobicity of amino acids was computed for all pairs of amino acids. The matrix of these scores (online Additional file 3) was linked to the nonsynonymous substitution rate via a parameter in the GPP generating process (*β*_{HI}). When *β*_{HI} = 0, the matrix of hydrophobicity scores will have no impact on nonsynonymous rates, yielding a SNR codon model (\( {e}^{\beta_{HI}}=1\Big) \). When \( {e}^{\beta_{HI}}=0.4\ \mathrm{and}\ 0.05 \), the process of codon evolution has MNRs, with \( {e}^{\beta_{HI}}=0.05 \) yielding an extremely biased MNR model. As our primary interest is the effect of MNRs, we do not include DT codon changes in this study. Note that hydrophobicity is used for convenience to induce MNRs here; any property scale can be similarly used within this GPP framework. Figure 1c indicates the relationship between the different scenarios in this study.

Each scenario was analysed with three different pairs of models. The first was the pair of SNR models M1a and M2a. This pair represents an under-fit modelling scenario. The second pair was G1a^{x} and G2a^{x}, which represent GPP models having perfect fit to the generating process. The superscript of *x* represents the number of mechanistic model parameters required for a perfect fit to a given scenario. The third pair of models was G1a^{13} and G2a^{13}. In addition to the branch lengths, and each model’s parameters for the *ω* distribution, these models have *x* = 13 additional parameters. The 13 additional parameters account for DT changes (2 parameters), 6 amino acid properties (polarity, volume, hydropathy, isoelectric point, polar requirement & composition), and GTR exchangeabilities (5 free parameters). Models G1a^{13} and G2a^{13} are used here to represent an over-fit modelling scenario.

#### Simulation study 3

The purpose of this study is to extend Study 2 by adding simultaneous DT nucleotide changes between codons. To minimize the computational burden, the impact of DT nucleotide changes was explored in a selected subset of six scenarios covered in Simulation Study 2. Specifically, we chose three different distributions for *ω* (see 2b, 2d, 2g in Fig. 1)*,* and applied two hydrophobicity factors to each one. The hydrophobicity factors \( \left({e}^{\beta_{HI}}\right) \) were 1.0 (yielding an SNR model) and 0.05 (yielding a highly variable MNR model). One *ω* distribution excluded positive selection (77% *ω* = 0.05 and 23% *ω* = 1.0). The other two *ω* distributions included positive selection (77% *ω* = 0.05; 20% *ω* = 0.50; 3% *ω* = 2.0, and 77% *ω* = 0.05; 20% *ω* = 1.0; 3% *ω* = 2.0). As in Study 2, the tree, branch lengths, GTR parameters and codon frequencies were derived from a set of real *β*-globin sequences. Also like Study 2, we used an under-fit model pair (M1a and M2a), a perfectly fit model pair (G1a^{x} and G2a^{x}), and an over-fit model (G1a^{13} and G2a^{13}).

#### Simulation study 4

The purpose of this study was to investigate the impact of alternative model formulations on false positive rates for the M-series LRTs. Users of M-series models have many choices for how to (*i*) model the distribution of *ω* variability among sites, and (*ii*) parameterize codon frequencies within the model. A comprehensive assessment of alternative *ω* distributions is beyond the scope of this study. For this reason we chose to assess the LRT for positive selection that compares M8_{ω + =1} with M8_{ω+ > 1} because it is a popular alternative, and because M8 is based on a discretized *β* distribution. There are two fundamentally different approaches to parameterize codon frequencies. One of them emphasizes the context of the nucleotide change within the complete codon, and employs the equilibrium frequency of the target codon (*π*_{j}) to model transition probabilities ([26], hereafter denoted GY). The other emphasizes the independence of the process of mutation among sites, and employs the equilibrium frequency of the target nucleotide (*j*) at a single position (*k*) averaged over all codons (*π*_{j}^{k}) to model transition probabilities among codons ([47], hereafter denoted MG). Both approaches employ estimates of four nucleotide frequencies at each position of the codon (denoted F3 × 4), and thus each requires 9 free parameters. Despite having similar instantaneous rate matrices, these two Markov processes have different properties when codon frequencies are uneven (e.g., [52]). To investigate the effect of both kinds of modelling choices (*ω* distribution and codon frequencies), we applied both frequency parameterizations (*π*_{j} vs. *π*_{j}^{k}) to the LRT of M1a vs. M2a and to the LRT of M8_{ω + =1} with M8_{ω+ > 1}. This comparison yields four LRTs per simulation scenario, and because we were interested in false positive rates we applied those four LRTs to all nine null scenarios of Simulations Studies 1–3 (SNR: 1a, 1b, 1c, 2a, 2b & 3a; MNR: 2a, 2b & 3a). In these we covered M-series false positive rates for DT changes, MNRs, and the combination of DT changes and MNRs.

### Real data analyses

We analyzed a set of 24 *Streptococcus* transmembrane proteins. The data are derived from a previous phylogenomic analysis of *Streptococcus* genomes [53]. The homologous gene clusters identified in that study were filtered for clusters of transmembrane proteins with ≥4 unique sequences. The sequence alignments for these gene clusters range from 4 to 19 lineages, and included pathogens and their non-pathogenic relatives. The data were then pre-screened with a LRT for among-codon heterogeneity in *ω* ([6]: M0 vs M3). Three genes had no significant evidence for heterogeneity in *ω* according to this LRT and were excluded from subsequent analyses. The remaining 21 genes were tested using a pair of models that are (presumably) under-fit with respect to DT changes and MNRs (M1a and M2a), and a pair that can be considered mechanistically over-fit for at least some of their parameters (G1a^{13} and G2a^{13}). As we do not know the true generating process for these data, we cannot analyze them using a perfectly fit model pair.

### Likelihood calculations and likelihood ratio tests

The values of the model parameters, including branch lengths, were estimated from the data via maximum likelihood. The only exception was the equilibrium frequencies, which were obtained from the empirical codon frequencies within each dataset. The SNR codon models M0, M3, M1a, M2a and M8 were fit to the data as implemented in the codeml program of the PAML package [54]. Fitting the G-series models described above was made possible by an efficient Hessian calculation for phylogenetic likelihood [55], and the GPP modelling framework implemented in version 1.2 of the COLD program “www.mathstat.dal.ca/~tkenney/Cold/”. Model M1a differs from M2a only in the parameters of the *ω* distribution. As these models are nested, and differ by two free parameters, the log likelihood statistic (2Δ) should be approximately *χ*^{2} distributed with 2 degrees of freedom. However, the alternative model (M2a) is related to the null model (M1a) by fixing one of its mixture weights on the boundary (*p*_{+} = 0). This means that the LRT statistic \( {\chi}_2^2 \) is not the correct distribution; however, we use it here because it is expected to be conservative in many scenarios. The GPP models used in this study employ the same *ω* distributions, and their LRTs are carried out in the same way.

The method for calculation of phylogenetic likelihood under a GPP model is fully described in Kenney and Gu [55]. The implementation of the unique hessian likelihood calculation, and the optimization routines employed to fit the GPP models to sequence data, are distributed via the COLD package as open source software “www.mathstat.dal.ca/~tkenney/Cold/, https://github.com/tjk23/COLD”. COLD uses a variety of metrics to monitor convergence, but COLD’s main convergence test is whether the expected improvement from the next step is less than 1e-10. To deal with some difficult cases, COLD will also claim convergence if the moving average of either expected or actual improvement is less than 1e-5, and will signal if the program has failed to make progress for a long time. Problematic cases of optimization are indicated when either (*i*) COLD fails to converge within the maximum number of iterations, or (*ii*) the likelihood of an alternative model is lower than the null (indicating convergence to a sub-optimal peak). In this study, if ether outcome occurred, models were re-run several times with different initial values for the model parameters.