Decomposing Mutation and Selection to Identify Mismatched Codon Usage

For decades, codon usage has been used as a measure of adaptation for translational efficiency of a gene’s coding sequence. These patterns of codon usage reflect both the selective and mutational environment in which the coding sequences evolved. Over this same period, gene transfer between lineages has become widely recognized as an important biological pheonmena. Nevertheless, most studies of codon usage implicitly assume that all genes within a genome evolved under the same selective and mutational environment, an assumption violated when introgression occurs. In order to better understand the effects of introgression on codon usage patterns and vice versa, we examine the patterns of codon usage in the yeast which has experienced a large introgression, Lachancea kluyveri. We quantify the effects of mutation bias and selection for translation efficiency on the codon usage pattern of the endogenous and introgressed exogenous genes using a Bayesian mixture model, ROC SEMPPR, which is built on mechanistic assumptions of protein synthesis and grounded in population genetics. We find substantial differences in codon usage between the endogenous and exogenous genes, and show that these differences can be largely attributed to a shift in mutation bias from A/T ending codons in the endogenous genes to C/G ending codons in the exogenous genes. Recognizing the two different signatures of mutation and selection bias improves our ability to predict protein synthesis rate by 17% and allowed us to accurately assess codon preferences. In addition, using our estimates of mutation and selection bias, we to identify Eremothecium gossypii as the most likely source lineage, estimate the introgression occurred ~ 6 × 108 generation ago, and estimate its historic and current genetic load. Together, our work illustrates the advantage of mechanistic, population genetic models like ROC SEMPPR and the quantitative estimates they provide when analyzing sequence data.


Introduction
than endogenous genes to their new cellular environment, we expect a greater genetic load of transferred 48 genes if donor and recipient environment differ greatly in their selection bias, making such transfers less 49 likely. More practically, if differences in codon usage of transferred genes are unaccounted for, they may 50 distort parameter estimates. Such distortion could lead to the wrong codon preference for an amino 51 acid, underestimate the variation in protein synthesis rate, or bias mutation estimates when analyzing a 52 genome. 53 To illustrate these ideas, we analyze the CUB of the genome of Lachancea kluyveri, which is sister 54 to all other Lachancea. The Lachancea clade diverged from the Saccharomyces clade, prior to its whole 55 genome duplication ∼ 100 Mya ago (Marcet-Houben and Gabaldn, 2015;Beimforde et al., 2014). Since

90
To better understand the differences in the endogenous and exogenous cellular environments, we com-91 pared our parameter estimates of mutation bias ∆M and selection ∆η for the two sets of genes. Our 92 estimates of ∆M for the endogenous and exogenous genes were negatively correlated (ρ = −0.49), in-93 dicating weak concordance of ∼ 5% between the two mutation environments (Figure 2). For example, 94 the endogenous genes show a mutational preference for A and T ending codons in ∼ 95% of the codon 95 families. In contrast, the exogenous genes display an equally consistent mutational preference towards 96 C and G ending codons (Table S1). As a result, only the two codon amino acid Phenylalanine (Phe, F) 97 shares the same rank order across the endogenous and exogenous ∆M estimates.

98
In contrast, our estimates of ∆η for the endogenous and exogenous genes were positively correlated 99 (ρ = 0.69) and showing concordance of ∼ 53% between the two selection environments ( Figure 2). ROC 100 SEMPPR constraints E[φ] = 1, allowing us to interpret ∆η as selection on codon usage of the average 101 gene with φ = 1 and gives us the ability to compare the efficacy of selection sNe across genomes. We 102 find that the strength of selection within each codon family differs between sets of genes. Overall, the 103 endogenous genes only show a selection preference for C and G ending codons in ∼ 58% of the codon 104 families. In contrast, the exogenous genes display a strong preference for A and T ending codons in 105 ∼ 89% of the codon families.

106
The difference in codon usage between endogenous and exogenous genes is striking. As a result, our 107 estimates of the optimal codon differ in nine cases between endogenous and exogenous genes (Table S2).  Figure 2: Comparison of (a) mutation bias ∆M and (b) selection bias ∆η parameters for endogenous and exogenous genes. Estimates are relative to the mean for each codon family. Black dots indicate ∆M or ∆η parameters with the same sign for the endogenous and exogenous genes, red dots indicate parameters with different signs. Black line shows the type II regression line (Sokal and Rohlf, 1981). Dashed lines mark quadrants.
Fits to the complete L. kluyveri genome reveal that the relatively small exogenous gene set (∼ 10% 109 of genes) has a disproportional effect on the model fit. We find that the complete L. kluyveri genome 110 is estimated to share the mutational preference with the exogenous genes in ∼ 78% of the 19 codon 111 families that are discordant between the endogenous and exogenous genes. In two cases, Isoleucine (Ile, 112 I) and Arginine (Arg, R), the strong discordance in mutation preference results in an estimated codon 113 preference in the complete L. kluyveri genome that differs from both the endogenous, and the exogenous 114 genes.

115
The effect of the small exogenous gene set on the fit to the complete L. kluyveri genome is smaller 116 in our estimates of selection bias ∆η than ∆M , but still large. We find that the complete L. kluyveri 117 genome is estimated to share the selection preference with the exogenous genes in ∼ 60% of codon 118 families that show discordance between endogenous and exogenous genes. These results clearly show 119 that it is important to recognize the difference in endogenous and exogenous genes and treat these genes 120 as separate sets to avoid the inference of incorrect synonymous codon preferences and better predict 121 protein synthesis.

123
We combined our estimates of mutation bias ∆M and selection bias ∆η with synteny information and 124 searched for potential source lineages of the introgressed exogenous region. We examined 38 yeast lineages 125 (Table S3)   introgression assuming that E. gossypii still represents the mutation bias of its ancestral source lineage 141 at the time of the introgression and a constant mutation rate. We infer the age of the introgression to 142 be on the order of 6.2 ± 1.2 × 10 8 generations. Assuming L. kluyveri experiences between one and eight 143 generations per day, we estimate the introgression to have occurred between 212, 000 to 1, 700, 000 years 144 ago. Our estimate places the time of the introgression earlier than previously assumed (Friedrich et al., .

146
Using the same approach, we also estimated the persistence of the signal of the exogenous cellular 147 environment. We assume that differences in mutation bias will decay more slowly than differences in 148 selection bias to be able to utilize our bias free estimates of ∆M . We predict that the ∆M signal of the 149 source cellular environment will have decayed to be within one percent of the L. kluyveri environment 150 in ∼ 5.4 ± 0.2 × 10 9 generations, or between 1, 800, 000 and 15, 000, 000 years. Together, these results

151
indicate that the mutation signature of the exogenous genes will persist for a very long time.

152
Genetic Load due to Mismatching Codon Usage of the Exogenous Genes

153
We define genetic load as the difference between the fitness of an expected, replaced endogenous gene and 154 the exogenous gene, s ∝ φ∆η due to the mismatch in codon usage parameters (See Methods for details).

155
Estimates of selection bias for the exogenous genes show that, while well correlated with the endogenous 156 genes, only nine amino acids share the same optimal codon. Exogenous genes are, therefore, expected to 157 represent a significant reduction in fitness, or genetic load for L. kluyveri due to this mismatch in codon 4a) and currently ( Figure 4b). We find that the genetic load due to mismatched codon usage was -0.0008 164 at the time of the introgression and still represents a genetic load of -0.0003 today.

165
In order to account for differences in the efficacy of selection on codon usage between the donor 166 lineage and L. kluyveri using a linear scaling factor κ (See Methods for details). We predict that a small 167 number of low expression genes (φ < 1) were weakly exapted at the time of the introgression ( Figure 4a).

168
High expression genes (φ > 1) are predicted to have carried the largest genetic load in the novel cellular

213
In order to estimate the introgression's genetic load due to codon mismatch, we had to make three triggered by a speciation event, this scenario seems very unlikely.

245
In the second scenario, where we assume the introgression contained advantageous loci, one may 246 wonder why recombination events did not limit the introgression to only the adaptive loci. Payen Choice of reference codon does reorganize codon families coding for an amino acid relative to each other, therefore all parameter estimates are relative to the mean for each codon family.

282
The parameter ∆η can be interpreted as the difference in fitness between codon i and j for the average 283 gene with φ = 1 scaled by the effective population size Ne, and the selective cost of an ATP q (Gilchrist, 284 2007; Gilchrist et al., 2015). Type II regression was performed with re-centered parameter estimates, 285 accounting for noise in dependent and independent variable (Sokal and Rohlf, 1981).

291
We modeled the change in codon frequency over time using an exponential model for all two codon amino acids, and describing the change in codon c1 as where µi,j is the rate at which codon i mutates to codon j and c1 is the frequency of the reference codon.
Our estimates of ∆M endo can be used to calculate the steady state of equation 3.
µ2,1 µ1,2 + µ2,1 Solving for µ1,2 gives us µ1,2 = ∆M endo exp[µ2,1] which allows us to rewrite and solve equation 3 as where K is Equation 5 was solved with a mutation rate m2,1 of 3.8 × 10 −10 per nucleotide per generation (Lang To estimate the genetic load due to mismatched codon usage, we made three key assumptions. First, we 301 assumed that the current exogenous amino acid sequence of a gene is representative of its ancestral state 302 and the replaced endogenous gene it replaced. Second, we assume that the currently observed cellular 303 environment of E. gossypii reflects the cellular environment that the exogenous genes experienced before 304 transfer to L. kluyveri. Lastly, we assume that the difference in the efficacy of selection between the 305 cellular environments due to differences in either effective population size Ne or the selective cost of an 306 ATP q of the source lineage and L. kluyveri can be expressed as a scaling constant and that protein 307 synthesis rate φ has not changed between the replaced endogenous and the introgressed exogenous genes.

310
We scale the difference in the efficacy of selection on codon usage between the donor lineage and L.
kluyveri using a linear scaling factor κ. As ∆η is defined as ∆η = 2Neq(ηi − ηj), we can not distinguish if κ is a scaling on protein synthesis rate φ, effective population size Ne, or the selective cost of an ATP q (Gilchrist, 2007;Gilchrist et al., 2015). We calculated the genetic load each gene represents due to its mismatched codon usage assuming additive fitness effects as where sg is the overall strength of selection for translational efficiency on gene, g in the exogenous gene 311 set, κ is a constant, scaling the efficacy of selection between the endogenous and exogenous cellular 312 environments, ng is length of the protein, φg is the estimated protein synthesis rate of the gene in the 313 endogenous environment, and ∆η i , is the ∆η' for the codon at position i. As stated previously, our ∆η 314 are relative to the mean of the codon family. We find that the genetic load of the introgressed genes 315 is minimized at κ ∼ 5 ( Figure S5b). Thus, we expect a five fold difference in the efficacy of selection 316 between L. kluyveri and E. gossypii, either due to differences in either protein synthesis rate φ, effective 317 population size Ne, or the selective cost of an ATP q. Therefore, we set κ = 1 if we calculate the sg for 318 the endogenous and the current exogenous genes, and κ = 5 for sg for the genetic load at the time of 319 introgression.
Since we are unable to observe codon counts for the replaced endogenous genes and for the exogenous genes at the time of introgression, we calculate expected codon counts ma i is the number of occurrences of amino acid a that codon i codes for. We report the genetic load due 321 to mismatched codon usage of the introgression as E[sg] = sintro,g − s endo,g where sintro,g is the genetic 322 load of an introgressed gene g either at the time of the introgression or presently.   S  TCA  TCC  TCT  TCT  Thr T  ACT  ACC  ACT  ACT  Val V  GTT  GTC  GTT  GTT  Tyr Y  TAT  TAC  TAT  TAC  Ser 2 Z  AGT  AGT AGT AGT   Figure S2: Comparison of (a) mutation bias ∆M and (b) selection bias ∆η parameters for endogenous genes and combined gene sets. Estimates are relative to the mean for each codon family. Black dots indicate ∆M or ∆η parameters with the same sign for the endogenous and exogenous genes, red dots indicate parameters with different signs. Black line shows the type II regression line (Sokal and Rohlf, 1981). Dashed lines mark quadrants.    Figure S7: Codon usage patterns for 19 amino acids. Amino acids are indicated as one letter code. The amino acids Serine was split into two groups (S and Z) as Serine is coded for by two groups of codons that are separated by more than one mutation. Solid line indicates the endogenous codon usage, dashed line indicates the exogenous codon usage, dotted line indicates the combined codon usage.