The simplicity hypothesis and constructive neutral evolution
Let T represent a transferable gene encoding a single protein product P that confers a fitness advantage to a host cell under specific environmental conditions (e.g., depending on available nutrients, temperature, salinity, pH, the presence of antimicrobial substances, etc.). The copies of a transferable gene that reside within a microbial host population will, for the purpose of our model, be construed as a population of transferable genes or a “population of T”. The milieu of the simplicity hypothesis is a “population of populations” or “metapopulation” [29] of transferable genes. Naïve microbial populations into which T can be transferred are also assumed to exist.
The simplicity hypothesis requires connectivity to vary across populations of T. One possible source of variation, proposed by Novick and Doolittle [32], is constructive neutral evolution (CNE). Broadly speaking, CNE occurs when a mutation in one gene (whether transferable or not) that would be deleterious is rendered neutral or nearly neutral due to a fortuitous or previously selected association with another gene that “pre-supresses” the deleterious effect of that mutation [19]. The fixation of the mutation by drift will result in an increase in the dependency between the two genes, which would subsequently be maintained by purifying selection. In the case of gene duplication, for example, a mutation in one copy that would otherwise reduce the fitness of an organism can be fixed by drift due to the presence of the second copy. The performance of a function that was once carried out by the original copy might thereby come to depend on the existence of both paralogs via a process known as subfunctionalization [15, 38]. See Muñoz-Gómez et al. [31] and citations therein for other examples of genetic features that can be explained by CNE.
In the specific context of a transferable gene, it is assumed that mutations in T that reduce or eliminate the fitness advantage P confers to a host cell can sometimes arise. It is further assumed that the impact of such mutations on P can sometimes be neutralized by the presence of host genes or gene products. The fixation of any such mutation by drift will increase the degree to which the fitness advantage P provides depends on genes or gene products specific to the microbial population that currently hosts the population of T. In this way, CNE can increase the connectivity of T and decrease the probability that P will function in a way that provides a fitness advantage to the next naïve microbial host cell that T enters by HGT. Similarly, it is assumed that mutations in host genes that reduce host fitness can sometimes be neutralized by the presence of T. The fixation of such mutations by drift will increase the degree to which the viability of the host population depends on the presence of T. This will reduce the probability that T will be lost from that host population, as would otherwise be likely in the event of a change in the environment that negates the fitness advantage P provides (e.g., [5, 25]). Any such increase in host dependency is referred to as an increase in the indispensability of T to its current host population.
The preceding demonstrates how, in theory, constructive neutral evolution can gradually increase the indispensability and, independently, the connectivity of a population of transferable genes while it resides within its current microbial host population. CNE can therefore act as a complexity ratchet to produce what Novick and Doolittle [32] call a sessile transferable gene, one unlikely to suffer gene loss from its current microbial host population due to its high indispensability, but also unlikely to be fixed following transfer into a naive microbial population due to its high connectivity. This trend toward greater complexity and the sessile state can be opposed by the opportunity to colonize naïve microbial populations by HGT. The indispensability of T in a newly colonized microbial population is minimal since it takes time for the new host to accumulate dependencies on T. High rates of colonization will therefore reduce the mean indispensability of T across a metapopulation of transferable genes and increase the probability that some populations of T will be eliminated by gene loss. However, the opportunity to colonize also favors the dispersal of variants of T with lower connectivity. The opportunity to disperse by HGT can therefore act as a simplicity ratchet to produce what Novick and Doolittle [32] call an itinerant transferable gene, one quite likely to colonize naïve microbial populations due to its low connectivity, but also unlikely to persist in any one microbial host population for long due its low indispensability.
Changes in connectivity
A functional module is a group of genes or gene products related by genetic or intracellular interactions [41]. Functional modules are often displayed as a graph with nodes representing genes or their protein products and edges indicating relationships between nodes. In this context, the connectivity of a gene (whether transferable or not) is just the number of edges connecting it to other nodes in the same gene co-expression or protein interaction network [8]. A change in the connectivity of a gene corresponds to a change in the number of such edges. This can occur in several ways, depending on the gene. If the gene codes for a protein that is part of a supramolecular complex, then any change in the number of subunits that make up the complex will change the gene’s connectivity. The evolution of tetrameric hemoglobin from a monomeric ancestral protein provides an example [4]. Connectivity can also be altered by a change in the number of proteins involved in a metabolic, signaling, or regulatory pathway. The transcription factor SIM1, for example, plays several roles in humans, from the development of neurons during embryogenesis to the regulation of functions in the adult form. The STRING [39] database indicates that SIM1 is involved in eight direct protein–protein interactions in humans but only one in Mus musculus, suggesting that the connectivity of SIM1 might have changed over macroevolutionary time scales.
Here we posit an additional and, in some ways, more subtle process of change. The protein product P of T needs to fold into a specific stable configuration and may require access to one or more specific binding partners to carry out its selected function (i.e., the function that is beneficial to the host under some environmental conditions). We assume that mutations in T that would cause P to become unstable or unable to carry out its selected function can be pre-suppressed or rendered neutral by the presence of genes or gene products native to the host. The fixation of such mutations by drift would increase the connectivity of the transferable gene, here broadly construed as the degree to which P depends on the specific intracellular milieu provided by its current host to function. We also entertain the possibility that other mutations in T can remedy the need for the suppression of previously fixed mutations, and that these can lead to a reduction in the connectivity of T if fixed, although such reversions are presumably rare (e.g., [19]).
Changes in indispensability
An essential gene is defined to be one that supports a function that is necessary for reproductive success (e.g., genes required for transcription and translation). Such genes tend to correspond to nodes in functional modules with many edges and so are typically not transferable but rather part of a core genome common to a wide range of strains or species [37]. Interestingly, there is evidence to suggest that the essentiality of a gene is nevertheless mutable and subject to evolutionary processes (ibid). Here we define the indispensability of a transferable gene to be the degree to which the viability of a microbial host population comes to depend on T via a process of gene-host coevolution. By this definition, an indispensable transferable gene is in some ways like an essential gene. However, an increase in the indispensability of T does not necessarily make the transferable gene essential or necessary for reproductive success. Instead, we imagine that a transferable gene can sometimes insinuate itself into the protein networks of its host by CNE in a Rube Goldberg fashion until the cell can no longer survive without it [19]. This can occur if the presence of T acts to pre-suppress the deleterious effects of mutations that arise in host genes. The fixation of such mutations by drift will increase the indispensability of T while it resides in its current microbial host population, making it less likely that the host population will lose T in an environment in which the selected function of P provides no fitness advantage.
Accounting for gene-host coevolution
Let \(P\left(\Delta {x}_{i}\right), {x}_{i}\in \left\{{y}_{i},{z}_{i}\right\}\) represent the probability that a mutation that changes the indispensability \(({y}_{i})\) or connectivity \({(z}_{i})\) of T from \({x}_{i}\) to \({x}_{i}+\Delta {x}_{i}\) arises in one copy of the transferable gene and is subsequently fixed in its current microbial host population by drift. Three outcomes are considered, \(\Delta {x}_{i}\in \left\{-\mathrm{1,0},+1\right\}\), when \({x}_{i}\ge 1\), and two, \(\Delta {x}_{i}\in \left\{0,+1\right\}\), when \({x}_{i}=0.\) In both cases, \(\Delta {x}_{i}=0\) indicates that no mutation occurred or that one occurred but was not fixed. It is assumed that change rarely occurs, so \(P\left(0\right)\approx 1\), and that mutations are biased to increase both the indispensability and connectivity of T, so \(P\left(+1\right)\gg P\left(-1\right)\). This is consistent with the general view that constructive neutral evolution is a rare process that tends to increase complexity over time [31]. The expected change in \({y}_{i}\) and \({z}_{i}\) over one ancestor–descendant mapping is therefore:
$$\mathrm{E}\left(\Delta {x}_{i}\right)=\left\{\begin{array}{c} P\left(+1\right)-P\left(-1\right), {x}_{i}\ge 1\\ \frac{P\left(+1\right)}{P\left(0\right)+P\left(+1\right)}, {x}_{i}=0\end{array}\right.$$
(1)
The key assumption of our model is that the indispensability and connectivity of a transferable gene can independently increase or decrease over time via gene-host coevolution. Although CNE provides a plausible mechanism for change [32], the assumption that change occurs by CNE alone is not crucial. In other words, \(P\left(\Delta {x}_{i}\right)\) can be interpreted as the probability of change due to all evolutionary processes that might impact the state \(\left({y}_{i},{z}_{i}\right)\) of a population of transferable genes, including but not necessarily limited to CNE.
The fitness of an ancestral population of transferable genes
The loss of a transferable gene from its current microbial host population can be construed as the death of a population of T. Likewise, colonization of a naïve microbial population by HGT can be construed as the birth of a new population of T. It is therefore possible to treat a population of transferable genes as an individual unit that can be assigned fitness in the form of the number of descendant populations it generates. Fitness, once defined, can then be used to map an ancestral metapopulation of transferable genes onto a descendant metapopulation.
The fitness advantage T confers to a host cell is a function of the state of the environment in which its host population resides. This is assumed to vary across microbial populations and over time. Whether the state of the environment changes in such a way as to negate the fitness P confers (i.e., due to a shift to a neutral environmental state) is determined by a Bernoulli random variable with expected value \(\delta \in [\mathrm{0,1}]\). Whether an ancestral population of T will suffer death by gene loss following a temporary switch to the neutral environment is assumed to be a Bernoulli random variable with expected value \({p}_{D}\left({y}_{i}\right)\), a function of indispensability. The expected probability of death by gene loss over one ancestor–descendant mapping is therefore the product \({\delta p}_{D}\left({y}_{i}\right)\). It follows that the probability that an ancestral population of transferable genes will persist into the descendant metapopulation by evading death is \({w}_{i}^{p}=1-{\delta p}_{D}\left({y}_{i}\right)\) (superscript “p” for “persistence”).
The number of naïve microbial populations an ancestral T enters by HGT is assumed to be a Poisson random variable with expected value \({\beta }_{N}=\beta \left(1-N{/N}_{max}\right)\), where \(N\) is the current number of ancestral populations of T and \({N}_{max}\) is an upper bound placed on the size of the metapopulation of transferable genes (i.e., the maximum number of populations of T it may contain). Whether T is fixed following HGT is assumed to be a Bernoulli random variable with expected value \({p}_{B}\left({z}_{i}\right)\), a function of connectivity. The expected number of new populations of the transferable gene generated by an ancestral population over one ancestor–descendant mapping is therefore \({w}_{i}^{m}={\beta }_{N}{p}_{B}\left({z}_{i}\right)\) (superscript “m” for “multiplication”).
The expected fitnessof an ancestral population of transferable genes as a function of its indispensability and connectivity is just the sum of the contributions made by persistence and multiplication, \({w}_{i}={w}_{i}^{p}+{w}_{i}^{m}\). The specification of fitness is complete once the functional forms for the probabilities \({p}_{D}\left({y}_{i}\right)\) and \({p}_{B}\left({z}_{i}\right)\) have been chosen. It is not clear what forms these probabilities should take to best reflect what might occur in nature apart from the plausible assumption that both are decreasing functions. For the sake of simplicity, we assume a common exponential form:
$${w}_{i}={w}_{i}^{p}+{w}_{i}^{m}=1-\delta \mathrm{exp}\left(-{y}_{i}s\right)+{\beta }_{N}\mathrm{exp}\left(-{z}_{i}s\right)$$
(2)
The scaling parameter \(s\) controls the rate at which each exponential function approaches its horizontal asymptote at zero. This was set to \(s=0.20\) to simulate a relatively slow approach, with an e-fold decrease in probability when \({y}_{i}\) or \({z}_{i}=5\). To restate, \(\mathrm{exp}\left(-{y}_{i}s\right)\) models the probability that, following a temporary shift to the neutral environmental state, T is lost from one cell and the lineage of cells without T is subsequently fixed by drift. This is equated to the death of an ancestral population of T over one ancestor–descendant mapping. And \(\mathrm{exp}\left(-{z}_{i}s\right)\) models the probability that a lineage of cells with T in an otherwise naïve microbial population reaches fixation. This is equated to the birth of a descendant population of T. The number of microbial generations separating an ancestral metapopulation from its descendant metapopulation is assumed to be more than sufficient for these within-population processes to reach completion (e.g., billions of microbial generations). See Fig. 4 for a depiction of these birth and death processes.
The character state of a descendant population of transferable genes
The indispensability and connectivity of a descendant population of transferable genes generated by persistence are inherited, in a manner of speaking, from its ancestral population subject to transmission bias due to gene-host coevolution (Eq. 1). The descendant character state is therefore \(\left({y}_{i}+\Delta {y}_{i} ,{z}_{i}+\Delta {z}_{i}\right)\) where \(\left({y}_{i},{z}_{i}\right)\) is the state of the ancestral population of T and \(\Delta {y}_{i}\) and \(\Delta {z}_{i}\) represent any change that might be realized during one ancestor–descendant mapping. The state of a descendant population produced by HGT, by contrast, is \(\left(0,{z}_{i}\right)\). Indispensability is set to zero because of the assumption that the dependencies accumulated by an ancestral host population are absent in the naïve microbial population. Connectivity is preserved due to the additional simplifying assumption that gene-host coevolution does not occur until T is fixed in any naïve microbial population it enters. The difference between the two kinds of descendant populations is illustrated in Fig. 5.
The Price equation
We use the Price equation [16, 17, 35] to write an expression for the change in the mean character state of T in a metapopulation of transferable genes over one ancestor–descendant mapping. Of primary interest is to identify conditions under which the mean connectivity of T will decrease. If \({q}_{i}\) is the proportion of ancestral populations with \(\left({y}_{i},{z}_{i}\right)\), then the change in the mean connectivity of T over one mapping is (see Appendix):
$$\Delta \overline{z }=\frac{1}{\overline{w}}{\sum }_{i}{q}_{i}\left({w}_{i}^{m}-\overline{w }\right){z}_{i}+\frac{1}{\overline{w}}{\sum }_{i}{q}_{i}{w}_{i}^{p}\left({z}_{i}+\mathrm{E}\left(\Delta {z}_{i}\right)\right)$$
(3)
The first sum accounts for differences in the number of descendant populations that each ancestral population of T generates by HGT. Fitness \({w}_{i}^{m}\) and connectivity \({z}_{i}\) are negatively correlated, so this sum is interpreted as the effect of selection that favors populations of T with lower connectivity. The second sum accounts for the expected change in the connectivity of T due to gene-host coevolution within each ancestral population that persists into the descendant metapopulation (see Fig. 5). The expectation \(\mathrm{E}\left(\Delta {z}_{i}\right)>0\) is biased toward greater connectivity, so this sum is interpreted as the effect of an evolutionary complexity ratchet. The first sum will tend to decrease the mean connectivity of T in the metapopulation over one ancestral-descendant mapping provided \(\mathrm{var}\left({z}_{i}\right)>0\). The second sum will tend to increase the mean connectivity of T over one mapping due to the assumed evolutionary bias toward greater complexity. The direction of change in the mean connectivity of T will therefore depend on the relative size of these two sums, or equivalently, on the tradeoff between multiplication and persistence.
It is important to note that fitness in the Price equation is a realized value, which in our model is equated to an expectation (Eq. 2). The transmission bias is likewise equated to an expected value (Eq. 1). It follows that Eq. 3 is deterministic. It is nevertheless possible to account for stochastic variation in birth, death, and transmission bias by simulating these as random processes, and to use simulations to explore the conditions under which the transferable gene might evolve to become more sessile or more itinerant. See Additional file 1 for details.