### Markov basis of the interaction space

Our first point is that the genotype space in Table 1 allows many more tests of epistasis than the 27 standard tests performed by Elena and Lenski [2]. Notice that the standard test *w·ar - a·r* compares two genotypes having zero and two mutations with two others each having one mutation. The minimal Markov basis (see Methods for a mathematical definition) of the space of tables of the type displayed in Table 1 reveals an additional 216 non-standard tests of the following two sorts. First, there are 108 "double-double" tests that compare two double mutants with two other double mutants, holding the distribution of the mutant alleles constant, for example, *ar·bs* - *as·br*. Second, there are another 108 orthogonal "single-double" tests that compare one single mutant and one double mutant with another single mutant and another double mutant, again holding constant the allele frequencies, for example, *a·br* - *b·ar*. For both types of non-standard tests, the two genotypes on the right-hand side can be regarded as the products of recombination between the two genotypes on the left-hand side. Notice that the same relationship also holds true for the standard tests.

Experimental biologists (including two authors of this paper) are likely to raise three concerns about these non-standard tests. What biological insights can non-standard tests provide beyond those obtained using standard tests? Are these additional tests independent of the standard tests? What computational tools are available to perform such tests on other datasets?

As we show in the sections that follow, the non-standard tests are potentially useful in at least three respects. First, they allow one to focus attention on features of epistasis that are not quantifiable by the standard tests. For example, we perform non-standard tests of the "single-double" type to explore whether some mutations are better overall mixers than others. Second, non-standard tests span greater genetic distances than do pairwise tests, allowing more powerful analyses of the structure of fitness landscapes. For example, we use the "double-double" tests to test curvature at genetic distances of four, whereas standard tests allow curvature to be examined only at distances of two. Third, non-standard tests are an integral part of the complete geometric description of a fitness landscape. While this high-dimensional geometry is abstract and even foreign, we describe how biological features of gene interactions, such as mixing ability, are embedded in the geometric shape of the landscape.

The non-standard tests are not independent, in a statistical sense, of the pairwise tests or of one another because all the tests are calculated from the same underlying data. Nonetheless, this complication can be addressed by employing appropriate statistical methods (Tukey's jackknife, Bonferroni correction, etc.) to ensure that significance levels reflect the data structure. Regarding the availability of computational tools, we provide references to programs that automate the calculations of the Markov basis and perform the triangulations necessary to describe the geometry of landscapes, and these tools can be applied to other datasets. In supplementary material, we illustrate the use of these computational tools and show their output (see Additional files 1-5).

### Comparing standard and non-standard epistasis terms overall

An important feature of the standard tests is that the sign of epistasis, either positive or negative, is always expressed in reference to the same wild-type strain. The key result reported by Elena and Lenski, based on the standard tests, was that there were many significant epistatic deviations in both directions, in contrast to one hypothesis that predicted negative epistasis should be the general rule [17]. At first glance, the non-standard tests described above would not seem to allow this prediction to be tested, because the sign of the difference is arbitrary; that is, the labels that put one pair of mutants on one side of the equation, and not on the other side, are arbitrary. However, the biological interest in epistasis is often framed in terms of deleterious mutations, and Elena and Lenski ensured that they had a high-fitness wild-type strain by using one that had evolved for 10,000 generations in the exact same environment employed for measuring the relative fitness of all the genotypes in their study. In the same vein, we can anchor all 216 non-standard equations simply by ensuring that the genotype with the highest fitness (of the four used in any given calculation) appears on the left-hand side of the equation with positive sign.

Figure 2 shows the distribution of epistatic deviations for all 216 of the non-standard equations (dashed curve) as well as the 27 standard equations (solid curve). The non-standard tests support the inference of Elena and Lenski, based on the standard tests, that many epistatic deviations are positive and many others negative. In other words, one directional form does not predominate to the exclusion of the others. To the eye, it would appear that the non-standard form is more skewed toward positive epistasis. If so, that difference would be interesting because the non-standard tests include genotype sets in which the maximal fitness is usually below the high-fitness wild-type strain. Previous research has shown that compensatory mutations, in which some mutations are conditionally beneficial only in the presence of certain otherwise deleterious mutations, are biologically important [32–34]. Compensatory mutations contribute to positive epistasis and, moreover, they become more important farther away from a local fitness peak [33–36]. In the context of the *E. coli* data that we are analyzing, we predict that epistatic deviations tend to more positive values when the component mutations have more deleterious effects (thus corresponding to greater distances from the local peak).

To test this prediction for the fitness peak at the wild-type, we calculated for each standard test of epistasis (such as *w·ar* - *a·r*) the average fitness decrement Δ = 1 - (*a* + *r*)/2 of the two single mutants relative to the wild-type. We then plot the epistasis values as a function of Δ for all 27 equations. In this way, we can test the hypothesis that epistatic deviations tend more toward positive values when mutations in the wild-type background are more deleterious (large Δ) than when mutations are less deleterious (small Δ). The scope of this analysis can be extended substantially by also including the non-standard "double-double" epistatic equations (such as *ar·bs* - *as·br*). We anchored each double-double test with the fittest genotype (say *ar*) on the left-hand side, and calculated the average fitness decrement, Δ = *ar* - (*as* + *br*)/2, of the two genotypes that appear on the right-hand side of the equation. These tests ask more generally for the relationship between epistasis and the average fitness loss relative to any local fitness peak rather than only the one centred on the wild-type. Moreover, these non-standard tests may have greater power because their "reach" extends over mutational distances of four, rather than the two allowed by standard tests. (We have excluded the non-standard single-double tests from this analysis, because in those equations each genotype has different Hamming distances to the two genotypes on the opposite side of the difference equation. This heterogeneity prohibits direct comparisons with the other tests).

Figure 3 shows the resulting relationship between the average fitness decrement associated with component mutations and the value of the epistatic deviation. For both standard and non-standard tests, there is a strong relationship in the predicted direction, such that epistatic interactions tend to be more positive when the component genotypes are less fit, and more negative when those genotypes are fitter. This trend is marginally significant for the 27 standard tests (*r* = 0.433, 25 d.f., *p* = 0.012), but highly significant for the larger set of 108 double-double tests (*r* = 0.597, 106 d.f., *p* <10^{-11}), if each of the deviations is viewed as independent. However, these values all rest on 37 genotypes, whose fitness values were estimated with error (albeit with replication), and hence the errors are not independent for those epistatic terms that share a genotype. To take this complication into account, we performed Tukey's jackknife test [37]. Even so, the association remains strongly significant for both the standard (mean *r* = 0.440, *t*_{
s
}= 3.761, 26 d.f., one-tailed *p* = 0.0004) and non-standard tests (mean *r* = 0.578, *t*_{
s
}= 3.672, 26 d.f., one-tailed *p* = 0.0005). It is important to understand that this inference concerns the relationship between average fitness decrements and the form of epistasis around any local peak within the particular landscape represented by these 37 genotypes. In other words, it is an inference about a restricted region of the complete *E. coli* genotypic space.

If we want to make the same type of inference about the complete genotype space, rather than the specific subset sampled by Elena and Lenski, we can apply a similar, but not identical, test. More precisely, we want to investigate the correlation between average fitness decrease and epistasis among any single and double mutants of the *E. coli* genome. In that case, the unit of independent observation becomes the nine single mutations, and the resulting inference concerns the landscape of all possible double mutations derived from the same parental strain. We again applied Tukey's jackknife, this time eliminating in turn all tests containing a particular mutation in any of the four double mutants, rather than eliminating a particular genotype as before. In this case, the association is only marginally significant (mean *r* = 0.572, *t*_{
s
}= 1.363, 8 d.f., one-tailed *p* = 0.105 for the standard tests, and mean *r* = 0.457, *t*_{
s
}= 1.780, 8 d.f., one-tailed *p* = 0.056 for the non-standard tests), but still supports the trend for increasingly positive epistasis with increasing average fitness loss. Given that these tests aim to infer a complex genome-wide relationship between average mutational effects and the form of epistasis from a small sample of mutations, it is encouraging to find such a strong trend.

We present these alternative analyses to emphasize the subtly different hypotheses that can be addressed by using our mathematical approach. Comparing the last two analyses suggests that individual mutations might have pervasive effects on the shape of the local landscape. While pervasive effects of certain mutations can make it more difficult to test broader generalizations, the precise nature of these pervasive effects is of biological interest. In the next section, we follow the same general approach, but focusing on a different set of epistatic interactions, to examine differences between individual mutations in greater detail.

### Some non-standard tests reveal differences in mixing ability

In this section, we use non-standard tests of the "single-double" type to explore a particular aspect of the fitness landscape, specifically whether certain mutations are better mixers than others. The mixing ability of any particular mutation indicates whether its epistatic interactions with other mutations tend to be positive or negative. We can then measure the relative mixing ability of two mutations by holding constant the identity of other mutations with which the two of interest are mixed. Consider the polynomial *a·br* - *b·ar*. This test asks, in effect, whether mutation *a* or *b* mixes better with a third mutation *r*. An individual test of this form might be interesting when one has specific knowledge about the identity of the three mutated genes and the position of their products in a metabolic pathway, for example. By contrast, Elena and Lenski emphasized the statistical properties of epistatic interactions. In this context, we can examine related sets of these single-double equations to ask whether one mutation is a better mixer than another in the context of the sample of mutations with which they were each tested.

Any pair of mutations belonging to the same set of three ({*a*, *b*, *c*}, {*r*, *s*, *t*}, or {*x*, *y*, *z*}) was tested with the exact same set of six mutations belonging to the other two sets. For example, the following six equations examine the relative mixing ability of "focal mutations" *a* and *b* with respect to "tester mutations" *r*, *s*, *t*, *x*, *y*, and *z*: *a·br* - *b·ar*; *a·bs* - *b·as*; *a·bt* - *b·at*; *a·bx* - *b·ax*; *a·by* - *b·ay*; and *a·bz* - *b·az*. All in all, there are nine groups of six equations each that compare the general mixing abilities of two focal mutations from the same set: *a* versus *b*, *a* versus *c*, *b* versus *c*, *r* versus *s*, *r* versus *t*, *s* versus *t*, *x* versus *y*, *x* versus *z*, and *y* versus *z*. There are another 18 groups of three equations each that compare the mixing abilities of two focal mutations from different sets. (Nine more groups of this type are redundant, in the sense that their values are determined by the first 18 groups.)

In our analysis, we focus on the nine comparisons of mixing ability that each involves six tester mutations, because these provide more statistical power that might reveal differences between focal mutations. Figure 4 summarizes these nine comparisons. Six points are plotted above each pair of focal mutations, corresponding to the six different tester mutations with which each one was combined. A value above zero indicates that the first-listed mutation in the focal pair was the better mixer. For example, for the first pair (*a*, *b*) of focal mutations, *a* was the better mixer with two tester mutations whereas *b* mixed better with four others. For each focal pair, we performed a *t*-test to ask whether the average epistatic deviation was significantly different from zero. The focal pairs (*x*, *y*) and (*y*, *z*) were both significant (*p* ≤ 0.027), with *y* being the better mixer in both cases. Two other focal pairs, (*a*, *b*) and (*b*, *c*), were marginally non-significant with *p* = 0.083 and *p* = 0.068, respectively, and *b* was the better mixer in both of these cases. The (*y*, *z*) comparison survives even a stringent Bonferroni correction that accounts for the multiplicity of related tests (*p* = 0.0054·9 < 0.05). Therefore, we can conclude that mutations are variable in their general mixing ability. The molecular identities of mutations were not identified by Elena and Lenski [2], and so we cannot say anything about the potential physiological bases for the observed differences in mixing ability. However, future studies of epistatic interactions might systematically compare mixing ability between different classes of mutations, such as those affecting protein structures and regulatory domains.

### Geometry of the fitness landscape

Our final set of results is concerned with the geometric shape of the fitness landscape. Since fitness landscapes are high-dimensional and complicated objects, it is desirable to classify them into a finite set of distinct shapes; with the general idea that fitness landscapes with the same shape are likely to share biological properties. This approach generalizes the classification of bi-allelic two-locus landscapes into those with positive epistasis versus those with negative epistasis. This appealing binary classification has been linked, for example, to the advantage of sex, but it does not extend to higher-dimensional genotype spaces. We present here a notion of the shape of a fitness landscapes for any genotypic space. This concept is intimately related to the interaction tests discussed so far, because the shape is determined by a certain subset of the gene interactions that includes the Markov basis. Thus, the proposed classification of landscapes into shapes can be regarded as a formal summary of all the various standard and non-standard tests.

The fitness landscape studied in this paper consists of the 37 *E. coli* genotypes and their fitness values as shown in Table 2. The genotope is the set of all possible allele frequencies that can be realized by any population on these genotypes. It is a nine-dimensional figure with 37 vertices, and it contains all the three-dimensional genotopes shown in Figure 1. By the shape of the fitness landscape we mean the triangulation of the genotope that is induced by the fitness values (see Methods for details). Since most of us have trouble visualizing and interpreting nine-dimensional objects, we therefore study the shape by analyzing its restrictions to genotypes on two and three of the nine loci. In so doing, we aim to illustrate how certain features of the fitness landscape - in particular, differences in mixing ability revealed by the non-standard tests - are reflected in the geometry of the fitness landscape.

Consider the bi-allelic two-locus system with genotypes *w, a, r*, and *ar*. Its genotope is the unit square in (*a, r*)-space, and a generic fitness landscape has exactly one of two possible shapes corresponding to either negative or positive epistasis, i.e., to *w ar* - *a r* being either negative or positive. Negative epistasis induces the triangulation of the square consisting of the two triangles {*w, a, r*} and {*a, r, ar*}, whereas positive epistasis results in the other possible triangulation with {*w, a, ar*} and {*w, r, ar*}.

For larger genetic systems, the role of the triangles is played by simplices. The shape of the present fitness landscape on 37 genotypes is a triangulation of the genotope into 362 nine-dimensional simplices (see Additional file 5). The following analysis of the geometry of this space is based on the general framework developed elsewhere [1], as applied to the specific experimental design that generated the *E. coli* fitness landscape investigated in this paper. The general idea is to investigate and describe the interaction space. This space has finite dimension, but infinitely many elements, and each element of this space represents one interaction. There are many different ways of extracting potentially interesting subsets of the interaction space. We suggest looking at the circuits, which generate the space but are not linearly independent. The signs of the circuits determine the triangulation of the genotope and thus the shape of the fitness landscape. However, the number of circuits grows fast with the size of the fitness landscape; the present *E. coli* landscape has 772,731 circuits. Therefore, we restrict our attention to the much smaller subset provided by the Markov basis. The circuits and the Markov basis provide a natural generalization of the concept of pairwise epistasis.

The three-locus subsystems that occur in this dataset are represented in Figure 1b–d by their genotopes. Notice that type (d) cannot be further subdivided. By contrast, type (c) has six triangulations and type (b) has 16 triangulations. We focus on type (b) in order to show how the different shapes are related. The signs of a total of nine circuits (or tests) determine the geometry, including three standard tests like *w·ar* - *a·r*, three non-standard single-double tests, and three cubic tests that are not part of the minimal Markov basis. In Figure 5, each of the 16 shapes is represented by a vertex in the graph and labelled by an integer. The labels refer to the 74 shapes of the 3-cube (see Table 5.1 in [1] for the complete list), 16 of which occur here as the shapes of the type (b) genotype space. Two shapes in the graph are connected by an edge if they differ only by the sign of a single test. For example, shapes 36 and 37 differ by the sign of *r·ax* - *x·ar*. Thus, the graph represents the 16 possible shapes of a fitness landscape over the three-locus genotype space consisting of seven genotypes (all but the triple mutant). This graph provides the basis for statistical inference about the three-way interactions in the given fitness landscape. We find the following shapes among all 27 of the three-dimensional genotopes of type (b) that appear as subsets of the complete dataset: shape 56 (frequency: 6), 7 (5), 25 (4), 8 (3), 52 (3), 21 (2), 23 (2), 2 (1), and 20 (1). A closer inspection of this shape distribution reveals many deviations from linearity in the fitness landscape, but no single dominating shape. This result therefore confirms and extends the earlier finding of Elena and Lenski [2] about the commonness of deviations from linearity.

The analogous graph of possible shapes for the three-locus subsystem corresponding to the triangular prism (Figure 1c) is a hexagon. In fact, that hexagon occurs three times as a sub-graph in the graph for type (b) illustrated in Figure 5, because the genotope (c) is contained within the genotope (b) in three different orientations. Intuitively, any three-locus genotype space lacking a double and the triple mutant can be regarded as a subspace of the space that only lacks the triple mutant. The six triangulations of genotope (c) are represented in Figure 5 by the three hexagons with vertices {35, 36, 52, 21, 20, 50} (upper left), {36, 37, 56, 25, 23, 52} (upper right), and {37, 35, 50, 22, 24, 56} (bottom), respectively, that are related by symmetry.

The shape of fitness landscapes on type (c) spaces, such as {*w*, *a*, *b*, *r*, *ar*, *br*}, is determined by the signs of the three tests *w·ar* - *a·r*, *w·br* - *b·r*, and *b·ar* - *a·br*. Therefore, the shape summarizes the information about the standard pairwise interactions between mutations *a* and *r* and between *b* and *r*, as well as the relative mixing ability of *a* and *b* with respect to *r*. For example, five of the six subsystems of type (c) involving mutations *y* and *z* have shape 21, while the subspace {*w*, *y*, *z*, *b*, *by*, *bz*} has shape 52. Shape 21 is defined by a negative sign for the second of the three tests above and a positive sign for the other two, whereas shape 52 is defined by positive signs for all three tests. Hence, these shapes reflect positive epistasis involving *y* (6/6 tests), positive epistasis involving *z* (5/6 tests), and superior mixing ability of *y* over *z* (6/6 tests). As such, the geometric shapes provide more details about the form of mutational interactions than the average values of the tests analyzed in the previous section. On the other hand, these shapes summarize the data by classifying continuous fitness values into discrete shape classes which reflect the sign pattern of the interaction tests.