The enzyme phosphoenolpyruvate carboxykinase, PEPCK, occurs in its guanosine-nucleotide-using form in animals and a few prokaryotes. We study its natural genetic variation in Colias (Lepidoptera, Pieridae). PEPCK offers a route, alternative to pyruvate kinase, for carbon skeletons to move between cytosolic glycolysis and mitochondrial Krebs cycle reactions.
PEPCK is expressed in both cytosol and mitochondrion, but differently in diverse animal clades. In vertebrates and independently in Drosophila, compartment-specific paralogous genes occur. In a contrasting expression strategy, compartment-specific PEPCKs of Colias and of the silkmoth, Bombyx, differ only in their first, 5′, exons; these are alternatively spliced onto a common series of following exons. In two Colias species from distinct clades, PEPCK sequence is highly variable at nonsynonymous and synonymous sites, mainly in its common exons. Three major amino acid polymorphisms, Gly 335 ↔ Ser, Asp 503 ↔ Glu, and Ile 629 ↔ Val occur in both species, and in the first two cases are similar in frequency between species. Homology-based structural modelling shows that the variants can alter hydrogen bonding, salt bridging, or van der Waals interactions of amino acid side chains, locally or at one another’s sites which are distant in PEPCK’s structure, and thus may affect its enzyme function. We ask, using coalescent simulations, if these polymorphisms’ cross-species similarities are compatible with neutral evolution by genetic drift, but find the probability of this null hypothesis is 0.001 ≤ P ≤ 0.006 under differing scenarios.
Our results make the null hypothesis of neutrality of these PEPCK polymorphisms quite unlikely, but support an alternative hypothesis that they are maintained by natural selection in parallel in the two species. This alternative can now be justifiably tested further via studies of PEPCK genotypes’ effects on function, organismal performance, and fitness. This case emphasizes the importance, for evolutionary insight, of studying gene-specific mechanisms affected by natural genetic variation as an essential complement to surveys of such variation.
Phosphoenolpyruvate carboxykinase, PEPCK, converts phosphoenolpyruvate (PEP) plus nucleotide diphosphate and carbon dioxide to and from oxaloacetic acid (OAA) plus nucleotide triphosphate, in multiple metabolic contexts among the domains of life. Its guanosine-nucleotide-using form (EC 22.214.171.124; for reaction see Figure 1), while present in some Bacteria and Archaea, occurs mainly in Animalia and in both cytosol and mitochondrial matrix. Its production of PEP from OAA begins gluconeogenesis or glycerol synthesis from Krebs cycle metabolites, or through them from dietary amino acids or lipids . Its production of OAA from PEP may ″replenish″ Krebs cycle metabolites, or play a role in reaction paths which produce moderate ATP yields during chronic anoxia in some invertebrates . In mice, its overexpression in skeletal muscle yields striking extensions of exercise capacity, lifespan, and reproduction .
We have studied natural genetic variation in enzymes of energy metabolism, using Colias (Lepidoptera, Pieridae) as a test system for evolutionary functional genomics [4, 5]. While surveying such variation across all enzyme-coding genes of glycolysis and its links to other processes, separate study of PEPCK was prompted by finding 3 high-frequency amino acid polymorphisms at the same PEPCK codons in two Colias species from distinct clades. We first clarify the basis of dual cell compartment expression of PEPCK in Colias vs. other animals. We next study PEPCK’s natural variation in the two Colias species, especially the high-frequency amino acid variants shared between species. We locate these variants in PEPCK’s structure and explore their possible effects on structure-function relations. With coalescent simulations, we test a population-genetic null hypothesis of genetic drift as cause of this variation, the alternative cause being natural selection.
Animals and basic molecular biology
PEPCK cDNA was made by reverse-transcription (with Invitrogen MMLV enzyme) of mRNA extracted from fat body (preserved in Ambion RNAlater) of 18 Colias eurytheme (randomly sampled near Tracy, California, elevation 25 m) and 18 Colias meadii (randomly sampled from Cottonwood Pass, Colorado, elevation 3780 m). Using ButterflyBase , we compared PEPCK sequences from Bombyx mori (BMP026541_1) and Heliconius erato (HEP05212_1) to design consensus primers for initial PCR amplification (using Invitrogen HiFi Platinum Taq and Stratagene Robo-cyclers) of a central sequence fragment of Colias PEPCK cDNA. Primers matching this fragment were designed, using Oligo 6 software (Molecular Biology Insights, Inc.), first to amplify the gene’s 3′ end with an antisense primer matching the cDNA polyA tail (GEN22(15)-A3end: [4, 5]), and then, using the Ambion RLM-RACE kit, to amplify the 5′end (unexpectedly complex, as discussed below). Colias-specific primers were then designed (Additional file 1) to amplify and sequence the whole gene from 5′ to 3′ untranslated regions (UTRs). Nucleic acids were purified with Qiagen kits and a Qiacube processing robot. Sequences were read in sense and antisense directions with ABI BigDye 3.1 reagents and an ABI 377 sequencer.
Data processing and bioinformatics
Colias PEPCK sequences were cross-checked and edited using BioEdit . DnaSP 5.1  was used to estimate haplotype compositions from individuals’ heterozygous PEPCK sequences using the PHASE algorithm [9, 10], and to tabulate diverse evolutionary-genetic statistics from these sequences; previously written filter programs  were used to organize DnaSP analysis of linkage disequilibrium. Sequences of Bombyx mori ’s PEPCK were retrieved from expressed sequence tag libraries in ButterflyBase  and from assembled genomic DNA in SilkDB 2.0 , using search tools of each site. Drosophila sequences were drawn from FlyBase , and vertebrate and prokaryotic sequences from GenBank .
All sequences were evaluated for cell compartment specificity using the TargetP server . This server’s elaborate algorithm assesses mitochondrial targeting of proteins, as contrasted to properties of proteins retained by default in the cytosol, on the basis of characteristics of their N-terminal amino acid sequences: a) richness in basic (Arg, Lys) and hydroxylated (Ser, Thr) amino acids; b) absence of acidic amino acids (Asp, Glu); c) certain secondary structure features .
Sequences were aligned with the ClustalW algorithm as implemented in BioEdit. Additional file 2 lists accession numbers of all sequences, including those from Colias as submitted to GenBank. Phylogenetic relationships among sequences were evaluated with PHYML and PHYLIP software [15, 16]. Colias sequences were matched to the best available structural templates, for homology-based structural modelling, by the 3D-Jury metaserver . Template structure files were drawn from the Protein Data Bank . Homology-based calculation of Colias PEPCK structures using these templates was done with MODELLER 9.8 ; in each case, 5 replicates were run and the best-scoring one (i.e. with lowest value of the molpdf criterion ) was used. These models were visualized, and their structural features measured, with DeepView (Swiss-PDB Viewer) 4.0.1 .
Basic genomic structure of PEPCK in Colias, other insects, and vertebrates
During primer set development for the autosomal Colias PEPCK gene, we found two sequence forms, differing in their 5′ ends and 5′-untranslated regions (UTRs). Inspection of Bombyx mori′s PEPCK sequences  clarified this: PEPCK sequence BMP026541_1, closely matching one of the Colias forms, is annotated to cytosolic expression, and BMP000643_1, close in sequence to the other Colias form, to mitochondrial expression. The TargetP server confirmed this compartment targeting for the two forms in each species. In each taxon, these sequences differ only in their 5′ ends, being mRNA splice variants whose alternative 5′ exons (each associated with a unique 5′ untranslated region, in which unique 5′ amplifying primers are located) are attached to common exons 2–13. (That the sequences following the 5′ exon are the same, and not parts of fully distinct paralogs, was shown in Colias by the fact that in every case, in amplifying from the 5′ untranslated region, regardless of which of the two 5′ exons was amplified, all varying base positions following the 5′ exon, whether heterozygous or variant-homozygous in an individual, were the same between the alternately amplified sequences.) But in Drosophila melanogaster, distinct, though closely linked, paralogous genes code for PEPCK of cytosol and of mitochondria (12; Figure 2 shows these insects′ PEPCK 5′ ends). Pairs of compartment-specific paralogs also occur in diverse vertebrates .
A truncated expressed-sequence-tag sequence from Bombyx, similar but not identical to BMP026541_1, occurs in ButterflyBase as BMP026778_1. Consultation of the Bombyx genome assembly  clarified this. The whole gene corresponding to BMP026541_1, including the cytosolic exon 1, occupies one locus, punctuated by 12 introns, on the minus strand of scaffold nscaf2789. Roughly 20 kb beyond this on the minus strand begins the second locus BMP026778_1, which is interrupted by base dropouts causing frameshifts in comparison to BMP026541_1; if the first of these is ″repaired″ by substitution from BMP026541_1, more PEPCK-like sequence is recovered to about codon 235, after which more frame-shifting results in premature stop codons. Thus this locus behaves like an incipient, but not yet fully silenced, pseudogene. We've found no expressed mRNA sequence evidence of any such locus in Colias.
Bombyx’ mitochondrial exon 1 of BMP000643_1, with its distinctive 5′ -untranslated region, is again on the minus strand, 15 kb beyond BMP026778_1. It is not now annotated in SilkDB, so we give its specific location here: nscaf 2789: exon bp 1207835 – 1207867, 5′-UTR to ~ 1207900.
Evolutionary history of PEPCK
What is the evolutionary history of the alternate PEPCK expression strategies – paralogs vs. splice variants? One study has examined the phylogenetic relationships of the GTP-using (E.C. 126.96.36.199) and ATP-using (E.C. 188.8.131.52) PEPCK enzymes . It focused on distribution of these co-substrate types among domains Bacteria, Archaea, and Eukarya, but did not address the eukaryotic cell-compartment-specific forms. Therefore we reconstructed, using protein sequences, the phylogeny of vertebrate and insect GTP-using PEPCKs, with prokaryotic GTP-using PEPCKs as outgroups, drawing sequences from sources listed above. Figure 3 shows that vertebrate mitochondrial and cytosolic PEPCKs form two compartment-specific paralogous branches whose most basal members on each branch are fish sequences. This apparent duplication-and-divergence event may have been part of the whole-genome duplications found at the base of vertebrate evolution . The Drosophila paralogs form a coherent sub-branch of an Insecta branch, originated independently of the vertebrate paralogs, evolving the duplication-and-divergence genomic mechanism for compartment-specificity in parallel. The apparent Bombyx pseudogene (bmocyto2; BMP026778_1) groups closely with the main Bombyx cytosolic gene; it is unrelated to the Drosophila paralog pair. The exon-splicing mechanism of the two Lepidoptera, Bombyx and Colias, is a distinct alternative to whole gene paralogy for compartment-specificity of PEPCK.
Overall sequence variation of Colias PEPCK
The cDNA of Colias cytosolic PEPCK includes 1929 bp, 643 codons, and the mitochondrial form includes 1896 bp, 632 codons (omitting the stop in each case). The length differences lie in the 5′ exons of 22 and 11 codons respectively; base pair and amino acid sites for the gene as a whole are numbered beginning with the mitochondrial 5′ exon. Cytosol form sequences add 11 codons/33 base pairs (bp) to numbers beyond the end of the cytosol exon 1. The mitochondrion-targeting exon 1 may be excised once its protein has been imported into mitochondria, as this happens with other such proteins, ostensibly to prevent further interaction with the mitochondrial membranes . If so, this would affect interpretation of both structural and population-genetic aspects of mitochondrial PEPCK′s functional evolutionary interactions (see below).
18 C. eurytheme and 18 C. meadii were sequenced for both 5′ exons and the common exons 2–13, and haplotype phases were estimated for each species as noted above. Genetic statistics are tabulated in Table 1 for the common exons 2–13 and for the two 5′ exons. PEPCK is highly variable at both amino acid and DNA levels: e.g., for exons 2–13 of Colias eurytheme, in common between the compartment forms, overall nucleotide diversity π = 0.0269, synonymous diversity πss = 0.1054, and θ = 4Neμ = 0.028 (estimated from the number of segregating sites S). Both 5′ exons are less variable than exons 2–13. In comparison, Colias eurytheme PGI, one of the most variable animal genes known (maintained so by strong natural selection ), shows πΣ = 0.0267, πss = 0.0993, and θ = 0.034 , while average values for a sample of Drosophila melanogaster genes are πΣ = 0.0040, πss = 0.0135, and θ = 0.0040 . Colias’ PEPCK thus matches its PGI in level of variability. It also shows similarly high estimates of the minimum number of intragenic recombination events , i.e. 60 and 41 for C. eurytheme and C. meadii respectively, vs. 58 in the 1668 bp of C. eurytheme PGI .
Patterns of allelic amino acid variation
Each unique PEPCK sequence at either nucleotide or amino acid level of organization constitutes a distinct genetic allele at that level. We focus here on amino acid variation as the possible basis of naturally selected enzyme properties, thence effects on higher-level phenotypes and eventually on Darwinian fitness. Figure 4 shows all mitochondrial-form amino acid allelic variants as inferred by PHASE (above). Nearly all the variation occurs in those exons, 2–13, which are in common between the compartment forms. 23 codons have singleton amino acid variants in C. eurytheme or C. meadii while 20 codons have major polymorphism (p2 ≥ 0.05) in either species (the cytosol 5′ exons have only one singleton in C. eurytheme, none in C. meadii). These combine into 28 (of 36) distinct alleles in C. eurytheme and 24 (of 36) distinct alleles in C. meadii.
Three polymorphic codons, 335, 503, and 629, have p2 ≥ 0.05 in both species: Gly/Ser 335, Asp/Glu 503, and Ile/Val 629. All these changes are charge-neutral. By exact binomial test , corrected for multiple tests , codon 629 differs significantly in its variant frequencies between species, while codons 335 and 503 do not (Table 2). These variants combine into 8 allele classes or allele ″macrostates″ , which are identified by one-letter amino acid codes at each position, e.g. GEI for Gly Glu Ile). Frequencies of these alleles for the two species are given in Table 3; their differences mostly follow C. eurytheme’s increase in Ile 629 frequency compared to C. meadii.
We asked if there is intragenic linkage disequilibrium among codon 335, 503, and 629 variants; as Additional file 3 shows, there is not. For reasons noted below, we also tested C. eurytheme for disequilibrium between these 3 codons and the 5′-mitochondrial-exon codon 11 Arg/Lys polymorphism of C. eurytheme, but none was found (Additional file 3). Indeed, scanning whole cDNA sequences of each species for linkage disequilibrium with DnaSP and filters (Methods, above) found only one instance of disequilbrium between nonsynonymous variants (codons 122 and 220 gave a locally significant Fisher’s exact test at P = 0.001, but this was not significant by DnaSP’s Bonferroni criterion for multiple testing) in C. eurytheme, and no such instances in C. meadii. Absence of disequilibria among amino acid variants fits with the finding above of extensive intragenic recombination in PEPCK of both species. Some mainly nearby disequilibria involving synonymous variants are seen, but no interpretation is evident and we omit these data for the sake of brevity.
On a hypothesis of selective neutrality some variable sites are expected to be shared between species, given large θ and thus a number of sites variable in each species at a time . But on this hypothesis, the vast majority of those variants are expected to be of very low frequency and destined to be lost by drift to fixation . A finding of multiple high-frequency variants shared between species at even roughly similar frequencies is unusual on this hypothesis, and so merits study of its possible causes – neutrality as null hypothesis, or some form of natural selection as an alternative. Direct study of the variants’ functional effects, and fitness consequences in the wild, will be the future and final arbiter of this issue. But we can even now make more use of present data, by homology-based structural modelling and by population-genetic simulation, to gain further insight.
Structural nature and potential impacts of amino acid sequence variation
We summarize PEPCK’s protein structure in order to study placement of its polymorphic variants in Colias for clues to their evolutionary meaning. This is a first step in assessing neutrality or selection in functional terms: if molecular modelling shows changes in intramolecular protein bonding by amino acid variants, that may at least suggest variants’ possible functional effects, while if no such changes are evident, functional neutrality of the variants is strongly suggested.
PEPCK is one of a few enzymes of glycolysis and related processes which are active as monomers, without oligomeric structure. The best homologous modelling template available, per 3D-Jury , is PEPCK of Rattus: high-resolution crystal structures exist for two catalytic conformations of this ([30, 31], see Additional file 4 for a ColiasRattus alignment). Sequence identity between Colias and Rattus PEPCKs is 0.61, and their Dayhoff similarity is 0.77. These values support homology-based modelling, with accuracy of mid-range crystallographic resolution , to explore variants’ potential structural effects. We modelled both conformations of each amino acid polymorph allele for each of three compartment-specific forms: cytosolic exon 1 plus common exons 2–13, mitochondrial exon 1 plus exons 2–13, and exons 2–13 alone in light of the above-noted likelihood that mitochondrial exon 1 is excized once mitochondrial PEPCK has reached its target.
Figures 5a,b show that Colias PEPCK’s tertiary structure is built up from its secondary structure as an irregular lattice of β-strands. This supports α-helices which form much of the protein’s surfaces. Ends of these α- and β-structures are connected by loops. The catalytic center includes a mobile ″lid″ loop which when open (structure PDB 2qew, Figure 5a) allows substrate/product binding or release, but closes (structure PDB 2qf2, Figure 5b) over these ligands when they are bound during catalysis [30–32]. Other kinds of changes accompany this lid movement, e.g.:
the ″p-loop″, including substrate-binding residues Cys 304 and Lys 306, moves in the catalytic site with the lid's movement [30, 32];
amino acids 592–610 form an α-helix when the lid is open, but the helix appears to shorten to 592–609 when the lid is closed (see Figures 5a,b);
invariant and polymorphic amino acids can change spatial relations and resulting bond patterns when the lid is open vs. closed, as seen in Figure 5 and supplemented by Figure 6.
Figures 5a and b also locate the polymorphic amino acid sites shared between species, which are illustrated in more detail in Figures 6, 7, and 8. Each can have different interactions with nearby invariant amino acids, depending on their own or other polymorph segregations, on catalytic stage as above, and on different compartment forms, e.g.:
Gly/Ser 335 begins a very short loop connecting two β-strands, Gly 318 – Asp 334 and Val 338 – Ile 342. Gly’s ″side chain″ is one immobile proton, while Ser’s hydroxymethyl side chain is much larger, more polar, and can hydrogen-bond via its mobile hydroxyl proton. Either Gly or Ser can hydrogen-bond between its backbone nitrogen and the side-chain carboxyl of Asp 334; Ser's hydroxyl can hydrogen-bond with the carboxyls of either Asp 334 or Asp 336. Bonding alternatives for Gly/Ser 335, tracking segregation of Ile/Val 629, are illustrated in Figure 6.
Asp/Glu 503 is in α-helix Phe 501 – Ser 510. Asp and Glu differ in length of side chains (1 vs. 2 –CH2– groups), each ending in a carboxyl group. The guanidino side chain of nearby Arg 272 can form a salt bridge with the carboxyl of Asp/Glu 503, or hydrogen-bond to the backbone carbonyl group of Glu 503. Arg 506’s guanidino side chain, in the same α-helix as Asp/Glu 503, can also form a salt bridge with their carboxyl. These bonding possibilities, again tracking segregation of Ile/Val 629 but in the ″lid closed″ configuration of the cytosol form, are shown in Figure 7.
Ile/Val 629 is near one end of the 3′ α-helix Asn 616 – Gln 630. Their nonpolar side chains, whose volumes differ by one –CH2– group, make van der Waals contact with Trp 595, and often Leu 596, in a different α-helix starting with Lys 592. In comparison to salt bridges or hydrogen bonds, van der Waals contacts occur over a wider range of carbon-carbon distances (hence different values of contact energy), which may be grouped: contact 2.95 – 4.5 Ǻ, marginal contact 4.5 – 5.2 Ǻ, no contact > 5.2 Ǻ cf. . Figure 8 shows the absence or presence of van der Waals contact with Leu 596 for Ile 629 vs. Val 629, with the other polymorphic sites the same in each case.
Polymorphic variants may alter their nearby bonding contacts by changes of volume or polarity, as in absence/presence of a salt bridge to Arg 506 with the extension of 503’s side chain between alleles SDI and SEI (Figures 7a,c). Further, variants’ changes in properties can propagate their effects across PEPCK′s structure to alter distant bonding patterns, as, e.g., shown by responses of Ser 335 (Figures 6b,c) or Asp 503 (Figures 7a,b) bonding patterns to segregation of Ile/Val 629.
Additional file 5 lists all combinations of the eight allelic variants’ intramolecular bonding across catalytic stages and compartment forms. No two alleles display the same combination of bonds. Some additional suggestive patterns emerge from this file, for example:
among the 48 cases, there is just one of side-chain polar bonds – hydrogen bond or salt bridge – occurring at both 335 and 503 sites in the same allele (GEV, in closed conformation of the full mitochondrial form), though four would be expected at random from the frequencies of those bonds′ occurrences. This might represent a mutual steric constraint.
GDV and GDI alleles never show polar bonding between Asp 503 and Arg 272 in 12 cases, in contrast to the other 6 alleles which show 13 such bonds among 36 cases (exact binomial test  ×* = 2.44, P < 0.02). This may arise from the combination of Gly 335’s small volume and Asp 503’s short side chain, keeping Arg 272 and Asp 503 side chains distant from one another.
the full mitochondrial form shows the fewest polar bonds by variants at site 335 (1/16) compared to the cytosol and the 5′-exon-excised mitochondrial forms (13/32; ×* = 2.47, P < 0.02).
The cytosol form shows generally greater van der Waals contact distance between Ile/Val 629 and Leu 596 when closed than when open. This effect is less pronounced in the full mitochondrial form and is slightly reversed in the 5′-exon-excised mitochondrial form.
In summary, these results show that each allelic bond combination may make a different potential energy contribution to the stabilization of the protein’s structure as a whole, or of one part of the catalytic cycle (i.e. open or closed) vs. the other. By such effects, the alleles might alter either catalytic function or thermal stability, or both, of the PEPCK enzyme, and might do so differently among compartment-specific forms.
Population-genetic testing of hypotheses for shared PEPCK polymorphisms
The species studied here, Colias eurytheme and C. meadii, represent in North America the lowland species complex and an alpine/northern species complex, respectively. They are fully reproductively isolated  and are well separated in phylogeny . We now test population-genetic explanations for their sharing of 3 major (p2 ≥ 0.05) PEPCK polymorphisms, at codons 335, 503, and 629, and their close similarity of allele frequencies at codons 335 and 503 (above). Obvious alternative hypotheses are that some form of balancing selection maintains these polymorphisms in the two species, or instead that the variation is selectively neutral and subject to genetic drift.
We first consider the null hypothesis of selective neutrality. We assume the two Colias species diverged from an ancestral species t generations before the present. A non-synonymous polymorphism shared by the two species could result from a single mutation that occurred in the ancestral species and drifted to the observed frequencies in the descendant species. The probability of this approaches zero as t increases. It is also possible that independent mutations occurred in each descendant species after the split and drifted to the observed frequencies.
We have used the coalescent simulation program ″ms″  to simulate this scenario with a symmetric two-allele mutation assumption, so that mutations can occur both in the ancestral species and in the two descendant species. The simulations are used to assess the probability, given the null hypothesis that the variants are drifting neutrally, that the variants' frequencies in the descendant species would be as similar as or more similar than their observed values. We denote the frequency of the second allele at a particular codon in species A (C. eurytheme) and B (C. meadii) by p2A and p2B, respectively. From the ms program's output we estimate the probability that |p2A - p2B| is less than or equal to the observed allele frequency difference. This probability will be a function of time since species A and B split from their common ancestor, scaled by effective population size: i.e., t/2Ne. To choose a focal set of simulation outcomes from which to estimate this null probability, we set the condition that (p2A + p2B) = observed value. This allows an unequivocal ordering of |p2A – p2B| values from maximum to minimum agreement with the null hypothesis, hence minimum to maximum agreement with its selective alternative.
We estimate overall θss = 4Neμss [Ne = effective population size, μss = synonymous (= neutral) mutation rate], a central parameter for these simulations, based on values of Sss, the number of segregating synonymous sites, from glycolytic genes of each Colias species: glyceraldehyde phosphate dehydrogenase GAPdH, phosphoglycerate kinase PGK, phosphoglycerate mutase PGAM (C. eurytheme only), pyruvate kinase PK, and triose phosphate isomerase TPI. These genes give no evidence of balancing or positive-directional selection on nonsynonymous sites, which if present could bias estimates by causing neutral variants to "hitchhike" on such sites (W. Watt et al., unpublished). We correct estimates from sex-linked TPI for its Ne being a priori ¾ that of the other genes. The average value among estimates from these genes is θss = 0.0556. We subdivide this following the common observation that transition mutants are twice as common as transversion mutants overall, and the fact that for any base pair one transition and two transversions are possible as single mutants. Thus, in our primary simulation analysis, to test transition polymorphisms we set θss-tr = 0.037, and for transversion polymorphisms we set θss-tv = 0.009.
We do not use a value of θss from PEPCK itself in our primary analysis, as on the alternative hypothesis of selective maintenance of the shared polymorphisms, θss might well be elevated (by hitchhiking) above neutral expectations, biasing any further calculations. However, a variant of the neutral null hypothesis would invoke a PEPCK-specific elevation of the mutation-rate component of θss to explain high levels of PEPCK variation. We can, therefore, ask what is the probability of finding |p2A - p2B| less than or equal to the observed allele frequency difference using PEPCK θss in simulations, to see if elevated mutation rate could explain observed results on a neutral assumption. The overall θss value averaged between C. eurytheme and C. meadii = 0.0892, so θss-tr = 0.059 and θss-tv = 0.015 for these additional simulations.
Next, what should be the “choice rule” for which PEPCK codons to test? Test power would be poor for small total numbers of the second allele at each codon, so our rule is that the total counts (n2A + n2B) be ≥ 10 (out of 72, given 36 sequences for each species). Therefore, on one hand, besides the shared codons of primary interest, we should also test codon 11 of the mitochondrial PEPCK form, polymorphic for Arg/Lys. This has (n2A + n2B) = 16, which satisfies the choice rule although it is polymorphic only in C. eurytheme, not C. meadii, as this is a possible outcome of the null hypothesis. But on the other hand, if the mitochondrial-targeting exon which includes codon 11 is indeed excised after PEPCK enters the mitochondrion (above), codon 11's polymorphism would not be expressed, would be synonymous in functional terms, and should not be tested with the other codons. Hence we report our significance testing both with and without inclusion of codon 11.
We ran 2 × 107 simulations of 36 sampled sequences for each of two species and for each of the codon polymorphisms (codons 11, 335, and 629, transitions, and codon 503, transversion), filtered their output for cases satisfying [(p2A + p2B) = observed value], and tabulated the fraction of those with (|p2A – p2B| ≤ observed value) at intervals of t/2Ne between 0.2 and 2.6 as our null-hypothesis probability. Table 4 presents these results.
For comparison to Table 4, we estimate t/2Ne for our two Colias species via genetic statistics of synonymous (assumed neutral) variation at Colias GAPdH, hexokinase HK, PGK, PK, PGAM (both species), and TPI. We define these symbols: πssA or πssB = synonymous nucleotide diversity in species A or B; πssAB = between-species synonymous nucleotide diversity; πssCA = synonymous nucleotide diversity in the most recent common ancestor (MRCA); μ = mutation rate; Ne = effective population size; t = time in generations since MRCA; E(x) = expected value of x. Next, we assume πssA ~ πssB ~ πssCA. πssA ~ πssB is evident for all 6 genes (W.B. Watt et al., unpublished); that these values also reflect πssCA is reasonable, given present patterns of Colias speciation by differentiation of large "semispecies" populations without evident bottlenecking or founder effects [37, 38]. Then:
Assuming πssCA∼πssA, E(πssAB - πssA) = 2μt
By a parallel argument, E[(πssAB - πssB)/πssB≅ 2μt/4Neμ = t/2Ne.
Estimates of t/2Ne were made for each species in turn at each of the 6 genes listed above, again correcting estimates from TPI for its smaller Ne. The final average t/2Ne over all 6 genes and 2 species is 1.828 ± 0.635 (mean ± standard error of mean).
Table 5 applies “Fisher’s method” of combining probabilities  to the joint analysis of “ms” simulation results for the three polymorphic codons shared among species, with and without the mitochondrial codon 11 polymorphism as discussed above. We used t/2Ne = 1.80 as the closest tabulated value less than our averaged estimate. As Table 5 shows, using θss estimated from the five genes as above, it is highly unlikely that these polymorphisms would be as or more similar than observed due to neutral drift, whether (P = 0.004) or not (P = 0.001) codon 11 is included in the analysis. This is so even though at codon 629 the polymorphism does differ significantly in frequency between species (above). Using θss from PEPCK itself does not change these conclusions importantly: including codon 11 in the simulations, P = 0.006, while without codon 11 P = 0.002. Thus an elevated PEPCK-specific mutation rate combined with neutrality is also rejected as an explanatory hypothesis for the shared polymorphisms. These results support the alternative working hypothesis that codon polymorphisms 335, 503, and 629 are shared between species because they are maintained by parallel natural selection in the two species.
PEPCK and PGI: different forms of chronically maintained polymorphism?
We have found that neutral drift is quite unlikely to explain the sharing of PEPCK’s amino acid polymorphisms between Colias species. Thus it makes sense that PEPCK, as a candidate for selectively maintained chronic polymorphism among species, shows very high levels of genetic variability comparable to those of Colias’ PGI gene, whose amino acid polymorphism is maintained widely across the genus by strong balancing selection [4, 39, 40].
However, the cases differ in detail. PGI polymorphism is maintained in C. eurytheme and C. meadii without preserving allelic identity between species. At PGI, a two-transversion change, Gly 370 GGG → Ser 370 TCG, was fixed by a selective sweep in the midst of chronic polymorphism at other codon sites , somewhere in phylogeny between more basal C. meadii and derived C. eurytheme. This increases the eurytheme PGI genotypes’ thermal stabilities compared to those of meadii, fitting with differences in thermal ecology between the species. In contrast, PEPCK’s polymorphism engages the same 3 codons between species although, despite the inter-species similarity of amino acid variant frequencies at codons 335 and 503, the increase of p2 frequency at codon 629 from basal C. meadii to derived C. eurytheme does change several multicodon allele frequencies (Table 3).
The amino acid polymorphs of PEPCK and PGI have one structural feature in common: they occur outside their enzymes' catalytic centers. This is often so for natural variants that change catalytic (or stability) properties of enzymes without altering their reaction mechanisms . But other structural aspects of these polymorphisms differ between the genes. PGI, with interpenetrated dimeric structure, is completely inactive as a dissociated monomer, so that no separable allelic properties exist beyond sequence differences themselves, and the genotype is the minimum unit of function and thus of performance or fitness effects. In contrast, PEPCK is active catalytically as a single polypeptide, so allelic structural and functional properties exist distinct from genotypic properties, which would be linear combinations of allelic ones. If PEPCK’s genotypes do differ in function, they could, for example, increase heterozygotes’ breadth of function across the range of a state variable such as temperature, or differ in balances of alternate metabolic roles (below).
Molecular evolutionists have long recognized the prevalence of conservation of sequence via stabilizing (“purifying”) selection across broad clades. In contrast, polymorphism is often seen as transient, whether neutral or selected, with “exceptions” recognized for recombination-suppressing inversion blocks as in Drosophila, or for frequency-dependent cases such as host-pathogen “arms races” [44–46] or self-incompatibility systems . Cases such as PGI, phosphoglucomutase [48–50], and now probably PEPCK, demonstrate that chronic polymorphism may often be a long-term, stable response to multiple-scale environmental variation, albeit perhaps predisposed by complexities of protein structure (PGI) and/or metabolic role (PEPCK).
Accordingly, studies of the kinetic properties and thermal stabilities of the PEPCK variants will be of high priority, testing the present working hypothesis of selective maintenance for their polymorphism. If functional differences are indeed found, these may well give clues to the genotype-phenotype-environment interactions responsible for variants’ maintenance – as was the case for Colias PGI. Field studies to test those interactions will follow in turn.
Compartmentation: functional rôles and expression strategies among taxa
The concept of “elementary flux modes” expresses how a group of enzyme steps can be deployed to execute different reaction series serving distinct metabolic functions – e.g., glycolysis vs. gluconeogenesis, or interactions of glycolysis with the pentose shunt . As seen above, expression of PEPCK in both cytosol and mitochondria may support alternative elementary flux modes. Krebs cycle carbon skeletons derived from dietary or stored lipids or amino acids could be converted by mitochondrial PEPCK from OAA to PEP, then moved to the cytoplasm (by the tricarboxyl transporter ) for reverse-glycolytic support of glycerol synthesis or storage in glycogen . Otherwise, cytosolic PEPCK could prime the Krebs cycle to match large transients in glycolytic flux (such as seen in insect flight), by diverting part of this flux from PEP into OAA, thence into mitochondria, perhaps via the malate shuttle, to react with acetyl-CoA derived from pyruvate .
How these flux modes might interact with functional effects of the Colias PEPCK amino acid polymorphisms remains to be explored. But in addition, the alternative strategies for PEPCK forms’ expression – splice variants in higher Lepidoptera vs. full-scale paralogous genes of independent origin in vertebrates and in Diptera – bespeak a level of adaptive specialization on a large scale. They may, for example, reflect deep clade differences in nutritional mass-energy budget structures. Thus multiple levels of evolutionary comparison are evoked by our present findings.
These expression strategy differences have implications for studies of clade structure itself. It was proposed that PEPCK sequences might offer good phylogenetic signal differentiating Mesozoic to early Cenozoic divergences of insect taxa, mainly in Lepidoptera but with other insects including Drosophila as outgroups . The part of PEPCK studied is in the common block of splice-variant sequences in Bombyx and Colias, a region also quite similar between the Drosophila paralogs. However, what are the most appropriate outgroups, and whether basal Lepidoptera concur in the strategy of Bombyx and Colias, or display another expression pattern which may complicate sorting out sequence homology vs. paralogy, are open questions which systematists must address if they study this gene.
PEPCK and the place of specific-gene studies in a time of genomic variation surveying
High-throughput sequencing and variation surveys using it have remarkable power to screen genomes for genetic evidence of evolution . But it is increasingly recognized that “genomics is not enough” to overcome underdetermination of genetic variation patterns by theoretically possible alternative processes . (This problem should be clear even from simple population genetics: e.g., a heterozygote deficiency compared to Hardy-Weinberg expectation may arise from inbreeding or a Wahlund effect or underdominance in fitness, and only study of process can choose the right explanation.) Genome-wide surveys can at best evoke working hypotheses to be tested by study of varying mechanisms in specific gene systems [56–58]. The focus of our molecular survey on a central, functionally well-known pathway has allowed augmentation of PEPCK’s genetic statistics with initial structural and population-genetic studies, and poises the case for mechanistic testing. This study of Colias PEPCK, like our earlier work , engages diverse processes which can shape natural variation, from protein-specific structural predispositions or constraints to epistatic interaction among nearby enzyme steps, systemic pathway organization and enzymes’ roles in it, or “global” issues of energy allocation or network connectedness. Case studies of natural variation’s effects are a potent source of insight into partition of evolutionary causes among these processes.
This does not imply a surrender to evolutionary particularism. On the contrary, we seek, with Whitehead , “…to see the forest by means of the trees”. The now-obvious universality of the genetic code, the “unity of biochemistry”, and other unifying concepts of molecular and physiological evolution were established by detailed studies of the molecular mechanisms of diverse organisms – in complement to the distillation of natural selection and other early evolutionary generalities out of many specific cases by Darwin and his successors. In biology, the path to heuristic generalization runs through the comparative study of specificity [42, 60].
This situation underscores the importance of a difference of evolutionary paradigms: an approach which is self-limited to amechanistic pattern analysis in evolution , vs. a view which values patterns as starting points but, as in our earlier work and as begun here, tests their causes by mechanistic studies of genotype-phenotype-environment interactions [62–64] which are the actual drivers of natural selection . Increasing focus on the power of the latter paradigm [66, 67] will lead to deeper insight into evolutionary processes and into realistic generalities concerning them.
We've pursued diverse approaches to PEPCK’s evolution in two Colias species:
phylogenetic comparisons of strategy for expression of cell compartment forms, finding splice variation in Colias (like Bombyx) as contrasted to paralogous gene divergence in other clades;
finding extensive genetic variation at nucleotide and amino acid levels, including three amino acid polymorphisms which are shared among species, in two cases with similar frequencies;
homology-based modelling, finding that these three polymorphisms may have both local structural impacts and longer-range interactions among their distinct locations in PEPCK structure;
population genetic simulation, testing the null hypothesis of neutrality of amino acid polymorphs and finding it improbable (0.001 ≤ P ≤ 0.006), leaving the alternative of natural-selective maintenance.
Each by itself gives important clues to causes of the gene’s extensive variation in context of the splice-based evolutionary strategy of compartment-specific PEPCK expression, in contrast to the paralogy seen in Drosophila and in vertebrates. Together they offer a coherent hypothesis of selectively maintained polymorphism, chronically persistent among species. This hypothesis is now poised for further test, clarifying PEPCK’s genotype-phenotype-environment interaction by studies of PEPCK’s allelic and genotypic functions, performances, and fitnesses.
Glyceraldehyde phosphate dehydrogenase
Most recent common ancestor
Effective population size
Nonsynonymous (nt) sites
Probability of a statistical outcome being due to chance
Frequncy of second allele at a gene or codon
Polymerase chain reaction
Protein Data Bank
Synonymous (nt) sites; Standard one-letter abbreviations for amino acids; Standard one-letter abbreviations for nucleotide bases; Standard three-letter abbreviations for amino acids
Sum or total
Time of separation of species scaled by effective population size(s)
Triose phosphate isomerase
Test parameter (normal deviate) for exact binomial test.
Yang J, Kalhan SC, Hanson RW: What is the metabolic role of phosphoenolpyruvate carboxykinase?. J Biol Chem. 2009, 284: 27025-27029. 10.1074/jbc.R109.040543.
Hakimi P, Yang J, Casadesus G, Massillon D, Tolentino-Silva F, Nye CK, Cabrera ME, Hagen DR, Utter CB, Baghdy Y, Johnson DH, Wilson DL, Kirwan JP, Kalhan SC, Hanson RW: Overexpression of the cytosolic form of phosphenolpyruvate carboxykinase (GTP) in skeletal muscle repatterns energy metabolism in the mouse. J Biol Chem. 2007, 282: 32844-32855. 10.1074/jbc.M706127200.
Wang B, Watt WB, Aakre C, Hawthorne N: Emergence of complex haplotypes from microevolutionary variation in sequence and structure of Colias phosphoglucose isomerase. J Mol Evol. 2009, 68: 433-447. 10.1007/s00239-009-9237-2.
Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP and related tools. Nature Protocols. 2007, 2: 953-971. 10.1038/nprot.2007.131. http://www.cbs.dtu.dk,
Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520. http://www.atgc.lirmm.fr/phyml/,
Li A, Nussinov R: A set of van der Waals and Coulombic radii of protein atoms for molecular and solvent-accessible surface calculation, packing evaluation, and docking. Proteins. 1998, 32: 111-127. 10.1002/(SICI)1097-0134(19980701)32:1<111::AID-PROT12>3.0.CO;2-H.
Watt WB: Adaptation, constraint, and neutrality: mechanistic case studies with butterflies and their general implications. The evolution of population biology. Edited by: Singh R, Uyenoyama M. 2004, Cambridge, UK: Cambridge University Press, 275-296.
Watt WB, Donohue K, Carter PA: Adaptation at specific loci. VI. Divergence vs. parallelism of polymorphic allozymes in molecular function and fitness-component effects among Colias species (Lepidoptera, Pieridae). Mol Biol Evol. 1996, 13: 699-709. 10.1093/oxfordjournals.molbev.a025631.
Hedrick PW, Kim TJ: Genetics of complex polymorphisms: parasites and maintenance of the major histocompatibility complex variation. Evolutionary genetics: from molecules to morphology. Edited by: Singh RS, Krimbas CB. 2000, Cambridge, UK: Cambridge University Press, 204-234.
Uyenoyama M, Takebayashi N: Genus-specific diversification of mating types. The evolution of population biology. Edited by: Singh R, Uyenoyama M. 2004, Cambridge, UK: Cambridge University Press, 254-271.
Schuster S, Fell DA, Dandekar T: A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks. Nature Biotech. 2000, 18: 326-332. 10.1038/73786.
We thank Carol Boggs, Michael Bramson, Richard Hanson, Jason Hill, Todd Holyoak, Mark Longo, Noah Rosenberg, and George Somero for helpful discussions and comments on the manuscript. The work was supported by grants to WBW from the US National Science Foundation (DEB 05–20315 and MCB 08–46970). Our results do not represent official policy of any government agency or corporate entity.
Present address: Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 0213, USA
Authors and Affiliations
Department of Biology, Stanford University, Stanford, CA, 94305-5020, USA
Ward B Watt, Baiqing Wang & Eddie Wang
Rocky Mountain Biological Laboratory, Crested Butte, CO, 81224, USA
Ward B Watt
Department of Ecology and Evolution, University of Chicago, Chicago, IL, 60637, USA
The authors declare that they have no competing interests.
WBW conceived and organized the study. BW and WBW collected specimens. BW, EW, and WBW collected and analyzed sequence data. WBW carried out structural studies. RRH developed the algorithm for estimating t/2Ne, and designed and ran the simulations with input from WBW. WBW wrote the paper with input from other authors, all of whom read and approved the final manuscript.
Additional file 1: Amplifying and sequencing primers for study ofColiasPEPCK. Amplifying primers are of course used for sequencing as well. ″5end″ and ″3end″ primers are located in 5′- and 3′-untranslated regions (UTRs). S and A denote sense and antisense primer directions, respectively. Overlapping gene subsets amplified for sequencing: MVY-S-5end and MLH-S2-5end / A1005 (one each per individual); S511C / A1519C; and S1487 / A3end. Primer numbers denote primers’ 3′-terminal nucleotide positions in the gene. (DOCX 12 KB)
Additional file 3: Tests of linkage disequilibrium among varying amino acid sites. Abbreviations: amino acids, standard one-letter codes; obs, observed; exp, expected; D, linkage disequilibrium = pX1pX4 – pX2pX3; GYates(1), G statistic with Yates’ correction and 1 degree of freedom; Fisher’s exact, test of that name [ref]; P, probability of this or more extreme result by chance alone. a) tests of sites polymorphic in both Colias species. Site variant frequencies in main text Table 2. Correction of significance threshold for multiple tests by Dunn-Sidak method [ref], 6 tests, α′ = 0.0085. No linkage disequilibria are significant. b) tests of Arg/Lys 11, polymorphic only in mitochondrial 5′ exon of C. eurytheme, and the other polymorphic sites in that species. Site frequencies at site 11: Arg 0.556, Lys 0.444; other site frequencies as above. 3 tests, α′ = 0.017. No linkage disequilibria are significant. (DOCX 14 KB)
Additional file 5: Absence/presence of polymorphic amino acid variable bonds with sidechains of nearby invariant amino acids inColiasPEPCK. Abbreviations: H bond, hydrogen bond; Ǻ, Ǻngstrom unit of length; vdW, van der Waals; marg, marginal. Alleles identified by one-letter amino acid codes at each of the three polymorphic sites shared between species. Invariant amino acid sites bonding to variable amino acids are identified by number prior to bond length in Ǻ. 272-503 H bonds engage the backbone carbonyl of amino acid 503. All bond types, including distance criteria for marginal or absent van der Waals contacts, are discussed in the main text and many are illustrated in Figures 6, 7, 8. (DOCX 13 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.