- Research article
- Open Access
High correlation between the turnover of nucleotides under mutational pressure and the DNA composition
BMC Evolutionary Biology volume 1, Article number: 13 (2001)
Any DNA sequence is a result of compromise between the selection and mutation pressures exerted on it during evolution. It is difficult to estimate the relative influence of each of these pressures on the rate of accumulation of substitutions. However, it is important to discriminate between the effect of mutations, and the effect of selection, when studying the phylogenic relations between taxa.
We have tested in computer simulations, and analytically, the available substitution matrices for many genomes, and we have found that DNA strands in equilibrium under mutational pressure have unique feature: the fraction of each type of nucleotide is linearly dependent on the time needed for substitution of half of nucleotides of a given type, with a correlation coefficient close to 1. Substitution matrices found for sequences under selection pressure do not have this property. A substitution matrix for the leading strand of the Borrelia burgdorferi genome, having reached equilibrium in computer simulation, gives a DNA sequence with nucleotide composition and asymmetry corresponding precisely to the third positions in codons of protein coding genes located on the leading strand.
Parameters of mutational pressure allow us to count DNA composition in equilibrium with this mutational pressure. Comparing any real DNA sequence with the sequence in equilibrium it is possible to estimate the distance between these sequences, which could be used as a measure of the selection pressure. Furthermore, the parameters of the mutational pressure enable direct estimation of the relative mutation rates in any DNA sequence in the studied genome.
Nucleotide sequences of all contemporary genomes are results of compromise between mutational pressure and selection . Many mutations which took place in the past have been eliminated by genetic death. Even so, there are differences in the nucleotide composition of protein coding sequences and intergenic sequences – it is very difficult to discriminate between the effects of selection and mutations on their composition. Furthermore, many prokaryotic genomes have very asymmetric nucleotide composition of chromosomes [e.g. [2–9]]. Strand composition of DNA depends on the role which the strand plays in the replication process – leading or lagging. Usually the leading strand is richer in Guanine (G) than in Cytosine (C) and richer in Thymine (T) than in Adenine (A). The replication-associated mutational pressure is thought to be the most probable cause of this asymmetry [10–12].
Analyses of long range correlations in DNA sequences revealed that in the intergenic sequences a very strong triplet signal can be detected [13, 14]. This signal can be created by fragments of coding sequences transferred into intergenic space by recombination mechanisms. Since the nucleotide compositions of the first, the second and the third nucleotide positions in coding sequences are strongly correlated, these correlations are seen even in some noncoding intergenic sequences. We have assumed that some intergenic sequences have derived from coding sequences and could freely accumulate mutations with frequencies determined by the replication-associated mutational pressure. If the time of divergence has not been very long, the homology between the intergenic sequences and their original protein coding sequences can be found (these original coding sequences we have called the reference sequences).
We have made an assumption that mutations have been accumulated only in the intergenic sequences and not in the reference sequences, which is not exactly true, but which enabled accomplishing our studies. This assumption could give a good approximation of mutational pressure exerted on intergenic sequences. Many other authors, who have constructed matrices of substitutions using the mutations accumulated in pseudogene sequences have made the same assumptions [15, 16]. Such an assumption could give higher estimated mutational rate than the real one. Nevertheless, the substitution rates in the matrices are described as a relative values thus, it should not change the values in the matrix.
We have chosen for our analyses the B. burgdorferi genome because there are many premises indicating that this genome is in the steady state. The B. burgdorferi genome is very asymmetric, which suggests its structural conservation . There are not many inversions of genes between the leading and lagging strands or the mutational pressure has had enough time to make the inverted genes resemble the genes of the new strand . The nucleotide composition of the third positions in codons testifies for the very conserved structure of chromosome. These positions follow precisely the sign of the asymmetry of intergenic sequences and, the third positions of Open Reading Frames (ORFs) situated on the leading and lagging strands have precisely mirror asymmetry, which is even stronger than that of intergenic sequences . This paradox could be explained assuming that the highly degenerated third positions have accumulated more neutral or near neutral mutations introduced by the replication-associated processes because they stay at their positions longer than intergenic sequences. There are constraints for inversions of coding sequences but no constraints for inversions of intergenic sequences. Thus, some newly inverted intergenic sequences could complement the asymmetry of the "new host" strand.
Results and Discussion
Testing the table of substitutions and verifying the assumptions
Once having experimentally found the rates of all types of substitutions (Table 1, Borrelia burgdorferi Table of Substitutions (BbTS)), we were able to test these data and to verify our previous assumptions. In equilibrium, the number of a given nucleotide substituted by other nucleotides should be balanced by the number of that nucleotide substituting the other nucleotides. The following four equations should be fulfilled:
NA>G + NA>C + NA>T = NG>A + NC>A + NT>A (1)
NG>A + NG>C + NG>T = NA>G + NC>G + NT>G (2)
NC>A + NC>G + NC>T = NA>C + NG>C + NT>C (3)
NT>A + NT>G + NT>C = NA>T + NG>T + NC>T (4)
where NA>G = N A *p(NA>C) and where p(NA>C) is the probability of substitutions of A by G, taken from the BbTS, (other symbols – respectively).
Note that there are numbers, not frequencies in the equations. Fulfilling these equations means that the nucleotide composition of the sequences submitted to the mutational pressure determined by the parameters of BbTS is in equilibrium. We have assumed that in the case of the B. burgdorferi genome the best approximation of such sequences is the composition of the third positions of codons of ORFs, as has been argued in the Introduction section. Thus, the nucleotide composition of this set of nucleotides should not change significantly under such mutational pressure. To prove that, we simulated the mutational pressure on the sequence of the same composition as described previously  and after 10,000 Monte Carlo Steps (MCS), when the sequence was in equilibrium, we compared it to the sequence before the simulation. The ratios of nucleotides were 0.994, 1.008, 0.992 and 0.988 for A, T, G, and C, respectively (note that the ratios do not sum to 1 because they are not weighted). There are no significant changes in nucleotide composition of the third positions after the prolonged exposition to the mutational pressure described by BbTS (Chi square test, p = 0.99987). Thus, BbTS generates DNA sequence with nucleotide composition corresponding precisely to the nucleotide composition of the third codon positions. In Fig. 1 we have shown the evolution of two DNA sequences, of which one originally had equimolar nucleotide composition and the other one – the nucleotide composition of the third codon positions of ORFs from the leading strand. Both sequences reach the same final nucleotide composition. Furthermore, a sequence obtained after long evolution in computer has very similar asymmetry in terms of GC skew and AT skew as the sequence of the third codon positions before evolution. GC skew is [G-C]/ [G+C] and AT skew is [A-T]/ [A+T]. The AT skew is -0.23 and -0.22 for the sequences before and after simulation, respectively. The GC skew is 0.34 for the sequences before and after simulation. Note that the most frequent substitution is C->T transition, which is in agreement with the cytosine deamination theory (see ref. 10 and references therein), and the average transition frequency is twofold higher than transversion frequency.
Properties of the substitution matrices
Let us consider only nucleotides existing in the original sequence, which is already in the steady state. It is trivial that the substitution of each of the four nucleotides will follow exactly the same rules as a decay of radioactive isotopes with characteristic for each nucleotide "half time of substitutions" (τ A , τ G , τ T , τ C for A, G, T, and C, respectively) determined by the sum of probabilities of substitutions of a given nucleotide by the other three nucleotides. In a more formal language:
τ A = ln2/(p mut *(p(A>G) + p(A>T) + p(A>C)); (symbols for nucleotides other than A-respectively), where pmut is a parameter which denotes the overall rate of mutations and does not influence the ratios between τ for different nucleotides.
It is also trivial that in the equilibrium, the fraction of a nucleotide which has been substituted is exactly the same as the fraction of this very nucleotide substituting the other ones (left sides of equations 1–4). Thus, after the half time of substitutions the ratio between the "old" nucleotides and "new" nucleotides is 1:1 (see Fig. 2A and Fig. 2C). This is a general property of any table of substitutions in the equilibrium state. But BbTS has another property: the half time of substitutions is precisely correlated with the frequency of the given nucleotide in the sequence in equilibrium with the correlation coefficient equaling 0.999 (p = 0.0007, Fig. 2B). This is not just a feature of any matrix of substitutions. We have tried to find analytically a table of substitutions which would generate a DNA sequence of the nucleotide composition of the analysed sequence of B. burgdorferi [see also ]. One of such tables is presented in Table 1. This "artificial table" generates a DNA sequence of the same nucleotide composition as BbTS does, but the correlation coefficient between the half time of substitutions and the fraction of nucleotides in the sequence is close to zero (Fig. 2C and 2D).
We claim that the mutational pressure leading to the nucleotide substitutions is extremely highly correlated with the DNA composition of the genome in such a way that the higher substitution turnover of a nucleotide determines the lower fraction of this nucleotide in the DNA sequence.
It seems very unbelievable that such correlation in the B. burgdorferi genome has happened accidentally. We have tested many other tables of substitutions which had been published for different genomes and different sequences (data collected in Table 1). As long as such matrices describe the substitutions for sequences which are not under the selection pressure (i.e. pseudogenes or the third positions in codons), they follow the same rule, with extremely high correlation between τ N and the fraction of the nucleotide N in the DNA sequence in equilibrium (all correlations were statistically significant). This rule is true for asymmetric DNA like in the B. burgdorferi genome and for much less biased eukaryotic DNA. Matrices found for the third positions in the four-fold degenerated codons in Drosophila mitochondrial DNA  fulfil this rule more precisely than for all third positions in codons in that organelle's genome (the same results were obtained for matrices of primates' mtDNA published by , data not shown). These differences could be expected if some mutations in the third positions, leading to amino acid substitutions are not neutral. Furthermore, in some instances, for example for the table describing substitution rates in sequences under strong selection , we have not found the correlation between τN and the fraction of nucleotide N (see the last column in Table 1). That supports the hypothesis that the rule is a specific property of the pure mutational pressure. One can also notice that matrices found by analysis of substitutions into different pseudogenes in the same organism or in very closely related organisms give a different DNA composition in equilibrium, which supports the thesis that the mutational pressure varies for different regions of the same eukaryotic genome [23–25].
We have no clear answer for the question: what selection forces have tuned the mutational pressure in such a way that it follows the strict rules for sequences released from selection. It is logical that nucleotides with higher turnover destabilise the genetic information and selection would tend to eliminate them from the DNA molecule. On the other hand, a lower frequency of a nucleotide gives it a higher informative value while at the same time the deviation from the equimolar fractions of nucleotides in DNA diminishes the coding capacity of the whole molecule. Perhaps mathematical analysis of this phenomena, taking into considerations the properties of the universal genetic code, will show that the optimum for information transfer by the DNA molecule is just at such points. Further studies would show other properties of these strategic points where τ determines very specific balance between the DNA composition and mutational pressure.
The implications of such evolutionary established relations between the DNA composition and the turnover rates of nucleotides would have a great impact on the understanding of the genomes evolution itself. It gives the possibility of estimating the relations between the mutational pressure exerted on specific nucleotides of each genome analytically i.e. by simple computing the nucleotide composition of sequences which are not subject to selection pressure. Having the mutational pressure in terms of nucleotide turnover, one can estimate the selection pressure exerted on any sequence or position in codons. For example, see what would be the fate of the first positions of ORFs from the leading strand of the B. burgdorferi genome under the BbTS molecular pressure, without selection (Fig. 3). Note that the half time of substitutions of each nucleotide is the same as for other sequences under such mutational pressure, but the rate of appearing of new nucleotides is different and the composition of the sequence would change non-linearly during evolution. It is also simple to count, from the results of computer simulations, the corrections for multiple substitutions and reversions, which is important for estimating the real divergence time. It is clear that such corrections should be counted considering different contributions of each nucleotide turnover in the overall frequencies of multiple substitutions. Having precise mutational pressure one can predict not only the selection pressure but also find the history of the sequence.
Keeping in mind the precise relations between the fraction of nucleotide and its turnover time, the symmetric DNA (with A=T and G=C) is a specific case were the turnover times of nucleotides in pairs equal each other. Posing a question of which type of substitution should be blamed for the DNA asymmetry makes sense for the mutational pressure exerted on the DNA released from the selection pressure. Now a simple test for such a mutational pressure is available – it should generate the DNA in equilibrium whose nucleotide composition fulfils the rule of linear interdependence between the sizes of the nucleotide fractions and their turnover times.
Substitution matrices enable counting the DNA composition in equilibrium with a given mutational pressure. It is possible to test if a given substitution matrix is the pure mutational matrix or if it is "contaminated" with the effects of selection. The difference between the DNA composition in equilibrium with mutational pressure and a DNA sequence under both mutational and selection pressures allows for estimation of the effect of selection pressure exerted on the particular sequence.
Materials and Methods
Construction of the substitution table
To estimate the frequency of substitutions, we have analysed the differences between coding sequences of the B. burgdorferi genome and sequences homologous to them found in the intergenic regions. For the data, see Additional file 1. The sequence of the B. burgdorferi genome  was downloaded from http://www.ncbi.nlm.nih.gov. To accomplish our analysis, we extracted all intergenic sequences longer than 90 nucleotides. We translated them into amino acid sequences in all six reading frames. The amber and ochre stop codons were translated for tyrosine residues and opal for tryptophan. Then we searched data bases for homology with the B. burgdorferi protein sequences using FASTA program . For detailed amino acid alignment data see Additional file 2. After selecting homologues (with E < 0.05) whose previously (presumably) coding strands were duplicated on the leading strand, we made alignments of nucleotide sequences of these intergenic sequences with the reference ORFs' sequences using CLUSTAL X programme  and we counted the nucleotide substitutions. The number of the analysed alignments sites was 3737 and the average number of substitutions per site – 0.46. For detailed nucleotide alignment data see Additional file 3. The observed numbers of nucleotide substitutions from nucleotide i to j (where i to j stand for A, T, G or C, and i ≠ j) were converted to relative substitution frequencies according to Gojobori, Li, and Gaur  and Francino and Ochman . That allowed us to count the frequency of each of the twelve possible substitutions on the leading strand. Since the observed substitution rates were different for each of the four nucleotides, we introduced corrections for multiple substitutions and reversions for each type of the substituted nucleotide instead of one general correction. It means that we have counted for each type of nucleotide the fraction of substituted (observed number) and used it for estimating the corrected substitution number according to Kimura's formula . The frequencies of substitutions, normalised in such a way that the sum of all 12 frequencies equals 1, are shown in Table 1 (Table of Substitutions – BbTS).
Computer simulations were performed on the DNA sequences corresponding to the real protein coding sequences of B. burgdorferi leading strand or DNA random sequences generated by computer. In the first case all ORFs longer than 100 codons situated on leading strand of Borrelia genome were spliced together. In the second case the DNA sequence were constructed by computer by drawing the consecutive nucleotides with a probability described by the assumed composition of this artificial sequence.
During the simulation of mutational pressure, in one Monte Carlo Step (MCS) each nucleotide in the sequence was drawn with the probability equalling pmut, then it is substituted with the probability described by the substitution matrix. Note that the nucleotide drawn for substitution not necessarily has to be substituted. After each MCS each substitution was counted in the specific type of substitution and additionally the evolving sequence was compared with the original sequence and the accumulated substitution were counted. This allowed us to measure not only the divergence rate but also the mutational rate subdivided on 12 different classes. The dynamic of substitution was also counted analytically using the equations describing the first order reaction rate. For more details on computing methods see Kowalczuk et al. [19, 32].
table of probabilities of nucleotide substitutions in the Borrelia burgdorferi leading DNA strand.
Sueoka N: On the genetic basis of variation and heterogeneity of DNA base composition. Proc Natl Acad Sci USA. 1962, 48: 582-591.
Lobry JR: Asymmetric substitution patterns in the two DNA strands of bacteria. Mol Biol Evol. 1996, 13: 660-665.
Freeman JM, Plasterer TN, Smith TF, Mohr SC: Patterns of genome organization in bacteria. Science. 1998, 279: 1827-1830. 10.1126/science.279.5358.1827a.
Grigoriev A: Analyzing genomes with cumulative skew diagrams. Nucleic Acids Res. 1998, 26: 2286-2290. 10.1093/nar/26.10.2286.
McLean MJ, Wolfe KH, Devine K: Base composition skews, replication orientation, and gene orientation in 12 prokaryote genomes. J Mol Evol. 1998, 47: 691-696.
Mrazek J, Karlin S: Strand compositional asymmetry in bacterial and large viral genomes. Proc Natl Acad Sci USA. 1998, 95: 3720-3725. 10.1073/pnas.95.7.3720.
Francino MP, Ochman H: A comparative genomics approach to DNA asymmetry. Ann N Y Acad Sci. 1999, 870: 428-431.
Mackiewicz P, Gierlik A, Kowalczuk M, Dudek MR, Cebrat S: Asymmetry of nucleotide composition of prokaryotic chromosomes. J AppI Genet. 1999, 40: 1-14.
Rocha EP, Danchin A, Viari A: Universal replication biases in bacteria. Mol Microbiol. 1999, 32: 11-16. 10.1046/j.1365-2958.1999.01334.x.
Frank AC, Lobry JR: Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene. 1999, 238: 65-77. 10.1016/S0378-1119(99)00297-8.
Mackiewicz P, Gierlik A, Kowalczuk M, Dudek MR, Cebrat S: How does replication-associated mutational pressure influence amino acid composition of proteins?. Genome Res. 1999, 9: 409-416.
Tillier ER, Collins RA: The contributions of replication orientation, gene direction, and signal sequences to base-composition asymmetries in bacterial genomes. J Mol Evol. 2000, 50: 249-257.
Voss R: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett. 1992, 68: 3805-3808. 10.1103/PhysRevLett.68.3805.
Gierlik A, Mackiewicz P, Kowalczuk M, Dudek MR, Cebrat S: Some hints on Open Reading Frame statistics – how ORF length depends on selection. Int J Modern Phys C. 1999, 10: 635-643. 10.1142/S0129183199000474.
Li WH, Wu CI, Luo CC: Nonrandomness of point mutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implications. J Mol Evol. 1984, 21: 58-71.
Yang Z: Estimating the pattern of nucleotide substitution. J Mol Evol. 1994, 39: 105-111.
Mackiewicz P, Gierlik A, Kowalczuk M, Szczepanik D, Dudek MR, Cebrat S: Mechanisms generating long-range correlation in nucleocide composition of the Borrelia burgdorferi. Physica A. 1999, 273: 103-115. 10.1016/S0378-4371(99)00345-3.
Mackiewicz P, Mackiewicz D, Gierlik A, Kowalczuk M, Nowicka A, Dudkiewicz M, Dudek MR, Cebrat S: The Differential Killing of Genes by Inversions in Prokaryotic Genomes. J Mol Evol. 2001, 53: 615-621. 10.1007/s002390010248.
Kowalczuk M, Gierlik A, Mackiewicz P, Cebrat S, Dudek MR: Optimization of gene sequences under constant mutational pressure and slection. Physica A. 1999, 273: 116-131. 10.1016/S0378-4371(99)00346-5.
Tamura K: The rate and pattern of nucleotide substitution in Drosophila mitochondrial DNA. Mol Biol Evol. 1992, 9: 814-825.
Adachi J, Hasegawa M: Tempo and mode of synonymous substitutions in mitochondrial DNA of primates. Mol Biol Evol. 1996, 13: 200-208.
Zhang J: Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. Mol Biol Evol. 1999, 16: 868-875.
Filipski J: Why the rate of silent codon substitutions is variable within a vertebrate's genome. J Theor Biol. 1988, 134: 159-164.
Wolfe KH, Sharp PM, Li WH: Mutation rates differ among regions of the mammalian genome. Nature. 1989, 337: 283-285. 10.1038/337283a0.
Matassi G, Sharp PM, Gautier C: Chromosomal location effects on gene sequence evolution in mammals. Curr Biol. 1999, 9: 786-791. 10.1016/S0960-9822(99)80361-3.
Fraser CM, Casjens S, Huang WM, Sutton GG, Clayton R, Lathigra R, White O, Ketchum KA, Dodson R, Hickey EK: Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature. 1997, 390: 580-586. 10.1038/37551.
Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988, 85: 2444-2448.
Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ: Multiple sequence alignment with Clustal X. Trends Biochem Sci. 1988, 23: 403-405. 10.1016/S0968-0004(98)01285-7.
Gojobori T, Li WH, Graur D: Patterns of nucleotide substitution in pseudogenes and functional genes. J Mol Evol. 1982, 18: 360-369.
Francino MP, Ochman H: Strand symmetry around the beta-globin origin of replication in primates. Mol Biol Evol. 2000, 17: 416-422.
Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111-120.
Kowalczuk M, Mackiewicz P, Mackiewicz D, Nowicka A, Dudkiewicz M, Dudek MR, Cebrat S: Multiple base substitution corrections in DNA sequence evolution. Int J Modern Phys C. 2001, 12: 1043-1053. 10.1142/S0129183101002279.
Funding sources: This work was supported by The State Committee for Scientific Research, grant numbers 6 P04A 025-18 and 6 P04A 016 20. P.M. was supported by Foundation for Polish Science.
Electronic supplementary material
Additional file 1: Contains information on intergenic sequences and homologous ORFs used to obtain table of substitutions. This file is to be opened in excel with tab as separator. (XLS )
About this article
Cite this article
Kowalczuk, M., Mackiewicz, P., Mackiewicz, D. et al. High correlation between the turnover of nucleotides under mutational pressure and the DNA composition. BMC Evol Biol 1, 13 (2001). https://doi.org/10.1186/1471-2148-1-13
- Codon Position
- Nucleotide Composition
- Protein Code Sequence
- Borrelia Burgdorferi
- Monte Carlo Step