Reduced mRNA stability at the translation-initiation region in viral genomes
We performed a sliding-window analysis of mRNA secondary-structure stability in 650 fully sequenced ds-DNA viruses. For each ORF in each virus, we calculated Z scores ZΔG. ZΔGmeasures to what extent mRNA secondary-structure stability deviates from random expectation given the amino-acid sequence and codon composition of the ORF [15]. A ZΔG> 0 indicates that the structure is less stable than expected, and a ZΔG< 0 indicates the opposite. We calculated ZΔG> 0 for windows of length 30 nucleotides (nt), and we covered the first 150 nt of each ORF in steps of 10 nt, as described [15]. For each window, we then averaged ZΔGover all ORFs in each genome. We refer to this genome-wide average as
. Below, we test whether this genome-wide average is significantly different from zero. Note that the genome-wide average can be significantly non-zero even if the Z scores for individual genes are relatively small and would individually not be considered significant.
Figure 1 shows
as a function of window position for bacteriophage T7. In this virus,
is significantly larger than zero in the first window (t-test,
= 0.35, P = 0.005), and it is not significantly different from zero in windows further downstream. We carried out a similar analysis on all virus strains in our data set. As in cellular organisms, there was substantial variation in the
of the first window among different virus strains and somewhat less variation in windows further downstream (Figure 2). Allowing for a false-discovery rate of 5%, we found 181 dsDNA viruses (28%) whose
in the first window was significantly non-zero. With the exception of 3 viruses infecting eukaryotes (epiphyas postvittana nucleopolyhedrovirus, cowpox virus, and canarypox virus),
was positive in all these cases. For windows further downstream, the number of virus strains with significantly non-zero
declined rapidly, and
tended to be negative rather than positive (Figures 2 and 3). These results mirror the results of Gu et al. [15], who found that
was generally positive near the start codon and negative further downstream. The main difference in the virus data set is that virus genomes tend to be small, and the error estimates on
are consequently large. (E.g., compare Figure 1 of the present study to Figure 1 of [15].)
In aggregate,
values for eukaryotic and archaeic viruses were lower than those for prokaryotic viruses. On average, archaeic viruses had a
of 0.0049, not significantly different from zero (t-test, P = 0.91). Eukaryotic viruses had an average of 0.057, which was significantly different from zero (t-test, P = 3.2 × 10-5). Prokaryotic viruses had an average of 0.22, also significantly different from zero (t-test, P < 10-10). The
distribution for prokaryotic viruses was significantly different from those for eukaryotic viruses (t-test, P < 10-10) and archaeic viruses (t-test, P = 2.1 × 10-5).
Since we obtained ZΔGby shuffling codons within genes, we implicitly assumed that there is no sub-stantial, site-specific selection on synonymous sites outside the focal analysis window. This assumption is violated in regions where reading frames overlap and synonymous sites are primarily determined by the amino-acid sequence of the overlapping reading frame. Therefore, we tested whether our ZΔGvalues were confounded by overlapping reading frames. For all ORFs in all virus genomes, we determined whether they overlapped with any other ORFs and classified them into overlapping and non-overlapping ORFs. We found that on average 50% of the ORFs in a virus genome were overlapping, with a standard deviation of 15.9 percentage points. We then tested for each window in each virus genome whether the mean ZΔGfor overlapping genes was different from the mean ZΔGfor non-overlapping genes, using t tests. We found that this was generally not the case. Allowing for a false-discovery rate of 5%, not a single virus genome showed a significant difference between overlapping and non-overlapping ORFs in the first four windows. Over all 13 windows, there were only two cases were we could reject the null hypothesis of no difference, invertebrate iridescent virus 3 in window 5 and clanis bilineata nucleopolyhedrosis virus in window 11.
However, when pooling data from all viruses into a single analysis, we found a small shift towards lower
values for overlapping ORFs in eukaryotic and archaeic (but not prokaryotic) viruses (Additional File 1 Figure S1). We concluded that there was no evidence that overlap influences
in prokaryotic viruses, and weak evidence that it does so in the other two types of viruses. Since gene overlap certainly does not increase
, we concluded that including both overlapping and non-overlapping ORFs in our analysis was a conservative approach. Therefore, we freely mixed overlapping and non-overlapping ORFs throughout our analysis. The reduced
in eukaryotic and archaeic viruses compared to prokaryotic viruses was not due to the inclusion of overlapping ORFs;
values for all virus groups were nearly identical regardless of whether overlapping ORFs were included or not (not shown).
For a few select virus species, we also tested whether our results were confounded by dinucleotide frequencies. We calculated alternate ZΔGvalues using reshuffled sequences in which all dinucleotide frequencies had been held constant. We found that our standard shuffling method and the dinucleotide shuffling method resulted in nearly identical Z scores (Additional File 1 Figure S2). Note that in the dinucleotide shuffling method, we shuffled synonymous codons such that both amino-acid sequence and dinucleotide frequencies were held constant. The algorithm to perform this shuffling is fairly computationally expensive and runs approximately 24 times slower than regular codon shuffling.
Relationship between genomic GC composition and mean 5' ZΔG
For the remainder of this work, we refer to the
at the very start of the coding sequence (in sliding window #1) as the 5'
. To explain the variation observed in the 5'
, we correlated it with the mean GC content in coding sequences, since this quantity is a good predictor of the 5'
in cellular organisms [15].
Because different virus strains are evolutionarily related, a relationship between 5'
and GC content may be confounded by the viral phylogeny [16]. We can avoid this issue by correlating phylogenetically independent contrasts (PIC), which are differences of variables among organisms [16]. We found that the PIC of the 5'
were well correlated with the PIC of the GC content in coding sequences (r = 0.53, P = 10-31 for prokaryotic viruses, r = 0.54, P = 10-17 for eukaryotic viruses, and r = 0.49, P = 0.009 for archaeic viruses, see also Figure 4). Genomes with higher GC content had comparatively less stable mRNA secondary structure near the start codon.
Because mRNA stability was reduced only at the translation-initiation region, we expected that the correlation between PIC of
and PIC of GC content should decrease when
was caluclated for windows further downstream. Thus, we calculated the corresponding correlation coefficient for all windows. We found that indeed the correlation declined continuously and was consistently near zero (for eukaryotic viruses) or negative (for prokaryotic or archaeic viruses) from the 5th window onwards (Additional File 1 Figure S3).
Since the thermodynamic stability of RNA secondary structure tends to be correlated to the RNA's GC content, we also considered local deviations in a gene's GC content. We calculated ZGC, which measures the deviation in GC content in a 30 nt window relative to the average GC content in the gene (see Materials and Methods). We found a negative correlation between PIC of genomic GC content and PIC of
in the first window (r = -0.67, P = 10-84 for prokaryotic viruses, r = -0.68, P = 10-58 for eukaryotic viruses, and r = -0.74, P = 10-35 for archaeic viruses, see also Additional File 1 Figure S4). Thus, in GC-rich viruses, the sequence regions immediately downstream of the start codon have undergone stronger GC reduction.
We also analyzed to what extent ΔG (rather than its deviation from expectation, as measured by ZΔG) varied with GC content. Our results mirrored those we had previously found for cellular organisms [15]. There was a strong negative correlation between the mean ΔG and GC content (Additional File 1 Figure S5). The higher the GC content, the more stable the mRNA secondary structure, even in the first window. We also tested for a correlation between GC content and the difference in stability between the first window and the tenth window, and found that this difference increased in viruses infecting prokaryotic or eukaryotic hosts, but not in those infecting archaeic hosts (r = 0.68, P < 10-15 for prokaryotic viruses, r = 0.37, P = 10-7 for eukaryotic viruses, and r = 0.37, P = 0.06 for archaeic viruses, see also Additional File 1 Figure S6). These results remained unchanged when correcting for phylogeny (not shown).
Host-specific patterns in bacteriophages
Finally, we asked to what extent sequence features in viruses correlated with corresponding features in their hosts. Previous work has shown that synonymous codon usage in bacteriophages exhibits significant bias towards host-preferred codons [17], and that genomic GC content in some phages is close to the genomic GC content of their host organisms [18, 19]. Thus, we would expect more generally that both GC content and 5'
in viruses correlate with the same quantities in the appropriate host organisms.
For all bacteriophages in our data set, we identified the corresponding host organism based on the information provided by RefSeq. We then compared GC content in phages and hosts. We found that the GC content in phages correlated strongly with the GC content of the host (correlation coefficient r = 0.89, P ≪ 10-15, Figure 5). More specifically, Figure 5 suggests that the phage's GC content places a lower limit on the GC content of host organisms the phages can infect. Nearly all data points fall below the dashed line indicating identical GC content for phages and hosts. Moreover, some phages with low GC content are associated with hosts with high GC content, but phages with high GC content are never associated with hosts with low GC content.
The correlation between phage and host GC content may, however, be confounded by phylogeny, as explained in the previous subsection. One complication here is that the phylogeny of phages is not necessarily the same as the phylogeny of the hosts. We are not aware of any method that can correctly compare two data sets with distinct covariance structures. Therefore, we opted for two strategies. On the one hand, we calculated correlations without considering phylogeny at all, and thus obtained the value r = 0.89 cited above. On the other hand, we considered the GC content of the host a measurement on the virus, and thus used the virus phylogeny to calulate PIC for both virus and host GC content. The correlation we obtained in this way was nearly indistinguishable from the one obtained not controlling for phylogeny at all (correlation coefficient r = 0.88, P = 10-60, Additional File 1 Figure S7).
We also analyzed whether the 5'
in bacteriophages correlated with that in their hosts. The 5'
values for the phage hosts were obtained from [15]. We found a significant positive correlation, both when ignoring the phylogeny (correlation coefficient r = 0.53, P ≪ 10-12, Figure 6) and when using the phage phylogeny to calculate PIC for both phage and host 5'
(correlation coefficient r = 0.50, P ≪ 10-12, Additional File 1 Figure S8).