The above results seem to indicate that conserved proteins are, in general, longer than non-conserved ones. It is highly unlikely that the above results are due to a detection bias given that these observations were unchanged when varying the cutoff expectation value used for the sequence comparisons (see Methods) between E < 10-3 and E < 10-9. A possible explanation for the insensitivity of these results to varying similarity thresholds is that, e.g., for the E. coli Conserved Set, 80% of the proteins had conserved regions (shared with protein from other kingdoms) over more than 75% of their length and thus would be easy to detect with most sequence comparison methods over a wide range of thresholds.
For the protein sequences determined by conceptual translation of genomic DNA, annotation artifacts would likely be more common among the shorter sequences and these would be classified into the Nonconserved Set. Thus, a possible explanation for the difference in the length distributions between the Nonconserved and Conserved Set proteins would be annotation artifacts for the proteins derived from genomic sequence. Skovgaard and colleagues [8] compared the length distribution of annotated microbial genome proteins matching known proteins with those that do not match a known protein. The sequences that did not have any matches were shorter and this was taken as evidence that too many short genes have been annotated in many genomes (i.e many of these short genes are artifacts). To test this possibility for the Escherichia coli proteins, we generated the length distribution for the Salmonella Set, a subset of E. coli proteins that match proteins from Salmonella but do not match proteins from more distant organisms. It is estimated that Salmonella and E. coli diverged about 100 million years ago [9] and thus a statistically significant similarity between sequences from these bacteria indicates that the corresponding genes evolve under purifying selection. Although this does not prove that all these genes encode proteins (i.e., some of them might encode heretofore uncharacterized regulatory RNAs), requiring a statistically significant similarity to Salmonella sequences greatly reduces the chance of retaining annotation artifacts. Although there are fewer proteins in the Salmonella Set, its length distribution is essentially the same as that of the Nonconserved Set (Figure 1).
Furthermore, although it is likely that there is a greater fraction of annotation artifacts among the Nonconserved set proteins derived from genome annotations, this is unlikely to be true for the human and Drosophila proteins analyzed here because they have been derived from cDNA sequences. To further reduce the chance of annotation errors, for the Drosophila set, we avoided cDNA sequences generated from high throughput cDNA projects. Thus, annotation artifacts are unlikely to explain the results shown in Figures 1,2,3,4,5.
A challenging problem for biologists trying to make sense of genomic sequence, particularly for the eukaryotes, is that shorter proteins are more difficult to predict on purely statistical grounds [8] and are also less likely to have confirmatory homologies in other organisms. Thus, without functionally cloned cDNA transcripts, it becomes hard to distinguish artifacts from rapidly evolving genes and a conservative approach may result in under representation of the shorter eukaryotic proteins in the databases. Consistent with this possibility is the rightward shift of the Nonconserved Set proteins of the eukaryotes as compared to that of the prokaryotes.
One generally assumes that the length of a protein is largely determined by its functions. The relatively wide variance in sequence length of the members of the Conserved set reflects the diverse range of specific functional roles for these proteins. The Nonconserved set proteins however are, on average, shorter than the conserved proteins, with the poorly conserved E. coli and A. fulgidus proteins closely approximating the minimal length distribution possible for globular proteins, as represented by the Minimal Structural Domain Set. In this sense, the poorly conserved proteins from these organisms appear to be as small as proteins can be and still fold into a stable globular structure.
Many biologists implicitly assume that functionally important proteins are more evolutionarily conserved than less vital proteins, and recent work has confirmed this belief [10, 11]. Here, we identified another substantial difference between highly conserved and poorly conserved proteins: the less conserved (i.e. less important) proteins are, on average, smaller than more conserved (and more important) proteins. What global evolutionary forces would favor shorter proteins in the absence of other functional constraints? It seems logical to think of these potential forces in terms of minimizing the cost of having extra sequences that do not substantially affect fitness. Such costs might be associated with several distinct processes. One possibility is simply the cost of protein translation and another is the cost of the chaperones that are required to fold longer, particularly multidomain proteins [12]. Although perhaps less likely, yet another cost of longer proteins could be their increased risk of "side effects", i.e. deleterious interactions with other cellular components. For any given protein, the cost differential is likely to be almost negligible, but this difference becomes more significant when one considers the entire set of poorly conserved proteins. In a somewhat similar context, Akashi & Gojobori [13] have shown that highly expressed proteins in the proteomes of E. coli and B. subtilis have a greater abundance of less energetically costly amino acids than other proteins encoded in these genomes. Another related observation is that of Castillo-Davis and colleagues [14], who have shown that highly expressed genes have smaller introns on average than other genes presumably due to the cost of transcription and/or splicing.
The action of random genetic drift and selection pressure on genome size (c.f. [15]) could also favor shorter proteins. If deletions are more common than insertions for a given organism, then proteins that can tolerate more mutations (i.e. are evolving under weaker functional constraints) will tend to get smaller over time. Several studies in E. coli have indeed shown that, on average, deletions are eight times more frequent than insertions, c.f. [16]. Similarly, analysis of human mutations (A. Kondrashov, personal communication) has shown that deletions are approximately three times more frequent than insertions. It is reasonable to assume that evolutionary forces acting on genome size might be a more important factor favoring smaller proteins for prokaryotic and unicellular eukaryotic genomes because they are primarily composed of protein-coding sequence. This is less obvious for the larger eukaryotic genomes; in particular, the metazoan and plant genomes are primarily composed of noncoding DNA where reductions in protein length would tend to have far less impact on overall genome size.
All of the above constraints would tend to favor shorter proteins but do not seem to explain why the tendency to economize on unnecessary residues increases with greater sequence length, as seen in Figure 7. To have this effect, a constraint must initially have more than a linear increase in intensity with greater sequence length. Given the globular nature of a folded protein, the average number of intramolecular contacts per residue should grow with increasing sequence length (the volume of the globule grows faster than the surface with length increase) and these contacts would constrain the possible residues at any given site within a protein. However, the size of a single globular domain of a protein does not continue to grow with sequence length beyond a certain limit (~150 residues); rather, longer proteins typically have multiple globular domains, and thus, the rate of increase in intramolecular contacts for a protein should level off. This is exactly what is seen for the plot of average contact density versus length shown in Figure 7, and the similarity of the contact density plot with that for the fraction of conserved residues is noteworthy. This similarity over a range of sequence lengths is consistent with an evolutionary force minimizing the cost of having extra sequences that do not substantially affect fitness.
The results presented here show that, for all the organisms studied, poorly conserved proteins are, on average, shorter than highly conserved ones. And, in general, there appears to be a significant trend towards shorter proteins in the absence of other, more specific functional constraints. This is compatible with the existence of an evolutionary force acting to minimize the costs associated with sequences that have no functional role. Thus, the size of the poorly conserved proteins seems to tend to minimal domain size, whereas the size of highly conserved proteins varies to a greater extent, reflecting the broad range of functions. It appears that analysis of functionally relatively unimportant proteins allows one to uncover general evolutionary trends that so far remain unnoticed.