Divergence of the upstream transcriptional regulation of duplicated genes in S. cerevisiae
The first measure of the divergence of duplicated genes compares sets of their transcriptional regulators. Such a set contains information about different conditions under which a given gene is expressed, and thus reflects the spectrum of its functional roles in the cell. To quantify the similarity of transcriptional regulation of a pair of genes we use "regulatory overlap" Ω
req
given by the number of transcription factors that bind to upstream regulatory regions of both these genes (see Fig. 1 for a general illustration). The information about gene duplications used in this study was extracted from the list of all pairs of paralogous (evolutionary related) proteins found in the yeast genome by the blastp program [9] with a conservative 10-10 E-value cutoff (see Methods for more details). The system-wide data for the transcription regulatory network in yeast was taken from the chip-on-chip experiment by Lee et al. [2] which investigated in-vivo binding patterns between 106 yeast transcription factors and upstream regulatory regions of all 6270 yeast genes. Fig. 2A shows the distribution of the regulatory overlap for different values of the percent identity (PID) of amino acid sequences of paralogous proteins. From this figure one can see that the regulatory overlap has a tendency to decrease as a function of PID. While multiple overlaps dominate the distribution for PID ≥ 80%, they gradually disappear at lower PIDs.
Fig 2B shows the average value of the regulatory overlap as a function of PID. The regulatory overlap in this plot is normalized by a proxy to the ancestral connectivity of a gene, estimated as the total number of distinct transcription factors that are involved in regulation of at least one of the pair of proteins (see Fig 1). The correlation between the normalized regulatory overlap Ω
reg
and the PID is highly statistically significant: the Pearson correlation is 0.34 (P-value around 10-70 for 2275 data points). Even for the lowest value of PID = 20% the average Ω
reg
significantly exceeds its value in non-paralogous proteins. One interesting feature of the graph in Fig. 2B is that even pairs of proteins whose amino acid sequences are 100% identical to each other on average have only about 30% overlap in their upstream regulation. Such low regulatory overlap of recently duplicated genes can be partially attributed to false positives and false negatives present in the dataset of Ref. [2] (see Methods for extended discussion.) It might also be sometimes caused by an incomplete duplication of the upstream regulatory region of a gene, or by a burst of very rapid evolution of the regulatory region immediately following the duplication event. The second feature of the Fig. 2B is a gradual decline of the average regulatory overlap over the whole range of sequence similarities. The data in Fig. 2B can be fitted with an exponential decay with a rate corresponding to an average 3% loss of common regulators of a paralogous pair for every 1% decrease in their amino acid sequence identity. Thus already at PID = 80% about half of the common regulations present at PID = 100% are lost. The decline in the regulatory overlap at lower PIDs clearly visible in Fig. 2A,2B is in accord with a recently published analysis [11] of similarity between microarray profiles of paralogs. In fact, due to a more direct information about transcriptional regulation contained in the chip-on-chip dataset of Ref. [2] compared to microarray experiments, our analysis extends the gradual decline to much lower PID than was detected in Ref. [11]. After we submitted this manuscript another group of authors [12] has reported a rapid decline in the number of shared regulatory motifs of duplicated genes. This study, carried out as a function of a much faster silent substitution rate K
s
, nicely complements our own findings. Indeed, in their analysis Papp et al. [12] logarithmically binned the K
s
into four broad bins: below 0.01, 0.01–0.1, 0.1–1, and above 1. Since the reliability of the measured silent substitution rate dramatically decreases at high values of K
s
, the whole long-time behavior (i.e. that for PID < 75% which in yeast roughly corresponds to K
s
> 1) of the regulatory overlap remained inaccessible to the analysis of Ref. [12].
Divergence in downstream functional roles of duplicated genes in S. cerevisiae
The rate of divergence between sets of upstream transcriptional regulators of paralogous proteins has an obvious downstream counterpart: it is the rate at which paralogous transcription factors loose their downstream targets. Unfortunately, an attempt to quantify this rate using the same dataset that we used above for the rate of upstream divergence would be limited to only 4 paralogous pairs formed by 106 transcriptional regulators studied in Ref. [2]. In general, relatively small number of paralogous transcription factors in any given species makes it difficult to go beyond just describing anecdotal cases in such an analysis. Thus in the remaining part of this study we concentrate on another measure of the downstream divergence, systematically comparing functional roles of duplicated (paralogous) proteins. The functional similarity of a pair of proteins is in part reflected in the "interaction overlap" Ω
int
given by the number of other proteins that physically interact with both of them (See Fig. 1). In our study we use the system-wide information about protein-protein physical interactions obtained by combining two high-throughput two-hybrid experiments [3, 4]. Fig 3A shows the average value of the interaction overlap Ω
int
between pairs of paralogous proteins as a function of PID – their amino-acid similarity. Again Ω
int
typically decreases with decreasing PID, reflecting the gradual loss/change of binding partners of proteins in the course of evolution. A similar analysis, but as a function of the silent substitution rate (K
s
) was previously reported by Wagner [13]. In agreement with that study, we find that paralogous proteins are more likely to share interaction partners than one expects by pure chance alone (see the caption to the Fig. 3). Our set of yeast paralogs contains 189 paralogous pairs such that both paralogs physically interact with at least one other protein in the combined dataset of Refs. [3, 4]. Out of these pairs 60 (30%) share at least one interaction partner. The correlation between the Ω
int
and the PID in the combined two-hybrid dataset is highly statistically significant: the Pearson correlation is 0.36 (P-value around 5 × 10-6 for 189 data points). We also find that in yeast the divergence in the set of binding partners becomes systematic only for PID < 70%, while above 70% it remains roughly constant in both Uetz [3], Ito [4], and combined datasets (Fig. 3A).
An alternative way to quantify the extent of divergence/redundancy of duplicated genes is to examine phenotypes of of null-mutants lacking one of them. A systematic gene-deletion study in yeast [10] was recently used [14] to compare the fraction of essential genes (so that their null-mutants have lethal phenotype) between genes with and without paralogs in the genome. It was found that the fraction of essential genes is approximately 4 times higher among singleton genes than among ones protected by a highly similar paralog. It was also demonstrated that such protection by a paralog persists down to rather low levels of its amino-acid sequence similarity (PID) with the deleted protein. In Fig. 3B we confirm these findings using a more recent and larger systematic study [5] of viability of null-mutants in yeast as well as demonstrate that the magnitude of this protective effect is the strongest in the nucleus, where the largest fraction of essential proteins resides. Notice that the fraction of essential proteins (especially that of nuclear proteins) shows a dramatic increase as the PID to their closest paralog falls below 50%. Thus paralogous proteins with sequence similarity above 50% can typically substitute for each other.
Having presented different measures of upstream and downstream divergence of duplicated genes in yeast S. cerevisiae we are now in a position to discuss them in a wider context. Comparing Fig. 2B to Figs 3A,3B one concludes that changes in the upstream regulation of duplicated genes happen more readily than changes in their downstream function. The overlap in the set of binding partners (Fig. 3A) and the ability of duplicates to substitute for each other (Fig. 3B) remain virtually constant down to PID of 70%, at which point their average regulatory overlap has dropped to about 40% of its maximum (Fig. 2B). To summarize: our results indicate that duplicated genes would still have the ability to partially substitute for downstream functions of each other even at the time when the repertoire of their regulatory connections has already substantially changed from its ancestral state before the duplication. Such genes would be less constrained in evolving new functions [15], and thus would contribute to a greater evolutionary plasticity of the network.
Functional redundancy of paralogous proteins from RNAi experiments on C. elegans
One might expect the protective effect of paralogs to be unique to single-celled organisms such as yeast. Indeed, in multicellular organisms duplicated proteins are often expressed only in specific tissues and therefore unable to substitute for each other. However, using a systematic study of RNAi (RNA Interference) phenotypes in a nematode worm C. elegans [8] we found such protection [16] to be equally strong in this multicellular organism (See Fig. 4). As in Fig. 3B, the x-axis in Fig. 4 is PID – the similarity of amino acid sequences between a given protein and its closest related paralog (all singleton proteins without paralogs are clumped into the 0% PID bin). The y-axis is the fraction of tested proteins whose elimination by the RNAi technique was found [8] to give rise to a nonviable phenotype (embryonic or larval lethality or sterility). In worm the protection of having a paralog starts to gradually weaken for PID < 70%. In both worm and yeast there seems to be a four-fold drop in the fraction of essential proteins between PID = 0% and 100%.
In the inset to Fig. 4 we kept all successfully cloned genepairs, while in the main panel we dropped those genepairs whose product was predicted [8] to target mRNA product of more than one gene in the genome (see Methods for more details). It is instructive that the fraction of essential genes as a function of PID shown in the inset to Fig. 4 has a well pronounced minimum around PID = 70% and then subsequently starts to rise for higher values of PID. The tentative explanation for this behavior is that unlike single-gene deletion technique used in yeast, the RNAi technique is based on RNA complementarity and can eliminate several different mRNAs with similar sequences. Therefore, paralogous genes with nearly identical DNA sequences prove to be useless from the point of view of protection against RNAi since their mRNA products would be eliminated at nearly the same rate as the intended targets. This neatly explains why in the inset to Fig. 4 the fraction of nonviable phenotypes for genes with a 100% identical paralog in the genome approaches that of unprotected genes without paralogous partners (keep in mind that in this plot we use amino acid sequence identity of proteins and not of their mRNA precursors.) This observation also reinforces the point of view that the decline in the fraction of essential genes vs PID shown in Figs 3B,4 is indeed caused by protective effects of paralogs and cannot be explained by a possible tendency of nonessential genes to duplicate more frequently.
Divergence of physical interactions of paralogous genes in H. pylori and D. melanogaster
The analysis of evolution of molecular networks advocated in this paper requires a large (preferably genome-wide) and unbiased (i.e. no anthropogenic selection present in databases) dataset describing a molecular network in a given species. Apart from yeast, which is arguably the best studied model organism, system-wide two-hybrid physical interaction assays were published for a simple bacterium Helicobacter pylori [6], and a fly Drosophila melanogaster [7]. In Fig. 5 we used these two datasets to quantify the decay of the average interaction overlap as a function of amino-acid sequence similarly (see Fig. 3A for the same analysis in yeast.) The correlation between Ω
int
and PID is highly statistically significant in both cases: the Pearson correlation of 0.43 (P-value around 3 × 10-4 for 65 data points) for H. pylori, and 0.19 (P-value around 10-26 for 2843 data points) in D. melanogaster. Our basic conclusions agree for all quite diverse organisms used in this study: paralogous proteins are much more likely to share binding partners than expected by pure chance alone. Furthermore, the number of common interaction partners goes down as PID of their amino acid sequences decreases. In the yeast and H. pylori we see the evidence of an initial plateau at which the average overlap appears to be independent of PID. On the other hand in the fly there is no evidence of such plateau, which makes the average rate of loss of common binding partners (about 4.5% for every 1% of change in PID) quite high in this organism. However, in the absence of system-wide data on transcription factors' binding in the fly and H. pylori we could not quantify rates of upstream changes in these two organisms, and consequently cannot compare them to the corresponding downstream rates.