Directional divergence of Ep300 duplicates in teleosts and its implications

Background EP300 is a conserved protein in vertebrates, which serves as a key mediator of cellular homeostasis. Mutations and dysregulation of EP300 give rise to severe human developmental disorders and malignancy. Danio rerio is a promising model organism to study EP300 related diseases and drugs; however, the effect of EP300 duplicates derived from teleost-specific whole genome duplication should not just be neglected. Results In this study, we obtained EP300 protein sequences of representative teleosts, mammals and sauropsids, with which we inferred a highly supported maximum likelihood tree. We observed that Ep300 duplicates (Ep300a and Ep300b) were widely retained in teleosts and universally expressed in a variety of tissues. Consensus sequences of Ep300a and Ep300b had exactly the same distribution of conserved domains, suggesting that their functions should still be largely overlapped. We analyzed the molecular evolution of Ep300 duplicates in teleosts, using branch-site models, clade models and site models. The results showed that both duplicates were subject to strong positive selection; however, for an extant species, generally at most one copy was under positive selection. At the clade level, there were evident positive correlations between evolutionary rates, the number of positively selected sites and gene expression levels. In Ostariophysi, Ep300a were under stronger positive selection than Ep300b; in Neoteleostei, another species-rich teleost clade, the contrary was the case. We also modeled 3D structures of zf-TAZ domain and its flanking regions of Ep300a and Ep300b of D. rerio and Oryzias latipes and found that in either species the faster evolving copy had more short helixes. Conclusions Collectively, the two copies of Ep300 have undoubtedly experienced directional divergence in main teleost clades. The divergence of EP300 between teleosts and mammals should be greater than the divergence between different teleost clades. Further studies are needed to clarify to what extent the EP300 involved regulatory network has diverged between teleosts and mammals, which would also help explain the huge success of teleosts.

double after genome duplication represent the subset under strongest purifying selection [22]. On the other hand, two duplicates generally evolve asymmetrically [22], which will nally lead to functional divergence and even gene separation (genesis of new genes). By searching against the Selectome database, we found that both EP300 and CREBBP of teleosts have positively selected branches, while mammals and sauropsids have none [23][24][25][26]. If EP300 and CREBBP of teleosts have experienced strong and constant positive selection, they may have diverged considerably in functions from their mammalian orthologs. Given the fundamental roles of EP300 and CREBBP in regulatory networks, positive selection on them may also correlate with the huge success of teleosts [27]. However, the Selectom database cannot provide more details since it only includes a limited number of teleost species and the method used is restricted to branch-site models.
In this study, we focused on the molecular evolution of Ep300 in teleosts. We are particularly interested in the way and the extent of divergence of the retained duplicates in different teleost clades, which were explored through analyses of molecular evolution of Ep300 duplicates based on diverse evolutionary models, tissue expression pro les and protein structures.

Retention of Ep300 duplicates in teleosts
Through blastp search against NCBI nr database, we obtained 114 EP300 protein sequences from 28 shes, 30 mammals and 25 sauropsids (a detailed list of these species and their respective lineage information can be found in Additional le 1). All mammals and sauropsids had only one copy; the shes had 2.1 copies on average, with 21 shes had exactly 2 copies and only one teleost sh had one copy (Additional le 2). Sinocyclocheilus anshuiensis, Carassius auratus (gold sh), Austrofundulus limnaeus and Oncorhynchus kisutch (coho salmon) had 3~4 copies, which was in accordance with the fact that their respective ancestors underwent recent genome duplications [28][29][30]. The two non-teleost shes, Erpetoichthys calabaricus (reed sh) and Lepisosteus oculatus (spotted gar), had only one copy. We also conducted tblastn search against RefSeq genomes of 19 sh species that have available assembled chromosomes and found no evidence of any species to have additional copies that might originate from small-scale duplications (SSDs) (Additional le 3). Therefore, the best explanation is that the commonly appeared two copies in teleost shes originated from the teleost-speci c WGD.
To get more direct evidence, we inferred phylogenetic trees by both maximum likelihood (ML) and Bayesian methods, both of which showed clear separation of two big teleost clades (Additional les 4 and 5). The topologies of the two trees were very similar: at least 98% of edges of one tree could be found on the other and the normalized Robinson-Foulds distance was 0.03 (see Table S1 in Additional le 6). However, the Bayesian tree unreasonably placed Ep300a of Scleropages formosus (Asian bonytongue) and Paramormyrops kingsleyae (Additional le 5); therefore, only the ML tree was used for further analyses.
We also generated consensus protein sequences of Ep300a and Ep300b of teleost shes and EP300 of mammals and sauropsids. We queried the conserved domains within these consensus sequences and found that Ep300a and Ep300a had exactly the same distribution of conserved domains as EP300 of mammals and sauropsids (Fig. 1). That is not unexpected, since EP300 and CREBBP are also highly similar to each other (Additional le 6, Fig. S1 and S2) [31]. It should be noted, however, that conserved domains shown in Fig. 1 were just speci c hits reported by CDD search; there were also non-speci c hits called superfamilies (Additional le 6, Fig. S3-S8). For speci c species, two copies of Ep300 can differ in both speci c hits and superfamilies. Take D. rerio as an example,: its Ep300b did not have the speci c PHD_p300 domain that existed in Ep300a, but had a PHD_SF superfamily in the corresponding region (Additional le 6, Fig. S5 and S6).
Ep300a and Ep300b were widely subject to positive selection We used aBSREL [32] to test branches that were subject to episodic positive selection throughout the ML tree (Fig. 2). Since multiple testing greatly reduces the statistical power in an exploratory analysis, we considered all branches with uncorrected p value lower than 0.05. Of the 55 branches of Ep300a, 22 were under positive selection; of the 57 branches of Ep300b, 21 were under positive selection. By contrast, the proportions of positively selected branches of mammals and sauropsids were 5/59 and 4/49, respectively. The proportion of positively selected branches of either Ep300a or Ep300b was very signi cantly higher than that in mammals and sauropsids (p values all lower than 0.01, see Table S2 in Additional le 6); however,the difference between the two copies was not signi cant (all comparisons were conducted by Fisher's exact test [33]). Of the 43 positively selected branches of Ep300a and Ep300b, over half were internal branches, viz. common ancestors. Furthermore, in 9 ancestral species (branches #1-#8 and Neoteleostei, which were all dated back to more than 100 MYA), both duplicates were under positive selection, while in most extant species, only one or none duplicate was under positive selection.
When we inspected the ω ratio of every branch (reported by aBSREL test; see Additional le 7), we observed that the type of asymmetric evolutionary rates of Ep300a and Ep300b was different in different species: e.g. in D. rerio Ep300a evolved faster than Ep300b, while in Oryzias latipes the contrary was the case. From the common ancestor #1 to the majority of extant species, there was not a constant trend of which copy evolves faster.
Faster evolving Ep300a/Ep300b contained more positively selected sites We used Clade model C (CmC) [34] and RELAX [35] to compare overall evolutionary rates of Ep300a and Ep300b in different clades. One signi cant difference between CmC and RELAX is that the latter incorporates rate variation in synonymous sites (ds) across sites and branches. Still, their results were accordant: in each comparison, a higher ω 2 in CmC result was accompanied by a higher mean ω value in RELAX result (Table 1). Both duplicates evolved faster than mammals and sauropsids, which was in accordance with previous results [22,36]. At teleost level, the two duplicates evolved at almost the same rate; p values of CmC test and RELAX test were not su ciently small either (Table 1). In four smaller clades (Neoteleostei, Atherinomorphae, Ostariophysi and Cypriniformes), the two duplicates evolved at signi cantly different rates and with very low p values: in the clade Neoteleostei and its subclade Atherinomorphae, Ep300b evolved faster than Ep300a; in the clade Ostariophysi and its subclade Cypriniformes, however, Ep300a evolved faster than Ep300b. Although the duplicates generally evolved at different rates, the moderate k values reported by RELAX indicated that there was no strong evidence of one copy to be under intensi ed or relaxed selection relative to the other.
To get a thorough exploration of positively selected sites of Ep300a and Ep300b, we used MEME [37] to detect sites subject to episodic positive selection and FUBAR [38] and M8&M7 models [39] to detect sites subject to pervasive positive selection. To speculate the possible consequences of positively selected sites, we matched their positions with conserved domains of the consensus sequence (of the full sequence set). As shown in Table 2, mammals and sauropsids had much fewer positively selected sites than teleosts, which was consistent with the above results that they also had fewer positively selected branches and slower evolutionary rates. At any clade, we can nd that the dominant proportion of positively selected sites was detected by MEME, con rming that natural selection is predominantly episodic [37]. An unexpected observation was that there were much fewer detected positively selected sites in big clades like teleosts than in smaller clades like Neoteleostei or Ostariophysi, suggesting that more data will not necessarily provide greater power to detect positive selection [37]. In either big or small clades of teleosts, positively selected sites detected in two duplicates were generally non-redundant, indicating that they were subject to divergent selection. In smaller clades, it was evident that the copy with a faster evolutionary rate generally had more positively selected sites. However, these positively selected sites were most commonly appeared in non-conserved regions, especially in regions before zf-TAZ domain and between KIX and Bromo_cbp_like domains.

Structural features of zf-TAZ domain and its anking regions
We modeled structures of zf-TAZ domain and its anking regions of Ep300a and Ep300b of D. rerio and O. latipes (as representatives of Ostariophysi and Neoteleostei, respectively) using I-TASSER suit. All four best models of respective sequences had signi cantly greater structure density (by the number of decoys) than respective lower-rank models; three of them even had TM-score greater than 0.5 (see Table S3 in Additional le 6). In three best models (except for Ep300b of D. rerio), the region corresponding to zf-TAZ domain was characterized by long α-helixes, further con rming the credibility of the best models (Fig. 3). In anking regions, especially the N-terminal side, short helixes were frequently appeared. Ep300a of D. rerio and Ep300b of O. latipes, which evolved faster and had more positively selected sites, also contained more short helixes than their respective paralogs (Fig. 3). It should be noted that the anking regions are mainly loops, the modeling of which would heavily rely on remote homology (or ab initio modeling) and consequently may not be very stable. We also modeled structures of the N-terminal side anking region of zf-TAZ domain and found that the number and distribution of short helixes changed greatly in best models of this region (Additional le 6, Table S4 and Fig. S9). However, Ep300a of D. rerio and Ep300b of O. latipes still contained more short helixes than their respective paralogs.

Correlation between tissue expression pro le and evolutionary rate
We analyzed the tissue expression pro les of ep300a/ep300b of ve teleosts, D. rerio, Astyanax mexicanus (Mexican tetra), O. latipes, Esox lucius (northern pike) and Oreochromis niloticus (Nile tilapia); ep300 of one holostean sh, L. oculatus, which had not been affected by the teleost-speci c WGD event; and Ep300 of Mus musculus (house mouse) as control. In three teleosts (D. rerio, E. lucius and O. niloticus), tissue expression pro les of ep300a/ep300b did not correlate with ep300 of L. oculatus, neither individually nor on average (Fig. 4). The extraordinarily high level of ep300b of A. mexicanus and ep300a of O. latipes in testis made their tissue expression pro les signi cantly correlate with ep300 of L. oculatus. Simply removing the testis expression data will make the correlations not signi cant. On the other hand, in all ve teleosts, the expression pro les of the two copies were signi cantly correlated (for O. latipes, the testis expression data should be excluded), suggesting that their functions have not su ciently diverged yet. Compared to teleosts and M. musculus, ep300 gene of L. oculatus was highly expressed in a smaller subset of tissues. According to the PhyloFish database, the quality of sequencing data of L. oculatus was not signi cantly inferior to that of other shes (see Table S5 in Additional le 6).
We also found that tissue expression pro les of the two duplicates correlated with evolutionary rates: in four shes (D. rerio, A. mexicanus, O. latipes and E. lucius) where one copy evolved faster than the other (Additional le 7), the copy with a faster rate had higher gene expression levels in more tissues (Table 1 and Fig. 4). At Neoteleostei level, Ep300b evolved faster than Ep300a; therefore even for O. niloticus of which the duplicates evolved at similar rates, the above correlation between evolutionary rate and gene expression levels still holds true. We further analyzed tissue expression pro les of ep300a/ep300b of D. rerio and O. latipes based on the NCBI SRA Study ERP121186. The relative abundance of ep300a and ep300b were largely in accordance with the above observations (Additional le 6, Table S6).

Discussion
It has been widely acknowledged that the most common fate of duplicates originated from WGD is loss of one copy and becoming singleton again [22,30,40]. Duplicates may be successfully retained due to subfunctionalization, neofunctionalization and requirement to keep dosage balance [21,30]. It was reported that in D. rerio, the Ep300b KAT domain does not have detectable acetyltransferase activity [41]. In this study, we found that Ep300b of D. rerio even lost the conserved PHD_p300 domain (Additional le 6, Fig. S6), which could be found in EP300 of mammals and Ep300a/Ep300b of O. latipes. It has been reported that PHD domain, together with other domains that ank the KAT domain, regulates acetyltransferase activity and also promotes SUMOylation of the adjacent CRD1 cell cycle regulatory domain [9]. The loss of the PHD_p300 domain is likely to be an important cause of the inactivation of KAT domain in Ep300b of D. rerio. Meanwhile, it was suggested that the transcriptional coactivator function of Ep300b (conferred by other conserved domains) might still play subtle but important roles in development [41]. Therefore, Ep300b of D. rerio had undoubtedly experienced subfunctionalization. On the other hand, we observed that the faster evolving copy of Ep300a/Ep300b generally contained more positively selected sites and more structural innovations (short helixes) in the most intensively selected regions (Tables 1 and 2, Fig. 3), suggesting that they had also been subject to neofunctionalization. The moderate k values reported by RELAX means that selective constraints acted on the duplicates were largely the same (Table 1); therefore, divergence between the duplicates is not likely to cause signi cant subfunctionalization or signi cant neofunctionalization in short periods of time, but ne tuning of them both. Still, it seems that teleosts favor functional innovations: in ve representative species, the faster evolving copy had higher expression levels in more tissues (Table 1 and Fig. 4). From tissue expression pro les we can also conclude that Ep300 of L. oculatus is still very primitive, while its orthologs in mammals and teleosts are more nely tuned (Fig. 4). Since the divergence between teleosts and L. oculatus occurred later than the divergence between teleosts and mammals [16], the paths teleosts and mammals took to tune functions of EP300 may be distinct, which will inevitably affect the tuning results.
Constituting around half of all vertebrate species, teleosts are by far the most successful vertebrate clade [27]. Given the fact that teleosts and some other diverse taxa have all experienced WGD events before their radiation [42,43], it was thought that there is a causal correlation between WGD, evolutionary success and radiation [30]. However, the universal time delay between WGD and phases of radiation [44][45][46] suggests that WGD itself has not been the direct factor generating diversity. It is more likely that duplication and subsequent divergence of some essential genes enabled by WGD directly facilitate radiation [30]. Cells of multicellular organisms generally contain the same DNA sequence, yet they are differentiated into extremely diverse cell types that differ in both structure and function. This variation is largely due to the transcriptional activity of different sets of genes in different cell types, which is closely controlled by a number of epigenetic mechanisms, including acetylation, methylation, non-coding RNAs, etc [6,47]. Thousands of proteins have recently been identi ed as substrates of EP300, indicating that their binding a nities are generally weak [2,6]. Even slight changes in EP300 structure might have profound effects on substrate speci city. Therefore, the expanded number and diversi ed function of Ep300 in teleosts should enable more sophisticated transcriptional regulation and additional morphological diversity, and nally facilitate their radiation. The question is how tightly they are correlated. In this study, we observed that the evolutionary process of Ep300a/Ep300b had coincided with the radiation of teleosts. In early stages, there were enough ecological niches to occupy; therefore natural selection should favor evolutionary innovation of both copies to explore a wider subset of the phenotypic space. Correspondingly, we found that duplicates of many ancestral species were both subject to positive selection (Fig. 2). As the number of species increases, ecological niches are tending to be exhaustively partitioned, which would decrease the requirement for innovation. Consequently, we found that in most extant species at most one copy was under positive selection (Fig. 2). In Ostariophysi and Neoteleostei, the two most species-rich teleost clades [30,45], the directions of divergence of Ep300 duplicates were just the opposite, suggesting that their requirements for ne tuning of Ep300 controlled transcriptional regulation were also divergent. A deeper reason for this directional divergence is possibly the ecological segregation of their respective ancestors. For either Ep300a or Ep300b, natural selection mainly acted on regions anking conserved domains (Table 2); however, these regions are not just linkers that allow exible spatial arrangement of the structured domains, they can also bind transcription factors [48]. Therefore, positively selected sites detected in Ep300 duplicates can more or less change their binding a nity to substrates, alter transcriptional regulation, and nally facilitate radiation of teleosts.

Conclusions
The importance of EP300 as a key mediator of cellular homeostasis has been well established, yet the knowledge about the divergence of EP300 between teleosts and mammals is very limited, which will inevitably affect the effectiveness of using D. rerio as a model organism to study EP300 related diseases and drugs. In this study, we found that WGD derived duplicates of Ep300 were widely retained in teleosts. In representative teleosts, the two copies were both expressed in many tissues, suggesting that their functions were also widely retained. Based on analyses of positively selected branches, positively selected sites, relative evolutionary rates, protein structures and tissue expression pro les, we observed divergent evolution of Ep300 duplicates in teleosts. The directions of divergence of Ep300 duplicates in Ostariophysi and Neoteleostei were just the opposite, suggesting that tuning functions of Ep300 duplicates may promote adaptation to new ecological niches and speciation of teleosts. The divergence of EP300 between teleosts and mammals should be greater than between different teleost clades. Further studies are needed to clarify the difference of EP300 involved regulatory network between teleosts and mammals.

Obtainment of EP300 homologs
To obtain homologs of EP300 from interested species (detailed information is listed in Additional le 1), we selected D. rerio, M. musculus and Gallus gallus as representatives of bony shes (NCBI:txid7898), mammals (NCBI:txid40674) and sauropsids (NCBI:txid8457), respectively. The protein sequences (Genbank accession No.: XP_021335970.1, NP_808489.4 and XP_004937767.1) of the above three species were used as query sequences to conduct blastp search against nr database of their respective clade, with the max target sequences set to be 20000 and expect threshold to be 1e-5.
To extract sequences of interested species from the blastp results, we rst ltered non-wanted hits: if the word "p300" did not appear in the description of hit sequence or if the source organism was not interested, this hit would be ignored. Then we extracted the NCBI gene id, sequence status (validated, model, etc) and respective nucleotide sequence accession number of each hit from the le "gene2accession" (downloaded from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/). The remaining hits would be selected based on gene ids: for each gene id, if none sequence had the status "VALIDATED", the top hit would be selected; if at least one sequence had the status "VALIDATED", the top hit of them would be selected.

Phylogenetic analyses
Based on the above information (see Additional le 2), we downloaded the protein and nucleotide sequences of each selected hit/gene.
We also added CREBBP sequences (of D. rerio, M. musculus and G. gallus) into the EP300 sequences to serve as outgroup. After that, the protein sequences were subject to multiple sequence alignment by MAFFT [49], using the L-INS-i method. The alignment was trimmed to the average length of the original sequences by removing columns with excessive gaps [50].
We used RAxML 8.2.8 [51] and MrBayes 3.2.7 [52] to infer phylogenetic trees from the trimmed alignment, respectively. The RAxML tree was inferred using the GAMMA model of rate heterogeneity, automatically determined substitution model and 500 bootstraps. The Bayesian tree was inferred under mixed models, run for 1 million generations with defaulted 25% burn-in and two parallel analyses. To use Metropolis coupling to improve the MCMC sampling of the target distribution, the Nchains parameter was set to be 12. Convergence was con rmed by checking that the standard deviations of split frequencies approached zero (below 0.01) and that there was no obvious trend in the log likelihood plot. The topologies of the two trees were compared with ete-compare [53]. The alignment les and original tree les are supplied in Additional le 8.

Molecular evolutionary analyses
Before conducting evolutionary analyses, we arranged the nucleotide sequence alignment based on the trimmed protein sequence alignment described above. To estimate episodic positive selection acting on speci c branches, we performed aBSREL test [32] with HYPHY 2.5.0 [54] on all branches within a tree. To compare selective pressures of duplicates at different clade levels, we performed CmC test [34] with codeml program from the PAML 4 software package [55] and RELAX test [35] with HYPHY. In a CmC test, all internal and external branches of two interested clades in a tree le were labeled with "#1" and "#2", respectively; in a RELAX test, the labels were "Test" and "Reference", respectively. To estimate sites subject to positive selection, we rst extracted subtrees containing only interested species; respective nucleotide sequences were also extracted from alignments of full sequence set. We performed M8&M7 models test [39] with codeml program and FUBAR test [38] with HYPHY to estimate sites subject to pervasive positive selection. To estimate sites subject to episodic positive selection, we performed MEME test [37] with HYPHY, in which we consider all branches of a subtree.

Consensus sequences and conserved domains
To get a consensus sequence of the full sequence set, we extracted the most frequent residue of each column (if it is a gap, then the less frequent residue would be selected) from the trimmed protein sequence alignment. For smaller sequence sets, e.g. a set only contained mammals' sequences, the sequences would be aligned with MAFFT rst. Then the alignment would be and trimmed to the average length of the original sequences (after a rst calculation, sequences with length shorter than 60% of the average value would be excluded, the remaining sequences were used to calculate the nal average length). The conserved domains and their exact locations within consensus sequences were predicted by the NCBI online tool CDD search (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) [56].

Modeling of protein structure
We used I-TASSER suit (version 5.1) [57] to model structures of zf-TAZ domain and its anking regions of Ep300a and Ep300b of D. rerio (Genbank accession No.: XP_021335970.1 and XP_009297687.1) and O. latipes (Genbank accession No.: XP_023805552.1 and XP_011476788.1). For Ep300a and Ep300b of D. rerio, the top 500 aa were used as query sequences; for Ep300a of O. latipes, the top 550 aa were used as query sequence; for Ep300b of O. latipes, the top 520 aa were used as query sequence. The best models were visualized using PyMOL [58]. We used MAFFT to conduct multiple alignment of the four query sequences, the result of which can be seen in Additional le 9.

Gene expression data
To get gene expression data of ep300a/ep300b of D. rerio, A. mexicanus, O. latipes and E. lucius and ep300 of L. oculatus, the nucleotide sequence of each gene was used as a query to search against PhyloFish database [15]. The best hit was selected to further explore its own length and expression data in different tissues (indicated by the number of matched reads). The total number of RNA-seq reads of respective tissues and species were also collected from PhyloFish. The above data were combined together to calculate RPKMs of each gene in respective tissues and species. The RPKMs of ep300a/ep300b of O. niloticus and Ep300 of M. musculus in different tissues were obtained from the NCBI gene database with their respective gene ids as queries (see Additional le 2). Correlation of expression pro les between duplicates (and between ep300a/ep300b of teleosts and ep300 of L. oculatus) were calculated using the pearsonr function of scipy.stats module [33].

Consent for publication
Not applicable.

Availability of data and materials
All data generated or analyzed during this study are included in this published article and its supplementary information les.

Competing interests
The authors declare that they have no competing interests. Authors' contributions XW carried out the collection, processing and analyses of data and wrote the manuscript; JY participated in the design of the study. Both authors read and approve the nal manuscript.