Region-specific expression of young small-scale duplications in the human central nervous system

Background The duplication of genes is one of the main genetic mechanisms that led to the gain in complexity of biological tissue. Although the implication of duplicated gene expression in brain evolution was extensively studied through comparisons between organs, their role in the regional specialization of the adult human central nervous system has not yet been well described. Results Our work explored intra-organ expression properties of paralogs through multiple territories of the human central nervous system (CNS) using transcriptome data generated by the Genotype-Tissue Expression (GTEx) consortium. Interestingly, we found that paralogs were associated with region-specific expression in CNS, suggesting their involvement in the differentiation of these territories. Beside the influence of gene expression level on region-specificity, we observed the contribution of both duplication age and duplication type to the CNS region-specificity of paralogs. Indeed, we found that small scale duplicated genes (SSDs) and in particular ySSDs (SSDs younger than the 2 rounds of whole genome duplications) were more CNS region-specific than other paralogs. Next, by studying the two paralogs of ySSD pairs, we observed that when they were region-specific, they tend to be specific to the same region more often than for other paralogs, showing the high co-expression of ySSD pairs. The extension of this analysis to families of paralogs showed that the families with co-expressed gene members (i.e. homogeneous families) were enriched in ySSDs. Furthermore, these homogeneous families tended to be region-specific families, where the majority of their gene members were specifically expressed in the same region. Conclusions Overall, our study suggests the involvement of ySSDs in the differentiation of human central nervous system territories. Therefore, we show the relevance of exploring region-specific expression of paralogs at the intra-organ level. Supplementary Information The online version contains supplementary material available at 10.1186/s12862-021-01794-w.

maximal expression over all these regions was lower on average for a paralog than for a singleton, mostly due to SSDs (Figure 2), which could represent a confounding effect for the association that we found between region-specificity and paralogy.
In order to confirm the association of paralogy with CNS region-specific expression, in addition to the potential effect of expression level, we performed a multivariate linear regression using R version 3.5 (package stats). The model was fitted on all protein coding genes to predict Tau scores with explanatory variables corresponding to the maximal expression of each gene across the 7 CNS regions and its duplication status (singleton or paralog) (Additional File 2: Table S16).
We observed that the maximal expression had a significantly negative effect on the Tau score, while the duplication status had a smaller but independent and significantly positive effect. The association between a high region-specificity and a low expression could be explained by two reasons: i) the calculation of the region-specificity Tau score can be artificially inflated for genes with low expression values in all regions, since expression ratios can be highly variable in the case of low discrete expression values (ie RPKM derived from low read counts), and ii) a low expression level can be related to region-specificity for some biological reasons. Thus, the same analysis was done on a more restricted set of genes by removing genes whose maximal expression is lower than 1 RPKM (instead of 0.1), in order to reduce the potential bias of low expressions. With the threshold of 1 RPKM, the effect of maximal expression was much less significant but the duplication effect was still significant and positive. These results indicate that 1 / 14 the specific expression of paralogous genes for a particular CNS region is influenced by both gene abundance value and duplication status.

Confirmation of the effect of the age and the type of duplication on the region-specificity in the CNS
Since SSDs (and in particular ySSDs) have a lower maximal expression than WGDs (Figure 2A), it is also difficult to distinguish between the confounding effects of expression and duplication type orage on Tau.
We then estimated the respective effects of the expression level (the maximal expression over CNS regions), the phyletic age of the duplication and the type of duplication (SSD or WGD) on Tau scores using multivariate linear regression models on paralogs only (Additional File 2: Table   S16B). When using all paralogs with a maximal expression over CNS regions > 0.1 RPKM, we observed that the maximal expression had the stronger significant negative effect, while the duplication age had a weaker but still significant negative effect on region-specificity (the same results were obtained with the three age categories coded quantitatively: "0" for paralogs younger than WGD, "1" for WGD-old paralogs and "2" for paralogs older than WGD; Additional File 2: Table S16C). The duplication type effect did not appear since it is hidden by the duplication age effect, probably because WGDs are roughly all of the same age and constitute the very large majority of the paralogs of this age. We confirmed this hypothesis by applying the multivariate linear model on the subset of genes with the same age (WGD and wSSD) (Additional File 2: Table S16D). We found with this last model that indeed the SSD duplication type significantly contributed to a high region-specificity. Therefore, these results seem to confirm that ySSD genes tend to be more region-specific than other genes, due to both their SSD origin and their young duplication age, in addition to the effect of their low expression level.
As mentioned previously, the effect of expression on Tau score may be a true biological effect or a bias. We thus applied the same linear models after filtering out genes whose maximal expression over CNS regions was less than 1 RPKM, in order to reduce both the confounding effects and the risk of potential bias. We found that the effect of the duplication age was even 2 / 14 more significant than before this filter and that the effect of the duplication type became significant, while the effect of the maximal expression was really lower and less significant than before.

Result S2. Association between region-specific expression in the same region and paralog pairs
We analyzed the 130 paralog pairs for which both genes showed some region-specificity and with a single SSD or WGD annotation. We observed that, among these pairs, SSD pairs were slightly enriched in pairs where both paralogs were region-specific to the same region (50% of SSD pairs versus 31% of WGD pairs, p-value= 0.04483). The proportion of ySSD pairs that were region-specific to the same region was even higher (59%) but the enrichment was not significant, probably due to the very low number of ySSD pairs (22 pairs). This result suggests that within an SSD pair, especially a ySSD pair, with region-specific expression the two paralogs tend to be co-expressed.

Result S3. Optimization of WGCNA parameters
The WGCNA algorithm comprises five steps: 1) computation of a correlation measure for each gene pair to estimate the similarity of their expression profiles 2) computation of an adjacency matrix by applying a soft-threshold to the correlation measure (power applied to the correlation) 3) computation of the Topological Overlap Matrix (which reflects pairwise gene similarity in terms of connectivity with the other genes in the adjacency matrix), 4) hierarchical clustering of the Topological Overlap Matrix and 5) identification of modules of co-expressed genes from the hierarchical clustering.
Soft-threshold. The WGCNA tool generates the hypothesis that the majority of genes are weakly

/ 14
connected by co-expression and only a small number of genes are highly connected. To respect this topology, known as scale-free, we had to find the appropriate soft threshold ( ) implicated β in the adjacency matrix computation (Equation 1). WGCNA tests different values and selects β the one that most respects the scale-free topology. In this study, the soft threshold used was 6 ( Figure S3A). For the adjacency matrix, we chose to use the equation that captures both the correlated and anti-correlated gene profiles, without distinction. Thus, for each gene pair (i,j) the following was defined: Parameter selection. One of our goals was to compare gene families to co-expression modules.
Given that 47% of gene families have a size equal to 2, we optimized WGCNA parameters to obtain small highly co-expressed modules. Out of the collection of WGCNA parameters, we only evaluated those that impacted module size: the Cut tree, the deepSplit and the merge of clusters.
The Cut tree parameter corresponds to the height at which the hierarchical clustering tree is cut Finally to obtain small module sizes, we did not merge modules based on hierarchical clustering of their first principal components (called eigengenes).

Enrichments of co-expression modules:
The biological relevance of our gene modules was determined using over-representation tests for gene ontology (GO) terms derived from molecular functions and biological processes. From the list of modules larger than 20 genes (81 modules), 67% and 86% respectively were found to be significantly enriched for at least one molecular function and one biological process (p-value threshold= 6.17E-04, after Bonferroni correction for the number of tested terms in each of these two categories, and the number of modules) (Additional File 2: Table S8).

Result S5. Genomic distances between pairs of paralogs
We then studied further the relationship between the type and categorical age of the duplication by analyzing the proximity between pairs of duplicates on the genome. We assessed the association between duplicate proximity and each group of paralog pairs (SSD, ySSD, wSSD, oSSD and WGD), considering only the 2,918 paralog pairs with a single SSD or WGD annotation (Additional File 2: In the families associated with heterogeneous expression of their member genes across CNS regions, we found enrichments in many signaling pathways, such as MET, NTRKs and MAPKs. We also identified the family of hyperpolarization-activated cation genes (HCN), which was previously known to be preferentially expressed in the brain (Santoro and Tibbs  The evolutionary constraints associated with the co-expression of genes within a family can highlight genes that work together to carry out their functions. On the opposite, families whose gene members are scattered over different co-expression modules could consist of genes which work independently and perhaps diverged in function through evolution. We found that these two categories of families differed in their level of enrichment in different biological processes and molecular functions. In homogeneous families, we showed that the majority of genes from the HOX and AP-2 families of transcription factors and from the NOTCH signaling pathway were strongly co-expressed.

Result S7. Association between co-expression and shared region-specificity
To first check the relationship between co-expression and shared region-specificity, we analyzed the distribution of region-specific genes across the 932 modules of co-expressed paralogs and found that 177 modules included at least two region-specific genes. We then looked at whether within each of these modules the region-specific genes were expressed in the same or in different regions. We found that among these 177 modules, 66% consisted of region-specific genes associated with the same region (Additional File 2: Table S15). Therefore, gene modules identified from correlation-based co-expression networks also capture shared region-specificity.

Result S8. Classification of human CNS regions from gene expression
We