Here, we evaluated the level of population differentiation for human genes on autosomal chromosomes among three populations: African, European and East Asian, based on the HapMap data (Phase II) [4], using the parameter FST according to methods described previously [3, 5]. A previous study has reported that there is a higher level of population differentiation at gene regions compared to non-gene regions in the genome [6]. However, in our analysis, we observed that for several chromosomes, including 5, 6, 8, 11, 13, and 20, did not show a pattern with higher population differentiation at genic compared to non-genic regions, namely genic regions did not have excess SNPs with a higher FST (≥0.6) (Figure S1 in Additional file 1).
Functional significance of genes with higher levels of population differentiation
Since an analysis of categories that contain only a few genes will have low statistical power, here we only summarize categories that contain at least 10 genes. Figure 1 summarizes the biological processes that are enriched with higher FST SNPs with a significant P value of 10-10 or lower (see Method), and their λ values, with λ being the ratio of the proportion of higher FST SNPs (≥0.6) in the analyzed category to the proportion of higher FST SNPs in genome-wide genes (which is 0.0049). The categories listed in Figure 1 include a large number involved with organ development, such as those involved in pancreas, lung, and heart development. For example, GO: 0021983, pituitary gland development, is enriched with high FST SNPs and has the highest λ value, 19.37. The pituitary gland produces and secretes many hormones, some of which stimulate other glands to produce other types of hormones, thus this organ and it controls many biochemical processes, e.g. growth, homeostasis, stress response, reproduction, and metabolism [7, 8], that similarly demonstrate a high level of population differentiation, such as developmental growth (GO: 0048589), reproduction related(GO:0030317, GO:0007286, GO:0007276), and several metabolic pathways (GO: 0006641,GO: 0042593, GO: 0042632) (see following text and Figure 1).
An intriguing observation is that osteoblast development is significantly rich in high FST SNPs (λ = 12.28, P= 4.92E-88 after multiple testing). Osteoblasts are mononucleate cells that are responsible for bone formation. Modern humans demonstrate substantial phenotypic variation, which to a large extent can be illuminated by the skeletal system, such as height, body mass, body mineral density, and craniofacial differences. Indeed, evidence indicates that the human skeletal system has evolved rapidly since the advent of agriculture [9] and our recent study concluded that the high levels of population differentiation of skeletal genes among human populations was driven by positive selection [10].
Another interesting category is hair follicle development, which also showed a higher level of population differentiation (GO: 0001942, λ = 4.09, P= 2.07E-08 after multiple testing). Hair is produced by hair follicles. Similar to the skeletal system, hair morphology, including water swelling diameter and section, shape of fiber, mechanical properties, combability and hair moisture, have distinctive traits among human populations [11]. Previous studies have identified some genes involved in hair follicle development that have undergone recent positive selection, as detected by the long range haplotype homozygosity test, such as EDAR and EDA2R [12, 13]. These studies, together with our evidence of higher population differentiation in the genes involved in the hair follicle development support a hypothesis of adaptive evolution accounting for the diversification of human hair.
Consistent with previous observations [12, 14], genes involved in pigmentation, including the following GO processes: pigmentation during development, pigmentation, and melanocyte differentiation, demonstrated significantly higher population differentiation. In a similar manner, reproduction associated processes, e.g. sperm motility, spermatid development, gamete generation, have higher levels of population differentiation (Figure 1). Among the categories with a significant enrichment of higher FST SNPs, many are involved in the nervous system, e.g. dorsoventral neural tube patterning (GO: 0021904, λ = 15.67), hindbrain development (GO: 0030902, λ = 11.08), positive regulation of neuron differentiation (GO: 0045666, λ = 8.50), and neuron development (GO: 0048666, λ = 5.27) (Figure 1). Others categories include metabolic process, such as the triglyceride metabolic process (GO: 0006641, λ = 6.69), glucose homeostasis (GO: 0042593, λ = 4.64), cholesterol homeostasis (GO: 0042632, λ = 4.35), possibly resulting from the variation in metabolism among humans.
Immunity-related genes, however, which are a common target of positive selection [2, 15, 16], are involved in small list of categories with a higher proportion of higher FST SNPs. This observation is probably attributable to the fact that many of the genes in the immunity system evolve under balancing selection in human populations for a heterozygote advantage, which would reduce the level of population differentiation [17, 18].
Tables S1 in Additional file 2, and Tables S2 in Additional file 3 summarize the GO categories in cellular component and molecular function with an enrichment of higher FST SNPs.
In addition, to discern which population(s) contribute more to the pattern, we generated three pairwise sets of FST-values: FST (CEU-YRI), FST (EA-YRI) and FST (CEU-EA). At the genes in the biological processes described in Figure 1, the three data sets demonstrate consistent pattern of significantly higher proportion of higher FST SNPs compared with that at the genome-wide genes (Figure 2), which suggested that the population differentiation is present commonly between pairwise populations.
Population differentiation under neutral evolution is mostly influenced by demographic history (that is, genetic drift and gene flow), which can generate similar pattern with biological factor such as natural selection. However, demographic history tends to influence all loci in the genome equally, and natural selection acts only on the single gene or a group of functional related genes. Compared with the proportion of higher FST SNPs in the genome-wide genes, we present some groups of functional related genes enriched with high FST SNPs, which are mostly driven by positive natural selection, although the confounding factor of demographic history cannot be excluded absolutely.
Population differentiation in disease-related genes
Studies of the pattern of molecular evolution of human disease-related genes will provide insight into the origin, maintenance and mechanism of disease [19]. Previous reports suggested that disease-related genes tend to evolve under purifying selection based on the comparison of non-synonymous rate to synonymous substitution rates [19–21]. Here, as expected, we found that disease-related genes (including Mendelian disease genes and complex disease genes), demonstrate a significant excess of SNPs with lower FST (≤0.05), relative to other genes (χ2= 23.16, P= 1.49E-06 for OMIM gene panel, χ2= 193.78, P = 4.76E-44 for complex-disease gene panel, Figure S2 in Additional file 1). These disease genes demonstrate an excess of lower FST SNPs in the lower frequency bins but not in the high frequency bins (Figure 3), suggesting that negative selection, rather than balancing selection, operated on these genes.
Surprisingly, higher FST (≥0.6) SNPs are enriched significantly at Mendelian disease genes (OMIM) relative to other genes (χ2 = 30.47, P = 3.39E-08), with three MAF bins demonstrating statistical significance (Figure 4). These higher FST SNPs are probably under positive selection. This pattern, however, was not observed in complex disease genes and appear inconsistent with the previous study by Blekhman et al. (2008) [20]. Blekhman et al. (2008) found that Mendelian-disease genes appear to be under widespread purifying selection but that genes that influence complex disease risk show lower levels of evolutionary conservation, as assessed by the ratio of nonsynonymous to synonymous substitutions (Dn/Ds), possibly because they were targeted by both purifying and positive selection. The difference in results is probably attributable to the different methods used to assess sequence evolution: Dn/Ds method changes over a long time scale (i.e. between human and other species), while FST measures recent evolution (i.e., since the separation of modern human populations). The incidence and susceptibility to some Mendelian diseases might demonstrate higher levels of differences among modern human populations.
Lower levels of population differentiation in microRNA targeted genes
The regulation of gene expression is crucial to the development of an organism and has been increasingly recognized that a remarkable fraction of regulation is dominated by microRNAs (miRNAs) [22, 23]. miRNAs are a group of ~23 nt endogenous RNAs important for a diverse range of biological functions that direct the posttranscriptional repression of mRNAs by cleavage or translational repression [22, 23]. Evidence has shown that negative selection operates on miRNA regulated genes [24]. Here, we observed that microRNA targeted genes present a significant excess of lower FST (≤0.05) SNPs (χ2 = 29.76, P = 4.90E-08), and significantly fewer high FST (≥0.6) SNPs (χ2 = 37.61, P = 8.63E-10), relative to other genes (Figure S3 in Additional file 1). The lower FST SNPs are mainly restricted within the lower minor allele frequency bins, and not the intermediate frequency bin (Figure 5), suggesting that widespread purifying selection operated on miRNA targeted genes.