Dataset preparation
We gathered fecal (n = 247) and hair samples (n = 223) from wild-living, non-habituated chimpanzee populations from 36 remote regions spanning eastern Nigeria through southern Cameroon (Figure 2 and Additional file 1). Most samples originating from the regions of Belgique (BQ), Boumba Bek (BB), Diang (DG), Dja Biosphere (DB), Duomo Pierre (DP), Ekom (EK), Lobeke (LB), Mambele (MB) and Minta (MT) were collected between 2004 and 2006 by the team of BHH and MP. Hair and tissue samples from Nigeria and Cameroon were collected during a series of studies from 1994 – 2010 by the team of MKG. We recruited a team of ~9 field assistants for each mission and base camps were established in the vicinity of forests were chimpanzee presence was reported by local hunters and villagers. We initially walked transects, hunters’ paths or elephants paths in search of chimpanzees presence (e.g. night nests, foot prints and vocalizations). The Campo Ma’an (CP), Cross River (CR), Deng Deng (DD), Diang (DG), Dja Biosphere (DB), Douala-Edea (DE), Gashaka Gumti (GG), Mbam et Djerem (MD) and Mount Cameroon (MC) sites are located in National Parks or forests reserves, whereas the remaining field sites are located in unprotected forests. Fecal samples were identified to be of likely chimpanzee origin by experienced trackers and/or by the researchers. Fecal samples (15–20 g) were placed into 30 or 50 mL tubes and mixed with equal amounts of RNAlater® (Ambion, Austin, TX). Hair samples were collected directly from abandoned sleeping nests and a minimum of three hairs per nest were stored into glassine envelopes, and kept dry in silica gel. Fecal samples were inspected to estimate their likely time of deposition, whereas night nests age was estimated according to a leaf decay index [58]. Time, date, location, longitude, latitude and name of collector were also recorded. Fecal and hair samples were generally kept at ambient temperature for no longer than 2 weeks and subsequently stored at −20°C once back in Yaoundé, Cameroon. Samples were shipped to the United States at ambient temperature, then stored at −20°C upon receipt. All samples were transported from Cameroon to the United States in full compliance with Convention of International Trade in Endangered Species of Wild Fauna and Flora (CITES) and Center for Disease Control (CDC) export and import regulations. This research was carried out with IACUC approval from the University at Albany – State University of New York.
We extracted fecal DNA using the QIAamp Stool DNA Mini kits (Qiagen, Valencia, CA) following well-established protocols [59]. We briefly resuspended 1.5 mL of fecal RNAlater® mixture in stool lysis buffer and clarified them by centrifugation. We treated the supernatants with an InhibitEx tablet, subjected them to proteinase K digestion, and passed them through a DNA binding column. Bound DNA was eluted in 100 μL elution buffer. We extracted DNA from hair samples using a chelating resin protocol [60] followed by filtration using Microcon 100 columns (Millipore, Billerica, MA) to concentrate DNA extracts. We conducted the preparation of the DNA quantification standards, reagents and reactions according to the Quantifiler® DNA Manufacturer’s protocol [61]. We analyzed the data using a 7500 System v1.2.3 software as an ‘Absolute Quantification (standard curve)’ assay with the settings recommended in the Quantifiler® Kit User’s Manual [61].
We used primers L15997 (5′-CACCATTAGCACCCAAAGCT-3') and H16498 (5′-CCTGAAGTAGGAACCAGATG-3′) to amplify a 460 – 500-bp mtDNA fragment spanning the hypervariable D-loop region (HVRI) in all samples that were newly collected for this study, using methods described in previous studies [10,13,59]. We assembled and aligned the resulting sequences with SEQMAN DNASTAR software (Lasergene, DNASTAR, Inc., Madison, WI), along with georeferenced sequences from previous studies [9,10,59,62]. We deposited all newly generated sequences from this study in GenBank under accession numbers KM401682-KM401815.
We used twenty-one autosomal microsatellite loci to produce genotype profiles from both hair and fecal samples. Additional file 14 lists information about the markers including the primers flanking the selected regions and the fluorescent dye set (Applied Biosystems, Foster City, CA) chosen. These markers were originally developed for use in humans [63], but have been shown to be highly informative for use in chimpanzee population genetics studies [39,64]. We also included the Amelogenin locus to determine the sex of the individual sampled. We divided these 22 loci into 4 multiplex PCR reactions that were performed using the Quiagen Multiplex PCR Kit (Qiagen, Valencia, CA) in Eppendorf Mastercyclers (Eppendorf, Westbury, NY). We used 0.5-1 ng DNA, along with Q-Solution (included in the kit). We adopted the following PCR conditions: an initialization at 95°C for 15 min, followed by 40 cycles including a denaturation step at 95°C for 30 sec, an annealing temperature of 60°C for 1 min, an elongation at 72°C for 1 min and 30 sec, and a final extension at 72°C for 10 min. For samples that failed to produce reliable genotype profiles, we increased the amount of DNA template to up to 8 μL per reaction, regardless of the DNA quantitation results. Many of the hair samples from the Nigerian locations had been typed previously for some of the loci selected, and were adjusted to the differences in base pair sizes due to apparatus and protocol discrepancies [13,65]. All PCR reactions included negative control samples for quality assurance. We analyzed each multiplex PCR product on an ABI 3130 capillary array genetic analyzer (Applied Biosystems, Foster City, CA). We determined fragment sizes against a Genescan 600 Liz size standard (Applied Biosystems, Foster City, CA), and allele sizes using the Genemapper ID version 2.7 software (Applied Biosystems, Foster City, CA). We scored alleles between three and 6 times to avoid problems associated with allelic dropout, which frequently occurs when genotyping low-yield DNA samples [66]. Samples that did not include at least 15 (70%) or more loci after multiple attempts at PCR fragment amplification were excluded from this study.
We subjected all 21 autosomal microsatellite loci to outlier tests using LOSITAN [67]. LOSITAN is a Java based selection detection platform, based on the fdist F
ST
outlier methods [68]. We ran 1,000,000 simulations of the data while (i) assuming a stepwise mutation model, (ii) assuming three populations, (iii) forcing a correct mean F
ST
, and (iv) calculating a “neutral” mean F
ST
. We also subjected the 21 autosomal microsatellites, in addition to the mtDNA HVRI sequences, to an exact test of Hardy-Weinberg equilibrium [69] using the Arlequin version 3.5 software package [37]. We tested deviations from Hardy-Weinberg equilibrium for each locus against 1,000,000 random permutations of the data.
We calculated pairwise estimates of relatedness for all samples that produced reliable microsatellite genotypes using the software program COANCESTRY [23]. COANCESTRY implements three inbreeding estimators and seven relatedness estimators [24,70-76] to estimate relatedness between individuals and inbreeding coefficients from multi-locus genotype data. We tested 95% confidence intervals for relatedness and inbreeding estimates for all genotyped individuals against 1000 bootstrap permutations of the data. Samples estimated to come from the same individual or close relatives, with a relatedness index [24] above 0.75, were excluded in order to remove all pairs of individuals down to first cousins from all further data analyses.
mtDNA diversity analysis
We generated haplotype networks for HVRI mtDNA sequences via the median-joining algorithm of Network 4.5 (http://www.fluxus-engineering.com). Because it allows for reticulation, the median-joining approach for inferring haplotype relationships is appropriate for the analyses of mtDNA control region sequences, which exhibits high levels of homoplasy in humans [77,78]. We identified hypermutable sites by post-processing using the Steiner maximum parsimony algorithm within Network 4.5 and were excluded from the network analyses.
We performed an AMOVA using Arlequin version 3.5 software package [37]. We tested population differentiation values against 10,000 random permutations of the data. For the AMOVA, we grouped individuals according to their origin either north or south of the Sanaga River. In addition to the AMOVA, we ran a spatial AMOVA (SAMOVA) at k = 2, to confirm the groupings used in the AMOVA, as well as to test how geospatial partitioning of the sampled populations affected population differentiation [28].
We generated mismatch distributions by computing distribution of the number of pairwise differences between mtDNA haplotypes using Arlequin version 3.5 software package [37]. We generated distributions by grouping individuals into either two or three separate groups (as identified by microsatellite cluster analysis), using 100 bootstrap replicates.
Microsatellite genotype analysis
We examined the relationship between genetic differentiation and geographic distance by carrying out partial Mantel tests on the microsatellite data using the Arlequin version 3.5 software package [37]. Pairwise F
ST
values were generated between all sampling locations and genetic differentiation was plotted against straight-line, geographic distance. We fit the data with a linear regression, calculated a correlation coefficient and tested for statistical significance.
We used the EIGENSOFT software package [79] to perform PCA on individual genotypes to identify individuals that could be grouped into significantly different populations. We converted microsatellite data into a false SNP format by scoring the presence or absence of each of n-1 alleles (where n is the number of alleles in the sample) using a script in MATLAB (The MathWorks, Natick, MA) described previously [39]. We processed this file in SmartPCA, which produced eigenvectors and eigenvalues. We conducted this analysis blindly to a priori population labels for individuals in the dataset. We tested the statistical significance of each eigenvector using Tracy-Widom statistics. Each significant eigenvector recovered by this PCA approach separates the samples in such a way that the first and subsequent eigenvectors distinguish, in order, the most to least differentiated populations in the sample [79]. All analyses using EIGENSOFT were performed blinded to a priori population labels.
We examined population structure and individual ancestry using a Bayesian clustering approach implemented in the TESS (version 2.3) software package [30]. This program estimates the shared population history of individuals based on their genotypes and geographic origin under a model that assumes Hardy–Weinberg equilibrium and linkage equilibrium, thereby making no a priori assumptions regarding population classifications. TESS estimates individual proportions of ancestry into K clusters, where K is specified for the program in advance across independent runs and corresponds to the number of putative ancestral populations. The program then assigns admixture estimates for each individual (Q) from each inferred ancestral population cluster. It also produces posterior predictive values of Q for geographic areas located in between and around the individual data points.
TESS runs were performed: (i) with a model that allows individuals to have ancestry in multiple populations (CAR model [80]); (ii) with correlated allele frequencies; and (iii) blinded to a priori population labels. Runs were performed with a burn-in step of 500,000 Markov Chain Monte Carlo (MCMC) iterations and 1,000,000 MCMC iterations. Fifty runs each for K = 2 through K = 8 were performed for all datasets. We processed TESS outputs with CLUMPP [81] and a G-statistic >99% was used to assign groups of runs to a common clustering pattern. CLUMPP outputs for each K value were plotted with DISTRUCT [82]. We used a combination of methods to infer a maximum number of chimpanzee populations (K
MAX
) including, (i) the K value that had the highest average DIC [31], (ii) high stability of clustering patterns between runs, (iii) the K
MAX
value at which K
MAX
+ 1 no longer split the cluster distinguished by K
MAX
[83], (iv) correspondence between maximum PPD values from TESS runs and significant eigenvectors recovered by PCA, and (v) calculating an ad hoc statistic, ΔK [32].
We created spatial interpolations of the Q matrix using the option in TESS to generate posterior predictive maps of admixture proportions. We created a template map of the extent of probable chimpanzee ranges in Nigeria and Cameroon in ArcMap Version 10 (ESRI Corp., Redlands, CA) in the ASCII-raster format and input into TESS in order to generate posterior predictive maps of admixture proportions. Using these predictive maps, we generated spatial interpolations of the Q matrices by implementing a matrix based vector algorithm in the program TESS Ad-Mixer [33]. These spatial interpolations use color mixing to predict admixture proportions across space in order to better visualize the spatial partitioning of genetic differentiation.
We calculated allele size range, observed and expected heterozygosity, and an M-Ratio [84] for each locus for three populations as identified by cluster analysis. We tested these values against 10,000 random permutations of the data using the Arlequin version 3.5 software package [37].
We calculated three measures of population genetic differentiation using the Arlequin version 3.5 software package [37]: D
2, R
ST
and δμ
2. The D
2 [34] genetic distance is based on a model in which genetic drift is the only force influencing allele frequency differences across populations and is sensitive to recent differentiation events. R
ST
[35] and δμ
2 [36] are similar to D
2, but both assume a stepwise mutation model (SMM). Consequently, R
ST
and δμ
2 are more likely to capture whether differences in the mutation processes are important in driving population differentiation and enable assessment of sensitivity to mutation model assumptions [83]. These latter models differ in that R
ST
is based on the fraction of the total variance in allele size between populations and is analogous to F
ST
[35], whereas δμ
2 is based on differences in the means of microsatellite allele sizes [36]. Recent work has shown convincing evidence that the loci typed for this study appear to follow the SMM in both chimpanzees and bonobos [64,85]. D
2 calculations were completed on untransformed allele size calls. Since R
ST
and δμ
2 assume the SMM, we transformed allele sizes to repeat size units prior to analysis in Arlequin version 3.5 [37]. We transformed allele sizes such that the smallest allele for each locus was scored as n and each subsequent allele was scored as n + 1. In infrequent cases where repeat unit sizes did not follow the n + 1 model, and instead repeat units skipped x repeat(s), we scored the next allele in the data as (n + × +1), as described in a previous study [39]. We determined each pairwise genetic distance calculation by 100,000 replications in Arlequin, and the significance of these pairwise population genetic distances were evaluated by a significance test at p < 0.05.
We calculated distributions of alleles within and between populations using ADZE [86]. ADZE is a generalized rarefaction approach that counts alleles private to populations as well as combinations of populations. We used this program to infer (i) the mean number of distinct alleles per locus, per population; (ii) the mean number of private alleles per locus, per population; and (iii) the mean number of uniquely shared private alleles per locus, between populations.
Population history
We used the program IMa [38] to estimate: (i) the population mutation parameter, θ, for both descendant populations (θ1 and θ2) and an ancestral population (θA); (ii) rates of migration from population 1 to population 2(m
1) and from population 2 to population 1 (m
2); (iii) and time of population divergence (t) for all possible pairs of the three assumed populations: P. t. ellioti in western Cameroon & Nigeria, P. t. ellioti in central Cameroon, and P. t. troglodytes in southern Cameroon. We used IMa instead of the more recently released IMa2 [40] for several reasons. The dataset is composed mostly of microsatellites with poorly characterized mutation rates [39,64,87]. Also, due to the complexity of the IMa2 model, the high number of microsatellite alleles, and few number of microsatellite loci, we used the version of the IM model with the fewest assumptions. Additionaly, IMa2 requires the user to input a well resolved, rooted phylogeny of the populations to be tested, and based on the results of the cluster analysis, it was counterproductive to swap through an already well resolved, and simple phylogeny.
We ran multiple iterations of IMa in parallel, for each paired population, using a multi-processor computer cluster located at the SUNY College of Nanoscale Science and Engineering. First, we ran the IMa analysis using “M-Mode” (MCMC Mode) with a full complement of model parameters, and a broad range of priors for all parameters (θ1, θ2, θA, m
1
m
2, and t). Each run was performed with heated chains using the two-step scheme [88]. For each run, we ran burn-in replicates for a total of five days. We then reduced the ranges of the model parameters over repeated runs in order to sample more densely their respective posterior distributions for each of the three population comparisons. This resulted in different lengths of analysis for each comparison: the western Cameroon/central Cameroon comparison generated 1,004,512 genealogies using 58 chains; the western Cameroon/southern Cameroon comparison generated 684,192 genealogies using 40 chains; and the central Cameroon/southern Cameroon comparison generated 947,723 genealogies using 38 chains. After the runs converged toward the same values, we loaded the saved genealogies from multiple runs into “L-Mode” (Load Trees Mode) in order to calculate values for model parameters for each pair of populations pooled across these multiple independent runs.
We converted the migration parameters, m
1 and m
2, into the number of migrants exchanged per generation (Nm) using the equation Nm = (θ*m)/2. We converted the demographic parameter (t) into years by scaling by mutation rate (t/μ). We scaled the t parameter for mtDNA assuming a mutation rate (μ) of 1.64 × 10−7 [89]. We scaled the t parameter for microsatellite loci using several microsatellite mutation rates, given the uncertainty in microsatellite mutation rates in the literature [36,39,64,87]. We used a slow mutation rate of 3.53 × 10−5 [64], an intermediate mutation rate of 7.75 × 10−5 (calculated from the geometric mean of rates from Wegmann and Excoffier [87]), and a fast mutation rate of 1.6 × 10−4 [36,39]. We scaled final values for demographic parameters assuming a 20-year generation time for chimpanzees [90]. We scaled effective population sizes (N
E
) using θ and a per generation μ using the equation N
E
= θ/(4*μ*20).