The PSAgene family
A BLAST search analysis was performed against the complete genome sequences of the three species L. major, L. infantum and L. braziliensis http://www.genedb.org/, using as a query sequence some of the PSA protein sequences reported in the literature. Some genes were discarded from the candidate PSA genes obtained from this BLAST analysis: LmjF12.1005, LinJ12_v4.0662, LbrM_V2.1660 and proteophosphoglycan (PPG) genes (4 in each species). LmjF12.1005 of L. major and LinJ12_v4.0662 of L. infantum, which code for N-terminal truncated forms of PSA (LmjF12.1005 protein actually lacks a LRR domain) are probably pseudogenes. LbrM_V2.1660 is almost identical to gene LbrM_V2.1670, besides harbouring a N-terminal truncation, but this is questionable since the genomic sequence immediately upstream LbrM_V2.1660 is lacking, which renders the determination of the N-terminus uncertain. PPGs indeed possess PSA-related LRR motifs, but their overall domain architecture differs from that of PSA proteins. More importantly, PPGs are most probably functionally different from PSA, owing to their very long (often more than 1000 residues) serine-rich stretches on which are anchored very large amounts of phosphoglycan chains [5]. At the end, 32, 14 and 8 PSA genes were found in L. major, L. infantum and L. braziliensis, respectively. Analysis of the primary structure of the encoded PSA proteins reveals that they all possess the same architecture (Fig. 1). The central part of the protein is composed of LRRs in tandem. These repeats, which show high variability, both in terms of number of repeats and primary sequence, are flanked by relatively well conserved domains, including Cys-rich domains, which are expected for extracellular LRR-containing proteins [20]. On the N-terminal side of the LRR domain, one finds a 29 to 31 residue-long domain reminiscent of the LRR motif, but which is clearly not a LRR. This is preceded by a Cx(6)C domain and finally by a signal peptide. At the other extremity, the last LRR is followed by a CGCx(23–26)CxxxxxC containing domain. This marks the end of the sequence for some PSA proteins. Others have extensions that contain either a Thr/Ser-rich region of variable length or a transmembrane domain preceded by a Cys-rich region, or both. This configuration of a N-terminal signal sequence and a C-terminal transmembrane domain predicts that all PSA proteins are either secreted or anchored to the plasma membrane by their C-terminal end. Furthermore, all PSA proteins containing a C-terminal transmembrane domain (and a signal peptide) are predicted to be GPI-anchored (see Fig. 2).
The main signature of PSA proteins is their specific 24–25 amino-acids LRR motif, of consensus sequence G(T/S)LPxxWxx (M/I)xxLxxLxLxxxx(x)(V/L/I)(S/T) (Fig. 1). The most striking feature of this motif relative to other LRRs is the presence of quasi invariant glycine, proline and tryptophane residues at positions 1, 4 and 7, respectively. This motif is not present in the predicted proteins from the fully sequenced genomes of two other trypanosomatidae, T. brucei and T. cruzi, and no clear homologs of PSA could be identified in these organisms. The only PSA domain for which homology was found in the other trypanosomatidae is the Thr/Ser-rich domain. The T(n)KP2 and T(n)EAPT repeats found in most of these domains are present in T. cruzi mucins. However, while it makes up the bulk of the mucin protein architecture, this domain is not always present in PSA proteins (see Fig. 2), and mucins do not contain any LRR repeat. Therefore, PSA proteins and mucins are likely to play different roles. Actually, the LRRs mostly similar to those of Leishmania PSA proteins belong to the plant NBS-LRR resistance proteins.
PSA genes are dispersed on chromosomes (Chr) 5, 9, 12, 21 and 31 for all three Leishmania species. Chromosomes 9 and 21 contain a single gene, while Chr 5 contains 2 genes, and Chr 31, 3 genes (4 in L. major). PSA genes localised on different chromosomes (and also those of the different Chr31 loci) are sufficiently divergent to define PSA gene classes (see below). Inside each chromosome, all PSA genes are clustered in tandem, except those of Chr 5 and gene LbrM31_V2.3700, which are well separated. The PSA subfamily present on Chr 12 is the most complex (except in L. braziliensis where there is only one gene). In L. major, the 24 Chr 12 PSA genes belong to a single cluster of tandem repeats, with many of these genes separated by a single non-related intervening gene. This pattern is also observed for the 7 Chr12 PSA genes of L. infantum. All PSA genes arranged in clusters have the same orientation relative to transcription. The only PSA genes studied so far in vivo belong to the Chr12 array. Although the PSA genes we have identified on the other chromosomes show significant sequence divergence with the Chr12 genes (see below), the conservation of the precise domain organisation between all these genes justifies their being included in the same unique PSA family.
Phylogeny of the PSAgene family
An alignment of all PSA protein sequences from all three Leishmania species was generated. For phylogenic analysis, only the domains present in all PSA proteins were included in the alignment. The Thr/Ser and Cys-rich/transmembrane C-terminal domains were thus discarded, as well as the central LRR domain (except for the last LRR motif of each sequence, which was retained). LRR units probably evolve not only by nucleotide substitution but also by birth and loss of repeats following recombination events. Therefore, alignments of rather distant LRR domains (with low similarity between repeats and varying number of repeats) may not reflect properly their evolution. This truncated alignment was transformed into its DNA coding version (Additional file 1), which was used to generate a phylogenetic tree by PhyML (Fig. 2). This tree shows the existence of eight PSA subfamilies which can be defined by their chromosome localisation: Chr5 (A and B), 9, 12, 21 and three subfamilies on Chr31 (A, B, C). Subfamilies Chr5A and B are the most distantly related and pair apart from all other clusters. Among the latter, only the Chr21/Chr31A subfamilies can be phylogenetically linked with high confidence bootstrap values. Each of the three Leishmania species has at least one gene in each of these eight subfamilies. The L. infantum gene orthologs always group with those of L. major, which is expected from the known species tree. The relatively short branch length separating orthologs in comparison to those separating the inparalogs (ancestors of the seven gene subfamilies) might suggest that speciation of the three Leishmania is a late event in the evolution history of the PSA genes. In keeping with this interpretation, a tree with similar relative branch lengths was inferred when using the rate of synonymous substitutions (dS), a more neutral distance metric (not shown).
Chr5A, B, 9, 21, 31A, B, C subfamilies of PSAgenes
While the Chr12 subfamily is most diversified, having very different numbers (1 to 24) of rather distantly related paralogs in the three species, the PSA genes lying on the other chromosomes form very simple clusters. For each of the latter subfamilies, each species contains a single gene (except L. major, which has two highly similar paralogs for the Chr31A cluster) and the orthologs or pseudo-orthologs are quite similar (median = 75% amino acid identity for all three species and 93% for L. infantum/L. major orthologs). This degree of homology between orthologs, which was calculated on the same domains used above for phylogenetic tree determination, was also observed over the entire LRR domain. Indeed, confident alignments covering this domain can be obtained inside each of the Chr5A and B, 9, 21, 31A, B, and C clusters since the number of LRRs differ at most by one inside each subfamily and the similarity between orthologs is (median = 80% amino acid identity for all three species and 96% for L. infantum/L. major orthologs) (see Additional files 2, 3, 4, 5, 6, 7, 8 for sequence alignments). On the other hand, similarity between genes of different subfamilies is rather low over this domain (median = 32% amino acid identity, and much difference in number of repeats). This may suggest that the LRR domains characteristic of each cluster represent specific binding properties, which could confer specialized roles to their respective PSA proteins. Another link between these subfamilies and functional specialisation is given by the nature of the C-terminal domain. Indeed, all members of the same subfamily share the same membrane anchorage/secretion determinant. In particular, all genes of the phylogenetically linked Chr21 and 31A subgroups are predicted to code for secreted proteins (Fig. 2).
Chr12 subfamily of PSAgenes
The tree of Fig. 2 shows that the phylogenetic structure of the Chr12 genes is quite different from that of the above subfamilies. Here, all genes group according to the species, which suggests that generation of paralogs in L. major and L. infantum occurred independently in each species. This clustering pattern was also obtained for an alignment corresponding to the C-ter extension containing the Cys-rich/transmembrane domains of the relevant subset of Chr12 PSA genes (not shown). We then produced an alignment of the LRR domain for these Chr12 genes. Although sequence diversity and variability in numbers of repeats render this task uncertain, this alignment could be validated by the high similarity between many genes and singularities of some LRRs (length of LRRs, sequence specificities). The most robust portion of this LRR alignment (Additional file 9) was used to generate a phylogenetic tree (Fig. 3). While many clades were conserved from the tree of Fig. 2, new well supported clusters were generated. Notably, two link together genes from L. major and L. infantum, thus identifying clear orthologs: LmjF12.1090/LinJ12_v4.0671 and LmjF12.0850, LmjF12.1070/LinJ12_v4.0663. Phylogenetic analysis scanning of all individual domains including single LRR repeats confirmed the above orthologies for all LRR repeats except those at both ends of the LRR domain (not shown). On the other hand, the N- and C-terminal domains, particularly the signal peptide region and the Cys-rich/transmembrane domain are species specific. The identification of LmjF12.1090/LinJ12_v4.0671 as orthologous genes was also strengthened by the absence of C-terminal Cys-rich and transmembrane coding domains in their respective ORFs. Yet, these domains are still encoded by very similar sequences located immediately downstream of the stop codons, suggesting that a common frameshift mutation resulted in this deletion. Besides the above orthologs, some clustering of L. infantum with L. major genes was observed, but with low bootstrap values. New trees were generated for the L. infantum genes alone, or for those of L. major, but they did not reveal any new well-supported clusters. If we consider the LRR domain as the main functional determinant of the PSA protein, the following picture of the genes on Chr12 emerges. In L. infantum, only two genes (LinJ12_v4.0666 and LinJ12_V3.0690) are closely related. Besides this clade, the history of the L. infantum Chr12 genes can not be resolved, due to the important apparent divergence between the sequences, which might result in part of recombination events involving LRR repeats. In L. major, we can distinguish four clusters of paralogs. Only three of them are supported by strong bootstrap values. Cluster Lm12B contains the two pseudo-orthologs of LinJ12_v4.0663. Subgroup C comprises seven genes, out of which six code for secreted proteins. As for the LmjF12.1090/LinJ12_v4.0671 genes, truncation of the Cys-rich and transmembrane domains in these genes originated from a common non-sense frameshift mutation. Cluster D comprises seven genes, five of which are almost identical (>99% nucleotide identity) and were most probably created by recent duplication events. In comparison to L. infantum, L. major has a much younger cluster of Chr12 genes, with 9 recently duplicated genes (dS <0.02 and no insertions or deletions). We must note, however, that nearly identical copies of PSA genes might have been missed in the L. infantum genome assembly, where contigs were assembled solely from shotgun sequence reads.
Recombination in PSAgenes
The repetitive nature of the LRR domain and the number of paralogs imply that recombination has largely contributed to the diversity of the PSA gene. Some recombination events can be traced in the LRR domains. For example, while LRRs inside a given PSA protein are generally quite divergent, nearly identical repeats can be found. For example, LbrM12_V2.0750 has tandem triplicates of a LRR doublet, and LmjF12.0860, 1020, 1040 have up to 19 copies of a conserved LRR motif. These nearly exact duplications could have been generated by recent intragenic cross-overs, or cross-overs involving closely related paralogs. More diverging chimeric genes could result from intergenic recombination. To detect such events, we partitioned the LRR alignment of the L. major Chr12 genes into single LRR alignments and inferred the phylogenetic tree for the most conserved of them. Fig. 4 shows how these genes cluster according to these LRRs as well as to the conserved regions outside the LRR domain. It can be noticed that while genes from clade C are quite homogeneous in their clustering pattern across the entire alignment, others are more mosaic. This is the case for most of the genes of clade E, and this probably explains the low resolution (weak bootstrap values) of this subtree (see Fig. 3). Simple intergenic recombination events are readily suggested. For example, genes LmjF12.0830 and 12.0960 appear to be chimeric products originating from strand exchanges between genes belonging to clades D and E. Likewise, gene LmjF12.0990 must have been generated by the fusion of the N-terminal pre-LRR domain of a gene from clade C with a gene of clade D. To validate these conclusions, alignments of relevant PSA gene domains were analysed by SimPlot (Fig. 5) [21]. First, similarity profiles against the putative chimeric genes (LmjF12.0830, LmjF12.0960 and LmjF12.0990) show a clear switch between potential parent genes at the breakpoint region suggested above in Fig. 4. Second, bootscanning analysis reveals an alternation of high bootstrap support across the same regions. Finally, these recombination events are confirmed by analysis of informative sites by the Maximum Chi-Squared method (Fig. 5).
More profound rearrangements can be created by intergenic recombinations (both at coding and non-coding sites), which lead to whole gene duplication and deletions. One such rearrangement can be traced in the L. major Chr12 PSA gene cluster, where PSA genes are organised in tandem, tough often separated by one of two unrelated genes of type A or B. The distribution of these intervening genes together with the phylogenic signature of each PSA genes suggest that the two pairs of very similar paralogs LmjF12.0810/LmjF12.0940 and LmjF12.0830/LmjF12.0960 were created by a same segmental duplication involving a bloc of five genes (see Additional file 10).
Positive selection of PSAgenes
The high gene diversity found for the Chr12 subfamily in L. major and L. infantum could be indicative of selective pressure acting on these genes. The preservation of a high number of paralogs on chromosome 12 (with the exception of the single L. braziliensis Chr12 gene) in comparison to that of paralogs found on the other chromosomes (one or two copies only) is itself an indication that diversification is taking place in the Chr12 subfamily. This is strengthened by the high protein sequence diversity among the L. infantum and L. major paralogs of Chr 12. After grouping pairs of genes with dS values below 0.02 (recently duplicated genes) as single entities, we found that the median pairwise p-distances (calculated over the coding sequence deleted of the Thr/Ser and Cys-rich/transmembrane C-terminal domains) for the L. infantum and L. major Chr12 paralogs were 0.336 and 0.285, respectively (it was only slightly higher, 0.348, for the L. infantum-L. major genes pairwise comparison). That is to compare to the high protein similarity found between the L. infantum and L. major orthologs of Chr5, 9, 21, 31, ranging from 91 to 95% identity (median = 94%). Orthologs of PSA genes on Chr5, 9, 21, 31 are therefore much more conserved even than paralogs of Chr12. To gain a better insight on selective pressure, we conducted analysis of ratios of dS, the number of synonymous substitutions per synonymous sites, to dN, the number of non-synonymous substitutions per non-synonymous sites, over alignments covering the entire LRR domain. We calculated a median dN/dS ratio of 0.26 for the pairwise comparison of Chr5, 9, 21, 31 orthologs of L. infantum and L. major (0.26 when those of L. braziliensis where included). The pairwise comparison between all genes of Chr12 of L. infantum and L. major (after grouping pairs of genes with dS values below 0.02 as single entities) revealed a much higher median dN/dS of 0.74 (0.71 and 0.86 when comparing only paralogs of L. major and L. infantum, respectively). Fig. 6 shows the dN/dS versus dS plot of all pairwise comparisons. dN/dS values of the Chr5,9,21,31 orthologs are quite uniformly distributed along the dS axis, well below those of the Chr12 genes, with the exception of two points which refer to the same LbrM31_V2.1680 gene. It is worth noting that the dN/dS values associated with the lowest dS values for the Chr12 paralogs (and where the alignments are the most confident) are relatively high. The two ortholog pairs of Chr12 genes (LmjF12.1090/LinJ12_v4.0671 and LmjF12.0850, LmjF12.1070/LinJ12_v4.0663) position somewhat outside the distribution of Chr12 genes towards low dN/dS values, especially for LmjF12.1090/LinJ12_v4.0671, whose dN/dS value (0.31) rather falls inside the Chr5,9,21,31 ortholog values. These results suggest that the PSA genes on chromosome 5, 9, 21 and 31, as well as orthologs LmjF12.1090/LinJ12_v4.0671 and LmjF12.0850, LmjF12.1070/LinJ12_v4.0663 of chromosome 12 have been more subjected to purifying selection than the other Chr12 genes.
A more incisive method to detect positive selection involves analysing dN/dS ratios at each codon site. Indeed, dN/dS ratios calculated over the entire length of the sequence cannot always identify positive selection (dN/dS > 1). That is because only a subset of sites in a given gene is subjected to diversifying selection. The alignment of the entire coding sequence of the Chr12 genes, except the C-terminal Cys-rich and transmembrane domains was analysed using codeml of the PAML package. Likelihood ratio test analysis of three pairs of simple and more complex models of dN/dS categories (M1a/M2a, M7/M8 and M8a/M8) all revealed the presence of positive selection with high probabilities (P < 0.001). The M1a/M2a models comparison predicted the presence of 43 positively selected sites with P < 0.05 (59 sites with P < 0.1), among which 33 (44 with P < 0.1) were located in the LRR domain. Very similar results were obtained with the M7/M8 and M8a/M8 model comparisons. Codeml also revealed the presence of positive selection (P < 0.001) in alignments comprising all Chr 12 genes of either L. infantum or L. major alone, or the 7 closely related genes of L. major 12B cluster (see tree of Fig. 3). On the other hand, PSA genes on the other chromosomes could not be analysed properly by codeml: each of these phylogenetically defined clusters possess too few genes (2 to 4), and those often offer poor diversity. It is known that inference of positive selection by codeml can lead to false positives when the genes under investigation have a recombination history [22]. To minimise the impact of recombination, we performed codeml analysis on single or contiguous pairs of LRRs of the Chr12 genes. The majority of these analyses revealed the presence of positive selection and could identify sites under positive selection. Under model M8, still 15 such sites were found at P < 0.05 for pairs of LRRs. We then asked if the numerous predicted positively selected sites found above by codeml for the entire LRR domain of Chr12 genes corresponded to particular positions in each LRR. Fig. 7A shows that all these sites are located at non-conserved sites of the LRR consensus sequence. The posterior dN/dS values calculated at each site by codeml were grouped according to their corresponding amino acid position in the LRR and a mean dN/dS value was determined for each residue position in the LRR (Fig. 7B). As expected, variable positions have higher dN/dS values than more conserved positions. An even better correlation is observed if we consider the localisation of the sites in the 3D structure of the LRRs. Crystal structures and computer modelling of several LRR domains in other proteins allow for the prediction of the positioning of residues outside or inside the LRR domain 3D structure [23]. The relatively well conserved hydrophobic residues are involved in structuring of the consecutive α-helices and -sheet domains of the LRRs that arrange into a horseshoe shape, and are buried inside the structure. On the opposite, the most divergent residues of the LRR orientate outward. It is clear from Fig. 7A that positively selected sites all correspond to positions predicted to reside at the surface of the LRR domain structure. In addition, the latter positions have higher posterior mean values of dN/dS (Fig. 7B).
3' Intergenic region
Expression of some PSA genes of Chr12 has been shown to be stage specific. Strong increases in the levels of both mRNA and protein were observed during metacyclogenesis in vitro [10]. This regulation was shown to depend on the 3' intergenic region (IR) of the PSA gene [17]. The question arises, then, as to whether the different classes of PSA might be regulated the same way. We therefore wished to compare the phylogeny of the 3' (IR) to that obtained above for the coding sequence. Around 2 kb of the 3' IR of PSA genes from all three Leishmania species (with few exceptions for L. infantum and L. braziliensis due to gaps in their sequences) were analysed by local BLAST and aligned by Dialign2. We found that the 3' IRs are conserved between all orthologs (and pseudo-orthologs for he Chr12 clade), but show no significant homology between genes of the 8 different PSA clades (05, 09, 12, 21, 31A, B, C). While the similarities found for pairwise comparisons involving L. braziliensis sequences are variable and relatively low, those of the other comparisons are high. The L. infantum/L. major orthologs of clades 5, 9, 21, 31A, B, C are equally well conserved (ranging from 87 to 92% identity; see Additional file 11 for sequence alignments), as are the set of pseudo-orthologs and paralogs of clade 12 (median = 95% identity; see Additional file 12 for sequence alignment). Yet, the alignment of the 3' IR of these Chr12 PSA genes is quite mosaic, showing large blocks of deletions/insertions. Phylogenetic analysis of an alignment retaining 1.2 kb of domains conserved in at least 70% of the genes revealed that Chr12 genes cluster according to the species (not shown). This is reminiscent of the extremities of the coding regions (Fig. 2). Functional studies will be necessary to determine if this high degree of conservation is the result of gene conversion or rather, of purifying selection driven by imperatives of gene expression. In any case, if indeed the 3' IR of the different PSA clades play a role in the control of gene expression, either the pattern of expression is variable between the clades or the cues governing this control are subtler than the primary sequence per se.