Evolution and expansion of the Mycobacterium tuberculosis PE and PPE multigene families and their association with the duplication of the ESAT-6 (esx) gene cluster regions
BMC Evolutionary Biology volume 6, Article number: 95 (2006)
The PE and PPE multigene families of Mycobacterium tuberculosis comprise about 10% of the coding potential of the genome. The function of the proteins encoded by these large gene families remains unknown, although they have been proposed to be involved in antigenic variation and disease pathogenesis. Interestingly, some members of the PE and PPE families are associated with the ESAT-6 (esx) gene cluster regions, which are regions of immunopathogenic importance, and encode a system dedicated to the secretion of members of the potent T-cell antigen ESAT-6 family. This study investigates the duplication characteristics of the PE and PPE gene families and their association with the ESAT-6 gene clusters, using a combination of phylogenetic analyses, DNA hybridization, and comparative genomics, in order to gain insight into their evolutionary history and distribution in the genus Mycobacterium.
The results showed that the expansion of the PE and PPE gene families is linked to the duplications of the ESAT-6 gene clusters, and that members situated in and associated with the clusters represent the most ancestral copies of the two gene families. Furthermore, the emergence of the repeat protein PGRS and MPTR subfamilies is a recent evolutionary event, occurring at defined branching points in the evolution of the genus Mycobacterium. These gene subfamilies are thus present in multiple copies only in the members of the M. tuberculosis complex and close relatives. The study provides a complete analysis of all the PE and PPE genes found in the sequenced genomes of members of the genus Mycobacterium such as M. smegmatis, M. avium paratuberculosis, M. leprae, M. ulcerans, and M. tuberculosis.
This work provides insight into the evolutionary history for the PE and PPE gene families of the mycobacteria, linking the expansion of these families to the duplications of the ESAT-6 (esx) gene cluster regions, and showing that they are composed of subgroups with distinct evolutionary (and possibly functional) differences.
The genome of Mycobacterium tuberculosis contains five copies of the immunopathologically-important ESAT-6 (esx) gene clusters . Each gene cluster encodes proteins involved in energy provision for active transport, membrane pore formation and protease processing, which assembles to form a dedicated biosynthesis, transport and processing system for the secretion of the potent T-cell antigens belonging to the ESAT-6 protein family [1–9]. Although other, chromosomally unlinked, but homologous, genes seem to play a role in this novel secretory system [10, 11], there are two families of genes present within the clusters which have no apparent function in the secretion system, namely the PE and PPE gene families (Figure 1A).
The PE and PPE gene families of M. tuberculosis encode large multi-protein families (99 and 69 members, respectively) of unknown function [12, 13]. These protein families comprise about 10% of the coding potential of the genome of M. tuberculosis . The PE family is characterized by the presence of a proline-glutamic acid (PE) motif at positions 8 and 9 in a very conserved N-terminal domain of approximately 110 amino acids . Similarly, the PPE family also contains a highly conserved, but unique, N-terminal domain of approximately 180 amino acids, with a proline-proline-glutamic acid (PPE) motif at positions 7–9 (Figure 2A) . Although the N-terminal domains are conserved within each family, there is very little N-terminal homology between the two different families. The C-terminal domains of both of these protein families are of variable size and sequence and frequently contain repeat sequences of different copy numbers .
Both the PE and PPE protein families can be divided into subfamilies according to the homology and presence of characteristic motifs in their C-terminal domains . The polymorphic GC-rich-repetitive sequence (PGRS)  subfamily of the PE family is the largest subfamily (65 members) and contains proteins with multiple tandem repeats of a glycine-glycine-alanine (Gly-Gly-Ala) or a glycine-glycine-asparagine (Gly-Gly-Asn) motif in the C-terminal domain . The other PE subfamily (34 members) consists of proteins with C-terminal domains of low homology . The PPE family can be broadly divided into four subfamilies [14, 16] of which the PPE-SVP subfamily is the largest (24 members). The proteins of this subfamily are characterized by the motif Gly-X-X-Ser-Val-Pro-X-X-Trp between position 300 and 350 in the amino acid sequence (Figure 2B). The major polymorphic tandem repeat (MPTR) PPE subfamily is the second largest (23 members) and contains multiple C-terminal repeats of the motif Asn-X-Gly-X-Gly-Asn-X-Gly, encoded by a consensus repeat sequence GCCGGTGTTG, separated by 5 bp spacers [17, 18]. The third subfamily (10 members), recently identified by Adindla and Guruprasad , is characterized by a conserved 44 amino acid residue region in the C-terminus comprising of highly conserved Gly-Phe-X-Gly-Thr and Pro-X-X-Pro-X-X-Trp sequence motifs (Figure 2C, named the "PPE-PPW" subfamily for the purpose of this study). The last PPE subfamily (12 members) consists of proteins with a low percentage of homology at the C-terminus .
An early paper by Doran and coworkers  suggested that the members of the PPE-MPTR family were likely to be cell wall associated. Association of a PPE protein with the mycobacterial cell wall was first demonstrated experimentally for the PPE-MPTR protein Rv1917c (PPE34), which was also demonstrated to be at least partly exposed on the cell surface . It has subsequently been shown that certain PE_PGRS proteins are cell-surface constituents [21–23] which influence the cellular architecture and colony morphology  as well as the interactions of the organism with other cells . More recently, it has been demonstrated that the PPE proteins Rv2108 (PPE36) and Rv3873 (PPE68) are also both cell-wall associated [24, 25]. Furthermore, Pajon and coworkers  have identified at least one outer membrane anchoring domain with the potential to form a beta-barrel outer-membrane protein-like structure in 40 different PE and PPE proteins. It has yet to be shown whether all PE and PPE proteins localize to the cell wall, and secretion into the extracellular environment has not been ruled out.
Although the function of the 168 members of the PE and PPE protein families has not been established, various hypotheses have been advanced. The fact that these genes encode about 4% of the total protein species in the organism (if all genes are expressed), suggests that they most probably fulfill an important function or functions in the organism. The most widely-supported theory suggests the involvement of these proteins in antigenic variation due to the highly polymorphic nature of their C-terminal domains [12, 14, 27]. In agreement with this, sequence variation has been observed between the orthologues of the PE and PPE protein families in in silico analyses of the sequenced genomes of M. tuberculosis H37Rv, M. tuberculosis CDC1551 and M. bovis [28–30]. Extensive variation of a subset of PPE genes in clinical isolates of M. tuberculosis has also been observed (S. Sampson, unpublished results) and a recent study by Talarico et al.  found sequence variation for PE_PGRS33 (Rv1818c) in 68% of clinical isolates spanning all three M. tuberculosis principal genetic groups . Additionally, Srivastava et al.  showed in an analysis of more than 300 clinical isolates of M. tuberculosis that the MPTR domain of the PPE gene Rv0355c (PPE8) displayed several polymorphisms. There is also ample evidence available for differential expression of members of the PE/PE_PGRS family between different strains of M. tuberculosis  as well as under different environmental and experimental conditions. [35–38]. However, the observed sequence variation and differential expression has yet to be related to antigenic variation.
An alternative way in which the PE and PPE proteins may interact with the host immune system is by the inhibition of antigen processing . Some support for this hypothesis is provided by a report that a DNA vaccine construct based on the conserved N-terminal PE region of the PE_PGRS protein Rv1818c (PE_PGRS33) is able to elicit a cellular immune response, whereas a construct containing the whole PE_PGRS region is unable to do so , suggesting that the PGRS repeats are in some way able to influence antigen processing and presentation. This is supported by a recent follow-up study, in which Dheenadhayalan and coworkers  demonstrated that expression of the complete PE_PGRS33 protein in the non-pathogenic fast-growing M. smegmatis, causes the strain to survive better in infected macrophage cultures and mice than a parental strain or a strain expressing only the PE domain of the protein. Work done by Delogu et al.  proved that the PE domain of PE_PGRS33 is necessary for subcellular localization, while the PGRS domain, but not PE, affects the bacterial shape and colony morphology. It was also shown previously that an M. bovis BCG strain containing a transposon insertion in PE_PGRS33 could not infect (and survive in) macrophages and showed dispersed growth in liquid media . Complementation of this mutant restored infectivity of macrophages as well as aggregative growth (clumping) in liquid media .
Other diverse clues to the potential functions of the members of these families exist. For example, Rodriguez and colleagues [41, 42] have found that the PPE gene Rv2123 (PPE37) is upregulated under low iron conditions, leading to the hypothesis that this gene may encode a siderophore involved in iron uptake. One member of the PE_PGRS family, Rv1759c (wag22), has been characterized as a fibronectin binding protein [43, 44]. Interestingly, the orthologue of this gene in the closely-related genome of M. bovis is a pseudogene, the absence of which could potentially play a role in influencing host or tissue tropism . It was also shown that two M. marinum orthologues of the PE_PGRS subfamily are essential for replication in macrophages as well as persistence in granulomas . More recently, an M. avium PPE protein (Rv1787/PPE25 orthologue), expressed only in macrophages, has been shown to influence macrophage vacuole acidification, phagosome-lysosome fusion and replication in macrophages; and to be associated with virulence in mice . Additional data supports the notion that members of the PPE gene family may be involved in disease pathogenesis, as a transposon mutant of the PPE gene Rv3018c (PPE46) was attenuated for growth in macrophages . Sassetti et al. , confirmed the importance of Rv3018c and identified a further 5 PPE genes (Rv0286/PPE4, Rv0755c/PPE12, Rv1753c/PPE24, Rv3135/PPE50 and Rv3343c/PPE54) and 3 PE genes (Rv0285/PE5, Rv0335c/PE6 and Rv1169c/PE11) as essential for in vitro growth in a transposon-mutagenesis-based screen, although a follow-up study by the same group  showed that only two PPE's (Rv1807/PPE31 and Rv3873/PPE68) and one PE (Rv3872/PE35) are specifically required for mycobacterial growth in vivo during infection of mice. The authors speculated that the fact that such a small fraction were detected in their system suggests either that most of these genes are able to functionally complement each other, or that they are required under conditions that were not tested. Interestingly, Rv3872 (PE35) and Rv3873 (PPE68), required for in vivo growth, are both situated within the ESAT-6 gene cluster region 1 , which has been previously shown to be involved in pathogenicity of the organism [4, 6, 8, 49–51], while Rv0285 (PE5) and Rv0286 (PPE4), required for in vitro growth, are both situated within the ESAT-6 gene cluster region 3 . Recently, Jain and coworkers  identified three PE_PGRS genes (Rv0977/PE_PGRS16, Rv0978c/PE_PGRS17 and Rv0980c/PE_PGRS18) and two PPE genes (Rv1801/PPE29 and Rv3021c/PPE47) to be up-regulated by at least 8-fold in human brain microvascular endothelial-cell-associated M. tuberculosis and showed that at least Rv0980c and Rv1801 are potentially required for endothelial-cell invasion and/or intracellular survival. This confirmed data by Talaat at al.  which identified the same PE_PGRS genes Rv0977, Rv0978c and Rv0980c to form part of a so-called in vivo-expressed genomic island that was highly expressed only in vivo and not in vitro.
The evolution and distribution of the members of the PE and PPE gene families in the genus Mycobacterium, as well as their association with the ESAT-6 gene cluster regions within these organisms are unknown. The only attempt to obtain some insight into the relationships among members of specifically the large PE_PGRS gene family was done in an analysis by Espitia et al. , in order to identify the closest relatives of a PE_PGRS sequence involved in fibronectin-binding. This resulted in an uninformative unrooted tree only suggesting a complex evolutionary history for this gene family.
Sequencing of the complete genomes of organisms has provided a wealth of information concerning phenotype and evolution. The information obtained from these sequencing projects can be used to trace the evolution of genes and gene families using comparative genomics. This study investigates the evolutionary history of the mycobacterial PE and PPE gene families using in silico sequence analyses, phylogenetic analyses, DNA hybridization and comparative genomics of a selected set of mycobacterial genome sequences. We attempt to answer the question of why and how these PE and PPE genes were duplicated, as well as provide insight into the relationship between these genes and the ESAT-6 (esx) gene clusters. We envisage that this data will provide a better understanding of the factors involved in the considerable expansion of the PE and PPE families, their evolutionary and functional relationship to the ESAT-6 (esx) gene cluster regions, and the evolution of the mycobacterial genome.
Results and Discussion
Identification of the most ancestral PE and PPE genes
The PE and PPE gene families are not present outside the genus Mycobacterium
In order to be able to construct a robust evolutionary history of the PE and PPE gene families through phylogenetic analysis, it is of critical importance to first identify the most ancestral representatives of both these families. These ancestral genes are used as the root for the construction of the relationship tree, and represents the origin of the family. Comparative genomics, during which the genomes of different species are compared to look for differences and similarities, is the tool of choice for the identification of orthologues of genes in these species. To date, 31 mycobacterial genome sequencing projects are in various stages of completion (see Table 1), representing a valuable resource for comparative genomics analyses within the genus Mycobacterium. A detailed examination of the sequenced genomes of species belonging to closely-related genera to the mycobacteria (e.g. Corynebacteria, Nocardia etc.) have shown that the PE and PPE genes are not found outside of the genus Mycobacterium (data not shown). This is in agreement with the published genome analyses of these organisms [54–60]. Where repetitive proteins with some homology to the PE and PPE gene families have been identified previously (e.g. nfa8180 in Nocardia farcinica and SAV5103, SAV6636, SAV6731, SAV7299 in Streptomyces avermitilis – see Ishikawa et al. ), this is merely due to unspecific alignment of the repetitive regions and these proteins do not contain the conserved N-terminal PE and PPE domains or the conserved PE and PPE motifs. The answer to the evolution and expansion of these multigene PE and PPE families thus lies within the genus Mycobacterium.
Generation of a mycobacterial phylogenetic tree
A phylogenetic tree was generated using the 16S rRNA gene sequence of 83 species of the genus Mycobacterium, with the sequence of the species Gordonia aichiensis as the outgroup (Figure 3). This was done in order to determine the evolutionary history of the genus Mycobacterium and to identify the sequenced species closest to the origin/last common ancestor of the genus. This species would provide the most valuable data with regards to the presence and origin of the ancestral PE and PPE genes. The taxonomical relationships between members of the genus Mycobacterium based on the 16S rRNA gene sequence information in this tree is comparable to data published previously by Pitulle et al. , Shinnick and Good  and Springer et al. . The phylogenetic positions of all the sequenced mycobacterial species are indicated in yellow in Figure 3. From this analysis it is apparent that the non-pathogenic, fast-growing mycobacterium M. smegmatis is the sequenced species closest to the last common ancestor (the genome sequences of M. abscessus and M. chelonae have not been released publicly) and the genome sequence of this species thus represents the ancestral reference point for the investigation of the evolution of these gene families within the mycobacteria.
Comparative genomics analyses between M. tuberculosis H37Rv and M. smegmatis
Analysis of the genome sequence of M. smegmatis revealed only two pairs of the PE and PPE gene families. None of the other members of the PE or PPE gene families, including any of the PE_PGRS or PPE-MPTR genes, could be detected within the M. smegmatis genome. The first pair corresponds to the Rv3872/3 orthologues (MSMEG0062 and MSMEG0063) from ESAT-6 (esx) gene cluster region 1 (70% and 55% similarity to the M. tuberculosis H37Rv proteins, respectively), while the second pair corresponds to the Rv0285/6 orthologues (MSMEG0608 and MSMEG0609) from ESAT-6 (esx) gene cluster region 3 (87% and 64% similarity to the M. tuberculosis H37Rv proteins, respectively). These two gene pairs have been shown to be required for in vivo, and in vitro growth, respectively, in M. tuberculosis H37Rv [47, 48]. Thus, the only PE and PPE genes present within the M. smegmatis genome are found within two ESAT-6 (esx) gene cluster regions.
The PE and PPE genes from ESAT-6 region 1 are the most ancestral genes of the two gene families
PE/PPE gene pairs are frequently associated with the ESAT-6 (esx) gene clusters in M. tuberculosis [1, 64]. The duplication order of the ESAT-6 (esx) gene clusters within the genome of M. tuberculosis was previously predicted by systematic phylogenetic analyses of the constituent genes . This duplication order was shown to extend from the ancestral region named region 4 (Rv3444c-Rv3450c) to region 1 (Rv3866-Rv3883c), 3 (Rv0282-Rv0292), 2 (Rv3884c-Rv3895c), and lastly to region 5 (Rv1782-Rv1798) (Figure 1A). The absence of a pair of PE and PPE genes within the most ancestral ESAT-6 region, region 4 (a region which is also present in species outside of the genus Mycobacterium), indicates that these genes may have been integrated into the first duplicate of this region (region 1), and have subsequently been co-duplicated together with the rest of the genes within the subsequent four regions (Figure 1).
The genome of M. smegmatis only contains three of the five ESAT-6 (esx) gene cluster regions (regions 4, 1, and 3), with regions 2 and 5 being absent . Although it is possible that regions 2 and 5 may have been deleted from the genome of this organism, it is more likely that they only evolved after the divergence of M. smegmatis, as these regions were determined to be the last two duplicates of the ESAT-6 (esx) gene cluster evolution . This is supported by comparative genomics analyses of the genomes of closely-related fast-growing mycobacteria M. flavenscens, M. vanbaalenii, M. sp MCS and M. sp JLS in which ESAT-6 (esx) gene cluster regions 2 and 5 were also found to be absent, as well as M. sp KMS in which ESAT-6 (esx) gene cluster region 2 was present, but region 5 was absent (results not shown). This is further supported by the fact that the genome of M. smegmatis is approximately 1.7 times larger than that of M. tuberculosis , and thus does not display the same reductive properties to that observed in the genome of, for example, M. leprae (which was confirmed to have lost ESAT-6 (esx) gene cluster region 2 and 4 by deletion, ). As the only copies of the PE and PPE gene families found in the genome of M. smegmatis were present in ESAT-6 (esx) regions 1 and 3, and as the PE and PPE genes are not found outside of the genus Mycobacterium, it is clear that the members of the PE and PPE genes found within the ESAT-6 (esx) gene cluster regions 1 and 3 are the most ancestral representatives of these two gene families. Furthermore, as ESAT-6 (esx) gene cluster region 1 is the first duplicate of the ESAT-6 gene cluster regions, the PE and PPE gene copies from region 1 are probably the progenitors of all other PE and PPE genes. This is further supported by the observation that, although these two genes do contain the conserved N-terminal PE and PPE regions, respectively, they do not contain any long and complex C-termini as found in other representatives of the families, and thus represent a pre-C-terminal elongation and repeat-region formation stage.
Phylogeny of the PE and PPE protein families in M. tuberculosis H37Rv
Phylogenetic analysis of the ancestral PE and PPE genes situated within the ESAT-6 (esx) gene clusters in M. tuberculosis H37Rv
To confirm that the PE and PPE genes found within the ESAT-6 (esx) gene cluster regions in M. tuberculosis shared an evolutionary history with the other genes within the clusters (indicating co-duplication/evolution), we constructed separate phylogenetic trees based on the results of the independent analyses of the members of the PE and PPE families present in the 4 PE/PPE-containing ESAT-6 (esx) gene cluster regions (regions 1, 3, 2 and 5). The resulting phylogenetic trees (Figure 4) showed topologies congruent to those of phylogenetic trees obtained for all the other gene families situated in the ESAT-6 (esx) gene clusters . From this we concluded that the PE and PPE genes were duplicated together with the ESAT-6 (esx) gene clusters after their initial insertion (into region 1), rather than being inserted during multiple separate subsequent events. These results also confirm the previously determined duplication order of the ESAT-6 (esx) gene clusters .
Phylogenetic analysis of all the PE and PPE genes present in M. tuberculosis H37Rv
To obtain a global picture of the evolutionary relationships of all PE and PPE genes within M. tuberculosis and not only those situated within the ESAT-6 (esx) gene clusters, we constructed independent phylogenetic trees based on the results of the multiple sequence alignments of all proteins encoded by members of the two gene families. The phylogenetic tree constructed from the ninety-six chosen PE protein family N-terminal sequences (see Methods) was rooted to the ancestral PE outgroup from ESAT-6 (esx) gene cluster region 1, namely Rv3872 (PE35, Figure 5). Similarly, the PPE protein from ESAT-6 (esx) gene cluster region 1, namely Rv3873 (PPE68), was chosen as the outgroup to root the phylogenetic tree constructed independently from the sixty-four PPE sequences (Figure 6). Both trees (from the PE and PPE families, respectively) showed a similar topology, which was conserved when the complete protein sequences were used for analysis instead of only the conserved N-termini (data not shown). Each tree was characterized by five distinct (but corresponding) sublineages (indicated by Roman numerals in Figure 5 and 6). Four of these sublineages match the PE_PGRS, PPE-PPW, PPE-SVP and PPE-MPTR subfamilies, respectively, and these results are thus in accordance with the subgroupings of the PE and PPE families proposed previously [12, 14, 16].
Since the tree topologies correspond to each other, it also suggests a co-evolutionary history for the two gene families. Interestingly, this evolutionary scenario is also congruent with the evolutionary history determined for the five ESAT-6 (esx) gene clusters, with duplication events of PE and PPE genes contained and associated with these regions expanding sequentially from region 1 to 3, 2 and lastly region 5. The topology of the phylogenetic trees suggests that the PE_PGRS and the PPE-MPTR subfamilies are the result of the most recent evolutionary events and have evolved from the sublineage that include the ESAT-6 (esx) gene cluster region 5 PE and PPE genes (Figure 5 and 6, sublineage IV). This is supported by the finding that some members (Rv1361c/PPE19, Rv3135/PPE50 and Rv3136/PPE51) of the PPE sublineage IV (PPE-SVP subfamily) contain isolated MPTR-like repeats, suggesting the existence of a common progenitor gene from which the PPE-MPTR subfamily expanded (data not shown). The proteins outside of the PE_PGRS and PPE-MPTR subfamilies, seem to be closer in homology to the ancestral genes, and are thus collectively called the "ancestral-type" PE and PPE genes for the purpose of discussion in this study.
The genes from ESAT-6 (esx) gene cluster region 5 seem to be highly prone to duplication, as region 5 is the only one of the five ESAT-6 (esx) gene clusters which contains multiple copies of the PE and PPE genes situated inside the cluster (Figure 1). Furthermore, ESAT-6 (esx) gene cluster region 5 is also the parent of a number of secondary duplications containing only the genes for PE, PPE, ESAT-6 (esx) and CFP-10 (a member of the esx family) (see Figure 1B and 1C) . It appears that this region plays an important role in the propagation of both the ESAT-6/CFP-10 and the PE/PPE genes. It is thus tempting to speculate that the duplication propensity of the region 5 genes may have resulted in the initial subsequent expansion of the PGRS and MPTR subfamilies, although inherent properties of the PGRS and MPTR repeats themselves certainly also contributed to this phenomenon.
Closer inspection of the relative positions of the PE and PPE genes in the M. tuberculosis genome sequence revealed that in a number of cases a copy of each of these families was found situated adjacent to each other (Table 2, see also Tundup et al.  and Strong et al. ). By examining the relative positions of the PE and PPE genes from each pair on the separate PE and PPE phylogenetic trees, it was found that these pairs of genes are always situated in the same sublineage on the trees, indicating that they were likely to be co-duplicated. Furthermore, the order of their positions is always conserved, with the PE gene found situated upstream of the PPE gene. These paired genes are found in all the sublineages except in the highly polymorphic PGRS and MPTR subfamilies (sublineage V). In this sublineage, member genes were found situated on their own within a specific genomic location. Thus, it is clear that the expansion of the PGRS and MPTR subfamilies was associated with a change in their duplication characteristics, and although the cause and significance of this is unknown, it may point to a corresponding change in function. In support of this, in a computational identification of beta-barrel outer-membrane proteins of M. tuberculosis, Pajon et al.  identified 40 PE and PPE proteins from a total of 114 predicted beta-barrel structures. Closer inspection of the identified proteins indicate that they all form part of sublineage V, the PE_PGRS and PPE-MPTR subfamilies (23 and 17 members, respectively), indicating a shared function between the members of these two subfamilies.
The reason for the maintenance of the gene pairing of the ancestral PE and PPE genes is still unclear, although these genes may be functionally related and co-transcribed. There is some early evidence for the latter from gene expression data obtained during adaptation to nutrient starvation (the gene pairs Rv0285/86 (PE5/PPE4), Rv1195/96 (PE13/PPE18), Rv1386/87 (PE15/PPE20) and Rv2431c/30c (PE25/PPE41) are downregulated and the pair Rv1169c/68c (PE11/PPE17) is upregulated ). Furthermore, it was recently demonstrated that the genes from at least one of these PE-PPE gene pairs, Rv2430c/31c, are co-transcribed and that the gene products interact with each other to form a hetero-tetramer . This finding was expanded upon by Strong et al. , who determined the structure of the Rv2430c/31c protein interaction, and demonstrated that the PE/PPE protein pair forms a 1:1 complex. Intriguingly, this is similar to the situation observed for the proteins transcribed by the CFP-10 and ESAT-6 genes (adjacently situated to many of the PE-PPE gene pairs – see Figure 1A and 1B), which also forms a tight 1:1 complex [69–72] and is secreted by the ESAT-6 transport system [4–6, 8]. There is evidence that the PPE protein encoded by Rv3873 (PPE68 from ESAT-6 (esx) gene cluster region 1) interacts with CFP-10, ESAT-6 and at least one other esx family member (Rv0288) . It is thus tempting to speculate that the PE/PPE and esx genes are not only intricately linked phylogenetically, but also functionally, and that the PE/PPE complex may also be secreted by the ESAT-6 transport system. In support of this, Fortune et al.  have shown that the PE gene situated in ESAT-6 gene cluster region 1 (PE35 or Rv3872) are present (together with ESAT-6 and CFP-10 from ESAT-6 gene cluster region 1) in culture filtrates of M. tuberculosis.
Although a previous study by Espitia and colleagues aimed to address PE gene phylogeny, the authors had excluded 19 PE sequences from their phylogenetic calculations . The absence of these sequences, which included the PE proteins belonging to the ESAT-6 (esx) gene cluster regions 1 (Rv3872/PE35), 2 (Rv3893c/PE36) and 3 (Rv0285/PE5), left a major gap in the study of the evolutionary expansion of this family. Our results differ from this study because we included these sequences, which have been shown in the current study to be the most ancestral representatives of the family, and thus form the roots from which the rest of the family expanded. We were thus able to root the tree and explain the evolutionary history of this gene family on the basis thereof.
Comparative genomics analyses to verify the PE and PPE evolutionary history
In order to support the hypothesized evolutionary history deduced from the topologies of the PE and PPE phylogenetic trees generated in this study, we performed comparative genomics analyses of the sequenced genomes of M. avium paratuberculosis, M. avium, M. leprae, M. ulcerans and M. marinum, chosen as representative sequenced mycobacterial species phylogenetically situated between M. smegmatis and M. tuberculosis H37Rv (Figure 3).
M. tuberculosis H37Rv vs. M. avium and M. avium paratuberculosis
The results from the analysis between the genomes of M. tuberculosis H37Rv and M. avium paratuberculosis is summarized in Table 3. We found a total of 10 "ancestral-type" PE genes in the genome of M. avium paratuberculosis (compared to the 34 "ancestral-type" PE's in M. tuberculosis), of which one is M. avium paratuberculosis-specific. We could not find any genes belonging to the PE_PGRS subfamily, consistent with the observation by Li et al. . We also identified 37 PPE genes in the genome of M. avium paratuberculosis (compared to the 69 in M. tuberculosis), of which only one (NT03MA4150, an orthologue of Rv0442c/PPE10) belongs to the PPE-MPTR subfamily, and 18 are M. avium paratuberculosis-specific. When these results were superimposed on the phylogenetic trees generated for the PE and PPE gene families in M. tuberculosis H37Rv (Figures 7 and 8, respectively, M. avium paratuberculosis-specific genes were omitted), they showed clearly that all the members of the PE and PPE gene families that are present in the genome of M. avium paratuberculosis form part of the "ancestral-type" genes, except for the orthologue of Rv0442c. This supports the notion that these "ancestral-type" genes represent the earliest members of the PE and PPE gene families, and shows that the PE_PGRS and PPE-MPTR subfamilies have evolved only after the divergence of M. avium paratuberculosis. These results were compared with that obtained with the unfinished genome sequence database of M. avium 104, which were found to correspond to what is observed in the M. paratuberculosis subspecies (data not shown). This also confirmed previously published hybridization analyses which showed the absence of PGRS sequences in the genome of M. avium [15, 75].
One of the most interesting results from the M. avium paratuberculosis analysis was the identification of NT03MA4150, an orthologue of the MPTR subfamily gene Rv0442c, the only MPTR orthologue identified in the genome of M. avium paratuberculosis. Closer inspection of the sequence of this and surrounding genes showed that this gene is a true orthologue of the M. tuberculosis MPTR gene Rv0442c (i.e. situated between the orthologues of Rv0441c and Rv0443, with the highest level of homology to Rv0442c). However, this gene in M. avium paratuberculosis does not contain the polymorphic MPTR C-terminal region characteristic of the MPTR subfamily and found in Rv0442c in M. tuberculosis. To confirm the result, a complete sequence alignment was done with the protein sequences of the orthologues of this gene from the genomes of all available mycobacterial species (Figure 9). From this analysis it was clear that members of the M. avium complex (M. avium paratuberculosis and M. avium 104) do not contain the MPTR region in this gene, while members of species closer to M. tuberculosis (M. marinum, M. ulcerans, M. bovis and M. microti) do contain the repeat region. The homology between the orthologues of the M. avium complex and that of the other species end at exactly amino acid 180 (the consensus end for the conserved N-terminal region of the members of the PPE family). Furthermore, the tail region could not have been omitted from the annotation of the genome of M. avium paratuberculosis, as the 3' flanking gene (orthologue of Rv0441c) follows 27 bp after the stopcodon of NT03MA4150 (the intergenic region is 26 bp in M. tuberculosis, see Figure 9). This suggests that Rv0442c represents the first member of the MPTR subfamily to have been duplicated, before the acquisition of the MPTR repeat region. It is perhaps possible that M. avium and M. avium paratuberculosis could have lost all the genes belonging to the PE_PGRS and PPE-MPTR subfamilies, however, this is highly unlikely, as we could find no evidence of residues of genes or the presence of pseudogenes which could indicate a loss of function and degeneration.
M. tuberculosis H37Rv vs. M. leprae
To gain insight into the events taking place in the phylogenetic gap between the M. tuberculosis complex and the M. avium complex, we performed a comparative genomics analysis between the completed genome sequences of M. tuberculosis H37Rv and M. leprae. The genome sequence of M. leprae is known to have undergone extensive loss of synteny, inversion and genome downsizing , which may have resulted from recombination between dispersed copies of repetitive elements . This has caused the loss of hundreds of genes, resulting in a genome littered with pseudogenes in various stages of decay and elimination. Our primary aim was thus not to identify the reason for the absence of members of the PE and PPE gene families (which could either be due to the fact that they were never present/duplicated, or that they were deleted), but rather to identify whether members were present (in an intact form), and if not, whether there were any residues left of members (pseudogenes) which may have been lost in the process of genome downsizing. Table 4 provides a summary of the members of the PE and PPE gene families present in the genome of M. leprae. We identified 14 genes from the "ancestral-type" PE family, of which 9 were pseudogenes and 5 were M. leprae-specific. In addition, 8 members of the PGRS subfamily could be identified in M. leprae (of which 7 were pseudogenes and 4 were M. leprae-specific), indicating that the expansion of the PGRS subfamily must have started before the divergence of this organism (Figure 7 – M. leprae-specific genes were omitted). It is interesting to note that, although there were 8 detectable PGRS members, 7 of them were pseudogenes and only one intact PGRS gene could be identified in this species, consistent with previously published hybridization studies which showed a general absence of PGRS sequences in the genome of M. leprae . Analysis of the PPE subfamily led to the identification of 26 members of the "ancestral-type" (of which 19 were pseudogenes and 13 were M. leprae-specific), with no MPTR subfamily members present, except for ML2369c, the orthologue of Rv0442c/PPE10 (which is also the only representative present in the genomes of M. avium and M. avium paratuberculosis). In Figure 8, members of the PPE family identified in this study were superimposed on the phylogenetic tree generated for the PPE gene family in M. tuberculosis H37Rv (M. leprae-specific genes were omitted). With the exception of the orthologue of Rv0442c (ML2369c), no residues or pseudogenes of any of the other MPTR subfamily genes present in M. tuberculosis H37Rv could be identified in the genome of M. leprae (including the M. leprae-specific genes). This suggests that the MPTR subfamily was not duplicated in the genome of this organism, and that the expansion of the MPTR subfamily thus occurred after the divergence of M. leprae. Although it is possible that the extensive genome downsizing in M. leprae could have caused the loss of all the members of this gene subfamily, it is highly unlikely, and no evidence for this was observed (no pseudogenes or residues of genes were found as in the case of the PGRS subfamily).
To confirm the absence of MPTR genes in this species, we analyzed the sequence of ML2369c (the Rv0442c orthologue) to determine whether it contains the C-terminal MPTR region which is present in Rv0442c in M. tuberculosis, but absent in the Rv0442c orthologues of M. avium and M. avium paratuberculosis. Although the gene is a pseudogene and has undergone extensive degradation at the C-terminus, complicating the sequence alignment, it is clear that there are no MPTR repeats present in this region, even when the C-terminal region is translated into any of the three potential open reading frames (data not shown). This suggests that M. leprae diverged after the start of the expansion of the PGRS subfamily, but before that of the MPTR's.
M. tuberculosis H37Rv vs. M. ulcerans and M. marinum
M. ulcerans and M. marinum are phylogenetically closely-related and are also phylogenetically close relatives of the members of the M. tuberculosis complex (see Figure 3). The genomes of both of these organisms have been sequenced, with the M. ulcerans Agy99 genome annotation completed and the M. marinum M genome sequence in the process of being annotated (Table 1). These genome sequences thus provide an excellent resource to determine the status of the expansion of the MPTR subfamily of the PPE gene family in two species situated immediately outside of the M. tuberculosis complex.
An analysis of the genome of M. ulcerans was carried out to determine the presence and absence of orthologues of the members of the PE and PPE gene families of M. tuberculosis H37Rv in this organism. The results from the analysis between M. tuberculosis H37Rv and M. ulcerans are summarized in Additional file 1. We identified 21 genes from the "ancestral-type" PE family in the genome of M. ulcerans (compared to the 34 in M. tuberculosis), of which 6 were pseudogenes and 8 were M. ulcerans-specific. Of the 6 pseudogenes, 4 were M. ulcerans-specific. In addition, 121 members of the PE_PGRS subfamily could be identified in M. ulcerans (compared to the 65 in M. tuberculosis), of which 66 were pseudogenes and 104 were M. ulcerans-specific. Of the 66 pseudogenes, 59 were M. ulcerans-specific. Analysis of the PPE subfamily led to the identification of 81 members (compared to the 69 in M. tuberculosis) of which 34 were pseudogenes and 55 were M. ulcerans-specific. Of the 34 pseudogenes, 25 were M. ulcerans-specific. Six orthologues of members of the M. tuberculosis PPE-MPTR subfamily were present in the genome of M. ulcerans, including the orthologue of Rv0442c/PPE10 (MUL_1395), in this case containing an MPTR repeat region (see also Figure 9). Interestingly, 5 of these 6 PPE-MPTR orthologues were pseudogenes, with the only intact subfamily member being the orthologue of Rv0442c, although 9 intact M. ulcerans-specific PPE-MPTR subfamily members were also detected (MUL_0782, MUL_0890, MUL_0893, MUL_0902, MUL_0964, MUL_0965, MUL_2586, MUL_0098 and MUL_3169). These results are superimposed on the phylogenetic trees generated for the PE and PPE gene families in M. tuberculosis H37Rv in Figures 7 and 8 (M. ulcerans-specific genes are omitted). This suggests that the acquisition of the MPTR repeat region in the C-terminus of Rv0442c and the expansion of the MPTR subfamily took place before the divergence of M. ulcerans. M. ulcerans also had a vast specific expansion of the PE and PPE families, resulting in 55 more genes belonging to these two gene families than in M. tuberculosis H37Rv, although a large number of them have become pseudogenes, resulting in a lesser number of functional genes in M. ulcerans (117 genes) compared to M. tuberculosis H37Rv (168 genes). It is interesting to note that the majority of the pseudogenes from these two gene families in the genome of M. ulcerans are M. ulcerans-specific copies (88 out of 106 pseudogenes), and may thus represent "unsuccessful evolutionary experiments".
An analysis of the genome of M. marinum was carried out to determine the presence and absence of orthologues of the members of the PE and PPE gene families of M. tuberculosis H37Rv in this organism, in order to confirm the observations of the M. ulcerans genome. As the genome sequence of M. marinum is still in the annotation phase, no gene names or numbers are available, but the results of the analyses are superimposed on the phylogenetic trees generated for the PE and PPE gene families in M. tuberculosis H37Rv in Figure 7 and 8 (M. marinum-specific genes are omitted). The results are analogous to what was observed in M. ulcerans (confirming their relatedness), and shows the presence of multiple copies of both the PGRS and MPTR subfamilies. This confirms the previously published hybridization data which indicated the presence of multiple copies of the PGRS sequence in the genome of M. marinum . There are, analogous to M. ulcerans, also 6 orthologues of members of the M. tuberculosis PPE-MPTR subfamily present in the genome of M. marinum, one of which is the orthologue of Rv0442c, in this case also containing an MPTR repeat region (see Figure 9). This supports the observation of the M. ulcerans genome sequence and confirms that the acquisition of the MPTR repeat region in the C-terminus of Rv0442c and the expansion of the MPTR subfamily took place before the divergence of M. marinum and M. ulcerans.
Comparative genomics for extent of sequence variation
To further examine the relationships between, and evolutionary history of, the members of the subfamilies of the PE and PPE protein families, to identify subfamily-specific characteristics, and to determine the extent of PE and PPE sequence similarity and variation, orthologues in the fully sequenced and annotated genomes of M. tuberculosis H37Rv and CDC1551 were analyzed by comparative genomics. During this analysis, a complete investigation of the presence and absence of genes, gene sizes, frameshifts, insertions and deletions (indels), alternative start sites, protein mismatches and conservative substitutions was performed. Although other strains of M. tuberculosis are also being sequenced (including strains 210, A1, Ekat-4, K, F11, C, Haarlem, Peruvian1, Peruvian2 and W-148 – see Table 1), these sequences are not completed and verified and thus not useful for an analysis where, for example, single nucleotide polymorphisms are investigated. Additional file 2 provides an overview of the reasons for size differences between annotated genes from the two genome databases. This analysis shows that the "ancestral-type" members of both the PE and PPE families, and specifically the members present within the ESAT-6 (esx) gene cluster regions, have remained conserved between the two different strains (with the only reason for a difference in size being artificial, due to the use of an alternative start site during genome annotation). This is in contrast to the members of the PGRS and MPTR subfamilies, which show considerable variation in size due to frameshifts, insertions and deletions. Additional file 3 shows a summary of the extent of sequence variation on a protein level between the orthologues of these gene families in the two M. tuberculosis strains and from this it is clear that the "ancestral-type" PE and PPE genes are highly conserved between strains, while the MPTR and especially the PGRS subfamilies are more prone to sequence variation (the only exception to this is PPE60 which is not an MPTR but shows a high level of variation between the strains). These variations mostly occur in the C-terminal polymorphic domain (after the conserved N-terminal domain of approximately 110 amino acids for the PE members, and 180 amino acids for the PPE members), clearly demonstrating the importance of the conservation of the N-terminal domain. The results from this study are in agreement with previously-published results by Garnier and coworkers , who found blocks of sequence variation in genes encoding 29 different PE_PGRS and 28 PPE proteins (most of which belong to the PPE_MPTR subfamily) resulting from frameshifts, insertions and deletions in a comparison between the annotated genes from the completed genomes of M. bovis AF2122/97 and M. tuberculosis H37Rv. The authors speculate that this indicates that these families can support extensive sequence polymorphism and could thus provide a potential source of antigenic variation. It is thus possible that the members of the PGRS and MPTR subfamilies have evolved to function as a source of antigenic variation; a function which probably differs from the original function still performed by the members of the "ancestral-type" subgroup (including the members present within and associated with the ESAT-6 (esx) gene cluster regions). The genome sequencing of other members of the M. tuberculosis complex which are currently being performed (M. microti, M. africanum, and M. canettii) will undoubtedly shed more light on the variation observed between the orthologues of these two large polymorphic subfamilies.
Presence of the PPE-MPTR's in members of the genus Mycobacterium
In order to confirm the exclusive expansion of the PPE-MPTR subfamily in the genomes of members of the M. tuberculosis complex and species closely-related to it, we performed Southern blot analyses of different mycobacterial species using two selected PPE-MPTR gene probes (Table 5), and compared this to previously published data on the distribution of the MPTR repeat sequence. A probe for the mycosin gene mycP5 (Rv1796), was also selected to be used as a marker for the presence or absence of ESAT-6 (esx) gene cluster region 5 within the genomes of these different species. The mycosins are a family of subtilisin-like serine proteases found within the ESAT-6 (esx) gene cluster regions (Figure 1) [1, 77, 78] and represent the most conserved genes within the ESAT-6 (esx) cluster regions when orthologues of different species are compared (data not shown). The Southern blot analysis was done with genomic DNA of species of both the fast- and slow-growing mycobacterial groups (see Figure 3 and Table 6) and the results are summarized in Figure 10.
The first analysis was done using the probe for mycP5, the mycosin present in ESAT-6 (esx) gene cluster region 5. This probe gave an indication of the distribution of the ESAT-6 (esx) gene cluster region 5 within the genomes of other mycobacterial species, as region 5 was hypothesized in this study to be the origin of both the SVP and MPTR subfamilies of the PPE gene family. The results showed that the ESAT-6 (esx) gene cluster region 5 was only present within the genomes of the slow-growing mycobacterial species tested. The only exception for this is the slow-growing species M. nonchromogenicum, which might have undergone a deletion of this region. No hybridization was found with any members of the fast-growing group except for M. chitae, indicating either that the ESAT-6 (esx) gene cluster region 5 is absent from the genomes of these species, or that the species are evolutionarily so far removed from the slow-growers that the gene homology was insufficient to allow hybridization under the stringent conditions used in the analysis. Given the absence of region 5 in the genomes of M. smegmatis, M. flavescens, M. vanbaalenii, M. sp. KMS, M. sp. MCS and M. sp. JLS, it is highly likely that this region is absent from all fast-growing species and that these species have diverged before the duplication of region 5.
In order to obtain insight into the expansion and distribution of the PPE-MPTR subfamily within the slow-growing mycobacterial species, we used the two genes Rv1917c (PPE34) and Rv1753c (PPE24) as representatives of the PPE-MPTR sublineage (V) for Southern hybridization analysis. The hybridization signals were specific and appeared to be restricted to specific members of the slow growing mycobacterial group within and surrounding the M. tuberculosis complex, namely M. gordonae, M. asiaticum, M. tuberculosis, M. bovis and M. africanum (in the case of Rv1917c) and M. tuberculosis, M. bovis and M. africanum in the case of Rv1753c (Figure 10). The fact that both Rv1917c and Rv1753c did not hybridize to M. marinum and M. ulcerans is in agreement with the genome sequencing data which indicated the absence of both of these genes within the genomes of these species. The results also confirms the absence of these genes in the genomes of the members of the M. avium complex. Furthermore, the results compared favorably to previously published data (see Column 4, Figure 10) in which the MPTR repeat region probe was used for hybridization, and in which only species situated in the M. tuberculosis complex, or closely-related to the complex, were identified .
Previously published hybridization data on the PGRS repeat sequence [15, 75] also confirms the broader distribution and earlier expansion of this subfamily in comparison to the PPE-MPTR subfamily within the slow-growing members of the genus Mycobacterium (see Column 5, Figure 10). This data supports the evolutionary history proposed in this study with the expansion of the PGRS subfamily (after the divergence of the M. avium complex) preceding that of the MPTR subfamily.
In summary, the hybridization results support the proposed phylogenetic relationships of the gene families, and are likely to reflect evolutionary divergence/branch points of different mycobacterial species, interspersed by periods of PE/PPE/ESAT-6 duplication and expansion.
Phylogenetic reconstruction of the evolutionary history of the PE and PPE gene families suggests that the first pair of these genes were initially inserted into the ESAT-6 (esx) gene cluster region 1, and have subsequently been duplicated along with the regions (Figure 11). After each main duplication event involving a complete ESAT-6 (esx) gene cluster region, a number of secondary subduplications of the PE and PPE genes (in some cases associated with a copy of the ESAT-6 and CFP-10 genes, ) occurred from the newly duplicated ESAT-6 (esx) gene cluster region. This phenomenon is predicted to have culminated in the duplication of the ESAT-6 (esx) gene cluster region 5, from which a large number of PE and PPE genes (the so-called SVP subfamily of the PPE gene family) were duplicated separately to the rest of the genome. Furthermore, the evolutionary history predicted by the phylogenetic trees suggests that the highly duplicated PE_PGRS subfamily and subsequently the PPE-MPTR subfamily have originated from a duplication from ESAT-6 (esx) gene cluster region 5. It thus seems as if the PE and PPE genes present within region 5 have an enhanced propensity for duplication, their mobility driving the expansion of these genes into the highly polymorphic PGRS and MPTR subfamilies, respectively.
The data presented in the study suggests that the PE_PGRS subfamily expansion preceded the emergence of the PPE-MPTR subfamily. A possible explanation for this observation comes from the fact that there are some resemblance between the MPTR repeat sequence (GCCGGTGTTG) and the complementary sequence of the core region of two PGRS repeat elements arranged in tandem (TTGCCGCCGTTG CCGCCG) [15, 17]. This may indicate a potential role for the C-terminal PGRS repeat of the PE gene family in the emergence of the C-terminal MPTR element of the PPE gene family, and may point to an evolutionary event through insertion/recombination between the two gene families and subsequent expansion in the MPTR subfamily. In support of this, Adindla and Guruprasad  have identified three PPE-MPTR proteins (Rv1800/PPE28, Rv3539/PPE63 and Rv2608/PPE42) which showed sequence similarity to five PE proteins (Rv1430/PE16, Rv0151/PE1, Rv0152/PE2, Rv0159/PE3 and Rv0160/PE4) corresponding to a 225 amino acid C-terminal region, which they named the "PE-PPE domain". Although not identified as true PGRS-containing PE genes, all five these genes form part of sublineage V (the PGRS-containing sublineage) and may therefore represent precursors to the PE_PGRS sequences. There are thus some genes from the PE and MPTR subfamilies which share levels of homology in their C-termini. This is further supported by the data from Pajon et al.  which showed that a large proportion of the members from the PE_PGRS and PPE-MPTR subfamilies share beta-barrel outer-membrane protein structures, and that one of these outer-membrane anchoring domains consists of the proposed conserved "PE-PPE domain" identified by Adindla and Guruprasad .
A number of recent studies using diverse approaches have shown that the ESAT-6 (esx) gene clusters encode a novel secretory apparatus [1–5, 50] Most recently, the demonstration by Okkels et al.  that Rv3873 (PPE68), the PPE gene present in the RD1 region, is a potent T-cell antigen, lead these authors to speculate that the ESAT-6 (esx) gene cluster promotes the presentation of key antigens, including members of the PE and PPE protein families, to the host immune system. It is tempting to speculate that the ESAT-6/CFP-10 loci together with their associated PE/PPE genes represent what might be thought of as an "immunogenicity island". Further studies are under way to determine whether the ESAT-6 (esx) gene cluster regions are able to secrete members of the PE and PPE protein families, whether this secretion is specific for members of the "ancestral-type" group found in the cluster regions, and whether the recently-evolved PGRS/MPTR types can also use this secretion system.
The large number of genes within the PE and PPE gene families has confounded past attempts to choose representative members of the families for further analysis. This study provides a logical starting point by defining the evolutionary history of the gene families, and elucidating the relationships and specific features of the different subgroups. An informed choice concerning candidate genes for further study can now be made, based on position of the member on the evolutionary tree, association or not with the ESAT-6 gene clusters, and subgroup-specific features. In this way, studies based upon a random choice of members, which may be biased in not being representative of the whole spectrum of different members within these families, could be avoided. It also provides the opportunity to study subgroups instead of individual members, to determine what functional differences, if any, exists between these different subgroups.
In conclusion, we aimed to investigate the evolutionary history of the PE and PPE gene families in relation to their observed association with four of the five ESAT-6 (esx) gene cluster regions. We have demonstrated that the expansion of the PE and PPE families is linked to the duplications of the ESAT-6 (esx) gene clusters. We have also shown that this association has led to the absence of multiple duplications of the PE and PPE families, including the total absence of the multigene PE_PGRS and PPE-MPTR subfamilies, in the fast-growing mycobacteria, including M. smegmatis. We have shown that the expansion of the PE_PGRS and PPE-MPTR subfamilies took place after the divergence of the M. avium complex, and that the PGRS and the MPTR expansions started before the divergence of M. leprae and M. marinum, respectively. This study contributes to the understanding of the PE and PPE gene families, in terms of stability, absence/presence of the PE and PPE genes within the genomes of various mycobacteria, and their association with the ESAT-6 (esx) gene clusters. The results of this study also provides for a logical starting point for the selection of candidates for further study of these large multigene families.
Genome sequence data and comparative genomics analyses
Annotations, descriptions, gene and protein sequences of individual genes belonging to the PE and PPE families were obtained from the publicly available finished and unfinished genome sequence databases of the organisms listed in Table 1. For comparative genomics, the genome sequence databases were compared to that of M. tuberculosis H37Rv, in order to identify orthologous genes. BLAST similarity searches  using the respective M. tuberculosis H37Rv protein sequences and the tblastn algorithm were performed using the WU-BLAST version 2.0 (Gish, W. 1996–2005 – ) server in the database search services of the TIGR , Sanger Centre  and Genolist (Pasteur Institute)  websites. To confirm the identity of the resulting sequences, open reading frames adjacent to the identified genes were examined to determine if they matched the genes surrounding the corresponding M. tuberculosis PE and PPE genes, thereby confirming the identity of the orthologue. The unfinished genome sequences were examined in a similar manner, but were not analyzed in detail as sequencing is still incomplete.
Phylogenetic tree of all the members of the genus Mycobacterium
The 16S rRNA gene sequences of 83 species of the genus Mycobacterium, as well as the species Gordonia aichiensis, was used to generate a phylogenetic tree of the genus Mycobacterium. All species were selected from the Ribosomal Database Project-II Release 9  to be type strains containing only near-full-length 16S rRNA sequences (>1200 bases, no short partials), except for the species M. chelonae, M. spagni, M. abscessus, M. confluentis, M. genavense, M. interjectum, M. intermedium, M. marinum, M. ulcerans, M. haemophilum, M. acapulcensis, M. lentiflavum, M. pulveris, M. manitobense, M. monacense, M. brumae, and M. moriokaense, which did not have any type strains with a near-full-length sequence of longer than 1200 bases available in the database. For some of these species (M. abscessus, M. confluentis, M. marinum), sequences from type strains from the German Collection of Microorganisms and Cell Cultures (DSM)  were available and could thus be used. For the rest, representatives with sequences of longer than 1200 bases were chosen according to correct alignment with type strains. The following strains were chosen for all species (type strain indicated by the letter T in brackets after the name):M. abscessus (T); DSM 44196, M. acapulcensis; ATCC 14473, M. aichiense (T); ATCC 27280, M. alvei (T); CIP 103464, M. asiaticum (T); ATCC 25276, M. aurum (T); ATCC 23366, M. austroafricanum (T); ATCC 33464, M. avium subsp. paratuberculosis (T); ATCC 19698, M. botniense (T); E347, M. brumae; ATCC 51384, M. celatum (T); L08169, M. chelonae; ATCC 35752, M. chitae (T); ATCC 19627, M. chlorophenolicum (T); PCP-I, M. chubuense (T); ATCC 27278, M. confluentis (T); DSM 44017, M. cookii (T); ATCC 49103 (T) = NZ2., M. diernhoferi (T); ATCC 19340, M. doricum (T); FI-13295, M. duvalii (T); ATCC 43910, M. elephantis (T); AJ010747, M. fallax (T); M29562, M. farcinogenes (T); ATCC35753, M. flavescens (T); ATCC 14474, M. fortuitum (T); ATCC 6841, M. frederiksbergense (T); DSM 44346, M. gadium (T); ATCC 27726, M. gastri (T); ATCC 15754, M. genavense X60070, M. gilvum (T); ATCC 43909, M. goodii (T); M069, M. gordonae (T); ATCC 14470, M. haemophilum X88923, M. heckeshornense (T); S369, M. heidelbergense (T); 2554/91, M. hiberniae (T); ATCC 9874, M. hodleri (T); DSM 44183, M. holsaticum (T); 1406, M. interjectum X70961, M. intermedium X67847, M. intracellulare (T); ATCC 15985, M. kansasii (T); M29575, M. komossense (T); ATCC 33013, M. kubicae (T); CDC 941078, M. lacus (T); NRCM 00-255, M. lentiflavum; ATCC 51985, M. leprae (T); X53999, M. malmoense (T); ATCC 29571, M. manitobense; NRCM 01-154, M. marinum (T); DSM 44344, M. monacense; B9-21-178, M. moriokaense; DSM 44221T, M. neoaurum (T); M29564, M. nonchromogenicu m (T); ATCC 19530, M. novocastrense (T); 73, M. obuense (T); ATCC 27023, M. palustre (T); E846, M. parafortuitum (T); DSM 43528, M. peregrinum (T); ATCC14467, M. phlei (T); M29566, M. pulveris; DSM 44222T, M. scrofulaceum (T); ATCC 19981, M. senegalense (T); M29567, M. septicum (T); W4964, M. shimoidei (T); ATCC 27962, M. shottsii (T); M175, M. simiae (T); ATCC 25275, M. smegmatis (T); ATCC 19420, M. sp. KMS; AY083217, M. sp. MCS; CP000384, M. sp. JLS; AF387804, M. sphagni; ATCC 33026, M. szulgai (T); ATCC 25799, M. terrae (T); ATCC 15755, M. thermoresistibile (T); M29570, M. triviale (T); ATCC 23292, M. tuberculosis (T); H37/Rv, M. tusciae (T); FI-25796, M. ulcerans X58954, M. vaccae (T); ATCC 15483, M. vanbaalenii (T); DSM 7251 = PYR-1, M. wolinskyi (T); 700010, M. xenopi (T); M61664, G._aichiensis (T); ATCC 33611T. Multiple sequence alignments of these gene sequences were done using ClustalW 1.8 on the WWW server at the European Bioinformatics Institute website [86, 87]. The alignments were manually checked for errors and refined where appropriate using BioEdit version 5.0.9. . The final tree was taken as the strict consensus of the 230 most parsimonious trees generated using Paup 4.0b10 (heuristic search, gaps = fifth state)  from the 1286 aligned nucleotides of the 16S rRNA DNA sequence of the 83 species of the genus Mycobacterium, with the sequence of the species Gordonia aichiensis as the outgroup.
Clean-up and generation of PE and PPE datasets
The phylogenetic reconstruction of the evolutionary relationships of the members of the PE and PPE protein families of M. tuberculosis H37Rv was done by analyses of four separate datasets. Clean-up of sample sets involved preliminary alignment to check for sequence instability or misalignments, as well as confirmation of gene annotation by comparative analyses. The first two datasets included the protein sequences of all the members of the PE and PPE protein families, respectively, that are present within the four ESAT-6 (esx) gene clusters in the genome of M. tuberculosis H37Rv.
The third dataset comprised the protein sequences of the sixty-nine members of the PPE family in the M. tuberculosis H37Rv database. Eleven of the predicted PPE proteins did not contain the characteristic N-terminal PPE motif. However, in six of these (Rv0305c/PPE6, Rv3425/PPE57, Rv3426/PPE58, Rv3429/PPE59, Rv3539/PPE63 and Rv3892c/PPE69) this was only due to a substitution in one of the two proline residues in the conserved motif. These six protein sequences could thus be reliably aligned to the rest of the family members due to a high percentage of sequence homology and were included in the dataset. The other five proteins (Rv0304c/PPE5, Rv0354c/PPE7, Rv2353c/PPE39, Rv3021c/PPE47 and Rv3738c/PPE66) were excluded from the analysis as it was found that their upstream regions were disrupted by either IS6110 insertion or apparent frameshift mutations, and they could thus not be aligned for phylogenetic analyses.
The fourth dataset contained the protein sequences of the ninety-nine members of the PE family in the M. tuberculosis H37Rv database. One of the members of the predicted PE family (Rv3020c) was found  to have been annotated incorrectly as a PE by Cole et al. , while two other members (Rv3018A/PE27A and Rv2126c/PE_PGRS37) could not be reliably aligned due to a loss of the N-terminal conserved region, and all three were thus excluded from further analyses. Six members (Rv0833/PE_PGRS13, Rv1089/PE10, Rv2098c/PE_PGRS36, Rv3344c/PE_PGRS49, Rv3512/PE_PGRS56, and Rv3653PE_PGRS61), which also did not have conserved N-termini, were shown to be situated adjacent to a gene encoding for the N-terminus (Rv0832/PE_PGRS12, Rv1088/PE9, Rv2099c/PE21, Rv3345c/PE_PGRS50, Rv3511/PE_PGRS55, and Rv3652/PE_PGRS60, respectively). Closer inspection of this organization suggested that each of these gene pairs in fact represented one gene that was split by stopcodon formation during frameshifting. Thus, each pair of proteins from this group were combined and included as one protein sequence in the analyses. Stopcodons were left out of these combined sequences.
Multiple sequence alignments
Due to the highly polymorphic nature of the C-terminal region of the PE and PPE proteins, the conserved N-terminal domains of 100 aa and 180 aa for the PE and PPE proteins, respectively, were initially used to construct the multiple sequence alignments. Multiple sequence alignments of the protein sequences of the ninety-six PE and sixty-four PPE proteins were done using ClustalW 1.8 on the WWW server at the European Bioinformatics Institute website [86, 87]. The alignments were manually checked for errors and refined where appropriate. Subsequent alignments using the complete sequences (containing both conserved N- and polymorphic C-terminal regions) were done to confirm results obtained with only conserved N-termini.
Phylogenetic analyses were done using the neighbor-joining algorithm in the program PAUP 4.0b10 , and 1000 subsets were generated for Bootstrapping resampling of the data. Confidence intervals for the internal topology of the trees were obtained from the resampling analyses and only nodes occurring in over 50% of the trees were assumed to be significant . All branches with a zero branch length were collapsed. Based on the evolutionary order defined for the ESAT-6 (esx) gene clusters  and the results from the analysis of the genome sequence of M. smegmatis, we have used the ancestral PE and PPE genes present within ESAT-6 (esx) gene cluster region 1 (Rv3872/PE35 and Rv3873/PPE68, respectively) as the outgroups to assign as roots. The consensus trees of the above were calculated using the majority rule formula and were drawn using the program Treeview 1.5 .
Comparative genomics for extent of sequence variation
To determine the extent of PE and PPE sequence variation and elucidate the differences between orthologues of subfamilies of these gene families in the genomes of M. tuberculosis H37Rv and CDC1551, a complete comparative analysis of the presence and absence of genes, gene sizes, frameshifts, insertions and deletions (indels), alternative start sites, protein mismatches and conservative substitutions was done.
Primers and probes
The primers used to generate probes for Southern hybridization to genomic DNA are listed in Table 5. PPE-MPTR and mycP probes were generated using the selected primers to individually PCR amplify regions from the PPE-MPTR genes Rv1917c (PPE34) and Rv1753c (PPE24), as well as from the mycosin gene mycP5 (Rv1796).
Genomic DNA was isolated from different mycobacterial species (obtained from the American Type Culture Collection (ATCC), see Table 6) as previously described . Genomic DNA was digested with Alu I or BstE II, electrophoretically fractionated, Southern transferred and hybridized as previously described . Probing of Southern blots was done using selected ECL-labeled probes as listed in Table 5.
protein family characterized by Proline-Glutamic Acid motif
protein family characterized by Proline-Proline-Glutamic Acid motif
"polymorphic GC-rich-repetitive sequence" subfamily of the PE family
"major polymorphic tandem repeat" subfamily of the PPE family
subfamily of the PPE family characterized by the motif Gly-X-X-Ser-Val-Pro-X-X-Trp
subfamily of the PPE family characterized by the motifs Gly-Phe-X-Gly-Thr and Pro-X-X-Pro-X-X-Trp
insertions or deletions
6 kDa Early Secreted Antigenic Target (esx)
10 kDa Culture Filtrate Protein
Gey van Pittius NC, Gamieldien J, Hide W, Brown GD, Siezen RJ, Beyers AD: The ESAT-6 gene cluster of Mycobacterium tuberculosis and other high G+C Gram-positive bacteria. Genome Biol. 2001, 2: 0044-10.1186/gb-2001-2-10-research0044.
Tekaia F, Gordon SV, Garnier T, Brosch R, Barrell BG, Cole ST: Analysis of the proteome of Mycobacterium tuberculosis in silico. Tuber Lung Dis. 1999, 79: 329-342. 10.1054/tuld.1999.0220.
Pallen MJ: The ESAT-6/WXG100 superfamily -- and a new Gram-positive secretion system?. Trends Microbiol. 2002, 10: 209-212. 10.1016/S0966-842X(02)02345-4.
Hsu T, Hingley-Wilson SM, Chen B, Chen M, Dai AZ, Morin PM, Marks CB, Padiyar J, Goulding C, Gingery M, Eisenberg D, Russell RG, Derrick SC, Collins FM, Morris SL, King CH, Jacobs WR: The primary mechanism of attenuation of bacillus Calmette-Guerin is a loss of secreted lytic function required for invasion of lung interstitial tissue. Proc Natl Acad Sci U S A. 2003, 100: 12420-12425. 10.1073/pnas.1635213100.
Pym AS, Brodin P, Majlessi L, Brosch R, Demangel C, Williams A, Griffiths KE, Marchal G, Leclerc C, Cole ST: Recombinant BCG exporting ESAT-6 confers enhanced protection against tuberculosis. Nat Med. 2003, 9: 533-539. 10.1038/nm859.
Stanley SA, Raghavan S, Hwang WW, Cox JS: Acute infection and macrophage subversion by Mycobacterium tuberculosis require a specialized secretion system. Proc Natl Acad Sci U S A. 2003, 100: 13001-13006. 10.1073/pnas.2235593100.
Brodin P, Rosenkrands I, Andersen P, Cole ST, Brosch R: ESAT-6 proteins: protective antigens and virulence factors?. Trends Microbiol. 2004, 12: 500-508. 10.1016/j.tim.2004.09.007.
Guinn KM, Hickey MJ, Mathur SK, Zakel KL, Grotzke JE, Lewinsohn DM, Smith S, Sherman DR: Individual RD1-region genes are required for export of ESAT-6/CFP-10 and for virulence of Mycobacterium tuberculosis. Mol Microbiol. 2004, 51: 359-370. 10.1046/j.1365-2958.2003.03844.x.
Converse SE, Cox JS: A protein secretion pathway critical for Mycobacterium tuberculosis virulence is conserved and functional in Mycobacterium smegmatis. J Bacteriol. 2005, 187: 1238-1245. 10.1128/JB.187.4.1238-1245.2005.
Fortune SM, Jaeger A, Sarracino DA, Chase MR, Sassetti CM, Sherman DR, Bloom BR, Rubin EJ: Mutually dependent secretion of proteins required for mycobacterial virulence. Proc Natl Acad Sci U S A. 2005, 102: 10676-10681. 10.1073/pnas.0504922102.
MacGurn JA, Raghavan S, Stanley SA, Cox JS: A non-RD1 gene cluster is required for Snm secretion in Mycobacterium tuberculosis. Mol Microbiol. 2005, 57: 1653-1663. 10.1111/j.1365-2958.2005.04800.x.
Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry CE, Tekaia F, Badcock K, Basham D, Brown D, Chillingworth T, Connor R, Davies R, Devlin K, Feltwell T, Gentles S, Hamlin N, Holroyd S, Hornsby T, Jagels K, Barrell BG: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998, 393: 537-544. 10.1038/31159.
Camus JC, Pryor MJ, Medigue C, Cole ST: Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv. Microbiology. 2002, 148: 2967-2973.
Gordon SV, Eiglmeier K, Brosch R, Garnier T, Honore N, Barrell B, Cole ST: Genomics of Mycobacterium tuberculosis and Mycobacterium leprae. Mycobacteria: molecular biology and virulence. Edited by: Ratledge C and Dale J. 1999, Oxford, Blackwell Science Ltd, 93-109.
Poulet S, Cole ST: Characterization of the highly abundant polymorphic GC-rich-repetitive sequence (PGRS) present in Mycobacterium tuberculosis. Arch Microbiol. 1995, 163: 87-95.
Adindla S, Guruprasad L: Sequence analysis corresponding to the PPE and PE proteins in Mycobacterium tuberculosis and other genomes. J Biosci. 2003, 28: 169-179.
Hermans PW, van Soolingen D, van Embden JD: Characterization of a major polymorphic tandem repeat in Mycobacterium tuberculosis and its potential use in the epidemiology of Mycobacterium kansasii and Mycobacterium gordonae. J Bacteriol. 1992, 174: 4157-4165.
Cole ST, Barrell BG: Analysis of the genome of Mycobacterium tuberculosis H37Rv. Novartis Found Symp. 1998, 217: 160-172.
Doran TJ, Hodgson AL, Davies JK, Radford AJ: Characterisation of a novel repetitive DNA sequence from Mycobacterium bovis. FEMS Microbiol Lett. 1992, 75: 179-185. 10.1111/j.1574-6968.1992.tb05413.x.
Sampson SL, Lukey P, Warren RM, van Helden PD, Richardson M, Everett MJ: Expression, characterization and subcellular localization of the Mycobacterium tuberculosis PPE gene Rv1917c. Tuberculosis (Edinb ). 2001, 81: 305-317. 10.1054/tube.2001.0304.
Brennan MJ, Delogu G, Chen Y, Bardarov S, Kriakov J, Alavi M, Jacobs WR: Evidence that mycobacterial PE_PGRS proteins are cell surface constituents that influence interactions with other cells. Infect Immun. 2001, 69: 7326-7333. 10.1128/IAI.69.12.7326-7333.2001.
Banu S, Honore N, Saint-Joanis B, Philpott D, Prevost MC, Cole ST: Are the PE-PGRS proteins of Mycobacterium tuberculosis variable surface antigens?. Mol Microbiol. 2002, 44: 9-19. 10.1046/j.1365-2958.2002.02813.x.
Delogu G, Pusceddu C, Bua A, Fadda G, Brennan MJ, Zanetti S: Rv1818c-encoded PE_PGRS protein of Mycobacterium tuberculosis is surface exposed and influences bacterial cell structure. Mol Microbiol. 2004, 52: 725-733. 10.1111/j.1365-2958.2004.04007.x.
Okkels LM, Brock I, Follmann F, Agger EM, Arend SM, Ottenhoff TH, Oftung F, Rosenkrands I, Andersen P: PPE protein (Rv3873) from DNA segment RD1 of Mycobacterium tuberculosis: strong recognition of both specific T-cell epitopes and epitopes conserved within the PPE family. Infect Immun. 2003, 71: 6116-6123. 10.1128/IAI.71.11.6116-6123.2003.
Le Moigne V, Robreau G, Borot C, Guesdon JL, Mahana W: Expression, immunochemical characterization and localization of the Mycobacterium tuberculosis protein p27. Tuberculosis (Edinb ). 2005, 85: 213-219. 10.1016/j.tube.2005.02.002.
Pajon R, Yero D, Lage A, Llanes A, Borroto CJ: Computational identification of beta-barrel outer-membrane proteins in Mycobacterium tuberculosis predicted proteomes as putative vaccine candidates. Tuberculosis (Edinb ). 2006, 86: 290-302. 10.1016/j.tube.2006.01.005.
Cole ST: Learning from the genome sequence of Mycobacterium tuberculosis H37Rv. FEBS Lett. 1999, 452: 7-10. 10.1016/S0014-5793(99)00536-0.
Gordon SV, Eiglmeier K, Garnier T, Brosch R, Parkhill J, Barrell B, Cole ST, Hewinson RG: Genomics of Mycobacterium bovis. Tuberculosis (Edinb ). 2001, 81: 157-163. 10.1054/tube.2000.0269.
Fleischmann RD, Alland D, Eisen JA, Carpenter L, White O, Peterson J, DeBoy R, Dodson R, Gwinn M, Haft D, Hickey E, Kolonay JF, Nelson WC, Umayam LA, Ermolaeva M, Salzberg SL, Delcher A, Utterback T, Weidman J, Khouri H, Gill J, Mikula A, Bishai W, Jacobs JWR, Venter JC, Fraser CM: Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J Bacteriol. 2002, 184: 5479-5490. 10.1128/JB.184.19.5479-5490.2002.
Garnier T, Eiglmeier K, Camus JC, Medina N, Mansoor H, Pryor M, Duthoy S, Grondin S, Lacroix C, Monsempe C, Simon S, Harris B, Atkin R, Doggett J, Mayes R, Keating L, Wheeler PR, Parkhill J, Barrell BG, Cole ST, Gordon SV, Hewinson RG: The complete genome sequence of Mycobacterium bovis. Proc Natl Acad Sci U S A. 2003, 100: 7877-7882. 10.1073/pnas.1130426100.
Talarico S, Cave MD, Marrs CF, Foxman B, Zhang L, Yang Z: Variation of the Mycobacterium tuberculosis PE_PGRS33 Gene among Clinical Isolates. J Clin Microbiol. 2005, 43: 4954-4960. 10.1128/JCM.43.10.4954-4960.2005.
Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, Whittam TS, Musser JM: Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proc Natl Acad Sci U S A. 1997, 94: 9869-9874. 10.1073/pnas.94.18.9869.
Srivastava R, Kumar D, Waskar MN, Sharma M, Katoch VM, Srivastava BS: Identification of a repetitive sequence belonging to a PPE gene of Mycobacterium tuberculosis and its use in diagnosis of tuberculosis. J Med Microbiol. 2006, 55: 1071-1077. 10.1099/jmm.0.46379-0.
Flores J, Espitia C: Differential expression of PE and PE_PGRS genes in Mycobacterium tuberculosis strains. Gene. 2003, 318: 75-81. 10.1016/S0378-1119(03)00751-0.
Voskuil MI, Schnappinger D, Rutherford R, Liu Y, Schoolnik GK: Regulation of the Mycobacterium tuberculosis PE/PPE genes. Tuberculosis (Edinb ). 2004, 84: 256-262. 10.1016/j.tube.2003.12.014.
Li Y, Miltner E, Wu M, Petrofsky M, Bermudez LE: A Mycobacterium avium PPE gene is associated with the ability of the bacterium to grow in macrophages and virulence in mice. Cell Microbiol. 2005, 7: 539-548. 10.1111/j.1462-5822.2004.00484.x.
Delogu G, Sanguinetti M, Pusceddu C, Bua A, Brennan MJ, Zanetti S, Fadda G: PE_PGRS proteins are differentially expressed by Mycobacterium tuberculosis in host tissues. Microbes Infect. 2006
Dheenadhayalan V, Delogu G, Sanguinetti M, Fadda G, Brennan MJ: Variable expression patterns of Mycobacterium tuberculosis PE_PGRS genes: evidence that PE_PGRS16 and PE_PGRS26 are inversely regulated in vivo. J Bacteriol. 2006, 188: 3721-3725. 10.1128/JB.188.10.3721-3725.2006.
Delogu G, Brennan MJ: Comparative immune response to PE and PE_PGRS antigens of Mycobacterium tuberculosis. Infect Immun. 2001, 69: 5606-5611. 10.1128/IAI.69.9.5606-5611.2001.
Dheenadhayalan V, Delogu G, Brennan MJ: Expression of the PE_PGRS 33 protein in Mycobacterium smegmatis triggers necrosis in macrophages and enhanced mycobacterial survival. Microbes Infect. 2005
Rodriguez GM, Gold B, Gomez M, Dussurget O, Smith I: Identification and characterization of two divergently transcribed iron regulated genes in Mycobacterium tuberculosis. Tuber Lung Dis. 1999, 79: 287-298. 10.1054/tuld.1999.0219.
Rodriguez GM, Voskuil MI, Gold B, Schoolnik GK, Smith I: ideR, An essential gene in Mycobacterium tuberculosis: role of IdeR in iron-dependent gene expression, iron metabolism, and oxidative stress response. Infect Immun. 2002, 70: 3371-3381. 10.1128/IAI.70.7.3371-3381.2002.
Abou-Zeid C, Garbe T, Lathigra R, Wiker HG, Harboe M, Rook GA, Young DB: Genetic and immunological analysis of Mycobacterium tuberculosis fibronectin-binding proteins. Infect Immun. 1991, 59: 2712-2718.
Espitia C, Laclette JP, Mondragon-Palomino M, Amador A, Campuzano J, Martens A, Singh M, Cicero R, Zhang Y, Moreno C: The PE-PGRS glycine-rich proteins of Mycobacterium tuberculosis: a new family of fibronectin-binding proteins?. Microbiology. 1999, 145 ( Pt 12): 3487-3495.
Ramakrishnan L, Federspiel NA, Falkow S: Granuloma-specific expression of Mycobacterium virulence proteins from the glycine-rich PE-PGRS family. Science. 2000, 288: 1436-1439. 10.1126/science.288.5470.1436.
Camacho LR, Ensergueix D, Perez E, Gicquel B, Guilhot C: Identification of a virulence gene cluster of Mycobacterium tuberculosis by signature-tagged transposon mutagenesis. Mol Microbiol. 1999, 34: 257-267. 10.1046/j.1365-2958.1999.01593.x.
Sassetti CM, Boyd DH, Rubin EJ: Genes required for mycobacterial growth defined by high density mutagenesis. Mol Microbiol. 2003, 48: 77-84. 10.1046/j.1365-2958.2003.03425.x.
Sassetti CM, Rubin EJ: Genetic requirements for mycobacterial survival during infection. Proc Natl Acad Sci U S A. 2003, 100: 12989-12994. 10.1073/pnas.2134250100.
Pym AS, Brodin P, Brosch R, Huerre M, Cole ST: Loss of RD1 contributed to the attenuation of the live tuberculosis vaccines Mycobacterium bovis BCG and Mycobacterium microti. Mol Microbiol. 2002, 46: 709-717. 10.1046/j.1365-2958.2002.03237.x.
Lewis KN, Liao R, Guinn KM, Hickey MJ, Smith S, Behr MA, Sherman DR: Deletion of RD1 from Mycobacterium tuberculosis mimics bacille Calmette-Guerin attenuation. J Infect Dis. 2003, 187: 117-123. 10.1086/345862.
Volkman HE, Clay H, Beery D, Chang JC, Sherman DR, Ramakrishnan L: Tuberculous granuloma formation is enhanced by a mycobacterium virulence determinant. PLoS Biol. 2004, 2: e367-10.1371/journal.pbio.0020367.
Jain SK, Paul-Satyaseela M, Lamichhane G, Kim KS, Bishai WR: Mycobacterium tuberculosis invasion and traversal across an in vitro human blood-brain barrier as a pathogenic mechanism for central nervous system tuberculosis. J Infect Dis. 2006, 193: 1287-1295. 10.1086/502631.
Talaat AM, Lyons R, Howard ST, Johnston SA: The temporal expression profile of Mycobacterium tuberculosis infection in mice. Proc Natl Acad Sci U S A. 2004, 101: 4602-4607. 10.1073/pnas.0306023101.
Bentley SD, Chater KF, Cerdeno-Tarraga AM, Challis GL, Thomson NR, James KD, Harris DE, Quail MA, Kieser H, Harper D, Bateman A, Brown S, Chandra G, Chen CW, Collins M, Cronin A, Fraser A, Goble A, Hidalgo J, Hornsby T, Howarth S, Huang CH, Kieser T, Larke L, Murphy L, Oliver K, O'Neil S, Rabbinowitsch E, Rajandream MA, Rutherford K, Rutter S, Seeger K, Saunders D, Sharp S, Squares R, Squares S, Taylor K, Warren T, Wietzorrek A, Woodward J, Barrell BG, Parkhill J, Hopwood DA: Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature. 2002, 417: 141-147. 10.1038/417141a.
Cerdeno-Tarraga AM, Efstratiou A, Dover LG, Holden MT, Pallen M, Bentley SD, Besra GS, Churcher C, James KD, De Zoysa A, Chillingworth T, Cronin A, Dowd L, Feltwell T, Hamlin N, Holroyd S, Jagels K, Moule S, Quail MA, Rabbinowitsch E, Rutherford KM, Thomson NR, Unwin L, Whitehead S, Barrell BG, Parkhill J: The complete genome sequence and analysis of Corynebacterium diphtheriae NCTC13129. Nucleic Acids Res. 2003, 31: 6516-6523. 10.1093/nar/gkg874.
Ikeda H, Ishikawa J, Hanamoto A, Shinose M, Kikuchi H, Shiba T, Sakaki Y, Hattori M, Omura S: Complete genome sequence and comparative analysis of the industrial microorganism Streptomyces avermitilis. Nat Biotechnol. 2003, 21: 526-531. 10.1038/nbt820.
Kalinowski J, Bathe B, Bartels D, Bischoff N, Bott M, Burkovski A, Dusch N, Eggeling L, Eikmanns BJ, Gaigalat L, Goesmann A, Hartmann M, Huthmacher K, Kramer R, Linke B, McHardy AC, Meyer F, Mockel B, Pfefferle W, Puhler A, Rey DA, Ruckert C, Rupp O, Sahm H, Wendisch VF, Wiegrabe I, Tauch A: The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of L-aspartate-derived amino acids and vitamins. J Biotechnol. 2003, 104: 5-25. 10.1016/S0168-1656(03)00154-8.
Nishio Y, Nakamura Y, Kawarabayasi Y, Usuda Y, Kimura E, Sugimoto S, Matsui K, Yamagishi A, Kikuchi H, Ikeo K, Gojobori T: Comparative complete genome sequence analysis of the amino acid replacements responsible for the thermostability of Corynebacterium efficiens. Genome Res. 2003, 13: 1572-1579. 10.1101/gr.1285603.
Ishikawa J, Yamashita A, Mikami Y, Hoshino Y, Kurita H, Hotta K, Shiba T, Hattori M: The complete genomic sequence of Nocardia farcinica IFM 10152. Proc Natl Acad Sci U S A. 2004, 101: 14925-14930. 10.1073/pnas.0406410101.
Tauch A, Kaiser O, Hain T, Goesmann A, Weisshaar B, Albersmeier A, Bekel T, Bischoff N, Brune I, Chakraborty T, Kalinowski J, Meyer F, Rupp O, Schneiker S, Viehoever P, Puhler A: Complete genome sequence and analysis of the multiresistant nosocomial pathogen Corynebacterium jeikeium K411, a lipid-requiring bacterium of the human skin flora. J Bacteriol. 2005, 187: 4671-4682. 10.1128/JB.187.13.4671-4682.2005.
Pitulle C, Dorsch M, Kazda J, Wolters J, Stackebrandt E: Phylogeny of rapidly growing members of the genus Mycobacterium. Int J Syst Bacteriol. 1992, 42: 337-343.
Shinnick TM, Good RC: Mycobacterial taxonomy. Eur J Clin Microbiol Infect Dis. 1994, 13: 884-901. 10.1007/BF02111489.
Springer B, Stockman L, Teschner K, Roberts GD, Bottger EC: Two-laboratory collaborative study on identification of mycobacteria: molecular versus phenotypic methods. J Clin Microbiol. 1996, 34: 296-303.
Tundup S, Akhter Y, Thiagarajan D, Hasnain SE: Clusters of PE and PPE genes of Mycobacterium tuberculosis are organized in operons: Evidence that PE Rv2431c is co-transcribed with PPE Rv2430c and their gene products interact with each other. FEBS Lett. 2006
Reyrat JM, Kahn D: Mycobacterium smegmatis: an absurd model for tuberculosis?. Trends Microbiol. 2001, 9: 472-473. 10.1016/S0966-842X(01)02168-0.
Cole ST, Eiglmeier K, Parkhill J, James KD, Thomson NR, Wheeler PR, Honore N, Garnier T, Churcher C, Harris D, Mungall K, Basham D, Brown D, Chillingworth T, Connor R, Davies RM, Devlin K, Duthoy S, Feltwell T, Fraser A, Hamlin N, Holroyd S, Hornsby T, Jagels K, Lacroix C, Maclean J, Moule S, Murphy L, Oliver K, Quail MA, Rajandream MA, Rutherford KM, Rutter S, Seeger K, Simon S, Simmonds M, Skelton J, Squares R, Squares S, Stevens K, Taylor K, Whitehead S, Woodward JR, Barrell BG: Massive gene decay in the leprosy bacillus. Nature. 2001, 409: 1007-1011. 10.1038/35059006.
Strong M, Sawaya MR, Wang S, Phillips M, Cascio D, Eisenberg D: Toward the structural genomics of complexes: crystal structure of a PE/PPE protein complex from Mycobacterium tuberculosis. Proc Natl Acad Sci U S A. 2006, 103: 8060-8065. 10.1073/pnas.0602606103.
Betts JC, Lukey PT, Robb LC, McAdam RA, Duncan K: Evaluation of a nutrient starvation model of Mycobacterium tuberculosis persistence by gene and protein expression profiling. Mol Microbiol. 2002, 43: 717-731. 10.1046/j.1365-2958.2002.02779.x.
Renshaw PS, Panagiotidou P, Whelan A, Gordon SV, Hewinson RG, Williamson RA, Carr MD: Conclusive evidence that the major T-cell antigens of the Mycobacterium tuberculosis complex ESAT-6 and CFP-10 form a tight, 1:1 complex and characterization of the structural properties of ESAT-6, CFP-10, and the ESAT-6*CFP-10 complex. Implications for pathogenesis and virulence. J Biol Chem. 2002, 277: 21598-21603. 10.1074/jbc.M201625200.
Lightbody KL, Renshaw PS, Collins ML, Wright RL, Hunt DM, Gordon SV, Hewinson RG, Buxton RS, Williamson RA, Carr MD: Characterisation of complex formation between members of the Mycobacterium tuberculosis complex CFP-10/ESAT-6 protein family: towards an understanding of the rules governing complex formation and thereby functional flexibility. FEMS Microbiol Lett. 2004, 238: 255-262.
Renshaw PS, Veverka V, Kelly G, Frenkiel TA, Williamson RA, Gordon SV, Hewinson RG, Carr MD: Sequence-specific assignment and secondary structure determination of the 195-residue complex formed by the Mycobacterium tuberculosis proteins CFP-10 and ESAT-6. J Biomol NMR. 2004, 30: 225-226. 10.1023/B:JNMR.0000048852.40853.5c.
Renshaw PS, Lightbody KL, Veverka V, Muskett FW, Kelly G, Frenkiel TA, Gordon SV, Hewinson RG, Burke B, Norman J, Williamson RA, Carr MD: Structure and function of the complex formed by the tuberculosis virulence factors CFP-10 and ESAT-6. EMBO J. 2005, 24: 2491-2498. 10.1038/sj.emboj.7600732.
Okkels LM, Andersen P: Protein-protein interactions of proteins from the ESAT-6 family of Mycobacterium tuberculosis. J Bacteriol. 2004, 186: 2487-2491. 10.1128/JB.186.8.2487-2491.2004.
Li L, Bannantine JP, Zhang Q, Amonsin A, May BJ, Alt D, Banerji N, Kanjilal S, Kapur V: The complete genome sequence of Mycobacterium avium subspecies paratuberculosis. Proc Natl Acad Sci U S A. 2005, 102: 12344-12349. 10.1073/pnas.0505662102.
Ross BC, Raios K, Jackson K, Dwyer B: Molecular cloning of a highly repeated DNA element from Mycobacterium tuberculosis and its use as an epidemiological tool. J Clin Microbiol. 1992, 30: 942-946.
Cole ST, Supply P, Honore N: Repetitive sequences in Mycobacterium leprae and their impact on genome plasticity. Lepr Rev. 2001, 72: 449-461.
Brown GD, Dave JA, Gey van Pittius NC, Stevens L, Ehlers MR, Beyers AD: The mycosins of Mycobacterium tuberculosis H37Rv: a family of subtilisin-like serine proteases. Gene. 2000, 254: 147-155. 10.1016/S0378-1119(00)00277-8.
Dave JA, Gey van Pittius NC, Beyers AD, Ehlers MR, Brown GD: Mycosin-1, a subtilisin-like serine protease of Mycobacterium tuberculosis, is cell wall-associated and expressed during infection of macrophages. BMC Microbiol. 2002, 2: 30-10.1186/1471-2180-2-30.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.
WU-BLAST version 2.0. [http://blast.wustl.edu/]
The Institute for Genomics Research (TIGR). [http://www.tigr.org]
Sanger Centre. [http://www.sanger.ac.uk]
Genolist (Pasteur Institute). [http://genolist.pasteur.fr/]
Ribosomal Database Project-II Release 9. [http://rdp.cme.msu.edu/]
German Collection of Microorganisms and Cell Cultures (DSM). [http://www.dsmz.de/]
ClustalW 1.8 (European Bioinformatics Institute). [http://www.ebi.ac.uk/clustalw/]
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.
Hall TA: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl Acids Symp Ser. 1999, 41: 95-98.
Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. . 1998, Sinauer Associates, Sunderland, Massachusetts
Felsenstein J: PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5: 164-166.
Page RD: TreeView: an application to display phylogenetic trees on personal computers. Comput Appl Biosci. 1996, 12: 357-358.
Ausubel FM, Brent R, Kingston RE, Moore DD, Seidman JG, Smith JA, Struhl K: Current protocols in molecular biology - Section 2.4. 1989
Sambrook J, Fritsch EF, Maniatis T: Molecular Cloning: A Laboratory Manual. 1989, New York, Cold Spring Harbour,, 2
Leclerc MC, Haddad N, Moreau R, Thorel MF: Molecular characterization of environmental Mycobacterium strains by PCR-restriction fragment length polymorphism of hsp65 and by sequencing of hsp65, and of 16S and ITS1 rDNA. Res Microbiol. 2000, 151: 629-638. 10.1016/S0923-2508(00)90129-3.
Leclerc MC, Thomas F, Guegan JF: Evidence for phylogenetic inheritance in pathogenicity of Mycobacterium. Antonie Van Leeuwenhoek. 2003, 83: 265-274. 10.1023/A:1023327929535.
This study was supported by the DST/NRF Centre of Excellence for Biomedical TB Research.
NCGvP conceived of and designed the study, carried out the sequence alignments, comparative genomics and phylogenetics, interpreted the results and drafted the manuscript. SLS helped conceive of the study, participated in its design, carried out sequence alignments and was involved in interpretation of the results and drafting of the manuscript. HL and YK carried out the DNA extractions and Southern hybridizations. PDvH and RMW participated in the design and coordination of the study, were involved in interpreting the results and helped to draft the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: M. ulcerans PE, PGRS and PPE genes. The data provided represent presence and absence of all orthologues of the members of the PE and PPE gene families of M. tuberculosis H37Rv in M. ulcerans (this file is the M. ulcerans equivalent to the data that is presented for M. avium paratuberculosis and M. leprae in Tables 3 and 4). (DOC 276 KB)
Additional file 2: Comparative genomics for gene size differences between M. tuberculosis H37Rv and CDC1551. The data in this table provide an overview of the reasons for size differences observed between annotated PE and PPE genes from the two M. tuberculosis genome databases, indicating that variation in size due to frameshifts, insertions and deletions is largely associated with the PE_PGRS and PPE-MPTR subfamilies. (DOC 246 KB)
Additional file 3: Comparative genomics for extent of sequence variation between M. tuberculosis H37Rv and CDC1551. The data in this table provide an overview of the extent of sequence variation on a protein level between the orthologues of the PE and PPE families in the two M. tuberculosis strains, indicating that the "ancestral-type" PE and PPE genes are highly conserved between strains, while the PPE-MPTR and PE_PGRS subfamilies are prone to sequence variation. (DOC 218 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Gey van Pittius, N.C., Sampson, S.L., Lee, H. et al. Evolution and expansion of the Mycobacterium tuberculosis PE and PPE multigene families and their association with the duplication of the ESAT-6 (esx) gene cluster regions. BMC Evol Biol 6, 95 (2006). https://doi.org/10.1186/1471-2148-6-95