The basic step in the overall procedure to obtain PHOGs at all nodes of the evolutionary tree is to compare several supergenomes, find orthologs and paralogs, put them into one PHOG and to merge these supergenomes into the supergenome lying higher in the evolutionary tree. Since each PHOG represents a multiple alignment of protein sequences, it has first to be converted into an "ancestral" sequence (a supergene), and then consensus sequences from both supergenomes are compared to find orthologs and paralogs. Sequences in each newly created PHOG are multiply aligned, and all PHOGs are then stored in the relational database to launch the procedure at next nodes of the evolutionary tree.
Obtaining a supergene from a PHOG multiple alignment
Our accompanying paper [17] shows that each column of the multiple alignment in more than 98% cases belongs to one of the 20 frequency column clusters, which can be thought to be derived from a single amino acid residue. Rarely, we obtain "garbage' columns which will get the special symbol "X". We convert a column of the protein multiple alignment to a frequency vector and find the nearest cluster as described in our paper [17].
Running PHOG-BLAST
After all PHOGs from supergenomes are converted into consensus sequences, a special BLAST-like procedure is run between each pair of supergenomes lying at a single node of the evolutionary tree which is called PHOG-BLAST [17]. PHOG-BLAST combines ideas from FASTA [18], original BLAST [19] and dynamic programming.
After PHOG-BLAST scores are computed for all possible pairs of supergenes from each pair of supergenomes, bi-directional best hits are obtained (BBHs). Since BBHs with low scores can potentially lead to false-positive orthologs, BBHs with scores less than a given threshold (100) are discarded. If we form a graph with vertices as supergenes and BBH relationships as edges we can obtain connected components in this graph. Fig. 2 shows a connected component in this graph consisting of eight genes. Genes A1, A2, A3, A4 form one orthologous group and genes B1, B2, B3, B4 form another orthologous group. Due to the false BBH bridge between genes A1 and B3, both orthologous groups are merged into a single orthologous group. Some of these connected components can be quite big. For example, when our procedure was run at the Archaen node (Fig. 3), we frequently obtained connected components having 60 vertices and more. Since we believe that each orthologous group is the result of the node evolution of just one ancestral gene we have to split this connected component into several parts.
Splitting procedure
Our splitting procedure is based on the assumption that the higher the BBH score is between a pair of supergenes, the greater is the chance that these BBH supergenes are orthologs and they are not false BBH bridges. Therefore, we are looking for the pair of supergenes in the connected component with the greatest BBH score, and we consider this pair as the seed of a new orthologous group. For all other supergenes in the connected components we calculate a sum of PHOG-BLAST scores to the seeds. Then we arrange these supergenes in descending order for these scores. After that we "fill" the growing orthologous group starting from the top ranking genes in this order and omitting genes that have already representatives in the orthologous group from their taxon. We repeat this procedure for all genes that are not included in orthologous groups until we cannot find seeds anymore. As a result of this procedure, the connected component is split into several orthologous groups, each with its pair of seed supergenes. To reduce the rate of erroneous assignment of supergenes to orthologous groups, we reshuffle all supergenes assigning them to those seeds for which they have the maximum PHOG-BLAST score.
The possible scenario for our procedure for the situation in Fig. 2 might be like this. A1A2 is the strongest BBH, the arrangement of other supergenes is A3A4B1B2B3B4, and the first "filled" orthologous group is A1A2A3A4. From all BBHs that are not included in this orthologous group, B1B2 is the strongest BBH, the arrangement of other supergenes is B3B4. The second "filled" orthologous group is B1B2B3B4. Reshuffling has not changed the composition of both orthologous groups.
In each orthologous group, the Smith-Waterman algorithm [20] is applied to the seeds. Only seed segments giving the maximum score are left for further processing. They are called seed cores. A seed consensus sequence is formed from these cores by finding the nearest frequency profile cluster in each position of this seed pairwise alignment. Since N/C out-of-core ends of seeds might represent protein domains, it is very important to look at them once more. To this end, N/C out-of-core ends having length greater than 100 are stored in the database and they are used to launch the second round of the procedure at a single node of the evolutionary tree (see the "Second round" section below).
Similarly, the Smith-Waterman algorithm is applied to the seed consensus and all non-seed supergenes to get non-seed cores. N/C out-of-core ends of non-seeds having length greater than 100 are stored in the database.
Multiple alignment of core sequences in the orthologous group
In our earlier experiments with PHOGs we used ClustalW [21] to multiply align supergene sequences. However, this approach resulted in a very slow overall procedure. Therefore, we decided to develop our own procedure for the multiple alignment, following the traditional iterative approach. The computational experiments showed that this procedure produced multiple alignments of good quality (data not shown).
Our alignment procedure is based on the well-known observation that more similar protein sequences produce less error prone alignments [22]. The input for our multiple alignment procedure is a set of supergene core sequences belonging to one orthologous group obtained at the previous step. The procedure runs as follows:
(i) Compute a sum of PHOG-BLAST scores for non-seed cores to the seed cores.
(ii) Arrange all core sequences in the descending order for these scores. Seed cores will head the ordered list.
(iii) Set the consensus sequence equal to the first sequence in the ordered list.
(iv) Set the current sequence equal to the second sequence in the ordered list.
(v) Apply the Needleman-Wunsch algorithm [23] to align the consensus sequence and the current core sequence. Form the new consensus sequence from this multiple alignment of the two aligned sequences by finding the nearest frequency column cluster in each position.
3. Repeat step (v) for all other sequences in the ordered list.
Finding paralogs
After gene duplications, paralogs experience a period of relaxed evolution [14], and generally it is difficult to assess how long this period was. One safe approach to find paralogs is to select those gene as paralogs whose evolutionary distance to an ortholog in its own taxon is smaller than the distance between orthologs belonging to different taxons [24]. We think, however, that this approach is too restrictive, and the procedure based on it can result in too many orphan genes, even these genes have high similarities to other genes that found their counterparts in other species and fell into PHOGs.
Therefore, for all supergenes that are not declared orthologs at a particular node of the evolutionary tree, we compute PHOG-BLAST scores to PHOG supergenes and for each such supergene we find the best hit. If the PHOG-BLAST score to this consensus exceeds 100, we declare this supergene to be a paralog to the best-hit PHOG. After all paralogs are added to PHOGs, PHOGs are aligned as described in the previous section.
Second round of the procedure
This round is needed because orthologs can have different domain structures due to gene fusion events. If both orthologs have a homologous core, but the first ortholog has an additional domain that is absent in the second ortholog, then we can cut out this additional domain. This additional domain can find its match among other domains or orphan genes in other supergenomes. Therefore, all previous steps are repeated for all N/C cuts and all orphan genes at a particular node.