Skip to main content
  • Research article
  • Open access
  • Published:

Structural similarity of loops in protein families: toward the understanding of protein evolution



Protein evolution and protein classification are usually inferred by comparing protein cores in their conserved aligned parts. Structurally aligned protein regions are separated by less conserved loop regions, where sequence and structure locally deviate from each other and do not superimpose well.


Our results indicate that even longer protein loops can not be viewed as "random coils" and for the majority of protein families in our test set there exists a linear correlation between the measures of sequence similarity and loop structural similarity. Results suggest that distance matrices derived from the loop (dis)similarity measure may produce in some cases more reliable cluster trees compared to the distance matrices based on the conventional measures of sequence and structural (dis)similarity.


We show that by considering "dissimilar" loop regions rather than only conserved core regions it is possible to improve our understanding of protein evolution.


Globular proteins are considered to be structurally similar if their regular secondary structure elements can be superimposed well and are connected in the same order. The loop regions connecting secondary structures demonstrate less regularity in their conformations even though short loops linking specific secondary structures can be classified into distinct classes [16]. The structures and sequences in loop regions may deviate from each other so that they do not superimpose well and as a result loops are very often not aligned by structure-structure or sequence alignment methods. Loops apparently do not contribute much to protein stability but may be quite important for protein specific function and for the interaction with other components of the cell. In our previous work we showed that a measure derived from the loop regions can distinguish homologous from analogous proteins with the same or higher accuracy compared to the conventional measures which are based on comparing proteins in structurally aligned regions only [7].

Recently it has been observed that structural variation in the core of homologous proteins is linearly correlated with sequence changes [8, 9]. As was also shown several years ago, the probability of insertion and deletion events, which occur predominantly in the loop regions, strongly depends on the evolutionary distance between two homologous proteins [10, 11]. Based on these observations one might argue that more closely related proteins may exhibit more similarity in the structure of their loop regions compared to distantly related proteins and the structural loop (dis)similarity should correlate with evolutionary distance.

To check this hypothesis we performed an analysis of structural variation in the loop regions within different homologous protein families using a recently introduced new measure of loop similarity [7]. This new measure is based on the concept of the Hausdorff metric, which is used in mathematical topology to define a distance between two point sets of a metric space. It does not require an alignment or one to one correspondence between two point sets. We show that there exists a linear correlation between the average structural change in the loop regions and the evolutionary distance, which allows us to use the loop (dis)similarity measure for inferring the phylogenetic history of homologous protein families.


Test set

To select sets of homologous proteins the Conserved Domain Database (CDD) version 1.62 was taken, which can be accessed at [12]. The CDD collection of protein domain alignments included curated CDDs [13] and preprocessed domain families imported from SMART and PFAM, altogether 6222 protein domain families[14]. Upon import, the sequences from SMART/PFAM alignments with more than 75% identity with known structures were substituted by the most similar structures from the Protein Data Bank [15].

Each CDD family was decomposed into a set of pairwise structure-structure alignments. Structural alignments were computed by the VAST algorithm [16] and only those structures which had more than 80% mutual overlap between the VAST alignment footprint and CDD footprint were considered in the analysis. The footprint for a given sequence was defined as a region between the first and the last residues aligned by VAST or CDD. Those families containing short sequence repeats and having average alignment length less than 50 residues were excluded from the test set. The structural pairs within the remaining CDD families were disregarded if at least one of the following conditions held true:

- at least one structure in a pair had X-ray resolution of greater than 3.0 Å

- the Blast E-value calculated for the VAST alignment exceeded 0.01

- at least one structure in a pair contained a chain discontinuous domain inconsistently aligned between VAST and CDD

- at least one structure in a pair contained more than 25% of its nonaligned loops with missing residues.

To ensure that protein families span a wide range of sequence similarity, all families were examined and those having less than 30% sequence identity span were not considered in further analysis. The redundancy between protein families was checked by using the procedure implemented in the CDART algorithm [17] and not more than 2 protein families from the same CDD cluster were retained in the final test set. At the end, the test set comprised 59 CDD families with more than 10 structurally aligned pairs of homologs. This test set covered a wide range of functional and structural classes and the list of test families together with their length, number of protein pairs and correlation coefficients is shown in Table 1.

Table 1 List of the names of 59 test protein families together with their CDD accession names, lengths, number of protein pairs, Pearson correlation coefficients between LHM (AHM) and normalized Blast bitscore. The families are ordered with respect to decreasing quality of LHM correlation. The supplementary table is available at [27].

Measures of structural and sequence similarity

To measure the sequence similarity between homologous proteins from the same family we used a Blast bitscore normalized by the alignment length. Among structure similarity measures used in this paper, two of them, RMSD and alignment-based Hausdorff measure (AHM) were computed by comparing the proteins in structurally aligned regions, while the loop-based Hausdorff measure (LHM) quantified the difference in the loop regions.

The root mean squared deviation (RMSD) was calculated using the superposition algorithm due to McLachlan [18]. The AHM and LHM measures were based on the mathematical concept of Hausdorff distance[19]. Let A = {a1,..., a m } and B = {b1,..., b n } be finite point sets in a Euclidean space. The Hausdorff distance between the sets A and B is then defined by:

d H (A, B) = max {min j d(a1, b j ),..., min j d(a m , b j ), min i d(a i , b1),..., min i d(a i , b n )}     (1)

Here the terms d(a i , b j ) denote the usual Euclidean distance between the points. In other words, the Hausdorff distance between the sets A and B is the smallest distance such that every point a i A is within this distance of some point b j B and vice versa. Hausdorff distance can be calculated under the assumption that the C α atoms for both structures are in a common coordinate frame which is defined by the structural alignment between two domains. The Hausdorff measure for loops (LHM) was calculated as an average of Hausdorff distances over all loops in the protein pair, where n s is the number of aligned secondary structure elements:

The "loop" was defined as a region between two consecutive aligned secondary structure elements and:

h i = 0, if the i-th loop regions do not have any unaligned residues;

h i = d H (A i , B i ), where A i contains the set of C α coordinates of non-aligned residues in the i-th loop of the first structure in a pair, the last aligned residue from the preceding aligned region and the first aligned residue from the following aligned region. Similarly, B i is defined for the second structure in a pair. The sets (A i , B i ) are defined to include two aligned residues so that the measure can be defined even if one of the sets of non-aligned residues is empty. The Hausdorff measure for the structurally aligned regions (AHM) was defined similarly. In this case, instead of the sets that contain the coordinates for the C α atoms in the loops, we use the coordinates for the C α atoms in the aligned segments and average over the number of aligned segments.

The correlation analysis between the measures of sequence and structural similarity, linear/nonlinear regression analyses and cluster analysis were performed using Splus version 6. Pearson (ρ) and Spearman correlation coefficients were calculated to quantify the accuracy of linear correlation. The P-value under the null hypothesis that the correlation coefficient between two variables is equal to zero has been estimated and those families with the P-values less than 0.01 were considered as having statistically significant correlation. The cluster analysis was done using the complete linkage clustering [20] where the distance between two clusters was measured as a maximum distance between a point in one cluster and a point in another cluster. The cluster trees based on p-distance and LHM were compared using the Phylip program [21] by generating 1000 bootstrap alignments from the structural alignments of a protein family and by calculating p-distance based cluster trees from the bootstrap alignments. The bootstrap support for the LHM based tree or different partitions of this tree was calculated by counting how many times the LHM topology occurs among the bootstrap cluster trees.

Results and discussion

Tables 1 and 2 show the accuracy of correlation obtained between the various measures of structural similarity (RMSD, AHM and LHM). As can be seen from these tables, the correlation quantified by the Pearson correlation coefficient is quite high for most of the families and half of the families have coefficients between -0.76 and -0.81 depending on the structural similarity measure used (Spearman rank correlation coefficients were shown to be very close to those reported in Tables 1 and 2). This result is consistent with the studies of Wood and Pearson who showed on a smaller test set of 35 protein families that half of them have correlation coefficients greater than 0.878 [8]. In their case the sequence-structure correlation was quantified, however, by using only the measures based on the structurally aligned regions of the proteins.

The dependence of structural similarity on sequence similarity in some cases can be more accurately described by the nonlinear regression model taking into account higher order quadratic terms. To quantify how much the nonlinear terms improve the data fitting, we use the ratio of squared correlation coefficient for linear (

) and nonlinear () models (). In the overall test set only 12 families have r2 – ratio smaller than 0.9 (with LHM used as a structural similarity measure) indicating that for these cases adding the non-linear term improves the performance of modeling by about 10%.

As was shown previously, the evolutionary relatedness between proteins can be successfully gauged from the comparison of their loop regions [7]. Indeed, Table 2 and Figure 1 show that within the families of homologous proteins, the structural changes in loops are strongly coupled with evolutionary distance, which in the first approximation can be estimated using normalized Blast score. The structural-sequence dependence in loop regions for 71% of our protein families can be well described by a linear model and for 88% of protein families the linear correlation coefficients are found to be statistically significant. Comparing different measures of structural similarity one can see that AHM performs somewhat better than other quantities yielding 90% of families with statistically significant linear correlation coefficients (with P-value < 0.01) and 80% of families with r2 > 0.9.

Table 2 Table shows the median of Pearson correlation coefficients, fraction of families with statistically significant correlation (P-value less than 0.01) and the fraction of families with the ratio r2 higher than 0.9 for each measure of structural similarity used in the study.
Figure 1
figure 1

Hausdorff measure (in Angstroms) for loop (LHM) and aligned (AHM) regions is plotted versus the normalized Blast bitscore for three families: Pancreatic ribonucleases (RnaseA), Ig-like plexins/transcription factors (IPT) and Trypsin-like serine proteases (Tryp_SPc). Solid line shows the linear regression fit of the data.

However, not all families exhibit such good correlation. One example of a protein family showing particularly low LHM correlation is the family of Actin depolymerisation factor/cofilin-like domains (ADF). The sequence-structure correlation for loop regions of this family is not statistically significant (the Pearson correlation coefficient is close to zero) whereas the sequence-structure correlation for the protein core is very high (ρ = -0.85 with AHM). Indeed, different proteins of this family show distinctly different loop conformations and evolutionary analysis of ADF family argued that the insertions present in the vertebrate ADF/cofilins (and not present in non-vertebrate cofilins) might be important for nuclear function of mammalian cofilins [22]. Therefore, in this case the structural heterogeneity of loop regions can be explained by the acquisition of a new distinct function by some members of this family. For some families, for example, Trypsin-like serine protease (Tryp_SPc), neither LHM (ρ = -0.31) nor AHM (ρ = -0.55) similarity measures exhibit a good sequence-structure correlation (Figure 1(c)).

Among families with particularly high LHM correlation are the families of Xylose isomerase (Xylose_isom), Class I Histocompatibility antigen (domains alpha 1 and 2, MHC_I), Protein tyrosine phosphatase (PTPc) and others. Figure 1 shows two families with high sequence-structure correlation using the LHM measure: Ig-like plexins (IPT) and Ribonucleases A (RnaseA). The IPT family is characterized by high sequence-structure correlation for both core (ρAHM = -0.90) and loop regions (ρLHM = -0.94). On the other hand, the protein core structure of the RnaseA family changes very little with sequence whereas the loop structure gradually diverges as sequence becomes more and more dissimilar (ρAHM = -0.48, ρLHM = -0.87).

To understand whether significant sequence-structure correlation for loop regions has an underlying biological meaning, we performed a cluster analysis of proteins from two diverse families, Ribonuclease A (RnaseA), and SH2 domain (SH2, ρAHM = -0.48, ρLHM = -0.78), using different measures of sequence and structural similarity. Figure 2 depicts the cluster trees constructed using distance/similarity matrices which were based on the fraction of non-identical residues (p-distance), RMSD and LHM for these two families.

Figure 2
figure 2

Complete linkage cluster tree produced using fraction of non-identical residues (p-distance), RMSD (Å), and LHM (Å) is plotted between proteins from Pancreatic ribonuclease family (RnaseA). Five major groups of RnaseA family according to Rosenberg et al [23] are: eosinophil ribonucleases (ER), pancreatic ribonucleases (PR), angiogenins (ANG), Rana ribonucleases (RR) and ribonuclease 4 (R4). The maximum parsimony tree described by Rosenberg et al [23] is given in the Phylip format: (RR, ((ANG, R4), (PR, ER))).

The RnaseA family represents a very interesting example to study as it is characterized by considerably different catalytic efficiency and substrate preferences among family members and the different aspects of its activity is not well understood. Although cysteines that form disulfide bonds, catalytic histidines and lysine residues are mostly structurally and sequence conserved, there is a great variability in sequence between other regions of RnaseA proteins [23, 24]. We compared the obtained cluster trees (Figure 2) with the maximum-parsimony phylogenetic tree derived by Rosenberg et al [23], the Phylip format of this tree is given in the captions of Figure 2. As shown in this figure, the RMSD-based tree divides pancreatic ribonucleases (PR) into two groups and puts together two very different proteins: angiogenin (ANG) and Rana ribonuclease (RR) although angiogenin has a very weak enzymatic activity and is a tumor-growth promoter while Rana ribonuclease P-30 has ribonuclease activity and antitumor effects. In contrast to the RMSD cluster tree, distance matrices based on the loop (dis)similarity measure correctly cluster the representatives of the five major groups of the Ribonuclease family as per Rosenberg et al [23]. Although the topology of the p-distance based cluster tree is somewhat different from the topology of the LHM based tree (with bootstrap support less than 0.001), it also produces a biologically meaningful clustering as judged from Rosenberg et al [23].

SH2 domains represent phosphor-tyrosyl peptide binding modules which are found in many signaling proteins. The specificity of phosphate interaction with a protein has been attributed to the hydrophobic pocket which is mostly formed by two loop regions [25]. Our analysis shows that indeed the loop regions have a much higher accuracy in clustering of functional subfamilies of SH2 domains. Comparing our cluster trees with the classification of Songyang et al [26] and cluster trees of SH2 phosphotyrosyl binding sites [25] we can see from Figure 3 that p-distance based and RMSD based distance matrices cluster correctly two representatives of the "1A" subfamily (vsrc, hck), but separate proteins from subfamily "1B" (csk, csk, syk) and "4" (shptp2 and shc). In contrast, these subfamilies ("1B" and "4" [26]) are very well supported by the cluster tree which is based on the LHM measure. The bootstrap calculations (see Methods) show that the LHM based topology is supported by the p-distance based clustering algorithm at less than the 0.001 level. Different partitions of this tree are supported at higher but still non-significant levels, namely 0.11 for the "1B" subfamily (csk, csk, syk) and 0.01 for the subfamily "4" (shptp2 and shc). This in turn indicates that the two cluster trees can be considered statistically different.

Figure 3
figure 3

Complete linkage cluster tree produced using fraction of non-identical residues (p-distance), RMSD (Å), and LHM (Å) is plotted between proteins from SH2 family (SH2). The classifications of SH2 domains according to [25, 26] are given in the parentheses: syk (1B, B), shptp2 (4, C), vsrc (1A, A), hck (1A, A), csk (1B, B), P85a (3, D) and shc (4,).


Here we have presented an analysis of how the structure of protein loops changes in evolution as homologous proteins diverge from each other. We showed that for the majority of protein families there exists a statistically significant linear correlation between measures of sequence similarity and average loop structural similarity. This in turn suggests that loops change in evolution via a stepwise insertion or deletion process and clearly one can not portray even longer loop regions as "irregular conformations" or "random coils". Indeed, our results imply that, in general, loops are under constant evolutionary constraints which, apparently, are weaker than those for a protein core but still strong enough to preserve the loop overall structure. Since loops do not contribute much to the protein core stability, these constraints predominantly arise from the importance of loops in interacting with ligands, other proteins and cells, as well as a possible role of loops in protein folding.

Modeling of insertion and deletion events in evolution poses a lot of difficulties and protein evolution is usually reconstructed based only on the aligned regions of proteins. We demonstrated that loop regions which usually correspond to the non-aligned protein regions can be very important in inferring the phylogenetic history of a protein family. Moreover, it was shown, that sometimes sequence and structure similarity measures comparing proteins in their core are not sensitive enough to detect subtle (dis)similarities between the subfamilies. Loop-based measures which emphasize the dissimilarities between different protein members can shed light on the evolutionary relationships between homologous proteins.


  1. Richardson JS: The anatomy and taxonomy of protein structure. Adv Protein Chem. 1981, 34: 167-339.

    Article  CAS  PubMed  Google Scholar 

  2. Wilmot CM, Thornton JM: Analysis and prediction of the different types of beta-turn in proteins. J Mol Biol. 1988, 203: 221-232. 10.1016/0022-2836(88)90103-9.

    Article  CAS  PubMed  Google Scholar 

  3. Kwasigroch JM, Chomilier J, Mornon JP: A global taxonomy of loops in globular proteins. J Mol Biol. 1996, 259: 855-872. 10.1006/jmbi.1996.0363.

    Article  CAS  PubMed  Google Scholar 

  4. Wintjens RT, Rooman MJ, Wodak SJ: Automatic classification and analysis of alpha alpha-turn motifs in proteins. J Mol Biol. 1996, 255: 235-253. 10.1006/jmbi.1996.0020.

    Article  CAS  PubMed  Google Scholar 

  5. Oliva B, Bates PA, Querol E, Aviles FX, Sternberg MJ: An automated classification of the structure of protein loops. J Mol Biol. 1997, 266: 814-830. 10.1006/jmbi.1996.0819.

    Article  CAS  PubMed  Google Scholar 

  6. Burke DF, Deane CM, Blundell TL: Browsing the SLoop database of structurally classified loops connecting elements of protein secondary structure. Bioinformatics. 2000, 16: 513-519. 10.1093/bioinformatics/16.6.513.

    Article  CAS  PubMed  Google Scholar 

  7. Panchenko AR, Madej T: Analysis of protein homology by assessing the (dis)similarity in protein loop regions. Proteins. 2004, 57: 539-547. 10.1002/prot.20237.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  8. Wood TC, Pearson WR: Evolution of protein sequences and structures. J Mol Biol. 1999, 291: 977-995. 10.1006/jmbi.1999.2972.

    Article  CAS  PubMed  Google Scholar 

  9. Koehl P, Levitt M: Sequence variations within protein families are linearly related to structural variations. J Mol Biol. 2002, 323: 551-562. 10.1016/S0022-2836(02)00971-3.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  10. Benner SA, Cohen MA, Gonnet GH: Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J Mol Biol. 1993, 229: 1065-1082. 10.1006/jmbi.1993.1105.

    Article  CAS  PubMed  Google Scholar 

  11. Pascarella S, Argos P: Analysis of insertions/deletions in protein structures. J Mol Biol. 1992, 224: 461-471. 10.1016/0022-2836(92)91008-D.

    Article  CAS  PubMed  Google Scholar 

  12. . []

  13. Marchler-Bauer A, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, Liebert CA, Liu C, Madej T, Marchler GH, Mazumder R, Nikolskaya AN, Panchenko AR, Rao BS, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Vasudevan S, Wang Y, Yamashita RA, Yin JJ, Bryant SH: CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res. 2003, 31: 383-387. 10.1093/nar/gkg087.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  14. Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH: CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res. 2002, 30: 281-283. 10.1093/nar/30.1.281.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Berman HM, Bhat TN, Bourne PE, Feng Z, Gilliland G, Weissig H, Westbrook J: The Protein Data Bank and the challenge of structural genomics. Nat Struct Biol. 2000, 7 Suppl: 957-959. 10.1038/80734.

    Article  CAS  PubMed  Google Scholar 

  16. Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996, 6: 377-385. 10.1016/S0959-440X(96)80058-3.

    Article  CAS  PubMed  Google Scholar 

  17. Geer LY, Domrachev M, Lipman DJ, Bryant SH: CDART: protein homology by domain architecture. Genome Res. 2002, 12: 1619-1623. 10.1101/gr.278202.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  18. McLachlan AD: Gene duplications in the structural evolution of chymotrypsin. J Mol Biol. 1979, 128: 49-79. 10.1016/0022-2836(79)90308-5.

    Article  CAS  PubMed  Google Scholar 

  19. Preparata FP, Shamos MI: Computational geometry, an introduction. 1985, , Springer-Verlag,NY

    Chapter  Google Scholar 

  20. Chambers JM: Programming with Data. A Guide to the S Language. 1998, New York, Springer-Verlag

    Chapter  Google Scholar 

  21. Felsenstein J: PHYLIP - phylogeny inference package. Cladistics. 1989, 5: 164-166.

    Google Scholar 

  22. Lappalainen P, Kessels MM, Cope MJ, Drubin DG: The ADF homology (ADF-H) domain: a highly exploited actin-binding module. Mol Biol Cell. 1998, 9: 1951-1959.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Rosenberg HF, Zhang J, Liao YD, Dyer KD: Rapid diversification of RNase A superfamily ribonucleases from the bullfrog, Rana catesbeiana. J Mol Evol. 2001, 53: 31-38.

    CAS  PubMed  Google Scholar 

  24. Zhang J, Dyer KD, Rosenberg HF: Human RNase 7: a new cationic ribonuclease of the RNase A superfamily. Nucleic Acids Res. 2003, 31: 602-607. 10.1093/nar/gkg157.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  25. Campbell SJ, Jackson RM: Diversity in the SH2 domain family phosphotyrosyl peptide binding site. Protein Eng. 2003, 16: 217-227. 10.1093/proeng/gzg025.

    Article  CAS  PubMed  Google Scholar 

  26. Songyang Z, Shoelson SE, Chaudhuri M, Gish G, Pawson T, Haser WG, King F, Roberts T, Ratnofsky S, Lechleider RJ: SH2 domains recognize specific phosphopeptide sequences. Cell. 1993, 72: 767-778. 10.1016/0092-8674(93)90404-E.

    Article  CAS  PubMed  Google Scholar 

  27. . []

Download references


We especially thank Stephen Bryant and Yuri Wolf for insightful discussions. This work has been supported by the NIH Intramural Research Program.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Anna R Panchenko.

Additional information

Authors' contributions

AP and TM contributed equally to this paper.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Panchenko, A.R., Madej, T. Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evol Biol 5, 10 (2005).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: