Running BLAST-Explorer
The entry page of BLAST-EXPLORER is a simplified BLAST form that receive a single fasta-formated query sequence as input and allows (i) the selection of BLASTN, BLASTP, TBLASTN, or BLASTX [4] as an alignment algorithm, (ii) the selection of a sequence database (Genbank NT for nucleotides; Genbank Non Redundant Protein, Ensembl, PDB, RefSeq, Uniprot and Swissprot for proteins), (iii) the selection of a BLAST E-Value threshold and (iv) the option of filtering out low-complexity sequence segments. BLAST searches report a maximum of 5,000 hits.
Small scale selection mode
By default, the result page only shows the top-100 scoring BLAST hits, while the remaining hits are kept in memory and can be activated using the large-scale selection tools (next section). Small-scale selection tools only apply on the top-100 scoring BLAST hits. The central tool in this mode is the sequence similarity tree that provides an approximate picture of the phylogenetic relationships between the query and the top BLAST hits (Fig. 1A). BLAST hits are renamed according to the species name. The similarity tree is documented with meta-information including hit description (Fig. 1B), alignment coverage (Fig. 1C), taxonomy-based coloring (Fig. 1D). The tree image allows a navigation across the BLAST result page (clicking on an alignment coverage bar [Fig. 1C] leads to the corresponding pairwise alignment [Fig. 1E]), gives access to the database record (by clicking on the hit name), as well as to the selection of individual hits (check-boxes) or in bulk (by clicking on internal branches).
A dropdown menu (Fig. 1F) gives access to additional small-scale selection tools:
o The top-panel shows the number of gap-free sites in the BLAST-reconstructed multiple-alignment of selected sequences (see supplementary data). This number is dynamically updated when BLAST hits are added or removed from the selection.
o The "score histogram" tool shows the BLAST score values ranked in decreasing order. A score threshold can be applied by clicking on the histogram (e.g., Fig. 1G).
o Two "Update tree" options allow redrawing the similarity tree by setting the appropriate number of top-scoring BLAST hits or using a user-defined sequence selection. The tree is generated by combining ClustalW [6] and TreeDyn [7] using either all sites of the BLAST-reconstructed multiple-alignment or gap-free sites only (N.B., the initial tree is computed using all sites).
o The "Add sequences to tree" option allow incorporating up to five external sequences (supplied by users) into the current hit sequence selection. The similarity tree is then recalculated to show the phylogenetic position of the external sequences relative to the BLAST hit sequences.
At the end of the selection process, selected sequences can be imported in fasta format ("get selected sequence" button) or passed to one of the phylogenetic reconstruction pipelines available on the phylogeny.fr platform [5] ("One click mode" or "Advanced mode" buttons).
Large-scale selection mode
In the large-scale selection mode, several tools allow the sampling of homologous sequences among the entire set of BLAST hits (including those that are not shown in the top-100 BLAST subset) using global criterions. They are grouped in a dedicated panel (Fig. 1H) and comprise:
o A pull-down menu that allows changing the e-value threshold on BLAST hits
o Buttons showing the distributions of the BLAST hits according to three BLAST alignment statistics (i.e., BLAST scores, percentage of similarity, and alignment coverage). Bulk selection among the BLAST hits can then be done by selecting intervals of the distribution histogram.
o The "selection on taxonomy" tool enabling the selection of BLAST hits according to their taxonomic rank (e.g., Fig. 1I). The taxonomic information is presented as a hierarchical graph allowing users to adjust the level of details that is relevant to their needs.
Following the application of the selection rules, the result page (i.e., the similarity tree and individual pairwise alignments) is updated to account for changes in the list of the top-100 best BLAST.
Comparison with existing software
Several existing BLAST post-processors combine BLAST searches with automated phylogenetic analysis of the BLAST hits. However most of them do not pursue the same goal and therefore differ in the nature of the results. Also, the functionalities proposed to interact with the results vary greatly. Some of the applications allow filtering of the BLAST hits before phylogenetic reconstruction, others do not.
Phylogena is a standalone application for phylogenetic annotation of unknown sequences [8] and implements an automated intelligent filtering of BLAST hits before phylogenetic reconstruction. In contrast with BLAST-Explorer, the hit filtering method is optimized for sequence annotation and do not enable interactive and progressive refinement of the sequence dataset. Furthermore Phylogena does not allow retrieving the selected sequences for external analysis.
Phylogenie is also a standalone application for automated phylome generation and analysis [9]. Because the principal force of Phylogenie is to automatically produce a large number of phylogenetic analyses in batch, it does not allow interactive filtering of BLAST hits before phylogenetic reconstruction. Phylogenie is a command-line driven pipeline, requiring at least some familiarity with UNIX and command line tools.
Phyloblast [10] and the NCBI BLAST server [11] are two web services that have the most in common with BLAST-Explorer. They produce an enriched BLAST output and allow selection of hits using various criterions. The Phyloblast server is apparently no longer maintained. Phyloblast only allowed comparing a protein sequence against a protein database using BLASTP whereas BLAST-Explorer allows nucleotide/nucleotide, protein/protein and translated nucleotide/protein comparisons. Tools for selecting hits before phylogenetic reconstruction are less versatile than those proposed by BLAST-Explorer (selection based on species names and sequence description). The NCBI BLAST service also provides several tools for selecting and retrieving matching sequences from the BLAST output; a distance tree of the BLAST hits can also be calculated. Here again the hit selection tools are more limited than in BLAST-Explorer (simple check boxes beside sequence descriptions). Furthermore the image of the distance tree does not allow interactive selection of the BLAST hits. This makes selection on phylogenetic criterion less straightforward.
The principal strength of BLAST-Explorer is the flexibility of the sequence selection process and the richness of the information displayed on screen. However, BLAST-Explorer does not propose pre-defined automated methods of hit selection such as for example in Phylogena. Rather, BLAST hit selection is multi-dimensional and mainly human-driven though an interactive graphical interface in order to respond to a wide range of sequence selection strategies. Another feature that differentiates BLAST-Explorer from other software is that it is entirely web-based. Thus no installation on personal computer and no regular update of the sequence databases are required.
The BLAST-Explorer output includes a phylogenetic representation of the BLAST hits (i.e., the similarity tree) that aims at helping in the hit selection process. It is important to note that this tree is not optimized for phylogenetic accuracy. Rather, we opted for a fast tree reconstruction strategy that is however sufficiently robust for providing an approximate phylogenetic position of the BLAST hits. Thus we advise users to use external specialized software if they want to improve or confirm the accuracy of the phylogenetic tree.
Finally, it is important to note that in some phylogenetic aspect, the the importance is a correct distinction between orthologous and paralogous sequences