Transcriptomics provides a robust framework for the relationships of the major clades of cladobranch sea slugs (Mollusca, Gastropoda, Heterobranchia), but fails to resolve the position of the enigmatic genus Embletonia

Background The soft-bodied cladobranch sea slugs represent roughly half of the biodiversity of marine nudibranch molluscs on the planet. Despite their global distribution from shallow waters to the deep sea, from tropical into polar seas, and their important role in marine ecosystems and for humans (as targets for drug discovery), the evolutionary history of cladobranch sea slugs is not yet fully understood. Results To enlarge the current knowledge on the phylogenetic relationships, we generated new transcriptome data for 19 species of cladobranch sea slugs and two additional outgroup taxa (Berthella plumula and Polycera quadrilineata). We complemented our taxon sampling with previously published transcriptome data, resulting in a final data set covering 56 species from all but one accepted cladobranch superfamilies. We assembled all transcriptomes using six different assemblers, selecting those assemblies that provided the largest amount of potentially phylogenetically informative sites. Quality-driven compilation of data sets resulted in four different supermatrices: two with full coverage of genes per species (446 and 335 single-copy protein-coding genes, respectively) and two with a less stringent coverage (667 genes with 98.9% partition coverage and 1767 genes with 86% partition coverage, respectively). We used these supermatrices to infer statistically robust maximum-likelihood trees. All analyses, irrespective of the data set, indicate maximal statistical support for all major splits and phylogenetic relationships at the family level. Besides the questionable position of Noumeaella rubrofasciata, rendering the Facelinidae as polyphyletic, the only notable discordance between the inferred trees is the position of Embletonia pulchra. Extensive testing using Four-cluster Likelihood Mapping, Approximately Unbiased tests, and Quartet Scores revealed that its position is not due to any informative phylogenetic signal, but caused by confounding signal. Conclusions Our data matrices and the inferred trees can serve as a solid foundation for future work on the taxonomy and evolutionary history of Cladobranchia. The placement of E. pulchra, however, proves challenging, even with large data sets and various optimization strategies. Moreover, quartet mapping results show that confounding signal present in the data is sufficient to explain the inferred position of E. pulchra, again leaving its phylogenetic position as an enigma. Supplementary Information The online version contains supplementary material available at 10.1186/s12862-021-01944-0.

and one pleurobranchid (Pleurobranchaea californica) [1][2][3][4]. RNASeq data of the published samples were downloaded from the NCBI Sequence Read Archive (SRA) (see Supplementary Table S2, Additional File 2). Based on recently published findings [5], the transcriptome assigned to the species Unidentia angelvaldesi in [4] and [6] is indeed a new and undescribed species. We therefore use the name Unidentia sp. 1. Further names have been changed in the course of our analyses and are listed in detail above.

Sample preparation and sequencing
Newly collected specimens were preserved in RNAlater (Qiagen) and stored at -80 °C until further processing (Supplementary Table S1

Quality checks and adapter trimming
All raw sequence reads were quality-checked prior to and after adapter trimming using FastQC v. 0.11.5 [7]. Adapter trimming and quality filtering were performed with

In silico estimation of average insert sizes
Some de novo transcriptome assemblers, such as SOAPdenovo-Trans [9], require the average insert size of the raw reads of an input library. The average insert sizes were estimated with the software BBMerge as implemented in the package BBTools version 37.28 [10,11], rounded to the closest integer and provided to SOAPdenovo-Trans as input (Supplementary Table S3, Additional File 2).
The assembly process of the 61 raw data sets resulted in a total of 366 (six assemblies per sample) transcriptome assemblies. The number of assembled contigs and total size of each assembly are provided in Supplementary Table S4, Additional File 2.

BinPacker
BinPacker was run with the following parameters: BinPacker -s fq -p pair -m FR -l R1_pd.fastq -r R2_pd.fastq -u both.fastq Although it is possible to pass both paired-end and single-end reads to BinPacker in the same run, the generated log files indicated that only paired-end reads had been used in the de novo assembly process, which we assume to be an unsolved issue in BinPacker.

IDBA-Tran
IDBA-Tran was run with the following parameters: IDBA-Tran requires input reads to be in FASTA format. Therefore, the included conversion software fq2fa was run with the following parameters for paired-end reads: 6 fq2fa --merge --filter R1_pd.fastq R2_pd.fastq R1_R2.fa And for single-end reads: Subsequently, all reads were combined into a single file.

Shannon
Shannon was run with the following parameters: Because Shannon can only process paired-end reads, no single reads were used in the assembly process.

SOAPdenovo-Trans
SOAPdenovo-Trans was run with the following parameters:

Trans-ABySS
Trans-ABySS was run with the following parameters: Trans-ABySS requires sequence headers to have an additional "/1" or "/2" at the end to identify forward and reverse reads. Therefore, all sequence headers were adjusted accordingly prior to the assembly process using a custom Perl script.

Contaminant removal
Following the identification of the best transcriptome assembly per species (see below), possible foreign contaminants (non-target sequences, adapter/linker sequences) within the assemblies of the newly sequenced samples were identified upon submission to NCBI

Quality assessment of assemblies using BUSCO
We checked the quality of our transcriptomes by using BUSCO version 3.0.0 [17] with the reference data set "metazoa".
The following settings used were: run_BUSCO.py -f -i assembly.fasta -c 12 -l metazoa_odb9 -m tran To further judge the quality of our assemblies, we additionally performed a second check after orthology assignment, see section 9.

Orthology assignment
In order to set up a reference database for the software Orthograph [18], which we used to infer orthology for our transcripts, we selected a set of clusters of orthologous sequences of 1,997 single-copy protein-coding genes (ortholog groups; OGs) from OrthoDB version 9 [19].
Setting the hierarchical level to "Lophotrochozoa", we considered the official gene sets version Jan-2013 (JGI genebuild) [22]. From this ortholog set, we excluded five genes because of inconsistent sequence headers. Thus, a total of 1,992 OGs were used to assign transcripts of each sample (Additional File 6).
Orthology assignment for transcripts included in the assembled transcriptomes was then performed using Orthograph version 0.6.2 [18]. An example Orthograph configuration file with the settings that were used is available in the supplementary material (Additional File 6).
In short, we used default settings allowing the reciprocal search of candidate orthologous transcripts against one of all official gene sets of the reference species. Furthermore, we allowed concatenation of transcripts in case they matched one OG, but did not overlap. We were able to successfully assign at least one transcript for 1,989 out of 1,992 OGs, for three OGs no transcript was assigned; these three OGs were excluded from further processing.

Quality assessment after orthology assignment
Since the BUSCO set "metazoa" is not very specific with respect to Cladobranchia, and since there is no specific sea-slug data set available yet, we additionally established a All alignments are available in Additional File 7.

Outlier check
Multiple sequence alignments were checked for outlier sequences (i.e. putatively misaligned or misassigned amino acid sequences) following the method described in [24]. We chose, however, to reimplement the method with some notable modifications. Our reimplementation now allows for arbitrary sets of reference taxa. Furthermore, we based the outlier identification on a formal definition of the interquartile range (IQR). The script is available from github [25] and has been run with the following command:

checker_complete.2.pl -a [ALIGNMENT] -s subjects.txt -k
Given the evolutionary distance of the three reference species, which we expect to be higher than the genetic distance within Cladobranchia, we considered their sequence dissimilarity as a cutoff for the selection of outlier sequences.

Removal of sequences from reference species and gap-only sites
After the removal of identified outlier sequences, we further removed all sequences from the reference species Biomphalaria glabrata, Crassostrea gigas, and Lottia gigantea from the alignments using the custom Python script remove_reference_sequences.py (available from: https://github.com/alexdonath/Embletonia). Subsequently, gap-only sites were removed from each multiple sequence alignment.

Filtering for ambiguous or randomly similar aligned sites, alignment masking, and concatenation into a supermatrix
The amino acid multiple sequence alignments were examined with the program Aliscore version 2.0 [26,27] with default settings in order to identify ambiguous or randomly similar aligned sites. All positions flagged by Aliscore per gene were discarded using the Perl script AliCUT version 2.31 [28]. For 248 genes masking was not necessary, so 1,519 gene 13 alignments were masked. The maximal proportion that was discarded from an MSA was 93.3%.
The resulting masked amino acid multiple sequence alignments were concatenated into a single supermatrix using FASconCAT-G version 1.04 [29] and a corresponding partition file based on the OGs was created. This supermatrix, which we hereinafter call "original unreduced data set", comprised 58 species, spanned a length of 771,739 positions (~ 71% of the unmasked superalignment), and contained 1,767 gene partitions.

Compilation, evaluation and optimization of data sets
Based on our original unreduced data set, we compiled another supermatrix, hereinafter called "original reduced data set", with the aid of the Perl script fasta2hypo (see Supplement of [24]) and kept only those gene partitions with sequence data for all 58 species. Thus, we ensured 100% partition coverage for each included species. The original reduced data set comprised 143,859 aligned amino acid positions and 364 gene partitions.
We analyzed both data sets with the software tool MARE version 1.2-rc [30] in order to assess the potential information content (IC) of each partition, the overall information content, and the coverage in terms of gene partitions. Subsequently, AliStat v. We repeated the concatenation into supermatrices and generated again i) a data set in which all MSAs were combined in a supermatrix, termed "unreduced data set" and ii) a reduced data set based on the same criterion as described above, termed "strict data set".
Further, we compiled iii) a data set that contained only those gene partitions for which at least one representative of each of the defined groups was present, termed "intermediate data set". Group definitions and included species for this data set are given in Supplementary Table S9, Additional File 2. Finally, we increased the overall information content (IC) of the strict data set using MARE with default settings. By discarding less informative genes as identified by MARE this resulted in iv) a strict selected optimal subset (SOS). All four data sets including the respective gene partition information are available in Additional File 9.
Note that we consider the strict data set as our main data set for two reasons: first, the amount of missing data is minimized and the overlap of all taxa is maximized following the rationale of Dell'Ampio and colleagues [36] (Supplementary Table S11, Additional File 2) and secondly, the amount of data violating SRH model assumptions was massively reduced (compare Supplementary Figures S1a and b

Phylogenetic tree inference
For all four final data sets (unreduced, intermediate, strict, and strict SOS), phylogenetic trees were calculated under the maximum likelihood optimality criterion using the program IQ-TREE version 1.6.12 [37]. The best fitting amino acid models were identified using ModelFinder [38], which is implemented in IQ-TREE. To this end, ModelFinder was run with an edge-linked partitioned approach [39]. Models were selected from all included nuclear models with equal rates (E), gamma rates (G), gamma and invariant sites (G+I), and free rates (R) with 2 -15 categories as well as the two free-rate models LG4X and LG4M, which by default only have four categories [40]. The model for each partition was chosen based on the corrected Akaike Information Criterion (AICc) [41]. Details about the selected substitution models are provided in Additional File 9.
Additionally, we ran a mixture model analysis for the unpartitioned strict data set using IQ-TREE version 1.6.12. ModelFinder was used to select the best mixture model. This was done by first performing an extended model selection on the unpartitioned strict data set to identify the best fitting amino acid exchange rate matrix. Models were selected from all included nuclear models with equal rates (E), gamma rates (G), gamma and invariant sites (G+I), and free rates (R) with 2 -8 categories as well as the two free-rate models LG4X and LG4M. In a next step, the three best-performing matrices were tested against and in combination with all protein mixture models (as implemented in IQ-TREE version 1.6.12), 10, 20, 30, and 60 classes of amino acid profiles (C10, C20, C30, and C60), and using empirical base frequencies (F). The final round of selecting the best mixture model tested the bestperforming amino acid exchange rate matrix in combination with 20, 30, and 60 classes of amino acid profiles (C20, C30, and C60) and using empirical (F) as well as optimized based frequencies (FO) and 4 -6 rate categories (R4 -R6). In each step, the best model was chosen based on the AICc.
For each data set, a total of 20 tree searches were performed, out of which ten were run using a parsimonious start tree and ten using a random start tree.
The best maximum likelihood tree was selected according to the best log-likelihood value.
Statistical support was inferred from non-parametric bootstrap replicates in batches of 100 replicates. For all data sets 100 bootstrap replicates were calculated. Since 100 replicates were not sufficient to achieve bootstrap convergence for the intermediate and strict data set, for both data sets 300 bootstrap replicates were calculated (see below). Convergence of bootstrap replicates was ensured with RAxML version 8.2.11 [42] using the following options: -I autoMRE -B 0.03 -m GTRGAMMA and 10,000 permutations. Testing for bootstrap convergence was conducted ten times independently with random start seeds. For the unreduced and the unpartitioned strict data set, bootstrap convergence was always achieved after 50 replicates, whereas bootstraps of the intermediate data set and the partitioned strict data set required 150 replicates each to reach convergence. For the strict SOS data set, bootstrap convergence was achieved four times after 100 BS replicates and six times after 50 replicates.
Statistical bootstrap support was mapped on the ML best tree of each analysis, i.e.
In addition to bootstrap support, we calculated support values based on the SH-like approximate likelihood ratio test [43] with 10,000 replicates and based on the approximate Bayes test [44] using default settings and the best inferred ML tree for each data set (unreduced, intermediate, partitioned strict, unpartitioned strict, and strict SOS) as input (Supplementary Figures S9, S11, S13, S15, and S17, Additional File 3).
All trees were rooted in Dendroscope v. 3.7. The five final data sets (unreduced, intermediate, partitioned strict, unpartitioned strict, and strict SOS) were tested for the presence of rogue taxa using RogueNaRok v. 1.0 [47] with default settings, to which the best ML tree inferred from each data set was provided as an input. No rogue taxa were identified in any of the data sets.

Phylogenetic discordance
To further analyse phylogenetic discordance, we applied the Quartet Sampling (QS) method [48]. The QS method aims to identify a lack of branch support due to low phylogenetic information, discordance because of lineage sorting or introgression, and misplaced or erroneous taxa (rogue taxa) by repeatedly and randomly sampling one taxon from each of the four subsets at a focal branch. For each quartet sampled, the likelihood is evaluated for all three possible topologies given the sequence data for the randomly selected quartet, and the frequencies, with which either the concordant or one of the two discordant topologies shows the best likelihood, are counted. These counts are then used to calculate three branch scores to quantify the relative support among the three possible topologies of four taxa (quartet concordance, QC), the disparity between the sampled proportions of the two discordant topologies (quartet differential, QD), and the proportion of replicates where the likelihood value of the best-likelihood quartet tree exceeds the second-best likelihood score by a given threshold (quartet informativeness, QI).
The QS analysis was conducted on the tree inferred from the (partitioned) strict data set with a maximum of 100 quartet replicates. The likelihood of the three possible topologies for each quartet sample was evaluated using RAxML v. 8.2.12 [42], using the same models used for ML tree inference with IQ-TREE (see above).
The detailed results of the QS analysis are given in the Supplementary Material (Supplementary Table S12, Additional File 2 and Additional File 10).

Investigation of the phylogenetic placement of Embletonia pulchra
Since the position of Embletonia pulchra was not stable between the ML trees calculated from the five data sets, we performed approximately unbiased (AU) tests [49] on the intermediate, partitioned strict, the unpartitioned strict, and the strict SOS data set as implemented in IQ-TREE version 1.6.12. Note that the position of E. pulchra was identical in the best ML trees from the unreduced and the partitioned strict data set as well as in the best ML trees from the intermediate, the unpartitioned strict, and the strict SOS data set. AU tests were performed with 100,000 RELL replicates [50] using the best tree inferred for each data set to optimize model parameters (option -te). Therefore, the best ML trees were modified manually by changing the position of Embletonia with the aid of Mesquite v. 3.61 [51].
In particular, we tested (see Fig. 2 Additionally, Four-cluster Likelihood Mapping (FcLM) [52] was performed on the strict data set, which we consider as our main data set, because it has full partition coverage for each included species and the least amount of missing data. We assessed the strict data set for conflicting signal in order to address the placement of Embletonia as sister to Proctonotoidea and to explore this data set for possible confounding signal that might originate from model violation of SRH conditions and/or non-randomly distributed data (for a rationale, see [24,53,54]). For this purpose, four groups were defined as follows: Groups and included species are provided in Supplementary Table S13, Additional File 2.
We performed FcLM analysing all possible quartets on the original data set (i.e. two analyses with Group 4 being modified) with partition boundaries and selected models as used for ML tree inference with IQ-TREE v. 1.6.9. We furthermore applied three permutations to identify putative confounding signal [24,54]. All data for the FcLM analyses are available in Additional File 12. Detailed results are provided in Supplementary Table   S14, Additional File 2.
Irrespective of whether or not the outgroup taxa were excluded from Group 4, the results remain ambiguous: The majority of quartets suggest a placement of Embletonia as sister to a clade Proctonotoidea + Aeolidia. Only less than 20% of the quartets supported Embletonia as sister to Proctonotoidea, as inferred in our best ML tree from the strict data set. However, Embletonia as sister to Proctonotoidea can be fully explained by confounding signal, as indicated by the permutation approach. This implies that confounding signal most likely outperformed phylogenetic signal and affected ML tree inference. However, the topology suggested by the majority of the quartets, i.e. Embletonia as sister to Proctonotoidea + Aeolidia, was rejected by the AU test. Therefore, we cannot draw conclusions and consider the placement of Embletonia as unresolved, which is also reflected in the negligible statistical support in the inferred ML trees from the strict and intermediate data sets.