Transcriptomics provides a robust framework for the relationships of the major clades of cladobranch sea slugs (Mollusca, Gastropoda, Heterobranchia), but fails to resolve the position of the enigmatic genus Embletonia

The soft-bodied cladobranch sea slugs represent roughly half of the biodiversity of marine nudibranch molluscs on the planet. Despite their global distribution from shallow waters to the deep sea, from tropical into polar seas, and their important role in marine ecosystems and for humans (as targets for drug discovery), the evolutionary history of cladobranch sea slugs is not yet fully understood. To enlarge the current knowledge on the phylogenetic relationships, we generated new transcriptome data for 19 species of cladobranch sea slugs and two additional outgroup taxa (Berthella plumula and Polycera quadrilineata). We complemented our taxon sampling with previously published transcriptome data, resulting in a final data set covering 56 species from all but one accepted cladobranch superfamilies. We assembled all transcriptomes using six different assemblers, selecting those assemblies that provided the largest amount of potentially phylogenetically informative sites. Quality-driven compilation of data sets resulted in four different supermatrices: two with full coverage of genes per species (446 and 335 single-copy protein-coding genes, respectively) and two with a less stringent coverage (667 genes with 98.9% partition coverage and 1767 genes with 86% partition coverage, respectively). We used these supermatrices to infer statistically robust maximum-likelihood trees. All analyses, irrespective of the data set, indicate maximal statistical support for all major splits and phylogenetic relationships at the family level. Besides the questionable position of Noumeaella rubrofasciata, rendering the Facelinidae as polyphyletic, the only notable discordance between the inferred trees is the position of Embletonia pulchra. Extensive testing using Four-cluster Likelihood Mapping, Approximately Unbiased tests, and Quartet Scores revealed that its position is not due to any informative phylogenetic signal, but caused by confounding signal. Our data matrices and the inferred trees can serve as a solid foundation for future work on the taxonomy and evolutionary history of Cladobranchia. The placement of E. pulchra, however, proves challenging, even with large data sets and various optimization strategies. Moreover, quartet mapping results show that confounding signal present in the data is sufficient to explain the inferred position of E. pulchra, again leaving its phylogenetic position as an enigma.


Background
Marine Heterobranchia (Gastropoda) have become a major focus in monitoring reef biodiversity [1][2][3][4][5]. They mainly prey on a high variety of marine sessile organisms, from algae to sponges, cnidarians, bryozoans and tunicates, and very often take up the chemical compounds of the food for their own defence. These "stolen" compounds have become of high interest for pharmacists in finding new drug leads for medical applications [6][7][8][9][10]. However, they are also of high interest because of various unique biological phenomena. Sacoglossans and some nudibranchs incorporate chloroplasts or dinoflagellates and thus can serve as model organisms in understanding the evolution of photosymbiosis [11][12][13]. Within marine Heterobranchia, the shell-less Nudibranchia have developed a variety of biological strategies that make them unique within Metazoa. Of particular interest is the sequestration of cnidocysts from the cnidarian prey, storing them in special morphological structures (cnidosacs) in exposed body areas, and the ability to mature the stolen cnidocysts (cleptocnides) in the cnidosac [14][15][16][17]. This unique defence system seems to have evolved only in one of the major nudibranch clades, the Cladobranchia, within which there are likely two independent origins [17].
Nudibranchia, with the two clades Cladobranchia and Anthobranchia, form a monophyletic group that is well explained by morphological features [18]. The sister group relationship to Pleurobranchomorpha (Pleurobranchida) as well as monophyly of Nudibranchia was shown with transcriptomic data by Zapata and colleagues [19] and was again later confirmed by additional data [20]. The monophyly of Nudibranchia has also been confirmed in various molecular analyses using larger taxon sets, albeit small gene sets (see review in [21]). However, few studies have used both morphological and molecular methods to obtain and explain phylogenetic relationships within Cladobranchia. A comprehensive study of Anthobranchia (Doridida) applying both molecular phylogenetic and ontogenetic data was published recently [22]. Similar studies are still lacking for Cladobranchia.
Cladobranchia comprise seven superfamilies (World Register of Marine Species [23]) and two unassigned families, Goniaeolididae and Heroidae. Two of the superfamilies, Aeolidioidea and Fionoidea, are usually united under the name Aeolidida, with the possession of the cnidosac as one shared character [18]. Dendronotoidea and Tritonoidea were united in former times under the name Dendronotacea [24], and Proctonotoidea and Arminoidea under the name Arminacea [25]. However, many morphological and molecular studies have contested the hypotheses of Dendronotacea and Arminacea (see review in [21]), but this is not completely disregarded according to ancestral state reconstruction [26]. The enigmatic Doridoxoidea were considered for some time as the sister taxon of all other Cladobranchia, representing the Dexiarchia concept in the sense of [27]. However, it has been shown that Doridoxidae is not a sister taxon to, but part of the Nudibranchia s. str. (Cladobranchia) [28,29].
Pola and Gosliner [30] aimed to resolve the phylogeny of Cladobranchia using one nuclear (Histone 3) and two mitochondrial genes (cytochrome c oxidase subunit I and 16S rRNA): the study resulted in a topology that primarily consisted of an unresolved polytomy on family level. Several further studies focussed on different parts of the cladobranch tree. Bleidissel [31] analysed the Aeolidida within the Cladobranchia, based on three genes (18S, 16S, and CO1), in order to investigate the evolution of the incorporation of algae from the genus Symbiodinium in certain sea slugs. In that study, for the first time, the paraphyly of the aeolidid family Facelinidae was shown. Recently, by the inclusion of the type species of the genus Facelina, Facelinidae sensu stricto was revealed and the name Myrrhinidae resurrected for the second "facelinid" clade [32]. Korshunova and colleagues [33] studied the relationships within the former Flabellinidae, including representatives of many Aeolidida. The authors provided much evidence for the paraphyly of the former Flabellinidae, which they then split into five different families. More recently, Korshunova and Martynov [26] were able to resurrect old genus names (e.g., Duvaucelia) and proposed a new genus Tritonicula after a careful analysis of the Tritoniidae. They also confirmed the inclusion of Doridoxa within the Cladobranchia. Detailed studies on the general taxonomic patterns of Fionoidea including the families Abronicidae, Calmidae, Cuthonellidae, Cuthonidae, Eubranchidae, Fionidae s. str., Tergipedidae, Trinchesiidae, Xenocratenidae, and Murmaniidae are presented in [34,35]. All molecular analyses of these studies are based on only a few genes.
Recent analyses, using a large transcriptomic data set, provided the first robust cladobranch tree that enabled the study of evolution of food preferences [36,37]. In a subsequent study, a broader data set with nearly 90 taxa was used to examine the evolution of the cnidosac [17], which is the main defence system of Aeolidida [35]. Within some taxa (e.g., members of the genus Phestilla of the family Trinchesiidae) cnidosacs are secondarily reduced [34]. Similar defence structures have evolved independently in Hancockia [14], a genus assigned to Dendronotida [17]. However, the authors based their interpretations on a phylogenetic tree with partly low statistical support. Moreover, a few taxa showed relatively long branches compared to other members of the family (Cerberilla) or even the same genus (Janolus). Therefore, bias due to possible long branch artefacts cannot be excluded. A reduced data set was used by Goodheart and Wägele [39] to study the taxonomic relationship of an enigmatic pelagic cladobranch, the genus Phylliroe, to analyse morphological traits enabling a shift from a benthic life style into a pelagic form. Robustly resolved and reliably inferred phylogenetic trees that are not affected by confounding signal, but driven by informative phylogenetic signal, are one prerequisite for answering questions about the evolutionary history of taxa and biological phenomena, such as the aforementioned evolution of the cnidosac and photosymbiosis. Therefore, only trees that reflect most likely the "true" history of species allow the inference of biological traits to understand biodiversity and its origin. Inferred trees resulting from methodological or computational inadequacy can lead to erroneous hypotheses (see, e.g., [40]). Taxa that diversified quickly and/or underwent rapid radiation events within a short period of time are especially difficult to analyse (see, e.g., [41] and several examples in [42]).
Here, we present a thorough study on 57 cladobranch and four outgroup transcriptomes in order to obtain a statistically highly supported phylogenetic tree and to check whether or not ambiguous splits in this tree might be based on confounding and thus erroneous signal. After discarding three species with a low coverage of the ortholog set as well as two species due to model violation, we compiled four final data sets. All data sets included 56 out of the original 61 species but differed in their alignment completeness and gene partition coverage. Based on the most complete data set, we comprehensively examined the ambiguously inferred position of Embletonia, which has been assigned to various groups in the past without any current consensus [15,17,31,43,44], for alternative topologies with approximately unbiased (AU) tests [45], Four-cluster Likelihood Mapping [46,47], and sampling puzzling [48] approaches.

Data preparation prior to phylogenetic analyses
A list with details on the 21 species with newly sequenced transcriptome data is provided in Additional file 2: Table S1. Accession numbers for all species are given in Additional file 2: Table S2. Higher species affiliation follows the World Register of Marine Species (WoRMS), with the exception of the clade of all aeolids, which we call Aeolidida.

Transcriptome sequencing and data processing
Paired-end sequencing resulted in approximately 7.5 Gbases of raw data per sample. For the newly generated transcriptomes, the number of complete read pairs ranged from 20,266,817 in Calmella cavolini to 43,524,035 in Facelina rubrovittata with a median of 24,882,673 (Hancockia cf. uncinata). After trimming of possible adapter sequences and sequence regions of low quality, the average read length of complete read pairs ranged from 118.1 bp in Hermissenda emurai to 139.6 bp in Doto sp. with a median of 133.8 bp in Polycera quadrilineata (Additional file 2: Table S3). Details on sequence processing are provided in Additional file 1. Transcriptome assembly using six different de novo assemblers per data set resulted in a total number of 366 assemblies, i.e., six assemblies for each of the 61 transcriptomic data sets (see Additional file 1 and Additional file 2: Table S4).

Evaluation of transcriptome assemblies, orthology prediction, and alignment procedures
Evaluation of assembled transcriptomes and subsequently applying BUSCO version 3.0.0 [49] with the Metazoa set including 978 orthologs revealed a median of 731 (75%) complete BUSCO genes per sample (maximum: 943 complete BUSCO genes [27 fragmented, 8 missing] in Caloria elegans; minimum: 158 complete BUSCO genes [123 fragmented, 697 missing] in Doris kerguelenensis). All quality assessment results of the transcriptomes using BUSCO are summarised in Additional file 2: Table S5.
We additionally evaluated the quality of all transcriptomes separately for each assembly method based on the results of orthology prediction and identified single-copy protein-coding genes with our custom-made ortholog set comprising 1992 orthologs (see "Methods" and Additional file 1). Results were ranked based on the cumulative length of transcripts that were successfully assigned to the reference genes used to identify singlecopy orthologs (OGs) in the transcriptomes (see Additional file 2: Table S6). The cumulative lengths ranged from 82,409 bp in Pseudobornella orientalis (the genus was recently resurrected by Korshunova and colleagues [50]) (IDBA-Tran, 458 genes successfully assigned) to 784,043 bp in Caloria elegans (Shannon, 1904 genes successfully assigned). The median was 472,305 bp for the cumulative length and 1577 for the number of successfully assigned genes. The best assembly (according to the largest cumulative length) out of the six available per sample was selected as the representative transcriptome for the respective species and was submitted to NCBI. Transcriptomes accepted by NCBI after removal of possible foreign sequence contamination were used for all further downstream analyses (see Additional file 2: Tables S2 and S7). In order to reduce the amount of missing data in subsequent analyses we excluded three samples for which less than 60% of OGs included in the search had been identified: Pseudobornella orientalis, Dermatobranchus sp., and Tritoniopsis frydis. Furthermore, we only kept OGs for which at least 50% of the investigated 58 species had a positive hit. This resulted in 1767 OGs that we subsequently used to generate multiple sequence alignments (MSAs) on amino acid level. Checking the MSAs for outlier sequences (i.e., putatively misaligned or misassigned amino acid sequences), we identified 897 sequences in 112 MSAs that were subsequently removed. Outliers were found in sequences from all remaining 58 species with the highest number of 30 outlier sequences in Limenandra confusa and the lowest number of eight outlier sequences in Doris kerguelenensis (median: 15 outliers, all details are provided in Additional file 1 and Additional file 2: Table S8).
Alignment masking resulted in masking of alignment sites in 1,519 out of 1,767 genes (Additional file 1) leaving ~ 71% of aligned unmasked sites for subsequent analyses.

Compilation, evaluation and optimization of data sets
Analysing the concatenated supermatrix using MARE v. 1.2-rc [51], AliStat v. 1.6 [52] for information content and data coverage, and SymTest v. 2.0.47 [53] for putative violation of stationary, (time-)reversible and homogenous (SRH) model conditions [54,55] using the implemented Bowker's matched pairs test of symmetry [56] led to the results shown in Additional file 3: Figs. S1 and S2.
With respect to the amount and distribution of missing data we initially compiled two data sets as described in the methods section. The data set allowing for the highest amount of missing data, termed "original unreduced data set", was not further reduced after concatenation and comprised 58 species, 771,739 aligned amino acid positions and 1767 gene partitions. The second data set with a full gene coverage for all 58 species (termed "original reduced data set") comprised 143,859 aligned amino acid positions and 364 gene partitions. Analysing both data sets for violation of SRH model conditions with SymTest revealed that two species strongly violated the SRH conditions: Calmella cavolini and Doris kerguelenensis (Additional file 3: Fig. S2). Therefore, the sequences belonging to these two species were removed entirely from all MSAs from the original unreduced data set. This newly created data set (termed "unreduced data set") spanned a superalignment length of 771,706 amino acid positions including 1767 gene partitions.
To reduce the amount of missing data, we compiled an "intermediate" data set featuring only those gene partitions for which at least one representative of the selected taxa was present (see "Methods", Additional file 1, and Additional file 2: Table S9). This data set (termed "intermediate data set") spanned a superalignment length of 271,732 amino acid positions and included 667 gene partitions. The third and most strict data set with full gene coverage for each of the 56 species (termed "strict data set") had a superalignment length of 170,140 amino acid positions and included 446 gene partitions. This data set was further examined in an additional analysis using mixture models (i.e., without partition information). Finally, we compiled a fourth data subset with increased overall information content discarding less informative genes (hereafter called "strict SOS data set") with all 56 taxa and an even further reduced number of 126,094 aligned sites and 335 gene partitions. Details on data matrix diagnostics are provided in Additional file 1, Additional file 2: Tables S10 and S11, and Additional file 3: Figs. S3-S8.

Phylogenetic relationships of sea slug taxa
All analyses irrespective of the data set indicate maximum statistical support for all major splits and phylogenetic relationships at the family level ( Fig. 1, Additional file 3: Figs. S9-S17). Notably, low statistical support was inferred with regard to the phylogenetic position of the genus Embletonia. In the following, we discuss taxa relationships using the names according to the latest changes [57] that are implemented in World Register of Marine Species [23,58], although we disagree with several assignments as discussed below.

Phylogenetic relationships of major taxa and sea slug families
Out of the seven accepted superfamilies of Cladobranchia, we were able to include members of six superfamilies, whereas a representative of the rare Doridoxoidea, recently confirmed as sister to the Arminoidea [26], was not available to us. We inferred Aeolidida, Aeolidioidea (sensu WoRMS), Proctonotoidea, and Dendronotoidea, with representatives of various families and genera, as being monophyletic. This was fully supported by the quartet scores [48] for Aeolidida, Aeolidioidea, and Proctonotoidea, and strongly supported for Dendronotoidea (see Quartet Sampling scores, splits 1-3 and indicate a BS support value of 100. The numbers represent splits that are discussed in the main text and the surrounding coloured circles represent Quartet Sampling (QS) scores for the corresponding split. QFreq. quartet frequencies, QC quartet concordance, QD quartet differential, QI quartet informativeness. Higher affiliation (except for the term Aeolidida) is according to the system as proposed in WoRMS 8 in Fig. 1 and Additional file 2: Table S12). Arminoidea and Tritonioidea are only represented by one genus each. Thus, former results on monophyly [26] still need to be tested in future genomic analyses.
Our analyses revealed the following ambiguities: Flabellina affinis (type species of Flabellinidae), which is currently regarded as a representative of Fionoidea [17], is inferred as sister taxon to Aeolidioidea with maximal statistical support. Quartet sampling, on the other hand, showed only medium support (split 4 in Fig. 1, Additional file 2: Table S12) with the large majority of quartets (67%) supporting the focal branch (Aeolidioidea + Flabellina affinis), but the strong skew in discordance [quartet differential (QD) = 0] indicates the possibility of a single different evolutionary history supported by all remaining quartets. Although the type species was included in our analyses, additional data of further members of the family should be included in future studies to address a possible taxonomic revision.
The family Flabellinopsidae is currently listed as a member of the Aeolidioidea in WoRMS [58] with Flabellinopsis iodinea (Flabellinopsidae) being sister to all remaining taxa in this large clade, confirming previous results [17,33,36,37]. Again, this position is statistically maximally supported by classical support values (BS, aLRT, aBayes) in our study and quartet sampling scores confirmed this position (split 5, Fig. 1) with strong support (94% of the non-uncertain quartets). Although a strong skew in discordance (QD = 0) indicates the possible presence of an alternative quartet relationship, this result is rather less meaningful due to the low number of discordant trees (5% of the non-uncertain quartets). Thus, our results on Flabellinidae and Flabellinopsidae partly contradict recent analyses and subsequent systematic assignments.
The majority of the family Facelinidae is inferred as being monophyletic, but the facelinid species Noumeaella rubrofasciata groups with Myrrhinidae in published analyses [17,37] as well as in our study with nearly maximal classical statistical support. However, quartet sampling only shows weak support for this relationship (38% of the non-uncertain quartets; see split 6 in Fig. 1 and Additional file 2: Table S12). In fact, the quartet frequencies show no clear signal since all three quartet topologies are roughly equally supported (27% of the non-uncertain quartets support the second possible quartet topology, 36% support the third; QD = 0.85). Thus, the assignment of this species to Facelinidae [57] or Myrrhinidae (our results) should be reconsidered in future studies. Interestingly, this species did not cluster with several other Noumeaella species in a three-gene analysis of Aeolidida by Schillo and colleagues [59].
Within Aeolidioidea, the families Myrrhinidaeexcluding Noumeaella rubrofasciata-and Aeolidiidae form a monophyletic sister group relationship in our study, thus confirming the results of [17,37], and [32]. This is also consistent with recent morphological and molecular analyses using all acknowledged families of Aeolidida [60].
Fionoidea in the sense of Bouchet and colleagues [57] is polyphyletic, mainly due to the position of Flabellina affinis and Embletonia pulchra, the latter is discussed below. Within Fionoidea, the family Trinchesiidae, represented here with three genera, is monophyletic. Fionidae and Eubranchidae are related to Trinchesiidae. Unidentia is sister to these three families. A recent analysis [60] inferred Unidentiidae as sister taxon to Embletoniidae and a clade Embletoniidae + Unidentiidae as sister to all other clades of Aeolidida. Previous analyses did not reveal a robust placement of Unidentiidae and inferred this clade as sister to members of the Aeolidioidea (Facelinidae, Babakinidae, and Aeolidiidae), which in turn are sister to a larger group of Coryphellidae, Flabellinidae, Flabellinopsidae, and Paracoryphellidae [33]. In a subsequent study [61], Unidentiidae was inferred as sister taxon to a clade of Aeolidiidae, Babakinidae, Coryphellidae, Facelinidae, Flabellinidae, Flabellinopsidae, and Paracoryphellidae. In contrast, our quartet sampling analyses do not unambiguously support the relationship of the Unidentiidae as sister to other Fionoidea with a rather weak quartet support (52% of the non-uncertain quartets). The support for the other two possible quartet topologies is almost similar (QD = 0.99), which indicates that no alternative history is favoured (see split 7 in Fig. 1 and Additional file 2: Table S12). In this context, the results of Goodheart and colleagues [17] are quite noteworthy, because in their study, Unidentiidae is the sister taxon of Embletonia while the clade Embletonia + Unidentiidae is sister to all remaining Fionoidea. Results by Martynov and colleagues [60] suggest a sister group relationship to other aeolidoidean families, which is in part compatible with our results (see below).
The family Samlidae, represented by Luisella babai, is considered as being part of Fionoidea [17,33]. In our study, however, it is inferred as sister to all remaining Aeolidida in all analyses with maximum classical statistical support as well as very strong quartet support (see split 8 in Fig. 1 and Additional file 2: Table S12): About 98% of the quartets support this relationship without evidence for alternative quartet topologies (QD = 1). Including more taxa representing more families of the Aeolidida is necessary to address and hopefully solve this incongruence of our data in comparison to published results.
With regard to Proctonotoidea, Tritonioidea, and Dendronotoidea, our results confirm the findings published by Goodheart and colleagues [17] with the family Embletoniidae being the only exception, as we will discuss below.

The phylogenetic position of Embletoniidae remains ambiguous
The monogeneric family Embletoniidae, which currently only comprises two recognized species, Embletonia pulchra and E. gracilis, has experienced a rich history since the first description of the genus Embletonia by Alder and Hancock [62], with Pterochilus pulcher Alder and Hancock, 1844 as type species. The authors considered this species as a link between cladobranch aeolids and panpulmonate sacoglossans, two taxa that are not closely related to each other, but show many convergent characters. Pruvot-Fol [43], who named the family for the first time, included members of Trinchesiidae, but assigned the whole clade as a "section" to the dendronotoid family Dotidae. The two recognized members of Embletonia share some characters with members of Fionoidea or Aeolidioidea, e.g., the reduction of the lateral teeth, the absence of rhinophoral sheaths [64], and the presence of a cnidosac at the end of the cerata, a synapomorphy of Aeolidida [18], which additionally favours a position within this clade. However, Martin and colleagues [15] and Goodheart and colleagues [17] have shown that this cnidosac differs to a great extent from the typical aeolidid cnidosac by lacking a proper sac-like structure with musculature around it, as well as a connection to the digestive gland, which is necessary for taking up sequestered cnidocysts. Nevertheless, cnidocysts were found in the structures investigated by Goodheart and colleagues [17]. The authors explain this atypical situation with a loss of characters or as constituting a transitional form in the evolution of the cnidosac. Most recently, Martynov and colleagues [60] provided evidence for paedomorphic processes (e.g., reduction of oral tentacles), which would explain a regressive evolution of Embletoniidae within Aeolidida. This phenomenon is known from various unrelated taxa inhabiting soft-bottom interstitial environments [60]. Embletonia feeds on hydrozoans, which is a typical food source of many aeolidids, but also of some dendronotoids. Unique to this genus are the cerata, which show bi-to quadrifid apices. Highly structured cerata are not known from any aeolidids. However, the digestive gland reaches far into these cerata, a character less pronounced in Proctonotoidea, and only present in one further non-aeolidid group, the genus Hancockia [65].
Embletonia also shares traits that are characteristic for non-aeolidid groups, a reason why Pruvot-Fol [43] included the genus into the family Dotidae (Dendronotoidea). This assignment to Dotidae, as well as grouping with Trinchesiidae was, however, rejected later by Schmekel [44], and the closer relationship to Dendronotoidea was emphasized by Miller and Willan [66]. The primary connecting character is the lack of oral tentacles, which are considered to be a synapomorphy of the traditional dendronotaceans [18,26]. Furthermore, their oral gland ducts do not open into the oral tube by two separate ducts, but fuse into one common duct, which is described for Proctonotoidea. Proctonotoidea mainly feed on bryozoans, however, a few members also rely on hydrozoan prey, similar to Embletonia.
Few studies addressed the phylogenetic relationship of Embletoniidae using molecular data [17,31,60]. Bleidissel [31] focussed on Aeolidida and included Embletonia, because of its putative assignment to this group. Bleidissel's analyses, based on three genes, inferred a sister group relationship of Embletoniidae with Notaeolidiidae, with the latter again being sister to all remaining Aeolidida. In the only study based on a large data set, Embletonia was inferred, along with Unidentia, within Aeolidida as sister to the remaining Fionoidea, thus excluding a closer relationship with Notaeolidia [17]. Martin and colleagues [15] included characters of the cnidosac into the morphological data matrix published by Wägele and Willan [18], and their analysis resulted in an assignment of Embletonia to Aeolidida (tree not shown in the publication). Likewise, our unpublished morphological analyses render Embletonia as a sister taxon to Aeolidida. However, it is more likely the lack of data that constrains the position than apomorphic characters of high phylogenetic information.
In our analyses comprising the unreduced and strict data set, Embletonia pulchra is inferred as sister to Proctonotoidea with full support in the unreduced data set (100 BS, 100 aLRT, 1 aBayes), but with negligible support in the strict data set (65 BS, 50.1 aLRT, 1 aBayes). When assuming that Embletonia is a sister taxon of the Proctonotoidea (see split 9 in Fig. 1 and position i in Fig. 2) and taking into consideration the studies on the evolution of prey preferences [37] and cnidocyst incorporation [17], we have to conclude that (1) feeding on Hydrozoa is an ancestral trait within Cladobranchia and has not changed in Embletonia (in contrast to Proctonotoidea) and (2) the evolution of the cnidosac might have started in the stemline of the clade Aeolidida/Proctonotoidea/Embletoniidae, with Janolus and Dirona probably representing a condition where the ability to store cnidocysts was lost due to a food switch to bryozoan prey. Both an independent evolution of cnidosacs and cnidocyst storage (in the genus Hancockia) as well as a loss or strong reduction of these complex structures has occurred within Dendronotoidea [17] and Aeolidioidea [34,60].  Table S12), which indicates a highly complex evolution or rapid radiation.
Morphological analyses of important characters, like the positions of the anus, jaws, and radula also contradict both relationships discussed above with apomorphic features lacking for both hypotheses [60]. Instead, Embletoniidae shows an uniserial radula with central teeth more similar to various aeolidids [60].

Evaluation of alternative positions of Embletoniidae and possible confounding signal
To gain more insights into one of the obtained positions of Embletonia and to investigate alternative positions (see Fig. 2), further analyses were conducted. Note, that we consider the strict data set as most reliable, since it has full gene coverage for all species, following the rationale of Dell' Ampio and colleagues [67] and Misof and colleagues [47], who showed that inferred positions with high statistical support can be simply due to nonphylogenetic signal, e.g., the distribution of missing data. In addition, the strict data set shows the highest completeness scores of all data sets (see Additional file 2: Table S11). However, we also performed some of the analyses on the intermediate data set, the unpartitioned strict data set, and the strict SOS data set as described in the following.
We applied approximately unbiased (AU) tests [45] for alternative positions of Embletonia on the intermediate data set, both strict data sets (partitioned and unpartitioned), and the strict SOS data set. An AU test always takes the complete tree topology into account and not only single splits. Further, it does not test whether or not confounding signal is inherent in the data set, e.g., due to non-randomly distributed data and/or among-lineage heterogeneity violating SRH conditions. We therefore also applied Four-cluster Likelihood Mapping (FcLM) [46] along with a permutation approach on the strict (partitioned) data set. By testing all three possible quartet topologies around Embletonia we evaluated whether or not there was an alternative signal. Further, we checked for any sign of confounding signal (see [47]). To this end, we defined four groups (Additional file 2: Table S13) considering group 4 as an outgroup. We performed separate analyses for two outgroup variations: first, with 19 species including Anthobranchia and Pleurobranchomorpha and second, only with the 15 remaining cladobranch species. We drew quartets on the original data set and on three artificial data sets, from which any existing phylogenetic signal was removed in three different ways (see "Methods", Additional file 1, and [47]): (a) by destroying the phylogenetic signal but leaving the distribution of missing data and the compositional heterogeneity, which can lead to violating SRH conditions, untouched; (b) by leaving the distribution of missing data untouched but making the data set completely homogenous (no SRH model violation possible), and (c) by randomizing the missing data distribution and making the data set completely homogenous. For all details see Additional file 1.
Interestingly, the results of the phylogenetic trees and the results of the FcLM (Additional file 2: Table S14) and AU tests (Additional file 2: Table S15) were quite contradicting: (i) Although the best ML trees of the unreduced data set and strict (partitioned) data set suggest that Embletonia is sister to Proctonotoidea and although the AU test was unable to reject this topology (p > 0.05), it received the lowest proportion of quartets (< 20%) in the FcLM approach. Thus, this relationship can only be explained by confounding signal (see original and permutation results in Additional file 2: Table S14).
(ii) Although the best ML trees of the intermediate data set, the unpartitioned strict data set analysed with a mixture model approach, and the strict SOS data set suggest Embletonia to be sister to all remaining Aeolidida, a position that is not rejected by the AU test (p > 0.05), the FcLM results indicate only minimal support for such a relationship: the proportion of supporting quartets, excluding those that can be explained by confounding signal, was only around 3%. This also implies that AU tests, irrespective of whether or not a topology for the data set is significantly rejected, should not only be used to check if the signal is confounding.
(iii) A sister group relationship of Embletonia to a clade Aeolidida + Proctonotoidea, which received strongest support in the FcLM analyses (8-16% of all quartets after excluding the proportion of supporting quartets that can be explained by confounding signal, see Additional file 2: Table S14), was equally rejected by all AU tests.
There is only very little signal that is not confounding (around 3-8%, compare quartets of original with permuted approaches, Additional file 2: Table S14), which would support either Embletonia + Aeolidida (position ii in Fig. 2) or Embletonia as sister to a clade Aeolidida + Proctonotoidea (position iii in Fig. 2). Thus, these results clearly indicate that the position of Embletonia as a sister taxon of Proctonotoidea is most likely not due to informative phylogenetic signal, but mainly due to confounding signal in our data set, and again leaves the phylogenetic position of Embletonia as an enigma.
In order to analyse further possibilities of putative relationships of Embletonia, we tested four alternative positions (iv-vii, see Fig. 2) of Embletonia, which have been discussed in the literature before, by applying the AU test on the (partitioned) strict data set (see Fig. 2 and see below). Note that none of these positions were inferred in any of our ML analyses.
(iv) Since Embletonia exhibits characters, which are shared with the Dendronotoidea, we analysed a putative sister group relationship with this superfamily.
(v) Although an assignment to Tritonioidea is very unlikely, because Embletonia does not share all the characters special for this superfamily, the position of the Arminoidea is variable within the various published phylogenies [17,68,69] when including this superfamily. Nevertheless, we tested this possibility.
The last two tests imply a closer relationship of Embletonia with Fionoidea, a relationship that was assumed in former times and reflects the current systematics [57]. Therefore, we tested (vi) a position of Embletonia as sister to Fionoidea and (vii) Embletonia as sister to Unidentiidae and this clade being again sister to the remaining Fionoidea in restricted sense [17,60].
Despite our extensive molecular data sets and tests, we still cannot unambiguously assign Embletonia to one of the superfamilies in our tree. Beyond only small putative phylogenetic signal as indicated by our FcLM analyses, which is also in line with the negligible support considering classical statistical support, a reason could be the lack of relevant taxa in our data set that could positively influence the position of Embletoniidae in the cladobranch tree (e.g., Doridomorpha, Curnonidae, Notaeolidiidae). Achieving congruence between morphological and molecular data within the Cladobranchia is a task for future research.

Conclusions
Due to the high number of orthologous single-copy genes that could be successfully extracted from the transcriptomes, the high information content and up to full gene coverage of the supermatrices, and the high resolution of all phylogenies, we conclude that the use of transcriptomic data is a valuable tool for analysing phylogenetic relationships within Cladobranchia. Nevertheless, analyses of large data sets can be error-prone to systematic bias and classical support values might be inflated as has been shown and discussed [70][71][72][73]. Beyond careful data processing prior to phylogenetic tree inference, additional thorough tests, e.g., AU tests, quartet approaches like FcLM and quartet sampling as well as checks for confounding signal on a variety of different data matrices become more and more indispensable. Our study has revealed that, despite previous efforts, the position of some families within this group, especially the Embletoniidae, requires further investigation and possibly taxonomic revision. In future studies, the present data set should be extended by increasing the number of groupspecific orthologous single-copy genes and by including Curnonidae, Notaolidiidae, and other relevant species to shed light on the relationships between families and superfamilies in Cladobranchia in order to draw a more complete image of the evolution of this enigmatic group.

Methods
An overview of the complete workflow is displayed in Fig. 3. Major steps are described here while all details and settings can be found in Additional file 1.

Taxon sampling and sampling of transcriptome data
For this study, we used recently published transcriptome data and generated new transcriptome data for 21 species. We collected 19 species of Cladobranchia and two more distantly related species of heterobranch sea slugs from different locations in the Mediterranean Sea and the Sea of Japan (Additional file 2: Table S1). The specimens were preserved in RNAlater (Qiagen) or IntactRNA (Evrogen) and stored at − 80 °C. The specimens collected on Elba island (Additional file 2: Table S1) were stored at − 20 °C for approximately two weeks and then transferred to − 80 °C until RNA extraction. RNA extraction of mostly whole organisms (for Armina tigrina only foot tissue was used) was performed using the Macherey & Nagel NucleoSpin RNA II kit. Preparation and amplification of the cDNA libraries were performed by StarSeq GmbH, Mainz using the Illumina TruSeq Stranded RNA HT kit. Paired-end sequencing was also conducted at StarSeq with a read length of 150 base pairs on an Illumina NextSeq 500 sequencing platform. Raw reads were Fig. 3 Analysis workflow. Schematic workflow representing all steps from NGS data to the testing of alternative topologies with major steps being highlighted in shades of gray submitted to the NCBI SRA database. All accession numbers are provided in Additional file 2: Table S2.
Our newly generated transcriptome samples were combined with the published transcriptome data of another 40 samples that we downloaded from the NCBI SRA database (Additional file 2: Table S2) [19,36,37,74]. The published data comprised 37 species of Cladobranchia as well as two dorids, Prodoris clavigera and Doris kerguelenensis, and one pleurobranchid, Pleurobranchaea californica (Additional file 2: Table S2).

De novo transcriptome assembly
All raw sequence reads of published and newly generated samples were quality-checked prior to and after adapter trimming using FastQC Version 0.11.5 [75]. Adapter trimming and quality filtering were performed with Trimmomatic v. 0.36 [76] using a custom adapter file (see Additional file 4). Reads that were shorter than 80 bp after trimming, were removed from the read set.

Orthology prediction and generation of data matrices
We designed a custom-made ortholog set by selecting all genes that were listed by OrthoDB version 9 [83] to be single-copy at the hierarchical level "Lophotrochozoa" and downloaded the respective table with the IDs of the ortholog groups (called OGs hereinafter). We additionally downloaded the official gene sets of three species with well-sequenced and annotated genomes, which we selected as reference species: Biomphalaria glabrata, Official Gene Set (OGS) version 1.2 vectorbase [84], Crassostrea gigas, OGS version Sep-2012 (ENA genebuild) [85], and Lottia gigantea, OGS version Jan-2013 (JGI genebuild) [86]. We excluded five genes from this set due to defective sequence headers, leading to a custom-made ortholog set comprising 1,992 orthologs. Orthology prediction was performed using Orthograph v.0.6.2 [87], for which we used the aforementioned ortholog set (Additional file 5). Details are provided in Additional file 1. To reduce the amount of missing data per species, three transcriptome assemblies that covered less than 60% of the ortholog set were excluded from further analyses: Pseudobornella orientalis (53% of the ortholog set missing), Dermatobranchus sp. (46% missing), and Tritoniopsis frydis (51% missing). We then removed all OGs for which less than 50% of the investigated species had a positive hit. This resulted in 1,767 OGs for further analyses.
The quality of all transcriptome assemblies was further assessed with BUSCO v3.0.0 using the meta-zoa_odb9 reference set genes comprising 978 BUSCO groups [49] and default settings (Additional file 2: Table S5). Because BUSCO's general Metazoa data set is not very specific for nudibranchs and since there is no way to easily compile a nudibranch-specific reference data set (R. Waterhouse, personal communication), we devised a method that makes use of the output generated by Orthograph. For each Orthograph run, we calculated the number of sequences that were assigned to OGs by Orthograph as well as the cumulative length of these sequences. With the aim to maximize the amount of data, the latter was used as a criterion to determine the best assembly for each species (for details see Additional file 1, Additional file 2: Table S6).
Multiple sequence alignments on amino acid level (as automatically produced by Orthograph) were generated using DIALIGN-TX version 1.0.2 [88] and checked for outlier sequences using a newly implemented version of the outlier script described in [47] (see Additional file 1 for details; unfiltered alignments are provided in Additional file 6). Sequences identified as outliers as well as all sequences belonging to the three reference taxa were removed from the alignments.
The amino acid multiple sequence alignments were examined with the program Aliscore version 2.0 [89,90] in order to identify ambiguous or randomly similar aligned sites. All positions flagged by Aliscore (~ 29% of the originally aligned sites, see Additional file 1) were discarded using AliCut version 2.31 [91] (Additional file 7). The resulting masked amino acid alignments were concatenated into a supermatrix along with the creation of a partition file using FASconCAT-G version 1.04 [92].

Compilation, evaluation and optimization of data sets
Based on the concatenated supermatrix, we compiled another matrix with the aid of the Perl script fasta2hypo (see Supplement of [47]) and kept only those gene partitions with sequence data for all 58 species, ensuring 100% partition coverage for each included species. Both amino acid supermatrices were analysed using the software tool MARE version 1.2-rc [51] in order to assess the potential information content (IC) of each gene partition, the overall information content of the matrices, and the coverage in terms of gene partitions. The tool AliStat version 1.6 [52] was used to calculate alignment diagnostics and the software SymTest version 2.0.47 [53][54][55] was used to analyse the compositional heterogeneity of the supermatrices in order to detect possible violations of stationary, (time-)reversible, and homogeneous (SRH) conditions [56].
To reduce especially among-lineage heterogeneity (see "Results and discussion"), we excluded the species Doris kerguelenensis and Calmella cavolini from our data (see Additional file 3: Figs. S2 and S8).
We repeated analyses with MARE, AliStat, and SymTest and compiled four final data sets, allowing different levels of missing data (Additional file 2: Table S11): (1) an unreduced data set with 56 species and all 1,767 gene partitions with 771,707 aligned amino-acid sites and allowing ~ 39% missing data; (2) an intermediate data set in which data for at least one representative of the defined groups (Additional file 2: Table S9) had to be present, which led to a data matrix of 56 species and 667 gene partitions (271,732 aligned sites) with 98% gene coverage and 18% of missing data, and (3) our most strict data set only including genes present in all 56 species. This led to a data matrix with 170,140 aligned sites, 446 gene partitions and less than 13% of missing data. Note that we used this data set without partitioning information in an additional analysis using mixture models (see below). Finally, we again increased the overall information content (IC) using MARE with default settings. By discarding less informative genes as identified by MARE this resulted in (4) a strict selected optimal subset (SOS) with all 56 taxa, 126,094 aligned sites, 335 gene partitions and circa 15% of missing data. Missing data can lead to confounding signal in phylogenetic inference [47,51,67]. We therefore consider our strict data set as most reliable. Details are provided in Additional file 1. All supermatrices are provided in Additional file 8.

Phylogenetic tree inference
For all four data sets, maximum likelihood (ML) trees were calculated using IQ-TREE version 1.6.12 [93]. The best fitting amino acid models for each partition were identified using ModelFinder [94], which was run using an edge-link partitioned approach [95]. Additionally, we performed a tree inference using a mixture model approach for the strict data set (i.e., without gene partitioning) using IQ-TREE. Again, ModelFinder was used to select the best mixture model. Out of 20 tree searches per data set, we selected the best ML tree according to the best log-likelihood. Statistical support was derived from non-parametric bootstrap replicates (BS) ensuring bootstrap convergence. Additionally, we calculated SHlike approximate likelihood ratio test (aLRT) support [96] and approximate Bayes test (aBayes) support [97]. The best ML tree of each of the three data sets was tested for the presence of rogue taxa using RogueNaRok v.1.0 [98]. Details for each step including used settings are provided in Additional file 1.

Testing for alternative topologies Quartet sampling
To analyse phylogenetic discordance, we applied the Quartet Sampling (QS) method [48], which aims to identify the lack of branch support due to low phylogenetic information, discordance due to lineage sorting or introgression, and misplaced or erroneous taxa (rogue taxa). Details on the analysis and interpretation of scores are provided in Additional file 1, Additional file 2: Table S12, and Additional file 9.

Testing the position of Embletonia
Since the inferred position of Embletonia pulchra was not stable comparing the best ML trees of our various data sets, we tested various possible topologies with AU tests (see Fig. 2) [45] as implemented in IQ-TREE version 1.6.12 (see "Results and discussion" Additional file 1 and Additional file 10). To further analyse whether or not the placement of Embletonia in our best tree inferred from the strict data set was influenced by confounding signal and violating SRH conditions, and whether or not there was putative phylogenetic signal for alternative positions of Embletonia, we additionally performed Four-cluster Likelihood Mapping (FcLM), which is outlined in the results section and in detail in Additional file 1 (see also Additional file 11). In summary, we tested the following seven alternative hypotheses concerning the position of Embletonia: (

Additional file 1. Additional text.
Additional file 2. Table S1: Sampling information for the species collected for this study. Table S2: NCBI accession numbers for all species used in this study. Table S3: Statistics of raw sequence reads before and after trimming. Table S4: Assembly statistics. Table S5: BUSCO results. Table S6: Results of the Quality Checker script and selection of the best assembly. Table S7: Information on sequences removed during contamination filtering. Table S8: Number of removed outlier sequences per species. Table S9: Group definitions to compile the intermediate data set .  Table S10: Information content and evolutionary rates of the orthologs included in the strict data set. Table S11: Supermatrix diagnostics of data sets used in this study. Additional file 3. Figure S1. Species-pairwise site-coverage of the original unreduced and reduced data sets. Heat maps indicate speciespairwise amino acid site-coverage of the sequences of 58 species in the original data sets inferred with AliStat. Low shared site-coverage is in shades of red and high shared site-coverage is in shades of green. AliStat scores are given in Additional file 2: Table S11. a) original unreduced data set. b) original reduced data set. Figure S2.  Figure S5. Heat map visualising the information content of the final strict data set calculated with MARE. The information content (IC) is colour-coded in shades of blue, with darker shades representing higher IC and white squares indicating missing data. Red squares indicate gene partitions with an IC = 0. Species are displayed in rows (x-axis) and gene partitions are displayed in columns (y-axis). Supermatrix diagnostics of MARE are provided in Additional file 2: Table S11. Figure S6. Heat map visualising the information content of the strict SOS data set calculated with MARE. The information content (IC) is colour-coded in shades of blue, with darker shades representing higher IC and white squares indicating missing data. Red squares indicate gene partitions with an IC = 0. Species are displayed in rows (x-axis) and gene partitions are displayed in columns (y-axis). Supermatrix diagnostics of MARE are provided in Additional file 2: Table S11. Figure S7. Speciespairwise site-coverage of the final unreduced, intermediate, strict, and strict SOS data set. Heat maps indicate species-pairwise amino acid site-coverage of the sequences of 56 species in the final data sets inferred with AliStat. Low shared site-coverage is in shades of red and high shared site-coverage is in shades of green. AliStat scores are given in Additional file 2: Table S11. a) unreduced data set. b) intermediate data set. c) strict data set. d) strict SOS data set. Figure S8.  Figure S9. Best ML tree of the strict data set with aLRT and aBayes support. The phylogram is identical to the phylogram in Fig. 1 without the alternative position of Embletonia pulchra. The first value displays branch support based on 10,000 SH-aLRT replicates, the second value displays support derived from the approximate Bayes test. Figure S10. Best ML tree of the intermediate data set with non-parametric bootstrap support. Statistical support was inferred from 300 non-parametric bootstrap replicates. Figure S11. Best ML tree of the intermediate data set with aLRT and aBayes support. The first value displays branch support based on 10,000 SH-aLRT replicates, the second value displays support derived from the approximate Bayes test. Figure  S12. Best ML tree of the unreduced data set with non-parametric bootstrap support. Statistical support was inferred from 100 non-parametric bootstrap replicates. Figure S13. Best ML tree of the unreduced data set with aLRT and aBayes support. The first value displays branch support based on 10,000 SH-aLRT replicates, the second value displays support derived from the approximate Bayes test. Figure S14. Best ML tree of the strict unpartitioned data set analysed with a mixture model approach with non-parametric bootstrap support. Statistical support was inferred from 100 non-parametric bootstrap replicates. Figure S15. Best ML tree of the strict unpartitioned data set analysed with a mixture model approach with aLRT and aBayes support. The first value displays branch support based on 10,000 SH-aLRT replicates, the second value displays support derived from the approximate Bayes test. Figure S16. Best ML tree of the strict SOS data set with non-parametric bootstrap support. Statistical support was inferred from 100 non-parametric bootstrap replicates. Figure S17.
Best ML tree of the strict SOS data set with aLRT and aBayes support. The first value displays branch support based on 10,000 SH-aLRT replicates, the second value displays support derived from the approximate Bayes test.
Additional file 4. Archive S1: Illumina adapters used for adapter trimming.
Additional file 5. Archive S3: This archive includes official gene sets of the three reference species Biomphalaria glabrata, Crassostrea gigas, and Lottia gigantea on translational and transcriptional level, the list of all orthologous sequence clusters (OGs) as required for Orthograph, and an exemplary Orthograph config file.
Additional file 6. Archive S4: Unmasked multiple sequence alignments on amino acid level including Doris kerguelenensis and Calmella cavolini prior to the removal of outliers.
Additional file 7. Archive S5: 1,767 multiple sequence alignments (FASTA format) on amino acid level, from which sequences belonging to Doris kerguelenensis and Calmella cavolini as well as ambiguously aligned sections and gap-only sites were removed. These served as the basis for compiling the final unreduced supermatrix.