Signal, bias, and the role of transcriptome assembly quality in phylogenomic inference

Background Phylogenomic approaches have great power to reconstruct evolutionary histories, however they rely on multi-step processes in which each stage has the potential to affect the accuracy of the final result. Many studies have empirically tested and established methodology for resolving robust phylogenies, including selecting appropriate evolutionary models, identifying orthologs, or isolating partitions with strong phylogenetic signal. However, few have investigated errors that may be initiated at earlier stages of the analysis. Biases introduced during the generation of the phylogenomic dataset itself could produce downstream effects on analyses of evolutionary history. Transcriptomes are widely used in phylogenomics studies, though there is little understanding of how a poor-quality assembly of these datasets could impact the accuracy of phylogenomic hypotheses. Here we examined how transcriptome assembly quality affects phylogenomic inferences by creating independent datasets from the same input data representing high-quality and low-quality transcriptome assembly outcomes. Results By studying the performance of phylogenomic datasets derived from alternative high- and low-quality assembly inputs in a controlled experiment, we show that high-quality transcriptomes produce richer phylogenomic datasets with a greater number of unique partitions than low-quality assemblies. High-quality assemblies also give rise to partitions that have lower alignment ambiguity and less compositional bias. In addition, high-quality partitions hold stronger phylogenetic signal than their low-quality transcriptome assembly counterparts in both concatenation- and coalescent-based analyses. Conclusions Our findings demonstrate the importance of transcriptome assembly quality in phylogenomic analyses and suggest that a portion of the uncertainty observed in such studies could be alleviated at the assembly stage. Supplementary Information The online version contains supplementary material available at 10.1186/s12862-021-01772-2.


Background
The genomics revolution has resulted in a transformation of the approaches that scientists use to estimate phylogeny by vastly increasing the number of available independent genetic markers [1,2], as well as the number of taxa included in phylogenetic analyses [3].However, for taxa that remain largely unrepresented in publicly available datasets, generating a large number of genetic markers, often accomplished as part of a de novo whole genome sequencing project, continues to be a challenge.Transcriptome sequencing is a more accessible method of generating a reduced representation of the nuclear genome that requires fewer sequenced reads and is therefore less expensive than whole genome sequencing (although it is not without its own challenges, see [4]).In addition, transcriptomes perform comparably to genomes in phylogenomic studies when used with robust methods of ortholog identification [5].For these reasons, data derived from transcriptome assemblies have become widely used in phylogenomic studies and have come to represent a mainstream approach to phylogenetic reconstruction [6][7][8][9][10].
The generation of a phylogenomic data matrix is a complex and critical process, as biases introduced at this point can propagate in downstream analyses in unpredictable ways.Phylogenomic data matrices are composed of multiple (often hundreds of ) partitions, alignments of orthologous loci that have been filtered and concatenated together (concatenation-based methods) or analyzed as separate gene trees to inform species trees (coalescent-based methods), resulting in data matrices that are highly dimensional.In addition, phylogenomic datasets are often comprised of an agglomeration of data from multiple research groups that may have leveraged different sequencing and assembly strategies.Therefore it is not surprising that there are still many questions concerning the best practices related to the generation and application of these massive new datasets to phylogenomics [11][12][13].Many researchers have addressed questions related to the most appropriate modeling schemes for different partitions of the data matrix [14][15][16][17][18][19].Some have considered the impact of incomplete lineage sorting in phylogenomic reconstruction and have leveraged this property of recently diverged lineages to inform species trees [20,21].Others have sought to examine differential phylogenetic signal among partitions in order to maximize phylogenomic performance [22,23].Increasingly, researchers have added the additional step of recoding the amino acid data matrix in an attempt to account for saturation and compositional heterogeneity ( [16,[22][23][24], although see [25]).While each of these issues is critical to consider in phylogenomic studies, collectively they deal with aspects of the analyses that occur after transcriptome datasets have been assembled.In most cases, biases introduced during the generation of the primary transcriptome assemblies are not explicitly addressed and may persist in influencing downstream inferences.
Whole transcriptome sequencing is itself a relatively new technology, having gained widespread popularity only in the past decade [26].Therefore, RNA-seq data are commonly treated inconsistently among different phylogenomic studies.While many genomics studies have investigated methodological impacts of read trimming [27,28], error correction [29][30][31], different approaches to transcriptome assembly [32], and quality assessment [33][34][35], researchers using transcriptome assemblies for phylogenomic applications have been slow to adopt many of these recommendations (but see [36][37][38][39]).Phylogenomics studies commonly provide few details regarding the nature and quality of the transcriptome assemblies used as input in phylogenomic workflows.
To date there has been no empirical study of how transcriptome assembly quality may affect downstream phylogenomic analyses, although many impacts are possible.Poor-quality assemblies may alter the accuracy of ortholog prediction, alignment quality, and phylogenetic signal.We predicted that in phylogenomic analyses, poor-quality assemblies would result in differences in the number and identity of orthogroups obtained as well as differences in the quality of the partition alignments compared to those from higher-quality transcriptomes.Here we examine the effects of transcriptome assembly quality on these metrics.Our research strategy is to eliminate as many variables that arise from phylogenomic workflows as possible so that we can attribute discrepancies in phylogenomic results to the differences in transcriptome assembly quality.We use a well-characterized quantitative metric (TransRate score, see "Methods"; [35]) to evaluate transcriptome assemblies and to systematically construct two separate phylogenomic datasets: one of high quality and one of intentionally low quality.We then perform identical phylogenetic analyses on each dataset, allowing the identification of discrepancies between them and the assessment of their relative phylogenomic performance.We find that high-quality transcriptomes produce larger phylogenomic datasets with partitions that have less alignment ambiguity, weaker compositional bias, and are more concordant with the constraint tree, in both concatenation-and coalescent-based analyses, than datasets derived from low-quality transcriptome assemblies.Our results indicate that a portion of the uncertainty in phylogenomic studies likely stems from issues related to the initial assemblies used in preparing phylogenomic data matrices.

Datasets chosen based on TransRate scores have different numbers of transcripts, but show little variation in BUSCO score
Our study design controls for several factors that could preclude direct comparison between empirical outcomes in phylogenomic analyses.We focus on the craniate phylogeny because there is little debate about the major relationships within the group and because RNA-seq read data are available from the same tissue type (liver) for a wide range of taxa.The read sets used in this study ranged in size from 13.7 million read-pairs (Calidris pugnax) to 46.4 million read-pairs (Ambystoma mexicanum).We prepared one high-quality dataset and one low-quality dataset from the same read sets using the Oyster River Protocol (ORP) [32], an assembly pipeline that creates five different transcriptome assemblies for each raw RNA-seq dataset, calculates quality scores for each one, and produces a merged transcriptome assembly consisting of the highest quality unique transcripts (Fig. 1).We leverage the ORP here to intentionally create low-quality transcriptome assemblies that represent realworld empirical outcomes, in addition to high-quality transcriptome assemblies, for each taxon.Reads assembled into significantly fewer transcripts in the high-quality dataset compared to the low-quality dataset (P < 0.001, Fig. 2a), with an average of 178,473 and 321,306 transcripts per assembly respectively.The BUSCO scores and numbers of orthogroups recovered from orthology analysis of each assembly were both higher on average in the high-quality dataset (Table 1).We compared the number of transcripts in each assembly with the number of orthogroups found for that assembly and identified a significant relationship between these measures in both datasets (linear regression: high-quality dataset, P = 0.001; low-quality dataset, P = 0.002; Fig. 2b).The high-quality dataset based on overall TransRate assembly scores had a median TransRate score of 0.47236 (ranging from 0.23542 to 0.68372), while the low-quality dataset's median TransRate score was 0.15943 (ranging from 0.09216 to 0.25281), and overall TransRate scores of the two datasets were significantly different from one another (P < 0.001; Fig. 2c).We did not find a significant relationship between the overall TransRate scores of assemblies and the number of orthogroups obtained for each assembly (linear regression: high-quality dataset, P = 0.43; low-quality dataset, P = 0.51; Fig. 2d).The number of orthogroups for each dataset was higher in the high-quality dataset, but still largely comparable to the low-quality dataset with the exception of two low-quality read datasets, Takifugu rubripes and Callorhinchus milii.Each of these datasets recovered much lower numbers of orthogroups than other taxa in the low-quality dataset.In addition to TransRate evaluations, the BUSCO scores for the low-quality T. rubripes and C. milii assemblies were also dramatically lower than all other BUSCO scores in both datasets (2.7% and 7.2% respectively, compared to the next lowest score: 42.9% for Notechis scutatus).However, the overall BUSCO scores for the high-and low-quality datasets were not significantly different (Wilcoxon rank sum: P = 0.24, Fig. 2e).We observed a significant relationship between BUSCO score and number of orthogroups recovered in both datasets (linear regression: high-quality dataset, P = 0.001; low-quality dataset, P = 0.001; Fig. 2f ).

High-quality assemblies result in a larger number of partitions after processing
Next, we isolated one-to-one orthologs that were present in 100% of taxa.After aligning and filtering these orthologs into partitions we observed that one major impact of assembly quality on phylogenomic data matrix construction is the scale of the resulting data.
We obtained 2016 data partitions from the high-quality dataset, whereas we recovered only 408 data partitions from the low-quality dataset.332 data partitions in both the high-and low-quality datasets included an identical reference sequence from the Mus musculus reference transcriptome, demonstrating that a majority of the Fig. 1 The phylogenomic pipeline used in this analysis from publicly available transcriptomic datasets to partition tree statistics.In the top flowchart red borders indicate bioinformatic tools used while pink ones depict datasets.The Oyster River Protocol is highlighted in yellow, and in the inset: darker blue borders represent steps of the protocol while the resulting transcriptome assemblies are outlined in lighter blue data partitions recovered from the low-quality dataset are also represented in the high-quality dataset (Fig. 3a).The high-quality dataset however, included many more unique sequence partitions (1684 unique partitions compared to 76, Fig. 3a).The distributions of alignment lengths between datasets differed significantly before alignment filtering (Wilcoxon rank sum, P = 0.02; Fig. 3b) with alignments in the high-quality dataset being longer on average, but not after alignment filtering (Wilcoxon rank sum, P = 0.79; Fig. 3c).

High-quality alignments possess reduced compositional bias and alignment ambiguity
In order to draw direct comparisons between the partitions derived from the high-and low-quality datasets, we examined the alignment statistics of the 332 partitions that were shared between them.The percentage of constant sites in each alignment was not significantly different between the high-and low-quality datasets (Wilcoxon rank sum, P = 0.37, Fig. 4a).Similarly, the percentage of parsimony-informative sites in the alignments did not differ significantly between the two datasets (Wilcoxon rank sum, P = 0.89, Fig. 4b).However, the number of sequences that failed the composition Chi 2 test [40] and the number of sequences with over 50% alignment ambiguity were significantly different between the two datasets (composition-Wilcoxon rank sum, P = 0.006, Fig. 4c; ambiguity-Wilcoxon rank sum, P < 0.001, Fig. 4d), and both of these metrics were higher in the low-quality dataset.For each species, we assembled the transcriptomic reads using the Oyster River Protocol.Of the five resulting transcriptome assemblies, we chose the one with the highest overall TransRate score and the one with the lowest overall TransRate score to use in the high-and low-quality datasets, respectively.We also quantified the number of transcripts in each assembly, calculated the complete BUSCO score, and inferred orthogroups using OrthoFinder

No bias in gene content in partitions from both highand low-quality datasets
Phylogenetic information content of a given phylogenomic data matrix could be impacted if the partitions themselves are drawn from a biased set of loci.In order to understand the genetic composition of phylogenomic datasets derived from high-and low-quality assemblies, we conducted gene ontology (GO) analysis of the recovered partitions.We did not observe enrichment for functional category in either the high-or low-quality datasets.

Partitions from high-quality assemblies recapitulate the constraint tree to a larger extent than those from low-quality assemblies in both concatenationand coalescent-based analyses
Finally, we sought to understand the impact of assembly quality on phylogenetic signal.We first compared the two datasets to a constraint tree representing the current view of craniate relationships [41,42] by using Robinson-Foulds (RF) distances and internode certainty all (ICA) values in concatenation analyses.RF distances reflect topological differences between partition subtrees and the constraint tree [43], whereas ICA values indicate the proportion of data partitions for the high-quality and low-quality datasets that support each node in our constraint tree [44].We found that the high-quality dataset had significantly lower RF values overall than the low-quality dataset (Wilcoxon rank sum, P < 0.001; Fig. 5), indicating a shorter distance to the constrained craniate tree for the partitions in the high-quality dataset.The partitions derived from the high-quality dataset possessed characteristically higher ICA values than those from the lowquality dataset, although the distributions of scores were not significantly different (Wilcoxon rank sum, P = 0.47; Fig. 6) likely due to low statistical power.We also investigated the relative performance of the two datasets in coalescent-based analyses using ASTRAL [20,45].Similarly, we found that the high-quality dataset produced gene trees with less discordance to the estimated species tree than their low-quality counterparts, with a normalized quartet score of 0.75 for the high-quality partitions compared to 0.73 for the lowquality partitions.Both datasets resolved the same topology in ASTRAL analyses (Fig. 7).
In summary, we find that datasets derived from high-quality transcriptome assemblies yield larger phylogenomic matrices than those from low-quality transcriptome assemblies.In addition to being more numerous, the data partitions in the high-quality dataset are also less compositionally biased, have less alignment ambiguity, and are less discordant with the constraint tree.

Discussion
Given the ubiquity of transcriptome usage phylogenomics, we sought to understand how sub-optimal data handing practices during the assembly process may affect downstream phylogenomic analyses.We observed a general trend in our analyses where more accurate transcriptome assemblies resulted in phylogenomic datasets with a greater number of unique data partitions, longer alignments, fewer ambiguous regions, less compositional bias, greater consistency with the known phylogeny in concatenation-based analyses, and higher normalized quartet scores in coalescent-based analyses.We did not uncover any functional biases in the GO terms associated with either dataset.

High-quality assemblies result in a larger number of partitions after phylogenomic processing
The most dramatic difference between the high-and low-quality phylogenomic data matrices is the number of orthogroups that contained all species.After estimating one-to-one orthologs, aligning the orthologs, and filtering the alignments, this difference led to ~ five times the number of data partitions in the high-quality dataset compared with the low-quality dataset.Transcriptomic assembly errors that are expected to pervade low-quality assemblies include the generation of chimeric transcripts, the generation of incomplete transcripts, or the failure to generate transcripts due to missing data [32,35].Our results from analyses of the low-quality assemblies indicate that incompletely assembled transcripts may be at least partially responsible for the differences in partition number because the partition alignments before filtering are significantly longer in the high-quality dataset, indicating fewer incompletely assembled transcripts in the latter.While OrthoFinder [46,47] may be somewhat robust to these issues, when more complete sequence information is provided in high-quality transcripts, OrthoFinder analyses identify significantly greater numbers of orthogroups that contain a high proportion of species and therefore greater numbers of orthologs.Missing transcripts could also impact the accuracy of downstream analyses and the establishment b a d c Fig. 4 Density plots of four alignment metrics for both datasets.Alignments created from low-quality transcriptome assemblies have similar percentages of constant and parsimony-informative sites, but higher compositional bias and ambiguity when compared to alignments from high-quality assemblies.a Percentage of constant sites in each partition alignment.b Percentage of parsimony-informative sites in each partition alignment.c Number of sequences that fail the composition test, normalized by partition alignment length.d Number of sequences that contain more than 50% gaps/ambiguity in each partition alignment of one-to-one orthologs because, depending on what data are missing, orthologs and paralogs could become conflated between taxa.Our results are consistent with this expectation because among partitions that are shared between high-and low-quality datasets, those from the high-quality dataset show more accurate phylogenetic signal, as measured by constraint tree analyses in concatenation analyses and in coalescent approaches (see below).
We identified two transcriptome assemblies within the low-quality dataset, Takifugu rubripes and Callorhinchus milii, which have dramatically lower BUSCO scores and number of orthogroups recovered than other taxa within the same dataset.We included these two taxa in the analysis despite their extreme BUSCO scores for a number of reasons.First, these taxa occupy important phylogenomic positions within the craniate tree and publicly available craniate liver transcriptome datasets are somewhat limited.Second, while the TransRate scores for these two taxa are below average for the low-quality dataset (Fig. 2c, d), they are well within the distribution of low-quality assembly TransRate scores, indicating that these two taxa yield assemblies that are contiguous and correctly assembled to a comparable extent to the other assemblies included in that dataset.While it is standard practice to deposit raw reads into public databases, the read-sets for these two species appeared to have been trimmed prior to public data deposition [48], making them shorter than the other read-sets.We identified average read length as the probable reason for the lack of genic completeness as measured by BUSCO for these two taxa.Due to this shorter read length, these two organisms performed especially poorly in rnaSPAdes with a kmer length of 75 (only reads of length k + 1 are used in assembly), which was subsequently the assembly used in the low-quality dataset for both of these Importantly, these two species' corresponding assemblies in the high-quality dataset were not outliers (Fig. 2c,  d), indicating that a robust assembly strategy can compensate for sub-optimal sequence reads.Therefore, by including these two taxa, we were able to represent a situation commonly encountered in phylogenomic studies that utilize publicly available data-the inclusion of reads of poor quality or that have been previously processed.The drastic difference in number of partitions in the lowquality dataset compared to the high-quality dataset is due in part to these two taxa having smaller and less complete assemblies than all others.However, when we relax the strict filtering to include orthogroups with up to two missing taxa (thereby giving the low-quality dataset the opportunity to exclude T. rubripes and C. milii) we find that the high-quality dataset still has over 1600 more partitions than the low-quality dataset, and therefore the inclusion of these taxa is not the only driving force behind the difference in partitions between the datasets.While there are fewer partitions in the low-quality dataset, it is still a sufficient number (408) for most downstream phylogenomic applications.Therefore, we conclude that while the situation encountered with the T. rubripes and C. milii RNA-seq data has an effect on some aspects of our phylogenomic analysis, their effects are only manifested in analyses of the low-quality assemblies and extend beyond data drop out.

Low-quality assemblies produce alignments with more compositional bias and alignment ambiguity than high-quality assemblies
In the process of making gene trees for each of the data partitions, IQ-TREE calculates a number of metrics about the partition alignments and the sequences within them [40].One such test is for compositional homogeneity, which measures the character composition of amino acids in each sequence against the character composition in the whole alignment.Here, we chose to assess changes in compositional heterogeneity using the simple Chi 2 test implemented in IQ-TREE [40,49].Heterogeneity or bias in amino acid composition can mislead phylogenetic inferences: distantlyrelated organisms that have high compositional bias may erroneously group together [50].The number of sequences failing the composition test-that is, the number of sequences with higher compositional heterogeneity than expected by chance-was higher in the partitions from the low-quality dataset.Because these partitions have direct counterparts in the high-quality dataset, this difference in compositional heterogeneity is directly attributable to a difference in assembly quality.Similarly, the partitions from the low-quality dataset also contained more sequences with over 50% gaps or ambiguity in the alignment.While global alignments often contain gaps because of insertions or deletions in the sequences, comparison of the two datasets implies that the greater number of gaps in the low-quality b a Fig. 6 Partitions derived from the high-quality dataset have higher internode certainty all (ICA) values than those derived from the low-quality dataset when compared to the constraint tree.a Density plot of ICA values.b Average ICA values for each node.Blue represents the high-quality dataset, red represents the low-quality dataset.Negative ICA values suggest that the node conflicts with at least one other node that has a higher support dataset also results from incorrect transcriptome assemblies rather than natural variation.
The low-quality dataset contained some partitions that the high-quality dataset did not have.These partitions could be unique transcripts only assembled in the lowquality dataset, or they could be the result of differential pruning of paralogous sequences between the two datasets, resulting in a different Mus identifying sequence in two partitions that represent the same gene family.They might also be erroneous or duplicate partitions that were misidentified during the OrthoFinder procedure as separate gene families due to poor assembly quality.In principle, differential data assembly quality could inject bias into the resulting orthogroups if some loci, perhaps short or highly expressed genes, were preferentially assembled among the different datasets, however our GO analyses showed no enrichment or depletion of GO terms in these partitions.

Partitions derived from high-quality assemblies perform better in both concatenation-and coalescent-based phylogenomic analyses
In this study, we used quantitative analyses to assess phylogenomic performance of the high-and low-quality transcriptome assemblies.We showed that the individual partitions included in the high-quality dataset were closer to the constraint tree by calculating RF distances.The high-quality dataset had significantly smaller RF Fig. tree analysis in ASTRAL reveals a similar pattern to concatenation analyses.ASTRAL analyses of gene trees from 332 shared partitions from the high-and low-quality datasets result in identical topologies.In addition to normalized quartet scores being higher for gene trees derived from the high-quality dataset, node support values for the high-quality dataset are marginally stronger than those from the low-quality dataset.Support values represent support for quadripartitions of the tree, and only those that were less than 1 are represented distances to the constraint tree in concatenation-based analyses (Wilcoxon rank sum, P < 0.001) and less discordance in coalescence-based analyses as indicated by normalized quartet score (Fig. 7).While the ICA values of the high-quality dataset were not significantly higher than those in the low-quality dataset, the trend shows that ICA values are generally higher among partitions from the high-quality dataset with a greater proportion of partitions falling above 0.6.This indicates that the gene trees estimated from the high-quality dataset partitions are more consistent with the constraint tree of craniates and show greater phylogenetic signal [51] than the lowquality dataset in concatenated analyses (Fig. 6b).

Limitations in data availability and statistical power do not affect our conclusions
Our research strategy was to eliminate as many variables as possible so that we could isolate the effects of assembly quality on phylogenomic performance.These variables include the type of tissue that RNA-seq datasets are derived from and the topology itself.We treat the craniate phylogeny, for which few arguments remain regarding the relationships of the taxa included [41,42], as a "known" parameter to constrain our analyses.In this way we were able assess how close a given analysis accords with that constraint in light of other perturbations like assembly quality.However, it is notable that phylogenomic trees based on the 332 data partitions that are common to both the high-quality and low-quality datasets, using either concatenation-or coalescent-based methods, fail to resolve the craniate phylogeny accurately (Fig. 7; Additional file 2: Figure S1).While this result has no bearing on any of the conclusions presented here, it is likely due to two factors.First, the magnitude of both datasets, 332 partitions, is far fewer than that included in recent well-resolved phylogenomic studies of craniates [41].Here, our utilization of only 332 partitions derives from the necessity that they be shared between the highand low-quality assemblies, and therefore directly comparable.Second, our taxon sampling is low compared to recent phylogenomic studies of craniates.This is due to the requirement of our study design that RNA-seq reads be derived from a homologous tissue (e.g.liver) across taxa, offering a different type of direct comparison.While we were able to represent most of the major lineages of craniates with RNA-seq data derived from liver tissue, it was not possible to provide greater taxon sampling given current publicly available data while also preserving taxonomic evenness in sampling across various vertebrate clades.
We also point out that some of the quantitative measures reported here (e.g.ICA) show clear trends that favor the high-quality dataset over the low-quality dataset but are not significantly different.This may be due to intrinsic differences in statistical power that make it unlikely that a significant difference would be identified between datasets for those measures that have fewer data points (RF distances yield one data point per gene tree (332) while ICA scores provide one data point per node [34]).However, we do not observe a single instance of the lowquality dataset being quantitatively or qualitatively better than the high-quality dataset in terms of phylogenetic signal for any of our measures.

Conclusions
Phylogenomic approaches leverage great power to resolve phylogenetic relationships, but they also include many analytical pitfalls associated with ortholog identification, alignment filtering, and model selection.While these pitfalls have been well-characterized, we chose to focus on transcriptome assembly quality-a more fundamental and largely overlooked aspect of phylogenomic analyses.We addressed this problem empirically using a study design that controls for variables including taxon selection, data type, data provenance, and phylogenetic uncertainty.We show that assembly quality, when all other factors are controlled, can have a dramatic impact on phylogenomic analyses in three ways.First, the richness and size of the dataset can differ profoundly when assembly errors are prevalent in the data.Second, alignments created from low-quality assemblies are more prone to ambiguity and compositional bias than their high-quality counterparts.And third, the partitions derived from high-quality assemblies have greater phylogenetic signal to resolve true evolutionary relationships than partitions derived from low-quality assemblies.We conclude that additional analytical interventions aimed at improving assembly quality, such as the Oyster River Protocol [32], are likely worth the additional effort.

Read selection and assembly
To understand the effects of transcriptome assembly quality on phylogenomic inference, we created two datasets, one of high and one of low quality, from publicly available transcriptomic reads (see Additional file 1 for more information on data availability).All read data are available on the European Nucleotide Archive (Table 1).We focused on craniates because there are few remaining disputes on the craniate phylogeny [41] and these well-established phylogenetic relationships serve as a comparison to the topologies found using our high-and low-quality transcriptome assemblies.Our research strategy was to assemble high-and low-quality transcriptomes from the same set of reads.We obtained Illumina-generated paired-end liver transcriptomic reads for 37 vertebrate species spanning the majority of the diversity contained within the clade as well as one craniate outgroup.We assembled each read set using the Oyster River Protocol (ORP) version 2.2.3 [32] on a Linux computer with 24 CPUs and 128 GB of RAM.In brief, this protocol begins by adapter-and quality-trimming reads using Trimmomatic version 0.38 [52] as per recommendations in MacManes [27], after which it corrects read errors using Rcorrector version 1.0.8[30] following recommendations from MacManes and Eisen [29].The ORP then assembles trimmed and corrected reads using three different assemblers: Trinity version 2.8.5 [53] with a kmer length of 25, Trans-ABySS version 2.0.1 [54] with a kmer length of 32, and rnaSPAdes version 3.14 [55] using kmer lengths of 55 and 75.The protocol continues by merging the resultant four assemblies and clustering them into isoform groups.The ORP then scores all transcripts using TransRate version 1.0.3 [35] which maps the read sets onto the assembly and, based on the mapping, detects assembly errors such as fragmentation, chimerism, and local misassembly.TransRate then uses this error information to assign quality scores to each transcript before integrating these individual scores into a score for the assembly as a whole.The ORP selects the member of each isoform group with the highest TransRate score and places it into a new file.Finally, the protocol uses cd-hit-est version 4.8.1 [56] and a 98% sequence identity threshold to reduce transcript redundancy.The assemblies produced by the ORP are therefore populated by the highest quality, non-redundant sequences produced by any of the five possible assembly strategies [32].A graphical summary of this protocol and our phylogenomic pipeline can be found in Fig. 1.

Quality analysis and high-and low-quality dataset construction
We evaluated each of the five assemblies generated from the ORP (from Trinity, TransABySS, rnaSPAdes at two kmer lengths, and the final ORP assembly) for each species in two main ways.We used BUSCO version 3.0.1 [57], which uses benchmarking universal single copy orthologs to measure the genic completeness of an assembly.In addition, because we were primarily interested in assessing the structural differences in the transcriptome assemblies arising from errors during the assembly process, we generated TransRate scores for each assembly.Of the five assemblies for each species, we chose the assembly with the highest overall TransRate score to be part of the high-quality dataset, and the one with the lowest overall score to be part of the low-quality dataset.We selected assemblies for each dataset regardless of which assembler produced them, resulting in datasets that contain transcriptomes from multiple different programs.This was done in part to simulate transcriptomic datasets in other studies that may be constructed from preexisting transcriptome assemblies, rather than those that have reassembled each dataset using the same program and to provide appropriate contrast between the high-and low-quality datasets.We performed all subsequent steps on both datasets in parallel.

Orthogroup inference, statistics, and data partition creation
We used TransDecoder version 5.5.0 [58] to translate all transcript sequences to amino acid sequences.The transcriptome assembly process assigns each new transcript a unique name so that it can be differentiated within the assembly.This means that the high-and low-quality assemblies do not share identical transcripts or names common to both assemblies, making the direct comparison of sequences impossible.To circumvent this issue, we added the Mus musculus reference transcriptome (release 96) [59] to both datasets just before the Trans-Decoder step so that a Mus sequence would be present in many orthogroups and partitions downstream.This created a common naming system by which we could compare the content of orthogroups and partitions derived from assemblies of high and low quality later in the analysis.
For each dataset (containing either the high-quality or low-quality transcriptome assemblies for the 38 craniate species plus the Mus reference transcriptome) we performed a separate OrthoFinder version 2.3.3 analysis [46,47].We then used linear regressions in R version 3.5.2[60] to evaluate the relationship between the total number of orthogroups found for each taxon and three other measures: the total number of transcripts in each assembly, the overall TransRate score, and the BUSCO complete score.We also plotted the distributions of these three measures for each dataset and performed Wilcoxon rank sum tests in R to determine if they were statistically different.
We filtered the resulting orthogroups so that we retained only those that had each taxon represented by at least one sequence.From these, we obtained one-toone orthologs using PhyloTreePruner [61].We realigned these sequences using MAFFT version 7.305b using the "auto" setting [62], and filtered the alignments for poorly aligned or divergent regions using Gblocks version 0.91b [63,64] with options "− b2 = 0.65 − b3 = 10 − b4 = 5 − b5 = a" in the script "gblocks_wrapper.pl"[65].Finally, we concatenated all sequences into a NEXUS file for each dataset.We measured the lengths of the alignments both before and after Gblocks and compared the content of both groups of partitions by using the Mus sequence headers as common identifiers that were present in both datasets and determined the numbers of unique and shared partitions.We then used IQ-TREE version 1.6.12under the LG model [40] to find individual gene trees for each partition in each dataset.

GO analysis and alignment metrics
To investigate the differences in content and qualities of the partitions between the two datasets, we separated the partitions into groups containing only those that were unique to each dataset, and only those that were shared between the two datasets.We used InterProScan version 5.31-70.0[66] to annotate the partitions unique to each dataset and then performed a gene ontology (GO) analysis with topGO version 2.32.0 [67] in R version 3.5.2[60] to check for any functional enrichment or depletion bias in the partitions of either dataset.For each partition common to both datasets, we extracted various alignment metrics from the log and information files generated while making partition trees in IQ-TREE.These included percent constant sites, percent parsimony-informative sites, number of sequences that failed the Chi 2 composition test (which we normalized by alignment length), and the number of sequences that contained more than 50% gaps or ambiguity.To test for significant differences, we performed Wilcoxon rank sum tests in R version 3.5.2[60] between the two datasets for each of these measures.

Constraint tree and comparisons of partition trees
The phylogenetic relationships among the 38 craniate species for which we obtained liver RNA-seq data are well-supported by previous work [41].Therefore, we used a tree that reflects the most well-supported hypothesized relationships for comparison against the partition trees.Using Mesquite version 3.6 [68], we constructed a constraint tree that reflects the widely accepted topology for craniates.We used the high-quality dataset NEXUS alignment file along with this topology to estimate the constraint tree topology with branch lengths in IQ-TREE using the LG model [40].We calculated RF distances [43] from the partition trees in each dataset to the constraint tree using phangorn version 2.5.5 [69] in R version 3.5.2[60].This metric measures the differences in topology (RF distance) from the partition trees to the constraint tree, with smaller numbers indicating less conflict between the two trees.We also calculated ICA values between the individual partition trees and the constraint tree using RAxML version 8.2.11 [70].The ICA refers to the degree of certainty for each internal node of the tree compared to the constraint tree when all other conflicting bipartitions are taken into account for that dataset.Numbers close to 1 show a lack of conflict between the partition tree and the constraint tree [44].We tested for significant differences between the two dataset distributions using a Wilcoxon rank sum test in R version 3.5.2[60] for both RF distances and ICA values.Finally, we created species trees using the 332 gene trees that were common to both the high-quality and low-quality datasets with a coalescent method implemented in ASTRAL version 5.7.4 [20,45].We calculated the normalized quartet score for each tree, which represents the percentage of quartet trees in the input trees that are satisfied by the species tree and ranges from 0-1, with higher numbers indicating less discordance.

Fig. 2
Fig.2Summary statistics for the high-and low-quality datasets produced.We selected high-and low-quality datasets based on TransRate score.This resulted in transcriptome assemblies with both high and low completeness, according to complete BUSCO score, in each dataset.Larger assembles in the low-quality dataset did not lead to higher BUSCO or TransRate scores.Dotted lines in density plots represent medians for each dataset.a Density plot of the total number of transcripts (in thousands) in each transcriptome.b Relationship between the total number of transcripts (in thousands) and the total number of orthogroups.c Density plot of overall TransRate scores for each assembly.d Relationship between the overall TransRate score and the total number or orthogroups.e Density plot of complete BUSCO score for each transcriptome assembly.f Relationship between BUSCO score and total number of orthogroups

Fig. 3
Fig.3Length of alignments and number of partitions for each dataset.a Venn diagram showing number of partitions unique to each dataset, and common between them.The number of partitions recovered through the phylogenomic analysis pipeline is fivefold higher when the dataset is made up of high-quality transcripts compared to lower-quality ones.b Density plot of alignment lengths of each partition before filtering with Gblocks.c Density plot of alignment lengths of each partition after filtering with Gblocks.While the lengths of the individual alignments are significantly different before Gblocks filtering, they are similar afterwards

Fig. 5
Fig. 5 Per partition Robinson-Foulds (RF) distances to the constraint tree are significantly shorter in the high-quality dataset compared with the low-quality dataset.a Density plot for all partitions from both datasets.b Density plot for only those 332 partitions that are shared between the two datasets

Table 1
Read set information and transcriptome assembly metrics