Separating the wheat from the chaff: mitigating the effects of noise in a plastome phylogenomic data set from Pinus L. (Pinaceae)
BMC Evolutionary Biology volume 12, Article number: 100 (2012)
Through next-generation sequencing, the amount of sequence data potentially available for phylogenetic analyses has increased exponentially in recent years. Simultaneously, the risk of incorporating ‘noisy’ data with misleading phylogenetic signal has also increased, and may disproportionately influence the topology of weakly supported nodes and lineages featuring rapid radiations and/or elevated rates of evolution.
We investigated the influence of phylogenetic noise in large data sets by applying two fundamental strategies, variable site removal and long-branch exclusion, to the phylogenetic analysis of a full plastome alignment of 107 species of Pinus and six Pinaceae outgroups. While high overall phylogenetic resolution resulted from inclusion of all data, three historically recalcitrant nodes remained conflicted with previous analyses. Close investigation of these nodes revealed dramatically different responses to data removal. Whereas topological resolution and bootstrap support for two clades peaked with removal of highly variable sites, the third clade resolved most strongly when all sites were included. Similar trends were observed using long-branch exclusion, but patterns were neither as strong nor as clear. When compared to previous phylogenetic analyses of nuclear loci and morphological data, the most highly supported topologies seen in Pinus plastome analysis are congruent for the two clades gaining support from variable site removal and long-branch exclusion, but in conflict for the clade with highest support from the full data set.
These results suggest that removal of misleading signal in phylogenomic datasets can result not only in increased resolution for poorly supported nodes, but may serve as a tool for identifying erroneous yet highly supported topologies. For Pinus chloroplast genomes, removal of variable sites appears to be more effective than long-branch exclusion for clarifying phylogenetic hypotheses.
The potential influence of phylogenetic ‘noise’, i.e. random or misleading signal, in molecular phylogenetic studies has been recognized for over 30 years [1–4]. Similarly, various strategies to identify and/or mitigate noise in datasets have been formulated, including measuring skewness in the distribution of phylogenetic trees [5, 6], quantifying incongruence between data partitions , but see, for example, [8, 9], likelihood mapping , increasing taxon sampling [11, 12], and profiling loci based on phylogenetic information content [13–15], among others. While the specific details of these strategies differ, ultimately the goal of each is to increase the accuracy of phylogenetic hypothesis generation by identifying and/or reducing the influence of misleading signal . Nonetheless, as next-generation technologies continue to bring about orders-of-magnitude increases in DNA sequence output and usher in an era of phylogenomics, the challenges associated with phylogenetic noise could temper gains in phylogenetic resolution resulting from increased taxon and sequence sampling [17–22]. Although genomic-scale data sets are relatively novel, it is clear that misleading signal inherent in these large datasets still impacts phylogenetic resolution, in particular in clades that have experienced rapid divergence or radiation events, as well as lineages with elevated rates of evolution and/or long periods of genetic isolation (i.e., “long branches”) [12, 20, 21, 23–26].
While chloroplast sequences are still the most commonly used markers in plant phylogenetic studies, many analyses tend to rely on relatively small portions of the chloroplast genome, and few studies to date have applied plastome-scale sequences to phylogenetic questions [27–34]. This is particularly true at low taxonomic levels , as the majority of plastome-level phylogenetic analyses have focused on clarifying plastid-based relationships at familial and ordinal levels. Considering the potential impact of phylogenetic noise in phylogenomic analyses [17–22, 35], it seems appropriate to explore such effects on plastome-scale datasets, particularly as they become widespread in plant phylogenetic analyses and more commonly applied to investigations at low taxonomic levels. Further, although representing a single linkage group, mutation rate varies between different regions of the plastome [31, 36, 37], and so the potential for misleading signal certainly exists when using full plastomes to delineate evolutionary events over varying time-scales.
The genus Pinus, consists of ca. 110 species distributed primarily throughout the northern hemisphere, and contains evolutionary patterns ranging from deep divergence events to apparent rapid and relatively shallow radiations. In addition, the moderate size of the genus facilitates thorough taxon sampling. Pinus is represented by a relatively well-documented fossil record reaching back over 100 million years [38–40] and has been the focus of a large body of phylogenetic work, including studies based in morphology [41–45], crossability [41, 46–48] and molecular data, including restriction fragment analyses [49, 50] and both nuclear [51–54] and chloroplast sequence data [31, 42, 44, 55–59]. The most recent molecular systematic treatment of Pinus recovered a well-supported systematic framework consisting of two subgenera (Pinus and Strobus ), four sections (sections Pinus and Trifoliae in subgenus Pinus, sections Parrya and Quinquefoliae in subgenus Strobus ) and 11 subsections (Figure 1) that is widely accepted today. However, while nearly complete plastome sequences for a subset of pine species support this framework and result in increased resolution across much of the genus , there remain a number of taxa with poor resolution and/or incongruence between chloroplast-based and nuclear- or morphology-based analyses. In particular, subsections Krempfianae and Contortae , as well as a clade of the two closely related species Pinus merkusii and P. latteri each demonstrate these conflicts (Figure 1). In the present study, we investigated poor and conflicting resolutions in these clades using highly variable alignment positions and long-branches as proxies for phylogenetic noise. Sequential removal of variable sites and long branches was applied to the phylogenetic analysis of a full-plastome alignment which included most of the world’s pine species and several Pinaceae outgroups. While responses to these treatments differed between these three clades, each case provided insight into both the general patterns of response to noise removal in a phylogenomic dataset as well as specific characteristics of the plastid-based Pinus evolutionary history.
Accessions used in study
A total of 113 accessions were included in the alignment and subsequent analyses described below, including 37 Pinus and Pinaceae accessions utilized by Cronn et al.  and Parks et al.  (GenBank FJ899555-FJ899583, EU998739-998746, NC_001631.1  and NC_004677.2) and the plastome sequence of Cathaya argyrophylla reported by Lin et al.  (GenBank AB547400.1) (Additional File 1). The 75 novel plastome accessions included in analyses were sequenced and assembled as follows:
Genomic DNA extraction, chloroplast enrichment and sequencing
Total genomic DNA was extracted from fresh-frozen leaf or mega-gametophyte tissues using the FastDNA extraction protocol (MP Biomedicals, Ohio, USA). In several cases (Pinus chiapensis , P. cembroides , Pinus dabeshanensis , P. discolor , P. douglasiana , P. edulis, P. hwangshanensis , P. massoniana , P. pumila , and P. sabiniana), genomic DNA yield was insufficient for sequence preparation, so extracts were amplified by whole genome amplification with random hexamer priming . Genomic libraries were prepared following the Illumina protocol , with fragmentation performed using a BioRuptor Sonicator (Diagenode, Inc., Denville, NJ, USA) (setting ‘high’ for 5–30 one minute cycles). Adapters ligated to genomic fragments carried unique 4 bp ‘barcodes’ at their 3´ ends for multiplex sequencing as described in Cronn et al. . Agarose gel size-selected (300–700 bp), adapter-ligated libraries were enriched through 12–18 cycles of PCR using Phusion DNA polymerase and HF Buffer (New England Biolabs, Ipswich, MA, USA) and standard Illumina paired-end primers, and quantified using a Nanodrop 1000 (ThermoFisher Scientific, Wilmington, DE, USA).
Solution-based enrichment of the chloroplast portion of genomic libraries followed the general methods of Gnirke et al. . Enrichments were performed as follows. Chloroplast probe pools were synthesized by first PCR-amplifying the entire plastome of a member of Pinus subgenus Pinus (Pinus thunbergii ), using the methods described in Cronn et al. . In addition, plastome regions unique to Pinus subgenus Strobus were amplified from P. koraiensis to account for regions not present in the Pinus subgenus Pinus plastome. PCR products were quantified using a Nanodrop 1000 and pooled in an equimolar mix. Pooled amplicons were blunted-ended and subsequently ligated into ‘concatemers’ (Quick Blunting Kit and Quick Ligation Kit, New England Biolabs, Ipswich, MA, USA) and purified with Agencourt AMPure beads (Beckman-Coulter Genomics, Danvers, MA, USA). Concatemer probe pools were denatured into single-stranded product using 0.4 N KOH, and then amplified and biotinylated in a single incubation of 18 h at 30°C in the presence of 5´-end biotinylated random hexamers, 0.4 mM biotin-14-dCTP stock, 1 mM dNTPs and φ29 DNA polymerase. After cleaning by ethanol precipitation, this procedure typically yielded pools consisting of 10-25 μg of large (tens of kbp in length) biotinylated chloroplast probe. Hybridization reactions were carried out in 40 μl volumes and contained 0.5 μg probe and 0.5–1 μg of either a single enriched genomic library or equimolar-pooled 4-plex genomic libraries; Denhardts solution (Invitrogen, Inc., Carlsbad, CA, USA) and lambda DNA (New England Biolabs, Ipswich, MA, USA) were used as blocking agents to minimize binding of non-target DNA to probes. Reactions were heated to 95°C for 10 min, and subsequently incubated at 65°C for 64–72 h. After incubation, hybridization products were captured using MagnaSphere streptavidin-coated paramagnetic beads (Promega, Inc., Madison, WI, USA). Capture reactions were incubated for 30 min at room temperature, after which the streptavidin-probe-target DNA complexes were captured through magnetization and then washed four times at 65°C in the presence of 0.1% SDS and 1X, 1X, 0.5X and 0.1X SSC for 15, 10, 10 and 10 min, respectively. Enriched hybrids were eluted from the paramagnetic beads in 50 μl dH2O at 80°C for 10 min and PCR-amplified over 12-18 cycles using Phusion-Flash PCR Master Mix (New England Biolabs, Ipswich, MA, USA) and standard Illumina paired end primers. After PCR enrichment, libraries were purified with Agencourt Ampure beads and subsequently quantified using the Nanodrop 1000, and size-confirmed using either gel electrophoresis or the Agilent 2100 BioAnalyzer (Agilent, Santa Clara, CA, USA).
The molarity of enriched libraries was estimated by their concentration and average fragment size, after which the libraries were submitted for sequencing singly or in barcode-specified multiplex pools ranging in size from four to 16 accessions. Individual samples or multiplex pools were submitted to the Oregon State University Center for Gene Research and Biocomputing (OSU CGRB) (http://www.cgrb.oregonstate.edu/) or the FAS Center for Systems Biology at Harvard University (http://sysbio.harvard.edu/csb/) for sequencing on Illumina GAIIx sequencers. Libraries were loaded at a concentration of 5–7 ρM and sequenced in 60 or 80 bp single-end sequencing reactions. Cluster formation, primer hybridization and sequencing reactions followed Illumina protocols . Image analysis, base-calling and error estimation were performed using the Illumina GA Pipeline version 1.5.
Plastome assembly from microreads
To initially determine enrichment of read pools, all reads containing Illumina adapter sequence were removed from read pools and the remaining reads were sorted by barcode using two Perl scripts, sort_fastq.pl and bcsort_fastq_se.pl (available at http://brianknaus.com). The proportion of reads representing the chloroplast was checked using the program BLAT  with default settings and a reference of either Pinus thunbergii or P. koraiensis for accessions in subgenus Pinus or Strobus , respectively.
Reference-guided assembly of microreads was facilitated using a pipeline of five scripts called "alignreads”, as described in Straub et al. . In this pipeline, assembly of microreads into contigs is performed by YASRA , which assembles contiguous sequences (contigs) by iteratively aligning sequence reads to a reference genome using the lastz alignment algorithm . The alignment of assembled contigs is then refined using NUCmer and Delta-Filter of the MUMmer 3.0 suite , and the resulting alignment information is paired with the original contigs and read depth information from YASRA, to be converted into an aligned consensus sequence using sumqual.py and qualtofa.py. The latter allows user-specified masking of contig positions based on read depth and base call proportion. Both sumqual.py and qualtofa.py are available for download at http://milkweedgenome.org; YASRA and MUMmer are available online at http://www.bx.psu.edu/miller_lab/ and http://mummer.sourceforge.net/, respectively.
For assembly of the novel plastome sequences reported in this paper, subsectional references reported in Parks et al.  were used (Additional File 1). The alignment of assembled contigs was checked and adjusted manually in BioEdit 7.0.9 . Aligned contig positions matching the reference were masked if fewer than five overlapping reads and less than 80% of all reads overlapping to form the contig at that position agreed with the reference; aligned positions called as SNPs were similarly masked, but required a minimum coverage depth of 20 aligned reads and 80% call proportion.
Alignment and quality screening of assemblies
Plastome assemblies were aligned in MAFFT v.6.240 , using gap opening and extension penalties of 2.0 and 0.1, respectively. Alignments were subsequently manually adjusted and annotated in BioEdit 7.0.9 where necessary. In particular, MAFFT appeared to have the most difficulty with insertions and deletions, ranging from inconsistent arrangement of variable length homopolymer repeats in noncoding regions to clear failure in alignment of larger repetitive elements with variable copy number, for example in the loci ycf 1 and ycf 2. As our taxon sampling was relatively deep, in most cases it was possible to rely on the assumed homology of closely related species or groups of species to guide manual adjustment and, as much as possible, the direction toward a most parsimonious solution was favored. The plastome of Cathaya argyrophylla was primarily aligned by hand due to structural rearrangements. The assemblies of exonic regions were checked and adjusted as necessary by translation to identify potential misassemblies, as represented by internal stop codons and/or frameshift mutations.
Novel plastome sequences were quality-screened at this point by level of completion and relative similarity to the subsectional reference used in their assembly. Specifically, assemblies were discarded from further analyses if they were estimated to be less than 80% complete, or if the pairwise distance to their subsectional assembly reference was greater than two times the standard deviation of all pairwise distances between assembled members of their subsection and the subsectional reference. The latter measure was taken to diminish bias resulting from poor assemblies, for example resulting from low coverage or capture of divergent paralogous copies of chloroplast regions residing in the nuclear or mitochondrial genome. In addition, several assemblies were discarded due to poor overall assembly quality as evidenced by highly divergent exon/protein sequences and divergence from Sanger-sequenced plastome regions of the same species. Previously published Pinus plastome sequences were used only if they exceeded 80% estimated sequence completion.
Phylogenetic analysis of full plastome alignment
Maximum likelihood (ML) and Bayesian phylogenetic analyses of the complete alignment (including the ca. 450 bp remnant of the IR common to members of Pinaceae) were completed through the Cipres Science Gateway (http://www.phylo.org/) using RAxML-HPC2  and MrBayes , both on the available teragrid. Maximum likelihood analysis in this case was performed under the GTRGAMMA model, with the number of bootstrap replicates automatically determined under the recommended autoMRE option, and gaps treated as missing data. Bayesian analyses were performed under the same model of evolution and with the same treatment of gapped positions. Each analysis consisted of two runs with four chains each (three hot and one cold chain), run for 10,000,000 generations with trees sampled every 1000 generations, and the first 25% of trees discarded as burn-in. Stationarity was evaluated by graphing –lnL of trees across all generations and by requiring the standard deviation of the two runs to be less than 0.05. All trees were combined from both runs past the point of stationarity to determine topology and support through the majority rule consensus tree using PAUP* v.4.0b10 . Parsimony analysis was performed with PAUP* v.4.0b10, under heuristic search with ten repetitions of random sequence addition, tree bisection and reconnection branch swapping and 100 bootstrap replicates; gapped positions were again treated as missing data.
Evaluation of the impact of variable site removal
Variable sites in the full plastome alignment were ranked using both tree-independent and tree-dependent methods – i.e. without and with the consideration of an underlying phylogenetic framework, respectively. In the tree-independent strategy, all alignment sites in the full plastome matrix were ranked based on their variability using the script sorter.pl , which measures the ‘observed variability’ (OV) of each position in an alignment as:
k, the number of all possible pairwise comparisons between accessions in an alignment, excluding accessions with a gap or masked base at the position considered; dij, the score of character variability (0 for match, 1 for mismatch) in each of k pairwise comparisons of accessions in the alignment (accessions with gaps and masked positions excluded, as noted for k above).
Starting with the highest variability sites, alignment positions were serially removed from the full alignment in 100 site partitions using the script sorter.pl, resulting in two series of data partitions. Following Goremykin et al. , the first series, An, consisted of all alignment positions except the most variable 100, 200, 300,…,20000 sites, while the second series, Bn, consisted of the most variable 100, 200, 300,…,20000 sites. Our notation differs slightly from Goremykin et al. in that we use the subscript for An to refer to the size of a given An partition, while the subscript of Bn refers also to the size of the corresponding An partition.
In the tree-dependent strategy, the variability of all sites in the full plastome matrix was measured using the program AIR-Identifier  as implemented in the University of Oslo Bioportal (https://www.bioportal.uio.no/). AIR-Identifier utilizes the baseml application of PAML [77, 78] to estimate site variability in a nucleotide alignment within a maximum likelihood framework. For these analyses, the maximum likelihood phylogenetic tree from analysis of the full alignment (see above) was used as the underlying phylogenetic framework. Variability of alignment sites was quantified under default settings as recommended by the site’s author (S. Kumar, pers. comm.) (Additional File 2) and after several difficulties due to the size of our data matrix, and corresponding series of An and Bn matrices were constructed as described above up to the most variable 20000 sites, with the exception that variable sites were extracted from the alignment using PAUP* v.4.0b10.
For each strategy, phylogenetic analyses on all An and Bn data partitions were run through the OSU CGRB GENOME Cloud computing resources (http://bioinfo.cgrb.oregonstate.edu) using RAxML-VI-HPC v.2.2.3  primarily under the GTRGAMMA model (some of the larger partitions were run under the GTRCAT model due to time constraints), with 100 bootstrap replicates and gaps treated as missing data. The highest likelihood tree from each An partition and its corresponding Bn partition were then compared using the branch score metric  and partition metric  as implemented in the treedist executable of Phylip v.3.69 . These comparisons allowed visual inspection of the effects of increasing variable site removal from the alignment. It was expected that the topology of corresponding An and Bn phylogenetic trees should converge as more sites are removed from An partitions and added to the Bn partitions, since variable sites partitions should increasingly contain accurate rather than misleading phylogenetic signal as the process of variable site removal progresses [35, 82]. In turn, median bootstrap values and the distribution of bootstrap values were also recorded for all An partitions to visually assess the point at which excess phylogenetic signal was being lost through variable site removal (as evidenced by a decrease in overall bootstrap support values). Trends in bootstrap support values were then visually compared to trends in branch score and partition metrics to identify whether a ‘window’ representing the highest signal to noise ratio in a range of An partitions existed. If present, such a window would be found in An partitions corresponding to low topological differences between phylogenetic trees generated from An and corresponding Bn partitions, yet high overall bootstrap support in trees generated from An partitions.
Trends in topology and bootstrap support were subsequently investigated over the process of variable site removal specifically for three taxa with historically recalcitrant phylogenetic positions: 1) the monotypic subsection Krempfianae, consisting of the morphologically distinctive flat-needled P. krempfii , 2) the southeast Asian clade consisting of P. merkusii and P. latteri , and 3) subsection Contortae , consisting of Pinus contorta , P. banksiana, P. clausa and P. virginiana , (Figure 1). For these analyses, bootstrap values for the nodes immediately ancestral to all three taxa were recorded for each An partition, as these nodes represented the resolution between disputed alternative placements of each taxon (Figure 1). In addition, bootstrap values supporting the monophyly of the P. merkusii /P. latteri clade and subsection Contortae were recorded for each An partition.
Evaluation of the impact of long-branch exclusion
As a general rule, the Pinus chloroplast phylogeny contains relatively long branches (substantial divergence) separating the two subgenera and four sections, but relatively short branches (low divergence) within subsections [31, 42]. As a result, to remove long branches it is necessary in most cases to remove entire clades at the subsectional level or higher. Because of this and due to the conflicting topologies of interest residing at the subsectional level, long branches were excluded in the following manners: 1) all six Pinaceae outgroups were removed prior to phylogenetic analyses, 2) only the subgenus of interest was included in the analyses, and 3) only the section of interest and one member of the sister section were included in analyses. For the most exclusive strategy, P. monophylla (EU998745.4), P. ponderosa (FJ899555.2) and Pinus thunbergii (NC_001631.1) were used as outgroups for sections Quinquefoliae , Pinus and Trifoliae respectively. Maximum likelihood phylogenetic analyses were performed as described above for the full alignment for each strategy of long-branch exclusion on each of three partition sizes of interest (full alignment, A136665, A133065, as discussed in Results).
Impact of noise-removal strategies on saturation
To gain further insight into the impact of variable site removal and long-branch exclusion on saturation in our data matrix (i.e., the history of multiple nucleotide state changes at individual sites), pairwise genetic distances between all accessions were determined in MEGA4  both without correction and with application of a Jukes-Cantor correction. The correlation of these values was determined by linear regression for each of three partition sizes of interest in the OV-based variable site removal analysis (full alignment, A136665, A133065, as discussed in Results) and for each strategy of long-branch exclusion. The slope of the regression line was taken as indicative of the level of saturation present in the dataset, such that higher values for corrected pairwise distances relative to uncorrected distances correspond to higher levels of saturation [18, 84, 85].
Sequence assembly and alignment
After quality/chastity filtering through the Illumina GA Pipeline v. 1.5 and removal of adapter sequences, read pools for successfully assembled plastome sequences averaged 1.77 ± 0.76 million reads per accession, while chloroplast reads accounted for 56.83 ± 13.85% of these reads on average (SRA047299.1, Additional File 1). Seventy-five novel assembled plastome sequences averaged 117157 ± 3634 bp in length, and were estimated to be 98.1 ± 2.5% complete on average after masking (GenBank JN854151-JN854220, JN854222-JN854226, Additional File 1). The alignment of all successfully assembled plastome sequences, including 107 Pinus accessions and six Pinaceae outgroups, resulted in 141265 aligned sites. Within this alignment, 1217 positions were composed entirely of masked bases or gaps and masked bases due to failure of called nucleotides to meet coverage requirements. The complete alignment is available through TreeBASE (http://www.treebase.org, study number 12640).
Variable sites were identified in nearly all coding and noncoding regions of the plastome, although they were unequally distributed between and among exons, introns and noncoding regions (Table 1, Figure 2). Highest average per-site OV was found in noncoding regions, followed by protein-coding exons, introns, and finally RNA-coding exons (Table 1). The higher variability of exons than introns was an unexpected result; however, previous work [31, 61] has shown that the loci ycf 1 and ycf2 are extremely variable in Pinus compared to other protein-coding loci. Because of this, OV calculations were also averaged for exons without ycf1 and ycf2. With the removal of either ycf 1 alone or both ycf1 and ycf2 positions, average per site OV for protein-coding exons fell below that of intronic regions (Table 1), although the difference between intronic regions and exonic regions with removal of only ycf1 was not significant. The distribution of rate values for alignment positions by AIR-Identifier was similar, although intron regions were significantly more highly variable than all three exon partitions, and variability of tRNA loci was significantly higher on average than rRNA and exon regions with the exclusion of ycf 1 and ycf 2 (Additional File 3).
Phylogenetic analysis of the full alignment
Our full alignment contained 42468 alignment patterns, and resulted in highly supported and almost completely congruent topologies in ML, Bayesian and parsimony analyses (Additional File 4). All major clades at the subgenus, sectional and subsectional levels as reported by Gernandt et al.  were recovered with 95–100% bootstrap support. Across the topology, average ML bootstrap support for 105 ingroup nodes was 89.7% (standard deviation = 18.5%). Only two minor topological conflicts were found between methods. In subsection Australes Pinus caribaea was placed sister to a clade of P. cubensis and P. occidentalis with low support in Bayesian analysis (<0.6 posterior probability), while ML and parsimony analyses recovered P. caribaea sister to P. palustris , again with low support (≤50% bootstrap support). In section Quinquefoliae , parsimony analysis recovered P. morrisonicola in a weakly supported clade with P. armandii (55% bootstrap support), while both Bayesian and ML methods recovered these species in a grade with variable support (43% bootstrap support, 0.97 posterior probability). For the three clades of interest, topology was consistent between methods, while support was in some cases variable. For example, section Quinquefoliae was recovered as subsection Strobus + (P. krempfii + subsection Gerardianae ) with weak to strong support (58–73% bootstrap / 1.0 posterior probability) for the position of P. krempfii . Section Pinus was recovered as subsection Pinus + (P. merkusii /P. latteri + subsection Pinaster ) with weak to moderate support for the position of P. merkusii /P. latteri (50–71% bootstrap / 0.52 posterior probability) but strong support for the monophyly of these two species (100% bootstrap, 1.0 posterior probability).Section Trifoliae was recovered as subsection Contortae + (subsection Australes + subsection Ponderosae) with high support (100% bootstrap / 1.0 posterior probability) for the monophyly and position of subsection Contortae .
Impact of variable site removal
Bootstrap support values showed clear trends throughout An partitions as variable sites were removed, with overall values consistently high (average value > 85%, median value ≥ 98%) until the most variable ca. 8.3 kbp had been removed (A133065) (Figure 3). Support steadily decreased from this point before leveling off at low values with removal of ≥18kbp of variable sites (mean/median bootstrap support <17% /<10%). Branch score metric values decreased relatively rapidly over removal of the first ca. 2kbp of highly variable alignment positions (A141265 to A139265), but subsequently rose again and plateaued at about half their maximal level from ca. A138565 to A136665, until decreasing rapidly again and remaining at low levels (Figure 3). Partition metric values showed an initial rapid decline before leveling off after the removal of the most variable 2.2 kbp (ca. A139165) (Figure 3), at which point values remained constant and relatively low until increasing again beyond removal of the most variable 8.3 kbp (A133065). The most highly variable 2kbp of alignment positions, corresponding to the first rapid decline in branch score and partition metric values, was dominated by positions wherein the majority of taxa contained gaps and masked bases (mean/median/standard deviation of gaps + masked bases per alignment position = 80.49/108/44.05). Visual inspection of An and Bn trees further revealed that the great majority of topological differences over the plateau of branch score metric values from ca. A138565 to A136665 involved changes in branch lengths associated with subgeneric and sectional level divisions. This is also indirectly evidenced by the consistently low values of partition metric scores over this range, which reflect the consistent branching orders between An and Bn partitions but do not reflect differences in branch lengths. It is likely that these branch length differences are largely responsible for the temporary increase in branch score metric values seen here. Such a result is not completely unexpected, as variable sites associated with internal divisions of large groups of taxa typically have relatively high OV scores .
Bootstrap support for the phylogenetic position of P. krempfii was moderate (59–84%) until removal of the most variable 5.7 kbp (ca. A135665), at which point bootstrap values steadily increased until peaking at 97–100% after removal of the most variable 6.3–7.8 kbp (ca. A135065 to A133665) (Figure 4). An partitions greater than 129.4 kbp in size recovered section Quinquefoliae as subsection Strobus + (P. krempfii + subsection Gerardianae); at An partition sizes smaller than this phylogenetic position was variable.
The monophyly of P. merkusii/P. latteri was highly supported until removal of the most variable 18.2 kbp (A123165) (Figure 4). Support for their resolution within section Pinus, however, was consistently moderate until removal of 7.2 kbp of the most variable sites (A134165). An partitions prior to this point recovered the P. merkusii/P. latteri clade alternately sister to subsection Pinaster and subsection Pinus. After this point, bootstrap support rapidly increased to a peak of 96–100% between removals of 7.6–9 kbp of the most variable sites (ca. A133765 to A132465), and all An partitions in this range recovered section Pinus as subsection Pinus + (P. merkusii/P. latteri + subsection Pinaster).
Monophyly of subsection Contortae was highly supported until removal of 15.3 kbp of the most variable sites (ca. A126065), while support for the phylogenetic position of the Contortae decreased fairly steadily after removal of only 4.2 kbp (ca. A137265) (Figure 4). Section Trifoliae was recovered as subsection Contortae + (subsection Australes + subsection Ponderosae) by all An partitions greater than 137 kbp in size; resolution based on An partitions less than 137 kbp in size was variable, although placement of subsection Contortae as sister to or nested within subsection Australes was supported by several partitions between A136665 and A133065.
Overall, trends were not as clear using the tree-based approach employed. In particular, bootstrap support values remained constant and high, with a median value of 100% in all An partitions (Additional File 5). An and Bn tree topological comparisons showed somewhat similar trends as seen in the OV-based strategy, with rapid initial decreases followed by consistently low values in branch score and partition metric values (Additional File 5); however due to the lack of change in bootstrap support values, it was not possible to identify a target window of variable site removal using this method.
Nonetheless, phylogenetic patterns in the three clades of interest were partly reflective of tree-independent results, suggesting an influence of highly variable sites in the placement of at least two of these clades. For example, P. krempfii was placed as sister to subsection Gerardianae by all An partitions, and reached high support with variable site removal. In this case removal of both 8.8-8.9 kbp and 14.7-20 kbp resulted in 100% bootstrap support (Additional File 5). P. merkusii/P. latteri also showed increased support with removal of variable sites and, similar to tree-independent results initially varied in phylogenetic placement as sister to subsection Pinus and subsection Pinaster (Additional File 6). Support for (subsection Pinus + P. merkusii/P. latteri) peaked at 97% with removal of 9.7–10.5 kbp, while support for (subsection Pinaster + P. merkusii/P. latteri) reached 100% with removal of 15.3–20 kbp. In contrast to tree-independent results, subsection Contortae was found as sister to (subsection Australes + subsection Ponderosae ) with 100% bootstrap support for all An partitions (Additional File 6).
Impact of long-branch exclusion
When all alignment sites were included in analyses, long-branch exclusion strategies had essentially no impact on the topology or support of subsection Krempfianae or Contortae , while support increased moderately for a monophyletic (Pinus merkusii/P. latteri + subsection Pinaster ) only with the most exclusive strategy (Table 2). When long-branch exclusion was used in combination with variable site removal (specifically partitions A136665 and A133065), trends were reflective of variable site removal alone for partition size A136665 in subsection Contortae and P. merkusii /P. latteri (Table 2). In the remaining cases, trends were either non-existent (P. krempfii and P. merkusii /P. latteri exclusion strategies applied to A133065) or counter to patterns seen with variable site removal alone (P. krempfii exclusion strategies applied to A136665, subsection Contortae exclusion strategies applied to A133065) (Table 2).
Impact of variable site and long-branch removal on saturation
Based on the slopes of regression lines of corrected vs. uncorrected pairwise distances, saturation decreased similarly both with OV-based variable site removal and long-branch exclusion strategies (Table 3, Additional File 7). The highest levels of saturation were observed with inclusion of all accessions, while the lowest values occurred with removal of the most variable 8.3 kbp of the alignment (A133065) and exclusion of at least the Pinaceae outgroups (Table 3).
As genome-scale datasets become increasingly common in evolutionary analyses, it is reasonable to expect challenges associated with highly variable or noisy data. Because of this, it is prudent to develop efficient strategies to identify and mitigate phylogenetic noise while simultaneously preserving sites and taxa carrying useful phylogenetic signal in order to most effectively capture information from large datasets. The benefit of developing such strategies has been demonstrated already, for example in placental mammals , early-diverging plant lineages [82, 86] and deep eukaryotic phylogeny . Our approach is similar to previous efforts, but focused on two fundamental and complementary strategies, variable site removal and long-branch exclusion, and explored the dynamics of tree topology and support values to measure their impact on an infrageneric phylogenetic analysis. While the two strategies employed were both utilized to counter the effect of putative phylogenetic noise, there are important contrasts between them. For example, the strict application of long-branch exclusion serves to minimize long-branch attraction artefacts, yet phylogenetic hypotheses may still be misled by evolutionary patterns at highly variable sites since all sites are still included in the analysis. In this case, removal of taxa could mask evolutionary patterns at some sites that otherwise might be more clearly interpreted [4, 87], while the inclusion of fast-evolving sites may still mislead phylogenetic analyses . On the other hand, removal of highly variable sites potentially diminishes the impact of misleading signal in an alignment and should increase the ability of applied models of sequence evolution to capture evolutionary patterns in phylogenetic analyses. The success of this strategy may be limited, however, as the inclusion of highly divergent taxa could still lead to long-branch artefacts when phylogenetic signal is minimal, and the broad application of variable site removal may have the unintended result of diminishing or erasing phylogenetic signal in some sub-clades . It is therefore likely that utilizing a combination of these two strategies is prudent in many cases , yet an overly conservative approach could still lead to the loss of essential phylogenetic signal.
With our dataset and strategies, removal of variable sites in a tree-independent manner appears to provide more information toward clarifying the plastid-based evolutionary relationships of three historically problematic clades than does the exclusion of long branches. Further, removal of up to several thousand of the most variable alignment positions as measured by the OV metric was useful in investigating conflicting or weakly supported phylogenetic resolution in these clades; however, it is important to note that this magnitude of variable site removal is specific to Pinus and is expected to vary in other taxa depending on their evolutionary histories and the amount and origin of the sequence utilized in phylogenetic analyses. Nonetheless, the targeted range of variable site removal in this study (from ca. 4.3 to 8.6 kbp of highly variable sites), corresponding to alignment partitions with lower branch score/partition metric values and high overall bootstrap support, is significant in two regards. First, as overall high levels of bootstrap support are maintained across this range of partitions (Figure 3), sites within this window are not required for the resolution of most intraspecific relationships within the genus Pinus. Second, as variable sites are removed within this range of partitions important changes are seen in the topology and support of three historically recalcitrant clades - support for the position of subsection Contortae as sister to the monophyletic grouping of subsections Australes and Ponderosae diminishes substantially (discussed below), while there is substantial increase in resolution for the consistent positions of P. krempfii and P. merkusii / P. latteri. Conversely, the long-branch exclusion strategies applied have little to no effect on the topology and support for these clades when applied to the full plastome alignment, suggesting that variable site removal is more effective in mitigating what we perceive to be the impact of phylogenetic noise in our data set.
Finally, the tree-dependent strategy applied to our dataset also did not have the same impact as the OV-based strategy in relation to the three clades of interest. Rather, it appeared that this procedure was more strongly influenced by positions primarily delimiting generic and subgeneric divisions, as evidenced in part by the relatively high variability categorization of rRNA positions (Additional File 3). Critically, as overall bootstrap support values did not change even with extensive variable site removal, it was not possible to identify a finite window of putative high signal to noise ratio as with the tree-independent strategy. It is possible that this method would be more efficient if combined with the exclusion of outgroups or if applied to clades consisting solely of more closely related taxa. Alternatively, it may be that tree-based methods are in many cases limited by the influence of the assumed phylogenetic hypothesis . The absence of change in overall bootstrap support values with tree-dependent variable site removal in our dataset supports the latter contention.
The specific changes in position and topological support shown in our analyses are also noteworthy because they highlight the disparate resolutions previously supported by different analyses or different types of data. For example, the unique morphological characteristics of Pinus krempfii (most notably its flat, paired needles) have led to a wide range of phylogenetic resolution, including placement outside the genus Pinus, in its own subgenus within Pinus[41, 91, 92], and within subgenus Strobus , section Parrya[93–95]. At least two recent morphological treatments have recognized some affinity of P. krempfii to P. gerardiana and P. bungeana of subsection Gerardianae, based on cuticular micromorphology and the morphology of tracheid and parenchyma cells [94, 96]. Molecular evidence to date strongly support a position within or sister to section Quinquefoliae of subgenus Strobus, although a consistent and clear relationship of P. krempfii to subsections Strobus and Gerardianae of section Quinquefoliae has proven elusive (Figure 1). Some analyses based on chloroplast sequence data suggest an affinity to subsection Gerardianae[56, 57], but support for this relationship is typically moderate to weak. Other reports based on chloroplast or nuclear sequence data show poor resolution [42, 44], place the species sister to section Quinquefoliae , or suggest inclusion within subsection Strobus [51, 52].
Pinus merkusii and P. latteri have demonstrated similarly ambiguous phylogenetic resolution relative to subsections Pinus and Pinaster of section Pinus (Figure 1), and again there is incongruence between molecular and morphological data. For example, Frankis  placed P. merkusii within subsection Pinaster based on cone morphology, while most molecular analyses place P. merkusii as sister to subsection Pinus [42, 51, 52, 57], albeit typically with low to moderate support. On the other hand, Wang et al. [56, 57] and Szmidt et al.  demonstrated a clear genetic separation of P. merkusii from sampled Asian members of subsection Pinus, and suggest a divergence between these groups possibly in the early Tertiary, although this timeframe is not in accordance with the age of section Pinus based on molecular clock calibrations [40, 98].
Finally, the position of subsection Contortae is strongly supported (up to 100% bootstrap support) as sister to subsections Ponderosae and Trifoliae (Figure 1) based on previous reports using chloroplast sequence data or chloroplast restriction fragment analyses [31, 42, 44, 49, 55, 59]. Alternatively, other lines of evidence suggest this highly supported topology may be incorrect. For example, hybridization is possible between some members of subsections Contortae and Australes, but not between members of subsections Contortae and Ponderosae[99, 100]. Similarly, the relatively shallow fossil record of subsection Contortae[101, 102] suggests a more recent derivation within its section. In turn, two reports based on nrITS and four low-copy nuclear loci place the Contortae either nested within subsection Australes with moderately high support (77–82% bootstrap support in Liston et al. ) or forming a polytomy with monophyletic subsections Ponderosae and Australes , respectively, while restriction fragment analysis including chloroplast, mitochondrial and nuclear DNA suggest a more derived position of subsection Contortae within section Trifoliae and some affinity to members of subsection Australes.
While our results cannot be considered conclusive by themselves, they certainly add important perspectives to Pinus evolutionary history as well as the use of plastome-scale sequences in plant phylogenomic analyses. For the genus Pinus as a whole, our dataset apparently represents the maximal resolution to be gained from the plastome, although various permutations of chloroplast loci may still prove useful at different levels of phylogenetic inquiry [for example see  and certainly plastome variation within-species warrants further interrogation for a number of species (i.e., P. ponderosa P. lambertiana, although see [98, 103]). From this point, the next target of phylogenetic interrogation will likely be larger unique portions of the nuclear genome, particularly as increases in sequence output continue to outpace increases in read length for next-generation sequencers  and progress is made on the sequencing and assembly of a representative pine nuclear genome . For the three specific clades investigated in this study, the patterns in response to variable site and long-branch removal from the full plastome alignment are intriguing and a measure of insight has been gained into the evolutionary histories and relationships of their plastomes, if not of the species themselves. In each case, decreasing the impact of phylogenetic noise by removing highly variable sites resulted in phylogenetic resolution more reflective of results based on nuclear and/or morphological data. At the same time, the impact of long-branch exclusion was less pronounced, suggesting that long-branch attraction artefacts are not prevalent at these levels of the Pinus phylogeny. The congruent results between model-based and parsimony methods for these clades also lend support to this conclusion, as methodological incongruence is another indication of possible long branch attraction artefacts . This result is somewhat counter-intuitive, as all three lineages investigated have relatively long branches in chloroplast-based phylogenetic reconstruction (Additional File 4) . It is possible that these long branches are not all reflective of the same biological processes. The long branches of P. krempfii and the P. merkusii /P. latteri clade likely are due to relatively long periods of neutral divergence from their sister lineages. In these cases, however, it appears that removal of highly variable sites unmasks the limited underlying signal more definitively supporting their plastid-based resolutions - P. krempfii as sister to subsection Gerardianae of section Quinquefoliae , and P. merkusii/P. latteri as sister to subsection Pinaster of section Pinus . For subsection Contortae, on the other hand, chloroplast-based support for an early divergence in section Trifoliae is clearly inflated by the phylogenetic noise of highly variable sites. In this case, the pronounced effect of variable site removal combined with the relatively long branch leading to subsection Contortae may instead be indicative of elevated rates of evolution or responses to selection in this lineage, and a position sister to or within subsection Australes could be the final resolution of this challenging group.
The promise of phylogenomics is still very much palpable and (to paraphrase Mark Twain) reports of its ‘demise’ [17, 18] are greatly exaggerated. As demonstrated in the current study, a full-plastome matrix provides greatly increased resolution into the evolutionary history of the genus Pinus . Still, in this and other cases it is equally premature to confirm phylogenetic results based on genome-scale datasets without investigating first for the presence of misleading signal [17–19, 22]. This is particularly important when trying to reconcile poorly supported topologies or conflicting phylogenetic results based on different sources or types of data , such as those represented by three historically recalcitrant taxa in the Pinus phylogeny. The present analysis and similar efforts [for example  also demonstrate not only the power of large (but well-managed) datasets to increase phylogenetic resolution, but the risk of relying on single sources of data, as inconsistencies between organellar- and nuclear-based analyses can remain even with greatly increased sampling. Fortunately, sequencing capacity and read length of next-generation platforms continue to increase [106–110], and combined with increasingly effective methods of genome interrogation [111–114] will make it easier to capture useful sequence data from what are currently less tractable genomes (such as plant nuclear and mitochondrial genomes). However, the development of analytical strategies to interrogate misleading signal present in large datasets will remain essential, as phylogenetic signal clearly is not always sufficient to overcome phylogenetic noise in identifying the relationships of certain taxa or organelles, even at genomic scales.
Fitch WM: Cautionary remarks on using gene expression events in parsimony procedures. Systematic Zool. 1979, 28: 375-379.
Fitch WM: Cladistic and other methods: problems, pitfalls, and potentials. Cladistics: Perspectives on the Reconstruction of Evolutionary History. Edited by: Duncan T, Stuessy TF. 1984, New York: Columbia University Press, 221-252.
Felsenstein J: Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool. 1978, 27: 401-410.
Hendy MD, Penny D: A framework for the quantitative study of evolutionary trees. Syst Biol. 1989, 38: 297-309.
Huelsenbeck JP: Tree-length distribution skewness: an indicator of phylogenetic information. Syst Biol. 1991, 40: 257-270.
Hillis DM, Huelsenbeck JP: Signal, noise, and reliability in molecular phylogenetic analyses. J Hered. 1992, 83: 189-195.
Farris JS, Källersjö M, Kluge AG, Bult C: Constructing a significance test for incongruence. Syst Biol. 1995, 44: 570-572.
Dowton M, Austin AD: Increased congruence does not necessarily indicate increased phylogenetic accuracy - the behavior of the incongruence length difference test in mixed-model analyses. Syst Biol. 2002, 51: 19-31.
Hipp AL, Hall JC, Sytsma KJ: Congruence versus phylogenetic accuracy: revisiting the incongruence length difference test. Syst Biol. 2004, 53: 81-89.
Strimmer K, von Haeseler A: Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proc Nat Acad Sci USA. 1997, 94: 6815-6819.
Pollock DD, Zwickl DJ, McGuire JA, Hillis DM: Increased taxon sampling is advantageous for phylogenetic inference. Syst Biol. 2002, 51: 664-671.
Soltis DE, Albert VA, Savolainen V, Hilu K, Qiu Y-L, Chase MW, Farris JS, Stefanovic S, Rice DW, Palmer JD, Soltis PS: Genome-scale data, Angiosperm relationships, and ‘ending incongruence’: a cautionary tale in phylogenetics. Trends Plant Sci. 2004, 9: 477-483.
Townsend JP, Leuenberger C: Taxon sampling and the optimal rates of evolution for phylogenetic inference. Syst Biol. 2011, 60: 358-365.
Townsend JP: Profiling phylogenetic informativeness. Syst Biol. 2007, 56: 222-231.
Klopfstein S, Kropf C, Quicke DLJ: An evaluation of phylogenetic informativeness profiles and the molecular phylogeny of Diplazontinae (Hymenoptera, Ichneumonidae). Syst Biol. 2010, 59: 226-241.
Bergsten J: A review of long-branch attraction. Cladistics. 2005, 21: 163-193.
Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005, 6: 361-375.
Jeffroy O, Brinkmann H, Delsuc F, Philippe H: Phylogenomics: the beginning of incongruence?. Trends Genet. 2006, 22: 225-231.
Philippe H, Delsuc F, Brinkmann H, Lartillot N: Phylogenomics. Ecol, Evol and Syst. 2005, 36: 541-562.
Degnan JH, Rosenberg NA: Discordance of species trees with their most likely gene trees. PLoS Genet. 2006, 2: 762-768.
Kubatko LS, Degnan JH: Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007, 56: 17-24.
Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, Baurain D: Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biology. 2011, 9: e1000602-
Stefanovic S, Rice D, Palmer J: Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots?. BMC Evol Biol. 2004, 4: 35-
Phillips MJ, Delsuc F, Penny D: Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 2004, 21: 1455-1458.
Brinkmann H, van der Giezen M, Zhou Y, de Raucourt GP, Philippe H: An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst Biol. 2005, 54: 743-757.
Rokas A, Carroll SB: Bushes in the tree of life. PLoS Biology. 2006, 4: e352-
Cai Z, Penaflor C, Kuehl JV, Leebens-Mack J, Carlson JE, dePamphilis CW, Boore JL, Jansen RK: Complete plastid genome sequences of Drimys, Liriodendron, and Piper: implications for the phylogenetic relationships of Magnoliids. BMC Evol Biol. 2006, 6: 77-
Jansen RK, Cai Z, Raubeson LA, Daniell H, dePamphilis CW, Leebens-Mack J, Muller KF, Guisinger-Bellian M, Haberle RC, Hansen AK: Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proc Natl Acad Sci USA. 2007, 104: 19369-
Jansen RK, Kaittanis C, Saski C, Lee SB, Tomkins J, Alverson AJ, Daniell H: Phylogenetic analyses of Vitis (Vitaceae) based on complete chloroplast genome sequences: effects of taxon sampling and phylogenetic methods on resolving relationships among rosids. BMC Evol Biol. 2006, 6: 32-
Moore MJ, Bell CD, Soltis PS, Soltis DE: Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms. Proc Natl Acad Sci USA. 2007, 104: 19363-19368.
Parks M, Cronn R, Liston A: Increasing phylogenetic resolution at low taxonomic levels using massively parallel sequencing of chloroplast genomes. BMC Biol. 2009, 7: 84-
Goremykin VV, Holland B, Hirsch-Ernst KI, Hellwig FH: Analysis of Acorus calamus chloroplast genome and its phylogenetic implications. Mol Biol Evol. 2005, 22: 1813-1822.
Leebens-Mack J, Raubeson LA, Cui L, Kuehl JV, Fourcade MH, Chumley TW, Boore JL, Jansen RK, dePamphilis CW: Identifying the basal angiosperm node in chloroplast genome phylogenies: sampling one's way out of the Felsenstein zone. Mol Biol Evol. 2005, 22: 1948-1963.
Lin C-P, Huang J-P, Wu C-S, Hsu C-Y, Chaw S-M: Comparative chloroplast genomics reveals the evolution of Pinaceae genera and subfamilies. Gen Biol and Evol. 2010, 2: 504-517.
Goremykin V, Nikiforova S, Bininda-Emonds O: Automated removal of noisy data in phylogenomic analyses. J Mol Evol. 2010, 71: 319-331.
Shaw J, Lickey EB, Beck JT, Farmer SB, Liu W, Miller J, Siripun KC, Winder CT, Schilling EE, Small RL: The tortoise and the hare II: relative utility of 21 noncoding chloroplast DNA sequences for phylogenetic analysis. Am J Bot. 2005, 92: 142-166.
Shaw J, Lickey EB, Schilling EE, Small RL: Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. Am J Bot. 2007, 94: 275-
Millar CI: Early evolution of pines. Ecology and biogeography of Pinus. Edited by: Richardson DM. 2000, Cambridge: Cambridge University Press, 69-91.
Klymiuk AA, Stockey RA, Rothwell GW: The first organismal concept for an extinct species of Pinaceae. Int J Plant Sci. 2011, 172: 294-313.
Willyard A, Syring J, Gernandt DS, Liston A, Cronn R: Fossil calibration of molecular divergence infers a moderate mutation rate and recent radiations for Pinus. Mol Biol Evol. 2007, 24: 90-101.
Little EL, Critchfield WB: Subdivision of the Genus Pinus (Pines). 1969, Washington, D.C: USDA
Gernandt DS, Lopez G, Garcia SO, Liston A: Phylogeny and classification of Pinus. Taxon. 2005, 54: 29-42.
Ortiz Garcia S: Evolucion y fiogenia en pinos y sus hongos endofitos: aspectos sistematicos de la coespeciacion PhD thesis. 1999, Universidad Nacional Autonoma de Mexico, Instituto de Ecologia
Gernandt DS, Magallon S, Geada Lopez G, Zeron Flores O, Willyard A, Liston A: Use of simultaneous analyses to guide fossil-based calibrations of Pinaceae phylogeny. Int J Plant Sci. 2008, 169: 1086-1099.
Frankis MP: Morphology and affinities of Pinus brutia. International Symposium on Pinus brutia Ten: 18-23 October 1993. 1993, Marmaris, Turkey: Ministry of Forestry, 11-18.
Critchfield WB: Crossability and relationships of the closed-cone pines. Silvae Genetica. 1966, 16: 89-97.
Critchfield WB: Interspecific hybridization inPinus: a summary review. Symposium on Interspecific and Interprovenance Hybridization in Forest Trees: 28-30 August 1975; Fredericton, New Brunswick. Edited by: Fowler DP, Yeatman CY. 1975, Ottawa: Canadian Forestry Service, 99-105.
Critchfield WB: Hybridization and classification of the white pines (Pinus section Strobus). Taxon. 1986, 35: 647-656.
Krupkin AB, Liston A, Strauss SH: Phylogenetic analysis of the hard pines (Pinus subgenus Pinus, Pinaceae) from chloroplast DNA restriction site analysis. Am J Bot. 1996, 83: 489-498.
Strauss SH, Doerksen AH: Restriction fragment analysis of pine phylogeny. Evolution. 1990, 44: 1081-1096.
Liston A, Gernandt DS, Vining TF, Campbell CS, Pinero D: Molecular phylogeny of Pinaceae and Pinus. Acta Hort (ISHS). 2003, 615: 107-114.
Liston A, Robinson WA, Piñero D, Alvarez-Buylla ER: Phylogenetics of Pinus (Pinaceae) based on nuclear ribosomal DNA internal transcribed spacer region sequences. Mol Phylogenet Evol. 1999, 11: 95-109.
Syring J, Willyard A, Cronn R, Liston A: Evolutionary relationships among Pinus (Pinaceae) subsections inferred from multiple low-copy nuclear loci. Am J Bot. 2005, 92: 2086-2100.
Palmé AE, Pyhäjärvi T, Wachowiak W, Savolainen O: Selection on nuclear genes in a Pinus phylogeny. Mol Biol Evol. 2009, 26: 893-905.
Geada Lopez G, Kamiya K, Harada K: Phylogenetic relationships of diploxylon pines (subgenus Pinus) based on plastid sequence data. Int J Plant Sci. 2002, 163: 737-747.
Wang XR, Szmidt AE, Nguyên HN: The phylogenetic position of the endemic flat-needle pine Pinus krempfii (Pinaceae) from Vietnam, based on PCR-RFLP analysis of chloroplast DNA. Plant Syst Evol. 2000, 220: 21-36.
Wang XR, Tsumura Y, Yoshimaru H, Nagasaka K, Szmidt AE: Phylogenetic relationships of Eurasian pines (Pinus, Pinaceae) based on chloroplast rbcL, matK, rpl20-rps18 spacer, and trnV intron sequences. Am J Bot. 1999, 86: 1742-1753.
Zhang ZY, Li DZ: Molecular phylogeny of section Parrya of Pinus (Pinaceae) based on chloroplast matK gene sequence data. Acta Bot Sin. 2004, 46: 171-179.
Eckert AJ, Hall BD: Phylogeny, historical biogeography, and patterns of diversification for Pinus (Pinaceae): phylogenetic tests of fossil-based hypotheses. Mol Phylogenet Evol. 2006, 40: 166-182.
Cronn R, Liston A, Parks M, Gernandt DS, Shen R, Mockler T: Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology. Nucleic Acids Res. 2008, 36: e122-
Parks M, Liston A, Cronn R: Newly developed primers for complete ycf1 amplification in Pinus (Pinaceae) chloroplasts with possible family-wide utility. Am J Bot. 2011, 98 (7): e185-e188.
Wakasugi T, Tsudzuki J, Ito S, Nakashima K, Tsudzuki T, Sugiura M: Loss of all ndh genes as determined by sequencing the entire chloroplast genome of the black pine Pinus thunbergii. Proc Natl Acad Sci USA. 1994, 91: 9794-9798.
Pan X, Urban AE, Palejev D, Schulz V, Grubert F, Hu Y, Snyder M, Weissman SM: A procedure for highly specific, sensitive, and unbiased whole-genome amplification. Proc Natl Acad Sci USA. 2008, 105: 15499-15504.
Solexa, Inc: Protocol for whole genome sequencing using Solexa technology. Biotechniques Protocol Guide. 2007, 29-
Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature Biotech. 2009, 27: 182-189.
Kent WJ: BLAT - the BLAST-like alignment tool. Genome Res. 2002, 4: 656-664.
Straub SC, Fishbein M, Livshultz T, Foster Z, Parks M, Weitemier K, Cronn RC, Liston A: Building a model: developing genomic resources for common milkweed (Asclepias syriaca) with low coverage genome sequencing. BMC Genomics. 2011, 12: 211-
Ratan A: Assembly algorithms for next-generation sequence data.PhD Thesis. 2009, The Pennsylvania State University, Computer Science Department
Harris RS: Improved pairwise alignment of genomic DNA.PhD Thesis. 2007, The Pennsylvania State University, Computer Science Department
Kurtz S, Phyllippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol. 2004, 5: R12-
Hall TA: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser. 1999, 41: 95-98.
Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33: 511-518.
Stamatakis A: A rapid bootstrap algorithm for the RAxML web servers. Syst Biol. 2008, 57: 758-771.
Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003, 19: 1572-1574.
Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. 2000, Sunderland, Massachusetts: Sinauer Associates
Kumar S, Skjaeveland A, Orr R, Enger P, Ruden T, Mevik B-H, Burki F, Botnen A, Shalchian-Tabrizi K: AIR: A batch-oriented web program package for construction of supermatrices ready for phylogenomic analyses. BMC Bioinforma. 2009, 10: 357-
Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS. 1997, 13: 555-556.
Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007, 24: 1586-1591.
Kuhner MK, Felsenstein J: A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol. 1994, 11: 459-468.
Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131-147.
Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. 2005, Distributed by the author Department of Genome Sciences, University of Washington
Goremykin V, Viola R, Hellwig F: Removal of noisy characters from chloroplast genome-scale data suggests revision of phylogenetic placements of Amborella and Ceratophyllum. J Mol Evol. 2009, 68: 197-204.
Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24: 1596-1599.
Rodríguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, Philippe H: Detecting and overcoming systematic errors in genome-scale phylogenies. Syst Biol. 2007, 56: 389-399.
Philippe H, Sörhannus U, Baroin A, Perasso R, Gasse F, Adoutte A: Comparison of molecular and paleontological data in diatoms suggests a major gap in the fossil record. J Evol Biol. 1994, 7: 247-265.
Zhong B, Deusch O, Goremykin VV, Penny D, Biggs PJ, Atherton RA, Nikiforova SV, Lockhart PJ: Systematic error in seed plant phylogenomics. Genome Biol Evol. 2011, 1340-1348.
Zwickl DJ, Hillis DM: Increased taxon sampling greatly reduces phylogenetic error. Syst Biol. 2002, 51: 588-598.
Kälersjö M, Albert VA, Farris JS: Homoplasy Increases Phylogenetic Structure. Cladistics. 1999, 15: 91-93.
Cummins CA, McInerney JO: A Method for Inferring the Rate of Evolution of Homologous Characters that Can Potentially Improve Phylogenetic Inference, Resolve Deep Divergence and Correct Systematic Biases. Syst Biol. 2011, 60: 833-844.
Chevalier A: Notes sur les coniferes de l'Indochine. Revue de Botanique Appliquee et d'Agriculture Tropicale. 1944, 24: 7-34.
Gaussen H: Les gymnospermes actuelles et fossiles. Fassicule VI Les Coniferales Chapter 11 Generalites, Genre Pinus Travaux du Toulous Universite Laboratoire Forestier. Volume Tome 2, Sect. 1, Vol. 1. 1960, 1-272.
De Ferre Y: Division du genre Pinus en quatre sous-genres. Academie des Sciences Compte Rendu. 1953, 236: 226-228.
Farjon A: Pines: drawings and descriptions of the genus. 1984, Leiden: W. Backhuys
Ickert-Bond S: Reexamination of wood anatomical features in Pinus krempfii (Pinaceae). IAWA J. 2001, 22: 355-365.
Van der Berg J: Holzer der niederrheinischen braunkohlenformation 2. Holzer der braunkohlengruben "Maria Theresia" zu Herzogenrath, "Zukunft West" zu Eschweiler und "Victor" Zulpich mitte zu Zulpich. Nebst einer systematisch-anatomischen bearbeitung der gattung Pinus L. Rev Palaeobot Palynol. 1973, 15: 73-275.
Ickert-Bond S: Cuticle Micromorphology of Pinus krempfii Lecomte (Pinaceae) and Additional Species from Southeast Asia. Int J Plant Sci. 2000, 161: 301-317.
Szmidt A, Wang XR, Changtragoon S: Contrasting patterns of genetic diversity in two tropical pines: Pinus kesiya (Royle ex Gordon) and P. merkusii (Jungh et De Vriese). TAG Theor Appl Genet. 1996, 92: 436-441.
Gernandt DS, Hernández-León S, Salgado-Hernández E, Rosa JAPdl: Phylogenetic relationships of Pinus subsection Ponderosae inferred from rapidly evolving cpDNA regions. Syst Bot. 2009, 34: 481-491.
Critchfield WB: Hybridization of the southern pines in California. Southern Forest Tree Improvement Committee Publications. 1963, 22: 40-48.
Saylor LC, Koenig RL: The slash x sand pine hybrid. Silvae Genetica. 1967, 16: 134-138.
Miller CN: Preserved cones of Pinus from the Neogene of Idaho and Oregon. Int J Plant Sci. 1992, 153: 147-154.
McKown AD, Stockey RA, Scheger CE: A new species of Pinus subgenus Pinus subsection Contortae from Pliocene sediments of Ch'ijee's Bluff, Yukon Territory, Canada. Int J Plant Sci. 2002, 163: 687-697.
Liston A, Parker-Defeniks M, Syring JV, Willyard A, Cronn R: Interspecific phylogenetic analysis enhances intraspecific phylogeographical inference: a case study in Pinus lambertiana. Mol Ecol. 2007, 16: 3926-3937.
Alkan C, Coe BP, Eichler EE: Genome structural variation discovery and genotyping. Nature Rev Genet. 2011, 12: 363-376.
Neale DB, Kremer A: Forest tree genomics: growing resources and applications. Nat Rev Genet. 2011, 12: 111-122.
Mardis ER: Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008, 9: 387-402.
Shendure J, Ji H: Next-generation DNA sequencing. Nature Biotech. 2008, 26: 1135-1145.
Schweiger M, Kerick M, Timmermann B, Isau M: The power of NGS technologies to delineate the genome organization in cancer: from mutations to structural variations and epigenetic alterations. Cancer Metastasis Rev. 2011, 30: 199-210.
Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet. 2010, 11: 31-46.
Mardis ER: A decade's perspective on DNA sequencing technology. Nature. 2011, 470: 198-203.
Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ: Target-enrichment strategies for next-generation sequencing. Nat Meth. 2010, 7: 111-118.
Seeb JE, Carvalho G, Hauser L, Naish K, Roberts S, Seeb LW: Single-nucleotide polymorphism (SNP) discovery and applications of SNP genotyping in nonmodel organisms. Mol Ecol Resour. 2011, 11: 1-8.
Etter PD, Preston JL, Bassham S, Cresko WA, Johnson EA: Local de novo assembly of RAD paired-end contigs using short sequencing reads. PLoS One. 2011, 6: e18561-
Cronn R, Knaus B, Liston A, Maughan J, Parks M, Syring J, Udall J: Targeted Enrichment Strategies for Next-Generation Plant Biology. Am J Bot. 2012, 99: 291-311.
We would like to acknowledge the contributions of our collaborators on the Gymnosperm Tree of Life project, in particular Sarah Mathews for project leadership and organizing. We would also like to thank Zachary Foster and Brian Knaus for work on essential in-house scripts and computing assistance, Chris Sullivan and Mark Dasenko of the OSU CGRB for computing and sequencing assistance, and Vadim Goremykin and Surendra Kumar for providing assistance in adapting sorter.pl and AIR-Identifier to our data matrices.
This work was supported by the National Science Foundation grant number NSF ATOL-0629508.
Richard Cronn and Matthew Parks were invited speakers at Illumina User Group Meetings in 2009 and 2010, respectively, and received airfare, accommodation and meals for these events. The authors declare that they have no other competing interests, including funding, stocks/shares, patents or non-financial interests.
All authors conceived and developed the research plan. RC designed the hybridization methods, and MP and RC carried out template preparation and hybridization reactions. MP carried out post-sequencing assembly and analyses with contributions from AL. MP drafted the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 3: Average per site rate variability of plastome regions for full plastome alignment using tree-dependent methodology. Average per site category ranking for protein-coding exons, introns, rRNA and tRNA genes, and noncoding regions for full plastome alignment of 113 Pinus and Pinaceae species using tree-dependent methodology. Standard deviations are given in parentheses. Mean values with different superscript letters are significantly different at α < 0.05 in Tukey’s HSD test, following one-way ANOVA supporting different means at p < 0.0001. (DOC 29 KB)
Additional file 4: Phylogenetic relationships within genus Pinus as determined from full plastome alignment. a) Cladogram based on ML topology, showing support values below branches as ML bootstrap support / Bayesian posterior probability / parsimony bootstrap support. Support values are shown only for nodes with less than 100% bootstrap support and/or posterior probabilities less than 1.0; single values indicate either ML bootstrap support or Bayesian posterior probability. * indicates branch not supported in Bayesian or parsimony analysis. b) ML Phylogram with branch lengths determined from ML analysis; scale corresponds to probability of change per positions. (PDF 505 KB)
Additional file 5: Trends in bootstrap support values and topologies for likelihood analyses of alignment partitions (tree-dependent). For tree-dependent site variability analyses, the following are shown: a) Distributions of bootstrap support values for all nodes. Circles represent median bootstrap support for each An partition size. b) Distribution of branch score metric (triangles) and partition metric (circles) values for tests of topological congruence between An and corresponding Bn data partitions. Partition metric values shown are 0.1× actual value in order to fit on same scale with branch score values. (PDF 710 KB)
Additional file 6: Distribution of bootstrap support for phylogenetic position of three clades in genus Pinus (tree-dependent). Results for tree-dependent site variability analyses are shown for: a) subsection Krempfianae , b) Pinus merkusii / P. latteri and c) Subsection Contortae . In a), P. krempfii was found sister to subsection Gerardianae for all An partitions. In b), circles correspond to placement of P. merkusii /P. latteri as sister to subsection Pinaster and triangles as sister to subsection Pinus. In c), circles correspond to placement of subsection Contortae as sister to subsection Australes and triangles as basal to both subsections Australes and Contortae; squares represent variable phylogenetic placements not including those represented by circles or triangles. For b) and c), monophyly of P. merkusii/P. latteri and subsection Contortae was supported at 100% bootstrap support for all An partitions. (PDF 491 KB)
Additional file 7: Trends in corrected vs. uncorrected pairwise distances. All pairwise distance values calculated as uncorrected pairwise distance and with a Jukes-Cantor correction, plotted for: a) all accessions used in study, b) genus Pinus accessions only, c) subgenus Pinus accessions only, d) subgenus Strobus accessions only, e) subsection Quinquefoliae and Pinus monophylla only, f) subsection Pinus and Pinus ponderosa only, and g) subsection Trifoliae and Pinus thunbergii only. (PDF 3 MB)
About this article
Cite this article
Parks, M., Cronn, R. & Liston, A. Separating the wheat from the chaff: mitigating the effects of noise in a plastome phylogenomic data set from Pinus L. (Pinaceae). BMC Evol Biol 12, 100 (2012). https://doi.org/10.1186/1471-2148-12-100
- Phylogenetic noise