Strains and culture conditions
The 24 green algal strains that were selected for chloroplast genome sequencing are listed in Table 1. All of these strains, except Carteria sp. SAG 8–5, were obtained from the culture collections of algae at the University of Goettingen (SAG), the University of Texas at Austin (UTEX), the National Institute of Environmental Studies in Tsukuba (NIES), and Charles University in Prague (CAUP). Carteria sp. SAG 8–5 (=UTEX 2) was a gift of Dr. Mark Buchheim (University of Tulsa). All strains were grown in C medium [45] at 18 °C under alternating 12 h-light/12 h-dark periods.
Genome sequencing, assembly and annotation
As indicated in Table 1, 15 of the genomes analyzed were sequenced using the Roche 454 method and the remaining nine using the Illumina method. For 454 sequencing, shotgun libraries (700-bp fragments) of A + T-rich DNA fractions obtained as described previously [46] were constructed using the GS-FLX Titanium Rapid Library Preparation Kit of Roche 454 Life Sciences (Branford, CT, USA). Library construction and 454 GS-FLX DNA Titanium pyrosequencing were carried out by the “Plateforme d’Analyses Génomiques de l’Université Laval” [47]. Reads were assembled using Newbler v2.5 [48] with default parameters, and contigs were visualized, linked and edited using the CONSED 22 package [49]. Contigs of chloroplast origin were identified by BLAST searches against a local database of organelle genomes. Regions spanning gaps in the cpDNA assemblies were amplified by polymerase chain reaction (PCR) with primers specific to the flanking sequences. Purified PCR products were sequenced using Sanger chemistry with the PRISM BigDye Terminator Ready Reaction Cycle Sequencing Kit (Applied Biosystems, Foster City, CA, USA).
For Illumina sequencing, total cellular DNA was isolated using the EZNA HP Plant Mini Kit of Omega Bio-Tek (Norcross, GA, USA). Libraries of 700-bp fragments were constructed using the TrueSeq DNA Sample Prep Kit (Illumina, San Diego, CA, USA) and paired-end reads were generated on the Illumina HiSeq 2000 (100-bp reads) or the MiSeq (300-bp reads) sequencing platforms by the Innovation Centre of McGill University and Genome Quebec [50] and the “Plateforme d’Analyses Génomiques de l’Université Laval” [47], respectively. Reads were assembled using Ray 2.3.1 [51] and contigs were visualized, linked and edited using the CONSED 22 package [49]. Identification of cpDNA contigs and gap filling were performed as described above for 454 sequence assemblies.
Genes were identified on the final assemblies using a custom-built suite of bioinformatics tools [52]. Genes coding for rRNAs and tRNAs were localized using RNAmmer [53] and tRNAscan-SE [54], respectively. Intron boundaries were determined by modeling intron secondary structures [55, 56] and by comparing intron-containing genes with intronless homologs.
Phylogenomic analyses of the amino acid data set
The chloroplast genomes of 73 core chlorophyte taxa were used to generate the analyzed amino acid and nucleotide data sets. The GenBank accession numbers of the genomes sequenced in this study are presented in Table 1; those of the remaining genomes are as follows: Pedinomonas minor UTEX LB 1350, [GenBank:NC_016733]; Pedinomonas tuberculata SAG 42.84, [GenBank:KM462867]; Marsupiomonas sp. NIES 1824, [GenBank:KM462870]; Pseudochloris wilhelmii SAG 1.80, [GenBank:KM462886]; Chlorella variabilis NC64A, [GenBank:NC_015359]; Chlorella vulgaris C-27, [GenBank:NC_001865]; Dicloster acuatus SAG 41.98, [GenBank:KM462885]; Marvania geminata SAG 12.88, [GenBank:KM462888]; Parachlorella kessleri SAG 211-11 g, [GenBank:NC_012978]; Botryococcus braunii SAG 807–1, [GenBank:KM462884]; Choricystis minor SAG 17.98, [GenBank:KM462878]; Coccomyxa subellipsoidea NIES 2166, [GenBank:NC_015084]; Elliptochloris bilobata CAUP H7103, [GenBank:KM462887]; Paradoxia multiseta SAG 18.84, [GenBank:KM462879]; Trebouxiophyceae sp. MX-AZ01, [GenBank:NC_018569]; Geminella minor SAG 22.88, [GenBank:KM462883]; Geminella terricola SAG 20.91, [GenBank:KM462881]; Gloeotilopsis sterilis UTEX 1704, [GenBank:KM462877]; Fusochloris perforata SAG 28.85, [GenBank:KM462882]; Microthamnion kuetzingianum UTEX 318, [GenBank:KM462876]; Oocystis solitaria SAG 83.80, [GenBank:FJ968739]; Planctonema lauterbornii SAG 68.94, [GenBank:KM462880]; “Chlorella” mirabilis SAG 38.88, [GenBank:KM462865]; Koliella longiseta UTEX 339, [GenBank:KM462868]; Pabia signiensis SAG 7.90, [GenBank:KM462866]; Stichococcus bacillaris UTEX 176, [GenBank:KM462864]; Prasiolopsis sp. SAG 84.81, [GenBank:KM462862]; Myrmecia israelensis UTEX 1181, [GenBank:KM462861]; Trebouxia aggregata SAG 219-1D, [GenBank:EU123962-EU124002]; Dictyochloropsis reticulata SAG 2150, [GenBank:KM462860]; Watanabea reniformis SAG 211-9b, [GenBank:KM462863]; Pleurastrosarcina brevispinosa UTEX 1176, [GenBank:KM462875]; “Koliella” corcontica SAG 24.84, [GenBank:KM462874]; Leptosira terrestris UTEX 333, [GenBank:NC_009681]; Lobosphaera incisa SAG 2007, [GenBank:KM462871]; Neocystis brevis CAUP D802, [GenBank:KM462873]; Parietochloris pseudoalveolaris UTEX 975, [GenBank:KM462869]; Xylochloris irregularis CAUP H7801, [GenBank:KM462872]; Oltmannsiellopsis viridis NIES 360, [GenBank:NC_008099]; Pseudendoclonium akinetum UTEX 1912, [GenBank:NC_008114]; Oedogonium cardiacum SAG 575-1b, [GenBank:NC_011031]; Floydiella terrestris UTEX 1709, [GenBank:NC_014346]; Stigeoclonium helveticum UTEX 441, [GenBank:NC_008372]; Schizomeris leibleinii UTEX LB 1228, [GenBank:NC_015645]; Scenedesmus obliquus UTEX 393, [GenBank:NC_008101]; Chlamydomonas moewusii UTEX 97, [GenBank:EF587443-EF587503]; Dunaliella salina CCAP 19/18, [GenBank:NC_016732]; Volvox carteri f. nagariensis UTEX 2908, [GenBank:GU084820]; and Chlamydomonas reinhardtii, [GenBank:NC_005353].
A total of 69 protein-coding genes were used to construct the amino acid data set (PCG-AA): atpA, B, E, F, H, I, ccsA, cemA, chlB, L, N, clpP, ftsH, infA, petA, B, D, G, L, psaA, B, C, J, M, psbA, B, C, D, E, F, H, I, J, K, L, M, N, T, Z, rbcL, rpl2, 5, 12, 14, 16, 20, 23, 32, 36, rpoA, B, C1, C2, rps2, 3, 4, 7, 8, 9, 11, 12, 14, 18, 19, tufA, ycf1, 3, 4, 12. This data set was prepared as follows: the deduced amino acid sequences from the 69 individual genes were aligned using MUSCLE 3.7 [57], the ambiguously aligned regions in each alignment were removed using TRIMAL 1.3 [58] with the options block = 6, gt = 0.7, st = 0.005 and sw = 3, and the protein alignments were concatenated using Phyutility 2.2.6 [59].
Phylogenies were inferred from the PCG-AA data set using the ML and Bayesian methods. ML analyses were carried out using RAxML 8.1.14 [60] and the GTR + Γ4 model of sequence evolution; in these analyses, the data set was partitioned by gene, with the model applied to each partition. Confidence of branch points was estimated by fast-bootstrap analysis (f = a) with 500 replicates. Bayesian analyses were performed with PhyloBayes 3.3f [61] using the site-heterogeneous CATGTR + Γ4 model [62]. Five independent chains were run for 2,000 cycles and consensus topologies were calculated from the saved trees using the BPCOMP program of PhyloBayes after a burn-in of 500 cycles. Under these conditions, the largest discrepancy observed across all bipartitions in the consensus topologies (maxdiff) was lower than 0.15, indicating that convergence between the chains was achieved.
Phylogenomic analyses of nucleotide data sets
Two DNA datasets were constructed: PCG123degenRNA (all degenerated codon positions of 69 protein-coding genes plus three rRNA genes and 26 tRNA genes) and PCG12RNA (first and second codon positions of the 69 protein-coding genes plus three rRNA genes and 26 tRNA genes). The PCG123degenRNA data set was prepared as follows. The multiple sequence alignment of each protein was converted into a codon alignment, the poorly aligned and divergent regions in each codon alignment were excluded using Gblocks 0.91b [63] with the -t = c, −b3 = 5, −b4 = 5 and -b5 = half options, and the individual gene alignments were concatenated using Phyutility 2.2.6 [59]. The Degen1.pl 1.2 script of Regier et al. [36] was applied to the resulting concatenated alignment (PCG123) and finally, the degenerated matrix was combined with the concatenated alignment of the following RNA genes: rrf, rrl, rrs, trnA (ugc), C (gca), D (guc), E (uuc), F (gaa), G (gcc), G (ucc), H (gug), I (cau), I (gau), K (uuu), L (uaa), L (uag), Me (cau), Mf (cau), N (guu), P (ugg), Q (uug), R (acg), R (ucu), S (gcu), S (uga), T (ugu), V (uac), W (cca), Y (gua). The latter genes were aligned using MUSCLE 3.7 [57], the ambiguously aligned regions in each alignment were removed using TRIMAL 1.3 [58] with the options block = 6, gt = 0.9, st = 0.4 and sw = 3, and the individual alignments were concatenated using Phyutility 2.2.6 [59]. To obtain the PCG12RNA data set, the third codon positions of the PCG123 alignment were excluded using Mesquite 3.03 [64] and the resulting alignment was merged with the filtered RNA gene alignment.
ML analyses of the PCG12RNA and PCG123degenRNA nucleotide data sets were carried out using RAxML 8.1.14 [60] and the GTR + Γ4 model of sequence evolution. In these analyses, the data sets were partitioned into 71 groups, with the model applied to each partition. The partitions included the 69 individual protein-coding genes, the concatenated rRNA genes and the concatenated tRNA genes. Confidence of branch points was estimated by fast-bootstrap analysis (f = a) with 500 replicates.
Nucleotide substitution saturation for each of the three codon positions of concatenated chlorophycean protein coding genes was assessed using the test of Xia et al. [34] implemented in DAMBE [35]. This program was also employed to calculate AT-skew and GC-skew of chlorophycean sequences within the PCG12RNA and PCG123RNA data sets as a measure of nucleotide compositional differences.