The birds
The ruby-throated hummingbird (Archilocus colubris) tissue was provided by the Louisiana State University Museum of Natural Science and is sample LSUMNZ B-26279. The Australian little grebe (Tachybaptus novaehollandiae) is from the Australian Museum Sydney (sample EBU 9986). The New Zealand rail (takahe: Porphyrio hochstetteri) was provided by the Department of Conservation via Massey University Veterinary Pathology, and the New Caledonian kagu (Rhynochetos jubatus) sample was a gift from Christophe Lambert, New Caledonia. The common swift (Apus apus AM237310 EMBL) was provided by Stefan Gabrielsson (Katastrofhjälp fåglar och vilt, Kristianstad/Bromölla) and the greater flamingo (Phoenicopterus ruber roseus) came from the Auckland Zoological Park.
Genomic DNA was extracted from the hummingbird, kagu, rail, flamingo and grebe tissue at the AWC using 25–50 mg of liver and the High Pure™ PCR Template Preparation Kit (Protocol Vb; Boehringer Mannheim) according to the manufacturers instructions. To minimize the possibility of obtaining nuclear copies of mitochondrial genes (numts), mitochondrial genomes were first amplified in 2–3 long overlapping fragments (3.5 – 12 kb in length) using the Expand ™ Long template PCR System (Roche). The products were excised from agarose gel using Eppendorf gel extraction columns. Long-range PCR products were then used as templates for multiple rounds of short-range PCR of overlapping fragments 0.5 – 3 kb in length. Primers were found from a database maintained in our laboratory and described by Slack et al. [43]. Sequencing was performed using BigDye® Terminator Cycle Sequencing reagents according to the manufacturers instructions (Applied Biosystems), and the nucleotide sequences read on an ABI 3730 automated sequencer (Applied Biosystems). For each genome, overlapping sequence fragments were assembled and checked for ambiguity using Sequencher™ 4.2.2 (Gene Codes Corp.).
Where necessary PCR products were cloned using standard techniques to resolve length heteroplasmy in control regions arising from microsatellite repeats [1]. At least three clones were sequenced for each region to guard against PCR errors. In all cases, overlaps between sequences were sufficient to ensure synonymy (usually ≥ 100 bp between sequences from short-range PCR; and a total of 1 – 4 kb between the different long-range products. Sequence identity was confirmed through BLAST searches of the NCBI database [44], confirmation of amino acid translation in coding regions and alignment with other species.
In addition to the six new bird mitochondrial genomes reported in this paper, 35 other complete avian mt genomes were included in the analyses, 29 neoaves and six Galloanserae. The Galloanserae taxa are: chicken (Gallus gallus; GenBank accession number AP003317), Japanese quail (Coturnix japonica; AP003195), magpie goose (Anseranas semipalmata; AY309455), redhead duck (Aythya americana; AF090337), greater white-fronted goose(Anser albifrons; AF363031), Australian brush-turkey (Alectura lathami, AY346091). The 29 neoaves taxa are: rifleman (NZ wren, Acanthisitta chloris; AY325307), gray-headed broadbill (Smithornis sharpei; AF090340), fuscous flycatcher (Cnemotriccus fuscatus; AY596278), superb lyrebird (Menura novaehollandiae; AY542313), village indigobird (Vidua chalybeata; AF090341), rook (Corvus frugilegus; Y18522), ivory billed aracari (Pteroglossus azara, DQ780882), woodpecker (Dryocopus pileatus; DQ780879), peregrine falcon (Falco peregrinus; AF090338), forest falcon (Micrastur gilvicollis, DQ780881), American kestral (Falco sparverius, DQ780880), Eurasian buzzard (Buteo buteo; AF380305), osprey (Pandion haliaetus, DQ780884), Blyth's hawk eagle (Spizaetus alboniger, AP008239), turkey vulture (Cathartes aura, AY463690), blackish oystercatcher (Haematopus ater; AY074886), ruddy turnstone (Arenaria interpres; AY074885), southern black-backed gull (Larus dominicanus, AY293619), Oriental stork (Ciconia boyciana; AB026193), red-throated loon(Gavia stellata; AY293618), little blue penguin(Eudyptula minor; AF362763), black-browed albatross(Diomedea melanophris; AY158677) and Kerguelen petrel (Pterodroma brevirostris; AY158678), white-faced heron (Ardea novaehollandiae; DQ780878), rockhopper penguin (Eudyptes chrysocome; NC 008138), great crested grebe (Podiceps cristatus; NC 008140), frigatebird (Fregata sp; AP009192), Australian pelican (Pelecanus conspicillatus, DQ780883), red-tailed tropicbird (Phaethon rubricauda; AP009043). Paleognath taxa were not included because the paleo/neognath division has been well established for mitochondrial genomes[1, 2]. Thus we rooted our Neoaves trees with the six Galloanserae sequences.
Phylogenetic Analysis
Nucleotide sequences for each gene were aligned separately in Se-Al v2 [45]. Protein-coding genes were aligned using translated amino acid sequences and RNA genes were aligned based on secondary structure. The resulting dataset has 12 protein-coding genes, two rRNA genes and 21 tRNAs (lacking tRNA-Phe because sequence data is missing in some taxa). Gaps, ambiguous sites adjacent to gaps, the NADH6 (light-strand encoded), and stop codons (often incomplete in the DNA sequence), were excluded from the alignment. The full analysed mtDNA dataset was 13,229 bp in length.
In previous work [46–48] we found that RY-coding of the most variable partitions of the nucleotide data (especially the 3rd codon position) was advantageous. This recoding increases the proportion of changes on internal branches of the tree (that is, 'treeness'), reduces effective differences in nucleotide composition (relative compositional variability; RCV), and was shown to increase concordance between mitochondrial and nuclear datasets. RY-coding does improve the ML scores, but because RY-coding is not strictly nested within nucleotide-coding (M.A. Steel, pers. comm.) it is not valid to compare their respective ML scores directly. However, because of the better fit of the data to the model (higher treeness, and lower RCV) this has been our preferred method of analysis of vertebrate mitochondrial data. Thus the trees reported here have the third codon positions of 12 protein-coding genes recoded as R (instead of A & G), and Y (instead of C & T). The full data set is available [49]. Analysis used standard programs including ModelTest [50] PAUP*4.0b10 [51], MrBayes 3.1.2 [52], and consensus networks [39]. We ran 1000 unconstrained ML bootstrap replicates with PAUP*4.0b10 on the Helix computing cluster [53], plus a Bayesian analysis using chains of 107 generations. For some runs, we constrained the seven 'Metaves' taxa to be monophyletic (see Figure 1) and used a Shimodaira-Hasegawa (SH) test [54] implemented in PAUP* to compare this ML tree with the unconstrained ML tree (RELL, one-tailed test, 1000 bootstrap replicates).
Site-stripping
The most serious problem for reconstructing deep-level phylogeny from mitochondrial sequences is substitution saturation [55, 56]. Aside from the direct effect of superimposed substitutions eroding phylogenetic signal, 'non-historical' biases (such as that derived from compositional non-stationarity) accumulate more rapidly at faster evolving sites. In a number of recent studies [56, 57] we have attempted to reduce these problems by identifying partitions among which the sites have (on average) high signal erosion and then either RY-code them (using only information from the slower transversions), or excluding that partition altogether. This earlier approach may not be optimal as some phylogenetically useful sites are excluded simply because they group under some prior definition (e.g. codon position) with many fast evolving sites. So here we test a noise reduction technique in which the information retained from the sequence is determined on a site-by-site basis. 'Noise reduction', in general, is a standard technique in many areas of science [58].
In an earlier noise reduction technique using site-stripping [38], sites were excluded from analysis if changes occurred at these sites within a few predefined closely-related taxa. As a proxy for their utility, in the present study, sites are scored as the average of their consistency and retention indices (CI and RI, respectively). The CI and RI are calculated on the consensus tree that upholds relationships among the primary data matrix that are uncontroversial with respect to prior studies, and also receive a Bayesian posterior probability of 1.00 in the unstripped analyses. Any groupings that do not conform to these requirements are collapsed so as to avoid the circularity of increased support resulting from the exclusion of sites that might have been influenced by conflict with that grouping. The consensus tree we used was: ((((quail, chicken), brush turkey), ((goose, duck), magpie goose)),((((rook, indigobird), lyrebird), (broadbill, flycatcher), rifleman),((gull, turnstone), oystercatcher), flamingo,(great crested grebe, grebe), takahe, hummingbird, kagu, heron, (petrel, albatross), (little blue penguin, rock hopper penguin), stork, turkey vulture, tropicbird, ((falcon, kestrel), forest falcon), ((hawk eagle, buzzard), osprey), swift, pelican, frigatebird, (aracari, woodpecker), loon)).
Variants of the primary dataset were RY-coded and site-stripped at progressively higher threshold levels of site utility, (CI+RI)/2 = 0.08, 0.12, 0.16, 0.20, 0.24, using the Perl program site_strip_search.pl [49] For each iteration, individual site utility scores that fall below specified threshold levels are RY-coded. When the resulting site utility score remains below the specified level, then the site is excluded altogether. Bayesian inference analyses were carried out on each of these 'noise reduced' data matrices.
The seventh intron of the β-fibrinogen gene
Fain and Houde's [3] dataset of the seventh intron of the β-fibrinogen gene (FGB-int7) was reduced to 35 sequences, corresponding to the 35 taxa common with our mitochondrial dataset [see Additional file 1]. The 35 taxa include pairs that represent equivalent branching patterns (the same position in a cladogram relative to the other taxa in each dataset) although the species are not always identical, or in some cases, even sister taxa. The alignment of the taxon-reduced FGB-int7 dataset was checked visually in Se-Al v2, with the Metaves group at the top, and the dataset exported as a Mega file. In this format, the positions of the sites that potentially contribute to phylogenetic signal in the dataset, could be examined and compared. To evaluate the utility of the intron positions, phylogenetic analyses were conducted by equally weighted maximum parsimony (MP), with indel characters treated as missing data using PAUP* 4.0b10 [51].
We tested whether tree reconstruction from the 35-taxa FGB-int7 data is stable with respect to the internal reference tree generated during the Clustal X alignment. A 500 nucleotide dataset was simulated in Seqgen 1.3 [59] on the mitochondrial tree, under the ML-GTR+I+Γ optimisation for the original mt data. Only the 35 taxa common to both the FGB-int7 and mitochondrial datasets were included in the simulated tree. Insertion of the simulated data ahead of the FGB-int7 sequences should drive the alignment to conform more closely to the original mt tree. We then performed phylogenetic analyses using the combined alignment (simulated sequences plus FGB-int7 sequences) and using only the realigned FGB-int7 sequences.
In addition, we reversed the sequences of the FGB-int7 dataset, and re-aligned it using Clustal X as recommended by Landan and Graur [30]. This reversed dataset was also used for phylogenetic analyses. Landan and Graur [30] have shown that reversing the direction of the sequences before an alignment is made can result in quite different trees from the unreversed alignment if the true alignment is ambiguous. The change in result arises because many programs, when faced with tied values may always take the first alternative (i.e., not breaking ties randomly). Conversely, an alignment that is robust to reversing the sequences supports the original alignment. Finally, using the primers FIB-BI7U and FIB-BI7L [60] we amplified the FGB-int7 from two birds; one Metaves (kagu), and one Coronaves (a New Zealand rail, weka, Gallirallus australis). The products were cloned and a total of 26 clones were sequenced to detect possible paralogous copies in the same genome. Thus we had a range of approaches to test the robustness of the β-fibrinogen seventh intron data.