An invariants-based method for efficient identification of hybrid species from large-scale genomic data

Background Coalescent-based species tree inference has become widely used in the analysis of genome-scale multilocus and SNP datasets when the goal is inference of a species-level phylogeny. However, numerous evolutionary processes are known to violate the assumptions of a coalescence-only model and complicate inference of the species tree. One such process is hybrid speciation, in which a species shares its ancestry with two distinct species. Although many methods have been proposed to detect hybrid speciation, only a few have considered both hybridization and coalescence in a unified framework, and these are generally limited to the setting in which putative hybrid species must be identified in advance. Results Here we propose a method that can examine genome-scale data for a large number of taxa and detect those taxa that may have arisen via hybridization, as well as their potential “parental” taxa. The method is based on a model that considers both coalescence and hybridization together, and uses phylogenetic invariants to construct a test that scales well in terms of computational time for both the number of taxa and the amount of sequence data. We test the method using simulated data for up 20 taxa and 100,000bp, and find that the method accurately identifies both recent and ancient hybrid species in less than 30 s. We apply the method to two empirical datasets, one composed of Sistrurus rattlesnakes for which hybrid speciation is not supported by previous work, and one consisting of several species of Heliconius butterflies for which some evidence of hybrid speciation has been previously found. Conclusions The proposed method is powerful for detecting hybridization for both recent and ancient hybridization events. The computations required can be carried out rapidly for a large number of sequences using genome-scale data, and the method is appropriate for both SNP and multilocus data.

Hybridization is another evolutionary process that can cause variability in gene trees within the containing species tree. It generally refers to the interbreeding of individuals from distinct populations, resulting in the production of a hybrid species that shares genetic information with both parental species. Hybridization between distinct species can occur for many generations with fertile offspring, making it possible for a new species to be formed. If the hybridization does not result in the formation of a new lineage, the process is termed introgression or introgressive hybridization [19][20][21][22][23][24][25][26][27][28]. Despite the earlier belief that hybridization was rare, numerous recent studies have shown that hybrid speciation occurs in both plants and animals [27,[29][30][31][32][33][34][35][36][37][38]. Hybridization has been recognized as an important mechanism for the evolution of new species and recent estimates indicate that approximately 25% of plants and 10% of animals hybridize [26,27,27,28,39]. However, inference of hybridization cannot be based solely on observed gene tree variability since other processes (e.g., incomplete lineage sorting and gene duplication and loss) may contribute to disagreements in single-gene phylogenies [1].
Several models and methods have been developed to detect hybridization. Here we focus on methods specific to gene flow between species (hybridization) and not on methods that are concerned with gene flow within one species (admixture). One group of methods for detecting hybridization involves the identification and removal of hybrids prior to phylogenetic analysis, with the hybrids added to the inferred tree by connecting them to their parental species [40][41][42]. Joly et al. (2009) [43] developed a method and software (JML; [44]) for identifying introgressed sequences by proposing that for some hybridization events the minimum distance between two sequences will be smaller than for incomplete lineage sorting. Another test that was originally developed to test ancient admixture is based on a relative abundance of ABBA or BABA single nucleotide patterns that can be evaluated using Patterson's D-statistic [45][46][47]. However, Eaton and Ree (2013) [48] noted that Patterson's D-statistic does not utilize all the information from incongruent allele patterns in multiple taxa and proposed an extension to the method, which they termed partitioned D-statistic. Meng and Kubatko (2009) [49] proposed a model for detecting hybridization under the coalescent model and used both a maximum likelihood and a Bayesian framework for inference. An extension to that model was later provided by Kubatko (2009) [50] by utilizing gene tree densities for inference. Yu et al. (2014) [51] also proposed a likelihood method that accounts for both reticulate evolutionary events and incomplete lineage sorting by providing methods for computing the likelihood of a phylogenetic network under the coalescent model. This method, as well as some earlier variations of it, is implemented in the software PhyloNet [52].
In this paper we develop a method for detecting and quantifying the extent of hybridization using a coalescentbased model that is fast and accurate. At the heart of our method are special relations called phylogenetic invariants, which are functions (usually polynomials) in the site pattern probabilities that evaluate to zero on any probability distribution that is consistent with the tree topology and associated model. Invariants have been introduced by Cavender and Felsenstein (1987) [53] and Lake (1987) [54] as a means for phylogenetic reconstruction, and have recently been gaining popularity for use in phylogenetic tree inference [16,55,56]. Here we propose using a ratio between two linear invariants in site pattern probabilities to develop statistics that accurately identify hybrid taxa. Because these statistics are functions of site pattern probabilities across multi-locus or SNP data, they can be rapidly computed. In addition, we can derive the mean, variance, and asymptotic distribution of these invariants, enabling development of a hypothesis test for hybridization when the number of sites is large. We begin by giving the theoretical details of our model, and then evaluate the performance of several possible invariants-based statistics for four-taxon trees using simulation. The bestperforming of these statistics, which we call the Hils statistic, is then evaluated for larger trees using simulation, with hybridization events at various "depths" of the tree (i.e., hybridization between tip species and hybridization between ancestral species). Finally, we apply our method to several empirical data sets, including the Sistrurus rattlesnakes and Heliconius butterflies.

A Coalescent-based Model for Hybridization
We consider here the model originally proposed by Meng and Kubatko (2009) in which data arise along a phylogenetic species tree via an evolutionary process that allows for the possibility of both hybridization and incomplete lineage sorting, as modeled by the coalescent process. Hybridization cannot be modeled by a bifurcating phylogenetic tree, thus it is common to represent hybridization on a phylogeny by a horizontal line connecting two lineages of an otherwise-bifurcating phylogeny ( Fig. 1 the leftmost panel). This network represents the evolutionary history of the species as a whole, and depicts a hybrid origin for taxon H. We refer to species H as the hybrid species, and to species P 1 and P 2 as the parental species. The times labeled by τ i are speciation times, and in general we refer to the species network S γ together with its vector of speciation times τ by (S γ , τ ). The data arising along this phylogenetic species network are a collection of site patterns. Letting X Y ∈ {A, C, G, T} denote the nucleotide observed for species Y at a specific location in the DNA Here taxon H is a hybrid of taxa P 1 and P 2 sequence, we define a site pattern X = X O X P 1 X H X P 2 as an assignment of nucleotides to all species. We represent the site pattern probability on the species network (S γ , τ ) for a particular observation ijkl at the tips of the network by Our model defines the probability distribution on the space of all 4 4 = 256 site patterns under a model that allows both ILS and hybridization via a three-stage process. First, the hybrid species is assigned one of its two putative parents, with probability γ of selecting parental species P 2 and probability 1 − γ of selecting parental species P 1 (resulting in trees S 1 and S 2 in Fig. 1 being the "parental species trees", respectively). Next, a gene tree is generated along the parental species tree from step 1 through the standard coalescent process (see, e.g., [2-5, 7, 9, 57, 58]). Finally, a site pattern is generated along the gene tree from step 2 according to one of the standard Markov substitution models (e.g., the GTR+I+ model [59] or one of its sub-models). Combining steps 2 and 3, we see that the probability for site pattern ijkl for a given species tree S i , i ∈ {1, 2}, is given by where (G, t) represents a gene tree with topology G and branch lengths t, p ijkl|(G,t) is the probability of the particular observation ijkl at the tips of gene tree (G, t), and f ((G, t)|(S i , τ )) is the joint density of (G, t) conditional on the species tree (S i , τ ). A full description of the computations required for this model are given in Chifman and Kubatko (2015) [60], and we do not review them here. Finally, we write the site pattern probability on a hybrid species network as For our purposes, it suffices to view the collection of site patterns observed in an empirical data set as a sample of observations from the probability distribution defined by the {p ijkl|(S γ ,τ ) |i, j, k, l ∈ {A, C, G, T}}. We call data generated in this way "coalescent independent sites" and refer to this model as the "coalescent independent sites model".
Let N X be the number of sites with site pattern X observed in a sample of N sites generated from hybrid species network (S γ , τ ) under this coalescent-with-hybridization model.
When N is large, thep X are approximately normally distributed, and thus the sampling distributions of statistics based on thep X can be derived. We next describe how these ideas can be used to build tests for hybridization.

Invariants-based Hypothesis Tests for Hybridization
As mentioned in the Introduction, our tests are based on phylogenetic invariants, which are polynomials in the site patterns that evaluate to zero on one tree topology but do not evaluate to zero for at least one tree of a different topology.
Consider four linear relationships that arise on the hybrid phylogenetic species network (S γ , τ ) as described in the previous section: where i = j ∈ {A, C, G, T}. It can be shown that f 2 and f 4 are zero when evaluated on site pattern probabilities that correspond to the species tree S 1 , while f 1 and f 3 are nonzero (see [60] for details). Similarly, f 1 and f 3 are zero when evaluated on site pattern probabilities that correspond to tree S 2 , while f 2 and f 4 are not. However, when the site pattern probabilities correspond to the species network (S γ , τ ) with γ ∈ (0, 1), none of the four linear relations are zero.
What is special about these functions is that their ratio is a function of γ ∈ (0, 1): Notice that the last equality holds because p ijji|(S 2 ,τ ) − p ijij|(S 2 ,τ ) = p iijj|(S 1 ,τ ) − p ijij|(S 1 ,τ ) , which results from the symmetric roles of P 1 and P 2 leading to p ijji|(S 2 ,τ ) = p iijj|(S 1 ,τ ) and p ijij|(S 2 ,τ ) = p ijij|(S 1 ,τ ) . A full explanation about linear relations under the coalescent model on species trees that satisfy the molecular clock is provided in Chifman and Kubatko (2015), Section 3.1 [60]. Using a similar argument we find that If we consider cumulative site pattern probabilities then the results in Eqs. (4) and (5) still hold. By a cumulative site pattern we mean, for example, p ijji|(S γ ,τ ) = x =y∈{A,C,G,T} p xyyx| (S γ ,τ ) . Under the JC69 model [61], each of the terms in the sum will have the same value, regardless of the choice of x and y; under more complex models, these probabilities will vary depending on the particular x and y. We implement the JC69 version of the test here, though we use simulation to assess the performance under more complicated models.
Using the ratios in Eqs. (4) and (5) we construct formal significance tests of the following hypotheses: Here we consider the ratio f 1 f 2 to illustrate the procedure. First, we estimate this ratio using the site pattern probabilities observed in the sample, To use this estimator as a test statistic in a hypothesis test, we need the distribution of the statistic when the null hypothesis is true. We first consider distributional results for the numerator and denominator separately. Using standard results for the multinomial distribution, we have Now, using the fact that when the sample size N is large we havef 1 , we apply the Geary-Hinkley transformation [62,63] to the ratioˆf 1 The terms in the denominator on the left-hand side of the above equation depend on several unknown quantities, which we estimate by substituting the observed site pattern frequencies into Eqs. (7) - (11). We also multiply the expression in Eq. (12) byf 2 μ f 2 (which converges in probability to 1, and thus does not change the asymptotic distribution) to obtain the test statistic We call the statistic H the Hils statistic, in honor of Professor Matthew H. Hils a . Under the null hypothesis that γ = 0, the term in the numerator of (13) is 0, and the hypothesis test can be carried out by comparing the observed value of the test statistic computed with We note that γ = 1 also implies the absence of hybridization, and thus our hypothesis test should consider this situation as well. In fact, the symmetry in the model in Fig. 1 means that this case is already covered by the test above. To see this, note that when γ = 0,f 1 is close to 0, and the hypothesis test will fail to reject the null hypothesis, as could be expected from inspection of Eq. (13). When γ = 1, thenf 2 is close to 0. It is not obvious from Eq. (13) that the test statistic would be expected to be close to 0 in this case, but if one multiples both the numerator and the denominator of the test statistic in Eq. (13) bŷ , it can be observed that there is an equivalent version of the test statistic withf 2 , rather thanf 1 , in the numerator. Note that our conditionp ijij > max{p iijj ,p ijji } (see below) ensures that bothf 1 andf 2 are positive. Thus the test given above is sufficient to test for hybridization with either γ = 0 or γ = 1.

Extension to Larger Species Networks
The hypothesis test derived in the previous section deals with the case in which four taxa are specified, with one of the four taxa identified as the putative hybrid species. In many settings, however, primary interest is in searching over a large collection of species with the goal of identifying which species might have arisen via a process that involved hybridization at some point in the past. To address this, we consider a large collection of sequences, and suppose that an outgroup sequence can be identified. For each subset of four sequences consisting of three sequences plus the outgroup, we carry out the above test of hybridization for different assignments of the three ingroup sequences to the hybrid and parental taxa. Of the three possible choices for the hybrid taxon, we consider only two of those, eliminating from consideration the one for whichp ijij > max{p iijj ,p ijji }, since this implies that the two parental taxa are more closely related than either is to the putative hybrid. For a data set of n + 1 sequences with one outgroup sequence, this results in n 3 ×2 hypothesis tests. To handle the issue of multiple comparisons, we use the Bonferroni correction, which is conservative in this case because the tests are correlated. Thus, if an overall α-level test is desired, we report significant evidence of hybridization when the p-value computed for a particular comparison is smaller than α ( n 3 )×2 .
The simulation design for each study is described in the "Methods" section and all code used to carry out the simulations and empirical analyses in this paper is available at https://github.com/lkubatko/HilsTest.

Four-taxon simulation studies
Our results for the four-taxon simulation studies establish that the various tests behaved as we have expected ( Fig. 2 and Table 1). First, in all of the cases considered, the power increases as the sample size increases, reaching near 100% when alignments of length 500,000bp were used for many of the simulation conditions (Fig. 2). Second, we note that as the value of γ increases from 0 (no hybridization) to 0.5 (equal contribution from both parental species), the power to detect hybridization increases as well, with near 100% power for the "long" branch length setting when γ ≥ 0.3 for all three of the tests considered. Third, we note that all of the tests are more powerful for data simulated under the "long" branch length setting (Fig. 2e, f, g, and h) than for data generated under the "short" branch length setting (Fig. 2a, b, c, and d). Finally, we note that all tests appear to achieve the nominal 0.05 level when data are simulated under the null hypothesis (γ = 0). The ABBA-BABA test ( Fig. 2d and h) shows power similar to our test based on the ratio f 1 f 2 (Fig. 2a and e). One unexpected result of the simulations designed to address the power was that the test based on f 2 +f 4 . This is most likely due to the variance associated with estimating the various site pattern probabilities that contribute to each invariant. We return to this point in the discussion. Based on this observation, we report results for only the ratio The results of the four-taxon simulation studies designed to estimate γ using the ratio f 1 f 2 also matched our intuition about how the method should perform ( Table 1). As the sample size increases, the estimates become closer to the true values used to generate the data, and the variance decreases as the sample size increases. In general, the estimates obtained from the "long" branch length setting are slightly better than those obtained from data generated under the "short" branch length setting. Overall, the method seems to provide very reasonable estimates of γ .
The results of the second set of simulation studies are shown in Fig. 3. The results are in general consistent with the results of the first simulation study. In particular, the power increases as γ gets closer to 0.5 and as the sample size increases, and both tests are more powerful when the branch lengths are longer. The Hils test is slightly more powerful than the ABBA-BABA test over most of the simulation conditions examined, but from a practical viewpoint, little difference in performance of the two methods would be expected. While both tests show some decrease in power resulting from the violation of the molecular clock, both still perform well, particularly with sufficient data, suggesting that these methods have some degree of robustness to violation of the assumption of a molecular clock.

Simulation studies for larger species networks
For the 9-taxon simulations ( Fig. 4 and Table 2), we note first that for data generated under the coalescent independent sites model, when γ = 0 approximately 5% of the data sets give significant results, and thus the test appears to attain the desired significance level in this case. For the multilocus data sets, however, the type I error rate is larger than the specified 0.05 level, and thus the test appears to reject the null hypothesis more often than it should.  Table 1 Estimates of the parameter γ using the ratio f 1 f 2 for data simulated on the four-taxon hybrid species network in Fig. 1 with the "short" and "long" branch lengths settings "Short" branch length "Long" branch length  When γ > 0, we see that the test is powerful for both the shallow and the deep hybridization events and for both types of data, with the power above 90% in both cases when γ ≥ 0.2. Furthermore, the test almost always selects the correct assignment of hybrid and parental taxa, with the proportion of times that this is exclusively generated increasing toward 100% as γ increases for the coalescent independent sites data. One observation we made that is not reflected in the results in Table 2 is that for data simulated from the network involving the deep hybridization event, many sets appear as significant when some true relationship is detected. For example, it is common to have the hybrid correctly assigned, but the parental species assigned as belonging to a taxon from the sister clade of the true parent. This is especially true for the multilocus data sets with the deep hybridization event. In other words, this test is good at picking out the hybrid taxon, but not as good at unambiguously picking out its parents when the hybridization event occurs deeper in the network. This was not the case for the shallow event, where it often got exactly the correct relationships and only those in most cases.
The results for the 20-taxon networks are largely the same (Fig. 4 and Table 3). The test still demonstrates good power to detect the hybridization event, though the power does not rise above 90% for all settings until γ ≥ 0.3, rather than 0.2 as in the 9-taxon case. In addition, the proportion of data sets with "Correct Sets" decreases for the shallow hybridization events in this case, meaning that when a hybridization event is identified, it nearly always involved correct identification of which species was the hybrid and which were the parental species. Though there is a hint of an elevated type I error rate when multilocus data were simulated, the problem is not as dramatic as in the 9-taxon case. Overall, the method maintains its good ability to detect hybrid species.

Empirical data: Sistrurus rattlesnakes
Recall that this dataset contains two species, each containing three subspecies, as well as two outgroup species, for a total of eight tips in the species phylogeny of interest. When analyzing empirical data of this nature, for which several individuals are sampled within each species, our main interest will be in detecting individuals that show evidence of hybrid origin from parental individuals that are members of two different species. The current version of our software will output the test statistic for all assignments of hybrid and parental taxa for a given outgroup, but this output can easily be examined to consider only the comparisons of interest. For the rattlesnake data for a particular choice of outgroup, we can consider all choices of one individual allele from each of three subspecies, and for each such choice, one individual will be assigned to be the hybrid and the other two assigned to An additional practical issue that arose with our empirical data but was not observed with simulated data was that for some choices of three alleles, one or more of the site pattern frequencies p iijj , p ijij , and p ijji was observed to be 0. To correct for this, we added a small count (0.005) to each observed site pattern count in all cases before computing estimated site pattern frequencies and carrying out the test. With this modification, we find no evidence of hybrid origin for any of the sequences with any choice of outgroup sequence, consistent with other analyses in this group [64,65].

Empirical data: Heliconius butterflies
This dataset consists of 3 species with 4 individuals sampled per species, plus an outgroup. Thus, the number of comparisons of interest is 4 · 4 · 4 · 2 = 128 and the Bonferroni-corrected level of the tests is 0.05/128 = The columns labeled "False Pos." refer to the proportion of data sets for which a triplet of taxa were incorrectly identified as involving a hybridization event (false positives); the columns labeled "True Pos." refer to the proportion of data sets for which the correct triplet of taxa involving the hybridization event was identified and the hybrid taxon was correctly identified (true positives); and the columns labeled "True Sets" refer to the proportion of data sets for which the correct triplet of taxa was identified but the hybrid taxa was specified incorrectly. It is possible that both the correct triplet with the hybrid correctly specified and the correct triplet with the hybrid misspecified are identified as statistically significant in the analysis. We tally these separately because in the case of empirical data it would be ambiguous as to which is the hybrid taxon. For our simulated data, all data sets for which the true set was significant also had the triplet with the correct hybrid assignment found to be significant, and thus this proportion is always a fraction of the proportion of true positives

Discussion
We have proposed a method for detecting hybrid species using a model of hybrid speciation that incorporates coalescent stochasticity. The test is based on observed site pattern frequencies, which leads to several convenient properties. First, the computations required for the test can be carried out very rapidly, as all that is required is to obtain counts of observed site pattern frequencies for four taxa of interest. This computation is so rapid that there are essentially no limits on the length of sequences that can be handled by the method, and it is thus appropriate for genome-scale data. Second, observed site pattern frequencies arise from a multinomial distribution under the coalescent hybridization model used here, which allows Column headings are as in Table 2 derivation of the asymptotic distribution of the estimators of the site pattern frequencies. This ultimately leads to a null distribution for testing the hypothesis of interest that is asymptotically normally distributed which provides a straightforward test of the hypothesis of interest. Finally, we note that our method is derived under the assumption that each site has its own underlying gene tree, an experimental design that we call "coalescent independent sites". The method is thus clearly appropriate for genomewide SNP data, whether biallelic or not. We argue that the method is also appropriate for multilocus data, in that as the number of loci becomes large and provided that alignment lengths are not biased toward certain gene tree topologies, the proportion of sites observed from a particular gene tree will approach the proportion expected under the coalescent independent sites model. We thus carry out simulations for both multilocus and coalescent independent sites data, and we test our method on an empirical multilocus dataset. Our simulations show that the method is powerful for detecting hybridization for both recent and ancient hybridization events, although for ancient hybridization events it may be more difficult to pinpoint the precise parental species for the detected hybrids. In addition, the proportional contribution of the two parental species to the genome of the hybrid species can be estimated accurately and unbiasedly. The simulations also show that the method scales extremely well: for 20-taxon networks with 100,000 sites, computations can be completed in less than 30 s, while for a dataset with 13 sequences and over 248 million sites, the analysis took less than 20 min on an older desktop linux machine. While these analyses demonstrate that sequence length is not a computationally-limiting factor, they also suggest that larger numbers of taxa will be similarly unproblematic. Although adding taxa increases the number of hypothesis tests to be carried out, these are each done very rapidly (e.g., for 20 taxa, there are over 2200 tests being done in less than 30 s), and they could easily be carried out on separate processors, if necessary. To the extent of our knowledge, this method is thus the only technique available for exploratory hybrid identification for large numbers of sequences using genome-scale data.
The method is based on phylogenetic invariants, and we note that the particular choice of invariants used here was somewhat arbitrary. Indeed, the ABBA-BABA test [45][46][47] is based on the difference of ABBA and BABA patterns similar to our invariant f 2 and it too is useful in detecting hybridization. However their statistic is normalized by the total number of observations whereas our method is based on the ratio of two linear invariants leading to a function that depends only on the mixing parameter γ . Based on this crucial observation we were able to derive the Hils statistic for accurate detection of hybridization. We have also noticed that the ratio between f 3 and f 4 was not as powerful, thus it is possible that other invariants may be identified that work as well or better than the ones we have chosen here. It is also possible that invariants that operate on more than four taxa at a time could be determined, with potential improvements in the localization of hybrid and parental taxa for more ancient hybridization events. There is also a possibility that a set of linear invariants specific to species trees under the coalescent exists and can be classified, and if such a set exists, these species invariants may improve the performance. We suggest that exploring these directions is appealing, as site pattern-based methods provide the possibility of both rapid computation and convenient asymptotic distributions, making them suitable for processing the large genome-scale datasets that are becoming increasingly available. In fact, the performance of these methods improves with sequence length, since site pattern probabilities can be more accurately estimated, with little associated computational cost.

Conclusions
Classification of organisms and estimation of their phylogenetic relationships is central to many areas of biological research, but inference of these relationships comes with several challenges. Most notable are computational challenges arising from the abundance of available DNA sequence data and the need to model organismal evolution at two distinct levels -individual genes, and species as a whole, where the evolutionary histories of genes are constrained by the evolutionary history of the species. Additionally, several processes, such as incomplete lineage sorting (deep coalescence), hybridization, horizontal gene transfer, and gene duplication and loss, lead to the potential for incongruence in the evolutionary histories of the individual genes. The multispecies coalescent is commonly used to model incomplete lineage sorting and provides a model for the generation of gene trees within the containing species tree. We used this model to develop a method for detecting species that have arisen via hybridization and for quantifying the extent of hybridization in a formal statistical framework. We demonstrated the performance of our method using both simulated and empirical data. Our method is capable of processing genome-scale sequence datasets consisting of many taxa in a computationally efficient manner, thus providing researchers with an effective exploratory tool for hybrid identification.
For each simulated data set, we tested the null hypothesis that γ = 0 using the test statistics corresponding to the ratios in Eqs. (4) and (5) at level α = 0.05. We also applied the ABBA-BABA test [46]. We estimate the power of each test as the proportion of the 500 replicates for which the null hypothesis was rejected (when γ = 0, this gives an estimate of the level of the test). We also considered using each of the statistics to estimate the true hybridization parameter, γ . We report the mean of the estimated γ values, as well as the standard deviation and the mean squared error, for each parameter setting.
To evaluate the sensitivity of our test to the assumption of a molecular clock, we carried out a second set of simulations using model trees that violated the clock assumption. We considered violating the molecular clock in two ways. First, we extended the branch leading to species P1 by doubling its length, for both the short and the long branch length settings described above. Second, we extended the branch leading to the hybrid species by doubling its length, again for both branch length settings. As in the first set of simulation studies, we evaluate the power of our test and compare its performance to the ABBA-BABA test. Here, however, we consider only the Hils test based on the ratio f 1 f 2 , since this statistic showed superior performance in the first set of simulations.

Larger species networks
To examine the performance of our method for larger taxon samples, we considered networks containing 8 species and an outgroup, and networks containing 19 species and an outgroup. We also considered both recent hybridization and more ancient hybridization in each case (Fig. 4). For each model network, we generated 125 data sets containing 100,000 coalescent independent sites for γ = 0, 0.1, 0.2, 0.3, 0.4, and 0.5 as follows. First, 100000γ gene trees were generated from the species tree formed by connecting the hybrid taxon to the "left" parental lineage, and 100000(1 − γ ) gene trees were generated from the species tree formed by connecting the hybrid taxon to the "right" parental lineage. For each gene tree, one coalescent independent site was generated using Seq-Gen [66]  Each simulated data set was then given to our program with the outgroup specified, and the Hils statistic was computed for each possible combination of parents and hybrids. A cut-off for significance was determined using a Bonferroni correction with base level α = 0.05, and the putative hybrid and parents were reported for any statistic whose p-value fell below α/M, where M was the total number of comparisons. We summarized results by counting the number of "True Positives" (data sets for which the true hybrid and parental taxa are correctly identified), "True Sets" (data sets for which the true hybrid and parental taxa are identified, but their assignment to which is the hybrid and which are the parental taxa is ambiguous), and "False Positives" (data sets for which an incorrect set of taxa are identified as being subject to hybridization).
Because many of the genome-scale datasets being generated today are multilocus datasets (rather than being generated under the coalescent independent sites model used here), we also simulated data under multilocus n. These simulations proceeded exactly as described above, except that rather than simulating 100,000 coalescent independent sites, we simulated 1000 genes each of length 100bp. This choice was made to mimic the short read lengths generated by next-gen sequencing methods. We summarized these results in the same manner as described above. We justify application of our methodology to multilocus data in the Discussion section.

Empirical examples
We have also explored the performance of our method on two empirical data sets; the Sistrurus rattlesnakes and Heliconius butterflies. The Sistrurus rattlesnakes are found across North America and are currently classified into two species, Sistrurus catenatus and S. miliarius, each with three putative subspecies. The dataset consists of 19 genes sampled from 26 rattlesnakes: 18 individuals within the species Sistrurus catenatus (with subspecies S. c. catenatus (Sca, 9 individuals), S. c. edwardsii (Sce, 4 individuals), and S. c. tergeminus (Sct, 5 individuals)); six within species Sistrurus miliarius (with subspecies S. m. miliarius (Smm, 1 individual), S. m. barbouri (Smb, 3 individuals), and S. m. streckeri (Sms, 2 individuals)); and two outgroup species, Agkistrodon contortrix and A. piscivorus. These data were originally analyzed by [67] to determine species-level phylogenetic relationships. Prior to this analysis, the sequences were computationally phased, resulting in 52 sequences and 8,466 aligned nucleotide positions (data are available at TreeBase ID 11174). These data have been subsequently reanalyzed in several ways. For example, [16] used different methodology to infer the species phylogeny, and found agreement with the original analysis of Kubatko et al. (2011). Gerard et al. (2011) used a subset of the data to examine whether several specimens collected in Missouri and assigned to subspecies S. c. catenatus were actually hybrid species. They did not find evidence of hybridization, in agreement with other results using different data [64].
The Heliconius butterflies are a diverse group of tropical butterflies in the family Heliconii that are found throughout the southern United States and in Central and South America. We consider the study of   [68] in which genome-scale data for 31 individuals from seven distinct species were collected and evidence for gene flow between various species was assessed. We examine a subset of these data consisting of four individuals from each of the species Heliconius cydno, H. melpomene rosina, and H. m. melpomene, as well as one individual from the outgroup species H. hecale.  found evidence that H. m. rosina is a hybrid of H. m. melpomene and H. cydno. We obtained the aligned genome-wide data from the complete study of  from Dryad (http://datadryad.org/resource/doi:10.5061/dryad. dk712) [69], and extracted the 13 sequences of interest. The resulting aligned sequences consisted of 248,822,400 base pairs.