Patterns of intron gain and conservation in eukaryotic genes

Carmel, Liran; Rogozin, Igor B; Wolf, Yuri I; Koonin, Eugene V

doi:10.1186/1471-2148-7-192

Research article
Open access
Published: 12 October 2007

Patterns of intron gain and conservation in eukaryotic genes

Liran Carmel¹,
Igor B Rogozin¹,
Yuri I Wolf¹ &
…
Eugene V Koonin¹

BMC Evolutionary Biology volume 7, Article number: 192 (2007) Cite this article

6663 Accesses
55 Citations
Metrics details

Abstract

Background:

The presence of introns in protein-coding genes is a universal feature of eukaryotic genome organization, and the genes of multicellular eukaryotes, typically, contain multiple introns, a substantial fraction of which share position in distant taxa, such as plants and animals. Depending on the methods and data sets used, researchers have reached opposite conclusions on the causes of the high fraction of shared introns in orthologous genes from distant eukaryotes. Some studies conclude that shared intron positions reflect, almost entirely, a remarkable evolutionary conservation, whereas others attribute it to parallel gain of introns. To resolve these contradictions, it is crucial to analyze the evolution of introns by using a model that minimally relies on arbitrary assumptions.

Results:

We developed a probabilistic model of evolution that allows for variability of intron gain and loss rates over branches of the phylogenetic tree, individual genes, and individual sites. Applying this model to an extended set of conserved eukaryotic genes, we find that parallel gain, on average, accounts for only ~8% of the shared intron positions. However, the distribution of parallel gains over the phylogenetic tree of eukaryotes is highly non-uniform. There are, practically, no parallel gains in closely related lineages, whereas for distant lineages, such as animals and plants, parallel gains appear to contribute up to 20% of the shared intron positions. In accord with these findings, we estimated that ancestral introns have a high probability to be retained in extant genomes, and conversely, that a substantial fraction of extant introns have retained their positions since the early stages of eukaryotic evolution. In addition, the density of sites that are available for intron insertion is estimated to be, approximately, one in seven basepairs.

Conclusion:

We obtained robust estimates of the contribution of parallel gain to the observed sharing of intron positions between eukaryotic species separated by different evolutionary distances. The results indicate that, although the contribution of parallel gains varies across the phylogenetic tree, the high level of intron position sharing is due, primarily, to evolutionary conservation. Accordingly, numerous introns appear to persist in the same position over hundreds of millions of years of evolution. This is compatible with recent observations of a negative correlation between the rate of intron gain and coding sequence evolution rate of a gene, suggesting that at least some of the introns are functionally relevant.

Background

The presence of spliceosomal introns and the concurrent splicing machinery are one of the principal distinctive features of eukaryotic genomes [1–3]. Indeed, even early branching eukaryotes that were once suspected of being intronless have been shown in recent years to posses at least a few introns [4–7]. This suggests that the evolution of introns is tightly linked to the central aspects of the evolution of eukaryotes, and it even has been proposed that introns were the driving force behind the emergence of the eukaryotic nucleus and other features of the eukaryotic cell [8, 9].

Thanks to their ubiquity and potential major role in the evolution of eukaryotes, intron evolution has drawn considerable attention [1–3, 10]. It is generally accepted that introns are units of evolution such that their presence/absence pattern is a result of stochastic processes of loss and gain. The details of these processes, however, remain elusive. Only in recent years, with the accumulation of genomic data, evolution of introns has become amenable to a systematic, genome-wide analysis. However, different attempts to accomplish such a study led to incongruent conclusions regarding the prevalence, rates, and timing of intron loss and gain during the evolution of eukaryotes [11–18]. Apparently, these discrepancies are due, primarily, to incomplete underlying evolutionary models and biased estimation techniques [10].

Recently, we have obtained more conclusive results by developing a comprehensive probabilistic model of intron evolution, and by compiling a data set that is considerably larger than any one previously used [19, 20]. The probabilistic models used so far to study intron evolution can be classified into two groups: branch-specific and gene-specific ones. The branch-specific models assume that the processes of intron gain and loss along a branch are determined only by the properties of the branch, regardless of the particular gene in question [17, 18, 21]. Conversely, gene-specific models assume that these processes are determined only by the particular gene, independent on the branch in question [12]. Obviously, in reality, the characteristics of intron gain and loss processes vary considerably both across genes and across branches. Thus, each of the models employed thus far seem to describe only one facet of a more complex reality. By contrast, our model allows for the variability of intron gain and loss characteristics with respect to both genes and branches such that any of the previously suggested models can be shown to be a special case of this comprehensive model [20]. Moreover, this model is even more realistic in that it also includes rate variability between sites, with respect to both intron gain and intron loss. In order to estimate the model parameters, we devised an expectation-maximization (EM) algorithm that can also be used to reconstruct ancestral states [22]. Combining this algorithm with the profile likelihood technique allows one, in addition to all of the above, to compute confidence intervals of the model parameters.

Here, we describe, in detail, the developed model of intron evolution and the derivation of an improved version of the EM algorithm used to estimate its parameters. Previously, we have applied this comprehensive model to a set of 391 conserved genes from 19 eukaryotic species, to investigate evolutionary trends in gene structure both at the lineage level [20] and at the gene level [19]. The results of this analysis suggest that introns invaded eukaryotic genomes at early stages of eukaryogenesis in a nearly neutral process. At early times, during periods of major transitions in the eukaryotic evolution that led to population bottlenecks, these introns seem to have vastly proliferated in the ancient genomes. Gradually, a considerable fraction of the introns appear to become involved in various cellular functions, mostly, regulation of gene expression [19].

In the present work, we focus on the process of intron gain, and address the causes of the high level of intron position conservation between eukaryotic taxa. It had been already noticed that many intron positions are shared between distant eukaryotic taxa [11, 13]. For example, plants and animals share up to 25% of the intron positions [13]. However, these findings can be explained by either remarkable conservation of ancient introns or by parallel, independent, intron gains in the same positions, or (perhaps, most likely) by a combination of both these factors. The previous analyses that have attempted to quantify the relative contributions of evolutionary conservation and parallel gain in intron position sharing have differed widely, with estimates of the extent of parallel gain ranging from nearly 0% to nearly 100% [11–13, 18, 23]. Using our probabilistic model, we developed a rigorous measure for assessing the amount of parallel gain of introns. We found that, overall, parallel gain is responsible for ~8% of the shared intron positions, with the rest due to shared ancestry. However, we also demonstrate substantial heterogeneity in the extent of parallel gain, with almost none in closely related lineages, but up to ~20% in distant ones, such as plants and unikonts. On the whole, these results support the notion that intron positions are highly conserved during evolution.

Results and discussion

Notation

The primary input component in the study of intron evolution consists of G sets of aligned protein-coding sequences of orthologous genes from S species. Each nucleotide in these alignments is substituted by 0 or 1 depending on whether or not an intron is present following the respective position. We allow for missing data by using a third symbol (*) to indicate lack of knowledge about the presence or absence of an intron. Consequently, every site in the alignments, called pattern, is a vector of length S over the alphabet (0,1,*) and is denoted by ω. Let Ω be the total number of unique patterns in the entire set of G alignments, and let n_gpbe the count of the number of times pattern ω_p(p = 1, ..., Ω) is found in the multiple alignment of gene g. Assuming that the sites evolve independently, the set M_g= (n_{g 1}, ..., n_{g Ω}) fully characterizes the multiple alignment of the g th gene.

Let T be a rooted, bifurcating phylogenetic tree with S leaves (terminal nodes), describing the evolutionary relationships between the S species above. The total number of nodes in T is N = 2S - 1, and we index them by t = 0,1, ..., N - 1, with the convention that zero is the root node. The state of node t is described by the variable q_t, which can take the values 0 and 1 (and * in leaves). We use V_tfor the set of all leaves such that node t is among their ancestors. The entire collection of leaves is, obviously, V₀. The parent node of t is denoted P(t). We use the notations $q_{t}^{P}$ and $V_{t}^{P}$ for q_P(t)and V_P(t), respectively. Analogously, the two direct descendents of the node t are denoted L(t) and R(t), and we use the notations $q_{t}^{L}$ , $q_{t}^{R}$ , $V_{t}^{L}$ , and $V_{t}^{R}$ for q_L(t), q_R(t), V_L(t), and V_R(t), respectively. The branches are indexed by the node into which they are leading, and Δ_tdenotes the length (in time units) of the t th branch. Hereinafter we assume that the tree topology, as well as all the branch lengths Δ₁, ..., Δ_N-1, are known.

The probabilistic model

A bifurcating phylogenetic tree can be viewed as a graphical model depicting the probabilistic model

\Pr (q_{0}) \prod_{t = 1}^{N - 1} \Pr (q_{t} | q_{t}^{P}) .

(1)

We use the notation π_i= Pr(q₀ = i) to describe the prior probability of the root, and T_ij(g, t) = Pr(q_t= j| $q_{t}^{P}$ = i, g) to describe the transition probability for gene g along branch t. In our model, we assume that this transition probability depends on both the gene and the branch, and that it takes the explicit form

T (g, t) = (\begin{matrix} 1 - ξ_{t} (1 - e^{- η_{g} Δ_{t}}) & ξ_{t} (1 - e^{- η_{g} Δ_{t}}) \\ 1 - (1 - φ_{t}) e^{- θ_{g} Δ_{t}} & (1 - φ_{t}) e^{- θ_{g} Δ_{t}} \end{matrix}) .

(2)

Here, η_gand θ_gare non-negative parameters which determine, respectively, the intron gain and loss rates per site for gene g. That is, along branch t the gene's contribution to intron gain and retention probabilities per site is $1 - e^{- η_{g} Δ_{t}}$ and $e^{- θ_{g} Δ_{t}}$ , respectively. We assume that each branch is characterized by an intrinsic branch-specific intron gain coefficient, ξ_t, as well as an intrinsic branch-specific intron loss coefficient, ϕ_t, both of which are bounded, 0 ≤ ξ_t, ϕ_t≤ 1.

In other fields of molecular evolution, it had been long realized that the precision of the analysis significantly improves if one allows for rate variability across sites [24–26]. Typically, such rate variability is modeled by introducing a rate variable, r, which scales, for each site, the time units of the phylogenetic tree, Δ_t← r·Δ_t. This rate variable is a random variable that is distributed according to a distribution function with non-negative domain and unit mean, typically, the unit-mean gamma distribution. The rate variability captures rate variations among sites of the same gene. Specifically, there are fast-evolving sites (r >> 1), as well as slow-evolving ones (r << 1). In our model of intron evolution, we extend this idea by assuming that the gain and loss processes are subject to rate variability, independently of each other. Hence, a site can have any combination of gain and loss rates, for example, it can be fast to gain introns but slow to lose them. To implement this approach, we use two independent rate variables, r^ηand r^θ, that are used to scale, for each site, the gene-specific gain rate, η_g← r^η·η_g, and the gene-specific loss rate, θ_g← r^θ·θ_g, respectively. We further assume that the distributions of these rate variables are independent of the genes, and are explicitly given by

\begin{array}{l} r^{η} ~ ν δ (η) + (1 - ν) Γ (η; λ_{η}) \\ r^{θ} ~ Γ (θ; λ_{θ}) . \end{array}

(3)

Here, Γ(x; λ) is the unit-mean gamma distribution of the variable x with the shape parameter λ, δ(x) is the Dirac delta-function, and ν is the fraction of sites that are assumed to have zero gain rate. The existence of these zero sites reflects the notion that introns cannot be gained at any location within genes, but rather are preferentially inserted at specific locations, contingent on particular sequence motifs known as proto-splice sites [27–29], the density of other introns in the neighborhood, the chromatin exposure, and more. According to this interpretation, 1 - ν measures the density of potential intron insertion sites. Importantly, using the same value of ν for the entire tree does not mean that the proto-splice sites are constant throughout the evolution, or are identical for different lineages. It only means that, on the average, the fraction of potential insertion sites, whatever is their concrete nature, is similar across the lineages throughout the course of eukaryotic evolution. The incorporation of such invariant sites in a rate variability model appears natural for intron evolution and has proved beneficial also in other fields of molecular evolution [30–32]. By contrast, intron loss does not have an invariant counterpart because the assumption is that, once an intron is gained, it can always be lost. Therefore, the loss rate variable is assumed to be distributed according to a gamma distribution, which is by far the most popular distribution for describing rate variability [24, 33].

In practice, the rate distributions in (3) are rendered discrete [34]. We assume that the gain rate variable can take K_ηdiscrete values $r_{1}^{η} = 0, r_{2}^{η}, \dots, r_{K_{η}}^{η}$ with probabilities $f_{1}^{η} = ν, f_{2}^{η} \dots, f_{K_{η}}^{η}$ such that $\sum_{k = 1}^{K_{η}} f_{k}^{η} = 1$ . Analogously, we assume that the loss rate variable can take K_θdiscrete values $r_{1}^{θ}, \dots, r_{K_{θ}}^{θ}$ with probabilities $f_{1}^{θ}, \dots, f_{K_{θ}}^{θ}$ such that $\sum_{k = 1}^{K_{θ}} f_{k}^{θ} = 1$ . For a particular gain rate value $r_{k}^{η}$ , we denote the actual gain rate $r_{k}^{η}$ ·η_gby η_kg. Similarly, for a particular loss rate value $r_{k}^{θ}$ , we denote the actual loss rate $r_{k}^{θ}$ ·θ_gby θ_kg.

For notational clarity, we aggregate the model's parameters into a small number of sets. To this end, let Ξ_t= {ξ_t, ϕ_t} be the set of parameters that are specific to branch t, and let Ξ = (Ξ₁, ..., Ξ_N-1) be the set of all branch-specific parameters. Similarly, let Ψ_g= (η_g, θ_g) be the set of parameters that are specific for gene g, and let Ψ = (Ψ₁, ..., Ψ_G) be the set of all gene-specific parameters. Additionally, we denote by Λ = (π₀, ν, λ_η, λ_θ) the "global" parameters that determine the rate variability and the prior probability of an intron absent in the root. When the distinction between the different sets of parameters is irrelevant, we shall use Θ = (Ξ, Ψ, Λ) as the set of all the model's parameters. We achieve further succinctness in notations by denoting the actual gene-specific rate values for particular values $r_{k}^{η}$ and $r_{k'}^{θ}$ of the rate variables as Ψ_kk'g= (η_kg, θ_k'g).

The Expectation-Maximization algorithm

We estimate the parameters of the model using maximum likelihood. As the probability model includes observed random variables (state of the tree leaves) as well as hidden random variables (state of internal nodes in the tree and value of actual rate variables), the expectation-maximization algorithm is a natural tool to use [35]. This is a hill climbing iterative algorithm that requires two steps in each iteration – the expectation (E) step followed by the maximization (M) step. The details of the algorithm are given under Methods.

The total number of parameters in the model is 2G + 4S, where G is the number of genes and S is the number of species. For data sets in the hundreds of genes, this sums up to >1000 parameters. For infinite data, maximum likelihood estimators are known to be nonbiased and efficient. However, as each gene typically goes through a small number of intron-related events during its lifetime, the information content of our data is limited, and cannot straightforwardly support the estimation of such a large number of parameter. To overcome this, we adopt a two-phase approach in the data analysis. In the first phase, that we denoted homogeneous evolution, it is assumed that all the genes have identical intron gain and loss rates, formally, that θ_g= θ₀ and η_g= η₀ for any gene g. In the second phase, denoted heterogeneous evolution, all the global and branch-specific parameters are fixed, which allows for estimation of the gene-specific parameters θ_gand η_gthat can now take different values for different genes. Except for the rate variability within genes, the model of evolution under the homogeneous phase resembles the branch-specific models, and consequently the EM algorithm used in this part has similar structure to the EM algorithm developed by Nguyen et al. [18, 36].

Using simulations, we showed that this approach yields highly accurate evolutionary reconstructions: a relative error of 1%, 3%, and 11% in estimating the number of introns in internal nodes, the number of loss events along each branch, and the number of gain events along each branch, respectively [20]. The current analysis is pattern-centered rather than gene-centered and thus we have used the results of the homogeneous phase only. While slightly less accurate than the heterogeneous counterparts, the reconstructions after the homogeneous phase are still highly accurate: a relative error of 2%, 4%, and 12% in estimating the number of introns in internal nodes, the number of losses along each branch, and the number of gains along each branch, respectively [20]. Also notable is that, apart from the fraction of zero sites v, the rate variability within genes can be ignored without having any significant effect on the results (Ref. [20] and Additional file 1), indicating that those sites that are capable of gaining introns do so at comparable rates.

The EM algorithm was applied to the set of 391 orthologous genes from 19 eukaryotic species ([20] and see Methods), under the homogeneous evolution assumption, and the profile likelihood technique was used to estimate the 95% confidence interval for each parameter (Additional file 1). In computing the confidence intervals for a particular parameter, we allow all other parameters to vary. Sometimes, different combinations of parameters yield similar likelihood values, an observation that has already been made formal for simpler models [18]. This occasionally leads to wide single-parameter confidence intervals, especially for the intron gain and loss rates of deep nodes. Importantly, this does not reflect similarly large errors in the ensuing inference, as was demonstrated by an exhaustive simulation study [19, 20].

One in 7 nucleotides is a potential intron insertion site

The maximum-likelihood estimate for the fraction of zero sites is ν = 0.862, suggesting a potential intron insertion site every 7 nucleotides. Taking into account the 95% confidence interval for ν, [0.631, 0.928], the density of potential insertion sites is estimated to be 1 site per 3-14 nucleotides. Based on a smaller data set, Nguyen et al. [18] computed a 95% confidence interval of 1 potential insertion site in 9–14 nucleotides. Although this estimate falls within our confidence interval, our results suggest, generally, a denser population of potential insertion sites. The present estimate is also compatible with the high density of intron insertion sites (one site per 9.7 nucleotides) observed for some large protein families, e.g., small G proteins [37]. The problem of estimating the number of zero sites is of fundamental importance in maximum likelihood techniques, so both Csuros [17] and Nguyen et al. [18] developed different heuristic procedure to handle it. We took the alternative and, arguably, more natural approach of integrating this estimate into the model as part of the gain rate variability distribution (see above) such that additional, ad hoc computations are not required.

One should be cautious about identifying potential intron insertion sites with proto-splice sites. More realistically, the density of potential insertion sites reflects the overall average (across species) of the impact of multiple factors such as differential tendencies to be inserted into different proto-splice sites, local densities of pre-existing introns, and degree of chromatin exposure. Furthermore, this density estimation strongly depends on the specific data set as the parameter ν "absorbs" the information on zero sites. Thus, if intronless genes are added to the data, only this parameter will be heavily affected.

With the total number of sites N₀ = 289,902 our calculations yield 39,962 sites that can gain introns (95% confidence interval 20,891 to 106,901). Accordingly, the 5,755 sites that are actually occupied by introns in the genes analyzed here comprise 14.4% of all sites that can potentially gain introns (95% confidence interval 5.4% to 27.5%). Thus, even when data from 19 species are combined, the density of the sites actually occupied by introns is still far below the density of sites capable of hosting introns, which emphasizes the inadequacy of any analysis that considers only those sites in which introns are, actually, observed.

The number of shared intron positions is much greater than expected by chance

Numerous intron positions are shared even between taxa that had diverged during the early stages of the eukaryotic evolution [11, 13]. To quantify this conservation more precisely and to estimate how surprising it is, compared to the random expectation, we propose the following measure: let N_iand N_jbe the number of intron positions in species i and j, respectively; let S_ijbe the number of intron positions shared by the two species; and let N be the total number of sites capable of gaining introns. Then, approximately, we would expect that N_iN_j/N positions will be shared between the two species by chance alone. We define the conservation level

c_{i j} = \log (\frac{N S_{i j}}{N_{i} N_{j}})

as the log-ratio between the observed number and the number expected by chance. Positive values designate that S_ijexceeds random expectation, whereas negative values indicate that S_ijis below expectation. Even if we take for N its lower 95% confidence interval bound of 20,891 (see the preceding section), almost all pairs show a positive value, namely, a greater than random number of shared positions (Fig. 1). The only exception is the pair S. cerevisiae – P. falciparum (the two most intron-poor species in our set) that do not have even a single shared position.

The number of positions shared by chance between two species is a random variable distributed, approximately, according to a binomial distribution with the probability of success p = N_iN_j/N², and N experiments. Therefore, it is easy to associate a p-value with any observed number of shared positions, S_ijmeasuring how improbable it is to obtain by chance this value or a greater one. An overall significance level of 0.05 is equivalent to a Bonfferoni-corrected significance level of 0.0003 (overly conservative as we assume that all pairs are independent). The calculations indicate that only 20 species pairs out of 171, all involving the intron-poor S. cerevisiae or one of the apicomplexans, had a number of shared positions that was indistinguishable from the random expectation; all other pairs had a significant excess of shared intron positions (Additional file 2).

Twelve out of thirteen shared intron positions reflect common ancestry

The significant excess of shared intron positions over the random expectation can be explained by one of two factors or a combination thereof. The first explanation is that this observation reflects genuine evolutionary conservation; the implication is that, once an intron is gained, it is hard to lose, perhaps, due to functional importance of introns or because the deletion event itself is destructive. The second explanation is that different lineages gain introns at the same position, independently. The chance of such parallel gain is not necessarily as low as it seems at first sight because introns are, apparently, preferentially inserted into proto-splice sites (see above). Accordingly, regions of high sequence conservation might gain introns at exactly the same position.

The two explanations have opposing impacts on our understanding of the evolution of eukaryotic genes. If introns are persistent, many must have been gained early in the eukaryotic evolution, and the later evolution involved multiple losses [16, 38]. By contrast, if parallel gain is the dominant mechanism, many of the extant introns are evolutionarily young, and the eukaryotic evolution involves, primarily, multiple introns gains [12].

In order to estimate the extent of parallel gain, we examined all the sites that host introns in at least two species. Assuming that these sites are not invariant, the probability was estimated that the corresponding pattern had arisen from parallel gain, by computing the probability of the last common ancestor of the intron-bearing species to lack an intron. This approach assumes that, if the last common ancestor had an intron in a particular position, no parallel gain occurred, i.e., highly unlikely scenarios involving at least two gains and one loss at the same site are neglected. The probabilities of parallel gain for all the patterns observed in our data set are summarized in Additional file 3. Overall, the data set harbors 3176 sites with shared intron positions (741 unique patterns) out of which 317 positions (10%) are expected to result from parallel intron gain. As inferences involving the root of the tree are prone to significantly elevated standard errors [20], the calculation was repeated using only patterns that do not have the root as the last common ancestor of all intron-bearing species. This calculation yielded 2913 sites with shared intron positions (568 unique patterns) of which 229 (~7.9%) are expected to be due to parallel intron gain. Each of these calculations provides a rough estimate of the level of errors introduced into some of the recent studies that explicitly excluded the possibility of parallel gain [13, 16, 21].

Importantly, however, the contributions of parallel gains to the emergence of different patterns of intron sharing are widely different; in particular, some rare patterns are explained (almost) entirely by parallel gain and do not reflect evolutionary conservation (Additional file 3). For example, for the single site that harbors introns only in humans and N. crassa, the probability is >0.99 that it results from parallel gain. Considering somewhat more frequent patterns, 11 sites harbor introns in C. intestinalis, A. thaliana and O. saliva, with the probability of parallel gain ~0.875, and another 12 sites harbor introns only in C. elegans and C. intestinalis, with the probability of parallel gain ~0.8.

Generally, the distribution of parallel gains in comparisons of specific clades is, obviously, more informative than overall counting (Table 1). To obtain this information, for each internal node t (excluding the root of the tree), all patterns that have 1s in the two sub-clades stemming from t, $V_{t}^{L}$ and $V_{t}^{R}$ were tallied, and the probability that t is in state zero was computed. By this analysis, in which 728 unique patterns were included, we found that nearly 20% of the shared intron positions between plants and unikonts, thought to have diverged more than a billion and a half years ago [39], are due to parallel gain. In fungi and metazoa, diverged ~1.4 billion years ago, >10% of the shared positions are estimated to derive from parallel gains. In contrast, many recently diverged clades show almost no parallel gain (e.g., humans versus rodents, birds versus mammals, Aspergillus versus Neurospora, and flies versus mosquitoes). Table 2 lists all patterns for which the estimated contribution to parallel gain was greater than two sites. The estimated total number of sites with parallel gain is 248. Out of the 728 unique patterns, the 22 in this list (3%) account for 116 parallel gains, i.e., ~47% of the total estimated number.

Table 1 The estimated number of parallel gains on the branches stemming out of each of the internal nodes in the phylogenetic tree (excluding the root; see Additional file 4)

Full size table

Table 2 Patterns that are estimated to contribute more than two sites to the total count of parallel gains

Full size table

Given that most of the other studies report overall estimates on a smaller (8 species) data set [13], a direct comparison of their results with the present ones is not feasible due to the apparent non-uniformity in the extent of parallel gain in different parts of the tree. Nevertheless, an overall parallel gain of ~10% has been computed also for the smaller data set. Rogozin et al. [13] used simulations to assess the extent of parallel gain and showed that, under various assumptions regarding the density of proto-splice sites, parallel gains could be responsible to ~2–40% of the shared intron positions. Accounting more accurately for the likely density of proto-splice sites, Sverdlov et al. estimated the contribution of parallel gains to the observed sharing of intron positions to be in the range of 5–10% [23]. An even higher estimate, 18.5% parallel gains, was obtained by Nguyen et al. using a branch-specific maximum-likelihood model [18].

Intron retention

We conclude, therefore, that a substantial majority of the shared intron positions are due to evolutionary conservation, hence intron positions tend to be retained for long times. To further validate this conclusion, we explicitly computed the probability of an intron to survive along given paths in the phylogenetic tree. To this end, let B denote the set of branches that comprise a path in the tree. Then, intron retention probability along this path is

P_{B} = \sum_{k' = 1}^{K_{θ}} f_{k'}^{θ} \prod_{t \in B} (1 - φ_{t}) e^{- θ_{k' g} Δ_{t}},

where $(1 - φ_{t}) e^{- θ_{k' g} Δ_{t}}$ is the retention probability along branch t, and $f_{k'}^{θ}$ is the probability of being at loss-rate category k'.

In accord with the above conclusions on the high level of evolutionary conservation, this probability is, typically, high, even for very long paths in the phylogenetic tree (Table 3). For example, an intron present in the last common ancestor of the metazoa has a probability of 0.83 to be retained in humans whereas an intron present in the last common ancestor of multicellular life (AME) has a probability of 0.57 to be retained in extant plants.

Table 3 Probabilities of an intron to be retained along selected paths in the phylogenetic tree

Full size table

A complementary approach involves the computation of the probability of extant introns to be of ancient origin. To calculate this probability, we assumed that a site is known to host an intron in one species, and that no information is available on this site in other species (that is, their state is *). Then, we used our model to compute the probability that this intron was present in any of the ancestors of that species (Table 4 and Fig. 2). Clearly, these probabilities decay quite slowly with evolutionary time. For instance, an intron in C. elegans, H. sapiens and D. melanogaster has a probability of 0.44, 0.69, and 0.68, respectively, to have been present in the last common ancestor of metazoa.

Table 4 Probability of an intron in extant species to be inherited from an ancestral node

Full size table

Conclusion

Despite the extensive attention given to the evolution of eukaryotic gene structure over the last three decades, the fundamental characteristics of this process remain controversial. In particular, depending on the methods and data sets used, different researchers have reached opposite conclusions on the causes of the remarkably high fraction of shared introns in orthologous genes from distant eukaryotic species. Some attribute it (almost) entirely to a remarkable evolutionary conservation of intron positions and others, largely, to parallel gain of introns. To resolve these contradictions, it is important to analyze the evolution of introns by using a probabilistic model that minimally relies on arbitrary assumptions. To this end, we developed a model that allows for variability of intron gain and loss rates over branches of the phylogenetic tree, individual genes, and individual sites. Applying this model to an extended set of conserved eukaryotic genes, we found that parallel gain, on average, accounts for only ~8% of the shared intron positions. However, the distribution of parallel gains over the phylogenetic tree of eukaryotes is highly non-uniform such that there are, practically, no parallel gains in closely related lineages, whereas for distant lineages, such as animals and plants, parallel gains might have contributed up to 20% of the shared intron positions. Given the distinctly non-uniform distribution of the inferred gain events over the phylogenetic tree of eukaryotes [20], most of the recently diverged lineages have amassed very few gains during their separate evolution; by contrast, deeply diverged lineages, such as animals and plants appear to have gone through independent stages of extensive intron gain, which would explain the greater share of parallel gains.

We estimated that ancestral introns have a high probability to be retained in extant genomes, and conversely, many of the extant introns are ancient. The reasons for this remarkable endurance of a substantial fraction of the introns are not clear. One possibility is mechanistic, i.e., removing existing introns might be imprecise and hence potentially deleterious. Another possibility is functional, i.e., introns might have acquired many functional roles since entering eukaryotic genomes. The latter possibility is compatible with the recent observation of a negative correlation between the rate of intron gain and coding sequence evolution rate of a gene [19] which suggests that at least some of the introns are functionally relevant.

Methods

The data set

The methods and criteria used in compiling the data set have been described previously [19, 20]. Briefly, the analyzed data set consisted of the reliable alignments of 391 genes from 19 eukaryotic species (a total of 289,902 sites). These include 9 animals (Caenorhabditis elegans, Strongylocentrotus purpuratus, Ciona intestinalis, Danio rerio, Gallus gallus, Homo sapiens, rodents (Mus musculus and Rattus norvegicus combined), Drosophila melanogaster, Anopheles gambiae); 5 fungi (Cryptococcus neoformans, Schizosaccharomyces pombe, Saccharomyces cerevisiae, Aspergillus fumigatus, Neurospora crassa); two plants (Arabidopsis thaliana, Oryza sativa); two apicomplexans (Theileria parva, Plasmodium falciparum); and the amoebozoan Dictyostelium discoideum.

Phylogenetic tree topology

Throughout this paper we assumed the traditional "crown-group" tree topology (Additional file 4). Specifically, the root position is between the Apicomplexa and the common ancestor of multicellular eukaryotes (plants and animals) [40], as opposed to the alternative Unikont-Bikont division [41]. We furthermore assume the Coelomata topology (Deuterostomia and insects are grouped together to the exclusion of nematodes) [42, 43] as opposed to the Ecdysozoa topology (insects and nematodes are grouped together to the exclusion of Deuterostomia) [44, 45]. The results, however, are not sensitive to the exact tree topology, as explicitly shown elsewhere [19, 20].

The EM Algorithm

For each site, the S leaves form a set of observed random variables, their states being described by the corresponding pattern ω_p. The states of all the internal nodes, denoted σ, form a set of hidden random variables, that is, random variables whose states are not observed. In order to account for rate variability across sites, we associate with each pattern two hidden random variables, $ρ_{p}^{η}$ and $ρ_{p}^{θ}$ , that determine the value of the rate variables in that site. To sum up, the observed random variables are ω_p, and the hidden random variables are (σ, $ρ_{p}^{η}$ , $ρ_{p}^{θ}$ ).

We assume that sites within a gene, as well as the genes themselves, evolve independently. Therefore, the total likelihood can be decomposed as

L (M_{1}, \dots, M_{G} | Θ) = \prod_{g = 1}^{G} L (M_{g} | Ξ, Ψ_{g}, Λ) = \prod_{g = 1}^{G} \prod_{p = 1}^{Ω} L {(ω_{p} | Ξ, Ψ_{g}, Λ)}^{n_{g p}} .

and so

\log L (M_{1}, \dots, M_{G} | Θ) = \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} n_{g p} \log L (ω_{p} | Ξ, Ψ_{g}, Λ) .

(4)

According to the well-known EM paradigm [35], log L(M₁, ..., M_G|Θ) is guaranteed to increase as long as we maximize the auxiliary function

Q (Θ, Θ^{0}) = \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} n_{g p} Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}),

(5)

where

\begin{array}{l} Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \\ \sum_{σ, ρ_{p}^{η}, ρ_{p}^{θ}} \Pr (σ, ρ_{p}^{η}, ρ_{p}^{θ} | ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \log \Pr (ω_{p}, σ, ρ_{p}^{η}, ρ_{p}^{θ} | Ξ, Ψ_{g}, Λ) . \end{array}

(6)

If we replace the formal summing over all states of $ρ_{p}^{η}$ and $ρ_{p}^{θ}$ in (6) by a direct sum, we get

\begin{array}{l} Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \\ \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} \sum_{σ} \Pr (σ, ρ_{p}^{η} = k, ρ_{p}^{θ} = k' | ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \log \Pr (ω_{p}, σ, ρ_{p}^{η} = k, ρ_{p}^{θ} = k' | Ξ, Ψ_{g}, Λ) . \end{array}

(7)

Using our notational conventions, we can write the first term in (7) as

\begin{array}{l} \Pr (σ, ρ_{p}^{η} = k, ρ_{p}^{θ} = k' | ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \\ \Pr (ρ_{p}^{η} = k, ρ_{p}^{θ} = k' | ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (σ | ω_{p}, Ξ^{0}, Ψ_{g k k'}^{0}), \end{array}

(8)

and the second term as

\begin{array}{l} \log \Pr (ω_{p}, σ, ρ_{p}^{η} = k, ρ_{p}^{θ} = k' | Ξ, Ψ_{g}, Λ) = \log \Pr (ρ_{p}^{η} = k | Ξ, Ψ_{g}, Λ) + \\ + \log \Pr (ρ_{p}^{θ} = k' | Ξ, Ψ_{g}, Λ) + \log \Pr (ω_{p}, σ | Ξ, Ψ_{g k k'}) = \\ \log f_{k}^{η} + \log f_{k'}^{θ} + \log \Pr (ω_{p}, σ | Ξ, Ψ_{g k k'}) . \end{array}

(9)

Substituting (8) and (9) back into (7) gives

\begin{array}{l} Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \\ \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} [\Pr (ρ_{p}^{η} = k, ρ_{p}^{θ} = k' | ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0})] \cdot \\ \cdot [\sum_{σ} \Pr (σ | ω_{p}, Ξ^{0}, Ψ_{g k k'}^{0}) \cdot {\log f_{k}^{η} + \log f_{k'}^{θ} + \log \Pr (ω_{p}, σ | Ξ, Ψ_{g k k'})}] \end{array}

Denoting by w_gpkk'and Q_gpkk'the first and second square brackets, respectively, this expression becomes

Q_{g p} (Ξ, Ψ_{g}, Λ, Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) = \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} w_{g p k k'} Q_{g p k k'},

(10)

And, consequently.

Q (Θ, Θ^{0}) = \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} n_{g p} w_{g p k k'} Q_{g p k k'}

(11)

The E-step

In this step, the function Q(Θ, Θ⁰) or, equivalently, the set of coefficients w_gpkk'and Q_gpkk'is computed by using inward-outward recursion on the tree.

The inward (γ) recursion

Here we suggest a variation on the well-known Felsenstein's pruning algorithm [46]. Let us associate with each node t (except for the root) a vector $γ_{i}^{g p k k'} (t) = \Pr (V_{t} | q_{t}^{P} = i, Ξ^{0}, Ψ_{g k k'}^{0})$ . This is the probability of observing the nodes V_t(which are a subset of the pattern ω_p) for a gene g, when the gain and loss rate variables are $r_{k}^{η}$ and $r_{k'}^{θ}$ , respectively, and the parent node of t is known to be in state i. By definition, this function is initialized at all leaves (t ϵ V₀) by

γ (t \in V_{0}) = {\begin{array}{l} (\begin{matrix} 1 - ξ_{t} (1 - e^{- η_{g k} Δ_{t}}) \\ 1 - (1 - φ_{t}) e^{- θ_{g k'} Δ_{t}} \end{matrix}) & q_{t} = 0 \\ (\begin{matrix} ξ_{t} (1 - e^{- η_{g k} Δ_{t}}) \\ (1 - φ_{t}) e^{- θ_{g k'} Δ_{t}} \end{matrix}) & q_{t} = 1. \end{array}

(12)

Here and in the derivations to follow, we omit the superscript from γ. For all internal nodes (except for the root), γ is computed using the recursion

γ_{i} (t) = \sum_{j = 0}^{1} T_{i j} (g, t) {\tilde{γ}}_{j} (t),

(13)

where ${\tilde{γ}}_{j} (t)$ is defined as γ_j(L(t))γ _j(R(t)). This is easy to see, as

\begin{array}{l} γ_{i} (t) = \Pr (V_{t} | q_{t}^{P} = i) = \Pr (V_{t}^{L}, V_{t}^{R} | q_{t}^{P} = i) = \sum_{j = 0}^{1} \Pr (V_{t}^{L}, V_{t}^{R}, q_{t} = j | q_{t}^{P} = i) = \\ = \sum_{j = 0}^{1} \Pr (q_{t} = j | q_{t}^{P} = i) \cdot \Pr (V_{t}^{L} | q_{t} = j, q_{t}^{P} = i) \cdot \Pr (V_{t}^{R} | V_{t}^{L}, q_{t} = j, q_{t}^{P} = i) . \end{array}

(14)

The first term is, simply, the definition of T_ij(g, t). Given q_t, $V_{t}^{L}$ is independent of $q_{t}^{P}$ , thus the second term is just Pr( $V_{t}^{L}$ | q_t= j) = γ_j(t^L) . By similar argument, the third term is simply Pr( $V_{t}^{R}$ | q_t= j) = γ_j(t^R). Substituting these results in (14), we recover the recursion formula (13).

The γ-recursion allows for computing the likelihood of any observed pattern ω_p, given the values of the rate variables:

\begin{array}{l} \Pr (ω_{p} | Ξ^{0}, Ψ_{g k k'}^{0}) = \Pr (V_{0} | Ξ^{0}, Ψ_{g k k'}^{0}) = \Pr (V_{0}^{L}, V_{0}^{R} | Ξ^{0}, Ψ_{g k k'}^{0}) = \\ = \sum_{i = 0}^{1} \Pr (V_{0}^{L}, V_{0}^{R}, q_{0} = i | Ξ^{0}, Ψ_{g k k'}^{0}) = \\ = \sum_{i = 0}^{1} \Pr (q_{0} = i | Ξ^{0}, Ψ_{g k k'}^{0}) \cdot \Pr (V_{0}^{L} | q_{0} = i, Ξ^{0}, Ψ_{g k k'}^{0}) \cdot \Pr (V_{0}^{R} | V_{0}^{L}, q_{0} = i, Ξ^{0}, Ψ_{g k k'}^{0}) . \end{array}

Given q₀, $V_{0}^{R}$ is independent of $V_{0}^{L}$ , and so

\Pr (V_{0}^{R} | V_{0}^{L}, q_{0} = i, Ξ^{0}, Ψ_{g k k'}^{0}) = \Pr (V_{0}^{R} | q_{0} = i, Ξ^{0}, Ψ_{g k k'}^{0}),

and

\Pr (ω_{p} | Ξ^{0}, Ψ_{g k k'}^{0}) = \sum_{i = 0}^{1} π_{i} {\tilde{γ}}_{i} (0) .

(15)

One of the useful features of this recursion is that is allows to treat missing data fairly easily. Only a single option has to be added to the initialization phase (12),

\begin{matrix} γ (t \in V_{0}) = (\begin{matrix} 1 \\ 1 \end{matrix}) & q_{t} = * . \end{matrix}

(16)

The outward (α-β) recursion

Once the γ-recursion is computed, we can use it to compute a second, complementary, recursion. To this end, let us associate with each node t (except for the root node) a matrix $α_{i j}^{g p k k'} (t) = \Pr (q_{t} = j, q_{t}^{P} = i | ω_{p}, Ξ^{0}, Ψ_{g k k'}^{0})$ . It is convenient to define for each node t (except for the root node) a vector $β_{j}^{g p k k'} (t) = \sum_{i = 0}^{1} α_{i j}^{g p k k'} (t) = \Pr (q_{t} = j | ω_{p}, Ξ^{0}, Ψ_{g k k'}^{0})$ . Upon the computation of α, β is readily computed, too. Again, omitting the superscripts, α can be initialized from its definition on the two direct descendants of the root,

α (D (0)) = \frac{1}{\Pr (ω_{p} | Ξ^{0}, Ψ_{g k k'}^{0})} {\begin{array}{l} (\begin{matrix} \begin{matrix} π_{0} γ_{0} (\bar{D} (0)) T_{00} (g, D (0)) & 0 \end{matrix} \\ \begin{matrix} π_{1} γ_{1} (\bar{D} (0)) T_{10} (g, D (0)) & 0 \end{matrix} \end{matrix}) & D (0) \in V_{0}, q_{0}^{D} = 0 \\ (\begin{array}{l} \begin{matrix} 0 & π_{0} γ_{0} (\bar{D} (0)) T_{01} (g, D (0)) \end{matrix} \\ \begin{matrix} 0 & π_{1} γ_{1} (\bar{D} (0)) T_{11} (g, D (0)) \end{matrix} \end{array}) & D (0) \in V_{0}, q_{0}^{D} = 1 \\ (\begin{matrix} π_{0} γ_{0} (\bar{D} (0)) {\tilde{γ}}_{0} (D (0)) T_{00} (g, D (0)) & π_{0} γ_{0} (\bar{D} (0)) {\tilde{γ}}_{1} (D (0)) T_{01} (D (0)) \\ π_{1} γ_{1} (\bar{D} (0)) {\tilde{γ}}_{0} (D (0)) T_{10} (D (0)) & π_{1} γ_{1} (\bar{D} (0)) {\tilde{γ}}_{1} (D (0)) T_{11} (D (0)) \end{matrix}) & D (0) \notin V_{0} . \end{array}

(17)

Here, D(0) stands for any one of the direct descendants of the root, and $\bar{D} (0)$ is its sibling. For any other internal node, α is computed using the outward-recursion

α (t) = (\begin{matrix} β_{0} (P (t)) {\tilde{γ}}_{0} (t) T_{00} (g, t) / γ_{0} (t) & β_{0} (P (t)) {\tilde{γ}}_{1} (t) T_{01} (g, t) / γ_{0} (t) \\ β_{1} (P (t)) {\tilde{γ}}_{0} (t) T_{10} (g, t) / γ_{1} (t) & β_{1} (P (t)) {\tilde{γ}}_{1} (t) T_{11} (g, t) / γ_{1} (t) \end{matrix}) .

(18)

To prove this recursion, let us start with the definition of α,

\begin{array}{l} α_{i j} (t) = \Pr (q_{t} = j, q_{t}^{P} = i | ω_{p}) = \Pr (q_{t} = j, q_{t}^{P} = i | V_{0}) = \\ \Pr (q_{t}^{P} = i | V_{0}) \cdot \Pr (q_{t} = j | q_{t}^{P} = i, V_{0}) = β_{i} (P (t)) \cdot \Pr (q_{t} = j | q_{t}^{P} = i, V_{0}), \end{array}

(19)

and make the decomposition V₀ = V_t+ ${\bar{V}}_{t}$ , with ${\bar{V}}_{t}$ being the set of all leaves such that node t is not among their ancestors. But, given $q_{t}^{P}$ , the state of node t is independent on ${\bar{V}}_{t}$ , and therefore (19) becomes

α_{i j} (t) = β_{i} (P (t)) \cdot \Pr (q_{t} = j | q_{t}^{P} = i, V_{t})

(20)

From Bayes formula,

\begin{array}{l} \Pr (q_{t} = j | q_{t}^{P} = i, V_{t}) = \frac{\Pr (q_{t} = j, V_{t} | q_{t}^{P} = i)}{\Pr (V_{t} | q_{t}^{P} = i)} = \\ \frac{\Pr (q_{t} = j | q_{t}^{P} = i) \cdot \Pr (V_{t} | q_{t} = j, q_{t}^{P} = i)}{γ_{i} (t)} = \frac{T_{i j} (g, t)}{γ_{i} (t)} \cdot \Pr (V_{t} | q_{t} = j, q_{t}^{P} = i) . \end{array}

(21)

But, given q_t, V_tis independent of P(t) and therefore

\Pr (V_{t} | q_{t} = j, q_{t}^{P} = i) = \Pr (V_{t} | q_{t} = j) = {\tilde{γ}}_{j} (t) .

(22)

Combining (22) and (21) in (20), we get

α_{i j} (t) = \frac{{\tilde{γ}}_{j} (t) β_{i} (P (t))}{γ_{i} (t)} T_{i j} (g, p),

which is just another form of (18). Finally, for each leaf that is not a descendant of the root

α (t) = {\begin{matrix} \begin{matrix} \begin{matrix} (\begin{matrix} β_{0} (P (t)) & 0 \\ β_{1} (P (t)) & 0 \end{matrix}) & q_{t} = 0 \end{matrix} \\ \begin{matrix} (\begin{matrix} 0 & β_{0} (P (t)) \\ 0 & β_{1} (P (t)) \end{matrix}) & q_{t} = 1. \end{matrix} \end{matrix} & t \in V_{0}, P (t) \neq 0 \end{matrix}

(23)

When missing data are present, two simple modifications are required. First, we have to add to the initialization phase (17) an option

\begin{matrix} α (D (0)) = \frac{1}{\Pr (ω_{p} | Ξ^{0}, Ψ_{g k k'}^{0})} (\begin{matrix} π_{0} γ_{0} (\bar{D} (0)) T_{00} (g, D (0)) & π_{0} γ_{0} (\bar{D} (0)) T_{01} (D (0)) \\ π_{1} γ_{1} (\bar{D} (0)) T_{10} (D (0)) & π_{1} γ_{1} (\bar{D} (0)) T_{11} (D (0)) \end{matrix}) & D (0) \in V_{0}, q_{0}^{D} = * \end{matrix}

Second, we have to add to the finalization phase (23) an option

\begin{matrix} α (t) = (\begin{matrix} β_{0} (P (t)) T_{00} (g, t) & β_{0} (P (t)) T_{01} (g, t) \\ β_{1} (P (t)) T_{10} (g, t) & β_{1} (P (t)) T_{11} (g, t) \end{matrix}) & q_{t} = * \end{matrix} .

These inward-outward recursions are the phylogenetic equivalent of the backward-forward recursions known from hidden Markov models, and other versions of this method have been developed previously [47, 48]. The version developed here can be shown to be the realization of the junction tree algorithm [49] on rooted bifurcating trees. The junction tree algorithm is a scheme to compute marginal probabilities of maximal cliques on graphs by means of belief propagation on a modified junction tree. Indeed, the matrix α computes marginal probabilities of pairs (t, P(t)), but such pairs are nothing but maximal cliques on rooted bifurcating trees.

Computing the coefficients w_gpkk'

Here we show that the γ-recursion is sufficient to compute the coefficients w_gpkk'. From the definition, $w_{g p k k'} = \Pr (ρ_{p}^{η} = k, ρ_{p}^{θ} = k' | ω_{p}, Ξ^{0}, Ψ_{g}^{0}, Λ^{0})$ . Using the Bayes formula Pr(x, y | z) = Pr(x, y, z)/Σ_{x, y}Pr(x, y, z), we can rewrite it as

\begin{array}{l} w_{g p k k'} = \frac{\Pr (ρ_{p}^{η} = k, ρ_{p}^{θ} = k', ω_{p} | Ξ^{0}, Ψ_{g}^{0}, Λ^{0})}{\sum_{h, h'} \Pr (ρ_{p}^{η} = h, ρ_{p}^{θ} = h', ω_{p} | Ξ^{0}, Ψ_{g}^{0}, Λ^{0})} = \\ = \frac{\Pr (ρ_{p}^{η} = k | Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (ρ_{p}^{θ} = k' | Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (ω_{p} | Ξ^{0}, Ψ_{g k k'}^{0})}{\sum_{h, h'} \Pr (ρ_{p}^{η} = h | Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (ρ_{p}^{θ} = h' | Ξ^{0}, Ψ_{g}^{0}, Λ^{0}) \cdot \Pr (ω_{p} | Ξ^{0}, Ψ_{g h h'}^{0})} . \end{array}

But $\Pr (ρ_{p}^{η} = k | Ξ^{0}, Ψ_{g}^{0}, Λ^{0})$ is the current estimate of the probability of the gain rate variable to have the value $r_{k}^{η}$ , namely ${(f_{k}^{η})}^{0}$ . Similarly, $\Pr (ρ_{p}^{θ} = k' | Ξ^{0}, Ψ_{g}^{0}, Λ^{0})$ is just ${(f_{k'}^{θ})}^{0}$ . Therefore, the expression for the coefficients w_gpkk'reduces to

w_{g p k k'} = \frac{{(f_{k}^{η})}^{0} {(f_{k'}^{θ})}^{0} \Pr (ω_{p} | Ξ^{0}, Ψ_{g k k'}^{0})}{\sum_{h, h'} {(f_{h}^{η})}^{0} {(f_{h'}^{θ})}^{0} \Pr (ω_{p} | Ξ^{0}, Ψ_{g h h'}^{0})} .

(24)

The function $\Pr (ω_{p} | Ξ^{0}, Ψ_{g k k'}^{0})$ is the likelihood of observing pattern ω_pfor gain and loss rate variables $r_{k}^{η}$ and $r_{k'}^{θ}$ , respectively. This is readily computed upon completion of the γ-recursion, using (15).

Computing the coefficients Q_gpkk'

Here we show that those coefficients require the α, β-recursion. By definition,

Q_{g p k k'} = \sum_{σ} \Pr (σ | ω_{p}, Ξ^{0}, Ψ_{g k k'}^{0}) \cdot [\log f_{k}^{η} + \log f_{k'}^{θ} + \log \Pr (ω_{p}, σ | Ξ, Ψ_{g k k'})] .

The probability Pr(ω_p, σ | Ξ, Ψ_gkk') is the likelihood of a particular realization of the tree, thus from (1)

\log \Pr (ω_{p}, σ | Ξ, Ψ_{g k k'}) = \sum_{i = 0}^{1} δ (q_{0}, i) \cdot \log π_{i} + \sum_{i, j = 0}^{1} \sum_{t = 1}^{N - 1} δ (q_{t}, j) δ (q_{t}^{P}, i) \cdot \log T_{i j} (g, t) .

(25)

Here, δ(a, b) is the Kronecker delta function, which is 1 for a = b and 0 otherwise. Denote the expectation over $\Pr (σ | ω_{p}, Ξ^{0}, Ψ_{g k k'}^{0})$ by E_σ. Applying it to (25), we get

E_{σ} [\log \Pr (ω_{p}, σ | Ξ, Ψ_{g k k'})] = \sum_{i = 0}^{1} \log π_{i} \cdot E_{σ} [δ (q_{0}, i)] + \sum_{i, j = 0}^{1} \sum_{t = 1}^{N - 1} \log T_{i j} (g, t) \cdot E_{σ} [δ (q_{t}, j) δ (q_{t}^{P}, i)] .

But $E_{σ} [δ (q_{0}, i)] = \Pr (q_{0} = i | ω_{p}, Ξ^{0}, Ψ_{g k k'}^{0}) = β_{i} (0)$ , and similarly $E_{σ} [δ (q_{t}, j) δ (q_{t}^{P}, i)] = α_{i j} (t)$ .

Hence, Q_gpkk'is given by

\begin{array}{l} Q_{g p k k'} = \sum_{σ} \Pr (σ | ω_{p}, Ξ^{0}, Ψ_{g k k'}^{0}) [\log f_{k}^{η} + \log f_{k'}^{θ} + \log \Pr (ω_{p}, σ | Ξ, Ψ_{g k k'})] = \\ = \log f_{k}^{η} + \log f_{k'}^{θ} + \sum_{i = 0}^{1} β_{i} (0) \log π_{i} + \sum_{i, j = 0}^{1} \sum_{t = 1}^{N - 1} α_{i j} (t) \log T_{i j} (g, t) . \end{array}

(26)

The M-step

Substituting (26) in (11), we obtain an explicit form of the function whose maximization guarantees stepping up-hill in the likelihood landscape,

\begin{matrix} Q = \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} n_{g p} w_{g p k k'} (\log f_{k}^{η} + \log f_{k'}^{θ}) + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} n_{g p} w_{g p k k'} [β_{0}^{g p k k'} (0) \log π_{0} + β_{1}^{g p k k'} (0) \log π_{1}] + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} \sum_{t = 1}^{N - 1} n_{g p} w_{g p k k'} α_{00}^{g p k k'} (t) \log [1 - ξ_{t} (1 - e^{- η_{g k} Δ_{t}})] + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} \sum_{t = 1}^{N - 1} n_{g p} w_{g p k k'} α_{01}^{g p k k'} (t) [\log ξ_{t} + \log (1 - e^{- η_{g k} Δ_{t}})] + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} \sum_{t = 1}^{N - 1} n_{g p} w_{g p k k'} α_{10}^{g p k k'} (t) \log [1 - (1 - φ_{t}) e^{- θ_{g k'} Δ_{t}}] + \\ + \sum_{g = 1}^{G} \sum_{p = 1}^{Ω} \sum_{k = 1}^{K_{η}} \sum_{k' = 1}^{K_{θ}} \sum_{t = 1}^{N - 1} n_{g p} w_{g p k k'} α_{11}^{g p k k'} (t) [\log (1 - φ_{t}) - θ_{g k'} Δ_{t}] . \end{matrix}

(27)

Actually, any increase in Q is sufficient to guarantee an increase in the likelihood, suggesting that the precise maximization of Q is not particularly important. Therefore, we speed up the computations by performing low-tolerance maximization with respect to each of the parameters individually. Except for the parameters λ_ηand λ_θ, it is easy to differentiate Q twice with respect to any parameter. This lends itself to using simple zero-finding algorithms of which we chose the Newton-Raphson algorithm [50].

Maximizing Q with respect to the shape parameters λ_ηand λ_θis more involved, as Q depends on these parameters only through the discrete approximation of the rate variability distributions (3). In our implementation, we used Yang's quantile method [34] to compute the discrete levels of the gamma distributions such that each level has equal probability. Formally, $f_{1}^{η} = ν, f_{k}^{η} = (1 - ν) / (K_{η} - 1)$ for k = 2, ..., K_ηand $f_{k}^{θ} = 1 / K_{θ}$ for k = 1, ..., K_θ. To perform the maximization in this case, we used Brent's maximization algorithm that does not require derivatives [50].

Reconstruction of Ancestral States and Events

Given the α, β, γ-recursions on the tree with the final model parameters, it is straightforward to reconstruct the history of intron evolution, and to assign gains and losses to specific branches. The number of introns in an internal node t for a gene g, assuming gain rate variable $r_{k}^{η}$ and loss rate variable $r_{k'}^{θ}$ , given that the observed pattern is ω_p, is $n_{g p} ω_{g p k k'} β_{1}^{g p k k'} (t)$ . Similarly, the number of loss events along the branch t is $n_{g p} ω_{g p k k'} α_{10}^{g p k k'} (t)$ , and the number of gain events along this branch is $n_{g p} ω_{g p k k'} α_{01}^{g p k k'} (t)$ .

Confidence Intervals

In order to obtain confidence intervals on the model parameters, we used the profile likelihood technique. In brief, if ϑ is one parameter in the model, and $\bar{Θ}$ is the set of remaining parameters, then the profile likelihood of ϑ is defined as

L (ϑ) = \max_{\bar{Θ}} \Pr (ϑ, \bar{Θ}) .

That is, we compute the maximum likelihood under the constraint that the value of ϑ is given. If we denote the overall maximum likelihood by L(Θ) = max_Θ Pr(Θ), then the likelihood ratio

λ = - 2 \ln \frac{L (ϑ_{0})}{L (Θ)},

(28)

under the hypothesis that ϑ = ϑ₀ is distributed according to the χ² distribution with one degree of freedom. In order to find the 95% confidence interval of the parameter ϑ around its optimal value ϑ₀, we find the value of ϑ for which the likelihood ratio (28) exceeds the value 3.84 (95%-percentile of the χ²(1) distribution). This value is found numerically using Ridder's method [50].

References

Lynch M, Richardson AO: The evolution of spliceosomal introns. Curr Opin Genet Dev. 2002, 12 (6): 701-710. 10.1016/S0959-437X(02)00360-X.
Article CAS PubMed Google Scholar
Roy SW, Gilbert W: The evolution of spliceosomal introns: patterns, puzzles and progress. Nat Rev Genet. 2006, 7 (3): 211-221.
PubMed Google Scholar
Rodríguez-Trelles F, Tarrío R, Ayala FJ: Origins and evolution of spliceosomal introns. Annu Rev Genet. 2006, 40: 47-76. 10.1146/annurev.genet.40.110405.090625.
Article PubMed Google Scholar
Nixon JE, Wang A, Morrison HG, McArthur AG, Sogin ML, Loftus BJ, Samuelson J: A spliceosomal intron in Giardia lamblia. Proc Natl Acad Sci U S A. 2002, 99 (6): 3701-3705. 10.1073/pnas.042700299.
Article PubMed Central CAS PubMed Google Scholar
Simpson AG, MacQuarrie EK, Roger AJ: Eukaryotic evolution: early origin of canonical introns. Nature. 2002, 419 (6904): 270-10.1038/419270a.
Article CAS PubMed Google Scholar
Vanacova S, Yan W, Carlton JM, Johnson PJ: Spliceosomal introns in the deep-branching eukaryote Trichomonas vaginalis. Proc Natl Acad Sci U S A. 2005, 102 (12): 4430-4435. 10.1073/pnas.0407500102.
Article PubMed Central CAS PubMed Google Scholar
Russell AG, Shutt TE, Watkins RF, Gray MW: An ancient spliceosomal intron in the ribosomal protein L7a gene (Rpl7a) of Giardia lamblia. BMC Evol Biol. 2005, 5: 45-10.1186/1471-2148-5-45.
Article PubMed Central PubMed Google Scholar
Koonin EV: The origin of introns and their role in eukaryogenesis: A compromise solution to the introns-early versus introns-late debate?. Biol Direct. 2006, 1 (1): 22-10.1186/1745-6150-1-22.
Article PubMed Central PubMed Google Scholar
Martin W, Koonin EV: Introns and the origin of nucleus-cytosol compartmentalization. Nature. 2006, 440 (7080): 41-45. 10.1038/nature04531.
Article CAS PubMed Google Scholar
Rogozin IB, Sverdlov AV, Babenko VN, Koonin EV: Analysis of evolution of exon-intron structure of eukaryotic genes. Brief Bioinform. 2005, 6 (2): 118-134. 10.1093/bib/6.2.118.
Article CAS PubMed Google Scholar
Fedorov A, Merican AF, Gilbert W: Large-scale comparison of intron positions among animal, plant, and fungal genes. Proc Natl Acad Sci U S A. 2002, 99 (25): 16128-16133. 10.1073/pnas.242624899.
Article PubMed Central CAS PubMed Google Scholar
Qiu WG, Schisler N, Stoltzfus A: The evolutionary gain of spliceosomal introns: sequence and phase preferences. Mol Biol Evol. 2004, 21 (7): 1252-1263. 10.1093/molbev/msh120.
Article CAS PubMed Google Scholar
Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV: Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution. Curr Biol. 2003, 13 (17): 1512-1517. 10.1016/S0960-9822(03)00558-X.
Article CAS PubMed Google Scholar
Roy SW, Fedorov A, Gilbert W: Large-scale comparison of intron positions in mammalian genes shows intron loss but no gain. Proc Natl Acad Sci U S A. 2003, 100 (12): 7158-7162. 10.1073/pnas.1232297100.
Article PubMed Central CAS PubMed Google Scholar
Roy SW, Gilbert W: The pattern of intron loss. Proc Natl Acad Sci U S A. 2005, 102 (3): 713-718. 10.1073/pnas.0408274102.
Article PubMed Central CAS PubMed Google Scholar
Roy SW, Gilbert W: Complex early genes. Proc Natl Acad Sci U S A. 2005, 102 (6): 1986-1991. 10.1073/pnas.0408355101.
Article PubMed Central CAS PubMed Google Scholar
Csuros M: Likely scenarios of intron evolution. Comparative Genomics Lecture Notes in Computer Science. 2005, 3678: 47-60.
Article Google Scholar
Nguyen HD, Yoshihama M, Kenmochi N: New maximum likelihood estimators for eukaryotic intron evolution. PLoS Comput Biol. 2005, 1 (7): e79-10.1371/journal.pcbi.0010079.
Article PubMed Central PubMed Google Scholar
Carmel L, Rogozin IB, Wolf YI, Koonin EV: Evolutionarily conserved genes preferentially accumulate introns. Genome Res. 2007, 17: 1045-1050. 10.1101/gr.5978207.
Article PubMed Central CAS PubMed Google Scholar
Carmel L, Wolf YI, Rogozin IB, Koonin EV: Three distinct modes of intron dynamics in the evolution of eukaryotes. Genome Res. 2007, 17: 1034-1044. 10.1101/gr.6438607.
Article PubMed Central CAS PubMed Google Scholar
Roy SW, Gilbert W: Rates of intron loss and gain: implications for early eukaryotic evolution. Proc Natl Acad Sci U S A. 2005, 102 (16): 5773-5778. 10.1073/pnas.0500383102.
Article PubMed Central CAS PubMed Google Scholar
Carmel L, Rogozin IB, Wolf YI, Koonin EV: An expectation-maximization algorithm for analysis of evolution of exon-intron structure of eukaryotic genes. Comparative Genomics Lecture Notes in Computer Science. 2005, 3678: 35-46.
Article Google Scholar
Sverdlov AV, Rogozin IB, Babenko VN, Koonin EV: Conservation versus parallel gains in intron evolution. Nucleic Acids Res. 2005, 33 (6): 1741-1748. 10.1093/nar/gki316.
Article PubMed Central CAS PubMed Google Scholar
Nei M, Chakraborty R, Fuerst PA: Infinite allele model with varying mutation rate. Proc Natl Acad Sci U S A. 1976, 73 (11): 4164-4168. 10.1073/pnas.73.11.4164.
Article PubMed Central CAS PubMed Google Scholar
Uzzell T, Corbin KW: Fitting discrete probability distributions to evolutionary events. Science. 1971, 172 (988): 1089-1096. 10.1126/science.172.3988.1089.
Article CAS PubMed Google Scholar
Felsenstein J: Inferring Phylogenies. 2004, Sunderland, MA , Sinauer Associates
Google Scholar
Dibb NJ: Proto-splice site model of intron origin. J Theor Biol. 1991, 151 (3): 405-416. 10.1016/S0022-5193(05)80388-1.
Article CAS PubMed Google Scholar
Dibb NJ, Newman AJ: Evidence that introns arose at proto-splice sites. Embo J. 1989, 8 (7): 2015-2021.
PubMed Central CAS PubMed Google Scholar
Sverdlov AV, Rogozin IB, Babenko VN, Koonin EV: Reconstruction of ancestral protosplice sites. Curr Biol. 2004, 14 (16): 1505-1508. 10.1016/j.cub.2004.08.027.
Article CAS PubMed Google Scholar
Gu X, Fu YX, Li WH: Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol Biol Evol. 1995, 12 (4): 546-557.
CAS PubMed Google Scholar
Hasegawa M, Kishino H, Yano T: Man's place in Hominoidea as inferred from molecular clocks of DNA. J Mol Evol. 1987, 26 (1-2): 132-147. 10.1007/BF02111287.
Article CAS PubMed Google Scholar
Mayrose I, Friedman N, Pupko T: A Gamma mixture model better accounts for among site rate heterogeneity. Bioinformatics. 2005, 21 Suppl 2: ii151-ii158. 10.1093/bioinformatics/bti1125.
PubMed Google Scholar
Jin L, Nei M: Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol Biol Evol. 1990, 7 (1): 82-102.
CAS PubMed Google Scholar
Yang Z: Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 1994, 39 (3): 306-314. 10.1007/BF00160154.
Article CAS PubMed Google Scholar
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977, 39: 138-
Google Scholar
Nguyen HD, Yoshihama M, Kenmochi N: Authors' reply. PLoS Comput Biol. 2006, 2: e83-10.1371/journal.pcbi.0020083.
Article PubMed Central Google Scholar
Stoltzfus A, Logsdon JM, Palmer JD, Doolittle WF: Intron "sliding" and the diversity of intron positions. Proc Natl Acad Sci U S A. 1997, 94 (20): 10739-10744. 10.1073/pnas.94.20.10739.
Article PubMed Central CAS PubMed Google Scholar
Roy SW: Intron-rich ancestors. Trends Genet. 2006, 22 (9): 468-471. 10.1016/j.tig.2006.07.002.
Article CAS PubMed Google Scholar
Hedges SB, Kumar S: Genomic clocks and evolutionary timescales. Trends Genet. 2003, 19: 200-206. 10.1016/S0168-9525(03)00053-2.
Article Google Scholar
Hedges SB: The origin and evolution of model organisms. Nat Rev Genet. 2002, 3 (11): 838-849. 10.1038/nrg929.
Article CAS PubMed Google Scholar
Stechmann A, Cavalier-Smith T: The root of the eukaryote tree pinpointed. Curr Biol. 2003, 13 (17): R665-6. 10.1016/S0960-9822(03)00602-X.
Article CAS PubMed Google Scholar
Wolf YI, Rogozin IB, Koonin EV: Coelomata and not Ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 2004, 14: 29-36. 10.1101/gr.1347404.
Article PubMed Central CAS PubMed Google Scholar
Rogozin IB, Wolf YI, Carmel L, Koonin EV: Ecdysozoan clade rejected by genome-wide analysis of rare amino acid replacements. Mol Biol Evol. 2007, 24 (4): 1080-1090. 10.1093/molbev/msm029.
Article CAS PubMed Google Scholar
Aguinaldo AM, Turbeville JM, Linford LS, Rivera MC, Garey JR, Raff RA, Lake JA: Evidence for a clade of nematodes, arthropods and other moulting animals. Nature. 1997, 387 (6632): 489-493. 10.1038/387489a0.
Article CAS PubMed Google Scholar
Philippe H, Lartillot N, Brinkmann H: Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol Biol Evol. 2005, 22 (5): 1246-1253. 10.1093/molbev/msi111.
Article CAS PubMed Google Scholar
Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17 (6): 368-376. 10.1007/BF01734359.
Article CAS PubMed Google Scholar
Friedman N, Ninio M, Pe'er I, Pupko T: A structural EM algorithm for phylogenetic inference. J Comput Biol. 2002, 9 (2): 331-353. 10.1089/10665270252935494.
Article CAS PubMed Google Scholar
Siepel A, Haussler D: Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004, 21 (3): 468-488. 10.1093/molbev/msh039.
Article CAS PubMed Google Scholar
Castillo E, Gutierrez JM, Hadi AS: Expert systems and probabilistic network models (Monographs in Computer Science). 1996, New York , Springer
Google Scholar
Press WH, Flannery BP, Teukolsky SA, Vetterling WT: Numerical recipes in C: The art of scientific computing. 1992, New York , Cambridge University Press, 2nd
Google Scholar

Download references

Acknowledgements

We thank Jacob Goldberger for useful discussions regarding aspects of the EM algorithm. This work was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.

Author information

Authors and Affiliations

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, 20894, USA
Liran Carmel, Igor B Rogozin, Yuri I Wolf & Eugene V Koonin

Authors

Liran Carmel
View author publications
You can also search for this author in PubMed Google Scholar
Igor B Rogozin
View author publications
You can also search for this author in PubMed Google Scholar
Yuri I Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Eugene V Koonin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eugene V Koonin.

Additional information

Authors' contributions

LC contributed to the development of the probabilistic model, developed the EM algorithm, and drafted the manuscript; IBR collected the data and contributed to the development of the probabilistic model; YIW contributed to the development of the probabilistic model; EVK conceived of the study, provided the biological interpretation of the results, and finalized the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

12862_2007_483_MOESM1_ESM.doc

Additional file 1: Maximum likelihood estimation of the model parameters. Each parameter is associated with three values – the lower 95% confidence bound, the optimal value, and the upper 95% confidence bound. The confidence intervals were computed using the profile likelihood technique. (DOC 217 KB)

12862_2007_483_MOESM2_ESM.doc

Additional file 2: Comparison of the number of shared intron positions with the number expected by chance. All pairs of species (20), where the number of shared intron positions is not significantly greater than the number expected by chance alone. (DOC 55 KB)

12862_2007_483_MOESM3_ESM.doc

Additional file 3: Pattern-by-pattern analysis of parallel gains. The order of species in each pattern is Dicdi, Caeel, Strpu, Cioin, Danre, Galga, Homsa, Roden, Drome, Anoga, Cryne, Schpo, Sacce, Aspfu, Neucr, Arath, Orysa, Thepa, and Plafa. The frequency of a pattern is the number of times it was observed in our data. (DOC 550 KB)

12862_2007_483_MOESM4_ESM.doc

Additional file 4: The phylogenetic tree of eukaryotes used in the present study. Species and lineage abbreviations: Caeel (Caenorhabditis elegans), Strpu (Strongylocentrotus purpuratus), Cioin (Ciona intestinalis), Danre (Danio rerio), Galga (Gallus gallus), Homsa (Homo sapiens), roden (Mus musculus and Rattus norvegicus combined), Drome (Drosophila melanogaster), Anoga (Anopheles gambiae), cryne (Cryptococcus neoformans), Schpo (Schizosaccharomyces pombe), Sacce (Saccharomyces cerevisiae), Aspfu (Aspergillus fumigatus), Neucr (Neurospora crassa), Arath (Arabidopsis thaliana), Orysa (Oryza sativa), Thepa (Theileria parva), Plafa (Plasmodium falciparum), Dicdi (Dictyostelium discoideum), AME (Ancestor of Multicellular Eukaryotes). (DOC 36 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Carmel, L., Rogozin, I.B., Wolf, Y.I. et al. Patterns of intron gain and conservation in eukaryotic genes. BMC Evol Biol 7, 192 (2007). https://doi.org/10.1186/1471-2148-7-192

Download citation

Received: 18 June 2007
Accepted: 12 October 2007
Published: 12 October 2007
DOI: https://doi.org/10.1186/1471-2148-7-192

Patterns of intron gain and conservation in eukaryotic genes

Abstract

Background:

Results:

Conclusion:

Background

Results and discussion

Notation

The probabilistic model

The Expectation-Maximization algorithm

One in 7 nucleotides is a potential intron insertion site

The number of shared intron positions is much greater than expected by chance

Twelve out of thirteen shared intron positions reflect common ancestry

Intron retention

Conclusion

Methods

The data set

Phylogenetic tree topology

The EM Algorithm

The E-step

The inward (γ) recursion

The outward (α-β) recursion

Computing the coefficients w gpkk'

Computing the coefficients Qgpkk'

The M-step

Reconstruction of Ancestral States and Events

Confidence Intervals

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

12862_2007_483_MOESM1_ESM.doc

12862_2007_483_MOESM2_ESM.doc

12862_2007_483_MOESM3_ESM.doc

12862_2007_483_MOESM4_ESM.doc

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Ecology and Evolution

Contact us

Computing the coefficients w_gpkk'

Computing the coefficients Q_gpkk'