A scale invariant clustering of genes
The Chromosome 7 Annotation Project demonstrated that the density of gene structures within the chromosome was heterogeneous [3]. Figure 1 provides the numbers of such structures contained within a sequence of equal-sized non-overlapping bins that spanned the physical length of chromosome 7. The high degree of heterogeneity in local gene density as demonstrated here seemed, at least on a qualitative level, to indicate clustering.
A quantitative test for clustering was performed upon these data. If the genes were randomly scattered throughout the chromosome, without clustering, their dispersal should reflect a Poisson distribution – the usual model for such randomness. To determine whether or not this was the case, the variance var(Z) and the mean E(Z) of the number of genes per bin, Z, were compared. At the scale of the 200 kb bins the variance and mean were, var(Z) = 6.6 and E(Z) = 2.3, with a variance/mean ratio of approximately 2.9. The variance should have equalled the mean with a Poisson distribution. This finding indicated that the genes were more dispersed than could be predicted by a random distribution, and thus there was clustering at this scale.
To determine whether this clustering persisted at other measurement scales, the variance and mean number of gene structures per bin were estimated for a range of bin sizes. Figure 2 provides these data on a log-log plot of variance versus mean. The logarithmically transformed points seemed to describe a linear relationship. Indeed the correlation coefficient squared, estimated between the transformed variance and mean estimates, was r2 = 0.997 thus substantiating a linear relationship. As well, the residuals between the logarithmically transformed variables and a trial linear relationship were essentially negligible and normally distributed about zero (Fig. 2 insert). It should be mentioned that the linear relationship tested here against the data in Fig. 2 was obtained not from the regression fit of the logarithmically transformed data, but from a statistical model that was fitted to the chromosome 7 data and that will be presented later in this article.
The strong linear relationship between the logarithmically transformed variances and means indicated that the variance and the mean were related by a power function,
var(Z) = a·E(Z)p,
where a and p were constants. If the gene structures had been randomly distributed along chromosome 7, as per a Poisson distribution, one would have expected an exponent p = 1. Since p = 1.51, this provided further confirmation of clustering, which now was evident over a range of measurement scales.
This variance to mean power function exhibited another property of note – scale invariance. This term, as used here, indicates that if one takes a small segment of a pattern and magnifies it to a larger scale, then the magnified portion should be statistically similar to the unmagnified portion. Specifically, if the measurement scale employed in the variance to mean relationship is increased by a factor c then
a·[c·E(Z)]p= (acp)·[E(Z)]p,
and this relationship would retain the form of a power function with exponent p. In fact, this variance to mean power function is the only possible scale invariant relationship that could exist between the variance and the mean [5]. As will be seen below the variance to mean power function implicated a specific probabilistic model to represent the distribution of gene structures along chromosome 7. Before we can discuss this model, it would be useful to consider a somewhat more conventional model, for the distribution of the number of genes within chromosomal segments.
An overdispersed Poisson distribution for conserved segments
Earlier in this article the distribution of genes within conserved segments was alluded to in the context of observations provided by the Human Genome Project [4]. Conserved segments are conventionally identified on the basis of the relative order of contiguous landmarks between the chromosomes of two different species. If one were to define the enumerative bins employed here on the basis of the limits of individual conserved segments, one might expect genes to be randomly distributed within these bins, in accordance with a Poisson distribution [6]. One might also expect some heterogeneity between different conserved segments, such that the mean number of genes per conserved segment would depend upon both the physical length of the segment and upon the local gene density. The distribution of the mean number of genes per conserved segment has been conventionally represented by a gamma distribution, giving an overdispersed Poisson distribution (i.e., a negative binomial distribution) for the actual number of genes per conserved segment [6].
How would a negative binomial distribution affect the variance to mean relationship as plotted in Fig. 2? With some simple calculation we have,
where ξ is a constant. In Fig. 2 the optimised least squares fit of this additional variance to mean relationship was plotted as a broken line (with ξ = 3.12). Larger deviations were seen between this relationship and the data, relative to those seen with the variance to mean power function.
This discrepancy between the negative binomial model and the variance to mean data was probably to be expected. In the analysis presented here, the estimates for the variance and the mean had been calculated from bins of uniform size rather than from individual conserved segments, as the negative binomial model would properly have required. These uniformly sized bins would presumably have contained variable numbers of conserved segments and, since the convolution of a random number of negative binomial distributions would not result in a negative binomial distribution, a negative binomial model would have been inappropriate. In the next section, a modification of this model will be presented that will account for a variable number of conserved segments per bin.
A scale invariant Poisson-gamma model for the number of gene structures per bin
One view of chromosomal structure represents chromosomes as mosaics formed from the random fragmentation and rearrangement of more primitive chromosomes [1, 7]. Possibly then, the number of genes within unit segments of vertebrate chromosomes might reflect this segmental structure. A modified model for gene distribution might thus have to account for the contribution of multiple genomic segments that would individually exhibit statistical behaviour related to the conventional gamma model for the mean number of genes.
In the initial specification of such a statistical model one might stipulate that the model be made as general as possible. The generalized linear models of Nelder and Wedderburn [8] provide a simple method to analyse a large variety of data, and these models can be further generalized into an even wider class of models called exponential dispersion models [5]. These latter models provide descriptions for a comprehensive range of normal and non-normal distributions that include the Poisson, gamma and Gaussian distributions. The reader is encouraged to refer to an excellent introduction to these models in the monograph provided by Jørgensen [5].
If one accepts the premise that some form of exponential dispersion model might describe the distribution of the number of genes per bin then, consequent to the finding of the variance to mean power function, one would be lead to a statistical model that uniquely exhibits such a power function, where its exponent p is constrained to range between the values of 1 and 2. This particular model is described by a scale invariant Poisson-gamma (PG) distribution for which its additive form has the cumulant generating function K* (s) [5],
Here s is a variable used to base the generating function upon, λ is the index parameter, θ is the canonical parameter, α is a parameter related to the power function exponent such that
and the cumulant function κ (θ) is given by,
The variance to mean power function,
var(Z) = λ1/(α-1)·E(Z)p,
follows as a consequence of this cumulant generating function.
There have been many alternative explanations for the variance to mean power function, some that represented approximations and others exact models [reviewed in [9]]. What would favour the choice of an exponential dispersion model over any of these alternatives? Much of the justification goes back to the theory of errors as developed by Gauss and others [5]. The Gaussian distribution, which has been widely applied to describe measurement error, provides a familiar description for errors of small magnitude. However, there exist processes with larger non-Gaussian errors which require a more generalized theory, such as the fluctuations apparent to the chromosome 7 data. Nelder and Wedderburn [8], with their generalized linear models, provided a simple means to analyze a large range of non-Gaussian data; exponential dispersion models represent an extension to their theory.
The utility of the exponential dispersion models becomes most apparent when one considers a class of these models characterized by the variance to mean power function. These Tweedie models, named after M.C.K. Tweedie who first studied them [10], serve as limiting distributions for a wider range of exponential dispersion models [5]. Much as the Gaussian model serves as a limiting distribution for a range of statistical processes, the Tweedie models, which include the Gaussian distribution as a special case, represent limiting distributions for a range of non-Gaussian processes. True, one may employ one of the alternative models to account for the variance to mean power function, but the burden would be then to develop a theory for the specific case which nonetheless would lack the generality offered by the exponential dispersion models. Moreover, such an alternative model, as indicated by the Tweedie convergence theorem, might be approximated by a Tweedie model. For these reasons the more general approach, allowed by the theory of exponential dispersion models, seems appropriate.
Granted these considerations in favour of exponential dispersion models, let us consider the PG distribution in more detail. It is difficult to describe the probability density function and its corresponding cumulative distribution function (CDF) for this distribution, since these expressions do not exist in closed form [5]. The probability density function p*(z; θ,λ,α) can be expressed in terms of the canonical statistic z such that,
p*(z; θ, λ, α) = c*(z; λ)·exp[θ·z - λκ(θ)].
where
The CDF P*(z; θ,λ,α) can then be expressed:
How well does the PG distribution fit the observed data? Figure 3 provides the empirical CDF, as obtained from a bin size of 200 kb and fitted to the theoretical PG CDF (Eq. 1). The fit was very good, with at most a 1.4% deviation between theory and observation. An analysis of the residuals (Fig. 3 insert) revealed that they were essentially negligible and normally distributed about zero. A Kolmogorov Smirnov test additionally confirmed an acceptable fit of the theoretical PG model to the empirical CDF.
Three parameters were derived from the regression of the PG CDF (Eq. 1) to the empirical CDF: α = -0.952, λ = 0.245 and θ = -0.691. These were the parameters employed to provide the theoretical variance to mean power function given in Fig. 2, with p = (α - 2)/(α - 1) = 1.51 and a = λ1/(α-1) = 2.06, and for which the agreement with the chromosome 7 data was also very good (χ2 = 0.231, d.f. = 200, P = 1). Thus two different tests of the PG distribution were provided here to confirm its agreement with the chromosome 7 data: the fit of the CDF and the fit of the variance to mean power function to the chromosome 7 data.