### Notation

Here, and for the rest of the paper, we use the following notation : *N* and *K* are respectively the number of species and the number of trees; \theta =({\beta}_{0},{\beta}_{1},\dots ,\sigma ,\dots ) is the vector containing the parameters to be estimated. \beta =({\beta}_{0},{\beta}_{1},\dots ) is the vector of linear regression parameters. *Y* is a data vector of length *N* and *X* is a design matrix containing predictors for the linear regression, so that \mathbb{E}\left({y}_{i}{|}_{X,\theta}\right)={\beta}_{0}{x}_{i,0}+{\beta}_{1}{x}_{i,1}+\dots and *Σ*is a scaled variance-covariance matrix calculated from an ultrametric tree. *Σ* should be scaled to a height of one instead of being scaled to tree maximum branch length, because units of branch length may be meaningful for the phylogeny (e.g. millions of years, number of mutations...) but they are not related to the units of the trait data, and so relative lengths should be used. In this way, *σ*^{2} can be directly interpreted as the residual variance and *Σ* as a correlation structure. Other notation will be specified when needed. The distribution of *Σ* can be a computed distribution from Bayesian phylogenetic software (e.g. BEAST [35] or MrBayes [36]). In a general way, we will write:

\Sigma \sim \pi \left(\xi \right)

(5)

where *π* represents any relevant distribution with parameters *ξ*.

### Linear regression model

In order to illustrate the practical nature of our methods, we first give a simple example. One classic model for comparative analysis is a linear regression across a multispecies data set. To construct it, we used a multivariate normal for the likelihood and conjugate priors. The model can be specified as follows:

\begin{array}{ll}Y|X,\beta ,\sigma ,\Sigma & \sim \mathcal{N}(\mathrm{X\beta},{\sigma}^{2}\Sigma )\\ \beta \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{N}(0,1{0}^{6})\phantom{\rule{2em}{0ex}}\\ {\sigma}^{-2}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathrm{\Gamma}(\theta ,\theta )\phantom{\rule{2em}{0ex}}\\ \Sigma \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\pi \left(\xi \right)\phantom{\rule{2em}{0ex}}\end{array}

(6)

The priors on the components of *β*are the usual non-informative conjugate univariate normal priors [33, p.578-583]. The Gamma prior (labelled Γ) on *σ*^{−2} is weakly informative for small variance[54], depending on the value of *ε*, but its conjugacy with the multivariate normal seems to help in avoiding the problem of autocorrelation in successive samples from the Markov chain Monte Carlo. This model is quick to converge and usually shows negligible autocorrelation.

### Measurement error model accounting for intraspecific variation

Comparative analyses frequently represent each species by a single value, such as a mean estimated from a small sample of individuals. Often, the intraspecific variance in trait values is not considered. Such variance can arise from sources including meaningful biological variation among individuals, inaccuracies of measurement, or poor sampling. Analyses that do not consider such "measurement error" may lead to biases or inaccuracies in evolutionary inferences [55].

Here we develop a model for the relationship between two traits across species, and incorporate variation across individuals within species, by using measurements from multiple individuals per species. The forms of intraspecific variation that this model can incorporate are: *(i)* natural variation across individuals that is not correlated between the two traits, and/or *(ii)* the observer’s ’measurement errors’ sensu stricto. Therefore this model does not incorporate phenotypic covariance *within* species (i.e. the situation where the values of each trait are correlated across individuals within the species), though it does incorporate phylogenetic covariance of the traits across species.

Here we focus on the situation where one measurement was taken per individual, and hence we treat natural variation and measurement error *sensu stricto* together. However, it would be possible to model these two variance components separately, if multiple observations (measurements) were made on each individual, for multiple individuals per species.

We take several *individual measurements*, and assume these to be normally distributed around an unknown species mean (which we call the *species level value*). The evolutionary relationship between two traits is modelled as a linear relation between the unobserved species mean values. This model is therefore a form of measurement error model [56]. We denote the *individual measurements* for each trait by the *N*×*n* matrices *W* and *V* , where *N* is again the number of species and *n* is the number of observations per species. The *species level* variables are defined as the (unobserved) corresponding vectors *X* and *Y* . The individual variances of trait measurements are assumed to be constant across species and are denoted respectively {\sigma}_{W}^{2} and {\sigma}_{V}^{2}. The residual standard deviation of the relation between *X* and *Y* will be designated *σ*_{
R
}. Note that both *X* and *Y* have the same phylogenetic correlation structure (*Σ*), and we assume that the measurements are uncorrelated within individuals. The model can then be written as follows:

\begin{array}{ll}\phantom{\rule{2em}{0ex}}{W}_{\mathit{\text{ni}}}|{X}_{n},{\sigma}_{W}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{N}({X}_{n},{\sigma}_{W}^{2})\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{2em}{0ex}}{V}_{\mathrm{ni}}|{Y}_{n},{\sigma}_{V}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{N}({Y}_{n},{\sigma}_{V}^{2})\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{2em}{0ex}}X|{\mu}_{0},{\sigma}_{0},\Sigma \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{N}({\mu}_{0},{\sigma}_{0}^{2}\phantom{\rule{0.3em}{0ex}}\Sigma )\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{2em}{0ex}}Y|X,\beta ,{\sigma}_{R},\Sigma \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{N}(\mathrm{X\beta},{\sigma}_{R}^{2}\phantom{\rule{0.3em}{0ex}}\Sigma )\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{7.5em}{0ex}}\beta \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{N}(0,1{0}^{6})\phantom{\rule{2em}{0ex}}\\ {\sigma}_{W}^{-2},{\sigma}_{V}^{-2},{\sigma}_{R}^{-2}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathrm{\Gamma}(\theta ,\theta )\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{7.5em}{0ex}}\Sigma & \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\pi \left(\xi \right)\end{array}

(7)

After initial experimentation, we found that a weakly informative prior on the species level *X* with sensible parameters that cover a range thought to be biologically possible (such as *μ*_{0}=0.5 and *σ*_{0}=0.5 for a trait known to be between zero and one) enabled the model to return more stable estimates for *X*, especially when some species have only a small number of individual measurements. Autocorrelation of the values sampled by the MCMC chain can become important for some datasets. In this case, we found that plausible starting values and having a longer burn-in helped to ensure that the model converged on its equilibrium distribution, since autocorrelation becomes only a minor issue when convergence is reached. In OpenBUGS, this model failed to run for data sets with a large number of species *N*, probably because of memory issues. This problem was not encountered in JAGS, and does not appear to derive from theoretical problems in the estimation.

### Phylogenetic signal model

It is often of interest to quantify the strength of phylogenetic signal [57] in the values of a trait across present-day species and to compare this strength among traits or for trees of different lineages. For this purpose, we constructed a model to estimate Pagel’s *λ*in the response trait *Y* [37, 58] simultaneously with the estimation of linear regression parameters for the relation between *X* and *Y* . This model has been treated in a Bayesian context by [59, 60]. Unlike the Measurement Error model, we do not assume any phylogenetic signal in *X*[61]. In this model, *λ* is a coefficient that multiplies the off-diagonal elements of the variance-covariance matrix *Σ*. A *λ* close to zero implies that the phylogenetic signal in the data is low, suggesting independence in the error structure of the data points, whereas a *λ* close to one suggests a good agreement with the Brownian Motion evolution model and thus suggests correlation in the error structure. Since our variance-covariance matrices (indeed correlation matrices) have all diagonal elements equal to 1, we can incorporate *λ* into the matrix using this simple calculation:

{\Sigma}_{\lambda}=\lambda \phantom{\rule{0.3em}{0ex}}\Sigma +(1-\lambda )\phantom{\rule{0.3em}{0ex}}\mathbf{I}

(8)

where **I** is the identity matrix. The model can then be written as:

\begin{array}{ll}\phantom{\rule{0.5em}{0ex}}Y|X,\beta ,\sigma ,\Sigma ,\lambda \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{N}(\mathrm{X\beta},{\sigma}^{2}{\Sigma}_{\lambda})\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{6.5em}{0ex}}\beta \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{N}(0,1{0}^{6})\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{4.8em}{0ex}}{\sigma}^{-2}\phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathrm{\Gamma}(\epsilon ,\epsilon )\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{6.5em}{0ex}}\lambda \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\mathcal{U}(0,1)\phantom{\rule{2em}{0ex}}\\ \phantom{\rule{6.5em}{0ex}}\Sigma \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}& \sim \phantom{\rule{2.77695pt}{0ex}}\phantom{\rule{2.77695pt}{0ex}}\pi \left(\xi \right)\phantom{\rule{2em}{0ex}}\end{array}

(9)

This model estimates the regression coefficients *β* as well as *λ*, which has a uniform prior (labelled \mathcal{U}). In our experiments, the MCMC sample for *β* showed a small degree of autocorrelation, but converged quickly. Using JAGS, some data sets with small values for *X* or *Y* showed a very large autocorrelation on the estimates for *λ*. This issue can be avoided by scaling data values (by a factor 10 for example), or equivalently, use a different prior for *σ*. We found that simulated data sets with fewer than 20 species have very low power to detect phylogenetic signal (as found in [57]). For simulated data (for which *λ* should be estimated as unity), *N* = 5 led to an estimate of *λ* of 0.40 with standard deviation of 0.27, whereas for *N* = 10, this became 0.74 (0.24) and for *N* = 25, we obtained *λ*=0.90(0.10).

### Model checking

A fundamental part of statistical modelling is checking the goodness-of-fit of the model to the data. That is, does the model adequately capture the properties of the data? This procedure is called "posterior checking" in the Bayesian framework [62, 63]. Of course, the first checks concern the relevance of the estimates and their distribution. To assess the performance of each model in capturing the properties of the data, we also performed posterior checks [62, 63] based on the posterior predictive distributions (checking agreement between the observed data, and simulated replicates of the data, generated by simulation from the selected model). This is a method for assessing the discrepancy between the model and the data, based on the distribution of a discrepancy test statistic *D*(*y* *θ*). Since we are interested in the overall goodness-of-fit of the model, we used a function related to the *χ*^{2}function suggested by [63]. However, in our case, the points are not independent, as they are related through a correlation structure. Thus, instead of using the *standardised* residuals to define the usual *χ*^{2}, we will use the *normalised* residuals, defined by [64]:

\epsilon ={\sigma}^{-1}{\left({\Sigma}^{-\frac{1}{2}}\right)}^{T}\times (Y-\mathbb{E}(Y\left|\theta \right))

(10)

where *T* is the canonical symbol for "transpose of the matrix". *Y* is a column vector, so this matrix multiplication returns a column vector of residuals as a result. Our discrepancy function can then simply be written:

D(y,\theta )=\sum _{i}{\left({\epsilon}_{i}\right)}^{2}

(11)

The essence of the posterior predictive check is to compute this distribution for hypothetical replicates of the data *Y*^{rep} and see if the value for the data *Y* is in agreement with this distribution. In order to simulate *Y*^{rep}, it is necessary to integrate over all the possible parameter values. One solution is to draw *L* parameter values directly from the MCMC posterior samples. We can then calculate:

\forall \phantom{\rule{0.3em}{0ex}}l\in 1,\dots ,L:\phantom{\rule{2.22198pt}{0ex}}D({y}_{l}^{\mathit{\text{rep}}},{\theta}_{l})-D(y,{\theta}_{l})

(12)

and compute the *ppp-value* (for posterior predictive *p-value*) as: P\left(D\right({y}_{l}^{\mathit{\text{rep}}},{\theta}_{l})-D(y,{\theta}_{l})>0) (see example in Figure 3). If we are interested in other discrepancy measures *D*^{∗}(*Y*|*θ*) (for example *max*(*Y* ), *mean*(*Y* ), or *sd*(*Y* )), we can use the same draws to calculate them, allowing us to check different parts of the model at the same time. One other interesting source of information is to compute the *species ppp-value* (*sppp-value*) for each species or taxon, which we define as follows:

P({({\epsilon}_{i}^{\mathit{\text{rep}}})}^{2}-{\left({\epsilon}_{i}\right)}^{2}>0).

(13)

Discrepancy values are used to compare the dispersion of the replicates to the dispersion of the data and detect potential outliers or consistent over- and underdispersion (see examples in Figures 3 and 4). For example, ppp-values close to zero indicate that most of the replicated datasets were less extreme than the observed datasets, and hence the model shows less discrepancy from predicted values, compared to the real data. A high ppp-value indicates that the model generates replicate datasets that show consistently greater discrepancy from predicted values than does the observed dataset. For sppp, a high value indicates consistent overdispersion within the replicate datasets.

For the Measurement Error model, we split the posterior checking into two parts. We assessed the estimates for the parameters of the linear relation, and for the species-level values *X* and *Y* , and we also assessed the distribution of the individual measurements (within species). Assessing the accuracy of the species level value is difficult, because we have no theoretical expectation for these values. However, we can compare the estimates and the mean of the individual measurements or we can compute a ppp-value using D\left(Y\right|\theta )=\sum _{i}({Y}_{i}-\overline{{V}_{i}})as a discrepancy statistic, \overline{{V}_{i}} being the mean of the individual measurements for the species *i*. Afterward, the residuals from the linear relationship residuals were checked using the previous method (seeing \overline{V} as the data "species value"). We checked the individual measurements for each species using standardised residuals to compute a sppp-value (in contrast to the normalised residuals used above, because we view the measurements as independent within a species, rather than correlated).

The ppp-value is not the probability of the model being true. Rather it is the probability of observing more extreme data than the current data set, given the model assumptions, the posterior distribution of parameters and the discrepancy statistic. Therefore, our use of ppp-values is solely to assess how "surprising" the data appear to be under the model assumptions and the parameters estimates. If the ppp-value is very extreme (close to zero or one), this alerts us to possible structural problems in the model, since it means that the distribution of data simulated from the model differs from the data we actually observed for a particular aspect of the model (distribution of residuals, mean, etc.). This can help to identify aspects of the model that are failing to represent the data adequately and should be altered (see an example in our Real Data analysis with Phylogenetic Signal model). Unlike classical p-values, the Bayesian ppp-values are not necessarily uniformly distributed under the null hypothesis and should not be compared across models or be used to set a permissible type I error rate (false rejection of the model, [65]): there is no "critical value" such as 0.05 with ppp-values. For a detailed explanation about the interpretation of ppp-values, see [63].

If the interest is in comparing *β*_{
i
} to a particular value, you can simply give the posterior probability that *β*_{
i
} falls in any particular range of values. For example, you might want to know the probability of values less than zero, or greater than zero, or within a certain distance of zero. Bayesian inference enables us to make a direct statement about this probability, rather than accepting or rejecting a point hypothesis with an assumed significance level. The probability is equal to the proportion of the area under the probability density function that falls in a particular range. For example, if we were interested in whether *β*_{
i
} was greater than or less than zero, and the posterior distribution had only 1% of its area in a lower tail extending into negative numbers, then we would conclude that the probability that *β*_{
i
} is less than zero, given the data, is 0.01. By the same finding, the probability that *β*_{
i
} is positive, is 99%.

### Implementation of models and data analysis

The general aim for our models is to estimate the posterior distribution of parameters of a model where the data are correlated through a phylogenetic relationship for which we have a prior distribution. The two main assumptions of our models are *(i)* that the phylogenetic trees are ultrametric, so that the correlation matrix is proportional to the variance-covariance matrix and *(ii)* that evolution can be modelled by a Brownian Motion (BM) process. These assumptions are common in the comparative analysis literature, for example [66], but can be relaxed in some situations, e.g. [57].

We used the statistics software R [67] for data handling in association with OpenBUGS 3.0.3 [68] for Markov Chain Monte Carlo (MCMC) computation. MCMC algorithms use iterative draws to sample from an unknown target distribution, and thereby learn about its properties. In our case, the target is the marginal posterior distribution of parameters in our model. In each step, a draw is made of a new value for one parameter, conditional on the data and current values of all the other parameters in the model. Over a sufficiently large number of iterations, the algorithm converges to the marginal distribution of the parameters, see [33, 69]. The samples following this initial period of convergence (the ’burn in’) can be used for inference on the parameters. The three kinds of algorithm currently used by OpenBUGS are the Gibbs sampler [70, 71], the Metropolis-Hastings algorithm [72, 73] and the slice sampler [74]. After prior sensitivity analysis, we decided to use an inverse-Gamma(1,1) prior distribution for variance components. This prior helped to avoid autocorrelation and had little influence on our results (see Figure 8 for example). However, we do not recommend such an informative prior without preliminary analysis of prior sensitivity.

MCMC algorithms sometimes exhibit excessive autocorrelation among successive values in the chain, leading to inefficient sampling of the full parameter space if the dependence among samples extends for more than a few iterations. If the autocorrelation is high for a parameter, it may be necessary to let the simulation run longer and take a subsample of the MCMC output. We discuss autocorrelation issues for each model, below, along with other features of their application.

We report ’ppp-values’ (posterior predictive p-values) as an indicator of probabilities of the observed data, under the best-fit model; and ’sppp-values’ (species ppp-values, for each species) as an indicator of the under or overdispersion in replicate datasets generated under the best-fit model (see "Model checking" section).