### Specimens

Sheds from 308 species of snakes were collected from museums, vivariums, private breeders and reptile shops. These samples were used for Scanning Electron Microscopy (SEM) imaging. The SEM images of 45 additional species from previous publications [15, 17] were also used. The 353 species scored for the present analysis represent around 32% of all snake genera (169 over 522 genera [18]) and 19 of the 26 snake families. Confocal microscopy was performed on skin biopsies obtained from African House Snake (*Boaedon fuliginosus*) embryos (day post-ovoposition 46) in the colony housed in Milinkovitch’s laboratory, University of Geneva, Switzerland. Maintenance of, and experiments on snakes were approved by the Geneva Canton ethical regulation authority (authorisations GE/82/14 and GE/73/16) and performed according to Swiss law. These guidelines meet international standards.

### Confocal microscopy

African house snake skin samples were laid flat face up on a Millipore nitrocellulose membrane (0.8mu AABP) to prevent curling and placed in an Eppendorf tube. Then, 250 μl of Gibco Dulbecco’s phosphate-buffered saline (+ CaCl2 and MgCl2) + 1 μl of 1000x Syto9 Green Fluorescent Nucleic Acid Stain (ThermoFisher Scientific S34854) [16] were added and the samples were incubated 1 h at 30 °C and 300 rpm in an Eppendorf Thermomix agitator. The skin was then peeled from the filter and mounted between a microscope slide and coverslip, which was glued on to prevent detachment in the inverted confocal microscope. As the Syto9 Green dye binds to nucleic acids, the nuclei appear very bright whereas the cytoplasm is less intensely labelled, allowing the visualisation of both the nuclei and the borders of the cells with high resolution. A Zeiss LSM 780 inverted microscope was used with a Zeiss Planapo 63X Oil 1.4NA lens and 1.518 refractive index oil. The Argon 488 nm excitation laser was used and the detector frame was set between 490 nm and 600 nm to capture the entire Syto9 emission spectrum.

### Scanning Electron microscopy and character description

A dorsal sample from each moult was excised using a scalpel blade and placed on a cylindrical specimen mount covered with a carbon adhesive tab (Electron Microscopy Sciences). The samples were then coated with gold and SEM was performed with a Jeol JSM-6510LV microscope using a beam size of 40 nm and a voltage of 10 kV. Pictures were taken at 3000x and 9000x magnifications at the cranial tip, in the middle, and at the caudal tip of a scale from each sample. When the cells were bigger than the image frame at 3000x, a new image was taken at 1000x at the tip of the scale to allow visualisation of cell shape.

We defined four nanomorphological characters, each with a number of mutually exclusive states (Fig. 3). Each of the 353 studied species was categorised using the nanostructures found at the tip of the scales, which are the most developmentally-derived [14]. The corresponding state matrix is provided in Additional file 1: Table S1 as well as in a MySQL relational database at https://snake-nanogratings.lanevol.org. Note that the latter also provides one representative SEM image of each of the corresponding species.

The characters and their states are (Fig. 3): ‘cell shape’ (‘wide’ = 0 or ‘polygonal’ = 1), ‘cell border’ (‘regular’ = 0;‘short digits’ = 1;‘mild digits’ = 2;‘long digits’ = 3 or ‘sawteeth’ = 4), ‘cell surface’ (‘smooth’ = 0; ‘holes’ = 1; ‘straight channels’ = 2 or ‘labyrinthine channels’ = 3), and ‘ridge’ (‘absence’ = 0 or ‘presence’ = 1). Ambiguity was prevented by grouping similar structures (holes and chains), by ignoring dim structures (pits, Fig. 4c), and by ignoring substantially larger microstructures (Fig. 4d,e). When a character’s state was unclear (e.g., cell shape or cell surface not visible), it was classified as ‘unknown’.

For the ‘cell shape’ to be categorised as ‘polygonal’, the ratio between the width (left-right axis) and length (cranial-caudal axis) of the cell must be ≤2 (Fig. 3a). Conversely, when the cell is more than twice wider than long, it is classified as ‘wide’. When a cell’s length varies a lot, as in ‘sawteeth’ cell border, an average is taken between the longest and shortest lengths (Fig. 3b).

For the ‘cell border’ character (Fig. 3b), the states ‘regular’ and ‘sawteeth’ are easily recognised while ‘digitations’ can show various lengths. The length of digits and the total cell length were measured in ImageJ for 10 cells and averaged. The digit length was normalised by the cell length. The states of digitations were defined as ‘short’, ‘mild’, and ‘long’ when the ratio of digit length versus cell length was < 0.3, between 0.3 and 0.5, and > 0.5, respectively. Note that the absolute size ranges of short, mild and long digitations overlap as they are 0.4–4.7 μm, 0.8-5 μm, and 1.2–5.5 μm, respectively. Digitations and sawteeth are easily differentiated because the former are small (< 5.5 μm), regular, cover the cranial border of the next cell, and are specifically oriented toward the caudal end of the scale. On the other hand, sawteeth are larger (5-10 μm), irregular, and the borders of adjacent cells imbricate with each other. Note that it is sometimes difficult to discern the cell’s borders as the previous clear layer’s imprint can leave marks on the current oberhautchen cell layer. Some species exhibit a dense network of ‘elevations’ (Fig. 1c) that could either be elevated border digitations or originate from the cell surface.

The ‘cell surface’ states were defined using functionalities from the OpenCV computer vision library [35]: all contours (i.e., closed curves connecting sharp changes in pixel intensity) in each SEM image were detected. First, we applied *Contrast Limited Adaptive Histogram Equalisation* [36] to account for inhomogeneity of contrasts across the image while preventing the over-amplification of noise (Additional file 15: Figure. S11a-b). Next, a local thresholding method was applied as follows: images were divided into small tiles (100 × 100 px) and a k-means clustering method (k = 3) was applied to group the pixels intensities of each tile independently (Additional file 15: Figure. S11c). The black and white thresholding for each tile was then set to the average of the cluster centres with the lowest and middle intensities to differentiate the darkest pixels from those in the two other clusters (Additional file 15: Figure. S11d). Artefacts at the tile borders were removed by applying Gaussian filtering on the threshold values. Then image intensities were inverted and contours were detected. The contours were then classified into shapes based on the ASR criterion: A is the area of the contour (normalised by image area), S is its ‘solidity’ (i.e., the ratio of the contour surface and the surface of its bounding convex polygon), and R (≥ 1) is the ratio of the minimum bounding rectangle sides. A contour was classified as a ‘hole’ if S > 0.8, R < 6 and A is less than two standard deviations (SD) away from the mean area of detected holes on that image (blue contours in Additional file 15: Figure. S11e). This resulted in the grouping of similar structures that could be qualified as roundish and elongated holes (Fig. 4a-b). For a contour to be sorted as a ‘straight channel’ it required S > 0.5 and R ≥ 6, but no constraint on the area (A) as channels can be very long (red contours in Additional file 15: Figure. S11e). The value of R = 6 separating the two classes was selected arbitrarily as the distribution for the bounding rectangle length/width ratio is continuous and strongly decreasing (Additional file 9: Figure. S5a). The classification was improved using a training set consisting of contours (from manually-selected images) clearly belonging to a certain class. Contours that were not labeled as a ‘hole’ or a ‘straight channel’ based on the ASR criterion were still classified if a similar shape (calculated using Hu moments [37] implemented inside OpenCV *‘Match Shapes’* procedure) was found in the training set. To identify ‘labyrinthine channels’, we recognised all contours that could not be classified into ‘holes’ or ‘straight channels’ but could be subdivided into multiple straight channels based on their convexity defects (green contours in Additional file 15: Figure. S11e). To avoid miss-classification due to image quality or cell border detection, all images were manually corrected by deleting, redrawing and automatically sorting or subdividing already existing contours. Finally, we made sure to subdivide all ‘labyrinthine channels’ into ‘straight channels’ to facilitate the process of image classification (Additional file 15: Figure. S11f). At the end of the procedure, the percentage of surface covered by each type of shapes on a SEM image was calculated and the final image state was decided using a simple decision tree. If the total area of every shape was less than 3% of the total surface of the image, the state was set to ‘smooth’. Otherwise, we set the state to ‘holes’ if they had the largest covering surface. If ‘channels’ had the largest covering area, we measured their orientation and classified the image as ‘labyrinthine channels’ if the angle SD was > 25°, or as ‘straight channels’ otherwise. Given that the standard deviation of channel orientation (Additional file 9: Figure. S5b) presents a gap between 22° and 26°, we can objectively set a threshold at 25° for separating ‘straight channels’ (R ≥ 6) from ‘labyrinthine channels’ (channel orientation angles with SD > 25°). Despite these criteria, some species exhibited structures that could not be unambiguously classified as ‘holes’ or ‘labyrinthine channels’ (Additional file 11: Figure. S7).

The ‘ridge’ character refers to inverted-gutter deformations of the cell surface that run along the cranial-caudal axis (Fig. 3d).

### Phylogeny

First, we pruned the time-calibrated tree among 9754 species of squamates [20] to keep the 353 species for which we have obtained nanomorphological data. Note however that some of these species are represented by skin samples from multiple subspecies. Given that no subspecies information is provided in [20], we selected only one subspecies using the two following successive criteria: (1) if some, but not all, subspecies within a species could not be scored for one or several characters, remove the corresponding subspecies, and (2) if the different subspecies in a species exhibit different character state(s), keep the subspecies with the state(s) most frequently exhibited by the most closely-related other species. When more than one subspecies per species remained, we arbitrarily kept only one of them. This procedure removed 13 subspecies from the dataset prior to character mapping and subsequent analyses. The corresponding Newick tree file among 340 species is provided as a Additional file 3(Suppl_File_3_Tonini_SnakesCommon.nwk). As the pruning process did not remove all of the polytomies present in the original time-calibrated tree [20], and some of our analyses (e.g.*,* models of character evolution) require a dichotomous tree, we resolved these ambiguous nodes with the function *multi2di* of the *ape* package [38] of the program R [39] while imposing the length of the newly-generated branches to an arbitrarily small value (10^{− 6} of the total height of the tree); this avoids problems generated by zero branch lengths in comparative analyses functions. Note that the reverse transformation can easily be performed with the *di2multi* function of the *ape* package [38]. The 353 snake (sub) species in our dataset are listed in Additional file 1: Table S1 and in the MySQL relational database at https://snake-nanogratings.lanevol.org; in both cases, the 13 subspecies not used for phylogenetic mapping are indicated with an asterisk.

### Models of character evolution

Without a priori information about the possible character evolutionary model, we fitted and compared different rate transition matrices of the Mk model with or without Pagel’s tree transformation coefficients λ, δ, and κ [21, 22]. The Mk model for discrete characters is based on the assumptions that *(i)* the trait value can evolve through a Markov process, i.e.*,* the probability of character substitution depends only on the current state and not on previous states, and *(ii)* every state of a particular trait is equally likely to change to any other states. This model is the analogue to a Brownian motion (BM) model for continuous traits as it supposes that evolutionary changes are independent across lineages and that the rate of change is constant over time and along all branches [21, 22, 24]. We focused on three forms of the rate matrix: equal rates (ER) for all substitutions, different but time-reversible (i.e., symmetric) rates (SYM), or with all rates being different (ARD).

When using continuous characters, coefficients can be used to transform the variance-covariance matrix and fit the model to the corresponding tree. On the other hand, when using discrete characters, one cannot define a covariance matrix, such that coefficients can only be applied by transforming the tree while keeping a constant-rate Mk model [21,22,23,24]. First, the λ coefficient compresses internal branches (without affecting tip branches) to measure the dependence of trait evolution among species. A value of λ = 1 leaves the tree unchanged, i.e., the model is assumed to maximally fit the original tree. On the other hand, a λ = 0 (making the tree a star-like phylogeny) indicates independence of evolution among species. Hence, comparing the fit of the model with various values of λ provides a good approach to test for the presence of ‘phylogenetic signal’ (see next subsection). Second, the coefficient δ constitutes a quantitative measure of how rates of substitution change across the tree (i.e., through time). If δ < 1, the evolutionary rate slows down through time, while δ > 1 indicates an acceleration of the rate. In terms of phylogenetic transformation, it affects the tree by modifying nodes’ heights. Third, the coefficient κ raises the branches length by a power κ. When κ = 1, the tree is left unchanged while all branch lengths are one when κ = 0. The parameter κ can be interpreted as character changes being more or less concentrated at speciation events, i.e., it tests if evolution of a trait was punctuated or gradual [21, 22, 26].

For each character, we fitted all combinations of models (ER, SYM, ARD) and coefficients (none, λ, δ, and κ) with the function *fitDiscrete* of the *geiger* package [40] using an internal optimisation procedure supported by the functions *optim* and *subplex* (i.e., general-purpose optimisation functions provided in R [39]). We used 10^{4} starting points during the optimisation procedure and performed comparisons among all fitted models using the sample-size corrected Akaike information criteria scores (AICc and AICw) [41] provided by the *fitDiscrete* [40] and *aic.w* [42] functions, respectively. A summary of the scores obtained with all models is provided in Additional file 4: Table S2.

### Phylogenetic signal

Heritable traits (whether continuous or discrete) observed in species related by a phylogenetic history do not represent data points that are statistically independent and identically-distributed [22, 23, 43]. Hence, specific statistical methods must be applied for comparative analysis of these traits (here, nanostructures). Many indices were developed to quantify the so-called ‘phylogenetic signal’ (i.e., how much closely-related species tend to resemble each other more than randomly-drawn species) with respect to trait evolution [43, 44]. Here, we use Pagel’s λ parameter [22, 23] because of its good statistical properties (low sensitivity to tree size and to errors in topology and branch lengths, high power of associated statistical tests) over broad types of tree transformations [44]. As discussed above, with discrete traits, the λ coefficient is used in the Mk model as a scaling factor of branch lengths [22]. The *phytools* package [42] provides the function *phylosig* that tests for the presence of phylogenetic signal through randomisation of the species in the tree [22, 23, 43, 44] but cannot be used with discrete data. Hence, we follow here the proposed alternative method of Revell [45]: we optimise the Mk model likelihood under a given λ-transformation of the tree, making it an analogue of the BM model but for discrete traits, and we compare the associated estimator λ_{ML} to the null hypothesis case (λ = 0, i.e., trait evolution and phylogeny are independent). The combination of a value of λ close to 1, and significantly better likelihood than with λ = 0, indicates the presence of significant phylogenetic signal.

### Phylogenetic mapping

Using the time-calibrated tree of squamates [20] and the best fitting model of character evolution (Additional file 4: Table S2), we performed ancestral state reconstruction and substitutional history estimation of the four nano-morphological characters and the simple life habit character discussed above. The phylogenetic mapping was determined with a continuous-time reversible Markov model implemented in the function *make.simmap* of the *phytools* [42] package in R [39]. If a state could not be scored for a particular character, the corresponding species was removed from the analysis. We followed the procedure described in [42, 46, 47], and implemented in the function *make.simmap*: estimating the posterior probabilities of ancestral states by sampling the associated posterior distribution approximated by a continuous-time Markov chain. The parameters used in the Markov algorithm were: 10^{4} simulations (nsim), a burnin of 10^{3} steps and a sample frequency of 100. We used the default option of prior distribution estimation at the root by sampling the conditional scaled likelihood distribution. Then, conditioned on this prior distribution at the root and the observed states at tip branches, the mutational history of the character was simulated by sampling with the appropriate posterior distribution.

### Correlations among characters

To detect potential linear relationships (i.e.*,* correlation) between traits [19, 27, 48, 49], we performed a comparative analysis among the four nano-morphological traits and the simple classification of life habit (aquatic, terrestrial, fossorial or arboreal) using both the phylogenetic generalised least-squares regression (PGLS; [27]) and the phylogenetic generalised linear mixed model (PGLMM; [28]), two methods based on different statistical perspectives (frequentist and Bayesian inferences, respectively). Note that we do not aim to infer the best values of the model parameters but wish to identify the presence or absence of correlation. The analysis was performed on the entire 10 (5 × 4/2) pairs of characters. For each pair of characters, we pruned the global tree such that it only includes the species for which both traits have been scored.

The PGLS regression is based on a common generalised least squares (GLS) model assuming that the response variable **Y** (or any of its transformations) can be expressed by a linear relationship of explanatory variable **X**, which are independent of each other and of the error [27, 48], i.e.*,* **g(Y) = Xβ +** ∈. However, the GLS takes into account dependence of the observations (here, the species) by down-weighting the estimator with the variance-covariance matrix. Consequently, the assumption of uncorrelated errors and homoscedasticity, i.e.*,* constant variance, is relaxed. The error allows grouping within-species variation (also called measurement error), error due to unknown or incomplete phylogenetic relationships, and error due to the stochastic nature of the evolutionary process along the phylogeny [48]. As we could not estimate the error due to unknown or incomplete phylogenetic relationships and to within-species variation, we assume that the error originates only from variation of evolutionary changes along a phylogeny. This error follows a multivariate normal distribution with an expectation of **0** and variance-covariance matrix σ^{2}**V**, where **V** (NxN matrix, N number of species) is known and relates to the tree structure and branch lengths. For a Brownian model, the variance is defined by **V**_{ij} = σ^{2}t_{ij}, where t_{ij} is the distance on the phylogeny between the root and the most recent common ancestor of taxa i and j, and σ^{2} (that can be estimated) represents the evolutionary rate [21]. One advantage of the PGLS method (function *pgls* in the *caper* package [50]) is that the matrix **V** can be adapted to multiple models of evolution by manipulating the branch lengths in the phylogenetic tree with one of Pagel’s coefficients. Note that we had to convert the discrete trait variable response into a pseudo-continuous variable in order to apply PGLS. It has been shown [26] that this approach exhibits good statistical performances and valid estimations. We used the *pgls* function with the R structure formula Var1 ~ Var2, where Var1**,** Var2 represents the tested pair of characters. Note that we fixed the parameters λ, δ, and κ to 1 as suggested by our statistical comparisons among all combinations of models (ER, SYM, ARD) and coefficients (none, λ, δ, and κ).

The PGLMM method, supported by the function *MCMCglmm* in the package of the same name [51], implements Bayesian inference with Markov Chain Monte Carlo (MCMC) approximation to fit the model. The generalised linear mixed model is an extension of the generalised linear model (with fixed **X** explanatory effects) but with the addition of random effects (denoted by the matrix **Z)** on the model. We can then write the model as **l = [X Z]ϑ +** ϵ, where **l** is the so-called latent variable (some transformation of the response variable **Y**), **X** and **Z** are the design matrices related to respectively fixed and random effects, **ϑ** (**[β a]**^{T}) is the vector of parameters, and ϵ is the residuals’ vector. Location effects **β** (fixed), **a** (random) and residuals ϵ are assumed to be drawn from a multivariate normal distribution. While fixed effects are assumed a priori to be independently-distributed with mean **β**_{0} and variance σ_{B}^{2}**I**, the random effects and residuals have zero mean as well variance σ_{a}^{2}**A** and σ_{e}^{2}**I**, respectively. In phylogenetic comparative analyses, **A** is equivalent to the **V** matrix defined in PGLS. Note that σ^{2} can also represent a matrix in the case of multiple responses models (i.e., with multi-state characters), such that the global variance matrix is the Kronecker product of **σ**^{2} with **I** or **A**.

The main advantage of PGLMM is to model multinomial logit (=log-odds) responses (a special case of multiple response models) with more than two states, hence, it is particularly adapted to our data [28]. Multinomial logit models usually reduces the J states problem to a J-1 problem by taking one of the levels as a reference. While in PGLS the variable response corresponds exactly at the data value, multinomial PGLMM looks on log-odds ratio of the probability that a species *i* has a trait’s state *j*, i.e.*, l*_{ij} *= log (*α_{ij}*/*α_{iref}*)* where α_{ij} represents these probabilities and *ref* is the reference state. The *MCMCglmm* function input for the fixed effect is defined as Var1 ~ Var2 for multinomial response with 2 states (Cell shape, Ridge) and Var1 ~ trait Var2 + trait – 1 for responses with more than 2 states (Cell borders, Cell surface). The latter formulation enables to obtain specific intercepts and regression coefficients for each trait of Var1 (‘trait’ indexes the latent variables, i.e.*,* the states of the characters) with respect to Var2. The (− 1) is imposed in order to remove global intercept and to get easier interpretable models [52]. For the random effect, we define the input random ~ animal for 2 states latent variables and random ~ us(trait): animal for > 2 states latent variables. ‘Animal’ corresponds to the species identifier, while us() is an internal function that fits the variance of the random effects by assuming the complete **σ**^{2} (J-1) x(J-1) variance matrix of the trait, i.e.*,* covariance between levels of the trait are permitted. Note that we use the *rcov* formula for the residual covariance structure. It is defined as rcov ~ us(trait): units for > 2 states response and random ~ units for 2 states, in order to estimate the fully parametrized covariance matrix. **‘**Units’ indexes the rows of the multiple response data (here as there is only one state for each species, such that the ‘units’ index is equivalent to the ‘animal’ index).

In the context of a Bayesian framework, prior distributions should be defined. For the fixed effect, the *MCMCglmm* function assumes a multivariate normal prior [52] where the mean μ is set to **0** and the variance parameter V is set to (1.7 + π^{2}/3) *I*, where *I* is the (J-1) xK identity matrix (J and K are the numbers of levels for Var1 and Var2, respectively). For the random effects and the residuals, the *MCMCglmm* function assumes an inverse-Whishart prior with the parameter *V* and *ν*. Here, for the residuals, we impose V = (*I* + *J*) / J, where **J** is the unit matrix and *ν* = *J*-1. Additionally, we arbitrarily fix the variance throughout the MCMC run [28]. On the other hand, we can estimate the variance of the random effect during the run.

In order to increase the mixing of the MCMC, we use the parameter expansion methods [52], i.e., we add (in the prior definition of the random effects) the parameters *αμ* = 0 and *α.V* = 10^{3}*I*. Other parameters of the MCMC were: 10^{7} MCMC iterations with a burnin of 10^{6} and thinning interval of 10^{4} iterations, and truncate latent variables when necessary (to prevent under/overflow). We used the *autocorr.plot* and *gewecke.plot* functions from the *coda* package [53] to evaluate Markov chain convergence [54].