A coarse-graining, ultrametric approach to resolve the phylogeny of prokaryotic strains with frequent homologous recombination

Background A frequent event in the evolution of prokaryotic genomes is homologous recombination, where a foreign DNA stretch replaces a genomic region similar in sequence. Recombination can affect the relative position of two genomes in a phylogenetic reconstruction in two different ways: (i) one genome can recombine with a DNA stretch that is similar to the other genome, thereby reducing their pairwise sequence divergence; (ii) one genome can recombine with a DNA stretch from an outgroup genome, increasing the pairwise divergence. While several recombination-aware phylogenetic algorithms exist, many of these cannot account for both types of recombination; some algorithms can, but do so inefficiently. Moreover, many of them reconstruct the ancestral recombination graph (ARG) to help infer the genome tree, and require that a substantial portion of each genome has not been affected by recombination, a sometimes unrealistic assumption. Methods Here, we propose a Coarse-Graining approach for Phylogenetic reconstruction (CGP), which is recombination-aware but forgoes ARG reconstruction. It accounts for the tendency of a higher effective recombination rate between genomes with a lower phylogenetic distance. It is applicable even if all genomic regions have experienced substantial amounts of recombination, and can be used on both nucleotide and amino acid sequences. CGP considers the local density of substitutions along pairwise genome alignments, fitting a model to the empirical distribution of substitution density to infer the pairwise coalescent time. Given all pairwise coalescent times, CGP reconstructs an ultrametric tree representing vertical inheritance. Results Based on simulations, we show that the proposed approach can reconstruct ultrametric trees with accurate topology, branch lengths, and root positioning. Applied to a set of E. coli strains, the reconstructed trees are most consistent with gene distributions when inferred from amino acid sequences, a data type that cannot be utilized by many alternative approaches. Conclusions The CGP algorithm is more accurate than alternative recombination-aware methods for ultrametric phylogenetic reconstructions.

resource, let us assume an upper bound l s cutoff ≤l s to x, where segments with x>l s cutoff SSPs are simplified to have l s cutoff SSPs. f(x|t) is normalized to unity in the following way: At t=0, the most recent common ancestor (MRCA) of XY splits into two lineages; the two have identical genomes, and thus f(x|0)=δ x,0 (Kronecker delta, i.e., this is non-zero only at x=0). When t>0, mutations and recombinations occur, and the evolution of f(x|t) is described by the following equation (Eq. (1) in the main text): The first term of this equation accounts for mutation, and the second term accounts for recombination; there is a factor 2 because mutation and recombination can equally occur on either X or Y. μ is the mutation rate per site, and l s μ is the mutation rate per segment; I(x|y) is the identity matrix; M(x|y)=δ x,y+1 +(δ x,lscutoff ×δ y,lscutoff ) is the mutation matrix (i.e., M(x|y)=0 for x≠y+1), as a segment with y SSPs jumps to x=y+1 SSPs after a mutation. For simplicity, it does not consider back mutation. ρ is the rate for a segment to be covered by a recombination stretch. The value of μ or ρ changes if time is rescaled, but not ρ/μ. P(x|y,θ,δ TE ,l s ) is the recombination matrix, which is the probability for a segment to change its number of SSPs from y to x during a recombination. P(x|y,θ,δ TE ,l s ) corresponds to Eq. (S4) in Dixit et al. (Dixit et al. 2015) and Eq. (3) in Dixit et al. (Dixit et al. 2017); these literatures describe the detailed derivation of P(x|y,θ,δ TE ,l s ). In brief, since a segment can recombine with its counterpart on another genome, we assume that each segment has its own phylogeny, which is different from the phylogeny of the genomes; the population structure at each segment is approximated by the coalescent model. For an attempted recombination between Y and an external donor D, we can use the coalescent model to calculate the probability distribution for the segment divergence δ between D and X, and then obtain x from x=l s δ. The expression of P(x|y,θ,δ TE ,l s ) is: B is a normalization constant; Θ(x) is the step function, which is 1 when x>0 and 0 when x≤0; δ xy is the Kronecker delta. The three terms of P(x|y,θ,δ TE ,l s ) (corresponding to Eq. (S1-S3) in Dixit et al. (Dixit et al. 2017) represent three different possible scenarios of recombination: 1. when x<y, recombination reduces divergence: 2. when x>y, recombination increases divergence: 3. when y=x, the recombination event either failed, or succeeded but did not change the divergence: 3 ( , , .
The normalization constant B is introduced to deal with possible numerical errors and is determined by the condition: We can calculate f(x|t) at different t, starting from Equation (1) with the boundary condition f(x|0)=δ x0 and using discrete time step t. Supplementary Figure S1 shows an example of SSP distributions of this model at different coalescent time t, with model parameters (μ, ρ, θ, δ TE , l s )=(1E-5, 0.01, 2%, 1%, 1000).

Monte Carlo algorithm implementing the CGP model to search for the optimal tree and model configuration
We developed a Monte Carlo algorithm, along with Metropolis acceptance and annealing, to search for the parameters (μ,ρ,θ,δ TE ) and coalescent time of all pairs of sequences that maximize the score in Eq. (3). Given the initial parameters (μ,ρ,θ,δ TE ), the algorithm rescales time, such that after rescaling l s μ=min(0.02, l s δ farthest /200); here δ farthest is the sequence divergence between the most divergent pair. This rescaling rule has two considerations: (i) l s μ (and also ρ) should be ≪1 to suppress numerical error; (ii) the branch lengths, which is the coalescent time t in f(x|t), should be numerically small to save computational resource.
The algorithm calculates the initial f(x|t), discretizing time t into steps; the coalescent time of genome-pair XY, t XY , is set to make f(x|t) best-fit g XY (x). This generates an initial t-matrix of the genome-pairs, which is turned into ultrametric tree by single linkage clustering.
The algorithm maps an n-leaves ultrametric tree to an n×n coalescent time matrix through single linkage clustering. Moving in the direction of a degree of freedom in this tree is like moving an internal node up or down, joining two nodes or splitting one node. Hence the matrix has ≤n-1 degrees of freedom despite n 2 entries. When performing local search, CGP algorithm generates a new tree by moving an internal node of the old tree up or down by one time-step. This may combine two internal nodes into one; for an internal node with c>2 children, a possible move involves moving the entire node upwards or downwards, or breaking it into two nodes, one with two children and the other with c-1 children (see Supplementary Figure S2 for an example).
Given the initial tree and model parameters, the algorithm proceeds to search for the optimal tree and model parameters. In each Monte Carlo step, one of the following moves is considered and selected according to Metropolis acceptance: 1. There is an n -2 /2 chance to mutate one of the parameters (ρ, θ, δ TE ). The algorithm considers one of six different parameter sets (ρ(1+ɛ), θ, δ TE ), (ρ(1-ɛ), θ, δ TE ), (ρ, θ(1+ɛ Absolute upper limits are imposed for some of the parameters: ρ<1,θ<100% and δ TE <100%. 2. There is an n -2 /2 chance for a random branch of the tree to be cut and grafted to a different part of the tree. In this move, a branch with the younger internal node Y and older internal node O is picked randomly. With their ages denoted as t Y and t O , this branch is then cut at the height t O , and the entire sub-clade is then grafted to another random branch on the tree that is present at height t O to generate a new ultrametric tree. 3. There is a 1-n -2 chance for an internal node to be picked randomly. One of the possible local moves of the node described above is selected randomly to generate a new tree. The simulation continues, and stops when the maximal score of the chain of Monte Carlo steps has not increased by more than 1 for the last 200,000 steps. It also makes use of annealing, starting with an initial temperature that has magnitude of a twenty-thousandth of the initial score, and reducing by half every 50,000 steps.

Preparing simulated and real genome sequences to test different phylogenetic algorithms
Fisher-Wright simulation We simulated Fisher-Wright populations that have micro and macro recombination. We set N e =1000, genome length 5Mb, and each site can have one of the four letters ATGC. In our simulation, mutation occurs at a rate μ per site, which randomly mutates a site to one of the other three nucleotides. A recombination-attempting DNA stretch initiates at a rate ρ ini per genome per time-step, which transfers a DNA stretch from a random donor that starts at a random site of the genome. The length of the stretch is drawn either from a micro distribution (geometric distribution, mean 100bp) or from a macro distribution (geometric distribution, mean 1kb), both with equal probability. The recombination attempt can either succeed (the donor stretch replaces the recipient stretch) or fail; the chance of success is exp(-δ/δ TE ), where δ is the divergence between the incoming stretch and the host stretch. Throughout a simulation, we recorded the history of genome inheritance and have full knowledge of the population phylogenetic tree. A simulation continues for at least 10,000 steps, and then stops once the MRCA of the population has emerged.
We performed Fisher-Wright simulation on three sets of parameters (μ, ρ, θ, δ TE ), representing prokaryotic species with low, intermediate and high levels of recombination; note that ρ=ρ ini L, where L is the average length of the transferred DNA stretch and is 550bp in our case. The parameters of the simulations are: 1. (5E-5, 0.01, 10%, 0.8%), r/m≈2 r/m is the ratio between substitutions contributed by mutations and by recombinations; these r/m values are estimated using Eq. (A2) of Dixit et al. (Dixit et al. 2017). We set the transfer efficiency δ TE =0.8%, which is close to previously reported values 2.4% (Fraser et al. 2007) and 0.8% (Dixit et al. 2015) for E. coli. A previous study reports the r/m values of different prokaryotic species, which range from 0.02 to 63.6 (Didelot et al. 2009); hence the different levels of recombination of our simulated genomes are consistent with what observed in nature.
We picked genomes from the simulated populations that can mimic inter-species recombination. If we pick random genomes into a test-group, then their MRCA will have an average age t root~1 000 because N e =1000. We would like to simulate the condition where a single lineage diverging from the rest of the population and forming its own subpopulation, so that the genomes in their subpopulation continue exchanging DNA among them and with the rest outside. Therefore, we need test-groups of genomes coming from a small clade of the population, and we constrained the MRCA of a genome-test-group to be small, i.e., t testgroup-root ≪1000.
We performed Fisher-Wright simulations, repeating each parameter set ten times to get ten populations. We picked 10 test-groups in each population; each test-group has ~10 genomes and a constraint on the age of their MRCA: t test-group-root~1 0, 20, …, 100; the true phylogenetic tree of the genomes in each test-group is also recorded. In sum, this generates 100 test-groups for each of the three parameter sets. See Additional File 4 for the sequences and their tree for each test-group.
A test-group with n genomes has n(n-1)/2 genome pairs, and we calculated the SSP distributions of these pairs to prepare for phylogenetic reconstruction using our model, with segment size l s =150 (and l s cutoff =100); the last segment is discarded as it is shorter than the others.
Real E. coli genomes We also made test-groups using different combinations of genome sequences chosen from 55 E. coli strains (see Supplementary Table S1 for their names and Genbank Accession IDs). We detected the orthologous gene families in these 55 genomes by performing the ProteinORTHO algorithm (v5.11) (Lechner et al. 2011) on their amino acid coding sequences (CDS), which identifies 1,636 orthologous gene families that have one copy in each of these 55 strains; see Additional File 2 for a list of these core genes and Additional File 4 for the genes to orthologous gene families mapping provided by ProteinORTHO. We aligned the nucleotide sequences of these 1,636 universal orthologous gene families using MAFFT (Katoh and Standley 2013) with options "--maxiterate 1000" and "--localpair"; we performed another alignment on their amino acid sequences in the same way. To evaluate the accuracy of different phylogenetic algorithms using GLOOME, we made 100 test-groups, each with ten strains randomly chosen from the 55 strains. We concatenated the alignment of all universal orthologous gene families to make a 'supergene' nucleotide sequence and a super-gene amino acid sequence for each strain (see Additional File 2 for the strains in each test-group, and Additional File 4 for their sequences). To prepare the pairwise SSP distributions, we removed all positions that have dashes in the alignment of each core gene, and divided the alignment into segments with length l s , ignoring the last segment if it has fewer than l s sites. For each pair of strains in a test-group, we collected their segments in different core genes, enumerated the SSPs on each segment to get their pairwise SSP distribution; note that each pair of strains has a SNP distribution obtained from their nucleotide sequences, and a SAP distribution from their amino acid sequences. We used segment sizes l s =150 (l s cutoff =100) for nucleotide sequences, and l s =50 (l s cutoff =50) for amino acid sequences.
Alternatively, we evaluated the consistency of the phylogenetic algorithms by comparing pairs of phylogenetic trees reconstructed from sequences of different core genes of identical strains. We made another 100 test-groups that are different from the 100 groups in the GLOOME-test, where each group has ten strains randomly chosen from the 55 strains. In each group, we randomly assigned a core gene to be either in set A or set B. We concatenated the gene sequences in set A (B) of a strain to make its super-gene A (supergene B); we then applied different phylogenetic algorithms to reconstruct tree A (tree B) from the super-gene A (super-gene B) of different strains in the group. For each pair of strains in a test group, we also prepared the SSP distribution A and SSP distribution B, using the nucleotide sequences (or amino acid sequences) of the genes in set A and set B, in the same way that is done in the GLOOME-test; we then reconstructed tree A and tree B using the CGP algorithm. See Additional File 4 for the strains in each test-group, the core genes of set A and set B, and also their sequences. See Additional File 3 for the unrooted SD, the unrooted BSD, and the rooted BSD between the pair of reconstructed trees in each test group.
Real E. coli genome sequences For the nucleotide sequences, we used the algorithms and settings same as the simulated sequences. For amino acid sequences, we used RAxML with the model PROTGAMMAAUTO; the resultant trees are mid-point-rooted by RAxML (option '-f I').

Evaluating how the UPGMA assumption perturbs the coalescent time inference in CGP algorithm
To evaluate how the UPGMA assumption perturbs the coalescent time inference, we fitted our theoretical model to the SSP distributions of real genomes across the 100 GLOOMEtest-groups. Based on the CGP phylogenetic reconstruction performed on the pairwise SSP distributions in each group, we obtained the optimal set of model parameters (μ, ρ, θ, δ TE ) and also the optimal tree that maximize the fit score. We then simulated the theoretical distribution using this optimal parameter set, and fitted it to the empirical SSP distribution of different pairs of genomes to infer their coalescent times. For a pair of genomes, we defined the directly inferred coalescent time, t direct , to be the time that minimizes the cross entropy between the theoretical distribution and the empirical distribution of the pair; alternatively, the coalescent time of the pair reported by the optimal tree is denoted as t tree . To quantify the perturbation caused by UPGMA assumption, we calculated Δ, the deviation between t direct and t tree , defined as = ( − )/( + ).

Estimating the CPU time of phylogenetic reconstruction performed by different algorithms
To assess and compare the CPU load of different phylogenetic reconstruction algorithms, we executed the linux shell script for the reconstruction of a phylogenetic tree within the Perl environment, and used the Perl package "Benchmark" to help record the time at the start and at the end of each phylogenetic reconstruction. We then applied the function "timediff" within package "Benchmark" to calculate the time difference between the start and the end, and then the function "timestr" to print out the time measurement. "timestr" outputs the "wallclock time", "user time", "system time", "user time of children", and "system time of children". The wallclock time corresponds to the time that physically elapsed between the start and the end of the script. However, not all the algorithms are executed in single-threadmode: RAxML uses at least two threads, and Gubbins calls RAxML and thus is also multithreaded. Therefore, we summed up the user time, system time, user time of children and system time of children, and used this sum to represent the CPU load. We performed time measurement for the phylogenetic reconstructions of the 100 GLOOME-test-groups using different algorithms, in a computer system with Linux distribution Debian GNU/Linux 10 (buster), kernel 4.19.98-1, and AMD EPYC 7601 CPU.

Evaluating the quality of a reconstructed tree
Trees reconstructed from simulated genome sequences We measured the unrooted symmetric distance (SD) (Robinson and Foulds 1981) between a reconstructed tree and the true tree to quantify the accuracy of topology. We measured the unrooted and rooted branch score distance (BSD) (Kuhner and Felsenstein 1994) between a reconstructed tree and the true tree to quantify the branch lengths and root positioning of the reconstructed tree; before calculating the branch score distance of two trees, we normalized each tree by dividing the length of each branch by the total branch length.
Trees reconstructed from real E. coli sequences We evaluated the accuracy of a reconstructed tree by comparing the tree to the phylogenetic signals encoded in the absence and presence of genes across different genomes. We considered each internal node of the tree to be an ancestral strain, and used the GLOOME algorithm (VR01.266) (Cohen et al. 2010) along with the default parameters of the online version of GLOOME (Evolutionary model: fixed gain/loss ratio, rate distribution Gamma), to reconstruct the presence and absence of different orthologous gene families in the ancestral strains. GLOOME takes a tree, and also the presence and absence of different orthologous gene families across the extant strains of the tree as input. As we have used Proteinortho (Lechner et al. 2011) to identify the orthologous gene families in the 55 E. coli genomes, an orthologous gene family is present in an extant strain if there is one or more copies there, and absent otherwise. The GLOOME output includes the GLOOME posterior likelihood (GPL) for the ancestral genome reconstruction. We used GPL to quantify the accuracy of a reconstructed phylogenetic tree, because the higher is the GPL, the more consistent is the tree with the phylogeny inferred from the distribution of orthologous gene families across different extant genomes.
Supplementary Figure S2. Illustration of possible local moves that split an internal node into two. For an internal node that has more than two children, possible moves comprise moving the node upwards or downwards by one time-step (not shown), and also moving one of its downstream branches upwards or downwards by one time-step and splitting the internal node into two (all 6 possible moves that split an internal node with 3 children are shown).
Supplementary Figure S3. An example of phylogenetic trees of 10 E. coli / Shigella strains reconstructed by CGP, RAxML, ClonalFrameML and Gubbins from the nucleotide sequences and also from the amino acid sequences. This group is one of the one hundred groups of real genome sequences prepared for the GLOOME-test.
Supplementary Figure S4. Distribution of CPU time for the reconstruction of a phylogenetic tree, by CGP, RAxML, ClonalFrameML, and Gubbins applied to nucleotide sequences, and by CGP and RAxML to amino acid sequences. Note that ClonalFrameML takes the RAxML tree as input, and rescales the tree's branch lengths. See Supplementary Text for details. Figure S5. Box plot showing the distribution of unrooted SD, unrooted BSD, rooted BSD, as well as the distribution of their z-score, of pairs of trees inferred from the same group of strains to evaluate the consistency of reconstruction by CGP, RAxML, ClonalFrameML, and Gubbins applied to nucleotide sequences, and also CGP and RAxML applied to amino acid sequences.

Supplementary
Supplementary Figure S6. Distribution of Δ-the perturbation on branch lengths / coalescent times caused by the UPGMA assumption in the CGP algorithm. See Supplementary Text for details. Supplementary Table S1. List of 55 genomes of E. coli and Shigella used in this study.