Holistic view of the seascape dynamics and environment impact on macro-scale genetic connectivity of marine plankton populations

Background Plankton seascape genomics studies have revealed different trends from large-scale weak differentiation to microscale structures. Previous studies have underlined the influence of the environment and seascape on species differentiation and adaptation. However, these studies have generally focused on a few single species, sparse molecular markers, or local scales. Here, we investigated the genomic differentiation of plankton at the macro-scale in a holistic approach using Tara Oceans metagenomic data together with a reference-free computational method. Results We reconstructed the FST-based genomic differentiation of 113 marine planktonic taxa occurring in the North and South Atlantic Oceans, Southern Ocean, and Mediterranean Sea. These taxa belong to various taxonomic clades spanning Metazoa, Chromista, Chlorophyta, Bacteria, and viruses. Globally, population genetic connectivity was significantly higher within oceanic basins and lower in bacteria and unicellular eukaryotes than in zooplankton. Using mixed linear models, we tested six abiotic factors influencing connectivity, including Lagrangian travel time, as proxies of oceanic current effects. We found that oceanic currents were the main population genetic connectivity drivers, together with temperature and salinity. Finally, we classified the 113 taxa into parameter-driven groups and showed that plankton taxa belonging to the same taxonomic rank such as phylum, class or order presented genomic differentiation driven by different environmental factors. Conclusion Our results validate the isolation-by-current hypothesis for a non-negligible proportion of taxa and highlight the role of other physicochemical parameters in large-scale plankton genetic connectivity. The reference-free approach used in this study offers a new systematic framework to analyse the population genomics of non-model and undocumented marine organisms from a large-scale and holistic point of view. Supplementary Information The online version contains supplementary material available at 10.1186/s12862-023-02160-8.


Introduction
Marine species from epipelagic plankton are drifting organisms that are abundant in the global ocean, play an active role in Earth's biogeochemical cycles, and form a complex trophic web with high taxonomic diversity based on fish resources [1][2][3][4][5][6][7][8][9] .Understanding the present connectivity between populations or communities of plankton is thus crucial to apprehend upheavals due to climate change in oceans [10,11].Due to their potentially high dispersal and large population size, planktonic species have long been thought to be homogenous and highly connected across oceans, but this assumption has been challenged by empirical studies over the past two decades [12] .Planktonic species are characterized by theoretically high population effective sizes [13,14], which reduces the power of genetic drift and makes selection and beneficial mutations stronger drivers of their evolution, as exemplified in the SAR11 Alphaproteobacteria [15] , but the balance between neutral evolution and selection is still debated [16,17].Furthermore, plankton evolution also seems to be strengthened by acclimation through variations in gene expression or changing phenotypes in response to environmental conditions [18][19][20][21].
Two major forces can affect gene flow between planktonic populations: abiotic factors, including marine currents, and biotic factors.First, as planktonic species are transported passively and continuously by marine currents, we could expect that the "isolation-by-current" shapes the genetic structure of populations.Conversely, cosmopolitan, panmictic and/or unstructured species have been reported multiple times in Copepoda [18,[22][23][24][25][26], Collodaria [25], and Cnidaria [26].Other studies have shown more complex patterns, with genetic structures mainly observed at the basin level in Copepoda [27] , Pteropoda [28] , diatoms [29] , and Cnidaria [30] or at the mesoscale in Chaetognatha [31] , Hexanauplia [32][33][34], Dinophyceae [35], and Macrocystis pyrifera [36] .Given the complexity of oceanic processes, classical landscape genomics frameworks have been adapted [37] to better model the dispersion and marine currents on populations over the seascape.In seascape genomics, the "isolation-by-currents" replaced the "isolation-bydistance" effect [38].In this context, modelling oceanic circulation at the macro-and meso-scales is a prerequisite for capturing water mass connectivity [38] .Successful approaches using data derived from larval dispersal models have been used in fish and coral [39][40][41] and the use of Lagrangian travel time estimates combined with genetic data has shown promising results in explaining gene flow [33,36].
Simultaneously, changing environmental conditions may lead to selective pressure that counteracts the effect of dispersion induced by marine currents, leading to higher differentiation.Some good examples are temperature-driven genetic structures from bacteria to cnidaria [15,30,42] and the effect of salinity and silicate in diatoms that can even favor speciation in estuaries [43][44][45].Biotic drivers based on competition and coevolution have also been reported to shape evolution [46] .These findings enhanced our understanding of plankton connectivity, but they focused on documented species with reference sequences or often used few molecular markers, such as mitochondrial (COI) or ribosomal genes (16S, 18S, 28S), and/or were restricted to mesoscale sampling.
Advances in environmental genomics realized by shotgun sequencing offer a new perspective for the population genomics of marine plankton species based on metagenomic data.Diversity in ocean microorganisms can now be better-understood, thanks to ambitious expeditions [47,48].Particularly, Tara Oceans data provide a unique dataset from many locations in all oceans worldwide, enabling global approaches to investigate plankton [49][50][51][52], but blind spots in terms of taxonomy or function are still an obstacle for further analyses due to the lack of reference genomes or transcriptomes, especially for eukaryotes.The first approach to address this issue relies on the use of metagenome-assembled genomes (MAGs) that enable the retrieval of numerous lineages from metagenomic samples, especially small-sized genomes found in viruses, prokaryotes, and protists [45,49,[53][54][55][56].The second method is single-cell sequencing after flow cytometric sorting, which allows genome reconstruction of small eukaryotic species [57].
An alternative method for studying plankton population genomics has been proposed based on single nucleotide polymorphisms (SNPs) calling directly from metagenomic data using an assembly-free strategy [58].The latter uses DiscoSNP + + [59], a SNP calling tool applied directly to raw high-throughput sequencing data without assembly.Its application to Tara Oceans metagenomic data generated 18 million SNPs and a proof of concept of their utility and robustness for population genomics has been demonstrated on the epilagic copepod Oithona nana using its genome as a reference to relocating SNPs [58].SNPs can also be directly clustered by species to bypass the use of genome references.To perform this, metaVaR was developed and applied as a proof-of-concept to simulated and real metagenomic datasets [60].This approach allows the profiling of the genomic differentiation of several species separately and opens gates for new investigations.
Here, we propose to study plankton connectivity from a holistic point of view, using metagenomic data extracted from samples gathered during Tara Oceans expeditions in the Mediterranean Sea, Atlantic Ocean, and Southern Ocean.We clustered the 18e 6 SNPs into 113 taxa that may correspond to complexes of closely related species.Minor allele frequencies of each species were used to estimate genomic differentiation using pairwise F ST .The genomic distances were modelled with environmental parameters, including Lagrangian travel times between sampling sites [61] to estimate the relative contribution of environmental factors, especially marine currents, to the genetic connectivity of plankton populations.

Taxonomy and biogeography of 113 plankton species based on reference-free SNPs
We used over 18e 6 SNPs called using a reference-free approach generated from 114 metagenomic samples collected from 35 Tara stations (Fig. 1A).The SNPs were clustered into groups SNPs belonging to the same taxon, and the minor allele frequencies of each SNPs were computed by population (Fig. 1B, Supplementary Table S1).
Fig. 1 Clustering of SNPs by taxon from the metagenomic dataset of Tara Oceans.A Worldmap showing the locations of the 35 Tara Oceans stations used in the study.Each circle is divided into four parts, depending on the detection of plankton taxa by SNPs clustering.Grey colour indicates that no species were retrieved.B Pipeline to cluster SNPs by species using metaVaR, with additional statistics by size fraction.From top to bottom: number of SNPs before and after filtering, number of SNPs clusters detected, and number of plankton taxa finally selected Most of the SNPs harbouring sequences used for taxonomic assignment did not show any signal due to lack of planktonic data in the public databases.However, we assigned 113 taxa showed various lineages spanning all plankton trophic levels with a predominance of Hexanauplia (46 taxa), Bacteria (24 taxa), and Eumetazoa (21 taxa, comprising three Cnidaria and one Echinodea) (Fig. 2A  and B).Among the Bacteria, we found nine Cyanobacteria, with eight taxa assigned to Synechococcus and one assigned to Prochlorococcus.Other notable eukaryotic taxa include Dinophycea (5), Haptophyta (4), Mamiellales (3), Collodaria (2), Ciliophora (2), Cryptophyta (1), and Pelagomonadaceae (1).Only four taxa presented very poor assignment (unclassified or eukaryotes) and one virus.In Mamiellales, two species were identified as Bathyccocus prasinos and were related to previously observed results from Tara Oceans (Supplementary Table S2).The number of SNPs per taxon ranged from 114 to 1,767.As expected, bacteria dominated the smaller size fractions, and Eumetazoan (Cnidaria, Copepods, and other Bilateria) were found in the larger size fractions.Most taxa (95.84%) were present at four to six stations, with a maximum of eight stations (Supplementary Figure S1).The number of taxa per station showed an important variation (Fig. 2C), from four to 43 taxa (in TARA_67/81/84/85 and TARA_150, respectively).Notably, stations from the Southern Ocean (TARA_82 to 85) contained fewer taxa (from 4 to 7), with four taxa (Gammaproteobacteria, Haptophyta, Flavobacteria, and Calanoida) being present solely in the Southern Ocean (SO).Finally, 36 taxa were present in only one basin, while the majority (80 taxa) occurred in the Northern Atlantic Ocean (NAO) and one other basin (Fig. 2D).

Global view of plankton genetic connectivity in the surface layer
Pairwise F ST was used to estimate the population genomic differentiation.As expected, the global population genetic connectivity was higher within than between basins for each size fraction, either separately or together (Fig. 3A).Overall, taxa occurring in NAO presented moderate differentiation from their populations in the Mediterranean Sea (MED) and the Southern Atlantic Ocean (SAO) (0.118 and 0.143, respectively) (Fig. 3B).SAO and MED presented relatively high differentiation (0.222).Finally, this analysis underlined the important global differentiation of the SO from other basins (0.201-0.555), but also a high differentiation within the SO (0.397).The population genetic connectivity was significantly different between size fractions (Kruskal-Wallis, p-value < 0.05), being higher in the 180-2000 µm and lower in 0.8-5 µm and (Fig. 3C).Population differentiation between the six larger taxonomic groups was related to the body size of the lineages, with differentiation being relatively lower in copepods and other animals than in unicellular eukaryotes, bacteria, and viruses (Fig. 3D). Figure 3E shows a large spectrum of population genomic differentiation patterns, with a maximum median pairwise-F ST between 0.03 and 1. Extreme cases with a median pairwise F ST of 1 were observed for 13 taxa, and a global F ST distribution strongly shifted to 1, as exemplified by Collodaria species (15_200_2) (Supplementary Figure S2).These 13 taxa illustrate the fact that our approach generates a non-negligible proportion of clusters of SNPs corresponding to complexes of closelyrelated species.

The relative role of the environmental actors in population genetic connectivity
We modelled the pairwise-F ST of each taxon as the response variable explained by six environmental factors Lagrangian times (Fig. 4A), temperature, salinity, nitrate, silicate, and phosphate (Fig. 4B) using a linear mixed model (LMM).The fixed part of the explained Fig. 3 Global view of genomic differentiation of plankton populations.A Distributions of the 113 taxa's pairwise F ST matrices.In red, pairwise F ST of populations belonging to the same basin; in blue to different basins.B Pairwise F ST matrix between basins.The values represent the mean of all the median-F ST between stations regrouped according to the basin they belonged to.C Distributions of the taxa's median pairwise F ST , according to their size fractions.Black diamonds correspond to the mean of the distributions.The bars on the top correspond to the comparisons done by pairwise Wilcoxon tests (p-values: * < 0.05, ** < 0.01, *** < 0.001, **** < 0.0001) D Distributions of the taxa's median pairwise F ST , according to their taxonomic group.Black diamonds correspond to the mean of the distributions.Each bar corresponds to taxonomic groups displaying no significant differences.E Scatter plot, each dot is a taxon.The size of each dot reflects the global median F ST of the taxa's F ST distribution (i.e., F ST computed over all the populations of a taxon) variance was low for each taxon, ranging from 0 to 14% and was not further analysed (Supplementary Table S1).Among all tested environmental variables, Lagrangian travel time, temperature and salinity were the major contributors to genomic differentiation and were highly correlated to the first three components (67% explained variance) (Fig. 5A).The variance contributions of nitrate, silicate and phosphate respectively followed the last three components.
The taxa were then clustered into eight groups using k-means based on their t-SNE coordinates (Fig. 5B).We identified the most important variables for each taxon in each cluster (Fig. 5C).Two clusters were linked to Lagrangian travel times, labelled as "Lagrangian" (14 species) and "Lagrangian 2" (13), the latter exhibiting a lower variance explained by Lagrangian.The largest cluster contained 24 taxa but was not linked to any parameter.The other taxa clusters were linked to a single environmental parameter, such as salinity for 16 taxa and temperature, silicate, phosphate and nitrate for 14, 13, 13 and 10 taxa, respectively.The clusters "Lagrangian", "Temperature" and "Salinity" presented clear differences between their respective drivers compared to the other parameters (Fig. 5C).The clusters "Phosphate" and "Silicate" showed a wider distribution of their respective driver among the taxa they contained, with respectively salinity and phosphate sharing a high proportion of explained variance.The "Nitrate" cluster also regrouped taxa for which a non-negligible part of variance was explained by Lagrangian travel time.Each cluster contained taxa assigned to almost all taxonomic groups and presented no particular visual enrichment (Fig. 5C).This absence of enrichment was clearer in copepods, which constituted most species (Fisher's Exact Test p-value = 0.348).
Among the taxa belonging to the "Lagrangian" cluster, we observed five taxa present in the Mediterranean Sea and Southern Atlantic and one in the Northern and Southern Atlantic.Two taxa were restrained to a single basin, the Southern Ocean and Northern Atlantic.Notably, the latter, Planctomycetales (9_200_1) shows population genomic differentiation linked to local marine barriers, with the population from TARA_148 being more isolated from the others (TARA_150, 151 and 152) (Fig. 6A).Another example of within-basin differentiation concerns the Mediterranean Gammaproteobacteria 7_300_4 from the "Lagrangian 2" cluster.The genomic differentiation clearly shows a pattern correlated to the Mediterranean marine currents (Fig. 6B), with less Fig. 4 Lagrangian travel times and other environmental parameters.A Minimum times retained for analyses.In grey, asymmetric times were not the minimum, thus the matrix accounts for the "direction" of currents between stations.B Measures of temperature, salinity, nitrate, phosphate, and silicate extracted from World Ocean Atlas (WOA) for the 35 Tara stations.On the right, colour scales for each parameter.For the world map of Tara stations, see Supplementary Figure S3 connectivity between TARA_7, 9 and TARA_23, 25, 18.In the SO, Gammaproteobacteria (12_100_16), Flavobacteria (7_100_6), Haptophyte (4_50_2) and Calanoid (5_20_1) were observed at stations TARA_82, 83, 84 and 85, where two main currents in the area were spotted: the Malvinas Current and the Antarctic Circumpolar Current (ACC) (Fig. 7A, Supplementary Figure S4).These four taxa presented among the highest global median F ST (0.35 to 0.84) and revealed very low connectivity between their populations (Fig. 7B).Particularly, Haptophyta species present genomic differentiation linked to both the ACC and the Malvinas Current.
Certain taxa display a clear correlation between their population genetic connectivity and a single environmental parameter that differs from the marine current.For example, in the "Phosphate" cluster, the Dinophyceae (8_10_11) population from TARA_70 was more isolated from the other NAO populations and the TARA_70 site is also characterized by a higher phosphate concentration (0.264 µmol.L −1 against 0.031-0.106µmol.L −1 ) (Fig. 6C).In the "Nitrate" cluster, the populations of a Mamiellales taxon (5_100_1) from TARA_146 and TARA_147 were highly connected and this correlated with the variation of nitrate concentration (Fig. 6D).In the "Temperature" cluster, the widely distributed Calanoida species (12_5_104), detected in the MED, NAO, and SAO presented a relatively higher genetic distance between populations from TARA_20 and 68 (F ST = 0.08) (Fig. 6E) and was linked to a higher temperature difference.We observed genomic differentiation along a silicate gradient for a cyanobacteria (8_100_13), showing high isolation of the TARA_151 population compared to populations from TARA_146, 147 and 150 (Fig. 6F).The genetic isolation of TARA_151 was correlated to a higher concentration of silicate in the North-East Atlantic.We also found a few taxa with a large proportion of the genomic differentiation explained by two factors, as for a Cnidaria (20_100_10) from the "Salinity" cluster, with temperature being also an important explaining factor (Fig. 6G).It was also the case for a cyanobacteria (7_7_9) from "Lagrangian 2" cluster which presented a high genomic differentiation between MED and NAO (Fig. 6H) that correlates with both Lagrangian travel times and salinity, the Mediterranean Sea presenting higher salinity than NAO.

Reference-free approach for non-model species population genomics
Thanks to our approach exploiting metagenomic data, the population connectivity was reconstructed for planktonic eukaryote taxa representing the different trophic levels of the epipelagic layer of oceans and enabled a realistic overview of the population structures of marine planktonic species lacking reference sequences.With hundreds of variants per taxon, we drew the silhouette of population structures across four oceanic areas using more markers than previous studies often based on few genetic markers, few samples, and limited to small geographic areas.It must be noted that for each taxon, most sequences did not show any taxonomic signal, an observation already made in other studies using Tara Oceans data [50,52].The level and quality of taxonomic assignment are both due to a lack of references in databases and the short length of the sequences supporting the variants, reducing the chance of matching a reference with an acceptable coverage and having a taxonomic assignment with a high resolution.Notwithstanding these technical limitations for taxonomic annotation, four notable taxonomic groups have been described and could be related to previous observations.First, we were able to detect a virus from the order Caudovirales which probably belongs to the bacteriophage family of Myoviridae.These viruses are known to be abundant compared to other viruses in oceans [62] , notably infect Cyanobacteria (i.e., Prochlorococcus and Synechococcus), and constitute most viral populations in GOV 2.0 [63] .Second, two Cyanobacteria (15_500_9 and 7_20_37), probably belonging to the Synechococcus genus, were detected in the same locations in the Mediterranean Sea, with clear F ST unimodal distributions (Supplementary Figure S2) and could be related to the ecotypes of Mediterranean Synechococcus [64] .Third, in protists, two species corresponding to Mamiellales (6_5_14 and 9_500_10) are respectively located in Tara stations where Bathycoccus prasinos and Bathycoccus spp.TOSAG39-1 were the most abundant (Supplementary Table S2), as described in a previous study using Tara Oceans metagenomic dataset [65] .Finally, copepods formed the largest group, with a predominance of calanoids over cyclopoids.Numerous copepods were expected considering their high abundance in oceans [66,67] and good representation in the Tara Oceans dataset.A limitation of our approach was the presence of a cluster of SNPs predicted to belong to a single taxon but harbouring extreme pairwise-F ST values, showing that some validated clusters of SNPs may refer to a complex of closely related species, as previously described for the cosmopolitan copepod Oithona similis [68].While the reference-free approach provides interesting results, the lack of reference sequences does not allow downstream analyses to provide functional annotation of the SNPs.Future reference-based methods, including MAGs or newly built genome assemblies, will greatly help to capture more polymorphisms, refine taxonomical assignment, and allow the identification of genes and their related functions impacted by nucleotide polymorphisms.
We showed that populations of smaller organisms, such as protists and bacteria, are more structured than those of zooplankton.These first two groups of organisms are not characterized by the same range of demographic parameters, such as population size, dispersal capacity, or generation time, leading to very different effects on their evolution.Moreover, these taxa experienced radically different demographic histories, limiting the comparisons from the use of F ST , an estimate affected by the population effective size, described as large among plankton organisms in the few studies that estimated this parameter [14,69,70].

Relative effects of environment and currents on macro-scale population genetic connectivity of plankton
Over 113 plankton taxa, Lagrangian travel time, salinity and temperature were the most important tested genomic differentiation drivers, while nitrate, silicate and phosphate had a relatively lower impact and this does not seem to be clade-specific.The effect of Lagrangian travel time on population differentiation illustrates the role of ocean currents and seascape dynamics in population genetic connectivity and validates the isolation-by-current hypothesis for a non-negligible number of plankton species.Here, we showed that populations belonging to different basins tend to be more differentiated than populations located in the same basin, which could be easily explained by relatively smaller connections within basins than between basins.While this trend has been observed several times [28,71,72], interesting patterns of population genetic connectivity remain between the basins.We observed the central role of the NAO, which connects its populations to both MED and SAO, and a slightly lower connection between MED and SAO.Some plankton populations from the SO were isolated from the other basins.This situation has already been observed in the copepod Metridia lucens [73] , as well as important differentiation within the SO.This area is characterized by differences in environmental conditions, and compared to the rest of the basins, with higher silicate, nitrate and phosphate concentrations on one hand, and lower salinity and temperature (Fig. 4B).Additionally, water masses are driven over thousands of kilometers by the complex Antarctic Circumpolar Current (ACC) [74] , which could favour long-range gene flow around the Antarctic.The Lagrangian data traced the northward Malvinas current (an ACC branch), which mixes warm water masses from the Brazil current with cold waters of the ACC in the Brazil-Malvinas confluence [75] , possibly favouring the isolation of plankton populations in the south of this area.This specific environment could explain why these species are both specific to the Austral Tara stations and are highly differentiated.
Salinity and temperature affect biogeography, community composition and population structure [15,28,44,51,76].The role of nutrients such as nitrate [77], silicate [25,78,79] , and phosphate [80] in marine microorganism metabolism and diversity has been well studied, but their impact on the population genetic structure has never been investigated at this scale [81][82][83].A large part of the genomic differentiation could not be explained in this study, suggesting missing parameters.The absence of key physicochemical parameters, such as metals [21,84], sulfur [85] and pH [19] could also enhance our understanding of plankton genomic differentiation.The contribution of biotic interactions between trophic levels, such as zooplankton grazing on phytoplankton, should also be examined [86].

A holistic view of plankton connectivity as a mosaic
By combining population genomics with environmental factors and seascape dynamics, we identified planktonic species groups with genomic differentiation driven by the same factors.These different groups of species allowed for the sketching of a mosaic of connectivity patterns in the seascape.This mosaic is underlined by the diversity of environmental conditions influencing the differentiation and shows that the living range of species is not correlated to their population structure, that is, cosmopolitan species do not necessarily present an absence of population structure and species with populations present in close locations can exhibit high differentiation (such as SO).Thus, we showed how population genomic analyses at different trophic levels are important to decipher the connectivity of plankton and can be complementary to the traditional metabarcoding approach that fails to quantify the connectivity and intra-species structure patterns.The next step would be to better capture the relative effects of evolutionary forces acting on plankton genomes, such as genetic drift and selection.Haplotype data could resolve this question, but in the framework of metagenomics, the latter remains a technical and computational challenge.
Global warming and its impact on ocean climate are expected to have a significant impact on marine plankton biogeography by restructuring plankton assemblages [87].In this context, the relative role of seascape dynamics and environmental factors in plankton population genetic connectivity can be expected to shift towards a major role of temperature.Our ability to model plankton adaptation and evolution at the molecular level, in response to global warming, will permit better projections of future plankton biogeography.

Single nucleotide polymorphisms from Tara Oceans metagenomic data
We used a set of 18e 6 SNPs produced in a previous study [58].SNPs were detected from metagenomic data generated from 35 Tara Oceans sampling sites corresponding to four distinct size fractions (0.8-5 µm, 5-20 µm, 20-180 µm, and 180-2000 µm) from the water surface layer, for a total of 114 samples (Fig. 1A).For further analyses, Tara stations were separated into four groups corresponding to the basins they belonged to: the Mediterranean Sea (MED; TARA_7 to TARA_30), Northern Atlantic Ocean (NAO; TARA_4, TARA_142 to TARA_152), Southern Atlantic Ocean (SAO; TARA_66 to TARA_81), and Southern Ocean (SO; TARA_82 to TARA_85).The full protocols for sampling, extraction and sequencing have been detailed in previous studies [88,89].All maps in the figures of the current study were generated with the rnaturalearth R package (https:// github.com/ ropen sci/ rnatu ralea rth).

Clustering of intra-species SNPs
To identify taxon specific SNPs, we used metaVaR version v0.2 [60].We discarded SNPs called from low-covered loci, repeated regions that present very high coverage, and SNPs from loci with non-null coverage in less than four samples.This was performed using metaVarFilter.plwith parameters -a 5 -b 5000 -c 4. The filtered SNPs were clustered based on the covariation of their loci depth of sequencing coverage using multiple density-based clustering instances [90,91], a total of 187 couples of parameters epsilon and minimum points (ε, MinPts) were tested with epsilon ε = {4, 5,6,7,8,9,10,12,15,18,20} and MinPts = {1,2,3,4,5,6,7,8,9,10,20,50,100,200,300,400, 500}.This clustering generated a set of clusters for each parameter couple (Supplementary Figure S5).To retain taxon specific SNPs, we selected non-overlapping clusters, that is, clusters sharing no SNPs and maximizing a score based on the distribution of the sequencing depth of coverage of the loci (expected to follow a negative binomial distribution).The selection was performed by applying a maximum-weighted independent set algorithm to the scored clusters.To avoid any allele frequency bias due to low depth of coverage, we selected only loci with a depth of coverage between 8 × and the maximum expected depth of coverage in all populations, based on the empirical depth of coverage distribution.Finally, only clusters with at least 100 SNPs, for which at least three samples presented a median depth of coverage over 8 × were retained, leading to a final set of 113 clusters corresponding to species (or complex of closely related species).For each species, we generated an allele frequency matrix for each biallelic locus.

Taxonomic assignment of species
Taxonomic assignment of each species was performed using three different methods (Supplementary Figure S5).For the first method, the short sequences supporting the variants (generated by DiscoSNP + +) were mapped on the downloaded NCBI non-redundant database with diamond v0.9.24.125 [92] using blastx and parameter -k 10, and the results were filtered based on the E-value (< 10 -5 ).The taxonomic ID and bit scores of each match were maintained.A fuzzy Lowest Common Ancestor (LCA) (see https:// github.com/ insti tut-de-genom ique/ fuzzy-lca-module) method was used to assign a taxonomy to each sequence using bitscore as a weight with -r 0.67 (i.e.taxa covering at least 67% of all bitscores) and -ftdp options.The highest phylogenetic rank was retained as the best assignment for each sequence.For the second method, the sequences were mapped using the blastn algorithm implemented in diamond on MATOU, a unigen catalog based on Tara Oceans metatranscriptomic data [50] , and the last method involved mapping the sequences on the MMETSP transcriptomic database [93] .The species (or a complex of closely related species) were assigned to the most probable taxon to offer three taxonomic assignment levels, from the most precise to the widest (Supplementary Table S1).The final set of taxa was first grouped into 24 taxonomic groups and finally merged into six reliable wider groups (Viruses, Bacteria, Unicellular Eukaryotes, Copepods, and other animals, and poor classification) (Fig. 2B).

Population genomics analysis
To investigate the genomic differentiation of each taxon, F ST was used and computed for each variant as follows: , where p and σ 2 are, respectively, the mean and variance of minor allele frequency across the considered populations [94].Two F ST calculations were performed, the global F ST was calculated using among all populations, allowing the analysis of the global F ST distribution.Then, a pairwise F ST was calculated between the populations and the median pairwise F ST was retained as a measure of genomic differentiation between the populations.We tested the effect of oceanic basins, taxonomy, and size fraction on the genomic differentiation of each species using a Kruskal-Wallis test.When the test was significant (p < 0.05), multiple comparison Wilcoxon tests were performed between groups.To estimate the genetic connectivity between and within basins, we regrouped Tara stations based on their locations (i.e., MED, NAO, SAO and SO) and computed the mean F ST between and within basins.

Lagrangian travel time estimation and environmental data
To estimate Lagrangian transport, we used a method based on drifter data [61] to compute the travel time of the most likely path between Tara stations back and forth.We used the public database of the Global Drifter Program (GDP), managed by the National Oceanographic and Atmospheric Administration (NOAA) (https:// www.aoml.noaa.gov/ phod/ gdp/), which contains information on drifters ranging from February 15, 1979, to September 31, 2019.We extracted the data for both drogued and undrogued drifters (i.e., drifters that lost their socks) to maximize the information.No drifters have been observed to exit the Mediterranean Sea through the Strait of Gibraltar.Therefore, to avoid missing data, we arbitrarily added 100 years to the travel times of pathways out of the Mediterranean Sea over the Strait of Gibraltar and added one year to the pathways going into the Mediterranean Sea, based on previous models of surface water [95,96].We used 450 rotations within the method to reduce the reliance on travel times on the grid system used.Two travel times are obtained by the method for each pair of stations, back and forth, resulting in an asymmetric travel time matrix between all possible station pairings.For our analyses, we retained only the minimum of the two travel times.

Variation partitioning of the genomic differentiation
To estimate the relative contribution of environmental parameters and Lagrangian travel time to the variance of the genetic connectivity, a LMM was applied using the R package MM4LMM [97] .The model applied was as follows: Y FST = µ + Zu + ε , where Y FST is the vector of observations of F ST values with a mean µ, Z is a known matrix of parameters relating the observations Y FST to u, is a vector of independent random effects of zero mean, and ε is a vector of random errors of 0 means and covariance matrix proportional to the identity (white noise).For each pairwise F ST matrix, the corresponding matrix of the minimum Lagrangian travel time is retrieved.Temperature, salinity, silicate, phosphate, and nitrate measurements were extracted for all the stations where the plankton species were present, and a Euclidean distance was computed between the stations for each of these parameters.The LMM was then applied to pairwise F ST values using the five environmental distances and Lagrangian travel times after scaling, adding a variance of 1 for each explicative variable.We considered these parameters as the independent variables.As a result, an estimate of the contribution of each parameter to the total variance of the pairwise F ST was obtained.Additionally, a fixed effect and proportion of unexplained variance were retrieved.After F ST variance decomposition, two principal component analyses (PCA) were performed.The first was performed on the variance explained by the six variables and the unexplained part of the variance over the 113 species.From this PCA, the unexplained F ST variance (Supplementary Figure S8) was high in most species, strongly contributing to the first component (37% explained variance).For clarity, a second PCA was performed by removing the unexplained part of the variance.For both PCAs, the correlation of the variables with the components and the contribution (i.e., the ratio of cos 2 of each variable to the total cos 2 of the components) of the variables to the components were extracted.PCAs were performed using the FactoMineR v2.3 R package [98].

Identification of species with similar environmental parameters-driven genetic connectivity
To identify taxa sharing similar environmental parameters that drive their genetic connectivity, the variance explained by each factor was used with dimensional reduction through t-distributed stochastic neighbour embedding (t-SNE) using the Rtsne R package [99] with a perplexity of 5 and 5,000 iterations, and we extracted the taxa coordinates.Subsequently, k-means clustering was performed to identify taxa with common patterns of explained variance, with K = 8 based on the observation of the t-SNE point density.To identify which set of parameters drives the differentiation of a cluster, we compared the distributions of the explained variance of each parameter within the cluster using the Kruskal-Wallis and Wilcoxon paired tests (p < 0.05).

Fig. 2
Fig. 2 Taxonomy and biogeography of the plankton species.A Distribution of the number of SNPs for each size fraction.On the top, pie charts represent the taxonomic composition of each size fraction.B Number of taxa assigned to the six wider taxonomic groups.C Number of plankton taxa according to the basins they were detected in: Northern Atlantic Ocean (NAO), SAO (Southern Atlantic Ocean), SO (Southern Ocean), and MED (Mediterranean Sea).D World map showing the number of taxa of each taxonomic group for each Tara station.The size of the circles corresponds to the number of species detected in each station.The colours of taxonomic groups are indicated on the bottom right of the panel

Fig. 5
Fig. 5 Variation partitioning of genomic differentiation for 113 plankton taxa.A PCA was performed on the proportion of variation explained by each parameter over the 113 plankton taxa.The colour corresponds to Pearson's correlation between the coordinates of species for a component and the variation explained by the parameters (p-values: * < 0.05, ** < 0.01, *** < 0.001, **** < 0.0001).The size of the circles represents the relative contribution (i.e. the ratio of the variable cos 2 on the total cos 2 of the component) of each variable to each component.B t-SNE and k-means (K = 8) clustering.Each dot represents a taxon.Each colour corresponds to a defined cluster obtained by k-means.The names of the clusters are linked to the following figure C Distributions of variation are explained by each factor by cluster and the taxonomic composition of each cluster.The boxplot colours are the same as in the previous figure.The asterisk * on the top of boxplots corresponds to parameters that significantly contribute the most to the genomic differentiation of the taxa included in the cluster, according to a pairwise Wilcoxon test (p-value < 0.05)

Fig. 6 Fig. 7
Fig. 6 Examples of genomic differentiation.From A to H Pairwise F ST matrices of plankton taxa mentioned in the respective titles.For each title are mentioned: the taxonomic assignment, the species ID, and the name of the cluster the species belongs to (clusters based on the abiotic parameters driving the population connectivity)