Numins contribute up to 10% of the total organelle insertions
To estimate the prevalence of nuclear mosaic insertions (numins), BLAST was first used to search the nuclear genomes of nine plants with their corresponding mitochondrial and plastid genomes. We defined the term fragment to describe a single nuclear locus with homology to organelle DNA (see Methods). The number of independent organelle insertions was inferred using concatenation of fragments (see Methods) [30]. In total, five classes of inferred insertions were identified: (a) simple numts (a single mitochondrial fragment); (b) simple nupts (a single plastid fragment); (c) complex numts (multiple mitochondrial fragments); (d) complex nupts (multiple plastid fragments); and (e) numins (both mitochondrial and plastid fragments).
Figure 1 shows the frequency of the five classes of organelle insertions with a concatenation distance of 5 kb in the analyzed nine plant genomes (listed in Additional file 1: Table S1). In total, we identified 1,782 numins in all genomes tested. The frequency of numins relative to all organelle insertions in each organism ranged from 2% (183 out of 7,963) in Z. mays to 10% (302 out of 2,939) in D. carota. Of note, changing of the permitted concatenating distances has no effect on the trend of results, as previously shown for single origin insertions [30]. In the following, we chose to present results of inferred insertions with a distance of up to 5 kb.
Simple numts and nupts are the most frequent organelle insertions in all studied organisms, with numts dominating the genomes of A. thaliana, C. sativus, O. sativa, S. bicolor, V. vinifera, Z. mays and P. dactylifera, and nupts dominating the genomes of D. carota and G. max. The single organelle complex insertions are substantially more frequent than numins in all organisms. In addition, when numts are more frequent than nupts, complex numts are the most common type of complex insertions, and vice versa. Thus, simple nupts dominate in G. max with 63% of the total organelle insertions, while complex nupts dominate the G. max with 14% of the complex organelle insertions. Similarly, simple numts dominate P. dactylifera with 39% of the total organelle insertions while complex numts dominate the P. dactylifera with 25% of the complex organelle insertions. We also tested if the meeting of organelle fragments is random, that is, if a junction between two fragments occurs with comparable frequency independent of their organelle origin. Simulation of the organelle origin of all fragments shows that mitochondrion and plastid fragments do not meet randomly; pairs of fragments from the same origin (mitochondria-mitochondria or plastid-plastid) are overrepresented while mosaic pairs are underrepresented in real data (p value < 0.05).
Numins show NHEJ signatures
Signatures reflecting the mechanism of integration into the nuclear genome are of interest. Numts and nupts are known to integrate into nuclear DSBs via NHEJ. Two hallmarks of NHEJ are known: microhomology and blunt-end repair. We compared the signatures of complex insertions originating from complex numts and complex nupts to numins. Because NHEJ signatures can only be detected by looking at nearby fragments of inferred insertions, only inferred insertions with adjacent mitochondria and plastid fragments were further analyzed. To estimate the length of microhomology, each of the corresponding organelle fragments was extended by 10 bp and the number of overlapping bases until the appearance of a mismatch was counted. In our dataset of inferred insertions with a distance of up to 5 kb, there are 3,206 dual origin junctions. Of these junctions, 1,503 are up to 10 bp apart and were considered for the NHEJ analysis.
NHEJ signatures of complex numts and nupts were reported both between adjacent organelle fragments (inner junctions) and between the terminal organelle fragments and the nuclear genome [13, 14]. Analysis of the terminal junctions requires the availability of highly similar nuclear genomes [16, 17]. Since the similarity between the available plant genomes is not high enough, our analysis only considered the inner junctions while terminal junctions were not considered.
Figure 2 demonstrates an example of numin in the genome of Z. mays. Two plastid fragments and two mitochondrial fragments form an insertion of 3,573 bp. Detailed examples of two inner junctions are shown in Fig. 2b. Junction 1 between plastid fragment A and mitochondrial fragment B shows blunt-end repair. In contrast, junction 3 between mitochondrial fragment C and plastid fragment D shows three bases of microhomology. In this case, the bases ATT that appear in the nuclear genome are shared between the mitochondrial and the plastid genomes.
Comparison of microhomology length between junctions that are of dual origin and junctions from single organelle origin are shown in Fig. 3. The distribution of the number of microhomology bases is similar in the two sets and both blunt-end repair and microhomology were identified. Statistical analysis using multinomial tests shows that we could not reject the null hypothesis that junctions from dual origin have similar microhomology distribution to those of single origin (p value > 0.05). However, in D. carota and V. vinifera the test suggests that the distributions are different. The V. vinifera seems to have more blunt-end junctions in the single organelle insertions while dual organelle junctions seem to have more of 3 bp microhomology junctions. Our results suggest that NHEJ is a key mechanism in numins and that the mosaic insertion mechanism is similar to that of single-origin complex insertions.
While only a handful of numins events with NHEJ signatures were previously observed [17], our data suggest that NHEJ is a major mechanism in numins. Our results indicate that 1,128 out of 3,206 (35%) junctions that appear between mitochondria and plastid fragments show microhomology or blunt-end signatures.
NHEJ signatures do not reflect recurring insertion events
Two circumstances can explain the presence of NHEJ signatures between two organelle fragments either from a single origin or from a dual origin. (a) In the case of single insertions, complex numts and nupts are concatenated before or during the capture into DSBs (this is the current consensus). (b) In the case of recurrent insertions, DSB hotspots might exist in these loci [22] and complex insertions could result from multiple insertions into the same locus in the nuclear genome. That is, first an organelle fragment is captured in a DSB in the nuclear genome and later the nuclear genome undergoes a second DSB at the same locus where a second fragment is captured. If the latter mechanism is frequent, numins would provide an opportunity to detect it.
We looked for evidence of recurring events by analyzing numins in O. sativa subsp. japonica and comparing them to insertion events in O. sativa subsp. indica [17], but found no cases that would reflect a recurrent insertion mechanism. This suggests that the same mechanism operating for numts and nupts is also responsible for the integration of numins. Thus, it appears that the concatenation of mitochondria and plastid fragments in numins occurs before or during the integration into the DSB in the nuclear genome, requiring the coexistence of free DNA from both organelles somewhere in the cell, possibly nucleoplasm or autophagosomes, prior to insertion.
Complex insertions show long homology
Complex insertions can potentially be the result of homologous recombination or single-strand annealing (SSA) between fragments either during or after integration. Such long homology could not be identified in previous studies of primates [16]. However, in the present sample of plant genomes, we identified cases with long homology by screening insertions for organelle fragments whose sequence overlapped by at least 40 bases, a homology stretch that is too long for NHEJ. A fragment that was degraded in the nuclear genome can mistakenly be identified as two overlapped organelle fragments with long homology. To prevent false identification of insertions that occurred through long homology mechanisms, we set a criterion such that the overlapping fragments cover at least 100 base pairs that are not shared with other fragments.
We identified 16 such events (Additional file 2: Table S2) in complex numts, complex nupts, and numins. An example of a complex numt in Z. mays chr4:156,747,273-156,755,671 is shown in Fig. 4. This numt is composed of two mitochondrial fragments that are 5,534 bp and 3,057 bp long, overlapping by 186 bp.
In addition to these 16 events that are unique in each genome, the grape nuclear genome shows insertions with long homology that appear multiple times. The most extreme example is an insertion of four mitochondrial fragments with overlapped fragments of up to 440 bp appearing at least 38 times in the nuclear genome. These copies are highly similar to each other and to the mitochondrial genome. It is unclear how these insertions composed of the same fragments evolved multiple times in the genome. It seems unlikely that the same insertion integrated independently multiple times. Interestingly, one of the four fragments of this insertion includes mitochondrial orf333 which encodes a reverse transcriptase LTR, suggesting that these might be duplicated copies of one or a few insertions.
NHEJ signatures are enriched in recent numins
Our finding of NHEJ in 35% of numin junctions with adjacent fragments might be an underestimation of that mechanism, because numts and nupts are degraded by mutations after integration as part of nuclear genome evolution [19,20,21,22]. This process can give rise to longer distances between insertion fragments [32] damaging NHEJ signatures. Therefore, numins with NHEJ signatures should show a higher similarity to organelle DNA than numins without NHEJ insertions.
To test that, we labeled insertions as NHEJ or non-NHEJ. NHEJ insertions were defined as those that contain at least one junction with zero to ten base pair overlap between fragments. Similarly, we labeled insertions as non-NHEJ if all of their fragments are separated by at least one base pair. Insertions with long homology between fragments were omitted to avoid a contamination by SSA or gene conversion mechanism.
We calculated a p-distance for each numin for a total of 1563 insertions. This number includes 750 NHEJ insertions and 813 non-NHEJ insertions. The p-distance distribution for seven organisms with at least 100 numins (Fig. 5) shows that NHEJ signatures are enriched in recent numins. Indeed, Mann Whitney test shows that NHEJ p-distance is significantly lower than non-NHEJ p-distance for all organisms analyzed (one-tail p value < 10−5). Thus, the number of identified NHEJ insertions in our data is probably an underestimate of insertions integrated via this mechanism.